BLITZ Memory System - Perlegos

wackybabiesSoftware and s/w Development

Dec 14, 2013 (3 years and 10 months ago)

181 views




CS152 Computer Architecture

Lab 6:

MIPS Processor


Codename:

BLITZ





Pete Perlegos SID# 13520625

Suparn Vats SID# 13156688


David Wang

Tu 1
-
2













BLITZ Memory System


Abstract:

This report is an addendum to the previous reports describing the evolution of our design of the
BLITZ MIPS RISC microprocessor. Specifically this report indulges in the design and
description of BLITZ’s memory system. This system is a complete overhaul
of the simplified
memory used in the previous models. Previously the memories, for instructions and data were
modeled as separate single banks with extremely unrealistic short access times. Hence, the
complications of the memory system were completely ab
stracted away earlier to aid in the design
of the control and datapath. The realistic and practical memory system developed in this phase
models the systems used in the real world. The key issue addressed in this part is the
organization of memory in the

most efficient way so as to overcome the bottleneck caused by
slow memory access times. By organizing memory in the most hierarchical fashion (
smallest/fastest to largest/slowest) the processor is deceived in seeing a system which is large and
reasonabl
y fast. The design has the right layer of abstractions which gives us room for further
improvement and scalability without changing the whole system. Also, the design as a whole is
abstract enough making it easily adapted for use in most RISC load/store
word architectures.
The system incorporates 4
-
way set associative cache with an optimized write back policy and a
64bit memory bus for the two interleaved dram memory banks. The design is implemented using
VHDL models of the following blocks: identical 4
-
way set associative cache for instructions and
data (icache and dcache respectively), data memory management unit (dmmu), instruction
memory management unit (immu), data write back system (dwb system), dram access arbiter
(memarbit), and two identical dra
m memory blocks (dram).


Division of Labor:

The complications of implementing a complex system described in the abstract were dealt by
dividing the system into the most logical subcomponents insinuated near the end of the abstract.
Because the memory sys
tem is extremely complicated as a whole it was always the goal to
distribute the complexity over as many logically separable blocks as possible. The design
problem was broken down into the following subcomponents:

Four
-
way set associative cache block:

Thi
s block was designed by Suparn Vats and Pete Perlegos and implemented by Pete Perlegos.

Data Memory Management Unit:

This block was designed by Suparn Vats and Pete Perlegos and implemented by Suparn Vats.

Instruction Memory Management Unit:

This block w
as designed by Suparn Vats and Pete Perlegos and implemented by Suparn Vats.

LRU Update Block:

This block was designed by Suparn Vats and Pete Perlegos and implemented by Suparn Vats.

Data Write Back System:

This block was designed by Suparn Vats and Pete
Perlegos and implemented by Pete Perlegos.

Dram Access Arbiter:

This block was designed by Suparn Vats and Pete Perlegos and implemented by Pete Perlegos.


Detailed Strategy:

The following section gives a detailed strategy for our design. Each of the su
bcomponents
mentioned above our completely described and in conclusion the whole working system is
described in detail.


Four
-
way set associative cache block:

The four
-
way set associative cache block consists of four separate two word, two block memory
(SR
AM) blocks. Other than the standard signals of an asynchronous memory block these blocks
take a dirty bit to indicate whether a block has been written to by the processor, and two
additional status bits to keep track of the usage of the blocks (in particu
lar to record the Least
Recently Used block). It should be noted that read/write access to the data and LRU bits is
independent of each other (i.e., while reading the cache data, the LRU bits can be written).
Again, other than the outputs of any standard

asynchronous memory block, the cache block
outputs a valid bit, a tag, LRU bits, and a dirty bit. A simplified block
-
level diagram of the cache
block is given below.

Cacheblock


LRU Update Block:

The LRU Update block updates the LRU block information in the dcache or icache when
requested by the dmmu or immu respectively. This block reads in the LRU bits from the cache
and is s
ignaled by the dmmu/immu to change the LRU block information to some designated
block. For example, the dmmu/immu may ask the LRU Update Block to make some block the
most recently (MRU) used block based on a hit or a LRU dump to the wbsystem. The LRU
inf
ormation is coded by two bits stored in the dcache/icache with 00 indicating LRU and 11
indicating MRU. It should be noted that the LRU technique employed in this system is only
approximate as it was decided that the hardware required for an exact methodo
logy would be
overly complicated without resulting in much of a gain.


Data Memory Management Unit:

The Data Memory Management Unit (dmmu) is the main memory controller for data access. It
interacts with the processor, the four
-
way set associative data cache block, the write
-
back system,
and the arbitrator to the dram to control the flow of data to and
from the processor. Below is a
basic flow chart of the control path/logic taken by the dmmu for processing load word (lw) and
store word (sw) instructions.

Data_MMU

Load word instructions

=> Upon a dcache hit the data is forwarded to the processor and in the
next cycle the LRU bits are updated by the LRU update block, which too is controlled by the
dmmu. If, however, a dcache miss occurs the dmmu checks in the wbsystem. Upon a hit in t
he
wbsystem the data is forwarded to the processor and in the next cycle written to the cache with an
LRU update. It is important to note that the wbsystem automatically kills the hit data so that it
doesn’t have to be dumped in dram just as yet. If the
wbsystem also misses the dmmu sends a
dram request to the dram arbitrator. It stalls the processor pipeline until the data is received from
the dram and forwarded to the processor. In the next cycle after receiving the data the cache is
written and LRU b
its updated.


Store word instructions

=> Upon a dcache hit the data is written to the cache and the LRU bits
updated. If the dcache misses then the dmmu checks the dirty bit of the LRU block of the four 2
-
word blocks. The dirty bit indicates whether the

data in that particular location of the cache has
been written to by the processor but not been updated in the dram. So, if the LRU block is clean
(i.e., data in dram is up to date) then the dmmu simply writes to the LRU block in the next cycle
and makes

that location dirty. However, if the LRU block is found to be dirty (i.e., dram needs
to be updated) then the LRU block is dumped to the wbsystem and data from the processor
written to the LRU block and dirtied. Once data is dumped into the wbsystem by

the dmmu from
the dcache it is automatically written to the dram by the wbsystem (wbsystem is described later in
the report).


Instruction Memory Management Unit:

The Instruction Memory Management Unit (immu) is very similar in logic and control to the
dm
mu but only simpler. The simplicity comes from the fact that instructions are never written
back to memory by the processor. Hence the immu only interacts with the processor, icache
block, and the dram arbitrator (no wbsystem is required in this case).
Below is a basic
logic/control diagram of the operation of the immu.

Instr_MMU

In summary, every clock cycle the immu checks for an

icache hit. If there is a hit the instruction
is loaded in the instruction register of the processor and the LRU bits updated in the next clock
cycle. If there is a miss however, the immu issues a dram request and stalls only the PC and the
instruction

register until the requested instruction from the dram is available. The rest of the
pipeline still flows as there is no reason to stop the evaluation of instructions already in the pipe.
In fact, this may avoid the problem of stalling the processor on
a load word dependency.


Data Write Back System:

The Data Write Back System is used to modified data back to the dram upon ejection from the
dcache. This block is more than simply a buffer because it consists of some control logic which
makes it quite self
-
sufficient. This system is implemented to model a First In First Out buffer
(FIFO) of length four. However, it has an extremely important special feature discussed later
which sets it after from a simple FIFO. Below is a conceptual block
-
level diagram
showing the
wbsystem with an explanation following.

WB_System

Data and address are inserted automatically at the next available po
sition upon a request from the
dmmu ( inserted at A, B, C, or D). If, however, the wbsystem is full it generates a Full signal for
the dmmu to wait before dumping to the system. The wbsystem, if not empty, continually makes
requests to the memarbit to du
mp data/addr into the dram until it is empty. The memarbit puts
the wbsystem to wait until its write request to dram is serviced. Now for the special feature: as
discussed in the dmmu section above, if there is a hit in the wbsystem after a dcache miss t
he
dmmu forwards the data to the processor and writes it back to the cache. Also, as mentioned
above, the hit data in the wbsystem should be deleted as it doesn’t need to be written back to
dram at this time. So, the wbsystem not only detects a hit it al
so deletes the hit entry in the next
clock cycle. All these features make this block quite independent and self
-
sufficient.


Dram Access Arbiter:

The Dram Access Arbiter (memarbit) manages accesses to the dram. There are three different
blocks, described

above, that access the dram. The dmmu and immu only make read requests to
the dram, and the wbsystem only makes write request to the dram. The memarbit needs to
smartly schedule these requests in the most optimal order to minimize the processor stall t
ime.
Keeping this in mind we chose the logic/flow described below to tackle the problem.

In essence the memarbit gives the highest priority to the dmmu followed by the immu and the
least priority to the wbsystem. Dmmu gets highest priority because it sta
lls the whole processor
as it waits for data from the dram. The immu only stalls the IF stage and the PC while it waits
for the next instruction from the dram. Finally, the wbsystem receives least priority because it
doesn’t directly stall any part of the

processor.



Interleaved Dram Block (64
-
bit Memory Bus):

To fully utilize the 2
-
word cache blocks used in our cache the obvious choice was to use a 64
-
bit
memory bus for an interleaved dram block system. This architecture was chosen as opposed
implementi
ng the Burst mode requesting because that only increases the upstream bandwidth
from the memory (only read works in burst mode not the write) whereas having interleaved
memory and a 64
-
bit memory bus doubles the upstream and downstream bandwidth.


System D
escription:

The block level diagram below shows all the different blocks described above interconnected to
each other with key signals labeled between them.


BLITZ Memory System











Results:

The critical path in this case will either be our old critical path, when forwarding occurs from the
execution stage during a br
anch, or the cache that we added.

Old critical path:
criticalpath

reg32

m32x2

ALU

m32x2

m32x3

m32x4

compare

m32x2

adder

m32x2

3

1

15

1

2

3

2


1

8

1

= 37ns

The data cache has the worst delay:
cache_critical_path

Reg32

cachblk

compare29

logic

new_data_MMU

LRU_update_blk


3

10


6


2

10



4


= 35ns

(The reason we said that we have not fully decided on this yet is because we have not yet finished
fully testing.)


Conclusion:

As mentioned earlier in this report the main strategy used to deal with t
he complexity was to
distribute the responsibilities amongst the different blocks. So, in some cases we shifted the
processing burden from one block to another so as to balance the complexity. Making the
wbsystem functional was the most difficult becaus
e it has to coordinate with the dmmu when
getting a data/addr dump and providing hit data at the same time. Also, it was decided later in
the design process that the dmmu/immu and LRU Update blocks will be coded in VHDL as
simply combinational logic block
s (i.e., they don’t see the clock). The timing was resolved
externally by buffering the required outputs. This issue again falls under the complexity
distribution strategy. Although we thoroughly tested each block separately we were unsure about
testing

the processor as a whole because of the interleaved dram memory banks. In particular we
weren’t sure about how to load instructions into the interleaved dram blocks. So, testing results
will be delayed by one day.


Appendix I (notebooks):

Notebook



There is only one notebook since we were together virtually all of the time.


Appendix II (schematics):


pipecache

pipecache1

pipecache2

pipecache3

instrcache

datacache

WB_system

criticatpath

cache_critical_path


Appendix III (vhdl):

adder

alu

arbitrtr

brancher

bts1

bts32

cacheblk

compar28

compar29

compare

compartr

const031

const1

const16

const31

control

div4

extend

hazard_c

hazard_r

instr_mmu

jumpbuf

lru_update_blk

lw_stall

m16x2

m1x4

m32x2

m32x3

m32x4

m32x5

m5x2

memctrl

multfou
r

new_data_mmu

reg1

reg2

reg3

reg4

reg5

reg32

regfile

shiftr



Appendix III (testing):

Coming soon (i.e., tomorrow (hopefully)).