PPT

bloatdecorumΛογισμικό & κατασκευή λογ/κού

30 Οκτ 2013 (πριν από 3 χρόνια και 5 μήνες)

106 εμφανίσεις

By

Ajay Mathews Cheriyan

Jian Wang

Shoaib Akram



Memory Hierarchy Overview


Memory Subsystem


Virtual Memory and Prefetching



IBM POWER:


-

Performance Optimization With Enhanced RISC


Used in many of IBM

s servers, workstations and
supercomputers


POWER to POWER3: (1990~1998)


POWER4: (2001)


-

dual
-
core chips, up to 8 cores, most powerful at the
time


POWER5: (2004)


-

memory system improvement


-

support simultaneous multithreading(SMT)


POWER6: (may, 2007)


-

advanced interchip communication technology


-

double performance of POWER5


POWER7: (currently in development)

P1

P2

L2

L3 Ctrl

MC


L1 caches are not shared between processors,


-

each core has its own cache


-

LRU replacement (VS FIFO in POWER4)



L1
I
-
cache:


-

size: 64KB/processor


-

associative: 2
-
way (VS direct mapped in POWER4)


-

line size: 128 bytes


-

write policy: N/A



L1
D
-
cache:


-

size: 32KB/processor


-

associative: 4
-
way (VS 2
-
way in POWER4)


-

line size: 128 bytes


-

write policy: write
-
through


-

transfer rate: 4 words/cycle






three identical slices



shared between processors



(Memory address) mod 3 = slice ID



Total Size: 1.9MB (VS 1.4MB)



Associative: 10
-
way (VS 8
-
way)



Latency: 13 cycles, 6.8 ns (VS 12/7.1)



Line size: 128 bytes



Write policy: write back



Transfer rate: 4 words/cycle


Memory

address

L3 controller and L3



three independent controllers



each controller manages one slice



each can deliver 32B/cycle(60.8GB/sec) to L1



3 identical slices



Each attaches to one of L2 slice



off
-
chip, shared between processors



Total Size: 36MB (VS 32MB)



Associative: 12
-
way (VS 8
-
way)



Latency: 87 cycles, 45.8 ns (VS 123/72.3)



Line size: 256 bytes



Write policy: write back



Transfer rate: <1 words/cycle



L3 directory/control is on
-
chip



This design reduce the off
-
chip delay



L3 removed from path between chip
-
MC:



-

why? heavy traffic on FBC (16 chips)



can reduce latency to the L3


(physically closer to CPUs)



-

L3 act as victim cache of L2



-

now operate at ½ the processor clock


rate, while this rate is 1/3 in POWER4



-

this opt increases bandwidth by ½,



and reduce latency by roughly 1/3







* FBC

Fabric Bus Controller



Structure Optimizations on POWER4(1)


MC integrated into the chip:



-

different path Proc
-
L3 and Proc
-
memory



-

Increase potential operational


parallelism and bandwidth



-

significantly reduce latency, eliminating


communi delays to external controller




The benefit of these optimization:



-

latency L3: 72.3ns VS 45.8ns



-

latency memory: 206ns VS 116ns



-

bandwidth(4P): 8.37GB/s VS 17.9GB/s



Structure Optimizations on POWER4(2)


Memory Hierarchy Overview


Memory Subsystem


Virtual Memory and Prefetching


P

FBC

FBC

MC

MC

P

P

P

L2

L2


Path followed by requests


L2 Controller


Fabric Controller


Memory Controller

To physical memory


read/write reorder queues


Schedular

selects operations from queues


FIFO based arbiter queue


Separate queues


Increasing capacity increases clock cycle


Read and write reordering is done differently


Eight enteries per queue



Centralized FIFO queue


Prevenents CPU stalls when memory controller
under stress



Scheduler


Selects operations from queues.


Buses operate at twice the DRAM speed


Memory protection.


ECC


Memory scrubbing

Meomory Controller

SMI (*4)

DIMM (*2)

read/write/command buses

DIMM (*2)

Matches bus
-
width between

MC and DIMMs

Dual In
-
memory modules

-
Ring Topology

-
Snooping Mechanism

-
Combined Response

P

AB

SB

RB

DB

-
Fabric Buses

-------


Memory Hierarchy Overview


Memory Subsystem


Virtual Memory and Prefetching



64 bit virtual address and 50 bit real
address


Two steps to address translation


Effective address translated to virtual address
using a 64 entry segment lookaside buffer(SLB)


Virtual address translated to real address using
page table.


Page table cached in a 1024 entry, 4 way set
associative TLB


For fast translation , 2 first level translation tables are
used
-
1for instructions and 1 for data


They provide fast effective to real address
translation.SLB and TLB looked up only if miss in the
first level translation.


Data Translation Table


128 entry fully associative array


Instruction Translation Table


64 entry two way set associative array.


Entries in both tables are tagged with thread number
and are not shared between threads.


TLB entries can be shared between threads.




When load instructions miss sequential cache lines ,
prefetch engine initiates accesses to following cache
lines before being referenced by future load
instructions.


L1 data cache prefetch initiated when load references
data from new cache line and a new line is loaded into
L2 from memory.


Latency for retrieving data from memory is more. So
prefetch engine requests data from memory 12 lines
ahead of the line being referenced by the load.


Hardware ramps up prefetch slowly requiring an
additional 2 sequential misses before it reaches steady
state prefetch sequencing.


Software prefetching is also supported to indicate the
number of lines to prefetch using hardware.


Advantages


Improves performance by eliminating the initial
ramp up.


Only required number of lines are prefetched.


Eight software prefetch streams are supported per
processor.


Upon a cache miss:


Biased guess is made as to the direction of that
stream


Guess is based upon where in the cache line the
address associated with that miss occurred


If it is in the first 3/4, then the direction is guessed as
ascending


If in the last 1/4, the direction is guessed descending
.


Instruction prefetch is also present in Power5 processor
with 4 instruction prefetch buffers ( 2 per thread).


Both DDR and DDR2 DIMMs can be used with Power5


SMI chips are provided to support the connection of the
DIMMs with the processor. Support is provided for 2 (2
SMI mode) or 4 (4 SMI mode) chips.


Each SMI chip has two ports and each port can support
up to 2 DIMMs


THANK YOU