Application-Specific Memory Management in Embedded Systems ...

harpywarrenΛογισμικό & κατασκευή λογ/κού

14 Δεκ 2013 (πριν από 3 χρόνια και 4 μέρες)

62 εμφανίσεις

CSAIL
Computer Science and Artificial Intelligence Laboratory
Massachusetts Institute of Technology
Application-Specific Memory Management in Embedded

Systems Using Software-Controlled Caches
Derek Chiou, Prabhat Jain,
Larry Rudolph, Srinivas Devadas
In the proceedings of the 37th Design
Automation Conference Architecture, 2000, June
Computation Structures Group
Memo 448

The Stata Center, 32 Vassar Street, Cambridge, Massachusetts 02139
Applicationp eci Memory Managemen t for Em b edded Systems
Using Soft w areon trolled Cac hes
Derek Chiou, Prabhat Jain, Larry Rudolph, and Srinivas Devadas
Department of EECS
Massachusetts Institute of Technology
Cambridge, MA 02139
ABSTRACT cac he and scratc hpad memory on hip s ince eac h
W e a w a y to impro v e the p erformance em bed a diren t n eed hes are transparen t to soft w since
ded ro p cessors running datan tensiv e applications b yallo w
they are accessed through the same space as the
s w are t o allo cate o n hip emory m on an application
b ac king storage They often impro v eo v erall soft w are
sp eci basis hip memory the form of cac he can
p erformance but are u npredictable Although the cac he re
be made to act e scratc hpad memory v ia a no v
placemen t hardw are kno wn predicting its p erformance
w mec hanism whic h w e call c olumn c aching Column
dep nds e on accurately predicting past and future reference
cac hing enables dynamic cac he partitioning in soft w b y
patterns Scratc hpad memory is addressed via an indep en
mapping data regions to a sp ecid s ets of cac he olumns
den t address space and th us m be managed explicitly
a ys When a region of memory is exclusiv ely mapp ed
b y w are often times a complex and cum b e rsome prob
an equiv alen t sized partition of cac he column cac hing
vides the same functionalit y and predictabilit y a s a ded lem but pro vides absolutely predictable p erformance Th us
icated scratc hpad memory for t imeritical parts of a real ev en though a pure ac c he system ma y p erform b etter o v er
time application The ratio bet w een scratc hpad and
scratc hpad memories necessary to guaran tee
cac he size can b e easily and quic kly v e h applica
critical p erformance etrics m are alw a ys met
tion or eac h task ithin w an application Th us soft w are has
m uc h er soft w con trol on hip memory pro viding
Of course b o th cac hes a nd scratc hpad memories should b e
the abilit y t o dynamically tradeo p rformance e for on hip
a v ailable o t m e b e dded systems so that the appropriate mem
memory
ory structure can b e used in eac h nstance i A divi
sion ho w er is guaran teed to b e sub optimal as diren t
1. INTRODUCTION
applications ha v e irend t requiremen ts Previous researc h
timeoark et requiremen ts of electronic systems de
has sho wn ev en within a single application dynami
mand ev er faster design cycles an ev er increasing n um ber
cally v arying the partitioning b e t w een cac he and cratcs
of systems are built around a programmable em b e
memory can signian i v e p
cessor that i mplemen ts an ev er increasing amoun tof func
tionalit y in m w are running on the em b dded e pro cessor
W e prop ose a w a y to dynamically a llo cate ac c he and scratc h
The dv a an tages of doing s o a re t w ofold soft w are is simpler pad memories from a common p o ol of memory In partic
to implemen t han t a dedicated hardw are solution and soft
ular w e prop ose column cac hing a simple mo diation
w are can b e easily c hanged to address design errors late de that enables soft w are to dynamically partition a ac c he in to
sign c hanges and pro duct ev olution Only the m ost
sev eral distinct cac hes and scratc hpad memories at a c olumn
timeritical tasks need to b e implemen ted in hardw are gran y In our reference mplemen i tation eac wh a y of
an n a y setsso c iativ ecac he is a c olumn By exclusiv ely
On hip memory in the form of cac he scratc
allo cating a region f o address space to an e qualized region
nd more recen tly em b edded DRAM or some bina
of cac he column cac e m ulate s cratc
tion of the three is ubiquitous programmable em bed
Column only restricts placemen t within the
ded s ystems o t s upp rt o soft w are and to pro terface
cac he during replacemen t all other op erations are unmo di
bet w een hardw are and soft w are Most systems ha v e b oth
d
Careful apping m can p oten tially reduce or liminate e replace
men t e rrors r esulting in impro v ed p e rformance It not only
enables a cac he to em ulate scratc hpad memory but
spatialemp oral cac hes a eparate s prefetc h bur separate
write burs and other t raditional taticallyartitioned s struc
tures within he t general cac he as w ell of er
describ e s olumn c ac c hing and ho witmigh t b e sedu
pap this rest The
separate
in an vide
data hing cac
in
memory hpad can hing
com
SRAM hpad

ularit

erformance mpro tly
pro dded
hpad
that
As
ev
static
of are
ac for aried
that are all
size
pro
to
soft
or
ust
are
are
is
hard el lik
in On
larger
oft ing
address
are Cac of ose prop
addressesOp Virtual address
men tunitm ust b e pro vided Similar con trol o v er cac he
already e xists for uncac hed data since the cac hedncac
Replacement Unit BIU
bit esides r i n he t TLB
TLB
BIU
Hit?
Data
2.1 Partitioning and Repartitioning
Implemen tation is greatly implid s if t he minim um map
ping ularit y is a since memory
translation m ec hanisms including the biquitous u translation
lo oksideurs LB c an b e used to store mapping infor
Column 0 Column 1 Column 2 Column 3
mation that will b e passed to the eplacemen r t nit u TLB
accessed ev ery memory reference are designed to b e fast in
Figure Basic Column Cac hing Three mo dia
order to inimize m ph ysical ac c he access time P artitioning
tions to a setsso ciativ e cac he denoted b y dotted
is supp orted b y simply adding c olumn ac c e n
lines in the ure are necessary augmen TLB
tries to the TLB data structures and p ro viding a data path
to hold mapping nformation i i mo did replace
from those en tries to the mo did replacemen t u There
men t unit that uses mapping information and ii
fore n i order to remap p ages to columns access to he t page
a path bet w een the TLB replacemen t unit
table en tries i s required
that carries that information
Mapping a page to a he partition represen ted b ya
v is a t w o pro P a a tint
2. COLUMN CACHING
rather than to a it b v ector directly A tin t s i a virtual group
The simplest mplemen i tation of column cac hing is deriv ed
ing of ddress a spaces F or example an en tire streaming data
from a s etsso cativ ecac he where lo w errder bits are used
structure could b e mapp ed to a single t in t or all streaming
select a cac heines whic h are then asso ciativ ely
data structures could be mapp ed a tin t or
searc hed for the desired data If the data is not found
the st page of sev eral data structures ould c b e mapp ed to
cac he miss the replacemen t algorithm selects a cac heine
a single t Tin ts are indep enden ed to a set of
from the s elected set
columns represen ted b ya bitv ector suc h mappings c an b e
c hanged quic kly Th us tin ts rather than bit v ectors are
During lo okup a c olumn cac he b eha v a s
stored in page table e n tries
setsso cativ e cac he and us incurs no p erformance
p enalt yona cca he hit Rather than allo wing the replace
The tin tlve elfndirection is in tro uced d to isolate the
t lgorithm a to alw a ys select rom f an y heine in the
user rom f mac hinep eci i nformation suc h as the n um ber
set ho w ev er column cac hing pro vides he t abilit y o t r estrict
of columns or the n um ber of vle els of he t memory hierarc h y
the eplacemen r t lgorithm a to certain c olumns Eac h column
and i to mak e reapping easier
one a y or bank of the n a y setsso ciativ e cac he
igure Em b edded pro c essors suc h as het ARM rea
2.2 Using Columns As Scratchpad Memory
highly setsso c iativ e o t reduce p o w er consumption pro vid
Column cac hing can em ulate scratc hpad memory within the
ing a large n um b r e o f columns A bit v ector sp ecifying the
cac he b y edicating d a region of cac he equal in size to a re
p ermissible set of columns generated and passed to the
gion of memory No other memory r egions are mapp ed to
replacemen t u nit
the same r egion of cac he Since there is a oneone map
ping once the data is b rough tin to he it r
A m o iation d to the bit v ector r ep artitions the cac he Since
there In order o t guaran tee p erformance oft s w are can p r e
ev ery cac heine in the set is searc hed d uring v e ery access
form a load on all cac heines of data when as
repartitioning is graceful and fast if data is v ed
is required a dedicated SRAM That r
one column to another ut alw a in the same set the
then beha v es lik e a scratc If and when the
asso ciativ esreac h will still d the data n i the new l o c ation
scratc hpad memory is remapp ed to a d iren t use the data
A memory lo cation can be cac in one column
is automatically copied bac ikf acb king R AM is pro vided
one cycle then reapp ed to another column on the next
cycle cac hed d ata will not o m v e the new column
2.3 Impact of Column Caching on Clock Cycle
instan taneously but will remain in the old column un til i t i s
The mo diations required for column cac l
replaced Once r emo v ed from the cac b e hed in
to the he replacemen tuint hwic his not on the critical
a olumn c to whic h it i s m app ed the next time t i i s a ccessed
path In realistic systems data requested from L cac
Column cac hing is implemen ted iav three small mo dia main memory tak es at least three cycles but generally more
tions to a etsso s ciativ e cac he igure The TLB m ust to return The exact replacemen tcca do es n
b e mo did t o s tore the mapping nformation i to b e decided un til the data returns iving g the replacemen t
tuitn m ust b e mo id d to resp ect r e algorithm least three cycles to mak e a whic h
strictions replacemen t cac heine selection A should easily be suien t for the minor to the
carry the mapping information from the TLB to the replace replacemen t p ath
additions to path of
decision at TLBenerated men
replace The
eed not heine
to he
cac
cac will it he
imited are hing
to The
during hed
memory hpad
ys
egion memory with
from mo
remapping
emain will cac the
is
is
cac men
th dard
tan as exactly es
mapp tly tin
just single to
of set to
to ed mapp re ages cess phase ector
bit cac
the and
nit
ted
mapping hing
virtual existing page gran
hed
the2.8
3. PREDICTABILITY IN MULTITASKING
gzip 16K cache
2.6
Column cac hing enables predictable p erformance within a
2.4
m ultitasking en vironmen t where m ultiple jobs are execut
gzip 16K column cache
2.2
ing Consider three ompression c zip jobs sim ultaneously
executing on one pro cessor and eac h ha ving access to the 2
cac he T o nderstand u w hat is happ ening w e t
1.8
gzip 128K cache
the p erformance m easuremen t of a one zip g pro cess eferred
1.6
to as job A in the r est f o the discussion in this mixture W e
1.4
presen t he t results n i terms of lo c c ks p er nstruction i PI
gzip 128K column cache
1.2
whic his ni v ersely correlated with p erformance a lo w er CPI
1
means higher p erformance T o compute the CPI w e assume
a cycle latency to memory and that of instructions
are m emory p o erations Figure s ho ws the v ariation i n ob j
Context-Switch Time Quantum
A PI C when the time uan q tum is v aried Results for b oth
a tandard s cac he and a mapp ed olumn c cac he are presen
Figure Column cac hing pro vides p redictable and
t w o s ets of p lots corresp ond to diren t ized s and
sup erior p rformance e to a standard cac he o v er a
cac hes
wide of sc heduling time quan The clo c ks
per instruction a measure of p erformance the
Eac h poin t in this plot w as generated b y c ho osing a
smaller the n um b er the b e tter Except for v ery long
quan tum and p erforming a roundobin sc hedule of the three
time quan tum p erio ds column cac vides su
gzip jobs A B a nd C There are t w o cases eac h job gets
p rior e p rformance e Also note that the p erformance
to use the en tire cac he while it is unning r tandard c ac he
of column cac is m uc h sensitiv e to time
and i eac h job uses only assigned columns olumn
quan tum times from the nearly horizon tal
cac he F or the olumn c cac he the critical job is p rmitted e to
curv es for column ac c hing
t tire cac he while the other t w o obs j are restricted
to using only a quarter f o the cac In the s tandard c ac he
case there i s a signian t d irence in the CPI for j ob A
The idea of staticallyartitioned cac hes is not new The
as the t ime quan tum v aries This v ariation is caused mainly
most common example are separate instruction data
b y t he cac he hit rate for job A b ing e acted b yin terv
cac Some existing and prop osed arc
cac he accesses due jobs B and C The n um ber of h
a of hes one for spatial lo calit y and one for tem
accesses s i dep enden t on the time quan tum Once column
poarl lo calit y These designs statically
cac hing is in tro duced a nd most of job A data is protected
divide the t w occa Other pro c essors supp ort lo c of
from replacemen tb y jobs B a nd C data hen t the CPI of
data in to the cac he but do not i nclude a w a ytotlel
job A is signian tly less sensitiv e to the time quan tum Job
if the desired data is in the cac he Sun M icrosystems Cor
Aw as considered critical and it w as exclusiv ely assigned a
p oration paten a mec hanism v ery s imilar to column
large fraction of the cac he hence the hit rate for job A is
cac hing that allo ws partitioning of a cac he b et w een pro cesses
higher Therefore the CPI is signian tly smaller for small
at cac he column gran ularit yb ypro viding a bit mask asso
time quan ta course when the time s i really
ciated w ith the running pro cess limiting it to partitioning
large w e ectiv ely ha v e h pro cessing and CPI
the cac he b t e w een pro cesses
are virtually the s ame for the op t t w o p lots Ov erall system
throughput ma y actually decrease due to n a o v er allo cation
A subset of column cac abilities is hiev
of resources a set of critical jobs p erformance
hardw are s upp ort b y page coloring ac ed b yin telligen tly
of those critical jobs i s generally higher and has m uc hless
mapping virtual pages t o h p ysical pages hing
v ariation
ho w ev er is m uc h faster at repartitioning age coloring re
quires a memory y uses setsso ciativ e cac hes b
One m a y argue that the t ime q uan c b e
and enabling suc h abilities as mapping a large con
dictabilit y but in realit y due to in terrupts and exceptions
region of address space to a small region in the cac he se
the ectiv e t ime quan tum can v ary signian tly during the
ful for memoryapp ed devices
time that a job is running sim ultaneously w ith ther o jobs
us column cac hing can impro v e p erformance of a criti
cal job as w ell as signian tly r educe p erformance v ariation
4.2 Memory Exploration in Embedded Sys-
ev en in the presence of in terrupts or v arying time quan ta
tems
Cac he memory issues ha v ebene studied in the con text of
em b e dded systems arling presen ts tec f o o c d e
4. RELATED WORK
placemen t n i main memory to maximize instruction cac he
hit A mo for partitioning an instruction
4.1 Cache Mechanisms cac he among m ultiple pro cesses has b een resen p ted
1
2
4
8
16
32
64
128
256
512
1024
2048
4096
8192
16384
32768
65536
131072
262144
524288
1048576
Clocks Per Instruction
del ratio
hniques McF
Th
tiguous
pre for ed ould tum
etter cop
cac Column
hiev
the but to
without able ac hing
the batc
tum quan Of
ted
king hes
cac pair
suc to
ort supp hitectures hes
ening
and
he
en he use
seen as
its
less hing
pro hing
time
is
ta range
The
ted
presen onlyP anda Dutt a nd Nicolau presen ttec hniques for partition Motorola MPC nte I gr ate dPr o c essor User
Manual lyu
ing on hip memory i n to scratc hpad memory and c ac he
The p resen ted algorithm a ssumes a xed amoun t of scratc h
B Na yfeh and Y A halidi K s U p aten t
pad memory and a xedize cac he iden tis critical v
Apparatus and etho m d to reserv p e ata d in a et s
ables and assigns them to scratc hpad memory The algo
asso ciativ e emory m device Dec
rithm can be rep eatedly to d optim p e
P P anda N Dutt a nd A Nicolau Memory Issues
mance p oin t
in e dde d Systemsnhip Optimizations and
Explor ation lu w er Academic Publishers
5. CONCLUSIONS
w ork describ ed in this pap er represen ts a c onence of
P C T C Ma y and S Sutarw ala
t w o observ ations The st observ ation is that giv DSP Design T o ol R equiremen ts for Em bedded
Systems AT elecomm unications Industrial
geneous applications with data streams hat t ha v e signian t
P ersp ectiv e Journal of VLSI S ignal Pr o c essing
v ariation in their lo calit y prop erties it is w orth while to pro
Jan uary
vide er soft w are con trol of the cac he so the cac b e
used more eien tly T othsei nd eha v e prop sed o a col
J V Praet G Go ossens D Lanneer and H D Man
c hing mec hanism that enables cac he partitioning so
Instruction Set Deition and I nstruction Selection
with diren t lo calit y c haracteristics can be isolated
for A SIPs In o c e e dings o f the h IEEE CM
for impro v ed p erformance The second observ ation s i that International Symp osium o n H igh evel Synthesis
Ma y
columns can em ulate scratc hpad memory whic hisudese x
tensiv ely t o mpro i v e p redictabilit yinem b dded e systems A
F S anc hez A Gonz alez nd a M V alero Soft w are
signian t b ene o f olumn c cac hing is that through he t ex
Managemen t o f Selectiv e and hes In
ecution of a program the data stored in columns be
IEEE Computer So T e chnic al Committe eon
explicitly m anaged as in a cratc s hpad or can b e i mplicitly
Computer A r chite ctur e Sp e cial Issue on D istribute d
managed as in a cac he and that the managemen t c an c hange
Shar e d Memory and R elate d Issues p ages Mar
dynamically and at v ery small in terv als
M T omask o S Hadjiyiannis and W Na jjar
kno wledgemen ts This pap er describ es researc h done
Exp rimen e Ev aluation of Arra yCca hes In
at the ab L oratory for Computer Science of he t Massac h usetts
Computer So ciety T e al Committe e on Computer
Institute of T ec hnology F unding for his t w ork is p ro vided
A r ctur e e cial Issue n o Distribute d S har e d
in part b y the Adv anced Researc h ro P jects Agency o f the
Memory nd a R elate d I ssues
Departmen t of Defense under the ir A F orce Researc h L ab o
H T omiy ama and H Y asuura C o de Placemen t
ratory con tract F
T ec hniques for ac C he Miss Rate Reduction A CM
T r ansactions on Design A utomation of E le ctr onic
6. REFERENCES
Systems ctob O er
K Asano vic V e ctor Micr opr o c essors hD P thesis
Univ ersit y of California at Berk eley a y
D T Chiou Extending the R e ach o f Micr opr o c essors
Column and C urious Caching hD P thesis
Departmen t of EECS MIT Cam bridge MA Sept

Cyrix Cyrix XMX P r o c essor a y
G F aanes A C MOS V ector Pro cessor w ith a Custom
Streaming ac C In Hot hips C August
In tel Intel Str ongARM SA Micr opr o c essor pril

Y Li and W W olf A T askev el Hierarc hical
Memory o M del for S ystem Syn thesis of
th
Multipro cessors In Pr o c e e dings of the Design
A utomation Confer enc e pages June
B Lync h and G Lauterbac h UltraSP CIII
MHz bit Sup erscalar Pro cessor for a y
Scalable Systems In Hot Chips
S McF arling Program Optimization or f Instruction
rd
Cac hes In Pr o c e e dings of the Int Confer enc eon
A r chite ctur al Supp ort for P r o amming L anguages and
Op er ating Systems pages April
gr
AR

he


Mar pages
Sp chite
chnic
IEEE tal
Ac
ciety
can
Cac Data Dual

Pr
data
ac umn

can he
hetero en
Liem aulin
The

Emb

rfor um the run
ari