Complex Systems
1
(1987) 939965
Performance of VLSI Engines for Lattice
Computat ions'
Steven D.Kugelmass
Richard Squier
Kenneth Steiglitz
Department
of
Computer Science,Princeton University,
Princeton,NJ 08544,USA
Abstract.We address t he problem of designing an d buil ding ef
ficient custom Vl.Slbesed processors to do computat ions on large
multi dimensional lat tices.The design t radeoffs for two architectures
which provide practical engines for lattice updat es are deri ved and
analyzed.We find t hat I/O constit utes t he pri ncipal bottleneck of
pr ocessors des igned for lat t ice computations,and we derive upp er
bounds on t hroughput for lattice updates based on Hong and Kung's
graphpebbling argument t hat model s I/O.In part icular,we show t hat
R
=
O(BS
1
/
d
),
where
R
is t he sit e update rate,
B
is t he
main
memory
bandwidth,
S
is t he processor sto rage,
and
d
is t he dimension of the
lattice.
1.
I ntroduction
This paper deals with
t he
problems of designing and building practi cal,cus
tom VLSIbased computers for lat ti ce calculat ions.These computational
problems are characterized by being iterat ive,defined on a
regular
lattice of
points,uniform in space and t ime,local,and relatively simple at each lat tice
point.Examples include numerical solut ion of differential equat ions,it era
t ive image processing,and cellular au tomat a.The recent ly st udied lattice
gas auto mata,which are microscopic models for fluid dynamics,are proposed
as a test bed for t he work.
The machines envisaged lattice engineswould typically consist of many
instances of a cust om chip and a generalpurpose host machine for support.
In
many pract ical situat ions,the performance of such machin es is limited,
"I'his work
was
support ed
in
part by NSF Grant
EC584 14674,
U.S.Army Resea rch
OfficeDurham
Contract
DAAG2985KOI9 1,
and
DARPA
Contract
N0 00 148 2 K ~ 05 4 9.
An ear lier version of
t his
paper app ears
in
[7].
©
1987 Complex Systems Publi cat ions,Inc.
940
Steven Kugelmass,
Richard Squier,and Kenne th
Steiglitz
not
by
the speed and size of t he actual processing elements,but
by
t he com
municat ion bandwidth on and offchip and
by
t he memory capacity of t he
chip.
A famil iar example of lat ti cebased computational tasks is twodi mensional
image processing.Many useful algorit hms,such as linear filter ing and median
filt ering,recomput e values t he
same
way everywhere on the image,and so
ar e perfectly uni form;they are local in t hat the computa.tion at a given point
depends only on the immediate neighbors of the point in t he t wodimensional
image.
Another class of calculat ions,besides being uniform and local,has t he ad
dit ional import ant characteristic of using only a few bits to store the val ues
at lat t ice points,and so is ext remely simple.Fur ther}the calculat ions oper
ate on local data it erati vely} which means that t hey are not as demanding
of external data
as
many signal processing problems.Th ese comput at ional
modelsuniform}local }simple} and iterat iveare called
cellular
automa ta.
We will next describe a particular class of cellular automata} one t hat
pro
vides a good t est bed for the general problems ari sing in the design of dedi
cated hardware for lat ti cebased computations.
2.A paradigm for lattice computations:t he lattice gas model
Quite recently}t here has been much attention given to a particularl y promi s
ing kind of cellular automaton} the socalled lat tice gases} because they can
model fluid dynamics [141.These ar e lat ti ces governed by the following rules:
At each lat t ice sit e,each edge of t he lattice incident to that site may
have exactly zero or one par ticl e t raveling at unit speed away from that
sit e,and}in some models,possibly a parti cle at rest at t he lattice sit e.
Th ere is a set of collision ru les which det ermines} at each lattice sit e
and at each time st ep,what t he next particle configuration
will
be on
its incident edges.
The collision rul es sat isfy certain physically plausible laws} especially
particlenumber (mass) conservat ion and momentum conservat ion.
These lat ti ce gas models have an intri nsic exclusion pr inciple,because no
more than one part icle can occupy a given dir ect ed lat ti ce edge at any given
t ime.It is t herefore surprising t hat t hey can model fluid mechanics.In fact }
in a twodimensional hexagonally connect ed lat ti ce} it has been shown t hat
the NavierStokes equation is sat isfied in the limit of lar ge lattice size.This
model is called t he FHP model,aft er Frisch,Hasslacher,and Pomeau [3J.
The older HPP model {4J,which uses an ort hogona l latti ce,does not lead t o
isotropic solutions.
The idea of using hexagonal lattice gas models to predict feat ures of fluid
flow seems to be about two years old}
and
whether the general approach of
simulat ing a lat ti ce gas can ever be competiti ve wit h more familiar numerical
solut ion of the NavierStokes equ ation is certainly a premat ure question.
Performance
of
VLSI Engines for Lattice Computations
941
Extensions t o threedimensional gases are just now being formulated [1],
and quanti tative experimental verificat ion of the twodimensional resul t s is
fragmentary.The Reynolds Numbers achievable depends on the size of t he
lattices used,and very lar ge Reynolds Numbers will require huge lattices and
correspondingly huge comput at ion rates.For a discussion of t he scaling of
the lattice
computations wit h Reynolds Number,see [1 0).
What is clear is that t he ult imate pract icality of t he approach will de
pend on the technology of specialpurpose hardware implementations for the
models involved.Furt hermore,the
uniformity,Iocsl ity,
and
simplicity
of t he
model mean t hat thi s is an ideal test bed for dedicated hard ware t hat is
based on cust om chips.
We
will therefore use the lat ti ce gas problem
as
a
running example in what follows.We especially want to st udy the interac
t ion between the design of custom VLSI chips and t he design of t he overall
system ar chitecture for t his class of problems.
We will present and compare two comp eting architectures for latti ce gas
cellular automata (LGCA) comput at ions
th at
are each based on VLSI cust om
processors.The analysis will focus on t he permissible design space given t he
usual chip const raint s of area and
pinout
and on the achievable performance
within the design space.Following t his,we will present some t heoret ical
upp er bounds for t he computation rate over a lattice,based on a graph
pebbling argument.
3.
Serial pipelined architectures for lattice processing
We are primari ly interested in spec ialpurpose,VLSIbased processor archi
tectures t hat have more t han one PE
(processing
element)
per custom chip.
It
is import ant to recogn ize t hat
if
the PEs ar e not kept busy,then it might
be more effective (in t erms of overall th roughput ) t o have fewer PEs per
chip but to use t hem more efficient ly.Although t here are many archit ectures
that have the property of using PEs efficiently,we will only descr ibe two,
both based on
the
idea of serial pipel ining (see figure I).This approach has
t he
benefit
t hat the
bandwidth to
the
processor system is small
even
t hough
t he number of act ive PEs is lar ge.Thi s serial technique has been used for
image processing where t he size of t he twodimensional grid is small and
fixed [6,13,17] and has also been used to design a highperformance custom
processor for a onedimensional cellular automaton [16).
Consider wha t is required t o pipeline a computation.We must guarantee
th at t he appropriat e site values of t he correct ages are present ed to t he
comput ing elements.In t he case of the LGCA,we can express this dat a
dependency as:
v(a,t
+
1)
=
f( N(a),t)
where
v(a,
t )
is t he value at lat t ice sit e
a
at time
t,
N(a)
is t he
neighborllOod
of the latt ice site c,and
f
is the functi on t hat det ermines t he new value
of a based on
it s
neighborhood.The LGCA requires all
the
points in
t he
neighb orhood of a to be the same age in order to compute t he new val ue,
942
Steven Kugeliness,Richard Squier,
and
Kenn eth SteigJitz
Figur e 1:Onedimensional pipeline.
Figure 2:Hexagonal neighborhood.The circled
site
is
a j
the sites
wit h Xs constitute it s neighbor hood.
v(a,
t
+
1).The LGCA has a neighborhood that looks like t he example given
in figure
2.
Onedimensional pipelining also requires a linear ordering of the sites in
t he array.That is,we wish t o send the values associated with t he sites one at
a t ime into t he onedimensional pipeline and receive the sequence of sites in
the same order possibly some generations later.Therefore,we would like sites
th at
are
close together in the latti ce to be close together in t he stream.
In
t his way,the serial PE requi res a small local memory because neig hborhoods
(sites t hat are close together in t he array) will also be close toget her in t he
st ream.Unfort unately,the lattice gas automaton can require a large amount
of local memor y per PE because there is no sublinear embedding of an array
into a list [12].
The natural rowmaj or embedding of t he array int o a list preserves 2
Performance of VLSI Engines for Lattice Comp utations
943
neighbor hoods?with diameter
2n 
2.This means t hat a full neighborhood
of a site from an
n x n
lattice is distributed in the list so t hat some elements
of the neighborhood are at
least
2n  2 posit ions apart.Thi s embedding is
undesirable for two reasons.The amount of local memory required by a
PE
is a fun cti on of t he problem inst ance,forcing us to decide
in
ad vance t he
size of one dimension of t he lattice (one can actually process a prism array,
finit e in all but one dimension) because t he chip will only wor k for a single
problem size due to its fixed span.The second deficiency is due to t he size
of t he span.If
n
=
1000,th en each PE would require about 2000 sites wor th
of memory.This puts a severe restrict ion on th e number of PEs that ca n he
placed on a chip.
Unfort unately,t he
2n 
2 embedding is opt imal.Rose nberg showed t his
bound holds for prism ar ray reali zations but it
has
been unknown whet her
it
is possible do better for finite array realizat ions.Rosenberg's best lower
bound for t he fini te array case
has
never been achieved and he suspected t hat
t he rowmajor scheme was optimal.Sternberg
[18]
also questioned whet her or
not t he st orage requirement for
a
serial pipelined machine could be reduced.
Supowit an d Young
[1 9]
showed t hat the rowmajor embedding is optimal
and therefore a serial pipeline must use at least 2n  2 st orage.
T h eorem 1.
Place the numbers
1...,
n
2
in
a square array
a(i,i),and
define
the span of the
array to
be
max{la(i
+
l,j ) 
a(i,j )l,la(i,j
+
1) 
a(i,j)l l
Then span
2:
n.
Proof.Put t he number s in t he array one at a time in order,starti ng wit h
1.When for the first t ime t here is eit her a number in every row or a number
in every column,stop.Wit hout loss of gene rality,ass ume t his happens wit h
a number in every row.
We claim t hat t here cannot be a full row.Suppose t he cont rary.The
last number entered was placed in an empty row
1
so t here must have been a
full row before we stopped.This would mean th ere
was
a number in every
column before there
was
a number in every row.
Since t here is no full row,but a number in every row,there is at least
one vacant place in every row that is adjacent to an occupied spot.Choose
one such vacant place in each row,and call th em set
F
(wit h
IFI
=
n).
Now,
if we stopped after placing th e number
t,
the places in
F
will get filled wit h
numbe rs gr eater t han
t.
The largest number t hat will be put in a location
in
F
is
2:
t
+
n,and will be adjacent to a nurnber
j;
t.
The critical system par amet ers for the onedimensional pipeline ar chitec
t ur e,sys te m area and total system th roughput,ca n be var ied over a range of
values.The actual select ion of the operating point on the t hro ughput area
curve depends on several factors:for example,t he probl em instance size and
tot al system cost.
I
Sites t hat are two edge trave rsals apart in the latt ice.
944
Steven Kugelmass,
Richard
Squier,
and
Kennet h St eiglitz
The appealing as pec ts of t he serial architect ure are the simplicity of its
design,it s small area in comparison to other archi tectures,and th e small
input/output bandwidth requirement.The computation proceeds on
a.
wave
front
[8]
through t ime an d space,each succeeding PE using t he data [rom
t he previous PE without t he need for fur t her external data.
4.
Wides erial archit ecture (WSA)
Throughput in a serial architecture can be improved
by
adding concurrency
at each level of t he pi peline.One way t o accomplish t his is to have each
pipeli ne stage compute t he new value of more t han one site each clock period.
For exa mple,
if
t he comp ut ati on at PE
i
is at the point where site
G,
circled,
is to be updat ed,then PE
j
contai ns t he dat a indi cat ed
by
st rikeout in t he
following:
0 0 0 0 0 0 0 0 0 0
0 0 0
0 0 0 0 0 0 0
0 0 0
©J
0 0 0 0 0 0
0 0
0 0
0 0 0
0 0 0
0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0
We could allow a second PE
P
to compute site
a
+
1 at t he same t ime if we
st ore j ust one mor e data point.
0
0 0 0 0 0 0 0
0 0
0 0 0 0
0
0 0
0 0 0
0
0
0
©J©J
0 0
0 0 0
0 0 0
0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0 0 0
The most at t ract ive feat ure of t his scheme is t hat performan ce is in
creased,but at a cost of only t he increment al amount of memory needed to
store the ext ra sites.The onchip memory per PE is also improved dr amati
cal ly;it decreases linearly wit h t he number of PEs per chip.However } t here
is a price to pay:two new site values are required every clock period so t hat
two site updates can be pe rformed.Th e extra PEs require added bandwidt h
to and from t he chip and t his implies t hat t he mai n memory system must
provide that bandwidth as pins or wires.
Performance of VLSJ Engines for Lattice Computations
Figure 3:Wideserial architecture.
945
As an example,t he following figure shows how two PEs on the same chip
can cooperate on a computation.Each square of the shift regist er holds t he
value of one site in t he lat tice.Every clock per iod,two new site values are
input to t he chip,two sites are updated,and their values are output to t he
next chip in t he pipeline.
5.Sternberg partitioned architecture (SPA)
In reference [18],Sternberg proposes that a large array computation can be
divided among several
serial
processors,each of which operates as described
earlier.The array is divided into adjacent,nonoverlapping columnar slices,
and a fully serial processor is assigned to each slice (see figur e 4).
The processors are not exact ly the same as those described above;t hey are
augment ed to provide a bidirectional synchronous communicat ion channel
between adj acent par titions so that sites whose neighborhoods do not lie
ent irely in the st orage of a single PE can be computed correctly and in step
with other site upd at es.See reference [18] for det ails.
Dividing the work in this way accomplishes th ree thi ngs.First,it de
creases t he amount of storage t hat each PE needs in order to delay site
values for correct operation of the pipeline.This comes about because each
PE needs
to
delay only two lines of its slice,not the whole line width.Sec
ond,it increases the ratio of processing eleme nt s to t he total number of sites,
permitt ing an increase in the maximum t hroughput by a mult iplicative con
st ant equal to the number of slices.Third,it provides a degree of modularity
and extensibility.It is possib le to join two smal ler machines along an edge
to form a machine t hat han dles a larger problem.
In the case of a VLSI implementation,decreas ing t he size of the local
storage is ext remely important because most of t he silicon area in the imple
mentat ion of a serial processor is shift regist er.Since each PE in the SPA
ar chitectu re requires fewer shift register storage cells,it is possible
to
place
several PEs on a chip,whereas if each serial PE were required to store two
lines of the whole lat t ice,then only one or two PEs could be placed on a
946
St even Kugelmass,Richard Squier,and Kenneth Steiglitz
Site Lattice
I++IL"..+  +I
Pipeline
Figure 4:Sternberg partit ioned archi tecture.
VLSI
chip
wit h
current
technology.The only way around this limitat ion is
to use anot her tech nology to implement t he required st orage,such as offchip
commercial memories,in which case we qu ickly encounter pin limi t atio ns.
It
is imp ort ant t o recogni ze t hat t he
tot al
amount of storage required
under t his organizati on is two lines of the whole lat ti ce per pipeline stage.
Thus,the tot al storage requirement under t his implementat ion is not reduced
below that of t he fully serial approach pr esent ed earlier.We should also not
forget t hat each column of serial processors requires it s own dat a path to
and from main memory.This data path is a relatively expensive commodity.
In Iect,as we
will
see in t he upcoming analysis,t he data path is t he most
expensive commodity in a VLSI implement ati on of t his architecture.
The analysis
will
demonstrate an und erlying principle of VLSI implemen
tations of archi tectures for multidimensional spat ial updates,namely t hat
I/O pins are the crit ical resource of a VLSI chip.
6.Analysis and compar ison of WSA a nd SPA
In
t his
sect ion,we analyze and compare the Sternberg partit ioned ar chitec
t ure (SPA) wit h the wideserial architecture (WSA) that we proposed in
sect ion 4.The analysis derives t he optimu m th roughput and ar ea of process
ing systems comp osed of VLSI chips for t he t wodimensional FHP lat ti ce gas
problem.We define t he design parameters for each syst em and der ive the
design curves and opt imum values of t hose parameters.For the analysis,we
Performance of VLSI Engines
{or
Lattice Comp utations
947
assume
that a memory system capable of providing full bandwidth to the pro
cessor system is evailable.?Finally,we compare the systems on the basis of
maximum throughpu t,t otal system area,and t hroughput t oarea ratio.We
also discuss the relative advantages and disadvant ages of bot h architectures
wit h an emphasis on system complexity and ease of implement ati on.
6.1 Wid eserial architecture (WSA)
The WSA has system parameters:(assumes 1 pipeline stage per chip,
P
processing elements wide)
N
=
k
chips
R=F·P· k~
second
and chip const raints
(Syst em Area)
(System Throughput)
where
2D· p:o;n
P(2L
+
7
P
+
3)
+
y P
:0;
a
(Chip Pins)
(Chip Area)
N
is t he tot al number of chips const it ut ing the processor,
P
is the number of PEs per VLSI chip,
k
is the total dept h iu PEs of t he processor pipeline (path lengt h),
F
is the major cycle t ime of t he
chip,
D
is the number of bits required t o represent t he st ate of a lat t ice sit e,
L
is the number of sites along an edge of the square lat tice,
PI
is the tot al number of pins usahle for input/output,
(J
is the area of a shift regist er t hat holds a site value,in
,X2,
'Y
is the area of a PE,in
,X2,
a
is t he total usable chip ar ea,in
A
2
.
For convenience,we also define:
B =
f =
I!.
=
.:1:=
o
normalized site storage area
normalized processor area
Less forma lly,t his says t hat t he number of chips that we need for the
processor equals t he total pipeline depth requi red,
k.
The processing rate
t hat t his system achieves is equal to t he depth of t he pipeline,mult iplied
by t he number of processors at each depth,multiplied by the rate at which
a processor computes new sites.We are assuming t hat each VLSl chip will
2This
is a very important assumpt ion.
948
Steven Kugelma.ss,Richard Squier,and Kenn eth Steiglitz
contain only a sing le wide pa rallel pipeline stage.That is,t he chip is not
internally pipelined wit h wideserial processors.
We wish to maximize
R
subject to having a fixed number of chips,
N
=
No,
an d subj ect to constraints on the pin count an d area of the VLSI cust om
chip.
Notice t hat the
prob lem is equivalent to maximizing
P
subj ect to t he
chip constraints because
R
=
F ·
p.
k
=
F·
p.
N,
where
F
and
N
are fixed
(N
is fixed at
No).
The constrai nts are describ ed in the
L  P
pl ane by the following two
inequalit ies:
p
<
..!!..
 2D
p
<
:.1
_:::3~B_ ~2B:.cL::
 7B+f
If
we consider an example where
D
=
8,II
=
72,
B
=
576
X
10
6
,
and
r
=
19.4 x 10
3
(figures derived from our actual layouts) we get t he following
graph:
40
30
P
(pEs/Chip) 20
10
o
o
500
L
(Sites)
1000
The chip const raints require that t he op erat ing point determined by
P
and
L
lie below both curves.The int ersect ion of the two curves is
P
~
4 and
L
~
785.Beyond that po int,we need to dec rease the number of processors
on a chip to make room for more memory an undesirable situation because
t hroughput th en drops off linearly.Furt hermore,we
want
L
to be as big as
possible,so t he corner is the logical choice of operating point.
We ar e also interested in the ulti mat e maximum performance t hat the
architect ure can deliver using any number of chips.
It
is easy to see t hat
th e maximum throughput for a fixed clock frequency,
F,
comes when th e
pipeline depth,
k,
is at a maximum.A maximum value,
k
rnax
=
L,
arises
because at t hat point th e pipeline contains all the values of the sit es in the
lat t ice a nd t here is no new data to int rodu ce int o t he processor pipeline.The
Performan ce of
VLSI Engines for Lattice Comput ations
949
maximum values for processor syst em area and processor system t hroughput
are therefore:
NllUJ"X
=
L
chips
lim..
=
~
.F.L
sites
2D
sec
It
is also interest ing to not e that there is an upper bound on
L
even if we
were to accept arbitrar ily slow computation.At a certai n point all the chip
area would be used for memory,leaving no room for PEs.
The major limit ation of t his architect ure is t hat the largest problem in
stance is fixed by the chip technology,but it has the redeeming feat ures of
simplicity,ease of implementat ion,and small main memory bandwidth.
6.2 Sternberg partitioned architect ure (SPA)
This processor computes updates for a lattice
L
sites on a side by partitioning
t he lat tice into nonoverlapp ing slices t hat are each
W
sit es wide (t here are
{¢
such slices).Each of the VLSI chips t hat compose t he processor computes
P
w
slices and t he computat ion of each slice is pipelined on t he chip t o a
depth
P,
(see figure 4).
It
is then easy to see th at t he system has area and
t hroughput:
N =
L!lY.
.s:
chips
P",Pic
R _F.k.L.sites
 w
sec
(System Area)
(System Throughput)
To derive t he const raint s on the VLSI chip,not ice t hat the communica
t ion pat h bet ween chips in the direction of t he dat a pipeline requires
2DP
w
pins,and t hat t he"slicetoslice"pat h requires
2EP
k
,
where
E
is t he number
of bit s required to complete t he informat ion contained in a single site's neigh
borhood when t hat neighborhood is split across a slice boundary.However,
the chip must use no more t han
a
area,of which processors each require
"
and memory to hold a site value requires
{3.
Thus,the whole chip is governed
by t he const rai nts
2DP
w
+
2EP,
~
IT
((2W
+
9)fJ
+
'Y)PwP
k
~
c
(Chip Pi ns)
(Chip Area)
We agai n wish to maximize t hroughput wit h respect to a fixed number of
chips,
N =
No,
while at t he same t ime satisfying the VLSI chip constraint s of
area and bandwidt h.This agai n t urns out to be equivalent to maximizing t he
total number of processors on t he chip because we can easily verify by direct
subst it ution that
R
=
p.
k:
{(;
=
PU/P
k
·
F· No.
Since
F
and No are fixed,it
suffices to maximize t he product
P
wPk
=
P
subject to the const raint s above.
To evaluate the design space of SPA,it is helpful to view it in the
W

P
plane.We do t his via a change of variables:
950
Steven Kugelm
as
s,
Richard Squier,and
Kenneth
Steiglitz
Rewr iting the chip inequalities yields
P
2DP
w
+
2E
p
~
II
w
((2W
+
9)B
+
r )p
~
1
where
PUll
P,
and Ware variables.Thi s is t he logical choice of var iab les for
t his arch it ecture because t hey are t he ones that are constrained
by
t he chip
technology and govern the opti mal design of t he chip.Once we know good
values for t hem,a machine which can compute for an arbitrary lat ti ce width
L
can be bui lt
by
increasing the number of slices of width
W.
When these curves are projected onto the
W  P
plane using the values
for
D,
II,
B
from t he previous example,and set t ing
E
t o 3 (t hree bits must
be passed to complete a neighborhood),we have
40
30
P
(PEslChip) 20
10
I
o
,
1000
The constant curve is a proj ect ion of the first constraint where
P
w
is given
the value which permits
P
to achieve it s maximum value.For t his example,
t his occur s at
P
w
=
~.
As before,we need to operate below both curves,
and t he corner at
P
'"
13.5 and
W
'"
43 yields the best choice.Beyond t his
point,throughput drop s off quite rapidl y as t he silicon real est ate is used by
memory.
6.3 Discussion
The above analysis gives us t wo different viewpoints from which t o make a
comparison between these two archit ectures.Ignoring extensibility,we can
make a comparison between the two designs when t hey are optimized for
t hroughput,as they were in t he preceding.Taking a more general point of
view,we can make t he comparison by
using
a slight variant of WSA which
allows for extensiblity by sacrificing processing speed.
First,let us compare the designs opt imized for throughput without regard
to extensibility.The opt imal WSA configurat ion limit s the lattice lengt h
Performance
oi
VLSI Engines [or Lattice Computations
951
to
L
=
785.Both WSA
and
SPA systems have t hroughput rat es which
grow linearly with the number of chips.However,SPA is t hree t imes faster
than WSA.(SPA has twelve processors per chip while WSA has four.) On
the other hand,t he SPA system requi res four t imes as much main memory
handwidth
as
t he WSA system:262 bit s/tick versus 64 bits/tick.
The above argument contains a bias in favor of SPA.System timing is an
important considerat ion which can make it difficult to clock SPA as fast as
WSA.The WSA architect ure has connect ivity only in one dimension,whereas
t he SPA system requi res communication in bot h t he pipeline direct ion and
the synchronous sidetoside data path s.Th is added complexity is a more
pronounced drawback for SPA when ext ensibility is considered,as we will
mention below.The conclusion in both cases favors the WSA system when
it comes t o considering an implement ation.There is also t he matter of t he
dat a access pattern in t he memory.The WSA machine accesses t he da t a in
a
st rict raster scan pattern which is simpler than the rowst aggered pattern
t hat t he SPA scheme requires for it s operation.
The SPA arch itect ure has one considerable advant age over the WSA
scheme:extensibility.Smaller inst ances of an SPA machine can be joined
together to form a machine that computes a larger lat tice.Th is is not true
for t he WSA
case,
where comp ut at ion is limited to lat ti ce sizes which do
not exceed
L
as given by t he chip area const raint,because all t he requi red
data must fit on the chip.Thi s requirement is relaxed in t he SPA scheme
because data can be moved between adj acent chips as
W
is adjusted to the
chip const raints and an arbitrary lattice width
L
can be supported by com
posing a suitable numb er of slices.In th is respect,t he two schemes seem
incomparable.
Our second point of view on t he compari son of these two architectures is
facilit at ed by considering a slight variat ion of WSA which allows extension
of the latt ice size.The extens ion can be accomplished by moving a port ion
of t he shift register off chip.The pin const raints given previously,with t he
same constants,allowonl y one processor per chip in thi s case.A stage in the
pipeline consists of a processor chip and associated shift registers sufficient
to hold the remainder of t he 2£
+
10 node values which do not fit onto the
processor chip.We will call t his version of WSA WSAE.
Both SPA
and
WSAE systems have t hroughput rates t hat grow linearl y
with t he number of chips in t he system for a fixed lat tice size
L.
However,
the constant of proportionality between t he two rates grows with increasing
lat tice size.The reason is t hat t he number of processors per unit chip area
is independent of lattice size for SPA,whereas it decreases with increasing
lat tice size for WSAE.So,for instance,given the same numb er of chips and
a lattice size
L
~
785,t he SPA system is twelve times
fast er
t han
WSA ~ E
because it has twelve processors per chip as opposed to one per chip.
A better under standing of the cont rast s between t he two systems can be
obt ained by looking at requirement s for main memory bandwidt h and storage
area per processor.WSAE has a constant bandwidt h requ irement of 16 bit s
per clock t ick and requires
(2L
+
lO)B
st orage area per processor;SPA has a
952
Steven
Kugelmass,
Richard Squier,
and
Kenneth Steiglitz
main memory bandwidt h requirement of
~
bit s per t ick and requires
(1284)B
area per processor.For a fixed processi ng
rate,
t he penalty for lar ger lattice
size is ei t her linear growth in th e number of chips for the WSAE system,or
linear growth in the
main
memory bandwidth in th e SPA case.For example,
if
L
=
1000,then WSAE requires about twice as much area as SPA,while
requiri ng about one twentieth
as
much ba ndwidth.
6.4 Summary
We have analyzed the critical parameters of two system architect ures for
high performance computation on a cellular automaton lat t ice.We see th at
the WSA ar chitecture offers good throughput at a modest system area and
complexity,while the SPA architecture offers higher performance,at t he pr ice
of increased complexity and memory bandwidth.
The pr eceding ana lysis suggests that the ultimate limit to t he perfor
mance of these architect ures,and any alternat ives,will ste m from chip pi n
bandwidth an d storage requirements,not from pro cessing requirements.For
example,a chip in
3ft
CMOS has been fabr icated and tested for the wide
serial architect ure in which about 4 percent of t he area is used for pr ocessing.
Any more process ing on t he chip woul d simply go unused because of storage
and bandwidt h constraints.We can expect t his frac t ion to shri nk as t he lat
tice gets wider,and as we increase t he dimensionality of the probl ems.T his
fact has recently become clear in the liter atu re on systolic arrays,and in [5),
Hong and Kung present a model an d a bou nding tech nique for quantifying
this notion.In t he next section,we will apply t heir res ults to th e class of
lattice computations.
7.Pebbling bounds
WSA and SPA are only two of many poss ible computation schemes for com
put ing the evolution of a lattice gas cellular automaton (LGCA).Once a
scheme has been selected from among the possi bilit ies (for example,single
st ream pipeline,wide pipel ine,column parallel),the processors and local
memory must be mapped to chips whi le ma intaining pin,area,processing
ra te,and I/O bandwidth constrai nts.These constraints can be thought of
as divided into hierar chical classes by sca le:ma in memory bandwidth,total
processor memory,and overall computation rat e at large scale;processing el
ement area and speed at small scale;and interchip communication and pin
const raints somewhere in between.The question ar ises as to which sche me
makes the best use of t he resources given the mult iscale constraints.To
answer t his partially,we would like to answer the general question,"What is
the bes t that can be done,considering only the large scale constraints?"
By
"best"we mean"fastest overall computation rate."We want to ignore t he
pa rticular met hod of progress ing t hrough the computation graph for a given
LGCA and concentrate on t he limits implied solely by the large scale con
straints.We will use a pebble game to count th e input/output requ irements
Performance
of
VLSI Engines
for
Lat tice Comp utations
953
of an LGCA computation.
Variant s of the pebble game have been used.as a tool to get spacet ime
tradeoffs for comput at ional problems.See,for instance,t he papers by Pip
penger
[Ll]
and Lengauer and Tarjan [9].The redb lue pebbl e game described
by Hong and Kung [5] models t he computat ion and I/O steps in a sequent ial
comput at ion.They used it to get spaceinput/output tradeoffs for several
problems,and t o get upp er bounds on speedup of a comput at ion of t hese
problems using a sequent ial machine.The redbl ue game th ey describe was
exte nded by Savage and Vitter [15J to the par allelred and blockredblue
pebble games,which model parallel computation without input/outp ut and
block parallel input/out put respecti vely.We will use a further variant of the
redblue game which allows parallel computat ion and parallel input/output
of any size up to t he processor's local memory ca pac ity.We will use Hong
and Kung's met hods for t he analysis of the redblue game t o derive from
our variant a tradeoff among t he minimum main memory bandwidth,the
maximum overall comp utation rate,and the local processor memory size.
The redblue pebble game is played on directed acyclic graphs wit h bounded
indegree according th e following rules:
1.
A pebble may be removed.from a vertex at any t ime.
2.A red pebble may be placed on any vertex that has a blue pebble.
3.A blue pebble may be placed on any vertex t hat has a red pebble.
4.
If
all immediat e predecessors of a vertex
v
are red pebbled,
v
may be
red pebbled.
The"inputs"are those vertices which have no predecessors,and t he
"out
puts"ar e t hose which have no successors.A vertex that is bluepebb led
represent s the associated value's presence in main memory.A redpebbled
vertex repr esents presence in processor (chip) memory.Rules (2) and (3)
represent
I/O,
and rul e (4) represents t he computation of a new value.The
goal of t he game is to bluepebbl e the outputs given a starting configurat ion
in which the inpu ts are bluepebbled and the rest of the vertices are free of
pebbles.We will delay t he int roducti on of an extension of t his game unti l
we have established some fur ther groundwork.
Th e computat ion gr aph for
an
LGCA is deri ved in the usual manner for
a data depende ncy graph.An LGCA,
9
=
G(v),is defined by a lattice
graph
G
=
(V,E)
conta ined in some ddimensional finit e volume,a value
v(x,t)
associated wit h each node
x
in t he lattice,and a function givi ng
v(x,t
+
1)
=
f (N(x),t)
where
N(x)
is t he"neighborhood"of
x
in
G;
that
IS,
N(x){zl{x,z }
is an edge in
G}
U
{x }.
The values of nodes at time
t
+
1 depend on the values of its neighbori ng
nodes at t ime
t.
For an LGCA t hat mode ls real fluids,the lattice
G
must
be isot ropic with respect to conservat ion of momentum and energy.This
954
Steven
Kuge1mass,
Ri chard Squier,
and
Kenneth Steiglit z
means
G
must be regul ar.We
will
make use of t his regularity in the proof
for th e bound on t he computat ion rat e,al though we will not requi re t he
satis fact ion of the isotropy condition.We form the computat ion graph of the
LGCA by identifying t he vert ices in each layer of th e comput ation gr ap h with
the vertices in the latti ce G.Each layer of t he comput at ion graph consists
of a copy of G's vertex set with arcs to the next layer expressing the data
dependency between t he values associated with t he vertices of the latt ice at
time
t
and t hose at time
t
+
1.That is,if
V
=
{I,2,3,...
,L }
is t he set of
vert ices in
G,
th en th e computation gr aph for Gis
C
=
(X,A)
where
x
=
{(x,t)l x
E
V,
and
O:S
t:s
T )
and t here is an arcfrom
(u,
t l)
to
(v,
t )
in Cif and only if
u
is in
N(
v).
C is a
layer ed graph of
T
+
1 layers,each layer repr esent ing t he LGCA at evolut ion
ti me
t
=
0,1,2,...,
T
(see figures
5
and 6).We are usually interested in
seeing an image of t he LGCA at periodic t ime steps in it s evolut ion,say
every
k
time steps,and we let
T
go t o infinit y.However,
it
is easy to see
from the proofs t hat follow t hat forcing
T
=
k will
not alt er the results.We
will
ap ply a var iant of t he redblue pebble game to t he computation graph
C.
Let us introduce some terms we will need and review th e results of Hong
and Kung.Results pr oved in [5]
will
be so ind icated.A computation of an
LGCA
is
said to be a
complete comput at ion
if
it begins with onl y t he input
values
v(x,
0) known and at the end of t he comp utat ion t he values
v(x,T)
have heen computed for all
x
in t he lat t ice G of the LGCA.Thus,a pebbling
P
of t he computat ion graph represents a complete comput at ion of th e LGCA.
Given any complete computatio n of LGCA
9
(a pebbling
P
of the as sociated
computat ion graph
CjJ,
we ass ume the following,where memory and
I/O
are
measured in unit s of storage required to store
a
single site value
v(x,
t)
of the
LGCA.
S
=
t he number of red pebbles,i.e.,t he amount of processor memory.
(We assume an inexhaus tibl e supply of blue pebbles.)
q
=
the number of
I/O
moves requ ired by
P.
Q
=
t he minimum number of
I/O
moves required to pebb le
C,
over all
pebblings using
S
or fewer red pebbles.
Definition:
P'
is an
SI/Odivision
of
P,
if
P'
=
{Pi ll:S i:S
h)
where
Pi
is a consecut ive su bsequence of
P
such that
Pi
cont ains exact ly
qi
I/O
moves,an d
where
Performance of
VLSI Engines
for
Lattice Computa.tions
955
G
Figure 5:A onedimensional lattice of
a
cellular aut omat on
(}::;
(G,
v).
Vertices
1
and
r
are boundary vert ices of
G.
The neighborhood
of vertex 2 is
N( 2)
=
{I,2,3).
qi
=
S
for all
i
except that 0
<
q.
S
S.
We say t he
size
of
P'
is
h.
Clearly,a lower boun d on t he I/O required by a compl ete computation
of
9
is determined by
I.
=
min
{h}
over all pebblings of
C
Q
using
S
or fewer
red peb bles.That is,
Q
>
S ·
(I.

1).
Hong and Kung have developed some methods for der iving a lower hound
for
h.
The concept s depend explicit ly on the definit ion of an S I/Odivision
which depends implicitly on the fact that t he pebbling is linearly ordered.
This is tri vially true for the redblue game because it is a st rict ly sequential
game:
a
single rule from rules (1) through (4) is appli ed,and the result ing
configuration determines t he applicable rul es for the next move.An immedi
ate extension of t he redblue game simply considers a block of such moves as
occurring in a single"time step".This allows a cer tain form of parallelism
and is the extension used by Savage and Vit ter [15] in t he blockredblue
game.The act ual play of the game is not altered;ra t her,the count ing of
moves is redefined.
It
is easy to find a simple example of a graph for which
the number of input/output st eps can be reduced by allowing the red peb
bling moves to occur in truly parallel fashion.That is,any number of pebbles
may be moved simult aneously,provided the configuration before t he move
sat isfies the conditions of any rule employed in t he move.With this in mind,
we define the parallel redblue pebbl e game and show that it models any
computation which can be performed by a computer with arbit rary parallel
capabilit ies (CRCW PRAM).'The result s of t he analysis of this game will
be applied to a machine model which has the same features as a
eRe\,\,
PRAM,but has a limited communicat ion bandwidt h.
Cons ider a computat ion which proceeds by doing many steps in parallel
in real t ime,and let us consider t he necessary feat ures of a pebble game t hat
mode ls it.The end result is a pebble game t hat can be described by a linearly
ordered set of pebble moves,which will allow us to define an
S I/O ~d i vi sion
for t his game.In the following,we will use t he following terminology:placing
a red pebble on a node that cont ains no pebbl es is a calculation.The node
3S
uch a
machine model consists of an arbitrary number of processors communicat ing
via a shared memory.This
model is often referred
to
as
a
CReW
PRAM:ConcurrentRead
Concurrent Writ e Paral lel Random Access Machine {2].
956
Steven Kugelmass,
Richard Squier,
and
Kenneth Steiglitz
Figure 6:A comput ation graph
Ca(T)
where 0
~
t
:s;
T.
The
tth
row
corresponds to
get).
pebbled is called th e
dependent
node,and t he nodes with arcs ending at t he
dependent node are said to
support
the calculation
by
virtue of the fact that
if
t hey di d not contain red pebbles,the calculat ion would not be possible.
We firs t decompose the comput at ion int o pieces which occur simu ltane
ously.Let th ese pi eces be designated
G
i
,
and we say the complete compu
tation
C
cons ist s of their concatenat ion:C
=
C
1
0
C
2
0.. 0
Cv.
Now let
us consider the pebble moves within
Cc
Consider a datum t hat is fet ched
from main memory by C
j
.
It
is reas onable to assume t hat this dat um could
not simultaneously he used in a calculation of some dep endent dat um.We
t hen require that
a
peb ble move that places a red pebble on
a
node which
only contains
a
blue pebble pr ecedes any pebble move t hat uses t he node as
a support ing node for a calculat ion.We satisfy this orderi ng requirement by
ordering all the pebble moves of thi s type (which model main memory reads
occurri ng in
C
i
)
aft er any other moves in
Ci.
Cons ider t he calculation of a datum during
Cc.
We assume t he result
dat um must be wri tten to a register in the processor memory.Therefore,
we do not consider it possible in ou r model of computation to allow a mai n
memory wr ite of a datum to occur simult aneously with
a
calcu lation of th e
same datum.We can enforce th is requ irement in t he pebble game by ordering
Performance of VLSI Engines for Lattice Comput ations
957
all main memory writ es in C
i
before all calculations in
C
j
Th at is,a nod e
must cont ain a red pebble before a blue pebble may be placed on it,and that
red pebble must have been placed in a previous
Ci.
At this point,we can say that t he pebble game must proceed parallel
move by parallel move in order and t hat wit hin each parallel move
C
j
t he
ordering is:place blue pebbles (wr ite to main memory phase),move red
pebbles to unpebbled nodes (calculation phase ),place red pebbles on nodes
cont aining blue pebbles (readfrommaiomemory phase).It now remains for
us to find an ordering within these phases of
C;
Consider th e pebbl e moves in the two I/O phases.In real time,we ass ume
t hey all happen simult aneously.Suppose we order them arbitrari ly within
each phase.Take first the write phase.Placing blue pebbl es on nod es con
taining red pebbl es in any or der is permissible since there are no dependence
const raint s beyond t he presence of the red pebbles.The nodes available for
writ ing have red pebbles before the beginni ng of
Ci.
The read phase is essent ially t he same,except that nodes cont aining blue
pebbles receive red pebbles.We must be careful not to violate t he timi ng
constraints on any red pebbl es used for thi s purpose.A register that is
used to store t he result of a calculat ion performed during
O,
cannot also
receive
a
datum from main memory during C
i
.
Actually,t he red pebbl es
placed on dependent nodes in t he calculation phase could theoretically be
picked up and moved to a bluepebbled node to effect a read during t he
read phase.However,the resul t is t hat the dependent node that had its
red pebble removed did not really get calculated during
G
i
,
and we ar e not
violati ng the timing constraints on the real time computation if we adopt
this interpret at ion of such an event.
The next potent ial conflict comes from the overlapping of read and write
operations.Suppose a register is used
as
a source to wr ite to main memory.
During t he write phase,a blue pebble is placed on a node cont aining a red
pebble.Th e red pebble repr esent s t he use of a regist er as a source for a.
wri te operati on.However,the read phase ma.y remove t he red pebble and
place it on some node cont aining a blue pebbl e,indi cating t he same register
is both a source and a receiver of dat a simultaneously.We accept t his as
within our model of computation because hardware with t his capabili ty is
easil y realized.The only remaining sources for red pebbl es ar e red pebbles
t hat were placed pr ior to
C
i
and do not therefore represent any conflict with
real computation.As t he various sources for red pebbles do not have any
mutu al dependencies,and likewise,t he placement of t he red pebbles are not
interdependent:we are free to order the movement of pebbles in the read
phase arbitrarily.
We have established th at the I/O phases may
be
linearl y ordered.At
thi s point,it appears t hat nothing has changed visavis the redblue game.
The real cont ribut ion of the parallel game comes in t he calculation phase.
Consider
a
calculat ion in which t he resul t is writ t en into one of t he registers
used
as
inpu t.Th e input may be fanned out to many calculat ions,and all
proceed in parallel.The redblue game would block thi s type of activity
958
St even Kugelmass,
Richard
Squier,
and
Kenneth St eiglit z
since lift ing t he red pebble from a support ing node and sliding
it
t o one of
t he dependent nodes leaves the remaining dependent nodes without a full
complement of supporting nodes.We define t he calculation phase as t he
movement of red pebbles onto dep endent nodes.We will add a new pebble
(pink) to the game to avoid t he blockage ment ioned above.The pink pebble
(placeholder pebble) allows fanout of t he input by holding t he contents of
t he calculat ion until t he end of t he cal culati on
phase.
The new pebble is not
st rict ly required,but using
it
simplifies t he definit ion of th e new game.
The above discussion gives us a pebble game t hat can model an arbitrary
par allel compu tation under t he assumed model of comput at ion.The game
is sequent ial in the I/O phases,and t aking t he calculat ion phase as a single
move,the SIJOdivision is well defined for t his game.
Defi nit ion:The ru les of t he
parallelredblue
pebble game:
The game is identi cal to the redblue pebble game wit h t he addition of a new
pebble (pi nk) and the following additional rules:
5.The game consists of cyclic repeti t ion of t hree phases:
writ e phase,calculate phase,read phase.
The write phase consists of only rule (3) moves.
The read phase consists of only rul e (2)
moves.
The calculate phase comprises t he following moves
{a] pink pebble placed by rul e (4).
(b) a red pebble replaces a pink pebbl e.
(c) no pink pebbles remain at t he end of t he phase.
With t his definit ion of t he parallelredblue game,we can proceed along t he
lines of
[5J
wit hout alt ering their argument s.Their next step int roduces
the idea of partitioning t he comput at ion graph to get a lower bound on the
number of subpebblings in an SI/Odivision.
Defin ition:A
I( parti tion V
is a partition of t he vert ices of a di rected acyclic
gr aph G
=
(V,A)
such t hat
1.
For every
Vi
in
V
t here is a
dominator set D,
~
V,
and a
mi nimum set
Mj
~ ~,
both of size at most
J(
such t hat every path from t he inputs
to any element of
Vi
contai ns an element of D
i l
and every
v
in
\Ii
which
has no children in
\Ii
is in
Mi.
2.There are no cyclic dependencies among the
Vi.(V
j
depends
on
Vi
if
there is an arc from an element of
\Ii
to some element of
y,:.)
We say
g
=
IVI
is t he size of t he partit ion.
For every SI/Od ivision of a pebbling
P
t here is a 2Sparti tion deter
mined in t he following way:in
P,
consider every vert ex that has never bad a
red pebble placed on it by any
moves
in
Pi,
i
<
k,
an d is red pebbled during
Pv.
This set of vert ices is
V
k
.
Prop erty (2) is clearly sat isfied by the set all
Performance of VLSI Engines for Lattice Computations
959
such
Vk's,
V.
The dominator,
Diu
is t hen t he set of all vert ices which had red
pebbles on t hem at t he end of
P
k

1l
together wit h t hose vert ices with blue
pebbles on t hem at the end of
p.
1
which get red pebbles duri ng
p
.
The
size of
D.
is at most
2S
(t here are
S
red pebbles and at most
S
I/O moves).
The minimum set,
M
k
,
consists of t hose ver tices which were the
"last"
to be
red pebbled durin g
Pit:
[i.e.,
have no children which were red pebbled during
P.).
At t he end of
Pv,
any such vertex is either i) st ill red pebbled,or ii)
now blue pebbled.Therefore,
Mit:
can be at most of size 28.
The above argument gives us t he following theorem and lemma.
T heorem 2.{5}
Le t
G be any dir ected
acyclic graph and P
be any
redblue
pe bbling
of G
wit h
an
S l/Odivision
of
size h using at most S red pebbles.
Th en,there is
a
2Spart ition
of G of size
9
=
h.
In particular,t here is a part it ion such th at
9
=
h.
From th e comment
mad e above concerning the minimum
I/O
requir ements,and letting
9
=
min{g} over all 28 parti tions of
G,
we have
Lemma
1.
{5}
For any
direct ed
acyclic graph,
Q >
S ·
(g

1).
Th e types of graphs represented by LGCA computation gr aphs have the
nice feature that they ar e regular and"lined."
Lines
are simp le paths from
inputs to out puts.A vertex is said to
lie
on a
line
if the line contains the
vertex.A line is
covered
by a set of vertices if the set contains a vert ex that
lies on t he line.
A
lined graph
is a graph in which a set of vertex disjoint
lines can be chosen so t hat every inpu t is on some line in the set.
A
complete
set
of
Jines
is such a set of lines.For an
LGCA
computation graph,a path
«x,
0),
(x,
1),
(x,
2),...,
(x,
T)}
is a line
I.,
for any node
x
in t he lattice G.
Suppose we have chosen a complete set of lines
£.
for some lined graph
G.
If
we can bound from above the maximum numb er of vert ices that lie on lines
in
£.
and ar e contained in a single subset of any 2Spartition of
G,
and we
can count the tot al number of vertices in
G
that are on lines,we wil1 be able
to lower bound
g.
In applying t his reasoning to LGCA computation graphs,
we will choose the complete set of
lines
t:.
=
{1.lx
E V}.
In the case of these graphs,every vertex lies on some line in
E.
Definition:The
linetime T(k)
for a lined graph
G
is the maximum number
of vert ices that lie on a single line in any subset of any kpartition
of
G.
That
is,if we let
K
be the set of al1 kpart it ions of G and
£.
be a compl ete set of
lines
in
G,
then
r(k)
=
max ma x max{jl;n
11;1}.
VEK:
V;EV
IjE.c
960
Steven Kuge1mass,
Richard Squier,
and
Kenneth St eiglitz
By
observing t hat a dominator set of size
28
or less can domi nat e at mos t
28
different lines,it is easy to conclude t ha t t he maximum number of vert ices
in
a
single subset of a 2Spart it ion that lie on lines is bounded from above
by 2S.
r(2S);
t hat is,
111;· 1
:S
2S· r (2S )
in any 2S partition of G,
where
Vi"'
is the smallest subset of
Vi
cont aining every vertex in
Vi
that lies
on some line.
Consequentl y,we have
Lemma 2.(51
fJ::":
,)'~;J5 )
for a
computation graph
C
=
(X,A).
This leads to Hong and Kung's second result:
.1U
Theorem 3.{51
Q
=
n (
'('5»).
For LGCA comput at ions,we
can
express this bound in te rms of main
memory band width
B
and processing rate
R.
Let the total time to per
form the
comliutat ion
descr ibed by the comp utat ion graph be
p.
We t hen
define
R
=
lfl
(for LGCA computations IXI = IX·I).Cert ainly,the total
input/output traffi c must be handled
by
t he communicat ion channel to mai n
memory,so
Bp
2:
Q,
and t he preceding bound becomes
R
B
=
n(
r(2S) )
or equ ivalent ly,
R
=
O(Br(2S) ).
Using t his result,we will show that for
d ~dim en si on a l
LGCA comput ations
R
=
O(BS ~ ).
Specifically,we will show t hat
r(2S)
<
2(
d!2S)
~
for t heir comput at ion graphs.
In proving this,we will make t he following simplifying assump tions,which
are
in
any event worst case.
1.The graph G of addimensional LGCA is an orthogonal grid defined on
the integer lattice points contai ned in the
d· cell
in R
d
consist ing of the
poin ts [x] 0
$"
Xi
::;
r (
i
=
1,2,...
,d)}
where r is a nonnegative integer.
There are edges between a vertex and its nearest neighbors.We will
refer t o
G
as a lat ti ce.Although G as defined above is inadequate for
isot ropic lattice gases
[3],
we are assuming the minimum connect ivity
for G in t he sense that any lattice that satisfies isotropy requires at
least
the same degree of connect ivity.
Performance
of
VLSI Engines for Lattice Computations
961
2.The boundaries of LGCAs
can
be handled in a variety of ways.They
can be null (zero valued),independentl y random,dependently random
or deterministic wit h truncated neighborhoods,or toroidally connected
wit h full connectivity.
In
t he first two cases above,the boundari es
do not appear in
C
at all.We will assume boundary vert ices appear
in
C
wit h dependencies defined by the lat t ice.The boundaries
can
be thought of
as
det erministic or randomized,but dependent on their
neighhnrs as defined in (1).
3.
H
t he size of the vertex set
of
the lattice
G
is
r
d
,
then we assume that
t he processor memory size
8
is less than
r d.
In fact,if
8
=
r
d,
onl y
28
of
main
memory I/O is required to pebble
C,
and the bounds mentioned
are irr elevant.
4.
In
the following,we will use t he notat ion
Cd
when referring to a com
put at ion graph
Cg
for addi mensional LGCA
g,
with latti ce
G.
Let us derive some properti es of the computation graph
Cd.
Definition:A
(u,v)pat h
is a path from vertex
u
to vertex
v.
The
length
of
path
P,
l(p),
is the number of edges in
p.
The distance d(u,
v)
between two
vertices
u
and
v
is t he minimum of
I(p)
over
all
(u,
v)pat hs
p.
Lemma 3.
In
Cd,
every
(u,v)path p
hss
length d(u,v).
Proof.Since every arc in
Cd
goes from some layer
t
to a layer
t
+
1,pat hs
of different
lengt hs start ing from the same vertex end in different layers.
Lemma
4.
In
Cd
every
vertex
w
which
has
a.
distance from some specified
ver tex u
of d( u,w)
=
Hd(u,v)J
lies
on
some
(u,v) path,
pr ovided
tz
and
v
bot h
lie
on the
same
line
in
E.
Proof.Let
d(u,v)
=
2k
+
6
where
k
~
0 and
6
is eit her zero or one.Let
u
=
(x,t )
and
v
=
(x,t
+
2k
+
6).
If
k
=
0,t he result is trivial,so suppose
that
k
>
O.
There
is
some
(u,w)pat h
PI
=
(u
=
UO,
u
r,...,
Uk
=
w).
Let
Uj
=
(Xi,
t
+
i).
Since t here is an arc
(Ui,
'Ui +l),
Xi
is in t he neighborhood
N( Xi+d,
and vice versa.Consequent ly,there is
a.
path
P2
=
(w
=
Vk,
Vk _ I,
...,Vo
=
(x,
t+2k )),
where
v;
=
(x;,
t+k+(k  i) ).
Thus,the path p
=
PlOP,
is a pat h from
u
to
(x,
t
+
2k)
on the
(u,
v)pat h along Ie Concatenat ing
t be path
« x,
t
+
2k),
(x,
t
+
2k
+
I ),
(x,
t
+
2k
+
2),...,
(x,
t
+
2k
+
6))
onto
t he end of
p
gives us a
(u,v)~pa th
containing
w.
Lemma 5.
In
Cd,
every
line
I%;
covered
by
a
path
of
length
at
most
j
[rom
some specified
vertex u,
is
covered
by
a
path
of
length exactly
i
such that
the last vertex
on
the path lies on
{z:.
P roof.Let
p
be a path from u of length
j
or less covering line
Ix.
Let
z
be
a vert ex on path
p
such that
z
lies on
1%.
Let
PI
be that portion of
p
[rom
u
to
z,
By
assumpt ion
l (PI)
=
k
:0;
j.
Concatenating onto
PI
the path start ing
from
z
and continuing along
I;J;
for
j 
k
steps gives us t he required path.
962
Steven Kugelmass,Richard Squier,
and
Kenneth
Steiglitz
Lemma 6.
In
Cd
the
number
of lines covered
by
all
paths of length
j
or
less
from a
specified
vertex
u
is equal to the number of vertices reachable from
u
in exactly
j
steps.
Proof.
By
the defini tion of
L,
every vertex in a single layer lies on a unique
line.
By
the argument of lemma
3,
the end point of every path of length
j
lies in the
same
layer.So,for every vertex reachable in exactly
j
steps,there
is a line covered
by
a path of length
j.
The lemma then follows from the
previous lemma.
Lemma 7.H in Cd vertex v
=
(z,
t
+
j)
is reachable from vertex
u
=
(x,
t)
in
j
steps,then in G vertex z is reachable {rom x in
at
most
j
steps.The
converse holds
if
t
~
T 
j.
Pro of.Consider a
(u,
v) pat h
p
=
(u
=
uo,
UI,
U2,'"
1
Uj
=
v)
in
Cd,
where
Ui
=
(Xil
t
+
i).
Since
Xi
E
N(Xi+l),
either there is an edge
{ Xi l
Xi+l }
in G,or
Xi
=
Xi+ l'
Deleting the self loops from the path
q
=
(x
=
Xo,
Xl,
X2,...,
xi
=
z)
gives us an {z,z)pat h in G of length at most
j.
Conversely,consider a path
q
=
(xo
=
X,
XI,
X2"",
Xi
=
z)
in G where
o
$
i
$
j.
By hypothesis,
t
$
T 
j,
and
consequent ly,the path
p
=
((x
=
Xo,
t),
(XI>
t
+
1),...,
(Xi
=
Z,
t
+
i),
(z,
t
+
i
+
1),...,(z,
t
+
j ))
is a (u,v)pat h in
Cd.
Defini ti on:The
linespread from
a
vertex
u
in graph Gis
{
00,
if no vertex
z
exists such that
d(
u,
z)
=
j
ta(u,j)
=
the numher of lines covered
by pat hs of length
j
or less,ot her wise.
Definition:The linespread of a graph G
=
(V,E) is
Ta(j)
=
min{t a(u,j)}
uEV
If
the graph
G
is
Cd,
we write
Td(j).
'Lemma 8.
Td(j )
>
71 ·
Proof
By
lemmas 5,6,and 7 we have shown that the number of lines
covered
by
paths from some vertex
u
=
(x,t)
of length at most
j
in
Cd
is
equal to the number of vertices reachable from
x
in at most
j
steps in G,
provi ded at least one path of length
j
exist s in
Cd.
By the definition of G,
that is,an integer grid in the nonnegati ve orthant,the minimum number of
vertices reachable in
j
or fewer steps in
G
occurs
when the origin is chosen
as t he specified vertex.The latter quant ity is then given by
Performance
of
VLSI Engines
for
Lat tice
Computations
963
where
t/>
is the region in
a
defined by the set
{xlxl
+
X3
+...+
Xd
:5
i,
(x;
~
OJ),and
<l>
is t he set of integer lattice points in
t/>.
We are now in a position t o prove t he main resul t:
Theorem 4.In
c;
T(2S)
<
2 ( d!2S )~.
P roof.'Suppose that
T(2S)
~ 2 (d!2S ) ~.
Let
j
=
(d!2S) i.
Then t here
exist vert ices
u
and
v
in some subset
\Ii
of some 2S part it ion
V
of
Cd
such
that
d(u,v)
=
2j,
and u and
v
both lie on t he same line in
C.
Since t he
subset s of t he par ti tion
V
are not cyclically dependent,every vertex
z
on any
(u,v)path is in
\Ii.
By lemma 4,every vertex in the set
Z
=
{xld(u,x)
=
j)
is on a (u,v)path,and t herefore
Z
C
\Ii.
Then
Z
covers at least
Td(j)
lines.The dominator for
Vi
must cover t hese lines.Since the lines in
L
are
disjoint
ID;I
~
Td(j),
and emp loying lemma 8 we have
ID;I
>
;F
=
2S.
This
cont radicts the assumption that
Vi
is an element of a 2Spartit ion,and we
are done.
8.Conclusions
We have descr ibed two architectures for latticeupdate engines based on VLSI
custo m chips and deri ved t heir design curves and best operating points.The
wideserial archit ect ure (WSA) has ext remely simple support logic and data
flow,while Sternberg's part itioned ar chitecture (SPA) is per haps mor e
eas
ily extens ible to latt ices of arbit rary sizes and provides higher th roughput
per custom chip,albeit at the expense of support logic and main memory
bandwidth.Each has it s preferred operating regime in different parts of t he
throughput vs.latt icesize plane.A prototype lat ticegas engine,using t he
WSA ar chit ecture,and based on a custom
31t
CMOS chip,is now being con
st ructed.Each chip provides 20 million siteu pdates per second runn ing at 10
MHz.
It
is unl ikely,however,t hat the workstation host will be able to supply
the 40 megabyte per second bandwidth required for thi s level of performance.
We expect to real ize approximately 1 million site upda tes/sec/chip from th e
prototype impl ement at ion.
We have also present ed a graphpebbling argument that gives upper
bounds for the computat ion rate for lattice updates.The asy mpt otic
up
per bounds show clearly that memor y bandwidth,and not processor speed
or size,is t he factor t hat limi t s performance.One goal for fur ther research
is t he t ightening of t hese pebbli nggame ar gument s so t hat t hey give est i
mates of abso lute,
as
well as asymptot ic,performance.We will apply t hese
est imates to get quan t itati ve comparisons between compet ing architectures
"This prooffollows t.he proof of theorem 5.1 in {S].
964
Steven Kugelmass,
Richar d Squier,and
Ken neth
Steig)itz
for lattice gas computations such as
the
Connection Machine,the CRAY
XMP,and special pur pose machines.A furt her goal would be to discover
an op timal pebbling for any problem in this class,an d thereby discover an
architecture which is optimal with regard to input/output complexity.
This work supports t he growing recog nition t hat communication bott le
necks at all scales of the architect ural hierarchyare the crit ical limi t ing
factors in the performance of
highly
pipelined,massively parallel machi nes.
In
OUf
conservative VLSI des ign,not nearly at t he limi t s of present inte
gration technology,the processors t hemselves comp rise only a smal l fraction
of the total sil icon area.As feature sizes shrink and pro blems are tackled
with larger lat tices in higher
dimensions,
th is effect will become even more
dramatic.This suggests that a search for more effect ive interconnect ion tech
nologies,
perhaps using
optics,
should have high priori ty.
References
[1] D.d'Humieres,P.Lalleman d,and U.Frisch,"Lattice Gas Mode ls for 3D
Hydrodynamics,"
Europhysics Let ters,
4:2 (1986).
[2]
S.Fortune and
J.
Wyllie,"Parallelism in Random Access Machines,"
Proc.
10th Annual ACM Symp.on the Theory of Computing,
San Diego,CA,
1978.
[3] U.Frisch,B.Hass lacher,and Y.Pomeau,
"A
Lattice Gas Automaton for the
NavierStokes Equation,"Los Alamos National Laboratory prepr int LAUR
853503 (1985).
[4]
J.
Hardy,
Y.
Pomeau,and
O.
de
Pazais,
"Ti me evolution of a two
dimensional model system.
I.Invariant
states and t ime correlation func
tions,"J.
Math.Phys.,
14:12 (1973) 1746 1759.
[51 J iaWei Hong and H.T.Kung,"I/O Complexity:The RedBlue Pebble
Game,"
Proceedings
of
ACMSym.Theory
of
Computing,
(1981) 326 333.
[6]
Josef Kittler and
Michae l
J.
B.
Duff,eds.
Image Processing Sys tem Archi
tectures,
(Research St udies Press,Ltd.,John Wiley an d Sons,1985).
(7]
Steven
D.
Kugelmass,Richard Squier,and Ken neth Steiglitz,"Performance
ofVLSI Engines for Lattice Computations,"
Proc.
1987
Int.
Conf.
01 1
Parallel
Processing,
St.Charles,IL,August
1721,
(Pennsylvania State Univers ity
Press,Univers ity Park,PA,1987) 684691.
[8] S.Y.Kung,S.
C.
10,
S.N.Jean,and J.N.Hwang,"Wavefront Ar
ray ProcessorsConcept to Implement ati on,"
Computer,
20:7
(IEEE,New
York,1987).
[9] T.
Lengauer and
R.E.Tarjan,
"Upper and Lower Bounds on TimeSpace
Tradeoffs,"
ACM Sympos ium
on
the Theory of Computing,
Atlanta,GA
(1979) 262277.
Performance
of
VLSI Engines for Lattice Computations
965
[10] Steven A.Or szag and Victor Yakhot,"Reynolds Number Scaling of Cellular
Automaton Hydrod ynamics,'
Physical Review
Let ters,56:16 (1986) 1691
1693.
[11] N.
Pippenger,"Pebbling,"Proc.
5th IBM Symposium
on
Mat hematical
Foundations
of
Computer
Science,Academi c and Scient ific Programs,
IBM
J apan,May 1980.
[12] Arnold L.Rosenberg,"Preserving Proximity in Ar rays,"
SIAM J.Comput
ing,
4:4 (1975) 443460.
[13] Peter A.Ruet z and Robert W.Broderson,"Architect ures and Design
Tech
niques for RealTi me ImageProcessing IC's,"
IEEE Journal
of
SolidState
Circuits,
SC22:2
(1987).
[14]
James
B.
Salem and Stephen Wolfram,"Thermod ynamics and Hydrody
nami cs wit h Cellular Aut omata,"in
Theory and Applications
of
Cellular
Automata,ed.S.Wolfram,(World Scient ific
Publishing
Co.,Hong Kong,
1986) 362 366.
[15] John E.Savage and Jeffrey Scott Vitter,"Parall elism in SpaceTime
Trade
offs,"in
VLSI:Algorithms and Architectures,
ed.F.Luccio,( Elsevier Science
Publishers B.V.,North Holland,1985) 4958.
[16] K.Steiglitz a nd R.R.Morita,
"A
Mult iProcessor Cellular Automaton
Chip,"
Proc.
1985
IEEE Int.Conf.on Acousti cs,Speec,
and
Signal
Pro
cessing,
Tampa,FL,1985.
[17]
Stanley
R.Sternberg,
"Computer
Architectures Special ized for Mat hemat
ical Morphology,"in
Algorithmically Specialized Parallel Comp uters,
ed.
Howard Jay Siegel,(Academic Press,1985) 169176.
[18] Stanley R.Sternberg,"Pipeline Architectures for Image Process ing,"in
Mul
ticomputers and Image Processing,Algorithms and
Programs,ed.Leonard
Uhf,(Acade mic Press,1982) 291305.
[19] Kenneth Supowit and Neal Young,personal communication (1986).
Comments 0
Log in to post a comment