Connex
Integral Parallel Architecture
&
the
13 Berkeley Motifs
(version 1.1
)
Abstract
:
Connex
Integral Parallel Architecture is
a many

cell engine
designed to
solve
intense
computational problems.
Connex
technology is presented and
analyzed from the poi
nt of view of each of the 13 computational motifs proposed in
the
Berkeley's View
[1]. We conclude that in almost all 13 computational motifs
Connex
technology works efficiently.
1.
Introd
uction
Parallel processing is able to propose two distinct soluti
ons for
the everlasting problems
which challenge computer science:
complexity
& intensity
. Coarse

grain networks of
few or
dozens of
big & complex
processors performing multi

threaded computation
are
proposed for the complex computation,
while
fine

grain n
etworks
of
hundreds or
thousands of
small & simple
processing cells are
proposed for the intense computation.
Connex
architecture
1
is designed
to perform
intense
computation.
We take into account the fundamental differences between
multi

processors
and
ma
ny

processors
[2]
. We see
multi

processors as performing complex computations, while
many

processors are designed for intense computations
2
.
Academia provides a comprehensive survey on parallel computing in
the seminal
research report known as the
Berkele
y's View
[1]
, where 13 computational
motifs
are
identified as the main challenges for the emerging parallel
paradigm.
This white paper investigates how
Connex
architecture behaves related to the 13
computational
motifs
emphasized in the
Berkeley's View
.
2.
Connex
Integral Parallel Architecture
Integral Parallel Architecture (IPA)
is defined as
a fine

grain
cellular machine with
structural resources for
performing
mainly
data

parallel
and
time

parallel
computations
and
resources for computations requiring
speculative
execution
. IPA considers two main
data structures:
vectors
of scalars, processed in data parallel machines, and
streams
of
1
See
here the history of
the project
:
http://arh.pub.ro/gstefan/conexMemory.html
2
The distinction between the complex computation and the intense computation is defined in [12].
scalars, processed in time parallel machines.
Connex
IPA performs the following types of
parallel
computations:
data pa
rallel computation
working on vectors and having
as result,
v
ectors,
scalars (by reduction
parallel
operations) or
streams (applied as inputs for time
parallel computations)
time parallel computation
working on streams and having
as result streams,
scalars
(by accumulating operations), or
vectors (applied as inputs for data
parallel computations)
s
peculative parallel computation
(expanded mainly inside the time parallel
computation, but sometimes inside the data parallel computation) working on
scalars and
having as result vectors reduced immediately by selection (the
simplest reduction function) to a
scalar
reduction parallel computation
working on vectors with results on scalars
.
Almost all parallel computations are data parallel
(with the associated redu
ction
operations)
, but some of
them involve time parallel processes
,
supported by speculative
computations, if needed.
Data parallel engine
Connex
computational structure performing data parallel computation
(with the
associated reduction operations)
is a
fine

grain network of
small & simple
execution units
(EU) working as a many

cell machine.
The
current embodiment of the
Connex
many

core section
(see Figure 1) is a linear array
of
n=
1024
EUs. Each EU is a machine
working on words of
p=16
bits
, with a 1
KB local
data memory. This memory allows
storage
of
m=512
16

bit
component
s
vectors
. The
processing array works in parallel with an IO plan
e
(IOP) used to transfer
w=128

bit
data
words
between the array and the external memory.
The architecture is scalable
: all the
previous parameters scale up or down (usually: n = 64 … 4096, p = 8 … 64, m = 64 …
16384, w = 32 … 1024).
The array is controlled by Instruction
Sequencer (IS),
a
simple controller
,
while the IOP
transf
ers data under the control of another
machi
ne called IO Controller (IOC). Thus,
data processing and data transfer are two independent processes performed in parallel.
Data exchange between the processing array and the IO plan
e
is performed in one clock
cycle and
it
is synchronized by
hardware
mecha
nisms.
For time parallel computation a dynamically reconfigurable network of 8 simple
processors is provided
outside of the processing array
.
Speculative computation is performed in both networks.
Fgure 1.
Connex
data
parallel engine
.
The processi
ng array is paralleled by
the IO Plan
e
which performs
data transfers
transparent to the
processing
.
Data parallel architecture
The user's view of the data parallel machine is represented in
Figure
2
.
The
linear cellular
machine containing 1024 E
U
s
perform
ing
parallel on
the
vectors
stored on
the two

dimension array of scalars and Booleans
.
The number of cells,
n
, provides the
spatial dimension
of the array, while number of
words stored in each cell provides the
temporal dimension
of the array (for exampl
e, in
Figure 2
n = 1024, m = 512
). On the spatial dimension the “distance” between two
scalars in a vector is in
O(n)
. On the temporal dimension the “distance” between two
scalars is in
O(1)
. Indeed, for performing an operations between two scalars stored
in two
different cells the time to have them in the same EU is proportional, in the worst case,
with the number of cells, while two operands stored in the data memory of an EU are
accessed in few cycles using random addressing.
The two dimensions of the
C
onnex
architecture
–
the horizontal, spatial dimension and
the vertical, temporal dimension
–
must be carefully
used
by the programmer in order to
squeeze the maximum of performance
from
ConnexArray
TM
.
Two kinds of vectors are
defined in this array:
horizo
ntal vectors
(along the spatial dimension) and
vertical vectors
(along the temporal dimension). In the following we consider only horizontal vectors
called simply vectors. (When vertical vectors will be considered it will be specified.)
Operations
on vect
ors
are performed in a small fixed number of cycles.
Some generic
operations are exemplified in the
following:
PROCESSING OPERATIONS
performed in the processing array under the
control of IS:
o
full vector operation
:
{carry, v5} = v4 + v3;
the correspondi
ng integer components of the two operand vectors (
v4
and
v3
) are added,
and the result is stored in the scalar
vector
v5
and in the
Boolean vector
carry
o
Boolean operation
:
s3 = s3
& s5;
the corresponding Boolean components of the two operand vectors
,
s3
,
s5
,
are ANDed and the result is stored in the result vector
s3
Figure 2
.
The internal state of
Connex
data parallel mach
ine
.
There are
m =
512
integer
(horizontal)
vectors, each having
n =
1
024
16

bit
integer
components (
vi[j]
is a
16

bit integer), an
d 8 selection vectors, each having
1024
Booleans (
sk[j]
is a Boolean).
o
predicated execution
:
v1 = s2 ? v3

v2 : v1;
in any positions
where
s2 = 1
the corresponding components are operated,
while in the rest (i.e.,
elsewhere
) the content of the result v
ector remains
unchanged (it is a
``
spatial
if
” statement
)
o
vector rotate
:
v7 = v7 >> n;
the content of vector
v7
is rotated
n
positions right, i.e.,
v7[i] = v7[(i+n)mod1024]
INPUT

OUTPUT OPERATIONS
performed in IOP under the control
of IOC
:
o
strided loa
d
:
load v5 address burst stride;
the content of
v5
is loaded with data from the external
memory accessed
starting from the address
:
address
, using
bursts of size
:
burst
,
on a stride
of:
stride
o
scattered load
:
sload v3 highA
ddress v9 addr
stride
;
the co
ntent of
v3
is loaded with data from the external
memory indirectly
accessed using the content of the
address
vector
:
v9
, whose content is used
starting from the index
address:
addr
,
on a stride of:
stride
; the address
vector is
structured in pairs of 16

bit words; each of the 512 resulting
32

bit
word is
organized as follows:
{dummy
, burst[5:0], address[24:0]}
where: if
dummy == 1
, then a burst of
{burst[5:0],
1'b0}
dummy
bytes are loaded, else a burst of
{burst[5:0], 1'b0}
bytes
from the
address
{highA
ddress, addr
, 1'b0
}
is loaded (it is a
sort of
indirect
load)
o
strided store
:
store v7 address burst stride;
o
gathered store
:
gstore v4 high
A
ddress
v3 addr
stride
;
(it is a sort of
indirect
store).
VectorC
: the programming language
for data parallel archi
tecture
Connex
data parallel engine is programmed in
VectorC
, a C
++
language extension for
parallel
processing defined by
Connex
[
8
]
. The
extension is made by adding new
primitive data types and by
extending the existing operators to accept the new data ty
pes.
In
the VectorC programming language the conditional statements have
become
predication statements.
The new data primitives are:
int vector
:
vector of integers (stored
as a pair of 16

bit integer vectors)
short vector
:
vector of shorts (sto
red as a 16

bit
integer vector)
byte vector
:
vector of bytes (two byte
vectors are stored as a integer vector)
selection
:
vector of Booleans
In order to explain how VectorC works
let
be
the following
variable declarations:
int i1, i2, i3
;
bool b1, b2, b3;
int vector v1, v2, v3;
selection s1, s2, s3;
Then
a VectorC statement like:
v3 = v1 + v2;
replaces this style of for statement
:
for (int i = 0; i < VECTOR_SIZE; i++)
v3[i] = v1[i] + v2[i]
;
and
s3 = s1 && s2;
replaces this for statement
:
for (int i = 0; i < VECTOR_SIZE; i++)
s3[i] = s1[i] && s2[i];
The scalar statement
if (b1) {i3 = i1 + i2};
is written
inVectorC
as
the following vector predica
tion statement:
WHERE (s1) {v3 = v1 + v2};
replacing this nested for
:
for (int i = 0; i < VECTOR_SIZE; i++)
if (s1[i])
v3[i] = v1[i] + v2[i];
Similarly,
i3 = (b1)? i1 : i2;
is extended to acce
pt vectors:
v3 = (s1)? v1 : v2;
Here is an example in VectorC computing the absolute difference of
two vectors.
vector absdiff(vector V1, vector V2);
int main() {
vector V1 = 2;
vector V2 = 3;
vector V;
V
= absdiff(V1, V2);
return 0;
}
vector absdiff(vector V1, vector V2) {
vector V;
V = V1

V2;
WHERE (V < 0) {
V =

V;
}
ENDW
return V;
}
See few
introductory examples in
[
5
]
where the VectorC library is posted
.
Connex
data parallel engine by
the
numbers
The last implementation of
ConnexArray
TM
provided the following measured
performances:
computation
: 400 GOPS
3
at 400 MHz (peak performance)
external bandwidth
:
6.4 GB/sec (pe
ak performance)
internal bandwidth
:
800 GB/sec (peak performance)
power
:
< 3 Watt
area
: < 50 mm
2
(
1024

EU array, including 1Mbyte of memory and
the two
controllers
with their local program and data memory
)
.
Design & technology parameters
:
65nm
technology
Fully synthesized (no custom memories or layout)
Standard Chartered Semiconductor “G” process
Time parallel architecture
Amdhal law
is the argument used against the extensive use of parallel computation. But,
this very old argument (1967) was coined in
the pioneering era of
the
parallel computing.
3
16

bit Operations
Meantime the theoretical and technological
context changed and a lot of new data about
how parallelism works are accumulated.
In 1998
Daniel Hillis
expressed
his opinion as
follows:
“I now understand the flow
in Amdahl’s argument. It lies in the assumption that a fixed
portion of the computation, even just 10 percent,
must
be sequential. This estimate
sounds plausible, but it turns out not to be true of most large computations. The false
intuition came from a m
isunderstanding about how parallel processors would be used. …
Even problems that most people think are inherently sequential can usually be solved
efficiently on a parallel computer.”
(
[4], pag. 114
–
115)
Indeed, for “large computations” p
ipelining
,
spe
culating
or speculating in a pipe on the
structural support of
a many

cell engine provides the architectural context for executing
in parallel pure sequential computations when big stream of data are waiting to be
processed.
Figure 3.
Connex
time para
llel engine
.
a.
The pipe without speculation.
b.
The pipe with speculation. The
i

th stage in pipe is computed by
q
PEs
dynamically configured in a speculative linear array.
PE
i+q
selects dynamically as
input only one of the outputs of the speculative arra
y
Time parallel computation is supported in the current implementation of
Connex
chip by
the small configurable network or processing elements called the
Stream Accelerator
(SA). The network works like a pipe of processors in which at any point two or mor
e
machines are connected in parallel to support speculation. In the actual implementation 8
machines are used. Functions like CABAC decoding procedure, a highly sequential with
strong data dependency computation, are efficiently executed in a small program
.
The computation of SA is defined by two mechanisms:
stream of functions
containing
m
programs
p
j
(σ)
:
S
(σ
in
) =
<p
0
(σ
in
),
p
1
(σ
0
), … p
m

1
(σ
m

2
),
> =
σ
out
applied to a stream of
scalars
σ
in
, generating a stream of scalars
σ
out
as output, where:
p
j
(σ)
is
a program
which processes the stream
σ
and generates the stream
σ
j
;
this is a
type
of
MIMD computation
vector of function
s
containing
q
programs
p
j
(σ)
:
V(
σ
in
) = [
p
0
(σ
in
)
,
…
p
q

1
(σ
in
)
]
applied to a stream of scalars
σ
in
, generating a stream of
q

compone
nt vectors;
thi
s
is
a true MISD computation
.
The latency introduced by a stream of functions is
m
, but the stream is computed in real
time (see Figure 3a). Vectors of functions are used to perform speculation when the
computation requests it in order to p
reserve the possibility of real time computation (see
Figure 3b). For example:
< p
0
(σ
in
), p
1
(σ
0
), … p
i

1
(σ
i

2
), V(σ
i

1
), p
i+q
(σ
?
), … p
m

1
(σ
m

2
)>
is a computation which performs a speculation in the stage
i
of the pipe. The program
p
i+q
(σ
?
)
selects from the vectors generated by
V
(σ
i

1
)
only one component as input.
There is a Connex vers
ion having the stream accelerator functionality integrated with the
data parallel engine.
3. Berkeley's View
Berkeley's View
[1]
provides a comprehensive
presentation of the problems to be solved
by the emerging actor on
the computing market: the ubiquito
us parallel paradigm. Many
decades an academic topic, parallelism becomes an important actor
on the market after
2001 when the clock rate race stopped. This
research repor
t presents 13 computational
motifs
4
which cover the main aspects
of the parallel comp
uting. They are defined
unrelated with a
specific parallel architecture. In the next section we will make a
preliminary evaluation of them in the context of
Connex
's IPA.
4
Initially called
dw
arfs
, they are renamed as
motifs
in [
9
].
4.
Connex
's Performance
Connex
's cellular network has the simplest
possible interconn
ection network. This is both
an advantage and a
limitation. On one hand, the area of the system is minimized, and
it is
ease to hide the associated
organization
from
the user, with
no los
s
in programmability
or
in the efficiency of compilation.
The limitat
ion
appears in certain
application domain
s
.
What follows are
short comments about how the
Connex
architecture
works for each of
the 13
Berkeley
's View
motifs.
Motif
1: Dense linear algebra
The computation in
this domain operates mainly on
N
×
M
matri
c
es. Th
e operations
performed are: matrix addition, scalar
multiplication,
transpos
ition
of
a matrix,
dot
product of vectors,
matrix multiplication,
dete
rminant of a matrix,
(forward
&
backward)
Gaussian elimin
ation,
solving systems of linear
equations,
and
inv
erse of a
N
×
M
matrix.
Depending on the product
N
×
M
the internal representation
of the matri
x is
decided. If the
product is small enough
(usually, no bigger than 128), each matrix can be
expanded as a
vertical vector
and
associated to
one EU, resulting
in
1024 matri
c
es represented by
N
×
M
1024

element
horizontal
vectors. But, if the product
N
×
M
is big,
then
P
EUs are
associat
ed with each matrix, resulting
in
parallel processing of
1024/
P
matri
c
es
represented
i
n
(
N
×
M
)/
P
1024

element vectors.
For all the oper
ations above listed the computation is usually accelerated
almost
1024
times,
but not
under
1024/
P
times
. This is possible because special hardware is provided
for reduction operations
in the
Connex
Array
TM
(for example: adding 1024 16

bit
numbers takes
4

2
0 clock cycles
, depending on the actual implementation of the
reduction network associated to the array
).
Motif
2: Sparse linear alg
ebra
There are two types of sparse matri
c
es
: (1) randomly distributed
sparse arrays
(represented by few types of lists), (2
) band arrays
(represented by a
stream
of
short
vectors).
For small random sparse arrays, converting them internally into
dense array is a good
solution. For big random sparse arrays the
associated list is operated
on
using efficient
search operations. Fo
r
band arrays systolic

like solution are proposed.
Connex
’s intense computation engine handles these types of linear algebra problems very
well.
Acceleration is between
128
and 10
24
times depending on the density of the array.
Motif
3: Spectral methods
The typical examples are: FFT or wavelet computation. Because of
the ``butterfly" data
movement
, how the
FFT computation is implemented
depend
s
on the length of the
sample.
The spatial and the
temporal dimensions of the
Connex
array helps
the
programmer
to
easily adapt
the
data representation
to result in
an almost linear
acceleration
, i.e., in the worst case the ac
c
eleration is not less than 80% from the linear
acceleration which is 1024
.
Evaluation report
:
Using VectorC as simulation environment FFT com
putation was evaluated for
Connex
architecture.
For a
ConnexArray
TM
version with
n = 1024, p = 32, m = 512, w = 256
,
which can be implemented in
65nm
on
1cm
2
,
a
4096

sample FFT is computed at 1
.6
clock cycle
s
per sample
,
a
1024

sample FFT is computed at 1
clock cycle per sample
,
a
256

sample FFT is computed
in less than 0.5
clock cycle
s
per sample
,
and a
64

sample FFT is computed
in less than 0.2
clock cycle
s
per sample
. At
400 MHz
results
a
10
Watt
chip.
The algorithm loads each stream of sample as
16
64

element
vertical vectors. Thus, the
array works simultaneously of 64 FFTs.
This is a good example of the optimizations
allowed by the two dimensions of the
Connex
architecture. If only the spatial dimension
is used, loading all the 1024 samples as a singl
e horizontal vector, then the same
computation is done in 8 clock cycles per sample instead of only one. Almost one order
of magnitude is acquired only playing with the two dimensions of our architecture.
Motif
4:
N

Body method
This method fits perfect
ly
on
Connex
’s
architecture,
because for
j=0
to
j=n

1
the
following equation
must be computed:
U(x
j
) =
Σ
i
F(x
j
, X
i
)
Each function
F(x
j
, X
i
)
is computed
on a single
EU, and
then the sum is a
reduction
operation linearly accelerated
by the array. Depending o
n the value of
n
, the data is
distributed in the processing array using the spatial dimension
only, or
for large n,
both
the spatial and the temporal dimension.
For this motif results also an almost linear
acceleration.
Motif
5: Structured grids
The grid
is distributed on the two dimensions of our array: the
spatial dimension and the
temporal dimension. Each processor is
assigned a line of nodes (on the spatial dimension).
It performs
each update step locally and independently of other line of nodes.
Each
node
only has to communicate with neighboring nodes on the
grid, exchanging data at the end
of each step. The system works as
a cellular automaton
. The computation is accelerated
almost
linearly
on
Connex
’s architecture
.
Motif
6: Unstructured grids
Unstru
ctured grid problems are described as updates on an
irregular grid, where each grid
element is updated from its
neighbor grid elements. Parallel computation is disturbed
when
problem size is large, and the non

uniformity of the data
distribution
would best
utilize
special access mechanisms. In order to solve
the non

uniformity problem
for the
Connex
Array,
a preprocessing step is required.
The a
lgorithm for preprocessing the
n

element unstructured grid
representation starts
from an
initial list of grid ele
ments
G =
{g
0
, … g
n

1
}
and provide
s
the minimum number of
vectors, following the steps sketched here:
the
n×
n
interconnection matrix for
n
grid
elements is generated
inte
rchanging elements in the list
G
a minimal band matrix
is generated
each diagonal
of the band represents a vector loaded into
the processing array
the
result
is
a grid with some dummy elements, but each actual
grid element has
its neighborhood located in few adjacent
EU
s.
Depending on how the list
G
the preprocessing can be performed i
n the
Connex
array or
in an ex
t
ernal standard processor. The resulting acceleration is smaller than for the
structured grid (depending on the computation involved in each grid elements, 10

20% of
the acceleration is lost).
Motif
7: Map reduce
The typical
example
of a
map reduce computation is the
Monte Carlo
method
. This
method consists in many completely independent
computations working on randomly
generated data. This type of
computation is highly parallel. Sometimes it
requires
the
add
reduction
functio
n, for which
Connex
architecture has special accelerating hardware.
The
computation is linearly accelerated.
Motif
8: Combinational logic
There are a lot of very different problems falling in this class.
We list here only the most
important and the most f
requently used
ones:
blocks processing, exemplified by AES encryption algorithms;
it works in
4
×
4
arrays of bytes, each array is loaded in
one EU, and the processing is completely
SIMD

like with linear
acceleration
on the
Connex
Array.
stream processing,
exemplified by
convolution
methods
which do not use blocks,
processing instead a continuous
bit stream
; it is computed very efficient
ly
in
Connex
’s
time parallel
accelerator (SA) with no speculation involved
image rotation for black & white or color bit
mapped
images is performed
first by
loading
m
×m
array of pixels
into the processing array on both dimensions (spatial
and
temporal),
second,
executing a local transformation, and
third
restoring
the
transformed image in the appropriate place
.
This is done
very efficiently on the
Connex
Array.
route lookup, used in networking; it supposes three
data

base like operations:
longest match, insert, delete
,
all performed very efficiently by the
Connex
processing array.
Motif
9: Graph traversal
The array of 1024
machines can be used as a big ``speculative
device". Each EU starts
with a full graph stored in its data
memory, and the computation provides the result when
one EU, if
any, finds the solution. Limitations are generated by the
dimension of the data
memory
of each EU. More investigation is
needed to evaluate the actual power of
Connex
technology in solving this problem.
Some problems
related with graphs are eas
ily
solved if matrix
computation is involved
(example: computing the distance between
all the elem
ents of a graph).
Motif
10: Dynamic programming
Viterbi decoding is the example presented in
[1]
. It
best fits
the modular feed

forward
architecture of SA, built as a
distinct network
as currently implemented at
Connex
or
integrated
into the main data par
allel processing array. Very long stream
s
of
bits are
computed
in parallel
by the pipeline structure of
Connex
’s
SA.
Motif
11:
Back

track and branch & bound
Motif
under investigation (
“
Berkeley's View
”
is
silent regarding this
motif
).
Motif
12: Graphical
models
Motif
under investigation (
“
Berkeley's View
”
is silent regarding this
motif
).
Motif
13: Finite state machine
(FSM)
The authors of
“
Berkeley's View
”
claim that for this
motif
“
nothing helps
”
. But, we
consider that a pipe of machines
featured with s
peculative resources
[6]
provides benefits
in acceleration
. In
fact,
Connex
’s
SA solves the problem if its speculative resources are
activated.
Another way to use
ConnexArray
TM
for FSM oriented application is to add to each cell,
working as a PE, specifi
c instructions for FSM emulation. The resulting system can be
used as a speculative engine for
deep packet search
applications.
Connex
’s
SA
technology
seems to be
the
first implementation of a machine able to deal
with this
rebellious
motif
.
5. Concludi
ng Remarks
1.
Connex
techn
ology covers almost all motifs.
Excepting the motifs 11 and 12 (work
on them in progress),
possibly 9, the
Connex
technology performs very
well.
2. The linea
r network
connecting EUs
is
not a
limitation.
Because the intense
comput
ational problems are characterized by an
advanced locality
, the simplest
interconnection network is
not a major limitation. The temporal dimension of the
architecture
in many cases
helps to avoid the limitations imposed by the two
simple
interconnection ne
tworks.
3. The spatial
& temporal dimension
s
are doing a
good job together.
The user's view
of the machine is a
two

dimension array. Actually one dimension is in space (the 1024
EUs), and the other dimension is in time (the 512 16

bit words
stored in each
local
randomly accessed
memory). These two distinct dimensions allow
Connex
to optimize
area
resources
, while the locality and the degree of parallelism
are both kept at high
values.
4. Time parall
elism is rare, but unavoidable.
Almost any
time in a real
complex
application all kinds of
parallelism are involved. Some pure sequential processes
represent
sometimes uncomfortable corner cases solved only by the time
parallel
resources provided in
Connex
architecture (see the 13th
motif
).
5.
Connex
organizati
on is transparent.
Because the interconnection network is simple
,
the internal
organization of the machine is easy to be made transparent to the
user. The
elegant solution offered by the
VectorC
language
is a good proof of the high
organizational transpare
ncy of the
Connex
technology.
References
[1] K. Asanovic,
et a
l.
:
”
The Landscape of
Parallel Computing Research: A View from Berkeley
”,
Technical
Report No. UCB/EECS

2006

183
, December 18, 2006.
http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS

2006

183.pdf
[2]
Shekar Y. Borkar,
et a
l
.:
“
Platform
2015: Intel Processor and Platfor
m Evolution for the Next
decade”
,
edited by R. M. Ra
manathan and Vince Thomas,
Intel
Corporati
on
, 2005.
[3] Pradeep Dubey: “
A Platform 2015 Workload Model: Recognition, Mining and Synthesis
Mov
es Computers to the Era of Tera”
,
Technology@Intel
Magazine
, Feb. 2005.
[4] W. Daniel Hillis:
The Pattern on the Stone. The Simple Ideas that Make Computer
s Work
,
Basic Books, 1998.
[
5
] Mihaela Malita:
http://www.anselm.edu/internet/compsci/Faculty_Staff/mmalita/HOMEPAGE/ResearchS07/Web
site
S07/index.html
[
6
]
Mihaela Malita, Gheorghe Stefan, Dominique Thiebaut:
“
Not Multi

, but Many

Core:
Designing Integral Parallel
Archite
ctures for Embedded Computation” in
ACM SIGARCH
Computer Architecture News
, Volume 35 , Issue 5, Dec. 2007,
Special iss
ue:
ALPS '07

Advanced low power systems;
communication at International Workshop on Advanced Low
Power
Systems held in conjunction with 21st International Conference on
Supercomputing
June
17, 2007 Seattle, WA, USA
.
[
7
]
Mi
haela Malita, Gheorghe Stefan:
“On the Many

Processor Paradigm”
,
in: H. R. Arabina
(Ed.):
Proceedings of the 2008 World
Congress in Computer Science, Computer Engineering and
Applied
Computing
, vol. PDPTA'08 (The 2008 International Conference on
Parallel and
Distributed Processi
ng Techn
iques and Applications)
,
2008.
http://arh.pub.ro/gstefan/pdpta08.pdf
[
8
]
Bogdan
Mîţu: “
C Language Extension for Parallel
Processing”
,
BrightScale
research report
2008
.
http://arh.pub.ro/gstefan/VectorC.ppt
[
9
]
David A. Patterson:
“
The Parallel
Computing Landscape: A Berkeley
View 2.0”
, keynote
lecture at
The 2008 World Congress in Computer Science, Computer Engineering
and Applied
Computing
, Las Ve
gas, July, 2008.
[
10
] Gheorghe Stefan: “
The CA1024: A Massively Parallel
Pr
ocessor for Cost

Effective HDTV”
,
in
SPRING PROCESSOR FORUM
JAPAN
, June 8

9, 2006, Tokyo.
[1
1
]
Gheorghe Stefa
n: “
The CA1024: SoC with Integral Parallel
A
rchitecture for HDTV
Processing”
, invited paper at
4
th
International S
ystem

on

Chip (SoC) Conference
& Exhibit
,
November 1
& 2, 2006, Radisson Hotel Newport Beach, CA
[1
2
]
Gheorghe Stefan, Anand Sheel, Bogdan Mitu, Tom Thomson, Dan
Tomescu:
“
The CA1024:
A
Fully Programmable System

On

Chip for
Cost

Effective HDTV Media Processing
”
, in
Hot
Chips: A
Symp
osium on High Performance Chips
, Memorial Auditorium,
Stanford University,
August 20 to 22, 2006.
[1
3
] Gheorghe Stefan: “One

Chip TeraArchitecture”, in
Proce
edings of the 8th
Applications and Principles of Information Science Conference
, Okinawa, Japan on 11

12 January 2009.
http://arh.pub.ro/gstefan/teraArchitecture.pdf
Comments 0
Log in to post a comment