IBM, Sony, Toshiba Cell processor architecture

chatventriloquistAI and Robotics

Dec 1, 2013 (3 years and 9 months ago)

116 views


Connex

Integral Parallel Architecture

&

the
13 Berkeley Motifs

(version 1.1
)



Abstract
:

Connex

Integral Parallel Architecture is

a many
-
cell engine

designed to
solve
intense

computational problems.
Connex

technology is presented and
analyzed from the poi
nt of view of each of the 13 computational motifs proposed in
the
Berkeley's View
[1]. We conclude that in almost all 13 computational motifs
Connex

technology works efficiently.




1.
Introd
uction

Parallel processing is able to propose two distinct soluti
ons for

the everlasting problems
which challenge computer science:
complexity
& intensity
. Coarse
-
grain networks of
few or
dozens of
big & complex

processors performing multi
-
threaded computation

are
proposed for the complex computation,
while

fine
-
grain n
etworks

of
hundreds or
thousands of
small & simple

processing cells are

proposed for the intense computation.
Connex

architecture
1

is designed

to perform

intense

computation.


We take into account the fundamental differences between
multi
-
processors

and
ma
ny
-
processors

[2]
. We see

multi
-
processors as performing complex computations, while

many
-
processors are designed for intense computations
2
.


Academia provides a comprehensive survey on parallel computing in

the seminal
research report known as the
Berkele
y's View

[1]
, where 13 computational
motifs

are

identified as the main challenges for the emerging parallel

paradigm.


This white paper investigates how
Connex

architecture behaves related to the 13
computational
motifs

emphasized in the
Berkeley's View
.


2.
Connex

Integral Parallel Architecture

Integral Parallel Architecture (IPA)
is defined as
a fine
-
grain

cellular machine with
structural resources for

performing
mainly
data
-
parallel

and

time
-
parallel
computations

and
resources for computations requiring
speculative

execution
. IPA considers two main
data structures:
vectors

of scalars, processed in data parallel machines, and
streams

of



1

See
here the history of
the project
:
http://arh.pub.ro/gstefan/conexMemory.html

2

The distinction between the complex computation and the intense computation is defined in [12].

scalars, processed in time parallel machines.
Connex

IPA performs the following types of
parallel

computations:





data pa
rallel computation

working on vectors and having

as result,
v
ectors,

scalars (by reduction

parallel

operations) or

streams (applied as inputs for time
parallel computations)



time parallel computation

working on streams and having

as result streams,
scalars

(by accumulating operations), or

vectors (applied as inputs for data
parallel computations)



s
peculative parallel computation

(expanded mainly inside the time parallel
computation, but sometimes inside the data parallel computation) working on
scalars and

having as result vectors reduced immediately by selection (the

simplest reduction function) to a

scalar



reduction parallel computation

working on vectors with results on scalars
.


Almost all parallel computations are data parallel

(with the associated redu
ction
operations)
, but some of

them involve time parallel processes
,

supported by speculative

computations, if needed.


Data parallel engine

Connex

computational structure performing data parallel computation

(with the
associated reduction operations)

is a

fine
-
grain network of
small & simple

execution units

(EU) working as a many
-
cell machine.


The
current embodiment of the
Connex

many
-
core section

(see Figure 1) is a linear array
of
n=
1024

EUs. Each EU is a machine

working on words of
p=16

bits
, with a 1
KB local
data memory. This memory allows
storage

of

m=512

16
-
bit
component
s

vectors
. The
processing array works in parallel with an IO plan
e

(IOP) used to transfer
w=128
-
bit
data

words

between the array and the external memory.
The architecture is scalable
: all the
previous parameters scale up or down (usually: n = 64 … 4096, p = 8 … 64, m = 64 …
16384, w = 32 … 1024).


The array is controlled by Instruction

Sequencer (IS),
a
simple controller
,
while the IOP
transf
ers data under the control of another
machi
ne called IO Controller (IOC). Thus,
data processing and data transfer are two independent processes performed in parallel.
Data exchange between the processing array and the IO plan
e

is performed in one clock
cycle and

it

is synchronized by
hardware

mecha
nisms.


For time parallel computation a dynamically reconfigurable network of 8 simple
processors is provided

outside of the processing array
.


Speculative computation is performed in both networks.







Fgure 1.
Connex

data

parallel engine
.

The processi
ng array is paralleled by

the IO Plan
e

which performs
data transfers

transparent to the

processing
.


Data parallel architecture

The user's view of the data parallel machine is represented in

Figure
2
.
The

linear cellular
machine containing 1024 E
U
s

perform
ing

parallel on
the

vectors

stored on

the two
-
dimension array of scalars and Booleans
.



The number of cells,
n
, provides the
spatial dimension

of the array, while number of
words stored in each cell provides the
temporal dimension

of the array (for exampl
e, in
Figure 2
n = 1024, m = 512
). On the spatial dimension the “distance” between two
scalars in a vector is in
O(n)
. On the temporal dimension the “distance” between two
scalars is in
O(1)
. Indeed, for performing an operations between two scalars stored
in two
different cells the time to have them in the same EU is proportional, in the worst case,
with the number of cells, while two operands stored in the data memory of an EU are
accessed in few cycles using random addressing.


The two dimensions of the
C
onnex

architecture


the horizontal, spatial dimension and
the vertical, temporal dimension


must be carefully
used
by the programmer in order to
squeeze the maximum of performance

from
ConnexArray
TM
.

Two kinds of vectors are
defined in this array:
horizo
ntal vectors

(along the spatial dimension) and
vertical vectors

(along the temporal dimension). In the following we consider only horizontal vectors
called simply vectors. (When vertical vectors will be considered it will be specified.)


Operations
on vect
ors
are performed in a small fixed number of cycles.

Some generic
operations are exemplified in the

following:




PROCESSING OPERATIONS

performed in the processing array under the
control of IS:


o

full vector operation
:
{carry, v5} = v4 + v3;


the correspondi
ng integer components of the two operand vectors (
v4

and
v3
) are added,
and the result is stored in the scalar
vector
v5

and in the
Boolean vector
carry


o

Boolean operation
:
s3 = s3
& s5;


the corresponding Boolean components of the two operand vectors
,
s3
,

s5
,

are ANDed and the result is stored in the result vector

s3



Figure 2
.

The internal state of
Connex

data parallel mach
ine
.

There are
m =
512

integer

(horizontal)
vectors, each having
n =

1
024

16
-
bit
integer
components (
vi[j]

is a

16
-
bit integer), an
d 8 selection vectors, each having
1024

Booleans (
sk[j]

is a Boolean).



o

predicated execution
:
v1 = s2 ? v3
-

v2 : v1;


in any positions
where

s2 = 1

the corresponding components are operated,
while in the rest (i.e.,
elsewhere
) the content of the result v
ector remains
unchanged (it is a
``
spatial

if
” statement
)


o

vector rotate
:
v7 = v7 >> n;


the content of vector
v7

is rotated
n

positions right, i.e.,

v7[i] = v7[(i+n)mod1024]



INPUT
-
OUTPUT OPERATIONS

performed in IOP under the control

of IOC
:


o

strided loa
d
:

load v5 address burst stride;


the content of
v5

is loaded with data from the external

memory accessed
starting from the address
:

address
, using

bursts of size
:

burst
,
on a stride
of:
stride


o

scattered load
:

sload v3 highA
ddress v9 addr

stride
;


the co
ntent of
v3

is loaded with data from the external

memory indirectly
accessed using the content of the
address

vector
:

v9
, whose content is used
starting from the index

address:

addr
,
on a stride of:

stride
; the address
vector is

structured in pairs of 16
-
bit words; each of the 512 resulting

32
-
bit
word is

organized as follows:

{dummy
, burst[5:0], address[24:0]}

where: if
dummy == 1
, then a burst of
{burst[5:0],

1'b0}

dummy
bytes are loaded, else a burst of
{burst[5:0], 1'b0}

bytes

from the

address
{highA
ddress, addr
, 1'b0
}

is loaded (it is a

sort of
indirect

load)


o

strided store
:
store v7 address burst stride;


o

gathered store
:

gstore v4 high
A
ddress
v3 addr

stride
;

(it is a sort of
indirect

store).


VectorC
: the programming language

for data parallel archi
tecture

Connex

data parallel engine is programmed in

VectorC
, a C
++

language extension for
parallel

processing defined by
Connex

[
8
]
. The

extension is made by adding new
primitive data types and by

extending the existing operators to accept the new data ty
pes.
In

the VectorC programming language the conditional statements have

become
predication statements.


The new data primitives are:




int vector
:


vector of integers (stored

as a pair of 16
-
bit integer vectors)



short vector
:

vector of shorts (sto
red as a 16
-
bit

integer vector)



byte vector
:


vector of bytes (two byte

vectors are stored as a integer vector)



selection
:


vector of Booleans


In order to explain how VectorC works

let
be
the following

variable declarations:




int i1, i2, i3
;



bool b1, b2, b3;



int vector v1, v2, v3;



selection s1, s2, s3;


Then

a VectorC statement like:





v3 = v1 + v2;


replaces this style of for statement
:





for (int i = 0; i < VECTOR_SIZE; i++)



v3[i] = v1[i] + v2[i]
;


and




s3 = s1 && s2;


replaces this for statement
:




for (int i = 0; i < VECTOR_SIZE; i++)



s3[i] = s1[i] && s2[i];


The scalar statement




if (b1) {i3 = i1 + i2};


is written
inVectorC
as
the following vector predica
tion statement:




WHERE (s1) {v3 = v1 + v2};


replacing this nested for
:




for (int i = 0; i < VECTOR_SIZE; i++)



if (s1[i])



v3[i] = v1[i] + v2[i];


Similarly,





i3 = (b1)? i1 : i2;


is extended to acce
pt vectors:




v3 = (s1)? v1 : v2;


Here is an example in VectorC computing the absolute difference of

two vectors.



vector absdiff(vector V1, vector V2);


int main() {


vector V1 = 2;


vector V2 = 3;


vector V;


V

= absdiff(V1, V2);


return 0;


}


vector absdiff(vector V1, vector V2) {


vector V;


V = V1
-

V2;



WHERE (V < 0) {


V =
-
V;


}


ENDW


return V;


}


See few
introductory examples in
[
5
]

where the VectorC library is posted
.


Connex

data parallel engine by
the
numbers

The last implementation of
ConnexArray
TM

provided the following measured
performances:



computation
: 400 GOPS
3

at 400 MHz (peak performance)



external bandwidth
:

6.4 GB/sec (pe
ak performance)



internal bandwidth
:

800 GB/sec (peak performance)



power
:

< 3 Watt



area
: < 50 mm
2

(
1024
-
EU array, including 1Mbyte of memory and
the two
controllers

with their local program and data memory
)
.


Design & technology parameters
:



65nm
technology



Fully synthesized (no custom memories or layout)



Standard Chartered Semiconductor “G” process


Time parallel architecture


Amdhal law

is the argument used against the extensive use of parallel computation. But,
this very old argument (1967) was coined in
the pioneering era of

the

parallel computing.



3

16
-
bit Operations

Meantime the theoretical and technological

context changed and a lot of new data about
how parallelism works are accumulated.
In 1998
Daniel Hillis
expressed

his opinion as
follows:


“I now understand the flow
in Amdahl’s argument. It lies in the assumption that a fixed
portion of the computation, even just 10 percent,
must

be sequential. This estimate
sounds plausible, but it turns out not to be true of most large computations. The false
intuition came from a m
isunderstanding about how parallel processors would be used. …
Even problems that most people think are inherently sequential can usually be solved
efficiently on a parallel computer.”

(
[4], pag. 114


115)


Indeed, for “large computations” p
ipelining
,
spe
culating

or speculating in a pipe on the
structural support of
a many
-
cell engine provides the architectural context for executing
in parallel pure sequential computations when big stream of data are waiting to be
processed.



Figure 3.
Connex

time para
llel engine
.
a.

The pipe without speculation.
b.

The pipe with speculation. The
i
-
th stage in pipe is computed by
q

PEs
dynamically configured in a speculative linear array.
PE
i+q

selects dynamically as
input only one of the outputs of the speculative arra
y


Time parallel computation is supported in the current implementation of
Connex

chip by
the small configurable network or processing elements called the
Stream Accelerator

(SA). The network works like a pipe of processors in which at any point two or mor
e
machines are connected in parallel to support speculation. In the actual implementation 8
machines are used. Functions like CABAC decoding procedure, a highly sequential with
strong data dependency computation, are efficiently executed in a small program
.



The computation of SA is defined by two mechanisms:




stream of functions

containing
m

programs

p
j
(σ)
:


S

in
) =
<p
0

in
),
p
1

0
), … p
m
-
1

m
-
2
),
> =
σ
out


applied to a stream of

scalars
σ
in
, generating a stream of scalars

σ
out

as output, where:
p
j
(σ)

is

a program

which processes the stream
σ

and generates the stream

σ
j
;

this is a
type

of

MIMD computation




vector of function
s

containing
q

programs

p
j
(σ)
:


V(
σ
in
) = [
p
0

in
)
,

p
q
-
1

in
)
]


applied to a stream of scalars

σ
in
, generating a stream of
q
-
compone
nt vectors;
thi
s

is
a true MISD computation
.


The latency introduced by a stream of functions is
m
, but the stream is computed in real
time (see Figure 3a). Vectors of functions are used to perform speculation when the
computation requests it in order to p
reserve the possibility of real time computation (see
Figure 3b). For example:


< p
0

in
), p
1

0
), … p
i
-
1

i
-
2
), V(σ
i
-
1
), p
i+q

?
), … p
m
-
1

m
-
2
)>


is a computation which performs a speculation in the stage
i

of the pipe. The program
p
i+q

?
)

selects from the vectors generated by
V

i
-
1
)

only one component as input.


There is a Connex vers
ion having the stream accelerator functionality integrated with the
data parallel engine.


3. Berkeley's View

Berkeley's View

[1]

provides a comprehensive

presentation of the problems to be solved
by the emerging actor on

the computing market: the ubiquito
us parallel paradigm. Many

decades an academic topic, parallelism becomes an important actor

on the market after
2001 when the clock rate race stopped. This

research repor
t presents 13 computational
motifs
4

which cover the main aspects

of the parallel comp
uting. They are defined
unrelated with a

specific parallel architecture. In the next section we will make a

preliminary evaluation of them in the context of
Connex
's IPA.




4

Initially called
dw
arfs
, they are renamed as
motifs

in [
9
].

4.
Connex
's Performance

Connex
's cellular network has the simplest

possible interconn
ection network. This is both
an advantage and a

limitation. On one hand, the area of the system is minimized, and

it is
ease to hide the associated
organization

from
the user, with

no los
s

in programmability
or
in the efficiency of compilation.

The limitat
ion
appears in certain

application domain
s
.
What follows are
short comments about how the
Connex

architecture

works for each of
the 13
Berkeley
's View

motifs.


Motif

1: Dense linear algebra

The computation in
this domain operates mainly on
N
×
M

matri
c
es. Th
e operations
performed are: matrix addition, scalar

multiplication,

transpos
ition

of

a matrix,

dot
product of vectors,

matrix multiplication,

dete
rminant of a matrix,
(forward
&

backward)
Gaussian elimin
ation,
solving systems of linear

equations,
and
inv
erse of a
N
×
M

matrix.


Depending on the product
N
×
M

the internal representation

of the matri
x is

decided. If the
product is small enough

(usually, no bigger than 128), each matrix can be
expanded as a
vertical vector

and
associated to

one EU, resulting
in
1024 matri
c
es represented by
N
×
M

1024
-
element
horizontal
vectors. But, if the product
N
×
M

is big,

then
P

EUs are
associat
ed with each matrix, resulting
in

parallel processing of

1024/
P

matri
c
es
represented
i
n
(
N
×
M
)/
P

1024
-
element vectors.


For all the oper
ations above listed the computation is usually accelerated
almost
1024

times,
but not

under
1024/
P

times
. This is possible because special hardware is provided
for reduction operations

in the
Connex
Array
TM

(for example: adding 1024 16
-
bit
numbers takes
4
-
2
0 clock cycles
, depending on the actual implementation of the
reduction network associated to the array
).


Motif

2: Sparse linear alg
ebra

There are two types of sparse matri
c
es
: (1) randomly distributed

sparse arrays
(represented by few types of lists), (2
) band arrays

(represented by a
stream

of
short
vectors).


For small random sparse arrays, converting them internally into

dense array is a good
solution. For big random sparse arrays the

associated list is operated
on
using efficient
search operations. Fo
r

band arrays systolic
-
like solution are proposed.


Connex
’s intense computation engine handles these types of linear algebra problems very
well.


Acceleration is between
128

and 10
24

times depending on the density of the array.


Motif

3: Spectral methods

The typical examples are: FFT or wavelet computation. Because of

the ``butterfly" data
movement
, how the

FFT computation is implemented

depend
s

on the length of the
sample.

The spatial and the

temporal dimensions of the
Connex

array helps

the
programmer

to

easily adapt

the

data representation
to result in

an almost linear

acceleration
, i.e., in the worst case the ac
c
eleration is not less than 80% from the linear
acceleration which is 1024
.


Evaluation report
:


Using VectorC as simulation environment FFT com
putation was evaluated for
Connex

architecture.
For a
ConnexArray
TM

version with
n = 1024, p = 32, m = 512, w = 256
,
which can be implemented in
65nm

on
1cm
2
,
a
4096
-
sample FFT is computed at 1
.6

clock cycle
s

per sample
,
a
1024
-
sample FFT is computed at 1
clock cycle per sample
,
a

256
-
sample FFT is computed
in less than 0.5

clock cycle
s

per sample
,
and a

64
-
sample FFT is computed
in less than 0.2

clock cycle
s

per sample
. At
400 MHz

results
a
10

Watt

chip.



The algorithm loads each stream of sample as
16
64
-
element

vertical vectors. Thus, the
array works simultaneously of 64 FFTs.

This is a good example of the optimizations
allowed by the two dimensions of the
Connex

architecture. If only the spatial dimension
is used, loading all the 1024 samples as a singl
e horizontal vector, then the same
computation is done in 8 clock cycles per sample instead of only one. Almost one order
of magnitude is acquired only playing with the two dimensions of our architecture.


Motif

4:
N
-
Body method

This method fits perfect
ly

on
Connex
’s

architecture,

because for
j=0

to
j=n
-
1

the
following equation
must be computed:



U(x
j
) =
Σ
i

F(x
j
, X
i
)


Each function
F(x
j
, X
i
)

is computed
on a single

EU, and

then the sum is a
reduction

operation linearly accelerated

by the array. Depending o
n the value of
n
, the data is

distributed in the processing array using the spatial dimension

only, or
for large n,
both
the spatial and the temporal dimension.

For this motif results also an almost linear
acceleration.


Motif

5: Structured grids

The grid
is distributed on the two dimensions of our array: the

spatial dimension and the
temporal dimension. Each processor is

assigned a line of nodes (on the spatial dimension).
It performs

each update step locally and independently of other line of nodes.

Each
node
only has to communicate with neighboring nodes on the

grid, exchanging data at the end
of each step. The system works as

a cellular automaton
. The computation is accelerated
almost

linearly

on
Connex
’s architecture
.


Motif

6: Unstructured grids

Unstru
ctured grid problems are described as updates on an

irregular grid, where each grid
element is updated from its

neighbor grid elements. Parallel computation is disturbed
when

problem size is large, and the non
-
uniformity of the data

distribution
would best

utilize
special access mechanisms. In order to solve

the non
-
uniformity problem

for the
Connex

Array,

a preprocessing step is required.


The a
lgorithm for preprocessing the
n
-
element unstructured grid

representation starts
from an
initial list of grid ele
ments


G =

{g
0
, … g
n
-
1
}


and provide
s

the minimum number of

vectors, following the steps sketched here:




the

n

interconnection matrix for
n

grid

elements is generated



inte
rchanging elements in the list
G

a minimal band matrix

is generated



each diagonal
of the band represents a vector loaded into

the processing array



the
result

is

a grid with some dummy elements, but each actual

grid element has
its neighborhood located in few adjacent
EU
s.


Depending on how the list
G

the preprocessing can be performed i
n the
Connex

array or
in an ex
t
ernal standard processor. The resulting acceleration is smaller than for the
structured grid (depending on the computation involved in each grid elements, 10
-
20% of
the acceleration is lost).


Motif

7: Map reduce

The typical
example
of a
map reduce computation is the
Monte Carlo

method
. This
method consists in many completely independent

computations working on randomly
generated data. This type of

computation is highly parallel. Sometimes it
requires
the
add

reduction

functio
n, for which
Connex

architecture has special accelerating hardware.

The
computation is linearly accelerated.


Motif

8: Combinational logic

There are a lot of very different problems falling in this class.

We list here only the most
important and the most f
requently used

ones:




blocks processing, exemplified by AES encryption algorithms;

it works in
4
×
4

arrays of bytes, each array is loaded in

one EU, and the processing is completely
SIMD
-
like with linear

acceleration

on the
Connex

Array.




stream processing,

exemplified by
convolution

methods

which do not use blocks,
processing instead a continuous

bit stream
; it is computed very efficient
ly

in
Connex
’s
time parallel

accelerator (SA) with no speculation involved




image rotation for black & white or color bit
mapped

images is performed
first by

loading
m
×m

array of pixels

into the processing array on both dimensions (spatial
and

temporal),
second,

executing a local transformation, and
third

restoring

the
transformed image in the appropriate place
.

This is done
very efficiently on the
Connex

Array.




route lookup, used in networking; it supposes three

data
-
base like operations:
longest match, insert, delete
,

all performed very efficiently by the
Connex

processing array.


Motif

9: Graph traversal

The array of 1024
machines can be used as a big ``speculative

device". Each EU starts
with a full graph stored in its data

memory, and the computation provides the result when
one EU, if

any, finds the solution. Limitations are generated by the

dimension of the data
memory
of each EU. More investigation is

needed to evaluate the actual power of
Connex

technology in solving this problem.


Some problems

related with graphs are eas
ily

solved if matrix

computation is involved
(example: computing the distance between

all the elem
ents of a graph).


Motif

10: Dynamic programming

Viterbi decoding is the example presented in
[1]
. It

best fits
the modular feed
-
forward
architecture of SA, built as a

distinct network
as currently implemented at
Connex

or
integrated

into the main data par
allel processing array. Very long stream
s

of

bits are
computed
in parallel
by the pipeline structure of

Connex
’s
SA.


Motif

11:

Back
-
track and branch & bound

Motif

under investigation (

Berkeley's View


is

silent regarding this
motif
).


Motif

12: Graphical

models

Motif

under investigation (

Berkeley's View


is silent regarding this
motif
).


Motif

13: Finite state machine

(FSM)

The authors of

Berkeley's View


claim that for this
motif


nothing helps

. But, we
consider that a pipe of machines

featured with s
peculative resources
[6]

provides benefits
in acceleration
. In

fact,

Connex
’s

SA solves the problem if its speculative resources are

activated.


Another way to use
ConnexArray
TM

for FSM oriented application is to add to each cell,
working as a PE, specifi
c instructions for FSM emulation. The resulting system can be
used as a speculative engine for
deep packet search

applications.


Connex
’s

SA
technology
seems to be

the

first implementation of a machine able to deal
with this

rebellious
motif
.


5. Concludi
ng Remarks

1.
Connex

techn
ology covers almost all motifs.

Excepting the motifs 11 and 12 (work
on them in progress),

possibly 9, the
Connex

technology performs very

well.


2. The linea
r network
connecting EUs
is
not a

limitation.

Because the intense
comput
ational problems are characterized by an

advanced locality
, the simplest
interconnection network is

not a major limitation. The temporal dimension of the
architecture

in many cases

helps to avoid the limitations imposed by the two

simple
interconnection ne
tworks.


3. The spatial
& temporal dimension
s

are doing a

good job together.

The user's view
of the machine is a

two
-
dimension array. Actually one dimension is in space (the 1024

EUs), and the other dimension is in time (the 512 16
-
bit words

stored in each

local
randomly accessed
memory). These two distinct dimensions allow

Connex

to optimize
area

resources
, while the locality and the degree of parallelism

are both kept at high
values.


4. Time parall
elism is rare, but unavoidable.

Almost any

time in a real

complex
application all kinds of

parallelism are involved. Some pure sequential processes
represent

sometimes uncomfortable corner cases solved only by the time

parallel
resources provided in
Connex

architecture (see the 13th
motif
).


5.
Connex

organizati
on is transparent.

Because the interconnection network is simple
,

the internal

organization of the machine is easy to be made transparent to the

user. The
elegant solution offered by the
VectorC

language

is a good proof of the high
organizational transpare
ncy of the

Connex

technology.


References

[1] K. Asanovic,
et a
l.
:

The Landscape of

Parallel Computing Research: A View from Berkeley
”,

Technical

Report No. UCB/EECS
-
2006
-
183
, December 18, 2006.
http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS
-
2006
-
183.pdf



[2]

Shekar Y. Borkar,
et a
l
.:

Platform

2015: Intel Processor and Platfor
m Evolution for the Next
decade”
,

edited by R. M. Ra
manathan and Vince Thomas,
Intel

Corporati
on
, 2005.


[3] Pradeep Dubey: “
A Platform 2015 Workload Model: Recognition, Mining and Synthesis

Mov
es Computers to the Era of Tera”
,
Technology@Intel

Magazine
, Feb. 2005.


[4] W. Daniel Hillis:
The Pattern on the Stone. The Simple Ideas that Make Computer
s Work
,
Basic Books, 1998.


[
5
] Mihaela Malita:

http://www.anselm.edu/internet/compsci/Faculty_Staff/mmalita/HOMEPAGE/ResearchS07/Web
site
S07/index.html


[
6
]

Mihaela Malita, Gheorghe Stefan, Dominique Thiebaut:


Not Multi
-
, but Many
-
Core:
Designing Integral Parallel

Archite
ctures for Embedded Computation” in
ACM SIGARCH

Computer Architecture News
, Volume 35 , Issue 5, Dec. 2007,

Special iss
ue:
ALPS '07
-

Advanced low power systems;

communication at International Workshop on Advanced Low
Power

Systems held in conjunction with 21st International Conference on

Supercomputing

June
17, 2007 Seattle, WA, USA
.


[
7
]

Mi
haela Malita, Gheorghe Stefan:
“On the Many
-
Processor Paradigm”
,

in: H. R. Arabina
(Ed.):
Proceedings of the 2008 World

Congress in Computer Science, Computer Engineering and
Applied

Computing
, vol. PDPTA'08 (The 2008 International Conference on

Parallel and
Distributed Processi
ng Techn
iques and Applications)
,

2008.

http://arh.pub.ro/gstefan/pdpta08.pdf



[
8
]

Bogdan

Mîţu: “
C Language Extension for Parallel

Processing”
,
BrightScale
research report

2008
.

http://arh.pub.ro/gstefan/VectorC.ppt



[
9
]

David A. Patterson:

The Parallel

Computing Landscape: A Berkeley

View 2.0”
, keynote
lecture at
The 2008 World Congress in Computer Science, Computer Engineering

and Applied
Computing
, Las Ve
gas, July, 2008.


[
10
] Gheorghe Stefan: “
The CA1024: A Massively Parallel

Pr
ocessor for Cost
-
Effective HDTV”
,
in
SPRING PROCESSOR FORUM

JAPAN
, June 8
-
9, 2006, Tokyo.


[1
1
]

Gheorghe Stefa
n: “
The CA1024: SoC with Integral Parallel

A
rchitecture for HDTV
Processing”
, invited paper at
4
th

International S
ystem
-
on
-
Chip (SoC) Conference
& Exhibit
,

November 1
& 2, 2006, Radisson Hotel Newport Beach, CA


[1
2
]

Gheorghe Stefan, Anand Sheel, Bogdan Mitu, Tom Thomson, Dan

Tomescu:

The CA1024:
A

Fully Programmable System
-
On
-
Chip for

Cost
-
Effective HDTV Media Processing

, in
Hot
Chips: A

Symp
osium on High Performance Chips
, Memorial Auditorium,

Stanford University,
August 20 to 22, 2006.


[1
3
] Gheorghe Stefan: “One
-
Chip TeraArchitecture”, in
Proce
edings of the 8th
Applications and Principles of Information Science Conference
, Okinawa, Japan on 11
-
12 January 2009.

http://arh.pub.ro/gstefan/teraArchitecture.pdf