Pipelined Vector Processing and Scientific

learnedmooseupvalleyΗλεκτρονική - Συσκευές

7 Νοε 2013 (πριν από 3 χρόνια και 7 μήνες)

57 εμφανίσεις

Eine Zeitreise in die Welt der Computer.

1

Pipelined Vector Processing and Scientific
Computation



John G. Zabolitzky




Eine Zeitreise in die Welt der Computer.

2

Applications of High
-
Performance Computing

Weather prediction, climatic simulation

fluid dynamics simulation (aerodynamics for aerospace, automobile, combustion, ....)

basic science



cosmology



quantum mechanical many
-
body problems




chemistry




solid
-
state




quantum fluids



high
-
energy physics

cryptography

weapons research

energy research



nuclear reactor simulation



fusion research

many many more


Eine Zeitreise in die Welt der Computer.

3

Terminal State of Scalar Computing: CDC 7600, 1968

Maximum RISC performance of 1 operation/cycle achieved

No further improvement possible without change of paradigm

36 MHz => 36 MIPS => 5 MFLOPS real


The CDC 7600 (designed by Seymour Cray) was
the most powerful of all computers from 1968 to
1976 when the Cray
-
1 achieved > 10 times its
performance

Eine Zeitreise in die Welt der Computer.

4

Pipelined Scalar Execution

time
====>
instruction
1
2
3
4
5
6
7
1
fetch
decode
execute
2
fetch
decode
execute
3
fetch
decode
execute
4
fetch
decode
execute
5
fetch
decode
execute
6
fetch
decode
7
fetch
Pipelined execution on parallel functional units
Eine Zeitreise in die Welt der Computer.

5

Eine Zeitreise in die Welt der Computer.

6

Scalar Code Example

DO i=1,100 a(i)=b(i)*c(i)

load b, inc addesss

load c, inc address

multiply

store a, inc address

decrement count, loop?

5 instructions = cycles (optimum) for one multiply

pipelined multiply: could start one multiply each and every cycle => only 20%
efficient use

expensive multiplier sits idle most of the time

Eine Zeitreise in die Welt der Computer.

7

Architectural Alternatives

* Pipelined Scalar (RISC) as outlined before


* Pipelined Vector (this presentation further down)


* SIMD (Single Instruction Multiple Data) parallel arithmetic (e.g., ILLIAC IV)

too expensive, inefficient: larger number of lightly used multipliers


* Superscalar = multiple issue in one cycle

all modern single
-
chip CPUs (Intel to TI); keep all functions busy


* VLIW (Very Long Instruction Word) = Variant of Superscalar


* MIMD (Multiple Instruction Multiple Data) true parallel streams, e.g. Cray T3E, IBM
Blue Gene, IBM Cell: may be superimposed on top of ANY CPU architecture


Eine Zeitreise in die Welt der Computer.

8

Vector Computation

Scientific codes have high percentage in looping over simple data structures


DO i=1,100 a(i) = b*c(i) + d(i)


simple logical structure ==>


set up such that one multiply/cycle


one instruction for entire loop


MFLOP rate = cycle rate or multiple thereof


specialized for scientific/engineering tasks


Eine Zeitreise in die Welt der Computer.

9

Vector Pipeline c(i)=a(i)*b(i)

fetch a(i++)
multip. 1
multip. 2
multip. 3
multip. 4
store c(i++)
fetch b(i++)
time
i=1
|
2
1
|
3
2
1
|
4
3
2
1
|
5
4
3
2
1
V
6
5
4
3
2
1
7
6
5
4
3
2
8
7
6
5
4
3
Inventor: Henry Ford

Eine Zeitreise in die Welt der Computer.

10

Need to Vectorize; some automatic, high quality requires hand
-
optimization

Naive scalar code for matrix multiply

»
s=0.0

»
do j=1,n

»
s=s+a(i,j)*b(j,k)

Recursive on s => adder pipeline blocked

vector code for matrix multiply

»
do i=1,n

»
c(i,k) = c(i,k) + a(i,j)*b(j,k)

Independent vector elements, but 1.5x bandwidth

Frequently good idea: exchange inner/outer loop

Eine Zeitreise in die Welt der Computer.

11

First Vector Computers

Control Data Corporation (CDC) STAR
-
100 [STring ARray 100 MFLOPS]


memory
-
to
-
memory architecture

therefore long startup times (~n00 cycles)

very slow scalar unit (~2 MFLOPS)

overall disappointing performance

contracted 1967, announced 1972, delivered 1974

total of 4 machines, 2 Lawrence Livermore Lab

Thornton (CDC) and Fernbach (LLL) loose their jobs


Eine Zeitreise in die Welt der Computer.

12

Photograph courtesy of
Charles Babbage
Institute, University of
Minnesota, Minneapolis

CDC STAR
-
100

Eine Zeitreise in die Welt der Computer.

13

Texas Instruments ASC


Advanced Scientific Computer, early 1970s


architecturally similar to CDC STAR
-
100


7 units sold


TI dropped out of mainframe computer manufacturing after this machine


Eine Zeitreise in die Welt der Computer.

14

Vector Performance I

MFLOP rate (MFLOPS) as function of vector length n

scalar: ~constant (only some loop overhead, then n * loop time)

vector: (n = length of vector)

# cycles = startup + n / nflop_per_cycle

rate/clock = #ops / #cycles ~ n / (startup + n)

half rate at vectorlength n ~ startup

full rate needs n >> startup => “Long Vector Machine”


Eine Zeitreise in die Welt der Computer.

15

Performance vs. Startup, Length

0
0.5
1
1.5
2
2.5
3
3.5
4
0
100
200
300
400
500
vector length
ops/clock
s10_r1
s100_r1
s10_r4
s100_r4
scalar_0.2
Eine Zeitreise in die Welt der Computer.

16

Vector Performance II

Vector/Scalar Subsections


ALL codes have some scalar (non
-
vectorizable) sections


total time = (scalar fraction)/(scalar rate) + (vector fraction)/(vector rate)


example: 10% / 1 MFLOPS + 90% / 100 MFLOPS =


100 / (0.1 * 100 + 0.9 * 1) = 9.2 MFLOPS !!!

Eine Zeitreise in die Welt der Computer.

17

Vector Version of Amdahl’s Law

0
0.2
0.4
0.6
0.8
1
1.2
0
0.2
0.4
0.6
0.8
1
1.2
scalar fraction
performance
r5
r10
r20
r50
r100
Eine Zeitreise in die Welt der Computer.

18

Vector Computer Design Guide

Must have SHORT vector startup => can work with short vectors


Must have FASTEST POSSIBLE scalar unit => can afford scalar sections


irregular data structures ==> need gather, scatter, merge operations (and a few
more)

x(i) = a(index(i)) * b(i)

y(index(i)) = c(i) + d(i)

where (a(i) > b(i)) c(i) = d(i)

Eine Zeitreise in die Welt der Computer.

19

Cray Research, Inc.

Founded by Seymour Cray (father of CDC 6600/7600) in 1972 (STAR
-
100 known)


first Cray
-
1 delivered in 1976 to Los Alamos Scientific Laboratory (LASL)


8 vector registers of 64 elements each


Vector load/store instructions


fastest scalar computer of its time


160 MFLOPS peak rate ( 2 ops/cycle @ 80 MHz), few cycles startup

Eine Zeitreise in die Welt der Computer.

20

Photograph courtesy
of Charles Babbage
Institute, University of
Minnesota,
Minneapolis


Seymour Cray


Cray
-
1

1976

Single Processor

80 MFLOPS

1 Mword = 8 Mbyte


Eine Zeitreise in die Welt der Computer.

21

Block Diagram Cray YMP-EL, only one of four identical CPUs shown, simplified
8 vector registers, 64 elements, 64
bit 4 vector execution units, 33 MHz
Shared
Vi
Main
8 scalar registers, 64
bit scalar functional units
Memory
64 word

Tjk
Buffer
memory
Si
128 MW
64
bit 8 address registers, 32 bit address functional units
64 word
1
Gbyte

Bjk
4 ports
/ buffer memory

proc Ai
4x 4 x
33 MHz
= 8 instruction buffers, 32 words each
4.2
Gby
/sec Y1 channel

instruction issue 40
Mbyte/sec
48
shared
registers
IOS
Large working set:

-

8 vector registers, 64 words

-

8 scalar registers

-

8 address registers

-

large instruction buffer


Performance Features:

-

vector processing: one operation
affects 64 vector elements, streamed
through functional unit

-

small vector startup time

-

chaining between vector ops

-

large, fast semiconductor memory

Eine Zeitreise in die Welt der Computer.

22

Cray Research, Inc. cnt’d

1982 Cray
-
XMP (Steve Chen improvements, up to 4 processors, shared memory)

1985 Cray
-
2, 256 Mword memory, 4 processors, immersion cooled

1988 Cray
-
YMP (last Chen machine)

1991 Cray C90 (up to 16 vector CPUs, shared memory)

1993 Cray T3D (massively parallel Alpha)


one and only Cray
-
3 delivered to NCAR (Cray Comp Corp)

1994 Cray J90 (up to 32 vector CPUs, shared memory), air cooled

1995 Cray T3E (most successful MPP machine), Cray T90 (parallel vector, immersion cooled)



Cray
-
4 abandoned (Cray Computer Corporation ch. 11)

1996 acquired by Silicon Graphics

1998 Cray SV1 (parallel vector, air cooled)

1999 acquired by Teradata => Cray, Inc.

2002 Cray X1, parallel vector, immersion spray cooled

2004 Cray X1e, enhanced version of X1


Cray XT3, AMD based 3D Torus massively parallel machine


Eine Zeitreise in die Welt der Computer.

23

CDC Cyber 200 Family

-

1980, enhanced version of STAR
-
100

-

reduced startup time, ~ 50 cycles

-

fast scalar unit

-

rich instruction repertoire

-

still memory
-
to
-
memory, 400 MFLOPS peak

-

Cyber 203, Cyber 205, ETA
-
10 [10 GFLOPS]

-

vector FORTRAN language extensions provided

-

terminated in 1989 since unprofitable

-

around 40 Cyber 200, 34 ETA
-
10 sold


Eine Zeitreise in die Welt der Computer.

24

Minnesota Supercomputer Center

Minneapolis, 1986

Cray
-
2, CDC Cyber 205

Eine Zeitreise in die Welt der Computer.

25

NEC Japan

-

1983 SX
-
1 single processor vector 650 MFLOPS

-

1985 SX
-
2 single processor vector 1300 MFLOPS

-

1990 SX
-
3 four processors at ~ 5 GFLOPS each, 4 Gbyte = 0.5 Gword memory

-

1995 SX
-
4 32 processors at ~ 2 GFLOPS each (CMOS; all previous ECL)

-

1998 SX
-
5 upto 512 processors 8 GFLOPS each

-

2002 SX
-
6 upto 1024 processors 8 GFLOPS each

-

2004 SX
-
7 upto 2048 processors 8.8 GFLOPS each

-

2004 SX
-
8 upto 4096 processors 16 GFLOPS each


Eine Zeitreise in die Welt der Computer.

26

IBM
-

Sony
-

Toshiba CELL processor

-

8 vector CPUs + GPU on single chip

-

256 kbyte = 32 kword local storage (very small !!)

-

12 word/cycle internal interconnect = 386 Gbyte/sec

-

24 Gbyte/sec = 3 Gword/sec main memory

-

76 Gbyte/sec = 9.5 Gword/sec communication

-

@ 4 GHz clock 256 GFLOPS (32 bit) peak

-

26 GFLOPS (64 bit) peak

-

max 4.5 Gbyte addressable, 512 Mbyte implemented

-

system interconnect ?

-

used within Sony Playstation 3

-

Mercury, IBM blades available; 512 Mbyte only

-

highly imbalanced for scientific computation

Eine Zeitreise in die Welt der Computer.

27

IBM
-

Sony
-

Toshiba CELL processor

-

90 nm SOI, 8 layers Cu interconnect

-

234 M Transistors

-

221 mm² die size


-

significant potential in future revisions

-

but: 80W @ 1.1V 4.0 GHz is too much

-

180W @ 1.4V 5.6 GHz is much too much

-

work needed in power reduction

-

larger internal memory

-

64 bit arithmetic improved

Eine Zeitreise in die Welt der Computer.

28

IBM
-

Sony
-

Toshiba CELL processor

From: S. Williams et. al., Lawrence Berkeley Laboratory

-

single Cell chip performance

-

compared with Cray X1E single vector processor and
several commodity microprocessors (AMD, Intel)

-

already current version shows impressive speedup, at
cost of significant programming complexity (explicit
storage moves as opposed to caching)

-

slightly enhanced Cell (Cell+) simulation provides very
significant additional speedup (more efficient DP)

-

current version insufficient for major impact

-

future versions may change that, great potential