Slide - ACE home page - Ohio University

stingymilitaryΗλεκτρονική - Συσκευές

27 Νοε 2013 (πριν από 3 χρόνια και 9 μήνες)

105 εμφανίσεις

Randy W. Morris, Jr.

Affiliation:
EECS, Ohio University

E
-
mail: rm700603@ohio.edu

Advisor:
Avinash

Kodi

1

Thesis Defense

Outline



Motivation & Background



PROPEL: Architecture



PROPEL: Implementation



Performance Analysis



Conclusion


2

Why Chip Multi
-
Processor? (1/2)


3


After 2002 diminishing returns from single core designs!!


Courtesy:
J
. Hennessy and D. Patterson, Computer Architecture: A Quantitative


Approach, 4th edition, Morgan Kauffman, San Francisco, 2007.

Why Chip Multi
-
Processor? (2/2)


4

Examples: RAW, Core 2 Duo, Quad Core, Ultra
Sparc


Courtesy: G.
Konstadinidis

and et. al., “Architecture and Physical Implementation of a Third Generation 65 nm, 16 Core, 32
Thread Chip
-
Multithreading SPARC Processor”

Wire Delay Problem


5

0

1

2

3

20mm

Past

0

20mm

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

FUTURE



Wire delay proportional to wire’s RC constant

90 nm

65 nm

45nm

32nm

22nm

R

122

220

312

382

455

C

170

165

160

155

150

Present

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

20mm

Resistance increases as Capacitance remains constant.

Network
-
on
-
Chip (
NoC
)

6

Core
0


Core
1



Core
2



Core
3



Core
4



Core
5



Core
6



Core
7



Core
8



Core
9



Core
10



Core
11



Core
12



Core
13



Core
14



Core
15


Core

Router

Link

Crossbar Switch

Processing Core

+X

-
X

-
Y

+Y

+X

-
X

-
Y

+Y


Route

Computation

(RC)


Virtual Channel

(VC)


Switch Allocator

(SA)

Credits

In/Out

Router

Power Dissipation

7

Clock Distribution 11%
Dual FPMACs 36 %
Router & Links 28 %
10-port RF 4%
IMEM + DMEM 21%
MSINT 6%
Links 17%
Crossbar 15%
Clocking 33%
Queues Data Path 22%
Arbiters + control 7%


28% of a tile’s overall power is for the router and links



Link power will become a more major contribution of a router’s


overall power for future VLSI technology



Router and link power should be about 10
-
15% of the tile’s power budget


Tile Power

Routing Power


Courtesy: Y.
Hoskote
, “A 5
-
GHz Mesh Interconnect for A Teraflops Processor,” IEEE Computer Society, 2007 pp. 51
-
61

Intel
Tera
-
Flops (65 nm)

Potential Solutions: Optics, RF and 3D stacking

Why use Optics?

8



Lower latency




Low power (1.1
mW
/
Gb
)





Bit
-
rate independent of distance





Lower cross
-
talk





Does not suffer for impedance mismatch


and signal reflection






Low signal attenuation






Higher bandwidth (WDM, SDM & TDM)




Increased bandwidth density(compact parallel optics)


Electrical Interconnect

9

C
p

C
0

r
s

R, C


l
opt

s
opt

R
=wire resistant per length

C =wire capacitance per length

Cp=inverter output capacitance

C
0
=inverter input capacitance

R
s
= inverter resistance

S
opt
=inverter size

L
opt

=

Wire distance


RC Link:




ITRS 2007 Transistor & Link
Parameters?




10

Device

90 nm

65 nm

45nm

32nm

22nm

V
dd

1.2

1.1

1

0.9

0.8

f
clk

3.088

4.7

5.875

7.344

9.18

R

122

220

312

382

455

C

170

165

160

155

150

C
p

1

0.9

0.8

0.712

0.544

C
o

0.5

0.45

0.4

0.356

0.272

R
s

1890

2200

3500

4700

6900

S
opt

72.5

60.5

66.9

73.1

91.4

Lopt

0.45

0.35

0.25

0.18

0.13

Ioffn

(
nA
/micron)

50

70

100

150

220

Ishortckt

(
nA
/micron)

65

100

100

100

100



Increase wire delay due to RC constant



Increase in
Ioffn

&
Ishortckt

current parameters



Electrical link device parameters for various VLSI technologies

Optical Interconnect

11







Off
-
Chip
Laser

On
-
Chip

Modulator

Transmission
Medium

Photodetector



TIA





Buffer Chain

Limiting

Amplifier

Driver for

Electronics



Optical

Layer

Electronics

Layer

On
-
Chip

Tx

Power(
mW
/
Gbps
)

Area (mm
2
)

VCSEL

~2
-
4

~0.2

Micro
-
ring

Resonator

~0.1

~0.01

MZ

Modulator

~0.1

~0.0001

-

Transmitter

Micro
-
ring Resonators

12

Resonant wavelength (
λ
0
)


λ
0



m=
n
eff



2

R



m


an integer

n
eff



effective refractive index

R


radius of the ring resonator

Input Port 0

Output Port 0

n
+

p
+

n
+

=V
OFF

=V
ON

=V
OFF

V
R

Output Port 1

V
R

Input Port 0

Output Port 0

n
+

p
+

n
+

V
R

Input Port 0

Output Port 0

n
+

p
+

n
+



CMOS compatible



Low power (0.1
mW
)



Small footprint (10 um)




High Bandwidth (10
Gb
)

Waveguide & Receiver

13

WAVEGUIDE

Pitch (um)

Propagation

Time
(
ps
)

Optical Loss (dB/cm)

Si [1]

5.5

10.45

1.3

Polymer [1]

20

4.93

1.0

RECEIVER

Power (
mW
/
Gbps
)

Area (mm
2
)

Si
-
CMOS
-
Amplifier [2]

1.1

0.02625

80 nm

CMOS [3]

2.5

0.0625

SiGe

BiCMOS

[4]

24.5

1.07

[1] N.
Kirman

and et. al., “Leveraging Optical Technology in Future Bus
-
based Chip Multiprocessors”,


39th Annual IEEE/ACM International Symposium on
Microarchitecture
, 2006 Vol. 9 ,
Iss
. 13 Dec. 2006 pg.492


50

[2] S. Koester et. al., “
Ge
-
on
-
SOI
-
Dectector
/Si
-
CMOS
-
Amplifier Receivers for High
-
Performance Optical
-
Communication


Applications,” Journal of
Lightwave

Technology, Vol. 25, No. 1, January 2007

[3] C.
Kromer

and et. al., “A 100
-
mW 4X10
Gb
/s Transceiver in 80
-
nm CMOS for High
-
Density Optical


Interconnects,” IEEE Journal of Solid
-
State Circuits, Vol. 40, No. 12, December 2005

[4]D.
Kuchta

and et. al., “120
-
Gb/s VCSEL
-
based parallel
-
optical interconnect and custom 120
-
Gb/s testing


station,” Journal of
Lightwave

Technology, Vol. 22 No. 9 pp. 2200
-
2212, Sept. 2004

Electrical/Optical Comparison

14

Power
-
delay product at various technology nodes for a 5 mm link.

Optics is more advantageous: 52nm for Global & 45 nm for Semi
-
global Interconnects

core
-
to
-
core distance

Critical Length

15

Critical Length is the distance where optical becomes more advantageous


Advantages of PROPEL


16



Efficient use of optical components




Balance between optics and electronics




Simple network design


Low diameter, DOR




Scalability




Fault Tolerant


tile

Core 0

Core2

Core 1

Core3

L
2


C
a
c
h
e

0

2

16

18

32

34

48

50

1

17

19

33

35

49

51

4

6

20

21

36

37

52

54

5

7

22

23

38

39

53

55

8

9

24

25

40

41

56

58

10

11

26

27

42

43

57

59

12

13

28

29

44

46

60

62

14

15

30

31

45

47

61

63

3

PROPEL’s Design

17

Core

Photonic

Transceiver

L
2

L
2

L
2

L
2

L
2

L
2

L
2

L
2

L
2

L
2

L
2

L
2

L
2

L
2

L
2

L
2

Broadband
Light
source


0
,

1
,

2
, …

Tile 0

Photonic

Transceiver

Optical

Interconnect

PROPEL’s Routing & Wavelength
Assignment (x
-
direction)



18

Core
0

Core
2

Core
1

Core
3

L
2


C
a
c
h
e

Tile 0

Core
4

Core
6

Core
5

Core
7

L
2


C
a
c
h
e

Tile 1

Core
8

Core
10

Core
9

Core
11

L
2


C
a
c
h
e

Tile 2

Core
12

Core
14

Core
13

Core
15

L
2


C
a
c
h
e

Tile 3

Home Channel 0

Home Channel 1

Home Channel 2

Home Channel 3

λ
1
(
0,0
)

λ
2
(
2,0
)

λ
3
(
0,0
)

λ
2
(
0,0
)

λ
0
(
1,0
)
+

λ
2
(
1,0
)
+

λ
3
(
2,0
)

λ
1
(
0,0
)
+

λ
2
(
0,0
)
+

λ
3
(
0,0
)

λ
0
(
1,0
)

Broadband Signal

λ
3
(
2,0
)

PROPEL’s 64 Wavelength Design


19

Tile 1

Core
4

Core
6

Core
5

Core
7

Shared

L2

L1 Cache


L1 Cache


L1 Cache


L1 Cache


X
-
Receiver

X
-
Transmitter

Y
-
Transmitter

Y
-
Receiver

Core
0

Core
2

Core
1

Core
3

Shared L2

L1 Cache


L1 Cache


L1 Cache


L1 Cache


Optical Inter
-
Title Communication Channels

Tile 0

X
-
Receiver

X
-
Transmitter

Y
-
Transmitter

Y
-
Receiver

Tile 2

Core
8

Core
10

Core
9

Core
11

Shared L2

L1 Cache


L1 Cache


L1 Cache


L1 Cache


X
-
Receiver

X
-
Transmitter

Y
-
Transmitter

Y
-
Receiver

Tile 3

Core
12

Core
14

Core
13

Core
15

Shared L2

L1 Cache


L1 Cache


L1 Cache


L1 Cache


X
-
Receiver

X
-
Transmitter

Y
-
Transmitter

Y
-
Receiver

Laser

λ
(0
-
15)


λ
(16
-
31)


λ
(32
-
47)


λ
(48
-
63)


Research has shown 64
-
wavelengths are possible to traverse down
one waveguide.

Laser

Tile 0

Tile 1

Tile 2

Tile 3

Tile 4

Tile 5

Tile 6

Tile 7

Tile 8

Tile 1

Tile 2

Tile 3

Tile
12

Tile 5

Tile 6

Tile 7

Core
0

Core
2

Core
1

Core
3

Shared L2

L1 Cache


L1 Cache


L1 Cache


L1 Cache


X
-
Receiver

X
-
Transmitter

Y
-
Transmitter

Y
-
Receiver

On
-
Chip

DRAM

Bank 0

Bank 2

Bank 1

Bank 3

Bank 4
-
15

Off
-
Chip

20

PROPEL’s

x
-

and y
-
direction
Implementation

λ
48
-
63

λ
32
-
47

λ
16
-
31

λ
0
-
15

λ
48
-
63

λ
32
-
47

λ
16
-
31

. .

. .

. .

. .

. .

. .

. .

. .

λ
0
-
15

Bank 0

Bank 1

Bank 2

Bank 3

From CMP

From Laser


To CMP

Receiver

Transmitter

21

Memory

Routing and
Wavelength Assignment

Communication Example

22

Laser

Tile 0

Tile 1

Tile 2

Tile 3

Tile 4

Tile 5

Tile 6

Tile 7

Tile 8

Tile 1

Tile
12

Tile
13

Core
0

Core
2

Core
1

Core
3

L1 Cache


X
-
Receiver

X
-
Transmitter

Y
-
Transmitter

Y
-
Receiver

Shared

L2

L1 Cache


L1 Cache


L1 Cache


Tile 3 communicates with Tile 8
.

Crossbar Switch

L2 Cache

X
0


Route

Computation

(RC)


Virtual Channel

(VC)


Switch Allocator

(SA)

Credits

In/Out

X
1

X
2

Y
0

Y
1

Y
2

X
1

X
2

Y
0

Y
1

Y
2

X
0

Modulation Implementation

23

. .

. .

. .

. .

. .

. .

Broadband

Signal

λ
0

λ
15

λ
16

λ
31

λ
32

λ
47

23

λ
0
-
15

λ
16
-
31

λ
32
-
47

Multicasting & Broadcasting


Multicasting:
single tile to multiple tiles.


Broadcasting:
s
ingle tile to all
-
tile communication.


Use 3 individual multicasts


24


Tile 0

Tile 1


Tile 2


Tile 3

Tile 4

Tile 8

Tile 12

Tile 5


Tile 6

Tile 7


Tile 9


Tile 10


Tile 11


Tile 13


Tile 14


Tile 15

Sending
Tile

Communication
Link

25

Performance Evaluation


Cost & Component Comparison



Synthetic Traffic


OPTISM


Uniform, Bit
-
reversal, Butterfly,
Complement,


Matrix transpose, Perfect Shuffle



SPLASH
-
2


Simics

with GEMS and Garnet


FFT, LU,
Radiosity

and Ocean



Networks topology evaluated


Electrical: Mesh,
Cmesh

and Flattened
-
butterfly


Optical: Circuit
-
switch, Shared
-
bus and Corona

Electronic Parameters

26

Crossbar Switch

Processing Element (PE)

+X

-
X


Route

Computation

(RC)


Virtual Channel

(VC)


Switch Allocator

(SA)

Credits

In/Out

-
Y

+Y

+X

-
X

-
Y

+Y

P
write

=
P
wordline

+ (2
×

F
×

P
bitline
) + (F
×

P
memory
-
cell
)

P
read

=
P
wordline

+ F
×

(
P
bitline
r

+
P
chg
)


VC Buffer (4.03
mW
/flit)

E
sw

=
w
f

×

(
C
xbi

+
C
xbo
)V
2
DD


Crossbar (0.8
mW
/flit)

P
link

=
P
dynmanic

+
P
leakage

+
P
short¡ckt


Electrical Link (22
mW
/mm)

Optical Parameters

27







Off
-
Chip
Laser

On
-
Chip

Modulator

Transmission
Medium

Photodetector



TIA





Buffer Chain

Limiting

Amplifier

Driver for

Electronics



Optical

Layer

Electronics

Layer

On
-
Chip

Micro
-
ring Modulator (0.1
mW
)

Receiver Circuitry (1.1
mW
/
Gbps
)

Component Comparison

28

Shared
-
Bus

Circuit
-
Switch

Corona

PROPEL

Wavelengths

4

24

64

64

Waveguides

168

64

99

32

Micro
-
rings

2,688

16,576

72,192

3,072

Photodetectos

1,536

2,016

7,424

1,536

Power

Loss

37

39.2

49.2

32.1

Optical

Area

(mm
2
)

16

46

64.6

17

Electrical Area

(mm
2
)

60

55

195

50

PROPEL is the most cost effective
NoCs

!!!!

Synthetic Traffic Trace

29


Uniform traffic: Each packet's destination has an


equal probability to be all nodes.




Bit
-
Reversal:
.


Source
: a
n
-
1
,a
n
-
2
,...,a
1
,a
0

Destination:

a
0
,a
1

,..., a
n
-
2
,a
n
-
1




Butterfly:


Source
: a
n
-
1
,a
n
-
2
,...,a
1
,a
0

Destination:

a
0
,a
n
-
2
,...,a
1
,a
n
-
1





Complement:


Source
: a
n
-
1
,a
n
-
2
,...,a
1
,a
0

Destination:

a
n
-
1
’,a
n
-
2
’,...,a
1
’,a
0




Perfect
-
shuffle:



Source
: a
n
-
1
,a
n
-
2
,...,a
1
,a
0

Destination:

a
n
-
2
,a
n
-
3
,...,a
0
,a
n
-
1





Matrix Transpose


Source
: a
n
-
1
,a
n
-
2
,...,a
1
,a
0

Destination:

a
n/2
-
1
,...,a
0
,a
n
-
1
,a
n
-
2



Uniform Traffic Throughput

30



25% Improvement


over Mesh




9% Improvement


over Flattened
-
butterfly




Over 2
×

increase in


performance over


Circuit
-
switch,
Cmesh


and Shared
-
bus



0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0
20
40
60
80
100
120
140
160
Network Load
Throughput (GBps)


Mesh
Cmesh
Flattened-Butterfly
Circuit-switch
Shared-bus
Corona
PROPEL
Uniform Traffic Latency

0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0
200
400
600
800
1000
1200
Network Load
Latency (nS)


Mesh
Cmesh
Flattened-Butterfly
Circuit-switch
Shared-bus
Corona
PROPEL
31



PROPEL saturates at a


network load of 0.5




Saturates at a network


load of 0.1 higher than


than Flattened
-
butterfly




Saturates at a 2
×

higher


network load than


Shared
-
bus and


Circuit
-
switch.



All
Traffic Saturation
Throughput

0
0.5
1
1.5
2
2.5
3
Uniform
BitReversal
Butterfly
Complement
Matrix
Transpose
Perfect
Shuffle
Neighbor
Throughput

Mesh
Cmesh
Flattened-Butterfly
Circuit-Switch
Shared-Bus
Corona
PROPEL
32

Bit
-
Reversal

Traffic Latency

1
1.5
2
2.5
3
3.5
4
4.5
0
100
200
300
400
500
600
700
800
900
1000
Network Load
Latency (nS)


Mesh
Cmesh
Flattened-Butterfly
Circuit-switch
Shared-bus
Corona
PROPEL
33



PROPEL saturates at a


network load of 0.25




Saturates at a network


load of 0.25 higher than


than Flattened
-
butterfly




Saturates at a 1.5
×

higher


network load than


Shared
-
bus and


Circuit
-
switch.



Complement

Traffic Latency

1
1.2
1.4
1.6
1.8
2
2.2
2.4
0
100
200
300
400
500
600
700
800
900
1000
Network Load
Latency (nS)


Mesh
Cmesh
Flattened-butterfly
Circuit-switch
Shared-bus
Corona
PROPEL
34




Networks with core


concentrations create


communication hotspot.



Matrix Transpose

Traffic Latency

35

1
1.5
2
2.5
3
3.5
4
4.5
5
0
100
200
300
400
500
600
700
800
900
1000
Network Load
Latency (nS)


Mesh
Cmesh
Flattened-butterfly
Circuit-switch
Shared-bus
Corona
PROPEL


PROPEL saturates at a


network load of 0.3




Circuit
-
switch saturates


higher than the electrical


networks



Synthetic Traffic Power Dissipation

36

0
0.2
0.4
0.6
0.8
1
1.2
Uniform
BitReversal
Butterfly
Complement
Matrix
Transpose
Perfect Shuffle
Neighbor
Power

Mesh
CMESH
Flattened-Butterfly
Circuit-Switch
Shared-Bus
Corona
PROPEL
5
×

Reduction In Power


Simics

is a full system simulator from
Virtutech

37

Simics

Parameters

Parameter

Value

L2 cache size/
accoc
.

4MB/4
-
Way

L2 cache line size

64

L1

cache/
accoc
.

64KB/4
-
way

L1 cache line size

64

Core Frequency (GHz)

2.5

Threads(core)

2

Issue Policy

In
-
order

Memory size(GB)

4

Memory Controllers

16


FFT kernel is a 1
-
Dimensional version of the radix
-
n
1/2
six step FFT algorithm.



LU kernel is used to factor a dense matrix into the
upper and lower triangular matrices.



Radiosity

is a graphics kernel used to calculate the
equal distribution of light in a scene.



The Ocean application evaluates the boundary and
eddy currents of large scale ocean movements.

38

SPLASH
-
2 Benchmarks


SPLASH
-
2 Speed
-
Up

0
0.2
0.4
0.6
0.8
1
1.2
1.4
FFT
LU
Radiosity
Ocean
Speed up

Mesh
Flattened-Butterfly
Shared-bus
Corona
Propel
39


PROPEL is a low power high bandwidth
NoC

for
future many
-
core processors.



PROPEL uses both electronic for packet switching
and optics for inter
-
router communication, allowing
for a reduction in electrical and optical components.



PROPEL uses the least number of optical components
and consumes the least area, when compared to other
opto
-
electronic networks.



PROPEL is able to outperform and dissipate less
power when compared to well
-
known network
topologies.

40

Conclusion

41

Future Work


Use optics to go to memory



Dynamic Bandwidth



Dynamic Voltage Scaling



Application Integration with the
NoC

43

Examples of
NoCs

(1/2)

44

Core

Router

Link

Core

Router

Link

Mesh

Torus

Advantages



Simple to Integrate on
-
chip



DOR routing

Disadvantages



High hop count

Advantages



Reduced Hop Count



DOR routing

Disadvantages



Difficult to Integrate on
-
chip

Examples of
NoCs

(2/2)

45

Cmesh

Flattened
-
butterfly

Advantages



Reduced Network Diameter



Fewer Routers

Disadvantages



Multiple cores share same ports

Advantages



Max hop count of 2



Reduce power dissipation

Disadvantages



Not easily scalable

46

Core
0

Core
2

Core
1

Core
3

Shared L2

L1 Cache


L1 Cache


L1 Cache


L1 Cache


Tile 0

X
-
Receiver

X
-
Transmitter

Y
-
Transmitter

Y
-
Receiver

Tile 2

Core
8

Core
10

Core
9

Core
11

Shared L2

L1 Cache


L1 Cache


L1 Cache


L1 Cache


X
-
Receiver

X
-
Transmitter

Y
-
Transmitter

Y
-
Receiver

Tile 3

Core
12

Core
14

Core
13

Core
15

Shared L2

L1 Cache


L1 Cache


L1 Cache


L1 Cache


X
-
Receiver

X
-
Transmitter

Y
-
Transmitter

Y
-
Receiver

Multicast example: Tile 0 communicates the same data to Tile 1,2 & 3

Laser

Tile 1

Core
4

Core
6

Core
5

Core
7

Shared L2

L1 Cache


L1 Cache


L1 Cache


L1 Cache


X
-
Receiver

X
-
Transmitter

Y
-
Transmitter

Y
-
Receiver

PROPEL Multicasting Example

47

PROPEL’s Implementation (3/4)

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

Transmitters

λ
0
-
15

λ
16
-
31

λ
32
-
47

λ
48
-
63

Receivers

λ
16
-
31

λ
32
-
47

λ
48
-
63

λ
0
-
15

λ
32
-
47

λ
48
-
63

λ
0
-
15

λ
16
-
31

λ
48
-
63

λ
16
-
31

λ
32
-
47

λ
0
-
15

λ
16
-
31

λ
32
-
47

λ
48
-
63

λ
0
-
15

Tile 0

Tile 1

Tile 2

Tile 3

From

Memory

To

Memory

Off
-
chip laser

PROPEL’s Design

64
-
Wavelengths Assignment




Research has show 64
-
wavelengths are possible to
traverse down one waveguide.


Wavelengths used for PROPEL are extended from 4 to
64
.












48

Source

Tile

0

Source

Tile 1

Source

Tile

2

Source

Tile 3

Destination

Tile

0

-

λ
0
[16
-
31]

λ
0
[32
-
47]


λ
0
[48
-
63]


Destination

Tile 1

λ
1
[0
-
15]

-

λ
1
[32
-
47]


λ
1
[48
-
63]


Destination

Tile 2

λ
2
[0
-
15]

λ
2
[16
-
31]


-

λ
2
[48
-
63]


Destination

Tile 3

λ
3
[0
-
15]

λ
3
[16
-
31]


λ
03
[32
-
47]


-

PROPEL Broadcasting


Single tile to all
-
tile communication.


Use 3 individual multicasts


49


Tile 0

Tile 1


Tile 2


Tile 3

Tile 4

Tile 8

Tile 12

Tile 5


Tile 6

Tile 7


Tile 9


Tile 10


Tile 11


Tile 13


Tile 14


Tile 15

Sending
Tile

Communication
Link

50







Electrical Link Power
Dissipation

Optical Power Dissipation