The Alpha Roadmap

cavalcadejewelΛογισμικό & κατασκευή λογ/κού

18 Νοε 2013 (πριν από 3 χρόνια και 6 μήνες)

74 εμφανίσεις

The Alpha Roadmap

How it applies to Alpha clusters

Ray Hookway

Compaq Computer Corporation

Littleton, MA

Ray.Hookway@compaq.com

Map Features


Alpha Processor Roadmap


Alpha Systems


Alpha Clusters


Processor Roadmap

References:


Pete Bannon, “Alpha 21364: A Scalable Single
-
chip SMP”, Microprocessor Forum 1998,
http://www.digital.com/alphaoem/microprocessorfo
rum.htm


Joel Emer, “
Simultaneous Multithreading:
Multiplying Alpha Performance”, Microprocessor
Forum 1999


Alpha Roadmap

Higher Performance

2000 2001 2002 2003

1998

1999

EV6
21264

EV68

0.35
m
m

EV67

0.28
m
m

〮ㄸ
m
m

EV7

0.18
m
m

...

EV8

0.125
m
m

EV78

0.125
m
m

Alpha 21264 is performance leader

0
10
20
30
40
50
60
Compaq

AlphaServer DS20

HP D380

Sun

UE250


SPECfp95

58.7

17.4

20.6

12.6

IBM F50

Alpha 21264 Systems


AlphaServer 8400 with EV6/575

Benchmark
CPU
EV6/575
EV56/600
Ratio
SPECint95
1
30.3
18.4
1.6
SPECfp95
1
47.7
20.8
2.2
Linpack 100x100
1
460
280
1.5
TPC-C (K/Min)*
8
37.5
24.5
1.5
AIM 7 max users (K)
8
10.5
6.9
1.4
SPECweb (K conn/sec)**
8
12.2
7.8
1.5
*
37,541 tpmC at $79.4/tpmC for 8CPU 16GB

Sybase V11.9 available 12/98


**estimated

IA
-
64 .vs. Alpha Philosophy




EPIC


Smart compiler and a dumb
machine


Compiler creates record of
execution


Machine plays record


Stall when compiler is wrong



Focus on vector programs


Compiler transform scalar to
vector


What about:


function calls, indirection


dynamic linking


C++, Java/JIT




ALPHA


Smart compiler, smart machine, and
a GREAT circuit design


Compiler creates record of
execution


Machine exploits additional
information available at runtime


Works across barriers to compile
-
time analysis


Focus on scalar programs


Add resources for vector


Amdahl’s law

Alpha 21364 Goals


Improve


Single processor performance, operating frequency,
and memory system


SMP scaling


System performance density (computes/ft
3
)


Reliability and availability



Decrease


System cost


System complexity

“It’s the Memory, Stupid”

Dick Sites

Estimated time for TPC
-
C

0
10
20
30
40
50
60
70
80
90
100
164/600
264/575
264/1000
364/1000
Issue
Mispred
Trap
Cache
Memory
New core

Higher MHz

Alpha 21364 Features


Alpha 21264 core with enhancements


Integrated L2 Cache


Integrated memory controller


Integrated network interface


Support for lock
-
step operation to enable high
-
availability systems.


Memory

Controller

R

A

M

B

U

S

21364 Chip Block Diagram

21264

Core

16 L1

Miss Buffers

L2

Cache

Address Out

Address In

Network

Interface

N

S

E

W

I/O

16 L1

Victim Buf

16 L2

Victim Buf

64K Icache

64K Dcache

Int
Reg

Map

Branch

Predictors

21364 Core


FETCH MAP QUEUE REG EXEC DCACHE

Stage: 0 1 2 3 4 5 6

L2
cache1
.5MB

6
-
Set

Int
Issue
Queue

(20)

Exec

4 Instructions / cycle

Reg
File

(80
)

Victim
Buffer


L1
Data

Cache

64KB

2
-
Set

FP

Reg

Map

FP ADD

Div/Sqrt

FP MUL

Addr

80 in
-
flight instructions

plus 32 loads and 32 stores

Addr

Miss
Address

Next
-
Line

Address

L1 Ins.

Cache

64KB

2
-
Set

Exec

Exec

Exec

Reg
File

(80
)

FP
Issue
Queue

(15)

Reg
File

(72
)

Integrated L2 Cache


1.5 MB


6
-
way set associative


16 GB/s total read/write bandwidth


16 Victim buffers for L1
-
> L2


16 Victim buffers for L2
-
> Memory


ECC SECDED code


12ns load to use latency


Integrated Memory Controller


Direct RAMbus


High data capacity per pin


800 MHz operation


30ns CAS latency pin to pin


6 GB/sec read or write bandwidth


100s of open pages


Directory based cache coherence


ECC SECDED


Integrated Network Interface


Direct processor
-
to
-
processor interconnect


10 GB/second per processor


15ns processor
-
to
-
processor latency


Out
-
of
-
order network with adaptive routing


Asynchronous clocking between processors


3 GB/second I/O interface per processor

21364 System Block Diagram

364

M

IO

364

M

IO

364

M

IO

364

M

IO

364

M

IO

364

M

IO

364

M

IO

364

M

IO

364

M

IO

364

M

IO

364

M

IO

364

M

IO

Alpha 21364 Technology


0.18
m
m CMOS


1000+ MHz


100 Watts @ 1.5 volts


3.5 cm
2


6 Layer Metal


100 million transistors


8 million logic


92 million RAM

Alpha 21364 Performance/Status


70 SPECint95 (estimated)


140 SPECfp95 (estimated)


RTL model running


Tapeout 4Q99

21364 Summary


The 21364 integrated L2 cache and memory
controller provide outstanding single processor
performance


The 21364 integrated network interface enables
high performance multi
-
processor systems


The high level of integration directly supports
systems containing a large number of processors

21464 Overview


Enhanced out
-
of
-
order execution


8
-
wide superscalar


Large on
-
chip L2 cache


Direct RAMBUS interface


On
-
chip router for system interconnect


Glueless, directory
-
based, ccNUMA for up to 512
-
way SMP


4
-
way simultaneous multithreading (SMT)


Superscalar Instruction Issue

Time

Multi
-
Threading

Time

Simultaneous Multi
-
Threading

Time

What Changed?


Multiple Program Counters


Choose among them


More Architectural Register Space


Mapper


Register Files


Distinguished Per Thread Instruction State


Register mapping


Instruction Retire


Store Buffers


Abort and Restart Information

What Didn’t Change

Almost everything else



No basic functional changes in any stage



No partitioned instruction cache



No partitioned data caches



No partitioned off
-
chip caches



No extra register files



Little special branch prediction mechanism


Multi
-
threaded Scaling

0
1
2
3
4
5
6
SPEC95 Integer
SPEC95
Integer/FP
SPEC95 Floating
Point
SQL Server
Average IPC of 4 single-threaded programs
IPC of 4 programs running multithreaded
1.8x

2.0x

1.9x

2.3x

AlphaServer DS Series


Uni and dual processor systems


Offerings scale to 8GB memory


Up to 6 PCI slots

Switched based system
-

64
-
bit PCI I/O subsystems
-

Very Large Memory

Scalable clusters on DIGITAL UNIX, OpenVMS

Modular system packaging
-

advanced systems management


1
-
64Processors


Up to 128GB of memory


Up to 224 PCI slots


Up to 32GB of memory


1
-

4 Processors


Up to 10 PCI slots

AlphaServer ES Series

AlphaServer GS Series

AlphaServer

Family Today

AlphaServer DS10




Fast Memory Access


Large total RAM
-
128MB up to 1GB


High bandwidth access
-

1.3 GB/s

Flexible Internal Storage


Internal dual channel IDE storage included


Optional SCSI adapter supported


3 internal disk bays

Special Features


4 PCI I/O slots (3 64
-
bit, 1 32
-
bit)


300 watt power supply


3U Small footprint
-

Rack or Desktop


Dual embedded 10/100 Ethernet ports


New
AlphaServer
DS Series





Solution for project environment



Fastest Uni processor design
in a 1U
formfactor



Fastest Memory Access with the Highest
Bandwidth memory in its class


High speed I/O with 64 bit PCI


Sleek, compact and powerful package


Dual Purpose Solutions Support


Rack
and

desktop
-
ready for space constrained
environments




AlphaServer DS Series




Fast Memory Access


Large total RAM
-

64MB up to 1GB


High bandwidth access
-

1.3 GB/s


Flexible Internal Storage


Internal dual channel IDE


Wide range of PCI Options supported


2 disk bays
-
27 GB IDE / 18 GB SCSI


Special Features


Optional Slimline CD
-
Floppy Combo


Toolless features
-
snap out CD and Disk


Full 1 PCI I/O slot (64
-
bit)


150 watt power supply


1U (1.75”) Small footprint


Rack or Desktop


Dual embedded 10/100b Ethernet ports


Performance and Management Features


Remote management console


Serverworks and Compaq Insight Manager

New AlphaServer DS Series



Complementary, low
-
cost,
open source model.


Leadership performance over
other Linux platforms.


Tru64 UNIX compatibility
with common SWD tools


Support services through
Compaq and partners

Two ways to build an Alpha cluster



Scalable, robust HPC
platform


Maximum performance over
broadest range of
applications


Outstanding system
management and reliability
features


Sierra

Beowulf

Sierra Architecture


Tera
-
scale systems derived from ASCI PathForward


Very large Distributed Shared Memory systems


High speed, scalable interconnect (Quadrics)


Exploit EV6, EV7 & EV8


Installed and administered as single system


System wide scheduler


High performance file systems (PFS, CFS, AdvFS)


Application availability


Sierra


ASCI Pathforward Project

Alpha Beowulf Clusters


Compaq ships 64
-
bit Linux on Alpha systems


Myrinet and other popular interconnects are
supported


SeverNet
-
II available in late 1999


Compaq Tru64Unix (Digital Unix) development
tools ported in 1999 (!)

Prepackaged Beowulf Cluster



H9A15 Cabinet
71.75'' 41U
78.74''
1.75'' 1U
H7600-AA L5-30P Input
1.75'' 1U
H7600-AA L5-30P Input
3.50" U
2
Myricom
S
D
1
2
3
4
5
6
7
8
D
E
C
s
e
r
v
e
r

9
0
M
S
D
1
2
3
4
5
6
7
8
9
1
0
1
1
1
2
E
t
h
e
r
n
e
t
A
C
B
D
B
1
x
2
x
3
x
4
x
5
x
6
x
7
x
8
x
D
1
x
2
x
3
x
4
x
5
x
6
x
7
x
8
x
A
1
x
2
x
3
x
4
x
5
x
6
x
7
x
8
x
C
1
x
2
x
3
x
4
x
5
x
6
x
7
x
8
x
1
2
3
4
5
6
7
8
P
O
R
T
s
w
i
t
c
h

9
0
0
T
P
/
1
2
Hub or switch for
local network traffic
and control.
Terminal server
attached to each
console port to
provide single
console control.
Myrinet High speed
low latency network.
V
T
4
2
0
C
o
n
t
r
a
s
t
B
r
i
g
h
t
d
i
g
i
t
a
l
Root Node
DS10
DS10
5.25'' 3U
C
O
M
P
A
Q
DS10
5.25'' 3U
C
O
M
P
A
Q
DS10
5.25'' 3U
C
O
M
P
A
Q
DS10
5.25'' 3U
C
O
M
P
A
Q
DS10
5.25'' 3U
C
O
M
P
A
Q
DS10
5.25'' 3U
C
O
M
P
A
Q
DS10
5.25'' 3U
C
O
M
P
A
Q
DS10
5.25'' 3U
C
O
M
P
A
Q
7.00'' 4U
StorageWorks
BA356
OPEN
OPEN
OPEN
OPEN
OPEN
OPEN
OPEN
7.00'' 4U
StorageWorks
BA356
OPEN
OPEN
OPEN
OPEN
OPEN
OPEN
OPEN
Platform 3
Flatpanel
Monitor
Open Space for Air Flow
Fan
1.75'' 1U
CT
-
D10MJ
-
SR


Starter DS10
-
based Beowulf
cluster, including eight Alphaserver
DS10 compute nodes, one
Alphaserver DS10 management
station with keyboard/trackball and
display, Myrinet™ system area
network, 73.1 GB JBOD UltraSCSI
disk storage, Ethernet multiplexer
for system management, and all
Linux software required for basic
Beowulf operation.


ServerNet
-
II Interconnect


Scalable high
-
performance network.


65,536 end nodes, 5 km range.


Multi
-
gigabit, low latency, low CPU, cheap.


VIA
-

Virtual Interface Architecture.


MPI
-

Message Passing Interface.


Open source Intel and Alpha Linux drivers.


NT, Tru64, NonStop Clusters, VxWorks.

Virtual Interface Architecture
(VIA)

Applications

VI Primitive Library

Open/Close/Map Memory Send/Receive/Read/Write

VI Kernel Support

VI Kernel HW Interface

SAN Media Interface (ServerNet, Ethernet, ...)

OS Vendor API

DBMS

Apps

CQ

VI

VI

VI


65,536 VIs per node.


RDMA & send/recv.


Reliable reception.


< 2% CPU utilization.


Low latency/zero copy.


Thread
-
safe, protected.


Basis for
COMPAQ
’s
“System I/O”.

Communication through “Virtual Interfaces (VI)”

with associated “Completion Queues (CQ)”.

ServerNet
-
II Components

Beowulf.loc1.Tandem.com

ServerNet
-
II Hardware
Components

Router II

FCAL bridge, dual line
card, or LAN bridge

Line
Car
d

Line
Car
d

Line
Car
d

Line
Car
d

Line
Car
d

Line
Car
d

Line
Car
d

Line
Car
d

IBC
Logic

FCAL bridge, dual line
card, or LAN bridge



Dual
-
port PCI interface (NIC)


VIA in hardware, DCE


negligible CPU cost


64 bit, 33 MHz & 66 MHz


12 port crossbar switch


wormhole routed


< 300 nsec latency


“fat pipe” channel bonding


bridges to fibre channel,
gigabit ethernet


Gigabit ethernet cables


copper or fibre optic


5 meters to 5 km


ServerNet
-
II Hardware
Performance




1.25+1.25 gigabit/s links 1999, doubles 2001.


< 300 nanosecond path formation per stage.


1M end nodes, 5 km fibre optic links.




Single VI

Multiple VIs

33 MHz PCI
-
64

166 MB/s

240 MB/s

66 MHz PCI
-
64

197 MB/s

350 MB/s


Reliability


Dual port NICs, dual network topologies.


Link level CRC, in
-
band control protocol.


Strong packet ordering guarantees.


Every packet is acknowledged by receiver.


Automatic retry on transmission failure.


Avoids deadlock & livelock.

Linux Software Development
Tools for
AlphaServers


Boosted performance with
Compaq Portable Math Library
(CPML) for Linux on Alpha


Significantly increases the
precision and speed of
mathematical calculations up to
10 times compared to other
mathematical libraries currently
available on Linux


Following the success of the
Portable Math Library, now
announcing plans for Compaq
Extended Math Library to run
on Linux AlphaServer systems


Compaq C compiler announced in
April


Compaq Fortran compilers for Linux
announced in April available in July
beta program


New Compaq C++ compiler


Makes it easy to support both Linux and
Tru64 UNIX

software


New Software Development Test
-
Drive capabilities


Test out the performance of your
Application over the Web


Get help from our leading Linux
developers to optimize your application

SPEC CPU Benchmark*

* not audited

Linpack (100x100) MFlops

0
50
100
150
200
250
P2/300
P2/450
EV56/500
EV6/500
gcc
GEM
Summary


Alpha is the fastest processor available


Alpha is available in a full range of high
performance systems


Sierra systems provided complete tera
-
scale
solutions


Compaq wants to be involved in the Beowulf
community