Introduction - The School of Electrical Engineering and Computer ...

basketontarioΗλεκτρονική - Συσκευές

2 Νοε 2013 (πριν από 3 χρόνια και 9 μήνες)

79 εμφανίσεις

CptS

5 6 1 / E
E

5 2 4


C
OMPUTER
A
RCHITECTURE



School of Electrical Engineering and Computer Science

W
ASHINGTON

S
TATE

U
NIVERSITY

F A L
L

2 0 1
2

Course Objectives

Students in this course will be able to:


Understand how modern computer systems work.


Perform quantitative analysis of computer systems.


Analyze at system level the impact of changes in the
computer systems.


Estimate the performance of a computer system.


Recognize the need for further learning in this field (life
-
long
learning).

Project



A study of a multi
-
core processor



Students will work on teams of
2

members


Grade

Conventional Wisdom in Comp. Architecture


Old Conventional Wisdom:
Power is free,

T
ransistors are
expensive


New Conventional Wisdom:
“Power wall” Power expensive
,
Transistors free

(We can put more on chip than can afford to turn on)



Old CW:
Sufficiently increasing Instruction Level Parallelism
via compilers, innovation (Out
-
of
-
order, speculation, VLIW, …)


New CW:
“ILP wall”

law of diminishing returns on more HW
for ILP

Conventional Wisdom in Comp. Architecture


Old CW:
Multipliers are slow
,
Memory access is fast


New CW
:
“Memory wall”

Memory slow,
multiplies fast


(200 clock cycles to DRAM memory, 4 clocks for multiply)



Old CW:
Uniprocessor

performance 2X / 1.5 yrs


New CW:
Power Wall
+
ILP Wall
+
Memory Wall
=
Brick Wall


Uniprocessor

performance now 2X / 5(?) yrs





Sea change in chip design: multiple “cores”


(2X processors per chip / ~ 2 years)


More simpler processors that are more power efficient

Single Processor Performance

RISC

Move to multi
-
processor

Current Trends in Architecture


Cannot continue to leverage Instruction
-
Level
parallelism (ILP)


Single processor performance improvement ended in
2003



New models for performance:


Data
-
level parallelism (DLP)


Thread
-
level parallelism (TLP)


Request
-
level parallelism (RLP)
-
Data
Centeres



These require explicit restructuring of the
application

Classes of Computers


Personal Mobile Device (PMD
)


e.g. start phones, tablet computers


Emphasis on energy efficiency and real
-
time


Desktop Computing


Emphasis on price
-
performance


Servers


Emphasis on availability, scalability, throughput


Clusters / Warehouse Scale Computers


Used for “Software as a Service (
SaaS
)”


Emphasis on availability and price
-
performance


Sub
-
class: Supercomputers, emphasis: floating
-
point performance
and fast internal networks


Embedded Computers


Emphasis: price

Parallelism


Classes of parallelism in applications:


Data
-
Level Parallelism (DLP)


Task
-
Level Parallelism (TLP)



Classes of architectural parallelism:


Instruction
-
Level Parallelism (ILP)


Vector architectures/Graphic Processor Units (GPUs)


Thread
-
Level Parallelism

Flynn’s Taxonomy


Single instruction stream, single data stream (SISD)



Single instruction stream, multiple data streams (SIMD)


Vector architectures


Multimedia extensions


Graphics processor units



Multiple instruction streams, single data stream (MISD)


No commercial implementation



Multiple instruction streams, multiple data streams (MIMD)


Tightly
-
coupled MIMD


Loosely
-
coupled MIMD

Defining Computer Architecture


“Old” view of computer architecture:


Instruction Set Architecture (ISA) design


i.e. decisions regarding:


registers, memory addressing, addressing modes,
instruction operands, available operations, control flow
instructions, instruction encoding



“Real” computer architecture:


Specific requirements of the target machine


Design to maximize performance within constraints:
cost, power, and availability


Includes ISA, microarchitecture, hardware

Tracking Technology Performance Trends


Drill down into 4 technologies:


Disks


Memory


Network


Processors



Compare ~1980 Archaic (Nostalgic) vs.

~2000 Modern (Newfangled)


Performance Milestones in each technology


Compare for
Bandwidth

vs.
Latency

improvements in performance
over time


Bandwidth
: number of events per unit time


E.g., M bits / second over network, M bytes / second from disk


Latency
: elapsed time for a single event



E.g., one
-
way network delay in microseconds,

average disk access time in milliseconds

Disks
: Archaic(Nostalgic) v. Modern(Newfangled)


Seagate 373453, 2003


15000 RPM

(4X)


73.4
GBytes


(2500X)


Tracks/Inch: 64000


(80X)


Bits/Inch: 533,000


(60X)


Four 2.5” platters

(in 3.5” form factor)


Bandwidth:

86
MBytes
/sec


(140X)


Latency: 5.7 ms


(8X)


Cache: 8
MBytes


CDC Wren I, 1983


3600 RPM


0.03
GBytes

capacity


Tracks/Inch: 800



Bits/Inch: 9550



Three 5.25” platters



Bandwidth:

0.6
MBytes
/sec


Latency: 48.3 ms


Cache: none

Latency Lags Bandwidth (for last ~20 years)


Performance Milestones










Disk
: 3600, 5400, 7200, 10000, 15000
RPM
(8x, 143x)


(latency = simple operation w/o contention

BW = best
-
case)

Memory:

Archaic (Nostalgic) v. Modern (Newfangled)


1980
DRAM


(asynchronous)


0.06
Mbits
/chip


64,000
xtors
, 35 mm
2


16
-
bit data bus per
module, 16 pins/chip


13 Mbytes/sec


Latency: 225 ns


(no block transfer)


2000

Double Data Rate
Synchr
.

(clocked) DRAM


256
Mbits
/chip


(4000X)


256,000,000
xtors
, 204 mm
2


64
-
bit data bus per

DIMM, 66 pins/chip


(4X)


1600 Mbytes/sec


(120X)


Latency: 52 ns


(4X)


Block transfers (page mode)

Latency Lags Bandwidth (last ~20 years)


Performance Milestones







Memory Module
: 16bit plain
DRAM, Page Mode DRAM, 32b,
64b, SDRAM,

DDR SDRAM
(4x,120x)


Disk:

3600, 5400, 7200, 10000,
15000 RPM
(8x, 143x)

(latency = simple operation w/o contention

BW = best
-
case)

LANs:

Archaic (Nostalgic)v. Modern (Newfangled)


Ethernet 802.3



Year of Standard: 1978


10 Mbits/s

link speed


Latency: 3000
m
sec


Shared media


Coaxial cable



Ethernet 802.3ae



Year of Standard: 2003


10,000 Mbits/s

(1000X)

link speed


Latency: 190
m
sec


(15X)


Switched media


Category 5 copper wire

Coaxial Cable:

Copper core

Insulator

Braided outer conductor

Plastic Covering

Copper, 1mm thick,

twisted to avoid antenna effect

Twisted Pair:

"Cat 5" is 4 twisted pairs in bundle

Latency Lags Bandwidth (last ~20 years)


Performance Milestones





Ethernet
: 10Mb, 100Mb,
1000Mb, 10000 Mb/s
(16x,1000x)


Memory Module:

16bit plain
DRAM, Page Mode DRAM, 32b,
64b, SDRAM,

DDR SDRAM
(4x,120x)


Disk:

3600, 5400, 7200, 10000,
15000 RPM
(8x, 143x)


(latency = simple operation w/o contention

BW = best
-
case
)

CPUs:

Archaic (Nostalgic) v. Modern (Newfangled)


1982

Intel 80286


12.5 MHz


2 MIPS (peak)


Latency 320 ns


134,000
xtors
, 47 mm
2


16
-
bit data bus, 68 pins


Microcode interpreter,

separate FPU chip


(no caches)



2001 Intel Pentium 4



1500

MHz

(120X)


4500 MIPS (peak)


(2250X)


Latency 15 ns


(20X)


42,000,000
xtors
, 217 mm
2


64
-
bit data bus, 423 pins


3
-
way superscalar,

Dynamic translate to RISC,
Superpipelined

(22 stage)
,

Out
-
of
-
Order execution


On
-
chip 8KB Data caches,

96KB Instr. Trace cache,

256KB L2 cache

Latency Lags Bandwidth (last ~20 years)


Performance Milestones


Processor
: ‘286, ‘386, ‘486,
Pentium, Pentium Pro, Pentium
4
(21x,2250x)


Ethernet: 10Mb, 100Mb,
1000Mb, 10000 Mb/s
(16x,1000x)


Memory Module: 16bit plain
DRAM, Page Mode DRAM, 32b,
64b, SDRAM,

DDR SDRAM
(4x,120x)


Disk : 3600, 5400, 7200, 10000,
15000 RPM
(8x, 143x)


CPU high,

Memory low

(“
Memory
Wall”)

Trends in Technology


Integrated circuit technology


Transistor density: 35%/year


Die size: 10
-
20%/year


Integration overall: 40
-
55%/year



DRAM capacity: 25
-
40%/year (slowing)



Flash capacity: 50
-
60%/year


15
-
20X cheaper/bit than DRAM



Magnetic disk technology: 40%/year


15
-
25X cheaper/bit then Flash


300
-
500X cheaper/bit than DRAM

Bandwidth and Latency


Bandwidth or throughput


Total work done in a given time


10,000
-
25,000X improvement for processors


300
-
1200X improvement for memory and disks



Latency or response time


Time between start and completion of an event


30
-
80X improvement for processors


6
-
8X improvement for memory and disks

Bandwidth and Latency

Log
-
log plot of bandwidth and latency milestones

Rule of Thumb for Latency Lagging BW


In the time that bandwidth doubles, latency
improves by no more than a factor of 1.2 to 1.4


(and capacity improves faster than bandwidth)


Stated alternatively:

Bandwidth improves by more than the square of
the improvement in Latency




Six Reasons Latency

Lags Bandwidth

1.

Moore’s Law helps BW more than latency


Faster transistors, more transistors, more pins help Bandwidth


MPU Transistors:

0.130 vs. 42 M
xtors


(300X)


DRAM Transistors:

0.064 vs. 256 M
xtors


(4000X)


MPU Pins:

68 vs. 423 pins


(6X)


DRAM Pins:

16 vs. 66 pins


(4X)


Smaller, faster transistors but communicate over (relatively) longer
lines: limits latency



Feature size:

1.5 to 3 vs. 0.18 micron

(8X,17X)


MPU Die Size:

35 vs. 204 mm
2


(ratio
sqrt



2X)


DRAM Die Size:

47 vs. 217 mm
2


(ratio
sqrt



2X)

6 Reasons Latency

Lags Bandwidth (cont’d)


2. Distance limits latency


Size of DRAM block



long bit and word lines



most of DRAM access time


Speed of light and computers on network


1. & 2. explains linear latency vs. square BW?

3.

Bandwidth easier to sell (“bigger=better”)


E.g., 10
Gbits
/s Ethernet (“10 Gig”) vs. 10
m
sec

latency Ethernet


4400 MB/s DIMM (“PC4400”) vs. 50 ns latency


Even if just marketing, customers now trained


Since bandwidth sells, more resources thrown at bandwidth, which
further tips the balance

4.

Latency helps BW, but not vice versa


Spinning disk faster improves both bandwidth and rotational
latency



3600 RPM


15000 RPM = 4.2X


Average rotational latency: 8.3 ms


2.0 ms


Things being equal, also helps BW by 4.2X


Lower DRAM latency




More access/second (higher bandwidth)


Higher linear density helps disk BW
(and capacity)
, but not
disk Latency


9,550 BPI


533,000 BPI


60X in BW

6 Reasons Latency

Lags Bandwidth (cont’d)


5. Bandwidth hurts latency


Queues help Bandwidth, hurt Latency (Queuing Theory)


Adding chips to widen a memory module increases
Bandwidth but higher fan
-
out on address lines may increase
Latency

6. Operating System overhead hurts latency more
than Bandwidth


Long messages amortize overhead; overhead bigger part of
short messages

6 Reasons Latency

Lags Bandwidth (cont’d)


Transistors and Wires


Feature size


Minimum size of transistor or wire in x or y dimension


10 microns in 1971 to .032 microns in 2011 (.022 micron
FinFET

2012)


Transistor performance scales linearly


Wire delay does not improve with feature size!


Integration density scales
quadratically

Power and Energy


Problem: Get power in, get power out



Thermal Design Power (TDP)


Characterizes sustained power consumption


Used as target for power supply and cooling system


Lower than peak power, higher than average power consumption



Clock rate can be reduced dynamically to limit power
consumption



Energy per task is often a better measurement

Dynamic Energy and Power


Dynamic energy


Transistor switch from 0
-
> 1 or 1
-
> 0


½ x Capacitive load x Voltage
2



Dynamic power


½ x Capacitive load x Voltage
2

x Frequency switched



Reducing clock rate reduces power, not energy

Power


Intel 80386 consumed
~ 2 W


3.3 GHz Intel Core i7
consumes 130 W


Heat must be
dissipated from 1.5 x
1.5 cm chip


This is the limit of
what can be cooled
by air

Reducing Power


Techniques for reducing power:


Do nothing well (dark silicon)


Dynamic Voltage
-
Frequency Scaling


Low power state for DRAM, disks


Overclocking
, turning off cores


Static Power


Static power consumption


Current
static

x Voltage


Scales with number of transistors


To reduce: power gating


Trends in Cost


Cost driven down by learning curve


Yield



DRAM: price closely tracks cost



Microprocessors: price depends on volume


10% less for each doubling of volume

Measuring Performance


Typical performance metrics:


Response time


Throughput



Speedup of X relative to Y


Execution
time
Y

/ Execution
time
X



Execution time


Wall clock time: includes all system overheads


CPU time: only computation time



Benchmarks


Kernels (e.g. matrix multiply)


Toy programs (e.g. sorting)


Synthetic benchmarks (e.g. Dhrystone)


Benchmark suites (e.g. SPEC06fp, TPC
-
C)

Principles of Computer Design


Take Advantage of Parallelism


e.g. multiple processors, disks, memory banks, pipelining,
multiple functional units



Principle of Locality


Reuse of data and instructions



Focus on the Common Case


Amdahl’s Law

Amdahl's Law

Speedup due to enhancement E:


ExTime

w/o E Performance w/ E

Speedup(E) =
-------------

=
-------------------


ExTime

w/ E Performance w/o E




Suppose that enhancement E accelerates a fraction
F

of
the task by a factor S, and the remainder of the task is
unaffected.

F

Amdahl’s Law


Floating point instructions improved to run 2X; but
only 10% of actual instructions are FP

Speedup
overall

=

ExTime
new

=

Principles of Computer Design


The Processor Performance Equation


Principles of Computer Design


Different instruction types having different CPIs