# Multicore: Read Chapter 7-7.7 (all of GPUs)

Λογισμικό & κατασκευή λογ/κού

2 Δεκ 2013 (πριν από 4 χρόνια και 6 μήνες)

86 εμφανίσεις

Chapter 7

Multicores,
Multiprocessors, and
Clusters

Objectives

The Student shall be able to:

Define parallel processing, multicore, cluster, vector
processing.

Define SISD, SIMD, MISD, MIMD.

-
-

Draw network configurations: bus, ring, mesh, cube.

Define how vector processing works: how instructions
may look, and how they may be processed.

Define GPU and name 3 characteristics that differentiate
a CPU and GPU.

Chapter 7

Multicores, Multiprocessors, and Clusters

2

Chapter 7

Multicores, Multiprocessors, and Clusters

3

§
Instruction
-
Level Parallelism

Pipelines and Multiple Issue

§
5.8: Parallelism and Memory Hierarchies

Associative Memory, Interleaved Memory

§
6.9: Parallelism and I/O:

Redundant Arrays of Inexpensive Disks

Parallel Processing

Chapter 7

Multicores, Multiprocessors, and Clusters

4

Microsoft

Word

Editor

SpellCheck

GrammarCheck

Backup

Matrix Multiply

Parallel Programming

Main

cout << "Run Factor " << total << ":"

<< numChild << endl;

Factor factor;

// Spawn children

for (i=0; i<numChild; i++)

if (fork() == 0) {

factor.child(begin, begin+range);

begin += range + 1;

}

// Wait for children to finish

for (i=0; i<numChild; i++)

wait(&stat);

cout << "All Children Done: "

<< numChild << endl;

}

Factor::child(int begin, int end)

int

val
,
i
;

for
(
val
=begin;
val
<end;
val
++) {

for
(
i
=2;
i
<=end/2;
i
++)

if
(
val

%
i

== 0) break;

if
(
i
>
val
/2)

cout

<< "Factor:" <<
val

<<
endl
;

}

exit(0
);

Chapter 7

Multicores, Multiprocessors, and Clusters

5

Chapter 7

Multicores, Multiprocessors, and Clusters

6

Introduction

Goal: connecting multiple computers

to get higher performance

Multiprocessors

Scalability, availability, power efficiency

Job
-
level (process
-
level) parallelism

High throughput for independent jobs

Parallel processing program

Single program run on multiple processors

Multicore microprocessors

Chips with multiple processors (cores)

§
9.1 Introduction

Chapter 7

Multicores, Multiprocessors, and Clusters

7

Hardware and Software

Hardware

Serial: e.g., Pentium 4

-
core Xeon e5345

Software

Concurrent: e.g., operating system

Sequential/concurrent software can run on
serial/parallel hardware

Challenge: making effective use of parallel
hardware

Chapter 7

Multicores, Multiprocessors, and Clusters

8

Amdahl’s Law

Sequential part can limit speedup

Example: 100 processors, 90
×

speedup?

T
new

= T
parallelizable
/100 + T
sequential

Solving: F
parallelizable

=
0.999

Need sequential part to be 0.1% of original
time

90
/100
F
)
F
(1
1
Speedup
able
paralleliz
able
paralleliz

Chapter 7

Multicores, Multiprocessors, and Clusters

9

Shared Memory

SMP: shared memory multiprocessor

Hardware provides single physical

Synchronize shared variables using locks

Memory access time

UMA (uniform) vs. NUMA (nonuniform)

§
7.3 Shared Memory Multiprocessors

Chapter 7

Multicores, Multiprocessors, and Clusters

10

Example: Sum Reduction

half = 100;

repeat

synch();

if (half%2 != 0 && Pn == 0)

sum[0] = sum[0] + sum[half
-
1];

/* Conditional sum needed when half is odd;

Processor0 gets missing element */

half = half/2; /* dividing line on who sums */

if (Pn < half) sum[Pn] = sum[Pn] + sum[Pn+half];

until (half == 1);

Chapter 7

Multicores, Multiprocessors, and Clusters

11

Cluster
-

Message Passing

Each processor (or computer) has private

between processors

§
7.4 Clusters and Other Message
-
Passing Multiprocessors

Chapter 7

Multicores, Multiprocessors, and Clusters

12

Loosely Coupled Clusters

Network of independent computers

Each has private memory and OS

Connected using I/O system

E.g., Ethernet/switch, Internet

Suitable for applications with independent tasks

Web servers, databases, simulations, …

High availability, scalable, affordable

Problems

Low interconnect bandwidth

c.f. processor/memory bandwidth on an SMP

Chapter 7

Multicores, Multiprocessors, and Clusters

13

Grid Computing

Separate computers interconnected by
long
-
haul networks

E.g., Internet connections

Work units farmed out, results sent back

Can make use of idle time on PCs

E.g., SETI@home, World Community Grid

has its own register file and PC

Fine
-
Grained

= interleaved processing:

Course
-
Grained
stall required (memory access, wait)

: uses
dynamic scheduling to schedule multiple

Chapter 7

Multicores, Multiprocessors, and Clusters

14

Chapter 7

Multicores, Multiprocessors, and Clusters

15

Performing multiple threads of execution in
parallel

Replicate registers, PC, etc.

Fine
-

Interleave instruction execution

If one thread stalls, others are executed

Coarse
-

Only switch on long stall (e.g., L2
-
cache miss)

Simplifies hardware, but doesn’t hide short stalls
(eg, data hazards)

§

Chapter 7

Multicores, Multiprocessors, and Clusters

16

In multiple
-
issue dynamically scheduled
processor

when function units are available

scheduling and register renaming

Example: Intel Pentium
-
4 HT

function units and caches

Chapter 7

Multicores, Multiprocessors, and Clusters

17

Chapter 7

Multicores, Multiprocessors, and Clusters

18

Will it survive? In what form?

Power considerations

simplified
microarchitectures

Tolerating cache
-
miss latency

Thread switch may be most effective

Multiple simple cores might share
resources more effectively

In /home/student/Classes/Cs355/
PrimeLab

are 2 files:
Factor.cpp
runFactor

Copy them over to one of your directories (below called
mydirectory
):

c
p

Factor.cpp ~/
mydirectory

cp

runFactor

~/
mydirectory

cd ~/
mydirectory

You want to observe the processor utilization (how busy the processor is).

Linux: Applications
-
>System Tools
-
>System Monitor
-
>Resources

Now compile Factor.cpp into executable Factor and run it using the command file
runFactor

(in Linux):

g++ Factor.cpp

o Factor

. /
runFactor

The file
time.dat

will contain the start and end time of the program, so you can calculate the duration.

Now change in Factor.cpp the number of threads:
numChild
. Recompile and run.

Create a matrix in Microsoft Excel or Open Office, with one column containing the number of children, the second
column containing the seconds to complete. Label the second column: Delay.

Linux: Applications
-
>Office
-
>
LibreOffice

Calc

to show efficiency of multiple threads.

Find the times for a range of thread counts:

1,2,3,4,5,6,10,20

(Whatever you have time for).

Chapter 7

Multicores, Multiprocessors, and Clusters

19

Delay

1

2

3

4

6

Chapter 7

Multicores, Multiprocessors, and Clusters

20

Instruction and Data Streams

An alternate classification

§
7.6
SISD, MIMD, SIMD, SPMD, and Vector

Data Streams

Single

Multiple

Instruction
Streams

Single

SISD
:

Intel Pentium 4

SIMD
: SSE
instructions of x86

Multiple

MISD
:

No examples today

MIMD
:

Intel Xeon e5345

SPMD: Single Program Multiple Data

A parallel program on a MIMD computer

Conditional code for different processors

Chapter 7

Multicores, Multiprocessors, and Clusters

21

SIMD

Operate elementwise on vectors of data

E.g., MMX and SSE instructions in x86

Multiple data elements in 128
-
bit wide registers

All processors execute the same
instruction at the same time

Each with different data address, etc.

Simplifies synchronization

Reduced instruction control hardware

Works best for highly data
-
parallel
applications

Chapter 7

Multicores, Multiprocessors, and Clusters

22

Vector Processors

Highly pipelined function units

Stream data from/to vector registers to units

Data collected from memory into registers

Results stored from registers to memory

Example: Vector extension to MIPS

32
×

64
-
element registers (64
-
bit elements)

Vector instructions

lv
,
sv

: add scalar to each element of vector of double

Significantly reduces instruction
-
fetch bandwidth

Chapter 7

Multicores, Multiprocessors, and Clusters

23

Example: DAXPY (Y = a
×

X + Y)

Conventional MIPS code

mul.d \$f2,\$f2,\$f0 ;a
×

x(i)

×

x(i) + y(i)

s.d \$f4,0(\$s1) ;store into y(i)

addiu \$s0,\$s0,#8 ;increment index to x

addiu \$s1,\$s1,#8 ;increment index to y

subu \$t0,r4,\$s0 ;compute bound

bne \$t0,\$zero,loop ;check if done

Vector MIPS code

mulvs.d \$v2,\$v1,\$f0 ;vector
-
scalar multiply

sv \$v4,0(\$s1) ;store the result

Chapter 7

Multicores, Multiprocessors, and Clusters

24

Vector vs. Scalar

Vector architectures and compilers

Simplify data
-
parallel programming

Speed up processing since no loops

No data hazard within vector instruction

Benefit from interleaved and burst
memory

Avoid control hazards by avoiding loops

Multimedia Improvements

Intel X86 (e.g., 80386) Architecture

MMX: MultiMedia Extensions

SSE: Streaming SIMD Extensions

A register can be subdivided into smaller
units … or extended and subdivided

Chapter 7

Multicores, Multiprocessors, and Clusters

25

32 bits

16 bits

16 bits

8 bits

8 bits

8 bits

8 bits

One ALU

Chapter 7

Multicores, Multiprocessors, and Clusters

26

Interconnection Networks

Network topologies

Arrangements of processors, switches, and links

§
7.8
Introduction to Multiprocessor Network Topologies

Bus

Ring

2D Mesh

N
-
cube (N = 3)

Fully connected

Chapter 7

Multicores, Multiprocessors, and Clusters

27

Multistage Networks

Chapter 7

Multicores, Multiprocessors, and Clusters

28

Network Characteristics

Performance

Latency (delay) per message

Throughput: messages/second

Congestion delays (depending on traffic)

Cost

Power

Routability in silicon

Chapter 7

Multicores, Multiprocessors, and Clusters

29

History of GPUs

Graphics Processing Units

Processors oriented to 3D graphics tasks

rasterization

Architecture

GPU memory optimized for bandwidth, not latency

Wider DRAM chips

Smaller memories, no multilevel cache

Simultaneous execution

Parallel processing: SIMD + scalar

No double precision floating point

§
7.7
Introduction to Graphics Processing Units

Chapter 7

Multicores, Multiprocessors, and Clusters

30

Graphics in the System

Chapter 7

Multicores, Multiprocessors, and Clusters

31

GPU Architectures

Processing is highly data
-
parallel

Use thread switching to hide memory latency

Less reliance on multi
-
level caches

Graphics memory is wide and high
-
bandwidth

Trend toward general purpose GPUs

Heterogeneous CPU/GPU systems

CPU for sequential code, GPU for parallel code

Programming languages/APIs

DirectX, OpenGL

C for Graphics (Cg), High Level Shader Language
(HLSL)

Compute Unified Device Architecture (CUDA)

Chapter 7

Multicores, Multiprocessors, and Clusters

32

Example: NVIDIA Tesla

Streaming
multiprocessor

8
×

Streaming

processors

Chapter 7

Multicores, Multiprocessors, and Clusters

33

Example: NVIDIA Tesla

Streaming Processors (SP)

Single
-
precision FP and integer units

Each SP is fine
-

Executed in parallel,

SIMD (or SPMD) style

8 SPs

×

4 clock cycles

Hardware contexts

for 24 warps

Registers, PCs, …

Time

Chapter 7

Multicores, Multiprocessors, and Clusters

34

Classifying GPUs

Don’t fit nicely into SIMD/MIMD model

Conditional execution in a thread allows an
illusion of MIMD

But with performance degredation

Need to write general purpose code with care

Static: Discovered

at Compile Time

Dynamic: Discovered
at Runtime

Instruction
-
Level
Parallelism

VLIW

Superscalar

Data
-
Level
Parallelism

SIMD or Vector

Tesla Multiprocessor

Chapter 7

Multicores, Multiprocessors, and Clusters

35

Roofline Diagram

Attainable GPLOPs/sec

= Max ( Peak Memory BW
×

Arithmetic Intensity, Peak FP Performance )

Chapter 7

Multicores, Multiprocessors, and Clusters

36

Optimizing Performance

Choice of optimization depends on
arithmetic intensity of code

Arithmetic intensity is
not always fixed

May scale with
problem size

Caching reduces
memory accesses

Increases arithmetic
intensity

Chapter 7

Multicores, Multiprocessors, and Clusters

37

Comparing Systems

Example: Opteron X2 vs. Opteron X4

2
-
core vs. 4
-
core, 2
×

FP performance/core, 2.2GHz
vs. 2.3GHz

Same memory system

To get higher performance
on X4 than X2

Need high arithmetic intensity

Or working set must fit in X4’s
2MB L
-
3 cache

Chapter 7

Multicores, Multiprocessors, and Clusters

38

Optimizing Performance

Optimize FP performance

Improve superscalar ILP
and use of SIMD
instructions

Optimize memory usage

Software prefetch

Memory affinity

Avoid non
-
local data
accesses

Chapter 7

Multicores, Multiprocessors, and Clusters

39

Four Example Systems

§
7.11
Real Stuff: Benchmarking Four Multicores …

2
×

-
core

Intel Xeon e5345

(Clovertown)

2
×

-
core

AMD Opteron X4 2356

(Barcelona)

Chipset = Bus

Fully
-
Buffered DRAM DIMMs

Chapter 7

Multicores, Multiprocessors, and Clusters

40

Four Example Systems

2
×

oct
-
core

IBM Cell QS20

2
×

oct
-
core

Sun UltraSPARC

T2 5140 (Niagara 2)

Fine
-

SPE=Synergistic Proc. Element

Have SIMD instr. set

Chapter 7

Multicores, Multiprocessors, and Clusters

41

Pitfalls

Not developing the software to take
account of a multiprocessor architecture

Example: using a single lock for a shared
composite resource

Serializes accesses, even if they could be done in
parallel

Use finer
-
granularity locking

Chapter 7

Multicores, Multiprocessors, and Clusters

42

Concluding Remarks

Goal: higher performance by using multiple
processors

Difficulties

Developing parallel software

Devising appropriate architectures

Many reasons for optimism

Changing software and application environment

Chip
-
level multiprocessors with lower latency,
higher bandwidth interconnect

An ongoing challenge for computer architects!

§
7.13 Concluding Remarks