Multicore / Multiprocessor

yellvillepotatocreekΛογισμικό & κατασκευή λογ/κού

2 Δεκ 2013 (πριν από 3 χρόνια και 8 μήνες)

65 εμφανίσεις

CDA 3101 Fall 2013



Introduction to Computer
Organization

Multicore

/ Multiprocessor
Architectures


22 November 2013

2

Multicore Architectures



Introduction



What are Multicores?




Why Multicores
?



Power and Performance Perspectives




Multiprocessor Architectures




Conclusion

CDA 3101


Fall 2011

Copyright © 2011 Prabhat Mishra

How to Reduce Power Consumption


Multicore


One core with frequency 2 GHz


Two cores with 1 GHz frequency (each)



Same performance



Two 1 GHz cores require half power/energy



Power


freq
2



1GHz core needs one
-
fourth power compared to 2GHz core.



New challenges


Performance


How to utilize the cores


It is difficult to find parallelism in programs to keep
all these cores busy.


Reducing Energy Consumption

[www.transmeta.com]

Pentium

Max Temp = 105.5 deg C

Crusoe

Max Temp = 48.2 deg C

Both processors are running the same

multimedia application.

Infrared Cameras (FLIR) can be used to detect thermal distribution.

6

Introduction


Never ending story …



Complex Applications



Faster Computation



How far did we go with uniprocessors?


Parallel Processors
now play a major role



Logical way to improve performance



Connect multiple microprocessors



Not much left with ILP exploitation



Server and embedded software have parallelism


Multiprocessor architectures will become
increasingly attractive


Due to slowdown in advances of uniprocessors

7

Level of Parallelism


Bit level parallelism: 1970 to ~1985



4 bits, 8 bit, 16 bit, 32 bit microprocessors


Instruction level parallelism: ~1985
-

today



Pipelining



Superscalar



VLIW



Out
-
of
-
order execution / Dynamic Instr. Scheduling


Process level or thread level parallelism



Servers are parallel



Desktop dual processor PCs



Multicore architectures (CPUs, GPUs)

8

Taxonomy of Parallel Architectures


SISD (Single Instruction Single Data)


Uniprocessors



MISD (Multiple Instruction Single Data)


Multiple processors on a single data stream


No commercial prototypes. Can be thought of as successive
refinement of a given set of data by multiple processors (units).



SIMD (Single Instruction Multiple Data)


Examples: Illiac
-
IV, CM
-
2


Simple programming model, low overhead, and flexibility


All custom integrated circuits



MIMD (Multiple Instruction Multiple Data)


Examples: Sun Enterprise 5000, Cray T3D, SGI Origin


Flexible


Difficult to program


no unifying model of parallelism


Use off
-
the
-
shelf microprocessors


MIMD in practice: designs with <= 128 processors

Flynn

Classification

9

MIMD


Two types


Centralized shared
-
memory multiprocessors


Distributed
-
memory multiprocessors


Exploits thread
-
level
-
parallelism


The program should have at least
n

threads or
processes for a MIMD machine with
n

processors


Threads can be of different types


Independent programs


Parallel iterations of a loop (extracted by compiler)

10

Centralized Shared
-
Memory Multiprocessor

11

Centralized Shared
-
Memory Multiprocessor


Small number of processors share a centralized
memory


Use multiple buses or switches


Multiple memory banks



Main memory has a
symmetric

relationship to
all processors
and
uniform access time

from
any processor


SMP:
s
ymmetric shared
-
memory
m
ulti
p
rocessors


UMA:
u
niform
m
emory
a
ccess architectures



Increase in processor performance and
memory bandwidth requirements make
centralized memory paradigm less attractive

12

Distributed
-
Memory Multiprocessors

13

Distributed
-
Memory Multiprocessors


Distributing memory has two benefits


Cost
-
effective way to scale memory bandwidth


Reduces local memory access time.


Communicating data between processors is
complex and has higher latency


Two approaches for data communication


Shared address space (not centralized memory)



Same physical addr. refers to same memory location



DSM
: Distributed Shared
-
Memory Architectures



NUMA
: Non
-
uniform memory access since the access
time depends on the location of the data


Logically disjoint address space
-

Multicomputers

14


Caches serve to:


Increase
bandwidth versus
bus/memory


Reduce latency of
access


Valuable for both
private data and
shared data


What about cache
consistency?

Time

Event

$
A

$
B

X


(memory)

0




1

1

CPU A

reads X

1


1

2

CPU B

reads X

1

1

1

3

CPU A

stores 0

into X

0

1

0


Small
-
Scale

Shared Memory

15

Example: Cache Coherence Problem


Processors see different values for
u

after event 3



With write back caches, value written back to memory
depends on which cache flushes or writes back value


Processes accessing main memory may see very stale value



Unacceptable for programming, and its frequent!

I/O devices

Memory

P

1

Cache

Cache

Cache

P

2

P

3

5

u


= ?

4

u


= ?

u


:5

1

u


:5

2

u


:5

3

u


= 7

4 C’s: Sources of Cache Misses


Compulsory

misses (aka cold start misses)



First access to a block


Capacity

misses



Due to finite cache size



A replaced block is later accessed again


Conflict

misses (aka collision misses)



In a non
-
fully associative cache



Due to competition for entries in a set



Would not occur in a fully associative cache of

the same total size



Coherence

Misses

Graphics Processing Units (GPUs)



Moore’s Law will come to an end



Many complicated solutions



Simple solution


SPATIAL PARALLELISM



SIMD model


(single instr,


multiple data


streams)



GPUs

have a


SIMD
grid with


local & shared


memory model

17

Graphics Processing Units (GPUs)



Nvidia Fermi GPU


3GB DRAM, 512 cores


18

CUDA
architecture


-
Thread


-

Thread Block


-

Grid of Thread
Blocks


-

Intelligent
CUDA Compiler

GPUs


Nvidia CUDA Hierarchy

19

Map Process to Thread





Group Threads in Block



Group Blocks in Grids for
Efficiency Memory Access




Also, memory coales
-
cing operations for faster
data transfer

Nvidia Tesla 20xx GPU Board

20

GPU Problems and Solutions



GPUs are designed for
graphics rendering



GPUs are not designed for general
-
purpose

computing!! (no unifying model of ||
-
ism)



Memory hierarchy:



Local Memory



Fast, small (MBs)



Shared Memory



Slower, larger



Global Memory



Slow, Gbytes



How to circumvent data movement cost?



Clever hand coding


costly, app
-
specific



Automatic coding


sub
-
optimal, softwe support

21

Speedup =




P

is fraction of program
that is parallel


S

is fraction of program
that is sequential

Advantages and Disadvantages



GPUs provide
fast parallel computing





䝐G猠睯牫w扥獴b景爠灡牡汬p氠獯汵瑩潮s



Sequential programs can actually run slower



Amdahl’s Law

摥d捲楢i猠獰s敤異e


22

Multicore CPUs



Intel Nehalem:



Servers, HPC arrays



45nm circuit technology





Intel Xeon:



2001
-
present



2 to 8 cores



Workstations



Multiple cores



Laptops



Heat dissipation?


23

DUAL NEHALEM

Intel Multicore CPU Performance

24

SINGLE CORE

Conclusions



Parallel machines


Parallel solutions



Inherently sequential programs don’t benefit
much from parallelism



2 main types of parallel architectures



SIMD


Single
-
instruction, multiple data stream



MIMD


Multiple
-
instruction, multiple data stream



Modern parallel architectures (multicores)



GPUs



Exploit SIMD parallelism for general
-
purpose
parallel

computing solutions



CPUs



Multicore CPUs are more amenable to
MIMD parallel applications

25