Parallel Programming Flynn's Classification

shapecartΛογισμικό & κατασκευή λογ/κού

1 Δεκ 2013 (πριν από 3 χρόνια και 6 μήνες)

84 εμφανίσεις

Parallel Programming
Shared memory &
Message passing
MPSoC IPC
Flynn’s Classification
• Based on notions of instruction and data
streams (1972)
– SISD (Single Instruction stream over a Single Data
stream )
– SIMD (Single Instruction stream over Multiple Data
streams )
– MISD (Multiple Instruction streams over a Single Data
stream)
– MIMD (Multiple Instruction streams over Multiple Data
stream)
• Popularity
– MIMD > SIMD > MISD
MPSoC IPC
SISD (Single Instruction Stream
Over A Single Data Stream )
CU
PU
MU
IS
IS DS
I/O
IS : Instruction Stream DS : Data Stream
CU : Control Unit PU : Processing Unit
MU : Memory Unit
• SISD
– Conventional sequential machines
MPSoC IPC
SIMD (Single Instruction Stream
Over Multiple Data Streams )
• SIMD
– Vector computers
– Special purpose computations
CU
PE
1
LM
1
PE
n
LM
n
DS
DS
DS
DS
IS
IS
Program loaded
from host
Data sets
loaded from host
SIMD architecture with distributed memory
PE : Processing Element LM : Local Memory
MPSoC IPC
MISD (Multiple Instruction
Streams Over A Single Data
Streams)
• MISD
– Processor arrays, systolic arrays
– Special purpose computations
Memory
(Program,
Data)
PU
1
PU
2
PU
n
CU
1
CU
2
CU
n
DS DS DS
IS
IS
IS
IS
IS
DS
I/O
MISD architecture (the systolic array)
MPSoC IPC
MIMD (Multiple Instruction
Streams Over Multiple Data
Stream)
• MIMD
– General purpose parallel computers
CU
1
PU
1
Shared
Memory
IS
IS DS
I/O
CU
n
PU
n
IS DS
I/O
IS
MIMD architecture with shared memory
MPSoC IPC
Current Parallel Computer
Architectures
MIMD
Multiprocessor
Multicomputer
PVP
SMP
DSM
MPP
Constellation
Cluster
(Shared Address Space)
(Message Passing)
MPSoC IPC
Programming Models
• What does programmer use in coding
applications?
• Specifies communication and synchronization
• Classification
– Uniprocessor model: Von Neumann model
– Multiprogramming: no comm. and synch. at program
level
• (ex) CAD
– Shared address space
– Symmetric multiprocessor model
– CC-NUMA model
– Message passing
– Data parallel
MPSoC IPC
Fundamental Design Issues
• Design of user-system and hardware-software
interface
– proposed by Culler et. al.
• Functional Issues:
– Naming: how are logically shared data referenced?
– Operations: what operations are provided on shared
data?
– Ordering: how are accesses to data ordered and
coordinated?
• Performance Issues:
– Replication: how are data replicated to reduce
communication
MPSoC IPC
Sequential Programming Model
• Functional
– Naming: Can name any variable in virtual
address space
• Hardware (and perhaps compilers) does
translation to physical addresses
– Operations: Loads and Stores
– Ordering: Sequential program order
MPSoC IPC
• Performance
– Rely on dependences on single location (mostly):
dependence order
– Compilers and hardware violate other orders without
getting caught
– Compiler: reordering and register allocation
– Hardware: out of order, pipeline bypassing, write
buffers
– Transparent replication in caches
MPSoC IPC
SAS Programming Model
Thread
(Process)
Thread
(Process)
System
X
read(X) write(X)
Processor Memory
Shared variable
MPSoC IPC
Shared Address Space
Programming Model
• Naming
– Any process can name any variable in shared space
• Operations
– Loads and stores, plus those needed for ordering
• Simplest Ordering Model
– Within a process/thread: sequential program order
– Across threads: some interleaving (as in time-sharing)
– Additional orders through synchronization
– Again, compilers/hardware can violate orders without
getting caught
MPSoC IPC
Synchronization
• Mutual exclusion (locks)
– Ensure certain operations on certain data can be
performed by only one process at a time
– Room that only one person can enter at a time
– No ordering guarantees
• Event synchronization
– Ordering of events to preserve dependences
– e.g. producer —> consumer of data
– 3 main types:
• point-to-point
• global
• group
MPSoC IPC
Shared Address Space
Architecture
• Any processor can directly reference any memory
location
– Communication occurs implicitly by “loads and stores”.
• Natural extension of uniprocessor model
– Location transparency
– Good throughput on multiprogrammed workloads
• OS used shared memory to coordinate processes
• Shared memory multiprocessors
– SMP: every processor has equal access to the shared memory,
the I/O devices, and the OS system serviced. UMA architecture
– NUMA: distributed shared memory
MPSoC IPC
SMP (Symmetric Multi-Processor)
P/C
P/C
Bus or Crossbar Switch
P/C
SM
SM
SM
P/C : Microprocessor and Cache
MPSoC IPC
DSM (Distributed Shared
Memory)
Custom-Designed Network
MB
MB
P/C
LM
NIC
DIR
P/C
LM
NIC
DIR
DIR : Cache Directory
MPSoC IPC
MP Programming Model
process
process
Node A
message
Y
Y’
send (Y) receive (Y’)
Node B
Processor Memory
MPSoC IPC
Message Passing Programming
Model
• Naming
– Processes can name private data directly.
– No shared address space
• Operations
– Explicit communication: send and receive
– Send transfers data from private address space to
another process
– Receive copies data from process to private address
space
– Must be able to name processes
MPSoC IPC
• Ordering
– Program order within a process
– Send and receive can provide pt-to-pt synch between
processes
– Mutual exclusion inherent
• Can construct global address space
– Process number + address within process address
space
– But no direct operations on these names
MPSoC IPC
Message Passing Architectures
• Complete computer as building block, including I/O
– Communication via explicit I/O operations
• Programming model:
– directly access to private address space (local memory),
– Interprocessor communication via explicit messages
(send/receive)
• High-level block diagram similar to NUMA
– But comm. integrated at IO level, needn’t be into memory system
– Easier to build than scalable SAS
• Programming model more removed from basic hardware
operations
– Library or OS intervention
MPSoC IPC
Message Passing Abstraction
• Send specifies buffer to be transmitted and receiving
process
• Recv specifies sending process and application storage
to receive into
• Memory to memory copy, but need to name processes
• Optional tag on send and matching rule on receive
• User process names local data and entities in
process/tag space too
• In simplest form, the send/recv match achieves pairwise
synch event
– Other variants too
• Many overheads: copying, buffer management,
protection
MPSoC IPC
MPP (Massively Parallel
Processing)
P/C
LM
NIC
MB
P/C
LM
NIC
MB
Custom-Designed Network
MB : Memory Bus NIC : Network Interface Circuitry
MPSoC IPC
Cluster
Commodity Network (Ethernet, ATM, Myrinet, VIA)
MB
MB
P/C
M
NIC
P/C
M
Bridge
Bridge
LD
LD
NIC
IOB
IOB
LD : Local Disk IOB : I/O Bus
MPSoC IPC
Constellation
P/C
P/C
SM
SM
NIC
LD
Hub
Custom or Commodity Network
>= 16
IOC
P/C
P/C
SM
SM
NIC
LD
Hub
>= 16
IOC
IOC : I/O Controller
MPSoC IPC
Trend in Parallel Computer
Architectures
0
50
100
150
200
250
300
350
400
1997 1998 1999 2000 2001 2002
Years
Number ofHPCs
MPPs Constellations Clusters SMPs
MPSoC IPC
Converged Architecture of
Current Supercomputers
Interconnection Network
Interconnection Network
Memory
P
P
P
P
Memory
P
P
P
P
Memory
P
P
P
P
Memory
P
P
P
P
Multiprocessors Multiprocessors Multiprocessors
Multiprocessors
IPC on MPSoC
System Architecture Overview
MPSoC IPC
SystemV IPC - Overvew
• Born for high level
communication and
synchronization between
processes on the same
processor.
• Kernel as garantor of
atomicity, synchronization
and resources manager.
• High semantic and
flexibility.
• 3 types of comm. facilities:
– Shared memory
– Integer semaphores
– Messages queues
SystemV (Linux) Kernel
IPC API
System Resurces
(Memory - CPU)
scheduler
Memory alloc.
Kernel managed interaction
MPSoC IPC
ID=b
ID=a
SystemV IPC - Features
• IPC uses the idea of KEY
to get the ID of an
instantiated facility
• All facilities have the “get”
API in order to obtain a
facility ID froma key (and
in case to create a new
one).
• KEY and ID spaces are
different between
facilities.
IPC API
IPC
Facility instance
IPC Kernel structures
ID getAPI (KEY)
Process A Process B
IPC
Facility instance
Same KEY
Same ID
Kernel managed interaction
MPSoC IPC
ID=b
ID=a
SystemV IPC - Features
• Processes use the ID to
work with the facility.
• All facilities have one or
more “use” API in order
work with them.
• Facilities can be
destroyed, resource are
dinamicallt managed by
the kernel
IPC acts like a broker
IPC API
IPC
Facility instance
IPC Kernel structures
useAPI (ID)
Process A Process B
IPC
Facility instance
Same ID
Same facility
(ID=a)
Kernel managed interaction
MPSoC IPC
ID=a
SystemV IPC - Shared memory
• Shared memory facility
provides shared memory
buffers that processes
can create, attach, use,
detach and finally
destroy.
• No synchronization
mechanismprovided.
• Processes once
attacched use directly the
buffer address.
IPC API
IPC
Shared Buffer
void *shmat(ID)
Process A Process B
ID=a
Shared buffer pointer
MPSoC IPC
ID=a
SystemV IPC - Semaphores
• Semaphore facility provides
integer semaphore semantic.
• Processes can suspend
themself waiting on a variety of
condition.
• Underlying kernel provides
efficent scheduling!
• Semaphore facility contains an
array of semaphore.
• Support for:
– atomic multi-operation on a
semaphore set
– “undo” capability
– processes auto wake up on
facility destruction
IPC API
Kernel scheduler
semop (ID)
Process A Process B
IPC
Semaphores
One or more ops:
•blocking wait
•unblocking wait
•“demon” wait
•signal
ID=a
MPSoC IPC
ID=a
SystemV IPC - Messages queues
• Message queues facility
provides a FIFO channel
between any number of
processes.
• Each message can have a
different priority and a different
size.
• A queue has an unlimited
number of messages but a
limited size of data.
• Processes can indifferently
send/rcv with blocking and
unblocking semantic (kernel
support).
• Automatic suspended
processes wake up on queue
distruction.
IPC API
IPC
Message queue
snd/rcv (ID)
Process A Process B
ID=a
Kernel scheduler
MPSoC IPC
SystemV IPC
IPC is embedded in UNIX prog style
• System signals can wake up processes
• Object can be run-time “tuned” working on
kernel structures (root or owner rights)
• UNIX-style read/write/exec protection for
owner, group and other processes
• Operation log on kernel dedicated
structure
• API in typical sysop call format
MPSoC IPC
IPC on MPSoC
Why develop a MPSoC communication library with
the same interface and features of a single
processor, monolitic kernel based, UNIX-style
communication library like SystemV IPC?
Standard interface allow to develop and debug
applications on the linux-unix host and then to
run them on embedded system simulator.
But performances/efficency? Can IPC semantic
match MPSoC architecture?
MPSoC IPC
MPSoC typical features
• Real hardware multiprocessing
• No central kernel
– It is possible to have
processors local kernel for
local scheduling
• Communication can physically
take place only on
uncacheable shared memory
(no cache coherency – snoop
devices)
• Hardware support for
synchronization with very
limited semantic (mutex)
• Very limited resources and no
built-in dynamic management
Processor
cache
Processor
cache
cache
Processor
System Interconnection
Processor
cache
Private
Cacheable RAM
Private
Cacheable RAM
Private
Cacheable RAM
Private
cacheable RAM
Shared RAM
uncacheable
Hardware
Mutex
MPSoC IPC
Implementation guidelines
• No Unix/Kernel related stuff.
– facility owner process pid
– rwx bits and facilities protection
– last operation timestamp
– ecc…
• No scheduler-based processes wait.
• Cut of IPC semantic and features where hardware lack
of expressivity (respect of OS kernel) keeps from
implementing.
– API returns error on unsupported features
– System define altered to match new semantic
• Processes auto wake up on facility destruction is feasible
but costs in terms of memory and API complexity. We
decided to postpone the realization.
MPSoC IPC
Resources dynamic allocation
• Both hardware mutexes
and shared memory must
become dynamic
allocated resources.
• The allocators must be
themself allocated in
shared memory to be
reachable for all
processes/processors.
• The allocators themself
are shared resources and
must be used in mutual
exclusion between
processes/processors!
SM allocator
Mut. allocator
Allocator
managed
space
Hardware
Mutexes
Shared
memory
Allocator
managed
space
System reserved
mutexes
MPSoC IPC
Key & ID
• For each facility we build
an hash table that
contains the facility
descriptor.
• Table size is a system
defined const and is
shared-allocated.
• The hash function is the
“mod” of the table size.
• Tables are shared
resurces and must be
used in mutual exclusion!
ID =
Hash(Key)
Facility descriptor table

used + data
free





MPSoC IPC
System memory image
• IPC flexibility costs in
terms of shared-
allocated data.
• Tables size must be
tuned for MPSoC
systems.
• Nested critical
sections (when IPC
uses allocators) must
not generate
deadlock!
SM allocator
Mut. allocator
Allocator
managed
space
Hardware
Mutexes
Shared
memory
Allocator
managed
space
System reserved
mutexes
SHM table
SEM table
MSG table
MPSoC IPC
Shared Memory
For this facility we can implement all IPC
original features.
• Shared buffers are reserved asking to the
shared memory dynamic allocator.
• SHM table contains buffers address and
number of attached processes.
• Once marked for destruction, on the last
detach, buffers are released.
MPSoC IPC
Semaphores
Hardware impose some cuts:
• Only binary semaphores (mutexes)
• Processes can suspend only on “no resource”
condition
• Processes waits polling semaphore value
• No “undo” capability
Moreover Kernel absence impose other cuts:
• No atomic multi-operation on a semaphore set
MPSoC IPC
Messages Queues
• Technically all the facility
features are feasible but
to avoid really complex
API we decided to
remove message priority.
• For each queue we used
three mutexes: one for
consumers, one for
producers and one for the
effective send/receive
operations.
Queue DS
Hardware
Mutexes
Shared
memory
Producers
mutex
Consumers
mutex
List ops.
mutex
MPSoC IPC
Messages Queues
• To manage messages
different size and the
teorically “unlimited”
number of messages in a
queue we used a list.
• Each message is a node
of the list and nodes
memory is reserved
dinamically.
• List structure implies
dinamic allocation
overhead but simplify
data management
Queue DS
Hardware
Mutexes
Shared
memory
Producers
mutex
Consumers
mutex
List ops.
mutex
2° Message
Last Message
1° Message
MPSoC IPC
Messages Queues
Even though the reductions the send/receive API
are complex because IPC queues flexibility
requires complex management. Are there some
ways to reduce API complexity?
• Introduce fixed-size messages
• Introduce a # of messages limit
Such possibilities strongly alter IPC semantic:
before to decide it’s better to test MPSoC IPC
support performances as it is.
MPSoC IPC
Critical sections
To manage IPC critical sections we need some
mutexes, but how many? The number of
mutexes limit IPC API parallelism; possible
solutions are:
• One mutex for all
– no parallelism, allocators “dedicated”
• One mutex for IPC and two for the dynamic
allocators
– no parallesim, free allocators
• One for each IPC facility and two for the
dynamic allocators
– API partial parallelism, free allocators
MPSoC IPC
Critical sections
Currently we have choosen
the third solution to allow
the maximum parallelismon
API execution
However, no matter how
smart we manage critical
sections, shared memory is
a single bus target…only
one master can use it at the
same time!
Hardware
Mutexes
Allocator
managed
space
Shared allocator
mutex
Mutex allocator
mutex
IPC shared memory
System mutex
IPC Semaphore
System mutex
IPC Messages
System mutex
Free mutexes for both
IPC semaphore and
IPC messaes queues
MPSoC IPC
Conclusions
MPSoC IPC support allow to quickly deploy
on an embedded system an application.
IPC generality and flexibility requires quite
complex API.
What level of performances can be
reached? Are there more efficent
programmation paradigm?
MPSoC IPC
examples
=
B
C
A
X
MATRIX MULTIPLICATION
PARALLEL MULTIPLICATION
with 2 computing task
=
C
X
A
B
Task 1
Task 2
Task 2
Task 2
Task 1
Task 1
Data partitioning
MPSoC IPC
C
B
A
examples
X
Task 1
Task 2
=
Task 1/2
Task 2
Task 1
Init task
A Init task create and initializes 3 shared mem (A,B,C)
and 2 semaphores used as barrier
MPSoC IPC
How to do that with systemv IPC
A task create and initializes 3 shared mem (A,B,C)
and 2 semaphores used as barrier
#define N 8;
int *A, *B, *C;
int semid, key=1234, shmid1, shmid2, shmid3;
struct sembuf sem;
semid=semget(key, 2, IPC_CREAT | 0666);
if(semid==-1) perror(NULL);
shmid1 = shmget (key+1, N*N*sizeof(int), IPC_CREAT | 0666);
A = shmat(shmid1,NULL, 0);
intialize matrix…
….
//unlock input semaphores
sem.sem_num=0;
sem.sem_op=2
semop(semid,&op,1);
//wait on output semaphore
sem.sem_num=1;
sem.sem_op=-2;
semop(semid,&op,1);
Task_init
MPSoC IPC
How to do that with systemv IPC
the computing tasks wait on the input barrier
then compute and finally unlock the output barrier
#define N 8;
int *A, *B, *C;
int semid, key=1234, shmid1, shmid2, shmid3;
struct sembuf sem;
semid=semget(key, 2, IPC_CREAT | 0666);
shmid1 = shmget (key+1, N*N*sizeof(int), IPC_CREAT | 0666);
//wait on input semaphore
sem.sem_num=0;
sem.sem_op=-1;
semop(semid,&op,1);
computation…
//unlock output semaphores
sem.sem_num=1;
sem.sem_op=1
semop(semid,&op,1);
Task_x
MPSoC IPC
examples
ASSUME WE NEED TO DO
:
F=Σ (x=1..N) (A
x
*B)*C
pipelining
=
B
F
X
A
x
X
X
C
Task 1
Task 3
Task 2
A
x
*B R’
x
* C
R’
x
R’’
x
A
x
R’
x=1
* R’
x=2
F
MPSoC IPC
How to do that with systemv IPC
each task of the pipeline allocate an input and output queue
struct mybuf {
long type;
char text[N*N*sizeof(int)];
} buffer;
buffer.type=1;
if(buffer.text==NULL){ printf("***Unable to allocate buffer!***\n");
return -1; }
//lets create the output queue
if(ID!=NUMPROC-1)
{
printf("PR:%d Crate output QUEUE key:%d\n",ID,key+ID);
outqueueid=msgget(key+ID,IPC_CREAT | 0666);
if(outqueueid==-1) { perror("***output Queue get FAILED***");
return -1; }
}
MPSoC IPC
//lets create the input queue
if(ID!=0)
{
inqueueid=msgget(key+ID-1,IPC_CREAT | 0666);
if(inqueueid==-1) {
perror("***Input Queue get FAILED***");
return -1; }
}
//execution

//queue deletion
if(ID!=0)
if ( msgctl(inqueueid, IPC_RMID, NULL) == -1) {
perror("shmctl: shmctl failed");
exit(1);
}
MPSoC IPC
for(int i=0; i<ITERATION;i++)
{
//read
if(ID!=0)
{
result=msgrcv(inqueueid, (struct msgbuf*)&buffer,
(N*N*sizeof(int)), 1, 0);
if (result==-1)
{
perror("***Queue receive FAILED***");
exit(1);
}
}
compute….
MPSoC IPC
//write
if(ID!=NUMPROC-1)
{
result=msgsnd(outqueueid,
(struct msgbuf*)&buffer, (N*N*sizeof(int)), 0);
if (result==-1)
{
perror("***Queue send FAILED***");
exit(1);
}
printf("Wrote:%d\n",i);
}
}