Development of Parallel Simulator for

footballsyrupSoftware and s/w Development

Dec 1, 2013 (3 years and 6 months ago)

57 views

Development of Parallel Simulator for
Wireless WCDMA Network



Hong Zhang

Communication lab of HUT

2


Outline

1. Overview


1.1 The Requirement for Computational Speed of Simulation for


Wireless WCDMA system


1.2 Parallel Programming

2. Types of Parallel Computers


2.1 Shared Memory Multiprocessor System


2.2 Message Passing Multiprocessor with Local Memory

3. Parallel Programming Scenarios


3.1 Ideal Parallel Computations


3.2 Partitioning and Divide
-
and
-

Conquer Strategies


3.3 Pipelined Computation


3.4 Synchronous Computation


3.5 Load balancing


3.6 Multiprocessor with Shared Memory

4. Progress of the project

3

1. Overview


1.1 The Requirement for Computational Speed of Wireless WCDMA


Network Simulation





In mobile communication, the development of advanced signal


processing techniques such as smart antenna and MUD can improve the


system performance, but require signal or system level simulation.




Simulation is an important tool for getting insight into the problem.


However, often it is very
time consuming

task to simulate the signal


processing algorithms




It is necessary to
speed up

simulation.
Parallel programming

is one


of the best techniques to solve this problem.

4

1.2 Parallel Programming



Parallel programming

can speed up the execution of a program by dividing
the program into multiple fragments that can be executed simultaneously, each
on it’s own processor.



Parallel programming involves:




Decomposing an algorithm or data into parts




Distributing sub
-
tasks which are processed by


multiple processors simultaneously




Coordinating work and communications between those processors


5





1.2 Parallel Programming ( cont. )




The Requirements for Parallel Programming




Parallel architecture being used




Multiple processors




Network




Environment to create and manage parallel processing




A parallel algorithm and parallel program


6

2. Types of Parallel Computers



2.1 Shared Memory Multiprocessor System


CPU

Memory



Multiple processors operate independently but share the same memory resources.




Only one processor can access the shared memory location at a time



Synchronisation achieved by controlling with READING FROM and


WRITING TO the shared memory.

CPU

CPU

CPU

7

2.1 Shared Memory Multiprocessor System

(cont.)



Advantages





Easy for user to use efficiently



Data sharing among tasks is fast ( speedup memory access )




Disadvantages




The size of memory might be a limiting factor. Increase the number of


processors without increase of the size of memory can cause severe


bottlenecks




User is responsible for establishing synchronization.


8

2.2 Message Passing Multiprocessor with Local Memory



Multiple processors operate independently but each has its own
local memory
.



Data is shared across communication network using
message passing



User is responsible for synchronization using
message passing
.

Network

Memory

CPU

Memory

CPU

CPU
Memory

CPU
Memory

9

2.2 Message Passing Multiprocessor with Local Memory (cont)



Advantages



• Memory scalable to number of processors. Increase number of


processors with their own memory , the total size of memory will be


increased comparing with the shared memory multiprocessor system.


• Each processor can rapidly access its own memory without limitation.



Disadvantages


• Difficult to map existing data structures.


• User is responsible for sending and receiving data among processors


• To minimize overhead and latency, data should be stacked up in large


blocks before receiving nodes will need it.


10

3. Parallel Programming Scenario


3.1

Ideal Parallel Computations


• A computation can be readily divided into completely


independent parts that can be executed simultaneously .


• Example:


In the simulation of Uplink WCDMA (single user), signal processing
at the transmitter and the receiver are




divided into smaller parts,




executed by separate processors.

11

3.1

Ideal Parallel Computations

(cont.)


Example: simulation of wireless communication with Ideal Parallel Computation

CPU 2

Channel coding
and
data matching


CPU4

Spreading
and
scrambling


CPU 5

Pulse shaping
filtering


CPU 1

Source data
generation
(traffic/packet)


CPU 3

Modulation


Transmitter

CPU 10

Channel


decoding

CPU 9

demodulation

Receiver

CPU 6

Reconstruction of the
composite signal

(signal, channel,
AWGN
)

AWGN

CPU 8

Rake
combining

CPU 7

Matched
filtering

Radio
channel

12

3.2

Task Partitioning and Divide
-
and
-
Conquer Strategies



Partitioning
: the problem is simply divided into separate parts and


each part is computed separately



Divide
-
and
-
Conquer
: to divide task continually into smaller


and smaller subtasks before solving the


smaller parts and the results are combined


• Example:


In the simulation of Rake combining technique in WCDMA,


the problem can be continually divided among different fingers. In


each finger, the problem can be also divided into correlating, delay


equalizing, MRC/EGC combining.


13

3.2

Partitioning and Divide
-
and
-

Conquer Strategies
(cont.)



Example: the simulation of wireless communication with Divide
-
and
-


Conquer Strategy

Rake Combining

Finger K

Finger 2

Finger 1

CPU 2

modified with
the channel
estimate

CPU 3

combining
with
MRC/EGC

CPU 1

Correlating

14

3.3

Pipelined Computation



• The problem is divided into a series of tasks that have to be


completed one after the other.


• Each task will be executed by a separate processor


• Partially sequential in nature


• Example:



In the simulation of WCDMA transmitter and receiver, each block


of signal processing needs the output of the previous block as its


input. In this case, Pipelining technique is adopted to parallel


sequential source code.



15

3.3

Pipelined Computation

(cont.)




Example: the simulation of wireless communication with Pipelined


Computation

CPU 2

Channel coding
and
data matching


CPU4

Spreading
and
scrambling


CPU 5

Pulse shaping
filtering


CPU 1

Source data
generation
(traffic/packet)


CPU 3

Modulation


Transmitter

CPU 10

Channel


decoding

CPU 9

demodulation

Receiver

CPU 6

Reconstruction of the
composite signal

(signal, channel,
AWGN
)

AWGN

CPU 8

Rake
combining

CPU 7

Matched
filtering

Radio
channel

16

3.4

Synchronous Computation


• Processors need to exchange data between themselves.


• All the processes start at the same time in a lock
-
step manner


• Each process must wait until all processes have reached a particular


reference point (barrier) in their computation.


• Example: WCDMA system




Smart Antenna (SA) : the signal processing in each branch of antenna


elements must be finished before combining them.




Rake Combining: the signal processing in each finger must be


finished before combining them.




Multiuser Detection(MUD): as MUD for each user signal needs


other users’ signal message, the operation for


all users’ signal must be finished before MUD.

17

3.4

Synchronous Computation
(cont.)



Example: the simulation of wireless communication with Synchronous


Computation


AWGN

CPU

AWGN

CPU

Received
signal
reconstruction

Matched
filtering

Beam
forming


Rake Combining

CPU

Rake

Combining

CPU

Rake

Combining

CPU Finger K

CPU Finger 1

Modified with
the channel
estimate


Correlating

CPU

Beamforing

Combining

User 1

CPU Finger K

CPU Finger 1

Modified with
the channel
estimate


Correlating

User N

MUD

MUD

w

w

18

3.4

Synchronous Computation
(cont.)



Example: the simulation of wireless communication with Synchronous


Computation


Mutiuser Detection


CPU

...

CPU

...

CPU

...

.

.

.

The output of user 1’ beamforming /combining

The signature waveform of user 1

The output of user 2’ beamforming /combining

The signature waveform of user 2

The output of user N’ beamforming /combining

The signature waveform of user N

19

3.5

Load balancing

• to distribute computation load fairly across processors in order to


obtain the highest possible execution speed.

• Example: WCDMA system




Smart Antenna (SA) : the speed of Direction of arrival (DOA) variation for


different user signal can be different, this means that beamforming


processor for different user could have different number of


operations. The load of all processors can be fairly balanced by


detecting if the solution has been reached on each processor.




Rake Combining: the number of multipath signals for different users could be


different. The load of all processors can be fairly balanced by


detecting if the solution has been reached by each processor.


20

3.5

Load balancing
(cont.)

Example: the simulation of wireless communication with Load balancing


Rake Combining


.

.

.

CPU 1 ( user 1)

Computation time

CPU 2 ( user 2 has more
number of multipath signals)
than that of other users

Computation time

CPU N ( user N)

Computation time

Beamforming

.

.

.

CPU N+1 ( user 1)

Computation time

CPU N+2 ( the channel
parameter of user 2 are varying
faster than that of other users)

Computation time

CPU 2N ( user N)

Computation time

21

3.6

Multiprocessor with Shared Memory


• Multiprocessor with shard memory can speed up programming by


storing the executable code and data in shared memory for each


processor.


• Example

In the simulation of WCDMA with multiple users, each part of signal
processing model could have certain number of algorithms, for example




adaptive Beamforming: RLS, LMS, CMA, Conjugate Gradient Method




Multiuser Detection: Decorrelating detector, MMSE Detector,


Adaptive MMSE Detection etc.




All codes for these algorithms are stored in the shared memory.




Processing for each user shares all these codes




The processor for each user can access these executable codes in the


shared memory to speed up the programming.

22

3.6

Multiprocessor with Shared Memory
(cont.)



Example: the simulation of wireless communication by Multiprocessor with


Shared Memory


Beamforming

Cache


CPU 1


( user 1)

...

Cache


CPU N


( user N)

Memory
module

( RLS )

Memory
module

(CMA)

...

Multiuser Detection

Cache


CPU 1


( user 1)

...

Cache


CPU N


( user N)

Memory
module

(decorrelating
detector)

Memory
module

( MMSE )

...

23

4.
Progress of the project


The following models of WCDMA system are developed /integrated into

simulator

-
Spreader/despreder

-
Spatial Processing

-
RAKE receiver

-
Fading radio channel

-
Some simulation results are obtained for the models verification

-
Interactions with SARG at Stanford on Rake receiver model


verifications

Work on translation from MATLAB into C language with further

parallelization is accomplished at UCLA.