Parallel computing has long held the promise of increased ...

compliantprotectiveΛογισμικό & κατασκευή λογ/κού

1 Δεκ 2013 (πριν από 3 χρόνια και 9 μήνες)

58 εμφανίσεις

Preliminary Investigations into Distributed Computing Applications on a Beowulf
Cluster


Guy A. Schiavone, Judd Tracy, and Ravishankar Palaniappan

Institute for Simulation and Training

University of Central Florida


Introduction:
Parallel computing has lo
ng held the promise of increased performance over traditional Von
Neumann architectures, but the high cost of specialized hardware and the complexity of programming has
withheld this promise for all but the most crucial, computationally intensive tasks. I
n recent years,
however, the increasing power of commodity desktop platforms combined with the increasing bandwidth
of low
-
cost networking technologies has opened the door for a new type of cost
-
efficient parallel computer
based on dedicated computing clus
ters, sometimes referred to as networks of workstations (NOWs) or piles
of PCs (POPs). Dedicated computing clusters are now a vital technology that has proven successful in a
large variety of applications. Systems have been implemented for both Windows N
T and Linux
-
based
platforms. Linux
-
based clusters, known as Beowulf clusters, were first developed at NASA CESDIS in
1994. The idea of the Beowulf cluster is to maximize the performance
-
to
-
cost ratio of computing by using
low
-
cost commodity components and

free
-
source Linux and GNU software to assemble a distributed
computing system. The performance of these systems can match that of shared memory parallel processors
costing 10 to 100 times as much.


In 1999, the Institute for Simulation and Training at th
e University of Central Florida constructed a
Beowulf
-
class computing cluster, named Boreas. Boreas is made up of 17 nodes, with each node
consisting of two 350 MHz Pentium
-
II processors, 256 Mb main memory on a 100 MHz bus, and 8.6 Gb of
disk storage. N
odes are connected using Fast Ethernet with a maximum bandwidth of 100 Mbit/s, through
a Linksys Etherfast 24
-
port switch. Software support includes the standard Linux/Gnu environment,
including compilers, debuggers, editors, and standard numerical librar
ies. MPICH is supported for
message passing between nodes, and shared memory processing within each node is enabled using the
pthreads library. The main advantages of establishing a message
-
passing standard are portability and ease
-
of
-
use. In a distribute
d memory communication environment in which the higher level routines or
abstractions are built upon lower level message passing routines, the benefits of standardization are
particular apparent.


Beowulf cluster in parallel computing:
Some of the simulati
ons that have been performed on the Beowulf
cluster include Jacobi Iteration Coding and Performance, Netpipe Performance Tests, Early Distributed
Image Generation and Computational Electromagnetics










Jacobi’s Iteration Method:
There are two classe
s of methods of solving linear system of equations.,
direct and iterative methods. Jacobi’s is an iterative method that can be performed using the Beowulf
cluster. The basic idea of the iterative method is that we can first guess at a solution for the X
values and
then calculate X
new
. This new value of X is used to improve the next guess value of X. This process is
repeated and each step generates a better approximation of the final answer until the error is within the
acceptable range of values.

























1
1
n
new old
i ij j i
ii
j
j i
a X B
a
X
 
 
 
 
 
 
 
 
 
 
 


 





1a)






b)


The plot 1a) shows the Speedup obtained as the number of processors is increased for different problem
sizes. It can be observed from the graph that with 2000x2000 matrix, the speedup does not increase

much
after 8 processors because of the time spent on data communication. When the ration of computation
increases, the speedup increases linearly with the number of processors. When the dimension of the matrix
is up to 8000x8000 and only one processor is
used, 62024 swaps occurred in the system, meaning that data
needed to be read and written frequently from the main memory to the hard disk due to insufficient main
memory, so that, for example, the speedup of 2 processors was 5 times as fast than that of 1

processor.


Plot 1b) shows the CPU time taken versus the number of processors. As the number of processors increases
the CPU time taken decreases.



Netpipe Performance Tests:

The cluster operates with two networks for communication. The first
network li
nks every machine through a fast Ethernet switch for normal communications such as NFS,
telnet, rlogin, etc. The second network consists of a pair of fast Ethernet cards on each machine bonded
together to form a single high
-
speed virtual network. This se
cond network is used for the communications
of the distributed applications.


To test the network performance of the cluster, a program called NetPipe was used. Figure 2a) shows
throughput of the network as the packet size is increased on both a single ch
annel of Ethernet and a bonded
pair of Ethernet using the switch, and a single channel of Ethernet and 2 & 3 channel bonded Ethernet
without using a switch. Figure 2b) shows the latency of the network as the packet size is increased on the
same configurat
ion as Figure 2a)







































































5
4
3
2
1
5
4
3
2
1
5
4
3
2
1
4
45
44
43
42
41
3
25
24
23
32
31
2
25
24
23
22
21
1
15
14
13
12
11
B
B
B
B
B
X
X
X
X
X
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
nn
n
n
n
n
n
n
n
n
n























0
5
10
15
20
25
30
35
0
5
10
15
20
Number of Processors
Speedup
2000x2000 matrix
4000x4000 matrix
8000x8000 matrix
0
50
100
150
200
250
300
350
0
5
10
15
20
Number of Precessors
CPU Time(seconds)
2000x2000 Matrix
4000x4000 Matrix
8000x8000 Matrix
Tit le:
av g_throughput _bot h.eps
Creator:
gnuplot 3.7 pat chlevel 0.2
Prev iew:
This EPS pict ure was not saved
wit h a prev iew included in it.
Comment:
This EPS pict ure will print to a
Post Script print er, but not to
ot her t ypes of print ers.
Tit le:
av g_latency_bot h.eps
Creator:
gnuplot 3.7 pat chlevel 0.2
Prev iew:
This EPS pict ure was not saved
wit h a prev iew included in it.
Comment:
This EPS pict ure will print to a
Post Script print er, but not to
ot her t ypes of print ers.





2a)






b)



Distributed Image Generation:

The idea behind distributed image generation is to use low
-
cost image
generation hardware in parallel to obtain greater cost
-
performance ratio. For a proof of concept example an
OpenG
L application was modified to divided the scene evenly and distribute the sections across the nodes
to render and send the rendered scene back to be pasted together again on the screen in real
-
time. Figure 3)
shows the frame rates obtained plotted against

the number of processors used.




Tit le:
gloss.ps
Creator:
gnuplot 3.7 pat chlevel 0.2
Prev iew:
This EPS pict ure was not saved
wit h a prev iew included in it.
Comment:
This EPS pict ure will print to a
Post Script print er, but not to
ot her t ypes of print ers.

3)

Computational Electromagnetics
: In the field of computational electromagnetics, parallel and distributing
computing using Beowulf clusters has proven a viable alternative to applications developed for exp
ensive
special purpose architectures. In our investigation, the Finite Difference Time Domain (FDTD) method
was used to simulate the electric and magnetic field patterns of a printed dipole antenna on a dielectric
substrate. The FDTD algorithm was implem
ented on the workstation cluster by splitting the computation
grid into equal subdomains. Each subdomain was assigned a particular node in the cluster. The electric and
magnetic field components were updated after each run. The problem size was varied and

the normalized
run time versus the number of processors was calculated. Another way of viewing the same data was to
use the fixed speedup, computed as the ratio of the time it takes to run a problem on a processor to the time
it takes to run the same pro
blem on a given number of processors. As expected, the run time was cut by
almost a factor of two when going from one processor to two. But as the number of processors increases,
each processor performs less computation but the same amount of computation a
nd the curve starts to
saturate.










4a)




Substrate
Dipole

b)











The given problem was run on the same number of
nodes
, once using threads and once

without threads.
Figure 5a) shows the fixed speedup versus the number of processors. When threads are used, each node
contributes with two processors to the computation of the fields. The internode communication remains the
same but the computation time

is reduced by about of factor of two. Figure 5b) shows the scaled speedup
vs. number of processors.






5a)







b)


Conclusion:

The project investigated various applications which uses distributed computing techniques. It
was observed that

the problem size must be sufficiently large compared to the number of processors to take
advantage of the speedup obtained using parallel computing.


References

[1] Varadajaran, V, and R. Mittra, "Finite
-
difference time
-
domain analysis using distributed c
omputing", IEEE
Microwave and Guided Wave Letters, vol. 4, 1994, pp.144
-
145.


[2] Rodohan, D.P,. and S.R. Saunders, "Parallel Implementations of the Finite Difference Time Domain Method",
Second Int. Conf. on Computation in Electromagnetics, Conf. Pub. No
. 384, pp. 367
-
370, IEE, 1994.


[3] Velamparambil, S. V., J.E. Schutt
-
Aine, J.G. Nickel, J. M. Song, and W. C. Chew, "Solving Large Scale
Electromagnetic Problems Using a Linux Cluster and Parallel MLFMA", presented at the IEEE Antennas Propagat.
Symp, Ses
sion AP
-
20, July 11
-
16, Orlando, FL,Vol. 1, pp. 636
-
639, 1999.