Abstract—
Many problems in computer simulation of systems in
science and engineering present potential for parallel
implementations through one of the three major paradigms of
algorithmic parallelism, geometric parallelism and processor
farming. Static process scheduling techniques have been used
successfully to exploit geometric and algorithmic parallelism, while
dynamic process scheduling is better suited to dealing with the
independent processes inherent in the process farming paradigm.
This paper considers the application of parallel or multicomputers to
a class of problems exhibiting spatial data dependency characteristic
of the geometric paradigm. However, by using processor farming
paradigm in conjunction with geometric decomposition, a dynamic
scheduling technique is developed to suit the MIMD structure of the
multicomputers. The specific problem chosen for the investigation
of scheduling techniques is the computer simulation of Cellular
Automaton models.
Keywords—
Cellular Automaton, multicomputers, parallel
paradigms, scheduling.
I. I
NTRODUCTION
TATIC
and dynamic scheduling of processes are techniques
that can be used to optimize performance in parallel
computing systems. When dealing with such systems an
acceptable balance between communication and computation
times is required to ensure efficient use of processing
resources. When the time to perform the compute on a sub
problem is less than the time taken to receive the data or
transmit the results, then the communication bandwidth
becomes a limit to performance. With dynamic scheduling, an
appropriate program can redirect the flow of data at run time
to keep the processors as busy as possible and help achieve
optimum performance [1].
The problem chosen here for the investigation of
scheduling techniques is the cellular automaton (C.A.). The
C.A. approach has been used in many applications, such as
image processing, self learning machines, fluid dynamics and
modeling parallel computers. Because of their small compute
requirements, many C.A. algorithms implemented on a
network of processors, exhibit the above discussed imbalance.
Mohammad S. Laghari is with the Electrical Engineering Department,
Faculty of Engineering, United Arab Emirates University, P.O. Box: 17555,
Al Ain, U.A.E. (phone: 00971506625492; fax: 0097137623156; email:
mslaghari@uaeu.ac.ae).
Gulzar A. Khuwaja is with the Department of Computer Engineering,
College of Computer Sciences & Information Technology, King Faisal
University, Al Ahsa 31982, Kingdom of Saudi Arabia (email:
Khuwaja@kfu.edu.sa).
A cellular automaton simulation, with artificially increased
compute load per cell (in the form of number of simulated
multiplies) is considered for parallelization. Such a simulation
is representative of a class of recursive algorithms with local
spatial dependency and fine granularity that may be
encountered in biological applications, finite elements, certain
problems in image analysis and computational geometry [2]
[5]. These types of applications exhibit geometric parallelism
and may be considered best suited to static scheduling.
However, using dynamic scheduling, the MIMD structure of
multicomputer networks is exploited, and comparison of both
the schemes is given in the form of total timings and speedup.
II.
T
HE
C.A.
M
ODEL
Cellular automata were introduced in the late forties by
John von Neumann, following a suggestion of Stan Ulam, to
provide a more realistic model for the behavior of complex,
extended systems [6].
In its simplest form, a cellular automaton consists of a
lattice or line of sites known as cells, each with value 0 or 1.
These values are updated in a sequence of discrete time steps
according to a definite, fixed, rule. The overall properties of a
cellular automaton are usually not readily evident from its
basic rule. But given these rules, its behavior can always be
determined by explicit simulation on a digital computer.
Cellular automata are mathematical idealizations of
physical systems in which space and time are discrete, and
physical quantities take on finite set of discrete values. The
C.A. model used in this investigation is a 1dimensional
cellular automaton where processing takes place by a near
homogeneous system having a fine level of granularity. It is
conceptually simple and has a high degree of parallelism. It
consists of a line of cells or sites x
i
, (where i = 1, ... , n) with
periodic boundary conditions x
n+1
= x
1
which means that last
cell in the line of site is connected to the first cell. Each cell
can store a single value or variable known as its state. At
regular intervals in time the value of cells are simultaneously
(synchronously) updated according to a local transition rule
whose result depends on the previous state of the cell and
those of its neighbors. The neighborhood of a given site is
simply the site itself and the sites immediately adjacent to it on
the left and right. Each cell may exist in one of two states x
i
=
0 or 1.
The local rules of C.A. can be described by an eightdigit
binary number as shown in the following example. Fig. 1
specifies one particular set of rules for an elementary C.A.
Scheduling Techniques of Processor Scheduling
in Cellular Automaton
Mohammad S. Laghari
and
Gulzar A. Khuwaja
S
International Conference on Intelligent Computational Systems (ICICS'2012) Jan. 78, 2012 Dubai
96
0
000
1
001
1
010
0
011
1
100
0
101
0
110
1
111
Fig. 1 The 8 possible states of 3 adjacent sites
The top row gives all the 2
3
= 8 possible values of the three
sites in the neighborhood, and below each one is given the
values achieved by the middle site on the next time step
according to a particular local rule. As any eightdigit binary
number specifies a cellular automaton, therefore there are
2
8
= 256 possible distinct C.A. rules in one dimension with a 3
site neighborhood. The rule in the lower line of the Fig. 1 is
rule number 150 (10010110) which have been used for the
implementation of C.A. algorithms in this paper.
The rules may be considered as a Boolean function of the
sites within the neighborhood. Let x
i
(t) be the value of site i at
time step t. For the above example, the value of a particular
site is simply the sum modulo two of the values of its own and
its two neighboring sites on the previous time step. The
Boolean equivalent of this rule is given by:
2)))()()(()1(
11
REMtxtxtxtx
iiii +−
++=+
where REM is the remainder function.
This can be written in the form of:
)()()1(
11
txxtxtx
iiii +−
⊕⊕=+
or schematically
+−
+
⊕⊕= xxxx
where,
⊕
denotes addition modulo two or exclusive disjunction,
+
x
denotes value of a particular site for the next time step and
+−
xxx,,
denotes values of its own and its neighboring sites
on the previous time step, respectively.
The following shows how the above equations relate to rule
number 150 of C.A.
Suppose
CxBxAx ≡≡≡
+−
,,
then using Boolean laws the schematic equation becomes:
CBACBACBACBA
CBABA
CBA
........
)..(
+++
⊕+
⊕⊕
Putting this equation in the truth Table I shows the output
giving rule number 150 in the binary form when read from the
most significant bit.
150241612801101001
1248163264128
=+++=
Fig. 2 shows the evolution of a particular state of the C.A.
through two time steps in the above example.
TABLE
I
F
INDING RULE NUMBER IN BINARY FORM
A B C output
0 0 0 0
0 0 1 1
0
1
0
1
0
1
1
0
1
0
0
1
1
0
1
0
1
1
0
0
1
1
1
1
Fig. 2 Evolution of 1D C.A. through two time steps
Fig. 3 shows evolution of 1dimensional elementary cellular
automaton according to the above described rule, starting
from a state containing a single site with value 1. Sites 1 and 0
are represented with ‘*’s and ‘ ‘s, respectively. The
configuration of the cellular automaton at successive time
steps is shown on successive lines. The time evolution is
shown for at most 20 time steps or up to the point where
system is detected to cycle.
Fig. 3 Evolution of C.A. into a configuration up to 20 time steps
III. P
ARALLEL
P
ARADIGMS
In order to efficiently utilize the computational potential of
a large number of processors in a parallel processing
environment, it is necessary to identify the important parallel
features of the application. There are several simple paradigms
for exploiting parallelism in scientific and engineering
applications, but the most commonly occurring types fall into
three classes. These three paradigms are described in more
detail in [7], [8].
International Conference on Intelligent Computational Systems (ICICS'2012) Jan. 78, 2012 Dubai
97
A. Al
gorithmic Parallelism
Is present where the algorithm can be broken down into a
pipeline of processors. In this decomposition the data flows
through the processing elements.
B. Geometric Parallelism
Is present where the problem can be broken down into a
number similar processes in such a way as to preserve
processor data locality and each processor operate on different
subset of the total data to be processed.
C. Processor Farm
Is present where each processor is executing the same
program with different initial data in isolation from all the
other processors in the farm [9], [10].
IV. A
LGORITHMS
In order to meet the high speed and performance, a scalable
and reconfigurable multicomputer system (NPLA) is used.
This networked multicomputer system is a bit similar to the
NePA system used to implement NetworkonChip [11].
The system used is a linear array of processors. It includes
RISC processors and memory blocks. Each processor in the
array has a compactOR, internal instruction memory, internal
data memory, data control unit, and registers. One of the
processors is used as a master or main processor and the
remaining as slaves. The system has a network interface with
the main processor having four and others equipped with two
port routers. Routers can transfer both control as well as
application data among processors. The two scheduling
algorithms are described:
A. Static Algorithm
In this implementation of cellular automaton, the problem is
statically implemented by using array processing. The
algorithm is properly decomposed by using geometrical
parallelism. Ideally, the master processor should distribute
fixed number of cells uniformly across the ring of slave
processors. At the start of an individual iteration, each cell
process broadcasts the current state of the cell to its neighbors
in parallel with inputting the states of its neighbors from the
neighboring cell processes. After this exchange of data, the
cell update its new state using the rule described earlier.
Instead of individual cell processes in each slave which
communicated with the neighboring cells after every update,
the master processor distributes fixed size array segments of
cells (for a total length of a maximum768 cells) uniformly
across the worker array, with each processor being responsible
for the defined spatial area.
Each iteration starts with slave processors first exchange
boundary information with the neighboring processors in such
a way that end elements of each array segment carry
information of the end elements of the neighboring segments.
After this exchange, the array segment updates the results
with the help of the neighboring elements for all the elements
in parallel by using the cellular automaton rule described
earlier. The updated results are assigned in another array.
Results of all iterations are communicated back to the master
processor.
Simulation tests are carried out for 20 iterations or time
steps using from 1 to 7 slave processors, supplied with fixed
size array segments for the total array length of 768 cells.
Artificially increased compute loads in the form of multiplies
per cell (in steps of 20 multiplies) are introduced. Five loads
of 20, 40, 60, 80 and 100 multiplies, respectively are used,
which reside in the worker process of each slave. Table II
shows the total timings in seconds for a normal and a range of
artificially increased compute loads.
TABLE
II
T
OTAL TIMING IN SECONDS FOR
20
ITERATIONS IN STATIC SCHEME
Result without the additional compute load shows no
improvement in performance when the algorithm is
implemented on multiple processors. The communications
take more time than the computation in each slave. Results
with the compute load of 20 additional multiply show that
there is a reasonable improvement in timings. The comparison
shows that with the increase in the compute load, the overall
performance of the algorithm and the utilization of the
processors proportionally improve.
B. Dynamic Algorithm
In the previous implementation the allocation of processes
to processors is defined at compile time. It is possible to have
the program perform the process allocation as it runs. In this
implementation of cellular automaton, distribution of
processing loads is performed dynamically. The topology used
is the same as in the previous examples, which is a master
processor and up to 7 slaves, now operating as a farm of
processors with the code replicated on each of them.
In this algorithm, the master processor distributes work
packets to the farm of slave processors. This processor is also
responsible for geometrical decomposition and the tracking of
the work packets through the iteration sequence. It consists of
two main processes of send and receive, which execute in
parallel and share two large arrays of data send and data
receive. At the start of the first iteration, the send process
farms out fixed size data packets from the send array (which
contains the line of site to be computed) to the slave
processors. Each data packet includes; an array segment of
cells, address of the segment location in the send array, and
information about the end elements of the neighboring
segments.
International Conference on Intelligent Computational Systems (ICICS'2012) Jan. 78, 2012 Dubai
98
The
slave processors operate two main processes both
running in parallel. One is a worker process where actual
computation takes place and is run in low priority with the
other which is a work_packet_schedular as shown in Fig. 4.
Fig. 4 Work packet schedular on slave processors
The work_packet_schedular on each slave consists of:
• a schedular process which inputs data packets from the
master and schedules tasks through buffers either to the
worker process or to the next processor in the chain of
slaves on the first come first served basis. The buffers
operate as request buffers which is as soon as the buffers
have served their tasks, more work is requested from the
scheduler process. If request for work from the worker
process and next processor arrive at the same time then
priority is given to the worker process.
• a data_passer process which inputs resultant data through
buffers both from the worker process or previous
processor on the first come first served basis and forwards
it to the next processor leading towards the master
processor.
In order to keep the slave processors busy, the task
schedular buffers an extra item of work so that when the
worker process completes the computation for an array
segment it can start on its next at once rather than having to
wait for the master processor to send the next item of work.
The worker process inputs the array segment together with
the information of the end bits of neighboring segments and
the address bits. Then, updates the segment according to the
C.A. rule described earlier, stores the result in another array,
adds address bits and communicates it to the data_passer
process.
The processed array segments together with the address bits
are received by the other main process of receive in the master
processor and are placed in the data receive array at the
appropriate positions. This completes the first iteration.
For subsequent iterations, array segments can only be sent
for processing if adjoining neighbors are present; this is
because of the end element information of the neighboring
segments. Therefore, as soon as the master processor receives
3 contiguous segments in the data receive array, it copies the
middle segment to the data send array. When 3 contiguous
segments are copied to the data send array, then the middle
segment from this array is sent to the slaves for further
processing.
Experiments are performed on the dynamically allocated
scheme by varying the network sizes, the computational loads,
and the size of the work packets in order to obtain optimum
performance parameters. Timings from 1 to 7 slave processors
are obtained for 20 iterations. Experiments are performed with
varying packet sizes of 12, 24, and 48 cells for the total array
length of 768 cells. Additional compute loads in the form of
20, 40, 60, 80 and 100 multiplies, are used.
Table III shows computation timings in seconds for the
array lengths of 24 for the dynamic scheme. The results of
dynamic allocation show reasonable improvements in timings
for the three packet sizes; the exception being the compute
load of 20 multiplies which shows small improvements in
performance for smaller networks.
TABLE
III
T
IMING FOR
20
ITERATIONS IN DYNAMIC SCHEME FOR
24
CELLS
The speedup for the packet size of 24 cells show very good
results for all the additional compute loads except for case of
20 multiply as shown in Fig. 5. A near linear speedup is
shown when four slave processors are used. For the load of 60
multiplies, speedup of 5.76 is achieved when all the slaves are
used. The results for the three segment sizes of 12, 24 and 48
cells are compared with artificially increased compute loads in
terms of speedup. For comparison, compute loads of 20 and
100 multiplies are chosen.
Fig. 6 shows speedup, for the case of 20 multiplies. Array
size of 12 cells shows no improvements in the result. The
reason being that for the case of 12 cells, the master processor
distributes 64 array segments for each line of site of 768 cells.
Therefore, the master communicates a total of 1280 array
segments to do 20 iterations. With the compute load of 20
multiplies for each cell, the system does not balance the
computation and communication loads. The results prove that
International Conference on Intelligent Computational Systems (ICICS'2012) Jan. 78, 2012 Dubai
99
the
system is taking much more time to communicate data
packets of this size to and from the slave processors and thus
show poor performance. Increasing the size of the data
packets for the additional load of 20 multiplies has a small
effect on the performance. The array size of 48 cells shows
slight improvements for up to 3 slave processors.
Fig. 5 Speedup for 24 cells in dynamic scheme
Fig. 6 Comparison of speedup results for the load of 20 multiplies
Fig. 7 shows the speedup, for the case of 100 multiplies.
Excellent results are obtained for all the array segments, when
from 1 to 4 slave processors are used. Again, the array size of
24 cells gives the best performance results for using all the
available slave processors. Therefore, when comparing the
results for all the additional compute loads, array segment of
size 24 with the compute load of 100 multiplies gives the best
performance parameters in the dynamic scheduling scheme.
Fig. 7 Comparison of speedup for the load of 100 multiplies
Fig. 8 shows the timing comparison for two schemes for
seven processors. Except for 20 compute load, the dynamic
scheme performs better for all other loads.
Fig. 8 Comparison of timings between the two schemes
V.
C
ONCLUSION
In this paper we have considered a modified C.A. model
with artificially increased load. The recursive structure and
spatial data dependency of this algorithm is representative of
an important class of algorithms in science and engineering.
The paper investigates the performance of scheduling
techniques for the implementation of this type of algorithm on
multicomputer networks. Experiments performed on
implementation of above techniques suggest that over certain
ranges of compute load, dynamic scheduling can outperform
its rival in terms of speedup.
R
EFERENCES
[1] T. L. Casavant and J. G. Kuhl, “A Taxonomy of Scheduling in General
Purpose Distributed Computing Systems,” IEEE Trans. on Software
Engineering, vol. 14, no. 2, Feb. 1988.
[2] M. V. Avolio, A. Errara, V. Lupiano, P. Mazzanti, and S. D. Gregorio,
“Development and Calibration of a Preliminary Cellular Automata
Model for Snow Avalanches,” in Proc. 9th Int. Conf. on Cellular
Automata for Research and Industry, Ascoli Piceno, Italy, 2010, pp. 83–
94.
[3] D. Cacciagrano, F. Corradini, and E. Merelli, “Bone Remodelling: A
Complex AutomataBased Model Running in BIO SHAPE,” in Proc. 9th
Int. Conf. on Cellular Automata for Research and Industry, Ascoli
Piceno, Italy, 2010, pp. 116–127.
[4] M. Ghaemi, O. Naderi, and Z. Zabihinpour, “A Novel Method for
Simulating Cancer Growth,” in Proc. 9th Int. Conf. on Cellular
Automata for Research and Industry, Ascoli Piceno, Italy, 2010, pp.
142148.
[5] Y. Zhao, S. A. Billing, and A. F. Routh, "Identification of Excitable
Media Using Cellular Automata Models,” Int. J. of Bifurcation and
Chaos, vol. 17, pp. 153168, 2007.
[6] A. IIanchinski, Cellular Automata – A Discrete Universe. Singapore:
World Scientific Publishing, 2001.
[7] D. J. Pritchard, “Transputer Applications on Supernode,” in Proc. Int.
Conf. on Application of Transputers, Liverpool, U.K., Aug. 1989.
[8] M. S. Laghari and F. Deravi, “Scheduling Techniques for the Parallel
Implementation of the Hough Transform,” in Proc. Engineering System
Design and Analysis, Istanbul, Turkey, 1992, pp. 285290.
[9] A. S. Wagner, H. V. Sreekantaswamy, and S. T. Chanson, “Performance
Models for the Processor Farm Paradigm,” IEEE Trans. on Parallel and
Distributed Systems, vol. 8, no. 5, pp. 475489, May 1997.
[10] A. Walsch, “Architecture and Prototype of a RealTime Processor Farm
Running at 1 MHz,” Ph.D. Thesis, University of Mannheim, Mannheim,
Germany 2002.
[11] Y. S. Yang, J. H. Bahn, S. E. Lee, and N. Bagherzadeh, “Parallel and
Pipeline Processing for Block Cipher Algorithms on a Networkon
Chip,” in proc. 6th Int. Conf. on Information Technology: New
Generations, Las Vegas, Nevada, Apr. 2009, pp. 849854.
International Conference on Intelligent Computational Systems (ICICS'2012) Jan. 78, 2012 Dubai
100
Enter the password to open this PDF file:
File name:

File size:

Title:

Author:

Subject:

Keywords:

Creation Date:

Modification Date:

Creator:

PDF Producer:

PDF Version:

Page Count:

Preparing document for printing…
0%
Comments 0
Log in to post a comment