Abstract—

Many problems in computer simulation of systems in

science and engineering present potential for parallel

implementations through one of the three major paradigms of

algorithmic parallelism, geometric parallelism and processor

farming. Static process scheduling techniques have been used

successfully to exploit geometric and algorithmic parallelism, while

dynamic process scheduling is better suited to dealing with the

independent processes inherent in the process farming paradigm.

This paper considers the application of parallel or multi-computers to

a class of problems exhibiting spatial data dependency characteristic

of the geometric paradigm. However, by using processor farming

paradigm in conjunction with geometric decomposition, a dynamic

scheduling technique is developed to suit the MIMD structure of the

multi-computers. The specific problem chosen for the investigation

of scheduling techniques is the computer simulation of Cellular

Automaton models.

Keywords—

Cellular Automaton, multi-computers, parallel

paradigms, scheduling.

I. I

NTRODUCTION

TATIC

and dynamic scheduling of processes are techniques

that can be used to optimize performance in parallel

computing systems. When dealing with such systems an

acceptable balance between communication and computation

times is required to ensure efficient use of processing

resources. When the time to perform the compute on a sub-

problem is less than the time taken to receive the data or

transmit the results, then the communication bandwidth

becomes a limit to performance. With dynamic scheduling, an

appropriate program can redirect the flow of data at run time

to keep the processors as busy as possible and help achieve

optimum performance [1].

The problem chosen here for the investigation of

scheduling techniques is the cellular automaton (C.A.). The

C.A. approach has been used in many applications, such as

image processing, self learning machines, fluid dynamics and

modeling parallel computers. Because of their small compute

requirements, many C.A. algorithms implemented on a

network of processors, exhibit the above discussed imbalance.

Mohammad S. Laghari is with the Electrical Engineering Department,

Faculty of Engineering, United Arab Emirates University, P.O. Box: 17555,

Al Ain, U.A.E. (phone: 00971-50-6625492; fax: 00971-3-7623156; e-mail:

mslaghari@uaeu.ac.ae).

Gulzar A. Khuwaja is with the Department of Computer Engineering,

College of Computer Sciences & Information Technology, King Faisal

University, Al Ahsa 31982, Kingdom of Saudi Arabia (e-mail:

Khuwaja@kfu.edu.sa).

A cellular automaton simulation, with artificially increased

compute load per cell (in the form of number of simulated

multiplies) is considered for parallelization. Such a simulation

is representative of a class of recursive algorithms with local

spatial dependency and fine granularity that may be

encountered in biological applications, finite elements, certain

problems in image analysis and computational geometry [2]-

[5]. These types of applications exhibit geometric parallelism

and may be considered best suited to static scheduling.

However, using dynamic scheduling, the MIMD structure of

multicomputer networks is exploited, and comparison of both

the schemes is given in the form of total timings and speedup.

II.

T

HE

C.A.

M

ODEL

Cellular automata were introduced in the late forties by

John von Neumann, following a suggestion of Stan Ulam, to

provide a more realistic model for the behavior of complex,

extended systems [6].

In its simplest form, a cellular automaton consists of a

lattice or line of sites known as cells, each with value 0 or 1.

These values are updated in a sequence of discrete time steps

according to a definite, fixed, rule. The overall properties of a

cellular automaton are usually not readily evident from its

basic rule. But given these rules, its behavior can always be

determined by explicit simulation on a digital computer.

Cellular automata are mathematical idealizations of

physical systems in which space and time are discrete, and

physical quantities take on finite set of discrete values. The

C.A. model used in this investigation is a 1-dimensional

cellular automaton where processing takes place by a near

homogeneous system having a fine level of granularity. It is

conceptually simple and has a high degree of parallelism. It

consists of a line of cells or sites x

i

, (where i = 1, ... , n) with

periodic boundary conditions x

n+1

= x

1

which means that last

cell in the line of site is connected to the first cell. Each cell

can store a single value or variable known as its state. At

regular intervals in time the value of cells are simultaneously

(synchronously) updated according to a local transition rule

whose result depends on the previous state of the cell and

those of its neighbors. The neighborhood of a given site is

simply the site itself and the sites immediately adjacent to it on

the left and right. Each cell may exist in one of two states x

i

=

0 or 1.

The local rules of C.A. can be described by an eight-digit

binary number as shown in the following example. Fig. 1

specifies one particular set of rules for an elementary C.A.

Scheduling Techniques of Processor Scheduling

in Cellular Automaton

Mohammad S. Laghari

and

Gulzar A. Khuwaja

S

International Conference on Intelligent Computational Systems (ICICS'2012) Jan. 7-8, 2012 Dubai

96

0

000

1

001

1

010

0

011

1

100

0

101

0

110

1

111

Fig. 1 The 8 possible states of 3 adjacent sites

The top row gives all the 2

3

= 8 possible values of the three

sites in the neighborhood, and below each one is given the

values achieved by the middle site on the next time step

according to a particular local rule. As any eight-digit binary

number specifies a cellular automaton, therefore there are

2

8

= 256 possible distinct C.A. rules in one dimension with a 3

site neighborhood. The rule in the lower line of the Fig. 1 is

rule number 150 (10010110) which have been used for the

implementation of C.A. algorithms in this paper.

The rules may be considered as a Boolean function of the

sites within the neighborhood. Let x

i

(t) be the value of site i at

time step t. For the above example, the value of a particular

site is simply the sum modulo two of the values of its own and

its two neighboring sites on the previous time step. The

Boolean equivalent of this rule is given by:

2)))()()(()1(

11

REMtxtxtxtx

iiii +−

++=+

where REM is the remainder function.

This can be written in the form of:

)()()1(

11

txxtxtx

iiii +−

⊕⊕=+

or schematically

+−

+

⊕⊕= xxxx

where,

⊕

denotes addition modulo two or exclusive disjunction,

+

x

denotes value of a particular site for the next time step and

+−

xxx,,

denotes values of its own and its neighboring sites

on the previous time step, respectively.

The following shows how the above equations relate to rule

number 150 of C.A.

Suppose

CxBxAx ≡≡≡

+−

,,

then using Boolean laws the schematic equation becomes:

CBACBACBACBA

CBABA

CBA

........

)..(

+++

⊕+

⊕⊕

Putting this equation in the truth Table I shows the output

giving rule number 150 in the binary form when read from the

most significant bit.

150241612801101001

1248163264128

=+++=

Fig. 2 shows the evolution of a particular state of the C.A.

through two time steps in the above example.

TABLE

I

F

INDING RULE NUMBER IN BINARY FORM

A B C output

0 0 0 0

0 0 1 1

0

1

0

1

0

1

1

0

1

0

0

1

1

0

1

0

1

1

0

0

1

1

1

1

Fig. 2 Evolution of 1-D C.A. through two time steps

Fig. 3 shows evolution of 1-dimensional elementary cellular

automaton according to the above described rule, starting

from a state containing a single site with value 1. Sites 1 and 0

are represented with ‘*’s and ‘ ‘s, respectively. The

configuration of the cellular automaton at successive time

steps is shown on successive lines. The time evolution is

shown for at most 20 time steps or up to the point where

system is detected to cycle.

Fig. 3 Evolution of C.A. into a configuration up to 20 time steps

III. P

ARALLEL

P

ARADIGMS

In order to efficiently utilize the computational potential of

a large number of processors in a parallel processing

environment, it is necessary to identify the important parallel

features of the application. There are several simple paradigms

for exploiting parallelism in scientific and engineering

applications, but the most commonly occurring types fall into

three classes. These three paradigms are described in more

detail in [7], [8].

International Conference on Intelligent Computational Systems (ICICS'2012) Jan. 7-8, 2012 Dubai

97

A. Al

gorithmic Parallelism

Is present where the algorithm can be broken down into a

pipeline of processors. In this decomposition the data flows

through the processing elements.

B. Geometric Parallelism

Is present where the problem can be broken down into a

number similar processes in such a way as to preserve

processor data locality and each processor operate on different

subset of the total data to be processed.

C. Processor Farm

Is present where each processor is executing the same

program with different initial data in isolation from all the

other processors in the farm [9], [10].

IV. A

LGORITHMS

In order to meet the high speed and performance, a scalable

and reconfigurable multi-computer system (NPLA) is used.

This networked multi-computer system is a bit similar to the

NePA system used to implement Network-on-Chip [11].

The system used is a linear array of processors. It includes

RISC processors and memory blocks. Each processor in the

array has a compactOR, internal instruction memory, internal

data memory, data control unit, and registers. One of the

processors is used as a master or main processor and the

remaining as slaves. The system has a network interface with

the main processor having four and others equipped with two

port routers. Routers can transfer both control as well as

application data among processors. The two scheduling

algorithms are described:

A. Static Algorithm

In this implementation of cellular automaton, the problem is

statically implemented by using array processing. The

algorithm is properly decomposed by using geometrical

parallelism. Ideally, the master processor should distribute

fixed number of cells uniformly across the ring of slave

processors. At the start of an individual iteration, each cell

process broadcasts the current state of the cell to its neighbors

in parallel with inputting the states of its neighbors from the

neighboring cell processes. After this exchange of data, the

cell update its new state using the rule described earlier.

Instead of individual cell processes in each slave which

communicated with the neighboring cells after every update,

the master processor distributes fixed size array segments of

cells (for a total length of a maximum768 cells) uniformly

across the worker array, with each processor being responsible

for the defined spatial area.

Each iteration starts with slave processors first exchange

boundary information with the neighboring processors in such

a way that end elements of each array segment carry

information of the end elements of the neighboring segments.

After this exchange, the array segment updates the results

with the help of the neighboring elements for all the elements

in parallel by using the cellular automaton rule described

earlier. The updated results are assigned in another array.

Results of all iterations are communicated back to the master

processor.

Simulation tests are carried out for 20 iterations or time

steps using from 1 to 7 slave processors, supplied with fixed

size array segments for the total array length of 768 cells.

Artificially increased compute loads in the form of multiplies

per cell (in steps of 20 multiplies) are introduced. Five loads

of 20, 40, 60, 80 and 100 multiplies, respectively are used,

which reside in the worker process of each slave. Table II

shows the total timings in seconds for a normal and a range of

artificially increased compute loads.

TABLE

II

T

OTAL TIMING IN SECONDS FOR

20

ITERATIONS IN STATIC SCHEME

Result without the additional compute load shows no

improvement in performance when the algorithm is

implemented on multiple processors. The communications

take more time than the computation in each slave. Results

with the compute load of 20 additional multiply show that

there is a reasonable improvement in timings. The comparison

shows that with the increase in the compute load, the overall

performance of the algorithm and the utilization of the

processors proportionally improve.

B. Dynamic Algorithm

In the previous implementation the allocation of processes

to processors is defined at compile time. It is possible to have

the program perform the process allocation as it runs. In this

implementation of cellular automaton, distribution of

processing loads is performed dynamically. The topology used

is the same as in the previous examples, which is a master

processor and up to 7 slaves, now operating as a farm of

processors with the code replicated on each of them.

In this algorithm, the master processor distributes work

packets to the farm of slave processors. This processor is also

responsible for geometrical decomposition and the tracking of

the work packets through the iteration sequence. It consists of

two main processes of send and receive, which execute in

parallel and share two large arrays of data send and data

receive. At the start of the first iteration, the send process

farms out fixed size data packets from the send array (which

contains the line of site to be computed) to the slave

processors. Each data packet includes; an array segment of

cells, address of the segment location in the send array, and

information about the end elements of the neighboring

segments.

International Conference on Intelligent Computational Systems (ICICS'2012) Jan. 7-8, 2012 Dubai

98

The

slave processors operate two main processes both

running in parallel. One is a worker process where actual

computation takes place and is run in low priority with the

other which is a work_packet_schedular as shown in Fig. 4.

Fig. 4 Work packet schedular on slave processors

The work_packet_schedular on each slave consists of:

• a schedular process which inputs data packets from the

master and schedules tasks through buffers either to the

worker process or to the next processor in the chain of

slaves on the first come first served basis. The buffers

operate as request buffers which is as soon as the buffers

have served their tasks, more work is requested from the

scheduler process. If request for work from the worker

process and next processor arrive at the same time then

priority is given to the worker process.

• a data_passer process which inputs resultant data through

buffers both from the worker process or previous

processor on the first come first served basis and forwards

it to the next processor leading towards the master

processor.

In order to keep the slave processors busy, the task

schedular buffers an extra item of work so that when the

worker process completes the computation for an array

segment it can start on its next at once rather than having to

wait for the master processor to send the next item of work.

The worker process inputs the array segment together with

the information of the end bits of neighboring segments and

the address bits. Then, updates the segment according to the

C.A. rule described earlier, stores the result in another array,

adds address bits and communicates it to the data_passer

process.

The processed array segments together with the address bits

are received by the other main process of receive in the master

processor and are placed in the data receive array at the

appropriate positions. This completes the first iteration.

For subsequent iterations, array segments can only be sent

for processing if adjoining neighbors are present; this is

because of the end element information of the neighboring

segments. Therefore, as soon as the master processor receives

3 contiguous segments in the data receive array, it copies the

middle segment to the data send array. When 3 contiguous

segments are copied to the data send array, then the middle

segment from this array is sent to the slaves for further

processing.

Experiments are performed on the dynamically allocated

scheme by varying the network sizes, the computational loads,

and the size of the work packets in order to obtain optimum

performance parameters. Timings from 1 to 7 slave processors

are obtained for 20 iterations. Experiments are performed with

varying packet sizes of 12, 24, and 48 cells for the total array

length of 768 cells. Additional compute loads in the form of

20, 40, 60, 80 and 100 multiplies, are used.

Table III shows computation timings in seconds for the

array lengths of 24 for the dynamic scheme. The results of

dynamic allocation show reasonable improvements in timings

for the three packet sizes; the exception being the compute

load of 20 multiplies which shows small improvements in

performance for smaller networks.

TABLE

III

T

IMING FOR

20

ITERATIONS IN DYNAMIC SCHEME FOR

24

CELLS

The speedup for the packet size of 24 cells show very good

results for all the additional compute loads except for case of

20 multiply as shown in Fig. 5. A near linear speedup is

shown when four slave processors are used. For the load of 60

multiplies, speedup of 5.76 is achieved when all the slaves are

used. The results for the three segment sizes of 12, 24 and 48

cells are compared with artificially increased compute loads in

terms of speedup. For comparison, compute loads of 20 and

100 multiplies are chosen.

Fig. 6 shows speedup, for the case of 20 multiplies. Array

size of 12 cells shows no improvements in the result. The

reason being that for the case of 12 cells, the master processor

distributes 64 array segments for each line of site of 768 cells.

Therefore, the master communicates a total of 1280 array

segments to do 20 iterations. With the compute load of 20

multiplies for each cell, the system does not balance the

computation and communication loads. The results prove that

International Conference on Intelligent Computational Systems (ICICS'2012) Jan. 7-8, 2012 Dubai

99

the

system is taking much more time to communicate data

packets of this size to and from the slave processors and thus

show poor performance. Increasing the size of the data

packets for the additional load of 20 multiplies has a small

effect on the performance. The array size of 48 cells shows

slight improvements for up to 3 slave processors.

Fig. 5 Speedup for 24 cells in dynamic scheme

Fig. 6 Comparison of speedup results for the load of 20 multiplies

Fig. 7 shows the speedup, for the case of 100 multiplies.

Excellent results are obtained for all the array segments, when

from 1 to 4 slave processors are used. Again, the array size of

24 cells gives the best performance results for using all the

available slave processors. Therefore, when comparing the

results for all the additional compute loads, array segment of

size 24 with the compute load of 100 multiplies gives the best

performance parameters in the dynamic scheduling scheme.

Fig. 7 Comparison of speedup for the load of 100 multiplies

Fig. 8 shows the timing comparison for two schemes for

seven processors. Except for 20 compute load, the dynamic

scheme performs better for all other loads.

Fig. 8 Comparison of timings between the two schemes

V.

C

ONCLUSION

In this paper we have considered a modified C.A. model

with artificially increased load. The recursive structure and

spatial data dependency of this algorithm is representative of

an important class of algorithms in science and engineering.

The paper investigates the performance of scheduling

techniques for the implementation of this type of algorithm on

multicomputer networks. Experiments performed on

implementation of above techniques suggest that over certain

ranges of compute load, dynamic scheduling can outperform

its rival in terms of speedup.

R

EFERENCES

[1] T. L. Casavant and J. G. Kuhl, “A Taxonomy of Scheduling in General-

Purpose Distributed Computing Systems,” IEEE Trans. on Software

Engineering, vol. 14, no. 2, Feb. 1988.

[2] M. V. Avolio, A. Errara, V. Lupiano, P. Mazzanti, and S. D. Gregorio,

“Development and Calibration of a Preliminary Cellular Automata

Model for Snow Avalanches,” in Proc. 9th Int. Conf. on Cellular

Automata for Research and Industry, Ascoli Piceno, Italy, 2010, pp. 83–

94.

[3] D. Cacciagrano, F. Corradini, and E. Merelli, “Bone Remodelling: A

Complex Automata-Based Model Running in BIO SHAPE,” in Proc. 9th

Int. Conf. on Cellular Automata for Research and Industry, Ascoli

Piceno, Italy, 2010, pp. 116–127.

[4] M. Ghaemi, O. Naderi, and Z. Zabihinpour, “A Novel Method for

Simulating Cancer Growth,” in Proc. 9th Int. Conf. on Cellular

Automata for Research and Industry, Ascoli Piceno, Italy, 2010, pp.

142-148.

[5] Y. Zhao, S. A. Billing, and A. F. Routh, "Identification of Excitable

Media Using Cellular Automata Models,” Int. J. of Bifurcation and

Chaos, vol. 17, pp. 153-168, 2007.

[6] A. IIanchinski, Cellular Automata – A Discrete Universe. Singapore:

World Scientific Publishing, 2001.

[7] D. J. Pritchard, “Transputer Applications on Supernode,” in Proc. Int.

Conf. on Application of Transputers, Liverpool, U.K., Aug. 1989.

[8] M. S. Laghari and F. Deravi, “Scheduling Techniques for the Parallel

Implementation of the Hough Transform,” in Proc. Engineering System

Design and Analysis, Istanbul, Turkey, 1992, pp. 285-290.

[9] A. S. Wagner, H. V. Sreekantaswamy, and S. T. Chanson, “Performance

Models for the Processor Farm Paradigm,” IEEE Trans. on Parallel and

Distributed Systems, vol. 8, no. 5, pp. 475-489, May 1997.

[10] A. Walsch, “Architecture and Prototype of a Real-Time Processor Farm

Running at 1 MHz,” Ph.D. Thesis, University of Mannheim, Mannheim,

Germany 2002.

[11] Y. S. Yang, J. H. Bahn, S. E. Lee, and N. Bagherzadeh, “Parallel and

Pipeline Processing for Block Cipher Algorithms on a Network-on-

Chip,” in proc. 6th Int. Conf. on Information Technology: New

Generations, Las Vegas, Nevada, Apr. 2009, pp. 849-854.

International Conference on Intelligent Computational Systems (ICICS'2012) Jan. 7-8, 2012 Dubai

100

## Σχόλια 0

Συνδεθείτε για να κοινοποιήσετε σχόλιο