download - Advanced Computing Lab at St. Edward's University

perchorangeSoftware and s/w Development

Dec 1, 2013 (3 years and 11 months ago)

63 views

A Simple Interface for Polite Computing

Travis Finch, St. Edward’s University

Faculty Advisor:

Dr. Sharon Weber, St. Edward’s University


Abstract



As computing labs begin to rely more on shared commodity workstations to
perform parallel computations load

balancing cannot be ignored. Parallel
applications, by nature, are resource intensive and often times load balancing
techniques do not take into consideration external events of the application. This
can cause disruption among other users sharing the same

computer. A steep
learning curve is also present in High
-
Performance Computing (HPC) for novice
programmers, often causing load balancing to be totally ignored. This paper
presents Simple Interface for Polite Computing (SIPC), a mechanism that allows
exte
rnal load balancing to be easily integrated into programs where polite resource
sharing is necessary. While SIPC can be used with any program, the focus here is
the integration of it with embarrassingly parallel applications that follow a dynamic
schedulin
g paradigm.


Introduction



Polite computing is a scheduling policy idea that allows intensive
applications to run on a shared workstation and not excessively consume
resources in the presence of other users and their applications. As high
-
performance comp
uting labs are built around clustered, shared workstations this
type of policy becomes a necessity so that other programs do not starve and users
can remain productive.


In his paper "Polite Parallel Computing", Cameron Rivers introduced a
simple approach
to solving external load balancing [8]. The algorithm was
integrated into mpiBLAST and allowed the application to become aware of its
surroundings, scaling back if needed to distribute computing power to all users of
the node. While effective, this process

overlooked vital system heuristics for
determining system load more accurately, and introduced unnecessary overhead
with the method used to determine the load.



The purpose of SIPC was to improve Rivers’ algorithm by making load
checks more accurate, re
ducing the amount of overhead a load check would
cause, and introducing an algorithm for timing the load checks on a system. Most
importantly, SIPC is easy for novice programmers to utilize. It can be integrated
with any serial application that requires po
liteness, but our target applications here
are embarrassingly parallel that follow a dynamic scheduling paradigm.
Communication between processors is achieved by utilizing the Message Passing
Interface (MPI). MPI is a portable, language
-
independent applica
tion
-
programming
interface used to allow a group of computers to communicate over a network [4].


The three applications chosen to demonstrate the use of SIPC were:
mpiBLAST, MPI
-
POVRay, and a Mandelbrot Set rendering engine. mpiBLAST is a
parallelized ve
rsion of BLAST, a program that segments a BLAST database and
distributes it to a cluster of workstations to perform queries simultaneously [3].
MPI
-
POVRay is a wrapper for the ray tracing application POVRay [6]. Ray tracing
is a prime candidate for paralle
lization because once a scene has been modeled,
any number of computers can work on the solution without communication or
synchronization with each other [13]. Calculating the Mandelbrot Set, like ray
tracing, is also an excellent candidate due to the fact

that once a set of points have
been chosen,
n

pixels can be calculated on
n

computers, and communications
only takes place with the master scheduling process. All of these applications are
parallelized in a straightforward manner and require minimal messa
ge passing
communication.


Related Work



The University of California
-
Berkeley introduced their Network of
Workstations (NOW) project during the 1990’s. They had the idea that inexpensive
commodity machines connected via high
-
speed network switches could
form large
parallel computing systems that competed with the fastest supercomputers in the
world [2]. They achieved their feat on April 30, 1997 when over 10 GFLOPS was
reached on the LINPACK benchmark, ranking NOW as one of the top 200
supercomputers in t
he world at the time [10].



As the popularity grew with high
-
performance commodity clusters, research
began to improve the performance of the programs running on them, in particular
load balancing [7]. Load balancing is a critical factor in the performanc
e of parallel
applications, but often times the focus is only on internal application imbalance
rather than external collisions taking place between competing applications on a
shared workstation.



The best
-
known approach to polite computing is the nice c
ommand found in
UNIX and other POSIX
-
like operating systems. This command assigns a priority
level to a process for the kernel's job scheduler. A level of
-
20 is used for the most
favorable scheduling, while 19 is used for the least. A process gets the def
ault
level of 0 if no other priority is specified. This approach allows the kernel to make
decisions about process scheduling and takes control away from the application
developer [12].


The Charm++ system at the University of Illinois at Urbana
-
Champaign
, a
parallel object
-
oriented programming framework, proposes a solution to external
load balancing by using object
-
migration [1]. If a processor has high system load,
an intelligent job scheduler performs load migration by placing objects from highly
loade
d processors to less saturated processors. The problem encountered with
this approach is that you must use the Charm++ runtime system for parallel
application development, which might be difficult for novice parallel programmers.

Solution


In the High
-
Perf
ormance Computing community, often times the speed of
program execution is the only determination of success. Other relevant factors
such as time to develop the solution, additional lines of code compared to the
serial implementation, and the cost per line

of code are not considered when
determining the success of a parallel computation development effort. Hochstein
et al. show in [5] that HPC development is significantly more expensive than serial
development, especially for applications that use MPI. Ofte
n times MPI
implementations contain twice as many lines of code as their serial counterpart.


Many HPC applications are complex and only understood by a small group
of domain experts who are often times novice parallel programmers. Tools are
needed that al
low advanced aspects such as external load balancing to be
injected into the parallel application, creating more efficient solutions with minimal
effort from the novice programmer. SIPC was designed with this in mind as a self
-
contained library and require
s only two function calls: initialization and load
checks. The application developer must only worry about finding a safe place to
call the load checking procedure. A safe location would entail finding a spot in the
code where no message
-
passing communicat
ions are taking place at that time in
execution. For example, if the master process is waiting for a worker node to send
results from a computation, the worker node would first send the results to the
master and then check its load level before requesting
more work. If the process is
reversed, the master would have to unnecessarily wait for the worker to complete
its load checking routine before receiving the results. Figure 1 shows a basic code
template in written in C, integrating SIPC into a parallel ap
plication that uses MPI.


The method Rivers used in his polite implementation was to fork another
mpiBLAST process on a “worker” node, and the parent process would then wait
for it to execute. This child process would spawn a shell script to determine if t
he
system load was above a certain level, and if it was, it would create a file to
indicate this. The child process would then terminate, and the original process
would continue execution. It would then check to see if the shell script had created
a file,
and sleep for a predetermined amount of time if a file was present [8]. While
this method was effective for the scope of Rivers’ project, other methods exist that
do the same thing less intrusively. Significant overhead requirements can arise as
multiple c
ontext switches and file system accesses occur throughout execution of
the program. SIPC performs system load checks within the host program.




The method Rivers used in his polite implementation was to fork another
mpiBLAST process on a “worker” node, a
nd the parent process would then wait
for it to execute. This child process would spawn a shell script to determine if the
system load was above a certain level, and if it was, it would create a file to
indicate this. The child process would then terminate
, and the original process
would continue execution. It would then check to see if the shell script had created
a file, and sleep for a predetermined amount of time if a file was present [8]. While
this method was effective for the scope of Rivers’ project
, other methods exist that
do the same thing less intrusively. Significant overhead requirements can arise as
multiple context switches and file system accesses occur throughout execution of
the program. SIPC performs system load checks within the host pro
gram.


In Rivers’ version of mpiBLAST, the frequency of load checks is hard coded
into the application. The number of load checks was determined by doing
performance analysis on the target application and deciding how often a check
was warranted. This met
hod is often out of the skill range of a novice
programmer. SIPC uses a unique mechanism that allows load check timing to be
adjusted dynamically during runtime. It is based on the assumption that if a system
is under high load at time
t
, then it is likely

to remain under high load at some
future time
t

+
x
, where
x

is a relatively small amount of time. Figure 2 provides a
visual for this heuristic. As
x

increases, the probability of the system remaining
under high load decreases. This transient principle a
llows SIPC to adjust the
frequency of checks during runtime, increasing the time between them if a system
has remained in a state of low saturation for a long duration of time. A limit is
placed on the size that the interval
x
can become to keep the schedu
ling system
responsive to changes in load.




Implementation




A goal of the project was to obtain system information such as CPU
utilization and the number of users currently logged onto the system in a way that
would not create a large processing foot
print. Obtaining the CPU load is
accomplished by opening the
/proc/loadavg

file and retrieving the first value in it
with the
fopen

and
fscanf

functions. Although it seems this method would
introduce overhead by accessing the file system,
loadavg

is a virt
ual file that does
not reside on the hard disk [9].


Counting the number of users is achieved by executing the
users

shell
command and capturing the return stream via the
popen
and
fgets

functions. The
lines of the return string are then counted, each one
indicating a user logged into
the system. Figure 2 shows the two complete functions for obtaining CPU
utilization and counting users currently logged into the system.


To determine if a system is under high load, three simple conditions must
be met. The fi
rst condition is that there are at least two users logged into the
system. If there is only one user logged in, execution of the program continues as
normal since there is no point sharing resources if there is not anyone to share
them with.


The next con
dition checks if the CPU load is greater than 100


(10 *
num_users). This calculation is performed to prevent SIPC from causing an
application to sleep on a semi
-
high load with a low number of users. For example,
it is not necessary to be polite when ther
e are only two users logged into the
system and CPU utilization is at 80%. There are enough CPU resources to ensure
the other user is not starved and also allow more users to log on the system
without severe performance degradation.





The last condition

checks if the value of CPU_load / num_users +
CPU_load) is greater than the high load level threshold set at the initialization of
SIPC. During initialization, sensitivity values of LOW (75%), MED (85%), or HIGH
(95%) can be passed to determine how the ap
plication developer wants SIPC to
react to system load. The default level is MED. If this condition is met, SIPC will
cause the application to sleep for one second, allowing the starved users to utilize
the CPU.




If the target application sleeps due to

high system load, the timing
mechanism used to schedule load checks must be reset so another check will
occur again soon. If it does not sleep, the duration between load checks is doubled
to increase the time between checks. This behavior is based off of
the transient
principle of system load mentioned above.


Analysis



The inclusion of SIPC into a host application proved to have very little, if no
overhead at all. SIPC increased execution time by 1% in mpiBLAST and MPI
-
POVRay, while MPI
-
Mandelbrot with
SIPC actually executed faster than the
original application. This small improvement might be the result of better
instruction cache, data cache, or virtual memory performance due to a slightly
different set of program instructions introduced by the load ch
ecks for the SIPC
code. While this result was unexpected, since the improvement is minimal further
investigation is not warranted at the moment.


Figure 3 shows the execution times for each polite variation (Rivers’
method, SIPC Inline, SIPC Library) in ea
ch target application where no politeness
occurs. These values indicate the overhead that the load checks cause compared
to the execution time of the original application with no modifications. The times
are the average execution of each application over s
everal hundred runs.


A phenomenon was encountered during execution of an application with
Rivers' implementation of politeness. The mechanism did not take into
consideration the number of users logged on the system. This in turn could cause
a worker node
to "be polite" under high load when it was not really necessary. This
event caused a chain reaction of politeness throughout the rest of the nodes,
leading to severe performance degradation of the application. This pitfall was
eliminated in the design of S
IPC by ignoring system load if only one user is
present on a system, as mentioned in the above section.


Conclusion


The development of SIPC successfully met three goals: a self
-
contained library
easily utilized by novice programmers, less overhead require
ments than Rivers’
implementation, and a mechanism that allows dynamic scheduling of system load
checks. This work represents a beginning in the development of tools designed to
improve the efficiency of code written by beginning HPC programmers. More
acce
ssible tools will allow an easier understanding of the concepts behind
advanced techniques used in parallel programming. Future work for SIPC could
include developing a tool that would analyze code and locate a safe spot to place
load checking procedure ca
lls. This would aid novice developers in finding a place
in the program where load balancing routines would not interfere with node
synchronization. Also, other work could go into porting SIPC to be used on an
operating system other than a Linux/UNIX envir
onment allowing reliable cross
-
platform capabilities.




References


[1] Brunner, Robert K., and Laxmikant V. Kale. "Adapting to Load on Workstation
Clusters." The 7th Symposium on the Frontiers of Massively Parallel Computation
(1999): 106.


[2] Culler,
D., et al., “Parallel Computing on the Berkley NOW”.
9
th

Joint
Symposium on Parallel Processing
, Kobe, Japan, 1997.


[3] Darling, A., L. Carey, and W. Feng. "The Design, Implementation, and
Evaluation of MpiBLAST."
4th International Conference on Linux Clu
sters

(2003).
<http://www.mpiblast.org/downloads/pubs/cwce03.pdf>.


[4] Gropp, W., Lusk, E., Doss, N., Skjellum, A., "A High
-
Performance, Portable
Implementation of the MPI Message Passing Interface Standard", Parallel
Computing, North
-
Holland, vol. 22, pp
. 789
-
828, 1996.


[5] Hochstein, Lorin, Jeff Carver, Forrest Shull, Sima Asgari, Victor Basili, Jeffrey
K. Hollingsworth, and Marvin V. Zelkowitz. "Parallel Programmer Productivity: a
Case Study of Novice Parallel Programmers."
Proceedings of the 2005 ACM/
IEEE
Conference on Supercomputing

(2005).


[6] "MPI
-
POVRay." 14 Nov. 2006 <http://www.verrall.demon.co.uk/mpipov/>.


[7] "PPL: Load Balancing."
Parallel Programming Laboratory
. University of Illinois
at Urbana
-
Champaign. 30 Mar. 2007 <http://charm.cs.uiuc.
edu/research/ldbal/>.


[8] Rivers, Cameron. "Polite Parallel Computing." Journal of Computing Sciences
in Colleges 21 (2006): 190
-
195.


[9] Silberschatz, Abraham, and Peter B. Galvin.
Operating System Concepts
.
Reading: Addison Wesley, 1998.


[10] "The Ber
keley NOW Project." UC
-
Berkeley. 30 Mar. 2007
<http://now.cs.berkeley.edu/>.


[11] "The Mandelbrot Set." 30 Mar. 2007

<http://www.cs.mu.oz.au/~sarana/mandelbrot_webpage/mandelbrot/mandelbrot.ht
ml>.


[12] "UNIX Man Pages : Nice ()." 14 Mar. 2007 <
http://un
ixhelp.ed.ac.uk/CGI/man
-
cgi?nice
>.


[13] Wilt, Nicholas. Object
-
Oriented Ray Tracing in C++. New York: John Wiley &
Sons, Inc., 1994.