Programming models, environments and languages

unevenoliveSoftware and s/w Development

Dec 1, 2013 (3 years and 11 months ago)

98 views

Programming models, environments and languages

Seminar Cluster Computing
, SS2005


Gordana Stojceska

Tech
nishe Universit
ä
t München, Institut für Informatik

Boltzmannstraße 3

,
85748 Garching

bei München

stojcesk@in.tum.de


Abstract.

If low cost hardware has

speeded up

the
use

of clusters

for
industrial, scientific
and commercial high performance computing, it is the software that has

all of them
enabled
its utility and restrained its usability. While the rapid advances in hardware

performance

have propelled

clusters to the forefront of next generation
high performance computer
systems,

equally important has been the evolving capability and maturity of the support
software

systems and tools. The result is a
one global

system environment that is
converging on p
revious

generation
of
supercomputers and MPPs

(Massively Parallel
Processors)
.

L
ike
all
these predecessors,
the
clusters

give

opportunity for
advanced
future
research and development in programming tools

and resource management software
in
order to

enhance

their applicabi
lity, availability, scalability
and usability.
This paper
includes one

software point of view to the clusters.

The main discussion is about the
Software Distributed Shared Memory Systems and one of the most famous problems: the
atomic page
update problem.

Keywords:
Operating systems,

programming models, distributed shared memory, atomic update
page problem

1
Introduction


One can realize that the development of
software components

is really

necessary
and essential
for the success of any type

of parallel system
, and that includes the clusters as well. Everyday
there are many improvements
at

an accelerated rate and quick

approach

of
the stable and
sophisticated

levels of functionality that will establish clusters as the long
-
term solution to hi
gh
-
end

computation.


The main thing that should be achieved building one cluster of computers is a speed up or
reducing the computation time in general, or wherever it is strongly essential, for example large
scientific computations. Therefore, from the ve
ry first beginning one needs to take care about
parallelization in each sense. Analyzing the Amdahl’s Law, which shortly explained, says that
the total execution time on the parallel machine is a sum of the part that is parallelized and the
serial part, on
e can make some conclusions. Although the Amdahl’s equation looks pretty
simple, it has very interesting implication: the most that one can speed up using a cluster of
processors, no matter how much of them are used, is a factor of a
relativel
y small numbe
r! For
example, in a case when the number of processors gets extremely large (theoretically: infinitely
large) and assumption that 5% of a task is irreducibly serial; the highest speed that one can
2

Gordana Stojceska

achieve is a factor of 20! But, if one wants to achieve th
e best speed up using clusters, should be
aware of the Amdahl’s Law, which general point is following (or somewhere is used the term
“game of elimination”):



The hardware can be adequately parallel, but it would
not
do a bit of good if the
operating system
is not.



Both of the hardware and operating system can be configured in the best parallelisable
way, but it would not do any good, if the middleware (databases, communications, etc)
is not.



All abovementioned parts can be perfectly configured to act in para
llel, but if the
application is not, one can lose again.



If the hardware, the OS, the middleware and the application all have low serial content,
than one can have a good chance to get the best of the clusters

Thus, in this paper will be discussed, more or

less all of this points. First, there is overview of the
most famous “State
-
of
-
the
-
Art”
-
operating systems that are used nowadays, as well as their
abilities, performance, advantages and/or disadvantages, some possible improvements and so on.
The next very

important thing is the programming model, or how the programs are going to be
parallelized and which mechanisms should be used, depending on the hardware architecture. The
main discussion is about the Distributed Shared Memory Systems and one of the most
famous
problems: the atomic page update problem. Some solutions that solve this problem are presented
as well the comparison between them. At the end is given short introduction in OpenMP.


2
Operating Systems for Clusters

2.1 Basic characteristics

The id
eal operating system, in general, as well as for clusters, should always help as much as
possible, and never hinder the users. That means, it should help the user (which in such case
almost always
is an application
-
or
-
middleware designer) to set up the sys
tem for optimal
program execution. This could be done by supplying a consistent and well
-
targeted set of
functions offering as many system resources as possible. After configuring the environment, it
would be nice if it stays out of the user’s way avoiding

any time consuming context switches
.
Anyway, h
ow different one chooses to make the operating

system depends on ones view of
clustering. On the one hand, we have those who argue that each node of a cluster
must

contain a
full
-
featured operating system such

as Unix, with all the positives and negatives

that implies. At
the other extreme,
there are

researchers asking the question, "Just how much

can I remove from
the OS and have it still be useful?"

It is a common question what attributes exactly should have

a cluster operating system. But the
most mentioned in the research community are the following ones:

m
anageability
, st
ability
,
p
erformance
, e
xtensibility
, s
calability
, s
upport

and heterogeneity.
It should be said that the
experience shows that these attri
butes may be mutually exclusive. That means, for example,
supplying a SSI

(Single System Image)

at the operating system level, while a definite boon in
terms of manageability, drastically inhibits scalability. Another example is the availability of the
sou
rce code in conjunction with the possibility to extend (and thus modify) the operating system
on this base. This property has a negative influence on the stability and manageability of the
system: over time, many variants of the operating system will devel
op, and the different
extension may conflict when there is no single supplier.

Programming models, environments and languages

3

2.2
Current State
-
of
-
the
-
art

State
-
of
-
the
-
art can be interpreted to mean the most common solution as well as the best

solution using today's technology. By far the most common s
olution
for the
current clusters is

running a conventional operating system, with little or no special modification. This

operating
system is usually a Unix derivative, although NT clusters are becoming more

common.

2.2.1
Linux

One of the most popular oper
ating system

used for the clusters i
n

the public research
community is
Linux

[1], because
it is released as true open and it

is cheap in terms of both
required
software and hardwa
re platforms. It offers all the
functionality that is expected from
standard
U
NIX OS
, and it is developing fast as missing

functionality can be implement by
anyone who needs it. However, these solutions are usually

not as thoroughly tested as releases
for commercial UNIX variants. This requires frequent

updates, which do not ease t
he job of the
administrators to create a stable system, as it is

required in commercial environments.
However,
f
or scientific a
nd research applications,
the Linux concept is a great success, as can be seen from
the well
-
known Beowulf
[2]

project
, which is a
ctually
collection

of tools, middleware libraries,
network drivers and kernel extensions to enable a cluster for

specific HPC purposes
.

This leads
to a big and growing user community develop
ing a great number of tools and
environments to
control and manage

clusters, which are mostly available for free. However,

it is

often hard to
configure and integrate these solutions to suit ones own, special set up as

not all Linux OSs

are
created equal: the different distributions are getting more and more

different in

the way the
complete system is designed. Even the common base, the kernel, is

in danger, as solution
providers like TurboLinux [1] start to integrate advanced features into

the Kernel.

In general, in has to be noted that the support for Linux by commercia
l hardware and

software
vendors is continuously improving, so it can no
longer be ignored as a relevant
computing
platform.

2.2.2
Windows NT

Although NT contains consistent solutions for
local
system administration in a workstation

or
small
-
server configur
ation (consistent API, intuitive GUI
-
based management, registry

database),
it lacks several characteristics that are required for efficient use in large or

clustered servers. The
most prominent are standardized remote access and administration,

SMP scaling

in terms of
resource limits and performance, dynamic reconfiguration, high

availability

and clustering with
more than a few nodes.

Through Microsoft’s omnipresence and market power, the support for NT by the vendors of

interconnect hardware and tools deve
lopers is good, and therefore a number of research

projects
in clustering for scientific computing
is based on Windows NT
and Windows 2000[4]
. Microsoft
itself is working hard to extend the required functionality and will surely

improve.

2.2.3
AIX

IBM’s A
IX operating system, running on the SP series of clusters, is surely one of the most

advanced solutions for commercial and scientific cluster computing. It has proven scalability

and
stability for several years and a broad number of applications in both ar
eas. However, it

is a
closed system, and as such the development is nearly totally in the hand of IBM, which

however
4

Gordana Stojceska

has enough resources to create so
lutions on

hardware and softwa
re
, such as HA

(High
Availability)
. Research from outside the IBM laboratori
es is very

limited, though.

2.2.4
Solaris

Somehow
, Solaris can be seen as a compromise or merge of the three systems described
above:

it is
still
not an open system, which ensures stability on a commercial level and a truly
identical

interface to administr
ators and programmers on every installation of the supported

platforms.
With the new decision of Sun to make it Solaris one open source project [20] maybe
the interest in
easy kernel extensions or modifications

will grow with the time. Solaris

offers a
lot

of functionality required for commercial,

enterprise
-
scale cluster
-
based

computing like
excellent dynamic reconfiguration and fail
-
over and also offers leading inter

and

intra
-
nodes
scalability for both, scientific and commercial clustering. Its support b
y

commercial vendors is
better for high
-
end

equipment, and the
available software solutions

are also directed towards a
commercial clientele. However, Solaris can be run on the same

low
-
cost off
-
the
-
shelf hardware
as Linux as well as on the original Sun Sp
arc
-
based

equipment and the support for relevant

clustering hardware like inter
connects is given

(Myrinet

[21]
, SCI). Software solutions generated
by the Linux community are generally portable

to the Solaris platform with little or no effort.

2.2.5
Puma OS

The Puma operating system [19], from Sandia National Labs and the University of New

Mexico, represents the ideological opposite of Solaris
OS
. Puma takes a true minimalist

approach: there is no sharing between nodes, and there is not even a file system or

demand

paged

virtual memory. This is because Puma runs on the “compute partition” of the Intel

Paragon and
Tflops/s machines, while a full
-
featured OS (e.g. Intel’s TflopsOS or Linux) runs

on the Service
and I/O partitions. The compute partition is focuse
d on high
-
speed

computation, and Puma
supplies low
-
latency, high
-
bandwidth communication through its

Portals mechanism.

2.2.6
Mosix

MOSIX

[22]

is a set of kernel extensions for Linux that provides support for seamless

process
migration. Under MOSIX, a user

can launch jobs on their home node, and the system

will
automatically load balance the cluster and migrate the jobs to lightly
-
loaded nodes.

MOSIX maintains a single process space, so the user can still track the status of their

migrated
jobs. MOSIX offer
s a number of different modes in which available nodes form a

cluster,
submit
and migrate jobs, ranging from a closed batch controlled system to a open

network
-
of
-
workstation like configuration. MOSIX is a mature system, growing out of the

MOS project and
having been implemented for seven different operating

systems/architectures.

openMosix

[23] is
a Linux kernel extension for single
-
system image clustering.


This kernel extension turns a
network of ordinary computers into a supercomputer for Linux applicat
ions.


3 Programming Models

A programming model is the architecture of a computer system, both hardware and system,
software, above the level that is hidden by traditional high
-
level languages. It is an application’s
Programming models, environments and languages

5

high
-
level view of the system on which
it is running [3]. Because there exist only one serial
(nonparallel) programming model, it is common understood that the term “parallel programming
model” means actually “programming model”. Therefore today are known various programming
models, but just ve
ry few of them are really used. Here
,

the following programming models

should be mentioned
:

1.

the uniprocessor model, otherwise known as the Von Neumann model,

2.

the symmetric multiprocessor model, or also known as the shared
-
memory model

3.

the message passing

model (mainly used in the clusters)

4.

cc
-
NUMA (cache coherent Non
-
Uniform Access Memory) model

There are numerous discussions in the scientific community, about that which of the
programming models is the best to be used. Thus the question of importance of
programming
models should be answered. They are actually extremely important just because of the following
reasons: first, since some are easier to use than others, they have a very big influence on that how
difficult is to write a program; second, is the
issue of portability: a program written using one
model can be essentially difficult (or extremely easy) to move to a computer system
implementing another model. The most used commercially programming models nowadays are
the shared
-
memory (or also distribu
ted shared memory) models as well as the message passing
programming model.


4
Distributed Shared Memory
on Clusters

4.1 Distributed Shared Memory (DSM) System

A DSM system logically implements the shared
-
memory model on a physically distributed
-
memory sy
stem. System designers
could

implement the specific mechanism for achieving t
he
shared
-
memory abstraction either of
hardware or
of
software in a
different

ways. The DSM
system hides the remote communication mechanism from the application writer, preserving

the
programming ease and portability typical of shared
-
memory systems. DSM systems allow
relatively easy modification and efficient execution of existing shared
-
memory system
applications, which preserves software investments while
optimizing

the performa
nce. In
addition, the scalability and cost
-
effectiveness of underlying distributed
-
memory systems are
also inherited. Consequently, DSM systems offer a viable choice for building efficient, large
-
scale multiprocessors. The DSM model's ability to provide a
transparent interface and convenient
programming environment for distributed and parallel applications have made it the focus of
numerous research efforts in recent years. Current DSM system research

[24]

focuses on the
development of general approaches th
at minimize the average access time to shared data, while
maintaining data consistency. Some solutions implement a specific software layer on top of
existing message
-
passing systems. Others extend strategies applied in shared
-
memory
multiprocessors with pr
ivate cach
es
to multilevel memory systems [14
]
.

Although the shared memory

abstraction is gaining ground as a programming abstraction f
or
parallel computing, the main
platforms that support it, small
-
scale symmetric multiprocessors
(SMPs) and hardware cach
e
-
coherent distributed shared memory

systems (DSMs), seem to lie
inherently at the extremes of the cost
-
performance spectrum for parallel systems
. Because of this
reason, there exist also
shared virtual memory

(SVM) clusters

[16]

that try to
bridge this ga
p by

examining how application performance scales on a state
-
of
-

the
-
art shared virtual memory

cluster
[
15
]
.
The results are
:

6

Gordana Stojceska



The level of application restructuring needed is quite high compared to

applications that
perform well on a DSM system of the same

scale and larger problem sizes are needed
for good performance.



However, surprisingly, SVM performs quite well for a fairly wide range of applications,
achieving at least half the parallel efficiency

of a high
-
end DSM system at the same
scale and often mu
ch more.

Howev
er, SMPs do not scale to larger
numbers of processors, and DSM systems, although

t
hey

can be sc
aled down to smaller numbers of
processors, they

are still very

costly

compared to
SMPs

and clusters.

4.2 Software Distributed Shared Memory (SDSM)

System

A Software Distributed Shared Memory

(SDSM) system
is known to provide

shared memory
abstraction over physically distributed memory of clusters. Programming with SDSM is
considered to be easier than with a message
-
passing model. The goal of
many

re
search
es

is to
investigate how to build efficient SDSM on clusters.
One such example is KAIST DSM

[
25
].

The team
designed and implemented the several versions of SDSM with focusing on
communication layers, consistency models and high availability. In add
ition, the SDSM has been
attractive for large clusters due to its high performance and scalability. However, the probability
of failures increases as the system size grows. Thus,
they also

designed and implemented SDSM
with fault tolerant capabilities. It
guarantees the bounded recovery time and ensures no domino
effect.

4.3 Memory management for multithread software DSM systems

Software distributed shared memory (SDSM) syste
ms have been powerful platforms
to
provide shared address space on distributed memo
ry architectures. The

first generation of SDSM
systems like IVY [5], Midway [6], Munin [7], and TreadMarks [8
] assume

uniprocessor nodes,
thus allow only one thread per process on a node. Currently,

commodity off
-
the
-
shelf
microprocessors and network compo
nents are

widely used as building blocks for parallel
computers. This trend has made cluster

systems consisting of sym
metric
multiprocessors
(SMPs)
,

attractive platforms for

high performance computing. However, the early single
-
threaded SDSM systems

are to
o restricted to exploit multiprocessors in SMP clusters. The next
generation

SDSM systems like Quarks [9
], Brazos [
10
], DSM
-
Threads [
11
], and Murks [
12
] are

aware of multiprocessors and exploit them by means of multiprocesses or multithreads.

In
general, n
aive process
-
based systems experience high context switching

overhead and additional
inter
-
process communication delay within a node
.


5
The
Atomic Page Update Problem

5.1
Definition of the problem

The
conventional fault
-
handling process
used in the imple
mentation of single
-
node DSM
systems is not useful as a concept anymore
in multithreaded environments because

other threads
may try to access the same page during the update period. The SDSM

system
is in doubt

when
Programming models, environments and languages

7

multiple threads
try

to access an invalid

page

within a short interval. On the first access to an
invalid page, the system should

set the page writable to replace with a valid one. Unfortunately,
this change also allows

other application threads to access the same page freely. This
phenomenon is

known as atomic page update and change right problem [
11
] or mmap() race
condition

[
12
].
S
hort
ly
,
this is
known as


atomic page update problem

.

5.2 Solutions

Famous

solution to this problem adopted by major multithreaded SDSM systems

(like
TreadMarks [8
]

and

Brazos [
10]),
is to map a file to two different

virtual addresses. Even though
the file mapping method achieves good performance

on some systems, file mapping is not
always the best solution. This observation
gives

the research
ers an idea that

other so
lutions
have
to be found to

the atomic page update problem. Moreover, file mapping

has high initialization
cost and reduces the available address space

because SDSM and application share the same
address space.

A
solution

to this problem in general, is the

following one: separate
the application address
space from the system address space for the same physical

memory, and assign different access
permission to each address space. Since the virtual memory protection mechanism is
implemented

in the per
-
process

page table,
different virtual addresses (pages) can have different
access permission even though

they refer to the same physical page. Then, the system can
guarantee the atomic page

update by changing the access permission of a virtual page in the
applica
tion address

space only after it completes the page update through the system address
space.

Except this general solution, there are also proposed other different solution. Here are pre
sented
the following three [17
]:

1.

System V shared memory.

Mapping a phys
ical page to different virtual addresses
using System V shared memory IPC. The shmget() system call enables a process to
create a shared memory object in the kernel and the shmat() system call enables the
process to attach the object to its address space.
Meanwhile, a process can attach the
shared memory object to its address space more than once and a different virtual
address is assigned to each attachment. This is very cheap, compared to file mapping.

2.

mdup() system call method
. The basic mechanism of mdu
p() is to allocate new page
table entries and to copy the page table entries of the anonymous memory to new ones.
The reasons why one uses anonymous memory, are following: (1) no initialization step
is required, (2) there is no size limit, and (3) the memo
ry region is released
automatically at program termination.

3.

fork() system call.
When a process forks a child process, the child process inherits the
execution image of the parent process. Especially, the content of the child process page
table is copied fr
om that of the parent process. The parent process creates shared
memory regions and forks a child process. They have independent access paths even
though they use the same virtual address to access the same physical page. The parent
process could execute a
pplications and the child process could perform memory
consistency mechanisms. Hence, the SDSM system can successfully update the shared
memory region in a thread
-
safe way through the child process’s address space.

Experiments on a Linux
-
based cluster and
on an IBM SP2[13] machine showed that the three
proposed methods overcome the drawbacks of the file mapping method such as high
initialization cost and buffer cache flushing overhead. In particular, the method using a fork()
system call is portable and pre
serves the whole address space to the application even though the
others can use only the half of the virtual address space. The System V shared memory method
shows low initialization cost and runtime overhead, and the new mdup() system call method has
8

Gordana Stojceska

the

least coding overhead in the application code. Not all the methods can be implemented on a
given SMP cluster system due to the limitation of the operating system as observed in the IBM
SP System.


6 OpenMP

OpenMP
[18]

is an industry
-
standard API for progra
mming shared memory computers. It is
available on most if not all commercially available shared memory computers.

With OpenMP one can direct the compiler to create multi
-
threaded blocks of code by adding
compiler directives to your program. It is easy to
use and in many cases supports the incremental
addition of parallelism to a program.

So why should a shared memory API such as OpenMP be of interest to the cluster computing
community? There are two reasons. First, many clusters are built from shared memo
ry nodes.
OpenMP can be used to exploit parallelism on a node while a distributed memory API is used
between nodes. Second, OpenMP is evolving to address non
-
uniform memory architecture
(NUMA) computers

[19]
. A cluster running some sort of distributed shar
ed memory (DSM) is an
extreme case

of NUMA. Hence
one can program a cluster using OpenMP.


7

Future Work

This paper is just one drop in a ocean called software for cluster computing. There are many
interesting research areas
in this area
that grow within
each day.
The open source projects are
becoming more and more popular for the scientific community in general. That means, the
people involved in such projects have freedom to do realize their research visions. There is a lot
of work and progress to be don
e in the system software, the operating systems as well as
different ways of solving the problems with the DSM and SDSM systems.


8

Bibliography

[1]

Information about Linux

http://www.linux.org/
,
http://www.linux.org/info/index.html

[2
] Beowulf Project,
http://www.beowulf.org

[3] Gregory Pfister, „In search of Clusters“
,

[4]

http://www.microsoft.com/windows2000/techinfo/howitworks/cluster/introcluster.asp

[5] L. Kai, IVY: a shared virtual memory system for parallel computi
ng, International
Conference on
Parallel Processing (1988)

94

101.

[6] B.N. Bershad, M.J. Zekauskas, W.A. Sawdon, The midway distrib
uted shared memory
system, IEEE
International Computer Conference, February 1993, pp. 528

537.

[7
] J.K. Bennett, J.B. Carter, W. Zwaenepoel, Munin: distributed shared memo
ry based on

type
-
specific
memory coherence, Principles and Practice of Parallel Programming (1990) 168

176.

[8
] C. Amza, A.L. Cox, S. Dwarkadas, P. Keleher
, H. Lu, R. Rajamony, W. Yu, W.
Zwaenepoel,

TreadMarks: shared memory computing on networks of workstations, IEEE

Computer 29 (2)
(1996)

18

28.

Programming models, environments and languages

9

[9
] D.R. Khandekar, Quarks: distributed shared memory as a basic building block for complex
parallel

and distributed systems, Master’s Thesis, University of Utah, 1996.

[10
] E. Speight, J.K. Bennett, Brazos:
“A

third generati
on DSM system

, USENIX Windows NT
Workshop,

August 1997, pp. 95

106.

[11
] F. Mueller,

Distributed shared
-
memory threads: DSM
-
Threads

, Workshop on RunTime
systems for

Parallel Programming, April 1997, pp. 31

40.

[12
] M. Pizka, C. Rehn, Murks
––

A POSIX thr
eads based DSM system

, in: Proceedings of the

International Conference on Parallel and Distributed Computing Systems, 2001.

[13]

http://www.tc.cornell.edu/~slantz/what_is_sp2.html

[14]

J.

Proti
c, M. Tomasevi
c, and V. Milutinovoi
c, “Distributed Shared Memory: Concepts

and
Systems,”
IEEE Parallel and Distributed Technology
, vol. 4, pp. 63

79, Summer

1996.

[15]

Angelos Bilas,_ Dongming Jiang and Jaswinder Pal Singh
,


Shared virtual memory

clusters: bridging the cost
-
performance gap

between SMPs and hardware DSM systems


[16]

Home
-
based SVM Protocols for SVM Clusters, Design, Implementation and performmance
http://www.cs
.princeton.edu/research/techreps/TR
-
548
-
97

[17]


Memory management for multi
-
threaded

software DSM systems
” Yang
-
Suk Kee, Jin
-
Soo
Kim
, Soonhoi Ha
,

www.elsevier.com/locate/parco

[18]

Online Information about OpenMP,
h
ttp://www.openmp.org/

[19]

OpenMp on NUMA,
http://www2.cs.uh.edu/~hpctools/openmpOngoing.shtml

[20]

Open Solaris,
http://www.ad
tmag.com/article.asp?id=10519

[21] Myrinet
,
http://www.myricom.com/myrinet/overview/

[22] MOSIX OS
,
http://www.mosix.org/
,
http://www.ospueblo.com/mosix.shtml

[23]
Information about open
M
osix

OS
,
http://openmosix.sourceforge.net

[24] International Workshop on DSM,
http://perso.ens
-
lyon.fr/laurent.lefevre/dsm2005/

[25]

KAIST Project,
http://camars.kaist.ac.kr/~nrl/team/dsm.html