Implementing the MPI Standard

compliantprotectiveSoftware and s/w Development

Dec 1, 2013 (4 years and 7 months ago)


Implementing the MPI Standard

CSE 6306 Advanced Operating Systems

Research Paper

The University of Texas at Arlington

Satya Sagar Salapaka

06 April 1999



Message Passing is a programming paradigm used widely on paral
lel computers, especially
Scalable Parallel Computers (SPCs) with distributed memory, and on Networks of Workstations (NOWs).
In a remarkably short time, the Message
Passing Interface (MPI) has become what it was intended to be
de facto

standard Applicat
ion Programming Interface (API) for message
passing parallel programming.
Beginning in 1992 a group of researchers, academicians, and vendors met to quickly create a portable,
efficient, and full
featured standard. All meetings and discussions of this Me
ssage Passing Interface
Forum were open and accessible over the Internet. They met their goals
the first version of the standard
was released in May 1994, revised in June 1995, and it’s second version released in July 1997 and MPI
has become the API of c
hoice for most new parallel programming projects. The focus of this paper is the
problems, issues, and solutions that pertain to implementing the MPI standard on several representative
computer architectures and operating systems. This paper is based on th
e work done by John M.
Linebarger of The University of New Mexico.


Reasons for MPI were legion. Existing message passing libraries were mostly vendor
restricted to a limited range of platforms. Those that weren’t were usually researc
oriented, and thus not
designed to be a production standard or even necessarily backward compatible with earlier releases. The
feature sets of message
passing libraries were often incomplete; more annoying still, the gaps were often
in different areas be
tween libraries.

MPI addressed this situation by providing a feature
rich portable message
passing standard that
would allow for efficient implementation on a wide variety of platforms. Both point
point and collective
communication are supported by
MPI, as are user
defined data types, virtual process topologies, and
ANSI C and Fortran 77 language bindings. Multiple architecture and programming models are embraced
by the MPI standard, including the SPMD and MPMD parallel programming models, distribut
ed and
shared memory architectures, and networks of heterogeneous workstations. A rich variety of
communication modes are provided for sending messages, which exist, in both blocking and non
versions. A particular contribution of MPI is the conc
ept of a communicator, which is an abstraction that
combines communication context and processor group for better security and separation in message



Numerous implementations of MPI exist, both public domain and commercial. The

most important
of the public domain implementations is MPICH (MPI/CHameleon), developed by Argonne National
Laboratories. Other portable public domain implementations include Local Area Multicomputer (LAM) from
the Ohio State Supercomputing Center and Com
mon High
level Interface to Message
(CHIMP/MPI) from the Edinburgh Parallel Computing Center (EPCC). Commercial implementations are
available from several supercomputer and workstation vendors, including IBM, Cray Research, Intel,
Silicon Graphics,

and Hewlett

MPI Standard API

Figure 1:

A Conceptual View of MPI Implementation

As depicted above, the conceptual goal of implementation is to map MPI to the machine. In
general, the MPI standard and the facilities provided by the target

platform represent fixed, given
quantities; the job of the implementor is to make the best of what is provided (in terms of operating system
and hardware features) in order to fulfill the requirements of the MPI standard. The most efficient
hardware and s
oftware primitives should be used to directly implement corresponding MPI calls, emulating
in software those functions for which the vendor does not directly provide support.

In practical terms, however, most MPI implementations to date have been ports of

the reference
implementation, MPICH. The implementations surveyed for this paper fall into three categories: Ports of
MPICH; adding MPI interfaces to existing message
passing systems; and “roll your own from scratch” MPI


MPI Standard API

Figure 2:

MPICH Implementation Architecture

Since implementations of MPICH are so pervasive, a look at its architecture is in order. As
indicated in the figure above, MPICH takes a two
layer approach to implementing MPI. All MPI functions

are contained in the upper, device
independent layer. This layer accesses the lower, device
dependent layer through the Abstract Device Interface (ADI), which is implemented as compiler macros.
The ADI provides four main functions: Sending and receiving
, data transfer, queuing, and device
dependent operations. The job of the implementers is thus to tailor the lower, device
dependent layer to
the target machine; the upper, device
independent layer remains virtually unchanged.

Since one of the goals of M
PICH is to demonstrate the efficiency of MPI, several optimizations are
included. One of them is optimization by message length. Four send protocols are supported. The short
send protocol piggybacks the message inside of the message envelope. The eager

send protocol
delivers the message data without waiting for the sender to request it, on the assumption that the
probability is high that the receiver will accept the message. The rendezvous protocol doesn’t deliver the
data until the receiver explicitly

requests it, thus allowing the setup time necessary to send large
messages with high bandwidth. And the get protocol is used for shared memory implementations, where
the receiver utilizes shared memory operations to receive the message.


Every MPI implementor wrestles with a number of common issues. Chief among them is
message buffering, because of the enormous leverage it has on overall performance. The goal is to


reduce (or even eliminate) the number of intermediate bu
ffers required transmitting a message, thus
avoiding the significant overhead of memory
memory copies. Related issues are optimization by
message length, the mechanism to use to handle unexpected messages, and the often
tradeoff between lat
ency and bandwidth.

Other common implementation issues include reducing the number of layers and context switches
to improve performance, handling non
contiguous data, making the best use of available communication
coprocessors, resolving semantic clashes

between vendor
supplied message
passing facilities and the
MPI standard, overcoming asymmetric performance of vendor
supplied primitives, compensating for the
lack of native support for collective operations, adhering to the spirit of the MPI standard at
times instead
of the letter, and even deciding how much of the MPI standard to implement.


Though common issues are faced, each MPI implementation is essentially unique. In order to
demonstrate the wide variety of implementat
ion options, several case studies of MPI implementations on
representative parallel processing architectures are presented below. These architectures include
massively parallel processors using both distributed and distributed shared memory, networks of
orkstations, a hybrid architecture involving networked multiprocessors which uses shared memory at
each node and distributed memory between nodes, and uniprocessors running a flavor of Microsoft

The Meiko CS/2 is a 64
node distributed memory para
llel computer equipped with an Elan
communication coprocessor. As part of a Master’s thesis project, a student at the University of California
at Santa Barbara developed a custom implementation of MPI with the goal of achieving low latency.
Optimization b
y message length was provided
an optimistic “transfer before match” strategy was used to
decrease latency for small messages, and large messages tapped DMA hardware to increase the
bandwidth of large message transfer once message tag matching was complete

But the key observation in reducing message
passing latency was that message tag matching
was one of its primary components. Two options were explored to reduce the impact. After analyzing the
performance of the MPICH implementation for the Meiko CS/
2, which used the 10 MHz Elan coprocessor
to do the matching in the background, the thesis author decided to design his own implementation to task
the node processor (a 40 MHz SPARC Viking 3) with the matching. The reasoning was that although
assigning th
e matching chores to the Elan reduced the load on the SPARC processor, the much slower
clock speed of the Elan might actually be increasing latency overall. Performance comparisons revealed
that the best choice depended on the characteristics of the appli
cation, so a compile
time option was
added to specify which processor should be used for message
tag matching.

The Meiko has rudimentary native support for collective communication, but its use is restricted to
contiguous data transmission between contigu
ously numbered processors. Implementing MPI collective


communication required the resolution of this semantic clash by packing and unpacking non
data in order to use the native Meiko collective communication widget; however, the communicator wa
required to consist only of contiguously numbered processors.

The Intel Paragon is another example of a massively parallel distributed memory supercomputer,
with an architecture that supports more than 1800 nodes. At Sandia National Laboratories in Albu
NM, implementing MPI on the Paragon tapped the unique features of Puma, an operating system for
parallel processors that is optimized to provide flexible, lightweight, high
performance message
Portals are the structures used in Puma to r
ealize message passing; a portal is an abstraction for a
wrinkle in space, an opening in user process address space into which messages can be delivered. In
essence, Puma uses user process address space as the buffer space for message transmission, avoidi
gratuitous message copies. Thus the very architecture of Puma itself is designed to avoid two of the
primary obstacles to message
passing performance: Kernel context switches and memory

Portals were used as the foundation for the MPI

implementation on the Paragon under Puma.
Starting with MPICH as the base implementation, point
point message passing was built directly on top
of portals, with minimal assistance (evidently to reduce path length) from the Puma portals library.
zations were performed by message length, with the goal of low latency for short messages and
high bandwidth for long ones; short, long, and synchronous message protocols were provided. MPI
collective communication operations were mapped to native Puma co
llective functions, which are
themselves built on top of portals. In addition, one
sided communication (a.k.a. “remote memory
addressing,” which enables writing into and reading out of the memory of a remote process without its
direct involvement) was impl
emented in terms of reply portals, in anticipation of the MPI2 standard (see
below). Note that this particular approach implemented MPI almost exclusively in terms of operating
system features, deferring hardware issues to the implementation of Puma itsel

The Cray T3D is a massively parallel supercomputer supporting up to 2048 processors, but with a
different architecture than the machines previously described. The T3D is a physically distributed shared
memory computer; memory exists on each local node
, but is globally addressable. From a programming
perspective, the availability of shared memory simplifies implementation because standard shared
memory operations can be used to transfer messages. However, maintaining cache consistency at each
r is the responsibility of the application.

One MPI implementation for the T3D has taken the following approach. Using MPICH as the base
distribution, a message header buffer was implemented as a collection of arrays on the shared heap. The
common optimi
zation for long and short messages was provided. The message transfer mechanism
chosen was particularly novel. Performance profiling revealed an asymmetry in the performance of the
supplied shared memory operations; specifically, put (
) ou
tperformed get (
by nearly a factor of two. As a result, an entirely put
based message transfer mechanism was designed;
in other words, sending and receiving was accomplished by a sender push, not a receiver pull.


Several technical problems cre
ated by the choice of a put
based message transfer mechanism
had to be resolved. The sender bore the burden of validating the delivery address in advance, since the
destination address was usually local to the receiving processor, not a global address. T
he receiver
cache was automatically invalidated upon message delivery, in order to maintain cache coherency. And
temporary buffers had to be used to transfer data that was either not aligned properly or whose size was
not a multiple of four, because

is limited to transferring data that is four
byte aligned.

Using shared or distributed memory does not represent a mutually exclusive choice in
implementing MPI. The next case considered exhibits interesting hybrid architecture. SGI’s Array 2.0
roduct is targeted at multiprocessor nodes running the 64
bit IRIX 6.2 operating system and connected
by HiPPI switches with a TCI/IP fallback path. Up to eight nodes are supported in the array, and each
node can contain up to 24 processors. Both PVM and

MPI implementations are included in Array 2.0. The
goal is to get massively parallel processor (MPP)
like performance from a networked array of high
multiprocessor workstations.

A hybrid message passing mechanism is employed in Array 2.0. Shared mem
ory is used to
transmit messages between processors on the same node, and distributed memory is used for messages
sent between nodes over the HiPPI link. Optimizations by message length are provided, but in an unusual
way: The latency of short messages is

reduced by bypassing the overhead of the HiPPI framing protocol,
and the bandwidth of large messages is increased by transmitting them in parallel using multiple HiPPI
adapters, if available. (This is known as “striped HiPPI.”) Other characteristics inc
lude the use of an
environment variable to explicitly specify the number of unexpected messages to buffer, and the up
admission that the Array 2.0 implementation of MPI is not thread
safe, the intention of the MPI standard

Another a
pproach to achieving MPP
like message passing performance on a network of
workstations was taken by the University of Illinois at Urbana
Champaign (UIUC). MPICH was
implemented on top of UIUC’s Fast Messages (FM) low
level communication library, which run
s entirely in
user space and avoids the overhead of kernel context switches. The target platform was restricted to Sun
workstations equipped with LANai interface processors and connected by a packet
switched Myrinet
network. Although the limitations woul
d appear to be severe (specialized physical and transport layers,
homogenous workstations), the results were outstanding: Extremely low latency was achieved, and the
bandwidth for short and medium messages was comparable to MPI implementations on MPP


Several technical hurdles were overcome to achieve this low latency. The LANai control program
was kept simple because the coprocessor was quite slow in comparison to the host processor. Two
semantic clashes between MPICH and FM were encounte
red. MPICH uses a split
phase message send,
while FM is stateless; this was resolved by using sender and receiver queues to track state information.
On the receive side, FM has no explicit receive operation but instead relies on user
defined receive
lers in the style of Active Messages; the ADI receive primitives were implemented in terms of such
handlers, into which FM calls were carefully placed. Two optimizations to FM itself were developed to


eliminate message copies
a “gather” mechanism on the
send side to efficiently transmit arbitrarily large
messages consisting of non
contiguous data, and an upcall mechanism to retrieve the receiver buffer
address so that message reassembly could be done directly at the destination. The gather mechanism
provided the requisite optimizations for short and long messages. The performance of the two
optimizations taken together was found to exceed the sum of its parts, because it kept the message
pipeline full.

MPI is also available on uniprocessor Intel mac
hines running variants of Microsoft Windows. The
first such implementation was WinMPI, an early port of MPICH by a graduate student at the University of
Nebraska at Omaha. WinMPI runs on a standalone Windows 3.1 PC, and represents each parallel
process by
a separate Windows task. The P4 channel device was used as the lower
level communication
layer and implemented as a Dynamic Link Library (DLL). Global (
., shared) memory allocated by the
DLL is used for message exchange. The purpose of WinMPI was ost
ensibly to serve as a testing and
training platform for parallel programming in MPI.

As can be imagined, numerous technical problems had to be solved in order to implement MPI in
such a restricted environment. Most involved the addition of extensions to
the WinMPI API to compensate
for the limitations, which have the unfortunate side effect of requiring minor source code changes to MPI
programs. For example, a different program entry point (MPI_main) is needed; all integers are declared
long to bring the
m up to the 32
bit level; memory allocation functions were added to the DLL to bypass the
allocation limitations of the medium memory model required for the pseudo
parallel Windows tasks;
several I/O calls were added to the DLL to provide Graphical User In
terface (GUI) analogs to the standard
UNIX console I/O functions; and a processor yield function was created because of the non
nature of the Windows multitasking model.

Two implementations of MPI currently exist for Windows NT, both ports of M
PICH. The first,
W32MPI, comes out of the Universidade de Coimbra in Portugal. A DLL implementation of the P4 channel
device is again used as the lower
level messaging layer. Local communication takes place via shared
memory, and TCP/IP communication is
possible over networks of heterogeneous workstations, thus
extending a popular feature of the UNIX implementations of MPICH. The second is from Mississippi State
University, and uses the MPICH’s Channel Device Interface (CDI) to implement two custom
nication devices, a TCP/IP device to communicate between networked workstations and a Local
device to communicate between processes running on a single machine using memory mapped files (
shared memory). Threads are used to mediate between the send a
nd receive queues of a user process
and the relevant shared memory buffer slots for each processor.


I gratefully acknowledge the work done by John M. Linebarger of The University of New Mexico, in
understanding the implementation issues
of MPI on various architectures and operating systems.




John M. Linebarger “Mapping MPI to Machine.” 11 Dec 1996, University of New Mexico


Marc Snir “MPI: The Complete Reference” MIT Press.


William Gropp and Ewing Lusk “User’s Guide to MP
ICH, a portable implementation of MPI”


William Gropp and Ewing Lusk “Installation Guide to MPICH, a portable implementation of MPI”


Ron Brightwell and Lance Shuler “Design and Implementation of MPI on Puma Portals” Sandia
National Laboratories, Albuquerque


P.K. Jimack and N. Touheed “An Introduction to MPI for Computational Mechanics” School of
Computer Studies, University of Leeds


Peter S. Pacheco’s tutorial “Parallel Programming with MPI”


Gropp William “Tutorial on MPI: The Message Passing Interface”


rg Meyer “MPI for MS Windows 3.1” University of Nebraska, December 1994.