Report - Nyu

desirespraytownΛογισμικό & κατασκευή λογ/κού

1 Δεκ 2013 (πριν από 3 χρόνια και 7 μήνες)

64 εμφανίσεις

AMLAPI: Active Messages Over Low
-
level Application Program Interface

CS262B Semester Project, Spring 2001.

Simon Yau, smyau@cs.berkeley.edu

Introduction:

Modern large
-
scale parallel high performance machines make use of many different
libraries for communi
cation. Examples include Active Message (AM) from Berkeley,
Virtual Interface Architecture (VIA) from Compaq and Intel, Low
-
level Application
Program Interface (LAPI) from IBM, GM from Myrinet, and others. For application
writers, the need to port their co
mmunication layer onto different platform can be a
hassle. Therefore, communication layer “glue
-
ware” has been developed that emulates
one communication layer using another. Using the emulated communication layers,
parallel applications can then run on mor
e machines. Earlier works include: AM over
VIA [1], AM over UDP [2], and Myrinet’s VIA over GM project. This project attempts
to emulate the AM using IBM’s LAPI on the SP3 in San Diego Supercomputing Center.


Active Message and Low
-
level Application Progra
m Interface

Active Message (AM) is a communication protocol developed at Berkeley [3][4]. It is a
“RISC
-
style” communication layer, which aims at providing the minimal functionality
that a parallel application would need. The AM model is based on the idea
of lightweight
Remote Procedure Calls (RPC), in which the processors communicate by sending
network messages that would cause a remote processor to execute certain procedures. [3]
argues that this minimal set of communication function is enough to support
the variety
of communication needs for parallel applications. In AM2 [5], the model is extended to
include
endpoints
, which are virtual network interfaces, to support the illusion that each
application level thread has their own network interface. Each mes
sage is now sent from
one endpoint to another, and the handler is executed in the context of an endpoint.

This messaging model has been proven useful: many parallel systems, such as MPI,
Split
-
c, Titanium, and others, use AM as their communication layer,
or have it as an
option.

Low
-
level Application Program Interface (LAPI) is an AM
-
like messaging layer
developed at IBM [6]. The design philosophy is very similar to that of the AM: LAPI
aims at providing an extensible and flexible lightweight communicatio
n layer for parallel
programs. As a result, AM and LAPI have very similar semantics: in both architectures,
communication is done through a node sending a message to another node that cause a
handler to be handled on another node. However, unlike AM, LAPI
does not virtualize
network interface for threads; and unlike AM, LAPI handlers execute in the context of a
thread that is specially dedicated to execute handlers, whereas AM requires the
application thread to periodically poll for AM messages. Such differ
ences proved crucial
to the implementation of an AM emulator on LAPI.


Motivation:

This project is motivated by the desire to run Titanium [7] on the SP3 [8]. Titanium is a
high
-
performance parallel programming language that has java semantics based on the

SPMD model. The current implementation of Titanium compiler uses the AM as its
communication layer. In order to run Titanium programs on platforms that use other
communication layers, either the communication backend need to be written [9][10], or a
commu
nication glue
-
ware need to be written [2] that maps AM onto a communication
layer that runs on a machine. The Titanium group has been looking to run Titanium
programs on IBM’s SP3 Blue Horizon, which uses LAPI as the communication layer.
While previous att
empts have been made to write a communication library specifically
for Titanium in LAPI [9], since AM and LAPI have such similar specifications, this
project opts to emulate AM functionality using LAPI.


Implementation

The are two main differences between

AM and LAPI:

1)

AM virtualizes the network interface for threads: each thread communicates with
other thread using an interface called endpoints, to create the illusion that each
thread has its own network interface. Communication is done between endpoints,
instead of between nodes.

2)

LAPI handlers execute outside the context of the application thread. LAPI handler
is executed by a separate LAPI thread that gets spawned when LAPI is initialized.
In AM, each endpoint belongs to a bundle. Messages that are sent t
o the endpoints
in a bundle will only be executed if a thread polls that bundle for incoming active
messages.

To bridge the gap, AMLAPI must maintain AM’s semantics using LAPI.

To implement endpoint in software, each node maintains a vector of endpoints. T
he
vector is protected by a latch, so only one thread in a node can access the vector at any
one point. When the application creates an endpoint, it will append the endpoint onto the
end of the vector. When an endpoint is removed from the vector, the “hole
” will not be
filled, so each position in the endpoint vector is occupied by at most one endpoint during
the run of a program. The endpoints can thus be globally identified by a <node, vector
-
position> tuple. When the application calls AM send, AMLAPI will

piggyback the
endpoint identifier with the AM message and send it to the LAPI destination node. The
LAPI handler on the destination node will unpack the information, and de
-
multiplex the
message to the appropriate endpoint.

In order to ensure AM handlers

are executed in the context of application threads, a task
queue is associated with each endpoint bundle. When the LAPI handler decides the
destination endpoint of an AM message, it will append the handler on to the end of the
task queue of the endpoint b
undle to which the endpoint belongs. When the application
polls the endpoint bundle for AM messages, it will extract a handle from the task queue
and execute it.


Evaluation:

(a) Evaluation Platform:

The SP3 Blue Horizon is a cluster of symmetric multi
-
pro
cessors (SMP). The nodes are
connected to each other via a Colony network switch. The advertised latency of the
network is 17ms, and bandwidth is 350MB/s. Each SMP contains eight Power3
processors, with 4 GB of memory per node. Each processor is super
-
sc
alar, pipelined, 64
bit RISC processor, with 8 instructions per clock at 375MHz, 64KB L1 cache, and 8MB
L2 cache. Each processor communicates with other processors on the same node through
shared memory using the pthread interface; and with processors on r
emote nodes using
LAPI. The OS on each node runs the AIX parallel environment.

The AMLAPI layer has been implemented on the SP3 platform and the round
-
trip latency
and bandwidth have been measured.


(b) Latency:

Not surprisingly, the software overhead sig
nificantly increased the round
-
trip latency of
sending a one
-
byte message. The LAPI round trip latency is 32 microseconds, while AM
round trip latency is 470 microseconds. There are several factors contributing to the
delay:

1)

Message size: The message sent
using AMLAPI is actually bigger than the LAPI
message. In addition to the 1 byte, AMLAPI has to piggyback the endpoint
information, the AM handler index, and the AM token on to the LAPI message.

2)

Message packing overhead: In order to piggyback the AM inform
ation onto the
LAPI messages, additional CPU cycles need to be spent for packing the
information with the message.

3)

Context switch and queuing overhead: The fact that AM handlers can only be
executed in the context of application thread means that each AM m
essage on the
destination node has to be queued, wait for the context switch to application
thread, wait for the application thread to start polling before it can get executed.

The next section attempts to quantify the extent of each of these factors when

sending
long messages.

(c) Bandwidth:

Since AMLAPI adds a fixed amount of overhead to each AM message being sent,
regardless of the size, one would expect the bandwidth of both AMLAPI and LAPI
communication layers to converge to the same point in which th
e network is saturated.
However, that is not the case.

LAPI & AMLAPI Bandwidth on SP3
0
20000
40000
60000
80000
100000
120000
140000
0
100000
200000
300000
Message Size (bytes)
Bandwidth (KB/s)
LAPI Bandwidth (KB/s)
AM Bandwidth (KB/s)
Figure 1.

In order to find out where the extra time went, we profiled the amount of time spent on
emulating AM call with LAPI while sending various
-
sized messages:

Time spent on transmission of a message
0
0.5
1
1.5
2
2.5
0
50000
100000
150000
200000
250000
300000
Message size
Time (ms)
LAPI (communication)
Context Switch & Polling
Packing AM info
Copying to Endpoint VM
Segment
Figure 2.

As shown in figure 2, in addition to the same overheads that increase latency (increased
message size, context switching and queuing overhead, and message packing overhead),
there is also an overhead associated with copying th
e message into the endpoint’s
associated memory segment. AM spec specifies that AM transfer requests will copy a
contiguous array of data from one node to a designated virtual memory segment
associated with an endpoint. However, the current implementation
of AMLAPI
piggybacks the AM information with the large chunk of data in one contiguous array, so
the LAPI handler at the destination node must unpack the information and copy the
information into the endpoint’s virtual memory segment. (It is unknown why th
e amount
of time spent context switching would go up with the message size, since it should stay
constant regardless of message size). Figure below is a percentage breakdown of the
significant pieces of an AM bulk memory transfer.

Percentage breakdown of overhead
at 262144 byte message
LAPI
51%
Copying to
Endpoint
VM
22%
Packing AM
Info
17%
Context
switch
10%

Figure 3.

As Figure 3 shows that the packing of copying the message to endpoint’s VM fragment
takes up the bulk of the overhead.
Since SP3 is an SMP, the LAPI threads and application
thread run on different nodes. After the LAPI thread unpacks the data
, it needs to flush its
processor cache in order to move the data from LAPI thread’s processor to the
application thread’s processor. This contributes a significant amount of overhead.



Future work:

a) Possible Optimizations:

To decrease the AMLAPI overhe
ad we propose the following implementation strategies:

1)

Run AM handlers in LAPI thread context. This violates the AM specification, but
if we block the progress of an polling AM thread while executing the LAPI
thread, this has the same effect as executing t
he AM handler in application thread
context (the LAPI thread and user thread share the same address space). This will
eliminate the context switching and queuing overhead, but is more difficult to
code.

2)

More efficient piggybacking. LAPI has a two
-
phased ha
ndling protocol. A LAPI
message is divided into header and body. The header is delivered first and passed
to a header handler. The header handler will allocate the memory needed to hold
the body. The body will then be copied into the allocated memory and
the
completion handler will be called. The current implementation piggybacks all AM
information in the footer. However, for small messages, it would be possible to
package the whole message into the header and have the header schedule a task
on the queue.
This will remove the overhead from context switching between
LAPI handlers.

3)

Since all AM information is piggybacked in the body, it is necessary to copy the
bulk data from the LAPI footer into the endpoint’s virtual memory segment. If we
switch to packagin
g AM information in the header, we can eliminate the extra
level of copying by having the header handler specify the correct virtual memory
segment to copy the bulk data into, and LAPI will copy the bulk data into that
location.

4)

AM semantics states an AM s
end call cannot return until the network has
accepted the message. But LAPI calls are asynchronous. To guarantee that the
network has accepted the message, we must wait until LAPI notify that the header
handler has been executed on the remote node before r
eturning. Instead of
waiting, we can return immediately after LAPI call; and use a buffer pool to hold
the LAPI messages (so the user thread cannot de
-
allocate them).

5)

To eliminate the need to flush the LAPI processor’s cache to move unpacked AM
data to the

application cache, we can postpone unpacking of AM data to the
application thread.

b) Higher
-
level language support:

Using AMLAPI, we have run several Titanium programs on SP3. However, the
performance of these programs is still constrained by the perfor
mance of the LAPI layer.
One can use existing Titanium programs to indicate the typical communication workload
of a high level scientific application. Table 1 shows the message size breakdown of two
Titanium programs, adaptive mesh refinement (AMR) and gas

dynamic simulation
(GAS) on four processors.

Application

Small (0
-
32B)

Medium (32
-
544B)

Large (>544B)

Average Size

GAS

131400

122640

0

31.982 Bytes

AMR

31434

2574

0

93.760 Bytes

Table
1

Message size break down for two Titanium
programs

As seen in the figure, most of the message size is under 544 bytes. Therefore, the
communication layer should aim at optimizing for short messages. As suggested above,
since most messages are small, AMLAPI should package all AM information in the
header to avoid context switch overhead.

Conclusion:

This project shows that communication “glue
-
ware” can be a viable option to increase the
portability of parallel programs, but needs to be done with great care. Although AM and
LAPI have similar interfa
ces, the “fine prints” (such as endpoint APIs, and the
requirement that handlers can only be executed in thread context) can pose a performance
problem.

However, besides portability, there are other advantages to using communication “glue
-
ware” for paralle
l programs. For example, Titanium’s communication library is written
using AM; any changes that is made to that library will need to be re
-
done on other
platforms that does not use AM as its communication layer. By emulating AM on the
SP3, such changes wil
l be reflected on the SP3 for free.

Communication glue
-
wares represent a maintainability/portability
-
performance trade off.
If the communication layer is emulated with relatively little performance loss, glue
-
ware
would be the correct choice to port a par
allel application.


Acknowledgements:

The author would like to thank Dan Bonachea for the data on Titanium message sizes and
the micro
-
benchmark programs, Tyson Condie for his help on the implementation of
AMLAPI, and the Chang Sun Lin Jr. for his help on
setting up the SP3 environment.


Reference:

[1] Andrew Begel, Philip Buonadonna, David Culler, David Gay:
An Analysis of VI
Architecture Primitives in Support of Parallel and Distributed Communication
.

[2] Dan Bonachea and Daniel Hettena:
AMUDP: Active Me
ssages Over UDP
. CS294
-
8
Semester Project, UC Berkeley, Fall 2000

[3] Thorsen von Eicken, David Culler, Seth Copen Goldstein, Klaus Erick Schauser:
Active Messages: a Mechanism for Integrated Communication and Computation
.
Proceedings of the 19
th

Internati
onal Symposium on Computer Architecture, ACM Press,
May 1992

[4] Thorsten von Eicken:
Active Messages:

An Efficient Communication Architecture for
Multiprocessors
. PhD. Thesis. Dipl. Ing. (Eidgenossische Technishe Hochsule, Zurich)
1987
.


[5] Alan Mainwar
ing, David Culler:
Active Message Applications Programming Interface
and Communication Subsystem Organization
. Draft Technical Report.

[6] Gautam Shah, Jarek Nieplocha, Jamshed Mirza, Chulho Kim, Robert Harrison, Rama
K. Govindaraju, Kevin Gildea, Paul Di
Nicola, Carl Bender:
Performance and Experience
with LAPI


a New High
-
Performance Communication Library for the IBM RS/6000 SP
.
IPPS’98

[7] Yelick, Semenzato, Pike, Miyamoto, Liblit, Krishnamurthy, Hilfinger, Graham, Gay,
Colella, Aiken:
Titanium, a High
-
Performance Java Dialect
. Workshop on Java for
High
-
Performance Network Computing, ACM 1998

[8] NPACI Blue Horizon User Guide:
http://www.npaci.edu/BlueHorizon

[9] Chang
-
Sun Lin, Jr.:
The Performance Limitations of SPMD Programs on Clusters of
Multiprocess
ors
. UC Berkeley, Masters Project Report, 2000

[10] Carleton Miyamoto and Chang
-
Sun Lin, Jr.:
Evaluating Titanium SPMD Programs
on the Tera MTA
. Supercomputing99.

[11] David Culler, Richard Karp, David Patterson, Abhijit Sahay, Klaus Schauser, Eunice
Sant
os, Ramesh Subramonian, Thorsten von Eicken:
LogP: Towards a Realistic Model of
Parallel Computation
. ACM SIGPLAN Notices, 28(7):1
-
12, July 1993.