Asynchronous Global Heap: Stepping Stone to Global Memory ...

harpywarrenΛογισμικό & κατασκευή λογ/κού

14 Δεκ 2013 (πριν από 3 χρόνια και 10 μήνες)

132 εμφανίσεις

Asynchronous Global Heap:Stepping Stone to
Global Memory Management
Yuichiro Ajima
12
,Hideyuki Akimoto
12
,Tomoya Adachi
12
,Takayuki Okamoto
12
,
Kazushige Saga
12
,Kenichi Miura
12
and Shinji Sumimoto
12
1
Fujitsu Limited,Kawasaki,Kanagawa,Japan
2
JST CREST,Kawaguchi,Saitama,Japan
faji,h.akimoto,adachi.tomoya,tokamoto,saga.kazushige,k.miura,
sumimoto.shinjig@jp.fujitsu.com
Abstract
Memory utilization for communication will become a major problem in the exascale
era because an exascale parallel computation job will include millions of processes and a
conventional communication layer requires preprovisioned memory of a size proportional
to the number of processes.One approach to addressing the memory utilization problem
is to allocate a data sink on a remote node dynamically each time a node needs to send
data to the remote node.In this paper,a memory management scheme that can provide
memory to a process in a remote node is called\global memory management."A global
memory management scheme that can be accessed via interconnect is an ideal solution
that does not waste local processing resources but is also different from today's local
memory management schemes.For a stepping stone to global memory management,we
propose an asynchronous global heap that virtually achieves global memory management
while minimizing modi cations to the operating system and the runtime.In addition,new
hardware features for global memory management are also co-designed.
1 Introduction
Cluster-type parallel computers are the mainstream of today's high performance computing
(HPC).A cluster is composed of a massive number of stand-alone computers that are inter-
connected.On a stand-alone computer,a network interface is an expansion device and not a
rst-class citizen such as a processor,memory,and operating system.An ordinary network
interface incurs ten or more microseconds of latency [
6
] because of a data delivery scheme that
involves context switches.In contrast,today's HPC interconnect devices access remote mem-
ory in the sub-microsecond range [
3
,
4
,
1
] at the price of overheads establishing a connection
to the remote node and registering memory to the device.These overheads are permitted as
a tradeoff to resolve performance bottlenecks.Therefore,today's interconnect designs target
large or iterative data transfers.
Memory utilization for communication will become a major problem in the exascale era
because an exascale parallel computation job will include millions of processes and conventional
communication layers require preprovisioned memory of a size proportional to the number of
processes in order to avoid performance bottlenecks.One approach to addressing the memory
utilization problem is to allocate a data sink on a remote node dynamically each time a node
needs to send data to the remote node.In this paper,a memory management scheme that can
provide memory to a process in a remote node is called\global memory management,"and
memory provided by the global memory management scheme is called\global memory."A
global memory management scheme that can be accessed via interconnect is an ideal solution
that does not waste local processing resources but is also different from today's local memory
management schemes.
1
For a stepping stone to global memory management,we propose an asynchronous global
heap that virtually achieves global memory management while minimizing modi cations to
the operating system and the runtime.In addition,new hardware features for global memory
management are also co-designed assuming a system-on-chip whose device I/Obus is customized
to enhance the memory access capability of the integrated interconnect device.
In this paper,we propose an asynchronous global heap as a primitive global memory man-
agement scheme.In Section
2
,the asynchronous global heap is introduced.In Section
3
,the
results of evaluating the asynchronous global heap functions are shown.In Section
4
,this work
is summarized,and possible future works are listed.
2 Asynchronous Global Heap
We propose an asynchronous global heap that includes a concurrent data structure allocated
on each process and application programming interfaces (API) to provide free memory to other
processes.Control variables and a heap body of the data structure are designed to be accessed
via the remote direct memory access (RDMA) features of an interconnect device.Therefore,
the control variables and heap body are allocated in a memory region that can be accessed
via RDMA.With today's operating system and runtime,RDMA capable memory regions are
required to be pinned and registered to an interconnect device prior to RDMAaccess.Therefore,
the data structure is placed in a locally provided memory region,and the free memory managed
by the operation system decreases even though the heap body is totally unused at the time.
We choose memory allocators of runtimes including the malloc() function to resolve this prior
memory consumption problem while leaving the operating system and the memory registration
scheme unchanged.The implementation of an asynchronous global heap uses only memory
allocation system calls of the operating system,so the memory allocator of a runtime can
obtain free memory from an asynchronous global heap.
2.1 Data Structure
Figure
1
shows the data structure of an asynchronous global heap.To isolate fragmentation
caused by local and remote memory allocation,local memory allocation consumes free memory
from the top,and global memory allocation consumes free memory from the bottom.A control
variable that indicates the bottom of the free memory is called a\global break"(gbrk),and
another control variable that indicates the bottom of the memory allocated locally is called a
\global limit"(glimit).The gbrk variable may be changed by global memory allocators,and
the glimit variable may be changed by a local memory allocator.All memory allocators using
the asynchronous global heap read both the gbrk and glimit variables to calculate the size of
the free memory space,so accessing these variables requires exclusive control.
Each process holds a dynamic state of its own asynchronous global heap that consists of
three control variables:a lock,the gbrk,and the glimit.A control structure including these
control variables is placed in conjunction with the heap body.Caching the dynamic state of
another process is prohibited.
2.2 Application Programming Interfaces
The API design of an asynchronous global heap is similar to the data segment system calls
of Linux which includes the brk() system call.There are ve API functions.The ginit()
function initializes the asynchronous global heap.The gbrk() function changes the global break.
Figure 1:
Data structure of asynchronous global heap
The sgbrk() function moves the global break.The gglimit() function changes returns the free
memory information.The sglimit() function changes the global limit.To obtain free memory
from other nodes,gbrk(),sgbrk(),and gglimit() functions have an input argument to specify
a process identi cation number.Initializing an asynchronous global heap involves a collective
communication process that gathers the identi ers of RDMA capable memory regions of whole
processes.API functions other than the ginit() do not register local memory or exchange
identi ers of registered memory.
2.3 Co-designing New RDMA Features
Global memory obtained from an asynchronous global heap can be accessed with ordinary
RDMA put and get features.To access the control variables effectively,we introduce three
additional RDMA features:RDMA atomic compare and swap (CAS),RDMA remote fence,
and interoperable atomic operations.RDMA atomic CAS sequentially operates compare and
swap on remote memory without interruption from any other RDMA accesses and effectively
performs mutual exclusion of the control structure.RDMAremote fence forces memory accesses
of the RDMA requests sent prior to the fence completed before the memory accesses of RDMA
requests sent after the fence are started.Implementing a RDMA remote fence feature that
handles memory access ordering remote-side may reduce the latency of reading control variables
because an RDMA get request can be sent speculatively immediately after the RDMA remote
fence following the RDMA atomic CAS,which tries to acquire the lock.Interoperable atomic
operations are carefully implemented atomic operations of the processor atomic instructions and
RDMA atomic operations,so processor atomic instructions are ensured not to be interrupted by
any RDMA memory accesses,and RDMA Atomic operations are ensured not to be interrupted
by any processor memory accesses.Interoperable atomic operations allow any control variable
of an asynchronous global heap to be accessed by processor memory access instructions as long
as the control variables belong to the accessing process itself.
3 Evaluation
3.1 Evaluation Environment
In this chapter,we evaluate the execution time of the gbrk(),gglimit(),and sglimit() functions
that are assumed to be called frequently from memory allocators that support asynchronous
global heap.A prototype system of the K computer placed in Fujitsu's Numazu Plant was
used for the evaluations.The processor was a SPARC64
TM
VIIIfx [
5
],and it has an operating
frequency of 2 GHz and eight cores.The interconnect device was a Tofu interconnect [
2
].The
experiment programs used the Message Passing Interface (MPI) and were executed with two
MPI processes.The MPI process rank 0 repeatedly executed an asynchronous global heap API
function 1001 times to access the same asynchronous global heap.The experiment programs
measured the time between starting the second execution of the function and nishing the last
one.
The experiment programs used the Tofu library (tlib) in conjunction with MPI for RDMA
communication.The tlib is a low-level API for using features of the Tofu interconnect directly.
Each MPI process created a thread to emulate RDMA atomic CAS and interoperable atomic
operations.The RDMA remote fence was implemented by using the strong order ag feature
of the Tofu interconnect [
1
].
The average execution time of each function was measured for each of the four different
combinations of access patterns and assumed RDMA features.For the remote access patterns,
the MPI process rank 0 accessed its own asynchronous global heap,and there were two options:
assume or do not assume the RDMA remote fence feature.For the local access patterns,the
MPI process rank 0 accessed an asynchronous global heap on MPI process rank 1,and there
were two options:assume or do not assume the interoperable atomic operations feature.
3.2 Evaluation Result
Figure
2
shows the evaluation results.The graph shows the measured average execution time
of the asynchronous global heap API functions.The vertical axis has a logarithmic scale,and
the unit of time is microseconds.The results of local access with RDMA atomic were shorter
than those of remote access by 35 to 48%.The difference comes from the access method for
controlling variables other than the lock variable.For the local access patterns,only the lock
variable was accessed by RDMA,and the others were accessed by processor instructions.For
the remote access patterns,all control variables were accessed by RDMA.The results of remote
access with fence were shorter than those for remote access without fence by 25 to 37%.The
difference comes from speculative accesses to the control variables with fence that hide the
latencies of mutual exclusion.As for hiding mutual exclusion latencies,the results of remote
access with fence were closer to those of local access with RDMA atomic rather than those
of remote access without fence.For the local access patterns,interoperable atomic operations
reduced the execution time by 99% because all control variables were accessed by processor
instructions when interoperable atomic operations were assumed.
4 Summary and Future Work
In this paper,we proposed an asynchronous global heap as a stepping stone for an ideal global
memory management scheme that allows other nodes to obtain memory directly via an inter-
connect.An asynchronous global heap provides free memory to both local and global memory
Figure 2:
Evaluated average execution time of asynchronous global heap API functions
allocators.For efficient access to control variables,three RDMA features were also introduced:
RDMA atomic CAS,RDMA remote fence,and interoperable atomic operations.The results
of evaluating the API functions showed that RDMA remote fence reduced access time to an
remote asynchronous global heap by 25 to 37%,and interoperable atomic operations reduced
access time to a local asynchronous global heap by 99%.
Our possible future work includes investigating an extensible asynchronous global heap,an-
other global memory management scheme that causes no fragmentation,low-latency algorithms
for global memory allocators to merge fragments on deallocation,and intelligent strategies for
user programs to reuse allocated data sinks.
References
[1] Yuichiro Ajima,Tomohiro Inoue,Shinya Hiramoto,Toshiyuki Shimizu,and Yuzo Takagi.The Tofu
Interconnect.IEEE Micro,32(1):21{31,2012.
[2] Yuichiro Ajima,Shinji Sumimoto,and Toshiyuki Shimizu.Tofu:A 6D Mesh/Torus Interconnect
for Exascale Computers.IEEE Computer,42(11):36{40,2009.
[3] Robert Alverson,Duncan Roweth,and Larry Kaplan.The Gemini System Interconnect.In Pro-
ceedings of IEEE 18th Annual Symposium on High Performance Interconnects,pages 83{87,2010.
[4] Dong Chen et al.The IBM Blue Gene/Q Interconnection Network and Message Unit.In Proceed-
ings of the International Conference on High Performance Computing,Networking,Storage and
Analysis,number 26,2011.
[5] Takumi Maruyama et al.SPARC64 VIIIfx:A New-Generation Octocore Processor for Petascale
Computing.IEEE Micro,30(2):30{40,2010.
[6] Justin (Gus) Hurwitz and Wu-chun Feng.End-to-end performance of 10-gigabit ethernet on com-
modity systems.IEEE Micro,24(1):10{22,2004.