Performance of HPC Middleware over Infiniband WAN

rockyboygangNetworking and Communications

Oct 24, 2013 (4 years and 6 months ago)


Performance of HPC Middleware
over Infiniband WAN
Presented by:
Ashish Kumar Singh
Designing Efficient FTP Mechanisms
for High Performance Data –Transfer
over Infiniband
High Performance Data Transfer in
Grid Environment Using GridFTP over
Performance of HPC Middleware
over Infiniband WAN

S. Narravula, H. Subramoni, P. Lai, R. Noronha and D.K.
• Multi-Cluster needs of organizations
• Advent of long haul Infiniband (IB WAN)
– Infiniband range extenders like Intel Connects and
Obsidian Longbows
• IB applications and libraries like, MPI, NFS over
RDMA, etc. developed for Intra-cluster
• Analyzes the general communication
performance of HPC middleware
• Proposes basic design optimizations for
enhancing communication performance over
• Demonstrates the potential benefits obtained
by enhancing internal protocols of middleware
IB Range Extension
• Obsidian Longbows provide range extension
for Infiniband fabrics over 10 Gigabits/s WAN

Verbs-level Performance (UD)
•UD does not involve any acknowledgements from the remote side
•UD is scalable with higher delays
•Higher level protocols need to take care of reliability and flow control mechanisms
Verbs-level Performance (RC)
•RC guarantees in-order delivery by ACKs and NACKs, which limits the number of messages
that can be in flight to a maximum supported window size
•Fewer large messages can fill the pipeline and so large messages are less effected
IPoIB Performance (UD)
•TCP needs larger window sizes to achieve good bandwidth
•More streams – more UD packets with independent flow control, so more outstanding
packets that can be pushed out from source at any given time frame
IPoIB Performance (RC)
•Advantage of RC transport mode over IPoIB is that RC can handle larger packet sizes.
Larger packet sizes can achieve better bandwidth and per byte TCP processing decreases
MPI-level Performance (Delay)
•Trends similar to basic verbs-level evaluation
MPI-level Performance (Tuning)
•Protocol choice changes for medium sized messages in high delay scenario
•Rendezvous protocol involves an additional message exchange
MPI-level Performance (Streams)
•For small messages, messaging rate increases proportionally with number of
communicating streams
•For higher delay networks, additional parallel streams are better for overall network
bandwidth utilization
a) 100 us delay b) 1 ms delay c) 10 ms delay
MPI-level Performance (Collective)
•Simple optimized broadcast that performs the bcast operation hierarchially over the two
connected clusters, minimizing the traffic on the WAN
•For small messages, as the WAN link is able to handle all the traffic, the congestion is very
a) 10 us delay b) 100 us delay c) 1000 us delay
• Applications usually absorb smaller network
delay fairly well
• Many protocols get severely impacted in high
delay scenarios
• Protocols can be optimized for high delay
scenarios to improve the performance
• With long-haul IB WAN technology cluster-of-
clusters architecture for HPC systems is
Designing Efficient FTP Mechanisms
for High Performance Data –
Transfer over Infiniband

P. Lai, H. Subramoni, S. Narravula, A. Mamidala and D.K.
• FTP - most popular method to transfer bulk
• Typically used in applications like data staging,
content replication and remote site backup
• Advent of long haul Infiniband (IB WAN) made
cluster-of-cluster architecture possible
• IPoIB and SDP lose significant native
Possible Approaches
•Existing sockets based FTP through intermediate drivers (#1, #2 and #3). IPoIB
and SDP are the popular schemes for this choice.
•#4, new FTP mechanism using the Native IB features.
Performance of Communication
•Native IB verbs achieve much higher bandwidth as compared to other protocols.
•Performance for FTP, e.g., GridFTP, using IPoIB and SDP is even more worse.
• Design an Advanced Data Transfer Service
(ADTS) that leverages zero-copy capabilities
• Leverage ADTS to design a high performance
zero-copy FTP library
• Provide a robust and inter-operable
mechanism to support zero-copy capable
clients and the traditional TCP/UDP clients
• Performance study
FTP-ADTS Architecture
•Clients may be capable of performing zero-copy data transfer or only support the
TCP/UDP based communication.
•Once the transport protocol is negotiated , Data Connection Management
component initiates a connection.

Design of Zero-Copy Channel
• Memory Semantics using RDMA vs. Channel
semantics using Send-Recv
• Drawbacks of Memory Semantics:
– Pre-allocation, registration and communication of
target RDMA buffers
– Explicit flow control
– Notification of completion
– Latency benefits for small messages is marred by
high network delay
Design of Zero-Copy Channel
• Advantages of Send-Recv Semantics:
– Identical zero-copy benefits
– Simpler flow control, with use of SRQ
– Sender is not throttled down due to lack of buffers
on remote node
– Both RC and UD transports available
Design Enhancements
• Buffer/File Management component keeps a
small set of pre-allocated and registered
• Memory Registration Cache and Persistent
• Pipelined Data Transfers
• Prefork Server to handle bursts of requests
•Site Replication over IB WAN using FTP.
•FTP-ADTS speeds up data transfer by up to 65%.
•Much lesser CPU utilization.

• Existing TCP or UDP or SCTP based FTP
implementations are not suitable for WAN
capable interconnects like IB WAN
• FTP-ADTS efficiently transfers data by
leveraging zero-copy operations of modern
High Performance Data Transfer in
Grid Environment Using GridFTP
over Infiniband

H. Subramoni, P. Lai, R. Kettimuthu and D.K. Panda
• GridFTP is a high-performance, secure, reliable
extension of the standard FTP optimized for
• Globus XIO framework, used to design
GridFTP, offers easy-to-use interface
• The framework hides the complications of
communication semantics of underlying
devices (network or disk)

• Combining the ease of use of Globus XIO
framework and the high performance achieved
through IB
• Enhancing the disk I/O performance of the
existing ADTS library
– By decoupling the network processing from disk I/O
• Evaluation of the design
– micro-benchmark level
– applications like Community Climate System Model
and ultra scale visualization

Design Issues
• Most HPC applications require movement of
huge amount of data
– Needs slower hard disks and RAIDs for storage
– With low bandwidth provided by TCP/UDP based
FTP, this was not an issue
– Will be an issue for Globus ADTS XIO
• Solution
– decoupling of network from disk I/O

Design Changes in ADTS
•Introduction of :
•multiple threads (read, write and network thread)
•set of buffers to stage the data
•Read thread prefetches a set of locations from the disk and keeps it ready for the
network thread to send over the physical link
•How to avoid frequent context switches?
•Low and High Water Marks, High water mark is set to max size of circular buf

Read only available buffers less than low-water mark

Application Level Improvements