Performance of RDMA-Capable Storage Performance of RDMA-Capable Storage Protocols on Wide-Area Network Protocols on Wide-Area Network

calvesnorthΔίκτυα και Επικοινωνίες

24 Οκτ 2013 (πριν από 3 χρόνια και 7 μήνες)

127 εμφανίσεις

Managed by UT-Battelle
for the Department of Energy
Performance of RDMA-Capable Storage
Performance of RDMA-Capable Storage
Protocols on Wide-Area Network
Protocols on Wide-Area Network
Weikuan Yu
Weikuan Yu
Nageswara S.V. Rao
Nageswara S.V. Rao
Pete Wyckoff*
Pete Wyckoff*
Jeffrey S. Vetter
Jeffrey S. Vetter
Ohio
Ohio


Supercomputer Center*
Supercomputer Center*
2
Managed by UT-Battelle
for the Department of Energy
PDSW'08, Austin, TX
InfiniBand Clusters around the World
InfiniBand Clusters around the World
Ranger
(US)
SGI (US)
CEA (France)
Tsubame (Japan)
Dawning (China)
EKA (India)
3
Managed by UT-Battelle
for the Department of Energy
PDSW'08, Austin, TX
The Problem of Computing Islands
The Problem of Computing Islands


Islands of InfiniBand (IB) clusters
Islands of InfiniBand (IB) clusters


More IB clusters are deployed
More IB clusters are deployed


Some already connected, e.g. through
Some already connected, e.g. through
TeraGrid
TeraGrid


But only via TCP/IP protocols
But only via TCP/IP protocols


Data transfer across these islands
Data transfer across these islands


Need ever-greater data movement capabilities.
Need ever-greater data movement capabilities.


GridFTP, BBCP or other special storage configuration
GridFTP, BBCP or other special storage configuration


TCP performance on Long Distance can be low
TCP performance on Long Distance can be low


With 10GigE on UltraScience Net (no tuning)
With 10GigE on UltraScience Net (no tuning)


9.2 Gbps at 0.2 mile
9.2 Gbps at 0.2 mile


8.2 Gbps at 1400 miles
8.2 Gbps at 1400 miles


2.3-2.5 Gbps at 6600+ miles
2.3-2.5 Gbps at 6600+ miles
4
Managed by UT-Battelle
for the Department of Energy
PDSW'08, Austin, TX
RDMA (IB) in Clusters and Local Area Networks
RDMA (IB) in Clusters and Local Area Networks


Sub-microsecond latency
Sub-microsecond latency


Superb bandwidth (32Gbps with IB QDR)
Superb bandwidth (32Gbps with IB QDR)


Heavily used for clustering
Heavily used for clustering


Getting popular in storage environment
Getting popular in storage environment


NFS over RDMA (
NFS over RDMA (
NFSoRDMA
NFSoRDMA
)
)


SCSI RDMA Protocol (SRP)
SCSI RDMA Protocol (SRP)


iSCSI over RDMA (
iSCSI over RDMA (
iSER
iSER
)
)
InfiniBand HCA
Applications
Verbs
InfiniBand HCA
MPI
NFS/iSERI/SRP
Applications
Verbs
MPI
NFS/iSERI/SRP
1µsec
5
Managed by UT-Battelle
for the Department of Energy
PDSW'08, Austin, TX
Sample Performance of RDMA-based Storage
Sample Performance of RDMA-based Storage


RDMA
enables
good
iSCSI
bandwidth
within
LAN
RDMA
enables
good
iSCSI
bandwidth
within
LAN


Nearly
doubled
the
performance
for
iSCSI
Nearly
doubled
the
performance
for
iSCSI
6
Managed by UT-Battelle
for the Department of Energy
PDSW'08, Austin, TX
Feasibility of RDMA (IB) on WAN
Feasibility of RDMA (IB) on WAN


Long-range Extensions for InfiniBand available
Long-range Extensions for InfiniBand available


Network Equipment Technologies (NET): NX5010
Network Equipment Technologies (NET): NX5010


Obsidian Research: Longbow
Obsidian Research: Longbow


Long latency (10
Long latency (10
4
4
~10
~10
5
5
µsec)
µsec)


High bandwidth yet feasible
High bandwidth yet feasible


Good
Good
distance scalability and tolerance to interfering traffic
distance scalability and tolerance to interfering traffic


Good network throughput and MPI-level Performance
Good network throughput and MPI-level Performance


Can RDMA provide a good transport protocol for storage on WAN?
Can RDMA provide a good transport protocol for storage on WAN?
InfiniBand HCA
Applications
Verbs
InfiniBand HCA
MPI
NFS/iSERI/SRP
Applications
Verbs
MPI
NFS/iSERI/SRP
10
4
~10
5
µsec
7
Managed by UT-Battelle
for the Department of Energy
PDSW'08, Austin, TX
Experimental Environment
Experimental Environment


Hardware
Hardware


Long-range IB extension devices from NET (Network Equipment
Long-range IB extension devices from NET (Network Equipment
Technologies, Inc)
Technologies, Inc)


Mellanox PCI-Express 4x DDR
Mellanox PCI-Express 4x DDR
HCAs
HCAs
(InfiniHost-III and Connect-X)
(InfiniHost-III and Connect-X)


Software Packages
Software Packages


OFED-1.3 from
OFED-1.3 from
openfabrics
openfabrics
.org
.org


Linux-2.6.25 with
Linux-2.6.25 with
NFSoRDMA
NFSoRDMA
and
and
iSER
iSER
support
support


Performance of RDMA-based Storage Protocols on WAN
Performance of RDMA-based Storage Protocols on WAN


NFS over RDMA
NFS over RDMA


iSCSI over RDMA
iSCSI over RDMA
8
Managed by UT-Battelle
for the Department of Energy
PDSW'08, Austin, TX
UltraScience Net at ORNL
UltraScience Net at ORNL


Experimental WAN Network
Experimental WAN Network


Oak Ridge,
Oak Ridge,
Atlanta, Chicago, Seattle, and Sunnyvale
Atlanta, Chicago, Seattle, and Sunnyvale


OC192 backbone connections
OC192 backbone connections


4300 miles one way, 8600 miles loop-back
4300 miles one way, 8600 miles loop-back
9
Managed by UT-Battelle
for the Department of Energy
PDSW'08, Austin, TX
RDMA-based Transport
RDMA-based Transport


Request
and
request
becomes
pure
control
messages,
Request
and
request
becomes
pure
control
messages,
and
have
to
travel
long
distance
on
WAN
and
have
to
travel
long
distance
on
WAN


Use
of
RDMA
read
(round‐trip
operations)
for
clients
to
write
data
Use
of
RDMA
read
(round‐trip
operations)
for
clients
to
write
data


Possible
additional
control
messages
for
NFSoRDMA
for
long
arguments
Possible
additional
control
messages
for
NFSoRDMA
for
long
arguments


Further
fragmentation
due
to
the
use
of
page‐based
operations
Further
fragmentation
due
to
the
use
of
page‐based
operations
10
Managed by UT-Battelle
for the Department of Energy
PDSW'08, Austin, TX
RDMA on WAN
RDMA on WAN


RDMA
has
good
network‐level
performance
within
short
distance
WAN
RDMA
has
good
network‐level
performance
within
short
distance
WAN


High
bandwidth
at
long
distance
is
only
possible
for
large
messages
High
bandwidth
at
long
distance
is
only
possible
for
large
messages


Low
RDMA‐read
performance
for
page‐based
messages
(4KB),
even
at
Low
RDMA‐read
performance
for
page‐based
messages
(4KB),
even
at
0.2
mile
when
using
InfiniHost‐III
0.2
mile
when
using
InfiniHost‐III


HCAs
HCAs
11
Managed by UT-Battelle
for the Department of Energy
PDSW'08, Austin, TX
NFS over RDMA
NFS over RDMA


NFS
over
RDMA
achieves
good
NFS
over
RDMA
achieves
good


bandwidth
within
short
distance
bandwidth
within
short
distance


But
significant
optimizations
are
needed
for
long
distance
But
significant
optimizations
are
needed
for
long
distance
12
Managed by UT-Battelle
for the Department of Energy
PDSW'08, Austin, TX
NFS - Large block size
NFS - Large block size


NFS
over
IPoIB‐CM
benefits
from
large
block
size
NFS
over
IPoIB‐CM
benefits
from
large
block
size


NFS
over
RDMA
needs
to
support
large
block
size
for
better
fit
NFS
over
RDMA
needs
to
support
large
block
size
for
better
fit
on
long‐distance
WAN
on
long‐distance
WAN
13
Managed by UT-Battelle
for the Department of Energy
PDSW'08, Austin, TX
NFS over RDMA - using Connect-X
NFS over RDMA - using Connect-X


Better
RDMA
read
in
connect‐X
improves
Better
RDMA
read
in
connect‐X
improves


the
performance
the
performance
of
file
write
for
NFS
over
RDMA
of
file
write
for
NFS
over
RDMA


Performance
at
long
distance
is
yet
to
determine
Performance
at
long
distance
is
yet
to
determine
14
Managed by UT-Battelle
for the Department of Energy
PDSW'08, Austin, TX
iSCSI over RDMA (
iSCSI over RDMA (
iSER
iSER
)
)


RDMA
enables
high‐performance
iSCSI
within
short
distance
RDMA
enables
high‐performance
iSCSI
within
short
distance


RDMA
has
good
promise
over
long
distance
as
shown
with
large
RDMA
has
good
promise
over
long
distance
as
shown
with
large
messages
messages
15
Managed by UT-Battelle
for the Department of Energy
PDSW'08, Austin, TX
Perspectives
Perspectives


Long-range InfiniBand
Long-range InfiniBand


InfiniBand over SONET is promising
InfiniBand over SONET is promising


Storage protocols are not yet exploiting the bandwidth
Storage protocols are not yet exploiting the bandwidth
potential of RDMA at long distance
potential of RDMA at long distance


RDMA-based Storage on WAN
RDMA-based Storage on WAN


Need to enable large block sizes
Need to enable large block sizes


Need to avoid page-based RDMA operations in NFS
Need to avoid page-based RDMA operations in NFS


Utilize IB FRMR support to avoid
Utilize IB FRMR support to avoid


small RDMA operations
small RDMA operations


Need to allow more concurrent RDMA read operations
Need to allow more concurrent RDMA read operations
16
Managed by UT-Battelle
for the Department of Energy
PDSW'08, Austin, TX
Acknowledgment
Acknowledgment


Network Equipment Technologies, Inc
Network Equipment Technologies, Inc


Andrew
Andrew
DiSilvestre
DiSilvestre


Rich
Rich
Erikson
Erikson


Brad
Brad
Chalker
Chalker


ORNL
ORNL


Susan Hicks
Susan Hicks


Philip Roth
Philip Roth


Mellanox
Mellanox