component for high performance

lovingbangInternet και Εφαρμογές Web

3 Νοε 2013 (πριν από 4 χρόνια και 9 μέρες)

85 εμφανίσεις

A GPGPU transparent virtualization
component for high performance
computing clouds

G. Giunta,
Raffaele Montella
, G. Agrillo, G. Coviello

University of Napoli Parthenope

Department of Applied Science


{
giunta,montella,agrillo,coviello
}@
uniparthenope.it


http://lmncp.uniparthenope.it

http://dsa.uniparthenope.it

uniParthenope


One

of

the
five

Universities

in
Napoli (Italy)


20K
students


5
faculties


Science and
Technologies


Engineering


Economy


Law


Sports

&
Health

http://www.uniparthenope.it

Summary


Introduction


System
Architecture

and Design



Performance
evaluation


Conclusions

and
developments



GPGPU virtualization service

g
Virtu
S

Introduction

&
Contextualization


High Performance
Computing
:


Stack

of

technologies

enabling

high performance
computing

resources

demanding

software


Grid

computing
:


Stack

of

technoogies

enabling

the
resource

sharing

and
aggregation



Many

core
:


The “
enforcement

of

the Moore’s
law




GPGPUs
:


Computing

efficient

and
cost

effective

high performance
computing

using

many

core

graphics

processing
units




Virtualization
:


Hardware and software
resources

abstraction


One

of

the
many

core

CPUs

killing

applications



Cloud

computing
:


Stack

of

technologies

enabling

hosting on
virtualized

resources


On
demand

resource

virtualization


Pay

as

you

go


High Performance
Cloud

Computing


Hardware:


High performance
computing

cluster


Multicore

/ Multi
processor

computing

nodes


GPGPUs



Software:


Linux


Virtualization

hypervisor


Private
cloud

management software



+Special

ingredients…

gVirtuS


GPU
Virtualization

Service


Bonded

on
nVidia
/CUDA
APIs


Hypervisor

independent


Uses

a
front
-
end

(FE)/
back
-
end

approach

(BE)


FE/BE
communicator

independent


The

key properties of the proposed system are:


1. Enabling the CUDA kernels execution in a virtualized environment

2. With an
overall performance

not too far from un
-
virtualized machines

System
Architecture

and Design


CUDA device is under control
of the hypervisor



Interface between guest and
host machine



Any GPU access is routed via
the FE/BE



The management component
controls invocation and data
movement

The
Communicator


Provides a high performance communication
between virtual machines and their hosts.



The choice of the hypervisor deeply affects the
efficiency of the communication.

Hypervisor

FE/BE

comm

Notes

No

hypervisor

Unix
Sockets

Used

for

testing

purposes

Generic

TCP/IP

Used

for

communication

testing

purposes
,
but

interesting…

Xen

XenLoop


牵r猠s楲散瑬礠潮o瑨攠t潰o潦 瑨攠桡edw慲攠瑨e潵杨 愠cust潭o䱩Lu砠k敲e敬


pr潶楤i猠愠c潭oun楣a瑩潮t汩扲慲礠be瑷敥n 杵敳琠慮d h潳琠m慣h楮is


業p汥l敮瑳t汯眠污t敮c礠慮d 睩d攠b慮d睩d瑨 TCP⽉P 慮d 啄U c潮o散瑩潮t


慰p汩la瑩潮t瑲慮獰慲敮琠慮d 潦o敲猠慮 慵t潭o瑩t d楳i潶敲礠潦 瑨攠eupp潲t敤 噍S

噍w慲e

噩牴V慬

䵡Mh楮i

C潭oun楣a瑩潮

Interface
(VMCI)


commercial hypervisor running at the application level.


pr潶楤i猠愠dat慧a慭 䅐A t漠oxch慮g攠獭慬氠m敳獡e敳


a shared memory API to share data,


an access control API to control which resources a virtual machine can access


慮d 愠d楳i潶敲礠獥s癩捥 f潲 pub汩lh楮i 慮d re瑲楥癩湧vr敳潵ec敳e

䭖䴯充䵕

噍ch慮n敬


䱩Lu砠汯慤慢汥lk敲e敬em潤o汥ln潷 敭b敤d敤 慳⁡ast慮d慲d c潭o潮敮琠


獵sp汩敳l愠h楧i p敲e潲m慮c攠杵敳琯h潳琠c潭oun楣a瑩潮t


b慳敤 潮o愠獨慲敤 m敭潲礠慰pr潡捨.

How

gVirtuS

works


CUDA library:


deals directly with the hardware accelerator


interacts with a GPU virtualization front end


The Front End:


packs the library function invocation


sends it to the back end


The back end:


deals with the hardware using the CUDA driver


unpacks the library function invocation


maps memory pointers


executes the CUDA operation


retrieves the results


sends them to the front end using the communicator


The Front End:


interacts with the CUDA library by terminating the
GPU operation


provides results to the calling program.


This design is:


hypervisor independent


communicator independent


accelerator independent


The same approach could be
followed to implement
different kinds of
virtualization.

Choices

and
Motivations


We focused on VMware and KVM hypervisors.



vmSocket

is the component we have designed to obtain a high
performance communicator


vmSocket

exposes Unix Sockets on virtual machine instances
thanks to a QEMU device connected to the virtual PCI bus.

Hypervisor

FE/BE
Comm

Open

Src

Running

as

Official

CUDA
Drivers

Xen

Xen

Loop

Yes

Kernel

No

VM
-
Ware

VMCI

No

Application

Shares

the
host

OS
ones

KVM/QEMU

vmChannel

Yes

Loadable

Kernel

Module

Shares

the
host

OS
ones

vmSocket



Programming interface:


Unix Socket



Communication between
guest and host:


Virtual PCI interface


QEMU has been modified





GPU based high performance
computing applications usually
require massive data transfer
between host (CPU) memory
and device (GPU) memory…

FE/BE interaction efficiency:



there is no mapping between guest memory and device memory


the memory device pointers are never de
-
referenced on the host side


CUDA kernels are executed on the BE where the pointers are fully consistent.


vmSocket
:
virtual

PCI
device

Performance
Evaluation


CUDA Workstation


Genesis GE
-
i940 Tesla


i7
-
940 2,93 133 GHz
fsb
, Quad Core hyper
-
threaded 8 Mb cache CPU and 12Gb RAM.


1
nVIDIA

Quadro

FX5800 4Gb RAM video
card


2
nVIDIA

Tesla C1060 4
Gb

RAM




The testing system:


Fedora 12 Linux


nVIDIA

CUDA Driver, and the SDK/Toolkit
version 2.3.


VMware vs. KVM/QEMU (using different
communicators).

…from

CUDA
SDK…


ScalarProd

computes
k

scalar products of two real vectors of
length
m
. Notice that each product is executed by a CUDA
thread on the GPU so no synchronization is required.



MatrixMul

computes a matrix multiplication. The matrices
are
m
x
n

and
n
x
p
, respectively. It partitions the input matrices
in blocks and associates a CUDA thread to each block. As in
the previous case, there is no need of synchronization.



Histogram

returns the histogram of a set of
m

uniformly
distributed real random numbers in 64 bins. The set is
distributed among the CUDA threads each computing a local
histogram. The final result is obtained through
synchronization and reduction techniques.

Test
cases


Host
/cpu:
CPU

without

virtualization

(no
gVirtuS
)


Host
/
gpu
: GPU
without

virtualization

(no
gVirtuS
)


Host
/
afunix
: GPU
without

virtualizatiuon

(
with

gVirtuS
)

measures

the impact
of

the
gVirtuS

stack


Host
/
tcp
: GPU
without

virtualization

(
with

gVirtuS
)

measures

the impact
of

the
communication

stack



*/cpu:
CPU

in a
virtualized

environment

(no
gVirtuS
)


*/
tcp
: GPU in a
virtualized

environment

(
with

gVirtuS
)



Vmware
/
vmci
: GPU in a
vmware

virtual

machine

with

gVirtuS

using

the VMCI
based

communicator


KVM/
vmSocket
: GPU in a KVM/QEMU
virtual

machine

with

gVirtuS

using

the
vmSocket

based

communicator

ScalarProd

MatrixMul

Histogram

About

Results


Virtualization doesn’t affect computing performances in a
heavy way


gVirtuS
-
kvm
/
vmsocket

gives the best efficiency with the less
impact respect to the raw host/
gpu

setup


The
tcp

based communicator could
be used in a production scenario:


The problem size and the
computing speed
-
up justify the
poor communication
performances

HPCC: High Performance

Cloud


Computing


Intel based 12 computing nodes
cluster


Each node:


quad core 64 bit CPU / 4 GB of
RAM


nVIDIA

GeForce

GT 9400 video
card with 16 CUDA cores and a
memory of 1
Gbyte
.



Software stack:


Fedora 12


Eucalyptus


KVM/QEMU


gVirtus

[

]

HPCC Performance
Evaluation


Ad hoc

benchmark


Matrix multiplication algorithms.


Classic memory distributed parallel approach.


The first matrix is distributed by rows, the second one by
columns


Each process has to perform a local matrix multiplication.



MPICH 2 as message passing interface among
processes


Each process uses the CUDA library to perform the local
matrix multiplication.

MatrixMul

MPI/CUDA/
gVirtus

Future
Directions


Enable shared memory
communication between host and
guest machines in order to
improve virtual host to device and
vice
-
versa memory copying.



Implementation of OpenGL
interoperability to integrate
gVirtuS

and VMGL for 3D graphics
virtualization.



Integrate MPICH2 with
vmSocket

in order to implement a high
performance message passing
standard interface.

Conclusions


The
gVirtuS

GPU virtualization and sharing system enables thin Linux
based virtual machines to be accelerated by the computing power
provided by
nVIDIA

GPUs.



The
gVirtuS

stack permits to accelerate virtual machines with a small
impact on overall performance respect to a pure host/
gpu

setup.



gVirtuS

can be easily extended to other CUDA enabled devices


This approach is based on highly proprietary and close
-
source
nVIDIA

products.

http://osl.uniparthenope.it/projects/gvirtus/

Download

Try

&
Contribute
!

gVirtuS

implementation

(1/2)


gVirtus

is

implementend

in C++


BE/FE
run

as

deamons


The BE
runs

on the
host

device


The FE
runs

on the
virtual

machine

gVirtuS

implementation

(2/
2
)


The FE class diagram