GPU based cloud computing

birdsowlSoftware and s/w Development

Dec 2, 2013 (3 years and 6 months ago)

135 views

© NVIDIA Corporation 2010

GPU based cloud computing

Dairsie Latimer, Petapath, UK

Petapath

© NVIDIA Corporation 2010

About Petapath


Founded in 2008 to focus on delivering innovative hardware and

software solutions into the high performance computing (HPC) markets


Partnered with HP and SGI to deliverer two Petascale prototype

systems as part of the PRACE WP8 programme

The system is a

testbed for new ideas in usability, scalability and

efficiency of large computer installations


Active in exploiting emerging standards for acceleration technologies and

are members of Khronos group and sit on the OpenCL working committee


We also provide consulting expertise for companies wishing to explore the
advantages offered by heterogeneous systems

Petapath

© NVIDIA Corporation 2010

What is Heterogeneous or GPU Computing?

Computing with CPU + GPU

Heterogeneous Computing

x86

GPU

PCIe bus

© NVIDIA Corporation 2010

Low Latency or High Throughput?

CPU

Optimised

for low
-
latency
access to cached data sets

Control logic for out
-
of
-
order
and speculative execution


GPU

Optimised

for data
-
parallel,
throughput computation

Architecture tolerant of
memory latency

More transistors dedicated to
computation


© NVIDIA Corporation 2010

NVIDIA GPU Computing Ecosystem

NVIDIA Hardware
Solutions

CUDA SDK
& Tools

GPU
Architecture

Deployment

Hardware
Architecture

Customer

Requirements

Customer
Application

TPP / OEM

CUDA
Development
Specialist

ISV

CUDA
Training
Company

Hardware

Architect

VAR

© NVIDIA Corporation 2010

Science is Desperate for Throughput

1982

1997

2003

2006

2010

2012

1,000,000,000

1,000,000

1,000

1

Gigaflops

Estrogen Receptor

36K atoms

F1
-
ATPase

327K atoms

Ribosome

2.7M atoms

Chromatophore

50M atoms

BPTI

3K atoms

Bacteria

100s of

Chromatophores

1 Exaflop

1 Petaflop

Ran for 8 months to
simulate 2 nanoseconds

© NVIDIA Corporation 2010

Power Crisis in Supercomputing

1982

1996

2008

2020

Exaflop

Petaflop

Teraflop

Gigaflop

Household Power

Equivalent

City

Town

Neighborhood

Block

7,000,000 Watts

25,000,000 Watts

850,000 Watts

60,000 Watts

Jaguar

Los
Alamos

© NVIDIA Corporation 2010

Tesla
TM

High
-
Performance Computing

Quadro
®

Design & Creation

GeForce
®

Entertainment

Enter the GPU

NVIDIA GPU Product Families

© NVIDIA Corporation 2010

NEXT
-
GENERATION GPU ARCHITECTURE


‘FERMI’

© NVIDIA Corporation 2010

3 billion transistors

Up to 2
×

the cores (C2050 has 448)

Up to 8
×

the peak DP performance

ECC on all memories

L1 and L2 caches

Improved memory bandwidth (GDDR5)

Up to 1 Terabyte of GPU memory

Concurrent kernels

Hardware support for C++

DRAM I/F

HOST I/F

Giga Thread

DRAM I/F

DRAM I/F

DRAM I/F

DRAM I/F

DRAM I/F

L2

Introducing the ‘Fermi’ Tesla Architecture

The Soul of a Supercomputer in the body of a GPU

© NVIDIA Corporation 2010

Data

Parallel

Instruction

Parallel

Many Decisions

Large Data Sets

Design Goal of Fermi



Expand
performance sweet
spot of the GPU



Bring more users,
more applications
to the GPU

© NVIDIA Corporation 2010

Streaming Multiprocessor Architecture

Register File

Scheduler

Dispatch

Scheduler

Dispatch

Load/Store Units
×

16

Special Func Units
×

4

Interconnect Network

64K Configurable

Cache/Shared
Mem

Uniform Cache

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Instruction Cache

32 CUDA cores per SM (512 total)


8
×

peak double precision floating
point performance

50% of peak single precision


Dual Thread Scheduler


64 KB of RAM for shared memory
and L1 cache (configurable)




© NVIDIA Corporation 2010

CUDA Core Architecture

Register File

Scheduler

Dispatch

Scheduler

Dispatch

Load/Store Units x 16

Special Func Units x 4

Interconnect Network

64K Configurable

Cache/Shared
Mem

Uniform Cache

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Instruction Cache

CUDA Core

Dispatch Port

Operand Collector

Result Queue

FP Unit

INT Unit

New IEEE 754
-
2008 floating
-
point standard,
surpassing even the most advanced CPUs


Fused multiply
-
add (FMA) instruction


for both single and double precision


New integer ALU optimized for

64
-
bit and extended precision

operations



© NVIDIA Corporation 2010

Cached Memory Hierarchy

First GPU architecture to support a true cache
hierarchy in combination with on
-
chip shared memory


L1 Cache per SM (32 cores)

Improves bandwidth and reduces latency


Unified L2 Cache (768 KB)

Fast, coherent data sharing across all cores in the GPU

DRAM I/F

Giga Thread

HOST I/F

DRAM I/F

DRAM I/F

DRAM I/F

DRAM I/F

DRAM I/F

L2

Parallel
DataCache


Memory Hierarchy

© NVIDIA Corporation 2010

Larger, Faster, Resilient Memory Interface

GDDR5 memory interface

2
×

signaling speed of GDDR3


Up to 1 Terabyte of memory attached to GPU

Operate on larger data sets (3 and 6 GB Cards)


ECC protection for
GDDR5
DRAM


All major internal memories are ECC protected

Register file, L1 cache, L2 cache


DRAM I/F

Giga Thread

HOST I/F

DRAM I/F

DRAM I/F

DRAM I/F

DRAM I/F

DRAM I/F

L2

© NVIDIA Corporation 2010

GigaThread Hardware Thread Scheduler

Concurrent Kernel Execution + Faster Context Switch

Serial Kernel Execution

Parallel Kernel Execution

Time

Kernel 1

Kernel 1

Kernel 2

Kernel 2

Kernel 3

Kernel 3

Ker

4

nel


Kernel 5

Kernel 5

Kernel 4

Kernel 2

Kernel 2

© NVIDIA Corporation 2010

GigaThread Streaming Data Transfer Engine

Dual DMA engines

Simultaneous CPU

GPU and GPU

CPU
data transfer

Fully overlapped with CPU and GPU
processing time



Activity Snapshot:

SDT

Kernel 0

Kernel 1

Kernel 2

Kernel 3

CPU

CPU

CPU

CPU

SDT0

SDT0

SDT0

SDT0

GPU

GPU

GPU

GPU

SDT1

SDT1

SDT1

SDT1

© NVIDIA Corporation 2010

Enhanced Software Support

Many new features in CUDA Toolkit 3.0

To be released on Friday


Including early support for the Fermi architecture:

Native 64
-
bit GPU support

Multiple Copy Engine support

ECC reporting

Concurrent Kernel Execution

Fermi HW debugging support in cuda
-
gdb

© NVIDIA Corporation 2010

Enhanced Software Support

OpenCL 1.0 Support

First class language citizen in CUDA Architecture

Supports ICD (so interoperability between vendors is a possibility)

Profiling support available

Debug support coming to Parallel Nsight (NEXUS) soon


gDebugger CL from graphicREMEDY

Third party OpenCL profiler/debugger/memory checker


Software Tools Ecosystem is starting to grow

Given boost by existence of OpenCL

© NVIDIA Corporation 2010

“Oak Ridge National Lab (ORNL) has already announced it
will be using Fermi technology in an upcoming super that is
"expected to be 10
-
times more powerful than today's fastest
supercomputer."


Since ORNL's Jaguar supercomputer, for all intents and
purposes, holds that title, and is in the process of being
upgraded to 2.3 PFlops….


…we can surmise that the upcoming Fermi
-
equipped super is
going to be in the
20 Petaflops
range.”

September 30 2009

© NVIDIA Corporation 2010

NVIDIA T
ESLA

P
RODUCTS

© NVIDIA Corporation 2010

Tesla S1070

1U System

Tesla C1060

Computing Board

Tesla Personal
Supercomputer

GPUs

2 Tesla GPUs

4 Tesla GPUs

1 Tesla GPU

4 Tesla GPUs

Single Precision

Performance

1.87 Teraflops

4.14 Teraflops

933 Gigaflops

3.7 Teraflops

Double Precision

Performance

156 Gigaflops

346 Gigaflops

78 Gigaflops

312 Gigaflops

Memory

8 GB (4 GB / GPU)

16 GB (4 GB / GPU)

4 GB

16 GB (4 GB / GPU)

SuperMicro 1U

GPU SuperServer

Tesla GPU Computing Products: 10 Series

© NVIDIA Corporation 2010

Tesla S2070

1U System

Tesla C2050

Computing Board

Tesla C2070

Computing Board

GPUs

4 Tesla GPUs

1 Tesla GPU

Double Precision

Performance

2.1


2.5 Teraflops

500+ Gigaflops

Memory

12 GB (3 GB / GPU)

24 GB (6 GB / GPU)

3 GB

6 GB

Tesla S2050

1U System

Tesla GPU Computing Products: 20 Series

© NVIDIA Corporation 2010

HETEROGENEOUS CLUSTERS

© NVIDIA Corporation 2010

Data Centers: Space and Energy Limited

Traditional Data
Center Cluster

2x Performance
requires

2x Number of Servers

8 cores per server

1000’s of cores

1000’s of servers

Quad
-
core

CPU

Heterogeneous Data
Center Cluster

10,000’s of cores

100’s of servers

Augment/replace

host servers

© NVIDIA Corporation 2010

Cluster Deployment

Now a number of GPU aware Cluster Management Systems

ActiveEon ProActive Parallel Suite® Version 4.2

Platform Cluster Manager and HPC Workgroup

Streamline Computing GPU Environment (SCGE)



Not just installation aids

i.e. putting the driver and toolkits in the right place

now starting to provide GPU node discovery and job steering


NVIDIA and Mellanox

Better interop. between Mellanox IF adapters and NVIDIA Tesla GPUs

Can provide as much as a 30% performance improvement by eliminating
unnecessary data movement in a multi node heterogeneous application

© NVIDIA Corporation 2010

Cluster Deployment

A number of cluster and distributed debug tools now support
CUDA and NVIDIA Tesla


Allinea
®

DDT for NVIDIA CUDA

Extends well known
Distributed Debugging Tool (DDT) with CUDA
support


TotalView® debugger (part of an Early Experience Program)

Extends with CUDA support, have also announced intentions to support
OpenCL


Both based on the
Parallel Nsight
(NEXUS) Debugging API


© NVIDIA Corporation 2010

NVIDIA Reality Server 3.0

Cloud computing platform for running 3D web applications


Consists of an Tesla RS GPU
-
based server cluster running
RealityServer software from
mental images


Deployed in a number of different sizes

From 2


100’s of 1U Servers


iray®
-

Interactive Photorealistic Rendering Technology

Streams interactive 3D applications to any web connected device

Designers and architects can now share and visualize complex 3D models
under different lighting and environmental conditions

© NVIDIA Corporation 2010

DISTRIBUTED COMPUTING PROJECTS

© NVIDIA Corporation 2010

Distributed Computing Projects

Traditional distributed computing projects have been

making use of GPUs for some time (non
-
commercial)

Typically have 000’s to 10,000’s of contributors

Folding@Home has access to 6.5 PFLOPS of compute

Of which ~95% comes from GPUs or PS3s


Many are bio
-
informatics, molecular dynamics

and quantum chemistry codes

Represent the current sweet spot applications


Ubiquity of GPUs in home systems helps

© NVIDIA Corporation 2010

Distributed Computing Projects

Folding@Home

Directed by Prof. Vijay Pande at Stanford University

(
http://folding.stanford.edu/
)


Most recent GPU3 Core based on OpenMM 1.0 (
https://simtk.org/home/openmm
)

OpenMM library provides tools for molecular modeling simulation

Can be hooked into any MM application, allowing that code to do

molecular modeling with minimal extra effort

OpenMM has a strong emphasis on hardware acceleration providing

not just a consistent API, but much greater performance


Current NVIDIA target is via
CUDA Toolkit 2.3


OpenMM 1.0 also provides Beta support for OpenCL


OpenCL is long term convergence software platform


© NVIDIA Corporation 2010

Distributed Computing Projects

Berkeley Open Infrastructure for Network Computing

BOINC project (
http://boinc.berkeley.edu/
)


Platform infrastructure originally evolved from SETI@home


Many projects use BOINC and several of these have
heterogeneous compute implementations
(
http://boinc.berkeley.edu/wiki/GPU_computing)

Examples include:

GPUGRID.net

SETI@home

Milkyway@home (IEEE 754 Double precision capable GPU required)

AQUA@home

Lattice

Collatz Conjecture

© NVIDIA Corporation 2010

Distributed Computing Projects

GPUGRID.net

Dr. Gianni De Fabritiis,

Research Group of Biomedical Informatics

University Pompeu Fabra
-
IMIM, Barcelona


Uses GPUs to deliver high
-
performance all
-
atom biomolecular
simulation of proteins using ACEMD (
http://multiscalelab.org/acemd)

ACEMD is a production bio
-
molecular dynamics code specially optimized to run
on graphics processing units (GPUs) from NVIDIA

It reads CHARMM/NAMD and AMBER input files with a simple and powerful
configuration interface


A commercial implementation of ACEMD is available from Acellera Ltd
(
http://www.acellera.com/acemd/
)

What makes this particularly interesting is that it is implemented using OpenCL

© NVIDIA Corporation 2010

Distributed Computing Projects

Have had to use brute force methods to deal with robustness

Run the same WU with multiple users and compare results


Running on purpose designed heterogeneous grids with ECC

Means that some of the paranoia can be relaxed

(can at least detect there have been soft errors or WU corruption)

Results in better throughput on these systems


But does result in divergence between Consumer and HPC devices

Should be compensated for by HPC class devices being about 4x faster


© NVIDIA Corporation 2010

Tesla Bio Workbench

Accelerating New Science


January, 2010

http://www.nvidia.com/bio_workbench

© NVIDIA Corporation 2010

Introducing Tesla Bio WorkBench

Applications

Community

Platforms

Technical

papers

Discussion

Forums

Benchmarks

& Configurations

Tesla Personal Supercomputer

Tesla GPU Clusters

Download,

Documentation

MUMmerGPU

LAMMPS

GPU
-
AutoDock

TeraChem

© NVIDIA Corporation 2010

Tesla Bio Workbench Applications

AMBER (MD)

ACEMD (MD)

GROMACS (MD)

GROMOS (MD)

LAMMPS (MD)

NAMD (MD)

TeraChem (QC)

VMD (Visualization MD & QC)

Docking

GPU AutoDock

Sequence analysis

CUDASW++ (SmithWaterman)

MUMmerGPU

GPU
-
HMMER

CUDA
-
MEME Motif Discovery


© NVIDIA Corporation 2010

Recommended Hardware Configurations

Up to 4 Tesla C1060s per
workstation

4GB main memory / GPU

Tesla S1070 1U

4 GPUs per 1U

Integrated CPU
-
GPU Server

2 GPUs per 1U + 2 CPUs

Tesla Personal Supercomputer

Tesla GPU Clusters

Specifics at
http://www.nvidia.com/bio_workbench

© NVIDIA Corporation 2010

Molecular Dynamics and

Quantum Chemistry Applications

© NVIDIA Corporation 2010

Molecular Dynamics and

Quantum Chemistry Applications

AMBER (MD)

ACEMD (MD)

HOOMD (MD)

GROMACS (MD)

LAMMPS (MD)

NAMD (MD)

TeraChem (QC)

VMD (Viz. MD & QC)

Typical speed ups of 3
-
8x on a single Tesla C1060 vs Modern 1U

Some applications (compute bound) show 20
-
100x speed ups

© NVIDIA Corporation 2010

Usage of TeraGrid National Supercomputing Grid

Half of the
cycles

© NVIDIA Corporation 2010

Summary

© NVIDIA Corporation 2010

Summary

‘Fermi’ debuts HPC/Enterprise features

Particularly ECC and high performance double precision


Software development environments are now more mature

Significant software ecosystem is starting to emerge

Broadening availability of development tools, libraries and applications

Heterogeneous (GPU) aware cluster management systems


Economics, open standards and improving programming
methodologies

Heterogeneous computing is gradually changing long held perception
that it is just an ‘exotic’ niche technology

© NVIDIA Corporation 2010

Questions?

© NVIDIA Corporation 2010

Supporting Slides

© NVIDIA Corporation 2010

AMBER Molecular Dynamics

Implicit solvent GB results

1 Tesla GPU 8x faster than 2
quad
-
core CPUs


Data courtesy of San Diego Supercomputing Center

7x

8.6x

Generalized Born Simulations

Alpha

now

Q1 2010


PME: Particle Mesh Ewald


Beta release


Generalized Born


Q2 2010


Multi
-
GPU + MPI support


Beta 2 release

More Info

http://www.nvidia.com/object/amber_on_tesla.html


© NVIDIA Corporation 2010

GROMACS Molecular Dynamics

PME results

1 Tesla GPU 3.5x
-
4.7x faster
than CPU



Data courtesy of Stockholm Center for
Biomembrane

Research

3.5x

22x

GROMACS on Tesla GPU Vs CPU

Particle
-
Mesh
-
Ewald


(PME)

Reaction
-
Field

Cutoffs

5.2x

Beta

now


Particle Mesh Ewald (PME)


Implicit solvent GB


Arbitrary forms of non
-
bonded interactions

Q2 2010


Multi
-
GPU + MPI support


Beta 2 release

More Info

http://www.nvidia.com/object/gromacs_on_tesla.html


© NVIDIA Corporation 2010

HOOMD Blue Molecular Dynamics

Written bottom
-
up for CUDA
GPUs

Modeled after LAMMPS

Supports multiple GPUs


1 Tesla GPU outperforms 32
CPUs running LAMMPS


Data courtesy of University of Michigan

More Info

http://www.nvidia.com/object/hoomd_on_tesla.html


© NVIDIA Corporation 2010

LAMMPS: Molecular Dynamics on a GPU Cluster

2 GPUs = 24 CPUs

Available as beta on CUDA

Cut
-
off based non
-
bonded terms

2 GPUs outperforms 24 CPUs

PME based electrostatic

Preliminary results: 5X speed
-
up

Multiple GPU + MPI support
enabled


Data courtesy of Scott Hampton & Pratul K. Agarwal

Oak Ridge National Laboratory

More Info

http://www.nvidia.com/object/lammps_on_tesla.html


© NVIDIA Corporation 2010

NAMD: Scaling Molecular Dynamics on a GPU Cluster

4 GPUs = 16 CPUs

Feature complete on CUDA :
available in NAMD 2.7 Beta 2

Full electrostatics with PME

Multiple time
-
stepping

1
-
4 Exclusions


4 GPU Tesla PSC outperforms


8 CPU servers


Scales to a GPU cluster


Data courtesy of Theoretical and Computational Bio
-
physics Group, UIUC

More Info

http://www.nvidia.com/object/namd_on_tesla.html


© NVIDIA Corporation 2010

TeraChem: Quantum Chemistry Package for GPUs

First QC SW written ground
-
up for
GPUs


4 Tesla GPUs outperform 256 quad
-
core CPUs

Beta

now

Q1 2010


Full release


MPI support


HF, Kohn
-
Sham, DFT


Multiple GPUs supported

More Info

http://www.nvidia.com/object/terachem_on_tesla.html


© NVIDIA Corporation 2010

VMD: Acceleration using CUDA GPUs

Several CUDA applications in
VMD 1.8.7

Molecular Orbital Display

Coulomb
-
based Ion Placement

Implicit Ligand Sampling

Speedups : 20x
-

100x


Multiple GPU support enabled



Images and data courtesy of Beckman Institute for Advanced Science and Technology, UIUC

More Info

http://www.nvidia.com/object/vmd_on_tesla.html


© NVIDIA Corporation 2010

GPU
-
HMMER: Protein Sequence Alignment

Protein sequence alignment
using profile HMMs


Available now

Supports multiple GPUs


Speedups range from 60
-
100x
faster than CPU


Download

http://www.mpihmmer.org/releases.htm

GPUs

CPU

© NVIDIA Corporation 2010

MUMmerGPU:

Genome Sequence Alignment

High
-
throughput pair
-
wise
local sequence alignment

Designed for large sequences


Drop
-
in replacement for
“mummer” component in
MUMmer software


Speedups 3.5x to 3.75x

Download

http://mummergpu.sourceforge.net