Tesla: Fastest Processor Adoption in HPC History

spongemintSoftware and s/w Development

Dec 2, 2013 (3 years and 11 months ago)

145 views

Tesla: Fastest Processor Adoption in HPC History

http://www.nvidia.com/tesla


2

4 cores

GPU Computing

CPU
+
GPU Co
-
Processing

Heterogeneous Computing

3

Computation Discontinuity

0
200
400
600
800
1000
1200
9/22/2002
2/4/2004
6/18/2005
10/31/2006
3/14/2008
Gflops
(log scale)
NVIDIA GPU
Intel CPU
Tesla 8
-
series
Tesla 10
-
series
Intel Xeon
Quad
-
core 3 GHz
Intel Pentium 4
3.2 GHz
Intel Pentium 4
Dual
-
core 3.0 GHz
Intel Core2
Dual
-
core 3.0 GHz
Double Precision
debut

4

146X

Medical Imaging

U of Utah

36X

Molecular Dynamics

U of Illinois, Urbana

18X

Video
Transcoding

Elemental Tech

50X

Matlab

Computing

AccelerEyes

100X

Astrophysics

RIKEN

149X

Financial simulation

Oxford

47X

Linear Algebra

Universidad Jaime

20X

3D Ultrasound

Techniscan

130X

Quantum Chemistry

U of Illinois, Urbana

30X

Gene Sequencing

U of Maryland

50x


150x

5

L1

L1

L1

L1

L1

L1

L1

L1

L1

L1

L1

L1

L1

L1

L1

L1

L1

L1

L1

L1

L1

L1

L1

L1

L1

L1

L1

L1

L1

L1

NVIDIA Tesla 10
-
Series GPU

Massively parallel, many core architecture

6

Tesla GPU Computing Products

Tesla S1070

1U
System

Tesla C1060

Computing Board

GPUs

4 Tesla GPUs

1 Tesla GPU

Single Precision

Performance

4.14 Teraflops

933 Gigaflops

Double Precision

Performance

346 Gigaflops

78 Gigaflops

Memory

16 GB (4 GB / GPU)

4 GB

7

New Class of Hybrid CPU
-
GPU Servers

2 Tesla

M1060 GPUs

Upto

18 Tesla

M1060 GPUs

Bull
Bullx


Blade Enclosure

SuperMicro

1U

GPU Server

8

M$

Performance

100x

1x

10,000x

Traditional

CPU Cluster

CPU
Workstation

K$

Tesla

Personal

Supercomputer

Tesla

Co
-
processing

Cluster

9

UPenn
: Finding a Better Shampoo

Tesla PSC

32 CPU Servers

1
kWatt

19.2
kWatts

~$7 K

$128 K

No Data Center Required

9.6x Lower Power

13x Lower Cost

Equal Performance

1

1

10

Finance: Equity Pricing

2 Tesla S1070s

500 CPU Servers

2.8
kWatts

37.5
kWatts

$24 K

$250 K

16x Less Space

13x Lower Power

10x Lower Cost

Equal Performance

1

1

11

Oil & Gas: Seismic Processing

~$400 K

~$8 M

45
kWatts

1200
kWatts

27x Lower Power

20x Lower Cost

Equal Performance

1

1

32 Tesla S1070s

2000 CPU Servers

31x Less Space

12

Workstation Supercomputing

HRL Labs

Carnegie Mellon University

Korean Government

MIT Lincoln Lab

US Army

UC San Diego

Northrop Grumman

University of Wisconsin

Halliburton Energy Services

Oxford University

North Star Imaging

University of

Michigan

Pacific Biosciences

Johns Hopkins

Kodak

Canada

Genome Sciences
Centre

Tesla Personal

Supercomputer

~5000 Customers

13

-
100
200
300
400
2009

2008

CSIRO
-

Australia

Argonne National Labs

Tokyo Tech

NCSA

BNP
-
Paribas

Pacific Northwest Labs

Harvard

Oak Ridge Nat’l Laboratory

乡瑩潮慬 T慩w慮⁕湩v敲獩瑹

䅭敳⁌慢


Iowa State

Federal agencies

Cambridge

Petrobras

British Aerospace

TOTAL

Fermi Research Labs

Hess

HLRS


Germany

Max Planck Institute

University of Michigan

Daresbury

Labs, UK

Chinese Academy of Sciences

Tesla Cluster Installations

14

Supercomputing for the Masses

Millions

of researchers

< $5K

Tesla

Personal Supercomputer

100,000s

of researchers

100s

of researchers

$50K
-
$1M

$10M+

Tesla

Preconfigured Clusters

Large

Clusters

15

GPU Computing Applications

C

C++

Fortran

Java

Python

OpenCL
tm

DirectX

Compute

NVIDIA GPU

CUDA
Parallel Computing Architecture

OpenCL

is trademark of Apple Inc. used under license to the
Khronos

Group Inc.

CUDA Parallel Computing Architecture

16

CUDA: Widely Adopted Parallel Programming Model

1000+ Research Papers

200+ universities teaching CUDA


120 Million CUDA

GPUs

60,000+ Active Developers


17

CUDA Ecosystem

Applications

Libraries

FFT

BLAS

LAPACK

Image processing

Video processing

Signal processing

Vision

Consultants

OEMs

Languages

C, C++

DirectX

Fortran

Java

OpenCL

Python


Compilers

PGI Fortran

CAPs HMPP

MCUDA

MPI

NOAA Fortran2C

OpenMP

UIUC

MIT

Harvard

Berkeley

Cambridge

Oxford



IIT Delhi

Tsinghua

Dortmundt

ETH Zurich

Moscow

NTU




Over 200 Universities Teaching CUDA

ANEO

GPU Tech

Oil & Gas

Finance

Medical

Biophysics

Numerics

Imaging

CFD

DSP

EDA

18

Bio
-
Sciences

Bio
-
Informatics

Medical Imaging

Defense

Oil and Gas


GROMACS using
OpenMM


NAMD alpha


VMD, 1.8.7 beta


HOOMD



GPU HMMER


MUMmerGPU
: Sequence
Alignment


Accelereyes
: MATLAB
plugin




GPULib
: IDL
acceleration


Acceleware

CT Recon


Digisens

CT Recon


Accelereyes
: MATLAB
plugin



GPU VSIPL: Signal
Processing


GPULib
: IDL
acceleration


Ikena
: Imagery Analysis,
Video Forensics


GIS: Manifold



Accelereyes
: MATLAB
plugin


Acceleware
: Time
Migration


SeismicCity
:
Prestack


Headwave
:
Prestack


OpenGeoSolutions
:
Spectral
Decomp


Mercury: 3D
viz


ffA
: 3D Seismic process


GIS: Manifold


Released Applications

EDA

Weather & Ocean
Modeling

Finance

Electro
-
magnetics


CST: 3D EM


Agilent: ADS SPICE


Synopsys: TCAD


WRF beta release


Particle simulation
Boltzmann solver


Tsunami simulation:
Tokyo Tech


NOAA new model being
developed


Numerix
: Counterparty


Scicomp
: Derivative
Pricing


Hanweck: Options
Pricing


Exegy
: Risk Analysis


Aqumin
: 3D
Viz


Acceleware
: FDTD Solver


Quantum electrodynamics
library


CST Microwave Studio


GPMAD : Particle beam
dynamics simulator


19

More Information


http://www.nvida.com/tesla

Products

Vertical Solutions

CUDA GPU Programming Training

GPU Developer Conference
Sept 30


Oct

2, 2009

San

Jose, CA

http://www.nvidia.com/gtc


20

Programming the GPU

21

Compiling C for CUDA Applications

void serial_function(… ) {


...

}

void other_function(int ... ) {


...

}


void saxpy_serial(float ... ) {


for

(int i = 0; i

<

n; ++i)


y[i] = a*x[i] + y[i];

}


void main( ) {


float x;


saxpy_serial(..);


...

}


NVCC

(Open64)

CPU Compiler

C CUDA

Key Kernels

CUDA object

files

Rest of C

Application

CPU object

files

Linker

CPU
-
GPU

Executable

Modify into
Parallel
CUDA code

22

C for CUDA : C with a few keywords

void saxpy_serial(int n, float a, float *x, float *y)

{


for

(int i = 0; i

<

n; ++i)


y[i] = a*x[i] + y[i];

}

// Invoke
serial

SAXPY kernel

saxpy_serial(n, 2.0, x, y);


__global__
void saxpy_parallel(int n, float a, float *x, float *y)

{


int i =
blockIdx
.x*
blockDim
.x +
threadIdx
.x;


if

(i

<

n)
y[i] = a*x[i] + y[i];

}

// Invoke
parallel

SAXPY kernel

with
256 threads
/
block

int nblocks = (n + 255) / 256;

saxpy_parallel
<<<nblocks, 256>>>
(n, 2.0, x, y);

Standard C Code

Parallel C Code

23

CUDA Programming Effort / Performance

Source : MIT CUDA Course

24

4.4 secs

1.1 mins

4.7
mins

5.5
mins

12.5 mins

0.2 secs

1.2 secs

4.5 secs

5.7 secs

8.1 secs

0.1
1
10
100
1000
Caffeine
Cholesterol
Taxol
Buckyball
Valinomycin
Time (Log
-
scale)

Quantum Chemistry

0
2,000
4,000
6,000
LRAND48
Mersenne
Twister DC
+ Box-Muller
(MKL)
Speed (Millions samples per second)

10x
faster Random Number
Generators
For Monte Carlo

Tesla C1060
Xeon Quad (3.0Ghz)
Science

Manufacturing

Medical

Finance

Computed Tomography (CT)

Source: Ufimtsev, Martinez

Source: Batenburg, Sijbers, et al

Source:
Tolke
, Krafczyk

Source: CUDA SDK, NAG

25

FFT Performance: CPU
vs

GPU

0
50
100
150
200
Gflops

Matrix Size

Single Precision FFT

cuFFT 2.3
cuFFT 2.2
MKL 4 Threads
FFTW 1 Thread
cuFFT

2.3: NVIDIA Tesla C1060 GPU

MKL 10.1r1: Quad
-
Core

Intel
Cor
e i7 (Nehalem)
3.2GHz

0
50
Gflops

Matrix Size

Double Precision FFT

cuFFT 2.3
MKL 4 Threads
FFTW 1 Thread
26

BLAS Performance: CPU
vs

GPU

0
50
100
150
200
250
300
350
400
Gflops

Matrix Size

Single Precision BLAS: SGEMM

Tesla C1060
Intel MKL 4 Threads
0
10
20
30
40
50
60
70
80
Gflops

Matrix Size

Double Precision BLAS: DGEMM

Tesla C1060
Intel MKL 4 Threads
CUBLAS: CUDA
2.2,
Tesla C1060

MKL 10.0.3: Intel Core2 Extreme, 3.00GHz

27

Heterogeneous Computing Domains

Oil & Gas

Finance

Medical

Biophysics

Numerics

Audio

Video

Imaging

GPU

(Parallel Computing)

Graphics

CPU

(Sequential Computing)

Highly

Parallel

Computation

Control and
Communication

Productivity Application

Data Intensive Application

28

Life Sciences &

Medical Equipment

Productivity
/ Misc

Oil and
Gas

EDA

Finance

CAE /

Mathematical

Communi
cation


Max Planck

FDA

Robarts Research

Medtronic

AGC

Evolved machines

Smith
-
Waterman
DNA sequencing

AutoDock

NAMD/VMD

Folding@Home

Howard Hughes
Medical

CRIBI Genomics



GE Healthcare

Siemens

Techniscan

Boston Scientific

Eli Lilly

Silicon Informatics

Stockholm
Research

Harvard

Delaware

Pittsburg

ETH Zurich

Institute Atomic
Physics



CEA

NCSA

WRF Weather
Modeling

OptiTex

Tech
-
X

Elemental
Technologies

Dimensional
Imaging

Manifold

Digisens

General Mills

Rapidmind

Rhythm & Hues

xNormal

Elcomsoft

LINZIK



Hess

TOTAL

CGG/
Veritas

Chevron

Headwave

Acceleware

Seismic City

P
-
Wave
Seismic
Imaging

Mercury
Computer

ffA

Geostar



Synopsys

Nascentric

Gauda

CST

Agilent


Symcor

Level 3

SciComp

Hanweck

Quant
Catalyst

RogueWave

BNP Paribas


AccelerEyes
MathWorks

Wolfram

National
Instruments

Ansys

Access Analytics

Tech
-
x

RIKEN

SOFA

Renault

Boeing


Nokia

RIM

Philips

Samsung

LG

Sony
Ericsson

NTT DoCoMo

Mitsubishi

Hitachi

Radio
Research
Laboratory

US Air Force


5000+ Customers / ISVs