na Arquitetura Fermi

companyscourgeΤεχνίτη Νοημοσύνη και Ρομποτική

19 Οκτ 2013 (πριν από 3 χρόνια και 11 μήνες)

152 εμφανίσεις

1

Product Availability Update

Product

Inventory

Leadtime


for big orders

Notes

C1060

200 units

8 weeks

Build to order

M1060

500 units

8 weeks

B
uild to order

S1070
-
400

50 units

10 weeks

Build

to order

S1070
-
500

25

units+ 75 being built

10 weeks

Build to order

M2050

Shipping

now

Building 20K for Q2

8 weeks

Sold out through mid
-
July

S2050

Shipping

now

Building 200 for Q2

8 weeks

Sold out through mid
-
July

C2050

2000 units

8 weeks

Will maintain inventory

M2070

Sept 2010

-

Get PO in now to get priority

C2070

Sept
-
Oct

2010

-

Get PO in now to get priority

M2070
-
Q

Oct 2010

-

Processamento

Paralelo

em

GPU’s
na

Arquitetura

Fermi


Arnaldo Tavares

Tesla Sales Manager for Latin America

2


Quadro

or Tesla?

Computer Aided Design


e.g. CATIA,
SolidWorks
, Siemens NX


3D Modeling / Animation


e.g. 3ds, Maya, Softimage


Video Editing / FX


e.g. Adobe CS5, Avid


Numerical Analytics


e.g. MATLAB,
Mathematica


Computational Biology


e.g. AMBER, NAMD, VMD


Computer Aided Engineering


e.g. ANSYS, SIMULIA/ABAQUS

3

GPU Computing

CPU
+
GPU Co
-
Processing

4 cores

CPU

48
GigaFlops

(DP)

GPU

515
GigaFlops

(DP)


(Average efficiency in
Linpack
: 50%)

4

146X

Medical Imaging

U of Utah

36X

Molecular Dynamics

U of Illinois, Urbana

18X

Video
Transcoding

Elemental Tech

50X

Matlab

Computing

AccelerEyes

100X

Astrophysics

RIKEN

149X

Financial simulation

Oxford

47X

Linear Algebra

Universidad Jaime

20X

3D Ultrasound

Techniscan

130X

Quantum Chemistry

U of Illinois, Urbana

30X

Gene Sequencing

U of Maryland

50x


150x

5

Tools

Oil & Gas

Bio
-
Chemistry

Bio
-
Informatics

TotalView

Debugger

NVIDIA

Video Libraries

AccelerEyes

Jacket MATLAB

EMPhotonics

CULAPACK

Bright Cluster

Manager

CAPS HMPP

MATLAB

Thrust C++

Template
Lib

CUDA C/C++

PGI CUDA

Fortran

Parallel
Nsight

Vis Studio IDE

Allinea

DDT

Debugger

OpenEye

ROCS

Available

Announced

TauCUDA

Perf Tools

NVIDIA NPP

Perf

Primitives

ParaTools

VampirTrace

VSG

Open
Inventor

StoneRidge

RTM

Headwave
Suite

Acceleware

RTM Solver

GeoStar

Seismic Suite

ffA SVI Pro

OpenGeoSolut
ions

OpenSEIS

Paradigm

RTM

Seismic City

RTM

Tsunami

RTM

CAE

ACUSIM

AcuSolve 1.8

Autodesk

Moldflow

Prometch

Particleworks

Remcom

XFdtd

7.0

MSC.Software

Marc 2010.2

PGI

Accelerators

Platform LSF

Cluster
Mgr

MAGMA
(LAPACK)

FluiDyna

OpenFOAM

Metacomp

CFD++

Available Now

Future

Libraries

Wolfram
Mathematica

CUDA FFT

CUDA BLAS

TeraChem

BigDFT

ABINT

VMD

Acellera

ACEMD

AMBER

DL
-
POLY

GROMACS

HOOMD

LAMMPS

NAMD

GAMESS

CP2K

CUDA
-
BLASTP

CUDA
-
EC

CUDA
-
MEME

CUDA SW++

SmithWaterm

GPU
-
HMMR

HEX Protein

Docking

MUMmerGPU

PIPER

Docking

LSTC

LS
-
DYNA 971

RNG & SPARSE
CUDA Libraries

Paradigm

SKUA

Panorama
Tech

PGI
CUDA

x86

Increasing Number of Professional CUDA Apps

ANSYS

Mechanical

6

Increasing Number of Professional CUDA Apps

Siemens 4D
Ultrasound

Rendering

Finance

EDA

Digisens

Medical

Schrodinger

Core Hopping

MotionDSP

Ikena Video

Manifold

GIS

Dalsa

Machine
Vision

Synopsys

TCAD

SPEAG

SEMCAD X

Agilent

EMPro 2010

CST
Microwave

Agilent ADS

SPICE

Acceleware

FDTD Solver

Acceleware

EM Solution

Aquimin

AlphaVision

Other

NAG

RNG

SciComp

SciFinance

Hanweck

Options
Analy

Available Now

Gauda

OPC

Useful
Progress Med

Lightworks

Artisan

Autodesk

3ds Max

NVIDIA

OptiX (SDK)

mental images

iray (OEM)

Bunkspeed

Shot (iray)

Refractive SW

Octane

Works Zebra

Zeany

Chaos Group

V
-
Ray GPU

Cebas

finalRender

Random
Control
Arion

Caustic
Graphics

Weta Digital

PantaRay

ILM

Plume

Future

Available

Announced

Digital
Anarchy Photo

Video

Elemental

Video

Fraunhofer

JPEG2000

Cinnafilm

Pixel Strings

Assimilate

SCRATCH

The Foundry

Kronos

TDVision

TDVCodec

ARRI

Various Apps

Black Magic

Da Vinci

MainConcept

CUDA Encoder

GenArts

Sapphire

Adobe Premier
Pro CS5

Murex

MACS

Numerix

Risk

RMS Risk

Mgt Solutions

Rocketick

Veritlog

Sim

MVTec

Machine Vis

7

3 of Top5 Supercomputers

0
1
2
3
4
5
6
7
8
0
500
1000
1500
2000
2500
3000
Tianhe-1A
Jaguar
Nebulae
Tsubame
Hopper II
Tera 100
Megawatts

Gigaflops

8

3 of Top5 Supercomputers

0
1
2
3
4
5
6
7
8
0
500
1000
1500
2000
2500
3000
Tianhe-1A
Jaguar
Nebulae
Tsubame
Hopper II
Tera 100
Megawatts

Gigaflops

9

What if Every Supercomputer Had Fermi?

0
200
400
600
800
1000
Linpack

Teraflops

Top 500 Supercomputers (Nov 2009)

150 GPUs

37
TeraFlops

$740K

Top 150

225 GPUs

55
TeraFlops

$1.1 M

Top 100

450 GPUs

110
TeraFlops


$2.2 M

Top 50

10

Hybrid ExaScale Trajectory


2008

1 TFLOP

7.5 KWatts




2010

1.27 PFLOPS

2.55 MWatts



2017 *

2 EFLOPS

10 MWatts

*

This is a projection based on Moore’s law and does not represent a committed roadmap

11

Tesla Roadmap

12

The March of the GPUs

0
50
100
150
200
250
2007
2008
2009
2010
2011
2012
Peak Memory Bandwidth

GBytes
/sec

T10

Nehalem

3 GHz

Westmere

3 GHz

8
-
core Sandy
Bridge

3 GHz

T20

T20A

0
200
400
600
800
1000
1200
2007
2008
2009
2010
2011
2012
Peak Double Precision FP

GFlops
/sec

Nehalem

3 GHz

Westmere

3 GHz

T20

T20A

T10

8
-
core

Sandy Bridge

3 GHz

NVIDIA GPU (ECC off)

x86 CPU

Double Precision: NVIDIA GPU

Double Precision: x86 CPU

13

Project Denver

14

Expected Tesla Roadmap with Project Denver

15

Workstations

Up to 4x

Tesla C2050/70 GPUs

Integrated

CPU
-
GPU Server

2x Tesla M2050/70 GPUs

in 1U

OEM CPU Server +

Tesla S2050/70

4 Tesla GPUs in 2U

Workstation / Data Center Solutions

2 Tesla

M2050/70 GPUs

16

Tesla C2050

Tesla C2070

Processors

Tesla 20
-
series GPU

Number of Cores

448

Caches

64 KB L1 cache + Shared Memory / 32 cores

768 KB L2 cache

Floating Point Peak
Performance

1030 Gigaflops (single)

515 Gigaflops (double)

GPU Memory

3 GB

2.625 GB with ECC on

6 GB

5.25 GB with ECC on

Memory
Bandwith

144 GB/s (GDDR5)

System I/O

PCIe

x16 Gen2

Power

238 W (max)

238 W (max)

Available

Shipping Now

Shipping Now

Tesla C
-
Series Workstation GPUs

17

How is the GPU Used?

Basic Component:

“Stream Multiprocessor” (SM)


SIMD: “Single Instruction

Multiple Data”


Same Instruction

for all cores, but can operate over different data


“SIMD at SM, MIMD at GPU chip”





Source: Presentation from Felipe A. Cruz, Nagasaki University

18

The Use of GPU’s and Bottleneck Analysis

Source: Presentation from Takayuki Aoki, Tokyo Institute of Technology

19

The Fermi Architecture

3 billion transistors


16 x Streaming Multiprocessors
(SM’s)


6 x 64
-
bit

Memory Partitions =
384
-
bit Memory Interface


Host Interface: connects the GPU
to the CPU via PCI
-
Express


GigaThread

global scheduler:
distribute thread blocks to SM
thread schedulers




20

SM Architecture

Register File

Scheduler

Dispatch

Scheduler

Dispatch

Load/Store Units x 16

Special
Func

Units x 4

Interconnect Network

64K Configurable

Cache/Shared
Mem

Uniform Cache

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Instruction Cache

32 CUDA cores per SM (512 total)


16 x Load/Store Units = source and
destin
. address
calculated for 16 threads per clock


4 x Special Function Units (sin, cosine, sq. root, etc.)


64 KB of RAM for shared memory and L1 cache
(configurable)


Dual Warp Scheduler




21

Dual Warp Scheduler

1 Warp = 32 parallel threads


2 Warps issued and executed concurrently


Each Warp goes to 16 CUDA Cores


Most instructions can be dual issued
(exception: Double Precision instructions)


Dual
-
Issue Model allows near peak hardware
performance




22

CUDA Core Architecture

Register File

Scheduler

Dispatch

Scheduler

Dispatch

Load/Store Units x 16

Special
Func

Units x 4

Interconnect Network

64K Configurable

Cache/Shared
Mem

Uniform Cache

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Instruction Cache

CUDA Core

Dispatch Port

Operand Collector

Result Queue

FP Unit

INT Unit

New IEEE 754
-
2008 floating
-
point standard,
surpassing even the most advanced CPUs


Newly designed integer ALU


optimized for 64
-
bit and extended


precision operations


Fused multiply
-
add (FMA) instruction


for both 32
-
bit single and 64
-
bit


double precision


23

Fused Multiply
-
Add Instruction (FMA)

24

GigaThread
TM

Hardware Thread Scheduler (HTS)

Hierarchically manages thousands
of simultaneously active threads


10x faster application context
switching (each program receives a
time slice of processing resources)


Concurrent kernel execution






HTS

25

GigaThread Hardware Thread Scheduler

Concurrent Kernel Execution + Faster Context Switch

Serial Kernel Execution

Parallel Kernel Execution

Time

Kernel 1

Kernel 1

Kernel 2

Kernel 2

Kernel 3

Kernel 3

Ker

4

nel


Kernel 5

Kernel 5

Kernel 4

Kernel 2

Kernel 2

26

GigaThread Streaming Data Transfer Engine

Dual DMA engines

Simultaneous CPU

GPU and GPU

CPU
data transfer

Fully overlapped with CPU and GPU
processing time



Activity Snapshot:

SDT

Kernel 0

Kernel 1

Kernel 2

Kernel 3

CPU

CPU

CPU

CPU

SDT0

SDT0

SDT0

SDT0

GPU

GPU

GPU

GPU

SDT1

SDT1

SDT1

SDT1

27

Cached Memory Hierarchy

First GPU architecture to support a true cache
hierarchy in combination with on
-
chip shared memory


Shared/L1 Cache per SM (64KB)

Improves bandwidth and reduces latency


Unified L2 Cache (768 KB)

Fast, coherent data sharing across all cores in the GPU


Global Memory (up to 6GB)

28

CUDA: Compute Unified Device Architecture

NVIDIA’s Parallel Computing Architecture


Software Development Platform aimed to the GPU Architecture




29

Thread Hierarchy

Kernels (simple C program) are executed by thread


Threads are grouped into Blocks


Threads in a Block can synchronize execution


Blocks are grouped in a Grid


Blocks are independent (must be able to be executed
at any order

Source: Presentation from Felipe A. Cruz, Nagasaki University

30

Memory and Hardware Hierarchy

Threads access Registers

CUDA Cores execute Threads


Threads within a Block can share data/results
via Shared Memory

Streaming Multiprocessors (SM’s) execute
Blocks


Grids use Global Memory for result sharing
(after kernel
-
wide global synchronization)

GPU executes Grids


Source: Presentation from Felipe A. Cruz, Nagasaki University

31

Full View of the Hierarchy Model

CUDA

Hardware Level

Memory Access

Thread

CUDA Core

Registers

Block

SM

Shared Memory

Grid

GPU

Global Memory

Device

Node

Host Memory

32

IDs and Dimensions

Device

Grid 1

Block

(0, 0)

Block

(1, 0)

Block

(2, 0)

Block

(0, 1)

Block

(1, 1)

Block

(2, 1)

Block (1, 1)

Thread

(0, 1)

Thread

(1, 1)

Thread

(2, 1)

Thread

(3, 1)

Thread

(4, 1)

Thread

(0, 2)

Thread

(1, 2)

Thread

(2, 2)

Thread

(3, 2)

Thread

(4, 2)

Thread

(0, 0)

Thread

(1, 0)

Thread

(2, 0)

Thread

(3, 0)

Thread

(4, 0)

Threads

3D IDs, unique within a block

Blocks

2D IDs, unique within a grid

Dimensions set at launch time

Can be unique for each grid

Built
-
in variables

threadIdx, blockIdx

blockDim, gridDim


33

Compiling C for CUDA Applications

void serial_function(… ) {


...

}

void other_function(int ... ) {


...

}


void saxpy_serial(float ... ) {


for

(int i = 0; i

<

n; ++i)


y[i] = a*x[i] + y[i];

}


void main( ) {


float x;


saxpy_serial(..);


...

}


NVCC

(Open64)

CPU Compiler

C CUDA

Key Kernels

CUDA object

files

Rest of C

Application

CPU object

files

Linker

CPU
-
GPU

Executable

Modify into
Parallel
CUDA code

34

C for CUDA : C with a few keywords

void saxpy_serial(int n, float a, float *x, float *y)

{


for

(int i = 0; i

<

n; ++i)


y[i] = a*x[i] + y[i];

}

// Invoke
serial

SAXPY kernel

saxpy_serial(n, 2.0, x, y);


__global__
void saxpy_parallel(int n, float a, float *x, float *y)

{


int i =
blockIdx
.x*
blockDim
.x +
threadIdx
.x;


if

(i

<

n)
y[i] = a*x[i] + y[i];

}

// Invoke
parallel

SAXPY kernel

with
256 threads
/
block

int nblocks = (n + 255) / 256;

saxpy_parallel
<<<nblocks, 256>>>
(n, 2.0, x, y);

Standard C Code

Parallel C Code

35

Software Programming

Source: Presentation from Andreas
Klöckner
, NYU

36

Software Programming

Source: Presentation from Andreas
Klöckner
, NYU

37

Software Programming

Source: Presentation from Andreas
Klöckner
, NYU

38

Software Programming

Source: Presentation from Andreas
Klöckner
, NYU

39

Software Programming

Source: Presentation from Andreas
Klöckner
, NYU

40

Software Programming

Source: Presentation from Andreas
Klöckner
, NYU

41

Software Programming

Source: Presentation from Andreas
Klöckner
, NYU

42

Software Programming

Source: Presentation from Andreas
Klöckner
, NYU

43

CUDA C/C++ Leadership


2007


2008


2009

2010

July 07

Nov 07

April 08

Aug 08


July 09

Nov 09

Mar 10

CUDA Toolkit 1.1



Win XP 64




Atomics support




Multi
-
GPU


support

CUDA Toolkit 2.0





Double Precision




Compiler


Optimizations




Vista 32/64




Mac OSX




3D Textures




HW Interpolation



CUDA Toolkit 2.3



DP FFT




16
-
32 Conversion


intrinsics




Performance


enhancements


CUDA Toolkit 1.0



C Compiler



C Extensions




Single Precision



BLAS



FFT



SDK


40 examples




CUDA

Visual Profiler 2.2

cuda
-
gdb

HW Debugger

Parallel Nsight

Beta

CUDA Toolkit 3.0



C++ inheritance




Fermi arch support




Tools updates




Driver / RT
interop

44

Why should I choose Tesla over consumer cards?

Feature

Benefits

Features

4x Higher double precision (on 20
-
series)

Higher Performance for scientific CUDA applications

ECC only on Tesla & Quadro (on 20
-
series)

Data reliability inside the GPU and on DRAM memories

Bi
-
directional PCI
-
E communication (Tesla has Dual DMA
Engines, GeForce has only 1 DMA Engine)

Higher Performance for CUDA applications (by overlapping
communication & computation)

Larger memory for larger data sets


3GB and 6GB Products

Higher performance on wide range of applications (medical, oil & gas,
manufacturing, FEA, CAE)

Cluster management software tools available on Tesla only

Needed for GPU monitoring and job scheduling in data center
deployments

TCC (Tesla Compute Cluster) driver supported for Windows OS
only on Tesla.

Higher performance for CUDA applications due to lower kernel launch
overhead. TCC adds support for RDP and Services

Integrated OEM workstations and servers

Trusted, reliable systems built for Tesla products.

Professional ISVs will certify CUDA applications only on Tesla

Bug reproduction, support, feature requests for Tesla only.

Quality &
Warranty

2 to 4 day Stress testing & memory burn
-
in for reliability. Added
margin in memory and core clocks for added reliability.

Built for 24/7 computing in data center and workstation environments.

Manufactured & guaranteed by NVIDIA

No changes in key components like GPU and memory without notice.
Always the same clocks for known, reliable performance.

3
-
year warranty from HP

Reliable, long life products

Support &
Lifecycle

Enterprise support, higher priority for CUDA bugs and requests

Ability to influence CUDA and GPU roadmap. Get early access to
features requests.

18
-
24 months availability + 6
-
month EOL notice

Reliable product supply