High Performance Computing with Graphics Processing Units (1)

skillfulwolverineΛογισμικό & κατασκευή λογ/κού

2 Δεκ 2013 (πριν από 3 χρόνια και 15 μέρες)

114 εμφανίσεις

High Performance Computing with Graphics
Processing Units (1)
Manfred Liebmann
Department of Mathematics
University of Wyoming
mliebman@uwyo.edu
March 23,2009
Manfred Liebmann March 23,2009
Graphics Processor Architecture
Todays graphics processors are multi-core processors with hundreds of
processor cores!
 GPU Architecture Building Block:Multiprocessor
{ 8 thread processors + 16384 registers + 64KB shared memory
{ IEEE 754 single and double precision oating point support
{ 3 x 8 single (MAD + MUL) + 2 x double (MAD) per clock
{ Up to 1024 hardware threads per multiprocessor
{ Up to 30 multiprocessors per graphics chip
{ Up to 2 GPUs per graphics board
{ Up to 4GB RAM per graphics board
High Performance Computing with Graphics Processing Units 1
Manfred Liebmann March 23,2009
 NVIDIA GTX 295
{ 2x GPUs with 1.4 billion transistors
{ 480 thread processors @ 1.25 GHz + 1.8 GB RAM
{ Memory bandwidth  220 GB/s
{ Peak performance: 2 TFLOPS (single) +  200 GFLOPS (double)
{ PCI Express Card:500 USD
Figure 1:NVIDIA GTX 295
High Performance Computing with Graphics Processing Units 2
Manfred Liebmann March 23,2009
Motivation:Floating-Point Performance
Figure 2:Floating-Point Operations per Second for the CPU and GPU
High Performance Computing with Graphics Processing Units 3
Manfred Liebmann March 23,2009
Motivation:Memory Bandwidth
Figure 3:Memory Bandwidth for the CPU and GPU
High Performance Computing with Graphics Processing Units 4
Manfred Liebmann March 23,2009
Graphics Processing Units
Driving force over the last decade:3D Gaming Industry
 Hardware Companies
{ Nvidia
{ ATI/AMD
{ Intel (Integrated 3D Graphics)
 GPU Technology
{ Nvidia:Thread Processors
{ ATI/AMD:Stream Processors
Only Nvidia provides fully programmable GPUs!Focus on Nvidia's technology!
High Performance Computing with Graphics Processing Units 5
Manfred Liebmann March 23,2009
Software/Literature
 CUDA ZONE:http://www.nvidia.com/object/cuda_home.html
{ Nvidia CUDA (Compute Unied Device Architecture) resource!
{ All the information you need!
 CUDA 2.1:http://www.nvidia.com/object/cuda_get.html
{ 1.CUDA driver
{ 2.CUDA toolkit
{ 3.CUDA SDK
 CUDA U:http://www.nvidia.com/object/cuda_education.html
{ CUDA Classes (Wen-mei W.Hwu,David Kirk)
{ More reading material
High Performance Computing with Graphics Processing Units 6
Manfred Liebmann March 23,2009
Where to start?
 Download CUDA 2.1 (Driver,Toolkit,SDK)
 Try the SDK samples on Linux or Windows!
{ Microsoft Visual C++ 2005 Express Edition
{ Free download!http://www.microsoft.com/express/2005/
 Read the CUDA Programming Guide!(CUDA/doc Folder)
{ NVIDIA
CUDA
Programming
Guide
2.1.pdf
 Start Coding!
High Performance Computing with Graphics Processing Units 7
Manfred Liebmann March 23,2009
Quad-GPU Compute Server
 Linux Server with 960 Thread Processors
{ MSI K9A2 Platinum Motherboard
{ AMD Phenom X4 9950 BE @ 2.6 GHz + 8 GB DDR2 RAM
{ Quad Nvidia GTX 280 @ 1.4 GHz + 1 GB Graphics RAM
Figure 4:Quad-GPU Compute Server
High Performance Computing with Graphics Processing Units 8
Manfred Liebmann March 23,2009
16 GPU Compute Cluster
 Two Linux Servers with 3840 Thread Processors
{ ASRock X58 Supercomputer Motherboard
{ Intel Core i7-965 EE @ 3.2 GHz + 12 GB DDR3 RAM
{ Quad Nvidia GTX 295 @ 1.25 GHz + 1.8 GB Graphics RAM
Figure 5:16 GPU Compute Cluster Parts
High Performance Computing with Graphics Processing Units 9
Manfred Liebmann March 23,2009
Nvidia Tesla Computing Solutions
 Tesla C1060 Computing Processor
{ 240 Thread Processors + 4 GB GDDR3 RAM
{ Price:1500 USD
 Tesla S1070 Computing System
{ Quad Tesla C1060 1U-Server
{ Price:10000 USD
Figure 6:Tesla Computing Solutions
High Performance Computing with Graphics Processing Units 10
Manfred Liebmann March 23,2009

InstituteforMathematicsandScientificComputing,UniverisityofGraz,Austria

InstituteofBiophysics,CenterforPhysiologicalMedicine,MedicalUniversityofGraz,Austria
Thealgebraicmultigridmethod(AMG)providesanefficientpreconditionerforthepreconditionedconjugategradient(PCG)algorithm.CurrentlythecompletemultigriditerationwithinthePCG
solverisexecutedontheGPUwhilethesetupofthealgebraicmultigridmethodishandledbytheCPUduetohighcomplexityofthealgorithm.Table1showsthecomparisonofasingle
PCG-AMGiterationondifferentclustercomputers:Kepler(32Opteronprocessors,Infinibandinterconnect),Boltzmann(16Opteronprocessors,GigabitEthernetnetwork)andLiebmann(4
QuadcoreBarcelonaprocessors,Sharedmemoryarchitecture).ThefiniteelementmatrixoftheVirtualHeartsimulationhasdimension862.515with12,795.209nonzeroelements.Allcomputations
usedoubleprecisionarithmetics.
N#A
Kepler32P(IB)Boltzmann16P(GBE)Liebmann16P(SHM)SingleGTX280DualGTX280QuadGTX280
862.51512,795.209
0.0210940.0581250.0537500.0221780.0128280.008277
Table1:TimingofPCG-AMGiterationinseconds(doubleprecision)
ConsideringthecostofatypicalclustercomputercomparedtoasingleNvidiaGTX280boardweseetwotothreeordersofmagnitudeinprice/performanceadvantage.
PCG-AMGBenchmark
CARPSimulator:ThecapabilitiesoftheCARPsimulatorareextendedtoincludeamulti-GPUalgebraicmultigrid
solverfortheellipticPDEimplementedusingtheNvidiaCUDAToolkitandtheParallelToolbox
softwarepackage.
Quad-GPUComputeServer:Atestplatformforthemulti-GPUPCG-AMGsolverhasbeensetupusingaMSIK9A2Platinum
mainboard,aquadcorePhenom9950processor,andfourNvidiaGTX280boards.Earlytests
showgoodscalingoftheparallelmultigridsolveronthetestplatform.Givinga50-foldper-
formanceadvantageoveratypicalsingleCPUconfiguration.Withtheseresultsatremendous
speedupoftheVirtualHeartsimulationwillbepossibleinnearfuture.
Multi-GPUAcceleratedAlgebraicMultigridMethods
TheBidomainEquations:Thebidomainequationsareasetofcoupledpartialdifferentialequationswhichdescribethe
currentflowinthemyocardium(regionI).Optionally,thecurrentflowinasurroundingmedium
(torso,fluidbath,blood-filledcavitiesoftheheart)areincludedintheformulation(regionsII).
Thebidomainequationsarewrittenasfollows:
−∇(¯σi∇φi)=−βI
m
−∇(¯σe∇φe)=βI
m
−∇(¯
σb∇φe)=Ie
where
Im
=Cm∂V
m
∂t
+Iion
(Vm,~η)−Itr
d~η
dt
=g(Vm,~η)
Vm
=φi
−φe
ThebidomainsolveroftheCardiacArrhythmiaResearchPackageCARP,oneofthemostefficient
solversforthistypeofproblem,implementsseveralsolutionmethods.Mostfrequently,however,
basedonanoperatorsplittingtechniquethefollowingschemeisemployed[7,8,1].Thebidomain
equationsaredecoupledinto
anEllipticPDE
(Ai
+Ae)Φ
k+1
e
=AiV
k+1
+Ie
aParabolicPDE
￿
V
k∗
=(1−ΔtA
i)V
k
−ΔtA
eφk
e
Δx>100m
￿1+
1
2ΔtA
i￿
V
k∗
=
￿1−
1
2
ΔtA
i￿
V
k
−ΔtA
eφk
e
Δx<100m
andasetofODE’s
V
k+1
=V
k∗
+
Δt
Cm
iion
￿V
k∗,~η
k￿

k+1
=~η
k
+Δtg(V
k+1
,~η
k
)
where
Ai
=−
∇(¯σi∇)
βC
m
;Ae
=−∇(¯σi∇)
βC
m
;t=kΔt
Further,besidesthebidomainsolver,CARPconsistsoftheionicmodellibraryIMP,thevisualiza-
tiontoolMeshalyzer,andcodesforthecomputationofECG(φe
recovery)andMCG(cooperation
withDr.WeberdosSantosandthePhysikalischTechnischeBundesanstaltinBerlin).
ImageProcessingandMeshGeneration:Afirstattempttosegmenthigh-resolutionMRIimagestacksandhistologicalimagestacksin
asemi-automaticfashionhasbeenundertaken[2].Intrichrome-stainedhistologicalsections,
interestingtissuetypesandtheextracellularspaceoccupyareasofthecolorhistogramwithmin-
imaloverlappingandthuscanbesegmenteddirectlyusingcolorthresholding.Segmentationof
the3DMRIdatasetsconstitutesamorecomplicatedchallenge.Thisprocessrequiredseveral
stepsincludingtheestimationandremovalofthebiasfieldandasetofad-hocmorphological
operations.
RegistrationIntheprocessofslicingthespecimen,preparingthesectionandacquiringhistologicalimages,
considerablerigidandnon-rigiddistortionisintroduced.Wecorrectforthisdistortionbyusing
slice-to-sliceregistration,intwosteps.Inthefirststep,aninitialcoarserigidregistrationalign-
mentisrealized.Inthesecondstep,non-rigidregistrationbetweenadjacentslicesisperformed
followingthetechniquedevelopedbyKeellingandRing[3].
MeshGenerationWithintheframeworkofanationalcooperation,theOctree-basedmeshingsoftwareTARANTULA
hasbeendevelopedwhichproducesboundary-fitted,locally-refinedandconformalmulti-element
meshesonunstructuredgridstoallow1)smoothrepresentationoforganboundaries,and2)to
reducethedegreesoffreedombyusingadaptivemethodswhendiscretizingthenon-myocardial
volume.
Arrhythmogenesis:Thegroupoftheapplicanthasgatheredexperienceincarryingoutcomputersimulationsdealing
withmechanismsunderlyingtheformationofarrhythmiasusingbothsimpleslab-likegeometries
andmorerealisticsetupsusinganatomicallyrealisticrepresentationsoftheventriculargeome-
try.Further,anovelmethodtosimulatetoincorporatethePurkinjesystemhasbeendeveloped
recentlytoallowthesimulationofsinusbeatsandtheinteractionofthePurkinjesystemwith
arrhythmicacitivity.
Defibrillation:Electricaldefibrillationistheapplicationofalargeelectricalshocktothehearttoterminate
otherwiselife-threateningarrhythmias.Theapplicanthasbeencarryingoutsimulationstudies
[6]whichinvestigatedtherelationshipbetweenshockenergyrequirementsforsuccessfulldefib-
rillationand1)thedegreeoforganizationofanarrhythmia,and2)theeffectofmicroscopic
heterogeneities.
ExperimentalValidation:Anewflexiblesensorforin-vitroexperimentshasbeendeveloped[11,10,9,5]tomeasurethe
surfacepotentialΦ,anditsgradient,E(electricnearfield),atgivensitesoftheheart.During
depolarisation,Edescribesavectorloopfromwhichdirectionandmagnitudeoflocalconduction
velocityϑcanbecomputed.Fourrecordingsilverelectrodesseparatedby50m,conducting
leads,andsolderablepadswerepatternedona50mthickpolyimidefilm.Combinedwithhigh-
resolutiondataacquisition(samplingrate100-800kHzat16-24-bit)weareabletodiscriminate
signallatenciesofjustafewmicrosecondsandtomonitorthecomputedtime-courseofEas
wellasthemagnitudeanddirectionoflocalconductionvelocityϑon-linefrombeat-to-beat.
Thus,thecapabilitytomeasureaccuratelyactivationtimeaswellasdirectionandvelocityof
propagationatthemicroscopicsizescalemakesthemethodideallysuitedforthevalidationof
computersimulationsusingmicro-anatomicalgrids,assuggestedinthisSFBproposal.Amap
ofEcombinedwiththecorrespondinghistographsformsthefundamentforamicro-structure
relatedcomputermode.(a)Guinea-PigheartandsignalsΦandE.(b)histographwithmyocytes
(red)andbarriersfromconnectivetissue(blue).(c)rightatrialendocard.(d)sensor.
References
[1]G.Plank,M.Liebmann,R.WeberdosSantos,E.VigmondandG.Haase.AlgebraicMultigridPreconditionerfortheCardiacBidomainModel,IEEE
Trans.Biomed.Eng.,acceptedpendingminorrevision.
[2]Burton

A.B.R.,G.Plank
∗,J.Schneider
∗,V.GrauColomer,H.Ahammer,S.L.Keeling,J.L.Lee,N.Smith,N.A.TrayanovaandP.Kohl.(

sharedfirst
authors)3-DimensionalModelsofIndividualCardiacHisto-Anatomy:ToolsandChallenges.Ann.N.Y.Acad.Sci.,inpress.
[3]S.Keeling,W.Ring.Medicalimageregistrationandinterpolationbyopticalflowwithmaximalrigidity.J.Math.Imag.Vision.23:4765,2005.
[4]N.Trayanova,G.Plank,andB.Rodriguez.WhathavewelearnedfromMathematicalModelsofDefibrillationandPostshockArrhythmogenesis?,
ApplicationofBidomainSimulations,EHeartRhythm,inpress.
[5]Hofer,E.,F.Kepplinger,T.Thurner,T.Wiener,andG.Plank.Anovelfloatingsensorarraytodetectelectricnearfieldsofbeatingheart
preparations.,BiosensorsandBioelectronics,21(12),2232-2239,2006.
[6]PlankG.,L.J.Leon,S.Kimber,E.J.Vigmond,Defibrillationdependsonconductivityfluctuationsandthedegreeofdisorganizationinreentry
patterns.,J.Cardiovasc.Electrophysiol.,16(2):205-216,2005.
[7]WeberdosSantos,R.,G.Plank,S.Bauer,E.J.Vigmond,Parallelmultigridpreconditionerforthecardiacbidomainmodel.,IEEETrans.Biomed.
Eng.,51(11):1960-8,2004.
[8]WeberdosSantos,R.,G.Plank,S.Bauer,E.J.Vigmond,Preconditioningtechniquesforthebidomainequations.LectureNotesinComputational
ScienceandEngineering(LNCSE),ISBN/ISSN14397358,40:571-580,2004.
[9]Plank,G.E.J.Vigmond,L.J.Leon,E.Hofer.Cardiacnear-fieldmorphologyduringconductionaroundamicroscopicobstacle-acomputer
simulationstudy.Ann.Biomed.Eng.,31:1-7,2003.
[10]Plank,G.E.Hofer.Theuseofcardiacnear-fieldmeasurementstodetermineactivationtimes.Ann.Biomed.Eng.,31:1066-1076,2003.
[11]Plank,G.andE.Hofer.Modelstudyofvector-loopmorphologyduringelectricalmappingofmicroscopicconductionincardiactissue.Ann.
Biomed.Eng.,28(10):1244-1252,2000.
OverviewoftheVirtualHeartModel
GPUAcceleratedAlgebraicMultigridMethodsfortheVirtualHeartModel
ManfredLiebmann
†,GernotPlank
‡,AntonPrassl

andGundolfHaase

High Performance Computing with Graphics Processing Units 11
Manfred Liebmann March 23,2009
Virtual Heart Simulation:PCG-AMG Parallel Benchmark
Domain decomposition based parallelization for CPU/GPU cluster computers!
NP Solver Setup
1 31.37 5.42
2 14.83 2.57
4 7.59 1.64
8 3.95 1.11
16 2.13 1.05
32 1.35 1.31
NP Solver Setup
1 30.59 5.93
2 14.84 3.11
4 7.48 1.95
8 4.10 1.51
16 3.44 1.91
NP Solver Setup
1 1.406 4.82
2 0.800 2.71
4 0.500 1.94
Table 1:Solver and setup times in seconds for Kepler (IB),Liebmann (SHM),
and GTX (GPU)
High Performance Computing with Graphics Processing Units 12