! Important(class(of(applications(((one(of(the(motifs)(

Size: px

Start display at page:

Download "! Important(class(of(applications(((one(of(the(motifs)("

Bethany Lambert
5 years ago
Views:

DidemUnat,XingCai,ScottB.Baden dunat@lbl.

1 DidemUnat,XingCai,ScottB.Baden LawrenceBerkeleyNationalLaboratory SimulaResearchLaboratory UniversityofCalifornia,SanDiego Oslo,Jun7,12 Importantclassofapplicationsoneofthemotifs) Basisforapproximatingderivativesnumerically Physicalsimulationse.g.turbulenceflow,seismicwavepropagation) Multimediaapplicationse.g.imagesmoothing) Nearestneighborupdateonastructuredgrid 3DHeatEqnusing fullyexplicitfinitedifferencing: U x,y,z) = c*ux,y,z) + c1*ux,y,z-1) + Ux,y,z+1)+ Ux,y-1,z) + Ux,y+1,z) + Ux-1,y,z) + Ux+1,y,z)) Highlydataparallel,memorybandwidthbound GPUspeedupsovermulti8s) " 5XforLatticeBoltzmann[Lee,ISCA 11], " 4XReverseTimeMigration[Kruger,SC 11] 4 PowerMW) Gigaflop/s 1Teraflop/s HPC$milestones$ 1Petaflop/s 1Exaflop/s 1985% 1996% 8% 18% Powerconsumptionisthemaindesignconstraint Drasticchangesinnodearchitecture[Shalf,VecPar ] Moreparallelismonthechip SoftwareSmanagedmemory/incoherentcaches Alreadystartedseeingconcreteinstances Heterogeneityincomputeresources Explicitmanagementofdatatransfer Separatedevicememoryfromthehostmemory Reengineeringofscientificapplications Algorithmicchangestomatchthehardwarecapabilities BestperformancerequiresnonStrivialknowledgeofthearchitecture host Main Memory bus accelerator On-chip Memory Vector Cores 2 5 GraphicsProcessingUnitsGPUs) Massivelyparallelsinglechipprocessor Lowpowers:tradeoffsinglethreadperformance LargeregisterfileandsoftwareSmanagedmemory Effectiveinacceleratingcertaindataparallelapplications CaseStudy:CardiacElectrophysiology[Unat,PARA ] Notidealforothers:sorting[Lee,ISCA ] host accelerator Explicitlymanagedmemory OnSchipmemoryresources Privateandincoherent e.g shared floata[n]; Hierarchicalthreadmanagement Thread,threadgroups,threadsubgroups Granularityofathread Shared Memory/L1 cache Main Memory On-chip Memory DomainSspecificoptimizations Limitstheadoptioninscientificcomputing Register File bus Vector Cores Weneedprogrammingmodelstomasterthenewtechnologyand makeitaccessibletocomputationalscientists. 3 6

Aimsprogrammer sproductivityandhighperformance Simplifiesapplicationdevelopment

sviewofthehardware Seismic Modeling Cardiac Simulation Turbulent Flow Main Memory

+ directives MotifSspecificautoSoptimizer Targetsstencilmethods Incorporatessemanticknowledgetocompileranalysis

AcceleratedRegion Dataparallelfor HostRegion Dataparallelfor Host Thread 8 11 Serialcode AcceleratedRegion

Indicatestheacceleratedregion #pragmamintfor MarksenclosedloopSnestforacceleration

2 Aimsprogrammer sproductivityandhighperformance Simplifiesapplicationdevelopment Basedonamodestnumberofcompilerdirectives #pragmamintfor Incrementalparallelization Abstractsawaytheprogrammer sviewofthehardware Seismic Modeling Cardiac Simulation Turbulent Flow Main Memory SourceStoSsourcetranslatorfortheNvidiaGPUs Parallelizesloopnests Relievestheprogrammerofavarietyoftedioustasks C + directives MotifSspecificautoSoptimizer Targetsstencilmethods Incorporatessemanticknowledgetocompileranalysis PerformsdatalocalityoptimizationsviaonSchipmemory Compilerflagsforperformancetuning 7 Serialcode AcceleratedRegion Dataparallelfor HostRegion Dataparallelfor Host Thread 8 11 Serialcode AcceleratedRegion Dataparallelfor HostRegion kernel Block Block #pragmamintparallel Accelerated Region Indicatestheacceleratedregion #pragmamintfor MarksenclosedloopSnestforacceleration 3additionalclausesforoptimizations #pragmamintcopy Data Transfer Expressesdatatransfersbetweenthehostanddevice Dataparallelfor Host Thread Block Block Block #pragmamintsingle Handlesserialsection #pragmamintbarrier Synchronizeshostanddevicethreads Synchronization 9 12

Performancetuningparameters HighSlevelinterfacetolowSlevel hardwarespecificoptimizations 1.

Compilerflagsfordatalocality Register:Sregister SoftwareSmanagedmemory:Sshared Cache:SpreferL1 Shared Memory/L1 cache Register File 13 #pragma mint

for nestall) tile16,16,64) chunksize1,1,64) for int z=1; z<= k; z++) for int y=1; y<= m; y++) for int x=1; x<= n; x++) c1 * U[z][y][x-1] + U[z][y][x+1] +

copyunew,todevice,n+2),m+2),k+2)) Program for the 3D Heat Eqn.

copyunew,todevice,n+2),m+2),k+2)) while t++ < T ) #pragmamintfor #pragma mint Data parallel for loop #pragma mint copyu,fromdevice,n+2),m+2),k+2)) 14 17

#pragma mint #pragma mint copyu,fromdevice,n+2),m+2),k+2)) Accelerated Region #pragma mint copyu,todevice,n+2),m+2),k+2)) #pragma depth mint of

3 Performancetuningparameters HighSlevelinterfacetolowSlevel hardwarespecificoptimizations 1. ForSloopclauses handledatadecompositionandthread management nest),tile),chunksize) 2. Compilerflagsfordatalocality Register:Sregister SoftwareSmanagedmemory:Sshared Cache:SpreferL1 Shared Memory/L1 cache Register File 13 #pragma mint copyu,todevice,n+2),m+2),k+2)) n+2),m+2),k+2)) #pragma mint copyunew,todevice,n+2),m+2),k+2)) n+2),m+2),k+2)) Data while t++ < T ) Transfers #pragma mint for nestall) tile16,16,64) chunksize1,1,64) for int z=1; z<= k; z++) for int y=1; y<= m; y++) for int x=1; x<= n; x++) c1 * U[z][y][x-1] + U[z][y][x+1] + #pragma mint copyu,fromdevice,n+2),m+2),k+2)) n+2),m+2),k+2)) 16 #pragma mint copyu,todevice,n+2),m+2),k+2)) #pragma mint copyunew,todevice,n+2),m+2),k+2)) Program for the 3D Heat Eqn. while t++ < T ) #pragma mint for nestall) tile16,16,64) chunksize1,1,64) for int z=1; z<= k; z++) for int y=1; y<= m; y++) for int x=1; x<= n; x++) c1 * U[z][y][x-1] + U[z][y][x+1] + #pragma mint copyu,fromdevice,n+2),m+2),k+2)) #pragma mint copyu,todevice,n+2),m+2),k+2)) #pragma mint copyunew,todevice,n+2),m+2),k+2)) while t++ < T ) #pragmamintfor #pragma mint for nestall) tile16,16,64) chunksize1,1,64) for int z=1; z<= k; z++) for int y=1; y<= m; y++) for int x=1; x<= n; x++) c1 * U[z][y][x-1] + U[z][y][x+1] + Data parallel for loop #pragma mint copyu,fromdevice,n+2),m+2),k+2)) #pragma mint copyu,todevice,n+2),m+2),k+2)) #pragma mint copyunew,todevice,n+2),m+2),k+2)) #pragma #pragma mint mint parallel parallel while t++ < T ) #pragma mint for nestall) tile16,16,64) chunksize1,1,64) for int z=1; z<= k; z++) for int y=1; y<= m; y++) for int x=1; x<= n; x++) c1 * U[z][y][x-1] + U[z][y][x+1] + #pragma mint copyu,fromdevice,n+2),m+2),k+2)) Accelerated Region #pragma mint copyu,todevice,n+2),m+2),k+2)) #pragma depth mint of copyunew,todevice,n+2),m+2),k+2)) loop parallelism while t++ < T ) #pragmamintfornestall) #pragma mint for nestall) tile16,16,64) chunksize1,1,64) for int z=1; z<= k; z++) for int y=1; y<= m; y++) for int x=1; x<= n; x++) c1 * U[z][y][x-1] + U[z][y][x+1] + #pragma mint copyu,fromdevice,n+2),m+2),k+2)) 15 18

#pragma mint copyu,todevice,n+2),m+2),k+2)) #pragma depth mint of

while t++ < T ) #pragmamintfornestall)tile16,16,64) #pragma mint for nestall)

y++) for int x=1; x<= n; x++) c1 * U[z][y][x-1] + U[z][y][x+1] + #pragma mint

#pragmamintfornestall)tile16,16,64)chunksize1,1,64) #pragma mint for nestall)

Fullyautomatedtranslationand optimizationsystem

DevelopedatLLNL ROSEprovidesanAPIforgeneratingand

Input code: C + ROSE Parser ROSE backend Output file Cuda src 23 ty 3D Grid Nx,

BaselineTranslator MemoryManager ROSEParser LoopTransformer tz tile tx,ty,tz)

x,c y,c z ) ROSE backend KernelConfig Outliner ArgumentHandler WorkPractitioner

4 #pragma mint copyu,todevice,n+2),m+2),k+2)) #pragma depth mint of copyunew,todevice,n+2),m+2),k+2)) loop parallelism partitioning iteration space while t++ < T ) #pragmamintfornestall)tile16,16,64) #pragma mint for nestall) tile16,16,64) chunksize1,1,64) for int z=1; z<= k; z++) for int y=1; y<= m; y++) for int x=1; x<= n; x++) c1 * U[z][y][x-1] + U[z][y][x+1] + #pragma mint copyu,fromdevice,n+2),m+2),k+2)) #pragma mint copyu,todevice,n+2),m+2),k+2)) #pragma depth mint of copyunew,todevice,n+2),m+2),k+2)) loop parallelism partitioning iteration space while t++ < T ) #pragmamintfornestall)tile16,16,64)chunksize1,1,64) #pragma mint for nestall) tile16,16,64) chunksize1,1,64) for int z=1; z<= k; z++) for int y=1; y<= m; y++) for int x=1; x<= n; x++) c1 * U[z][y][x-1] + U[z][y][x+1] + #pragma mint copyu,fromdevice,n+2),m+2),k+2)) workload of a thread Fullyautomatedtranslationand optimizationsystem TransformationperformedontheAbstract SyntaxTree BuiltontopoftheROSEcompiler DevelopedatLLNL ROSEprovidesanAPIforgeneratingand manipulatingabstractsyntaxtrees isapartoftherosedistribution sincenov 11. Input code: C + ROSE Parser ROSE backend Output file Cuda src 23 ty 3D Grid Nx, Ny, Nz) threads tx/cx, ty/cy, tz/cz) Cuda thread Inputcode: C+ PragmaHandler BaselineTranslator MemoryManager ROSEParser LoopTransformer tz tile tx,ty,tz) chunksize cx,cy,cz) tx #pragma mint for nest#,all)tilet x,t y,t z ) chunksizec x,c y,c z ) ROSE backend KernelConfig Outliner ArgumentHandler WorkPractitioner ManagesdatadecompositionandthreadworkSassignment. Outputfile Cudasrc Optimizer 21 24

#pragmamintparallel whilet<t) t+=dt; #pragmamintfor fori=1;i<=n;i++) forj=1;j<=n;j++) A[i][j]=c*B[iJ1][j]+B[i+1][j] +B[i][jJ1]+B[i][j+1]); }//endofwhile

pitch / sizeofdouble ); int sliceu = ptr_du.ysize * widthu;... int _idx = threadidx.x + 1; int _gidx = _idx + blockdim.x * blockidx.x;.

.. ); Unpack pitched ptrs Compute local and global indices using thread and block IDs If-statements are derived from forstatements Each thread updates single

.. Device 26 29 AST #pragmamintparallel whilet<t) t+=dt;... /* Outlined Kernel */ global void cuda_func ) }.

Translator Optimizer cuda_1_func_<<<threads, blocks>>> );.

5 #pragmamintparallel whilet<t) t+=dt; #pragmamintfor fori=1;i<=n;i++) forj=1;j<=n;j++) A[i][j]=c*B[iJ1][j]+B[i+1][j] +B[i][jJ1]+B[i][j+1]); }//endofwhile }//endofparallelregion global void mint_1_1517cudapitchedptr ptr_du...) double* U = double *)ptr_du.ptr); int widthu = ptr_du.pitch / sizeofdouble ); int sliceu = ptr_du.ysize * widthu;... int _idx = threadidx.x + 1; int _gidx = _idx + blockdim.x * blockidx.x;... if _gidz >= 1 && _gidz <= k) if _gidy >= 1 && _gidy <= m) if _gidx >= 1 && _gidx <= n) Unew[indUnew] = c * U[indU] + c1 * U[indU - 1] + U[indU + 1]... ); Unpack pitched ptrs Compute local and global indices using thread and block IDs If-statements are derived from forstatements Each thread updates single data point 25 }//end of kernel 28 #pragmamintparallel whilet<t) t+=dt; #pragmamintfor fori=1;i<=n;i++) forj=1;j<=n;j++) A[i][j]=c*B[iJ1][j]+B[i+1][j] +B[i][jJ1]+B[i][j+1]); }//endofwhile }//endofparallelregion Host /* Outlined Kernel */ global void cuda_func ) }... Device AST #pragmamintparallel whilet<t) t+=dt;... /* Outlined Kernel */ global void cuda_func ) }... Device Thebaselinetranslator performsallthememoryreferences throughglobalmemory canstillparallelizeloops,launchkernel, performdatatransfers Baseline Translator Optimizer cuda_1_func_<<<threads, blocks>>> );... }//endofwhile Host }//endofparallelregion Optimizerfocusesonstencilmethods Reducesglobalmemoryaccesses DatalocalityusingonSchipmemory Stencil Analyzer OnSchipMemory Optimizer OnSchipmemoryoptimizationflags SpreferL1,Sregister,Sshared Bestperformancedependsonthedevice andapplication ModifiedAST 27

Analyzesarrayaccesspattern Findsstencilstructureanddependencybetweenthreads z y z y z y x

Maximizethetotalreductioninglobalmemoryreferences Minimizethesharedmemoryusage

ConfiguresonSchipmemoryonFermi Favors48KBL1and16KBsharedmemory Sregister

Enhancesaccesstothecentralpointofastencil Shared Memory/L1 cache Register File 32 35

PlacestheminsharedmemorysoftwareSmanagedmemory) Anumberofplanesresideonsharedmemory tile

Increasesresourcesneededbyathread,reducingconcurrency Gflops 45 35 25 15 5 Heat Poisson

6 Analyzesarrayaccesspattern Findsstencilstructureanddependencybetweenthreads z y z y z y x Tradeoff:findthesweetspot Whichvariables?andHowmanyvariables? Maximizethetotalreductioninglobalmemoryreferences Minimizethesharedmemoryusage Planesmaycomefrom1ormorearrays 7-point 13-point 19-point Ghostcellregion SpreferL1 ConfiguresonSchipmemoryonFermi Favors48KBL1and16KBsharedmemory Sregister Takesadvantageoflargeregisterfile Placesfrequentlyaccessedarraysintoregisters Enhancesaccesstothecentralpointofastencil Shared Memory/L1 cache Register File Sshared Detectssharablereferencesamongthreads PlacestheminsharedmemorysoftwareSmanagedmemory) Anumberofplanesresideonsharedmemory tile 2D planes Tradeoff Reducesmemoryreferencestofrequentlyaccessedlocations Increasesresourcesneededbyathread,reducingconcurrency Gflops Heat Poisson Variable Poisson 19pt Tesla C6 Hand- Heat Poisson Variable Poisson 19pt Tesla C D.Unat,X.Cai,andS.Baden.":Realizingperformancein3DStencil MethodswithAnnotatedC",inInternationalConferenceonSupercomputing. ICS'11,Tucson,AZ,

45 Theoptimizerimprovestheperformance2.6xoverthebaseline. Hand- 8 7 25 6 Gflop/s Gflops 35 15 5 Heat Poisson Variable Poisson 19pt Heat Poisson Variable Poisson 19pt Tesla C6 2.

7 45 Theoptimizerimprovestheperformance2.6xoverthebaseline. Hand Gflop/s Gflops Heat Poisson Variable Poisson 19pt Heat Poisson Variable Poisson 19pt Tesla C6 2.6x Tesla C 1 MPI OnTeslaC6,achieves79%ofthehandSoptimized. OnTeslaCFermi),achieves76%ofthehandSoptimized. EarthquakeSinducedseismicwavepropagation Tesla C GPU 7 32 MPI processes 6 Gflop/s YifengCuiandJunZhouatSDSC Refersto31threeSdimarrays asymmetric13spointstencil Timeconsumingloops:185lines Generatedcode:1185lines Hand 8 GordonBellPrizeﬁnalistatSC baseline optimizer TheedcodeonsingleGPUisslightlyfasterthan32s. UsedbyresearchersSouthernCAEarthquakeCenter 32 MPI 4 nodes Petascaleanelasticwavepropagationcode 16 MPI 8 Nehalem s / node 37 8 MPI D.Unat,J.Zhou,Y.Cui,X.Cai,andS.Baden. Acceleratinga3DFinite DiﬀerenceEarthquakeSimulationwithaCMtoMTranslator,inComputing inscienceandengineeringjournal,12. 1 MPI 8 MPI 16 MPI 32 MPI 8 Nehalem s / node baseline optimizer Hand Tesla C GPU 4 nodes Gflop/s Gflop/s Theedcodeachieves83%ofthehandSoptimized. 8 83% 1 MPI 8 MPI 16 MPI 32 MPI 8 Nehalem s / node 4 nodes baseline optimizer Hand Tesla C GPU 1 MPI 8 MPI 16 MPI 32 MPI 8 Nehalem s / node 39 4 nodes baseline optimizer Hand Tesla C GPU 42

Computervisionalgorithm Detectscornersandhighintensitypoints CollaboratedwithHanKim&JürgenSchulze Inserted5linesofcodeintothe originalcode~3lines) RealStimeperformancewith

Baden, AutoMoptimizationofaFeature SelectionAlgorithm,inEmergingApplicationsandManySArchitecture, EAMAWorkshopcoSlocatedwithISCA,SanJose,CA,11.

SourceStoSsourcetranslatorandoptimizer IncorporatesmotifSspecificknowledge Achievedaround8%ofthehandSoptimized " BothcommonlySusedkernelsandrealSworldapplications

html 46 OpenMPCextendsOpenMPtosupport Parallelizesonlyouterloopandnosharedmemoryoptimization CommercialcompilerfromPortlandGroupPGI) GeneralSpurposecompiler Gflops OpenMPC PGI

ROSETeamLLNL), PauliusMicikeviciusNvidia), EverettPhillipsNvidia) 47 TargetmultipleGPUs MPIcodegeneration ExtendforIntelManyIntegratedCoreMIC) Sameexecutionmodel

Main Memory Main Memory [Shalf,VecPar ]JohnShalf,SudipDosanjh,andJohnMorrison.Exascalecomputing technologychallenges.

Lee,ChangkyuKim,JatinChhugani,MichaelDeisher,Daehyun Kim,AnthonyD.Nguyen,NadathurSatish,MikhailSmelyanskiy,SrinivasChennuSpaty,Per Hammarlund,RonakSinghal,andPradeepDubey.

OptimizingtheAlievSPanfilov modelofcardiacexcitationonheterogeneoussystems.para:stateoftheartin ScientificandParallelComputing [Lee]SeyongLee,SeungSJaiMin,andRudolfEigenmann.

8 Computervisionalgorithm Detectscornersandhighintensitypoints CollaboratedwithHanKim&JürgenSchulze Inserted5linesofcodeintothe originalcode~3lines) RealStimeperformancewith S22xperformanceof8Nehalems runningopenmpthreads TeslaCvsIntelXeonE54Nehalem D.Unat,H.S.Kim,J.Schulze,S.B.Baden, AutoMoptimizationofaFeature SelectionAlgorithm,inEmergingApplicationsandManySArchitecture, EAMAWorkshopcoSlocatedwithISCA,SanJose,CA, ProgrammingModel AddressestheprogrammabilityissueofGPUs " Today smassivelyparallelarchitectures " SoftwareSmanagedstorageandmassivelyparallelchip SourceStoSsourcetranslatorandoptimizer IncorporatesmotifSspecificknowledge Achievedaround8%ofthehandSoptimized " BothcommonlySusedkernelsandrealSworldapplications Availablefordownload Ourprojectwebsite " OnlineTranslator: " 46 OpenMPCextendsOpenMPtosupport Parallelizesonlyouterloopandnosharedmemoryoptimization CommercialcompilerfromPortlandGroupPGI) GeneralSpurposecompiler Gflops OpenMPC PGI HandD HeatEqn OpenACC CollaborativeeffortfromPGI,Cray,CAPs,Nvidia ShowsthatdirectiveSbasedmodelisapromisingapproach AllresultsareobtainedonTeslaC6.PGIv11.1)results:ontheLincolnsystem. 44 ScottBadenUCSD), XingCaiSimula), AllanSnavelySDSC), HanSukKimUCSD,Apple), JürgenSchulzeUCSD), YifengCuiSDSC), JunZhouSDSC), WenjieWeiSimula), RossWalkerSDSC), DanQuinlanLLNL), ROSETeamLLNL), PauliusMicikeviciusNvidia), EverettPhillipsNvidia) 47 TargetmultipleGPUs MPIcodegeneration ExtendforIntelManyIntegratedCoreMIC) Sameexecutionmodel Newclausesorcompileroptions " Registerblocking,SSEinstructions,softwareprefetching Soffload=[mic apu cuda] Maintainsinglecodebase? Main Memory Main Memory [Shalf,VecPar ]JohnShalf,SudipDosanjh,andJohnMorrison.Exascalecomputing technologychallenges.inproceedingsofthe9thinternationalconferenceonhigh PerformanceComputingforComputationalScience,VECPAR. [Lee,ISCA ]VictorW.Lee,ChangkyuKim,JatinChhugani,MichaelDeisher,Daehyun Kim,AnthonyD.Nguyen,NadathurSatish,MikhailSmelyanskiy,SrinivasChennuSpaty,Per Hammarlund,RonakSinghal,andPradeepDubey.DebunkingtheXGPUvs.CPUmyth: anevaluationofthroughputcomputingoncpuandgpu.sigarchcomput. [Unat,Para ]DidemUnat,XingCai,andScottBaden.OptimizingtheAlievSPanfilov modelofcardiacexcitationonheterogeneoussystems.para:stateoftheartin ScientificandParallelComputing [Lee]SeyongLee,SeungSJaiMin,andRudolfEigenmann.OpenMPtoGPGPU:acompiler frameworkforautomatictranslationandoptimization.inproceedingsofthe14thacm SIGPLANSymposiumonPrinciplesandPracticeofParallelProgramming,PPoPP 9 [William]SamuelWebbWilliams.AutoStuningPerformanceonMultiComputers.PhD thesis,eecsdepartment,universityofcalifornia,berkeley,dec8 [Carrington]LauraCarrington,MustafaM.Tikir,CatherineOlschanowsky,Michael Laurenzano,JoshuaPeraza,AllanSnavely,andStephenPoole.AnidiomSfindingtoolfor increasingproductivityofaccelerators.inproceedingsoftheinternationalconferenceon Supercomputing,ICS 11 [Datta]KaushikDatta,MarkMurphy,VasilyVolkov,SamuelWilliams,JonathanCarter, LeonidOliker,DavidPatterson,JohnShalf,andKatherineYelick.Stencilcomputation optimizationandautostuningonstatesofsthesartmultiarchitectures.insc

Didem Unat, Xing Cai, Scott Baden

Didem Unat, Xing Cai, Scott Baden GPUs are effective means of accelerating data parallel applications However, porting algorithms on a GPU still remains a challenge. A directive based programming model