CRAY User Group Mee'ng May 2010

Size: px

Start display at page:

Download "CRAY User Group Mee'ng May 2010"

Buddy Hutchinson
6 years ago
Views:

1 Applica'on Accelera'on on Current and Future Cray Pla4orms Alice Koniges, NERSC, Berkeley Lab David Eder, Lawrence Livermore Na'onal Laboratory (speakers) Robert Preissl, Jihan Kim (NERSC LBL), Aaron Fisher, Nathan Masters, Velimir Mlaker (LLNL), Stephan Ethier, Weixing Wang (PPPL), Mar'n Head Gordon (UC Berkeley), Nathan Wichmann (CRAY Inc.) CRAY User Group Mee'ng May 2010

2 Various means of applica'on speedup are described for 3 different codes GTS magne+c fusion par+cle in cell code Already op+mized and hybrid (MPI + OpenMP) Consider advanced hybrid techniques to overlap communica+on and computa+on QChem computa+onal chemistry Op+miza+on for GPU and accelerators ALE AMR hydro/materials/radia+on Mul+physics code with MPI everywhere model Library speedup Is the code appropriate for hybrid? Experiences with automa+c paralleliza+on tools

3 GTS is a massively parallel magne'c fusion applica'on Gyrokine+c Tokamak Simula+on (GTS) code Global 3D Par+cle In Cell (PIC) code to study microturbulence & transport in magne+cally confined fusion plasmas of tokamaks Microturbulence: very complex, nonlinear phenomenon; key in determining instabili'es of magne+c confinement of plasmas GTS: Highly op+mized Fortran90 (+C) code Massively parallel hybrid paralleliza+on (MPI +OpenMP): tested on today s largest computers (Earth Simulator, IBM BG/L, Cray XT)

PIC: follow trajectories of charged par+cles in electromagne+c fields Sca[er: computa+on of charge density at each grid point arising from neighboring par+cles Poisson's equa'on for

4 PIC: follow trajectories of charged par+cles in electromagne+c fields Sca[er: computa+on of charge density at each grid point arising from neighboring par+cles Poisson's equa'on for compu+ng the field poten+al (solved on a 2D poloidal plane) Gather: calculate forces on each par+cle from the electric poten+al Push: moving par+cles in +me according to equa+ons of mo+on Repeat

The Parallel Model of GTS has three independent levels One dimensional (1D) domain decomposi'on

to 128 planes), this can happen to adjacent or even to further toroidal domains Divide par'cles

requiring processes within a domain to sum their contribu+on to total grid charge density

5 The Parallel Model of GTS has three independent levels One dimensional (1D) domain decomposi'on in the toroidal direc'on 5 th PIC step: shicing par+cles between toroidal domains (MPI; limited to 128 planes), this can happen to adjacent or even to further toroidal domains Divide par'cles between MPI processes within toroidal domain: each process keeps a copy of the local grid, requiring processes within a domain to sum their contribu+on to total grid charge density OpenMP compiler direc'ves to heavily used loop regions exploi+ng shared memory capabili+es P 2 P 0 P 1

to overlap MPI communica+on with independent Computa+on and

6 Two different hybrid models in GTS: Using tradi+onal OpenMP worksharing constructs and OpenMP tasks OpenMP tasks enables us to overlap MPI communica+on with independent Computa+on and therefore the overall run+me can be reduced by the costs of MPI communica+on.

roun-ne Work on par+cle array (packing for sending, reordering, adding acer

7 Overlapping Communica+on with Computa+on in GTS shic rou+ne due to data independent code sec+ons INDEPENDENT INDEPENDENT INDEPENDENT GTS shi( roun-ne Work on par+cle array (packing for sending, reordering, adding acer sending) can be overlapped with data independent MPI communica+on using OpenMP tasks.

$omp master) tasking statements and creates work for the thread team for deferred execu+on.

8 Reducing the limita+ons of single threaded execu+on (MPI communica+on) can be achieved with OpenMP tasks Overlapping MPI_Allreduce with par-cle work Overlap: Master thread encounters (!$omp master) tasking statements and creates work for the thread team for deferred execu+on. MPI Allreduce call is immediately executed. MPI implementa+on has to support at least MPI_THREAD_FUNNELED Subdividing tasks into smaller chunks to allow beler load balancing and scalability among threads.

Further communica+on overlaps can be achieved with OpenMP tasks exploi+ng data independent code regions Overlapping par-cle reordering Par+cle reordering of remaining

9 Further communica+on overlaps can be achieved with OpenMP tasks exploi+ng data independent code regions Overlapping par-cle reordering Par+cle reordering of remaining par+cles (above) and adding sent par+cles into array (right) & sending or receiving of shiced par+cles can be independently executed. Overlapping remaining MPI_Sendrecv

10 OpenMP tasking version outperforms original shicer, especially in larger poloidal domains &!!"!"#$%&"&'()*$%>"?"%"!(+,&-%./")* %#!"!"#$%&"&'()*$%=">"$?"!(+,&-%./")* 9):+";<+1=" %#!" %!!" $#!" $!!" 8(9*":;*0<" %!!" $#!" $!!" #!" #!"!" '()*+," -..,+/01+" 2)..)3456.+" '+3/7+18"!" &'()*+",--+*./0*" 1(--(2345-*" &*2.6*07" Performance breakdown of GTS shicer rou+ne using 4 OpenMP threads per MPI process with varying domain decomposi+on and par+cles per cell on Franklin Cray XT4. MPI communica+on in the shic phase uses a toroidal MPI communicator (constantly 128) However, performance differences in the 256 MPI run compared to 2048 MPI run! Speed Up is expected to be higher on larger GTS runs with hundreds of thousands CPUs since MPI communica+on is more expensive

11 Early experiments that overlap communica+on with communica+on are promising for future HPC systems )*+,"-.,/0" #!!" '&!" '%!" '$!" '#!" '!!" &!" %!" $!" #!"!" '!#$" ('#" #(%" 123"245/,..,.6"7#6$6&8"59,:12";<4,=>."9,4"123"945/,.." )*+,"-.,/0" &!!" %#!" %!!" $#!" $!!" #!"!" B4*A*;>C" $!%'" #$%" %#(" 123"245/,..,.6"7%6'689"5:,;12"<=4,>?.":,4"123":45/,.." Overlapping MPI communica+on with other consecu+ve, data independent MPI Communica+on Here: itera+ve execu+on of two consecu+ve MPI_Allreduce with small and larger messages on Hopper Cray XT5 GTS shicer or pusher rou+nes have such consecu+ve MPI communica+on Overlapping MPI_Allreduce with larger messages (~1K bytes) pays off when ra+o of threads/sockets per node is reasonable Future HPC systems are expected to have many communica+on channels per node

12 Reducing overhead of single threaded execu+on is essen+al for massively parallel (hybrid) codes Overhead of MPI communica+on increases when scaling applica+ons to large number of MPI processes (collec+ve MPI communica+on) Adding OpenMP compiler direc+ves to heavily used loop can exploit the shared memory capabili+es Overlapping MPI communica+on with independent computa+on by the new OpenMP tasking model makes usage of idle cores Overlapping MPI communica+on with independent, consecu+ve MPI communica+on might be another way to reduce MPI overhead; especially regarding future HPC systems with many communica+on channels per node

13 Q Chem: Computa'onal chemistry can accurately model molecular structures Q Chem: used to model carbon capture (i.e. reac+vity of CO 2 with other materials) Quantum calcula+ons: accurately predict molecular equilibrium structure (used as an input to classical molecular dynamics/monte Carlo simula+ons) RI MP2: resolu+on of the iden+ty second order Moller Plesset perturba+on theory Treat correla+on energy with 2 nd order Moller Plesset theory U+lize auxiliary basis sets to approximate atomic orbital densi+es Strengths: no self interac+on problem (DFT), 80 90% of correla+on energy Weakness: fich order computa+onal dependency on system size (expensive) Goal: accelerate RI MP2 method in Q Chem Q Chem RI MP2 requirements: quadra+c memory, cubic storage, quar+c I/O, quin+c computa+on

14 Dominant computa'onal steps are fich order RI MP2 rou'nes RI MP2 rou+ne: largely divided up into seven major steps Test input molecules: glycine n As system size increases, step 4 becomes the dominant wall +me (e.g. glycine 16, 83% of total wall +me is spent in step 4) Reason: step 4 contains three quin+c computa+on rou+nes (BLAS3 matrix mul+plica+ons) and quar+c I/O read Goal: op+mize step 4 Wall +me in seconds Greta Cluster (M. Head Gordon) : AMD Quad core Opterons

(Compute Unified Device Architecture): parallel architecture developed by NVIDIA Step 4: CUDA matrix

15 The GPU and the CPU are significantly different GPU: graphics processing units GPU: More transistors devoted to data computa+on (CPU: cache, loop control) Interest in high performance compu+ng Use CUDA (Compute Unified Device Architecture): parallel architecture developed by NVIDIA Step 4: CUDA matrix matrix mul+plica+ons (~ 75 GFLOPS TESLA, ~225 GFLOPS FERMI for double precision) Concurrently execute CPU and GPU rou+nes

16 CPU and GPU can work together to produce a fast algorithm Step 4 CPU Algorithm Step 4 CPU+GPU Algorithm T tot T load + T mm1 + T mm2 + T mm3 + T rest T tot max (T load, T mm1 ) + T mm3 + max(t mm2, T copy ) + T rest

17 I/O bo[leneck is a concern for accelerated RI MP2 code Tesla/Turing (TnT): NERSC GPU Testbed Sun SunFire x4600 Servers AMD Quad Core processors ( Shanghai ), 4 NVidia FX 5800 Quadro GPUs (4GB memory) CUDA 2.3 gcc ACML Franklin: NERSC Cray XT4 system 2.3 GHz AMD Opteron Quad Core RI MP2 wall +me (seconds) Franklin TnT(CPU) TnT(GPU) TnT(GPU, beler I/O) to 800 (?) 4.7x improvement, more so for beler I/O systems

18 ALE AMR: hydro/materials/radia'on code with MPI paralleliza'on Mul+ physics code using operator splivng ALE Arbitrary Lagrangian Eulerian AMR Adap+ve Mesh Refinement Material interface reconstruc+on Anisotropic stress tensor Material strength/failure with history Thermal conduc+on, radia+on diffusion Laser ray trace and ion deposi+on Code used to model targets at various highenergy experimental facili+es including the Na+onal Igni+on Facility (world s largest laser)

19 Simula+ons can include hot radia+ng plasmas and cold fragmen+ng solids

20 ALE AMR diffusion based models use solvers for an implicit solve Energy transport in NIF ALE AMR is based on diffusion approxima+ons Diffusion equa+on Heat Conduc+on Radia+on Diffusion

21 The AMR code uses finite elements and a solver for Diffusion Level representa+on Composite representa+on Special transi+on elements and basis func+ons

22 Some 3D Simula+ons have Unreasonable Performance Simulated a point explosion on 2 level AMR grid # of CPUs u'lized 27x27x27 AMR mesh run'me (s) 81x81x81 AMR mesh run'me(s)

23 Open SpeedShop Used to Understand Performance Degrada+on Instruments an executable to collect sample data on code execu+on paths Provides insight into how code is execu+ng Hot call path feature is very useful for gevng a sense of the code bollenecks

24 Hot call path feature shows bolleneck for degraded performance in HYPRE

25 Whereas bolleneck in normal performance is in Jacobian computa+on

26 Sparsity Plots HYPRE Matrix and Precondi+oner Shed Light on the Issue

27 HYPRE/Euclid Defaults are Unsuitable for Our AMR Grids Too much non zero fill is being allowed Fortunately changing the fill behavior for Euclid is easy Added level 0 to the Euclid parameters file This parameter turns off fill en+rely May not be op+mal, but should fix the degraded performance issue

28 Euclid Parameter Change Leads to Much Improved Behavior Re ran point explosion simula+on with Euclid fill turned off # of CPUs u'lized With default Euclid fill level 1 With updated Euclid fill level

29 Useful to study effect of same number of MPI tasks on different # of cores Memory and Run'me Number of cores Run'me Memory Speedup even with addi+onal inter node communica+on shows promise for hybrid For 32 and 64 cases have idle cores can be used by adding OpenMP to give addi+onal speedup

30 ALE AMR has good poten+al for hybrid including in the SAMRAI library SAMRAI provides patch level parallelism The physics steps loop inside patch OpenMP compiliers can parallelize C style loops but not template iterators used by SAMRAI However, autopar in the Rose complier has the poten+al to deal with C++ constructs In some cases, modifying the index space can reduce number of synchronizing barriers

We have inves'gated using autopar tp parallelize code with OpenMP ROSE s automa+c

a[100][100]; for (i=0; i<100; i++){ for (j=0; j<100; j++){ a[i][j]=a[i][j]+1; } } used project

.. autopar translate compile link rose_incr.c #include omp.

31 We have inves'gated using autopar tp parallelize code with OpenMP ROSE s automa+c paralleliza+on tool autopar translates source code to use OpenMP pragmas int i, j; int a[100][100]; for (i=0; i<100; i++){ for (j=0; j<100; j++){ a[i][j]=a[i][j]+1; } } used project wide in place of compiler.mk incr.c... #CXX = g++ CXX = autopar... autopar translate compile link rose_incr.c #include omp.h int i, j; int a[100ul][100ul]; #pragma omp parallel for private (i,j) for (i=0; i<=100-1; i++){ #pragma omp parallel for private (j) for (j=0; j<=100-1; j++){ a[i][j]=a[i][j]+1; } }.exe

The complexity of code affects the ability of an auto tool to be effec+ve ALE AMR uses SAMRAI for AMR, loadbalancing and MPI Size suggests improvement

32 The complexity of code affects the ability of an auto tool to be effec+ve ALE AMR uses SAMRAI for AMR, loadbalancing and MPI Size suggests improvement opportuni+es in either codebase Start with autopar on SAMRAI: general use code, possibly more standardized Next try ALE AMR: less code, and beler known

33 Summary Applica+on Speedup for both current and future architectures is complicated Overlapping communica+on and computa+on in GTS shows promise (advanced hybrid) GPU s allow fast matrix matrix mul+plica+on in Q Chem (next genera+on architectures) Profiling libraries (OpenSpeedShop) in ALE AMR is cri+cal; must use extreme care in library usage Analysis is necessary to add hybrid to exis+ng codes Auto paralleliza+on tools show promise

MPI & OpenMP Mixed Hybrid Programming

MPI & OpenMP Mixed Hybrid Programming Berk ONAT İTÜ Bilişim Enstitüsü 22 Haziran 2012 Outline Introduc/on Share & Distributed Memory Programming MPI & OpenMP Advantages/Disadvantages MPI vs. OpenMP Why