CRAY User Group Mee'ng May 2010
|
|
- Buddy Hutchinson
- 6 years ago
- Views:
Transcription
1 Applica'on Accelera'on on Current and Future Cray Pla4orms Alice Koniges, NERSC, Berkeley Lab David Eder, Lawrence Livermore Na'onal Laboratory (speakers) Robert Preissl, Jihan Kim (NERSC LBL), Aaron Fisher, Nathan Masters, Velimir Mlaker (LLNL), Stephan Ethier, Weixing Wang (PPPL), Mar'n Head Gordon (UC Berkeley), Nathan Wichmann (CRAY Inc.) CRAY User Group Mee'ng May 2010
2 Various means of applica'on speedup are described for 3 different codes GTS magne+c fusion par+cle in cell code Already op+mized and hybrid (MPI + OpenMP) Consider advanced hybrid techniques to overlap communica+on and computa+on QChem computa+onal chemistry Op+miza+on for GPU and accelerators ALE AMR hydro/materials/radia+on Mul+physics code with MPI everywhere model Library speedup Is the code appropriate for hybrid? Experiences with automa+c paralleliza+on tools
3 GTS is a massively parallel magne'c fusion applica'on Gyrokine+c Tokamak Simula+on (GTS) code Global 3D Par+cle In Cell (PIC) code to study microturbulence & transport in magne+cally confined fusion plasmas of tokamaks Microturbulence: very complex, nonlinear phenomenon; key in determining instabili'es of magne+c confinement of plasmas GTS: Highly op+mized Fortran90 (+C) code Massively parallel hybrid paralleliza+on (MPI +OpenMP): tested on today s largest computers (Earth Simulator, IBM BG/L, Cray XT)
4 PIC: follow trajectories of charged par+cles in electromagne+c fields Sca[er: computa+on of charge density at each grid point arising from neighboring par+cles Poisson's equa'on for compu+ng the field poten+al (solved on a 2D poloidal plane) Gather: calculate forces on each par+cle from the electric poten+al Push: moving par+cles in +me according to equa+ons of mo+on Repeat
5 The Parallel Model of GTS has three independent levels One dimensional (1D) domain decomposi'on in the toroidal direc'on 5 th PIC step: shicing par+cles between toroidal domains (MPI; limited to 128 planes), this can happen to adjacent or even to further toroidal domains Divide par'cles between MPI processes within toroidal domain: each process keeps a copy of the local grid, requiring processes within a domain to sum their contribu+on to total grid charge density OpenMP compiler direc'ves to heavily used loop regions exploi+ng shared memory capabili+es P 2 P 0 P 1
6 Two different hybrid models in GTS: Using tradi+onal OpenMP worksharing constructs and OpenMP tasks OpenMP tasks enables us to overlap MPI communica+on with independent Computa+on and therefore the overall run+me can be reduced by the costs of MPI communica+on.
7 Overlapping Communica+on with Computa+on in GTS shic rou+ne due to data independent code sec+ons INDEPENDENT INDEPENDENT INDEPENDENT GTS shi( roun-ne Work on par+cle array (packing for sending, reordering, adding acer sending) can be overlapped with data independent MPI communica+on using OpenMP tasks.
8 Reducing the limita+ons of single threaded execu+on (MPI communica+on) can be achieved with OpenMP tasks Overlapping MPI_Allreduce with par-cle work Overlap: Master thread encounters (!$omp master) tasking statements and creates work for the thread team for deferred execu+on. MPI Allreduce call is immediately executed. MPI implementa+on has to support at least MPI_THREAD_FUNNELED Subdividing tasks into smaller chunks to allow beler load balancing and scalability among threads.
9 Further communica+on overlaps can be achieved with OpenMP tasks exploi+ng data independent code regions Overlapping par-cle reordering Par+cle reordering of remaining par+cles (above) and adding sent par+cles into array (right) & sending or receiving of shiced par+cles can be independently executed. Overlapping remaining MPI_Sendrecv
10 OpenMP tasking version outperforms original shicer, especially in larger poloidal domains &!!"!"#$%&"&'()*$%>"?"%"!(+,&-%./")* %#!"!"#$%&"&'()*$%=">"$?"!(+,&-%./")* 9):+";<+1=" %#!" %!!" $#!" $!!" 8(9*":;*0<" %!!" $#!" $!!" #!" #!"!" '()*+," -..,+/01+" 2)..)3456.+" '+3/7+18"!" &'()*+",--+*./0*" 1(--(2345-*" &*2.6*07" Performance breakdown of GTS shicer rou+ne using 4 OpenMP threads per MPI process with varying domain decomposi+on and par+cles per cell on Franklin Cray XT4. MPI communica+on in the shic phase uses a toroidal MPI communicator (constantly 128) However, performance differences in the 256 MPI run compared to 2048 MPI run! Speed Up is expected to be higher on larger GTS runs with hundreds of thousands CPUs since MPI communica+on is more expensive
11 Early experiments that overlap communica+on with communica+on are promising for future HPC systems )*+,"-.,/0" #!!" '&!" '%!" '$!" '#!" '!!" &!" %!" $!" #!"!" '!#$" ('#" #(%" 123"245/,..,.6"7#6$6&8"59,:12";<4,=>."9,4"123"945/,.." )*+,"-.,/0" &!!" %#!" %!!" $#!" $!!" #!"!" B4*A*;>C" $!%'" #$%" %#(" 123"245/,..,.6"7%6'689"5:,;12"<=4,>?.":,4"123":45/,.." Overlapping MPI communica+on with other consecu+ve, data independent MPI Communica+on Here: itera+ve execu+on of two consecu+ve MPI_Allreduce with small and larger messages on Hopper Cray XT5 GTS shicer or pusher rou+nes have such consecu+ve MPI communica+on Overlapping MPI_Allreduce with larger messages (~1K bytes) pays off when ra+o of threads/sockets per node is reasonable Future HPC systems are expected to have many communica+on channels per node
12 Reducing overhead of single threaded execu+on is essen+al for massively parallel (hybrid) codes Overhead of MPI communica+on increases when scaling applica+ons to large number of MPI processes (collec+ve MPI communica+on) Adding OpenMP compiler direc+ves to heavily used loop can exploit the shared memory capabili+es Overlapping MPI communica+on with independent computa+on by the new OpenMP tasking model makes usage of idle cores Overlapping MPI communica+on with independent, consecu+ve MPI communica+on might be another way to reduce MPI overhead; especially regarding future HPC systems with many communica+on channels per node
13 Q Chem: Computa'onal chemistry can accurately model molecular structures Q Chem: used to model carbon capture (i.e. reac+vity of CO 2 with other materials) Quantum calcula+ons: accurately predict molecular equilibrium structure (used as an input to classical molecular dynamics/monte Carlo simula+ons) RI MP2: resolu+on of the iden+ty second order Moller Plesset perturba+on theory Treat correla+on energy with 2 nd order Moller Plesset theory U+lize auxiliary basis sets to approximate atomic orbital densi+es Strengths: no self interac+on problem (DFT), 80 90% of correla+on energy Weakness: fich order computa+onal dependency on system size (expensive) Goal: accelerate RI MP2 method in Q Chem Q Chem RI MP2 requirements: quadra+c memory, cubic storage, quar+c I/O, quin+c computa+on
14 Dominant computa'onal steps are fich order RI MP2 rou'nes RI MP2 rou+ne: largely divided up into seven major steps Test input molecules: glycine n As system size increases, step 4 becomes the dominant wall +me (e.g. glycine 16, 83% of total wall +me is spent in step 4) Reason: step 4 contains three quin+c computa+on rou+nes (BLAS3 matrix mul+plica+ons) and quar+c I/O read Goal: op+mize step 4 Wall +me in seconds Greta Cluster (M. Head Gordon) : AMD Quad core Opterons
15 The GPU and the CPU are significantly different GPU: graphics processing units GPU: More transistors devoted to data computa+on (CPU: cache, loop control) Interest in high performance compu+ng Use CUDA (Compute Unified Device Architecture): parallel architecture developed by NVIDIA Step 4: CUDA matrix matrix mul+plica+ons (~ 75 GFLOPS TESLA, ~225 GFLOPS FERMI for double precision) Concurrently execute CPU and GPU rou+nes
16 CPU and GPU can work together to produce a fast algorithm Step 4 CPU Algorithm Step 4 CPU+GPU Algorithm T tot T load + T mm1 + T mm2 + T mm3 + T rest T tot max (T load, T mm1 ) + T mm3 + max(t mm2, T copy ) + T rest
17 I/O bo[leneck is a concern for accelerated RI MP2 code Tesla/Turing (TnT): NERSC GPU Testbed Sun SunFire x4600 Servers AMD Quad Core processors ( Shanghai ), 4 NVidia FX 5800 Quadro GPUs (4GB memory) CUDA 2.3 gcc ACML Franklin: NERSC Cray XT4 system 2.3 GHz AMD Opteron Quad Core RI MP2 wall +me (seconds) Franklin TnT(CPU) TnT(GPU) TnT(GPU, beler I/O) to 800 (?) 4.7x improvement, more so for beler I/O systems
18 ALE AMR: hydro/materials/radia'on code with MPI paralleliza'on Mul+ physics code using operator splivng ALE Arbitrary Lagrangian Eulerian AMR Adap+ve Mesh Refinement Material interface reconstruc+on Anisotropic stress tensor Material strength/failure with history Thermal conduc+on, radia+on diffusion Laser ray trace and ion deposi+on Code used to model targets at various highenergy experimental facili+es including the Na+onal Igni+on Facility (world s largest laser)
19 Simula+ons can include hot radia+ng plasmas and cold fragmen+ng solids
20 ALE AMR diffusion based models use solvers for an implicit solve Energy transport in NIF ALE AMR is based on diffusion approxima+ons Diffusion equa+on Heat Conduc+on Radia+on Diffusion
21 The AMR code uses finite elements and a solver for Diffusion Level representa+on Composite representa+on Special transi+on elements and basis func+ons
22 Some 3D Simula+ons have Unreasonable Performance Simulated a point explosion on 2 level AMR grid # of CPUs u'lized 27x27x27 AMR mesh run'me (s) 81x81x81 AMR mesh run'me(s)
23 Open SpeedShop Used to Understand Performance Degrada+on Instruments an executable to collect sample data on code execu+on paths Provides insight into how code is execu+ng Hot call path feature is very useful for gevng a sense of the code bollenecks
24 Hot call path feature shows bolleneck for degraded performance in HYPRE
25 Whereas bolleneck in normal performance is in Jacobian computa+on
26 Sparsity Plots HYPRE Matrix and Precondi+oner Shed Light on the Issue
27 HYPRE/Euclid Defaults are Unsuitable for Our AMR Grids Too much non zero fill is being allowed Fortunately changing the fill behavior for Euclid is easy Added level 0 to the Euclid parameters file This parameter turns off fill en+rely May not be op+mal, but should fix the degraded performance issue
28 Euclid Parameter Change Leads to Much Improved Behavior Re ran point explosion simula+on with Euclid fill turned off # of CPUs u'lized With default Euclid fill level 1 With updated Euclid fill level
29 Useful to study effect of same number of MPI tasks on different # of cores Memory and Run'me Number of cores Run'me Memory Speedup even with addi+onal inter node communica+on shows promise for hybrid For 32 and 64 cases have idle cores can be used by adding OpenMP to give addi+onal speedup
30 ALE AMR has good poten+al for hybrid including in the SAMRAI library SAMRAI provides patch level parallelism The physics steps loop inside patch OpenMP compiliers can parallelize C style loops but not template iterators used by SAMRAI However, autopar in the Rose complier has the poten+al to deal with C++ constructs In some cases, modifying the index space can reduce number of synchronizing barriers
31 We have inves'gated using autopar tp parallelize code with OpenMP ROSE s automa+c paralleliza+on tool autopar translates source code to use OpenMP pragmas int i, j; int a[100][100]; for (i=0; i<100; i++){ for (j=0; j<100; j++){ a[i][j]=a[i][j]+1; } } used project wide in place of compiler.mk incr.c... #CXX = g++ CXX = autopar... autopar translate compile link rose_incr.c #include omp.h int i, j; int a[100ul][100ul]; #pragma omp parallel for private (i,j) for (i=0; i<=100-1; i++){ #pragma omp parallel for private (j) for (j=0; j<=100-1; j++){ a[i][j]=a[i][j]+1; } }.exe
32 The complexity of code affects the ability of an auto tool to be effec+ve ALE AMR uses SAMRAI for AMR, loadbalancing and MPI Size suggests improvement opportuni+es in either codebase Start with autopar on SAMRAI: general use code, possibly more standardized Next try ALE AMR: less code, and beler known
33 Summary Applica+on Speedup for both current and future architectures is complicated Overlapping communica+on and computa+on in GTS shows promise (advanced hybrid) GPU s allow fast matrix matrix mul+plica+on in Q Chem (next genera+on architectures) Profiling libraries (OpenSpeedShop) in ALE AMR is cri+cal; must use extreme care in library usage Analysis is necessary to add hybrid to exis+ng codes Auto paralleliza+on tools show promise
MPI & OpenMP Mixed Hybrid Programming
MPI & OpenMP Mixed Hybrid Programming Berk ONAT İTÜ Bilişim Enstitüsü 22 Haziran 2012 Outline Introduc/on Share & Distributed Memory Programming MPI & OpenMP Advantages/Disadvantages MPI vs. OpenMP Why
More informationOp#mizing PGAS overhead in a mul#-locale Chapel implementa#on of CoMD
Op#mizing PGAS overhead in a mul#-locale Chapel implementa#on of CoMD Riyaz Haque and David F. Richards This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore
More informationFusion PIC Code Performance Analysis on the Cori KNL System. T. Koskela*, J. Deslippe*,! K. Raman**, B. Friesen*! *NERSC! ** Intel!
Fusion PIC Code Performance Analysis on the Cori KNL System T. Koskela*, J. Deslippe*,! K. Raman**, B. Friesen*! *NERSC! ** Intel! tkoskela@lbl.gov May 18, 2017-1- Outline Introduc3on to magne3c fusion
More informationConcurrency-Optimized I/O For Visualizing HPC Simulations: An Approach Using Dedicated I/O Cores
Concurrency-Optimized I/O For Visualizing HPC Simulations: An Approach Using Dedicated I/O Cores Ma#hieu Dorier, Franck Cappello, Marc Snir, Bogdan Nicolae, Gabriel Antoniu 4th workshop of the Joint Laboratory
More informationResults from the Early Science High Speed Combus:on and Detona:on Project
Results from the Early Science High Speed Combus:on and Detona:on Project Alexei Khokhlov, University of Chicago Joanna Aus:n, University of Illinois Charles Bacon, Argonne Na:onal Laboratory Andrew Knisely,
More informationPor$ng Monte Carlo Algorithms to the GPU. Ryan Bergmann UC Berkeley Serpent Users Group Mee$ng 9/20/2012 Madrid, Spain
Por$ng Monte Carlo Algorithms to the GPU Ryan Bergmann UC Berkeley Serpent Users Group Mee$ng 9/20/2012 Madrid, Spain 1 Outline Introduc$on to GPUs Why they are interes$ng How they operate Pros and cons
More informationPhysis: An Implicitly Parallel Framework for Stencil Computa;ons
Physis: An Implicitly Parallel Framework for Stencil Computa;ons Naoya Maruyama RIKEN AICS (Formerly at Tokyo Tech) GTC12, May 2012 1 è Good performance with low programmer produc;vity Mul;- GPU Applica;on
More informationUnstructured Finite Volume Code on a Cluster with Mul6ple GPUs per Node
Unstructured Finite Volume Code on a Cluster with Mul6ple GPUs per Node Keith Obenschain & Andrew Corrigan Laboratory for Computa;onal Physics and Fluid Dynamics Naval Research Laboratory Washington DC,
More informationAcceleration of HPC applications on hybrid CPU-GPU systems: When can Multi-Process Service (MPS) help?
Acceleration of HPC applications on hybrid CPU- systems: When can Multi-Process Service (MPS) help? GTC 2018 March 28, 2018 Olga Pearce (Lawrence Livermore National Laboratory) http://people.llnl.gov/olga
More informationA Distributed Data- Parallel Execu3on Framework in the Kepler Scien3fic Workflow System
A Distributed Data- Parallel Execu3on Framework in the Kepler Scien3fic Workflow System Ilkay Al(ntas and Daniel Crawl San Diego Supercomputer Center UC San Diego Jianwu Wang UMBC WorDS.sdsc.edu Computa3onal
More informationCPU GPU. Regional Models. Global Models. Bigger Systems More Expensive Facili:es Bigger Power Bills Lower System Reliability
Xbox 360 Successes and Challenges using GPUs for Weather and Climate Models DOE Jaguar Mark GoveM Jacques Middlecoff, Tom Henderson, Jim Rosinski, Craig Tierney CPU Bigger Systems More Expensive Facili:es
More informationHeterogeneous CPU+GPU Molecular Dynamics Engine in CHARMM
Heterogeneous CPU+GPU Molecular Dynamics Engine in CHARMM 25th March, GTC 2014, San Jose CA AnE- Pekka Hynninen ane.pekka.hynninen@nrel.gov NREL is a na*onal laboratory of the U.S. Department of Energy,
More informationIntroduc4on to OpenMP and Threaded Libraries Ivan Giro*o
Introduc4on to OpenMP and Threaded Libraries Ivan Giro*o igiro*o@ictp.it Informa(on & Communica(on Technology Sec(on (ICTS) Interna(onal Centre for Theore(cal Physics (ICTP) OUTLINE Shared Memory Architectures
More informationSuper Instruction Architecture for Heterogeneous Systems. Victor Lotric, Nakul Jindal, Erik Deumens, Rod Bartlett, Beverly Sanders
Super Instruction Architecture for Heterogeneous Systems Victor Lotric, Nakul Jindal, Erik Deumens, Rod Bartlett, Beverly Sanders Super Instruc,on Architecture Mo,vated by Computa,onal Chemistry Coupled
More informationOverlapping Computation and Communication for Advection on Hybrid Parallel Computers
Overlapping Computation and Communication for Advection on Hybrid Parallel Computers James B White III (Trey) trey@ucar.edu National Center for Atmospheric Research Jack Dongarra dongarra@eecs.utk.edu
More informationSpeedup Altair RADIOSS Solvers Using NVIDIA GPU
Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair
More informationNetwork Coding: Theory and Applica7ons
Network Coding: Theory and Applica7ons PhD Course Part IV Tuesday 9.15-12.15 18.6.213 Muriel Médard (MIT), Frank H. P. Fitzek (AAU), Daniel E. Lucani (AAU), Morten V. Pedersen (AAU) Plan Hello World! Intra
More informationMPI Performance Analysis Trace Analyzer and Collector
MPI Performance Analysis Trace Analyzer and Collector Berk ONAT İTÜ Bilişim Enstitüsü 19 Haziran 2012 Outline MPI Performance Analyzing Defini6ons: Profiling Defini6ons: Tracing Intel Trace Analyzer Lab:
More informationTiDA: High Level Programming Abstrac8ons for Data Locality Management
h#p://parcorelab.ku.edu.tr TiDA: High Level Programming Abstrac8ons for Data Locality Management Didem Unat, Muhammed Nufail Farooqi, Burak Bastem Koç University, Turkey Tan Nguyen, Weiqun Zhang, George
More informationDanesh TaNi & Amit Amritkar
GenIDLEST Co- Design Danesh TaNi & Amit Amritkar Collaborators Wu- chun Feng, Paul Sathre, Kaixi Hou, Sriram Chivukula, Hao Wang, Eric de Sturler, Kasia Swirydowicz Virginia Tech AFOSR- BRI Workshop Feb
More informationParticle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA
Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran
More informationOptimizing Fusion PIC Code XGC1 Performance on Cori Phase 2
Optimizing Fusion PIC Code XGC1 Performance on Cori Phase 2 T. Koskela, J. Deslippe NERSC / LBNL tkoskela@lbl.gov June 23, 2017-1 - Thank you to all collaborators! LBNL Brian Friesen, Ankit Bhagatwala,
More informationAnalysis and Visualization Algorithms in VMD
1 Analysis and Visualization Algorithms in VMD David Hardy Research/~dhardy/ NAIS: State-of-the-Art Algorithms for Molecular Dynamics (Presenting the work of John Stone.) VMD Visual Molecular Dynamics
More informationNVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU
NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated
More informationCombinatorial Mathema/cs and Algorithms at Exascale: Challenges and Promising Direc/ons
Combinatorial Mathema/cs and Algorithms at Exascale: Challenges and Promising Direc/ons Assefaw Gebremedhin Purdue University (Star/ng August 2014, Washington State University School of Electrical Engineering
More informationTurbostream: A CFD solver for manycore
Turbostream: A CFD solver for manycore processors Tobias Brandvik Whittle Laboratory University of Cambridge Aim To produce an order of magnitude reduction in the run-time of CFD solvers for the same hardware
More informationA TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE
A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE Abu Asaduzzaman, Deepthi Gummadi, and Chok M. Yip Department of Electrical Engineering and Computer Science Wichita State University Wichita, Kansas, USA
More informationHARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA
HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh
More informationIntroduction to parallel Computing
Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts
More informationGeneral Purpose GPU Computing in Partial Wave Analysis
JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data
More informationDesigning Parallel Programs. This review was developed from Introduction to Parallel Computing
Designing Parallel Programs This review was developed from Introduction to Parallel Computing Author: Blaise Barney, Lawrence Livermore National Laboratory references: https://computing.llnl.gov/tutorials/parallel_comp/#whatis
More informationSplotch: High Performance Visualization using MPI, OpenMP and CUDA
Splotch: High Performance Visualization using MPI, OpenMP and CUDA Klaus Dolag (Munich University Observatory) Martin Reinecke (MPA, Garching) Claudio Gheller (CSCS, Switzerland), Marzia Rivi (CINECA,
More informationMulti-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation
Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation 1 Cheng-Han Du* I-Hsin Chung** Weichung Wang* * I n s t i t u t e o f A p p l i e d M
More informationEfficient Memory and Bandwidth Management for Industrial Strength Kirchhoff Migra<on
Efficient Memory and Bandwidth Management for Industrial Strength Kirchhoff Migra
More informationHigh Performance Computing. Introduction to Parallel Computing
High Performance Computing Introduction to Parallel Computing Acknowledgements Content of the following presentation is borrowed from The Lawrence Livermore National Laboratory https://hpc.llnl.gov/training/tutorials
More informationAn Introduc+on to OpenACC Part II
An Introduc+on to OpenACC Part II Wei Feinstein HPC User Services@LSU LONI Parallel Programming Workshop 2015 Louisiana State University 4 th HPC Parallel Programming Workshop An Introduc+on to OpenACC-
More informationProfiling & Tuning Applica1ons. CUDA Course July István Reguly
Profiling & Tuning Applica1ons CUDA Course July 21-25 István Reguly Introduc1on Why is my applica1on running slow? Work it out on paper Instrument code Profile it NVIDIA Visual Profiler Works with CUDA,
More informationUse of QE in HPC: trend in technology for HPC, basics of parallelism and performance features Ivan
Use of QE in HPC: trend in technology for HPC, basics of parallelism and performance features Ivan Giro@o igiro@o@ictp.it Informa(on & Communica(on Technology Sec(on (ICTS) Interna(onal Centre for Theore(cal
More informationFaster Code for Free: Linear Algebra Libraries. Advanced Research Compu;ng 22 Feb 2017
Faster Code for Free: Linear Algebra Libraries Advanced Research Compu;ng 22 Feb 2017 Outline Introduc;on Implementa;ons Using them Use on ARC systems Hands on session Conclusions Introduc;on 3 BLAS Level
More informationGeneral Plasma Physics
Present and Future Computational Requirements General Plasma Physics Center for Integrated Computation and Analysis of Reconnection and Turbulence () Kai Germaschewski, Homa Karimabadi Amitava Bhattacharjee,
More informationThe Art of Parallel Processing
The Art of Parallel Processing Ahmad Siavashi April 2017 The Software Crisis As long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a
More informationPhD in Computer And Control Engineering XXVII cycle. Torino February 27th, 2015.
PhD in Computer And Control Engineering XXVII cycle Torino February 27th, 2015. Parallel and reconfigurable systems are more and more used in a wide number of applica7ons and environments, ranging from
More informationPreliminary Evalua.on of ABAQUS, FLUENT, and in- house GPU code Performance on Blue Waters
Blue Water Symposium, University of Illinois, Urbana, IL, May 12-15, 2014 Preliminary Evalua.on of ABAQUS, FLUENT, and in- house GPU code Performance on Blue Waters B. G. Thomas 1, L. C. Hibbeler 1, K.
More informationAlgorithms for Auto- tuning OpenACC Accelerated Kernels
Outline Algorithms for Auto- tuning OpenACC Accelerated Kernels Fatemah Al- Zayer 1, Ameerah Al- Mu2ry 1, Mona Al- Shahrani 1, Saber Feki 2, and David Keyes 1 1 Extreme Compu,ng Research Center, 2 KAUST
More informationDeveloping PIC Codes for the Next Generation Supercomputer using GPUs. Viktor K. Decyk UCLA
Developing PIC Codes for the Next Generation Supercomputer using GPUs Viktor K. Decyk UCLA Abstract The current generation of supercomputer (petaflops scale) cannot be scaled up to exaflops (1000 petaflops),
More informationPerformance Analysis and Optimization of Gyrokinetic Torodial Code on TH-1A Supercomputer
Performance Analysis and Optimization of Gyrokinetic Torodial Code on TH-1A Supercomputer Xiaoqian Zhu 1,2, Xin Liu 1, Xiangfei Meng 2, Jinghua Feng 2 1 School of Computer, National University of Defense
More informationEXPOSING PARTICLE PARALLELISM IN THE XGC PIC CODE BY EXPLOITING GPU MEMORY HIERARCHY. Stephen Abbott, March
EXPOSING PARTICLE PARALLELISM IN THE XGC PIC CODE BY EXPLOITING GPU MEMORY HIERARCHY Stephen Abbott, March 26 2018 ACKNOWLEDGEMENTS Collaborators: Oak Ridge Nation Laboratory- Ed D Azevedo NVIDIA - Peng
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationCray Users Group 2012 May 2, 2012
Cray Users Group 2012 May 2, 2012 This work was performed under the auspices of the U.S. Department of Energy by under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC pf3d simulates
More informationHigh Performance Computing with Accelerators
High Performance Computing with Accelerators Volodymyr Kindratenko Innovative Systems Laboratory @ NCSA Institute for Advanced Computing Applications and Technologies (IACAT) National Center for Supercomputing
More informationPreliminary Experiences with the Uintah Framework on on Intel Xeon Phi and Stampede
Preliminary Experiences with the Uintah Framework on on Intel Xeon Phi and Stampede Qingyu Meng, Alan Humphrey, John Schmidt, Martin Berzins Thanks to: TACC Team for early access to Stampede J. Davison
More informationBLAS. Basic Linear Algebra Subprograms
BLAS Basic opera+ons with vectors and matrices dominates scien+fic compu+ng programs To achieve high efficiency and clean computer programs an effort has been made in the last few decades to standardize
More informationPerformance database technology for SciDAC applications
Performance database technology for SciDAC applications D Gunter 1, K Huck 2, K Karavanic 3, J May 4, A Malony 2, K Mohror 3, S Moore 5, A Morris 2, S Shende 2, V Taylor 6, X Wu 6, and Y Zhang 7 1 Lawrence
More informationOpenACC2 vs.openmp4. James Lin 1,2 and Satoshi Matsuoka 2
2014@San Jose Shanghai Jiao Tong University Tokyo Institute of Technology OpenACC2 vs.openmp4 he Strong, the Weak, and the Missing to Develop Performance Portable Applica>ons on GPU and Xeon Phi James
More informationCURRENT STATUS OF THE PROJECT TO ENABLE GAUSSIAN 09 ON GPGPUS
CURRENT STATUS OF THE PROJECT TO ENABLE GAUSSIAN 09 ON GPGPUS Roberto Gomperts (NVIDIA, Corp.) Michael Frisch (Gaussian, Inc.) Giovanni Scalmani (Gaussian, Inc.) Brent Leback (PGI) TOPICS Gaussian Design
More informationEnergy- Aware Time Change Detec4on Using Synthe4c Aperture Radar On High- Performance Heterogeneous Architectures: A DDDAS Approach
Energy- Aware Time Change Detec4on Using Synthe4c Aperture Radar On High- Performance Heterogeneous Architectures: A DDDAS Approach Sanjay Ranka (PI) Sartaj Sahni (Co- PI) Mark Schmalz (Co- PI) University
More informationGPU Cluster Computing. Advanced Computing Center for Research and Education
GPU Cluster Computing Advanced Computing Center for Research and Education 1 What is GPU Computing? Gaming industry and high- defini3on graphics drove the development of fast graphics processing Use of
More informationIntroduc)on to Xeon Phi
Introduc)on to Xeon Phi IXPUG 14 Lars Koesterke Acknowledgements Thanks/kudos to: Sponsor: National Science Foundation NSF Grant #OCI-1134872 Stampede Award, Enabling, Enhancing, and Extending Petascale
More informationCUDA- based Geant4 Monte Carlo Simula8on for Radia8on Therapy. N. Henderson & K. Murakami GTC 2013
CUDA- based Geant4 Monte Carlo Simula8on for Radia8on Therapy N. Henderson & K. Murakami GTC 2013 1 The collabora8on Makoto Asai, SLAC Joseph Perl, SLAC Koichi Murakami, KEK- SLAC Takashi Sasaki, KEK Margot
More informationOutline. In Situ Data Triage and Visualiza8on
In Situ Data Triage and Visualiza8on Kwan- Liu Ma University of California at Davis Outline In situ data triage and visualiza8on: Issues and strategies Case study: An earthquake simula8on Case study: A
More informationAn Extension of XcalableMP PGAS Lanaguage for Multi-node GPU Clusters
An Extension of XcalableMP PGAS Lanaguage for Multi-node Clusters Jinpil Lee, Minh Tuan Tran, Tetsuya Odajima, Taisuke Boku and Mitsuhisa Sato University of Tsukuba 1 Presentation Overview l Introduction
More informationA4. Intro to Parallel Computing
Self-Consistent Simulations of Beam and Plasma Systems Steven M. Lund, Jean-Luc Vay, Rémi Lehe and Daniel Winklehner Colorado State U., Ft. Collins, CO, 13-17 June, 2016 A4. Intro to Parallel Computing
More informationScaling and op,miza,on results of the real-space DFT solver PARSEC on Haswell and KNL systems
Scaling and op,miza,on results of the real-space DFT solver PARSEC on Haswell and KNL systems Kevin GoG 1*, Charles Lena 2, Ariel Biller 3, Josh Neitzel 2, Kai-Hsin Liou 2, Jack Deslippe 1, James R Chelikowsky
More informationRed Storm / Cray XT4: A Superior Architecture for Scalability
Red Storm / Cray XT4: A Superior Architecture for Scalability Mahesh Rajan, Doug Doerfler, Courtenay Vaughan Sandia National Laboratories, Albuquerque, NM Cray User Group Atlanta, GA; May 4-9, 2009 Sandia
More informationANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation
ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation Ray Browell nvidia Technology Theater SC12 1 2012 ANSYS, Inc. nvidia Technology Theater SC12 HPC Revolution Recent
More informationIntro to Parallel Computing
Outline Intro to Parallel Computing Remi Lehe Lawrence Berkeley National Laboratory Modern parallel architectures Parallelization between nodes: MPI Parallelization within one node: OpenMP Why use parallel
More informationHigh-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs
High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs Gordon Erlebacher Department of Scientific Computing Sept. 28, 2012 with Dimitri Komatitsch (Pau,France) David Michea
More informationShared memory programming model OpenMP TMA4280 Introduction to Supercomputing
Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing NTNU, IMF February 16. 2018 1 Recap: Distributed memory programming model Parallelism with MPI. An MPI execution is started
More informationExecu&on Templates: Caching Control Plane Decisions for Strong Scaling of Data Analy&cs
Execu&on Templates: Caching Control Plane Decisions for Strong Scaling of Data Analy&cs Omid Mashayekhi Hang Qu Chinmayee Shah Philip Levis July 13, 2017 2 Cloud Frameworks SQL Streaming Machine Learning
More informationImplementing MPI on Windows: Comparison with Common Approaches on Unix
Implementing MPI on Windows: Comparison with Common Approaches on Unix Jayesh Krishna, 1 Pavan Balaji, 1 Ewing Lusk, 1 Rajeev Thakur, 1 Fabian Tillier 2 1 Argonne Na+onal Laboratory, Argonne, IL, USA 2
More informationLecture 1 Introduc-on
Lecture 1 Introduc-on What would you get out of this course? Structure of a Compiler Op9miza9on Example 15-745: Introduc9on 1 What Do Compilers Do? 1. Translate one language into another e.g., convert
More informationInstructor: Randy H. Katz hbp://inst.eecs.berkeley.edu/~cs61c/fa13. Fall Lecture #16. Warehouse Scale Computer
CS 61C: Great Ideas in Computer Architecture OpenMP Instructor: Randy H. Katz hbp://inst.eecs.berkeley.edu/~cs61c/fa13 10/23/13 Fall 2013 - - Lecture #16 1 New- School Machine Structures (It s a bit more
More informationReal Parallel Computers
Real Parallel Computers Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra, Meuer, Simon Parallel Computing 2005 Short history
More informationScientific Programming in C XIV. Parallel programming
Scientific Programming in C XIV. Parallel programming Susi Lehtola 11 December 2012 Introduction The development of microchips will soon reach the fundamental physical limits of operation quantum coherence
More informationAdaptive Mesh Astrophysical Fluid Simulations on GPU. San Jose 10/2/2009 Peng Wang, NVIDIA
Adaptive Mesh Astrophysical Fluid Simulations on GPU San Jose 10/2/2009 Peng Wang, NVIDIA Overview Astrophysical motivation & the Enzo code Finite volume method and adaptive mesh refinement (AMR) CUDA
More informationLecture 4: OpenMP Open Multi-Processing
CS 4230: Parallel Programming Lecture 4: OpenMP Open Multi-Processing January 23, 2017 01/23/2017 CS4230 1 Outline OpenMP another approach for thread parallel programming Fork-Join execution model OpenMP
More informationBuilding NVLink for Developers
Building NVLink for Developers Unleashing programmatic, architectural and performance capabilities for accelerated computing Why NVLink TM? Simpler, Better and Faster Simplified Programming No specialized
More informationParallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs
Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs C.-C. Su a, C.-W. Hsieh b, M. R. Smith b, M. C. Jermy c and J.-S. Wu a a Department of Mechanical Engineering, National Chiao Tung
More informationServer Side Applications (i.e., public/private Clouds and HPC)
Server Side Applications (i.e., public/private Clouds and HPC) Kathy Yelick UC Berkeley Lawrence Berkeley National Laboratory Proposed DOE Exascale Science Problems Accelerators Carbon Capture Cosmology
More informationMAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures
MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010
More informationHypergraph Sparsifica/on and Its Applica/on to Par//oning
Hypergraph Sparsifica/on and Its Applica/on to Par//oning Mehmet Deveci 1,3, Kamer Kaya 1, Ümit V. Çatalyürek 1,2 1 Dept. of Biomedical Informa/cs, The Ohio State University 2 Dept. of Electrical & Computer
More informationAdaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics
Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics H. Y. Schive ( 薛熙于 ) Graduate Institute of Physics, National Taiwan University Leung Center for Cosmology and Particle Astrophysics
More informationG P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G
Joined Advanced Student School (JASS) 2009 March 29 - April 7, 2009 St. Petersburg, Russia G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Dmitry Puzyrev St. Petersburg State University Faculty
More informationAchieving Portable Performance for GTC-P with OpenACC on GPU, multi-core CPU, and Sunway Many-core Processor
Achieving Portable Performance for GTC-P with OpenACC on GPU, multi-core CPU, and Sunway Many-core Processor Stephen Wang 1, James Lin 1,4, William Tang 2, Stephane Ethier 2, Bei Wang 2, Simon See 1,3
More informationLoad Balancing and Data Migration in a Hybrid Computational Fluid Dynamics Application
Load Balancing and Data Migration in a Hybrid Computational Fluid Dynamics Application Esteban Meneses Patrick Pisciuneri Center for Simulation and Modeling (SaM) University of Pittsburgh University of
More informationOpenMP: Open Multiprocessing
OpenMP: Open Multiprocessing Erik Schnetter May 20-22, 2013, IHPC 2013, Iowa City 2,500 BC: Military Invents Parallelism Outline 1. Basic concepts, hardware architectures 2. OpenMP Programming 3. How to
More informationACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS
ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS Ferdinando Alessi Annalisa Massini Roberto Basili INGV Introduction The simulation of wave propagation
More informationTechnology for a better society. hetcomp.com
Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction
More informationA Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004
A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into
More informationIntroduc)on to Xeon Phi
Introduc)on to Xeon Phi ACES Aus)n, TX Dec. 04 2013 Kent Milfeld, Luke Wilson, John McCalpin, Lars Koesterke TACC What is it? Co- processor PCI Express card Stripped down Linux opera)ng system Dense, simplified
More informationExperiences with ENZO on the Intel R Many Integrated Core (Intel MIC) Architecture
Experiences with ENZO on the Intel R Many Integrated Core (Intel MIC) Architecture 1 Introduction Robert Harkness National Institute for Computational Sciences Oak Ridge National Laboratory The National
More informationHow to get Access to Shaheen2? Bilel Hadri Computational Scientist KAUST Supercomputing Core Lab
How to get Access to Shaheen2? Bilel Hadri Computational Scientist KAUST Supercomputing Core Lab Live Survey Please login with your laptop/mobile h#p://'ny.cc/kslhpc And type the code VF9SKGQ6 http://hpc.kaust.edu.sa
More informationThe Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System
The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Alan Humphrey, Qingyu Meng, Martin Berzins Scientific Computing and Imaging Institute & University of Utah I. Uintah Overview
More informationFast Quantum Molecular Dynamics on Multi-GPU Architectures in LATTE
Fast Quantum Molecular Dynamics on Multi-GPU Architectures in LATTE S. Mniszewski*, M. Cawkwell, A. Niklasson GPU Technology Conference San Jose, California March 18-21, 2013 *smm@lanl.gov Slide 1 Background:
More informationSEDA An architecture for Well Condi6oned, scalable Internet Services
SEDA An architecture for Well Condi6oned, scalable Internet Services Ma= Welsh, David Culler, and Eric Brewer University of California, Berkeley Symposium on Operating Systems Principles (SOSP), October
More informationCompiling for GPUs. Adarsh Yoga Madhav Ramesh
Compiling for GPUs Adarsh Yoga Madhav Ramesh Agenda Introduction to GPUs Compute Unified Device Architecture (CUDA) Control Structure Optimization Technique for GPGPU Compiler Framework for Automatic Translation
More informationEnergy Efficient Transparent Library Accelera4on with CAPI Heiner Giefers IBM Research Zurich
Energy Efficient Transparent Library Accelera4on with CAPI Heiner Giefers IBM Research Zurich Revolu'onizing the Datacenter Datacenter Join the Conversa'on #OpenPOWERSummit Towards highly efficient data
More informationGPU Debugging Made Easy. David Lecomber CTO, Allinea Software
GPU Debugging Made Easy David Lecomber CTO, Allinea Software david@allinea.com Allinea Software HPC development tools company Leading in HPC software tools market Wide customer base Blue-chip engineering,
More informationOpenMP: Open Multiprocessing
OpenMP: Open Multiprocessing Erik Schnetter June 7, 2012, IHPC 2012, Iowa City Outline 1. Basic concepts, hardware architectures 2. OpenMP Programming 3. How to parallelise an existing code 4. Advanced
More informationExperiences with ENZO on the Intel Many Integrated Core Architecture
Experiences with ENZO on the Intel Many Integrated Core Architecture Dr. Robert Harkness National Institute for Computational Sciences April 10th, 2012 Overview ENZO applications at petascale ENZO and
More information