Intermediate Parallel Programming & Cluster Computing

Similar documents
An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm. Martin Burtscher Department of Computer Science Texas State University-San Marcos

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo

Nostradamus Grand Cross. - N-body simulation!

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

Fast Multipole and Related Algorithms

Two-Phase flows on massively parallel multi-gpu clusters

Accelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX

An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm

Evaluating the Performance and Energy Efficiency of N-Body Codes on Multi-Core CPUs and GPUs

General Purpose GPU Computing in Partial Wave Analysis

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

Treelogy: A Benchmark Suite for Tree Traversals

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

Fast Quantum Molecular Dynamics on Multi-GPU Architectures in LATTE

CUDA Experiences: Over-Optimization and Future HPC

Using GPUs to compute the multilevel summation of electrostatic forces

ExaFMM. Fast multipole method software aiming for exascale systems. User's Manual. Rio Yokota, L. A. Barba. November Revision 1

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Center for Computational Science

Intermediate Parallel Programming & Cluster Computing

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

On Massively Parallel Algorithms to Track One Path of a Polynomial Homotopy

Re-architecting Virtualization in Heterogeneous Multicore Systems

GPU accelerated heterogeneous computing for Particle/FMM Approaches and for Acoustic Imaging

Software and Performance Engineering for numerical codes on GPU clusters

GPU GPU CPU. Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3

Turbostream: A CFD solver for manycore

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

High Performance Computing (HPC) Introduction

Automatic Scaling Iterative Computations. Aug. 7 th, 2012

FMM implementation on CPU and GPU. Nail A. Gumerov (Lecture for CMSC 828E)

Fast Multipole Method on the GPU

Harnessing GPU speed to accelerate LAMMPS particle simulations

GAMER : a GPU-accelerated Adaptive-MEsh-Refinement Code for Astrophysics GPU 與自適性網格於天文模擬之應用與效能

GPU-based Distributed Behavior Models with CUDA

CMSC 858M/AMSC 698R. Fast Multipole Methods. Nail A. Gumerov & Ramani Duraiswami. Lecture 20. Outline

Pedraforca: a First ARM + GPU Cluster for HPC

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

ParCube. W. Randolph Franklin and Salles V. G. de Magalhães, Rensselaer Polytechnic Institute

EXPLICIT MOVING PARTICLE SIMULATION METHOD ON GPU CLUSTERS. of São Paulo

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

CME 213 S PRING Eric Darve

Red Fox: An Execution Environment for Relational Query Processing on GPUs

High performance computing and numerical modeling

Optimizing Multiple GPU FDTD Simulations in CUDA

Introduction to CELL B.E. and GPU Programming. Agenda

Red Fox: An Execution Environment for Relational Query Processing on GPUs

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Particle-based simulations in Astrophysics

Scalability of Processing on GPUs

Overlapping Computation and Communication for Advection on Hybrid Parallel Computers

OP2 FOR MANY-CORE ARCHITECTURES

GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC

Scaling Hierarchical N-body Simulations on GPU Clusters

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

9. Efficient Many-Body Algorithms: Barnes-Hut and Fast Multipole

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst

The Fast Multipole Method on NVIDIA GPUs and Multicore Processors

Recent Advances in Heterogeneous Computing using Charm++

Chapter 3 Parallel Software

Intro to Parallel Computing

arxiv: v2 [astro-ph.im] 26 Nov 2015

Tesla Architecture, CUDA and Optimization Strategies

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

Performance Metrics of a Parallel Three Dimensional Two-Phase DSMC Method for Particle-Laden Flows

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

A Kernel-independent Adaptive Fast Multipole Method

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Exploiting Task-Parallelism on GPU Clusters via OmpSs and rcuda Virtualization

High Performance Computing with Accelerators

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

Solving Dense Linear Systems on Graphics Processors

Efficient O(N log N) algorithms for scattered data interpolation

Accelerating Polynomial Homotopy Continuation on a Graphics Processing Unit with Double Double and Quad Double Arithmetic

Tesla GPU Computing A Revolution in High Performance Computing

Analysis and Visualization Algorithms in VMD

CUDA and OpenCL Implementations of 3D CT Reconstruction for Biomedical Imaging

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors

The Icosahedral Nonhydrostatic (ICON) Model

Numerical Algorithms on Multi-GPU Architectures

custinger - Supporting Dynamic Graph Algorithms for GPUs Oded Green & David Bader

Partitioning and Divide-and-Conquer Strategies

CUDA Programming Model

A4. Intro to Parallel Computing

"Fast, High-Fidelity, Multi-Spacecraft Trajectory Simulation for Space Catalogue Applications"

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

Advanced Parallel Programming. Is there life beyond MPI?

Exploiting CUDA Dynamic Parallelism for low power ARM based prototypes

Parallelization of Tau-Leap Coarse-Grained Monte Carlo Simulations on GPUs

Scalable Fast Multipole Methods on Distributed Heterogeneous Architectures

Accelerating Implicit LS-DYNA with GPU

Performance Analysis and Optimization of Gyrokinetic Torodial Code on TH-1A Supercomputer

High Scalability of Lattice Boltzmann Simulations with Turbulence Models using Heterogeneous Clusters

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

High performance Computing and O&G Challenges

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

The challenges of new, efficient computer architectures, and how they can be met with a scalable software development strategy.! Thomas C.

High-Performance Scientific Computing

Introduction to GPU hardware and to CUDA

Transcription:

High Performance Computing Modernization Program (HPCMP) Summer 2011 Puerto Rico Workshop on Intermediate Parallel Programming & Cluster Computing in conjunction with the National Computational Science Institute (NCSI)/ SC11 Conference S.V. Providence Jointly hosted at Polytechnic U of Puerto Rico and U Oklahoma and available live via videoconferencing (streaming video recordings coming soon) Sponsored by DOD HPCMP, SC11/ACM, NCSI and OK EPSCoR

HPCMP PUPR-IPPCC 2011 LN2: CUDA Overview S.V. Providence Department of Computer Science Hampton University Hampton, Virginia 23668 stephen.providence@hamptonu.edu (757)728-6406 (voice mail) Polytechnic University of Puerto Rico Intermediate Parallel Programming & Cluster Computing, Thursday, Aug. 4 th, 2011 Hampton University

Outline

Outline Introduction N-body problem Core Methods Algorithms and Implementations O(n 2 ) version Fast O(n log n) version [Barnes-Hut] Conclusion Inputs System Compilers Metric Validation

Outline Introduction N-body problem Core Methods Algorithms and Implementations O(n 2 ) version Fast O(n log n) version [Barnes-Hut] Conclusion Inputs System Compilers Metric Validation

Outline Introduction N-body problem Core Methods Algorithms and Implementations O(n 2 ) version Fast O(n log n) version [Barnes-Hut] Conclusion Inputs System Compilers Metric Validation

Outline Introduction N-body problem Core Methods Algorithms and Implementations O(n 2 ) version Fast O(n log n) version [Barnes-Hut] Conclusion Inputs System Compilers Metric Validation

Outline Introduction N-body problem Core Methods Algorithms and Implementations O(n 2 ) version Fast O(n log n) version [Barnes-Hut] Conclusion Inputs System Compilers Metric Validation

Outline Introduction N-body problem Core Methods Algorithms and Implementations O(n 2 ) version Fast O(n log n) version [Barnes-Hut] Conclusion Inputs System Compilers Metric Validation

Outline Introduction N-body problem Core Methods Algorithms and Implementations O(n 2 ) version Fast O(n log n) version [Barnes-Hut] Conclusion Inputs System Compilers Metric Validation

Outline Introduction N-body problem Core Methods Algorithms and Implementations O(n 2 ) version Fast O(n log n) version [Barnes-Hut] Conclusion Inputs System Compilers Metric Validation

Outline Introduction N-body problem Core Methods Algorithms and Implementations O(n 2 ) version Fast O(n log n) version [Barnes-Hut] Conclusion Inputs System Compilers Metric Validation

Outline Introduction N-body problem Core Methods Algorithms and Implementations O(n 2 ) version Fast O(n log n) version [Barnes-Hut] Conclusion Inputs System Compilers Metric Validation

Outline Introduction N-body problem Core Methods Algorithms and Implementations O(n 2 ) version Fast O(n log n) version [Barnes-Hut] Conclusion Inputs System Compilers Metric Validation

Outline Introduction N-body problem Core Methods Algorithms and Implementations O(n 2 ) version Fast O(n log n) version [Barnes-Hut] Conclusion Inputs System Compilers Metric Validation

Introduction N-body problem P2P - point to point: taken pairwise is order n 2 start with the first of n points and form pairs with the other n 1 points perform this step for all n point asymptotically n 2 pairings are formed - dipoles this does not consider tri-poles, quad-poles upto multi-poles there are basically two categories: macro: massive objects >10 10, galaxies and cosmological phenomena - Einstein (relativity; gravity) micro: small objects <10 10, quarks, and nano-scale phenomena - Dirac (quantum mechanics; strong, weak & EM forces)

Introduction N-body problem P2P - point to point: taken pairwise is order n 2 start with the first of n points and form pairs with the other n 1 points perform this step for all n point asymptotically n 2 pairings are formed - dipoles this does not consider tri-poles, quad-poles upto multi-poles there are basically two categories: macro: massive objects >10 10, galaxies and cosmological phenomena - Einstein (relativity; gravity) micro: small objects <10 10, quarks, and nano-scale phenomena - Dirac (quantum mechanics; strong, weak & EM forces)

Introduction N-body problem P2P - point to point: taken pairwise is order n 2 start with the first of n points and form pairs with the other n 1 points perform this step for all n point asymptotically n 2 pairings are formed - dipoles this does not consider tri-poles, quad-poles upto multi-poles there are basically two categories: macro: massive objects >10 10, galaxies and cosmological phenomena - Einstein (relativity; gravity) micro: small objects <10 10, quarks, and nano-scale phenomena - Dirac (quantum mechanics; strong, weak & EM forces)

Introduction approaches O(n 2 ) requires straightforward translation of simple data structures to arrays for BLAS 1, 2 computations O(n log n) is challenging 1 CON: repeatedly builds and traverses (in-order) an irregular tree-based data structure 2 CON: performs a great deal of pointer chasing memory operations 3 CON: the Barnes-Hut approach is typically expressed recursively 4 PRO: this approach makes interesting problem sizes computationally tractable 5 PRO: GPUs can be used ti accelerate irregular problems

Introduction approaches O(n 2 ) requires straightforward translation of simple data structures to arrays for BLAS 1, 2 computations O(n log n) is challenging 1 CON: repeatedly builds and traverses (in-order) an irregular tree-based data structure 2 CON: performs a great deal of pointer chasing memory operations 3 CON: the Barnes-Hut approach is typically expressed recursively 4 PRO: this approach makes interesting problem sizes computationally tractable 5 PRO: GPUs can be used ti accelerate irregular problems

Introduction approaches O(n 2 ) requires straightforward translation of simple data structures to arrays for BLAS 1, 2 computations O(n log n) is challenging 1 CON: repeatedly builds and traverses (in-order) an irregular tree-based data structure 2 CON: performs a great deal of pointer chasing memory operations 3 CON: the Barnes-Hut approach is typically expressed recursively 4 PRO: this approach makes interesting problem sizes computationally tractable 5 PRO: GPUs can be used ti accelerate irregular problems

Introduction approaches O(n 2 ) requires straightforward translation of simple data structures to arrays for BLAS 1, 2 computations O(n log n) is challenging 1 CON: repeatedly builds and traverses (in-order) an irregular tree-based data structure 2 CON: performs a great deal of pointer chasing memory operations 3 CON: the Barnes-Hut approach is typically expressed recursively 4 PRO: this approach makes interesting problem sizes computationally tractable 5 PRO: GPUs can be used ti accelerate irregular problems

Core Methods Barnes-Hut widely used, starts with all N bodies in a computational box hierarchically decomposes the space around the bodies into successively smaller boxes called cels hierarchical decomposition is recorded in octrees (3D equivalent to binary tree), resembles tic-tac-toe grid Figure: GPU Gems: Emerald Edition Hampton University

Core Methods Barnes-Hut widely used, starts with all N bodies in a computational box hierarchically decomposes the space around the bodies into successively smaller boxes called cels hierarchical decomposition is recorded in octrees (3D equivalent to binary tree), resembles tic-tac-toe grid Figure: GPU Gems: Emerald Edition Hampton University

Core Methods pseudo code 1 read input data and transfer to GPU for each time step do { } 1 compute computational box around all bodies 2 build hierarchical decomposition by inserting each body into octree 3 summarize body information in each internal octree node 4 approximately sort the bodies by spatial distance 5 compute forces acting on each body with help of ochre 6 update body positions and velocity 2 transfer result to CPU and output

Core Methods pseudo code 1 read input data and transfer to GPU for each time step do { } 1 compute computational box around all bodies 2 build hierarchical decomposition by inserting each body into octree 3 summarize body information in each internal octree node 4 approximately sort the bodies by spatial distance 5 compute forces acting on each body with help of ochre 6 update body positions and velocity 2 transfer result to CPU and output

Core Methods pseudo code 1 read input data and transfer to GPU for each time step do { } 1 compute computational box around all bodies 2 build hierarchical decomposition by inserting each body into octree 3 summarize body information in each internal octree node 4 approximately sort the bodies by spatial distance 5 compute forces acting on each body with help of ochre 6 update body positions and velocity 2 transfer result to CPU and output

Core Methods pseudo code 1 read input data and transfer to GPU for each time step do { } 1 compute computational box around all bodies 2 build hierarchical decomposition by inserting each body into octree 3 summarize body information in each internal octree node 4 approximately sort the bodies by spatial distance 5 compute forces acting on each body with help of ochre 6 update body positions and velocity 2 transfer result to CPU and output

Core Methods hierarchical decomposition 2D view Figure: GPU Gems: Emerald Edition Hampton University

Core Methods hierarchical decomposition 2D view Figure: GPU Gems: Emerald Edition Hampton University

Core Methods hierarchical decomposition tree view Figure: GPU Gems: Emerald Edition Hampton University

Core Methods hierarchical decomposition tree view Figure: GPU Gems: Emerald Edition Hampton University

Core Methods hierarchical decomposition center of gravity (or forces) force calculation depicted Figure: GPU Gems: Emerald Edition Hampton University

Core Methods hierarchical decomposition center of gravity (or forces) force calculation depicted Figure: GPU Gems: Emerald Edition Hampton University

Core Methods hierarchical decomposition runtime per simulated tome step in milliseconds of GPU O(n 2 ) vs. GPU Barnes-Hut vs. CPU Barnes-Hut Figure: GPU Gems: Emerald Edition Hampton University

Algorithms and Implementations Fast N-body 1 the following expressions for the potential and force, resp. on a body i, and r ij = x j x i is a vec from body i to body j Φ i = m i N j=1 m j r ij, F i = Φ i this results in O(n 2 ) computational complexity 2 here the sum for the potential is factored into a near-field and a far-field expansion as follows: Φ i = n n=0 m= n m i r n 1 j Y m n (θ i, φ i ) N m j ρ n j Yn m (α j, β j ). }{{} Mn m Mn m clusters particles to far field, Yn m is spherical harmonic funct and (r, θ, φ); (ρ, α, β) are dist vecs from center of the expansion to bodies i, and j resp. j=1 Hampton University

Algorithms and Implementations Fast N-body 1 the following expressions for the potential and force, resp. on a body i, and r ij = x j x i is a vec from body i to body j Φ i = m i N j=1 m j r ij, F i = Φ i this results in O(n 2 ) computational complexity 2 here the sum for the potential is factored into a near-field and a far-field expansion as follows: Φ i = n n=0 m= n m i r n 1 j Y m n (θ i, φ i ) N m j ρ n j Yn m (α j, β j ). }{{} Mn m Mn m clusters particles to far field, Yn m is spherical harmonic funct and (r, θ, φ); (ρ, α, β) are dist vecs from center of the expansion to bodies i, and j resp. j=1 Hampton University

Algorithms and Implementations Fast N-body 1 here the sum for the potential is factored into a far-field and a near-field expansion as follows: Φ i = n n=0 m= n m i r n 1 i Y m n (θ i, φ i ) N m j ρ n j Yn m (α j, β j ). }{{} Mn m M m n clusters particles to far field, Y m n is spherical harmonic funct and (r, θ, φ); (ρ, α, β) are dist vecs from center of the expansion to bodies i, and j resp. Φ i = n n=0 m= n m i r n i Y m n (θ i, φ i ) j=1 N m j ρ n 1 j Yn m (α j, β j ). }{{} L m n L m n clusters particles to near field, use of FMM, O(N): fast multipole method and the tree structure list of log N cells interacting with N particles, yields O(N log N) complexity. j=1 Hampton University

Algorithms and Implementations Fast N-body flow of tree code and FMM calculation Figure: GPU Gems: Emerald Edition Hampton University

Algorithms and Implementations Fast N-body flow of tree code and FMM calculation Figure: GPU Gems: Emerald Edition Hampton University

Conclusion modest evaluation No implementation of an entire N-body algorithm runs on the GPU Inputs: 5,000 to 50,000,000 galaxies System: 2.53 GHz Xeon E5540 CPU with 12GB RAM per node and!.66 GHz TESLA GPU with 240 cores Compilers: CUDA codes with nvcc v.4.1 and the "-O3 -arch=sm_13", "-O2" is used with icc. Metric: runtimes appear close together Validation: O(n 2 ) is more accurate than Barnes-Hut algorithms because the CPUs floating point arithmetic is more precise than the GPUs Hampton University

Conclusion modest evaluation No implementation of an entire N-body algorithm runs on the GPU Inputs: 5,000 to 50,000,000 galaxies System: 2.53 GHz Xeon E5540 CPU with 12GB RAM per node and!.66 GHz TESLA GPU with 240 cores Compilers: CUDA codes with nvcc v.4.1 and the "-O3 -arch=sm_13", "-O2" is used with icc. Metric: runtimes appear close together Validation: O(n 2 ) is more accurate than Barnes-Hut algorithms because the CPUs floating point arithmetic is more precise than the GPUs Hampton University

Conclusion modest evaluation No implementation of an entire N-body algorithm runs on the GPU Inputs: 5,000 to 50,000,000 galaxies System: 2.53 GHz Xeon E5540 CPU with 12GB RAM per node and!.66 GHz TESLA GPU with 240 cores Compilers: CUDA codes with nvcc v.4.1 and the "-O3 -arch=sm_13", "-O2" is used with icc. Metric: runtimes appear close together Validation: O(n 2 ) is more accurate than Barnes-Hut algorithms because the CPUs floating point arithmetic is more precise than the GPUs Hampton University

Conclusion modest evaluation No implementation of an entire N-body algorithm runs on the GPU Inputs: 5,000 to 50,000,000 galaxies System: 2.53 GHz Xeon E5540 CPU with 12GB RAM per node and!.66 GHz TESLA GPU with 240 cores Compilers: CUDA codes with nvcc v.4.1 and the "-O3 -arch=sm_13", "-O2" is used with icc. Metric: runtimes appear close together Validation: O(n 2 ) is more accurate than Barnes-Hut algorithms because the CPUs floating point arithmetic is more precise than the GPUs Hampton University

For Further Reading I Appendix For Further Reading Michael J. Quinn, Parallel Programming in C with MPI and OpenMP McGraw-Hill, 2004 J. Sanders, E. Kandrot, CUDA By Example: An Introduction to General-Purpose GPU Programming, Nvidia, 2011 Board of Trustees of the University of Illinois, 2011 NCSA News, http://www.ncsa.ilinois.edu/bluewaters/systems.html B. Sinharoy,et al. IBM POWER7 Multicore Server Processor IBM J. Res. & Dev. Vol. 55 No. 3 Paper 1 May/June 2011 Hampton University

For Further Reading II Appendix For Further Reading Jeffrey Vetter, Dick Glassbrook, Jack Dongarra, Richard Fujimoto, Thomas Schulthess, Karsten Schwan Keeneland - Enabling Heterogenous Computing for the Open Science Community Supercomputing Conference 2010, New Orleans, Louisiana C. Zeller, Nvidia Corporation C. Zeller - CUDA C Basics Supercomputing Conference 2010, New Orleans, Louisiana