Intro to Parallel Computing

Similar documents
A4. Intro to Parallel Computing

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

Developing PIC Codes for the Next Generation Supercomputer using GPUs. Viktor K. Decyk UCLA

Trends in HPC (hardware complexity and software challenges)

Dynamic load balancing in OSIRIS

High performance computing and numerical modeling

Asynchronous OpenCL/MPI numerical simulations of conservation laws

Preliminary Experiences with the Uintah Framework on on Intel Xeon Phi and Stampede

Overlapping Computation and Communication for Advection on Hybrid Parallel Computers

Hybrid Model Parallel Programs

Introduction to parallel Computing

Accelerating HPL on Heterogeneous GPU Clusters

Introduction to GPU hardware and to CUDA

The Future of High Performance Computing

HPC Architectures. Types of resource currently in use

Christopher Sewell Katrin Heitmann Li-ta Lo Salman Habib James Ahrens

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

CloverLeaf: Preparing Hydrodynamics Codes for Exascale

Using GPUs to Accelerate Synthetic Aperture Sonar Imaging via Backpropagation

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends

Splotch: High Performance Visualization using MPI, OpenMP and CUDA

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

High Performance Computing. Introduction to Parallel Computing

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

Parallel Programming Libraries and implementations

Parallel Systems. Project topics

Parallelism paradigms

OpenACC programming for GPGPUs: Rotor wake simulation

BIG CPU, BIG DATA. Solving the World s Toughest Computational Problems with Parallel Computing. Alan Kaminsky

General Plasma Physics

GPU Debugging Made Easy. David Lecomber CTO, Allinea Software

Overview of research activities Toward portability of performance

Adaptive Refinement Tree (ART) code. N-Body: Parallelization using OpenMP and MPI

PARALLEL FRAMEWORK FOR PARTIAL WAVE ANALYSIS AT BES-III EXPERIMENT

COMP528: Multi-core and Multi-Processor Computing

Open Compute Stack (OpenCS) Overview. D.D. Nikolić Updated: 20 August 2018 DAE Tools Project,

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center

Overview. Idea: Reduce CPU clock frequency This idea is well suited specifically for visualization

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

Trends and Challenges in Multicore Programming

Parallel Geospatial Data Management for Multi-Scale Environmental Data Analysis on GPUs DOE Visiting Faculty Program Project Report

An Introduction to Parallel Programming

Experiences with Achieving Portability across Heterogeneous Architectures

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

Experiences with ENZO on the Intel Many Integrated Core Architecture

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

CUDA. GPU Computing. K. Cooper 1. 1 Department of Mathematics. Washington State University

Petascale Multiscale Simulations of Biomolecular Systems. John Grime Voth Group Argonne National Laboratory / University of Chicago

Optimising the Mantevo benchmark suite for multi- and many-core architectures

A numerical microscope for plasma physics

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Debugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc.

A Scalable Adaptive Mesh Refinement Framework For Parallel Astrophysics Applications

Harnessing GPU speed to accelerate LAMMPS particle simulations

Performance Comparison between Two Programming Models of XcalableMP

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA

Scientific Computing at Million-way Parallelism - Blue Gene/Q Early Science Program

SIMD Exploitation in (JIT) Compilers

Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC

Parallel Computing Why & How?

Evaluation of Parallel I/O Performance and Energy with Frequency Scaling on Cray XC30 Suren Byna and Brian Austin

Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies

Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI

Outline. Overview Theoretical background Parallel computing systems Parallel programming models MPI/OpenMP examples

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

An Extension of XcalableMP PGAS Lanaguage for Multi-node GPU Clusters

Strong Scaling for Molecular Dynamics Applications

A Unified Approach to Heterogeneous Architectures Using the Uintah Framework

Performance of the hybrid MPI/OpenMP version of the HERACLES code on the Curie «Fat nodes» system

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo

PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters

Towards an Efficient CPU-GPU Code Hybridization: a Simple Guideline for Code Optimizations on Modern Architecture with OpenACC and CUDA

Parallel Computing. November 20, W.Homberg

An Auto-Tuning Framework for Parallel Multicore Stencil Computations

Early Experiences Writing Performance Portable OpenMP 4 Codes

HANDLING LOAD IMBALANCE IN DISTRIBUTED & SHARED MEMORY

Scientific Programming in C XIV. Parallel programming

Optimised all-to-all communication on multicore architectures applied to FFTs with pencil decomposition

MPI+X on The Way to Exascale. William Gropp

The Future of High- Performance Computing

Software and Performance Engineering for numerical codes on GPU clusters

7 DAYS AND 8 NIGHTS WITH THE CARMA DEV KIT

Intermediate Parallel Programming & Cluster Computing

OpenACC 2.6 Proposed Features

Analysis of Performance Gap Between OpenACC and the Native Approach on P100 GPU and SW26010: A Case Study with GTC-P

Performance of the 3D-Combustion Simulation Code RECOM-AIOLOS on IBM POWER8 Architecture. Alexander Berreth. Markus Bühler, Benedikt Anlauf

HACC: Extreme Scaling and Performance Across Diverse Architectures

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

STUDYING OPENMP WITH VAMPIR

Managing HPC Active Archive Storage with HPSS RAIT at Oak Ridge National Laboratory

ICON for HD(CP) 2. High Definition Clouds and Precipitation for Advancing Climate Prediction

Lecture V: Introduction to parallel programming with Fortran coarrays

BIG CPU, BIG DATA. Solving the World s Toughest Computational Problems with Parallel Computing Second Edition. Alan Kaminsky

Using MPI+OpenMP for current and future architectures

ECE 574 Cluster Computing Lecture 16

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Transcription:

Outline Intro to Parallel Computing Remi Lehe Lawrence Berkeley National Laboratory Modern parallel architectures Parallelization between nodes: MPI Parallelization within one node: OpenMP Why use parallel architecture? Example: Laser-wakefield simulation to interpret experiments at LBNL. 3D grid with 2500 x 200 x 200 grid points 0.2 billion macroparticles 140,000 timesteps (Courant limit!) ~60,000 hours on 1 core = 7 years! => Need either faster cores or more cores in parallel Why use parallel architectures? History of CPU performance Nowadays, individual cores do not get faster. We need to use many cores in parallel.

Parallel clusters Leadership and Production Computing Facilities Contain 100,000+ cores Titan: Peak performance of 27.1 PF 18,688 Hybrid Compute Nodes 8.9 MW peak power Fast network communication (~10 Gb/s) Individual node Edison XC30: Peak performance 2.4 PF 124,608 processing cores 2.1 MW peak power Mira: Peak performance of 10 PF 49,152 Compute Nodes 4.8 MW peak power Courtesy Steve Binkley, BESAC 2016 Parallel clusters Class discussion Individual nodes have several cores (i.e. computing units): - Traditional CPUs: ~10 cores - Xeon Phi: 68 cores - GPUs: ~1000 (slow) cores Cores within one node share memory. Cores from different nodes do not. - Is the code that you use parallelized? Do you know what technology it uses (MPI, OpenMP, etc.)? - Do you ever use ~100 nodes, ~1000 nodes? Did you experience difficulties associated with it (scaling, etc.)? - Do you use GPUs? - How to use this architecture to make a PIC code faster? - How to use the two levels of parallelism? (within one node and between nodes)

Outline Modern parallel architectures Domain-decomposition Each node deals with a fixed chunk of the simulation box (which includes fields on the grid + macroparticles) Parallelization between nodes: MPI Parallelization within one node: OpenMP Fast network communication The nodes are not independent: They need to exchange information with other nodes via the network. Domain-decomposition minimizes this: communications only with a few neighbors Domain-decomposition: particle exchange Particle pusher: macroparticles may cross domain boundaries Domain-decomposition: field exchange Field update: each node needs values from neighboring nodes to calculate the spatial derivatives at the boundary Field gathering/current deposition: with wide shape factor, particles gather field values from other nodes and deposit some current/charge to other nodes After the particle pusher, the particle data needs to be communicated from one node to the other. But the cores from different nodes do not share memory! (i.e. the required data is not readily available)

Domain-decomposition: guard cells Sum up: the PIC loop White: physical cells simulated by the node Communicate particles at the boundaries between nodes Blue: guard cells Each node has guard cells i.e. cells that are a local copy of the physical cells from neighbor nodes and make these values readily available. The guard cells need to be synced, whenever their value is updated. Communicate guard cells values for E/B Communicate guard cells values for charge/current 1 Exchanging information: MPI Problem: load balancing MPI: Message Passing Interface Library that allows to send/receive information between processes: - Each process has an id (or rank ) - Each process executes the same source code - Call sending/receiving commands based on id 0 3 6 1 4 7 2 5 8 Example in Python: Push particle 1 Push particle 1 Push particle 1 Push particle 2 Push particle 2 Push particle 2 Push particle 3 Exchange particles Exchange particles Push particle 4 Wait Push particle 5 Push particle 6 Exchange particles Wait Wait Exchange particles Example: particle pusher, schematic view Compute time The simulation will always progress at the pace of the slowest node (the one doing the most work) => Problematic when the particle distribution is very non-uniform

Outline Modern parallel architectures Parallelization within one node: MPI Parallelization within one node can be done with MPI: Divide each nodes s sub-domain into smaller sub-domain. Each core is tied to one of the smaller sub-domain. core 1 core 3 core 2 core 4 Parallelization between nodes: MPI Parallelization within one node: OpenMP The smaller sub-domains have guard cells and exchange information via MPI send/receive, within one node. (but in this case the information does not go through the network). MPI within one node: even worse load balancing! core 1 core 3 core 2 core 4 OpenMP within one node: load balancing - Create more subdomains than cores (here 4 cores per node but 14 subdomains or tiles ) - With OpenMP, cores are not tied to one subdomain Cores can work one subdomain and then switch to another depending on the work that remains to be done. Only possible within one node, because memory is shared - No need for guard cells within one node. core1 core2 core3 core4 core1 core2 core3 core4 core1 core2 core3 core4 push push push push push push push Exchange particles Exchange particles push push push Exchange particles Exchange particles core1 core2 core3 core4 core1 core2 core3 core4 core1 core2 core3 core4 push push push push push push push push push push Exchange particles Exchange particles Exchange particles Exchange particles

Core 2 performs current deposition OpenMP s dangers: race condition Race condition! Core 3 performs current deposition - The cores do not exchange information via MPI send/receive. Instead they directly modify the value of the current in shared memory, without notifying the other cores. - Potentially, two cores could simultaneously try to modify the value of the current in a given cell (leads to inconsistencies). This can be avoided with proper care (e.g. atomic operations ). OpenMP: practical consideration - On the developer side: Not available in Python, but available in C and Fortran Requires to use pragmas in the code. Example in Fortran:!!$OMP PARALLEL DO (Use OpenMP to do the loop in parallel) DO it=1,nt (Loop over tiles ) (Perform work on one tile ) ENDDO - Warp does not use OpenMP for the moment - But Warp can use PICSAR, which does use OpenMP PICSAR = highly-optimized library for elementary operations, such as particle pusher, current deposition, field gathering, etc. PICSAR is soon to be released as open-source. GPU programming - Conceptual similarities with OpenMP programming: load balancing by tiling, race conditions - But also differences: ~1000s (slow) cores instead of 10-60 cores Only connected to the network through an associated CPU GPU programming uses specific language (CUDA, OpenCL, ) - The trend for the future is to bridge the difference between many-core CPUs and GPU: hardware (more cores on CPU, GPUs to be integrated with CPUs) language (OpenMP starts targeting GPUs) Summary Parallel architectures are organized around (at least) two levels of parallelization: - Inter-nodes (uses network) - Intra-node (uses shared memory) The traditional paradigm (in the PIC community) is to use MPI at both levels. This is limited, esp. due to load-balancing. Novel paradigms are becoming more and more common: MPI+OpenMP (with tiles), MPI+GPU, etc.