Parallel 3D Sweep Kernel with PaRSEC

Similar documents
3D Cartesian Transport Sweep for Massively Parallel Architectures with PARSEC

Shared Memory Parallelism for 3D Cartesian Discrete Ordinates Solver

Software and Performance Engineering for numerical codes on GPU clusters

Thinking Outside of the Tera-Scale Box. Piotr Luszczek

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Numerical Algorithms on Multi-GPU Architectures

Best Practice Guide to Hybrid PaRSEC + OpenMP Programming

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

Center for Computational Science

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA

Hierarchical DAG Scheduling for Hybrid Distributed Systems

Toward Automated Application Profiling on Cray Systems

Python for Development of OpenMP and CUDA Kernels for Multidimensional Data

Massively Parallel Phase Field Simulations using HPC Framework walberla

BlueGene/L (No. 4 in the Latest Top500 List)

ORAP Forum October 10, 2013

Scalable, Hybrid-Parallel Multiscale Methods using DUNE

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

Toward a supernodal sparse direct solver over DAG runtimes

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation

Intel Many Integrated Core (MIC) Architecture

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Lecture 1: Introduction and Computational Thinking

Introducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method

Fast Multipole Method on the GPU

simulation framework for piecewise regular grids

Vincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012

Optimizing Efficiency of Deep Learning Workloads through GPU Virtualization

Generic Programming Experiments for SPn and SN transport codes

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Breaking Through the Barriers to GPU Accelerated Monte Carlo Particle Transport

LOAD BALANCING UNSTRUCTURED MESHES FOR MASSIVELY PARALLEL TRANSPORT SWEEPS. A Thesis TAREK HABIB GHADDAR

Accelerating CFD with Graphics Hardware

Implicit Low-Order Unstructured Finite-Element Multiple Simulation Enhanced by Dense Computation using OpenACC

Fujitsu s Approach to Application Centric Petascale Computing

OpenCL in Scientific High Performance Computing The Good, the Bad, and the Ugly

Implementation of an integrated efficient parallel multiblock Flow solver

OP2 FOR MANY-CORE ARCHITECTURES

Turbostream: A CFD solver for manycore

RAMSES on the GPU: An OpenACC-Based Approach

Overview of research activities Toward portability of performance

Benchmark results on Knight Landing (KNL) architecture

Available online at ScienceDirect. Parallel Computational Fluid Dynamics Conference (ParCFD2013)

CPU-GPU Heterogeneous Computing

Multigrid algorithms on multi-gpu architectures

Module 3 Mesh Generation

HPC with Multicore and GPUs

Introduction to parallel Computing

Portable Heterogeneous High-Performance Computing via Domain-Specific Virtualization. Dmitry I. Lyakh.

CUDA Experiences: Over-Optimization and Future HPC

A task-based parallelism and vectorized approach to 3D Method of Characteristics (MOC) reactor simulation for high performance computing architectures

Findings from real petascale computer systems with meteorological applications

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Two-Phase flows on massively parallel multi-gpu clusters

Advances of parallel computing. Kirill Bogachev May 2016

Spiral. Computer Generation of Performance Libraries. José M. F. Moura Markus Püschel Franz Franchetti & the Spiral Team. Performance.

Moore s Law. Computer architect goal Software developer assumption

SNAP Performance Benchmark and Profiling. April 2014

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters

PyFR: Heterogeneous Computing on Mixed Unstructured Grids with Python. F.D. Witherden, M. Klemm, P.E. Vincent

Parallel Exact Inference on the Cell Broadband Engine Processor

Predictive Engineering and Computational Sciences. Data Structures and Methods for Unstructured Distributed Meshes. Roy H. Stogner

Top500 Supercomputer list

The quest for a high performance Boltzmann transport solver

Key Technologies for 100 PFLOPS. Copyright 2014 FUJITSU LIMITED

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University

High Performance Computing. University questions with solution

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

Using GPUs for unstructured grid CFD

High-Performance Computational Electromagnetic Modeling Using Low-Cost Parallel Computers

Heterogeneous Multi-Computer System A New Platform for Multi-Paradigm Scientific Simulation

Evaluation and Improvements of Programming Models for the Intel SCC Many-core Processor

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies

Solving Dense Linear Systems on Graphics Processors

Exam Issued: December 18, 2015, 09:00 Hand in: December 18, 2015, 12:00

Matrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

LINPACK Benchmark. on the Fujitsu AP The LINPACK Benchmark. Assumptions. A popular benchmark for floating-point performance. Richard P.

Overview of Tianhe-2

CMSC 858M/AMSC 698R. Fast Multipole Methods. Nail A. Gumerov & Ramani Duraiswami. Lecture 20. Outline

How to Write Fast Numerical Code

Teresa S. Bailey (LLNL) Marvin L. Adams (Texas A&M University)

Accelerating Implicit LS-DYNA with GPU

Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework walberla

Multiphysics simulations of nuclear reactors and more

Fast Matrix-Free High-Order Discontinuous Galerkin Kernels: Performance Optimization and Modeling

Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design

Flexible Hardware Mapping for Finite Element Simulations on Hybrid CPU/GPU Clusters

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Peta-Scale Simulations with the HPC Software Framework walberla:

CMAQ PARALLEL PERFORMANCE WITH MPI AND OPENMP**

OpenACC programming for GPGPUs: Rotor wake simulation

Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

High performance Computing and O&G Challenges

Transcription:

Parallel 3D Sweep Kernel with PaRSEC Salli Moustafa Mathieu Faverge Laurent Plagne Pierre Ramet 1 st International Workshop on HPC-CFD in Energy/Transport Domains August 22, 2014

Overview 1. Cartesian Transport Sweep 2. Sweep on top of PaRSEC 3. Performances Studies 4. Conclusion Parallel 3D Sweep Kernel with PaRSEC 2/19

1. Cartesian Transport Sweep 1. Cartesian Transport Sweep 2. Sweep on top of PaRSEC 3. Performances Studies 4. Conclusion Parallel 3D Sweep Kernel with PaRSEC 3/19

The Boltzmann Transport Equation (BTE)... describes the neutron flux inside a space region We need to compute the neutron flux at the position (x, y), having energy E and traveling inside the direction Ω: ψ(x, y, E, Ω). BTE: balance between arrival and migration of neutrons at (x, y); Its resolution according to discrete ordinates (SN) method, involves a so-called Sweep operation consuming the vast majority of computation. 10 12 Degrees of Freedom (DoFs) Parallel 3D Sweep Kernel with PaRSEC 4/19

Solving the Boltzmann Transport Equation The Spatial Sweep Operation in 2D Discretization of the spatial domain into several cells Parallel 3D Sweep Kernel with PaRSEC 5/19

Solving the Boltzmann Transport Equation The Spatial Sweep Operation in 2D Each spatial cell have 2 incoming dependencies (angular fluxes) per direction Parallel 3D Sweep Kernel with PaRSEC 5/19

Solving the Boltzmann Transport Equation The Spatial Sweep Operation in 2D At the beginning of the sweep, left and bottom fluxes are known; One cell ready to be processed. Parallel 3D Sweep Kernel with PaRSEC 5/19

Solving the Boltzmann Transport Equation The Spatial Sweep Operation in 2D The processing of the first cell sets incoming data for neighbouring cells; Two cells ready to be processed in parallel. This process continues until all cells are processed for this direction... Parallel 3D Sweep Kernel with PaRSEC 5/19

Solving the Boltzmann Transport Equation The Spatial Sweep Operation in 2D... for all directions belonging to a corner... Parallel 3D Sweep Kernel with PaRSEC 5/19

Solving the Boltzmann Transport Equation The Spatial Sweep Operation in 2D... and it is repeated for all the 4 corners. Parallel 3D Sweep Kernel with PaRSEC 5/19

Solving the Boltzmann Transport Equation The Spatial Sweep Operation in 2D... and it is repeated for all the 4 corners. Parallel 3D Sweep Kernel with PaRSEC 5/19

Solving the Boltzmann Transport Equation The Spatial Sweep Operation in 2D... and it is repeated for all the 4 corners. Parallel 3D Sweep Kernel with PaRSEC 5/19

Solving the Boltzmann Transport Equation The Spatial Sweep Operation in 2D All directions belonging to a same quadrant can be processed in parallel. Parallel 3D Sweep Kernel with PaRSEC 5/19

Solving the Boltzmann Transport Equation The Spatial Sweep Operation in 2D 2 levels of parallelism: spatial: several cells processed in parallel; angular: for each cell, several directions processed in parallel (SIMD). Parallel 3D Sweep Kernel with PaRSEC 6/19

2. Sweep on top of PaRSEC 1. Cartesian Transport Sweep 2. Sweep on top of PaRSEC 3. Performances Studies 4. Conclusion Parallel 3D Sweep Kernel with PaRSEC 7/19

The PaRSEC Framework Parallel Runtime Scheduling and Execution Controller The PaRSEC 1 runtime system is a generic data-flow engine supporting a task-based implementation and targeting distributed hybrid systems. An algorithm is represented in a graph whose nodes are tasks and edges data dependencies; Data distribution is specified through an API. 1 George Bosilca et al. DAGuE: A generic distributed DAG engine for High Performance Computing. In: Parallel Computing 38.1-2 (2012). Parallel 3D Sweep Kernel with PaRSEC 8/19

Implementation of the Sweep on top PaRSEC Description of the DAG using the Job Data Flow (JDF) Representation A task is defined as the processing of a cell for all directions of the same corner. 1,0 0,1 1,1 2,1 1,2 Parallel 3D Sweep Kernel with PaRSEC 9/19

Implementation of the Sweep on top PaRSEC Description of the DAG using the Job Data Flow (JDF) Representation Grouping several cells into a MacroCell defines the granularity of the task. 1,0 0,1 1,1 2,1 1,2 Parallel 3D Sweep Kernel with PaRSEC 9/19

Implementation of the Sweep on top PaRSEC Description of the DAG using the Job Data Flow (JDF) Representation The whole DAG of the Sweep is described using the JDF symbolic representation: T(a, b) // Execution space a = 0.. 3 b = 0.. 3 // Parallel partitioning : mcg (a, b) 0,0 1,0 0,1 2,0 1,1 0,2 // Parameters RW X <- (a!= 0)? X T(a -1, b) -> (a!= 3)? X T(a+1, b) RW Y <- (b!= 0)? Y T(b, b -1) -> (b!= 3)? Y T(b, b +1) RW MCG <- mcg (a, b) -> mcg (a, b) BODY { computephi ( MCG, X, Y,... ); } END 3,0 2,1 3,2 1,2 3,1 2,2 1,3 computephi() is a call to a vectorized (over directions) kernel targeting CPUs. Parallel 3D Sweep Kernel with PaRSEC 9/19 3,3 2,3 0,3

Implementation of the Sweep on top PaRSEC Data Distribution We are using a blocked data distribution of cells; int rank_of ( int a, int b,...){ // Rank of the node containing cell (a,b) int lp = a / ( ncx / P); int lq = b / ( ncy / Q); } return lq * P + lp; void * data_of ( int a, int b,...){ // Address of the cell object (a,b) } int aa = a % ( ncx / P); int bb = b % ( ncy / Q) return &( mcg [bb ][ aa ]); (P = 2, Q = 2) defines the process grid partitioning; Cells of the same color belong to the same node. Parallel 3D Sweep Kernel with PaRSEC 10/19

Implementation of the Sweep on top PaRSEC Hybrid Implementation At runtime: Individual tasks are executed by threads; Data transfers between remote tasks: asynchronous MPI send/receive calls. Parallel 3D Sweep Kernel with PaRSEC 11/19

3. Performances Studies 1. Cartesian Transport Sweep 2. Sweep on top of PaRSEC 3. Performances Studies 4. Conclusion Parallel 3D Sweep Kernel with PaRSEC 12/19

Shared Memory Results IVANOE/BIGMEM 32 cores Intel X7560 3D spatial mesh: 480 480 480 cells; N dir : 288 directions 350 300 Practical Peak PaRSEC TBB 250 Gflop/s 200 150 100 50 0 5 10 15 20 25 30 Core Number PaRSEC implementation achieves 291 Gflop/s (51% of Theoretical Peak Perf.) at 32 cores; 8% faster than Intel TBB implementation. This is a sign of a reduced scheduling overhead for PaRSEC. Parallel 3D Sweep Kernel with PaRSEC 13/19

Distributed Memory Results Hybrid IVANOE 768 cores (64 nodes of 12 cores) Intel X7560 3D spatial mesh: 480 480 480 cells; N dir : 288 directions 10 Practical Peak Hybrid - No Priority Tflop/s 1 0.1 12 24 48 96 144 216 288 384 432 576 768 Core Number Parallel efficiency: 52.7% 4.8 Tflop/s (26.8% of Theoretical Peak Perf.) at 768 cores Parallel 3D Sweep Kernel with PaRSEC 14/19

Distributed Memory Results Hybrid IVANOE 768 cores (64 nodes of 12 cores) Intel X7560 3D spatial mesh: 480 480 480 cells; N dir : 288 directions 10 Practical Peak Hybrid - Priority Hybrid - No Priority Tflop/s 1 0.1 12 24 48 96 144 216 288 384 432 576 768 Core Number Parallel efficiency: 66.8% 6.2 Tflop/s (34.4% of Theoretical Peak Perf.) at 768 cores Defined priorities allow to first execute tasks belonging to the same z plane: 26% of performance improvement. Parallel 3D Sweep Kernel with PaRSEC 14/19

Distributed Memory Results Hybrid vs Flat What is the best approach? Hybrid approach: two-level view of the cluster; threads within nodes and MPI between nodes. Classical Flat approach: one-level view of the cluster as a collection of computing cores; using only MPI. Parallel 3D Sweep Kernel with PaRSEC 15/19

Distributed Memory Results Hybrid vs Flat IVANOE 384 cores Intel X7560 3D spatial mesh: 120 120 120 cells; N dir : 288 directions Performance measurements for the Flat are also obtained using PaRSEC 10 Practical Peak Hybrid - PaRSEC Flat - PaRSEC Tflop/s 1 0.1 12 24 48 96 144 216 288 384 Core Number The Flat version involves much more MPI communications than the Hybrid one; consequently it is less performant (by 30% at 384 cores); A hand-crafted MPI version will it be more efficient? Parallel 3D Sweep Kernel with PaRSEC 16/19

4. Conclusion 1. Cartesian Transport Sweep 2. Sweep on top of PaRSEC 3. Performances Studies 4. Conclusion Parallel 3D Sweep Kernel with PaRSEC 17/19

A Building Block for a Massively Parallel SN Solver We achieved: Parallel implementation of the Cartesian transport sweep on top of PaRSEC framework; a task-based programming model for distributed architectures; No performance penalty compared to Intel TBB in shared memory; 6.2 Tflop/s (34.4% of Theoretical Peak Perf.) at 768 cores of the IVANOE supercomputer. Perspectives: Finishing theoretical models; Integrate this new implementation inside the DOMINO solver; Acceleration step: coupling with a distributed SPN solver;... Parallel 3D Sweep Kernel with PaRSEC 18/19

Thank you for your attention! Questions Parallel 3D Sweep Kernel with PaRSEC 19/19

The Sweep Algorithm The Sweep Operation forall o Octants do forall c Cells do c = (i, j, k) forall d Directions[o] do d = (ν, µ, ξ) ɛ x = x 2ν ; ɛ y = 2η y ; ɛ z = 2ξ z ; ψ[o][c][d] = ɛ x ψ L +ɛ y ψ B +ɛ z ψ F +S ɛ x +ɛ y +ɛ z +Σ t ; ψ R [o][c][d] = 2ψ[o][c][d] ψ L [o][c][d]; ψ T [o][c][d] = 2ψ[o][c][d] ψ B [o][c][d]; ψ BF [o][c][d] = 2ψ[o][c][d] ψ F [o][c][d]; φ[k][j][i] = φ[k][j][i] + ψ[o][c][d] ω[d]; end end end 9 add or sub; 11 mul; 1 div (5 flops) 25 flops. Parallel 3D Sweep Kernel with PaRSEC 1/4

Gflops Evaluation Gflops value is estimated by dividing the floating point operation number by the completion time: GFlops = 25 N cells N dir Time in nanoseconds. Parallel 3D Sweep Kernel with PaRSEC 2/4

The Critical Arithmetic Intensity Issue i = Number of floating points operations Number of RAM access (Read+Write) Parallel 3D Sweep Kernel with PaRSEC 3/4

Flat Approach Data Distribution 4 nodes, each having 2 cores 8 processors; (P = 4, Q = 2) grid of processors; Az: number of planes to compute before comm. Parallel 3D Sweep Kernel with PaRSEC 4/4