PORTING PARALLEL APPLICATIONS TO HETEROGENEOUS SUPERCOMPUTERS: LIBRARIES AND TOOLS CAN MAKE IT TRANSPARENT

Similar documents
OVERVIEW OF MPC JUNE 24 TH LLNL Meeting June 15th, 2015 PAGE 1

MAGMA. Matrix Algebra on GPU and Multicore Architectures

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

Overview of research activities Toward portability of performance

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

Adaptive MPI Multirail Tuning for Non-Uniform Input/Output Access

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

MPC: Multi-Processor Computing Framework Guest Lecture. Parallel Computing CIS 410/510 Department of Computer and Information Science

Fast and reliable linear system solutions on new parallel architectures

A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

MAGMA: a New Generation

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

StarPU: a runtime system for multigpu multicore machines

PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

HPC with Multicore and GPUs

Multigrain Affinity for Heterogeneous Work Stealing

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Intel Xeon Phi архитектура, модели программирования, оптимизация.

A Scalable and Portable Approach to Accelerate Hybrid HPL on Heterogeneous CPU-GPU Clusters*

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

GPU GPU CPU. Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores

GOING ARM A CODE PERSPECTIVE

An Overview of High Performance Computing and Challenges for the Future

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

A NUMA Aware Scheduler for a Parallel Sparse Direct Solver

Intel Performance Libraries

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Introduction to Runtime Systems

CPU-GPU Heterogeneous Computing

Pedraforca: a First ARM + GPU Cluster for HPC

Placement de processus (MPI) sur architecture multi-cœur NUMA

! XKaapi : a runtime for highly parallel (OpenMP) application

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Dealing with Heterogeneous Multicores

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Load Balancing for Parallel Multi-core Machines with Non-Uniform Communication Costs

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors

Accelerating Linpack Performance with Mixed Precision Algorithm on CPU+GPGPU Heterogeneous Cluster

ET International HPC Runtime Software. ET International Rishi Khan SC 11. Copyright 2011 ET International, Inc.

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Why you should care about hardware locality and how.

Overlapping Computation and Communication for Advection on Hybrid Parallel Computers

QR Decomposition on GPUs

Thinking Outside of the Tera-Scale Box. Piotr Luszczek

CS560 Lecture Parallel Architecture 1

On the limits of (and opportunities for?) GPU acceleration

Hierarchical DAG Scheduling for Hybrid Distributed Systems

CME 213 S PRING Eric Darve

Parallel Computing. Hwansoo Han (SKKU)

Evaluation of Intel Memory Drive Technology Performance for Scientific Applications

QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators

Introducing Task-Containers as an Alternative to Runtime Stacking

Memory Footprint of Locality Information On Many-Core Platforms Brice Goglin Inria Bordeaux Sud-Ouest France 2018/05/25

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Topology and affinity aware hierarchical and distributed load-balancing in Charm++

Resource aggregation for task-based Cholesky Factorization on top of heterogeneous machines

What does Heterogeneity bring?

Modeling and Simulation of Hybrid Systems

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation

Toward a supernodal sparse direct solver over DAG runtimes

Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends

Addressing Heterogeneity in Manycore Applications

Empirical Modeling: an Auto-tuning Method for Linear Algebra Routines on CPU plus Multi-GPU Platforms

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM

Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design

SCALING DGEMM TO MULTIPLE CAYMAN GPUS AND INTERLAGOS MANY-CORE CPUS FOR HPL

Mixed MPI-OpenMP EUROBEN kernels

Mapping MPI+X Applications to Multi-GPU Architectures

Parallel 3D Sweep Kernel with PaRSEC

Accelerating HPL on Heterogeneous GPU Clusters

CUDA Accelerated Compute Libraries. M. Naumov

Tesla GPU Computing A Revolution in High Performance Computing

High Performance Linear Algebra

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Introduction to parallel Computing

NUMA-aware Multicore Matrix Multiplication

A Standard for Batching BLAS Operations

Optimization of Dense Linear Systems on Platforms with Multiple Hardware Accelerators. Enrique S. Quintana-Ortí

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

MAGMA. LAPACK for GPUs. Stan Tomov Research Director Innovative Computing Laboratory Department of Computer Science University of Tennessee, Knoxville

ANSYS HPC Technology Leadership

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Hybrid programming with MPI and OpenMP On the way to exascale

Why CPU Topology Matters. Andreas Herrmann

NEW ADVANCES IN GPU LINEAR ALGEBRA

Radiation Modeling Using the Uintah Heterogeneous CPU/GPU Runtime System

MPC-MPI: An MPI Implementation Reducing the Overall Memory Consumption

Linear Algebra libraries in Debian. DebConf 10 New York 05/08/2010 Sylvestre

Automatic Development of Linear Algebra Libraries for the Tesla Series

Transcription:

PORTING PARALLEL APPLICATIONS TO HETEROGENEOUS SUPERCOMPUTERS: LIBRARIES AND TOOLS CAN MAKE IT TRANSPARENT Jean-Yves VET, DDN Storage Patrick CARRIBAULT, CEA Albert COHEN, INRIA CEA, DAM, DIF, F-91297 Arpajon, France CATC 2016 September,15 CEA 10 AVRIL 2012 PAGE 1

CONTEXT (1/2) HPC AND LEGACY CODES AT THE CEA Since the 60s, about a new machine every 5 years CDC 7600 (1976) CRAY 1S (1982) Bull Tera-10 (2006) Some legacy codes >100k lines. Maintaining and porting legacy codes is a huge amount of work. A strong need for: Increasing portability Reaching decent compute efficiency (HPC) Exécution sur calculateur Porting code in a cost-efficient way (libraries or transparent mechanisms, incremental changes) 12 SEPTEMBRE 2016 PAGE 2 / 16

CONTEXT (2/2) BACK TO 2010: TERA 100 AND EVALUATIONS OF NEW ARCHITECTURES Tera-100 (2010) Fat node (Bull Coherence System) Tera-100 (Bull) 1.254 PFLOP/s (ranked 6, November 2010 TOP500) Mainly homogeneous (Intel Xeon) Codes: MPI + OpenMP - Increased NonUniform Memory Access (NUMA) effects - Heterogeneous computing (load balancing + programming models) Heterogeneous node - Non-hardware coherent memories (GPU <-> ) GPU GPU - Non-Uniform IO Access (NUIOA (1) ) (1) Stéphanie Moreaud, Brice Goglin, and Raymond Namyst. Adaptive MPI multirail tuning for non-uniform input/output access. EuroMPI 10. PAGE 3 / 16

CONTRIBUTIONS HETEROGENEOUS LOAD BALANCING & IMPROVED DATA LOCALITY APPLIED TO LEGACY CODES H3LMS (Harnessing Hierarchy and Heterogeneity with Locality Management and Scheduling) - Bulk synchronous (multi-phase with barriers) task decomposition to deal with heterogeneity. - NUMA aware scheduling: mix data centric work distribution and hierarchical work stealing. - Transparent coupling with MPI and OpenMP with an implementation into a single framework. COMPAS (Coordinate and Organize Memory Placement and Allocation for Scheduler) - Keep track of data residency (NUMA node) to guide scheduling Common features with StarPU (1), XKaapi (2), or OmpSs (3) - Distributed Shared Memory (DSM) to handle data transfers automatically between non-hardware coherent memories in the compute node. Common feature with Minas (4) - Allocate memory and distribute pages across NUMA nodes according to provided pattern. - Software caches to reduce memory transfers. (1) Cédric Augonnetcet al., StarPU : A unified platform for task scheduling on heterogeneous multicore architectures, Euro-Par 09 (2) Thierry Gautier et al., Xkaapi : A runtime system for data-flow task programming on heterogeneous architectures, IPDPS 2013 (3) Alejandro Duran et al., Ompss : a proposal for programming heterogeneous multi-core architectures. Parallel Processing Letters 2011 (4) Pousa Ribeiro (C.). et al., Minas: Memory Affinity Management Framework. INRIA 2009. PAGE 4 / 16

MULTI-GRANULARITY TASKS HETEROGENEOUS LOAD BALANCING WITH SUPER-TASKS Data block Data block Corner stone of the H3LMS platform Data block Data block Super-task Multi-level load balancing. Decomposable tasks to adapt the workload to the target architecture. Super-task Work sharing task task task Increased spatial locality on targets + work stealing between units of the same type. 2 scheduling modes : - For compute intensive applications Dynamic heterogeneous load balancing by using shared queue of super-tasks. - Data centric to minimize transfers core core Work stealing core by directly using local queues. Many-core (GPU or Xeon Phi) PAGE 5 / 16

GFLOP/s Higher is better MULTI-GRANULARITY TASKS BUILDING HIGH PERFORMANCE LIBRARIES MAGMA (1) Linear-algebra library for a heterogeneous node StarPU task scheduler Kernels: Intel MKL + Nvidia CuBLAS s: 2 Intel Xeon X5660 (2x 4 cores) GPUs: 2 Nvidia Tesla M2050 (Blocks: 1024x1024, Sub-blocks: 256x256) 1000 900 800 700 600 500 400 300 200 100 0 Matrix multiply (SGEMM) Comparison with MAGMA (including data transfers) 0 5000 10000 15000 20000 25000 30000 35000 40000 Dimension of squared matrices (N=M=K) 20% gain multi-granularité H3LMS (multi-granularity, MAGMA 1.3 single list of super-tasks) (1) Agullo (E.) et al., Numerical linear algebra on emerging architectures : The PLASMA and MAGMA projects. Journal of Physics, 2009 PAGE 6 / 16

included into HIERARCHICAL AFFINITIES IN H3LMS ABSTRACT ORGANIZATION BASED ON THE HARDWARE TOPOLOGY Worker Unit (constant granularity) - Two kinds: small and accelerators - Execute local tasks Extended with an abstract organization (based on the HWLOC (1) library) Work Team (shared same physical memory) - Execute super-tasks - First level of work-stealing Work Pole (memory affinity) - NUMA / NUIOA affinities - As many lists of super-tasks as NUIOA nodes - COMPAS to help selecting the pole and super-task-list Abstract organization in a node: 2 NUMA nodes, 2x 4-core s and 2 accelerators Choice of the list at the begining of a bulk of super-tasks (~owner compute rule (2) ) Pole hierarchy - Poles organized in tree to map NUMA distances (1) F. Broquedis et al., HWLOC : A generic framework for managing hardware affinities in HPC applications. PDP 10. (2) D. Callahan, et al., Compiling Programs for Distributed Memory Multiprocessors.The Journal of Supercomputing 1988. PAGE 7 / 16

GFLOP/s Higher is better HIERARCHICAL AFFINITIES AND COMPAS BUILDING HIGH PERFORMANCE LIBRARIES 1000 900 800 700 600 500 400 300 200 100 0 Matrix multiply (SGEMM) Comparison with MAGMA (including data transfers) 0 5000 10000 15000 20000 25000 30000 35000 40000 Dimension of squared matrices (N=M=K) 25% gain multi-granularité H3LMS (multi-granularity, + NUMAhierarchical affinities) + COMPAS multi-granularité H3LMS (multi-granularity, single list of super-tasks) MAGMA 1.3 s: 2 Intel Xeon X5660 (2x 4 cores) GPUs: 2 Nvidia Tesla M2050 (Blocks: 1024x1024, Sub-blocks: 256x256) PAGE 8 / 16

GFLOP/s Higher is better REDUCE TRANSFERS WITH COMPAS BENEFIT FROM THE SOFTWARE CACHES BETWEEN LIBRARY CALLS MAGMA with MORSE Interface to link to linearalgebra libraries to runtime systems e.g. MAGMA on StarPU (1) 1200 1000 800 600 5 SGEMMs accumulating in the same matrix (C = A * B + C) Comparison with MAGMA (including data transfers) 66% 87% With COMPAS: Keep same of data blocks granularity 400 200 Turn off systematic flushing of software caches between library calls. 0 0 5000 10000 15000 20000 25000 30000 35000 40000 Dimension of squared matrices (N=M=K) H3LMS (super-tâches + COMPAS et COMPAS) s: 2 Intel Xeon X5660 (2x 4 cores) GPUs: 2 Nvidia Tesla M2050 MAGMA modifié 1.3 (MORSE (StarPU) modified) MAGMA (StarPU) 1.3 (Blocks: 1024x1024, Sub-blocks: 256x256) PAGE 9 / 16 (1) Matrices Over Runtime Systems @ Exascale (MORSE)

H3LMS H3LMS H3LMS H3LMS H3LMS H3LMS H3LMS COUPLING WITH MPI AND OPENMP CODES IMPLEMENTATION INTO THE MPC FRAMEWORK MPC (1) (Multi-Processor Computing) Framework developped by the CEA and the ECR Thread based MPI tigthly coupeled to an OpenMP implementation Dynamic load balancing H3LMS-MPC Rely on MPC for inter-node communications Load balancing with H3LMS, super-tasks generated from different MPI taks in the same compute node PAGE 10 / 16 (1) M. Pérache et al., MPC: A unified parallel runtime for clusters of NUMA machines. Euro-Par 08.

EVALUATION WITH LEGACY CODES CEA 10 AVRIL 2012 PAGE 11 12 SEPTEMBRE 2016

LINPACK HETEROGENEOUS EXECUTION OF THE HOT SPOT (1/2) LINPACK (HPL 2.0) Need to modifiy 3 lines of code 1 - Allocate page-locked memory with COMPAS. Before After 2-3 - Call BLAS function based on H3LMS. Before After 12 SEPTEMBRE 2016 PAGE 12 / 16

Higher is better LINPACK HETEROGENEOUS EXECUTION OF THE HOT SPOT (2/2) LINPACK HPL 2.0 (N = 46080, N B = 512, P = 1, Q = 1, WC10L2L2) H3LMS 6 cores 2 GPUs H3LMS (s & GPUs) speedup 55.03 x Based on optimized libraries Intel MKL 10.1 and NVIDIA CUBLAS 4.2 Transparent for the user with synchronous function calls H3LMS 8 cores H3LMS (s) 7.46 x Internally decomposed into super-tasks and tasks Parallel MKL 8 cores Parallel MKL (OpenMP) Sequential MKL 1 core Sequential MKL 1 x 7.85 x 0 10 20 30 40 50 60 Homogeneous performance close to parallel MKL (62.11 vs 68.94 GFLOP/s) Heterogeneous performance 482.4 GFLOP/s 2x Intel Xeon Nehalem EP E5620 (2x 4 cores @ 2.4 GHz, peak: 2x 38.4 GFLOPS) 2x NVIDIA Tesla Fermi M2090 (peak: 2x 665 GFLOPS) 12 SEPTEMBRE 2016 24 GB of DDR3 memory PAGE 13 / 16

Lower is better PN APPLICATION INCREMENTAL CHANGES TO HARNESS HETEROGENEOUS COMPUTE RESOURCES (1/3) PN: solve the linear particle transport equation with deterministic resolution (1) based on spherical harmonics approximation Hybrid MPI-OpenMP code Focus on the numerical_flux function (~90% execution time). 8 cores 154.21 s 6.32 x speedup PN: s performance of numerical_flux - Double precision - Cartesian mesh : 1536x1536, N=15, 36 iter. - Average on 20 runs Sequential 974 s 0 200 400 600 800 1000 Execution time (s) Tera-100 heterogeneous compute nodes 12 SEPTEMBRE 2016 s: 2x 4 cores, Intel Xeon E5620 PAGE 14 / 16 (1) Thomas A. Brunner and James Paul Holloway. Two dimensional time dependent riemann solvers for neutron transport. J. Comput. Phys., 210(1) :386 399, November 2005.

PN APPLICATION INCREMENTAL CHANGES TO HARNESS HETEROGENEOUS COMPUTE RESOURCES (2/3) numerical_flux function on a x * z Cartesian mesh 1 «Large» matrix multiplies 2 Small matrix multiplies 3 Consecutive loops operating on each cell of the mesh 12 SEPTEMBRE 2016 4 «Large» matrix multiplies PAGE 15 / 16

Step Lower is better Reference Homogeneous 8 cores: 154.21 s PN APPLICATION INCREMENTAL CHANGES TO HARNESS HETEROGENEOUS COMPUTE RESOURCES (3/3) PN: heterogeneous performance of flux numerical_flux H3LMS+ COMPAS: multi-granularity, spatial and temporal locality Final speedup Heterogeneous vs homogeneous 2.65 x Data transfer bound 4 Heterogeneous DGEMM + local hand defined super-tasks 58.33 s 3 Heterogeneous DGEMM + hand defined super-tasks 68.28 s 2 (Heterogeneous DGEMM no data transfer) 32.53 s 1 Heterogeneous DGEMM 84.29 s double precision, Cartesian mesh: 1536x1536, N=15, 36 iter. averaged on 20 runs 0 10 20 30 40 50 60 70 80 90 Execution time (s) Tera-100 heterogeneous node s: 2x 4 cores Intel Xeon E5620 Accel.: 2x Nvidia Telsa GTX M2090 12 SEPTEMBRE 2016 PAGE 16 / 16

THANK YOU CEA 10 AVRIL 2012 PAGE 17 12 SEPTEMBRE 2016

Backup SUPER-TASK AFFINITY IMPROVING TEMPORAL LOCALITY (1/2) Software cache and execution order of the super-tasks (1) Sort list of super-task at the bulk instantiation according to affinity with the corresponding memory. Affinity index depends on: - Quantity of data - If blocks of data are already held inside the cache New scheduling policy based on cache statistics to reduce data transfers. 12 SEPTEMBRE 2016 PAGE 18 / 16 (1) Jean-Yves Vet, Patrick Carribault, and Albert Cohen. Multigrain affinity for heterogeneous work stealing. MULTIPROG-2012.

GFLOP/s Backup 700 SUPER-TASK AFFINITY IMPROVING TEMPORAL LOCALITY (2/2) Sparse LU (step 3, simple precision, data transfers included) 600 500 400 300 200 100 s + GPU s (scheduling + GPU based (couplage on cache statistics) cache) Cumulated Cumulé (theoretical) théorique GPU GPU (couplage (with software cache) cache) s s 0 0 5000 10000 15000 20000 25000 192x192 sub-blocks s: 2x 12 cores AMD Opteron 6164 HE Accel.: 1 GPU Nvidia Geforce GTX 470 Performance larger than theoretical cumulated due to less data transfers 12 SEPTEMBRE 2016 PAGE 19 / 16

Backup PN APPLICATION MULTI-NODES STRONG SCALING PN: heterogeneous performance of flux numerical_flux on multiple compute nodes # compute nodes and mesh dimensions Tera-100 heterogeneous node s: 2x 4 cores Intel Xeon E5620 Accel.: 2x Nvidia Telsa GTX M2090 12 SEPTEMBRE 2016 PAGE 20 / 16

Backup HIERARCHICAL AFFINITIES IN H3LMS CHOICE OF THE SUPER-TASK LIST Abstract organization in a node: 2 NUMA nodes, 2x 12-core AMD Magny-Cours s (Dual package, 1 memory controler per ) and 2 accelerators attached to the same NUMA node PAGE 21 / 16

Backup API - COMPAS COMPAS API Proposal based on pragma PAGE 22 / 16

API - H3LMS Backup H3LMS API Proposal based on pragmas PAGE 23 / 16

BCS BLAS COMPAS Bull Coherence System Basic Linear Algebra Subprograms Coordinate and Organize Memory Placement and Allocation for Scheduler Glossary CEA 10 AVRIL 2012 PAGE 24 12 SEPTEMBRE 2016 DGEMM DSM DTRSM GFLOPS GPU H3LMS HPC HPL LRU NUMA NUIOA PFLOPS Central Processing Unit Double precision matrix matrix multiplication Distributed Shared Memory Solves one of the matrix equations op( A )*X = alpha*b, or X*op( A ) = alpha*b Giga FLoating-point Operations Per Second Graphics Processing Unit Harnessing Hierarchy and Heterogeneity with Locality Management and Scheduling High Performance Computing High Performance Linpack Least Recently Used Non-Uniform Memory Access Non-Uniform Input/Output Access Peta FLoating-point Operations Per Second