Addressing Heterogeneity in Manycore Applications

Similar documents
Parallel Hybrid Computing Stéphane Bihan, CAPS

OpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware

Heterogeneous Multicore Parallel Programming

How to Write Code that Will Survive the Many-Core Revolution

Dealing with Heterogeneous Multicores

High performance Computing and O&G Challenges

Incremental Migration of C and Fortran Applications to GPGPU using HMPP HPC Advisory Council China Conference 2010

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Pragma-based GPU Programming and HMPP Workbench. Scott Grauer-Gray

Parallel Hybrid Computing F. Bodin, CAPS Entreprise

CAPS Technology. ProHMPT, 2009 March12 th

MIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise. June 2011

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

Turbostream: A CFD solver for manycore

Trends and Challenges in Multicore Programming

How to write code that will survive the many-core revolution Write once, deploy many(-cores) F. Bodin, CTO

High Performance Computing with Accelerators

Code Migration Methodology for Heterogeneous Systems

GPU GPU CPU. Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3

A Uniform Programming Model for Petascale Computing

What does Heterogeneity bring?

An innovative compilation tool-chain for embedded multi-core architectures M. Torquati, Computer Science Departmente, Univ.

Technology for a better society. hetcomp.com

Pedraforca: a First ARM + GPU Cluster for HPC

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

GPGPU, 4th Meeting Mordechai Butrashvily, CEO GASS Company for Advanced Supercomputing Solutions

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

Parallel Programming. Libraries and Implementations

Accelerating Financial Applications on the GPU

designing a GPU Computing Solution

Chapter 3 Parallel Software

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

Parallel Hybrid Computing F. Bodin, CAPS Entreprise

An Extension of XcalableMP PGAS Lanaguage for Multi-node GPU Clusters

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Programming Support for Heterogeneous Parallel Systems

OpenACC Course. Office Hour #2 Q&A

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Performance potential for simulating spin models on GPU

Auto-tuning a High-Level Language Targeted to GPU Codes. By Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, John Cavazos

Directions in HPC Technology

The Era of Heterogeneous Computing

SPOC : GPGPU programming through Stream Processing with OCaml

Parallel Programming on Ranger and Stampede

HMPP port. G. Colin de Verdière (CEA)

Modern Processor Architectures. L25: Modern Compiler Design

Preparing for Highly Parallel, Heterogeneous Coprocessing

Experts in Application Acceleration Synective Labs AB

SIMD Exploitation in (JIT) Compilers

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

Parallel Programming Libraries and implementations

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

GPU Debugging Made Easy. David Lecomber CTO, Allinea Software

GPUfs: Integrating a file system with GPUs

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

OMP2HMPP: HMPP Source Code Generation from Programs with Pragma Extensions

StarPU: a runtime system for multigpu multicore machines

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors

The Heterogeneous Programming Jungle. Service d Expérimentation et de développement Centre Inria Bordeaux Sud-Ouest

CUDA GPGPU Workshop 2012

Why? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators

Parallel Computing. November 20, W.Homberg

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

OpenCL: History & Future. November 20, 2017

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Acceleration of a Python-based Tsunami Modelling Application via CUDA and OpenHMPP

CPU-GPU Heterogeneous Computing

Unified Memory. Notes on GPU Data Transfers. Andreas Herten, Forschungszentrum Jülich, 24 April Member of the Helmholtz Association

A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function

ET International HPC Runtime Software. ET International Rishi Khan SC 11. Copyright 2011 ET International, Inc.

Hybrid Implementation of 3D Kirchhoff Migration

GPGPU. Alan Gray/James Perry EPCC The University of Edinburgh.

Mapping MPI+X Applications to Multi-GPU Architectures

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

GpuWrapper: A Portable API for Heterogeneous Programming at CGG

Portable Parallel Programming for Multicore Computing

Martin Dubois, ing. Contents

Overview of research activities Toward portability of performance

Unified Tasks and Conduits for Programming on Heterogeneous Computing Platforms

Parallel Programming. Libraries and implementations

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

S Comparing OpenACC 2.5 and OpenMP 4.5

FPGA-based Supercomputing: New Opportunities and Challenges

Overlapping Computation and Communication for Advection on Hybrid Parallel Computers

Utilisation of the GPU architecture for HPC

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

A General Discussion on! Parallelism!

Efficiency Considerations of Cauchy Reed-Solomon Implementations on Accelerator and Multi-Core Platforms

NVIDIA DGX SYSTEMS PURPOSE-BUILT FOR AI

Exploring Task Parallelism for Heterogeneous Systems Using Multicore Task Management API

Parallel Systems. Project topics

Transcription:

Addressing Heterogeneity in Manycore Applications RTM Simulation Use Case stephane.bihan@caps-entreprise.com Oil&Gas HPC Workshop Rice University, Houston, March 2008 www.caps-entreprise.com

Introduction Hardware accelerators (HWAs) such as GPUs can bring very high application speedups but No consensus on the programming model Their usage brings complexity in the management of the application codes No consensus on the architecture Software moves slowly, hardware moves fastly There is a need for a glue between general purpose programming and HWA specific environments Addressing Heterogeneity in Manycore Applications - Houston, March 2008 2

Hybrid Application View Application 1 Application Hybrid Processor System 1 Execution time 2A 2B 2C 3A generic core generic core Control and data transfers stream core stream core 2A 2B 2C 3A 3B Execution time 3B 4 4 Parallel and distributed execution Sequential execution Addressing Heterogeneity in Manycore Applications - Houston, March 2008 3

Stream Computing A similar computation is performed on a collection of data (stream) There is no data dependence between the computation on different stream elements Addressing Heterogeneity in Manycore Applications - Houston, March 2008 4

Heterogeneous Platforms Homogeneous multi-core with already integrated specialized compute units (SSE, Cell) Use of external accelerators mainly interconnected with PCIe NVIDIA Tesla AMD-ATI GPUs Various FPGAs: Celoxica, Mitrionics ASICs: ClearSpeed Coming integrated hybrid systems Intel Larrabee (QuickPath, PCIe) AMD Fusion (HyperTransport ) Addressing Heterogeneity in Manycore Applications - Houston, March 2008 5

Existing Programming Tools Numerous HWA specific environments NVIDIA CUDA AMD CAL (plus Brook, IL) RapidMind No standard languages but extensions Address the programming of the stream cores Require new language or API training Addressing Heterogeneity in Manycore Applications - Houston, March 2008 6

HMPP Concepts 1. Expressing parallelism and communications in C and Fortran source Through the use of directives à la OpenMP The legacy code is preserved to ensure portability and default compilation and execution 2. Dealing with resource availability and scheduling Scale to various platform configurations Application should still run even if a core is not available or locked 3. Programming the HWAs Insulate HWA specific computations Use hardware vendor SDK Domain specific code generators Addressing Heterogeneity in Manycore Applications - Houston, March 2008 7

HMPP Directives Define hardware specific implementations of functions (codelets) Can be specialized to the execution context (data size, ) Codelet execution Synchronous, asynchronous properties Express data transfers Pre loading of data Insert synchronization barriers Host CPU will wait until remote computation is complete Addressing Heterogeneity in Manycore Applications - Houston, March 2008 8

HMPP Codelet Pure function No side effects Declare a codelet Pragma label Synchronize release HWA Which codelet call to perform remotely in asynchronous mode Addressing Heterogeneity in Manycore Applications - Houston, March 2008 9

HMPP Code Generation Flow Addressing Heterogeneity in Manycore Applications - Houston, March 2008 10

Codelet Execution Guard Codelet implementation may have restrictions Performance Data size, Guard to restrict HWA codelet usage January 2008 Addressing Heterogeneity in Manycore Applications - Houston, March 2008 11

Communication Directives Data transfers strongly impact performance Data prefetching Avoid reloading data Where to prefetch Argument is prefetched Addressing Heterogeneity in Manycore Applications - Houston, March 2008 12

Codelet Code Generation Stream computing specific approach C and Fortran inputs Code parallelisation machine independent directives CUDA, SSE, CAL (soon) target languages Intend to cover most frequent cases Generate all runtime support routines Communication, Addressing Heterogeneity in Manycore Applications - Houston, March 2008 13

HMPP Application View HMPP annotated application 1 2A 2B 2C 3A 3B HMPP runtime 2A 2B 2C codelets 2A 2B 2C pthread codelets HWA1 runtime HWA2 runtime Host OS 4 codelets generic core stream core generic core stream core Addressing Heterogeneity in Manycore Applications - Houston, March 2008 14

HMPP Key Points Not disruptive: keep the application code portable A glue between general purpose programming and hardware specific environments No new programming language or model: a standard way «àla»openmp Leverage the hardware vendor tools Application scaling to various system configurations Provide hardware interoperability Addressing Heterogeneity in Manycore Applications - Houston, March 2008 15

Use Example: RTM Parallelized with MPI Each sub-domain processed by a GPU Modeling / Migration Only border nodes are transfered Cluster configuration 5 nodes Bi-socket quadricore TeslaS870(4GPUs) PCIe gen2 Addressing Heterogeneity in Manycore Applications - Houston, March 2008 16

Current Status For a simple 2D test case using snapshoting on a workstation with Quadro FX5600 8x for the computational parts About 1/3 of the application time is disk I/O Key optimizations CUDA kernels carefully tuned Data alignment, coalescing Data transfers must be asynchronously overlapped Next work 3D case using 2D operations Addressing Heterogeneity in Manycore Applications - Houston, March 2008 17

Conclusion Hybrid computing offers cheap high performance per Watt Code portability and resource management are main issues HMPP approach is a first step toward a complete heterogeneous programming workbench Addressing Heterogeneity in Manycore Applications - Houston, March 2008 18