A Productive Framework for Generating High Performance, Portable, Scalable Applications for Heterogeneous computing

Similar documents
CUDA Experiences: Over-Optimization and Future HPC

CUB. collective software primitives. Duane Merrill. NVIDIA Research

ECE 8823: GPU Architectures. Objectives

Renderscript Accelerated Advanced Image and Video Processing on ARM Mali T-600 GPUs. Lihua Zhang, Ph.D. MulticoreWare Inc.

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Optimization of Tele-Immersion Codes

Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA

Triolet C++/Python productive programming in heterogeneous parallel systems

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

More on CUDA and. graphics processing unit computing CHAPTER. Mark Harris and Isaac Gelado CHAPTER OUTLINE

THE FUTURE OF GPU DATA MANAGEMENT. Michael Wolfe, May 9, 2017

Accelerating Molecular Modeling Applications with Graphics Processors

Scalability, Portability, and Productivity in GPU Computing

High Performance Computing with Accelerators

Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

TANGRAM Efficient Kernel Synthesis for Performance Portable Programming

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Bigger GPUs and Bigger Nodes. Carl Pearson PhD Candidate, advised by Professor Wen-Mei Hwu

Mathematical computations with GPUs

Tesla GPU Computing A Revolution in High Performance Computing

NUMA-Aware Data-Transfer Measurements for Power/NVLink Multi-GPU Systems

Parallel Systems. Project topics

Introduction to CUDA (1 of n*)

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016

OpenACC Course. Office Hour #2 Q&A

Advanced CUDA Optimization 1. Introduction

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

GPU Programming for Mathematical and Scientific Computing

Scalability, Portability, and Productivity in GPU Computing

CPU-GPU Heterogeneous Computing

SENSEI / SENSEI-Lite / SENEI-LDC Updates

Objective. GPU Teaching Kit. OpenACC. To understand the OpenACC programming model. Introduction to OpenACC

arxiv: v1 [hep-lat] 12 Nov 2013

Using GPUs to compute the multilevel summation of electrostatic forces

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

Scaling in a Heterogeneous Environment with GPUs: GPU Architecture, Concepts, and Strategies

Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

CUDA. Matthew Joyner, Jeremy Williams

Tesla GPU Computing A Revolution in High Performance Computing

OpenMP for next generation heterogeneous clusters

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

A Scalable, Numerically Stable, High- Performance Tridiagonal Solver for GPUs. Li-Wen Chang, Wen-mei Hwu University of Illinois

Auto-tuning a High-Level Language Targeted to GPU Codes. By Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, John Cavazos

Accelerating Financial Applications on the GPU

Chapter 2 A Guide for Implementing Tridiagonal Solvers on GPUs

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University

CS 677: Parallel Programming for Many-core Processors Lecture 6

CS377P Programming for Performance GPU Programming - II

GpuWrapper: A Portable API for Heterogeneous Programming at CGG

Warps and Reduction Algorithms

Moving Towards Exascale with Lessons Learned from GPU Computing. Wen-mei Hwu ECE, CS, PCI, NCSA University of Illinois at Urbana-Champaign

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Patterns: Graph Search

Lift: a Functional Approach to Generating High Performance GPU Code using Rewrite Rules

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

Heterogeneous Computing with a Fused CPU+GPU Device

Lecture 5: Input Binning

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.

How to Optimize Geometric Multigrid Methods on GPUs

Introduction to GPU hardware and to CUDA

Lecture 2: Introduction to CUDA C

GPU Supercomputing From Blue Waters to Exascale

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

An Introduction to OpenACC

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

GPU Performance Nuggets

Di Zhao Ohio State University MVAPICH User Group (MUG) Meeting, August , Columbus Ohio

Structured Parallel Programming

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

CUDA 5 and Beyond. Mark Ebersole. Original Slides: Mark Harris 2012 NVIDIA

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

MODERN heterogeneous computing systems use CPU

Matrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Lattice Simulations using OpenACC compilers. Pushan Majumdar (Indian Association for the Cultivation of Science, Kolkata)

General Purpose GPU Computing in Partial Wave Analysis

Module 2: Introduction to CUDA C

Structured Parallel Programming Patterns for Efficient Computation

Heterogeneous platforms

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

Many-core Computing. Can compilers and tools do the heavy lifting? Wen-mei Hwu

LLVM for the future of Supercomputing

ECE 669 Parallel Computer Architecture

LooPo: Automatic Loop Parallelization

Project Kickoff CS/EE 217. GPU Architecture and Parallel Programming

OPTIMIZING HPC SIMULATION AND VISUALIZATION CODE USING NVIDIA NSIGHT SYSTEMS

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 10. Reduction Trees

Module 2: Introduction to CUDA C. Objective

High-Performance Data Loading and Augmentation for Deep Neural Network Training

Experiences with Achieving Portability across Heterogeneous Architectures

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

NAMD GPU Performance Benchmark. March 2011

Paralization on GPU using CUDA An Introduction

Transcription:

A Productive Framework for Generating High Performance, Portable, Scalable Applications for Heterogeneous computing Wen-mei W. Hwu with Tom Jablin, Chris Rodrigues, Liwen Chang, Steven ShengZhou Wu, Abdul Dakkak CCoE, University of Illinois at Urbana-Champaign

4,224 Kepler GPUs in Blue Waters NAMD 100 million atom benchmark with Langevin dynamics and PME once every 4 steps, from launch to finish, all I/O included 768 nodes, Kepler+Interlagos is 3.9X faster over Interlagos-only 768 nodes, XK7 is 1.8X XE6 Chroma Lattice QCD parameters: grid size of 48 3 x 512 running at the physical values of the quark masses 768 nodes, Kepler+Interlagos is 4.9X faster over Interlagos-only 768 nodes, XK7 is 2.4X XE6 QMCPACK Full run Graphite 4x4x1 (256 electrons), QMC followed by VMC 700 nodes, Kepler+Interlagos is 4.9X faster over Interlagos-only 700 nodes, XK7 is 2.7X XE6 IWCSE 2013

Two Current Challenges At scale use of GPUs Communication costs dominate beyond 2048 nodes E.g., NAMD Limited by PME Insufficient computation work Programming Efforts This talk 3500 3000 2500 2000 1500 1000 500 0 Blue Waters K7 Nodes NAMD Strong Scaling 100M Atoms 512 1024 2048 4096 CPU CPU+GPU

Writing efficient parallel code is complicated. Tools can provide focused help or broad help Planning how to execute an algorithm Implementing the plan Choose data structures Map work/data into tasks Schedule tasks to threads Tangram Memory allocation Data movement Pointer operations Index arithmetic Kernel dimensions Thread ID arithmetic Synchronization Temporary data structures GMAC DL Triolet, X10, Chappel, Nesl, DeLite, Par4All OpenACC/ C++AMP/ Thrust

Levels of GPU Programming Languages Prototype & in development Next generation X10, Chapel, Nesl, Delite, Par4all, Triolet... Implementation manages GPU threading and synchronization invisibly to user OpenACC, C++AMP, Thrust, Bolt Simplifies data movement, kernel details and kernel launch Same GPU execution model (but less boilerplate) Current generation CUDA, OpenCL, DirectCompute IWCSE 2013

Where should the smarts be for Parallelization and Optimization? General-purpose language + parallelizing compiler Requires a very intelligent compiler Limited success outside of regular, static array algorithms Domain-specific language + domain-specific compiler Simplify compiler s job with language restrictions and extensions Requires customizing a compiler for each domain Parallel meta-library + general-purpose compiler Library embodies parallelization decisions Uses a general-purpose compiler infrastructure Extensible just add library functions Historically, library is the area with the most success in parallel computing

Triolet Composable Library-Driven Parallelization EDSL-style library: build, then interpret program packages Allows library to collect multiple parallel operations and create an optimized arrangement Lazy evaluation and aggressive inlining Loop fusion to reduce communication and memory traffic Array partitioning to reduce communication overhead Library source-guided parallelism optimization of sequential, shared-memory, and/or distributed algorithms Loop-building decisions use information that is often known at compile time By adding typing to Python

Example: Correlation Code def correlation(xs, ys): scores = (f(x,y) for x in xs for y in ys) return histogram(100, par(scores)) Compute f(x,y) for every x in xs and for every y in ys (Doubly nested loop) Compute it in parallel Put scores into a 100- element histogram

Triolet Compiler Intermediate Representation List comprehension and par build a package containing 1. Desired parallelism 2. Input data structures 3. Loop body for each loop level Loop structure and parallelism annotations are statically known correlation xs ys = let i = IdxNest HintPar (arrayslice xs) Inner loop (λx. IdxFlat HintSeq (arrayslice ys) (λy. f x y ) ) in histogram 100 i Body Outer loop

Triolet Meta-Library Compiler inlines histogram histogram has code paths for handling different loop structures Loop structure is known, so compiler can remove unused code paths correlation xs ys = case IdxNest HintPar (arrayslice xs) (λx. IdxFlat HintSeq (arrayslice ys) (λy. f x y ) ) of IdxNest parhint input body. case parhint of HintSeq. code for sequential nested histogram HintPar. parreduce input (λchunk. seqhistogram 100 body chunk) IdxFlat parhint input body. code for flat histogram

Example: Correlation Code Result is an outer loop specialized for this application Process continues for inner loop correlation xs ys = parreduce (arrayslice xs) (λchunk. seqhistogram 100 (λx. IdxFlat HintSeq (arrayslice ys) chunk) Inner loop (λy. f x y ) ) Body Parallel reduction; each task processes a chunk of xs Task computes a sequential histogram

Chris Rodriues Rodrigues, et al, PPoPP 2014 Cluster-Parallel Performance and Scalability Triolet delivers large speedup over sequential C On par with manually parallelized C for computation-bound code (left) Beats similar highlevel interfaces on communicationintensive code (right)

A parallel algorithm framework for solving linear recurrence problems Scan, tridiagonal matrix solvers, bidiagonal matrix solvers, recursive filters, Many specialized algorithms in literature Linear Recurrence - very important for converting sequential algorithms into parallel algorithms Tangram

Tangrams Linear Optimizations Library operations to simplify application tiling and communication Auto-tuning for each target architecture Unified Tiling Space Simple interface for register tiling, scratchpad tiling, and cache tiling Automatic thread fusion as enabler Communication Optimization Choice/hybrid of three major types of algorithms Computation vs. communication tradeoff

Linear Recurrence Algorithms and Communication Brent-Kung Circuit Kogge-Stone Circuit Group Structured

Tangram Initial Results Throughput (billions of samples per second) 14 12 10 8 6 4 2 0 StreamScan-Reported Proposed-Tuned StreamScan-Tuned SDK-5.0 Thrust-1.5 1-32bit 8-32bit 64-32bit 1-64bit 8-64bit 64-64bit Problem Size (millions of samples, data tpye) Prefix scan on Fermi (C2050) Throughput (billions of samples per second) 35 30 25 20 15 10 5 0 Tuned-for-Kepler, Run-on-Kepler Tuned-for-Kepler-no-Shuffle, Run-on-Kepler StreamScan, Run-on-Kepler Tuned-for-Fermi, Run-on-Kepler SDK-5.0, Run-on-Kepler 1 8 64 Problem Size (millions of samples) Prefix scan on Kepler(Titan) Throughput (billions of samples per second) 25 20 15 10 5 0 Proposed, Tuned-for-Fermi, Run-on-Fermi Proposed, Tuned-for-Fermi, Run-on-Kepler Proposed, Tuned-for-Kepler, Run-on-Kepler 1 2 4 Order of IIR Filter IIR Filter on both GPUs Throughput (millions of equations per second) 1800 1600 1400 1200 1000 800 600 400 200 0 1 16 Ours,Kepler-Kepler Ours,Fermi-Fermi Ours,Fermi-Kepler SC12,Kepler-Kepler SC12,Fermi-Fermi NVIDIA,Kepler-Kepler NVIDIA,Fermi-Fermi Problem Size (millions of equations) Tridiagonal solver on both GPUs

Next Steps Triolet released as an open source project Develop additional Triolet library functions and their implementations for important application domains Develop Triolet library functions for GPU clusters Publish and release Tangram Current tridiagonal solver in CUSPARSE is from UIUC based on the Tangram work Integration with Triolet

THANK YOU!