A Cross-Input Adaptive Framework for GPU Program Optimizations

Similar documents
A Cross-Input Adaptive Framework for GPU Program Optimizations

Lecture 2: CUDA Programming

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Using GPUs to compute the multilevel summation of electrostatic forces

Technology for a better society. hetcomp.com

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

OpenMP Optimization and its Translation to OpenGL

Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA

High Performance Computing on GPUs using NVIDIA CUDA

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

Numerical Simulation on the GPU

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

High Performance Linear Algebra on Data Parallel Co-Processors I

A Compiler Framework for Optimization of Affine Loop Nests for General Purpose Computations on GPUs

Tesla Architecture, CUDA and Optimization Strategies

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION

A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs

CS 179 Lecture 4. GPU Compute Architecture

Introduction to CUDA

Introduction to CUDA Programming

Introduction to GPU programming with CUDA

Graphics Processing Unit (GPU) Acceleration of Machine Vision Software for Space Flight Applications

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION

Advanced CUDA Optimizations

ECE 8823: GPU Architectures. Objectives

Computer Architecture

Mattan Erez. The University of Texas at Austin

Parallelization Techniques for Implementing Trellis Algorithms on Graphics Processors

JCudaMP: OpenMP/Java on CUDA

CS516 Programming Languages and Compilers II

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

Surface-related multiple elimination with multiple prediction on GPU

Fundamental CUDA Optimization. NVIDIA Corporation

TUNING CUDA APPLICATIONS FOR MAXWELL

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

FMM implementation on CPU and GPU. Nail A. Gumerov (Lecture for CMSC 828E)

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Introduction to CUDA (1 of n*)

Fundamental CUDA Optimization. NVIDIA Corporation

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

GPU Fundamentals Jeff Larkin November 14, 2016

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam

Portland State University ECE 588/688. Graphics Processors

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Directive-based General-Purpose GPU Programming. Tian Yi David Han

Scalable Multi Agent Simulation on the GPU. Avi Bleiweiss NVIDIA Corporation San Jose, 2009

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

Multi Agent Navigation on GPU. Avi Bleiweiss

OpenACC programming for GPGPUs: Rotor wake simulation

ME964 High Performance Computing for Engineering Applications

CUDA Performance Optimization. Patrick Legresley

CSE 160 Lecture 24. Graphical Processing Units

TUNING CUDA APPLICATIONS FOR MAXWELL

Auto-tuning a High-Level Language Targeted to GPU Codes. By Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, John Cavazos

Introduction to GPGPU and GPU-architectures

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA

Asynchronous Stochastic Gradient Descent on GPU: Is It Really Better than CPU?

Locality-Aware Mapping of Nested Parallel Patterns on GPUs

Automatic Tuning Matrix Multiplication Performance on Graphics Hardware

Debunking the 100X GPU vs CPU Myth: An Evaluation of Throughput Computing on CPU and GPU

GRAPHICS PROCESSING UNITS

Multi-Processors and GPU

Programming in CUDA. Malik M Khan

B. Tech. Project Second Stage Report on

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors

GPU-based Distributed Behavior Models with CUDA

NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA)


Unrolling parallel loops

Device Memories and Matrix Multiplication

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Introduction to Parallel Computing with CUDA. Oswald Haan

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

On-the-Fly Elimination of Dynamic Irregularities for GPU Computing

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

Auto-Generation and Auto-Tuning of 3D Stencil Codes on GPU Clusters

ME964 High Performance Computing for Engineering Applications

HPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber,

Introduction to GPU hardware and to CUDA

Improving Performance of Machine Learning Workloads

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CUB. collective software primitives. Duane Merrill. NVIDIA Research

A GPU-Enabled Rapid Image Processing Framework

Automated Finite Element Computations in the FEniCS Framework using GPUs

Parallel Processing SIMD, Vector and GPU s cont.

Di Zhao Ohio State University MVAPICH User Group (MUG) Meeting, August , Columbus Ohio

Transcription:

A Cross-Input Adaptive Framework for GPU Program Optimizations Yixun Liu, Eddy Z. Zhang, Xipeng Shen Computer Science Department The College of William & Mary

Outline GPU overview G-Adapt Framework Evaluation Related & Future Work Conclusion 2

GPU (Graphics Processing Unit) Architecture SIMD parallel Multithreaded Many core Feature Tremendous computational horsepower High mem badwidth Applications Traditional graphic rendering Emerging: general data parallel computing 3

Programming GPU High level model Abstraction to multithread platform C-like programming No explicit mapping to graphics rendering E.g, CUDA, Brook+, opencl NVIDIA CUDA Kernel func. on GPU Threads->blocks->grids Graph From CUDA Manual 4

Optimization Challenges Goal Maximize throughput Increase occupancy, reduce latency, dynamic instr. Difficulties Hard to predict optimization effects Non-linearity, coupling, undisclosed CUDA details GPU hardware complexities Limits: 512 threads per block, 768 threads per SM, etc Various types of memories: constant, texture, etc Input sensitivity 5

Matrix-Vector Multiplication 6

Outline GPU Overview G-Adapt Framework Evaluation Related & Future Work Conclusion 7

G-ADAPT Empirical search-based optimization Three obstacles to address Construction of the optimization space Space pruning Cross-input adaptation 8

G-ADAPT: Overview Source-to-source compiler Cross-input adaptation Stage 1 Stage 2 Code with pragmas & input Empirical search & data collection Pattern recognition & code generation Automatic search & transformations <Input, best optimizations> Optimized input-adaptive GPU program Easy integration of user knowledge through pragmas 9

G-ADAPT Pragmas Supports a programmer-compiler synergy Covers 2 levels of optimizations Execution configurations E.g, thread block dimensions Code transformations E.g, loop tile size, unrolling levels 10

Pragma Examples #pragma erange 64, 512, 2 #define BLKSZ 256 #pragma lpur_lrange 0, min(blksz, 16), 2 For (i=1; i < BLKSZ; i++) { } 11

Stage I: Search & Collect Code with pragmas & inputs Optimization Parameters G-ADAPT compiler Optimized GPU code Performance Calibrator Performance Perf. DB Optimization Agent 12

G-ADAPT Compiler Two functionalities Recognize opt. space Program transformations Based on Cetus [Purdue Univ] Source-to-source Optimization Parameters Code with pragmas & inputs G-ADAPT compiler Optimized GPU code Performance Calibrator GPU extensions Support G-ADAPT pragmas Performance Optimization Agent Perf. DB 13

Performance Calibrator Invokes CUDA compiler and runs the executable Optimization Parameters Code with pragmas & inputs G-ADAPT compiler Optimized GPU code Collect running time and GPU occupancy Performance Calibrator Performance Optimization Agent Perf. DB 14

Optimization Agent Determines the optimization param. to try next Optimization Parameters Code with pragmas & inputs G-ADAPT compiler Optimized GPU code Uses hill climbing to overcome space explosion problem Performance Calibrator Performance Optimization Agent Perf. DB 15

G-ADAPT: Overview Stage 1 Stage 2 Code with pragmas & input Empirical search & data collection Pattern recognition & code generation <Input, best optimizations> Optimized inputadaptive GPU program 16

Stage II: PR & Code Gen Pattern recognizer Recognize input best parameters Regression Trees with Least Mean Square Options for code generator Multiple versions JIT compilers Linker Perf. DB Pattern Recognizer G-ADAPT Code Generator Optimized input adaptive GPU program 17

G-ADAPT Code with pragmas & inputs Optimization Parameters G-ADAPT compiler Optimized GPU code Performance Calibrator Pattern Recognizer Performance Optimization Agent Perf. DB G-ADAPT Code Generator Final input adaptive GPU program 18

Outline GPU overview G-Adapt Framework Evaluation Related & Future Work Conclusion 19

Evaluation - Platform GPU: NVIDIA GeForce 8800 GT 14 multiprocessors (MP), 112 cores 512M global mem, 16KB shared mem/mp, 8192 registers/mp CUDA 2.0 Host: Intel Xeon 3.6 GHz, Suse Linux 2.6.22 20

Benchmarks Benchmark Description #of Inputs Convolution Convolution filter of a 2D signal 10 matrixmul Dense matrix multiplication 9 mvmul Dense matrix vector multiplication (by Fujimoto) 15 reduction Sum of an array 15 scalarprod Scalar products 7 transpose Matrix transpose 18 Transpose-co Coalescing matrix transpose 18 21

Training and Prediction Benchmark Training iterations Training time (s) Prediction accuracy convolution 200 2825 100% matrixmul 196 2539 100% mvmul 124 124 93.3% reduction 75 29 80% scalarprod 93 237 100% transpose 54 1639 100% Transpose-co 54 631 100% 22

Matrix Vector Multiplication Best Parameter V.S. Input Speed up V.S. Input 23

Speed up over default 24

Speed up over default 25

Speed up over default http://www.cs.wm.edu/~xshen/publications/ipdps09.pdf 26

Outline GPU overview G-Adapt Framework Evaluation Related & Future Work Conclusion 27

Related Work Ryoo+: CGO 08 Efficiency and utilization model for search Manual transformation; assumptions on applications Baskaran+:ICS 08 Polyhedral model for optimizing memory access Limited to affine loop nests Features of G-Adapt First generally applicable framework Cross-input adaptation 28

Future Work More optimization options Algorithm Selection Memory optimization Divergence Elimination General Support Cetus ANSI C compiler Non ANSI C features, C++ CUDA built-in types E.g. float4, texture and etc 29

Outline GPU overview G-Adapt Framework Evaluation Related & Future Work Conclusion 30

Conclusion A general tool for GPU optimization Cross-input adaptation Synergy between compilers and programmers Alternative of manual tuning, enabling easy adaptation across architectures 31

Acknowledgement Cetus authors at Purdue Group led by Eigenmann and Midkiff John Owens NVIDIA donation of device NSF grants 32

Thank you! 33