A Large-Scale Cross-Architecture Evaluation of Thread-Coarsening. Alberto Magni, Christophe Dubach, Michael O'Boyle

Similar documents
Analysis and Parameter Prediction of Compiler Transformation for Graphics Processors. Alberto Magni

End-to-end Deep Learning of Optimization Heuristics

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

Automatic Intra-Application Load Balancing for Heterogeneous Systems

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Analyzing and improving performance portability of OpenCL applications via auto-tuning

Data Placement Optimization in GPU Memory Hierarchy Using Predictive Modeling

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

A Code Merging Optimization Technique for GPU. Ryan Taylor Xiaoming Li University of Delaware

Visualization of OpenCL Application Execution on CPU-GPU Systems

Yunsup Lee UC Berkeley 1

OpenACC2 vs.openmp4. James Lin 1,2 and Satoshi Matsuoka 2

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Scientific Computing on GPUs: GPU Architecture Overview

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

Auto-tuning a High-Level Language Targeted to GPU Codes. By Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, John Cavazos

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures

High Performance Computing on GPUs using NVIDIA CUDA

NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA)

ECE 8823: GPU Architectures. Objectives

Understanding Outstanding Memory Request Handling Resources in GPGPUs

A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function

Generating Performance Portable Code using Rewrite Rules

TANGRAM Efficient Kernel Synthesis for Performance Portable Programming

AutoTVM & Device Fleet

Introduction to Multicore Programming

Support Tools for Porting Legacy Applications to Multicore. Natsuki Kawai, Yuri Ardila, Takashi Nakamura, Yosuke Tamura

Modern Processor Architectures. L25: Modern Compiler Design

Predicting GPU Performance from CPU Runs Using Machine Learning

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

An Evaluation of Unified Memory Technology on NVIDIA GPUs

A Framework for Modeling GPUs Power Consumption

How to Optimize Geometric Multigrid Methods on GPUs

JCudaMP: OpenMP/Java on CUDA

Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs

ICSA Institute for Computing Systems Architecture. LFCS Laboratory for Foundations of Computer Science

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Towards a Performance- Portable FFT Library for Heterogeneous Computing

CUDA Performance Considerations (2 of 2)

An Introduction to the SPEC High Performance Group and their Benchmark Suites

Empirical Modeling: an Auto-tuning Method for Linear Algebra Routines on CPU plus Multi-GPU Platforms

MAPPING VIDEO CODECS TO HETEROGENEOUS ARCHITECTURES. Mauricio Alvarez-Mesa Techische Universität Berlin - Spin Digital MULTIPROG 2015

Parallel Programming. Libraries and Implementations

Parallel H.264/AVC Motion Compensation for GPUs using OpenCL

PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort

Evaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi

A Detailed GPU Cache Model Based on Reuse Distance Theory

Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

GPU for HPC. October 2010

PAPI - PERFORMANCE API. ANDRÉ PEREIRA

arxiv: v1 [cs.dc] 2 Jun 2015

n N c CIni.o ewsrg.au

Native Offload of Haskell Repa Programs to Integrated GPUs

CS 179: GPU Computing

How GPUs can find your next hit: Accelerating virtual screening with OpenCL. Simon Krige

Introduction to Multicore Programming

RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM

Parallel Programming Libraries and implementations

TUNING CUDA APPLICATIONS FOR MAXWELL

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

Antonio R. Miele Marco D. Santambrogio

CUB. collective software primitives. Duane Merrill. NVIDIA Research

Accelerating sequential computer vision algorithms using commodity parallel hardware

Accelerated Test Execution Using GPUs

TODAY, the word s second-fastest supercomputer,

Exploring the Optimization Space of Multi-Core Architectures with OpenCL Benchmarks

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

Cache efficiency based dynamic bypassing technique for improving GPU performance

Multi2sim Kepler: A Detailed Architectural GPU Simulator

TUNING CUDA APPLICATIONS FOR MAXWELL

Real-Time Rendering Architectures

Automatic Pruning of Autotuning Parameter Space for OpenCL Applications

Parallel Programming Concepts. GPU Computing with OpenCL

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

HPVM: Heterogeneous Parallel Virtual Machine

Lecture 7: The Programmable GPU Core. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

Auto-tunable GPU BLAS

SPOC : GPGPU programming through Stream Processing with OCaml

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Mathematical computations with GPUs

Use of Synthetic Benchmarks for Machine- Learning-based Performance Auto-tuning

Performance potential for simulating spin models on GPU

Shadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies

PathScale ENZO GTC12 S0631 Programming Heterogeneous Many-Cores Using Directives. C. Bergström May 14th, 2012

Universiteit Leiden Opleiding Informatica

Progress Report on QDP-JIT

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Introduction to GPGPU and GPU-architectures

Computer Architecture

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

Transcription:

A Large-Scale Cross-Architecture Evaluation of Thread-Coarsening Alberto Magni, Christophe Dubach, Michael O'Boyle

Introduction Wide adoption of GPGPU for HPC Many GPU devices from many of vendors AMD Nvidia Intel Qualcomm ARM 2

Introduction OpenCL AMD Nvidia Intel Qualcomm ARM 3

OpenCL is Functionally Portable OpenCL AMD Nvidia Proprietary compiler Intel Qualcomm ARM 4

Performance Evaluation Regression Trees Thread Coarsening Loads < 0.92 T 1.70 F Branches < 2.80 T Cache Misses < 1.00 T 1.06 F 0.40 F 0.81 5

Performance is NOT Portable Nbody AMD Cypress Nvidia Fermi 6

What's next Motivation Compiler Infrastructure Thread-Coarsening Experiments Data Analysis Conclusions and Future Work 7

Thread Coarsening Original Thread Space Reduce thread number Transformed Thread Space Increase amount of work 8

Advantages of Thread Coarsening Reduce the amount of redundant computation Perfect for a cross-architectural evaluation Supported by the standard Architecture independent 9

Our Portable Compiler 10

Thread Coarsening Implementation LLVM function pass replicates the instructions in the kernel body id = 0 id = 1 for index in (0 : width) tmp += A[id + index]; B[id] = tmp; for index in (0 : width) tmp += A[id + index]; B[id] = tmp; 11

Thread Coarsening Implementation Identify divergent instructions id = 0 id = 1 for index in (0 : width) tmp += A[id + index]; B[id] = tmp; for index in (0 : width) tmp += A[id + index]; B[id] = tmp; 12

Thread Coarsening Implementation Replicate divergent instructions id = 0 for index in (0 : width) tmp1 += A[id + index] tmp2 += A[2*id + 1 + index] B[id] = tmp1 B[2*id + 1] = tmp2 Thread number is reduced at runtime 13

What's next Motivation Compiler Infrastructure Thread-Coarsening Experiments Data Analysis Conclusions and Future Work 14

Parameter Space Static Parameters Coarsening Factor Stride Direction Dynamic Parameters Local Work Group Size ~300 configs for One-D benchmarks ~2,000 configs for Two-D benchmarks 15

Experimental Set-Up 17 benchmarks from Nvidia / AMD / Parboil 5 Devices: Nvidia Fermi GTX480 Nvidia Kepler K20 AMD Cypress HD 5900 AMD Tahiti 7970 Intel Core-i7 ~ 43,000 runs in Total 16

Experimental Set-Up 17 benchmarks from Nvidia / AMD / Parboil 5 Devices: Nvidia Fermi GTX480 Nvidia Kepler K20 AMD Cypress HD 5900 AMD Tahiti 7970 Intel Core-i7 17

Performance Varies Significantly Nvidia Fermi AMD Cypress 18

Performance Varies Significantly Nvidia Fermi 19

Performance Varies Significantly AMD Cypress 20

What's next Motivation Compiler Infrastructure Thread-Coarsening Experiments Data Analysis Conclusions and Future Work 21

Data Collection Full Exploration 22

Data Collection Config Selection 23

Data Collection Profiling 24

Data Collection Profiling 25

Profiler Counters Nvidia #instructions #branches #loads #stores AMD ALU Utilization Vector Utilization WLIW Packing Cache Utilization Memory Unit Utilization Cache: L1 HitRate L2 HitRate 26

Counters Analysis GOAL: Discriminate fast and slow configs Speedup Counter Relative Value x counter > x counter < x 27

Explaining Performance per Device Device HW Speedups Counters Regression Tree counter < value... Discriminate fast and slow configs Relate counters to performance Trees are easy to read... speedups speedups 28

Tree Analysis Nvidia Fermi Loads < 0.92 T 1.70 F Branches < 2.80 T Cache Misses < 1.00 T 1.06 F 0.40 F 0.81 29

Tree Analysis Nvidia Fermi floydwarshall sgemm 30

Dynamic Counter 31

Tree Analysis Nvidia Fermi floydwarshall sgemm spmv mvcoal 32

Dynamic Counter 33

Trees Analysis AMD Cypress spmv stencil nbody BinarySearch mt 34

Dynamic Counter 35

What's next Motivation Compiler Infrastructure Thread-Coarsening Experiments Data Analysis Conclusions and Future Work 36

Conclusion and Future Work Automatic methodology for performance explanation First step toward definition of compiler heuristics and automatic coarsening tuning Loads < 0.92 T 1.70 ALUPacking < 1.28 F T T Branches < 2.80 Cache Misses < 1.00 T 1.06 F 0.81 F 0.8 F ALUBusy < 0.59 T 0.40 0.79 F 2.10 37