Automatic Tuning Matrix Multiplication Performance on Graphics Hardware

Similar documents
AUTOMATIC TUNING MATRIX MULTIPLICATION PERFORMANCE ON GRAPHICS HARDWARE. Changhao Jiang and Marc Snir

X. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

Portland State University ECE 588/688. Graphics Processors

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

What s New with GPGPU?

General-Purpose Computation on Graphics Hardware

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

1.2.3 The Graphics Hardware Pipeline

CS GPU and GPGPU Programming Lecture 7: Shading and Compute APIs 1. Markus Hadwiger, KAUST

ECE 574 Cluster Computing Lecture 16

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský

Graphics Processing Unit Architecture (GPU Arch)

Graphics Performance Optimisation. John Spitzer Director of European Developer Technology

Programmable Graphics Hardware

GPU Memory Model. Adapted from:

The GPGPU Programming Model

Data-Parallel Algorithms on GPUs. Mark Harris NVIDIA Developer Technology

The Application Stage. The Game Loop, Resource Management and Renderer Design

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Threading Hardware in G80

Performance Analysis of a Family of WHT Algorithms

A MATLAB Interface to the GPU

2.11 Particle Systems

Lecture 2. Shaders, GLSL and GPGPU

Introduction to Modern GPU Hardware

Introduction to CUDA (1 of n*)

Scan Primitives for GPU Computing

The Problem: Difficult To Use. Motivation: The Potential of GPGPU CGC & FXC. GPGPU Languages

Cg Toolkit. Cg 1.3 Release Notes. December 2004

CS427 Multicore Architecture and Parallel Computing

ECE 571 Advanced Microprocessor-Based Design Lecture 18

An Empirically Optimized Radix Sort for GPU

CS GPU and GPGPU Programming Lecture 2: Introduction; GPU Architecture 1. Markus Hadwiger, KAUST

ECE 571 Advanced Microprocessor-Based Design Lecture 20

Spring 2009 Prof. Hyesoon Kim

Graphics Hardware. Instructor Stephen J. Guy

Spring 2011 Prof. Hyesoon Kim

ACCELERATING SIGNAL PROCESSING ALGORITHMS USING GRAPHICS PROCESSORS

Cornell University CS 569: Interactive Computer Graphics. Introduction. Lecture 1. [John C. Stone, UIUC] NASA. University of Calgary

A Data-Parallel Genealogy: The GPU Family Tree. John Owens University of California, Davis

GeForce4. John Montrym Henry Moreton

Rendering Subdivision Surfaces Efficiently on the GPU

Introduction to Shaders.

Journal of Universal Computer Science, vol. 14, no. 14 (2008), submitted: 30/9/07, accepted: 30/4/08, appeared: 28/7/08 J.

Hardware-driven Visibility Culling Jeong Hyun Kim

Rendering Objects. Need to transform all geometry then

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

Data parallel algorithms, algorithmic building blocks, precision vs. accuracy

INTEL HPC DEVELOPER CONFERENCE FUEL YOUR INSIGHT

CS451Real-time Rendering Pipeline

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

Mattan Erez. The University of Texas at Austin

Optimization of Tele-Immersion Codes

Could you make the XNA functions yourself?

Graphics Hardware, Graphics APIs, and Computation on GPUs. Mark Segal

From Brook to CUDA. GPU Technology Conference

Bifurcation Between CPU and GPU CPUs General purpose, serial GPUs Special purpose, parallel CPUs are becoming more parallel Dual and quad cores, roadm

Optimal Shaders Using High-Level Languages

Vertex Shader Design I

CIS 665 GPU Programming and Architecture

GPUs and GPGPUs. Greg Blanton John T. Lubia

Real-Time Reyes: Programmable Pipelines and Research Challenges. Anjul Patney University of California, Davis

Shaders. Slide credit to Prof. Zwicker

GpuPy: Accelerating NumPy With a GPU

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Real - Time Rendering. Pipeline optimization. Michal Červeňanský Juraj Starinský

Statistical Models for Automatic Performance Tuning

The Source for GPU Programming

Mio: Fast Multipass Partitioning via Priority-Based Instruction Scheduling

EECS 487: Interactive Computer Graphics

System Demonstration of Spiral: Generator for High-Performance Linear Transform Libraries

GPU-Based Volume Rendering of. Unstructured Grids. João L. D. Comba. Fábio F. Bernardon UFRGS

CS130 : Computer Graphics. Tamar Shinar Computer Science & Engineering UC Riverside

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

GPGPU. Peter Laurens 1st-year PhD Student, NSC

Cache-oblivious Programming

Accelerating CFD with Graphics Hardware

Shading Languages for Graphics Hardware

Design of a Programmable Vertex Processing Unit for Mobile Platforms

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

GPU Memory Model Overview

Shaders (some slides taken from David M. course)

High-Quality Surface Splatting on Today s GPUs

Advanced Computer Graphics (CS & SE ) Lecture 7

Comparing Reyes and OpenGL on a Stream Architecture

Abstract. Introduction. Kevin Todisco

An Approach on Hardware Design for Computationally Intensive Image Processing Applications based on Light Field Refocusing Algorithm

Direct Rendering of Trimmed NURBS Surfaces

Today s Content. Lecture 7. Trends. Factors contributed to the growth of Beowulf class computers. Introduction. CUDA Programming CUDA (I)

Getting fancy with texture mapping (Part 2) CS559 Spring Apr 2017

Automatic Performance Tuning and Machine Learning

Why Use the GPU? How to Exploit? New Hardware Features. Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. Semiconductor trends

PROFESSIONAL. WebGL Programming DEVELOPING 3D GRAPHICS FOR THE WEB. Andreas Anyuru WILEY. John Wiley & Sons, Ltd.

Next-Generation Graphics on Larrabee. Tim Foley Intel Corp

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Automated Empirical Optimizations of Software and the ATLAS project* Software Engineering Seminar Pascal Spörri

A Matrix--Matrix Multiplication methodology for single/multi-core architectures using SIMD

General Purpose computation on GPUs. Liangjun Zhang 2/23/2005

Transcription:

Automatic Tuning Matrix Multiplication Performance on Graphics Hardware Changhao Jiang (cjiang@cs.uiuc.edu) Marc Snir (snir@cs.uiuc.edu) University of Illinois Urbana Champaign

GPU becomes more powerful NVIDIA GeForceFX 7800 3.0 GHz dualcore Pentium4 Courtesy Ian Buck, John Owens 9/18/2005 University of Illinois Urbana Champaign 2

Use of GPU for non-graphics applications GPGPU (General purpose computation on GPUs) Goal: make the flexible and powerful GPU available to developers as a computational coprocessor. Difficulties of GPGPU Unusual programming model Programming idioms tied to computer graphics Underlying architecture evolves fast Architecture internals largely secret 9/18/2005 University of Illinois Urbana Champaign 3

Automatic library generation system Automatic library generation can help Generate high-performance libraries by empirical search Successful example systems on CPUs: ATLAS Whaley, Petitet, Dongarra Sparsity Im, Yelick, Vuduc FFTW Frigo, Johnson Spiral Puschel, Singer, Xiong, Moura, Johnson, Padua Adaptively tuned sorting library Li, Garzaran, Padua 9/18/2005 University of Illinois Urbana Champaign 4

Our work Implemented a high-performance matrix multiplication library generator for GPU An ATLAS for GPU Main contributions: The first automatic library generation system for GPUs Identifies several tuning strategies unique for GPUs Implements a customized search-engine The automatically generated code has comparable performance with expert manually tuned version. 9/18/2005 University of Illinois Urbana Champaign 5

GPU architecture Graphics pipeline Graphics State Application Vertices (3D) Vertex Processor Transform & Light Xformed, Lit Vertices (2D) Assemble Primitives Screenspace triangles (2D) Rasterize Fragments (pre-pixels) Fragment Processor Shade Final Pixels (Color, Depth) Video Memory (Textures) CPU GPU Programmability was introduced into two stages Render-to-texture Courtesy David Luebke 9/18/2005 University of Illinois Urbana Champaign 6

GPU architecture Another view of GPU architecture Rasterizer fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp Fragment processors Frame buffer Pixel a b g r = 4 component vector Most horsepower of GPGPU comes from the fragment processors The same shader runs synchronously on all fragment processors Every fragment processor can execute SIMD instructions on the 4 channels of a pixel. 9/18/2005 University of Illinois Urbana Champaign 7

GPU programming model Stream processing model The same kernel program (shader) operates on streams of data. Input stream Shader Constants Temp Regs Textures Output stream 9/18/2005 University of Illinois Urbana Champaign 8

Unusual features/constraints of GPU program SIMD instructions with smearing and swizzling R2 = R1.abgr * R3.ggab Limit on instruction count Limit on output Limit on branch instruction 9/18/2005 University of Illinois Urbana Champaign 9

GPU algorithms for matrix multiply Straightforward mapping of triply nested loop onto GPU Store two input matrices (A and B) as two textures Store the resulting matrix C in the frame buffer. Each execution of the shader program outputs one element of C Problems: Fetches one row from matrix A Fetches one column from matrix B Computes the dot product. Save result to C No data reuse in the shader => poor performance Shader length might exceed instruction limit if loop is unrolled due to the lack of branch instruction Matrix A Matrix B Matrix C 9/18/2005 University of Illinois Urbana Champaign 10

Tuning for multi-render-targets multi-render-targets : allows a shader to simultaneously write to multiple buffers Tuning strategy: Take advantage of multi-render-targets to improve datareuse Algorithm with multi-render-targets: Divide matrix C into mxn sub matrix blocks Each of them will be a render-target A and B are logically divided too Each fragment program Fetches m rows from matrix A Fetches n columns from matrix B Computes mxn dot products Downside: The shader require more temporary registers Using multi-render-target has performance overhead Matrix B Matrix A Matrix C 9/18/2005 University of Illinois Urbana Champaign 11

Tuning for SIMD instruction with data packing Fragment processor supports SIMD instructions Tuning strategy: Use SIMD instruction to improve performance Use smearing and swizzling to do register tiling to improve data reuse Algorithm of tuning for SIMD instruction with data packing Packing four elements into one pixel Two schemes: 1x4 vs. 2x2 Each fragment program (1x4 scheme) Fetches one row from matrix A Fetches four columns from matrix B Perform a series of vector by matrix product Question: What packing scheme is the best in performance? Matrix B Matrix A Matrix C 9/18/2005 University of Illinois Urbana Champaign 12

Tuning the number of passes Problem: GPU s limit on instruction count prevents the dot product to be completed in one pass Strategy: Partition the computation into multiple passes Algorithm with multiple passes: Each fragment program Fetches a part of a row from matrix A Fetches a part of a column from matrix B Perform a dot product to get a partial sum Iterate multiple times to get the final result Downside Multi-pass results in expensive overhead in copying intermediate results Matrix A Matrix B Matrix C 9/18/2005 University of Illinois Urbana Champaign 13

An automatic matrix-multiply generation system An automatic matrix multiply generation system, which includes: A code generator: Generate multiple versions in high level BrookGPU language, which will be compiled into low level code. A search engine: Searches in the implementation space for the best version A performance evaluator: Measure performance of generated code Code Generator Tuning parameters Generated program Search Engine Performance Evaluator Performance metrics 9/18/2005 University of Illinois Urbana Champaign 14

Tuning parameters Generated code combines the previous tuning strategies Tuning parameters mrt_w, mrt_h How to divide matrix C mc_w, mc_h How to pack data to use SIMD np How many iterations executed in each pass unroll Whether or not to use branch instructions compiler To use cgc or fxc compiler shader To use DirectX backend with ps20, ps2a, ps2b, ps30, or use OpenGL backend with arbfp, fp30, fp40 Matrix A Matrix B Matrix C 9/18/2005 University of Illinois Urbana Champaign 15

Search strategy Search in an exponential space is time-consuming. Two techniques employed to speed up the search Space pruning Limit the search range of parameters based on problem-specific heuristics Search in phases Search parameters in phases Search order: 1: For each compiler value 2: For each profile value 3: For each unroll value 4: Search np in power of two values 5: For each mc_* value 6: For each mrt_* value 7: Evaluate Performance 8: Recursively search np in both sides of best np found in step 4. The search time reduces dramatically from 53 days in theory to 4 hours in practice, with no significant performance loss. 9/18/2005 University of Illinois Urbana Champaign 16

Performance data Compare with two expert hand-tuned implementations Part of GPUBench developed at Stanford University Implemented with carefully crafted assembly code Comparable performance on four GPU platforms On two platforms beats hand-tuned by 8% and 15% On the other two platforms achieves 56% and 70% of hand-tuned version. 70% 107% 115% Searched 56% 9/18/2005 University of Illinois Urbana Champaign 17

Performance penalties of using a high level language One reason for lower performance than manual tuning: Overhead in using the high-level BrookGPU language. Compare the performance of the same algorithm implemented in BrookGPU Assembly code 9/18/2005 University of Illinois Urbana Champaign 18

Potential future research directions Improve high-level BrookGPU s performance Generating more libraries for GPU Signal processing (FFT) Numerical libraries Sorting library 9/18/2005 University of Illinois Urbana Champaign 19