AMD and OpenCL. Mike Houst on Senior Syst em Archit ect Advanced Micro Devices, I nc.

Similar documents
INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD

GPGPU COMPUTE ON AMD. Udeepta Bordoloi April 6, 2011

OpenCL *, Heterogeneous Computing, and the CPU

Poulson: An 8 Core 32 nm Next Generation I ntel I tanium. Processor. Steve Undy I ntel I NTEL CONFI DENTI AL

OPENCL TM APPLICATION ANALYSIS AND OPTIMIZATION MADE EASY WITH AMD APP PROFILER AND KERNELANALYZER

Using RefWorks to Create an Annotated Bibliography

Heterogeneous Computing

AMD Custom Timing Tool

clarmor: A DYNAMIC BUFFER OVERFLOW DETECTOR FOR OPENCL KERNELS CHRIS ERB, JOE GREATHOUSE, MAY 16, 2018

Transitioning the I ntel Next. ( Nehalem and W estm ere) into the. Lily Looi, St ephan Jourdan I ntel Corporation Hot Chips 21 August 24, 2009

Use cases. Faces tagging in photo and video, enabling: sharing media editing automatic media mashuping entertaining Augmented reality Games

Quadrics QsNet II : A network for Supercomputing Applications

High Performance Graphics 2010

Automatic Intra-Application Load Balancing for Heterogeneous Systems

DB2 Optimization Service Center DB2 Optimization Expert for z/os

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

Real-Time Rendering Architectures

EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT

AMD Graphics Team Last Updated February 11, 2013 APPROVED FOR PUBLIC DISTRIBUTION. 1 3DMark Overview February 2013 Approved for public distribution

AMD Radeon ProRender plug-in for Unreal Engine. Installation Guide

AMD Graphics Team Last Updated April 29, 2013 APPROVED FOR PUBLIC DISTRIBUTION. 1 3DMark Overview April 2013 Approved for public distribution

Preparing seismic codes for GPUs and other

AMD APU and Processor Comparisons. AMD Client Desktop Feb 2013 AMD

Profiling & Tuning Applications. CUDA Course István Reguly

Anatomy of AMD s TeraScale Graphics Engine

Understanding GPGPU Vector Register File Usage

Release Notes Compute Abstraction Layer (CAL) Stream Computing SDK New Features. 2 Resolved Issues. 3 Known Issues. 3.

SCALING DGEMM TO MULTIPLE CAYMAN GPUS AND INTERLAGOS MANY-CORE CPUS FOR HPL

Generic System Calls for GPUs

ACCELERATING MATRIX PROCESSING WITH GPUs. Nicholas Malaya, Shuai Che, Joseph Greathouse, Rene van Oostrum, and Michael Schulte AMD Research

CAUTIONARY STATEMENT 1 AMD NEXT HORIZON NOVEMBER 6, 2018

viewdle! - machine vision experts

AMD 780G. Niles Burbank AMD. an x86 chipset with advanced integrated GPU. Hot Chips 2008

Link Layer & Network Layer

MEASURING AND MODELING ON-CHIP INTERCONNECT POWER ON REAL HARDWARE

I ntel. QuickPath I nterconnect Overview. presented by Robert Safranek. Contributors: Gurbir Singh, Robert Maddox

RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS

AMD CORPORATE TEMPLATE AMD Radeon Open Compute Platform Felix Kuehling

HPG 2011 HIGH PERFORMANCE GRAPHICS HOT 3D

RegMutex: Inter-Warp GPU Register Time-Sharing

Advanced Com puter Architecture: s1/ 2005

Visual Software MDS 3.0. User Manual PRELIMINARY. Copyright Reliable Health Systems, LLC 2610 Nostrand Avenue Brooklyn, NY 11210

Martin Kruliš, v

Martin Kruliš, v

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

The first tim e you use Visual Studio it is im portant to get the initial settings right. Load Visual Studio 2010 from the Start menu:

Graphics Hardware 2008

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

Fusion Enabled Image Processing

SIMULATOR AMD RESEARCH JUNE 14, 2015

Sequential Consistency for Heterogeneous-Race-Free

The All-in-One, Intelligent NXC Controller

Better Games Through Usability Evaluation and Testing By Sauli Laitinen Gamasutra June 23, 2005

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Introduction to GPGPU and GPU-architectures

Maximizing Six-Core AMD Opteron Processor Performance with RHEL

MODEL LVDT-8U EIGHT CHANNEL AC LVDT UNIVERSAL SIGNAL CONDITIONER USER MANUAL

TrendReader for SmartButton Software Reference Guide

Game AI. Ming-Hwa Wang, Ph.D. COEN 396 Interactive Multimedia and Game Programming Department of Computer Engineering Santa Clara University

The Rise of Open Programming Frameworks. JC BARATAULT IWOCL May 2015

The Road to the AMD. Fiji GPU. Featuring Die Stacking and HBM Technology 1 THE ROAD TO THE AMD FIJI GPU ECTC 2016 MAY 2015

Multi-core processors are here, but how do you resolve data bottlenecks in native code?

AMD Elite A-Series APU Desktop LAUNCHING JUNE 4 TH PLACE YOUR ORDERS TODAY!

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

An Introduction to Com puter Networks

AMD HD3D Technology. Setup Guide. 1 AMD HD3D TECHNOLOGY: Setup Guide


CAUTIONARY STATEMENT This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) including, but not limited to

From Shader Code to a Teraflop: How Shader Cores Work

Getting Started with Intel SDK for OpenCL Applications

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION

BIOMEDICAL DATA ANALYSIS ON HETEROGENEOUS PLATFORM. Dong Ping Zhang Heterogeneous System Architecture AMD

/INFOMOV/ Optimization & Vectorization. J. Bikker - Sep-Nov Lecture 10: GPGPU (3) Welcome!

OpenCL Implementation Of A Heterogeneous Computing System For Real-time Rendering And Dynamic Updating Of Dense 3-d Volumetric Data

ADVANCED RENDERING EFFECTS USING OPENCL TM AND APU Session Olivier Zegdoun AMD Sr. Software Engineer

AMD RYZEN PROCESSOR WITH RADEON VEGA GRAPHICS CORPORATE BRAND GUIDELINES

Solid State Graphics (SSG) SDK Setup and Raw Video Player Guide

Fan Control in AMD Radeon Pro Settings. User Guide. This document is a quick user guide on how to configure GPU fan speed in AMD Radeon Pro Settings.

OpenCL Overview. Shanghai March Neil Trevett Vice President Mobile Content, NVIDIA President, The Khronos Group

HyperTransport Technology

General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14

Scientific Computing on GPUs: GPU Architecture Overview

Real-time Graphics 9. GPGPU

D3D12 & Vulkan: Lessons learned. Dr. Matthäus G. Chajdas Developer Technology Engineer, AMD

CS377P Programming for Performance GPU Programming - II

Code Optimizations for High Performance GPU Computing

Graphics Architectures and OpenCL. Michael Doggett Department of Computer Science Lund university

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher

THE PROGRAMMER S GUIDE TO THE APU GALAXY. Phil Rogers, Corporate Fellow AMD

Real-time Graphics 9. GPGPU

OpenCL API. OpenCL Tutorial, PPAM Dominik Behr September 13 th, 2009

AMD Radeon HD 2900 Highlights

OpenACC (Open Accelerators - Introduced in 2012)

GPU programming. Dr. Bernhard Kainz

Reliability & Flow Control

INTERFERENCE FROM GPU SYSTEM SERVICE REQUESTS

Lecture 7: The Programmable GPU Core. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

ROCm: An open platform for GPU computing exploration

Maximizing Face Detection Performance

NWA3000-N Series. Wireless N Business WLAN 3000 Series Access Point. Default Login Details.

Transcription:

AMD and OpenCL Mike Houst on Senior Syst em Archit ect Advanced Micro Devices, I nc.

Overview Brief overview of ATI Radeon HD 4870 architecture and operat ion Spreadsheet perform ance analysis Tips & tricks Dem o 2

Review : OpenCL View Processing Elem ent Host Com pute Unit Com pute Device 3

Review : OpenCL View Host ATI Radeon HD 4870 (com pute device) 4

Review : OpenCL View Host 10 SI MDs (com pute unit) ATI Radeon HD 4870 (com pute device) 5

Review : OpenCL View 64 Elem ent Wide wavefront (Processing elem ents) Host 10 SI MDs (com pute unit) ATI Radeon HD 4870 (com pute device) 6

ATI Radeon HD 4 8 7 0 : Reality 1 0 SI MDs 7

ATI Radeon HD 4 8 7 0 : Reality 1 0 SI MDs 1 6 Processing cores per SI MD (6 4 Elem ents over 4 cycles) 8

ATI Radeon HD 4 8 7 0 : Reality 1 0 SI MDs 1 6 Processing cores per SI MD (6 4 Elem ents over 4 cycles) 5 ALUs per core ( VLI W Processors) 9

ATI Radeon HD 4 8 7 0 : Reality Register File Fixed Funct ion Logic Texture Units 1 0

Latency Hiding GPUs eschew large caches for large regist er files I deally launch m ore work than available ALUs Register file partitioned am ongst active wavefronts Fast switching on long lat ency operat ions Wave 1 Wave 2 Wave 3 Wave 4 1 1

Spreadsheet Analysis Valuable to estim ate perform ance of kernels Helps identify when som ething is wrong First- order bot t lenecks: ALU, TEX, or MEM Tkernel = m ax( ALU, TEX, MEM ) Talu = # elem ents * # ALU / (10* 16) / 750 Mhz # SI MDs Engine Clock global_size VLI W instructions cores per SI MD 1 2

Practical I m plications on ATI Radeon HD 4 8 7 0 Workgroup size should be a m ultiple of 64 Rem em ber: Wavefront is 64 elem ents Sm aller workgroups SI MDs will be underutilized SI MDs operate on pairs of wavefronts 1 3

Minim um Global Size on ATI Radeon HD 4 8 7 0 1 0 SI MDs * 2 waves * 6 4 elem ents = 1 2 8 0 elem ents Minim um global size to utilize GPU with one kernel Does not allow for any latency hiding! For m inim um latency hiding: 2 5 6 0 elem ents 1 4

Register Usage Recall GPUs hide latency by switching between large num ber of wavefronts Register usage determ ines m axim um num ber of wavefront s in flight More wavefronts better latency hiding Fewer wavefronts worse latency hiding Long runs of ALU instructions can com pensate for low num ber of wavefronts 1 5

Kernel Guidelines on ATI Radeon HD 4 8 7 0 Prefer int4 / float4 when possible Processor cores are 5-wide VLI W m achines Mem ory system prefers 128-bit load/ stores Consider access patterns - e.g. access along rows AMD GPUs have large register files Perform m ore work per elem ent - exam ple later 1 6

Median Filtering Non-linear filter for rem oving one-shot noise in im ages Sim ple Algorithm : 1. Load neighborhood 2. Sort pixels - exam ple uses an efficient m ethod 3. Output m edian value 1 7

Sim ple Kernel Com pute m y index Handle edge cases Load neighborhood Sort along the rows Sort along the cols Sort along the diagonal kernel void medianfilter_rgba( global uint *id, global uint *od, int width, int h, int r ) { const int posx = get_global_id(0); const int posy = get_global_id(1); const int idx = posy*width + posx; if( posx == 0 posy == 0 posx == width-1 posy == h-1 ) { od[ idx ] = id[ idx ]; } else { uint row00, row01, row02, row10, row11, row12, row20, row21, row22; row00 = id[ idx - 1 - width ]; row01 = id[ idx - width ]; row02 = id[ idx + 1 - width ]; row10 = id[ idx - 1]; row11 = id[ idx ]; row12 = id[ idx + 1 ]; row20 = id[ idx - 1 + width ]; row21 = id[ idx + width ]; row22 = id[ idx + 1 + width ]; // sort the rows sort3( &(row00), &(row01), &(row02) ); sort3( &(row10), &(row11), &(row12) ); sort3( &(row20), &(row21), &(row22) ); // sort the cols sort3( &(row00), &(row10), &(row20) ); sort3( &(row01), &(row11), &(row21) ); sort3( &(row02), &(row12), &(row22) ); // sort the diagonal sort3( &(row00), &(row11), &(row22) ); Output m edian } } // median is the middle value of the diagonal od[ idx ] = row11; 1 8

Auxiliary Functions Convert to lum inance Swap A and B Com parison sort float rgbtolum( uint a ) { return (0.3f*(a & 0xff) + 0.59*((a>>8) & 0xff) + 0.11*((a>>16) & 0xff)); } void swap( uint *a, uint *b ) { uint tmp = *b; *b = *a; *a = tmp; } void sort3( uint *a, uint *b, uint *c ) { if( rgbtolum( *a ) > rgbtolum( *b ) ) swap( a, b ); if( rgbtolum( *b ) > rgbtolum( *c ) ) swap( b, c ); if( rgbtolum( *a ) > rgbtolum( *b ) ) swap( a, b ); } 1 9

More W ork per W ork- item Prefer read/ write 128-bit values Com pute m ore than one output per work-item Better Algorithm (further optim izations possible): 1. Load neighborhood 8x3 via six 128- bit loads 2. Sort pixels for each of four pixels 3. Output m edian values via 128-bit write 2 0

Better Kernel Com pute m y index Load row above Load m y row Load row below Find four m edians kernel void medianfilter_x4( global uint *id, global uint *od, int width, int h, int r ) { const int posx = get_global_id(0); // global width is 1/4 image width const int posy = get_global_id(1); // global height is image height const int width_d4 = width >> 2; // divide width by 4 const int idx_4 = posy*(width_d4) + posx; uint4 left0, right0, left1, right1, left2, right2, output; // ignoring edge cases for simplicity left0 = (( global uint4*)id)[ idx_4 - width_d4 ]; right0 = (( global uint4*)id)[ idx_4 - width_d4 + 1]; left1 = (( global uint4*)id)[ idx_4 ]; right1 = (( global uint4*)id)[ idx_4 + 1]; left2 = (( global uint4*)id)[ idx_4 + width_d4 ]; right2 = (( global uint4*)id)[ idx_4 + width_d4 + 1]; // now compute four median values output.x = find_median( left0.x, left0.y, left0.z, left1.x, left1.y, left1.z, left2.x, left2.y, left2.z ); output.y = find_median( left0.y, left0.z, left0.w, left1.y, left1.z, left1.w, left2.y, left2.z, left2.w ); output.z = find_median( left0.z, left0.w, right0.x, left1.z, left1.w, right1.x, left2.z, left2.w, right2.x ); output.w = find_median( left0.w, right0.x, right0.y, left1.w, right1.x, right1.y, left2.w, right2.x, right2.y ); Output m edians } (( global uint4*)od)[ idx_4 ] = output; 2 1

Dem o 2 2

Conclusions Optim ization is a balancing act Things t o consider: Register usage / num ber of wavefronts in flight ALU to m em ory access ratio Som etim es better re-com pute som ething Workgroup size a m ultiple of 64 Global size at least 2560 for a single kernel 2 3

Multi- Core x8 6 CPU im plem entation available now! http: / / developer.am d.com / stream beta Subm itted for conform ance during SI GGRAPH for Microsoft Windows XP 32-bit, Windows Vista 32-bit, Windows 7 32-bit, Linux 32-bit, Linux 64- bit Perform ance advice (sim ilar to GPU): Use 128-bit types (i.e. float4) Will m ap well to SSE Use reasonably large work groups L1/ L2 cache locality Do a lot of work per work item Start writing code and playing with OpenCL now on any x86 m achine 2 4

DI SCLAI MER The inform ation presented in this docum ent is for inform ational purposes only and m ay contain technical inaccuracies, om issions and t ypographical errors. AMD MAKES NO REPRESENTATI ONS OR WARRANTI ES WI TH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSI BI LI TY FOR ANY I NACCURACI ES, ERRORS OR OMI SSI ONS THAT MAY APPEAR I N THI S I NFORMATI ON. AMD SPECI FI CALLY DI SCLAI MS ANY I MPLI ED WARRANTI ES OF MERCHANTABI LI TY OR FI TNESS FOR ANY PARTI CULAR PURPOSE. I N NO EVENT WI LL AMD BE LI ABLE TO ANY PERSON FOR ANY DI RECT, I NDI RECT, SPECI AL OR OTHER CONSEQUENTI AL DAMAGES ARI SI NG FROM THE USE OF ANY I NFORMATI ON CONTAI NED HEREI N, EVEN I F AMD I S EXPRESSLY ADVI SED OF THE POSSI BI LI TY OF SUCH DAMAGES. ATTRI BUTI ON 2009 Advanced Micro Devices, I nc. All rights reserved. AMD, the AMD Arrow logo, ATI, the ATI logo, Radeon and com binations thereof are tradem arks of Advanced Micro Devices, I nc. Microsoft, Windows, and Windows Vista are registered tradem arks of Microsoft Corporation in the United States and/ or other jurisdictions. OpenCL is tradem ark of Apple I nc. used under license to the Khronos Group I nc. Other nam es are for inform ational purposes only and m ay be tradem arks of their respective owners. 2 5