GpuPy: Accelerating NumPy With a GPU

Similar documents
X. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1

Shaders. Slide credit to Prof. Zwicker

Programmable GPUs. Real Time Graphics 11/13/2013. Nalu 2004 (NVIDIA Corporation) GeForce 6. Virtua Fighter 1995 (SEGA Corporation) NV1

Programming shaders & GPUs Christian Miller CS Fall 2011

Cg: A system for programming graphics hardware in a C-like language

Graphics Hardware. Instructor Stephen J. Guy

GPGPU. Peter Laurens 1st-year PhD Student, NSC

Graphics Hardware. Computer Graphics COMP 770 (236) Spring Instructor: Brandon Lloyd 2/26/07 1

Basics of GPU-Based Programming

Introduction to Shaders.

CS GPU and GPGPU Programming Lecture 3: GPU Architecture 2. Markus Hadwiger, KAUST

Shader Programming 1. Examples. Vertex displacement mapping. Daniel Wesslén 1. Post-processing, animated procedural textures

CS GPU and GPGPU Programming Lecture 3: GPU Architecture 2. Markus Hadwiger, KAUST

Programming with OpenGL Part 3: Shaders. Ed Angel Professor of Emeritus of Computer Science University of New Mexico

Sung-Eui Yoon ( 윤성의 )

Tutorial on GPU Programming. Joong-Youn Lee Supercomputing Center, KISTI

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský

MXwendler Fragment Shader Development Reference Version 1.0

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

CS179 GPU Programming Introduction to CUDA. Lecture originally by Luke Durant and Tamas Szalay

Lecture 2. Shaders, GLSL and GPGPU

Programmable Graphics Hardware

CS GPU and GPGPU Programming Lecture 7: Shading and Compute APIs 1. Markus Hadwiger, KAUST

From Brook to CUDA. GPU Technology Conference

Today. Rendering - III. Outline. Texturing: The 10,000m View. Texture Coordinates. Specifying Texture Coordinates in GL

Tutorial on GPU Programming #2. Joong-Youn Lee Supercomputing Center, KISTI

Efficient and Scalable Shading for Many Lights

GPUPY: EFFICIENTLY USING A GPU WITH PYTHON

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

2.11 Particle Systems

12.2 Programmable Graphics Hardware

Programmable Graphics Hardware

HD (in) Processing. Andrés Colubri Design Media Arts dept., UCLA Broad Art Center, Suite 2275 Los Angeles, CA

WebGL and GLSL Basics. CS559 Fall 2015 Lecture 10 October 6, 2015

GLSL Introduction. Fu-Chung Huang. Thanks for materials from many other people

GLSL Introduction. Fu-Chung Huang. Thanks for materials from many other people

Threading Hardware in G80

University of Toronto: CSC467F Compilers and Interpreters, Fall 2009

Rippling Reflective and Refractive Water

Real-time Graphics 9. GPGPU

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

What s New with GPGPU?

Cg 2.0. Mark Kilgard

GPGPU: Parallel Reduction and Scan

Could you make the XNA functions yourself?

WebGL and GLSL Basics. CS559 Fall 2016 Lecture 14 October

Shader Series Primer: Fundamentals of the Programmable Pipeline in XNA Game Studio Express

Real-time Graphics 9. GPGPU

Shaders (some slides taken from David M. course)

Lecture 13: OpenGL Shading Language (GLSL)

Motivation Hardware Overview Programming model. GPU computing. Part 1: General introduction. Ch. Hoelbling. Wuppertal University

Graphics Hardware, Graphics APIs, and Computation on GPUs. Mark Segal

CS179: GPU Programming

Zeyang Li Carnegie Mellon University

OPENGL RENDERING PIPELINE

COMP371 COMPUTER GRAPHICS

Programming with OpenGL Shaders I. Adapted From: Ed Angel Professor of Emeritus of Computer Science University of New Mexico

Automatic Tuning Matrix Multiplication Performance on Graphics Hardware

Introduction to Programmable GPUs CPSC 314. Real Time Graphics

CS427 Multicore Architecture and Parallel Computing

Introduction to Programmable GPUs CPSC 314. Introduction to GPU Programming CS314 Gordon Wetzstein, 09/03/09

Order Independent Transparency with Dual Depth Peeling. Louis Bavoil, Kevin Myers

Levy: Constraint Texture Mapping, SIGGRAPH, CS 148, Summer 2012 Introduction to Computer Graphics and Imaging Justin Solomon

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

1 of 5 9/10/ :11 PM

General Purpose computation on GPUs. Liangjun Zhang 2/23/2005

The Graphics Pipeline and OpenGL III: OpenGL Shading Language (GLSL 1.10)!

Image Processing Tricks in OpenGL. Simon Green NVIDIA Corporation

Programming with OpenGL Shaders I. Adapted From: Ed Angel Professor of Emeritus of Computer Science University of New Mexico

General Purpose Computing on Graphical Processing Units (GPGPU(

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

OUTLINE. Learn the basic design of a graphics system Introduce pipeline architecture Examine software components for a graphics system

Shaders CSCI 4239/5239 Advanced Computer Graphics Spring 2014

Using GPUs. Visualization of Complex Functions. September 26, 2012 Khaldoon Ghanem German Research School for Simulation Sciences

Developer Tools. Robert Strzodka. caesar research center Bonn, Germany

SHADER PROGRAMMING. Based on Jian Huang s lecture on Shader Programming

Shadows. Prof. George Wolberg Dept. of Computer Science City College of New York

CIS 665: GPU Programming and Architecture. Original Slides by: Suresh Venkatasubramanian Updates by Joseph Kider

CS195V Week 1. Introduction

Jitter Shaders. Fig 1. The opengl pipeline (simplified).

Dave Shreiner, ARM March 2009

C P S C 314 S H A D E R S, O P E N G L, & J S RENDERING PIPELINE. Mikhail Bessmeltsev

Comparison of High-Speed Ray Casting on GPU

Shader Programs. Lecture 30 Subsections 2.8.2, Robb T. Koether. Hampden-Sydney College. Wed, Nov 16, 2011

Understanding Shaders and WebGL. Chris Dalton & Olli Etuaho

Graphics Processing Unit Architecture (GPU Arch)

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Programmable GPUS. Last Time? Reading for Today. Homework 4. Planar Shadows Projective Texture Shadows Shadow Maps Shadow Volumes

Shaders CSCI 4229/5229 Computer Graphics Fall 2017

CS4621/5621 Fall Computer Graphics Practicum Intro to OpenGL/GLSL

The Application Stage. The Game Loop, Resource Management and Renderer Design

NVIDIA Parallel Nsight. Jeff Kiel

CS GPU and GPGPU Programming Lecture 2: Introduction; GPU Architecture 1. Markus Hadwiger, KAUST

Siggraph Agenda. Usability & Productivity. FX Composer 2.5. Usability & Productivity 9/12/2008 9:16 AM

Pat Hanrahan. Modern Graphics Pipeline. How Powerful are GPUs? Application. Command. Geometry. Rasterization. Fragment. Display.

Cg Toolkit. Cg 2.1 October 2008 Release Notes

Supplement to Lecture 22

Real-Time Rendering (Echtzeitgraphik) Michael Wimmer

OUTLINE. Implementing Texturing What Can Go Wrong and How to Fix It Mipmapping Filtering Perspective Correction

A Real-Time Procedural Shading System for Programmable Graphics Hardware

Transcription:

GpuPy: Accelerating NumPy With a GPU Washington State University School of Electrical Engineering and Computer Science Benjamin Eitzen - eitzenb@eecs.wsu.edu Robert R. Lewis - bobl@tricity.wsu.edu

Presentation Outline What is a GPU? How can a GPU be used to do general purpose computations? How can NumPy take advantage of a GPU?

What Is a GPU? Stands for Graphics Processing Unit It is the main component on a video card Provides hardware acceleration for graphics APIs such as OpenGL and DirectX

GeForce 7800 GTX 8 programmable vertex shaders 24 programmable fragment shaders 16 raster units (non-programmable) GeForce 7800 GTX block diagram. Source: NVIDIA

Strengths of a GPU Large capacity for floating point calculations Dedicated hardware for many graphicsrelated operations (linear algebra, trigonometry, etc.) Highly parallel Improving more rapidly than traditional CPUs Programmable

Weaknesses of a GPU Only single-precision floating point values are supported. Data must be copied to the GPU's memory before calculations can be done. Programming is conceptually more difficult than on a traditional CPU.

GPU vs. CPU Source: SIGGRAPH 2005 GPGPU Course

GPU Programmability How can a GPU be used to do general purpose computations? Newer GPUs can directly execute programs written in high-level languages (Cg, GLSL, HLSL) or in assembler. Programs are called shaders and will seem familiar to programmers who have used C and/or assembler.

How it works GPUs execute a program once for each pixel that is drawn to the screen. The coordinates of the pixel being drawn index an element of an array. By rendering a rectangle that fills the entire screen, the GPU can be made to execute a program once for each element in an array.

Data Types Most GPUs support 1-2- 3- or 4-vectors of 32-bit floating point values. Some operators act in parallel on all elements of the vector, others act on single elements. Data is passed to the GPU in the form of a texture, which is essentially an array.

High-Level Code for Addition Cg: float4 main(float2 p : TEXCOORD0, uniform samplerrect arg1, uniform samplerrect arg2) : COLOR { return texrect(arg1, p) + texrect(arg2, p); } GLSL: uniform sampler2drect arg1; uniform sampler2drect arg2; void main(void) { vec2 p = gl_texcoord[0].xy; gl_fragcolor = texture2drect(arg1, p) + texture2drect(arg2, p); }

Assembler for Addition The two examples from the previous slide both produce the following assembler code. arbfp1: PARAM c[1] = { program.local[0] }; TEMP R0; TEMP R1; TEX R1, fragment.texcoord[0], texture[1], RECT; TEX R0, fragment.texcoord[0], texture[0], RECT; ADD result.color, R0, R1; END

GpuPy GpuPy extends NumPy so that it can take advantage of a GPU. Many of the operations that NumPy provides can be performed efficiently on a GPU.

How it works GpuPy overrides NumPy's default implementations of Float32 operations with GPU-based implementations. GpuPy can be transparent, so if a GPU is present on the system where the script is being run, it will simply run faster. This means that existing scripts can take advantage of a GPU without any changes.

Drawbacks Currently, the GPU-based implementations only support 32-bit floating point values, so only Float32 and Complex64 can be accelerated. Simple operations aren't any faster than software (yet).

Performance Results 1 Some operations are never faster on a GPU.

Performance Results 2 Some operations are only faster on a GPU when the array size is large enough.

Performance Results 3 Some operations are much faster on a GPU.

Cross Correlation Implementing NumPy's correlate function on a GPU.

Reductions Reductions can be performed in parallel on the GPU. Reduction Requires O(lg N) steps. Example: Finding the maximum value. 3 9 7 1 2 4 0 5 6 5 3 8 Step 1 9 7 7 8 Step 2 9 1 7 2 5

More Complex Operations Convolution can be done on a GPU, and is a good choice because of the large number of multiplies per element copied to the GPU. Sorting networks. FFTs can also be performed on the GPU. http://gamma.cs.unc.edu/gpufftw

Future Work: High-Level Code Cg: for Edge Detection Filter float main(float2 p : TEXCOORD0, uniform samplerrect arg1) : COLOR { float v1 = texrect(arg1, p + float2(-1, -1)).r * -1; float v2 = texrect(arg1, p + float2(-1, 0)).r * -2; float v3 = texrect(arg1, p + float2(-1, 1)).r * -1; float v4 = texrect(arg1, p + float2( 0, -1)).r * 0; float v5 = texrect(arg1, p + float2( 0, 0)).r * 0; float v6 = texrect(arg1, p + float2( 0, 1)).r * 0; float v7 = texrect(arg1, p + float2( 1, -1)).r * 1; float v8 = texrect(arg1, p + float2( 1, 0)).r * 2; float v9 = texrect(arg1, p + float2( 1, 1)).r * 1; } return v1 + v2 + v3 + v4 + v5 + v6 + v7 + v8 + v9;

Future Work: Assembler for Edge Detection Filter arbfp1: PARAM c[2] = { { 1, 0, 2, -1 }, { -2 } }; TEMP R0; TEMP R1; ADD R0.xy, fragment.texcoord[0], c[0].w; ADD R0.zw, fragment.texcoord[0].xyxy, c[0].xywy; TEX R1.x, R0.zwzw, texture[0], RECT; TEX R0.x, R0, texture[0], RECT; MAD R0.y, R1.x, c[1].x, -R0.x; ADD R1.xy, fragment.texcoord[0], c[0].wxzw; TEX R0.x, R1, texture[0], RECT; ADD R0.zw, fragment.texcoord[0].xyxy, c[0].xyxw; TEX R1.x, R0.zwzw, texture[0], RECT; ADD R0.x, R0.y, -R0; ADD R0.y, R0.x, R1.x; ADD R1.xy, fragment.texcoord[0], c[0].x; TEX R0.x, R1, texture[0], RECT; ADD R0.zw, fragment.texcoord[0].xyxy, c[0].xyxy; TEX R1.x, R0.zwzw, texture[0], RECT; MAD R0.y, R1.x, c[0].z, R0; ADD result.color.x, R0.y, R0; END

Future Work: Minimize Copies Moving data between CPU and GPU is bottleneck. Avoiding the transfer of intermediate values can improve performance considerably. Optimize texture usage. Combine complex operations into a single shader. Be aware how the GPU and driver are managing memory.

Future Work: Lazy Evaluation Postpone calculations until result is actually needed. Avoid copying data between GPU and CPU whenever possible. Saves expression and only evaluates it when needed.

Future Work: Miscellaneous Auto-configuration: Dynamically decide whether it is faster to perform the operation in software or using the GPU (or a combination of both). Double-precision emulation: Use the GPU's single-precision hardware to provide double or greater precision floating point values. http://hal.ccsd.cnrs.fr/ccsd-00021443

email: eitzenb@eecs.wsu.edu bobl@tricity.wsu.edu on the web: http://eecs.wsu.edu/~eitzenb/gpupy