General Purpose GPU Programming. Advanced Operating Systems Tutorial 9

Similar documents
General Purpose GPU Programming. Advanced Operating Systems Tutorial 7

General Purpose GPU Programming (2) Advanced Operating Systems Lecture 15

General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14

GPUs and GPGPUs. Greg Blanton John T. Lubia

Scientific Computing on GPUs: GPU Architecture Overview

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Real-Time Rendering Architectures

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

GPU Programming Using NVIDIA CUDA

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Using Graphics Chips for General Purpose Computation

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

! Readings! ! Room-level, on-chip! vs.!

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Programming Languages Research Programme

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

GPU Architecture. Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)

CMPE 665:Multiple Processor Systems CUDA-AWARE MPI VIGNESH GOVINDARAJULU KOTHANDAPANI RANJITH MURUGESAN

Evaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi

GPGPU, 4th Meeting Mordechai Butrashvily, CEO GASS Company for Advanced Supercomputing Solutions

Garbage Collection (2) Advanced Operating Systems Lecture 9

CME 213 S PRING Eric Darve

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Introduction to GPU computing

High Performance Computing on GPUs using NVIDIA CUDA

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Technology for a better society. hetcomp.com

General Purpose Computing on Graphical Processing Units (GPGPU(

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CUDA Programming Model

ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors

Introduction to Modern GPU Hardware

Antonio R. Miele Marco D. Santambrogio

A Data-Parallel Genealogy: The GPU Family Tree. John Owens University of California, Davis

Warps and Reduction Algorithms

TUNING CUDA APPLICATIONS FOR MAXWELL

When MPPDB Meets GPU:

An Introduction to Graphical Processing Unit

GPGPU. Peter Laurens 1st-year PhD Student, NSC

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

An Efficient Stream Buffer Mechanism for Dataflow Execution on Heterogeneous Platforms with GPUs

Exploiting High-Performance Heterogeneous Hardware for Java Programs using Graal

From Shader Code to a Teraflop: How Shader Cores Work

Current Trends in Computer Graphics Hardware

TUNING CUDA APPLICATIONS FOR MAXWELL

Parallel Programming for Graphics

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

HD (in) Processing. Andrés Colubri Design Media Arts dept., UCLA Broad Art Center, Suite 2275 Los Angeles, CA

Computer Organization and Design, 5th Edition: The Hardware/Software Interface

CC MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters

Modern Processor Architectures. L25: Modern Compiler Design

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Lecture 7: The Programmable GPU Core. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

CS 426 Parallel Computing. Parallel Computing Platforms

Lecture 1: Gentle Introduction to GPUs

Introduction to GPU hardware and to CUDA

GPU programming. Dr. Bernhard Kainz

GPU vs FPGA : A comparative analysis for non-standard precision

Kernel level AES Acceleration using GPUs

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

high performance medical reconstruction using stream programming paradigms

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

Paralization on GPU using CUDA An Introduction

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

CPU-GPU Heterogeneous Computing

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Measurement of real time information using GPU

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Recent Advances in Heterogeneous Computing using Charm++

Lecture 12: Instruction Execution and Pipelining. William Gropp

Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks

GPU Programming for Mathematical and Scientific Computing

Graphics and Imaging Architectures

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

From Brook to CUDA. GPU Technology Conference

Hardware Acceleration for CGP: Graphics Processing Units

A Code Merging Optimization Technique for GPU. Ryan Taylor Xiaoming Li University of Delaware

Heterogeneous Computing with a Fused CPU+GPU Device

Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels?

Highly Scalable Multi-Objective Test Suite Minimisation Using Graphics Card

The Bifrost GPU architecture and the ARM Mali-G71 GPU

HETEROGENEOUS COMPUTING IN THE MODERN AGE. 1 HiPEAC January, 2012 Public

Spreadsheet optimisation on GPUs

Graphics Hardware 2008

Portland State University ECE 588/688. Graphics Processors

SIMD Exploitation in (JIT) Compilers

The GPU as a co-processor in FEM-based simulations. Preliminary results. Dipl.-Inform. Dominik Göddeke.

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Introduction to CUDA

AMath 483/583, Lecture 24, May 20, Notes: Notes: What s a GPU? Notes: Some GPU application areas

Exploring the Optimization Space of Multi-Core Architectures with OpenCL Benchmarks

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

Transcription:

General Purpose GPU Programming Advanced Operating Systems Tutorial 9

Tutorial Outline Review of lectured material Key points Discussion OpenCL Future directions 2

Review of Lectured Material Heterogeneous instruction set systems Heterogeneous multi-kernel systems Helios Main core with heterogenous offload Graphics offload hardware GPGPU Programming model OpenCL Integration with operating systems Heterogenous virtual machines Hera JVM Hybrid models Accelerator Lazy encoding of SIMD-style operations and JIT compilation into type system 3

Key Points Increasing heterogeneity of hardware Programming models are complex Too limited to run a full operating system Too different to effectively run standard programming languages OpenCL-style offload model performs well, but is complex to program Attempts to hide complexity in VM have had mixed success 4

Discussion What is complexity versus performance trade-off in OpenCL how does this limit usefulness? Is the OpenCL programming model straight-forward and understandable for non-expert programmers? How much complexity is inherent in the problem domain? How can SIMD-style processing be more cleanly incorporated into modern languages? Is OpenCL C difficult because based on C, because low-level, or because the problem space is complex? OpenCL Overview Ofer Rosenberg, AMD November 2011 Accelerator: Using Data Parallelism to Program GPUs for General-Purpose Uses David Tarditi Sidd Puri Jose Oglesby Microsoft Research {dtarditi,siddpuri,joseogl}@microsoft.com 1.2 Is the embedded DSL approach of Accelerator a set in the right direction, or is the complexity of the VM excessive? How to use heterogenous processing resources? Abstract GPUs are difficult to program for general-purpose uses. Programmers can either learn graphics APIs and convert their applications to use graphics pipeline operations or they can use stream programming abstractions of GPUs. We describe Accelerator, a system that uses data parallelism to program GPUs for general-purpose uses instead. Programmers use a conventional imperative programming language and a library that provides only high-level data-parallel operations. No aspects of GPUs are exposed to programmers. The library implementation compiles the data-parallel operations on the fly to optimized GPU pixel shader code and API calls. We describe the compilation techniques used to do this. We evaluate the effectiveness of using data parallelism to program GPUs by providing results for a set of compute-intensive benchmarks. We compare the performance of Accelerator versions of the benchmarks against hand-written pixel shaders. The speeds of the Accelerator versions are typically within 50% of the speeds of hand-written pixel shader code. Some benchmarks significantly outperform C versions on a CPU: they are up to 18 times faster than C code running on a CPU. Categories and Subject Descriptors D.1.3 [Programming Techniques]: Concurrent Programming Parallel Programming; D.3.4 [Programming Languages]: Processors Compilers General Terms Measurement, Performance, Experimentation, Languages Keywords Graphics processing units, data parallelism, just-intime compilation 1. Introduction Highly programmable graphics processing units (GPUs) became available in 2001 [10] and have evolved rapidly since then [15]. GPUs are now highly parallel processors that deliver much higher floating-point performance for some workloads than comparable CPUs. For example, the ATI Radeon x1900 processor has 48 pixel shader processors, each of which is capable of 4 floating-point operations per cycle, at a clock speed of 650 MHz. It has a peak floating-point performance of over 250 GFLOPS using singleprecision floating-point numbers, counting multiply-adds as two FLOPs. GPUs have an explicitly parallel programming model and Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ASPLOS 06 October 21 25, 2006, San Jose, California, USA. Copyright c 2006 ACM 1-59593-451-0/06/0010...$5.00. their performance continues to increase as transistor counts increase. The performance available on GPUs has led to interest in using GPUs for general-purpose programming [16, 8]. It is difficult, however, for most programmers to program GPUs for generalpurpose uses. In this paper, we show how to use data parallelism to program GPUs for general-purpose uses. We start with a conventional imperative language, C# (which is similar to Java). We provide a library that implements an abstract data type providing data-parallel arrays; no aspects of GPUs are exposed to programmers. The library evaluates the data-parallel operations using a GPU; all other operations are evaluated on the CPU. For efficiency, the library does not immediately perform data-parallel operations. Instead, it builds a graph of desired operations and compiles the operations on demand to GPU pixel shader code and API calls. Data-parallel arrays only provide aggregate operations over entire input arrays. The operations are a subset of those found in languages like APL and include element-wise arithmetic and comparison operators, reduction operations (such as sum), and transformations on arrays. Data-parallel arrays are functional: each operation produces a new data-parallel array. Programmers must explicitly convert back and forth between conventional arrays and data-parallel arrays. The lazy compilation is typically done when aprogramconvertsadata-parallelarraytoanormalarray. Compiling data-parallel operations lazily to a GPU allows us to implement the operations efficiently: the system can avoid creating large numbers of temporary data-parallel arrays and optimize the creation of pixel shaders. It also allows us to avoid exposing GPU details to programmers: the system manages the use of GPU resources automatically and amortizes the cost of accessing graphics APIs. Compilation at run time also allows the system to handle properties and features that vary across GPU manufacturers and models. We have implemented these ideas in a system called Accelerator. We evaluate the effectiveness of the approach using a set of benchmarks for compute-intensive tasks such as image processing and computer vision, run on several generations of GPUs from both ATIand NVidia. We implemented the benchmarks in hand-written pixel shader assembly for GPUs, C# using Accelerator, and C++ for the CPU. The C# programs, including compilation overhead, are typically within 2 of the speed of the hand-written pixel shader programs, and sometimes exceed their speeds. The C# programs, like the hand-written pixel shader programs, often outperform the C++ programs (by up to 18 ). Prior work on programming GPUs for general-purpose uses either targets the specialized GPU programming model directly or provides a stream programming abstraction of GPUs. It is difficult to target the GPU directly. First, programmers need to learn the graphics programming model, which is specialized to the set of 325 5