General Purpose GPU Programming. Advanced Operating Systems Tutorial 7

Similar documents
General Purpose GPU Programming. Advanced Operating Systems Tutorial 9

General Purpose GPU Programming (2) Advanced Operating Systems Lecture 15

General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14

Scientific Computing on GPUs: GPU Architecture Overview

GPUs and GPGPUs. Greg Blanton John T. Lubia

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Garbage Collection (2) Advanced Operating Systems Lecture 9

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

! Readings! ! Room-level, on-chip! vs.!

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

CME 213 S PRING Eric Darve

Real-Time Rendering Architectures

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

Warps and Reduction Algorithms

Introduction to GPU computing

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE

High Performance Computing on GPUs using NVIDIA CUDA

GPU Programming Using NVIDIA CUDA

GPU Architecture. Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)

CMPE 665:Multiple Processor Systems CUDA-AWARE MPI VIGNESH GOVINDARAJULU KOTHANDAPANI RANJITH MURUGESAN

Evaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

General Purpose Computing on Graphical Processing Units (GPGPU(

Measurement of real time information using GPU

A Data-Parallel Genealogy: The GPU Family Tree. John Owens University of California, Davis

GPGPU, 4th Meeting Mordechai Butrashvily, CEO GASS Company for Advanced Supercomputing Solutions

Using Graphics Chips for General Purpose Computation

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Programming Languages Research Programme

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI

When MPPDB Meets GPU:

Modern Processor Architectures. L25: Modern Compiler Design

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

OpenACC Course. Office Hour #2 Q&A

SPOC : GPGPU programming through Stream Processing with OCaml

Technology for a better society. hetcomp.com

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

TUNING CUDA APPLICATIONS FOR MAXWELL

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

GPGPU. Peter Laurens 1st-year PhD Student, NSC

Lecture 1: Gentle Introduction to GPUs

From Brook to CUDA. GPU Technology Conference

CS 220: Introduction to Parallel Computing. Introduction to CUDA. Lecture 28

Introduction to GPU hardware and to CUDA

CUDA Programming Model

The GPU as a co-processor in FEM-based simulations. Preliminary results. Dipl.-Inform. Dominik Göddeke.

ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors

TUNING CUDA APPLICATIONS FOR MAXWELL

Introduction to Modern GPU Hardware

Antonio R. Miele Marco D. Santambrogio

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

GPU Programming for Mathematical and Scientific Computing

An Introduction to Graphical Processing Unit

Highly Scalable Multi-Objective Test Suite Minimisation Using Graphics Card

An Efficient Stream Buffer Mechanism for Dataflow Execution on Heterogeneous Platforms with GPUs

Kernel level AES Acceleration using GPUs

Copyright Khronos Group Page 1. Vulkan Overview. June 2015

Exploiting High-Performance Heterogeneous Hardware for Java Programs using Graal

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

Parallel Execution of Kahn Process Networks in the GPU

GPU programming. Dr. Bernhard Kainz

CS 426 Parallel Computing. Parallel Computing Platforms

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

Graphics Hardware 2008

high performance medical reconstruction using stream programming paradigms

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Lecture 12: Instruction Execution and Pipelining. William Gropp

From Shader Code to a Teraflop: How Shader Cores Work

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Current Trends in Computer Graphics Hardware

A Code Merging Optimization Technique for GPU. Ryan Taylor Xiaoming Li University of Delaware

Heterogeneous Computing with a Fused CPU+GPU Device

HD (in) Processing. Andrés Colubri Design Media Arts dept., UCLA Broad Art Center, Suite 2275 Los Angeles, CA

Accelerating Spark Workloads using GPUs

ECE 571 Advanced Microprocessor-Based Design Lecture 20

Real-Time and Embedded Systems (M) Lecture 19

CC MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters

Pat Hanrahan. Modern Graphics Pipeline. How Powerful are GPUs? Application. Command. Geometry. Rasterization. Fragment. Display.

General Purpose GPU Computing in Partial Wave Analysis

ECE 574 Cluster Computing Lecture 16

Computer Organization and Design, 5th Edition: The Hardware/Software Interface

Computer Architecture Area Fall 2009 PhD Qualifier Exam October 20 th 2008

Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels?

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology

SIMD Exploitation in (JIT) Compilers

GPU ARCHITECTURE Chris Schultz, June 2017

Lecture 7: The Programmable GPU Core. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

GPU Architecture. Alan Gray EPCC The University of Edinburgh

How to write code that will survive the many-core revolution Write once, deploy many(-cores) F. Bodin, CTO

The Bifrost GPU architecture and the ARM Mali-G71 GPU

Course web site: teaching/courses/car. Piazza discussion forum:

GPU vs FPGA : A comparative analysis for non-standard precision

Transcription:

General Purpose GPU Programming Advanced Operating Systems Tutorial 7

Tutorial Outline Review of lectured material Key points Discussion OpenCL Future directions 2

Review of Lectured Material Heterogeneous instruction set systems Heterogeneous multi-kernel systems Helios Main core with heterogenous offload Graphics offload hardware GPGPU Programming model OpenCL Integration with operating systems Heterogenous virtual machines Hera JVM Hybrid models Accelerator Lazy encoding of SIMD-style operations and JIT compilation into type system 3

Key Points Increasing heterogeneity of hardware Programming models are complex Too limited to run a full operating system Too different to effectively run standard programming languages OpenCL-style offload model performs well, but is complex to program Attempts to hide complexity in VM have had mixed success 4

Discussion What is complexity versus performance trade-off in OpenCL how does this limit usefulness? How can SIMD-style processing be more cleanly incorporated into modern languages? Is the embedded DSL approach of Accelerator a set in the right direction, or is the complexity of the VM excessive?! How to use heterogenous processing resources? OpenCL Overview Ofer Rosenberg, AMD November 2011 Accelerator: Using Data Parallelism to Program GPUs for General-Purpose Uses Abstract GPUs are difficult to program for general-purpose uses. Programmers can either learn graphics APIs and convert their applications to use graphics pipeline operations or they can use stream programming abstractions of GPUs. We describe Accelerator, a system that uses data parallelism to program GPUs for general-purpose uses instead. Programmers use a conventional imperative programming language and a library that provides only high-level data-parallel operations. No aspects of GPUs are exposed to programmers. The library implementation compiles the data-parallel operations on the fly to optimized GPU pixel shader code and API calls. We describe the compilation techniques used to do this. We evaluate the effectiveness of using data parallelism to program GPUs by providing results for a set of compute-intensive benchmarks. We compare the performance of Accelerator versions of the benchmarks against hand-written pixel shaders. The speeds of the Accelerator versions are typically within 50% of the speeds of hand-written pixel shader code. Some benchmarks significantly outperform C versions on a CPU: they are up to 18 times faster than C code running on a CPU. Categories and Subject Descriptors D.1.3 [Programming Techniques]: Concurrent Programming Parallel Programming; D.3.4 [Programming Languages]: Processors Compilers General Terms Measurement, Performance, Experimentation, Languages Keywords Graphics processing units, data parallelism, just-intime compilation 1. Introduction Highly programmable graphics processing units (GPUs) became available in 2001 [10] and have evolved rapidly since then [15]. GPUs are now highly parallel processors that deliver much higher floating-point performance for some workloads than comparable CPUs. For example, the ATI Radeon x1900 processor has 48 pixel shader processors, each of which is capable of 4 floating-point operations per cycle, at a clock speed of 650 MHz. It has a peak floating-point performance of over 250 GFLOPS using singleprecision floating-point numbers, counting multiply-adds as two FLOPs. GPUs have an explicitly parallel programming model and Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ASPLOS 06 October 21 25, 2006, San Jose, California, USA. Copyright c 2006 ACM 1-59593-451-0/06/0010...$5.00. David Tarditi Sidd Puri Jose Oglesby Microsoft Research {dtarditi,siddpuri,joseogl}@microsoft.com 1.2 their performance continues to increase as transistor counts increase. The performance available on GPUs has led to interest in using GPUs for general-purpose programming [16, 8]. It is difficult, however, for most programmers to program GPUs for generalpurpose uses. In this paper, we show how to use data parallelism to program GPUs for general-purpose uses. We start with a conventional imperative language, C# (which is similar to Java). We provide a library that implements an abstract data type providing data-parallel arrays; no aspects of GPUs are exposed to programmers. The library evaluates the data-parallel operations using a GPU; all other operations are evaluated on the CPU. For efficiency, the library does not immediately perform data-parallel operations. Instead, it builds a graph of desired operations and compiles the operations on demand to GPU pixel shader code and API calls. Data-parallel arrays only provide aggregate operations over entire input arrays. The operations are a subset of those found in languages like APL and include element-wise arithmetic and comparison operators, reduction operations (such as sum), and transformations on arrays. Data-parallel arrays are functional: each operation produces a new data-parallel array. Programmers must explicitly convert back and forth between conventional arrays and data-parallel arrays. The lazy compilation is typically done when aprogramconvertsadata-parallelarraytoanormalarray. Compiling data-parallel operations lazily to a GPU allows us to implement the operations efficiently: the system can avoid creating large numbers of temporary data-parallel arrays and optimize the creation of pixel shaders. It also allows us to avoid exposing GPU details to programmers: the system manages the use of GPU resources automatically and amortizes the cost of accessing graphics APIs. Compilation at run time also allows the system to handle properties and features that vary across GPU manufacturers and models. We have implemented these ideas in a system called Accelerator. We evaluate the effectiveness of the approach using a set of benchmarks for compute-intensive tasks such as image processing and computer vision, run on several generations of GPUs from both ATIand NVidia. We implemented the benchmarks in hand-written pixel shader assembly for GPUs, C# using Accelerator, and C++ for the CPU. The C# programs, including compilation overhead, are typically within 2 of the speed of the hand-written pixel shader programs, and sometimes exceed their speeds. The C# programs, like the hand-written pixel shader programs, often outperform the C++ programs (by up to 18 ). Prior work on programming GPUs for general-purpose uses either targets the specialized GPU programming model directly or provides a stream programming abstraction of GPUs. It is difficult to target the GPU directly. First, programmers need to learn the graphics programming model, which is specialized to the set of 325 5

Course Wrap-up Real-time scheduling of periodic tasks Real-time scheduling of aperiodic & sporadic tasks Resource management Real-time and embedded systems programming Garbage collection Message passing Transactions General purpose GPU programming 6

Key Points Real-time systems predictability and reliability are critical; desire to raise level of abstraction to help to achieve these goals Garbage collection is effective, but at high memory overhead cost real-time garbage collection exists Message passing effective for multi-core systems; potential of multi-kernel operating systems model Transactions seem to have limited applicability No effective GPGPU programming model; OpenCL is too low-level and not a long-term solution 7

Discussion Wide spectrum of research ideas and concepts Which are seeing widespread use? Functional languages and message passing concurrency Garbage collection potential for integration with kernels Increased use of static code analysis tools, to debug the limitations of C! Opportunities for dependable kernels New implementation frameworks and safe programming languages Approaches similar to Singularity have large potential 8

Examination Final examination: Worth 80% of marks for the course 2 hours; answer 3-out-of-4 questions Sample exam and past papers available on Moodle, and on the website All material covered in the lectures, tutorials, and papers is examinable Aim is to test your understanding of the material, not simply to test your memory of all the details in particular, read papers to understand the concepts, not details Explain why, don t just recite what are looking for your reasoned and justified technical opinion about the material 9

The End! http://csperkins.org/teaching/adv-os/ 10