Ocelot: An Open Source Debugging and Compilation Framework for CUDA

Size: px
Start display at page:

Download "Ocelot: An Open Source Debugging and Compilation Framework for CUDA"

Transcription

1 Ocelot: An Open Source Debugging and Compilation Framework for CUDA Gregory Diamos*, Andrew Kerr*, Sudhakar Yalamanchili Computer Architecture and Systems Laboratory School of Electrical and Computer Engineering Georgia Institute of Technology Special thanks to our sponsors: IBM, Intel, LogicBlox, NSF, NVIDIA

2 Outline Ocelot Vision and Goals Multiplatform Dynamic Compilation for CUDA Full Function Instruction Set Emulation Productivity Tools Correctness and debugging functionality Performance Analysis and Workload Characterization

3 Productivity Challenges of Heterogeneity Execution Portability Systems evolve over time New systems esd.lbl.gov math.harvard.edu Sandia.gov Performance Debugging and optimization Application Migration Protect investments in existing code bases Emerging Software Stacks Productivity Tools Language Front-End Harmony Ocelot/LLVM NVIDIA JIT OS/VM

4 Ocelot Vision esd.lbl.gov Just-in-time code generation and optimization for CUDA applications Basis for a range of productivity and workload characterization tools

5 Project Goals Encourage proliferation of GPU computing Lower the barriers to entry for researchers and developers Understand performance behavior of CUDA applications on multiple platforms Translate, optimize, debug, and execute CUDA applications on multiple platforms

6 Ocelot: Dynamic Execution Infrastructure PTX Kernel Productivity Tools Multiplatform Support Multi-GPU Support Performance Analysis and Modeling PTX Emulator Dynamic Translation (LLVM)

7 Example Usage Scenarios Research Communities Ocelot Interface Driving Research Programming Languages Domain specific IR Optimization of Database Applications Compiler Optimizations PTX IR Interface Thread fusion, control flow optimization, etc. Operating Systems & Run-Times Switchable Compute Run- Time Kernel scheduling and data movement optimizations Computer Architecture Research Workload Characterization/Tuning Flexible Event Interface Flexible Event Interface Instruction and/or memory trace generation Control flow behavior, data sharing patterns

8 Outline Ocelot Vision and Goals Multiplatform Dynamic Compilation for CUDA Full Function Instruction Set Emulation Productivity Tools Correctness and debugging functionality Performance Analysis and Workload Characterization

9 NVIDIA s Parallel Thread execution ISA PTX defines an execution model that abstracts cores as CTAs and hardware threads/simd units as threads

10 Ocelot PTX Intermediate Representation (IR) Backend compiler framework for PTX Full-featured PTX IR Class hierarchy for PTX instructions/directives PTX control flow graph Static single-assignment form Dataflow/dominance analysis Enables PTX optimization IR to IR translation From PTX to other IRs LLVM (x86/powerpc/arm) CAL (AMD GPUs) PTX Kernel

11 Example PTX PTX Optimization Ocelot includes a comprehensive optimization framework for PTX Implementation modeled after LLVM passes Optimizations can be applied dynamically between kernel executions

12 CUDA on CPUs Execution model translation Ocelot fuses CUDA threads into heavy-weight pthreads

13 PTX IR CPU Backend Memory Traversal 2 This pattern significantly affects throughput-bound applications

14 Ocelot CUDA Runtime A complete reimplementation of the CUDA Runtime API Clean device abstraction All back-ends implement same interface Compatible with existing applications Link against libocelot.so instead of libcudart Ocelot API Extensions Add/remove trace generators Compile/launch kernels directly in PTX Device memory sharing among host threads Device switching

15 Switchable Compute Switch devices at runtime Load balancing Instrumentation Fault-and-emulate Remote execution

16 Outline Ocelot Vision and Goals Multiplatform Dynamic Compilation for CUDA Full Function Instruction Set Emulation Productivity Tools Correctness and debugging functionality Performance Analysis and Workload Characterization

17 Ocelot PTX Emulator Abstract machine model Performs functional simulation of the PTX execution model Enables detailed performance evaluation/debugging Program correctness checking Workload characterization and modeling Architecture simulation PTX 1.4 support Fermi support in progress

18 PTX Emulator CUDA Debugging- Erroneous Memory Accesses // file: memorycheck.cu global void badmemoryreference(int *A) { A[threadIdx.x] = 0; // line 3 - faulting store } int main() { int *invalidptr = 0x0234; // arbitrary pointer does not refer // to an existing memory allocation int *validptr = 0; cudamalloc((void **)&validptr, sizeof(int)*64); } badmemoryreference<<< dim3(1,1), dim3(64, 1) >>>( invalidptr ); return 0; ==Ocelot== Ocelot PTX Emulator failed to run kernel "_Z18badMemoryReferencePi" with exception: ==Ocelot== [PC 5] [thread 0] [cta 0] st.global.s32 [%r4 + 0], %r0 - Global memory access 0x234 is not within any allocated or mapped range. ==Ocelot== ==Ocelot== Nearby Device Allocations ==Ocelot== [0x12fa2e0] - [0x12fa3e0] (256 bytes) ==Ocelot== ==Ocelot== Near memorycheck.cu:3:0

19 PTX Emulator CUDA Debugging Race Detection // file: racecondition.cu global void racecondition(int *A) { shared int SharedMem[64]; SharedMem[threadIdx.x] = A[threadIdx.x]; // no synchronization barrier! A[threadIdx.x] = SharedMem[64 - threadidx.x]; // line 9 - faulting load }... racecondition<<< dim3(1,1), dim3(64, 1) >>>( validptr );... ==Ocelot== Ocelot PTX Emulator failed to run kernel "_Z13raceConditionPi" with exception: ==Ocelot== [PC 15] [thread 0] [cta 0] ld.shared.s32 %r14, [%r ] - Shared memory race condition, address 0xfc was previously written by thread 63 without a memory barrier in between. ==Ocelot== Near racecondition.cu:9:0

20 PTX Emulator Interactive Debugger Interactive command-line debugger implemented using TraceGenerator interface $./TestCudaSequence A_gpu = 0x16dcbe0 (ocelot-dbg) Attaching debugger to kernel 'sequence' (ocelot-dbg) (ocelot-dbg) watch global address 0x16dcbe4 s32[3] set #1: watch global address 0x16dcbe4 s32[3] - 12 bytes (ocelot-dbg) (ocelot-dbg) continue st.global.s32 [%r11 + 0], %r7 watchpoint #1 - CTA (0, 0) thread (1, 0, 0) - store to 0x16dcbe4 4 bytes old value = -1 new value = 2 thread (2, 0, 0) - store to 0x16dcbe8 4 bytes old value = -1 new value = 4 thread (3, 0, 0) - store to 0x16dcbec 4 bytes old value = -1 new value = 6 break on watchpoint (ocelot-dbg) Enables Inspection of application state Single-stepping of instructions Breakpoints and watchpoints Faults in MemoryChecker and RaceDetector invoke ocelotdbg automatically

21 PTX Emulator Performance Tuning Identify critical regions Memory demand Floating-point intensity Shared memory bank conflicts cold... for (int offset = 1; offset < n; offset *= 2) { // line 61 pout = 1 - pout; pin = 1 - pout; syncthreads(); temp[pout*n+thid] = temp[pin*n+thid]; hot if (thid >= offset) { temp[pout*n+thid] += temp[pin*n+thid - offset]; } }...

22 Trace Generator Interface Emulator broadcasts events to trace generators during execution Events describe instruction execution, memory addresses, etc. Used for error checking, instrumentation, and simulation

23 Trace Generator Interface //! Base class for generating traces class TraceGenerator { public: TraceGenerator(); virtual ~TraceGenerator(); // called when a traced kernel is launched to // retrieve some parameters from the kernel virtual void initialize( const executive::executablekernel& kernel); // Called whenever an event takes place. virtual void event(const TraceEvent & event); // called when an event is committed virtual void postevent(const TraceEvent & event); // Called when a kernel is finished. There will // be no more events for this kernel. virtual void finish(); };

24 TraceEvent Object Captures execution of a dynamic instruction PC and internal representation of PTX instruction Set of active threads executing instruction Kernel grid and CTA dimensions Memory addresses and size of transfer Branch target(s)

25 Example Thread Load Imbalance //! Computes number of dynamic instructions for each thread class ThreadLoadImbalance: public trace::tracegenerator { public: std::vector< size_t > dynamicinstructions; // For each dynamic instruction, increment counters of each // thread that executes it virtual void event(const TraceEvent & event) { if (!dynamicinstrucitons.size()) dynamicinstructions.resize(event.active.size(), 0); } }; for (int i = 0; i < event.active.size(); i++) { if (event.active[i]) dynamicinstructions[i]++; } Mandelbrot (CUDA SDK)

26 Outline Ocelot Vision and Goals Multiplatform Dynamic Compilation for CUDA Full Function Instruction Set Emulation Productivity Tools Correctness and debugging functionality Performance Analysis and Workload Characterization

27 Workload Characterization Describe applications using architecture-independent metrics Control flow, memory intensity/efficiency, data sharing, parallelism Many functions constructed using the trace generator interface

28 Activity Factor Fraction of threads active for each dynamic instruction Utilization of CTA-wide SIMD processor Sensitive to reconvergence mechanism

29 Interthread Data Flow Which kernels exchange computed results through shared memory? Track id of producer thread Ensure threads are well synchronized Optionally ignore uses of shared memory to transfer working sets

30 Statistical Modeling Prior work has used Ocelot for statistical workload/architecture modeling Application/machine metrics collected via static analysis, instrumentation, and emulation Principal Component Analysis (PCA) combines related metrics Cluster analysis identifies related sets of applications Regression models use metrics to predict application behavior

31 PTX Emulator Statistical Modeling Data flow correlated to Kernels launched, live registers Kernels imply barriers Data sharing across CTAs Applications that share data do so at all levels of the thread hierarchy

32 Architecture Simulation Ocelot PTX emulator has been used as an architecture simulator front-end Can explore impact of architectural changes eg. Instruction fetch and memory scheduling policies MacSim Architecture Simulator (H. Kim-GT) J. Lee, N. B. Lakshminarayana, H. Kim, R.Vuduc, Many-Thread Aware Prefetching Mechanisms for GPGPU Applications, MICRO 43, December 2010.

33 Ongoing Work

34 Ocelot Front Ends Today Future Plans Enterprise SW frontend from LogicBlox Inc. (in progress) Datalog CUDA OpenCL Harmony Runtime Correctness & Performance Tools Ocelot: PTX Emulator Instrumentation database

35 AMD Backend Goal: Explore CUDA a cross-platform alternative to OpenCL CUDA as a generic programming model Understand the relationships between algorithms and architecture Explore execution and performance portability issues Architecture specific code generation issues, VLIW+SIMD vs. Superscalar+SIMD Unstructured control flow People: Rodrigo Dominguez (rdomingu@ece.neu.edu)

36 Software Warp-Formation People: Andrew Kerr

37 Harmony Runtime: Multiple Devices People: Gregory Diamos

38 Contributors Ocelot Team: Gregory Diamos, Rodrigo Dominguez, Naila Farooqui, Andrew Kerr, Si Li, Tri Pho, Haicheng Wu, Sudhakar Yalamanchili Special Thanks: Nathan Bell, Sylvan Collange, Jared Hoberook, David Luebke, Ryuta Suzuki, Steve Worley, Bruno Coutinho

39 Questions? Contact us: Mailing list: Download Harmony and Ocelot:

40 PTX IR CPU Backend Memory Traversal PTX optimizations can change memory access patterns

The Era of Heterogeneous Compute: Challenges and Opportunities

The Era of Heterogeneous Compute: Challenges and Opportunities The Era of Heterogeneous Compute: Challenges and Opportunities Sudhakar Yalamanchili Computer Architecture and Systems Laboratory Center for Experimental Research in Computer Systems School of Electrical

More information

Red Fox: An Execution Environment for Relational Query Processing on GPUs

Red Fox: An Execution Environment for Relational Query Processing on GPUs Red Fox: An Execution Environment for Relational Query Processing on GPUs Georgia Institute of Technology: Haicheng Wu, Ifrah Saeed, Sudhakar Yalamanchili LogicBlox Inc.: Daniel Zinn, Martin Bravenboer,

More information

Profiling of Data-Parallel Processors

Profiling of Data-Parallel Processors Profiling of Data-Parallel Processors Daniel Kruck 09/02/2014 09/02/2014 Profiling Daniel Kruck 1 / 41 Outline 1 Motivation 2 Background - GPUs 3 Profiler NVIDIA Tools Lynx 4 Optimizations 5 Conclusion

More information

CSE 591: GPU Programming. Programmer Interface. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591: GPU Programming. Programmer Interface. Klaus Mueller. Computer Science Department Stony Brook University CSE 591: GPU Programming Programmer Interface Klaus Mueller Computer Science Department Stony Brook University Compute Levels Encodes the hardware capability of a GPU card newer cards have higher compute

More information

Red Fox: An Execution Environment for Relational Query Processing on GPUs

Red Fox: An Execution Environment for Relational Query Processing on GPUs Red Fox: An Execution Environment for Relational Query Processing on GPUs Haicheng Wu 1, Gregory Diamos 2, Tim Sheard 3, Molham Aref 4, Sean Baxter 2, Michael Garland 2, Sudhakar Yalamanchili 1 1. Georgia

More information

ECE 8823: GPU Architectures. Objectives

ECE 8823: GPU Architectures. Objectives ECE 8823: GPU Architectures Introduction 1 Objectives Distinguishing features of GPUs vs. CPUs Major drivers in the evolution of general purpose GPUs (GPGPUs) 2 1 Chapter 1 Chapter 2: 2.2, 2.3 Reading

More information

A Framework for Dynamically Instrumenting GPU Compute Applications within GPU Ocelot

A Framework for Dynamically Instrumenting GPU Compute Applications within GPU Ocelot A Framework for Dynamically Instrumenting GPU Compute Applications within GPU Ocelot Naila Farooqui 1, Andrew Kerr 2, Gregory Diamos 3, S. Yalamanchili 4, and K. Schwan 5 College of Computing 1,5, School

More information

Technical Report: GIT-CERCS-09-06

Technical Report: GIT-CERCS-09-06 Technical Report: GIT-CERCS-09-06 A Characterization and Analysis of GPGPU Kernels Andrew Kerr, Gregory Diamos, and Sudhakar Yalamanchili School of Electrical and Computer Engineering Georgia Institute

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

Lynx: A Dynamic Instrumentation System for Data-Parallel Applications on GPGPU Architectures

Lynx: A Dynamic Instrumentation System for Data-Parallel Applications on GPGPU Architectures Lynx: A Dynamic Instrumentation System for Data-Parallel Applications on GPGPU Architectures Naila Farooqui, Andrew Kerr, Greg Eisenhauer, Karsten Schwan and Sudhakar Yalamanchili College of Computing,

More information

Automatic Intra-Application Load Balancing for Heterogeneous Systems

Automatic Intra-Application Load Balancing for Heterogeneous Systems Automatic Intra-Application Load Balancing for Heterogeneous Systems Michael Boyer, Shuai Che, and Kevin Skadron Department of Computer Science University of Virginia Jayanth Gummaraju and Nuwan Jayasena

More information

Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs

Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs Haicheng Wu 1, Daniel Zinn 2, Molham Aref 2, Sudhakar Yalamanchili 1 1. Georgia Institute of Technology 2. LogicBlox

More information

CUDA Development Using NVIDIA Nsight, Eclipse Edition. David Goodwin

CUDA Development Using NVIDIA Nsight, Eclipse Edition. David Goodwin CUDA Development Using NVIDIA Nsight, Eclipse Edition David Goodwin NVIDIA Nsight Eclipse Edition CUDA Integrated Development Environment Project Management Edit Build Debug Profile SC'12 2 Powered By

More information

Accelerating Data Warehousing Applications Using General Purpose GPUs

Accelerating Data Warehousing Applications Using General Purpose GPUs Accelerating Data Warehousing Applications Using General Purpose s Sponsors: Na%onal Science Founda%on, LogicBlox Inc., IBM, and NVIDIA The General Purpose is a many core co-processor 10s to 100s of cores

More information

Characterization and Transformation of Unstructured Control. Flow in Bulk Synchronous GPU Applications

Characterization and Transformation of Unstructured Control. Flow in Bulk Synchronous GPU Applications Characterization and Transformation of Unstructured Control Flow in Bulk Synchronous GPU Applications Haicheng Wu, Gregory Diamos, Jin Wang, Si Li, and Sudhakar Yalamanchili School of Electrical and Computer

More information

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE Haibo Xie, Ph.D. Chief HSA Evangelist AMD China OUTLINE: The Challenges with Computing Today Introducing Heterogeneous System Architecture (HSA)

More information

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers COMP 322: Fundamentals of Parallel Programming Lecture 37: General-Purpose GPU (GPGPU) Computing Max Grossman, Vivek Sarkar Department of Computer Science, Rice University max.grossman@rice.edu, vsarkar@rice.edu

More information

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique GPU programming: CUDA basics Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr This lecture: CUDA programming We have seen some GPU architecture Now how to program it? 2 Outline

More information

Advanced CUDA Optimizations. Umar Arshad ArrayFire

Advanced CUDA Optimizations. Umar Arshad ArrayFire Advanced CUDA Optimizations Umar Arshad (@arshad_umar) ArrayFire (@arrayfire) ArrayFire World s leading GPU experts In the industry since 2007 NVIDIA Partner Deep experience working with thousands of customers

More information

GPU Programming Using NVIDIA CUDA

GPU Programming Using NVIDIA CUDA GPU Programming Using NVIDIA CUDA Siddhante Nangla 1, Professor Chetna Achar 2 1, 2 MET s Institute of Computer Science, Bandra Mumbai University Abstract: GPGPU or General-Purpose Computing on Graphics

More information

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems Ed Hinkel Senior Sales Engineer Agenda Overview - Rogue Wave & TotalView GPU Debugging with TotalView Nvdia CUDA Intel Phi 2

More information

ECE 574 Cluster Computing Lecture 15

ECE 574 Cluster Computing Lecture 15 ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements

More information

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction Francesco Rossi University of Bologna and INFN * Using this terminology since you ve already heard of SIMD and SPMD at this school

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Gernot Ziegler, Developer Technology (Compute) (Material by Thomas Bradley) Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction

More information

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions. Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication

More information

Overview of Performance Prediction Tools for Better Development and Tuning Support

Overview of Performance Prediction Tools for Better Development and Tuning Support Overview of Performance Prediction Tools for Better Development and Tuning Support Universidade Federal Fluminense Rommel Anatoli Quintanilla Cruz / Master's Student Esteban Clua / Associate Professor

More information

Multi2sim Kepler: A Detailed Architectural GPU Simulator

Multi2sim Kepler: A Detailed Architectural GPU Simulator Multi2sim Kepler: A Detailed Architectural GPU Simulator Xun Gong, Rafael Ubal, David Kaeli Northeastern University Computer Architecture Research Lab Department of Electrical and Computer Engineering

More information

Shadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies

Shadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies Shadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies Alexander Merritt, Vishakha Gupta, Abhishek Verma, Ada Gavrilovska, Karsten Schwan {merritt.alex,abhishek.verma}@gatech.edu {vishakha,ada,schwan}@cc.gtaech.edu

More information

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca (Slides http://support.scinet.utoronto.ca/ northrup/westgrid CUDA.pdf) March 12, 2014

More information

GPU Debugging Made Easy. David Lecomber CTO, Allinea Software

GPU Debugging Made Easy. David Lecomber CTO, Allinea Software GPU Debugging Made Easy David Lecomber CTO, Allinea Software david@allinea.com Allinea Software HPC development tools company Leading in HPC software tools market Wide customer base Blue-chip engineering,

More information

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca March 13, 2014 Outline 1 Heterogeneous Computing 2 GPGPU - Overview Hardware Software

More information

Heterogeneous Computing with a Fused CPU+GPU Device

Heterogeneous Computing with a Fused CPU+GPU Device with a Fused CPU+GPU Device MediaTek White Paper January 2015 2015 MediaTek Inc. 1 Introduction Heterogeneous computing technology, based on OpenCL (Open Computing Language), intelligently fuses GPU and

More information

CUDA Architecture & Programming Model

CUDA Architecture & Programming Model CUDA Architecture & Programming Model Course on Multi-core Architectures & Programming Oliver Taubmann May 9, 2012 Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New

More information

Lecture 11: GPU programming

Lecture 11: GPU programming Lecture 11: GPU programming David Bindel 4 Oct 2011 Logistics Matrix multiply results are ready Summary on assignments page My version (and writeup) on CMS HW 2 due Thursday Still working on project 2!

More information

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler Taylor Lloyd Jose Nelson Amaral Ettore Tiotto University of Alberta University of Alberta IBM Canada 1 Why? 2 Supercomputer Power/Performance GPUs

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

Module 2: Introduction to CUDA C

Module 2: Introduction to CUDA C ECE 8823A GPU Architectures Module 2: Introduction to CUDA C 1 Objective To understand the major elements of a CUDA program Introduce the basic constructs of the programming model Illustrate the preceding

More information

ECE 574 Cluster Computing Lecture 17

ECE 574 Cluster Computing Lecture 17 ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 March 2019 HW#8 (CUDA) posted. Project topics due. Announcements 1 CUDA installing On Linux

More information

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Xin Huo Advisor: Gagan Agrawal Motivation - Architecture Challenges on GPU architecture

More information

GPU Programming with Ateji PX June 8 th Ateji All rights reserved.

GPU Programming with Ateji PX June 8 th Ateji All rights reserved. GPU Programming with Ateji PX June 8 th 2010 Ateji All rights reserved. Goals Write once, run everywhere, even on a GPU Target heterogeneous architectures from Java GPU accelerators OpenCL standard Get

More information

Domain Specific Languages for Financial Payoffs. Matthew Leslie Bank of America Merrill Lynch

Domain Specific Languages for Financial Payoffs. Matthew Leslie Bank of America Merrill Lynch Domain Specific Languages for Financial Payoffs Matthew Leslie Bank of America Merrill Lynch Outline Introduction What, How, and Why do we use DSLs in Finance? Implementation Interpreting, Compiling Performance

More information

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1 Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei

More information

Visualization of OpenCL Application Execution on CPU-GPU Systems

Visualization of OpenCL Application Execution on CPU-GPU Systems Visualization of OpenCL Application Execution on CPU-GPU Systems A. Ziabari*, R. Ubal*, D. Schaa**, D. Kaeli* *NUCAR Group, Northeastern Universiy **AMD Northeastern University Computer Architecture Research

More information

Real-time Graphics 9. GPGPU

Real-time Graphics 9. GPGPU Real-time Graphics 9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics GPU Programming Rüdiger Westermann Chair for Computer Graphics & Visualization Faculty of Informatics Overview Programming interfaces and support libraries The CUDA programming abstraction An in-depth

More information

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee

More information

CUDA Programming Model

CUDA Programming Model CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming

More information

Characterization and Transformation of Unstructured Control Flow in GPU Applications

Characterization and Transformation of Unstructured Control Flow in GPU Applications Characterization and Transformation of Unstructured Control Flow in GPU Applications Haicheng Wu, Gregory Diamos, Si Li, and Sudhakar Yalamanchili School of Electrical and Computer Engineering Georgia

More information

Module 2: Introduction to CUDA C. Objective

Module 2: Introduction to CUDA C. Objective ECE 8823A GPU Architectures Module 2: Introduction to CUDA C 1 Objective To understand the major elements of a CUDA program Introduce the basic constructs of the programming model Illustrate the preceding

More information

Introduction to CUDA

Introduction to CUDA Introduction to CUDA Overview HW computational power Graphics API vs. CUDA CUDA glossary Memory model, HW implementation, execution Performance guidelines CUDA compiler C/C++ Language extensions Limitations

More information

NVIDIA GPU CODING & COMPUTING

NVIDIA GPU CODING & COMPUTING NVIDIA GPU CODING & COMPUTING WHY GPU S? ARCHITECTURE & PROGRAM MODEL CPU v. GPU Multiprocessor Model Memory Model Memory Model: Thread Level Programing Model: Logical Mapping of Threads Programing Model:

More information

An Execution Model and Runtime For Heterogeneous. Many-Core Systems

An Execution Model and Runtime For Heterogeneous. Many-Core Systems An Execution Model and Runtime For Heterogeneous Many-Core Systems Gregory Diamos School of Electrical and Computer Engineering Georgia Institute of Technology gregory.diamos@gatech.edu January 10, 2011

More information

GPUs and GPGPUs. Greg Blanton John T. Lubia

GPUs and GPGPUs. Greg Blanton John T. Lubia GPUs and GPGPUs Greg Blanton John T. Lubia PROCESSOR ARCHITECTURAL ROADMAP Design CPU Optimized for sequential performance ILP increasingly difficult to extract from instruction stream Control hardware

More information

Introduction to GPGPU and GPU-architectures

Introduction to GPGPU and GPU-architectures Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

Compiling CUDA and Other Languages for GPUs. Vinod Grover and Yuan Lin

Compiling CUDA and Other Languages for GPUs. Vinod Grover and Yuan Lin Compiling CUDA and Other Languages for GPUs Vinod Grover and Yuan Lin Agenda Vision Compiler Architecture Scenarios SDK Components Roadmap Deep Dive SDK Samples Demos Vision Build a platform for GPU computing

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

Introduction to CUDA (1 of n*)

Introduction to CUDA (1 of n*) Administrivia Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Paper presentation due Wednesday, 02/23 Topics first come, first serve Assignment 4 handed today

More information

From Application to Technology OpenCL Application Processors Chung-Ho Chen

From Application to Technology OpenCL Application Processors Chung-Ho Chen From Application to Technology OpenCL Application Processors Chung-Ho Chen Computer Architecture and System Laboratory (CASLab) Department of Electrical Engineering and Institute of Computer and Communication

More information

A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function

A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function Chen-Ting Chang, Yu-Sheng Chen, I-Wei Wu, and Jyh-Jiun Shann Dept. of Computer Science, National Chiao

More information

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer Parallel Programming and Debugging with CUDA C Geoff Gerfin Sr. System Software Engineer CUDA - NVIDIA s Architecture for GPU Computing Broad Adoption Over 250M installed CUDA-enabled GPUs GPU Computing

More information

CUDA Performance Optimization. Patrick Legresley

CUDA Performance Optimization. Patrick Legresley CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations

More information

Real-time Graphics 9. GPGPU

Real-time Graphics 9. GPGPU 9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing GPGPU general-purpose

More information

OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania

OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania Course Overview This OpenCL base course is structured as follows: Introduction to GPGPU programming, parallel programming

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research Introduction to CUDA CME343 / ME339 18 May 2011 James Balfour [ jbalfour@nvidia.com] NVIDIA Research CUDA Programing system for machines with GPUs Programming Language Compilers Runtime Environments Drivers

More information

A MODEL OF DYNAMIC COMPILATION FOR HETEROGENEOUS COMPUTE PLATFORMS

A MODEL OF DYNAMIC COMPILATION FOR HETEROGENEOUS COMPUTE PLATFORMS A MODEL OF DYNAMIC COMPILATION FOR HETEROGENEOUS COMPUTE PLATFORMS A Thesis Presented to The Academic Faculty by Andrew Kerr In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy

More information

HARMONY: AN EXECUTION MODEL FOR HETEROGENEOUS SYSTEMS

HARMONY: AN EXECUTION MODEL FOR HETEROGENEOUS SYSTEMS HARMONY: AN EXECUTION MODEL FOR HETEROGENEOUS SYSTEMS A Thesis Presented to The Academic Faculty by Gregory Frederick Diamos In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy

More information

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture Rafia Inam Mälardalen Real-Time Research Centre Mälardalen University, Västerås, Sweden http://www.mrtc.mdh.se rafia.inam@mdh.se CONTENTS

More information

Supporting Data Parallelism in Matcloud: Final Report

Supporting Data Parallelism in Matcloud: Final Report Supporting Data Parallelism in Matcloud: Final Report Yongpeng Zhang, Xing Wu 1 Overview Matcloud is an on-line service to run Matlab-like script on client s web browser. Internally it is accelerated by

More information

Evolving a CUDA Kernel from an nvidia Template

Evolving a CUDA Kernel from an nvidia Template Evolving a CUDA Kernel from an nvidia Template W. B. Langdon CREST lab, Department of Computer Science 16a.7.2010 Introduction Using genetic programming to create C source code How? Why? Proof of concept:

More information

Register File Organization

Register File Organization Register File Organization Sudhakar Yalamanchili unless otherwise noted (1) To understand the organization of large register files used in GPUs Objective Identify the performance bottlenecks and opportunities

More information

Evolving a CUDA Kernel from an nvidia Template

Evolving a CUDA Kernel from an nvidia Template Evolving a CUDA Kernel from an nvidia Template W. B. Langdon CREST lab, Department of Computer Science 11.5.2011 Introduction Using genetic programming to create C source code How? Why? Proof of concept:

More information

CUB. collective software primitives. Duane Merrill. NVIDIA Research

CUB. collective software primitives. Duane Merrill. NVIDIA Research CUB collective software primitives Duane Merrill NVIDIA Research What is CUB?. A design model for collective primitives How to make reusable SIMT software constructs. A library of collective primitives

More information

ECE 498AL. Lecture 12: Application Lessons. When the tires hit the road

ECE 498AL. Lecture 12: Application Lessons. When the tires hit the road ECE 498AL Lecture 12: Application Lessons When the tires hit the road 1 Objective Putting the CUDA performance knowledge to work Plausible strategies may or may not lead to performance enhancement Different

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

OpenACC Course. Office Hour #2 Q&A

OpenACC Course. Office Hour #2 Q&A OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle

More information

Locality-Aware Mapping of Nested Parallel Patterns on GPUs

Locality-Aware Mapping of Nested Parallel Patterns on GPUs Locality-Aware Mapping of Nested Parallel Patterns on GPUs HyoukJoong Lee *, Kevin Brown *, Arvind Sujeeth *, Tiark Rompf, Kunle Olukotun * * Pervasive Parallelism Laboratory, Stanford University Purdue

More information

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Lecture 8: GPU Programming. CSE599G1: Spring 2017 Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library

More information

COSC 6385 Computer Architecture. - Data Level Parallelism (II)

COSC 6385 Computer Architecture. - Data Level Parallelism (II) COSC 6385 Computer Architecture - Data Level Parallelism (II) Fall 2013 SIMD Instructions Originally developed for Multimedia applications Same operation executed for multiple data items Uses a fixed length

More information

Re-architecting Virtualization in Heterogeneous Multicore Systems

Re-architecting Virtualization in Heterogeneous Multicore Systems Re-architecting Virtualization in Heterogeneous Multicore Systems Himanshu Raj, Sanjay Kumar, Vishakha Gupta, Gregory Diamos, Nawaf Alamoosa, Ada Gavrilovska, Karsten Schwan, Sudhakar Yalamanchili College

More information

Scaling CUDA for Distributed Heterogeneous Processors

Scaling CUDA for Distributed Heterogeneous Processors San Jose State University SJSU ScholarWorks Master's Theses Master's Theses and Graduate Research Spring 2012 Scaling CUDA for Distributed Heterogeneous Processors Siu Kwan Lam San Jose State University

More information

OCELOT: SUPPORTED DEVICES SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING GEORGIA INSTITUTE OF TECHNOLOGY

OCELOT: SUPPORTED DEVICES SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING GEORGIA INSTITUTE OF TECHNOLOGY OCELOT: SUPPORTED DEVICES Overview Multicore-Backend NVIDIA GPU Backend AMD GPU Backend Ocelot PTX Emulator 2 Overview Multicore-Backend NVIDIA GPU Backend AMD GPU Backend Ocelot PTX Emulator 3 Multicore

More information

Computer Architecture

Computer Architecture Jens Teubner Computer Architecture Summer 2017 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2017 Jens Teubner Computer Architecture Summer 2017 34 Part II Graphics

More information

GPU CUDA Programming

GPU CUDA Programming GPU CUDA Programming 이정근 (Jeong-Gun Lee) 한림대학교컴퓨터공학과, 임베디드 SoC 연구실 www.onchip.net Email: Jeonggun.Lee@hallym.ac.kr ALTERA JOINT LAB Introduction 차례 Multicore/Manycore and GPU GPU on Medical Applications

More information

PANOPTES: A BINARY TRANSLATION FRAMEWORK FOR CUDA. Chris Kennelly D. E. Shaw Research

PANOPTES: A BINARY TRANSLATION FRAMEWORK FOR CUDA. Chris Kennelly D. E. Shaw Research PANOPTES: A BINARY TRANSLATION FRAMEWORK FOR CUDA Chris Kennelly D. E. Shaw Research Outline The Motivating Problems Binary Translation as a Solution Results of Panoptes Future Work My Story: Buffer Ping-Ponging

More information

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Joined Advanced Student School (JASS) 2009 March 29 - April 7, 2009 St. Petersburg, Russia G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Dmitry Puzyrev St. Petersburg State University Faculty

More information

GPU Computing Master Clss. Development Tools

GPU Computing Master Clss. Development Tools GPU Computing Master Clss Development Tools Generic CUDA debugger goals Support all standard debuggers across all OS Linux GDB, TotalView and DDD Windows Visual studio Mac - XCode Support CUDA runtime

More information

General-purpose computing on graphics processing units (GPGPU)

General-purpose computing on graphics processing units (GPGPU) General-purpose computing on graphics processing units (GPGPU) Thomas Ægidiussen Jensen Henrik Anker Rasmussen François Rosé November 1, 2010 Table of Contents Introduction CUDA CUDA Programming Kernels

More information

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center An Introduction to GPU Architecture and CUDA C/C++ Programming Bin Chen April 4, 2018 Research Computing Center Outline Introduction to GPU architecture Introduction to CUDA programming model Using the

More information

THE PROGRAMMER S GUIDE TO THE APU GALAXY. Phil Rogers, Corporate Fellow AMD

THE PROGRAMMER S GUIDE TO THE APU GALAXY. Phil Rogers, Corporate Fellow AMD THE PROGRAMMER S GUIDE TO THE APU GALAXY Phil Rogers, Corporate Fellow AMD THE OPPORTUNITY WE ARE SEIZING Make the unprecedented processing capability of the APU as accessible to programmers as the CPU

More information

ClearSpeed Visual Profiler

ClearSpeed Visual Profiler ClearSpeed Visual Profiler Copyright 2007 ClearSpeed Technology plc. All rights reserved. 12 November 2007 www.clearspeed.com 1 Profiling Application Code Why use a profiler? Program analysis tools are

More information

COSC 6339 Accelerators in Big Data

COSC 6339 Accelerators in Big Data COSC 6339 Accelerators in Big Data Edgar Gabriel Fall 2018 Motivation Programming models such as MapReduce and Spark provide a high-level view of parallelism not easy for all problems, e.g. recursive algorithms,

More information

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367

More information

NVIDIA CUDA. Fermi Compatibility Guide for CUDA Applications. Version 1.1

NVIDIA CUDA. Fermi Compatibility Guide for CUDA Applications. Version 1.1 NVIDIA CUDA Fermi Compatibility Guide for CUDA Applications Version 1.1 4/19/2010 Table of Contents Software Requirements... 1 What Is This Document?... 1 1.1 Application Compatibility on Fermi... 1 1.2

More information

simcuda: A C++ based CUDA Simulation Framework

simcuda: A C++ based CUDA Simulation Framework Technical Report simcuda: A C++ based CUDA Simulation Framework Abhishek Das and Andreas Gerstlauer UT-CERC-16-01 May 20, 2016 Computer Engineering Research Center Department of Electrical & Computer Engineering

More information

Caracal: Dynamic Translation of Runtime Environments for GPUs

Caracal: Dynamic Translation of Runtime Environments for GPUs Caracal: Dynamic Translation of Runtime Environments for GPUs Rodrigo Domínguez rdomingu@ece.neu.edu Dana Schaa dschaa@ece.neu.edu Department of Electrical and Computer Engineering Northeastern University

More information