Ocelot: An Open Source Debugging and Compilation Framework for CUDA
|
|
- Amy Rodgers
- 5 years ago
- Views:
Transcription
1 Ocelot: An Open Source Debugging and Compilation Framework for CUDA Gregory Diamos*, Andrew Kerr*, Sudhakar Yalamanchili Computer Architecture and Systems Laboratory School of Electrical and Computer Engineering Georgia Institute of Technology Special thanks to our sponsors: IBM, Intel, LogicBlox, NSF, NVIDIA
2 Outline Ocelot Vision and Goals Multiplatform Dynamic Compilation for CUDA Full Function Instruction Set Emulation Productivity Tools Correctness and debugging functionality Performance Analysis and Workload Characterization
3 Productivity Challenges of Heterogeneity Execution Portability Systems evolve over time New systems esd.lbl.gov math.harvard.edu Sandia.gov Performance Debugging and optimization Application Migration Protect investments in existing code bases Emerging Software Stacks Productivity Tools Language Front-End Harmony Ocelot/LLVM NVIDIA JIT OS/VM
4 Ocelot Vision esd.lbl.gov Just-in-time code generation and optimization for CUDA applications Basis for a range of productivity and workload characterization tools
5 Project Goals Encourage proliferation of GPU computing Lower the barriers to entry for researchers and developers Understand performance behavior of CUDA applications on multiple platforms Translate, optimize, debug, and execute CUDA applications on multiple platforms
6 Ocelot: Dynamic Execution Infrastructure PTX Kernel Productivity Tools Multiplatform Support Multi-GPU Support Performance Analysis and Modeling PTX Emulator Dynamic Translation (LLVM)
7 Example Usage Scenarios Research Communities Ocelot Interface Driving Research Programming Languages Domain specific IR Optimization of Database Applications Compiler Optimizations PTX IR Interface Thread fusion, control flow optimization, etc. Operating Systems & Run-Times Switchable Compute Run- Time Kernel scheduling and data movement optimizations Computer Architecture Research Workload Characterization/Tuning Flexible Event Interface Flexible Event Interface Instruction and/or memory trace generation Control flow behavior, data sharing patterns
8 Outline Ocelot Vision and Goals Multiplatform Dynamic Compilation for CUDA Full Function Instruction Set Emulation Productivity Tools Correctness and debugging functionality Performance Analysis and Workload Characterization
9 NVIDIA s Parallel Thread execution ISA PTX defines an execution model that abstracts cores as CTAs and hardware threads/simd units as threads
10 Ocelot PTX Intermediate Representation (IR) Backend compiler framework for PTX Full-featured PTX IR Class hierarchy for PTX instructions/directives PTX control flow graph Static single-assignment form Dataflow/dominance analysis Enables PTX optimization IR to IR translation From PTX to other IRs LLVM (x86/powerpc/arm) CAL (AMD GPUs) PTX Kernel
11 Example PTX PTX Optimization Ocelot includes a comprehensive optimization framework for PTX Implementation modeled after LLVM passes Optimizations can be applied dynamically between kernel executions
12 CUDA on CPUs Execution model translation Ocelot fuses CUDA threads into heavy-weight pthreads
13 PTX IR CPU Backend Memory Traversal 2 This pattern significantly affects throughput-bound applications
14 Ocelot CUDA Runtime A complete reimplementation of the CUDA Runtime API Clean device abstraction All back-ends implement same interface Compatible with existing applications Link against libocelot.so instead of libcudart Ocelot API Extensions Add/remove trace generators Compile/launch kernels directly in PTX Device memory sharing among host threads Device switching
15 Switchable Compute Switch devices at runtime Load balancing Instrumentation Fault-and-emulate Remote execution
16 Outline Ocelot Vision and Goals Multiplatform Dynamic Compilation for CUDA Full Function Instruction Set Emulation Productivity Tools Correctness and debugging functionality Performance Analysis and Workload Characterization
17 Ocelot PTX Emulator Abstract machine model Performs functional simulation of the PTX execution model Enables detailed performance evaluation/debugging Program correctness checking Workload characterization and modeling Architecture simulation PTX 1.4 support Fermi support in progress
18 PTX Emulator CUDA Debugging- Erroneous Memory Accesses // file: memorycheck.cu global void badmemoryreference(int *A) { A[threadIdx.x] = 0; // line 3 - faulting store } int main() { int *invalidptr = 0x0234; // arbitrary pointer does not refer // to an existing memory allocation int *validptr = 0; cudamalloc((void **)&validptr, sizeof(int)*64); } badmemoryreference<<< dim3(1,1), dim3(64, 1) >>>( invalidptr ); return 0; ==Ocelot== Ocelot PTX Emulator failed to run kernel "_Z18badMemoryReferencePi" with exception: ==Ocelot== [PC 5] [thread 0] [cta 0] st.global.s32 [%r4 + 0], %r0 - Global memory access 0x234 is not within any allocated or mapped range. ==Ocelot== ==Ocelot== Nearby Device Allocations ==Ocelot== [0x12fa2e0] - [0x12fa3e0] (256 bytes) ==Ocelot== ==Ocelot== Near memorycheck.cu:3:0
19 PTX Emulator CUDA Debugging Race Detection // file: racecondition.cu global void racecondition(int *A) { shared int SharedMem[64]; SharedMem[threadIdx.x] = A[threadIdx.x]; // no synchronization barrier! A[threadIdx.x] = SharedMem[64 - threadidx.x]; // line 9 - faulting load }... racecondition<<< dim3(1,1), dim3(64, 1) >>>( validptr );... ==Ocelot== Ocelot PTX Emulator failed to run kernel "_Z13raceConditionPi" with exception: ==Ocelot== [PC 15] [thread 0] [cta 0] ld.shared.s32 %r14, [%r ] - Shared memory race condition, address 0xfc was previously written by thread 63 without a memory barrier in between. ==Ocelot== Near racecondition.cu:9:0
20 PTX Emulator Interactive Debugger Interactive command-line debugger implemented using TraceGenerator interface $./TestCudaSequence A_gpu = 0x16dcbe0 (ocelot-dbg) Attaching debugger to kernel 'sequence' (ocelot-dbg) (ocelot-dbg) watch global address 0x16dcbe4 s32[3] set #1: watch global address 0x16dcbe4 s32[3] - 12 bytes (ocelot-dbg) (ocelot-dbg) continue st.global.s32 [%r11 + 0], %r7 watchpoint #1 - CTA (0, 0) thread (1, 0, 0) - store to 0x16dcbe4 4 bytes old value = -1 new value = 2 thread (2, 0, 0) - store to 0x16dcbe8 4 bytes old value = -1 new value = 4 thread (3, 0, 0) - store to 0x16dcbec 4 bytes old value = -1 new value = 6 break on watchpoint (ocelot-dbg) Enables Inspection of application state Single-stepping of instructions Breakpoints and watchpoints Faults in MemoryChecker and RaceDetector invoke ocelotdbg automatically
21 PTX Emulator Performance Tuning Identify critical regions Memory demand Floating-point intensity Shared memory bank conflicts cold... for (int offset = 1; offset < n; offset *= 2) { // line 61 pout = 1 - pout; pin = 1 - pout; syncthreads(); temp[pout*n+thid] = temp[pin*n+thid]; hot if (thid >= offset) { temp[pout*n+thid] += temp[pin*n+thid - offset]; } }...
22 Trace Generator Interface Emulator broadcasts events to trace generators during execution Events describe instruction execution, memory addresses, etc. Used for error checking, instrumentation, and simulation
23 Trace Generator Interface //! Base class for generating traces class TraceGenerator { public: TraceGenerator(); virtual ~TraceGenerator(); // called when a traced kernel is launched to // retrieve some parameters from the kernel virtual void initialize( const executive::executablekernel& kernel); // Called whenever an event takes place. virtual void event(const TraceEvent & event); // called when an event is committed virtual void postevent(const TraceEvent & event); // Called when a kernel is finished. There will // be no more events for this kernel. virtual void finish(); };
24 TraceEvent Object Captures execution of a dynamic instruction PC and internal representation of PTX instruction Set of active threads executing instruction Kernel grid and CTA dimensions Memory addresses and size of transfer Branch target(s)
25 Example Thread Load Imbalance //! Computes number of dynamic instructions for each thread class ThreadLoadImbalance: public trace::tracegenerator { public: std::vector< size_t > dynamicinstructions; // For each dynamic instruction, increment counters of each // thread that executes it virtual void event(const TraceEvent & event) { if (!dynamicinstrucitons.size()) dynamicinstructions.resize(event.active.size(), 0); } }; for (int i = 0; i < event.active.size(); i++) { if (event.active[i]) dynamicinstructions[i]++; } Mandelbrot (CUDA SDK)
26 Outline Ocelot Vision and Goals Multiplatform Dynamic Compilation for CUDA Full Function Instruction Set Emulation Productivity Tools Correctness and debugging functionality Performance Analysis and Workload Characterization
27 Workload Characterization Describe applications using architecture-independent metrics Control flow, memory intensity/efficiency, data sharing, parallelism Many functions constructed using the trace generator interface
28 Activity Factor Fraction of threads active for each dynamic instruction Utilization of CTA-wide SIMD processor Sensitive to reconvergence mechanism
29 Interthread Data Flow Which kernels exchange computed results through shared memory? Track id of producer thread Ensure threads are well synchronized Optionally ignore uses of shared memory to transfer working sets
30 Statistical Modeling Prior work has used Ocelot for statistical workload/architecture modeling Application/machine metrics collected via static analysis, instrumentation, and emulation Principal Component Analysis (PCA) combines related metrics Cluster analysis identifies related sets of applications Regression models use metrics to predict application behavior
31 PTX Emulator Statistical Modeling Data flow correlated to Kernels launched, live registers Kernels imply barriers Data sharing across CTAs Applications that share data do so at all levels of the thread hierarchy
32 Architecture Simulation Ocelot PTX emulator has been used as an architecture simulator front-end Can explore impact of architectural changes eg. Instruction fetch and memory scheduling policies MacSim Architecture Simulator (H. Kim-GT) J. Lee, N. B. Lakshminarayana, H. Kim, R.Vuduc, Many-Thread Aware Prefetching Mechanisms for GPGPU Applications, MICRO 43, December 2010.
33 Ongoing Work
34 Ocelot Front Ends Today Future Plans Enterprise SW frontend from LogicBlox Inc. (in progress) Datalog CUDA OpenCL Harmony Runtime Correctness & Performance Tools Ocelot: PTX Emulator Instrumentation database
35 AMD Backend Goal: Explore CUDA a cross-platform alternative to OpenCL CUDA as a generic programming model Understand the relationships between algorithms and architecture Explore execution and performance portability issues Architecture specific code generation issues, VLIW+SIMD vs. Superscalar+SIMD Unstructured control flow People: Rodrigo Dominguez (rdomingu@ece.neu.edu)
36 Software Warp-Formation People: Andrew Kerr
37 Harmony Runtime: Multiple Devices People: Gregory Diamos
38 Contributors Ocelot Team: Gregory Diamos, Rodrigo Dominguez, Naila Farooqui, Andrew Kerr, Si Li, Tri Pho, Haicheng Wu, Sudhakar Yalamanchili Special Thanks: Nathan Bell, Sylvan Collange, Jared Hoberook, David Luebke, Ryuta Suzuki, Steve Worley, Bruno Coutinho
39 Questions? Contact us: Mailing list: Download Harmony and Ocelot:
40 PTX IR CPU Backend Memory Traversal PTX optimizations can change memory access patterns
The Era of Heterogeneous Compute: Challenges and Opportunities
The Era of Heterogeneous Compute: Challenges and Opportunities Sudhakar Yalamanchili Computer Architecture and Systems Laboratory Center for Experimental Research in Computer Systems School of Electrical
More informationRed Fox: An Execution Environment for Relational Query Processing on GPUs
Red Fox: An Execution Environment for Relational Query Processing on GPUs Georgia Institute of Technology: Haicheng Wu, Ifrah Saeed, Sudhakar Yalamanchili LogicBlox Inc.: Daniel Zinn, Martin Bravenboer,
More informationProfiling of Data-Parallel Processors
Profiling of Data-Parallel Processors Daniel Kruck 09/02/2014 09/02/2014 Profiling Daniel Kruck 1 / 41 Outline 1 Motivation 2 Background - GPUs 3 Profiler NVIDIA Tools Lynx 4 Optimizations 5 Conclusion
More informationCSE 591: GPU Programming. Programmer Interface. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591: GPU Programming Programmer Interface Klaus Mueller Computer Science Department Stony Brook University Compute Levels Encodes the hardware capability of a GPU card newer cards have higher compute
More informationRed Fox: An Execution Environment for Relational Query Processing on GPUs
Red Fox: An Execution Environment for Relational Query Processing on GPUs Haicheng Wu 1, Gregory Diamos 2, Tim Sheard 3, Molham Aref 4, Sean Baxter 2, Michael Garland 2, Sudhakar Yalamanchili 1 1. Georgia
More informationECE 8823: GPU Architectures. Objectives
ECE 8823: GPU Architectures Introduction 1 Objectives Distinguishing features of GPUs vs. CPUs Major drivers in the evolution of general purpose GPUs (GPGPUs) 2 1 Chapter 1 Chapter 2: 2.2, 2.3 Reading
More informationA Framework for Dynamically Instrumenting GPU Compute Applications within GPU Ocelot
A Framework for Dynamically Instrumenting GPU Compute Applications within GPU Ocelot Naila Farooqui 1, Andrew Kerr 2, Gregory Diamos 3, S. Yalamanchili 4, and K. Schwan 5 College of Computing 1,5, School
More informationTechnical Report: GIT-CERCS-09-06
Technical Report: GIT-CERCS-09-06 A Characterization and Analysis of GPGPU Kernels Andrew Kerr, Gregory Diamos, and Sudhakar Yalamanchili School of Electrical and Computer Engineering Georgia Institute
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationLynx: A Dynamic Instrumentation System for Data-Parallel Applications on GPGPU Architectures
Lynx: A Dynamic Instrumentation System for Data-Parallel Applications on GPGPU Architectures Naila Farooqui, Andrew Kerr, Greg Eisenhauer, Karsten Schwan and Sudhakar Yalamanchili College of Computing,
More informationAutomatic Intra-Application Load Balancing for Heterogeneous Systems
Automatic Intra-Application Load Balancing for Heterogeneous Systems Michael Boyer, Shuai Che, and Kevin Skadron Department of Computer Science University of Virginia Jayanth Gummaraju and Nuwan Jayasena
More informationMultipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs
Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs Haicheng Wu 1, Daniel Zinn 2, Molham Aref 2, Sudhakar Yalamanchili 1 1. Georgia Institute of Technology 2. LogicBlox
More informationCUDA Development Using NVIDIA Nsight, Eclipse Edition. David Goodwin
CUDA Development Using NVIDIA Nsight, Eclipse Edition David Goodwin NVIDIA Nsight Eclipse Edition CUDA Integrated Development Environment Project Management Edit Build Debug Profile SC'12 2 Powered By
More informationAccelerating Data Warehousing Applications Using General Purpose GPUs
Accelerating Data Warehousing Applications Using General Purpose s Sponsors: Na%onal Science Founda%on, LogicBlox Inc., IBM, and NVIDIA The General Purpose is a many core co-processor 10s to 100s of cores
More informationCharacterization and Transformation of Unstructured Control. Flow in Bulk Synchronous GPU Applications
Characterization and Transformation of Unstructured Control Flow in Bulk Synchronous GPU Applications Haicheng Wu, Gregory Diamos, Jin Wang, Si Li, and Sudhakar Yalamanchili School of Electrical and Computer
More informationHETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE
HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE Haibo Xie, Ph.D. Chief HSA Evangelist AMD China OUTLINE: The Challenges with Computing Today Introducing Heterogeneous System Architecture (HSA)
More informationCOMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers
COMP 322: Fundamentals of Parallel Programming Lecture 37: General-Purpose GPU (GPGPU) Computing Max Grossman, Vivek Sarkar Department of Computer Science, Rice University max.grossman@rice.edu, vsarkar@rice.edu
More informationGPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique
GPU programming: CUDA basics Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr This lecture: CUDA programming We have seen some GPU architecture Now how to program it? 2 Outline
More informationAdvanced CUDA Optimizations. Umar Arshad ArrayFire
Advanced CUDA Optimizations Umar Arshad (@arshad_umar) ArrayFire (@arrayfire) ArrayFire World s leading GPU experts In the industry since 2007 NVIDIA Partner Deep experience working with thousands of customers
More informationGPU Programming Using NVIDIA CUDA
GPU Programming Using NVIDIA CUDA Siddhante Nangla 1, Professor Chetna Achar 2 1, 2 MET s Institute of Computer Science, Bandra Mumbai University Abstract: GPGPU or General-Purpose Computing on Graphics
More informationAddressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer
Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems Ed Hinkel Senior Sales Engineer Agenda Overview - Rogue Wave & TotalView GPU Debugging with TotalView Nvdia CUDA Intel Phi 2
More informationECE 574 Cluster Computing Lecture 15
ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements
More informationCUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN
CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction Francesco Rossi University of Bologna and INFN * Using this terminology since you ve already heard of SIMD and SPMD at this school
More informationTesla GPU Computing A Revolution in High Performance Computing
Tesla GPU Computing A Revolution in High Performance Computing Gernot Ziegler, Developer Technology (Compute) (Material by Thomas Bradley) Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction
More informationCUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.
Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication
More informationOverview of Performance Prediction Tools for Better Development and Tuning Support
Overview of Performance Prediction Tools for Better Development and Tuning Support Universidade Federal Fluminense Rommel Anatoli Quintanilla Cruz / Master's Student Esteban Clua / Associate Professor
More informationMulti2sim Kepler: A Detailed Architectural GPU Simulator
Multi2sim Kepler: A Detailed Architectural GPU Simulator Xun Gong, Rafael Ubal, David Kaeli Northeastern University Computer Architecture Research Lab Department of Electrical and Computer Engineering
More informationShadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies
Shadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies Alexander Merritt, Vishakha Gupta, Abhishek Verma, Ada Gavrilovska, Karsten Schwan {merritt.alex,abhishek.verma}@gatech.edu {vishakha,ada,schwan}@cc.gtaech.edu
More informationIntroduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series
Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca (Slides http://support.scinet.utoronto.ca/ northrup/westgrid CUDA.pdf) March 12, 2014
More informationGPU Debugging Made Easy. David Lecomber CTO, Allinea Software
GPU Debugging Made Easy David Lecomber CTO, Allinea Software david@allinea.com Allinea Software HPC development tools company Leading in HPC software tools market Wide customer base Blue-chip engineering,
More informationIntroduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series
Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca March 13, 2014 Outline 1 Heterogeneous Computing 2 GPGPU - Overview Hardware Software
More informationHeterogeneous Computing with a Fused CPU+GPU Device
with a Fused CPU+GPU Device MediaTek White Paper January 2015 2015 MediaTek Inc. 1 Introduction Heterogeneous computing technology, based on OpenCL (Open Computing Language), intelligently fuses GPU and
More informationCUDA Architecture & Programming Model
CUDA Architecture & Programming Model Course on Multi-core Architectures & Programming Oliver Taubmann May 9, 2012 Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New
More informationLecture 11: GPU programming
Lecture 11: GPU programming David Bindel 4 Oct 2011 Logistics Matrix multiply results are ready Summary on assignments page My version (and writeup) on CMS HW 2 due Thursday Still working on project 2!
More informationGPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler
GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler Taylor Lloyd Jose Nelson Amaral Ettore Tiotto University of Alberta University of Alberta IBM Canada 1 Why? 2 Supercomputer Power/Performance GPUs
More informationIntroduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware
More informationCS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in
More informationModule 2: Introduction to CUDA C
ECE 8823A GPU Architectures Module 2: Introduction to CUDA C 1 Objective To understand the major elements of a CUDA program Introduce the basic constructs of the programming model Illustrate the preceding
More informationECE 574 Cluster Computing Lecture 17
ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 March 2019 HW#8 (CUDA) posted. Project topics due. Announcements 1 CUDA installing On Linux
More informationExecution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures
Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Xin Huo Advisor: Gagan Agrawal Motivation - Architecture Challenges on GPU architecture
More informationGPU Programming with Ateji PX June 8 th Ateji All rights reserved.
GPU Programming with Ateji PX June 8 th 2010 Ateji All rights reserved. Goals Write once, run everywhere, even on a GPU Target heterogeneous architectures from Java GPU accelerators OpenCL standard Get
More informationDomain Specific Languages for Financial Payoffs. Matthew Leslie Bank of America Merrill Lynch
Domain Specific Languages for Financial Payoffs Matthew Leslie Bank of America Merrill Lynch Outline Introduction What, How, and Why do we use DSLs in Finance? Implementation Interpreting, Compiling Performance
More informationLecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1
Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei
More informationVisualization of OpenCL Application Execution on CPU-GPU Systems
Visualization of OpenCL Application Execution on CPU-GPU Systems A. Ziabari*, R. Ubal*, D. Schaa**, D. Kaeli* *NUCAR Group, Northeastern Universiy **AMD Northeastern University Computer Architecture Research
More informationReal-time Graphics 9. GPGPU
Real-time Graphics 9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied
More informationTechnische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics
GPU Programming Rüdiger Westermann Chair for Computer Graphics & Visualization Faculty of Informatics Overview Programming interfaces and support libraries The CUDA programming abstraction An in-depth
More informationPerformance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference
The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee
More informationCUDA Programming Model
CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming
More informationCharacterization and Transformation of Unstructured Control Flow in GPU Applications
Characterization and Transformation of Unstructured Control Flow in GPU Applications Haicheng Wu, Gregory Diamos, Si Li, and Sudhakar Yalamanchili School of Electrical and Computer Engineering Georgia
More informationModule 2: Introduction to CUDA C. Objective
ECE 8823A GPU Architectures Module 2: Introduction to CUDA C 1 Objective To understand the major elements of a CUDA program Introduce the basic constructs of the programming model Illustrate the preceding
More informationIntroduction to CUDA
Introduction to CUDA Overview HW computational power Graphics API vs. CUDA CUDA glossary Memory model, HW implementation, execution Performance guidelines CUDA compiler C/C++ Language extensions Limitations
More informationNVIDIA GPU CODING & COMPUTING
NVIDIA GPU CODING & COMPUTING WHY GPU S? ARCHITECTURE & PROGRAM MODEL CPU v. GPU Multiprocessor Model Memory Model Memory Model: Thread Level Programing Model: Logical Mapping of Threads Programing Model:
More informationAn Execution Model and Runtime For Heterogeneous. Many-Core Systems
An Execution Model and Runtime For Heterogeneous Many-Core Systems Gregory Diamos School of Electrical and Computer Engineering Georgia Institute of Technology gregory.diamos@gatech.edu January 10, 2011
More informationGPUs and GPGPUs. Greg Blanton John T. Lubia
GPUs and GPGPUs Greg Blanton John T. Lubia PROCESSOR ARCHITECTURAL ROADMAP Design CPU Optimized for sequential performance ILP increasingly difficult to extract from instruction stream Control hardware
More informationIntroduction to GPGPU and GPU-architectures
Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks
More informationTesla Architecture, CUDA and Optimization Strategies
Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization
More informationCompiling CUDA and Other Languages for GPUs. Vinod Grover and Yuan Lin
Compiling CUDA and Other Languages for GPUs Vinod Grover and Yuan Lin Agenda Vision Compiler Architecture Scenarios SDK Components Roadmap Deep Dive SDK Samples Demos Vision Build a platform for GPU computing
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control
More informationIntroduction to CUDA (1 of n*)
Administrivia Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Paper presentation due Wednesday, 02/23 Topics first come, first serve Assignment 4 handed today
More informationFrom Application to Technology OpenCL Application Processors Chung-Ho Chen
From Application to Technology OpenCL Application Processors Chung-Ho Chen Computer Architecture and System Laboratory (CASLab) Department of Electrical Engineering and Institute of Computer and Communication
More informationA Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function
A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function Chen-Ting Chang, Yu-Sheng Chen, I-Wei Wu, and Jyh-Jiun Shann Dept. of Computer Science, National Chiao
More informationParallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer
Parallel Programming and Debugging with CUDA C Geoff Gerfin Sr. System Software Engineer CUDA - NVIDIA s Architecture for GPU Computing Broad Adoption Over 250M installed CUDA-enabled GPUs GPU Computing
More informationCUDA Performance Optimization. Patrick Legresley
CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations
More informationReal-time Graphics 9. GPGPU
9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing GPGPU general-purpose
More informationOpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania
OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania Course Overview This OpenCL base course is structured as follows: Introduction to GPGPU programming, parallel programming
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationIntroduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research
Introduction to CUDA CME343 / ME339 18 May 2011 James Balfour [ jbalfour@nvidia.com] NVIDIA Research CUDA Programing system for machines with GPUs Programming Language Compilers Runtime Environments Drivers
More informationA MODEL OF DYNAMIC COMPILATION FOR HETEROGENEOUS COMPUTE PLATFORMS
A MODEL OF DYNAMIC COMPILATION FOR HETEROGENEOUS COMPUTE PLATFORMS A Thesis Presented to The Academic Faculty by Andrew Kerr In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy
More informationHARMONY: AN EXECUTION MODEL FOR HETEROGENEOUS SYSTEMS
HARMONY: AN EXECUTION MODEL FOR HETEROGENEOUS SYSTEMS A Thesis Presented to The Academic Faculty by Gregory Frederick Diamos In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy
More informationAn Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture
An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture Rafia Inam Mälardalen Real-Time Research Centre Mälardalen University, Västerås, Sweden http://www.mrtc.mdh.se rafia.inam@mdh.se CONTENTS
More informationSupporting Data Parallelism in Matcloud: Final Report
Supporting Data Parallelism in Matcloud: Final Report Yongpeng Zhang, Xing Wu 1 Overview Matcloud is an on-line service to run Matlab-like script on client s web browser. Internally it is accelerated by
More informationEvolving a CUDA Kernel from an nvidia Template
Evolving a CUDA Kernel from an nvidia Template W. B. Langdon CREST lab, Department of Computer Science 16a.7.2010 Introduction Using genetic programming to create C source code How? Why? Proof of concept:
More informationRegister File Organization
Register File Organization Sudhakar Yalamanchili unless otherwise noted (1) To understand the organization of large register files used in GPUs Objective Identify the performance bottlenecks and opportunities
More informationEvolving a CUDA Kernel from an nvidia Template
Evolving a CUDA Kernel from an nvidia Template W. B. Langdon CREST lab, Department of Computer Science 11.5.2011 Introduction Using genetic programming to create C source code How? Why? Proof of concept:
More informationCUB. collective software primitives. Duane Merrill. NVIDIA Research
CUB collective software primitives Duane Merrill NVIDIA Research What is CUB?. A design model for collective primitives How to make reusable SIMT software constructs. A library of collective primitives
More informationECE 498AL. Lecture 12: Application Lessons. When the tires hit the road
ECE 498AL Lecture 12: Application Lessons When the tires hit the road 1 Objective Putting the CUDA performance knowledge to work Plausible strategies may or may not lead to performance enhancement Different
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationOpenACC Course. Office Hour #2 Q&A
OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle
More informationLocality-Aware Mapping of Nested Parallel Patterns on GPUs
Locality-Aware Mapping of Nested Parallel Patterns on GPUs HyoukJoong Lee *, Kevin Brown *, Arvind Sujeeth *, Tiark Rompf, Kunle Olukotun * * Pervasive Parallelism Laboratory, Stanford University Purdue
More informationLecture 8: GPU Programming. CSE599G1: Spring 2017
Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library
More informationCOSC 6385 Computer Architecture. - Data Level Parallelism (II)
COSC 6385 Computer Architecture - Data Level Parallelism (II) Fall 2013 SIMD Instructions Originally developed for Multimedia applications Same operation executed for multiple data items Uses a fixed length
More informationRe-architecting Virtualization in Heterogeneous Multicore Systems
Re-architecting Virtualization in Heterogeneous Multicore Systems Himanshu Raj, Sanjay Kumar, Vishakha Gupta, Gregory Diamos, Nawaf Alamoosa, Ada Gavrilovska, Karsten Schwan, Sudhakar Yalamanchili College
More informationScaling CUDA for Distributed Heterogeneous Processors
San Jose State University SJSU ScholarWorks Master's Theses Master's Theses and Graduate Research Spring 2012 Scaling CUDA for Distributed Heterogeneous Processors Siu Kwan Lam San Jose State University
More informationOCELOT: SUPPORTED DEVICES SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING GEORGIA INSTITUTE OF TECHNOLOGY
OCELOT: SUPPORTED DEVICES Overview Multicore-Backend NVIDIA GPU Backend AMD GPU Backend Ocelot PTX Emulator 2 Overview Multicore-Backend NVIDIA GPU Backend AMD GPU Backend Ocelot PTX Emulator 3 Multicore
More informationComputer Architecture
Jens Teubner Computer Architecture Summer 2017 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2017 Jens Teubner Computer Architecture Summer 2017 34 Part II Graphics
More informationGPU CUDA Programming
GPU CUDA Programming 이정근 (Jeong-Gun Lee) 한림대학교컴퓨터공학과, 임베디드 SoC 연구실 www.onchip.net Email: Jeonggun.Lee@hallym.ac.kr ALTERA JOINT LAB Introduction 차례 Multicore/Manycore and GPU GPU on Medical Applications
More informationPANOPTES: A BINARY TRANSLATION FRAMEWORK FOR CUDA. Chris Kennelly D. E. Shaw Research
PANOPTES: A BINARY TRANSLATION FRAMEWORK FOR CUDA Chris Kennelly D. E. Shaw Research Outline The Motivating Problems Binary Translation as a Solution Results of Panoptes Future Work My Story: Buffer Ping-Ponging
More informationG P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G
Joined Advanced Student School (JASS) 2009 March 29 - April 7, 2009 St. Petersburg, Russia G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Dmitry Puzyrev St. Petersburg State University Faculty
More informationGPU Computing Master Clss. Development Tools
GPU Computing Master Clss Development Tools Generic CUDA debugger goals Support all standard debuggers across all OS Linux GDB, TotalView and DDD Windows Visual studio Mac - XCode Support CUDA runtime
More informationGeneral-purpose computing on graphics processing units (GPGPU)
General-purpose computing on graphics processing units (GPGPU) Thomas Ægidiussen Jensen Henrik Anker Rasmussen François Rosé November 1, 2010 Table of Contents Introduction CUDA CUDA Programming Kernels
More informationAn Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center
An Introduction to GPU Architecture and CUDA C/C++ Programming Bin Chen April 4, 2018 Research Computing Center Outline Introduction to GPU architecture Introduction to CUDA programming model Using the
More informationTHE PROGRAMMER S GUIDE TO THE APU GALAXY. Phil Rogers, Corporate Fellow AMD
THE PROGRAMMER S GUIDE TO THE APU GALAXY Phil Rogers, Corporate Fellow AMD THE OPPORTUNITY WE ARE SEIZING Make the unprecedented processing capability of the APU as accessible to programmers as the CPU
More informationClearSpeed Visual Profiler
ClearSpeed Visual Profiler Copyright 2007 ClearSpeed Technology plc. All rights reserved. 12 November 2007 www.clearspeed.com 1 Profiling Application Code Why use a profiler? Program analysis tools are
More informationCOSC 6339 Accelerators in Big Data
COSC 6339 Accelerators in Big Data Edgar Gabriel Fall 2018 Motivation Programming models such as MapReduce and Spark provide a high-level view of parallelism not easy for all problems, e.g. recursive algorithms,
More informationCS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology
CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367
More informationNVIDIA CUDA. Fermi Compatibility Guide for CUDA Applications. Version 1.1
NVIDIA CUDA Fermi Compatibility Guide for CUDA Applications Version 1.1 4/19/2010 Table of Contents Software Requirements... 1 What Is This Document?... 1 1.1 Application Compatibility on Fermi... 1 1.2
More informationsimcuda: A C++ based CUDA Simulation Framework
Technical Report simcuda: A C++ based CUDA Simulation Framework Abhishek Das and Andreas Gerstlauer UT-CERC-16-01 May 20, 2016 Computer Engineering Research Center Department of Electrical & Computer Engineering
More informationCaracal: Dynamic Translation of Runtime Environments for GPUs
Caracal: Dynamic Translation of Runtime Environments for GPUs Rodrigo Domínguez rdomingu@ece.neu.edu Dana Schaa dschaa@ece.neu.edu Department of Electrical and Computer Engineering Northeastern University
More information