Ocelot: An Open Source Debugging and Compilation Framework for CUDA

Size: px

Start display at page:

Download "Ocelot: An Open Source Debugging and Compilation Framework for CUDA"

Amy Rodgers
5 years ago
Views:

1 Ocelot: An Open Source Debugging and Compilation Framework for CUDA Gregory Diamos*, Andrew Kerr*, Sudhakar Yalamanchili Computer Architecture and Systems Laboratory School of Electrical and Computer Engineering Georgia Institute of Technology Special thanks to our sponsors: IBM, Intel, LogicBlox, NSF, NVIDIA

2 Outline Ocelot Vision and Goals Multiplatform Dynamic Compilation for CUDA Full Function Instruction Set Emulation Productivity Tools Correctness and debugging functionality Performance Analysis and Workload Characterization

Productivity Challenges of Heterogeneity Execution

Migration Protect investments in existing code bases

3 Productivity Challenges of Heterogeneity Execution Portability Systems evolve over time New systems esd.lbl.gov math.harvard.edu Sandia.gov Performance Debugging and optimization Application Migration Protect investments in existing code bases Emerging Software Stacks Productivity Tools Language Front-End Harmony Ocelot/LLVM NVIDIA JIT OS/VM

4 Ocelot Vision esd.lbl.gov Just-in-time code generation and optimization for CUDA applications Basis for a range of productivity and workload characterization tools

5 Project Goals Encourage proliferation of GPU computing Lower the barriers to entry for researchers and developers Understand performance behavior of CUDA applications on multiple platforms Translate, optimize, debug, and execute CUDA applications on multiple platforms

6 Ocelot: Dynamic Execution Infrastructure PTX Kernel Productivity Tools Multiplatform Support Multi-GPU Support Performance Analysis and Modeling PTX Emulator Dynamic Translation (LLVM)

7 Example Usage Scenarios Research Communities Ocelot Interface Driving Research Programming Languages Domain specific IR Optimization of Database Applications Compiler Optimizations PTX IR Interface Thread fusion, control flow optimization, etc. Operating Systems & Run-Times Switchable Compute Run- Time Kernel scheduling and data movement optimizations Computer Architecture Research Workload Characterization/Tuning Flexible Event Interface Flexible Event Interface Instruction and/or memory trace generation Control flow behavior, data sharing patterns

8 Outline Ocelot Vision and Goals Multiplatform Dynamic Compilation for CUDA Full Function Instruction Set Emulation Productivity Tools Correctness and debugging functionality Performance Analysis and Workload Characterization

9 NVIDIA s Parallel Thread execution ISA PTX defines an execution model that abstracts cores as CTAs and hardware threads/simd units as threads

10 Ocelot PTX Intermediate Representation (IR) Backend compiler framework for PTX Full-featured PTX IR Class hierarchy for PTX instructions/directives PTX control flow graph Static single-assignment form Dataflow/dominance analysis Enables PTX optimization IR to IR translation From PTX to other IRs LLVM (x86/powerpc/arm) CAL (AMD GPUs) PTX Kernel

11 Example PTX PTX Optimization Ocelot includes a comprehensive optimization framework for PTX Implementation modeled after LLVM passes Optimizations can be applied dynamically between kernel executions

12 CUDA on CPUs Execution model translation Ocelot fuses CUDA threads into heavy-weight pthreads

13 PTX IR CPU Backend Memory Traversal 2 This pattern significantly affects throughput-bound applications

14 Ocelot CUDA Runtime A complete reimplementation of the CUDA Runtime API Clean device abstraction All back-ends implement same interface Compatible with existing applications Link against libocelot.so instead of libcudart Ocelot API Extensions Add/remove trace generators Compile/launch kernels directly in PTX Device memory sharing among host threads Device switching

15 Switchable Compute Switch devices at runtime Load balancing Instrumentation Fault-and-emulate Remote execution

16 Outline Ocelot Vision and Goals Multiplatform Dynamic Compilation for CUDA Full Function Instruction Set Emulation Productivity Tools Correctness and debugging functionality Performance Analysis and Workload Characterization

17 Ocelot PTX Emulator Abstract machine model Performs functional simulation of the PTX execution model Enables detailed performance evaluation/debugging Program correctness checking Workload characterization and modeling Architecture simulation PTX 1.4 support Fermi support in progress

18 PTX Emulator CUDA Debugging- Erroneous Memory Accesses // file: memorycheck.cu global void badmemoryreference(int *A) { A[threadIdx.x] = 0; // line 3 - faulting store } int main() { int *invalidptr = 0x0234; // arbitrary pointer does not refer // to an existing memory allocation int *validptr = 0; cudamalloc((void **)&validptr, sizeof(int)*64); } badmemoryreference<<< dim3(1,1), dim3(64, 1) >>>( invalidptr ); return 0; ==Ocelot== Ocelot PTX Emulator failed to run kernel "_Z18badMemoryReferencePi" with exception: ==Ocelot== [PC 5] [thread 0] [cta 0] st.global.s32 [%r4 + 0], %r0 - Global memory access 0x234 is not within any allocated or mapped range. ==Ocelot== ==Ocelot== Nearby Device Allocations ==Ocelot== [0x12fa2e0] - [0x12fa3e0] (256 bytes) ==Ocelot== ==Ocelot== Near memorycheck.cu:3:0

19 PTX Emulator CUDA Debugging Race Detection // file: racecondition.cu global void racecondition(int *A) { shared int SharedMem[64]; SharedMem[threadIdx.x] = A[threadIdx.x]; // no synchronization barrier! A[threadIdx.x] = SharedMem[64 - threadidx.x]; // line 9 - faulting load }... racecondition<<< dim3(1,1), dim3(64, 1) >>>( validptr );... ==Ocelot== Ocelot PTX Emulator failed to run kernel "_Z13raceConditionPi" with exception: ==Ocelot== [PC 15] [thread 0] [cta 0] ld.shared.s32 %r14, [%r ] - Shared memory race condition, address 0xfc was previously written by thread 63 without a memory barrier in between. ==Ocelot== Near racecondition.cu:9:0

20 PTX Emulator Interactive Debugger Interactive command-line debugger implemented using TraceGenerator interface $./TestCudaSequence A_gpu = 0x16dcbe0 (ocelot-dbg) Attaching debugger to kernel 'sequence' (ocelot-dbg) (ocelot-dbg) watch global address 0x16dcbe4 s32[3] set #1: watch global address 0x16dcbe4 s32[3] - 12 bytes (ocelot-dbg) (ocelot-dbg) continue st.global.s32 [%r11 + 0], %r7 watchpoint #1 - CTA (0, 0) thread (1, 0, 0) - store to 0x16dcbe4 4 bytes old value = -1 new value = 2 thread (2, 0, 0) - store to 0x16dcbe8 4 bytes old value = -1 new value = 4 thread (3, 0, 0) - store to 0x16dcbec 4 bytes old value = -1 new value = 6 break on watchpoint (ocelot-dbg) Enables Inspection of application state Single-stepping of instructions Breakpoints and watchpoints Faults in MemoryChecker and RaceDetector invoke ocelotdbg automatically

21 PTX Emulator Performance Tuning Identify critical regions Memory demand Floating-point intensity Shared memory bank conflicts cold... for (int offset = 1; offset < n; offset *= 2) { // line 61 pout = 1 - pout; pin = 1 - pout; syncthreads(); temp[pout*n+thid] = temp[pin*n+thid]; hot if (thid >= offset) { temp[pout*n+thid] += temp[pin*n+thid - offset]; } }...

22 Trace Generator Interface Emulator broadcasts events to trace generators during execution Events describe instruction execution, memory addresses, etc. Used for error checking, instrumentation, and simulation

23 Trace Generator Interface //! Base class for generating traces class TraceGenerator { public: TraceGenerator(); virtual ~TraceGenerator(); // called when a traced kernel is launched to // retrieve some parameters from the kernel virtual void initialize( const executive::executablekernel& kernel); // Called whenever an event takes place. virtual void event(const TraceEvent & event); // called when an event is committed virtual void postevent(const TraceEvent & event); // Called when a kernel is finished. There will // be no more events for this kernel. virtual void finish(); };

24 TraceEvent Object Captures execution of a dynamic instruction PC and internal representation of PTX instruction Set of active threads executing instruction Kernel grid and CTA dimensions Memory addresses and size of transfer Branch target(s)

$Computes number of dynamic instructions for each thread class ThreadLoadImbalance: public trace::tracegenerator { public: std::vector< size_t > dynamicinstructions; // For each dynamic instruction,$

25 Example Thread Load Imbalance //! Computes number of dynamic instructions for each thread class ThreadLoadImbalance: public trace::tracegenerator { public: std::vector< size_t > dynamicinstructions; // For each dynamic instruction, increment counters of each // thread that executes it virtual void event(const TraceEvent & event) { if (!dynamicinstrucitons.size()) dynamicinstructions.resize(event.active.size(), 0); } }; for (int i = 0; i < event.active.size(); i++) { if (event.active[i]) dynamicinstructions[i]++; } Mandelbrot (CUDA SDK)

26 Outline Ocelot Vision and Goals Multiplatform Dynamic Compilation for CUDA Full Function Instruction Set Emulation Productivity Tools Correctness and debugging functionality Performance Analysis and Workload Characterization

27 Workload Characterization Describe applications using architecture-independent metrics Control flow, memory intensity/efficiency, data sharing, parallelism Many functions constructed using the trace generator interface

28 Activity Factor Fraction of threads active for each dynamic instruction Utilization of CTA-wide SIMD processor Sensitive to reconvergence mechanism

29 Interthread Data Flow Which kernels exchange computed results through shared memory? Track id of producer thread Ensure threads are well synchronized Optionally ignore uses of shared memory to transfer working sets

30 Statistical Modeling Prior work has used Ocelot for statistical workload/architecture modeling Application/machine metrics collected via static analysis, instrumentation, and emulation Principal Component Analysis (PCA) combines related metrics Cluster analysis identifies related sets of applications Regression models use metrics to predict application behavior

31 PTX Emulator Statistical Modeling Data flow correlated to Kernels launched, live registers Kernels imply barriers Data sharing across CTAs Applications that share data do so at all levels of the thread hierarchy

32 Architecture Simulation Ocelot PTX emulator has been used as an architecture simulator front-end Can explore impact of architectural changes eg. Instruction fetch and memory scheduling policies MacSim Architecture Simulator (H. Kim-GT) J. Lee, N. B. Lakshminarayana, H. Kim, R.Vuduc, Many-Thread Aware Prefetching Mechanisms for GPGPU Applications, MICRO 43, December 2010.

33 Ongoing Work

34 Ocelot Front Ends Today Future Plans Enterprise SW frontend from LogicBlox Inc. (in progress) Datalog CUDA OpenCL Harmony Runtime Correctness & Performance Tools Ocelot: PTX Emulator Instrumentation database

35 AMD Backend Goal: Explore CUDA a cross-platform alternative to OpenCL CUDA as a generic programming model Understand the relationships between algorithms and architecture Explore execution and performance portability issues Architecture specific code generation issues, VLIW+SIMD vs. Superscalar+SIMD Unstructured control flow People: Rodrigo Dominguez (rdomingu@ece.neu.edu)

36 Software Warp-Formation People: Andrew Kerr

37 Harmony Runtime: Multiple Devices People: Gregory Diamos

38 Contributors Ocelot Team: Gregory Diamos, Rodrigo Dominguez, Naila Farooqui, Andrew Kerr, Si Li, Tri Pho, Haicheng Wu, Sudhakar Yalamanchili Special Thanks: Nathan Bell, Sylvan Collange, Jared Hoberook, David Luebke, Ryuta Suzuki, Steve Worley, Bruno Coutinho

39 Questions? Contact us: Mailing list: Download Harmony and Ocelot:

40 PTX IR CPU Backend Memory Traversal PTX optimizations can change memory access patterns

The Era of Heterogeneous Compute: Challenges and Opportunities

The Era of Heterogeneous Compute: Challenges and Opportunities Sudhakar Yalamanchili Computer Architecture and Systems Laboratory Center for Experimental Research in Computer Systems School of Electrical