Yunsup Lee UC Berkeley 1
|
|
- Shon Welch
- 5 years ago
- Views:
Transcription
1 Yunsup Lee UC Berkeley 1
2 Why is Supporting Control Flow Challenging in Data-Parallel Architectures? for (i=0; i<16; i++) { a[i] = op0; b[i] = op1; if (a[i] < b[i]) { c[i] = op2; else { c[i] = op3; d[i] = op4; Thread0 Thread4 Thread8 Thread12 Data-Parallel Issue Unit Thread1 Thread2 Thread5 Thread6 Thread9 Thread10 Thread13 Thread14 Thread3 Thread7 Thread11 Thread15 The divergence management architecture must not only partially sequence all execution paths for correctness, but also reconverge threads from different execution paths for efficiency. Yunsup Lee UC Berkeley 2
3 For GPUs, Supporting Complex Control Flow in a SPMD Program is not Optional! for (i=0; i<16; i++) { if (a[i] < b[i]) { if (f[i]) { c[i]->vfunc(); goto SKIP; d[i] = op; SKIP: Traditional vector compilers can always give up and run complex control flow on the control processor. However, for the SPMD compiler, supporting complex control flow is a functional requirement rather than an optional performance optimization. Kernel() { if (a[tid] < b[tid]) { if (f[tid]) { c[tid]->vfunc(); goto SKIP; d[tid] = op; SKIP: Kernel<<<16>>>(); Yunsup Lee UC Berkeley 3
4 Design Space of Divergence Management Software Explicitly Scheduled by Compiler Hardware Implicitly Managed by Microarchitecture Vector Predication Predication + Fork/Join Predication (Used in Limited Cases) We can perform a design space exploration of divergence management on NVIDIA GPU silicon Divergence Stack (Compiler Figures Out Reconvergence Points) Yunsup Lee UC Berkeley 4
5 Executive Summary, Contributions of Paper 28 Benchmarks written in CUDA (Parboil, Rodinia, FFT, nqueens) CUDA 6.5 Production Compiler Divergence Stack Limited Predication CUDA Binary Modified CUDA 6.5 Production Compiler Full Predication with New Compiler Algorithms CUDA Binary (1) Detailed Explanation and Categorization of Hardware and Software Divergence Management Schemes (2) SPMD Predication Compiler Algorithms (3) Apples-to-Apples Comparison using Production Silicon and Compiler NVIDIA Tesla K20c (Kepler, GK110) Performance, Statistics Performance, Statistics Performance with predication is on par compared to performance with divergence stack Yunsup Lee UC Berkeley 5
6 How does the hardware divergence stack and software predication handle control flow? Yunsup Lee UC Berkeley 6
7 If-Then-Else Example: Divergence Stack CUDA Program LLVM Compiler PTX ptxas Backend Compiler SASS Kernel() { a = op0; b = op1; if (a < b) { c = op2; else { c = op3; d = op4; Kernel<<<n>>>(); LLVM compiler takes a CUDA program and generates PTX, which encodes all data/control dependences Note, all instructions on the right (despite being a scalar instruction) are executed in a SIMD fashion a = op0 b = op1 p = slt a,b branch.eqz p, else c = op2 j ipdom else: c = op3 ipdom: d = op4 Yunsup Lee UC Berkeley 7
8 If-Then-Else Example: Divergence Stack CUDA Program LLVM Compiler PTX ptxas Backend Compiler SASS a = op0 b = op1 p = slt a,b push ipdom branch.eqz p, else c = op2.pop else: c = op3.pop ipdom: d = op4 ptxas backend compiler takes PTX and generates SASS instructions, which executes natively on GPU Reconvergence points are analyzed and inserted by the backend compiler Kernel() { a = op0; b = op1; if (a < b) { c = op2; else { c = op3; d = op4; Kernel<<<n>>>(); Yunsup Lee UC Berkeley 8
9 If-Then-Else Example: Divergence Stack Mask OP 1111 op op slt push 1111 branch 1100 op2 pop 0011 op3 pop 1111 op4 Assume 4 threads are executing. Assume thread 0 and 1 took the branch. Divergence Stack PC PC MASK MASK ipdom 1111 PC MASK else 0011 ipdom 1111 a = op0 b = op1 p = slt a,b push ipdom branch.eqz p, else c = op2.pop else: c = op3.pop ipdom: d = op4 Push: pushes <reconverge pc, current mask> to stack Pop: disregards current <pc, mask>, pops top of stack, and executes deferred <pc, mask> Yunsup Lee UC Berkeley 9
10 If-Then-Else Example: Predication Predication Kernel() { a = op0; b = op1; if (a < b) { c = op2; else { c = op3; d = op4; Kernel<<<n>>>(); a = op0 b = op1 f0 = slt c = c = op3 d = op4 The compiler can schedule instructions Predicates also encode reconvergence information Yunsup Lee UC Berkeley 10
11 Uniform Branch Conditions Across All Threads Divergence Stack a = op0 b = op1 p = slt a,b push ipdom branch.eqz p, else c = op2.pop else: c = op3.pop ipdom: d = op4 Predication a = op0 b = op1 f0 = slt c = c = op3 d = op4 What if the branch condition is uniform across all threads? Execute instructions with a null predicate Branches don t push a token to the divergence stack when branch condition is uniform. Yunsup Lee UC Berkeley 11
12 Runtime Branch-Uniformity Optimization with Consensual Branches Kernel() { a = op0; b = op1; if (a < b) { c = op2; else { c = op3; d = op4; Kernel<<<n>>>(); a = op0 b = op1 f0 = slt a,b cbranch.ifnull f0, c = op2 else: cbranch.ifnull!f0, c = op3 ipdom: d = op4 Thread-Aware Predication We can optimize the predicated code with a consensual branch (cbranch), which is taken only when all threads consensually agree on the branch condition (ifnull). The code may jump around unnecessary work. Yunsup Lee UC Berkeley 12
13 Static Branch-Uniformity Optimization with Consensual Branches Kernel() { a = op0; b = op1; if (a < b) { c = op2; else { c = op3; d = op4; Kernel<<<n>>>(); a = op0 b = op1 f0 = slt a,b cbranch.ifnull f0, else c = op2 else: cbranch.ifnull!f0, ipdom c = op3 ipdom: d = op4 Thread-Aware Predication If the compiler can prove that the branch condition is uniform across all threads, the compiler can omit the guard predicates. Yunsup Lee UC Berkeley 13
14 Loop Example: Consensual Branches are Key to Compile Loops with Predication Kernel() { done = false; while (!done) { a = op0; b = op1; done = a < b; c = op2; Kernel<<<n>>>(); f0 = true loop: cbranch.ifnull f0, a = b = f1 = slt a, b f0 = and f0,!f1 j loop exit: c = op2 Thread-Aware Predication Intuitively, the compiler needs to sequence loop until all threads are done executing the loop. A consensual branch (cbranch.ifnull) is used to check whether loop mask (f0) is null. Yunsup Lee UC Berkeley 14
15 Thread-Aware (TA) Predication Compiler Algorithms Yunsup Lee UC Berkeley 15
16 Predication Compiler Algorithms Generate CFG, CDG Walk CDG to Get Guard Predicates for all BBs Linearize Control Flow Predicate all instructions with guard predicate and rewire all BBs Kernel() { N1; N2; if (!P1) { N3; else { N4; if (!P2) { N5; else { N6; N7; N8; Kernel<<<n>>>(); Control Flow Graph (CFG) N6 N7 N1 N2 P1!P1 N4 P2!P2 N5 N8 N3 N1 N8 N2 Control Dependence Graph (CDG) P1!P1 N4 N7 N3 P2!P2 N6 Yunsup Lee UC Berkeley 16 N5 Guard Predicate = P1 &&!P2
17 Runtime Branch-Uniformity Optimization Generate CFG, CDG Kernel() { N1; N2; if (!P1) { N3; else { N4; if (!P2) { N5; else { N6; N7; N8; Kernel<<<n>>>(); Walk CDG to Get Guard Predicates for all BBs CFG Assume compiler cannot prove that P1 is uniform across all threads N6 Add consensual branch if P1 is null N7 N4 N5 N1 N2 P1!P1 P2!P2 Linearize Control Flow N8 N3 Predicate all instructions with guard predicate and rewire all BBs N1 N8 N2 Control Dependence Graph (CDG) P1!P1 N4 N7 N3 P2!P2 N6 Add consensual branch if!p1 is null Yunsup Lee UC Berkeley 17 N5
18 Static Branch-Uniformity Optimization Generate CFG, CDG Walk CDG to Get Guard Predicates for all BBs Linearize Control Flow Kernel() { CFG N1 N1; N2; Assume compiler if (!P1) { can prove that N3; N2 P1 is uniform else { across all threads P1!P1 N4; if (!P2) { N4 N5; else { P2!P2 N6; N6 N5 N7; N7 N3 N8; Kernel<<<n>>>(); N8 Predicate all instructions with guard predicate and rewire all BBs N1 N8 N2 Control Dependence Graph (CDG) P1!P1 N4 N7 N3 P2!P2 N6 Yunsup Lee UC Berkeley 18 N5 Guard Predicate = P2
19 Predication Compiler Algorithms: Loops N1 N1 N3 N2 N2 N11 Kernel() { if (!P1) { while (!P2) { if (!P3) { break; Kernel<<<n>>>(); N12 Exit Mask 2 CFG N3 N6 N8 N9 N10 N4 N5 N11 Loop Mask Exit Mask 1 N7 N12 P1!P1 Thread-Aware Control Dependence Graph (CDG) N10 Yunsup Lee UC Berkeley 19 N4 L1 N5 P2 E1!P2 N7 N6 N8 P3!P3 E2 N9
20 Supporting Complex Control Flow Function Calls Support by a straightforward calling convention Virtual Function jalr r3 With Divergence Stack loop: p0, r4 = find_unique p2, jalr r4 p2 = p2 and!p0 cbranch.ifany p2, loop With Predication Irreducible Control Flow Find smallest region containing irreducible control flow, and insert sequencing code at the entry block and exit block to sequence active threads through the region one-by-one. Yunsup Lee UC Berkeley 20
21 Evaluation Yunsup Lee UC Berkeley 21
22 Predication CUDA Compiler CUDA Program LLVM Compiler Predication Passes, Generates a throw-away pseudo PTX instruction with predication information Annotated PTX Theoretically we could have implemented our thread-aware predication pass in ptxas Implemented bulk of the predication pass in LLVM for fast prototyping ptxas Backend Compiler Runs existing optimizations Then predicates with information retrieved from pseudo PTX insts. SASS Yunsup Lee UC Berkeley 22
23 Evaluation: Quick Recap 28 Benchmarks written in CUDA (Parboil, Rodinia, FFT, nqueens) CUDA 6.5 Production Compiler Divergence Stack Limited Predication NVIDIA Tesla K20c (Kepler, GK110) CUDA Binary Performance, Statistics Modified CUDA 6.5 Production Compiler Full Predication with TA Compiler Algorithms CUDA Binary Performance, Statistics Five Bars 1) Baseline (Divergence Stack) 2) Limited Predication 3) TA Predication 4) TA+SBU (static branch-uniformity optimization) 5) TA+SBU+RBU (runtime branch-uniformity optimization) Compare performance and statistics Yunsup Lee UC Berkeley 23
24 Performance Results: Geomean Baseline with divergence stack Divergence Stack+Limited Predication TA TA+SBU TA+SBU+RBU Speedup geomean r-pathfinder p-cutcp p-sgemm p-sad p-mri-q p-stencil r-nn p-lbm fft r-backprop nqueens r-b+tree r-hotspot r-srad-v2 p-gridding p-tpacf p-spmv r-gaussian p-histo p-bfs r-srad-v1 r-scluster r-lud r-bfs Thread-aware predication compiler is competitive with the baseline compiler (divergence stack) Both static and runtime branch-uniformity optimizations play an important role Performance doesn t change with limited if-conversion heuristic implemented in production compiler Yunsup Lee UC Berkeley 24
25 Performance Results: Speedups Baseline with divergence stack Divergence Stack+Limited Predication TA TA+SBU TA+SBU+RBU Speedup geomean r-pathfinder p-cutcp p-sgemm p-sad p-mri-q p-stencil r-nn p-lbm fft r-backprop nqueens r-b+tree r-hotspot r-srad-v2 p-gridding p-tpacf p-spmv r-gaussian p-histo p-bfs r-srad-v1 r-scluster r-lud r-bfs Predication can expose better scheduling opportunities Extra consensual branches added for +RBU may act as scheduling barriers Yunsup Lee UC Berkeley 25
26 Performance Results: Slowdowns Baseline with divergence stack Divergence Stack+Limited Predication TA TA+SBU TA+SBU+RBU Speedup geomean r-pathfinder p-cutcp p-sgemm p-sad p-mri-q p-stencil r-nn p-lbm fft r-backprop nqueens r-b+tree r-hotspot r-srad-v2 p-gridding p-tpacf p-spmv r-gaussian p-histo p-bfs r-srad-v1 r-scluster r-lud r-bfs Ten benchmarks are inconclusive (>90%, <100%) Yunsup Lee UC Berkeley 26
27 Performance Results: Slowdowns Baseline with divergence stack Divergence Stack+Limited Predication TA TA+SBU TA+SBU+RBU Speedup geomean r-pathfinder p-cutcp p-sgemm p-sad p-mri-q p-stencil r-nn p-lbm fft r-backprop nqueens r-b+tree r-hotspot r-srad-v2 p-gridding p-tpacf p-spmv r-gaussian p-histo p-bfs r-srad-v1 r-scluster r-lud r-bfs Ten benchmarks are inconclusive (>90%, <100%) Five benchmarks in <90% range Increase in register pressure can reduce occupancy, which sometimes reduce performance Compiler is not able to optimize for all branch uniformity exhibited during runtime Yunsup Lee UC Berkeley 27
28 Discussion on Area, Power, Energy Hard to quantify impact of predication on area, power, energy Experiment done on GPU silicon Primary motivation of software divergence management is to reduce hardware design complexity and associated verification costs Area and power overhead of divergence stack are not significant Performance power/energy consumption Power/Energy consumption of software divergence mgmt. Power/Energy of hardware divergence mgmt. Yunsup Lee UC Berkeley 28
29 Fundamental Advantages of Software Divergence Management N6 N1 N2 N3 Short-circuit example Divergence stack cannot reconverge threads at N4 Predication can reconverge threads at N4 N4 N7 N5 Control Flow Graph Yunsup Lee UC Berkeley 29
30 Conclusions Advantages of Divergence Stack Enables a fairly conventional thread compilation model Register allocation easier Simplifies the task of supporting irreducible control flow Advantages of Predication Simplifies the hardware without sacrificing programmability Actual cases where predication can outperform divergence stack Better scheduling opportunities Better reconvergence of threads For divergence management, pushing complexity to the compiler is a better choice This work was funded by DARPA award HR , the Center for Future Architecture Research, a member of STARnet, a Semiconductor Research Corporation program sponsored by MARCO and DARPA, an NVIDIA graduate fellowship, and ASPIRE Lab industrial sponsors and affiliates Intel, Google, Nokia, NVIDIA, Oracle, and Samsung. It was also funded by DOE contract B Any opinions, findings, conclusions, or recommendations in this paper are solely those of the authors and do not necessarily reflect the position or the policy of the sponsors. Yunsup Lee UC Berkeley 30
Exploring the Design Space of SPMD Divergence Management on Data-Parallel Architectures
Exploring the Design Space of SPMD Divergence Management on Data-Parallel Architectures Yunsup Lee, Vinod Grover, Ronny Krashinsky, Mark Stephenson, Stephen W. Keckler, Krste Asanović University of California,
More informationGAIL The Graph Algorithm Iron Law
GAIL The Graph Algorithm Iron Law Scott Beamer, Krste Asanović, David Patterson GAP Berkeley Electrical Engineering & Computer Sciences gap.cs.berkeley.edu Graph Applications Social Network Analysis Recommendations
More informationPAGE PLACEMENT STRATEGIES FOR GPUS WITHIN HETEROGENEOUS MEMORY SYSTEMS
PAGE PLACEMENT STRATEGIES FOR GPUS WITHIN HETEROGENEOUS MEMORY SYSTEMS Neha Agarwal* David Nellans Mark Stephenson Mike O Connor Stephen W. Keckler NVIDIA University of Michigan* ASPLOS 2015 EVOLVING GPU
More informationA Large-Scale Cross-Architecture Evaluation of Thread-Coarsening. Alberto Magni, Christophe Dubach, Michael O'Boyle
A Large-Scale Cross-Architecture Evaluation of Thread-Coarsening Alberto Magni, Christophe Dubach, Michael O'Boyle Introduction Wide adoption of GPGPU for HPC Many GPU devices from many of vendors AMD
More informationGPUCC An Open-Source GPGPU Compiler A Preview
GPUCC An GPGPU Compiler A Preview Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Jingyue Wu, Xuetian Weng, Artem Belevich, Robert Hundt (rhundt@google.com) Why
More informationStrober: Fast and Accurate Sample-Based Energy Simulation Framework for Arbitrary RTL
Strober: Fast and Accurate Sample-Based Energy Simulation Framework for Arbitrary RTL Donggyu Kim, Adam Izraelevitz, Christopher Celio, Hokeun Kim, Brian Zimmer, Yunsup Lee, Jonathan Bachrach, Krste Asanović
More informationHPVM: Heterogeneous Parallel Virtual Machine
HPVM: Heterogeneous Parallel Virtual Machine Maria Kotsifakou Department of Computer Science University of Illinois at Urbana-Champaign kotsifa2@illinois.edu Prakalp Srivastava Department of Computer Science
More informationHandout 3. HSAIL and A SIMT GPU Simulator
Handout 3 HSAIL and A SIMT GPU Simulator 1 Outline Heterogeneous System Introduction of HSA Intermediate Language (HSAIL) A SIMT GPU Simulator Summary 2 Heterogeneous System CPU & GPU CPU GPU CPU wants
More informationPATS: Pattern Aware Scheduling and Power Gating for GPGPUs
PATS: Pattern Aware Scheduling and Power Gating for GPGPUs Qiumin Xu and Murali Annavaram Ming Hsieh Department of Electrical Engineering, University of Southern California Los Angeles, CA {qiumin, annavara}
More informationFPGA-based Supercomputing: New Opportunities and Challenges
FPGA-based Supercomputing: New Opportunities and Challenges Naoya Maruyama (RIKEN AICS)* 5 th ADAC Workshop Feb 15, 2018 * Current Main affiliation is Lawrence Livermore National Laboratory SIAM PP18:
More informationLocality-Aware Mapping of Nested Parallel Patterns on GPUs
Locality-Aware Mapping of Nested Parallel Patterns on GPUs HyoukJoong Lee *, Kevin Brown *, Arvind Sujeeth *, Tiark Rompf, Kunle Olukotun * * Pervasive Parallelism Laboratory, Stanford University Purdue
More informationProfiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency
Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu
More informationAn Evaluation of Unified Memory Technology on NVIDIA GPUs
An Evaluation of Unified Memory Technology on NVIDIA GPUs Wenqiang Li 1, Guanghao Jin 2, Xuewen Cui 1, Simon See 1,3 Center for High Performance Computing, Shanghai Jiao Tong University, China 1 Tokyo
More informationMulti-tier Dynamic Vectorization for Translating GPU Optimizations into CPU Performance
Multi-tier Dynamic Vectorization for Translating GPU Optimizations into CPU Performance Hee-Seok Kim, Izzat El Hajj, John A Stratton and Wen-Mei W. Hwu {kim868, elhajj2, stratton, w-hwu@illinois.edu IMPACT
More informationArquitetura e Organização de Computadores 2
Arquitetura e Organização de Computadores 2 Paralelismo em Nível de Dados Graphical Processing Units - GPUs Graphical Processing Units Given the hardware invested to do graphics well, how can be supplement
More informationAutomated Adaptive Bug Isolation using Dyninst. Piramanayagam Arumuga Nainar, Prof. Ben Liblit University of Wisconsin-Madison
Automated Adaptive Bug Isolation using Dyninst Piramanayagam Arumuga Nainar, Prof. Ben Liblit University of Wisconsin-Madison Cooperative Bug Isolation (CBI) ++branch_17[p!= 0]; if (p) else Predicates
More informationAdaptive Optimization for OpenCL Programs on Embedded Heterogeneous Systems
Adaptive Optimization for OpenCL Programs on Embedded Heterogeneous Systems Ben Taylor, Vicent Sanz Marco, Zheng Wang School of Computing and Communications, Lancaster University, UK {b.d.taylor, v.sanzmarco,
More informationEvaluating the Error Resilience of Parallel Programs
Evaluating the Error Resilience of Parallel Programs Bo Fang, Karthik Pattabiraman, Matei Ripeanu, The University of British Columbia Sudhanva Gurumurthi AMD Research 1 Reliability trends The soft error
More informationSpring Prof. Hyesoon Kim
Spring 2011 Prof. Hyesoon Kim 2 Warp is the basic unit of execution A group of threads (e.g. 32 threads for the Tesla GPU architecture) Warp Execution Inst 1 Inst 2 Inst 3 Sources ready T T T T One warp
More informationOn the Correctness of the SIMT Execution Model of GPUs. Extended version of the author s ESOP 12 paper. Axel Habermaier and Alexander Knapp
UNIVERSITÄT AUGSBURG On the Correctness of the SIMT Execution Model of GPUs Extended version of the author s ESOP 12 paper Axel Habermaier and Alexander Knapp Report 2012-01 January 2012 INSTITUT FÜR INFORMATIK
More informationMIMD Synchronization on SIMT Architectures
MIMD Synchronization on SIMT Architectures Ahmed ElTantawy and Tor M. Aamodt University of British Columbia {ahmede,aamodt}@ece.ubc.ca Abstract In the single-instruction multiple-threads (SIMT) execution
More informationLoGA: Low-overhead GPU accounting using events
: Low-overhead GPU accounting using events Jens Kehne Stanislav Spassov Marius Hillenbrand Marc Rittinghaus Frank Bellosa Karlsruhe Institute of Technology (KIT) Operating Systems Group os@itec.kit.edu
More informationInput Space Splitting for OpenCL
Input Space Splitting for OpenCL Simon Moll, Johannes Doerfert, Sebastian Hack Saarbrücken Graduate School of Computer Science Saarland University Saarbrücken, Germany October 29, 2015 OpenCL: Execution
More informationgpucc: An Open-Source GPGPU Compiler
gpucc: An Open-Source GPGPU Compiler Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, Robert Hundt One-Slide Overview Motivation
More informationCUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA
CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION Julien Demouth, NVIDIA Cliff Woolley, NVIDIA WHAT WILL YOU LEARN? An iterative method to optimize your GPU code A way to conduct that method with NVIDIA
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More informationOn-the-Fly Elimination of Dynamic Irregularities for GPU Computing
On-the-Fly Elimination of Dynamic Irregularities for GPU Computing Eddy Z. Zhang, Yunlian Jiang, Ziyu Guo, Kai Tian, and Xipeng Shen Graphic Processing Units (GPU) 2 Graphic Processing Units (GPU) 2 Graphic
More informationExecution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures
Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Xin Huo Advisor: Gagan Agrawal Motivation - Architecture Challenges on GPU architecture
More informationModern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design
Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant
More informationCS 179: GPU Computing
CS 179: GPU Computing LECTURE 2: INTRO TO THE SIMD LIFESTYLE AND GPU INTERNALS Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ Separate CUDA code into.cu and.cuh
More informationCS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher
CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming Lecturer: Alan Christopher Overview GP-GPU: What and why OpenCL, CUDA, and programming GPUs GPU Performance
More informationRodinia Benchmark Suite
Rodinia Benchmark Suite CIS 601 Paper Presentation 3/16/2017 Presented by Grayson Honan, Shreyas Shivakumar, Akshay Sriraman Rodinia: A Benchmark Suite for Heterogeneous Computing Shuai Che, Michael Boyer,
More informationCompiling for GPUs. Adarsh Yoga Madhav Ramesh
Compiling for GPUs Adarsh Yoga Madhav Ramesh Agenda Introduction to GPUs Compute Unified Device Architecture (CUDA) Control Structure Optimization Technique for GPGPU Compiler Framework for Automatic Translation
More informationGPU Performance vs. Thread-Level Parallelism: Scalability Analysis and A Novel Way to Improve TLP
1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 GPU Performance vs. Thread-Level Parallelism: Scalability Analysis
More informationgem5-gpu Extending gem5 for GPGPUs Jason Power, Marc Orr, Joel Hestness, Mark Hill, David Wood
gem5-gpu Extending gem5 for GPGPUs Jason Power, Marc Orr, Joel Hestness, Mark Hill, David Wood (powerjg/morr)@cs.wisc.edu UW-Madison Computer Sciences 2012 gem5-gpu gem5 + GPGPU-Sim (v3.0.1) Flexible memory
More informationGPU Fundamentals Jeff Larkin November 14, 2016
GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate
More informationEvaluation of RISC-V RTL with FPGA-Accelerated Simulation
Evaluation of RISC-V RTL with FPGA-Accelerated Simulation Donggyu Kim, Christopher Celio, David Biancolin, Jonathan Bachrach, Krste Asanovic CARRV 2017 10/14/2017 Evaluation Methodologies For Computer
More informationParticle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA
Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran
More informationgpucc: An Open-Source GPGPU Compiler
gpucc: An Open-Source GPGPU Compiler Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, Robert Hundt One-Slide Overview Motivation
More informationPredictive Runtime Code Scheduling for Heterogeneous Architectures
Predictive Runtime Code Scheduling for Heterogeneous Architectures Víctor Jiménez, Lluís Vilanova, Isaac Gelado Marisa Gil, Grigori Fursin, Nacho Navarro HiPEAC 2009 January, 26th, 2009 1 Outline Motivation
More informationDirected Optimization On Stencil-based Computational Fluid Dynamics Application(s)
Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Islam Harb 08/21/2015 Agenda Motivation Research Challenges Contributions & Approach Results Conclusion Future Work 2
More informationProject Kickoff CS/EE 217. GPU Architecture and Parallel Programming
CS/EE 217 GPU Architecture and Parallel Programming Project Kickoff David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012 University of Illinois, Urbana-Champaign! 1 Two flavors Application Implement/optimize
More informationUNLOCKING BANDWIDTH FOR GPUS IN CC-NUMA SYSTEMS
UNLOCKING BANDWIDTH FOR GPUS IN CC-NUMA SYSTEMS Neha Agarwal* David Nellans Mike O Connor Stephen W. Keckler Thomas F. Wenisch* NVIDIA University of Michigan* (Major part of this work was done when Neha
More informationScalarization and Temporal SIMT in GPUs: Reducing Redundant Operations for Better Performance and Higher Energy Efficiency. Jan Lucas TU Berlin - AES
Scalarization and Temporal SIMT in GPUs: Reducing Redundant Operations for Better Performance and Higher Energy Efficiency Jan Lucas TU Berlin - AES Overview What is a Scalarization? Why are Scalar Operations
More informationAdaptive Optimization for OpenCL Programs on Embedded Heterogeneous Systems
Adaptive Optimization for OpenCL Programs on Embedded Heterogeneous Systems Ben Taylor, Vicent Sanz Marco, Zheng Wang School of Computing and Communications, Lancaster University, UK {b.d.taylor, v.sanzmarco,
More informationGPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler
GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler Taylor Lloyd Jose Nelson Amaral Ettore Tiotto University of Alberta University of Alberta IBM Canada 1 Why? 2 Supercomputer Power/Performance GPUs
More informationDeSC: Decoupled Supply-Compute Communication Management for Heterogeneous Architectures
DeSC Decoupled Supply-Compute Communication Management for Heterogeneous Architectures Tae Jun Ham Princeton University tae@princeton.edu Juan L. Aragón University of Murcia jlaragon@um.es Margaret Martonosi
More informationCSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591: GPU Programming Using CUDA in Practice Klaus Mueller Computer Science Department Stony Brook University Code examples from Shane Cook CUDA Programming Related to: score boarding load and store
More informationParallel Programming in C with MPI and OpenMP
Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming 1 Outline n OpenMP n Shared-memory model n Parallel for loops n Declaring private variables n Critical
More informationLDetector: A low overhead data race detector for GPU programs
LDetector: A low overhead data race detector for GPU programs 1 PENGCHENG LI CHEN DING XIAOYU HU TOLGA SOYATA UNIVERSITY OF ROCHESTER 1 Data races in GPU Introduction & Contribution Impact correctness
More informationUnrolling parallel loops
Unrolling parallel loops Vasily Volkov UC Berkeley November 14, 2011 1 Today Very simple optimization technique Closely resembles loop unrolling Widely used in high performance codes 2 Mapping to GPU:
More informationEfficient Control Flow Restructuring for GPUs
Efficient Control Flow Restructuring for GPUs Nico Reissmann, Thomas L. Falch and Benjamin A. Bjørnseth Dept. of Computer and Information Science NTNU, Norway {reissman, thomafal, benjambj}@idi.ntnu.no
More informationBEAMJIT, a Maze of Twisty Little Traces
BEAMJIT, a Maze of Twisty Little Traces A walk-through of the prototype just-in-time (JIT) compiler for Erlang. Frej Drejhammar 130613 Who am I? Senior researcher at the Swedish Institute
More informationOffloading Java to Graphics Processors
Offloading Java to Graphics Processors Peter Calvert (prc33@cam.ac.uk) University of Cambridge, Computer Laboratory Abstract Massively-parallel graphics processors have the potential to offer high performance
More informationIN recent years, diminishing returns in single-core processor
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 0, NO. 0, AUGUST 2016 1 AIRA: A Framework for Flexible Compute Kernel Execution in Heterogeneous Platforms Robert Lyerly, Alastair Murray, Antonio
More informationElaborazione dati real-time su architetture embedded many-core e FPGA
Elaborazione dati real-time su architetture embedded many-core e FPGA DAVIDE ROSSI A L E S S A N D R O C A P O T O N D I G I U S E P P E T A G L I A V I N I A N D R E A M A R O N G I U C I R I - I C T
More informationProgrammer's View of Execution Teminology Summary
CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 28: GP-GPU Programming GPUs Hardware specialized for graphics calculations Originally developed to facilitate the use of CAD programs
More informationFunction Call Re-vectorization. Authors: Rubens E. A. Moreira, Sylvain Collange, Fernando M. Q. Pereira
Function Call Re-vectorization Authors: Rubens E. A. Moreira, Sylvain Collange, Fernando M. Q. Pereira Overview Our goal is to increase the programmability of languages that target SIMD-like machines,
More informationD5.5.3 Design and implementation of the SIMD-MIMD GPU architecture
D5.5.3(v.1.0) D5.5.3 Design and implementation of the SIMD-MIMD GPU architecture Document Information Contract Number 288653 Project Website lpgpu.org Contractual Deadline 31-08-2013 Nature Report Author
More informationLocality-Centric Thread Scheduling for Bulk-synchronous Programming Models on CPU Architectures
Locality-Centric Thread Scheduling for Bulk-synchronous Programming Models on CPU Architectures Hee-Seok Kim 1, Izzat El Hajj 1, John Stratton 2, Steven Lumetta 1 and Wen-Mei Hwu 1 1 University of Illinois
More informationGPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique
GPU programming: CUDA basics Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr This lecture: CUDA programming We have seen some GPU architecture Now how to program it? 2 Outline
More informationOccupancy-based compilation
Occupancy-based compilation Advanced Course on Compilers Spring 2015 (III-V): Lecture 10 Vesa Hirvisalo ESG/CSE/Aalto Today Threads and occupancy GPUs as the example SIMT execution warp (thread-group)
More informationBy: Tomer Morad Based on: Erik Lindholm, John Nickolls, Stuart Oberman, John Montrym. NVIDIA TESLA: A UNIFIED GRAPHICS AND COMPUTING ARCHITECTURE In IEEE Micro 28(2), 2008 } } Erik Lindholm, John Nickolls,
More informationControl Flow Analysis
Control Flow Analysis Last time Undergraduate compilers in a day Today Assignment 0 due Control-flow analysis Building basic blocks Building control-flow graphs Loops January 28, 2015 Control Flow Analysis
More informationCS377P Programming for Performance GPU Programming - I
CS377P Programming for Performance GPU Programming - I Sreepathi Pai UTCS November 9, 2015 Outline 1 Introduction to CUDA 2 Basic Performance 3 Memory Performance Outline 1 Introduction to CUDA 2 Basic
More informationAutomatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo
Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015 Goals Running CUDA code on CPUs. Why? Performance portability! A major challenge faced
More informationDiego Caballero and Vectorizer Team, Intel Corporation. April 16 th, 2018 Euro LLVM Developers Meeting. Bristol, UK.
Diego Caballero and Vectorizer Team, Intel Corporation. April 16 th, 2018 Euro LLVM Developers Meeting. Bristol, UK. Legal Disclaimer & Software and workloads used in performance tests may have been optimized
More informationFast and Efficient Automatic Memory Management for GPUs using Compiler- Assisted Runtime Coherence Scheme
Fast and Efficient Automatic Memory Management for GPUs using Compiler- Assisted Runtime Coherence Scheme Sreepathi Pai R. Govindarajan Matthew J. Thazhuthaveetil Supercomputer Education and Research Centre,
More informationSupport Tools for Porting Legacy Applications to Multicore. Natsuki Kawai, Yuri Ardila, Takashi Nakamura, Yosuke Tamura
Support Tools for Porting Legacy Applications to Multicore Natsuki Kawai, Yuri Ardila, Takashi Nakamura, Yosuke Tamura Agenda Introduction PEMAP: Performance Estimator for MAny core Processors The overview
More informationMulti2sim Kepler: A Detailed Architectural GPU Simulator
Multi2sim Kepler: A Detailed Architectural GPU Simulator Xun Gong, Rafael Ubal, David Kaeli Northeastern University Computer Architecture Research Lab Department of Electrical and Computer Engineering
More informationPorting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics using the Immersed Boundary Method
Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics using the Immersed Boundary Method Josh Romero, Massimiliano Fatica - NVIDIA Vamsi Spandan, Roberto Verzicco -
More informationHARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA
HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationCUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN
CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction Francesco Rossi University of Bologna and INFN * Using this terminology since you ve already heard of SIMD and SPMD at this school
More informationApple LLVM GPU Compiler: Embedded Dragons. Charu Chandrasekaran, Apple Marcello Maggioni, Apple
Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1 Agenda How Apple uses LLVM to build a GPU Compiler Factors that affect GPU performance The Apple GPU compiler
More informationProgrammable Graphics Hardware (GPU) A Primer
Programmable Graphics Hardware (GPU) A Primer Klaus Mueller Stony Brook University Computer Science Department Parallel Computing Explained video Parallel Computing Explained Any questions? Parallelism
More informationCSE 401/M501 Compilers
CSE 401/M501 Compilers Intermediate Representations Hal Perkins Autumn 2018 UW CSE 401/M501 Autumn 2018 G-1 Agenda Survey of Intermediate Representations Graphical Concrete/Abstract Syntax Trees (ASTs)
More informationPosition Paper: OpenMP scheduling on ARM big.little architecture
Position Paper: OpenMP scheduling on ARM big.little architecture Anastasiia Butko, Louisa Bessad, David Novo, Florent Bruguier, Abdoulaye Gamatié, Gilles Sassatelli, Lionel Torres, and Michel Robert LIRMM
More informationParallel Compact Roadmap Construction of 3D Virtual Environments on the GPU
Parallel Compact Roadmap Construction of 3D Virtual Environments on the GPU Avi Bleiweiss NVIDIA Corporation Programmability GPU Computing CUDA C++ Parallel Debug Heterogeneous Computing Productivity Efficiency
More informationImproving Performance of Machine Learning Workloads
Improving Performance of Machine Learning Workloads Dong Li Parallel Architecture, System, and Algorithm Lab Electrical Engineering and Computer Science School of Engineering University of California,
More informationCS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in
More informationarxiv: v5 [cs.ar] 12 Feb 2017
A Scratchpad Sharing in GPUs Vishwesh Jatala, Indian Institute of Technology, Kanpur Jayvant Anantpur, Indian Institute of Science, Bangalore Amey Karkare, Indian Institute of Technology, Kanpur arxiv:1607.03238v5
More informationDesigning a Domain-specific Language to Simulate Particles. dan bailey
Designing a Domain-specific Language to Simulate Particles dan bailey Double Negative Largest Visual Effects studio in Europe Offices in London and Singapore Large and growing R & D team Squirt Fluid Solver
More informationTowards a Performance- Portable FFT Library for Heterogeneous Computing
Towards a Performance- Portable FFT Library for Heterogeneous Computing Carlo C. del Mundo*, Wu- chun Feng* *Dept. of ECE, Dept. of CS Virginia Tech Slides Updated: 5/19/2014 Forecast (Problem) AMD Radeon
More informationImplementing Control Flow Constructs Comp 412
COMP 412 FALL 2018 Implementing Control Flow Constructs Comp 412 source code IR Front End Optimizer Back End IR target code Copyright 2018, Keith D. Cooper & Linda Torczon, all rights reserved. Students
More informationDesign of Embedded DSP Processors Unit 2: Design basics. 9/11/2017 Unit 2 of TSEA H1 1
Design of Embedded DSP Processors Unit 2: Design basics 9/11/2017 Unit 2 of TSEA26-2017 H1 1 ASIP/ASIC design flow We need to have the flow in mind, so that we will know what we are talking about in later
More informationEC 413 Computer Organization
EC 413 Computer Organization Review I Prof. Michel A. Kinsy Computing: The Art of Abstraction Application Algorithm Programming Language Operating System/Virtual Machine Instruction Set Architecture (ISA)
More informationCUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav
CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture
More informationEfficient JIT to 32-bit Arches
Efficient JIT to 32-bit Arches Jiong Wang Linux Plumbers Conference Vancouver, Nov, 2018 1 Background ISA specification and impact on JIT compiler Default code-gen use 64-bit register, ALU64, JMP64 test_l4lb_noinline.c
More informationIntroduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model
Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Dirk Ribbrock Faculty of Mathematics, TU dortmund 2016 Table of Contents Why parallel
More informationCS553 Lecture Dynamic Optimizations 2
Dynamic Optimizations Last time Predication and speculation Today Dynamic compilation CS553 Lecture Dynamic Optimizations 2 Motivation Limitations of static analysis Programs can have values and invariants
More informationGaaS Workload Characterization under NUMA Architecture for Virtualized GPU
GaaS Workload Characterization under NUMA Architecture for Virtualized GPU Huixiang Chen, Meng Wang, Yang Hu, Mingcong Song, Tao Li Presented by Huixiang Chen ISPASS 2017 April 24, 2017, Santa Rosa, California
More informationAutoMatch: An Automated Framework for Relative Performance Estimation and Workload Distribution on Heterogeneous HPC Systems
AutoMatch: An Automated Framework for Relative Performance Estimation and Workload Distribution on Heterogeneous HPC Systems Ahmed E. Helal, Wu-chun Feng, Changhee Jung, and Yasser Y. Hanafy Electrical
More informationTrace Compilation. Christian Wimmer September 2009
Trace Compilation Christian Wimmer cwimmer@uci.edu www.christianwimmer.at September 2009 Department of Computer Science University of California, Irvine Background Institute for System Software Johannes
More informationSudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread
Intra-Warp Compaction Techniques Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Goal Active thread Idle thread Compaction Compact threads in a warp to coalesce (and eliminate)
More informationAutomatic Intra-Application Load Balancing for Heterogeneous Systems
Automatic Intra-Application Load Balancing for Heterogeneous Systems Michael Boyer, Shuai Che, and Kevin Skadron Department of Computer Science University of Virginia Jayanth Gummaraju and Nuwan Jayasena
More informationAn Automated Framework for Characterizing and Subsetting GPGPU Workloads
An Automated Framework for Characterizing and Subsetting GPGPU Workloads Vignesh Adhinarayanan and Wu-chun Feng Department of Computer Science, Virginia Tech Blacksburg, VA 24061, U.S.A. {avignesh, wfeng}@vt.edu
More informationTruffle A language implementation framework
Truffle A language implementation framework Boris Spasojević Senior Researcher VM Research Group, Oracle Labs Slides based on previous talks given by Christian Wimmer, Christian Humer and Matthias Grimmer.
More informationCS 432 Fall Mike Lam, Professor. Code Generation
CS 432 Fall 2015 Mike Lam, Professor Code Generation Compilers "Back end" Source code Tokens Syntax tree Machine code char data[20]; int main() { float x = 42.0; return 7; } 7f 45 4c 46 01 01 01 00 00
More information