Yunsup Lee UC Berkeley 1

Size: px
Start display at page:

Download "Yunsup Lee UC Berkeley 1"

Transcription

1 Yunsup Lee UC Berkeley 1

2 Why is Supporting Control Flow Challenging in Data-Parallel Architectures? for (i=0; i<16; i++) { a[i] = op0; b[i] = op1; if (a[i] < b[i]) { c[i] = op2; else { c[i] = op3; d[i] = op4; Thread0 Thread4 Thread8 Thread12 Data-Parallel Issue Unit Thread1 Thread2 Thread5 Thread6 Thread9 Thread10 Thread13 Thread14 Thread3 Thread7 Thread11 Thread15 The divergence management architecture must not only partially sequence all execution paths for correctness, but also reconverge threads from different execution paths for efficiency. Yunsup Lee UC Berkeley 2

3 For GPUs, Supporting Complex Control Flow in a SPMD Program is not Optional! for (i=0; i<16; i++) { if (a[i] < b[i]) { if (f[i]) { c[i]->vfunc(); goto SKIP; d[i] = op; SKIP: Traditional vector compilers can always give up and run complex control flow on the control processor. However, for the SPMD compiler, supporting complex control flow is a functional requirement rather than an optional performance optimization. Kernel() { if (a[tid] < b[tid]) { if (f[tid]) { c[tid]->vfunc(); goto SKIP; d[tid] = op; SKIP: Kernel<<<16>>>(); Yunsup Lee UC Berkeley 3

4 Design Space of Divergence Management Software Explicitly Scheduled by Compiler Hardware Implicitly Managed by Microarchitecture Vector Predication Predication + Fork/Join Predication (Used in Limited Cases) We can perform a design space exploration of divergence management on NVIDIA GPU silicon Divergence Stack (Compiler Figures Out Reconvergence Points) Yunsup Lee UC Berkeley 4

5 Executive Summary, Contributions of Paper 28 Benchmarks written in CUDA (Parboil, Rodinia, FFT, nqueens) CUDA 6.5 Production Compiler Divergence Stack Limited Predication CUDA Binary Modified CUDA 6.5 Production Compiler Full Predication with New Compiler Algorithms CUDA Binary (1) Detailed Explanation and Categorization of Hardware and Software Divergence Management Schemes (2) SPMD Predication Compiler Algorithms (3) Apples-to-Apples Comparison using Production Silicon and Compiler NVIDIA Tesla K20c (Kepler, GK110) Performance, Statistics Performance, Statistics Performance with predication is on par compared to performance with divergence stack Yunsup Lee UC Berkeley 5

6 How does the hardware divergence stack and software predication handle control flow? Yunsup Lee UC Berkeley 6

7 If-Then-Else Example: Divergence Stack CUDA Program LLVM Compiler PTX ptxas Backend Compiler SASS Kernel() { a = op0; b = op1; if (a < b) { c = op2; else { c = op3; d = op4; Kernel<<<n>>>(); LLVM compiler takes a CUDA program and generates PTX, which encodes all data/control dependences Note, all instructions on the right (despite being a scalar instruction) are executed in a SIMD fashion a = op0 b = op1 p = slt a,b branch.eqz p, else c = op2 j ipdom else: c = op3 ipdom: d = op4 Yunsup Lee UC Berkeley 7

8 If-Then-Else Example: Divergence Stack CUDA Program LLVM Compiler PTX ptxas Backend Compiler SASS a = op0 b = op1 p = slt a,b push ipdom branch.eqz p, else c = op2.pop else: c = op3.pop ipdom: d = op4 ptxas backend compiler takes PTX and generates SASS instructions, which executes natively on GPU Reconvergence points are analyzed and inserted by the backend compiler Kernel() { a = op0; b = op1; if (a < b) { c = op2; else { c = op3; d = op4; Kernel<<<n>>>(); Yunsup Lee UC Berkeley 8

9 If-Then-Else Example: Divergence Stack Mask OP 1111 op op slt push 1111 branch 1100 op2 pop 0011 op3 pop 1111 op4 Assume 4 threads are executing. Assume thread 0 and 1 took the branch. Divergence Stack PC PC MASK MASK ipdom 1111 PC MASK else 0011 ipdom 1111 a = op0 b = op1 p = slt a,b push ipdom branch.eqz p, else c = op2.pop else: c = op3.pop ipdom: d = op4 Push: pushes <reconverge pc, current mask> to stack Pop: disregards current <pc, mask>, pops top of stack, and executes deferred <pc, mask> Yunsup Lee UC Berkeley 9

10 If-Then-Else Example: Predication Predication Kernel() { a = op0; b = op1; if (a < b) { c = op2; else { c = op3; d = op4; Kernel<<<n>>>(); a = op0 b = op1 f0 = slt c = c = op3 d = op4 The compiler can schedule instructions Predicates also encode reconvergence information Yunsup Lee UC Berkeley 10

11 Uniform Branch Conditions Across All Threads Divergence Stack a = op0 b = op1 p = slt a,b push ipdom branch.eqz p, else c = op2.pop else: c = op3.pop ipdom: d = op4 Predication a = op0 b = op1 f0 = slt c = c = op3 d = op4 What if the branch condition is uniform across all threads? Execute instructions with a null predicate Branches don t push a token to the divergence stack when branch condition is uniform. Yunsup Lee UC Berkeley 11

12 Runtime Branch-Uniformity Optimization with Consensual Branches Kernel() { a = op0; b = op1; if (a < b) { c = op2; else { c = op3; d = op4; Kernel<<<n>>>(); a = op0 b = op1 f0 = slt a,b cbranch.ifnull f0, c = op2 else: cbranch.ifnull!f0, c = op3 ipdom: d = op4 Thread-Aware Predication We can optimize the predicated code with a consensual branch (cbranch), which is taken only when all threads consensually agree on the branch condition (ifnull). The code may jump around unnecessary work. Yunsup Lee UC Berkeley 12

13 Static Branch-Uniformity Optimization with Consensual Branches Kernel() { a = op0; b = op1; if (a < b) { c = op2; else { c = op3; d = op4; Kernel<<<n>>>(); a = op0 b = op1 f0 = slt a,b cbranch.ifnull f0, else c = op2 else: cbranch.ifnull!f0, ipdom c = op3 ipdom: d = op4 Thread-Aware Predication If the compiler can prove that the branch condition is uniform across all threads, the compiler can omit the guard predicates. Yunsup Lee UC Berkeley 13

14 Loop Example: Consensual Branches are Key to Compile Loops with Predication Kernel() { done = false; while (!done) { a = op0; b = op1; done = a < b; c = op2; Kernel<<<n>>>(); f0 = true loop: cbranch.ifnull f0, a = b = f1 = slt a, b f0 = and f0,!f1 j loop exit: c = op2 Thread-Aware Predication Intuitively, the compiler needs to sequence loop until all threads are done executing the loop. A consensual branch (cbranch.ifnull) is used to check whether loop mask (f0) is null. Yunsup Lee UC Berkeley 14

15 Thread-Aware (TA) Predication Compiler Algorithms Yunsup Lee UC Berkeley 15

16 Predication Compiler Algorithms Generate CFG, CDG Walk CDG to Get Guard Predicates for all BBs Linearize Control Flow Predicate all instructions with guard predicate and rewire all BBs Kernel() { N1; N2; if (!P1) { N3; else { N4; if (!P2) { N5; else { N6; N7; N8; Kernel<<<n>>>(); Control Flow Graph (CFG) N6 N7 N1 N2 P1!P1 N4 P2!P2 N5 N8 N3 N1 N8 N2 Control Dependence Graph (CDG) P1!P1 N4 N7 N3 P2!P2 N6 Yunsup Lee UC Berkeley 16 N5 Guard Predicate = P1 &&!P2

17 Runtime Branch-Uniformity Optimization Generate CFG, CDG Kernel() { N1; N2; if (!P1) { N3; else { N4; if (!P2) { N5; else { N6; N7; N8; Kernel<<<n>>>(); Walk CDG to Get Guard Predicates for all BBs CFG Assume compiler cannot prove that P1 is uniform across all threads N6 Add consensual branch if P1 is null N7 N4 N5 N1 N2 P1!P1 P2!P2 Linearize Control Flow N8 N3 Predicate all instructions with guard predicate and rewire all BBs N1 N8 N2 Control Dependence Graph (CDG) P1!P1 N4 N7 N3 P2!P2 N6 Add consensual branch if!p1 is null Yunsup Lee UC Berkeley 17 N5

18 Static Branch-Uniformity Optimization Generate CFG, CDG Walk CDG to Get Guard Predicates for all BBs Linearize Control Flow Kernel() { CFG N1 N1; N2; Assume compiler if (!P1) { can prove that N3; N2 P1 is uniform else { across all threads P1!P1 N4; if (!P2) { N4 N5; else { P2!P2 N6; N6 N5 N7; N7 N3 N8; Kernel<<<n>>>(); N8 Predicate all instructions with guard predicate and rewire all BBs N1 N8 N2 Control Dependence Graph (CDG) P1!P1 N4 N7 N3 P2!P2 N6 Yunsup Lee UC Berkeley 18 N5 Guard Predicate = P2

19 Predication Compiler Algorithms: Loops N1 N1 N3 N2 N2 N11 Kernel() { if (!P1) { while (!P2) { if (!P3) { break; Kernel<<<n>>>(); N12 Exit Mask 2 CFG N3 N6 N8 N9 N10 N4 N5 N11 Loop Mask Exit Mask 1 N7 N12 P1!P1 Thread-Aware Control Dependence Graph (CDG) N10 Yunsup Lee UC Berkeley 19 N4 L1 N5 P2 E1!P2 N7 N6 N8 P3!P3 E2 N9

20 Supporting Complex Control Flow Function Calls Support by a straightforward calling convention Virtual Function jalr r3 With Divergence Stack loop: p0, r4 = find_unique p2, jalr r4 p2 = p2 and!p0 cbranch.ifany p2, loop With Predication Irreducible Control Flow Find smallest region containing irreducible control flow, and insert sequencing code at the entry block and exit block to sequence active threads through the region one-by-one. Yunsup Lee UC Berkeley 20

21 Evaluation Yunsup Lee UC Berkeley 21

22 Predication CUDA Compiler CUDA Program LLVM Compiler Predication Passes, Generates a throw-away pseudo PTX instruction with predication information Annotated PTX Theoretically we could have implemented our thread-aware predication pass in ptxas Implemented bulk of the predication pass in LLVM for fast prototyping ptxas Backend Compiler Runs existing optimizations Then predicates with information retrieved from pseudo PTX insts. SASS Yunsup Lee UC Berkeley 22

23 Evaluation: Quick Recap 28 Benchmarks written in CUDA (Parboil, Rodinia, FFT, nqueens) CUDA 6.5 Production Compiler Divergence Stack Limited Predication NVIDIA Tesla K20c (Kepler, GK110) CUDA Binary Performance, Statistics Modified CUDA 6.5 Production Compiler Full Predication with TA Compiler Algorithms CUDA Binary Performance, Statistics Five Bars 1) Baseline (Divergence Stack) 2) Limited Predication 3) TA Predication 4) TA+SBU (static branch-uniformity optimization) 5) TA+SBU+RBU (runtime branch-uniformity optimization) Compare performance and statistics Yunsup Lee UC Berkeley 23

24 Performance Results: Geomean Baseline with divergence stack Divergence Stack+Limited Predication TA TA+SBU TA+SBU+RBU Speedup geomean r-pathfinder p-cutcp p-sgemm p-sad p-mri-q p-stencil r-nn p-lbm fft r-backprop nqueens r-b+tree r-hotspot r-srad-v2 p-gridding p-tpacf p-spmv r-gaussian p-histo p-bfs r-srad-v1 r-scluster r-lud r-bfs Thread-aware predication compiler is competitive with the baseline compiler (divergence stack) Both static and runtime branch-uniformity optimizations play an important role Performance doesn t change with limited if-conversion heuristic implemented in production compiler Yunsup Lee UC Berkeley 24

25 Performance Results: Speedups Baseline with divergence stack Divergence Stack+Limited Predication TA TA+SBU TA+SBU+RBU Speedup geomean r-pathfinder p-cutcp p-sgemm p-sad p-mri-q p-stencil r-nn p-lbm fft r-backprop nqueens r-b+tree r-hotspot r-srad-v2 p-gridding p-tpacf p-spmv r-gaussian p-histo p-bfs r-srad-v1 r-scluster r-lud r-bfs Predication can expose better scheduling opportunities Extra consensual branches added for +RBU may act as scheduling barriers Yunsup Lee UC Berkeley 25

26 Performance Results: Slowdowns Baseline with divergence stack Divergence Stack+Limited Predication TA TA+SBU TA+SBU+RBU Speedup geomean r-pathfinder p-cutcp p-sgemm p-sad p-mri-q p-stencil r-nn p-lbm fft r-backprop nqueens r-b+tree r-hotspot r-srad-v2 p-gridding p-tpacf p-spmv r-gaussian p-histo p-bfs r-srad-v1 r-scluster r-lud r-bfs Ten benchmarks are inconclusive (>90%, <100%) Yunsup Lee UC Berkeley 26

27 Performance Results: Slowdowns Baseline with divergence stack Divergence Stack+Limited Predication TA TA+SBU TA+SBU+RBU Speedup geomean r-pathfinder p-cutcp p-sgemm p-sad p-mri-q p-stencil r-nn p-lbm fft r-backprop nqueens r-b+tree r-hotspot r-srad-v2 p-gridding p-tpacf p-spmv r-gaussian p-histo p-bfs r-srad-v1 r-scluster r-lud r-bfs Ten benchmarks are inconclusive (>90%, <100%) Five benchmarks in <90% range Increase in register pressure can reduce occupancy, which sometimes reduce performance Compiler is not able to optimize for all branch uniformity exhibited during runtime Yunsup Lee UC Berkeley 27

28 Discussion on Area, Power, Energy Hard to quantify impact of predication on area, power, energy Experiment done on GPU silicon Primary motivation of software divergence management is to reduce hardware design complexity and associated verification costs Area and power overhead of divergence stack are not significant Performance power/energy consumption Power/Energy consumption of software divergence mgmt. Power/Energy of hardware divergence mgmt. Yunsup Lee UC Berkeley 28

29 Fundamental Advantages of Software Divergence Management N6 N1 N2 N3 Short-circuit example Divergence stack cannot reconverge threads at N4 Predication can reconverge threads at N4 N4 N7 N5 Control Flow Graph Yunsup Lee UC Berkeley 29

30 Conclusions Advantages of Divergence Stack Enables a fairly conventional thread compilation model Register allocation easier Simplifies the task of supporting irreducible control flow Advantages of Predication Simplifies the hardware without sacrificing programmability Actual cases where predication can outperform divergence stack Better scheduling opportunities Better reconvergence of threads For divergence management, pushing complexity to the compiler is a better choice This work was funded by DARPA award HR , the Center for Future Architecture Research, a member of STARnet, a Semiconductor Research Corporation program sponsored by MARCO and DARPA, an NVIDIA graduate fellowship, and ASPIRE Lab industrial sponsors and affiliates Intel, Google, Nokia, NVIDIA, Oracle, and Samsung. It was also funded by DOE contract B Any opinions, findings, conclusions, or recommendations in this paper are solely those of the authors and do not necessarily reflect the position or the policy of the sponsors. Yunsup Lee UC Berkeley 30

Exploring the Design Space of SPMD Divergence Management on Data-Parallel Architectures

Exploring the Design Space of SPMD Divergence Management on Data-Parallel Architectures Exploring the Design Space of SPMD Divergence Management on Data-Parallel Architectures Yunsup Lee, Vinod Grover, Ronny Krashinsky, Mark Stephenson, Stephen W. Keckler, Krste Asanović University of California,

More information

GAIL The Graph Algorithm Iron Law

GAIL The Graph Algorithm Iron Law GAIL The Graph Algorithm Iron Law Scott Beamer, Krste Asanović, David Patterson GAP Berkeley Electrical Engineering & Computer Sciences gap.cs.berkeley.edu Graph Applications Social Network Analysis Recommendations

More information

PAGE PLACEMENT STRATEGIES FOR GPUS WITHIN HETEROGENEOUS MEMORY SYSTEMS

PAGE PLACEMENT STRATEGIES FOR GPUS WITHIN HETEROGENEOUS MEMORY SYSTEMS PAGE PLACEMENT STRATEGIES FOR GPUS WITHIN HETEROGENEOUS MEMORY SYSTEMS Neha Agarwal* David Nellans Mark Stephenson Mike O Connor Stephen W. Keckler NVIDIA University of Michigan* ASPLOS 2015 EVOLVING GPU

More information

A Large-Scale Cross-Architecture Evaluation of Thread-Coarsening. Alberto Magni, Christophe Dubach, Michael O'Boyle

A Large-Scale Cross-Architecture Evaluation of Thread-Coarsening. Alberto Magni, Christophe Dubach, Michael O'Boyle A Large-Scale Cross-Architecture Evaluation of Thread-Coarsening Alberto Magni, Christophe Dubach, Michael O'Boyle Introduction Wide adoption of GPGPU for HPC Many GPU devices from many of vendors AMD

More information

GPUCC An Open-Source GPGPU Compiler A Preview

GPUCC An Open-Source GPGPU Compiler A Preview GPUCC An GPGPU Compiler A Preview Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Jingyue Wu, Xuetian Weng, Artem Belevich, Robert Hundt (rhundt@google.com) Why

More information

Strober: Fast and Accurate Sample-Based Energy Simulation Framework for Arbitrary RTL

Strober: Fast and Accurate Sample-Based Energy Simulation Framework for Arbitrary RTL Strober: Fast and Accurate Sample-Based Energy Simulation Framework for Arbitrary RTL Donggyu Kim, Adam Izraelevitz, Christopher Celio, Hokeun Kim, Brian Zimmer, Yunsup Lee, Jonathan Bachrach, Krste Asanović

More information

HPVM: Heterogeneous Parallel Virtual Machine

HPVM: Heterogeneous Parallel Virtual Machine HPVM: Heterogeneous Parallel Virtual Machine Maria Kotsifakou Department of Computer Science University of Illinois at Urbana-Champaign kotsifa2@illinois.edu Prakalp Srivastava Department of Computer Science

More information

Handout 3. HSAIL and A SIMT GPU Simulator

Handout 3. HSAIL and A SIMT GPU Simulator Handout 3 HSAIL and A SIMT GPU Simulator 1 Outline Heterogeneous System Introduction of HSA Intermediate Language (HSAIL) A SIMT GPU Simulator Summary 2 Heterogeneous System CPU & GPU CPU GPU CPU wants

More information

PATS: Pattern Aware Scheduling and Power Gating for GPGPUs

PATS: Pattern Aware Scheduling and Power Gating for GPGPUs PATS: Pattern Aware Scheduling and Power Gating for GPGPUs Qiumin Xu and Murali Annavaram Ming Hsieh Department of Electrical Engineering, University of Southern California Los Angeles, CA {qiumin, annavara}

More information

FPGA-based Supercomputing: New Opportunities and Challenges

FPGA-based Supercomputing: New Opportunities and Challenges FPGA-based Supercomputing: New Opportunities and Challenges Naoya Maruyama (RIKEN AICS)* 5 th ADAC Workshop Feb 15, 2018 * Current Main affiliation is Lawrence Livermore National Laboratory SIAM PP18:

More information

Locality-Aware Mapping of Nested Parallel Patterns on GPUs

Locality-Aware Mapping of Nested Parallel Patterns on GPUs Locality-Aware Mapping of Nested Parallel Patterns on GPUs HyoukJoong Lee *, Kevin Brown *, Arvind Sujeeth *, Tiark Rompf, Kunle Olukotun * * Pervasive Parallelism Laboratory, Stanford University Purdue

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

An Evaluation of Unified Memory Technology on NVIDIA GPUs

An Evaluation of Unified Memory Technology on NVIDIA GPUs An Evaluation of Unified Memory Technology on NVIDIA GPUs Wenqiang Li 1, Guanghao Jin 2, Xuewen Cui 1, Simon See 1,3 Center for High Performance Computing, Shanghai Jiao Tong University, China 1 Tokyo

More information

Multi-tier Dynamic Vectorization for Translating GPU Optimizations into CPU Performance

Multi-tier Dynamic Vectorization for Translating GPU Optimizations into CPU Performance Multi-tier Dynamic Vectorization for Translating GPU Optimizations into CPU Performance Hee-Seok Kim, Izzat El Hajj, John A Stratton and Wen-Mei W. Hwu {kim868, elhajj2, stratton, w-hwu@illinois.edu IMPACT

More information

Arquitetura e Organização de Computadores 2

Arquitetura e Organização de Computadores 2 Arquitetura e Organização de Computadores 2 Paralelismo em Nível de Dados Graphical Processing Units - GPUs Graphical Processing Units Given the hardware invested to do graphics well, how can be supplement

More information

Automated Adaptive Bug Isolation using Dyninst. Piramanayagam Arumuga Nainar, Prof. Ben Liblit University of Wisconsin-Madison

Automated Adaptive Bug Isolation using Dyninst. Piramanayagam Arumuga Nainar, Prof. Ben Liblit University of Wisconsin-Madison Automated Adaptive Bug Isolation using Dyninst Piramanayagam Arumuga Nainar, Prof. Ben Liblit University of Wisconsin-Madison Cooperative Bug Isolation (CBI) ++branch_17[p!= 0]; if (p) else Predicates

More information

Adaptive Optimization for OpenCL Programs on Embedded Heterogeneous Systems

Adaptive Optimization for OpenCL Programs on Embedded Heterogeneous Systems Adaptive Optimization for OpenCL Programs on Embedded Heterogeneous Systems Ben Taylor, Vicent Sanz Marco, Zheng Wang School of Computing and Communications, Lancaster University, UK {b.d.taylor, v.sanzmarco,

More information

Evaluating the Error Resilience of Parallel Programs

Evaluating the Error Resilience of Parallel Programs Evaluating the Error Resilience of Parallel Programs Bo Fang, Karthik Pattabiraman, Matei Ripeanu, The University of British Columbia Sudhanva Gurumurthi AMD Research 1 Reliability trends The soft error

More information

Spring Prof. Hyesoon Kim

Spring Prof. Hyesoon Kim Spring 2011 Prof. Hyesoon Kim 2 Warp is the basic unit of execution A group of threads (e.g. 32 threads for the Tesla GPU architecture) Warp Execution Inst 1 Inst 2 Inst 3 Sources ready T T T T One warp

More information

On the Correctness of the SIMT Execution Model of GPUs. Extended version of the author s ESOP 12 paper. Axel Habermaier and Alexander Knapp

On the Correctness of the SIMT Execution Model of GPUs. Extended version of the author s ESOP 12 paper. Axel Habermaier and Alexander Knapp UNIVERSITÄT AUGSBURG On the Correctness of the SIMT Execution Model of GPUs Extended version of the author s ESOP 12 paper Axel Habermaier and Alexander Knapp Report 2012-01 January 2012 INSTITUT FÜR INFORMATIK

More information

MIMD Synchronization on SIMT Architectures

MIMD Synchronization on SIMT Architectures MIMD Synchronization on SIMT Architectures Ahmed ElTantawy and Tor M. Aamodt University of British Columbia {ahmede,aamodt}@ece.ubc.ca Abstract In the single-instruction multiple-threads (SIMT) execution

More information

LoGA: Low-overhead GPU accounting using events

LoGA: Low-overhead GPU accounting using events : Low-overhead GPU accounting using events Jens Kehne Stanislav Spassov Marius Hillenbrand Marc Rittinghaus Frank Bellosa Karlsruhe Institute of Technology (KIT) Operating Systems Group os@itec.kit.edu

More information

Input Space Splitting for OpenCL

Input Space Splitting for OpenCL Input Space Splitting for OpenCL Simon Moll, Johannes Doerfert, Sebastian Hack Saarbrücken Graduate School of Computer Science Saarland University Saarbrücken, Germany October 29, 2015 OpenCL: Execution

More information

gpucc: An Open-Source GPGPU Compiler

gpucc: An Open-Source GPGPU Compiler gpucc: An Open-Source GPGPU Compiler Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, Robert Hundt One-Slide Overview Motivation

More information

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION Julien Demouth, NVIDIA Cliff Woolley, NVIDIA WHAT WILL YOU LEARN? An iterative method to optimize your GPU code A way to conduct that method with NVIDIA

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

On-the-Fly Elimination of Dynamic Irregularities for GPU Computing

On-the-Fly Elimination of Dynamic Irregularities for GPU Computing On-the-Fly Elimination of Dynamic Irregularities for GPU Computing Eddy Z. Zhang, Yunlian Jiang, Ziyu Guo, Kai Tian, and Xipeng Shen Graphic Processing Units (GPU) 2 Graphic Processing Units (GPU) 2 Graphic

More information

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Xin Huo Advisor: Gagan Agrawal Motivation - Architecture Challenges on GPU architecture

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

CS 179: GPU Computing

CS 179: GPU Computing CS 179: GPU Computing LECTURE 2: INTRO TO THE SIMD LIFESTYLE AND GPU INTERNALS Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ Separate CUDA code into.cu and.cuh

More information

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming Lecturer: Alan Christopher Overview GP-GPU: What and why OpenCL, CUDA, and programming GPUs GPU Performance

More information

Rodinia Benchmark Suite

Rodinia Benchmark Suite Rodinia Benchmark Suite CIS 601 Paper Presentation 3/16/2017 Presented by Grayson Honan, Shreyas Shivakumar, Akshay Sriraman Rodinia: A Benchmark Suite for Heterogeneous Computing Shuai Che, Michael Boyer,

More information

Compiling for GPUs. Adarsh Yoga Madhav Ramesh

Compiling for GPUs. Adarsh Yoga Madhav Ramesh Compiling for GPUs Adarsh Yoga Madhav Ramesh Agenda Introduction to GPUs Compute Unified Device Architecture (CUDA) Control Structure Optimization Technique for GPGPU Compiler Framework for Automatic Translation

More information

GPU Performance vs. Thread-Level Parallelism: Scalability Analysis and A Novel Way to Improve TLP

GPU Performance vs. Thread-Level Parallelism: Scalability Analysis and A Novel Way to Improve TLP 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 GPU Performance vs. Thread-Level Parallelism: Scalability Analysis

More information

gem5-gpu Extending gem5 for GPGPUs Jason Power, Marc Orr, Joel Hestness, Mark Hill, David Wood

gem5-gpu Extending gem5 for GPGPUs Jason Power, Marc Orr, Joel Hestness, Mark Hill, David Wood gem5-gpu Extending gem5 for GPGPUs Jason Power, Marc Orr, Joel Hestness, Mark Hill, David Wood (powerjg/morr)@cs.wisc.edu UW-Madison Computer Sciences 2012 gem5-gpu gem5 + GPGPU-Sim (v3.0.1) Flexible memory

More information

GPU Fundamentals Jeff Larkin November 14, 2016

GPU Fundamentals Jeff Larkin November 14, 2016 GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate

More information

Evaluation of RISC-V RTL with FPGA-Accelerated Simulation

Evaluation of RISC-V RTL with FPGA-Accelerated Simulation Evaluation of RISC-V RTL with FPGA-Accelerated Simulation Donggyu Kim, Christopher Celio, David Biancolin, Jonathan Bachrach, Krste Asanovic CARRV 2017 10/14/2017 Evaluation Methodologies For Computer

More information

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran

More information

gpucc: An Open-Source GPGPU Compiler

gpucc: An Open-Source GPGPU Compiler gpucc: An Open-Source GPGPU Compiler Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, Robert Hundt One-Slide Overview Motivation

More information

Predictive Runtime Code Scheduling for Heterogeneous Architectures

Predictive Runtime Code Scheduling for Heterogeneous Architectures Predictive Runtime Code Scheduling for Heterogeneous Architectures Víctor Jiménez, Lluís Vilanova, Isaac Gelado Marisa Gil, Grigori Fursin, Nacho Navarro HiPEAC 2009 January, 26th, 2009 1 Outline Motivation

More information

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Islam Harb 08/21/2015 Agenda Motivation Research Challenges Contributions & Approach Results Conclusion Future Work 2

More information

Project Kickoff CS/EE 217. GPU Architecture and Parallel Programming

Project Kickoff CS/EE 217. GPU Architecture and Parallel Programming CS/EE 217 GPU Architecture and Parallel Programming Project Kickoff David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012 University of Illinois, Urbana-Champaign! 1 Two flavors Application Implement/optimize

More information

UNLOCKING BANDWIDTH FOR GPUS IN CC-NUMA SYSTEMS

UNLOCKING BANDWIDTH FOR GPUS IN CC-NUMA SYSTEMS UNLOCKING BANDWIDTH FOR GPUS IN CC-NUMA SYSTEMS Neha Agarwal* David Nellans Mike O Connor Stephen W. Keckler Thomas F. Wenisch* NVIDIA University of Michigan* (Major part of this work was done when Neha

More information

Scalarization and Temporal SIMT in GPUs: Reducing Redundant Operations for Better Performance and Higher Energy Efficiency. Jan Lucas TU Berlin - AES

Scalarization and Temporal SIMT in GPUs: Reducing Redundant Operations for Better Performance and Higher Energy Efficiency. Jan Lucas TU Berlin - AES Scalarization and Temporal SIMT in GPUs: Reducing Redundant Operations for Better Performance and Higher Energy Efficiency Jan Lucas TU Berlin - AES Overview What is a Scalarization? Why are Scalar Operations

More information

Adaptive Optimization for OpenCL Programs on Embedded Heterogeneous Systems

Adaptive Optimization for OpenCL Programs on Embedded Heterogeneous Systems Adaptive Optimization for OpenCL Programs on Embedded Heterogeneous Systems Ben Taylor, Vicent Sanz Marco, Zheng Wang School of Computing and Communications, Lancaster University, UK {b.d.taylor, v.sanzmarco,

More information

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler Taylor Lloyd Jose Nelson Amaral Ettore Tiotto University of Alberta University of Alberta IBM Canada 1 Why? 2 Supercomputer Power/Performance GPUs

More information

DeSC: Decoupled Supply-Compute Communication Management for Heterogeneous Architectures

DeSC: Decoupled Supply-Compute Communication Management for Heterogeneous Architectures DeSC Decoupled Supply-Compute Communication Management for Heterogeneous Architectures Tae Jun Ham Princeton University tae@princeton.edu Juan L. Aragón University of Murcia jlaragon@um.es Margaret Martonosi

More information

CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University CSE 591: GPU Programming Using CUDA in Practice Klaus Mueller Computer Science Department Stony Brook University Code examples from Shane Cook CUDA Programming Related to: score boarding load and store

More information

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming 1 Outline n OpenMP n Shared-memory model n Parallel for loops n Declaring private variables n Critical

More information

LDetector: A low overhead data race detector for GPU programs

LDetector: A low overhead data race detector for GPU programs LDetector: A low overhead data race detector for GPU programs 1 PENGCHENG LI CHEN DING XIAOYU HU TOLGA SOYATA UNIVERSITY OF ROCHESTER 1 Data races in GPU Introduction & Contribution Impact correctness

More information

Unrolling parallel loops

Unrolling parallel loops Unrolling parallel loops Vasily Volkov UC Berkeley November 14, 2011 1 Today Very simple optimization technique Closely resembles loop unrolling Widely used in high performance codes 2 Mapping to GPU:

More information

Efficient Control Flow Restructuring for GPUs

Efficient Control Flow Restructuring for GPUs Efficient Control Flow Restructuring for GPUs Nico Reissmann, Thomas L. Falch and Benjamin A. Bjørnseth Dept. of Computer and Information Science NTNU, Norway {reissman, thomafal, benjambj}@idi.ntnu.no

More information

BEAMJIT, a Maze of Twisty Little Traces

BEAMJIT, a Maze of Twisty Little Traces BEAMJIT, a Maze of Twisty Little Traces A walk-through of the prototype just-in-time (JIT) compiler for Erlang. Frej Drejhammar 130613 Who am I? Senior researcher at the Swedish Institute

More information

Offloading Java to Graphics Processors

Offloading Java to Graphics Processors Offloading Java to Graphics Processors Peter Calvert (prc33@cam.ac.uk) University of Cambridge, Computer Laboratory Abstract Massively-parallel graphics processors have the potential to offer high performance

More information

IN recent years, diminishing returns in single-core processor

IN recent years, diminishing returns in single-core processor IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 0, NO. 0, AUGUST 2016 1 AIRA: A Framework for Flexible Compute Kernel Execution in Heterogeneous Platforms Robert Lyerly, Alastair Murray, Antonio

More information

Elaborazione dati real-time su architetture embedded many-core e FPGA

Elaborazione dati real-time su architetture embedded many-core e FPGA Elaborazione dati real-time su architetture embedded many-core e FPGA DAVIDE ROSSI A L E S S A N D R O C A P O T O N D I G I U S E P P E T A G L I A V I N I A N D R E A M A R O N G I U C I R I - I C T

More information

Programmer's View of Execution Teminology Summary

Programmer's View of Execution Teminology Summary CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 28: GP-GPU Programming GPUs Hardware specialized for graphics calculations Originally developed to facilitate the use of CAD programs

More information

Function Call Re-vectorization. Authors: Rubens E. A. Moreira, Sylvain Collange, Fernando M. Q. Pereira

Function Call Re-vectorization. Authors: Rubens E. A. Moreira, Sylvain Collange, Fernando M. Q. Pereira Function Call Re-vectorization Authors: Rubens E. A. Moreira, Sylvain Collange, Fernando M. Q. Pereira Overview Our goal is to increase the programmability of languages that target SIMD-like machines,

More information

D5.5.3 Design and implementation of the SIMD-MIMD GPU architecture

D5.5.3 Design and implementation of the SIMD-MIMD GPU architecture D5.5.3(v.1.0) D5.5.3 Design and implementation of the SIMD-MIMD GPU architecture Document Information Contract Number 288653 Project Website lpgpu.org Contractual Deadline 31-08-2013 Nature Report Author

More information

Locality-Centric Thread Scheduling for Bulk-synchronous Programming Models on CPU Architectures

Locality-Centric Thread Scheduling for Bulk-synchronous Programming Models on CPU Architectures Locality-Centric Thread Scheduling for Bulk-synchronous Programming Models on CPU Architectures Hee-Seok Kim 1, Izzat El Hajj 1, John Stratton 2, Steven Lumetta 1 and Wen-Mei Hwu 1 1 University of Illinois

More information

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique GPU programming: CUDA basics Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr This lecture: CUDA programming We have seen some GPU architecture Now how to program it? 2 Outline

More information

Occupancy-based compilation

Occupancy-based compilation Occupancy-based compilation Advanced Course on Compilers Spring 2015 (III-V): Lecture 10 Vesa Hirvisalo ESG/CSE/Aalto Today Threads and occupancy GPUs as the example SIMT execution warp (thread-group)

More information

By: Tomer Morad Based on: Erik Lindholm, John Nickolls, Stuart Oberman, John Montrym. NVIDIA TESLA: A UNIFIED GRAPHICS AND COMPUTING ARCHITECTURE In IEEE Micro 28(2), 2008 } } Erik Lindholm, John Nickolls,

More information

Control Flow Analysis

Control Flow Analysis Control Flow Analysis Last time Undergraduate compilers in a day Today Assignment 0 due Control-flow analysis Building basic blocks Building control-flow graphs Loops January 28, 2015 Control Flow Analysis

More information

CS377P Programming for Performance GPU Programming - I

CS377P Programming for Performance GPU Programming - I CS377P Programming for Performance GPU Programming - I Sreepathi Pai UTCS November 9, 2015 Outline 1 Introduction to CUDA 2 Basic Performance 3 Memory Performance Outline 1 Introduction to CUDA 2 Basic

More information

Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo

Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015 Goals Running CUDA code on CPUs. Why? Performance portability! A major challenge faced

More information

Diego Caballero and Vectorizer Team, Intel Corporation. April 16 th, 2018 Euro LLVM Developers Meeting. Bristol, UK.

Diego Caballero and Vectorizer Team, Intel Corporation. April 16 th, 2018 Euro LLVM Developers Meeting. Bristol, UK. Diego Caballero and Vectorizer Team, Intel Corporation. April 16 th, 2018 Euro LLVM Developers Meeting. Bristol, UK. Legal Disclaimer & Software and workloads used in performance tests may have been optimized

More information

Fast and Efficient Automatic Memory Management for GPUs using Compiler- Assisted Runtime Coherence Scheme

Fast and Efficient Automatic Memory Management for GPUs using Compiler- Assisted Runtime Coherence Scheme Fast and Efficient Automatic Memory Management for GPUs using Compiler- Assisted Runtime Coherence Scheme Sreepathi Pai R. Govindarajan Matthew J. Thazhuthaveetil Supercomputer Education and Research Centre,

More information

Support Tools for Porting Legacy Applications to Multicore. Natsuki Kawai, Yuri Ardila, Takashi Nakamura, Yosuke Tamura

Support Tools for Porting Legacy Applications to Multicore. Natsuki Kawai, Yuri Ardila, Takashi Nakamura, Yosuke Tamura Support Tools for Porting Legacy Applications to Multicore Natsuki Kawai, Yuri Ardila, Takashi Nakamura, Yosuke Tamura Agenda Introduction PEMAP: Performance Estimator for MAny core Processors The overview

More information

Multi2sim Kepler: A Detailed Architectural GPU Simulator

Multi2sim Kepler: A Detailed Architectural GPU Simulator Multi2sim Kepler: A Detailed Architectural GPU Simulator Xun Gong, Rafael Ubal, David Kaeli Northeastern University Computer Architecture Research Lab Department of Electrical and Computer Engineering

More information

Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics using the Immersed Boundary Method

Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics using the Immersed Boundary Method Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics using the Immersed Boundary Method Josh Romero, Massimiliano Fatica - NVIDIA Vamsi Spandan, Roberto Verzicco -

More information

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction Francesco Rossi University of Bologna and INFN * Using this terminology since you ve already heard of SIMD and SPMD at this school

More information

Apple LLVM GPU Compiler: Embedded Dragons. Charu Chandrasekaran, Apple Marcello Maggioni, Apple

Apple LLVM GPU Compiler: Embedded Dragons. Charu Chandrasekaran, Apple Marcello Maggioni, Apple Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1 Agenda How Apple uses LLVM to build a GPU Compiler Factors that affect GPU performance The Apple GPU compiler

More information

Programmable Graphics Hardware (GPU) A Primer

Programmable Graphics Hardware (GPU) A Primer Programmable Graphics Hardware (GPU) A Primer Klaus Mueller Stony Brook University Computer Science Department Parallel Computing Explained video Parallel Computing Explained Any questions? Parallelism

More information

CSE 401/M501 Compilers

CSE 401/M501 Compilers CSE 401/M501 Compilers Intermediate Representations Hal Perkins Autumn 2018 UW CSE 401/M501 Autumn 2018 G-1 Agenda Survey of Intermediate Representations Graphical Concrete/Abstract Syntax Trees (ASTs)

More information

Position Paper: OpenMP scheduling on ARM big.little architecture

Position Paper: OpenMP scheduling on ARM big.little architecture Position Paper: OpenMP scheduling on ARM big.little architecture Anastasiia Butko, Louisa Bessad, David Novo, Florent Bruguier, Abdoulaye Gamatié, Gilles Sassatelli, Lionel Torres, and Michel Robert LIRMM

More information

Parallel Compact Roadmap Construction of 3D Virtual Environments on the GPU

Parallel Compact Roadmap Construction of 3D Virtual Environments on the GPU Parallel Compact Roadmap Construction of 3D Virtual Environments on the GPU Avi Bleiweiss NVIDIA Corporation Programmability GPU Computing CUDA C++ Parallel Debug Heterogeneous Computing Productivity Efficiency

More information

Improving Performance of Machine Learning Workloads

Improving Performance of Machine Learning Workloads Improving Performance of Machine Learning Workloads Dong Li Parallel Architecture, System, and Algorithm Lab Electrical Engineering and Computer Science School of Engineering University of California,

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

arxiv: v5 [cs.ar] 12 Feb 2017

arxiv: v5 [cs.ar] 12 Feb 2017 A Scratchpad Sharing in GPUs Vishwesh Jatala, Indian Institute of Technology, Kanpur Jayvant Anantpur, Indian Institute of Science, Bangalore Amey Karkare, Indian Institute of Technology, Kanpur arxiv:1607.03238v5

More information

Designing a Domain-specific Language to Simulate Particles. dan bailey

Designing a Domain-specific Language to Simulate Particles. dan bailey Designing a Domain-specific Language to Simulate Particles dan bailey Double Negative Largest Visual Effects studio in Europe Offices in London and Singapore Large and growing R & D team Squirt Fluid Solver

More information

Towards a Performance- Portable FFT Library for Heterogeneous Computing

Towards a Performance- Portable FFT Library for Heterogeneous Computing Towards a Performance- Portable FFT Library for Heterogeneous Computing Carlo C. del Mundo*, Wu- chun Feng* *Dept. of ECE, Dept. of CS Virginia Tech Slides Updated: 5/19/2014 Forecast (Problem) AMD Radeon

More information

Implementing Control Flow Constructs Comp 412

Implementing Control Flow Constructs Comp 412 COMP 412 FALL 2018 Implementing Control Flow Constructs Comp 412 source code IR Front End Optimizer Back End IR target code Copyright 2018, Keith D. Cooper & Linda Torczon, all rights reserved. Students

More information

Design of Embedded DSP Processors Unit 2: Design basics. 9/11/2017 Unit 2 of TSEA H1 1

Design of Embedded DSP Processors Unit 2: Design basics. 9/11/2017 Unit 2 of TSEA H1 1 Design of Embedded DSP Processors Unit 2: Design basics 9/11/2017 Unit 2 of TSEA26-2017 H1 1 ASIP/ASIC design flow We need to have the flow in mind, so that we will know what we are talking about in later

More information

EC 413 Computer Organization

EC 413 Computer Organization EC 413 Computer Organization Review I Prof. Michel A. Kinsy Computing: The Art of Abstraction Application Algorithm Programming Language Operating System/Virtual Machine Instruction Set Architecture (ISA)

More information

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture

More information

Efficient JIT to 32-bit Arches

Efficient JIT to 32-bit Arches Efficient JIT to 32-bit Arches Jiong Wang Linux Plumbers Conference Vancouver, Nov, 2018 1 Background ISA specification and impact on JIT compiler Default code-gen use 64-bit register, ALU64, JMP64 test_l4lb_noinline.c

More information

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Dirk Ribbrock Faculty of Mathematics, TU dortmund 2016 Table of Contents Why parallel

More information

CS553 Lecture Dynamic Optimizations 2

CS553 Lecture Dynamic Optimizations 2 Dynamic Optimizations Last time Predication and speculation Today Dynamic compilation CS553 Lecture Dynamic Optimizations 2 Motivation Limitations of static analysis Programs can have values and invariants

More information

GaaS Workload Characterization under NUMA Architecture for Virtualized GPU

GaaS Workload Characterization under NUMA Architecture for Virtualized GPU GaaS Workload Characterization under NUMA Architecture for Virtualized GPU Huixiang Chen, Meng Wang, Yang Hu, Mingcong Song, Tao Li Presented by Huixiang Chen ISPASS 2017 April 24, 2017, Santa Rosa, California

More information

AutoMatch: An Automated Framework for Relative Performance Estimation and Workload Distribution on Heterogeneous HPC Systems

AutoMatch: An Automated Framework for Relative Performance Estimation and Workload Distribution on Heterogeneous HPC Systems AutoMatch: An Automated Framework for Relative Performance Estimation and Workload Distribution on Heterogeneous HPC Systems Ahmed E. Helal, Wu-chun Feng, Changhee Jung, and Yasser Y. Hanafy Electrical

More information

Trace Compilation. Christian Wimmer September 2009

Trace Compilation. Christian Wimmer  September 2009 Trace Compilation Christian Wimmer cwimmer@uci.edu www.christianwimmer.at September 2009 Department of Computer Science University of California, Irvine Background Institute for System Software Johannes

More information

Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread

Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread Intra-Warp Compaction Techniques Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Goal Active thread Idle thread Compaction Compact threads in a warp to coalesce (and eliminate)

More information

Automatic Intra-Application Load Balancing for Heterogeneous Systems

Automatic Intra-Application Load Balancing for Heterogeneous Systems Automatic Intra-Application Load Balancing for Heterogeneous Systems Michael Boyer, Shuai Che, and Kevin Skadron Department of Computer Science University of Virginia Jayanth Gummaraju and Nuwan Jayasena

More information

An Automated Framework for Characterizing and Subsetting GPGPU Workloads

An Automated Framework for Characterizing and Subsetting GPGPU Workloads An Automated Framework for Characterizing and Subsetting GPGPU Workloads Vignesh Adhinarayanan and Wu-chun Feng Department of Computer Science, Virginia Tech Blacksburg, VA 24061, U.S.A. {avignesh, wfeng}@vt.edu

More information

Truffle A language implementation framework

Truffle A language implementation framework Truffle A language implementation framework Boris Spasojević Senior Researcher VM Research Group, Oracle Labs Slides based on previous talks given by Christian Wimmer, Christian Humer and Matthias Grimmer.

More information

CS 432 Fall Mike Lam, Professor. Code Generation

CS 432 Fall Mike Lam, Professor. Code Generation CS 432 Fall 2015 Mike Lam, Professor Code Generation Compilers "Back end" Source code Tokens Syntax tree Machine code char data[20]; int main() { float x = 42.0; return 7; } 7f 45 4c 46 01 01 01 00 00

More information