CUB. collective software primitives. Duane Merrill. NVIDIA Research
|
|
- Anna Holland
- 5 years ago
- Views:
Transcription
1 CUB collective software primitives Duane Merrill NVIDIA Research
2 What is CUB?. A design model for collective primitives How to make reusable SIMT software constructs. A library of collective primitives Block-reduce, block-sort, block-histogram, warp-scan, warp-reduce, etc. 3. A library of global primitives Device-reduce, device-sort, device-scan, etc. Constructed from collective primitives Demonstrate performance, performance-portability
3 Software reuse 3
4 Software reuse Abstraction & composability are fundamental Reducing redundant programmer effort Saves time, energy, money Reduces buggy software Encapsulating complexity Empowers ordinary programmers Insulates applications from underlying hardware Software reuse empowers a durable programming model 4
5 Software reuse Abstraction & composability are fundamental Reducing redundant programmer effort Saves time, energy, money Reduces buggy software Encapsulating complexity Empowers ordinary programmers Insulates applications from underlying hardware Software reuse empowers a durable programming model 5
6 Collective primitives 6
7 Parallel programming is hard 7
8 Cooperative parallel programming is hard Parallel decomposition and grain sizing Synchronization Deadlock, livelock, and data races Plurality of state Plurality of flow control (divergence, etc.) Bookkeeping control structures Memory access conflicts, coalescing, etc. Occupancy constraints from SMEM, RF, etc Algorithm selection and instruction scheduling Special hardware functionality, instructions, etc. 8
9 Parallel programming is hard Parallel decomposition and grain sizing Synchronization Deadlock, livelock, and data races Plurality of state Plurality of flow control (divergence, etc.) Bookkeeping control structures Memory access conflicts, coalescing, etc. Occupancy constraints from SMEM, RF, etc Algorithm selection and instruction scheduling Special hardware functionality, instructions, etc. 9
10 CUDA today application thread CUDA stub threadblock threadblock threadblock 0
11 CUDA today Collective primitives are the missing layer in today s CUDA software stack application thread CUDA stub threadblock threadblock threadblock BlockSort BlockSort BlockSort
12 3 3 What do these have in common? 0 Parallel sparse graph traversal Parallel radix sort Parallel SpMV Parallel BWT compression
13 3 3 What do these have in common? Block-wide prefix-scan Queue management 0 Partitioning Parallel sparse graph traversal Parallel radix sort Segmented reduction Recurrence solver Parallel SpMV Parallel BWT compression 3
14 Examples of parallel scan data flow 6 threads contributing 4 items each t 0 t t 3 t 4 t 5 t 6 t 7 t 8 t t 0 t t 3 t 4 t 5 t 6 t 7 t 8 t t 0 t t 3 t 4 t 5 t 6 t 7 t 8 t t 0 t t 3 t 4 t 5 t 6 t 7 t 8 t t 0 t t 3 t 4 t 5 t 6 t 7 t 8 t t 0 t t 3 t 0 t t 3 t 4 t 5 t 6 t 7 t 8 t t 0 t 0 id id id t t t 0 t t 3 t 0 t t 3 t 3 t 3 id id id t 0 t t 3 t 3 t 3 t 3 t 3 t t t 3 t 3 id id id t 4 t 5 t 6 t 7 t 4 t 4 t 5 t 5 t 6 t 6 t 7 t 7 id id id t 8 t 8 t 8 t 9 t 9 t id id id t 0 t t 3 t 0 t 0 t t t 3 t 3 t 4 t 5 t 6 t 7 t 8 t 8 t 9 t t 0 t t 3 t 0 t t 3 t 4 t 5 t 6 t 7 t 8 t t 0 t t 3 t 4 t 5 t 6 t 7 t 8 t t 0 t t 3 t 4 t 5 t 6 t 7 t 8 t t 0 t t 3 t 4 t 5 t 6 t 7 t 8 t t 0 t t 3 t 4 t 5 t 6 t 7 t 8 t t 0 t t 3 t 4 t 5 t 6 t 7 t 8 t Work-efficient Brent-Kung hybrid (~30 binary ops) Depth-efficient Kogge-Stone hybrid (~70 binary ops) 4
15 CUDA today Kernel programming is complicating application thread CUDA stub threadblock threadblock threadblock 5
16 CUDA today Collective primitives are the missing layer in today s CUDA software stack application thread CUDA stub threadblock threadblock threadblock BlockSort BlockSort BlockSort 6
17 Collective design & usage 7
18 Collective design criteria Components are easily nested & sequenced threadblock application thread CUDA stub BlockRadixRank BlockScan threadblock threadblock threadblock WarpScan BlockSort BlockSort BlockSort BlockExchange BlockSort 8
19 Collective design criteria Flexible interfaces that scale (& tune) to different block sizes, thread-granularities, etc. threadblock application thread CUDA stub BlockRadixRank BlockScan thread block thread block thread block WarpScan Block Sort Block Sort Block Sort BlockExchange BlockSort 9
20 Collective interface design - 3 parameter fields separated by concerns - Reflected shared resource types. Static specialization interface global void ExampleKernel() { // Specialize cub::blockscan for 8 threads typedef cub::blockscan<int, 8> BlockScanT; Params dictate storage layout and unrolling of algorithmic steps Allows data placement in fast registers 3 4 // Allocate temporary storage in shared memory shared typename BlockScanT::TempStorage scan_storage; // Obtain a 5 items blocked across 8 threads int items[4];... // Compute block-wide prefix sum BlockScanT(scan_storage).ExclusiveSum(items, items);. Reflected shared resource types Reflection enables compile-time allocation and tuning 3. Collective construction interface Optional params concerning interthread communication Orthogonal to function behavior 4. Operational function interface Method-specific inputs/outputs 0
21 Collective primitive design Simplified block-wide prefix sum template <typename T, int BLOCK_THREADS> class BlockScan { // Type of shared memory needed by BlockScan typedef T TempStorage[BLOCK_THREADS]; // Per-thread data (shared storage reference) TempStorage &temp_storage; // Constructor BlockScan (TempStorage &storage) : temp_storage(storage) {} // Prefix sum operation (each thread contributes its own data item) T Sum (T thread_data) { for (int i = ; i < BLOCK_THREADS; i *= ) { temp_storage[tid] = thread_data; syncthreads(); if (tid i >= 0) thread_data += temp_storage[tid]; syncthreads(); } }; } return thread_data;
22 Sequencing CUB primitives Using cub::blockload and cub::blockscan global void ExampleKernel(int *d_in) { // Specialize for 8 threads owning 4 integers each typedef cub::blockload<int*, 8, 4> BlockLoadT; typedef cub::blockscan<int, 8> BlockScanT; // Allocate temporary storage in shared memory shared union { typename BlockLoadT::TempStorage load; typename BlockScanT::TempStorage scan; } temp_storage; Specialize, Allocate // Use coalesced (thread-striped) loads and a subsequent local exchange to // block a global segment of 5 items across 8 threads int items[4]; BlockLoadT(temp_storage.load).Load(d_in, items) syncthreads() // Compute block-wide prefix sum BlockScanT(temp_storage.scan).ExclusiveSum(items, items);... Load, Scan
23 Nested composition of CUB primitives cub::blockscan cub::blockscan cub::warpscan 3
24 Nested composition of CUB primitives cub::blockradixsort cub::blockradixsort cub::blockradixrank cub::blockscan cub::warpscan cub::blockexchange 4
25 Nested composition of CUB primitives cub::blockhistogram (specialized for BLOCK_HISTO_SORT algorithm) cub::blockhistogram cub::blockradixsort cub::blockradixrank cub::blockscan cub::warpscan cub::blockexchange cub::blockdiscontinuity 5
26 Block-wide and warp-wide CUB primitives cub::blockdiscontinuity cub::blockexchange cub::blockload & cub::blockstore t 0 t t 3 t 0 t t 3 t 4 t 5 t 6 t 7 t 0 t t 3 t 4 t 5 t 6 t 7 cub::blockradixsort cub::warpreduce & cub::blockreduce cub::warpscan & cub::blockscan L / Tex cub::blockhistogram and more at the CUB project on GitHub 6
27 Tuning with flexible collectives 7
28 Example: radix sorting throughput (initial GT00 effort ~0) Millions of 3-bit keys /s NVIDIA GTX580 [] NVIDIA GTX480 [] NVIDIA Tesla C050 [] NVIDIA GTX80 [] NVIDIA 9800 GTX+ [] Intel MIC Knight's Ferry [4] Intel Core i7 Nehalem 3.GHz [] AMD Radeon HD 6970 [3] [] Merrill. Back40 GPU Primitives (0) [] Satish et al. Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort (00) [3] T. Harada and L. Howes. Introduction to GPU Radix Sort (0) [4] Satish et al. Fast Sort on CPUs, GPUs, and Intel MIC Architectures. Intel Labs,
29 Radix sorting throughput (current) Millions of 3-bit keys /s NVIDIA GTX580 [] NVIDIA GTX480 [] NVIDIA Tesla C050 [] NVIDIA GTX80 [] NVIDIA 9800 GTX+ [] Intel MIC Knight's Ferry [4] Intel Core i7 Nehalem 3.GHz [] AMD Radeon HD 6970 [3] [] Merrill. Back40, CUB GPU Primitives (03) [] Satish et al. Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort (00) [3] T. Harada and L. Howes. Introduction to GPU Radix Sort (0) [4] Satish et al. Fast Sort on CPUs, GPUs, and Intel MIC Architectures. Intel Labs,
30 Fine-tuning primitives Tiled prefix sum Data is striped across threads for memory accesses /** * Simple CUDA kernel for computing tiled partial sums */ template <int BLOCK_THREADS, int ITEMS_PER_THREAD, LoadAlgorithm LOAD_ALGO, ScanAlgorithm SCAN_ALGO> global void ScanTilesKernel(int *d_in, int *d_out) { // Specialize collective types for problem context typedef cub::blockload<int*, BLOCK_THREADS, ITEMS_PER_THREAD, LOAD_ALGO> BlockLoadT; typedef cub::blockscan<int, BLOCK_THREADS, SCAN_ALGO> BlockScanT; t 0 t t 3 t 0 t t 3 t 4 t 5 t 6 t 7 t 4 t 5 t 6 t 7 // Allocate on-chip temporary storage shared union { typename BlockLoadT::TempStorage load; typename BlockScanT::TempStorage reduce; } temp_storage; Data is blocked across threads for scanning // Load data per thread } int thread_data[items_per_thread]; int offset = blockidx.x * (BLOCK_THREADS * ITEMS_PER_THREAD); BlockLoadT(temp_storage.load).Load(d_in + offset, offset); syncthreads(); // Compute the block-wide prefix sum BlockScanT(temp_storage).Sum(thread_data); t 0 t t 3 t 4 t 5 t 6 t 7 t 8 t t 0 0 t t 3 3 t 4 4 t 5 5 t 6 6 t 7 7 t 8 8 t i t t t t i t t t t i t t t t i d i t t t i t 0 3 d i t t t t i t t t t i d i t t i t t t t 8 9 d d t t t t d d t t t t t t t 0 t d i d d t t 3 4 t d d t t t t t t 0 t t 3 t 4 t 5 t 6 t t t t t t t t t 8 t t t t t t t t t t t t t Scan data flow tiled from warpscans 30
31 CUB: device-wide performance-portability vs. Thrust and NPP across the last three major NVIDIA arch families (Telsa, Fermi, Kepler) billions of 3b keys / sec CUB Tesla C060 Thrust v Tesla C Tesla K0C billions of 3b items / sec CUB 8 4 Tesla C060 Thrust v.7. 4 Tesla C Tesla K0C billions of 8b items / sec CUB 0 Tesla C060 NPP 6. Tesla C Tesla K0C Global radix sort Global prefix scan Global Histogram billions of 3b items / sec CUB 9 Tesla C060 Thrust v Tesla C Tesla K0C billions of 3b inputs / sec CUB 4. Tesla C060 Thrust v Tesla C050 Tesla K0C Global reduction Global partition-if 3
32 Summary 3
33 Summary: benefits of using CUB primitives Simplicity of composition Kernels are simply sequences of primitives (e.g., BlockLoad -> BlockSort -> BlockReduceByKey) High performance CUB uses the best known algorithms, abstractions, and strategies, and techniques Performance portability CUB is specialized for the target hardware (e.g., memory conflict rules, special instructions, etc.) Simplicity of tuning CUB adapts to various grain sizes (threads per block, items per thread, etc.) CUB provides alterative algorithms Robustness and durability CUB supports arbitrary data types and block sizes 33
34 Questions? Please visit the CUB project on GitHub Duane Merrill
35 x 0 x x reduce reduce reduce scan prefix 0:0 prefix 0: scan scan prefix 0: prefix 0:P- reduce scan y 0 y y p 0 p p p P-
36 x 0 x x reduce reduce reduce reduce adaptive look-back aggregate 0 adaptive look-back aggregate 0 adaptive look-back aggregate 0 adaptive look-back aggregate 0 incl-prefix 0 incl-prefix 0 incl-prefix 0 scan scan scan scan incl-prefix 0 y 0 y y p 0 p p p P- Status flag P A P X Aggregate Inclusive prefix P-
CS377P Programming for Performance GPU Programming - II
CS377P Programming for Performance GPU Programming - II Sreepathi Pai UTCS November 11, 2015 Outline 1 GPU Occupancy 2 Divergence 3 Costs 4 Cooperation to reduce costs 5 Scheduling Regular Work Outline
More informationLecture 2: CUDA Programming
CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:
More informationGPU Sparse Graph Traversal
GPU Sparse Graph Traversal Duane Merrill (NVIDIA) Michael Garland (NVIDIA) Andrew Grimshaw (Univ. of Virginia) UNIVERSITY of VIRGINIA Breadth-first search (BFS) 1. Pick a source node 2. Rank every vertex
More informationIntroduction to GPGPU and GPU-architectures
Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationCS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in
More informationLocality-Aware Mapping of Nested Parallel Patterns on GPUs
Locality-Aware Mapping of Nested Parallel Patterns on GPUs HyoukJoong Lee *, Kevin Brown *, Arvind Sujeeth *, Tiark Rompf, Kunle Olukotun * * Pervasive Parallelism Laboratory, Stanford University Purdue
More informationProgramming in CUDA. Malik M Khan
Programming in CUDA October 21, 2010 Malik M Khan Outline Reminder of CUDA Architecture Execution Model - Brief mention of control flow Heterogeneous Memory Hierarchy - Locality through data placement
More informationComputer Architecture
Jens Teubner Computer Architecture Summer 2017 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2017 Jens Teubner Computer Architecture Summer 2017 34 Part II Graphics
More informationReductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research
Reductions and Low-Level Performance Considerations CME343 / ME339 27 May 2011 David Tarjan [dtarjan@nvidia.com] NVIDIA Research REDUCTIONS Reduction! Reduce vector to a single value! Via an associative
More informationOptimization solutions for the segmented sum algorithmic function
Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code
More informationad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors
ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors Weifeng Liu and Brian Vinter Niels Bohr Institute University of Copenhagen Denmark {weifeng, vinter}@nbi.dk March 1, 2014 Weifeng
More informationUpdate on Angelic Programming synthesizing GPU friendly parallel scans
Update on Angelic Programming synthesizing GPU friendly parallel scans Shaon Barman, Sagar Jain, Nicholas Tung, Ras Bodik University of California, Berkeley Angelic programming and synthesis Angelic programming:
More informationCS 179 Lecture 4. GPU Compute Architecture
CS 179 Lecture 4 GPU Compute Architecture 1 This is my first lecture ever Tell me if I m not speaking loud enough, going too fast/slow, etc. Also feel free to give me lecture feedback over email or at
More informationPractical Introduction to CUDA and GPU
Practical Introduction to CUDA and GPU Charlie Tang Centre for Theoretical Neuroscience October 9, 2009 Overview CUDA - stands for Compute Unified Device Architecture Introduced Nov. 2006, a parallel computing
More informationCUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN
CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction Francesco Rossi University of Bologna and INFN * Using this terminology since you ve already heard of SIMD and SPMD at this school
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control
More informationOptimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as
More informationOptimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More informationPortland State University ECE 588/688. Graphics Processors
Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly
More informationReview for Midterm 3/28/11. Administrative. Parts of Exam. Midterm Exam Monday, April 4. Midterm. Design Review. Final projects
Administrative Midterm - In class April 4, open notes - Review notes, readings and review lecture (before break) - Will post prior exams Design Review - Intermediate assessment of progress on project,
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationParticle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA
Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran
More informationA Productive Framework for Generating High Performance, Portable, Scalable Applications for Heterogeneous computing
A Productive Framework for Generating High Performance, Portable, Scalable Applications for Heterogeneous computing Wen-mei W. Hwu with Tom Jablin, Chris Rodrigues, Liwen Chang, Steven ShengZhou Wu, Abdul
More informationPrefix Scan and Minimum Spanning Tree with OpenCL
Prefix Scan and Minimum Spanning Tree with OpenCL U. VIRGINIA DEPT. OF COMP. SCI TECH. REPORT CS-2013-02 Yixin Sun and Kevin Skadron Dept. of Computer Science, University of Virginia ys3kz@virginia.edu,
More information2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA
2006: Short-Range Molecular Dynamics on GPU San Jose, CA September 22, 2010 Peng Wang, NVIDIA Overview The LAMMPS molecular dynamics (MD) code Cell-list generation and force calculation Algorithm & performance
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationScalable GPU Graph Traversal!
Scalable GPU Graph Traversal Duane Merrill, Michael Garland, and Andrew Grimshaw PPoPP '12 Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming Benwen Zhang
More informationGPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique
GPU programming: CUDA basics Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr This lecture: CUDA programming We have seen some GPU architecture Now how to program it? 2 Outline
More informationWarps and Reduction Algorithms
Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum
More informationOptimization Techniques for Parallel Code: 5: Warp-synchronous programming with Cooperative Groups
Optimization Techniques for Parallel Code: 5: Warp-synchronous programming with Cooperative Groups Nov. 21, 2017 Sylvain Collange Inria Rennes Bretagne Atlantique http://www.irisa.fr/alf/collange/ sylvain.collange@inria.fr
More informationFundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA
Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU
More informationCUDA Programming Model
CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming
More informationDi Zhao Ohio State University MVAPICH User Group (MUG) Meeting, August , Columbus Ohio
Di Zhao zhao.1029@osu.edu Ohio State University MVAPICH User Group (MUG) Meeting, August 26-27 2013, Columbus Ohio Nvidia Kepler K20X Intel Xeon Phi 7120 Launch Date November 2012 Q2 2013 Processor Per-processor
More informationBy: Tomer Morad Based on: Erik Lindholm, John Nickolls, Stuart Oberman, John Montrym. NVIDIA TESLA: A UNIFIED GRAPHICS AND COMPUTING ARCHITECTURE In IEEE Micro 28(2), 2008 } } Erik Lindholm, John Nickolls,
More informationAdvanced CUDA Optimizations
Advanced CUDA Optimizations General Audience Assumptions General working knowledge of CUDA Want kernels to perform better Profiling Before optimizing, make sure you are spending effort in correct location
More informationCartoon parallel architectures; CPUs and GPUs
Cartoon parallel architectures; CPUs and GPUs CSE 6230, Fall 2014 Th Sep 11! Thanks to Jee Choi (a senior PhD student) for a big assist 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ~ socket 14 ~ core 14 ~ HWMT+SIMD
More informationPolicy-based Tuning for Performance Portability and Library Co-optimization
Policy-based Tuning for Performance Portability and Library Co-optimization Duane Merrill NVIDIA Corporation Santa Clara California USA dumerrill@nvdia.com Michael Garland NVIDIA Corporation Santa Clara
More informationGPU Sparse Graph Traversal. Duane Merrill
GPU Sparse Graph Traversal Duane Merrill Breadth-first search of graphs (BFS) 1. Pick a source node 2. Rank every vertex by the length of shortest path from source Or label every vertex by its predecessor
More informationCS 314 Principles of Programming Languages
CS 314 Principles of Programming Languages Zheng Zhang Fall 2016 Dec 14 GPU Programming Rutgers University Programming with CUDA Compute Unified Device Architecture (CUDA) Mapping and managing computations
More informationCUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer
CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer Outline We ll be focussing on optimizing global memory throughput on Fermi-class GPUs
More informationHigh Performance Comparison-Based Sorting Algorithm on Many-Core GPUs
High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs Xiaochun Ye, Dongrui Fan, Wei Lin, Nan Yuan, and Paolo Ienne Key Laboratory of Computer System and Architecture ICT, CAS, China Outline
More informationFast BVH Construction on GPUs
Fast BVH Construction on GPUs Published in EUROGRAGHICS, (2009) C. Lauterbach, M. Garland, S. Sengupta, D. Luebke, D. Manocha University of North Carolina at Chapel Hill NVIDIA University of California
More informationA PERFORMANCE COMPARISON OF SORT AND SCAN LIBRARIES FOR GPUS
October 9, 215 9:46 WSPC/INSTRUCTION FILE ssbench Parallel Processing Letters c World Scientific Publishing Company A PERFORMANCE COMPARISON OF SORT AND SCAN LIBRARIES FOR GPUS BRUCE MERRY SKA South Africa,
More informationLDetector: A low overhead data race detector for GPU programs
LDetector: A low overhead data race detector for GPU programs 1 PENGCHENG LI CHEN DING XIAOYU HU TOLGA SOYATA UNIVERSITY OF ROCHESTER 1 Data races in GPU Introduction & Contribution Impact correctness
More informationGRAPHICS PROCESSING UNITS
GRAPHICS PROCESSING UNITS Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 4, John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2011
More informationCUDA OPTIMIZATIONS ISC 2011 Tutorial
CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationCS671 Parallel Programming in the Many-Core Era
CS671 Parallel Programming in the Many-Core Era Lecture 3: GPU Programming - Reduce, Scan & Sort Zheng Zhang Rutgers University Review: Programming with CUDA An Example in C Add vector A and vector B to
More information2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions
Administrative L6: Memory Hierarchy Optimization IV, Bandwidth Optimization Next assignment available Goals of assignment: simple memory hierarchy management block-thread decomposition tradeoff Due Tuesday,
More informationDirected Optimization On Stencil-based Computational Fluid Dynamics Application(s)
Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Islam Harb 08/21/2015 Agenda Motivation Research Challenges Contributions & Approach Results Conclusion Future Work 2
More informationNVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU
NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated
More informationRed Fox: An Execution Environment for Relational Query Processing on GPUs
Red Fox: An Execution Environment for Relational Query Processing on GPUs Georgia Institute of Technology: Haicheng Wu, Ifrah Saeed, Sudhakar Yalamanchili LogicBlox Inc.: Daniel Zinn, Martin Bravenboer,
More informationRed Fox: An Execution Environment for Relational Query Processing on GPUs
Red Fox: An Execution Environment for Relational Query Processing on GPUs Haicheng Wu 1, Gregory Diamos 2, Tim Sheard 3, Molham Aref 4, Sean Baxter 2, Michael Garland 2, Sudhakar Yalamanchili 1 1. Georgia
More information1/25/12. Administrative
Administrative L3: Memory Hierarchy Optimization I, Locality and Data Placement Next assignment due Friday, 5 PM Use handin program on CADE machines handin CS6235 lab1 TA: Preethi Kotari - Email:
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationUsing GPUs to compute the multilevel summation of electrostatic forces
Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of
More informationAutomatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC
Automatic Compiler-Based Optimization of Graph Analytics for the GPU Sreepathi Pai The University of Texas at Austin May 8, 2017 NVIDIA GTC Parallel Graph Processing is not easy 299ms HD-BFS 84ms USA Road
More informationOptimizing Parallel Reduction in CUDA
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf Parallel Reduction Tree-based approach used within each
More informationImproving Performance of Machine Learning Workloads
Improving Performance of Machine Learning Workloads Dong Li Parallel Architecture, System, and Algorithm Lab Electrical Engineering and Computer Science School of Engineering University of California,
More informationIntroduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model
Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Dirk Ribbrock Faculty of Mathematics, TU dortmund 2016 Table of Contents Why parallel
More informationSupport Tools for Porting Legacy Applications to Multicore. Natsuki Kawai, Yuri Ardila, Takashi Nakamura, Yosuke Tamura
Support Tools for Porting Legacy Applications to Multicore Natsuki Kawai, Yuri Ardila, Takashi Nakamura, Yosuke Tamura Agenda Introduction PEMAP: Performance Estimator for MAny core Processors The overview
More informationGPU Fundamentals Jeff Larkin November 14, 2016
GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate
More informationGPU MEMORY BOOTCAMP III
April 4-7, 2016 Silicon Valley GPU MEMORY BOOTCAMP III COLLABORATIVE ACCESS PATTERNS Tony Scudiero NVIDIA Devtech Fanatical Bandwidth Evangelist The Axioms of Modern Performance #1. Parallelism is mandatory
More informationGPU Histogramming: Radial Distribution Functions
GPU Histogramming: Radial Distribution Functions John Stone Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of Illinois at Urbana-Champaign
More informationCME 213 S PRING Eric Darve
CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and
More informationAn Efficient CUDA Implementation of a Tree-Based N-Body Algorithm. Martin Burtscher Department of Computer Science Texas State University-San Marcos
An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm Martin Burtscher Department of Computer Science Texas State University-San Marcos Mapping Regular Code to GPUs Regular codes Operate on
More informationMultipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs
Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs Haicheng Wu 1, Daniel Zinn 2, Molham Aref 2, Sudhakar Yalamanchili 1 1. Georgia Institute of Technology 2. LogicBlox
More informationEfficient Implementation of Reductions on GPU Architectures
The University of Akron IdeaExchange@UAkron Honors Research Projects The Dr. Gary B. and Pamela S. Williams Honors College Spring 2017 Efficient Implementation of Reductions on GPU Architectures Stephen
More informationCS 179: GPU Programming. Lecture 7
CS 179: GPU Programming Lecture 7 Week 3 Goals: More involved GPU-accelerable algorithms Relevant hardware quirks CUDA libraries Outline GPU-accelerated: Reduction Prefix sum Stream compaction Sorting(quicksort)
More informationCUTLASS: CUDA TEMPLATE LIBRARY FOR DENSE LINEAR ALGEBRA AT ALL LEVELS AND SCALES
CUTLASS: CUDA TEMPLATE LIBRARY FOR DENSE LINEAR ALGEBRA AT ALL LEVELS AND SCALES Andrew Kerr, Duane Merrill, Julien Demouth, John Tran, Naila Farooqui, Markus Tavenrath, Vince Schuster, Eddie Gornish,
More informationAdvanced CUDA Programming. Dr. Timo Stich
Advanced CUDA Programming Dr. Timo Stich (tstich@nvidia.com) Outline SIMT Architecture, Warps Kernel optimizations Global memory throughput Launch configuration Shared memory access Instruction throughput
More informationFast Segmented Sort on GPUs
Fast Segmented Sort on GPUs Kaixi Hou, Weifeng Liu, Hao Wang, Wu-chun Feng {kaixihou, hwang121, wfeng}@vt.edu weifeng.liu@nbi.ku.dk Segmented Sort (SegSort) Perform a segment-by-segment sort on a given
More informationParallel Patterns Ezio Bartocci
TECHNISCHE UNIVERSITÄT WIEN Fakultät für Informatik Cyber-Physical Systems Group Parallel Patterns Ezio Bartocci Parallel Patterns Think at a higher level than individual CUDA kernels Specify what to compute,
More informationSparse Linear Algebra in CUDA
Sparse Linear Algebra in CUDA HPC - Algorithms and Applications Alexander Pöppl Technical University of Munich Chair of Scientific Computing November 22 nd 2017 Table of Contents Homework - Worksheet 2
More informationGPU Programming. Parallel Patterns. Miaoqing Huang University of Arkansas 1 / 102
1 / 102 GPU Programming Parallel Patterns Miaoqing Huang University of Arkansas 2 / 102 Outline Introduction Reduction All-Prefix-Sums Applications Avoiding Bank Conflicts Segmented Scan Sorting 3 / 102
More informationExploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology
Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation
More informationAdvanced CUDA Optimization 1. Introduction
Advanced CUDA Optimization 1. Introduction Thomas Bradley Agenda CUDA Review Review of CUDA Architecture Programming & Memory Models Programming Environment Execution Performance Optimization Guidelines
More informationIntroduction to GPU programming. Introduction to GPU programming p. 1/17
Introduction to GPU programming Introduction to GPU programming p. 1/17 Introduction to GPU programming p. 2/17 Overview GPUs & computing Principles of CUDA programming One good reference: David B. Kirk
More informationStanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690
Stanford University NVIDIA Tesla M2090 NVIDIA GeForce GTX 690 Moore s Law 2 Clock Speed 10000 Pentium 4 Prescott Core 2 Nehalem Sandy Bridge 1000 Pentium 4 Williamette Clock Speed (MHz) 100 80486 Pentium
More informationINTRODUCING NVBIO: HIGH PERFORMANCE PRIMITIVES FOR COMPUTATIONAL GENOMICS. Jonathan Cohen, NVIDIA Nuno Subtil, NVIDIA Jacopo Pantaleoni, NVIDIA
INTRODUCING NVBIO: HIGH PERFORMANCE PRIMITIVES FOR COMPUTATIONAL GENOMICS Jonathan Cohen, NVIDIA Nuno Subtil, NVIDIA Jacopo Pantaleoni, NVIDIA SEQUENCING AND MOORE S LAW Slide courtesy Illumina DRAM I/F
More informationImplementing Radix Sort on Emu 1
1 Implementing Radix Sort on Emu 1 Marco Minutoli, Shannon K. Kuntz, Antonino Tumeo, Peter M. Kogge Pacific Northwest National Laboratory, Richland, WA 99354, USA E-mail: {marco.minutoli, antonino.tumeo}@pnnl.gov
More informationGPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60
1 / 60 GPU Programming Performance Considerations Miaoqing Huang University of Arkansas Fall 2013 2 / 60 Outline Control Flow Divergence Memory Coalescing Shared Memory Bank Conflicts Occupancy Loop Unrolling
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More informationAutomatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo
Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015 Goals Running CUDA code on CPUs. Why? Performance portability! A major challenge faced
More informationObject Support in an Array-based GPGPU Extension for Ruby
Object Support in an Array-based GPGPU Extension for Ruby ARRAY 16 Matthias Springer, Hidehiko Masuhara Dept. of Mathematical and Computing Sciences, Tokyo Institute of Technology June 14, 2016 Object
More informationMay 8-11, 2017 Silicon Valley CUDA 9 AND BEYOND. Mark Harris, May 10, 2017
May 8-11, 2017 Silicon Valley CUDA 9 AND BEYOND Mark Harris, May 10, 2017 INTRODUCING CUDA 9 BUILT FOR VOLTA FASTER LIBRARIES Tesla V100 New GPU Architecture Tensor Cores NVLink Independent Thread Scheduling
More informationTesla GPU Computing A Revolution in High Performance Computing
Tesla GPU Computing A Revolution in High Performance Computing Gernot Ziegler, Developer Technology (Compute) (Material by Thomas Bradley) Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction
More informationA low memory footprint OpenCL simulation of short-range particle interactions
A low memory footprint OpenCL simulation of short-range particle interactions Raymond Namyst STORM INRIA Group With Samuel Pitoiset, Inria and Emmanuel Cieren, Laurent Colombet, Laurent Soulard, CEA/DAM/DPTA
More informationPERFORMANCE ANALYSIS AND DEBUGGING FOR VOLTA. Felix Schmitt 11 th Parallel Tools Workshop September 11-12, 2017
PERFORMANCE ANALYSIS AND DEBUGGING FOR VOLTA Felix Schmitt 11 th Parallel Tools Workshop September 11-12, 2017 INTRODUCING TESLA V100 Volta Architecture Improved NVLink & HBM2 Volta MPS Improved SIMT Model
More informationUnrolling parallel loops
Unrolling parallel loops Vasily Volkov UC Berkeley November 14, 2011 1 Today Very simple optimization technique Closely resembles loop unrolling Widely used in high performance codes 2 Mapping to GPU:
More informationData Parallel Execution Model
CS/EE 217 GPU Architecture and Parallel Programming Lecture 3: Kernel-Based Data Parallel Execution Model David Kirk/NVIDIA and Wen-mei Hwu, 2007-2013 Objective To understand the organization and scheduling
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationIntroduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware
More informationCOMP 605: Introduction to Parallel Computing Lecture : GPU Architecture
COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture Mary Thomas Department of Computer Science Computational Science Research Center (CSRC) San Diego State University (SDSU) Posted:
More informationScan Primitives for GPU Computing
Scan Primitives for GPU Computing Agenda What is scan A naïve parallel scan algorithm A work-efficient parallel scan algorithm Parallel segmented scan Applications of scan Implementation on CUDA 2 Prefix
More informationKernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow
Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization
More informationExecution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures
Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Xin Huo Advisor: Gagan Agrawal Motivation - Architecture Challenges on GPU architecture
More informationAutomatic FFT Kernel Generation for CUDA GPUs. Akira Nukada Tokyo Institute of Technology
Automatic FFT Kernel Generation for CUDA GPUs. Akira Nukada Tokyo Institute of Technology FFT (Fast Fourier Transform) FFT is a fast algorithm to compute DFT (Discrete Fourier Transform). twiddle factors
More information3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA
3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires
More information