Scan Primitives for GPU Computing

Similar documents
Data parallel algorithms, algorithmic building blocks, precision vs. accuracy

Data-Parallel Algorithms on GPUs. Mark Harris NVIDIA Developer Technology

Scan Primitives for GPU Computing

Fast Tridiagonal Solvers on GPU

Fast BVH Construction on GPUs

GPU Programming. Parallel Patterns. Miaoqing Huang University of Arkansas 1 / 102

Scan Algorithm Effects on Parallelism and Memory Conflicts

Parallel Patterns Ezio Bartocci

Efficient Stream Reduction on the GPU

Sorting and Selection

CS 179: GPU Programming. Lecture 7

Fast Scan Algorithms on Graphics Processors

Hiroki Yasuga, Elisabeth Kolp, Andreas Lang. 25th September 2014, Scientific Programming

CS377P Programming for Performance GPU Programming - II

First Swedish Workshop on Multi-Core Computing MCC 2008 Ronneby: On Sorting and Load Balancing on Graphics Processors

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 13 Parallelism in Software I

Update on Angelic Programming synthesizing GPU friendly parallel scans

Automatic Tuning Matrix Multiplication Performance on Graphics Hardware

ECE 498AL. Lecture 12: Application Lessons. When the tires hit the road

CS671 Parallel Programming in the Many-Core Era

Algorithms and Applications

Warps and Reduction Algorithms

State of Art and Project Proposals Intensive Computation

CS 314 Principles of Programming Languages

DIVIDE & CONQUER. Problem of size n. Solution to sub problem 1

Parallel Alternating Direction Implicit Solver for the Two-Dimensional Heat Diffusion Problem on Graphics Processing Units

Data Parallel Programming with Patterns. Peng Wang, Developer Technology, NVIDIA

Parallel FFT Program Optimizations on Heterogeneous Computers

CSC Design and Analysis of Algorithms

Parallel Prefix Sum (Scan) with CUDA. Mark Harris

CSC Design and Analysis of Algorithms. Lecture 6. Divide and Conquer Algorithm Design Technique. Divide-and-Conquer

A Sampling of CUDA Libraries Michael Garland

Improved Integral Histogram Algorithm. for Big Sized Images in CUDA Environment

DIVIDE AND CONQUER ALGORITHMS ANALYSIS WITH RECURRENCE EQUATIONS

Fragment-Parallel Composite and Filter. Anjul Patney, Stanley Tzeng, and John D. Owens University of California, Davis

A Batched GPU Algorithm for Set Intersection

A MATLAB Interface to the GPU

School of Computer and Information Science

Introduction to CUDA

The GPGPU Programming Model

Sparse Multifrontal Performance Gains via NVIDIA GPU January 16, 2009

Sorting Algorithms. Slides used during lecture of 8/11/2013 (D. Roose) Adapted from slides by

Faster Sorting Methods

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

GPU Multisplit. Saman Ashkiani, Andrew Davidson, Ulrich Meyer, John D. Owens. S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

Information Coding / Computer Graphics, ISY, LiTH

Unit 6 Chapter 15 EXAMPLES OF COMPLEXITY CALCULATION

Mio: Fast Multipass Partitioning via Priority-Based Instruction Scheduling

Divide-and-Conquer. The most-well known algorithm design strategy: smaller instances. combining these solutions

CS 677: Parallel Programming for Many-core Processors Lecture 6

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

Parallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. Fall Parallel Sorting

Principles of Parallel Algorithm Design: Concurrency and Mapping

17/05/2018. Outline. Outline. Divide and Conquer. Control Abstraction for Divide &Conquer. Outline. Module 2: Divide and Conquer

A Scalable, Numerically Stable, High- Performance Tridiagonal Solver for GPUs. Li-Wen Chang, Wen-mei Hwu University of Illinois

Efficiently Multiplying Sparse Matrix - Sparse Vector for Social Network Analysis

Quick-Sort fi fi fi 7 9. Quick-Sort Goodrich, Tamassia

Sorting Algorithms. + Analysis of the Sorting Algorithms

Real-Time Reyes: Programmable Pipelines and Research Challenges. Anjul Patney University of California, Davis

A Work-Efficient GPU Algorithm for Level Set Segmentation. Mike Roberts Jeff Packer Mario Costa Sousa Joseph Ross Mitchell

Chapter 4. Divide-and-Conquer. Copyright 2007 Pearson Addison-Wesley. All rights reserved.

Parallel Sorting Algorithms

Scalable Algorithmic Techniques Decompositions & Mapping. Alexandre David

Lecture 13: March 25

Parallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. November Parallel Sorting

Better sorting algorithms (Weiss chapter )

Selection, Bubble, Insertion, Merge, Heap, Quick Bucket, Radix

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

GPGPU: Parallel Reduction and Scan

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

Mergesort again. 1. Split the list into two equal parts

COMP 352 FALL Tutorial 10

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Pattern: Sparse Matrices

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Sorting (Chapter 9) Alexandre David B2-206

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

A GPU based Real-Time Line Detector using a Cascaded 2D Line Space

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply

Divide and Conquer CISC4080, Computer Algorithms CIS, Fordham Univ. Instructor: X. Zhang

A Comparative Study on Exact Triangle Counting Algorithms on the GPU

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory I

Divide and Conquer CISC4080, Computer Algorithms CIS, Fordham Univ. Acknowledgement. Outline

Quicksort (Weiss chapter 8.6)

SORTING, SETS, AND SELECTION

Lecture 8 Parallel Algorithms II

We can use a max-heap to sort data.

Divide-and-Conquer. Dr. Yingwu Zhu

Key question: how do we pick a good pivot (and what makes a good pivot in the first place)?

B. Tech. Project Second Stage Report on

Lecture 2: CUDA Programming

Principles of Parallel Algorithm Design: Concurrency and Decomposition

Sorting Algorithms. Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar

Using GPUs to compute the multilevel summation of electrostatic forces

Chris Sewell Li-Ta Lo James Ahrens Los Alamos National Laboratory

Sorting (Chapter 9) Alexandre David B2-206

Lesson 1 4. Prefix Sum Definitions. Scans. Parallel Scans. A Naive Parallel Scans

SORTING AND SELECTION

Transcription:

Scan Primitives for GPU Computing Shubho Sengupta, Mark Harris *, Yao Zhang, John Owens University of California Davis, *NVIDIA Corporation

Motivation Raw compute power and bandwidth of GPUs increasing rapidly Programmable unified shader cores Ability to program outside the graphics framework However lack of efficient data-parallel primitives and algorithms

Motivation Current efficient algorithms either have streaming access 1:1 relationship between input and output element 3 1 7 0 1 6 3 2 8 1 5 2 7

Motivation Or have small neighborhood access N:1 relationship between input and output element where N is a small constant 3 1 7 0 1 6 3 8 7 5 7 9 3

Motivation However interesting problems require more general access patterns Changing one element affects everybody Stream Compaction 3 0 7 0 1 0 3 0 Null 3 7 1 3

Motivation Split T F F T F F T F F F F F F T T T Needed for Sort

Motivation Common scenarios in parallel computing Variable output per thread Threads want to perform a split radix sort, building trees What came before/after me? Where do I start writing my data? Scan answers this question

System Overview Algorithms Sort, Sparse matrix operations, Higher Level Primitives Enumerate, Distribute, Libraries and Abstractions for data parallel programming Low Level Primitives Scan and variants

Scan Each element is a sum of all the elements to the left of it (Exclusive) Each element is a sum of all the elements to the left of it and itself (Inclusive) 3 1 7 0 1 6 3 Input 0 3 11 11 15 16 22 Exclusive 3 11 11 15 16 22 25 Inclusive

Scan the past First proposed in APL (1962) Used as a data parallel primitive in the Connection Machine (1990) Guy Blelloch used scan as a primitive for various parallel algorithms (1990)

Scan the present First GPU implementation by Daniel Horn (200), O(n logn) Subsequent GPU implementations by Hensley (2005) O(n logn), Sengupta (2006) O(n), Greß (2006) O(n) 2D NVIDIA CUDA implementation by Mark Harris (2007), O(n), space efficient

Scan the implementation O(n) algorithm same work complexity as the serial version Space efficient needs O(n) storage Has two stages reduce and down-sweep

Scan - Reduce 3 1 7 0 1 6 3 log n steps Work halves each step 3 7 7 5 6 9 O(n) work In place, space efficient 3 7 11 5 6 1 3 7 11 5 6 25

Scan - Down Sweep 3 7 11 5 6 25 log n steps Work doubles each step 3 7 11 5 6 0 O(n) work 3 7 0 5 6 11 In place, space efficient 3 0 7 11 6 16 0 3 11 11 15 16 22

Segmented Scan Input 3 1 7 0 1 6 3 Scan within each segment in parallel Output 0 3 0 7 7 0 1 7

Segmented Scan Introduced by Schwartz (1980) Forms the basis for a wide variety of algorithms Quicksort, Sparse Matrix-Vector Multiply, Convex Hull

Segmented Scan - Challenges Representing segments Efficiently storing and propagating information about segments Scans over all segments should happen in parallel Overall work and space complexity should be O(n) regardless of the number of segments

Representing Segments Possible Representations are Vector of segment lengths Vector of indices which are segment heads Vector of flags: 1 for segment head, 0 if not First two approaches hard to parallelize as different size as input We use the third as it is the same size as input

Segmented Scan Flag Storage Space-Inefficient to store one flag in an integer Store one flag in a byte striped across 32 words Reduces bank conflicts

Segmented Scan implementation Similar to Scan O(n) space and work complexity Has two stages reduce and down-sweep

Segmented Scan implementation Unique to segmented scan Requires an additional flag per element for intermediate computation Additional flags get set in reduce step Additional book-keeping with flags in down-sweep These flags prevent data movement between segments

Platform NVIDIA CUDA and G80 Threads grouped into blocks Threads in a block can cooperate through fast onchip memory Hence programmer must partition problem into multiple blocks to use fast memory Adds complexity but usually much faster code

Segmented Scan Large Input

Segmented Scan Advantages Operations in parallel over all the segments Irregular workload since segments can be of any length Can simulate divide-and-conquer recursion since additional segments can be generated

Primitives - Enumerate Input: a true/false vector F F T F T T T F Output: count of true values to the left of each element 0 0 0 1 1 2 3 3 Useful in stream compact Output for each true element is the address for that element in the compacted array

Primitives - Distribute Input: a vector with segments 3 1 7 0 1 6 3 Output: the first element of a segment copied over all other elements 3 3 3 6 6

Primitives Distribute Set all elements except the head elements to zero 3 0 0 0 0 6 0 Do inclusive segmented scan 3 3 3 6 6 Used in quicksort to distribute pivot

Primitives Split and Segment Input: a vector with true/false elements. Possibly segmented 3 1 7 0 1 6 3 False Output: Stable split within each segment falses on the left, trues on the right 3 0 1 7 1 6 3

Primitives Split and Segment Can be implemented with Enumerate One enumerate for the falses going left to right One enumerate for the trues going right to left Used in quicksort

Applications Quicksort Traditional algorithm GPU unfriendly Recursive Segments vary in length, unequal workload Primitives built on segmented scan solves both problems Allows operations on all segments in parallel Simulates recursion by generating new segments in each iteration

Applications Quicksort 5 3 7 6 8 9 3 Input 5 5 5 5 5 5 5 5 Distribute pivot F F T F T T T F Input > pivot 5 3 3 7 6 8 9 Split and Segment

Applications Quicksort 5 3 3 7 6 8 9 5 5 5 5 7 7 7 7 Distribute pivot T F F F T F T T Input pivot 3 3 5 6 7 8 9 Split and segment

Applications Quicksort 3 3 5 6 7 8 9 3 3 3 5 6 7 7 7 Distribute pivot F T F F F F T T Input > pivot 3 3 5 6 7 8 9 Split and segment

Applications Sparse M-V multiply Dense matrix operations are much faster on GPU than CPU However Sparse matrix operations on GPU much slower Hard to implement on GPU Non-zero entries in row vary in number

Applications Sparse M-V multiply Three different approaches Rows sorted by number of non-zero entries [Bolz] Stored as diagonals and processed them in sequence [Krüger] Rows computed in parallel but runtime determined by longest row [Brook]

Applications Sparse M-V multiply y 0 y 1 y 2 = a 0 b c d e 0 0 f x 0 x 1 x 2 a b c d e f Non-zero elements 0 2 0 1 2 2 Column Index 0 2 5 Row begin Index

Applications Sparse M-V multiply Column Index 0 2 0 1 2 2 a b c d e f x 0 x 2 x 0 x 1 x 2 x 2 x = ax 0 bx 2 cx 0 dx 1 ex 2 fx 2 Backward inclusive segmented scan Pick first element in segment ax 0 + bx 2 cx 0 + dx 1 + ex 2 fx 2

Applications Tridiagonal Solver Implemented Kass and Miller s shallow water solver Water surface described as a 2D array of heights Global movement of data From one end to the other and back Suits the Reduce/Down-sweep structure of scan

Applications Tridiagonal Solver Tridiagonal system of n rows solved in parallel Then for each of the m columns in parallel Read pattern is similar to but more complex than scan

Results - Scan Packing and Unpacking Flags Non sequential I/O Saving State.8 x slower Time (Normalized) Extra computation for sequential memory access 1.1x slower 3x slower Forward Backward Forward Backward Scan Segmented Scan

Results Sparse M-V Multiply Input: raefsky matrix, 322 x 322, 29276 elements GPU (215 MFLOPS) half as fast as CPU oski (522 MFLOPS) Hard to do irregular computation Most time spent in backward segmented scan

Results - Sort 13x slower Slow Merge Packing/Unpacking Flags Complex Kernel Time (Normalized) Slow Merge 2x slower x slower Global Block GPU CPU Radix Sort Quick Sort

Results Tridiagonal solver 256 x 256 grid: 367 simulation steps per second Dominated by the overhead of mapping and unmapping vertex buffers 3x faster than a CPU cyclic reduction solver 12x faster when using shared memory

Improved Results Since Publication Twice as fast for all variants of scan and sparse matrix-vector multiply Scan More work per thread 8 elements vs 2 before Segmented Scan No packing of flags Sequential memory access More optimizations possible

Contribution and Future Work Algorithm and implementation of segmented scan on GPU First implementation of quicksort on GPU Primitives appropriate for complex algorithms Global data movement, unbalanced workload, recursive Scan never occurs in serial computation Tiered approach, standard library and interfaces

Acknowledgments Jim Ahrens, Guy Blelloch, Jeff Inman, Pat McCormick David Luebke, Ian Buck Jeff Bolz Eric Lengyel Department of Energy National Science Foundation

Shallow Water Simulation