Scan Primitives for GPU Computing

Similar documents
PARALLEL PROCESSING UNIT 3. Dr. Ahmed Sallam

Scan Algorithm Effects on Parallelism and Memory Conflicts

Scan Primitives for GPU Computing

Data parallel algorithms, algorithmic building blocks, precision vs. accuracy

Section 5.5. Left subtree The left subtree of a vertex V on a binary tree is the graph formed by the left child L of V, the descendents

GPGPU: Parallel Reduction and Scan

Trees. (Trees) Data Structures and Programming Spring / 28

Section 4 SOLUTION: AVL Trees & B-Trees

Data-Parallel Algorithms on GPUs. Mark Harris NVIDIA Developer Technology

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 13 Parallelism in Software I

Sorting and Selection

Lecture 7. Transform-and-Conquer

Binary Trees. BSTs. For example: Jargon: Data Structures & Algorithms. root node. level: internal node. edge.

CSCI2100B Data Structures Trees

Binary Trees

EE 368. Weeks 5 (Notes)

& ( D. " mnp ' ( ) n 3. n 2. ( ) C. " n

A6-R3: DATA STRUCTURE THROUGH C LANGUAGE

CS584/684 Algorithm Analysis and Design Spring 2017 Week 3: Multithreaded Algorithms

Multi-way Search Trees. (Multi-way Search Trees) Data Structures and Programming Spring / 25

CS102 Binary Search Trees

March 20/2003 Jayakanth Srinivasan,

ECE 242 Data Structures and Algorithms. Trees IV. Lecture 21. Prof.

Binary Trees, Binary Search Trees

INF2220: algorithms and data structures Series 1

Heaps Outline and Required Reading: Heaps ( 7.3) COSC 2011, Fall 2003, Section A Instructor: N. Vlajic

THE B+ TREE INDEX. CS 564- Spring ACKs: Jignesh Patel, AnHai Doan

CS 171: Introduction to Computer Science II. Binary Search Trees

Data Structure and Algorithm Midterm Reference Solution TA

CS 179: GPU Programming. Lecture 7

Binary heaps (chapters ) Leftist heaps

CSC Design and Analysis of Algorithms. Lecture 8. Transform and Conquer II Algorithm Design Technique. Transform and Conquer

(Refer Slide Time 04:53)

ECE 498AL. Lecture 12: Application Lessons. When the tires hit the road

Range Queries. Kuba Karpierz, Bruno Vacherot. March 4, 2016

CSCI-401 Examlet #5. Name: Class: Date: True/False Indicate whether the sentence or statement is true or false.

( ) n 5. Test 1 - Closed Book

Computational Geometry

CE 221 Data Structures and Algorithms

Chapter 4 Trees. Theorem A graph G has a spanning tree if and only if G is connected.

Balanced Search Trees

CSC Design and Analysis of Algorithms. Lecture 8. Transform and Conquer II Algorithm Design Technique. Transform and Conquer

Balanced Binary Search Trees. Victor Gao

Prefix Scan and Minimum Spanning Tree with OpenCL

) $ f ( n) " %( g( n)

Lecture Notes for Advanced Algorithms

Programming II (CS300)

ARC 088/ABC 083. A: Libra. DEGwer 2017/12/23. #include <s t d i o. h> i n t main ( )

GPU Programming. Parallel Patterns. Miaoqing Huang University of Arkansas 1 / 102

Warped parallel nearest neighbor searches using kd-trees

Friday Four Square! 4:15PM, Outside Gates

Algorithms in Systems Engineering ISE 172. Lecture 16. Dr. Ted Ralphs

FINALTERM EXAMINATION Fall 2009 CS301- Data Structures Question No: 1 ( Marks: 1 ) - Please choose one The data of the problem is of 2GB and the hard

CSE 530A. B+ Trees. Washington University Fall 2013

CS301 - Data Structures Glossary By

Lower Bound on Comparison-based Sorting

CSE100. Advanced Data Structures. Lecture 13. (Based on Paul Kube course materials)

MID TERM MEGA FILE SOLVED BY VU HELPER Which one of the following statement is NOT correct.

CSE332: Data Abstractions Lecture 21: Parallel Prefix and Parallel Sorting. Tyler Robison Summer 2010

Lecture 6. Binary Search Trees and Red-Black Trees

16/10/2008. Today s menu. PRAM Algorithms. What do you program? What do you expect from a model? Basic ideas for the machine model

Computational Optimization ISE 407. Lecture 16. Dr. Ted Ralphs

Advanced Tree Data Structures

An AVL tree with N nodes is an excellent data. The Big-Oh analysis shows that most operations finish within O(log N) time

Day 10. COMP1006/1406 Summer M. Jason Hinek Carleton University

CSE506: Operating Systems CSE 506: Operating Systems

CSE 332 Autumn 2013: Midterm Exam (closed book, closed notes, no calculators)

Parallel Prefix Sum (Scan) with CUDA. Mark Harris

Trees. Q: Why study trees? A: Many advance ADTs are implemented using tree-based data structures.

Parallel Patterns Ezio Bartocci

Sorting. Bubble Sort. Pseudo Code for Bubble Sorting: Sorting is ordering a list of elements.

Binary Trees. Height 1

9/29/2016. Chapter 4 Trees. Introduction. Terminology. Terminology. Terminology. Terminology

A set of nodes (or vertices) with a single starting point

1 Leaffix Scan, Rootfix Scan, Tree Size, and Depth

Exercise 1 : B-Trees [ =17pts]

Analysis of Algorithms

6. Algorithm Design Techniques

Multiple Choice. Write your answer to the LEFT of each problem. 3 points each

CPS 616 TRANSFORM-AND-CONQUER 7-1

CSE 4500 (228) Fall 2010 Selected Notes Set 2

1 Interlude: Is keeping the data sorted worth it? 2 Tree Heap and Priority queue

D. Θ nlogn ( ) D. Ο. ). Which of the following is not necessarily true? . Which of the following cannot be shown as an improvement? D.

Trees! Ellen Walker! CPSC 201 Data Structures! Hiram College!

Computational Geometry

Assignment 1. Stefano Guerra. October 11, The following observation follows directly from the definition of in order and pre order traversal:

Analysis of Algorithms. Unit 4 - Analysis of well known Algorithms

COMP 250 Fall recurrences 2 Oct. 13, 2017

Visualizing Data Structures. Dan Petrisko

Binary Search Trees. Carlos Moreno uwaterloo.ca EIT

CS 677: Parallel Programming for Many-core Processors Lecture 6

CS 331 DATA STRUCTURES & ALGORITHMS BINARY TREES, THE SEARCH TREE ADT BINARY SEARCH TREES, RED BLACK TREES, THE TREE TRAVERSALS, B TREES WEEK - 7

Introduction. for large input, even access time may be prohibitive we need data structures that exhibit times closer to O(log N) binary search tree

Parallel Programming

In the previous presentation, Erik Sintorn presented methods for practically constructing a DAG structure from a voxel data set.

Introduction to Parallel Computing Errata

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Sorting lower bound and Linear-time sorting Date: 9/19/17

Information Coding / Computer Graphics, ISY, LiTH

Search Trees. Undirected graph Directed graph Tree Binary search tree

( ). Which of ( ) ( ) " #& ( ) " # g( n) ( ) " # f ( n) Test 1

Transcription:

Scan Primitives for GPU Computing

Agenda What is scan A naïve parallel scan algorithm A work-efficient parallel scan algorithm Parallel segmented scan Applications of scan Implementation on CUDA 2

Prefix sum Prefix sum Input: An ordered set: [x 0, x 1, x 2, x 3, ] Array or linked-list A binary associative i operator: + Associative: (a+b)+c = a+(b+c) Don t need to be commutative: it s okay that a+b b+a (String concatenation for example.) Identity element exists (additive identity : 0 ) 0+a=a+0=a Examples: add, multiply, max, min, AND, OR, Minkowski sum, etc. Output: [x 0, x 0 +x 1, x 0 +x 1 +x 2, x 0 +x 1 +x 2 +x 3, ] 3

Scan Scan is array-based prefix sum Elements are located continuously in the memory Easy for GPU implementation Inclusive scan [x 0, x 0 +x 1, x 0 +x 1 +x 2, x 0 +x 1 +x 2 +x 3, ] Exclusive scan (prescan) [0, x 0, x 0 +x 1, x 0 +x 1 +x 2, ] 4

Sequential Implementation of Inclusive Scan Step Complexity: O(n) Total number of steps Work Complexity: O(n) Total number of operations 5

Work Complexity Work complexity matters in parallel algorithm Look at a simple parallel problem: copy n 2 data source: destination: 1 2 n 2 1 2 n 2 Step complexity = 1; work complexity = n 2 If we have n 2 processors, we only need 1 parallel l step. If we have n processors, each process needs to do n copies. So we need n parallel steps. (Brent 1974) step complexity S, work complexity W, p processors 6 # parallel steps W + S p

Agenda What is scan A naïve parallel scan algorithm A work-efficient parallel scan algorithm Parallel segmented scan Applications of scan Implementation on CUDA 7

Horn s GPU Implementation 1 2 3 4 5 6 7 8 8

Horn s GPU Implementation 1 2 3 4 5 6 7 8 x i = x i-1 + x i 1 12 23 34 45 56 67 78 9

Horn s GPU Implementation 1 2 3 4 5 6 7 8 x i = x i-1 + x i 1 12 23 34 45 56 67 78 x i = x i-2 + x i 1 12 123 1234 2345 3456 4567 5678 10

Horn s GPU Implementation 1 2 3 4 5 6 7 8 x i = x i-1 + x i 1 12 23 34 45 56 67 78 x i = x i-2 + x i 1 12 123 1234 2345 3456 4567 5678 x i = x i-4 + x i 1 12 123 1234 12345 123456 1234567 12345678 11

If we have 16 numbers 1 2 3 4 5 6 7 8 x i = x i-1 + x i 1 12 23 34 45 56 67 78 x i = x i-2 + x i 1 12 123 1234 2345 3456 4567 5678 x i = x i-4 + x i 1 12 123 1234 12345 123456 1234567 12345678 x i = x i-8 + x i 12

Horn s GPU Implementation Horn, Daniel. Stream reduction operations for GPGPU applications. In GPU Germs 2. Step Complexity: O(logn) Step complexity of sequential algorithm is O(n) Efficient Work Complexity: = (n-1)+(n-2)+(n-4)+ = log n d= d ( n 2 ) 0 = nlogn-n+1 n+1 = O(nlogn) Work complexity of sequential algorithm is O(n) Not work-efficient 13

Agenda What is scan A naïve parallel scan algorithm A work-efficient parallel scan algorithm Parallel segmented scan Applications of scan Implementation on CUDA 14

A Work-Efficient Scan Algorithm Exclusive Scan Step Complexity: O(logn) Work Complexity: O(n) Use (Implicit) Balanced Binary Tree Just a concept. Don t need to build a real tree. Perform calculations on the array directly. Easy for implementation on GPU. Two passes over the array (tree) Reduce (up-sweep) Down-sweep 15

Balanced Binary Tree A binary tree where the depth of all the leaves differs by at most 1. The height of a balanced binary tree is minimized. Take reduction for an example 1234 1234 4 123 12 34 3 12 1 2 3 4 2 1 16

Up-sweep (Balanced Tree) 12345678 Parent = Left + Right 1234 5678 12 34 56 78 1 2 3 4 5 6 7 8 17

Down-sweep (Balanced Tree) 12345678 0 Root = 0 1234 5678 12 34 56 78 1 2 3 4 5 6 7 8 18

Down-sweep (Balanced Tree) 12345678 0 Right = Parent + Left Left = Parent 1234 0 5678 1234 12 34 56 78 1 2 3 4 5 6 7 8 19

Down-sweep (Balanced Tree) 12345678 0 Right = Parent + Left Left = Parent 1234 0 5678 1234 12 0 34 12 56 1234 78 123456 1 2 3 4 5 6 7 8 20

Down-sweep (Balanced Tree) 12345678 0 Right = Parent + Left Left = Parent 1234 0 5678 1234 12 0 34 12 56 1234 78 123456 1 2 3 4 5 6 7 8 0 1 12 123 1234 12345 123456 1234567 21

Why It Works? (up-sweep) 12345678 Parent = Left + Right 1234 5678 12 34 56 78 1 2 3 4 5 6 7 8 v subtree(v) After reduction (up-sweep), each node contains the partial sum of its subtree 22

Why It Works? (down-sweep) 12345678 0 Right = Parent + Left Left = Parent 1234 0 5678 1234 12 0 34 12 56 1234 78 123456 1 2 3 4 5 6 7 8 0 1 12 123 1234 12345 123456 1234567 L P R pre(p), pre(l) pre(r) After down-sweep, each node contains the sum of all the leaves that t precede it in the preorder traversal. 23

Pre-order traversal Traverse the tree recursively in the following order: Visit the root Visit the left subtree Visit the right subtree post-order and in-order also give a feasible algorithm? 1 2 9 3 6 10 13 Items preceding the green node (9) are marked in yellow. What are items preceding the left (10) and right (13) child of the green node (9)? Right = Parent + Left Left = Parent 4 5 7 8 11 12 14 15 Why we reset the root to zero at the beginning of down-sweep? 24

Why pre-order? Actually in-order and post-order also work! Try to derive their formulas But pre-order gives the simplest formula. Post-order gives a beautiful and concise formula for inclusive scan; however it requires to define a corresponding minus operator, which is not always trivial (like Minkowski sum) In-order also requires minus operator. In addition, its formula involves three continuous levels (parent, child, grandchild) Pre-order doesn t need minus operator, because the children always come after the parent. 25

In-place computation without a tree Balanced Binary Tree is just a concept. We don t really build such a tree. All the computations are performed on the array directly. Given an item in the array, we know the exact index of its parent or left/right child. So there is no need to record the parent-child relationship like a tree data structure does. 26

Up-sweep (in-place) 1 2 3 4 5 6 7 8 27

Up-sweep (in-place) 1 12 3 34 5 56 7 78 1 2 3 4 5 6 7 8 Parent = Left + Right 28

Up-sweep (in-place) 1 12 3 1234 5 56 7 5678 1 12 3 34 5 56 7 78 1 2 3 4 5 6 7 8 Parent = Left + Right 29

Up-sweep (in-place) 1 12 3 5 56 7 1234 12345678 1 12 3 1234 5 56 7 5678 1 12 3 34 5 56 7 78 1 2 3 4 5 6 7 8 Parent = Left + Right 30

Down-sweep (in-place) 1 12 3 5 56 7 1234 12345678 31

Down-sweep (in-place) 1 12 3 1234 5 56 7 0 Root = 0 32

Down-sweep (in-place) 1 12 3 1234 5 56 7 0 1 12 3 0 5 56 7 1234 Right = Parent + Left Left = Parent 33

Down-sweep (in-place) 1 12 3 1234 5 56 7 0 1 12 3 0 5 56 7 1234 1 0 3 12 5 1234 7 123456 Right = Parent + Left Left = Parent 34

Down-sweep (in-place) 1 12 3 1234 5 56 7 0 1 12 3 0 5 56 7 1234 1 0 3 12 5 1234 7 123456 0 1 12 123 1234 12345 123456 1234567 Right = Parent + Left Left = Parent 35

Pseudocode Up-sweep (reduce): Down-sweep: 36

Complexity Analysis Step complexity Up-sweep: log(n) Down-sweep: log(n) Work complexity Up-sweep n log n 1 d= 0 d+ 1 2 Down-sweep = n 1 log 1 1 n n + 3 = 1 3 n 2 + d= 0 2 d 37

Agenda What is scan A naïve parallel scan algorithm A work-efficient parallel scan algorithm Parallel segmented scan Applications of scan Implementation on CUDA 38

Segmented Scan Input array is broken into arbitrary segments; Scan is performed in each segment; Additional input: flag array (1: start of a new segment) Flag: [1 0 0 1 0 0 1 0] Data: [4 2 1 3 0 2 1 5] => Result: [0 4 6 0 3 3 0 1] Still O(n) work complexity and O(logn) step complexity; Only three times slower than scan; Algorithm is slightly different, but more tricky. 39

Segmented Scan Scan 40

Why segmented scan? Why not scan in each segment independently? Suppose we have m arrays, each of which has n numbers. Scanning them independently takes O(mlogn) step complexity. If we combine them into a single array with mn numbers, and do a segmented scan, it takes only O(logmn) step complexity. Segmented scan has lot of applications, such as quicksort, sparse matrix-vector multiplication, processor allocation, etc. 41

Convert segmented scan into scan Given flag array: f i (0 or 1) Given data array: x i Define the array of pairs: c i = [f i, x i ] Define a new binary operator on c i c1 c2 = [ f1, x1] [ f2, x2] { [ f1 f2, x1 + x2 ] if f2 = 0 = [ f f, x ] if f = 1 Prove that 1 2 2 2 is associative The identity element for is (0, 0 ) Now scan over array c i with operator 42

Reduce [1,4] [0,2] [0,1] [1,3] [0,0] [0,2] [1,1] [0,5] 43

Reduce [1,4] [1,6] [0,1] [1,3] [0,0] [0,2] [1,1] [1,6] [1,4] [0,2] [0,1] [1,3] [0,0] [0,2] [1,1] [0,5] 44

Reduce [1,4] [1,6] [0,1] [1,3] [0,0] [0,2] [1,1] [1,6] [1,4] [1,6] [0,1] [1,3] [0,0] [0,2] [1,1] [1,6] [1,4] [0,2] [0,1] [1,3] [0,0] [0,2] [1,1] [0,5] 45

Reduce [1,4] [1,6] [0,1] [1,3] [0,0] [0,2] [1,1] [1,6] [1,4] [1,6] [0,1] [1,3] [0,0] [0,2] [1,1] [1,6] [1,4] [1,6] [0,1] [1,3] [0,0] [0,2] [1,1] [1,6] [1,4] [0,2] [0,1] [1,3] [0,0] [0,2] [1,1] [0,5] 46

Down-sweep [1,4] [1,6] [0,1] [1,3] [0,0] [0,2] [1,1] [1,6] 47

Down-sweep [1,4] [1,6] [0,1] [1,3] [0,0] [0,2] [1,1] [0,0] 48

Down-sweep [1,4] [1,6] [0,1] [1,3] [0,0] [0,2] [1,1] [0,0] [1,4] [1,6] [0,1] [0,0] [0,0] [0,2] [1,1] [1,3] 49

Down-sweep [1,4] [1,6] [0,1] [1,3] [0,0] [0,2] [1,1] [0,0] [1,4] [1,6] [0,1] [0,0] [0,0] [0,2] [1,1] [1,3] [1,4] [0,0] [0,1] [1,6] [0,0] [1,3] [1,1] [1,5] 50

Down-sweep [1,4] [1,6] [0,1] [1,3] [0,0] [0,2] [1,1] [0,0] [1,4] [1,6] [0,1] [0,0] [0,0] [0,2] [1,1] [1,3] [1,4] [0,0] [0,1] [1,6] [0,0] [1,3] [1,1] [1,5] [0,0] [1,4] [1,6] [1,7] [1,3] [1,3] [1,5] [1,1] 51

Now What We Get Input: [1,4] [0,2] [0,1] [1,3] [0,0] [0,2] [1,1] [0,5] Output: [0,0] [1,4] [1,6] [1,7] [1,3] [1,3] [1,5] [1,1] Each output value is the exclusive prefix sum ( ). Seems correct. Should be [0,0] But something is WRONG 52

How to correct this? Need to cut off the sum across different segments. Note that the root is propagated along the left side the tree. Reset the top node of a subtree to zero when its leftmost leave is the start of a new segment, just like we reset the top root to zero at the beginning of the downsweep. left subtree right subtree Adjustment to down-sweep: P if the ORIGINAL flag at the position to the right of the left child is 1, the right child is L R set to 0. start of a new segment 53

Adjustment to Down-sweep Input: [1,4] [0,2] [0,1] [1,3] [0,0] [0,2] [1,1] [0,5] [1,4] [1,6] [0,1] [1,3] [0,0] [0,2] [1,1] [0,0] [1,4] [1,6] [0,1] [0,0] [0,0] [0,2] [1,1] [1,3] [1,4] [0,0] [0,1] [1,6] [0,0] [1,3] [1,1] [1,5] [0,0] [0,0] [1,4] [1,6] [1,7] [1,3] [1,3] [1,5] [1,1] [0,0] [0,0] 54

Segmented Scan Algorithm: Revisit Reduce: Down-sweep: Adjustment Scan over (flag, data) pairs 55

Agenda What is scan A naïve parallel scan algorithm A work-efficient parallel scan algorithm Parallel segmented scan Applications of scan Implementation on CUDA 56

Application Other Primitives Enumerate Distribute Split Rdi Radix Sort Quick Sort Recurrence Equations Sparse Matrix-Vector Multiply Tridiagonal Matrix Solver Multi-precision addition 57

Split Data: [5 7 3 1 4 2 7 2] Flag: [110 1 1 001 0 0] Result: [5 7 1 7 3 4 2 2] Steps: [110 1 1 001 0 0] # e [0 1 2 2 3 3 3 4] # f = scan over e 4 # NF = add last elements of e and f [0 1 2 3 4 5 6 7] # id [4 4 4 5 5 6 7 7] # t = id f + NF [01425637] # addr = e? f : t [5 7 1 7 3 4 2 2] # out[addr] = in 58

Split (hands-on) Data: [1 5 6 2 3 7 8 4] Flag: [1 0 0 110 1 01] Result: [ ] Steps: [10011001] 0 1 1 0 0 1] # e [ ] # f = scan over e [ ] # id [ ] # t = id f + NF [ ] # addr = e? f : t [ ] # out[addr] = in # NF = add last elements of e and f 59

Agenda What is scan A naïve parallel scan algorithm A work-efficient parallel scan algorithm Parallel segmented scan Applications of scan Implementation on CUDA 60

Implementation on CUDA All the threads should be in the same thread block so that they can share memory and synchronize with each other. On G80 GPUs, this limits us to a maximum of 1024 elements. Extend to large arrays by dividing the array into blocks. Each block is scanned by a single thread block. 61

Multi-block Segmented Scan First Level Reduce Store last element of each block Set last element of each block Second Level Segmented Scan First Level Down-Sweep 62

CUDA Source codes for scan 63