GPU Multisplit. Saman Ashkiani, Andrew Davidson, Ulrich Meyer, John D. Owens. S. Ashkiani (UC Davis) GPU Multisplit GTC / 16
|
|
- Daniella Turner
- 6 years ago
- Views:
Transcription
1 GPU Multisplit Saman Ashkiani, Andrew Davidson, Ulrich Meyer, John D. Owens S. Ashkiani (UC Davis) GPU Multisplit GTC / 16
2 Motivating Simple Example: Compaction Compaction (i.e., binary split) Traditional approach: Flags Scan Shuffle e.g., splitter: 10 input keys compacted S. Ashkiani (UC Davis) GPU Multisplit GTC / 16
3 Motivating Simple Example: Compaction Compaction (i.e., binary split) Traditional approach: Flags Scan Shuffle e.g., splitter: 10 Other option: split input into two buckets input keys buckets output keys buckets S. Ashkiani (UC Davis) GPU Multisplit GTC / 16
4 Motivating Simple Example: Compaction Compaction (i.e., binary split) Traditional approach: Flags Scan Shuffle e.g., splitter: 10 Other option: split input into two buckets Can also be solved by sorting keys Not always possible Loses stability, i.e., initial order within buckets not preserved input keys buckets output keys buckets sorted keys S. Ashkiani (UC Davis) GPU Multisplit GTC / 16
5 What is Multisplit? Multisplit (generalization of binary split) Let s try multiple buckets e.g., splitters: 10 and 20 input keys buckets output keys buckets S. Ashkiani (UC Davis) GPU Multisplit GTC / 16
6 What is Multisplit? Multisplit (generalization of binary split) Let s try multiple buckets e.g., splitters: 10 and 20 Can also be solved by sorting keys input keys buckets output keys buckets sorted keys buckets S. Ashkiani (UC Davis) GPU Multisplit GTC / 16
7 Mutlisplit primitive Input: unordered set of keys (or key-value pairs) m, number of buckets a user specified function to identify buckets for each key Output: keys (or key-value pairs) separated into m buckets B0 B1 B2 B3 S. Ashkiani (UC Davis) GPU Multisplit GTC / 16
8 A Fast and Flexible Data-Organization Primitive characterizing key-value pairs into buckets General load balancing Priority queues Single Source Shortest Path (SSSP) Serial (Dijkstra): processing the vertex with the lowest weight Bellman-Ford-Moore all vertices in parallel delta-stepping formulation of SSSP [Davidson et al., 2014] classifying vertices into buckets by their weights processing the lowest weights in parallel But no multisplit primitive used radix-sort instead By using our own multisplit 2.1x faster other applications colored prefix-sum reorganizing into 8 direction-based buckets in GPU based ray tracers [Yang et al., 2013] first step in building GPU hash tables [Alcantara et al., 2009] in the shallow stages of k-d tree construction [Wu et al., 2011] S. Ashkiani (UC Davis) GPU Multisplit GTC / 16
9 Common Approaches: 1 Recursive scan-based split log(m) rounds of binary splits Buckets B 0 = {i apple 40} B 1 = {i >40} Initial Keys B 0 Exclusive scan B 1 2 right to left exclusive scan S. Ashkiani (UC Davis) GPU Multisplit GTC / 16
10 Common Approaches: 1 Recursive scan-based split log(m) rounds of binary splits 2 Radix sort sorting keys overkill (sorted within buckets) initial order is not preserved Initial Keys binary representation apple 7 splits S. Ashkiani (UC Davis) GPU Multisplit GTC / 16
11 Common Approaches: 1 Recursive scan-based split log(m) rounds of binary splits 2 Radix sort sorting keys 1 overkill (sorted within buckets) 0 initial order is not preserved 3 Reduced bit sort sort (bucket ID, keys) log m -bit bucket IDs Initial Keys New values New keys key-value sort S. Ashkiani (UC Davis) GPU Multisplit GTC / 16
12 Common Approaches: 1 Recursive scan-based split log(m) rounds of binary splits 2 Radix sort sorting keys overkill (sorted within buckets) initial order is not preserved 3 Reduced bit sort sort (bucket ID, keys) log m -bit bucket IDs 4 Randomized insertions a PRAM algorithm large buffers for buckets random insertions initial order is not preserved Initial Keys bu er B 0 17 bu er B compaction S. Ashkiani (UC Davis) GPU Multisplit GTC / 16
13 Designing an Efficient Approach Stable Multisplit unique permutation + data movement 1 Deriving all permutations global computations histogram (h 0,..., h m 1 ) key order per bucket u i B j p(i) = j 1 h k + {u r : u r B j, r < i} }{{} k=0 }{{} Number of keys Number of keys in before me previous buckets in my own bucket B 0 B 1 B 2 B 3 S. Ashkiani (UC Davis) GPU Multisplit GTC / 16
14 Designing an Efficient Approach Stable Multisplit unique permutation + data movement 1 Deriving all permutations global computations histogram (h 0,..., h m 1 ) key order per bucket 2 Final data movements global random scatters B 0 B 1 B 2 B 3 B 0 S. Ashkiani (UC Davis) GPU Multisplit B 1 GTC / 16
15 Our high level ideas 1 Global computations Localize computations several large enough local subproblems local histograms a single small enough global computation global histogram several large enough local subproblems permutations + scatters Avoid shared memory and synchronization: utilize intrinsics Local Pre scan Global Scan Local Post scan S. Ashkiani (UC Davis) GPU Multisplit GTC / 16
16 Our high level ideas 1 Global computations Localize computations several large enough local subproblems local histograms a single small enough global computation global histogram several large enough local subproblems permutations + scatters Avoid shared memory and synchronization: utilize intrinsics 2 Global random scatters Reordering keys locally in the last stage local multisplits more computational cost but better memory access pattern (coalesced writes) S. Ashkiani (UC Davis) GPU Multisplit GTC / 16
17 Granularity Tradeoffs We experimented with a couple different subproblem granularities 1 warp warp synchronous model with minimal warp divergence fast communication via warp-wide ballot/shuffles 2 block more expensive communication via shared memory cheaper global computation (scan over m N blocks ) more locality to extract after reordering Property Direct MS Warp-level MS Block-level MS Subproblem warp warp block reordering warp-wide reordering block-wide reordering computational load low medium high Coalesced memory access low medium high S. Ashkiani (UC Davis) GPU Multisplit GTC / 16
18 Implementation details & Optimizations Warp-level MS 1 Pre-scan (Local): h1,0 hm 1,0 Pre scan read keys warp histogram 1 bit-by-bit balloting 2 log m rounds store warp histogram h0,1 h1,1 hm 1,1 Scan h0,l 1 h1,l 1 hm 1,L 1 h1,0 h1,1 h1,l 1 h0,1 h0,l 1 hm 1,0 hm 1,1 hm 1,L 1 Post scan h0,0 h1,0 hm 1,0 Pre scan h0,1 h1,1 hm 1,1 h0,l 1 h1,l 1 hm 1,L 1 1: procedure warp histogram(bucket id[0:31]) Input: bucket id[0:31] Output: histo[0:m-1] 2: for each thread i = 0:31 parallel warp do 3: histo bmp[i] = 0xFFFFFFFF; 4: for (int k = 0; k < ceil(log2(m)); k++) do 5: temp buffer = ballot(bucket id[i] & 0x01); 6: if ((i >> k) & 0x01) then 7: histo bmp[i] &= temp buffer; 8: else 9: histo bmp[i] &= XOR(0xFFFFFFFF, temp buffer); 10: end if 11: bucket id[i] >>= 1; 12: end for 13: histo[i] = popc(histo bmp[i]); 14: end for 15: return histo[0:m-1]; 16: end procedure S. Ashkiani (UC Davis) GPU Multisplit GTC / 16
19 Implementation details & Optimizations Warp-level MS 1 Pre-scan (Local): read keys warp histogram 1 bit-by-bit balloting 2 log m rounds store warp histogram 2 Scan (Global): exclusive scan on histograms m N warps elements h0,0 h1,0 hm 1,0 Pre scan h0,1 h1,1 hm 1,1 Scan h0,l 1 h1,l 1 hm 1,L 1 h1,0 h1,1 h1,l 1 h0,0 h0,1 h0,l 1 hm 1,0 hm 1,1 hm 1,L 1 S. Ashkiani (UC Davis) GPU Multisplit GTC / 16
20 Implementation details & Optimizations Warp-level MS 1 Pre-scan (Local): read keys warp histogram 1 bit-by-bit balloting 2 log m rounds store warp histogram 2 Scan (Global): exclusive scan on histograms m N warps elements 3 Post-scan (Local): read keys (key-value) recompute warp histograms compute local offsets warp-level reordering compute final positions final data movement h0,0 h1,0 hm 1,0 Pre scan h0,1 h1,1 hm 1,1 Scan h0,l 1 h1,l 1 hm 1,L 1 h1,0 h1,1 h1,l 1 h0,0 h0,1 h0,l 1 hm 1,0 hm 1,1 hm 1,L 1 Post scan S. Ashkiani (UC Davis) GPU Multisplit GTC / 16
21 Performance Evaluation In this presentation: 1 NVIDIA Tesla K40c GPU 2 Radix sort from CUB including the one in the reduced-bit sort method 3 Device-wide exclusive scan from CUB 4 Uniform distribution of keys in buckets More results in the paper: 1 Detailed timing of different stages of our algorithm 2 Other GPU achitectures: Maxwell 3 Different distributions of keys 4 Using our Multisplit algorithms in SSSP method S. Ashkiani (UC Davis) GPU Multisplit GTC / 16
22 Average running time vs. number of buckets 9 Average running time (msec) Block level MS Direct MS Reduced bit sort Warp level MS Average running time (msec) Number of buckets (m) (a) Key-only Number of buckets (m) (b) Key-value Memory access quality: Block-level MS > Warp-level MS > Direct MS Computational load: Block-level MS > Warp-level MS > Direct MS S. Ashkiani (UC Davis) GPU Multisplit GTC / 16
23 Performance vs Radix-Sort 6 Block level MS Direct MS Reduced bit sort Warp level MS Binary Split 8 6 Block level MS Direct MS Redcued bit sort Warp level MS Binary Split Speedup 4 Speedup Number of buckets (m) Number of buckets (m) (c) Key-only (d) Key-value key-only: 3.0x 6.7x key-value: 4.4x 8.0x S. Ashkiani (UC Davis) GPU Multisplit GTC / 16
24 More buckets For more buckets than the warp width (m > 32): warp histograms each thread in charge of multiple buckets Shared memory capacity the other bottleneck Average running time (msec) Radix sort (key value) Radix sort (key only) Block level MS Reduced bit sort Key only Key value Number of buckets (m) Average running time (msec) for more buckets for Block level MS and reduced-bit sort S. Ashkiani (UC Davis) GPU Multisplit GTC / 16
25 Conclusions Introduce a new efficient data organization primitive High performance especially for low or modest number of buckets Full paper: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2016) Code will soon be available in CUDPP: S. Ashkiani (UC Davis) GPU Multisplit GTC / 16
26 Thank You S. Ashkiani (UC Davis) GPU Multisplit GTC / 16
27 References Alcantara, D. A., Sharf, A., Abbasinejad, F., Sengupta, S., Mitzenmacher, M., Owens, J. D., and Amenta, N. (2009). Real-time parallel hashing on the GPU. ACM Transactions on Graphics, 28(5):154:1 154:9. Davidson, A., Baxter, S., Garland, M., and Owens, J. D. (2014). Work-efficient parallel GPU methods for single source shortest paths. In Proceedings of the 28th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2014, pages Wu, Z., Zhao, F., and Liu, X. (2011). SAH KD-tree construction on GPU. In Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics, HPG 11, pages Yang, X., Xu, D., and Zhao, L. (2013). Efficient data management for incoherent ray tracing. Applied Soft Computing, 13(1):1 8. S. Ashkiani (UC Davis) GPU Multisplit GTC / 16
GPU Multisplit. Andrew Davidson. University of California, Davis John D. Owens
GPU Multisplit Saman Ashkiani University of California, Davis sashkiani@ucdavis.edu Andrew Davidson University of California, Davis aaldavidson@ucdavis.edu Ulrich Meyer Goethe-Universität Frankfurt am
More informationFast BVH Construction on GPUs
Fast BVH Construction on GPUs Published in EUROGRAGHICS, (2009) C. Lauterbach, M. Garland, S. Sengupta, D. Luebke, D. Manocha University of North Carolina at Chapel Hill NVIDIA University of California
More informationWarped parallel nearest neighbor searches using kd-trees
Warped parallel nearest neighbor searches using kd-trees Roman Sokolov, Andrei Tchouprakov D4D Technologies Kd-trees Binary space partitioning tree Used for nearest-neighbor search, range search Application:
More informationScalable GPU Graph Traversal!
Scalable GPU Graph Traversal Duane Merrill, Michael Garland, and Andrew Grimshaw PPoPP '12 Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming Benwen Zhang
More informationCS377P Programming for Performance GPU Programming - II
CS377P Programming for Performance GPU Programming - II Sreepathi Pai UTCS November 11, 2015 Outline 1 GPU Occupancy 2 Divergence 3 Costs 4 Cooperation to reduce costs 5 Scheduling Regular Work Outline
More informationScan Primitives for GPU Computing
Scan Primitives for GPU Computing Shubho Sengupta, Mark Harris *, Yao Zhang, John Owens University of California Davis, *NVIDIA Corporation Motivation Raw compute power and bandwidth of GPUs increasing
More informationHigh-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock
High-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock Yangzihao Wang University of California, Davis yzhwang@ucdavis.edu March 24, 2014 Yangzihao Wang (yzhwang@ucdavis.edu)
More informationData-Parallel Algorithms on GPUs. Mark Harris NVIDIA Developer Technology
Data-Parallel Algorithms on GPUs Mark Harris NVIDIA Developer Technology Outline Introduction Algorithmic complexity on GPUs Algorithmic Building Blocks Gather & Scatter Reductions Scan (parallel prefix)
More informationReal-Time Reyes: Programmable Pipelines and Research Challenges. Anjul Patney University of California, Davis
Real-Time Reyes: Programmable Pipelines and Research Challenges Anjul Patney University of California, Davis Real-Time Reyes-Style Adaptive Surface Subdivision Anjul Patney and John D. Owens SIGGRAPH Asia
More informationCUB. collective software primitives. Duane Merrill. NVIDIA Research
CUB collective software primitives Duane Merrill NVIDIA Research What is CUB?. A design model for collective primitives How to make reusable SIMT software constructs. A library of collective primitives
More informationData Parallel Programming with Patterns. Peng Wang, Developer Technology, NVIDIA
Data Parallel Programming with Patterns Peng Wang, Developer Technology, NVIDIA Overview Patterns in Data Parallel Programming Examples Radix sort Cell list calculation in molecular dynamics MPI communication
More informationGPU Programming. Parallel Patterns. Miaoqing Huang University of Arkansas 1 / 102
1 / 102 GPU Programming Parallel Patterns Miaoqing Huang University of Arkansas 2 / 102 Outline Introduction Reduction All-Prefix-Sums Applications Avoiding Bank Conflicts Segmented Scan Sorting 3 / 102
More informationarxiv: v2 [cs.dc] 2 Mar 2018
GPU LSM: A Dynamic Dictionary Data Structure for the GPU Saman Ashkiani, Shengren Li University of California, Davis {sashkiani, shrli}@ucdavis.edu Martin Farach-Colton Rutgers University farach@cs.rutgers.edu
More informationChapter 13 On the Efficient Implementation of a Real-time Kd-tree Construction Algorithm 1
Chapter 13 On the Efficient Implementation of a Real-time Kd-tree Construction Algorithm 1 Byungjoon Chang Woong Seo Insung Ihm Department of Computer Science and Engineering, Sogang University, Seoul,
More informationCSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591: GPU Programming Using CUDA in Practice Klaus Mueller Computer Science Department Stony Brook University Code examples from Shane Cook CUDA Programming Related to: score boarding load and store
More informationFragment-Parallel Composite and Filter. Anjul Patney, Stanley Tzeng, and John D. Owens University of California, Davis
Fragment-Parallel Composite and Filter Anjul Patney, Stanley Tzeng, and John D. Owens University of California, Davis Parallelism in Interactive Graphics Well-expressed in hardware as well as APIs Consistently
More informationContents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11
Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed
More informationGPU Sparse Graph Traversal
GPU Sparse Graph Traversal Duane Merrill (NVIDIA) Michael Garland (NVIDIA) Andrew Grimshaw (Univ. of Virginia) UNIVERSITY of VIRGINIA Breadth-first search (BFS) 1. Pick a source node 2. Rank every vertex
More informationTheory and Implementation of Dynamic Data Structures for the GPU
Theory and Implementation of Dynamic Data Structures for the GPU John Owens UC Davis Martín Farach-Colton Rutgers NVIDIA OptiX & the BVH Tero Karras. Maximizing parallelism in the construction of BVHs,
More informationFrom Biological Cells to Populations of Individuals: Complex Systems Simulations with CUDA (S5133)
From Biological Cells to Populations of Individuals: Complex Systems Simulations with CUDA (S5133) Dr Paul Richmond Research Fellow University of Sheffield (NVIDIA CUDA Research Centre) Overview Complex
More informationScan Algorithm Effects on Parallelism and Memory Conflicts
Scan Algorithm Effects on Parallelism and Memory Conflicts 11 Parallel Prefix Sum (Scan) Definition: The all-prefix-sums operation takes a binary associative operator with identity I, and an array of n
More informationA Comparative Study on Exact Triangle Counting Algorithms on the GPU
A Comparative Study on Exact Triangle Counting Algorithms on the GPU Leyuan Wang, Yangzihao Wang, Carl Yang, John D. Owens University of California, Davis, CA, USA 31 st May 2016 L. Wang, Y. Wang, C. Yang,
More informationParallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. Fall Parallel Sorting
Parallel Systems Course: Chapter VIII Sorting Algorithms Kumar Chapter 9 Jan Lemeire ETRO Dept. Fall 2017 Overview 1. Parallel sort distributed memory 2. Parallel sort shared memory 3. Sorting Networks
More information/INFOMOV/ Optimization & Vectorization. J. Bikker - Sep-Nov Lecture 10: GPGPU (3) Welcome!
/INFOMOV/ Optimization & Vectorization J. Bikker - Sep-Nov 2018 - Lecture 10: GPGPU (3) Welcome! Today s Agenda: Don t Trust the Template The Prefix Sum Parallel Sorting Stream Filtering Optimizing GPU
More informationEE/CSCI 451 Spring 2018 Homework 8 Total Points: [10 points] Explain the following terms: EREW PRAM CRCW PRAM. Brent s Theorem.
EE/CSCI 451 Spring 2018 Homework 8 Total Points: 100 1 [10 points] Explain the following terms: EREW PRAM CRCW PRAM Brent s Theorem BSP model 1 2 [15 points] Assume two sorted sequences of size n can be
More informationAccelerated Machine Learning Algorithms in Python
Accelerated Machine Learning Algorithms in Python Patrick Reilly, Leiming Yu, David Kaeli reilly.pa@husky.neu.edu Northeastern University Computer Architecture Research Lab Outline Motivation and Goals
More informationA Batched GPU Algorithm for Set Intersection
A Batched GPU Algorithm for Set Intersection Di Wu, Fan Zhang, Naiyong Ao, Fang Wang, Xiaoguang Liu, Gang Wang Nankai-Baidu Joint Lab, College of Information Technical Science, Nankai University Weijin
More informationarxiv: v2 [cs.dc] 2 Mar 2018
A Dynamic Hash Table for the GPU Saman Ashkiani Electrical and Computer Engineering University of California, Davis sashkiani@ucdavis.edu Martin Farach-Colton Computer Science Rutgers University farach@cs.rutgers.edu
More informationAutomatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC
Automatic Compiler-Based Optimization of Graph Analytics for the GPU Sreepathi Pai The University of Texas at Austin May 8, 2017 NVIDIA GTC Parallel Graph Processing is not easy 299ms HD-BFS 84ms USA Road
More informationFirst Swedish Workshop on Multi-Core Computing MCC 2008 Ronneby: On Sorting and Load Balancing on Graphics Processors
First Swedish Workshop on Multi-Core Computing MCC 2008 Ronneby: On Sorting and Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Distributed Computing Systems Chalmers University
More informationLecture 6: Input Compaction and Further Studies
PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 6: Input Compaction and Further Studies 1 Objective To learn the key techniques for compacting input data for reduced consumption of
More informationCSC 373 Lecture # 3 Instructor: Milad Eftekhar
Huffman encoding: Assume a context is available (a document, a signal, etc.). These contexts are formed by some symbols (words in a document, discrete samples from a signal, etc). Each symbols s i is occurred
More informationCorolla: GPU-Accelerated FPGA Routing Based on Subgraph Dynamic Expansion
Corolla: GPU-Accelerated FPGA Routing Based on Subgraph Dynamic Expansion Minghua Shen and Guojie Luo Peking University FPGA-February 23, 2017 1 Contents Motivation Background Search Space Reduction for
More informationGunrock: A Fast and Programmable Multi- GPU Graph Processing Library
Gunrock: A Fast and Programmable Multi- GPU Graph Processing Library Yangzihao Wang and Yuechao Pan with Andrew Davidson, Yuduo Wu, Carl Yang, Leyuan Wang, Andy Riffel and John D. Owens University of California,
More informationOptimization Techniques for Parallel Code: 5: Warp-synchronous programming with Cooperative Groups
Optimization Techniques for Parallel Code: 5: Warp-synchronous programming with Cooperative Groups Nov. 21, 2017 Sylvain Collange Inria Rennes Bretagne Atlantique http://www.irisa.fr/alf/collange/ sylvain.collange@inria.fr
More informationLecture 2: CUDA Programming
CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:
More informationHigh Performance Comparison-Based Sorting Algorithm on Many-Core GPUs
High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs Xiaochun Ye, Dongrui Fan, Wei Lin, Nan Yuan, and Paolo Ienne Key Laboratory of Computer System and Architecture ICT, CAS, China Outline
More informationFast Uniform Grid Construction on GPGPUs Using Atomic Operations
Fast Uniform Grid Construction on GPGPUs Using Atomic Operations Davide BARBIERI a, Valeria CARDELLINI a and Salvatore FILIPPONE b a Dipartimento di Ingegneria Civile e Ingegneria Informatica Università
More informationSorting Large Multifield Records on a GPU*
Sorting Large Multifield Records on a GPU* Shibdas Bandyopadhyay and Sartaj Sahni Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL 32611 shibdas@ufl.edu,
More informationImproved Integral Histogram Algorithm. for Big Sized Images in CUDA Environment
Contemporary Engineering Sciences, Vol. 7, 2014, no. 24, 1415-1423 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ces.2014.49174 Improved Integral Histogram Algorithm for Big Sized Images in CUDA
More informationEfficient Stream Compaction on Wide SIMD Many-Core Architectures
Efficient Stream Compaction on Wide SIMD Many-Core Architectures Markus Billeter Chalmers University of Technology Ola Olsson Chalmers University of Technology Ulf Assarsson Chalmers University of Technology
More informationCSE 332 Winter 2018 Final Exam (closed book, closed notes, no calculators)
Name: Sample Solution Email address (UWNetID): CSE 332 Winter 2018 Final Exam (closed book, closed notes, no calculators) Instructions: Read the directions for each question carefully before answering.
More informationA PERFORMANCE COMPARISON OF SORT AND SCAN LIBRARIES FOR GPUS
October 9, 215 9:46 WSPC/INSTRUCTION FILE ssbench Parallel Processing Letters c World Scientific Publishing Company A PERFORMANCE COMPARISON OF SORT AND SCAN LIBRARIES FOR GPUS BRUCE MERRY SKA South Africa,
More informationFast Segmented Sort on GPUs
Fast Segmented Sort on GPUs Kaixi Hou, Weifeng Liu, Hao Wang, Wu-chun Feng {kaixihou, hwang121, wfeng}@vt.edu weifeng.liu@nbi.ku.dk Segmented Sort (SegSort) Perform a segment-by-segment sort on a given
More informationEfficient Stream Reduction on the GPU
Efficient Stream Reduction on the GPU David Roger Grenoble University Email: droger@inrialpes.fr Ulf Assarsson Chalmers University of Technology Email: uffe@chalmers.se Nicolas Holzschuch Cornell University
More informationCS671 Parallel Programming in the Many-Core Era
CS671 Parallel Programming in the Many-Core Era Lecture 3: GPU Programming - Reduce, Scan & Sort Zheng Zhang Rutgers University Review: Programming with CUDA An Example in C Add vector A and vector B to
More informationParallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. November Parallel Sorting
Parallel Systems Course: Chapter VIII Sorting Algorithms Kumar Chapter 9 Jan Lemeire ETRO Dept. November 2014 Overview 1. Parallel sort distributed memory 2. Parallel sort shared memory 3. Sorting Networks
More informationSimpler and Faster HLBVH with Work Queues
Simpler and Faster HLBVH with Work Queues Kirill Garanzha NVIDIA Keldysh Institute of Applied Mathematics Jacopo Pantaleoni NVIDIA Research David McAllister NVIDIA Figure 1: Some of our test scenes, from
More informationAlgorithms: Design & Practice
Algorithms: Design & Practice Deepak Kumar Bryn Mawr College Spring 2018 Course Essentials Algorithms Design & Practice How to design Learn some good ones How to implement practical considerations How
More informationPortland State University ECE 588/688. Graphics Processors
Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly
More informationScalable Algorithmic Techniques Decompositions & Mapping. Alexandre David
Scalable Algorithmic Techniques Decompositions & Mapping Alexandre David 1.2.05 adavid@cs.aau.dk Introduction Focus on data parallelism, scale with size. Task parallelism limited. Notion of scalability
More informationParallel Patterns Ezio Bartocci
TECHNISCHE UNIVERSITÄT WIEN Fakultät für Informatik Cyber-Physical Systems Group Parallel Patterns Ezio Bartocci Parallel Patterns Think at a higher level than individual CUDA kernels Specify what to compute,
More informationA Sampling of CUDA Libraries Michael Garland
A Sampling of CUDA Libraries Michael Garland NVIDIA Research CUBLAS Implementation of BLAS (Basic Linear Algebra Subprograms) on top of CUDA driver Self-contained at the API level, no direct interaction
More informationarxiv: v1 [cs.dc] 24 Feb 2010
Deterministic Sample Sort For GPUs arxiv:1002.4464v1 [cs.dc] 24 Feb 2010 Frank Dehne School of Computer Science Carleton University Ottawa, Canada K1S 5B6 frank@dehne.net http://www.dehne.net February
More informationSimpler and Faster HLBVH with Work Queues. Kirill Garanzha NVIDIA Jacopo Pantaleoni NVIDIA Research David McAllister NVIDIA
Simpler and Faster HLBVH with Work Queues Kirill Garanzha NVIDIA Jacopo Pantaleoni NVIDIA Research David McAllister NVIDIA Short Summary Full GPU implementa/on Simple work queues genera/on Simple middle
More informationRed Fox: An Execution Environment for Relational Query Processing on GPUs
Red Fox: An Execution Environment for Relational Query Processing on GPUs Haicheng Wu 1, Gregory Diamos 2, Tim Sheard 3, Molham Aref 4, Sean Baxter 2, Michael Garland 2, Sudhakar Yalamanchili 1 1. Georgia
More informationSearching in General
Searching in General Searching 1. using linear search on arrays, lists or files 2. using binary search trees 3. using a hash table 4. using binary search in sorted arrays (interval halving method). Data
More informationReview of course COMP-251B winter 2010
Review of course COMP-251B winter 2010 Lecture 1. Book Section 15.2 : Chained matrix product Matrix product is associative Computing all possible ways of parenthesizing Recursive solution Worst-case running-time
More informationAn Empirically Optimized Radix Sort for GPU
2009 IEEE International Symposium on Parallel and Distributed Processing with Applications An Empirically Optimized Radix Sort for GPU Bonan Huang, Jinlan Gao and Xiaoming Li Electrical and Computer Engineering
More informationAlgorithms and Data Structures (INF1) Lecture 15/15 Hua Lu
Algorithms and Data Structures (INF1) Lecture 15/15 Hua Lu Department of Computer Science Aalborg University Fall 2007 This Lecture Minimum spanning trees Definitions Kruskal s algorithm Prim s algorithm
More informationUnit 6 Chapter 15 EXAMPLES OF COMPLEXITY CALCULATION
DESIGN AND ANALYSIS OF ALGORITHMS Unit 6 Chapter 15 EXAMPLES OF COMPLEXITY CALCULATION http://milanvachhani.blogspot.in EXAMPLES FROM THE SORTING WORLD Sorting provides a good set of examples for analyzing
More informationAlternative GPU friendly assignment algorithms. Paul Richmond and Peter Heywood Department of Computer Science The University of Sheffield
Alternative GPU friendly assignment algorithms Paul Richmond and Peter Heywood Department of Computer Science The University of Sheffield Graphics Processing Units (GPUs) Context: GPU Performance Accelerated
More informationEE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 13 Parallelism in Software I
EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 13 Parallelism in Software I Mattan Erez The University of Texas at Austin N EE382N: Parallelilsm and Locality, Spring
More informationData parallel algorithms, algorithmic building blocks, precision vs. accuracy
Data parallel algorithms, algorithmic building blocks, precision vs. accuracy Robert Strzodka Architecture of Computing Systems GPGPU and CUDA Tutorials Dresden, Germany, February 25 2008 2 Overview Parallel
More informationAlgorithms and Applications
Algorithms and Applications 1 Areas done in textbook: Sorting Algorithms Numerical Algorithms Image Processing Searching and Optimization 2 Chapter 10 Sorting Algorithms - rearranging a list of numbers
More informationHigh Performance Comparison-Based Sorting Algorithm on Many-Core GPUs
High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs Xiaochun Ye 1, Dongrui Fan 1, Wei Lin 1, Nan Yuan 1, Paolo Ienne 1, 2 1 Key Laboratory of Computer System and Architecture Institute
More informationMemory Management Method for 3D Scanner Using GPGPU
GPGPU 3D 1 2 KinectFusion GPGPU 3D., GPU., GPGPU Octree. GPU,,. Memory Management Method for 3D Scanner Using GPGPU TATSUYA MATSUMOTO 1 SATORU FUJITA 2 This paper proposes an efficient memory management
More informationGPGPU: Parallel Reduction and Scan
Administrivia GPGPU: Parallel Reduction and Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Assignment 3 due Wednesday 11:59pm on Blackboard Assignment 4 handed out Monday, 02/14 Final Wednesday
More informationRay Tracing. Computer Graphics CMU /15-662, Fall 2016
Ray Tracing Computer Graphics CMU 15-462/15-662, Fall 2016 Primitive-partitioning vs. space-partitioning acceleration structures Primitive partitioning (bounding volume hierarchy): partitions node s primitives
More informationGPU Task-Parallelism: Primitives and Applications. Stanley Tzeng, Anjul Patney, John D. Owens University of California at Davis
GPU Task-Parallelism: Primitives and Applications Stanley Tzeng, Anjul Patney, John D. Owens University of California at Davis This talk Will introduce task-parallelism on GPUs What is it? Why is it important?
More informationImplementation of Parallel Path Finding in a Shared Memory Architecture
Implementation of Parallel Path Finding in a Shared Memory Architecture David Cohen and Matthew Dallas Department of Computer Science Rensselaer Polytechnic Institute Troy, NY 12180 Email: {cohend4, dallam}
More informationhsa-ds: A Heterogeneous Suffix Array Construction Using D-Critical Substrings for Burrow-Wheeler Transform
146 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'16 hsa-ds: A Heterogeneous Suffix Array Construction Using D-Critical Substrings for Burrow-Wheeler Transform Yu-Cheng Liao, Yarsun Hsu Department
More informationCUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA
CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0 Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nsight VSE APOD
More informationIntroduction to Real-Time Operating Systems
Introduction to Real-Time Operating Systems GPOS vs RTOS General purpose operating systems Real-time operating systems GPOS vs RTOS: Similarities Multitasking Resource management OS services to applications
More informationMassively Parallel A* Search on a GPU
Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence Massively Parallel A* Search on a GPU Yichao Zhou and Jianyang Zeng Institute for Interdisciplinary Information Sciences Tsinghua
More informationPrefix Scan and Minimum Spanning Tree with OpenCL
Prefix Scan and Minimum Spanning Tree with OpenCL U. VIRGINIA DEPT. OF COMP. SCI TECH. REPORT CS-2013-02 Yixin Sun and Kevin Skadron Dept. of Computer Science, University of Virginia ys3kz@virginia.edu,
More informationHIGH PERFORMANCE PEDESTRIAN DETECTION ON TEGRA X1
April 4-7, 2016 Silicon Valley HIGH PERFORMANCE PEDESTRIAN DETECTION ON TEGRA X1 Max Lv, NVIDIA Brant Zhao, NVIDIA April 7 mlv@nvidia.com https://github.com/madeye Histogram of Oriented Gradients on GPU
More informationRay Casting Deformable Models on the GPU
Ray Casting Deformable Models on the GPU Suryakant Patidar and P. J. Narayanan Center for Visual Information Technology, IIIT Hyderabad. {skp@research., pjn@}iiit.ac.in Abstract The GPUs pack high computation
More informationGlobal Memory Access Pattern and Control Flow
Optimization Strategies Global Memory Access Pattern and Control Flow Objectives Optimization Strategies Global Memory Access Pattern (Coalescing) Control Flow (Divergent branch) Global l Memory Access
More informationSpanning Trees and Optimization Problems (Excerpt)
Bang Ye Wu and Kun-Mao Chao Spanning Trees and Optimization Problems (Excerpt) CRC PRESS Boca Raton London New York Washington, D.C. Chapter 3 Shortest-Paths Trees Consider a connected, undirected network
More informationEfficient Lists Intersection by CPU- GPU Cooperative Computing
Efficient Lists Intersection by CPU- GPU Cooperative Computing Di Wu, Fan Zhang, Naiyong Ao, Gang Wang, Xiaoguang Liu, Jing Liu Nankai-Baidu Joint Lab, Nankai University Outline Introduction Cooperative
More informationCS 314 Principles of Programming Languages
CS 314 Principles of Programming Languages Zheng Zhang Fall 2016 Dec 14 GPU Programming Rutgers University Programming with CUDA Compute Unified Device Architecture (CUDA) Mapping and managing computations
More informationAMS 148 Chapter 6: Histogram, Sort, and Sparse Matrices
AMS 148 Chapter 6: Histogram, Sort, and Sparse Matrices Steven Reeves Now that we have completed the more fundamental parallel primitives on GPU, we will dive into more advanced topics. Histogram is a
More informationA Study of Different Parallel Implementations of Single Source Shortest Path Algorithms
A Study of Different Parallel Implementations of Single Source Shortest Path s Dhirendra Pratap Singh Department of Computer Science and Engineering Maulana Azad National Institute of Technology, Bhopal
More informationA Forward-Backward Single-Source Shortest Paths Algorithm
A Forward-Backward Single-Source Shortest Paths Algorithm Single-Source Shortest Paths in O(n) time David Wilson Microsoft Research Uri Zwick Tel Aviv Univ. Deuxièmes journées du GT CoA Complexité et Algorithmes
More informationFind the block in which the tuple should be! If there is free space, insert it! Otherwise, must create overflow pages!
Professor: Pete Keleher! keleher@cs.umd.edu! } Keep sorted by some search key! } Insertion! Find the block in which the tuple should be! If there is free space, insert it! Otherwise, must create overflow
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationSocial graphs (Facebook, Twitter, Google+, LinkedIn, etc.) Endorsement graphs (web link graph, paper citation graph, etc.)
Large-Scale Graph Processing Algorithms on the GPU Yangzihao Wang, Computer Science, UC Davis John Owens, Electrical and Computer Engineering, UC Davis 1 Overview The past decade has seen a growing research
More informationLecture 8 Parallel Algorithms II
Lecture 8 Parallel Algorithms II Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Original slides from Introduction to Parallel
More informationCS61BL. Lecture 5: Graphs Sorting
CS61BL Lecture 5: Graphs Sorting Graphs Graphs Edge Vertex Graphs (Undirected) Graphs (Directed) Graphs (Multigraph) Graphs (Acyclic) Graphs (Cyclic) Graphs (Connected) Graphs (Disconnected) Graphs (Unweighted)
More informationOn-the-fly Vertex Reuse for Massively-Parallel Software Geometry Processing
2018 On-the-fly for Massively-Parallel Software Geometry Processing Bernhard Kerbl Wolfgang Tatzgern Elena Ivanchenko Dieter Schmalstieg Markus Steinberger 5 4 3 4 2 5 6 7 6 3 1 2 0 1 0, 0,1,7, 7,1,2,
More informationWarps and Reduction Algorithms
Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum
More informationJulienne: A Framework for Parallel Graph Algorithms using Work-efficient Bucketing
Julienne: A Framework for Parallel Graph Algorithms using Work-efficient Bucketing Laxman Dhulipala Joint work with Guy Blelloch and Julian Shun SPAA 07 Giant graph datasets Graph V E (symmetrized) com-orkut
More informationIntroduction to Parallel Computing
Introduction to Parallel Computing George Karypis Sorting Outline Background Sorting Networks Quicksort Bucket-Sort & Sample-Sort Background Input Specification Each processor has n/p elements A ordering
More informationFast Radix Sort for Sparse Linear Algebra on GPU
Fast Radix Sort for Sparse Linear Algebra on GPU Lukas Polok, Viorela Ila, Pavel Smrz {ipolok,ila,smrz}@fit.vutbr.cz Brno University of Technology, Faculty of Information Technology. Bozetechova 2, 612
More informationFast Tridiagonal Solvers on GPU
Fast Tridiagonal Solvers on GPU Yao Zhang John Owens UC Davis Jonathan Cohen NVIDIA GPU Technology Conference 2009 Outline Introduction Algorithms Design algorithms for GPU architecture Performance Bottleneck-based
More informationParallel Prefix Sum (Scan) with CUDA. Mark Harris
Parallel Prefix Sum (Scan) with CUDA Mark Harris mharris@nvidia.com March 2009 Document Change History Version Date Responsible Reason for Change February 14, 2007 Mark Harris Initial release March 25,
More informationData Modeling and Databases Ch 9: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich
Data Modeling and Databases Ch 9: Query Processing - Algorithms Gustavo Alonso Systems Group Department of Computer Science ETH Zürich Transactions (Locking, Logging) Metadata Mgmt (Schema, Stats) Application
More informationCUDA Performance Optimization
Mitglied der Helmholtz-Gemeinschaft CUDA Performance Optimization GPU Programming with CUDA April 25-27, 2016 Jiri Kraus (NVIDIA) based on work by Andrew V. Adinetz What you will learn: What is memory
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More information