GPU Multisplit. Saman Ashkiani, Andrew Davidson, Ulrich Meyer, John D. Owens. S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

Size: px
Start display at page:

Download "GPU Multisplit. Saman Ashkiani, Andrew Davidson, Ulrich Meyer, John D. Owens. S. Ashkiani (UC Davis) GPU Multisplit GTC / 16"

Transcription

1 GPU Multisplit Saman Ashkiani, Andrew Davidson, Ulrich Meyer, John D. Owens S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

2 Motivating Simple Example: Compaction Compaction (i.e., binary split) Traditional approach: Flags Scan Shuffle e.g., splitter: 10 input keys compacted S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

3 Motivating Simple Example: Compaction Compaction (i.e., binary split) Traditional approach: Flags Scan Shuffle e.g., splitter: 10 Other option: split input into two buckets input keys buckets output keys buckets S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

4 Motivating Simple Example: Compaction Compaction (i.e., binary split) Traditional approach: Flags Scan Shuffle e.g., splitter: 10 Other option: split input into two buckets Can also be solved by sorting keys Not always possible Loses stability, i.e., initial order within buckets not preserved input keys buckets output keys buckets sorted keys S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

5 What is Multisplit? Multisplit (generalization of binary split) Let s try multiple buckets e.g., splitters: 10 and 20 input keys buckets output keys buckets S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

6 What is Multisplit? Multisplit (generalization of binary split) Let s try multiple buckets e.g., splitters: 10 and 20 Can also be solved by sorting keys input keys buckets output keys buckets sorted keys buckets S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

7 Mutlisplit primitive Input: unordered set of keys (or key-value pairs) m, number of buckets a user specified function to identify buckets for each key Output: keys (or key-value pairs) separated into m buckets B0 B1 B2 B3 S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

8 A Fast and Flexible Data-Organization Primitive characterizing key-value pairs into buckets General load balancing Priority queues Single Source Shortest Path (SSSP) Serial (Dijkstra): processing the vertex with the lowest weight Bellman-Ford-Moore all vertices in parallel delta-stepping formulation of SSSP [Davidson et al., 2014] classifying vertices into buckets by their weights processing the lowest weights in parallel But no multisplit primitive used radix-sort instead By using our own multisplit 2.1x faster other applications colored prefix-sum reorganizing into 8 direction-based buckets in GPU based ray tracers [Yang et al., 2013] first step in building GPU hash tables [Alcantara et al., 2009] in the shallow stages of k-d tree construction [Wu et al., 2011] S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

9 Common Approaches: 1 Recursive scan-based split log(m) rounds of binary splits Buckets B 0 = {i apple 40} B 1 = {i >40} Initial Keys B 0 Exclusive scan B 1 2 right to left exclusive scan S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

10 Common Approaches: 1 Recursive scan-based split log(m) rounds of binary splits 2 Radix sort sorting keys overkill (sorted within buckets) initial order is not preserved Initial Keys binary representation apple 7 splits S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

11 Common Approaches: 1 Recursive scan-based split log(m) rounds of binary splits 2 Radix sort sorting keys 1 overkill (sorted within buckets) 0 initial order is not preserved 3 Reduced bit sort sort (bucket ID, keys) log m -bit bucket IDs Initial Keys New values New keys key-value sort S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

12 Common Approaches: 1 Recursive scan-based split log(m) rounds of binary splits 2 Radix sort sorting keys overkill (sorted within buckets) initial order is not preserved 3 Reduced bit sort sort (bucket ID, keys) log m -bit bucket IDs 4 Randomized insertions a PRAM algorithm large buffers for buckets random insertions initial order is not preserved Initial Keys bu er B 0 17 bu er B compaction S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

13 Designing an Efficient Approach Stable Multisplit unique permutation + data movement 1 Deriving all permutations global computations histogram (h 0,..., h m 1 ) key order per bucket u i B j p(i) = j 1 h k + {u r : u r B j, r < i} }{{} k=0 }{{} Number of keys Number of keys in before me previous buckets in my own bucket B 0 B 1 B 2 B 3 S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

14 Designing an Efficient Approach Stable Multisplit unique permutation + data movement 1 Deriving all permutations global computations histogram (h 0,..., h m 1 ) key order per bucket 2 Final data movements global random scatters B 0 B 1 B 2 B 3 B 0 S. Ashkiani (UC Davis) GPU Multisplit B 1 GTC / 16

15 Our high level ideas 1 Global computations Localize computations several large enough local subproblems local histograms a single small enough global computation global histogram several large enough local subproblems permutations + scatters Avoid shared memory and synchronization: utilize intrinsics Local Pre scan Global Scan Local Post scan S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

16 Our high level ideas 1 Global computations Localize computations several large enough local subproblems local histograms a single small enough global computation global histogram several large enough local subproblems permutations + scatters Avoid shared memory and synchronization: utilize intrinsics 2 Global random scatters Reordering keys locally in the last stage local multisplits more computational cost but better memory access pattern (coalesced writes) S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

17 Granularity Tradeoffs We experimented with a couple different subproblem granularities 1 warp warp synchronous model with minimal warp divergence fast communication via warp-wide ballot/shuffles 2 block more expensive communication via shared memory cheaper global computation (scan over m N blocks ) more locality to extract after reordering Property Direct MS Warp-level MS Block-level MS Subproblem warp warp block reordering warp-wide reordering block-wide reordering computational load low medium high Coalesced memory access low medium high S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

18 Implementation details & Optimizations Warp-level MS 1 Pre-scan (Local): h1,0 hm 1,0 Pre scan read keys warp histogram 1 bit-by-bit balloting 2 log m rounds store warp histogram h0,1 h1,1 hm 1,1 Scan h0,l 1 h1,l 1 hm 1,L 1 h1,0 h1,1 h1,l 1 h0,1 h0,l 1 hm 1,0 hm 1,1 hm 1,L 1 Post scan h0,0 h1,0 hm 1,0 Pre scan h0,1 h1,1 hm 1,1 h0,l 1 h1,l 1 hm 1,L 1 1: procedure warp histogram(bucket id[0:31]) Input: bucket id[0:31] Output: histo[0:m-1] 2: for each thread i = 0:31 parallel warp do 3: histo bmp[i] = 0xFFFFFFFF; 4: for (int k = 0; k < ceil(log2(m)); k++) do 5: temp buffer = ballot(bucket id[i] & 0x01); 6: if ((i >> k) & 0x01) then 7: histo bmp[i] &= temp buffer; 8: else 9: histo bmp[i] &= XOR(0xFFFFFFFF, temp buffer); 10: end if 11: bucket id[i] >>= 1; 12: end for 13: histo[i] = popc(histo bmp[i]); 14: end for 15: return histo[0:m-1]; 16: end procedure S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

19 Implementation details & Optimizations Warp-level MS 1 Pre-scan (Local): read keys warp histogram 1 bit-by-bit balloting 2 log m rounds store warp histogram 2 Scan (Global): exclusive scan on histograms m N warps elements h0,0 h1,0 hm 1,0 Pre scan h0,1 h1,1 hm 1,1 Scan h0,l 1 h1,l 1 hm 1,L 1 h1,0 h1,1 h1,l 1 h0,0 h0,1 h0,l 1 hm 1,0 hm 1,1 hm 1,L 1 S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

20 Implementation details & Optimizations Warp-level MS 1 Pre-scan (Local): read keys warp histogram 1 bit-by-bit balloting 2 log m rounds store warp histogram 2 Scan (Global): exclusive scan on histograms m N warps elements 3 Post-scan (Local): read keys (key-value) recompute warp histograms compute local offsets warp-level reordering compute final positions final data movement h0,0 h1,0 hm 1,0 Pre scan h0,1 h1,1 hm 1,1 Scan h0,l 1 h1,l 1 hm 1,L 1 h1,0 h1,1 h1,l 1 h0,0 h0,1 h0,l 1 hm 1,0 hm 1,1 hm 1,L 1 Post scan S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

21 Performance Evaluation In this presentation: 1 NVIDIA Tesla K40c GPU 2 Radix sort from CUB including the one in the reduced-bit sort method 3 Device-wide exclusive scan from CUB 4 Uniform distribution of keys in buckets More results in the paper: 1 Detailed timing of different stages of our algorithm 2 Other GPU achitectures: Maxwell 3 Different distributions of keys 4 Using our Multisplit algorithms in SSSP method S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

22 Average running time vs. number of buckets 9 Average running time (msec) Block level MS Direct MS Reduced bit sort Warp level MS Average running time (msec) Number of buckets (m) (a) Key-only Number of buckets (m) (b) Key-value Memory access quality: Block-level MS > Warp-level MS > Direct MS Computational load: Block-level MS > Warp-level MS > Direct MS S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

23 Performance vs Radix-Sort 6 Block level MS Direct MS Reduced bit sort Warp level MS Binary Split 8 6 Block level MS Direct MS Redcued bit sort Warp level MS Binary Split Speedup 4 Speedup Number of buckets (m) Number of buckets (m) (c) Key-only (d) Key-value key-only: 3.0x 6.7x key-value: 4.4x 8.0x S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

24 More buckets For more buckets than the warp width (m > 32): warp histograms each thread in charge of multiple buckets Shared memory capacity the other bottleneck Average running time (msec) Radix sort (key value) Radix sort (key only) Block level MS Reduced bit sort Key only Key value Number of buckets (m) Average running time (msec) for more buckets for Block level MS and reduced-bit sort S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

25 Conclusions Introduce a new efficient data organization primitive High performance especially for low or modest number of buckets Full paper: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2016) Code will soon be available in CUDPP: S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

26 Thank You S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

27 References Alcantara, D. A., Sharf, A., Abbasinejad, F., Sengupta, S., Mitzenmacher, M., Owens, J. D., and Amenta, N. (2009). Real-time parallel hashing on the GPU. ACM Transactions on Graphics, 28(5):154:1 154:9. Davidson, A., Baxter, S., Garland, M., and Owens, J. D. (2014). Work-efficient parallel GPU methods for single source shortest paths. In Proceedings of the 28th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2014, pages Wu, Z., Zhao, F., and Liu, X. (2011). SAH KD-tree construction on GPU. In Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics, HPG 11, pages Yang, X., Xu, D., and Zhao, L. (2013). Efficient data management for incoherent ray tracing. Applied Soft Computing, 13(1):1 8. S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

GPU Multisplit. Andrew Davidson. University of California, Davis John D. Owens

GPU Multisplit. Andrew Davidson. University of California, Davis John D. Owens GPU Multisplit Saman Ashkiani University of California, Davis sashkiani@ucdavis.edu Andrew Davidson University of California, Davis aaldavidson@ucdavis.edu Ulrich Meyer Goethe-Universität Frankfurt am

More information

Fast BVH Construction on GPUs

Fast BVH Construction on GPUs Fast BVH Construction on GPUs Published in EUROGRAGHICS, (2009) C. Lauterbach, M. Garland, S. Sengupta, D. Luebke, D. Manocha University of North Carolina at Chapel Hill NVIDIA University of California

More information

Warped parallel nearest neighbor searches using kd-trees

Warped parallel nearest neighbor searches using kd-trees Warped parallel nearest neighbor searches using kd-trees Roman Sokolov, Andrei Tchouprakov D4D Technologies Kd-trees Binary space partitioning tree Used for nearest-neighbor search, range search Application:

More information

Scalable GPU Graph Traversal!

Scalable GPU Graph Traversal! Scalable GPU Graph Traversal Duane Merrill, Michael Garland, and Andrew Grimshaw PPoPP '12 Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming Benwen Zhang

More information

CS377P Programming for Performance GPU Programming - II

CS377P Programming for Performance GPU Programming - II CS377P Programming for Performance GPU Programming - II Sreepathi Pai UTCS November 11, 2015 Outline 1 GPU Occupancy 2 Divergence 3 Costs 4 Cooperation to reduce costs 5 Scheduling Regular Work Outline

More information

Scan Primitives for GPU Computing

Scan Primitives for GPU Computing Scan Primitives for GPU Computing Shubho Sengupta, Mark Harris *, Yao Zhang, John Owens University of California Davis, *NVIDIA Corporation Motivation Raw compute power and bandwidth of GPUs increasing

More information

High-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock

High-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock High-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock Yangzihao Wang University of California, Davis yzhwang@ucdavis.edu March 24, 2014 Yangzihao Wang (yzhwang@ucdavis.edu)

More information

Data-Parallel Algorithms on GPUs. Mark Harris NVIDIA Developer Technology

Data-Parallel Algorithms on GPUs. Mark Harris NVIDIA Developer Technology Data-Parallel Algorithms on GPUs Mark Harris NVIDIA Developer Technology Outline Introduction Algorithmic complexity on GPUs Algorithmic Building Blocks Gather & Scatter Reductions Scan (parallel prefix)

More information

Real-Time Reyes: Programmable Pipelines and Research Challenges. Anjul Patney University of California, Davis

Real-Time Reyes: Programmable Pipelines and Research Challenges. Anjul Patney University of California, Davis Real-Time Reyes: Programmable Pipelines and Research Challenges Anjul Patney University of California, Davis Real-Time Reyes-Style Adaptive Surface Subdivision Anjul Patney and John D. Owens SIGGRAPH Asia

More information

CUB. collective software primitives. Duane Merrill. NVIDIA Research

CUB. collective software primitives. Duane Merrill. NVIDIA Research CUB collective software primitives Duane Merrill NVIDIA Research What is CUB?. A design model for collective primitives How to make reusable SIMT software constructs. A library of collective primitives

More information

Data Parallel Programming with Patterns. Peng Wang, Developer Technology, NVIDIA

Data Parallel Programming with Patterns. Peng Wang, Developer Technology, NVIDIA Data Parallel Programming with Patterns Peng Wang, Developer Technology, NVIDIA Overview Patterns in Data Parallel Programming Examples Radix sort Cell list calculation in molecular dynamics MPI communication

More information

GPU Programming. Parallel Patterns. Miaoqing Huang University of Arkansas 1 / 102

GPU Programming. Parallel Patterns. Miaoqing Huang University of Arkansas 1 / 102 1 / 102 GPU Programming Parallel Patterns Miaoqing Huang University of Arkansas 2 / 102 Outline Introduction Reduction All-Prefix-Sums Applications Avoiding Bank Conflicts Segmented Scan Sorting 3 / 102

More information

arxiv: v2 [cs.dc] 2 Mar 2018

arxiv: v2 [cs.dc] 2 Mar 2018 GPU LSM: A Dynamic Dictionary Data Structure for the GPU Saman Ashkiani, Shengren Li University of California, Davis {sashkiani, shrli}@ucdavis.edu Martin Farach-Colton Rutgers University farach@cs.rutgers.edu

More information

Chapter 13 On the Efficient Implementation of a Real-time Kd-tree Construction Algorithm 1

Chapter 13 On the Efficient Implementation of a Real-time Kd-tree Construction Algorithm 1 Chapter 13 On the Efficient Implementation of a Real-time Kd-tree Construction Algorithm 1 Byungjoon Chang Woong Seo Insung Ihm Department of Computer Science and Engineering, Sogang University, Seoul,

More information

CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University CSE 591: GPU Programming Using CUDA in Practice Klaus Mueller Computer Science Department Stony Brook University Code examples from Shane Cook CUDA Programming Related to: score boarding load and store

More information

Fragment-Parallel Composite and Filter. Anjul Patney, Stanley Tzeng, and John D. Owens University of California, Davis

Fragment-Parallel Composite and Filter. Anjul Patney, Stanley Tzeng, and John D. Owens University of California, Davis Fragment-Parallel Composite and Filter Anjul Patney, Stanley Tzeng, and John D. Owens University of California, Davis Parallelism in Interactive Graphics Well-expressed in hardware as well as APIs Consistently

More information

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11 Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed

More information

GPU Sparse Graph Traversal

GPU Sparse Graph Traversal GPU Sparse Graph Traversal Duane Merrill (NVIDIA) Michael Garland (NVIDIA) Andrew Grimshaw (Univ. of Virginia) UNIVERSITY of VIRGINIA Breadth-first search (BFS) 1. Pick a source node 2. Rank every vertex

More information

Theory and Implementation of Dynamic Data Structures for the GPU

Theory and Implementation of Dynamic Data Structures for the GPU Theory and Implementation of Dynamic Data Structures for the GPU John Owens UC Davis Martín Farach-Colton Rutgers NVIDIA OptiX & the BVH Tero Karras. Maximizing parallelism in the construction of BVHs,

More information

From Biological Cells to Populations of Individuals: Complex Systems Simulations with CUDA (S5133)

From Biological Cells to Populations of Individuals: Complex Systems Simulations with CUDA (S5133) From Biological Cells to Populations of Individuals: Complex Systems Simulations with CUDA (S5133) Dr Paul Richmond Research Fellow University of Sheffield (NVIDIA CUDA Research Centre) Overview Complex

More information

Scan Algorithm Effects on Parallelism and Memory Conflicts

Scan Algorithm Effects on Parallelism and Memory Conflicts Scan Algorithm Effects on Parallelism and Memory Conflicts 11 Parallel Prefix Sum (Scan) Definition: The all-prefix-sums operation takes a binary associative operator with identity I, and an array of n

More information

A Comparative Study on Exact Triangle Counting Algorithms on the GPU

A Comparative Study on Exact Triangle Counting Algorithms on the GPU A Comparative Study on Exact Triangle Counting Algorithms on the GPU Leyuan Wang, Yangzihao Wang, Carl Yang, John D. Owens University of California, Davis, CA, USA 31 st May 2016 L. Wang, Y. Wang, C. Yang,

More information

Parallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. Fall Parallel Sorting

Parallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. Fall Parallel Sorting Parallel Systems Course: Chapter VIII Sorting Algorithms Kumar Chapter 9 Jan Lemeire ETRO Dept. Fall 2017 Overview 1. Parallel sort distributed memory 2. Parallel sort shared memory 3. Sorting Networks

More information

/INFOMOV/ Optimization & Vectorization. J. Bikker - Sep-Nov Lecture 10: GPGPU (3) Welcome!

/INFOMOV/ Optimization & Vectorization. J. Bikker - Sep-Nov Lecture 10: GPGPU (3) Welcome! /INFOMOV/ Optimization & Vectorization J. Bikker - Sep-Nov 2018 - Lecture 10: GPGPU (3) Welcome! Today s Agenda: Don t Trust the Template The Prefix Sum Parallel Sorting Stream Filtering Optimizing GPU

More information

EE/CSCI 451 Spring 2018 Homework 8 Total Points: [10 points] Explain the following terms: EREW PRAM CRCW PRAM. Brent s Theorem.

EE/CSCI 451 Spring 2018 Homework 8 Total Points: [10 points] Explain the following terms: EREW PRAM CRCW PRAM. Brent s Theorem. EE/CSCI 451 Spring 2018 Homework 8 Total Points: 100 1 [10 points] Explain the following terms: EREW PRAM CRCW PRAM Brent s Theorem BSP model 1 2 [15 points] Assume two sorted sequences of size n can be

More information

Accelerated Machine Learning Algorithms in Python

Accelerated Machine Learning Algorithms in Python Accelerated Machine Learning Algorithms in Python Patrick Reilly, Leiming Yu, David Kaeli reilly.pa@husky.neu.edu Northeastern University Computer Architecture Research Lab Outline Motivation and Goals

More information

A Batched GPU Algorithm for Set Intersection

A Batched GPU Algorithm for Set Intersection A Batched GPU Algorithm for Set Intersection Di Wu, Fan Zhang, Naiyong Ao, Fang Wang, Xiaoguang Liu, Gang Wang Nankai-Baidu Joint Lab, College of Information Technical Science, Nankai University Weijin

More information

arxiv: v2 [cs.dc] 2 Mar 2018

arxiv: v2 [cs.dc] 2 Mar 2018 A Dynamic Hash Table for the GPU Saman Ashkiani Electrical and Computer Engineering University of California, Davis sashkiani@ucdavis.edu Martin Farach-Colton Computer Science Rutgers University farach@cs.rutgers.edu

More information

Automatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC

Automatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC Automatic Compiler-Based Optimization of Graph Analytics for the GPU Sreepathi Pai The University of Texas at Austin May 8, 2017 NVIDIA GTC Parallel Graph Processing is not easy 299ms HD-BFS 84ms USA Road

More information

First Swedish Workshop on Multi-Core Computing MCC 2008 Ronneby: On Sorting and Load Balancing on Graphics Processors

First Swedish Workshop on Multi-Core Computing MCC 2008 Ronneby: On Sorting and Load Balancing on Graphics Processors First Swedish Workshop on Multi-Core Computing MCC 2008 Ronneby: On Sorting and Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Distributed Computing Systems Chalmers University

More information

Lecture 6: Input Compaction and Further Studies

Lecture 6: Input Compaction and Further Studies PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 6: Input Compaction and Further Studies 1 Objective To learn the key techniques for compacting input data for reduced consumption of

More information

CSC 373 Lecture # 3 Instructor: Milad Eftekhar

CSC 373 Lecture # 3 Instructor: Milad Eftekhar Huffman encoding: Assume a context is available (a document, a signal, etc.). These contexts are formed by some symbols (words in a document, discrete samples from a signal, etc). Each symbols s i is occurred

More information

Corolla: GPU-Accelerated FPGA Routing Based on Subgraph Dynamic Expansion

Corolla: GPU-Accelerated FPGA Routing Based on Subgraph Dynamic Expansion Corolla: GPU-Accelerated FPGA Routing Based on Subgraph Dynamic Expansion Minghua Shen and Guojie Luo Peking University FPGA-February 23, 2017 1 Contents Motivation Background Search Space Reduction for

More information

Gunrock: A Fast and Programmable Multi- GPU Graph Processing Library

Gunrock: A Fast and Programmable Multi- GPU Graph Processing Library Gunrock: A Fast and Programmable Multi- GPU Graph Processing Library Yangzihao Wang and Yuechao Pan with Andrew Davidson, Yuduo Wu, Carl Yang, Leyuan Wang, Andy Riffel and John D. Owens University of California,

More information

Optimization Techniques for Parallel Code: 5: Warp-synchronous programming with Cooperative Groups

Optimization Techniques for Parallel Code: 5: Warp-synchronous programming with Cooperative Groups Optimization Techniques for Parallel Code: 5: Warp-synchronous programming with Cooperative Groups Nov. 21, 2017 Sylvain Collange Inria Rennes Bretagne Atlantique http://www.irisa.fr/alf/collange/ sylvain.collange@inria.fr

More information

Lecture 2: CUDA Programming

Lecture 2: CUDA Programming CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:

More information

High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs

High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs Xiaochun Ye, Dongrui Fan, Wei Lin, Nan Yuan, and Paolo Ienne Key Laboratory of Computer System and Architecture ICT, CAS, China Outline

More information

Fast Uniform Grid Construction on GPGPUs Using Atomic Operations

Fast Uniform Grid Construction on GPGPUs Using Atomic Operations Fast Uniform Grid Construction on GPGPUs Using Atomic Operations Davide BARBIERI a, Valeria CARDELLINI a and Salvatore FILIPPONE b a Dipartimento di Ingegneria Civile e Ingegneria Informatica Università

More information

Sorting Large Multifield Records on a GPU*

Sorting Large Multifield Records on a GPU* Sorting Large Multifield Records on a GPU* Shibdas Bandyopadhyay and Sartaj Sahni Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL 32611 shibdas@ufl.edu,

More information

Improved Integral Histogram Algorithm. for Big Sized Images in CUDA Environment

Improved Integral Histogram Algorithm. for Big Sized Images in CUDA Environment Contemporary Engineering Sciences, Vol. 7, 2014, no. 24, 1415-1423 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ces.2014.49174 Improved Integral Histogram Algorithm for Big Sized Images in CUDA

More information

Efficient Stream Compaction on Wide SIMD Many-Core Architectures

Efficient Stream Compaction on Wide SIMD Many-Core Architectures Efficient Stream Compaction on Wide SIMD Many-Core Architectures Markus Billeter Chalmers University of Technology Ola Olsson Chalmers University of Technology Ulf Assarsson Chalmers University of Technology

More information

CSE 332 Winter 2018 Final Exam (closed book, closed notes, no calculators)

CSE 332 Winter 2018 Final Exam (closed book, closed notes, no calculators) Name: Sample Solution Email address (UWNetID): CSE 332 Winter 2018 Final Exam (closed book, closed notes, no calculators) Instructions: Read the directions for each question carefully before answering.

More information

A PERFORMANCE COMPARISON OF SORT AND SCAN LIBRARIES FOR GPUS

A PERFORMANCE COMPARISON OF SORT AND SCAN LIBRARIES FOR GPUS October 9, 215 9:46 WSPC/INSTRUCTION FILE ssbench Parallel Processing Letters c World Scientific Publishing Company A PERFORMANCE COMPARISON OF SORT AND SCAN LIBRARIES FOR GPUS BRUCE MERRY SKA South Africa,

More information

Fast Segmented Sort on GPUs

Fast Segmented Sort on GPUs Fast Segmented Sort on GPUs Kaixi Hou, Weifeng Liu, Hao Wang, Wu-chun Feng {kaixihou, hwang121, wfeng}@vt.edu weifeng.liu@nbi.ku.dk Segmented Sort (SegSort) Perform a segment-by-segment sort on a given

More information

Efficient Stream Reduction on the GPU

Efficient Stream Reduction on the GPU Efficient Stream Reduction on the GPU David Roger Grenoble University Email: droger@inrialpes.fr Ulf Assarsson Chalmers University of Technology Email: uffe@chalmers.se Nicolas Holzschuch Cornell University

More information

CS671 Parallel Programming in the Many-Core Era

CS671 Parallel Programming in the Many-Core Era CS671 Parallel Programming in the Many-Core Era Lecture 3: GPU Programming - Reduce, Scan & Sort Zheng Zhang Rutgers University Review: Programming with CUDA An Example in C Add vector A and vector B to

More information

Parallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. November Parallel Sorting

Parallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. November Parallel Sorting Parallel Systems Course: Chapter VIII Sorting Algorithms Kumar Chapter 9 Jan Lemeire ETRO Dept. November 2014 Overview 1. Parallel sort distributed memory 2. Parallel sort shared memory 3. Sorting Networks

More information

Simpler and Faster HLBVH with Work Queues

Simpler and Faster HLBVH with Work Queues Simpler and Faster HLBVH with Work Queues Kirill Garanzha NVIDIA Keldysh Institute of Applied Mathematics Jacopo Pantaleoni NVIDIA Research David McAllister NVIDIA Figure 1: Some of our test scenes, from

More information

Algorithms: Design & Practice

Algorithms: Design & Practice Algorithms: Design & Practice Deepak Kumar Bryn Mawr College Spring 2018 Course Essentials Algorithms Design & Practice How to design Learn some good ones How to implement practical considerations How

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

Scalable Algorithmic Techniques Decompositions & Mapping. Alexandre David

Scalable Algorithmic Techniques Decompositions & Mapping. Alexandre David Scalable Algorithmic Techniques Decompositions & Mapping Alexandre David 1.2.05 adavid@cs.aau.dk Introduction Focus on data parallelism, scale with size. Task parallelism limited. Notion of scalability

More information

Parallel Patterns Ezio Bartocci

Parallel Patterns Ezio Bartocci TECHNISCHE UNIVERSITÄT WIEN Fakultät für Informatik Cyber-Physical Systems Group Parallel Patterns Ezio Bartocci Parallel Patterns Think at a higher level than individual CUDA kernels Specify what to compute,

More information

A Sampling of CUDA Libraries Michael Garland

A Sampling of CUDA Libraries Michael Garland A Sampling of CUDA Libraries Michael Garland NVIDIA Research CUBLAS Implementation of BLAS (Basic Linear Algebra Subprograms) on top of CUDA driver Self-contained at the API level, no direct interaction

More information

arxiv: v1 [cs.dc] 24 Feb 2010

arxiv: v1 [cs.dc] 24 Feb 2010 Deterministic Sample Sort For GPUs arxiv:1002.4464v1 [cs.dc] 24 Feb 2010 Frank Dehne School of Computer Science Carleton University Ottawa, Canada K1S 5B6 frank@dehne.net http://www.dehne.net February

More information

Simpler and Faster HLBVH with Work Queues. Kirill Garanzha NVIDIA Jacopo Pantaleoni NVIDIA Research David McAllister NVIDIA

Simpler and Faster HLBVH with Work Queues. Kirill Garanzha NVIDIA Jacopo Pantaleoni NVIDIA Research David McAllister NVIDIA Simpler and Faster HLBVH with Work Queues Kirill Garanzha NVIDIA Jacopo Pantaleoni NVIDIA Research David McAllister NVIDIA Short Summary Full GPU implementa/on Simple work queues genera/on Simple middle

More information

Red Fox: An Execution Environment for Relational Query Processing on GPUs

Red Fox: An Execution Environment for Relational Query Processing on GPUs Red Fox: An Execution Environment for Relational Query Processing on GPUs Haicheng Wu 1, Gregory Diamos 2, Tim Sheard 3, Molham Aref 4, Sean Baxter 2, Michael Garland 2, Sudhakar Yalamanchili 1 1. Georgia

More information

Searching in General

Searching in General Searching in General Searching 1. using linear search on arrays, lists or files 2. using binary search trees 3. using a hash table 4. using binary search in sorted arrays (interval halving method). Data

More information

Review of course COMP-251B winter 2010

Review of course COMP-251B winter 2010 Review of course COMP-251B winter 2010 Lecture 1. Book Section 15.2 : Chained matrix product Matrix product is associative Computing all possible ways of parenthesizing Recursive solution Worst-case running-time

More information

An Empirically Optimized Radix Sort for GPU

An Empirically Optimized Radix Sort for GPU 2009 IEEE International Symposium on Parallel and Distributed Processing with Applications An Empirically Optimized Radix Sort for GPU Bonan Huang, Jinlan Gao and Xiaoming Li Electrical and Computer Engineering

More information

Algorithms and Data Structures (INF1) Lecture 15/15 Hua Lu

Algorithms and Data Structures (INF1) Lecture 15/15 Hua Lu Algorithms and Data Structures (INF1) Lecture 15/15 Hua Lu Department of Computer Science Aalborg University Fall 2007 This Lecture Minimum spanning trees Definitions Kruskal s algorithm Prim s algorithm

More information

Unit 6 Chapter 15 EXAMPLES OF COMPLEXITY CALCULATION

Unit 6 Chapter 15 EXAMPLES OF COMPLEXITY CALCULATION DESIGN AND ANALYSIS OF ALGORITHMS Unit 6 Chapter 15 EXAMPLES OF COMPLEXITY CALCULATION http://milanvachhani.blogspot.in EXAMPLES FROM THE SORTING WORLD Sorting provides a good set of examples for analyzing

More information

Alternative GPU friendly assignment algorithms. Paul Richmond and Peter Heywood Department of Computer Science The University of Sheffield

Alternative GPU friendly assignment algorithms. Paul Richmond and Peter Heywood Department of Computer Science The University of Sheffield Alternative GPU friendly assignment algorithms Paul Richmond and Peter Heywood Department of Computer Science The University of Sheffield Graphics Processing Units (GPUs) Context: GPU Performance Accelerated

More information

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 13 Parallelism in Software I

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 13 Parallelism in Software I EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 13 Parallelism in Software I Mattan Erez The University of Texas at Austin N EE382N: Parallelilsm and Locality, Spring

More information

Data parallel algorithms, algorithmic building blocks, precision vs. accuracy

Data parallel algorithms, algorithmic building blocks, precision vs. accuracy Data parallel algorithms, algorithmic building blocks, precision vs. accuracy Robert Strzodka Architecture of Computing Systems GPGPU and CUDA Tutorials Dresden, Germany, February 25 2008 2 Overview Parallel

More information

Algorithms and Applications

Algorithms and Applications Algorithms and Applications 1 Areas done in textbook: Sorting Algorithms Numerical Algorithms Image Processing Searching and Optimization 2 Chapter 10 Sorting Algorithms - rearranging a list of numbers

More information

High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs

High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs Xiaochun Ye 1, Dongrui Fan 1, Wei Lin 1, Nan Yuan 1, Paolo Ienne 1, 2 1 Key Laboratory of Computer System and Architecture Institute

More information

Memory Management Method for 3D Scanner Using GPGPU

Memory Management Method for 3D Scanner Using GPGPU GPGPU 3D 1 2 KinectFusion GPGPU 3D., GPU., GPGPU Octree. GPU,,. Memory Management Method for 3D Scanner Using GPGPU TATSUYA MATSUMOTO 1 SATORU FUJITA 2 This paper proposes an efficient memory management

More information

GPGPU: Parallel Reduction and Scan

GPGPU: Parallel Reduction and Scan Administrivia GPGPU: Parallel Reduction and Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Assignment 3 due Wednesday 11:59pm on Blackboard Assignment 4 handed out Monday, 02/14 Final Wednesday

More information

Ray Tracing. Computer Graphics CMU /15-662, Fall 2016

Ray Tracing. Computer Graphics CMU /15-662, Fall 2016 Ray Tracing Computer Graphics CMU 15-462/15-662, Fall 2016 Primitive-partitioning vs. space-partitioning acceleration structures Primitive partitioning (bounding volume hierarchy): partitions node s primitives

More information

GPU Task-Parallelism: Primitives and Applications. Stanley Tzeng, Anjul Patney, John D. Owens University of California at Davis

GPU Task-Parallelism: Primitives and Applications. Stanley Tzeng, Anjul Patney, John D. Owens University of California at Davis GPU Task-Parallelism: Primitives and Applications Stanley Tzeng, Anjul Patney, John D. Owens University of California at Davis This talk Will introduce task-parallelism on GPUs What is it? Why is it important?

More information

Implementation of Parallel Path Finding in a Shared Memory Architecture

Implementation of Parallel Path Finding in a Shared Memory Architecture Implementation of Parallel Path Finding in a Shared Memory Architecture David Cohen and Matthew Dallas Department of Computer Science Rensselaer Polytechnic Institute Troy, NY 12180 Email: {cohend4, dallam}

More information

hsa-ds: A Heterogeneous Suffix Array Construction Using D-Critical Substrings for Burrow-Wheeler Transform

hsa-ds: A Heterogeneous Suffix Array Construction Using D-Critical Substrings for Burrow-Wheeler Transform 146 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'16 hsa-ds: A Heterogeneous Suffix Array Construction Using D-Critical Substrings for Burrow-Wheeler Transform Yu-Cheng Liao, Yarsun Hsu Department

More information

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0 Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nsight VSE APOD

More information

Introduction to Real-Time Operating Systems

Introduction to Real-Time Operating Systems Introduction to Real-Time Operating Systems GPOS vs RTOS General purpose operating systems Real-time operating systems GPOS vs RTOS: Similarities Multitasking Resource management OS services to applications

More information

Massively Parallel A* Search on a GPU

Massively Parallel A* Search on a GPU Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence Massively Parallel A* Search on a GPU Yichao Zhou and Jianyang Zeng Institute for Interdisciplinary Information Sciences Tsinghua

More information

Prefix Scan and Minimum Spanning Tree with OpenCL

Prefix Scan and Minimum Spanning Tree with OpenCL Prefix Scan and Minimum Spanning Tree with OpenCL U. VIRGINIA DEPT. OF COMP. SCI TECH. REPORT CS-2013-02 Yixin Sun and Kevin Skadron Dept. of Computer Science, University of Virginia ys3kz@virginia.edu,

More information

HIGH PERFORMANCE PEDESTRIAN DETECTION ON TEGRA X1

HIGH PERFORMANCE PEDESTRIAN DETECTION ON TEGRA X1 April 4-7, 2016 Silicon Valley HIGH PERFORMANCE PEDESTRIAN DETECTION ON TEGRA X1 Max Lv, NVIDIA Brant Zhao, NVIDIA April 7 mlv@nvidia.com https://github.com/madeye Histogram of Oriented Gradients on GPU

More information

Ray Casting Deformable Models on the GPU

Ray Casting Deformable Models on the GPU Ray Casting Deformable Models on the GPU Suryakant Patidar and P. J. Narayanan Center for Visual Information Technology, IIIT Hyderabad. {skp@research., pjn@}iiit.ac.in Abstract The GPUs pack high computation

More information

Global Memory Access Pattern and Control Flow

Global Memory Access Pattern and Control Flow Optimization Strategies Global Memory Access Pattern and Control Flow Objectives Optimization Strategies Global Memory Access Pattern (Coalescing) Control Flow (Divergent branch) Global l Memory Access

More information

Spanning Trees and Optimization Problems (Excerpt)

Spanning Trees and Optimization Problems (Excerpt) Bang Ye Wu and Kun-Mao Chao Spanning Trees and Optimization Problems (Excerpt) CRC PRESS Boca Raton London New York Washington, D.C. Chapter 3 Shortest-Paths Trees Consider a connected, undirected network

More information

Efficient Lists Intersection by CPU- GPU Cooperative Computing

Efficient Lists Intersection by CPU- GPU Cooperative Computing Efficient Lists Intersection by CPU- GPU Cooperative Computing Di Wu, Fan Zhang, Naiyong Ao, Gang Wang, Xiaoguang Liu, Jing Liu Nankai-Baidu Joint Lab, Nankai University Outline Introduction Cooperative

More information

CS 314 Principles of Programming Languages

CS 314 Principles of Programming Languages CS 314 Principles of Programming Languages Zheng Zhang Fall 2016 Dec 14 GPU Programming Rutgers University Programming with CUDA Compute Unified Device Architecture (CUDA) Mapping and managing computations

More information

AMS 148 Chapter 6: Histogram, Sort, and Sparse Matrices

AMS 148 Chapter 6: Histogram, Sort, and Sparse Matrices AMS 148 Chapter 6: Histogram, Sort, and Sparse Matrices Steven Reeves Now that we have completed the more fundamental parallel primitives on GPU, we will dive into more advanced topics. Histogram is a

More information

A Study of Different Parallel Implementations of Single Source Shortest Path Algorithms

A Study of Different Parallel Implementations of Single Source Shortest Path Algorithms A Study of Different Parallel Implementations of Single Source Shortest Path s Dhirendra Pratap Singh Department of Computer Science and Engineering Maulana Azad National Institute of Technology, Bhopal

More information

A Forward-Backward Single-Source Shortest Paths Algorithm

A Forward-Backward Single-Source Shortest Paths Algorithm A Forward-Backward Single-Source Shortest Paths Algorithm Single-Source Shortest Paths in O(n) time David Wilson Microsoft Research Uri Zwick Tel Aviv Univ. Deuxièmes journées du GT CoA Complexité et Algorithmes

More information

Find the block in which the tuple should be! If there is free space, insert it! Otherwise, must create overflow pages!

Find the block in which the tuple should be! If there is free space, insert it! Otherwise, must create overflow pages! Professor: Pete Keleher! keleher@cs.umd.edu! } Keep sorted by some search key! } Insertion! Find the block in which the tuple should be! If there is free space, insert it! Otherwise, must create overflow

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

Social graphs (Facebook, Twitter, Google+, LinkedIn, etc.) Endorsement graphs (web link graph, paper citation graph, etc.)

Social graphs (Facebook, Twitter, Google+, LinkedIn, etc.) Endorsement graphs (web link graph, paper citation graph, etc.) Large-Scale Graph Processing Algorithms on the GPU Yangzihao Wang, Computer Science, UC Davis John Owens, Electrical and Computer Engineering, UC Davis 1 Overview The past decade has seen a growing research

More information

Lecture 8 Parallel Algorithms II

Lecture 8 Parallel Algorithms II Lecture 8 Parallel Algorithms II Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Original slides from Introduction to Parallel

More information

CS61BL. Lecture 5: Graphs Sorting

CS61BL. Lecture 5: Graphs Sorting CS61BL Lecture 5: Graphs Sorting Graphs Graphs Edge Vertex Graphs (Undirected) Graphs (Directed) Graphs (Multigraph) Graphs (Acyclic) Graphs (Cyclic) Graphs (Connected) Graphs (Disconnected) Graphs (Unweighted)

More information

On-the-fly Vertex Reuse for Massively-Parallel Software Geometry Processing

On-the-fly Vertex Reuse for Massively-Parallel Software Geometry Processing 2018 On-the-fly for Massively-Parallel Software Geometry Processing Bernhard Kerbl Wolfgang Tatzgern Elena Ivanchenko Dieter Schmalstieg Markus Steinberger 5 4 3 4 2 5 6 7 6 3 1 2 0 1 0, 0,1,7, 7,1,2,

More information

Warps and Reduction Algorithms

Warps and Reduction Algorithms Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum

More information

Julienne: A Framework for Parallel Graph Algorithms using Work-efficient Bucketing

Julienne: A Framework for Parallel Graph Algorithms using Work-efficient Bucketing Julienne: A Framework for Parallel Graph Algorithms using Work-efficient Bucketing Laxman Dhulipala Joint work with Guy Blelloch and Julian Shun SPAA 07 Giant graph datasets Graph V E (symmetrized) com-orkut

More information

Introduction to Parallel Computing

Introduction to Parallel Computing Introduction to Parallel Computing George Karypis Sorting Outline Background Sorting Networks Quicksort Bucket-Sort & Sample-Sort Background Input Specification Each processor has n/p elements A ordering

More information

Fast Radix Sort for Sparse Linear Algebra on GPU

Fast Radix Sort for Sparse Linear Algebra on GPU Fast Radix Sort for Sparse Linear Algebra on GPU Lukas Polok, Viorela Ila, Pavel Smrz {ipolok,ila,smrz}@fit.vutbr.cz Brno University of Technology, Faculty of Information Technology. Bozetechova 2, 612

More information

Fast Tridiagonal Solvers on GPU

Fast Tridiagonal Solvers on GPU Fast Tridiagonal Solvers on GPU Yao Zhang John Owens UC Davis Jonathan Cohen NVIDIA GPU Technology Conference 2009 Outline Introduction Algorithms Design algorithms for GPU architecture Performance Bottleneck-based

More information

Parallel Prefix Sum (Scan) with CUDA. Mark Harris

Parallel Prefix Sum (Scan) with CUDA. Mark Harris Parallel Prefix Sum (Scan) with CUDA Mark Harris mharris@nvidia.com March 2009 Document Change History Version Date Responsible Reason for Change February 14, 2007 Mark Harris Initial release March 25,

More information

Data Modeling and Databases Ch 9: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

Data Modeling and Databases Ch 9: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich Data Modeling and Databases Ch 9: Query Processing - Algorithms Gustavo Alonso Systems Group Department of Computer Science ETH Zürich Transactions (Locking, Logging) Metadata Mgmt (Schema, Stats) Application

More information

CUDA Performance Optimization

CUDA Performance Optimization Mitglied der Helmholtz-Gemeinschaft CUDA Performance Optimization GPU Programming with CUDA April 25-27, 2016 Jiri Kraus (NVIDIA) based on work by Andrew V. Adinetz What you will learn: What is memory

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information