GPU Sparse Graph Traversal. Duane Merrill

Size: px
Start display at page:

Download "GPU Sparse Graph Traversal. Duane Merrill"

Transcription

1 GPU Sparse Graph Traversal Duane Merrill

2 Breadth-first search of graphs (BFS) 1. Pick a source node 2. Rank every vertex by the length of shortest path from source Or label every vertex by its predecessor in this traversal order

3 The punchline Conventional wisdom GPUs are poorly suited for dynamic problems like BFS Dynamic work distribution seems hard to implement Requires cooperation We show GPUs provide exceptional performance Billions of traversed-edges / second / socket vs. hundreds-of-millions for multi-core (SC 10, SPAA 10) For diverse synthetic and real-world datasets Small-diameter (e.g., RMAT and social networks) Large-diameter (e.g., road-network & spatial lattices) 3

4 BFS applications A common algorithmic building block Tracking reachable heap during garbage collection Belief propagation in statistical inference Simple performance-analog for many applications Pointer-chasing Work queues Finding community structure in networks Path-finding Core benchmark kernel Graph 500 Rodinia Parboil 3D lattice Wikipedia 07 4

5 Level-synchronous BFS parallelization strategies

6 Quadratic-work approach 1. Inspect every vertex at every timestep 2. If updated, update its neighbors 3. Repeat t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 Details Quadratic O(n 2 + m) work Trivial data-parallel stencils One thread per vertex t 9 t 8 t 11 t 16 t 13 t 17 t 14 t 10 t 12 t 15 6

7 Items explored Linear-work approach 1. Expand neighbors of unexplored nodes Expand: vertex frontier edge-frontier 2. Filter neighbors (previously-visited & duplicates) 3. Repeat Contract: edge-frontier vertex frontier Logical Edge Frontier Logical Vertex Frontier Details Linear O(n+m) work Need cooperative buffers for tracking frontiers Distance from source 7

8 Referenced nodes & edges (millions) The problem with quadratic approach Too much work!!! Quadratic Linear D Poisson Lattice (300 3 = 27M vertices) x work BFS iteration 8

9 Vertices (millions) Vertices (millions) Vertices (millions) Vertices (millions) Vertices (millions) Vertices (millions) Goal: Demonstrate O(m+n) traversal on diverse datasets (Not a one-trick pony ) Edge Frontier Vertex Frontier Edge Frontier Vertex Frontier Edge Frontier Vertex Frontier Search Depth Search Depth Search Depth Wikipedia (social) 3D Poisson grid (cubic lattice) R-MAT (random, power-law, small-world) Edge Frontier 5 Edge Frontier 5 Edge Frontier Vertex Frontier 4 Vertex Frontier 4 Vertex Frontier Search Depth Search Depth Search Depth Europe road atlas (Euclidian space) PDE-constrained optimization (non-linear KKT) Auto transmission manifold (tetrahedral mesh) 9

10 Challenges

11 Linear-work challenges for parallelization General challenges 1. Load imbalance between processing elements Workstealing queues 2. Bandwidth inefficiency Use status bitmask when filtering already-visited nodes GPU-specific challenges 1. Cooperative placement within global queues 2. Poor load balancing within SIMD lanes 3. Poor locality within SIMD lanes 4. Simultaneous discovery (benign race condition) 11

12 (1) Cooperative placement within global queues Problem: Need shared work queues Threads must cooperatively determine enqueue offsets GPU rate-limits (C2050): A C D G Can copy 16B vertex-identifiers / s Only 67M global-atomics / s (238x slowdown) Only 600M smem-atomics / s (27x slowdown) Solution: Compute local offsets using CTA-wide prefix-sum Compute CTA offset using a single coarse-grained atomic-add Dynamic expansion 12

13 Prefix sum for allocation Input (e.g., adjacency list sizes) Result of prefix sum A C D G A0 2C 3 D3 G6 Thread Thread Thread Thread Thread Each output is computed to be the sum of the previous inputs O(n) work Use results as a scatter offsets Fits the GPU machine model well 1 Proceed at copy-bandwidth Only ~8 thread-instructions per input Frontier output Merrill et al. Parallel Scan for Stream Architectures. Technical Report CS , University of Virginia

14 (2) Poor load balancing within SIMD lanes (a) Bad: Serial expansion & processing Problem: Large variance in adjacency list sizes t 0 t 1 t 2 t 3 t 4 Solution: E.g., power-law degree distributions Exacerbated by wide SIMD widths Cooperative neighbor expansion Enlist nearby threads to help process each adjacency list in parallel (b) Slightly better: Coarse warp-centric parallel expansion t 0 t 1 t 2 t 3 t 4 t 5 t 31 t 32 t 33 t 34 t 35 t 36 t 37 t 63 t 64 t 65 t 66 t 67 t 68 t 69 t 95 t 96 t 97 t 98 t 99 t 100 t 101 t 127 (c) Best: Fine-grained parallel expansion (packed by prefix sum) t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 14

15 (3) Poor locality within SIMD lanes (a) Bad: Serial expansion & processing Problem: t 0 t 1 t 2 t 3 t 4 Solution: The referenced adjacency lists are arbitrarily located Exacerbated by wide SIMD widths and DRAM transaction sizes Can t afford to have SIMD threads streaming through unrelated data Cooperative neighbor expansion (b) Slightly better: Coarse warp-centric parallel expansion t 0 t 1 t 2 t 3 t 4 t 5 t 31 t 32 t 33 t 34 t 35 t 36 t 37 t 63 t 64 t 65 t 66 t 67 t 68 t 69 t 95 t 96 t 97 t 98 t 99 t 100 t 101 t 127 (c) Best: Fine-grained parallel expansion (packed by prefix sum) t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 15

16 (4) SIMD simultaneous discovery Duplicates in edge-frontier yield redundant work Exacerbated by wide SIMD Compounded every iteration Iteration 1 Iteration 2 Iteration 3 Iteration 4 16

17 Vertices (millions) Simultaneous discovery Normally not a problem for CPU implementations Logical edge-frontier Logical vertex-frontier Low hardware parallelism 1.2 Actually expanded Serial adjacency list inspection 1.0 Actually contracted A big problem for cooperative SIMD expansion Spatially-descriptive datasets Power-law datasets Solution: Collision hashes in local scratch Search Depth 1. Hash per warp (instantaneous coverage) 2. Hash per CTA (recent history coverage) 17

18 Absolute and relative performance 18

19 Single-socket performance comparison Sequential Parallel Graph Spy Plot Average Search Depth Sandybridge Nehalem Tesla C2050 Billion TE/s Billion TE/s Parallel speedup Billion TE/s Parallel speedup europe.osm x grid5pt x hugebubbles x grid7pt (4-core ) 3.0 x x nlpkkt (4-core ) 1.8 x x audikw x cage (4-core ) 1.8 x x kkt_power (4-core ) 2.2 x x copapersciteseer x wikipedia (4-core ) 2.7 x x kron_g500-logn x random.2mv.128me (8-core ) 5.0 x x rmat.2mv.128me (8-core ) 4.6 x x Harmonic mean (across all / common datasets) 12x / 20x 3.4GHz Core i7 2600K 2.5 GHz Core i7 4-core, Leiserson et al. 2.7 GHz EX Xeon X core, Agarwal et al.

20 Summary BFS conclusions Quadratic-work approaches are uncompetitive Need linear-work algorithm Dynamic workload management: Overall conclusions Prefix sum instead of atomic read-modify-write ops Fine-grained expansion and contraction for high utilization GPUs can be very amenable to dynamic, cooperative problems BFS is actually well-suited to GPU strengths Exposes lots of concurrency Is memory-bound Merrill et al. High performance and Scalable GPU Graph Traversal. Technical Report CS , Department of Computer Science, University of Virginia, Aug

21 Questions? Merrill, D., Garland, M., and Grimshaw, A. High Performance and Scalable GPU Graph Traversal. Technical Report CS , Department of Computer Science, University of Virginia. Aug

22 Arbitrary locality 3D (7-point )Poisson Lattice Adjacency lists reference nearby vertices Wikipedia 07 Adjacency lists do not reference nearby vertices Sparsity pattern (adjacency matrix) Sparsity pattern (adjacency matrix) 22

23 Experimental corpus Graph Spy Plot Description Average Search Depth Vertices (millions) Edges (millions) europe.osm Road network grid5pt D Poisson stencil hugebubbles D mesh grid7pt.300 3D Poisson stencil nlpkkt160 Constrained optimization problem audikw1 Finite element matrix cage15 Transition prob. matrix kkt_power Optimization (KKT) copapersciteseer Citation network wikipedia Wikipedia page links kron_g500-logn20 Graph500 random graph random.2mv.128me Uniform random graph rmat.2mv.128me RMAT random graph

24 Compressed sparse row (CSR) representation v 0 v 1 v 2 v 3 Logical adjacency matrix v 0 v 0 v 1 v 1 v 1 v 2 v 2 v 0 v 2 v 3 v 3 v 1 v 3 Column indices: O(m) [0] [1] [2] [3] [4] [5] [6] [7] [8] Row offsets: O(n) [0] [1] [2] [3] [4]

25 Coupling of expansion & contraction phases Alternatives: a) Expand-contract (one kernel) Maintain vertex-frontier (unexplored nodes) in DRAM between BFS iterations O(2n) bytes Produce and consume edge-frontier online, in-core b) Contract-expand (one kernel) Maintain edge-frontier (untraversed edges) in DRAM between BFS iterations O(2m) bytes Produce and consume vertex frontier online, in-core c) Two-phase (two kernels) Wholly produce both edge and vertex-frontiers in DRAM O(m+n) bytes 25

26 Coupling of expansion & contraction phases Alternatives: a) Expand-contract (vertex-frontier in DRAM) Suitable for all types of BFS iterations b) Contract-expand (edge-frontier in DRAM) Even better for small, fleeting BFS iterations c) Two-phase (both frontiers in DRAM) Even better for large, saturating BFS iterations (surprising!!) The two fused strategies (a & b above) suffer the interaction between: TLB misses during expansion (neighbor gathering) Uncoalesced reads during contraction (status lookup) 26

27 Comparison of expansion techniques 27

28 10 9 edges / sec normalized Multi-GPU traversal Expand neighbors sort by GPU (with filter) read from peer GPUs (with filter) repeat C2050 x1 C2050 x2 C2050 x3 C2050 x

29 10 9 edges / sec normalized Comparison of coupling approaches Expand-Contract Contract-Expand 2-Phase Hybrid

30 Redundant expansion factor (log) Local duplicate culling Contraction uses local collision hashes in smem scratch space: 1. Hash per warp (instantaneous coverage) Redundant work reduced < 10% in all datasets No atomics needed 2. Hash per CTA (recent history coverage) 10 Simple Local duplicate culling 4.2x 1 1.1x 2.4x 2.6x 1.7x 1.2x 1.1x 1.4x 1.1x n/a 1.3x 30

31 CTA local prefix sum t 1 t 0 t 2 t T/4-1 t 3T/4-1 t 3T/4+1 t T/4 t T/4 + 1 t T/4 + 2 t T/2-1 t T/2 t T/2 + 1 t T/2 + 2 t 3T/4 t 3T/4+2 t T - 1 t 0 t 1 t 2 t 3 t 0 t 1 t 2 t 3 t 0 t 1 t 2 t 3 t 0 t 1 t 2 t 3 t 0 t 1 t 2 t 3 t 0 t 1 t 2 t 3 t 0 t 1 t 2 t 3 t 0 t 1 t 2 t 3 t 0 t 1 t 2 t 3 t 0 t 1 t T/4-2 t T/4-1 t T/4 t T/4 + 1 t T/2-2 t T/2-1 t T/2 t T/2 + 1 t 3T/4-2 t 3T/4-1 t 3T/4 t 3T/4+1 t T - 2 t T

32 Filter duplicates within edge-frontier Implement local collision hashes in smem scratch space during contraction 1. Hash per warp (instantaneous coverage) 2. Hash per CTA (recent history coverage) Redundant work reduced < 10% in all datasets No atomics needed 32

GPU Sparse Graph Traversal

GPU Sparse Graph Traversal GPU Sparse Graph Traversal Duane Merrill (NVIDIA) Michael Garland (NVIDIA) Andrew Grimshaw (Univ. of Virginia) UNIVERSITY of VIRGINIA Breadth-first search (BFS) 1. Pick a source node 2. Rank every vertex

More information

Duane Merrill (NVIDIA) Michael Garland (NVIDIA)

Duane Merrill (NVIDIA) Michael Garland (NVIDIA) Duane Merrill (NVIDIA) Michael Garland (NVIDIA) Agenda Sparse graphs Breadth-first search (BFS) Maximal independent set (MIS) 1 1 0 1 1 1 3 1 3 1 5 10 6 9 8 7 3 4 BFS MIS Sparse graphs Graph G(V, E) where

More information

Scalable GPU Graph Traversal!

Scalable GPU Graph Traversal! Scalable GPU Graph Traversal Duane Merrill, Michael Garland, and Andrew Grimshaw PPoPP '12 Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming Benwen Zhang

More information

Scalable GPU Graph Traversal

Scalable GPU Graph Traversal Scalable GPU Graph Traversal Duane Merrill University of Virginia Charlottesville Virginia USA dgmd@virginia.edu Michael Garland NVIDIA Corporation Santa Clara California USA mgarland@nvidia.com Andrew

More information

High Performance and Scalable GPU Graph Traversal

High Performance and Scalable GPU Graph Traversal High Performance and Scalable GPU Graph Traversal Duane Merrill Department of Computer Science University of Virginia Technical Report CS-- Department of Computer Science, University of Virginia Aug, Michael

More information

A POWER CHARACTERIZATION AND MANAGEMENT OF GPU GRAPH TRAVERSAL

A POWER CHARACTERIZATION AND MANAGEMENT OF GPU GRAPH TRAVERSAL A POWER CHARACTERIZATION AND MANAGEMENT OF GPU GRAPH TRAVERSAL ADAM MCLAUGHLIN *, INDRANI PAUL, JOSEPH GREATHOUSE, SRILATHA MANNE, AND SUDHKAHAR YALAMANCHILI * * GEORGIA INSTITUTE OF TECHNOLOGY AMD RESEARCH

More information

High-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock

High-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock High-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock Yangzihao Wang University of California, Davis yzhwang@ucdavis.edu March 24, 2014 Yangzihao Wang (yzhwang@ucdavis.edu)

More information

Automatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC

Automatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC Automatic Compiler-Based Optimization of Graph Analytics for the GPU Sreepathi Pai The University of Texas at Austin May 8, 2017 NVIDIA GTC Parallel Graph Processing is not easy 299ms HD-BFS 84ms USA Road

More information

Graph Partitioning for Scalable Distributed Graph Computations

Graph Partitioning for Scalable Distributed Graph Computations Graph Partitioning for Scalable Distributed Graph Computations Aydın Buluç ABuluc@lbl.gov Kamesh Madduri madduri@cse.psu.edu 10 th DIMACS Implementation Challenge, Graph Partitioning and Graph Clustering

More information

Coordinating More Than 3 Million CUDA Threads for Social Network Analysis. Adam McLaughlin

Coordinating More Than 3 Million CUDA Threads for Social Network Analysis. Adam McLaughlin Coordinating More Than 3 Million CUDA Threads for Social Network Analysis Adam McLaughlin Applications of interest Computational biology Social network analysis Urban planning Epidemiology Hardware verification

More information

CUB. collective software primitives. Duane Merrill. NVIDIA Research

CUB. collective software primitives. Duane Merrill. NVIDIA Research CUB collective software primitives Duane Merrill NVIDIA Research What is CUB?. A design model for collective primitives How to make reusable SIMT software constructs. A library of collective primitives

More information

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark by Nkemdirim Dockery High Performance Computing Workloads Core-memory sized Floating point intensive Well-structured

More information

An Energy-Efficient Abstraction for Simultaneous Breadth-First Searches. Adam McLaughlin, Jason Riedy, and David A. Bader

An Energy-Efficient Abstraction for Simultaneous Breadth-First Searches. Adam McLaughlin, Jason Riedy, and David A. Bader An Energy-Efficient Abstraction for Simultaneous Breadth-First Searches Adam McLaughlin, Jason Riedy, and David A. Bader Problem Data is unstructured, heterogeneous, and vast Serious opportunities for

More information

Practical Near-Data Processing for In-Memory Analytics Frameworks

Practical Near-Data Processing for In-Memory Analytics Frameworks Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard

More information

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation

More information

Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search

Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search Jialiang Zhang, Soroosh Khoram and Jing Li 1 Outline Background Big graph analytics Hybrid

More information

GPU/CPU Heterogeneous Work Partitioning Riya Savla (rds), Ridhi Surana (rsurana), Jordan Tick (jrtick) Computer Architecture, Spring 18

GPU/CPU Heterogeneous Work Partitioning Riya Savla (rds), Ridhi Surana (rsurana), Jordan Tick (jrtick) Computer Architecture, Spring 18 ABSTRACT GPU/CPU Heterogeneous Work Partitioning Riya Savla (rds), Ridhi Surana (rsurana), Jordan Tick (jrtick) 15-740 Computer Architecture, Spring 18 With the death of Moore's Law, programmers can no

More information

Graph traversal and BFS

Graph traversal and BFS Graph traversal and BFS Fundamental building block Graph traversal is part of many important tasks Connected components Tree/Cycle detection Articulation vertex finding Real-world applications Peer-to-peer

More information

A Simple and Practical Linear-Work Parallel Algorithm for Connectivity

A Simple and Practical Linear-Work Parallel Algorithm for Connectivity A Simple and Practical Linear-Work Parallel Algorithm for Connectivity Julian Shun, Laxman Dhulipala, and Guy Blelloch Presentation based on publication in Symposium on Parallelism in Algorithms and Architectures

More information

Lecture 6: Input Compaction and Further Studies

Lecture 6: Input Compaction and Further Studies PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 6: Input Compaction and Further Studies 1 Objective To learn the key techniques for compacting input data for reduced consumption of

More information

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh

More information

Enterprise. Breadth-First Graph Traversal on GPUs. November 19th, 2015

Enterprise. Breadth-First Graph Traversal on GPUs. November 19th, 2015 Enterprise Breadth-First Graph Traversal on GPUs Hang Liu H. Howie Huang November 9th, 5 Graph is Ubiquitous Breadth-First Search (BFS) is Important Wide Range of Applications Single Source Shortest Path

More information

Accelerated Load Balancing of Unstructured Meshes

Accelerated Load Balancing of Unstructured Meshes Accelerated Load Balancing of Unstructured Meshes Gerrett Diamond, Lucas Davis, and Cameron W. Smith Abstract Unstructured mesh applications running on large, parallel, distributed memory systems require

More information

Optimizing Energy Consumption and Parallel Performance for Static and Dynamic Betweenness Centrality using GPUs Adam McLaughlin, Jason Riedy, and

Optimizing Energy Consumption and Parallel Performance for Static and Dynamic Betweenness Centrality using GPUs Adam McLaughlin, Jason Riedy, and Optimizing Energy Consumption and Parallel Performance for Static and Dynamic Betweenness Centrality using GPUs Adam McLaughlin, Jason Riedy, and David A. Bader Motivation Real world graphs are challenging

More information

CuSha: Vertex-Centric Graph Processing on GPUs

CuSha: Vertex-Centric Graph Processing on GPUs CuSha: Vertex-Centric Graph Processing on GPUs Farzad Khorasani, Keval Vora, Rajiv Gupta, Laxmi N. Bhuyan HPDC Vancouver, Canada June, Motivation Graph processing Real world graphs are large & sparse Power

More information

Parallel Combinatorial BLAS and Applications in Graph Computations

Parallel Combinatorial BLAS and Applications in Graph Computations Parallel Combinatorial BLAS and Applications in Graph Computations Aydın Buluç John R. Gilbert University of California, Santa Barbara SIAM ANNUAL MEETING 2009 July 8, 2009 1 Primitives for Graph Computations

More information

NUMA-aware Graph-structured Analytics

NUMA-aware Graph-structured Analytics NUMA-aware Graph-structured Analytics Kaiyuan Zhang, Rong Chen, Haibo Chen Institute of Parallel and Distributed Systems Shanghai Jiao Tong University, China Big Data Everywhere 00 Million Tweets/day 1.11

More information

A Simple and Practical Linear-Work Parallel Algorithm for Connectivity

A Simple and Practical Linear-Work Parallel Algorithm for Connectivity A Simple and Practical Linear-Work Parallel Algorithm for Connectivity Julian Shun, Laxman Dhulipala, and Guy Blelloch Presentation based on publication in Symposium on Parallelism in Algorithms and Architectures

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

BREADTH-FIRST SEARCH FOR SOCIAL NETWORK GRAPHS ON HETEROGENOUS PLATFORMS LUIS CARLOS MARIA REMIS THESIS

BREADTH-FIRST SEARCH FOR SOCIAL NETWORK GRAPHS ON HETEROGENOUS PLATFORMS LUIS CARLOS MARIA REMIS THESIS BREADTH-FIRST SEARCH FOR SOCIAL NETWORK GRAPHS ON HETEROGENOUS PLATFORMS BY LUIS CARLOS MARIA REMIS THESIS Submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer

More information

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Patterns: Graph Search

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Patterns: Graph Search CSE 599 I Accelerated Computing - Programming GPUS Parallel Patterns: Graph Search Objective Study graph search as a prototypical graph-based algorithm Learn techniques to mitigate the memory-bandwidth-centric

More information

Scalable Algorithmic Techniques Decompositions & Mapping. Alexandre David

Scalable Algorithmic Techniques Decompositions & Mapping. Alexandre David Scalable Algorithmic Techniques Decompositions & Mapping Alexandre David 1.2.05 adavid@cs.aau.dk Introduction Focus on data parallelism, scale with size. Task parallelism limited. Notion of scalability

More information

Introduction to Parallel Programming Models

Introduction to Parallel Programming Models Introduction to Parallel Programming Models Tim Foley Stanford University Beyond Programmable Shading 1 Overview Introduce three kinds of parallelism Used in visual computing Targeting throughput architectures

More information

Maximizing Face Detection Performance

Maximizing Face Detection Performance Maximizing Face Detection Performance Paulius Micikevicius Developer Technology Engineer, NVIDIA GTC 2015 1 Outline Very brief review of cascaded-classifiers Parallelization choices Reducing the amount

More information

LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs

LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs Jin Wang*, Norm Rubin, Albert Sidelnik, Sudhakar Yalamanchili* *Computer Architecture and System Lab, Georgia Institute of Technology NVIDIA

More information

}w!"#$%&'()+,-./012345<ya

}w!#$%&'()+,-./012345<ya MASARYK UNIVERSITY FACULTY OF INFORMATICS }w!"#$%&'()+,-./012345

More information

Mosaic: Processing a Trillion-Edge Graph on a Single Machine

Mosaic: Processing a Trillion-Edge Graph on a Single Machine Mosaic: Processing a Trillion-Edge Graph on a Single Machine Steffen Maass, Changwoo Min, Sanidhya Kashyap, Woonhak Kang, Mohan Kumar, Taesoo Kim Georgia Institute of Technology Best Student Paper @ EuroSys

More information

Continuum Computer Architecture

Continuum Computer Architecture Plenary Presentation to the Workshop on Frontiers of Extreme Computing: Continuum Computer Architecture Thomas Sterling California Institute of Technology and Louisiana State University October 25, 2005

More information

A New Parallel Algorithm for Connected Components in Dynamic Graphs. Robert McColl Oded Green David Bader

A New Parallel Algorithm for Connected Components in Dynamic Graphs. Robert McColl Oded Green David Bader A New Parallel Algorithm for Connected Components in Dynamic Graphs Robert McColl Oded Green David Bader Overview The Problem Target Datasets Prior Work Parent-Neighbor Subgraph Results Conclusions Problem

More information

Fast BVH Construction on GPUs

Fast BVH Construction on GPUs Fast BVH Construction on GPUs Published in EUROGRAGHICS, (2009) C. Lauterbach, M. Garland, S. Sengupta, D. Luebke, D. Manocha University of North Carolina at Chapel Hill NVIDIA University of California

More information

Towards Efficient Graph Traversal using a Multi- GPU Cluster

Towards Efficient Graph Traversal using a Multi- GPU Cluster Towards Efficient Graph Traversal using a Multi- GPU Cluster Hina Hameed Systems Research Lab, Department of Computer Science, FAST National University of Computer and Emerging Sciences Karachi, Pakistan

More information

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015 Frank Feinbube, M.Sc., Felix Eberhardt, M.Sc., Prof. Dr. Andreas Polze Workloads 2 Hardware / software execution environment

More information

Dynamic Thread Block Launch: A Lightweight Execution Mechanism to Support Irregular Applications on GPUs

Dynamic Thread Block Launch: A Lightweight Execution Mechanism to Support Irregular Applications on GPUs Dynamic Thread Block Launch: A Lightweight Execution Mechanism to Support Irregular Applications on GPUs Jin Wang* Norm Rubin Albert Sidelnik Sudhakar Yalamanchili* *Georgia Institute of Technology NVIDIA

More information

Parallel Methods for Verifying the Consistency of Weakly-Ordered Architectures. Adam McLaughlin, Duane Merrill, Michael Garland, and David A.

Parallel Methods for Verifying the Consistency of Weakly-Ordered Architectures. Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Parallel Methods for Verifying the Consistency of Weakly-Ordered Architectures Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader Challenges of Design Verification Contemporary hardware

More information

On Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs

On Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs On Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs Sungpack Hong 2, Nicole C. Rodia 1, and Kunle Olukotun 1 1 Pervasive Parallelism Laboratory, Stanford University

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

Parallel graph traversal for FPGA

Parallel graph traversal for FPGA LETTER IEICE Electronics Express, Vol.11, No.7, 1 6 Parallel graph traversal for FPGA Shice Ni a), Yong Dou, Dan Zou, Rongchun Li, and Qiang Wang National Laboratory for Parallel and Distributed Processing,

More information

Enterprise: Breadth-First Graph Traversal on GPUs

Enterprise: Breadth-First Graph Traversal on GPUs Enterprise: Breadth-First Graph Traversal on GPUs Hang Liu H. Howie Huang Department of Electrical and Computer Engineering George Washington University ABSTRACT The Breadth-First Search (BFS) algorithm

More information

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100 CS/EE 217 Midterm ANSWER ALL QUESTIONS TIME ALLOWED 60 MINUTES Question Possible Points Points Scored 1 24 2 32 3 20 4 24 Total 100 Question 1] [24 Points] Given a GPGPU with 14 streaming multiprocessor

More information

Lecture 4: Principles of Parallel Algorithm Design (part 3)

Lecture 4: Principles of Parallel Algorithm Design (part 3) Lecture 4: Principles of Parallel Algorithm Design (part 3) 1 Exploratory Decomposition Decomposition according to a search of a state space of solutions Example: the 15-puzzle problem Determine any sequence

More information

Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs

Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs Haicheng Wu 1, Daniel Zinn 2, Molham Aref 2, Sudhakar Yalamanchili 1 1. Georgia Institute of Technology 2. LogicBlox

More information

Software and Tools for HPE s The Machine Project

Software and Tools for HPE s The Machine Project Labs Software and Tools for HPE s The Machine Project Scalable Tools Workshop Aug/1 - Aug/4, 2016 Lake Tahoe Milind Chabbi Traditional Computing Paradigm CPU DRAM CPU DRAM CPU-centric computing 2 CPU-Centric

More information

Falcon: Scaling IO Performance in Multi-SSD Volumes. The George Washington University

Falcon: Scaling IO Performance in Multi-SSD Volumes. The George Washington University Falcon: Scaling IO Performance in Multi-SSD Volumes Pradeep Kumar H Howie Huang The George Washington University SSDs in Big Data Applications Recent trends advocate using many SSDs for higher throughput

More information

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm. Martin Burtscher Department of Computer Science Texas State University-San Marcos

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm. Martin Burtscher Department of Computer Science Texas State University-San Marcos An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm Martin Burtscher Department of Computer Science Texas State University-San Marcos Mapping Regular Code to GPUs Regular codes Operate on

More information

Tools and Primitives for High Performance Graph Computation

Tools and Primitives for High Performance Graph Computation Tools and Primitives for High Performance Graph Computation John R. Gilbert University of California, Santa Barbara Aydin Buluç (LBNL) Adam Lugowski (UCSB) SIAM Minisymposium on Analyzing Massive Real-World

More information

EFFICIENT BREADTH FIRST SEARCH ON MULTI-GPU SYSTEMS USING GPU-CENTRIC OPENSHMEM

EFFICIENT BREADTH FIRST SEARCH ON MULTI-GPU SYSTEMS USING GPU-CENTRIC OPENSHMEM EFFICIENT BREADTH FIRST SEARCH ON MULTI-GPU SYSTEMS USING GPU-CENTRIC OPENSHMEM Sreeram Potluri, Anshuman Goswami NVIDIA Manjunath Gorentla Venkata, Neena Imam - ORNL SCOPE OF THE WORK Reliance on CPU

More information

Customized Architecture for Complex Routing Analysis: Case Study for the Convey Hybrid-Core Computer

Customized Architecture for Complex Routing Analysis: Case Study for the Convey Hybrid-Core Computer Naval Research Laboratory Stennis Space Center, MS 39529-5004 NRL/MR/7440--14-9497 Customized Architecture for Complex Routing Analysis: Case Study for the Convey Hybrid-Core Computer Chris J. Michael

More information

Algorithm Design (8) Graph Algorithms 1/2

Algorithm Design (8) Graph Algorithms 1/2 Graph Algorithm Design (8) Graph Algorithms / Graph:, : A finite set of vertices (or nodes) : A finite set of edges (or arcs or branches) each of which connect two vertices Takashi Chikayama School of

More information

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation

More information

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Xin Huo Advisor: Gagan Agrawal Motivation - Architecture Challenges on GPU architecture

More information

Dynamic Sparse Matrix Allocation on GPUs. James King

Dynamic Sparse Matrix Allocation on GPUs. James King Dynamic Sparse Matrix Allocation on GPUs James King Graph Applications Dynamic updates to graphs Adding edges add entries to sparse matrix representation Motivation Graph operations (adding edges) (e.g.

More information

A Framework for Processing Large Graphs in Shared Memory

A Framework for Processing Large Graphs in Shared Memory A Framework for Processing Large Graphs in Shared Memory Julian Shun Based on joint work with Guy Blelloch and Laxman Dhulipala (Work done at Carnegie Mellon University) 2 What are graphs? Vertex Edge

More information

BFS preconditioning for high locality, data parallel BFS algorithm N.Vasilache, B. Meister, M. Baskaran, R.Lethin. Reservoir Labs

BFS preconditioning for high locality, data parallel BFS algorithm N.Vasilache, B. Meister, M. Baskaran, R.Lethin. Reservoir Labs BFS preconditioning for high locality, data parallel BFS algorithm N.Vasilache, B. Meister, M. Baskaran, R.Lethin Problem Streaming Graph Challenge Characteristics: Large scale Highly dynamic Scale-free

More information

Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia

Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia How well do CPU, GPU and Hybrid Graph Processing Frameworks Perform? Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Networked Systems

More information

A Comparative Study on Exact Triangle Counting Algorithms on the GPU

A Comparative Study on Exact Triangle Counting Algorithms on the GPU A Comparative Study on Exact Triangle Counting Algorithms on the GPU Leyuan Wang, Yangzihao Wang, Carl Yang, John D. Owens University of California, Davis, CA, USA 31 st May 2016 L. Wang, Y. Wang, C. Yang,

More information

Implementation of BFS on shared memory (CPU / GPU)

Implementation of BFS on shared memory (CPU / GPU) Kolganov A.S., MSU The BFS algorithm Graph500 && GGraph500 Implementation of BFS on shared memory (CPU / GPU) Predicted scalability 2 The BFS algorithm Graph500 && GGraph500 Implementation of BFS on shared

More information

Large and Sparse Mass Spectrometry Data Processing in the GPU Jose de Corral 2012 GPU Technology Conference

Large and Sparse Mass Spectrometry Data Processing in the GPU Jose de Corral 2012 GPU Technology Conference Large and Sparse Mass Spectrometry Data Processing in the GPU Jose de Corral 2012 GPU Technology Conference 2012 Waters Corporation 1 Agenda Overview of LC/IMS/MS 3D Data Processing 4D Data Processing

More information

GPU-accelerated data expansion for the Marching Cubes algorithm

GPU-accelerated data expansion for the Marching Cubes algorithm GPU-accelerated data expansion for the Marching Cubes algorithm San Jose (CA) September 23rd, 2010 Christopher Dyken, SINTEF Norway Gernot Ziegler, NVIDIA UK Agenda Motivation & Background Data Compaction

More information

Algorithm Engineering with PRAM Algorithms

Algorithm Engineering with PRAM Algorithms Algorithm Engineering with PRAM Algorithms Bernard M.E. Moret moret@cs.unm.edu Department of Computer Science University of New Mexico Albuquerque, NM 87131 Rome School on Alg. Eng. p.1/29 Measuring and

More information

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload) Lecture 2: Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload) Visual Computing Systems Analyzing a 3D Graphics Workload Where is most of the work done? Memory Vertex

More information

Scalable and High Performance Betweenness Centrality on the GPU

Scalable and High Performance Betweenness Centrality on the GPU Scalable and High Performance Betweenness Centrality on the GPU Adam McLaughlin School of Electrical and Computer Engineering Georgia Institute of Technology Atlanta, Georgia 30332 0250 Adam27X@gatech.edu

More information

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve Administrative Issues L11: Sparse Linear Algebra on GPUs Next assignment, triangular solve Due 5PM, Tuesday, March 15 handin cs6963 lab 3 Project proposals Due 5PM, Wednesday, March 7 (hard

More information

Managing and Mining Billion Node Graphs. Haixun Wang Microsoft Research Asia

Managing and Mining Billion Node Graphs. Haixun Wang Microsoft Research Asia Managing and Mining Billion Node Graphs Haixun Wang Microsoft Research Asia Outline Overview Storage Online query processing Offline graph analytics Advanced applications Is it hard to manage graphs? Good

More information

Using GPUs to compute the multilevel summation of electrostatic forces

Using GPUs to compute the multilevel summation of electrostatic forces Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of

More information

Lecture 5: Input Binning

Lecture 5: Input Binning PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 5: Input Binning 1 Objective To understand how data scalability problems in gather parallel execution motivate input binning To learn

More information

Corolla: GPU-Accelerated FPGA Routing Based on Subgraph Dynamic Expansion

Corolla: GPU-Accelerated FPGA Routing Based on Subgraph Dynamic Expansion Corolla: GPU-Accelerated FPGA Routing Based on Subgraph Dynamic Expansion Minghua Shen and Guojie Luo Peking University FPGA-February 23, 2017 1 Contents Motivation Background Search Space Reduction for

More information

Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs

Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs Sungpack Hong 2, Nicole C. Rodia 1, and Kunle Olukotun 1 1 Pervasive Parallelism Laboratory, Stanford University 2 Oracle

More information

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA 2006: Short-Range Molecular Dynamics on GPU San Jose, CA September 22, 2010 Peng Wang, NVIDIA Overview The LAMMPS molecular dynamics (MD) code Cell-list generation and force calculation Algorithm & performance

More information

Optimizing Cache Performance for Graph Analytics. Yunming Zhang Presentation

Optimizing Cache Performance for Graph Analytics. Yunming Zhang Presentation Optimizing Cache Performance for Graph Analytics Yunming Zhang 6.886 Presentation Goals How to optimize in-memory graph applications How to go about performance engineering just about anything In-memory

More information

GPU Fundamentals Jeff Larkin November 14, 2016

GPU Fundamentals Jeff Larkin November 14, 2016 GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate

More information

CS301 - Data Structures Glossary By

CS301 - Data Structures Glossary By CS301 - Data Structures Glossary By Abstract Data Type : A set of data values and associated operations that are precisely specified independent of any particular implementation. Also known as ADT Algorithm

More information

Fine-Grained Task Migration for Graph Algorithms using Processing in Memory

Fine-Grained Task Migration for Graph Algorithms using Processing in Memory Fine-Grained Task Migration for Graph Algorithms using Processing in Memory Paula Aguilera 1, Dong Ping Zhang 2, Nam Sung Kim 3 and Nuwan Jayasena 2 1 Dept. of Electrical and Computer Engineering University

More information

Master Informatics Eng.

Master Informatics Eng. Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8 AJProença, Advanced Architectures,

More information

S8873 GBM INFERENCING ON GPU. Shankara Rao Thejaswi Nanditale, Vinay Deshpande

S8873 GBM INFERENCING ON GPU. Shankara Rao Thejaswi Nanditale, Vinay Deshpande S8873 GBM INFERENCING ON GPU Shankara Rao Thejaswi Nanditale, Vinay Deshpande Introduction AGENDA Objective Experimental Results Implementation Details Conclusion 2 INTRODUCTION 3 BOOSTING What is it?

More information

Improving Memory Space Efficiency of Kd-tree for Real-time Ray Tracing Byeongjun Choi, Byungjoon Chang, Insung Ihm

Improving Memory Space Efficiency of Kd-tree for Real-time Ray Tracing Byeongjun Choi, Byungjoon Chang, Insung Ihm Improving Memory Space Efficiency of Kd-tree for Real-time Ray Tracing Byeongjun Choi, Byungjoon Chang, Insung Ihm Department of Computer Science and Engineering Sogang University, Korea Improving Memory

More information

CS 5220: Parallel Graph Algorithms. David Bindel

CS 5220: Parallel Graph Algorithms. David Bindel CS 5220: Parallel Graph Algorithms David Bindel 2017-11-14 1 Graphs Mathematically: G = (V, E) where E V V Convention: V = n and E = m May be directed or undirected May have weights w V : V R or w E :

More information

Julienne: A Framework for Parallel Graph Algorithms using Work-efficient Bucketing

Julienne: A Framework for Parallel Graph Algorithms using Work-efficient Bucketing Julienne: A Framework for Parallel Graph Algorithms using Work-efficient Bucketing Laxman Dhulipala Joint work with Guy Blelloch and Julian Shun SPAA 07 Giant graph datasets Graph V E (symmetrized) com-orkut

More information

Exploring Hardware Support For Scaling Irregular Applications on Multi-node Multi-core Architectures

Exploring Hardware Support For Scaling Irregular Applications on Multi-node Multi-core Architectures Exploring Hardware Support For Scaling Irregular Applications on Multi-node Multi-core Architectures MARCO CERIANI SIMONE SECCHI ANTONINO TUMEO ORESTE VILLA GIANLUCA PALERMO Politecnico di Milano - DEI,

More information

Efficiently Multiplying Sparse Matrix - Sparse Vector for Social Network Analysis

Efficiently Multiplying Sparse Matrix - Sparse Vector for Social Network Analysis Efficiently Multiplying Sparse Matrix - Sparse Vector for Social Network Analysis Ashish Kumar Mentors - David Bader, Jason Riedy, Jiajia Li Indian Institute of Technology, Jodhpur 6 June 2014 Ashish Kumar

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

Parallel Architectures

Parallel Architectures Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36

More information

Breadth First Search on Cost efficient Multi GPU Systems

Breadth First Search on Cost efficient Multi GPU Systems Breadth First Search on Cost efficient Multi Systems Takuji Mitsuishi Keio University 3 14 1 Hiyoshi, Yokohama, 223 8522, Japan mits@am.ics.keio.ac.jp Masaki Kan NEC Corporation 1753, Shimonumabe, Nakahara

More information

Parallel Breadth First Search

Parallel Breadth First Search CSE341T/CSE549T 11/03/2014 Lecture 18 Parallel Breadth First Search Today, we will look at a basic graph algorithm, breadth first search (BFS). BFS can be applied to solve a variety of problems including:

More information

Performance analysis of -stepping algorithm on CPU and GPU

Performance analysis of -stepping algorithm on CPU and GPU Performance analysis of -stepping algorithm on CPU and GPU Dmitry Lesnikov 1 dslesnikov@gmail.com Mikhail Chernoskutov 1,2 mach@imm.uran.ru 1 Ural Federal University (Yekaterinburg, Russia) 2 Krasovskii

More information

NVGRAPH,FIREHOSE,PAGERANK GPU ACCELERATED ANALYTICS NOV Joe Eaton Ph.D.

NVGRAPH,FIREHOSE,PAGERANK GPU ACCELERATED ANALYTICS NOV Joe Eaton Ph.D. NVGRAPH,FIREHOSE,PAGERANK GPU ACCELERATED ANALYTICS NOV 2016 Joe Eaton Ph.D. Agenda Accelerated Computing nvgraph New Features Coming Soon Dynamic Graphs GraphBLAS 2 ACCELERATED COMPUTING 10x Performance

More information

Performance Characterization of High-Level Programming Models for GPU Graph Analytics

Performance Characterization of High-Level Programming Models for GPU Graph Analytics Performance Characterization of High-Level Programming Models for GPU Graph Analytics By YUDUO WU B.S. (Macau University of Science and Technology, Macau) 203 THESIS Submitted in partial satisfaction of

More information

Review: Creating a Parallel Program. Programming for Performance

Review: Creating a Parallel Program. Programming for Performance Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)

More information

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload) Lecture 2: Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload) Visual Computing Systems Today Finishing up from last time Brief discussion of graphics workload metrics

More information

GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC

GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC MIKE GOWANLOCK NORTHERN ARIZONA UNIVERSITY SCHOOL OF INFORMATICS, COMPUTING & CYBER SYSTEMS BEN KARSIN UNIVERSITY OF HAWAII AT MANOA DEPARTMENT

More information

Locality-Aware Software Throttling for Sparse Matrix Operation on GPUs

Locality-Aware Software Throttling for Sparse Matrix Operation on GPUs Locality-Aware Software Throttling for Sparse Matrix Operation on GPUs Yanhao Chen 1, Ari B. Hayes 1, Chi Zhang 2, Timothy Salmon 1, Eddy Z. Zhang 1 1. Rutgers University 2. University of Pittsburgh Sparse

More information