GPU Sparse Graph Traversal. Duane Merrill
|
|
- Blake Austin
- 6 years ago
- Views:
Transcription
1 GPU Sparse Graph Traversal Duane Merrill
2 Breadth-first search of graphs (BFS) 1. Pick a source node 2. Rank every vertex by the length of shortest path from source Or label every vertex by its predecessor in this traversal order
3 The punchline Conventional wisdom GPUs are poorly suited for dynamic problems like BFS Dynamic work distribution seems hard to implement Requires cooperation We show GPUs provide exceptional performance Billions of traversed-edges / second / socket vs. hundreds-of-millions for multi-core (SC 10, SPAA 10) For diverse synthetic and real-world datasets Small-diameter (e.g., RMAT and social networks) Large-diameter (e.g., road-network & spatial lattices) 3
4 BFS applications A common algorithmic building block Tracking reachable heap during garbage collection Belief propagation in statistical inference Simple performance-analog for many applications Pointer-chasing Work queues Finding community structure in networks Path-finding Core benchmark kernel Graph 500 Rodinia Parboil 3D lattice Wikipedia 07 4
5 Level-synchronous BFS parallelization strategies
6 Quadratic-work approach 1. Inspect every vertex at every timestep 2. If updated, update its neighbors 3. Repeat t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 Details Quadratic O(n 2 + m) work Trivial data-parallel stencils One thread per vertex t 9 t 8 t 11 t 16 t 13 t 17 t 14 t 10 t 12 t 15 6
7 Items explored Linear-work approach 1. Expand neighbors of unexplored nodes Expand: vertex frontier edge-frontier 2. Filter neighbors (previously-visited & duplicates) 3. Repeat Contract: edge-frontier vertex frontier Logical Edge Frontier Logical Vertex Frontier Details Linear O(n+m) work Need cooperative buffers for tracking frontiers Distance from source 7
8 Referenced nodes & edges (millions) The problem with quadratic approach Too much work!!! Quadratic Linear D Poisson Lattice (300 3 = 27M vertices) x work BFS iteration 8
9 Vertices (millions) Vertices (millions) Vertices (millions) Vertices (millions) Vertices (millions) Vertices (millions) Goal: Demonstrate O(m+n) traversal on diverse datasets (Not a one-trick pony ) Edge Frontier Vertex Frontier Edge Frontier Vertex Frontier Edge Frontier Vertex Frontier Search Depth Search Depth Search Depth Wikipedia (social) 3D Poisson grid (cubic lattice) R-MAT (random, power-law, small-world) Edge Frontier 5 Edge Frontier 5 Edge Frontier Vertex Frontier 4 Vertex Frontier 4 Vertex Frontier Search Depth Search Depth Search Depth Europe road atlas (Euclidian space) PDE-constrained optimization (non-linear KKT) Auto transmission manifold (tetrahedral mesh) 9
10 Challenges
11 Linear-work challenges for parallelization General challenges 1. Load imbalance between processing elements Workstealing queues 2. Bandwidth inefficiency Use status bitmask when filtering already-visited nodes GPU-specific challenges 1. Cooperative placement within global queues 2. Poor load balancing within SIMD lanes 3. Poor locality within SIMD lanes 4. Simultaneous discovery (benign race condition) 11
12 (1) Cooperative placement within global queues Problem: Need shared work queues Threads must cooperatively determine enqueue offsets GPU rate-limits (C2050): A C D G Can copy 16B vertex-identifiers / s Only 67M global-atomics / s (238x slowdown) Only 600M smem-atomics / s (27x slowdown) Solution: Compute local offsets using CTA-wide prefix-sum Compute CTA offset using a single coarse-grained atomic-add Dynamic expansion 12
13 Prefix sum for allocation Input (e.g., adjacency list sizes) Result of prefix sum A C D G A0 2C 3 D3 G6 Thread Thread Thread Thread Thread Each output is computed to be the sum of the previous inputs O(n) work Use results as a scatter offsets Fits the GPU machine model well 1 Proceed at copy-bandwidth Only ~8 thread-instructions per input Frontier output Merrill et al. Parallel Scan for Stream Architectures. Technical Report CS , University of Virginia
14 (2) Poor load balancing within SIMD lanes (a) Bad: Serial expansion & processing Problem: Large variance in adjacency list sizes t 0 t 1 t 2 t 3 t 4 Solution: E.g., power-law degree distributions Exacerbated by wide SIMD widths Cooperative neighbor expansion Enlist nearby threads to help process each adjacency list in parallel (b) Slightly better: Coarse warp-centric parallel expansion t 0 t 1 t 2 t 3 t 4 t 5 t 31 t 32 t 33 t 34 t 35 t 36 t 37 t 63 t 64 t 65 t 66 t 67 t 68 t 69 t 95 t 96 t 97 t 98 t 99 t 100 t 101 t 127 (c) Best: Fine-grained parallel expansion (packed by prefix sum) t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 14
15 (3) Poor locality within SIMD lanes (a) Bad: Serial expansion & processing Problem: t 0 t 1 t 2 t 3 t 4 Solution: The referenced adjacency lists are arbitrarily located Exacerbated by wide SIMD widths and DRAM transaction sizes Can t afford to have SIMD threads streaming through unrelated data Cooperative neighbor expansion (b) Slightly better: Coarse warp-centric parallel expansion t 0 t 1 t 2 t 3 t 4 t 5 t 31 t 32 t 33 t 34 t 35 t 36 t 37 t 63 t 64 t 65 t 66 t 67 t 68 t 69 t 95 t 96 t 97 t 98 t 99 t 100 t 101 t 127 (c) Best: Fine-grained parallel expansion (packed by prefix sum) t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 15
16 (4) SIMD simultaneous discovery Duplicates in edge-frontier yield redundant work Exacerbated by wide SIMD Compounded every iteration Iteration 1 Iteration 2 Iteration 3 Iteration 4 16
17 Vertices (millions) Simultaneous discovery Normally not a problem for CPU implementations Logical edge-frontier Logical vertex-frontier Low hardware parallelism 1.2 Actually expanded Serial adjacency list inspection 1.0 Actually contracted A big problem for cooperative SIMD expansion Spatially-descriptive datasets Power-law datasets Solution: Collision hashes in local scratch Search Depth 1. Hash per warp (instantaneous coverage) 2. Hash per CTA (recent history coverage) 17
18 Absolute and relative performance 18
19 Single-socket performance comparison Sequential Parallel Graph Spy Plot Average Search Depth Sandybridge Nehalem Tesla C2050 Billion TE/s Billion TE/s Parallel speedup Billion TE/s Parallel speedup europe.osm x grid5pt x hugebubbles x grid7pt (4-core ) 3.0 x x nlpkkt (4-core ) 1.8 x x audikw x cage (4-core ) 1.8 x x kkt_power (4-core ) 2.2 x x copapersciteseer x wikipedia (4-core ) 2.7 x x kron_g500-logn x random.2mv.128me (8-core ) 5.0 x x rmat.2mv.128me (8-core ) 4.6 x x Harmonic mean (across all / common datasets) 12x / 20x 3.4GHz Core i7 2600K 2.5 GHz Core i7 4-core, Leiserson et al. 2.7 GHz EX Xeon X core, Agarwal et al.
20 Summary BFS conclusions Quadratic-work approaches are uncompetitive Need linear-work algorithm Dynamic workload management: Overall conclusions Prefix sum instead of atomic read-modify-write ops Fine-grained expansion and contraction for high utilization GPUs can be very amenable to dynamic, cooperative problems BFS is actually well-suited to GPU strengths Exposes lots of concurrency Is memory-bound Merrill et al. High performance and Scalable GPU Graph Traversal. Technical Report CS , Department of Computer Science, University of Virginia, Aug
21 Questions? Merrill, D., Garland, M., and Grimshaw, A. High Performance and Scalable GPU Graph Traversal. Technical Report CS , Department of Computer Science, University of Virginia. Aug
22 Arbitrary locality 3D (7-point )Poisson Lattice Adjacency lists reference nearby vertices Wikipedia 07 Adjacency lists do not reference nearby vertices Sparsity pattern (adjacency matrix) Sparsity pattern (adjacency matrix) 22
23 Experimental corpus Graph Spy Plot Description Average Search Depth Vertices (millions) Edges (millions) europe.osm Road network grid5pt D Poisson stencil hugebubbles D mesh grid7pt.300 3D Poisson stencil nlpkkt160 Constrained optimization problem audikw1 Finite element matrix cage15 Transition prob. matrix kkt_power Optimization (KKT) copapersciteseer Citation network wikipedia Wikipedia page links kron_g500-logn20 Graph500 random graph random.2mv.128me Uniform random graph rmat.2mv.128me RMAT random graph
24 Compressed sparse row (CSR) representation v 0 v 1 v 2 v 3 Logical adjacency matrix v 0 v 0 v 1 v 1 v 1 v 2 v 2 v 0 v 2 v 3 v 3 v 1 v 3 Column indices: O(m) [0] [1] [2] [3] [4] [5] [6] [7] [8] Row offsets: O(n) [0] [1] [2] [3] [4]
25 Coupling of expansion & contraction phases Alternatives: a) Expand-contract (one kernel) Maintain vertex-frontier (unexplored nodes) in DRAM between BFS iterations O(2n) bytes Produce and consume edge-frontier online, in-core b) Contract-expand (one kernel) Maintain edge-frontier (untraversed edges) in DRAM between BFS iterations O(2m) bytes Produce and consume vertex frontier online, in-core c) Two-phase (two kernels) Wholly produce both edge and vertex-frontiers in DRAM O(m+n) bytes 25
26 Coupling of expansion & contraction phases Alternatives: a) Expand-contract (vertex-frontier in DRAM) Suitable for all types of BFS iterations b) Contract-expand (edge-frontier in DRAM) Even better for small, fleeting BFS iterations c) Two-phase (both frontiers in DRAM) Even better for large, saturating BFS iterations (surprising!!) The two fused strategies (a & b above) suffer the interaction between: TLB misses during expansion (neighbor gathering) Uncoalesced reads during contraction (status lookup) 26
27 Comparison of expansion techniques 27
28 10 9 edges / sec normalized Multi-GPU traversal Expand neighbors sort by GPU (with filter) read from peer GPUs (with filter) repeat C2050 x1 C2050 x2 C2050 x3 C2050 x
29 10 9 edges / sec normalized Comparison of coupling approaches Expand-Contract Contract-Expand 2-Phase Hybrid
30 Redundant expansion factor (log) Local duplicate culling Contraction uses local collision hashes in smem scratch space: 1. Hash per warp (instantaneous coverage) Redundant work reduced < 10% in all datasets No atomics needed 2. Hash per CTA (recent history coverage) 10 Simple Local duplicate culling 4.2x 1 1.1x 2.4x 2.6x 1.7x 1.2x 1.1x 1.4x 1.1x n/a 1.3x 30
31 CTA local prefix sum t 1 t 0 t 2 t T/4-1 t 3T/4-1 t 3T/4+1 t T/4 t T/4 + 1 t T/4 + 2 t T/2-1 t T/2 t T/2 + 1 t T/2 + 2 t 3T/4 t 3T/4+2 t T - 1 t 0 t 1 t 2 t 3 t 0 t 1 t 2 t 3 t 0 t 1 t 2 t 3 t 0 t 1 t 2 t 3 t 0 t 1 t 2 t 3 t 0 t 1 t 2 t 3 t 0 t 1 t 2 t 3 t 0 t 1 t 2 t 3 t 0 t 1 t 2 t 3 t 0 t 1 t T/4-2 t T/4-1 t T/4 t T/4 + 1 t T/2-2 t T/2-1 t T/2 t T/2 + 1 t 3T/4-2 t 3T/4-1 t 3T/4 t 3T/4+1 t T - 2 t T
32 Filter duplicates within edge-frontier Implement local collision hashes in smem scratch space during contraction 1. Hash per warp (instantaneous coverage) 2. Hash per CTA (recent history coverage) Redundant work reduced < 10% in all datasets No atomics needed 32
GPU Sparse Graph Traversal
GPU Sparse Graph Traversal Duane Merrill (NVIDIA) Michael Garland (NVIDIA) Andrew Grimshaw (Univ. of Virginia) UNIVERSITY of VIRGINIA Breadth-first search (BFS) 1. Pick a source node 2. Rank every vertex
More informationDuane Merrill (NVIDIA) Michael Garland (NVIDIA)
Duane Merrill (NVIDIA) Michael Garland (NVIDIA) Agenda Sparse graphs Breadth-first search (BFS) Maximal independent set (MIS) 1 1 0 1 1 1 3 1 3 1 5 10 6 9 8 7 3 4 BFS MIS Sparse graphs Graph G(V, E) where
More informationScalable GPU Graph Traversal!
Scalable GPU Graph Traversal Duane Merrill, Michael Garland, and Andrew Grimshaw PPoPP '12 Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming Benwen Zhang
More informationScalable GPU Graph Traversal
Scalable GPU Graph Traversal Duane Merrill University of Virginia Charlottesville Virginia USA dgmd@virginia.edu Michael Garland NVIDIA Corporation Santa Clara California USA mgarland@nvidia.com Andrew
More informationHigh Performance and Scalable GPU Graph Traversal
High Performance and Scalable GPU Graph Traversal Duane Merrill Department of Computer Science University of Virginia Technical Report CS-- Department of Computer Science, University of Virginia Aug, Michael
More informationA POWER CHARACTERIZATION AND MANAGEMENT OF GPU GRAPH TRAVERSAL
A POWER CHARACTERIZATION AND MANAGEMENT OF GPU GRAPH TRAVERSAL ADAM MCLAUGHLIN *, INDRANI PAUL, JOSEPH GREATHOUSE, SRILATHA MANNE, AND SUDHKAHAR YALAMANCHILI * * GEORGIA INSTITUTE OF TECHNOLOGY AMD RESEARCH
More informationHigh-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock
High-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock Yangzihao Wang University of California, Davis yzhwang@ucdavis.edu March 24, 2014 Yangzihao Wang (yzhwang@ucdavis.edu)
More informationAutomatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC
Automatic Compiler-Based Optimization of Graph Analytics for the GPU Sreepathi Pai The University of Texas at Austin May 8, 2017 NVIDIA GTC Parallel Graph Processing is not easy 299ms HD-BFS 84ms USA Road
More informationGraph Partitioning for Scalable Distributed Graph Computations
Graph Partitioning for Scalable Distributed Graph Computations Aydın Buluç ABuluc@lbl.gov Kamesh Madduri madduri@cse.psu.edu 10 th DIMACS Implementation Challenge, Graph Partitioning and Graph Clustering
More informationCoordinating More Than 3 Million CUDA Threads for Social Network Analysis. Adam McLaughlin
Coordinating More Than 3 Million CUDA Threads for Social Network Analysis Adam McLaughlin Applications of interest Computational biology Social network analysis Urban planning Epidemiology Hardware verification
More informationCUB. collective software primitives. Duane Merrill. NVIDIA Research
CUB collective software primitives Duane Merrill NVIDIA Research What is CUB?. A design model for collective primitives How to make reusable SIMT software constructs. A library of collective primitives
More informationChallenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery
Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark by Nkemdirim Dockery High Performance Computing Workloads Core-memory sized Floating point intensive Well-structured
More informationAn Energy-Efficient Abstraction for Simultaneous Breadth-First Searches. Adam McLaughlin, Jason Riedy, and David A. Bader
An Energy-Efficient Abstraction for Simultaneous Breadth-First Searches Adam McLaughlin, Jason Riedy, and David A. Bader Problem Data is unstructured, heterogeneous, and vast Serious opportunities for
More informationPractical Near-Data Processing for In-Memory Analytics Frameworks
Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard
More informationKartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18
Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation
More informationBoosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search
Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search Jialiang Zhang, Soroosh Khoram and Jing Li 1 Outline Background Big graph analytics Hybrid
More informationGPU/CPU Heterogeneous Work Partitioning Riya Savla (rds), Ridhi Surana (rsurana), Jordan Tick (jrtick) Computer Architecture, Spring 18
ABSTRACT GPU/CPU Heterogeneous Work Partitioning Riya Savla (rds), Ridhi Surana (rsurana), Jordan Tick (jrtick) 15-740 Computer Architecture, Spring 18 With the death of Moore's Law, programmers can no
More informationGraph traversal and BFS
Graph traversal and BFS Fundamental building block Graph traversal is part of many important tasks Connected components Tree/Cycle detection Articulation vertex finding Real-world applications Peer-to-peer
More informationA Simple and Practical Linear-Work Parallel Algorithm for Connectivity
A Simple and Practical Linear-Work Parallel Algorithm for Connectivity Julian Shun, Laxman Dhulipala, and Guy Blelloch Presentation based on publication in Symposium on Parallelism in Algorithms and Architectures
More informationLecture 6: Input Compaction and Further Studies
PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 6: Input Compaction and Further Studies 1 Objective To learn the key techniques for compacting input data for reduced consumption of
More informationHARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA
HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh
More informationEnterprise. Breadth-First Graph Traversal on GPUs. November 19th, 2015
Enterprise Breadth-First Graph Traversal on GPUs Hang Liu H. Howie Huang November 9th, 5 Graph is Ubiquitous Breadth-First Search (BFS) is Important Wide Range of Applications Single Source Shortest Path
More informationAccelerated Load Balancing of Unstructured Meshes
Accelerated Load Balancing of Unstructured Meshes Gerrett Diamond, Lucas Davis, and Cameron W. Smith Abstract Unstructured mesh applications running on large, parallel, distributed memory systems require
More informationOptimizing Energy Consumption and Parallel Performance for Static and Dynamic Betweenness Centrality using GPUs Adam McLaughlin, Jason Riedy, and
Optimizing Energy Consumption and Parallel Performance for Static and Dynamic Betweenness Centrality using GPUs Adam McLaughlin, Jason Riedy, and David A. Bader Motivation Real world graphs are challenging
More informationCuSha: Vertex-Centric Graph Processing on GPUs
CuSha: Vertex-Centric Graph Processing on GPUs Farzad Khorasani, Keval Vora, Rajiv Gupta, Laxmi N. Bhuyan HPDC Vancouver, Canada June, Motivation Graph processing Real world graphs are large & sparse Power
More informationParallel Combinatorial BLAS and Applications in Graph Computations
Parallel Combinatorial BLAS and Applications in Graph Computations Aydın Buluç John R. Gilbert University of California, Santa Barbara SIAM ANNUAL MEETING 2009 July 8, 2009 1 Primitives for Graph Computations
More informationNUMA-aware Graph-structured Analytics
NUMA-aware Graph-structured Analytics Kaiyuan Zhang, Rong Chen, Haibo Chen Institute of Parallel and Distributed Systems Shanghai Jiao Tong University, China Big Data Everywhere 00 Million Tweets/day 1.11
More informationA Simple and Practical Linear-Work Parallel Algorithm for Connectivity
A Simple and Practical Linear-Work Parallel Algorithm for Connectivity Julian Shun, Laxman Dhulipala, and Guy Blelloch Presentation based on publication in Symposium on Parallelism in Algorithms and Architectures
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationBREADTH-FIRST SEARCH FOR SOCIAL NETWORK GRAPHS ON HETEROGENOUS PLATFORMS LUIS CARLOS MARIA REMIS THESIS
BREADTH-FIRST SEARCH FOR SOCIAL NETWORK GRAPHS ON HETEROGENOUS PLATFORMS BY LUIS CARLOS MARIA REMIS THESIS Submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer
More informationCSE 599 I Accelerated Computing - Programming GPUS. Parallel Patterns: Graph Search
CSE 599 I Accelerated Computing - Programming GPUS Parallel Patterns: Graph Search Objective Study graph search as a prototypical graph-based algorithm Learn techniques to mitigate the memory-bandwidth-centric
More informationScalable Algorithmic Techniques Decompositions & Mapping. Alexandre David
Scalable Algorithmic Techniques Decompositions & Mapping Alexandre David 1.2.05 adavid@cs.aau.dk Introduction Focus on data parallelism, scale with size. Task parallelism limited. Notion of scalability
More informationIntroduction to Parallel Programming Models
Introduction to Parallel Programming Models Tim Foley Stanford University Beyond Programmable Shading 1 Overview Introduce three kinds of parallelism Used in visual computing Targeting throughput architectures
More informationMaximizing Face Detection Performance
Maximizing Face Detection Performance Paulius Micikevicius Developer Technology Engineer, NVIDIA GTC 2015 1 Outline Very brief review of cascaded-classifiers Parallelization choices Reducing the amount
More informationLaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs
LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs Jin Wang*, Norm Rubin, Albert Sidelnik, Sudhakar Yalamanchili* *Computer Architecture and System Lab, Georgia Institute of Technology NVIDIA
More information}w!"#$%&'()+,-./012345<ya
MASARYK UNIVERSITY FACULTY OF INFORMATICS }w!"#$%&'()+,-./012345
More informationMosaic: Processing a Trillion-Edge Graph on a Single Machine
Mosaic: Processing a Trillion-Edge Graph on a Single Machine Steffen Maass, Changwoo Min, Sanidhya Kashyap, Woonhak Kang, Mohan Kumar, Taesoo Kim Georgia Institute of Technology Best Student Paper @ EuroSys
More informationContinuum Computer Architecture
Plenary Presentation to the Workshop on Frontiers of Extreme Computing: Continuum Computer Architecture Thomas Sterling California Institute of Technology and Louisiana State University October 25, 2005
More informationA New Parallel Algorithm for Connected Components in Dynamic Graphs. Robert McColl Oded Green David Bader
A New Parallel Algorithm for Connected Components in Dynamic Graphs Robert McColl Oded Green David Bader Overview The Problem Target Datasets Prior Work Parent-Neighbor Subgraph Results Conclusions Problem
More informationFast BVH Construction on GPUs
Fast BVH Construction on GPUs Published in EUROGRAGHICS, (2009) C. Lauterbach, M. Garland, S. Sengupta, D. Luebke, D. Manocha University of North Carolina at Chapel Hill NVIDIA University of California
More informationTowards Efficient Graph Traversal using a Multi- GPU Cluster
Towards Efficient Graph Traversal using a Multi- GPU Cluster Hina Hameed Systems Research Lab, Department of Computer Science, FAST National University of Computer and Emerging Sciences Karachi, Pakistan
More informationWorkloads Programmierung Paralleler und Verteilter Systeme (PPV)
Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015 Frank Feinbube, M.Sc., Felix Eberhardt, M.Sc., Prof. Dr. Andreas Polze Workloads 2 Hardware / software execution environment
More informationDynamic Thread Block Launch: A Lightweight Execution Mechanism to Support Irregular Applications on GPUs
Dynamic Thread Block Launch: A Lightweight Execution Mechanism to Support Irregular Applications on GPUs Jin Wang* Norm Rubin Albert Sidelnik Sudhakar Yalamanchili* *Georgia Institute of Technology NVIDIA
More informationParallel Methods for Verifying the Consistency of Weakly-Ordered Architectures. Adam McLaughlin, Duane Merrill, Michael Garland, and David A.
Parallel Methods for Verifying the Consistency of Weakly-Ordered Architectures Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader Challenges of Design Verification Contemporary hardware
More informationOn Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs
On Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs Sungpack Hong 2, Nicole C. Rodia 1, and Kunle Olukotun 1 1 Pervasive Parallelism Laboratory, Stanford University
More informationPortland State University ECE 588/688. Graphics Processors
Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly
More informationParallel graph traversal for FPGA
LETTER IEICE Electronics Express, Vol.11, No.7, 1 6 Parallel graph traversal for FPGA Shice Ni a), Yong Dou, Dan Zou, Rongchun Li, and Qiang Wang National Laboratory for Parallel and Distributed Processing,
More informationEnterprise: Breadth-First Graph Traversal on GPUs
Enterprise: Breadth-First Graph Traversal on GPUs Hang Liu H. Howie Huang Department of Electrical and Computer Engineering George Washington University ABSTRACT The Breadth-First Search (BFS) algorithm
More informationCS/EE 217 Midterm. Question Possible Points Points Scored Total 100
CS/EE 217 Midterm ANSWER ALL QUESTIONS TIME ALLOWED 60 MINUTES Question Possible Points Points Scored 1 24 2 32 3 20 4 24 Total 100 Question 1] [24 Points] Given a GPGPU with 14 streaming multiprocessor
More informationLecture 4: Principles of Parallel Algorithm Design (part 3)
Lecture 4: Principles of Parallel Algorithm Design (part 3) 1 Exploratory Decomposition Decomposition according to a search of a state space of solutions Example: the 15-puzzle problem Determine any sequence
More informationMultipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs
Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs Haicheng Wu 1, Daniel Zinn 2, Molham Aref 2, Sudhakar Yalamanchili 1 1. Georgia Institute of Technology 2. LogicBlox
More informationSoftware and Tools for HPE s The Machine Project
Labs Software and Tools for HPE s The Machine Project Scalable Tools Workshop Aug/1 - Aug/4, 2016 Lake Tahoe Milind Chabbi Traditional Computing Paradigm CPU DRAM CPU DRAM CPU-centric computing 2 CPU-Centric
More informationFalcon: Scaling IO Performance in Multi-SSD Volumes. The George Washington University
Falcon: Scaling IO Performance in Multi-SSD Volumes Pradeep Kumar H Howie Huang The George Washington University SSDs in Big Data Applications Recent trends advocate using many SSDs for higher throughput
More informationAn Efficient CUDA Implementation of a Tree-Based N-Body Algorithm. Martin Burtscher Department of Computer Science Texas State University-San Marcos
An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm Martin Burtscher Department of Computer Science Texas State University-San Marcos Mapping Regular Code to GPUs Regular codes Operate on
More informationTools and Primitives for High Performance Graph Computation
Tools and Primitives for High Performance Graph Computation John R. Gilbert University of California, Santa Barbara Aydin Buluç (LBNL) Adam Lugowski (UCSB) SIAM Minisymposium on Analyzing Massive Real-World
More informationEFFICIENT BREADTH FIRST SEARCH ON MULTI-GPU SYSTEMS USING GPU-CENTRIC OPENSHMEM
EFFICIENT BREADTH FIRST SEARCH ON MULTI-GPU SYSTEMS USING GPU-CENTRIC OPENSHMEM Sreeram Potluri, Anshuman Goswami NVIDIA Manjunath Gorentla Venkata, Neena Imam - ORNL SCOPE OF THE WORK Reliance on CPU
More informationCustomized Architecture for Complex Routing Analysis: Case Study for the Convey Hybrid-Core Computer
Naval Research Laboratory Stennis Space Center, MS 39529-5004 NRL/MR/7440--14-9497 Customized Architecture for Complex Routing Analysis: Case Study for the Convey Hybrid-Core Computer Chris J. Michael
More informationAlgorithm Design (8) Graph Algorithms 1/2
Graph Algorithm Design (8) Graph Algorithms / Graph:, : A finite set of vertices (or nodes) : A finite set of edges (or arcs or branches) each of which connect two vertices Takashi Chikayama School of
More informationExploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology
Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation
More informationExecution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures
Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Xin Huo Advisor: Gagan Agrawal Motivation - Architecture Challenges on GPU architecture
More informationDynamic Sparse Matrix Allocation on GPUs. James King
Dynamic Sparse Matrix Allocation on GPUs James King Graph Applications Dynamic updates to graphs Adding edges add entries to sparse matrix representation Motivation Graph operations (adding edges) (e.g.
More informationA Framework for Processing Large Graphs in Shared Memory
A Framework for Processing Large Graphs in Shared Memory Julian Shun Based on joint work with Guy Blelloch and Laxman Dhulipala (Work done at Carnegie Mellon University) 2 What are graphs? Vertex Edge
More informationBFS preconditioning for high locality, data parallel BFS algorithm N.Vasilache, B. Meister, M. Baskaran, R.Lethin. Reservoir Labs
BFS preconditioning for high locality, data parallel BFS algorithm N.Vasilache, B. Meister, M. Baskaran, R.Lethin Problem Streaming Graph Challenge Characteristics: Large scale Highly dynamic Scale-free
More informationTanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia
How well do CPU, GPU and Hybrid Graph Processing Frameworks Perform? Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Networked Systems
More informationA Comparative Study on Exact Triangle Counting Algorithms on the GPU
A Comparative Study on Exact Triangle Counting Algorithms on the GPU Leyuan Wang, Yangzihao Wang, Carl Yang, John D. Owens University of California, Davis, CA, USA 31 st May 2016 L. Wang, Y. Wang, C. Yang,
More informationImplementation of BFS on shared memory (CPU / GPU)
Kolganov A.S., MSU The BFS algorithm Graph500 && GGraph500 Implementation of BFS on shared memory (CPU / GPU) Predicted scalability 2 The BFS algorithm Graph500 && GGraph500 Implementation of BFS on shared
More informationLarge and Sparse Mass Spectrometry Data Processing in the GPU Jose de Corral 2012 GPU Technology Conference
Large and Sparse Mass Spectrometry Data Processing in the GPU Jose de Corral 2012 GPU Technology Conference 2012 Waters Corporation 1 Agenda Overview of LC/IMS/MS 3D Data Processing 4D Data Processing
More informationGPU-accelerated data expansion for the Marching Cubes algorithm
GPU-accelerated data expansion for the Marching Cubes algorithm San Jose (CA) September 23rd, 2010 Christopher Dyken, SINTEF Norway Gernot Ziegler, NVIDIA UK Agenda Motivation & Background Data Compaction
More informationAlgorithm Engineering with PRAM Algorithms
Algorithm Engineering with PRAM Algorithms Bernard M.E. Moret moret@cs.unm.edu Department of Computer Science University of New Mexico Albuquerque, NM 87131 Rome School on Alg. Eng. p.1/29 Measuring and
More informationParallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)
Lecture 2: Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload) Visual Computing Systems Analyzing a 3D Graphics Workload Where is most of the work done? Memory Vertex
More informationScalable and High Performance Betweenness Centrality on the GPU
Scalable and High Performance Betweenness Centrality on the GPU Adam McLaughlin School of Electrical and Computer Engineering Georgia Institute of Technology Atlanta, Georgia 30332 0250 Adam27X@gatech.edu
More informationAdministrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve
Administrative Issues L11: Sparse Linear Algebra on GPUs Next assignment, triangular solve Due 5PM, Tuesday, March 15 handin cs6963 lab 3 Project proposals Due 5PM, Wednesday, March 7 (hard
More informationManaging and Mining Billion Node Graphs. Haixun Wang Microsoft Research Asia
Managing and Mining Billion Node Graphs Haixun Wang Microsoft Research Asia Outline Overview Storage Online query processing Offline graph analytics Advanced applications Is it hard to manage graphs? Good
More informationUsing GPUs to compute the multilevel summation of electrostatic forces
Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of
More informationLecture 5: Input Binning
PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 5: Input Binning 1 Objective To understand how data scalability problems in gather parallel execution motivate input binning To learn
More informationCorolla: GPU-Accelerated FPGA Routing Based on Subgraph Dynamic Expansion
Corolla: GPU-Accelerated FPGA Routing Based on Subgraph Dynamic Expansion Minghua Shen and Guojie Luo Peking University FPGA-February 23, 2017 1 Contents Motivation Background Search Space Reduction for
More informationFast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs
Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs Sungpack Hong 2, Nicole C. Rodia 1, and Kunle Olukotun 1 1 Pervasive Parallelism Laboratory, Stanford University 2 Oracle
More information2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA
2006: Short-Range Molecular Dynamics on GPU San Jose, CA September 22, 2010 Peng Wang, NVIDIA Overview The LAMMPS molecular dynamics (MD) code Cell-list generation and force calculation Algorithm & performance
More informationOptimizing Cache Performance for Graph Analytics. Yunming Zhang Presentation
Optimizing Cache Performance for Graph Analytics Yunming Zhang 6.886 Presentation Goals How to optimize in-memory graph applications How to go about performance engineering just about anything In-memory
More informationGPU Fundamentals Jeff Larkin November 14, 2016
GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate
More informationCS301 - Data Structures Glossary By
CS301 - Data Structures Glossary By Abstract Data Type : A set of data values and associated operations that are precisely specified independent of any particular implementation. Also known as ADT Algorithm
More informationFine-Grained Task Migration for Graph Algorithms using Processing in Memory
Fine-Grained Task Migration for Graph Algorithms using Processing in Memory Paula Aguilera 1, Dong Ping Zhang 2, Nam Sung Kim 3 and Nuwan Jayasena 2 1 Dept. of Electrical and Computer Engineering University
More informationMaster Informatics Eng.
Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8 AJProença, Advanced Architectures,
More informationS8873 GBM INFERENCING ON GPU. Shankara Rao Thejaswi Nanditale, Vinay Deshpande
S8873 GBM INFERENCING ON GPU Shankara Rao Thejaswi Nanditale, Vinay Deshpande Introduction AGENDA Objective Experimental Results Implementation Details Conclusion 2 INTRODUCTION 3 BOOSTING What is it?
More informationImproving Memory Space Efficiency of Kd-tree for Real-time Ray Tracing Byeongjun Choi, Byungjoon Chang, Insung Ihm
Improving Memory Space Efficiency of Kd-tree for Real-time Ray Tracing Byeongjun Choi, Byungjoon Chang, Insung Ihm Department of Computer Science and Engineering Sogang University, Korea Improving Memory
More informationCS 5220: Parallel Graph Algorithms. David Bindel
CS 5220: Parallel Graph Algorithms David Bindel 2017-11-14 1 Graphs Mathematically: G = (V, E) where E V V Convention: V = n and E = m May be directed or undirected May have weights w V : V R or w E :
More informationJulienne: A Framework for Parallel Graph Algorithms using Work-efficient Bucketing
Julienne: A Framework for Parallel Graph Algorithms using Work-efficient Bucketing Laxman Dhulipala Joint work with Guy Blelloch and Julian Shun SPAA 07 Giant graph datasets Graph V E (symmetrized) com-orkut
More informationExploring Hardware Support For Scaling Irregular Applications on Multi-node Multi-core Architectures
Exploring Hardware Support For Scaling Irregular Applications on Multi-node Multi-core Architectures MARCO CERIANI SIMONE SECCHI ANTONINO TUMEO ORESTE VILLA GIANLUCA PALERMO Politecnico di Milano - DEI,
More informationEfficiently Multiplying Sparse Matrix - Sparse Vector for Social Network Analysis
Efficiently Multiplying Sparse Matrix - Sparse Vector for Social Network Analysis Ashish Kumar Mentors - David Bader, Jason Riedy, Jiajia Li Indian Institute of Technology, Jodhpur 6 June 2014 Ashish Kumar
More informationOptimization solutions for the segmented sum algorithmic function
Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code
More informationParallel Architectures
Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36
More informationBreadth First Search on Cost efficient Multi GPU Systems
Breadth First Search on Cost efficient Multi Systems Takuji Mitsuishi Keio University 3 14 1 Hiyoshi, Yokohama, 223 8522, Japan mits@am.ics.keio.ac.jp Masaki Kan NEC Corporation 1753, Shimonumabe, Nakahara
More informationParallel Breadth First Search
CSE341T/CSE549T 11/03/2014 Lecture 18 Parallel Breadth First Search Today, we will look at a basic graph algorithm, breadth first search (BFS). BFS can be applied to solve a variety of problems including:
More informationPerformance analysis of -stepping algorithm on CPU and GPU
Performance analysis of -stepping algorithm on CPU and GPU Dmitry Lesnikov 1 dslesnikov@gmail.com Mikhail Chernoskutov 1,2 mach@imm.uran.ru 1 Ural Federal University (Yekaterinburg, Russia) 2 Krasovskii
More informationNVGRAPH,FIREHOSE,PAGERANK GPU ACCELERATED ANALYTICS NOV Joe Eaton Ph.D.
NVGRAPH,FIREHOSE,PAGERANK GPU ACCELERATED ANALYTICS NOV 2016 Joe Eaton Ph.D. Agenda Accelerated Computing nvgraph New Features Coming Soon Dynamic Graphs GraphBLAS 2 ACCELERATED COMPUTING 10x Performance
More informationPerformance Characterization of High-Level Programming Models for GPU Graph Analytics
Performance Characterization of High-Level Programming Models for GPU Graph Analytics By YUDUO WU B.S. (Macau University of Science and Technology, Macau) 203 THESIS Submitted in partial satisfaction of
More informationReview: Creating a Parallel Program. Programming for Performance
Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)
More informationParallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)
Lecture 2: Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload) Visual Computing Systems Today Finishing up from last time Brief discussion of graphics workload metrics
More informationGPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC
GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC MIKE GOWANLOCK NORTHERN ARIZONA UNIVERSITY SCHOOL OF INFORMATICS, COMPUTING & CYBER SYSTEMS BEN KARSIN UNIVERSITY OF HAWAII AT MANOA DEPARTMENT
More informationLocality-Aware Software Throttling for Sparse Matrix Operation on GPUs
Locality-Aware Software Throttling for Sparse Matrix Operation on GPUs Yanhao Chen 1, Ari B. Hayes 1, Chi Zhang 2, Timothy Salmon 1, Eddy Z. Zhang 1 1. Rutgers University 2. University of Pittsburgh Sparse
More information