Duane Merrill (NVIDIA) Michael Garland (NVIDIA)
|
|
- Jeffry Wilcox
- 6 years ago
- Views:
Transcription
1 Duane Merrill (NVIDIA) Michael Garland (NVIDIA)
2 Agenda Sparse graphs Breadth-first search (BFS) Maximal independent set (MIS) BFS MIS
3
4 Sparse graphs Graph G(V, E) where n = V and m = E Convey structure via relationships Wikipedia 07 3D lattice 4
5 Graph sparsity Adjacency matrix of graph G(n, m) v 0 v 0 v 1 v 1 v 1 v v v 0 v v 3 v 3 v 1 v 3 Spy plots 5
6 Compressed sparse row (CSR) representation v 0 v 1 v v 3 Adjacency matrix of graph G(n, m) v 0 v 0 v 1 v 1 v 1 v v v 0 v v 3 v 3 v 1 v 3 Column indices: O(m) [0] [1] [] [3] [4] [5] [6] [7] [8] Row offsets: O(n) [0] [1] [] [3] [4]
7
8 Breadth-first search (BFS) 1. Pick a source node. Rank every vertex by the length of shortest path from source 0 8
9 Breadth-first search (BFS) 1. Pick a source node. Rank every vertex by the length of shortest path from source
10 Breadth-first search (BFS) 1. Pick a source node. Rank every vertex by the length of shortest path from source
11 Breadth-first search (BFS) 1. Pick a source node. Rank every vertex by the length of shortest path from source
12 Breadth-first search (BFS) 1. Pick a source node. Rank every vertex by the length of shortest path from source (or by predecessor vertex) 1
13 BFS applications Algorithmic primitive Reachability (e.g., garbage collection) Belief propagation (statistical inference) Simple performance-analog for many applications Pointer-chasing Work queues Finding community structure Path-finding Core benchmark kernel Graph 500 Rodinia Parboil 13
14 The sequential algorithm Cycles newly-discovered-but-unvisited vertices through a queue 1. Dequeue vertex v. For all neighbors n i of v: dist[n i ] = dist[v] + 1 Enqueue n i 14 a b c e f d u g h i j k l m n o p r s t v Q: {a} Q: {b, c, d, e, f} a b c e f d u g h i j k l m n o p r s t v
15 Common parallelization strategies Level-synchronous: parallel work done in epochs by level a) Quadratic work b) Linear work Randomized: parallel work done within independent sub-problems a) Ullman-Yannakakis
16 Quadratic-work approach 1. Inspect every vertex at every time step E.g., one thread per vertex. If updated, update its neighbors 3. Repeat until no further updates 16
17 Quadratic-work approach 1. Inspect every vertex at every time step E.g., one thread per vertex. If updated, update its neighbors 3. Repeat until no further updates 17
18 Quadratic-work approach 1. Inspect every vertex at every time step E.g., one thread per vertex. If updated, update its neighbors 3. Repeat until no further updates 18
19 Quadratic-work approach 1. Inspect every vertex at every time step E.g., one thread per vertex. If updated, update its neighbors 3. Repeat until no further updates 19
20 Quadratic-work approach Input: Vertex set V, row-offsets array R, column-indices array C, source vertex s Output: Array dist[0..n-1] with dist[v] holding the distance from s to v kernel kernel 1. parallel for (i in V) :. dist[i] := 3. dist[s] := 0 4. iteration := 0 5. do : 6. done := true 7. parallel for (i in V) : 8. if (dist[i] == iteration) 9. done := false 10. for (offset in R[i].. R[i+1]-1) : 11. j := C[offset] 1. dist[j] = iteration iteration while (!done) Good: trivial data-parallel GPU stencils Bad: worst-case quadratic O(n +m) work Harish et al., Deng et al. Hong et al 0
21 Vertices (millions) Vertices (millions) Vertices (millions) Vertices (millions) Vertices (millions) Vertices (millions) Active edges per iteration Search Depth Search Depth Search Depth Wikipedia 3D cubic lattice R-MAT small-world Search Depth Search Depth Search Depth Europe road atlas PDE-constrained optimization Auto transmission mesh 1
22 Active vertex identifiers Linear-work approach 1. Expand neighbors in parallel Enqueue adjacency lists Vertex frontier Edge-frontier. Filter out previously-visited vertices in parallel Edge-frontier Vertex frontier Logical Edge Frontier Logical Vertex Frontier 3. Repeat 0 Search depth Good: Linear O(n+m) work Merrill et al. (PPoPP 1), Luo et al., etc. Challenging: Need shared queues for tracking frontiers
23 Referenced vertex identifiers (millions) Quadratic vs. linear Quadratic Linear 3D Poisson Lattice (300 3 = 7M vertices) BFS iteration 130x more work!! 3
24 Linear-work challenges for parallelization Generic challenges 1. Load imbalance between cores Coarse workstealing queues. Bandwidth inefficiency Use status bitmask when filtering already-visited nodes GPU-specific challenges 1. Cooperative allocation (global queues). Load imbalance within SIMD lanes 3. Poor locality within SIMD lanes 4. Simultaneous discovery (i.e., the benign race condition) 4
25 Linear-work challenges for parallelization Generic challenges 1. Load imbalance between cores Coarse workstealing queues. Bandwidth inefficiency Use status bitmask when filtering already-visited nodes GPU-specific challenges 1. Cooperative allocation (global queues). Load imbalance within SIMD lanes 3. Poor locality within SIMD lanes 4. Simultaneous discovery (i.e., the benign race condition) 5
26 Available bandwidth (GB/s) 1) Cooperative data movement (queuing) Problem: Need shared work queues GPU rate-limits (C050): 16 billion vertex-identifiers / sec (14 GB/s) Only 67M global-atomics / sec (38x slowdown) Only 600M smem-atomics / sec (7x slowdown) Solution: Compute enqueue offsets using prefix-sum 0.1 1:1 1:8 1:64 1:51 1:4096 1:3768 1:6144 Frequency of atomic ops per vertex dequeued 6
27 Prefix sum for allocation Allocation requirement Result of prefix sum Output A C 1 0 D 3 t 0 t 1 t t 3 A 0 C 3 D 3 t 0 t 1 t t Each output is computed to be the sum of the previous inputs O(n) work Use results as a scatter offsets Fits the GPU machine model well Proceed at copy-bandwidth Only ~8 thread-instructions per input 7
28 Prefix sum for edge expansion Adjacency list size A C 1 0 D 3 Result of prefix sum Expanded edge frontier t 0 t 1 t t 3 A 0 C 3 D 3 t 0 t 1 t t
29 Data-flow of efficient CTA scan (granularity coarsening) x16 barrier x4 x1 x4 barrier x16 (For good GPU scan details, see Baxter s 9
30 ) Poor load balancing within SIMD lanes Problem: Solution: Large variance in adjacency list sizes Exacerbated by wide SIMD widths Cooperative neighbor expansion Enlist nearby threads to help process each adjacency list in parallel a) Bad: Serial expansion & processing t 1 t 3 t 0 t b) Slightly better: Coarse warp-centric parallel expansion t 0 t 1 t t 3 t 4 t 5 t 31 t 3 t 33 t 34 t 35 t 36 t 37 t 63 t 64 t 65 t 66 t 67 t 68 t 69 t 95 c) Best: Fine-grained parallel expansion (packed by prefix sum) t 0 t 1 t t 3 t 4 t 5 30
31 ) Poor load balancing within SIMD lanes Problem: Solution: Large variance in adjacency list sizes Exacerbated by wide SIMD widths Cooperative neighbor expansion Enlist nearby threads to help process each adjacency list in parallel a) Bad: Serial expansion & processing t 1 t 3 t 0 t b) Slightly better: Coarse warp-centric parallel expansion t 0 t 1 t t 3 t 4 t 5 t 31 t 3 t 33 t 34 t 35 t 36 t 37 t 63 t 64 t 65 t 66 t 67 t 68 t 69 t 95 c) Best: Fine-grained parallel expansion (packed by prefix sum) t 0 t 1 t t 3 t 4 t 5 31
32 3) Poor locality within SIMD lanes Problem: The referenced adjacency lists are arbitrarily located a) Bad: Serial expansion & processing t 1 t 3 t 0 t Solution: Exacerbated by wide SIMD widths Can t afford to have SIMD threads streaming through unrelated data b) Slightly better: Coarse warp-centric parallel expansion t 0 t 1 t t 3 t 4 t 5 t 31 t 3 t 33 t 34 t 35 t 36 t 37 t 63 Cooperative neighbor expansion t 64 t 65 t 66 t 67 t 68 t 69 t 95 c) Best: Fine-grained parallel expansion (packed by prefix sum) t 0 t 1 t t 3 t 4 t 5 3
33 4) SIMD simultaneous discovery Iteration 1 Iteration Iteration 3 Iteration 4 Duplicates in the edge-frontier redundant work Exacerbated by wide SIMD Compounded every iteration 33
34 Vertices (millions) 4) SIMD simultaneous discovery Normally not a problem for CPU implementations Serial adjacency list inspection A big problem for cooperative SIMD expansion Logical edgefrontier Logical vertexfrontier Actually expanded Actually contracted Spatially-descriptive datasets 0.6 Power-law datasets 0.4 Solution: Localized duplicate removal using hashes in local scratch Search Depth 34
35 Single-socket comparison Graph Spy Plot Average Search Depth Billion TE/s Nehalem Parallel speedup Billion TE/s Tesla C050 Parallel speedup europe.osm x grid5pt x hugebubbles x grid7pt (4-core ).4 x x nlpkkt (4-core ) 3.0 x.5 10 x audikw x cage (4-core ).6 x. 18 x kkt_power (4-core ) 3.0 x x copapersciteseer x wikipedia (4-core ) 3. x x kron_g500-logn x random.mv.18me (8-core ) 7.0 x x rmat.mv.18me (8-core ) 6.0 x 3.3 x.5 GHz Core i7 4-core (Leiserson et al.).7 GHz EX Xeon X core (Agarwal et al) vs 3.4GHz Core i7 600K (Sandybridge)
36 Single-socket comparison Graph Spy Plot Average Search Depth Billion TE/s Nehalem Parallel speedup Billion TE/s Tesla C050 Parallel speedup europe.osm x grid5pt x hugebubbles x grid7pt (4-core ).4 x x nlpkkt (4-core ) 3.0 x.5 10 x audikw x cage (4-core ).6 x. 18 x kkt_power (4-core ) 3.0 x x copapersciteseer x wikipedia (4-core ) 3. x x kron_g500-logn x random.mv.18me (8-core ) 7.0 x x rmat.mv.18me (8-core ) 6.0 x 3.3 x.5 GHz Core i7 4-core (Leiserson et al.).7 GHz EX Xeon X core (Agarwal et al) vs 3.4GHz Core i7 600K (Sandybridge)
37
38 Maximal independent set A vertex subset U V of an undirected graph G = (V, E) U is independent if no two vertices in U are adjacent U is maximal if no node can be added without violating independence a Both {a} and {b, c} are maximal independent sets b c 38
39 Maximal independent set A vertex subset U V of an undirected graph G = (V, E) U is independent if no two vertices in U are adjacent U is maximal if no node can be added without violating independence a Both {a} and {b, c} are maximal independent sets b c (Finding the maximum independent set is generally NP-Complete.) 39
40 MIS applications Graph coloring (vertex coloring) Co-scheduling sets of independent resources Useful for GPUs: avoid atomic updates by touching items in different kernels Matching (edge coloring) Pair-wise assignments Scheduling of jobs onto hosts Matching people with desks, dance partners, kidneys, etc. Run MIS on G where: There is a node in G for every edge in G Nodes are connected in G if edges in G are adjacent 40
41 Example: MIS() for mesh refinement in algebraic multi-grid Bell et al., Cohen et al., etc. 41
42 The sequential algorithm Repeatedly add a vertex to the MIS, removing it and its neighbors from the graph: 1. Select first identifier v from V t s g r h u e n o b a v f c p i d m j l k. Add v to MIS MIS: {} 3. Remove v and neighbors(v) from V v i 4. Repeat until V = {} g u j a s h n l t r o p m k MIS: {a} 4
43 The parallel algorithm (Luby85) Repeatedly add sets to the MIS, removing them and their neighbors from the graph: i h r k d j s g o c e l m f p 1. Randomly permute identifier ordering of V. Identify all m i such that identifier m i < all neighbors(m i ) a q t n MIS: {} b u 3. Add all m i to MIS 4. Remove m i and neighbors(m i ) from V 5. Repeat until V = {} q h r k d u s g o c t l m f p a i n e b j Generally converges after O(log n) steps MIS: {a, b, c, d, e, f, k, l, p}
44 Data-parallel GPU approach Input: Vertex set V, row-offsets array R, column-indices array C, source vertex s Output: Array MIS[0..n-1] with MIS[v] holding membership flag (1-1) in the maximal independent set constructed. kernel kernel 1. parallel for (i in V) :. MIS[i] := 0 3. do : 4. done := true 5. parallel for (i in V) : 6. if (MIS[i] == 0) 7. dependent i := false 8. done := false 9. local_min i := i 10. for (offset in R[i].. R[i+1]-1) : 11. j := C[offset] 1. if (MIS[j] == 1) 13. dependent i = true 14. else if ((MIS[j] == 0) && (permute(j) < permute(i)) 15. local_min i := j 16. if (dependent i ) 17. MIS[i] := else if (local_min i == i) 19. MIS[i] := 1 0. while (!done) Good: trivial data-parallel GPU stencils E.g., CUSP library, Hoberock, Cohen, Wang, etc. Bad: average case O(m log n) work because edges are never removed 44
45 Future work: actually remove vertices from G Don t visit all n vertices at every time step! Should significantly improve utilization in subsequent rounds Higher density of active vertices CSR isn t a particularly friendly format for vertex removal Easy: Use prefix sum to compact the row_offsets array Double the size to pair segment-begin offsets with explicit segment-end offsets Hard: No great way to find and remove vertex identifiers from adjacency lists 45
46 Summary Purely data-parallel approaches can be uncompetitive E.g., one thread per each time step Dynamic workload management: Prefix sum (vs. atomic-add) Utilization requires fine-grained expansion and contraction GPUs can be very amenable to dynamic, cooperative problems 46
47
48 GTC 013:The Premier Event in Accelerated Computing Registration is Open! Four days March 18-1, San Jose, CA Three keynotes 300+ sessions One day of pre-conference developer tutorials 100+ research posters Lots of networking events and opportunities
49 Experimental corpus Graph Spy Plot Description Average Search Depth Vertices (millions) Edges (millions) europe.osm Road network grid5pt.5000 D Poisson stencil hugebubbles-0000 D mesh grid7pt.300 3D Poisson stencil nlpkkt160 Constrained optimization problem audikw1 Finite element matrix cage15 Transition prob. matrix kkt_power Optimization (KKT) copapersciteseer Citation network wikipedia Wikipedia page links kron_g500-logn0 Graph500 random graph random.mv.18me Uniform random graph rmat.mv.18me RMAT random graph
50 Comparison of expansion techniques 50
51 Coupling of expansion & contraction phases Alternatives: Expand-contract Wholly realize vertex-frontier in DRAM Suitable for all types of BFS iterations Contract-expand Logical Edge Frontier Logical Vertex Frontier Two-phase Wholly realize edge-frontier in DRAM Even better for small, fleeting BFS iterations Wholly realize both frontiers in DRAM BFS iteration Even better for large, saturating BFS iterations (surprising!!) 51
52 10 9 edges / sec normalized Comparison of coupling approaches Expand-Contract Contract-Expand -Phase Hybrid
53 Redundant expansion factor (log) Local duplicate culling Contraction uses local collision hashes in smem scratch space: 10 Simple Local duplicate culling Hash per warp (instantaneous coverage) 4.x Hash per CTA (recent history coverage).4x.6x Redundant work < 10% in all datasets No atomics needed 1 1.1x 1.7x 1.x 1.1x 1.4x 1.1x 1.3x 53
54 10 9 edges / sec normalized Multi-GPU traversal Expand neighbors sort by GPU (with filter) read from peer GPUs (with filter) repeat C050 x1 C050 x C050 x3 C050 x
GPU Sparse Graph Traversal
GPU Sparse Graph Traversal Duane Merrill (NVIDIA) Michael Garland (NVIDIA) Andrew Grimshaw (Univ. of Virginia) UNIVERSITY of VIRGINIA Breadth-first search (BFS) 1. Pick a source node 2. Rank every vertex
More informationGPU Sparse Graph Traversal. Duane Merrill
GPU Sparse Graph Traversal Duane Merrill Breadth-first search of graphs (BFS) 1. Pick a source node 2. Rank every vertex by the length of shortest path from source Or label every vertex by its predecessor
More informationScalable GPU Graph Traversal!
Scalable GPU Graph Traversal Duane Merrill, Michael Garland, and Andrew Grimshaw PPoPP '12 Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming Benwen Zhang
More informationScalable GPU Graph Traversal
Scalable GPU Graph Traversal Duane Merrill University of Virginia Charlottesville Virginia USA dgmd@virginia.edu Michael Garland NVIDIA Corporation Santa Clara California USA mgarland@nvidia.com Andrew
More informationHigh Performance and Scalable GPU Graph Traversal
High Performance and Scalable GPU Graph Traversal Duane Merrill Department of Computer Science University of Virginia Technical Report CS-- Department of Computer Science, University of Virginia Aug, Michael
More informationHigh-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock
High-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock Yangzihao Wang University of California, Davis yzhwang@ucdavis.edu March 24, 2014 Yangzihao Wang (yzhwang@ucdavis.edu)
More informationA POWER CHARACTERIZATION AND MANAGEMENT OF GPU GRAPH TRAVERSAL
A POWER CHARACTERIZATION AND MANAGEMENT OF GPU GRAPH TRAVERSAL ADAM MCLAUGHLIN *, INDRANI PAUL, JOSEPH GREATHOUSE, SRILATHA MANNE, AND SUDHKAHAR YALAMANCHILI * * GEORGIA INSTITUTE OF TECHNOLOGY AMD RESEARCH
More informationAutomatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC
Automatic Compiler-Based Optimization of Graph Analytics for the GPU Sreepathi Pai The University of Texas at Austin May 8, 2017 NVIDIA GTC Parallel Graph Processing is not easy 299ms HD-BFS 84ms USA Road
More informationCoordinating More Than 3 Million CUDA Threads for Social Network Analysis. Adam McLaughlin
Coordinating More Than 3 Million CUDA Threads for Social Network Analysis Adam McLaughlin Applications of interest Computational biology Social network analysis Urban planning Epidemiology Hardware verification
More informationAn Energy-Efficient Abstraction for Simultaneous Breadth-First Searches. Adam McLaughlin, Jason Riedy, and David A. Bader
An Energy-Efficient Abstraction for Simultaneous Breadth-First Searches Adam McLaughlin, Jason Riedy, and David A. Bader Problem Data is unstructured, heterogeneous, and vast Serious opportunities for
More informationGraph Partitioning for Scalable Distributed Graph Computations
Graph Partitioning for Scalable Distributed Graph Computations Aydın Buluç ABuluc@lbl.gov Kamesh Madduri madduri@cse.psu.edu 10 th DIMACS Implementation Challenge, Graph Partitioning and Graph Clustering
More informationCS 5220: Parallel Graph Algorithms. David Bindel
CS 5220: Parallel Graph Algorithms David Bindel 2017-11-14 1 Graphs Mathematically: G = (V, E) where E V V Convention: V = n and E = m May be directed or undirected May have weights w V : V R or w E :
More informationParallel Methods for Verifying the Consistency of Weakly-Ordered Architectures. Adam McLaughlin, Duane Merrill, Michael Garland, and David A.
Parallel Methods for Verifying the Consistency of Weakly-Ordered Architectures Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader Challenges of Design Verification Contemporary hardware
More informationCUB. collective software primitives. Duane Merrill. NVIDIA Research
CUB collective software primitives Duane Merrill NVIDIA Research What is CUB?. A design model for collective primitives How to make reusable SIMT software constructs. A library of collective primitives
More informationEnterprise. Breadth-First Graph Traversal on GPUs. November 19th, 2015
Enterprise Breadth-First Graph Traversal on GPUs Hang Liu H. Howie Huang November 9th, 5 Graph is Ubiquitous Breadth-First Search (BFS) is Important Wide Range of Applications Single Source Shortest Path
More informationBREADTH-FIRST SEARCH FOR SOCIAL NETWORK GRAPHS ON HETEROGENOUS PLATFORMS LUIS CARLOS MARIA REMIS THESIS
BREADTH-FIRST SEARCH FOR SOCIAL NETWORK GRAPHS ON HETEROGENOUS PLATFORMS BY LUIS CARLOS MARIA REMIS THESIS Submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer
More informationHARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA
HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh
More informationGPU/CPU Heterogeneous Work Partitioning Riya Savla (rds), Ridhi Surana (rsurana), Jordan Tick (jrtick) Computer Architecture, Spring 18
ABSTRACT GPU/CPU Heterogeneous Work Partitioning Riya Savla (rds), Ridhi Surana (rsurana), Jordan Tick (jrtick) 15-740 Computer Architecture, Spring 18 With the death of Moore's Law, programmers can no
More informationCuSha: Vertex-Centric Graph Processing on GPUs
CuSha: Vertex-Centric Graph Processing on GPUs Farzad Khorasani, Keval Vora, Rajiv Gupta, Laxmi N. Bhuyan HPDC Vancouver, Canada June, Motivation Graph processing Real world graphs are large & sparse Power
More informationGraph traversal and BFS
Graph traversal and BFS Fundamental building block Graph traversal is part of many important tasks Connected components Tree/Cycle detection Articulation vertex finding Real-world applications Peer-to-peer
More informationChallenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery
Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark by Nkemdirim Dockery High Performance Computing Workloads Core-memory sized Floating point intensive Well-structured
More informationOn Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs
On Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs Sungpack Hong 2, Nicole C. Rodia 1, and Kunle Olukotun 1 1 Pervasive Parallelism Laboratory, Stanford University
More informationScalable Algorithmic Techniques Decompositions & Mapping. Alexandre David
Scalable Algorithmic Techniques Decompositions & Mapping Alexandre David 1.2.05 adavid@cs.aau.dk Introduction Focus on data parallelism, scale with size. Task parallelism limited. Notion of scalability
More informationMaximizing Face Detection Performance
Maximizing Face Detection Performance Paulius Micikevicius Developer Technology Engineer, NVIDIA GTC 2015 1 Outline Very brief review of cascaded-classifiers Parallelization choices Reducing the amount
More informationParallel Combinatorial BLAS and Applications in Graph Computations
Parallel Combinatorial BLAS and Applications in Graph Computations Aydın Buluç John R. Gilbert University of California, Santa Barbara SIAM ANNUAL MEETING 2009 July 8, 2009 1 Primitives for Graph Computations
More informationOptimizing Energy Consumption and Parallel Performance for Static and Dynamic Betweenness Centrality using GPUs Adam McLaughlin, Jason Riedy, and
Optimizing Energy Consumption and Parallel Performance for Static and Dynamic Betweenness Centrality using GPUs Adam McLaughlin, Jason Riedy, and David A. Bader Motivation Real world graphs are challenging
More informationFast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs
Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs Sungpack Hong 2, Nicole C. Rodia 1, and Kunle Olukotun 1 1 Pervasive Parallelism Laboratory, Stanford University 2 Oracle
More informationExploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology
Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation
More informationBFS preconditioning for high locality, data parallel BFS algorithm N.Vasilache, B. Meister, M. Baskaran, R.Lethin. Reservoir Labs
BFS preconditioning for high locality, data parallel BFS algorithm N.Vasilache, B. Meister, M. Baskaran, R.Lethin Problem Streaming Graph Challenge Characteristics: Large scale Highly dynamic Scale-free
More informationCSE 599 I Accelerated Computing - Programming GPUS. Parallel Patterns: Graph Search
CSE 599 I Accelerated Computing - Programming GPUS Parallel Patterns: Graph Search Objective Study graph search as a prototypical graph-based algorithm Learn techniques to mitigate the memory-bandwidth-centric
More information}w!"#$%&'()+,-./012345<ya
MASARYK UNIVERSITY FACULTY OF INFORMATICS }w!"#$%&'()+,-./012345
More informationKartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18
Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation
More informationLaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs
LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs Jin Wang*, Norm Rubin, Albert Sidelnik, Sudhakar Yalamanchili* *Computer Architecture and System Lab, Georgia Institute of Technology NVIDIA
More informationAlgorithm Design (8) Graph Algorithms 1/2
Graph Algorithm Design (8) Graph Algorithms / Graph:, : A finite set of vertices (or nodes) : A finite set of edges (or arcs or branches) each of which connect two vertices Takashi Chikayama School of
More informationCS 206 Introduction to Computer Science II
CS 206 Introduction to Computer Science II 04 / 06 / 2018 Instructor: Michael Eckmann Today s Topics Questions? Comments? Graphs Definition Terminology two ways to represent edges in implementation traversals
More informationDynamic Thread Block Launch: A Lightweight Execution Mechanism to Support Irregular Applications on GPUs
Dynamic Thread Block Launch: A Lightweight Execution Mechanism to Support Irregular Applications on GPUs Jin Wang* Norm Rubin Albert Sidelnik Sudhakar Yalamanchili* *Georgia Institute of Technology NVIDIA
More informationCSE 100: GRAPH ALGORITHMS
CSE 100: GRAPH ALGORITHMS 2 Graphs: Example A directed graph V5 V = { V = E = { E Path: 3 Graphs: Definitions A directed graph V5 V6 A graph G = (V,E) consists of a set of vertices V and a set of edges
More informationDynamic Sparse Matrix Allocation on GPUs. James King
Dynamic Sparse Matrix Allocation on GPUs James King Graph Applications Dynamic updates to graphs Adding edges add entries to sparse matrix representation Motivation Graph operations (adding edges) (e.g.
More informationBoosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search
Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search Jialiang Zhang, Soroosh Khoram and Jing Li 1 Outline Background Big graph analytics Hybrid
More informationGPU-accelerated data expansion for the Marching Cubes algorithm
GPU-accelerated data expansion for the Marching Cubes algorithm San Jose (CA) September 23rd, 2010 Christopher Dyken, SINTEF Norway Gernot Ziegler, NVIDIA UK Agenda Motivation & Background Data Compaction
More informationTowards Efficient Graph Traversal using a Multi- GPU Cluster
Towards Efficient Graph Traversal using a Multi- GPU Cluster Hina Hameed Systems Research Lab, Department of Computer Science, FAST National University of Computer and Emerging Sciences Karachi, Pakistan
More informationGPU Multisplit. Saman Ashkiani, Andrew Davidson, Ulrich Meyer, John D. Owens. S. Ashkiani (UC Davis) GPU Multisplit GTC / 16
GPU Multisplit Saman Ashkiani, Andrew Davidson, Ulrich Meyer, John D. Owens S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 1 / 16 Motivating Simple Example: Compaction Compaction (i.e., binary split) Traditional
More informationFast BVH Construction on GPUs
Fast BVH Construction on GPUs Published in EUROGRAGHICS, (2009) C. Lauterbach, M. Garland, S. Sengupta, D. Luebke, D. Manocha University of North Carolina at Chapel Hill NVIDIA University of California
More informationA Comparative Study on Exact Triangle Counting Algorithms on the GPU
A Comparative Study on Exact Triangle Counting Algorithms on the GPU Leyuan Wang, Yangzihao Wang, Carl Yang, John D. Owens University of California, Davis, CA, USA 31 st May 2016 L. Wang, Y. Wang, C. Yang,
More informationLesson 2 7 Graph Partitioning
Lesson 2 7 Graph Partitioning The Graph Partitioning Problem Look at the problem from a different angle: Let s multiply a sparse matrix A by a vector X. Recall the duality between matrices and graphs:
More informationParallel graph traversal for FPGA
LETTER IEICE Electronics Express, Vol.11, No.7, 1 6 Parallel graph traversal for FPGA Shice Ni a), Yong Dou, Dan Zou, Rongchun Li, and Qiang Wang National Laboratory for Parallel and Distributed Processing,
More informationFine-Grained Task Migration for Graph Algorithms using Processing in Memory
Fine-Grained Task Migration for Graph Algorithms using Processing in Memory Paula Aguilera 1, Dong Ping Zhang 2, Nam Sung Kim 3 and Nuwan Jayasena 2 1 Dept. of Electrical and Computer Engineering University
More informationIntroduction to Parallel & Distributed Computing Parallel Graph Algorithms
Introduction to Parallel & Distributed Computing Parallel Graph Algorithms Lecture 16, Spring 2014 Instructor: 罗国杰 gluo@pku.edu.cn In This Lecture Parallel formulations of some important and fundamental
More informationOptimization solutions for the segmented sum algorithmic function
Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code
More informationA New Parallel Algorithm for Connected Components in Dynamic Graphs. Robert McColl Oded Green David Bader
A New Parallel Algorithm for Connected Components in Dynamic Graphs Robert McColl Oded Green David Bader Overview The Problem Target Datasets Prior Work Parent-Neighbor Subgraph Results Conclusions Problem
More informationCS 220: Discrete Structures and their Applications. graphs zybooks chapter 10
CS 220: Discrete Structures and their Applications graphs zybooks chapter 10 directed graphs A collection of vertices and directed edges What can this represent? undirected graphs A collection of vertices
More informationPortland State University ECE 588/688. Graphics Processors
Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly
More informationA Simple and Practical Linear-Work Parallel Algorithm for Connectivity
A Simple and Practical Linear-Work Parallel Algorithm for Connectivity Julian Shun, Laxman Dhulipala, and Guy Blelloch Presentation based on publication in Symposium on Parallelism in Algorithms and Architectures
More informationNUMA-aware Graph-structured Analytics
NUMA-aware Graph-structured Analytics Kaiyuan Zhang, Rong Chen, Haibo Chen Institute of Parallel and Distributed Systems Shanghai Jiao Tong University, China Big Data Everywhere 00 Million Tweets/day 1.11
More informationMultipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs
Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs Haicheng Wu 1, Daniel Zinn 2, Molham Aref 2, Sudhakar Yalamanchili 1 1. Georgia Institute of Technology 2. LogicBlox
More informationDesigning Parallel Programs. This review was developed from Introduction to Parallel Computing
Designing Parallel Programs This review was developed from Introduction to Parallel Computing Author: Blaise Barney, Lawrence Livermore National Laboratory references: https://computing.llnl.gov/tutorials/parallel_comp/#whatis
More informationOn Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators
On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC
More informationTools and Primitives for High Performance Graph Computation
Tools and Primitives for High Performance Graph Computation John R. Gilbert University of California, Santa Barbara Aydin Buluç (LBNL) Adam Lugowski (UCSB) SIAM Minisymposium on Analyzing Massive Real-World
More informationDuksu Kim. Professional Experience Senior researcher, KISTI High performance visualization
Duksu Kim Assistant professor, KORATEHC Education Ph.D. Computer Science, KAIST Parallel Proximity Computation on Heterogeneous Computing Systems for Graphics Applications Professional Experience Senior
More informationPrefix Scan and Minimum Spanning Tree with OpenCL
Prefix Scan and Minimum Spanning Tree with OpenCL U. VIRGINIA DEPT. OF COMP. SCI TECH. REPORT CS-2013-02 Yixin Sun and Kevin Skadron Dept. of Computer Science, University of Virginia ys3kz@virginia.edu,
More informationWorkloads Programmierung Paralleler und Verteilter Systeme (PPV)
Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015 Frank Feinbube, M.Sc., Felix Eberhardt, M.Sc., Prof. Dr. Andreas Polze Workloads 2 Hardware / software execution environment
More informationParallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)
Lecture 2: Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload) Visual Computing Systems Analyzing a 3D Graphics Workload Where is most of the work done? Memory Vertex
More informationCSE373: Data Structures & Algorithms Lecture 28: Final review and class wrap-up. Nicki Dell Spring 2014
CSE373: Data Structures & Algorithms Lecture 28: Final review and class wrap-up Nicki Dell Spring 2014 Final Exam As also indicated on the web page: Next Tuesday, 2:30-4:20 in this room Cumulative but
More informationCorolla: GPU-Accelerated FPGA Routing Based on Subgraph Dynamic Expansion
Corolla: GPU-Accelerated FPGA Routing Based on Subgraph Dynamic Expansion Minghua Shen and Guojie Luo Peking University FPGA-February 23, 2017 1 Contents Motivation Background Search Space Reduction for
More informationBreadth First Search. Graph Traversal. CSE 2011 Winter Application examples. Two common graph traversal algorithms
Breadth irst Search CSE Winter Graph raversal Application examples Given a graph representation and a vertex s in the graph ind all paths from s to the other vertices wo common graph traversal algorithms
More informationFast and Scalable Subgraph Isomorphism using Dynamic Graph Techniques. James Fox
Fast and Scalable Subgraph Isomorphism using Dynamic Graph Techniques James Fox Collaborators Oded Green, Research Scientist (GT) Euna Kim, PhD student (GT) Federico Busato, PhD student (Universita di
More informationRandomized Algorithms
Randomized Algorithms Last time Network topologies Intro to MPI Matrix-matrix multiplication Today MPI I/O Randomized Algorithms Parallel k-select Graph coloring Assignment 2 Parallel I/O Goal of Parallel
More informationHybrid Implementation of 3D Kirchhoff Migration
Hybrid Implementation of 3D Kirchhoff Migration Max Grossman, Mauricio Araya-Polo, Gladys Gonzalez GTC, San Jose March 19, 2013 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation
More informationAlgorithm Engineering with PRAM Algorithms
Algorithm Engineering with PRAM Algorithms Bernard M.E. Moret moret@cs.unm.edu Department of Computer Science University of New Mexico Albuquerque, NM 87131 Rome School on Alg. Eng. p.1/29 Measuring and
More informationImproving Memory Space Efficiency of Kd-tree for Real-time Ray Tracing Byeongjun Choi, Byungjoon Chang, Insung Ihm
Improving Memory Space Efficiency of Kd-tree for Real-time Ray Tracing Byeongjun Choi, Byungjoon Chang, Insung Ihm Department of Computer Science and Engineering Sogang University, Korea Improving Memory
More informationParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser
ParalleX A Cure for Scaling Impaired Parallel Applications Hartmut Kaiser (hkaiser@cct.lsu.edu) 2 Tianhe-1A 2.566 Petaflops Rmax Heterogeneous Architecture: 14,336 Intel Xeon CPUs 7,168 Nvidia Tesla M2050
More informationAdministrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve
Administrative Issues L11: Sparse Linear Algebra on GPUs Next assignment, triangular solve Due 5PM, Tuesday, March 15 handin cs6963 lab 3 Project proposals Due 5PM, Wednesday, March 7 (hard
More informationA Simple and Practical Linear-Work Parallel Algorithm for Connectivity
A Simple and Practical Linear-Work Parallel Algorithm for Connectivity Julian Shun, Laxman Dhulipala, and Guy Blelloch Presentation based on publication in Symposium on Parallelism in Algorithms and Architectures
More informationCSE 100: GRAPH ALGORITHMS
CSE 100: GRAPH ALGORITHMS Dijkstra s Algorithm: Questions Initialize the graph: Give all vertices a dist of INFINITY, set all done flags to false Start at s; give s dist = 0 and set prev field to -1 Enqueue
More informationGraph Algorithms. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.
Graph Algorithms Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico May, 0 CPD (DEI / IST) Parallel and Distributed Computing 0-0-0 / Outline
More informationManaging and Mining Billion Node Graphs. Haixun Wang Microsoft Research Asia
Managing and Mining Billion Node Graphs Haixun Wang Microsoft Research Asia Outline Overview Storage Online query processing Offline graph analytics Advanced applications Is it hard to manage graphs? Good
More informationAn Efficient CUDA Implementation of a Tree-Based N-Body Algorithm. Martin Burtscher Department of Computer Science Texas State University-San Marcos
An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm Martin Burtscher Department of Computer Science Texas State University-San Marcos Mapping Regular Code to GPUs Regular codes Operate on
More informationMorph Algorithms on GPUs
Morph Algorithms on GPUs Rupesh Nasre 1 Martin Burtscher 2 Keshav Pingali 1,3 1 Inst. for Computational Engineering and Sciences, University of Texas at Austin, USA 2 Dept. of Computer Science, Texas State
More informationJulienne: A Framework for Parallel Graph Algorithms using Work-efficient Bucketing
Julienne: A Framework for Parallel Graph Algorithms using Work-efficient Bucketing Laxman Dhulipala Joint work with Guy Blelloch and Julian Shun SPAA 07 Giant graph datasets Graph V E (symmetrized) com-orkut
More informationLecture 6: Input Compaction and Further Studies
PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 6: Input Compaction and Further Studies 1 Objective To learn the key techniques for compacting input data for reduced consumption of
More informationReview: Creating a Parallel Program. Programming for Performance
Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)
More informationEFFICIENT BREADTH FIRST SEARCH ON MULTI-GPU SYSTEMS USING GPU-CENTRIC OPENSHMEM
EFFICIENT BREADTH FIRST SEARCH ON MULTI-GPU SYSTEMS USING GPU-CENTRIC OPENSHMEM Sreeram Potluri, Anshuman Goswami NVIDIA Manjunath Gorentla Venkata, Neena Imam - ORNL SCOPE OF THE WORK Reliance on CPU
More informationLarge and Sparse Mass Spectrometry Data Processing in the GPU Jose de Corral 2012 GPU Technology Conference
Large and Sparse Mass Spectrometry Data Processing in the GPU Jose de Corral 2012 GPU Technology Conference 2012 Waters Corporation 1 Agenda Overview of LC/IMS/MS 3D Data Processing 4D Data Processing
More informationData Structures and Algorithms for Counting Problems on Graphs using GPU
International Journal of Networking and Computing www.ijnc.org ISSN 85-839 (print) ISSN 85-847 (online) Volume 3, Number, pages 64 88, July 3 Data Structures and Algorithms for Counting Problems on Graphs
More informationAccelerated Load Balancing of Unstructured Meshes
Accelerated Load Balancing of Unstructured Meshes Gerrett Diamond, Lucas Davis, and Cameron W. Smith Abstract Unstructured mesh applications running on large, parallel, distributed memory systems require
More informationA GPU Algorithm for Greedy Graph Matching
A GPU Algorithm for Greedy Graph Matching B. O. Fagginger Auer R. H. Bisseling Utrecht University September 29, 2011 Fagginger Auer, Bisseling (UU) GPU Greedy Graph Matching September 29, 2011 1 / 27 Outline
More informationGraphs & Digraphs Tuesday, November 06, 2007
Graphs & Digraphs Tuesday, November 06, 2007 10:34 PM 16.1 Directed Graphs (digraphs) like a tree but w/ no root node & no guarantee of paths between nodes consists of: nodes/vertices - a set of elements
More informationExecution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures
Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Xin Huo Advisor: Gagan Agrawal Motivation - Architecture Challenges on GPU architecture
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationParallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)
Lecture 2: Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload) Visual Computing Systems Today Finishing up from last time Brief discussion of graphics workload metrics
More informationCS 4700: Foundations of Artificial Intelligence. Bart Selman. Search Techniques R&N: Chapter 3
CS 4700: Foundations of Artificial Intelligence Bart Selman Search Techniques R&N: Chapter 3 Outline Search: tree search and graph search Uninformed search: very briefly (covered before in other prerequisite
More informationAlgorithms and Theory of Computation. Lecture 3: Graph Algorithms
Algorithms and Theory of Computation Lecture 3: Graph Algorithms Xiaohui Bei MAS 714 August 20, 2018 Nanyang Technological University MAS 714 August 20, 2018 1 / 18 Connectivity In a undirected graph G
More informationRecitation 9. Prelim Review
Recitation 9 Prelim Review 1 Heaps 2 Review: Binary heap min heap 1 2 99 4 3 PriorityQueue Maintains max or min of collection (no duplicates) Follows heap order invariant at every level Always balanced!
More informationBreadth First Search on Cost efficient Multi GPU Systems
Breadth First Search on Cost efficient Multi Systems Takuji Mitsuishi Keio University 3 14 1 Hiyoshi, Yokohama, 223 8522, Japan mits@am.ics.keio.ac.jp Masaki Kan NEC Corporation 1753, Shimonumabe, Nakahara
More informationNVGRAPH,FIREHOSE,PAGERANK GPU ACCELERATED ANALYTICS NOV Joe Eaton Ph.D.
NVGRAPH,FIREHOSE,PAGERANK GPU ACCELERATED ANALYTICS NOV 2016 Joe Eaton Ph.D. Agenda Accelerated Computing nvgraph New Features Coming Soon Dynamic Graphs GraphBLAS 2 ACCELERATED COMPUTING 10x Performance
More information(Re)Introduction to Graphs and Some Algorithms
(Re)Introduction to Graphs and Some Algorithms Graph Terminology (I) A graph is defined by a set of vertices V and a set of edges E. The edge set must work over the defined vertices in the vertex set.
More informationGraph Representations and Traversal
COMPSCI 330: Design and Analysis of Algorithms 02/08/2017-02/20/2017 Graph Representations and Traversal Lecturer: Debmalya Panigrahi Scribe: Tianqi Song, Fred Zhang, Tianyu Wang 1 Overview This lecture
More informationState of Art and Project Proposals Intensive Computation
State of Art and Project Proposals Intensive Computation Annalisa Massini - 2015/2016 Today s lecture Project proposals on the following topics: Sparse Matrix- Vector Multiplication Tridiagonal Solvers
More informationRecent Advances in Heterogeneous Computing using Charm++
Recent Advances in Heterogeneous Computing using Charm++ Jaemin Choi, Michael Robson Parallel Programming Laboratory University of Illinois Urbana-Champaign April 12, 2018 1 / 24 Heterogeneous Computing
More informationParallel Architectures
Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36
More information