Duane Merrill (NVIDIA) Michael Garland (NVIDIA)

Size: px
Start display at page:

Download "Duane Merrill (NVIDIA) Michael Garland (NVIDIA)"

Transcription

1 Duane Merrill (NVIDIA) Michael Garland (NVIDIA)

2 Agenda Sparse graphs Breadth-first search (BFS) Maximal independent set (MIS) BFS MIS

3

4 Sparse graphs Graph G(V, E) where n = V and m = E Convey structure via relationships Wikipedia 07 3D lattice 4

5 Graph sparsity Adjacency matrix of graph G(n, m) v 0 v 0 v 1 v 1 v 1 v v v 0 v v 3 v 3 v 1 v 3 Spy plots 5

6 Compressed sparse row (CSR) representation v 0 v 1 v v 3 Adjacency matrix of graph G(n, m) v 0 v 0 v 1 v 1 v 1 v v v 0 v v 3 v 3 v 1 v 3 Column indices: O(m) [0] [1] [] [3] [4] [5] [6] [7] [8] Row offsets: O(n) [0] [1] [] [3] [4]

7

8 Breadth-first search (BFS) 1. Pick a source node. Rank every vertex by the length of shortest path from source 0 8

9 Breadth-first search (BFS) 1. Pick a source node. Rank every vertex by the length of shortest path from source

10 Breadth-first search (BFS) 1. Pick a source node. Rank every vertex by the length of shortest path from source

11 Breadth-first search (BFS) 1. Pick a source node. Rank every vertex by the length of shortest path from source

12 Breadth-first search (BFS) 1. Pick a source node. Rank every vertex by the length of shortest path from source (or by predecessor vertex) 1

13 BFS applications Algorithmic primitive Reachability (e.g., garbage collection) Belief propagation (statistical inference) Simple performance-analog for many applications Pointer-chasing Work queues Finding community structure Path-finding Core benchmark kernel Graph 500 Rodinia Parboil 13

14 The sequential algorithm Cycles newly-discovered-but-unvisited vertices through a queue 1. Dequeue vertex v. For all neighbors n i of v: dist[n i ] = dist[v] + 1 Enqueue n i 14 a b c e f d u g h i j k l m n o p r s t v Q: {a} Q: {b, c, d, e, f} a b c e f d u g h i j k l m n o p r s t v

15 Common parallelization strategies Level-synchronous: parallel work done in epochs by level a) Quadratic work b) Linear work Randomized: parallel work done within independent sub-problems a) Ullman-Yannakakis

16 Quadratic-work approach 1. Inspect every vertex at every time step E.g., one thread per vertex. If updated, update its neighbors 3. Repeat until no further updates 16

17 Quadratic-work approach 1. Inspect every vertex at every time step E.g., one thread per vertex. If updated, update its neighbors 3. Repeat until no further updates 17

18 Quadratic-work approach 1. Inspect every vertex at every time step E.g., one thread per vertex. If updated, update its neighbors 3. Repeat until no further updates 18

19 Quadratic-work approach 1. Inspect every vertex at every time step E.g., one thread per vertex. If updated, update its neighbors 3. Repeat until no further updates 19

20 Quadratic-work approach Input: Vertex set V, row-offsets array R, column-indices array C, source vertex s Output: Array dist[0..n-1] with dist[v] holding the distance from s to v kernel kernel 1. parallel for (i in V) :. dist[i] := 3. dist[s] := 0 4. iteration := 0 5. do : 6. done := true 7. parallel for (i in V) : 8. if (dist[i] == iteration) 9. done := false 10. for (offset in R[i].. R[i+1]-1) : 11. j := C[offset] 1. dist[j] = iteration iteration while (!done) Good: trivial data-parallel GPU stencils Bad: worst-case quadratic O(n +m) work Harish et al., Deng et al. Hong et al 0

21 Vertices (millions) Vertices (millions) Vertices (millions) Vertices (millions) Vertices (millions) Vertices (millions) Active edges per iteration Search Depth Search Depth Search Depth Wikipedia 3D cubic lattice R-MAT small-world Search Depth Search Depth Search Depth Europe road atlas PDE-constrained optimization Auto transmission mesh 1

22 Active vertex identifiers Linear-work approach 1. Expand neighbors in parallel Enqueue adjacency lists Vertex frontier Edge-frontier. Filter out previously-visited vertices in parallel Edge-frontier Vertex frontier Logical Edge Frontier Logical Vertex Frontier 3. Repeat 0 Search depth Good: Linear O(n+m) work Merrill et al. (PPoPP 1), Luo et al., etc. Challenging: Need shared queues for tracking frontiers

23 Referenced vertex identifiers (millions) Quadratic vs. linear Quadratic Linear 3D Poisson Lattice (300 3 = 7M vertices) BFS iteration 130x more work!! 3

24 Linear-work challenges for parallelization Generic challenges 1. Load imbalance between cores Coarse workstealing queues. Bandwidth inefficiency Use status bitmask when filtering already-visited nodes GPU-specific challenges 1. Cooperative allocation (global queues). Load imbalance within SIMD lanes 3. Poor locality within SIMD lanes 4. Simultaneous discovery (i.e., the benign race condition) 4

25 Linear-work challenges for parallelization Generic challenges 1. Load imbalance between cores Coarse workstealing queues. Bandwidth inefficiency Use status bitmask when filtering already-visited nodes GPU-specific challenges 1. Cooperative allocation (global queues). Load imbalance within SIMD lanes 3. Poor locality within SIMD lanes 4. Simultaneous discovery (i.e., the benign race condition) 5

26 Available bandwidth (GB/s) 1) Cooperative data movement (queuing) Problem: Need shared work queues GPU rate-limits (C050): 16 billion vertex-identifiers / sec (14 GB/s) Only 67M global-atomics / sec (38x slowdown) Only 600M smem-atomics / sec (7x slowdown) Solution: Compute enqueue offsets using prefix-sum 0.1 1:1 1:8 1:64 1:51 1:4096 1:3768 1:6144 Frequency of atomic ops per vertex dequeued 6

27 Prefix sum for allocation Allocation requirement Result of prefix sum Output A C 1 0 D 3 t 0 t 1 t t 3 A 0 C 3 D 3 t 0 t 1 t t Each output is computed to be the sum of the previous inputs O(n) work Use results as a scatter offsets Fits the GPU machine model well Proceed at copy-bandwidth Only ~8 thread-instructions per input 7

28 Prefix sum for edge expansion Adjacency list size A C 1 0 D 3 Result of prefix sum Expanded edge frontier t 0 t 1 t t 3 A 0 C 3 D 3 t 0 t 1 t t

29 Data-flow of efficient CTA scan (granularity coarsening) x16 barrier x4 x1 x4 barrier x16 (For good GPU scan details, see Baxter s 9

30 ) Poor load balancing within SIMD lanes Problem: Solution: Large variance in adjacency list sizes Exacerbated by wide SIMD widths Cooperative neighbor expansion Enlist nearby threads to help process each adjacency list in parallel a) Bad: Serial expansion & processing t 1 t 3 t 0 t b) Slightly better: Coarse warp-centric parallel expansion t 0 t 1 t t 3 t 4 t 5 t 31 t 3 t 33 t 34 t 35 t 36 t 37 t 63 t 64 t 65 t 66 t 67 t 68 t 69 t 95 c) Best: Fine-grained parallel expansion (packed by prefix sum) t 0 t 1 t t 3 t 4 t 5 30

31 ) Poor load balancing within SIMD lanes Problem: Solution: Large variance in adjacency list sizes Exacerbated by wide SIMD widths Cooperative neighbor expansion Enlist nearby threads to help process each adjacency list in parallel a) Bad: Serial expansion & processing t 1 t 3 t 0 t b) Slightly better: Coarse warp-centric parallel expansion t 0 t 1 t t 3 t 4 t 5 t 31 t 3 t 33 t 34 t 35 t 36 t 37 t 63 t 64 t 65 t 66 t 67 t 68 t 69 t 95 c) Best: Fine-grained parallel expansion (packed by prefix sum) t 0 t 1 t t 3 t 4 t 5 31

32 3) Poor locality within SIMD lanes Problem: The referenced adjacency lists are arbitrarily located a) Bad: Serial expansion & processing t 1 t 3 t 0 t Solution: Exacerbated by wide SIMD widths Can t afford to have SIMD threads streaming through unrelated data b) Slightly better: Coarse warp-centric parallel expansion t 0 t 1 t t 3 t 4 t 5 t 31 t 3 t 33 t 34 t 35 t 36 t 37 t 63 Cooperative neighbor expansion t 64 t 65 t 66 t 67 t 68 t 69 t 95 c) Best: Fine-grained parallel expansion (packed by prefix sum) t 0 t 1 t t 3 t 4 t 5 3

33 4) SIMD simultaneous discovery Iteration 1 Iteration Iteration 3 Iteration 4 Duplicates in the edge-frontier redundant work Exacerbated by wide SIMD Compounded every iteration 33

34 Vertices (millions) 4) SIMD simultaneous discovery Normally not a problem for CPU implementations Serial adjacency list inspection A big problem for cooperative SIMD expansion Logical edgefrontier Logical vertexfrontier Actually expanded Actually contracted Spatially-descriptive datasets 0.6 Power-law datasets 0.4 Solution: Localized duplicate removal using hashes in local scratch Search Depth 34

35 Single-socket comparison Graph Spy Plot Average Search Depth Billion TE/s Nehalem Parallel speedup Billion TE/s Tesla C050 Parallel speedup europe.osm x grid5pt x hugebubbles x grid7pt (4-core ).4 x x nlpkkt (4-core ) 3.0 x.5 10 x audikw x cage (4-core ).6 x. 18 x kkt_power (4-core ) 3.0 x x copapersciteseer x wikipedia (4-core ) 3. x x kron_g500-logn x random.mv.18me (8-core ) 7.0 x x rmat.mv.18me (8-core ) 6.0 x 3.3 x.5 GHz Core i7 4-core (Leiserson et al.).7 GHz EX Xeon X core (Agarwal et al) vs 3.4GHz Core i7 600K (Sandybridge)

36 Single-socket comparison Graph Spy Plot Average Search Depth Billion TE/s Nehalem Parallel speedup Billion TE/s Tesla C050 Parallel speedup europe.osm x grid5pt x hugebubbles x grid7pt (4-core ).4 x x nlpkkt (4-core ) 3.0 x.5 10 x audikw x cage (4-core ).6 x. 18 x kkt_power (4-core ) 3.0 x x copapersciteseer x wikipedia (4-core ) 3. x x kron_g500-logn x random.mv.18me (8-core ) 7.0 x x rmat.mv.18me (8-core ) 6.0 x 3.3 x.5 GHz Core i7 4-core (Leiserson et al.).7 GHz EX Xeon X core (Agarwal et al) vs 3.4GHz Core i7 600K (Sandybridge)

37

38 Maximal independent set A vertex subset U V of an undirected graph G = (V, E) U is independent if no two vertices in U are adjacent U is maximal if no node can be added without violating independence a Both {a} and {b, c} are maximal independent sets b c 38

39 Maximal independent set A vertex subset U V of an undirected graph G = (V, E) U is independent if no two vertices in U are adjacent U is maximal if no node can be added without violating independence a Both {a} and {b, c} are maximal independent sets b c (Finding the maximum independent set is generally NP-Complete.) 39

40 MIS applications Graph coloring (vertex coloring) Co-scheduling sets of independent resources Useful for GPUs: avoid atomic updates by touching items in different kernels Matching (edge coloring) Pair-wise assignments Scheduling of jobs onto hosts Matching people with desks, dance partners, kidneys, etc. Run MIS on G where: There is a node in G for every edge in G Nodes are connected in G if edges in G are adjacent 40

41 Example: MIS() for mesh refinement in algebraic multi-grid Bell et al., Cohen et al., etc. 41

42 The sequential algorithm Repeatedly add a vertex to the MIS, removing it and its neighbors from the graph: 1. Select first identifier v from V t s g r h u e n o b a v f c p i d m j l k. Add v to MIS MIS: {} 3. Remove v and neighbors(v) from V v i 4. Repeat until V = {} g u j a s h n l t r o p m k MIS: {a} 4

43 The parallel algorithm (Luby85) Repeatedly add sets to the MIS, removing them and their neighbors from the graph: i h r k d j s g o c e l m f p 1. Randomly permute identifier ordering of V. Identify all m i such that identifier m i < all neighbors(m i ) a q t n MIS: {} b u 3. Add all m i to MIS 4. Remove m i and neighbors(m i ) from V 5. Repeat until V = {} q h r k d u s g o c t l m f p a i n e b j Generally converges after O(log n) steps MIS: {a, b, c, d, e, f, k, l, p}

44 Data-parallel GPU approach Input: Vertex set V, row-offsets array R, column-indices array C, source vertex s Output: Array MIS[0..n-1] with MIS[v] holding membership flag (1-1) in the maximal independent set constructed. kernel kernel 1. parallel for (i in V) :. MIS[i] := 0 3. do : 4. done := true 5. parallel for (i in V) : 6. if (MIS[i] == 0) 7. dependent i := false 8. done := false 9. local_min i := i 10. for (offset in R[i].. R[i+1]-1) : 11. j := C[offset] 1. if (MIS[j] == 1) 13. dependent i = true 14. else if ((MIS[j] == 0) && (permute(j) < permute(i)) 15. local_min i := j 16. if (dependent i ) 17. MIS[i] := else if (local_min i == i) 19. MIS[i] := 1 0. while (!done) Good: trivial data-parallel GPU stencils E.g., CUSP library, Hoberock, Cohen, Wang, etc. Bad: average case O(m log n) work because edges are never removed 44

45 Future work: actually remove vertices from G Don t visit all n vertices at every time step! Should significantly improve utilization in subsequent rounds Higher density of active vertices CSR isn t a particularly friendly format for vertex removal Easy: Use prefix sum to compact the row_offsets array Double the size to pair segment-begin offsets with explicit segment-end offsets Hard: No great way to find and remove vertex identifiers from adjacency lists 45

46 Summary Purely data-parallel approaches can be uncompetitive E.g., one thread per each time step Dynamic workload management: Prefix sum (vs. atomic-add) Utilization requires fine-grained expansion and contraction GPUs can be very amenable to dynamic, cooperative problems 46

47

48 GTC 013:The Premier Event in Accelerated Computing Registration is Open! Four days March 18-1, San Jose, CA Three keynotes 300+ sessions One day of pre-conference developer tutorials 100+ research posters Lots of networking events and opportunities

49 Experimental corpus Graph Spy Plot Description Average Search Depth Vertices (millions) Edges (millions) europe.osm Road network grid5pt.5000 D Poisson stencil hugebubbles-0000 D mesh grid7pt.300 3D Poisson stencil nlpkkt160 Constrained optimization problem audikw1 Finite element matrix cage15 Transition prob. matrix kkt_power Optimization (KKT) copapersciteseer Citation network wikipedia Wikipedia page links kron_g500-logn0 Graph500 random graph random.mv.18me Uniform random graph rmat.mv.18me RMAT random graph

50 Comparison of expansion techniques 50

51 Coupling of expansion & contraction phases Alternatives: Expand-contract Wholly realize vertex-frontier in DRAM Suitable for all types of BFS iterations Contract-expand Logical Edge Frontier Logical Vertex Frontier Two-phase Wholly realize edge-frontier in DRAM Even better for small, fleeting BFS iterations Wholly realize both frontiers in DRAM BFS iteration Even better for large, saturating BFS iterations (surprising!!) 51

52 10 9 edges / sec normalized Comparison of coupling approaches Expand-Contract Contract-Expand -Phase Hybrid

53 Redundant expansion factor (log) Local duplicate culling Contraction uses local collision hashes in smem scratch space: 10 Simple Local duplicate culling Hash per warp (instantaneous coverage) 4.x Hash per CTA (recent history coverage).4x.6x Redundant work < 10% in all datasets No atomics needed 1 1.1x 1.7x 1.x 1.1x 1.4x 1.1x 1.3x 53

54 10 9 edges / sec normalized Multi-GPU traversal Expand neighbors sort by GPU (with filter) read from peer GPUs (with filter) repeat C050 x1 C050 x C050 x3 C050 x

GPU Sparse Graph Traversal

GPU Sparse Graph Traversal GPU Sparse Graph Traversal Duane Merrill (NVIDIA) Michael Garland (NVIDIA) Andrew Grimshaw (Univ. of Virginia) UNIVERSITY of VIRGINIA Breadth-first search (BFS) 1. Pick a source node 2. Rank every vertex

More information

GPU Sparse Graph Traversal. Duane Merrill

GPU Sparse Graph Traversal. Duane Merrill GPU Sparse Graph Traversal Duane Merrill Breadth-first search of graphs (BFS) 1. Pick a source node 2. Rank every vertex by the length of shortest path from source Or label every vertex by its predecessor

More information

Scalable GPU Graph Traversal!

Scalable GPU Graph Traversal! Scalable GPU Graph Traversal Duane Merrill, Michael Garland, and Andrew Grimshaw PPoPP '12 Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming Benwen Zhang

More information

Scalable GPU Graph Traversal

Scalable GPU Graph Traversal Scalable GPU Graph Traversal Duane Merrill University of Virginia Charlottesville Virginia USA dgmd@virginia.edu Michael Garland NVIDIA Corporation Santa Clara California USA mgarland@nvidia.com Andrew

More information

High Performance and Scalable GPU Graph Traversal

High Performance and Scalable GPU Graph Traversal High Performance and Scalable GPU Graph Traversal Duane Merrill Department of Computer Science University of Virginia Technical Report CS-- Department of Computer Science, University of Virginia Aug, Michael

More information

High-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock

High-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock High-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock Yangzihao Wang University of California, Davis yzhwang@ucdavis.edu March 24, 2014 Yangzihao Wang (yzhwang@ucdavis.edu)

More information

A POWER CHARACTERIZATION AND MANAGEMENT OF GPU GRAPH TRAVERSAL

A POWER CHARACTERIZATION AND MANAGEMENT OF GPU GRAPH TRAVERSAL A POWER CHARACTERIZATION AND MANAGEMENT OF GPU GRAPH TRAVERSAL ADAM MCLAUGHLIN *, INDRANI PAUL, JOSEPH GREATHOUSE, SRILATHA MANNE, AND SUDHKAHAR YALAMANCHILI * * GEORGIA INSTITUTE OF TECHNOLOGY AMD RESEARCH

More information

Automatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC

Automatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC Automatic Compiler-Based Optimization of Graph Analytics for the GPU Sreepathi Pai The University of Texas at Austin May 8, 2017 NVIDIA GTC Parallel Graph Processing is not easy 299ms HD-BFS 84ms USA Road

More information

Coordinating More Than 3 Million CUDA Threads for Social Network Analysis. Adam McLaughlin

Coordinating More Than 3 Million CUDA Threads for Social Network Analysis. Adam McLaughlin Coordinating More Than 3 Million CUDA Threads for Social Network Analysis Adam McLaughlin Applications of interest Computational biology Social network analysis Urban planning Epidemiology Hardware verification

More information

An Energy-Efficient Abstraction for Simultaneous Breadth-First Searches. Adam McLaughlin, Jason Riedy, and David A. Bader

An Energy-Efficient Abstraction for Simultaneous Breadth-First Searches. Adam McLaughlin, Jason Riedy, and David A. Bader An Energy-Efficient Abstraction for Simultaneous Breadth-First Searches Adam McLaughlin, Jason Riedy, and David A. Bader Problem Data is unstructured, heterogeneous, and vast Serious opportunities for

More information

Graph Partitioning for Scalable Distributed Graph Computations

Graph Partitioning for Scalable Distributed Graph Computations Graph Partitioning for Scalable Distributed Graph Computations Aydın Buluç ABuluc@lbl.gov Kamesh Madduri madduri@cse.psu.edu 10 th DIMACS Implementation Challenge, Graph Partitioning and Graph Clustering

More information

CS 5220: Parallel Graph Algorithms. David Bindel

CS 5220: Parallel Graph Algorithms. David Bindel CS 5220: Parallel Graph Algorithms David Bindel 2017-11-14 1 Graphs Mathematically: G = (V, E) where E V V Convention: V = n and E = m May be directed or undirected May have weights w V : V R or w E :

More information

Parallel Methods for Verifying the Consistency of Weakly-Ordered Architectures. Adam McLaughlin, Duane Merrill, Michael Garland, and David A.

Parallel Methods for Verifying the Consistency of Weakly-Ordered Architectures. Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Parallel Methods for Verifying the Consistency of Weakly-Ordered Architectures Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader Challenges of Design Verification Contemporary hardware

More information

CUB. collective software primitives. Duane Merrill. NVIDIA Research

CUB. collective software primitives. Duane Merrill. NVIDIA Research CUB collective software primitives Duane Merrill NVIDIA Research What is CUB?. A design model for collective primitives How to make reusable SIMT software constructs. A library of collective primitives

More information

Enterprise. Breadth-First Graph Traversal on GPUs. November 19th, 2015

Enterprise. Breadth-First Graph Traversal on GPUs. November 19th, 2015 Enterprise Breadth-First Graph Traversal on GPUs Hang Liu H. Howie Huang November 9th, 5 Graph is Ubiquitous Breadth-First Search (BFS) is Important Wide Range of Applications Single Source Shortest Path

More information

BREADTH-FIRST SEARCH FOR SOCIAL NETWORK GRAPHS ON HETEROGENOUS PLATFORMS LUIS CARLOS MARIA REMIS THESIS

BREADTH-FIRST SEARCH FOR SOCIAL NETWORK GRAPHS ON HETEROGENOUS PLATFORMS LUIS CARLOS MARIA REMIS THESIS BREADTH-FIRST SEARCH FOR SOCIAL NETWORK GRAPHS ON HETEROGENOUS PLATFORMS BY LUIS CARLOS MARIA REMIS THESIS Submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer

More information

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh

More information

GPU/CPU Heterogeneous Work Partitioning Riya Savla (rds), Ridhi Surana (rsurana), Jordan Tick (jrtick) Computer Architecture, Spring 18

GPU/CPU Heterogeneous Work Partitioning Riya Savla (rds), Ridhi Surana (rsurana), Jordan Tick (jrtick) Computer Architecture, Spring 18 ABSTRACT GPU/CPU Heterogeneous Work Partitioning Riya Savla (rds), Ridhi Surana (rsurana), Jordan Tick (jrtick) 15-740 Computer Architecture, Spring 18 With the death of Moore's Law, programmers can no

More information

CuSha: Vertex-Centric Graph Processing on GPUs

CuSha: Vertex-Centric Graph Processing on GPUs CuSha: Vertex-Centric Graph Processing on GPUs Farzad Khorasani, Keval Vora, Rajiv Gupta, Laxmi N. Bhuyan HPDC Vancouver, Canada June, Motivation Graph processing Real world graphs are large & sparse Power

More information

Graph traversal and BFS

Graph traversal and BFS Graph traversal and BFS Fundamental building block Graph traversal is part of many important tasks Connected components Tree/Cycle detection Articulation vertex finding Real-world applications Peer-to-peer

More information

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark by Nkemdirim Dockery High Performance Computing Workloads Core-memory sized Floating point intensive Well-structured

More information

On Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs

On Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs On Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs Sungpack Hong 2, Nicole C. Rodia 1, and Kunle Olukotun 1 1 Pervasive Parallelism Laboratory, Stanford University

More information

Scalable Algorithmic Techniques Decompositions & Mapping. Alexandre David

Scalable Algorithmic Techniques Decompositions & Mapping. Alexandre David Scalable Algorithmic Techniques Decompositions & Mapping Alexandre David 1.2.05 adavid@cs.aau.dk Introduction Focus on data parallelism, scale with size. Task parallelism limited. Notion of scalability

More information

Maximizing Face Detection Performance

Maximizing Face Detection Performance Maximizing Face Detection Performance Paulius Micikevicius Developer Technology Engineer, NVIDIA GTC 2015 1 Outline Very brief review of cascaded-classifiers Parallelization choices Reducing the amount

More information

Parallel Combinatorial BLAS and Applications in Graph Computations

Parallel Combinatorial BLAS and Applications in Graph Computations Parallel Combinatorial BLAS and Applications in Graph Computations Aydın Buluç John R. Gilbert University of California, Santa Barbara SIAM ANNUAL MEETING 2009 July 8, 2009 1 Primitives for Graph Computations

More information

Optimizing Energy Consumption and Parallel Performance for Static and Dynamic Betweenness Centrality using GPUs Adam McLaughlin, Jason Riedy, and

Optimizing Energy Consumption and Parallel Performance for Static and Dynamic Betweenness Centrality using GPUs Adam McLaughlin, Jason Riedy, and Optimizing Energy Consumption and Parallel Performance for Static and Dynamic Betweenness Centrality using GPUs Adam McLaughlin, Jason Riedy, and David A. Bader Motivation Real world graphs are challenging

More information

Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs

Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs Sungpack Hong 2, Nicole C. Rodia 1, and Kunle Olukotun 1 1 Pervasive Parallelism Laboratory, Stanford University 2 Oracle

More information

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation

More information

BFS preconditioning for high locality, data parallel BFS algorithm N.Vasilache, B. Meister, M. Baskaran, R.Lethin. Reservoir Labs

BFS preconditioning for high locality, data parallel BFS algorithm N.Vasilache, B. Meister, M. Baskaran, R.Lethin. Reservoir Labs BFS preconditioning for high locality, data parallel BFS algorithm N.Vasilache, B. Meister, M. Baskaran, R.Lethin Problem Streaming Graph Challenge Characteristics: Large scale Highly dynamic Scale-free

More information

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Patterns: Graph Search

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Patterns: Graph Search CSE 599 I Accelerated Computing - Programming GPUS Parallel Patterns: Graph Search Objective Study graph search as a prototypical graph-based algorithm Learn techniques to mitigate the memory-bandwidth-centric

More information

}w!"#$%&'()+,-./012345<ya

}w!#$%&'()+,-./012345<ya MASARYK UNIVERSITY FACULTY OF INFORMATICS }w!"#$%&'()+,-./012345

More information

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation

More information

LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs

LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs Jin Wang*, Norm Rubin, Albert Sidelnik, Sudhakar Yalamanchili* *Computer Architecture and System Lab, Georgia Institute of Technology NVIDIA

More information

Algorithm Design (8) Graph Algorithms 1/2

Algorithm Design (8) Graph Algorithms 1/2 Graph Algorithm Design (8) Graph Algorithms / Graph:, : A finite set of vertices (or nodes) : A finite set of edges (or arcs or branches) each of which connect two vertices Takashi Chikayama School of

More information

CS 206 Introduction to Computer Science II

CS 206 Introduction to Computer Science II CS 206 Introduction to Computer Science II 04 / 06 / 2018 Instructor: Michael Eckmann Today s Topics Questions? Comments? Graphs Definition Terminology two ways to represent edges in implementation traversals

More information

Dynamic Thread Block Launch: A Lightweight Execution Mechanism to Support Irregular Applications on GPUs

Dynamic Thread Block Launch: A Lightweight Execution Mechanism to Support Irregular Applications on GPUs Dynamic Thread Block Launch: A Lightweight Execution Mechanism to Support Irregular Applications on GPUs Jin Wang* Norm Rubin Albert Sidelnik Sudhakar Yalamanchili* *Georgia Institute of Technology NVIDIA

More information

CSE 100: GRAPH ALGORITHMS

CSE 100: GRAPH ALGORITHMS CSE 100: GRAPH ALGORITHMS 2 Graphs: Example A directed graph V5 V = { V = E = { E Path: 3 Graphs: Definitions A directed graph V5 V6 A graph G = (V,E) consists of a set of vertices V and a set of edges

More information

Dynamic Sparse Matrix Allocation on GPUs. James King

Dynamic Sparse Matrix Allocation on GPUs. James King Dynamic Sparse Matrix Allocation on GPUs James King Graph Applications Dynamic updates to graphs Adding edges add entries to sparse matrix representation Motivation Graph operations (adding edges) (e.g.

More information

Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search

Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search Jialiang Zhang, Soroosh Khoram and Jing Li 1 Outline Background Big graph analytics Hybrid

More information

GPU-accelerated data expansion for the Marching Cubes algorithm

GPU-accelerated data expansion for the Marching Cubes algorithm GPU-accelerated data expansion for the Marching Cubes algorithm San Jose (CA) September 23rd, 2010 Christopher Dyken, SINTEF Norway Gernot Ziegler, NVIDIA UK Agenda Motivation & Background Data Compaction

More information

Towards Efficient Graph Traversal using a Multi- GPU Cluster

Towards Efficient Graph Traversal using a Multi- GPU Cluster Towards Efficient Graph Traversal using a Multi- GPU Cluster Hina Hameed Systems Research Lab, Department of Computer Science, FAST National University of Computer and Emerging Sciences Karachi, Pakistan

More information

GPU Multisplit. Saman Ashkiani, Andrew Davidson, Ulrich Meyer, John D. Owens. S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

GPU Multisplit. Saman Ashkiani, Andrew Davidson, Ulrich Meyer, John D. Owens. S. Ashkiani (UC Davis) GPU Multisplit GTC / 16 GPU Multisplit Saman Ashkiani, Andrew Davidson, Ulrich Meyer, John D. Owens S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 1 / 16 Motivating Simple Example: Compaction Compaction (i.e., binary split) Traditional

More information

Fast BVH Construction on GPUs

Fast BVH Construction on GPUs Fast BVH Construction on GPUs Published in EUROGRAGHICS, (2009) C. Lauterbach, M. Garland, S. Sengupta, D. Luebke, D. Manocha University of North Carolina at Chapel Hill NVIDIA University of California

More information

A Comparative Study on Exact Triangle Counting Algorithms on the GPU

A Comparative Study on Exact Triangle Counting Algorithms on the GPU A Comparative Study on Exact Triangle Counting Algorithms on the GPU Leyuan Wang, Yangzihao Wang, Carl Yang, John D. Owens University of California, Davis, CA, USA 31 st May 2016 L. Wang, Y. Wang, C. Yang,

More information

Lesson 2 7 Graph Partitioning

Lesson 2 7 Graph Partitioning Lesson 2 7 Graph Partitioning The Graph Partitioning Problem Look at the problem from a different angle: Let s multiply a sparse matrix A by a vector X. Recall the duality between matrices and graphs:

More information

Parallel graph traversal for FPGA

Parallel graph traversal for FPGA LETTER IEICE Electronics Express, Vol.11, No.7, 1 6 Parallel graph traversal for FPGA Shice Ni a), Yong Dou, Dan Zou, Rongchun Li, and Qiang Wang National Laboratory for Parallel and Distributed Processing,

More information

Fine-Grained Task Migration for Graph Algorithms using Processing in Memory

Fine-Grained Task Migration for Graph Algorithms using Processing in Memory Fine-Grained Task Migration for Graph Algorithms using Processing in Memory Paula Aguilera 1, Dong Ping Zhang 2, Nam Sung Kim 3 and Nuwan Jayasena 2 1 Dept. of Electrical and Computer Engineering University

More information

Introduction to Parallel & Distributed Computing Parallel Graph Algorithms

Introduction to Parallel & Distributed Computing Parallel Graph Algorithms Introduction to Parallel & Distributed Computing Parallel Graph Algorithms Lecture 16, Spring 2014 Instructor: 罗国杰 gluo@pku.edu.cn In This Lecture Parallel formulations of some important and fundamental

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

A New Parallel Algorithm for Connected Components in Dynamic Graphs. Robert McColl Oded Green David Bader

A New Parallel Algorithm for Connected Components in Dynamic Graphs. Robert McColl Oded Green David Bader A New Parallel Algorithm for Connected Components in Dynamic Graphs Robert McColl Oded Green David Bader Overview The Problem Target Datasets Prior Work Parent-Neighbor Subgraph Results Conclusions Problem

More information

CS 220: Discrete Structures and their Applications. graphs zybooks chapter 10

CS 220: Discrete Structures and their Applications. graphs zybooks chapter 10 CS 220: Discrete Structures and their Applications graphs zybooks chapter 10 directed graphs A collection of vertices and directed edges What can this represent? undirected graphs A collection of vertices

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

A Simple and Practical Linear-Work Parallel Algorithm for Connectivity

A Simple and Practical Linear-Work Parallel Algorithm for Connectivity A Simple and Practical Linear-Work Parallel Algorithm for Connectivity Julian Shun, Laxman Dhulipala, and Guy Blelloch Presentation based on publication in Symposium on Parallelism in Algorithms and Architectures

More information

NUMA-aware Graph-structured Analytics

NUMA-aware Graph-structured Analytics NUMA-aware Graph-structured Analytics Kaiyuan Zhang, Rong Chen, Haibo Chen Institute of Parallel and Distributed Systems Shanghai Jiao Tong University, China Big Data Everywhere 00 Million Tweets/day 1.11

More information

Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs

Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs Haicheng Wu 1, Daniel Zinn 2, Molham Aref 2, Sudhakar Yalamanchili 1 1. Georgia Institute of Technology 2. LogicBlox

More information

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing Designing Parallel Programs This review was developed from Introduction to Parallel Computing Author: Blaise Barney, Lawrence Livermore National Laboratory references: https://computing.llnl.gov/tutorials/parallel_comp/#whatis

More information

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC

More information

Tools and Primitives for High Performance Graph Computation

Tools and Primitives for High Performance Graph Computation Tools and Primitives for High Performance Graph Computation John R. Gilbert University of California, Santa Barbara Aydin Buluç (LBNL) Adam Lugowski (UCSB) SIAM Minisymposium on Analyzing Massive Real-World

More information

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization Duksu Kim Assistant professor, KORATEHC Education Ph.D. Computer Science, KAIST Parallel Proximity Computation on Heterogeneous Computing Systems for Graphics Applications Professional Experience Senior

More information

Prefix Scan and Minimum Spanning Tree with OpenCL

Prefix Scan and Minimum Spanning Tree with OpenCL Prefix Scan and Minimum Spanning Tree with OpenCL U. VIRGINIA DEPT. OF COMP. SCI TECH. REPORT CS-2013-02 Yixin Sun and Kevin Skadron Dept. of Computer Science, University of Virginia ys3kz@virginia.edu,

More information

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015 Frank Feinbube, M.Sc., Felix Eberhardt, M.Sc., Prof. Dr. Andreas Polze Workloads 2 Hardware / software execution environment

More information

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload) Lecture 2: Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload) Visual Computing Systems Analyzing a 3D Graphics Workload Where is most of the work done? Memory Vertex

More information

CSE373: Data Structures & Algorithms Lecture 28: Final review and class wrap-up. Nicki Dell Spring 2014

CSE373: Data Structures & Algorithms Lecture 28: Final review and class wrap-up. Nicki Dell Spring 2014 CSE373: Data Structures & Algorithms Lecture 28: Final review and class wrap-up Nicki Dell Spring 2014 Final Exam As also indicated on the web page: Next Tuesday, 2:30-4:20 in this room Cumulative but

More information

Corolla: GPU-Accelerated FPGA Routing Based on Subgraph Dynamic Expansion

Corolla: GPU-Accelerated FPGA Routing Based on Subgraph Dynamic Expansion Corolla: GPU-Accelerated FPGA Routing Based on Subgraph Dynamic Expansion Minghua Shen and Guojie Luo Peking University FPGA-February 23, 2017 1 Contents Motivation Background Search Space Reduction for

More information

Breadth First Search. Graph Traversal. CSE 2011 Winter Application examples. Two common graph traversal algorithms

Breadth First Search. Graph Traversal. CSE 2011 Winter Application examples. Two common graph traversal algorithms Breadth irst Search CSE Winter Graph raversal Application examples Given a graph representation and a vertex s in the graph ind all paths from s to the other vertices wo common graph traversal algorithms

More information

Fast and Scalable Subgraph Isomorphism using Dynamic Graph Techniques. James Fox

Fast and Scalable Subgraph Isomorphism using Dynamic Graph Techniques. James Fox Fast and Scalable Subgraph Isomorphism using Dynamic Graph Techniques James Fox Collaborators Oded Green, Research Scientist (GT) Euna Kim, PhD student (GT) Federico Busato, PhD student (Universita di

More information

Randomized Algorithms

Randomized Algorithms Randomized Algorithms Last time Network topologies Intro to MPI Matrix-matrix multiplication Today MPI I/O Randomized Algorithms Parallel k-select Graph coloring Assignment 2 Parallel I/O Goal of Parallel

More information

Hybrid Implementation of 3D Kirchhoff Migration

Hybrid Implementation of 3D Kirchhoff Migration Hybrid Implementation of 3D Kirchhoff Migration Max Grossman, Mauricio Araya-Polo, Gladys Gonzalez GTC, San Jose March 19, 2013 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation

More information

Algorithm Engineering with PRAM Algorithms

Algorithm Engineering with PRAM Algorithms Algorithm Engineering with PRAM Algorithms Bernard M.E. Moret moret@cs.unm.edu Department of Computer Science University of New Mexico Albuquerque, NM 87131 Rome School on Alg. Eng. p.1/29 Measuring and

More information

Improving Memory Space Efficiency of Kd-tree for Real-time Ray Tracing Byeongjun Choi, Byungjoon Chang, Insung Ihm

Improving Memory Space Efficiency of Kd-tree for Real-time Ray Tracing Byeongjun Choi, Byungjoon Chang, Insung Ihm Improving Memory Space Efficiency of Kd-tree for Real-time Ray Tracing Byeongjun Choi, Byungjoon Chang, Insung Ihm Department of Computer Science and Engineering Sogang University, Korea Improving Memory

More information

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser ParalleX A Cure for Scaling Impaired Parallel Applications Hartmut Kaiser (hkaiser@cct.lsu.edu) 2 Tianhe-1A 2.566 Petaflops Rmax Heterogeneous Architecture: 14,336 Intel Xeon CPUs 7,168 Nvidia Tesla M2050

More information

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve Administrative Issues L11: Sparse Linear Algebra on GPUs Next assignment, triangular solve Due 5PM, Tuesday, March 15 handin cs6963 lab 3 Project proposals Due 5PM, Wednesday, March 7 (hard

More information

A Simple and Practical Linear-Work Parallel Algorithm for Connectivity

A Simple and Practical Linear-Work Parallel Algorithm for Connectivity A Simple and Practical Linear-Work Parallel Algorithm for Connectivity Julian Shun, Laxman Dhulipala, and Guy Blelloch Presentation based on publication in Symposium on Parallelism in Algorithms and Architectures

More information

CSE 100: GRAPH ALGORITHMS

CSE 100: GRAPH ALGORITHMS CSE 100: GRAPH ALGORITHMS Dijkstra s Algorithm: Questions Initialize the graph: Give all vertices a dist of INFINITY, set all done flags to false Start at s; give s dist = 0 and set prev field to -1 Enqueue

More information

Graph Algorithms. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

Graph Algorithms. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico. Graph Algorithms Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico May, 0 CPD (DEI / IST) Parallel and Distributed Computing 0-0-0 / Outline

More information

Managing and Mining Billion Node Graphs. Haixun Wang Microsoft Research Asia

Managing and Mining Billion Node Graphs. Haixun Wang Microsoft Research Asia Managing and Mining Billion Node Graphs Haixun Wang Microsoft Research Asia Outline Overview Storage Online query processing Offline graph analytics Advanced applications Is it hard to manage graphs? Good

More information

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm. Martin Burtscher Department of Computer Science Texas State University-San Marcos

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm. Martin Burtscher Department of Computer Science Texas State University-San Marcos An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm Martin Burtscher Department of Computer Science Texas State University-San Marcos Mapping Regular Code to GPUs Regular codes Operate on

More information

Morph Algorithms on GPUs

Morph Algorithms on GPUs Morph Algorithms on GPUs Rupesh Nasre 1 Martin Burtscher 2 Keshav Pingali 1,3 1 Inst. for Computational Engineering and Sciences, University of Texas at Austin, USA 2 Dept. of Computer Science, Texas State

More information

Julienne: A Framework for Parallel Graph Algorithms using Work-efficient Bucketing

Julienne: A Framework for Parallel Graph Algorithms using Work-efficient Bucketing Julienne: A Framework for Parallel Graph Algorithms using Work-efficient Bucketing Laxman Dhulipala Joint work with Guy Blelloch and Julian Shun SPAA 07 Giant graph datasets Graph V E (symmetrized) com-orkut

More information

Lecture 6: Input Compaction and Further Studies

Lecture 6: Input Compaction and Further Studies PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 6: Input Compaction and Further Studies 1 Objective To learn the key techniques for compacting input data for reduced consumption of

More information

Review: Creating a Parallel Program. Programming for Performance

Review: Creating a Parallel Program. Programming for Performance Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)

More information

EFFICIENT BREADTH FIRST SEARCH ON MULTI-GPU SYSTEMS USING GPU-CENTRIC OPENSHMEM

EFFICIENT BREADTH FIRST SEARCH ON MULTI-GPU SYSTEMS USING GPU-CENTRIC OPENSHMEM EFFICIENT BREADTH FIRST SEARCH ON MULTI-GPU SYSTEMS USING GPU-CENTRIC OPENSHMEM Sreeram Potluri, Anshuman Goswami NVIDIA Manjunath Gorentla Venkata, Neena Imam - ORNL SCOPE OF THE WORK Reliance on CPU

More information

Large and Sparse Mass Spectrometry Data Processing in the GPU Jose de Corral 2012 GPU Technology Conference

Large and Sparse Mass Spectrometry Data Processing in the GPU Jose de Corral 2012 GPU Technology Conference Large and Sparse Mass Spectrometry Data Processing in the GPU Jose de Corral 2012 GPU Technology Conference 2012 Waters Corporation 1 Agenda Overview of LC/IMS/MS 3D Data Processing 4D Data Processing

More information

Data Structures and Algorithms for Counting Problems on Graphs using GPU

Data Structures and Algorithms for Counting Problems on Graphs using GPU International Journal of Networking and Computing www.ijnc.org ISSN 85-839 (print) ISSN 85-847 (online) Volume 3, Number, pages 64 88, July 3 Data Structures and Algorithms for Counting Problems on Graphs

More information

Accelerated Load Balancing of Unstructured Meshes

Accelerated Load Balancing of Unstructured Meshes Accelerated Load Balancing of Unstructured Meshes Gerrett Diamond, Lucas Davis, and Cameron W. Smith Abstract Unstructured mesh applications running on large, parallel, distributed memory systems require

More information

A GPU Algorithm for Greedy Graph Matching

A GPU Algorithm for Greedy Graph Matching A GPU Algorithm for Greedy Graph Matching B. O. Fagginger Auer R. H. Bisseling Utrecht University September 29, 2011 Fagginger Auer, Bisseling (UU) GPU Greedy Graph Matching September 29, 2011 1 / 27 Outline

More information

Graphs & Digraphs Tuesday, November 06, 2007

Graphs & Digraphs Tuesday, November 06, 2007 Graphs & Digraphs Tuesday, November 06, 2007 10:34 PM 16.1 Directed Graphs (digraphs) like a tree but w/ no root node & no guarantee of paths between nodes consists of: nodes/vertices - a set of elements

More information

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Xin Huo Advisor: Gagan Agrawal Motivation - Architecture Challenges on GPU architecture

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload) Lecture 2: Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload) Visual Computing Systems Today Finishing up from last time Brief discussion of graphics workload metrics

More information

CS 4700: Foundations of Artificial Intelligence. Bart Selman. Search Techniques R&N: Chapter 3

CS 4700: Foundations of Artificial Intelligence. Bart Selman. Search Techniques R&N: Chapter 3 CS 4700: Foundations of Artificial Intelligence Bart Selman Search Techniques R&N: Chapter 3 Outline Search: tree search and graph search Uninformed search: very briefly (covered before in other prerequisite

More information

Algorithms and Theory of Computation. Lecture 3: Graph Algorithms

Algorithms and Theory of Computation. Lecture 3: Graph Algorithms Algorithms and Theory of Computation Lecture 3: Graph Algorithms Xiaohui Bei MAS 714 August 20, 2018 Nanyang Technological University MAS 714 August 20, 2018 1 / 18 Connectivity In a undirected graph G

More information

Recitation 9. Prelim Review

Recitation 9. Prelim Review Recitation 9 Prelim Review 1 Heaps 2 Review: Binary heap min heap 1 2 99 4 3 PriorityQueue Maintains max or min of collection (no duplicates) Follows heap order invariant at every level Always balanced!

More information

Breadth First Search on Cost efficient Multi GPU Systems

Breadth First Search on Cost efficient Multi GPU Systems Breadth First Search on Cost efficient Multi Systems Takuji Mitsuishi Keio University 3 14 1 Hiyoshi, Yokohama, 223 8522, Japan mits@am.ics.keio.ac.jp Masaki Kan NEC Corporation 1753, Shimonumabe, Nakahara

More information

NVGRAPH,FIREHOSE,PAGERANK GPU ACCELERATED ANALYTICS NOV Joe Eaton Ph.D.

NVGRAPH,FIREHOSE,PAGERANK GPU ACCELERATED ANALYTICS NOV Joe Eaton Ph.D. NVGRAPH,FIREHOSE,PAGERANK GPU ACCELERATED ANALYTICS NOV 2016 Joe Eaton Ph.D. Agenda Accelerated Computing nvgraph New Features Coming Soon Dynamic Graphs GraphBLAS 2 ACCELERATED COMPUTING 10x Performance

More information

(Re)Introduction to Graphs and Some Algorithms

(Re)Introduction to Graphs and Some Algorithms (Re)Introduction to Graphs and Some Algorithms Graph Terminology (I) A graph is defined by a set of vertices V and a set of edges E. The edge set must work over the defined vertices in the vertex set.

More information

Graph Representations and Traversal

Graph Representations and Traversal COMPSCI 330: Design and Analysis of Algorithms 02/08/2017-02/20/2017 Graph Representations and Traversal Lecturer: Debmalya Panigrahi Scribe: Tianqi Song, Fred Zhang, Tianyu Wang 1 Overview This lecture

More information

State of Art and Project Proposals Intensive Computation

State of Art and Project Proposals Intensive Computation State of Art and Project Proposals Intensive Computation Annalisa Massini - 2015/2016 Today s lecture Project proposals on the following topics: Sparse Matrix- Vector Multiplication Tridiagonal Solvers

More information

Recent Advances in Heterogeneous Computing using Charm++

Recent Advances in Heterogeneous Computing using Charm++ Recent Advances in Heterogeneous Computing using Charm++ Jaemin Choi, Michael Robson Parallel Programming Laboratory University of Illinois Urbana-Champaign April 12, 2018 1 / 24 Heterogeneous Computing

More information

Parallel Architectures

Parallel Architectures Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36

More information