GPU Sparse Graph Traversal

Size: px
Start display at page:

Download "GPU Sparse Graph Traversal"

Transcription

1 GPU Sparse Graph Traversal Duane Merrill (NVIDIA) Michael Garland (NVIDIA) Andrew Grimshaw (Univ. of Virginia) UNIVERSITY of VIRGINIA

2 Breadth-first search (BFS) 1. Pick a source node 2. Rank every vertex by the length of shortest path from source (or by predecessor vertex)

3 BFS applications A common algorithmic building block Tracking reachable heap during garbage collection Belief propagation (statistical inference) Simple performance-analog for many applications Pointer-chasing Work queues Finding community structure Path-finding Core benchmark kernel Graph 500 Rodinia Parboil 3D lattice Wikipedia 07 3

4 Conventional wisdom GPUs would be poorly suited for BFS Not pleasingly data-parallel (if you want to solve it efficiently) Insufficient atomic throughput for cooperation 4

5 Conventional wisdom GPUs would be poorly suited for BFS Not pleasingly data-parallel (if you want to solve it efficiently) Insufficient atomic throughput for cooperation Issues with scaling to 10,000s of processing elements Workload imbalance, SIMD underutilization Not enough parallelism in all graphs 5

6 Billions of travesed edges/sec/socket The punchline We show GPUs provide exceptional performance Billions of traversed-edges per second per socket For diverse synthetic and real-world datasets Linear-work algorithm Tesla C2050 Multi-core (per socket) Sequential using efficient parallel prefix sum For cooperative data movement Average traversal depth Agarwal et al. (SC 10) & Leiserson et al. (SPAA 10) 6

7 Level-synchronous parallelization strategies UNIVERSITY of VIRGINIA

8 (a) Quadratic-work approach Typical GPU BFS (e.g., Harish et al., Deng et al. Hong et al.) 1. Inspect every vertex at every time step t 1 t 0 2. If updated, update its neighbors 3. Repeat t 5 t 4 t 2 t 6 t 3 t 7 t 8 t 9 t 10 Quadratic O(n2+m) work t 11 t 13 t 12 Trivial data-parallel stencils t 14 t15 One thread per vertex t 16 t 17 8

9 Vertex identifiers (b) Linear-work approach 1. Expand neighbors (vertex frontier edge-frontier) 2. Filter out previously visited & duplicates (edge-frontier vertex frontier) 3. Repeat Logical Edge Frontier Logical Vertex Frontier 0.1 Linear O(n+m) work Need cooperative buffers for tracking frontiers 0 Distance from source 9

10 Referenced vertex identifiers (millions) Quadratic vs. linear Quadratic is too much work!! Quadratic Linear 3D Poisson Lattice (300 3 = 27M vertices) x work BFS iteration 2.5x 2,300x speedup vs. quadratic GPU methods 10

11 Vertices (millions) Vertices (millions) Vertices (millions) Vertices (millions) Vertices (millions) Goal: GPU competence on diverse datasets (Not a one-trick pony ) Search Depth Search Depth Search Depth Wikipedia (social) 3D Poisson grid (cubic lattice) R-MAT (random, power-law, small-world) Search Depth Search Depth Europe road atlas (Euclidian space) PDE-constrained optimization (non-linear KKT) Auto transmission manifold (tetrahedral mesh) 11

12 Challenges UNIVERSITY of VIRGINIA

13 Linear-work challenges for parallelization Generic challenges 1. Load imbalance between cores Coarse workstealing queues 2. Bandwidth inefficiency Use status bitmask when filtering already-visited nodes GPU-specific challenges 1. Cooperative allocation (global queues) 2. Load imbalance within SIMD lanes 3. Poor locality within SIMD lanes 4. Simultaneous discovery (i.e., the benign race condition) 13

14 Linear-work challenges for parallelization Generic challenges 1. Load imbalance between cores Coarse workstealing queues 2. Bandwidth inefficiency Use status bitmask when filtering already-visited nodes GPU-specific challenges 1. Cooperative allocation (global queues) 2. Load imbalance within SIMD lanes 3. Poor locality within SIMD lanes 4. Simultaneous discovery (i.e., the benign race condition) 14

15 GB / sec (1) Cooperative data movement (queuing) Problem: Need shared work queues 1000 Transfer Rate GPU rate-limits (C2050): 16 billion vertex-identifiers / sec (124 GB/s) Only 67M global-atomics / sec (238x slowdown) Only 600M smem-atomics / sec (27x slowdown) 10 1 Solution: Compute enqueue offsets using prefixsum Vertex stride between atomic ops (log-scale) 15

16 Prefix sum for allocation Allocation requirement Result of prefix sum Output A 2 C 1 0 D 3 t 0 t 1 t 2 t 3 A 0 C 2 3 D 3 t 0 t 1 t 2 t Each output is computed to be the sum of the previous inputs O(n) work Use results as a scatter offsets Fits the GPU machine model well Proceed at copy-bandwidth Only ~8 thread-instructions per input 16

17 Prefix sum for edge expansion Adjacency list size A 2 C 1 0 D 3 Result of prefix sum Expanded edge frontier t 0 t 1 t 2 t 3 A 0 2 C 3 D 3 t 0 t 1 t 2 t

18 Efficient local scan (granularity coarsening) x16 barrier x4 x1 x4 barrier x16 18

19 (2) Poor load balancing within SIMD lanes Problem: Solution: Large variance in adjacency list sizes Exacerbated by wide SIMD widths Cooperative neighbor expansion Enlist nearby threads to help process each adjacency list in parallel a) Bad: Serial expansion & processing t 1 t 3 t 0 t 2 b) Slightly better: Coarse warp-centric parallel expansion c) Best: Fine-grained parallel expansion (packed by prefix sum) t 0 t 1 t 2 t 3 t 4 t 5 t 31 t 32 t 33 t 34 t 35 t 36 t 37 t 63 t 64 t 65 t 66 t 67 t 68 t 69 t 95 t 0 t 1 t 2 t 3 t 4 t 5 19

20 (2) Poor load balancing within SIMD lanes Problem: Solution: Large variance in adjacency list sizes Exacerbated by wide SIMD widths Cooperative neighbor expansion Enlist nearby threads to help process each adjacency list in parallel a) Bad: Serial expansion & processing t 1 t 3 t 0 t 2 b) Slightly better: Coarse warp-centric parallel expansion c) Best: Fine-grained parallel expansion (packed by prefix sum) t 0 t 1 t 2 t 3 t 4 t 5 t 31 t 32 t 33 t 34 t 35 t 36 t 37 t 63 t 64 t 65 t 66 t 67 t 68 t 69 t 95 t 0 t 1 t 2 t 3 t 4 t 5 20

21 (3) Poor locality within SIMD lanes Problem: The referenced adjacency lists are arbitrarily located Exacerbated by wide SIMD widths a) Bad: Serial expansion & processing t 1 t 3 t 0 t 2 b) Slightly better: Coarse warp-centric parallel expansion Solution: Can t afford to have SIMD threads streaming through unrelated data t 0 t 1 t 2 t 3 t 4 t 5 t 31 t 32 t 33 t 34 t 35 t 36 t 37 t 63 t 64 t 65 t 66 t 67 t 68 t 69 t 95 Cooperative neighbor expansion c) Best: Fine-grained parallel expansion (packed by prefix sum) t 0 t 1 t 2 t 3 t 4 t 5 21

22 (4) SIMD simultaneous discovery Iteration 1 Iteration 2 Iteration 3 Iteration 4 Duplicates in the edge-frontier redundant work Exacerbated by wide SIMD Compounded every iteration 22

23 Vertices (millions) Simultaneous discovery Normally not a problem for CPU implementations Serial adjacency list inspection A big problem for cooperative SIMD expansion Logical edgefrontier Logical vertexfrontier Actually expanded Actually contracted Spatially-descriptive datasets 0.6 Power-law datasets 0.4 Solution: Localized duplicate removal using hashes in local scratch Search Depth 23

24 Absolute and relative performance UNIVERSITY of VIRGINIA 24

25 Single-socket comparison Graph Spy Plot Average Search Depth Billion TE/s Nehalem Parallel speedup Billion TE/s Tesla C2050 Parallel speedup europe.osm x grid5pt x hugebubbles x grid7pt (4-core ) 2.4 x x nlpkkt (4-core ) 3.0 x x audikw x cage (4-core ) 2.6 x x kkt_power (4-core ) 3.0 x x copapersciteseer x wikipedia (4-core ) 3.2 x x kron_g500-logn x random.2mv.128me (8-core ) 7.0 x x rmat.2mv.128me (8-core ) 6.0 x x 2.5 GHz Core i7 4-core (Leiserson et al.) 2.7 GHz EX Xeon X core (Agarwal et al) vs 3.4GHz Core i7 2600K (Sandybridge)

26 Single-socket comparison Graph Spy Plot Average Search Depth Billion TE/s Nehalem Parallel speedup Billion TE/s Tesla C2050 Parallel speedup europe.osm x grid5pt x hugebubbles x grid7pt (4-core ) 2.4 x x nlpkkt (4-core ) 3.0 x x audikw x cage (4-core ) 2.6 x x kkt_power (4-core ) 3.0 x x copapersciteseer x wikipedia (4-core ) 3.2 x x kron_g500-logn x random.2mv.128me (8-core ) 7.0 x x rmat.2mv.128me (8-core ) 6.0 x x 2.5 GHz Core i7 4-core (Leiserson et al.) 2.7 GHz EX Xeon X core (Agarwal et al) vs 3.4GHz Core i7 2600K (Sandybridge)

27 Summary Quadratic-work approaches are uncompetitive Dynamic workload management: Prefix sum (vs. atomic-add) Utilization requires fine-grained expansion and contraction GPUs can be very amenable to dynamic, cooperative problems 27

28 Questions? UNIVERSITY of VIRGINIA

29 Experimental corpus Graph Spy Plot Description Average Search Depth Vertices (millions) Edges (millions) europe.osm Road network grid5pt D Poisson stencil hugebubbles D mesh grid7pt.300 3D Poisson stencil nlpkkt160 Constrained optimization problem audikw1 Finite element matrix cage15 Transition prob. matrix kkt_power Optimization (KKT) copapersciteseer Citation network wikipedia Wikipedia page links kron_g500-logn20 Graph500 random graph random.2mv.128me Uniform random graph rmat.2mv.128me RMAT random graph

30 Comparison of expansion techniques 30

31 Coupling of expansion & contraction phases Alternatives: Expand-contract Wholly realize vertex-frontier in DRAM Suitable for all types of BFS iterations Contract-expand Logical Edge Frontier Logical Vertex Frontier Two-phase Wholly realize edge-frontier in DRAM Even better for small, fleeting BFS iterations Wholly realize both frontiers in DRAM BFS iteration Even better for large, saturating BFS iterations (surprising!!) 31

32 10 9 edges / sec normalized Comparison of coupling approaches Expand-Contract Contract-Expand 2-Phase Hybrid

33 Redundant expansion factor (log) Local duplicate culling Contraction uses local collision hashes in smem scratch space: 10 Simple Local duplicate culling Hash per warp (instantaneous coverage) 4.2x Hash per CTA (recent history coverage) Redundant work < 10% in all datasets No atomics needed 1 1.1x 2.4x 2.6x 1.7x 1.2x 1.1x 1.4x 1.1x 1.3x 33

34 10 9 edges / sec normalized Multi-GPU traversal Expand neighbors sort by GPU (with filter) read from peer GPUs (with filter) repeat C2050 x1 C2050 x2 C2050 x3 C2050 x

GPU Sparse Graph Traversal. Duane Merrill

GPU Sparse Graph Traversal. Duane Merrill GPU Sparse Graph Traversal Duane Merrill Breadth-first search of graphs (BFS) 1. Pick a source node 2. Rank every vertex by the length of shortest path from source Or label every vertex by its predecessor

More information

Duane Merrill (NVIDIA) Michael Garland (NVIDIA)

Duane Merrill (NVIDIA) Michael Garland (NVIDIA) Duane Merrill (NVIDIA) Michael Garland (NVIDIA) Agenda Sparse graphs Breadth-first search (BFS) Maximal independent set (MIS) 1 1 0 1 1 1 3 1 3 1 5 10 6 9 8 7 3 4 BFS MIS Sparse graphs Graph G(V, E) where

More information

Scalable GPU Graph Traversal!

Scalable GPU Graph Traversal! Scalable GPU Graph Traversal Duane Merrill, Michael Garland, and Andrew Grimshaw PPoPP '12 Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming Benwen Zhang

More information

Scalable GPU Graph Traversal

Scalable GPU Graph Traversal Scalable GPU Graph Traversal Duane Merrill University of Virginia Charlottesville Virginia USA dgmd@virginia.edu Michael Garland NVIDIA Corporation Santa Clara California USA mgarland@nvidia.com Andrew

More information

High Performance and Scalable GPU Graph Traversal

High Performance and Scalable GPU Graph Traversal High Performance and Scalable GPU Graph Traversal Duane Merrill Department of Computer Science University of Virginia Technical Report CS-- Department of Computer Science, University of Virginia Aug, Michael

More information

A POWER CHARACTERIZATION AND MANAGEMENT OF GPU GRAPH TRAVERSAL

A POWER CHARACTERIZATION AND MANAGEMENT OF GPU GRAPH TRAVERSAL A POWER CHARACTERIZATION AND MANAGEMENT OF GPU GRAPH TRAVERSAL ADAM MCLAUGHLIN *, INDRANI PAUL, JOSEPH GREATHOUSE, SRILATHA MANNE, AND SUDHKAHAR YALAMANCHILI * * GEORGIA INSTITUTE OF TECHNOLOGY AMD RESEARCH

More information

Automatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC

Automatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC Automatic Compiler-Based Optimization of Graph Analytics for the GPU Sreepathi Pai The University of Texas at Austin May 8, 2017 NVIDIA GTC Parallel Graph Processing is not easy 299ms HD-BFS 84ms USA Road

More information

Coordinating More Than 3 Million CUDA Threads for Social Network Analysis. Adam McLaughlin

Coordinating More Than 3 Million CUDA Threads for Social Network Analysis. Adam McLaughlin Coordinating More Than 3 Million CUDA Threads for Social Network Analysis Adam McLaughlin Applications of interest Computational biology Social network analysis Urban planning Epidemiology Hardware verification

More information

High-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock

High-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock High-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock Yangzihao Wang University of California, Davis yzhwang@ucdavis.edu March 24, 2014 Yangzihao Wang (yzhwang@ucdavis.edu)

More information

An Energy-Efficient Abstraction for Simultaneous Breadth-First Searches. Adam McLaughlin, Jason Riedy, and David A. Bader

An Energy-Efficient Abstraction for Simultaneous Breadth-First Searches. Adam McLaughlin, Jason Riedy, and David A. Bader An Energy-Efficient Abstraction for Simultaneous Breadth-First Searches Adam McLaughlin, Jason Riedy, and David A. Bader Problem Data is unstructured, heterogeneous, and vast Serious opportunities for

More information

CUB. collective software primitives. Duane Merrill. NVIDIA Research

CUB. collective software primitives. Duane Merrill. NVIDIA Research CUB collective software primitives Duane Merrill NVIDIA Research What is CUB?. A design model for collective primitives How to make reusable SIMT software constructs. A library of collective primitives

More information

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation

More information

GPU/CPU Heterogeneous Work Partitioning Riya Savla (rds), Ridhi Surana (rsurana), Jordan Tick (jrtick) Computer Architecture, Spring 18

GPU/CPU Heterogeneous Work Partitioning Riya Savla (rds), Ridhi Surana (rsurana), Jordan Tick (jrtick) Computer Architecture, Spring 18 ABSTRACT GPU/CPU Heterogeneous Work Partitioning Riya Savla (rds), Ridhi Surana (rsurana), Jordan Tick (jrtick) 15-740 Computer Architecture, Spring 18 With the death of Moore's Law, programmers can no

More information

Optimizing Energy Consumption and Parallel Performance for Static and Dynamic Betweenness Centrality using GPUs Adam McLaughlin, Jason Riedy, and

Optimizing Energy Consumption and Parallel Performance for Static and Dynamic Betweenness Centrality using GPUs Adam McLaughlin, Jason Riedy, and Optimizing Energy Consumption and Parallel Performance for Static and Dynamic Betweenness Centrality using GPUs Adam McLaughlin, Jason Riedy, and David A. Bader Motivation Real world graphs are challenging

More information

Enterprise. Breadth-First Graph Traversal on GPUs. November 19th, 2015

Enterprise. Breadth-First Graph Traversal on GPUs. November 19th, 2015 Enterprise Breadth-First Graph Traversal on GPUs Hang Liu H. Howie Huang November 9th, 5 Graph is Ubiquitous Breadth-First Search (BFS) is Important Wide Range of Applications Single Source Shortest Path

More information

Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search

Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search Jialiang Zhang, Soroosh Khoram and Jing Li 1 Outline Background Big graph analytics Hybrid

More information

On Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs

On Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs On Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs Sungpack Hong 2, Nicole C. Rodia 1, and Kunle Olukotun 1 1 Pervasive Parallelism Laboratory, Stanford University

More information

Towards Efficient Graph Traversal using a Multi- GPU Cluster

Towards Efficient Graph Traversal using a Multi- GPU Cluster Towards Efficient Graph Traversal using a Multi- GPU Cluster Hina Hameed Systems Research Lab, Department of Computer Science, FAST National University of Computer and Emerging Sciences Karachi, Pakistan

More information

LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs

LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs Jin Wang*, Norm Rubin, Albert Sidelnik, Sudhakar Yalamanchili* *Computer Architecture and System Lab, Georgia Institute of Technology NVIDIA

More information

NUMA-aware Graph-structured Analytics

NUMA-aware Graph-structured Analytics NUMA-aware Graph-structured Analytics Kaiyuan Zhang, Rong Chen, Haibo Chen Institute of Parallel and Distributed Systems Shanghai Jiao Tong University, China Big Data Everywhere 00 Million Tweets/day 1.11

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

Graph Partitioning for Scalable Distributed Graph Computations

Graph Partitioning for Scalable Distributed Graph Computations Graph Partitioning for Scalable Distributed Graph Computations Aydın Buluç ABuluc@lbl.gov Kamesh Madduri madduri@cse.psu.edu 10 th DIMACS Implementation Challenge, Graph Partitioning and Graph Clustering

More information

BREADTH-FIRST SEARCH FOR SOCIAL NETWORK GRAPHS ON HETEROGENOUS PLATFORMS LUIS CARLOS MARIA REMIS THESIS

BREADTH-FIRST SEARCH FOR SOCIAL NETWORK GRAPHS ON HETEROGENOUS PLATFORMS LUIS CARLOS MARIA REMIS THESIS BREADTH-FIRST SEARCH FOR SOCIAL NETWORK GRAPHS ON HETEROGENOUS PLATFORMS BY LUIS CARLOS MARIA REMIS THESIS Submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer

More information

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100 CS/EE 217 Midterm ANSWER ALL QUESTIONS TIME ALLOWED 60 MINUTES Question Possible Points Points Scored 1 24 2 32 3 20 4 24 Total 100 Question 1] [24 Points] Given a GPGPU with 14 streaming multiprocessor

More information

Recent Advances in Heterogeneous Computing using Charm++

Recent Advances in Heterogeneous Computing using Charm++ Recent Advances in Heterogeneous Computing using Charm++ Jaemin Choi, Michael Robson Parallel Programming Laboratory University of Illinois Urbana-Champaign April 12, 2018 1 / 24 Heterogeneous Computing

More information

}w!"#$%&'()+,-./012345<ya

}w!#$%&'()+,-./012345<ya MASARYK UNIVERSITY FACULTY OF INFORMATICS }w!"#$%&'()+,-./012345

More information

Introduction to Parallel Programming Models

Introduction to Parallel Programming Models Introduction to Parallel Programming Models Tim Foley Stanford University Beyond Programmable Shading 1 Overview Introduce three kinds of parallelism Used in visual computing Targeting throughput architectures

More information

Maximizing Face Detection Performance

Maximizing Face Detection Performance Maximizing Face Detection Performance Paulius Micikevicius Developer Technology Engineer, NVIDIA GTC 2015 1 Outline Very brief review of cascaded-classifiers Parallelization choices Reducing the amount

More information

CuSha: Vertex-Centric Graph Processing on GPUs

CuSha: Vertex-Centric Graph Processing on GPUs CuSha: Vertex-Centric Graph Processing on GPUs Farzad Khorasani, Keval Vora, Rajiv Gupta, Laxmi N. Bhuyan HPDC Vancouver, Canada June, Motivation Graph processing Real world graphs are large & sparse Power

More information

Practical Near-Data Processing for In-Memory Analytics Frameworks

Practical Near-Data Processing for In-Memory Analytics Frameworks Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard

More information

Scalable Algorithmic Techniques Decompositions & Mapping. Alexandre David

Scalable Algorithmic Techniques Decompositions & Mapping. Alexandre David Scalable Algorithmic Techniques Decompositions & Mapping Alexandre David 1.2.05 adavid@cs.aau.dk Introduction Focus on data parallelism, scale with size. Task parallelism limited. Notion of scalability

More information

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark by Nkemdirim Dockery High Performance Computing Workloads Core-memory sized Floating point intensive Well-structured

More information

Falcon: Scaling IO Performance in Multi-SSD Volumes. The George Washington University

Falcon: Scaling IO Performance in Multi-SSD Volumes. The George Washington University Falcon: Scaling IO Performance in Multi-SSD Volumes Pradeep Kumar H Howie Huang The George Washington University SSDs in Big Data Applications Recent trends advocate using many SSDs for higher throughput

More information

Graph traversal and BFS

Graph traversal and BFS Graph traversal and BFS Fundamental building block Graph traversal is part of many important tasks Connected components Tree/Cycle detection Articulation vertex finding Real-world applications Peer-to-peer

More information

Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs

Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs Sungpack Hong 2, Nicole C. Rodia 1, and Kunle Olukotun 1 1 Pervasive Parallelism Laboratory, Stanford University 2 Oracle

More information

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh

More information

Dynamic Thread Block Launch: A Lightweight Execution Mechanism to Support Irregular Applications on GPUs

Dynamic Thread Block Launch: A Lightweight Execution Mechanism to Support Irregular Applications on GPUs Dynamic Thread Block Launch: A Lightweight Execution Mechanism to Support Irregular Applications on GPUs Jin Wang* Norm Rubin Albert Sidelnik Sudhakar Yalamanchili* *Georgia Institute of Technology NVIDIA

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm. Martin Burtscher Department of Computer Science Texas State University-San Marcos

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm. Martin Burtscher Department of Computer Science Texas State University-San Marcos An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm Martin Burtscher Department of Computer Science Texas State University-San Marcos Mapping Regular Code to GPUs Regular codes Operate on

More information

Parallel Methods for Verifying the Consistency of Weakly-Ordered Architectures. Adam McLaughlin, Duane Merrill, Michael Garland, and David A.

Parallel Methods for Verifying the Consistency of Weakly-Ordered Architectures. Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Parallel Methods for Verifying the Consistency of Weakly-Ordered Architectures Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader Challenges of Design Verification Contemporary hardware

More information

Fast BVH Construction on GPUs

Fast BVH Construction on GPUs Fast BVH Construction on GPUs Published in EUROGRAGHICS, (2009) C. Lauterbach, M. Garland, S. Sengupta, D. Luebke, D. Manocha University of North Carolina at Chapel Hill NVIDIA University of California

More information

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser ParalleX A Cure for Scaling Impaired Parallel Applications Hartmut Kaiser (hkaiser@cct.lsu.edu) 2 Tianhe-1A 2.566 Petaflops Rmax Heterogeneous Architecture: 14,336 Intel Xeon CPUs 7,168 Nvidia Tesla M2050

More information

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Xin Huo Advisor: Gagan Agrawal Motivation - Architecture Challenges on GPU architecture

More information

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Patterns: Graph Search

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Patterns: Graph Search CSE 599 I Accelerated Computing - Programming GPUS Parallel Patterns: Graph Search Objective Study graph search as a prototypical graph-based algorithm Learn techniques to mitigate the memory-bandwidth-centric

More information

Prefix Scan and Minimum Spanning Tree with OpenCL

Prefix Scan and Minimum Spanning Tree with OpenCL Prefix Scan and Minimum Spanning Tree with OpenCL U. VIRGINIA DEPT. OF COMP. SCI TECH. REPORT CS-2013-02 Yixin Sun and Kevin Skadron Dept. of Computer Science, University of Virginia ys3kz@virginia.edu,

More information

GPU-accelerated data expansion for the Marching Cubes algorithm

GPU-accelerated data expansion for the Marching Cubes algorithm GPU-accelerated data expansion for the Marching Cubes algorithm San Jose (CA) September 23rd, 2010 Christopher Dyken, SINTEF Norway Gernot Ziegler, NVIDIA UK Agenda Motivation & Background Data Compaction

More information

A Simple and Practical Linear-Work Parallel Algorithm for Connectivity

A Simple and Practical Linear-Work Parallel Algorithm for Connectivity A Simple and Practical Linear-Work Parallel Algorithm for Connectivity Julian Shun, Laxman Dhulipala, and Guy Blelloch Presentation based on publication in Symposium on Parallelism in Algorithms and Architectures

More information

A New Parallel Algorithm for Connected Components in Dynamic Graphs. Robert McColl Oded Green David Bader

A New Parallel Algorithm for Connected Components in Dynamic Graphs. Robert McColl Oded Green David Bader A New Parallel Algorithm for Connected Components in Dynamic Graphs Robert McColl Oded Green David Bader Overview The Problem Target Datasets Prior Work Parent-Neighbor Subgraph Results Conclusions Problem

More information

Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia

Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia How well do CPU, GPU and Hybrid Graph Processing Frameworks Perform? Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Networked Systems

More information

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management Hideyuki Shamoto, Tatsuhiro Chiba, Mikio Takeuchi Tokyo Institute of Technology IBM Research Tokyo Programming for large

More information

Enterprise: Breadth-First Graph Traversal on GPUs

Enterprise: Breadth-First Graph Traversal on GPUs Enterprise: Breadth-First Graph Traversal on GPUs Hang Liu H. Howie Huang Department of Electrical and Computer Engineering George Washington University ABSTRACT The Breadth-First Search (BFS) algorithm

More information

Performance Characterization of High-Level Programming Models for GPU Graph Analytics

Performance Characterization of High-Level Programming Models for GPU Graph Analytics Performance Characterization of High-Level Programming Models for GPU Graph Analytics By YUDUO WU B.S. (Macau University of Science and Technology, Macau) 203 THESIS Submitted in partial satisfaction of

More information

Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs

Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs Haicheng Wu 1, Daniel Zinn 2, Molham Aref 2, Sudhakar Yalamanchili 1 1. Georgia Institute of Technology 2. LogicBlox

More information

Parallel Combinatorial BLAS and Applications in Graph Computations

Parallel Combinatorial BLAS and Applications in Graph Computations Parallel Combinatorial BLAS and Applications in Graph Computations Aydın Buluç John R. Gilbert University of California, Santa Barbara SIAM ANNUAL MEETING 2009 July 8, 2009 1 Primitives for Graph Computations

More information

Tools and Primitives for High Performance Graph Computation

Tools and Primitives for High Performance Graph Computation Tools and Primitives for High Performance Graph Computation John R. Gilbert University of California, Santa Barbara Aydin Buluç (LBNL) Adam Lugowski (UCSB) SIAM Minisymposium on Analyzing Massive Real-World

More information

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation

More information

Breadth First Search on Cost efficient Multi GPU Systems

Breadth First Search on Cost efficient Multi GPU Systems Breadth First Search on Cost efficient Multi Systems Takuji Mitsuishi Keio University 3 14 1 Hiyoshi, Yokohama, 223 8522, Japan mits@am.ics.keio.ac.jp Masaki Kan NEC Corporation 1753, Shimonumabe, Nakahara

More information

A GPU Algorithm for Greedy Graph Matching

A GPU Algorithm for Greedy Graph Matching A GPU Algorithm for Greedy Graph Matching B. O. Fagginger Auer R. H. Bisseling Utrecht University September 29, 2011 Fagginger Auer, Bisseling (UU) GPU Greedy Graph Matching September 29, 2011 1 / 27 Outline

More information

Corolla: GPU-Accelerated FPGA Routing Based on Subgraph Dynamic Expansion

Corolla: GPU-Accelerated FPGA Routing Based on Subgraph Dynamic Expansion Corolla: GPU-Accelerated FPGA Routing Based on Subgraph Dynamic Expansion Minghua Shen and Guojie Luo Peking University FPGA-February 23, 2017 1 Contents Motivation Background Search Space Reduction for

More information

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload) Lecture 2: Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload) Visual Computing Systems Analyzing a 3D Graphics Workload Where is most of the work done? Memory Vertex

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

Parallel Architectures

Parallel Architectures Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36

More information

Fine-Grained Task Migration for Graph Algorithms using Processing in Memory

Fine-Grained Task Migration for Graph Algorithms using Processing in Memory Fine-Grained Task Migration for Graph Algorithms using Processing in Memory Paula Aguilera 1, Dong Ping Zhang 2, Nam Sung Kim 3 and Nuwan Jayasena 2 1 Dept. of Electrical and Computer Engineering University

More information

A Batched GPU Algorithm for Set Intersection

A Batched GPU Algorithm for Set Intersection A Batched GPU Algorithm for Set Intersection Di Wu, Fan Zhang, Naiyong Ao, Fang Wang, Xiaoguang Liu, Gang Wang Nankai-Baidu Joint Lab, College of Information Technical Science, Nankai University Weijin

More information

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing

More information

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors

More information

Scheduling the Graphics Pipeline on a GPU

Scheduling the Graphics Pipeline on a GPU Lecture 20: Scheduling the Graphics Pipeline on a GPU Visual Computing Systems Today Real-time 3D graphics workload metrics Scheduling the graphics pipeline on a modern GPU Quick aside: tessellation Triangle

More information

Customized Architecture for Complex Routing Analysis: Case Study for the Convey Hybrid-Core Computer

Customized Architecture for Complex Routing Analysis: Case Study for the Convey Hybrid-Core Computer Naval Research Laboratory Stennis Space Center, MS 39529-5004 NRL/MR/7440--14-9497 Customized Architecture for Complex Routing Analysis: Case Study for the Convey Hybrid-Core Computer Chris J. Michael

More information

A Simple and Practical Linear-Work Parallel Algorithm for Connectivity

A Simple and Practical Linear-Work Parallel Algorithm for Connectivity A Simple and Practical Linear-Work Parallel Algorithm for Connectivity Julian Shun, Laxman Dhulipala, and Guy Blelloch Presentation based on publication in Symposium on Parallelism in Algorithms and Architectures

More information

BFS preconditioning for high locality, data parallel BFS algorithm N.Vasilache, B. Meister, M. Baskaran, R.Lethin. Reservoir Labs

BFS preconditioning for high locality, data parallel BFS algorithm N.Vasilache, B. Meister, M. Baskaran, R.Lethin. Reservoir Labs BFS preconditioning for high locality, data parallel BFS algorithm N.Vasilache, B. Meister, M. Baskaran, R.Lethin Problem Streaming Graph Challenge Characteristics: Large scale Highly dynamic Scale-free

More information

Deploying Graph Algorithms on GPUs: an Adaptive Solution

Deploying Graph Algorithms on GPUs: an Adaptive Solution Deploying Graph Algorithms on GPUs: an Adaptive Solution Da Li Dept. of Electrical and Computer Engineering University of Missouri - Columbia dlx7f@mail.missouri.edu Abstract Thanks to their massive computational

More information

NVGRAPH,FIREHOSE,PAGERANK GPU ACCELERATED ANALYTICS NOV Joe Eaton Ph.D.

NVGRAPH,FIREHOSE,PAGERANK GPU ACCELERATED ANALYTICS NOV Joe Eaton Ph.D. NVGRAPH,FIREHOSE,PAGERANK GPU ACCELERATED ANALYTICS NOV 2016 Joe Eaton Ph.D. Agenda Accelerated Computing nvgraph New Features Coming Soon Dynamic Graphs GraphBLAS 2 ACCELERATED COMPUTING 10x Performance

More information

EFFICIENT BREADTH FIRST SEARCH ON MULTI-GPU SYSTEMS USING GPU-CENTRIC OPENSHMEM

EFFICIENT BREADTH FIRST SEARCH ON MULTI-GPU SYSTEMS USING GPU-CENTRIC OPENSHMEM EFFICIENT BREADTH FIRST SEARCH ON MULTI-GPU SYSTEMS USING GPU-CENTRIC OPENSHMEM Sreeram Potluri, Anshuman Goswami NVIDIA Manjunath Gorentla Venkata, Neena Imam - ORNL SCOPE OF THE WORK Reliance on CPU

More information

Lecture 4: Principles of Parallel Algorithm Design (part 3)

Lecture 4: Principles of Parallel Algorithm Design (part 3) Lecture 4: Principles of Parallel Algorithm Design (part 3) 1 Exploratory Decomposition Decomposition according to a search of a state space of solutions Example: the 15-puzzle problem Determine any sequence

More information

Managing and Mining Billion Node Graphs. Haixun Wang Microsoft Research Asia

Managing and Mining Billion Node Graphs. Haixun Wang Microsoft Research Asia Managing and Mining Billion Node Graphs Haixun Wang Microsoft Research Asia Outline Overview Storage Online query processing Offline graph analytics Advanced applications Is it hard to manage graphs? Good

More information

Improving Memory Space Efficiency of Kd-tree for Real-time Ray Tracing Byeongjun Choi, Byungjoon Chang, Insung Ihm

Improving Memory Space Efficiency of Kd-tree for Real-time Ray Tracing Byeongjun Choi, Byungjoon Chang, Insung Ihm Improving Memory Space Efficiency of Kd-tree for Real-time Ray Tracing Byeongjun Choi, Byungjoon Chang, Insung Ihm Department of Computer Science and Engineering Sogang University, Korea Improving Memory

More information

Software and Tools for HPE s The Machine Project

Software and Tools for HPE s The Machine Project Labs Software and Tools for HPE s The Machine Project Scalable Tools Workshop Aug/1 - Aug/4, 2016 Lake Tahoe Milind Chabbi Traditional Computing Paradigm CPU DRAM CPU DRAM CPU-centric computing 2 CPU-Centric

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

A6-R3: DATA STRUCTURE THROUGH C LANGUAGE

A6-R3: DATA STRUCTURE THROUGH C LANGUAGE A6-R3: DATA STRUCTURE THROUGH C LANGUAGE NOTE: 1. There are TWO PARTS in this Module/Paper. PART ONE contains FOUR questions and PART TWO contains FIVE questions. 2. PART ONE is to be answered in the TEAR-OFF

More information

Exploring Hardware Support For Scaling Irregular Applications on Multi-node Multi-core Architectures

Exploring Hardware Support For Scaling Irregular Applications on Multi-node Multi-core Architectures Exploring Hardware Support For Scaling Irregular Applications on Multi-node Multi-core Architectures MARCO CERIANI SIMONE SECCHI ANTONINO TUMEO ORESTE VILLA GIANLUCA PALERMO Politecnico di Milano - DEI,

More information

Lecture 2: CUDA Programming

Lecture 2: CUDA Programming CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:

More information

Mosaic: Processing a Trillion-Edge Graph on a Single Machine

Mosaic: Processing a Trillion-Edge Graph on a Single Machine Mosaic: Processing a Trillion-Edge Graph on a Single Machine Steffen Maass, Changwoo Min, Sanidhya Kashyap, Woonhak Kang, Mohan Kumar, Taesoo Kim Georgia Institute of Technology Best Student Paper @ EuroSys

More information

Scan Algorithm Effects on Parallelism and Memory Conflicts

Scan Algorithm Effects on Parallelism and Memory Conflicts Scan Algorithm Effects on Parallelism and Memory Conflicts 11 Parallel Prefix Sum (Scan) Definition: The all-prefix-sums operation takes a binary associative operator with identity I, and an array of n

More information

Review: Creating a Parallel Program. Programming for Performance

Review: Creating a Parallel Program. Programming for Performance Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)

More information

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015 Frank Feinbube, M.Sc., Felix Eberhardt, M.Sc., Prof. Dr. Andreas Polze Workloads 2 Hardware / software execution environment

More information

Continuum Computer Architecture

Continuum Computer Architecture Plenary Presentation to the Workshop on Frontiers of Extreme Computing: Continuum Computer Architecture Thomas Sterling California Institute of Technology and Louisiana State University October 25, 2005

More information

Using GPUs to compute the multilevel summation of electrostatic forces

Using GPUs to compute the multilevel summation of electrostatic forces Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of

More information

Policy-based Tuning for Performance Portability and Library Co-optimization

Policy-based Tuning for Performance Portability and Library Co-optimization Policy-based Tuning for Performance Portability and Library Co-optimization Duane Merrill NVIDIA Corporation Santa Clara California USA dumerrill@nvdia.com Michael Garland NVIDIA Corporation Santa Clara

More information

GPU Fundamentals Jeff Larkin November 14, 2016

GPU Fundamentals Jeff Larkin November 14, 2016 GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate

More information

GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC

GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC MIKE GOWANLOCK NORTHERN ARIZONA UNIVERSITY SCHOOL OF INFORMATICS, COMPUTING & CYBER SYSTEMS BEN KARSIN UNIVERSITY OF HAWAII AT MANOA DEPARTMENT

More information

Large scale Imaging on Current Many- Core Platforms

Large scale Imaging on Current Many- Core Platforms Large scale Imaging on Current Many- Core Platforms SIAM Conf. on Imaging Science 2012 May 20, 2012 Dr. Harald Köstler Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen,

More information

Efficient graph computation on hybrid CPU and GPU systems

Efficient graph computation on hybrid CPU and GPU systems J Supercomput (2015) 71:1563 1586 DOI 10.1007/s11227-015-1378-z Efficient graph computation on hybrid CPU and GPU systems Tao Zhang Jingjie Zhang Wei Shu Min-You Wu Xiaoyao Liang Published online: 21 January

More information

GPU Multisplit. Saman Ashkiani, Andrew Davidson, Ulrich Meyer, John D. Owens. S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

GPU Multisplit. Saman Ashkiani, Andrew Davidson, Ulrich Meyer, John D. Owens. S. Ashkiani (UC Davis) GPU Multisplit GTC / 16 GPU Multisplit Saman Ashkiani, Andrew Davidson, Ulrich Meyer, John D. Owens S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 1 / 16 Motivating Simple Example: Compaction Compaction (i.e., binary split) Traditional

More information

Scan Primitives for GPU Computing

Scan Primitives for GPU Computing Scan Primitives for GPU Computing Shubho Sengupta, Mark Harris *, Yao Zhang, John Owens University of California Davis, *NVIDIA Corporation Motivation Raw compute power and bandwidth of GPUs increasing

More information

Social graphs (Facebook, Twitter, Google+, LinkedIn, etc.) Endorsement graphs (web link graph, paper citation graph, etc.)

Social graphs (Facebook, Twitter, Google+, LinkedIn, etc.) Endorsement graphs (web link graph, paper citation graph, etc.) Large-Scale Graph Processing Algorithms on the GPU Yangzihao Wang, Computer Science, UC Davis John Owens, Electrical and Computer Engineering, UC Davis 1 Overview The past decade has seen a growing research

More information

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload) Lecture 2: Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload) Visual Computing Systems Today Finishing up from last time Brief discussion of graphics workload metrics

More information

Wedge A New Frontier for Pull-based Graph Processing. Samuel Grossman and Christos Kozyrakis Platform Lab Retreat June 8, 2018

Wedge A New Frontier for Pull-based Graph Processing. Samuel Grossman and Christos Kozyrakis Platform Lab Retreat June 8, 2018 Wedge A New Frontier for Pull-based Graph Processing Samuel Grossman and Christos Kozyrakis Platform Lab Retreat June 8, 2018 Graph Processing Problems modelled as objects (vertices) and connections between

More information

OpenCL Implementation Of A Heterogeneous Computing System For Real-time Rendering And Dynamic Updating Of Dense 3-d Volumetric Data

OpenCL Implementation Of A Heterogeneous Computing System For Real-time Rendering And Dynamic Updating Of Dense 3-d Volumetric Data OpenCL Implementation Of A Heterogeneous Computing System For Real-time Rendering And Dynamic Updating Of Dense 3-d Volumetric Data Andrew Miller Computer Vision Group Research Developer 3-D TERRAIN RECONSTRUCTION

More information

Algorithm Design (8) Graph Algorithms 1/2

Algorithm Design (8) Graph Algorithms 1/2 Graph Algorithm Design (8) Graph Algorithms / Graph:, : A finite set of vertices (or nodes) : A finite set of edges (or arcs or branches) each of which connect two vertices Takashi Chikayama School of

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information