CuSha: Vertex-Centric Graph Processing on GPUs

Size: px
Start display at page:

Download "CuSha: Vertex-Centric Graph Processing on GPUs"

Transcription

1 CuSha: Vertex-Centric Graph Processing on GPUs Farzad Khorasani, Keval Vora, Rajiv Gupta, Laxmi N. Bhuyan HPDC Vancouver, Canada June,

2 Motivation Graph processing Real world graphs are large & sparse Power law distribution HiggsTwitter M Edges,.M Vertices LiveJournal 9M Edges, M Vertices Pokec M Edges,.M Vertices

3 Graphics Processing Unit (GPU) Single Instruction Multiple Data (SIMD) Repetitive processing patterns on regular data GPU Global Memory Streaming Multiprocessor Streaming Multiprocessor Streaming Multiprocessor Shared Memory Warp N - Warp Warp Shared Memory Warp N - Warp Warp Shared Memory Warp N - Warp Warp

4 Graphics Processing Unit (GPU) Single Instruction Multiple Data (SIMD) Repetitive processing patterns on regular data GPU Streaming Multiprocessor Global Memory Irregular Graphs Streaming Multiprocessor Streaming Multiprocessor Warp Warp Warp N - Shared Memory Irregular Memory Accesses Warp Warp Warp N - Shared Memory GPU underutilization Warp Warp Warp N - Shared Memory

5 Prior Work Using CSR assign vertex to a thread Threads iterate through assigned vertex s neighbors Non-coalesced memory accesses Work imbalance among threads Using CSR assign vertex to a virtual warp Virtual warp lanes process vertex s neighbors in parallel Non-coalesced memory accesses Work imbalance reduced but still exist Using CSR assemble frontiers Exploration based graph algorithms Non-coalesced memory accesses [Harish and Narayanan HiPC ] [Merrill et al, PPoPP ] [Hong et al, PPoPP ]

6 Prior Work Using CSR assign vertex to a thread Threads iterate through assigned vertex s neighbors Non-coalesced memory accesses Work imbalance among threads Using CSR assign vertex to a virtual warp Virtual warp lanes process vertex s neighbors in parallel Non-coalesced memory accesses Work imbalance reduced but still exist Using CSR assemble frontiers Exploration based graph algorithms Non-coalesced memory accesses [Harish and Narayanan HiPC ] Compressed Sparse Row [Merrill et al, PPoPP ] [Hong et al, PPoPP ]

7 Compressed Sparse Row (CSR) InEdgeIdxs 9 9 VertexValues x x x x x x x x s 9

8 Compressed Sparse Row (CSR) InEdgeIdxs 9 9 VertexValues x x x x x x x x s 9

9 Virtual Warp Centric (VWC) [PPoPP ] Physical warp broken into smaller virtual warps,, 8, lanes Advantages: Neighbors processed in parallel Load imbalance is reduced Disadvantages: s Load imbalance still exists GPU underutilization Non-coalesced memory accesses InEdgeIdxs VertexValues 9 x x x x x x x x 9

10 VWC-CSR Benchmark Avg global memory access efficiency (%) Avg warp execution efficiency (%) Breadth-First Search (BFS) Single Source Shortest Path (SSSP) PageRank (PR) Connected Components (CC) Single Source Widest Path (SSWP) Neural Network (NN) Heat Simulation (HS) Circuit Simulation (CS) Caused by Non-Coalesced memory loads and stores Caused by GPU Underutilization and Load Imbalance

11 G-Shards 8 Employs Shards Data required by computation placed contiguously Improves locality [Kyrola et al, OSDI ] Shard Shard 9 x x x x x x x x x x x x x 9 x VertexValues x x x x x x x x

12 G-Shards: Construction Edges partitioned based on destination vertex s index Edges sorted based on source vertex s index 9 W W VertexValues Shard Shard x x x x x x x 9 x x x x x x x x x x x x x x x 9 W W

13 G-Shards: Processing Iteratively process all shards Each shard processed by a thread block Steps: Read Compute Update Write Shard Shard x x x x x x x 9 x x x x x x x VertexValues x x x x x x x x Block Shared Memory

14 G-Shards: Processing Iteratively process all shards Each shard processed by a thread block Steps: Read Compute Update Write Shard Shard x x x x x x x 9 x x x x x x x VertexValues x x x x x x x x Block Shared Memory

15 G-Shards: Processing Iteratively process all shards Each shard processed by a thread block Steps: Read Compute Update Write Shard Shard x x x x x x x 9 x x x x x x x VertexValues x x x x x x x x Block Shared Memory

16 G-Shards: Processing Iteratively process all shards Each shard processed by a thread block Steps: Read Compute Update Write Shard Shard x x x x x x x 9 x x x x x x x VertexValues x x x x x x x x Block Shared Memory

17 G-Shards: Processing Iteratively process all shards Each shard processed by a thread block Steps: Read Compute Update Write Shard Shard x x x x x x x 9 x x x x x x x VertexValues x x x x x x x x Block Shared Memory

18 G-Shards: Processing Iteratively process all shards Each shard processed by a thread block Steps: Read Compute Update Write Shard Shard x x x x x x x 9 x x x x x x x Coalesced memory accesses in all steps VertexValues x x x x x x x x Block Shared Memory

19 Frequency (Thousands) Frequency (Thousands) G-Shards Large and sparse graphs have small computation windows Graph Size Graph Sparsity 8 _8 _ Window Sizes Window Sizes

20 G-Shards Large and sparse graphs have small computation windows Threads in a warp Shard j- Shard j Shard j+ GPU underutilization Many warp threads remain idle during th step Solution Wi,j- Wi,j+ Group windows to utilize maximum threads Concatenated Windows Wi,j t -

21 Concatenated Windows (CW) Shard Shard x x x x VertexValues x x x x x x x x x x x x x x x x x 9 x

22 Concatenated Windows (CW) Detach array from shards Shard Shard x x x x VertexValues x x x x x x x x x x x x x x x x x 9 x

23 Concatenated Windows (CW) Detach array from shards Concatenate the ones belonging to windows of same shards Shard Shard x x x x VertexValues x x x x x x x x x x x x x x x x x 9 x

24 Concatenated Windows (CW) Detach array from shards Concatenate the ones belonging to windows of same shards Mapper array Fast access of in shards Shard Shard CW CW Mapper Mapper x x x 8 x VertexValues x 9 x x x x x x x x x 8 x x 9 x x x x x 9 x

25 CW: Processing First steps remain same Writeback is done by block threads using mapper array Thread IDs k k+ m m+ m+ n n+ n+ p CW i Mapper for W i,j- for W i,j for W i,j+ Shard j+ Shard j Shard j- k k+ m+ m+ n+ n+ Wi,j+ Wi,j Wi,j- m n p

26 CW: Processing First steps remain same Writeback is done by block threads using mapper array k k+ m m+ m+ n n+ n+ p Thread IDs CW i Mapper Improves GPU utilization for W i,j- in th step for W i,j for W i,j+ Shard j Shard j- Shard j+ k k+ m+ m+ n+ n+ Wi,j+ Wi,j Wi,j- m n p

27 CuSha Framework Uses G-Shards and CWs Vertex centric programming model init_compute, compute, update_condition compute must be atomic // SSSP in CuSha typedef struct Edge { unsigned int Weight; } Edge; // requires a variable for edge weight typedef struct Vertex { unsigned int Dist; } Vertex; // requires a variable for distance device void init_compute( Vertex* local_v, Vertex* V ) { local_v->dist = V->Dist; // fetches distances from global memory to shared memory } device void compute( Vertex* SrcV, StaticVertex* SrcV_static, Edge* E, Vertex* local_v ) { if (SrcV->Dist!= INF) // minimum distance of a vertex from source node is calculated atomicmin ( &(local_v->dist), SrcV->Dist + E->Weight ); } device bool update_condition( Vertex* local_v, Vertex* V ) { return ( local_v->dist < V->Dist ); // return true if newly calculated value is smaller }

28 Experimental Setup GeForce GTX8: SMs, GB GDDR RAM Intel Core i-9k Sandy bridge cores (HT enabled). GHz, DDR RAM, PCI-e. x CUDA. on Ubuntu. 8 Benchmarks BFS, SSSP, PR, CC, SSWP, NN, HS, CS Input graphs from the SNAP dataset collection SNAP:

29 Speedup over Multi-Threaded CPU BM G-Shards CW BFS.x-.x.x-.8x SSSP.x-.x.99x-.x PR.x-.x.x-8.98x CC.x-.x.x-.x Average speedups G-Shards:.x-.9x CW:.88x-.x SSWP.9x-.x.x-.8x NN.8x-9.x.9x-9.9x HS.x-.x.8x-.x CS.9x-.x.9x-.x Inputs G-Shards CW LiveJournal.x-.x.x-9.x Pokec.x-.9x.x-.89x HiggsTwitter.x-.x.x-.x RoadNetCA.9x-9.9x.9x-.9x WebGoogle.9x-9.9x.9x-.9x Amazon.x-.x.88x-.x

30 Speedup over VWC-CSR BM G-Shards CW BFS.9x-.9x.9x-.x SSSP.9x-.9x.x-.9x PR.x-.88x.8x-.x CC.8x-.x.x-.x Average speedups G-Shards:.x-.x CW:.89x-.89x SSWP.9x-.x.9x-.x NN.x-.x.x-.x HS.x-.x.x-.x CS.x-.x.x-.8x Inputs G-Shards CW LiveJournal.x-.x.9x-.x Pokec.x-.x.x-.8x HiggsTwitter.x-.9x.x-.x RoadNetCA.x-8.x.9x-.99x WebGoogle.x-.x.x-.x Amazon.x-.x.x-.x

31 Efficiency Efficiency Global Memory Efficiency 8 CSR G-Shards CW CSR G-Shards CW 8 BFS SSSP PR CC SSWP NN HS CS BFS SSSP PR CC SSWP NN HS CS Avg Memory Load Efficiency Avg Memory Store Efficiency Average CSR G-Shards CW Avg global memory load efficiency Avg global memory store efficiency 8.8% 8.%.9%.9%.%.%

32 Efficiency Warp Execution Efficiency 9 CSR G-Shards CW 8 BFS SSSP PR CC SSWP NN HS CS Avg Warp Execution Efficiency Average CSR G-Shards CW Avg warp execution efficiency.8% 88.9% 9.%

33 CSR G-Shards CW CSR G-Shards CW CSR G-Shards CW CSR G-Shards CW CSR G-Shards CW CSR G-Shards CW Space Overhead Min Average Max Amazon WebGoogle RoadNetCA HiggsTwitter Pokec LiveJournal G-Shards over CSR: CW over G-Shards: ( E - V )xsizeof(index) + E xsizeof(vertex) bytes E xsizeof(index) bytes

34 CSR G-Shards CW CSR G-Shards CW CSR G-Shards CW CSR G-Shards CW CSR G-Shards CW CSR G-Shards CW CSR G-Shards CW CSR G-Shards CW Execution Time Breakdown HD Copy GPU Computation DH Copy BFS SSSP PR CC SSWP NN HS CS Space overheads directly impact HD copy times G-Shards and CW are computation friendly

35 Other Experiments Traversed Edges Per Seconds (TEPS) G-Shards and CW up to times better than CSR Sensitivity study using synthetic benchmarks G-Shards is sensitive to: Graph Size Graph Density Number of vertices assigned to shards CW less affected by these parameters

36 Conclusion G-Shards Shard based representation mapped for GPUs Fully coalesced memory accesses Concatenated Windows Improves GPU utilization CuSha Framework Vertex centric programming model.x-.9x speedup over best VWC-CSR

37 Thanks CuSha on GitHub GRASP

CuSha: Vertex-Centric Graph Processing on GPUs

CuSha: Vertex-Centric Graph Processing on GPUs CuSha: Vertex-Centric Graph Processing on GPUs Farzad Khorasani Keval Vora Rajiv Gupta Laxmi N. Bhuyan Computer Science and Engineering Department University of California Riverside, CA, USA {fkhor, kvora,

More information

UNIVERSITY OF CALIFORNIA RIVERSIDE. High Performance Vertex-Centric Graph Analytics on GPUs

UNIVERSITY OF CALIFORNIA RIVERSIDE. High Performance Vertex-Centric Graph Analytics on GPUs UNIVERSITY OF CALIFORNIA RIVERSIDE High Performance Vertex-Centric Graph Analytics on GPUs A Dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in

More information

Locality-Aware Software Throttling for Sparse Matrix Operation on GPUs

Locality-Aware Software Throttling for Sparse Matrix Operation on GPUs Locality-Aware Software Throttling for Sparse Matrix Operation on GPUs Yanhao Chen 1, Ari B. Hayes 1, Chi Zhang 2, Timothy Salmon 1, Eddy Z. Zhang 1 1. Rutgers University 2. University of Pittsburgh Sparse

More information

Scalable GPU Graph Traversal!

Scalable GPU Graph Traversal! Scalable GPU Graph Traversal Duane Merrill, Michael Garland, and Andrew Grimshaw PPoPP '12 Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming Benwen Zhang

More information

Towards Efficient Graph Traversal using a Multi- GPU Cluster

Towards Efficient Graph Traversal using a Multi- GPU Cluster Towards Efficient Graph Traversal using a Multi- GPU Cluster Hina Hameed Systems Research Lab, Department of Computer Science, FAST National University of Computer and Emerging Sciences Karachi, Pakistan

More information

Automatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC

Automatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC Automatic Compiler-Based Optimization of Graph Analytics for the GPU Sreepathi Pai The University of Texas at Austin May 8, 2017 NVIDIA GTC Parallel Graph Processing is not easy 299ms HD-BFS 84ms USA Road

More information

Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia

Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia How well do CPU, GPU and Hybrid Graph Processing Frameworks Perform? Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Networked Systems

More information

High-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock

High-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock High-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock Yangzihao Wang University of California, Davis yzhwang@ucdavis.edu March 24, 2014 Yangzihao Wang (yzhwang@ucdavis.edu)

More information

Gunrock: A Fast and Programmable Multi- GPU Graph Processing Library

Gunrock: A Fast and Programmable Multi- GPU Graph Processing Library Gunrock: A Fast and Programmable Multi- GPU Graph Processing Library Yangzihao Wang and Yuechao Pan with Andrew Davidson, Yuduo Wu, Carl Yang, Leyuan Wang, Andy Riffel and John D. Owens University of California,

More information

A POWER CHARACTERIZATION AND MANAGEMENT OF GPU GRAPH TRAVERSAL

A POWER CHARACTERIZATION AND MANAGEMENT OF GPU GRAPH TRAVERSAL A POWER CHARACTERIZATION AND MANAGEMENT OF GPU GRAPH TRAVERSAL ADAM MCLAUGHLIN *, INDRANI PAUL, JOSEPH GREATHOUSE, SRILATHA MANNE, AND SUDHKAHAR YALAMANCHILI * * GEORGIA INSTITUTE OF TECHNOLOGY AMD RESEARCH

More information

GPU Sparse Graph Traversal. Duane Merrill

GPU Sparse Graph Traversal. Duane Merrill GPU Sparse Graph Traversal Duane Merrill Breadth-first search of graphs (BFS) 1. Pick a source node 2. Rank every vertex by the length of shortest path from source Or label every vertex by its predecessor

More information

A Relaxed Consistency based DSM for Asynchronous Parallelism. Keval Vora, Sai Charan Koduru, Rajiv Gupta

A Relaxed Consistency based DSM for Asynchronous Parallelism. Keval Vora, Sai Charan Koduru, Rajiv Gupta A Relaxed Consistency based DSM for Asynchronous Parallelism Keval Vora, Sai Charan Koduru, Rajiv Gupta SoCal PLS - Spring 2014 Motivation Graphs are popular Graph Mining: Community Detection, Coloring

More information

WITH the rapid development of the Internet, processing

WITH the rapid development of the Internet, processing IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 28, NO. 4, APRIL 2017 1149 Optimizing Graph Processing on GPUs Wenyong Zhong, Jianhua Sun, Hao Chen, Member, IEEE, Jun Xiao, Zhiwen Chen, Chang

More information

GPU Sparse Graph Traversal

GPU Sparse Graph Traversal GPU Sparse Graph Traversal Duane Merrill (NVIDIA) Michael Garland (NVIDIA) Andrew Grimshaw (Univ. of Virginia) UNIVERSITY of VIRGINIA Breadth-first search (BFS) 1. Pick a source node 2. Rank every vertex

More information

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation

More information

Tigr: Transforming Irregular Graphs for GPU-Friendly Graph Processing.

Tigr: Transforming Irregular Graphs for GPU-Friendly Graph Processing. Tigr: Transforming Irregular Graphs for GPU-Friendly Graph Processing Amir Hossein Nodehi Sabet University of California, Riverside anode00@ucr.edu Abstract Graph analytics delivers deep knowledge by processing

More information

Medusa: A Parallel Graph Processing System on Graphics Processors

Medusa: A Parallel Graph Processing System on Graphics Processors Medusa: A Parallel Graph Processing System on Graphics Processors Jianlong Zhong Nanyang Technological University jzhong2@ntu.edu.sg Bingsheng He Nanyang Technological University bshe@ntu.edu.sg ABSTRACT

More information

Implementation of BFS on shared memory (CPU / GPU)

Implementation of BFS on shared memory (CPU / GPU) Kolganov A.S., MSU The BFS algorithm Graph500 && GGraph500 Implementation of BFS on shared memory (CPU / GPU) Predicted scalability 2 The BFS algorithm Graph500 && GGraph500 Implementation of BFS on shared

More information

Coordinating More Than 3 Million CUDA Threads for Social Network Analysis. Adam McLaughlin

Coordinating More Than 3 Million CUDA Threads for Social Network Analysis. Adam McLaughlin Coordinating More Than 3 Million CUDA Threads for Social Network Analysis Adam McLaughlin Applications of interest Computational biology Social network analysis Urban planning Epidemiology Hardware verification

More information

Graph Data Management

Graph Data Management Graph Data Management Analysis and Optimization of Graph Data Frameworks presented by Fynn Leitow Overview 1) Introduction a) Motivation b) Application for big data 2) Choice of algorithms 3) Choice of

More information

Deploying Graph Algorithms on GPUs: an Adaptive Solution

Deploying Graph Algorithms on GPUs: an Adaptive Solution Deploying Graph Algorithms on GPUs: an Adaptive Solution Da Li Dept. of Electrical and Computer Engineering University of Missouri - Columbia dlx7f@mail.missouri.edu Abstract Thanks to their massive computational

More information

Alternative GPU friendly assignment algorithms. Paul Richmond and Peter Heywood Department of Computer Science The University of Sheffield

Alternative GPU friendly assignment algorithms. Paul Richmond and Peter Heywood Department of Computer Science The University of Sheffield Alternative GPU friendly assignment algorithms Paul Richmond and Peter Heywood Department of Computer Science The University of Sheffield Graphics Processing Units (GPUs) Context: GPU Performance Accelerated

More information

Puffin: Graph Processing System on Multi-GPUs

Puffin: Graph Processing System on Multi-GPUs 2017 IEEE 10th International Conference on Service-Oriented Computing and Applications Puffin: Graph Processing System on Multi-GPUs Peng Zhao, Xuan Luo, Jiang Xiao, Xuanhua Shi, Hai Jin Services Computing

More information

Prefix Scan and Minimum Spanning Tree with OpenCL

Prefix Scan and Minimum Spanning Tree with OpenCL Prefix Scan and Minimum Spanning Tree with OpenCL U. VIRGINIA DEPT. OF COMP. SCI TECH. REPORT CS-2013-02 Yixin Sun and Kevin Skadron Dept. of Computer Science, University of Virginia ys3kz@virginia.edu,

More information

arxiv: v1 [cs.dc] 10 Dec 2018

arxiv: v1 [cs.dc] 10 Dec 2018 SIMD-X: Programming and Processing of Graph Algorithms on GPUs arxiv:8.04070v [cs.dc] 0 Dec 08 Hang Liu University of Massachusetts Lowell Abstract With high computation power and memory bandwidth, graphics

More information

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Patterns: Graph Search

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Patterns: Graph Search CSE 599 I Accelerated Computing - Programming GPUS Parallel Patterns: Graph Search Objective Study graph search as a prototypical graph-based algorithm Learn techniques to mitigate the memory-bandwidth-centric

More information

GPU/CPU Heterogeneous Work Partitioning Riya Savla (rds), Ridhi Surana (rsurana), Jordan Tick (jrtick) Computer Architecture, Spring 18

GPU/CPU Heterogeneous Work Partitioning Riya Savla (rds), Ridhi Surana (rsurana), Jordan Tick (jrtick) Computer Architecture, Spring 18 ABSTRACT GPU/CPU Heterogeneous Work Partitioning Riya Savla (rds), Ridhi Surana (rsurana), Jordan Tick (jrtick) 15-740 Computer Architecture, Spring 18 With the death of Moore's Law, programmers can no

More information

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh

More information

Efficient Warp Execution in Presence of Divergence with Collaborative Context Collection

Efficient Warp Execution in Presence of Divergence with Collaborative Context Collection Efficient Warp Execution in Presence of Divergence with Collaborative Context Collection Farzad Khorasani Rajiv Gupta Laxmi N. Bhuyan Computer Science and Engineering Department University of California

More information

SociaLite: A Datalog-based Language for

SociaLite: A Datalog-based Language for SociaLite: A Datalog-based Language for Large-Scale Graph Analysis Jiwon Seo M OBIS OCIAL RESEARCH GROUP Overview Overview! SociaLite: language for large-scale graph analysis! Extensions to Datalog! Compiler

More information

Benchmarking the Memory Hierarchy of Modern GPUs

Benchmarking the Memory Hierarchy of Modern GPUs 1 of 30 Benchmarking the Memory Hierarchy of Modern GPUs In 11th IFIP International Conference on Network and Parallel Computing Xinxin Mei, Kaiyong Zhao, Chengjian Liu, Xiaowen Chu CS Department, Hong

More information

NVGRAPH,FIREHOSE,PAGERANK GPU ACCELERATED ANALYTICS NOV Joe Eaton Ph.D.

NVGRAPH,FIREHOSE,PAGERANK GPU ACCELERATED ANALYTICS NOV Joe Eaton Ph.D. NVGRAPH,FIREHOSE,PAGERANK GPU ACCELERATED ANALYTICS NOV 2016 Joe Eaton Ph.D. Agenda Accelerated Computing nvgraph New Features Coming Soon Dynamic Graphs GraphBLAS 2 ACCELERATED COMPUTING 10x Performance

More information

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU Lifan Xu Wei Wang Marco A. Alvarez John Cavazos Dongping Zhang Department of Computer and Information Science University of Delaware

More information

Conflict-Free Vectorization of Associative Irregular Applications with Recent SIMD Architectural Advances

Conflict-Free Vectorization of Associative Irregular Applications with Recent SIMD Architectural Advances Conflict-Free Vectorization of Associative Irregular Applications with Recent SIMD Architectural Advances Peng Jiang Gagan Agrawal Department of Computer Science and Engineering The Ohio State University,

More information

PFAC Library: GPU-Based String Matching Algorithm

PFAC Library: GPU-Based String Matching Algorithm PFAC Library: GPU-Based String Matching Algorithm Cheng-Hung Lin Lung-Sheng Chien Chen-Hsiung Liu Shih-Chieh Chang Wing-Kai Hon National Taiwan Normal University, Taipei, Taiwan National Tsing-Hua University,

More information

Fast and Scalable Subgraph Isomorphism using Dynamic Graph Techniques. James Fox

Fast and Scalable Subgraph Isomorphism using Dynamic Graph Techniques. James Fox Fast and Scalable Subgraph Isomorphism using Dynamic Graph Techniques James Fox Collaborators Oded Green, Research Scientist (GT) Euna Kim, PhD student (GT) Federico Busato, PhD student (Universita di

More information

Duane Merrill (NVIDIA) Michael Garland (NVIDIA)

Duane Merrill (NVIDIA) Michael Garland (NVIDIA) Duane Merrill (NVIDIA) Michael Garland (NVIDIA) Agenda Sparse graphs Breadth-first search (BFS) Maximal independent set (MIS) 1 1 0 1 1 1 3 1 3 1 5 10 6 9 8 7 3 4 BFS MIS Sparse graphs Graph G(V, E) where

More information

Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs

Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs Haicheng Wu 1, Daniel Zinn 2, Molham Aref 2, Sudhakar Yalamanchili 1 1. Georgia Institute of Technology 2. LogicBlox

More information

Lecture 2: CUDA Programming

Lecture 2: CUDA Programming CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:

More information

CS 179: GPU Programming. Lecture 10

CS 179: GPU Programming. Lecture 10 CS 179: GPU Programming Lecture 10 Topics Non-numerical algorithms Parallel breadth-first search (BFS) Texture memory GPUs good for many numerical calculations What about non-numerical problems? Graph

More information

Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search

Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search Jialiang Zhang, Soroosh Khoram and Jing Li 1 Outline Background Big graph analytics Hybrid

More information

The GraphGrind Framework: Fast Graph Analytics on Large. Shared-Memory Systems

The GraphGrind Framework: Fast Graph Analytics on Large. Shared-Memory Systems Memory-Centric System Level Design of Heterogeneous Embedded DSP Systems The GraphGrind Framework: Scott Johan Fischaber, BSc (Hon) Fast Graph Analytics on Large Shared-Memory Systems Jiawen Sun, PhD candidate

More information

Wedge A New Frontier for Pull-based Graph Processing. Samuel Grossman and Christos Kozyrakis Platform Lab Retreat June 8, 2018

Wedge A New Frontier for Pull-based Graph Processing. Samuel Grossman and Christos Kozyrakis Platform Lab Retreat June 8, 2018 Wedge A New Frontier for Pull-based Graph Processing Samuel Grossman and Christos Kozyrakis Platform Lab Retreat June 8, 2018 Graph Processing Problems modelled as objects (vertices) and connections between

More information

Optimizing Cache Performance for Graph Analytics. Yunming Zhang Presentation

Optimizing Cache Performance for Graph Analytics. Yunming Zhang Presentation Optimizing Cache Performance for Graph Analytics Yunming Zhang 6.886 Presentation Goals How to optimize in-memory graph applications How to go about performance engineering just about anything In-memory

More information

Breadth-first Search in CUDA

Breadth-first Search in CUDA Assignment 3, Problem 3, Practical Parallel Computing Course Breadth-first Search in CUDA Matthias Springer Tokyo Institute of Technology matthias.springer@acm.org Abstract This work presents a number

More information

Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread

Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread Intra-Warp Compaction Techniques Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Goal Active thread Idle thread Compaction Compact threads in a warp to coalesce (and eliminate)

More information

Efficient Processing of Large Graphs via Input Reduction

Efficient Processing of Large Graphs via Input Reduction Efficient Processing of Large Graphs via Input Reduction Amlan Kusum Keval Vora Rajiv Gupta Iulian Neamtiu Department of Computer Science, University of California, Riverside {akusu, kvora, gupta}@cs.ucr.edu

More information

BREADTH-FIRST SEARCH FOR SOCIAL NETWORK GRAPHS ON HETEROGENOUS PLATFORMS LUIS CARLOS MARIA REMIS THESIS

BREADTH-FIRST SEARCH FOR SOCIAL NETWORK GRAPHS ON HETEROGENOUS PLATFORMS LUIS CARLOS MARIA REMIS THESIS BREADTH-FIRST SEARCH FOR SOCIAL NETWORK GRAPHS ON HETEROGENOUS PLATFORMS BY LUIS CARLOS MARIA REMIS THESIS Submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer

More information

A Study of Partitioning Policies for Graph Analytics on Large-scale Distributed Platforms

A Study of Partitioning Policies for Graph Analytics on Large-scale Distributed Platforms A Study of Partitioning Policies for Graph Analytics on Large-scale Distributed Platforms Gurbinder Gill, Roshan Dathathri, Loc Hoang, Keshav Pingali Department of Computer Science, University of Texas

More information

Mosaic: Processing a Trillion-Edge Graph on a Single Machine

Mosaic: Processing a Trillion-Edge Graph on a Single Machine Mosaic: Processing a Trillion-Edge Graph on a Single Machine Steffen Maass, Changwoo Min, Sanidhya Kashyap, Woonhak Kang, Mohan Kumar, Taesoo Kim Georgia Institute of Technology Best Student Paper @ EuroSys

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and

More information

Breadth First Search on Cost efficient Multi GPU Systems

Breadth First Search on Cost efficient Multi GPU Systems Breadth First Search on Cost efficient Multi Systems Takuji Mitsuishi Keio University 3 14 1 Hiyoshi, Yokohama, 223 8522, Japan mits@am.ics.keio.ac.jp Masaki Kan NEC Corporation 1753, Shimonumabe, Nakahara

More information

EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT

EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT JOSEPH L. GREATHOUSE, MAYANK DAGA AMD RESEARCH 11/20/2014 THIS TALK IN ONE SLIDE Demonstrate how to save space and time

More information

Efficient and Simplified Parallel Graph Processing over CPU and MIC

Efficient and Simplified Parallel Graph Processing over CPU and MIC Efficient and Simplified Parallel Graph Processing over CPU and MIC Linchuan Chen Xin Huo Bin Ren Surabhi Jain Gagan Agrawal Department of Computer Science and Engineering The Ohio State University Columbus,

More information

GPU-Based Acceleration for CT Image Reconstruction

GPU-Based Acceleration for CT Image Reconstruction GPU-Based Acceleration for CT Image Reconstruction Xiaodong Yu Advisor: Wu-chun Feng Collaborators: Guohua Cao, Hao Gong Outline Introduction and Motivation Background Knowledge Challenges and Proposed

More information

Order or Shuffle: Empirically Evaluating Vertex Order Impact on Parallel Graph Computations

Order or Shuffle: Empirically Evaluating Vertex Order Impact on Parallel Graph Computations Order or Shuffle: Empirically Evaluating Vertex Order Impact on Parallel Graph Computations George M. Slota 1 Sivasankaran Rajamanickam 2 Kamesh Madduri 3 1 Rensselaer Polytechnic Institute, 2 Sandia National

More information

Accelerated Load Balancing of Unstructured Meshes

Accelerated Load Balancing of Unstructured Meshes Accelerated Load Balancing of Unstructured Meshes Gerrett Diamond, Lucas Davis, and Cameron W. Smith Abstract Unstructured mesh applications running on large, parallel, distributed memory systems require

More information

Social graphs (Facebook, Twitter, Google+, LinkedIn, etc.) Endorsement graphs (web link graph, paper citation graph, etc.)

Social graphs (Facebook, Twitter, Google+, LinkedIn, etc.) Endorsement graphs (web link graph, paper citation graph, etc.) Large-Scale Graph Processing Algorithms on the GPU Yangzihao Wang, Computer Science, UC Davis John Owens, Electrical and Computer Engineering, UC Davis 1 Overview The past decade has seen a growing research

More information

NXgraph: An Efficient Graph Processing System on a Single Machine

NXgraph: An Efficient Graph Processing System on a Single Machine NXgraph: An Efficient Graph Processing System on a Single Machine arxiv:151.91v1 [cs.db] 23 Oct 215 Yuze Chi, Guohao Dai, Yu Wang, Guangyu Sun, Guoliang Li and Huazhong Yang Tsinghua National Laboratory

More information

Efficient Computation of Radial Distribution Function on GPUs

Efficient Computation of Radial Distribution Function on GPUs Efficient Computation of Radial Distribution Function on GPUs Yi-Cheng Tu * and Anand Kumar Department of Computer Science and Engineering University of South Florida, Tampa, Florida 2 Overview Introduction

More information

EFFICIENT BREADTH FIRST SEARCH ON MULTI-GPU SYSTEMS USING GPU-CENTRIC OPENSHMEM

EFFICIENT BREADTH FIRST SEARCH ON MULTI-GPU SYSTEMS USING GPU-CENTRIC OPENSHMEM EFFICIENT BREADTH FIRST SEARCH ON MULTI-GPU SYSTEMS USING GPU-CENTRIC OPENSHMEM Sreeram Potluri, Anshuman Goswami NVIDIA Manjunath Gorentla Venkata, Neena Imam - ORNL SCOPE OF THE WORK Reliance on CPU

More information

Simple Parallel Biconnectivity Algorithms for Multicore Platforms

Simple Parallel Biconnectivity Algorithms for Multicore Platforms Simple Parallel Biconnectivity Algorithms for Multicore Platforms George M. Slota Kamesh Madduri The Pennsylvania State University HiPC 2014 December 17-20, 2014 Code, presentation available at graphanalysis.info

More information

RESEARCH ARTICLE. Accelerating Ant Colony Optimization for the Traveling Salesman Problem on the GPU

RESEARCH ARTICLE. Accelerating Ant Colony Optimization for the Traveling Salesman Problem on the GPU The International Journal of Parallel, Emergent and Distributed Systems Vol. 00, No. 00, Month 2011, 1 21 RESEARCH ARTICLE Accelerating Ant Colony Optimization for the Traveling Salesman Problem on the

More information

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation

More information

Graph Prefetching Using Data Structure Knowledge SAM AINSWORTH, TIMOTHY M. JONES

Graph Prefetching Using Data Structure Knowledge SAM AINSWORTH, TIMOTHY M. JONES Graph Prefetching Using Data Structure Knowledge SAM AINSWORTH, TIMOTHY M. JONES Background and Motivation Graph applications are memory latency bound Caches & Prefetching are existing solution for memory

More information

Performance Characterization of High-Level Programming Models for GPU Graph Analytics

Performance Characterization of High-Level Programming Models for GPU Graph Analytics 15 IEEE International Symposium on Workload Characterization Performance Characterization of High-Level Programming Models for GPU Analytics Yuduo Wu, Yangzihao Wang, Yuechao Pan, Carl Yang, and John D.

More information

Corolla: GPU-Accelerated FPGA Routing Based on Subgraph Dynamic Expansion

Corolla: GPU-Accelerated FPGA Routing Based on Subgraph Dynamic Expansion Corolla: GPU-Accelerated FPGA Routing Based on Subgraph Dynamic Expansion Minghua Shen and Guojie Luo Peking University FPGA-February 23, 2017 1 Contents Motivation Background Search Space Reduction for

More information

Analyzing CUDA Workloads Using a Detailed GPU Simulator

Analyzing CUDA Workloads Using a Detailed GPU Simulator CS 3580 - Advanced Topics in Parallel Computing Analyzing CUDA Workloads Using a Detailed GPU Simulator Mohammad Hasanzadeh Mofrad University of Pittsburgh November 14, 2017 1 Article information Title:

More information

Parallel Trie-based Frequent Itemset Mining on Graphics Processors

Parallel Trie-based Frequent Itemset Mining on Graphics Processors City University of New York (CUNY) CUNY Academic Works Master's Theses City College of New York 2012 Parallel Trie-based Frequent Itemset Mining on Graphics Processors Jay Junjie Yao CUNY City College

More information

On Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs

On Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs On Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs Sungpack Hong 2, Nicole C. Rodia 1, and Kunle Olukotun 1 1 Pervasive Parallelism Laboratory, Stanford University

More information

arxiv: v2 [cs.dc] 27 Mar 2015

arxiv: v2 [cs.dc] 27 Mar 2015 Gunrock: A High-Performance Graph Processing Library on the GPU Yangzihao Wang, Andrew Davidson, Yuechao Pan, Yuduo Wu, Andy Riffel, John D. Owens University of California, Davis {yzhwang, aaldavidson,

More information

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA GPGPU LAB Case study: Finite-Difference Time- Domain Method on CUDA Ana Balevic IPVS 1 Finite-Difference Time-Domain Method Numerical computation of solutions to partial differential equations Explicit

More information

GraFBoost: Using accelerated flash storage for external graph analytics

GraFBoost: Using accelerated flash storage for external graph analytics GraFBoost: Using accelerated flash storage for external graph analytics Sang-Woo Jun, Andy Wright, Sizhuo Zhang, Shuotao Xu and Arvind MIT CSAIL Funded by: 1 Large Graphs are Found Everywhere in Nature

More information

GridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning. Xiaowei ZHU Tsinghua University

GridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning. Xiaowei ZHU Tsinghua University GridGraph: Large-Scale Graph Processing on a Single Machine Using -Level Hierarchical Partitioning Xiaowei ZHU Tsinghua University Widely-Used Graph Processing Shared memory Single-node & in-memory Ligra,

More information

New Approach of Bellman Ford Algorithm on GPU using Compute Unified Design Architecture (CUDA)

New Approach of Bellman Ford Algorithm on GPU using Compute Unified Design Architecture (CUDA) New Approach of Bellman Ford Algorithm on GPU using Compute Unified Design Architecture (CUDA) Pankhari Agarwal Department of Computer Science and Engineering National Institute of Technical Teachers Training

More information

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes. HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes Ian Glendinning Outline NVIDIA GPU cards CUDA & OpenCL Parallel Implementation

More information

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3

More information

WaveView. System Requirement V6. Reference: WST Page 1. WaveView System Requirements V6 WST

WaveView. System Requirement V6. Reference: WST Page 1. WaveView System Requirements V6 WST WaveView System Requirement V6 Reference: WST-0125-01 www.wavestore.com Page 1 WaveView System Requirements V6 Copyright notice While every care has been taken to ensure the information contained within

More information

Work efficient parallel algorithms for large graph exploration on emerging heterogeneous architectures

Work efficient parallel algorithms for large graph exploration on emerging heterogeneous architectures Accepted Manuscript Work efficient parallel algorithms for large graph exploration on emerging heterogeneous architectures Dip Sankar Banerjee, Ashutosh Kumar, Meher Chaitanya, Shashank Sharma, Kishore

More information

SIMD Divergence Optimization through Intra-Warp Compaction. Aniruddha Vaidya Anahita Shayesteh Dong Hyuk Woo Roy Saharoy Mani Azimi ISCA 13

SIMD Divergence Optimization through Intra-Warp Compaction. Aniruddha Vaidya Anahita Shayesteh Dong Hyuk Woo Roy Saharoy Mani Azimi ISCA 13 SIMD Divergence Optimization through Intra-Warp Compaction Aniruddha Vaidya Anahita Shayesteh Dong Hyuk Woo Roy Saharoy Mani Azimi ISCA 13 Problem GPU: wide SIMD lanes 16 lanes per warp in this work SIMD

More information

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010 Parallelizing FPGA Technology Mapping using GPUs Doris Chen Deshanand Singh Aug 31 st, 2010 Motivation: Compile Time In last 12 years: 110x increase in FPGA Logic, 23x increase in CPU speed, 4.8x gap Question:

More information

Enterprise. Breadth-First Graph Traversal on GPUs. November 19th, 2015

Enterprise. Breadth-First Graph Traversal on GPUs. November 19th, 2015 Enterprise Breadth-First Graph Traversal on GPUs Hang Liu H. Howie Huang November 9th, 5 Graph is Ubiquitous Breadth-First Search (BFS) is Important Wide Range of Applications Single Source Shortest Path

More information

Optimizing Energy Consumption and Parallel Performance for Static and Dynamic Betweenness Centrality using GPUs Adam McLaughlin, Jason Riedy, and

Optimizing Energy Consumption and Parallel Performance for Static and Dynamic Betweenness Centrality using GPUs Adam McLaughlin, Jason Riedy, and Optimizing Energy Consumption and Parallel Performance for Static and Dynamic Betweenness Centrality using GPUs Adam McLaughlin, Jason Riedy, and David A. Bader Motivation Real world graphs are challenging

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

arxiv: v1 [cs.dc] 24 Feb 2010

arxiv: v1 [cs.dc] 24 Feb 2010 Deterministic Sample Sort For GPUs arxiv:1002.4464v1 [cs.dc] 24 Feb 2010 Frank Dehne School of Computer Science Carleton University Ottawa, Canada K1S 5B6 frank@dehne.net http://www.dehne.net February

More information

NXgraph: An Efficient Graph Processing System on a Single Machine

NXgraph: An Efficient Graph Processing System on a Single Machine NXgraph: An Efficient Graph Processing System on a Single Machine Yuze Chi, Guohao Dai, Yu Wang, Guangyu Sun, Guoliang Li and Huazhong Yang Tsinghua National Laboratory for Information Science and Technology,

More information

NUMA-aware Graph-structured Analytics

NUMA-aware Graph-structured Analytics NUMA-aware Graph-structured Analytics Kaiyuan Zhang, Rong Chen, Haibo Chen Institute of Parallel and Distributed Systems Shanghai Jiao Tong University, China Big Data Everywhere 00 Million Tweets/day 1.11

More information

Navigating the Maze of Graph Analytics Frameworks using Massive Graph Datasets

Navigating the Maze of Graph Analytics Frameworks using Massive Graph Datasets Navigating the Maze of Graph Analytics Frameworks using Massive Graph Datasets Nadathur Satish, Narayanan Sundaram, Mostofa Ali Patwary, Jiwon Seo, Jongsoo Park, M. Amber Hassaan, Shubho Sengupta, Zhaoming

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 04 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 4.1 Potential speedup via parallelism from MIMD, SIMD, and both MIMD and SIMD over time for

More information

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarangnirun,

More information

Facial Recognition Using Neural Networks over GPGPU

Facial Recognition Using Neural Networks over GPGPU Facial Recognition Using Neural Networks over GPGPU V Latin American Symposium on High Performance Computing Juan Pablo Balarini, Martín Rodríguez and Sergio Nesmachnow Centro de Cálculo, Facultad de Ingeniería

More information

Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs

Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs Sungpack Hong 2, Nicole C. Rodia 1, and Kunle Olukotun 1 1 Pervasive Parallelism Laboratory, Stanford University 2 Oracle

More information

Support Tools for Porting Legacy Applications to Multicore. Natsuki Kawai, Yuri Ardila, Takashi Nakamura, Yosuke Tamura

Support Tools for Porting Legacy Applications to Multicore. Natsuki Kawai, Yuri Ardila, Takashi Nakamura, Yosuke Tamura Support Tools for Porting Legacy Applications to Multicore Natsuki Kawai, Yuri Ardila, Takashi Nakamura, Yosuke Tamura Agenda Introduction PEMAP: Performance Estimator for MAny core Processors The overview

More information

New Approach for Graph Algorithms on GPU using CUDA

New Approach for Graph Algorithms on GPU using CUDA New Approach for Graph Algorithms on GPU using CUDA 1 Gunjan Singla, 2 Amrita Tiwari, 3 Dhirendra Pratap Singh Department of Computer Science and Engineering Maulana Azad National Institute of Technology

More information

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance CSE 599 I Accelerated Computing - Programming GPUS Memory performance GPU Teaching Kit Accelerated Computing Module 6.1 Memory Access Performance DRAM Bandwidth Objective To learn that memory bandwidth

More information

Applications of Berkeley s Dwarfs on Nvidia GPUs

Applications of Berkeley s Dwarfs on Nvidia GPUs Applications of Berkeley s Dwarfs on Nvidia GPUs Seminar: Topics in High-Performance and Scientific Computing Team N2: Yang Zhang, Haiqing Wang 05.02.2015 Overview CUDA The Dwarfs Dynamic Programming Sparse

More information

CUB. collective software primitives. Duane Merrill. NVIDIA Research

CUB. collective software primitives. Duane Merrill. NVIDIA Research CUB collective software primitives Duane Merrill NVIDIA Research What is CUB?. A design model for collective primitives How to make reusable SIMT software constructs. A library of collective primitives

More information

State of Art and Project Proposals Intensive Computation

State of Art and Project Proposals Intensive Computation State of Art and Project Proposals Intensive Computation Annalisa Massini - 2015/2016 Today s lecture Project proposals on the following topics: Sparse Matrix- Vector Multiplication Tridiagonal Solvers

More information

A Framework for Modeling GPUs Power Consumption

A Framework for Modeling GPUs Power Consumption A Framework for Modeling GPUs Power Consumption Sohan Lal, Jan Lucas, Michael Andersch, Mauricio Alvarez-Mesa, Ben Juurlink Embedded Systems Architecture Technische Universität Berlin Berlin, Germany January

More information