CuSha: Vertex-Centric Graph Processing on GPUs
|
|
- Dina Barrett
- 6 years ago
- Views:
Transcription
1 CuSha: Vertex-Centric Graph Processing on GPUs Farzad Khorasani, Keval Vora, Rajiv Gupta, Laxmi N. Bhuyan HPDC Vancouver, Canada June,
2 Motivation Graph processing Real world graphs are large & sparse Power law distribution HiggsTwitter M Edges,.M Vertices LiveJournal 9M Edges, M Vertices Pokec M Edges,.M Vertices
3 Graphics Processing Unit (GPU) Single Instruction Multiple Data (SIMD) Repetitive processing patterns on regular data GPU Global Memory Streaming Multiprocessor Streaming Multiprocessor Streaming Multiprocessor Shared Memory Warp N - Warp Warp Shared Memory Warp N - Warp Warp Shared Memory Warp N - Warp Warp
4 Graphics Processing Unit (GPU) Single Instruction Multiple Data (SIMD) Repetitive processing patterns on regular data GPU Streaming Multiprocessor Global Memory Irregular Graphs Streaming Multiprocessor Streaming Multiprocessor Warp Warp Warp N - Shared Memory Irregular Memory Accesses Warp Warp Warp N - Shared Memory GPU underutilization Warp Warp Warp N - Shared Memory
5 Prior Work Using CSR assign vertex to a thread Threads iterate through assigned vertex s neighbors Non-coalesced memory accesses Work imbalance among threads Using CSR assign vertex to a virtual warp Virtual warp lanes process vertex s neighbors in parallel Non-coalesced memory accesses Work imbalance reduced but still exist Using CSR assemble frontiers Exploration based graph algorithms Non-coalesced memory accesses [Harish and Narayanan HiPC ] [Merrill et al, PPoPP ] [Hong et al, PPoPP ]
6 Prior Work Using CSR assign vertex to a thread Threads iterate through assigned vertex s neighbors Non-coalesced memory accesses Work imbalance among threads Using CSR assign vertex to a virtual warp Virtual warp lanes process vertex s neighbors in parallel Non-coalesced memory accesses Work imbalance reduced but still exist Using CSR assemble frontiers Exploration based graph algorithms Non-coalesced memory accesses [Harish and Narayanan HiPC ] Compressed Sparse Row [Merrill et al, PPoPP ] [Hong et al, PPoPP ]
7 Compressed Sparse Row (CSR) InEdgeIdxs 9 9 VertexValues x x x x x x x x s 9
8 Compressed Sparse Row (CSR) InEdgeIdxs 9 9 VertexValues x x x x x x x x s 9
9 Virtual Warp Centric (VWC) [PPoPP ] Physical warp broken into smaller virtual warps,, 8, lanes Advantages: Neighbors processed in parallel Load imbalance is reduced Disadvantages: s Load imbalance still exists GPU underutilization Non-coalesced memory accesses InEdgeIdxs VertexValues 9 x x x x x x x x 9
10 VWC-CSR Benchmark Avg global memory access efficiency (%) Avg warp execution efficiency (%) Breadth-First Search (BFS) Single Source Shortest Path (SSSP) PageRank (PR) Connected Components (CC) Single Source Widest Path (SSWP) Neural Network (NN) Heat Simulation (HS) Circuit Simulation (CS) Caused by Non-Coalesced memory loads and stores Caused by GPU Underutilization and Load Imbalance
11 G-Shards 8 Employs Shards Data required by computation placed contiguously Improves locality [Kyrola et al, OSDI ] Shard Shard 9 x x x x x x x x x x x x x 9 x VertexValues x x x x x x x x
12 G-Shards: Construction Edges partitioned based on destination vertex s index Edges sorted based on source vertex s index 9 W W VertexValues Shard Shard x x x x x x x 9 x x x x x x x x x x x x x x x 9 W W
13 G-Shards: Processing Iteratively process all shards Each shard processed by a thread block Steps: Read Compute Update Write Shard Shard x x x x x x x 9 x x x x x x x VertexValues x x x x x x x x Block Shared Memory
14 G-Shards: Processing Iteratively process all shards Each shard processed by a thread block Steps: Read Compute Update Write Shard Shard x x x x x x x 9 x x x x x x x VertexValues x x x x x x x x Block Shared Memory
15 G-Shards: Processing Iteratively process all shards Each shard processed by a thread block Steps: Read Compute Update Write Shard Shard x x x x x x x 9 x x x x x x x VertexValues x x x x x x x x Block Shared Memory
16 G-Shards: Processing Iteratively process all shards Each shard processed by a thread block Steps: Read Compute Update Write Shard Shard x x x x x x x 9 x x x x x x x VertexValues x x x x x x x x Block Shared Memory
17 G-Shards: Processing Iteratively process all shards Each shard processed by a thread block Steps: Read Compute Update Write Shard Shard x x x x x x x 9 x x x x x x x VertexValues x x x x x x x x Block Shared Memory
18 G-Shards: Processing Iteratively process all shards Each shard processed by a thread block Steps: Read Compute Update Write Shard Shard x x x x x x x 9 x x x x x x x Coalesced memory accesses in all steps VertexValues x x x x x x x x Block Shared Memory
19 Frequency (Thousands) Frequency (Thousands) G-Shards Large and sparse graphs have small computation windows Graph Size Graph Sparsity 8 _8 _ Window Sizes Window Sizes
20 G-Shards Large and sparse graphs have small computation windows Threads in a warp Shard j- Shard j Shard j+ GPU underutilization Many warp threads remain idle during th step Solution Wi,j- Wi,j+ Group windows to utilize maximum threads Concatenated Windows Wi,j t -
21 Concatenated Windows (CW) Shard Shard x x x x VertexValues x x x x x x x x x x x x x x x x x 9 x
22 Concatenated Windows (CW) Detach array from shards Shard Shard x x x x VertexValues x x x x x x x x x x x x x x x x x 9 x
23 Concatenated Windows (CW) Detach array from shards Concatenate the ones belonging to windows of same shards Shard Shard x x x x VertexValues x x x x x x x x x x x x x x x x x 9 x
24 Concatenated Windows (CW) Detach array from shards Concatenate the ones belonging to windows of same shards Mapper array Fast access of in shards Shard Shard CW CW Mapper Mapper x x x 8 x VertexValues x 9 x x x x x x x x x 8 x x 9 x x x x x 9 x
25 CW: Processing First steps remain same Writeback is done by block threads using mapper array Thread IDs k k+ m m+ m+ n n+ n+ p CW i Mapper for W i,j- for W i,j for W i,j+ Shard j+ Shard j Shard j- k k+ m+ m+ n+ n+ Wi,j+ Wi,j Wi,j- m n p
26 CW: Processing First steps remain same Writeback is done by block threads using mapper array k k+ m m+ m+ n n+ n+ p Thread IDs CW i Mapper Improves GPU utilization for W i,j- in th step for W i,j for W i,j+ Shard j Shard j- Shard j+ k k+ m+ m+ n+ n+ Wi,j+ Wi,j Wi,j- m n p
27 CuSha Framework Uses G-Shards and CWs Vertex centric programming model init_compute, compute, update_condition compute must be atomic // SSSP in CuSha typedef struct Edge { unsigned int Weight; } Edge; // requires a variable for edge weight typedef struct Vertex { unsigned int Dist; } Vertex; // requires a variable for distance device void init_compute( Vertex* local_v, Vertex* V ) { local_v->dist = V->Dist; // fetches distances from global memory to shared memory } device void compute( Vertex* SrcV, StaticVertex* SrcV_static, Edge* E, Vertex* local_v ) { if (SrcV->Dist!= INF) // minimum distance of a vertex from source node is calculated atomicmin ( &(local_v->dist), SrcV->Dist + E->Weight ); } device bool update_condition( Vertex* local_v, Vertex* V ) { return ( local_v->dist < V->Dist ); // return true if newly calculated value is smaller }
28 Experimental Setup GeForce GTX8: SMs, GB GDDR RAM Intel Core i-9k Sandy bridge cores (HT enabled). GHz, DDR RAM, PCI-e. x CUDA. on Ubuntu. 8 Benchmarks BFS, SSSP, PR, CC, SSWP, NN, HS, CS Input graphs from the SNAP dataset collection SNAP:
29 Speedup over Multi-Threaded CPU BM G-Shards CW BFS.x-.x.x-.8x SSSP.x-.x.99x-.x PR.x-.x.x-8.98x CC.x-.x.x-.x Average speedups G-Shards:.x-.9x CW:.88x-.x SSWP.9x-.x.x-.8x NN.8x-9.x.9x-9.9x HS.x-.x.8x-.x CS.9x-.x.9x-.x Inputs G-Shards CW LiveJournal.x-.x.x-9.x Pokec.x-.9x.x-.89x HiggsTwitter.x-.x.x-.x RoadNetCA.9x-9.9x.9x-.9x WebGoogle.9x-9.9x.9x-.9x Amazon.x-.x.88x-.x
30 Speedup over VWC-CSR BM G-Shards CW BFS.9x-.9x.9x-.x SSSP.9x-.9x.x-.9x PR.x-.88x.8x-.x CC.8x-.x.x-.x Average speedups G-Shards:.x-.x CW:.89x-.89x SSWP.9x-.x.9x-.x NN.x-.x.x-.x HS.x-.x.x-.x CS.x-.x.x-.8x Inputs G-Shards CW LiveJournal.x-.x.9x-.x Pokec.x-.x.x-.8x HiggsTwitter.x-.9x.x-.x RoadNetCA.x-8.x.9x-.99x WebGoogle.x-.x.x-.x Amazon.x-.x.x-.x
31 Efficiency Efficiency Global Memory Efficiency 8 CSR G-Shards CW CSR G-Shards CW 8 BFS SSSP PR CC SSWP NN HS CS BFS SSSP PR CC SSWP NN HS CS Avg Memory Load Efficiency Avg Memory Store Efficiency Average CSR G-Shards CW Avg global memory load efficiency Avg global memory store efficiency 8.8% 8.%.9%.9%.%.%
32 Efficiency Warp Execution Efficiency 9 CSR G-Shards CW 8 BFS SSSP PR CC SSWP NN HS CS Avg Warp Execution Efficiency Average CSR G-Shards CW Avg warp execution efficiency.8% 88.9% 9.%
33 CSR G-Shards CW CSR G-Shards CW CSR G-Shards CW CSR G-Shards CW CSR G-Shards CW CSR G-Shards CW Space Overhead Min Average Max Amazon WebGoogle RoadNetCA HiggsTwitter Pokec LiveJournal G-Shards over CSR: CW over G-Shards: ( E - V )xsizeof(index) + E xsizeof(vertex) bytes E xsizeof(index) bytes
34 CSR G-Shards CW CSR G-Shards CW CSR G-Shards CW CSR G-Shards CW CSR G-Shards CW CSR G-Shards CW CSR G-Shards CW CSR G-Shards CW Execution Time Breakdown HD Copy GPU Computation DH Copy BFS SSSP PR CC SSWP NN HS CS Space overheads directly impact HD copy times G-Shards and CW are computation friendly
35 Other Experiments Traversed Edges Per Seconds (TEPS) G-Shards and CW up to times better than CSR Sensitivity study using synthetic benchmarks G-Shards is sensitive to: Graph Size Graph Density Number of vertices assigned to shards CW less affected by these parameters
36 Conclusion G-Shards Shard based representation mapped for GPUs Fully coalesced memory accesses Concatenated Windows Improves GPU utilization CuSha Framework Vertex centric programming model.x-.9x speedup over best VWC-CSR
37 Thanks CuSha on GitHub GRASP
CuSha: Vertex-Centric Graph Processing on GPUs
CuSha: Vertex-Centric Graph Processing on GPUs Farzad Khorasani Keval Vora Rajiv Gupta Laxmi N. Bhuyan Computer Science and Engineering Department University of California Riverside, CA, USA {fkhor, kvora,
More informationUNIVERSITY OF CALIFORNIA RIVERSIDE. High Performance Vertex-Centric Graph Analytics on GPUs
UNIVERSITY OF CALIFORNIA RIVERSIDE High Performance Vertex-Centric Graph Analytics on GPUs A Dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in
More informationLocality-Aware Software Throttling for Sparse Matrix Operation on GPUs
Locality-Aware Software Throttling for Sparse Matrix Operation on GPUs Yanhao Chen 1, Ari B. Hayes 1, Chi Zhang 2, Timothy Salmon 1, Eddy Z. Zhang 1 1. Rutgers University 2. University of Pittsburgh Sparse
More informationScalable GPU Graph Traversal!
Scalable GPU Graph Traversal Duane Merrill, Michael Garland, and Andrew Grimshaw PPoPP '12 Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming Benwen Zhang
More informationTowards Efficient Graph Traversal using a Multi- GPU Cluster
Towards Efficient Graph Traversal using a Multi- GPU Cluster Hina Hameed Systems Research Lab, Department of Computer Science, FAST National University of Computer and Emerging Sciences Karachi, Pakistan
More informationAutomatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC
Automatic Compiler-Based Optimization of Graph Analytics for the GPU Sreepathi Pai The University of Texas at Austin May 8, 2017 NVIDIA GTC Parallel Graph Processing is not easy 299ms HD-BFS 84ms USA Road
More informationTanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia
How well do CPU, GPU and Hybrid Graph Processing Frameworks Perform? Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Networked Systems
More informationHigh-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock
High-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock Yangzihao Wang University of California, Davis yzhwang@ucdavis.edu March 24, 2014 Yangzihao Wang (yzhwang@ucdavis.edu)
More informationGunrock: A Fast and Programmable Multi- GPU Graph Processing Library
Gunrock: A Fast and Programmable Multi- GPU Graph Processing Library Yangzihao Wang and Yuechao Pan with Andrew Davidson, Yuduo Wu, Carl Yang, Leyuan Wang, Andy Riffel and John D. Owens University of California,
More informationA POWER CHARACTERIZATION AND MANAGEMENT OF GPU GRAPH TRAVERSAL
A POWER CHARACTERIZATION AND MANAGEMENT OF GPU GRAPH TRAVERSAL ADAM MCLAUGHLIN *, INDRANI PAUL, JOSEPH GREATHOUSE, SRILATHA MANNE, AND SUDHKAHAR YALAMANCHILI * * GEORGIA INSTITUTE OF TECHNOLOGY AMD RESEARCH
More informationGPU Sparse Graph Traversal. Duane Merrill
GPU Sparse Graph Traversal Duane Merrill Breadth-first search of graphs (BFS) 1. Pick a source node 2. Rank every vertex by the length of shortest path from source Or label every vertex by its predecessor
More informationA Relaxed Consistency based DSM for Asynchronous Parallelism. Keval Vora, Sai Charan Koduru, Rajiv Gupta
A Relaxed Consistency based DSM for Asynchronous Parallelism Keval Vora, Sai Charan Koduru, Rajiv Gupta SoCal PLS - Spring 2014 Motivation Graphs are popular Graph Mining: Community Detection, Coloring
More informationWITH the rapid development of the Internet, processing
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 28, NO. 4, APRIL 2017 1149 Optimizing Graph Processing on GPUs Wenyong Zhong, Jianhua Sun, Hao Chen, Member, IEEE, Jun Xiao, Zhiwen Chen, Chang
More informationGPU Sparse Graph Traversal
GPU Sparse Graph Traversal Duane Merrill (NVIDIA) Michael Garland (NVIDIA) Andrew Grimshaw (Univ. of Virginia) UNIVERSITY of VIRGINIA Breadth-first search (BFS) 1. Pick a source node 2. Rank every vertex
More informationExploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology
Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation
More informationTigr: Transforming Irregular Graphs for GPU-Friendly Graph Processing.
Tigr: Transforming Irregular Graphs for GPU-Friendly Graph Processing Amir Hossein Nodehi Sabet University of California, Riverside anode00@ucr.edu Abstract Graph analytics delivers deep knowledge by processing
More informationMedusa: A Parallel Graph Processing System on Graphics Processors
Medusa: A Parallel Graph Processing System on Graphics Processors Jianlong Zhong Nanyang Technological University jzhong2@ntu.edu.sg Bingsheng He Nanyang Technological University bshe@ntu.edu.sg ABSTRACT
More informationImplementation of BFS on shared memory (CPU / GPU)
Kolganov A.S., MSU The BFS algorithm Graph500 && GGraph500 Implementation of BFS on shared memory (CPU / GPU) Predicted scalability 2 The BFS algorithm Graph500 && GGraph500 Implementation of BFS on shared
More informationCoordinating More Than 3 Million CUDA Threads for Social Network Analysis. Adam McLaughlin
Coordinating More Than 3 Million CUDA Threads for Social Network Analysis Adam McLaughlin Applications of interest Computational biology Social network analysis Urban planning Epidemiology Hardware verification
More informationGraph Data Management
Graph Data Management Analysis and Optimization of Graph Data Frameworks presented by Fynn Leitow Overview 1) Introduction a) Motivation b) Application for big data 2) Choice of algorithms 3) Choice of
More informationDeploying Graph Algorithms on GPUs: an Adaptive Solution
Deploying Graph Algorithms on GPUs: an Adaptive Solution Da Li Dept. of Electrical and Computer Engineering University of Missouri - Columbia dlx7f@mail.missouri.edu Abstract Thanks to their massive computational
More informationAlternative GPU friendly assignment algorithms. Paul Richmond and Peter Heywood Department of Computer Science The University of Sheffield
Alternative GPU friendly assignment algorithms Paul Richmond and Peter Heywood Department of Computer Science The University of Sheffield Graphics Processing Units (GPUs) Context: GPU Performance Accelerated
More informationPuffin: Graph Processing System on Multi-GPUs
2017 IEEE 10th International Conference on Service-Oriented Computing and Applications Puffin: Graph Processing System on Multi-GPUs Peng Zhao, Xuan Luo, Jiang Xiao, Xuanhua Shi, Hai Jin Services Computing
More informationPrefix Scan and Minimum Spanning Tree with OpenCL
Prefix Scan and Minimum Spanning Tree with OpenCL U. VIRGINIA DEPT. OF COMP. SCI TECH. REPORT CS-2013-02 Yixin Sun and Kevin Skadron Dept. of Computer Science, University of Virginia ys3kz@virginia.edu,
More informationarxiv: v1 [cs.dc] 10 Dec 2018
SIMD-X: Programming and Processing of Graph Algorithms on GPUs arxiv:8.04070v [cs.dc] 0 Dec 08 Hang Liu University of Massachusetts Lowell Abstract With high computation power and memory bandwidth, graphics
More informationCSE 599 I Accelerated Computing - Programming GPUS. Parallel Patterns: Graph Search
CSE 599 I Accelerated Computing - Programming GPUS Parallel Patterns: Graph Search Objective Study graph search as a prototypical graph-based algorithm Learn techniques to mitigate the memory-bandwidth-centric
More informationGPU/CPU Heterogeneous Work Partitioning Riya Savla (rds), Ridhi Surana (rsurana), Jordan Tick (jrtick) Computer Architecture, Spring 18
ABSTRACT GPU/CPU Heterogeneous Work Partitioning Riya Savla (rds), Ridhi Surana (rsurana), Jordan Tick (jrtick) 15-740 Computer Architecture, Spring 18 With the death of Moore's Law, programmers can no
More informationHARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA
HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh
More informationEfficient Warp Execution in Presence of Divergence with Collaborative Context Collection
Efficient Warp Execution in Presence of Divergence with Collaborative Context Collection Farzad Khorasani Rajiv Gupta Laxmi N. Bhuyan Computer Science and Engineering Department University of California
More informationSociaLite: A Datalog-based Language for
SociaLite: A Datalog-based Language for Large-Scale Graph Analysis Jiwon Seo M OBIS OCIAL RESEARCH GROUP Overview Overview! SociaLite: language for large-scale graph analysis! Extensions to Datalog! Compiler
More informationBenchmarking the Memory Hierarchy of Modern GPUs
1 of 30 Benchmarking the Memory Hierarchy of Modern GPUs In 11th IFIP International Conference on Network and Parallel Computing Xinxin Mei, Kaiyong Zhao, Chengjian Liu, Xiaowen Chu CS Department, Hong
More informationNVGRAPH,FIREHOSE,PAGERANK GPU ACCELERATED ANALYTICS NOV Joe Eaton Ph.D.
NVGRAPH,FIREHOSE,PAGERANK GPU ACCELERATED ANALYTICS NOV 2016 Joe Eaton Ph.D. Agenda Accelerated Computing nvgraph New Features Coming Soon Dynamic Graphs GraphBLAS 2 ACCELERATED COMPUTING 10x Performance
More informationParallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU
Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU Lifan Xu Wei Wang Marco A. Alvarez John Cavazos Dongping Zhang Department of Computer and Information Science University of Delaware
More informationConflict-Free Vectorization of Associative Irregular Applications with Recent SIMD Architectural Advances
Conflict-Free Vectorization of Associative Irregular Applications with Recent SIMD Architectural Advances Peng Jiang Gagan Agrawal Department of Computer Science and Engineering The Ohio State University,
More informationPFAC Library: GPU-Based String Matching Algorithm
PFAC Library: GPU-Based String Matching Algorithm Cheng-Hung Lin Lung-Sheng Chien Chen-Hsiung Liu Shih-Chieh Chang Wing-Kai Hon National Taiwan Normal University, Taipei, Taiwan National Tsing-Hua University,
More informationFast and Scalable Subgraph Isomorphism using Dynamic Graph Techniques. James Fox
Fast and Scalable Subgraph Isomorphism using Dynamic Graph Techniques James Fox Collaborators Oded Green, Research Scientist (GT) Euna Kim, PhD student (GT) Federico Busato, PhD student (Universita di
More informationDuane Merrill (NVIDIA) Michael Garland (NVIDIA)
Duane Merrill (NVIDIA) Michael Garland (NVIDIA) Agenda Sparse graphs Breadth-first search (BFS) Maximal independent set (MIS) 1 1 0 1 1 1 3 1 3 1 5 10 6 9 8 7 3 4 BFS MIS Sparse graphs Graph G(V, E) where
More informationMultipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs
Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs Haicheng Wu 1, Daniel Zinn 2, Molham Aref 2, Sudhakar Yalamanchili 1 1. Georgia Institute of Technology 2. LogicBlox
More informationLecture 2: CUDA Programming
CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:
More informationCS 179: GPU Programming. Lecture 10
CS 179: GPU Programming Lecture 10 Topics Non-numerical algorithms Parallel breadth-first search (BFS) Texture memory GPUs good for many numerical calculations What about non-numerical problems? Graph
More informationBoosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search
Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search Jialiang Zhang, Soroosh Khoram and Jing Li 1 Outline Background Big graph analytics Hybrid
More informationThe GraphGrind Framework: Fast Graph Analytics on Large. Shared-Memory Systems
Memory-Centric System Level Design of Heterogeneous Embedded DSP Systems The GraphGrind Framework: Scott Johan Fischaber, BSc (Hon) Fast Graph Analytics on Large Shared-Memory Systems Jiawen Sun, PhD candidate
More informationWedge A New Frontier for Pull-based Graph Processing. Samuel Grossman and Christos Kozyrakis Platform Lab Retreat June 8, 2018
Wedge A New Frontier for Pull-based Graph Processing Samuel Grossman and Christos Kozyrakis Platform Lab Retreat June 8, 2018 Graph Processing Problems modelled as objects (vertices) and connections between
More informationOptimizing Cache Performance for Graph Analytics. Yunming Zhang Presentation
Optimizing Cache Performance for Graph Analytics Yunming Zhang 6.886 Presentation Goals How to optimize in-memory graph applications How to go about performance engineering just about anything In-memory
More informationBreadth-first Search in CUDA
Assignment 3, Problem 3, Practical Parallel Computing Course Breadth-first Search in CUDA Matthias Springer Tokyo Institute of Technology matthias.springer@acm.org Abstract This work presents a number
More informationSudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread
Intra-Warp Compaction Techniques Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Goal Active thread Idle thread Compaction Compact threads in a warp to coalesce (and eliminate)
More informationEfficient Processing of Large Graphs via Input Reduction
Efficient Processing of Large Graphs via Input Reduction Amlan Kusum Keval Vora Rajiv Gupta Iulian Neamtiu Department of Computer Science, University of California, Riverside {akusu, kvora, gupta}@cs.ucr.edu
More informationBREADTH-FIRST SEARCH FOR SOCIAL NETWORK GRAPHS ON HETEROGENOUS PLATFORMS LUIS CARLOS MARIA REMIS THESIS
BREADTH-FIRST SEARCH FOR SOCIAL NETWORK GRAPHS ON HETEROGENOUS PLATFORMS BY LUIS CARLOS MARIA REMIS THESIS Submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer
More informationA Study of Partitioning Policies for Graph Analytics on Large-scale Distributed Platforms
A Study of Partitioning Policies for Graph Analytics on Large-scale Distributed Platforms Gurbinder Gill, Roshan Dathathri, Loc Hoang, Keshav Pingali Department of Computer Science, University of Texas
More informationMosaic: Processing a Trillion-Edge Graph on a Single Machine
Mosaic: Processing a Trillion-Edge Graph on a Single Machine Steffen Maass, Changwoo Min, Sanidhya Kashyap, Woonhak Kang, Mohan Kumar, Taesoo Kim Georgia Institute of Technology Best Student Paper @ EuroSys
More informationCME 213 S PRING Eric Darve
CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and
More informationBreadth First Search on Cost efficient Multi GPU Systems
Breadth First Search on Cost efficient Multi Systems Takuji Mitsuishi Keio University 3 14 1 Hiyoshi, Yokohama, 223 8522, Japan mits@am.ics.keio.ac.jp Masaki Kan NEC Corporation 1753, Shimonumabe, Nakahara
More informationEFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT
EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT JOSEPH L. GREATHOUSE, MAYANK DAGA AMD RESEARCH 11/20/2014 THIS TALK IN ONE SLIDE Demonstrate how to save space and time
More informationEfficient and Simplified Parallel Graph Processing over CPU and MIC
Efficient and Simplified Parallel Graph Processing over CPU and MIC Linchuan Chen Xin Huo Bin Ren Surabhi Jain Gagan Agrawal Department of Computer Science and Engineering The Ohio State University Columbus,
More informationGPU-Based Acceleration for CT Image Reconstruction
GPU-Based Acceleration for CT Image Reconstruction Xiaodong Yu Advisor: Wu-chun Feng Collaborators: Guohua Cao, Hao Gong Outline Introduction and Motivation Background Knowledge Challenges and Proposed
More informationOrder or Shuffle: Empirically Evaluating Vertex Order Impact on Parallel Graph Computations
Order or Shuffle: Empirically Evaluating Vertex Order Impact on Parallel Graph Computations George M. Slota 1 Sivasankaran Rajamanickam 2 Kamesh Madduri 3 1 Rensselaer Polytechnic Institute, 2 Sandia National
More informationAccelerated Load Balancing of Unstructured Meshes
Accelerated Load Balancing of Unstructured Meshes Gerrett Diamond, Lucas Davis, and Cameron W. Smith Abstract Unstructured mesh applications running on large, parallel, distributed memory systems require
More informationSocial graphs (Facebook, Twitter, Google+, LinkedIn, etc.) Endorsement graphs (web link graph, paper citation graph, etc.)
Large-Scale Graph Processing Algorithms on the GPU Yangzihao Wang, Computer Science, UC Davis John Owens, Electrical and Computer Engineering, UC Davis 1 Overview The past decade has seen a growing research
More informationNXgraph: An Efficient Graph Processing System on a Single Machine
NXgraph: An Efficient Graph Processing System on a Single Machine arxiv:151.91v1 [cs.db] 23 Oct 215 Yuze Chi, Guohao Dai, Yu Wang, Guangyu Sun, Guoliang Li and Huazhong Yang Tsinghua National Laboratory
More informationEfficient Computation of Radial Distribution Function on GPUs
Efficient Computation of Radial Distribution Function on GPUs Yi-Cheng Tu * and Anand Kumar Department of Computer Science and Engineering University of South Florida, Tampa, Florida 2 Overview Introduction
More informationEFFICIENT BREADTH FIRST SEARCH ON MULTI-GPU SYSTEMS USING GPU-CENTRIC OPENSHMEM
EFFICIENT BREADTH FIRST SEARCH ON MULTI-GPU SYSTEMS USING GPU-CENTRIC OPENSHMEM Sreeram Potluri, Anshuman Goswami NVIDIA Manjunath Gorentla Venkata, Neena Imam - ORNL SCOPE OF THE WORK Reliance on CPU
More informationSimple Parallel Biconnectivity Algorithms for Multicore Platforms
Simple Parallel Biconnectivity Algorithms for Multicore Platforms George M. Slota Kamesh Madduri The Pennsylvania State University HiPC 2014 December 17-20, 2014 Code, presentation available at graphanalysis.info
More informationRESEARCH ARTICLE. Accelerating Ant Colony Optimization for the Traveling Salesman Problem on the GPU
The International Journal of Parallel, Emergent and Distributed Systems Vol. 00, No. 00, Month 2011, 1 21 RESEARCH ARTICLE Accelerating Ant Colony Optimization for the Traveling Salesman Problem on the
More informationKartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18
Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation
More informationGraph Prefetching Using Data Structure Knowledge SAM AINSWORTH, TIMOTHY M. JONES
Graph Prefetching Using Data Structure Knowledge SAM AINSWORTH, TIMOTHY M. JONES Background and Motivation Graph applications are memory latency bound Caches & Prefetching are existing solution for memory
More informationPerformance Characterization of High-Level Programming Models for GPU Graph Analytics
15 IEEE International Symposium on Workload Characterization Performance Characterization of High-Level Programming Models for GPU Analytics Yuduo Wu, Yangzihao Wang, Yuechao Pan, Carl Yang, and John D.
More informationCorolla: GPU-Accelerated FPGA Routing Based on Subgraph Dynamic Expansion
Corolla: GPU-Accelerated FPGA Routing Based on Subgraph Dynamic Expansion Minghua Shen and Guojie Luo Peking University FPGA-February 23, 2017 1 Contents Motivation Background Search Space Reduction for
More informationAnalyzing CUDA Workloads Using a Detailed GPU Simulator
CS 3580 - Advanced Topics in Parallel Computing Analyzing CUDA Workloads Using a Detailed GPU Simulator Mohammad Hasanzadeh Mofrad University of Pittsburgh November 14, 2017 1 Article information Title:
More informationParallel Trie-based Frequent Itemset Mining on Graphics Processors
City University of New York (CUNY) CUNY Academic Works Master's Theses City College of New York 2012 Parallel Trie-based Frequent Itemset Mining on Graphics Processors Jay Junjie Yao CUNY City College
More informationOn Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs
On Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs Sungpack Hong 2, Nicole C. Rodia 1, and Kunle Olukotun 1 1 Pervasive Parallelism Laboratory, Stanford University
More informationarxiv: v2 [cs.dc] 27 Mar 2015
Gunrock: A High-Performance Graph Processing Library on the GPU Yangzihao Wang, Andrew Davidson, Yuechao Pan, Yuduo Wu, Andy Riffel, John D. Owens University of California, Davis {yzhwang, aaldavidson,
More informationGPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA
GPGPU LAB Case study: Finite-Difference Time- Domain Method on CUDA Ana Balevic IPVS 1 Finite-Difference Time-Domain Method Numerical computation of solutions to partial differential equations Explicit
More informationGraFBoost: Using accelerated flash storage for external graph analytics
GraFBoost: Using accelerated flash storage for external graph analytics Sang-Woo Jun, Andy Wright, Sizhuo Zhang, Shuotao Xu and Arvind MIT CSAIL Funded by: 1 Large Graphs are Found Everywhere in Nature
More informationGridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning. Xiaowei ZHU Tsinghua University
GridGraph: Large-Scale Graph Processing on a Single Machine Using -Level Hierarchical Partitioning Xiaowei ZHU Tsinghua University Widely-Used Graph Processing Shared memory Single-node & in-memory Ligra,
More informationNew Approach of Bellman Ford Algorithm on GPU using Compute Unified Design Architecture (CUDA)
New Approach of Bellman Ford Algorithm on GPU using Compute Unified Design Architecture (CUDA) Pankhari Agarwal Department of Computer Science and Engineering National Institute of Technical Teachers Training
More informationHiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.
HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes Ian Glendinning Outline NVIDIA GPU cards CUDA & OpenCL Parallel Implementation
More informationCUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni
CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3
More informationWaveView. System Requirement V6. Reference: WST Page 1. WaveView System Requirements V6 WST
WaveView System Requirement V6 Reference: WST-0125-01 www.wavestore.com Page 1 WaveView System Requirements V6 Copyright notice While every care has been taken to ensure the information contained within
More informationWork efficient parallel algorithms for large graph exploration on emerging heterogeneous architectures
Accepted Manuscript Work efficient parallel algorithms for large graph exploration on emerging heterogeneous architectures Dip Sankar Banerjee, Ashutosh Kumar, Meher Chaitanya, Shashank Sharma, Kishore
More informationSIMD Divergence Optimization through Intra-Warp Compaction. Aniruddha Vaidya Anahita Shayesteh Dong Hyuk Woo Roy Saharoy Mani Azimi ISCA 13
SIMD Divergence Optimization through Intra-Warp Compaction Aniruddha Vaidya Anahita Shayesteh Dong Hyuk Woo Roy Saharoy Mani Azimi ISCA 13 Problem GPU: wide SIMD lanes 16 lanes per warp in this work SIMD
More informationParallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010
Parallelizing FPGA Technology Mapping using GPUs Doris Chen Deshanand Singh Aug 31 st, 2010 Motivation: Compile Time In last 12 years: 110x increase in FPGA Logic, 23x increase in CPU speed, 4.8x gap Question:
More informationEnterprise. Breadth-First Graph Traversal on GPUs. November 19th, 2015
Enterprise Breadth-First Graph Traversal on GPUs Hang Liu H. Howie Huang November 9th, 5 Graph is Ubiquitous Breadth-First Search (BFS) is Important Wide Range of Applications Single Source Shortest Path
More informationOptimizing Energy Consumption and Parallel Performance for Static and Dynamic Betweenness Centrality using GPUs Adam McLaughlin, Jason Riedy, and
Optimizing Energy Consumption and Parallel Performance for Static and Dynamic Betweenness Centrality using GPUs Adam McLaughlin, Jason Riedy, and David A. Bader Motivation Real world graphs are challenging
More informationCS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in
More informationarxiv: v1 [cs.dc] 24 Feb 2010
Deterministic Sample Sort For GPUs arxiv:1002.4464v1 [cs.dc] 24 Feb 2010 Frank Dehne School of Computer Science Carleton University Ottawa, Canada K1S 5B6 frank@dehne.net http://www.dehne.net February
More informationNXgraph: An Efficient Graph Processing System on a Single Machine
NXgraph: An Efficient Graph Processing System on a Single Machine Yuze Chi, Guohao Dai, Yu Wang, Guangyu Sun, Guoliang Li and Huazhong Yang Tsinghua National Laboratory for Information Science and Technology,
More informationNUMA-aware Graph-structured Analytics
NUMA-aware Graph-structured Analytics Kaiyuan Zhang, Rong Chen, Haibo Chen Institute of Parallel and Distributed Systems Shanghai Jiao Tong University, China Big Data Everywhere 00 Million Tweets/day 1.11
More informationNavigating the Maze of Graph Analytics Frameworks using Massive Graph Datasets
Navigating the Maze of Graph Analytics Frameworks using Massive Graph Datasets Nadathur Satish, Narayanan Sundaram, Mostofa Ali Patwary, Jiwon Seo, Jongsoo Park, M. Amber Hassaan, Shubho Sengupta, Zhaoming
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationChapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1
Chapter 04 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 4.1 Potential speedup via parallelism from MIMD, SIMD, and both MIMD and SIMD over time for
More informationA Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps
A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarangnirun,
More informationFacial Recognition Using Neural Networks over GPGPU
Facial Recognition Using Neural Networks over GPGPU V Latin American Symposium on High Performance Computing Juan Pablo Balarini, Martín Rodríguez and Sergio Nesmachnow Centro de Cálculo, Facultad de Ingeniería
More informationFast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs
Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs Sungpack Hong 2, Nicole C. Rodia 1, and Kunle Olukotun 1 1 Pervasive Parallelism Laboratory, Stanford University 2 Oracle
More informationSupport Tools for Porting Legacy Applications to Multicore. Natsuki Kawai, Yuri Ardila, Takashi Nakamura, Yosuke Tamura
Support Tools for Porting Legacy Applications to Multicore Natsuki Kawai, Yuri Ardila, Takashi Nakamura, Yosuke Tamura Agenda Introduction PEMAP: Performance Estimator for MAny core Processors The overview
More informationNew Approach for Graph Algorithms on GPU using CUDA
New Approach for Graph Algorithms on GPU using CUDA 1 Gunjan Singla, 2 Amrita Tiwari, 3 Dhirendra Pratap Singh Department of Computer Science and Engineering Maulana Azad National Institute of Technology
More informationCSE 599 I Accelerated Computing - Programming GPUS. Memory performance
CSE 599 I Accelerated Computing - Programming GPUS Memory performance GPU Teaching Kit Accelerated Computing Module 6.1 Memory Access Performance DRAM Bandwidth Objective To learn that memory bandwidth
More informationApplications of Berkeley s Dwarfs on Nvidia GPUs
Applications of Berkeley s Dwarfs on Nvidia GPUs Seminar: Topics in High-Performance and Scientific Computing Team N2: Yang Zhang, Haiqing Wang 05.02.2015 Overview CUDA The Dwarfs Dynamic Programming Sparse
More informationCUB. collective software primitives. Duane Merrill. NVIDIA Research
CUB collective software primitives Duane Merrill NVIDIA Research What is CUB?. A design model for collective primitives How to make reusable SIMT software constructs. A library of collective primitives
More informationState of Art and Project Proposals Intensive Computation
State of Art and Project Proposals Intensive Computation Annalisa Massini - 2015/2016 Today s lecture Project proposals on the following topics: Sparse Matrix- Vector Multiplication Tridiagonal Solvers
More informationA Framework for Modeling GPUs Power Consumption
A Framework for Modeling GPUs Power Consumption Sohan Lal, Jan Lucas, Michael Andersch, Mauricio Alvarez-Mesa, Ben Juurlink Embedded Systems Architecture Technische Universität Berlin Berlin, Germany January
More information