custinger A Sparse Dynamic Graph and Matrix Data Structure Oded Green
|
|
- Mervin Thompson
- 5 years ago
- Views:
Transcription
1 custinger A Sparse Dynamic Graph and Matrix Data Structure Oded Green
2 Going to talk about 2 things A scalable and dynamic data structure for graph algorithms and linear algebra based problems A framework for static and dynamic analytics NVIDIA GTC custinger Oded Green,
3 Upfront Summary of Results Can support up to 90 million updates per second Low overhead in comparison with CSR Initializing is also relatively in expensive 20% 200% Equal performance Great performance for static graph algorithms Simple to use NVIDIA GTC custinger Oded Green,
4 Big Data problems need Graph Analysis Financial networks: Transactions between players. Different transactions types (property graph) Communication networks: World wide connectivity High velocity changes Different types of extracted data: Physical communication network. Person to person communication network. Health Care networks: Various players. Pattern matching and epidemic monitoring. Problem sizes have doubled in last 5 years. Graphs are a unifying motif for data analytics. NVIDIA GTC custinger Oded Green,
5 STINGER STINGER: Spatio Temporal Interaction Networks and Graphs (STING) Extensible Representation Enable algorithm designers to implement dynamic & streaming graph algorithms with ease. Portable semantics for various platforms Linked list of edge blocks not ideal for the GPU Good performance for all types of graph problems and algorithms static and dynamic. Assumes globally addressable memory access NVIDIA GTC custinger Oded Green,
6 STINGER and custinger Properties A Simple programming model Millions of updates per second to graph Updates are not bottlenecks for analytics. Advanced memory manager Transfers data between host and device automatically Reduces initialization time Allows for simple update processes STINGER Papers: [Bader et al.; 2007; Tech Report], [Ediger et al.; HPEC; 2012], [McColl et al.; PPAA; 2014] custinger paper: [Green&Bader; HPEC, 2016]: custinger: Supporting dynamic graph algorithms for GPUs NVIDIA GTC custinger Oded Green,
7 Definitions Dynamic graphs (matrices) Graph can change over time. Changes can be to topology, edges, or vertices. For example new edges between two vertices. Changes to edge or vertex weights Streaming graphs: Graphs changing at high rates. 100s of thousands of updates per second. NVIDIA GTC custinger Oded Green,
8 Dynamic graph example Only a subset of the entire graph Dynamic: At time t: v and w become friends. insert_edge (v, w) At time t: Ƹ u and v no longer friends delete edge u,v Additional operations include vertex insertions & deletions u v w NVIDIA GTC custinger Oded Green,
9 Separation of powers Dynamic graph data structure and dynamic graph algorithms are in two different repositories Easy to integrate with external library Can also be used with matrices First part of today s talk will be on the dynamic data structure NVIDIA GTC custinger Oded Green,
10 Part 1 Data Structure custinger Version 2.0 Improved initialization time 100s of time faster than Version 1.0 New memory manager Reduces fragmentation Enables memory reclamation Offers good memory bounds Scalable data structure Can easily grow 1000X its initial size without needing to be re initialized Faster updates Coming soon (probably late May) NVIDIA GTC custinger Oded Green,
11 Restrictions of existing static graph representations Name Pros Cons Adjacency Matrix Flexible Limited utilization for sparse data Linked lists Flexible Poor locality Allocation time is costly COO (Edge list) unsorted Has some flexibility Updates are simple CSR Uses exact amount of memory Good locality Poor locality Stores both the source and destination Inflexible NVIDIA GTC custinger Oded Green,
12 Compressed Sparse Row (CSR) Pros: Uses precise storage requirements Great locality Good for GPUs Handful of arrays Cons: Simple to use and manage Inflexible. Network growth unsupported Topology changes unsupported Property graphs not supported Src/Row Offset Dest./Col. Value NVIDIA GTC custinger Oded Green,
13 Part 1: custinger A High Level View USER INTERFACE Vertex Id Used BSize Over allocated space Pointer Dest./Col Value Supports updates Supports edge insertion\deletion and deletion. Supports vertex insertion\deletion. NVIDIA GTC custinger Oded Green,
14 custinger Property Graph Support USER INTERFACE Vertex Id Used BSize Pointer Dest./Col Weigth Type Time 1 User 1 User 2. These are optional fields NVIDIA GTC custinger Oded Green,
15 Challenges Memory allocations are costly. Seems like we are suggesting that we need O(V) allocations Absolutely not. Our first implementation did this ouch NVIDIA GTC custinger Oded Green,
16 Memory Manager Made up three parts: 1. Vectorized Bit Trees 2. BlockArrays 3. B + Trees of BlockArrays THIS IS AN INTERNAL REPRESENTATION (HIDDEN FROM USERS) NVIDIA GTC custinger Oded Green,
17 BlockArrays Definition an array made up of equal size blocks. Each block can contain an equal number of edges BlockArray (with 4 blocks) Block (with 2 edges) NVIDIA GTC custinger Oded Green,
18 custinger BlockArray allocations USER INTERFACE Vertex Id Used BSize Over allocated space Pointer Dest./Col Value BA 0,1 BA 1,1 BA 1,2 BA 2,1 USER INTERFACE INTERNAL REPRESENTATION. HIDDEN FROM USERS available space Relatively small number of BlockArrays are needed Exact number not known at compile time (or even at runtime given updates) NVIDIA GTC custinger Oded Green,
19 Memory Manager Made up three parts: 1. Vectorized Bit Trees Helps determine which blocks are empty Key components for efficient memory reclamation 2. BlockArrays 3. B + Trees of BlockArrays NVIDIA GTC custinger Oded Green,
20 Vec Trees Each block is either full (0) or empty (1). Vect Tree representation 0 Next available position block Vect Tree implementation BA 1,1 Vect Tree implementation BA 1, Machine word (a) Full BlockArray Machine word (b) Partially Empty BlockArray NVIDIA GTC custinger Oded Green,
21 Vec Trees Complexity Given a BlockArray with BA Storage complexity O BA bits. In practice this is close to 2 BA bits Relatively small overhead. Vec Tree Updates require O log BA operations NVIDIA GTC custinger Oded Green,
22 Memory Manager Made up three parts: 1. Vectorized Bit Trees 2. BlockArrays 3. B + Trees of BlockArrays Responsible for deciding when more memory needs to be allocated NVIDIA GTC custinger Oded Green,
23 Log. of block size B + Trees of BlockArrays Each block sizes has a different tree. The KEY of the B + Trees is the number of empty blocks B + Tree Array B + Tree for BlockArray with 1 edge in a block B + Tree for BlockArray with 2 edges in a block B + Tree for BlockArray with 4 edges in a block B+ Node BA 2,2 1 available block 31 NVIDIA GTC custinger Oded Green,
24 Log. of block size B + Trees of BlockArrays Currently supports adjacency lists with up to 2 31 edges Can easily support up 2 63 edge blocks!!! B + Tree Array B + Tree for BlockArray with 1 edge in a block B + Tree for BlockArray with 2 edges in a block B + Tree for BlockArray with 4 edges in a block B+ Node B+ Node BA 2,2 1 available block BA 2,1 0 available blocks NVIDIA GTC custinger Oded Green,
25 B + Trees Properties A new BlockArray is allocated when all existing BlockArrays are full. Great for memory utilization. NVIDIA GTC custinger Oded Green,
26 custinger Update Process for Edge Insertions USER INTERFACE Vertex Id Used BSize Pointer Update Batch Source Destination Value Count number of insertions for each Source 2. Check edge availability for each Source. If not enough edges: 1. memory manger get larger block that can store all edges 2. Copy existing edges from old block to new block 3. memory manager old block is reclaimed 3. Insert new edges (while avoiding duplicates) NVIDIA GTC custinger Oded Green,
27 custinger Full View (after update) USER INTERFACE Vertex Id Used BSize Pointer Update Batch Source Destination Value Dest./Col. Value USER INTERFACE BA 0,1 BA 1,1 BA 1,2 INTERNAL REPRESENTATION. HIDDEN FROM USERS Vec Tree bit status BA 2, BA 2,2 NVIDIA GTC custinger Oded Green,
28 custinger Go Home with this View USER INTERFACE Vertex Id Update Batch Used Source BSize Destination Pointer Value Dest./Col. Value USER INTERFACE NVIDIA GTC custinger Oded Green,
29 custinger Data Structure Build instructions git clone recursive mkdir build && cd build cmake.. make j8 NVIDIA GTC custinger Oded Green,
30 Performance Analysis Initialization Overhead Memory Utilization Update rate Number of sustainable updates per second CSR Vs. custinger SpMV NVIDIA GTC custinger Oded Green,
31 Experimental Setup GPU μarch SMs SPs Memory (GB) For the K80 GPU, we use only one GPU Memory Type K40 Kepler GDDR5 K80 Kepler 2x13 2x2496 2x12 GDDR5 P100 Pascal HBM2 We report for the K80, unless noted NVIDIA GTC custinger Oded Green,
32 Inputs Graphs DIMACS 10 Graph Implementation Challenge SNAP Stanford Network Analysis Project Florida Matrix Collection The following is only a subset of these graphs: Name Type V E * Source coauthorsdblp Collaboration 299k 1.95M DIMACS as skitter Trace route 1.69M 11.1M SNAP kron_21 Random 2M 201M DIMACS cit patents Citation 3.77M 16.5M SNAP cage15 Matrix 5.15M 94M DIMACS uk 2002 Webcrawl 18.52M 523M DIMACS NVIDIA GTC custinger Oded Green,
33 Space Efficiency Memory Utilization Edges 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Used Edge Utilization Utilization = Allocated 70% average utilization NVIDIA GTC custinger Oded Green,
34 Space Efficiency Memory Utilization Blocks 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Used Block Utilization Utilization = Allocated 90% average utilization NVIDIA GTC custinger Oded Green,
35 Space Efficiency Memory Utilization Overall 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 70% average utilization Overall Utilization 30% overhead in comparison to CSR NVIDIA GTC custinger Oded Green,
36 Overhead vs. CSR Memcpy Initialization Time Copying from the CPU to GPU is costly Initializing custinger does not add much overhead We still need to optimize this process for P CSR Copy Host to Device Initilization on Device Over 100x than custinger Version 1.0 NVIDIA GTC custinger Oded Green,
37 Insertion Rates Supports up to 90M updates per second Version 2.0 is 4X 10X faster than Version 1.0 Does not have performance dip like Version 1.0 Scalable growth in update rate Version 1.0 Version 2.0 NVIDIA GTC custinger Oded Green,
38 MFLOPS SpMV: CSR Vs. custinger CSR custinger Simply replace CSR accesses with custinger A real apples to apples comparison NVIDIA GTC custinger Oded Green,
39 Part 2: Algorithms for custinger Goal support Static and Dynamic graph algorithms Already showed that the graph update process is efficient All algorithms are implemented using the same set of operations We show that these operators are efficient for static graph algorithms NVIDIA GTC custinger Oded Green,
40 custinger Programming Model Keep things simple Limit the amount of GPU programming users need to do. Uses vertex and edge frontiers Similar to Gunrock & LIGRA Necessary for good utilization Still requires good load balancing Edge frontiers are created implicitly from vertex frontiers. NVIDIA GTC custinger Oded Green,
41 Vertex and Edge Frontiers Ligra CPU HPC graph framework by Julian Shun Two backends: CILK and OpenMP Edge frontiers created implicitly from vertex frontiers Gunrock Typically one phase Highly tuned GPU graph library from Prof. John Owens Supports multi GPU analytics (same shared node) Each operation consists of two phases Edge frontiers created explicitly by programmer. NVIDIA GTC custinger Oded Green,
42 Case Study 1: Label Propagation Connected Components Connected component algorithms such as Shiloach Vishkin Initially every vertex is in its components Vertices move to with smallest ID Vertices can swap components multiple times NVIDIA GTC - custinger - Oded Green,
43 One Iteration of Connected Components Algorithm // Label propagation 1) For all v V 2) For all u adj v 3) if CC u < CC v 4) CC u CC v // Shortcutting 5) For all v V Traverse all edges O E Traverse all vertices O V 6) CC v CC CC v NVIDIA GTC - custinger - Oded Green,
44 Revisiting the algorithm in parallel* Notice, no triple brackets <<<>>> * We have more optimized versions in the library (require about 4 more lines of code.) NVIDIA GTC - custinger - Oded Green,
45 Case Study 2: Katz Centrality Given pseudo code for a variant of Katz Centrality It took 5 6 hours to port from the CPU to the GPU. NVIDIA P100 GPU, initial speedup: 70X CPU version had some addition optimizations. Within additional two hours speedup was: 100X This was feasible because of the pre existing load balanced primitives in the library NVIDIA GTC - custinger - Oded Green,
46 custinger Algorithms Build instructions git clone recursive mkdir build && cd build cmake.. make j8 By default, custinger will also be cloned Though you will need to build both repos NVIDIA GTC custinger Oded Green,
47 custingerv2 Algorithms Build instructions git clone recursive cd custingeralg/build cmake.. make j NVIDIA GTC custinger Oded Green,
48 Performance Analysis Connected Components Breadth First Search Triangle Counting Static Dynamic NVIDIA GTC custinger Oded Green,
49 Connected Components NVIDIA P100 Name Gunrock (msec.) custinger (msec.) Speedup coauthorsdblp X as Skitter X kron_ X cit patents X Cage X uk X Using label propagation Within 25% for most cases. NVIDIA GTC custinger Oded Green,
50 BFS Classic Top Down NVIDIA P100 Name Gunrock (msec.) custinger (msec.) Speedup coauthorsdblp X as Skitter X kron_ X cit patents X cage X uk X Using a similar algorithm in Gunrock Gunrock has additional optimizations that can make it faster than custinger NVIDIA GTC custinger Oded Green,
51 Triangle Counting: CSR Vs. custinger Name V E Time CSR (sec.) Time custinger (sec.) Execution Difference coauthorsdblp 299k 1.95M % as skitter 1.69M 11.1M % kron_21 2M 201M % cit patents 3.77M 16.5M % cage M 94M % uk M 523M % Triangle counting algorithm taken from [Green et al; IA 3 ;2014] Simply replace CSR accesses with custinger Executed on a K40 NVIDIA GTC custinger Oded Green,
52 Summary NVIDIA GTC custinger Oded Green,
53 Library Overview Completed algorithms and on going Algorithm Static Dynamic Reference Breadth first search Triangle Counting Static [Green et al; IA32014] Dynamic new algorithm [Makkar; 2017 submitted] Connect components on going [McColl; HiPC 2013] Betweenness Centrality on going [Green; SocialCom 2012] Page Rank on going New algorithm (non linear algebra based) Katz Centrality New algorithm (non linear algebra based) Of course many more algorithms to come NVIDIA GTC custinger Oded Green,
54 Upcoming projects using custinger Extend dynamic triangle counting to Jaccard Indices Scalable pattern and motif detection on the GPU NVIDIA GTC custinger Oded Green,
55 Take away Dynamic data structure for sparse data sets Supports high update rates Scalable in both data size and in performance NVIDIA GTC custinger Oded Green,
56 Collaborators Prof. David Bader (Georgia Tech) Prof. Jimeng Sun (Georgia Tech) Dr. Jason Riedy (Georgia Tech) Federico Busato, Visiting PhD student (Universita di Verona) James Fox, PhD student (Georgia Tech) Euna Kim, PhD student (Georgia Tech) Muhammad Osama Sakhi, BSc student (Georgia Tech) Alok Tripathy, BSc student (Georgia Tech) Manas George, BSc student (Georgia Tech) Graduates (GT) Devavret Makkar, MSc. (Tower Research) NVIDIA GTC custinger Oded Green,
57 Thank you Data structure: Algorithms: Versions 2.0, coming soon to a GPU near you NVIDIA GTC custinger Oded Green,
custinger - Supporting Dynamic Graph Algorithms for GPUs Oded Green & David Bader
custinger - Supporting Dynamic Graph Algorithms for GPUs Oded Green & David Bader What we will see today The first dynamic graph data structure for the GPU. Scalable in size Supports the same functionality
More informationFast and Scalable Subgraph Isomorphism using Dynamic Graph Techniques. James Fox
Fast and Scalable Subgraph Isomorphism using Dynamic Graph Techniques James Fox Collaborators Oded Green, Research Scientist (GT) Euna Kim, PhD student (GT) Federico Busato, PhD student (Universita di
More informationAutonomous, Independent Management of Dynamic Graphs on GPUs
Autonomous, Independent Management of Dynamic Graphs on GPUs Martin Winter Graz University of Technology Graz, Austria martin.winter@icg.tugraz.at Rhaleb Zayer Max Planck Institute for Informatics Saarland
More informationA New Parallel Algorithm for Connected Components in Dynamic Graphs. Robert McColl Oded Green David Bader
A New Parallel Algorithm for Connected Components in Dynamic Graphs Robert McColl Oded Green David Bader Overview The Problem Target Datasets Prior Work Parent-Neighbor Subgraph Results Conclusions Problem
More informationHigh-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock
High-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock Yangzihao Wang University of California, Davis yzhwang@ucdavis.edu March 24, 2014 Yangzihao Wang (yzhwang@ucdavis.edu)
More informationA Comparative Study on Exact Triangle Counting Algorithms on the GPU
A Comparative Study on Exact Triangle Counting Algorithms on the GPU Leyuan Wang, Yangzihao Wang, Carl Yang, John D. Owens University of California, Davis, CA, USA 31 st May 2016 L. Wang, Y. Wang, C. Yang,
More informationAn Energy-Efficient Abstraction for Simultaneous Breadth-First Searches. Adam McLaughlin, Jason Riedy, and David A. Bader
An Energy-Efficient Abstraction for Simultaneous Breadth-First Searches Adam McLaughlin, Jason Riedy, and David A. Bader Problem Data is unstructured, heterogeneous, and vast Serious opportunities for
More informationGraph Partitioning for Scalable Distributed Graph Computations
Graph Partitioning for Scalable Distributed Graph Computations Aydın Buluç ABuluc@lbl.gov Kamesh Madduri madduri@cse.psu.edu 10 th DIMACS Implementation Challenge, Graph Partitioning and Graph Clustering
More informationNVGRAPH,FIREHOSE,PAGERANK GPU ACCELERATED ANALYTICS NOV Joe Eaton Ph.D.
NVGRAPH,FIREHOSE,PAGERANK GPU ACCELERATED ANALYTICS NOV 2016 Joe Eaton Ph.D. Agenda Accelerated Computing nvgraph New Features Coming Soon Dynamic Graphs GraphBLAS 2 ACCELERATED COMPUTING 10x Performance
More informationA Framework for Processing Large Graphs in Shared Memory
A Framework for Processing Large Graphs in Shared Memory Julian Shun Based on joint work with Guy Blelloch and Laxman Dhulipala (Work done at Carnegie Mellon University) 2 What are graphs? Vertex Edge
More informationEfficiently Multiplying Sparse Matrix - Sparse Vector for Social Network Analysis
Efficiently Multiplying Sparse Matrix - Sparse Vector for Social Network Analysis Ashish Kumar Mentors - David Bader, Jason Riedy, Jiajia Li Indian Institute of Technology, Jodhpur 6 June 2014 Ashish Kumar
More informationFlexible Batched Sparse Matrix-Vector Product on GPUs
ScalA'17: 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems November 13, 217 Flexible Batched Sparse Matrix-Vector Product on GPUs Hartwig Anzt, Gary Collins, Jack Dongarra,
More informationRuntime. The optimized program is ready to run What sorts of facilities are available at runtime
Runtime The optimized program is ready to run What sorts of facilities are available at runtime Compiler Passes Analysis of input program (front-end) character stream Lexical Analysis token stream Syntactic
More informationComputational Graphics: Lecture 15 SpMSpM and SpMV, or, who cares about complexity when we have a thousand processors?
Computational Graphics: Lecture 15 SpMSpM and SpMV, or, who cares about complexity when we have a thousand processors? The CVDLab Team Francesco Furiani Tue, April 3, 2014 ROMA TRE UNIVERSITÀ DEGLI STUDI
More informationCS 5220: Parallel Graph Algorithms. David Bindel
CS 5220: Parallel Graph Algorithms David Bindel 2017-11-14 1 Graphs Mathematically: G = (V, E) where E V V Convention: V = n and E = m May be directed or undirected May have weights w V : V R or w E :
More informationOptimizing Energy Consumption and Parallel Performance for Static and Dynamic Betweenness Centrality using GPUs Adam McLaughlin, Jason Riedy, and
Optimizing Energy Consumption and Parallel Performance for Static and Dynamic Betweenness Centrality using GPUs Adam McLaughlin, Jason Riedy, and David A. Bader Motivation Real world graphs are challenging
More informationParallel Methods for Verifying the Consistency of Weakly-Ordered Architectures. Adam McLaughlin, Duane Merrill, Michael Garland, and David A.
Parallel Methods for Verifying the Consistency of Weakly-Ordered Architectures Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader Challenges of Design Verification Contemporary hardware
More informationEFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT
EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT JOSEPH L. GREATHOUSE, MAYANK DAGA AMD RESEARCH 11/20/2014 THIS TALK IN ONE SLIDE Demonstrate how to save space and time
More informationBig Data Management and NoSQL Databases
NDBI040 Big Data Management and NoSQL Databases Lecture 10. Graph databases Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Graph Databases Basic
More informationAccelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors
Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors Michael Boyer, David Tarjan, Scott T. Acton, and Kevin Skadron University of Virginia IPDPS 2009 Outline Leukocyte
More informationNUMA-aware Graph-structured Analytics
NUMA-aware Graph-structured Analytics Kaiyuan Zhang, Rong Chen, Haibo Chen Institute of Parallel and Distributed Systems Shanghai Jiao Tong University, China Big Data Everywhere 00 Million Tweets/day 1.11
More informationIntegrating Productivity-Oriented Programming Languages with High-Performance Data Structures
Integrating Productivity-Oriented Programming Languages with High-Performance Data Structures James Fairbanks Rohit Varkey Thankachan, Eric Hein, Brian Swenson Georgia Tech Research Institute September
More informationGPU Sparse Graph Traversal. Duane Merrill
GPU Sparse Graph Traversal Duane Merrill Breadth-first search of graphs (BFS) 1. Pick a source node 2. Rank every vertex by the length of shortest path from source Or label every vertex by its predecessor
More informationPerformance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply
Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply University of California, Berkeley Berkeley Benchmarking and Optimization Group (BeBOP) http://bebop.cs.berkeley.edu
More informationSTINGER: Spatio-Temporal Interaction Networks and Graphs (STING) Extensible Representation
STINGER: Spatio-Temporal Interaction Networks and Graphs (STING) Extensible Representation David A. Bader Georgia Institute of Technolgy Adam Amos-Binks Carleton University, Canada Jonathan Berry Sandia
More informationPerformance Analysis and Culling Algorithms
Performance Analysis and Culling Algorithms Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1 Assignment 2 Sign up for Pluto labs on the web
More informationAlgorithm Design and Analysis
Algorithm Design and Analysis LECTURE 3 Data Structures Graphs Traversals Strongly connected components Sofya Raskhodnikova L3.1 Measuring Running Time Focus on scalability: parameterize the running time
More informationWedge A New Frontier for Pull-based Graph Processing. Samuel Grossman and Christos Kozyrakis Platform Lab Retreat June 8, 2018
Wedge A New Frontier for Pull-based Graph Processing Samuel Grossman and Christos Kozyrakis Platform Lab Retreat June 8, 2018 Graph Processing Problems modelled as objects (vertices) and connections between
More informationGPU Sparse Graph Traversal
GPU Sparse Graph Traversal Duane Merrill (NVIDIA) Michael Garland (NVIDIA) Andrew Grimshaw (Univ. of Virginia) UNIVERSITY of VIRGINIA Breadth-first search (BFS) 1. Pick a source node 2. Rank every vertex
More informationMultipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs
Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs Haicheng Wu 1, Daniel Zinn 2, Molham Aref 2, Sudhakar Yalamanchili 1 1. Georgia Institute of Technology 2. LogicBlox
More informationCSE 599 I Accelerated Computing - Programming GPUS. Parallel Patterns: Graph Search
CSE 599 I Accelerated Computing - Programming GPUS Parallel Patterns: Graph Search Objective Study graph search as a prototypical graph-based algorithm Learn techniques to mitigate the memory-bandwidth-centric
More informationMosaic: Processing a Trillion-Edge Graph on a Single Machine
Mosaic: Processing a Trillion-Edge Graph on a Single Machine Steffen Maass, Changwoo Min, Sanidhya Kashyap, Woonhak Kang, Mohan Kumar, Taesoo Kim Georgia Institute of Technology Best Student Paper @ EuroSys
More informationAutomatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC
Automatic Compiler-Based Optimization of Graph Analytics for the GPU Sreepathi Pai The University of Texas at Austin May 8, 2017 NVIDIA GTC Parallel Graph Processing is not easy 299ms HD-BFS 84ms USA Road
More informationGraph Data Management
Graph Data Management Analysis and Optimization of Graph Data Frameworks presented by Fynn Leitow Overview 1) Introduction a) Motivation b) Application for big data 2) Choice of algorithms 3) Choice of
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationPorting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation
Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA NVIDIA Corporation Outline! Overview of CG benchmark! Overview of CUDA Libraries! CUSPARSE! CUBLAS! Porting Sequence! Algorithm Analysis! Data/Code
More informationR13. II B. Tech I Semester Supplementary Examinations, May/June DATA STRUCTURES (Com. to ECE, CSE, EIE, IT, ECC)
SET - 1 II B. Tech I Semester Supplementary Examinations, May/June - 2016 PART A 1. a) Write a procedure for the Tower of Hanoi problem? b) What you mean by enqueue and dequeue operations in a queue? c)
More informationEFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI
EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI 1 Akshay N. Panajwar, 2 Prof.M.A.Shah Department of Computer Science and Engineering, Walchand College of Engineering,
More informationSimple Parallel Biconnectivity Algorithms for Multicore Platforms
Simple Parallel Biconnectivity Algorithms for Multicore Platforms George M. Slota Kamesh Madduri The Pennsylvania State University HiPC 2014 December 17-20, 2014 Code, presentation available at graphanalysis.info
More informationKartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18
Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation
More informationCASS-MT Task #7 - Georgia Tech. GraphCT: A Graph Characterization Toolkit. David A. Bader, David Ediger, Karl Jiang & Jason Riedy
CASS-MT Task #7 - Georgia Tech GraphCT: A Graph Characterization Toolkit David A. Bader, David Ediger, Karl Jiang & Jason Riedy October 26, 2009 Outline Motivation What is GraphCT? Package for Massive
More informationNavigating the Maze of Graph Analytics Frameworks using Massive Graph Datasets
Navigating the Maze of Graph Analytics Frameworks using Massive Graph Datasets Nadathur Satish, Narayanan Sundaram, Mostofa Ali Patwary, Jiwon Seo, Jongsoo Park, M. Amber Hassaan, Shubho Sengupta, Zhaoming
More informationChallenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery
Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark by Nkemdirim Dockery High Performance Computing Workloads Core-memory sized Floating point intensive Well-structured
More informationTechnologies for High Performance Data Analytics
Technologies for High Performance Data Analytics Dr. Jens Krüger Fraunhofer ITWM 1 Fraunhofer ITWM n Institute for Industrial Mathematics n Located in Kaiserslautern, Germany n Staff: ~ 240 employees +
More informationFast Segmented Sort on GPUs
Fast Segmented Sort on GPUs Kaixi Hou, Weifeng Liu, Hao Wang, Wu-chun Feng {kaixihou, hwang121, wfeng}@vt.edu weifeng.liu@nbi.ku.dk Segmented Sort (SegSort) Perform a segment-by-segment sort on a given
More informationGPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS
GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS Agenda Forming a GPGPU WG 1 st meeting Future meetings Activities Forming a GPGPU WG To raise needs and enhance information sharing A platform for knowledge
More informationOptimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink
Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink Rajesh Bordawekar IBM T. J. Watson Research Center bordaw@us.ibm.com Pidad D Souza IBM Systems pidsouza@in.ibm.com 1 Outline
More informationExtreme-scale Graph Analysis on Blue Waters
Extreme-scale Graph Analysis on Blue Waters 2016 Blue Waters Symposium George M. Slota 1,2, Siva Rajamanickam 1, Kamesh Madduri 2, Karen Devine 1 1 Sandia National Laboratories a 2 The Pennsylvania State
More informationSTINGER: Spatio-Temporal Interaction Networks and Graphs Extensible Representation
Chapter 1 : Spatio-Temporal Interaction Networks and Graphs Extensible Representation 1.1 Background Contributed by Tim Shaffer [1] is a general-purpose graph data structure that supports streaming updates
More informationS WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018
S8630 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS Jakob Progsch, Mathias Wagner GTC 2018 1. Know your hardware BEFORE YOU START What are the target machines, how many nodes? Machine-specific
More informationTesla GPU Computing A Revolution in High Performance Computing
Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory
More informationGraph Prefetching Using Data Structure Knowledge SAM AINSWORTH, TIMOTHY M. JONES
Graph Prefetching Using Data Structure Knowledge SAM AINSWORTH, TIMOTHY M. JONES Background and Motivation Graph applications are memory latency bound Caches & Prefetching are existing solution for memory
More informationCS162 - Operating Systems and Systems Programming. Address Translation => Paging"
CS162 - Operating Systems and Systems Programming Address Translation => Paging" David E. Culler! http://cs162.eecs.berkeley.edu/! Lecture #15! Oct 3, 2014!! Reading: A&D 8.1-2, 8.3.1. 9.7 HW 3 out (due
More informationGraphs. CSE 2320 Algorithms and Data Structures Alexandra Stefan and Vassilis Athitsos University of Texas at Arlington
Graphs CSE 2320 Algorithms and Data Structures Alexandra Stefan and Vassilis Athitsos University of Texas at Arlington 1 Representation Adjacency matrix??adjacency lists?? Review Graphs (from CSE 2315)
More information8. Hardware-Aware Numerics. Approaching supercomputing...
Approaching supercomputing... Numerisches Programmieren, Hans-Joachim Bungartz page 1 of 48 8.1. Hardware-Awareness Introduction Since numerical algorithms are ubiquitous, they have to run on a broad spectrum
More information8. Hardware-Aware Numerics. Approaching supercomputing...
Approaching supercomputing... Numerisches Programmieren, Hans-Joachim Bungartz page 1 of 22 8.1. Hardware-Awareness Introduction Since numerical algorithms are ubiquitous, they have to run on a broad spectrum
More informationCSE638: Advanced Algorithms, Spring 2013 Date: March 26. Homework #2. ( Due: April 16 )
CSE638: Advanced Algorithms, Spring 2013 Date: March 26 Homework #2 ( Due: April 16 ) Task 1. [ 40 Points ] Parallel BFS with Cilk s Work Stealing. (a) [ 20 Points ] In homework 1 we analyzed and implemented
More informationCS 31: Intro to Systems Virtual Memory. Kevin Webb Swarthmore College November 15, 2018
CS 31: Intro to Systems Virtual Memory Kevin Webb Swarthmore College November 15, 2018 Reading Quiz Memory Abstraction goal: make every process think it has the same memory layout. MUCH simpler for compiler
More informationPortable Heterogeneous High-Performance Computing via Domain-Specific Virtualization. Dmitry I. Lyakh.
Portable Heterogeneous High-Performance Computing via Domain-Specific Virtualization Dmitry I. Lyakh liakhdi@ornl.gov This research used resources of the Oak Ridge Leadership Computing Facility at the
More informationCSE 373 NOVEMBER 20 TH TOPOLOGICAL SORT
CSE 373 NOVEMBER 20 TH TOPOLOGICAL SORT PROJECT 3 500 Internal Error problems Hopefully all resolved (or close to) P3P1 grades are up (but muted) Leave canvas comment Emails tomorrow End of quarter GRAPHS
More informationKathleen Durant PhD Northeastern University CS Indexes
Kathleen Durant PhD Northeastern University CS 3200 Indexes Outline for the day Index definition Types of indexes B+ trees ISAM Hash index Choosing indexed fields Indexes in InnoDB 2 Indexes A typical
More informationWorkloads Programmierung Paralleler und Verteilter Systeme (PPV)
Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015 Frank Feinbube, M.Sc., Felix Eberhardt, M.Sc., Prof. Dr. Andreas Polze Workloads 2 Hardware / software execution environment
More informationGPU-Based Acceleration for CT Image Reconstruction
GPU-Based Acceleration for CT Image Reconstruction Xiaodong Yu Advisor: Wu-chun Feng Collaborators: Guohua Cao, Hao Gong Outline Introduction and Motivation Background Knowledge Challenges and Proposed
More informationAcknowledgements These slides are based on Kathryn McKinley s slides on garbage collection as well as E Christopher Lewis s slides
Garbage Collection Last time Compiling Object-Oriented Languages Today Motivation behind garbage collection Garbage collection basics Garbage collection performance Specific example of using GC in C++
More informationCoordinating More Than 3 Million CUDA Threads for Social Network Analysis. Adam McLaughlin
Coordinating More Than 3 Million CUDA Threads for Social Network Analysis Adam McLaughlin Applications of interest Computational biology Social network analysis Urban planning Epidemiology Hardware verification
More informationAutomatic Tuning Matrix Multiplication Performance on Graphics Hardware
Automatic Tuning Matrix Multiplication Performance on Graphics Hardware Changhao Jiang (cjiang@cs.uiuc.edu) Marc Snir (snir@cs.uiuc.edu) University of Illinois Urbana Champaign GPU becomes more powerful
More informationCuSha: Vertex-Centric Graph Processing on GPUs
CuSha: Vertex-Centric Graph Processing on GPUs Farzad Khorasani, Keval Vora, Rajiv Gupta, Laxmi N. Bhuyan HPDC Vancouver, Canada June, Motivation Graph processing Real world graphs are large & sparse Power
More informationMemory management. Knut Omang Ifi/Oracle 10 Oct, 2012
Memory management Knut Omang Ifi/Oracle 1 Oct, 212 (with slides from V. Goebel, C. Griwodz (Ifi/UiO), P. Halvorsen (Ifi/UiO), K. Li (Princeton), A. Tanenbaum (VU Amsterdam), and M. van Steen (VU Amsterdam))
More informationApril Copyright 2013 Cloudera Inc. All rights reserved.
Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and the Virtual EDW Headline Goes Here Marcel Kornacker marcel@cloudera.com Speaker Name or Subhead Goes Here April 2014 Analytic Workloads on
More informationGTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013
GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS Kyle Spagnoli Research Engineer @ EM Photonics 3/20/2013 INTRODUCTION» Sparse systems» Iterative solvers» High level benchmarks»
More informationA Simple and Practical Linear-Work Parallel Algorithm for Connectivity
A Simple and Practical Linear-Work Parallel Algorithm for Connectivity Julian Shun, Laxman Dhulipala, and Guy Blelloch Presentation based on publication in Symposium on Parallelism in Algorithms and Architectures
More informationMultidimensional Arrays & Graphs. CMSC 420: Lecture 3
Multidimensional Arrays & Graphs CMSC 420: Lecture 3 Mini-Review Abstract Data Types: List Stack Queue Deque Dictionary Set Implementations: Linked Lists Circularly linked lists Doubly linked lists XOR
More informationAuto-Tuning Strategies for Parallelizing Sparse Matrix-Vector (SpMV) Multiplication on Multi- and Many-Core Processors
Auto-Tuning Strategies for Parallelizing Sparse Matrix-Vector (SpMV) Multiplication on Multi- and Many-Core Processors Kaixi Hou, Wu-chun Feng {kaixihou, wfeng}@vt.edu Shuai Che Shuai.Che@amd.com Sparse
More informationOutline. Introduction Network Analysis Static Parallel Algorithms Dynamic Parallel Algorithms. GraphCT + STINGER
Outline Introduction Network Analysis Static Parallel Algorithms Dynamic Parallel Algorithms Data Structures for Streaming Data Clustering Coefficients Connected Components Anomaly Detection GraphCT +
More informationWHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016
WHAT S NEW IN CUDA 8 Siddharth Sharma, Oct 2016 WHAT S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve Larger Problems** Critical Path Analysis * HOOMD Blue v1.3.3 Lennard-Jones liquid
More informationTanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia
How well do CPU, GPU and Hybrid Graph Processing Frameworks Perform? Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Networked Systems
More informationA Framework for Processing Large Graphs in Shared Memory
A Framework for Processing Large Graphs in Shared Memory Julian Shun UC Berkeley (Miller Institute, AMPLab) Based on joint work with Guy Blelloch and Laxman Dhulipala, Kimon Fountoulakis, Michael Mahoney,
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationGpuWrapper: A Portable API for Heterogeneous Programming at CGG
GpuWrapper: A Portable API for Heterogeneous Programming at CGG Victor Arslan, Jean-Yves Blanc, Gina Sitaraman, Marc Tchiboukdjian, Guillaume Thomas-Collignon March 2 nd, 2016 GpuWrapper: Objectives &
More informationGraphs. Edges may be directed (from u to v) or undirected. Undirected edge eqvt to pair of directed edges
(p 186) Graphs G = (V,E) Graphs set V of vertices, each with a unique name Note: book calls vertices as nodes set E of edges between vertices, each encoded as tuple of 2 vertices as in (u,v) Edges may
More informationScalable GPU Graph Traversal!
Scalable GPU Graph Traversal Duane Merrill, Michael Garland, and Andrew Grimshaw PPoPP '12 Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming Benwen Zhang
More informationParallel graph traversal for FPGA
LETTER IEICE Electronics Express, Vol.11, No.7, 1 6 Parallel graph traversal for FPGA Shice Ni a), Yong Dou, Dan Zou, Rongchun Li, and Qiang Wang National Laboratory for Parallel and Distributed Processing,
More informationAdaptable benchmarks for register blocked sparse matrix-vector multiplication
Adaptable benchmarks for register blocked sparse matrix-vector multiplication Berkeley Benchmarking and Optimization group (BeBOP) Hormozd Gahvari and Mark Hoemmen Based on research of: Eun-Jin Im Rich
More informationParallel SAH k-d Tree Construction. Byn Choi, Rakesh Komuravelli, Victor Lu, Hyojin Sung, Robert L. Bocchino, Sarita V. Adve, John C.
Parallel SAH k-d Tree Construction Byn Choi, Rakesh Komuravelli, Victor Lu, Hyojin Sung, Robert L. Bocchino, Sarita V. Adve, John C. Hart Motivation Real-time Dynamic Ray Tracing Efficient rendering with
More informationExtreme-scale Graph Analysis on Blue Waters
Extreme-scale Graph Analysis on Blue Waters 2016 Blue Waters Symposium George M. Slota 1,2, Siva Rajamanickam 1, Kamesh Madduri 2, Karen Devine 1 1 Sandia National Laboratories a 2 The Pennsylvania State
More informationCSE 100: GRAPH ALGORITHMS
CSE 100: GRAPH ALGORITHMS 2 Graphs: Example A directed graph V5 V = { V = E = { E Path: 3 Graphs: Definitions A directed graph V5 V6 A graph G = (V,E) consists of a set of vertices V and a set of edges
More informationExploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology
Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation
More informationLecture 15: More Iterative Ideas
Lecture 15: More Iterative Ideas David Bindel 15 Mar 2010 Logistics HW 2 due! Some notes on HW 2. Where we are / where we re going More iterative ideas. Intro to HW 3. More HW 2 notes See solution code!
More informationAutomatic Scaling Iterative Computations. Aug. 7 th, 2012
Automatic Scaling Iterative Computations Guozhang Wang Cornell University Aug. 7 th, 2012 1 What are Non-Iterative Computations? Non-iterative computation flow Directed Acyclic Examples Batch style analytics
More informationDuksu Kim. Professional Experience Senior researcher, KISTI High performance visualization
Duksu Kim Assistant professor, KORATEHC Education Ph.D. Computer Science, KAIST Parallel Proximity Computation on Heterogeneous Computing Systems for Graphics Applications Professional Experience Senior
More informationCafeGPI. Single-Sided Communication for Scalable Deep Learning
CafeGPI Single-Sided Communication for Scalable Deep Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany Deep Neural Networks
More informationThe GraphGrind Framework: Fast Graph Analytics on Large. Shared-Memory Systems
Memory-Centric System Level Design of Heterogeneous Embedded DSP Systems The GraphGrind Framework: Scott Johan Fischaber, BSc (Hon) Fast Graph Analytics on Large Shared-Memory Systems Jiawen Sun, PhD candidate
More informationArabesque. A system for distributed graph mining. Mohammed Zaki, RPI
rabesque system for distributed graph mining Mohammed Zaki, RPI Carlos Teixeira, lexandre Fonseca, Marco Serafini, Georgos Siganos, shraf boulnaga, Qatar Computing Research Institute (QCRI) 1 Big Data
More informationIterative Sparse Triangular Solves for Preconditioning
Euro-Par 2015, Vienna Aug 24-28, 2015 Iterative Sparse Triangular Solves for Preconditioning Hartwig Anzt, Edmond Chow and Jack Dongarra Incomplete Factorization Preconditioning Incomplete LU factorizations
More informationMaximizing Face Detection Performance
Maximizing Face Detection Performance Paulius Micikevicius Developer Technology Engineer, NVIDIA GTC 2015 1 Outline Very brief review of cascaded-classifiers Parallelization choices Reducing the amount
More informationModern GPUs (Graphics Processing Units)
Modern GPUs (Graphics Processing Units) Powerful data parallel computation platform. High computation density, high memory bandwidth. Relatively low cost. NVIDIA GTX 580 512 cores 1.6 Tera FLOPs 1.5 GB
More informationComputing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany
Computing on GPUs Prof. Dr. Uli Göhner DYNAmore GmbH Stuttgart, Germany Summary: The increasing power of GPUs has led to the intent to transfer computing load from CPUs to GPUs. A first example has been
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More informationAlternative GPU friendly assignment algorithms. Paul Richmond and Peter Heywood Department of Computer Science The University of Sheffield
Alternative GPU friendly assignment algorithms Paul Richmond and Peter Heywood Department of Computer Science The University of Sheffield Graphics Processing Units (GPUs) Context: GPU Performance Accelerated
More informationStreaming Graph Analytics: Complexity, Scalability, and Architectures
Streaming Graph Analytics: Complexity, Scalability, and Architectures Peter M. Kogge McCourtney Prof. of CSE Univ. of Notre Dame IBM Fellow (retired)please Sir, I want more 1 Outline Motivation and Current
More information