custinger A Sparse Dynamic Graph and Matrix Data Structure Oded Green

Size: px
Start display at page:

Download "custinger A Sparse Dynamic Graph and Matrix Data Structure Oded Green"

Transcription

1 custinger A Sparse Dynamic Graph and Matrix Data Structure Oded Green

2 Going to talk about 2 things A scalable and dynamic data structure for graph algorithms and linear algebra based problems A framework for static and dynamic analytics NVIDIA GTC custinger Oded Green,

3 Upfront Summary of Results Can support up to 90 million updates per second Low overhead in comparison with CSR Initializing is also relatively in expensive 20% 200% Equal performance Great performance for static graph algorithms Simple to use NVIDIA GTC custinger Oded Green,

4 Big Data problems need Graph Analysis Financial networks: Transactions between players. Different transactions types (property graph) Communication networks: World wide connectivity High velocity changes Different types of extracted data: Physical communication network. Person to person communication network. Health Care networks: Various players. Pattern matching and epidemic monitoring. Problem sizes have doubled in last 5 years. Graphs are a unifying motif for data analytics. NVIDIA GTC custinger Oded Green,

5 STINGER STINGER: Spatio Temporal Interaction Networks and Graphs (STING) Extensible Representation Enable algorithm designers to implement dynamic & streaming graph algorithms with ease. Portable semantics for various platforms Linked list of edge blocks not ideal for the GPU Good performance for all types of graph problems and algorithms static and dynamic. Assumes globally addressable memory access NVIDIA GTC custinger Oded Green,

6 STINGER and custinger Properties A Simple programming model Millions of updates per second to graph Updates are not bottlenecks for analytics. Advanced memory manager Transfers data between host and device automatically Reduces initialization time Allows for simple update processes STINGER Papers: [Bader et al.; 2007; Tech Report], [Ediger et al.; HPEC; 2012], [McColl et al.; PPAA; 2014] custinger paper: [Green&Bader; HPEC, 2016]: custinger: Supporting dynamic graph algorithms for GPUs NVIDIA GTC custinger Oded Green,

7 Definitions Dynamic graphs (matrices) Graph can change over time. Changes can be to topology, edges, or vertices. For example new edges between two vertices. Changes to edge or vertex weights Streaming graphs: Graphs changing at high rates. 100s of thousands of updates per second. NVIDIA GTC custinger Oded Green,

8 Dynamic graph example Only a subset of the entire graph Dynamic: At time t: v and w become friends. insert_edge (v, w) At time t: Ƹ u and v no longer friends delete edge u,v Additional operations include vertex insertions & deletions u v w NVIDIA GTC custinger Oded Green,

9 Separation of powers Dynamic graph data structure and dynamic graph algorithms are in two different repositories Easy to integrate with external library Can also be used with matrices First part of today s talk will be on the dynamic data structure NVIDIA GTC custinger Oded Green,

10 Part 1 Data Structure custinger Version 2.0 Improved initialization time 100s of time faster than Version 1.0 New memory manager Reduces fragmentation Enables memory reclamation Offers good memory bounds Scalable data structure Can easily grow 1000X its initial size without needing to be re initialized Faster updates Coming soon (probably late May) NVIDIA GTC custinger Oded Green,

11 Restrictions of existing static graph representations Name Pros Cons Adjacency Matrix Flexible Limited utilization for sparse data Linked lists Flexible Poor locality Allocation time is costly COO (Edge list) unsorted Has some flexibility Updates are simple CSR Uses exact amount of memory Good locality Poor locality Stores both the source and destination Inflexible NVIDIA GTC custinger Oded Green,

12 Compressed Sparse Row (CSR) Pros: Uses precise storage requirements Great locality Good for GPUs Handful of arrays Cons: Simple to use and manage Inflexible. Network growth unsupported Topology changes unsupported Property graphs not supported Src/Row Offset Dest./Col. Value NVIDIA GTC custinger Oded Green,

13 Part 1: custinger A High Level View USER INTERFACE Vertex Id Used BSize Over allocated space Pointer Dest./Col Value Supports updates Supports edge insertion\deletion and deletion. Supports vertex insertion\deletion. NVIDIA GTC custinger Oded Green,

14 custinger Property Graph Support USER INTERFACE Vertex Id Used BSize Pointer Dest./Col Weigth Type Time 1 User 1 User 2. These are optional fields NVIDIA GTC custinger Oded Green,

15 Challenges Memory allocations are costly. Seems like we are suggesting that we need O(V) allocations Absolutely not. Our first implementation did this ouch NVIDIA GTC custinger Oded Green,

16 Memory Manager Made up three parts: 1. Vectorized Bit Trees 2. BlockArrays 3. B + Trees of BlockArrays THIS IS AN INTERNAL REPRESENTATION (HIDDEN FROM USERS) NVIDIA GTC custinger Oded Green,

17 BlockArrays Definition an array made up of equal size blocks. Each block can contain an equal number of edges BlockArray (with 4 blocks) Block (with 2 edges) NVIDIA GTC custinger Oded Green,

18 custinger BlockArray allocations USER INTERFACE Vertex Id Used BSize Over allocated space Pointer Dest./Col Value BA 0,1 BA 1,1 BA 1,2 BA 2,1 USER INTERFACE INTERNAL REPRESENTATION. HIDDEN FROM USERS available space Relatively small number of BlockArrays are needed Exact number not known at compile time (or even at runtime given updates) NVIDIA GTC custinger Oded Green,

19 Memory Manager Made up three parts: 1. Vectorized Bit Trees Helps determine which blocks are empty Key components for efficient memory reclamation 2. BlockArrays 3. B + Trees of BlockArrays NVIDIA GTC custinger Oded Green,

20 Vec Trees Each block is either full (0) or empty (1). Vect Tree representation 0 Next available position block Vect Tree implementation BA 1,1 Vect Tree implementation BA 1, Machine word (a) Full BlockArray Machine word (b) Partially Empty BlockArray NVIDIA GTC custinger Oded Green,

21 Vec Trees Complexity Given a BlockArray with BA Storage complexity O BA bits. In practice this is close to 2 BA bits Relatively small overhead. Vec Tree Updates require O log BA operations NVIDIA GTC custinger Oded Green,

22 Memory Manager Made up three parts: 1. Vectorized Bit Trees 2. BlockArrays 3. B + Trees of BlockArrays Responsible for deciding when more memory needs to be allocated NVIDIA GTC custinger Oded Green,

23 Log. of block size B + Trees of BlockArrays Each block sizes has a different tree. The KEY of the B + Trees is the number of empty blocks B + Tree Array B + Tree for BlockArray with 1 edge in a block B + Tree for BlockArray with 2 edges in a block B + Tree for BlockArray with 4 edges in a block B+ Node BA 2,2 1 available block 31 NVIDIA GTC custinger Oded Green,

24 Log. of block size B + Trees of BlockArrays Currently supports adjacency lists with up to 2 31 edges Can easily support up 2 63 edge blocks!!! B + Tree Array B + Tree for BlockArray with 1 edge in a block B + Tree for BlockArray with 2 edges in a block B + Tree for BlockArray with 4 edges in a block B+ Node B+ Node BA 2,2 1 available block BA 2,1 0 available blocks NVIDIA GTC custinger Oded Green,

25 B + Trees Properties A new BlockArray is allocated when all existing BlockArrays are full. Great for memory utilization. NVIDIA GTC custinger Oded Green,

26 custinger Update Process for Edge Insertions USER INTERFACE Vertex Id Used BSize Pointer Update Batch Source Destination Value Count number of insertions for each Source 2. Check edge availability for each Source. If not enough edges: 1. memory manger get larger block that can store all edges 2. Copy existing edges from old block to new block 3. memory manager old block is reclaimed 3. Insert new edges (while avoiding duplicates) NVIDIA GTC custinger Oded Green,

27 custinger Full View (after update) USER INTERFACE Vertex Id Used BSize Pointer Update Batch Source Destination Value Dest./Col. Value USER INTERFACE BA 0,1 BA 1,1 BA 1,2 INTERNAL REPRESENTATION. HIDDEN FROM USERS Vec Tree bit status BA 2, BA 2,2 NVIDIA GTC custinger Oded Green,

28 custinger Go Home with this View USER INTERFACE Vertex Id Update Batch Used Source BSize Destination Pointer Value Dest./Col. Value USER INTERFACE NVIDIA GTC custinger Oded Green,

29 custinger Data Structure Build instructions git clone recursive mkdir build && cd build cmake.. make j8 NVIDIA GTC custinger Oded Green,

30 Performance Analysis Initialization Overhead Memory Utilization Update rate Number of sustainable updates per second CSR Vs. custinger SpMV NVIDIA GTC custinger Oded Green,

31 Experimental Setup GPU μarch SMs SPs Memory (GB) For the K80 GPU, we use only one GPU Memory Type K40 Kepler GDDR5 K80 Kepler 2x13 2x2496 2x12 GDDR5 P100 Pascal HBM2 We report for the K80, unless noted NVIDIA GTC custinger Oded Green,

32 Inputs Graphs DIMACS 10 Graph Implementation Challenge SNAP Stanford Network Analysis Project Florida Matrix Collection The following is only a subset of these graphs: Name Type V E * Source coauthorsdblp Collaboration 299k 1.95M DIMACS as skitter Trace route 1.69M 11.1M SNAP kron_21 Random 2M 201M DIMACS cit patents Citation 3.77M 16.5M SNAP cage15 Matrix 5.15M 94M DIMACS uk 2002 Webcrawl 18.52M 523M DIMACS NVIDIA GTC custinger Oded Green,

33 Space Efficiency Memory Utilization Edges 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Used Edge Utilization Utilization = Allocated 70% average utilization NVIDIA GTC custinger Oded Green,

34 Space Efficiency Memory Utilization Blocks 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Used Block Utilization Utilization = Allocated 90% average utilization NVIDIA GTC custinger Oded Green,

35 Space Efficiency Memory Utilization Overall 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 70% average utilization Overall Utilization 30% overhead in comparison to CSR NVIDIA GTC custinger Oded Green,

36 Overhead vs. CSR Memcpy Initialization Time Copying from the CPU to GPU is costly Initializing custinger does not add much overhead We still need to optimize this process for P CSR Copy Host to Device Initilization on Device Over 100x than custinger Version 1.0 NVIDIA GTC custinger Oded Green,

37 Insertion Rates Supports up to 90M updates per second Version 2.0 is 4X 10X faster than Version 1.0 Does not have performance dip like Version 1.0 Scalable growth in update rate Version 1.0 Version 2.0 NVIDIA GTC custinger Oded Green,

38 MFLOPS SpMV: CSR Vs. custinger CSR custinger Simply replace CSR accesses with custinger A real apples to apples comparison NVIDIA GTC custinger Oded Green,

39 Part 2: Algorithms for custinger Goal support Static and Dynamic graph algorithms Already showed that the graph update process is efficient All algorithms are implemented using the same set of operations We show that these operators are efficient for static graph algorithms NVIDIA GTC custinger Oded Green,

40 custinger Programming Model Keep things simple Limit the amount of GPU programming users need to do. Uses vertex and edge frontiers Similar to Gunrock & LIGRA Necessary for good utilization Still requires good load balancing Edge frontiers are created implicitly from vertex frontiers. NVIDIA GTC custinger Oded Green,

41 Vertex and Edge Frontiers Ligra CPU HPC graph framework by Julian Shun Two backends: CILK and OpenMP Edge frontiers created implicitly from vertex frontiers Gunrock Typically one phase Highly tuned GPU graph library from Prof. John Owens Supports multi GPU analytics (same shared node) Each operation consists of two phases Edge frontiers created explicitly by programmer. NVIDIA GTC custinger Oded Green,

42 Case Study 1: Label Propagation Connected Components Connected component algorithms such as Shiloach Vishkin Initially every vertex is in its components Vertices move to with smallest ID Vertices can swap components multiple times NVIDIA GTC - custinger - Oded Green,

43 One Iteration of Connected Components Algorithm // Label propagation 1) For all v V 2) For all u adj v 3) if CC u < CC v 4) CC u CC v // Shortcutting 5) For all v V Traverse all edges O E Traverse all vertices O V 6) CC v CC CC v NVIDIA GTC - custinger - Oded Green,

44 Revisiting the algorithm in parallel* Notice, no triple brackets <<<>>> * We have more optimized versions in the library (require about 4 more lines of code.) NVIDIA GTC - custinger - Oded Green,

45 Case Study 2: Katz Centrality Given pseudo code for a variant of Katz Centrality It took 5 6 hours to port from the CPU to the GPU. NVIDIA P100 GPU, initial speedup: 70X CPU version had some addition optimizations. Within additional two hours speedup was: 100X This was feasible because of the pre existing load balanced primitives in the library NVIDIA GTC - custinger - Oded Green,

46 custinger Algorithms Build instructions git clone recursive mkdir build && cd build cmake.. make j8 By default, custinger will also be cloned Though you will need to build both repos NVIDIA GTC custinger Oded Green,

47 custingerv2 Algorithms Build instructions git clone recursive cd custingeralg/build cmake.. make j NVIDIA GTC custinger Oded Green,

48 Performance Analysis Connected Components Breadth First Search Triangle Counting Static Dynamic NVIDIA GTC custinger Oded Green,

49 Connected Components NVIDIA P100 Name Gunrock (msec.) custinger (msec.) Speedup coauthorsdblp X as Skitter X kron_ X cit patents X Cage X uk X Using label propagation Within 25% for most cases. NVIDIA GTC custinger Oded Green,

50 BFS Classic Top Down NVIDIA P100 Name Gunrock (msec.) custinger (msec.) Speedup coauthorsdblp X as Skitter X kron_ X cit patents X cage X uk X Using a similar algorithm in Gunrock Gunrock has additional optimizations that can make it faster than custinger NVIDIA GTC custinger Oded Green,

51 Triangle Counting: CSR Vs. custinger Name V E Time CSR (sec.) Time custinger (sec.) Execution Difference coauthorsdblp 299k 1.95M % as skitter 1.69M 11.1M % kron_21 2M 201M % cit patents 3.77M 16.5M % cage M 94M % uk M 523M % Triangle counting algorithm taken from [Green et al; IA 3 ;2014] Simply replace CSR accesses with custinger Executed on a K40 NVIDIA GTC custinger Oded Green,

52 Summary NVIDIA GTC custinger Oded Green,

53 Library Overview Completed algorithms and on going Algorithm Static Dynamic Reference Breadth first search Triangle Counting Static [Green et al; IA32014] Dynamic new algorithm [Makkar; 2017 submitted] Connect components on going [McColl; HiPC 2013] Betweenness Centrality on going [Green; SocialCom 2012] Page Rank on going New algorithm (non linear algebra based) Katz Centrality New algorithm (non linear algebra based) Of course many more algorithms to come NVIDIA GTC custinger Oded Green,

54 Upcoming projects using custinger Extend dynamic triangle counting to Jaccard Indices Scalable pattern and motif detection on the GPU NVIDIA GTC custinger Oded Green,

55 Take away Dynamic data structure for sparse data sets Supports high update rates Scalable in both data size and in performance NVIDIA GTC custinger Oded Green,

56 Collaborators Prof. David Bader (Georgia Tech) Prof. Jimeng Sun (Georgia Tech) Dr. Jason Riedy (Georgia Tech) Federico Busato, Visiting PhD student (Universita di Verona) James Fox, PhD student (Georgia Tech) Euna Kim, PhD student (Georgia Tech) Muhammad Osama Sakhi, BSc student (Georgia Tech) Alok Tripathy, BSc student (Georgia Tech) Manas George, BSc student (Georgia Tech) Graduates (GT) Devavret Makkar, MSc. (Tower Research) NVIDIA GTC custinger Oded Green,

57 Thank you Data structure: Algorithms: Versions 2.0, coming soon to a GPU near you NVIDIA GTC custinger Oded Green,

custinger - Supporting Dynamic Graph Algorithms for GPUs Oded Green & David Bader

custinger - Supporting Dynamic Graph Algorithms for GPUs Oded Green & David Bader custinger - Supporting Dynamic Graph Algorithms for GPUs Oded Green & David Bader What we will see today The first dynamic graph data structure for the GPU. Scalable in size Supports the same functionality

More information

Fast and Scalable Subgraph Isomorphism using Dynamic Graph Techniques. James Fox

Fast and Scalable Subgraph Isomorphism using Dynamic Graph Techniques. James Fox Fast and Scalable Subgraph Isomorphism using Dynamic Graph Techniques James Fox Collaborators Oded Green, Research Scientist (GT) Euna Kim, PhD student (GT) Federico Busato, PhD student (Universita di

More information

Autonomous, Independent Management of Dynamic Graphs on GPUs

Autonomous, Independent Management of Dynamic Graphs on GPUs Autonomous, Independent Management of Dynamic Graphs on GPUs Martin Winter Graz University of Technology Graz, Austria martin.winter@icg.tugraz.at Rhaleb Zayer Max Planck Institute for Informatics Saarland

More information

A New Parallel Algorithm for Connected Components in Dynamic Graphs. Robert McColl Oded Green David Bader

A New Parallel Algorithm for Connected Components in Dynamic Graphs. Robert McColl Oded Green David Bader A New Parallel Algorithm for Connected Components in Dynamic Graphs Robert McColl Oded Green David Bader Overview The Problem Target Datasets Prior Work Parent-Neighbor Subgraph Results Conclusions Problem

More information

High-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock

High-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock High-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock Yangzihao Wang University of California, Davis yzhwang@ucdavis.edu March 24, 2014 Yangzihao Wang (yzhwang@ucdavis.edu)

More information

A Comparative Study on Exact Triangle Counting Algorithms on the GPU

A Comparative Study on Exact Triangle Counting Algorithms on the GPU A Comparative Study on Exact Triangle Counting Algorithms on the GPU Leyuan Wang, Yangzihao Wang, Carl Yang, John D. Owens University of California, Davis, CA, USA 31 st May 2016 L. Wang, Y. Wang, C. Yang,

More information

An Energy-Efficient Abstraction for Simultaneous Breadth-First Searches. Adam McLaughlin, Jason Riedy, and David A. Bader

An Energy-Efficient Abstraction for Simultaneous Breadth-First Searches. Adam McLaughlin, Jason Riedy, and David A. Bader An Energy-Efficient Abstraction for Simultaneous Breadth-First Searches Adam McLaughlin, Jason Riedy, and David A. Bader Problem Data is unstructured, heterogeneous, and vast Serious opportunities for

More information

Graph Partitioning for Scalable Distributed Graph Computations

Graph Partitioning for Scalable Distributed Graph Computations Graph Partitioning for Scalable Distributed Graph Computations Aydın Buluç ABuluc@lbl.gov Kamesh Madduri madduri@cse.psu.edu 10 th DIMACS Implementation Challenge, Graph Partitioning and Graph Clustering

More information

NVGRAPH,FIREHOSE,PAGERANK GPU ACCELERATED ANALYTICS NOV Joe Eaton Ph.D.

NVGRAPH,FIREHOSE,PAGERANK GPU ACCELERATED ANALYTICS NOV Joe Eaton Ph.D. NVGRAPH,FIREHOSE,PAGERANK GPU ACCELERATED ANALYTICS NOV 2016 Joe Eaton Ph.D. Agenda Accelerated Computing nvgraph New Features Coming Soon Dynamic Graphs GraphBLAS 2 ACCELERATED COMPUTING 10x Performance

More information

A Framework for Processing Large Graphs in Shared Memory

A Framework for Processing Large Graphs in Shared Memory A Framework for Processing Large Graphs in Shared Memory Julian Shun Based on joint work with Guy Blelloch and Laxman Dhulipala (Work done at Carnegie Mellon University) 2 What are graphs? Vertex Edge

More information

Efficiently Multiplying Sparse Matrix - Sparse Vector for Social Network Analysis

Efficiently Multiplying Sparse Matrix - Sparse Vector for Social Network Analysis Efficiently Multiplying Sparse Matrix - Sparse Vector for Social Network Analysis Ashish Kumar Mentors - David Bader, Jason Riedy, Jiajia Li Indian Institute of Technology, Jodhpur 6 June 2014 Ashish Kumar

More information

Flexible Batched Sparse Matrix-Vector Product on GPUs

Flexible Batched Sparse Matrix-Vector Product on GPUs ScalA'17: 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems November 13, 217 Flexible Batched Sparse Matrix-Vector Product on GPUs Hartwig Anzt, Gary Collins, Jack Dongarra,

More information

Runtime. The optimized program is ready to run What sorts of facilities are available at runtime

Runtime. The optimized program is ready to run What sorts of facilities are available at runtime Runtime The optimized program is ready to run What sorts of facilities are available at runtime Compiler Passes Analysis of input program (front-end) character stream Lexical Analysis token stream Syntactic

More information

Computational Graphics: Lecture 15 SpMSpM and SpMV, or, who cares about complexity when we have a thousand processors?

Computational Graphics: Lecture 15 SpMSpM and SpMV, or, who cares about complexity when we have a thousand processors? Computational Graphics: Lecture 15 SpMSpM and SpMV, or, who cares about complexity when we have a thousand processors? The CVDLab Team Francesco Furiani Tue, April 3, 2014 ROMA TRE UNIVERSITÀ DEGLI STUDI

More information

CS 5220: Parallel Graph Algorithms. David Bindel

CS 5220: Parallel Graph Algorithms. David Bindel CS 5220: Parallel Graph Algorithms David Bindel 2017-11-14 1 Graphs Mathematically: G = (V, E) where E V V Convention: V = n and E = m May be directed or undirected May have weights w V : V R or w E :

More information

Optimizing Energy Consumption and Parallel Performance for Static and Dynamic Betweenness Centrality using GPUs Adam McLaughlin, Jason Riedy, and

Optimizing Energy Consumption and Parallel Performance for Static and Dynamic Betweenness Centrality using GPUs Adam McLaughlin, Jason Riedy, and Optimizing Energy Consumption and Parallel Performance for Static and Dynamic Betweenness Centrality using GPUs Adam McLaughlin, Jason Riedy, and David A. Bader Motivation Real world graphs are challenging

More information

Parallel Methods for Verifying the Consistency of Weakly-Ordered Architectures. Adam McLaughlin, Duane Merrill, Michael Garland, and David A.

Parallel Methods for Verifying the Consistency of Weakly-Ordered Architectures. Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Parallel Methods for Verifying the Consistency of Weakly-Ordered Architectures Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader Challenges of Design Verification Contemporary hardware

More information

EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT

EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT JOSEPH L. GREATHOUSE, MAYANK DAGA AMD RESEARCH 11/20/2014 THIS TALK IN ONE SLIDE Demonstrate how to save space and time

More information

Big Data Management and NoSQL Databases

Big Data Management and NoSQL Databases NDBI040 Big Data Management and NoSQL Databases Lecture 10. Graph databases Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Graph Databases Basic

More information

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors Michael Boyer, David Tarjan, Scott T. Acton, and Kevin Skadron University of Virginia IPDPS 2009 Outline Leukocyte

More information

NUMA-aware Graph-structured Analytics

NUMA-aware Graph-structured Analytics NUMA-aware Graph-structured Analytics Kaiyuan Zhang, Rong Chen, Haibo Chen Institute of Parallel and Distributed Systems Shanghai Jiao Tong University, China Big Data Everywhere 00 Million Tweets/day 1.11

More information

Integrating Productivity-Oriented Programming Languages with High-Performance Data Structures

Integrating Productivity-Oriented Programming Languages with High-Performance Data Structures Integrating Productivity-Oriented Programming Languages with High-Performance Data Structures James Fairbanks Rohit Varkey Thankachan, Eric Hein, Brian Swenson Georgia Tech Research Institute September

More information

GPU Sparse Graph Traversal. Duane Merrill

GPU Sparse Graph Traversal. Duane Merrill GPU Sparse Graph Traversal Duane Merrill Breadth-first search of graphs (BFS) 1. Pick a source node 2. Rank every vertex by the length of shortest path from source Or label every vertex by its predecessor

More information

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply University of California, Berkeley Berkeley Benchmarking and Optimization Group (BeBOP) http://bebop.cs.berkeley.edu

More information

STINGER: Spatio-Temporal Interaction Networks and Graphs (STING) Extensible Representation

STINGER: Spatio-Temporal Interaction Networks and Graphs (STING) Extensible Representation STINGER: Spatio-Temporal Interaction Networks and Graphs (STING) Extensible Representation David A. Bader Georgia Institute of Technolgy Adam Amos-Binks Carleton University, Canada Jonathan Berry Sandia

More information

Performance Analysis and Culling Algorithms

Performance Analysis and Culling Algorithms Performance Analysis and Culling Algorithms Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1 Assignment 2 Sign up for Pluto labs on the web

More information

Algorithm Design and Analysis

Algorithm Design and Analysis Algorithm Design and Analysis LECTURE 3 Data Structures Graphs Traversals Strongly connected components Sofya Raskhodnikova L3.1 Measuring Running Time Focus on scalability: parameterize the running time

More information

Wedge A New Frontier for Pull-based Graph Processing. Samuel Grossman and Christos Kozyrakis Platform Lab Retreat June 8, 2018

Wedge A New Frontier for Pull-based Graph Processing. Samuel Grossman and Christos Kozyrakis Platform Lab Retreat June 8, 2018 Wedge A New Frontier for Pull-based Graph Processing Samuel Grossman and Christos Kozyrakis Platform Lab Retreat June 8, 2018 Graph Processing Problems modelled as objects (vertices) and connections between

More information

GPU Sparse Graph Traversal

GPU Sparse Graph Traversal GPU Sparse Graph Traversal Duane Merrill (NVIDIA) Michael Garland (NVIDIA) Andrew Grimshaw (Univ. of Virginia) UNIVERSITY of VIRGINIA Breadth-first search (BFS) 1. Pick a source node 2. Rank every vertex

More information

Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs

Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs Haicheng Wu 1, Daniel Zinn 2, Molham Aref 2, Sudhakar Yalamanchili 1 1. Georgia Institute of Technology 2. LogicBlox

More information

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Patterns: Graph Search

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Patterns: Graph Search CSE 599 I Accelerated Computing - Programming GPUS Parallel Patterns: Graph Search Objective Study graph search as a prototypical graph-based algorithm Learn techniques to mitigate the memory-bandwidth-centric

More information

Mosaic: Processing a Trillion-Edge Graph on a Single Machine

Mosaic: Processing a Trillion-Edge Graph on a Single Machine Mosaic: Processing a Trillion-Edge Graph on a Single Machine Steffen Maass, Changwoo Min, Sanidhya Kashyap, Woonhak Kang, Mohan Kumar, Taesoo Kim Georgia Institute of Technology Best Student Paper @ EuroSys

More information

Automatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC

Automatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC Automatic Compiler-Based Optimization of Graph Analytics for the GPU Sreepathi Pai The University of Texas at Austin May 8, 2017 NVIDIA GTC Parallel Graph Processing is not easy 299ms HD-BFS 84ms USA Road

More information

Graph Data Management

Graph Data Management Graph Data Management Analysis and Optimization of Graph Data Frameworks presented by Fynn Leitow Overview 1) Introduction a) Motivation b) Application for big data 2) Choice of algorithms 3) Choice of

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA NVIDIA Corporation Outline! Overview of CG benchmark! Overview of CUDA Libraries! CUSPARSE! CUBLAS! Porting Sequence! Algorithm Analysis! Data/Code

More information

R13. II B. Tech I Semester Supplementary Examinations, May/June DATA STRUCTURES (Com. to ECE, CSE, EIE, IT, ECC)

R13. II B. Tech I Semester Supplementary Examinations, May/June DATA STRUCTURES (Com. to ECE, CSE, EIE, IT, ECC) SET - 1 II B. Tech I Semester Supplementary Examinations, May/June - 2016 PART A 1. a) Write a procedure for the Tower of Hanoi problem? b) What you mean by enqueue and dequeue operations in a queue? c)

More information

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI 1 Akshay N. Panajwar, 2 Prof.M.A.Shah Department of Computer Science and Engineering, Walchand College of Engineering,

More information

Simple Parallel Biconnectivity Algorithms for Multicore Platforms

Simple Parallel Biconnectivity Algorithms for Multicore Platforms Simple Parallel Biconnectivity Algorithms for Multicore Platforms George M. Slota Kamesh Madduri The Pennsylvania State University HiPC 2014 December 17-20, 2014 Code, presentation available at graphanalysis.info

More information

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation

More information

CASS-MT Task #7 - Georgia Tech. GraphCT: A Graph Characterization Toolkit. David A. Bader, David Ediger, Karl Jiang & Jason Riedy

CASS-MT Task #7 - Georgia Tech. GraphCT: A Graph Characterization Toolkit. David A. Bader, David Ediger, Karl Jiang & Jason Riedy CASS-MT Task #7 - Georgia Tech GraphCT: A Graph Characterization Toolkit David A. Bader, David Ediger, Karl Jiang & Jason Riedy October 26, 2009 Outline Motivation What is GraphCT? Package for Massive

More information

Navigating the Maze of Graph Analytics Frameworks using Massive Graph Datasets

Navigating the Maze of Graph Analytics Frameworks using Massive Graph Datasets Navigating the Maze of Graph Analytics Frameworks using Massive Graph Datasets Nadathur Satish, Narayanan Sundaram, Mostofa Ali Patwary, Jiwon Seo, Jongsoo Park, M. Amber Hassaan, Shubho Sengupta, Zhaoming

More information

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark by Nkemdirim Dockery High Performance Computing Workloads Core-memory sized Floating point intensive Well-structured

More information

Technologies for High Performance Data Analytics

Technologies for High Performance Data Analytics Technologies for High Performance Data Analytics Dr. Jens Krüger Fraunhofer ITWM 1 Fraunhofer ITWM n Institute for Industrial Mathematics n Located in Kaiserslautern, Germany n Staff: ~ 240 employees +

More information

Fast Segmented Sort on GPUs

Fast Segmented Sort on GPUs Fast Segmented Sort on GPUs Kaixi Hou, Weifeng Liu, Hao Wang, Wu-chun Feng {kaixihou, hwang121, wfeng}@vt.edu weifeng.liu@nbi.ku.dk Segmented Sort (SegSort) Perform a segment-by-segment sort on a given

More information

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS Agenda Forming a GPGPU WG 1 st meeting Future meetings Activities Forming a GPGPU WG To raise needs and enhance information sharing A platform for knowledge

More information

Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink

Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink Rajesh Bordawekar IBM T. J. Watson Research Center bordaw@us.ibm.com Pidad D Souza IBM Systems pidsouza@in.ibm.com 1 Outline

More information

Extreme-scale Graph Analysis on Blue Waters

Extreme-scale Graph Analysis on Blue Waters Extreme-scale Graph Analysis on Blue Waters 2016 Blue Waters Symposium George M. Slota 1,2, Siva Rajamanickam 1, Kamesh Madduri 2, Karen Devine 1 1 Sandia National Laboratories a 2 The Pennsylvania State

More information

STINGER: Spatio-Temporal Interaction Networks and Graphs Extensible Representation

STINGER: Spatio-Temporal Interaction Networks and Graphs Extensible Representation Chapter 1 : Spatio-Temporal Interaction Networks and Graphs Extensible Representation 1.1 Background Contributed by Tim Shaffer [1] is a general-purpose graph data structure that supports streaming updates

More information

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018 S8630 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS Jakob Progsch, Mathias Wagner GTC 2018 1. Know your hardware BEFORE YOU START What are the target machines, how many nodes? Machine-specific

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory

More information

Graph Prefetching Using Data Structure Knowledge SAM AINSWORTH, TIMOTHY M. JONES

Graph Prefetching Using Data Structure Knowledge SAM AINSWORTH, TIMOTHY M. JONES Graph Prefetching Using Data Structure Knowledge SAM AINSWORTH, TIMOTHY M. JONES Background and Motivation Graph applications are memory latency bound Caches & Prefetching are existing solution for memory

More information

CS162 - Operating Systems and Systems Programming. Address Translation => Paging"

CS162 - Operating Systems and Systems Programming. Address Translation => Paging CS162 - Operating Systems and Systems Programming Address Translation => Paging" David E. Culler! http://cs162.eecs.berkeley.edu/! Lecture #15! Oct 3, 2014!! Reading: A&D 8.1-2, 8.3.1. 9.7 HW 3 out (due

More information

Graphs. CSE 2320 Algorithms and Data Structures Alexandra Stefan and Vassilis Athitsos University of Texas at Arlington

Graphs. CSE 2320 Algorithms and Data Structures Alexandra Stefan and Vassilis Athitsos University of Texas at Arlington Graphs CSE 2320 Algorithms and Data Structures Alexandra Stefan and Vassilis Athitsos University of Texas at Arlington 1 Representation Adjacency matrix??adjacency lists?? Review Graphs (from CSE 2315)

More information

8. Hardware-Aware Numerics. Approaching supercomputing...

8. Hardware-Aware Numerics. Approaching supercomputing... Approaching supercomputing... Numerisches Programmieren, Hans-Joachim Bungartz page 1 of 48 8.1. Hardware-Awareness Introduction Since numerical algorithms are ubiquitous, they have to run on a broad spectrum

More information

8. Hardware-Aware Numerics. Approaching supercomputing...

8. Hardware-Aware Numerics. Approaching supercomputing... Approaching supercomputing... Numerisches Programmieren, Hans-Joachim Bungartz page 1 of 22 8.1. Hardware-Awareness Introduction Since numerical algorithms are ubiquitous, they have to run on a broad spectrum

More information

CSE638: Advanced Algorithms, Spring 2013 Date: March 26. Homework #2. ( Due: April 16 )

CSE638: Advanced Algorithms, Spring 2013 Date: March 26. Homework #2. ( Due: April 16 ) CSE638: Advanced Algorithms, Spring 2013 Date: March 26 Homework #2 ( Due: April 16 ) Task 1. [ 40 Points ] Parallel BFS with Cilk s Work Stealing. (a) [ 20 Points ] In homework 1 we analyzed and implemented

More information

CS 31: Intro to Systems Virtual Memory. Kevin Webb Swarthmore College November 15, 2018

CS 31: Intro to Systems Virtual Memory. Kevin Webb Swarthmore College November 15, 2018 CS 31: Intro to Systems Virtual Memory Kevin Webb Swarthmore College November 15, 2018 Reading Quiz Memory Abstraction goal: make every process think it has the same memory layout. MUCH simpler for compiler

More information

Portable Heterogeneous High-Performance Computing via Domain-Specific Virtualization. Dmitry I. Lyakh.

Portable Heterogeneous High-Performance Computing via Domain-Specific Virtualization. Dmitry I. Lyakh. Portable Heterogeneous High-Performance Computing via Domain-Specific Virtualization Dmitry I. Lyakh liakhdi@ornl.gov This research used resources of the Oak Ridge Leadership Computing Facility at the

More information

CSE 373 NOVEMBER 20 TH TOPOLOGICAL SORT

CSE 373 NOVEMBER 20 TH TOPOLOGICAL SORT CSE 373 NOVEMBER 20 TH TOPOLOGICAL SORT PROJECT 3 500 Internal Error problems Hopefully all resolved (or close to) P3P1 grades are up (but muted) Leave canvas comment Emails tomorrow End of quarter GRAPHS

More information

Kathleen Durant PhD Northeastern University CS Indexes

Kathleen Durant PhD Northeastern University CS Indexes Kathleen Durant PhD Northeastern University CS 3200 Indexes Outline for the day Index definition Types of indexes B+ trees ISAM Hash index Choosing indexed fields Indexes in InnoDB 2 Indexes A typical

More information

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015 Frank Feinbube, M.Sc., Felix Eberhardt, M.Sc., Prof. Dr. Andreas Polze Workloads 2 Hardware / software execution environment

More information

GPU-Based Acceleration for CT Image Reconstruction

GPU-Based Acceleration for CT Image Reconstruction GPU-Based Acceleration for CT Image Reconstruction Xiaodong Yu Advisor: Wu-chun Feng Collaborators: Guohua Cao, Hao Gong Outline Introduction and Motivation Background Knowledge Challenges and Proposed

More information

Acknowledgements These slides are based on Kathryn McKinley s slides on garbage collection as well as E Christopher Lewis s slides

Acknowledgements These slides are based on Kathryn McKinley s slides on garbage collection as well as E Christopher Lewis s slides Garbage Collection Last time Compiling Object-Oriented Languages Today Motivation behind garbage collection Garbage collection basics Garbage collection performance Specific example of using GC in C++

More information

Coordinating More Than 3 Million CUDA Threads for Social Network Analysis. Adam McLaughlin

Coordinating More Than 3 Million CUDA Threads for Social Network Analysis. Adam McLaughlin Coordinating More Than 3 Million CUDA Threads for Social Network Analysis Adam McLaughlin Applications of interest Computational biology Social network analysis Urban planning Epidemiology Hardware verification

More information

Automatic Tuning Matrix Multiplication Performance on Graphics Hardware

Automatic Tuning Matrix Multiplication Performance on Graphics Hardware Automatic Tuning Matrix Multiplication Performance on Graphics Hardware Changhao Jiang (cjiang@cs.uiuc.edu) Marc Snir (snir@cs.uiuc.edu) University of Illinois Urbana Champaign GPU becomes more powerful

More information

CuSha: Vertex-Centric Graph Processing on GPUs

CuSha: Vertex-Centric Graph Processing on GPUs CuSha: Vertex-Centric Graph Processing on GPUs Farzad Khorasani, Keval Vora, Rajiv Gupta, Laxmi N. Bhuyan HPDC Vancouver, Canada June, Motivation Graph processing Real world graphs are large & sparse Power

More information

Memory management. Knut Omang Ifi/Oracle 10 Oct, 2012

Memory management. Knut Omang Ifi/Oracle 10 Oct, 2012 Memory management Knut Omang Ifi/Oracle 1 Oct, 212 (with slides from V. Goebel, C. Griwodz (Ifi/UiO), P. Halvorsen (Ifi/UiO), K. Li (Princeton), A. Tanenbaum (VU Amsterdam), and M. van Steen (VU Amsterdam))

More information

April Copyright 2013 Cloudera Inc. All rights reserved.

April Copyright 2013 Cloudera Inc. All rights reserved. Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and the Virtual EDW Headline Goes Here Marcel Kornacker marcel@cloudera.com Speaker Name or Subhead Goes Here April 2014 Analytic Workloads on

More information

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013 GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS Kyle Spagnoli Research Engineer @ EM Photonics 3/20/2013 INTRODUCTION» Sparse systems» Iterative solvers» High level benchmarks»

More information

A Simple and Practical Linear-Work Parallel Algorithm for Connectivity

A Simple and Practical Linear-Work Parallel Algorithm for Connectivity A Simple and Practical Linear-Work Parallel Algorithm for Connectivity Julian Shun, Laxman Dhulipala, and Guy Blelloch Presentation based on publication in Symposium on Parallelism in Algorithms and Architectures

More information

Multidimensional Arrays & Graphs. CMSC 420: Lecture 3

Multidimensional Arrays & Graphs. CMSC 420: Lecture 3 Multidimensional Arrays & Graphs CMSC 420: Lecture 3 Mini-Review Abstract Data Types: List Stack Queue Deque Dictionary Set Implementations: Linked Lists Circularly linked lists Doubly linked lists XOR

More information

Auto-Tuning Strategies for Parallelizing Sparse Matrix-Vector (SpMV) Multiplication on Multi- and Many-Core Processors

Auto-Tuning Strategies for Parallelizing Sparse Matrix-Vector (SpMV) Multiplication on Multi- and Many-Core Processors Auto-Tuning Strategies for Parallelizing Sparse Matrix-Vector (SpMV) Multiplication on Multi- and Many-Core Processors Kaixi Hou, Wu-chun Feng {kaixihou, wfeng}@vt.edu Shuai Che Shuai.Che@amd.com Sparse

More information

Outline. Introduction Network Analysis Static Parallel Algorithms Dynamic Parallel Algorithms. GraphCT + STINGER

Outline. Introduction Network Analysis Static Parallel Algorithms Dynamic Parallel Algorithms. GraphCT + STINGER Outline Introduction Network Analysis Static Parallel Algorithms Dynamic Parallel Algorithms Data Structures for Streaming Data Clustering Coefficients Connected Components Anomaly Detection GraphCT +

More information

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016 WHAT S NEW IN CUDA 8 Siddharth Sharma, Oct 2016 WHAT S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve Larger Problems** Critical Path Analysis * HOOMD Blue v1.3.3 Lennard-Jones liquid

More information

Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia

Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia How well do CPU, GPU and Hybrid Graph Processing Frameworks Perform? Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Networked Systems

More information

A Framework for Processing Large Graphs in Shared Memory

A Framework for Processing Large Graphs in Shared Memory A Framework for Processing Large Graphs in Shared Memory Julian Shun UC Berkeley (Miller Institute, AMPLab) Based on joint work with Guy Blelloch and Laxman Dhulipala, Kimon Fountoulakis, Michael Mahoney,

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

GpuWrapper: A Portable API for Heterogeneous Programming at CGG

GpuWrapper: A Portable API for Heterogeneous Programming at CGG GpuWrapper: A Portable API for Heterogeneous Programming at CGG Victor Arslan, Jean-Yves Blanc, Gina Sitaraman, Marc Tchiboukdjian, Guillaume Thomas-Collignon March 2 nd, 2016 GpuWrapper: Objectives &

More information

Graphs. Edges may be directed (from u to v) or undirected. Undirected edge eqvt to pair of directed edges

Graphs. Edges may be directed (from u to v) or undirected. Undirected edge eqvt to pair of directed edges (p 186) Graphs G = (V,E) Graphs set V of vertices, each with a unique name Note: book calls vertices as nodes set E of edges between vertices, each encoded as tuple of 2 vertices as in (u,v) Edges may

More information

Scalable GPU Graph Traversal!

Scalable GPU Graph Traversal! Scalable GPU Graph Traversal Duane Merrill, Michael Garland, and Andrew Grimshaw PPoPP '12 Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming Benwen Zhang

More information

Parallel graph traversal for FPGA

Parallel graph traversal for FPGA LETTER IEICE Electronics Express, Vol.11, No.7, 1 6 Parallel graph traversal for FPGA Shice Ni a), Yong Dou, Dan Zou, Rongchun Li, and Qiang Wang National Laboratory for Parallel and Distributed Processing,

More information

Adaptable benchmarks for register blocked sparse matrix-vector multiplication

Adaptable benchmarks for register blocked sparse matrix-vector multiplication Adaptable benchmarks for register blocked sparse matrix-vector multiplication Berkeley Benchmarking and Optimization group (BeBOP) Hormozd Gahvari and Mark Hoemmen Based on research of: Eun-Jin Im Rich

More information

Parallel SAH k-d Tree Construction. Byn Choi, Rakesh Komuravelli, Victor Lu, Hyojin Sung, Robert L. Bocchino, Sarita V. Adve, John C.

Parallel SAH k-d Tree Construction. Byn Choi, Rakesh Komuravelli, Victor Lu, Hyojin Sung, Robert L. Bocchino, Sarita V. Adve, John C. Parallel SAH k-d Tree Construction Byn Choi, Rakesh Komuravelli, Victor Lu, Hyojin Sung, Robert L. Bocchino, Sarita V. Adve, John C. Hart Motivation Real-time Dynamic Ray Tracing Efficient rendering with

More information

Extreme-scale Graph Analysis on Blue Waters

Extreme-scale Graph Analysis on Blue Waters Extreme-scale Graph Analysis on Blue Waters 2016 Blue Waters Symposium George M. Slota 1,2, Siva Rajamanickam 1, Kamesh Madduri 2, Karen Devine 1 1 Sandia National Laboratories a 2 The Pennsylvania State

More information

CSE 100: GRAPH ALGORITHMS

CSE 100: GRAPH ALGORITHMS CSE 100: GRAPH ALGORITHMS 2 Graphs: Example A directed graph V5 V = { V = E = { E Path: 3 Graphs: Definitions A directed graph V5 V6 A graph G = (V,E) consists of a set of vertices V and a set of edges

More information

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation

More information

Lecture 15: More Iterative Ideas

Lecture 15: More Iterative Ideas Lecture 15: More Iterative Ideas David Bindel 15 Mar 2010 Logistics HW 2 due! Some notes on HW 2. Where we are / where we re going More iterative ideas. Intro to HW 3. More HW 2 notes See solution code!

More information

Automatic Scaling Iterative Computations. Aug. 7 th, 2012

Automatic Scaling Iterative Computations. Aug. 7 th, 2012 Automatic Scaling Iterative Computations Guozhang Wang Cornell University Aug. 7 th, 2012 1 What are Non-Iterative Computations? Non-iterative computation flow Directed Acyclic Examples Batch style analytics

More information

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization Duksu Kim Assistant professor, KORATEHC Education Ph.D. Computer Science, KAIST Parallel Proximity Computation on Heterogeneous Computing Systems for Graphics Applications Professional Experience Senior

More information

CafeGPI. Single-Sided Communication for Scalable Deep Learning

CafeGPI. Single-Sided Communication for Scalable Deep Learning CafeGPI Single-Sided Communication for Scalable Deep Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany Deep Neural Networks

More information

The GraphGrind Framework: Fast Graph Analytics on Large. Shared-Memory Systems

The GraphGrind Framework: Fast Graph Analytics on Large. Shared-Memory Systems Memory-Centric System Level Design of Heterogeneous Embedded DSP Systems The GraphGrind Framework: Scott Johan Fischaber, BSc (Hon) Fast Graph Analytics on Large Shared-Memory Systems Jiawen Sun, PhD candidate

More information

Arabesque. A system for distributed graph mining. Mohammed Zaki, RPI

Arabesque. A system for distributed graph mining. Mohammed Zaki, RPI rabesque system for distributed graph mining Mohammed Zaki, RPI Carlos Teixeira, lexandre Fonseca, Marco Serafini, Georgos Siganos, shraf boulnaga, Qatar Computing Research Institute (QCRI) 1 Big Data

More information

Iterative Sparse Triangular Solves for Preconditioning

Iterative Sparse Triangular Solves for Preconditioning Euro-Par 2015, Vienna Aug 24-28, 2015 Iterative Sparse Triangular Solves for Preconditioning Hartwig Anzt, Edmond Chow and Jack Dongarra Incomplete Factorization Preconditioning Incomplete LU factorizations

More information

Maximizing Face Detection Performance

Maximizing Face Detection Performance Maximizing Face Detection Performance Paulius Micikevicius Developer Technology Engineer, NVIDIA GTC 2015 1 Outline Very brief review of cascaded-classifiers Parallelization choices Reducing the amount

More information

Modern GPUs (Graphics Processing Units)

Modern GPUs (Graphics Processing Units) Modern GPUs (Graphics Processing Units) Powerful data parallel computation platform. High computation density, high memory bandwidth. Relatively low cost. NVIDIA GTX 580 512 cores 1.6 Tera FLOPs 1.5 GB

More information

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany Computing on GPUs Prof. Dr. Uli Göhner DYNAmore GmbH Stuttgart, Germany Summary: The increasing power of GPUs has led to the intent to transfer computing load from CPUs to GPUs. A first example has been

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

Alternative GPU friendly assignment algorithms. Paul Richmond and Peter Heywood Department of Computer Science The University of Sheffield

Alternative GPU friendly assignment algorithms. Paul Richmond and Peter Heywood Department of Computer Science The University of Sheffield Alternative GPU friendly assignment algorithms Paul Richmond and Peter Heywood Department of Computer Science The University of Sheffield Graphics Processing Units (GPUs) Context: GPU Performance Accelerated

More information

Streaming Graph Analytics: Complexity, Scalability, and Architectures

Streaming Graph Analytics: Complexity, Scalability, and Architectures Streaming Graph Analytics: Complexity, Scalability, and Architectures Peter M. Kogge McCourtney Prof. of CSE Univ. of Notre Dame IBM Fellow (retired)please Sir, I want more 1 Outline Motivation and Current

More information