custinger A Sparse Dynamic Graph and Matrix Data Structure Oded Green

Size: px

Start display at page:

Download "custinger A Sparse Dynamic Graph and Matrix Data Structure Oded Green"

Mervin Thompson
5 years ago
Views:

1 custinger A Sparse Dynamic Graph and Matrix Data Structure Oded Green

2 Going to talk about 2 things A scalable and dynamic data structure for graph algorithms and linear algebra based problems A framework for static and dynamic analytics NVIDIA GTC custinger Oded Green,

3 Upfront Summary of Results Can support up to 90 million updates per second Low overhead in comparison with CSR Initializing is also relatively in expensive 20% 200% Equal performance Great performance for static graph algorithms Simple to use NVIDIA GTC custinger Oded Green,

Big Data problems need Graph Analysis Financial networks: Transactions between players.

connectivity High velocity changes Different types of extracted data: Physical

Health Care networks: Various players. Pattern matching and epidemic monitoring.

4 Big Data problems need Graph Analysis Financial networks: Transactions between players. Different transactions types (property graph) Communication networks: World wide connectivity High velocity changes Different types of extracted data: Physical communication network. Person to person communication network. Health Care networks: Various players. Pattern matching and epidemic monitoring. Problem sizes have doubled in last 5 years. Graphs are a unifying motif for data analytics. NVIDIA GTC custinger Oded Green,

5 STINGER STINGER: Spatio Temporal Interaction Networks and Graphs (STING) Extensible Representation Enable algorithm designers to implement dynamic & streaming graph algorithms with ease. Portable semantics for various platforms Linked list of edge blocks not ideal for the GPU Good performance for all types of graph problems and algorithms static and dynamic. Assumes globally addressable memory access NVIDIA GTC custinger Oded Green,

6 STINGER and custinger Properties A Simple programming model Millions of updates per second to graph Updates are not bottlenecks for analytics. Advanced memory manager Transfers data between host and device automatically Reduces initialization time Allows for simple update processes STINGER Papers: [Bader et al.; 2007; Tech Report], [Ediger et al.; HPEC; 2012], [McColl et al.; PPAA; 2014] custinger paper: [Green&Bader; HPEC, 2016]: custinger: Supporting dynamic graph algorithms for GPUs NVIDIA GTC custinger Oded Green,

7 Definitions Dynamic graphs (matrices) Graph can change over time. Changes can be to topology, edges, or vertices. For example new edges between two vertices. Changes to edge or vertex weights Streaming graphs: Graphs changing at high rates. 100s of thousands of updates per second. NVIDIA GTC custinger Oded Green,

8 Dynamic graph example Only a subset of the entire graph Dynamic: At time t: v and w become friends. insert_edge (v, w) At time t: Ƹ u and v no longer friends delete edge u,v Additional operations include vertex insertions & deletions u v w NVIDIA GTC custinger Oded Green,

9 Separation of powers Dynamic graph data structure and dynamic graph algorithms are in two different repositories Easy to integrate with external library Can also be used with matrices First part of today s talk will be on the dynamic data structure NVIDIA GTC custinger Oded Green,

10 Part 1 Data Structure custinger Version 2.0 Improved initialization time 100s of time faster than Version 1.0 New memory manager Reduces fragmentation Enables memory reclamation Offers good memory bounds Scalable data structure Can easily grow 1000X its initial size without needing to be re initialized Faster updates Coming soon (probably late May) NVIDIA GTC custinger Oded Green,

11 Restrictions of existing static graph representations Name Pros Cons Adjacency Matrix Flexible Limited utilization for sparse data Linked lists Flexible Poor locality Allocation time is costly COO (Edge list) unsorted Has some flexibility Updates are simple CSR Uses exact amount of memory Good locality Poor locality Stores both the source and destination Inflexible NVIDIA GTC custinger Oded Green,

Compressed Sparse Row (CSR) Pros: Uses precise storage requirements Great locality Good for GPUs Handful of arrays Cons: Simple to use and manage Inflexible.

12 Compressed Sparse Row (CSR) Pros: Uses precise storage requirements Great locality Good for GPUs Handful of arrays Cons: Simple to use and manage Inflexible. Network growth unsupported Topology changes unsupported Property graphs not supported Src/Row Offset Dest./Col. Value NVIDIA GTC custinger Oded Green,

13 Part 1: custinger A High Level View USER INTERFACE Vertex Id Used BSize Over allocated space Pointer Dest./Col Value Supports updates Supports edge insertion\deletion and deletion. Supports vertex insertion\deletion. NVIDIA GTC custinger Oded Green,

14 custinger Property Graph Support USER INTERFACE Vertex Id Used BSize Pointer Dest./Col Weigth Type Time 1 User 1 User 2. These are optional fields NVIDIA GTC custinger Oded Green,

15 Challenges Memory allocations are costly. Seems like we are suggesting that we need O(V) allocations Absolutely not. Our first implementation did this ouch NVIDIA GTC custinger Oded Green,

16 Memory Manager Made up three parts: 1. Vectorized Bit Trees 2. BlockArrays 3. B + Trees of BlockArrays THIS IS AN INTERNAL REPRESENTATION (HIDDEN FROM USERS) NVIDIA GTC custinger Oded Green,

17 BlockArrays Definition an array made up of equal size blocks. Each block can contain an equal number of edges BlockArray (with 4 blocks) Block (with 2 edges) NVIDIA GTC custinger Oded Green,

18 custinger BlockArray allocations USER INTERFACE Vertex Id Used BSize Over allocated space Pointer Dest./Col Value BA 0,1 BA 1,1 BA 1,2 BA 2,1 USER INTERFACE INTERNAL REPRESENTATION. HIDDEN FROM USERS available space Relatively small number of BlockArrays are needed Exact number not known at compile time (or even at runtime given updates) NVIDIA GTC custinger Oded Green,

19 Memory Manager Made up three parts: 1. Vectorized Bit Trees Helps determine which blocks are empty Key components for efficient memory reclamation 2. BlockArrays 3. B + Trees of BlockArrays NVIDIA GTC custinger Oded Green,

20 Vec Trees Each block is either full (0) or empty (1). Vect Tree representation 0 Next available position block Vect Tree implementation BA 1,1 Vect Tree implementation BA 1, Machine word (a) Full BlockArray Machine word (b) Partially Empty BlockArray NVIDIA GTC custinger Oded Green,

21 Vec Trees Complexity Given a BlockArray with BA Storage complexity O BA bits. In practice this is close to 2 BA bits Relatively small overhead. Vec Tree Updates require O log BA operations NVIDIA GTC custinger Oded Green,

22 Memory Manager Made up three parts: 1. Vectorized Bit Trees 2. BlockArrays 3. B + Trees of BlockArrays Responsible for deciding when more memory needs to be allocated NVIDIA GTC custinger Oded Green,

23 Log. of block size B + Trees of BlockArrays Each block sizes has a different tree. The KEY of the B + Trees is the number of empty blocks B + Tree Array B + Tree for BlockArray with 1 edge in a block B + Tree for BlockArray with 2 edges in a block B + Tree for BlockArray with 4 edges in a block B+ Node BA 2,2 1 available block 31 NVIDIA GTC custinger Oded Green,

24 Log. of block size B + Trees of BlockArrays Currently supports adjacency lists with up to 2 31 edges Can easily support up 2 63 edge blocks!!! B + Tree Array B + Tree for BlockArray with 1 edge in a block B + Tree for BlockArray with 2 edges in a block B + Tree for BlockArray with 4 edges in a block B+ Node B+ Node BA 2,2 1 available block BA 2,1 0 available blocks NVIDIA GTC custinger Oded Green,

25 B + Trees Properties A new BlockArray is allocated when all existing BlockArrays are full. Great for memory utilization. NVIDIA GTC custinger Oded Green,

26 custinger Update Process for Edge Insertions USER INTERFACE Vertex Id Used BSize Pointer Update Batch Source Destination Value Count number of insertions for each Source 2. Check edge availability for each Source. If not enough edges: 1. memory manger get larger block that can store all edges 2. Copy existing edges from old block to new block 3. memory manager old block is reclaimed 3. Insert new edges (while avoiding duplicates) NVIDIA GTC custinger Oded Green,

27 custinger Full View (after update) USER INTERFACE Vertex Id Used BSize Pointer Update Batch Source Destination Value Dest./Col. Value USER INTERFACE BA 0,1 BA 1,1 BA 1,2 INTERNAL REPRESENTATION. HIDDEN FROM USERS Vec Tree bit status BA 2, BA 2,2 NVIDIA GTC custinger Oded Green,

28 custinger Go Home with this View USER INTERFACE Vertex Id Update Batch Used Source BSize Destination Pointer Value Dest./Col. Value USER INTERFACE NVIDIA GTC custinger Oded Green,

29 custinger Data Structure Build instructions git clone recursive mkdir build && cd build cmake.. make j8 NVIDIA GTC custinger Oded Green,

30 Performance Analysis Initialization Overhead Memory Utilization Update rate Number of sustainable updates per second CSR Vs. custinger SpMV NVIDIA GTC custinger Oded Green,

31 Experimental Setup GPU μarch SMs SPs Memory (GB) For the K80 GPU, we use only one GPU Memory Type K40 Kepler GDDR5 K80 Kepler 2x13 2x2496 2x12 GDDR5 P100 Pascal HBM2 We report for the K80, unless noted NVIDIA GTC custinger Oded Green,

32 Inputs Graphs DIMACS 10 Graph Implementation Challenge SNAP Stanford Network Analysis Project Florida Matrix Collection The following is only a subset of these graphs: Name Type V E * Source coauthorsdblp Collaboration 299k 1.95M DIMACS as skitter Trace route 1.69M 11.1M SNAP kron_21 Random 2M 201M DIMACS cit patents Citation 3.77M 16.5M SNAP cage15 Matrix 5.15M 94M DIMACS uk 2002 Webcrawl 18.52M 523M DIMACS NVIDIA GTC custinger Oded Green,

33 Space Efficiency Memory Utilization Edges 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Used Edge Utilization Utilization = Allocated 70% average utilization NVIDIA GTC custinger Oded Green,

34 Space Efficiency Memory Utilization Blocks 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Used Block Utilization Utilization = Allocated 90% average utilization NVIDIA GTC custinger Oded Green,

35 Space Efficiency Memory Utilization Overall 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 70% average utilization Overall Utilization 30% overhead in comparison to CSR NVIDIA GTC custinger Oded Green,

36 Overhead vs. CSR Memcpy Initialization Time Copying from the CPU to GPU is costly Initializing custinger does not add much overhead We still need to optimize this process for P CSR Copy Host to Device Initilization on Device Over 100x than custinger Version 1.0 NVIDIA GTC custinger Oded Green,

37 Insertion Rates Supports up to 90M updates per second Version 2.0 is 4X 10X faster than Version 1.0 Does not have performance dip like Version 1.0 Scalable growth in update rate Version 1.0 Version 2.0 NVIDIA GTC custinger Oded Green,

38 MFLOPS SpMV: CSR Vs. custinger CSR custinger Simply replace CSR accesses with custinger A real apples to apples comparison NVIDIA GTC custinger Oded Green,

39 Part 2: Algorithms for custinger Goal support Static and Dynamic graph algorithms Already showed that the graph update process is efficient All algorithms are implemented using the same set of operations We show that these operators are efficient for static graph algorithms NVIDIA GTC custinger Oded Green,

40 custinger Programming Model Keep things simple Limit the amount of GPU programming users need to do. Uses vertex and edge frontiers Similar to Gunrock & LIGRA Necessary for good utilization Still requires good load balancing Edge frontiers are created implicitly from vertex frontiers. NVIDIA GTC custinger Oded Green,

41 Vertex and Edge Frontiers Ligra CPU HPC graph framework by Julian Shun Two backends: CILK and OpenMP Edge frontiers created implicitly from vertex frontiers Gunrock Typically one phase Highly tuned GPU graph library from Prof. John Owens Supports multi GPU analytics (same shared node) Each operation consists of two phases Edge frontiers created explicitly by programmer. NVIDIA GTC custinger Oded Green,

42 Case Study 1: Label Propagation Connected Components Connected component algorithms such as Shiloach Vishkin Initially every vertex is in its components Vertices move to with smallest ID Vertices can swap components multiple times NVIDIA GTC - custinger - Oded Green,

43 One Iteration of Connected Components Algorithm // Label propagation 1) For all v V 2) For all u adj v 3) if CC u < CC v 4) CC u CC v // Shortcutting 5) For all v V Traverse all edges O E Traverse all vertices O V 6) CC v CC CC v NVIDIA GTC - custinger - Oded Green,

44 Revisiting the algorithm in parallel* Notice, no triple brackets <<<>>> * We have more optimized versions in the library (require about 4 more lines of code.) NVIDIA GTC - custinger - Oded Green,

45 Case Study 2: Katz Centrality Given pseudo code for a variant of Katz Centrality It took 5 6 hours to port from the CPU to the GPU. NVIDIA P100 GPU, initial speedup: 70X CPU version had some addition optimizations. Within additional two hours speedup was: 100X This was feasible because of the pre existing load balanced primitives in the library NVIDIA GTC - custinger - Oded Green,

46 custinger Algorithms Build instructions git clone recursive mkdir build && cd build cmake.. make j8 By default, custinger will also be cloned Though you will need to build both repos NVIDIA GTC custinger Oded Green,

47 custingerv2 Algorithms Build instructions git clone recursive cd custingeralg/build cmake.. make j NVIDIA GTC custinger Oded Green,

48 Performance Analysis Connected Components Breadth First Search Triangle Counting Static Dynamic NVIDIA GTC custinger Oded Green,

49 Connected Components NVIDIA P100 Name Gunrock (msec.) custinger (msec.) Speedup coauthorsdblp X as Skitter X kron_ X cit patents X Cage X uk X Using label propagation Within 25% for most cases. NVIDIA GTC custinger Oded Green,

50 BFS Classic Top Down NVIDIA P100 Name Gunrock (msec.) custinger (msec.) Speedup coauthorsdblp X as Skitter X kron_ X cit patents X cage X uk X Using a similar algorithm in Gunrock Gunrock has additional optimizations that can make it faster than custinger NVIDIA GTC custinger Oded Green,

51 Triangle Counting: CSR Vs. custinger Name V E Time CSR (sec.) Time custinger (sec.) Execution Difference coauthorsdblp 299k 1.95M % as skitter 1.69M 11.1M % kron_21 2M 201M % cit patents 3.77M 16.5M % cage M 94M % uk M 523M % Triangle counting algorithm taken from [Green et al; IA 3 ;2014] Simply replace CSR accesses with custinger Executed on a K40 NVIDIA GTC custinger Oded Green,

52 Summary NVIDIA GTC custinger Oded Green,

53 Library Overview Completed algorithms and on going Algorithm Static Dynamic Reference Breadth first search Triangle Counting Static [Green et al; IA32014] Dynamic new algorithm [Makkar; 2017 submitted] Connect components on going [McColl; HiPC 2013] Betweenness Centrality on going [Green; SocialCom 2012] Page Rank on going New algorithm (non linear algebra based) Katz Centrality New algorithm (non linear algebra based) Of course many more algorithms to come NVIDIA GTC custinger Oded Green,

54 Upcoming projects using custinger Extend dynamic triangle counting to Jaccard Indices Scalable pattern and motif detection on the GPU NVIDIA GTC custinger Oded Green,

55 Take away Dynamic data structure for sparse data sets Supports high update rates Scalable in both data size and in performance NVIDIA GTC custinger Oded Green,

56 Collaborators Prof. David Bader (Georgia Tech) Prof. Jimeng Sun (Georgia Tech) Dr. Jason Riedy (Georgia Tech) Federico Busato, Visiting PhD student (Universita di Verona) James Fox, PhD student (Georgia Tech) Euna Kim, PhD student (Georgia Tech) Muhammad Osama Sakhi, BSc student (Georgia Tech) Alok Tripathy, BSc student (Georgia Tech) Manas George, BSc student (Georgia Tech) Graduates (GT) Devavret Makkar, MSc. (Tower Research) NVIDIA GTC custinger Oded Green,

57 Thank you Data structure: Algorithms: Versions 2.0, coming soon to a GPU near you NVIDIA GTC custinger Oded Green,

custinger - Supporting Dynamic Graph Algorithms for GPUs Oded Green & David Bader

custinger - Supporting Dynamic Graph Algorithms for GPUs Oded Green & David Bader What we will see today The first dynamic graph data structure for the GPU. Scalable in size Supports the same functionality