Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU

Size: px

Start display at page:

Download "Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU"

Jordan Arnold
5 years ago
Views:

1 Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU Lifan Xu Wei Wang Marco A. Alvarez John Cavazos Dongping Zhang Department of Computer and Information Science University of Delaware

2 Outline Introduction Graph Graph kernel Shortest Path Graph Kernel (SPGK) Fast Computation of Shortest Path graph kernel (FCSP) Parallelization of FCSP on CPU and GPU Two OpenMP implementations Four GPU implementations Hybrid method Experiments results Synthetic datasets Scientific datasets Conclusion and Future Work 1

3 Graph A graph G is a set of vertices V and edges E, where E V 2 A graph G may have labels on vertices and/or edges The adjacency matrix A of G is defined as A ij = 1 if v i, v j E 0 otherwise 2

4 Graph Kernel Kernels (from machine learning) between pairs of graphs (roughly speaking -> graph similarities) Examples: Random Walk Kernel Comparing walks Shortest Path Kernel Comparing shortest paths Subtree Kernel Comparing subtree-like patterns Cyclic Pattern Kernel Comparing simple cycles Graphlet Kernel Counting subgraphs of limited size 3

5 Shortest-Path Graph Kernel (SPGK) Convert graph to all pair shortest path graph Can use Floyd-Warshall Algorithm 4

6 Floyd-Warshall Original Graph Shortest Path Graph 5

7 Floyd-Warshall Original Graph Shortest Path Graph 6

8 Shortest Path Graph Kernel Apply shortest path kernel K sp G, G = e E e E K walk (e, e ) 7

9 Shortest Path Graph Kernel Apply shortest path kernel K sp G, G = e E e E K walk (e, e ) K walk (e, e ) = K node (u, u ) K edge (e, e ) K node (v, v ) 8

10 Shortest Path Graph Kernel Apply shortest path kernel K sp G, G = e E e E K walk (e, e ) K walk (e, e ) = K node (u, u ) K edge (e, e ) K node (v, v ) K node is a valid kernel function for comparing two vertices K edge is a valid kernel function for comparing two edges 9

11 Shortest Path Graph Kernel Lines 2-4 loop through all paths in G1 10

12 Shortest Path Graph Kernel Lines 2-4 loop through all paths in G1 Lines 5-7 loop through all paths in G2 11

13 Shortest Path Graph Kernel Lines 2-4 loop through all paths in G1 Lines 5-7 loop through all paths in G2 Line 8 calculates K edge (e, e ) 12

14 Shortest Path Graph Kernel Lines 2-4 loop through all paths in G1 Lines 5-7 loop through all paths in G2 Line 8 calculates K edge (e, e ) Lines calculate K node (v, v ) 13

15 Shortest Path Graph Kernel Lines 2-4 loop through all paths in G1 Lines 5-7 loop through all paths in G2 Line 8 calculates K edge (e, e ) Lines calculate K node (v, v ) Line 12 calculates K walk (e, e ) 14

16 Drawbacks of SPGK Four for loops and two if statements Redundant computation of K node (v, v ) Random memory access 15

17 Fast Computation of Shortest Path Graph Kernel (FCSP) Compute all K node (v, v ) before K walk (e, e ) Convert shortest path adjacency matrix to coordinate lists (sparse matrix representation) One array for value One array for row One array for column 16

18 Lines 1-7 compute all K node (v, v ) 17

19 Lines 1-7 compute all K node (v, v ) Lines loop all paths in G1 18

20 Lines 1-7 compute all K node (v, v ) Lines loop all paths in G1 Lines loop all paths in G2 19

21 Lines 1-7 compute all K node (v, v ) Lines loop all paths in G1 Lines loop all paths in G2 Line 19 computes K edge (e, e ) 20

22 Lines 1-7 compute all K node (v, v ) Lines loop all paths in G1 Lines loop all paths in G2 Line 19 computes K edge (e, e ) Lines fetch K node (v, v ) 21

23 Lines 1-7 compute all K node (v, v ) Lines loop all paths in G1 Lines loop all paths in G2 Line 19 computes K edge (e, e ) Lines fetch K node (v, v ) Line 23 computes K walk (e, e ) 22

24 Calculating a Kernel Matrix Given a set of graphs {g 1, g 2,,g n } Calculate the kernel matrix K nxn K (i,j) is the similarity between g i and g j 23

25 FCSP on Multi-Core CPU using OpenMP OpenMP_In Parallelize computation of a pair of graphs Dynamic parallel for pragma Vertex Kernel Walk kernel OpenMP_Out Parallelize computation of the whole kernel matrix Each OpenMP thread fetches a pair of graphs until all computation are finished 24

26 FCSP on GPU using OpenCL Three OpenCL kernels Vertex kernel Walk kernel Reduction kernel Four implementations GPU_1D GPU_2D GPU_1D_overlap GPU_2D_overlap 25

27 GPU_1D 2D domain decomposition for Vertex Kernel 1D domain decomposition for Walk Kernel 26

28 GPU_1D Input graphs 27

29 GPU_1D Input graphs Adjacency matrix 28

30 GPU_1D Input graphs Adjacency matrix Shortest Path Adjacency matrix 29

31 GPU_1D Input graphs Vertex Kernel Adjacency matrix Shortest Path Adjacency matrix 30

32 GPU_1D Input graphs Vertex Kernel Adjacency matrix Shortest Path Adjacency matrix Walk Kernel 31

33 GPU_2D 2D domain decomposition for Vertex Kernel 2D domain decomposition for Walk Kernel 32

34 GPU_2D Input graphs Vertex Kernel Adjacency matrix Shortest Path Adjacency matrix Edge Kernel 33

35 GPU_1D_overlap 2D domain decomposition for Vertex Kernel 1D domain decomposition for Walk Kernel Computation and Communication overlap Issue non-blocking memory transfer after Reduction Kernel Assign next pair of graphs to non-blocking Vertex Kernel and Walk Kernel CPU accumulates results from Reduction kernel meanwhile 34

36 GPU_2D_overlap 2D domain decomposition for Vertex Kernel 2D domain decomposition for Walk Kernel Computation and Communication overlap Issue non-blocking memory transfer after Reduction Kernel Assign next pair of graphs to non-blocking Vertex Kernel and Walk Kernel CPU accumulates results from Reduction kernel meanwhile 35

37 CPU and GPU Hybrid Implementation Combine OpenMP_In and GPU_1D_overlap Set a threshold T for number of shortest paths Both input graphs smaller than T OpenMP_In Otherwise GPU_1D_overlap 36

38 Execution Environment CPU Two Intel 5530 Quad 2.4 GHz with 8MB cache (16 OpenMP threads) GPU - One NVIDIA C2050 ( GHz) with 3GB GDDR5 1.5 GHZ ECC RAM 37

39 Homogeneous Synthetic Datasets Nine homogeneous datasets 256 graphs per dataset Each dataset contains graphs of same sizes 38

40 Homogeneous Datasets Statistics 39

41 Speedup of sequential FCSP over sequential SPGK on CPU 40

42 Speedup of Parallel FCSP over Sequential FCSP 41

43 Mixed Synthetic Dataset nodes graph nodes graphs Different Implementation Running Time(seconds) on the Mixed Dataset 42

44 Scientific Datasets 43

45 Speedup over OpenMP_In on Scientic Datasets 44

46 Conclusion and Future Work We introduce Fast Computation of shortest Path graph kernel Sequential FCSP is able to achieve 76x speedup over sequential SPGK Two CPU parallelizations Four GPU implementations One Hybrid method We are going to accelerate other graph kernels in the future 45

47 Thanks! Questions? 46

Accelerating Financial Applications on the GPU

Accelerating Financial Applications on the GPU Scott Grauer-Gray Robert Searles William Killian John Cavazos Department of Computer and Information Science University of Delaware Sixth Workshop on General