Graph Coarsening and Clustering on the GPU

Size: px

Start display at page:

Download "Graph Coarsening and Clustering on the GPU"

Suzanna Elliott
5 years ago
Views:

1 Graph Coarsening and Clustering on the GPU B. O. Fagginger Auer R. H. Bisseling Utrecht University February 14, 2012

2 Introduction We will discuss coarsening and greedy clustering of graphs.

3 Introduction We will discuss coarsening and greedy clustering of graphs. Clustering isolating related groups of vertices in a graph.

4 Introduction We will discuss coarsening and greedy clustering of graphs. Clustering isolating related groups of vertices in a graph. Relevant in: social networks, epidemiology, papers, metabolism, and ecosystems (Newman & Girvan, 2004).

5 Introduction We will discuss coarsening and greedy clustering of graphs. Clustering isolating related groups of vertices in a graph. Relevant in: social networks, epidemiology, papers, metabolism, and ecosystems (Newman & Girvan, 2004). Our primary interests are speed and parallelisation.

6 Modularity clustering Let G = (V, E) be a graph with edge weights ω : E R >0.

7 Modularity clustering Let G = (V, E) be a graph with edge weights ω : E R >0. A clustering of G is a partitioning C of V : V = C C C as a disjoint union.

8 Modularity clustering Let G = (V, E) be a graph with edge weights ω : E R >0. A clustering of G is a partitioning C of V : V = C C C as a disjoint union. Quality of a clustering is measured by its modularity mod(c), introduced in 2004 by Newman and Girvan.

9 Modularity clustering Clustering modularity is defined as ω({u, v}) mod(c) := C C {u,v} E u,v C ω(e) e E C C ( v C ) 2 ζ(v) ( ) 2. 4 ω(e) e E

10 Modularity clustering Clustering modularity is defined as ω({u, v}) mod(c) := C C {u,v} E u,v C ω(e) e E C C ( v C ) 2 ζ(v) ( ) 2. 4 ω(e) e E Here, vertex weights ζ : V R 0 are defined as ζ(v) := ω({u, v}). {u,v} E

11 Modularity clustering mod(c) can be rewritten to 1 4 Ω 2 ζ(c) (2 Ω ζ(c)) 2 Ω ω(cut(c, C )). C C C C C C

12 Modularity clustering mod(c) can be rewritten to 1 4 Ω 2 ζ(c) (2 Ω ζ(c)) 2 Ω ω(cut(c, C )). C C C C C C Here, Ω := e E ω(e) and cut(c, C ) := {{u, v} E u C and v C }.

13 Modularity clustering mod(c) can be rewritten to 1 4 Ω 2 ζ(c) (2 Ω ζ(c)) 2 Ω ω(cut(c, C )). C C C C C C Here, Ω := e E ω(e) and cut(c, C ) := {{u, v} E u C and v C }. To calculate modularity, we only need to keep track of summed vertex weights of clusters and summed edge weights between clusters.

14 Merging clusters Merge two clusters C, C C to a single larger cluster C C.

15 Merging clusters Merge two clusters C, C C to a single larger cluster C C. Then, ζ(c C ) = ζ(c) + ζ(c )

16 Merging clusters Merge two clusters C, C C to a single larger cluster C C. Then, ζ(c C ) = ζ(c) + ζ(c ) and the modularity is increased by (eq. (4) of Ovelgönne et al., 2010) 1 ( ) 2 Ω 2 2 Ω ω(cut(c, C )) ζ(c) ζ(c ).

17 Merging clusters Merge two clusters C, C C to a single larger cluster C C. Then, ζ(c C ) = ζ(c) + ζ(c ) and the modularity is increased by (eq. (4) of Ovelgönne et al., 2010) 1 ( ) 2 Ω 2 2 Ω ω(cut(c, C )) ζ(c) ζ(c ). This suggests a greedy agglomerative strategy (e.g. Zhu et al., 2008).

18 Agglomerative clustering 1 Start with all vertices being a separate cluster: C = {{v} v V }.

19 Agglomerative clustering 1 Start with all vertices being a separate cluster: C = {{v} v V }. 2 Find a heavy matching of clusters with edge weights 1 ( ) 2 Ω 2 2 Ω ω(cut(c, C )) ζ(c) ζ(c ).

20 Agglomerative clustering 1 Start with all vertices being a separate cluster: C = {{v} v V }. 2 Find a heavy matching of clusters with edge weights 1 ( ) 2 Ω 2 2 Ω ω(cut(c, C )) ζ(c) ζ(c ). 3 Merge all matched clusters, summing ζ and ω (graph coarsening).

21 Agglomerative clustering 1 Start with all vertices being a separate cluster: C = {{v} v V }. 2 Find a heavy matching of clusters with edge weights 1 ( ) 2 Ω 2 2 Ω ω(cut(c, C )) ζ(c) ζ(c ). 3 Merge all matched clusters, summing ζ and ω (graph coarsening). 4 Go to step 2 until only a single cluster remains.

22 Agglomerative clustering 1 Start with all vertices being a separate cluster: C = {{v} v V }. 2 Find a heavy matching of clusters with edge weights 1 ( ) 2 Ω 2 2 Ω ω(cut(c, C )) ζ(c) ζ(c ). 3 Merge all matched clusters, summing ζ and ω (graph coarsening). 4 Go to step 2 until only a single cluster remains. 5 Return encountered clustering with highest modularity.

23 Agglomerative clustering 1 Start with all vertices being a separate cluster: C = {{v} v V }. 2 Find a heavy matching of clusters with edge weights 1 ( ) 2 Ω 2 2 Ω ω(cut(c, C )) ζ(c) ζ(c ). 3 Merge all matched clusters, summing ζ and ω (graph coarsening). 4 Go to step 2 until only a single cluster remains. 5 Return encountered clustering with highest modularity. We make use of parallelism in steps 2 and 3.

24 Agglomerative clustering (netherlands) 0 iterations.

25 Agglomerative clustering (netherlands) 11 iterations.

26 Agglomerative clustering (netherlands) 21 iterations.

27 Agglomerative clustering (netherlands) 26 iterations.

28 Agglomerative clustering (netherlands) 33 iterations.

29 Agglomerative clustering (netherlands) Final clustering.

30 Star graphs Agglomerative clustering slows down on star graphs.

31 Star graphs Merging vertices with the same neighbours is bad for clustering.

32 Star graphs So we merge multiple satellites to the same centre.

33 Centre potential To identify star centres and satellites, we propose a centre potential.

34 Centre potential To identify star centres and satellites, we propose a centre potential. This potential is defined for vertices v as cp(v) := deg(v) 2 deg(u). {u,v} E

35 Centre potential To identify star centres and satellites, we propose a centre potential. This potential is defined for vertices v as cp(v) := deg(v) 2 deg(u). {u,v} E We use cp( ) to identify satellites and match these to centres.

36 Centre potential

37 Centre potential For a star graph where k satellites are connected to a clique of l vertices with 0 < l < k, we have that

38 Centre potential For a star graph where k satellites are connected to a clique of l vertices with 0 < l < k, we have that cp(satellite) 1 2 and cp(satellite) 0 as k,

39 Centre potential For a star graph where k satellites are connected to a clique of l vertices with 0 < l < k, we have that cp(satellite) 1 2 cp(centre) 4 3 and cp(satellite) 0 as k, and cp(centre) as k.

40 GPU coarsening Input: a graph G = (V, E, ω, ζ) and a map µ : V N.

41 GPU coarsening Input: a graph G = (V, E, ω, ζ) and a map µ : V N. Output: a graph G = (V, E, ω, ζ ) and surjective map π : V V such that:

42 GPU coarsening Input: a graph G = (V, E, ω, ζ) and a map µ : V N. Output: a graph G = (V, E, ω, ζ ) and surjective map π : V V such that: E = {{π(u), π(v)} {u, v} E} (collapse edges),

43 GPU coarsening Input: a graph G = (V, E, ω, ζ) and a map µ : V N. Output: a graph G = (V, E, ω, ζ ) and surjective map π : V V such that: E = {{π(u), π(v)} {u, v} E} (collapse edges), ω (e ) = π(e)=e ω(e) (sum edge weights),

44 GPU coarsening Input: a graph G = (V, E, ω, ζ) and a map µ : V N. Output: a graph G = (V, E, ω, ζ ) and surjective map π : V V such that: E = {{π(u), π(v)} {u, v} E} (collapse edges), ω (e ) = ζ (v ) = π(e)=e ω(e) π(v)=v ζ(v) (sum edge weights), (sum vertex weights),

45 GPU coarsening Input: a graph G = (V, E, ω, ζ) and a map µ : V N. Output: a graph G = (V, E, ω, ζ ) and surjective map π : V V such that: E = {{π(u), π(v)} {u, v} E} (collapse edges), ω (e ) = ζ (v ) = π(e)=e ω(e) π(v)=v ζ(v) (sum edge weights), (sum vertex weights), π(u) = π(v) if and only if µ(u) = µ(v) (compress µ to π).

46 GPU coarsening (implementation) Implemented using the CUDA Thrust library.

47 GPU coarsening (implementation) Implemented using the CUDA Thrust library. View G as a collection of adjacency lists for each vertex.

48 GPU coarsening (implementation) Implemented using the CUDA Thrust library. View G as a collection of adjacency lists for each vertex. First, we create π and π 1 from µ.

49 GPU coarsening (implementation) Implemented using the CUDA Thrust library. View G as a collection of adjacency lists for each vertex. First, we create π and π 1 from µ. Then, we create the new adjacency lists and weights for G.

50 GPU coarsening (implementation) Implemented using the CUDA Thrust library. View G as a collection of adjacency lists for each vertex. First, we create π and π 1 from µ. Then, we create the new adjacency lists and weights for G. Use µ, π, π 1, and a bookkeeping array ρ in global GPU memory.

51 GPU coarsening (algorithm) ρ µ π 1 π Initialise ρ sequentially and store µ.

52 GPU coarsening (algorithm) ρ µ π 1 π Sort by increasing µ-value (sort by key).

53 GPU coarsening (algorithm) ρ µ π 1 π Sort by increasing µ-value (sort by key).

54 GPU coarsening (algorithm) ρ µ π 1 π Determine different matched groups (adjacent not equal).

55 GPU coarsening (algorithm) ρ µ π 1 π Determine different matched groups (adjacent not equal).

56 GPU coarsening (algorithm) ρ µ π 1 π Extract boundaries for π 1 (copy index if nonzero).

57 GPU coarsening (algorithm) ρ µ π π Extract boundaries for π 1 (copy index if nonzero).

58 GPU coarsening (algorithm) ρ µ π π Perform scan to find π indices (inclusive scan).

59 GPU coarsening (algorithm) ρ µ π π Perform scan to find π indices (inclusive scan).

60 GPU coarsening (algorithm) ρ µ π π Extract π as π(ρ(i)) = µ(i) (scatter).

61 GPU coarsening (algorithm) ρ µ π π Extract π as π(ρ(i)) = µ(i) (scatter).

62 GPU coarsening (algorithm) Construct the adjacency lists of G for i V in parallel.

63 GPU coarsening (algorithm) Construct the adjacency lists of G for i V in parallel. Sum vertex weights: ζ (i ) = ζ(ρ(π 1 (i ))) ζ(ρ(π 1 (i + 1) 1)).

64 GPU coarsening (algorithm) Construct the adjacency lists of G for i V in parallel. Sum vertex weights: ζ (i ) = ζ(ρ(π 1 (i ))) ζ(ρ(π 1 (i + 1) 1)). Gather weighted neighbours of ρ(π 1 (i )) to ρ(π 1 (i + 1) 1): (i 1, ω 1, i 2, ω 2,..., i k, ω k ).

65 GPU coarsening (algorithm) Construct the adjacency lists of G for i V in parallel. Sum vertex weights: ζ (i ) = ζ(ρ(π 1 (i ))) ζ(ρ(π 1 (i + 1) 1)). Gather weighted neighbours of ρ(π 1 (i )) to ρ(π 1 (i + 1) 1): (π(i 1 ), ω 1, π(i 2 ), ω 2,..., π(i k ), ω k ).

66 GPU coarsening (algorithm) Construct the adjacency lists of G for i V in parallel. Sum vertex weights: ζ (i ) = ζ(ρ(π 1 (i ))) ζ(ρ(π 1 (i + 1) 1)). Gather weighted neighbours of ρ(π 1 (i )) to ρ(π 1 (i + 1) 1): (π(i 1 ), ω 1, π(i 2 ), ω 2,..., π(i k ), ω k ). Sort the neighbour list by index.

67 GPU coarsening (algorithm) Construct the adjacency lists of G for i V in parallel. Sum vertex weights: ζ (i ) = ζ(ρ(π 1 (i ))) ζ(ρ(π 1 (i + 1) 1)). Gather weighted neighbours of ρ(π 1 (i )) to ρ(π 1 (i + 1) 1): Sort the neighbour list by index. (π(i 1 ), ω 1, π(i 2 ), ω 2,..., π(i k ), ω k ). Compress the neighbour list by replacing subsequences (j, ω 1, j, ω 2,..., j, ω l ) with (j, ω 1 + ω ω l ).

68 GPU clustering (parallelism) We perform matching in parallel on the GPU to obtain µ.

69 GPU clustering (parallelism) We perform matching in parallel on the GPU to obtain µ. Calculate centre potentials and match satellites in parallel.

70 GPU clustering (parallelism) We perform matching in parallel on the GPU to obtain µ. Calculate centre potentials and match satellites in parallel. Coarsen in parallel as described previously.

71 GPU clustering (parallelism) We perform matching in parallel on the GPU to obtain µ. Calculate centre potentials and match satellites in parallel. Coarsen in parallel as described previously. This gives us a fine-grained shared-memory parallel clustering algorithm.

72 GPU clustering (parallelism) We perform matching in parallel on the GPU to obtain µ. Calculate centre potentials and match satellites in parallel. Coarsen in parallel as described previously. This gives us a fine-grained shared-memory parallel clustering algorithm. We do not perform local improvement (Kernighan Lin): changing cluster weights makes parallelising this very hard.

73 Results Created an implementation on the GPU using CUDA and on the CPU using TBB.

74 Results Created an implementation on the GPU using CUDA and on the CPU using TBB. Results are averaged over 16 runs.

75 Results Created an implementation on the GPU using CUDA and on the CPU using TBB. Results are averaged over 16 runs. Time: matching and CPU GPU data transfer, not disk I/O.

76 Results Created an implementation on the GPU using CUDA and on the CPU using TBB. Results are averaged over 16 runs. Time: matching and CPU GPU data transfer, not disk I/O. Test set: 10th DIMACS challenge.

77 Results Created an implementation on the GPU using CUDA and on the CPU using TBB. Results are averaged over 16 runs. Time: matching and CPU GPU data transfer, not disk I/O. Test set: 10th DIMACS challenge. Test hardware: dual quad-core Xeon E5620 and an NVIDIA Tesla C2050 (thanks: the Little Green Machine project).

78 Results (scaling) Relative clustering time (%) Clustering time scaling 40 linear Number of CPU threads

79 Results (time) Clustering time (s) *10-7 E CUDA TBB Clustering time Number of graph edges E

80 Results (quality) V E CUDA TBB Ovelgönne et al. (2010) karate jazz 198 2, ,133 5, PGP 10,680 24,

81 Results (quality) V E CUDA TBB Ovelgönne et al. (2010) karate jazz 198 2, ,133 5, PGP 10,680 24, Lower quality, because we do not use local refinement.

82 Results (time) Our algorithm is very fast for large graphs.

83 Results (time) Our algorithm is very fast for large graphs. CUDA: road central, V = 14,081,816, E = 16,933,413, modularity clustering in 4.6 seconds.

84 Results (time) Our algorithm is very fast for large graphs. CUDA: road central, V = 14,081,816, E = 16,933,413, modularity clustering in 4.6 seconds. TBB: uk-2002, V = 18,520,486, E = 261,787,258, modularity clustering in 31 seconds.

85 Results (time) Our algorithm is very fast for large graphs. CUDA: road central, V = 14,081,816, E = 16,933,413, modularity clustering in 4.6 seconds. TBB: uk-2002, V = 18,520,486, E = 261,787,258, modularity clustering in 31 seconds. Comparison to state of the art: DIMACS challenge.

86 Conclusion We presented a fine-grained shared-memory parallel clustering algorithm.

87 Conclusion We presented a fine-grained shared-memory parallel clustering algorithm. This algorithm is suitable for both multi-core CPUs and GPUs.

88 Conclusion We presented a fine-grained shared-memory parallel clustering algorithm. This algorithm is suitable for both multi-core CPUs and GPUs. We propose the centre potential to deal with star-like graphs.

89 Conclusion We presented a fine-grained shared-memory parallel clustering algorithm. This algorithm is suitable for both multi-core CPUs and GPUs. We propose the centre potential to deal with star-like graphs. The algorithm is very fast, but quality could be improved by parallel local refinement.

90 Questions any questions?

91 GPU matching problems Performing matching in parallel is problematic.

92 GPU matching problems Performing matching in parallel is problematic. Disjoint edges requirement leads to serialisation.

93 GPU matching problems Suppose we match vertices simultaneously.

94 GPU matching problems Vertices find an unmatched neighbour...

95 GPU matching problems but generate an invalid matching.

96 GPU matching To solve this we create two groups of vertices: blue and red.

97 GPU matching To solve this we create two groups of vertices: blue and red. Blue vertices propose.

98 GPU matching To solve this we create two groups of vertices: blue and red. Blue vertices propose. Red vertices respond.

99 GPU matching To solve this we create two groups of vertices: blue and red. Blue vertices propose. Red vertices respond. Proposals that were responded to are matched.

100 GPU matching (implementation) The graph (neighbour ranges, indices, and weights) is stored as a triplet of 1D textures on the GPU.

101 GPU matching (implementation) The graph (neighbour ranges, indices, and weights) is stored as a triplet of 1D textures on the GPU. We create one thread for each vertex in V.

102 GPU matching (implementation) The graph (neighbour ranges, indices, and weights) is stored as a triplet of 1D textures on the GPU. We create one thread for each vertex in V. Each vertex v V only updates its colour/matching value π(v); and its proposal/response value σ(v).

103 GPU matching (implementation) The graph (neighbour ranges, indices, and weights) is stored as a triplet of 1D textures on the GPU. We create one thread for each vertex in V. Each vertex v V only updates its colour/matching value π(v); and its proposal/response value σ(v). Both π and σ are stored in 1D arrays in global memory.

104 GPU matching (algorithm) Colour Propose Respond Match π σ

105 GPU matching (algorithm) Colour Propose Respond Match π b r r b b r b b r σ

106 GPU matching (algorithm) Colour Propose Respond Match π b r r b b r b b r σ

107 GPU matching (algorithm) Colour Propose Respond Match π b r r b b r b b r σ

108 GPU matching (algorithm) Colour Propose Respond Match π b 2 3 b r σ

109 GPU matching (algorithm) Colour Propose Respond Match π r 2 3 r b σ

110 GPU matching (algorithm) Colour Propose Respond Match π r 2 3 r b σ d

111 GPU matching (algorithm) Colour Propose Respond Match π r 2 3 r b σ d

112 GPU matching (algorithm) Colour Propose Respond Match π r 2 3 r d σ d

113 GPU matching (algorithm) Colour Propose Respond Match π b 2 3 r d σ d

114 GPU matching (algorithm) Colour Propose Respond Match π b 2 3 r d σ

115 GPU matching (algorithm) Colour Propose Respond Match π b 2 3 r d σ

116 GPU matching (algorithm) Colour Propose Respond Match π d σ

A GPU Algorithm for Greedy Graph Matching

A GPU Algorithm for Greedy Graph Matching B. O. Fagginger Auer R. H. Bisseling Utrecht University September 29, 2011 Fagginger Auer, Bisseling (UU) GPU Greedy Graph Matching September 29, 2011 1 / 27 Outline