hyperx: scalable hypergraph processing

Size: px

Start display at page:

Download "hyperx: scalable hypergraph processing"

Jayson Lambert
5 years ago
Views:

1 hyperx: scalable hypergraph processing Jin Huang November 15, 2015 The University of Melbourne

2 overview Research Outline Scalable Hypergraph Processing Problem and Challenge Idea Solution Implementation Emperical Results Conclusion 2

3 research outline

4 scalable hypergraph processing

5 problem context Any (high-order) relationships with more than 2 participants. Figure 1: A few high-order relationships 5

6 representative existing hypergraph studies Table 1: Various hypergraph learning studies in literature Application Study Vertex Hyperedge Recommendation [TMCCA 13] Songs and users Listening histories Text retrieval [SIGIR 08] Documents Semantic similarities Image retrieval [Pattern Recognition 13] Images Descriptor similarities Multimedia [Multimedia 08] Videos Hyperlinks Bioinformatics [ICDM 13] Proteins Interactions Social mining [AAAI 14] Users Communities Machine learning [Signal Processing 14] Data Records Labels 6

7 existing solution Converting to a graph! Option I a bipartite Option II a clique Figure 2: Graph conversion inflates the problem size 7

8 challenges i Scalable graph frameworks: GraphLab, Giraph, GraphX, etc. synchronous BSP (Pregel) vertex-centric style vertex replication and aggregation 8

9 challenges i Scalable graph frameworks: GraphLab, Giraph, GraphX, etc. synchronous BSP (Pregel) vertex-centric style vertex replication and aggregation Figure 3: Vertex replicas to reduce network communication 8

10 challenges i Scalable graph frameworks: GraphLab, Giraph, GraphX, etc. synchronous BSP (Pregel) vertex-centric style vertex replication and aggregation Inflated Size 2M V and 15M H -> 17M V and 1B E Excessive Replication replicating both V and H 8

11 challenges ii Difficulty in Load Balance two causes 1. V and H not active simultaneously 2. double overhead in each iteration Figure 3: Two issues in balancing the loads 9

12 idea To Support (API) Random walks, label propagation, spectral Inflated Size (Representation) a distributed hypergraph Excessive Replication (Representation) replicate only V Difficulty in Lload Balance (Partitioning) An optimization minimizes the communication cost minimizes the replication cost balances both V and H loads 10

13 proposed solution: hyperx Figure 4: An overview of HyperX implemented over Spark 11

14 details: apis Algorithms expressed as vprog updates vertex values given incident hyperedges hprog update hyperedge values given incident vertices Table 2: HyperX Main APIs Name joinv mrtuples mapv maph subh HyperPregel Usage vprog as distributed joins hprog on hyperedges and reduce vertices update vertices independently (locally) update hyperedges independently (locally) restrict computation over a sub-hypergraph iteratively execute mrtuple and joinv 12

15 details: hyperpregel implementation Algorithm 1: HyperPregel input : G: Hypergraph[V,H], vprog: (Id,V) V, hprog: Tuple M, combine: (M,M) M, initial: M output: RDD[(Id, V)] 1 G G.mapV((id, v) vprog(id, v, initial)) 2 msg G.mrTuples(hProg, combine) 3 while msg > 0 do 4 G G.joinV (msg) (vprog).subh(v, t ) 5 msg G.mrTuples(hProg, combine) 6 return G.vertices 13

16 details: random walks with apis Algorithm 2: Random Walks (RW) with restart input : G, label vertex set L, restart probability rp output: RDD[(Id, Double)] 1 vprog(id,(v,d),msg)= ((1 rp) msg + rp v, d) S i Sd i D 2 hprog(s,d,sd,dd,h)= i S 3 combine(a,b)= a + b 4 G G.joinV (G.outDeg, (id, v, d) d) 5 G G.mapV((id, v) if id L (1.0, v) else (0.0, v)) 6 G.HyperPregel(G, vprog, hprog, combine,0) 14

17 details: representation Built on Spark s RDD, how to represent a hypergraph? Vertices vrdd Hyperedges hrdd Multiple vertices list or set flattened (vid, hid, issrc) in columnar arrays saves 41% to 88% memory consumption 15

18 details: representation Built on Spark s RDD, how to represent a hypergraph? Vertices vrdd Hyperedges hrdd To do mrtuples locally, replicate vertices One replica is adequate Cost in distributed vprog Cost in updating replicas Cost in storing replicas How to partition vrdd and hrdd to minimize the cost? 15

19 details: partitioning introduction Different from vertex-cut or edge-cut in graph literature Cut both vertices and hyperedges simultaneously Minimizes the vertex replicas (with local aggregation) With separate load constaints on vprog and hprog 16

20 details: partitioning objective formulation n vertices, m hyperedges, k workers, a h the arity of h number of replicas for vertex u R(x u, y) = k max((1 x u,i (1 y h,i ), 0) i=1 h N(u) 17

21 details: partitioning objective formulation n vertices, m hyperedges, k workers, a h the arity of h number of replicas for vertex u R(x u, y) = k max((1 x u,i (1 y h,i ), 0) i=1 h N(u) to optimize minimize u V R(x u, y) h H a h subject to y h,i a h (1 + α), i {1, 2,..., k} k h H u V x u,i R(x u, y) (1 + β) R(x u, y), i {1, 2,..., k} k u V 17

22 details: partitioning theoretic analysis How hard? a special case where α = 0 and β = + minimize k (1 (1 y h,i )) u V i=1 h N(u) subject to h H y h,i a h a h, i {1, 2,..., k} k h H reduction from the strongly NP-Complete 3-Partition no polynomial solution with finite approximation factor in plain words, it is extremely hard! how about α > 0? 18

23 details: partitioning practical solutions Lable propagation partitioning (LPP) labels are partitions label both vertices and hyperedges iteratively update labels 19

24 details: partitioning practical solutions Lable propagation partitioning (LPP) labels are partitions label both vertices and hyperedges iteratively update labels specifically, L(h) = arg max {v v N(h) L(v) = i} i K Ā 2 A 2 i L(v) = arg max( {h h N(v) L(h) = i} e Ā 2 ), i K where A i = L(h)=i a h. 19

25 experimental settings Metrics data RDD size data shuffuled elapsed time Comparisons HyperX (hx), Bipartite (star), Clique (clique) random, greedy, aweto, hmetis, LPP random walk (RW), label propagation (LP), spectural (SP) Environment 8 node, 28 workers, network 600Mbps Hadoop 2.4.0, YARN enabled, Spark HyperX implemented in Scala 20

26 datasets Table 3: Datasets presented in the empirical study Dataset n m d min d max d σd c vd a min a max ā σ a c va Medline Coauthor (Med) 3.2m 8m Orkut Communities (Ork) 2.3m 15m , Friendster Communities (Fri) 7.9m 1.6m , Synthetic (Zipfian s = 2) 2m 8m , m 5 1, , m 10 1, , m 15 1, , m 21 2, , m 16m 1 1, , m , m , m ,

27 evaluating hypergraph representation: space Data RDD size (GB) Vertices Hyperedges hx clique star hx clique star MedRW MedLP hx clique star OrkRW hx clique star OrkLP hx clique star FriRW hx clique star FriLP Figure 5: Memory Consumption of Data RDDs HyperX consumes 44% to 77% less memory than Bipartite. 22

28 evaluating hypergraph representation: communication Data shuffled (GB) at 5 t h iter star hx Write Read star hx MedRW MedLP star hx OrkRW star hx OrkLP star hx FriRW Figure 6: Data Shuffled on the Network star hx FriLP HyperX shuffles 19% to 98% fewer data than Bipartite. 23

29 evaluating hypergraph representation: time Elapsed time (S) per 10 iters MedRW MedLP OrkRw OrkLP FriRW hx star FriLP Figure 7: Elapsed Time HyperX is up to 49.1 times faster than Bipartite. 24

30 evaluating partitioning effectiveness: replica factor 16 Replica factor Med Ork Fri random aweto greedy hmetis5 hmetis1 lpp Figure 8: Different partitioning algorithms, replication factor HyperX produces 1.1 to 1.9 times more replicas than hmetis. 25

31 evaluating partitioning effectiveness: load balance Workload CoV MedArity MedReplica OrkArity OrkReplica FriArity FriReplica random aweto greedy hmetis5 hmetis1 lpp Figure 9: Different partitioning algorithms, load balance LPP prodcues 1.1 to 37.7 times more balanced loads than hmetis. 26

32 evaluating partitioning effectiveness: space Data RDD size (GB) lpp hmetis1 hmetis5 greedy aweto random lpp hmetis1 hmetis5 greedy aweto random Hyperedges Vertices lpp hmetis1 hmetis5 greedy aweto random RW LP SP Figure 10: Different partitioning algorithms on Orkut, space LPP and hmetis both outperform simplistic methods. 27

33 evaluating partitioning effectiveness: communication Data shuffled (MB) at 5 th Iter greedy aweto random lpp hmetis1 hmetis5 RW greedy aweto random lpp hmetis1 hmetis5 LP Read Write greedy aweto random lpp hmetis1 hmetis5 Figure 11: Different partitioning algorithms on Orkut, communication SP LPP and hmetis both significantly outperform simplistic methods. 28

34 evaluating partitioning effectiveness: time Elapsed time (S) per 10 iters MedRW random aweto MedLP MedSP greedy hmetis5 OrkRW OrkLP hmetis1 lpp OrkSP Figure 12: Different partitioning algorithms, time LPP results to up to 2.6 times speedup over hmetis. 29

35 evaluating partitioning efficiency LPP in Scala, run on JVM; hmetis in C Table 4: Partitioning time of different algorithms Dataset Algorithm Time t (s) w w.r.t. LPP Med LPP hmetis5 14, Ork LPP hmetis5 88, Fri LPP hmetis5 6,

36 evaluating learning algorithms: dataset cardinality Elapsed Time (S) per 5 iters RW LP SP M 12M 16M 20M 24M Number of hyperedges Figure 13: Elapsed time running algorithms on varying dataset cardinality, synthetic 31

37 evaluating learning algorithms: number of workers Elapased time (S) per 10 iters Number of workers RW LP SP Figure 14: Elapsed time running algorithms on varying number of workers, Orkut 32

38 optional evaluating lpp: time and replicas Elapsed time (S) MedReplica OrkReplica MedTime OrkTime Iteration Replica factor Figure 15: Elapsed time and replication factor It only takes LPP a few iteration to achieve reasonable replication ratio. 33

39 optional evaluating lpp: load balance Workload CoV MedReplica MedArity OrkReplica OrkArity Iteration Figure 16: Elapsed time and replication factor It only takes LPP a few iteration to achieve reasonable load balance. 34

40 conclusion Problem Scalable hypergraph learning Challenges Solutions Contribution 1. Inflated problem size 2. Excessive replication 3. Great difficulty in balancing the loads 1. Operate on a distributed hypergraph 2. Replicate only vertices 3. Partitioning optimization Efficient and scalable hypergraph framework Effective and efficient partitioning algorithm 35

41 Thanks! Any Questions or Comments? 36

One Trillion Edges. Graph processing at Facebook scale

One Trillion Edges. Graph processing at Facebook scale One Trillion Edges Graph processing at Facebook scale Introduction Platform improvements Compute model extensions Experimental results Operational experience How Facebook improved Apache Giraph Facebook's