Shortest paths on large graphs: Systems, Algorithms, Applications

Size: px

Start display at page:

Download "Shortest paths on large graphs: Systems, Algorithms, Applications"

Brianne Nicholson
6 years ago
Views:

1 Shortest paths on large graphs: Systems, Algorithms, Applications Andrey Gubichev TU München January 2012 Andrey Gubichev Shortest paths on large graphs 1 / 53

2 Outline Introduction Systems Algorithms Applications Semantic Web Social Search Andrey Gubichev Shortest paths on large graphs 2 / 53

3 Everything is a graph Internet Graph,Richardson Web Graph Social Network Wikipedia, Tulip Proteins, Bordalier Inst Andrey Gubichev Shortest paths on large graphs 3 / 53

4 RDF: format for graph data Maria Sklodowska bornas Poland in Warsaw bornin Nobel Prize Chemistry haswon Henri Becquerel adviser bornon diedon Marie Curie marriedto Pierre Curie alma mater haswon U Paris haswon Nobel Prize Physics Andrey Gubichev Shortest paths on large graphs 4 / 53

5 RDF: format for graph data Maria Sklodowska bornas 1867 bornon diedon 1934 Poland in Warsaw bornin Marie Curie marriedto Pierre Curie Nobel Prize Chemistry haswon alma mater Henri Becquerel adviser haswon U Paris haswon Nobel Prize Physics RDF: (id1,name, Marie Curie ) (id1,bornon,1867) (id1,bornin,id2) (id2,name, Warsaw ) (id2,locatedin,id3) (id3,name, Poland ) (G.Weikum, WSDM 09) Andrey Gubichev Shortest paths on large graphs 4 / 53

6 RDF: format for graph data Maria Sklodowska bornas bornon diedon Poland in Warsaw bornin Marie Curie marriedto Pierre Curie Nobel Prize Chemistry haswon alma mater Henri Becquerel adviser haswon U Paris haswon Nobel Prize Physics pay-as-you-go: schema-agnostic, schema-later RDF triples form ER graph RDF: (id1,name, Marie Curie ) (id1,bornon,1867) (id1,bornin,id2) (id2,name, Warsaw ) (id2,locatedin,id3) (id3,name, Poland ) (G.Weikum, WSDM 09) Andrey Gubichev Shortest paths on large graphs 4 / 53

7 RDF: a lot of data out there Linked Data Project, linkeddata.org Linked Data: extract explicit knowledge (ER-oriented facts) from the world s best information sources (Wikipedia, Web, Web 2.0) Andrey Gubichev Shortest paths on large graphs 5 / 53

8 SPARQL: a query language Select?c Where {?p isa scientist.?p bornin?t.?p haswon?a.?t locatedin?c.?a Name NobelPrize. } SQL-like syntax triple patterns common variables form joins Andrey Gubichev Shortest paths on large graphs 6 / 53

9 SPARQL: a query language for RDF... Select?c Where {?p isa scientist.?p bornin?t.?p haswon?a.?t locatedin?c.?a Name NobelPrize. Filter (?t < 1900) }... SQL-like syntax triple patterns common variables form joins filter predicates Andrey Gubichev Shortest paths on large graphs 7 / 53

10 SPARQL: a query language Select Distinct?c Where {?p?r1?t.?t?r2?c.?c isa Country.?p bornon?b. Filter (?b > 1945) } SQL-like syntax triple patterns common variables form joins filter predicates wildcard joins Andrey Gubichev Shortest paths on large graphs 8 / 53

11 RDF & SPARQL Engines giant triples table S P O id1 Name Marie Curie id1 bornon 1867 id1 bornin id2 Name... id2 Warsaw Sesame/OpenRDF YARS2 (DERI) Andrey Gubichev Shortest paths on large graphs 9 / 53

12 RDF & SPARQL Engines giant triples table clustered property tables S P O id1 Name Marie Curie id1 bornon 1867 id1 bornin id2 Name... id2 Warsaw Sesame/OpenRDF YARS2 (DERI) Person S Name bornon bornin... id1 Marie C 1867 id3... id2 Henri B 1852 id Town S Name Country id3 Warsaw id11... Jena (HP Labs) Oracle RDF MATCH Andrey Gubichev Shortest paths on large graphs 9 / 53

13 RDF & SPARQL Engines giant triples table clustered property tables property table S P O id1 Name Marie Curie id1 bornon 1867 id1 bornin id2 Name... id2 Warsaw Sesame/OpenRDF YARS2 (DERI) Person S Name bornon bornin... id1 Marie C 1867 id3... id2 Henri B 1852 id Town S Name Country id3 Warsaw id11... Jena (HP Labs) Oracle RDF MATCH bornon S O id id Advisor S O id1 id C-Store (MIT) MonetDB(CWI) Andrey Gubichev Shortest paths on large graphs 9 / 53

14 RDF & SPARQL Engines giant triples table clustered property tables property table S P O id1 Name Marie Curie id1 bornon 1867 id1 bornin id2 Name... id2 Warsaw Sesame/OpenRDF YARS2 (DERI) Why a new engine? Person S Name bornon bornin... id1 Marie C 1867 id3... id2 Henri B 1852 id9... Three main things in database design: Performance 2. Performance 3. Performance Town S Name Country id3 Warsaw id11... Jena (HP Labs) Oracle RDF MATCH bornon S O id id Advisor S O id1 id C-Store (MIT) MonetDB(CWI) Andrey Gubichev Shortest paths on large graphs 9 / 53

15 Scalable Semantic Web: RDF-3X Engine [T.Neumann et al: VLDB 08] tuning-free system architecture: giant triple table Andrey Gubichev Shortest paths on large graphs 10 / 53

16 Scalable Semantic Web: RDF-3X Engine [T.Neumann et al: VLDB 08] tuning-free system architecture: giant triple table map literals into ids (dictionary) S P O id1 Name Marie Curie id1 bornon 1867 id1 bornin id2 Name... id2 Warsaw map ID S P O Andrey Gubichev Shortest paths on large graphs 10 / 53

17 Scalable Semantic Web: RDF-3X Engine [T.Neumann et al: VLDB 08] tuning-free system architecture: giant triple table map literals into ids (dictionary) and precompute exhaustive indexing for SPO triples: SPO, SOP, OPS, OSP, PSO, POS, SP*, SO*, OS*, PO*, OP*, S*, P*, O* very high compression, index-only store directly store indexes into clustered B+ trees P O S Andrey Gubichev Shortest paths on large graphs 10 / 53

18 Scalable Semantic Web: RDF-3X Engine [T.Neumann et al: VLDB 08] tuning-free system architecture: giant triple table map literals into ids (dictionary) and precompute exhaustive indexing for SPO triples: SPO, SOP, OPS, OSP, PSO, POS, SP*, SO*, OS*, PO*, OP*, S*, P*, O* very high compression, index-only store directly store indexes into clustered B+ trees can choose any order for scan and join Andrey Gubichev Shortest paths on large graphs 10 / 53

19 Scalable Semantic Web: RDF-3X Engine [T.Neumann et al: VLDB 08] tuning-free system architecture: giant triple table map literals into ids (dictionary) and precompute exhaustive indexing for SPO triples: SPO, SOP, OPS, OSP, PSO, POS, SP*, SO*, OS*, PO*, OP*, S*, P*, O* very high compression, index-only store directly store indexes into clustered B+ trees can choose any order for scan and join also store two mapping indexes: literal id, id literal Andrey Gubichev Shortest paths on large graphs 10 / 53

20 Scalable Semantic Web: RDF-3X Engine [T.Neumann et al: VLDB 08] tuning-free system architecture: giant triple table map literals into ids (dictionary) and precompute exhaustive indexing for SPO triples: SPO, SOP, OPS, OSP, PSO, POS, SP*, SO*, OS*, PO*, OP*, S*, P*, O* very high compression, index-only store directly store indexes into clustered B+ trees can choose any order for scan and join also store two mapping indexes: literal id, id literal efficient merge joins with order-preservation Andrey Gubichev Shortest paths on large graphs 10 / 53

21 RDF-3X Query Optimization [T.Neumann et al: VLDB 08] bottom-up dynamical programming for plan enumaration exploit numerous indexes, order-preservation cost model based on selectivity estimation Andrey Gubichev Shortest paths on large graphs 11 / 53

22 Evaluation [T.Neumann et al: SIGMOD 09] Queries like: find a polish scientist with a french advisor, both got some awards YAGO knowledge base: 40 Mio. triples Billion Triple dataset, Uniprot (845 Mio.) - similar results Andrey Gubichev Shortest paths on large graphs 12 / 53

23 Evaluation [T.Neumann et al: SIGMOD 09] Queries like: find a polish scientist with a french advisor, Try it out! both got some awards RDF-3X is freely available: YAGO knowledge base: 40 Mio. triples Billion Triple dataset, Uniprot (845 Mio.) - similar results Andrey Gubichev Shortest paths on large graphs 12 / 53

24 Outline Introduction Systems Algorithms Applications Semantic Web Social Search Andrey Gubichev Shortest paths on large graphs 13 / 53

25 What is missing? What kind of queries we CAN answer? Find lat and long of the Eiffel Tower Find politicians who are also scientists What kind of queries we CAN NOT answer? Find common things between Angela Merkel and Arnold Schwarznegger Find all European-born Nobel prize winners Why? They require path traversals over RDF graph. Andrey Gubichev Shortest paths on large graphs 14 / 53

26 Why is SPARQL not enough? Sometimes we need to form join chains with unknown length (e.g., we need the transitive closure of the predicate). Example Triples Humboldt bornin Berlin. Berlin locatedin Germany. Example Triples Einstein bornin Ulm. Ulm locatedin Baden-Württemberg. Baden-Württemberg locatedin Germany. Were they both born in Germany? Yes. How to figure that out? Einstein bornin Ulm locatedin Baden-Württemberg locatedin Germany locatedin Humboldt bornin Berlin Andrey Gubichev Shortest paths on large graphs 15 / 53

27 Why is SPARQL not enough? Sometimes we need to form join chains with unknown length (e.g., we need the transitive closure of the predicate). Example Triples Humboldt bornin Berlin. Berlin locatedin Germany. Example Triples Einstein bornin Ulm. Ulm locatedin Baden-Württemberg. Baden-Württemberg locatedin Germany. How to find all scientists that were born in Germany? SPARQL?person bornin?place.?place locatedin Germany. UNION?person bornin?place.?place locatedin?place1.?place1 locatedin Germany. UNION... Andrey Gubichev Shortest paths on large graphs 16 / 53

28 Why is SPARQL not enough? Sometimes we need to form join chains with unknown length (e.g., we need the transitive closure of the predicate). Example Triples Humboldt bornin Berlin. Berlin locatedin Germany. Example Triples Einstein bornin Ulm. Ulm locatedin Baden-Württemberg. Baden-Württemberg locatedin Germany. How to find all scientists that were born in Germany? SPARQL with paths?person bornin?place.?place??path Germany. Andrey Gubichev Shortest paths on large graphs 17 / 53

29 SPARQL with path variables Introduced by K.Anyanwu et al. (WWW 07) Example: select??p?obj where {?place??path Germany} (path triple)??p: there exists a path from place to Germany in the RDF graph we consider only shortest paths we can specify filter (conditions) on??p we can join such path patterns with regular patterns Example select?name where {?m type Mountain.?m hasname?name.?m??location Europe. filter(containsonly(??location, locatedin)) } Andrey Gubichev Shortest paths on large graphs 18 / 53

30 How to execute SPARQL with path variables? [A.Gubichev et al: WebDB 11] We build upon RDF-3X. Two goals: Query Optimization: How to estimate cardinality of path triples? Physical Level: How to perform path scan efficiently? Andrey Gubichev Shortest paths on large graphs 19 / 53

31 Outline Introduction Systems Algorithms Applications Semantic Web Social Search Andrey Gubichev Shortest paths on large graphs 20 / 53

32 Can we do better? Dijkstra s algo is fine, but let s consider approximate algorithms (trade quality for speed) Let s change the setting for now: shortest paths on social network Social network: a set of people a social relationship linking them Andrey Gubichev Shortest paths on large graphs 21 / 53

33 Problem Statement Exact shortest path: V users, E friend of relationships Graph G(V, E) directed, unweighted, static Given u, v V find the shortest path from u to v Approximate shortest path: Graph is disk-resident Offline step: Do some precomputation, store on disk Online step: for u,v V quickly find some path from u to v Approximation error: approximate exact exact Andrey Gubichev Shortest paths on large graphs 22 / 53

34 Different approaches Exact SP Dijkstra: very slow A : works well for road networks, slow for OSN Hierarchy-based decomposition: works well for road networks, slow for OSN Approximate SP Different types of preprocessing: keep distances from all nodes to small subset of nodes (random, with high degree or centrality) Poor results for OSN: average error is 10% Find just the distance, not the path itself Andrey Gubichev Shortest paths on large graphs 23 / 53

35 Precomputation Step1 Set r = log V Step2 Sample r + 1 sets of nodes (uniformly, at random) of sizes: 1, 2, 2 2, 2 3,...,2 r Step3 For every u V and for every set S 1. Find the closest nodes to u in S (landmarks): landmark h S : dist(u, h) = dist(u, S) landmark h S : dist(h, u) = dist(s, u) 2. Find the distance from u to h and from h to u Andrey Gubichev Shortest paths on large graphs 24 / 53

36 Precomputation - WSDM 10 approach [A.Das Sarma et al: WSDM 10] h 1 S 1 u h 2 S 2... Sketch in RDF: u 2 h 1 u 3 h 2 u 1 h r h r S r Andrey Gubichev Shortest paths on large graphs 25 / 53

37 Precomputation - our approach [A.Gubichev et al: CIKM 10] x h 1 S 1 u... y h 2 S 2 Sketch in RDF: u x h 1 u x y h 2 u h r h r S r Andrey Gubichev Shortest paths on large graphs 26 / 53

38 Precomputation Step1 Set r = log V Step2 Sample r + 1 sets of nodes (uniformly, at random) of sizes: 1, 2, 2 2, 2 3,...,2 r Step3 For every u V and for every set S 1. Find the closest nodes to u in S (landmarks): landmark h S : dist(u, h) = dist(u, S) landmark h S : dist(h, u) = dist(s, u) 2. Find the path from u to h and from h to u 3. Store the paths (RDF): u path h, h path u Step4 Repeat Steps 2-3 k times (we use k = 2). Andrey Gubichev Shortest paths on large graphs 27 / 53

39 Sketch Sketch for a node u consists of 1. Landmarks h 1,...,h kr 2. Paths from u to landmarks 3. Paths from landmarks to u Sketch for u consists of two trees (u is the root) We keep sketches for every u V Andrey Gubichev Shortest paths on large graphs 28 / 53

40 SKETCH algorithm: online part [A.Das Sarma et al: WSDM 10] s Input: nodes s, d V d

41 SKETCH algorithm: online part [A.Das Sarma et al: WSDM 10] s Input: nodes s, d V 1. Load all the distances from s d

42 SKETCH algorithm: online part [A.Das Sarma et al: WSDM 10] s Input: nodes s, d V 1. Load all the distances from s 2. Load all the distances to d d

43 SKETCH algorithm: online part [A.Das Sarma et al: WSDM 10] s Input: nodes s, d V 1. Load all the distances from s 2. Load all the distances to d 3. Find common landmarks d

44 SKETCH algorithm: online part [A.Das Sarma et al: WSDM 10] s Input: nodes s, d V 1. Load all the distances from s 2. Load all the distances to d 3. Find common landmarks 4. Construct the paths d

45 SKETCH algorithm: online part [A.Das Sarma et al: WSDM 10] s Input: nodes s, d V 1. Load all the distances from s 2. Load all the distances to d 3. Find common landmarks 4. Construct the paths 5. Select the shortest distance Output: distance from s to d Andrey Gubichev Shortest paths on large graphs 29 / 53 d

46 SKETCH algorithm with paths [A.Gubichev et al: CIKM 10] s Input: nodes s, d V 1. Load all the paths from s 2. Load all the paths to d 3. Find common landmarks 4. Construct the paths 5. Select the shortest path Output: path from s to d: s x y h z d x y h z d Andrey Gubichev Shortest paths on large graphs 30 / 53

47 Datasets Slashdot: 77 K nodes, undirected YouTube: 1.1 Mln nodes Flickr: 1.7 Mln nodes WikiTalk: 2.2 Mln nodes Twitter: 2.4 Mln nodes Orkut: 3 Mln nodes, undirected Sources: Stanford, MPI, Telefonica Research Andrey Gubichev Shortest paths on large graphs 31 / 53

48 Approximation error of the Sketch algorithm Error = approximate exact exact Dataset (#nodes) Sketch error Slashdot (77K) 46% YouTube (1.1M) 30% Flickr (1.7M) 28% WikiTalk (2.2M) 55% Twitter (2.4M) 51% Orkut (3M) 71% Andrey Gubichev Shortest paths on large graphs 32 / 53

49 Precomputation Step1 Set r = log V Step2 Sample r + 1 sets of nodes (uniformly, at random) of sizes: 1, 2, 2 2, 2 3,...,2 r Step3 For every u V and for every set S 1. Find the closest nodes to u in S (landmarks): landmark h S : dist(u, h) = dist(u, S) landmark h S : dist(h, u) = dist(s, u) 2. Find the path from u to h and from h to u 3. Store the paths (RDF): u path h, h path u Step4 Repeat Steps 2-3 k times (we use k = 2). Andrey Gubichev Shortest paths on large graphs 33 / 53

50 First modification We find the path, not just the distance! s d Andrey Gubichev Shortest paths on large graphs 34 / 53

51 First modification Are there cycles? s a a d Andrey Gubichev Shortest paths on large graphs 34 / 53

52 First modification Are there cycles? s a d

53 First modification Construct a shorter path s a d Andrey Gubichev Shortest paths on large graphs 34 / 53

54 Approximation error of the first modification No time overhead! Dataset (#nodes) Sketch error Sketch I error Slashdot (77K) 46% 26% YouTube (1.1M) 30% 12% Flickr (1.7M) 28% 11% WikiTalk (2.2M) 55% 31% Twitter (2.4M) 51% 38% Orkut (3M) 71% 48% Andrey Gubichev Shortest paths on large graphs 35 / 53

55 Second modification s d Andrey Gubichev Shortest paths on large graphs 36 / 53

56 Second modification Are there any hidden connections? s d? Andrey Gubichev Shortest paths on large graphs 36 / 53

57 Second modification If yes, construct a shorter path s d Andrey Gubichev Shortest paths on large graphs 36 / 53

58 Second modification How to check it? 1. For every node in the path load the list of friends from the original dataset 2. For every pair of nodes from the path check whether they are friends Number of nodes in the path is usually small! Andrey Gubichev Shortest paths on large graphs 37 / 53

59 Approximation error of the second modification Dataset (#nodes) Sketch error Sketch I error Sketch II error Slashdot (77K) 46% 26% 0.6% YouTube (1.1M) 30% 12% 0.6% Flickr (1.7M) 28% 11% 0.3% WikiTalk (2.2M) 55% 31% 0.2% Twitter (2.4M) 51% 38% 0.8% Orkut (3M) 71% 48% 0.6% Andrey Gubichev Shortest paths on large graphs 38 / 53

60 Tree algorithm s Paths from a node to landmarks form a tree landmarks Andrey Gubichev Shortest paths on large graphs 39 / 53

61 Tree algorithm Load paths from s and to d s d

62 Tree algorithm Load paths from s and to d Start BFS from s and d For every visited node load a list of friends s1 s s2... s3 s4... s5... d4 d3 d2 d1 d

63 Tree algorithm Load paths from s and to d Start BFS from s and d For every visited node load a list of friends For every pair of visited nodes check: 1. are they equal? (s3, d1) 2. are they friends? (s1, d) s1 s s2... s3 s4... s5... d4 d3 d2 d1 d

64 Tree algorithm Load paths from s and to d Start BFS from s and d For every visited node load a list of friends For every pair of visited nodes check: 1. are they equal? (s3, d1) 2. are they friends? (s1, d) Form a new path and put it to the queue Q s1 s s2... s3 s4... s5... d4 d3 d2 d

65 Tree algorithm Load paths from s and to d Start BFS from s and d For every visited node load a list of friends For every pair of visited nodes check: 1. are they equal? (s3, d1) 2. are they friends? (s1, d) Form a new path and put it to the queue Q Don t go too deep: terminate if s s1 s2 s3 s4 s5 level s + level d = 4 > 2 d4 d3 d2 d1 level s + level d > Q.top.length d Andrey Gubichev Shortest paths on large graphs 40 / 53

66 Approximation error of the Tree algorithm Dataset Sketch error Sketch I error Sketch II error Tree error Slashdot 46% 26% 0.6% 0 YouTube 30% 12% 0.6% 0.06% Flickr 28% 11% 0.3% 0.04% WikiTalk 55% 31% 0.2% 0 Twitter 51% 38% 0.8% 0.03% Orkut 71% 48% 0.6% 0.1% Andrey Gubichev Shortest paths on large graphs 41 / 53

67 Experimental setup Pick 100 nodes (uniformly at random) from the OSN. For each node compute Shortest Path Tree (Dijkstra) The result is {(x, y, dist) x, y V, dist = dist(x, y)} Group triples by distance and randomly choose 50 triples from every group For every chosen triple (x, y, dist): find approximate shortest paths from x to y and compare their lengths with dist Andrey Gubichev Shortest paths on large graphs 42 / 53

68 Implementation details Datasets in RDF: user 1 friend-of user 2 Precomputed paths in RDF: u path h h path u RDF3X for datasets and precomputed data C++ Laptop: 2.0GHz Intel Core 2 Duo, 4 Gb RAM, L2 cache 3 Mb Andrey Gubichev Shortest paths on large graphs 43 / 53

69 Time Dataset (#nodes) Sketch Sketch II Tree Dijkstra Dijkstra (sec) (sec) (sec) (sec) (queue) Flickr (1.7M) K WikiTalk (2.2M) Mln Twitter (2.4M) Mln Orkut (3M) Mln Andrey Gubichev Shortest paths on large graphs 44 / 53

70 Disk space Disk space for precomputed data, Gb Dataset Dataset size Sketch with distances Sketch with paths Flickr WikiTalk Twitter Orkut Andrey Gubichev Shortest paths on large graphs 45 / 53

71 Number of shortest paths We find several shortest paths: Dataset (#nodes) Sketch II Tree Flickr (1.7M) Wikitalk (2.2M) Twitter (2.4M) Orkut (3M) Andrey Gubichev Shortest paths on large graphs 46 / 53

72 Outline Introduction Systems Algorithms Applications Semantic Web Social Search Andrey Gubichev Shortest paths on large graphs 47 / 53

73 Application #1: Semantic Web SPARQL v SPARQL + path traversal Querying the DB of entire human knowledge (everything that Wikipedia knows) Andrey Gubichev Shortest paths on large graphs 48 / 53

74 Outline Introduction Systems Algorithms Applications Semantic Web Social Search Andrey Gubichev Shortest paths on large graphs 49 / 53

75 Small World Milgram 1967 People are given letters, asked to forward to one friend Source: random Omahaians; Target: stockbrocker in Sharon, MA Of completed chains, averaged 6 hops to reach target Andrey Gubichev Shortest paths on large graphs 50 / 53

76 Shortest paths on Social Networks Shortest paths are interesting... per se: what is the distance between you and Angela Merkel? for geeks: Erdös number Andrey Gubichev Shortest paths on large graphs 51 / 53

77 Shortest paths on Social Networks Shortest paths are interesting... per se: what is the distance between you and Angela Merkel? for geeks: Erdös number as an important primitive for social network analysis (diameter, centrality, etc) social search Of course, we can do one-to-many shortest paths algo John searches Mary Ranking: 1. Mary A 2. Mary B 3. Mary C M. Potamias et al. CIKM 2009 Andrey Gubichev Shortest paths on large graphs 51 / 53

78 Acknowledgements Srikanta Bedathur Gerhard Weikum Josep M. Pujol Thomas Neumann Sihem Amer-Yahia Andrey Gubichev Shortest paths on large graphs 52 / 53

79 Thank you! Questions? Andrey Gubichev Shortest paths on large graphs 53 / 53

Dynamic and Historical Shortest-Path Distance Queries on Large Evolving Networks by Pruned Landmark Labeling

2014/04/09 @ WWW 14 Dynamic and Historical Shortest-Path Distance Queries on Large Evolving Networks by Pruned Landmark Labeling Takuya Akiba (U Tokyo) Yoichi Iwata (U Tokyo) Yuichi Yoshida (NII & PFI)