Automated Generation of Object Summaries from Relational Databases: A Novel Keyword Searching Paradigm GEORGIOS FAKAS

Size: px

Start display at page:

Download "Automated Generation of Object Summaries from Relational Databases: A Novel Keyword Searching Paradigm GEORGIOS FAKAS"

Moris Harris
5 years ago
Views:

1 Automated Generation of Object Summaries from Relational Databases: A Novel Keyword Searching Paradigm GEORGIOS FAKAS Department of Computing and Mathematics, Manchester Metropolitan University Manchester, UK. g.fakas@mmu.ac.uk

2 Related Work: Web Search Engines: Keyword Search Kw Search: Peacock Result: A ranked set of web pages

3 Related Work: Web Search Engines: Keyword Search Kw Search: Peacock Result: A ranked set of web pages

4 Related Work: Keyword Search in Relational DBs Full-text Search (e.g. Oracle 9i Text) Kw Searching in Relational DB (DISCOVER, BANKS) Kw Search: Leverling, Peacock Result: e3-o2-c2 e4-06-c2

5 A Novel Keyword Searching Paradigm: Object Summaries (OSs) Kw Search: Peacock Result: A Ranked set of OSs

6 A Novel Keyword Searching Paradigm: Object Summaries (OSs) Kw Search: Peacock Result: A Ranked set of OSs Problems-Challenges: How can we automatically (1) Generate, (2) size-l OSs and (3) Rank OSs liberating users from knowledge of: (1) Schema and (2) Query Language?

7 A Novel Keyword Searching Paradigm: Object Summaries (OSs) 1.Automated Generation of OSs Affinity 2.Generation of size-l OS Efficient greedy algorithms ValueRank, a PageRank inspired ranking system

R DS the corresponding central Relation;

8 OS Generation - Methodology t DS a central tuple containing the Kw; tuples around t DS contain additional information about the Data Subject. R DS the corresponding central Relation; similarly Relations around contain additional information.

9 OS Generation - Methodology KW-ID = Janet Leverling Territories t1 t2 t3 t4 Employees e1 e2 e3 e4 Orders o1 o2 Region r1 r2 Customers c1 c2 c3 EmployeeTerritories et1 et2 et3 et4 t DS a central tuple containing the Kw; tuples around t DS contain additional information about the Data Subject. o3 o4 Shippers o5 o6 s1 o7 Order Details Products s2 s3 od1 od2 od3 od4 od5 od6 p1 p2 Suppliers su1 Categories ca1 R DS the corresponding central Relation; similarly Relations around contain additional information.

10 OS Generation - Methodology KW-ID = Janet Leverling Territories t1 t2 t3 t4 Employees e1 e2 e3 e4 Orders o1 o2 Region r1 r2 Customers c1 c2 c3 EmployeeTerritories et1 et2 et3 et4 t DS a central tuple containing the Kw; tuples around t DS contain additional information about the Data Subject. o3 o4 Shippers o5 o6 s1 o7 Order Details Products s2 s3 od1 od2 od3 od4 od5 od6 p1 p2 Suppliers su1 Categories ca1 R DS the corresponding central Relation; similarly Relations around contain additional information.

11 OS Generation - Methodology KW-ID = Janet Leverling Territories t1 t2 t3 t4 Employees e1 e2 e3 e4 Orders o1 o2 Region r1 r2 Customers c1 c2 c3 EmployeeTerritories et1 et2 et3 et4 t DS a central tuple containing the Kw; tuples around t DS contain additional information about the Data Subject. o3 o4 Shippers o5 o6 s1 o7 Order Details Products s2 s3 od1 od2 od3 od4 od5 od6 p1 p2 Suppliers su1 Categories ca1 R DS the corresponding central Relation; similarly Relations around contain additional information.

12 OS Generation - Methodology KW-ID = Janet Leverling Territories Region t1 t2 r1 r2 t3 t4 Employees EmployeeTerritories e1 e2 e3 e4 Orders Customers et1 et2 et3 et4 c1 o1 o2 o3 o4 c2 c3 Shippers o5 o6 s1 o7 Order Details Products s2 s3 od1 od2 od3 od4 od5 od6 p1 p2 Suppliers su1 Categories ca1 G DS

13 OS Generation - Methodology G DS Problem: Not all Relations in G DS are relevant: How do I decide 1) What relations to select or not 2) When to Stop Traversing Solution: Investigate Relational Semantics: Schema Connectivity, Cardinality, Related Cardinality etc. Quantify Affinity of Relations

14 Af : Affinity of Relations to R DS in G DS DS R i R Distance Physical (fd), Logical (ld), ld=fd- M:N

15 Af : Affinity of Relations to R DS in G DS DS R i R Distance Physical (fd), Logical (ld), ld=fd- M:N E.g. Orders closer than Customer and CustomerDemo to Employees

16 Af : Affinity of Relations to R DS in G DS DS R i R Distance Physical (fd), Logical (ld), ld=fd- M:N E.g. Orders closer than Customer and CustomerDemo to Employees Hubs: spurious shortcuts Rather irrelevant or lateral information RC(R1, R2) R DS... N1: R hub 1:M R 2

17 Af : Affinity of Relations to R DS in G DS DS R i R Connectivity Schema Connectivity (Co i ) Data-graph Connectivity: Relative Cardinality (RC i j ), i.e. the average number of tuples of R i that are connected with each tuple from R j for 1:M RC i j = Ri / Rj for M:1 RC i j =1 Reverse Relative Cardinality (RRCi j) is the reverse of RC i j i.e. RRC i j =RC i j ).

18 Af DS R i R : Affinity of Relations to R DS in G DS DAf(Ri)={(m1, w1), (m2, w2),.. (mn, wn)} m1=f1(ldi), m2=f1(log(10*rci), m3=f1(log(10*rrci), m4=f1(log(10*coi) f1(α)=(11- α)/10 For a hub-child m1=f1(ldi *hi) and m2=f1(rci) Formula 1 (Semantic Affinity): The affinity of R i to R DS, denoted as Af DS, with respect to a schema R i R and a database conforming to the schema, can be calculated with the following formula: Af R R i DS = m j j w j Af R Parent R DS Where AfR Parent R DS is the affinity of the R i s Parent to R DS or is 1 if R Parent R DS.

19 Af DS R i R : Affinity of Relations to R DS in G DS G DS (θ)

20 Experimental Evaluation MS Northwind and TPC-H DBs Precision, Recall, F-Score Compare G DS s and OSs produced by G DS (θ) v G DS (h) G DS (h) was proposed by 10 participants G DS : average F-score 86.77, OS aver F-score 83 G DS Precision, Recall and F-score (Averages) <0.5, 0.4, 0.05, 0.05> OSs Precision, Recall and F-score (Averages) <0.5, 0.4, 0.05, 0.05> Precision Recall F-Score Precision Recall F-Score Customers Employees Suppliers Shippers Northwind Orders Products Customer Supplier Parts Orders TPC-H Nation Region 0 Customers Employees Suppliers Shippers Northwind Orders Products Customer Supplier Parts Orders TPC-H Nation Region

21 Affinity Ranking Correctness (Average) Affinity Ranking Correctness (Averages) Customers Employees Suppliers Shippers Orders Products Customer Supplier Parts Orders Nation Region Northwind TPC-H 100 * 100 d ( r i Af h Ri, r Ri )

22 A Novel Keyword Searching Paradigm: Object Summaries (OSs) Kw Search: Peacock Result: A Ranked set of OSs

23 Generation of Size-l Object Summaries Definition: A size-l OS Keyword Query is (1) a set of keywords and (2) a value for l; Example: Faloutsos with l=15. Query Result: a partial OS comprised only by l tuples and meet the following two criteria: (1) All l tuples are connected with the root of the OS tree and (2) The Importance of the size-l OS is the maximum. Importance of a Size-l OS Im(OS-size-l)=Σ(Im(OS, ti) Local Importance of a Tuple (Im(OS, ti)) Im(OS, ti)= Im(ti)*Af(ti)

24 Generation of Size-l Object Summaries 1. Brute-Force Algorithm generates firstly the complete OS (i.e. OS extractions of tuples I/O) then considers all candidate size-l OSs in order to find the optimal size-l OS (exponential in-memory operations). Very Expensive solution!!!

25 Generation of Size-l Object Summaries Greedy Size-l OS Generation Algorithms OS Property 1. Im(OS,ti) usually decreases with depth from tds. 2.1 Bottom-Up Pruning Size-l Algorithm Firstly generates the complete OS (similarly to the brute-force algorithm) And then prunes out from the bottom of the tree the k-l leaf nodes with the current smallest Im(OS, ti). Lemma 1: When the nodes of an OS have monotonically decreasing local Importance scores to their distance from the root (i.e. the score of an ancestor is always greater than its children s), then it returns the optimal size-l OS. Efficiency characteristics: OS I/O but only loglinear in memory very efficient when k is not significantly bigger than l, since fewer operations will be required (i.e. k-l is smaller). considerably cheaper than the brute force algorithm. Correctness: Very good approximations of the optimal size-l OS.

26 Size-l OS Generation Bottom-Up Pruning Size-l Algorithm

27 Size-l OS Generation Bottom-Up Pruning Size-l Algorithm

28 Size-l OS Generation Bottom-Up Pruning Size-l Algorithm

29 Size-l OS Generation Bottom-Up Pruning Size-l Algorithm

30 Size-l OS Generation Bottom-Up Pruning Size-l Algorithm

31 Size-l OS Generation Bottom-Up Pruning Size-l Algorithm

32 Size-l OS Generation Bottom-Up Pruning Size-l Algorithm

33 Size-l OS Generation Bottom-Up Pruning Size-l Algorithm

34 Size-l OS Generation Bottom-Up Pruning Size-l Algorithm

35 Size-l OS Generation Bottom-Up Pruning Size-l Algorithm

36 Generation of Size-l Object Summaries Greedy Size-l OS Generation Algorithms OS Property 1. Im(OS,ti) usually decreases with depth from tds. 2.1 Bottom-Up Pruning Size-l Algorithm Firstly generates the complete OS (similarly to the brute-force algorithm) And then prunes out from the bottom of the tree the k-l leaf nodes with the current smallest Im(OS, ti). Lemma 1: When the nodes of an OS have monotonically decreasing local Importance scores to their distance from the root (i.e. the score of an ancestor is always greater than its children s), then it returns the optimal size-l OS. Efficiency characteristics: OS I/O but only loglinear in memory very efficient when k is not significantly bigger than l, since fewer operations will be required (i.e. k-l is smaller). considerably cheaper than the brute force algorithm. Correctness: Very good approximations of the optimal size-l OS.

37 Generation of Size-l Object Summaries Greedy Size-l OS Generation Algorithms 2.1 Top-Down Size-l Algorithm Uses a Priority Queue to build the OS by expanding on the current tuple with the biggest local Importance score. Lemma 2: When the nodes of an OS have monotonically decreasing local Importance scores to their distance from the root, then the Top-Down Algorithm returns the optimal size-l OS. Efficiency characteristics: more efficient than both aforementioned algorithms when l is significantly smaller than k. less I/O operations (no need for the complete OS) and also less in memory operations. On the other hand, when k is not very big in comparison to l, this algorithm becomes more expensive than the Bottom-Up Pruning. Correctness: less effective because expanding on the best current local Importance value will not always lead us to good (near) optimal solution.

38 Size-l OS Generation Top-Down Size-l Algorithm

39 Size-l OS Generation Top-Down Size-l Algorithm PQ

40 Size-l OS Generation Top-Down Size-l Algorithm PQ

41 Size-l OS Generation Top-Down Size-l Algorithm PQ

42 Size-l OS Generation Top-Down Size-l Algorithm PQ

43 Size-l OS Generation Top-Down Size-l Algorithm PQ

44 Size-l OS Generation Top-Down Size-l Algorithm PQ

45 Size-l OS Generation Top-Down Size-l Algorithm PQ

46 Size-l OS Generation Top-Down Size-l Algorithm PQ

47 Size-l OS Generation Top-Down Size-l Algorithm PQ

48 Generation of Size-l Object Summaries Greedy Size-l OS Generation Algorithms 2.1 Top-Down Size-l Algorithm Uses a Priority Queue to build the OS by expanding on the current tuple with the biggest local Importance score. Lemma 2: When the nodes of an OS have monotonically decreasing local Importance scores to their distance from the root, then the Top-Down Algorithm returns the optimal size-l OS. Efficiency characteristics: more efficient than both aforementioned algorithms when l is significantly smaller than k. less I/O operations (no need for the complete OS) and also less in memory operations. On the other hand, when k is not very big in comparison to l, this algorithm becomes more expensive than the Bottom-Up Pruning. Correctness: less effective because expanding on the best current local Importance value will not always lead us to good (near) optimal solution.

49 Top-Down v Bottom Up Pruning Size-l Algorithm 1. Evidently for small OSs, the Bottom-Up Pruning is the best choice, since it always achieves better correctness and at the same time requires equal or even less time than the Top-Down Algorithm. 2. On the other hand for larger OSs (e.g. for OS >300), there are two alternatives: (1) speed (Top-Down is faster at least twice for any l<50) (2) or correctness (Bottom-Up achieves at least 10% better correctness).

50 Experimental Evaluation Database Cardinalities Size (MB) DBLP 2,959, TPC-H 8,661,245 1,100 Northwind 3, Parameter G A Range G A1, G A2, G A3 d (d 1, d 2, d 3 ) 0.85, 0.10, 0.99 All G DS (θ)s were generated with a common weight i.e. w i =0.25 and θ=0.70 and normalized Affinity.

51 Experimental Evaluation Efficiency of the two Size-l Algorithms DBLP Author (Aver OS =486) DBLP Paper (Aver OS =377) Time (s) 6 Time (s) Top-Dow n Bottom-Up Pruning l TPC-H Customer (Aver OS =179) (a) 2 Top-Dow n Bottom-Up Pruning TPC-H Supplier (Aver OS =1426) 24 l (b) Time (s) 2 Time (s) 8 1 Top-Dow n Bottom-Up Prunning l (c) 4 Top-Dow n Bottom-Up Prunning l (d)

52 Experimental Evaluation Correctness of the two greedy algorithms 100 DBLP Author (Aver( OS) =364) 100 DBLP Paper (Aver( OS) =279) 100 TPC-H Customer (Aver( OS) =179) Correctness 70 Correctness 70 Correctness Top-Dow n Bottom-Up Prunning 50 Top-Dow n Bottom-Up Prunning 50 Top-Dow n Bottom-Up Prunning l (a) l (b) l (c) 100 TPC-H Supplier (Aver( OS )=1425) 100 DBLP Author ( OS =67) 100 DBLP Author (Aver( OS) =364) Correctness Correctness Correctness Top-Dow n Bottom-Up Prunning l (d) 50 Top-Dow n Bottom-Up Prunning l (e) Top-Dow n Bottom-Up Prunning GA1-d1 GA2-d1 GA3-d1 GA1-d2 GA1-d3 (f) Settings that produced global Importance

53 Experimental Evaluation Effectiveness of Size-l OS for Northwind 100 DBLP Author DBLP Paper Northwind Employee Effectiveness GA1-d1 GA2-d1 GA3-d1 GA1-d2 GA1-d l (a) Effectiveness GA1-d1 GA2-d1 GA3-d1 30 GA1-d2 GA1-d l (b) Effectiveness GA1-d1 GA2-d1 GA3-d1 GA1-d2 GA1-d (c) l 100 Northwind Order 100 Size-15 OS 100 Size-30 OS Effectiveness GA1-d1 GA2-d1 GA3-d1 GA1-d2 GA1-d3 Effectiveness GA1-d2 30 GA1-d Author Paper Employee Order Author Paper Employee Order l (d) (e) (f) GA1-d1 GA2-d1 GA3-d1 Effectiveness GA1-d1 GA2-d1 GA3-d1 GA1-d2 GA1-d3

54 Conclusions -Novel Contributions The formal definition of the novel Search Paradigm which automatically produces OSs for a Data Subject. minimum contribution from the user (i.e. only a Kw) no prior knowledge of the DB schema or query language needed. Excellent Precision, Recall and F-score results The formal definition and quantification of Relation s Affinity in the context of G DS consider both Schema Design and Data distributions Generation of Size-l OS Efficient algorithms are proposed

55 Preliminaries: ObjectRank The ObjectRank of a node v i can be calculated: r = dar + (1 s d) S where A ij =α(e) if there is an edge e=(v i v j ) in E A D and 0 otherwise, d controls the Base Set importance and s=[s1,,sn] T is the Base Set vector for S. 0.7 cites Conference Year Paper Author 0 cited

56 Global Ranking of Tuples (Im(ti)): ValueRank The ValueRank of a node v i can be calculated using the same formula: r s = dar + (1 d) S The s i of a node v i in S can be calculated with the formula: s i =α+β f(v i ) The Authority Transfer Edges, either forward or backward denoted as a(e), can be calculated with the formula: Territories (3) Region (4) Employees (9) 0.2 Categories (8) Orders (830) OrderDetails (2155) s i = *f(UnitPrice*Quantity) Customers (91) *f(Price*Quantity) * *f(Price) f(price*quantity) Products (77) s i = *f(Price) α(e)=γ+δ f(v i v j ) *f(Freights) *f(Price) 0.3 Shippers (3) Suppliers (29) where α, β, γ and δ are tuning constants such that that α+β 1 and γ+δ 1 and f(.) is a normalisation function of the values of vi and vj (in the range [0, 1] rather than just 1 as in the case of ObjectRank).

57 Preliminary Evaluation: ValueRank v ObjecRank Tuple ID ObjectRank ValueRank Total Orders {UnitPrice*Quantity, Freight, Price } Employee Employee Shipper Product Product Customer SAVEA Customer QUICK Supplier Supplier ObjectRank connectivity ValueRanks values+connectivity Maximum values per relation are indicated in bold.

58 Local Ranking of Tuples (Im(OS, ti)) The local Importance of each tuple t i of an OS can be calculated with: Im(OS, t i )= Im(t i ) α *Af(t i ) β where Im(t i ) is the global Importance of t i (e.g. its ValueRank or ObjectRank), Af(t i ) is the Affinity of t i to the t DS, α and β are tuning constants. The product of Im(t i ) with AfR(t i ) actually reduces the Importance contribution of each tuple towards the overall Im(OS).

59 Inter-Relation Tuple Ranking Summary of ValueRank of Northwind Northwind R i Minimum Median Maximum Employees Territories Region Orders Customer Shipper OrderDetails Product Supplier Categories The results are based on GA_northwind and d=0.85 The earlier work ObjectRank did not investigate interrelation ranking of tuples in depth.

60 Inter-Relation Tuple Ranking ValueRank v ObjecRank Tuple ID ValueRank ObjectRank Total Orders {UnitPrice*Quantity, Freight, Price } Customer SAVEA ,673.4 Customer QUICK , Shipper ,185.3 Shipper Product ,984.2 Product ,296.0 ObjectRank connectivity ValueRanks values+connectivity Maximum values per relation are indicated in bold.. Employee ,187.4 Employee , Supplier Supplier

61 Af DS R i R : Affinity of Relations to R DS in G DS R DS ld i, RC i, Employees Customer Order Shipper m 1..m 4 Af Ri Af Ri (r Ri ) Af Ri (r Ri ) Af Ri (r Ri ) R i RC, Co i i Employees R DS R DS (3) 0.97 (4) 0.82 (4) Employees (ReportsTo) 1, 1, 0.9, 4 1, 1, 1, (5) 0.91 (5) 0.73 (5) Employees (ReportedBy) 1, 0.9, 1, 4 1, 1, 1, (7) 0.85 (7) 0.66 (7) Territories 1, 5.4, 1, 2 1, 0.9, 1, (10) 0.66 (10) 0.51 (10) Region 2, 1, 13.2, 1 0.9, 1, 0.88, (11) 0.59 (11) 0.43 (11) Order 1, 92.2, 1, 4 1, 0.8, 1, (1) 1 (R DS ) 0.89 (1) Customer 2, 1, 9.1, 2 0.9, 1, 0.9, (R DS ) 0.99 (1) 0.83 (2) Shipper 2, 1, 276.6, 1 0.9, 1, 0.75, (2) 0.98 (2) 1 (R DS ) OrderDetails 2, 2.5, 1, 2 0.9, 0.96, 1, (4) 0.97 (3) 0.82 (3) Product 3, 1, 43.9, 4 0.8, 1, 0.83, (6) 0.91 (6) 0.73 (6) Supplier 4, 1, 1.6, 1 0.7, 1, 0.9, (8) 0.82 (8) 0.62 (8) Categories 4, 1, 6.1, 1 0.7, 1, 0.92, (9) 0.81 (9) 0.61 (9) CustDemographics 3, null, null, 1 0.8, null, null, 1 Null Null Null Null

Chapter 13: Query Processing

Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing