Database Learning: Toward a Database that Becomes Smarter Over Time

Size: px

Start display at page:

Download "Database Learning: Toward a Database that Becomes Smarter Over Time"

Juliana York
6 years ago
Views:

1 Database Learning: Toward a Database that Becomes Smarter Over Time Yongjoo Park Ahmad Shahab Tajik Michael Cafarella Barzan Mozafari University of Michigan, Ann Arbor

2 Today s databases Database Users 1

3 Today s databases query Database Users 1

4 Today s databases Database Users 1

5 Today s databases Answer to query Database Users 1

6 Today s databases Database Users After answering queries, THE WORK is GONE 1

7 Today s databases Database Users After answering queries, THE WORK is GONE Our Goal: reuse the work 1

8 Our high-level approach AQP engine Users 2

9 Our high-level approach Users Q AQP engine 2

10 Our high-level approach Users A (10% err, 1 sec) AQP engine 2

11 Our high-level approach Query Synopsis Users Database Learning AQP engine 2

12 Our high-level approach Query Synopsis Users Q Database Learning AQP engine 2

13 Our high-level approach Query Synopsis Q Database Learning Q AQP engine Users 2

14 Our high-level approach Query Synopsis Q Database Learning Q A (10% err) AQP engine Users 2

15 Our high-level approach Query Synopsis Q Â (2% err) Database Learning Q A (10% err) AQP engine Users 2

16 Our high-level approach Query Synopsis Q Â (2% err) Database Learning Q A (10% err) AQP engine Users Error(%) AQP engine Database learning Time (sec) 3

17 Our high-level approach Query Synopsis Q Â (2% err) Database Learning Q A (10% err) AQP engine Users Error(%) AQP engine Database learning Time (sec) 3

18 Our high-level approach Query Synopsis Q Â (2% err) Database Learning Q A (10% err) AQP engine Users Error(%) AQP engine Database learning Time (sec) 3

19 Technical challenges 4

20 Technical challenges 4

21 Technical challenges 4

22 Technical challenges Queries use the data in different columns/rows 4

23 Technical challenges Queries use the data in different columns/rows How to leverage those queries for future queries? 4

24 Our idea? 5

25 Our idea Q1? 5

26 Our idea (Q1, A1)? 5

27 Our idea (Q1, A1) 5

28 Our idea Q2 5

29 Our idea (Q2, A2) 5

30 Our idea (Q2, A2) 5

31 Our idea more queries and answers 5

32 Concrete example SUM(count) 40M 30M 20M Week Number 6

33 Concrete example SUM(count) 40M 30M 20M Week Number True data 6

34 Concrete example SUM(count) 40M 30M 20M Week Number True data Ranges observed by past queries 6

35 Concrete example SUM(count) 40M 30M 20M Week Number True data Ranges observed by past queries Model (with 95% confidence interval) 6

36 Concrete example SUM(count) 40M 30M 20M Week Number True data Ranges observed by past queries Model (with 95% confidence interval) SUM(count) 40M 30M 20M Week Number 6

37 Concrete example SUM(count) 40M 30M 20M Week Number True data Ranges observed by past queries Model (with 95% confidence interval) SUM(count) 40M 30M 20M Week Number SUM(count) 40M 30M 20M Week Number 6

38 Design goals select X3, avg(y1) from t where 5 < X1 < 8; select sum(y2) from t where X2 between Apr and May group by X3; 1 Support a wide class of SQL queries 7

39 Design goals select X3, avg(y1) from t where 5 < X1 < 8; select sum(y2) from t where X2 between Apr and May group by X3; 1 Support a wide class of SQL queries 2 No Assumptions about Data 7

40 Design goals select X3, avg(y1) from t where 5 < X1 < 8; select sum(y2) from t where X2 between Apr and May group by X3; 1 Support a wide class of SQL queries latency 2 No Assumptions about Data BlinkDB 3 Lightweight DBL 7

41 Our Approach

42 Problem statement 8

43 Problem statement Problem: Given past queries (q 1,, q n ), a new query (q n+1 ), and their approximate answers, Find the most likely answer to the new query (q n+1 ) and its estimated error 8

44 Problem statement Problem: Given past queries (q 1,, q n ), a new query (q n+1 ), and their approximate answers, Find the most likely answer to the new query (q n+1 ) and its estimated error Our result: Under a certain model assumption, our answer s error bound original answer s error bound (in practice, much more accurate) if the error bounds provide the same probabilistic guarantees 8

45 Overview of our technique select count(y2) select avg(y2) from t from t where 1 < X1 < 2; where 6 < X1 < 8; 9

46 Overview of our technique select count(y2) select avg(y2) from select t sum(y2) from t where from1 t< X1 < 2; where 6 < X1 < 8; where 5 < X1 < 8; 9

47 Overview of our technique select count(y2) select avg(y2) from select t sum(y2) from t where from1 t< X1 < 2; where 6 < X1 < 8; where 5 < X1 < 8; 1 Random variables (our uncertainty on answers) θ 1, θ 2, θ 3 9

48 Overview of our technique select count(y2) select avg(y2) from select t sum(y2) from t where from1 t< X1 < 2; where 6 < X1 < 8; where 5 < X1 < 8; 1 Random variables (our uncertainty on answers) θ 1, θ 2, θ 3 2 Pr(θ 1, θ 2, θ 3 ) Probability distribution 9

49 Overview of our technique select count(y2) select avg(y2) from select t sum(y2) from t where from1 t< X1 < 2; where 6 < X1 < 8; where 5 < X1 < 8; 1 Random variables (our uncertainty on answers) θ 1, θ 2, θ 3 2 Pr(θ 1, θ 2, θ 3 ) Probability distribution Two aggregations involve common values correlation between answers 9

50 Overview of our technique select count(y2) select avg(y2) from select t sum(y2) from t where from1 t< X1 < 2; where 6 < X1 < 8; where 5 < X1 < 8; Pr(θ 3 θ 1, θ 2 ) Estimated answer 1 3 Random variables (our uncertainty on answers) θ 1, θ 2, θ 3 2 Pr(θ 1, θ 2, θ 3 ) Probability distribution Two aggregations involve common values correlation between answers 9

51 How to define random variables select sum(y2) from t where 5 < X1 < 8; 10

52 How to define random variables We define a random variable θ for every combination of: select sum(y2) from t where 5 < X1 < 8; 10

53 How to define random variables We define a random variable θ for every combination of: select sum(y2) from t where 5 < X1 < 8; Aggregate function 10

54 How to define random variables We define a random variable θ for every combination of: select sum(y2) from t where 5 < X1 < 8; Aggregate function Selection predicates 10

55 How to define random variables We define a random variable θ for every combination of: select sum(y2) from t where 5 < X1 < 8; Aggregate function Selection predicates select X3, avg(y1), sum(y2) from t where 5 < X1 < 8 and X2 between Apr and May group by X3; What if your query is complex? 10

56 How to define random variables We define a random variable θ for every combination of: select sum(y2) from t where 5 < X1 < 8; Aggregate function Selection predicates select X3, avg(y1), sum(y2) from t where 5 < X1 < 8 and X2 between Apr and May group by X3; What if your query is complex? 10

57 How to determine the probability distribution The Principle of Maximum Entropy (ME) 11

58 How to determine the probability distribution Statistical Info of (θ 1, θ 2, θ 3) The Principle of Maximum Entropy (ME) 11

59 How to determine the probability distribution Statistical Info of (θ 1, θ 2, θ 3) The Principle of Maximum Entropy (ME) Most-likely Pr(θ 1, θ 2, θ 3) 11

60 How to determine the probability distribution Statistical Info of (θ 1, θ 2, θ 3) The Principle of Maximum Entropy (ME) Most-likely Pr(θ 1, θ 2, θ 3) Low Amount of Info High Amount of Info 11

61 How to determine the probability distribution Statistical Info of (θ 1, θ 2, θ 3) The Principle of Maximum Entropy (ME) Most-likely Pr(θ 1, θ 2, θ 3) Low Amount of Info High Amount of Info 11

62 How to determine the probability distribution Statistical Info of (θ 1, θ 2, θ 3) The Principle of Maximum Entropy (ME) Most-likely Pr(θ 1, θ 2, θ 3) Low Amount of Info High Amount of Info Simple Pr Complex Pr 11

63 How to determine the probability distribution Statistical Info of (θ 1, θ 2, θ 3) The Principle of Maximum Entropy (ME) Most-likely Pr(θ 1, θ 2, θ 3) Low Amount of Info High Amount of Info Simple Pr Complex Pr Fast Inference Low-fidelity Slow Inference High-fidelity 11

64 How to determine the probability distribution Statistical Info of (θ 1, θ 2, θ 3) The Principle of Maximum Entropy (ME) Most-likely Pr(θ 1, θ 2, θ 3) Our choice: (co)variances between pairs of answers Low Amount of Info High Amount of Info Simple Pr Complex Pr Fast Inference Low-fidelity Slow Inference High-fidelity 11

65 Most-likely probability distribution θ 1 θ 2 θ 3 12

66 Most-likely probability distribution θ 1 Statistical Information: Mean, variances, covariances θ 2 θ 3 12

67 Most-likely probability distribution θ 1 Statistical Information: Mean, variances, covariances θ 2 θ 3 MaxEnt Multivariate normal distribution 12

68 Most-likely probability distribution θ 1 Statistical Information: Mean, variances, covariances θ 2 θ 3 MaxEnt Multivariate normal distribution Fast inference using a closed form 12

69 Benefits of database learning Database learning vs indexing 13

70 Benefits of database learning Database learning vs indexing storage Indexing DBL database size 1 Little storage overhead 13

71 Benefits of database learning Database learning vs indexing storage Indexing DBL database size 1 Little storage overhead Database learning vs materialized view 13

72 Benefits of database learning Database learning vs indexing storage Indexing DBL database size 1 Little storage overhead Database learning vs materialized view date 2 Without alignment 13

73 Benefits of database learning Database learning vs indexing storage Indexing DBL database size 1 Little storage overhead Database learning vs materialized view overhead DBL view selection date 2 Without alignment system uptime 3 No upfront overhead 13

74 Experiment

75 Experiment setup 1 Two systems: NoLearn: Approximate query processing engine (The longer runtime, the more accurate answer) Verdict: Our database learning system (on top of NoLearn) 14

76 Experiment setup 1 Two systems: NoLearn: Approximate query processing engine (The longer runtime, the more accurate answer) Verdict: Our database learning system (on top of NoLearn) 2 Datasets: Customer1: 536GB data and query log from a customer TPC-H: 100GB TPC-H dataset 14

77 Experiment setup 1 Two systems: NoLearn: Approximate query processing engine (The longer runtime, the more accurate answer) Verdict: Our database learning system (on top of NoLearn) 2 Datasets: Customer1: 536GB data and query log from a customer TPC-H: 100GB TPC-H dataset 3 Environment: 5 Amazon EC2 workers (m42xlarge) + 1 master SSD-backed HDFS for Spark s data loading 14

78 Our experimental claims 1 Verdict supports a large portion of real-world queries 15

79 Our experimental claims 1 Verdict supports a large portion of real-world queries 2 Verdict achieves speedup compared to NoLearn 15

80 Our experimental claims 1 Verdict supports a large portion of real-world queries 2 Verdict achieves speedup compared to NoLearn 3 Verdict works with small memory and computational overhead 15

81 Generality of Verdict Dataset # Analyzed # Supported Percentage Customer1 3,342 2, % TPC-H % Unsupported queries: 1 Nested queries (that cannot be flattened) 2 Textual filters: city like '%arbor%' 16

82 Runtime-error trade-off Results on the TPC-H dataset (the paper has the Customer1 results) Number of past queries fixed to 50 NoLearn Verdict Error bound (%) 10 5 Error bound (%) Runtime (sec) Runtime (min) (a) Data in Memory (b) Data on SSD 17

83 Runtime-error trade-off Results on the TPC-H dataset (the paper has the Customer1 results) Number of past queries fixed to 50 NoLearn Verdict Error bound (%) 10 5 Error bound (%) Runtime (sec) Runtime (min) (a) Data in Memory (b) Data on SSD 17

84 Speedup The results on the Customer1 dataset (the paper has the TPC-H results) Speedup (x) Speedup (x) % 2% 4% 2% Target Error Bound Target Error Bound (a) Data in memory (b) Data on SSD 18

85 Speedup The results on the Customer1 dataset (the paper has the TPC-H results) Speedup (x) Speedup (x) % 2% 4% 2% Target Error Bound Target Error Bound (a) Data in memory (b) Data on SSD 18

86 Memory and computational overhead 1 Memory overhead: 19

87 Memory and computational overhead 1 Memory overhead: Queries and their answer, some matrices and their inverses 19

88 Memory and computational overhead 1 Memory overhead: Queries and their answer, some matrices and their inverses 232 KB per query for the Customer1 dataset 158 KB per query for the TPC-H dataset 19

89 Memory and computational overhead 1 Memory overhead: Queries and their answer, some matrices and their inverses 232 KB per query for the Customer1 dataset 158 KB per query for the TPC-H dataset 2 Computational overhead: Latency for memory Latency for SSD NoLearn 2083 sec 5250 sec Verdict 2093 sec 5251 sec Overhead 0010 sec (048%) 0010 sec (002%) 19

90 Thank You! 19

Fast Data Analytics by Learning

Fast Data Analytics by Learning by Yongjoo Park A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Computer Science and Engineering) in The University