Overview of Data Exploration Techniques. Stratos Idreos, Olga Papaemmanouil, Surajit Chaudhuri

Size: px

Start display at page:

Download "Overview of Data Exploration Techniques. Stratos Idreos, Olga Papaemmanouil, Surajit Chaudhuri"

Edwin McDaniel
5 years ago
Views:

1 Overview of Data Exploration Techniques Stratos Idreos, Olga Papaemmanouil, Surajit Chaudhuri

2 data exploration not always sure what we are looking for (until we find it) data has always been big volume velocity variety veracity

3 content structure user interaction middleware database kernel

4 user interaction visualization 45 min interfaces prefetching approximation sampling 45 min middleware adaptive[ indexing, loading, storage] kernel 45 min

5 Part 3* *for Part1 and 2 please look at the websites of the tutorial co-authors

6 applications sql algorithms/operators database kernel cpu caches memory gpu data data index disk

7 too many preparation options lead to complex installation schema storage load indexes query timeline expert users - idle time - workload knowledge

8 users/applications declarative interface ask what you want DBA db system

9 users/applications need to choose the proper system db administrator 1 db administrator 2 data system 1 data system 2

10 how can we prepare if we do not know what we are up against? (loading, indexing, storage, )

11 how can we prepare if we do not know what we are up against? (loading, indexing, storage, ) data systems kernels tailored for data exploration no preparation - easy to use - fast

12 adaptive loading adaptive indexing adaptive storage vision declarative and flexible systems sampling dbtouch

13 indexing load tune query tune= create proper indices offline performance X

14 indexing load tune query tune= create proper indices offline performance X but it depends on the workload! which indices to build? on which data parts? and when to build them?

15 big data V s volume velocity variety veracity not enough space to index all data what can go wrong? not enough idle time to finish proper tuning by the time we finish tuning, the workload changes not enough money - energy - resources

16 big data V s volume velocity variety veracity not enough space to index all data what can go wrong? not enough idle time to finish proper tuning by the time we finish tuning, the workload changes not enough money - energy - resources

17 database cracking

18 database cracking idle time workload knowledge external tools human control

19 database cracking auto-tuning database kernels incremental, adaptive, partial indexing idle time workload knowledge external tools human control

20 initialization querying indexing database cracking auto-tuning database kernels incremental, adaptive, partial indexing idle time workload knowledge external tools human control

21 initialization querying indexing database cracking auto-tuning database kernels incremental, adaptive, partial indexing idle time workload knowledge external tools human control

22 initialization querying indexing database cracking auto-tuning database kernels incremental, adaptive, partial indexing every query workload is treated external as an advice human idle time knowledge tools control on how data should be stored

23 column-store database a fixed-width and dense array per attribute relation1/table Database Cracking CIDR 2007

24 column-store database a fixed-width and dense array per attribute relation1/table Database Cracking CIDR 2007

25 column A Q1: select R.A from R where R.A>10 and R.A< Database Cracking CIDR 2007

26 column A Q1: select R.A from R where R.A>10 and R.A< Database Cracking CIDR 2007

27 column A Q1: select R.A from R where R.A>10 and R.A< sort Database Cracking CIDR 2007

28 column A Q1: select R.A from R where R.A>10 and R.A< sort binary search Database Cracking CIDR 2007

29 column A Q1: select R.A from R where R.A>10 and R.A< sort binary search result Database Cracking CIDR 2007

30 column A Q1: select R.A from R where R.A>10 and R.A< sort binary search result time + knowledge Database Cracking CIDR 2007

31 Q1: select R.A from R where R.A>10 and R.A<14 column A Database Cracking CIDR 2007

32 Q1: select R.A from R where R.A>10 and R.A<14 column A piece1: A<=10 Database Cracking CIDR 2007

33 Q1: select R.A from R where R.A>10 and R.A<14 column A piece1: A<=10 piece2: 10<A<14 Database Cracking CIDR 2007

34 Q1: select R.A from R where R.A>10 and R.A<14 column A piece1: A<=10 piece2: 10<A<14 piece3: A>=14 Database Cracking CIDR 2007

35 column A Q1: select R.A from R where R.A>10 and R.A< piece1: A<=10 piece2: 10<A<14 piece3: A>=14 result Database Cracking CIDR 2007

36 gain knowledge on how data is organized column A Q1: select R.A from R where R.A>10 and R.A< piece1: A<=10 piece2: 10<A<14 piece3: A>=14 result Database Cracking CIDR 2007

37 gain knowledge on how data is organized column A Q1: select R.A from R where R.A>10 and R.A< piece1: A<=10 piece2: 10<A<14 piece3: A>=14 result dynamically/on-the-fly within the select-operator Database Cracking CIDR 2007

38 Q1: select R.A from R where R.A>10 and R.A<14 Q2: select R.A from R where R.A>7 and R.A<=16 column A piece1: A<=10 piece2: 10<A<14 piece3: A>=14 dynamically/on-the-fly within the select-operator Database Cracking CIDR 2007

39 Q1: select R.A from R where R.A>10 and R.A<14 Q2: select R.A from R where R.A>7 and R.A<=16 column A piece1: A<=10 piece2: 10<A<14 piece3: A>=14 dynamically/on-the-fly within the select-operator Database Cracking CIDR 2007

40 column A Q1: select R.A from R where R.A>10 and R.A<14 Q2: select R.A from R where R.A>7 and R.A<= piece1: A<=10 piece2: 10<A<14 piece3: A>= piece1: A<=7 piece2: 7<A<=10 dynamically/on-the-fly within the select-operator Database Cracking CIDR 2007

41 column A Q1: select R.A from R where R.A>10 and R.A<14 Q2: select R.A from R where R.A>7 and R.A<= piece1: A<=10 piece2: 10<A<14 piece3: A>= piece1: A<=7 piece2: 7<A<=10 piece3: 10<A<14 dynamically/on-the-fly within the select-operator Database Cracking CIDR 2007

42 column A Q1: select R.A from R where R.A>10 and R.A<14 Q2: select R.A from R where R.A>7 and R.A<= piece1: A<=10 piece2: 10<A<14 piece3: A>= piece1: A<=7 piece2: 7<A<=10 piece3: 10<A<14 piece4: 14<=A<=16 piece5: A>16 dynamically/on-the-fly within the select-operator Database Cracking CIDR 2007

43 column A Q1: select R.A from R where R.A>10 and R.A<14 Q2: select R.A from R where R.A>7 and R.A<= piece1: A<=10 piece2: 10<A<14 piece3: A>= piece1: A<=7 piece2: 7<A<=10 piece3: 10<A<14 piece4: 14<=A<=16 piece5: A>16 dynamically/on-the-fly within the select-operator Database Cracking CIDR 2007

44 the more we crack, the more we learn column A Q1: select R.A from R where R.A>10 and R.A<14 Q2: select R.A from R where R.A>7 and R.A<= piece1: A<=10 piece2: 10<A<14 piece3: A>= piece1: A<=7 piece2: 7<A<=10 piece3: 10<A<14 piece4: 14<=A<=16 piece5: A>16 dynamically/on-the-fly within the select-operator Database Cracking CIDR 2007

45 Database Cracking CIDR 2007

46 select [15,55] Database Cracking CIDR 2007

47 select [15,55] Database Cracking CIDR 2007

48 select [15,55] Database Cracking CIDR 2007

49 select [15,55] select [15,55] Database Cracking CIDR 2007

50 select [15,55] select [15,55] Database Cracking CIDR 2007

51 touch at most two pieces at a time pieces become smaller and smaller select [15,55] select [15,55] Database Cracking CIDR 2007

52 continuous adaptation set-up 100K random selections random selectivity random value ranges in a 10 million integer column Response time (secs) Scan Crack 0.01 Full Index Query sequence (x1000) Database Cracking CIDR 2007

53 continuous adaptation set-up 100K random selections random selectivity random value ranges in a 10 million integer column almost no initialization overhead Response time (secs) Full Index Scan Crack Query sequence (x1000) Database Cracking CIDR 2007

54 continuous adaptation set-up 100K random selections random selectivity random value ranges in a 10 million integer column almost no initialization overhead Response time (secs) Full Index Scan Crack continuous improvement Query sequence (x1000) Database Cracking CIDR 2007

55 continuous adaptation set-up 100K random selections random selectivity random value ranges in a 10 million integer column almost no initialization overhead Response time (secs) Full Index Scan Crack continuous improvement Query sequence (x1000) Database Cracking CIDR 2007

56 continuous adaptation set-up 10K random selections selectivity 10% random value ranges in a 30 million integer column Cumulative average response time (secs) Full Index Scan Crack Query sequence Database Cracking CIDR 2007

57 continuous adaptation set-up 10K random selections selectivity 10% random value ranges in a 30 million integer column Cumulative average response time (secs) Full Index Scan Crack Query sequence Database Cracking CIDR 2007

58 continuous adaptation set-up 10K random selections selectivity 10% random value ranges in a 30 million integer column 10K queries later, Full Index still has not amortized the initialization costs Cumulative average response time (secs) Full Index Scan Crack Query sequence Database Cracking CIDR 2007

59 A table1

60 table

61 select R.A from R where R.A>10 and R.A<14 table

62 select R.A from R where R.A>10 and R.A<14 table1 select max(r.a),max(r.b),max(s.a),max(s.b) from R,S where v1 <R.C<v2 and v3 <R.D<v4 and v5 <R.E<v6 and k1 <S.C<k2 and k3 <S.D<k4 and k5 <S.E<k and R.F = S.F......

63 select R.A from R where R.A>10 and R.A<14 table1 select max(r.a),max(r.b),max(s.a),max(s.b) from R,S where v1 <R.C<v2 and v3 <R.D<v4 and v5 <R.E<v6 and k1 <S.C<k2 and k3 <S.D<k4 and k5 <S.E<k and R.F = S.F updates... joins concurrency control......

64 basics (CIDR07) cracking databases updates (SIGMOD07) >1 columns >1 columns (SIGMOD09) storagerestrictions (SIGMOD09) robustness robustne (PVLDB12) algorithms (PVLDB11) benchmarking (TPCTC10) concurrency control (PVLDB12) adaptive storage (SIGMOD14) time-series (SIGMOD14) hadoop (Yale/Saarland) multi-cores (SIGMOD15) b-trees (HP Labs)

65 cracking tangram base data table 1 as queries arrive... table 2

66 cracking tangram base data table 1 as queries arrive... table 1 table 2 table 2

67 cracking tangram base data table 1 as queries arrive... table 1 partial materialization table 2 table 2

68 cracking tangram base data table 1 as queries arrive... table 1 partial materialization partial indexing table 2 table 2

69 cracking tangram base data table 1 as queries arrive... table 1 partial materialization partial indexing continuous adaptation table 2 table 2

70 cracking tangram base data table 1 as queries arrive... table 1 x partial materialization partial indexing continuous adaptation storage adaptation table 2 table 2 x x

71 cracking tangram base data table 1 as queries arrive... table 1 partial materialization partial indexing continuous adaptation storage adaptation table 2 table 2

72 cracking tangram base data table 1 table 2 as queries arrive... table 1 table 2 partial materialization partial indexing continuous adaptation storage adaptation no tuple reconstruction

73 cracking tangram base data table 1 table 2 as queries arrive... table 1 table 2 partial materialization partial indexing continuous adaptation storage adaptation no tuple reconstruction adaptive alignment

74 cracking tangram base data table 1 table 2 as queries arrive... table 1 table 2 partial materialization partial indexing continuous adaptation storage adaptation no tuple reconstruction adaptive alignment

75 cracking tangram base data table 1 table 2 as queries arrive... table 1 table 2 partial materialization partial indexing continuous adaptation storage adaptation no tuple reconstruction adaptive alignment sort in caches

76 cracking tangram base data table 1 table 2 as queries arrive... table 1 table 2 partial materialization partial indexing continuous adaptation storage adaptation no tuple reconstruction adaptive alignment sort in caches crack joins

77 cracking tangram base data table 1 table 2 as queries arrive... table 1 table 2 q1 q2 partial materialization partial indexing continuous adaptation storage adaptation no tuple reconstruction adaptive alignment sort in caches crack joins lightweight locking

78 cracking tangram base data table 1 table 2 as queries arrive... table 1 table 2 query random partial materialization partial indexing continuous adaptation storage adaptation no tuple reconstruction adaptive alignment sort in caches crack joins lightweight locking stochastic cracking

79 adaptive loading adaptive indexing adaptive storage vision declarative and flexible systems sampling dbtouch

80 loading load tune query copy data inside the database database now has full control slow process not all data might be needed all the time Adaptive Loading, CIDR 11

81 database vs. unix tools single query cost (secs) 1 file, 4 attributes, 1 billion tuples DB Awk Adaptive Loading, CIDR 11

82 database vs. unix tools single query cost (secs) 1 file, 4 attributes, 1 billion tuples DB Awk break down db cost Adaptive Loading, CIDR 11

83 database vs. unix tools single query cost (secs) 1 file, 4 attributes, 1 billion tuples DB Awk break down db cost loading is a major bottleneck Adaptive Loading, CIDR 11

84 database vs. unix tools single query cost (secs) 1 file, 4 attributes, 1 billion tuples DB Awk break down db cost loading is a major bottleneck but writing/maintaining scripts does not scale Adaptive Loading, CIDR 11

85 adaptive loading load/touch only what is needed and only when it is needed Adaptive Loading, CIDR 11

86 but raw data access is expensive tokenizing - parsing - no indexing - no statistics challenge: fast raw data access Adaptive Loading, CIDR 11

87 query plan Adaptive Loading, CIDR 11

88 query plan scan Adaptive Loading, CIDR 11

89 query plan scan db Adaptive Loading, CIDR 11

90 query plan scan loading Adaptive Loading, CIDR 11

91 query plan scan files loading access raw data adaptively on-the-fly Adaptive Loading, CIDR 11

92 query plan scan files access raw data adaptively on-the-fly cache loading Adaptive Loading, CIDR 11

93 selective parsing file indexing file splitting online statistics query plan scan files access raw data adaptively on-the-fly cache loading Adaptive Loading, CIDR 11

94 three graphs from paper NoDB, SIGMOD 2012

95 three graphs from paper NoDB, SIGMOD 2012

96 three graphs from paper NoDB, SIGMOD 2012

97 three graphs from paper NoDB, SIGMOD 2012

98 three graphs from paper NoDB, SIGMOD 2012

99 reducing data-to-query time three graphs from paper NoDB, SIGMOD 2012

100 adaptive loading adaptive indexing adaptive storage vision declarative and flexible systems sampling dbtouch

101 rows & columns row-store column-store H20, SIGMOD 14

102 no fixed optimal solution Execution Time (sec) DBMS-C DBMS-R Attributes Accessed (%) H20, SIGMOD 14

103 rows & columns row-store column-store H20, SIGMOD 14

104 rows & columns row-store column-store hybrid-store H20, SIGMOD 14

105 which layout is best? H20, SIGMOD 14

106 which layout is best? H20, SIGMOD 14

107 which layout is best? H20, SIGMOD 14

108 which layout is best? too many combinations to maintain in parallel H20, SIGMOD 14

109 query cost q(l)= L Â i=1 max(costi IO,costi CPU ) for a given query we can know which layout is best the one that will cause the fewer cache misses H20, SIGMOD 14

110 if we know all queries up front we can choose the layouts adaptive storage: continuously adapt layouts based on incoming queries H20, SIGMOD 14

111 but computing all possible combinations is expensive query select A+B+C+D from R where A<10 and E>10 1. deal only with attributes referenced in queries 2. handle select clause separately from where clause 3. start from pure column-store and build up 4. stop when no improvement possible H20, SIGMOD 14

112 Query&Response&Time&(sec)& 8" 7" 6" 5" 4" 3" 2" 1" 0" Row.store" Column.store" Best.Hybrid" H2O" 0" 5" 10" 15" 20" 25" 30" 35" 40" 45" 50" Query&Sequence& H20, SIGMOD 14

113 adaptive loading adaptive indexing adaptive storage vision declarative and flexible systems sampling dbtouch

114 querying load tune query SQL interface correct and complete answers dbtouch, CIDR 2013

115 querying load tune query complex and slow - not fit for exploration SQL interface correct and complete answers dbtouch, CIDR 2013

116 just touch the data you need dbtouch, CIDR 2013

117 just touch the data you need this is not about query building it is about query processing dbtouch, CIDR 2013

118 cracking example dbtouch CIDR 2013

119 what does this mean for db kernels? dbtouch, CIDR 2013

120 db select R.a from R what does this mean for db kernels? dbtouch, CIDR 2013

121 db select R.a from R what does this mean for db kernels? dbtouch process only what you touch dbtouch, CIDR 2013

122 from touch to query processing touch location (x) v size(width) row ID=(tuples*x)/size storage dbtouch, CIDR 2013

123 select avg(r.c) where R.a=S.b and S.b<20 S.b R.c R.a avg(r.c) R.a=S.b σ(<20) S.b R.a dbtouch, CIDR 2013

124 select avg(r.c) where R.a=S.b and S.b<20 S.b R.c R.a avg(r.c) R.a=S.b σ(<20) S.b R.a dbtouch, CIDR 2013

125 select avg(r.c) where R.a=S.b and S.b<20 S.b R.c R.a avg(r.c) R.a=S.b σ(<20) S.b R.a dbtouch, CIDR 2013

126 visual objects samples hierarchy dbtouch, CIDR 2013

127 visual objects samples hierarchy base data initial sample dbtouch, CIDR 2013

128 visual objects samples hierarchy base data initial sample create incrementally and on demand dbtouch, CIDR 2013

129 dbtouch, CIDR 2013

130 dbtouch, CIDR 2013

131 adaptive loading adaptive indexing adaptive storage vision declarative and flexible systems sampling dbtouch

132 sampling outside the engine application sampling logic/query rewrite Kernel Aqua, VLDB 1999 BlinkDB, Eurosys 2013

133 sampling with SciBORG queries data Kernel maintain hierarchy of biased samples SciBORG, CIDR 2011

134 sampling with SciBORG impression 1 impression 2 impression 3 base column continuously reorganized based on the workload SciBORG, CIDR 2011

135 adaptive loading adaptive indexing adaptive storage vision declarative and flexible systems sampling dbtouch

136 building systems declaratively abstraction performance vision: being able to define system components in a higher level language without significant performance penalty RodentStore, CIDR 2009 Abstraction without regrets, IEEE Data Engin. Bulletin/PVLDB 2014

137 data systems today allow us to answer queries fast db data systems for exploration should allow us to find fast which queries to ask db explore + approximate processing techniques

138 data systems today allow us to answer queries fast db data systems for exploration should allow us to find fast which queries to ask db explore + approximate processing techniques thank you!

Data Systems that are Easy to Design, Tune and Use. Stratos Idreos

Data Systems that are Easy to Design, Tune and Use data systems that are easy to: (years) (months) design & build set-up & tune (hours/days) use e.g., adapt to new applications, new hardware, spin off