Outline. The History of Histograms. Yannis Ioannidis University of Athens, Hellas

Size: px

Start display at page:

Download "Outline. The History of Histograms. Yannis Ioannidis University of Athens, Hellas"

Christal Douglas
6 years ago
Views:

1 The History of Histograms Yannis Ioannidis University of Athens, Hellas Outline Prehistory Definitions and Framework The Early Past 10 Years Ago The Recent Past Industry Competitors The Future

2 Prehistory Word `histogram of Greek origin `histo-s = `mast `gram-ma = `something written Not used originally in the Greek language! Introduced by Karl Pearson in 1892 for a common form of graphical representation Prehistory

3 Prehistory 1662: Concept exists at least since then in mortality tables of J. Graunt 1786: Bar charts introduced by W. Playfair to capture Scottish imports/exports 1833: Histograms introduced by A. M. Guerry as discrete approximations to distribution functions 1859: Florence Nightingale used them to compare mortality of soldiers and civilians Prehistory

4 Prehistory Playfair s bar chart Definitions Data Distributions One-dimensional data distribution = Set of (attribute value, frequency) pairs Large and non-uniform need compression and approximation Concentrate on numeric attributes

5 Definitions Data Distributions Freq Area Spread Definitions Data Distributions Combinations of multiple attribute values Joint frequency Multidimensional data distributions = Set of (value combination, joint frequency) pairs

6 Definitions Multidimensional Data Distributions Motivation Selectivity estimation Approximate query answering within Query optimization Query profiling for user feedback Load balancing for parallel join execution Partition-based temporal join execution

7 Definitions Histograms Partition data distribution into β disjoint buckets Approximate values (value combinations) and frequencies within each bucket Definitions Histograms Freq

8 Definitions Histograms Freq bucket 1 bucket 2 Framework Histogram Parameters Partition rule: 4 orthogonal parameters Partition class Sort parameter Partition constraint Source parameter Construction algorithm

9 Framework Histogram Parameters approximation within bucket Frequency approximation within bucket Error guarantees Framework Partition Class Indicates restrictions on partitioning Serial: non-overlapping ranges of sort parameter values End-biased: at most one non-singleton bucket

10 Framework Sort Parameter Derivative of data distribution element (its value and/or frequency) Attribute values (V) Frequencies (F) Areas (A) = spread x frequency Serial: buckets must contain contiguous sort parameter values Framework Partition Class and Sort Parameter VALUE FREQUENCY

11 Framework Partition Class and Sort Parameter VALUE FREQUENCY SORT PAR B B B B4 Framework Partition Class and Sort Parameter VALUE FREQUENCY SORT PAR B B B B4

12 Framework Partition Class and Sort Parameter VALUE FREQUENCY SORT PAR B B B B Framework Source Parameter Derivative of data distribution element (its value and/or frequency) Spreads (S) Frequencies (F) Cumulative frequencies (C) Areas (A) Partition constraint applied on source parameter

13 Framework Partition Constraint Mathematical constraint on the source parameter that partitioning must satisfy General direction: Avoid grouping vastly different source parameter values Framework Partition Constraint Equi-sum: equalize sums V-optimal: minimize variance Maxdiff: minimize maximum difference of adjacent source values Compressed: preserve high source values and equalize sums of the rest Spline-based: minimize square root of error

14 Framework Partition Constr. and Source Parameter VALUE FREQ SORT PAR SOURCE PAR B B B3 B4 Framework Histogram Parameters Notation class : constraint (sort, source) Special notation for serial partition class constraint (sort, source)

15 Framework Histogram Parameters Same parameters for multidimensional histograms Partition rule more intricate: not always analyzable into 4 orthogonal parameters No sort parameter often The Early Past Dark Ages Essentially, use of 1-bucket histograms Large errors

16 The Early Past First Appearance Kooi s PhD Thesis equi-width histograms equi-width = equi-sum (V, S) Adopted by INGRES The Early Past First Appearance Freq

17 The Early Past First Appearance Freq The Early Past First Alternative Don t equalize ranges of values but number of tuples in bucket equi-depth histograms equi-depth = equi-sum (V, F) Source is only difference Adopted by several commercial systems

values, serial histograms with frequency as sort parameter are

18 The Early Past First Alternative Freq The Early Past Optimal Sort Parameter Theorem: For single join queries and accurate knowledge of values, serial histograms with frequency as sort parameter are optimal. Generalization of practice to keep highfrequency values accurately.

19 The Early Past Optimal Sort Parameters Freq 10 Years Ago Theorem: For single join queries and accurate knowledge of values, serial histograms with frequency as sort parameter are optimal.

20 The Recent Past Optimal partition constraints and source parameters? Optimality when values are not known accurately? Optimal values of other histogram characteristics? The Recent Past Optimal Constraint and Source Theorem: For the average join query and accurate knowledge of values, v-optimal histograms with frequency as source parameter are optimal. v-optimal (F, F) v-optimal: minimize variance of source values

21 The Recent Past Optimal Constraint and Source Freq The Recent Past If values are not known accurately, no optimality result on any histogram characteristic Several experimental results identify key choices

differences of adjacent source values compressed: Preserve high

22 The Recent Past New Partition Constraints All try to group similar source values max-diff: bucket borders at highest differences of adjacent source values compressed: Preserve high values of source and equalize sums of the rest The Recent Past maxdiff Freq

23 The Recent Past compressed Freq The Recent Past Alternative Partition Constraints Variations on the optimal knot placement problem Linear splines only Discontinuous across bucket boundaries

24 Choices The Recent Past New Sort and Source Parameters Attribute values (V) Spreads (S) Frequencies (F) Areas (A) Cumulative frequencies (C) value is best sort parameter overall area and frequency are best source parameters overall The Recent Past Multidimensional Partition Rules Multidimensional value domain cannot be sorted to serve as sort parameter Many alternatives to partition the space of values into buckets Although possible, frequency has not been used as sort parameter

25 The Recent Past Multidimensional Partition Class A la Grid File A la K-D-B-Tree (MHIST) GENHIST STHoles The Recent Past Multidimensional Data Distributions

26 The Recent Past M-D Partition Class: Grid File 2 1 The Recent Past M-D Partition Class: MHIST 2 1

27 The Recent Past M-D Partition Class: GENHIST 2 1 The Recent Past M-D Partition Class: GENHIST 2 1

28 The Recent Past M-D Partition Class: GENHIST 2 1 The Recent Past M-D Partition Class: STHoles 2 1

29 The Recent Past Histogram Framework Partition rule Partition class Sort parameter Partition constraint Source parameter Construction algorithm and frequency approximation Error guarantees The Recent Past Approximation Continuous value assumption: (min and) max value Uniform spread assumption: above + number of unique values Popularity-based spread: above with fake num of unique values Kernel estimation

30 The Recent Past Approximation Freq 7 min max The Recent Past Approximation Freq 24 min max

31 The Recent Past Approximation All generalized to multidimensional case Tradeoff between number of buckets and information kept within each bucket The Recent Past Frequency Approximation Uniform distribution assumption: average frequency Linear spline approximation: above + spline s angle

32 The Recent Past Frequency Approximation Freq Industrial Presence Only 1-dimensional histograms 1970 s: trivial histograms (1 bucket) 1980 s: equi-width histograms 1990 s: equi-depth histograms 2000 s:

33 Industrial Presence DB2 compressed (V, F) Default of 10 singleton and 20 nonsingleton buckets Store cumulative frequencies Construction based on reservoir sample Indices used to quantify dependencies LEO learning is key Industrial Presence ORACLE equi-depth = equi-sum (V, F) Indices used to quantify dependencies On-the-fly dependence estimation Past selectivities stored for future use

34 Industrial Presence SQL Server max-diff (V, F) Up to 199 buckets Store cumulative frequencies Store frequency of max accurately Construction based on sample Indices use to quantify dependencies Histogram Competitors Wavelets Sampling (usually complementary) Specialized techniques

35 The Future Histograms and clustering Bucket recognition and representation Histograms and tree indices approximation Comprehensive technique comparison Other data types The Future Histograms and Clustering Clustering is identical problem! Grouping of similar elements into buckets (bucket = cluster = pattern) Small approximation within bucket Multidimensional elements are attribute value combinations above + frequency

36 The Future Histograms and Clustering Freq The Future Histograms and Clustering Freq

37 The Future Histograms and Clustering Freq The Future Histograms and Clustering Very different techniques Apply on one problem techniques developed for the other Partition rules Construction algorithms Approximate representations within bucket

38 The Future Bucket Recognition and Representation Essence of histograms or clustering Identify groups of similar elements Similarity on few characteristics (source) Store approximation of these characteristics Which are the similar characteristics? [Pattern Recognition] The Future Bucket Recognition and Representation Maybe not original element dimensions Maybe not the same for all groups

32 15 34 6 9 23 27 Representation Freq 80 60

39 The Future Bucket Recognition and Representation Freq The Future Bucket Recognition and Representation Freq

40 The Future Bucket Recognition and Representation Freq The Future Bucket Recognition and Representation Freq

41 The Future Bucket Recognition and Representation Not clustering in the value-frequency space, but the spread-frequency space Why the difference in treatment? Is this always better? How can we recognize winner? Freq The Future Bucket Recognition and Representation

42 Freq The Future Bucket Recognition and Representation Freq The Future Bucket Recognition and Representation

43 Freq The Future Bucket Recognition and Representation The Future Histograms and Tree Indices Root of the B+ tree partitions space of values into non-overlapping buckets Each bucket further subdivided into smaller buckets Appropriate info next to each bucket turns each node into a histogram Entire B+ tree becomes Hierarchical Histogram

44 The Future Histograms and Tree Indices The Future Histograms and Tree Indices - Index fanout decreases +Indexing and estimation in one + Incremental estimation with increasing estimate accuracy

45 The Future Histograms and Tree Indices B+ tree node is equi-depth histogram What kind of trees with other constraints? V-optimal Max-diff Compressed Unbalanced trees: exact search slower Unbalanced trees: approximate answers more accurate The Future Histograms and Tree Indices Take into account query frequency Represent popular values more accurately higher in the tree New hierarchical histograms/indices may be faster than traditional ones

46 Conclusions Histograms very successful in databases Possibly best tradeoff between Simplicity Efficiency Effectiveness Applicability The Future New approaches to some characteristics Untouched foundational problems The next 10 years even more exciting!

Improved Histograms for Selectivity Estimation of Range Predicates

Improved Histograms for Selectivity Estimation of Range Predicates Abstract Viswanath Poosala University of Wisconsin-Madison poosala@cs.wisc.edu Peter J. Haas IBM Almaden Research Center peterh@almaden.ibm.com