All About Bitmap Indexes... And Sorting Them
|
|
- Theodora Dawson
- 6 years ago
- Views:
Transcription
1 Joint work (presented at BDA 08 and DOLAP 08) with Owen Kaser (UNB) and Kamel Aouiche (post-doc). February 12, 2009
2 Database Indexes Databases use precomputed indexes (auxiliary data structures) to speed processing.
3 Database Indexes Databases use precomputed indexes (auxiliary data structures) to speed processing. An index costs memory, can hurt update speed.
4 Database Indexes Databases use precomputed indexes (auxiliary data structures) to speed processing. An index costs memory, can hurt update speed. Improving indexes is practically important.
5 What make indexes fast? We are going to use these three ideas: Expect specific queries? Avoid a full scan!
6 What make indexes fast? We are going to use these three ideas: Expect specific queries? Avoid a full scan! Data is not random? Compress it!
7 What make indexes fast? We are going to use these three ideas: Expect specific queries? Avoid a full scan! Data is not random? Compress it! A specific computer architecture? taylor your code for it!
8 Bitmap indexes SELECT * FROM T WHERE x=a AND y=b; Above, compute {r r is the row id of a row where x = a} {r r is the row id of a row where y = b}
9 Bitmap indexes SELECT * FROM T WHERE x=a AND y=b; Bitmap indexes have a long history. (1972 at IBM.) Above, compute {r r is the row id of a row where x = a} {r r is the row id of a row where y = b}
10 Bitmap indexes SELECT * FROM T WHERE x=a AND y=b; Bitmap indexes have a long history. (1972 at IBM.) Long history with DW & OLAP. (Sybase IQ since mid 1990s). Above, compute {r r is the row id of a row where x = a} {r r is the row id of a row where y = b}
11 Bitmap indexes SELECT * FROM T WHERE x=a AND y=b; Bitmap indexes have a long history. (1972 at IBM.) Long history with DW & OLAP. (Sybase IQ since mid 1990s). Main competition: B-trees. Above, compute {r r is the row id of a row where x = a} {r r is the row id of a row where y = b}
12 Bitmaps and fast AND/OR operations Computing the union of two sets of integers between 1 and 64 (eg row ids, trivial table)... E.g., {1, 5, 8} {1, 3, 5}?
13 Bitmaps and fast AND/OR operations Computing the union of two sets of integers between 1 and 64 (eg row ids, trivial table)... E.g., {1, 5, 8} {1, 3, 5}? Can be done in one operation by a CPU: BitwiseOR( , )
14 Bitmaps and fast AND/OR operations Computing the union of two sets of integers between 1 and 64 (eg row ids, trivial table)... E.g., {1, 5, 8} {1, 3, 5}? Can be done in one operation by a CPU: BitwiseOR( , ) Extend to sets from 1..N using N/64 operations. To compute [a 0,..., a N 1 ] [b 0, b 1,..., b N 1 ] : a 0,..., a 63 BitwiseOR b 0,..., b 63 ; a 64,..., a 127 BitwiseOR b 64,..., b 127 ; a 128,..., a 192 BitwiseOR b 128,..., b 192 ;...
15 Common applications of the bitmaps The Java language has had a bitmap class since the beginning: java.util.bitset.
16 Common applications of the bitmaps The Java language has had a bitmap class since the beginning: java.util.bitset. (Sun s implementation is based on 8-bit words.)
17 Common applications of the bitmaps The Java language has had a bitmap class since the beginning: java.util.bitset. (Sun s implementation is based on 8-bit words.) Search engines use bitmaps to filter queries, e.g. Apache Lucene
18 Bitmap compression column n x index bitmaps x=1... x=2... x= A column with n rows and L distinct values nl bits L
19 Bitmap compression column n x index bitmaps x=1... x=2... x= A column with n rows and L distinct values nl bits E.g., n = 10 6, L = Gbits L
20 Bitmap compression column n x index bitmaps x=1... x=2... x= A column with n rows and L distinct values nl bits E.g., n = 10 6, L = Gbits Uncompressed bitmaps are often impractical L
21 Bitmap compression column n x index bitmaps x=1... x=2... x= A column with n rows and L distinct values nl bits E.g., n = 10 6, L = Gbits Uncompressed bitmaps are often impractical Moreover, bitmaps often contain long streams of zeroes... L
22 Bitmap compression column n x index bitmaps x=1... x=2... x= L 0... A column with n rows and L distinct values nl bits E.g., n = 10 6, L = Gbits Uncompressed bitmaps are often impractical Moreover, bitmaps often contain long streams of zeroes... Logical operations over these zeroes is a waste of CPU cycles.
23 How to compress bitmaps? Must handle long streams of zeroes efficiently Run-length encoding? (RLE)
24 How to compress bitmaps? Must handle long streams of zeroes efficiently Run-length encoding? (RLE) Bitmap: a run of 0s, a run of 1s, a run of 0s, a run of 1s,...
25 How to compress bitmaps? Must handle long streams of zeroes efficiently Run-length encoding? (RLE) Bitmap: a run of 0s, a run of 1s, a run of 0s, a run of 1s,... So just encode the run lengths, e.g., , 5, 3, 1,1,3
26 Compressing better with delta codes RLE can make things worse.
27 Compressing better with delta codes RLE can make things worse. 11 may become E.g., Use 8-bit counters, then
28 Compressing better with delta codes RLE can make things worse. E.g., Use 8-bit counters, then 11 may become How many bits to use for the counters?
29 Compressing better with delta codes RLE can make things worse. 11 may become How many bits to use for the counters? E.g., Use 8-bit counters, then Universal coding like delta codes use no more than c log x bits to represent value x.
30 Compressing better with delta codes RLE can make things worse. 11 may become How many bits to use for the counters? E.g., Use 8-bit counters, then Universal coding like delta codes use no more than c log x bits to represent value x. Recall Gamma codes: 0 is 0, 1 is 1, 01 is 2, 001 is 3, 0001 is 4, etc.
31 Compressing better with delta codes RLE can make things worse. 11 may become How many bits to use for the counters? E.g., Use 8-bit counters, then Universal coding like delta codes use no more than c log x bits to represent value x. Recall Gamma codes: 0 is 0, 1 is 1, 01 is 2, 001 is 3, 0001 is 4, etc. Delta codes build on Gamma codes.
32 Compressing better with delta codes RLE can make things worse. 11 may become How many bits to use for the counters? E.g., Use 8-bit counters, then Universal coding like delta codes use no more than c log x bits to represent value x. Recall Gamma codes: 0 is 0, 1 is 1, 01 is 2, 001 is 3, 0001 is 4, etc. Delta codes build on Gamma codes. Has two steps: x = 2 N + (x mod 2 N ).
33 Compressing better with delta codes RLE can make things worse. 11 may become How many bits to use for the counters? E.g., Use 8-bit counters, then Universal coding like delta codes use no more than c log x bits to represent value x. Recall Gamma codes: 0 is 0, 1 is 1, 01 is 2, 001 is 3, 0001 is 4, etc. Delta codes build on Gamma codes. Has two steps: x = 2 N + (x mod 2 N ). Write N 1 as gamma code;
34 Compressing better with delta codes RLE can make things worse. 11 may become How many bits to use for the counters? E.g., Use 8-bit counters, then Universal coding like delta codes use no more than c log x bits to represent value x. Recall Gamma codes: 0 is 0, 1 is 1, 01 is 2, 001 is 3, 0001 is 4, etc. Delta codes build on Gamma codes. Has two steps: x = 2 N + (x mod 2 N ). Write N 1 as gamma code; write x mod 2 N as an N 1-bit number.
35 Compressing better with delta codes RLE can make things worse. 11 may become How many bits to use for the counters? E.g., Use 8-bit counters, then Universal coding like delta codes use no more than c log x bits to represent value x. Recall Gamma codes: 0 is 0, 1 is 1, 01 is 2, 001 is 3, 0001 is 4, etc. Delta codes build on Gamma codes. Has two steps: x = 2 N + (x mod 2 N ). Write N 1 as gamma code; write x mod 2 N as an N 1-bit number. E.g. 17 = ,
36 RLE with delta codes is pretty good In some (weak) sense, RLE compression with delta codes is optimal! Theorem A bitmap index over an N-value column of length n, compressed with RLE and delta codes, uses O(n log N) bits.
37 Is the compression rate what matters? There is endless debate about whether more compression is better: Solid-State Drives (SSD) have 10 the bandwidth? All problems are CPU-bound!
38 Is the compression rate what matters? There is endless debate about whether more compression is better: Solid-State Drives (SSD) have 10 the bandwidth? All problems are CPU-bound! Multi-core CPUs? All problems I/O-bound!
39 Is the compression rate what matters? There is endless debate about whether more compression is better: Solid-State Drives (SSD) have 10 the bandwidth? All problems are CPU-bound! Multi-core CPUs? All problems I/O-bound! Store your indexes in RAM? All problems are CPU-bound!
40 Is the compression rate what matters? There is endless debate about whether more compression is better: Solid-State Drives (SSD) have 10 the bandwidth? All problems are CPU-bound! Multi-core CPUs? All problems I/O-bound! Store your indexes in RAM? All problems are CPU-bound!...
41 Is the compression rate what matters? There is endless debate about whether more compression is better: Solid-State Drives (SSD) have 10 the bandwidth? All problems are CPU-bound! Multi-core CPUs? All problems I/O-bound! Store your indexes in RAM? All problems are CPU-bound!... No definitive answer on whether more compression is better. It depends!
42 Byte/Word-aligned RLE RLE variants can focus on runs that align with machine-word boundaries.
43 Byte/Word-aligned RLE RLE variants can focus on runs that align with machine-word boundaries. Trade compression for speed.
44 Byte/Word-aligned RLE RLE variants can focus on runs that align with machine-word boundaries. Trade compression for speed. That is what Oracle is doing.
45 Byte/Word-aligned RLE RLE variants can focus on runs that align with machine-word boundaries. Trade compression for speed. That is what Oracle is doing. Variants: BBC (byte aligned), WAH
46 Byte/Word-aligned RLE RLE variants can focus on runs that align with machine-word boundaries. Trade compression for speed. That is what Oracle is doing. Variants: BBC (byte aligned), WAH Our EWAH extends Wu et al. s (was known to Wu as WBC) word-aligned hybrid dirty word, run of 2 clean 0 words, dirty word...
47 Computational and storage bounds n number of rows, c number of 1s per row;
48 Computational and storage bounds n number of rows, c number of 1s per row; Model storage cost as #(dirty words) + #(clean words, 0x00)
49 Computational and storage bounds n number of rows, c number of 1s per row; Model storage cost as #(dirty words) + #(clean words, 0x00) Storage is in O(nc);
50 Computational and storage bounds n number of rows, c number of 1s per row; Model storage cost as #(dirty words) + #(clean words, 0x00) Storage is in O(nc); Bounds do not depend on the number of bitmaps. (Assuming O(n) bitmaps).
51 Computational and storage bounds n number of rows, c number of 1s per row; Model storage cost as #(dirty words) + #(clean words, 0x00) Storage is in O(nc); Bounds do not depend on the number of bitmaps. (Assuming O(n) bitmaps). Construction time is proportional to index size. (Data is written sequentially on disk.)
52 Computational and storage bounds n number of rows, c number of 1s per row; Model storage cost as #(dirty words) + #(clean words, 0x00) Storage is in O(nc); Bounds do not depend on the number of bitmaps. (Assuming O(n) bitmaps). Construction time is proportional to index size. (Data is written sequentially on disk.) Implementation scales to millions of bitmaps.
53 What about other compression types? Why not compress using other techniques (Huffman, LZ77, Arithmetic Coding,... )?
54 What about other compression types? Why not compress using other techniques (Huffman, LZ77, Arithmetic Coding,... )? With RLE-like compression we have B 1 B 2 or B 1 B 2 in time O( B 1 + B 2 ).
55 What about other compression types? Why not compress using other techniques (Huffman, LZ77, Arithmetic Coding,... )? With RLE-like compression we have B 1 B 2 or B 1 B 2 in time O( B 1 + B 2 ). We don t know how to do this using the other compression techniques!
56 What about other compression types? Why not compress using other techniques (Huffman, LZ77, Arithmetic Coding,... )? With RLE-like compression we have B 1 B 2 or B 1 B 2 in time O( B 1 + B 2 ). We don t know how to do this using the other compression techniques! Hence, with RLE, compress saves both storage and CPU cycles!!!!
57 What happens when you have many bitmaps? Consider B 1 B 2... B N.
58 What happens when you have many bitmaps? Consider B 1 B 2... B N. First compute the first two : B 1 B 2 in time O( B 1 + B 2 ).
59 What happens when you have many bitmaps? Consider B 1 B 2... B N. First compute the first two : B 1 B 2 in time O( B 1 + B 2 ). B 3 B 4 is in O( B 3 + B 4 ).
60 What happens when you have many bitmaps? Consider B 1 B 2... B N. First compute the first two : B 1 B 2 in time O( B 1 + B 2 ). B 3 B 4 is in O( B 3 + B 4 ). Thus (B 1 B 2 ) (B 3 B 4 ) takes O(2 i B i )...
61 What happens when you have many bitmaps? Consider B 1 B 2... B N. First compute the first two : B 1 B 2 in time O( B 1 + B 2 ). B 3 B 4 is in O( B 3 + B 4 ). Thus (B 1 B 2 ) (B 3 B 4 ) takes O(2 i B i )... Total is in O( N i=1 B i log N) [Lemire et al., 2009].
62 Improving compression by sorting the table RLE, BBC, WAH, EWAH are order-sensitive: they compress sorted tables better;
63 Improving compression by sorting the table RLE, BBC, WAH, EWAH are order-sensitive: they compress sorted tables better; But finding the best row ordering is NP-hard [Lemire et al., 2009].
64 Improving compression by sorting the table RLE, BBC, WAH, EWAH are order-sensitive: they compress sorted tables better; But finding the best row ordering is NP-hard [Lemire et al., 2009]. Lexicographic row sorting is fast, even for very large tables.
65 Improving compression by sorting the table RLE, BBC, WAH, EWAH are order-sensitive: they compress sorted tables better; But finding the best row ordering is NP-hard [Lemire et al., 2009]. Lexicographic row sorting is fast, even for very large tables. easy: sort is a Unix staple.
66 Improving compression by sorting the table RLE, BBC, WAH, EWAH are order-sensitive: they compress sorted tables better; But finding the best row ordering is NP-hard [Lemire et al., 2009]. Lexicographic row sorting is fast, even for very large tables. easy: sort is a Unix staple. Substantial index-size reductions (often 2.5 times)
67 Improving compression via k-of-n encoding With L bitmaps, you can represent L values by mapping each value to one bitmap; 1-of-N value cat dog dish fish cow cat pony
68 Improving compression via k-of-n encoding 1-of-N of-N With L bitmaps, you can represent L values by mapping each value to one bitmap; Alternatively, ( you can represent L ) 2 = L(L 1)/2 values by mapping each value to a pair of bitmaps;
69 Improving compression via k-of-n encoding 1-of-N of-N With L bitmaps, you can represent L values by mapping each value to one bitmap; Alternatively, ( you can represent L ) 2 = L(L 1)/2 values by mapping each value to a pair of bitmaps; More generally, you can represent ( L k) values by mapping each value to a k-tuple of bitmaps;
70 Improving compression via k-of-n encoding 1-of-N of-N With L bitmaps, you can represent L values by mapping each value to one bitmap; Alternatively, ( you can represent L ) 2 = L(L 1)/2 values by mapping each value to a pair of bitmaps; More generally, you can represent ( L k) values by mapping each value to a k-tuple of bitmaps; At query time, you need to load k bitmaps in a look-up for one value;
71 Improving compression via k-of-n encoding 1-of-N of-N With L bitmaps, you can represent L values by mapping each value to one bitmap; Alternatively, ( you can represent L ) 2 = L(L 1)/2 values by mapping each value to a pair of bitmaps; More generally, you can represent ( L k) values by mapping each value to a k-tuple of bitmaps; At query time, you need to load k bitmaps in a look-up for one value; You trade query-time performance for fewer bitmaps;
72 Improving compression via k-of-n encoding 1-of-N of-N With L bitmaps, you can represent L values by mapping each value to one bitmap; Alternatively, ( you can represent L ) 2 = L(L 1)/2 values by mapping each value to a pair of bitmaps; More generally, you can represent ( L k) values by mapping each value to a k-tuple of bitmaps; At query time, you need to load k bitmaps in a look-up for one value; You trade query-time performance for fewer bitmaps; Often, fewer bitmaps translates into a smaller index, created faster.
73 Encode then sort? Or vice versa? Two different conceptual approaches: 1 Encode attributes in table, obtaining an uncompressed index paint maker red ford blue honda green ford paint maker red ford blue honda green ford......
74 Encode then sort? Or vice versa? Two different conceptual approaches: 1 Encode attributes in table, obtaining an uncompressed index Sort the index rows paint maker red ford blue honda green ford paint maker red ford blue honda green ford
75 Encode then sort? Or vice versa? Two different conceptual approaches: 1 Encode attributes in table, obtaining an uncompressed index Sort the index rows Compress each column paint maker red ford blue honda green ford paint maker red ford blue honda green ford
76 Encode then sort? Or vice versa? Two different conceptual approaches: 1 Encode attributes in table, obtaining an uncompressed index Sort the index rows Compress each column 2 Sort the table rows paint maker red ford blue honda green ford paint maker red ford blue honda green ford paint maker blue honda green ford red ford
77 Encode then sort? Or vice versa? Two different conceptual approaches: 1 Encode attributes in table, obtaining an uncompressed index Sort the index rows Compress each column 2 Sort the table rows Encode attributes in table, build compressed index on-the-fly. paint maker red ford blue honda green ford paint maker red ford blue honda green ford paint maker blue honda green ford red ford
78 Gray-code order Lex. order Gray-code Gray-code (GC) order is an alternative to lexicographical order (defined only for bit arrays);
79 Gray-code order Lex. order Gray-code Gray-code (GC) order is an alternative to lexicographical order (defined only for bit arrays); May improve compression more than lex. sort (k > 1);
80 Gray-code order Lex. order Gray-code Gray-code (GC) order is an alternative to lexicographical order (defined only for bit arrays); May improve compression more than lex. sort (k > 1); [Pinar et al., 2005] process an uncompressed bitmap index.
81 Gray-code order Lex. order Gray-code Gray-code (GC) order is an alternative to lexicographical order (defined only for bit arrays); May improve compression more than lex. sort (k > 1); [Pinar et al., 2005] process an uncompressed bitmap index. Slow, if uncompressed index does not fit in RAM.
82 Gray-code order Lex. order Gray-code Gray-code (GC) order is an alternative to lexicographical order (defined only for bit arrays); May improve compression more than lex. sort (k > 1); [Pinar et al., 2005] process an uncompressed bitmap index. Slow, if uncompressed index does not fit in RAM. GC order is not supported by DBMSes or Unix utilities.
83 Gray-code sorting, cheaply Size improvement is small (usually < 4%), but it s essentially free: 1 What Pinar et al. do: expensive GC sort after encoding eg: [Tax, Cat, Girl, Cat] sort([1100, 0110, 1001, 0110]);
84 Gray-code sorting, cheaply Size improvement is small (usually < 4%), but it s essentially free: 1 What Pinar et al. do: expensive GC sort after encoding eg: [Tax, Cat, Girl, Cat] sort([1100, 0110, 1001, 0110]); 2 Instead, sort the table lexicographically comparing values alphabetically or by frequency (easy); eg: [Tax, Cat, Girl, Cat] [Cat, Cat, Girl, Tax]
85 Gray-code sorting, cheaply Size improvement is small (usually < 4%), but it s essentially free: 1 What Pinar et al. do: expensive GC sort after encoding eg: [Tax, Cat, Girl, Cat] sort([1100, 0110, 1001, 0110]); 2 Instead, sort the table lexicographically comparing values alphabetically or by frequency (easy); eg: [Tax, Cat, Girl, Cat] [Cat, Cat, Girl, Tax] 3 Map ordered values to k-tuples of bitmaps ordered as Gray codes: Cat: 0011, Dog: 0110, Girl: 0101, Tax: 1100; Lex ascending sequence: Cat, Dog, Girl, Tax. GC ascending sequence: 0011, 0110, 0101, 1100 for codes
86 Gray-code sorting, cheaply Size improvement is small (usually < 4%), but it s essentially free: 1 What Pinar et al. do: expensive GC sort after encoding eg: [Tax, Cat, Girl, Cat] sort([1100, 0110, 1001, 0110]); 2 Instead, sort the table lexicographically comparing values alphabetically or by frequency (easy); eg: [Tax, Cat, Girl, Cat] [Cat, Cat, Girl, Tax] 3 Map ordered values to k-tuples of bitmaps ordered as Gray codes: Cat: 0011, Dog: 0110, Girl: 0101, Tax: 1100; Lex ascending sequence: Cat, Dog, Girl, Tax. GC ascending sequence: 0011, 0110, 0101, 1100 for codes eg: [Cat, Cat, Girl, Tax] [0011, 0011, 0101, 1100] (generates a GC-sorted result without expensive GC sorting).
87 Gray-code sorting, cheaply Size improvement is small (usually < 4%), but it s essentially free: 1 What Pinar et al. do: expensive GC sort after encoding eg: [Tax, Cat, Girl, Cat] sort([1100, 0110, 1001, 0110]); 2 Instead, sort the table lexicographically comparing values alphabetically or by frequency (easy); eg: [Tax, Cat, Girl, Cat] [Cat, Cat, Girl, Tax] 3 Map ordered values to k-tuples of bitmaps ordered as Gray codes: Cat: 0011, Dog: 0110, Girl: 0101, Tax: 1100; Lex ascending sequence: Cat, Dog, Girl, Tax. GC ascending sequence: 0011, 0110, 0101, 1100 for codes eg: [Cat, Cat, Girl, Tax] [0011, 0011, 0101, 1100] (generates a GC-sorted result without expensive GC sorting). 4 Easily extended for > 1 columns.
88 Gray-code sorting, cheaply Size improvement is small (usually < 4%), but it s essentially free: 1 What Pinar et al. do: expensive GC sort after encoding eg: [Tax, Cat, Girl, Cat] sort([1100, 0110, 1001, 0110]); 2 Instead, sort the table lexicographically comparing values alphabetically or by frequency (easy); eg: [Tax, Cat, Girl, Cat] [Cat, Cat, Girl, Tax] 3 Map ordered values to k-tuples of bitmaps ordered as Gray codes: Cat: 0011, Dog: 0110, Girl: 0101, Tax: 1100; Lex ascending sequence: Cat, Dog, Girl, Tax. GC ascending sequence: 0011, 0110, 0101, 1100 for codes eg: [Cat, Cat, Girl, Tax] [0011, 0011, 0101, 1100] (generates a GC-sorted result without expensive GC sorting). 4 Easily extended for > 1 columns. In our tests, this is as good as a Gray-code bitmap index sort [Pinar et al., 2005], but technically much easier.
89 What about other Gray-codes? Define Gray-code to be a way to list all bitvectors while minimizing Hamming distances [Knuth, 2005, ]
90 What about other Gray-codes? Define Gray-code to be a way to list all bitvectors while minimizing Hamming distances [Knuth, 2005, ] There are other alternatives [Goddyn and Gvozdjak, 2003, Savage and Winkler, 1995].
91 What about other Gray-codes? Define Gray-code to be a way to list all bitvectors while minimizing Hamming distances [Knuth, 2005, ] There are other alternatives [Goddyn and Gvozdjak, 2003, Savage and Winkler, 1995]. Our tests suggest traditional Gray codes are best.
92 Test data sets Previous studies used data sets where the uncompressed index would fit in RAM.
93 Test data sets Previous studies used data sets where the uncompressed index would fit in RAM. Do their results apply to more realistic data sets?
94 Test data sets Previous studies used data sets where the uncompressed index would fit in RAM. Do their results apply to more realistic data sets? Our tests: Mix of real and synthetic data, up to 877 M rows, 22 GB, 4 M attribute values. using 4 10 columns/dimensions
95 When sorting, column order matters The first column(s) gain more from the sort (column 1 is primary sort key);
96 When sorting, column order matters The first column(s) gain more from the sort (column 1 is primary sort key); Its bitmaps (first 11 in example) are compressed well, compared to a randomsort. (Red above green) Compression on TWEED-4d 1-C/N Gray Random-sort rang des bitmaps
97 When sorting, column order matters The first column(s) gain more from the sort (column 1 is primary sort key); Its bitmaps (first 11 in example) are compressed well, compared to a randomsort. (Red above green) Least important column s bitmaps (43 49) don t gain much (red vs green) Compression on TWEED-4d 1-C/N rang des bitmaps Gray Random-sort
98 When sorting, column order matters Conceptually, we may wish to reorder columns, eg swap columns 1 & 3.
99 When sorting, column order matters Conceptually, we may wish to reorder columns, eg swap columns 1 & 3. Column order is crucial (to successful sorting).
100 When sorting, column order matters Conceptually, we may wish to reorder columns, eg swap columns 1 & 3. Column order is crucial (to successful sorting). Finding the best ordering quickly remains open. Netflix: 24 column orderings index size 5.5e+08 5e e+08 4e e+08 3e e+08 2e e+08 1e of-N encoding 4-of-N encoding column permutation
101 Progress toward choosing column order Paper models gain of putting a given column first. Idea: order columns greedily (by max gain).
102 Progress toward choosing column order Paper models gain of putting a given column first. Idea: order columns greedily (by max gain). Experimentally, this approach is not promising: the best orderings don t seem to depend on gain.
103 Progress toward choosing column order Paper models gain of putting a given column first. Idea: order columns greedily (by max gain). Experimentally, this approach is not promising: the best orderings don t seem to depend on gain. Factors: skews of columns number of distinct values k density of column s bitmaps
104 What usually works for dimension ordering?: k=1 For 1-of-N bitmaps, a density-based approach was okay:
105 What usually works for dimension ordering?: k=1 For 1-of-N bitmaps, a density-based approach was okay: Ordering rule, k = 1 : sparse but not too sparse Order columns by decreasing ( 1 min, 1 1/n ) i, where n i 4w 1 n i the number of distinct values in column i, w the word size distinct values in column
106 What usually works for dimension ordering?: k=1 For 1-of-N bitmaps, a density-based approach was okay: Ordering rule, k = 1 : sparse but not too sparse Order columns by decreasing ( 1 min, 1 1/n ) i, where n i 4w 1 n i the number of distinct values in column i, w the word size distinct values in column See 30 40% size reduction, merely knowing dimension sizes (n i ).
107 What usually works for dimension ordering?: k > 1 Density formula (n i k n i ) recommends poorly when k > 1. Our experiments on synthetic data give some guidance:
108 What usually works for dimension ordering?: k > 1 Density formula (n i k n i ) recommends poorly when k > 1. Our experiments on synthetic data give some guidance: When k > 1, order columns by 1 descending skew 2 descending size (And do the reverse when k = 1.)
109 What usually works for dimension ordering?: k > 1 Density formula (n i k n i ) recommends poorly when k > 1. Our experiments on synthetic data give some guidance: When k > 1, order columns by 1 descending skew 2 descending size (And do the reverse when k = 1.) Open issues, k > 1 1 How do we balance skew & size factors?
110 What usually works for dimension ordering?: k > 1 Density formula (n i k n i ) recommends poorly when k > 1. Our experiments on synthetic data give some guidance: When k > 1, order columns by 1 descending skew 2 descending size (And do the reverse when k = 1.) Open issues, k > 1 1 How do we balance skew & size factors? 2 What other properties of the histograms are needed?
111 Bitmap-by-bitmap reordering One might instead make the index, reorder its columns, then apply GC sort [Canahuate et al., 2006].
112 Bitmap-by-bitmap reordering One might instead make the index, reorder its columns, then apply GC sort [Canahuate et al., 2006]. Our best implementation of this is 100 times slower, cannot handle larger data sets.
113 Bitmap-by-bitmap reordering One might instead make the index, reorder its columns, then apply GC sort [Canahuate et al., 2006]. Our best implementation of this is 100 times slower, cannot handle larger data sets. We tried several bitmap orders on DBGEN and Census. Out of 8 cases, only one gained, and only by 3%.
114 Bitmap-by-bitmap reordering One might instead make the index, reorder its columns, then apply GC sort [Canahuate et al., 2006]. Our best implementation of this is 100 times slower, cannot handle larger data sets. We tried several bitmap orders on DBGEN and Census. Out of 8 cases, only one gained, and only by 3%. Canahaute suggests ordering does not matter much, but we see factor-of-2 differences (??)
115 Bitmap-by-bitmap reordering One might instead make the index, reorder its columns, then apply GC sort [Canahuate et al., 2006]. Our best implementation of this is 100 times slower, cannot handle larger data sets. We tried several bitmap orders on DBGEN and Census. Out of 8 cases, only one gained, and only by 3%. Canahaute suggests ordering does not matter much, but we see factor-of-2 differences (??) Seems sufficient (and much faster) to work with groups of bitmaps (reorder attributes, not bitmaps)
116 Index size versus block-wise sorting Netflix taille de l index (Mo) k=1 k=2 k=3 k= # de blocs Instead of fully sorting the table, we sorted it block-wise;
117 Index size versus block-wise sorting Netflix taille de l index (Mo) k=1 k=2 k=3 k= # de blocs Instead of fully sorting the table, we sorted it block-wise; Fewer blocks means a more complete sort;
118 Index size versus block-wise sorting Netflix taille de l index (Mo) k=1 k=2 k=3 k= # de blocs Instead of fully sorting the table, we sorted it block-wise; Fewer blocks means a more complete sort; Larger k means smaller index (in this case);
119 Index size versus block-wise sorting Netflix taille de l index (Mo) k=1 k=2 k=3 k= # de blocs Instead of fully sorting the table, we sorted it block-wise; Fewer blocks means a more complete sort; Larger k means smaller index (in this case); Index size diminishes drastically with sorting.
120 How do 64-bit words compare to 32-bit words? We implemented EWAH using 16-bit, 32-bit and 64-bit words;
121 How do 64-bit words compare to 32-bit words? We implemented EWAH using 16-bit, 32-bit and 64-bit words; Only 32-bit and 64-bit are efficient;
122 How do 64-bit words compare to 32-bit words? We implemented EWAH using 16-bit, 32-bit and 64-bit words; Only 32-bit and 64-bit are efficient; 64-bit indexes are nearly twice as large;
123 How do 64-bit words compare to 32-bit words? We implemented EWAH using 16-bit, 32-bit and 64-bit words; Only 32-bit and 64-bit are efficient; 64-bit indexes are nearly twice as large; 64-bit indexes are between 5%-40% faster (despite higher I/O costs).
124 Open Source Software? Lemur Bitmap Index C++ Library:
125 Open Source Software? Lemur Bitmap Index C++ Library: JavaEWAH: A compressed alternative to the Java BitSet class
126 Future direction? Need better mathematical modelling of bitmap compressed size in sorted tables;
127 Questions??
128 Canahuate, G., Ferhatosmanoglu, H., and Pinar, A. (2006). Improving bitmap index compression by data reorganization. http: //hpcrd.lbl.gov/~apinar/papers/tkde06.pdf (checked ). Goddyn, L. and Gvozdjak, P. (2003). Binary gray codes with long bit runs. Electronic Journal of Combinatorics, 10(R27):1 10. Knuth, D. E. (2005). The Art of Computer Programming, volume 4, chapter fascicle 2. Addison Wesley. Lemire, D., Kaser, O., and Aouiche, K. (2009). Sorting improves word-aligned bitmap indexes. available from Pinar, A., Tao, T., and Ferhatosmanoglu, H. (2005).
129 Compressing bitmap indices by data reorganization. In ICDE 05, pages Savage, C. and Winkler, P. (1995). Monotone gray codes and the middle levels problem. Journal of Combinatorial Theory, A, 70(2): Wu, K., Otoo, E. J., and Shoshani, A. (2006). Optimizing bitmap indices with efficient compression. ACM Transactions on Database Systems, 31(1):1 38.
Sorting Improves Bitmap Indexes
Joint work (presented at BDA 08 and DOLAP 08) with Daniel Lemire and Kamel Aouiche, UQAM. December 4, 2008 Database Indexes Databases use precomputed indexes (auxiliary data structures) to speed processing.
More informationHistogram-Aware Sorting for Enhanced Word-Aligned Compress
Histogram-Aware Sorting for Enhanced Word-Aligned Compression in Bitmap Indexes 1- University of New Brunswick, Saint John 2- Université du Québec at Montréal (UQAM) October 23, 2008 Bitmap indexes SELECT
More informationAnalysis of Basic Data Reordering Techniques
Analysis of Basic Data Reordering Techniques Tan Apaydin 1, Ali Şaman Tosun 2, and Hakan Ferhatosmanoglu 1 1 The Ohio State University, Computer Science and Engineering apaydin,hakan@cse.ohio-state.edu
More informationarxiv: v4 [cs.db] 6 Jun 2014 Abstract
Better bitmap performance with Roaring bitmaps Samy Chambi a, Daniel Lemire b,, Owen Kaser c, Robert Godin a a Département d informatique, UQAM, 201, av. Président-Kennedy, Montreal, QC, H2X 3Y7 Canada
More informationEnhancing Bitmap Indices
Enhancing Bitmap Indices Guadalupe Canahuate The Ohio State University Advisor: Hakan Ferhatosmanoglu Introduction n Bitmap indices: Widely used in data warehouses and read-only domains Implemented in
More informationData Compression for Bitmap Indexes. Y. Chen
Data Compression for Bitmap Indexes Y. Chen Abstract Compression Ratio (CR) and Logical Operation Time (LOT) are two major measures of the efficiency of bitmap indexing. Previous works by [5, 9, 10, 11]
More informationInformation Retrieval
Information Retrieval Suan Lee - Information Retrieval - 05 Index Compression 1 05 Index Compression - Information Retrieval - 05 Index Compression 2 Last lecture index construction Sort-based indexing
More informationDaniel Lemire, Waterloo University, May 10th 2018.
Daniel Lemire, Waterloo University, May 10th 2018. Next Generation Indexes For Big Data Engineering Daniel Lemire and collaborators blog: https://lemire.me twitter: @lemire Université du Québec (TÉLUQ)
More informationBetter bitmap performance with Roaring bitmaps
Better bitmap performance with bitmaps S. Chambi 1, D. Lemire 2, O. Kaser 3, R. Godin 1 1 Département d informatique, UQAM, Montreal, Qc, Canada 2 LICEF Research Center, TELUQ, Montreal, QC, Canada 3 Computer
More informationIndexing and Searching
Indexing and Searching Introduction How to retrieval information? A simple alternative is to search the whole text sequentially Another option is to build data structures over the text (called indices)
More informationProcessing of Very Large Data
Processing of Very Large Data Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Software Development Technologies Master studies, first
More informationIn-Memory Data Management Jens Krueger
In-Memory Data Management Jens Krueger Enterprise Platform and Integration Concepts Hasso Plattner Intitute OLTP vs. OLAP 2 Online Transaction Processing (OLTP) Organized in rows Online Analytical Processing
More informationStorage hierarchy. Textbook: chapters 11, 12, and 13
Storage hierarchy Cache Main memory Disk Tape Very fast Fast Slower Slow Very small Small Bigger Very big (KB) (MB) (GB) (TB) Built-in Expensive Cheap Dirt cheap Disks: data is stored on concentric circular
More informationWeb Information Retrieval. Lecture 4 Dictionaries, Index Compression
Web Information Retrieval Lecture 4 Dictionaries, Index Compression Recap: lecture 2,3 Stemming, tokenization etc. Faster postings merges Phrase queries Index construction This lecture Dictionary data
More informationLawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory
Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory Title Breaking the Curse of Cardinality on Bitmap Indexes Permalink https://escholarship.org/uc/item/5v921692 Author Wu, Kesheng
More informationHybrid Query Optimization for Hard-to-Compress Bit-vectors
Hybrid Query Optimization for Hard-to-Compress Bit-vectors Gheorghi Guzun Guadalupe Canahuate Abstract Bit-vectors are widely used for indexing and summarizing data due to their efficient processing in
More informationConcise: Compressed n Composable Integer Set
: Compressed n Composable Integer Set Alessandro Colantonio,a, Roberto Di Pietro a a Università di Roma Tre, Dipartimento di Matematica, Roma, Italy Abstract Bit arrays, or bitmaps, are used to significantly
More informationCSE 530A. B+ Trees. Washington University Fall 2013
CSE 530A B+ Trees Washington University Fall 2013 B Trees A B tree is an ordered (non-binary) tree where the internal nodes can have a varying number of child nodes (within some range) B Trees When a key
More informationCS 525: Advanced Database Organization
CS 525: Advanced Database Organization 06: Even more index structures Boris Glavic Slides: adapted from a course taught by Hector Garcia- Molina, Stanford InfoLab CS 525 Notes 6 - More Indices 1 Recap
More informationHeap-Filter Merge Join: A new algorithm for joining medium-size relations
Oregon Health & Science University OHSU Digital Commons CSETech January 1989 Heap-Filter Merge Join: A new algorithm for joining medium-size relations Goetz Graefe Follow this and additional works at:
More informationIndex Compression. David Kauchak cs160 Fall 2009 adapted from:
Index Compression David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture5-indexcompression.ppt Administrative Homework 2 Assignment 1 Assignment 2 Pair programming?
More informationAn Experimental Study of Bitmap Compression vs. Inverted List Compression
An Experimental Study of Bitmap Compression vs. Inverted Compression Paper ID: 338 ABSTRACT Bitmap compression has been studied extensively in the database area and many efficient compression schemes were
More informationExadata X3 in action: Measuring Smart Scan efficiency with AWR. Franck Pachot Senior Consultant
Exadata X3 in action: Measuring Smart Scan efficiency with AWR Franck Pachot Senior Consultant 16 March 2013 1 Exadata X3 in action: Measuring Smart Scan efficiency with AWR Exadata comes with new statistics
More informationCOMP 430 Intro. to Database Systems. Indexing
COMP 430 Intro. to Database Systems Indexing How does DB find records quickly? Various forms of indexing An index is automatically created for primary key. SQL gives us some control, so we should understand
More informationIndexing and Hashing
C H A P T E R 1 Indexing and Hashing This chapter covers indexing techniques ranging from the most basic one to highly specialized ones. Due to the extensive use of indices in database systems, this chapter
More informationBitmap Index Partition Techniques for Continuous and High Cardinality Discrete Attributes
Bitmap Index Partition Techniques for Continuous and High Cardinality Discrete Attributes Songrit Maneewongvatana Department of Computer Engineering King s Mongkut s University of Technology, Thonburi,
More informationC has been and will always remain on top for performancecritical
Check out this link: http://spectrum.ieee.org/static/interactive-the-top-programminglanguages-2016 C has been and will always remain on top for performancecritical applications: Implementing: Databases
More informationDatabase Technology. Topic 7: Data Structures for Databases. Olaf Hartig.
Topic 7: Data Structures for Databases Olaf Hartig olaf.hartig@liu.se Database System 2 Storage Hierarchy Traditional Storage Hierarchy CPU Cache memory Main memory Primary storage Disk Tape Secondary
More informationHICAMP Bitmap. A Space-Efficient Updatable Bitmap Index for In-Memory Databases! Bo Wang, Heiner Litz, David R. Cheriton Stanford University DAMON 14
HICAMP Bitmap A Space-Efficient Updatable Bitmap Index for In-Memory Databases! Bo Wang, Heiner Litz, David R. Cheriton Stanford University DAMON 14 Database Indexing Databases use precomputed indexes
More informationIndexing. Jan Chomicki University at Buffalo. Jan Chomicki () Indexing 1 / 25
Indexing Jan Chomicki University at Buffalo Jan Chomicki () Indexing 1 / 25 Storage hierarchy Cache Main memory Disk Tape Very fast Fast Slower Slow (nanosec) (10 nanosec) (millisec) (sec) Very small Small
More informationBeyond Counting. Owen Kaser. September 17, 2014
Beyond Counting Owen Kaser September 17, 2014 1 Introduction Combinatorial objects such as permutations and combinations are frequently studied from a counting perspective. For instance, How many distinct
More informationIndexing Web pages. Web Search: Indexing Web Pages. Indexing the link structure. checkpoint URL s. Connectivity Server: Node table
Indexing Web pages Web Search: Indexing Web Pages CPS 296.1 Topics in Database Systems Indexing the link structure AltaVista Connectivity Server case study Bharat et al., The Fast Access to Linkage Information
More informationVariable Length Integers for Search
7:57:57 AM Variable Length Integers for Search Past, Present and Future Ryan Ernst A9.com 7:57:59 AM Overview About me Search and inverted indices Traditional encoding (Vbyte) Modern encodings Future work
More informationCS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University
CS6200 Information Retrieval David Smith College of Computer and Information Science Northeastern University Indexing Process!2 Indexes Storing document information for faster queries Indexes Index Compression
More informationCS1800 Discrete Structures Fall 2016 Profs. Aslam, Gold, Ossowski, Pavlu, & Sprague December 16, CS1800 Discrete Structures Final
CS1800 Discrete Structures Fall 2016 Profs. Aslam, Gold, Ossowski, Pavlu, & Sprague December 16, 2016 Instructions: CS1800 Discrete Structures Final 1. The exam is closed book and closed notes. You may
More informationEfficient and Effective Practical Algorithms for the Set-Covering Problem
Efficient and Effective Practical Algorithms for the Set-Covering Problem Qi Yang, Jamie McPeek, Adam Nofsinger Department of Computer Science and Software Engineering University of Wisconsin at Platteville
More informationTrack Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross
Track Join Distributed Joins with Minimal Network Traffic Orestis Polychroniou Rajkumar Sen Kenneth A. Ross Local Joins Algorithms Hash Join Sort Merge Join Index Join Nested Loop Join Spilling to disk
More informationReducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip
Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off
More informationMultimedia Networking ECE 599
Multimedia Networking ECE 599 Prof. Thinh Nguyen School of Electrical Engineering and Computer Science Based on B. Lee s lecture notes. 1 Outline Compression basics Entropy and information theory basics
More informationColumn Stores vs. Row Stores How Different Are They Really?
Column Stores vs. Row Stores How Different Are They Really? Daniel J. Abadi (Yale) Samuel R. Madden (MIT) Nabil Hachem (AvantGarde) Presented By : Kanika Nagpal OUTLINE Introduction Motivation Background
More informationModule 9: Selectivity Estimation
Module 9: Selectivity Estimation Module Outline 9.1 Query Cost and Selectivity Estimation 9.2 Database profiles 9.3 Sampling 9.4 Statistics maintained by commercial DBMS Web Forms Transaction Manager Lock
More informationBlock Device Scheduling. Don Porter CSE 506
Block Device Scheduling Don Porter CSE 506 Logical Diagram Binary Formats Memory Allocators System Calls Threads User Kernel RCU File System Networking Sync Memory Management Device Drivers CPU Scheduler
More informationBlock Device Scheduling
Logical Diagram Block Device Scheduling Don Porter CSE 506 Binary Formats RCU Memory Management File System Memory Allocators System Calls Device Drivers Interrupts Net Networking Threads Sync User Kernel
More informationDesign Issues 1 / 36. Local versus Global Allocation. Choosing
Design Issues 1 / 36 Local versus Global Allocation When process A has a page fault, where does the new page frame come from? More precisely, is one of A s pages reclaimed, or can a page frame be taken
More informationEfficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)
Efficiency Efficiency: Indexing (COSC 488) Nazli Goharian nazli@cs.georgetown.edu Difficult to analyze sequential IR algorithms: data and query dependency (query selectivity). O(q(cf max )) -- high estimate-
More informationData Warehouse Physical Design: Part I
Data Warehouse Physical Design: Part I Robert Wrembel Poznan University of Technology Institute of Computing Science Robert.Wrembel@cs.put.poznan.pl www.cs.put.poznan.pl/rwrembel Lecture outline Basic
More informationDatabase System Concepts, 6 th Ed. Silberschatz, Korth and Sudarshan See for conditions on re-use
Chapter 11: Indexing and Hashing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files Static
More informationAlgorithms. Lecture Notes 5
Algorithms. Lecture Notes 5 Dynamic Programming for Sequence Comparison The linear structure of the Sequence Comparison problem immediately suggests a dynamic programming approach. Naturally, our sub-instances
More informationCoding for Random Projects
Coding for Random Projects CS 584: Big Data Analytics Material adapted from Li s talk at ICML 2014 (http://techtalks.tv/talks/coding-for-random-projections/61085/) Random Projections for High-Dimensional
More informationComputer Science 210 Data Structures Siena College Fall Topic Notes: Complexity and Asymptotic Analysis
Computer Science 210 Data Structures Siena College Fall 2017 Topic Notes: Complexity and Asymptotic Analysis Consider the abstract data type, the Vector or ArrayList. This structure affords us the opportunity
More informationAdvanced Data Management Technologies
ADMT 2017/18 Unit 13 J. Gamper 1/42 Advanced Data Management Technologies Unit 13 DW Pre-aggregation and View Maintenance J. Gamper Free University of Bozen-Bolzano Faculty of Computer Science IDSE Acknowledgements:
More informationMerge Sort Roberto Hibbler Dept. of Computer Science Florida Institute of Technology Melbourne, FL
Merge Sort Roberto Hibbler Dept. of Computer Science Florida Institute of Technology Melbourne, FL 32901 rhibbler@cs.fit.edu ABSTRACT Given an array of elements, we want to arrange those elements into
More informationUse of Tree-based Algorithms for Internal Sorting
Use of Tree-based s for Internal Sorting P. Y. Padhye (puru@deakin.edu.au) School of Computing and Mathematics Deakin University Geelong, Victoria 3217 Abstract Most of the large number of internal sorting
More informationCompressing Integers for Fast File Access
Compressing Integers for Fast File Access Hugh E. Williams Justin Zobel Benjamin Tripp COSI 175a: Data Compression October 23, 2006 Introduction Many data processing applications depend on access to integer
More informationCS1800 Discrete Structures Fall 2016 Profs. Aslam, Gold, Ossowski, Pavlu, & Sprague December 16, CS1800 Discrete Structures Final
CS1800 Discrete Structures Fall 2016 Profs. Aslam, Gold, Ossowski, Pavlu, & Sprague December 16, 2016 Instructions: CS1800 Discrete Structures Final 1. The exam is closed book and closed notes. You may
More informationInformation Systems (Informationssysteme)
Information Systems (Informationssysteme) Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2018 c Jens Teubner Information Systems Summer 2018 1 Part IX B-Trees c Jens Teubner Information
More informationData Blocks: Hybrid OLTP and OLAP on compressed storage
Data Blocks: Hybrid OLTP and OLAP on compressed storage Ben Brümmer Technische Universität München Fürstenfeldbruck, 26. November 208 Ben Brümmer 26..8 Lehrstuhl für Datenbanksysteme Problem HDD/Archive/Tape-Storage
More informationIndex Tuning. Index. An index is a data structure that supports efficient access to data. Matching records. Condition on attribute value
Index Tuning AOBD07/08 Index An index is a data structure that supports efficient access to data Condition on attribute value index Set of Records Matching records (search key) 1 Performance Issues Type
More informationCSC 261/461 Database Systems Lecture 17. Fall 2017
CSC 261/461 Database Systems Lecture 17 Fall 2017 Announcement Quiz 6 Due: Tonight at 11:59 pm Project 1 Milepost 3 Due: Nov 10 Project 2 Part 2 (Optional) Due: Nov 15 The IO Model & External Sorting Today
More informationAdministrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks
Administrative Index Compression! n Assignment 1? n Homework 2 out n What I did last summer lunch talks today David Kauchak cs458 Fall 2012 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture5-indexcompression.ppt
More information!! What is virtual memory and when is it useful? !! What is demand paging? !! When should pages in memory be replaced?
Chapter 10: Virtual Memory Questions? CSCI [4 6] 730 Operating Systems Virtual Memory!! What is virtual memory and when is it useful?!! What is demand paging?!! When should pages in memory be replaced?!!
More informationTable Compression in Oracle9i Release2. An Oracle White Paper May 2002
Table Compression in Oracle9i Release2 An Oracle White Paper May 2002 Table Compression in Oracle9i Release2 Executive Overview...3 Introduction...3 How It works...3 What can be compressed...4 Cost and
More informationOverview. DW Performance Optimization. Aggregates. Aggregate Use Example
Overview DW Performance Optimization Choosing aggregates Maintaining views Bitmapped indices Other optimization issues Original slides were written by Torben Bach Pedersen Aalborg University 07 - DWML
More informationCSE373: Data Structure & Algorithms Lecture 18: Comparison Sorting. Dan Grossman Fall 2013
CSE373: Data Structure & Algorithms Lecture 18: Comparison Sorting Dan Grossman Fall 2013 Introduction to Sorting Stacks, queues, priority queues, and dictionaries all focused on providing one element
More informationChapter 12: Indexing and Hashing. Basic Concepts
Chapter 12: Indexing and Hashing! Basic Concepts! Ordered Indices! B+-Tree Index Files! B-Tree Index Files! Static Hashing! Dynamic Hashing! Comparison of Ordered Indexing and Hashing! Index Definition
More informationParallel, In Situ Indexing for Data-intensive Computing. Introduction
FastQuery - LDAV /24/ Parallel, In Situ Indexing for Data-intensive Computing October 24, 2 Jinoh Kim, Hasan Abbasi, Luis Chacon, Ciprian Docan, Scott Klasky, Qing Liu, Norbert Podhorszki, Arie Shoshani,
More informationVirtual Memory. Kevin Webb Swarthmore College March 8, 2018
irtual Memory Kevin Webb Swarthmore College March 8, 2018 Today s Goals Describe the mechanisms behind address translation. Analyze the performance of address translation alternatives. Explore page replacement
More information8 Introduction to Distributed Computing
CME 323: Distributed Algorithms and Optimization, Spring 2017 http://stanford.edu/~rezab/dao. Instructor: Reza Zadeh, Matroid and Stanford. Lecture 8, 4/26/2017. Scribed by A. Santucci. 8 Introduction
More informationCSE 332: Data Structures & Parallelism Lecture 12: Comparison Sorting. Ruth Anderson Winter 2019
CSE 332: Data Structures & Parallelism Lecture 12: Comparison Sorting Ruth Anderson Winter 2019 Today Sorting Comparison sorting 2/08/2019 2 Introduction to sorting Stacks, queues, priority queues, and
More informationCS122 Lecture 15 Winter Term,
CS122 Lecture 15 Winter Term, 2014-2015 2 Index Op)miza)ons So far, only discussed implementing relational algebra operations to directly access heap Biles Indexes present an alternate access path for
More informationIn-Memory Data Structures and Databases Jens Krueger
In-Memory Data Structures and Databases Jens Krueger Enterprise Platform and Integration Concepts Hasso Plattner Intitute What to take home from this talk? 2 Answer to the following questions: What makes
More informationStrategies for Processing ad hoc Queries on Large Data Warehouses
Strategies for Processing ad hoc Queries on Large Data Warehouses Kurt Stockinger CERN Geneva, Switzerland Kurt.Stockinger@cern.ch Kesheng Wu Lawrence Berkeley Nat l Lab Berkeley, CA, USA KWu@lbl.gov Arie
More informationUnit 3 Fill Series, Functions, Sorting
Unit 3 Fill Series, Functions, Sorting Fill enter repetitive values or formulas in an indicated direction Using the Fill command is much faster than using copy and paste you can do entire operation in
More informationKeywords: Binary Sort, Sorting, Efficient Algorithm, Sorting Algorithm, Sort Data.
Volume 4, Issue 6, June 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com An Efficient and
More informationUnit 3 Functions Review, Fill Series, Sorting, Merge & Center
Unit 3 Functions Review, Fill Series, Sorting, Merge & Center Function built-in formula that performs simple or complex calculations automatically names a function instead of using operators (+, -, *,
More informationUpBit: Scalable In-Memory Updatable Bitmap Indexing. Guanting Chen, Yuan Zhang 2/21/2019
UpBit: Scalable In-Memory Updatable Bitmap Indexing Guanting Chen, Yuan Zhang 2/21/2019 BACKGROUND Some words to know Bitmap: a bitmap is a mapping from some domain to bits Bitmap Index: A bitmap index
More informationChapter 12: Indexing and Hashing
Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing and Hashing Index Definition in SQL
More informationOptimizing Query Execution for Variable-Aligned Length Compression of Bitmap Indices
Optimizing Query Execution for Variable-Aligned Length Compression of Bitmap Indices Ryan Slechta University of St. Thomas ryan@stthomas.edu David Chiu University of Puget Sound dchiu@pugetsound.edu Jason
More informationDatabase System Concepts, 5th Ed. Silberschatz, Korth and Sudarshan See for conditions on re-use
Chapter 12: Indexing and Hashing Database System Concepts, 5th Ed. See www.db-book.com for conditions on re-use Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files B-Tree
More informationReview: Memory, Disks, & Files. File Organizations and Indexing. Today: File Storage. Alternative File Organizations. Cost Model for Analysis
File Organizations and Indexing Review: Memory, Disks, & Files Lecture 4 R&G Chapter 8 "If you don't find it in the index, look very carefully through the entire catalogue." -- Sears, Roebuck, and Co.,
More informationAlbis: High-Performance File Format for Big Data Systems
Albis: High-Performance File Format for Big Data Systems Animesh Trivedi, Patrick Stuedi, Jonas Pfefferle, Adrian Schuepbach, Bernard Metzler, IBM Research, Zurich 2018 USENIX Annual Technical Conference
More informationBrotli Compression Algorithm outline of a specification
Brotli Compression Algorithm outline of a specification Overview Structure of backward reference commands Encoding of commands Encoding of distances Encoding of Huffman codes Block splitting Context modeling
More informationBlock Device Scheduling. Don Porter CSE 506
Block Device Scheduling Don Porter CSE 506 Quick Recap CPU Scheduling Balance competing concerns with heuristics What were some goals? No perfect solution Today: Block device scheduling How different from
More informationSystems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15
Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2014/15 Lecture II: Indexing Part I of this course Indexing 3 Database File Organization and Indexing Remember: Database tables
More information4.2 Sorting and Searching. Section 4.2
4.2 Sorting and Searching 1 Sequential Search Scan through array, looking for key. Search hit: return array index. Search miss: return -1. public static int search(string key, String[] a) { int N = a.length;
More informationCSE 562 Database Systems
Goal of Indexing CSE 562 Database Systems Indexing Some slides are based or modified from originals by Database Systems: The Complete Book, Pearson Prentice Hall 2 nd Edition 08 Garcia-Molina, Ullman,
More informationOutline. Database Management and Tuning. Index Data Structures. Outline. Index Tuning. Johann Gamper. Unit 5
Outline Database Management and Tuning Johann Gamper Free University of Bozen-Bolzano Faculty of Computer Science IDSE Unit 5 1 2 Conclusion Acknowledgements: The slides are provided by Nikolaus Augsten
More informationarxiv: v4 [cs.db] 2 Mar 2018
Consistently faster and smaller compressed bitmaps with Roaring D. Lemire 1, G. Ssi-Yan-Kai 2, O. Kaser 3 arxiv:1603.06549v4 [cs.db] 2 Mar 2018 1 LICEF Research Center, TELUQ, Montreal, QC, Canada 2 42
More informationA Case for Merge Joins in Mediator Systems
A Case for Merge Joins in Mediator Systems Ramon Lawrence Kirk Hackert IDEA Lab, Department of Computer Science, University of Iowa Iowa City, IA, USA {ramon-lawrence, kirk-hackert}@uiowa.edu Abstract
More informationComputer Organization and Structure. Bing-Yu Chen National Taiwan University
Computer Organization and Structure Bing-Yu Chen National Taiwan University Large and Fast: Exploiting Memory Hierarchy The Basic of Caches Measuring & Improving Cache Performance Virtual Memory A Common
More informationCaches Concepts Review
Caches Concepts Review What is a block address? Why not bring just what is needed by the processor? What is a set associative cache? Write-through? Write-back? Then we ll see: Block allocation policy on
More informationCh. 2: Compression Basics Multimedia Systems
Ch. 2: Compression Basics Multimedia Systems Prof. Ben Lee School of Electrical Engineering and Computer Science Oregon State University Outline Why compression? Classification Entropy and Information
More informationData Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi.
Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi Lecture 18 Tries Today we are going to be talking about another data
More informationPowerVault MD3 SSD Cache Overview
PowerVault MD3 SSD Cache Overview A Dell Technical White Paper Dell Storage Engineering October 2015 A Dell Technical White Paper TECHNICAL INACCURACIES. THE CONTENT IS PROVIDED AS IS, WITHOUT EXPRESS
More informationComparative Analysis of Sparse Matrix Algorithms For Information Retrieval
Comparative Analysis of Sparse Matrix Algorithms For Information Retrieval Nazli Goharian, Ankit Jain, Qian Sun Information Retrieval Laboratory Illinois Institute of Technology Chicago, Illinois {goharian,ajain,qian@ir.iit.edu}
More informationAn Efficient Ranking Algorithm of t-ary Trees in Gray-code Order
The 9th Workshop on Combinatorial Mathematics and Computation Theory An Efficient Ranking Algorithm of t-ary Trees in Gray-code Order Ro Yu Wu Jou Ming Chang, An Hang Chen Chun Liang Liu Department of
More informationOperating Systems. Overview Virtual memory part 2. Page replacement algorithms. Lecture 7 Memory management 3: Virtual memory
Operating Systems Lecture 7 Memory management : Virtual memory Overview Virtual memory part Page replacement algorithms Frame allocation Thrashing Other considerations Memory over-allocation Efficient
More informationIn-memory processing of big data via succinct data structures
In-memory processing of big data via succinct data structures Rajeev Raman University of Leicester SDP Workshop, University of Cambridge Overview Introduction Succinct Data Structuring Succinct Tries Applications
More informationIntroduction to Information Retrieval (Manning, Raghavan, Schutze)
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 3 Dictionaries and Tolerant retrieval Chapter 4 Index construction Chapter 5 Index compression Content Dictionary data structures
More informationIndexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze
Indexing UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides Addison Wesley, 2008 Table of Content Inverted index with positional information
More information