All About Bitmap Indexes... And Sorting Them

Size: px
Start display at page:

Download "All About Bitmap Indexes... And Sorting Them"

Transcription

1 Joint work (presented at BDA 08 and DOLAP 08) with Owen Kaser (UNB) and Kamel Aouiche (post-doc). February 12, 2009

2 Database Indexes Databases use precomputed indexes (auxiliary data structures) to speed processing.

3 Database Indexes Databases use precomputed indexes (auxiliary data structures) to speed processing. An index costs memory, can hurt update speed.

4 Database Indexes Databases use precomputed indexes (auxiliary data structures) to speed processing. An index costs memory, can hurt update speed. Improving indexes is practically important.

5 What make indexes fast? We are going to use these three ideas: Expect specific queries? Avoid a full scan!

6 What make indexes fast? We are going to use these three ideas: Expect specific queries? Avoid a full scan! Data is not random? Compress it!

7 What make indexes fast? We are going to use these three ideas: Expect specific queries? Avoid a full scan! Data is not random? Compress it! A specific computer architecture? taylor your code for it!

8 Bitmap indexes SELECT * FROM T WHERE x=a AND y=b; Above, compute {r r is the row id of a row where x = a} {r r is the row id of a row where y = b}

9 Bitmap indexes SELECT * FROM T WHERE x=a AND y=b; Bitmap indexes have a long history. (1972 at IBM.) Above, compute {r r is the row id of a row where x = a} {r r is the row id of a row where y = b}

10 Bitmap indexes SELECT * FROM T WHERE x=a AND y=b; Bitmap indexes have a long history. (1972 at IBM.) Long history with DW & OLAP. (Sybase IQ since mid 1990s). Above, compute {r r is the row id of a row where x = a} {r r is the row id of a row where y = b}

11 Bitmap indexes SELECT * FROM T WHERE x=a AND y=b; Bitmap indexes have a long history. (1972 at IBM.) Long history with DW & OLAP. (Sybase IQ since mid 1990s). Main competition: B-trees. Above, compute {r r is the row id of a row where x = a} {r r is the row id of a row where y = b}

12 Bitmaps and fast AND/OR operations Computing the union of two sets of integers between 1 and 64 (eg row ids, trivial table)... E.g., {1, 5, 8} {1, 3, 5}?

13 Bitmaps and fast AND/OR operations Computing the union of two sets of integers between 1 and 64 (eg row ids, trivial table)... E.g., {1, 5, 8} {1, 3, 5}? Can be done in one operation by a CPU: BitwiseOR( , )

14 Bitmaps and fast AND/OR operations Computing the union of two sets of integers between 1 and 64 (eg row ids, trivial table)... E.g., {1, 5, 8} {1, 3, 5}? Can be done in one operation by a CPU: BitwiseOR( , ) Extend to sets from 1..N using N/64 operations. To compute [a 0,..., a N 1 ] [b 0, b 1,..., b N 1 ] : a 0,..., a 63 BitwiseOR b 0,..., b 63 ; a 64,..., a 127 BitwiseOR b 64,..., b 127 ; a 128,..., a 192 BitwiseOR b 128,..., b 192 ;...

15 Common applications of the bitmaps The Java language has had a bitmap class since the beginning: java.util.bitset.

16 Common applications of the bitmaps The Java language has had a bitmap class since the beginning: java.util.bitset. (Sun s implementation is based on 8-bit words.)

17 Common applications of the bitmaps The Java language has had a bitmap class since the beginning: java.util.bitset. (Sun s implementation is based on 8-bit words.) Search engines use bitmaps to filter queries, e.g. Apache Lucene

18 Bitmap compression column n x index bitmaps x=1... x=2... x= A column with n rows and L distinct values nl bits L

19 Bitmap compression column n x index bitmaps x=1... x=2... x= A column with n rows and L distinct values nl bits E.g., n = 10 6, L = Gbits L

20 Bitmap compression column n x index bitmaps x=1... x=2... x= A column with n rows and L distinct values nl bits E.g., n = 10 6, L = Gbits Uncompressed bitmaps are often impractical L

21 Bitmap compression column n x index bitmaps x=1... x=2... x= A column with n rows and L distinct values nl bits E.g., n = 10 6, L = Gbits Uncompressed bitmaps are often impractical Moreover, bitmaps often contain long streams of zeroes... L

22 Bitmap compression column n x index bitmaps x=1... x=2... x= L 0... A column with n rows and L distinct values nl bits E.g., n = 10 6, L = Gbits Uncompressed bitmaps are often impractical Moreover, bitmaps often contain long streams of zeroes... Logical operations over these zeroes is a waste of CPU cycles.

23 How to compress bitmaps? Must handle long streams of zeroes efficiently Run-length encoding? (RLE)

24 How to compress bitmaps? Must handle long streams of zeroes efficiently Run-length encoding? (RLE) Bitmap: a run of 0s, a run of 1s, a run of 0s, a run of 1s,...

25 How to compress bitmaps? Must handle long streams of zeroes efficiently Run-length encoding? (RLE) Bitmap: a run of 0s, a run of 1s, a run of 0s, a run of 1s,... So just encode the run lengths, e.g., , 5, 3, 1,1,3

26 Compressing better with delta codes RLE can make things worse.

27 Compressing better with delta codes RLE can make things worse. 11 may become E.g., Use 8-bit counters, then

28 Compressing better with delta codes RLE can make things worse. E.g., Use 8-bit counters, then 11 may become How many bits to use for the counters?

29 Compressing better with delta codes RLE can make things worse. 11 may become How many bits to use for the counters? E.g., Use 8-bit counters, then Universal coding like delta codes use no more than c log x bits to represent value x.

30 Compressing better with delta codes RLE can make things worse. 11 may become How many bits to use for the counters? E.g., Use 8-bit counters, then Universal coding like delta codes use no more than c log x bits to represent value x. Recall Gamma codes: 0 is 0, 1 is 1, 01 is 2, 001 is 3, 0001 is 4, etc.

31 Compressing better with delta codes RLE can make things worse. 11 may become How many bits to use for the counters? E.g., Use 8-bit counters, then Universal coding like delta codes use no more than c log x bits to represent value x. Recall Gamma codes: 0 is 0, 1 is 1, 01 is 2, 001 is 3, 0001 is 4, etc. Delta codes build on Gamma codes.

32 Compressing better with delta codes RLE can make things worse. 11 may become How many bits to use for the counters? E.g., Use 8-bit counters, then Universal coding like delta codes use no more than c log x bits to represent value x. Recall Gamma codes: 0 is 0, 1 is 1, 01 is 2, 001 is 3, 0001 is 4, etc. Delta codes build on Gamma codes. Has two steps: x = 2 N + (x mod 2 N ).

33 Compressing better with delta codes RLE can make things worse. 11 may become How many bits to use for the counters? E.g., Use 8-bit counters, then Universal coding like delta codes use no more than c log x bits to represent value x. Recall Gamma codes: 0 is 0, 1 is 1, 01 is 2, 001 is 3, 0001 is 4, etc. Delta codes build on Gamma codes. Has two steps: x = 2 N + (x mod 2 N ). Write N 1 as gamma code;

34 Compressing better with delta codes RLE can make things worse. 11 may become How many bits to use for the counters? E.g., Use 8-bit counters, then Universal coding like delta codes use no more than c log x bits to represent value x. Recall Gamma codes: 0 is 0, 1 is 1, 01 is 2, 001 is 3, 0001 is 4, etc. Delta codes build on Gamma codes. Has two steps: x = 2 N + (x mod 2 N ). Write N 1 as gamma code; write x mod 2 N as an N 1-bit number.

35 Compressing better with delta codes RLE can make things worse. 11 may become How many bits to use for the counters? E.g., Use 8-bit counters, then Universal coding like delta codes use no more than c log x bits to represent value x. Recall Gamma codes: 0 is 0, 1 is 1, 01 is 2, 001 is 3, 0001 is 4, etc. Delta codes build on Gamma codes. Has two steps: x = 2 N + (x mod 2 N ). Write N 1 as gamma code; write x mod 2 N as an N 1-bit number. E.g. 17 = ,

36 RLE with delta codes is pretty good In some (weak) sense, RLE compression with delta codes is optimal! Theorem A bitmap index over an N-value column of length n, compressed with RLE and delta codes, uses O(n log N) bits.

37 Is the compression rate what matters? There is endless debate about whether more compression is better: Solid-State Drives (SSD) have 10 the bandwidth? All problems are CPU-bound!

38 Is the compression rate what matters? There is endless debate about whether more compression is better: Solid-State Drives (SSD) have 10 the bandwidth? All problems are CPU-bound! Multi-core CPUs? All problems I/O-bound!

39 Is the compression rate what matters? There is endless debate about whether more compression is better: Solid-State Drives (SSD) have 10 the bandwidth? All problems are CPU-bound! Multi-core CPUs? All problems I/O-bound! Store your indexes in RAM? All problems are CPU-bound!

40 Is the compression rate what matters? There is endless debate about whether more compression is better: Solid-State Drives (SSD) have 10 the bandwidth? All problems are CPU-bound! Multi-core CPUs? All problems I/O-bound! Store your indexes in RAM? All problems are CPU-bound!...

41 Is the compression rate what matters? There is endless debate about whether more compression is better: Solid-State Drives (SSD) have 10 the bandwidth? All problems are CPU-bound! Multi-core CPUs? All problems I/O-bound! Store your indexes in RAM? All problems are CPU-bound!... No definitive answer on whether more compression is better. It depends!

42 Byte/Word-aligned RLE RLE variants can focus on runs that align with machine-word boundaries.

43 Byte/Word-aligned RLE RLE variants can focus on runs that align with machine-word boundaries. Trade compression for speed.

44 Byte/Word-aligned RLE RLE variants can focus on runs that align with machine-word boundaries. Trade compression for speed. That is what Oracle is doing.

45 Byte/Word-aligned RLE RLE variants can focus on runs that align with machine-word boundaries. Trade compression for speed. That is what Oracle is doing. Variants: BBC (byte aligned), WAH

46 Byte/Word-aligned RLE RLE variants can focus on runs that align with machine-word boundaries. Trade compression for speed. That is what Oracle is doing. Variants: BBC (byte aligned), WAH Our EWAH extends Wu et al. s (was known to Wu as WBC) word-aligned hybrid dirty word, run of 2 clean 0 words, dirty word...

47 Computational and storage bounds n number of rows, c number of 1s per row;

48 Computational and storage bounds n number of rows, c number of 1s per row; Model storage cost as #(dirty words) + #(clean words, 0x00)

49 Computational and storage bounds n number of rows, c number of 1s per row; Model storage cost as #(dirty words) + #(clean words, 0x00) Storage is in O(nc);

50 Computational and storage bounds n number of rows, c number of 1s per row; Model storage cost as #(dirty words) + #(clean words, 0x00) Storage is in O(nc); Bounds do not depend on the number of bitmaps. (Assuming O(n) bitmaps).

51 Computational and storage bounds n number of rows, c number of 1s per row; Model storage cost as #(dirty words) + #(clean words, 0x00) Storage is in O(nc); Bounds do not depend on the number of bitmaps. (Assuming O(n) bitmaps). Construction time is proportional to index size. (Data is written sequentially on disk.)

52 Computational and storage bounds n number of rows, c number of 1s per row; Model storage cost as #(dirty words) + #(clean words, 0x00) Storage is in O(nc); Bounds do not depend on the number of bitmaps. (Assuming O(n) bitmaps). Construction time is proportional to index size. (Data is written sequentially on disk.) Implementation scales to millions of bitmaps.

53 What about other compression types? Why not compress using other techniques (Huffman, LZ77, Arithmetic Coding,... )?

54 What about other compression types? Why not compress using other techniques (Huffman, LZ77, Arithmetic Coding,... )? With RLE-like compression we have B 1 B 2 or B 1 B 2 in time O( B 1 + B 2 ).

55 What about other compression types? Why not compress using other techniques (Huffman, LZ77, Arithmetic Coding,... )? With RLE-like compression we have B 1 B 2 or B 1 B 2 in time O( B 1 + B 2 ). We don t know how to do this using the other compression techniques!

56 What about other compression types? Why not compress using other techniques (Huffman, LZ77, Arithmetic Coding,... )? With RLE-like compression we have B 1 B 2 or B 1 B 2 in time O( B 1 + B 2 ). We don t know how to do this using the other compression techniques! Hence, with RLE, compress saves both storage and CPU cycles!!!!

57 What happens when you have many bitmaps? Consider B 1 B 2... B N.

58 What happens when you have many bitmaps? Consider B 1 B 2... B N. First compute the first two : B 1 B 2 in time O( B 1 + B 2 ).

59 What happens when you have many bitmaps? Consider B 1 B 2... B N. First compute the first two : B 1 B 2 in time O( B 1 + B 2 ). B 3 B 4 is in O( B 3 + B 4 ).

60 What happens when you have many bitmaps? Consider B 1 B 2... B N. First compute the first two : B 1 B 2 in time O( B 1 + B 2 ). B 3 B 4 is in O( B 3 + B 4 ). Thus (B 1 B 2 ) (B 3 B 4 ) takes O(2 i B i )...

61 What happens when you have many bitmaps? Consider B 1 B 2... B N. First compute the first two : B 1 B 2 in time O( B 1 + B 2 ). B 3 B 4 is in O( B 3 + B 4 ). Thus (B 1 B 2 ) (B 3 B 4 ) takes O(2 i B i )... Total is in O( N i=1 B i log N) [Lemire et al., 2009].

62 Improving compression by sorting the table RLE, BBC, WAH, EWAH are order-sensitive: they compress sorted tables better;

63 Improving compression by sorting the table RLE, BBC, WAH, EWAH are order-sensitive: they compress sorted tables better; But finding the best row ordering is NP-hard [Lemire et al., 2009].

64 Improving compression by sorting the table RLE, BBC, WAH, EWAH are order-sensitive: they compress sorted tables better; But finding the best row ordering is NP-hard [Lemire et al., 2009]. Lexicographic row sorting is fast, even for very large tables.

65 Improving compression by sorting the table RLE, BBC, WAH, EWAH are order-sensitive: they compress sorted tables better; But finding the best row ordering is NP-hard [Lemire et al., 2009]. Lexicographic row sorting is fast, even for very large tables. easy: sort is a Unix staple.

66 Improving compression by sorting the table RLE, BBC, WAH, EWAH are order-sensitive: they compress sorted tables better; But finding the best row ordering is NP-hard [Lemire et al., 2009]. Lexicographic row sorting is fast, even for very large tables. easy: sort is a Unix staple. Substantial index-size reductions (often 2.5 times)

67 Improving compression via k-of-n encoding With L bitmaps, you can represent L values by mapping each value to one bitmap; 1-of-N value cat dog dish fish cow cat pony

68 Improving compression via k-of-n encoding 1-of-N of-N With L bitmaps, you can represent L values by mapping each value to one bitmap; Alternatively, ( you can represent L ) 2 = L(L 1)/2 values by mapping each value to a pair of bitmaps;

69 Improving compression via k-of-n encoding 1-of-N of-N With L bitmaps, you can represent L values by mapping each value to one bitmap; Alternatively, ( you can represent L ) 2 = L(L 1)/2 values by mapping each value to a pair of bitmaps; More generally, you can represent ( L k) values by mapping each value to a k-tuple of bitmaps;

70 Improving compression via k-of-n encoding 1-of-N of-N With L bitmaps, you can represent L values by mapping each value to one bitmap; Alternatively, ( you can represent L ) 2 = L(L 1)/2 values by mapping each value to a pair of bitmaps; More generally, you can represent ( L k) values by mapping each value to a k-tuple of bitmaps; At query time, you need to load k bitmaps in a look-up for one value;

71 Improving compression via k-of-n encoding 1-of-N of-N With L bitmaps, you can represent L values by mapping each value to one bitmap; Alternatively, ( you can represent L ) 2 = L(L 1)/2 values by mapping each value to a pair of bitmaps; More generally, you can represent ( L k) values by mapping each value to a k-tuple of bitmaps; At query time, you need to load k bitmaps in a look-up for one value; You trade query-time performance for fewer bitmaps;

72 Improving compression via k-of-n encoding 1-of-N of-N With L bitmaps, you can represent L values by mapping each value to one bitmap; Alternatively, ( you can represent L ) 2 = L(L 1)/2 values by mapping each value to a pair of bitmaps; More generally, you can represent ( L k) values by mapping each value to a k-tuple of bitmaps; At query time, you need to load k bitmaps in a look-up for one value; You trade query-time performance for fewer bitmaps; Often, fewer bitmaps translates into a smaller index, created faster.

73 Encode then sort? Or vice versa? Two different conceptual approaches: 1 Encode attributes in table, obtaining an uncompressed index paint maker red ford blue honda green ford paint maker red ford blue honda green ford......

74 Encode then sort? Or vice versa? Two different conceptual approaches: 1 Encode attributes in table, obtaining an uncompressed index Sort the index rows paint maker red ford blue honda green ford paint maker red ford blue honda green ford

75 Encode then sort? Or vice versa? Two different conceptual approaches: 1 Encode attributes in table, obtaining an uncompressed index Sort the index rows Compress each column paint maker red ford blue honda green ford paint maker red ford blue honda green ford

76 Encode then sort? Or vice versa? Two different conceptual approaches: 1 Encode attributes in table, obtaining an uncompressed index Sort the index rows Compress each column 2 Sort the table rows paint maker red ford blue honda green ford paint maker red ford blue honda green ford paint maker blue honda green ford red ford

77 Encode then sort? Or vice versa? Two different conceptual approaches: 1 Encode attributes in table, obtaining an uncompressed index Sort the index rows Compress each column 2 Sort the table rows Encode attributes in table, build compressed index on-the-fly. paint maker red ford blue honda green ford paint maker red ford blue honda green ford paint maker blue honda green ford red ford

78 Gray-code order Lex. order Gray-code Gray-code (GC) order is an alternative to lexicographical order (defined only for bit arrays);

79 Gray-code order Lex. order Gray-code Gray-code (GC) order is an alternative to lexicographical order (defined only for bit arrays); May improve compression more than lex. sort (k > 1);

80 Gray-code order Lex. order Gray-code Gray-code (GC) order is an alternative to lexicographical order (defined only for bit arrays); May improve compression more than lex. sort (k > 1); [Pinar et al., 2005] process an uncompressed bitmap index.

81 Gray-code order Lex. order Gray-code Gray-code (GC) order is an alternative to lexicographical order (defined only for bit arrays); May improve compression more than lex. sort (k > 1); [Pinar et al., 2005] process an uncompressed bitmap index. Slow, if uncompressed index does not fit in RAM.

82 Gray-code order Lex. order Gray-code Gray-code (GC) order is an alternative to lexicographical order (defined only for bit arrays); May improve compression more than lex. sort (k > 1); [Pinar et al., 2005] process an uncompressed bitmap index. Slow, if uncompressed index does not fit in RAM. GC order is not supported by DBMSes or Unix utilities.

83 Gray-code sorting, cheaply Size improvement is small (usually < 4%), but it s essentially free: 1 What Pinar et al. do: expensive GC sort after encoding eg: [Tax, Cat, Girl, Cat] sort([1100, 0110, 1001, 0110]);

84 Gray-code sorting, cheaply Size improvement is small (usually < 4%), but it s essentially free: 1 What Pinar et al. do: expensive GC sort after encoding eg: [Tax, Cat, Girl, Cat] sort([1100, 0110, 1001, 0110]); 2 Instead, sort the table lexicographically comparing values alphabetically or by frequency (easy); eg: [Tax, Cat, Girl, Cat] [Cat, Cat, Girl, Tax]

85 Gray-code sorting, cheaply Size improvement is small (usually < 4%), but it s essentially free: 1 What Pinar et al. do: expensive GC sort after encoding eg: [Tax, Cat, Girl, Cat] sort([1100, 0110, 1001, 0110]); 2 Instead, sort the table lexicographically comparing values alphabetically or by frequency (easy); eg: [Tax, Cat, Girl, Cat] [Cat, Cat, Girl, Tax] 3 Map ordered values to k-tuples of bitmaps ordered as Gray codes: Cat: 0011, Dog: 0110, Girl: 0101, Tax: 1100; Lex ascending sequence: Cat, Dog, Girl, Tax. GC ascending sequence: 0011, 0110, 0101, 1100 for codes

86 Gray-code sorting, cheaply Size improvement is small (usually < 4%), but it s essentially free: 1 What Pinar et al. do: expensive GC sort after encoding eg: [Tax, Cat, Girl, Cat] sort([1100, 0110, 1001, 0110]); 2 Instead, sort the table lexicographically comparing values alphabetically or by frequency (easy); eg: [Tax, Cat, Girl, Cat] [Cat, Cat, Girl, Tax] 3 Map ordered values to k-tuples of bitmaps ordered as Gray codes: Cat: 0011, Dog: 0110, Girl: 0101, Tax: 1100; Lex ascending sequence: Cat, Dog, Girl, Tax. GC ascending sequence: 0011, 0110, 0101, 1100 for codes eg: [Cat, Cat, Girl, Tax] [0011, 0011, 0101, 1100] (generates a GC-sorted result without expensive GC sorting).

87 Gray-code sorting, cheaply Size improvement is small (usually < 4%), but it s essentially free: 1 What Pinar et al. do: expensive GC sort after encoding eg: [Tax, Cat, Girl, Cat] sort([1100, 0110, 1001, 0110]); 2 Instead, sort the table lexicographically comparing values alphabetically or by frequency (easy); eg: [Tax, Cat, Girl, Cat] [Cat, Cat, Girl, Tax] 3 Map ordered values to k-tuples of bitmaps ordered as Gray codes: Cat: 0011, Dog: 0110, Girl: 0101, Tax: 1100; Lex ascending sequence: Cat, Dog, Girl, Tax. GC ascending sequence: 0011, 0110, 0101, 1100 for codes eg: [Cat, Cat, Girl, Tax] [0011, 0011, 0101, 1100] (generates a GC-sorted result without expensive GC sorting). 4 Easily extended for > 1 columns.

88 Gray-code sorting, cheaply Size improvement is small (usually < 4%), but it s essentially free: 1 What Pinar et al. do: expensive GC sort after encoding eg: [Tax, Cat, Girl, Cat] sort([1100, 0110, 1001, 0110]); 2 Instead, sort the table lexicographically comparing values alphabetically or by frequency (easy); eg: [Tax, Cat, Girl, Cat] [Cat, Cat, Girl, Tax] 3 Map ordered values to k-tuples of bitmaps ordered as Gray codes: Cat: 0011, Dog: 0110, Girl: 0101, Tax: 1100; Lex ascending sequence: Cat, Dog, Girl, Tax. GC ascending sequence: 0011, 0110, 0101, 1100 for codes eg: [Cat, Cat, Girl, Tax] [0011, 0011, 0101, 1100] (generates a GC-sorted result without expensive GC sorting). 4 Easily extended for > 1 columns. In our tests, this is as good as a Gray-code bitmap index sort [Pinar et al., 2005], but technically much easier.

89 What about other Gray-codes? Define Gray-code to be a way to list all bitvectors while minimizing Hamming distances [Knuth, 2005, ]

90 What about other Gray-codes? Define Gray-code to be a way to list all bitvectors while minimizing Hamming distances [Knuth, 2005, ] There are other alternatives [Goddyn and Gvozdjak, 2003, Savage and Winkler, 1995].

91 What about other Gray-codes? Define Gray-code to be a way to list all bitvectors while minimizing Hamming distances [Knuth, 2005, ] There are other alternatives [Goddyn and Gvozdjak, 2003, Savage and Winkler, 1995]. Our tests suggest traditional Gray codes are best.

92 Test data sets Previous studies used data sets where the uncompressed index would fit in RAM.

93 Test data sets Previous studies used data sets where the uncompressed index would fit in RAM. Do their results apply to more realistic data sets?

94 Test data sets Previous studies used data sets where the uncompressed index would fit in RAM. Do their results apply to more realistic data sets? Our tests: Mix of real and synthetic data, up to 877 M rows, 22 GB, 4 M attribute values. using 4 10 columns/dimensions

95 When sorting, column order matters The first column(s) gain more from the sort (column 1 is primary sort key);

96 When sorting, column order matters The first column(s) gain more from the sort (column 1 is primary sort key); Its bitmaps (first 11 in example) are compressed well, compared to a randomsort. (Red above green) Compression on TWEED-4d 1-C/N Gray Random-sort rang des bitmaps

97 When sorting, column order matters The first column(s) gain more from the sort (column 1 is primary sort key); Its bitmaps (first 11 in example) are compressed well, compared to a randomsort. (Red above green) Least important column s bitmaps (43 49) don t gain much (red vs green) Compression on TWEED-4d 1-C/N rang des bitmaps Gray Random-sort

98 When sorting, column order matters Conceptually, we may wish to reorder columns, eg swap columns 1 & 3.

99 When sorting, column order matters Conceptually, we may wish to reorder columns, eg swap columns 1 & 3. Column order is crucial (to successful sorting).

100 When sorting, column order matters Conceptually, we may wish to reorder columns, eg swap columns 1 & 3. Column order is crucial (to successful sorting). Finding the best ordering quickly remains open. Netflix: 24 column orderings index size 5.5e+08 5e e+08 4e e+08 3e e+08 2e e+08 1e of-N encoding 4-of-N encoding column permutation

101 Progress toward choosing column order Paper models gain of putting a given column first. Idea: order columns greedily (by max gain).

102 Progress toward choosing column order Paper models gain of putting a given column first. Idea: order columns greedily (by max gain). Experimentally, this approach is not promising: the best orderings don t seem to depend on gain.

103 Progress toward choosing column order Paper models gain of putting a given column first. Idea: order columns greedily (by max gain). Experimentally, this approach is not promising: the best orderings don t seem to depend on gain. Factors: skews of columns number of distinct values k density of column s bitmaps

104 What usually works for dimension ordering?: k=1 For 1-of-N bitmaps, a density-based approach was okay:

105 What usually works for dimension ordering?: k=1 For 1-of-N bitmaps, a density-based approach was okay: Ordering rule, k = 1 : sparse but not too sparse Order columns by decreasing ( 1 min, 1 1/n ) i, where n i 4w 1 n i the number of distinct values in column i, w the word size distinct values in column

106 What usually works for dimension ordering?: k=1 For 1-of-N bitmaps, a density-based approach was okay: Ordering rule, k = 1 : sparse but not too sparse Order columns by decreasing ( 1 min, 1 1/n ) i, where n i 4w 1 n i the number of distinct values in column i, w the word size distinct values in column See 30 40% size reduction, merely knowing dimension sizes (n i ).

107 What usually works for dimension ordering?: k > 1 Density formula (n i k n i ) recommends poorly when k > 1. Our experiments on synthetic data give some guidance:

108 What usually works for dimension ordering?: k > 1 Density formula (n i k n i ) recommends poorly when k > 1. Our experiments on synthetic data give some guidance: When k > 1, order columns by 1 descending skew 2 descending size (And do the reverse when k = 1.)

109 What usually works for dimension ordering?: k > 1 Density formula (n i k n i ) recommends poorly when k > 1. Our experiments on synthetic data give some guidance: When k > 1, order columns by 1 descending skew 2 descending size (And do the reverse when k = 1.) Open issues, k > 1 1 How do we balance skew & size factors?

110 What usually works for dimension ordering?: k > 1 Density formula (n i k n i ) recommends poorly when k > 1. Our experiments on synthetic data give some guidance: When k > 1, order columns by 1 descending skew 2 descending size (And do the reverse when k = 1.) Open issues, k > 1 1 How do we balance skew & size factors? 2 What other properties of the histograms are needed?

111 Bitmap-by-bitmap reordering One might instead make the index, reorder its columns, then apply GC sort [Canahuate et al., 2006].

112 Bitmap-by-bitmap reordering One might instead make the index, reorder its columns, then apply GC sort [Canahuate et al., 2006]. Our best implementation of this is 100 times slower, cannot handle larger data sets.

113 Bitmap-by-bitmap reordering One might instead make the index, reorder its columns, then apply GC sort [Canahuate et al., 2006]. Our best implementation of this is 100 times slower, cannot handle larger data sets. We tried several bitmap orders on DBGEN and Census. Out of 8 cases, only one gained, and only by 3%.

114 Bitmap-by-bitmap reordering One might instead make the index, reorder its columns, then apply GC sort [Canahuate et al., 2006]. Our best implementation of this is 100 times slower, cannot handle larger data sets. We tried several bitmap orders on DBGEN and Census. Out of 8 cases, only one gained, and only by 3%. Canahaute suggests ordering does not matter much, but we see factor-of-2 differences (??)

115 Bitmap-by-bitmap reordering One might instead make the index, reorder its columns, then apply GC sort [Canahuate et al., 2006]. Our best implementation of this is 100 times slower, cannot handle larger data sets. We tried several bitmap orders on DBGEN and Census. Out of 8 cases, only one gained, and only by 3%. Canahaute suggests ordering does not matter much, but we see factor-of-2 differences (??) Seems sufficient (and much faster) to work with groups of bitmaps (reorder attributes, not bitmaps)

116 Index size versus block-wise sorting Netflix taille de l index (Mo) k=1 k=2 k=3 k= # de blocs Instead of fully sorting the table, we sorted it block-wise;

117 Index size versus block-wise sorting Netflix taille de l index (Mo) k=1 k=2 k=3 k= # de blocs Instead of fully sorting the table, we sorted it block-wise; Fewer blocks means a more complete sort;

118 Index size versus block-wise sorting Netflix taille de l index (Mo) k=1 k=2 k=3 k= # de blocs Instead of fully sorting the table, we sorted it block-wise; Fewer blocks means a more complete sort; Larger k means smaller index (in this case);

119 Index size versus block-wise sorting Netflix taille de l index (Mo) k=1 k=2 k=3 k= # de blocs Instead of fully sorting the table, we sorted it block-wise; Fewer blocks means a more complete sort; Larger k means smaller index (in this case); Index size diminishes drastically with sorting.

120 How do 64-bit words compare to 32-bit words? We implemented EWAH using 16-bit, 32-bit and 64-bit words;

121 How do 64-bit words compare to 32-bit words? We implemented EWAH using 16-bit, 32-bit and 64-bit words; Only 32-bit and 64-bit are efficient;

122 How do 64-bit words compare to 32-bit words? We implemented EWAH using 16-bit, 32-bit and 64-bit words; Only 32-bit and 64-bit are efficient; 64-bit indexes are nearly twice as large;

123 How do 64-bit words compare to 32-bit words? We implemented EWAH using 16-bit, 32-bit and 64-bit words; Only 32-bit and 64-bit are efficient; 64-bit indexes are nearly twice as large; 64-bit indexes are between 5%-40% faster (despite higher I/O costs).

124 Open Source Software? Lemur Bitmap Index C++ Library:

125 Open Source Software? Lemur Bitmap Index C++ Library: JavaEWAH: A compressed alternative to the Java BitSet class

126 Future direction? Need better mathematical modelling of bitmap compressed size in sorted tables;

127 Questions??

128 Canahuate, G., Ferhatosmanoglu, H., and Pinar, A. (2006). Improving bitmap index compression by data reorganization. http: //hpcrd.lbl.gov/~apinar/papers/tkde06.pdf (checked ). Goddyn, L. and Gvozdjak, P. (2003). Binary gray codes with long bit runs. Electronic Journal of Combinatorics, 10(R27):1 10. Knuth, D. E. (2005). The Art of Computer Programming, volume 4, chapter fascicle 2. Addison Wesley. Lemire, D., Kaser, O., and Aouiche, K. (2009). Sorting improves word-aligned bitmap indexes. available from Pinar, A., Tao, T., and Ferhatosmanoglu, H. (2005).

129 Compressing bitmap indices by data reorganization. In ICDE 05, pages Savage, C. and Winkler, P. (1995). Monotone gray codes and the middle levels problem. Journal of Combinatorial Theory, A, 70(2): Wu, K., Otoo, E. J., and Shoshani, A. (2006). Optimizing bitmap indices with efficient compression. ACM Transactions on Database Systems, 31(1):1 38.

Sorting Improves Bitmap Indexes

Sorting Improves Bitmap Indexes Joint work (presented at BDA 08 and DOLAP 08) with Daniel Lemire and Kamel Aouiche, UQAM. December 4, 2008 Database Indexes Databases use precomputed indexes (auxiliary data structures) to speed processing.

More information

Histogram-Aware Sorting for Enhanced Word-Aligned Compress

Histogram-Aware Sorting for Enhanced Word-Aligned Compress Histogram-Aware Sorting for Enhanced Word-Aligned Compression in Bitmap Indexes 1- University of New Brunswick, Saint John 2- Université du Québec at Montréal (UQAM) October 23, 2008 Bitmap indexes SELECT

More information

Analysis of Basic Data Reordering Techniques

Analysis of Basic Data Reordering Techniques Analysis of Basic Data Reordering Techniques Tan Apaydin 1, Ali Şaman Tosun 2, and Hakan Ferhatosmanoglu 1 1 The Ohio State University, Computer Science and Engineering apaydin,hakan@cse.ohio-state.edu

More information

arxiv: v4 [cs.db] 6 Jun 2014 Abstract

arxiv: v4 [cs.db] 6 Jun 2014 Abstract Better bitmap performance with Roaring bitmaps Samy Chambi a, Daniel Lemire b,, Owen Kaser c, Robert Godin a a Département d informatique, UQAM, 201, av. Président-Kennedy, Montreal, QC, H2X 3Y7 Canada

More information

Enhancing Bitmap Indices

Enhancing Bitmap Indices Enhancing Bitmap Indices Guadalupe Canahuate The Ohio State University Advisor: Hakan Ferhatosmanoglu Introduction n Bitmap indices: Widely used in data warehouses and read-only domains Implemented in

More information

Data Compression for Bitmap Indexes. Y. Chen

Data Compression for Bitmap Indexes. Y. Chen Data Compression for Bitmap Indexes Y. Chen Abstract Compression Ratio (CR) and Logical Operation Time (LOT) are two major measures of the efficiency of bitmap indexing. Previous works by [5, 9, 10, 11]

More information

Information Retrieval

Information Retrieval Information Retrieval Suan Lee - Information Retrieval - 05 Index Compression 1 05 Index Compression - Information Retrieval - 05 Index Compression 2 Last lecture index construction Sort-based indexing

More information

Daniel Lemire, Waterloo University, May 10th 2018.

Daniel Lemire, Waterloo University, May 10th 2018. Daniel Lemire, Waterloo University, May 10th 2018. Next Generation Indexes For Big Data Engineering Daniel Lemire and collaborators blog: https://lemire.me twitter: @lemire Université du Québec (TÉLUQ)

More information

Better bitmap performance with Roaring bitmaps

Better bitmap performance with Roaring bitmaps Better bitmap performance with bitmaps S. Chambi 1, D. Lemire 2, O. Kaser 3, R. Godin 1 1 Département d informatique, UQAM, Montreal, Qc, Canada 2 LICEF Research Center, TELUQ, Montreal, QC, Canada 3 Computer

More information

Indexing and Searching

Indexing and Searching Indexing and Searching Introduction How to retrieval information? A simple alternative is to search the whole text sequentially Another option is to build data structures over the text (called indices)

More information

Processing of Very Large Data

Processing of Very Large Data Processing of Very Large Data Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Software Development Technologies Master studies, first

More information

In-Memory Data Management Jens Krueger

In-Memory Data Management Jens Krueger In-Memory Data Management Jens Krueger Enterprise Platform and Integration Concepts Hasso Plattner Intitute OLTP vs. OLAP 2 Online Transaction Processing (OLTP) Organized in rows Online Analytical Processing

More information

Storage hierarchy. Textbook: chapters 11, 12, and 13

Storage hierarchy. Textbook: chapters 11, 12, and 13 Storage hierarchy Cache Main memory Disk Tape Very fast Fast Slower Slow Very small Small Bigger Very big (KB) (MB) (GB) (TB) Built-in Expensive Cheap Dirt cheap Disks: data is stored on concentric circular

More information

Web Information Retrieval. Lecture 4 Dictionaries, Index Compression

Web Information Retrieval. Lecture 4 Dictionaries, Index Compression Web Information Retrieval Lecture 4 Dictionaries, Index Compression Recap: lecture 2,3 Stemming, tokenization etc. Faster postings merges Phrase queries Index construction This lecture Dictionary data

More information

Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory

Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory Title Breaking the Curse of Cardinality on Bitmap Indexes Permalink https://escholarship.org/uc/item/5v921692 Author Wu, Kesheng

More information

Hybrid Query Optimization for Hard-to-Compress Bit-vectors

Hybrid Query Optimization for Hard-to-Compress Bit-vectors Hybrid Query Optimization for Hard-to-Compress Bit-vectors Gheorghi Guzun Guadalupe Canahuate Abstract Bit-vectors are widely used for indexing and summarizing data due to their efficient processing in

More information

Concise: Compressed n Composable Integer Set

Concise: Compressed n Composable Integer Set : Compressed n Composable Integer Set Alessandro Colantonio,a, Roberto Di Pietro a a Università di Roma Tre, Dipartimento di Matematica, Roma, Italy Abstract Bit arrays, or bitmaps, are used to significantly

More information

CSE 530A. B+ Trees. Washington University Fall 2013

CSE 530A. B+ Trees. Washington University Fall 2013 CSE 530A B+ Trees Washington University Fall 2013 B Trees A B tree is an ordered (non-binary) tree where the internal nodes can have a varying number of child nodes (within some range) B Trees When a key

More information

CS 525: Advanced Database Organization

CS 525: Advanced Database Organization CS 525: Advanced Database Organization 06: Even more index structures Boris Glavic Slides: adapted from a course taught by Hector Garcia- Molina, Stanford InfoLab CS 525 Notes 6 - More Indices 1 Recap

More information

Heap-Filter Merge Join: A new algorithm for joining medium-size relations

Heap-Filter Merge Join: A new algorithm for joining medium-size relations Oregon Health & Science University OHSU Digital Commons CSETech January 1989 Heap-Filter Merge Join: A new algorithm for joining medium-size relations Goetz Graefe Follow this and additional works at:

More information

Index Compression. David Kauchak cs160 Fall 2009 adapted from:

Index Compression. David Kauchak cs160 Fall 2009 adapted from: Index Compression David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture5-indexcompression.ppt Administrative Homework 2 Assignment 1 Assignment 2 Pair programming?

More information

An Experimental Study of Bitmap Compression vs. Inverted List Compression

An Experimental Study of Bitmap Compression vs. Inverted List Compression An Experimental Study of Bitmap Compression vs. Inverted Compression Paper ID: 338 ABSTRACT Bitmap compression has been studied extensively in the database area and many efficient compression schemes were

More information

Exadata X3 in action: Measuring Smart Scan efficiency with AWR. Franck Pachot Senior Consultant

Exadata X3 in action: Measuring Smart Scan efficiency with AWR. Franck Pachot Senior Consultant Exadata X3 in action: Measuring Smart Scan efficiency with AWR Franck Pachot Senior Consultant 16 March 2013 1 Exadata X3 in action: Measuring Smart Scan efficiency with AWR Exadata comes with new statistics

More information

COMP 430 Intro. to Database Systems. Indexing

COMP 430 Intro. to Database Systems. Indexing COMP 430 Intro. to Database Systems Indexing How does DB find records quickly? Various forms of indexing An index is automatically created for primary key. SQL gives us some control, so we should understand

More information

Indexing and Hashing

Indexing and Hashing C H A P T E R 1 Indexing and Hashing This chapter covers indexing techniques ranging from the most basic one to highly specialized ones. Due to the extensive use of indices in database systems, this chapter

More information

Bitmap Index Partition Techniques for Continuous and High Cardinality Discrete Attributes

Bitmap Index Partition Techniques for Continuous and High Cardinality Discrete Attributes Bitmap Index Partition Techniques for Continuous and High Cardinality Discrete Attributes Songrit Maneewongvatana Department of Computer Engineering King s Mongkut s University of Technology, Thonburi,

More information

C has been and will always remain on top for performancecritical

C has been and will always remain on top for performancecritical Check out this link: http://spectrum.ieee.org/static/interactive-the-top-programminglanguages-2016 C has been and will always remain on top for performancecritical applications: Implementing: Databases

More information

Database Technology. Topic 7: Data Structures for Databases. Olaf Hartig.

Database Technology. Topic 7: Data Structures for Databases. Olaf Hartig. Topic 7: Data Structures for Databases Olaf Hartig olaf.hartig@liu.se Database System 2 Storage Hierarchy Traditional Storage Hierarchy CPU Cache memory Main memory Primary storage Disk Tape Secondary

More information

HICAMP Bitmap. A Space-Efficient Updatable Bitmap Index for In-Memory Databases! Bo Wang, Heiner Litz, David R. Cheriton Stanford University DAMON 14

HICAMP Bitmap. A Space-Efficient Updatable Bitmap Index for In-Memory Databases! Bo Wang, Heiner Litz, David R. Cheriton Stanford University DAMON 14 HICAMP Bitmap A Space-Efficient Updatable Bitmap Index for In-Memory Databases! Bo Wang, Heiner Litz, David R. Cheriton Stanford University DAMON 14 Database Indexing Databases use precomputed indexes

More information

Indexing. Jan Chomicki University at Buffalo. Jan Chomicki () Indexing 1 / 25

Indexing. Jan Chomicki University at Buffalo. Jan Chomicki () Indexing 1 / 25 Indexing Jan Chomicki University at Buffalo Jan Chomicki () Indexing 1 / 25 Storage hierarchy Cache Main memory Disk Tape Very fast Fast Slower Slow (nanosec) (10 nanosec) (millisec) (sec) Very small Small

More information

Beyond Counting. Owen Kaser. September 17, 2014

Beyond Counting. Owen Kaser. September 17, 2014 Beyond Counting Owen Kaser September 17, 2014 1 Introduction Combinatorial objects such as permutations and combinations are frequently studied from a counting perspective. For instance, How many distinct

More information

Indexing Web pages. Web Search: Indexing Web Pages. Indexing the link structure. checkpoint URL s. Connectivity Server: Node table

Indexing Web pages. Web Search: Indexing Web Pages. Indexing the link structure. checkpoint URL s. Connectivity Server: Node table Indexing Web pages Web Search: Indexing Web Pages CPS 296.1 Topics in Database Systems Indexing the link structure AltaVista Connectivity Server case study Bharat et al., The Fast Access to Linkage Information

More information

Variable Length Integers for Search

Variable Length Integers for Search 7:57:57 AM Variable Length Integers for Search Past, Present and Future Ryan Ernst A9.com 7:57:59 AM Overview About me Search and inverted indices Traditional encoding (Vbyte) Modern encodings Future work

More information

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University CS6200 Information Retrieval David Smith College of Computer and Information Science Northeastern University Indexing Process!2 Indexes Storing document information for faster queries Indexes Index Compression

More information

CS1800 Discrete Structures Fall 2016 Profs. Aslam, Gold, Ossowski, Pavlu, & Sprague December 16, CS1800 Discrete Structures Final

CS1800 Discrete Structures Fall 2016 Profs. Aslam, Gold, Ossowski, Pavlu, & Sprague December 16, CS1800 Discrete Structures Final CS1800 Discrete Structures Fall 2016 Profs. Aslam, Gold, Ossowski, Pavlu, & Sprague December 16, 2016 Instructions: CS1800 Discrete Structures Final 1. The exam is closed book and closed notes. You may

More information

Efficient and Effective Practical Algorithms for the Set-Covering Problem

Efficient and Effective Practical Algorithms for the Set-Covering Problem Efficient and Effective Practical Algorithms for the Set-Covering Problem Qi Yang, Jamie McPeek, Adam Nofsinger Department of Computer Science and Software Engineering University of Wisconsin at Platteville

More information

Track Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross

Track Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross Track Join Distributed Joins with Minimal Network Traffic Orestis Polychroniou Rajkumar Sen Kenneth A. Ross Local Joins Algorithms Hash Join Sort Merge Join Index Join Nested Loop Join Spilling to disk

More information

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off

More information

Multimedia Networking ECE 599

Multimedia Networking ECE 599 Multimedia Networking ECE 599 Prof. Thinh Nguyen School of Electrical Engineering and Computer Science Based on B. Lee s lecture notes. 1 Outline Compression basics Entropy and information theory basics

More information

Column Stores vs. Row Stores How Different Are They Really?

Column Stores vs. Row Stores How Different Are They Really? Column Stores vs. Row Stores How Different Are They Really? Daniel J. Abadi (Yale) Samuel R. Madden (MIT) Nabil Hachem (AvantGarde) Presented By : Kanika Nagpal OUTLINE Introduction Motivation Background

More information

Module 9: Selectivity Estimation

Module 9: Selectivity Estimation Module 9: Selectivity Estimation Module Outline 9.1 Query Cost and Selectivity Estimation 9.2 Database profiles 9.3 Sampling 9.4 Statistics maintained by commercial DBMS Web Forms Transaction Manager Lock

More information

Block Device Scheduling. Don Porter CSE 506

Block Device Scheduling. Don Porter CSE 506 Block Device Scheduling Don Porter CSE 506 Logical Diagram Binary Formats Memory Allocators System Calls Threads User Kernel RCU File System Networking Sync Memory Management Device Drivers CPU Scheduler

More information

Block Device Scheduling

Block Device Scheduling Logical Diagram Block Device Scheduling Don Porter CSE 506 Binary Formats RCU Memory Management File System Memory Allocators System Calls Device Drivers Interrupts Net Networking Threads Sync User Kernel

More information

Design Issues 1 / 36. Local versus Global Allocation. Choosing

Design Issues 1 / 36. Local versus Global Allocation. Choosing Design Issues 1 / 36 Local versus Global Allocation When process A has a page fault, where does the new page frame come from? More precisely, is one of A s pages reclaimed, or can a page frame be taken

More information

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488) Efficiency Efficiency: Indexing (COSC 488) Nazli Goharian nazli@cs.georgetown.edu Difficult to analyze sequential IR algorithms: data and query dependency (query selectivity). O(q(cf max )) -- high estimate-

More information

Data Warehouse Physical Design: Part I

Data Warehouse Physical Design: Part I Data Warehouse Physical Design: Part I Robert Wrembel Poznan University of Technology Institute of Computing Science Robert.Wrembel@cs.put.poznan.pl www.cs.put.poznan.pl/rwrembel Lecture outline Basic

More information

Database System Concepts, 6 th Ed. Silberschatz, Korth and Sudarshan See for conditions on re-use

Database System Concepts, 6 th Ed. Silberschatz, Korth and Sudarshan See  for conditions on re-use Chapter 11: Indexing and Hashing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files Static

More information

Algorithms. Lecture Notes 5

Algorithms. Lecture Notes 5 Algorithms. Lecture Notes 5 Dynamic Programming for Sequence Comparison The linear structure of the Sequence Comparison problem immediately suggests a dynamic programming approach. Naturally, our sub-instances

More information

Coding for Random Projects

Coding for Random Projects Coding for Random Projects CS 584: Big Data Analytics Material adapted from Li s talk at ICML 2014 (http://techtalks.tv/talks/coding-for-random-projections/61085/) Random Projections for High-Dimensional

More information

Computer Science 210 Data Structures Siena College Fall Topic Notes: Complexity and Asymptotic Analysis

Computer Science 210 Data Structures Siena College Fall Topic Notes: Complexity and Asymptotic Analysis Computer Science 210 Data Structures Siena College Fall 2017 Topic Notes: Complexity and Asymptotic Analysis Consider the abstract data type, the Vector or ArrayList. This structure affords us the opportunity

More information

Advanced Data Management Technologies

Advanced Data Management Technologies ADMT 2017/18 Unit 13 J. Gamper 1/42 Advanced Data Management Technologies Unit 13 DW Pre-aggregation and View Maintenance J. Gamper Free University of Bozen-Bolzano Faculty of Computer Science IDSE Acknowledgements:

More information

Merge Sort Roberto Hibbler Dept. of Computer Science Florida Institute of Technology Melbourne, FL

Merge Sort Roberto Hibbler Dept. of Computer Science Florida Institute of Technology Melbourne, FL Merge Sort Roberto Hibbler Dept. of Computer Science Florida Institute of Technology Melbourne, FL 32901 rhibbler@cs.fit.edu ABSTRACT Given an array of elements, we want to arrange those elements into

More information

Use of Tree-based Algorithms for Internal Sorting

Use of Tree-based Algorithms for Internal Sorting Use of Tree-based s for Internal Sorting P. Y. Padhye (puru@deakin.edu.au) School of Computing and Mathematics Deakin University Geelong, Victoria 3217 Abstract Most of the large number of internal sorting

More information

Compressing Integers for Fast File Access

Compressing Integers for Fast File Access Compressing Integers for Fast File Access Hugh E. Williams Justin Zobel Benjamin Tripp COSI 175a: Data Compression October 23, 2006 Introduction Many data processing applications depend on access to integer

More information

CS1800 Discrete Structures Fall 2016 Profs. Aslam, Gold, Ossowski, Pavlu, & Sprague December 16, CS1800 Discrete Structures Final

CS1800 Discrete Structures Fall 2016 Profs. Aslam, Gold, Ossowski, Pavlu, & Sprague December 16, CS1800 Discrete Structures Final CS1800 Discrete Structures Fall 2016 Profs. Aslam, Gold, Ossowski, Pavlu, & Sprague December 16, 2016 Instructions: CS1800 Discrete Structures Final 1. The exam is closed book and closed notes. You may

More information

Information Systems (Informationssysteme)

Information Systems (Informationssysteme) Information Systems (Informationssysteme) Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2018 c Jens Teubner Information Systems Summer 2018 1 Part IX B-Trees c Jens Teubner Information

More information

Data Blocks: Hybrid OLTP and OLAP on compressed storage

Data Blocks: Hybrid OLTP and OLAP on compressed storage Data Blocks: Hybrid OLTP and OLAP on compressed storage Ben Brümmer Technische Universität München Fürstenfeldbruck, 26. November 208 Ben Brümmer 26..8 Lehrstuhl für Datenbanksysteme Problem HDD/Archive/Tape-Storage

More information

Index Tuning. Index. An index is a data structure that supports efficient access to data. Matching records. Condition on attribute value

Index Tuning. Index. An index is a data structure that supports efficient access to data. Matching records. Condition on attribute value Index Tuning AOBD07/08 Index An index is a data structure that supports efficient access to data Condition on attribute value index Set of Records Matching records (search key) 1 Performance Issues Type

More information

CSC 261/461 Database Systems Lecture 17. Fall 2017

CSC 261/461 Database Systems Lecture 17. Fall 2017 CSC 261/461 Database Systems Lecture 17 Fall 2017 Announcement Quiz 6 Due: Tonight at 11:59 pm Project 1 Milepost 3 Due: Nov 10 Project 2 Part 2 (Optional) Due: Nov 15 The IO Model & External Sorting Today

More information

Administrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks

Administrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks Administrative Index Compression! n Assignment 1? n Homework 2 out n What I did last summer lunch talks today David Kauchak cs458 Fall 2012 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture5-indexcompression.ppt

More information

!! What is virtual memory and when is it useful? !! What is demand paging? !! When should pages in memory be replaced?

!! What is virtual memory and when is it useful? !! What is demand paging? !! When should pages in memory be replaced? Chapter 10: Virtual Memory Questions? CSCI [4 6] 730 Operating Systems Virtual Memory!! What is virtual memory and when is it useful?!! What is demand paging?!! When should pages in memory be replaced?!!

More information

Table Compression in Oracle9i Release2. An Oracle White Paper May 2002

Table Compression in Oracle9i Release2. An Oracle White Paper May 2002 Table Compression in Oracle9i Release2 An Oracle White Paper May 2002 Table Compression in Oracle9i Release2 Executive Overview...3 Introduction...3 How It works...3 What can be compressed...4 Cost and

More information

Overview. DW Performance Optimization. Aggregates. Aggregate Use Example

Overview. DW Performance Optimization. Aggregates. Aggregate Use Example Overview DW Performance Optimization Choosing aggregates Maintaining views Bitmapped indices Other optimization issues Original slides were written by Torben Bach Pedersen Aalborg University 07 - DWML

More information

CSE373: Data Structure & Algorithms Lecture 18: Comparison Sorting. Dan Grossman Fall 2013

CSE373: Data Structure & Algorithms Lecture 18: Comparison Sorting. Dan Grossman Fall 2013 CSE373: Data Structure & Algorithms Lecture 18: Comparison Sorting Dan Grossman Fall 2013 Introduction to Sorting Stacks, queues, priority queues, and dictionaries all focused on providing one element

More information

Chapter 12: Indexing and Hashing. Basic Concepts

Chapter 12: Indexing and Hashing. Basic Concepts Chapter 12: Indexing and Hashing! Basic Concepts! Ordered Indices! B+-Tree Index Files! B-Tree Index Files! Static Hashing! Dynamic Hashing! Comparison of Ordered Indexing and Hashing! Index Definition

More information

Parallel, In Situ Indexing for Data-intensive Computing. Introduction

Parallel, In Situ Indexing for Data-intensive Computing. Introduction FastQuery - LDAV /24/ Parallel, In Situ Indexing for Data-intensive Computing October 24, 2 Jinoh Kim, Hasan Abbasi, Luis Chacon, Ciprian Docan, Scott Klasky, Qing Liu, Norbert Podhorszki, Arie Shoshani,

More information

Virtual Memory. Kevin Webb Swarthmore College March 8, 2018

Virtual Memory. Kevin Webb Swarthmore College March 8, 2018 irtual Memory Kevin Webb Swarthmore College March 8, 2018 Today s Goals Describe the mechanisms behind address translation. Analyze the performance of address translation alternatives. Explore page replacement

More information

8 Introduction to Distributed Computing

8 Introduction to Distributed Computing CME 323: Distributed Algorithms and Optimization, Spring 2017 http://stanford.edu/~rezab/dao. Instructor: Reza Zadeh, Matroid and Stanford. Lecture 8, 4/26/2017. Scribed by A. Santucci. 8 Introduction

More information

CSE 332: Data Structures & Parallelism Lecture 12: Comparison Sorting. Ruth Anderson Winter 2019

CSE 332: Data Structures & Parallelism Lecture 12: Comparison Sorting. Ruth Anderson Winter 2019 CSE 332: Data Structures & Parallelism Lecture 12: Comparison Sorting Ruth Anderson Winter 2019 Today Sorting Comparison sorting 2/08/2019 2 Introduction to sorting Stacks, queues, priority queues, and

More information

CS122 Lecture 15 Winter Term,

CS122 Lecture 15 Winter Term, CS122 Lecture 15 Winter Term, 2014-2015 2 Index Op)miza)ons So far, only discussed implementing relational algebra operations to directly access heap Biles Indexes present an alternate access path for

More information

In-Memory Data Structures and Databases Jens Krueger

In-Memory Data Structures and Databases Jens Krueger In-Memory Data Structures and Databases Jens Krueger Enterprise Platform and Integration Concepts Hasso Plattner Intitute What to take home from this talk? 2 Answer to the following questions: What makes

More information

Strategies for Processing ad hoc Queries on Large Data Warehouses

Strategies for Processing ad hoc Queries on Large Data Warehouses Strategies for Processing ad hoc Queries on Large Data Warehouses Kurt Stockinger CERN Geneva, Switzerland Kurt.Stockinger@cern.ch Kesheng Wu Lawrence Berkeley Nat l Lab Berkeley, CA, USA KWu@lbl.gov Arie

More information

Unit 3 Fill Series, Functions, Sorting

Unit 3 Fill Series, Functions, Sorting Unit 3 Fill Series, Functions, Sorting Fill enter repetitive values or formulas in an indicated direction Using the Fill command is much faster than using copy and paste you can do entire operation in

More information

Keywords: Binary Sort, Sorting, Efficient Algorithm, Sorting Algorithm, Sort Data.

Keywords: Binary Sort, Sorting, Efficient Algorithm, Sorting Algorithm, Sort Data. Volume 4, Issue 6, June 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com An Efficient and

More information

Unit 3 Functions Review, Fill Series, Sorting, Merge & Center

Unit 3 Functions Review, Fill Series, Sorting, Merge & Center Unit 3 Functions Review, Fill Series, Sorting, Merge & Center Function built-in formula that performs simple or complex calculations automatically names a function instead of using operators (+, -, *,

More information

UpBit: Scalable In-Memory Updatable Bitmap Indexing. Guanting Chen, Yuan Zhang 2/21/2019

UpBit: Scalable In-Memory Updatable Bitmap Indexing. Guanting Chen, Yuan Zhang 2/21/2019 UpBit: Scalable In-Memory Updatable Bitmap Indexing Guanting Chen, Yuan Zhang 2/21/2019 BACKGROUND Some words to know Bitmap: a bitmap is a mapping from some domain to bits Bitmap Index: A bitmap index

More information

Chapter 12: Indexing and Hashing

Chapter 12: Indexing and Hashing Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing and Hashing Index Definition in SQL

More information

Optimizing Query Execution for Variable-Aligned Length Compression of Bitmap Indices

Optimizing Query Execution for Variable-Aligned Length Compression of Bitmap Indices Optimizing Query Execution for Variable-Aligned Length Compression of Bitmap Indices Ryan Slechta University of St. Thomas ryan@stthomas.edu David Chiu University of Puget Sound dchiu@pugetsound.edu Jason

More information

Database System Concepts, 5th Ed. Silberschatz, Korth and Sudarshan See for conditions on re-use

Database System Concepts, 5th Ed. Silberschatz, Korth and Sudarshan See   for conditions on re-use Chapter 12: Indexing and Hashing Database System Concepts, 5th Ed. See www.db-book.com for conditions on re-use Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files B-Tree

More information

Review: Memory, Disks, & Files. File Organizations and Indexing. Today: File Storage. Alternative File Organizations. Cost Model for Analysis

Review: Memory, Disks, & Files. File Organizations and Indexing. Today: File Storage. Alternative File Organizations. Cost Model for Analysis File Organizations and Indexing Review: Memory, Disks, & Files Lecture 4 R&G Chapter 8 "If you don't find it in the index, look very carefully through the entire catalogue." -- Sears, Roebuck, and Co.,

More information

Albis: High-Performance File Format for Big Data Systems

Albis: High-Performance File Format for Big Data Systems Albis: High-Performance File Format for Big Data Systems Animesh Trivedi, Patrick Stuedi, Jonas Pfefferle, Adrian Schuepbach, Bernard Metzler, IBM Research, Zurich 2018 USENIX Annual Technical Conference

More information

Brotli Compression Algorithm outline of a specification

Brotli Compression Algorithm outline of a specification Brotli Compression Algorithm outline of a specification Overview Structure of backward reference commands Encoding of commands Encoding of distances Encoding of Huffman codes Block splitting Context modeling

More information

Block Device Scheduling. Don Porter CSE 506

Block Device Scheduling. Don Porter CSE 506 Block Device Scheduling Don Porter CSE 506 Quick Recap CPU Scheduling Balance competing concerns with heuristics What were some goals? No perfect solution Today: Block device scheduling How different from

More information

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15 Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2014/15 Lecture II: Indexing Part I of this course Indexing 3 Database File Organization and Indexing Remember: Database tables

More information

4.2 Sorting and Searching. Section 4.2

4.2 Sorting and Searching. Section 4.2 4.2 Sorting and Searching 1 Sequential Search Scan through array, looking for key. Search hit: return array index. Search miss: return -1. public static int search(string key, String[] a) { int N = a.length;

More information

CSE 562 Database Systems

CSE 562 Database Systems Goal of Indexing CSE 562 Database Systems Indexing Some slides are based or modified from originals by Database Systems: The Complete Book, Pearson Prentice Hall 2 nd Edition 08 Garcia-Molina, Ullman,

More information

Outline. Database Management and Tuning. Index Data Structures. Outline. Index Tuning. Johann Gamper. Unit 5

Outline. Database Management and Tuning. Index Data Structures. Outline. Index Tuning. Johann Gamper. Unit 5 Outline Database Management and Tuning Johann Gamper Free University of Bozen-Bolzano Faculty of Computer Science IDSE Unit 5 1 2 Conclusion Acknowledgements: The slides are provided by Nikolaus Augsten

More information

arxiv: v4 [cs.db] 2 Mar 2018

arxiv: v4 [cs.db] 2 Mar 2018 Consistently faster and smaller compressed bitmaps with Roaring D. Lemire 1, G. Ssi-Yan-Kai 2, O. Kaser 3 arxiv:1603.06549v4 [cs.db] 2 Mar 2018 1 LICEF Research Center, TELUQ, Montreal, QC, Canada 2 42

More information

A Case for Merge Joins in Mediator Systems

A Case for Merge Joins in Mediator Systems A Case for Merge Joins in Mediator Systems Ramon Lawrence Kirk Hackert IDEA Lab, Department of Computer Science, University of Iowa Iowa City, IA, USA {ramon-lawrence, kirk-hackert}@uiowa.edu Abstract

More information

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Computer Organization and Structure. Bing-Yu Chen National Taiwan University Computer Organization and Structure Bing-Yu Chen National Taiwan University Large and Fast: Exploiting Memory Hierarchy The Basic of Caches Measuring & Improving Cache Performance Virtual Memory A Common

More information

Caches Concepts Review

Caches Concepts Review Caches Concepts Review What is a block address? Why not bring just what is needed by the processor? What is a set associative cache? Write-through? Write-back? Then we ll see: Block allocation policy on

More information

Ch. 2: Compression Basics Multimedia Systems

Ch. 2: Compression Basics Multimedia Systems Ch. 2: Compression Basics Multimedia Systems Prof. Ben Lee School of Electrical Engineering and Computer Science Oregon State University Outline Why compression? Classification Entropy and Information

More information

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi.

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi. Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi Lecture 18 Tries Today we are going to be talking about another data

More information

PowerVault MD3 SSD Cache Overview

PowerVault MD3 SSD Cache Overview PowerVault MD3 SSD Cache Overview A Dell Technical White Paper Dell Storage Engineering October 2015 A Dell Technical White Paper TECHNICAL INACCURACIES. THE CONTENT IS PROVIDED AS IS, WITHOUT EXPRESS

More information

Comparative Analysis of Sparse Matrix Algorithms For Information Retrieval

Comparative Analysis of Sparse Matrix Algorithms For Information Retrieval Comparative Analysis of Sparse Matrix Algorithms For Information Retrieval Nazli Goharian, Ankit Jain, Qian Sun Information Retrieval Laboratory Illinois Institute of Technology Chicago, Illinois {goharian,ajain,qian@ir.iit.edu}

More information

An Efficient Ranking Algorithm of t-ary Trees in Gray-code Order

An Efficient Ranking Algorithm of t-ary Trees in Gray-code Order The 9th Workshop on Combinatorial Mathematics and Computation Theory An Efficient Ranking Algorithm of t-ary Trees in Gray-code Order Ro Yu Wu Jou Ming Chang, An Hang Chen Chun Liang Liu Department of

More information

Operating Systems. Overview Virtual memory part 2. Page replacement algorithms. Lecture 7 Memory management 3: Virtual memory

Operating Systems. Overview Virtual memory part 2. Page replacement algorithms. Lecture 7 Memory management 3: Virtual memory Operating Systems Lecture 7 Memory management : Virtual memory Overview Virtual memory part Page replacement algorithms Frame allocation Thrashing Other considerations Memory over-allocation Efficient

More information

In-memory processing of big data via succinct data structures

In-memory processing of big data via succinct data structures In-memory processing of big data via succinct data structures Rajeev Raman University of Leicester SDP Workshop, University of Cambridge Overview Introduction Succinct Data Structuring Succinct Tries Applications

More information

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 3 Dictionaries and Tolerant retrieval Chapter 4 Index construction Chapter 5 Index compression Content Dictionary data structures

More information

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze Indexing UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides Addison Wesley, 2008 Table of Content Inverted index with positional information

More information