Adaptive String Dictionary Compression in In-Memory Column-Store Database Systems Ingo Müller, Cornelius Ratsch, Franz Faerber / KIT / SAP AG March 26, 2014 EDBT, Athens, Greece
Motivation: Column-Store Architecture Recap Logical representation Physical representation First name Code Helen Michael Michael Adam Michael Helen Adam = 2 3 3 1 3 2 1 Code Value 1 Adam 2 Helen 3 Michael Dictionary Column vector Focus on static dictionaries of the read-optimized store c 2014 SAP AG. All rights reserved. 2
Motivation: Observation of Real-World Data Distribution of cardinalities Memory consumption 100% 100% Fraction of columns 10% 1% 0.1% 10 4 10 5 Fraction of memory consumption 10% 1% 0.1% 10 4 10 5 10 6 10 6 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 8 Distinct values per column 10 7 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 8 Distinct values per column String columns play an important role in real-world applications potential for compression c 2014 SAP AG. All rights reserved. 3
Outline 1 Introduction 2 Survey of Dictionary Formats 3 Performance Models 4 Automatic Selection 5 Evaluation 6 Summary c 2014 SAP AG. All rights reserved. 4
Survey of Dictionary Formats Two basic dictionary formats Array Front coding Combined with string compression schemes Uncompressed N-gram compression Bit compression Huffman Re-Pair c 2014 SAP AG. All rights reserved. 5
Performance Comparison of Dictionary Formats Compression rate 3 2 1 0 0 1 2 3 4 Extract runtime (µs) fc block rp 16 fc block rp 12 fc block ng3 array rp 16 fc block hu array rp 12 fc block ng2 fc block bc fc inline fc block array ng3 fc block df array hu array ng2 array bc array array fixed column bc Formats provide trade-off between runtime and space consumption Trade-off depends on dictionary content c 2014 SAP AG. All rights reserved. 6
Compression Models Example: array of Huffman encoded strings size = data + # strings pointer data = raw data entropy 0 General idea: break down size into values that are either known or can be sampled c 2014 SAP AG. All rights reserved. 7
Compression Models: Evaluation Size prediction error 1 max error = 5 0.8 Prediction error 0.6 0.4 0.2 0 100% 10% 1% max(1%, 5000) Sampling ratio Compression models offer cheap, but accurate enough size predictions c 2014 SAP AG. All rights reserved. 8
Automatic Dictionary Selection: Goals and Overview Dictionary format Compression model Extract runtime per string Locate runtime per string Construct runtime per string Runtime in dictionary per dictionary format Column Number of extracts Number of locates Column vector size Merge frequency Dictionary content Database system Occupied memory Available memory Dictionary size Selection strategy using c per column Dictionary selection c 2014 SAP AG. All rights reserved. 9
Trade-Off Selection Strategy heavy light usage Column size Excluded Included Smallest Selected c small fast (1 + c) size min [Lem12] 0 0.2 0.4 0.6 Relative runtime Our heuristic selects a dictionary format based on local information and a global trade-off parameter ( c). c 2014 SAP AG. All rights reserved. 10
Trade-Off Selection Strategy heavy light usage Column size Excluded Included Smallest Selected small c fast α rel_time(d min) t + b 0 0.2 0.4 0.6 Relative runtime Our heuristic selects a dictionary format based on local information and a global trade-off parameter ( c). c 2014 SAP AG. All rights reserved. 10
Evaluation c = fastest Setup TPC-H with *key columns as strings 47 dictionaries Workload: all queries consecutively Relative memory consumption 0.5 1.0 1.5 2.0 array fixed array array ng2 array bc array ng3 Fixed format Workload driven array hu fc block df fc inline array rp 12 fc block fc block ng2 array rp 16 fc block bc fc block ng3 fc block hu fc block rp 12 fc block rp 16 0.90 0.95 1.00 1.05 1.10 1.15 1.20 c = trade-off Relative runtime c = smallest Workload-driven dictionary selection outperforms static selection c effectively controls space / time trade-off c 2014 SAP AG. All rights reserved. 11
Summary Survey of dictionary implementations Variety of space / time trade-offs depending on content Compression models Feasible size prediction for assisted or automatic selection Automatic selection strategy Heuristic translating a global trade-off parameter into local decisions Result Workload-driven format selection improves overall system trade-off Thank You! c 2014 SAP AG. All rights reserved. 12