Adaptive String Dictionary Compression in In-Memory Column-Store Database Systems

Similar documents
HICAMP Bitmap. A Space-Efficient Updatable Bitmap Index for In-Memory Databases! Bo Wang, Heiner Litz, David R. Cheriton Stanford University DAMON 14

In-Memory Data Management Jens Krueger

Vectorizing Database Column Scans with Complex Predicates

ENSC Multimedia Communications Engineering Topic 4: Huffman Coding 2

A New Compression Method Strictly for English Textual Data

In-Memory Data Management

UpBit: Scalable In-Memory Updatable Bitmap Indexing. Guanting Chen, Yuan Zhang 2/21/2019

An Asymmetric, Semi-adaptive Text Compression Algorithm

Data Compression Fundamentals

Indexing. CS6200: Information Retrieval. Index Construction. Slides by: Jesse Anderton

DATABASE COMPRESSION. Pooja Nilangekar [ ] Rohit Agrawal [ ] : Advanced Database Systems

Deep Dive Into Storage Optimization When And How To Use Adaptive Compression. Thomas Fanghaenel IBM Bill Minor IBM

Ch. 2: Compression Basics Multimedia Systems

Lecture: Analysis of Algorithms (CS )

Composite Group-Keys

Main-Memory Databases 1 / 25

14.4 Description of Huffman Coding

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University

Plugging versus Logging: A New Approach to Write Buffer Management for Solid-State Disks

To Zip or not to Zip. Effective Resource Usage for Real-Time Compression

8. Secondary and Hierarchical Access Paths

Column Stores vs. Row Stores How Different Are They Really?

arxiv: v2 [cs.it] 15 Jan 2011

Team SimpleMind. Ismail Oukid (TU Dresden), Ingo Müller (KIT), Iraklis Psaroudakis (EPFL) ACM SIGMOD 2015 Programming SIGMOD 2015 (June 2)

Continuous Data Cleaning

Information Retrieval. Chap 7. Text Operations

Sedef Akinli Kocak, Ryerson University

Fundamentals of Multimedia. Lecture 5 Lossless Data Compression Variable Length Coding

Compiler. Runtime System

A Simple Lossless Compression Heuristic for Grey Scale Images

Comparative Analysis of Sparse Matrix Algorithms For Information Retrieval

Dimensioning of Reassembly Buffers at the OLT

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze

Multimedia Networking ECE 599

Data Blocks: Hybrid OLTP and OLAP on compressed storage

Code Compression for RISC Processors with Variable Length Instruction Encoding

Encoding. A thesis submitted to the Graduate School of University of Cincinnati in

Brotli Compression Algorithm outline of a specification

GK-12 Lesson Plan. Discrete Cosine Transform, compression, jpg, transform. Five minutes researching the DCT.

Data Transformation and Migration in Polystores

Succinct Data Structures: Theory and Practice

15 Data Compression 2014/9/21. Objectives After studying this chapter, the student should be able to: 15-1 LOSSLESS COMPRESSION

Model-based Database Systems

Textual Data Compression Speedup by Parallelization

Statistical Simulation of Superscalar Architectures using Commercial Workloads

SCHISM: A WORKLOAD-DRIVEN APPROACH TO DATABASE REPLICATION AND PARTITIONING

7: Image Compression

1 o Semestre 2007/2008

Hashing. Yufei Tao. Department of Computer Science and Engineering Chinese University of Hong Kong

Red-Black, Splay and Huffman Trees

Accelerating multi-column selection predicates in main-memory the Elf approach

Greedy Approach: Intro

splitmem: graphical pan-genome analysis with suffix skips Shoshana Marcus May 7, 2014

7. Query Processing and Optimization

University of Waterloo CS240 Spring 2018 Help Session Problems

IMAGE COMPRESSION TECHNIQUES

STLAC: A Spatial and Temporal Locality-Aware Cache and Networkon-Chip

Interactive Collector Engine. Luca Sani

Outline. Introduction. Introduction Definition -- Selection and Join Semantics The Cost Model Load Shedding Experiments Conclusion Discussion Points

A DEDUPLICATION-INSPIRED FAST DELTA COMPRESSION APPROACH W EN XIA, HONG JIANG, DA N FENG, LEI T I A N, M I N FU, YUKUN Z HOU

IBM DB2 9.7 for Linux, UNIX, and Windows Storage Optimization

Dirk Tetzlaff Technical University of Berlin

Analysis in the Big Data Era

SAP IQ - Business Intelligence and vertical data processing with 8 GB RAM or less

Efficient Iceberg Query Processing in Sensor Networks

PDF recompression using JBIG2

Dictionary compression for a scan-based, main-memory database system

L14 Quicksort and Performance Optimization

Multimedia Quality in a Conversational Video-conferencing Environment

Visualization-supported Analysis of System Data for Controlled VMI-based Intrusion Detection

Lecture 6 Review of Lossless Coding (II)

CPE702 Algorithm Analysis and Design Week 7 Algorithm Design Patterns

Inside the InfluxDB Storage Engine

MILC: Inverted List Compression in Memory

AUTOMATIC CLUSTERING PRASANNA RAJAPERUMAL I MARCH Snowflake Computing Inc. All Rights Reserved

Implementation of Robust Compression Technique using LZ77 Algorithm on Tensilica s Xtensa Processor

Applying language modeling to session identification from database trace logs

Vertical Partitioning for Query Processing over Raw Data

Statistical Measures as Predictors of Compression Savings

Analyze Big Data Faster and Store It Cheaper

Model-Driven Geo-Elasticity In Database Clouds

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

CS152 Computer Architecture and Engineering Lecture 17: Cache System

Fast Approximations for Analyzing Ten Trillion Cells. Filip Buruiana Reimar Hofmann

Subject Name: OPERATING SYSTEMS. Subject Code: 10EC65. Prepared By: Kala H S and Remya R. Department: ECE. Date:

Purity: building fast, highly-available enterprise flash storage from commodity components

Power-Aware Throughput Control for Database Management Systems

Introduction Architecture overview. Multi-cluster architecture Addressing modes. Single-cluster Pipeline. architecture Instruction folding

Lec 04 Variable Length Coding in JPEG

MEMORY PREFETCHING WITH NEURAL-NETWORKS: CHALLENGES AND INSIGHTS. Leeor Peled, Shie Mannor, Uri Weiser, Yoav Etsion

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

: Fundamental Data Structures and Algorithms. June 06, 2011 SOLUTIONS

FASTER UPPER BOUNDING OF INTERSECTION SIZES

Big Linked Data ETL Benchmark on Cloud Commodity Hardware

Scalable Access to SAS Data Billy Clifford, SAS Institute Inc., Austin, TX

Boolean Retrieval. Manning, Raghavan and Schütze, Chapter 1. Daniël de Kok

Image coding and compression

Oracle 1Z0-515 Exam Questions & Answers

Keyword Search Using General Form of Inverted Index

Data Warehousing & Data Mining

Transcription:

Adaptive String Dictionary Compression in In-Memory Column-Store Database Systems Ingo Müller, Cornelius Ratsch, Franz Faerber / KIT / SAP AG March 26, 2014 EDBT, Athens, Greece

Motivation: Column-Store Architecture Recap Logical representation Physical representation First name Code Helen Michael Michael Adam Michael Helen Adam = 2 3 3 1 3 2 1 Code Value 1 Adam 2 Helen 3 Michael Dictionary Column vector Focus on static dictionaries of the read-optimized store c 2014 SAP AG. All rights reserved. 2

Motivation: Observation of Real-World Data Distribution of cardinalities Memory consumption 100% 100% Fraction of columns 10% 1% 0.1% 10 4 10 5 Fraction of memory consumption 10% 1% 0.1% 10 4 10 5 10 6 10 6 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 8 Distinct values per column 10 7 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 8 Distinct values per column String columns play an important role in real-world applications potential for compression c 2014 SAP AG. All rights reserved. 3

Outline 1 Introduction 2 Survey of Dictionary Formats 3 Performance Models 4 Automatic Selection 5 Evaluation 6 Summary c 2014 SAP AG. All rights reserved. 4

Survey of Dictionary Formats Two basic dictionary formats Array Front coding Combined with string compression schemes Uncompressed N-gram compression Bit compression Huffman Re-Pair c 2014 SAP AG. All rights reserved. 5

Performance Comparison of Dictionary Formats Compression rate 3 2 1 0 0 1 2 3 4 Extract runtime (µs) fc block rp 16 fc block rp 12 fc block ng3 array rp 16 fc block hu array rp 12 fc block ng2 fc block bc fc inline fc block array ng3 fc block df array hu array ng2 array bc array array fixed column bc Formats provide trade-off between runtime and space consumption Trade-off depends on dictionary content c 2014 SAP AG. All rights reserved. 6

Compression Models Example: array of Huffman encoded strings size = data + # strings pointer data = raw data entropy 0 General idea: break down size into values that are either known or can be sampled c 2014 SAP AG. All rights reserved. 7

Compression Models: Evaluation Size prediction error 1 max error = 5 0.8 Prediction error 0.6 0.4 0.2 0 100% 10% 1% max(1%, 5000) Sampling ratio Compression models offer cheap, but accurate enough size predictions c 2014 SAP AG. All rights reserved. 8

Automatic Dictionary Selection: Goals and Overview Dictionary format Compression model Extract runtime per string Locate runtime per string Construct runtime per string Runtime in dictionary per dictionary format Column Number of extracts Number of locates Column vector size Merge frequency Dictionary content Database system Occupied memory Available memory Dictionary size Selection strategy using c per column Dictionary selection c 2014 SAP AG. All rights reserved. 9

Trade-Off Selection Strategy heavy light usage Column size Excluded Included Smallest Selected c small fast (1 + c) size min [Lem12] 0 0.2 0.4 0.6 Relative runtime Our heuristic selects a dictionary format based on local information and a global trade-off parameter ( c). c 2014 SAP AG. All rights reserved. 10

Trade-Off Selection Strategy heavy light usage Column size Excluded Included Smallest Selected small c fast α rel_time(d min) t + b 0 0.2 0.4 0.6 Relative runtime Our heuristic selects a dictionary format based on local information and a global trade-off parameter ( c). c 2014 SAP AG. All rights reserved. 10

Evaluation c = fastest Setup TPC-H with *key columns as strings 47 dictionaries Workload: all queries consecutively Relative memory consumption 0.5 1.0 1.5 2.0 array fixed array array ng2 array bc array ng3 Fixed format Workload driven array hu fc block df fc inline array rp 12 fc block fc block ng2 array rp 16 fc block bc fc block ng3 fc block hu fc block rp 12 fc block rp 16 0.90 0.95 1.00 1.05 1.10 1.15 1.20 c = trade-off Relative runtime c = smallest Workload-driven dictionary selection outperforms static selection c effectively controls space / time trade-off c 2014 SAP AG. All rights reserved. 11

Summary Survey of dictionary implementations Variety of space / time trade-offs depending on content Compression models Feasible size prediction for assisted or automatic selection Automatic selection strategy Heuristic translating a global trade-off parameter into local decisions Result Workload-driven format selection improves overall system trade-off Thank You! c 2014 SAP AG. All rights reserved. 12