Introduction to Column Stores with MemSQL. Seminar Database Systems Final presentation, 11. January 2016 by Christian Bisig

Final presentation, 11. January 2016 by Christian Bisig

Topics Scope and goals Approaching Column-Stores Introducing MemSQL Benchmark setup & execution Benchmark result & interpretation Conclusion Questions and feedback 2

Scope and goals

Scope and goals Understandable preparation of the topic Column Stores Why and for what is the columnar data storage used? Introduction to MemSQL Columnar table usage in MemSQL Benchmark (MemSQL vs. PostgreSQL) (and in-memory tables vs. columnar tables) Deliverables: Article Presentation Tutorial modul 4

Approaching Column-Stores

Approaching Column-Stores A decomposition storage model - 1985 SIGMOD conference Vertical partitioned data C-Store: A Column-oriented DBMS - 2005 One of the first column-store DBMS The Design and Implementation of Modern Column-Oriented Database Systems - 2012 6

Approaching Column-Stores Business process automation Mostly transactional based (OLTP) e.g. register new client data, execute money transaction, etc. Additionally, business process improvement through gaining business intelligence Analytical processing (OLAP) e.g. evaluate client purchases, budget forecasts, etc. 7

Approaching Column-Stores Row based: Column based: 8

Approaching Column-Stores! Dos Large table scans and aggregations Range queries, BETWEEN, IN, <, > Data compression (sparse and repeated data) Large data load " Don ts Random / specific searches Large transaction volume (inserts and updates) Small inserts and updates (single-record insert performance) 9

Introducing MemSQL

Intro MemSQL Developed as in-memory database Added columnar tables with version 3.0 Provides a solution for both OLTP (row tables, in-memory tables) and OLAP (columnar tables on the harddisk) Wire compatible to MySQL Compiled queries 11

Intro MemSQL Two-tier architecture Distributed Systems (commodity hardware) Reference tables Shard tables Lock-free data structures Skip-Lists, Hash-Tables, Stacks, Queues MVCC 12

Intro MemSQL Sharding (Shard tables) Data partitioning distributed on leafs Reference tables 13

Intro MemSQL Row Table Columnar Table CREATE TABLE gnis ( x double precision not null, y double precision not null, fid integer primary key, name text, class text, state text, county text, elevation integer, map text ); CREATE COLUMNAR TABLE gnis_col ( x double precision not null, y double precision not null, fid integer, name text, class text, state text, county text, elevation integer, map text, KEY (`fid`) USING CLUSTERED COLUMNSTORE, SHARD KEY() ); 14

Intro MemSQL MemSQL column-store segmentation To consider: Every Insert or update creates a new row-segment-group The more row-segment-groups the worse the performance 15

Intro MemSQL Compression in MemSQL compression algorithms Dictionary (tokenization), Run-length-encoding example with osm_poi_tag_ch table Table-statistics compression rate of 3.6:1 which results in around 72% space savings 16

Benchmark setup & execution

Benchmark setup MemSQL (v 4.1.10) running with Creating the GNIS tables as columnar and row tables Comparing the performance of columnar and row tables PostgreSQL (9.4) row tables Benchmark on: imac (late 2009), 2.8 GHz Intel Core i7, OSX El Capitan Ram: 16GB 1067 MHz DDR3 SSD 500GB, Read: ~260MB/s, Write: ~270MB/s 18

Benchmark setup SQL Load script major changes to original scripts: Instead of PostgreSQL \copy command to load CSV > LOAD DATA LOCAL INFILE INTO TABLE Instead of CREATE TABLE AS SELECT > CREATE TABLE and INSERT INTO SELECT for the creation of the 1mio, 2mio, 3mio record tables Slightly different naming (e.g. column name keyz instead of key ) 19

Benchmark execution Python scripts for benchmark execution Both for PostgreSQL and MemSQL no reasonable timing mechanism in MemSQL Using psycopg2 (PostgreSQL) and Mysqldb python drivers. Ran every query 3 times on row (PostgreSQL) and column / row (MemSQL) and took the best run of each to compare. Second benchmark part: A script for bulk insert/update/delete 20

Benchmark execution Python Script excerpt: 21

Benchmark execution Python Script excerpt: 22

Benchmark result & interpretation

Benchmark result 24

Benchmark result Single tuple data manipulation 10 000 Inserts 10 000 Updates 10 000 Deletes 25

Benchmark interpretation Four points to mention: 1. Specific search on non index column and multiple tuples in result set, performs well 2. Both types have their field of events (e.g. specific search or range search) 3. Bad joined select performance on column-store 4. Column-stores are well suitable for a large amount of single data manipulation operations 26

Conclusion

Conclusion Not an option to compromise one store type over the other. Each one has its field of event SSD can not compensate column-store I/O disadvantages Impressed by the compression and performance Had a fight with measuring execution times in MemSQL Interested to test MemSQL in a larger setup 28

Any questions? 29

References: Image Slide 1: https://victoriafrederick.files.wordpress.com/2014/09/060922-120543-doric-columns-frieze-with-triglyphsand-metopes-and-pediment-at-the-back-of-the-temple-of-hera-ii1.jpg Image Slide 11: http://www.storagenewsletter.com/wp-content/uploads/2014/02/memsqlv3.0.jpg Slide 18, Docker: https://www.docker.com/ Author: Christian Bisig, cbisig@hsr.ch, cbisig@gmail.com Student for Master of Science Engineering at Hochschule für Technik Rapperswil Master Research Unit, Software and Systems Hardware/Software used for tests: imac (late 2009) CPU: 2.8 GHz Intel Core i7 Ram: 16GB 1067 MHz DDR3 SSD 500GB Read: ~260MB/s Write: ~270MB/s OS: OSX El Capitan