cstore_fdw Columnar store for analytic workloads Hadi Moshayedi & Ben Redman

Size: px

Start display at page:

Download "cstore_fdw Columnar store for analytic workloads Hadi Moshayedi & Ben Redman"

Derick Jordan
5 years ago
Views:

1 cstore_fdw Columnar store for analytic workloads Hadi Moshayedi & Ben Redman

3 What is CitusDB? CitusDB is a scalable analytics database that extends PostgreSQL Citus shards your data and automa/cally parallelizes your queries Citus isn t a fork of PostgreSQL. Rather, it hooks onto the planner and executor for distributed query execu/on. Always rebased to newest PostgreSQL version Na/vely supports new data types and extensions

4 master node (extended PostgreSQL) shard and shard placement metadata A D C C A 1 shard = 1 PostgreSQL table.... worker node #1 (extended PostgreSQL) worker node #2 (extended PostgreSQL) worker node #3 (extended PostgreSQL)

5 Talk Overview 1. Why customers want columnar stores 2. cstore_fdw live demo 3. cstore_fdw file layout 4. Benchmarks 5. Further Improvements

6 700 columns 30M rows Id Sz Ln Ht

7 Example SQL query SELECT weight, AVG(price), MAX(price) FROM items WHERE quantity > 100 AND last_stock_date < GROUP BY weight;

8 Row-oriented store Id price quant last_stm weight

9 Row-oriented store Id price quant last_stm weight

10 Row-oriented store Id price quant last_stm weight

11 Row-oriented store Id price quant last_stm weight

12 Cost of row storage Read 700 columns instead of 4 >39 GB of unnecessary I/O Input Type Estimated Input Rate Cost to query performance Memory 10 GB/s 3.9 seconds SSD 600 MB/s >60 seconds

13 Example SQL query SELECT weight, AVG(price), MAX(price) FROM items WHERE quantity > 100 AND last_stock_date < GROUP BY weight;

14 Column-oriented store Id sz price quant last_stm weight

15 Column-oriented store Id sz price quant last_stm weight

16 Column-oriented store Id sz price quant last_stm weight

17 Columnar Store Motivation Read subset of columns to reduce I/O Better compression Less disk usage Less disk I/O

18 Talk Overview 1. Why customers want columnar stores 2. cstore_fdw live demo 3. cstore_fdw file layout 4. Benchmarks 5. Further Improvements

19 Talk Overview 1. Why customers want columnar stores 2. cstore_fdw live demo 3. cstore_fdw file layout 4. Benchmarks 5. Further Improvements

20 Current Approaches to Columnar Stores 1. Fork a popular database, swap in your storage engine, and never look back 2. Develop an open columnar store format for the Hadoop Distributed Filesystem (HDFS) 3. Use PostgreSQL extension machinery for in-memory stores / external databases

21 ORC File Layout benefits 1. Columnar layout reads columns only related to the query 2. Compression groups column values (10K) together and compresses them 3. Skip indexes applies predicate filtering to skip over unrelated values

22 150K rows In a stripe (configurable) Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 10K column values (configurable) per block

23 Compression Current compression method is PG_LZ from PostgreSQL core Easy to add new compression methods depending on the CPU / disk trade-off cstore_fdw enables using different compression methods at the column block level

24 Table sizes normalized to 1.0

25 Drawbacks to ORC Support for limited data types. Each data type further needs to have a separate code path for min/max value collection and constraint exclusion. Gathering statistics from the data and table JOINs are an afterthought.

26 Talk Overview 1. Why customers want columnar stores 2. cstore_fdw live demo 3. cstore_fdw file layout 4. Benchmarks 5. Further Improvements

27 Recent Benchmark Results TPC-H is a standard benchmark Performed in-memory, SSD, and HDD tests on 10 GB of data Used m2.2xlarge and m3.2xlarge on EC2 Compared vanilla PostgreSQL, cstore_fdw, cstore_fdw with compression

28 10GB of uncached data on m2.2xlarge

29 10GB of uncached data on m3.2xlarge

30 Total issued disk I/O measures with iotop

31 10GB of cached data on m2/m3.2xlarge

32 Talk Overview 1. Why customers want columnar stores 2. cstore_fdw live demo 3. cstore_fdw file layout 4. Benchmarks 5. Further Improvements

33 Vectorization What if data fits in memory? PostgreSQL s execution model: One Tuple at a Time High Overhead

34 Improvement: Vectorization Batch of Values at a Time Decreases the Overhead Beaer U/liza/on of CPU Internship Project: Can Güler

35 Vectorization, Simple Aggregates

36 Vectorization, GROUP BY

37 More vectorization info postgres_vectorization_test

38 1.1 Release cstore_fdw is an open source project actively in development: github.com/citusdata/ cstore_fdw Improved sta/s/cs gathering Automa/c management of table filenames Management of table file data

39 Future Work Improve memory usage Na/ve Delete / Insert / Update support Improve read query performance (vectorized execu/on!) Different compression codecs Many more; contribute to the discussion on GitHub!

40 cstore_fdw: Open source columnar store fdw for PostgreSQL Improves query times (1.1x-2x), reduces disk I/O, and reduces disk utilization (3x-4x) Data layout is based on ORC (indexes, compression) Uses foreign wrapper APIs full type support, optimization, and easy installation Future perf improvements - vectorization

41 cstore_fdw Columnar Store for Analytic Workloads Hadi Moshayedi Ben Redman

SQL, Scaling, and What s Unique About PostgreSQL

SQL, Scaling, and What s Unique About PostgreSQL Ozgun Erdogan Citus Data XLDB May 2018 Punch Line 1. What is unique about PostgreSQL? The extension APIs 2. PostgreSQL extensions are a game changer for