Distributed Data Management Storage and Retrieval

Size: px

Start display at page:

Download "Distributed Data Management Storage and Retrieval"

Benjamin Young
5 years ago
Views:

1 Felix Naumann F-2.03/F-2.04, Campus II Hasso Plattner Institut

2 Introduction Layering Data Models 1. Conceptual layer Data structures, objects, modules, Application code 2. Logical layer Relational tables, JSON, XML, graphs, Database management system (DBMS) or storage engine 3. Representation layer our focus now Bytes in memory, on disk, on network, Database management system (DBMS) or storage engine 4. Physical layer Electrical currents, pulses of light, magnetic fields, Operating system and hardware drivers Slide 2

3 Overview Objective Design a distributed DBMS for fast storage and retrieval of huge and evolving datasets Slide 3

4 Overview Objective Most data models resemble key-value data Let s pretend that everything is key-value data Design a distributed DBMS for fast storage and retrieval of huge and evolving datasets Slide 4

$/bin/bash db_set () { } echo "$1,$2" >> database db_get () { } grep "^$1," database sed e "s/^$1,//" tail n 1 It works: $ db_set 1234$

5 Fast Storage DBMS A Tiny Database Basic database tasks: (a) write given data, (b) read specific data A tiny key-value store in two Bash functions: #!/bin/bash db_set () { } echo "$1,$2" >> database db_get () { } grep "^$1," database sed e "s/^$1,//" tail n 1 It works: $ db_set 1234 '{"name":"berlin","type":"city"}' $ db_get 1234 '{"name":"berlin","type":"city"}' Concatenate the first two parameters by, and write/append them to the file named database Find all lines starting with first parameter, remove first parameter from lines, and select the last line Why? Slide 5

$Fast Storage DBMS A Tiny Database Assume the following input-sequence: $ db_set 1234 '{"name":"berlin","type":"city"}' $ db_set 42$ database 1234,{"name":"Berlin","type":"city"} 42,{"name":"Germany","type":"country"} 42,{"name":"Germany","type":"country","capital":"Berlin"} database is a

database 1234,{"name":"Berlin","type":"city"} 42,{"name":"Germany","type":"country"} 42,{"name":"Germany","type":"country","capital":"Berlin"} database is a

6 Fast Storage DBMS A Tiny Database Assume the following input-sequence: $ db_set 1234 '{"name":"berlin","type":"city"}' $ db_set 42 '{"name":"germany","type":"country"}' $ db_set 42 '{"name":"germany","type":"country","capital":"berlin"}' The according database -file (= CSV-file): $ cat database 1234,{"name":"Berlin","type":"city"} 42,{"name":"Germany","type":"country"} 42,{"name":"Germany","type":"country","capital":"Berlin"} database is a Log file: Append only, no removal of old values only last entry for each key is valid Fast writes ( O(1) ) but slow reads ( O(n) with n records in the log ) To speed-up reads: Indexes! Slide 6

7 Overview Objective Design a distributed DBMS for fast storage and retrieval of huge and evolving datasets Slide 7

value identifier) Basically a key-value store, where values can be actual data or pointers to relational records, documents, graph

8 Fast DBMS Index Definition An additional data structure that locates data by some search criterion, i.e., the key (usually one or more attributes, or value identifier) Basically a key-value store, where values can be actual data or pointers to relational records, documents, graph nodes/edges, Improves data retrieval operations Costs additional writes to index structure and storage space Use indexes carefully (not too many)! Different index implementations (data structures) have different strengths Choose the right index for your queries! Slide 8

9 Fast DBMS Hash Index Definition A hash index is a hash map (dictionary) that maps keys to the positions (in memory, on disk, on the network, ) of their values/records The hash map uses a hash function to calculate mapping of keys and positions and is usually kept in memory Uses key-value stores, other indexes, data distribution (load balancing, sharding, ) Strength Point queries: one index look up directly delivers a value s position Weaknesses Range queries require to look up each key individually Hash map must fit into main memory; hash maps on disk perform poorly Slide 9

$i t y " } \ n 4 3, { " n a m e " : " G e r m a n y ", " t 20 50 Log-structured file on disk (each box is one byte) 60 70 y p e " : " c o u n t r y " } \ n 80 Slide$

10 Fast DBMS Hash Index Example hash function key byte offset In-memory hash map , { " n a m e " : " B e r l i n ", " t y p e " : " c i t y " } \ n 4 3, { " n a m e " : " G e r m a n y ", " t Log-structured file on disk (each box is one byte) y p e " : " c o u n t r y " } \ n 80 Slide 10

11 Overview Objective Design a distributed DBMS for fast storage and retrieval of huge and evolving datasets Slide 11

12 Distributed DBMS Remote Pointers hash function Three files; one per node key IP byte offset Slide 12

13 Overview Objective Design a distributed DBMS for fast storage and retrieval of huge and evolving datasets Slide 13

Huge and Evolving Datasets Segmentation Controlling file growth Indexed data, i.e., log is insert only (for good write performance) Frequent updates make files unnecessarily large Example: a store

14 Huge and Evolving Datasets Segmentation Controlling file growth Indexed data, i.e., log is insert only (for good write performance) Frequent updates make files unnecessarily large Example: a store that maps products to stock-counts Each purchase increments a stock-count new record! Each sale decrements a stock-count new record! But: collection of products is almost constant Solution: consolidate/compact the log frequently freeing up disk space How to do this on a running system? Segmentation! Slide 14

15 Huge and Evolving Datasets Segmentation Writes Reads Segmentation Break log into segments of fixed size Each segment stores a range of keys Compacted SATSFG SRSSSRZNRZR wrthsrns Current Wrthwrthwrth Wrthwrh wrth SRT Etzjejzezjejzetzj SRTHS SRSRZJSWJZ ZTJEDJZ DTZJETZJETZJ Rujrtj Wethwrth Reerzj Wrtwrjerjzezj Erzje Erzjezjetzj Etzetjz etzjejzezjejzetzj WrETZJ SRTHejz EjzeETZJETZJz Etzje etjzewre RTHERWRTHE ETZJETZJE ETZJ Wrjezjejz Ejzez Etzje etjzetzjejzewre RTHWERTHE WRTHE ETZJETZJE ETZJ EZJEZJ EJZETZJEJZ ZWEZ WETJHEJZU Q4TEZJWETZJZ6 ZRTZJ SR SRETZJZR EZJEZJ EJZETZJEJZ ZWEZ WETJHEJZU Q4WZ6ZR SR SRZR ERZJWRZJE SFDGRTHS SERZJ EJZEZTJE ETZJETZJ ETZJETJZEJZETZ ETZJET ETZJRTJHSW ERZJWRZJE SFDGS SRTJHSW RZJETZJE FGJRU RZIRZIETZJK,R ARETZGA Q5JW6 SD ETZJETJ RZJETZJE FGJRU RZIRZIK,R ARGA Q5ZW4E57JW6 SD ATEHSWRZJ SRTETZJETZJ TJHSRTJHSRT SSRZJETZJ RZJZ WRJREJ RJSER REJZWRJZWR REZJEJERZJ ATEHSWRZJ SRT SRTJHSRTJHSRT SSRZJ RZJERJZ WRJREJZREJ RJSER REJZWRJZWR REZJEJERZJ RWZJWRZ RZJ RZJJREZ REZJWRZJW RZJTEZJETZJ WZJETZJ ERZJDGHETZJ ETZJEJ ETZJETZJ RWZJWRZ RZJ RZJJREZ REZJWRZJW RZJ WZJ ERZJDGH Q5THW SRTW WTRHETZJWR 57J472ETJZ ETZJEETZ 245Z36J35 ETZJET7 Q5THW SRTW WTRHWR 57J472 ETZJE 245Z36J357 E5J7E7JETZJE EJETZJ EZJEJE57JE WRJWRJ WJZWE ETZJETZJE5JE ETZJ WZJWWE6ZJEJ E5J7E7JE EJ EZJEJE57JE WRJWRJ WJZWEJE5JE ETZJ WZJWWE6ZJEJ Can be subject for distribution! Each segment has two representations: Compacted Static (= does not allow writes) Purged (= only most recent value for each key) Current Dynamic (= allows appending writes) Unchecked (= same key might appear multiple times) Slide 15

Huge and Evolving Datasets Segmentation Segmentation Whenever a segment reaches max size: 1. Close the segment and redirect writes to a new segment 2. Compact the old segment: 1.

16 Huge and Evolving Datasets Segmentation Segmentation Whenever a segment reaches max size: 1. Close the segment and redirect writes to a new segment 2. Compact the old segment: 1. Read the old segment backwards 2. If a key is read for the first time: Write the entry (key + value) into a compacted segment 3. Merge the compacted segment with old compacted segment(s): 1. Read old compacted segment Writes Compacted 2. If a key is not present in the new compacted segment: Write the entry (key + value) into the compacted segment 4. Delete the old segment and old compacted segment SATSFG SRSSSRZNRZR wrthsrns Current Wrthwrthwrth Wrthwrh wrth SRT Etzjejzezjejzetzj SRTHS SRSRZJSWJZ ZTJEDJZ DTZJETZJETZJ Rujrtj Wethwrth Reerzj Wrtwrjerjzezj Erzje Erzjezjetzj Etzetjz etzjejzezjejzetzj WrETZJ SRTHejz EjzeETZJETZJz Etzje etjzewre RTHERWRTHE ETZJETZJE ETZJ Wrjezjejz Ejzez Etzje etjzetzjejzewre RTHWERTHE WRTHE ETZJETZJE ETZJ Reads EZJEZJ EJZETZJEJZ ZWEZ WETJHEJZU Q4TEZJWETZJZ6 ZRTZJ SR SRETZJZR EZJEZJ EJZETZJEJZ ZWEZ WETJHEJZU Q4WZ6ZR SR SRZR ERZJWRZJE SFDGRTHS SERZJ EJZEZTJE ETZJETZJ ETZJETJZEJZETZ ETZJET ETZJRTJHSW ERZJWRZJE SFDGS SRTJHSW RZJETZJE FGJRU RZIRZIETZJK,R ARETZGA Q5JW6 SD ETZJETJ RZJETZJE FGJRU RZIRZIK,R ARGA Q5ZW4E57JW6 SD ATEHSWRZJ SRTETZJETZJ TJHSRTJHSRT SSRZJETZJ RZJZ WRJREJ RJSER REJZWRJZWR REZJEJERZJ ATEHSWRZJ SRT SRTJHSRTJHSRT SSRZJ RZJERJZ WRJREJZREJ ATEHS RJSER SRT REJZWRJZWR SR REZJEJERZJ RWZJWRZ RZJ RZJJREZ REZJWRZJW RZJTEZJETZJ WZJETZJ ERZJDGHETZJ ETZJEJ ETZJETZJ RWZJWRZ RZJ RZJJREZ REZJWRZJW RZJ WZJ ERZJDGH Slide 16 Q5THW SRTW WTRHETZJWR 57J472ETJZ ETZJEETZ 245Z36J35 ETZJET7 Q5THW SRTW WTRHWR 57J472 ETZJE 245Z36J357 E5J7E7JETZJE EJETZJ EZJEJE57JE WRJWRJ WJZWE ETZJETZJE5JE ETZJ WZJWWE6ZJEJ E5J7E7JE EJ EZJEJE57JE WRJWRJ WJZWEJE5JE ETZJ WZJWWE6ZJEJ

17 Huge and Evolving Datasets Segmentation Segmentation Compact: Merge: (+ Compact) Slide 17

Huge and Evolving Datasets Segmentation If the data is really large, then this index will also be too large and updating it too expensive! hash function key IP byte offset 1234 172.168.0.1 0 42 172.

18 Huge and Evolving Datasets Segmentation If the data is really large, then this index will also be too large and updating it too expensive! hash function key IP byte offset SATSFG SRSSSRZNRZR wrthsrns SATSFG SRSSSRZNRZR wrthsrns Wrthwrthwrth Wrthwrh wrth SATSFG SRSSSRZNRZR wrthsrns Wrthwrthwrth Wrthwrh wrth SATSFG SRSSSRZNRZR wrthsrns SATSFG SRSSSRZNRZR wrthsrns Wrthwrthwrth Wrthwrh wrth Wrthwrthwrth Wrthwrh wrth Wrthwrthwrth Wrthwrh wrth Slide 18

Huge and Evolving key Datasets IP byte offset Segmentation 1234 172.168.0.1 243 362 172.168.0.1 134 Index = key-value store Segmentation! 39 172.168.0.3 34 key IP byte offset 1234 172.168.0.1 0 42 172.

19 Huge and Evolving key Datasets IP byte offset Segmentation Index = key-value store Segmentation! key IP byte offset SATSFG SRSSSRZNRZR wrthsrns SATSFG SRSSSRZNRZR wrthsrns Wrthwrthwrth Wrthwrh wrth SATSFG SRSSSRZNRZR wrthsrns Wrthwrthwrth Wrthwrh wrth SATSFG SRSSSRZNRZR wrthsrns SATSFG SRSSSRZNRZR wrthsrns Wrthwrthwrth Wrthwrh wrth Wrthwrthwrth Wrthwrh wrth Wrthwrthwrth Wrthwrh wrth Slide 19

20 Overview Objective Since our index became a segmentation file, we are back to O(n) reads Design a distributed DBMS for fast storage and retrieval of huge and evolving datasets Slide 20

Fast DBMS SSTables 2 4 5 7 1 Sorted String Tables (SSTables) Showing only the keys Are segment files with two special properties: 1.

21 Fast DBMS SSTables Sorted String Tables (SSTables) Showing only the keys Are segment files with two special properties: 1. All key-value pairs are sorted by their key Only pairs with larger keys can be appended: If a new key cannot be appended, start new SSTable! 2. Each key appears only once No pair with an existing key can be appended: If a existing key cannot be appended, trigger compaction! O(log(n)) read performance now, but we lost our O(1) write performance! Slide 21

22 Fast DBMS SSTables Merge ? Given two (or more) SSTables, their merge is calculated similarly to the merge-sort algorithm: Read all SSTables simultaneously: 1. Copy the smallest key with its value to the merged SSTable and read next key-value pair 2. If keys are similar: Copy only the key-value pair from the most recent SSTable to the merged SSTable More efficient than merging general segment files, but still too slow for random inserts of key-value pairs! Slide 22

Fast DBMS SSTables Merge 2 4 5 7 1 3 6 7 1 2 3 4?

23 Fast DBMS SSTables Merge ? Given two (or more) SSTables, their merge is calculated similarly to the merge-sort algorithm: Slide 23

Fast DBMS SSTables An index on top of the SSTable-index Two options: a) A dense index, i.e., each key-value pair in the SSTable is indexed Allows direct look ups but index size equal to SSTable size b) A sparse index, i.

search) find the block of the next larger key (look up) search for the query key in the block (binary search or linear scan, if log entries have

24 Fast DBMS SSTables An index on top of the SSTable-index Two options: a) A dense index, i.e., each key-value pair in the SSTable is indexed Allows direct look ups but index size equal to SSTable size b) A sparse index, i.e., first key-value pair in each block of an SSTable is indexed If a query key is not directly indexed: find the next larger key in the index (binary search) find the block of the next larger key (look up) search for the query key in the block (binary search or linear scan, if log entries have variable size) Blocks can be compressed to reduce memory consumption and I/O bandwidth Slide 24

25 Fast DBMS SSTables An index on top of the SSTable-index Two options: b) A sparse index, i.e., first key-value pair in each block of an SSTable is indexed Because they get large! So what is this? Slide 25

26 Fast DBMS B-Tree Definition R. Bayer and E. McCreight. Organization and maintenance of large ordered indices, In Proceedings of SIGFIDET (now SIGMOD) Workshop, pages , 1970 A self-balancing, tree-based data structure, that stores values sorted by key and allows read/write access in logarithmic time A generalization of a binary search tree as nodes can have more than two children Structure Blocks: Nodes in the tree that contain key-value pairs and pointers to other blocks Correspond to physical, fixed sized disk blocks/pages that are addressed and read as single units Pointers: Edges in the tree that connect blocks in tree structure Correspond to physical block/page addresses Slide 26

Fast DBMS B-Tree Constraints Balanced: Same distance from root-node to all leaf-nodes Depth of

required Block-Content: A block contains n keys and n+1 pointers in alternating order Pointers

containing larger/equal keys All values in the tree are sorted by their keys!

27 Fast DBMS B-Tree Constraints Balanced: Same distance from root-node to all leaf-nodes Depth of the tree is in O(log(n)) (= key-look-up complexity) If tree grows unbalanced, re-balancing required Block-Content: A block contains n keys and n+1 pointers in alternating order Pointers left to a key point to blocks containing smaller keys Pointers right to a key point to blocks containing larger/equal keys All values in the tree are sorted by their keys! Block-Size: Typically 4096 Byte per block; 4 Byte per key; 8 Byte per pointer 4n + 8(n+1) 4096 => n = 340 Slide 27

B + -Tree B-Trees store keys and values in both internal and leaf nodes; B + -Trees store values only in leaf nodes.

28 Fast DBMS B-Tree Constraints Root Node: Points to underlying nodes (and values) At least 2 pointers used Inner Node: Points to underlying nodes (and values) At least (n + 1)/2 pointers used B-Tree vs. B + -Tree B-Trees store keys and values in both internal and leaf nodes; B + -Trees store values only in leaf nodes. The following examples show B + -Trees Leaf Node: Points to right neighbor leaf and values At least 1 neighbor pointer (if present) and (n + 1)/2 value pointers used Uses Any data store: most widely used index structure for DBMSs Sorted, dense, and sparse indexes Slide 28

29 Fast DBMS B-Tree Example Key Block Pointer free space Root Inners Leafs B-Tree vs. SSTable. Always rewrite entire blocks vs. always append and merge frequently Values Slide 29

30 Fast DBMS B-Trees Neighbor pointer Grow scenario Look up scenario With data Slide 30

31 Fast DBMS B-Tree: Insert & Delete Split and Merge operations guarantee that the B-Tree is always balanced and the blocks are filled correctly.

32 Overview Objective Design a distributed DBMS for fast storage and retrieval of huge and evolving datasets Slide 32

Putting Everything Together LSM-Trees Definition Log-Structured Merge-Trees (LSM Trees) are multilayered search trees for

Example First layer (C 0 Tree): index structure that 1) efficiently takes new key-value pairs in any order 2) outputs all

1) is able to merge with sorted key-value pair lists 2) effectively compacts/compresses contained key-value pairs SSTables

33 Putting Everything Together LSM-Trees Definition Log-Structured Merge-Trees (LSM Trees) are multilayered search trees for key-value log-data that use different data structures, each of which optimized for its underlying storage medium memtable Example First layer (C 0 Tree): index structure that 1) efficiently takes new key-value pairs in any order 2) outputs all contained key-value pairs in sorted order B-trees, red-black trees, AVL trees, Second layer (C 1 Tree): index structure that 1) is able to merge with sorted key-value pair lists 2) effectively compacts/compresses contained key-value pairs SSTables Patrick O'Neil et. al. The log-structured merge-tree (LSM-tree), Acta Information, volume 33, number 4, pages , 1996 Analytics Slide 33

Putting Everything Together LSM-Trees Intuition Sorted trees are fast in-memory indexes but they outgrow main memory SSTables are indexable and

free memory Merge is required if a block is full! Read: search the key chronologically in every layer (C 0 C 1 C 2 ) until the first, i.e., most

34 Putting Everything Together LSM-Trees Intuition Sorted trees are fast in-memory indexes but they outgrow main memory SSTables are indexable and compact but don t support random inserts Insert: add new key-value pairs to C 0 Tree; frequently merge trees down the hierarchy (C 0 C 1 C 2 ) to free memory Merge is required if a block is full! Read: search the key chronologically in every layer (C 0 C 1 C 2 ) until the first, i.e., most recent value is found Data stores with LSM-Tree implementations LevelDB, RocksDB, Cassandra, Hbase, Lucene (in Elasticsearch and Solr), Analytics Slide 34

these queries with a Bloomfilter Size-tiered compaction Merge

35 Putting Everything Together LSM-Trees Optimizations Querying non-existend values is expensive (check all layers) Catch most of these queries with a Bloomfilter Size-tiered compaction Merge newer, smaller SSTables successively into older, larger SSTables Slide 35

36 Overview Objective Design a distributed DBMS for fast storage and retrieval of huge and evolving datasets Some further indexing-techniques Slide 36

Excursus Alternative Index Types Clustered Index Stores indexed data or parts of it within the index (instead of pointers to data) Example: An index on attribute delivery_status

Improves the performance of certain queries Reduces write performance and requires additional storage Redundant values (in data and index) complicate data consistency Multi-Column

37 Excursus Alternative Index Types Clustered Index Stores indexed data or parts of it within the index (instead of pointers to data) Example: An index on attribute delivery_status allows to count pending deliveries without data access. Improves the performance of certain queries Reduces write performance and requires additional storage Redundant values (in data and index) complicate data consistency Multi-Column Index a) Concatenated index: merge keys into one key b) Multi-dimensional index: split multi-dim. key domain into multi-dim. shapes Example: An index on two-dim. geo locations (longitude, latitude) to answer intersection, containment, and nearest neighbor queries. Most common implementation: R-Trees Slide 37

Excursus R-Tree R-Tree A variation of a B-Tree that uses a

block-sized nodes Indexed points are clustered into leaf

appropriate clusters split cluster if too large find

38 Excursus R-Tree R-Tree A variation of a B-Tree that uses a hierarchy of rectangles as keys Also: balanced and block-sized nodes Indexed points are clustered into leaf nodes might occur in multiple clusters Insertion: into appropriate clusters split cluster if too large find smallest cluster extension via heuristic if no cluster fits directly Slide 38

39 Excursus Alternative Index Types Fuzzy Index Index on terms/keys that allows for value misspellings, synonyms, variations, Idea: sparse, sorted index (e.g. SSTable or B-Tree) with similarity look up Example: An index on attribute firstname where names might be misspelled. Look up most similar key Scan the (sorted!) neighborhood of that key s value for similar values John Sno John Snow Jon Snow john snow jonny snow Slide 39

40 Check yourself Given these two SSTable segments from 16/11/2018 and 17/11/2018, calculate their compacted merge. 16/11/ /11/2018 ambition 62 area 71 argument 59 assumption 87 atmosphere 40 attitude 53 accident 63 ambition 14 ambition 27 anxiety 78 area 56 argument 85 argument 79 assistance 50 Specify the order in which the elements are accessed. Tobias Bleifuß Slide 40

41 Analytics Introduction G , Campus III Hasso Plattner Institut

Distributed Data Analytics Partitioning

Distributed Data Analytics Partitioning G-3.1.09, Campus III Hasso Plattner Institut Different mechanisms but usually used together Distributing Data Replication vs. Replication Store copies of the same data on several nodes Introduces redundancy