Ghislain Fourny Big Data 5. Wide column stores
Data Technology Stack User interfaces Querying Data stores Indexing Processing Validation Data models Syntax Encoding Storage 2
Where we are User interfaces Querying Data stores Indexing Processing Validation Data models Syntax Encoding Last weeks Storage 3
Where we are User interfaces Querying Data stores Indexing Processing Validation Today Data models Syntax Encoding Last weeks Storage 4
Relational model 5
Relational model Schema 6
Issues with relational databases (RDBMS) Small scale 7
Issues with relational databases (RDBMS) Small scale Single machine 8
Can we fix a RDBMS? 9
Can we fix a RDBMS? Scale up (remember?) 10
Can we fix a RDBMS? Scale out 11
Can we fix a RDBMS? Cluster Scale out 12
Can we fix a RDBMS? Cluster Replicate Scale out 13
Can we fix a RDBMS? Hard to set up Scale out 14
Can we fix a RDBMS? Hard to set up Very high maintenance costs Scale out 15
HBase By design running on a scalable cluster of commodity hardware 16
HBase By design running on a scalable cluster of commodity hardware HDFS 17
Wide column stores: data model 18
Founding paper 's BigTable 19
The tabular model 20
The tabular model: expensive joins 21
Design paradigm of BigTable store together what is accessed together 22
The tabular model: expensive joins 1 4 2 2 4 6 1 2 3 4 5 6 23
3rd Normal Form: Example Legi 32-000-000 Name City State Alan Turing Bletchley Park UK City State PLZ Bletchley Park UK MK3 6EB 32-000-000 Alan Turing Bletchley Park UK Bletchley Park UK MK3 6EB 62-000-000 Georg Cantor Pfäffikon SZ Pfäffikon SZ 8808 62-000-000 Georg Cantor Pfäffikon SZ Pfäffikon SZ 8808 25-000-000 Felix Bloch Pfäffikon ZH Pfäffikon ZH 8330 24
3rd Normal Form: Counter-Example Legi 32-000-000 Name City State PLZ Alan Turing Bletchley Park UK MK3 6EB 32-000-000 Alan Turing Bletchley Park UK MK3 6EB 62-000-000 Georg Cantor Pfäffikon SZ 8808 62-000-000 Georg Cantor Pfäffikon SZ 8808 25-000-000 Felix Bloch Pfäffikon ZH 8330 25
The tabular model: expensive joins 1 4 2 2 4 6 1 2 3 4 5 6 26
The columnar model: denormalized 1 4 2 2 4 6 27
Rows Row ID 000 002 0A1 1E0 22A 4A2 28
Rows Yes, for now this actually looks pretty much like key-value storage. Row ID 000 002 0A1 1E0 22A 4A2 29
Columns Row ID 000 002 0A1 1E0 22A 4A2 30
Columns Column family Row ID 000 002 0A1 1E0 22A 4A2 31
Column families must be known in advance... Row ID 32
Column families must be known in advance... Row ID 000 A B 1 2 I 002 0A1 1E0 22A 4A2 33
... but columns can be added on the fly Row ID 000 A B C 1 2 I II III IV 002 0A1 1E0 22A 4A2 34
Primary queries 35
Primary queries Get 36
Get Row ID 000 A B C 1 2 I II III IV 002 0A1 1E0 22A 4A2 37
Primary queries Get 38
Primary queries Get Put 39
Put Row ID 000 A B C 1 2 I II III IV 002 0A1 1E0 204 22A 4A2 40
Primary queries Get Put 41
Primary queries Get Put Scan (This is new) 42
Scan Row ID 000 A B C 1 2 I II III IV 002 0A1 1E0 204 22A 4A2 43
Primary queries Get Put Scan (This is new) 44
Primary queries Get Put Scan Delete (This is new) 45
Delete Row ID 000 A B C 1 2 I II III IV 002 0A1 1E0 204 22A 4A2 46
Some terminology: Key-value model Key Value 47
Some terminology: Column-oriented stores Column1 Column2 48
Some terminology: Column-oriented key-value stores Also: wide column stores, column-family-oriented Row ID A B C 1 2 I II III IV 49
Examples of Column-oriented key-value stores 's BigTable 50
Warning on terminology NoSQL is very recent! 51
Warning on terminology Key-value storage Relational table Words have a "life" File Block NoSQL Object storage 52
HBase: physical level 53
Physical layer: regions Row ID A B C 1 2 I II III IV 54
Physical layer: regions Row ID A B C 1 2 I II III IV 55
Physical layer: regions Row ID A B C 1 2 I II III IV Min-incl. Max-excl. 56
Physical layer: column families Row ID A B C 1 2 I II III IV Min-incl. Max-excl. Stored together 57
Architecture "The same procedure as every year, James." 58
HDFS... Namenode /dir/file1 /dir/file2 /file3 Datanode Datanode Datanode Datanode Datanode Datanode 59
HBase HMaster Regionserver Regionserver Regionserver Regionserver Regionserver Regionserver 60
HMaster HMaster Regionserver Regionserver Regionserver Regionserver Regionserver Regionserver 61
HMaster DDL operations 62
HMaster DDL operations Create table 63
HMaster DDL operations Create table Delete table 64
HMaster assigns regions to RegionServers Row ID 65
HMaster assigns regions to RegionServers Row ID 66
HMaster assigns regions to RegionServers Row ID 67
HMaster assigns regions to RegionServers Row ID 68
HMaster splits regions Row ID 69
HMaster handles Regionserver failovers 70
Architecture HMaster Regionserver Regionserver Regionserver Regionserver Regionserver Regionserver 71
Regionserver HMaster Regionserver Regionserver Regionserver Regionserver Regionserver Regionserver 72
Physical storage Row ID Min-incl. A B C 1 2 Stored together I II III IV 73
Physical storage Row ID A B C 1 2 I II III IV Store Store Store Store Store Store 74
Store = column family Row ID 1 2 75
Store = column family Row ID 1 2 Cell 76
Store = column family Row ID 1 2 HFile HFile HFile HFile (On HDFS) 77
HFile HFile 78
HFile HFile That's actually an SSTable (flat sorted list of key-value pairs) 79
HFile HFile KeyValue That's actually an SSTable (flat sorted list of key-value pairs) (Stores a cell) 80
HFile 1 2 81
Versioning Different versions of same cell Latest 82
Versioning: timeline V 1 V 2 V 3 V 4 83
Versioning: timeline V 1 V 2 V 3 V 4 Total order: not like DynamoDB 84
Versioning: timeline V 1 V 2 V 3 V 4 Total order: not like DynamoDB A B C HBase guarantees ACID on the row level (concurrent writes and reads are synchronizing with per-row locks) 85
HFile: KeyValue key value 86
HFile: KeyValue (prefix code) keylength valuelength key value 87
Prefix code example: Gamma code 10011111111001101011110011110101 88
Prefix code example: Gamma code 10011111111001101011110011110101 89
Prefix code example: Gamma code 10011111111001101011110011110101 10 90
Prefix code example: Gamma code 10011111111001101011110011110101 10 91
Prefix code example: Gamma code 10011111111001101011110011110101 10 92
Prefix code example: Gamma code 10011111111001101011110011110101 10 101101011 93
Prefix code example: Gamma code 10011111111001101011110011110101 10 101101011 94
Prefix code example: Gamma code 10011111111001101011110011110101 10 101101011 95
Prefix code example: Gamma code 10011111111001101011110011110101 10 101101011 101 96
Prefix code example: Gamma code 10011111111001101011110011110101 10 101101011 101 97
Prefix code example: Gamma code 10011111111001101011110011110101 10 101101011 101 98
Prefix code example: Gamma code 10011111111001101011110011110101 10 101101011 101 1101 99
HFile: Key row length row (key) column family length column family column qualifier timestamp key type 100
HFile: Key row length row (key) column family length column family column qualifier timestamp key type This one is for the versioning 101
HFile: Key row length row (key) column family length column family column qualifier timestamp key type This one is for marking as deleted 102
Blocks HFile 103
Blocks HFile "Quantity" of KeyValues that get read at a time 104
Blocks Default HFile 64kb 105
Blocks: long keys or values size(keyvalue) > block size No split (longer block) 106
Inside an HFile key1 key5 key11 key17 /index /data 107
Looking up a key key1 key5 key11 key17 108
Looking up a key key1 key5 key11 key17 109
Looking up a key key1 key5 key14 key11 key17 110
Looking up a key key1 key5 key14 key11 key17 111
Looking up a key key1 key5 key14 key11 key17 112
Looking up a key key1 key5 key14 key11 key17 key14 113
Writing to an HFile key1 114
Writing to an HFile key1 key1 115
Writing to an HFile key1 key1 key2 key3 key4 116
Writing to an HFile key1 key1 key2 key3 key4 key5 key5 117
Writing to an HFile key1 key5 key1 key2 key3 key4 key5 key6 key7 key8 key9 key10 118
Writing to an HFile key1 key5 key11 key1 key2 key3 key4 key5 key6 key7 key8 key9 key10 key11 119
Writing to an HFile key1 key5 key11 key1 key2 key3 key4 key5 key6 key7 key8 key9 key10 key11 key12 key13 key14 key15 key16 120
Writing to an HFile key1 key5 key11 key17 key1 key2 key3 key4 key5 key6 key7 key8 key9 key10 key11 key12 key13 key14 key15 key16 key17 key18 121
Levels of physical storage Table 122
Levels of physical storage Table Region 123
Levels of physical storage Table Region Store 124
Levels of physical storage Table Region Store StoreFile 125
Levels of physical storage Table Region Store StoreFile Block 126
Levels of physical storage Table Region Store StoreFile Block KeyValue 127
Problem key1 key5 key11 key17 key1 key2 key3 key4 key5 key6 key7 key8 key9 key10 key11 key12 key13 key14 key15 key16 key17 key18 128
Problem key1 We can only write key-values in sorted order Sorted key2 key3 key4 key5 key6 key7 key8 key9 key10 key11 key12 key13 key14 key15 key16 key17 key18 129
HBase: Writing new cells 130
On Disk Table Region Store StoreFile Block KeyValue 131
Store StoreFile Block Block StoreFile Block Block 132
Store MemStore radub85 / 123RF Stock Photo StoreFile Block Block StoreFile Block Block 133
In Memory Table Region Store MemStore Cell 134
Writing new cells MemStore StoreFile Block Block 135
Writing new cells MemStore StoreFile Block Block 136
Writing new cells MemStore StoreFile Block Block 137
Writing new cells MemStore StoreFile Block Block 138
Writing new cells MemStore StoreFile Block Block 139
Flush MemStore StoreFile StoreFile Block Block Block Block Sort! 140
Flush When: 141
Flush When: Reaching max Memstore size in a store 142
Flush When: Reaching max Memstore size in a store Reaching overall max Memstore size 143
Flush When: Reaching max Memstore size in a store Reaching overall max Memstore size Reaching full Write-Ahead Log 144
Write-Ahead Log MemStore 145
Write-Ahead Log MemStore HLog 146
Write-Ahead Log MemStore HLog On HDFS One per RegionServer 147
Write-Ahead Log MemStore HLog 148
Reading from a Store MemStore StoreFile Block Block StoreFile Block Block 149
Reading from a Store MemStore StoreFile Block Block StoreFile Block Block 150
Compaction StoreFile StoreFile StoreFile Block Block Block Block Block Block 151
Compaction StoreFile StoreFile StoreFile Block Block Block Block Block Block 152
Compaction StoreFile (Sort again) Block Block Block Block Block Block 153
Seek vs. Transfer B+-trees LSM-Trees 154
Seek vs. Transfer Classical RDBMS Wide column stores 155
Log-Structured Merge-Trees 156
Log-Structured Merge-Trees C 0 C 1 C 2... 157
Log-Structured Merge-Trees merge merge C 0 C 1 C 2... 158
Seek vs. Transfer Seek-time-bound Transfer-time-bound 159
The META table: a table like any other 160
The META table: stores region locations table + region start key + region id + replica id info: regioninfo info: server www.example.com:0 info: serverstartcode 2016-10-11T10:15:00 161
RegionInfo RegionInfo Table name Start key Region ID Replica ID encodedname End key Split Offline 162
HBase Bootstrap Root 163
HBase Bootstrap Root Meta 164
HBase Bootstrap Root Meta Regular tables 165
Architecture HMaster Regionserver Regionserver Regionserver Regionserver Regionserver Regionserver 166
Architecture HMaster Create/delete/update table Regionserver Regionserver Regionserver Regionserver Regionserver Regionserver 167
Architecture HMaster Region? Regionserver Regionserver Regionserver Regionserver Regionserver Regionserver (hosting meta) 168
Architecture HMaster Region? Regionserver location(s) Regionserver Regionserver Regionserver Regionserver Regionserver Regionserver 169
Architecture HMaster Query Regionserver Regionserver Regionserver Regionserver Regionserver Regionserver 170
HBase: Underlying APIs grazvydas / 123RF Stock Photo 171
HBase implementation (Packaged code) 172
HBase APIs REST 173
HBase: caching 174
HBase Caches: reading faster LRU block cache Level 1 175
HBase Caches: reading faster LRU block cache bucket cache Level 1 Level 2 176
HBase Caches: reading faster LRU block cache bucket cache HDFS Level 1 Level 2 177
LRU Block Cache On the Least Recently Heap Used 178
LRU Block Cache: levels of priority Single access priority Multi access priority In-memory access priority 179
When NOT to use the cache 180
When NOT to use the cache Batch processing 181
When NOT to use the cache Random access 182
Summary of what we have in memory 183
Summary of what we have in memory MemStore 184
Summary of what we have in memory MemStore LRU BlockCache 185
Summary of what we have in memory MemStore LRU BlockCache Indices of HFiles 186
Summary of what we have in memory MemStore LRU BlockCache Indices of HFiles Bloom Filters (avoids disk reads if we can guarantee that a key is not in an HFile) 187
Hash function Source: Jorge Stolfi (Wikipedia) 188
Bloom filter Very quickly whether an element belongs to a set (potentially false positives) 189
Bloom filter 0 0 0 0 0 0 0 0 0 0 0 0 190
Bloom filter John Smith hash function 1 hash function 2 hash function k 0 1 1 0 0 0 0 1 0 0 0 0 191
Bloom filter Mary Smith hash function 1 hash function 2 hash function k 0 1 1 0 0 1 1 1 0 0 0 0 192
Bloom filter: not in set 0 1 1 0 0 1 1 1 0 0 0 0 hash function 1 hash function 2 hash function k Albert Einstein? 193
Bloom filter: in set (and correct) 0 1 1 0 0 1 1 1 0 0 0 0 hash function 1 hash function 2 hash function k Mary Smith? 194
Bloom filter: in set (false positive) 0 1 1 0 0 1 1 1 0 0 0 0 hash function 1 hash function 2 hash function k Louis de Broglie? 195
Data Locality 196
HBase vs. HDFS 197
With HDFS load balancer... 198
HFile compaction brings back locality 199
Best practices 200
Number of rows Millions RDBMS Billions HBase 201
Number of nodes > 5 202
Row IDs and column names keep them >short< why? 203
10 Design Principles of Big Data 204
1. Learn from the past 205
2. Keep the design simple 206
3. Modularize the architecture 207
4. Homogeneity in the large 208
5. Heterogeneity in the small 209
6. Separate metadata from data 210
7. Abstract logical model from its physical implementation 211
8. Shard the data 212
9. Replicate the data 213
10. Buy lots of cheap hardware 214
Spanner 215
Spanner new: externally-consistent distributed transactions 216
Spanner Tabular Data Model Language ACID properties Distribution Scalability Sharding Replicas SQL NoSQL 217
Spanner Tabular Data Model Language ACID properties Distribution Scalability Sharding Replicas SQL NoSQL 218
Spanner: Data Model Multi-column primary key 219
Spanner: Data Model Multi-column primary key Timestamp 220
Spanner: Data Model Multi-column primary key Timestamp Directory 221
Spanner: Data Model Multi-column primary key Timestamp Tablet 222
Spanner: Data Scale 1,000,000,000,000s of rows 223
Spanner: Architecture of a Zone zonemaster Spanserver Spanserver Spanserver Spanserver Spanserver Spanserver 224
Spanner: Architecture universemaster 225
Spanner: Architecture Data Center Data Center Data Center 226
Spanner: Architecture Replica Replica Replica Paxos Paxos Paxos Tablet Tablet Tablet Colossus Colossus Colossus 227
Spanner: Architecture 100s of data centers Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center 228
Spanner: Architecture 1,000,000s of machines 229
Spanner: Architecture Higher availability Lower latency 230