Ghislain Fourny. Big Data 5. Wide column stores

Ghislain Fourny Big Data 5. Wide column stores

Data Technology Stack User interfaces Querying Data stores Indexing Processing Validation Data models Syntax Encoding Storage 2

Where we are User interfaces Querying Data stores Indexing Processing Validation Data models Syntax Encoding Last weeks Storage 3

Where we are User interfaces Querying Data stores Indexing Processing Validation Today Data models Syntax Encoding Last weeks Storage 4

Relational model 5

Relational model Schema 6

Issues with relational databases (RDBMS) Small scale 7

Issues with relational databases (RDBMS) Small scale Single machine 8

Can we fix a RDBMS? 9

Can we fix a RDBMS? Scale up (remember?) 10

Can we fix a RDBMS? Scale out 11

Can we fix a RDBMS? Cluster Scale out 12

Can we fix a RDBMS? Cluster Replicate Scale out 13

Can we fix a RDBMS? Hard to set up Scale out 14

Can we fix a RDBMS? Hard to set up Very high maintenance costs Scale out 15

HBase By design running on a scalable cluster of commodity hardware 16

HBase By design running on a scalable cluster of commodity hardware HDFS 17

Wide column stores: data model 18

Founding paper 's BigTable 19

The tabular model 20

The tabular model: expensive joins 21

Design paradigm of BigTable store together what is accessed together 22

The tabular model: expensive joins 1 4 2 2 4 6 1 2 3 4 5 6 23

3rd Normal Form: Example Legi 32-000-000 Name City State Alan Turing Bletchley Park UK City State PLZ Bletchley Park UK MK3 6EB 32-000-000 Alan Turing Bletchley Park UK Bletchley Park UK MK3 6EB 62-000-000 Georg Cantor Pfäffikon SZ Pfäffikon SZ 8808 62-000-000 Georg Cantor Pfäffikon SZ Pfäffikon SZ 8808 25-000-000 Felix Bloch Pfäffikon ZH Pfäffikon ZH 8330 24

3rd Normal Form: Counter-Example Legi 32-000-000 Name City State PLZ Alan Turing Bletchley Park UK MK3 6EB 32-000-000 Alan Turing Bletchley Park UK MK3 6EB 62-000-000 Georg Cantor Pfäffikon SZ 8808 62-000-000 Georg Cantor Pfäffikon SZ 8808 25-000-000 Felix Bloch Pfäffikon ZH 8330 25

The tabular model: expensive joins 1 4 2 2 4 6 1 2 3 4 5 6 26

The columnar model: denormalized 1 4 2 2 4 6 27

Rows Row ID 000 002 0A1 1E0 22A 4A2 28

Rows Yes, for now this actually looks pretty much like key-value storage. Row ID 000 002 0A1 1E0 22A 4A2 29

Columns Row ID 000 002 0A1 1E0 22A 4A2 30

Columns Column family Row ID 000 002 0A1 1E0 22A 4A2 31

Column families must be known in advance... Row ID 32

Column families must be known in advance... Row ID 000 A B 1 2 I 002 0A1 1E0 22A 4A2 33

... but columns can be added on the fly Row ID 000 A B C 1 2 I II III IV 002 0A1 1E0 22A 4A2 34

Primary queries 35

Primary queries Get 36

Get Row ID 000 A B C 1 2 I II III IV 002 0A1 1E0 22A 4A2 37

Primary queries Get 38

Primary queries Get Put 39

Put Row ID 000 A B C 1 2 I II III IV 002 0A1 1E0 204 22A 4A2 40

Primary queries Get Put 41

Primary queries Get Put Scan (This is new) 42

Scan Row ID 000 A B C 1 2 I II III IV 002 0A1 1E0 204 22A 4A2 43

Primary queries Get Put Scan (This is new) 44

Primary queries Get Put Scan Delete (This is new) 45

Delete Row ID 000 A B C 1 2 I II III IV 002 0A1 1E0 204 22A 4A2 46

Some terminology: Key-value model Key Value 47

Some terminology: Column-oriented stores Column1 Column2 48

Some terminology: Column-oriented key-value stores Also: wide column stores, column-family-oriented Row ID A B C 1 2 I II III IV 49

Examples of Column-oriented key-value stores 's BigTable 50

Warning on terminology NoSQL is very recent! 51

Warning on terminology Key-value storage Relational table Words have a "life" File Block NoSQL Object storage 52

HBase: physical level 53

Physical layer: regions Row ID A B C 1 2 I II III IV 54

Physical layer: regions Row ID A B C 1 2 I II III IV 55

Physical layer: regions Row ID A B C 1 2 I II III IV Min-incl. Max-excl. 56

Physical layer: column families Row ID A B C 1 2 I II III IV Min-incl. Max-excl. Stored together 57

Architecture "The same procedure as every year, James." 58

HDFS... Namenode /dir/file1 /dir/file2 /file3 Datanode Datanode Datanode Datanode Datanode Datanode 59

HBase HMaster Regionserver Regionserver Regionserver Regionserver Regionserver Regionserver 60

HMaster HMaster Regionserver Regionserver Regionserver Regionserver Regionserver Regionserver 61

HMaster DDL operations 62

HMaster DDL operations Create table 63

HMaster DDL operations Create table Delete table 64

HMaster assigns regions to RegionServers Row ID 65

HMaster assigns regions to RegionServers Row ID 66

HMaster assigns regions to RegionServers Row ID 67

HMaster assigns regions to RegionServers Row ID 68

HMaster splits regions Row ID 69

HMaster handles Regionserver failovers 70

Architecture HMaster Regionserver Regionserver Regionserver Regionserver Regionserver Regionserver 71

Regionserver HMaster Regionserver Regionserver Regionserver Regionserver Regionserver Regionserver 72

Physical storage Row ID Min-incl. A B C 1 2 Stored together I II III IV 73

Physical storage Row ID A B C 1 2 I II III IV Store Store Store Store Store Store 74

Store = column family Row ID 1 2 75

Store = column family Row ID 1 2 Cell 76

Store = column family Row ID 1 2 HFile HFile HFile HFile (On HDFS) 77

HFile HFile 78

HFile HFile That's actually an SSTable (flat sorted list of key-value pairs) 79

HFile HFile KeyValue That's actually an SSTable (flat sorted list of key-value pairs) (Stores a cell) 80

HFile 1 2 81

Versioning Different versions of same cell Latest 82

Versioning: timeline V 1 V 2 V 3 V 4 83

Versioning: timeline V 1 V 2 V 3 V 4 Total order: not like DynamoDB 84

Versioning: timeline V 1 V 2 V 3 V 4 Total order: not like DynamoDB A B C HBase guarantees ACID on the row level (concurrent writes and reads are synchronizing with per-row locks) 85

HFile: KeyValue key value 86

HFile: KeyValue (prefix code) keylength valuelength key value 87

Prefix code example: Gamma code 10011111111001101011110011110101 88

Prefix code example: Gamma code 10011111111001101011110011110101 89

Prefix code example: Gamma code 10011111111001101011110011110101 10 90

Prefix code example: Gamma code 10011111111001101011110011110101 10 91

Prefix code example: Gamma code 10011111111001101011110011110101 10 92

Prefix code example: Gamma code 10011111111001101011110011110101 10 101101011 93

Prefix code example: Gamma code 10011111111001101011110011110101 10 101101011 94

Prefix code example: Gamma code 10011111111001101011110011110101 10 101101011 95

Prefix code example: Gamma code 10011111111001101011110011110101 10 101101011 101 96

Prefix code example: Gamma code 10011111111001101011110011110101 10 101101011 101 97

Prefix code example: Gamma code 10011111111001101011110011110101 10 101101011 101 98

Prefix code example: Gamma code 10011111111001101011110011110101 10 101101011 101 1101 99

HFile: Key row length row (key) column family length column family column qualifier timestamp key type 100

HFile: Key row length row (key) column family length column family column qualifier timestamp key type This one is for the versioning 101

HFile: Key row length row (key) column family length column family column qualifier timestamp key type This one is for marking as deleted 102

Blocks HFile 103

Blocks HFile "Quantity" of KeyValues that get read at a time 104

Blocks Default HFile 64kb 105

Blocks: long keys or values size(keyvalue) > block size No split (longer block) 106

Inside an HFile key1 key5 key11 key17 /index /data 107

Looking up a key key1 key5 key11 key17 108

Looking up a key key1 key5 key11 key17 109

Looking up a key key1 key5 key14 key11 key17 110

Looking up a key key1 key5 key14 key11 key17 111

Looking up a key key1 key5 key14 key11 key17 112

Looking up a key key1 key5 key14 key11 key17 key14 113

Writing to an HFile key1 114

Writing to an HFile key1 key1 115

Writing to an HFile key1 key1 key2 key3 key4 116

Writing to an HFile key1 key1 key2 key3 key4 key5 key5 117

Writing to an HFile key1 key5 key1 key2 key3 key4 key5 key6 key7 key8 key9 key10 118

Writing to an HFile key1 key5 key11 key1 key2 key3 key4 key5 key6 key7 key8 key9 key10 key11 119

Writing to an HFile key1 key5 key11 key1 key2 key3 key4 key5 key6 key7 key8 key9 key10 key11 key12 key13 key14 key15 key16 120

Writing to an HFile key1 key5 key11 key17 key1 key2 key3 key4 key5 key6 key7 key8 key9 key10 key11 key12 key13 key14 key15 key16 key17 key18 121

Levels of physical storage Table 122

Levels of physical storage Table Region 123

Levels of physical storage Table Region Store 124

Levels of physical storage Table Region Store StoreFile 125

Levels of physical storage Table Region Store StoreFile Block 126

Levels of physical storage Table Region Store StoreFile Block KeyValue 127

Problem key1 key5 key11 key17 key1 key2 key3 key4 key5 key6 key7 key8 key9 key10 key11 key12 key13 key14 key15 key16 key17 key18 128

Problem key1 We can only write key-values in sorted order Sorted key2 key3 key4 key5 key6 key7 key8 key9 key10 key11 key12 key13 key14 key15 key16 key17 key18 129

HBase: Writing new cells 130

On Disk Table Region Store StoreFile Block KeyValue 131

Store StoreFile Block Block StoreFile Block Block 132

Store MemStore radub85 / 123RF Stock Photo StoreFile Block Block StoreFile Block Block 133

In Memory Table Region Store MemStore Cell 134

Writing new cells MemStore StoreFile Block Block 135

Writing new cells MemStore StoreFile Block Block 136

Writing new cells MemStore StoreFile Block Block 137

Writing new cells MemStore StoreFile Block Block 138

Writing new cells MemStore StoreFile Block Block 139

Flush MemStore StoreFile StoreFile Block Block Block Block Sort! 140

Flush When: 141

Flush When: Reaching max Memstore size in a store 142

Flush When: Reaching max Memstore size in a store Reaching overall max Memstore size 143

Flush When: Reaching max Memstore size in a store Reaching overall max Memstore size Reaching full Write-Ahead Log 144

Write-Ahead Log MemStore 145

Write-Ahead Log MemStore HLog 146

Write-Ahead Log MemStore HLog On HDFS One per RegionServer 147

Write-Ahead Log MemStore HLog 148

Reading from a Store MemStore StoreFile Block Block StoreFile Block Block 149

Reading from a Store MemStore StoreFile Block Block StoreFile Block Block 150

Compaction StoreFile StoreFile StoreFile Block Block Block Block Block Block 151

Compaction StoreFile StoreFile StoreFile Block Block Block Block Block Block 152

Compaction StoreFile (Sort again) Block Block Block Block Block Block 153

Seek vs. Transfer B+-trees LSM-Trees 154

Seek vs. Transfer Classical RDBMS Wide column stores 155

Log-Structured Merge-Trees 156

Log-Structured Merge-Trees C 0 C 1 C 2... 157

Log-Structured Merge-Trees merge merge C 0 C 1 C 2... 158

Seek vs. Transfer Seek-time-bound Transfer-time-bound 159

The META table: a table like any other 160

The META table: stores region locations table + region start key + region id + replica id info: regioninfo info: server www.example.com:0 info: serverstartcode 2016-10-11T10:15:00 161

RegionInfo RegionInfo Table name Start key Region ID Replica ID encodedname End key Split Offline 162

HBase Bootstrap Root 163

HBase Bootstrap Root Meta 164

HBase Bootstrap Root Meta Regular tables 165

Architecture HMaster Regionserver Regionserver Regionserver Regionserver Regionserver Regionserver 166

Architecture HMaster Create/delete/update table Regionserver Regionserver Regionserver Regionserver Regionserver Regionserver 167

Architecture HMaster Region? Regionserver Regionserver Regionserver Regionserver Regionserver Regionserver (hosting meta) 168

Architecture HMaster Region? Regionserver location(s) Regionserver Regionserver Regionserver Regionserver Regionserver Regionserver 169

Architecture HMaster Query Regionserver Regionserver Regionserver Regionserver Regionserver Regionserver 170

HBase: Underlying APIs grazvydas / 123RF Stock Photo 171

HBase implementation (Packaged code) 172

HBase APIs REST 173

HBase: caching 174

HBase Caches: reading faster LRU block cache Level 1 175

HBase Caches: reading faster LRU block cache bucket cache Level 1 Level 2 176

HBase Caches: reading faster LRU block cache bucket cache HDFS Level 1 Level 2 177

LRU Block Cache On the Least Recently Heap Used 178

LRU Block Cache: levels of priority Single access priority Multi access priority In-memory access priority 179

When NOT to use the cache 180

When NOT to use the cache Batch processing 181

When NOT to use the cache Random access 182

Summary of what we have in memory 183

Summary of what we have in memory MemStore 184

Summary of what we have in memory MemStore LRU BlockCache 185

Summary of what we have in memory MemStore LRU BlockCache Indices of HFiles 186

Summary of what we have in memory MemStore LRU BlockCache Indices of HFiles Bloom Filters (avoids disk reads if we can guarantee that a key is not in an HFile) 187

Hash function Source: Jorge Stolfi (Wikipedia) 188

Bloom filter Very quickly whether an element belongs to a set (potentially false positives) 189

Bloom filter 0 0 0 0 0 0 0 0 0 0 0 0 190

Bloom filter John Smith hash function 1 hash function 2 hash function k 0 1 1 0 0 0 0 1 0 0 0 0 191

Bloom filter Mary Smith hash function 1 hash function 2 hash function k 0 1 1 0 0 1 1 1 0 0 0 0 192

Bloom filter: not in set 0 1 1 0 0 1 1 1 0 0 0 0 hash function 1 hash function 2 hash function k Albert Einstein? 193

Bloom filter: in set (and correct) 0 1 1 0 0 1 1 1 0 0 0 0 hash function 1 hash function 2 hash function k Mary Smith? 194

Bloom filter: in set (false positive) 0 1 1 0 0 1 1 1 0 0 0 0 hash function 1 hash function 2 hash function k Louis de Broglie? 195

Data Locality 196

HBase vs. HDFS 197

With HDFS load balancer... 198

HFile compaction brings back locality 199

Best practices 200

Number of rows Millions RDBMS Billions HBase 201

Number of nodes > 5 202

Row IDs and column names keep them >short< why? 203

10 Design Principles of Big Data 204

1. Learn from the past 205

2. Keep the design simple 206

3. Modularize the architecture 207

4. Homogeneity in the large 208

5. Heterogeneity in the small 209

6. Separate metadata from data 210

7. Abstract logical model from its physical implementation 211

8. Shard the data 212

9. Replicate the data 213

10. Buy lots of cheap hardware 214

Spanner 215

Spanner new: externally-consistent distributed transactions 216

Spanner Tabular Data Model Language ACID properties Distribution Scalability Sharding Replicas SQL NoSQL 217

Spanner Tabular Data Model Language ACID properties Distribution Scalability Sharding Replicas SQL NoSQL 218

Spanner: Data Model Multi-column primary key 219

Spanner: Data Model Multi-column primary key Timestamp 220

Spanner: Data Model Multi-column primary key Timestamp Directory 221

Spanner: Data Model Multi-column primary key Timestamp Tablet 222

Spanner: Data Scale 1,000,000,000,000s of rows 223

Spanner: Architecture of a Zone zonemaster Spanserver Spanserver Spanserver Spanserver Spanserver Spanserver 224

Spanner: Architecture universemaster 225

Spanner: Architecture Data Center Data Center Data Center 226

Spanner: Architecture Replica Replica Replica Paxos Paxos Paxos Tablet Tablet Tablet Colossus Colossus Colossus 227

Spanner: Architecture 100s of data centers Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center 228

Spanner: Architecture 1,000,000s of machines 229

Spanner: Architecture Higher availability Lower latency 230