Ghislain Fourny. Big Data 5. Wide column stores

Size: px

Start display at page:

Download "Ghislain Fourny. Big Data 5. Wide column stores"

Melanie Francis
5 years ago
Views:

1 Ghislain Fourny Big Data 5. Wide column stores

2 Data Technology Stack User interfaces Querying Data stores Indexing Processing Validation Data models Syntax Encoding Storage 2

3 Where we are User interfaces Querying Data stores Indexing Processing Validation Data models Syntax Encoding Last weeks Storage 3

4 Where we are User interfaces Querying Data stores Indexing Processing Validation Today Data models Syntax Encoding Last weeks Storage 4

5 Relational model 5

6 Relational model Schema 6

7 Issues with relational databases (RDBMS) Small scale 7

8 Issues with relational databases (RDBMS) Small scale Single machine 8

9 Can we fix a RDBMS? 9

10 Can we fix a RDBMS? Scale up (remember?) 10

11 Can we fix a RDBMS? Scale out 11

12 Can we fix a RDBMS? Cluster Scale out 12

13 Can we fix a RDBMS? Cluster Replicate Scale out 13

14 Can we fix a RDBMS? Hard to set up Scale out 14

15 Can we fix a RDBMS? Hard to set up Very high maintenance costs Scale out 15

16 HBase By design running on a scalable cluster of commodity hardware 16

17 HBase By design running on a scalable cluster of commodity hardware HDFS 17

18 Wide column stores: data model 18

19 Founding paper 's BigTable 19

20 The tabular model 20

21 The tabular model: expensive joins 21

22 Design paradigm of BigTable store together what is accessed together 22

23 The tabular model: expensive joins

24 3rd Normal Form: Example Legi Name City State Alan Turing Bletchley Park UK City State PLZ Bletchley Park UK MK3 6EB Alan Turing Bletchley Park UK Bletchley Park UK MK3 6EB Georg Cantor Pfäffikon SZ Pfäffikon SZ Georg Cantor Pfäffikon SZ Pfäffikon SZ Felix Bloch Pfäffikon ZH Pfäffikon ZH

25 3rd Normal Form: Counter-Example Legi Name City State PLZ Alan Turing Bletchley Park UK MK3 6EB Alan Turing Bletchley Park UK MK3 6EB Georg Cantor Pfäffikon SZ Georg Cantor Pfäffikon SZ Felix Bloch Pfäffikon ZH

26 The tabular model: expensive joins

27 The columnar model: denormalized

28 Rows Row ID A1 1E0 22A 4A2 28

29 Rows Yes, for now this actually looks pretty much like key-value storage. Row ID A1 1E0 22A 4A2 29

30 Columns Row ID A1 1E0 22A 4A2 30

31 Columns Column family Row ID A1 1E0 22A 4A2 31

32 Column families must be known in advance... Row ID 32

33 Column families must be known in advance... Row ID 000 A B 1 2 I 002 0A1 1E0 22A 4A2 33

34 ... but columns can be added on the fly Row ID 000 A B C 1 2 I II III IV 002 0A1 1E0 22A 4A2 34

35 Primary queries 35

36 Primary queries Get 36

37 Get Row ID 000 A B C 1 2 I II III IV 002 0A1 1E0 22A 4A2 37

38 Primary queries Get 38

39 Primary queries Get Put 39

40 Put Row ID 000 A B C 1 2 I II III IV 002 0A1 1E A 4A2 40

41 Primary queries Get Put 41

42 Primary queries Get Put Scan (This is new) 42

43 Scan Row ID 000 A B C 1 2 I II III IV 002 0A1 1E A 4A2 43

44 Primary queries Get Put Scan (This is new) 44

45 Primary queries Get Put Scan Delete (This is new) 45

46 Delete Row ID 000 A B C 1 2 I II III IV 002 0A1 1E A 4A2 46

47 Some terminology: Key-value model Key Value 47

48 Some terminology: Column-oriented stores Column1 Column2 48

49 Some terminology: Column-oriented key-value stores Also: wide column stores, column-family-oriented Row ID A B C 1 2 I II III IV 49

50 Examples of Column-oriented key-value stores 's BigTable 50

51 Warning on terminology NoSQL is very recent! 51

52 Warning on terminology Key-value storage Relational table Words have a "life" File Block NoSQL Object storage 52

53 HBase: physical level 53

54 Physical layer: regions Row ID A B C 1 2 I II III IV 54

55 Physical layer: regions Row ID A B C 1 2 I II III IV 55

56 Physical layer: regions Row ID A B C 1 2 I II III IV Min-incl. Max-excl. 56

57 Physical layer: column families Row ID A B C 1 2 I II III IV Min-incl. Max-excl. Stored together 57

58 Architecture "The same procedure as every year, James." 58

59 HDFS... Namenode /dir/file1 /dir/file2 /file3 Datanode Datanode Datanode Datanode Datanode Datanode 59

60 HBase HMaster Regionserver Regionserver Regionserver Regionserver Regionserver Regionserver 60

61 HMaster HMaster Regionserver Regionserver Regionserver Regionserver Regionserver Regionserver 61

62 HMaster DDL operations 62

63 HMaster DDL operations Create table 63

64 HMaster DDL operations Create table Delete table 64

65 HMaster assigns regions to RegionServers Row ID 65

66 HMaster assigns regions to RegionServers Row ID 66

67 HMaster assigns regions to RegionServers Row ID 67

68 HMaster assigns regions to RegionServers Row ID 68

69 HMaster splits regions Row ID 69

70 HMaster handles Regionserver failovers 70

71 Architecture HMaster Regionserver Regionserver Regionserver Regionserver Regionserver Regionserver 71

72 Regionserver HMaster Regionserver Regionserver Regionserver Regionserver Regionserver Regionserver 72

73 Physical storage Row ID Min-incl. A B C 1 2 Stored together I II III IV 73

74 Physical storage Row ID A B C 1 2 I II III IV Store Store Store Store Store Store 74

75 Store = column family Row ID

76 Store = column family Row ID 1 2 Cell 76

77 Store = column family Row ID 1 2 HFile HFile HFile HFile (On HDFS) 77

78 HFile HFile 78

79 HFile HFile That's actually an SSTable (flat sorted list of key-value pairs) 79

80 HFile HFile KeyValue That's actually an SSTable (flat sorted list of key-value pairs) (Stores a cell) 80

81 HFile

82 Versioning Different versions of same cell Latest 82

83 Versioning: timeline V 1 V 2 V 3 V 4 83

84 Versioning: timeline V 1 V 2 V 3 V 4 Total order: not like DynamoDB 84

85 Versioning: timeline V 1 V 2 V 3 V 4 Total order: not like DynamoDB A B C HBase guarantees ACID on the row level (concurrent writes and reads are synchronizing with per-row locks) 85

86 HFile: KeyValue key value 86

87 HFile: KeyValue (prefix code) keylength valuelength key value 87

88 Prefix code example: Gamma code

89 Prefix code example: Gamma code

90 Prefix code example: Gamma code

91 Prefix code example: Gamma code

92 Prefix code example: Gamma code

93 Prefix code example: Gamma code

94 Prefix code example: Gamma code

95 Prefix code example: Gamma code

96 Prefix code example: Gamma code

97 Prefix code example: Gamma code

98 Prefix code example: Gamma code

99 Prefix code example: Gamma code

100 HFile: Key row length row (key) column family length column family column qualifier timestamp key type 100

101 HFile: Key row length row (key) column family length column family column qualifier timestamp key type This one is for the versioning 101

102 HFile: Key row length row (key) column family length column family column qualifier timestamp key type This one is for marking as deleted 102

103 Blocks HFile 103

104 Blocks HFile "Quantity" of KeyValues that get read at a time 104

105 Blocks Default HFile 64kb 105

106 Blocks: long keys or values size(keyvalue) > block size No split (longer block) 106

107 Inside an HFile key1 key5 key11 key17 /index /data 107

108 Looking up a key key1 key5 key11 key17 108

109 Looking up a key key1 key5 key11 key17 109

110 Looking up a key key1 key5 key14 key11 key17 110

111 Looking up a key key1 key5 key14 key11 key17 111

112 Looking up a key key1 key5 key14 key11 key17 112

113 Looking up a key key1 key5 key14 key11 key17 key14 113

114 Writing to an HFile key1 114

115 Writing to an HFile key1 key1 115

116 Writing to an HFile key1 key1 key2 key3 key4 116

117 Writing to an HFile key1 key1 key2 key3 key4 key5 key5 117

118 Writing to an HFile key1 key5 key1 key2 key3 key4 key5 key6 key7 key8 key9 key10 118

119 Writing to an HFile key1 key5 key11 key1 key2 key3 key4 key5 key6 key7 key8 key9 key10 key11 119

120 Writing to an HFile key1 key5 key11 key1 key2 key3 key4 key5 key6 key7 key8 key9 key10 key11 key12 key13 key14 key15 key16 120

121 Writing to an HFile key1 key5 key11 key17 key1 key2 key3 key4 key5 key6 key7 key8 key9 key10 key11 key12 key13 key14 key15 key16 key17 key18 121

122 Levels of physical storage Table 122

123 Levels of physical storage Table Region 123

124 Levels of physical storage Table Region Store 124

125 Levels of physical storage Table Region Store StoreFile 125

126 Levels of physical storage Table Region Store StoreFile Block 126

127 Levels of physical storage Table Region Store StoreFile Block KeyValue 127

128 Problem key1 key5 key11 key17 key1 key2 key3 key4 key5 key6 key7 key8 key9 key10 key11 key12 key13 key14 key15 key16 key17 key18 128

129 Problem key1 We can only write key-values in sorted order Sorted key2 key3 key4 key5 key6 key7 key8 key9 key10 key11 key12 key13 key14 key15 key16 key17 key18 129

130 HBase: Writing new cells 130

131 On Disk Table Region Store StoreFile Block KeyValue 131

132 Store StoreFile Block Block StoreFile Block Block 132

133 Store MemStore radub85 / 123RF Stock Photo StoreFile Block Block StoreFile Block Block 133

134 In Memory Table Region Store MemStore Cell 134

135 Writing new cells MemStore StoreFile Block Block 135

136 Writing new cells MemStore StoreFile Block Block 136

137 Writing new cells MemStore StoreFile Block Block 137

138 Writing new cells MemStore StoreFile Block Block 138

139 Writing new cells MemStore StoreFile Block Block 139

140 Flush MemStore StoreFile StoreFile Block Block Block Block Sort! 140

141 Flush When: 141

142 Flush When: Reaching max Memstore size in a store 142

143 Flush When: Reaching max Memstore size in a store Reaching overall max Memstore size 143

144 Flush When: Reaching max Memstore size in a store Reaching overall max Memstore size Reaching full Write-Ahead Log 144

145 Write-Ahead Log MemStore 145

146 Write-Ahead Log MemStore HLog 146

147 Write-Ahead Log MemStore HLog On HDFS One per RegionServer 147

148 Write-Ahead Log MemStore HLog 148

149 Reading from a Store MemStore StoreFile Block Block StoreFile Block Block 149

150 Reading from a Store MemStore StoreFile Block Block StoreFile Block Block 150

151 Compaction StoreFile StoreFile StoreFile Block Block Block Block Block Block 151

152 Compaction StoreFile StoreFile StoreFile Block Block Block Block Block Block 152

153 Compaction StoreFile (Sort again) Block Block Block Block Block Block 153

154 Seek vs. Transfer B+-trees LSM-Trees 154

155 Seek vs. Transfer Classical RDBMS Wide column stores 155

156 Log-Structured Merge-Trees 156

157 Log-Structured Merge-Trees C 0 C 1 C

158 Log-Structured Merge-Trees merge merge C 0 C 1 C

159 Seek vs. Transfer Seek-time-bound Transfer-time-bound 159

160 The META table: a table like any other 160

161 The META table: stores region locations table + region start key + region id + replica id info: regioninfo info: server info: serverstartcode T10:15:00 161

162 RegionInfo RegionInfo Table name Start key Region ID Replica ID encodedname End key Split Offline 162

163 HBase Bootstrap Root 163

164 HBase Bootstrap Root Meta 164

165 HBase Bootstrap Root Meta Regular tables 165

166 Architecture HMaster Regionserver Regionserver Regionserver Regionserver Regionserver Regionserver 166

167 Architecture HMaster Create/delete/update table Regionserver Regionserver Regionserver Regionserver Regionserver Regionserver 167

168 Architecture HMaster Region? Regionserver Regionserver Regionserver Regionserver Regionserver Regionserver (hosting meta) 168

169 Architecture HMaster Region? Regionserver location(s) Regionserver Regionserver Regionserver Regionserver Regionserver Regionserver 169

170 Architecture HMaster Query Regionserver Regionserver Regionserver Regionserver Regionserver Regionserver 170

171 HBase: Underlying APIs grazvydas / 123RF Stock Photo 171

172 HBase implementation (Packaged code) 172

173 HBase APIs REST 173

174 HBase: caching 174

175 HBase Caches: reading faster LRU block cache Level 1 175

176 HBase Caches: reading faster LRU block cache bucket cache Level 1 Level 2 176

177 HBase Caches: reading faster LRU block cache bucket cache HDFS Level 1 Level 2 177

178 LRU Block Cache On the Least Recently Heap Used 178

179 LRU Block Cache: levels of priority Single access priority Multi access priority In-memory access priority 179

180 When NOT to use the cache 180

181 When NOT to use the cache Batch processing 181

182 When NOT to use the cache Random access 182

183 Summary of what we have in memory 183

184 Summary of what we have in memory MemStore 184

185 Summary of what we have in memory MemStore LRU BlockCache 185

186 Summary of what we have in memory MemStore LRU BlockCache Indices of HFiles 186

187 Summary of what we have in memory MemStore LRU BlockCache Indices of HFiles Bloom Filters (avoids disk reads if we can guarantee that a key is not in an HFile) 187

188 Hash function Source: Jorge Stolfi (Wikipedia) 188

189 Bloom filter Very quickly whether an element belongs to a set (potentially false positives) 189

190 Bloom filter

191 Bloom filter John Smith hash function 1 hash function 2 hash function k

192 Bloom filter Mary Smith hash function 1 hash function 2 hash function k

193 Bloom filter: not in set hash function 1 hash function 2 hash function k Albert Einstein? 193

194 Bloom filter: in set (and correct) hash function 1 hash function 2 hash function k Mary Smith? 194

195 Bloom filter: in set (false positive) hash function 1 hash function 2 hash function k Louis de Broglie? 195

196 Data Locality 196

197 HBase vs. HDFS 197

198 With HDFS load balancer

199 HFile compaction brings back locality 199

200 Best practices 200

201 Number of rows Millions RDBMS Billions HBase 201

202 Number of nodes > 5 202

203 Row IDs and column names keep them >short< why? 203

204 10 Design Principles of Big Data 204

205 1. Learn from the past 205

206 2. Keep the design simple 206

207 3. Modularize the architecture 207

208 4. Homogeneity in the large 208

209 5. Heterogeneity in the small 209

210 6. Separate metadata from data 210

211 7. Abstract logical model from its physical implementation 211

212 8. Shard the data 212

213 9. Replicate the data 213

214 10. Buy lots of cheap hardware 214

215 Spanner 215

216 Spanner new: externally-consistent distributed transactions 216

217 Spanner Tabular Data Model Language ACID properties Distribution Scalability Sharding Replicas SQL NoSQL 217

218 Spanner Tabular Data Model Language ACID properties Distribution Scalability Sharding Replicas SQL NoSQL 218

219 Spanner: Data Model Multi-column primary key 219

220 Spanner: Data Model Multi-column primary key Timestamp 220

221 Spanner: Data Model Multi-column primary key Timestamp Directory 221

222 Spanner: Data Model Multi-column primary key Timestamp Tablet 222

223 Spanner: Data Scale 1,000,000,000,000s of rows 223

224 Spanner: Architecture of a Zone zonemaster Spanserver Spanserver Spanserver Spanserver Spanserver Spanserver 224

225 Spanner: Architecture universemaster 225

226 Spanner: Architecture Data Center Data Center Data Center 226

227 Spanner: Architecture Replica Replica Replica Paxos Paxos Paxos Tablet Tablet Tablet Colossus Colossus Colossus 227

228 Spanner: Architecture 100s of data centers Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center 228

229 Spanner: Architecture 1,000,000s of machines 229

230 Spanner: Architecture Higher availability Lower latency 230

Ghislain Fourny. Big Data 5. Column stores

Ghislain Fourny. Big Data 5. Column stores Ghislain Fourny Big Data 5. Column stores 1 Introduction 2 Relational model 3 Relational model Schema 4 Issues with relational databases (RDBMS) Small scale Single machine 5 Can we fix a RDBMS? Scale up