Ghislain Fourny. Big Data 5. Wide column stores
|
|
- Melanie Francis
- 5 years ago
- Views:
Transcription
1 Ghislain Fourny Big Data 5. Wide column stores
2 Data Technology Stack User interfaces Querying Data stores Indexing Processing Validation Data models Syntax Encoding Storage 2
3 Where we are User interfaces Querying Data stores Indexing Processing Validation Data models Syntax Encoding Last weeks Storage 3
4 Where we are User interfaces Querying Data stores Indexing Processing Validation Today Data models Syntax Encoding Last weeks Storage 4
5 Relational model 5
6 Relational model Schema 6
7 Issues with relational databases (RDBMS) Small scale 7
8 Issues with relational databases (RDBMS) Small scale Single machine 8
9 Can we fix a RDBMS? 9
10 Can we fix a RDBMS? Scale up (remember?) 10
11 Can we fix a RDBMS? Scale out 11
12 Can we fix a RDBMS? Cluster Scale out 12
13 Can we fix a RDBMS? Cluster Replicate Scale out 13
14 Can we fix a RDBMS? Hard to set up Scale out 14
15 Can we fix a RDBMS? Hard to set up Very high maintenance costs Scale out 15
16 HBase By design running on a scalable cluster of commodity hardware 16
17 HBase By design running on a scalable cluster of commodity hardware HDFS 17
18 Wide column stores: data model 18
19 Founding paper 's BigTable 19
20 The tabular model 20
21 The tabular model: expensive joins 21
22 Design paradigm of BigTable store together what is accessed together 22
23 The tabular model: expensive joins
24 3rd Normal Form: Example Legi Name City State Alan Turing Bletchley Park UK City State PLZ Bletchley Park UK MK3 6EB Alan Turing Bletchley Park UK Bletchley Park UK MK3 6EB Georg Cantor Pfäffikon SZ Pfäffikon SZ Georg Cantor Pfäffikon SZ Pfäffikon SZ Felix Bloch Pfäffikon ZH Pfäffikon ZH
25 3rd Normal Form: Counter-Example Legi Name City State PLZ Alan Turing Bletchley Park UK MK3 6EB Alan Turing Bletchley Park UK MK3 6EB Georg Cantor Pfäffikon SZ Georg Cantor Pfäffikon SZ Felix Bloch Pfäffikon ZH
26 The tabular model: expensive joins
27 The columnar model: denormalized
28 Rows Row ID A1 1E0 22A 4A2 28
29 Rows Yes, for now this actually looks pretty much like key-value storage. Row ID A1 1E0 22A 4A2 29
30 Columns Row ID A1 1E0 22A 4A2 30
31 Columns Column family Row ID A1 1E0 22A 4A2 31
32 Column families must be known in advance... Row ID 32
33 Column families must be known in advance... Row ID 000 A B 1 2 I 002 0A1 1E0 22A 4A2 33
34 ... but columns can be added on the fly Row ID 000 A B C 1 2 I II III IV 002 0A1 1E0 22A 4A2 34
35 Primary queries 35
36 Primary queries Get 36
37 Get Row ID 000 A B C 1 2 I II III IV 002 0A1 1E0 22A 4A2 37
38 Primary queries Get 38
39 Primary queries Get Put 39
40 Put Row ID 000 A B C 1 2 I II III IV 002 0A1 1E A 4A2 40
41 Primary queries Get Put 41
42 Primary queries Get Put Scan (This is new) 42
43 Scan Row ID 000 A B C 1 2 I II III IV 002 0A1 1E A 4A2 43
44 Primary queries Get Put Scan (This is new) 44
45 Primary queries Get Put Scan Delete (This is new) 45
46 Delete Row ID 000 A B C 1 2 I II III IV 002 0A1 1E A 4A2 46
47 Some terminology: Key-value model Key Value 47
48 Some terminology: Column-oriented stores Column1 Column2 48
49 Some terminology: Column-oriented key-value stores Also: wide column stores, column-family-oriented Row ID A B C 1 2 I II III IV 49
50 Examples of Column-oriented key-value stores 's BigTable 50
51 Warning on terminology NoSQL is very recent! 51
52 Warning on terminology Key-value storage Relational table Words have a "life" File Block NoSQL Object storage 52
53 HBase: physical level 53
54 Physical layer: regions Row ID A B C 1 2 I II III IV 54
55 Physical layer: regions Row ID A B C 1 2 I II III IV 55
56 Physical layer: regions Row ID A B C 1 2 I II III IV Min-incl. Max-excl. 56
57 Physical layer: column families Row ID A B C 1 2 I II III IV Min-incl. Max-excl. Stored together 57
58 Architecture "The same procedure as every year, James." 58
59 HDFS... Namenode /dir/file1 /dir/file2 /file3 Datanode Datanode Datanode Datanode Datanode Datanode 59
60 HBase HMaster Regionserver Regionserver Regionserver Regionserver Regionserver Regionserver 60
61 HMaster HMaster Regionserver Regionserver Regionserver Regionserver Regionserver Regionserver 61
62 HMaster DDL operations 62
63 HMaster DDL operations Create table 63
64 HMaster DDL operations Create table Delete table 64
65 HMaster assigns regions to RegionServers Row ID 65
66 HMaster assigns regions to RegionServers Row ID 66
67 HMaster assigns regions to RegionServers Row ID 67
68 HMaster assigns regions to RegionServers Row ID 68
69 HMaster splits regions Row ID 69
70 HMaster handles Regionserver failovers 70
71 Architecture HMaster Regionserver Regionserver Regionserver Regionserver Regionserver Regionserver 71
72 Regionserver HMaster Regionserver Regionserver Regionserver Regionserver Regionserver Regionserver 72
73 Physical storage Row ID Min-incl. A B C 1 2 Stored together I II III IV 73
74 Physical storage Row ID A B C 1 2 I II III IV Store Store Store Store Store Store 74
75 Store = column family Row ID
76 Store = column family Row ID 1 2 Cell 76
77 Store = column family Row ID 1 2 HFile HFile HFile HFile (On HDFS) 77
78 HFile HFile 78
79 HFile HFile That's actually an SSTable (flat sorted list of key-value pairs) 79
80 HFile HFile KeyValue That's actually an SSTable (flat sorted list of key-value pairs) (Stores a cell) 80
81 HFile
82 Versioning Different versions of same cell Latest 82
83 Versioning: timeline V 1 V 2 V 3 V 4 83
84 Versioning: timeline V 1 V 2 V 3 V 4 Total order: not like DynamoDB 84
85 Versioning: timeline V 1 V 2 V 3 V 4 Total order: not like DynamoDB A B C HBase guarantees ACID on the row level (concurrent writes and reads are synchronizing with per-row locks) 85
86 HFile: KeyValue key value 86
87 HFile: KeyValue (prefix code) keylength valuelength key value 87
88 Prefix code example: Gamma code
89 Prefix code example: Gamma code
90 Prefix code example: Gamma code
91 Prefix code example: Gamma code
92 Prefix code example: Gamma code
93 Prefix code example: Gamma code
94 Prefix code example: Gamma code
95 Prefix code example: Gamma code
96 Prefix code example: Gamma code
97 Prefix code example: Gamma code
98 Prefix code example: Gamma code
99 Prefix code example: Gamma code
100 HFile: Key row length row (key) column family length column family column qualifier timestamp key type 100
101 HFile: Key row length row (key) column family length column family column qualifier timestamp key type This one is for the versioning 101
102 HFile: Key row length row (key) column family length column family column qualifier timestamp key type This one is for marking as deleted 102
103 Blocks HFile 103
104 Blocks HFile "Quantity" of KeyValues that get read at a time 104
105 Blocks Default HFile 64kb 105
106 Blocks: long keys or values size(keyvalue) > block size No split (longer block) 106
107 Inside an HFile key1 key5 key11 key17 /index /data 107
108 Looking up a key key1 key5 key11 key17 108
109 Looking up a key key1 key5 key11 key17 109
110 Looking up a key key1 key5 key14 key11 key17 110
111 Looking up a key key1 key5 key14 key11 key17 111
112 Looking up a key key1 key5 key14 key11 key17 112
113 Looking up a key key1 key5 key14 key11 key17 key14 113
114 Writing to an HFile key1 114
115 Writing to an HFile key1 key1 115
116 Writing to an HFile key1 key1 key2 key3 key4 116
117 Writing to an HFile key1 key1 key2 key3 key4 key5 key5 117
118 Writing to an HFile key1 key5 key1 key2 key3 key4 key5 key6 key7 key8 key9 key10 118
119 Writing to an HFile key1 key5 key11 key1 key2 key3 key4 key5 key6 key7 key8 key9 key10 key11 119
120 Writing to an HFile key1 key5 key11 key1 key2 key3 key4 key5 key6 key7 key8 key9 key10 key11 key12 key13 key14 key15 key16 120
121 Writing to an HFile key1 key5 key11 key17 key1 key2 key3 key4 key5 key6 key7 key8 key9 key10 key11 key12 key13 key14 key15 key16 key17 key18 121
122 Levels of physical storage Table 122
123 Levels of physical storage Table Region 123
124 Levels of physical storage Table Region Store 124
125 Levels of physical storage Table Region Store StoreFile 125
126 Levels of physical storage Table Region Store StoreFile Block 126
127 Levels of physical storage Table Region Store StoreFile Block KeyValue 127
128 Problem key1 key5 key11 key17 key1 key2 key3 key4 key5 key6 key7 key8 key9 key10 key11 key12 key13 key14 key15 key16 key17 key18 128
129 Problem key1 We can only write key-values in sorted order Sorted key2 key3 key4 key5 key6 key7 key8 key9 key10 key11 key12 key13 key14 key15 key16 key17 key18 129
130 HBase: Writing new cells 130
131 On Disk Table Region Store StoreFile Block KeyValue 131
132 Store StoreFile Block Block StoreFile Block Block 132
133 Store MemStore radub85 / 123RF Stock Photo StoreFile Block Block StoreFile Block Block 133
134 In Memory Table Region Store MemStore Cell 134
135 Writing new cells MemStore StoreFile Block Block 135
136 Writing new cells MemStore StoreFile Block Block 136
137 Writing new cells MemStore StoreFile Block Block 137
138 Writing new cells MemStore StoreFile Block Block 138
139 Writing new cells MemStore StoreFile Block Block 139
140 Flush MemStore StoreFile StoreFile Block Block Block Block Sort! 140
141 Flush When: 141
142 Flush When: Reaching max Memstore size in a store 142
143 Flush When: Reaching max Memstore size in a store Reaching overall max Memstore size 143
144 Flush When: Reaching max Memstore size in a store Reaching overall max Memstore size Reaching full Write-Ahead Log 144
145 Write-Ahead Log MemStore 145
146 Write-Ahead Log MemStore HLog 146
147 Write-Ahead Log MemStore HLog On HDFS One per RegionServer 147
148 Write-Ahead Log MemStore HLog 148
149 Reading from a Store MemStore StoreFile Block Block StoreFile Block Block 149
150 Reading from a Store MemStore StoreFile Block Block StoreFile Block Block 150
151 Compaction StoreFile StoreFile StoreFile Block Block Block Block Block Block 151
152 Compaction StoreFile StoreFile StoreFile Block Block Block Block Block Block 152
153 Compaction StoreFile (Sort again) Block Block Block Block Block Block 153
154 Seek vs. Transfer B+-trees LSM-Trees 154
155 Seek vs. Transfer Classical RDBMS Wide column stores 155
156 Log-Structured Merge-Trees 156
157 Log-Structured Merge-Trees C 0 C 1 C
158 Log-Structured Merge-Trees merge merge C 0 C 1 C
159 Seek vs. Transfer Seek-time-bound Transfer-time-bound 159
160 The META table: a table like any other 160
161 The META table: stores region locations table + region start key + region id + replica id info: regioninfo info: server info: serverstartcode T10:15:00 161
162 RegionInfo RegionInfo Table name Start key Region ID Replica ID encodedname End key Split Offline 162
163 HBase Bootstrap Root 163
164 HBase Bootstrap Root Meta 164
165 HBase Bootstrap Root Meta Regular tables 165
166 Architecture HMaster Regionserver Regionserver Regionserver Regionserver Regionserver Regionserver 166
167 Architecture HMaster Create/delete/update table Regionserver Regionserver Regionserver Regionserver Regionserver Regionserver 167
168 Architecture HMaster Region? Regionserver Regionserver Regionserver Regionserver Regionserver Regionserver (hosting meta) 168
169 Architecture HMaster Region? Regionserver location(s) Regionserver Regionserver Regionserver Regionserver Regionserver Regionserver 169
170 Architecture HMaster Query Regionserver Regionserver Regionserver Regionserver Regionserver Regionserver 170
171 HBase: Underlying APIs grazvydas / 123RF Stock Photo 171
172 HBase implementation (Packaged code) 172
173 HBase APIs REST 173
174 HBase: caching 174
175 HBase Caches: reading faster LRU block cache Level 1 175
176 HBase Caches: reading faster LRU block cache bucket cache Level 1 Level 2 176
177 HBase Caches: reading faster LRU block cache bucket cache HDFS Level 1 Level 2 177
178 LRU Block Cache On the Least Recently Heap Used 178
179 LRU Block Cache: levels of priority Single access priority Multi access priority In-memory access priority 179
180 When NOT to use the cache 180
181 When NOT to use the cache Batch processing 181
182 When NOT to use the cache Random access 182
183 Summary of what we have in memory 183
184 Summary of what we have in memory MemStore 184
185 Summary of what we have in memory MemStore LRU BlockCache 185
186 Summary of what we have in memory MemStore LRU BlockCache Indices of HFiles 186
187 Summary of what we have in memory MemStore LRU BlockCache Indices of HFiles Bloom Filters (avoids disk reads if we can guarantee that a key is not in an HFile) 187
188 Hash function Source: Jorge Stolfi (Wikipedia) 188
189 Bloom filter Very quickly whether an element belongs to a set (potentially false positives) 189
190 Bloom filter
191 Bloom filter John Smith hash function 1 hash function 2 hash function k
192 Bloom filter Mary Smith hash function 1 hash function 2 hash function k
193 Bloom filter: not in set hash function 1 hash function 2 hash function k Albert Einstein? 193
194 Bloom filter: in set (and correct) hash function 1 hash function 2 hash function k Mary Smith? 194
195 Bloom filter: in set (false positive) hash function 1 hash function 2 hash function k Louis de Broglie? 195
196 Data Locality 196
197 HBase vs. HDFS 197
198 With HDFS load balancer
199 HFile compaction brings back locality 199
200 Best practices 200
201 Number of rows Millions RDBMS Billions HBase 201
202 Number of nodes > 5 202
203 Row IDs and column names keep them >short< why? 203
204 10 Design Principles of Big Data 204
205 1. Learn from the past 205
206 2. Keep the design simple 206
207 3. Modularize the architecture 207
208 4. Homogeneity in the large 208
209 5. Heterogeneity in the small 209
210 6. Separate metadata from data 210
211 7. Abstract logical model from its physical implementation 211
212 8. Shard the data 212
213 9. Replicate the data 213
214 10. Buy lots of cheap hardware 214
215 Spanner 215
216 Spanner new: externally-consistent distributed transactions 216
217 Spanner Tabular Data Model Language ACID properties Distribution Scalability Sharding Replicas SQL NoSQL 217
218 Spanner Tabular Data Model Language ACID properties Distribution Scalability Sharding Replicas SQL NoSQL 218
219 Spanner: Data Model Multi-column primary key 219
220 Spanner: Data Model Multi-column primary key Timestamp 220
221 Spanner: Data Model Multi-column primary key Timestamp Directory 221
222 Spanner: Data Model Multi-column primary key Timestamp Tablet 222
223 Spanner: Data Scale 1,000,000,000,000s of rows 223
224 Spanner: Architecture of a Zone zonemaster Spanserver Spanserver Spanserver Spanserver Spanserver Spanserver 224
225 Spanner: Architecture universemaster 225
226 Spanner: Architecture Data Center Data Center Data Center 226
227 Spanner: Architecture Replica Replica Replica Paxos Paxos Paxos Tablet Tablet Tablet Colossus Colossus Colossus 227
228 Spanner: Architecture 100s of data centers Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center 228
229 Spanner: Architecture 1,000,000s of machines 229
230 Spanner: Architecture Higher availability Lower latency 230
Ghislain Fourny. Big Data 5. Column stores
Ghislain Fourny Big Data 5. Column stores 1 Introduction 2 Relational model 3 Relational model Schema 4 Issues with relational databases (RDBMS) Small scale Single machine 5 Can we fix a RDBMS? Scale up
More informationHBASE INTERVIEW QUESTIONS
HBASE INTERVIEW QUESTIONS http://www.tutorialspoint.com/hbase/hbase_interview_questions.htm Copyright tutorialspoint.com Dear readers, these HBase Interview Questions have been designed specially to get
More informationBig Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)
Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Week 10: Mutable State (1/2) March 15, 2016 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These
More informationCOSC 6339 Big Data Analytics. NoSQL (II) HBase. Edgar Gabriel Fall HBase. Column-Oriented data store Distributed designed to serve large tables
COSC 6339 Big Data Analytics NoSQL (II) HBase Edgar Gabriel Fall 2018 HBase Column-Oriented data store Distributed designed to serve large tables Billions of rows and millions of columns Runs on a cluster
More informationBigTable: A Distributed Storage System for Structured Data
BigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26
More information10 Million Smart Meter Data with Apache HBase
10 Million Smart Meter Data with Apache HBase 5/31/2017 OSS Solution Center Hitachi, Ltd. Masahiro Ito OSS Summit Japan 2017 Who am I? Masahiro Ito ( 伊藤雅博 ) Software Engineer at Hitachi, Ltd. Focus on
More informationADVANCED HBASE. Architecture and Schema Design GeeCON, May Lars George Director EMEA Services
ADVANCED HBASE Architecture and Schema Design GeeCON, May 2013 Lars George Director EMEA Services About Me Director EMEA Services @ Cloudera Consulting on Hadoop projects (everywhere) Apache Committer
More informationHBase. Леонид Налчаджи
HBase Леонид Налчаджи leonid.nalchadzhi@gmail.com HBase Overview Table layout Architecture Client API Key design 2 Overview 3 Overview NoSQL Column oriented Versioned 4 Overview All rows ordered by row
More informationDistributed Systems. 19. Spanner. Paul Krzyzanowski. Rutgers University. Fall 2017
Distributed Systems 19. Spanner Paul Krzyzanowski Rutgers University Fall 2017 November 20, 2017 2014-2017 Paul Krzyzanowski 1 Spanner (Google s successor to Bigtable sort of) 2 Spanner Take Bigtable and
More informationBig Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)
Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 10: Mutable State (1/2) March 14, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These
More informationBigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao
Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI 2006 Presented by Xiang Gao 2014-11-05 Outline Motivation Data Model APIs Building Blocks Implementation Refinement
More informationBig Data Analytics. Rasoul Karimi
Big Data Analytics Rasoul Karimi Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Big Data Analytics Big Data Analytics 1 / 1 Outline
More informationComparing SQL and NOSQL databases
COSC 6397 Big Data Analytics Data Formats (II) HBase Edgar Gabriel Spring 2014 Comparing SQL and NOSQL databases Types Development History Data Storage Model SQL One type (SQL database) with minor variations
More informationGoal of the presentation is to give an introduction of NoSQL databases, why they are there.
1 Goal of the presentation is to give an introduction of NoSQL databases, why they are there. We want to present "Why?" first to explain the need of something like "NoSQL" and then in "What?" we go in
More informationTypical size of data you deal with on a daily basis
Typical size of data you deal with on a daily basis Processes More than 161 Petabytes of raw data a day https://aci.info/2014/07/12/the-dataexplosion-in-2014-minute-by-minuteinfographic/ On average, 1MB-2MB
More informationBig Data Processing Technologies. Chentao Wu Associate Professor Dept. of Computer Science and Engineering
Big Data Processing Technologies Chentao Wu Associate Professor Dept. of Computer Science and Engineering wuct@cs.sjtu.edu.cn Schedule (1) Storage system part (first eight weeks) lec1: Introduction on
More informationCS November 2017
Bigtable Highly available distributed storage Distributed Systems 18. Bigtable Built with semi-structured data in mind URLs: content, metadata, links, anchors, page rank User data: preferences, account
More informationBigTable. Chubby. BigTable. Chubby. Why Chubby? How to do consensus as a service
BigTable BigTable Doug Woos and Tom Anderson In the early 2000s, Google had way more than anybody else did Traditional bases couldn t scale Want something better than a filesystem () BigTable optimized
More informationGhislain Fourny. Big Data 2. Lessons learnt from the past
Ghislain Fourny Big Data 2. Lessons learnt from the past Mr. Databases: Edgar Codd Wikipedia Data Independence (Edgar Codd) Logical data model Lorem Ipsum Dolor sit amet Physical storage Consectetur Adipiscing
More informationFattane Zarrinkalam کارگاه ساالنه آزمایشگاه فناوری وب
Fattane Zarrinkalam کارگاه ساالنه آزمایشگاه فناوری وب 1391 زمستان Outlines Introduction DataModel Architecture HBase vs. RDBMS HBase users 2 Why Hadoop? Datasets are growing to Petabytes Traditional datasets
More informationBigTable. CSE-291 (Cloud Computing) Fall 2016
BigTable CSE-291 (Cloud Computing) Fall 2016 Data Model Sparse, distributed persistent, multi-dimensional sorted map Indexed by a row key, column key, and timestamp Values are uninterpreted arrays of bytes
More informationExtreme Computing. NoSQL.
Extreme Computing NoSQL PREVIOUSLY: BATCH Query most/all data Results Eventually NOW: ON DEMAND Single Data Points Latency Matters One problem, three ideas We want to keep track of mutable state in a scalable
More informationData Informatics. Seon Ho Kim, Ph.D.
Data Informatics Seon Ho Kim, Ph.D. seonkim@usc.edu HBase HBase is.. A distributed data store that can scale horizontally to 1,000s of commodity servers and petabytes of indexed storage. Designed to operate
More informationSpanner: Google's Globally-Distributed Database. Presented by Maciej Swiech
Spanner: Google's Globally-Distributed Database Presented by Maciej Swiech What is Spanner? "...Google's scalable, multi-version, globallydistributed, and synchronously replicated database." What is Spanner?
More informationAccelerating Big Data: Using SanDisk SSDs for Apache HBase Workloads
WHITE PAPER Accelerating Big Data: Using SanDisk SSDs for Apache HBase Workloads December 2014 Western Digital Technologies, Inc. 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents
More informationNoSQL Databases. Amir H. Payberah. Swedish Institute of Computer Science. April 10, 2014
NoSQL Databases Amir H. Payberah Swedish Institute of Computer Science amir@sics.se April 10, 2014 Amir H. Payberah (SICS) NoSQL Databases April 10, 2014 1 / 67 Database and Database Management System
More informationCS November 2018
Bigtable Highly available distributed storage Distributed Systems 19. Bigtable Built with semi-structured data in mind URLs: content, metadata, links, anchors, page rank User data: preferences, account
More informationGoogle File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information
Subject 10 Fall 2015 Google File System and BigTable and tiny bits of HDFS (Hadoop File System) and Chubby Not in textbook; additional information Disclaimer: These abbreviated notes DO NOT substitute
More information7680: Distributed Systems
Cristina Nita-Rotaru 7680: Distributed Systems BigTable. Hbase.Spanner. 1: BigTable Acknowledgement } Slides based on material from course at UMichigan, U Washington, and the authors of BigTable and Spanner.
More informationHBase... And Lewis Carroll! Twi:er,
HBase... And Lewis Carroll! jw4ean@cloudera.com Twi:er, LinkedIn: @jw4ean 1 Introduc@on 2010: Cloudera Solu@ons Architect 2011: Cloudera TAM/DSE 2012-2013: Cloudera Training focusing on Partners and Newbies
More informationDistributed File Systems II
Distributed File Systems II To do q Very-large scale: Google FS, Hadoop FS, BigTable q Next time: Naming things GFS A radically new environment NFS, etc. Independence Small Scale Variety of workloads Cooperation
More informationBigtable: A Distributed Storage System for Structured Data. Andrew Hon, Phyllis Lau, Justin Ng
Bigtable: A Distributed Storage System for Structured Data Andrew Hon, Phyllis Lau, Justin Ng What is Bigtable? - A storage system for managing structured data - Used in 60+ Google services - Motivation:
More informationIntroduction to BigData, Hadoop:-
Introduction to BigData, Hadoop:- Big Data Introduction: Hadoop Introduction What is Hadoop? Why Hadoop? Hadoop History. Different types of Components in Hadoop? HDFS, MapReduce, PIG, Hive, SQOOP, HBASE,
More informationBigtable. Presenter: Yijun Hou, Yixiao Peng
Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber Google, Inc. OSDI 06 Presenter: Yijun Hou, Yixiao Peng
More informationBigtable: A Distributed Storage System for Structured Data by Google SUNNIE CHUNG CIS 612
Bigtable: A Distributed Storage System for Structured Data by Google SUNNIE CHUNG CIS 612 Google Bigtable 2 A distributed storage system for managing structured data that is designed to scale to a very
More informationHow do we build TiDB. a Distributed, Consistent, Scalable, SQL Database
How do we build TiDB a Distributed, Consistent, Scalable, SQL Database About me LiuQi ( 刘奇 ) JD / WandouLabs / PingCAP Co-founder / CEO of PingCAP Open-source hacker / Infrastructure software engineer
More informationCSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores
CSE 444: Database Internals Lectures 26 NoSQL: Extensible Record Stores CSE 444 - Spring 2014 1 References Scalable SQL and NoSQL Data Stores, Rick Cattell, SIGMOD Record, December 2010 (Vol. 39, No. 4)
More informationIntroduction Data Model API Building Blocks SSTable Implementation Tablet Location Tablet Assingment Tablet Serving Compactions Refinements
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber Google, Inc. M. Burak ÖZTÜRK 1 Introduction Data Model API Building
More informationReplica Parallelism to Utilize the Granularity of Data
Replica Parallelism to Utilize the Granularity of Data 1st Author 1st author's affiliation 1st line of address 2nd line of address Telephone number, incl. country code 1st author's E-mail address 2nd Author
More informationCISC 7610 Lecture 2b The beginnings of NoSQL
CISC 7610 Lecture 2b The beginnings of NoSQL Topics: Big Data Google s infrastructure Hadoop: open google infrastructure Scaling through sharding CAP theorem Amazon s Dynamo 5 V s of big data Everyone
More informationYCSB++ benchmarking tool Performance debugging advanced features of scalable table stores
YCSB++ benchmarking tool Performance debugging advanced features of scalable table stores Swapnil Patil M. Polte, W. Tantisiriroj, K. Ren, L.Xiao, J. Lopez, G.Gibson, A. Fuchs *, B. Rinaldi * Carnegie
More informationCSE-E5430 Scalable Cloud Computing Lecture 9
CSE-E5430 Scalable Cloud Computing Lecture 9 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 15.11-2015 1/24 BigTable Described in the paper: Fay
More informationW b b 2.0. = = Data Ex E pl p o l s o io i n
Hypertable Doug Judd Zvents, Inc. Background Web 2.0 = Data Explosion Web 2.0 Mt. Web 2.0 Traditional Tools Don t Scale Well Designed for a single machine Typical scaling solutions ad-hoc manual/static
More informationMegastore: Providing Scalable, Highly Available Storage for Interactive Services & Spanner: Google s Globally- Distributed Database.
Megastore: Providing Scalable, Highly Available Storage for Interactive Services & Spanner: Google s Globally- Distributed Database. Presented by Kewei Li The Problem db nosql complex legacy tuning expensive
More informationHow we build TiDB. Max Liu PingCAP Amsterdam, Netherlands October 5, 2016
How we build TiDB Max Liu PingCAP Amsterdam, Netherlands October 5, 2016 About me Infrastructure engineer / CEO of PingCAP Working on open source projects: TiDB: https://github.com/pingcap/tidb TiKV: https://github.com/pingcap/tikv
More informationOutline. Spanner Mo/va/on. Tom Anderson
Spanner Mo/va/on Tom Anderson Outline Last week: Chubby: coordina/on service BigTable: scalable storage of structured data GFS: large- scale storage for bulk data Today/Friday: Lessons from GFS/BigTable
More informationFacebook. The Technology Behind Messages (and more ) Kannan Muthukkaruppan Software Engineer, Facebook. March 11, 2011
HBase @ Facebook The Technology Behind Messages (and more ) Kannan Muthukkaruppan Software Engineer, Facebook March 11, 2011 Talk Outline the new Facebook Messages, and how we got started with HBase quick
More informationΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing
ΕΠΛ 602:Foundations of Internet Technologies Cloud Computing 1 Outline Bigtable(data component of cloud) Web search basedonch13of thewebdatabook 2 What is Cloud Computing? ACloudis an infrastructure, transparent
More information5/2/16. Announcements. NoSQL Motivation. The New Hipster: NoSQL. Serverless. What is the Problem? Database Systems CSE 414
Announcements Database Systems CSE 414 Lecture 16: NoSQL and JSon Current assignments: Homework 4 due tonight Web Quiz 6 due next Wednesday [There is no Web Quiz 5 Today s lecture: JSon The book covers
More informationBig Table. Google s Storage Choice for Structured Data. Presented by Group E - Dawei Yang - Grace Ramamoorthy - Patrick O Sullivan - Rohan Singla
Big Table Google s Storage Choice for Structured Data Presented by Group E - Dawei Yang - Grace Ramamoorthy - Patrick O Sullivan - Rohan Singla Bigtable: Introduction Resembles a database. Does not support
More informationDatabase Systems CSE 414
Database Systems CSE 414 Lecture 16: NoSQL and JSon CSE 414 - Spring 2016 1 Announcements Current assignments: Homework 4 due tonight Web Quiz 6 due next Wednesday [There is no Web Quiz 5] Today s lecture:
More informationDistributed PostgreSQL with YugaByte DB
Distributed PostgreSQL with YugaByte DB Karthik Ranganathan PostgresConf Silicon Valley Oct 16, 2018 1 CHECKOUT THIS REPO: github.com/yugabyte/yb-sql-workshop 2 About Us Founders Kannan Muthukkaruppan,
More informationbig picture parallel db (one data center) mix of OLTP and batch analysis lots of data, high r/w rates, 1000s of cheap boxes thus many failures
Lecture 20 -- 11/20/2017 BigTable big picture parallel db (one data center) mix of OLTP and batch analysis lots of data, high r/w rates, 1000s of cheap boxes thus many failures what does paper say Google
More informationReferences. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals
References CSE 444: Database Internals Scalable SQL and NoSQL Data Stores, Rick Cattell, SIGMOD Record, December 2010 (Vol 39, No 4) Lectures 26 NoSQL: Extensible Record Stores Bigtable: A Distributed
More informationBigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13
Bigtable A Distributed Storage System for Structured Data Presenter: Yunming Zhang Conglong Li References SOCC 2010 Key Note Slides Jeff Dean Google Introduction to Distributed Computing, Winter 2008 University
More informationHBase Solutions at Facebook
HBase Solutions at Facebook Nicolas Spiegelberg Software Engineer, Facebook QCon Hangzhou, October 28 th, 2012 Outline HBase Overview Single Tenant: Messages Selection Criteria Multi-tenant Solutions
More informationCloudera Kudu Introduction
Cloudera Kudu Introduction Zbigniew Baranowski Based on: http://slideshare.net/cloudera/kudu-new-hadoop-storage-for-fast-analytics-onfast-data What is KUDU? New storage engine for structured data (tables)
More informationNewSQL Databases. The reference Big Data stack
Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica NewSQL Databases Corso di Sistemi e Architetture per Big Data A.A. 2017/18 Valeria Cardellini The reference
More informationCA485 Ray Walshe NoSQL
NoSQL BASE vs ACID Summary Traditional relational database management systems (RDBMS) do not scale because they adhere to ACID. A strong movement within cloud computing is to utilize non-traditional data
More informationA BigData Tour HDFS, Ceph and MapReduce
A BigData Tour HDFS, Ceph and MapReduce These slides are possible thanks to these sources Jonathan Drusi - SCInet Toronto Hadoop Tutorial, Amir Payberah - Course in Data Intensive Computing SICS; Yahoo!
More informationShen PingCAP 2017
Shen Li @ PingCAP About me Shen Li ( 申砾 ) Tech Lead of TiDB, VP of Engineering Netease / 360 / PingCAP Infrastructure software engineer WHY DO WE NEED A NEW DATABASE? Brief History Standalone RDBMS NoSQL
More information10/18/2017. Announcements. NoSQL Motivation. NoSQL. Serverless Architecture. What is the Problem? Database Systems CSE 414
Announcements Database Systems CSE 414 Lecture 11: NoSQL & JSON (mostly not in textbook only Ch 11.1) HW5 will be posted on Friday and due on Nov. 14, 11pm [No Web Quiz 5] Today s lecture: NoSQL & JSON
More informationSpanner: Google's Globally-Distributed Database* Huu-Phuc Vo August 03, 2013
Spanner: Google's Globally-Distributed Database* Huu-Phuc Vo August 03, 2013 *OSDI '12, James C. Corbett et al. (26 authors), Jay Lepreau Best Paper Award Outline What is Spanner? Features & Example Structure
More informationIntroduction to NoSQL Databases
Introduction to NoSQL Databases Roman Kern KTI, TU Graz 2017-10-16 Roman Kern (KTI, TU Graz) Dbase2 2017-10-16 1 / 31 Introduction Intro Why NoSQL? Roman Kern (KTI, TU Graz) Dbase2 2017-10-16 2 / 31 Introduction
More informationIntroduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data
Introduction to Hadoop High Availability Scaling Advantages and Challenges Introduction to Big Data What is Big data Big Data opportunities Big Data Challenges Characteristics of Big data Introduction
More informationBigTable: A Distributed Storage System for Structured Data (2006) Slides adapted by Tyler Davis
BigTable: A Distributed Storage System for Structured Data (2006) Slides adapted by Tyler Davis Motivation Lots of (semi-)structured data at Google URLs: Contents, crawl metadata, links, anchors, pagerank,
More informationGoogle Spanner - A Globally Distributed,
Google Spanner - A Globally Distributed, Synchronously-Replicated Database System James C. Corbett, et. al. Feb 14, 2013. Presented By Alexander Chow For CS 742 Motivation Eventually-consistent sometimes
More informationYCSB++ Benchmarking Tool Performance Debugging Advanced Features of Scalable Table Stores
YCSB++ Benchmarking Tool Performance Debugging Advanced Features of Scalable Table Stores Swapnil Patil Milo Polte, Wittawat Tantisiriroj, Kai Ren, Lin Xiao, Julio Lopez, Garth Gibson, Adam Fuchs *, Billie
More informationCS 655 Advanced Topics in Distributed Systems
Presented by : Walid Budgaga CS 655 Advanced Topics in Distributed Systems Computer Science Department Colorado State University 1 Outline Problem Solution Approaches Comparison Conclusion 2 Problem 3
More informationGoogle Cloud Bigtable. And what it's awesome at
Google Cloud Bigtable And what it's awesome at Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes Agenda 1 Research 2 A story about bigness 3 How it works 4 When it's awesome Google Research
More informationJargons, Concepts, Scope and Systems. Key Value Stores, Document Stores, Extensible Record Stores. Overview of different scalable relational systems
Jargons, Concepts, Scope and Systems Key Value Stores, Document Stores, Extensible Record Stores Overview of different scalable relational systems Examples of different Data stores Predictions, Comparisons
More informationPyro: A Spatial-Temporal Big-Data Storage System. Shen Li Shaohan Hu Raghu Ganti Mudhakar Srivatsa Tarek Abdelzaher
Pyro: A Spatial-Temporal Big-Data Storage System Shen Li Shaohan Hu Raghu Ganti Mudhakar Srivatsa Tarek Abdelzaher 1 Applications A huge amount of geo- tagged events are generated and stored in real- 5me.
More informationDistributed Systems. GFS / HDFS / Spanner
15-440 Distributed Systems GFS / HDFS / Spanner Agenda Google File System (GFS) Hadoop Distributed File System (HDFS) Distributed File Systems Replication Spanner Distributed Database System Paxos Replication
More informationWhat is database? Types and Examples
What is database? Types and Examples Visit our site for more information: www.examplanning.com Facebook Page: https://www.facebook.com/examplanning10/ Twitter: https://twitter.com/examplanning10 TABLE
More informationGridGain and Apache Ignite In-Memory Performance with Durability of Disk
GridGain and Apache Ignite In-Memory Performance with Durability of Disk Dmitriy Setrakyan Apache Ignite PMC GridGain Founder & CPO http://ignite.apache.org #apacheignite Agenda What is GridGain and Ignite
More informationIntegrity in Distributed Databases
Integrity in Distributed Databases Andreas Farella Free University of Bozen-Bolzano Table of Contents 1 Introduction................................................... 3 2 Different aspects of integrity.....................................
More informationDatabase Evolution. DB NoSQL Linked Open Data. L. Vigliano
Database Evolution DB NoSQL Linked Open Data Requirements and features Large volumes of data..increasing No regular data structure to manage Relatively homogeneous elements among them (no correlation between
More informationStructured Big Data 1: Google Bigtable & HBase Shiow-yang Wu ( 吳秀陽 ) CSIE, NDHU, Taiwan, ROC
Structured Big Data 1: Google Bigtable & HBase Shiow-yang Wu ( 吳秀陽 ) CSIE, NDHU, Taiwan, ROC Lecture material is mostly home-grown, partly taken with permission and courtesy from Professor Shih-Wei Liao
More informationApache HBase Andrew Purtell Committer, Apache HBase, Apache Software Foundation Big Data US Research And Development, Intel
Apache HBase 0.98 Andrew Purtell Committer, Apache HBase, Apache Software Foundation Big Data US Research And Development, Intel Who am I? Committer on the Apache HBase project Member of the Big Data Research
More informationTime Series Storage with Apache Kudu (incubating)
Time Series Storage with Apache Kudu (incubating) Dan Burkert (Committer) dan@cloudera.com @danburkert Tweet about this talk: @getkudu or #kudu 1 Time Series machine metrics event logs sensor telemetry
More informationBIG DATA TECHNOLOGIES: WHAT EVERY MANAGER NEEDS TO KNOW ANALYTICS AND FINANCIAL INNOVATION CONFERENCE JUNE 26-29,
BIG DATA TECHNOLOGIES: WHAT EVERY MANAGER NEEDS TO KNOW ANALYTICS AND FINANCIAL INNOVATION CONFERENCE JUNE 26-29, 2016 1 OBJECTIVES ANALYTICS AND FINANCIAL INNOVATION CONFERENCE JUNE 26-29, 2016 2 WHAT
More informationCorbett et al., Spanner: Google s Globally-Distributed Database
Corbett et al., : Google s Globally-Distributed Database MIMUW 2017-01-11 ACID transactions ACID transactions SQL queries ACID transactions SQL queries Semi-relational data model ACID transactions SQL
More informationApril Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.
1. MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model. MapReduce is a framework for processing big data which processes data in two phases, a Map
More informationBig Data Hadoop Course Content
Big Data Hadoop Course Content Topics covered in the training Introduction to Linux and Big Data Virtual Machine ( VM) Introduction/ Installation of VirtualBox and the Big Data VM Introduction to Linux
More informationCISC 7610 Lecture 5 Distributed multimedia databases. Topics: Scaling up vs out Replication Partitioning CAP Theorem NoSQL NewSQL
CISC 7610 Lecture 5 Distributed multimedia databases Topics: Scaling up vs out Replication Partitioning CAP Theorem NoSQL NewSQL Motivation YouTube receives 400 hours of video per minute That is 200M hours
More informationScaling Up HBase. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics
http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Scaling Up HBase Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Partly based on materials
More informationAdvanced HBase Schema Design. Berlin Buzzwords, June 2012 Lars George
Advanced HBase Schema Design Berlin Buzzwords, June 2012 Lars George lars@cloudera.com About Me SoluDons Architect @ Cloudera Apache HBase & Whirr CommiIer Author of HBase The Defini.ve Guide Working with
More informationDistributed Data Store
Distributed Data Store Large-Scale Distributed le system Q: What if we have too much data to store in a single machine? Q: How can we create one big filesystem over a cluster of machines, whose data is
More information18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.
18-hdfs-gfs.txt Thu Oct 27 10:05:07 2011 1 Notes on Parallel File Systems: HDFS & GFS 15-440, Fall 2011 Carnegie Mellon University Randal E. Bryant References: Ghemawat, Gobioff, Leung, "The Google File
More informationHBase: Overview. HBase is a distributed column-oriented data store built on top of HDFS
HBase 1 HBase: Overview HBase is a distributed column-oriented data store built on top of HDFS HBase is an Apache open source project whose goal is to provide storage for the Hadoop Distributed Computing
More informationApache Hadoop Goes Realtime at Facebook. Himanshu Sharma
Apache Hadoop Goes Realtime at Facebook Guide - Dr. Sunny S. Chung Presented By- Anand K Singh Himanshu Sharma Index Problem with Current Stack Apache Hadoop and Hbase Zookeeper Applications of HBase at
More informationBig Data for Engineers Spring Resource Management
Ghislain Fourny Big Data for Engineers Spring 2018 7. Resource Management artjazz / 123RF Stock Photo Data Technology Stack User interfaces Querying Data stores Indexing Processing Validation Data models
More informationMySQL Cluster Web Scalability, % Availability. Andrew
MySQL Cluster Web Scalability, 99.999% Availability Andrew Morgan @andrewmorgan www.clusterdb.com Safe Harbour Statement The following is intended to outline our general product direction. It is intended
More information18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.
18-hdfs-gfs.txt Thu Nov 01 09:53:32 2012 1 Notes on Parallel File Systems: HDFS & GFS 15-440, Fall 2012 Carnegie Mellon University Randal E. Bryant References: Ghemawat, Gobioff, Leung, "The Google File
More information<Insert Picture Here> Oracle NoSQL Database A Distributed Key-Value Store
Oracle NoSQL Database A Distributed Key-Value Store Charles Lamb The following is intended to outline our general product direction. It is intended for information purposes only,
More informationMapReduce & BigTable
CPSC 426/526 MapReduce & BigTable Ennan Zhai Computer Science Department Yale University Lecture Roadmap Cloud Computing Overview Challenges in the Clouds Distributed File Systems: GFS Data Process & Analysis:
More informationHuge market -- essentially all high performance databases work this way
11/5/2017 Lecture 16 -- Parallel & Distributed Databases Parallel/distributed databases: goal provide exactly the same API (SQL) and abstractions (relational tables), but partition data across a bunch
More informationUsing space-filling curves for multidimensional
Using space-filling curves for multidimensional indexing Dr. Bisztray Dénes Senior Research Engineer 1 Nokia Solutions and Networks 2014 In medias res Performance problems with RDBMS Switch to NoSQL store
More informationBig Data 7. Resource Management
Ghislain Fourny Big Data 7. Resource Management artjazz / 123RF Stock Photo Data Technology Stack User interfaces Querying Data stores Indexing Processing Validation Data models Syntax Encoding Storage
More information/ Cloud Computing. Recitation 10 March 22nd, 2016
15-319 / 15-619 Cloud Computing Recitation 10 March 22nd, 2016 Overview Administrative issues Office Hours, Piazza guidelines Last week s reflection Project 3.3, OLI Unit 4, Module 15, Quiz 8 This week
More information