Ghislain Fourny. Big Data 5. Wide column stores

Similar documents
Ghislain Fourny. Big Data 5. Column stores

HBASE INTERVIEW QUESTIONS

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

COSC 6339 Big Data Analytics. NoSQL (II) HBase. Edgar Gabriel Fall HBase. Column-Oriented data store Distributed designed to serve large tables

BigTable: A Distributed Storage System for Structured Data

10 Million Smart Meter Data with Apache HBase

ADVANCED HBASE. Architecture and Schema Design GeeCON, May Lars George Director EMEA Services

HBase. Леонид Налчаджи

Distributed Systems. 19. Spanner. Paul Krzyzanowski. Rutgers University. Fall 2017

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

Big Data Analytics. Rasoul Karimi

Comparing SQL and NOSQL databases

Goal of the presentation is to give an introduction of NoSQL databases, why they are there.

Typical size of data you deal with on a daily basis

Big Data Processing Technologies. Chentao Wu Associate Professor Dept. of Computer Science and Engineering

CS November 2017

BigTable. Chubby. BigTable. Chubby. Why Chubby? How to do consensus as a service

Ghislain Fourny. Big Data 2. Lessons learnt from the past

Fattane Zarrinkalam کارگاه ساالنه آزمایشگاه فناوری وب

BigTable. CSE-291 (Cloud Computing) Fall 2016

Extreme Computing. NoSQL.

Data Informatics. Seon Ho Kim, Ph.D.

Spanner: Google's Globally-Distributed Database. Presented by Maciej Swiech

Accelerating Big Data: Using SanDisk SSDs for Apache HBase Workloads

NoSQL Databases. Amir H. Payberah. Swedish Institute of Computer Science. April 10, 2014

CS November 2018

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information

7680: Distributed Systems

HBase... And Lewis Carroll! Twi:er,

Distributed File Systems II

Bigtable: A Distributed Storage System for Structured Data. Andrew Hon, Phyllis Lau, Justin Ng

Introduction to BigData, Hadoop:-

Bigtable. Presenter: Yijun Hou, Yixiao Peng

Bigtable: A Distributed Storage System for Structured Data by Google SUNNIE CHUNG CIS 612

How do we build TiDB. a Distributed, Consistent, Scalable, SQL Database

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores

Introduction Data Model API Building Blocks SSTable Implementation Tablet Location Tablet Assingment Tablet Serving Compactions Refinements

Replica Parallelism to Utilize the Granularity of Data

CISC 7610 Lecture 2b The beginnings of NoSQL

YCSB++ benchmarking tool Performance debugging advanced features of scalable table stores

CSE-E5430 Scalable Cloud Computing Lecture 9

W b b 2.0. = = Data Ex E pl p o l s o io i n

Megastore: Providing Scalable, Highly Available Storage for Interactive Services & Spanner: Google s Globally- Distributed Database.

How we build TiDB. Max Liu PingCAP Amsterdam, Netherlands October 5, 2016

Outline. Spanner Mo/va/on. Tom Anderson

Facebook. The Technology Behind Messages (and more ) Kannan Muthukkaruppan Software Engineer, Facebook. March 11, 2011

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

5/2/16. Announcements. NoSQL Motivation. The New Hipster: NoSQL. Serverless. What is the Problem? Database Systems CSE 414

Big Table. Google s Storage Choice for Structured Data. Presented by Group E - Dawei Yang - Grace Ramamoorthy - Patrick O Sullivan - Rohan Singla

Database Systems CSE 414

Distributed PostgreSQL with YugaByte DB

big picture parallel db (one data center) mix of OLTP and batch analysis lots of data, high r/w rates, 1000s of cheap boxes thus many failures

References. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

HBase Solutions at Facebook

Cloudera Kudu Introduction

NewSQL Databases. The reference Big Data stack

CA485 Ray Walshe NoSQL

A BigData Tour HDFS, Ceph and MapReduce

Shen PingCAP 2017

10/18/2017. Announcements. NoSQL Motivation. NoSQL. Serverless Architecture. What is the Problem? Database Systems CSE 414

Spanner: Google's Globally-Distributed Database* Huu-Phuc Vo August 03, 2013

Introduction to NoSQL Databases

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

BigTable: A Distributed Storage System for Structured Data (2006) Slides adapted by Tyler Davis

Google Spanner - A Globally Distributed,

YCSB++ Benchmarking Tool Performance Debugging Advanced Features of Scalable Table Stores

CS 655 Advanced Topics in Distributed Systems

Google Cloud Bigtable. And what it's awesome at

Jargons, Concepts, Scope and Systems. Key Value Stores, Document Stores, Extensible Record Stores. Overview of different scalable relational systems

Pyro: A Spatial-Temporal Big-Data Storage System. Shen Li Shaohan Hu Raghu Ganti Mudhakar Srivatsa Tarek Abdelzaher

Distributed Systems. GFS / HDFS / Spanner

What is database? Types and Examples

GridGain and Apache Ignite In-Memory Performance with Durability of Disk

Integrity in Distributed Databases

Database Evolution. DB NoSQL Linked Open Data. L. Vigliano

Structured Big Data 1: Google Bigtable & HBase Shiow-yang Wu ( 吳秀陽 ) CSIE, NDHU, Taiwan, ROC

Apache HBase Andrew Purtell Committer, Apache HBase, Apache Software Foundation Big Data US Research And Development, Intel

Time Series Storage with Apache Kudu (incubating)

BIG DATA TECHNOLOGIES: WHAT EVERY MANAGER NEEDS TO KNOW ANALYTICS AND FINANCIAL INNOVATION CONFERENCE JUNE 26-29,

Corbett et al., Spanner: Google s Globally-Distributed Database

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

Big Data Hadoop Course Content

CISC 7610 Lecture 5 Distributed multimedia databases. Topics: Scaling up vs out Replication Partitioning CAP Theorem NoSQL NewSQL

Scaling Up HBase. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

Advanced HBase Schema Design. Berlin Buzzwords, June 2012 Lars George

Distributed Data Store

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

HBase: Overview. HBase is a distributed column-oriented data store built on top of HDFS

Apache Hadoop Goes Realtime at Facebook. Himanshu Sharma

Big Data for Engineers Spring Resource Management

MySQL Cluster Web Scalability, % Availability. Andrew

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

<Insert Picture Here> Oracle NoSQL Database A Distributed Key-Value Store

MapReduce & BigTable

Huge market -- essentially all high performance databases work this way

Using space-filling curves for multidimensional

Big Data 7. Resource Management

/ Cloud Computing. Recitation 10 March 22nd, 2016

Transcription:

Ghislain Fourny Big Data 5. Wide column stores

Data Technology Stack User interfaces Querying Data stores Indexing Processing Validation Data models Syntax Encoding Storage 2

Where we are User interfaces Querying Data stores Indexing Processing Validation Data models Syntax Encoding Last weeks Storage 3

Where we are User interfaces Querying Data stores Indexing Processing Validation Today Data models Syntax Encoding Last weeks Storage 4

Relational model 5

Relational model Schema 6

Issues with relational databases (RDBMS) Small scale 7

Issues with relational databases (RDBMS) Small scale Single machine 8

Can we fix a RDBMS? 9

Can we fix a RDBMS? Scale up (remember?) 10

Can we fix a RDBMS? Scale out 11

Can we fix a RDBMS? Cluster Scale out 12

Can we fix a RDBMS? Cluster Replicate Scale out 13

Can we fix a RDBMS? Hard to set up Scale out 14

Can we fix a RDBMS? Hard to set up Very high maintenance costs Scale out 15

HBase By design running on a scalable cluster of commodity hardware 16

HBase By design running on a scalable cluster of commodity hardware HDFS 17

Wide column stores: data model 18

Founding paper 's BigTable 19

The tabular model 20

The tabular model: expensive joins 21

Design paradigm of BigTable store together what is accessed together 22

The tabular model: expensive joins 1 4 2 2 4 6 1 2 3 4 5 6 23

3rd Normal Form: Example Legi 32-000-000 Name City State Alan Turing Bletchley Park UK City State PLZ Bletchley Park UK MK3 6EB 32-000-000 Alan Turing Bletchley Park UK Bletchley Park UK MK3 6EB 62-000-000 Georg Cantor Pfäffikon SZ Pfäffikon SZ 8808 62-000-000 Georg Cantor Pfäffikon SZ Pfäffikon SZ 8808 25-000-000 Felix Bloch Pfäffikon ZH Pfäffikon ZH 8330 24

3rd Normal Form: Counter-Example Legi 32-000-000 Name City State PLZ Alan Turing Bletchley Park UK MK3 6EB 32-000-000 Alan Turing Bletchley Park UK MK3 6EB 62-000-000 Georg Cantor Pfäffikon SZ 8808 62-000-000 Georg Cantor Pfäffikon SZ 8808 25-000-000 Felix Bloch Pfäffikon ZH 8330 25

The tabular model: expensive joins 1 4 2 2 4 6 1 2 3 4 5 6 26

The columnar model: denormalized 1 4 2 2 4 6 27

Rows Row ID 000 002 0A1 1E0 22A 4A2 28

Rows Yes, for now this actually looks pretty much like key-value storage. Row ID 000 002 0A1 1E0 22A 4A2 29

Columns Row ID 000 002 0A1 1E0 22A 4A2 30

Columns Column family Row ID 000 002 0A1 1E0 22A 4A2 31

Column families must be known in advance... Row ID 32

Column families must be known in advance... Row ID 000 A B 1 2 I 002 0A1 1E0 22A 4A2 33

... but columns can be added on the fly Row ID 000 A B C 1 2 I II III IV 002 0A1 1E0 22A 4A2 34

Primary queries 35

Primary queries Get 36

Get Row ID 000 A B C 1 2 I II III IV 002 0A1 1E0 22A 4A2 37

Primary queries Get 38

Primary queries Get Put 39

Put Row ID 000 A B C 1 2 I II III IV 002 0A1 1E0 204 22A 4A2 40

Primary queries Get Put 41

Primary queries Get Put Scan (This is new) 42

Scan Row ID 000 A B C 1 2 I II III IV 002 0A1 1E0 204 22A 4A2 43

Primary queries Get Put Scan (This is new) 44

Primary queries Get Put Scan Delete (This is new) 45

Delete Row ID 000 A B C 1 2 I II III IV 002 0A1 1E0 204 22A 4A2 46

Some terminology: Key-value model Key Value 47

Some terminology: Column-oriented stores Column1 Column2 48

Some terminology: Column-oriented key-value stores Also: wide column stores, column-family-oriented Row ID A B C 1 2 I II III IV 49

Examples of Column-oriented key-value stores 's BigTable 50

Warning on terminology NoSQL is very recent! 51

Warning on terminology Key-value storage Relational table Words have a "life" File Block NoSQL Object storage 52

HBase: physical level 53

Physical layer: regions Row ID A B C 1 2 I II III IV 54

Physical layer: regions Row ID A B C 1 2 I II III IV 55

Physical layer: regions Row ID A B C 1 2 I II III IV Min-incl. Max-excl. 56

Physical layer: column families Row ID A B C 1 2 I II III IV Min-incl. Max-excl. Stored together 57

Architecture "The same procedure as every year, James." 58

HDFS... Namenode /dir/file1 /dir/file2 /file3 Datanode Datanode Datanode Datanode Datanode Datanode 59

HBase HMaster Regionserver Regionserver Regionserver Regionserver Regionserver Regionserver 60

HMaster HMaster Regionserver Regionserver Regionserver Regionserver Regionserver Regionserver 61

HMaster DDL operations 62

HMaster DDL operations Create table 63

HMaster DDL operations Create table Delete table 64

HMaster assigns regions to RegionServers Row ID 65

HMaster assigns regions to RegionServers Row ID 66

HMaster assigns regions to RegionServers Row ID 67

HMaster assigns regions to RegionServers Row ID 68

HMaster splits regions Row ID 69

HMaster handles Regionserver failovers 70

Architecture HMaster Regionserver Regionserver Regionserver Regionserver Regionserver Regionserver 71

Regionserver HMaster Regionserver Regionserver Regionserver Regionserver Regionserver Regionserver 72

Physical storage Row ID Min-incl. A B C 1 2 Stored together I II III IV 73

Physical storage Row ID A B C 1 2 I II III IV Store Store Store Store Store Store 74

Store = column family Row ID 1 2 75

Store = column family Row ID 1 2 Cell 76

Store = column family Row ID 1 2 HFile HFile HFile HFile (On HDFS) 77

HFile HFile 78

HFile HFile That's actually an SSTable (flat sorted list of key-value pairs) 79

HFile HFile KeyValue That's actually an SSTable (flat sorted list of key-value pairs) (Stores a cell) 80

HFile 1 2 81

Versioning Different versions of same cell Latest 82

Versioning: timeline V 1 V 2 V 3 V 4 83

Versioning: timeline V 1 V 2 V 3 V 4 Total order: not like DynamoDB 84

Versioning: timeline V 1 V 2 V 3 V 4 Total order: not like DynamoDB A B C HBase guarantees ACID on the row level (concurrent writes and reads are synchronizing with per-row locks) 85

HFile: KeyValue key value 86

HFile: KeyValue (prefix code) keylength valuelength key value 87

Prefix code example: Gamma code 10011111111001101011110011110101 88

Prefix code example: Gamma code 10011111111001101011110011110101 89

Prefix code example: Gamma code 10011111111001101011110011110101 10 90

Prefix code example: Gamma code 10011111111001101011110011110101 10 91

Prefix code example: Gamma code 10011111111001101011110011110101 10 92

Prefix code example: Gamma code 10011111111001101011110011110101 10 101101011 93

Prefix code example: Gamma code 10011111111001101011110011110101 10 101101011 94

Prefix code example: Gamma code 10011111111001101011110011110101 10 101101011 95

Prefix code example: Gamma code 10011111111001101011110011110101 10 101101011 101 96

Prefix code example: Gamma code 10011111111001101011110011110101 10 101101011 101 97

Prefix code example: Gamma code 10011111111001101011110011110101 10 101101011 101 98

Prefix code example: Gamma code 10011111111001101011110011110101 10 101101011 101 1101 99

HFile: Key row length row (key) column family length column family column qualifier timestamp key type 100

HFile: Key row length row (key) column family length column family column qualifier timestamp key type This one is for the versioning 101

HFile: Key row length row (key) column family length column family column qualifier timestamp key type This one is for marking as deleted 102

Blocks HFile 103

Blocks HFile "Quantity" of KeyValues that get read at a time 104

Blocks Default HFile 64kb 105

Blocks: long keys or values size(keyvalue) > block size No split (longer block) 106

Inside an HFile key1 key5 key11 key17 /index /data 107

Looking up a key key1 key5 key11 key17 108

Looking up a key key1 key5 key11 key17 109

Looking up a key key1 key5 key14 key11 key17 110

Looking up a key key1 key5 key14 key11 key17 111

Looking up a key key1 key5 key14 key11 key17 112

Looking up a key key1 key5 key14 key11 key17 key14 113

Writing to an HFile key1 114

Writing to an HFile key1 key1 115

Writing to an HFile key1 key1 key2 key3 key4 116

Writing to an HFile key1 key1 key2 key3 key4 key5 key5 117

Writing to an HFile key1 key5 key1 key2 key3 key4 key5 key6 key7 key8 key9 key10 118

Writing to an HFile key1 key5 key11 key1 key2 key3 key4 key5 key6 key7 key8 key9 key10 key11 119

Writing to an HFile key1 key5 key11 key1 key2 key3 key4 key5 key6 key7 key8 key9 key10 key11 key12 key13 key14 key15 key16 120

Writing to an HFile key1 key5 key11 key17 key1 key2 key3 key4 key5 key6 key7 key8 key9 key10 key11 key12 key13 key14 key15 key16 key17 key18 121

Levels of physical storage Table 122

Levels of physical storage Table Region 123

Levels of physical storage Table Region Store 124

Levels of physical storage Table Region Store StoreFile 125

Levels of physical storage Table Region Store StoreFile Block 126

Levels of physical storage Table Region Store StoreFile Block KeyValue 127

Problem key1 key5 key11 key17 key1 key2 key3 key4 key5 key6 key7 key8 key9 key10 key11 key12 key13 key14 key15 key16 key17 key18 128

Problem key1 We can only write key-values in sorted order Sorted key2 key3 key4 key5 key6 key7 key8 key9 key10 key11 key12 key13 key14 key15 key16 key17 key18 129

HBase: Writing new cells 130

On Disk Table Region Store StoreFile Block KeyValue 131

Store StoreFile Block Block StoreFile Block Block 132

Store MemStore radub85 / 123RF Stock Photo StoreFile Block Block StoreFile Block Block 133

In Memory Table Region Store MemStore Cell 134

Writing new cells MemStore StoreFile Block Block 135

Writing new cells MemStore StoreFile Block Block 136

Writing new cells MemStore StoreFile Block Block 137

Writing new cells MemStore StoreFile Block Block 138

Writing new cells MemStore StoreFile Block Block 139

Flush MemStore StoreFile StoreFile Block Block Block Block Sort! 140

Flush When: 141

Flush When: Reaching max Memstore size in a store 142

Flush When: Reaching max Memstore size in a store Reaching overall max Memstore size 143

Flush When: Reaching max Memstore size in a store Reaching overall max Memstore size Reaching full Write-Ahead Log 144

Write-Ahead Log MemStore 145

Write-Ahead Log MemStore HLog 146

Write-Ahead Log MemStore HLog On HDFS One per RegionServer 147

Write-Ahead Log MemStore HLog 148

Reading from a Store MemStore StoreFile Block Block StoreFile Block Block 149

Reading from a Store MemStore StoreFile Block Block StoreFile Block Block 150

Compaction StoreFile StoreFile StoreFile Block Block Block Block Block Block 151

Compaction StoreFile StoreFile StoreFile Block Block Block Block Block Block 152

Compaction StoreFile (Sort again) Block Block Block Block Block Block 153

Seek vs. Transfer B+-trees LSM-Trees 154

Seek vs. Transfer Classical RDBMS Wide column stores 155

Log-Structured Merge-Trees 156

Log-Structured Merge-Trees C 0 C 1 C 2... 157

Log-Structured Merge-Trees merge merge C 0 C 1 C 2... 158

Seek vs. Transfer Seek-time-bound Transfer-time-bound 159

The META table: a table like any other 160

The META table: stores region locations table + region start key + region id + replica id info: regioninfo info: server www.example.com:0 info: serverstartcode 2016-10-11T10:15:00 161

RegionInfo RegionInfo Table name Start key Region ID Replica ID encodedname End key Split Offline 162

HBase Bootstrap Root 163

HBase Bootstrap Root Meta 164

HBase Bootstrap Root Meta Regular tables 165

Architecture HMaster Regionserver Regionserver Regionserver Regionserver Regionserver Regionserver 166

Architecture HMaster Create/delete/update table Regionserver Regionserver Regionserver Regionserver Regionserver Regionserver 167

Architecture HMaster Region? Regionserver Regionserver Regionserver Regionserver Regionserver Regionserver (hosting meta) 168

Architecture HMaster Region? Regionserver location(s) Regionserver Regionserver Regionserver Regionserver Regionserver Regionserver 169

Architecture HMaster Query Regionserver Regionserver Regionserver Regionserver Regionserver Regionserver 170

HBase: Underlying APIs grazvydas / 123RF Stock Photo 171

HBase implementation (Packaged code) 172

HBase APIs REST 173

HBase: caching 174

HBase Caches: reading faster LRU block cache Level 1 175

HBase Caches: reading faster LRU block cache bucket cache Level 1 Level 2 176

HBase Caches: reading faster LRU block cache bucket cache HDFS Level 1 Level 2 177

LRU Block Cache On the Least Recently Heap Used 178

LRU Block Cache: levels of priority Single access priority Multi access priority In-memory access priority 179

When NOT to use the cache 180

When NOT to use the cache Batch processing 181

When NOT to use the cache Random access 182

Summary of what we have in memory 183

Summary of what we have in memory MemStore 184

Summary of what we have in memory MemStore LRU BlockCache 185

Summary of what we have in memory MemStore LRU BlockCache Indices of HFiles 186

Summary of what we have in memory MemStore LRU BlockCache Indices of HFiles Bloom Filters (avoids disk reads if we can guarantee that a key is not in an HFile) 187

Hash function Source: Jorge Stolfi (Wikipedia) 188

Bloom filter Very quickly whether an element belongs to a set (potentially false positives) 189

Bloom filter 0 0 0 0 0 0 0 0 0 0 0 0 190

Bloom filter John Smith hash function 1 hash function 2 hash function k 0 1 1 0 0 0 0 1 0 0 0 0 191

Bloom filter Mary Smith hash function 1 hash function 2 hash function k 0 1 1 0 0 1 1 1 0 0 0 0 192

Bloom filter: not in set 0 1 1 0 0 1 1 1 0 0 0 0 hash function 1 hash function 2 hash function k Albert Einstein? 193

Bloom filter: in set (and correct) 0 1 1 0 0 1 1 1 0 0 0 0 hash function 1 hash function 2 hash function k Mary Smith? 194

Bloom filter: in set (false positive) 0 1 1 0 0 1 1 1 0 0 0 0 hash function 1 hash function 2 hash function k Louis de Broglie? 195

Data Locality 196

HBase vs. HDFS 197

With HDFS load balancer... 198

HFile compaction brings back locality 199

Best practices 200

Number of rows Millions RDBMS Billions HBase 201

Number of nodes > 5 202

Row IDs and column names keep them >short< why? 203

10 Design Principles of Big Data 204

1. Learn from the past 205

2. Keep the design simple 206

3. Modularize the architecture 207

4. Homogeneity in the large 208

5. Heterogeneity in the small 209

6. Separate metadata from data 210

7. Abstract logical model from its physical implementation 211

8. Shard the data 212

9. Replicate the data 213

10. Buy lots of cheap hardware 214

Spanner 215

Spanner new: externally-consistent distributed transactions 216

Spanner Tabular Data Model Language ACID properties Distribution Scalability Sharding Replicas SQL NoSQL 217

Spanner Tabular Data Model Language ACID properties Distribution Scalability Sharding Replicas SQL NoSQL 218

Spanner: Data Model Multi-column primary key 219

Spanner: Data Model Multi-column primary key Timestamp 220

Spanner: Data Model Multi-column primary key Timestamp Directory 221

Spanner: Data Model Multi-column primary key Timestamp Tablet 222

Spanner: Data Scale 1,000,000,000,000s of rows 223

Spanner: Architecture of a Zone zonemaster Spanserver Spanserver Spanserver Spanserver Spanserver Spanserver 224

Spanner: Architecture universemaster 225

Spanner: Architecture Data Center Data Center Data Center 226

Spanner: Architecture Replica Replica Replica Paxos Paxos Paxos Tablet Tablet Tablet Colossus Colossus Colossus 227

Spanner: Architecture 100s of data centers Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center Data Center 228

Spanner: Architecture 1,000,000s of machines 229

Spanner: Architecture Higher availability Lower latency 230