Outline. Spanner Mo/va/on. Tom Anderson

Similar documents
BigTable: A Distributed Storage System for Structured Data (2006) Slides adapted by Tyler Davis

BigTable A System for Distributed Structured Storage

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

BigTable: A System for Distributed Structured Storage

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

Distributed File Systems II

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

CS November 2017

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores

Bigtable. Presenter: Yijun Hou, Yixiao Peng

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

Bigtable: A Distributed Storage System for Structured Data by Google SUNNIE CHUNG CIS 612

Bigtable: A Distributed Storage System for Structured Data. Andrew Hon, Phyllis Lau, Justin Ng

BigTable. CSE-291 (Cloud Computing) Fall 2016

References. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals

CS November 2018

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information

Google Data Management

BigTable: A Distributed Storage System for Structured Data

7680: Distributed Systems

CLOUD-SCALE FILE SYSTEMS

Big Table. Google s Storage Choice for Structured Data. Presented by Group E - Dawei Yang - Grace Ramamoorthy - Patrick O Sullivan - Rohan Singla

NPTEL Course Jan K. Gopinath Indian Institute of Science

Lecture: The Google Bigtable

BigTable. Chubby. BigTable. Chubby. Why Chubby? How to do consensus as a service

The Google File System

Big Data Processing Technologies. Chentao Wu Associate Professor Dept. of Computer Science and Engineering

Distributed Systems [Fall 2012]

The Google File System

The Google File System

Introduction Data Model API Building Blocks SSTable Implementation Tablet Location Tablet Assingment Tablet Serving Compactions Refinements

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

CA485 Ray Walshe Google File System

MapReduce & BigTable

GFS: The Google File System

Distributed Systems. GFS / HDFS / Spanner

2/27/2019 Week 6-B Sangmi Lee Pallickara

GFS: The Google File System. Dr. Yingwu Zhu

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

The Google File System

CS5412: DIVING IN: INSIDE THE DATA CENTER

9/26/2017 Sangmi Lee Pallickara Week 6- A. CS535 Big Data Fall 2017 Colorado State University

The Google File System

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9b: Distributed File Systems INTRODUCTION. Transparency: Flexibility: Slide 1. Slide 3.

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

The Google File System. Alexandru Costan

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

Google File System (GFS) and Hadoop Distributed File System (HDFS)

CSE-E5430 Scalable Cloud Computing Lecture 9

Google File System. Arun Sundaram Operating Systems

CS5412: DIVING IN: INSIDE THE DATA CENTER

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable

CSE 124: Networked Services Lecture-16

The Google File System

CS5412: OTHER DATA CENTER SERVICES

Distributed Filesystem

CS435 Introduction to Big Data FALL 2018 Colorado State University. 11/7/2018 Week 12-B Sangmi Lee Pallickara. FAQs

The Google File System

Lessons Learned While Building Infrastructure Software at Google

big picture parallel db (one data center) mix of OLTP and batch analysis lots of data, high r/w rates, 1000s of cheap boxes thus many failures

The Google File System (GFS)

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

CS 655 Advanced Topics in Distributed Systems

Extreme Computing. NoSQL.

Distributed System. Gang Wu. Spring,2018

MapReduce. U of Toronto, 2014

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05

CSE 124: Networked Services Fall 2009 Lecture-19

Staggeringly Large Filesystems

Google Disk Farm. Early days

Map-Reduce. Marco Mura 2010 March, 31th

Google File System, Replication. Amin Vahdat CSE 123b May 23, 2006

Bigtable: A Distributed Storage System for Structured Data

BigData and Map Reduce VITMAC03

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017

Bigtable: A Distributed Storage System for Structured Data

Google File System. By Dinesh Amatya

The Google File System GFS

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

CS /30/17. Paul Krzyzanowski 1. Google Chubby ( Apache Zookeeper) Distributed Systems. Chubby. Chubby Deployment.

GFS-python: A Simplified GFS Implementation in Python

Structured Big Data 1: Google Bigtable & HBase Shiow-yang Wu ( 吳秀陽 ) CSIE, NDHU, Taiwan, ROC

Distributed Systems 16. Distributed File Systems II

Cluster-Level Google How we use Colossus to improve storage efficiency

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

Abstract. 1. Introduction. 2. Design and Implementation Master Chunkserver

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Data Storage in the Cloud

The Google File System

Google is Really Different.

Megastore: Providing Scalable, Highly Available Storage for Interactive Services & Spanner: Google s Globally- Distributed Database.

Programming model and implementation for processing and. Programs can be automatically parallelized and executed on a large cluster of machines

A Distributed System Case Study: Apache Kafka. High throughput messaging for diverse consumers

4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS

NPTEL Course Jan K. Gopinath Indian Institute of Science

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Transcription:

Spanner Mo/va/on Tom Anderson Outline Last week: Chubby: coordina/on service BigTable: scalable storage of structured data GFS: large- scale storage for bulk data Today/Friday: Lessons from GFS/BigTable Megastore: Mul/- key, mul/- data center NoSQL Spanner: Mul/- key, mul/- data center NoSQL using real- /me 1

GFS Architecture each file stored as 64MB chunks each chunk on 3+ chunkservers single master stores metadata At Least Once Append If failure at primary or any replica, retry append (at new offset) Append will eventually succeed! May succeed mul/ple /mes! App client library responsible for Detec/ng corrupted copies of appended records Ignoring extra copies (during streaming reads) Why not append exactly once? 2

Ques/on Does the BigTable tablet server use at least once append for its opera/on log? Data Corrup/on Files stored on Linux, and Linux has bugs Some/mes silent corrup/ons Files stored on disk, and disks are not fail- stop Stored blocks can become corrupted over /me Ex: writes to sectors on nearby tracks Rare events become common at scale Chunkservers maintain per- chunk CRCs (64KB) Local log of CRC updates Verify CRCs before returning read data Periodic revalida/on to detect background failures 3

~15 years later Scale is much bigger: now 10K servers instead of 1K now 100 PB instead of 100 TB Bigger workload change: updates to small files! Around 2010: incremental updates of the Google search index GFS - > Colossus GFS scaled to ~50 million files, ~10 PB Developers had to organize their apps around large append- only files (see BigTable) Latency- sensi/ve applica/ons suffered GFS eventually replaced with a new design, Colossus 4

Metadata scalability Main scalability limit: single master stores all metadata HDFS has same problem (single NameNode) Approach: par//on the metadata among mul/ple masters New system supports ~100M files per master and smaller chunk sizes: 1MB instead of 64MB Reducing Storage Overhead Replica/on: 3x storage to handle two copies Erasure coding more flexible: m pieces, n check pieces e.g., RAID- 5: 2 disks, 1 parity disk (XOR of other two) => 1 failure w/ only 1.5 storage Sub- chunk writes more expensive (read- modify- write) Recovery is harder: usually need to get all the other pieces, generate another one ajer the failure 5

Erasure Coding 3- way replica/on: 3x overhead, 2 failures tolerated, easy recovery Google Colossus: (6,3) Reed- Solomon code 1.5x overhead, 3 failures Facebook HDFS: (10,4) Reed- Solomon 1.4x overhead, 4 failures, expensive recovery Azure: more advanced code (12, 4) 1.33x, 4 failures, same recovery cost as Colossus BigTable System Structure Bigtable cell Bigtable master performs metadata ops, load balancing Bigtable client Bigtable client library Open() Bigtable tablet server serves data Bigtable tablet server serves data Bigtable tablet server serves data Cluster Scheduling Master GFS Lock service handles failover, monitoring holds tablet data, logs holds metadata, handles master-election 6

Tablet Representation Read Write buffer in memory (random-access) Append-only log on GFS Write SSTable on GFS SSTable on GFS SSTable on GFS (mmap) Tablet SSTable: Immutable on-disk ordered map from stringà string String keys: <row, column, timestamp> triples Compactions Tablet state represented as set of immutable compacted SSTable files, plus tail of log (buffered in memory) Minor compaction: When in-memory state fills up, pick tablet with most data and write contents to SSTables stored in GFS Separate file for each locality group for each tablet Major compaction: Periodically compact all SSTables for tablet into new base SSTable on GFS Storage reclaimed from deletions at this point 7

Timestamps Used to store different versions of data in a cell New writes default to current time, but timestamps for writes can also be set explicitly by clients Lookup options: Return most recent K values Return all values in timestamp range (or all values) Column families can be marked w/ attributes: Only retain most recent K values in a cell Keep values until they are older than K seconds API Metadata operations Create/delete tables, column families, change metadata Writes (atomic) Set(): write cells in a row DeleteCells(): delete cells in a row DeleteRow(): delete all cells in a row Reads Scanner: read arbitrary cells in a bigtable Each row read is atomic Can restrict returned rows to a particular range Can ask for just data from 1 row, all rows, etc. Can ask for all columns, just certain column families, or specific columns 8

Shared Logs Designed for 1M tablets, 1000s of tablet servers 1M logs being simultaneously written performs badly Solution: shared logs Write log file per tablet server instead of per tablet Updates for many tablets co-mingled in same file Start new log chunks every so often (64MB) Problem: during recovery, server needs to read log data to apply mutations for a tablet Lots of wasted I/O if lots of machines need to read data for many tablets from same log chunk Shared Log Recovery Recovery: Servers inform master of log chunks they need to read Master aggregates and orchestrates sorting of needed chunks Assigns log chunks to be sorted to different tablet servers Servers sort chunks by tablet, writes sorted data to local disk Other tablet servers ask master which servers have sorted chunks they need Tablet servers issue direct RPCs to peer tablet servers to read sorted data for its tablets 9

Compression Many opportunities for compression Similar values in the same row/column at different timestamps Similar values in different columns Similar values across adjacent rows Within each SSTable for a locality group, encode compressed blocks Keep blocks small for random access (~64KB compressed data) Exploit fact that many values very similar Needs to be low CPU cost for encoding/decoding Compression Effectiveness Experiment: store contents for 2.1B page crawl in BigTable instance Key: URL of pages, with host-name portion reversed com.cnn.www/index.html:http Groups pages from same site together Good for compression (neighboring rows tend to have similar contents) Good for clients: efficient to scan over all pages on a web site One compression strategy: gzip each page: ~28% bytes remaining BigTable: BMDiff + Zippy Type Count(B) Space(TB) Compressed %remaining Web contents 2.1 45.1 4.2 9.2 Links 1.8 11.2 1.6 13.9 Anchors 126.3 22.8 2.9 12.7 10

Summary of BigTable Key Ideas Unstructured key- value table data No need for having a schema in advance instead create columns when needed Versioned data, with key- specific garbage collec/on Maintain data locality on same tablet Instead of consistent hashing, reconfigure tablet boundaries for load balancing Tablets for lookup: key - > tablet Efficient updates using log structure (store deltas) BigTable in retrospect Definitely a useful, scalable system! S/ll in use at Google, mo/vated lots of NoSQL DBs Lack of distributed transac/ons: biggest mistake in design, per Jeff Dean Lack of mul/- data center support 11

Ques/on How would you add mul/- key transac/ons to BigTable? Easy if all keys are on the same tablet, or on different tablets on the same tablet server What if keys are on different tablet servers? Mul/- Key NoSQL Transac/ons Straw proposal: Two phase commit Select one tablet server as coordinator Add log entries for coordinator/par/cipant ac/ons Check log if coordinator or par/cipant fails What if coordinator/par/cipant crashes? BigTable master wait for lease /meout Select new tablet server New tablet server recovers in progress transac/ons using log Abort/commit as appropriate 12

Performance of NoSQL 2PC What is performance of mul/- key transac/ons using two phase commit? Each tablet server orders opera/ons to its own keys If coordinator, must lock or delay subsequent opera/ons to that key, un/l par/cipants reply If par/cipant, must lock or delay subsequent ops to that key, un/l coordinator commits/aborts All ops to key are delayed, not just mul/- key ones Stale reads to the rescue? Ques/on How would you add support for mul/ple data centers to BigTable? 13

Mul/ple Data Center NoSQL Straw proposal: Paxos state machine replicas Every data center has complete copy of data One serves as Paxos leader (per tablet or per key) Clients contact leader Leader proposes ordering of client ops to tablet Paxos implies correct despite data center failures, network par//ons, etc. Progress if a majority of data centers remain up and connected Mul/- Data Center Paxos Performance Assume Paxos is op/mized: one round from leader to par/cipants per opera/on (batched) Latency: One RT from client to (remote) leader One RT from leader to farthest data center Throughput Every opera/on sent to every data center N messages to coordinate Paxos ordering (batched) Per- transac/on: two log opera/ons at coordinator, two at each par/cipant Stale reads to the rescue? 14

Megastore Mo/va/on Many apps need atomicity across rows Examples: gmail, google+, picasa, Many apps need to span mul/ple data centers Hide outages from end users Low latency for every user on planet Goals: Fast local reads At cost of slower writes Megastore Key Ideas BigTable as a service No need to reimplement NoSQL Two phase commit across keys opera/on log stored in a BT column Use data center for tes/ng Extensive randomized tes/ng of corner cases 15

Megastore Key Ideas Paxos with replica in each data center Most opera/ons are reads For writes, rota/ng leader wait turn to propose Special quorum rules reads require one replica, can always be local Writes require majority (of data centers) to commit Writes require all replicas before return to client If data center fails, wait for lease expire, then return to client 16