BigTable: A Distributed Storage System for Structured Data (2006) Slides adapted by Tyler Davis

Similar documents
BigTable: A System for Distributed Structured Storage

BigTable A System for Distributed Structured Storage

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

MapReduce & BigTable

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

Distributed Systems [Fall 2012]

Google Data Management

CS November 2017

CS November 2018

BigTable: A Distributed Storage System for Structured Data

7680: Distributed Systems

Outline. Spanner Mo/va/on. Tom Anderson

Lessons Learned While Building Infrastructure Software at Google

Bigtable: A Distributed Storage System for Structured Data by Google SUNNIE CHUNG CIS 612

Distributed File Systems II

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores

References. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals

Bigtable: A Distributed Storage System for Structured Data. Andrew Hon, Phyllis Lau, Justin Ng

Big Table. Google s Storage Choice for Structured Data. Presented by Group E - Dawei Yang - Grace Ramamoorthy - Patrick O Sullivan - Rohan Singla

Bigtable. Presenter: Yijun Hou, Yixiao Peng

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable

Google big data techniques (2)

BigTable. CSE-291 (Cloud Computing) Fall 2016

big picture parallel db (one data center) mix of OLTP and batch analysis lots of data, high r/w rates, 1000s of cheap boxes thus many failures

BigTable. Chubby. BigTable. Chubby. Why Chubby? How to do consensus as a service

CS5412: OTHER DATA CENTER SERVICES

Structured Big Data 1: Google Bigtable & HBase Shiow-yang Wu ( 吳秀陽 ) CSIE, NDHU, Taiwan, ROC

CS5412: DIVING IN: INSIDE THE DATA CENTER

CA485 Ray Walshe NoSQL

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information

Bigtable: A Distributed Storage System for Structured Data

Introduction Data Model API Building Blocks SSTable Implementation Tablet Location Tablet Assingment Tablet Serving Compactions Refinements

Programming model and implementation for processing and. Programs can be automatically parallelized and executed on a large cluster of machines

Extreme Computing. NoSQL.

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

CA485 Ray Walshe Google File System

CS5412: DIVING IN: INSIDE THE DATA CENTER

NPTEL Course Jan K. Gopinath Indian Institute of Science

Data Informatics. Seon Ho Kim, Ph.D.

goals monitoring, fault tolerance, auto-recovery (thousands of low-cost machines) handle appends efficiently (no random writes & sequential reads)

Bigtable: A Distributed Storage System for Structured Data

CSE-E5430 Scalable Cloud Computing Lecture 9

Bigtable: A Distributed Storage System for Structured Data

The Google File System

The Google File System

Distributed Database Case Study on Google s Big Tables

Infrastructure system services

Google File System, Replication. Amin Vahdat CSE 123b May 23, 2006

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

Map-Reduce. Marco Mura 2010 March, 31th

Distributed Systems. 05r. Case study: Google Cluster Architecture. Paul Krzyzanowski. Rutgers University. Fall 2016

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

Design & Implementation of Cloud Big table

Distributed Systems 16. Distributed File Systems II

CLOUD-SCALE FILE SYSTEMS

CSE 124: Networked Services Lecture-16

Distributed Data Management. Christoph Lofi Institut für Informationssysteme Technische Universität Braunschweig

The Google File System

11 Storage at Google Google Google Google Google 7/2/2010. Distributed Data Management

Big Data Processing Technologies. Chentao Wu Associate Professor Dept. of Computer Science and Engineering

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

The Google File System

DIVING IN: INSIDE THE DATA CENTER

The Google File System

CSE 124: Networked Services Fall 2009 Lecture-19

The Google File System

The Google File System

Distributed Systems Pre-exam 3 review Selected questions from past exams. David Domingo Paul Krzyzanowski Rutgers University Fall 2018

Distributed computing: index building and use

MapReduce. U of Toronto, 2014

Google File System 2

The Google File System (GFS)

The Google File System. Alexandru Costan

Data Storage in the Cloud

CS /29/18. Paul Krzyzanowski 1. Question 1 (Bigtable) Distributed Systems 2018 Pre-exam 3 review Selected questions from past exams

GFS: The Google File System

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017

Percolator. Large-Scale Incremental Processing using Distributed Transactions and Notifications. D. Peng & F. Dabek

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016

Google File System. By Dinesh Amatya

Distributed Filesystem

BigData and Map Reduce VITMAC03

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9b: Distributed File Systems INTRODUCTION. Transparency: Flexibility: Slide 1. Slide 3.

GFS: The Google File System. Dr. Yingwu Zhu

Lecture: The Google Bigtable

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

Building Large-Scale Internet Services. Jeff Dean

CS /30/17. Paul Krzyzanowski 1. Google Chubby ( Apache Zookeeper) Distributed Systems. Chubby. Chubby Deployment.

Distributed Systems. Fall 2017 Exam 3 Review. Paul Krzyzanowski. Rutgers University. Fall 2017

Big Table. Dennis Kafura CS5204 Operating Systems

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

Comparing SQL and NOSQL databases

Recap. CSE 486/586 Distributed Systems Google Chubby Lock Service. Paxos Phase 2. Paxos Phase 1. Google Chubby. Paxos Phase 3 C 1

Google File System. Arun Sundaram Operating Systems

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

W b b 2.0. = = Data Ex E pl p o l s o io i n

Fattane Zarrinkalam کارگاه ساالنه آزمایشگاه فناوری وب

Transcription:

BigTable: A Distributed Storage System for Structured Data (2006) Slides adapted by Tyler Davis

Motivation Lots of (semi-)structured data at Google URLs: Contents, crawl metadata, links, anchors, pagerank, Per-user data: User preference settings, recent queries/search results, Geographic locations: Physical entities (shops, restaurants, etc.), roads, satellite image data, user annotations, Scale is large billions of URLs, many versions/page (~20K/version) Hundreds of millions of users, thousands of q/sec 100TB+ of satellite image data 2

Why not just use commercial DB? Scale is too large for most commercial databases! Even if it weren t, cost would be very high Building internally means system can be applied across many projects for low incremental cost! Low-level storage optimizations help performance significantly Much harder to do when running on top of a database layer! Also fun and challenging to build large-scale systems :) 3

Goals Want asynchronous processes to be continuously updating different pieces of data Want access to most current data at any time Need to support: Very high read/write rates (millions of ops per second) Efficient scans over all or interesting subsets of data Efficient joins of large one-to-one and one-to-many datasets Often want to examine data changes over time E.g. Contents of a web page over multiple crawls Fault Tolerance Fast Recovery

BigTable Distributed multi-level map With an interesting data model Fault-tolerant, persistent Scalable Thousands of servers Terabytes of in-memory data Petabyte of disk-based data Millions of reads/writes per second, efficient scans Self-managing Servers can be added/removed dynamically Servers adjust to load imbalance 5

Background: Building Blocks Building blocks: Google File System (GFS): Raw storage Scheduler: schedules jobs onto machines Lock service: distributed lock manager also can reliably hold tiny files (100s of bytes) w/ high availability MapReduce: simplified large-scale data processing! BigTable uses of building blocks: GFS: stores persistent state Scheduler: schedules jobs involved in BigTable serving Lock service: master election, location bootstrapping MapReduce: often used to read/write BigTable data 7

Typical Cluster Cluster scheduling master Lock service GFS master Machine 1 Machine 2 Machine N User app1 User app2 BigTable server User app1 BigTable server BigTable master Scheduler slave GFS chunkserver Scheduler slave GFS chunkserver Scheduler slave GFS chunkserver Linux Linux Linux 10

BigTable Overview Data Model Implementation Structure Tablets, compactions, locality groups, API Details Shared logs, compression, replication, Evaluation Current/Future Work 11

Basic Data Model Distributed multi-dimensional sparse map (row, column, timestamp) cell contents contents: Columns Rows www.cnn.com <html> t 3 t 11 t 17 Timestamps Good match for most of our applications 12

Rows Name is an arbitrary string Access to data in a row is atomic Row creation is implicit upon storing data Rows ordered lexicographically Rows close together lexicographically usually on one or a small number of machines 13

Tablets Large tables broken into tablets at row boundaries Tablet holds contiguous range of rows Clients can often choose row keys to achieve locality Aim for ~100MB to 200MB of data per tablet Serving machine responsible for ~100 tablets Fast recovery: 100 machines each pick up 1 tablet from failed machine Fine-grained load balancing: Migrate tablets away from overloaded machine Master makes load-balancing decisions 14

Tablets & Splitting language: contents: aaa.com cnn.com EN <html> cnn.com/sports.html Tablets website.com yahoo.com/kids.html zuppa.com/menu.html 15 yahoo.com/kids.html\0

Bigtable Cell System Structure metadata ops Bigtable client Bigtable client library Bigtable master performs metadata ops + load balancing read/write Open() Bigtable tablet server Bigtable tablet server Bigtable tablet server serves data serves data serves data 16 Cluster scheduling system handles failover, monitoring GFS holds tablet data, logs Lock service holds metadata, handles master-election

Notes on System Structure Having one Tablet Server per tablet eliminates need for coordinating state Single master No need to synchronize state

Locating Tablets Since tablets move around from server to server, given a row, how do clients find the right machine? Need to find tablet whose row range covers the target row! One approach: could use the BigTable master Central server almost certainly would be bottleneck in large system! Instead: store special tables containing tablet location info in BigTable cell itself 17

Locating Tablets (cont.) Our approach: 3-level hierarchical lookup scheme for tablets Location is ip:port of relevant server 1st level: bootstrapped from lock service, points to owner of META0 2nd level: Uses META0 data to find owner of appropriate META1 tablet 3rd level: META1 table holds locations of tablets of all other tables META1 table itself can be split into multiple tablets Aggressive prefetching+caching Most ops go right to proper machine META0 table META1 table Actual tablet in table T Pointer to META0 location 18 Stored in lock service Row per META1 Table tablet Row per non-meta tablet (all tables)

Additional Uses of Lock Service Help keep track of live servers Each server holds a corresponding lock in Chubby If master can grab the lock, the tablet server is dead Track liveness of master Same general approach

Role of the Master Ensure each tablet is served by exactly one tablet server Both on cluster startup and in case of failure Load balance by moving tablets Metadata operations Schema changes Column Family creation Ordering tablet merges Garbage collection

Tablet Representation read write buffer in memory (random-access) append-only log on GFS SSTable on GFS SSTable on GFS SSTable on GFS (mmap) write Tablet 19 SSTable: Immutable on-disk ordered map from string->string string keys: <row, column, timestamp> triples

SSTables and Immutability No need to synchronize access to GFS Garbage collection only needs to deal with full SSTables, not individual rows

SSTables Cont. SSTables can be shared between tablets Used in the event of splits Tablet aardvark apple Tablet apple_two_e boat SSTable SSTable SSTable SSTable 21

Compactions Tablet state represented as set of immutable compacted SSTable files, plus tail of log (buffered in memory)! Minor compaction: When in-memory state fills up, pick tablet with most data and write contents to SSTables stored in GFS Separate file for each locality group for each tablet! Major compaction: Periodically compact all SSTables for tablet into new base SSTable on GFS Storage reclaimed from deletions at this point 20

Compactions Why a merging Compaction? Every minor compaction creates a new SSTable Read operations look at these SSTables Merging compaction combines SSTables + Memtable into a single new SSTable.

Columns contents: anchor:cnnsi.com anchor:stanford.edu www.cnn.com <html> CNN home page CNN 21 Columns have two-level name structure: family:optional_qualifier Column family Unit of access control Has associated type information Qualifier gives unbounded columns Additional level of indexing, if desired

Timestamps Used to store different versions of data in a cell New writes default to current time, but timestamps for writes can also be set explicitly by clients! Lookup options: Return most recent K values Return all values in timestamp range (or all values)! Column familes can be marked w/ attributes: Only retain most recent K values in a cell Keep values until they are older than K seconds 22

Locality Groups Column families can be assigned to a locality group Used to organize underlying storage representation for performance scans over one locality group are O(bytes_in_locality_group), not O(bytes_in_table) Data in a locality group can be explicitly memory-mapped 23

Locality Groups contents: Locality Groups language: pagerank: www.cnn.com <html> EN 0.65 24

API Metadata operations Create/delete tables, column families, change metadata Writes (atomic) Set(): write cells in a row DeleteCells(): delete cells in a row DeleteRow(): delete all cells in a row Reads Scanner: read arbitrary cells in a bigtable Each row read is atomic Can restrict returned rows to a particular range Can ask for just data from 1 row, all rows, etc. Can ask for all columns, just certain column families, or specific columns 25

Shared Logs Designed for 1M tablets, 1000s of tablet servers 1M logs being simultaneously written performs badly Solution: shared logs Write log file per tablet server instead of per tablet Updates for many tablets co-mingled in same file Start new log chunks every so often (64 MB) Problem: during recovery, server needs to read log data to apply mutations for a tablet Lots of wasted I/O if lots of machines need to read data for many tablets from same log chunk 26

Shared Log Recovery Recovery: Servers inform master of log chunks they need to read Master aggregates and orchestrates sorting of needed chunks Assigns log chunks to be sorted to different tablet servers Servers sort chunks by tablet, writes sorted data to local disk Other tablet servers ask master which servers have sorted chunks they need Tablet servers issue direct RPCs to peer tablet servers to read sorted data for its tablets 27

Performance Evaluation Table shows per Server Chart shows aggregate

Performance Evaluation Scaling is not linear Random read benchmark saturates network links 64 KB transferred, 1KB read Not efficient

Characteristics of Production Clusters

Evidence Problems Are Solved Wide use across Google (2006) Thousands of servers Used by over sixty projects Over a petabyte of storage used Trust when the designers say that it is reliable

Overarching Themes Don t reinvent the wheel Keep designs simple Rely on immutable data structures when possible Split responsibilities when possible

Questions / Possible Improvements Add atomicity when operating on multiple rows? What do typical access patterns look like? Will bottlenecking be encountered at a certain system size? How much load is on the master? Effects of master s persistent logging? How does the load balancer work? How reliable is it in practice? How fast is recovery from various failures?

References Slides Jeff Dean BigTable presentation: pages.cs.wisc.edu/~remzi/classes/739/fall2017/papers/bigta ble-slides-05.pdf S. Sudarshan BigTable presentation: www.cse.iitb.ac.in/infolab/data/courses/cs632/talks/bigtabl e-uw-presentation.ppt Additional Content Original BigTable paper: https://research.google.com/archive/bigtable-osdi06.pdf