Distributed Systems [Fall 2012]

Similar documents
BigTable: A Distributed Storage System for Structured Data (2006) Slides adapted by Tyler Davis

Bigtable. Presenter: Yijun Hou, Yixiao Peng

CS November 2017

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

CS November 2018

BigTable A System for Distributed Structured Storage

BigTable: A System for Distributed Structured Storage

MapReduce & BigTable

Bigtable: A Distributed Storage System for Structured Data by Google SUNNIE CHUNG CIS 612

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores

Introduction Data Model API Building Blocks SSTable Implementation Tablet Location Tablet Assingment Tablet Serving Compactions Refinements

BigTable: A Distributed Storage System for Structured Data

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

Structured Big Data 1: Google Bigtable & HBase Shiow-yang Wu ( 吳秀陽 ) CSIE, NDHU, Taiwan, ROC

References. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals

7680: Distributed Systems

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

Bigtable: A Distributed Storage System for Structured Data. Andrew Hon, Phyllis Lau, Justin Ng

CSE-E5430 Scalable Cloud Computing Lecture 9

CA485 Ray Walshe NoSQL

Bigtable: A Distributed Storage System for Structured Data

Bigtable: A Distributed Storage System for Structured Data

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable

Google Data Management

Big Table. Google s Storage Choice for Structured Data. Presented by Group E - Dawei Yang - Grace Ramamoorthy - Patrick O Sullivan - Rohan Singla

BigTable. CSE-291 (Cloud Computing) Fall 2016

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Design & Implementation of Cloud Big table

Distributed File Systems II

Bigtable: A Distributed Storage System for Structured Data

Lessons Learned While Building Infrastructure Software at Google

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Extreme Computing. NoSQL.

Distributed Database Case Study on Google s Big Tables

BigTable. Chubby. BigTable. Chubby. Why Chubby? How to do consensus as a service

CS5412: OTHER DATA CENTER SERVICES

big picture parallel db (one data center) mix of OLTP and batch analysis lots of data, high r/w rates, 1000s of cheap boxes thus many failures

Outline. Spanner Mo/va/on. Tom Anderson

CS5412: DIVING IN: INSIDE THE DATA CENTER

Google big data techniques (2)

DIVING IN: INSIDE THE DATA CENTER

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information

Distributed Systems 16. Distributed File Systems II

Distributed Systems Pre-exam 3 review Selected questions from past exams. David Domingo Paul Krzyzanowski Rutgers University Fall 2018

Lecture: The Google Bigtable

CS5412: DIVING IN: INSIDE THE DATA CENTER

Big Data Processing Technologies. Chentao Wu Associate Professor Dept. of Computer Science and Engineering

Data Informatics. Seon Ho Kim, Ph.D.

CS /29/18. Paul Krzyzanowski 1. Question 1 (Bigtable) Distributed Systems 2018 Pre-exam 3 review Selected questions from past exams

Infrastructure system services

Programming model and implementation for processing and. Programs can be automatically parallelized and executed on a large cluster of machines

The MapReduce Abstraction

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

FAQs Snapshots and locks Vector Clock

Scaling Up HBase. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

Data Storage in the Cloud

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017

Distributed Systems. Fall 2017 Exam 3 Review. Paul Krzyzanowski. Rutgers University. Fall 2017

Recap. CSE 486/586 Distributed Systems Google Chubby Lock Service. Recap: First Requirement. Recap: Second Requirement. Recap: Strengthening P2

Fattane Zarrinkalam کارگاه ساالنه آزمایشگاه فناوری وب

CS /30/17. Paul Krzyzanowski 1. Google Chubby ( Apache Zookeeper) Distributed Systems. Chubby. Chubby Deployment.

HBase: Overview. HBase is a distributed column-oriented data store built on top of HDFS

Typical size of data you deal with on a daily basis

Recap. CSE 486/586 Distributed Systems Google Chubby Lock Service. Paxos Phase 2. Paxos Phase 1. Google Chubby. Paxos Phase 3 C 1

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016

The Google File System

The Google File System

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

Distributed computing: index building and use

PoS(ISGC 2011 & OGF 31)075

CS 655 Advanced Topics in Distributed Systems

Distributed Data Management. Christoph Lofi Institut für Informationssysteme Technische Universität Braunschweig

11 Storage at Google Google Google Google Google 7/2/2010. Distributed Data Management

Map-Reduce. Marco Mura 2010 March, 31th

MapReduce. U of Toronto, 2014

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

BigData and Map Reduce VITMAC03

Column-Oriented Storage Optimization in Multi-Table Queries

Distributed Systems. 05r. Case study: Google Cluster Architecture. Paul Krzyzanowski. Rutgers University. Fall 2016

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

Comparing SQL and NOSQL databases

L22: NoSQL. CS3200 Database design (sp18 s2) 4/5/2018 Several slides courtesy of Benny Kimelfeld

Distributed Computation Models

Cluster-Level Google How we use Colossus to improve storage efficiency

MapReduce: Simplified Data Processing on Large Clusters

Exam 2 Review. October 29, Paul Krzyzanowski 1

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

COSC 6339 Big Data Analytics. NoSQL (II) HBase. Edgar Gabriel Fall HBase. Column-Oriented data store Distributed designed to serve large tables

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 17

The Google File System. Alexandru Costan

CompSci 516: Database Systems

Big Table. Dennis Kafura CS5204 Operating Systems

The Google File System

Column Stores and HBase. Rui LIU, Maksim Hrytsenia

Indexing Large-Scale Data

CLOUD-SCALE FILE SYSTEMS

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9b: Distributed File Systems INTRODUCTION. Transparency: Flexibility: Slide 1. Slide 3.

The Google File System

CmpE 138 Spring 2011 Special Topics L2

Transcription:

Distributed Systems [Fall 2012] Lec 20: Bigtable (cont ed) Slide acks: Mohsen Taheriyan (http://www-scf.usc.edu/~csci572/2011spring/presentations/taheriyan.pptx) 1

Chubby (Reminder) Lock service with a file system interface Intuitively, Chubby provides locks with possibility to store a bit of data in them, which can be read but not written unless you have a writer s lock It also provides notifications for file updates and others Uses Paxos to provide reliability and availability 2

Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber OSDI 2006 Slide acks to: Mohsen Taheriyan (http://www-scf.usc.edu/~csci572/2011spring/presentations/taheriyan.pptx) 3

Bigtable Description Outline Motivation and goals (last time) Schemas, interfaces, and semantics (with code) (today) Architecture (today) Implementation details (today, or you ll read on your own) 4

Bigtable Goals (Reminder) A distributed storage system for (semi-)structured data Scalable Thousands of servers Terabytes of in-memory data Petabyte of disk-based data Millions of reads/writes per second, efficient scans Self-managing Servers can be added/removed dynamically Servers adjust to load imbalance Extremely popular at Google (as of 2008) Web indexing, personalized search, Google Earth, Google Analytics, Google Finance, 5

Background Building blocks Google File System (GFS): Raw storage Scheduler: Schedules jobs onto machines Chubby: Lock service BigTable uses of building blocks GFS: stores all persistent state Scheduler: schedules jobs involved in BigTable serving Chubby: master election, location bootstrapping 6

GFS (Reminder) Client GFS Master Client Chunk servers C0 C3 C1 C4 C3 C1 C5 C0 C3 C4 Master manages metadata Data transfers happen directly between clients/chunkservers Files broken into chunks (typically 64 MB) Chunks replicated across three machines for reliability 7

Typical Cluster Cluster Scheduling Master Lock Service (Chubby) GFS Master Machine 1 by-and-large stateless! Machine 2 Machine 3 User Task (Map/Reduce) Bigtable Server User Task (Map/Reduce) Bigtable Server Bigtable Master Scheduler Slave GFS Chunk Server Scheduler Slave GFS Chunk Server Scheduler Slave GFS Chunk Server Linux Linux Linux stateful! 8

Specific Example: Web Search Webtable Note: ignores page rank functionality for simplicity Source: Hbase in Action, Dimiduk, et.al, http://www.manning.com/dimidukkhurana/hbias ample_ch1.pdf 9

Specific Example: Web Search inserts or updates of page contents and anchors to pages full or partial table scans Webtable targeted queries (should not require scans!) Note: ignores page rank functionality for simplicity Source: Hbase in Action, Dimiduk, et.al, http://www.manning.com/dimidukkhurana/hbias ample_ch1.pdf 10

Bigtable Description Outline Motivation and goals (last time) Schemas, interfaces, and semantics (with code) (today) Architecture (today) Implementation details (today, or you ll read on your own) 11

Basic Data Model A Bigtable is a sparse, distributed, persistent multidimensional sorted map (row:string, column:string, timestamp:int64) string Example: the (simplified) schema of the Webtable: Webtable Row name/key: up to 64KB, 10-100B typically, sorted. In this case, reverse URLs. cell w/ timestamped versions + garbage collection column families 12

Rows Row names/keys are arbitrary strings and are ordered lexicographically Rows close together lexicographically are stored on one or a small number of machines Hence, programmers can manipulate row names to achieve good locality in their programs Example: com.cnn.www vs. www.cnn.com which row key provides more locality for site-local queries? Access to data in a row is atomic Data row is the only unit of atomicity in Bigtable Does not support relational model No table integrity constraints, no multi-row transactions 13

Row-based Locality www.cnn.com edition.cnn.com money.cnn.com Matches from same site should be retrieved together by accessing the minimal number of machines Using reversed-dns URLs clusters URLs from the same site together, to speed up site-local queries com.cnn.edition, com.cnn.money, com.cnn.www 14

Columns Columns have a two-level name structure: family:optional_qualifier Column family Unit of access control Has associated type information There are few column families Qualifier gives unbounded # of columns in each row Provides additional levels of indexing, if desired Extremely sparsely populated across rows lang: contents: anchor:cnnsi.com anchor:stanford.edu com.cnn.www EN <html>.. CNN CNN homepage 15

Timestamps Used to store different versions of data in a cell New writes default to current time, but timestamps for writes can also be set explicitly by clients Lookup options: Return most recent N values Return all values in timestamp range (or all values) Column families can be marked w/ attributes: Only retain most recent N versions in a cell Keep values until they are older than T seconds Example uses: Keep multiple versions of the data (e.g., Web pages) 16

The Bigtable API Metadata operations Create/delete tables, column families, change metadata Writes: Single-row, atomic Set(): write cells in a row DeleteCells(): delete cells in a row DeleteRow(): delete all cells in a row Reads: Scanner abstraction Allows to read arbitrary cells in a Bigtable table Each row read is atomic Can restrict returned rows to a particular range Can ask for just data from 1 row (getter), all rows (scanner), etc. Can ask for all columns, just certain column families, or specific columns Can ask for certain timestamps only 17

API Examples: Write atomic row modification No support for (RDBMS-style) multi-row transactions 18

Example Exercise: Define Bigtable Schema for (Simplified) Twitter At the exam, you ll get a Bigtable schema design question To prep, do this example at home, ask specific questions on Piazza Exercise: Based on Webtable s Bigtable schema, define a schema for an efficient, simplified version of Twitter Recommended design steps: Restrict Twitter to some basic functionality and formulate the kinds of queries you might need to run to achieve that functionality Example functionality: display tweets from the persons the user follows Identify locality requirements for your queries to be efficient Design your Bigtable schema (row names, column families, column names within each family, and cell contents) that would support the identified queries efficiently Hint: Don t worry about replicating some data, such as tweet IDs, for fast access 20

Bigtable Description Outline Motivation and goals (last time) Schemas, interfaces, and semantics (with code) (today) Architecture (today) Implementation details (today, or you ll read on your own) 21

Tablets A Bigtable table is partitioned into many tablets based on row keys Tablets (100-200MB each) are stored in a particular structure in GFS Each tablet is served by one tablet server Tablets are stateless (all state is in GFS), hence they can restart at any time com.cnn.edition com.cnn.money com.cnn.www com.cnn.www/sports.html com.cnn.www/world/ com.dodo.www com.aaa com.website com.yahoo/kids.html com.yahoo/kids.html?d com.zuppa/menu.html language: EN contents: <html> Tablet: Start: com.aaa End: com.cnn.www Tablet: Start: com.cnn.www End: com.dodo.www 22

Tablet Structure Uses Google SSTables, a key building block Without going into much detail, an SSTable: Is an immutable, sorted file of key-value pairs SSTable files are stored in GFS Keys are: <row, column, timestamp> SSTables allow only appends, no updates (delete possible) Why do you think they don t use something that supports updates? SSTable Index (block ranges) 64KB Block 64KB Block 64KB Block 23

Tablet Structure A Tablet stores a range of rows from a table using SSTable files, which are stored in GFS Start: aardvark End: apple Tablet SSTable SSTable Index (block r anges) Index (block r anges) 64KB Block 64KB Block 64KB Block 64KB Block 64KB Block 64KB Block Files in GFS 24

Tablet Splitting When tablets grow too big, they are split There s merging, too com.aaa com.cnn.edition com.cnn.money language: EN contents: <html> com.website com.xuma com.yahoo/kids.html com.yahoo/kids.html?d com.yahoo/parents.html com.yahoo/parents.html?d com.zuppa/menu.html 25

Servers Library linked into every client One master server Assigns/load-balances tablets to tablet servers Detects up/down tablet servers Garbage collects deleted tablets Coordinates metadata updates (e.g., create table, ) Does NOT provide tablet location (we ll see how this is gotten) Master is stateless state is in Chubby and Bigtable (recursively)! Many tablet servers Tablet servers handle R/W requests to their tablets Split tablets that have grown too large Tablet servers are also stateless their state is in GFS! 26

System Architecture Bigtable cell Bigtable master Metadata ops performs metadata ops, load balancing Read/write Bigtable client Bigtable client library Open() Bigtable tablet server serves data Bigtable tablet server serves data Bigtable tablet server serves data GFS holds tablet data Chubby holds metadata, handles master-election 27

Locating Tablets Since tablets move around from server to server, given a row, how do clients find the right machine? Tablet properties: startrowindex and endrowindex Need to find tablet whose row range covers the target row One approach: could use the Bigtable master Central server almost certainly would be bottleneck in large system Instead: store special tables containing tablet location info in the Bigtable cell itself (recursive design ) 28

Tablets are located using a hierarchical structure (B+ tree-like) UserTable_1 METADATA Chubby lock file 1 st METADATA (stored in one single tablet, unsplittable) UserTable_N <table_id, end_row> location Each METADATA record ~1KB Max METADATA table = 128MB Addressable table values in Bigtable = 2 21 TB 29

Bigtable Description Outline Motivation and goals (last time) Schemas, interfaces, and semantics (with code) (today) Architecture (today) Implementation details (today, or you ll read on your own) 30

Tablet Assignment 1 Tablet => 1 Tablet server Master keeps tracks of set of live tablet serves and unassigned tablets Master sends a tablet load request for unassigned tablet to the tablet server Bigtable uses Chubby to keep track of tablet servers On startup a tablet server: Tablet server creates and acquires an exclusive lock on uniquely named file in Chubby directory Master monitors the above directory to discover tablet servers Tablet server stops serving tablets if it loses its exclusive lock Tries to reacquire the lock on its file as long as the file still exists 31

Tablet Assignment If the file no longer exists, tablet server not able to serve again and kills itself Master is responsible for finding when tablet server is no longer serving its tablets and reassigning those tablets as soon as possible. Master detects by checking periodically the status of the lock of each tablet server. If tablet server reports the loss of lock Or if master could not reach tablet server after several attempts. 32

Tablet Assignment Master tries to acquire an exclusive lock on server's file. If master is able to acquire lock, then chubby is alive and tablet server is either dead or having trouble reaching chubby. If so master makes sure that tablet server never can server again by deleting its server file. Master moves all tablets assigned to that server into set of unassigned tablets. If Chubby session expires, master kills itself. When master is started, it needs to discover the current tablet assignment. 33

Master Startup Operation Grabs unique master lock in Chubby Prevents server instantiations Scans directory in Chubby for live servers Communicates with every live tablet server Discover all tablets Scans METADATA table to learn the set of tablets Unassigned tables are marked for assignment 34

Bigtable Summary Scalable distributed storage system for semistructured data Offers a multi-dimensional-map interface <row, column, timestamp> value Offers atomic reads/writes within a row Key design philosophy: statelessness, which is key for scalability All Bigtable servers (including master) are stateless All state is stored in reliable GFS and Chubby systems Bigtable leverages strong-semantic operations in these systems (appends in GFS, file locks in Chubby) 35