Lessons Learned While Building Infrastructure Software at Google

Similar documents
BigTable: A Distributed Storage System for Structured Data (2006) Slides adapted by Tyler Davis

BigTable A System for Distributed Structured Storage

BigTable: A System for Distributed Structured Storage

Building Large-Scale Internet Services. Jeff Dean

MapReduce & BigTable

CS November 2017

Distributed Systems [Fall 2012]

Google Data Management

Bigtable. Presenter: Yijun Hou, Yixiao Peng

CS November 2018

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

Distributed File Systems II

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

The MapReduce Abstraction

Google: A Computer Scientist s Playground

Google: A Computer Scientist s Playground

BigTable: A Distributed Storage System for Structured Data

A Behind-the-Scenes Tour. User s View of Google. A Computer Scientist s View of Google. Hardware Design Philosophy

CA485 Ray Walshe NoSQL

CA485 Ray Walshe Google File System

The Google File System

Google File System (GFS) and Hadoop Distributed File System (HDFS)

The Google File System

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

Introduction Data Model API Building Blocks SSTable Implementation Tablet Location Tablet Assingment Tablet Serving Compactions Refinements

Bigtable: A Distributed Storage System for Structured Data by Google SUNNIE CHUNG CIS 612

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores

Parallel Computing: MapReduce Jin, Hai

Map-Reduce. Marco Mura 2010 March, 31th

CSE-E5430 Scalable Cloud Computing Lecture 9

References. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals

BigData and Map Reduce VITMAC03

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

The Google File System

7680: Distributed Systems

big picture parallel db (one data center) mix of OLTP and batch analysis lots of data, high r/w rates, 1000s of cheap boxes thus many failures

Outline. Spanner Mo/va/on. Tom Anderson

Distributed computing: index building and use

MapReduce. U of Toronto, 2014

Distributed Systems. 05r. Case study: Google Cluster Architecture. Paul Krzyzanowski. Rutgers University. Fall 2016

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

Distributed Filesystem

Cluster-Level Google How we use Colossus to improve storage efficiency

Distributed System. Gang Wu. Spring,2018

The Google File System. Alexandru Costan

The Google File System

Map Reduce. Yerevan.

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

The Google File System

Distributed Computations MapReduce. adapted from Jeff Dean s slides

BigTable. CSE-291 (Cloud Computing) Fall 2016

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

GOOGLE FILE SYSTEM: MASTER Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung

The Google File System

Recap. CSE 486/586 Distributed Systems Google Chubby Lock Service. Recap: First Requirement. Recap: Second Requirement. Recap: Strengthening P2

Data Storage in the Cloud

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

CLOUD-SCALE FILE SYSTEMS

Distributed Systems 16. Distributed File Systems II

Bigtable: A Distributed Storage System for Structured Data. Andrew Hon, Phyllis Lau, Justin Ng

Applications of Paxos Algorithm

Map Reduce Group Meeting

Introduction to MapReduce

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable

GFS: The Google File System

Programming model and implementation for processing and. Programs can be automatically parallelized and executed on a large cluster of machines

Plumbing the Web. Narayanan Shivakumar. Google Distinguished Entrepreneur & Director

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

BigTable. Chubby. BigTable. Chubby. Why Chubby? How to do consensus as a service

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Distributed computing: index building and use

Abstract. 1. Introduction. 2. Design and Implementation Master Chunkserver

Distributed Data Infrastructures, Fall 2017, Chapter 2. Jussi Kangasharju

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

Bigtable: A Distributed Storage System for Structured Data

Extreme Computing. NoSQL.

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Flat Datacenter Storage. Edmund B. Nightingale, Jeremy Elson, et al. 6.S897

Distributed Systems. GFS / HDFS / Spanner

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

Staggeringly Large File Systems. Presented by Haoyan Geng

Motivation. Map in Lisp (Scheme) Map/Reduce. MapReduce: Simplified Data Processing on Large Clusters

Distributed Systems Pre-exam 3 review Selected questions from past exams. David Domingo Paul Krzyzanowski Rutgers University Fall 2018

Recap. CSE 486/586 Distributed Systems Google Chubby Lock Service. Paxos Phase 2. Paxos Phase 1. Google Chubby. Paxos Phase 3 C 1

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

Distributed Computation Models

NPTEL Course Jan K. Gopinath Indian Institute of Science

Google big data techniques (2)

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

Google File System. By Dinesh Amatya

Distributed Systems. Fall 2017 Exam 3 Review. Paul Krzyzanowski. Rutgers University. Fall 2017

Google File System 2

Rule 14 Use Databases Appropriately

Column Stores and HBase. Rui LIU, Maksim Hrytsenia

CS /29/18. Paul Krzyzanowski 1. Question 1 (Bigtable) Distributed Systems 2018 Pre-exam 3 review Selected questions from past exams

Cloud Scale Storage Systems. Yunhao Zhang & Matthew Gharrity

Big Table. Google s Storage Choice for Structured Data. Presented by Group E - Dawei Yang - Grace Ramamoorthy - Patrick O Sullivan - Rohan Singla

Transcription:

Lessons Learned While Building Infrastructure Software at Google Jeff Dean jeff@google.com

Google Circa 1997 (google.stanford.edu)

Corkboards (1999)

Google Data Center (2000)

Google Data Center (2000)

Google Data Center (2000)

Google (new data center 2001)

Google Data Center (3 days later)

Google s Computational Environment Today Many datacenters around the world

Google s Computational Environment Today Many datacenters around the world

Zooming In...

Lots of machines...

Cool...

Low-Level Systems Software Desires If you have lots of machines, you want to: Store data persistently w/ high availability high read and write bandwidth Run large-scale computations reliably without having to deal with machine failures GFS, MapReduce, BigTable, Spanner,...

Google File System (GFS) Design Masters Replicas GFS Master GFS Master Misc. servers Client Client C 0 C 1 C 1 C 0 C 5 C 5 C 2 C 5 C 3 C 2 Chunkserver 1 Chunkserver 2 Chunkserver N Master manages metadata Data transfers are directly between clients/chunkservers Files broken into chunks (typically 64 MB) Chunks replicated across multiple machines (usually 3)

GFS Motivation and Lessons Indexing system clearly needed a large-scale distributed file system wanted to treat whole cluster as single file system Developed by subset of same people working on indexing system Identified minimal set of features needed e.g. Not POSIX compliant actual data was distributed, but kept metadata centralized Colossus: Follow-on system developed many years later distributed the metadata Lesson: Don t solve everything all at once

MapReduce History 2003: Sanjay Ghemawat and I were working on rewriting indexing system: starts with raw page contents on disk many phases: (near) duplicate elimination, anchor text extraction, language identification, index shard generation, etc. end result is data structures for index and doc serving Each phase was hand written parallel computation: hand parallelized hand-written checkpointing code for fault-tolerance

MapReduce A simple programming model that applies to many large-scale computing problems allowed us to express all phases of our indexing system since used across broad range of computer science areas, plus other scientific fields Hadoop open-source implementation seeing significant usage Hide messy details in MapReduce runtime library: automatic parallelization load balancing network and disk transfer optimizations handling of machine failures robustness improvements to core library benefit all users of library!

Typical problem solved by MapReduce Read a lot of data Map: extract something you care about from each record Shuffle and Sort Reduce: aggregate, summarize, filter, or transform Write the results Outline stays the same, User writes Map and Reduce functions to fit the problem

MapReduce Motivation and Lessons Developed by two people that were also doing the indexing system rewrite squinted at various phases with an eye towards coming up with common abstraction Initial version developed quickly proved initial API utility with very simple implementation rewrote much of implementation 6 months later to add lots of the performance wrinkles/tricks that appeared in original paper Lesson: Very close ties with initial users of system make things happen faster in this case, we were both building MapReduce and using it simultaneously

Lots of (semi-)structured data at Google URLs: Contents, crawl metadata, links, anchors, pagerank, Per-user data: User preferences, recent queries, Geographic locations: Physical entities, roads, satellite image data, user annotations, Scale is large BigTable: Motivation Want to be able to grow and shrink resources devoted to system as needed

BigTable: Basic Data Model Distributed multi-dimensional sparse map (row, column, timestamp) cell contents Columns Rows Rows are ordered lexicographically Good match for most of our applications

BigTable: Basic Data Model Distributed multi-dimensional sparse map (row, column, timestamp) cell contents contents: Columns Rows www.cnn.com <html> Rows are ordered lexicographically Good match for most of our applications

BigTable: Basic Data Model Distributed multi-dimensional sparse map (row, column, timestamp) cell contents contents: Columns Rows www.cnn.com <html> t 17 Timestamps Rows are ordered lexicographically Good match for most of our applications

BigTable: Basic Data Model Distributed multi-dimensional sparse map (row, column, timestamp) cell contents contents: Columns Rows www.cnn.com <html> t 17 t 11 Timestamps Rows are ordered lexicographically Good match for most of our applications

BigTable: Basic Data Model Distributed multi-dimensional sparse map (row, column, timestamp) cell contents contents: Columns Rows www.cnn.com <html> t 17 t 11 t 3 Timestamps Rows are ordered lexicographically Good match for most of our applications

Tablets & Splitting language: contents: aaa.com cnn.com EN <html> cnn.com/sports.html website.com zuppa.com/menu.html

Tablets & Splitting language: contents: aaa.com cnn.com EN <html> cnn.com/sports.html Tablets website.com zuppa.com/menu.html

Tablets & Splitting language: contents: aaa.com cnn.com EN <html> cnn.com/sports.html Tablets website.com yahoo.com/kids.html yahoo.com/kids.html\0 zuppa.com/menu.html

Tablets & Splitting language: contents: aaa.com cnn.com EN <html> cnn.com/sports.html Tablets website.com yahoo.com/kids.html yahoo.com/kids.html\0 zuppa.com/menu.html

BigTable System Structure Bigtable Cell Bigtable master

BigTable System Structure Bigtable Cell Bigtable master performs metadata ops + load balancing

BigTable System Structure Bigtable Cell Bigtable master performs metadata ops + load balancing

BigTable System Structure Bigtable Cell Bigtable master performs metadata ops + load balancing Cluster scheduling system Cluster file system Lock service

BigTable System Structure Bigtable Cell Bigtable master performs metadata ops + load balancing Cluster scheduling system Cluster file system Lock service schedules tasks onto machines

BigTable System Structure Bigtable Cell Bigtable master performs metadata ops + load balancing Cluster scheduling system Cluster file system Lock service schedules tasks onto machines holds tablet data, logs

BigTable System Structure Bigtable Cell Bigtable master performs metadata ops + load balancing Cluster scheduling system schedules tasks onto machines Cluster file system holds tablet data, logs Lock service holds metadata, handles master-election

BigTable System Structure Bigtable Cell Bigtable client Bigtable client library Bigtable master performs metadata ops + load balancing Cluster scheduling system schedules tasks onto machines Cluster file system holds tablet data, logs Lock service holds metadata, handles master-election

BigTable System Structure Bigtable Cell Bigtable client Bigtable client library Bigtable master performs metadata ops + load balancing Open() Cluster scheduling system schedules tasks onto machines Cluster file system holds tablet data, logs Lock service holds metadata, handles master-election

BigTable System Structure Bigtable Cell Bigtable client Bigtable client library Bigtable master performs metadata ops + load balancing read/write Open() Cluster scheduling system schedules tasks onto machines Cluster file system holds tablet data, logs Lock service holds metadata, handles master-election

BigTable System Structure Bigtable Cell metadata ops Bigtable client Bigtable client library Bigtable master performs metadata ops + load balancing read/write Open() Cluster scheduling system schedules tasks onto machines Cluster file system holds tablet data, logs Lock service holds metadata, handles master-election

BigTable Status Production use for 100s of projects: Crawling/indexing pipeline, Google Maps/Google Earth/Streetview, Search History, Google Print, Google+, Blogger,... Currently 500+ BigTable clusters Largest cluster: 100s PB data; sustained: 30M ops/sec; 100+ GB/s I/O Many asynchronous processes updating different pieces of information no distributed transactions, no cross-row joins initial design was just in a single cluster follow-on work added eventual consistency across many geographically distributed BigTable instances

Spanner Storage & computation system that runs across many datacenters single global namespace names are independent of location(s) of data fine-grained replication configurations support mix of strong and weak consistency across datacenters Strong consistency implemented with Paxos across tablet replicas Full support for distributed transactions across directories/machines much more automated operation automatically changes replication based on constraints and usage patterns automated allocation of resources across entire fleet of machines

Design Goals for Spanner Future scale: ~10 5 to 10 7 machines, ~10 13 directories, ~10 18 bytes of storage, spread at 100s to 1000s of locations around the world zones of semi-autonomous control consistency after disconnected operation users specify high-level desires: 99%ile latency for accessing this data should be <50ms Store this data on at least 2 disks in EU, 2 in U.S. & 1 in Asia

Spanner Lessons Several variations of eventual client API Started to develop with many possible customers in mind, but no particular customer we were working closely with Eventually we worked closely with Google ads system as initial customer first real customer was very demanding (real $$): good and bad Different API than BigTable Harder to move users with existing heavy BigTable usage

Designing & Building Infrastructure Identify common problems, and build software systems to address them in a general way Important to not try to be all things to all people Clients might be demanding 8 different things Doing 6 of them is easy handling 7 of them requires real thought dealing with all 8 usually results in a worse system more complex, compromises other clients in trying to satisfy everyone

Designing & Building Infrastructure (cont) Don't build infrastructure just for its own sake: Identify common needs and address them Don't imagine unlikely potential needs that aren't really there Best approach: use your own infrastructure (especially at first!) (much more rapid feedback about what works, what doesn't) If not possible, at least work very closely with initial client team ideally sit within 50 feet of each other keep other potential clients needs in mind, but get system working via close collaboration with first client first

Thanks! Further reading: Ghemawat, Gobioff, & Leung. Google File System, SOSP 2003. Barroso, Dean, & Hölzle. Web Search for a Planet: The Google Cluster Architecture, IEEE Micro, 2003. Dean & Ghemawat. MapReduce: Simplified Data Processing on Large Clusters, OSDI 2004. Chang, Dean, Ghemawat, Hsieh, Wallach, Burrows, Chandra, Fikes, & Gruber. Bigtable: A Distributed Storage System for Structured Data, OSDI 2006. Corbett et al. Spanner: Google s Globally Distributed Database, OSDI 2012. Burrows. The Chubby Lock Service for Loosely-Coupled Distributed Systems. OSDI 2006. Pinheiro, Weber, & Barroso. Failure Trends in a Large Disk Drive Population. FAST 2007. Barroso & Hölzle. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Morgan & Claypool Synthesis Series on Computer Architecture, 2009. Malewicz et al. Pregel: A System for Large-Scale Graph Processing. PODC, 2009. Schroeder, Pinheiro, & Weber. DRAM Errors in the Wild: A Large-Scale Field Study. SEGMETRICS 09. Protocol Buffers. http://code.google.com/p/protobuf/ See: http://research.google.com/papers.html http://research.google.com/people/jeff