Plumbing the Web. Narayanan Shivakumar. Google Distinguished Entrepreneur & Director

Similar documents
CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

Distributed Systems. 05r. Case study: Google Cluster Architecture. Paul Krzyzanowski. Rutgers University. Fall 2016

Google: A Computer Scientist s Playground

Google: A Computer Scientist s Playground

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

Lessons Learned While Building Infrastructure Software at Google

MapReduce. U of Toronto, 2014

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

CA485 Ray Walshe Google File System

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

BigData and Map Reduce VITMAC03

Distributed File Systems II

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Distributed Systems 16. Distributed File Systems II

The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012):

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

NPTEL Course Jan K. Gopinath Indian Institute of Science

Data Analysis Using MapReduce in Hadoop Environment

GFS: The Google File System. Dr. Yingwu Zhu

HDFS: Hadoop Distributed File System. Sector: Distributed Storage System

CSE 124: Networked Services Fall 2009 Lecture-19

Programming model and implementation for processing and. Programs can be automatically parallelized and executed on a large cluster of machines

BigTable: A System for Distributed Structured Storage

Google File System 2

CSE 124: Networked Services Lecture-16

GFS: The Google File System

The Google File System

Google File System. Arun Sundaram Operating Systems

BigTable: A Distributed Storage System for Structured Data (2006) Slides adapted by Tyler Davis

Distributed System. Gang Wu. Spring,2018

The Google File System

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Google is Really Different.

The Google File System

goals monitoring, fault tolerance, auto-recovery (thousands of low-cost machines) handle appends efficiently (no random writes & sequential reads)

The MapReduce Abstraction

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

Google File System. By Dinesh Amatya

BigTable A System for Distributed Structured Storage

A brief history on Hadoop

MapReduce & BigTable

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Recap. CSE 486/586 Distributed Systems Google Chubby Lock Service. Recap: First Requirement. Recap: Second Requirement. Recap: Strengthening P2

Distributed Filesystem

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores

The Google File System. Alexandru Costan

The Google File System

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

References. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals

Google File System (GFS) and Hadoop Distributed File System (HDFS)

Google Disk Farm. Early days

Map-Reduce. Marco Mura 2010 March, 31th

The Google File System (GFS)

CLOUD-SCALE FILE SYSTEMS

Google File System, Replication. Amin Vahdat CSE 123b May 23, 2006

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017

Outline. INF3190:Distributed Systems - Examples. Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016

Staggeringly Large File Systems. Presented by Haoyan Geng

CS /30/17. Paul Krzyzanowski 1. Google Chubby ( Apache Zookeeper) Distributed Systems. Chubby. Chubby Deployment.

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

CS 345A Data Mining. MapReduce

Introduction to MapReduce

The Google File System

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

CS427 Multicore Architecture and Parallel Computing

CS November 2017

Today CSCI Coda. Naming: Volumes. Coda GFS PAST. Instructor: Abhishek Chandra. Main Goals: Volume is a subtree in the naming space

Hadoop An Overview. - Socrates CCDH

Recap. CSE 486/586 Distributed Systems Google Chubby Lock Service. Paxos Phase 2. Paxos Phase 1. Google Chubby. Paxos Phase 3 C 1

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Google big data techniques (2)

Google Data Management

Cluster-Level Google How we use Colossus to improve storage efficiency

GFS-python: A Simplified GFS Implementation in Python

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

Lecture 11 Hadoop & Spark

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information

A BigData Tour HDFS, Ceph and MapReduce

7/22/2008. Transformations

CS November 2018

Introduction to Map Reduce

A Study of Comparatively Analysis for HDFS and Google File System towards to Handle Big Data

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

Dremel: Interactive Analysis of Web- Scale Datasets

TP1-2: Analyzing Hadoop Logs

CISC 7610 Lecture 2b The beginnings of NoSQL

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

The Google File System

Embedded Technosolutions

Dremel: Interactive Analysis of Web-Scale Database

Hadoop Distributed File System(HDFS)

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

Big Data and Object Storage

Transcription:

Plumbing the Web Narayanan Shivakumar Google Distinguished Entrepreneur & Director

Google Developer Day 2007 powered by 3 Copyright 2007, Google Inc

Developers and Google AJAX Search API Google Code Project Hosting Google Web Toolkit Calendar Data API Base Data API Blogger Data API Notebook Data API PicasaData API Spreadsheets Data API Google SOAP Search API Desktop API Sitemaps API Gadgets API AJAX Feed API Mashup Editor Mapplets Google Gears 2002 2003 2004 2005 2006 2007 4 Copyright 2007, Google Inc

How much information is out there? How large is the Web? Hundreds of billions of documents? Trillions? ~10KB/doc => 100s of Terabytes Then there s everything else Email, personal files, closed databases, broadcast media, print, etc. Estimated 5 Exabytes/year (growing at 30%)* 800MB/year/person ~90% in magnetic media Web is just a tiny starting point Source: How much information 2003 5 Copyright 2007, Google Inc

Early Search Search Link Extraction Web pages 6 Copyright 2007, Google Inc

Webserver-search ecosystem Part A: Sitemaps, tell us what you have Part B: Feedback to webservers about problems 7 Copyright 2007, Google Inc

Sitemaps ( ls for web) XML sitemaps auto-produced and maintained on webservers http://www.example.com/sitemap.xml {<url>, <changerate> <lastmod> <priority> } Autodiscovery through robots.txt Log structured protocol Scalable from 50 ->50M+ urls 8 Copyright 2007, Google Inc

Sitemaps adoption Open protocol launched in Jun 05 under Creative-Commons Joint support announced by MSN, Yahoo in Nov 06 (http://sitemaps.org) Auto-discovery thro robots.txt in Apr 07 (+IBM, Ask.com) Billions of URLs auto-produced by servers, tools, plugins 9 Copyright 2007, Google Inc

Early Search Search Link Extraction Web pages 10 Copyright 2007, Google Inc

Comprehensive Search Search Sitemaps WmTools Link Extraction Web pages 11 Copyright 2007, Google Inc

Google s Developer Products Integrate Integrate Google services Reach Reach Google users Build Build next gen web apps 12 Copyright 2007, Google Inc

Integrate Reach Build 13 Copyright 2007, Google Inc

Integrate Reach Build 14 Copyright 2007, Google Inc

Integrate Reach Build 15 Copyright 2007, Google Inc

Integrate Reach Build 16 Copyright 2007, Google Inc

Integrate Reach Build 17 Copyright 2007, Google Inc

Integrate Reach Build 18 Copyright 2007, Google Inc

Integrate Reach Build 19 Copyright 2007, Google Inc

Integrate Reach Build 20 Copyright 2007, Google Inc

Integrate Reach Build 21 Copyright 2007, Google Inc

Integrate Reach Build 22 Copyright 2007, Google Inc

Integrate Reach Build 23 Copyright 2007, Google Inc

Behind the plumbing Apps Standards Systems Infra Hardware 24 Copyright 2007, Google Inc

Google s Explosive Computational Requirements Every Google service sees continuing growth in computational needs More queries More users, happier users More data Bigger web, mailbox, blog, etc. Better results Find the right information, and find it faster better results more data more queries 25 Copyright 2007, Google Inc

Hardware Design Philosophy Prefer low-end server/pc-class designs Build lots of them! Why? Single machine performance is not interesting Our smaller problems are too large for any single system Large problems are easily partitioned into multiple threads Ultra-reliable hardware makes programmers lazy Most reliable platform will still fail fault-tolerant software needed Fault-tolerant software enables use of commodity components Interesting systems can be designed with commodity components 26 Copyright 2007, Google Inc

google.stanford.edu (circa 1997) 27 Copyright 2007, Google Inc

google.com (1999) 28 Copyright 2007, Google Inc

Google Data Center (circa 2000) 29 Copyright 2007, Google Inc

google.com (new data center 2001) 30 Copyright 2007, Google Inc

google.com (3 days later) 31 Copyright 2007, Google Inc

Current Design In-house rack design PC-class motherboards Low-end storage and networking hardware Linux + in-house software 32 Copyright 2007, Google Inc

33 Copyright 2007, Google Inc

Behind the plumbing Apps Standards Systems Infra Hardware 34 Copyright 2007, Google Inc

Systems Infrastructure Goal: Create very large scale, high performance computing infrastructure Hardware + software systems to make it easy to build products Focus on price/performance, and ease of use Enables better products: indices containing more documents updated more often faster queries faster product development cycles 35 Copyright 2007, Google Inc

GFS: Google File System Why YADFS? Google has unique FS requirements Huge read/write bandwidth Reliability over thousands of nodes Mostly operating on large data blocks Need efficient distributed operations Unfair advantage We have control over applications, libraries and operating system 36 Copyright 2007, Google Inc

GFS Setup Masters Replicas GFS Master GFS Master Misc. servers Client Client C 0 C 1 C 1 C 0 C 5 C 5 C 2 C 5 C 3 C 2 Chunkserver 1 Chunkserver 2 Chunkserver N Master manages metadata Data transfers happen directly between clients/chunkservers Files broken into chunks (typically 64 MB) 37 Copyright 2007, Google Inc

MapReduce + BigTable Okay, GFS lets us store lots of data now what? We want to process that data in new and interesting ways! MapReduce: a programming model and library to simplify large-scale computations on large clusters BigTable: A large-scale storage system for semi-structured data Database-like model, but data stored on thousands of machines.. 38 Copyright 2007, Google Inc

Developers and Google AJAX Search API Google Code Project Hosting Google Web Toolkit Calendar Data API Base Data API Blogger Data API Notebook Data API PicasaData API Spreadsheets Data API Google SOAP Search API Desktop API Sitemaps API Gadgets API AJAX Feed API Mashup Editor Mapplets Google Gears 2002 2003 2004 2005 2006 2007 39 Copyright 2007, Google Inc

40 Copyright 2007, Google Inc