Plumbing the Web Narayanan Shivakumar Google Distinguished Entrepreneur & Director
Google Developer Day 2007 powered by 3 Copyright 2007, Google Inc
Developers and Google AJAX Search API Google Code Project Hosting Google Web Toolkit Calendar Data API Base Data API Blogger Data API Notebook Data API PicasaData API Spreadsheets Data API Google SOAP Search API Desktop API Sitemaps API Gadgets API AJAX Feed API Mashup Editor Mapplets Google Gears 2002 2003 2004 2005 2006 2007 4 Copyright 2007, Google Inc
How much information is out there? How large is the Web? Hundreds of billions of documents? Trillions? ~10KB/doc => 100s of Terabytes Then there s everything else Email, personal files, closed databases, broadcast media, print, etc. Estimated 5 Exabytes/year (growing at 30%)* 800MB/year/person ~90% in magnetic media Web is just a tiny starting point Source: How much information 2003 5 Copyright 2007, Google Inc
Early Search Search Link Extraction Web pages 6 Copyright 2007, Google Inc
Webserver-search ecosystem Part A: Sitemaps, tell us what you have Part B: Feedback to webservers about problems 7 Copyright 2007, Google Inc
Sitemaps ( ls for web) XML sitemaps auto-produced and maintained on webservers http://www.example.com/sitemap.xml {<url>, <changerate> <lastmod> <priority> } Autodiscovery through robots.txt Log structured protocol Scalable from 50 ->50M+ urls 8 Copyright 2007, Google Inc
Sitemaps adoption Open protocol launched in Jun 05 under Creative-Commons Joint support announced by MSN, Yahoo in Nov 06 (http://sitemaps.org) Auto-discovery thro robots.txt in Apr 07 (+IBM, Ask.com) Billions of URLs auto-produced by servers, tools, plugins 9 Copyright 2007, Google Inc
Early Search Search Link Extraction Web pages 10 Copyright 2007, Google Inc
Comprehensive Search Search Sitemaps WmTools Link Extraction Web pages 11 Copyright 2007, Google Inc
Google s Developer Products Integrate Integrate Google services Reach Reach Google users Build Build next gen web apps 12 Copyright 2007, Google Inc
Integrate Reach Build 13 Copyright 2007, Google Inc
Integrate Reach Build 14 Copyright 2007, Google Inc
Integrate Reach Build 15 Copyright 2007, Google Inc
Integrate Reach Build 16 Copyright 2007, Google Inc
Integrate Reach Build 17 Copyright 2007, Google Inc
Integrate Reach Build 18 Copyright 2007, Google Inc
Integrate Reach Build 19 Copyright 2007, Google Inc
Integrate Reach Build 20 Copyright 2007, Google Inc
Integrate Reach Build 21 Copyright 2007, Google Inc
Integrate Reach Build 22 Copyright 2007, Google Inc
Integrate Reach Build 23 Copyright 2007, Google Inc
Behind the plumbing Apps Standards Systems Infra Hardware 24 Copyright 2007, Google Inc
Google s Explosive Computational Requirements Every Google service sees continuing growth in computational needs More queries More users, happier users More data Bigger web, mailbox, blog, etc. Better results Find the right information, and find it faster better results more data more queries 25 Copyright 2007, Google Inc
Hardware Design Philosophy Prefer low-end server/pc-class designs Build lots of them! Why? Single machine performance is not interesting Our smaller problems are too large for any single system Large problems are easily partitioned into multiple threads Ultra-reliable hardware makes programmers lazy Most reliable platform will still fail fault-tolerant software needed Fault-tolerant software enables use of commodity components Interesting systems can be designed with commodity components 26 Copyright 2007, Google Inc
google.stanford.edu (circa 1997) 27 Copyright 2007, Google Inc
google.com (1999) 28 Copyright 2007, Google Inc
Google Data Center (circa 2000) 29 Copyright 2007, Google Inc
google.com (new data center 2001) 30 Copyright 2007, Google Inc
google.com (3 days later) 31 Copyright 2007, Google Inc
Current Design In-house rack design PC-class motherboards Low-end storage and networking hardware Linux + in-house software 32 Copyright 2007, Google Inc
33 Copyright 2007, Google Inc
Behind the plumbing Apps Standards Systems Infra Hardware 34 Copyright 2007, Google Inc
Systems Infrastructure Goal: Create very large scale, high performance computing infrastructure Hardware + software systems to make it easy to build products Focus on price/performance, and ease of use Enables better products: indices containing more documents updated more often faster queries faster product development cycles 35 Copyright 2007, Google Inc
GFS: Google File System Why YADFS? Google has unique FS requirements Huge read/write bandwidth Reliability over thousands of nodes Mostly operating on large data blocks Need efficient distributed operations Unfair advantage We have control over applications, libraries and operating system 36 Copyright 2007, Google Inc
GFS Setup Masters Replicas GFS Master GFS Master Misc. servers Client Client C 0 C 1 C 1 C 0 C 5 C 5 C 2 C 5 C 3 C 2 Chunkserver 1 Chunkserver 2 Chunkserver N Master manages metadata Data transfers happen directly between clients/chunkservers Files broken into chunks (typically 64 MB) 37 Copyright 2007, Google Inc
MapReduce + BigTable Okay, GFS lets us store lots of data now what? We want to process that data in new and interesting ways! MapReduce: a programming model and library to simplify large-scale computations on large clusters BigTable: A large-scale storage system for semi-structured data Database-like model, but data stored on thousands of machines.. 38 Copyright 2007, Google Inc
Developers and Google AJAX Search API Google Code Project Hosting Google Web Toolkit Calendar Data API Base Data API Blogger Data API Notebook Data API PicasaData API Spreadsheets Data API Google SOAP Search API Desktop API Sitemaps API Gadgets API AJAX Feed API Mashup Editor Mapplets Google Gears 2002 2003 2004 2005 2006 2007 39 Copyright 2007, Google Inc
40 Copyright 2007, Google Inc