MapReduce & BigTable
|
|
- Anis Newton
- 5 years ago
- Views:
Transcription
1 CPSC 426/526 MapReduce & BigTable Ennan Zhai Computer Science Department Yale University
2 Lecture Roadmap Cloud Computing Overview Challenges in the Clouds Distributed File Systems: GFS Data Process & Analysis: MapReduce Database: BigTable
3 Last Lecture Google Applications, e.g., Gmail and Google Map MapReduce - Lec 9 BigTable - Lec 9 Google File System (GFS) - Lec 8
4 Today s Lecture Google Applications, e.g., Gmail and Google Map MapReduce - Lec 9 BigTable - Lec 9 Google File System (GFS) - Lec 8
5 Recall: How GFS works?
6 Google File System [SOSP 03] Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
7 Google File System [SOSP 03] GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
8 Google File System [SOSP 03] Data GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
9 Google File System [SOSP 03] Metadata GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
10 Google File System [SOSP 03] The design insights: - Metadata is used for indexing chunks - Huge files -> 64 MB for each chunk -> fewer chunks - Reduce client-master interaction and metadata size Metadata GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
11 Google File System [SOSP 03] Data Metadata GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
12 Google File System [SOSP 03] Data Metadata GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
13 Google File System [SOSP 03] The design insights: - Replicas are used to ensure availability - Master can choose the nearest replicas for the client - Read and append-only makes it easy to manage replicas Data Metadata GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
14 Google File System [SOSP 03] read Metadata GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
15 Google File System [SOSP 03] read - IP address for each chunk - the ID for each chunk Metadata GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
16 Google File System [SOSP 03] read <Chunkserver1 s IP, Chunk1> <Chunkserver1 s IP, Chunk2> <Chunkserver2 s IP, Chunk3> Metadata GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
17 Google File System [SOSP 03] read Metadata GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
18 Google File System [SOSP 03] read Why GFS tries to avoid random write? Metadata GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
19 Put GFS in a Datacenter
20 Each rack has a master Put GFS in a Datacenter
21 Each rack has a master Put GFS in a Datacenter
22 After GFS, what we need to do next? Metadata GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
23 Processing and Analyzing Data Very Important! Metadata GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
24 Processing and Analyzing Data Very Important! Metadata GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
25 Processing and Analyzing Data Very Important! Metadata GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
26 Processing and Analyzing Data Very Important! Metadata GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
27 Processing and Analyzing Data Very Important! Metadata GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
28 Processing and Analyzing Data A toy problem: The word count - We have 10 billion documents - Average document s size is 20KB => 10 billion docs = 200TB Our solution:
29 Processing and Analyzing Data A toy problem: The word count - We have 10 billion documents - Average document s size is 20KB => 10 billion docs = 200TB Our solution: for each document d { for each word w in d {word_count[w]++;} }
30 Processing and Analyzing Data A toy problem: The word count - We have 10 billion documents - Average document s size is 20KB => 10 billion docs = 200TB Our solution: for each document d { for each word w in d {word_count[w]++;} } Approximately one month.
31 MapReduce Programming Model Inspired from map and reduce operations commonly used in functional programming language like LISP Users implement interface of two primary methods: - 1. Map: <key1, value1> <key2, value 2> - 2. Reduce: <key2, value2> <value3>
32 Input MapReduce Programming Model Map(k,v)-->(k,v ) Map Group (k,v )s by k Reduce(k,v [])-->v Reduce Output
33 MapReduce Programming Model Map(k,v)-->(k,v ) Inspired from map Group and reduce (k,v )s by operations k commonly Inputused in functional Map programming language Reduce like LISPOutput Users implement interface of two primary methods: - 1. Map: <key1, value1> <key2, value 2> - 2. Reduce: <key2, value2[ ]> <value3> Reduce(k,v [])-->v Map, a pure function, written by the user, takes an input key/value pair and produces a set of intermediate key/ value pairs, e.g., <doc, id> <doc, content>
34 MapReduce Programming Model Map(k,v)-->(k,v ) Inspired from map Group and reduce (k,v )s by operations k commonly Inputused in functional Map programming language Reduce like LISPOutput Users implement interface of two primary methods: - 1. Map: <key1, value1> <key2, value 2> - 2. Reduce: <key2, value2[ ]> <value3> Reduce(k,v [])-->v After map phase, all the intermediate values for a given output key are combined together into a list and given to a reducer for aggregating/merging the result.
35 MapReduce [OSDI 04] GFS is responsible for storing data for MapReduce - Data is split into chunks and distributed across nodes - Each chunk is replicated - Offers redundant storage for massive amounts of data GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
36 MapReduce [OSDI 04] MapReduce GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
37 MapReduce [OSDI 04] Heard Hadoop? HDFS + Hadoop MapReduce MapReduce GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
38 MapReduce [OSDI 04] JobTracker TaskTracker MapReduceTaskTracker TaskTracker TaskTracker GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
39 MapReduce [OSDI 04] Two core components - JobTracker: assigning tasks to different workers - TaskTracker: executing map and reduce JobTracker TaskTracker MapReduceTaskTracker TaskTracker TaskTracker GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
40 MapReduce [OSDI 04] Two core components - JobTracker: assigning tasks to different workers - TaskTracker: executing map and reduce programs JobTracker TaskTracker MapReduceTaskTracker TaskTracker TaskTracker GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
41 MapReduce [OSDI 04] Documtent Document A2 Documtent A3 MapReduce JobTracker TaskTracker TaskTracker TaskTracker TaskTracker TaskTracker GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
42 MapReduce [OSDI 04] Documtent A1 Documtent A2 Documtent A3 MapReduce JobTracker TaskTracker TaskTracker TaskTracker TaskTracker TaskTracker GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
43 Documtent A1 Documtent A2 Documtent A3 MapReduce JobTracker TaskTracker TaskTracker TaskTracker TaskTracker TaskTracker GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
44 Documtent A1 Documtent A2 Documtent A3 TaskTracker MapReduce JobTracker TaskTracker TaskTracker TaskTracker TaskTracker TaskTracker GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
45 Word Count Example
46 Why we care about word count Word count is challenging over massive amounts of data Fundamentals of statistics often are aggregate functions Most aggregation functions have distributive nature MapReduce breaks complex tasks into smaller pieces which can be executed in parallel
47 Map Phase (On a Worker) Count the # of occurrences of each word in a large amount of input data Map(input_key, input_value) { foreach word w in input_value: emit(w, 1); }
48 Map Phase (On a Worker) Input to the Mapper (3414, the cat sat on the mat ) (3437, the aardvark sat on the sofa ) Count the # of occurrences of each word in a large amount of input data Map(input_key, input_value) { foreach word w in input_value: emit(w, 1); }
49 Map Phase (On a Worker) Input to the Mapper (3414, the cat sat on the mat ) (3437, the aardvark sat on the sofa ) Count the # of occurrences of each word in a large amount of input data Map(input_key, input_value) { foreach word w in input_value: emit(w, 1); } Output from the Mapper ( the, 1), ( cat, 1), ( sat, 1), ( on, 1), ( the, 1), ( mat, 1), ( the, 1), ( aardvark, 1), ( sat, 1), ( on, 1), ( the, 1), ( sofa, 1)
50 Map Phase (On a Worker) Input to the Mapper (3414, the cat sat on the mat ) (3437, the aardvark sat on the sofa ) Count the # of occurrences of each word in a large amount of input data Map(input_key, input_value) { foreach word w in input_value: emit(w, 1); } Output from the Mapper ( the, 1), ( cat, 1), ( sat, 1), ( on, 1), ( the, 1), ( mat, 1), ( the, 1), ( aardvark, 1), ( sat, 1), ( on, 1), ( the, 1), ( sofa, 1)
51 Reducer (On a Worker) After the Map, all the intermediate values for a given intermediate key are combined together into a list
52 Reducer (On a Worker) After the Map, all the intermediate values for a given intermediate key are combined together into a list Add up all the values associated with each intermediate key: Reduce(output_key, intermediate_vals) { set count = 0; foreach v in intermediate_vals: count += v; emit(output_key, count); }
53 Reducer (On a Worker) Input of the Reducer ( the, 1), ( cat, 1), ( sat, 1), ( on, 1), ( the, 1), ( mat, 1), ( the, 1), ( aardvark, 1), ( sat, 1), ( on, 1), ( the, 1), ( sofa, 1)
54 Reducer (On a Worker) Input of the Reducer ( the, 1), ( cat, 1), ( sat, 1), ( on, 1), ( the, 1), ( mat, 1), ( the, 1), ( aardvark, 1), ( sat, 1), ( on, 1), ( the, 1), ( sofa, 1) Add up all the values associated with each intermediate key: Reduce(output_key, intermediate_vals) { set count = 0; foreach v in intermediate_vals: count += v; emit(output_key, count); }
55 Reducer (On a Worker) Input of the Reducer ( the, 1), ( cat, 1), ( sat, 1), ( on, 1), ( the, 1), ( mat, 1), ( the, 1), ( aardvark, 1), ( sat, 1), ( on, 1), ( the, 1), ( sofa, 1) Add up all the values associated with each intermediate key: Reduce(output_key, intermediate_vals) { set count = 0; foreach v in intermediate_vals: count += v; emit(output_key, count); } Output from the Reducer ( the, 4), ( sat, 2), ( on, 2), ( sofa, 1), ( mat, 1), ( cat, 1), ( aardvark, 1)
56 Grouping + Reducer Input of the grouping ( the, 1), ( cat, 1), ( sat, 1), ( on, 1), ( the, 1), ( mat, 1), ( the, 1), ( aardvark, 1), ( sat, 1), ( on, 1), ( the, 1), ( sofa, 1) Add up all the values associated with each intermediate key: Reduce(output_key, intermediate_vals) { set count = 0; foreach v in intermediate_vals: count += v; emit(output_key, count); } Output from the Reducer ( the, 4), ( sat, 2), ( on, 2), ( sofa, 1), ( mat, 1), ( cat, 1), ( aardvark, 1)
57 After the Map, all the intermediate values for a given intermediate key are combined together into a list Mapper Output Grouping/Shuffling ( the, 1), ( cat, 1), ( sat, 1), ( on, 1), ( the, 1), ( mat, 1), ( the, 1), ( aardvark, 1), ( sat, 1), ( on, 1), ( the, 1), ( sofa, 1) Reducer Input aardvark, 1 cat, 1 mat, 1 on [1, 1] sat [1, 1] sofa, 1 the [1, 1, 1, 1]
58 Map + Reduce Mapping Grouping Reducing ( the, 1), Mapper Input the cat sat on the mat the aardvark sat on the sofa ( cat, 1), ( sat, 1), ( on, 1), ( the, 1), ( mat, 1), ( the, 1), ( aardvark, 1), ( sat, 1), ( on, 1), aardvark, 1 cat, 1 mat, 1 on [1, 1] sat [1, 1] sofa, 1 the [1, 1, 1, 1] aardvark, 1 cat, 1 mat, 1 on, 2 sat, 2 sofa, 1 the, 4 ( the, 1), ( sofa, 1)
59 High-Level Picture for MR
60 Let s use MapReduce to help Google Map India We want to compute the average temperature for each state
61 Let s use MapReduce to help Google Map We want to compute the average temperature for each state
62 Let s use MapReduce to help Google Map MP: 75 CG: 72 OR: 72
63 Let s use MapReduce to help Google Map
64 Let s use MapReduce to help Google Map
65 Let s use MapReduce to help Google Map
66 Let s use MapReduce to help Google Map
67 Let s use MapReduce to help Google Map
68 Lecture Roadmap Cloud Computing Overview Challenges in the Clouds Distributed File Systems: GFS Data Process & Analysis: MapReduce Database: BigTable
69 Today s Lecture Google Applications, e.g., Gmail and Google Map MapReduce - Lec 9 BigTable - Lec 9 Google File System (GFS) - Lec 8
70 Motivation for BigTable Lots of (semi-)structured data at Google - URLs: Content, crawl metadata, links, anchors [Search Engine] - Per-user data: User preference settings, queries [Hangout] - Geographic locations: Physical entities and satellite image data [Google maps and Google earth] Scale is large: - Billions of URLs, many versions/page (~20K/version) - Hundreds of millions of users, thousands of queries/sec - 100TB+ of satellite image data
71 Why not just use commercial DB? Scale is too large for most commercial databases Even if it weren t, cost would be very high - Building internally means system can be applied across many projects for low incremental cost Low-level storage optimizations help performance significantly Fun and challenging to build large-scale DB systems :)
72 BigTable [OSDI 06] Distributed multi-level map: - With an interesting data model Fault-tolerant, persistent Scalable: - Thousands of servers - Terabytes of in-memory data - Petabyte of disk-based data - Millions of reads/writes per second, efficient scans
73 BigTable Status in 2006 Design/initial implementation started beginning of 2004 Currently ~100 BigTable cells Production use or active development for many projects - Google print - My search history - Crawling/indexing pipeline - Google Maps/Google Earth Largest BigTable cell manages ~200TB of data spread over several thousand machines (larger cells planned)
74 Building Blocks for BigTable BigTable uses of building blocks: - Google File System (GFS): stores persistent state and data - Scheduler: schedules jobs involved in BigTable serving - Lock service: master election, location bootstrapping - MapReduce: often used to process BigTable data Remember what is the difference between Database and file system
75 Building Blocks for BigTable BigTable uses of building blocks: - Google File System (GFS): stores persistent state and data - Scheduler: schedules jobs involved in BigTable serving - Lock service: master election, location bootstrapping 1. What is the data model? - MapReduce: often used to process BigTable data 2. How to implement it?
76 BigTable s Data Model Design BigTable is NOT a relational database BigTable appears as a large table - A BigTable is a sparse, distributed, persistent multidimensional sorted map
77 BigTable s Data Model Design BigTable is NOT a relational database BigTable appears as a large table sorted - A BigTable is a sparse, distributed, persistent multidimensional sorted map rows com.aaa com.cnn.www com.weather language EN EN EN... content <!DOCTYPE html PUBLIC... <!DOCTYPE html PUBLIC... <!DOCTYPE html PUBLIC... <!DOCTYPE html PUBLIC... Webtable example columns
78 BigTable s Data Model Design (row, column, timestamp) is cell content rows language content columns sorted com.aaa com.cnn.www com.weather EN EN EN t2 t4 t 3 t6 t2 versions t2 t3 Webtable example t11
79 Rows rows language content columns sorted com.aaa com.cnn.www com.weather EN EN EN t2 t4 t 3 t6 t2 versions t2 t3 Webtable example t11
80 Rows sorted Row name is an arbitrary string and is used as key - Access to data in a row is atomic - Row creation is implicit upon storing data Rows ordered lexicographically rows com.aaa com.cnn.www com.weather language EN EN EN... content columns t2 t4 t 3 t6 t2 t2 t3 versions Webtable example t11
81 sorted com.aaa com.weather language EN EN EN... Column Columns have two-level name structure - family:optional_qualifier - Row creation is implicit upon storing data Rows ordered lexicographically rows com.cnn.www Webtable example content columns t2 t4 t 3 t6 t2 t2 t3 t11 versions
82 EN Column Columns have two-level name structure - family:optional_qualifier - Row creation is implicit upon storing data Rows ordered lexicographically rows com.aaa language content anchor:mylook.ca anchor:cnnsi.com sorted com.cnn.www com.weather EN CNN CNN.com EN... Webtable example
83 Column Columns have two-level name structure - family:optional_qualifier - Row creation is implicit upon storing data Rows ordered lexicographically family anchor:mylook.ca rows com.aaa language EN content anchor:cnnsi.com sorted com.cnn.www com.weather EN CNN CNN.com EN... Webtable example
84 EN Column Columns have two-level name structure - family:optional_qualifier - Row creation is implicit upon storing data Rows ordered lexicographically rows com.aaa language content qualifier anchor:mylook.ca anchor:cnnsi.com sorted com.cnn.www com.weather EN CNN CNN.com EN... Webtable example
85 com.aaa EN Column Column family - Unit of access control family:optional_qualifier - Has associated type information Qualifier gives unbounded columns anchor:mylook.ca - Additional level of indexing, if desired anchor:cnnsi.com rows language content sorted com.cnn.www com.weather EN CNN CNN.com EN... Webtable example
86 Column - is referenced by Sports illustrated (cnnsi.com) and mylook (mylook.ca) - The value ( com.cnn.www, anchor:cnnsi.com ) is CNN, the reference text from cnnsi.com anchor:mylook.ca rows com.aaa language EN content anchor:cnnsi.com sorted com.cnn.www com.weather EN CNN CNN.com EN... Webtable example
87 Tablet and Table A table starts as one tablet As it grows, it is split into multiple tablets - Approximate size: MB per tablet by default Tablet com.aaa com.cnn.www com.weather language EN EN EN... content <!DOCTYPE html PUBLIC... <!DOCTYPE html PUBLIC... <!DOCTYPE html PUBLIC... <!DOCTYPE html PUBLIC...
88 Tablet and Table A table starts as one tablet As it grows, it is split into multiple tablets - Approximate size: MB per tablet by default Tablet com.aaa com.cnn.www com.weather language EN EN EN... content <!DOCTYPE html PUBLIC... <!DOCTYPE html PUBLIC... <!DOCTYPE html PUBLIC... <!DOCTYPE html PUBLIC...
89 Tablet and Table A table starts as one tablet As it grows, it is split into multiple tablets - Approximate size: MB per tablet by default Tablet 1 Tablet 2 com.aaa com.cnn.www com.weather com.tech com.wikipedia com.zoom language content EN <!DOCTYPE html PUBLIC... EN <!DOCTYPE html PUBLIC... EN <!DOCTYPE html PUBLIC... EN <!DOCTYPE html PUBLIC... EN <!DOCTYPE html PUBLIC... EN <!DOCTYPE html PUBLIC...
90 Building Blocks for BigTable BigTable uses of building blocks: - Google File System (GFS): stores persistent state and data - Scheduler: schedules jobs involved in BigTable serving - Lock service: master election, location bootstrapping 1. What is the data model? - MapReduce: often used to process BigTable data 2. How to implement it?
91 BigTable Architecture BigTable Master BigTable client library Tablet Server Tablet Tablet Tablet Server Tablet Tablet Tablet Server Tablet Tablet Cluster Scheduling Google File System Chubby Lock service
92 The First Thing: Locating Tablets Since tablets move around from server to server, given a row, how do clients find the right machine? - We need to find tablet whose row range covers the target row One solution: could use the BigTable master - Central server almost certainly would be bottleneck in large system Instead: store special tables containing tablet location information in BigTable cell itself
93 The First Thing: Locating Tablets 3-level hierarchical lookup scheme for tablets - Location is IP:port of relevant server - 1st level: bootstrapped from lock service, points to owner of META0-2nd level: uses META0 data to find owner of appropriate META1 tablet - 3rd level: META1 table holds locations of tablets of all other tables META0 META1 table Actual tablet table in table T Pointer to META0 location Chubby
94 The First Thing: Locating Tablets 3-level hierarchical lookup scheme for tablets - Location is IP:port of relevant server - 1st level: bootstrapped from lock service, points to owner of META0-2nd level: uses META0 data to find owner of appropriate META1 tablet - 3rd level: META1 table holds locations of tablets of all other tables META1 table META0 Actual tablet Key: F ( ) table in table T Search row key Hi Pointer to META0 location Key: A ( ) Key: F ( ) Key: I ( ) Key: S ( ) Key: G ( ) Key: H ( ) Key: Ha ( ) Key: Hc ( ) Key: Hi ( ) Chubby
95 BigTable Master BigTable client library Tablet Server Tablet Tablet Tablet Server Tablet Tablet Tablet Server Tablet Tablet Cluster Scheduling Google File System Chubby Lock service
96 BigTable Master BigTable client library Tablet Server Tablet Tablet Tablet Server Tablet Tablet Tablet Server Tablet Tablet Cluster Scheduling Google File System Chubby Lock service
97 BigTable Master BigTable client library Tablet Server Tablet Tablet Tablet Server Tablet Tablet Tablet Server Tablet Tablet Cluster Scheduling Google File System Chubby Lock service
98 BigTable Master BigTable client library Tablet Server Tablet Tablet Tablet Server Tablet Tablet Tablet Server Tablet Tablet Cluster Scheduling Google File System Chubby Lock service
99 BigTable Master BigTable client library Tablet Server Tablet Tablet Tablet Server Tablet Tablet Tablet Server Tablet Tablet Cluster Scheduling Google File System Chubby Lock service
100 Metadata Operations Create/delete tables Create/delete column families change metadata BigTable Master BigTable client library Tablet Server Tablet Tablet Tablet Server Tablet Tablet Tablet Server Tablet Tablet Cluster Scheduling Google File System Chubby Lock service
101 BigTable s APIs Metadata operations - Create/delete tables, column families, change metadata Writes: Single-row, atomic - Set(): write cells in a row - DeleteCells(): delete cells in a row - DeleteRow(): delete all cells in a rw Reads: Scanner abstraction - Read arbitrary cells in a Bigtable table
102 BigTable s APIs Metadata operations - Create/delete tables, column families, change metadata Writes: Single-row, atomic - Set(): write cells in a row - DeleteCells(): delete cells in a row - DeleteRow(): delete all cells in a rw Reads: Scanner abstraction - Read arbitrary cells in a Bigtable table
103 BigTable s Write Path Client Tablet Server Log Server Memstore Put/Delete Write to Log File System Write to memstore Append
104 Next Lecture In the lec-10, I will cover: - Transactions in distributed systems - Consistency models - Two phase commit - Consensus protocol: Paxos
BigTable: A Distributed Storage System for Structured Data (2006) Slides adapted by Tyler Davis
BigTable: A Distributed Storage System for Structured Data (2006) Slides adapted by Tyler Davis Motivation Lots of (semi-)structured data at Google URLs: Contents, crawl metadata, links, anchors, pagerank,
More informationCS November 2017
Bigtable Highly available distributed storage Distributed Systems 18. Bigtable Built with semi-structured data in mind URLs: content, metadata, links, anchors, page rank User data: preferences, account
More informationCS November 2018
Bigtable Highly available distributed storage Distributed Systems 19. Bigtable Built with semi-structured data in mind URLs: content, metadata, links, anchors, page rank User data: preferences, account
More informationBigTable A System for Distributed Structured Storage
BigTable A System for Distributed Structured Storage Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber Adapted
More informationBigTable: A System for Distributed Structured Storage
BigTable: A System for Distributed Structured Storage Jeff Dean Joint work with: Mike Burrows, Tushar Chandra, Fay Chang, Mike Epstein, Andrew Fikes, Sanjay Ghemawat, Robert Griesemer, Bob Gruber, Wilson
More informationDistributed Systems [Fall 2012]
Distributed Systems [Fall 2012] Lec 20: Bigtable (cont ed) Slide acks: Mohsen Taheriyan (http://www-scf.usc.edu/~csci572/2011spring/presentations/taheriyan.pptx) 1 Chubby (Reminder) Lock service with a
More informationDistributed Systems. CS422/522 Lecture17 17 November 2014
Distributed Systems CS422/522 Lecture17 17 November 2014 Lecture Outline Introduction Hadoop Chord What s a distributed system? What s a distributed system? A distributed system is a collection of loosely
More informationBigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao
Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI 2006 Presented by Xiang Gao 2014-11-05 Outline Motivation Data Model APIs Building Blocks Implementation Refinement
More informationBigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13
Bigtable A Distributed Storage System for Structured Data Presenter: Yunming Zhang Conglong Li References SOCC 2010 Key Note Slides Jeff Dean Google Introduction to Distributed Computing, Winter 2008 University
More informationBigtable: A Distributed Storage System for Structured Data by Google SUNNIE CHUNG CIS 612
Bigtable: A Distributed Storage System for Structured Data by Google SUNNIE CHUNG CIS 612 Google Bigtable 2 A distributed storage system for managing structured data that is designed to scale to a very
More informationBigTable: A Distributed Storage System for Structured Data
BigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26
More informationLessons Learned While Building Infrastructure Software at Google
Lessons Learned While Building Infrastructure Software at Google Jeff Dean jeff@google.com Google Circa 1997 (google.stanford.edu) Corkboards (1999) Google Data Center (2000) Google Data Center (2000)
More information7680: Distributed Systems
Cristina Nita-Rotaru 7680: Distributed Systems BigTable. Hbase.Spanner. 1: BigTable Acknowledgement } Slides based on material from course at UMichigan, U Washington, and the authors of BigTable and Spanner.
More informationbig picture parallel db (one data center) mix of OLTP and batch analysis lots of data, high r/w rates, 1000s of cheap boxes thus many failures
Lecture 20 -- 11/20/2017 BigTable big picture parallel db (one data center) mix of OLTP and batch analysis lots of data, high r/w rates, 1000s of cheap boxes thus many failures what does paper say Google
More informationGoogle Data Management
Google Data Management Vera Goebel Department of Informatics, University of Oslo 2009 Google Technology Kaizan: continuous developments and improvements Grid computing: Google data centers and messages
More informationCSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores
CSE 444: Database Internals Lectures 26 NoSQL: Extensible Record Stores CSE 444 - Spring 2014 1 References Scalable SQL and NoSQL Data Stores, Rick Cattell, SIGMOD Record, December 2010 (Vol. 39, No. 4)
More informationDistributed File Systems II
Distributed File Systems II To do q Very-large scale: Google FS, Hadoop FS, BigTable q Next time: Naming things GFS A radically new environment NFS, etc. Independence Small Scale Variety of workloads Cooperation
More informationReferences. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals
References CSE 444: Database Internals Scalable SQL and NoSQL Data Stores, Rick Cattell, SIGMOD Record, December 2010 (Vol 39, No 4) Lectures 26 NoSQL: Extensible Record Stores Bigtable: A Distributed
More informationParallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce
Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The
More informationCSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable
CSE 544 Principles of Database Management Systems Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable References Bigtable: A Distributed Storage System for Structured Data. Fay Chang et. al. OSDI
More informationGoogle File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information
Subject 10 Fall 2015 Google File System and BigTable and tiny bits of HDFS (Hadoop File System) and Chubby Not in textbook; additional information Disclaimer: These abbreviated notes DO NOT substitute
More informationGoogle big data techniques (2)
Google big data techniques (2) Lecturer: Jiaheng Lu Fall 2016 10.12.2016 1 Outline Google File System and HDFS Relational DB V.S. Big data system Google Bigtable and NoSQL databases 2016/12/10 3 The Google
More informationCPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University
CPSC 426/526 Cloud Computing Ennan Zhai Computer Science Department Yale University Recall: Lec-7 In the lec-7, I talked about: - P2P vs Enterprise control - Firewall - NATs - Software defined network
More informationBigtable. Presenter: Yijun Hou, Yixiao Peng
Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber Google, Inc. OSDI 06 Presenter: Yijun Hou, Yixiao Peng
More informationΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing
ΕΠΛ 602:Foundations of Internet Technologies Cloud Computing 1 Outline Bigtable(data component of cloud) Web search basedonch13of thewebdatabook 2 What is Cloud Computing? ACloudis an infrastructure, transparent
More informationHBase: Overview. HBase is a distributed column-oriented data store built on top of HDFS
HBase 1 HBase: Overview HBase is a distributed column-oriented data store built on top of HDFS HBase is an Apache open source project whose goal is to provide storage for the Hadoop Distributed Computing
More informationBigtable: A Distributed Storage System for Structured Data. Andrew Hon, Phyllis Lau, Justin Ng
Bigtable: A Distributed Storage System for Structured Data Andrew Hon, Phyllis Lau, Justin Ng What is Bigtable? - A storage system for managing structured data - Used in 60+ Google services - Motivation:
More informationStructured Big Data 1: Google Bigtable & HBase Shiow-yang Wu ( 吳秀陽 ) CSIE, NDHU, Taiwan, ROC
Structured Big Data 1: Google Bigtable & HBase Shiow-yang Wu ( 吳秀陽 ) CSIE, NDHU, Taiwan, ROC Lecture material is mostly home-grown, partly taken with permission and courtesy from Professor Shih-Wei Liao
More informationOutline. Spanner Mo/va/on. Tom Anderson
Spanner Mo/va/on Tom Anderson Outline Last week: Chubby: coordina/on service BigTable: scalable storage of structured data GFS: large- scale storage for bulk data Today/Friday: Lessons from GFS/BigTable
More informationCS 345A Data Mining. MapReduce
CS 345A Data Mining MapReduce Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very large Tens to hundreds of terabytes
More informationProgramming model and implementation for processing and. Programs can be automatically parallelized and executed on a large cluster of machines
A programming model in Cloud: MapReduce Programming model and implementation for processing and generating large data sets Users specify a map function to generate a set of intermediate key/value pairs
More informationCS5412: DIVING IN: INSIDE THE DATA CENTER
1 CS5412: DIVING IN: INSIDE THE DATA CENTER Lecture V Ken Birman We ve seen one cloud service 2 Inside a cloud, Dynamo is an example of a service used to make sure that cloud-hosted applications can scale
More informationCA485 Ray Walshe NoSQL
NoSQL BASE vs ACID Summary Traditional relational database management systems (RDBMS) do not scale because they adhere to ACID. A strong movement within cloud computing is to utilize non-traditional data
More informationClustering Lecture 8: MapReduce
Clustering Lecture 8: MapReduce Jing Gao SUNY Buffalo 1 Divide and Conquer Work Partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 Result Combine 4 Distributed Grep Very big data Split data Split data
More informationOverview. Why MapReduce? What is MapReduce? The Hadoop Distributed File System Cloudera, Inc.
MapReduce and HDFS This presentation includes course content University of Washington Redistributed under the Creative Commons Attribution 3.0 license. All other contents: Overview Why MapReduce? What
More informationDistributed Database Case Study on Google s Big Tables
Distributed Database Case Study on Google s Big Tables Anjali diwakar dwivedi 1, Usha sadanand patil 2 and Vinayak D.Shinde 3 1,2,3 Computer Engineering, Shree l.r.tiwari college of engineering Abstract-
More informationCS5412: OTHER DATA CENTER SERVICES
1 CS5412: OTHER DATA CENTER SERVICES Lecture V Ken Birman Tier two and Inner Tiers 2 If tier one faces the user and constructs responses, what lives in tier two? Caching services are very common (many
More informationOutline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins
MapReduce 1 Outline Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins 2 Outline Distributed File System Map-Reduce The Computational Model Map-Reduce
More informationMapReduce. U of Toronto, 2014
MapReduce U of Toronto, 2014 http://www.google.org/flutrends/ca/ (2012) Average Searches Per Day: 5,134,000,000 2 Motivation Process lots of data Google processed about 24 petabytes of data per day in
More informationDistributed Systems 16. Distributed File Systems II
Distributed Systems 16. Distributed File Systems II Paul Krzyzanowski pxk@cs.rutgers.edu 1 Review NFS RPC-based access AFS Long-term caching CODA Read/write replication & disconnected operation DFS AFS
More informationCISC 7610 Lecture 5 Distributed multimedia databases. Topics: Scaling up vs out Replication Partitioning CAP Theorem NoSQL NewSQL
CISC 7610 Lecture 5 Distributed multimedia databases Topics: Scaling up vs out Replication Partitioning CAP Theorem NoSQL NewSQL Motivation YouTube receives 400 hours of video per minute That is 200M hours
More informationCSE-E5430 Scalable Cloud Computing Lecture 9
CSE-E5430 Scalable Cloud Computing Lecture 9 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 15.11-2015 1/24 BigTable Described in the paper: Fay
More informationBig Table. Google s Storage Choice for Structured Data. Presented by Group E - Dawei Yang - Grace Ramamoorthy - Patrick O Sullivan - Rohan Singla
Big Table Google s Storage Choice for Structured Data Presented by Group E - Dawei Yang - Grace Ramamoorthy - Patrick O Sullivan - Rohan Singla Bigtable: Introduction Resembles a database. Does not support
More informationMap-Reduce. Marco Mura 2010 March, 31th
Map-Reduce Marco Mura (mura@di.unipi.it) 2010 March, 31th This paper is a note from the 2009-2010 course Strumenti di programmazione per sistemi paralleli e distribuiti and it s based by the lessons of
More informationBig Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)
Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 10: Mutable State (1/2) March 14, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These
More information18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.
18-hdfs-gfs.txt Thu Oct 27 10:05:07 2011 1 Notes on Parallel File Systems: HDFS & GFS 15-440, Fall 2011 Carnegie Mellon University Randal E. Bryant References: Ghemawat, Gobioff, Leung, "The Google File
More informationDistributed Filesystem
Distributed Filesystem 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributing Code! Don t move data to workers move workers to the data! - Store data on the local disks of nodes in the
More informationPLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS
PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad
More informationDistributed computing: index building and use
Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput
More informationGoogle File System (GFS) and Hadoop Distributed File System (HDFS)
Google File System (GFS) and Hadoop Distributed File System (HDFS) 1 Hadoop: Architectural Design Principles Linear scalability More nodes can do more work within the same time Linear on data size, linear
More informationData Informatics. Seon Ho Kim, Ph.D.
Data Informatics Seon Ho Kim, Ph.D. seonkim@usc.edu HBase HBase is.. A distributed data store that can scale horizontally to 1,000s of commodity servers and petabytes of indexed storage. Designed to operate
More informationBigtable: A Distributed Storage System for Structured Data
Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber {fay,jeff,sanjay,wilsonh,kerr,m3b,tushar,fikes,gruber}@google.com
More informationBig Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)
Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Week 10: Mutable State (1/2) March 15, 2016 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These
More informationCmpE 138 Spring 2011 Special Topics L2
CmpE 138 Spring 2011 Special Topics L2 Shivanshu Singh shivanshu.sjsu@gmail.com Map Reduce ElecBon process Map Reduce Typical single node architecture Applica'on CPU Memory Storage Map Reduce Applica'on
More informationCS5412: DIVING IN: INSIDE THE DATA CENTER
1 CS5412: DIVING IN: INSIDE THE DATA CENTER Lecture V Ken Birman Data centers 2 Once traffic reaches a data center it tunnels in First passes through a filter that blocks attacks Next, a router that directs
More informationIntroduction to MapReduce
Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed
More informationCSE 124: Networked Services Lecture-16
Fall 2010 CSE 124: Networked Services Lecture-16 Instructor: B. S. Manoj, Ph.D http://cseweb.ucsd.edu/classes/fa10/cse124 11/23/2010 CSE 124 Networked Services Fall 2010 1 Updates PlanetLab experiments
More informationThe MapReduce Abstraction
The MapReduce Abstraction Parallel Computing at Google Leverages multiple technologies to simplify large-scale parallel computations Proprietary computing clusters Map/Reduce software library Lots of other
More informationBigData and Map Reduce VITMAC03
BigData and Map Reduce VITMAC03 1 Motivation Process lots of data Google processed about 24 petabytes of data per day in 2009. A single machine cannot serve all the data You need a distributed system to
More informationCS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab
CS6030 Cloud Computing Ajay Gupta B239, CEAS Computer Science Department Western Michigan University ajay.gupta@wmich.edu 276-3104 1 Acknowledgements I have liberally borrowed these slides and material
More informationBig Data Processing Technologies. Chentao Wu Associate Professor Dept. of Computer Science and Engineering
Big Data Processing Technologies Chentao Wu Associate Professor Dept. of Computer Science and Engineering wuct@cs.sjtu.edu.cn Schedule (1) Storage system part (first eight weeks) lec1: Introduction on
More informationBigTable. CSE-291 (Cloud Computing) Fall 2016
BigTable CSE-291 (Cloud Computing) Fall 2016 Data Model Sparse, distributed persistent, multi-dimensional sorted map Indexed by a row key, column key, and timestamp Values are uninterpreted arrays of bytes
More informationInfrastructure system services
Infrastructure system services Badri Nath Rutgers University badri@cs.rutgers.edu Processing lots of data O(B) web pages; each O(K) bytes to O(M) bytes gives you O(T) to O(P) bytes of data Disk Bandwidth
More informationDistributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017
Distributed Systems 15. Distributed File Systems Paul Krzyzanowski Rutgers University Fall 2017 1 Google Chubby ( Apache Zookeeper) 2 Chubby Distributed lock service + simple fault-tolerant file system
More informationExtreme Computing. NoSQL.
Extreme Computing NoSQL PREVIOUSLY: BATCH Query most/all data Results Eventually NOW: ON DEMAND Single Data Points Latency Matters One problem, three ideas We want to keep track of mutable state in a scalable
More informationCSE 124: Networked Services Fall 2009 Lecture-19
CSE 124: Networked Services Fall 2009 Lecture-19 Instructor: B. S. Manoj, Ph.D http://cseweb.ucsd.edu/classes/fa09/cse124 Some of these slides are adapted from various sources/individuals including but
More informationCLOUD-SCALE FILE SYSTEMS
Data Management in the Cloud CLOUD-SCALE FILE SYSTEMS 92 Google File System (GFS) Designing a file system for the Cloud design assumptions design choices Architecture GFS Master GFS Chunkservers GFS Clients
More informationCSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA
CSE 3302 Lecture 11: Map/Reduce 7 October 2010 Nate Nystrom UTA 378,000 results in 0.17 seconds including images and video communicates with 1000s of machines web server index servers document servers
More informationBigTable. Chubby. BigTable. Chubby. Why Chubby? How to do consensus as a service
BigTable BigTable Doug Woos and Tom Anderson In the early 2000s, Google had way more than anybody else did Traditional bases couldn t scale Want something better than a filesystem () BigTable optimized
More informationCS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.
CS 138: Google CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved. Google Environment Lots (tens of thousands) of computers all more-or-less equal - processor, disk, memory, network interface
More informationThe Google File System. Alexandru Costan
1 The Google File System Alexandru Costan Actions on Big Data 2 Storage Analysis Acquisition Handling the data stream Data structured unstructured semi-structured Results Transactions Outline File systems
More informationCloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018
Cloud Computing and Hadoop Distributed File System UCSB CS70, Spring 08 Cluster Computing Motivations Large-scale data processing on clusters Scan 000 TB on node @ 00 MB/s = days Scan on 000-node cluster
More informationCA485 Ray Walshe Google File System
Google File System Overview Google File System is scalable, distributed file system on inexpensive commodity hardware that provides: Fault Tolerance File system runs on hundreds or thousands of storage
More informationCS427 Multicore Architecture and Parallel Computing
CS427 Multicore Architecture and Parallel Computing Lecture 9 MapReduce Prof. Li Jiang 2014/11/19 1 What is MapReduce Origin from Google, [OSDI 04] A simple programming model Functional model For large-scale
More informationBig Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing
Big Data Analytics Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 Big Data "The world is crazy. But at least it s getting regular analysis." Izabela
More informationDistributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016
Distributed Systems 15. Distributed File Systems Paul Krzyzanowski Rutgers University Fall 2016 1 Google Chubby 2 Chubby Distributed lock service + simple fault-tolerant file system Interfaces File access
More informationCS /30/17. Paul Krzyzanowski 1. Google Chubby ( Apache Zookeeper) Distributed Systems. Chubby. Chubby Deployment.
Distributed Systems 15. Distributed File Systems Google ( Apache Zookeeper) Paul Krzyzanowski Rutgers University Fall 2017 1 2 Distributed lock service + simple fault-tolerant file system Deployment Client
More informationGeorgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong
Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong Relatively recent; still applicable today GFS: Google s storage platform for the generation and processing of data used by services
More informationCS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.
CS 138: Google CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved. Google Environment Lots (tens of thousands) of computers all more-or-less equal - processor, disk, memory, network interface
More informationMI-PDB, MIE-PDB: Advanced Database Systems
MI-PDB, MIE-PDB: Advanced Database Systems http://www.ksi.mff.cuni.cz/~svoboda/courses/2015-2-mie-pdb/ Lecture 10: MapReduce, Hadoop 26. 4. 2016 Lecturer: Martin Svoboda svoboda@ksi.mff.cuni.cz Author:
More informationThe MapReduce Framework
The MapReduce Framework In Partial fulfilment of the requirements for course CMPT 816 Presented by: Ahmed Abdel Moamen Agents Lab Overview MapReduce was firstly introduced by Google on 2004. MapReduce
More informationPercolator. Large-Scale Incremental Processing using Distributed Transactions and Notifications. D. Peng & F. Dabek
Percolator Large-Scale Incremental Processing using Distributed Transactions and Notifications D. Peng & F. Dabek Motivation Built to maintain the Google web search index Need to maintain a large repository,
More informationMap-Reduce. John Hughes
Map-Reduce John Hughes The Problem 850TB in 2006 The Solution? Thousands of commodity computers networked together 1,000 computers 850GB each How to make them work together? Early Days Hundreds of ad-hoc
More informationRecap. CSE 486/586 Distributed Systems Google Chubby Lock Service. Paxos Phase 2. Paxos Phase 1. Google Chubby. Paxos Phase 3 C 1
Recap CSE 486/586 Distributed Systems Google Chubby Lock Service Steve Ko Computer Sciences and Engineering University at Buffalo Paxos is a consensus algorithm. Proposers? Acceptors? Learners? A proposer
More informationData Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros
Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on
More informationA brief history on Hadoop
Hadoop Basics A brief history on Hadoop 2003 - Google launches project Nutch to handle billions of searches and indexing millions of web pages. Oct 2003 - Google releases papers with GFS (Google File System)
More informationThe Google File System (GFS)
1 The Google File System (GFS) CS60002: Distributed Systems Antonio Bruto da Costa Ph.D. Student, Formal Methods Lab, Dept. of Computer Sc. & Engg., Indian Institute of Technology Kharagpur 2 Design constraints
More informationMotivation. Map in Lisp (Scheme) Map/Reduce. MapReduce: Simplified Data Processing on Large Clusters
Motivation MapReduce: Simplified Data Processing on Large Clusters These are slides from Dan Weld s class at U. Washington (who in turn made his slides based on those by Jeff Dean, Sanjay Ghemawat, Google,
More informationDistributed Systems. 05r. Case study: Google Cluster Architecture. Paul Krzyzanowski. Rutgers University. Fall 2016
Distributed Systems 05r. Case study: Google Cluster Architecture Paul Krzyzanowski Rutgers University Fall 2016 1 A note about relevancy This describes the Google search cluster architecture in the mid
More informationGoogle: A Computer Scientist s Playground
Google: A Computer Scientist s Playground Jochen Hollmann Google Zürich und Trondheim joho@google.com Outline Mission, data, and scaling Systems infrastructure Parallel programming model: MapReduce Googles
More informationGoogle: A Computer Scientist s Playground
Outline Mission, data, and scaling Google: A Computer Scientist s Playground Jochen Hollmann Google Zürich und Trondheim joho@google.com Systems infrastructure Parallel programming model: MapReduce Googles
More informationCluster-Level Google How we use Colossus to improve storage efficiency
Cluster-Level Storage @ Google How we use Colossus to improve storage efficiency Denis Serenyi Senior Staff Software Engineer dserenyi@google.com November 13, 2017 Keynote at the 2nd Joint International
More informationRecap. CSE 486/586 Distributed Systems Google Chubby Lock Service. Recap: First Requirement. Recap: Second Requirement. Recap: Strengthening P2
Recap CSE 486/586 Distributed Systems Google Chubby Lock Service Steve Ko Computer Sciences and Engineering University at Buffalo Paxos is a consensus algorithm. Proposers? Acceptors? Learners? A proposer
More informationWhere We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344
Where We Are Introduction to Data Management CSE 344 Lecture 22: MapReduce We are talking about parallel query processing There exist two main types of engines: Parallel DBMSs (last lecture + quick review)
More informationDistributed Computations MapReduce. adapted from Jeff Dean s slides
Distributed Computations MapReduce adapted from Jeff Dean s slides What we ve learnt so far Basic distributed systems concepts Consistency (sequential, eventual) Fault tolerance (recoverability, availability)
More informationDistributed Programming the Google Way
Distributed Programming the Google Way Gregor Hohpe Software Engineer www.enterpriseintegrationpatterns.com Scalable & Distributed Fault tolerant distributed disk storage: Google File System Distributed
More informationSystems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14
Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2013/14 MapReduce & Hadoop The new world of Big Data (programming model) Overview of this Lecture Module Background Cluster File
More informationMap-Reduce (PFP Lecture 12) John Hughes
Map-Reduce (PFP Lecture 12) John Hughes The Problem 850TB in 2006 The Solution? Thousands of commodity computers networked together 1,000 computers 850GB each How to make them work together? Early Days
More informationThe amount of data increases every day Some numbers ( 2012):
1 The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect
More informationBig Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.
Big Data Programming: an Introduction Spring 2015, X. Zhang Fordham Univ. Outline What the course is about? scope Introduction to big data programming Opportunity and challenge of big data Origin of Hadoop
More information