MapReduce & BigTable

Size: px

Start display at page:

Download "MapReduce & BigTable"

Anis Newton
5 years ago
Views:

1 CPSC 426/526 MapReduce & BigTable Ennan Zhai Computer Science Department Yale University

2 Lecture Roadmap Cloud Computing Overview Challenges in the Clouds Distributed File Systems: GFS Data Process & Analysis: MapReduce Database: BigTable

3 Last Lecture Google Applications, e.g., Gmail and Google Map MapReduce - Lec 9 BigTable - Lec 9 Google File System (GFS) - Lec 8

4 Today s Lecture Google Applications, e.g., Gmail and Google Map MapReduce - Lec 9 BigTable - Lec 9 Google File System (GFS) - Lec 8

5 Recall: How GFS works?

6 Google File System [SOSP 03] Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

7 Google File System [SOSP 03] GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

8 Google File System [SOSP 03] Data GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

9 Google File System [SOSP 03] Metadata GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

10 Google File System [SOSP 03] The design insights: - Metadata is used for indexing chunks - Huge files -> 64 MB for each chunk -> fewer chunks - Reduce client-master interaction and metadata size Metadata GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

11 Google File System [SOSP 03] Data Metadata GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

12 Google File System [SOSP 03] Data Metadata GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

13 Google File System [SOSP 03] The design insights: - Replicas are used to ensure availability - Master can choose the nearest replicas for the client - Read and append-only makes it easy to manage replicas Data Metadata GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

14 Google File System [SOSP 03] read Metadata GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

15 Google File System [SOSP 03] read - IP address for each chunk - the ID for each chunk Metadata GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

16 Google File System [SOSP 03] read <Chunkserver1 s IP, Chunk1> <Chunkserver1 s IP, Chunk2> <Chunkserver2 s IP, Chunk3> Metadata GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

17 Google File System [SOSP 03] read Metadata GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

18 Google File System [SOSP 03] read Why GFS tries to avoid random write? Metadata GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

19 Put GFS in a Datacenter

20 Each rack has a master Put GFS in a Datacenter

21 Each rack has a master Put GFS in a Datacenter

22 After GFS, what we need to do next? Metadata GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

23 Processing and Analyzing Data Very Important! Metadata GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

24 Processing and Analyzing Data Very Important! Metadata GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

25 Processing and Analyzing Data Very Important! Metadata GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

26 Processing and Analyzing Data Very Important! Metadata GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

27 Processing and Analyzing Data Very Important! Metadata GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

28 Processing and Analyzing Data A toy problem: The word count - We have 10 billion documents - Average document s size is 20KB => 10 billion docs = 200TB Our solution:

29 Processing and Analyzing Data A toy problem: The word count - We have 10 billion documents - Average document s size is 20KB => 10 billion docs = 200TB Our solution: for each document d { for each word w in d {word_count[w]++;} }

$=> 10 billion docs = 200TB Our solution: for each document d { for each word w in d {word_count[w]++;} } Approximately one$

30 Processing and Analyzing Data A toy problem: The word count - We have 10 billion documents - Average document s size is 20KB => 10 billion docs = 200TB Our solution: for each document d { for each word w in d {word_count[w]++;} } Approximately one month.

31 MapReduce Programming Model Inspired from map and reduce operations commonly used in functional programming language like LISP Users implement interface of two primary methods: - 1. Map: <key1, value1> <key2, value 2> - 2. Reduce: <key2, value2> <value3>

32 Input MapReduce Programming Model Map(k,v)-->(k,v ) Map Group (k,v )s by k Reduce(k,v [])-->v Reduce Output

33 MapReduce Programming Model Map(k,v)-->(k,v ) Inspired from map Group and reduce (k,v )s by operations k commonly Inputused in functional Map programming language Reduce like LISPOutput Users implement interface of two primary methods: - 1. Map: <key1, value1> <key2, value 2> - 2. Reduce: <key2, value2[ ]> <value3> Reduce(k,v [])-->v Map, a pure function, written by the user, takes an input key/value pair and produces a set of intermediate key/ value pairs, e.g., <doc, id> <doc, content>

34 MapReduce Programming Model Map(k,v)-->(k,v ) Inspired from map Group and reduce (k,v )s by operations k commonly Inputused in functional Map programming language Reduce like LISPOutput Users implement interface of two primary methods: - 1. Map: <key1, value1> <key2, value 2> - 2. Reduce: <key2, value2[ ]> <value3> Reduce(k,v [])-->v After map phase, all the intermediate values for a given output key are combined together into a list and given to a reducer for aggregating/merging the result.

35 MapReduce [OSDI 04] GFS is responsible for storing data for MapReduce - Data is split into chunks and distributed across nodes - Each chunk is replicated - Offers redundant storage for massive amounts of data GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

36 MapReduce [OSDI 04] MapReduce GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

37 MapReduce [OSDI 04] Heard Hadoop? HDFS + Hadoop MapReduce MapReduce GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

38 MapReduce [OSDI 04] JobTracker TaskTracker MapReduceTaskTracker TaskTracker TaskTracker GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

39 MapReduce [OSDI 04] Two core components - JobTracker: assigning tasks to different workers - TaskTracker: executing map and reduce JobTracker TaskTracker MapReduceTaskTracker TaskTracker TaskTracker GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

40 MapReduce [OSDI 04] Two core components - JobTracker: assigning tasks to different workers - TaskTracker: executing map and reduce programs JobTracker TaskTracker MapReduceTaskTracker TaskTracker TaskTracker GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

41 MapReduce [OSDI 04] Documtent Document A2 Documtent A3 MapReduce JobTracker TaskTracker TaskTracker TaskTracker TaskTracker TaskTracker GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

42 MapReduce [OSDI 04] Documtent A1 Documtent A2 Documtent A3 MapReduce JobTracker TaskTracker TaskTracker TaskTracker TaskTracker TaskTracker GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

43 Documtent A1 Documtent A2 Documtent A3 MapReduce JobTracker TaskTracker TaskTracker TaskTracker TaskTracker TaskTracker GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

44 Documtent A1 Documtent A2 Documtent A3 TaskTracker MapReduce JobTracker TaskTracker TaskTracker TaskTracker TaskTracker TaskTracker GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

45 Word Count Example

46 Why we care about word count Word count is challenging over massive amounts of data Fundamentals of statistics often are aggregate functions Most aggregation functions have distributive nature MapReduce breaks complex tasks into smaller pieces which can be executed in parallel

47 Map Phase (On a Worker) Count the # of occurrences of each word in a large amount of input data Map(input_key, input_value) { foreach word w in input_value: emit(w, 1); }

48 Map Phase (On a Worker) Input to the Mapper (3414, the cat sat on the mat ) (3437, the aardvark sat on the sofa ) Count the # of occurrences of each word in a large amount of input data Map(input_key, input_value) { foreach word w in input_value: emit(w, 1); }

49 Map Phase (On a Worker) Input to the Mapper (3414, the cat sat on the mat ) (3437, the aardvark sat on the sofa ) Count the # of occurrences of each word in a large amount of input data Map(input_key, input_value) { foreach word w in input_value: emit(w, 1); } Output from the Mapper ( the, 1), ( cat, 1), ( sat, 1), ( on, 1), ( the, 1), ( mat, 1), ( the, 1), ( aardvark, 1), ( sat, 1), ( on, 1), ( the, 1), ( sofa, 1)

50 Map Phase (On a Worker) Input to the Mapper (3414, the cat sat on the mat ) (3437, the aardvark sat on the sofa ) Count the # of occurrences of each word in a large amount of input data Map(input_key, input_value) { foreach word w in input_value: emit(w, 1); } Output from the Mapper ( the, 1), ( cat, 1), ( sat, 1), ( on, 1), ( the, 1), ( mat, 1), ( the, 1), ( aardvark, 1), ( sat, 1), ( on, 1), ( the, 1), ( sofa, 1)

51 Reducer (On a Worker) After the Map, all the intermediate values for a given intermediate key are combined together into a list

52 Reducer (On a Worker) After the Map, all the intermediate values for a given intermediate key are combined together into a list Add up all the values associated with each intermediate key: Reduce(output_key, intermediate_vals) { set count = 0; foreach v in intermediate_vals: count += v; emit(output_key, count); }

53 Reducer (On a Worker) Input of the Reducer ( the, 1), ( cat, 1), ( sat, 1), ( on, 1), ( the, 1), ( mat, 1), ( the, 1), ( aardvark, 1), ( sat, 1), ( on, 1), ( the, 1), ( sofa, 1)

54 Reducer (On a Worker) Input of the Reducer ( the, 1), ( cat, 1), ( sat, 1), ( on, 1), ( the, 1), ( mat, 1), ( the, 1), ( aardvark, 1), ( sat, 1), ( on, 1), ( the, 1), ( sofa, 1) Add up all the values associated with each intermediate key: Reduce(output_key, intermediate_vals) { set count = 0; foreach v in intermediate_vals: count += v; emit(output_key, count); }

55 Reducer (On a Worker) Input of the Reducer ( the, 1), ( cat, 1), ( sat, 1), ( on, 1), ( the, 1), ( mat, 1), ( the, 1), ( aardvark, 1), ( sat, 1), ( on, 1), ( the, 1), ( sofa, 1) Add up all the values associated with each intermediate key: Reduce(output_key, intermediate_vals) { set count = 0; foreach v in intermediate_vals: count += v; emit(output_key, count); } Output from the Reducer ( the, 4), ( sat, 2), ( on, 2), ( sofa, 1), ( mat, 1), ( cat, 1), ( aardvark, 1)

56 Grouping + Reducer Input of the grouping ( the, 1), ( cat, 1), ( sat, 1), ( on, 1), ( the, 1), ( mat, 1), ( the, 1), ( aardvark, 1), ( sat, 1), ( on, 1), ( the, 1), ( sofa, 1) Add up all the values associated with each intermediate key: Reduce(output_key, intermediate_vals) { set count = 0; foreach v in intermediate_vals: count += v; emit(output_key, count); } Output from the Reducer ( the, 4), ( sat, 2), ( on, 2), ( sofa, 1), ( mat, 1), ( cat, 1), ( aardvark, 1)

57 After the Map, all the intermediate values for a given intermediate key are combined together into a list Mapper Output Grouping/Shuffling ( the, 1), ( cat, 1), ( sat, 1), ( on, 1), ( the, 1), ( mat, 1), ( the, 1), ( aardvark, 1), ( sat, 1), ( on, 1), ( the, 1), ( sofa, 1) Reducer Input aardvark, 1 cat, 1 mat, 1 on [1, 1] sat [1, 1] sofa, 1 the [1, 1, 1, 1]

58 Map + Reduce Mapping Grouping Reducing ( the, 1), Mapper Input the cat sat on the mat the aardvark sat on the sofa ( cat, 1), ( sat, 1), ( on, 1), ( the, 1), ( mat, 1), ( the, 1), ( aardvark, 1), ( sat, 1), ( on, 1), aardvark, 1 cat, 1 mat, 1 on [1, 1] sat [1, 1] sofa, 1 the [1, 1, 1, 1] aardvark, 1 cat, 1 mat, 1 on, 2 sat, 2 sofa, 1 the, 4 ( the, 1), ( sofa, 1)

59 High-Level Picture for MR

60 Let s use MapReduce to help Google Map India We want to compute the average temperature for each state

61 Let s use MapReduce to help Google Map We want to compute the average temperature for each state

62 Let s use MapReduce to help Google Map MP: 75 CG: 72 OR: 72

63 Let s use MapReduce to help Google Map

64 Let s use MapReduce to help Google Map

65 Let s use MapReduce to help Google Map

66 Let s use MapReduce to help Google Map

67 Let s use MapReduce to help Google Map

68 Lecture Roadmap Cloud Computing Overview Challenges in the Clouds Distributed File Systems: GFS Data Process & Analysis: MapReduce Database: BigTable

69 Today s Lecture Google Applications, e.g., Gmail and Google Map MapReduce - Lec 9 BigTable - Lec 9 Google File System (GFS) - Lec 8

70 Motivation for BigTable Lots of (semi-)structured data at Google - URLs: Content, crawl metadata, links, anchors [Search Engine] - Per-user data: User preference settings, queries [Hangout] - Geographic locations: Physical entities and satellite image data [Google maps and Google earth] Scale is large: - Billions of URLs, many versions/page (~20K/version) - Hundreds of millions of users, thousands of queries/sec - 100TB+ of satellite image data

71 Why not just use commercial DB? Scale is too large for most commercial databases Even if it weren t, cost would be very high - Building internally means system can be applied across many projects for low incremental cost Low-level storage optimizations help performance significantly Fun and challenging to build large-scale DB systems :)

72 BigTable [OSDI 06] Distributed multi-level map: - With an interesting data model Fault-tolerant, persistent Scalable: - Thousands of servers - Terabytes of in-memory data - Petabyte of disk-based data - Millions of reads/writes per second, efficient scans

73 BigTable Status in 2006 Design/initial implementation started beginning of 2004 Currently ~100 BigTable cells Production use or active development for many projects - Google print - My search history - Crawling/indexing pipeline - Google Maps/Google Earth Largest BigTable cell manages ~200TB of data spread over several thousand machines (larger cells planned)

74 Building Blocks for BigTable BigTable uses of building blocks: - Google File System (GFS): stores persistent state and data - Scheduler: schedules jobs involved in BigTable serving - Lock service: master election, location bootstrapping - MapReduce: often used to process BigTable data Remember what is the difference between Database and file system

75 Building Blocks for BigTable BigTable uses of building blocks: - Google File System (GFS): stores persistent state and data - Scheduler: schedules jobs involved in BigTable serving - Lock service: master election, location bootstrapping 1. What is the data model? - MapReduce: often used to process BigTable data 2. How to implement it?

76 BigTable s Data Model Design BigTable is NOT a relational database BigTable appears as a large table - A BigTable is a sparse, distributed, persistent multidimensional sorted map

77 BigTable s Data Model Design BigTable is NOT a relational database BigTable appears as a large table sorted - A BigTable is a sparse, distributed, persistent multidimensional sorted map rows com.aaa com.cnn.www com.weather language EN EN EN... content <!DOCTYPE html PUBLIC... <!DOCTYPE html PUBLIC... <!DOCTYPE html PUBLIC... <!DOCTYPE html PUBLIC... Webtable example columns

78 BigTable s Data Model Design (row, column, timestamp) is cell content rows language content columns sorted com.aaa com.cnn.www com.weather EN EN EN t2 t4 t 3 t6 t2 versions t2 t3 Webtable example t11

79 Rows rows language content columns sorted com.aaa com.cnn.www com.weather EN EN EN t2 t4 t 3 t6 t2 versions t2 t3 Webtable example t11

80 Rows sorted Row name is an arbitrary string and is used as key - Access to data in a row is atomic - Row creation is implicit upon storing data Rows ordered lexicographically rows com.aaa com.cnn.www com.weather language EN EN EN... content columns t2 t4 t 3 t6 t2 t2 t3 versions Webtable example t11

81 sorted com.aaa com.weather language EN EN EN... Column Columns have two-level name structure - family:optional_qualifier - Row creation is implicit upon storing data Rows ordered lexicographically rows com.cnn.www Webtable example content columns t2 t4 t 3 t6 t2 t2 t3 t11 versions

82 EN Column Columns have two-level name structure - family:optional_qualifier - Row creation is implicit upon storing data Rows ordered lexicographically rows com.aaa language content anchor:mylook.ca anchor:cnnsi.com sorted com.cnn.www com.weather EN CNN CNN.com EN... Webtable example

83 Column Columns have two-level name structure - family:optional_qualifier - Row creation is implicit upon storing data Rows ordered lexicographically family anchor:mylook.ca rows com.aaa language EN content anchor:cnnsi.com sorted com.cnn.www com.weather EN CNN CNN.com EN... Webtable example

84 EN Column Columns have two-level name structure - family:optional_qualifier - Row creation is implicit upon storing data Rows ordered lexicographically rows com.aaa language content qualifier anchor:mylook.ca anchor:cnnsi.com sorted com.cnn.www com.weather EN CNN CNN.com EN... Webtable example

85 com.aaa EN Column Column family - Unit of access control family:optional_qualifier - Has associated type information Qualifier gives unbounded columns anchor:mylook.ca - Additional level of indexing, if desired anchor:cnnsi.com rows language content sorted com.cnn.www com.weather EN CNN CNN.com EN... Webtable example

86 Column - is referenced by Sports illustrated (cnnsi.com) and mylook (mylook.ca) - The value ( com.cnn.www, anchor:cnnsi.com ) is CNN, the reference text from cnnsi.com anchor:mylook.ca rows com.aaa language EN content anchor:cnnsi.com sorted com.cnn.www com.weather EN CNN CNN.com EN... Webtable example

87 Tablet and Table A table starts as one tablet As it grows, it is split into multiple tablets - Approximate size: MB per tablet by default Tablet com.aaa com.cnn.www com.weather language EN EN EN... content <!DOCTYPE html PUBLIC... <!DOCTYPE html PUBLIC... <!DOCTYPE html PUBLIC... <!DOCTYPE html PUBLIC...

88 Tablet and Table A table starts as one tablet As it grows, it is split into multiple tablets - Approximate size: MB per tablet by default Tablet com.aaa com.cnn.www com.weather language EN EN EN... content <!DOCTYPE html PUBLIC... <!DOCTYPE html PUBLIC... <!DOCTYPE html PUBLIC... <!DOCTYPE html PUBLIC...

89 Tablet and Table A table starts as one tablet As it grows, it is split into multiple tablets - Approximate size: MB per tablet by default Tablet 1 Tablet 2 com.aaa com.cnn.www com.weather com.tech com.wikipedia com.zoom language content EN <!DOCTYPE html PUBLIC... EN <!DOCTYPE html PUBLIC... EN <!DOCTYPE html PUBLIC... EN <!DOCTYPE html PUBLIC... EN <!DOCTYPE html PUBLIC... EN <!DOCTYPE html PUBLIC...

90 Building Blocks for BigTable BigTable uses of building blocks: - Google File System (GFS): stores persistent state and data - Scheduler: schedules jobs involved in BigTable serving - Lock service: master election, location bootstrapping 1. What is the data model? - MapReduce: often used to process BigTable data 2. How to implement it?

91 BigTable Architecture BigTable Master BigTable client library Tablet Server Tablet Tablet Tablet Server Tablet Tablet Tablet Server Tablet Tablet Cluster Scheduling Google File System Chubby Lock service

92 The First Thing: Locating Tablets Since tablets move around from server to server, given a row, how do clients find the right machine? - We need to find tablet whose row range covers the target row One solution: could use the BigTable master - Central server almost certainly would be bottleneck in large system Instead: store special tables containing tablet location information in BigTable cell itself

93 The First Thing: Locating Tablets 3-level hierarchical lookup scheme for tablets - Location is IP:port of relevant server - 1st level: bootstrapped from lock service, points to owner of META0-2nd level: uses META0 data to find owner of appropriate META1 tablet - 3rd level: META1 table holds locations of tablets of all other tables META0 META1 table Actual tablet table in table T Pointer to META0 location Chubby

94 The First Thing: Locating Tablets 3-level hierarchical lookup scheme for tablets - Location is IP:port of relevant server - 1st level: bootstrapped from lock service, points to owner of META0-2nd level: uses META0 data to find owner of appropriate META1 tablet - 3rd level: META1 table holds locations of tablets of all other tables META1 table META0 Actual tablet Key: F ( ) table in table T Search row key Hi Pointer to META0 location Key: A ( ) Key: F ( ) Key: I ( ) Key: S ( ) Key: G ( ) Key: H ( ) Key: Ha ( ) Key: Hc ( ) Key: Hi ( ) Chubby

95 BigTable Master BigTable client library Tablet Server Tablet Tablet Tablet Server Tablet Tablet Tablet Server Tablet Tablet Cluster Scheduling Google File System Chubby Lock service

96 BigTable Master BigTable client library Tablet Server Tablet Tablet Tablet Server Tablet Tablet Tablet Server Tablet Tablet Cluster Scheduling Google File System Chubby Lock service

97 BigTable Master BigTable client library Tablet Server Tablet Tablet Tablet Server Tablet Tablet Tablet Server Tablet Tablet Cluster Scheduling Google File System Chubby Lock service

98 BigTable Master BigTable client library Tablet Server Tablet Tablet Tablet Server Tablet Tablet Tablet Server Tablet Tablet Cluster Scheduling Google File System Chubby Lock service

99 BigTable Master BigTable client library Tablet Server Tablet Tablet Tablet Server Tablet Tablet Tablet Server Tablet Tablet Cluster Scheduling Google File System Chubby Lock service

100 Metadata Operations Create/delete tables Create/delete column families change metadata BigTable Master BigTable client library Tablet Server Tablet Tablet Tablet Server Tablet Tablet Tablet Server Tablet Tablet Cluster Scheduling Google File System Chubby Lock service

101 BigTable s APIs Metadata operations - Create/delete tables, column families, change metadata Writes: Single-row, atomic - Set(): write cells in a row - DeleteCells(): delete cells in a row - DeleteRow(): delete all cells in a rw Reads: Scanner abstraction - Read arbitrary cells in a Bigtable table

102 BigTable s APIs Metadata operations - Create/delete tables, column families, change metadata Writes: Single-row, atomic - Set(): write cells in a row - DeleteCells(): delete cells in a row - DeleteRow(): delete all cells in a rw Reads: Scanner abstraction - Read arbitrary cells in a Bigtable table

103 BigTable s Write Path Client Tablet Server Log Server Memstore Put/Delete Write to Log File System Write to memstore Append

104 Next Lecture In the lec-10, I will cover: - Transactions in distributed systems - Consistency models - Two phase commit - Consensus protocol: Paxos

BigTable: A Distributed Storage System for Structured Data (2006) Slides adapted by Tyler Davis

BigTable: A Distributed Storage System for Structured Data (2006) Slides adapted by Tyler Davis Motivation Lots of (semi-)structured data at Google URLs: Contents, crawl metadata, links, anchors, pagerank,