Map Reduce dacosta@irit.fr
Divide and conquer at PaaS Second Third Fourth 100 % // Fifth Sixth Seventh Cliquez pour 2
Typical problem Second Extract something of interest from each MAP Third Shuffle and sort intermediate results Fourth Reduce Aggregate Fifth intermediate results Sixth Generate final output Seventh Cliquez pour Key idea: functional abstraction for these two operations Iterate over a large number of records 3
Folding Second Third Fourth Fifth Sixth Seventh Cliquez pour 4
Difficulties? Huge of data Click amount to edit the outline text format not fit into memory Second Access patterns are broad Third Most data not accessed frequently Fourth Complex data Fifth links between data or treatment Sixth Same data can be treated in different ways Seventh Cliquez pour No pre-processing Example : crawling through internet data Do 5
Principle Second Third Fourth Fifth "Reduce" step: The master node then Sixth collects the answers all the sub toseventh Cliquez pour problems and combines them inles styles du texte du modifier some way to form the output "Map" step: The master node takes the input, divides it into smaller subproblems, and distributes them to worker nodes 6
MapReduce Second map (k, v) <k, v >* reduce (k, v ) <k, v >* All valuesthird with the same key are reduced together Fourth also Usually, programmers specify: partition (k, number of partitions ) partition for k Fifth Sixth combine(k,v ) <k,v > Seventh Cliquez pour Implementations: Google has a proprietary implementation in C++ Programmers specify two functions: Often a simple hash of the key, e.g. hash(k ) mod n Allows reduce operations for different keys in parallel Mini-reducers that run in memory after the map phase Optimizes to reduce network traffic & disk writes Hadoop is an open source implementation in Java 7
Second Third Fourth Fifth Sixth Seventh Cliquez pour 8
Word count Second Third Fourth Fifth Sixth Seventh Cliquez pour 9
Exemple : Average number of contract by Age Click to edit the outline For 1 million Second entry text format Third 1100 of them Fourth Output of Map Range 8-110 Fifth Reduce : Sixth Batch of 1 Y Seventh Cliquez pour 102 of them Treat 1000's modifier les styles du texte du Output Batch of 1000 values 102 Deuxième niveau 10
MapReduce Runtime Handles scheduling Second Assigns workers to map and reduce tasks Third Handles data distribution Fourth Moves the process to the data Gathers, sorts, and shuffles intermediate data Sixth faults Handles Seventh Cliquez pour Detects worker failures and restarts Everything happens on top of a distributed FS (later) Fifth Handles synchronization 11
Second Third Fourth Fifth Sixth Seventh Cliquez pour 12
How do we get data to the workers Second Classical cluster vision Third Fourth Fifth Sixth Seventh Cliquez pour What's the problem here? 13
Distributed File System Don t move data to workers... Move workers to Second the data! Third Start upthe workers on the node that has the data local Fourth Why? Fifth Not enough RAM to hold all the data in memory Sixth Disk access is slow, disk throughput is good Seventh Cliquez pour A distributed file system is the answer GFS (Google File System) HDFS for Hadoop (= GFS clone) Store data on the local disks for nodes in the cluster 14
GFS: Assumptions Commodity hardware over exotic hardware Click to edit the outline text format Second High component failure rates Inexpensive Thirdcommodity components fail all the time Fourth Modest number of HUGE files Fifth Files are write-once, mostly appended to Large streaming reads over randompour access Seventh Cliquez modifier les styles du texte du latency High sustained throughput over low Sixth Perhaps concurrently 15
GFS: Design Decisions Click to as edit the outline Files stored chunks Second Fixed size (64MB) text format Third Reliability through replication Each chunk replicated across 3+ chunkservers Fourth Single master to coordinate access, keep metadata Fifth Simple centralized management Sixth No data caching Seventh Cliquez Little benefit due to large data sets, streaming reads pour modifier les styles du texte du Simplify the API Push some of the issues onto the client 16
Grid Computing by the fathers of the Grid Second Third Fourth Fifth Sixth Seventh Cliquez pour 17
Master s Responsibilities Metadata storage Click to edit the outline text format Second Namespace management/locking Third Periodic communication with Fourth chunkservers Fifth Sixth replication, Chunk creation, rebalancing Seventh Cliquez pour Garbage collection 18
Second Third Fourth Exemple : Inverted Indexing Fifth Sixth Seventh Cliquez pour 19
Architecture of IR Systems Second Third Fourth Fifth Sixth Seventh Cliquez pour 20
How do we represent text? Bag Clickoftowords edit the outline text format all the words in a document Second as index terms for that document Third Assign a weight to each term based on importance Fourth Disregard order, structure, meaning, etc. of the words Simple, yetfifth effective! Sixth Assumptions Seventh Cliquez pour Term occurrence is independent Document relevance modifier les styles du texte du is independent Words are well-defined Treat 21
Sample Document McDonald's slims down spuds Click to edit the outline text format Bag of Words Second 16 said Third 14 McDonalds Fourth 12 fat fries Fifth 11 8 new Sixth 6 company, french, Seventh Cliquez pour nutrition 5 food, oil, percent, modifier les styles du texte du reduce, taste, Tuesday... Fast-food chain to reduce certain types of fat in its french fries with new cooking oil. Bag of Words NEW YORK (CNN/Money) - McDonald's Corp. is cutting the amount of "bad" fat in its french fries nearly in half, the fast-food chain said Tuesday as it moves to make all its fried menu items healthier. But does that mean the popular shoestring frieswon't taste the same? The company says no. "It's a win-win for our customers because they are getting the same great french-fry taste along with an even healthier nutrition profile," said Mike Roberts, president of McDonald's USA. But others are not so sure. McDonald's will not specifically discuss the kind of oil it plans to use, but at least one nutrition expert says playing with the formula could mean a different taste.... 22
Representing Documents Second Third Fourth Fifth Sixth Seventh Cliquez pour 23
Inverted Index Second Third Fourth Fifth Sixth Seventh Cliquez pour 24
Boolean Retrieval To execute a Boolean query: Click tosyntax edittree the outline text format Build query Second For each clause, look up postings Third Fourth Traverse postings apply Boolean Fifthand operator Sixth analysis Efficiency Seventh Cliquez pour Postings traversal is linear (assuming sorted postings) Start with shortest posting first 25
Term Weighting Second Third Fourth Fifth Sixth Seventh Cliquez pour 26
Second Third Fourth Fifth Sixth Seventh Cliquez pour 27
MapReduce it? The problem Clickindexing to edit the outline text format Must Second be relatively fast, but need not be real time Third For Web, incremental updates are important Fourth Crawling is a challenge itself! Fifth Sixth The retrieval problem Seventh Cliquez pour Must have sub-second response modifier du texte du For Web, les only styles need relatively few results 28
Indexing: Performance Analysis Fundamentally, a large sorting Second problem Third Terms usually fit in memory Fourth Postings usually don t Fifth Sixth How is it done on a single machine? How Seventh pour large is the Cliquez inverted index? modifier les styles du texte du Size of vocabulary Size of postings 29
MapReduce: Index Construction Map over all documents Second Emit term as key, (docid, tf) as value Third Emit other information as necessary (e.g., term Fourth position) Fifth Reduce Sixth Trivial: each value represents a posting! Seventh Cliquez Might want to sort the postings (e.g., bypour docid or tf) modifier les styles du texte du MapReduce does all the heavy lifting! 30
Query Execution MapReduce meant for text large-data Click to editisthe outline format Second batch processing Not Third suitable for lots of real time operations requiring low latency Fourth Fifth the The solution: secret sauce Sixth Most likely involves document partitioning Seventh Lots of system Cliquez pourload engineering: e.g., caching, balancing, etc. modifier les styles du texte du 31
Second Third Fourth Fifth Algorithm Design MapReduce Sixth Seventh Cliquez pour 32
Managing Dependencies Click to edit the outline text format Remember: Mappers run in isolation Second You have no idea in what order the mappers run Third You have no idea on what node the mappers run Fourth You have no idea when each mapper finishes Fifth Tools for synchronization: Sixth Ability to hold state in reducer across multiple keyvalue pairs Seventh Cliquez pour Sorting function for keys modifier les styles du texte du Partitioner Cleverly-constructed data structures 33
For the programmer Input reader Second Map function Third Takes a series of key/value pairs, processes each, and generates zero or more output key/value pairs Fourth Partition function Each Map functionfifth output is allocated to a particular reducer by the application's partition function Sixth Compare function Reduce Seventh function Cliquez pour The framework calls the application's Reduce function once for each unique modifier les styles du texte du key in the sorted order Output writer The input reader reads data from stable storage and generates key/value pairs. writes the output of the Reduce to the stable storage 34
Input -> Map -> Copy/Sort -> Reduce -> Output Second Third Fourth Fifth Sixth Seventh Cliquez pour 35
Use cases Word Before Click count, to edit athe outline text: format Second 1 message per word little less naive! Third in the text Fourth Here Fifth 1 message per different Sixth word in the text Seventh Cliquez pour 36
Co-occurence Count the number of co-occurence of n elements in Click to edit the outline text format sets Second Exemple Third who Customer buy this also buy that Fourth If there are NFifth elements Report occurrence of NxN couples Sixth On a single node, quite simple Seventh Cliquez pour Foreach set Foreach i in set modifier les styles du texte du Foreach j in set Map Reduce Res[i][j]++ version? Words appears in same sentence 37
Pairs approach Second Third Fourth Fifth Sixth Too many intermediary keys Easy Seventh Cliquez pour and strayforward implementation modifier les styles du texte du of [i,j] Optimize using local accumulation of counts Easy optimization few improvement (large space) Only Deuxième niveau 38
Stripes Approach Second Third Fourth Fifth Sixth Faster, Seventh lower numbercliquez of intermediatepour keys modifier styles du texte du Can lead toles memory problems More complex implementation 39
Other exemples Grep Click to edit the outline text format 10^10 100-byte records Second Seek a rare 3 letters word 1800 Third machines Fourth: 30 Peak performance GB/s with 1764 workers 150s Fifth 1 minute startup Sixth Sort Seventh Cliquez pour Same environment and dataset modifier les styles du texte du 50 lines of code 891 seconds 40
Characteristics Manage well failure Click to edit the outline text format Just send the keys again Second Heavy on the file system Third Need dedicated and adapted filesystem Fourth Scale well Fifth In term of data, workflow Sixth Easy to use Seventh Cliquez Some translation tools from SQL are available pour modifier les stylesdatadu texte du Middleware manages and computinglocality 41
Some users Google Click to edit the outline text format They normalized it Second They use it internally large-scale learning problems, Thirdmachine clustering problems for the Google News and Froogle products, Fourth extracting data to produce reports of popular queries (e.g. Google Zeitgeist Google Trends), Fifthand extracting properties of Web pages for new experiments and Sixth products (e.g. extraction of geographical locations from a large corpus of Web pages for localized search), Seventh Cliquez pour processing of satellite imagery data, language model statistical machine modifier les processing stylesfordu texte dutranslation, and large-scale graph computations. 42
Other users Facebook Click Hadoop to edit the outline text format Second Now use Corona (own implementation) Yahoo Third Fourth More than 100,000 CPUs in more than 40,000 computers Hadoop Fifth Linkedin Sixth 5000 servers on hadoop Seventh Cliquez pour Ebay 532 nodes cluster (8 * 532 cores, 5.3PB) 43
Some links Google Click to edit the outline text format MapReduce: Simplified Data Processing on Large Second Clusters by Jeffrey Dean and Sanjay Ghemawat Technical Third report Apache Fourth Hadoop: The definitive guide Fifth Book Sixth Microsoft Seventh Cliquez pour Google s MapReduce Programming Model modifier Revisited les styles du texte du Technical report 44