Map Reduce.

Similar documents
Map Reduce. Yerevan.

Introduction to MapReduce

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

Introduction to MapReduce

Developing MapReduce Programs

CS 345A Data Mining. MapReduce

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

Clustering Lecture 8: MapReduce

CA485 Ray Walshe Google File System

Information Retrieval Processing with MapReduce

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Google File System (GFS) and Hadoop Distributed File System (HDFS)

Introduction to Map Reduce

CMSC 723: Computational Linguistics I Session #12 MapReduce and Data Intensive NLP. University of Maryland. Wednesday, November 18, 2009

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

MapReduce Algorithms

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies

Distributed computing: index building and use

The Google File System

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

MapReduce. U of Toronto, 2014

BigData and Map Reduce VITMAC03

Distributed Filesystem

CLOUD-SCALE FILE SYSTEMS

Overview. Why MapReduce? What is MapReduce? The Hadoop Distributed File System Cloudera, Inc.

Distributed computing: index building and use

Hadoop Distributed File System(HDFS)

The Google File System. Alexandru Costan

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

The MapReduce Abstraction

HDFS: Hadoop Distributed File System. Sector: Distributed Storage System

Map-Reduce. Marco Mura 2010 March, 31th

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins

CS 61C: Great Ideas in Computer Architecture. MapReduce

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

Distributed File Systems II

Data-Intensive Distributed Computing

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Parallel Computing: MapReduce Jin, Hai

The Google File System

Distributed Systems 16. Distributed File Systems II

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

CompSci 516: Database Systems

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Map-Reduce. John Hughes

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

The Google File System

The Google File System

Laarge-Scale Data Engineering

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

A brief history on Hadoop

Map Reduce Group Meeting

Big Data Management and NoSQL Databases

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

HADOOP FRAMEWORK FOR BIG DATA

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Distributed Systems. 18. MapReduce. Paul Krzyzanowski. Rutgers University. Fall 2015

with MapReduce The ischool University of Maryland JHU Summer School on Human Language Technology Wednesday, June 17, 2009

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

CSE 124: Networked Services Fall 2009 Lecture-19

CS 345A Data Mining. MapReduce

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search

MI-PDB, MIE-PDB: Advanced Database Systems

NPTEL Course Jan K. Gopinath Indian Institute of Science

CS /5/18. Paul Krzyzanowski 1. Credit. Distributed Systems 18. MapReduce. Simplest environment for parallel processing. Background.

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns

Database Systems CSE 414

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

CSE 124: Networked Services Lecture-16

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

Databases 2 (VU) ( / )

Introduction to MapReduce

Introduction to Data Management CSE 344

Map-Reduce (PFP Lecture 12) John Hughes

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

Clustering Documents. Case Study 2: Document Retrieval

The Google File System

A BigData Tour HDFS, Ceph and MapReduce

The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012):

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Parallel Nested Loops

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

Advanced Data Management Technologies

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

Distributed Systems. 05r. Case study: Google Cluster Architecture. Paul Krzyzanowski. Rutgers University. Fall 2016

CS-2510 COMPUTER OPERATING SYSTEMS

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Introduction to MapReduce (cont.)

MapReduce. Cloud Computing COMP / ECPE 293A

Transcription:

Map Reduce dacosta@irit.fr

Divide and conquer at PaaS Second Third Fourth 100 % // Fifth Sixth Seventh Cliquez pour 2

Typical problem Second Extract something of interest from each MAP Third Shuffle and sort intermediate results Fourth Reduce Aggregate Fifth intermediate results Sixth Generate final output Seventh Cliquez pour Key idea: functional abstraction for these two operations Iterate over a large number of records 3

Folding Second Third Fourth Fifth Sixth Seventh Cliquez pour 4

Difficulties? Huge of data Click amount to edit the outline text format not fit into memory Second Access patterns are broad Third Most data not accessed frequently Fourth Complex data Fifth links between data or treatment Sixth Same data can be treated in different ways Seventh Cliquez pour No pre-processing Example : crawling through internet data Do 5

Principle Second Third Fourth Fifth "Reduce" step: The master node then Sixth collects the answers all the sub toseventh Cliquez pour problems and combines them inles styles du texte du modifier some way to form the output "Map" step: The master node takes the input, divides it into smaller subproblems, and distributes them to worker nodes 6

MapReduce Second map (k, v) <k, v >* reduce (k, v ) <k, v >* All valuesthird with the same key are reduced together Fourth also Usually, programmers specify: partition (k, number of partitions ) partition for k Fifth Sixth combine(k,v ) <k,v > Seventh Cliquez pour Implementations: Google has a proprietary implementation in C++ Programmers specify two functions: Often a simple hash of the key, e.g. hash(k ) mod n Allows reduce operations for different keys in parallel Mini-reducers that run in memory after the map phase Optimizes to reduce network traffic & disk writes Hadoop is an open source implementation in Java 7

Second Third Fourth Fifth Sixth Seventh Cliquez pour 8

Word count Second Third Fourth Fifth Sixth Seventh Cliquez pour 9

Exemple : Average number of contract by Age Click to edit the outline For 1 million Second entry text format Third 1100 of them Fourth Output of Map Range 8-110 Fifth Reduce : Sixth Batch of 1 Y Seventh Cliquez pour 102 of them Treat 1000's modifier les styles du texte du Output Batch of 1000 values 102 Deuxième niveau 10

MapReduce Runtime Handles scheduling Second Assigns workers to map and reduce tasks Third Handles data distribution Fourth Moves the process to the data Gathers, sorts, and shuffles intermediate data Sixth faults Handles Seventh Cliquez pour Detects worker failures and restarts Everything happens on top of a distributed FS (later) Fifth Handles synchronization 11

Second Third Fourth Fifth Sixth Seventh Cliquez pour 12

How do we get data to the workers Second Classical cluster vision Third Fourth Fifth Sixth Seventh Cliquez pour What's the problem here? 13

Distributed File System Don t move data to workers... Move workers to Second the data! Third Start upthe workers on the node that has the data local Fourth Why? Fifth Not enough RAM to hold all the data in memory Sixth Disk access is slow, disk throughput is good Seventh Cliquez pour A distributed file system is the answer GFS (Google File System) HDFS for Hadoop (= GFS clone) Store data on the local disks for nodes in the cluster 14

GFS: Assumptions Commodity hardware over exotic hardware Click to edit the outline text format Second High component failure rates Inexpensive Thirdcommodity components fail all the time Fourth Modest number of HUGE files Fifth Files are write-once, mostly appended to Large streaming reads over randompour access Seventh Cliquez modifier les styles du texte du latency High sustained throughput over low Sixth Perhaps concurrently 15

GFS: Design Decisions Click to as edit the outline Files stored chunks Second Fixed size (64MB) text format Third Reliability through replication Each chunk replicated across 3+ chunkservers Fourth Single master to coordinate access, keep metadata Fifth Simple centralized management Sixth No data caching Seventh Cliquez Little benefit due to large data sets, streaming reads pour modifier les styles du texte du Simplify the API Push some of the issues onto the client 16

Grid Computing by the fathers of the Grid Second Third Fourth Fifth Sixth Seventh Cliquez pour 17

Master s Responsibilities Metadata storage Click to edit the outline text format Second Namespace management/locking Third Periodic communication with Fourth chunkservers Fifth Sixth replication, Chunk creation, rebalancing Seventh Cliquez pour Garbage collection 18

Second Third Fourth Exemple : Inverted Indexing Fifth Sixth Seventh Cliquez pour 19

Architecture of IR Systems Second Third Fourth Fifth Sixth Seventh Cliquez pour 20

How do we represent text? Bag Clickoftowords edit the outline text format all the words in a document Second as index terms for that document Third Assign a weight to each term based on importance Fourth Disregard order, structure, meaning, etc. of the words Simple, yetfifth effective! Sixth Assumptions Seventh Cliquez pour Term occurrence is independent Document relevance modifier les styles du texte du is independent Words are well-defined Treat 21

Sample Document McDonald's slims down spuds Click to edit the outline text format Bag of Words Second 16 said Third 14 McDonalds Fourth 12 fat fries Fifth 11 8 new Sixth 6 company, french, Seventh Cliquez pour nutrition 5 food, oil, percent, modifier les styles du texte du reduce, taste, Tuesday... Fast-food chain to reduce certain types of fat in its french fries with new cooking oil. Bag of Words NEW YORK (CNN/Money) - McDonald's Corp. is cutting the amount of "bad" fat in its french fries nearly in half, the fast-food chain said Tuesday as it moves to make all its fried menu items healthier. But does that mean the popular shoestring frieswon't taste the same? The company says no. "It's a win-win for our customers because they are getting the same great french-fry taste along with an even healthier nutrition profile," said Mike Roberts, president of McDonald's USA. But others are not so sure. McDonald's will not specifically discuss the kind of oil it plans to use, but at least one nutrition expert says playing with the formula could mean a different taste.... 22

Representing Documents Second Third Fourth Fifth Sixth Seventh Cliquez pour 23

Inverted Index Second Third Fourth Fifth Sixth Seventh Cliquez pour 24

Boolean Retrieval To execute a Boolean query: Click tosyntax edittree the outline text format Build query Second For each clause, look up postings Third Fourth Traverse postings apply Boolean Fifthand operator Sixth analysis Efficiency Seventh Cliquez pour Postings traversal is linear (assuming sorted postings) Start with shortest posting first 25

Term Weighting Second Third Fourth Fifth Sixth Seventh Cliquez pour 26

Second Third Fourth Fifth Sixth Seventh Cliquez pour 27

MapReduce it? The problem Clickindexing to edit the outline text format Must Second be relatively fast, but need not be real time Third For Web, incremental updates are important Fourth Crawling is a challenge itself! Fifth Sixth The retrieval problem Seventh Cliquez pour Must have sub-second response modifier du texte du For Web, les only styles need relatively few results 28

Indexing: Performance Analysis Fundamentally, a large sorting Second problem Third Terms usually fit in memory Fourth Postings usually don t Fifth Sixth How is it done on a single machine? How Seventh pour large is the Cliquez inverted index? modifier les styles du texte du Size of vocabulary Size of postings 29

MapReduce: Index Construction Map over all documents Second Emit term as key, (docid, tf) as value Third Emit other information as necessary (e.g., term Fourth position) Fifth Reduce Sixth Trivial: each value represents a posting! Seventh Cliquez Might want to sort the postings (e.g., bypour docid or tf) modifier les styles du texte du MapReduce does all the heavy lifting! 30

Query Execution MapReduce meant for text large-data Click to editisthe outline format Second batch processing Not Third suitable for lots of real time operations requiring low latency Fourth Fifth the The solution: secret sauce Sixth Most likely involves document partitioning Seventh Lots of system Cliquez pourload engineering: e.g., caching, balancing, etc. modifier les styles du texte du 31

Second Third Fourth Fifth Algorithm Design MapReduce Sixth Seventh Cliquez pour 32

Managing Dependencies Click to edit the outline text format Remember: Mappers run in isolation Second You have no idea in what order the mappers run Third You have no idea on what node the mappers run Fourth You have no idea when each mapper finishes Fifth Tools for synchronization: Sixth Ability to hold state in reducer across multiple keyvalue pairs Seventh Cliquez pour Sorting function for keys modifier les styles du texte du Partitioner Cleverly-constructed data structures 33

For the programmer Input reader Second Map function Third Takes a series of key/value pairs, processes each, and generates zero or more output key/value pairs Fourth Partition function Each Map functionfifth output is allocated to a particular reducer by the application's partition function Sixth Compare function Reduce Seventh function Cliquez pour The framework calls the application's Reduce function once for each unique modifier les styles du texte du key in the sorted order Output writer The input reader reads data from stable storage and generates key/value pairs. writes the output of the Reduce to the stable storage 34

Input -> Map -> Copy/Sort -> Reduce -> Output Second Third Fourth Fifth Sixth Seventh Cliquez pour 35

Use cases Word Before Click count, to edit athe outline text: format Second 1 message per word little less naive! Third in the text Fourth Here Fifth 1 message per different Sixth word in the text Seventh Cliquez pour 36

Co-occurence Count the number of co-occurence of n elements in Click to edit the outline text format sets Second Exemple Third who Customer buy this also buy that Fourth If there are NFifth elements Report occurrence of NxN couples Sixth On a single node, quite simple Seventh Cliquez pour Foreach set Foreach i in set modifier les styles du texte du Foreach j in set Map Reduce Res[i][j]++ version? Words appears in same sentence 37

Pairs approach Second Third Fourth Fifth Sixth Too many intermediary keys Easy Seventh Cliquez pour and strayforward implementation modifier les styles du texte du of [i,j] Optimize using local accumulation of counts Easy optimization few improvement (large space) Only Deuxième niveau 38

Stripes Approach Second Third Fourth Fifth Sixth Faster, Seventh lower numbercliquez of intermediatepour keys modifier styles du texte du Can lead toles memory problems More complex implementation 39

Other exemples Grep Click to edit the outline text format 10^10 100-byte records Second Seek a rare 3 letters word 1800 Third machines Fourth: 30 Peak performance GB/s with 1764 workers 150s Fifth 1 minute startup Sixth Sort Seventh Cliquez pour Same environment and dataset modifier les styles du texte du 50 lines of code 891 seconds 40

Characteristics Manage well failure Click to edit the outline text format Just send the keys again Second Heavy on the file system Third Need dedicated and adapted filesystem Fourth Scale well Fifth In term of data, workflow Sixth Easy to use Seventh Cliquez Some translation tools from SQL are available pour modifier les stylesdatadu texte du Middleware manages and computinglocality 41

Some users Google Click to edit the outline text format They normalized it Second They use it internally large-scale learning problems, Thirdmachine clustering problems for the Google News and Froogle products, Fourth extracting data to produce reports of popular queries (e.g. Google Zeitgeist Google Trends), Fifthand extracting properties of Web pages for new experiments and Sixth products (e.g. extraction of geographical locations from a large corpus of Web pages for localized search), Seventh Cliquez pour processing of satellite imagery data, language model statistical machine modifier les processing stylesfordu texte dutranslation, and large-scale graph computations. 42

Other users Facebook Click Hadoop to edit the outline text format Second Now use Corona (own implementation) Yahoo Third Fourth More than 100,000 CPUs in more than 40,000 computers Hadoop Fifth Linkedin Sixth 5000 servers on hadoop Seventh Cliquez pour Ebay 532 nodes cluster (8 * 532 cores, 5.3PB) 43

Some links Google Click to edit the outline text format MapReduce: Simplified Data Processing on Large Second Clusters by Jeffrey Dean and Sanjay Ghemawat Technical Third report Apache Fourth Hadoop: The definitive guide Fifth Book Sixth Microsoft Seventh Cliquez pour Google s MapReduce Programming Model modifier Revisited les styles du texte du Technical report 44