STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns

Similar documents
Introduction to MapReduce

STATS Data Analysis using Python. Lecture 8: Hadoop and the mrjob package Some slides adapted from C. Budak

Clustering Lecture 8: MapReduce

HADOOP FRAMEWORK FOR BIG DATA

Introduction to MapReduce

Distributed Computation Models

The MapReduce Abstraction

Hadoop. copyright 2011 Trainologic LTD

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Map Reduce. Yerevan.

A brief history on Hadoop

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

MI-PDB, MIE-PDB: Advanced Database Systems

15-388/688 - Practical Data Science: Big data and MapReduce. J. Zico Kolter Carnegie Mellon University Spring 2018

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

MapReduce. U of Toronto, 2014

Hadoop Map Reduce 10/17/2018 1

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

A BigData Tour HDFS, Ceph and MapReduce

Lecture 11 Hadoop & Spark

Distributed computing: index building and use

Improving the MapReduce Big Data Processing Framework

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

Introduction to Hadoop and MapReduce

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

CompSci 516: Database Systems

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

BigData and Map Reduce VITMAC03

Databases 2 (VU) ( / )

Survey Paper on Traditional Hadoop and Pipelined Map Reduce

CISC 7610 Lecture 2b The beginnings of NoSQL

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

Database Systems CSE 414

Hadoop An Overview. - Socrates CCDH

Introduction to MapReduce

Distributed Computations MapReduce. adapted from Jeff Dean s slides

Introduction to Map Reduce

Database Applications (15-415)

Parallel Computing: MapReduce Jin, Hai

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Map Reduce & Hadoop Recommended Text:

Certified Big Data and Hadoop Course Curriculum

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

MapReduce, Hadoop and Spark. Bompotas Agorakis

Hadoop, Yarn and Beyond

Map-Reduce. Marco Mura 2010 March, 31th

MapReduce and Hadoop

Introduction to Big Data. Hadoop. Instituto Politécnico de Tomar. Ricardo Campos

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Hadoop/MapReduce Computing Paradigm

Extreme Computing. Introduction to MapReduce. Cluster Outline Map Reduce

Distributed Systems 16. Distributed File Systems II

Webinar Series TMIP VISION

Distributed Systems CS6421

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies

Big Data Hadoop Stack

Dept. Of Computer Science, Colorado State University

Overview. Why MapReduce? What is MapReduce? The Hadoop Distributed File System Cloudera, Inc.

Map Reduce.

Data Platforms and Pattern Mining

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Project Design. Version May, Computer Science Department, Texas Christian University

Big Data Hadoop Course Content

Hadoop محبوبه دادخواه کارگاه ساالنه آزمایشگاه فناوری وب زمستان 1391

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Introduction to Data Management CSE 344

L22: SC Report, Map Reduce

Cloud Computing CS

Big Data Management and NoSQL Databases

CS370 Operating Systems

A Survey on Parallel Rough Set Based Knowledge Acquisition Using MapReduce from Big Data

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21

Map-Reduce. John Hughes

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

Getting Started with Hadoop

Batch Processing Basic architecture

50 Must Read Hadoop Interview Questions & Answers

Certified Big Data Hadoop and Spark Scala Course Curriculum

PaaS and Hadoop. Dr. Laiping Zhao ( 赵来平 ) School of Computer Software, Tianjin University

MapReduce Simplified Data Processing on Large Clusters

Chapter 5. The MapReduce Programming Model and Implementation

Transcription:

STATS 700-002 Data Analysis using Python Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns

Unit 3: parallel processing and big data The next few lectures will focus on big data and the MapReduce framework Today: overview of the MapReduce framework Next lectures: Python package mrjob, which implements MapReduce Apache Spark and the Hadoop file system

The big data revolution Sloan Digital Sky Survey https://www.sdss.org/ Generating so many images that most will never be looked at... Genomics data: https://en.wikipedia.org/wiki/genome_project Web crawls >20e9 webpages; ~400TB just to store pages (without images, etc) Social media data Twitter: ~00e6 tweets per day YouTube: >300 hours of content uploaded per minute (and that number is several years old, now)

Three aspects to big data Volume: data at the TB or PB scale Requires new processing paradigms e.g., Distributed computing, streaming model Velocity: data is generated at unprecedented rate e.g., web traffic data, twitter, climate/weather data Variety: data comes in many different formats Databases, but also unstructured text, audio, video Messy data requires different tools This requires a very different approach to computing from what we were accustomed to prior to about 200.

How to count all the books in the library? Peabody Library, Baltimore, MD USA

How to count all the books in the library? I ll count this side......you count this side......and then we add our counts together. Peabody Library, Baltimore, MD USA

Congratulations! You now understand the MapReduce framework! Basic idea: Split up a task into independent subtasks Specify how to combine results of subtasks into your answer Independent subtasks is a crucial point, here: If you and I constantly have to share information, inefficient to split the task Because we ll spend more time communicating than actually counting

MapReduce: the workhorse of big data Hadoop, Google MapReduce, Spark, etc are all based on this framework 1) 2) Specify a map operation to be applied to every element in a data set Specify a reduce operation for combining the list into an output Then we split the data among a bunch of machines, and combine their results

MapReduce isn t really new to you You already know the Map pattern: Python: [f(x) for x in mylist]...and the Reduce pattern: Python: sum( [f(x) for x in mylist] ) (map and reduce) SQL: aggregation functions are like reduce operations The only thing that s new is the computing model

MapReduce, schematically, cartoonishly Map: f(x) = 2x 2 3 8 1 Reduce: sum 1... 7 2... 14 Map 4 6 10 16 2... Reduce 10...but this hides the distributed computation.

Assumptions of MapReduce Task can be split into pieces Pieces can be processed in parallel......with minimal communication between processes. Results of each piece can be combined to obtain answer. Problems that have these properties are often described as being embarassingly parallel: https://en.wikipedia.org/wiki/embarrassingly_parallel

MapReduce, schematically (slightly more accurately) 2 3 Machine 1 Map: f(x) = 2x Reduce: sum 8 6 10... 1... Machine 2 Map 4 2 4 3 7 Machine M Map 16 4 Map... 2 8 Reduce 6 14 Reduce 20 22... 28 Reduce (again) 10

Less boring example: word counts Suppose we have a giant collection of books... e.g., Google ngrams: https://books.google.com/ngrams/info...and we want to count how many times each word appears in the collection. Divide and Conquer! 1. Everyone takes a book, and makes a list of (word,count) pairs. 2. Combine the lists, adding the counts with the same word keys. This still fits our framework, but it s a little more complicated...and it s just the kind of problem that MapReduce is designed to solve!

Fundamental unit of MapReduce: (key,value) pairs Examples: Linguistic data: <word, count> Enrollment data: <student, major> Climate data: <location, wind speed> Values can be more complicated objects in some environments E.g., lists, dictionaries, other data structures Apache Hadoop doesn t support this directly Social media data: <person, list_of_friends> but can be made to work via some hacking mrjob and Spark are a little more flexible

A prototypical MapReduce program 1. Read records (i.e., pieces of data) from file(s) 2. Map: For each record, extract information you care about Output this information in <key,value> pairs 3. Combine: Sort and group the extracted <key,value> pairs based on their keys 4. Reduce: For each group, summarize, filter, group, aggregate, etc. to obtain some new value, v2 Output the <key, v2> pair as a row in the results file

A prototypical MapReduce program Input <k1,v1> map <k2,v2> combine <k2,v2 > reduce <k3,v3> Output Note: this output could be made the input to another MR program. We call one of these input->map->combine->reduce->output chains a step. Hadoop/mrjob differs from Spark in how these steps are executed, a topic we ll discuss in our next two lectures.

MapReduce: vocabulary Cluster: a collection of devices (i.e., computers) Networked to enable fast communication, typically for purpose of distributed computing Jobs scheduled by a program like Sun/Oracle grid engine, Slurm, TORQUE or YARN https://en.wikipedia.org/wiki/job_scheduler Node: a single computing unit on a cluster Roughly, computer==node, but can have multiple nodes per machine Usually a piece of commodity (i.e., not specialized, inexpensive) hardware Step: a single map->combine->reduce chain A step need not contain all three of map, combine and reduce Note: some documentation refers to each of map, combine and reduce as steps Job: a sequence of one or more MapReduce steps

More terminology (useful for reading documentation) NUMA: non-uniform memory access Local memory is much faster to access than memory elsewhere on network https://en.wikipedia.org/wiki/non-uniform_memory_access Commodity hardware: inexpensive, mass-produced computing hardware As opposed to expensive specialized machines E.g., servers in a data center Hash function: a function that maps (arbitrary) objects to integers Used in MapReduce to assign keys to nodes in the reduce step

So MapReduce makes things much easier Instead of having to worry about splitting the data, organizing communication between machines, etc., we only need to specify: Map Combine (optional) Reduce and the Hadoop backend will handle everything else.

Counting words in MapReduce: version 1 Map Document 1: cat dog bird cat rat dog cat Document 2: dog dog dog cat rat bird Document 3: rat bird rat bird rat bird goat Reduce cat : 1 cat : 1 Output cat: 4 dog: bird: rat: cat dog bird rat goat 4 1

Counting words in MapReduce: version 1 Map Document 1: cat dog bird cat rat dog cat Document 2: dog dog dog cat rat bird Document 3: rat bird rat bird rat bird goat Reduce cat : 1 Lots of data moving around! cat : 1 Output cat: 4 dog: bird: rat: cat dog bird rat goat 4 1 Problem: this communication step is expensive! Solution: use a combiner

Counting words in MapReduce: version 2 Map Document 1: cat dog bird cat rat dog cat Document 2: dog dog dog cat rat bird Document 3: rat bird rat bird rat bird goat Combine cat : 1 cat: 3 dog: 2 dog: 3 rat: 3 bird: 3 Reduce cat: 3 dog: 2 dog: 3 rat: 3 bird: 3 Output cat: 4 dog: bird: rat: cat dog bird rat goat 4 1

Counting words in MapReduce: version 2 Map Document 1: cat dog bird cat rat dog cat Problem: Combine cat : 1 cat: 3 dog: 2 lotsrat: of1 if there are keys, the reduce step is going to be very slow. Document 2: dog dog dog cat rat bird Solution: dog: 3 parallelize the reduce step! Assign each machine its rat: own set of keys. 1 Document 3: rat bird rat bird rat bird goat rat: 3 bird: 3 Reduce cat: 3 dog: 2 dog: 3 rat: 3 bird: 3 Output cat: 4 dog: bird: rat: cat dog bird rat goat 4 1

Counting words in MapReduce version 3 Map Document 1: cat dog bird cat rat dog cat Document 2: dog dog dog cat rat bird Document 3: rat bird rat bird rat bird goat Combine cat : 1 Shuffle cat: 3 dog: 2 dog: 3 rat: 3 bird: 3 Reduce cat: 3 cat: 4 dog: 2 dog: 3 dog: bird: 3 bird: rat: 3 rat: Output cat dog bird rat goat 4 1

Counting words in MapReduce version 3 Map Document 1: cat dog bird cat rat dog cat Document 2: dog dog dog cat rat bird Document 3: rat bird rat bird rat bird goat Combine cat : 1 Shuffle cat: 3 dog: 2 dog: 3 rat: 3 bird: 3 Same amount of info Reduce cat: 3 cat: 4 dog: 2 dog: 3 dog: bird: 3 rat: 3 bird: Output cat dog bird rat goat 4 1 Note: this communication rat: step is no more expensive than before, but we do now require multiple machines for the reduce step.

MapReduce: under the hood MR job consists of: A master job tracker or resource manager node A number of worker nodes Resource manager: schedules and assigns tasks to workers monitors workers, reschedules tasks if a worker node fails https://en.wikipedia.org/wiki/fault-tolerant_computer_system Worker nodes: Perform computations as directed by resource manager Communicate results to downstream nodes (e.g., Mapper -> Reducer)

Hadoop v2 YARN schematic Resource manager functions only as a scheduler. Note manager is a process (i.e., program) that runs on a node and controls processing of data on that node. So everything except allocation of tasks is performed at the worker nodes. Even much of the resource allocation is done by worker nodes via the ApplicationMaster. Image credit: https://hortonworks.com/blog/apache-hadoop-yarn-concepts-and-applications/

Hadoop v2 YARN schematic Resource manager functions only as a scheduler. Note manager is a process (i.e., program) that runs on a node and controls processing of data on that node. You do not have to commit any of this to memory, or even understand it all! The So everything except important point here is that Hadoop/YARN allocation of tasks is performed at the worker nodes. Even much of the hides a whole bunch of complexity from resource allocation is by worker nodes via you so that you don t have to worry done about it! the ApplicationMaster. Image credit: https://hortonworks.com/blog/apache-hadoop-yarn-concepts-and-applications/

Clarifying terminology MapReduce: a large-scale computing framework initially developed at Google Later open-sourced via the Apache Foundation as Hadoop MapReduce Apache Hadoop: a set of open source tools from the Apache Foundation Includes Hadoop MapReduce, Hadoop HDFS, Hadoop YARN Hadoop MapReduce: implements MapReduce framework Hadoop YARN: resource manager that schedules Hadoop MapReduce jobs Hadoop Distributed File System (HDFS): distributed file system Designed for use with Hadoop MapReduce Runs on same commodity hardware that MapReduce runs on Note that there are a host of other loosely related programs, such as Apache Hive, Pig, Mahout and HBase, most of which are designed to work atop HDFS.

Hadoop Distributed File System (HDFS) Storage system for Hadoop File system is distributed across multiple nodes on the network In contrast to, say, all of your files being on one computer Fault tolerant Multiple copies of files are stored on different nodes If nodes fail, recovery is still possible High-throughput Many large files, accessible by multiple readers and writers, simultaneously Details: https://www.ibm.com/developerworks/library/wa-introhdfs/index.html

HDFS Schematic File1: 1,2,3 File2: 4, NameNode Keeps track of where the file chunks are stored. NameNode also ensures that changes to files are propagated correctly and helps recover from DataNode failures. 1 4 4 1 2 2 2 3 1 3 3 DataNodes 4

HDFS Schematic File1: 1,2,3 File2: 4, NameNode Keeps track of where the file chunks are stored. NameNode also ensures that changes to files are propagated correctly and helps recover from DataNode failures. Again, the important point is that HDFS does all the hard work so you don t have to! 1 4 4 1 2 2 2 3 1 3 3 DataNodes 4

Readings (this lecture) Required: J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters in Proceedings of the Sixth Symposium on Operating System Design and Implementation, 2004. https://research.google.com/archive/mapreduce.html This is the paper that originally introduced the MapReduce framework, and it s still, in my opinion, an excellent place to start. Don t worry too much about understanding every bit of the paper-- it s written for computer systems engineers! Recommended: Introduction to HDFS by J. Hanson https://www.ibm.com/developerworks/library/wa-introhdfs/index.html

Readings (next lecture) Required: mrjob Fundamentals and Concepts https://pythonhosted.org/mrjob/guides/quickstart.html https://pythonhosted.org/mrjob/guides/concepts.html Hadoop wiki: How MapReduce operations are actually carried out https://wiki.apache.org/hadoop/hadoopmapreduce Recommended: Allen Downey s Think Python Chapter 1 on Objects (pages 143-149). http://www.greenteapress.com/thinkpython/thinkpython.pdf Classes and objects in Python: https://docs.python.org/2/tutorial/classes.html#a-first-look-at-classes