MapReduce, Apache Hadoop

Similar documents
MapReduce, Apache Hadoop

MI-PDB, MIE-PDB: Advanced Database Systems

MapReduce, Apache Hadoop

Big Data Management and NoSQL Databases

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

Distributed Systems 16. Distributed File Systems II

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Hadoop. copyright 2011 Trainologic LTD

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Lecture 11 Hadoop & Spark

Document Databases: MongoDB

MapReduce. Cloud Computing COMP / ECPE 293A

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Hadoop/MapReduce Computing Paradigm

Introduction to MapReduce. Instructor: Dr. Weikuan Yu Computer Sci. & Software Eng.

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

CS /30/17. Paul Krzyzanowski 1. Google Chubby ( Apache Zookeeper) Distributed Systems. Chubby. Chubby Deployment.

Hadoop An Overview. - Socrates CCDH

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016

Today s Lecture. CS 61C: Great Ideas in Computer Architecture (Machine Structures) Map Reduce

HADOOP FRAMEWORK FOR BIG DATA

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

A BigData Tour HDFS, Ceph and MapReduce

Distributed Filesystem

HDFS Architecture Guide

Cloudera Exam CCA-410 Cloudera Certified Administrator for Apache Hadoop (CCAH) Version: 7.5 [ Total Questions: 97 ]

CS 378 Big Data Programming

Clustering Lecture 8: MapReduce

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Introduction to MapReduce (cont.)

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

Performance Evaluation of a MongoDB and Hadoop Platform for Scientific Data Analysis

Chapter 5. The MapReduce Programming Model and Implementation

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

CS 345A Data Mining. MapReduce

Introduction to Hadoop and MapReduce

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

MapReduce, Hadoop and Spark. Bompotas Agorakis

Introduction to MapReduce

HBase... And Lewis Carroll! Twi:er,

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

A brief history on Hadoop

Introduction to Data Management CSE 344

Hadoop محبوبه دادخواه کارگاه ساالنه آزمایشگاه فناوری وب زمستان 1391

The amount of data increases every day Some numbers ( 2012):

Introduction to MapReduce

Database Applications (15-415)

2/26/2017. The amount of data increases every day Some numbers ( 2012):

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Map- reduce programming paradigm

MapReduce. U of Toronto, 2014

CISC 7610 Lecture 2b The beginnings of NoSQL

BigData and Map Reduce VITMAC03

50 Must Read Hadoop Interview Questions & Answers

MapReduce. Tom Anderson

Distributed File Systems II

Big Data Hadoop Course Content

Distributed Systems. CS422/522 Lecture17 17 November 2014

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns

A Distributed Data- Parallel Execu3on Framework in the Kepler Scien3fic Workflow System

Agenda. Request- Level Parallelism. Agenda. Anatomy of a Web Search. Google Query- Serving Architecture 9/20/10

CS370 Operating Systems

Hadoop and HDFS Overview. Madhu Ankam

Chase Wu New Jersey Institute of Technology

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

CS60021: Scalable Data Mining. Sourangshu Bhattacharya

Op#mizing MapReduce for Highly- Distributed Environments

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Expert Lecture plan proposal Hadoop& itsapplication

Big Data for Engineers Spring Resource Management

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Distributed Systems CS6421

The MapReduce Abstraction

Map Reduce Group Meeting

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

Architecture of So-ware Systems Massively Distributed Architectures Reliability, Failover and failures. Mar>n Rehák

Bioinforma)cs Resources - NoSQL -

CA485 Ray Walshe Google File System

PaaS and Hadoop. Dr. Laiping Zhao ( 赵来平 ) School of Computer Software, Tianjin University

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

CS427 Multicore Architecture and Parallel Computing

Introduction to Map Reduce

Parallel Computing: MapReduce Jin, Hai

Big Data 7. Resource Management

The Hadoop Distributed File System Konstantin Shvachko Hairong Kuang Sanjay Radia Robert Chansler

Programming Systems for Big Data

Top 25 Hadoop Admin Interview Questions and Answers

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins

Introduction to the Hadoop Ecosystem - 1

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

Transcription:

NDBI040: Big Data Management and NoSQL Databases hp://www.ksi.mff.cuni.cz/ svoboda/courses/2016-1-ndbi040/ Lecture 2 MapReduce, Apache Hadoop Marn Svoboda svoboda@ksi.mff.cuni.cz 11. 10. 2016 Charles University in Prague, Faculty of Mathemacs and Physics Czech Technical University in Prague, Faculty of Electrical Engineering

Lecture Outline MapReduce Programming model and implementaon Movaon, principles, details, Apache Hadoop HDFS Hadoop Distributed File System MapReduce NDBI040: Big Data Management and NoSQL Databases Lecture 2: MapReduce, Apache Hadoop 11. 10. 2016 2

Programming Models What is a programming model? Abstracon of an underlying computer system Describes a logical view of the provided funconality Offers a public interface, resources or other constructs Allows for the expression of algorithms and data structures Conceals physical reality of the internal implementaon Allows us to work at a (much) higher level of abstracon The point is how the intended user thinks in order to solve their tasks and not necessarily how the system actually works NDBI040: Big Data Management and NoSQL Databases Lecture 2: MapReduce, Apache Hadoop 11. 10. 2016 3

Programming Models Examples Tradional von Neumann model Architecture of a physical computer with several components such as a central processing unit (CPU), arithmec-logic unit (ALU), processor registers, program counter, memory unit, etc. Execuon of a stream of instrucons Java Virtual Machine (JVM) Do not confuse programming models with Programming paradigms (procedural, funconal, logic, modular, object-oriented, recursive, generic, data-driven, parallel, ) Programming languages (Java, C++, ) NDBI040: Big Data Management and NoSQL Databases Lecture 2: MapReduce, Apache Hadoop 11. 10. 2016 4

Programming Models Parallel Programming Models Process interacon Mechanisms of mutual communicaon of parallel processes Shared memory shared global address space, asynchronous read and write access, synchronizaon primives Message passing Implicit interacon Problem decomposion Ways of problem decomposion into tasks executed in parallel Task parallelism Data parallelism independent tasks on disjoint parons of data Implicit parallelism NDBI040: Big Data Management and NoSQL Databases Lecture 2: MapReduce, Apache Hadoop 11. 10. 2016 5

MapReduce Framework What is MapReduce? Programming model + implementaon Developed by Google in 2008 Google: A simple and powerful interface that enables automac parallelizaon and distribuon of large-scale computaons, combined with an implementaon of this interface that achieves high performance on large clusters of commodity PCs. NDBI040: Big Data Management and NoSQL Databases Lecture 2: MapReduce, Apache Hadoop 11. 10. 2016 6

MapReduce Framework MapReduce programming model Cluster of commodity personal computers (nodes) Each running a host operang system, mutually interconnected within a network, communicaon based on IP addresses, Data is distributed among the nodes Computaon tasks executed in parallel across the nodes Classificaon Process interacon: message passing Problem decomposion: data parallelism NDBI040: Big Data Management and NoSQL Databases Lecture 2: MapReduce, Apache Hadoop 11. 10. 2016 7

MapReduce Framework A bit of history and movaon Google PageRank problem (2003) How to rank tens of billions of web pages by their importance efficiently in a reasonable amount of me when data is scattered across thousands of computers data files can be enormous (terabytes or more) data files are updated only occasionally (just appended) sending the data between compute nodes is expensive hardware failures are rule rather than excepon Centralized index structure was no longer sufficient Soluon Google File System a distributed file system MapReduce a programming model NDBI040: Big Data Management and NoSQL Databases Lecture 2: MapReduce, Apache Hadoop 11. 10. 2016 8

MapReduce Model Basic Idea Divide-and-conquer paradigm Map funcon Breaks down a problem into sub-problems Processes input data in order to generate a set of intermediate key-value pairs Reduce funcon Receives and combines sub-soluons to solve the problem Processes and possibly reduces intermediate values associated with the same intermediate key And that s all! NDBI040: Big Data Management and NoSQL Databases Lecture 2: MapReduce, Apache Hadoop 11. 10. 2016 9

MapReduce Model Basic Idea And that s all! It means... We only need to implement Map and Reduce funcons Everything else such as input data distribuon, scheduling of execuon tasks, monitoring of computaon progress, inter-machine communicaon, handling of machine failures, is managed automatically by the framework! NDBI040: Big Data Management and NoSQL Databases Lecture 2: MapReduce, Apache Hadoop 11. 10. 2016 10

MapReduce Model A bit more formally Map funcon Input: a key-value pair Output: a set of intermediate key-value pairs Usually from a different domain Keys do not have to be unique (, ) (, ) Reduce funcon Input: an intermediate key + a set of values for this key Output: a possibly smaller set of values for this key From the same domain (, ( )) (, ( )) NDBI040: Big Data Management and NoSQL Databases Lecture 2: MapReduce, Apache Hadoop 11. 10. 2016 11

Example: Word Frequency Implementaon NDBI040: Big Data Management and NoSQL Databases Lecture 2: MapReduce, Apache Hadoop 11. 10. 2016 12

Example: Word Frequency Execuon Phases NDBI040: Big Data Management and NoSQL Databases Lecture 2: MapReduce, Apache Hadoop 11. 10. 2016 13

Execuon: Phases Spling Input key-value pairs (documents) are parsed and prepared Mapping Map funcon is executed for each input document Intermediate key-value pairs are emied Shuffling Intermediate key-value pairs are grouped and sorted according to the keys Reducing Reduce funcon is executed for each intermediate key Final output is generated NDBI040: Big Data Management and NoSQL Databases Lecture 2: MapReduce, Apache Hadoop 11. 10. 2016 14

Execuon: Schema NDBI040: Big Data Management and NoSQL Databases Lecture 2: MapReduce, Apache Hadoop 11. 10. 2016 15

Execuon: Components Input reader Reads data from a stable storage (e.g. a distributed file system) Splits the data into appropriate size blocks (splits) Parses these blocks and prepares input key-value pairs Map funcon Paron funcon Determines Reduce task for an intermediate key-value pair E.g. hash of the key modulo the overall number of reducers Compare funcon Compares two intermediate keys, used during the shuffling Reduce funcon Output writer Writes the output of the Reduce funcon to stable storage NDBI040: Big Data Management and NoSQL Databases Lecture 2: MapReduce, Apache Hadoop 11. 10. 2016 16

Addional Aspects Combine funcon Analogous purpose and implementaon to the Reduce funcon Objecve Decrease the amount of intermediate data i.e. decrease the amount of data transferred to the reducer Executed locally by the mapper before the shuffling phase Only works for commutave and associave funcons! NDBI040: Big Data Management and NoSQL Databases Lecture 2: MapReduce, Apache Hadoop 11. 10. 2016 17

Addional Aspects Counters Allow to track the progress of a MapReduce job in real me Predefined counters E.g. numbers of launched Map / Reduce tasks, parsed input key-value pairs Custom counters (user-defined) Can be associated with any acon that a Map or Reduce funcon does NDBI040: Big Data Management and NoSQL Databases Lecture 2: MapReduce, Apache Hadoop 11. 10. 2016 18

Addional Aspects Fault tolerance When a large number of nodes process a large number of data fault tolerance is necessary Worker failure Master periodically pings every worker; if no response is received in a certain amount of me, master marks the worker as failed All its tasks are reset back to their inial idle state and become eligible for rescheduling on other workers Master failure Strategy A periodic checkpoints are created; if master fails, a new copy can then be started Strategy B master failure is considered to be highly unlikely; users simply resubmit unsuccessful jobs NDBI040: Big Data Management and NoSQL Databases Lecture 2: MapReduce, Apache Hadoop 11. 10. 2016 19

Addional Aspects Stragglers Straggler = node that takes unusually long me to complete a task it was assigned Soluon When a MapReduce job is close to compleon, the master schedules backup execuons of the remaining in-progress tasks A given task is considered to be completed whenever either the primary or the backup execuon completes NDBI040: Big Data Management and NoSQL Databases Lecture 2: MapReduce, Apache Hadoop 11. 10. 2016 20

Addional Aspects Task granularity Intended numbers of Map and Reduce tasks Praccal recommendaon (Google) Map tasks Choose the number so that each individual Map task has roughly 16 64 MB of input data Reduce tasks Small mulple of the number of worker nodes we expect to use Note also that the output of each Reduce task ends up in a separate output file NDBI040: Big Data Management and NoSQL Databases Lecture 2: MapReduce, Apache Hadoop 11. 10. 2016 21

Further Examples URL access frequency Input: HTTP server access logs Map: parses a log, emits (accessed URL, 1) pairs Reduce: computes and emits the sum of the associated values Output: overall number of accesses to a given URL Inverted index Input: text documents containing words Map: parses a document, emits (word, document ID) pairs Reduce: emits all the associated document IDs sorted Output: list of documents containing a given word NDBI040: Big Data Management and NoSQL Databases Lecture 2: MapReduce, Apache Hadoop 11. 10. 2016 22

Further Examples Distributed sort Input: records to be sorted according to a specific key Map: extracts the sorng key, emits (key, record) pairs Reduce: emits the associated records unchanged Reverse web-link graph Input: web pages with tags Map: emits (target URL, this URL) pairs Reduce: emits the associated source URLs unchanged Output: list of URLs of web pages targeng a given one NDBI040: Big Data Management and NoSQL Databases Lecture 2: MapReduce, Apache Hadoop 11. 10. 2016 23

Further Examples Sources of links between web pages NDBI040: Big Data Management and NoSQL Databases Lecture 2: MapReduce, Apache Hadoop 11. 10. 2016 24

Use Cases: General Paerns Counng, summing, aggregaon When the overall number of occurrences of certain items or a different aggregate funcon should be calculated Collang, grouping When all items belonging to a certain group should be found, collected together or processed in another way Filtering, querying, parsing, validaon When all items sasfying a certain condion should be found, transformed or processed in another way Sorng When items should be processed in a parcular order with respect to a certain ordering criterion NDBI040: Big Data Management and NoSQL Databases Lecture 2: MapReduce, Apache Hadoop 11. 10. 2016 25

Use Cases: Real-World Problems Just a few real-world examples Risk modeling, customer churn Recommendaon engine, customer preferences Adversement targeng, trade surveillance Fraudulent acvity threats, security breaches detecon Hardware or sensor network failure predicon Search quality analysis Source: hp://www.cloudera.com/ NDBI040: Big Data Management and NoSQL Databases Lecture 2: MapReduce, Apache Hadoop 11. 10. 2016 26

Apache Hadoop

Apache Hadoop Open-source soware framework hp://hadoop.apache.org/ Distributed storage and distributed processing of very large data sets on clusters built from commodity hardware Implements a distributed file system Implements MapReduce Derived from the original Google MapReduce and GFS Developed by Apache Soware Foundaon Implemented in Java Operang system: cross-plaorm Inial release in 2011 NDBI040: Big Data Management and NoSQL Databases Lecture 2: MapReduce, Apache Hadoop 11. 10. 2016 28

Apache Hadoop Modules Hadoop Common Common ulies and support for other modules Hadoop Distributed File System (HDFS) High-throughput distributed file system Hadoop Yet Another Resource Negoator (YARN) Cluster resource management Job scheduling framework Hadoop MapReduce YARN-based implementaon of the MapReduce model NDBI040: Big Data Management and NoSQL Databases Lecture 2: MapReduce, Apache Hadoop 11. 10. 2016 29

Apache Hadoop Hadoop-related projects Apache Cassandra wide column store Apache HBase wide column store Apache Hive data warehouse infrastructure Apache Avro data serializaon system Apache Chukwa data collecon system Apache Mahout machine learning and data mining library Apache Pig framework for parallel computaon and analysis Apache ZooKeeper coordinaon of distributed applicaons NDBI040: Big Data Management and NoSQL Databases Lecture 2: MapReduce, Apache Hadoop 11. 10. 2016 30

Apache Hadoop Real-world Hadoop users Facebook internal logs, analycs, machine learning, 2 clusters: 1100 nodes (8 cores, 12 TB storage), 12 PB 300 nodes (8 cores, 12 TB storage), 3 PB LinkedIn 3 clusters: 800 nodes (2 4 cores, 24 GB RAM, 6 2 TB SATA), 9 PB 1900 nodes (2 6 cores, 24 GB RAM, 6 2 TB SATA), 22 PB 1400 nodes (2 6 cores, 32 GB RAM, 6 2 TB SATA), 16 PB Spofy content generaon, data aggregaon, reporng, analysis: 1650 nodes, 43000 cores, 70 TB RAM, 65 PB, 20000 daily jobs Yahoo! 40000 nodes with Hadoop, biggest cluster: 4500 nodes (2 4 cores, 16 GB RAM, 4 1 TB storage), 17 PB Source: hp://wiki.apache.org/hadoop/poweredby NDBI040: Big Data Management and NoSQL Databases Lecture 2: MapReduce, Apache Hadoop 11. 10. 2016 31

HDFS Hadoop Distributed File System Open-source, high quality, cross-plaorm, pure Java Highly scalable, high-throughput, fault-tolerant Master-slave architecture Opmal applicaons MapReduce, web crawlers, data warehouses, NDBI040: Big Data Management and NoSQL Databases Lecture 2: MapReduce, Apache Hadoop 11. 10. 2016 32

HDFS: Assumpons Data characteriscs Large data sets and files Streaming data access Batch processing rather than interacve users Write-once, read-many Fault tolerance HDFS cluster may consist of thousands of nodes Each component has a non-trivial probability of failure there is always some component that is non-funconal I.e. failure is the norm rather than excepon, and so automac failure detecon and recovery is essenal NDBI040: Big Data Management and NoSQL Databases Lecture 2: MapReduce, Apache Hadoop 11. 10. 2016 33

HDFS: File System Logical view: Linux-based hierarchical file system Directories and files Contents of files is divided into blocks Usually 64 MB, configurable per file level User and group permissions Standard operaons are provided Create, remove, move, rename, copy, Namespace Contains names of all directories, files, and other metadata I.e. all data to capture the whole logical view of the file system Just a single namespace for the enre cluster NDBI040: Big Data Management and NoSQL Databases Lecture 2: MapReduce, Apache Hadoop 11. 10. 2016 34

HDFS: Cluster Architecture Master-slave architecture Master: NameNode Manages the file system namespace Provides the user interface for all the operaons Create, remove, move, rename, copy, file or directory Open and close file Regulates access to files by users Manages file blocks (mapping of logical to physical blocks) Slave: DataNode Physically stores file blocks within the underlying file system Serves read/write requests from users I.e. user data never flows through the NameNode Has no knowledge about the file system NDBI040: Big Data Management and NoSQL Databases Lecture 2: MapReduce, Apache Hadoop 11. 10. 2016 35

HDFS: Replicaon Replicaon = maintaining of mulple copies of each file block Increases read throughput, increases fault tolerance Replicaon factor (number of copies) Configurable per file level, usually 3 Replica placement Crical to reliability and performance Rack-aware strategy Takes the physical locaon of nodes into account Network bandwidth between the nodes on the same rack is greater than between those in different racks Common case (replicaon factor 3): Two replicas on two different nodes in a local rack Third replica on a node in a different rack NDBI040: Big Data Management and NoSQL Databases Lecture 2: MapReduce, Apache Hadoop 11. 10. 2016 36

HDFS: NameNode How the NameNode Works? FsImage data structure describing the whole file system Contains: namespace + mapping of blocks + system properes Loaded into the system memory (4 GB RAM is sufficient) Stored in the local file system, periodical checkpoints created EditLog transacon log for all the metadata changes E.g. when a new file is created, replicaon factor is changed, Stored in the local file system Failures When the NameNode starts up FsImage and EditLog are read from the disk, transacons from EditLog are applied, new version of FsImage is flushed on the disk, EditLog is truncated NDBI040: Big Data Management and NoSQL Databases Lecture 2: MapReduce, Apache Hadoop 11. 10. 2016 37

HDFS: DataNode How each DataNode Works? Stores physical file blocks Each block (replica) is stored as a separate local file Heuriscs are used to place these files in local directories Periodically sends HeartBeat messages to the NameNode Failures When a DataNode fails or in case of network paron, i.e. when the NameNode does not receive a HeartBeat message within a given me limit The NameNode no longer sends read/write requests to this node, re-replicaon might be iniated When a DataNode starts up Generates a list of all its blocks and sends a BlockReport message to the NameNode NDBI040: Big Data Management and NoSQL Databases Lecture 2: MapReduce, Apache Hadoop 11. 10. 2016 38

HDFS: API Available applicaon interfaces Java API Python access or C wrapper also available HTTP interface Browsing the namespace and downloading the contents of files FS Shell command line interface Intended for the user interacon Bash-inspired commands E.g.: NDBI040: Big Data Management and NoSQL Databases Lecture 2: MapReduce, Apache Hadoop 11. 10. 2016 39

Hadoop MapReduce Hadoop MapReduce MapReduce programming model implementaon Requirements HDFS Input and output files for MapReduce jobs YARN Underlying distribuon, coordinaon, monitoring and gathering of the results NDBI040: Big Data Management and NoSQL Databases Lecture 2: MapReduce, Apache Hadoop 11. 10. 2016 40

Cluster Architecture Master-slave architecture Master: JobTracker Provides the user interface for MapReduce jobs Fetches input file data locaons from the NameNode Manages the enre execuon of jobs Provides the progress informaon Schedules individual tasks to idle TaskTrackers Map, Reduce, tasks Nodes close to the data are preferred Failed tasks or stragglers can be rescheduled Slave: TaskTracker Accepts tasks from the JobTracker Spawns a separate JVM for each task execuon Indicates the available task slots via HearBeat messages NDBI040: Big Data Management and NoSQL Databases Lecture 2: MapReduce, Apache Hadoop 11. 10. 2016 41

Execuon Schema NDBI040: Big Data Management and NoSQL Databases Lecture 2: MapReduce, Apache Hadoop 11. 10. 2016 42

Java Interface Mapper class Implementaon of the map funcon Template parameters, types of input key-value pairs, types of intermediate key-value pairs Intermediate pairs are emied via NDBI040: Big Data Management and NoSQL Databases Lecture 2: MapReduce, Apache Hadoop 11. 10. 2016 43

Java Interface Reducer class Implementaon of the reduce funcon Template parameters, types of intermediate key-value pairs, types of output key-value pairs Output pairs are emied via NDBI040: Big Data Management and NoSQL Databases Lecture 2: MapReduce, Apache Hadoop 11. 10. 2016 44

Example Word Frequency Input: Documents with words Files located at HDFS directory Map: parses a document, emits (word, 1) pairs Reduce: computes and emits the sum of the associated values Output: overall number of occurrences of a given word Output will be wrien to MapReduce job execuon NDBI040: Big Data Management and NoSQL Databases Lecture 2: MapReduce, Apache Hadoop 11. 10. 2016 45

Example: Mapper Class NDBI040: Big Data Management and NoSQL Databases Lecture 2: MapReduce, Apache Hadoop 11. 10. 2016 46

Example: Reducer Class NDBI040: Big Data Management and NoSQL Databases Lecture 2: MapReduce, Apache Hadoop 11. 10. 2016 47

Conslusion MapReduce cricism MapReduce is a step backwards Does not use database schema Does not use index structures Does not support advanced query languages Does not support transacons, integrity constraints, views, Does not support data mining, business intelligence, MapReduce is not novel Ideas more than 20 years old and overcome Message Passing Interface (MPI), Reduce-Scaer The end of MapReduce? NDBI040: Big Data Management and NoSQL Databases Lecture 2: MapReduce, Apache Hadoop 11. 10. 2016 48