Overview. Why MapReduce? What is MapReduce? The Hadoop Distributed File System Cloudera, Inc.

Similar documents
Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Database System Architectures Parallel DBs, MapReduce, ColumnStores

Introduction to MapReduce

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

ILO3:Algorithms and Programming Patterns for Cloud Applications (Hadoop)

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

HDFS: Hadoop Distributed File System. Sector: Distributed Storage System

Introduction to Hadoop and MapReduce

Parallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014

MapReduce: Simplified Data Processing on Large Clusters

Introduction to MapReduce

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

The Google File System. Alexandru Costan

Parallel Computing: MapReduce Jin, Hai

MapReduce-style data processing

Hadoop/MapReduce Computing Paradigm

Distributed Systems. CS422/522 Lecture17 17 November 2014

Practice and Applications of Data Management CMPSCI 345. Lecture 18: Big Data, Hadoop, and MapReduce

Advanced Database Technologies NoSQL: Not only SQL

Google File System (GFS) and Hadoop Distributed File System (HDFS)

Hadoop Distributed File System(HDFS)

Database Systems CSE 414

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Map Reduce. Yerevan.

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

CA485 Ray Walshe Google File System

CS 61C: Great Ideas in Computer Architecture. MapReduce

Distributed Computations MapReduce. adapted from Jeff Dean s slides

CS60021: Scalable Data Mining. Sourangshu Bhattacharya

MapReduce. U of Toronto, 2014

A brief history on Hadoop

Distributed File Systems II

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Clustering Lecture 8: MapReduce

Parallel Processing - MapReduce and FlumeJava. Amir H. Payberah 14/09/2018

BigData and Map Reduce VITMAC03

MapReduce, Hadoop and Spark. Bompotas Agorakis

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

MapReduce. Simplified Data Processing on Large Clusters (Without the Agonizing Pain) Presented by Aaron Nathan

The Google File System

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies

MapReduce & BigTable

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

CLOUD-SCALE FILE SYSTEMS

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

Distributed Systems 16. Distributed File Systems II

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

MI-PDB, MIE-PDB: Advanced Database Systems

CS 345A Data Mining. MapReduce

Lecture 11 Hadoop & Spark

Distributed Filesystem

Map Reduce.

Programming Systems for Big Data

Google File System. Arun Sundaram Operating Systems

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

Introduction to Map Reduce

MapReduce and Hadoop

The Google File System

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

Mixing and matching virtual and physical HPC clusters. Paolo Anedda

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

In the news. Request- Level Parallelism (RLP) Agenda 10/7/11

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Tutorial Outline. Map/Reduce. Data Center as a computer [Patterson, cacm 2008] Acknowledgements

Laarge-Scale Data Engineering

Hadoop محبوبه دادخواه کارگاه ساالنه آزمایشگاه فناوری وب زمستان 1391

CSE 344 MAY 2 ND MAP/REDUCE

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Warehouse-Scale Computing, MapReduce, and Spark

Google: A Computer Scientist s Playground

Google: A Computer Scientist s Playground

Introduction to MapReduce

The Google File System

L1:Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung ACM SOSP, 2003

A BigData Tour HDFS, Ceph and MapReduce

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

Hadoop and HDFS Overview. Madhu Ankam

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

Google Cluster Computing Faculty Training Workshop

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

The Google File System (GFS)

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information

L2,3:Programming for Large Datasets MapReduce

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Introduction to Data Management CSE 344

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

The Google File System

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

Tutorial Outline. Map/Reduce. Data Center as a computer [Patterson, cacm 2008] Acknowledgements

Introduction to Data Management CSE 344

Transcription:

MapReduce and HDFS This presentation includes course content University of Washington Redistributed under the Creative Commons Attribution 3.0 license. All other contents:

Overview Why MapReduce? What is MapReduce? The Hadoop Distributed File System

How MapReduce is Structured Functional programming meets distributed computing A batch data processing system Factors out many reliability concerns from application logic

MapReduce Provides: Automatic parallelization & distribution Fault-tolerance Status and monitoring tools A clean abstraction for programmers

Programming Model Borrows from functional programming Users implement interface of two functions: map (in_key, in_value) -> (out_key, intermediate_value) list reduce (out_key, intermediate_value list) -> out_value list

map Records from the data source (lines out of files, rows of a database, etc) are fed into the map function as key*value pairs: e.g., (filename, line). map() produces one or more intermediate values along with an output key from the input.

map map (in_key, in_value) -> (out_key, intermediate_value) list

Example: Upper-case Mapper let map(k, v) = emit(k.toupper(), v.toupper()) ( foo, bar ) ( FOO, BAR ) ( Foo, other ) ( FOO, OTHER ) ( key2, data ) ( KEY2, DATA )

Example: Explode Mapper let map(k, v) = foreach char c in v: emit(k, c) ( A, cats ) ( A, c ), ( A, a ), ( A, t ), ( A, s ) ( B, hi ) ( B, h ), ( B, i )

Example: Filter Mapper let map(k, v) = if (isprime(v)) then emit(k, v) ( foo, 7) ( foo, 7) ( test, 10) (nothing)

Example: Changing Keyspaces let map(k, v) = emit(v.length(), v) ( hi, test ) (4, test ) ( x, quux ) (4, quux ) ( y, abracadabra ) (10, abracadabra )

reduce After the map phase is over, all the intermediate values for a given output key are combined together into a list reduce() combines those intermediate values into one or more final values for that same output key (in practice, usually only one final value per key)

reduce reduce (out_key, intermediate_value list) -> out_value list

Example: Sum Reducer let reduce(k, vals) = sum = 0 foreach int v in vals: sum += v emit(k, sum) ( A, [42, 100, 312]) ( A, 454) ( B, [12, 6, -2]) ( B, 16)

Example: Identity Reducer let reduce(k, vals) = foreach v in vals: emit(k, v) ( A, [42, 100, 312]) ( A, 42), ( A, 100), ( A, 312) ( B, [12, 6, -2]) ( B, 12), ( B, 6), ( B, -2)

Parallelism map() functions run in parallel, creating different intermediate values from different input data sets reduce() functions also run in parallel, each working on a different output key All values are processed independently Bottleneck: reduce phase can t start until map phase is completely finished.

Example: Count word occurrences map(string input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: emit(w, 1); reduce(string output_key, Iterator<int> intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += v; emit(output_key, result);

Combining Phase Run on mapper nodes after map phase Mini-reduce, only on local map output Used to save bandwidth before sending data to full reducer Reducer can be combiner if commutative & associative e.g., SumReducer

Combiner, graphically

WordCount Redux map(string input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: emit(w, 1); reduce(string output_key, Iterator<int> intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += v; emit(output_key, result);

MapReduce Conclusions MapReduce has proven to be a useful abstraction in many areas Greatly simplifies large-scale computations Functional programming paradigm can be applied to large-scale applications You focus on the real problem, library deals with messy details

HDFS Some slides designed by Alex Moschuk, University of Washington Redistributed under the Creative Commons Attribution 3.0 license

HDFS: Motivation Based on Google s GFS Redundant storage of massive amounts of data on cheap and unreliable computers Why not use an existing file system? Different workload and design priorities Handles much bigger dataset sizes than other filesystems

Assumptions High component failure rates Inexpensive commodity components fail all the time Modest number of HUGE files Just a few million Each is 100MB or larger; multi-gb files typical Files are write-once, mostly appended to Perhaps concurrently Large streaming reads High sustained throughput favored over low latency

HDFS Design Decisions Files stored as blocks Much larger size than most filesystems (default is 64MB) Reliability through replication Each block replicated across 3+ DataNodes Single master (NameNode) coordinates access, metadata Simple centralized management No data caching Little benefit due to large data sets, streaming reads Familiar interface, but customize the API Simplify the problem; focus on distributed apps

HDFS Client Block Diagram $" " ###!!

Based on GFS Architecture Figure from The Google File System, Ghemawat et. al., SOSP 2003

Metadata Single NameNode stores all metadata Filenames, locations on DataNodes of each file Maintained entirely in RAM for fast lookup DataNodes store opaque file contents in block objects on underlying local filesystem

HDFS Conclusions HDFS supports large-scale processing workloads on commodity hardware designed to tolerate frequent component failures optimized for huge files that are mostly appended and read filesystem interface is customized for the job, but still retains familiarity for developers simple solutions can work (e.g., single master) Reliably stores several TB in individual clusters