Distributed Systems. CS422/522 Lecture17 17 November 2014

Similar documents
MapReduce & BigTable

Overview. Why MapReduce? What is MapReduce? The Hadoop Distributed File System Cloudera, Inc.

Clustering Lecture 8: MapReduce

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

Distributed Systems CS6421

CS60021: Scalable Data Mining. Sourangshu Bhattacharya

Introduction to Hadoop and MapReduce

Hadoop. copyright 2011 Trainologic LTD

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Lecture 11 Hadoop & Spark

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

A brief history on Hadoop

Introduction to MapReduce

CCA-410. Cloudera. Cloudera Certified Administrator for Apache Hadoop (CCAH)

MI-PDB, MIE-PDB: Advanced Database Systems

Top 25 Hadoop Admin Interview Questions and Answers

Database Applications (15-415)

CS370 Operating Systems

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Hadoop/MapReduce Computing Paradigm

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Introduction to HDFS and MapReduce

HADOOP FRAMEWORK FOR BIG DATA

Parallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014

Your First Hadoop App, Step by Step

Lecture 30: Distributed Map-Reduce using Hadoop and Spark Frameworks

The MapReduce Framework

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Hadoop and HDFS Overview. Madhu Ankam

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies

Cloud Computing. Ennan Zhai. Computer Science at Yale University

Introduction to Data Management CSE 344

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

Introduction to MapReduce

CS November 2017

Distributed Systems. 09r. Map-Reduce Programming on AWS/EMR (Part I) 2017 Paul Krzyzanowski. TA: Long Zhao Rutgers University Fall 2017

Data Platforms and Pattern Mining

CS 345A Data Mining. MapReduce

Big Data for Engineers Spring Resource Management

MapReduce, Hadoop and Spark. Bompotas Agorakis

Big Data 7. Resource Management

Cloud Computing. Hwajung Lee. Key Reference: Prof. Jong-Moon Chung s Lecture Notes at Yonsei University

Lecture 12 DATA ANALYTICS ON WEB SCALE

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

CS370 Operating Systems

Ghislain Fourny. Big Data 6. Massive Parallel Processing (MapReduce)

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

BigData and Map Reduce VITMAC03

Hadoop-PR Hortonworks Certified Apache Hadoop 2.0 Developer (Pig and Hive Developer)

Vendor: Cloudera. Exam Code: CCA-505. Exam Name: Cloudera Certified Administrator for Apache Hadoop (CCAH) CDH5 Upgrade Exam.

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Hadoop محبوبه دادخواه کارگاه ساالنه آزمایشگاه فناوری وب زمستان 1391

Ghislain Fourny. Big Data Fall Massive Parallel Processing (MapReduce)

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Introduction to Data Management CSE 344

Hortonworks PR PowerCenter Data Integration 9.x Administrator Specialist.

CS 61C: Great Ideas in Computer Architecture. MapReduce

MapReduce. U of Toronto, 2014

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

Vendor: Cloudera. Exam Code: CCD-410. Exam Name: Cloudera Certified Developer for Apache Hadoop. Version: Demo

Database Systems CSE 414

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins

Mixing and matching virtual and physical HPC clusters. Paolo Anedda

Chapter 5. The MapReduce Programming Model and Implementation

Evaluation of Apache Hadoop for parallel data analysis with ROOT

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud

Distributed File Systems II

Introduction to Data Management CSE 344

Scaling Up 1 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. Hadoop, Pig

Distributed Computation Models

An Introduction to Big Data Analysis using Spark

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

MapReduce Design Patterns

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

Databases 2 (VU) ( / )

CS370 Operating Systems

TP1-2: Analyzing Hadoop Logs

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Introduction to Map/Reduce. Kostas Solomos Computer Science Department University of Crete, Greece

CSE 444: Database Internals. Lecture 23 Spark

50 Must Read Hadoop Interview Questions & Answers

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

Hadoop Overview. Lars George Director EMEA Services

CSE 344 MAY 2 ND MAP/REDUCE

Introduction to MapReduce

Introduction to MapReduce. Instructor: Dr. Weikuan Yu Computer Sci. & Software Eng.

Facilitating Consistency Check between Specification & Implementation with MapReduce Framework

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

ExamTorrent. Best exam torrent, excellent test torrent, valid exam dumps are here waiting for you

itpass4sure Helps you pass the actual test with valid and latest training material.

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Warehouse-Scale Computing, MapReduce, and Spark

Dr. Chuck Cartledge. 4 Feb. 2015

Cloudera Exam CCA-410 Cloudera Certified Administrator for Apache Hadoop (CCAH) Version: 7.5 [ Total Questions: 97 ]

Map Reduce. Yerevan.

Programming Systems for Big Data

Transcription:

Distributed Systems CS422/522 Lecture17 17 November 2014

Lecture Outline Introduction Hadoop Chord

What s a distributed system?

What s a distributed system? A distributed system is a collection of loosely coupled nodes interconnected by a communication network. In a distributed system, users access remote resources in the same way they access local resources.

What s a distributed system? A distributed system is a collection of loosely coupled nodes interconnected by a communication network. In a distributed system, users access remote resources in the same way they access local resources.

Why distributed systems?

PC Scalability

Scalability PC What if one computer is not enough? - Buy a bigger (server-class) computer

Scalability PC Server What if one computer is not enough? - Buy a bigger (server-class) computer

Scalability PC Server What if one computer is not enough? - Buy a bigger (server-class) computer What if the biggest computer is not enough? - Buy many computers

Scalability PC Server Cluster What if one computer is not enough? - Buy a bigger (server-class) computer What if the biggest computer is not enough? - Buy many computers

Scalability

Rack Scalability

Rack Scalability Network switches (connects nodes with each other and with other racks)

Rack Scalability Network switches (connects nodes with each other and with other racks) Many nodes/blades (often identical)

Rack Scalability Network switches (connects nodes with each other and with other racks) Many nodes/blades (often identical) Storage device(s)

Scalability PC Server Cluster What if cluster is too big to fit into machine room? - Build a separate building for the cluster - Building can have lots of cooling and power - Result: Data center

Scalability PC Server Cluster What if cluster is too big to fit into machine room? - Build a separate building for the cluster - Building can have lots of cooling and power - Result: Data center

Scalability PC Server Cluster Data center What if cluster is too big to fit into machine room? - Build a separate building for the cluster - Building can have lots of cooling and power - Result: Data center

Google Data Center in Oregon

Google Data Center in Oregon Data centers (size of a football field)

Google Data Center in Oregon Data centers (size of a football field) A warehouse-sized computer - A single data center can easily contain 10,000 racks with 100 cores in each rack (1,000,000 cores total)

Google Data Centers World Wide

Why distributed systems? Resource sharing - E.g., Napster Computation speedup - E.g., Hadoop Reliability - E.g., S3

Why distributed systems? Resource sharing - E.g., Napster Computation speedup - E.g., Hadoop Reliability - E.g., S3

Why distributed systems? Resource sharing - E.g., Napster Computation speedup - E.g., Hadoop Reliability - E.g., Amazon S3

Why distributed systems? Resource sharing - E.g., Napster Computation speedup - E.g., Hadoop Reliability - E.g., S3 Issues?

Lecture Outline Introduction Hadoop Chord

Lecture Outline Introduction Hadoop Chord

What s the Hadoop A project based on GFS and MapReduce - Distribute data across machines - Try to achieve the reliability and scalability An open-source software framework for big data The Google File System - Distributed storage - Distributed processing

What s the Hadoop A project based on GFS and MapReduce - Distribute data across machines - Try to achieve the reliability and scalability An open-source software framework for big data - Distributed storage - Distributed processing

Why Hadoop?

Hadoop Components

Hadoop Components NameNode DataNode1 DataNode2 DataNode3 DataNode4 DataNode5

Hadoop Components Two core components - The Hadoop distributed file system (HDFS) HDFS NameNode DataNode1 DataNode2 DataNode3 DataNode4 DataNode5

Hadoop Components Two core components - The Hadoop distributed file system (HDFS) Data HDFS NameNode DataNode1 DataNode2 DataNode3 DataNode4 DataNode5

Hadoop Components Two core components - The Hadoop distributed file system (HDFS) Metadata 1 2 3 HDFS NameNode DataNode1 DataNode2 DataNode3 DataNode4 DataNode5

Hadoop Components Two core components - The Hadoop distributed file system (HDFS) Data Metadata 1 2 3 HDFS NameNode DataNode1 DataNode2 DataNode3 DataNode4 DataNode5

Hadoop Components Two core components - The Hadoop distributed file system (HDFS) Data Metadata 1 2 1 3 1 2 3 2 3 HDFS NameNode DataNode1 DataNode2 DataNode3 DataNode4 DataNode5

Hadoop Components Two core components - The Hadoop distributed file system (HDFS) - MapReduce software framework Metadata 1 2 1 3 1 2 3 2 3 HDFS NameNode DataNode1 DataNode2 DataNode3 DataNode4 DataNode5

Hadoop Components Two core components - The Hadoop distributed file system (HDFS) - MapReduce software framework MapReduce Metadata 1 2 1 3 1 2 3 2 3 HDFS NameNode DataNode1 DataNode2 DataNode3 DataNode4 DataNode5

Hadoop Components Two core components - The Hadoop distributed file system (HDFS) - MapReduce software framework MapReduce JobTracker TaskTracker TaskTracker TaskTracker TaskTracker TaskTracker Metadata 1 2 1 3 1 2 3 2 3 HDFS NameNode DataNode1 DataNode2 DataNode3 DataNode4 DataNode5

Hadoop Components Two core components - The Hadoop distributed file system (HDFS) - MapReduce software framework MapReduce JobTracker TaskTracker TaskTracker TaskTracker TaskTracker TaskTracker Metadata 1 2 1 3 1 2 3 2 3 HDFS NameNode DataNode1 DataNode2 DataNode3 DataNode4 DataNode5

Hadoop Components Two core components - The Hadoop distributed file system (HDFS) - MapReduce software framework MapReduce JobTracker TaskTracker TaskTracker TaskTracker TaskTracker TaskTracker Metadata 1 2 1 3 1 2 3 2 3 HDFS NameNode DataNode1 DataNode2 DataNode3 DataNode4 DataNode5

HDFS HDFS is responsible for storing data - Data is split into blocks and distributed across nodes - Each block is replicated Implementation of HDFS: - Based on Google s GFS - Offers redundant storage for massive amounts of data

Getting Data in/out of HDFS Hadoop API: - Use hadoop fs to work with data in HDFS - hadoop fs -copyfromlocal local_dir /hdfs_dir - hadoop fs -copytolocal /hdfs_dir local_dir

HDFS Example NameNode: Stores metadata only Metadata: /usr/data/foo -> 1, 2, 4 /usr/data/bar -> 3, 5 1 2 1 5 2 3 1 4 5 2 4 3 4 NameNode holds metadata for the data files DataNodes hold the actual blocks - Each block is replicated three times When a client wants to read file 5 3 DataNodes: Store Blocks

HDFS Example NameNode: Stores metadata only Metadata: /usr/data/foo -> 1, 2, 4 /usr/data/bar -> 3, 5 1 2 1 5 2 3 1 4 5 2 4 3 4 bar? Client NameNode holds metadata for the data files DataNodes hold the actual blocks - Each block is replicated three times When a client wants to read file 5 3 DataNodes: Store Blocks

NameNode: Stores metadata only Metadata: /usr/data/foo -> 1, 2, 4 /usr/data/bar -> 3, 5 1 2 1 5 2 3 1 4 HDFS Example 5 2 4 3 4 bar? < DataNode IPs, 3, 5 > Client NameNode holds metadata for the data files DataNodes hold the actual blocks - Each block is replicated three times When a client wants to read file 5 3 DataNodes: Store Blocks

NameNode: Stores metadata only Metadata: /usr/data/foo -> 1, 2, 4 /usr/data/bar -> 3, 5 1 2 1 5 2 3 1 4 HDFS Example 5 2 4 3 4 < DataNode IPs, 3, 5 > Client NameNode holds metadata for the data files DataNodes hold the actual blocks - Each block is replicated three times When a client wants to read file 5 3 DataNodes: Store Blocks

Case Study Policy: Assigning blocks based on disk space tiger frog lion monkey python

Case Study Policy: Assigning blocks based on disk space Hadoop tiger frog lion monkey python

Case Study Policy: Assigning blocks based on disk space Metadata 1 1 1 Hadoop tiger frog lion monkey python

Case Study Policy: Assigning blocks based on disk space Metadata 1 2 1 2 1 2 Hadoop tiger frog lion monkey python

Case Study Policy: Assigning blocks based on disk space Metadata 1 2 3 1 2 1 3 2 3 Hadoop tiger frog lion monkey python

Case Study Policy: Assigning blocks based on disk space Metadata 1 2 3 1 2 1 3 2 3 Hadoop tiger frog lion monkey python gator

Recall: Hadoop Components Two core components - The Hadoop distributed file system (HDFS) - MapReduce software framework MapReduce JobTracker TaskTracker TaskTracker TaskTracker TaskTracker TaskTracker Metadata 1 2 1 3 1 2 3 2 3 HDFS NameNode DataNode1 DataNode2 DataNode3 DataNode4 DataNode5

MapReduce A method distributing a task across nodes: - Each node processes data stored on that node Consists of two phases: - Map - Reduce

MapReduce A method distributing a task across nodes: - Each node processes data stored on that node Consists of two phases: - Map - Reduce Data Map Reduce Result (Mapper Input)

Features of MapReduce Automatic parallelization and distribution A clean abstraction for programmers - MapReduce programs are usually written in Java MapReduce abstracts all the housekeeping away from the developer: - Developers concentrate on Map and Reduce functions

MapReduce Example: Word Count Count the # of occurrences of each word in a large amount of input data Map(input_key, input_value) { foreach word w in input_value: emit(w, 1); }

MapReduce Example: Word Count Count the # of occurrences of each word in a large amount of input data Map(input_key, input_value) { foreach word w in input_value: emit(w, 1); } Input to the Mapper (3414, the cat sat on the mat ) (3437, the aardvark sat on the sofa )

MapReduce Example: Word Count Count the # of occurrences of each word in a large amount of input data Map(input_key, input_value) { foreach word w in input_value: emit(w, 1); } Input to the Mapper (3414, the cat sat on the mat ) (3437, the aardvark sat on the sofa ) Output from the Mapper ( the, 1), ( cat, 1), ( sat, 1), ( on, 1), ( the, 1), ( mat, 1), ( the, 1), ( aardvark, 1), ( sat, 1), ( on, 1), ( the, 1), ( sofa, 1)

Map Phase Mapper Input the cat sat on the mat the aardvark sat on the sofa

Map Phase Mapper Output Mapper Input the cat sat on the mat the aardvark sat on the sofa ( the, 1), ( cat, 1), ( sat, 1), ( on, 1), ( the, 1), ( mat, 1), ( the, 1), ( aardvark, 1), ( sat, 1), ( on, 1), ( the, 1), ( sofa, 1)

Reducer After the Map, all the intermediate values for a given intermediate key are combined together into a list

Reducer After the Map, all the intermediate values for a given intermediate key are combined together into a list Mapper Output ( the, 1), ( cat, 1), ( sat, 1), ( on, 1), ( the, 1), ( mat, 1), ( the, 1), ( aardvark, 1), ( sat, 1), ( on, 1), ( the, 1), ( sofa, 1) Reducer Input aardvark, 1 cat, 1 mat, 1 on [1, 1] sat [1, 1] sofa, 1 the [1, 1, 1, 1]

Reducer Add up all the values associated with each intermediate key: Reduce(output_key, intermediate_vals) { set count = 0; foreach v in intermediate_vals: count += v; emit(output_key, count); }

Reducer Add up all the values associated with each intermediate key: Reduce(output_key, intermediate_vals) { set count = 0; foreach v in intermediate_vals: count += v; emit(output_key, count); } Output from the Reducer ( the, 4), ( sat, 2), ( on, 2), ( sofa, 1), ( mat, 1), ( cat, 1), ( aardvark, 1)

Map + Reduce Mapping Shuffling Reducing ( the, 1), Mapper Input the cat sat on the mat the aardvark sat on the sofa ( cat, 1), ( sat, 1), ( on, 1), ( the, 1), ( mat, 1), ( the, 1), ( aardvark, 1), ( sat, 1), ( on, 1), aardvark, 1 cat, 1 mat, 1 on [1, 1] sat [1, 1] sofa, 1 the [1, 1, 1, 1] aardvark, 1 cat, 1 mat, 1 on, 2 sat, 2 sofa, 1 the, 4 ( the, 1), ( sofa, 1)

Why we care about counting words Word count is challenging over massive amounts of data Fundamentals of statistics often are aggregate functions Most aggregation functions have distributive nature MapReduce breaks complex tasks into smaller pieces which can be executed in parallel

Deployment NameNode JobTracker [Determines execution plan] Client TaskTracker Map Reduce DataNode TaskTracker Map Reduce DataNode TaskTracker Map Reduce DataNode

Discussions

Lecture Outline Introduction Hadoop Chord

Lecture Outline Introduction Hadoop Chord

Why Chord?

Chord 7 0 1 6 2 5 3 4

Chord 7 0 1 6 2 5 3 4

Chord 7 0 1 6 2 1 2 6 5 4 3

Chord 6 7 0 1 1 6 2 5 4 3 2

Chord 6 7 0 1 1 start int. succ. 2 [2,3) 3 3 [3,5) 3 5 [5,1) 0 6 2 5 3 4 2

Chord 6 7 0 1 1 start int. succ. 2 [2,3) 3 3 [3,5) 3 5 [5,1) 0 6 2 5 4 3 2 start int. succ. 4 [4,5) 0 5 [5,7) 0 7 [7,3) 0

Chord start int. succ. 1 [1,2) 1 2 [2,4) 3 4 [4,0) 0 7 6 0 1 1 start int. succ. 2 [2,3) 3 3 [3,5) 3 5 [5,1) 0 6 2 5 4 3 2 start int. succ. 4 [4,5) 0 5 [5,7) 0 7 [7,3) 0

Chord start int. succ. 1 [1,2) 1 2 [2,4) 3 4 [4,0) 0 7 6 0 1 1 start int. succ. 2 [2,3) 3 3 [3,5) 3 5 [5,1) 0 6 2 1 5 4 3 2 start int. succ. 4 [4,5) 0 5 [5,7) 0 7 [7,3) 0

Importance: Complexity! start int. succ. 1 [1,2) 1 2 [2,4) 3 6 4 [4,0) 0 0 1 start int. succ. 2 [2,3) 3 7 1 3 [3,5) 3 5 [5,1) 0 6 2 5 4 3 2 start int. succ. 4 [4,5) 0 5 [5,7) 0 7 [7,3) 0

start int. succ. Node Joins 1 [1,2) 1 2 [2,4) 3 6 4 [4,0) 0 0 7 1 1 start int. succ. 2 [2,3) 3 3 [3,5) 3 5 [5,1) 0 6 2 5 4 3 2 start int. succ. 4 [4,5) 0 5 [5,7) 0 7 [7,3) 0

start int. succ. Node Joins 1 [1,2) 1 2 [2,4) 3 6 4 [4,0) 0 0 7 1 1 start int. succ. 2 [2,3) 3 3 [3,5) 3 5 [5,1) 0 6 2 5 4 3 2 start int. succ. 4 [4,5) 0 5 [5,7) 0 7 [7,3) 0

start int. succ. Node Joins 1 [1,2) 1 2 [2,4) 3 4 [4,0) 0 0 7 1 1 start int. succ. 2 [2,3) 3 3 [3,5) 3 5 [5,1) 0 6 6 2 5 4 3 2 start int. succ. 4 [4,5) 0 5 [5,7) 0 7 [7,3) 0

start int. succ. Node Joins 1 [1,2) 1 2 [2,4) 3 4 [4,0) 6 0 7 1 1 start int. succ. 2 [2,3) 3 3 [3,5) 3 5 [5,1) 0 6 6 2 5 4 3 2 start int. succ. 4 [4,5) 0 5 [5,7) 0 7 [7,3) 0

start int. succ. Node Joins 1 [1,2) 1 2 [2,4) 3 4 [4,0) 6 0 7 1 1 start int. succ. 2 [2,3) 3 3 [3,5) 3 5 [5,1) 6 6 6 2 5 4 3 2 start int. succ. 4 [4,5) 0 5 [5,7) 0 7 [7,3) 0

start int. succ. Node Joins 1 [1,2) 1 2 [2,4) 3 4 [4,0) 6 0 7 1 1 start int. succ. 2 [2,3) 3 3 [3,5) 3 5 [5,1) 6 6 6 2 5 4 3 2 start int. succ. 4 [4,5) 6 5 [5,7) 6 7 [7,3) 0

start int. succ. Node Joins 1 [1,2) 1 2 [2,4) 3 4 [4,0) 6 0 7 1 1 start int. succ. 2 [2,3) 3 3 [3,5) 3 5 [5,1) 6 6 6 2 start int. succ. 7 [7,0) 0 0 [0,2) 0 2 [2,6) 3 5 4 3 2 start int. succ. 4 [4,5) 6 5 [5,7) 6 7 [7,3) 0

Case Study: Node Leaves start int. succ. 1 [1,2) 1 2 [2,4) 3 4 [4,0) 6 0 1 start int. succ. 7 1 2 [2,3) 3 3 [3,5) 3 5 [5,1) 6 6 6 2 start int. succ. 7 [7,0) 0 0 [0,2) 0 2 [2,6) 3 5 4 3 2 start int. succ. 4 [4,5) 6 5 [5,7) 6 7 [7,3) 0

Case Study: Node Leaves start int. succ. 1 [1,2) 3 2 [2,4) 3 4 [4,0) 6 0 7 1 6 6 2 start int. succ. 7 [7,0) 0 0 [0,2) 0 2 [2,6) 3 5 4 3 1 2 start int. succ. 4 [4,5) 6 5 [5,7) 6 7 [7,3) 0

Chord Issues?