Extreme Computing. Introduction to MapReduce. Cluster Outline Map Reduce

Similar documents
Algorithms for MapReduce. Combiners Partition and Sort Pairs vs Stripes

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

CSE 410 Computer Systems. Hal Perkins Spring 2010 Lecture 12 More About Caches

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns

Assignment 3 ITCS-6010/8010: Cloud Computing for Data Analysis

Lecture Programming in C++ PART 1. By Assistant Professor Dr. Ali Kattan

Clustering Lecture 8: MapReduce

Big Data Management and NoSQL Databases

Processing Big Data with Hadoop in Azure HDInsight

Introduction to MapReduce

ML from Large Datasets

Your First Hadoop App, Step by Step

Hadoop Map Reduce 10/17/2018 1

Batch Processing Basic architecture

Introduction to MapReduce

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

Hadoop. copyright 2011 Trainologic LTD

A Guide to Running Map Reduce Jobs in Java University of Stirling, Computing Science

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Remote Procedure Call. Tom Anderson

Introduction to MapReduce

Map-Reduce. Marco Mura 2010 March, 31th

Expert Lecture plan proposal Hadoop& itsapplication

MapReduce: Recap. Juliana Freire & Cláudio Silva. Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec

CS 345A Data Mining. MapReduce

Distributed computing: index building and use

Testing is a very big and important topic when it comes to software development. Testing has a number of aspects that need to be considered.

CSC 261/461 Database Systems Lecture 19

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

Distributed Memory Programming With MPI Computer Lab Exercises

MapReduce. U of Toronto, 2014

Distributed computing: index building and use

Developing MapReduce Programs

Slide Set 3. for ENCM 339 Fall 2017 Section 01. Steve Norman, PhD, PEng

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies

Cloud Computing CS

Introduction to Map Reduce

DATABASE PERFORMANCE AND INDEXES. CS121: Relational Databases Fall 2017 Lecture 11

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

L22: SC Report, Map Reduce

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

MapReduce Simplified Data Processing on Large Clusters

STATS Data Analysis using Python. Lecture 8: Hadoop and the mrjob package Some slides adapted from C. Budak

Dept. Of Computer Science, Colorado State University

/ Cloud Computing. Recitation 3 Sep 13 & 15, 2016

Project 1 Balanced binary

Midterm 1: CS186, Spring I. Storage: Disk, Files, Buffers [11 points] cs186-

15-388/688 - Practical Data Science: Big data and MapReduce. J. Zico Kolter Carnegie Mellon University Spring 2018

Big Data XML Parsing in Pentaho Data Integration (PDI)

Big Data and Scripting map reduce in Hadoop

Processing Big Data with Hadoop in Azure HDInsight

CSCI 5417 Information Retrieval Systems Jim Martin!

How am I going to skim through these data?

CS427 Multicore Architecture and Parallel Computing

CS122 Lecture 8 Winter Term,

Getting Started with Hadoop

Map Reduce. Yerevan.

Introduction to MapReduce (cont.)

Hadoop streaming is an alternative way to program Hadoop than the traditional approach of writing and compiling Java code.

Parallel Processing - MapReduce and FlumeJava. Amir H. Payberah 14/09/2018

Introduction to Computer Engineering (E114)

Information Retrieval and Organisation

Mobile Computing Professor Pushpendra Singh Indraprastha Institute of Information Technology Delhi Java Basics Lecture 02

Hands-on Exercise Hadoop

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014

CMSC 201 Spring 2019 Lab 06 Lists

MI-PDB, MIE-PDB: Advanced Database Systems

Contents. Slide Set 1. About these slides. Outline of Slide Set 1. Typographical conventions: Italics. Typographical conventions. About these slides

COMP 3500 Introduction to Operating Systems Project 5 Virtual Memory Manager

MapReduce. Tom Anderson

(page 0 of the address space) (page 1) (page 2) (page 3) Figure 12.1: A Simple 64-byte Address Space

CS 378 Big Data Programming

Programming in OOP/C++

DePaul University CSC555 -Mining Big Data. Course Project by Bill Qualls Dr. Alexander Rasin, Instructor November 2013

BigData and Map Reduce VITMAC03

The MapReduce Framework

CS451 - Assignment 8 Faster Naive Bayes? Say it ain t so...

Hadoop Lab 3 Creating your first Map-Reduce Process

An Incredibly Brief Introduction to Relational Databases: Appendix B - Learning Rails

Actual4Dumps. Provide you with the latest actual exam dumps, and help you succeed

MapReduce Algorithms

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Database Systems: Fall 2008 Quiz II

Part III Appendices 165

Midterm spring. CSC228H University of Toronto

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

Programming Models MapReduce

Midterm 1: CS186, Spring I. Storage: Disk, Files, Buffers [11 points] SOLUTION. cs186-

Parallel Computing: MapReduce Jin, Hai

19 File Structure, Disk Scheduling

TABLES AND HASHING. Chapter 13

CS 318 Principles of Operating Systems

Clustering Documents. Case Study 2: Document Retrieval

CS 378 Big Data Programming

Dr. Chuck Cartledge. 4 Feb. 2015

CS-245 Database System Principles

Week 1: Introduction to R, part 1

Section 4 General Factorial Tutorials

CS197U: A Hands on Introduction to Unix

Transcription:

Extreme Computing Introduction to MapReduce 1

Cluster We have 12 servers: scutter01, scutter02,... scutter12 If working outside Informatics, first: ssh student.ssh.inf.ed.ac.uk Then log into a random server: ssh scutter$(printf "%02i"$((RANDOM%12+1))) Please load balance! Two years ago the cluster crashed. 2

Cluster Software The cluster runs Hadoop on DICE (the Informatics Linux Environment). = No need to install software yourself. You can run your own cluster but: We won t help you install it Copy your output to the cluster Code should run on the cluster 3

Cluster Software The cluster runs Hadoop on DICE (the Informatics Linux Environment). = No need to install software yourself. You can run your own cluster but: We won t help you install it Copy your output to the cluster Code should run on the cluster = Make sure your DICE account works! We don t have root so only computing support can help. Do this before the labs starting 2 October. 4

Companies I Take Money From Likely Guest Lecture Currently no Guest Lecture 5

MapReduce Incremental Approach Build MapReduce from problems. Assemble picture at the end. Assignment 1 is pure MapReduce problems. 6

grep grep extreme Find every line containing extreme in a text file. 7

grep grep extreme Find every line containing extreme in a text file. Input extreme students pay extremely high this is slow up to there method extremely useful take TTDS Output extreme students pay extremely high method extremely useful 8

Distributed grep grep extreme Find every line containing extreme in a text file. Input extreme students pay extremely high this is slow up to there method extremely useful take TTDS Output extreme students pay extremely high method extremely useful Split input into pieces, run grep on each. 9

Interlude: Pieces of a Text File Goal: assign a piece of the text file to each machine. Non-overlapping Break at line boundaries Fast (don t read more than you have to) Balanced (roughly equal sizes) 10

seeking seek allows one to skip to a particular byte in a file. There is no seek for line offsets. You d have read the file from the beginning and count newlines. But we can seek to a byte offset, then round up to the next line. 11

Rounding bytes to lines Split a 300-byte text file: Task Byte Assignment Line Rounding 0 0 99 0 102 1 100 199 103 207 2 200 299 208 299 Each task can read until it sees a newline, then round up to that. Work is divided at line boundaries. 12

Hadoop is an implementation of MapReduce. This just shows how Hadoop splits input: hadoop jar hadoop-streaming-2.7.3.jar -input /data/assignments/ex1/websmall.txt -output /user/$user/catted -mapper "cat" -reducer NONE Run Hadoop Read big text file Write here Just copy the input Ignore this for now Don t worry, you ll get too much practice in the labs. 13

Distributed grep hadoop jar hadoop-streaming-2.7.3.jar Run Hadoop -input /data/assignments/ex1/websmall.txt Read big text file -output /user/$user/grepped Write here -mapper "grep extreme" Scan for extreme -reducer NONE Ignore this for now 14

Summarizing File: websmall.txt Machine 0 mapper: grep Machine 1 mapper: grep Machine 2 mapper: grep File: part-00000 File: part-00001 File: part-00002 Hadoop takes care of: Shared file system Splitting input at line boundaries Launching tasks on multiple machines We can specify any command ( a mapper ) to run. 15

Word Count How many times do words appear? Output Input a 3 want to use a want 1 a to to 2 a decimal use 1 decimal 1 16

Each mapper counts independently: Mapper 0 a 1 want 1 to 1 use 1 Mapper 1 a 2 to 1 decimal 1 Problem: Need to collate/sum counts 17

Each mapper counts independently: Mapper 0 a 1 want 1 to 1 use 1 Mapper 1 a 2 to 1 decimal 1 Reducer 0 Reducer 1 a 3 decimal 1 to 2 use 1 want 1 Reducers sum counts 18

Each mapper counts independently: Mapper 0 a 1 want 1 to 1 use 1 Mapper 1 a 2 to 1 decimal 1 Reducer 0 Reducer 1 a 3 decimal 1 to 2 use 1 want 1 Reducers sum counts Mappers hash the word mod 2 to decide which reducer to send to. 19

Examine Reducer Input hadoop jar hadoop-streaming-2.7.3.jar Run Hadoop -files count_map.py Copy code to workers -input /data/assignments/ex1/websmall.txt Read big text file -output /user/$user/reducespy Write here -mapper count_map.py Count words locally -reducer cat Leave as is cat will copy input to output, so we can see what the input is. 20

Sorting Hadoop sorts reducer input for you: Unsorted: Annoying to 1 want 1 use 1 to 1 Sorted: Easy to 1 to 1 use 1 want 1 Sorting makes it easy to stream in constant memory. Unsorted would require remembering words in memory. 21

Examine Reducer Input hadoop jar hadoop-streaming-2.7.3.jar Run Hadoop -files count_map.py,count_reduce.py Copy code to workers -input /data/assignments/ex1/websmall.txt Read big text file -output /user/$user/count Write here -mapper count_map.py Count words locally -reducer count_reduce.py Sum counts And we get word count... hopefully 22