Fall 2018: Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU

Similar documents
Distributed computing: index building and use

2.3 Algorithms Using Map-Reduce

MapReduce and Friends

Distributed computing: index building and use

MapReduce. Stony Brook University CSE545, Fall 2016

MapReduce and the New Software Stack

Databases 2 (VU) ( / )

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

INTRODUCTION TO DATA SCIENCE. MapReduce and the New Software Stacks(MMDS2)

Data Partitioning and MapReduce

Database Systems CSE 414

Introduction to MapReduce (cont.)

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

TI2736-B Big Data Processing. Claudia Hauff

Map Reduce Group Meeting

Introduction to Data Management CSE 344

Introduction to MapReduce

Parallel Nested Loops

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

Clustering Documents. Case Study 2: Document Retrieval

Databases 2 (VU) ( )

A Fast and High Throughput SQL Query System for Big Data

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

CS 345A Data Mining. MapReduce

Matrix-Vector Multiplication by MapReduce. From Rajaraman / Ullman- Ch.2 Part 1

MapReduce, Hadoop and Spark. Bompotas Agorakis

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Lecture 7: MapReduce design patterns! Claudia Hauff (Web Information Systems)!

Part A: MapReduce. Introduction Model Implementation issues

MapReduce: A Programming Model for Large-Scale Distributed Computation

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins

The MapReduce Framework

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Clustering Lecture 8: MapReduce

Mitigating Data Skew Using Map Reduce Application

MapReduce Patterns. MCSN - N. Tonellotto - Distributed Enabling Platforms

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Developing MapReduce Programs

Introduction to Data Management CSE 344

One Trillion Edges. Graph processing at Facebook scale

CS 61C: Great Ideas in Computer Architecture. MapReduce

MapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1

15-388/688 - Practical Data Science: Big data and MapReduce. J. Zico Kolter Carnegie Mellon University Spring 2018

Distributed Systems. CS422/522 Lecture17 17 November 2014

Introduction to MapReduce Algorithms and Analysis

MapReduce for Graph Algorithms

Introduction to MapReduce

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search

CompSci 516: Database Systems

MapReduce Design Patterns

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

Tools for Social Networking Infrastructures

Rule 14 Use Databases Appropriately

The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012):

CISC 7610 Lecture 2b The beginnings of NoSQL

G(B)enchmark GraphBench: Towards a Universal Graph Benchmark. Khaled Ammar M. Tamer Özsu

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Column-Family Databases Cassandra and HBase

Distributed File Systems II

Big Data Management and NoSQL Databases

FAST DATA RETRIEVAL USING MAP REDUCE: A CASE STUDY

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Lecture Map-Reduce. Algorithms. By Marina Barsky Winter 2017, University of Toronto

CS-2510 COMPUTER OPERATING SYSTEMS

MapReduce and Hadoop. Debapriyo Majumdar Indian Statistical Institute Kolkata

Dept. Of Computer Science, Colorado State University

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

CSE 344 MAY 2 ND MAP/REDUCE

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

CS 655 Advanced Topics in Distributed Systems

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

The MapReduce Abstraction

Distributed Systems. 21. Graph Computing Frameworks. Paul Krzyzanowski. Rutgers University. Fall 2016

Jure Leskovec Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah

index construct Overview Overview Recap How to construct index? Introduction Index construction Introduction to Recap

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

Map Reduce. Yerevan.

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

Improving the MapReduce Big Data Processing Framework

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

CS November 2018

Graphs (Part II) Shannon Quinn

CS November 2017

HADOOP FRAMEWORK FOR BIG DATA

B490 Mining the Big Data. 5. Models for Big Data

COSC 6339 Big Data Analytics. NoSQL (II) HBase. Edgar Gabriel Fall HBase. Column-Oriented data store Distributed designed to serve large tables

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

This material is covered in the textbook in Chapter 21.

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

HW 4: PageRank & MapReduce. 1 Warmup with PageRank and stationary distributions [10 points], collaboration

Introduction to Map Reduce

MapReduce Patterns, Algorithms, and Use Cases

Transcription:

Fall 2018: Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU

!2 MapReduce Overview! Sometimes a single computer cannot process data or takes too long traditional serial programming is not always enough Processor constraints Storage constraints! But when resources are pooled, it may be possible break task into parts and execute concurrently on multiple processors Challenge: What can run concurrently? How to parallelize?! MapReduce: a programming paradigm to process large data sets

!3 Map-Reduce Overview! First invented by Dean and Ghemawat in 2004! Map-Reduce is a scalable parallel programming paradigm to address big data processing (Inspired by Lisp) Map ( fm, SetOfValues) ( length, [ (9) (7 3) () (4,6,8) ] ) gives (1 2 0 3) Reduce (fr, SetOfValues) ( sum, [ 2 7 1 5 0 3 ] ) gives 18! Programmer provides Map and Reduce and system handles rest

!4 History! Original research done at Google (2004, Dean & Ghemawat)! Now Apache Software Foundation provides Hadoop MapReduce implementation! Amazon version and others also exist

!5 Map-Reduce! Ranking (e.g., PageRank) requires iterated matrix-vector multiplication with matrix containing millions of rows and columns! Computing with social networks involves graphs with hundreds of millions of nodes and billions of edges! Map-Reduce is a parallel programming paradigm, a software-stack that will help to address big data processing Distributed file system with redundancy (e.g., Google FS, Hadoop DFS, CloudStore) Network of racks of processors forming a cluster

!6 MapReduce! Framework used by writing 2 procedures Map and Reduce! Map Input is broken into chunks and each Map task is given one or more chunks Output of Map task: (key, value) pairs. Master controller sorts by keys Reduce task works on all pairs with same key and combines values as defined

!7 MapReduce Execution Overview

!8 MapReduce Example! Input: repository of documents! Output: word Frequencies (want freq of word in a collection of docs)! Input element: one document! Map task: For each document, for each of its words, output pair (w,1)! Master Controller groups pairs by keys into a list, then merges into a file! Reduce task: Combines items related to a word getting frequency of single word If Combine is associative & commutative, can move work between map/reduce

!9 MapReduce Subtleties! One document assigned to one Map task (many docs to same Map)! Tradeoff between Map-Reduce: Map could do part of combine and decrease work for Reduce, i.e., it could return (w.m) count of number of occurrences of word w in one document! Master Controller uses a hash function to distribute work into r tasks, since it knows # of Reduce nodes. One bucket one file for Reduce. This helps to distribute work randomly among Reduce tasks/nodes.! One word assigned to one Reduce task (many words to same Reduce)

!10 Skew! Imbalance in workload to different tasks and their compute nodes More tasks means more overhead of creating tasks More tasks means greater ability to balance out load More documents and words than nodes Number of documents and their sizes may be known beforehand

!11 Node Failures! Compute node failure: Restart! Map node failure: Master node monitors, reassigns, and restarts task; all Reduce tasks informed of new task/location and to discard old task/ location! Reduce node failure: Master node monitors, reassigns and restarts task

!12 Matrix-Vector Multiplication! Same vector in MM of every node! Matrix M: n X n! Vector v: length n! Map step: focus on one element of M! Output contribution by one element:! Reduce step: Sum up all entries for key i to get result

!13 What if vector is too large for MM

Matrix Multiplication: 2 MapReduce steps!14! Matrix M can be thought of as a relation with tuples (i, j, m ij )! Matrix N can be thought of as a relation with tuples (j, k, n jk )! Map operation creates these tuples! Map: Join of M and N brings us closer to M X N by creating: Relation (i, j, k, m ij, n jk ) or the relation (i, j, k, m ij X n jk )! Grouping and aggregation produces M X N Map operation: identity operation producing tuple (i, k, m ij X n jk ) Reduce operation: aggregates all tuples with (i, k, Z) and stores in cell (i,k)

Matrix Multiplication: 1 MapReduce step!15! Map: Produce tuples ((i, k), (M, j, m ij )) from M Produce tuples ((i, k), (M, j, m ij )) from M! Reduce: Produce one entry of M X N

Relational DB operations using MapReduce!16! Selection! Projection! Union, Intersection & Difference! Natural Join! Grouping and aggregation

!17 Example: Paths of length 2 in network! If we want to know if there is a path of length 2 in a directed network from vertex A to B, then we need to find a vertex C such that (A,C) and (C,B) are directed edges in the network.! This can be written as a join of 2 relations. How?! This can also be written as a matrix multiplication of 2 adjacency matrices of a network/graph. How? Now we can implement using a MapReduce framework

!18 More complex example: Arbitrage! Assume currency exchange rates as follows: EUR/CAD: 0.664 (1 CAD buys you.0.664 EUR) USD/EUR: 1.234 CAD/USD: 1.398! If you start with 10,000 CAD, then use it to buy 6,640 EUR 6,640 * 1.234 USD 6,640 * 1.234 * 1.398 CAD = 11,454.87 CAD! Profit of 1,454.87 CAD or 14.5%. Not bad! Triangular Arbitrage

!19 Arbitrage using MapReduce! Process currency market quotes! Look for uncompleted offers to make the 3 currency exchanges Find all offers to BUY EUR with CAD, BUY USD with EUR, and BUY CAD with USD! Find a triple that makes you a profit! Now read this blog article that explains how to do it in Python/Hadoop https://medium.com/@rrfd/your-first-map-reduce-using-hadoop-with-python-andosx-ca3b6f3dfe78

!20 Running MapReduce! Need Map code! Need Reduce code! Need Hadoop set up Hadoop Distributed File System (HDFS) Parallel Processing environment! Blog tells you in detail how to set it up and run the MapReduce code https://medium.com/@rrfd/your-first-map-reduce-using-hadoop-with-python-andosx-ca3b6f3dfe78

!21 Many more examples! https://datascienceguide.github.io/map-reduce