Lecture 30: Distributed Map-Reduce using Hadoop and Spark Frameworks

Similar documents
COMP 322: Fundamentals of Parallel Programming. Lecture 37: Distributed Computing, Apache Spark

Lecture 11 Hadoop & Spark

Lecture 28: Introduction to the Message Passing Interface (MPI) (Start of Module 3 on Distribution and Locality)

CSE 444: Database Internals. Lecture 23 Spark

Spark Overview. Professor Sasu Tarkoma.

Outline. CS-562 Introduction to data analysis using Apache Spark

MI-PDB, MIE-PDB: Advanced Database Systems

MapReduce, Hadoop and Spark. Bompotas Agorakis

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

Chapter 4: Apache Spark

Distributed Computation Models

08/04/2018. RDDs. RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

Processing of big data with Apache Spark

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

CS435 Introduction to Big Data FALL 2018 Colorado State University. 10/22/2018 Week 10-A Sangmi Lee Pallickara. FAQs.

Distributed Systems. CS422/522 Lecture17 17 November 2014

Big data systems 12/8/17

15.1 Data flow vs. traditional network programming

2/26/2017. RDDs. RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

Data processing in Apache Spark

CSE 414: Section 7 Parallel Databases. November 8th, 2018

Distributed Systems. 22. Spark. Paul Krzyzanowski. Rutgers University. Fall 2016

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

Data processing in Apache Spark

CompSci 516: Database Systems

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

HADOOP FRAMEWORK FOR BIG DATA

CS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University

Introduction to Spark

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Clustering Lecture 8: MapReduce

An Introduction to Big Data Analysis using Spark

Massive Online Analysis - Storm,Spark

Spark, Shark and Spark Streaming Introduction

Fast, Interactive, Language-Integrated Cluster Computing

Spark supports several storage levels

15/04/2018. Spark supports several storage levels. The storage level is used to specify if the content of the RDD is stored

Spark supports several storage levels

Data processing in Apache Spark

Distributed Systems CS6421

Introduction to MapReduce

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Hadoop Development Introduction

Resilient Distributed Datasets

Distributed Systems. Distributed computation with Spark. Abraham Bernstein, Ph.D.

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Big Data 7. Resource Management

Project Design. Version May, Computer Science Department, Texas Christian University

CSC 261/461 Database Systems Lecture 24. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

Data Engineering. How MapReduce Works. Shivnath Babu

Data Platforms and Pattern Mining

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

6.830 Lecture Spark 11/15/2017

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

Programming Systems for Big Data

Big Data for Engineers Spring Resource Management

A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud

Data-intensive computing systems

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

Hadoop. copyright 2011 Trainologic LTD

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

Lecture 25: Spark. (leveraging bulk-granularity program structure) Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Apache Spark Internals

1. Introduction (Sam) 2. Syntax and Semantics (Paul) 3. Compiler Architecture (Ben) 4. Runtime Environment (Kurry) 5. Testing (Jason) 6. Demo 7.

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

Evaluation of Apache Hadoop for parallel data analysis with ROOT

COSC 6339 Big Data Analytics. Introduction to Spark. Edgar Gabriel Fall What is SPARK?

Batch Processing Basic architecture

Database Applications (15-415)

L3: Spark & RDD. CDS Department of Computational and Data Sciences. Department of Computational and Data Sciences

Beyond MapReduce: Apache Spark Antonino Virgillito

Today s content. Resilient Distributed Datasets(RDDs) Spark and its data model

Cloud Computing & Visualization

A Tutorial on Apache Spark

Dept. Of Computer Science, Colorado State University

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Spark: A Brief History.

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Big Data Infrastructures & Technologies

CS Spark. Slides from Matei Zaharia and Databricks

CS435 Introduction to Big Data FALL 2018 Colorado State University. 10/24/2018 Week 10-B Sangmi Lee Pallickara

Vendor: Cloudera. Exam Code: CCA-505. Exam Name: Cloudera Certified Administrator for Apache Hadoop (CCAH) CDH5 Upgrade Exam.

An Introduction to Apache Spark

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Introduction to MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Introduction to Data Management CSE 344

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

12/04/2018. Spark supports also RDDs of key-value pairs. Many applications are based on pair RDDs Pair RDDs allow

Databases 2 (VU) ( / )

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Vendor: Cloudera. Exam Code: CCD-410. Exam Name: Cloudera Certified Developer for Apache Hadoop. Version: Demo

Spark Programming with RDD

Lecture 4, 04/08/2015. Scribed by Eric Lax, Andreas Santucci, Charles Zheng.

Survey on Incremental MapReduce for Data Mining

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Spark supports also RDDs of key-value pairs

Transcription:

COMP 322: Fundamentals of Parallel Programming Lecture 30: Distributed Map-Reduce using Hadoop and Spark Frameworks Mack Joyner and Zoran Budimlić {mjoyner, zoran}@rice.edu http://comp322.rice.edu COMP 322 Lecture 30 March 2018

Worksheet #29 solution: MPI Gather Indicate what value should be provided instead of??? in line 6 to minimize space, and how it should depend on myrank. 1. MPI.Init(args) ; 2. int myrank = MPI.COMM_WORLD.Rank() ; 3. int numprocs = MPI.COMM_WORLD.Size() ; 4. int size =...; 5. int[] sendbuf = new int[size]; 6. int[] recvbuf = new int[???]; 7.... // Each process initializes sendbuf 8. MPI.COMM_WORLD.Gather(sendbuf, 0, size, MPI.INT, 9. recvbuf, 0, size, MPI.INT, 10. 0 /*root*/); 11.... 12. MPI.Finalize(); Solution: myrank == 0? (size * numprocs) : 0 2 COMP 322, Spring 2018 (M.Joyner, Z. Budimlić)

Organization of a Distributed-Memory Multiprocessor (Recap) Figure (a) Host node (P c ) connected to a cluster of processor nodes (P 0 P m ) Processors P 0 P m communicate via an interconnection network which could be standard TCP/IP (e.g., for Map-Reduce) or specialized for high performance communication (e.g., for scientific computing) Figure (b) Each processor node consists of a processor, memory, and a Network Interface Card (NIC) connected to a router node (R) in the interconnect In MPI, processes communicate by sending messages to each other. Distributed Map-Reduce offers an alternative approach for programming distributed-memory multiprocessors. 3 COMP 322, Spring 2018 (M.Joyner, Z. Budimlić)

MapReduce Pattern (Recap from Lecture 8) Apply Map function f to user supplied record of keyvalue pairs Compute set of intermediate key/value pairs Apply Reduce operation g to all values that share same key to combine derived data properly Often produces smaller set of values User supplies Map and Reduce operations in functional model so that the system can parallelize them, and also re-execute them for fault tolerance Distributed Map-Reduce frameworks (Hadoop, Spark) support the Map-Reduce pattern (with extensions) on a distributed-memory multiprocessor 4 COMP 322, Spring 2018 (M.Joyner, Z. Budimlić)

Distributed MapReduce Execution Fine granularity tasks: many more map tasks than machines Bucket sort to get same keys together E.g. 2000 servers => 200,000 Map Tasks, 5,000 Reduce tasks 5 COMP 322, Spring 2018 (M.Joyner, Z. Budimlić)

Apache Hadoop Project Hadoop: an open-source software framework that supports data-intensive distributed applications, licensed under the Apache v2 license. Goals / Requirements: Abstract and facilitate the storage and processing of large and/or rapidly growing data sets Simple programming models High scalability and availability Fault-tolerance Move computation rather than data Distributed design, with some centralization Main nodes of cluster are where most of the computational power and storage of the system lies Main nodes run TaskTracker to accept and reply to MapReduce tasks, and also DataNode to store needed blocks closely as possible Central control node runs NameNode to keep track of HDFS directories & files, and JobTracker to dispatch compute tasks to TaskTracker Written in Java, also supports Python and Ruby Acknowledgment: slides on Hadoop from UCI CS 237 course by Nalini Venkatasubramanian 6 COMP 322, Spring 2018 (M.Joyner, Z. Budimlić)

Hadoop s Architecture NN = Name Node DN = Data Node TT = Task Tracker Acknowledgment: slides on Hadoop from UCI CS 237 course by Nalini Venkatasubramanian 7 COMP 322, Spring 2018 (M.Joyner, Z. Budimlić)

Spark and Iterative Map/Reduce Apache Spark: General purpose functional programming over a cluster Caches results of map/reduce operations in memory so they can be used on subsequent iterations Tends to be 10-100 times faster than Hadoop for many applications 8 COMP 322, Spring 2018 (M.Joyner, Z. Budimlić)

Apache Spark Project Spark is a data parallel processing framework, which means it will execute tasks as close to where the data lives as possible (i.e. minimize data transfer). Spark follows a paradigm of keeping as much data in-memory and spilling excess to disk rather than pulling data from disk when needed. Spark decomposes your program into tasks and handles dispatching and scheduling of these tasks on worker nodes in your cluster. Spark revolves around the concept of a resilient distributed datasets (RDD). 9 COMP 322, Spring 2018 (M.Joyner, Z. Budimlić)

Resilient Distributed Datasets The key construct in Spark is the Resilient Distributed Dataset (RDD) RDDs can be though of as a collection of key-value pairs An RDD is a giant immutable collection, distributed in a redundant way over all the machines in a cluster The types of the elements in the RDD can be arbitrary elements If the elements are pairs, then the RDD acts like a table Computations on an RDD (including Map/Reduce) can be expressed as functional programming operations 10 COMP 322, Spring 2018 (M.Joyner, Z. Budimlić)

Working with RDDs There are two kinds of operations one can perform over an RDD. Transformations: Operations like map, filter, join etc. that just return another RDD. They are lazy operations. Actions: These are operations that actually produce results like count, collect, save etc. These operations don't return an RDD. Similar to intermediate and terminal operations in Java 8 Streams. 11 COMP 322, Spring 2018 (M.Joyner, Z. Budimlić)

Advantages of Immutability The distributed nature of RDDs is not evident in the programming model. RDD elements can be replicated for fault tolerance. Purely functional operations can be easily defined on RDDs Because RDDs are immutable, all the operations from purely functional programming can be applied and parallelized in a straightforward way. The runtime has great flexibility in scheduling operations on RDDs and executing them in parallel on partitions. Partitions of RDDs can be recomputed from their lineage. 12 COMP 322, Spring 2018 (M.Joyner, Z. Budimlić)

Word Count in Apache Spark JavaRDD<String> file = context.textfile(inputfile); JavaPairRDD<String, Integer> counter = file.flatmap(s -> Arrays.asList(s.split(" "))).maptopair(s -> new Tuple2<>(s, 1)).reduceByKey((a, b) -> a + b); counter.collect().foreach(system.out::println); // Definition of flatmap // x.flatmap(f) = x.map(f).flatten() // Definition of reducebykey // x.reducebykey(f) = x.groupbykey().map(xs -> xs.reduce(f)) 13 COMP 322, Spring 2018 (M.Joyner, Z. Budimlić)

Word Count in Apache Spark ["this is a line", "this is another line", "this is yet another line"].map(s -> Arrays.asList(s.split(" "))).flatten() -> [["this", "is", "a", "line"], ["this", "is", "another", "line"], ["this", "is", "yet", "another", "line"]].flatten() -> ["this", "is", "a", "line", "this", "is", "another", "line", "this", "is", "yet", "another", "line"] 14 COMP 322, Spring 2018 (M.Joyner, Z. Budimlić)

Word Count in Apache Spark ["this", "is", "a", "line", "this", "is", "another", "line", "this", "is", "yet", "another", "line"].map(s -> new Tuple2<>(s, 1)) > [["this",1], ["is",1], ["a",1], ["line",1], ["this",1], ["is",1], ["another",1], ["line",1], ["this",1], ["is",1], ["yet",1], ["another",1], ["line",1]] 15 COMP 322, Spring 2018 (M.Joyner, Z. Budimlić)

Word Count in Apache Spark [["this",1], ["is",1], ["a",1], ["line",1], ["this",1], ["is",1], ["another",1], ["line",1], ["this",1], ["is",1], ["yet",1],["another",1], ["line",1]].groupbykey().map(xs -> xs.reduce( (a,b) -> a + b)) > [["this", [1,1,1]], ["is", [1,1,1]], ["a", [1]], ["line", [1,1,1]], ["another", [1,1]], ["yet", [1]]].map(xs -> xs.reduce((a,b) -> a + b)) 16 COMP 322, Spring 2018 (M.Joyner, Z. Budimlić)

Word Count in Apache Spark [["this", [1,1,1]], ["is", [1,1,1]], ["a", [1]], ["line", [1,1,1]], ["another", [1,1]], ["yet", [1]]].map(xs -> xs.reduce( (a,b) -> a + b) > [["this", [1,1,1]].reduce((a,b) -> a + b), ["is", [1,1,1]].reduce((a,b) -> a + b), ["a", [1]].reduce((a,b) -> a + b), ["line", [1,1,1]].reduce((a,b) -> a + b), ["another", [1,1]].reduce((a,b) -> a + b), ["yet", [1]].reduce((a,b) -> a + b))] 17 COMP 322, Spring 2018 (M.Joyner, Z. Budimlić)

Word Count in Apache Spark [["this", [1,1,1]].reduce((a,b) -> a + b), ["is", [1,1,1]].reduce((a,b) -> a + b), ["a", [1]].reduce((a,b) -> a + b), ["line", [1,1,1]].reduce((a,b) -> a + b), ["another", [1,1]].reduce((a,b) -> a + b), ["yet", [1]].reduce((a,b) -> a + b))] > [["this", 3], ["is", 3], ["a", 1], ["line", 3], ["another", 2], ["yet", 1]] 18 COMP 322, Spring 2018 (M.Joyner, Z. Budimlić)