Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Similar documents
CompSci 516: Database Systems

Resilient Distributed Datasets

Database Systems CSE 414

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

Fast, Interactive, Language-Integrated Cluster Computing

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

Spark: A Brief History.

Spark & Spark SQL. High- Speed In- Memory Analytics over Hadoop and Hive Data. Instructor: Duen Horng (Polo) Chau

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Today s content. Resilient Distributed Datasets(RDDs) Spark and its data model

CSE 444: Database Internals. Lecture 23 Spark

Spark, Shark and Spark Streaming Introduction

Chapter 4: Apache Spark

Introduction to Data Management CSE 344

Spark Overview. Professor Sasu Tarkoma.

Lecture 11 Hadoop & Spark

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Spark. Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344

Introduction to MapReduce Algorithms and Analysis

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

Data-intensive computing systems

CompSci 516: Database Systems. Lecture 20. Parallel DBMS. Instructor: Sudeepa Roy

MapReduce, Hadoop and Spark. Bompotas Agorakis

The MapReduce Abstraction

Programming Systems for Big Data

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

Distributed Computations MapReduce. adapted from Jeff Dean s slides

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

Distributed Computation Models

Parallel Computing: MapReduce Jin, Hai

Distributed computing: index building and use

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Massive Online Analysis - Storm,Spark

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

L3: Spark & RDD. CDS Department of Computational and Data Sciences. Department of Computational and Data Sciences

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

a Spark in the cloud iterative and interactive cluster computing

2/4/2019 Week 3- A Sangmi Lee Pallickara

Batch Processing Basic architecture

Distributed File Systems II

CS Spark. Slides from Matei Zaharia and Databricks

BigData and Map Reduce VITMAC03

Map Reduce. Yerevan.

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21

Apache Spark Internals

Big Data Management and NoSQL Databases

MapReduce. U of Toronto, 2014

Parallel DBMS. Lecture 20. Reading Material. Instructor: Sudeepa Roy. Reading Material. Parallel vs. Distributed DBMS. Parallel DBMS 11/7/17

Introduction to Map Reduce

Distributed Systems. 22. Spark. Paul Krzyzanowski. Rutgers University. Fall 2016

Distributed computing: index building and use

CS 61C: Great Ideas in Computer Architecture. MapReduce

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

Big Data Infrastructures & Technologies

Part II: Software Infrastructure in Data Centers: Distributed Execution Engines

Discretized Streams. An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters

MapReduce: Simplified Data Processing on Large Clusters

MapReduce: Recap. Juliana Freire & Cláudio Silva. Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

6.830 Lecture Spark 11/15/2017

An Introduction to Big Data Analysis using Spark

Map Reduce Group Meeting

Introduction to MapReduce

Lecture 4, 04/08/2015. Scribed by Eric Lax, Andreas Santucci, Charles Zheng.

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Clustering Lecture 8: MapReduce

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Introduction to MapReduce (cont.)

Improving the MapReduce Big Data Processing Framework

CSE 344 MAY 2 ND MAP/REDUCE

Processing of big data with Apache Spark

COSC 6339 Big Data Analytics. Introduction to Spark. Edgar Gabriel Fall What is SPARK?

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Introduction to MapReduce

15.1 Data flow vs. traditional network programming

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

Motivation. Map in Lisp (Scheme) Map/Reduce. MapReduce: Simplified Data Processing on Large Clusters

CS 345A Data Mining. MapReduce

Cloud Computing CS

Parallel DBMS. Lecture 20. Reading Material. Instructor: Sudeepa Roy. Reading Material. Parallel vs. Distributed DBMS. Parallel DBMS 11/15/18

CS November 2018

Shark: SQL and Rich Analytics at Scale. Yash Thakkar ( ) Deeksha Singh ( )

Database Applications (15-415)

SparkStreaming. Large scale near- realtime stream processing. Tathagata Das (TD) UC Berkeley UC BERKELEY

Deep Learning Comes of Age. Presenter: 齐琦 海南大学

HADOOP FRAMEWORK FOR BIG DATA

L22: SC Report, Map Reduce

CS November 2017

MI-PDB, MIE-PDB: Advanced Database Systems

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

Hybrid MapReduce Workflow. Yang Ruan, Zhenhua Guo, Yuduo Zhou, Judy Qiu, Geoffrey Fox Indiana University, US

Turning Relational Database Tables into Spark Data Sources

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Transcription:

Announcements CompSci 516 Database Systems Lecture 12 - and Spark Practice midterm posted on sakai First prepare and then attempt! Midterm next Wednesday 10/11 in class Closed book/notes, no electronic devices Everything until and including Lecture 12 included Instructor: Sudeepa Roy Duke CS, Fall 2017 CompSci 516: Database Systems 1 Duke CS, Fall 2017 CompSci 516: Database Systems 2 Reading Material Recommended (optional) readings: Chapter 2 (Sections 1,2,3) of Mining of Massive Datasets, by Rajaraman and Ullman: http://i.stanford.edu/~ullman/mmds.html Original Google MR paper by Jeff Dean and Sanjay Ghemawat, OSDI 04: http://research.google.com/archive/mapreduce.html Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In- Memory Cluster Computing (see course website) by Matei Zaharia et al. - 2012 Acknowledgement: Some of the following slides have been borrowed from Prof. Shivnath Babu, Prof. Dan Suciu, Prajakta Kalmegh, and Junghoon Kang Duke CS, Fall 2017 CompSci 516: Database Systems 3 Duke CS, Fall 2017 CompSci 516: Database Systems 4 it cannot be stored in one machine store the data sets on multiple machines Google File System Big Data it cannot be processed in one machine parallelize computation on multiple machines The - Framework Google published paper in OSDI 2004, a year after the Google File System paper A high level programming paradigm allows many important data-oriented processes to be written simply processes large data by: applying a function to each logical record in the input (map) categorize and combine the intermediate results into summary values (reduce) Duke CS, Fall 2017 CompSci 516: Database Systems 6 1

Where does Google use? Storage Model Output crawled documents web request logs inverted indices graph structure of web documents summaries of the number of pages crawled per host the set of most frequent queries in a day Data is stored in large files (TB, PB) e.g. market-basket data (more when we do data mining) or web data Files are divided into chunks typically many MB (64 MB) sometimes each chunk is replicated for fault tolerance (later in distributed DBMS) Duke CS, Fall 2017 CompSci 516: Database Systems 8 - Steps - Steps is typically (key, value) pairs but could be objects of any type and are performed by a number of processes physically located in some processors 1. Read Data 2. extract some info of interest in (key, value) form 3. and sort send s to the same reduce process 4. operate on the values of the e.g. transform, aggregate, summarize, filter 5. Output the results (key, final-result) Duke CS, Fall 2017 CompSci 516: Database Systems 9 Duke CS, Fall 2017 CompSci 516: Database Systems 10 Simple Example: - Function Word counting Inverted indexes Ack: Slide by Prof. Shivnath Babu Duke CS, Fall 2017 CompSci 516: Database Systems 11 Each map process works on a chunk of data : (input-key, value) Output: (intermediate-key, value) -- may not be the same as input key value Example: list all doc ids containing a word of map (word, docid) emits each such pair word is key, docid is value duplicate elimination can be done at the reduce phase Duke CS, Fall 2017 CompSci 516: Database Systems 12 2

Function Example Problem: Explain how the query will be executed in : (intermediate-key, list-of-values-for-this-key) list can include duplicates each map process can leave its in the local disk, reduce process can retrieve its portion Output: (-key, final-value) Example: list all doc ids containing a word will be a list of (word, [doc-id1, doc-id5,.]) if the count is needed, reduce counts #docs, will be a list of (word, count) Duke CS, Fall 2017 CompSci 516: Database Systems 13 SELECT a, max(b) as topb FROM R WHERE a > 0 GROUP BY a Specify the computation performed in the map and the reduce functions Duke CS, Fall 2017 CompSci 516: Database Systems 14 SELECT a, max(b) as topb FROM R WHERE a > 0 GROUP BY a SELECT a, max(b) as topb FROM R WHERE a > 0 GROUP BY a Each map task Scans a block of R Calls the map function for each tuple The map function applies the selection predicate to the tuple For each tuple satisfying the selection, it s a record with key = a and value = b When each map task scans multiple relations, it needs to something like key = a and value = ( R, b) which has the relation name R The engine reshuffles the of the map phase and groups it on the intermediate key, i.e. the attribute a Note that the programmer has to write only the map and reduce functions, the shuffle phase is done by the engine (although the programmer can rewrite the partition function), but you should still mention this in your answers Duke CS, Fall 2017 CompSci 516: Database Systems 15 Duke CS, Fall 2017 CompSci 516: Database Systems 16 Each reduce task SELECT a, max(b) as topb FROM R WHERE a > 0 GROUP BY a computes the aggregate value max(b) = topb for each group (i.e. a) assigned to it (by calling the reduce function) s the final results: (a, topb) A local combiner can be used to compute local max before data gets reshuffled (in the map tasks) Multiple aggregates can be by the reduce phase like key = a and value = (sum(b), min(b)) etc. Sometimes a second (third etc) level of - phase might be needed Duke CS, Fall 2017 CompSci 516: Database Systems 17 More Terminology however, there is no uniform terminology across systems A - Job e.g. count the words in all docs complex queries can have multiple MR jobs or Tasks A group of map or reduce functions scheduled on a single worker Worker a process that executes one task at a time one per processor, so 4-8 per machine A master controller divides the data into chunks assigns different processors to execute the map function on each chunk other/same processors execute the reduce functions on the s of the map functions Ack: Slide by Prof. Dan Suciu Duke CS, Fall 2017 CompSci 516: Database Systems 18 3

Why is - Popular? Distributed computation before how to divide the workload among multiple machines? how to distribute data and program to other machines? how to schedule tasks? what happens if a task fails while running? and and... Distributed computation after how to write function? how to write function? Developers tasks made easy Duke CS, Fall 2017 CompSci 516: Database Systems 19 Handling Fault Tolerance in MR Although the probability of a machine failure is low, the probability of a machine failing among thousands of machines is common Worker Failure The master sends heartbeat to each worker node If a worker node fails, the master reschedules the tasks handled by the worker Master Failure The whole job gets restarted through a different master Duke CS, Fall 2017 CompSci 516: Database Systems 20 Other aspects of Locality The input data is managed by GFS Choose the cluster of machines such that those machines contain the input data on their local disk We can conserve network bandwidth Task granularity It is preferable to have the number of tasks to be multiples of worker nodes Smaller the partition size, faster failover and better granularity in load balance, but it incurs more overhead Need a balance Backup Tasks In order to cope with a straggler, the master schedules backup executions of the remaining in-progress tasks Apache Hadoop Apache Hadoop has an open-source version of GFS and GFS -> (Hadoop File System) Google -> Hadoop You can download the software and implement your own applications Duke CS, Fall 2017 CompSci 516: Database Systems 21 Duke CS, Fall 2017 CompSci 516: Database Systems 22 Pros and Cons is good for off-line batch jobs on large data sets is not good for iterative jobs due to high I/O overhead as each iteration needs to read/write data from/to GFS is bad for jobs on small datasets and jobs that require low-latency response Spark Duke CS, Fall 2017 CompSci 516: Database Systems 23 Duke CS, Fall 2017 CompSci 516: Database Systems 24 4

What is Spark? Distributed in-memory large scale data processing engine! Not a modified version of Hadoop Separate, fast, -like engine In-memory data storage for very fast iterative queries General execution graphs and powerful optimizations Up to 40x faster than Hadoop Up to 100x faster (2-10x on disk) Compatible with Hadoop s storage APIs Can read/write to any Hadoop-supported system, including, HBase, SequenceFiles, etc Applications (Big Data Analysis) In-memory analytics & anomaly detection (Conviva) Interactive queries on data streams (Quantifind) Exploratory log analysis (Foursquare) Traffic estimation w/ GPS data (Mobile Millennium) Twitter spam classification (Monarch)... Why a New Programming Model? greatly simplified big data analysis But as soon as it got popular, users wanted more: More complex, multi-stage iterative applications (graph algorithms, machine learning) More interactive ad-hoc queries More real-time online processing All three of these apps require fast data sharing across parallel jobs NOTE: What were the workarounds in MR world? Ysmart [1], Stubby[2], PTF[3], Haloop [4], Twister [5] Data Sharing in read read write read write iter. 1 iter. 2... query 1 query 2 query 3... result 1 result 2 result 3 Slow due to replication, serialization, and disk IO Data Sharing in Spark iter. 1 iter. 2... one-time processing Distributed memory query 1 query 2 query 3... 10-100 faster than network and disk RDD: Spark Programming Model Key idea: Resilient Distributed Datasets (RDDs) Distributed collections of objects that can be cached in memory or stored on disk across cluster nodes Manipulated through various parallel operators Automatically rebuilt on failure (How? Use Lineage) 5

Additional Slides on Spark (Optional Reading) Ack: The following slides are by Prajakta Kalmegh Duke CS, Fall 2017 CompSci 516: Database Systems 31 More on RDDs Transformations: Created through deterministic operations on either data in stable storage or other RDDs Lineage: RDD has enough information about how it was derived from other datasets Immutable: RDD is a read-only, partitioned collection of records Checkpointing of RDDs with long lineage chains can be done in the background. Mitigating stragglers: We can use backup tasks to recompute transformations on RDDs Persistence level: Users can choose a re-use storage strategy (caching in memory, storing the RDD only on disk or replicating it across machines; also chose a persistence priority for data spills) Partitioning: Users can ask that an RDD s elements be partitioned across machines based on a key in each record RDD Transformations and Actions DAG of RDDs *https://spark.apache.org/docs/1.0.1/cluster-overview.html Note: Lazy Evaluation: A very important concept *http://www.tothenew.com/blog/spark-1o3-spark-internals/ *https://trongkhoanguyenblog.wordpress.com/2014/11/27/understand-rdd-operations-transformations-and-actions/ Fault Tolerance RDDs track the series of transformations used to build them (their lineage) to recompute lost data E.g: messages = textfile(...).filter(_.contains( error )).map(_.split( \t )(2)) Representing RDDs Graph-based representation. Five components : HadoopRDD path = hdfs:// FilteredRDD func = _.contains(...) pedrdd func = _.split( ) Tradeoff: Low Computation cost (cache more RDDs) VS High memory cost (not much work for GC) 6

Representing RDDs (Dependencies) shuffle Representing RDDs (An example) one-to-one many-to-one many-to-many Advantages of the RDD model Checkpoint! Data Sharing in Spark and Some Applications RDD Definition, Model, Representation, Advantages Other Engine Features: Implementation Not covered in details Some Summary: Spark local vs Spark Standalone vs Spark cluster (Resource sharing handled by Yarn/Mesos) Job Scheduling: DAGScheduler vs TaskScheduler (Fair vs FIFO at task granularity) Memory Management: serialized in-memory(fastest) VS deserialized inmemory VS on-disk persistent Support for Checkpointing: Tradeoff between using lineage for recomputing partitions VS checkpointing partitions on stable storage Interpreter Integration: Ship external instances of variables referenced in a closure along with the closure class to worker nodes in order to give them access to these variables 7