Lecture 11 Hadoop & Spark

Similar documents
Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

MI-PDB, MIE-PDB: Advanced Database Systems

Introduction to MapReduce

Cloud Computing CS

Database Applications (15-415)

Hadoop. copyright 2011 Trainologic LTD

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Lecture 30: Distributed Map-Reduce using Hadoop and Spark Frameworks

Clustering Lecture 8: MapReduce

MapReduce, Hadoop and Spark. Bompotas Agorakis

MapReduce. U of Toronto, 2014

Programming Models MapReduce

Introduction to MapReduce

Resilient Distributed Datasets

Distributed Filesystem

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Improving the MapReduce Big Data Processing Framework

HADOOP FRAMEWORK FOR BIG DATA

CS370 Operating Systems

Outline. CS-562 Introduction to data analysis using Apache Spark

Programming Systems for Big Data

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Introduction to Data Management CSE 344

Chapter 5. The MapReduce Programming Model and Implementation

Distributed Computation Models

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

BigData and Map Reduce VITMAC03

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Hadoop Distributed File System(HDFS)

50 Must Read Hadoop Interview Questions & Answers

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Processing of big data with Apache Spark

CompSci 516: Database Systems

Data Platforms and Pattern Mining

CSE 444: Database Internals. Lecture 23 Spark

Distributed Computing with Spark and MapReduce

CS427 Multicore Architecture and Parallel Computing

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

A Survey on Big Data

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

Introduction to MapReduce. Instructor: Dr. Weikuan Yu Computer Sci. & Software Eng.

Distributed Systems 16. Distributed File Systems II

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

Cloudera Exam CCA-410 Cloudera Certified Administrator for Apache Hadoop (CCAH) Version: 7.5 [ Total Questions: 97 ]

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Big Data Architect.

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Fast, Interactive, Language-Integrated Cluster Computing

A brief history on Hadoop

A BigData Tour HDFS, Ceph and MapReduce

Distributed Systems CS6421

CS370 Operating Systems

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

Big Data Hadoop Stack

Spark, Shark and Spark Streaming Introduction

Big Data Infrastructures & Technologies

SparkStreaming. Large scale near- realtime stream processing. Tathagata Das (TD) UC Berkeley UC BERKELEY

Distributed Computing with Spark

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Batch Processing Basic architecture

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Mixing and matching virtual and physical HPC clusters. Paolo Anedda

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Sparrow. Distributed Low-Latency Spark Scheduling. Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica

An Introduction to Apache Spark

Data Storage Infrastructure at Facebook

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Hadoop An Overview. - Socrates CCDH

Spark: A Brief History.

Introduction to BigData, Hadoop:-

Google File System (GFS) and Hadoop Distributed File System (HDFS)

2/26/2017. For instance, consider running Word Count across 20 splits

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

Big Data Analytics using Apache Hadoop and Spark with Scala

Spark Overview. Professor Sasu Tarkoma.

MapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Hadoop and HDFS Overview. Madhu Ankam

Discretized Streams. An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters

An Introduction to Big Data Analysis using Spark

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,

MapReduce-style data processing

2/4/2019 Week 3- A Sangmi Lee Pallickara

Hadoop MapReduce Framework

Warehouse- Scale Computing and the BDAS Stack

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Database Systems CSE 414

PaaS and Hadoop. Dr. Laiping Zhao ( 赵来平 ) School of Computer Software, Tianjin University

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Databases 2 (VU) ( / )

Shark: SQL and Rich Analytics at Scale. Yash Thakkar ( ) Deeksha Singh ( )

Transcription:

Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico

Outline Distributed File Systems Hadoop Ecosystem Hadoop Architecture and Features Apache Spark ICOM 6025: High Performance Computing 2

How do we get data to the workers? NAS Compute Nodes SAN

Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes in the cluster Start up the workers on the node that has the data local Why? Not enough RAM to hold all the data in memory Disk access is slow, but disk throughput is reasonable A distributed file system is the answer GFS (Google File System) HDFS (Hadoop Distributed File System)

Apache Hadoop Ecosystem PIG R Hive MapReduce HBase Cassandra Hadoop Distributed File System (HDFS) ICOM 6025: High Performance Computing 5

Hadoop Designed to reliably store data using commodity hardware Redundant storage Designed to expect hardware failures Fault tolerance mechanism Intended for large files Not suitable for small data sets Not suitable for low latency data access COP6727 6

Hadoop Files are stored as a collection of blocks Blocks are 64 MB chunks of a file (configurable) Blocks are replicated on 3 nodes (configurable) NameNode (NN) Manages metadata about files and blocks DataNodes (DN) store and serve blocks

Jobs and Tasks in Hadoop Job: a user-submitted map and reduce implementation to apply to a data set Task: a single mapper or reducer task Failed tasks get retried automatically Tasks run local to their data, ideally JobTracker (JT) manages job submission and task delegation TaskTrackers (TT) ask for work and execute tasks COP6727 8

Hadoop Architecture Client Job Tracker Name Node Secondary Name Node Data Node Data Node Data Node Task Tracker Task Tracker Task Tracker 9

Typical Hadoop Cluster Switch Name Node Switch Switch Switch Job Tracker Secondary NN DN +TT DN +TT DN +TT DN +TT DN +TT DN +TT DN +TT DN +TT DN +TT Rack 1 Rack 2 Rack 3 Rack N ICOM 6025: High Performance Computing 10

Hadoop Fault Tolerance If a Task crashes: Retry on another node: OK for a map because it has no dependencies OK for a reduce because map outputs are on disk If a node crashes: Re-launch its current task on other nodes Re-run any maps the node previously ran to get output data If a task is going slowly (straggler): Launch second copy of task on another node ( speculative execution )

Hadoop Fault Tolerance Reactive way Worker failure Heartbeat, Workers are periodically pinged by master NO response = failed worker If the processor of a worker fails, the tasks of that worker are reassigned to another worker. Master failure Master writes periodic checkpoints Another master can be started from the last checkpointed state If eventually the master dies, the job will be aborted 12

Hadoop Data Locality Move computation to the data Hadoop tries to schedule tasks on nodes with the data When not possible TT has to fetch data from DN Thousands of machines read input at local disk speed. Without this, rack switches limit read rate and network bandwidth becomes the bottleneck. COP6727 13

Hadoop Scheduling Fair Sharing conducts fair scheduling using greedy method to maintain data locality Delay uses delay scheduling algorithm to achieve good data locality by slightly compromising fairness restriction LATE (Longest Approximate Time to End) improves MapReduce application performance in heterogenous environment, like virtualized environment, through accurate speculative execution Capacity Supports multiple queues for shared users and guarantees each queue a fraction of the capacity of the cluster 14

MPI vs. Map Reduce Programming in communication Explicit MPI Implicit Map Reduce Fault tolerance No Yes Main resources Memory I/O Latency Low High ICOM 6025: High Performance Computing 15

Other Frameworks Batch Processing Hadoop Independent tasks GraphLab Dependent tasks Interactive Processing Drill Spark Low latency for Interactive data analysis Resilient distributed datasets Primitives for in memory computing 100x faster than Hadoop!! Stream processing Storm, Apache S4

Apache Spark HDFS ICOM 6025: High Performance Computing 17

Use Memory Instead of Disk Input HDFS read HDFS read HDFS write HDFS read HDFS write iteration 1 iteration 2... query 1 query 2 result 1 result 2 Input query 3... result 3

In-Memory Data Sharing Input HDFS read one-time processing Input iteration 1 iteration 2... Distributed memory query 1 query 2 query 3... 10-100x faster than network and disk result 1 result 2 result 3

Resilient Distributed Datasets (RDDs) Write programs in terms of operations on distributed datasets Partitioned collections of objects spread across a cluster, stored in memory or on disk RDDs built and manipulated through a diverse set of parallel transformations (map, filter, join) and actions (count, collect, save) RDDs automatically rebuilt on machine failure

The Spark Computing Framework Provides programming abstraction and parallel runtime to hide complexities of fault-tolerance and slow machines Here s an operation, run it on all of the data I don t care where it runs (you schedule that) In fact, feel free to run it twice on different nodes

Tradeoff Space Fine Granularity of Updates K-V stores, databases, RAMCloud HDFS Network bandwidth Best for transactional workloads RDDs Memory bandwidth Best for batch workloads Coarse Low Write Throughput High

Scalability Logistic Regression K-Means Iteration time (s) 250 200 150 100 50 184 116 15 Hadoop HadoopBinMem Spark 111 80 76 62 Iteration time (s) 300 250 200 150 100 50 274 6 197 143 Hadoop HadoopBinMem Spark 157 121 61 106 87 33 0 25 50 100 3 0 25 50 100 Number of machines Number of machines

Spark and Map Reduce Differences Hadoop Map Reduce Spark Storage Disk only In-memory or on disk Operations Map and Reduce Map, Reduce, Join, Sample, etc Execution model Batch Batch, interactive, streaming Programming environments Java Scala, Java, R, and Python

Summary Distributed File Systems Hadoop Ecosystem Hadoop Architecture and Features Apache Spark ICOM 6025: High Performance Computing 25