Machine learning library for Apache Flink

Size: px
Start display at page:

Download "Machine learning library for Apache Flink"

Transcription

1 Machine learning library for Apache Flink MTP Mid Term Report submitted to Indian Institute of Technology Mandi for partial fulfillment of the degree of B. Tech. by Devang Bacharwar (B2059) under the guidance of Dr. Arti Kashyap SCHOOL OF COMPUTING AND ELECTRICAL ENGINEERING INDIAN INSTITUTE OF TECHNOLOGY MANDI 30 TH MARCH 206

2 CERTIFICATE OF APPROVAL Certified that the Mid Term Report entitled, Machine learning library for Apache Flink, submitted by, Devang Bacharwar ( B2059 ), to the Indian Institute of Technology Mandi, for the partial fulfilment of the degree of B. Tech. has been accepted after examination held today. Date: 30 March 206 Place: KAMAND, H. P., INDIA Dr. Varun Dutt Faculty Advisor

3

4 CERTIFICATE This is to certify that the Mid Term Report titled, Machine learning library for Apache Flink, submitted by, Devang Bacharwar ( B2059 ), to the Indian Institute of Technology, Mandi, is a record of bonafide work under my (our) supervision and is worthy of consideration for partial fulfilment of the degree of B. Tech. of the Institute. Date: 30 March 206 Place: KAMAND, H. P., INDIA Dr. Arti Kashyap Faculty Supervisors

5 DECLARATION BY THE STUDENT This is to certify that the Mid Term Report titled, Machine Learning library for Apache Flink, submitted by me to the Indian Institute of Technology Mandi for partial fulfillment of the degree of B. Tech. is a bonafide record of work carried out by me under the supervision of Dr. Arti Kashyap. The contents of this MTP, in full or in parts, have not been submitted to any other Institute or University for partial fulfillment of any degree or diploma. Date: 30 March 206 Place: KAMAND, H. P., INDIA Devang Bacharwar B2059

6 Acknowledgments I would like to thank Dr. Arti Kashyap and for giving me a chance to work on an exciting open source computing engine project. Devang Bacharwar

7 Abstract The volume of digital data human beings are generating every day now is huge. MapReduce since Google s publication has been the most widely used paradigm for big data processing. Apache Flink is the computation engine used for batch and stream processing of data. MapReduce provides a simple and elegant framework for writing parallel programs which can run on thousands of nodes running cheap hardware. Although this paradigm makes big data processing easier, the algorithms implemented needs to be modified to fit to this paradigm. Most of the machine learning algorithms implemented on this paradigm needs to be tuned to run at scale. The implementation of the algorithm also depends on the way computation engine operates. Decision tree learning uses decision tree to model the observation sequences to a set of If then statements. Keywords: Apache, Flink, Machine learning, Decision tree learning, Apache Hadoop, MapReduce

8 Table of Contents Acknowledgement Abstract i ii Introduction 2 2 Objectives 2 3 Background and Related Work 3 4 Discussion and conclusions of results 7 5 Deliverables 7 6 Timeline 8 7 Summary of work 8 8 References 9

9 . Introduction Big Data and its platforms Big data refers to processing huge amount of data on distributed hardware. Google published a paper on MapReduce[] authored by Jeffrey Dean and Sanjay Ghemawat in 2004, which showed how big data when distributed can be processed on cheap hardware instead of using very high end machines. In this paper the functional approach using map and reduce functions motivated the paradigm of MapReduce for big data processing without worrying about resource allocation etc. This paper led to various projects based on MapReduce paradigm for big data processing. Hadoop, an open source implementation of Google s MapReduce paradigm along with the projects like YARN and Hadoop Distributed File System etc. Apache Spark is another computation engine using MapReduce which improved the performance using Resilient Distributed Datasets (RDDs)[2]. Apache Flink is another implementation on RDD concept. However, Flink provides much richer set of operations like, join, count, group, cogroup etc. Flink provides native support for iterations making it an ideal framework to run machine learning algorithms on big data. MapReduce eases out the distribution of data and makes writing parallel programs simple. Most of the machine learning algorithms, although need to be reworked because of change of paradigm to functional and parallelizing the algorithm to minimize the communication cost between different nodes. 2. Objectives The objective of this project is to implement Parallel decision tree algorithm for Apache Flink and observe statistics for the implementation. This can be subdivided into following tasks for each algorithm. ) Study and implement decision tree algorithm for single node. 2) Study parallel decision tree algorithm. 3) Try different parallelization of algorithm (Map reduce). 4) Implement on line histograms needed for parallelising decision tree algorithm. 5) Implement parallel decision tree algorithm. 6) Modify the implementation as a library

10 3. Background and Related Work Stratosphere Project Apache Flink was started as a project named Stratosphere by TU Berlin. Stratosphere has been has accepted as an incubator project under Apache Software Foundation in April 204 under the name Flink. Flink was further approved as top level Apache project. FlinkML The Machine Learning(ML) library for flink is a new effort to bring scalable ML tools to the flink. The goal is to design and implement the system that is scalable and can deal with problems of various sizes, whether your data size is measured in megabytes or terabytes and beyond. FlinkML has a roadmap developed by Flink community Roadmap: Pipelines of transformers and learners Data pre processing Feature scaling Polynomial feature base mapper Feature hashing Feature extraction for text Dimensionality reduction Model selection and performance evaluation Model evaluation using a variety of scoring functions Cross validation for model selection and evaluation Hyper parameter optimization Supervised learning Optimization framework Stochastic Gradient Descent L BFGS Generalized Linear Models Multiple linear regression LASSO, Ridge regression Multi class Logistic regression Random forests Support Vector Machines Decision trees Unsupervised learning Clustering K means clustering

11 Principal Components Analysis Recommendation ALS Text analytics LDA Statistical estimation tools Distributed linear algebra Streaming ML Ones in the bold have already been implemented. Explore Apache flink To get the feel of Apache Flink as an end user I tried understanding different APIs of Apache Flink and running few examples like wordcount, single node k means for two dimensional data points etc. I set up the Flink gui interface used for monitoring the activity on Flink engine and keeping a log. It also gives a directed acyclic graph for the job. Figure : Current Architecture of Apache Flink Understanding Apache Flink ML Code Apache flink is written in Java and Scala. I was new to both of these programming languages. I read FlinkML s code base and two dimensional k means implementation which has been implemented as an example by the community. I worked on picking up these languages. MapReduce and Apache Hadoop MapReduce is a paradigm which uses Map and reduce on different or same processing unit to process large amount of data efficiently and using cheap hardware. This

12 paradigm was introduced by Jeff dean and Sanjay Ghemawat. Apache hadoop was the first open source implementation that exploits this paradigm for big data processing. Work done Decision tree learning Decision tree learning is a simple machine learning method based on decision trees which can give a transparent view on the way decisions are made. Decision trees model the observations into simple if then statements at every node based on the attributes. In these trees the leaves represent the class label. The split is decided based on different criterias like gini impurity, entropy etc. For single node decision tree gini impurity and entropy both have been implemented. Entropy = p(i)*log(p(i)) p(i)= frequency(i) = count(outcome)/count(totalrows) Entropy = sum of (p(i)*log(p(i)) for all outcomes A single node implementation was first coded in python and then in scala. This was test run on tic tac toe endgame dataset (Multivariate, 9 categorical attributes) Parallel decision tree algorithm in MapReduce paradigm Parallelizing decision tree in MapReduce paradigm is a difficult task. The design for this Map reduce version was worked upon.the distribution done is vertical. The MapReduce version is proposed which uses 4 sequences of Map and Reduce. The 3 layers of tasks that are performed are

13 ) Initialize fields and tree a) It scans the dataset to determine whether the categories are numerical or categorical. Objective field needs to be categorical. It gets some attributes like sum, min and max b) Map function iterates and returns key value pairs with field and value c) Reduce returns the type of field, key attributes like min, max, sum for numerical fields and number of category instances for categorical fields d) Map input: (holder_key, exampleinstance) Map output: (field, field value) Reduce: (field and field features) 2) Finds best split for each field node pair a) The map method finds the leaf node of the current state in which key value pair fits and returns the leaf id with the key id b) After gathering the counts for all the instances on all split candidate. nodes information gain is computed. c) This finds the best split for each field d) Map input: (holder_key, exampleinstance) Map output: ([leaf,field],[field_value, classvalue]) Reduce: (holder_key, [leaf, field, field_split, [informationgain, classcounts], [informationgain, classcounts],...) 3) Find best split for each node a) This mapreduce finds best split for each node Mapper maps the leaf id in output of second mapreduce and makes it a key b) The reducer then finds the key with the maximum information gain and uses it to split the node c) This finds the best split for each node d) Map input:(holder_key, [leaf, field, field_split, [informationgain, classcounts], [informationgain, classcounts],...]) Map output:(leaf, [field, field_split, [informationgain, classcounts], [informationgain, classcounts],...) Reduce: (holder_key, [leaf, field, field_split, [informationgain, classcounts], [informationgain, classcounts],...]) Four MapReduce jobs were setup ) NodeSplit 2) NodeFieldSplit 3) Grow Subtrees 4) Define fields Since a big data set was needed the runs were done on poker hand data set[4]. The data set is a categorical and integer data for classification No. of predictive attributes = 0 No. of Goal attribute = No. of training instances = 2500 training,,000,000 testing Sample space = 3,875,200

14 Accuracy = Result: / Data stream Data stream is an possibly infinite series of training examples {( x, y ),..., ( xn, yn )} where xi are the observation vectors and yi is the class they belong to. The data streams are assumed to be coming in on multiple processing nodes. Each processing node will observe around /W training examples. The partitioning happens for several reasons. A single node cannot store all the data coming in due t the volume of it. neither it can process that volume of data in timely manner. Streaming parallel decision tree algorithm Streaming parallel decision tree algorithm[3] uses on line histogram to reduce the communication overload between the processing nodes. On line histograms sends the overview of data observed at that processing node using B bins. The processing nodes construct the histograms for the observed data and send it to the master node for tree construction and updation. Online histogram provides update,merge, sum and uniform function. The tree is built with these histograms on master node. a. Compressing data for communication between master and slave node i. The data is compressed by building onl ine histograms at the slave nodes by observing the data. The histograms are good statistical tools for approximating large amounts of data and reduce the communication overhead between the nodes. b. Histogram procedures The volume of data that Flink is supposed to deal is generally on the scale of 00GBs to TBs. This amount of data can neither be kept in primary memory nor be transferred efficiently between different machine without any preprocessing. The data needs to be compressed to be able to transfer it between machines when we distribute

15 work between different processors. Approximation of data is one approach to get an overview of the data and be able to transfer it between different nodes. Online histograms are representative of the data and helps in approximating the data being received at slave nodes and transferring it to master node to build the tree. This is a better approach than sampling data which might remove the relevant information from the training examples. Online histograms can be used for both streaming data as well as batch data. The histograms are built and updated efficiently and quickly. Online histograms uses constant memory. They send an overview of data observed data to master node using B bins. Online histogram provides update, merge, sum and uniform function. c. Decision tree building algorithm 4. Discussion and Conclusions Started with random forest and explored a few research papers.[4]. This requires decision tree as well which is not yet a library for Flink. So my final algorithm is Decision tree which I will be implementing for Apache Flink. 5. Deliverables The deliverables for this project will be machine Implementation for parallel decision tree learning on Apache Flink. Intermediate deliverables are Single node decision tree implementation and distributed decision tree implementation, Decision tree implementation on Hadoop

16 6. Timeline Tasks\Start date Explore flink Algorithm options Decision tree, Single node Decision tree, MapReduce Decison tree, Distributed Convert implementation to library Code scrubbing Documentation 7. Summary a. Work done in 7th semester In 7th semester the random forest was explored first then switching to decision tree. The normal decision tree was implemented first in python

17 and executed on tic tac toe dataset. It was then converted to scala to run on Flink single node. b. Work done after 7th semester After 7th semester, The MapReduce version of algorithm has been proposed, developed and implemented with 3 MapReduce sequences. The code is written in Java which is executed on Hadoop. The distribution for data streams has been explored. c. Work intended to be done by end of semester Next the streaming implementation of decision tree with online histogram as put forward by Streaming parallel decision tree[3] by Yael Ben Haim and Elad Tom Tov. The data structure online histogram has to be implemented and streaming version has to be implemented on flink. 8. References ) MapReduce: Simplified data processing on Large clusters by Jeffery Dean and Sanjay Ghemawat ( reduce osdi04.pdf ) 2) Resilient Distributed Datasets: A Fault Tolerant Abstraction for In Memory Cluster Computing by Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica ( ) 3) A streaming parallel decision tree algorithm by Yael Ben Haim and Elad Tom Tov ( haim0a/ben haim0a.pdf ) 4) ( )

Integration of Machine Learning Library in Apache Apex

Integration of Machine Learning Library in Apache Apex Integration of Machine Learning Library in Apache Apex Anurag Wagh, Krushika Tapedia, Harsh Pathak Vishwakarma Institute of Information Technology, Pune, India Abstract- Machine Learning is a type of artificial

More information

2/4/2019 Week 3- A Sangmi Lee Pallickara

2/4/2019 Week 3- A Sangmi Lee Pallickara Week 3-A-0 2/4/2019 Colorado State University, Spring 2019 Week 3-A-1 CS535 BIG DATA FAQs PART A. BIG DATA TECHNOLOGY 3. DISTRIBUTED COMPUTING MODELS FOR SCALABLE BATCH COMPUTING SECTION 1: MAPREDUCE PA1

More information

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context 1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes

More information

Fast, Interactive, Language-Integrated Cluster Computing

Fast, Interactive, Language-Integrated Cluster Computing Spark Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica www.spark-project.org

More information

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab Apache is a fast and general engine for large-scale data processing aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes Low latency: sub-second

More information

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia MapReduce & Resilient Distributed Datasets Yiqing Hua, Mengqi(Mandy) Xia Outline - MapReduce: - - Resilient Distributed Datasets (RDD) - - Motivation Examples The Design and How it Works Performance Motivation

More information

Resilient Distributed Datasets

Resilient Distributed Datasets Resilient Distributed Datasets A Fault- Tolerant Abstraction for In- Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin,

More information

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin,

More information

Twitter data Analytics using Distributed Computing

Twitter data Analytics using Distributed Computing Twitter data Analytics using Distributed Computing Uma Narayanan Athrira Unnikrishnan Dr. Varghese Paul Dr. Shelbi Joseph Research Scholar M.tech Student Professor Assistant Professor Dept. of IT, SOE

More information

Stream Processing on IoT Devices using Calvin Framework

Stream Processing on IoT Devices using Calvin Framework Stream Processing on IoT Devices using Calvin Framework by Ameya Nayak A Project Report Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science in Computer Science Supervised

More information

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications Spark In- Memory Cluster Computing for Iterative and Interactive Applications Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker,

More information

Spark: A Brief History. https://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf

Spark: A Brief History. https://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf Spark: A Brief History https://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf A Brief History: 2004 MapReduce paper 2010 Spark paper 2002 2004 2006 2008 2010 2012 2014 2002 MapReduce @ Google

More information

CDS. André Schaaff1, François-Xavier Pineau1, Gilles Landais1, Laurent Michel2 de Données astronomiques de Strasbourg, 2SSC-XMM-Newton

CDS. André Schaaff1, François-Xavier Pineau1, Gilles Landais1, Laurent Michel2 de Données astronomiques de Strasbourg, 2SSC-XMM-Newton Docker @ CDS André Schaaff1, François-Xavier Pineau1, Gilles Landais1, Laurent Michel2 1Centre de Données astronomiques de Strasbourg, 2SSC-XMM-Newton Paul Trehiou Université de technologie de Belfort-Montbéliard

More information

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on

More information

Cloud, Big Data & Linear Algebra

Cloud, Big Data & Linear Algebra Cloud, Big Data & Linear Algebra Shelly Garion IBM Research -- Haifa 2014 IBM Corporation What is Big Data? 2 Global Data Volume in Exabytes What is Big Data? 2005 2012 2017 3 Global Data Volume in Exabytes

More information

CSE 444: Database Internals. Lecture 23 Spark

CSE 444: Database Internals. Lecture 23 Spark CSE 444: Database Internals Lecture 23 Spark References Spark is an open source system from Berkeley Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei

More information

Shark: Hive on Spark

Shark: Hive on Spark Optional Reading (additional material) Shark: Hive on Spark Prajakta Kalmegh Duke University 1 What is Shark? Port of Apache Hive to run on Spark Compatible with existing Hive data, metastores, and queries

More information

Spark. Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Spark. Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica. Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica UC Berkeley Background MapReduce and Dryad raised level of abstraction in cluster

More information

Chapter 4: Apache Spark

Chapter 4: Apache Spark Chapter 4: Apache Spark Lecture Notes Winter semester 2016 / 2017 Ludwig-Maximilians-University Munich PD Dr. Matthias Renz 2015, Based on lectures by Donald Kossmann (ETH Zürich), as well as Jure Leskovec,

More information

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation Analytic Cloud with Shelly Garion IBM Research -- Haifa 2014 IBM Corporation Why Spark? Apache Spark is a fast and general open-source cluster computing engine for big data processing Speed: Spark is capable

More information

Using Existing Numerical Libraries on Spark

Using Existing Numerical Libraries on Spark Using Existing Numerical Libraries on Spark Brian Spector Chicago Spark Users Meetup June 24 th, 2015 Experts in numerical algorithms and HPC services How to use existing libraries on Spark Call algorithm

More information

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Research challenges in data-intensive computing The Stratosphere Project Apache Flink Research challenges in data-intensive computing The Stratosphere Project Apache Flink Seif Haridi KTH/SICS haridi@kth.se e2e-clouds.org Presented by: Seif Haridi May 2014 Research Areas Data-intensive

More information

SparkBurst: An Efficient and Faster Sequence Mapping Tool on Apache Spark Platform

SparkBurst: An Efficient and Faster Sequence Mapping Tool on Apache Spark Platform SparkBurst: An Efficient and Faster Sequence Mapping Tool on Apache Spark Platform MTP End-Sem Report submitted to Indian Institute of Technology, Mandi for partial fulfillment of the degree of B. Tech.

More information

Dell In-Memory Appliance for Cloudera Enterprise

Dell In-Memory Appliance for Cloudera Enterprise Dell In-Memory Appliance for Cloudera Enterprise Spark Technology Overview and Streaming Workload Use Cases Author: Armando Acosta Hadoop Product Manager/Subject Matter Expert Armando_Acosta@Dell.com/

More information

CS294 Big Data System Course Project Report Gemini: Boosting Spark Performance with GPU Accelerators

CS294 Big Data System Course Project Report Gemini: Boosting Spark Performance with GPU Accelerators Gemini: Boosting Spark Performance with GPU Accelerators Guanhua Wang Zhiyuan Lin Ion Stoica AMPLab EECS AMPLab UC Berkeley UC Berkeley UC Berkeley Abstract Compared with MapReduce, Apache Spark is more

More information

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications Spark In- Memory Cluster Computing for Iterative and Interactive Applications Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker,

More information

Scaled Machine Learning at Matroid

Scaled Machine Learning at Matroid Scaled Machine Learning at Matroid Reza Zadeh @Reza_Zadeh http://reza-zadeh.com Machine Learning Pipeline Learning Algorithm Replicate model Data Trained Model Serve Model Repeat entire pipeline Scaling

More information

Using Numerical Libraries on Spark

Using Numerical Libraries on Spark Using Numerical Libraries on Spark Brian Spector London Spark Users Meetup August 18 th, 2015 Experts in numerical algorithms and HPC services How to use existing libraries on Spark Call algorithm with

More information

Apache Flink. Fuchkina Ekaterina with Material from Andreas Kunft -TU Berlin / DIMA; dataartisans slides

Apache Flink. Fuchkina Ekaterina with Material from Andreas Kunft -TU Berlin / DIMA; dataartisans slides Apache Flink Fuchkina Ekaterina with Material from Andreas Kunft -TU Berlin / DIMA; dataartisans slides What is Apache Flink Massive parallel data flow engine with unified batch-and streamprocessing CEP

More information

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia MapReduce Spark Some slides are adapted from those of Jeff Dean and Matei Zaharia What have we learnt so far? Distributed storage systems consistency semantics protocols for fault tolerance Paxos, Raft,

More information

Discretized Streams. An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters

Discretized Streams. An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker, Ion Stoica UC BERKELEY Motivation Many important

More information

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics Presented by: Dishant Mittal Authors: Juwei Shi, Yunjie Qiu, Umar Firooq Minhas, Lemei Jiao, Chen Wang, Berthold Reinwald and Fatma

More information

Lecture 11 Hadoop & Spark

Lecture 11 Hadoop & Spark Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem

More information

Benchmarking Apache Flink and Apache Spark DataFlow Systems on Large-Scale Distributed Machine Learning Algorithms

Benchmarking Apache Flink and Apache Spark DataFlow Systems on Large-Scale Distributed Machine Learning Algorithms Benchmarking Apache Flink and Apache Spark DataFlow Systems on Large-Scale Distributed Machine Learning Algorithms Candidate Andrea Spina Advisor Prof. Sonia Bergamaschi Co-Advisor Dr. Tilmann Rabl Co-Advisor

More information

Improving Ensemble of Trees in MLlib

Improving Ensemble of Trees in MLlib Improving Ensemble of Trees in MLlib Jianneng Li, Ashkon Soroudi, Zhiyuan Lin Abstract We analyze the implementation of decision tree and random forest in MLlib, a machine learning library built on top

More information

Jeffrey D. Ullman Stanford University

Jeffrey D. Ullman Stanford University Jeffrey D. Ullman Stanford University for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must

More information

a Spark in the cloud iterative and interactive cluster computing

a Spark in the cloud iterative and interactive cluster computing a Spark in the cloud iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica UC Berkeley Background MapReduce and Dryad raised level of

More information

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop

More information

Databases 2 (VU) ( / )

Databases 2 (VU) ( / ) Databases 2 (VU) (706.711 / 707.030) MapReduce (Part 3) Mark Kröll ISDS, TU Graz Nov. 27, 2017 Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, 2017 1 / 42 Outline 1 Problems Suited for Map-Reduce 2 MapReduce:

More information

Survey on Incremental MapReduce for Data Mining

Survey on Incremental MapReduce for Data Mining Survey on Incremental MapReduce for Data Mining Trupti M. Shinde 1, Prof.S.V.Chobe 2 1 Research Scholar, Computer Engineering Dept., Dr. D. Y. Patil Institute of Engineering &Technology, 2 Associate Professor,

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Distributed Machine Learning Week #9

Machine Learning for Large-Scale Data Analysis and Decision Making A. Distributed Machine Learning Week #9 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Distributed Machine Learning Week #9 Today Distributed computing for machine learning Background MapReduce/Hadoop & Spark Theory

More information

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem I J C T A, 9(41) 2016, pp. 1235-1239 International Science Press Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem Hema Dubey *, Nilay Khare *, Alind Khare **

More information

Introduction to Hadoop and MapReduce

Introduction to Hadoop and MapReduce Introduction to Hadoop and MapReduce Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large

More information

International Journal of Advance Engineering and Research Development. Performance Comparison of Hadoop Map Reduce and Apache Spark

International Journal of Advance Engineering and Research Development. Performance Comparison of Hadoop Map Reduce and Apache Spark Scientific Journal of Impact Factor (SJIF): 5.71 International Journal of Advance Engineering and Research Development Volume 5, Issue 03, March -2018 e-issn (O): 2348-4470 p-issn (P): 2348-6406 Performance

More information

Global Journal of Engineering Science and Research Management

Global Journal of Engineering Science and Research Management A FUNDAMENTAL CONCEPT OF MAPREDUCE WITH MASSIVE FILES DATASET IN BIG DATA USING HADOOP PSEUDO-DISTRIBUTION MODE K. Srikanth*, P. Venkateswarlu, Ashok Suragala * Department of Information Technology, JNTUK-UCEV

More information

Introduction to MapReduce (cont.)

Introduction to MapReduce (cont.) Introduction to MapReduce (cont.) Rafael Ferreira da Silva rafsilva@isi.edu http://rafaelsilva.com USC INF 553 Foundations and Applications of Data Mining (Fall 2018) 2 MapReduce: Summary USC INF 553 Foundations

More information

Data Analytics on RAMCloud

Data Analytics on RAMCloud Data Analytics on RAMCloud Jonathan Ellithorpe jdellit@stanford.edu Abstract MapReduce [1] has already become the canonical method for doing large scale data processing. However, for many algorithms including

More information

Introduction to MapReduce Algorithms and Analysis

Introduction to MapReduce Algorithms and Analysis Introduction to MapReduce Algorithms and Analysis Jeff M. Phillips October 25, 2013 Trade-Offs Massive parallelism that is very easy to program. Cheaper than HPC style (uses top of the line everything)

More information

The Stratosphere Platform for Big Data Analytics

The Stratosphere Platform for Big Data Analytics The Stratosphere Platform for Big Data Analytics Hongyao Ma Franco Solleza April 20, 2015 Stratosphere Stratosphere Stratosphere Big Data Analytics BIG Data Heterogeneous datasets: structured / unstructured

More information

Distributed Computing with Spark and MapReduce

Distributed Computing with Spark and MapReduce Distributed Computing with Spark and MapReduce Reza Zadeh @Reza_Zadeh http://reza-zadeh.com Traditional Network Programming Message-passing between nodes (e.g. MPI) Very difficult to do at scale:» How

More information

Processing of big data with Apache Spark

Processing of big data with Apache Spark Processing of big data with Apache Spark JavaSkop 18 Aleksandar Donevski AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 2 WHAT

More information

Big data systems 12/8/17

Big data systems 12/8/17 Big data systems 12/8/17 Today Basic architecture Two levels of scheduling Spark overview Basic architecture Cluster Manager Cluster Cluster Manager 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores

More information

Corpus methods in linguistics and NLP Lecture 7: Programming for large-scale data processing

Corpus methods in linguistics and NLP Lecture 7: Programming for large-scale data processing Corpus methods in linguistics and NLP Lecture 7: Programming for large-scale data processing Richard Johansson December 1, 2015 today's lecture as you've seen, processing large corpora can take time! for

More information

Distributed Machine Learning" on Spark

Distributed Machine Learning on Spark Distributed Machine Learning" on Spark Reza Zadeh @Reza_Zadeh http://reza-zadeh.com Outline Data flow vs. traditional network programming Spark computing engine Optimization Example Matrix Computations

More information

CS435 Introduction to Big Data FALL 2018 Colorado State University. 10/22/2018 Week 10-A Sangmi Lee Pallickara. FAQs.

CS435 Introduction to Big Data FALL 2018 Colorado State University. 10/22/2018 Week 10-A Sangmi Lee Pallickara. FAQs. 10/22/2018 - FALL 2018 W10.A.0.0 10/22/2018 - FALL 2018 W10.A.1 FAQs Term project: Proposal 5:00PM October 23, 2018 PART 1. LARGE SCALE DATA ANALYTICS IN-MEMORY CLUSTER COMPUTING Computer Science, Colorado

More information

Natural Language Processing In A Distributed Environment

Natural Language Processing In A Distributed Environment Natural Language Processing In A Distributed Environment A comparative performance analysis of Apache Spark and Hadoop MapReduce Ludwig Andersson Ludwig Andersson Spring 2016 Bachelor s Thesis, 15 hp Supervisor:

More information

An Introduction to Apache Spark

An Introduction to Apache Spark An Introduction to Apache Spark 1 History Developed in 2009 at UC Berkeley AMPLab. Open sourced in 2010. Spark becomes one of the largest big-data projects with more 400 contributors in 50+ organizations

More information

Spark Overview. Professor Sasu Tarkoma.

Spark Overview. Professor Sasu Tarkoma. Spark Overview 2015 Professor Sasu Tarkoma www.cs.helsinki.fi Apache Spark Spark is a general-purpose computing framework for iterative tasks API is provided for Java, Scala and Python The model is based

More information

MI-PDB, MIE-PDB: Advanced Database Systems

MI-PDB, MIE-PDB: Advanced Database Systems MI-PDB, MIE-PDB: Advanced Database Systems http://www.ksi.mff.cuni.cz/~svoboda/courses/2015-2-mie-pdb/ Lecture 10: MapReduce, Hadoop 26. 4. 2016 Lecturer: Martin Svoboda svoboda@ksi.mff.cuni.cz Author:

More information

Cloud Computing & Visualization

Cloud Computing & Visualization Cloud Computing & Visualization Workflows Distributed Computation with Spark Data Warehousing with Redshift Visualization with Tableau #FIUSCIS School of Computing & Information Sciences, Florida International

More information

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

A Comparative study of Clustering Algorithms using MapReduce in Hadoop A Comparative study of Clustering Algorithms using MapReduce in Hadoop Dweepna Garg 1, Khushboo Trivedi 2, B.B.Panchal 3 1 Department of Computer Science and Engineering, Parul Institute of Engineering

More information

Oracle Big Data Connectors

Oracle Big Data Connectors Oracle Big Data Connectors Oracle Big Data Connectors is a software suite that integrates processing in Apache Hadoop distributions with operations in Oracle Database. It enables the use of Hadoop to process

More information

Parallel learning of content recommendations using map- reduce

Parallel learning of content recommendations using map- reduce Parallel learning of content recommendations using map- reduce Michael Percy Stanford University Abstract In this paper, machine learning within the map- reduce paradigm for ranking

More information

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems Announcements CompSci 516 Database Systems Lecture 12 - and Spark Practice midterm posted on sakai First prepare and then attempt! Midterm next Wednesday 10/11 in class Closed book/notes, no electronic

More information

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Shark Hive on Spark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Agenda Intro to Spark Apache Hive Shark Shark s Improvements over Hive Demo Alpha

More information

Distributed Computing with Spark

Distributed Computing with Spark Distributed Computing with Spark Reza Zadeh Thanks to Matei Zaharia Outline Data flow vs. traditional network programming Limitations of MapReduce Spark computing engine Numerical computing on Spark Ongoing

More information

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018 Cloud Computing 3 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning

More information

Apache Spark and Scala Certification Training

Apache Spark and Scala Certification Training About Intellipaat Intellipaat is a fast-growing professional training provider that is offering training in over 150 most sought-after tools and technologies. We have a learner base of 600,000 in over

More information

CompSci 516: Database Systems

CompSci 516: Database Systems CompSci 516 Database Systems Lecture 12 Map-Reduce and Spark Instructor: Sudeepa Roy Duke CS, Fall 2017 CompSci 516: Database Systems 1 Announcements Practice midterm posted on sakai First prepare and

More information

Scalable Machine Learning in R. with H2O

Scalable Machine Learning in R. with H2O Scalable Machine Learning in R with H2O Erin LeDell @ledell DSC July 2016 Introduction Statistician & Machine Learning Scientist at H2O.ai in Mountain View, California, USA Ph.D. in Biostatistics with

More information

Spark, Shark and Spark Streaming Introduction

Spark, Shark and Spark Streaming Introduction Spark, Shark and Spark Streaming Introduction Tushar Kale tusharkale@in.ibm.com June 2015 This Talk Introduction to Shark, Spark and Spark Streaming Architecture Deployment Methodology Performance References

More information

Apache SystemML Declarative Machine Learning

Apache SystemML Declarative Machine Learning Apache Big Data Seville 2016 Apache SystemML Declarative Machine Learning Luciano Resende About Me Luciano Resende (lresende@apache.org) Architect and community liaison at Have been contributing to open

More information

CLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM

CLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 5.258 IJCSMC,

More information

HADOOP FRAMEWORK FOR BIG DATA

HADOOP FRAMEWORK FOR BIG DATA HADOOP FRAMEWORK FOR BIG DATA Mr K. Srinivas Babu 1,Dr K. Rameshwaraiah 2 1 Research Scholar S V University, Tirupathi 2 Professor and Head NNRESGI, Hyderabad Abstract - Data has to be stored for further

More information

Specialist ICT Learning

Specialist ICT Learning Specialist ICT Learning APPLIED DATA SCIENCE AND BIG DATA ANALYTICS GTBD7 Course Description This intensive training course provides theoretical and technical aspects of Data Science and Business Analytics.

More information

Summary of Big Data Frameworks Course 2015 Professor Sasu Tarkoma

Summary of Big Data Frameworks Course 2015 Professor Sasu Tarkoma Summary of Big Data Frameworks Course 2015 Professor Sasu Tarkoma www.cs.helsinki.fi Course Schedule Tuesday 10.3. Introduction and the Big Data Challenge Tuesday 17.3. MapReduce and Spark: Overview Tuesday

More information

An Introduction to Big Data Analysis using Spark

An Introduction to Big Data Analysis using Spark An Introduction to Big Data Analysis using Spark Mohamad Jaber American University of Beirut - Faculty of Arts & Sciences - Department of Computer Science May 17, 2017 Mohamad Jaber (AUB) Spark May 17,

More information

Similarities and Differences Between Parallel Systems and Distributed Systems

Similarities and Differences Between Parallel Systems and Distributed Systems Similarities and Differences Between Parallel Systems and Distributed Systems Pulasthi Wickramasinghe, Geoffrey Fox School of Informatics and Computing,Indiana University, Bloomington, IN 47408, USA In

More information

SparkBOOST, an Apache Spark-based boosting library

SparkBOOST, an Apache Spark-based boosting library SparkBOOST, an Apache Spark-based boosting library Tiziano Fagni (tiziano.fagni@isti.cnr.it) Andrea Esuli (andrea.esuli@isti.cnr.it) Istituto di Scienze e Tecnologie dell Informazione (ISTI) Italian National

More information

Mitigating Data Skew Using Map Reduce Application

Mitigating Data Skew Using Map Reduce Application Ms. Archana P.M Mitigating Data Skew Using Map Reduce Application Mr. Malathesh S.H 4 th sem, M.Tech (C.S.E) Associate Professor C.S.E Dept. M.S.E.C, V.T.U Bangalore, India archanaanil062@gmail.com M.S.E.C,

More information

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop

More information

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect Big Data Big Data Analyst INTRODUCTION TO BIG DATA ANALYTICS ANALYTICS PROCESSING TECHNIQUES DATA TRANSFORMATION & BATCH PROCESSING REAL TIME (STREAM) DATA PROCESSING Big Data Engineer BIG DATA FOUNDATION

More information

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics IBM Data Science Experience White paper R Transforming R into a tool for big data analytics 2 R Executive summary This white paper introduces R, a package for the R statistical programming language that

More information

Batch Processing Basic architecture

Batch Processing Basic architecture Batch Processing Basic architecture in big data systems COS 518: Distributed Systems Lecture 10 Andrew Or, Mike Freedman 2 1 2 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores 3

More information

A Parallel R Framework

A Parallel R Framework A Parallel R Framework for Processing Large Dataset on Distributed Systems Nov. 17, 2013 This work is initiated and supported by Huawei Technologies Rise of Data-Intensive Analytics Data Sources Personal

More information

Big Data Infrastructures & Technologies

Big Data Infrastructures & Technologies Big Data Infrastructures & Technologies Spark and MLLIB OVERVIEW OF SPARK What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop Improves efficiency through: In-memory

More information

Big Data Management and NoSQL Databases

Big Data Management and NoSQL Databases NDBI040 Big Data Management and NoSQL Databases Lecture 2. MapReduce Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Framework A programming model

More information

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko Shark: SQL and Rich Analytics at Scale Michael Xueyuan Han Ronny Hajoon Ko What Are The Problems? Data volumes are expanding dramatically Why Is It Hard? Needs to scale out Managing hundreds of machines

More information

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA CSE 3302 Lecture 11: Map/Reduce 7 October 2010 Nate Nystrom UTA 378,000 results in 0.17 seconds including images and video communicates with 1000s of machines web server index servers document servers

More information

SCALABLE, LOW LATENCY MODEL SERVING AND MANAGEMENT WITH VELOX

SCALABLE, LOW LATENCY MODEL SERVING AND MANAGEMENT WITH VELOX THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW LATENCY MODEL SERVING AND MANAGEMENT WITH VELOX Daniel Crankshaw, Peter Bailis, Joseph Gonzalez, Haoyuan Li, Zhao Zhang, Ali Ghodsi, Michael Franklin,

More information

Implementation of Aggregation of Map and Reduce Function for Performance Improvisation

Implementation of Aggregation of Map and Reduce Function for Performance Improvisation 2016 IJSRSET Volume 2 Issue 5 Print ISSN: 2395-1990 Online ISSN : 2394-4099 Themed Section: Engineering and Technology Implementation of Aggregation of Map and Reduce Function for Performance Improvisation

More information

I ++ Mapreduce: Incremental Mapreduce for Mining the Big Data

I ++ Mapreduce: Incremental Mapreduce for Mining the Big Data IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 18, Issue 3, Ver. IV (May-Jun. 2016), PP 125-129 www.iosrjournals.org I ++ Mapreduce: Incremental Mapreduce for

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 7: Parallel Computing Cho-Jui Hsieh UC Davis May 3, 2018 Outline Multi-core computing, distributed computing Multi-core computing tools

More information

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig Analytics in Spark Yanlei Diao Tim Hunter Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig Outline 1. A brief history of Big Data and Spark 2. Technical summary of Spark 3. Unified analytics

More information

Apache Flink Big Data Stream Processing

Apache Flink Big Data Stream Processing Apache Flink Big Data Stream Processing Tilmann Rabl Berlin Big Data Center www.dima.tu-berlin.de bbdc.berlin rabl@tu-berlin.de XLDB 11.10.2017 1 2013 Berlin Big Data Center All Rights Reserved DIMA 2017

More information

Apache Spark Performance Compared to a Traditional Relational Database using Open Source Big Data Health Software

Apache Spark Performance Compared to a Traditional Relational Database using Open Source Big Data Health Software PROJECT PAPER FOR CSE8803 BIG DATA ANALYTICS FOR HEALTH CARE, SPRING 2016 1 Apache Spark Performance Compared to a Traditional Relational Database using Open Source Big Data Health Software Joshua Powers

More information

An Introduction to Apache Spark Big Data Madison: 29 July William Red Hat, Inc.

An Introduction to Apache Spark Big Data Madison: 29 July William Red Hat, Inc. An Introduction to Apache Spark Big Data Madison: 29 July 2014 William Benton @willb Red Hat, Inc. About me At Red Hat for almost 6 years, working on distributed computing Currently contributing to Spark,

More information

Apache Spark 2.0. Matei

Apache Spark 2.0. Matei Apache Spark 2.0 Matei Zaharia @matei_zaharia What is Apache Spark? Open source data processing engine for clusters Generalizes MapReduce model Rich set of APIs and libraries In Scala, Java, Python and

More information

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework Scientific Journal of Impact Factor (SJIF): e-issn (O): 2348- International Journal of Advance Engineering and Research Development Volume 3, Issue 2, February -2016 A Study: Hadoop Framework Devateja

More information

15.1 Data flow vs. traditional network programming

15.1 Data flow vs. traditional network programming CME 323: Distributed Algorithms and Optimization, Spring 2017 http://stanford.edu/~rezab/dao. Instructor: Reza Zadeh, Matroid and Stanford. Lecture 15, 5/22/2017. Scribed by D. Penner, A. Shoemaker, and

More information