Machine learning library for Apache Flink

Similar documents
Integration of Machine Learning Library in Apache Apex

2/4/2019 Week 3- A Sangmi Lee Pallickara

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Fast, Interactive, Language-Integrated Cluster Computing

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

Resilient Distributed Datasets

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

Twitter data Analytics using Distributed Computing

Stream Processing on IoT Devices using Calvin Framework

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Spark: A Brief History.

CDS. André Schaaff1, François-Xavier Pineau1, Gilles Landais1, Laurent Michel2 de Données astronomiques de Strasbourg, 2SSC-XMM-Newton

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Cloud, Big Data & Linear Algebra

CSE 444: Database Internals. Lecture 23 Spark

Shark: Hive on Spark

Spark. Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Chapter 4: Apache Spark

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

Using Existing Numerical Libraries on Spark

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

SparkBurst: An Efficient and Faster Sequence Mapping Tool on Apache Spark Platform

Dell In-Memory Appliance for Cloudera Enterprise

CS294 Big Data System Course Project Report Gemini: Boosting Spark Performance with GPU Accelerators

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Scaled Machine Learning at Matroid

Using Numerical Libraries on Spark

Apache Flink. Fuchkina Ekaterina with Material from Andreas Kunft -TU Berlin / DIMA; dataartisans slides

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Discretized Streams. An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Lecture 11 Hadoop & Spark

Benchmarking Apache Flink and Apache Spark DataFlow Systems on Large-Scale Distributed Machine Learning Algorithms

Improving Ensemble of Trees in MLlib

Jeffrey D. Ullman Stanford University

a Spark in the cloud iterative and interactive cluster computing

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Databases 2 (VU) ( / )

Survey on Incremental MapReduce for Data Mining

Machine Learning for Large-Scale Data Analysis and Decision Making A. Distributed Machine Learning Week #9

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem

Introduction to Hadoop and MapReduce

International Journal of Advance Engineering and Research Development. Performance Comparison of Hadoop Map Reduce and Apache Spark

Global Journal of Engineering Science and Research Management

Introduction to MapReduce (cont.)

Data Analytics on RAMCloud

Introduction to MapReduce Algorithms and Analysis

The Stratosphere Platform for Big Data Analytics

Distributed Computing with Spark and MapReduce

Processing of big data with Apache Spark

Big data systems 12/8/17

Corpus methods in linguistics and NLP Lecture 7: Programming for large-scale data processing

Distributed Machine Learning" on Spark

CS435 Introduction to Big Data FALL 2018 Colorado State University. 10/22/2018 Week 10-A Sangmi Lee Pallickara. FAQs.

Natural Language Processing In A Distributed Environment

An Introduction to Apache Spark

Spark Overview. Professor Sasu Tarkoma.

MI-PDB, MIE-PDB: Advanced Database Systems

Cloud Computing & Visualization

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

Oracle Big Data Connectors

Parallel learning of content recommendations using map- reduce

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Distributed Computing with Spark

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Apache Spark and Scala Certification Training

CompSci 516: Database Systems

Scalable Machine Learning in R. with H2O

Spark, Shark and Spark Streaming Introduction

Apache SystemML Declarative Machine Learning

CLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM

HADOOP FRAMEWORK FOR BIG DATA

Specialist ICT Learning

Summary of Big Data Frameworks Course 2015 Professor Sasu Tarkoma

An Introduction to Big Data Analysis using Spark

Similarities and Differences Between Parallel Systems and Distributed Systems

SparkBOOST, an Apache Spark-based boosting library

Mitigating Data Skew Using Map Reduce Application

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

Batch Processing Basic architecture

A Parallel R Framework

Big Data Infrastructures & Technologies

Big Data Management and NoSQL Databases

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

SCALABLE, LOW LATENCY MODEL SERVING AND MANAGEMENT WITH VELOX

Implementation of Aggregation of Map and Reduce Function for Performance Improvisation

I ++ Mapreduce: Incremental Mapreduce for Mining the Big Data

STA141C: Big Data & High Performance Statistical Computing

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

Apache Flink Big Data Stream Processing

Apache Spark Performance Compared to a Traditional Relational Database using Open Source Big Data Health Software

An Introduction to Apache Spark Big Data Madison: 29 July William Red Hat, Inc.

Apache Spark 2.0. Matei

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

15.1 Data flow vs. traditional network programming

Transcription:

Machine learning library for Apache Flink MTP Mid Term Report submitted to Indian Institute of Technology Mandi for partial fulfillment of the degree of B. Tech. by Devang Bacharwar (B2059) under the guidance of Dr. Arti Kashyap SCHOOL OF COMPUTING AND ELECTRICAL ENGINEERING INDIAN INSTITUTE OF TECHNOLOGY MANDI 30 TH MARCH 206

CERTIFICATE OF APPROVAL Certified that the Mid Term Report entitled, Machine learning library for Apache Flink, submitted by, Devang Bacharwar ( B2059 ), to the Indian Institute of Technology Mandi, for the partial fulfilment of the degree of B. Tech. has been accepted after examination held today. Date: 30 March 206 Place: KAMAND, H. P., INDIA 75005 Dr. Varun Dutt Faculty Advisor

CERTIFICATE This is to certify that the Mid Term Report titled, Machine learning library for Apache Flink, submitted by, Devang Bacharwar ( B2059 ), to the Indian Institute of Technology, Mandi, is a record of bonafide work under my (our) supervision and is worthy of consideration for partial fulfilment of the degree of B. Tech. of the Institute. Date: 30 March 206 Place: KAMAND, H. P., INDIA 75005 Dr. Arti Kashyap Faculty Supervisors

DECLARATION BY THE STUDENT This is to certify that the Mid Term Report titled, Machine Learning library for Apache Flink, submitted by me to the Indian Institute of Technology Mandi for partial fulfillment of the degree of B. Tech. is a bonafide record of work carried out by me under the supervision of Dr. Arti Kashyap. The contents of this MTP, in full or in parts, have not been submitted to any other Institute or University for partial fulfillment of any degree or diploma. Date: 30 March 206 Place: KAMAND, H. P., INDIA 75005 Devang Bacharwar B2059

Acknowledgments I would like to thank Dr. Arti Kashyap and for giving me a chance to work on an exciting open source computing engine project. Devang Bacharwar

Abstract The volume of digital data human beings are generating every day now is huge. MapReduce since Google s publication has been the most widely used paradigm for big data processing. Apache Flink is the computation engine used for batch and stream processing of data. MapReduce provides a simple and elegant framework for writing parallel programs which can run on thousands of nodes running cheap hardware. Although this paradigm makes big data processing easier, the algorithms implemented needs to be modified to fit to this paradigm. Most of the machine learning algorithms implemented on this paradigm needs to be tuned to run at scale. The implementation of the algorithm also depends on the way computation engine operates. Decision tree learning uses decision tree to model the observation sequences to a set of If then statements. Keywords: Apache, Flink, Machine learning, Decision tree learning, Apache Hadoop, MapReduce

Table of Contents Acknowledgement Abstract i ii Introduction 2 2 Objectives 2 3 Background and Related Work 3 4 Discussion and conclusions of results 7 5 Deliverables 7 6 Timeline 8 7 Summary of work 8 8 References 9

. Introduction Big Data and its platforms Big data refers to processing huge amount of data on distributed hardware. Google published a paper on MapReduce[] authored by Jeffrey Dean and Sanjay Ghemawat in 2004, which showed how big data when distributed can be processed on cheap hardware instead of using very high end machines. In this paper the functional approach using map and reduce functions motivated the paradigm of MapReduce for big data processing without worrying about resource allocation etc. This paper led to various projects based on MapReduce paradigm for big data processing. Hadoop, an open source implementation of Google s MapReduce paradigm along with the projects like YARN and Hadoop Distributed File System etc. Apache Spark is another computation engine using MapReduce which improved the performance using Resilient Distributed Datasets (RDDs)[2]. Apache Flink is another implementation on RDD concept. However, Flink provides much richer set of operations like, join, count, group, cogroup etc. Flink provides native support for iterations making it an ideal framework to run machine learning algorithms on big data. MapReduce eases out the distribution of data and makes writing parallel programs simple. Most of the machine learning algorithms, although need to be reworked because of change of paradigm to functional and parallelizing the algorithm to minimize the communication cost between different nodes. 2. Objectives The objective of this project is to implement Parallel decision tree algorithm for Apache Flink and observe statistics for the implementation. This can be subdivided into following tasks for each algorithm. ) Study and implement decision tree algorithm for single node. 2) Study parallel decision tree algorithm. 3) Try different parallelization of algorithm (Map reduce). 4) Implement on line histograms needed for parallelising decision tree algorithm. 5) Implement parallel decision tree algorithm. 6) Modify the implementation as a library

3. Background and Related Work Stratosphere Project Apache Flink was started as a project named Stratosphere by TU Berlin. Stratosphere has been has accepted as an incubator project under Apache Software Foundation in April 204 under the name Flink. Flink was further approved as top level Apache project. FlinkML The Machine Learning(ML) library for flink is a new effort to bring scalable ML tools to the flink. The goal is to design and implement the system that is scalable and can deal with problems of various sizes, whether your data size is measured in megabytes or terabytes and beyond. FlinkML has a roadmap developed by Flink community Roadmap: Pipelines of transformers and learners Data pre processing Feature scaling Polynomial feature base mapper Feature hashing Feature extraction for text Dimensionality reduction Model selection and performance evaluation Model evaluation using a variety of scoring functions Cross validation for model selection and evaluation Hyper parameter optimization Supervised learning Optimization framework Stochastic Gradient Descent L BFGS Generalized Linear Models Multiple linear regression LASSO, Ridge regression Multi class Logistic regression Random forests Support Vector Machines Decision trees Unsupervised learning Clustering K means clustering

Principal Components Analysis Recommendation ALS Text analytics LDA Statistical estimation tools Distributed linear algebra Streaming ML Ones in the bold have already been implemented. Explore Apache flink To get the feel of Apache Flink as an end user I tried understanding different APIs of Apache Flink and running few examples like wordcount, single node k means for two dimensional data points etc. I set up the Flink gui interface used for monitoring the activity on Flink engine and keeping a log. It also gives a directed acyclic graph for the job. Figure : Current Architecture of Apache Flink Understanding Apache Flink ML Code Apache flink is written in Java and Scala. I was new to both of these programming languages. I read FlinkML s code base and two dimensional k means implementation which has been implemented as an example by the community. I worked on picking up these languages. MapReduce and Apache Hadoop MapReduce is a paradigm which uses Map and reduce on different or same processing unit to process large amount of data efficiently and using cheap hardware. This

paradigm was introduced by Jeff dean and Sanjay Ghemawat. Apache hadoop was the first open source implementation that exploits this paradigm for big data processing. Work done Decision tree learning Decision tree learning is a simple machine learning method based on decision trees which can give a transparent view on the way decisions are made. Decision trees model the observations into simple if then statements at every node based on the attributes. In these trees the leaves represent the class label. The split is decided based on different criterias like gini impurity, entropy etc. For single node decision tree gini impurity and entropy both have been implemented. Entropy = p(i)*log(p(i)) p(i)= frequency(i) = count(outcome)/count(totalrows) Entropy = sum of (p(i)*log(p(i)) for all outcomes A single node implementation was first coded in python and then in scala. This was test run on tic tac toe endgame dataset (Multivariate, 9 categorical attributes) Parallel decision tree algorithm in MapReduce paradigm Parallelizing decision tree in MapReduce paradigm is a difficult task. The design for this Map reduce version was worked upon.the distribution done is vertical. The MapReduce version is proposed which uses 4 sequences of Map and Reduce. The 3 layers of tasks that are performed are

) Initialize fields and tree a) It scans the dataset to determine whether the categories are numerical or categorical. Objective field needs to be categorical. It gets some attributes like sum, min and max b) Map function iterates and returns key value pairs with field and value c) Reduce returns the type of field, key attributes like min, max, sum for numerical fields and number of category instances for categorical fields d) Map input: (holder_key, exampleinstance) Map output: (field, field value) Reduce: (field and field features) 2) Finds best split for each field node pair a) The map method finds the leaf node of the current state in which key value pair fits and returns the leaf id with the key id b) After gathering the counts for all the instances on all split candidate. nodes information gain is computed. c) This finds the best split for each field d) Map input: (holder_key, exampleinstance) Map output: ([leaf,field],[field_value, classvalue]) Reduce: (holder_key, [leaf, field, field_split, [informationgain, classcounts], [informationgain, classcounts],...) 3) Find best split for each node a) This mapreduce finds best split for each node Mapper maps the leaf id in output of second mapreduce and makes it a key b) The reducer then finds the key with the maximum information gain and uses it to split the node c) This finds the best split for each node d) Map input:(holder_key, [leaf, field, field_split, [informationgain, classcounts], [informationgain, classcounts],...]) Map output:(leaf, [field, field_split, [informationgain, classcounts], [informationgain, classcounts],...) Reduce: (holder_key, [leaf, field, field_split, [informationgain, classcounts], [informationgain, classcounts],...]) Four MapReduce jobs were setup ) NodeSplit 2) NodeFieldSplit 3) Grow Subtrees 4) Define fields Since a big data set was needed the runs were done on poker hand data set[4]. The data set is a categorical and integer data for classification No. of predictive attributes = 0 No. of Goal attribute = No. of training instances = 2500 training,,000,000 testing Sample space = 3,875,200

Accuracy = Result: 559099/000000 55.9099 Data stream Data stream is an possibly infinite series of training examples {( x, y ),..., ( xn, yn )} where xi are the observation vectors and yi is the class they belong to. The data streams are assumed to be coming in on multiple processing nodes. Each processing node will observe around /W training examples. The partitioning happens for several reasons. A single node cannot store all the data coming in due t the volume of it. neither it can process that volume of data in timely manner. Streaming parallel decision tree algorithm Streaming parallel decision tree algorithm[3] uses on line histogram to reduce the communication overload between the processing nodes. On line histograms sends the overview of data observed at that processing node using B bins. The processing nodes construct the histograms for the observed data and send it to the master node for tree construction and updation. Online histogram provides update,merge, sum and uniform function. The tree is built with these histograms on master node. a. Compressing data for communication between master and slave node i. The data is compressed by building onl ine histograms at the slave nodes by observing the data. The histograms are good statistical tools for approximating large amounts of data and reduce the communication overhead between the nodes. b. Histogram procedures The volume of data that Flink is supposed to deal is generally on the scale of 00GBs to TBs. This amount of data can neither be kept in primary memory nor be transferred efficiently between different machine without any preprocessing. The data needs to be compressed to be able to transfer it between machines when we distribute

work between different processors. Approximation of data is one approach to get an overview of the data and be able to transfer it between different nodes. Online histograms are representative of the data and helps in approximating the data being received at slave nodes and transferring it to master node to build the tree. This is a better approach than sampling data which might remove the relevant information from the training examples. Online histograms can be used for both streaming data as well as batch data. The histograms are built and updated efficiently and quickly. Online histograms uses constant memory. They send an overview of data observed data to master node using B bins. Online histogram provides update, merge, sum and uniform function. c. Decision tree building algorithm 4. Discussion and Conclusions Started with random forest and explored a few research papers.[4]. This requires decision tree as well which is not yet a library for Flink. So my final algorithm is Decision tree which I will be implementing for Apache Flink. 5. Deliverables The deliverables for this project will be machine Implementation for parallel decision tree learning on Apache Flink. Intermediate deliverables are Single node decision tree implementation and distributed decision tree implementation, Decision tree implementation on Hadoop

6. Timeline Tasks\Start date 5 08 5 Explore flink 06 09 5 07 0 5 0 2 5 25 03 6 25 04 6 0 05 6 Algorithm options Decision tree, Single node Decision tree, MapReduce Decison tree, Distributed Convert implementation to library Code scrubbing Documentation 7. Summary a. Work done in 7th semester In 7th semester the random forest was explored first then switching to decision tree. The normal decision tree was implemented first in python

and executed on tic tac toe dataset. It was then converted to scala to run on Flink single node. b. Work done after 7th semester After 7th semester, The MapReduce version of algorithm has been proposed, developed and implemented with 3 MapReduce sequences. The code is written in Java which is executed on Hadoop. The distribution for data streams has been explored. c. Work intended to be done by end of semester Next the streaming implementation of decision tree with online histogram as put forward by Streaming parallel decision tree[3] by Yael Ben Haim and Elad Tom Tov. The data structure online histogram has to be implemented and streaming version has to be implemented on flink. 8. References ) MapReduce: Simplified data processing on Large clusters by Jeffery Dean and Sanjay Ghemawat ( http://static.googleusercontent.com/media/research.google.com/en//archive/map reduce osdi04.pdf ) 2) Resilient Distributed Datasets: A Fault Tolerant Abstraction for In Memory Cluster Computing by Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica ( https://www.cs.berkeley.edu/~matei/papers/202/nsdi_spark.pdf ) 3) A streaming parallel decision tree algorithm by Yael Ben Haim and Elad Tom Tov ( http://www.jmlr.org/papers/volume/ben haim0a/ben haim0a.pdf ) 4) ( https://archive.ics.uci.edu/ml/datasets/poker+hand )