Machine learning library for Apache Flink

Machine learning library for Apache Flink MTP Mid Term Report submitted to Indian Institute of Technology Mandi for partial fulfillment of the degree of B. Tech. by Devang Bacharwar (B2059) under the guidance of Dr. Arti Kashyap SCHOOL OF COMPUTING AND ELECTRICAL ENGINEERING INDIAN INSTITUTE OF TECHNOLOGY MANDI 30 TH MARCH 206

CERTIFICATE OF APPROVAL Certified that the Mid Term Report entitled, Machine learning library for Apache Flink, submitted by, Devang Bacharwar ( B2059 ), to the Indian Institute of Technology Mandi, for the partial fulfilment of the degree of B. Tech. has been accepted after examination held today. Date: 30 March 206 Place: KAMAND, H. P., INDIA 75005 Dr. Varun Dutt Faculty Advisor

CERTIFICATE This is to certify that the Mid Term Report titled, Machine learning library for Apache Flink, submitted by, Devang Bacharwar ( B2059 ), to the Indian Institute of Technology, Mandi, is a record of bonafide work under my (our) supervision and is worthy of consideration for partial fulfilment of the degree of B. Tech. of the Institute. Date: 30 March 206 Place: KAMAND, H. P., INDIA 75005 Dr. Arti Kashyap Faculty Supervisors

DECLARATION BY THE STUDENT This is to certify that the Mid Term Report titled, Machine Learning library for Apache Flink, submitted by me to the Indian Institute of Technology Mandi for partial fulfillment of the degree of B. Tech. is a bonafide record of work carried out by me under the supervision of Dr. Arti Kashyap. The contents of this MTP, in full or in parts, have not been submitted to any other Institute or University for partial fulfillment of any degree or diploma. Date: 30 March 206 Place: KAMAND, H. P., INDIA 75005 Devang Bacharwar B2059

Acknowledgments I would like to thank Dr. Arti Kashyap and for giving me a chance to work on an exciting open source computing engine project. Devang Bacharwar

Abstract The volume of digital data human beings are generating every day now is huge. MapReduce since Google s publication has been the most widely used paradigm for big data processing. Apache Flink is the computation engine used for batch and stream processing of data. MapReduce provides a simple and elegant framework for writing parallel programs which can run on thousands of nodes running cheap hardware. Although this paradigm makes big data processing easier, the algorithms implemented needs to be modified to fit to this paradigm. Most of the machine learning algorithms implemented on this paradigm needs to be tuned to run at scale. The implementation of the algorithm also depends on the way computation engine operates. Decision tree learning uses decision tree to model the observation sequences to a set of If then statements. Keywords: Apache, Flink, Machine learning, Decision tree learning, Apache Hadoop, MapReduce

Table of Contents Acknowledgement Abstract i ii Introduction 2 2 Objectives 2 3 Background and Related Work 3 4 Discussion and conclusions of results 7 5 Deliverables 7 6 Timeline 8 7 Summary of work 8 8 References 9

. Introduction Big Data and its platforms Big data refers to processing huge amount of data on distributed hardware. Google published a paper on MapReduce[] authored by Jeffrey Dean and Sanjay Ghemawat in 2004, which showed how big data when distributed can be processed on cheap hardware instead of using very high end machines. In this paper the functional approach using map and reduce functions motivated the paradigm of MapReduce for big data processing without worrying about resource allocation etc. This paper led to various projects based on MapReduce paradigm for big data processing. Hadoop, an open source implementation of Google s MapReduce paradigm along with the projects like YARN and Hadoop Distributed File System etc. Apache Spark is another computation engine using MapReduce which improved the performance using Resilient Distributed Datasets (RDDs)[2]. Apache Flink is another implementation on RDD concept. However, Flink provides much richer set of operations like, join, count, group, cogroup etc. Flink provides native support for iterations making it an ideal framework to run machine learning algorithms on big data. MapReduce eases out the distribution of data and makes writing parallel programs simple. Most of the machine learning algorithms, although need to be reworked because of change of paradigm to functional and parallelizing the algorithm to minimize the communication cost between different nodes. 2. Objectives The objective of this project is to implement Parallel decision tree algorithm for Apache Flink and observe statistics for the implementation. This can be subdivided into following tasks for each algorithm. ) Study and implement decision tree algorithm for single node. 2) Study parallel decision tree algorithm. 3) Try different parallelization of algorithm (Map reduce). 4) Implement on line histograms needed for parallelising decision tree algorithm. 5) Implement parallel decision tree algorithm. 6) Modify the implementation as a library

3. Background and Related Work Stratosphere Project Apache Flink was started as a project named Stratosphere by TU Berlin. Stratosphere has been has accepted as an incubator project under Apache Software Foundation in April 204 under the name Flink. Flink was further approved as top level Apache project. FlinkML The Machine Learning(ML) library for flink is a new effort to bring scalable ML tools to the flink. The goal is to design and implement the system that is scalable and can deal with problems of various sizes, whether your data size is measured in megabytes or terabytes and beyond. FlinkML has a roadmap developed by Flink community Roadmap: Pipelines of transformers and learners Data pre processing Feature scaling Polynomial feature base mapper Feature hashing Feature extraction for text Dimensionality reduction Model selection and performance evaluation Model evaluation using a variety of scoring functions Cross validation for model selection and evaluation Hyper parameter optimization Supervised learning Optimization framework Stochastic Gradient Descent L BFGS Generalized Linear Models Multiple linear regression LASSO, Ridge regression Multi class Logistic regression Random forests Support Vector Machines Decision trees Unsupervised learning Clustering K means clustering

Principal Components Analysis Recommendation ALS Text analytics LDA Statistical estimation tools Distributed linear algebra Streaming ML Ones in the bold have already been implemented. Explore Apache flink To get the feel of Apache Flink as an end user I tried understanding different APIs of Apache Flink and running few examples like wordcount, single node k means for two dimensional data points etc. I set up the Flink gui interface used for monitoring the activity on Flink engine and keeping a log. It also gives a directed acyclic graph for the job. Figure : Current Architecture of Apache Flink Understanding Apache Flink ML Code Apache flink is written in Java and Scala. I was new to both of these programming languages. I read FlinkML s code base and two dimensional k means implementation which has been implemented as an example by the community. I worked on picking up these languages. MapReduce and Apache Hadoop MapReduce is a paradigm which uses Map and reduce on different or same processing unit to process large amount of data efficiently and using cheap hardware. This

paradigm was introduced by Jeff dean and Sanjay Ghemawat. Apache hadoop was the first open source implementation that exploits this paradigm for big data processing. Work done Decision tree learning Decision tree learning is a simple machine learning method based on decision trees which can give a transparent view on the way decisions are made. Decision trees model the observations into simple if then statements at every node based on the attributes. In these trees the leaves represent the class label. The split is decided based on different criterias like gini impurity, entropy etc. For single node decision tree gini impurity and entropy both have been implemented. Entropy = p(i)*log(p(i)) p(i)= frequency(i) = count(outcome)/count(totalrows) Entropy = sum of (p(i)*log(p(i)) for all outcomes A single node implementation was first coded in python and then in scala. This was test run on tic tac toe endgame dataset (Multivariate, 9 categorical attributes) Parallel decision tree algorithm in MapReduce paradigm Parallelizing decision tree in MapReduce paradigm is a difficult task. The design for this Map reduce version was worked upon.the distribution done is vertical. The MapReduce version is proposed which uses 4 sequences of Map and Reduce. The 3 layers of tasks that are performed are

) Initialize fields and tree a) It scans the dataset to determine whether the categories are numerical or categorical. Objective field needs to be categorical. It gets some attributes like sum, min and max b) Map function iterates and returns key value pairs with field and value c) Reduce returns the type of field, key attributes like min, max, sum for numerical fields and number of category instances for categorical fields d) Map input: (holder_key, exampleinstance) Map output: (field, field value) Reduce: (field and field features) 2) Finds best split for each field node pair a) The map method finds the leaf node of the current state in which key value pair fits and returns the leaf id with the key id b) After gathering the counts for all the instances on all split candidate. nodes information gain is computed. c) This finds the best split for each field d) Map input: (holder_key, exampleinstance) Map output: ([leaf,field],[field_value, classvalue]) Reduce: (holder_key, [leaf, field, field_split, [informationgain, classcounts], [informationgain, classcounts],...) 3) Find best split for each node a) This mapreduce finds best split for each node Mapper maps the leaf id in output of second mapreduce and makes it a key b) The reducer then finds the key with the maximum information gain and uses it to split the node c) This finds the best split for each node d) Map input:(holder_key, [leaf, field, field_split, [informationgain, classcounts], [informationgain, classcounts],...]) Map output:(leaf, [field, field_split, [informationgain, classcounts], [informationgain, classcounts],...) Reduce: (holder_key, [leaf, field, field_split, [informationgain, classcounts], [informationgain, classcounts],...]) Four MapReduce jobs were setup ) NodeSplit 2) NodeFieldSplit 3) Grow Subtrees 4) Define fields Since a big data set was needed the runs were done on poker hand data set[4]. The data set is a categorical and integer data for classification No. of predictive attributes = 0 No. of Goal attribute = No. of training instances = 2500 training,,000,000 testing Sample space = 3,875,200

Accuracy = Result: 559099/000000 55.9099 Data stream Data stream is an possibly infinite series of training examples {( x, y ),..., ( xn, yn )} where xi are the observation vectors and yi is the class they belong to. The data streams are assumed to be coming in on multiple processing nodes. Each processing node will observe around /W training examples. The partitioning happens for several reasons. A single node cannot store all the data coming in due t the volume of it. neither it can process that volume of data in timely manner. Streaming parallel decision tree algorithm Streaming parallel decision tree algorithm[3] uses on line histogram to reduce the communication overload between the processing nodes. On line histograms sends the overview of data observed at that processing node using B bins. The processing nodes construct the histograms for the observed data and send it to the master node for tree construction and updation. Online histogram provides update,merge, sum and uniform function. The tree is built with these histograms on master node. a. Compressing data for communication between master and slave node i. The data is compressed by building onl ine histograms at the slave nodes by observing the data. The histograms are good statistical tools for approximating large amounts of data and reduce the communication overhead between the nodes. b. Histogram procedures The volume of data that Flink is supposed to deal is generally on the scale of 00GBs to TBs. This amount of data can neither be kept in primary memory nor be transferred efficiently between different machine without any preprocessing. The data needs to be compressed to be able to transfer it between machines when we distribute

work between different processors. Approximation of data is one approach to get an overview of the data and be able to transfer it between different nodes. Online histograms are representative of the data and helps in approximating the data being received at slave nodes and transferring it to master node to build the tree. This is a better approach than sampling data which might remove the relevant information from the training examples. Online histograms can be used for both streaming data as well as batch data. The histograms are built and updated efficiently and quickly. Online histograms uses constant memory. They send an overview of data observed data to master node using B bins. Online histogram provides update, merge, sum and uniform function. c. Decision tree building algorithm 4. Discussion and Conclusions Started with random forest and explored a few research papers.[4]. This requires decision tree as well which is not yet a library for Flink. So my final algorithm is Decision tree which I will be implementing for Apache Flink. 5. Deliverables The deliverables for this project will be machine Implementation for parallel decision tree learning on Apache Flink. Intermediate deliverables are Single node decision tree implementation and distributed decision tree implementation, Decision tree implementation on Hadoop

6. Timeline Tasks\Start date 5 08 5 Explore flink 06 09 5 07 0 5 0 2 5 25 03 6 25 04 6 0 05 6 Algorithm options Decision tree, Single node Decision tree, MapReduce Decison tree, Distributed Convert implementation to library Code scrubbing Documentation 7. Summary a. Work done in 7th semester In 7th semester the random forest was explored first then switching to decision tree. The normal decision tree was implemented first in python

and executed on tic tac toe dataset. It was then converted to scala to run on Flink single node. b. Work done after 7th semester After 7th semester, The MapReduce version of algorithm has been proposed, developed and implemented with 3 MapReduce sequences. The code is written in Java which is executed on Hadoop. The distribution for data streams has been explored. c. Work intended to be done by end of semester Next the streaming implementation of decision tree with online histogram as put forward by Streaming parallel decision tree[3] by Yael Ben Haim and Elad Tom Tov. The data structure online histogram has to be implemented and streaming version has to be implemented on flink. 8. References ) MapReduce: Simplified data processing on Large clusters by Jeffery Dean and Sanjay Ghemawat ( http://static.googleusercontent.com/media/research.google.com/en//archive/map reduce osdi04.pdf ) 2) Resilient Distributed Datasets: A Fault Tolerant Abstraction for In Memory Cluster Computing by Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica ( https://www.cs.berkeley.edu/~matei/papers/202/nsdi_spark.pdf ) 3) A streaming parallel decision tree algorithm by Yael Ben Haim and Elad Tom Tov ( http://www.jmlr.org/papers/volume/ben haim0a/ben haim0a.pdf ) 4) ( https://archive.ics.uci.edu/ml/datasets/poker+hand )