Integration of Machine Learning Library in Apache Apex Anurag Wagh, Krushika Tapedia, Harsh Pathak Vishwakarma Institute of Information Technology, Pune, India Abstract- Machine Learning is a type of artificial intelligence (AI) that provides computers with the ability to learn without being explicitly programmed. Machine learning focuses on the development of computer programs that can teach themselves to grow and change when exposed to new data. Apache Apex is a Hadoop YARN native platform that unifies stream and batch processing. It processes big data in motion in a way that is highly scalable, highly performant, fault tolerant, secure, stateful, distributed and easily operable. So, the need of Machine Learning library in a platform like Apache Apex where it will help in drawing out useful insights from the huge data collected and make the system faster and efficient with time. Keywords Machine Learning; Data Analytics; Apache Apex; Apache Spark; BigData; RealTime; Stream Processing; Batch Processing; I. INTRODUCTION Traditional systems are not equipped to process Big Data hence to handle Big Data with efficiency there are various Big Data platforms and, Hadoop is a ubiquitous name in this area. There are various tools build on top of Hadoop improving it s efficiency such as Spark, Flink, Apex, Storm. Hadoop processes data in batch mode i.e. data is stored in batches before processing but as the demand for real time data processing engines have grown we have had various quasi-streaming platforms such as Spark which has outperformed Hadoop but they still fail to process data truly in real time. Here, Apache Apex stands out as it can process data in motion. Apache Apex is well known for its stream processing capabilities like scalability, fault-tolerance and stateful guarantees. Additionally, it stands out due to its highly operable nature and ease of use. Apache Spark has a rich set of libraries including Spark MLLib, GraphX etc. Although Spark specializes in a lot of useful libraries, it is often the second choice to Apex where high speed, low latency, fault tolerant processing is the requirement. It is therefore, that we started working on getting Spark s libraries to run on Apex s platform. At present, just focusing on integrating the MLLib which is a library for Machine Learning algorithms in Spark on Apex, will expedite the ease of users who wish to develop machine learning models and choose Apex as their streaming platform. The main objective is to develop a high level API for Apache Apex s users who can effortlessly train machine learning models. A. Processing Model in Apache Spark Apache Spark is a well known and popular big data engine, that is built on Hadoop and supports various high level APIs in Java, Scala and Python. Data in Spark is represented by a data structure called as an RDD which is Resilient Distributed Dataset. RDDs are formally read only, collection partition of data. We perform various operations on the RDD. RDDs support two types of operations and they are transformation and action operations. Transformation of RDD means to transform an existing RDD into a new RDD. The best example of it is a map or filter function, whereas action functions are those which return a single value to the driver program after performing various transformations such as count, collect, reduce. Other salient principle is that the above mentioned operations 77
evaluate lazily in Spark. Lazy evaluation means even though the RDD is defined or the transformation functions are called, but unless it encounters an action function, it does not perform any computation. B. Processing Model in Apex Apache Apex is a unified batch and streaming platform it is mostly used to process big data in real time. An application in Apache Apex consists of operators, these operators are nothing but units which contain various operations that lead to our business logic. These operators are connected via streams, that facilitates in sending the data from one operator to the other. We may call them as the basic building blocks of the application. Such multiple operators connected via streams form a DAG (Direct Acyclic Graph). An Apache Apex s application runs as a YARN application, thereby running each operator of the DAG on containers provided by YARN. With such provision of true stream processing in Apache Apex, it lacks a machine learning library which is immensely in demand. Below, is the information on how we choose Spark s MLLib and strategy to integrate Spark s MLLib in Apache Apex, making the transition of Spark users to Apex facile and also if this is deployed successfully, we can run any Spark application which supports an RDD model. II. LITERATURE SURVEY C. Selection of Library 78
Traditionally all processing required for machine learning was done using non distributed platform which restricted its use to applications having small datasets. In past decade we have seen unprecedented growth of data which has made non-distributed machine learning systems handicapped. The pith and heart of machine learning is data, data powers the machine learning model and the new era of of Big Data has made machine learning an important aspect of research and industry applications.. There are various machine learning libraries that have surfaced for Big Data platforms such as SAMOA, H2O, MAHOUT, MLLIB. We will discuss two algorithms that we inspected closely to integrate with Apache Apex Mahout : Mahout is a machine learning tool, Mahout has wide selection of algorithms but it is built on top of hadoop and has inefficient speeds. With the release of Mahout 0.9, the focus is now on math Mahout-Samsara which will provide a math environment, this environment includes linear algebra, statistical operations, and data structures. The goal of Mahout-Samsara is to enable mahout users to write the own machine learning algorithms. Yet there are some issues regarding the use of Mahout, it is difficult to set up on existing Hadoop cluster, most of the documentation available for mahout regarding the use of algorithms is outdated. Some of the algorithms offered by Mahout are listed below 1. Classification a. Logistic Regression b. Naive Bayes c. Hidden Markov Models 2. Clustering a. K-means Clustering b. Canopy Clustering c. Fuzzy k-means d. Spectral Clustering e. Streaming k-means 3. Dimensionality Reduction a. Singular Value Decomposition b. Stochastic SVD c. PCA (via Stochastic SVD) d. QR Decomposition MLlib In general MLlib works with spark and provides interactive batch as well as streaming approaches which Mahout currently lacks. As well Spark s use of in-memory computation, enables tasks to run faster than those using Mahout. Even though Spark s MLlib is relatively young compared to Mahout. It is easy to set up on spark and run and it provides thorough documentation of machine learning APIs. It also have support for Basic Statistics operation. 79
Since more users are migrating from MapReduce to Spark. The community has grown bigger. MLlib include APIs for development using Scala, Python and Java. MLlib is Spark s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. At a high level, it provides tools such as: ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative filtering Featurization: feature extraction, transformation, dimensionality reduction, and selection Pipelines: tools for constructing, evaluating, and tuning ML Pipelines Persistence: saving and load algorithms, models, and Pipelines Utilities: linear algebra, statistics, data handling, etc. Name of the Language Parallel Distributed Comments Library Scikit - Learn Python Yes No Limited for multicore programming. H2O Java, Scala, R, Yes Yes Only one research Python so far Mahout Java, Scala Yes Yes Suitable for batch processing only Tensorflow Python Yes: GPU No No distributed support yet. Oryx Java Yes Yes Contains support for only few algorithms WEKA Java Yes No No distributed support yet. SAMOA Java Yes Yes Already under incubation MLlib Java, Python, Scala Yes Yes Has support for quasi-streaming. Table 1: Comparing distributed and non distributed machine learning libraries on the scale of language compatibility and flaws D. Spark RDD III. SPARK S & APEX S RESILIENT DISTRIBUTED DATASET Resilient Distributed Datasets (RDDs), is a distributed is a read only collection of objects partitioned across a set of 80
machines that can be rebuilt if a partition is lost. RDD is a distributed memory abstraction that helps programmers to perform in memory computations on large clusters without compromising fault tolerance. RDDs can be created through deterministic operations on either data in stable storage or other RDDs. We call these operations transformations to differentiate them from other operations on RDDs. Example of transformations include map, filter and join. RDDs do not need to be materialized all the time, Instead, an RDD has enough information about how it was derived from other datasets (its lineage) to compute its partitions from data in stable storage. User can indicate the RDDs they will re-use and user has control over these storage strategies. e.g. Industrial Machine Log Mining Industrial machines generates millions of log messages these messages are stored into log files which contributes to TeraBytes of memory usage which are stored in HDFS ( Hadoop File System). To determine the cause of failure in a machine, maintainer of the machine will have to see just error messages generated by the machine. This can be achieved using Spark s RDD where RDD will just load the error messages into the RAM across the several nodes after that the maintainer can query the logs interactively to gain insights about the failure. Fig 1: Lineage graph 3rd query in in industrial log mining example. RDDs are represented by boxes and transformations are represented by arrows. 1 lines = spark_context.textfile( hdfs://remotehost//.. ) 2 failures = lines.filter(_.startswith( FAILURE )) 3 failures.persist() Line 1 represents RDD backed by an HDFS file as collection of records i.e. logs, whereas line 2 represents an filtered RDD which derived from existing RDD. Line 3 indicates that failures to be persisted in memory. E. Apex RDD As RDD is a fundamental data structure in spark which supports operations such as transformation and action. The first step in integrating MLlib will be to create a wrapper around the existing RDD, by extending it to ApexRDD. Apex RDD will be Apex s version of RDD which will provide support for transformation and action via operators, now transformation and action function will correspond to Transformation Operator and Action Operator in Apex, respectively. In addition to this Apex RDD will have a DAG, each node of the DAG will consist of an operator. During the execution of the program whenever an Apex RDD will encounter transformation function it will keep adding the corresponding transformation operator in the DAG, as soon as the DAG encounters an action function it will add the correspondingaction operator in the DAG and execute it. The job of any action operator will be to execute the DAG, generate required result, store it into the HDFS, so that other DAGs can use this result. This way we take advantage of Spark s lazy execution F. Related Work Apache SAMOA was one of the tools developed to address the need of online stream mining for big data. SAMOA features a Write-Once-Run-Anywhere architecture which allows multiple distributed stream processing engines to be integrated into the frame work. The task of integrating Apache SAMOA into Apache Apex is done by Bhupesh Chawda. Apache SAMOA allows for multiple types of integrations. First is the ML-adapter layer which allows other machine learning libraries to integrate and be part of the SAMOA framework. His work focuses on second 81
type of API called SPE - adapter layer. This layer is provided to allow other stream processing engines (SPEs) to integrate with APache SAMOA. This integration requires implementation for a set of functions which essentially map the topology in SAMOA to the topology in the target SPE. In the case of Apache Apex, he implemented the mapping from SAMOA topology to an Apex DAG. Doing this gives us the capability to run all SAMOA algorithms onto the target SPE, in this case Apex G. References [1] Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. NSDI 2012. April 2012 [2] HTTPS://WWW.DATATORRENT.COM/BLOG/MACHINE-LEARNING-APACHE-APEX-APACHE-SAMOA/ [3] MLlib: Machine Learning in Apache Spark,Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, Doris Xin, Reynold Xin, Michael J. Franklin, Reza Zadeh, Matei Zaharia, and Ameet Talwalkar. Journal of Machine Learning Research (JMLR). 2016. [4] Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. NSDI 2012. April 2012. [5] Case Study Evaluation of Mahout as a Recommender Platform, Carlos E. Seminario, David C. Wilson 82