Integration of Machine Learning Library in Apache Apex

Similar documents
Fast, Interactive, Language-Integrated Cluster Computing

2/4/2019 Week 3- A Sangmi Lee Pallickara

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

Resilient Distributed Datasets

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

CSE 444: Database Internals. Lecture 23 Spark

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Dell In-Memory Appliance for Cloudera Enterprise

An Introduction to Apache Spark

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Cloud, Big Data & Linear Algebra

Distributed Computing with Spark and MapReduce

Machine learning library for Apache Flink

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Spark. Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Spark: A Brief History.

Distributed Computing with Spark

Scaled Machine Learning at Matroid

Twitter data Analytics using Distributed Computing

Big Data Infrastructures & Technologies

Distributed Machine Learning" on Spark

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Chapter 4: Apache Spark

CDS. André Schaaff1, François-Xavier Pineau1, Gilles Landais1, Laurent Michel2 de Données astronomiques de Strasbourg, 2SSC-XMM-Newton

Matrix Computations and " Neural Networks in Spark

CS435 Introduction to Big Data FALL 2018 Colorado State University. 10/22/2018 Week 10-A Sangmi Lee Pallickara. FAQs.

Spark Overview. Professor Sasu Tarkoma.

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

Shark: Hive on Spark

DATA SCIENCE USING SPARK: AN INTRODUCTION

Survey on Incremental MapReduce for Data Mining

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

MLlib. Distributed Machine Learning on. Evan Sparks. UC Berkeley

Stream Processing on IoT Devices using Calvin Framework

Survey on Frameworks for Distributed Computing: Hadoop, Spark and Storm Telmo da Silva Morais

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

a Spark in the cloud iterative and interactive cluster computing

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Processing of big data with Apache Spark

Spark, Shark and Spark Streaming Introduction


An Introduction to Apache Spark Big Data Madison: 29 July William Red Hat, Inc.

Analysis of Extended Performance for clustering of Satellite Images Using Bigdata Platform Spark

Big Data Infrastructures & Technologies Hadoop Streaming Revisit.

L3: Spark & RDD. CDS Department of Computational and Data Sciences. Department of Computational and Data Sciences

Unifying Big Data Workloads in Apache Spark

CS294 Big Data System Course Project Report Gemini: Boosting Spark Performance with GPU Accelerators

Cloud Computing & Visualization

Analyzing Flight Data

A Tutorial on Apache Spark

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Apache Spark 2.0. Matei

Survey of Big Data Frameworks for Different Application Characteristics

A REVIEW: MAPREDUCE AND SPARK FOR BIG DATA ANALYTICS

Research on improved K - nearest neighbor algorithm based on spark platform

COMPARATIVE EVALUATION OF BIG DATA FRAMEWORKS ON BATCH PROCESSING

Specialist ICT Learning

Using Existing Numerical Libraries on Spark

MLlib and Distributing the " Singular Value Decomposition. Reza Zadeh

Introduction to MapReduce Algorithms and Analysis

Machine Learning With Spark

The Evolution of Big Data Platforms and Data Science

Certified Big Data Hadoop and Spark Scala Course Curriculum

MLI - An API for Distributed Machine Learning. Sarang Dev

Summary of Big Data Frameworks Course 2015 Professor Sasu Tarkoma

Introduction to Apache Spark

Lecture 11 Hadoop & Spark

Big Data Analytics using Apache Hadoop and Spark with Scala

Using Numerical Libraries on Spark

Discretized Streams. An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Big Data Analytics. C. Distributed Computing Environments / C.2. Resilient Distributed Datasets: Apache Spark. Lars Schmidt-Thieme

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Massive Online Analysis - Storm,Spark

Applied Spark. From Concepts to Bitcoin Analytics. Andrew F.

Today s content. Resilient Distributed Datasets(RDDs) Spark and its data model

CSC 261/461 Database Systems Lecture 24. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

Spark & Spark SQL. High- Speed In- Memory Analytics over Hadoop and Hive Data. Instructor: Duen Horng (Polo) Chau

15.1 Data flow vs. traditional network programming

Outline. CS-562 Introduction to data analysis using Apache Spark

Programming Systems for Big Data

Khadija Souissi. Auf z Systems November IBM z Systems Mainframe Event 2016

International Journal of Advance Engineering and Research Development. Performance Comparison of Hadoop Map Reduce and Apache Spark

Webinar Series TMIP VISION

Backtesting with Spark

New Developments in Spark

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

Accelerating Spark Workloads using GPUs

Big data systems 12/8/17

Data Analytics and Machine Learning: From Node to Cluster

Apache Spark Performance Compared to a Traditional Relational Database using Open Source Big Data Health Software

SCALABLE, LOW LATENCY MODEL SERVING AND MANAGEMENT WITH VELOX

Lambda Architecture with Apache Spark

Adaptive Control of Apache Spark s Data Caching Mechanism Based on Workload Characteristics

Big Data Architect.

Data Platforms and Pattern Mining

The Datacenter Needs an Operating System

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Chapter 1 - The Spark Machine Learning Library

Transcription:

Integration of Machine Learning Library in Apache Apex Anurag Wagh, Krushika Tapedia, Harsh Pathak Vishwakarma Institute of Information Technology, Pune, India Abstract- Machine Learning is a type of artificial intelligence (AI) that provides computers with the ability to learn without being explicitly programmed. Machine learning focuses on the development of computer programs that can teach themselves to grow and change when exposed to new data. Apache Apex is a Hadoop YARN native platform that unifies stream and batch processing. It processes big data in motion in a way that is highly scalable, highly performant, fault tolerant, secure, stateful, distributed and easily operable. So, the need of Machine Learning library in a platform like Apache Apex where it will help in drawing out useful insights from the huge data collected and make the system faster and efficient with time. Keywords Machine Learning; Data Analytics; Apache Apex; Apache Spark; BigData; RealTime; Stream Processing; Batch Processing; I. INTRODUCTION Traditional systems are not equipped to process Big Data hence to handle Big Data with efficiency there are various Big Data platforms and, Hadoop is a ubiquitous name in this area. There are various tools build on top of Hadoop improving it s efficiency such as Spark, Flink, Apex, Storm. Hadoop processes data in batch mode i.e. data is stored in batches before processing but as the demand for real time data processing engines have grown we have had various quasi-streaming platforms such as Spark which has outperformed Hadoop but they still fail to process data truly in real time. Here, Apache Apex stands out as it can process data in motion. Apache Apex is well known for its stream processing capabilities like scalability, fault-tolerance and stateful guarantees. Additionally, it stands out due to its highly operable nature and ease of use. Apache Spark has a rich set of libraries including Spark MLLib, GraphX etc. Although Spark specializes in a lot of useful libraries, it is often the second choice to Apex where high speed, low latency, fault tolerant processing is the requirement. It is therefore, that we started working on getting Spark s libraries to run on Apex s platform. At present, just focusing on integrating the MLLib which is a library for Machine Learning algorithms in Spark on Apex, will expedite the ease of users who wish to develop machine learning models and choose Apex as their streaming platform. The main objective is to develop a high level API for Apache Apex s users who can effortlessly train machine learning models. A. Processing Model in Apache Spark Apache Spark is a well known and popular big data engine, that is built on Hadoop and supports various high level APIs in Java, Scala and Python. Data in Spark is represented by a data structure called as an RDD which is Resilient Distributed Dataset. RDDs are formally read only, collection partition of data. We perform various operations on the RDD. RDDs support two types of operations and they are transformation and action operations. Transformation of RDD means to transform an existing RDD into a new RDD. The best example of it is a map or filter function, whereas action functions are those which return a single value to the driver program after performing various transformations such as count, collect, reduce. Other salient principle is that the above mentioned operations 77

evaluate lazily in Spark. Lazy evaluation means even though the RDD is defined or the transformation functions are called, but unless it encounters an action function, it does not perform any computation. B. Processing Model in Apex Apache Apex is a unified batch and streaming platform it is mostly used to process big data in real time. An application in Apache Apex consists of operators, these operators are nothing but units which contain various operations that lead to our business logic. These operators are connected via streams, that facilitates in sending the data from one operator to the other. We may call them as the basic building blocks of the application. Such multiple operators connected via streams form a DAG (Direct Acyclic Graph). An Apache Apex s application runs as a YARN application, thereby running each operator of the DAG on containers provided by YARN. With such provision of true stream processing in Apache Apex, it lacks a machine learning library which is immensely in demand. Below, is the information on how we choose Spark s MLLib and strategy to integrate Spark s MLLib in Apache Apex, making the transition of Spark users to Apex facile and also if this is deployed successfully, we can run any Spark application which supports an RDD model. II. LITERATURE SURVEY C. Selection of Library 78

Traditionally all processing required for machine learning was done using non distributed platform which restricted its use to applications having small datasets. In past decade we have seen unprecedented growth of data which has made non-distributed machine learning systems handicapped. The pith and heart of machine learning is data, data powers the machine learning model and the new era of of Big Data has made machine learning an important aspect of research and industry applications.. There are various machine learning libraries that have surfaced for Big Data platforms such as SAMOA, H2O, MAHOUT, MLLIB. We will discuss two algorithms that we inspected closely to integrate with Apache Apex Mahout : Mahout is a machine learning tool, Mahout has wide selection of algorithms but it is built on top of hadoop and has inefficient speeds. With the release of Mahout 0.9, the focus is now on math Mahout-Samsara which will provide a math environment, this environment includes linear algebra, statistical operations, and data structures. The goal of Mahout-Samsara is to enable mahout users to write the own machine learning algorithms. Yet there are some issues regarding the use of Mahout, it is difficult to set up on existing Hadoop cluster, most of the documentation available for mahout regarding the use of algorithms is outdated. Some of the algorithms offered by Mahout are listed below 1. Classification a. Logistic Regression b. Naive Bayes c. Hidden Markov Models 2. Clustering a. K-means Clustering b. Canopy Clustering c. Fuzzy k-means d. Spectral Clustering e. Streaming k-means 3. Dimensionality Reduction a. Singular Value Decomposition b. Stochastic SVD c. PCA (via Stochastic SVD) d. QR Decomposition MLlib In general MLlib works with spark and provides interactive batch as well as streaming approaches which Mahout currently lacks. As well Spark s use of in-memory computation, enables tasks to run faster than those using Mahout. Even though Spark s MLlib is relatively young compared to Mahout. It is easy to set up on spark and run and it provides thorough documentation of machine learning APIs. It also have support for Basic Statistics operation. 79

Since more users are migrating from MapReduce to Spark. The community has grown bigger. MLlib include APIs for development using Scala, Python and Java. MLlib is Spark s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. At a high level, it provides tools such as: ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative filtering Featurization: feature extraction, transformation, dimensionality reduction, and selection Pipelines: tools for constructing, evaluating, and tuning ML Pipelines Persistence: saving and load algorithms, models, and Pipelines Utilities: linear algebra, statistics, data handling, etc. Name of the Language Parallel Distributed Comments Library Scikit - Learn Python Yes No Limited for multicore programming. H2O Java, Scala, R, Yes Yes Only one research Python so far Mahout Java, Scala Yes Yes Suitable for batch processing only Tensorflow Python Yes: GPU No No distributed support yet. Oryx Java Yes Yes Contains support for only few algorithms WEKA Java Yes No No distributed support yet. SAMOA Java Yes Yes Already under incubation MLlib Java, Python, Scala Yes Yes Has support for quasi-streaming. Table 1: Comparing distributed and non distributed machine learning libraries on the scale of language compatibility and flaws D. Spark RDD III. SPARK S & APEX S RESILIENT DISTRIBUTED DATASET Resilient Distributed Datasets (RDDs), is a distributed is a read only collection of objects partitioned across a set of 80

machines that can be rebuilt if a partition is lost. RDD is a distributed memory abstraction that helps programmers to perform in memory computations on large clusters without compromising fault tolerance. RDDs can be created through deterministic operations on either data in stable storage or other RDDs. We call these operations transformations to differentiate them from other operations on RDDs. Example of transformations include map, filter and join. RDDs do not need to be materialized all the time, Instead, an RDD has enough information about how it was derived from other datasets (its lineage) to compute its partitions from data in stable storage. User can indicate the RDDs they will re-use and user has control over these storage strategies. e.g. Industrial Machine Log Mining Industrial machines generates millions of log messages these messages are stored into log files which contributes to TeraBytes of memory usage which are stored in HDFS ( Hadoop File System). To determine the cause of failure in a machine, maintainer of the machine will have to see just error messages generated by the machine. This can be achieved using Spark s RDD where RDD will just load the error messages into the RAM across the several nodes after that the maintainer can query the logs interactively to gain insights about the failure. Fig 1: Lineage graph 3rd query in in industrial log mining example. RDDs are represented by boxes and transformations are represented by arrows. 1 lines = spark_context.textfile( hdfs://remotehost//.. ) 2 failures = lines.filter(_.startswith( FAILURE )) 3 failures.persist() Line 1 represents RDD backed by an HDFS file as collection of records i.e. logs, whereas line 2 represents an filtered RDD which derived from existing RDD. Line 3 indicates that failures to be persisted in memory. E. Apex RDD As RDD is a fundamental data structure in spark which supports operations such as transformation and action. The first step in integrating MLlib will be to create a wrapper around the existing RDD, by extending it to ApexRDD. Apex RDD will be Apex s version of RDD which will provide support for transformation and action via operators, now transformation and action function will correspond to Transformation Operator and Action Operator in Apex, respectively. In addition to this Apex RDD will have a DAG, each node of the DAG will consist of an operator. During the execution of the program whenever an Apex RDD will encounter transformation function it will keep adding the corresponding transformation operator in the DAG, as soon as the DAG encounters an action function it will add the correspondingaction operator in the DAG and execute it. The job of any action operator will be to execute the DAG, generate required result, store it into the HDFS, so that other DAGs can use this result. This way we take advantage of Spark s lazy execution F. Related Work Apache SAMOA was one of the tools developed to address the need of online stream mining for big data. SAMOA features a Write-Once-Run-Anywhere architecture which allows multiple distributed stream processing engines to be integrated into the frame work. The task of integrating Apache SAMOA into Apache Apex is done by Bhupesh Chawda. Apache SAMOA allows for multiple types of integrations. First is the ML-adapter layer which allows other machine learning libraries to integrate and be part of the SAMOA framework. His work focuses on second 81

type of API called SPE - adapter layer. This layer is provided to allow other stream processing engines (SPEs) to integrate with APache SAMOA. This integration requires implementation for a set of functions which essentially map the topology in SAMOA to the topology in the target SPE. In the case of Apache Apex, he implemented the mapping from SAMOA topology to an Apex DAG. Doing this gives us the capability to run all SAMOA algorithms onto the target SPE, in this case Apex G. References [1] Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. NSDI 2012. April 2012 [2] HTTPS://WWW.DATATORRENT.COM/BLOG/MACHINE-LEARNING-APACHE-APEX-APACHE-SAMOA/ [3] MLlib: Machine Learning in Apache Spark,Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, Doris Xin, Reynold Xin, Michael J. Franklin, Reza Zadeh, Matei Zaharia, and Ameet Talwalkar. Journal of Machine Learning Research (JMLR). 2016. [4] Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. NSDI 2012. April 2012. [5] Case Study Evaluation of Mahout as a Recommender Platform, Carlos E. Seminario, David C. Wilson 82