MLeap: Release Spark ML Pipelines

Similar documents
Deploying Machine Learning Models in Practice

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Latest from the Lab: What's New Machine Learning Sam Buhler - Machine Learning Product/Offering Manager

Unifying Big Data Workloads in Apache Spark

The Evolution of Big Data Platforms and Data Science

Data Analytics and Machine Learning: From Node to Cluster

Specialist ICT Learning

Introducing Oracle Machine Learning

Integrate MATLAB Analytics into Enterprise Applications

Apache Spark 2.0. Matei

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

New Developments in Spark

Deep Learning Frameworks with Spark and GPUs

A Tutorial on Apache Spark

Bringing Data to Life

DATA SCIENCE USING SPARK: AN INTRODUCTION

Sparkling Water. August 2015: First Edition

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

Deploying, Managing and Reusing R Models in an Enterprise Environment

Data Science Bootcamp Curriculum. NYC Data Science Academy

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM

Integration with popular Big Data Frameworks in Statistica and Statistica Enterprise Server Solutions Statistica White Paper

Applied Spark. From Concepts to Bitcoin Analytics. Andrew F.

Databricks, an Introduction

Machine Learning with Python

Franck Mercier. Technical Solution Professional Data + AI Azure Databricks

Creating a Recommender System. An Elasticsearch & Apache Spark approach

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks

Spark Overview. Professor Sasu Tarkoma.

Boost your Analytics with ML for SQL Nerds

An Introduction to Big Data Formats

KNIME for the life sciences Cambridge Meetup

Putting it all together: Creating a Big Data Analytic Workflow with Spotfire

Oracle Big Data Discovery

An Introduction to Apache Spark

Going Big Data on Apache Spark. KNIME Italy Meetup

SAS IS OPEN (FOR BUSINESS) MATT MALCZEWSKI, SAS CANADA

Machine Learning and SystemML. Nikolay Manchev Data Scientist Europe E-

Prototyping Data Intensive Apps: TrendingTopics.org

Scalable Machine Learning in R. with H2O

Agenda. Spark Platform Spark Core Spark Extensions Using Apache Spark

Introducing Microsoft SQL Server 2016 R Services. Julian Lee Advanced Analytics Lead Global Black Belt Asia Timezone

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

MATLAB. Senior Application Engineer The MathWorks Korea The MathWorks, Inc. 2

Understanding the latent value in all content

Outrun Your Competition With SAS In-Memory Analytics Sascha Schubert Global Technology Practice, SAS

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Higher level data processing in Apache Spark

Chapter 1 - The Spark Machine Learning Library

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

zspotlight: Spark on z/os

Kotlin for Data Science!

IBM Db2 Event Store Simplifying and Accelerating Storage and Analysis of Fast Data. IBM Db2 Event Store

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

About Intellipaat. About the Course. Why Take This Course?

Spotfire Data Science with Hadoop Using Spotfire Data Science to Operationalize Data Science in the Age of Big Data

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.

SAS IS OPEN (FOR BUSINESS) MATT MALCZEWSKI, SAS CANADA

Big data systems 12/8/17

Spark, Shark and Spark Streaming Introduction

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Using Machine Learning to Identify Security Issues in Open-Source Libraries. Asankhaya Sharma Yaqin Zhou SourceClear

Big Data Applications with Spring XD

Chapter 4: Apache Spark

Industrial IoT: Architecture Framework Use Cases. Artur Borycki Teradata Labs

H2O: An open-source, in-memory prediction engine for data science

Blurring the Line Between Developer and Data Scientist

WITH INTEL TECHNOLOGIES

Machine Learning With Spark

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

SQL Server Machine Learning Marek Chmel & Vladimir Muzny

Exam Questions

Analyzing Flight Data

COPYRIGHT DATASHEET

Oracle Data Integrator 12c: Integration and Administration

S8765 Performance Optimization for Deep- Learning on the Latest POWER Systems

The Reality of Qlik and Big Data. Chris Larsen Q3 2016

Introduction to MATLAB application deployment

IBM Power Systems: Open innovation to put data to work Dexter Henderson Vice President IBM Power Systems

Integrate MATLAB Analytics into Enterprise Applications

Boost your Analytics with Machine Learning for SQL Nerds. Julie mssqlgirl.com

Integrating MATLAB Analytics into Business-Critical Applications Marta Wilczkowiak Senior Applications Engineer MathWorks

Python Certification Training

Oracle Data Integrator 12c: Integration and Administration

MapR Enterprise Hadoop

Big Data Infrastructures & Technologies

MLI - An API for Distributed Machine Learning. Sarang Dev

A CD Framework For Data Pipelines. Yaniv

Cascading Pattern - How to quickly migrate Predictive Models (PMML) from SAS, R, Micro Strategies etc., onto Hadoop and deploy them at scale

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Hitachi Vantara Overview Pentaho 8.0 and 8.1 Roadmap. Pedro Alves

Big Data processing: a framework suitable for Economists and Statisticians

exam. Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0

Rapid growth of massive datasets

About the Tutorial. Audience. Prerequisites. Copyright and Disclaimer. PySpark

@h2oai presents. Sparkling Water Meetup

E6895 Advanced Big Data Analytics Lecture 4:

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Transcription:

MLeap: Release Spark ML Pipelines Hollin Wilkins & Mikhail Semeniuk SATURDAY

Web Dev @ Cornell Studied some General Biology Rails Consulting for TrueCar and other companies Implement ML model for ClearBook in Ruby, Python and Java Erlang for highlyconcurrent game servers C/C# Game Engines Introduction Join TrueCar for Rails/mobile development Begin work on machine learning platform based on Spark/Hadoop How do we build real-time APIs based on Spark Models? SparkContext prohibitive Try several ideas, fail, learn a ton, begin MLeap project MLeap Studied some Math and Economics @ University of Minnesota - Predictive modeling of at risk patients @ UHG - R/SAS batch pipelines - Joins TrueCar to do new car price modeling - SQL-based batch pipelines + python for Linear Regressions in API layer - Continues to train models in SAS/R - SQL-based batch pipelines + python for Linear Regressions in API layer - Massive precomputation in Hadoop/ElasticSearch - Portions of models still translated, now to Java - Spark enables data processing and training of models in same environment

Problem Statement: Deploying machine learning algorithms to a production environment is a lot more difficult than it has to be and is a common source of friction at data-driven organizations Action Reaction - Data scientists write data pipelines to construct research datasets - Engineers write scalable libraries for computing features and algorithms - Data scientists largely focus on linear/logistic regressions due to engineering constraints - Engineers re-write the data pipelines for a production-ready system - Data scientists largely don t use those libraries and maintain/re-write their own copy of the code - Talented engineers get largely tired of coding up linear regressions and updating coefficients Everyone wants to do better! The winning technology will be the one that enables Engineers and Data Scientists to collaborate and work across a single platform.

Existing Solutions: You won t believe how many companies are still deploying algorithms in a SQL environment! And these are billion dollar operations. Hard-Coded Models (SQL, Java, Ruby) PMML Emerging Solutions (yhat, DataRobot) Enterprise Solutions (Microsoft, IBM, SAS) MLeap Quick to Implement Open Sourced Committed to Spark/Hadoop API Server Infrastructure Lesson Learned: Push code down to where the data is, not the other way around!

MLeap Solution Born out of need to deploy models quickly to a real time API server Leverage Hadoop/Spark ecosystem for training, get rid of Spark dependency for execution Easily reuse models with serialization and executing without Spark

MLeap Components core - provides linear algebra system, regression models, and feature builders runtime - provides DataFrame-like LeapFrame and transformers for it spark - provides easy conversion from Spark transformers to MLeap transformers mleap-spark mleap-serialization mleap-runtime Bundle.ML mleap-core serialization - common serialization format for Spark and MLeap (Bundle.ML) New features: expanded serialization formats to include both json and protobuf for large models (i.e. random forests with thousands of features)

mleap-core MLeap Core Components Linear Algebra Feature Builders Regressions Classifiers Dense/Sparse Vectors VectorAssembler LinearRegression RandomForest BLAS from Spark StringIndexer RandomForest LogisticRegression StandardScaler Gradient Boosted Reg. Trees

mleap-runtime MLeap Runtime Provides LeapFrame, which stores data for transformations by MLeap transformers MLeap transformers use mleap-core building blocks to transform LeapFrame MLeap transformers correspond one-to-one with Spark transformers No dependencies on Spark

Legend Feature Pipeline Categorical Feature Categorical Feature Index Categorical Feature One Hot Vector VectorAssembler Continuous Feature Vector StandardScaler Scaled Continuous Feature Vector Continuous Feature VectorAssembler Final Feature Vector StringIndexer OneHotEncoder StringIndexer OneHotEncoder VectorAssembler Categorical Feature Vector StringIndexer OneHotEncoder Regression Pipeline Final Feature Vector LinearRegression Prediction

Categorical Pipeline Categorical Feature StringIndexer Categorical Feature Index OneHotEncoder Categorical Feature One Hot Vector LeapFrame LeapFrame LeapFrame StringIndexer OneHotEncoder

MLeap Serialization (Bundle.ML) Provides common serialization for both Spark and MLeap 100% protobuf/json based for easy reading, compact data, and portability No dependencies on Parquet * Can be written to zip files, file system, HDFS, anywhere with an FS-like structure mleap-serialization

String Indexer Model Linear Regression Model Linear Regression Model (Code)

MLeap Spark Train an ML pipeline with Spark then export it to MLeap MLeap Spark Spark Estimator Spark Model MLeap Model Execute an MLeap pipeline against a Spark DataFrame MLeap Spark MLeap Transformer MLeap Spark Spark DataFrame Spark LeapFrame Spark LeapFrame Spark DataFrame

Benchmarks MLeap: 0.011ms/transform Spark: 23.4ms/transform

Web Services Algo 1 Algo 2 Algo n REST API Mobile Apps MLeap API + Server MLeap Transformers + Serialization Java API Map/ Reduce MLlib Scikit + other Spark Jobs Spark + Hadoop + HDFS Pipeline Code Notebooks

Demo Usage of MLeap Train a sample listing price model using linear regression and random forest against some AirBnb training data Deploy both models to a local API server Get real-time results IN UNDER 5 MINUTES!

Future of MLeap Unify linear algebra and core libraries with Spark Python/R interface Deploy easily to embedded systems and outside of JVM Full support for all Spark transformers

MLeap Development Currently 5 people working on projects across 4 different companies Talk to us if you are interested in deploying this technology at your company MLeap Demo Project: https://github. com/truecar/mleap-demo

Thank You! Hollin Wilkins email: hollinrwilkins@gmail.com github: https://github.com/hollinwilkins twitter: https://twitter.com/hollinwilkins Mikhail Semeniuk email: seme0021@gmail.com github: https://github.com/seme0021 twitter: https://twitter.com/mikhailsemeniuk SATURDAY