MLI - An API for Distributed Machine Learning. Sarang Dev

Similar documents
Spark and Machine Learning

Chapter 1 - The Spark Machine Learning Library

Using Existing Numerical Libraries on Spark

Distributed Machine Learning" on Spark

Using Numerical Libraries on Spark

Matrix Computations and " Neural Networks in Spark

An Introduction to Apache Spark

Higher level data processing in Apache Spark

Practical Machine Learning Agenda

L6: Introduction to Spark Spark

MLlib and Distributing the " Singular Value Decomposition. Reza Zadeh

Big Data Infrastructures & Technologies

Specialist ICT Learning

Distributed Computing with Spark

Big Data SONY Håkan Jonsson Vedran Sekara

Pyspark standalone code

Data Analytics and Machine Learning: From Node to Cluster

Big data systems 12/8/17

Distributed Computing with Spark and MapReduce

MLlib. Distributed Machine Learning on. Evan Sparks. UC Berkeley

Scaled Machine Learning at Matroid

Apache Spark MLlib. Machine Learning Library for a parallel computing framework. Review by Renat Bekbolatov (June 4, 2015) BSP

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Processing of big data with Apache Spark

Scalable Data Science in R and Apache Spark 2.0. Felix Cheung, Principal Engineer, Microsoft

Machine Learning With Spark

Apache SystemML Declarative Machine Learning

Python Certification Training

DATA SCIENCE USING SPARK: AN INTRODUCTION


TUTORIAL: BIG DATA ANALYTICS USING APACHE SPARK

Integration of Machine Learning Library in Apache Apex

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

MLBase, MLLib and GraphX

2/26/2017. Spark MLlib is the Spark component providing the machine learning/data mining algorithms. MLlib APIs are divided into two packages:

Data Science Bootcamp Curriculum. NYC Data Science Academy

Harp-DAAL for High Performance Big Data Computing

A Tutorial on Apache Spark

Evan Sparks and Ameet Talwalkar UC Berkeley

EPL660: Information Retrieval and Search Engines Lab 11

Machine Learning in Action

Introduction to Apache Spark

09/05/2018. Spark MLlib is the Spark component providing the machine learning/data mining algorithms

Accelerating Spark Workloads using GPUs

Scalable Machine Learning in R. with H2O

Cloud Computing & Visualization

2/4/2019 Week 3- A Sangmi Lee Pallickara

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Beyond MapReduce: Apache Spark Antonino Virgillito

Spark Overview. Professor Sasu Tarkoma.

Rapid growth of massive datasets

Big Data Infrastructures & Technologies Hadoop Streaming Revisit.

Python With Data Science

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

Machine Learning at the Limit

Big Data Analytics at OSC

An Introduction to Apache Spark

About the Tutorial. Audience. Prerequisites. Copyright and Disclaimer. PySpark

New Developments in Spark

Machine Learning for Large-Scale Data Analysis and Decision Making A. Distributed Machine Learning Week #9

Chapter 4: Apache Spark

Creating a Recommender System. An Elasticsearch & Apache Spark approach

Apache Spark 2.0. Matei

Going Big Data on Apache Spark. KNIME Italy Meetup

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

BUILT FOR THE SPEED OF BUSINESS

Unifying Big Data Workloads in Apache Spark

Agenda. Spark Platform Spark Core Spark Extensions Using Apache Spark

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

ML 프로그래밍 ( 보충 ) Scikit-Learn

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Evan Sparks and Ameet Talwalkar UC Berkeley

Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM

Mahout in Action MANNING ROBIN ANIL SEAN OWEN TED DUNNING ELLEN FRIEDMAN. Shelter Island

Cloud, Big Data & Linear Algebra

Apache Spark and Scala Certification Training

An Introduction to Big Data Analysis using Spark

Hadoop Development Introduction

An Overview of Apache Spark

Big Data Analytics with Hadoop and Spark at OSC

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Guidelines For Hadoop and Spark Cluster Usage

Turning Relational Database Tables into Spark Data Sources

MapReduce, Hadoop and Spark. Bompotas Agorakis

Intro To Spark. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2017

Intro To Spark. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2017

Apache Spark 2 X Cookbook Cloud Ready Recipes For Analytics And Data Science

Evan Sparks and Ameet Talwalkar UC Berkeley. UC Berkeley

Scalable Tools - Part I Introduction to Scalable Tools

Greenplum-Spark Connector Examples Documentation. kong-yew,chan

An Introduction to Apache Spark Big Data Madison: 29 July William Red Hat, Inc.

Summary of Big Data Frameworks Course 2015 Professor Sasu Tarkoma

Databases and Big Data Today. CS634 Class 22

SCIENCE. An Introduction to Python Brief History Why Python Where to use

Why do we need graph processing?

Analysis of Extended Performance for clustering of Satellite Images Using Bigdata Platform Spark

Benchmarking Apache Flink and Apache Spark DataFlow Systems on Large-Scale Distributed Machine Learning Algorithms

Transcription:

MLI - An API for Distributed Machine Learning Sarang Dev

MLI - API Simplify the development of high-performance, scalable, distributed algorithms. Targets common ML problems related to data loading, feature extraction, model training. Usability : Comparable to Matlab, R Scalability : Matches low level systems like Graphlab, Vowpal Wabbit

Big Picture-MLBase ML Optimizer: This layer aims to automate the task of ML pipeline construction. MLI: An experimental API for feature extraction and algorithm development that introduces high-level ML programming abstractions. MLlib: Apache Spark's distributed ML library. Many features in MLlib have been borrowed from ML Optimizer and MLI. Maintained by Spark community.

Installing MLI https://github.com/amplab/mli Java 7 (not compatible with 8) Scala 2.9.3 Spark 0.8.0 Needs some change in build files to compile https://drive.google.com/open?id=0b64ip8kxpidptve0nmfaanfwouu Uses sbt(interactive build tool) for building Run command in sbt prompt >compile >assembly (makes a jar in the target)

MLI Interfaces MLTable MLTable is an object which provides a familiar table-like interface to a developer, and is designed to mimic a SQL table. Interface for processing the semi-structured, mixed type data. Once data is featurized, it can be cast into an MLNumericTable, which is a convenience type that most ML algorithms will expect as input.

MLI Interfaces LocalMatrix LocalMatrix provides linear algebra primitives on partitions of data. The partitions are automatically determined by the system. Optimizer, Algorithm, and Model Can implement algorithms using the Algorithm interface, which should return a model as specified by the Model interface. Optimization techniques are used to converge to an approximate solution while iterating over the data.

Using MLI ADD_JARS = <path to mli jar> spark-shell We can perform all the tasks in a spark-shell which always has a initialized spark context. import mli.feat._ import mli.interface._ val mc = new MLContext(sc) val inputtable = mc.loadfile("/home/sarang/downloads/sample.txt").cache() //MLTable // c is the column on which we want to perform N-gram extraction // n is the N-gram length, e.g., n=2 corresponds to bigrams // k is the number of top N-grams we want to use (sorted by N-gram frequency) val (featurizeddata, ngfeaturizer) = NGrams.extractNGrams(inputTable, c=0, n=2, k=10, stopwords = NGrams.stopWords) val (scaleddata, featurizer) = Scale.scale(featurizedData.filter(_.nonZeros.length > 5).cache(), 0, ngfeaturizer)

Spark Engine for large-scale data processing Speed : Run programs up to 100x faster than Hadoop MapReduce in memory Ease of Use : Write applications in Java, Scala, Python, R Libraries : Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming

Spark

Spark

Spark

Spark

Spark

MLlib MLlib is a standard component of Spark providing machine learning primitives on top of Spark. Scalability Performance User-friendly APIs Integration with Spark and its other components Support for Java, Scala, Python

MLib Classification: logistic regression, naive Bayes,... Regression: generalized linear regression, isotonic regression,... Decision trees, random forests, and gradient-boosted trees Recommendation: alternating least squares (ALS) Clustering: K-means, Gaussian mixtures (GMMs),... Topic modeling: latent Dirichlet allocation (LDA) Feature transformations: standardization, normalization, hashing,... Model evaluation and hyper-parameter tuning ML Pipeline construction ML persistence: saving and loading models and Pipelines Survival analysis: accelerated failure time model Frequent itemset and sequential pattern mining: FP-growth, association rules, PrefixSpan Distributed linear algebra: singular value decomposition (SVD), principal component analysis (PCA),... Statistics: summary statistics, hypothesis testing,...

Data Types in MLlib Local vector A local vector has integer-typed and 0-based indices and double-typed values, stored on a single machine Labeled point A labeled point is a local vector, either dense or sparse, associated with a label/response. Eg. val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0)) Local matrix Distributed matrix RowMatrix IndexedRowMatrix CoordinateMatrix BlockMatrix

DataFrames in Spark SQL A Dataset is a distributed collection of data. Dataset is a new interface added in Spark 1.6. The Dataset API is available in Scala and Java. Python does not have the support for the Dataset API. Can be used as MLTable interface of MLI. val sentencedata = spark.createdataframe(seq( (0, "Hi I heard about Spark"), (0, "I wish Java could use case classes"), (1, "Logistic regression models are neat") )).todf("label", "sentence")

Using MLib Example : K-Means Clustering ( partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean) Implement using Pyspark Implement using Scala

Kmeans Pyspark from numpy import array from math import sqrt from pyspark import SparkContext, SparkConf from pyspark.mllib.clustering import KMeans, KMeansModel conf = SparkConf().setAppName("KMeans").setMaster("local") sc = SparkContext(conf=conf) # Load and parse the data data = sc.textfile("/home/sarang/downloads/kmeans_data.txt") parseddata = data.map(lambda line: array([float(x) for x in line.split(' ')])) # Build the model (cluster the data) clusters = KMeans.train(parsedData, 2, maxiterations=10, runs=10, initializationmode="random") # Evaluate clustering by computing Within Set Sum of Squared Errors def error(point): center = clusters.centers[clusters.predict(point)] return sqrt(sum([x**2 for x in (point - center)])) WSSSE = parseddata.map(lambda point: error(point)).reduce(lambda x, y: x + y) print("within Set Sum of Squared Error = " + str(wssse)) # Save and load model clusters.save(sc, "/home/sarang/kmeansmodel") samemodel = KMeansModel.load(sc, "/home/sarang/kmeansmodel") We can also use the pyspark shell instead

Kmeans Scala import org.apache.spark.ml.clustering.kmeans val dataset = spark.read.format("libsvm").load("/home/sarang/downloads/kmeans_data1.txt") // Trains a k-means model. val kmeans = new KMeans().setK(2).setSeed(1L) val model = kmeans.fit(dataset) // Evaluate clustering by computing Within Set Sum of Squared Errors. val WSSSE = model.computecost(dataset) model.clustercenters.foreach(println)

Conclusion MLI is outdated and most of its features have been included in MLlib. MLlib can act as a powerful tool for machine learning.

References MLI Tutorial : http://ampcamp.berkeley.edu/3/exercises/mli-do cument-categorization.html Mllib : http://spark.apache.org/docs/latest/mllib-guide. html