Chapter 1 - The Spark Machine Learning Library

Similar documents
MLI - An API for Distributed Machine Learning. Sarang Dev

L6: Introduction to Spark Spark

Matrix Computations and " Neural Networks in Spark

2/26/2017. Spark MLlib is the Spark component providing the machine learning/data mining algorithms. MLlib APIs are divided into two packages:

Distributed Machine Learning" on Spark

Using Existing Numerical Libraries on Spark

Using Numerical Libraries on Spark

09/05/2018. Spark MLlib is the Spark component providing the machine learning/data mining algorithms

An Introduction to Apache Spark

Apache Spark MLlib. Machine Learning Library for a parallel computing framework. Review by Renat Bekbolatov (June 4, 2015) BSP

Scaled Machine Learning at Matroid

Specialist ICT Learning

Practical Machine Learning Agenda

Spark and Machine Learning

Distributed Computing with Spark and MapReduce

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Distributed Computing with Spark

Scalable Machine Learning in R. with H2O

Apache SystemML Declarative Machine Learning

MLlib and Distributing the " Singular Value Decomposition. Reza Zadeh

Higher level data processing in Apache Spark

DATA SCIENCE USING SPARK: AN INTRODUCTION

Big Data Infrastructures & Technologies

Scalable Data Science in R and Apache Spark 2.0. Felix Cheung, Principal Engineer, Microsoft

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Data Science Bootcamp Curriculum. NYC Data Science Academy

Data Analytics and Machine Learning: From Node to Cluster

Machine Learning in Action

Rapid growth of massive datasets

Accelerating Spark Workloads using GPUs

Python Certification Training

Python With Data Science

Unifying Big Data Workloads in Apache Spark

Index. bfs() function, 225 Big data characteristics, 2 variety, 3 velocity, 3 veracity, 3 volume, 2 Breadth-first search algorithm, 220, 225

Apache Spark 2.0. Matei

Machine Learning for Large-Scale Data Analysis and Decision Making A. Distributed Machine Learning Week #9

Integration with popular Big Data Frameworks in Statistica and Statistica Enterprise Server Solutions Statistica White Paper

Machine Learning With Spark

Using Machine Learning to Identify Security Issues in Open-Source Libraries. Asankhaya Sharma Yaqin Zhou SourceClear

Processing of big data with Apache Spark

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

Sparkling Water. August 2015: First Edition

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

C5##54&6*"6*1%2345*D&'*E2)2*F"4G)&"69

Integration of Machine Learning Library in Apache Apex

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Spark Overview. Professor Sasu Tarkoma.

Agenda. Spark Platform Spark Core Spark Extensions Using Apache Spark

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

ML 프로그래밍 ( 보충 ) Scikit-Learn

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Project Design. Version May, Computer Science Department, Texas Christian University

Contents. Preface to the Second Edition

About the Tutorial. Audience. Prerequisites. Copyright and Disclaimer. PySpark

Going Big Data on Apache Spark. KNIME Italy Meetup

Louis Fourrier Fabien Gaie Thomas Rolf

CSC 261/461 Database Systems Lecture 24. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

MLlib. Distributed Machine Learning on. Evan Sparks. UC Berkeley

10/05/2018. The following slides show how to

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

KNIME for the life sciences Cambridge Meetup

Mahout in Action MANNING ROBIN ANIL SEAN OWEN TED DUNNING ELLEN FRIEDMAN. Shelter Island

Big Data and Large Scale Machine Learning

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Big Data Analytics with. Apache Spark. Machiel Jansen. Matthijs Moed

A Tutorial on Apache Spark

ODL based AI/ML for Networks Prem Sankar Gopannan, Ericsson YuLing Chen, Cisco

Twitter data Analytics using Distributed Computing

Introduction to Apache Spark

BUILT FOR THE SPEED OF BUSINESS

SparkBOOST, an Apache Spark-based boosting library

The Evolution of Big Data Platforms and Data Science

Big Data processing: a framework suitable for Economists and Statisticians

Machine Learning at the Limit

Sparse Matrices. This means that for increasing problem size the matrices become sparse and sparser. O. Rheinbach, TU Bergakademie Freiberg

Benchmarking Spark ML using BigBench. Sweta Singh TPCTC 2016

CS 179 Lecture 16. Logistic Regression & Parallel SGD

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

SUPERVISED LEARNING METHODS. Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018

SCIENCE. An Introduction to Python Brief History Why Python Where to use

Pyspark standalone code

Big Data and FrameWorks; Perspectives to Applied Machine Learning

CS 696 Intro to Big Data: Tools and Methods Fall Semester, 2016 Doc 25 Spark 2 Nov 29, 2016

Analysis of Extended Performance for clustering of Satellite Images Using Bigdata Platform Spark

Spark & Spark SQL. High- Speed In- Memory Analytics over Hadoop and Hive Data. Instructor: Duen Horng (Polo) Chau

Data Analytics Job Guarantee Program

Always start off with a joke, to lighten the mood. Workflows 3: Graphs, PageRank, Loops, Spark

Tutorial on Machine Learning Tools

Logging Reservoir Evaluation Based on Spark. Meng-xin SONG*, Hong-ping MIAO and Yao SUN

CSC 261/461 Database Systems Lecture 26. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

Certified Data Science with Python Professional VS-1442

DATA SCIENCE INTRODUCTION QSHORE TECHNOLOGIES. About the Course:

Big data systems 12/8/17

Big Data Analytics. C. Distributed Computing Environments / C.2. Resilient Distributed Datasets: Apache Spark. Lars Schmidt-Thieme

ECS289: Scalable Machine Learning

CSE 158. Web Mining and Recommender Systems. Midterm recap

Harp-DAAL for High Performance Big Data Computing

Machine Learning In A Snap. Thomas Parnell Research Staff Member IBM Research - Zurich

Transcription:

Chapter 1 - The Spark Machine Learning Library Objectives Key objectives of this chapter: The Spark Machine Learning Library (MLlib) MLlib dense and sparse vectors and matrices Types of distributed matrices LIBSVM format Supported classification, regression and clustering algorithms 1.1 What is MLlib? Spark Machine Learning Library (MLlib) provides an array of high quality distributed Machine Learning (ML) algorithms The MLlib library implements a whole suite of statistical and machine learning algorithms (see Notes for details) MLlib provides tools for Building processing workflows (e.g. feature extraction and data transformation), Parameter optimization, and ML model management for model saving and loading MLlib applications run on top of Spark and take full advantage of Spark's distributed in-memory design MLlib applications claim 10X+ faster performance for applications that implement similar algorithms created using Apache Mahout Apache Mahout apps leverage Hadoop's MapReduce engine Notes: MLlib 1.3 contains the following algorithms (source: https://spark.apache.org/mllib/): linear SVM and logistic regression classification and regression tree

random forest and gradient-boosted trees recommendation via alternating least squares clustering via k-means, Gaussian mixtures, and power iteration clustering topic modeling via latent Dirichlet allocation singular value decomposition linear regression with L1- and L2-regularization isotonic regression multinomial naive Bayes frequent itemset mining via FP-growth basic statistics 1.2 Supported Languages Java Python R Scala You will need to install the NumPy package as a dependency In our further discussion, we will be using Python to illustrate the main concepts and programming structures Note: Spark version 1.6 (the latest in the 1.* version series) requires Java 7+, Python 2.6+, R 3.1+, and Scala 2.10 2

1.3 MLlib Packages MLlib is divided into two packages: spark.mllib Contains the original Spark API built on top of RDDs spark.ml Contains higher-level API built on top of DataFrames spark.ml is recommended if you use the DataFrames API which is more versatile and flexible Facilitates ML processing pipeline construction Notes: Recently, the Spark MLlib team has started encouraging ML developers to contribute new algorithms to the spark.ml package and at the same time are saying, "Users should be comfortable using spark.mllib features and expect more features coming." [http://spark.apache.org/docs/1.6.0/mllibguide.html] 1.4 Dense and Sparse Vectors MLlib supports local and distributed vectors and matrices Local vectors can be dense or sparse A dense vector is a regular array of doubles A sparse vector is backed by two parallel arrays: one for indices of elements that are present, and the other for double values for those elements Values 1.0, 2.0, 0.0, 4.0 (a four element list) can be represented in dense format as 3

[1.0, 2.0, 0.0, 4.0] Same values in sparse format would be presented as (4, [0,1,3], [1.0, 2.0, 4.0]) The first element is the size of the list; the vector [0,1,3] holds the indices of present elements; the third element (at the index 2) is absent 1.5 Labeled Point A labeled point is an object that represents label/category with a local vector (dense or sparse) of its properties Used in supervised learning algorithms in MLlib A label is represented by a single double starting from 0.0 for the first category, 1.0 for the second, etc., which make it possible to use them in both regression and classification algorithms For binary classification cases, a label should be either 0.0 (for negative classification) or 1.0 (for positive classification) A label point is brought into code as an instance of the LabeledPoint class that carries the features (properties) and labels (categories) of the data asset Features of a point are packaged as a vector 4

1.6 Python Example of Using the LabeledPoint Class from pyspark.mllib.linalg import SparseVector from pyspark.mllib.regression import LabeledPoint # Features are represented by a dense feature vector: datapointa=labeledpoint(0.0, [11.0, 2.2, 333.3, 4.444]) # The datapointa is labeled as belonging to category 0.0 # Features are some measured properties of the data Point object # (e.g. size, speed, duration, breath rate, etc. # Features are represented by a sparse feature vector (elements at position 1 and 2 in the feature vector are zeroed out): datapointb=labeledpoint(1.0, SparseVector(4, [0, 3], [11.0, 4.444])) 1.7 LIBSVM format LIBSVM is a compact text format for encoding data (usually representing training data sets) Widely used in MLlib to represent sparse feature vectors A file in LIBSVM format is shaped as a matrix in which each line is a space-delimited record that represents a labeled sparse feature (attribute / property) vector The layout is as follows: class_label index1:value1 index2:value2 where the numeric indices represent features; values are separated from indices via a colon ( ':' ) MLlib expects you to start class labeling from 0 Feature indices are one-based in ascending order (1,2,3, etc.); where a feature is not present in the data record, it is omitted from the record Note: After loading in your MLlib application, the feature indices are 5

converted from one-based to zero-based Notes: LIBSVM a library (and its data format) for support vector machines [http://www.csie.ntu.edu.tw/~cjlin/libsvm/faq.html]. This resource [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/] contains a great number of classification, regression, multi-label and string data sets stored in LIBSVM format. 1.8 An Example of a LIBSVM File 0 2:1 8:1 14:1 21:1 23:1 25:1 0 1:1 9:1 11:1 13:1 18:1 20:1 1 1:1 5:1 15:1 18:1 21:1 23:1 2 6:1 9:1 12:1 14:1 16:1 24:1 2 8:1 13:1 21:1 22:1 27:1 30:1 3 6:1 8:1 10:1 15:1 17:1 21:1 1.9 Loading LIBSVM Files MLlib provides the MLUtils class, which, among many other things, offers the method for loading a file in LIBSVM format: Example: from pyspark.mllib.util import MLUtils dataset=mlutils.loadlibsvmfile(sc, "data/mllib/libsvm_data.dat") Note: The resulting dataset object is an RDD with the records stored as LabledPoint objects 6

sc is the SparkContext reference 1.10 Local Matrices Matrices in MLlib are stored as vectors with their dimensions provided as parameters to help with the matrix layout (see next slide for an example) Like with vectors, matrices can be dense or sparse Data in a matrix is stored in the Compressed Sparse Column (CSC) format More specifically, data is serialized into a vector column-wise Example of CSC format If the original matrix data layout is 1.0 2.0 3.0 4.0 5.0 6.0 Then the matrix data is packed in the resulting CSC-encoded vector as follows: [1.0, 1.4, 2.0, 5.0, 3.0, 6.0] 1.11 Example of Creating Matrices in MLlib import org.apache.spark.mllib.linalg.{matrix, Matrices} # Create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0)) shaped # in 3 rows, 2 columns; the first column will have values 1.0, 3.0, # and 5.0; the second column will have values 2.0, 4.0, and 6.0 dm2 = Matrices.dense(3, 2, [1, 2, 3, 4, 5, 6]) 7

Notes: The Python syntax of the dense() and sparse() methods of the Matrices class is as follows: static dense(numrows, numcols, values) Description: Create a DenseMatrix static sparse(numrows, numcols, colptrs, rowindices, values) Description: Create a SparseMatrix 1.12 Distributed Matrices In MLlib, a distributed matrix is stored across a cluster of machines in one or more RDDs It uses long-typed (8-byte) row and column indices and double-typed values MLlib has implemented the following types of distributed matrices so far: RowMatrix, IndexedRowMatrix, CoordinateMatrix, and BlockMatrix Note: You need to match your processing use case with the right distributed matrix type -- it is an advanced topic not reviewed here, please refer to the Spark original documentation: http://spark.apache.org/docs/1.6.0/mllib-data-types.html#distributedmatrix The main idea behind the distributed matrices is based on splitting the underlying matrix into a set of vectors and distribute the processing task across the cluster of machines using the parallelize() method of the Spark Context object 8

1.13 Example of Using a Distributed Matrix The following example shows how to use the RowMatrix type In the RowMatrix distributed matrix type, each row in an RDD is a local vector from pyspark.mllib.linalg.distributed import RowMatrix # Create an RDD of vectors v1 = [1.0, 2.0, 3.0] v2 = [4.0, 5.0, 6.0] rdd = sc.parallelize([v1, v2]) # Create a RowMatrix object from an RDD of local vectors distributedmatrix = RowMatrix(rdd) 1.14 Classification and Regression Algorithm According to Spark documentation (http://spark.apache.org/docs/1.6.0/mllib-classification-regression.html), MLlib supports the following algorithms for classification and regression in its spark.mllib package: 9

1.15 Clustering According to Spark documentation (http://spark.apache.org/docs/1.6.0/mllib-clustering.html), MLlib supports the following algorithms for clustering in its spark.mllib package: K-means Gaussian mixture Power iteration clustering (PIC) Latent Dirichlet allocation (LDA) Bisecting k-means Streaming k-means 1.16 Summary In this chapter we have reviewed the following topics: MLlib packages and supported languages The ways to create dense and sparse vectors and matrices Types of distributed matrices LIBSVM format Supported classification, regression and clustering algorithms 10