MLI - An API for Distributed Machine Learning Sarang Dev
MLI - API Simplify the development of high-performance, scalable, distributed algorithms. Targets common ML problems related to data loading, feature extraction, model training. Usability : Comparable to Matlab, R Scalability : Matches low level systems like Graphlab, Vowpal Wabbit
Big Picture-MLBase ML Optimizer: This layer aims to automate the task of ML pipeline construction. MLI: An experimental API for feature extraction and algorithm development that introduces high-level ML programming abstractions. MLlib: Apache Spark's distributed ML library. Many features in MLlib have been borrowed from ML Optimizer and MLI. Maintained by Spark community.
Installing MLI https://github.com/amplab/mli Java 7 (not compatible with 8) Scala 2.9.3 Spark 0.8.0 Needs some change in build files to compile https://drive.google.com/open?id=0b64ip8kxpidptve0nmfaanfwouu Uses sbt(interactive build tool) for building Run command in sbt prompt >compile >assembly (makes a jar in the target)
MLI Interfaces MLTable MLTable is an object which provides a familiar table-like interface to a developer, and is designed to mimic a SQL table. Interface for processing the semi-structured, mixed type data. Once data is featurized, it can be cast into an MLNumericTable, which is a convenience type that most ML algorithms will expect as input.
MLI Interfaces LocalMatrix LocalMatrix provides linear algebra primitives on partitions of data. The partitions are automatically determined by the system. Optimizer, Algorithm, and Model Can implement algorithms using the Algorithm interface, which should return a model as specified by the Model interface. Optimization techniques are used to converge to an approximate solution while iterating over the data.
Using MLI ADD_JARS = <path to mli jar> spark-shell We can perform all the tasks in a spark-shell which always has a initialized spark context. import mli.feat._ import mli.interface._ val mc = new MLContext(sc) val inputtable = mc.loadfile("/home/sarang/downloads/sample.txt").cache() //MLTable // c is the column on which we want to perform N-gram extraction // n is the N-gram length, e.g., n=2 corresponds to bigrams // k is the number of top N-grams we want to use (sorted by N-gram frequency) val (featurizeddata, ngfeaturizer) = NGrams.extractNGrams(inputTable, c=0, n=2, k=10, stopwords = NGrams.stopWords) val (scaleddata, featurizer) = Scale.scale(featurizedData.filter(_.nonZeros.length > 5).cache(), 0, ngfeaturizer)
Spark Engine for large-scale data processing Speed : Run programs up to 100x faster than Hadoop MapReduce in memory Ease of Use : Write applications in Java, Scala, Python, R Libraries : Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming
Spark
Spark
Spark
Spark
Spark
MLlib MLlib is a standard component of Spark providing machine learning primitives on top of Spark. Scalability Performance User-friendly APIs Integration with Spark and its other components Support for Java, Scala, Python
MLib Classification: logistic regression, naive Bayes,... Regression: generalized linear regression, isotonic regression,... Decision trees, random forests, and gradient-boosted trees Recommendation: alternating least squares (ALS) Clustering: K-means, Gaussian mixtures (GMMs),... Topic modeling: latent Dirichlet allocation (LDA) Feature transformations: standardization, normalization, hashing,... Model evaluation and hyper-parameter tuning ML Pipeline construction ML persistence: saving and loading models and Pipelines Survival analysis: accelerated failure time model Frequent itemset and sequential pattern mining: FP-growth, association rules, PrefixSpan Distributed linear algebra: singular value decomposition (SVD), principal component analysis (PCA),... Statistics: summary statistics, hypothesis testing,...
Data Types in MLlib Local vector A local vector has integer-typed and 0-based indices and double-typed values, stored on a single machine Labeled point A labeled point is a local vector, either dense or sparse, associated with a label/response. Eg. val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0)) Local matrix Distributed matrix RowMatrix IndexedRowMatrix CoordinateMatrix BlockMatrix
DataFrames in Spark SQL A Dataset is a distributed collection of data. Dataset is a new interface added in Spark 1.6. The Dataset API is available in Scala and Java. Python does not have the support for the Dataset API. Can be used as MLTable interface of MLI. val sentencedata = spark.createdataframe(seq( (0, "Hi I heard about Spark"), (0, "I wish Java could use case classes"), (1, "Logistic regression models are neat") )).todf("label", "sentence")
Using MLib Example : K-Means Clustering ( partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean) Implement using Pyspark Implement using Scala
Kmeans Pyspark from numpy import array from math import sqrt from pyspark import SparkContext, SparkConf from pyspark.mllib.clustering import KMeans, KMeansModel conf = SparkConf().setAppName("KMeans").setMaster("local") sc = SparkContext(conf=conf) # Load and parse the data data = sc.textfile("/home/sarang/downloads/kmeans_data.txt") parseddata = data.map(lambda line: array([float(x) for x in line.split(' ')])) # Build the model (cluster the data) clusters = KMeans.train(parsedData, 2, maxiterations=10, runs=10, initializationmode="random") # Evaluate clustering by computing Within Set Sum of Squared Errors def error(point): center = clusters.centers[clusters.predict(point)] return sqrt(sum([x**2 for x in (point - center)])) WSSSE = parseddata.map(lambda point: error(point)).reduce(lambda x, y: x + y) print("within Set Sum of Squared Error = " + str(wssse)) # Save and load model clusters.save(sc, "/home/sarang/kmeansmodel") samemodel = KMeansModel.load(sc, "/home/sarang/kmeansmodel") We can also use the pyspark shell instead
Kmeans Scala import org.apache.spark.ml.clustering.kmeans val dataset = spark.read.format("libsvm").load("/home/sarang/downloads/kmeans_data1.txt") // Trains a k-means model. val kmeans = new KMeans().setK(2).setSeed(1L) val model = kmeans.fit(dataset) // Evaluate clustering by computing Within Set Sum of Squared Errors. val WSSSE = model.computecost(dataset) model.clustercenters.foreach(println)
Conclusion MLI is outdated and most of its features have been included in MLlib. MLlib can act as a powerful tool for machine learning.
References MLI Tutorial : http://ampcamp.berkeley.edu/3/exercises/mli-do cument-categorization.html Mllib : http://spark.apache.org/docs/latest/mllib-guide. html