ID2223 Lecture 3: Gradient Descent and SparkML

Size: px

Start display at page:

Download "ID2223 Lecture 3: Gradient Descent and SparkML"

Bryce Bryan
5 years ago
Views:

1 ID2223 Lecture 3: Gradient Descent and SparkML

2 Optimization Theory Review ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 2/91

3 Unconstrained Optimization Unconstrained optimization involves finding minima of functions that may have multiple inputs: min f p = f p, where p R n, f: R n R We deal with only minima because any maximum of f is a minimum of f We do not always aim to find the global minimum in R n but instead points that take smaller function values than all of their neighbors - A local minimum that is good enough ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 3

4 Local and Global Minima [Figure 4.3 from Deep Learning Book] 4/91

5 A general minimization algorithm Iterative optimization algorithms consist of setting initialization conditions, and then three iteration steps: 1. Initialize: Choose a starting point an initial guess that can either be determined by your situation or that can be actively chosen. 2. While a stopping criterion is not true (the solution is not close enough to the minimum), continue, else break and return the current solution. 3. Find a descent direction a direction in which the function value decreases near the current point. 4. Determine the step size the length of a step in the given direction that leads to a good decrease ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 5

Local and Global Minima 2017-11-08 [https://blog.

6 Local and Global Minima [ 6/91

7 Stopping Criterion The minimum always has a certain property, the first derivative, i.e., the gradient has to be zero: The gradient can be a vector with two components, and the above equation translates into the vector equation Each component of the gradient has to be 0. x = y = 0, or p*=(0,0) ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 7

8 Critical Point is called a critical point. In general, a critical point does not have to be a minimum but any minimum is a critical point [Figure 4.2 from Deep Learning Book] 8

9 Gradient Descent for Least Squares Regression ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 9/91

10 Gradient Descent for Least Squares Regression Goal: Find min w Xw y 2 2 That is, find w that minimizes f w = Xw y 2 2 Scalar objective: f w = wx y 2 2 = n j=1 wx j y j w = global optima 10/91

11 Gradient Descent Start at a random point Repeat - Determine a descent direction - Choose a step size - Update Until stopping criterion is satisfied error weight ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 11/91

12 Small Steps Down the Gradient ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 12/91

13 Small Steps Down the Gradient ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 13/91

14 Non-Convex Optimization Gradient descent is an iterative algorithm that takes small steps down the Gradient until it reaches a minimum Can potentially end up in a local minima, w, instead of the global optima, w* ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 14/91

15 Convex Optimization In a convex optimization problem, all local minimum are also a global minimum Least Squares and Ridge Regression are linear methods that use convex optimization methods to optimize objective functions ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 15/91

16 Direction of Descent - Slope Know the error function: f(w) = weight 2 f(w) w = 2 weight = 2 Original Weight error f(w) weight (w) /91

17 Direction of Descent Weight Update step size w i+1 = w i α i df dw w i /91 negative slope

18 Descent Direction and Magnitude (1D) The opposite direction of the slope points in the direction of steepest error descent in weight space. or as: step size w i+1 = w i α i df dw negative slope w i w i+1 = w i α i w f(w i ) Stepsize is a free parameter that has to be chosen carefully for each problem ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 18/91

19 Gradient For functions with multiple inputs, we use partial derivatives to measure how much w changes as only the variable w i increases at point w f w w The gradient generalizes the notion of derivative to the case where the derivative is with respect to a vector. The gradient of w is the vector containing all the partial derivatives is denoted: w f w Use the chain rule to compute the derivatives /91

20 What is the gradient of Gradient Example Answer: ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 20/91

21 Chain Rule y = x w 1 y w 1 = x w 1 w 2 e = y w 2 e y = w 2 x (input) y (intermediate) e (output) e = x w 1 w 2 e w 1 = x w 2 e = y e w 1 w 1 y [How Deep Neural Networks Work, Brandon Rohrer] 21/91

22 Chaining err weight = a b c weight a b n.. err m n weight a b c n err [How Deep Neural Networks Work, Brandon Rohrer] 22/91

23 Step Size An example step size is α i = α n i - where is n is the number of training points, and i is the iteration step, and α is a constant ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 23/91

24 Update Rule for Least Squares Regression ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 24/91

25 Parallel Gradient Descent for Least Squares Vector Update: w i+1 = w i α i n w i T x j y j x j j=1 Compute summands in parallel on workers receive all w i on every iteration /91 n=6; number workers= 3;

26 The Gradient ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 26/91

27 Example: Minimizing an Objective Function Consider points p(x,y) in Euclidean space R 2 and the function that determines half of the squared length of vector p What does it compute for p(3,-1)? p 3 1 = ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 27/91

Example: Minimizing an Objective Function If we wiggle the value of x and keep everything else the same (keep the value of y the same), how do we know if we are getting closer

28 Example: Minimizing an Objective Function If we wiggle the value of x and keep everything else the same (keep the value of y the same), how do we know if we are getting closer to the minimum or not? If we move from p(3,-1) to p(2,-1), does the error get better or worse? ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 28/91

29 Quick Review: Partial Derivatives f x, y = 2x 2 y 3 Assume we want to measure how much the function f(x,y) is changing at the point (a,b) if we keep y fixed (that is, set y=b): g x = f x, y = 2x 2 b 3 The partial derivative for f x, y with respect to x at point (a,b) is: g a = 4ab 3 f x x, y = 4ab 3 f x f(x, y) = f x x, y ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 29/91

30 Example: Least Squares Gradient Descent If we fit a straight line y = w 1 + w 2 x to a training set of two-dimensional points x 1, y 1,, x n, y n using least squares. The objective function to be minimized is: n j=1 w 1 + w 2 x i y i 2 We will update the following weights in a simultaneous step: w 1 w = w 1 2 w α 2 w 1 + w 2 x i y i 2 2x i w 1 + w 2 x i y i ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 30/91 w 1 w 2

31 Line Search Try a few different lengths for the step size, α, and pick the best one. - Inexact method that incurs overhead. For several values of α, evaluate: f x α f x Choose the one that results in the smallest objective function value ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 31/91

32 Momentum If the error surface is a long and narrow valley, gradient descent goes quickly down the valley walls, but very slowly along the valley floor. i Reduce this problem by updating parameters using a combination of the previous update and the gradient update: Δw i t+1 = βδw i t + 1 β Usually β is set quite high, about α i w t ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 32/91

33 Jacobian and the Second Derivative The matrix containing all partial derivatives of a function whose input and output are both vectors is called the Jacobian matrix. f: R n R m J R n m of f is defined such that J i,j = x j f(x) i The 2 nd derivative tells us about momentum 2 x i x j f In the single dimension, we can denote this by f (x) ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 33/91

34 Curvature helping determine step size ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 34/91

35 Hessian Matrix For multi-input dimensions, the 2nd order derivatives define a matrix called the Hessian Matrix, H f x H f x i,j = 2 x i x j f x The Hessian is the Jacobian of the Gradient. Eigenvectors/eigenvalues of the Hessian describe the directions of principal curvature and the amount of curvature in each direction. - Maximum sensible stepsize is 2 λ max - Rate of convergence depends on 1 2 λ min λ max ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 35/91

36 Approximate Hessian It can be very expensive to calculate and store the Hessian matrix. You can approximate the second order curvature using recent function and gradient evaluations (a sequence of gradients). - Newton s Method and Quasi-Newton Methods See Chapter 4 in the Book for more details ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 36/91

37 Second Derivative The directional 2 nd derivative tells us how well we can expect a gradient descent step to perform. The 2 nd derivative can be used to determine if a critical point (f (x)=) is a local maximum, local maximum, or a saddle point. - f (x)>0, f (x)=0 implies local minimum - f (x)<0, f (x)=0 implies local maximum - f (x)=0, f (x)=0 implies saddle point or flat-space We can approximate the 2 nd derivative using a 2 nd order Taylor series approximation - See Deep Learning book for details ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 37/91

38 Poor Conditioning Functions that are not smooth are problematic for scientific computing: a non-smooth function can cause rounding errors in the inputs, leading to large changes in the output Conditioning refers to how rapidly a function changes with respect to small changes in its input Functions with a high condition number are sensitive to error in the input for matrix inversion - See 4.2 in the book for more details ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 38/91

39 Poor Conditioning The Condition number of the Hessian measures how much the 2nd derivatives vary. - Poor condition number => poor gradient descent. - Gradient descent is unaware of the change in the derivative Difficult to choose a step size We can use the Hessian to guide search ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 39/91

40 But, why does Deep Learning work? 1990s: basic result that there are infinitely many local optima in sufficiently complex functions In low dimensionality, local optima dominate. In high dimensionality, local optima appear to be clustered close to the global optima. Convexity is not needed [Loss surface of multi-layer Nets, LeCun et Al, 2015] ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 40/91

41 Large-Scale ML Pipelines ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 41/91

42 Parallel Data Processing Trade-Off 1. Scale-Up by buying a more powerful machine - More cores, more expensive - No network cost - Disk slows us down 2. Scale-Out using Data Parallel and in-memory computation - Persist in-memory (especially for iterative computations) - Parallelism makes computations faster - Network makes communication slow ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 42/91

43 Scale-Out Commodity Hardware Deal with Network Deal with partial host/network failures server1 servern serverz ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 43/91

44 Minimize Network and Disk I/O We need to store and communicate raw data, features, and model objects. - Keep large objects local ML algorithms are typically iterative - Reduce the number of iterations Do not read/write to disk on every iteration - Keep state in memory between iterations Logistic Regression for 100GB on 50 AWS nodes /91

45 Batching to Improve Performance Throughput: # bytes read per second Latency: cost to send messages (size independent) Over-simplification to say that message complexity is only proportional to the amount of data sent on the network - Latency introduces a fixed cost overhead independent of the size of the message Amortize latency costs - Sending larger messages (batching) ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 45/91

46 Model Parallel Training Consider hyperparameter tuning for ridge regression with small n and small d - Evaluate for different regularization parameter values of λ - Each collection of different hyperparameters is a model Train a copy of the model locally on different worker nodes (model parallel) - Map phase Data is small, so can communicate it ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 47/91

47 Linear Regression: Big n, Huge d Both Data parallelism and Model parallelism needed O(d) communication slow with hundreds of millions of parameters - As in Deep Learning Possible solutions include - Rely on sparsity to reduce communication - Asynchronous, concurrent updates to the model Asynchronous stochastic gradient descent (more later) Need algorithms that compute more, and communicate less ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 48/91

48 Gradient Descent Linear Regression: Big n, Big d - On each iteration, communicate parameter vector w i - O(d) communication OK for fairly large d. train.cache() // persist the training data across iterations for i in range(numiters): alpha_i = alpha / (n * np.sqrt(i+1)) gradient = train.map(lambda lp: gradientsummand(w,lp)).sum() w -= alpha_i * gradient Gradient Descent 49/91

49 Divide and Conquer Approach Fully process each partition locally - Only communicate the final result Single iteration; minimal communication Approximate results w.trainmappartitions(locallinearregression).reduce(combinelocalregressionresults) Divide-and-Conquer ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 50/91

50 Stochastic Gradient Descent (SGD) Recall gradient descent for least squares, every iteration processes O(n) samples: w i+1 = w i α i n j=1 w i T x j y j x j The gradient is an expectation that can be approximated using a small set of samples drawn uniformly at random from the dataset. - In particular if the dataset is highly redundant the gradient in the first half will be very similar to the gradient in the 2 nd half. In SGD, we update the model with only one sample instead of n ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 51/91

51 Minibatch Gradient Descent Increase the batch size from 1 (from SGD) to m. - Usually better than SGD. Divide the dataset into small batches of m examples, compute the gradient using a single batch, make an update, then move to the next batch of examples. - Computing the gradient simultaneously uses matrix-matrix multiplies which are efficient, especially on GPUs - Mini-batches need to be balanced for classes E.g., m=10 w i+1 = w i α i 10 j=1 w i T x j y j x j ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 52/91

52 Minibatch Gradient Descent Each of the k workers receives w from the driver, then performs a few local steps of the gradient locally. Then send w back to the Driver. - Reduces total number of iterations required. for i in range(feweriters): update = train.mappartitions(dosomelocalgradientupdates).reduce(combinelocalupdates) w += update Mini-batch Gradient Descent ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 53/91

53 Asynchronous / Synchronous SGD In synchronous SGD (or minibatch GD), the Spark Driver (or a parameter server) will wait until all parallel workers have returned their updated model before continuing to the next iteration. In (parallel) asynchronous SGD (or minibatch GD), a parameter server will apply model updates from parallel workers immediately, whereupon the work can immediately get new copy of the model to work on a new mini-batch - workers train concurrently on mini-batches without blocking. - Needs to be tolerant to stale gradients [ /91

54 AdaGrad For good performance, Mini-batch Gradient Descent you typically decrease its learning rate over time. Adaptive Gradient Descent or AdaGrad is an automated per-weight method to adapt the learning rate. Each feature attribute has its own learning rate. - Increase the learning rate for more sparse parameters and decrease the learning rate for less sparse ones. - Can improve convergence performance over standard stochastic gradient descent in settings where data is sparse and sparse parameters are more informative ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 55/91

55 Feature Extraction [Adapted from Distributed Machine Learning with Apache Spark, UCLA/Berkeley Course] ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 56/91

56 Raw Data may or may not be Numeric Numeric Data Non-Numeric Data /91

57 Dealing with Non-Numerical Features 1. Use methods that natively support non-numeric features - Decision Trees and Naive Bayes naturally support nonnumerical features 2. Convert non-numerical features to numerical features - Allows us to use a wider range of learning methods - How do we do this? ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 58/91

58 Classifying Non-Numeric Features Categorical features - Features that can be grouped into two or more categories - No intrinsic ordering For example: Gender, Country, Occupation, Language Ordinal Features - Have two or more categories - Ordinal features can be ordered by their number (ordinal), but there is no consistent spacing between categories, i.e., all we have is a relative ordering - User feedback in survey questions, e.g., Did ID2223 meet its learning objectives? No, somewhat, yes ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 59/91

59 Ordinal Features Ordinal Features: - Survey categories = { no, somewhat, yes } Create single numerical feature: - Survey categories ={ no = 1, somewhat = 2, yes = 3} We can use a single numerical feature that preserves this ordering. We can introduce a degree of closeness that didn't previously exist ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 60/91

60 Categorical Features Create single numerical feature to represent nonnumeric categories. Country categories = { ARG, FRA, USA } become: ARG = 1, FRA = 2, USA = 3 - Implication that FRA lies between ARG and USA Creating single numerical feature introduces relationships between categories that don t otherwise exist ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 61/91

61 One-Hot-Encoding (OHE) Instead of ordinals, we can create a vector containing each category entry and set the active category value to 1, with all other category values set to 0. Country categories = { ARG, FRA, USA } - One new dummy feature for each category - ARG [1 0 0], FRA [0 1 0], USA [0 0 1] Creating dummy features doesn t introduce spurious relationships ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 62/91

62 Step 1: Create OHE Dictionary Features: - Animal = { bear, cat, mouse } - Color = { black, white } - Diet = { mouse, salmon } 7 dummy features in total - mouse category distinct for Animal and Diet features OHE Dictionary: Maps each - category to dummy feature - (Animal, bear ) 0 - (Animal, cat ) 1 - (Animal, mouse ) 2 - (Color, black ) ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 63/91

63 Step 2: Create Features with Dictionary Datapoints: - A1 = [ mouse, black, - ] - A2 = [ cat, white, mouse ] - A3 = [ bear, black, salmon ] OHE Features: - Map non-numeric feature to it s binary dummy feature E.g., A1 = [0, 0, 1, 1, 0, 0, 0] OHE Dictionary: Maps each category to dummy feature - (Animal, bear ) 0 - (Animal, cat ) 1 - (Animal, mouse ) 2 - (Color, black ) ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 64/91

64 OHE Features are Sparse For a given categorical feature only a single OHE feature is non-zero can we take advantage of this fact? Dense representation: Store all numbers - E.g., A1 = [0, 0, 1, 1, 0, 0, 0] Sparse representation: Store indices / values for non-zeros - Assume all other entries are zero - E.g., A1 = [ (2,1), (3,1) ] ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 65/91

65 Sparse Representation Example: Assume a Matrix with 10M observation and 1K features. Assume 1% non-zeros. Storage costs for the Dense representation: - (Stores all the numbers) - Store 10M 1K entries as doubles 80GB storage Storage costs for the Sparse representation: - (Only store indices / values for non-zeros) - Store value and location for non-zeros (2 doubles per entry) - How much savings in Storage? 50 savings in storage - We will also see computational saving for matrix operations ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 66/91

66 Feature Hashing ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 67/91

67 High Dimensionality of OHE Statistically: Inefficient learning - We generally need bigger n when we have bigger d (though in distributed setting we often have very large n) - We will have many non-predictive features Computationally: Increased communication - Linear models have parameter vectors of dimension d - Gradient descent communicates the parameter vector to all workers at each iteration ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 68/91

68 Feature Hashing Feature hashing, a.k.a. the hashing trick. Dummy features can drastically increase dimensionality. Feature hashing reduces dimensionality by applying a hash function to the features and using their hash values as indices directly, rather than looking the indices up in an associative array. - No need to compute an expensive OHE dictionary - Preserves sparsity - Theoretical underpinnings ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 69/91

69 Features to Hash Tables Design a Hash Function that maps an object to one of m buckets - Lookup time for the bucket will be O(1) and we should distribute objects across buckets Values are stored in the Hash Table buckets and correspond to feature categories - We have fewer buckets than feature categories - Different categories will map to same bucket (collisions) - Bucket indices are hashed features ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 70/91

70 Feature Hashing Example Datapoints: 7 feature categories - A1= [ mouse, black, - ] - A2= [ cat, tabby, mouse ] - A3= [ bear, black, salmon ] Feature Hashing Ex. Hashed Features: - A1 = [ ] - A2 = [ ] - A3 = [ ] Hash Function: m = 4 H(Animal, mouse ) = 3 H(Color, black ) = 2 H(Animal, cat ) = 0 H(Color, tabby ) = 0 H(Diet, mouse ) = 2 H(Animal, bear ) = 0 H(Color, black ) = 2 H(Diet, salmon ) = ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 71/91

71 Feature Hashing Evaluation Hash features have nice theoretical properties - Good approximations of inner products of OHE features under certain conditions - Many learning methods (including linear / logistic regression) can be viewed solely in terms of inner products Good empirical performance - Spam filtering and various other text classification tasks Hashed features are a reasonable alternative for OHE features ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 72/91

72 Apache Spark and Spark ML ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 73/91

73 Apache Spark In-Memory data-parallel processing engine - Unified framework for processing data in large-scale DBs (SparkSQL), machine learning, and graph processing. Support large-scale machine learning - Fast iterative procedures - Efficient communication primitives Integrated with Apache Hadoop (HDFS, YARN) APIs for Scala, Java, Python, R ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 74/91

74 Apache Spark ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 75/91

75 DataFrames In Spark, a DataFrame is a distributed collection of data organized into named columns - Conceptually equivalent to a table in a relational database or a data frame in R/Python With a DataFrame, we can build a ML Pipeline in Spark ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 76/91

76 Spark ML (Machine Learning Library) Consists of common learning algorithms and utilities - Classification - Regression - Clustering - Collaborative Filtering - Dimensionality Reduction New Version - spark.ml Deprecated version - spark.mllib ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 77/91

77 Machine Learing in Apache Spark SparkML integrates well with other Spark features, such as DataFrames, RDDs, and Datasets - You can load data from sources (like HDFS) into DataFrames, perform machine learning on the data to either build models or make predictions and then save the data back to some sink (like HDFS) Not as extensive machine learning support compared to scikit-learn in Python ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 78/91

SparkML Architecture [https://spark.apache.org/docs/2.0.0/ml-pipeline.

78 SparkML Architecture [ ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 79/91

79 Spark ML Dataframe - Represents a table in SparkSQL Transformer - Transform one Dataframe to another Dataframe. Estimator - Fit one Dataframe and produce a model, which is a transformer Pipeline - Chains Transformers and Estimators together in a ML workflow Parameter - A common API for params for transformers/estimators ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 80/91

80 Transformer A Transformer is an algorithm which transforms one DataFrame into another DataFrame. - Some transformers turn a DataFrame with features into a DataFrame with predictions - e.g., LogisticRegressionModel. - There are also feature transformers e.g., HashingTF. - Implements the transform() method. Examples - HashingTF, Binarizer - StringIndexer converts String values (part of a look-up) into categorical indices - VectorAssembler constructs a Vector from raw feature columns ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 81/91

81 Estimator An Estimator is an algorithm which can have fit called on a DataFrame to produce a Transformer. - For example, training/tuning on a DataFrame and producing a model. - Implements the fit() method. Examples - LogisticRegression (produces a LogisticRegressionModel) - StandardScaler - Pipeline ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 82/91

82 Pipeline A Pipeline is an estimator that chains multiple Transformers and Estimators together to specify a ML workflow. [Large-Scale Machine Learning, Berkeley/Databricks] /91

83 ParamMaps, Evaluator, CrossValidator ParamMaps: Parameters to choose from, sometimes called a parameter grid to search over. - Can be passed to fit() or transform() Evaluator: Metric to measure how well a fitted Model does on held-out test data. CrossValidator: Identifies the best ParamMap and re-fits the Estimator using the best ParamMap and the entire dataset ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 84/91

84 Feature Hashing trainhash= train.map(applyhashfunction).map(createsparsevector) Step 1: Apply a hash function on the raw data - Single map operation (local computation) - No need to compute OHE features or communication Step 2: Store the hashed features in a sparse representation - Single Map operation (local computation) - Reduce storage and lower computation costs SparkML supports two types of local vectors: dense and sparse. A dense vector is backed by a double array representing its entry values, while a sparse vector is backed by two parallel arrays: indices and values ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 85/91

85 Standard Scalar Pipeline [Large-Scale Machine Learning, Berkeley/Databricks] 86/91

86 Example: 20 Newsgroups in SparkML [ ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 87/91

87 Pipeline Transformer 20 Newsgroups [ ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 88/91

88 Python: Read Directly into a DataFrame training = sqlcontext.read.parquet( hdfs:///projects/datasets/20newsgroups /data-001/training").cache() test = sqlcontext.read.parquet("hdfs:///projects/datasets/20newsgroups /data-001/test").cache() training and test are Dataframes ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 89/91

89 Scala: Load/Convert to a DataFrame case class newsgroupscaseclass(id: String, text: String, topic: String) val newsgroups = newsgroupsrawdata.map{case (filepath, text) => val id = filepath.split("/").takeright(1)(0) val topic = filepath.split("/").takeright(2)(0) newsgroupscaseclass(id, text, topic)}.todf() newsgroups.cache() val labelednewsgroups = newsgroups.withcolumn("label", newsgroups("topic").like("comp%").cast("double")) labelednewsgroups.registertemptable("labelednewsgroups") val Array(training, test) = labelednewsgroups.randomsplit(array(0.9, 0.1), seed = 12345) newsgroups is a RDD, labelednewsgroups is a DataFrame ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 90/91

90 Building and Training a Model tokenizer = RegexTokenizer(inputCol="text", outputcol="words", pattern= \\s+") hashingtf = HashingTF(inputCol=tokenizer.getOutputCol(), outputcol="features", numfeatures=5000) lr = LogisticRegression(maxIter=20, regparam=0.01) pipeline = Pipeline(stages=[tokenizer, hashingtf, lr]) model = pipeline.fit(training) RegexTokenizer tokenizes each article into a sequence of words with a regex pattern HashingTF maps the word sequences produced by RegexTokenizer to sparse feature vectors using feature hashing LogisticRegression fits the feature vectors and the labels from the training data to a logistic regression model. The pipeline is then built with 3 stages and trained to produce a model ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 91/91

91 Prediction # Make predictions on the model. prediction = model.transform(training) # Check some results prediction.select("prediction", "label", "text").limit(10).show() With the trained model you can make predictions and check the results of the predictions ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 92/91

92 Evaluation # Create an evaluator for binary classification and use area under the ROC curve as the evaluation metric. evaluator = BinaryClassificationEvaluator(metricName="areaUnderROC") evaluator.evaluate(prediction) # Call "model.transform" on test data and then evaluate the result. evaluator.evaluate(model.transform(test)) # Inspect a pipeline prediction.printschema() You can then evaluate the accuracy of your trained model using the test dataset and a BinaryClassificationEvaluator. A ROC curve score near 1 is very good, 0.5 is like flipping a coin ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 93/91

93 Cross-Validation for Hyperparameter Tuning # We generate hyperparameter combinations by taking the cross product of some parameter values we want to try. paramgrid = ParamGridBuilder() \.addgrid(hashingtf.numfeatures, [1000, 10000]) \.addgrid(lr.regparam, [0.05, 0.2]) \.build() cv = CrossValidator(estimator=pipeline, evaluator=evaluator, estimatorparammaps=paramgrid, numfolds=2) cvmodel = cv.fit(training) evaluator.evaluate(cvmodel.transform(training)) evaluator.evaluate(cvmodel.transform(test)) ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 94/91

94 Reading Nodes Chapter 4, 5 Deep Learning Book - (KKT and Lagrangian not examined) Chapter 5 Deep Learning Book - ( , 5.11 not examined) Chapter Chain Rule of Calculus with Partial Derivatives: - Spark ML - 20 Newsgroups in Scala 20 Newsgroups in Python - Cancer Prediction Example in Scala - Distributed ML with Apache Spark, UCLA/Berkeley Course ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 95/91

Logistic Regression: Probabilistic Interpretation

Logistic Regression: Probabilistic Interpretation Approximate 0/1 Loss Logistic Regression Adaboost (z) SVM Solution: Approximate 0/1 loss with convex loss ( surrogate loss) 0-1 z = y w x SVM (hinge),