ID2223 Lecture 3: Gradient Descent and SparkML

Size: px
Start display at page:

Download "ID2223 Lecture 3: Gradient Descent and SparkML"

Transcription

1 ID2223 Lecture 3: Gradient Descent and SparkML

2 Optimization Theory Review ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 2/91

3 Unconstrained Optimization Unconstrained optimization involves finding minima of functions that may have multiple inputs: min f p = f p, where p R n, f: R n R We deal with only minima because any maximum of f is a minimum of f We do not always aim to find the global minimum in R n but instead points that take smaller function values than all of their neighbors - A local minimum that is good enough ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 3

4 Local and Global Minima [Figure 4.3 from Deep Learning Book] 4/91

5 A general minimization algorithm Iterative optimization algorithms consist of setting initialization conditions, and then three iteration steps: 1. Initialize: Choose a starting point an initial guess that can either be determined by your situation or that can be actively chosen. 2. While a stopping criterion is not true (the solution is not close enough to the minimum), continue, else break and return the current solution. 3. Find a descent direction a direction in which the function value decreases near the current point. 4. Determine the step size the length of a step in the given direction that leads to a good decrease ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 5

6 Local and Global Minima [ 6/91

7 Stopping Criterion The minimum always has a certain property, the first derivative, i.e., the gradient has to be zero: The gradient can be a vector with two components, and the above equation translates into the vector equation Each component of the gradient has to be 0. x = y = 0, or p*=(0,0) ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 7

8 Critical Point is called a critical point. In general, a critical point does not have to be a minimum but any minimum is a critical point [Figure 4.2 from Deep Learning Book] 8

9 Gradient Descent for Least Squares Regression ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 9/91

10 Gradient Descent for Least Squares Regression Goal: Find min w Xw y 2 2 That is, find w that minimizes f w = Xw y 2 2 Scalar objective: f w = wx y 2 2 = n j=1 wx j y j w = global optima 10/91

11 Gradient Descent Start at a random point Repeat - Determine a descent direction - Choose a step size - Update Until stopping criterion is satisfied error weight ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 11/91

12 Small Steps Down the Gradient ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 12/91

13 Small Steps Down the Gradient ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 13/91

14 Non-Convex Optimization Gradient descent is an iterative algorithm that takes small steps down the Gradient until it reaches a minimum Can potentially end up in a local minima, w, instead of the global optima, w* ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 14/91

15 Convex Optimization In a convex optimization problem, all local minimum are also a global minimum Least Squares and Ridge Regression are linear methods that use convex optimization methods to optimize objective functions ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 15/91

16 Direction of Descent - Slope Know the error function: f(w) = weight 2 f(w) w = 2 weight = 2 Original Weight error f(w) weight (w) /91

17 Direction of Descent Weight Update step size w i+1 = w i α i df dw w i /91 negative slope

18 Descent Direction and Magnitude (1D) The opposite direction of the slope points in the direction of steepest error descent in weight space. or as: step size w i+1 = w i α i df dw negative slope w i w i+1 = w i α i w f(w i ) Stepsize is a free parameter that has to be chosen carefully for each problem ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 18/91

19 Gradient For functions with multiple inputs, we use partial derivatives to measure how much w changes as only the variable w i increases at point w f w w The gradient generalizes the notion of derivative to the case where the derivative is with respect to a vector. The gradient of w is the vector containing all the partial derivatives is denoted: w f w Use the chain rule to compute the derivatives /91

20 What is the gradient of Gradient Example Answer: ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 20/91

21 Chain Rule y = x w 1 y w 1 = x w 1 w 2 e = y w 2 e y = w 2 x (input) y (intermediate) e (output) e = x w 1 w 2 e w 1 = x w 2 e = y e w 1 w 1 y [How Deep Neural Networks Work, Brandon Rohrer] 21/91

22 Chaining err weight = a b c weight a b n.. err m n weight a b c n err [How Deep Neural Networks Work, Brandon Rohrer] 22/91

23 Step Size An example step size is α i = α n i - where is n is the number of training points, and i is the iteration step, and α is a constant ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 23/91

24 Update Rule for Least Squares Regression ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 24/91

25 Parallel Gradient Descent for Least Squares Vector Update: w i+1 = w i α i n w i T x j y j x j j=1 Compute summands in parallel on workers receive all w i on every iteration /91 n=6; number workers= 3;

26 The Gradient ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 26/91

27 Example: Minimizing an Objective Function Consider points p(x,y) in Euclidean space R 2 and the function that determines half of the squared length of vector p What does it compute for p(3,-1)? p 3 1 = ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 27/91

28 Example: Minimizing an Objective Function If we wiggle the value of x and keep everything else the same (keep the value of y the same), how do we know if we are getting closer to the minimum or not? If we move from p(3,-1) to p(2,-1), does the error get better or worse? ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 28/91

29 Quick Review: Partial Derivatives f x, y = 2x 2 y 3 Assume we want to measure how much the function f(x,y) is changing at the point (a,b) if we keep y fixed (that is, set y=b): g x = f x, y = 2x 2 b 3 The partial derivative for f x, y with respect to x at point (a,b) is: g a = 4ab 3 f x x, y = 4ab 3 f x f(x, y) = f x x, y ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 29/91

30 Example: Least Squares Gradient Descent If we fit a straight line y = w 1 + w 2 x to a training set of two-dimensional points x 1, y 1,, x n, y n using least squares. The objective function to be minimized is: n j=1 w 1 + w 2 x i y i 2 We will update the following weights in a simultaneous step: w 1 w = w 1 2 w α 2 w 1 + w 2 x i y i 2 2x i w 1 + w 2 x i y i ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 30/91 w 1 w 2

31 Line Search Try a few different lengths for the step size, α, and pick the best one. - Inexact method that incurs overhead. For several values of α, evaluate: f x α f x Choose the one that results in the smallest objective function value ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 31/91

32 Momentum If the error surface is a long and narrow valley, gradient descent goes quickly down the valley walls, but very slowly along the valley floor. i Reduce this problem by updating parameters using a combination of the previous update and the gradient update: Δw i t+1 = βδw i t + 1 β Usually β is set quite high, about α i w t ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 32/91

33 Jacobian and the Second Derivative The matrix containing all partial derivatives of a function whose input and output are both vectors is called the Jacobian matrix. f: R n R m J R n m of f is defined such that J i,j = x j f(x) i The 2 nd derivative tells us about momentum 2 x i x j f In the single dimension, we can denote this by f (x) ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 33/91

34 Curvature helping determine step size ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 34/91

35 Hessian Matrix For multi-input dimensions, the 2nd order derivatives define a matrix called the Hessian Matrix, H f x H f x i,j = 2 x i x j f x The Hessian is the Jacobian of the Gradient. Eigenvectors/eigenvalues of the Hessian describe the directions of principal curvature and the amount of curvature in each direction. - Maximum sensible stepsize is 2 λ max - Rate of convergence depends on 1 2 λ min λ max ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 35/91

36 Approximate Hessian It can be very expensive to calculate and store the Hessian matrix. You can approximate the second order curvature using recent function and gradient evaluations (a sequence of gradients). - Newton s Method and Quasi-Newton Methods See Chapter 4 in the Book for more details ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 36/91

37 Second Derivative The directional 2 nd derivative tells us how well we can expect a gradient descent step to perform. The 2 nd derivative can be used to determine if a critical point (f (x)=) is a local maximum, local maximum, or a saddle point. - f (x)>0, f (x)=0 implies local minimum - f (x)<0, f (x)=0 implies local maximum - f (x)=0, f (x)=0 implies saddle point or flat-space We can approximate the 2 nd derivative using a 2 nd order Taylor series approximation - See Deep Learning book for details ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 37/91

38 Poor Conditioning Functions that are not smooth are problematic for scientific computing: a non-smooth function can cause rounding errors in the inputs, leading to large changes in the output Conditioning refers to how rapidly a function changes with respect to small changes in its input Functions with a high condition number are sensitive to error in the input for matrix inversion - See 4.2 in the book for more details ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 38/91

39 Poor Conditioning The Condition number of the Hessian measures how much the 2nd derivatives vary. - Poor condition number => poor gradient descent. - Gradient descent is unaware of the change in the derivative Difficult to choose a step size We can use the Hessian to guide search ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 39/91

40 But, why does Deep Learning work? 1990s: basic result that there are infinitely many local optima in sufficiently complex functions In low dimensionality, local optima dominate. In high dimensionality, local optima appear to be clustered close to the global optima. Convexity is not needed [Loss surface of multi-layer Nets, LeCun et Al, 2015] ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 40/91

41 Large-Scale ML Pipelines ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 41/91

42 Parallel Data Processing Trade-Off 1. Scale-Up by buying a more powerful machine - More cores, more expensive - No network cost - Disk slows us down 2. Scale-Out using Data Parallel and in-memory computation - Persist in-memory (especially for iterative computations) - Parallelism makes computations faster - Network makes communication slow ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 42/91

43 Scale-Out Commodity Hardware Deal with Network Deal with partial host/network failures server1 servern serverz ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 43/91

44 Minimize Network and Disk I/O We need to store and communicate raw data, features, and model objects. - Keep large objects local ML algorithms are typically iterative - Reduce the number of iterations Do not read/write to disk on every iteration - Keep state in memory between iterations Logistic Regression for 100GB on 50 AWS nodes /91

45 Batching to Improve Performance Throughput: # bytes read per second Latency: cost to send messages (size independent) Over-simplification to say that message complexity is only proportional to the amount of data sent on the network - Latency introduces a fixed cost overhead independent of the size of the message Amortize latency costs - Sending larger messages (batching) ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 45/91

46 Model Parallel Training Consider hyperparameter tuning for ridge regression with small n and small d - Evaluate for different regularization parameter values of λ - Each collection of different hyperparameters is a model Train a copy of the model locally on different worker nodes (model parallel) - Map phase Data is small, so can communicate it ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 47/91

47 Linear Regression: Big n, Huge d Both Data parallelism and Model parallelism needed O(d) communication slow with hundreds of millions of parameters - As in Deep Learning Possible solutions include - Rely on sparsity to reduce communication - Asynchronous, concurrent updates to the model Asynchronous stochastic gradient descent (more later) Need algorithms that compute more, and communicate less ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 48/91

48 Gradient Descent Linear Regression: Big n, Big d - On each iteration, communicate parameter vector w i - O(d) communication OK for fairly large d. train.cache() // persist the training data across iterations for i in range(numiters): alpha_i = alpha / (n * np.sqrt(i+1)) gradient = train.map(lambda lp: gradientsummand(w,lp)).sum() w -= alpha_i * gradient Gradient Descent 49/91

49 Divide and Conquer Approach Fully process each partition locally - Only communicate the final result Single iteration; minimal communication Approximate results w.trainmappartitions(locallinearregression).reduce(combinelocalregressionresults) Divide-and-Conquer ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 50/91

50 Stochastic Gradient Descent (SGD) Recall gradient descent for least squares, every iteration processes O(n) samples: w i+1 = w i α i n j=1 w i T x j y j x j The gradient is an expectation that can be approximated using a small set of samples drawn uniformly at random from the dataset. - In particular if the dataset is highly redundant the gradient in the first half will be very similar to the gradient in the 2 nd half. In SGD, we update the model with only one sample instead of n ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 51/91

51 Minibatch Gradient Descent Increase the batch size from 1 (from SGD) to m. - Usually better than SGD. Divide the dataset into small batches of m examples, compute the gradient using a single batch, make an update, then move to the next batch of examples. - Computing the gradient simultaneously uses matrix-matrix multiplies which are efficient, especially on GPUs - Mini-batches need to be balanced for classes E.g., m=10 w i+1 = w i α i 10 j=1 w i T x j y j x j ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 52/91

52 Minibatch Gradient Descent Each of the k workers receives w from the driver, then performs a few local steps of the gradient locally. Then send w back to the Driver. - Reduces total number of iterations required. for i in range(feweriters): update = train.mappartitions(dosomelocalgradientupdates).reduce(combinelocalupdates) w += update Mini-batch Gradient Descent ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 53/91

53 Asynchronous / Synchronous SGD In synchronous SGD (or minibatch GD), the Spark Driver (or a parameter server) will wait until all parallel workers have returned their updated model before continuing to the next iteration. In (parallel) asynchronous SGD (or minibatch GD), a parameter server will apply model updates from parallel workers immediately, whereupon the work can immediately get new copy of the model to work on a new mini-batch - workers train concurrently on mini-batches without blocking. - Needs to be tolerant to stale gradients [ /91

54 AdaGrad For good performance, Mini-batch Gradient Descent you typically decrease its learning rate over time. Adaptive Gradient Descent or AdaGrad is an automated per-weight method to adapt the learning rate. Each feature attribute has its own learning rate. - Increase the learning rate for more sparse parameters and decrease the learning rate for less sparse ones. - Can improve convergence performance over standard stochastic gradient descent in settings where data is sparse and sparse parameters are more informative ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 55/91

55 Feature Extraction [Adapted from Distributed Machine Learning with Apache Spark, UCLA/Berkeley Course] ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 56/91

56 Raw Data may or may not be Numeric Numeric Data Non-Numeric Data /91

57 Dealing with Non-Numerical Features 1. Use methods that natively support non-numeric features - Decision Trees and Naive Bayes naturally support nonnumerical features 2. Convert non-numerical features to numerical features - Allows us to use a wider range of learning methods - How do we do this? ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 58/91

58 Classifying Non-Numeric Features Categorical features - Features that can be grouped into two or more categories - No intrinsic ordering For example: Gender, Country, Occupation, Language Ordinal Features - Have two or more categories - Ordinal features can be ordered by their number (ordinal), but there is no consistent spacing between categories, i.e., all we have is a relative ordering - User feedback in survey questions, e.g., Did ID2223 meet its learning objectives? No, somewhat, yes ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 59/91

59 Ordinal Features Ordinal Features: - Survey categories = { no, somewhat, yes } Create single numerical feature: - Survey categories ={ no = 1, somewhat = 2, yes = 3} We can use a single numerical feature that preserves this ordering. We can introduce a degree of closeness that didn't previously exist ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 60/91

60 Categorical Features Create single numerical feature to represent nonnumeric categories. Country categories = { ARG, FRA, USA } become: ARG = 1, FRA = 2, USA = 3 - Implication that FRA lies between ARG and USA Creating single numerical feature introduces relationships between categories that don t otherwise exist ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 61/91

61 One-Hot-Encoding (OHE) Instead of ordinals, we can create a vector containing each category entry and set the active category value to 1, with all other category values set to 0. Country categories = { ARG, FRA, USA } - One new dummy feature for each category - ARG [1 0 0], FRA [0 1 0], USA [0 0 1] Creating dummy features doesn t introduce spurious relationships ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 62/91

62 Step 1: Create OHE Dictionary Features: - Animal = { bear, cat, mouse } - Color = { black, white } - Diet = { mouse, salmon } 7 dummy features in total - mouse category distinct for Animal and Diet features OHE Dictionary: Maps each - category to dummy feature - (Animal, bear ) 0 - (Animal, cat ) 1 - (Animal, mouse ) 2 - (Color, black ) ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 63/91

63 Step 2: Create Features with Dictionary Datapoints: - A1 = [ mouse, black, - ] - A2 = [ cat, white, mouse ] - A3 = [ bear, black, salmon ] OHE Features: - Map non-numeric feature to it s binary dummy feature E.g., A1 = [0, 0, 1, 1, 0, 0, 0] OHE Dictionary: Maps each category to dummy feature - (Animal, bear ) 0 - (Animal, cat ) 1 - (Animal, mouse ) 2 - (Color, black ) ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 64/91

64 OHE Features are Sparse For a given categorical feature only a single OHE feature is non-zero can we take advantage of this fact? Dense representation: Store all numbers - E.g., A1 = [0, 0, 1, 1, 0, 0, 0] Sparse representation: Store indices / values for non-zeros - Assume all other entries are zero - E.g., A1 = [ (2,1), (3,1) ] ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 65/91

65 Sparse Representation Example: Assume a Matrix with 10M observation and 1K features. Assume 1% non-zeros. Storage costs for the Dense representation: - (Stores all the numbers) - Store 10M 1K entries as doubles 80GB storage Storage costs for the Sparse representation: - (Only store indices / values for non-zeros) - Store value and location for non-zeros (2 doubles per entry) - How much savings in Storage? 50 savings in storage - We will also see computational saving for matrix operations ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 66/91

66 Feature Hashing ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 67/91

67 High Dimensionality of OHE Statistically: Inefficient learning - We generally need bigger n when we have bigger d (though in distributed setting we often have very large n) - We will have many non-predictive features Computationally: Increased communication - Linear models have parameter vectors of dimension d - Gradient descent communicates the parameter vector to all workers at each iteration ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 68/91

68 Feature Hashing Feature hashing, a.k.a. the hashing trick. Dummy features can drastically increase dimensionality. Feature hashing reduces dimensionality by applying a hash function to the features and using their hash values as indices directly, rather than looking the indices up in an associative array. - No need to compute an expensive OHE dictionary - Preserves sparsity - Theoretical underpinnings ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 69/91

69 Features to Hash Tables Design a Hash Function that maps an object to one of m buckets - Lookup time for the bucket will be O(1) and we should distribute objects across buckets Values are stored in the Hash Table buckets and correspond to feature categories - We have fewer buckets than feature categories - Different categories will map to same bucket (collisions) - Bucket indices are hashed features ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 70/91

70 Feature Hashing Example Datapoints: 7 feature categories - A1= [ mouse, black, - ] - A2= [ cat, tabby, mouse ] - A3= [ bear, black, salmon ] Feature Hashing Ex. Hashed Features: - A1 = [ ] - A2 = [ ] - A3 = [ ] Hash Function: m = 4 H(Animal, mouse ) = 3 H(Color, black ) = 2 H(Animal, cat ) = 0 H(Color, tabby ) = 0 H(Diet, mouse ) = 2 H(Animal, bear ) = 0 H(Color, black ) = 2 H(Diet, salmon ) = ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 71/91

71 Feature Hashing Evaluation Hash features have nice theoretical properties - Good approximations of inner products of OHE features under certain conditions - Many learning methods (including linear / logistic regression) can be viewed solely in terms of inner products Good empirical performance - Spam filtering and various other text classification tasks Hashed features are a reasonable alternative for OHE features ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 72/91

72 Apache Spark and Spark ML ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 73/91

73 Apache Spark In-Memory data-parallel processing engine - Unified framework for processing data in large-scale DBs (SparkSQL), machine learning, and graph processing. Support large-scale machine learning - Fast iterative procedures - Efficient communication primitives Integrated with Apache Hadoop (HDFS, YARN) APIs for Scala, Java, Python, R ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 74/91

74 Apache Spark ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 75/91

75 DataFrames In Spark, a DataFrame is a distributed collection of data organized into named columns - Conceptually equivalent to a table in a relational database or a data frame in R/Python With a DataFrame, we can build a ML Pipeline in Spark ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 76/91

76 Spark ML (Machine Learning Library) Consists of common learning algorithms and utilities - Classification - Regression - Clustering - Collaborative Filtering - Dimensionality Reduction New Version - spark.ml Deprecated version - spark.mllib ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 77/91

77 Machine Learing in Apache Spark SparkML integrates well with other Spark features, such as DataFrames, RDDs, and Datasets - You can load data from sources (like HDFS) into DataFrames, perform machine learning on the data to either build models or make predictions and then save the data back to some sink (like HDFS) Not as extensive machine learning support compared to scikit-learn in Python ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 78/91

78 SparkML Architecture [ ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 79/91

79 Spark ML Dataframe - Represents a table in SparkSQL Transformer - Transform one Dataframe to another Dataframe. Estimator - Fit one Dataframe and produce a model, which is a transformer Pipeline - Chains Transformers and Estimators together in a ML workflow Parameter - A common API for params for transformers/estimators ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 80/91

80 Transformer A Transformer is an algorithm which transforms one DataFrame into another DataFrame. - Some transformers turn a DataFrame with features into a DataFrame with predictions - e.g., LogisticRegressionModel. - There are also feature transformers e.g., HashingTF. - Implements the transform() method. Examples - HashingTF, Binarizer - StringIndexer converts String values (part of a look-up) into categorical indices - VectorAssembler constructs a Vector from raw feature columns ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 81/91

81 Estimator An Estimator is an algorithm which can have fit called on a DataFrame to produce a Transformer. - For example, training/tuning on a DataFrame and producing a model. - Implements the fit() method. Examples - LogisticRegression (produces a LogisticRegressionModel) - StandardScaler - Pipeline ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 82/91

82 Pipeline A Pipeline is an estimator that chains multiple Transformers and Estimators together to specify a ML workflow. [Large-Scale Machine Learning, Berkeley/Databricks] /91

83 ParamMaps, Evaluator, CrossValidator ParamMaps: Parameters to choose from, sometimes called a parameter grid to search over. - Can be passed to fit() or transform() Evaluator: Metric to measure how well a fitted Model does on held-out test data. CrossValidator: Identifies the best ParamMap and re-fits the Estimator using the best ParamMap and the entire dataset ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 84/91

84 Feature Hashing trainhash= train.map(applyhashfunction).map(createsparsevector) Step 1: Apply a hash function on the raw data - Single map operation (local computation) - No need to compute OHE features or communication Step 2: Store the hashed features in a sparse representation - Single Map operation (local computation) - Reduce storage and lower computation costs SparkML supports two types of local vectors: dense and sparse. A dense vector is backed by a double array representing its entry values, while a sparse vector is backed by two parallel arrays: indices and values ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 85/91

85 Standard Scalar Pipeline [Large-Scale Machine Learning, Berkeley/Databricks] 86/91

86 Example: 20 Newsgroups in SparkML [ ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 87/91

87 Pipeline Transformer 20 Newsgroups [ ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 88/91

88 Python: Read Directly into a DataFrame training = sqlcontext.read.parquet( hdfs:///projects/datasets/20newsgroups /data-001/training").cache() test = sqlcontext.read.parquet("hdfs:///projects/datasets/20newsgroups /data-001/test").cache() training and test are Dataframes ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 89/91

89 Scala: Load/Convert to a DataFrame case class newsgroupscaseclass(id: String, text: String, topic: String) val newsgroups = newsgroupsrawdata.map{case (filepath, text) => val id = filepath.split("/").takeright(1)(0) val topic = filepath.split("/").takeright(2)(0) newsgroupscaseclass(id, text, topic)}.todf() newsgroups.cache() val labelednewsgroups = newsgroups.withcolumn("label", newsgroups("topic").like("comp%").cast("double")) labelednewsgroups.registertemptable("labelednewsgroups") val Array(training, test) = labelednewsgroups.randomsplit(array(0.9, 0.1), seed = 12345) newsgroups is a RDD, labelednewsgroups is a DataFrame ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 90/91

90 Building and Training a Model tokenizer = RegexTokenizer(inputCol="text", outputcol="words", pattern= \\s+") hashingtf = HashingTF(inputCol=tokenizer.getOutputCol(), outputcol="features", numfeatures=5000) lr = LogisticRegression(maxIter=20, regparam=0.01) pipeline = Pipeline(stages=[tokenizer, hashingtf, lr]) model = pipeline.fit(training) RegexTokenizer tokenizes each article into a sequence of words with a regex pattern HashingTF maps the word sequences produced by RegexTokenizer to sparse feature vectors using feature hashing LogisticRegression fits the feature vectors and the labels from the training data to a logistic regression model. The pipeline is then built with 3 stages and trained to produce a model ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 91/91

91 Prediction # Make predictions on the model. prediction = model.transform(training) # Check some results prediction.select("prediction", "label", "text").limit(10).show() With the trained model you can make predictions and check the results of the predictions ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 92/91

92 Evaluation # Create an evaluator for binary classification and use area under the ROC curve as the evaluation metric. evaluator = BinaryClassificationEvaluator(metricName="areaUnderROC") evaluator.evaluate(prediction) # Call "model.transform" on test data and then evaluate the result. evaluator.evaluate(model.transform(test)) # Inspect a pipeline prediction.printschema() You can then evaluate the accuracy of your trained model using the test dataset and a BinaryClassificationEvaluator. A ROC curve score near 1 is very good, 0.5 is like flipping a coin ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 93/91

93 Cross-Validation for Hyperparameter Tuning # We generate hyperparameter combinations by taking the cross product of some parameter values we want to try. paramgrid = ParamGridBuilder() \.addgrid(hashingtf.numfeatures, [1000, 10000]) \.addgrid(lr.regparam, [0.05, 0.2]) \.build() cv = CrossValidator(estimator=pipeline, evaluator=evaluator, estimatorparammaps=paramgrid, numfolds=2) cvmodel = cv.fit(training) evaluator.evaluate(cvmodel.transform(training)) evaluator.evaluate(cvmodel.transform(test)) ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 94/91

94 Reading Nodes Chapter 4, 5 Deep Learning Book - (KKT and Lagrangian not examined) Chapter 5 Deep Learning Book - ( , 5.11 not examined) Chapter Chain Rule of Calculus with Partial Derivatives: - Spark ML - 20 Newsgroups in Scala 20 Newsgroups in Python - Cancer Prediction Example in Scala - Distributed ML with Apache Spark, UCLA/Berkeley Course ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 95/91

Logistic Regression: Probabilistic Interpretation

Logistic Regression: Probabilistic Interpretation Logistic Regression: Probabilistic Interpretation Approximate 0/1 Loss Logistic Regression Adaboost (z) SVM Solution: Approximate 0/1 loss with convex loss ( surrogate loss) 0-1 z = y w x SVM (hinge),

More information

Linear Regression Optimization

Linear Regression Optimization Gradient Descent Linear Regression Optimization Goal: Find w that minimizes f(w) f(w) = Xw y 2 2 Closed form solution exists Gradient Descent is iterative (Intuition: go downhill!) n w * w Scalar objective:

More information

A Brief Look at Optimization

A Brief Look at Optimization A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year s version Overview Introduction Classes of optimization problems Linear programming Steepest

More information

CS 179 Lecture 16. Logistic Regression & Parallel SGD

CS 179 Lecture 16. Logistic Regression & Parallel SGD CS 179 Lecture 16 Logistic Regression & Parallel SGD 1 Outline logistic regression (stochastic) gradient descent parallelizing SGD for neural nets (with emphasis on Google s distributed neural net implementation)

More information

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

CS535 Big Data Fall 2017 Colorado State University   10/10/2017 Sangmi Lee Pallickara Week 8- A. CS535 Big Data - Fall 2017 Week 8-A-1 CS535 BIG DATA FAQs Term project proposal New deadline: Tomorrow PA1 demo PART 1. BATCH COMPUTING MODELS FOR BIG DATA ANALYTICS 5. ADVANCED DATA ANALYTICS WITH APACHE

More information

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz Gradient Descent Wed Sept 20th, 2017 James McInenrey Adapted from slides by Francisco J. R. Ruiz Housekeeping A few clarifications of and adjustments to the course schedule: No more breaks at the midpoint

More information

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models DB Tsai Steven Hillion Outline Introduction Linear / Nonlinear Classification Feature Engineering - Polynomial Expansion Big-data

More information

Hyperparameter optimization. CS6787 Lecture 6 Fall 2017

Hyperparameter optimization. CS6787 Lecture 6 Fall 2017 Hyperparameter optimization CS6787 Lecture 6 Fall 2017 Review We ve covered many methods Stochastic gradient descent Step size/learning rate, how long to run Mini-batching Batch size Momentum Momentum

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 22, 2016 Course Information Website: http://www.stat.ucdavis.edu/~chohsieh/teaching/ ECS289G_Fall2016/main.html My office: Mathematical Sciences

More information

Scaled Machine Learning at Matroid

Scaled Machine Learning at Matroid Scaled Machine Learning at Matroid Reza Zadeh @Reza_Zadeh http://reza-zadeh.com Machine Learning Pipeline Learning Algorithm Replicate model Data Trained Model Serve Model Repeat entire pipeline Scaling

More information

Matrix Computations and " Neural Networks in Spark

Matrix Computations and  Neural Networks in Spark Matrix Computations and " Neural Networks in Spark Reza Zadeh Paper: http://arxiv.org/abs/1509.02256 Joint work with many folks on paper. @Reza_Zadeh http://reza-zadeh.com Training Neural Networks Datasets

More information

Decentralized and Distributed Machine Learning Model Training with Actors

Decentralized and Distributed Machine Learning Model Training with Actors Decentralized and Distributed Machine Learning Model Training with Actors Travis Addair Stanford University taddair@stanford.edu Abstract Training a machine learning model with terabytes to petabytes of

More information

Rapid growth of massive datasets

Rapid growth of massive datasets Overview Rapid growth of massive datasets E.g., Online activity, Science, Sensor networks Data Distributed Clusters are Pervasive Data Distributed Computing Mature Methods for Common Problems e.g., classification,

More information

CS281 Section 3: Practical Optimization

CS281 Section 3: Practical Optimization CS281 Section 3: Practical Optimization David Duvenaud and Dougal Maclaurin Most parameter estimation problems in machine learning cannot be solved in closed form, so we often have to resort to numerical

More information

15.1 Optimization, scaling, and gradient descent in Spark

15.1 Optimization, scaling, and gradient descent in Spark CME 323: Distributed Algorithms and Optimization, Spring 2017 http://stanford.edu/~rezab/dao. Instructor: Reza Zadeh, Matroid and Stanford. Lecture 16, 5/24/2017. Scribed by Andreas Santucci. Overview

More information

Learning via Optimization

Learning via Optimization Lecture 7 1 Outline 1. Optimization Convexity 2. Linear regression in depth Locally weighted linear regression 3. Brief dips Logistic Regression [Stochastic] gradient ascent/descent Support Vector Machines

More information

Machine Learning: Think Big and Parallel

Machine Learning: Think Big and Parallel Day 1 Inderjit S. Dhillon Dept of Computer Science UT Austin CS395T: Topics in Multicore Programming Oct 1, 2013 Outline Scikit-learn: Machine Learning in Python Supervised Learning day1 Regression: Least

More information

Cost Functions in Machine Learning

Cost Functions in Machine Learning Cost Functions in Machine Learning Kevin Swingler Motivation Given some data that reflects measurements from the environment We want to build a model that reflects certain statistics about that data Something

More information

Distributed Computing with Spark and MapReduce

Distributed Computing with Spark and MapReduce Distributed Computing with Spark and MapReduce Reza Zadeh @Reza_Zadeh http://reza-zadeh.com Traditional Network Programming Message-passing between nodes (e.g. MPI) Very difficult to do at scale:» How

More information

Parallel Deep Network Training

Parallel Deep Network Training Lecture 26: Parallel Deep Network Training Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2016 Tunes Speech Debelle Finish This Album (Speech Therapy) Eat your veggies and study

More information

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. CS 189 Spring 2015 Introduction to Machine Learning Final You have 2 hours 50 minutes for the exam. The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. No calculators or

More information

COMP6237 Data Mining Data Mining & Machine Learning with Big Data. Jonathon Hare

COMP6237 Data Mining Data Mining & Machine Learning with Big Data. Jonathon Hare COMP6237 Data Mining Data Mining & Machine Learning with Big Data Jonathon Hare jsh2@ecs.soton.ac.uk Contents Going to look at two case-studies looking at how we can make machine-learning algorithms work

More information

5 Machine Learning Abstractions and Numerical Optimization

5 Machine Learning Abstractions and Numerical Optimization Machine Learning Abstractions and Numerical Optimization 25 5 Machine Learning Abstractions and Numerical Optimization ML ABSTRACTIONS [some meta comments on machine learning] [When you write a large computer

More information

Processing of big data with Apache Spark

Processing of big data with Apache Spark Processing of big data with Apache Spark JavaSkop 18 Aleksandar Donevski AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 2 WHAT

More information

Stanford University. A Distributed Solver for Kernalized SVM

Stanford University. A Distributed Solver for Kernalized SVM Stanford University CME 323 Final Project A Distributed Solver for Kernalized SVM Haoming Li Bangzheng He haoming@stanford.edu bzhe@stanford.edu GitHub Repository https://github.com/cme323project/spark_kernel_svm.git

More information

Deep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES

Deep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES Deep Learning Practical introduction with Keras Chapter 3 27/05/2018 Neuron A neural network is formed by neurons connected to each other; in turn, each connection of one neural network is associated

More information

Distributed Machine Learning" on Spark

Distributed Machine Learning on Spark Distributed Machine Learning" on Spark Reza Zadeh @Reza_Zadeh http://reza-zadeh.com Outline Data flow vs. traditional network programming Spark computing engine Optimization Example Matrix Computations

More information

Apache SystemML Declarative Machine Learning

Apache SystemML Declarative Machine Learning Apache Big Data Seville 2016 Apache SystemML Declarative Machine Learning Luciano Resende About Me Luciano Resende (lresende@apache.org) Architect and community liaison at Have been contributing to open

More information

Training Deep Neural Networks (in parallel)

Training Deep Neural Networks (in parallel) Lecture 9: Training Deep Neural Networks (in parallel) Visual Computing Systems How would you describe this professor? Easy? Mean? Boring? Nerdy? Professor classification task Classifies professors as

More information

Scalable Machine Learning in R. with H2O

Scalable Machine Learning in R. with H2O Scalable Machine Learning in R with H2O Erin LeDell @ledell DSC July 2016 Introduction Statistician & Machine Learning Scientist at H2O.ai in Mountain View, California, USA Ph.D. in Biostatistics with

More information

Distributed Computing with Spark

Distributed Computing with Spark Distributed Computing with Spark Reza Zadeh Thanks to Matei Zaharia Outline Data flow vs. traditional network programming Limitations of MapReduce Spark computing engine Numerical computing on Spark Ongoing

More information

1 Training/Validation/Testing

1 Training/Validation/Testing CPSC 340 Final (Fall 2015) Name: Student Number: Please enter your information above, turn off cellphones, space yourselves out throughout the room, and wait until the official start of the exam to begin.

More information

Perceptron: This is convolution!

Perceptron: This is convolution! Perceptron: This is convolution! v v v Shared weights v Filter = local perceptron. Also called kernel. By pooling responses at different locations, we gain robustness to the exact spatial location of image

More information

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning Partitioning Data IRDS: Evaluation, Debugging, and Diagnostics Charles Sutton University of Edinburgh Training Validation Test Training : Running learning algorithms Validation : Tuning parameters of learning

More information

Chapter 1 - The Spark Machine Learning Library

Chapter 1 - The Spark Machine Learning Library Chapter 1 - The Spark Machine Learning Library Objectives Key objectives of this chapter: The Spark Machine Learning Library (MLlib) MLlib dense and sparse vectors and matrices Types of distributed matrices

More information

Introduction to Optimization

Introduction to Optimization Introduction to Optimization Second Order Optimization Methods Marc Toussaint U Stuttgart Planned Outline Gradient-based optimization (1st order methods) plain grad., steepest descent, conjugate grad.,

More information

Nearest Neighbor Predictors

Nearest Neighbor Predictors Nearest Neighbor Predictors September 2, 2018 Perhaps the simplest machine learning prediction method, from a conceptual point of view, and perhaps also the most unusual, is the nearest-neighbor method,

More information

Lecture 20: Neural Networks for NLP. Zubin Pahuja

Lecture 20: Neural Networks for NLP. Zubin Pahuja Lecture 20: Neural Networks for NLP Zubin Pahuja zpahuja2@illinois.edu courses.engr.illinois.edu/cs447 CS447: Natural Language Processing 1 Today s Lecture Feed-forward neural networks as classifiers simple

More information

Natural Language Processing with Deep Learning CS224N/Ling284. Christopher Manning Lecture 4: Backpropagation and computation graphs

Natural Language Processing with Deep Learning CS224N/Ling284. Christopher Manning Lecture 4: Backpropagation and computation graphs Natural Language Processing with Deep Learning CS4N/Ling84 Christopher Manning Lecture 4: Backpropagation and computation graphs Lecture Plan Lecture 4: Backpropagation and computation graphs 1. Matrix

More information

Problem 1: Complexity of Update Rules for Logistic Regression

Problem 1: Complexity of Update Rules for Logistic Regression Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 16 th, 2014 1

More information

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3 Neural Network Optimization and Tuning 11-785 / Spring 2018 / Recitation 3 1 Logistics You will work through a Jupyter notebook that contains sample and starter code with explanations and comments throughout.

More information

Logistic Regression. Abstract

Logistic Regression. Abstract Logistic Regression Tsung-Yi Lin, Chen-Yu Lee Department of Electrical and Computer Engineering University of California, San Diego {tsl008, chl60}@ucsd.edu January 4, 013 Abstract Logistic regression

More information

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari Machine Learning Basics: Stochastic Gradient Descent Sargur N. srihari@cedar.buffalo.edu 1 Topics 1. Learning Algorithms 2. Capacity, Overfitting and Underfitting 3. Hyperparameters and Validation Sets

More information

Optimization Plugin for RapidMiner. Venkatesh Umaashankar Sangkyun Lee. Technical Report 04/2012. technische universität dortmund

Optimization Plugin for RapidMiner. Venkatesh Umaashankar Sangkyun Lee. Technical Report 04/2012. technische universität dortmund Optimization Plugin for RapidMiner Technical Report Venkatesh Umaashankar Sangkyun Lee 04/2012 technische universität dortmund Part of the work on this technical report has been supported by Deutsche Forschungsgemeinschaft

More information

Class 6 Large-Scale Image Classification

Class 6 Large-Scale Image Classification Class 6 Large-Scale Image Classification Liangliang Cao, March 7, 2013 EECS 6890 Topics in Information Processing Spring 2013, Columbia University http://rogerioferis.com/visualrecognitionandsearch Visual

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Sources Hastie, Tibshirani, Friedman: The Elements of Statistical Learning James, Witten, Hastie, Tibshirani: An Introduction to Statistical Learning Andrew Ng:

More information

Today. Golden section, discussion of error Newton s method. Newton s method, steepest descent, conjugate gradient

Today. Golden section, discussion of error Newton s method. Newton s method, steepest descent, conjugate gradient Optimization Last time Root finding: definition, motivation Algorithms: Bisection, false position, secant, Newton-Raphson Convergence & tradeoffs Example applications of Newton s method Root finding in

More information

Case Study 1: Estimating Click Probabilities

Case Study 1: Estimating Click Probabilities Case Study 1: Estimating Click Probabilities SGD cont d AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade March 31, 2015 1 Support/Resources Office Hours Yao Lu:

More information

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation Learning 4 Supervised Learning 4 Unsupervised Learning 4

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Distributed Machine Learning Week #9

Machine Learning for Large-Scale Data Analysis and Decision Making A. Distributed Machine Learning Week #9 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Distributed Machine Learning Week #9 Today Distributed computing for machine learning Background MapReduce/Hadoop & Spark Theory

More information

Machine Learning Basics. Sargur N. Srihari

Machine Learning Basics. Sargur N. Srihari Machine Learning Basics Sargur N. srihari@cedar.buffalo.edu 1 Overview Deep learning is a specific type of ML Necessary to have a solid understanding of the basic principles of ML 2 Topics Stochastic Gradient

More information

Parallel Deep Network Training

Parallel Deep Network Training Lecture 19: Parallel Deep Network Training Parallel Computer Architecture and Programming How would you describe this professor? Easy? Mean? Boring? Nerdy? Professor classification task Classifies professors

More information

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect Big Data Big Data Analyst INTRODUCTION TO BIG DATA ANALYTICS ANALYTICS PROCESSING TECHNIQUES DATA TRANSFORMATION & BATCH PROCESSING REAL TIME (STREAM) DATA PROCESSING Big Data Engineer BIG DATA FOUNDATION

More information

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu Natural Language Processing CS 6320 Lecture 6 Neural Language Models Instructor: Sanda Harabagiu In this lecture We shall cover: Deep Neural Models for Natural Language Processing Introduce Feed Forward

More information

Theoretical Concepts of Machine Learning

Theoretical Concepts of Machine Learning Theoretical Concepts of Machine Learning Part 2 Institute of Bioinformatics Johannes Kepler University, Linz, Austria Outline 1 Introduction 2 Generalization Error 3 Maximum Likelihood 4 Noise Models 5

More information

15.1 Data flow vs. traditional network programming

15.1 Data flow vs. traditional network programming CME 323: Distributed Algorithms and Optimization, Spring 2017 http://stanford.edu/~rezab/dao. Instructor: Reza Zadeh, Matroid and Stanford. Lecture 15, 5/22/2017. Scribed by D. Penner, A. Shoemaker, and

More information

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2016

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2016 CPSC 34: Machine Learning and Data Mining Feature Selection Fall 26 Assignment 3: Admin Solutions will be posted after class Wednesday. Extra office hours Thursday: :3-2 and 4:3-6 in X836. Midterm Friday:

More information

Gradient Descent Optimization Algorithms for Deep Learning Batch gradient descent Stochastic gradient descent Mini-batch gradient descent

Gradient Descent Optimization Algorithms for Deep Learning Batch gradient descent Stochastic gradient descent Mini-batch gradient descent Gradient Descent Optimization Algorithms for Deep Learning Batch gradient descent Stochastic gradient descent Mini-batch gradient descent Slide credit: http://sebastianruder.com/optimizing-gradient-descent/index.html#batchgradientdescent

More information

Solution 1 (python) Performance: Enron Samples Rate Recall Precision Total Contribution

Solution 1 (python) Performance: Enron Samples Rate Recall Precision Total Contribution Summary Each of the ham/spam classifiers has been tested against random samples from pre- processed enron sets 1 through 6 obtained via: http://www.aueb.gr/users/ion/data/enron- spam/, or the entire set

More information

Logistic Regression

Logistic Regression Logistic Regression ddebarr@uw.edu 2016-05-26 Agenda Model Specification Model Fitting Bayesian Logistic Regression Online Learning and Stochastic Optimization Generative versus Discriminative Classifiers

More information

CPSC 340: Machine Learning and Data Mining. Robust Regression Fall 2015

CPSC 340: Machine Learning and Data Mining. Robust Regression Fall 2015 CPSC 340: Machine Learning and Data Mining Robust Regression Fall 2015 Admin Can you see Assignment 1 grades on UBC connect? Auditors, don t worry about it. You should already be working on Assignment

More information

Crossing the AI Chasm. What We Learned from building Apache PredictionIO (incubating)

Crossing the AI Chasm. What We Learned from building Apache PredictionIO (incubating) Crossing the AI Chasm What We Learned from building Apache PredictionIO (incubating) Simon Chan Sr. Director, Product Management, Salesforce Co-founder, PredictionIO PhD, University College London simon@salesforce.com

More information

Scaling Distributed Machine Learning

Scaling Distributed Machine Learning Scaling Distributed Machine Learning with System and Algorithm Co-design Mu Li Thesis Defense CSD, CMU Feb 2nd, 2017 nx min w f i (w) Distributed systems i=1 Large scale optimization methods Large-scale

More information

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University DS 4400 Machine Learning and Data Mining I Alina Oprea Associate Professor, CCIS Northeastern University September 20 2018 Review Solution for multiple linear regression can be computed in closed form

More information

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016 CPSC 340: Machine Learning and Data Mining Non-Parametric Models Fall 2016 Assignment 0: Admin 1 late day to hand it in tonight, 2 late days for Wednesday. Assignment 1 is out: Due Friday of next week.

More information

Homework 2. Due: March 2, 2018 at 7:00PM. p = 1 m. (x i ). i=1

Homework 2. Due: March 2, 2018 at 7:00PM. p = 1 m. (x i ). i=1 Homework 2 Due: March 2, 2018 at 7:00PM Written Questions Problem 1: Estimator (5 points) Let x 1, x 2,..., x m be an i.i.d. (independent and identically distributed) sample drawn from distribution B(p)

More information

Today. Gradient descent for minimization of functions of real variables. Multi-dimensional scaling. Self-organizing maps

Today. Gradient descent for minimization of functions of real variables. Multi-dimensional scaling. Self-organizing maps Today Gradient descent for minimization of functions of real variables. Multi-dimensional scaling Self-organizing maps Gradient Descent Derivatives Consider function f(x) : R R. The derivative w.r.t. x

More information

Parallelization in the Big Data Regime: Model Parallelization? Sham M. Kakade

Parallelization in the Big Data Regime: Model Parallelization? Sham M. Kakade Parallelization in the Big Data Regime: Model Parallelization? Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for Big data 1 / 12 Announcements...

More information

SCALABLE, LOW LATENCY MODEL SERVING AND MANAGEMENT WITH VELOX

SCALABLE, LOW LATENCY MODEL SERVING AND MANAGEMENT WITH VELOX THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW LATENCY MODEL SERVING AND MANAGEMENT WITH VELOX Daniel Crankshaw, Peter Bailis, Joseph Gonzalez, Haoyuan Li, Zhao Zhang, Ali Ghodsi, Michael Franklin,

More information

Machine Learning Classifiers and Boosting

Machine Learning Classifiers and Boosting Machine Learning Classifiers and Boosting Reading Ch 18.6-18.12, 20.1-20.3.2 Outline Different types of learning problems Different types of learning algorithms Supervised learning Decision trees Naïve

More information

Machine Learning with Spark. Amir H. Payberah 02/11/2018

Machine Learning with Spark. Amir H. Payberah 02/11/2018 Machine Learning with Spark Amir H. Payberah payberah@kth.se 02/11/2018 The Course Web Page https://id2223kth.github.io 1 / 89 Where Are We? 2 / 89 Where Are We? 3 / 89 Big Data 4 / 89 Problem Traditional

More information

Conflict Graphs for Parallel Stochastic Gradient Descent

Conflict Graphs for Parallel Stochastic Gradient Descent Conflict Graphs for Parallel Stochastic Gradient Descent Darshan Thaker*, Guneet Singh Dhillon* Abstract We present various methods for inducing a conflict graph in order to effectively parallelize Pegasos.

More information

Deep Neural Networks Optimization

Deep Neural Networks Optimization Deep Neural Networks Optimization Creative Commons (cc) by Akritasa http://arxiv.org/pdf/1406.2572.pdf Slides from Geoffrey Hinton CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Queries on streams

More information

Frameworks in Python for Numeric Computation / ML

Frameworks in Python for Numeric Computation / ML Frameworks in Python for Numeric Computation / ML Why use a framework? Why not use the built-in data structures? Why not write our own matrix multiplication function? Frameworks are needed not only because

More information

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

COMP 551 Applied Machine Learning Lecture 16: Deep Learning COMP 551 Applied Machine Learning Lecture 16: Deep Learning Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise noted, all

More information

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining CPSC 340: Machine Learning and Data Mining Fundamentals of learning (continued) and the k-nearest neighbours classifier Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart.

More information

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context 1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes

More information

Machine Learning and SystemML. Nikolay Manchev Data Scientist Europe E-

Machine Learning and SystemML. Nikolay Manchev Data Scientist Europe E- Machine Learning and SystemML Nikolay Manchev Data Scientist Europe E- mail: nmanchev@uk.ibm.com @nikolaymanchev A Simple Problem In this activity, you will analyze the relationship between educational

More information

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016 CPSC 340: Machine Learning and Data Mining Deep Learning Fall 2016 Assignment 5: Due Friday. Assignment 6: Due next Friday. Final: Admin December 12 (8:30am HEBB 100) Covers Assignments 1-6. Final from

More information

Nearest Neighbor with KD Trees

Nearest Neighbor with KD Trees Case Study 2: Document Retrieval Finding Similar Documents Using Nearest Neighbors Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Emily Fox January 22 nd, 2013 1 Nearest

More information

Module 1 Lecture Notes 2. Optimization Problem and Model Formulation

Module 1 Lecture Notes 2. Optimization Problem and Model Formulation Optimization Methods: Introduction and Basic concepts 1 Module 1 Lecture Notes 2 Optimization Problem and Model Formulation Introduction In the previous lecture we studied the evolution of optimization

More information

CS294-1 Assignment 2 Report

CS294-1 Assignment 2 Report CS294-1 Assignment 2 Report Keling Chen and Huasha Zhao February 24, 2012 1 Introduction The goal of this homework is to predict a users numeric rating for a book from the text of the user s review. The

More information

Machine Learning / Jan 27, 2010

Machine Learning / Jan 27, 2010 Revisiting Logistic Regression & Naïve Bayes Aarti Singh Machine Learning 10-701/15-781 Jan 27, 2010 Generative and Discriminative Classifiers Training classifiers involves learning a mapping f: X -> Y,

More information

Chapter Multidimensional Gradient Method

Chapter Multidimensional Gradient Method Chapter 09.04 Multidimensional Gradient Method After reading this chapter, you should be able to: 1. Understand how multi-dimensional gradient methods are different from direct search methods. Understand

More information

Neural Networks for Machine Learning. Lecture 15a From Principal Components Analysis to Autoencoders

Neural Networks for Machine Learning. Lecture 15a From Principal Components Analysis to Autoencoders Neural Networks for Machine Learning Lecture 15a From Principal Components Analysis to Autoencoders Geoffrey Hinton Nitish Srivastava, Kevin Swersky Tijmen Tieleman Abdel-rahman Mohamed Principal Components

More information

M. Sc. (Artificial Intelligence and Machine Learning)

M. Sc. (Artificial Intelligence and Machine Learning) Course Name: Advanced Python Course Code: MSCAI 122 This course will introduce students to advanced python implementations and the latest Machine Learning and Deep learning libraries, Scikit-Learn and

More information

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review Gautam Kunapuli Machine Learning Data is identically and independently distributed Goal is to learn a function that maps to Data is generated using an unknown function Learn a hypothesis that minimizes

More information

Harp-DAAL for High Performance Big Data Computing

Harp-DAAL for High Performance Big Data Computing Harp-DAAL for High Performance Big Data Computing Large-scale data analytics is revolutionizing many business and scientific domains. Easy-touse scalable parallel techniques are necessary to process big

More information

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University DS 4400 Machine Learning and Data Mining I Alina Oprea Associate Professor, CCIS Northeastern University January 24 2019 Logistics HW 1 is due on Friday 01/25 Project proposal: due Feb 21 1 page description

More information

Parallelism. CS6787 Lecture 8 Fall 2017

Parallelism. CS6787 Lecture 8 Fall 2017 Parallelism CS6787 Lecture 8 Fall 2017 So far We ve been talking about algorithms We ve been talking about ways to optimize their parameters But we haven t talked about the underlying hardware How does

More information

MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius

MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius What is Mixed Precision Training? Reduced precision tensor math with FP32 accumulation, FP16 storage Successfully used to train a variety

More information

Neural Networks (pp )

Neural Networks (pp ) Notation: Means pencil-and-paper QUIZ Means coding QUIZ Neural Networks (pp. 106-121) The first artificial neural network (ANN) was the (single-layer) perceptron, a simplified model of a biological neuron.

More information

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group Deep Learning Vladimir Golkov Technical University of Munich Computer Vision Group 1D Input, 1D Output target input 2 2D Input, 1D Output: Data Distribution Complexity Imagine many dimensions (data occupies

More information

Lecture 22 : Distributed Systems for ML

Lecture 22 : Distributed Systems for ML 10-708: Probabilistic Graphical Models, Spring 2017 Lecture 22 : Distributed Systems for ML Lecturer: Qirong Ho Scribes: Zihang Dai, Fan Yang 1 Introduction Big data has been very popular in recent years.

More information

Predict the Likelihood of Responding to Direct Mail Campaign in Consumer Lending Industry

Predict the Likelihood of Responding to Direct Mail Campaign in Consumer Lending Industry Predict the Likelihood of Responding to Direct Mail Campaign in Consumer Lending Industry Jincheng Cao, SCPD Jincheng@stanford.edu 1. INTRODUCTION When running a direct mail campaign, it s common practice

More information

CSE 546 Machine Learning, Autumn 2013 Homework 2

CSE 546 Machine Learning, Autumn 2013 Homework 2 CSE 546 Machine Learning, Autumn 2013 Homework 2 Due: Monday, October 28, beginning of class 1 Boosting [30 Points] We learned about boosting in lecture and the topic is covered in Murphy 16.4. On page

More information

Introduction to Machine Learning. Xiaojin Zhu

Introduction to Machine Learning. Xiaojin Zhu Introduction to Machine Learning Xiaojin Zhu jerryzhu@cs.wisc.edu Read Chapter 1 of this book: Xiaojin Zhu and Andrew B. Goldberg. Introduction to Semi- Supervised Learning. http://www.morganclaypool.com/doi/abs/10.2200/s00196ed1v01y200906aim006

More information

Simple Model Selection Cross Validation Regularization Neural Networks

Simple Model Selection Cross Validation Regularization Neural Networks Neural Nets: Many possible refs e.g., Mitchell Chapter 4 Simple Model Selection Cross Validation Regularization Neural Networks Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February

More information

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. CS 189 Spring 2015 Introduction to Machine Learning Final You have 2 hours 50 minutes for the exam. The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. No calculators or

More information