ID2223 Lecture 3: Gradient Descent and SparkML
|
|
- Bryce Bryan
- 5 years ago
- Views:
Transcription
1 ID2223 Lecture 3: Gradient Descent and SparkML
2 Optimization Theory Review ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 2/91
3 Unconstrained Optimization Unconstrained optimization involves finding minima of functions that may have multiple inputs: min f p = f p, where p R n, f: R n R We deal with only minima because any maximum of f is a minimum of f We do not always aim to find the global minimum in R n but instead points that take smaller function values than all of their neighbors - A local minimum that is good enough ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 3
4 Local and Global Minima [Figure 4.3 from Deep Learning Book] 4/91
5 A general minimization algorithm Iterative optimization algorithms consist of setting initialization conditions, and then three iteration steps: 1. Initialize: Choose a starting point an initial guess that can either be determined by your situation or that can be actively chosen. 2. While a stopping criterion is not true (the solution is not close enough to the minimum), continue, else break and return the current solution. 3. Find a descent direction a direction in which the function value decreases near the current point. 4. Determine the step size the length of a step in the given direction that leads to a good decrease ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 5
6 Local and Global Minima [ 6/91
7 Stopping Criterion The minimum always has a certain property, the first derivative, i.e., the gradient has to be zero: The gradient can be a vector with two components, and the above equation translates into the vector equation Each component of the gradient has to be 0. x = y = 0, or p*=(0,0) ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 7
8 Critical Point is called a critical point. In general, a critical point does not have to be a minimum but any minimum is a critical point [Figure 4.2 from Deep Learning Book] 8
9 Gradient Descent for Least Squares Regression ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 9/91
10 Gradient Descent for Least Squares Regression Goal: Find min w Xw y 2 2 That is, find w that minimizes f w = Xw y 2 2 Scalar objective: f w = wx y 2 2 = n j=1 wx j y j w = global optima 10/91
11 Gradient Descent Start at a random point Repeat - Determine a descent direction - Choose a step size - Update Until stopping criterion is satisfied error weight ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 11/91
12 Small Steps Down the Gradient ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 12/91
13 Small Steps Down the Gradient ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 13/91
14 Non-Convex Optimization Gradient descent is an iterative algorithm that takes small steps down the Gradient until it reaches a minimum Can potentially end up in a local minima, w, instead of the global optima, w* ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 14/91
15 Convex Optimization In a convex optimization problem, all local minimum are also a global minimum Least Squares and Ridge Regression are linear methods that use convex optimization methods to optimize objective functions ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 15/91
16 Direction of Descent - Slope Know the error function: f(w) = weight 2 f(w) w = 2 weight = 2 Original Weight error f(w) weight (w) /91
17 Direction of Descent Weight Update step size w i+1 = w i α i df dw w i /91 negative slope
18 Descent Direction and Magnitude (1D) The opposite direction of the slope points in the direction of steepest error descent in weight space. or as: step size w i+1 = w i α i df dw negative slope w i w i+1 = w i α i w f(w i ) Stepsize is a free parameter that has to be chosen carefully for each problem ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 18/91
19 Gradient For functions with multiple inputs, we use partial derivatives to measure how much w changes as only the variable w i increases at point w f w w The gradient generalizes the notion of derivative to the case where the derivative is with respect to a vector. The gradient of w is the vector containing all the partial derivatives is denoted: w f w Use the chain rule to compute the derivatives /91
20 What is the gradient of Gradient Example Answer: ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 20/91
21 Chain Rule y = x w 1 y w 1 = x w 1 w 2 e = y w 2 e y = w 2 x (input) y (intermediate) e (output) e = x w 1 w 2 e w 1 = x w 2 e = y e w 1 w 1 y [How Deep Neural Networks Work, Brandon Rohrer] 21/91
22 Chaining err weight = a b c weight a b n.. err m n weight a b c n err [How Deep Neural Networks Work, Brandon Rohrer] 22/91
23 Step Size An example step size is α i = α n i - where is n is the number of training points, and i is the iteration step, and α is a constant ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 23/91
24 Update Rule for Least Squares Regression ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 24/91
25 Parallel Gradient Descent for Least Squares Vector Update: w i+1 = w i α i n w i T x j y j x j j=1 Compute summands in parallel on workers receive all w i on every iteration /91 n=6; number workers= 3;
26 The Gradient ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 26/91
27 Example: Minimizing an Objective Function Consider points p(x,y) in Euclidean space R 2 and the function that determines half of the squared length of vector p What does it compute for p(3,-1)? p 3 1 = ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 27/91
28 Example: Minimizing an Objective Function If we wiggle the value of x and keep everything else the same (keep the value of y the same), how do we know if we are getting closer to the minimum or not? If we move from p(3,-1) to p(2,-1), does the error get better or worse? ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 28/91
29 Quick Review: Partial Derivatives f x, y = 2x 2 y 3 Assume we want to measure how much the function f(x,y) is changing at the point (a,b) if we keep y fixed (that is, set y=b): g x = f x, y = 2x 2 b 3 The partial derivative for f x, y with respect to x at point (a,b) is: g a = 4ab 3 f x x, y = 4ab 3 f x f(x, y) = f x x, y ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 29/91
30 Example: Least Squares Gradient Descent If we fit a straight line y = w 1 + w 2 x to a training set of two-dimensional points x 1, y 1,, x n, y n using least squares. The objective function to be minimized is: n j=1 w 1 + w 2 x i y i 2 We will update the following weights in a simultaneous step: w 1 w = w 1 2 w α 2 w 1 + w 2 x i y i 2 2x i w 1 + w 2 x i y i ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 30/91 w 1 w 2
31 Line Search Try a few different lengths for the step size, α, and pick the best one. - Inexact method that incurs overhead. For several values of α, evaluate: f x α f x Choose the one that results in the smallest objective function value ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 31/91
32 Momentum If the error surface is a long and narrow valley, gradient descent goes quickly down the valley walls, but very slowly along the valley floor. i Reduce this problem by updating parameters using a combination of the previous update and the gradient update: Δw i t+1 = βδw i t + 1 β Usually β is set quite high, about α i w t ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 32/91
33 Jacobian and the Second Derivative The matrix containing all partial derivatives of a function whose input and output are both vectors is called the Jacobian matrix. f: R n R m J R n m of f is defined such that J i,j = x j f(x) i The 2 nd derivative tells us about momentum 2 x i x j f In the single dimension, we can denote this by f (x) ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 33/91
34 Curvature helping determine step size ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 34/91
35 Hessian Matrix For multi-input dimensions, the 2nd order derivatives define a matrix called the Hessian Matrix, H f x H f x i,j = 2 x i x j f x The Hessian is the Jacobian of the Gradient. Eigenvectors/eigenvalues of the Hessian describe the directions of principal curvature and the amount of curvature in each direction. - Maximum sensible stepsize is 2 λ max - Rate of convergence depends on 1 2 λ min λ max ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 35/91
36 Approximate Hessian It can be very expensive to calculate and store the Hessian matrix. You can approximate the second order curvature using recent function and gradient evaluations (a sequence of gradients). - Newton s Method and Quasi-Newton Methods See Chapter 4 in the Book for more details ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 36/91
37 Second Derivative The directional 2 nd derivative tells us how well we can expect a gradient descent step to perform. The 2 nd derivative can be used to determine if a critical point (f (x)=) is a local maximum, local maximum, or a saddle point. - f (x)>0, f (x)=0 implies local minimum - f (x)<0, f (x)=0 implies local maximum - f (x)=0, f (x)=0 implies saddle point or flat-space We can approximate the 2 nd derivative using a 2 nd order Taylor series approximation - See Deep Learning book for details ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 37/91
38 Poor Conditioning Functions that are not smooth are problematic for scientific computing: a non-smooth function can cause rounding errors in the inputs, leading to large changes in the output Conditioning refers to how rapidly a function changes with respect to small changes in its input Functions with a high condition number are sensitive to error in the input for matrix inversion - See 4.2 in the book for more details ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 38/91
39 Poor Conditioning The Condition number of the Hessian measures how much the 2nd derivatives vary. - Poor condition number => poor gradient descent. - Gradient descent is unaware of the change in the derivative Difficult to choose a step size We can use the Hessian to guide search ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 39/91
40 But, why does Deep Learning work? 1990s: basic result that there are infinitely many local optima in sufficiently complex functions In low dimensionality, local optima dominate. In high dimensionality, local optima appear to be clustered close to the global optima. Convexity is not needed [Loss surface of multi-layer Nets, LeCun et Al, 2015] ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 40/91
41 Large-Scale ML Pipelines ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 41/91
42 Parallel Data Processing Trade-Off 1. Scale-Up by buying a more powerful machine - More cores, more expensive - No network cost - Disk slows us down 2. Scale-Out using Data Parallel and in-memory computation - Persist in-memory (especially for iterative computations) - Parallelism makes computations faster - Network makes communication slow ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 42/91
43 Scale-Out Commodity Hardware Deal with Network Deal with partial host/network failures server1 servern serverz ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 43/91
44 Minimize Network and Disk I/O We need to store and communicate raw data, features, and model objects. - Keep large objects local ML algorithms are typically iterative - Reduce the number of iterations Do not read/write to disk on every iteration - Keep state in memory between iterations Logistic Regression for 100GB on 50 AWS nodes /91
45 Batching to Improve Performance Throughput: # bytes read per second Latency: cost to send messages (size independent) Over-simplification to say that message complexity is only proportional to the amount of data sent on the network - Latency introduces a fixed cost overhead independent of the size of the message Amortize latency costs - Sending larger messages (batching) ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 45/91
46 Model Parallel Training Consider hyperparameter tuning for ridge regression with small n and small d - Evaluate for different regularization parameter values of λ - Each collection of different hyperparameters is a model Train a copy of the model locally on different worker nodes (model parallel) - Map phase Data is small, so can communicate it ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 47/91
47 Linear Regression: Big n, Huge d Both Data parallelism and Model parallelism needed O(d) communication slow with hundreds of millions of parameters - As in Deep Learning Possible solutions include - Rely on sparsity to reduce communication - Asynchronous, concurrent updates to the model Asynchronous stochastic gradient descent (more later) Need algorithms that compute more, and communicate less ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 48/91
48 Gradient Descent Linear Regression: Big n, Big d - On each iteration, communicate parameter vector w i - O(d) communication OK for fairly large d. train.cache() // persist the training data across iterations for i in range(numiters): alpha_i = alpha / (n * np.sqrt(i+1)) gradient = train.map(lambda lp: gradientsummand(w,lp)).sum() w -= alpha_i * gradient Gradient Descent 49/91
49 Divide and Conquer Approach Fully process each partition locally - Only communicate the final result Single iteration; minimal communication Approximate results w.trainmappartitions(locallinearregression).reduce(combinelocalregressionresults) Divide-and-Conquer ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 50/91
50 Stochastic Gradient Descent (SGD) Recall gradient descent for least squares, every iteration processes O(n) samples: w i+1 = w i α i n j=1 w i T x j y j x j The gradient is an expectation that can be approximated using a small set of samples drawn uniformly at random from the dataset. - In particular if the dataset is highly redundant the gradient in the first half will be very similar to the gradient in the 2 nd half. In SGD, we update the model with only one sample instead of n ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 51/91
51 Minibatch Gradient Descent Increase the batch size from 1 (from SGD) to m. - Usually better than SGD. Divide the dataset into small batches of m examples, compute the gradient using a single batch, make an update, then move to the next batch of examples. - Computing the gradient simultaneously uses matrix-matrix multiplies which are efficient, especially on GPUs - Mini-batches need to be balanced for classes E.g., m=10 w i+1 = w i α i 10 j=1 w i T x j y j x j ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 52/91
52 Minibatch Gradient Descent Each of the k workers receives w from the driver, then performs a few local steps of the gradient locally. Then send w back to the Driver. - Reduces total number of iterations required. for i in range(feweriters): update = train.mappartitions(dosomelocalgradientupdates).reduce(combinelocalupdates) w += update Mini-batch Gradient Descent ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 53/91
53 Asynchronous / Synchronous SGD In synchronous SGD (or minibatch GD), the Spark Driver (or a parameter server) will wait until all parallel workers have returned their updated model before continuing to the next iteration. In (parallel) asynchronous SGD (or minibatch GD), a parameter server will apply model updates from parallel workers immediately, whereupon the work can immediately get new copy of the model to work on a new mini-batch - workers train concurrently on mini-batches without blocking. - Needs to be tolerant to stale gradients [ /91
54 AdaGrad For good performance, Mini-batch Gradient Descent you typically decrease its learning rate over time. Adaptive Gradient Descent or AdaGrad is an automated per-weight method to adapt the learning rate. Each feature attribute has its own learning rate. - Increase the learning rate for more sparse parameters and decrease the learning rate for less sparse ones. - Can improve convergence performance over standard stochastic gradient descent in settings where data is sparse and sparse parameters are more informative ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 55/91
55 Feature Extraction [Adapted from Distributed Machine Learning with Apache Spark, UCLA/Berkeley Course] ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 56/91
56 Raw Data may or may not be Numeric Numeric Data Non-Numeric Data /91
57 Dealing with Non-Numerical Features 1. Use methods that natively support non-numeric features - Decision Trees and Naive Bayes naturally support nonnumerical features 2. Convert non-numerical features to numerical features - Allows us to use a wider range of learning methods - How do we do this? ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 58/91
58 Classifying Non-Numeric Features Categorical features - Features that can be grouped into two or more categories - No intrinsic ordering For example: Gender, Country, Occupation, Language Ordinal Features - Have two or more categories - Ordinal features can be ordered by their number (ordinal), but there is no consistent spacing between categories, i.e., all we have is a relative ordering - User feedback in survey questions, e.g., Did ID2223 meet its learning objectives? No, somewhat, yes ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 59/91
59 Ordinal Features Ordinal Features: - Survey categories = { no, somewhat, yes } Create single numerical feature: - Survey categories ={ no = 1, somewhat = 2, yes = 3} We can use a single numerical feature that preserves this ordering. We can introduce a degree of closeness that didn't previously exist ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 60/91
60 Categorical Features Create single numerical feature to represent nonnumeric categories. Country categories = { ARG, FRA, USA } become: ARG = 1, FRA = 2, USA = 3 - Implication that FRA lies between ARG and USA Creating single numerical feature introduces relationships between categories that don t otherwise exist ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 61/91
61 One-Hot-Encoding (OHE) Instead of ordinals, we can create a vector containing each category entry and set the active category value to 1, with all other category values set to 0. Country categories = { ARG, FRA, USA } - One new dummy feature for each category - ARG [1 0 0], FRA [0 1 0], USA [0 0 1] Creating dummy features doesn t introduce spurious relationships ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 62/91
62 Step 1: Create OHE Dictionary Features: - Animal = { bear, cat, mouse } - Color = { black, white } - Diet = { mouse, salmon } 7 dummy features in total - mouse category distinct for Animal and Diet features OHE Dictionary: Maps each - category to dummy feature - (Animal, bear ) 0 - (Animal, cat ) 1 - (Animal, mouse ) 2 - (Color, black ) ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 63/91
63 Step 2: Create Features with Dictionary Datapoints: - A1 = [ mouse, black, - ] - A2 = [ cat, white, mouse ] - A3 = [ bear, black, salmon ] OHE Features: - Map non-numeric feature to it s binary dummy feature E.g., A1 = [0, 0, 1, 1, 0, 0, 0] OHE Dictionary: Maps each category to dummy feature - (Animal, bear ) 0 - (Animal, cat ) 1 - (Animal, mouse ) 2 - (Color, black ) ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 64/91
64 OHE Features are Sparse For a given categorical feature only a single OHE feature is non-zero can we take advantage of this fact? Dense representation: Store all numbers - E.g., A1 = [0, 0, 1, 1, 0, 0, 0] Sparse representation: Store indices / values for non-zeros - Assume all other entries are zero - E.g., A1 = [ (2,1), (3,1) ] ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 65/91
65 Sparse Representation Example: Assume a Matrix with 10M observation and 1K features. Assume 1% non-zeros. Storage costs for the Dense representation: - (Stores all the numbers) - Store 10M 1K entries as doubles 80GB storage Storage costs for the Sparse representation: - (Only store indices / values for non-zeros) - Store value and location for non-zeros (2 doubles per entry) - How much savings in Storage? 50 savings in storage - We will also see computational saving for matrix operations ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 66/91
66 Feature Hashing ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 67/91
67 High Dimensionality of OHE Statistically: Inefficient learning - We generally need bigger n when we have bigger d (though in distributed setting we often have very large n) - We will have many non-predictive features Computationally: Increased communication - Linear models have parameter vectors of dimension d - Gradient descent communicates the parameter vector to all workers at each iteration ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 68/91
68 Feature Hashing Feature hashing, a.k.a. the hashing trick. Dummy features can drastically increase dimensionality. Feature hashing reduces dimensionality by applying a hash function to the features and using their hash values as indices directly, rather than looking the indices up in an associative array. - No need to compute an expensive OHE dictionary - Preserves sparsity - Theoretical underpinnings ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 69/91
69 Features to Hash Tables Design a Hash Function that maps an object to one of m buckets - Lookup time for the bucket will be O(1) and we should distribute objects across buckets Values are stored in the Hash Table buckets and correspond to feature categories - We have fewer buckets than feature categories - Different categories will map to same bucket (collisions) - Bucket indices are hashed features ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 70/91
70 Feature Hashing Example Datapoints: 7 feature categories - A1= [ mouse, black, - ] - A2= [ cat, tabby, mouse ] - A3= [ bear, black, salmon ] Feature Hashing Ex. Hashed Features: - A1 = [ ] - A2 = [ ] - A3 = [ ] Hash Function: m = 4 H(Animal, mouse ) = 3 H(Color, black ) = 2 H(Animal, cat ) = 0 H(Color, tabby ) = 0 H(Diet, mouse ) = 2 H(Animal, bear ) = 0 H(Color, black ) = 2 H(Diet, salmon ) = ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 71/91
71 Feature Hashing Evaluation Hash features have nice theoretical properties - Good approximations of inner products of OHE features under certain conditions - Many learning methods (including linear / logistic regression) can be viewed solely in terms of inner products Good empirical performance - Spam filtering and various other text classification tasks Hashed features are a reasonable alternative for OHE features ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 72/91
72 Apache Spark and Spark ML ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 73/91
73 Apache Spark In-Memory data-parallel processing engine - Unified framework for processing data in large-scale DBs (SparkSQL), machine learning, and graph processing. Support large-scale machine learning - Fast iterative procedures - Efficient communication primitives Integrated with Apache Hadoop (HDFS, YARN) APIs for Scala, Java, Python, R ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 74/91
74 Apache Spark ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 75/91
75 DataFrames In Spark, a DataFrame is a distributed collection of data organized into named columns - Conceptually equivalent to a table in a relational database or a data frame in R/Python With a DataFrame, we can build a ML Pipeline in Spark ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 76/91
76 Spark ML (Machine Learning Library) Consists of common learning algorithms and utilities - Classification - Regression - Clustering - Collaborative Filtering - Dimensionality Reduction New Version - spark.ml Deprecated version - spark.mllib ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 77/91
77 Machine Learing in Apache Spark SparkML integrates well with other Spark features, such as DataFrames, RDDs, and Datasets - You can load data from sources (like HDFS) into DataFrames, perform machine learning on the data to either build models or make predictions and then save the data back to some sink (like HDFS) Not as extensive machine learning support compared to scikit-learn in Python ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 78/91
78 SparkML Architecture [ ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 79/91
79 Spark ML Dataframe - Represents a table in SparkSQL Transformer - Transform one Dataframe to another Dataframe. Estimator - Fit one Dataframe and produce a model, which is a transformer Pipeline - Chains Transformers and Estimators together in a ML workflow Parameter - A common API for params for transformers/estimators ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 80/91
80 Transformer A Transformer is an algorithm which transforms one DataFrame into another DataFrame. - Some transformers turn a DataFrame with features into a DataFrame with predictions - e.g., LogisticRegressionModel. - There are also feature transformers e.g., HashingTF. - Implements the transform() method. Examples - HashingTF, Binarizer - StringIndexer converts String values (part of a look-up) into categorical indices - VectorAssembler constructs a Vector from raw feature columns ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 81/91
81 Estimator An Estimator is an algorithm which can have fit called on a DataFrame to produce a Transformer. - For example, training/tuning on a DataFrame and producing a model. - Implements the fit() method. Examples - LogisticRegression (produces a LogisticRegressionModel) - StandardScaler - Pipeline ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 82/91
82 Pipeline A Pipeline is an estimator that chains multiple Transformers and Estimators together to specify a ML workflow. [Large-Scale Machine Learning, Berkeley/Databricks] /91
83 ParamMaps, Evaluator, CrossValidator ParamMaps: Parameters to choose from, sometimes called a parameter grid to search over. - Can be passed to fit() or transform() Evaluator: Metric to measure how well a fitted Model does on held-out test data. CrossValidator: Identifies the best ParamMap and re-fits the Estimator using the best ParamMap and the entire dataset ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 84/91
84 Feature Hashing trainhash= train.map(applyhashfunction).map(createsparsevector) Step 1: Apply a hash function on the raw data - Single map operation (local computation) - No need to compute OHE features or communication Step 2: Store the hashed features in a sparse representation - Single Map operation (local computation) - Reduce storage and lower computation costs SparkML supports two types of local vectors: dense and sparse. A dense vector is backed by a double array representing its entry values, while a sparse vector is backed by two parallel arrays: indices and values ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 85/91
85 Standard Scalar Pipeline [Large-Scale Machine Learning, Berkeley/Databricks] 86/91
86 Example: 20 Newsgroups in SparkML [ ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 87/91
87 Pipeline Transformer 20 Newsgroups [ ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 88/91
88 Python: Read Directly into a DataFrame training = sqlcontext.read.parquet( hdfs:///projects/datasets/20newsgroups /data-001/training").cache() test = sqlcontext.read.parquet("hdfs:///projects/datasets/20newsgroups /data-001/test").cache() training and test are Dataframes ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 89/91
89 Scala: Load/Convert to a DataFrame case class newsgroupscaseclass(id: String, text: String, topic: String) val newsgroups = newsgroupsrawdata.map{case (filepath, text) => val id = filepath.split("/").takeright(1)(0) val topic = filepath.split("/").takeright(2)(0) newsgroupscaseclass(id, text, topic)}.todf() newsgroups.cache() val labelednewsgroups = newsgroups.withcolumn("label", newsgroups("topic").like("comp%").cast("double")) labelednewsgroups.registertemptable("labelednewsgroups") val Array(training, test) = labelednewsgroups.randomsplit(array(0.9, 0.1), seed = 12345) newsgroups is a RDD, labelednewsgroups is a DataFrame ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 90/91
90 Building and Training a Model tokenizer = RegexTokenizer(inputCol="text", outputcol="words", pattern= \\s+") hashingtf = HashingTF(inputCol=tokenizer.getOutputCol(), outputcol="features", numfeatures=5000) lr = LogisticRegression(maxIter=20, regparam=0.01) pipeline = Pipeline(stages=[tokenizer, hashingtf, lr]) model = pipeline.fit(training) RegexTokenizer tokenizes each article into a sequence of words with a regex pattern HashingTF maps the word sequences produced by RegexTokenizer to sparse feature vectors using feature hashing LogisticRegression fits the feature vectors and the labels from the training data to a logistic regression model. The pipeline is then built with 3 stages and trained to produce a model ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 91/91
91 Prediction # Make predictions on the model. prediction = model.transform(training) # Check some results prediction.select("prediction", "label", "text").limit(10).show() With the trained model you can make predictions and check the results of the predictions ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 92/91
92 Evaluation # Create an evaluator for binary classification and use area under the ROC curve as the evaluation metric. evaluator = BinaryClassificationEvaluator(metricName="areaUnderROC") evaluator.evaluate(prediction) # Call "model.transform" on test data and then evaluate the result. evaluator.evaluate(model.transform(test)) # Inspect a pipeline prediction.printschema() You can then evaluate the accuracy of your trained model using the test dataset and a BinaryClassificationEvaluator. A ROC curve score near 1 is very good, 0.5 is like flipping a coin ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 93/91
93 Cross-Validation for Hyperparameter Tuning # We generate hyperparameter combinations by taking the cross product of some parameter values we want to try. paramgrid = ParamGridBuilder() \.addgrid(hashingtf.numfeatures, [1000, 10000]) \.addgrid(lr.regparam, [0.05, 0.2]) \.build() cv = CrossValidator(estimator=pipeline, evaluator=evaluator, estimatorparammaps=paramgrid, numfolds=2) cvmodel = cv.fit(training) evaluator.evaluate(cvmodel.transform(training)) evaluator.evaluate(cvmodel.transform(test)) ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 94/91
94 Reading Nodes Chapter 4, 5 Deep Learning Book - (KKT and Lagrangian not examined) Chapter 5 Deep Learning Book - ( , 5.11 not examined) Chapter Chain Rule of Calculus with Partial Derivatives: - Spark ML - 20 Newsgroups in Scala 20 Newsgroups in Python - Cancer Prediction Example in Scala - Distributed ML with Apache Spark, UCLA/Berkeley Course ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 95/91
Logistic Regression: Probabilistic Interpretation
Logistic Regression: Probabilistic Interpretation Approximate 0/1 Loss Logistic Regression Adaboost (z) SVM Solution: Approximate 0/1 loss with convex loss ( surrogate loss) 0-1 z = y w x SVM (hinge),
More informationLinear Regression Optimization
Gradient Descent Linear Regression Optimization Goal: Find w that minimizes f(w) f(w) = Xw y 2 2 Closed form solution exists Gradient Descent is iterative (Intuition: go downhill!) n w * w Scalar objective:
More informationA Brief Look at Optimization
A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year s version Overview Introduction Classes of optimization problems Linear programming Steepest
More informationCS 179 Lecture 16. Logistic Regression & Parallel SGD
CS 179 Lecture 16 Logistic Regression & Parallel SGD 1 Outline logistic regression (stochastic) gradient descent parallelizing SGD for neural nets (with emphasis on Google s distributed neural net implementation)
More informationCS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.
CS535 Big Data - Fall 2017 Week 8-A-1 CS535 BIG DATA FAQs Term project proposal New deadline: Tomorrow PA1 demo PART 1. BATCH COMPUTING MODELS FOR BIG DATA ANALYTICS 5. ADVANCED DATA ANALYTICS WITH APACHE
More informationGradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz
Gradient Descent Wed Sept 20th, 2017 James McInenrey Adapted from slides by Francisco J. R. Ruiz Housekeeping A few clarifications of and adjustments to the course schedule: No more breaks at the midpoint
More informationLarge-Scale Lasso and Elastic-Net Regularized Generalized Linear Models
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models DB Tsai Steven Hillion Outline Introduction Linear / Nonlinear Classification Feature Engineering - Polynomial Expansion Big-data
More informationHyperparameter optimization. CS6787 Lecture 6 Fall 2017
Hyperparameter optimization CS6787 Lecture 6 Fall 2017 Review We ve covered many methods Stochastic gradient descent Step size/learning rate, how long to run Mini-batching Batch size Momentum Momentum
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 22, 2016 Course Information Website: http://www.stat.ucdavis.edu/~chohsieh/teaching/ ECS289G_Fall2016/main.html My office: Mathematical Sciences
More informationScaled Machine Learning at Matroid
Scaled Machine Learning at Matroid Reza Zadeh @Reza_Zadeh http://reza-zadeh.com Machine Learning Pipeline Learning Algorithm Replicate model Data Trained Model Serve Model Repeat entire pipeline Scaling
More informationMatrix Computations and " Neural Networks in Spark
Matrix Computations and " Neural Networks in Spark Reza Zadeh Paper: http://arxiv.org/abs/1509.02256 Joint work with many folks on paper. @Reza_Zadeh http://reza-zadeh.com Training Neural Networks Datasets
More informationDecentralized and Distributed Machine Learning Model Training with Actors
Decentralized and Distributed Machine Learning Model Training with Actors Travis Addair Stanford University taddair@stanford.edu Abstract Training a machine learning model with terabytes to petabytes of
More informationRapid growth of massive datasets
Overview Rapid growth of massive datasets E.g., Online activity, Science, Sensor networks Data Distributed Clusters are Pervasive Data Distributed Computing Mature Methods for Common Problems e.g., classification,
More informationCS281 Section 3: Practical Optimization
CS281 Section 3: Practical Optimization David Duvenaud and Dougal Maclaurin Most parameter estimation problems in machine learning cannot be solved in closed form, so we often have to resort to numerical
More information15.1 Optimization, scaling, and gradient descent in Spark
CME 323: Distributed Algorithms and Optimization, Spring 2017 http://stanford.edu/~rezab/dao. Instructor: Reza Zadeh, Matroid and Stanford. Lecture 16, 5/24/2017. Scribed by Andreas Santucci. Overview
More informationLearning via Optimization
Lecture 7 1 Outline 1. Optimization Convexity 2. Linear regression in depth Locally weighted linear regression 3. Brief dips Logistic Regression [Stochastic] gradient ascent/descent Support Vector Machines
More informationMachine Learning: Think Big and Parallel
Day 1 Inderjit S. Dhillon Dept of Computer Science UT Austin CS395T: Topics in Multicore Programming Oct 1, 2013 Outline Scikit-learn: Machine Learning in Python Supervised Learning day1 Regression: Least
More informationCost Functions in Machine Learning
Cost Functions in Machine Learning Kevin Swingler Motivation Given some data that reflects measurements from the environment We want to build a model that reflects certain statistics about that data Something
More informationDistributed Computing with Spark and MapReduce
Distributed Computing with Spark and MapReduce Reza Zadeh @Reza_Zadeh http://reza-zadeh.com Traditional Network Programming Message-passing between nodes (e.g. MPI) Very difficult to do at scale:» How
More informationParallel Deep Network Training
Lecture 26: Parallel Deep Network Training Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2016 Tunes Speech Debelle Finish This Album (Speech Therapy) Eat your veggies and study
More informationThe exam is closed book, closed notes except your one-page (two-sided) cheat sheet.
CS 189 Spring 2015 Introduction to Machine Learning Final You have 2 hours 50 minutes for the exam. The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. No calculators or
More informationCOMP6237 Data Mining Data Mining & Machine Learning with Big Data. Jonathon Hare
COMP6237 Data Mining Data Mining & Machine Learning with Big Data Jonathon Hare jsh2@ecs.soton.ac.uk Contents Going to look at two case-studies looking at how we can make machine-learning algorithms work
More information5 Machine Learning Abstractions and Numerical Optimization
Machine Learning Abstractions and Numerical Optimization 25 5 Machine Learning Abstractions and Numerical Optimization ML ABSTRACTIONS [some meta comments on machine learning] [When you write a large computer
More informationProcessing of big data with Apache Spark
Processing of big data with Apache Spark JavaSkop 18 Aleksandar Donevski AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 2 WHAT
More informationStanford University. A Distributed Solver for Kernalized SVM
Stanford University CME 323 Final Project A Distributed Solver for Kernalized SVM Haoming Li Bangzheng He haoming@stanford.edu bzhe@stanford.edu GitHub Repository https://github.com/cme323project/spark_kernel_svm.git
More informationDeep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES
Deep Learning Practical introduction with Keras Chapter 3 27/05/2018 Neuron A neural network is formed by neurons connected to each other; in turn, each connection of one neural network is associated
More informationDistributed Machine Learning" on Spark
Distributed Machine Learning" on Spark Reza Zadeh @Reza_Zadeh http://reza-zadeh.com Outline Data flow vs. traditional network programming Spark computing engine Optimization Example Matrix Computations
More informationApache SystemML Declarative Machine Learning
Apache Big Data Seville 2016 Apache SystemML Declarative Machine Learning Luciano Resende About Me Luciano Resende (lresende@apache.org) Architect and community liaison at Have been contributing to open
More informationTraining Deep Neural Networks (in parallel)
Lecture 9: Training Deep Neural Networks (in parallel) Visual Computing Systems How would you describe this professor? Easy? Mean? Boring? Nerdy? Professor classification task Classifies professors as
More informationScalable Machine Learning in R. with H2O
Scalable Machine Learning in R with H2O Erin LeDell @ledell DSC July 2016 Introduction Statistician & Machine Learning Scientist at H2O.ai in Mountain View, California, USA Ph.D. in Biostatistics with
More informationDistributed Computing with Spark
Distributed Computing with Spark Reza Zadeh Thanks to Matei Zaharia Outline Data flow vs. traditional network programming Limitations of MapReduce Spark computing engine Numerical computing on Spark Ongoing
More information1 Training/Validation/Testing
CPSC 340 Final (Fall 2015) Name: Student Number: Please enter your information above, turn off cellphones, space yourselves out throughout the room, and wait until the official start of the exam to begin.
More informationPerceptron: This is convolution!
Perceptron: This is convolution! v v v Shared weights v Filter = local perceptron. Also called kernel. By pooling responses at different locations, we gain robustness to the exact spatial location of image
More informationPartitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning
Partitioning Data IRDS: Evaluation, Debugging, and Diagnostics Charles Sutton University of Edinburgh Training Validation Test Training : Running learning algorithms Validation : Tuning parameters of learning
More informationChapter 1 - The Spark Machine Learning Library
Chapter 1 - The Spark Machine Learning Library Objectives Key objectives of this chapter: The Spark Machine Learning Library (MLlib) MLlib dense and sparse vectors and matrices Types of distributed matrices
More informationIntroduction to Optimization
Introduction to Optimization Second Order Optimization Methods Marc Toussaint U Stuttgart Planned Outline Gradient-based optimization (1st order methods) plain grad., steepest descent, conjugate grad.,
More informationNearest Neighbor Predictors
Nearest Neighbor Predictors September 2, 2018 Perhaps the simplest machine learning prediction method, from a conceptual point of view, and perhaps also the most unusual, is the nearest-neighbor method,
More informationLecture 20: Neural Networks for NLP. Zubin Pahuja
Lecture 20: Neural Networks for NLP Zubin Pahuja zpahuja2@illinois.edu courses.engr.illinois.edu/cs447 CS447: Natural Language Processing 1 Today s Lecture Feed-forward neural networks as classifiers simple
More informationNatural Language Processing with Deep Learning CS224N/Ling284. Christopher Manning Lecture 4: Backpropagation and computation graphs
Natural Language Processing with Deep Learning CS4N/Ling84 Christopher Manning Lecture 4: Backpropagation and computation graphs Lecture Plan Lecture 4: Backpropagation and computation graphs 1. Matrix
More informationProblem 1: Complexity of Update Rules for Logistic Regression
Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 16 th, 2014 1
More informationNeural Network Optimization and Tuning / Spring 2018 / Recitation 3
Neural Network Optimization and Tuning 11-785 / Spring 2018 / Recitation 3 1 Logistics You will work through a Jupyter notebook that contains sample and starter code with explanations and comments throughout.
More informationLogistic Regression. Abstract
Logistic Regression Tsung-Yi Lin, Chen-Yu Lee Department of Electrical and Computer Engineering University of California, San Diego {tsl008, chl60}@ucsd.edu January 4, 013 Abstract Logistic regression
More informationMachine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari
Machine Learning Basics: Stochastic Gradient Descent Sargur N. srihari@cedar.buffalo.edu 1 Topics 1. Learning Algorithms 2. Capacity, Overfitting and Underfitting 3. Hyperparameters and Validation Sets
More informationOptimization Plugin for RapidMiner. Venkatesh Umaashankar Sangkyun Lee. Technical Report 04/2012. technische universität dortmund
Optimization Plugin for RapidMiner Technical Report Venkatesh Umaashankar Sangkyun Lee 04/2012 technische universität dortmund Part of the work on this technical report has been supported by Deutsche Forschungsgemeinschaft
More informationClass 6 Large-Scale Image Classification
Class 6 Large-Scale Image Classification Liangliang Cao, March 7, 2013 EECS 6890 Topics in Information Processing Spring 2013, Columbia University http://rogerioferis.com/visualrecognitionandsearch Visual
More informationNetwork Traffic Measurements and Analysis
DEIB - Politecnico di Milano Fall, 2017 Sources Hastie, Tibshirani, Friedman: The Elements of Statistical Learning James, Witten, Hastie, Tibshirani: An Introduction to Statistical Learning Andrew Ng:
More informationToday. Golden section, discussion of error Newton s method. Newton s method, steepest descent, conjugate gradient
Optimization Last time Root finding: definition, motivation Algorithms: Bisection, false position, secant, Newton-Raphson Convergence & tradeoffs Example applications of Newton s method Root finding in
More informationCase Study 1: Estimating Click Probabilities
Case Study 1: Estimating Click Probabilities SGD cont d AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade March 31, 2015 1 Support/Resources Office Hours Yao Lu:
More informationContents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation
Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation Learning 4 Supervised Learning 4 Unsupervised Learning 4
More informationMachine Learning for Large-Scale Data Analysis and Decision Making A. Distributed Machine Learning Week #9
Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Distributed Machine Learning Week #9 Today Distributed computing for machine learning Background MapReduce/Hadoop & Spark Theory
More informationMachine Learning Basics. Sargur N. Srihari
Machine Learning Basics Sargur N. srihari@cedar.buffalo.edu 1 Overview Deep learning is a specific type of ML Necessary to have a solid understanding of the basic principles of ML 2 Topics Stochastic Gradient
More informationParallel Deep Network Training
Lecture 19: Parallel Deep Network Training Parallel Computer Architecture and Programming How would you describe this professor? Easy? Mean? Boring? Nerdy? Professor classification task Classifies professors
More informationBig Data. Big Data Analyst. Big Data Engineer. Big Data Architect
Big Data Big Data Analyst INTRODUCTION TO BIG DATA ANALYTICS ANALYTICS PROCESSING TECHNIQUES DATA TRANSFORMATION & BATCH PROCESSING REAL TIME (STREAM) DATA PROCESSING Big Data Engineer BIG DATA FOUNDATION
More informationNatural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu
Natural Language Processing CS 6320 Lecture 6 Neural Language Models Instructor: Sanda Harabagiu In this lecture We shall cover: Deep Neural Models for Natural Language Processing Introduce Feed Forward
More informationTheoretical Concepts of Machine Learning
Theoretical Concepts of Machine Learning Part 2 Institute of Bioinformatics Johannes Kepler University, Linz, Austria Outline 1 Introduction 2 Generalization Error 3 Maximum Likelihood 4 Noise Models 5
More information15.1 Data flow vs. traditional network programming
CME 323: Distributed Algorithms and Optimization, Spring 2017 http://stanford.edu/~rezab/dao. Instructor: Reza Zadeh, Matroid and Stanford. Lecture 15, 5/22/2017. Scribed by D. Penner, A. Shoemaker, and
More informationCPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2016
CPSC 34: Machine Learning and Data Mining Feature Selection Fall 26 Assignment 3: Admin Solutions will be posted after class Wednesday. Extra office hours Thursday: :3-2 and 4:3-6 in X836. Midterm Friday:
More informationGradient Descent Optimization Algorithms for Deep Learning Batch gradient descent Stochastic gradient descent Mini-batch gradient descent
Gradient Descent Optimization Algorithms for Deep Learning Batch gradient descent Stochastic gradient descent Mini-batch gradient descent Slide credit: http://sebastianruder.com/optimizing-gradient-descent/index.html#batchgradientdescent
More informationSolution 1 (python) Performance: Enron Samples Rate Recall Precision Total Contribution
Summary Each of the ham/spam classifiers has been tested against random samples from pre- processed enron sets 1 through 6 obtained via: http://www.aueb.gr/users/ion/data/enron- spam/, or the entire set
More informationLogistic Regression
Logistic Regression ddebarr@uw.edu 2016-05-26 Agenda Model Specification Model Fitting Bayesian Logistic Regression Online Learning and Stochastic Optimization Generative versus Discriminative Classifiers
More informationCPSC 340: Machine Learning and Data Mining. Robust Regression Fall 2015
CPSC 340: Machine Learning and Data Mining Robust Regression Fall 2015 Admin Can you see Assignment 1 grades on UBC connect? Auditors, don t worry about it. You should already be working on Assignment
More informationCrossing the AI Chasm. What We Learned from building Apache PredictionIO (incubating)
Crossing the AI Chasm What We Learned from building Apache PredictionIO (incubating) Simon Chan Sr. Director, Product Management, Salesforce Co-founder, PredictionIO PhD, University College London simon@salesforce.com
More informationScaling Distributed Machine Learning
Scaling Distributed Machine Learning with System and Algorithm Co-design Mu Li Thesis Defense CSD, CMU Feb 2nd, 2017 nx min w f i (w) Distributed systems i=1 Large scale optimization methods Large-scale
More informationDS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University
DS 4400 Machine Learning and Data Mining I Alina Oprea Associate Professor, CCIS Northeastern University September 20 2018 Review Solution for multiple linear regression can be computed in closed form
More informationCPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016
CPSC 340: Machine Learning and Data Mining Non-Parametric Models Fall 2016 Assignment 0: Admin 1 late day to hand it in tonight, 2 late days for Wednesday. Assignment 1 is out: Due Friday of next week.
More informationHomework 2. Due: March 2, 2018 at 7:00PM. p = 1 m. (x i ). i=1
Homework 2 Due: March 2, 2018 at 7:00PM Written Questions Problem 1: Estimator (5 points) Let x 1, x 2,..., x m be an i.i.d. (independent and identically distributed) sample drawn from distribution B(p)
More informationToday. Gradient descent for minimization of functions of real variables. Multi-dimensional scaling. Self-organizing maps
Today Gradient descent for minimization of functions of real variables. Multi-dimensional scaling Self-organizing maps Gradient Descent Derivatives Consider function f(x) : R R. The derivative w.r.t. x
More informationParallelization in the Big Data Regime: Model Parallelization? Sham M. Kakade
Parallelization in the Big Data Regime: Model Parallelization? Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for Big data 1 / 12 Announcements...
More informationSCALABLE, LOW LATENCY MODEL SERVING AND MANAGEMENT WITH VELOX
THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW LATENCY MODEL SERVING AND MANAGEMENT WITH VELOX Daniel Crankshaw, Peter Bailis, Joseph Gonzalez, Haoyuan Li, Zhao Zhang, Ali Ghodsi, Michael Franklin,
More informationMachine Learning Classifiers and Boosting
Machine Learning Classifiers and Boosting Reading Ch 18.6-18.12, 20.1-20.3.2 Outline Different types of learning problems Different types of learning algorithms Supervised learning Decision trees Naïve
More informationMachine Learning with Spark. Amir H. Payberah 02/11/2018
Machine Learning with Spark Amir H. Payberah payberah@kth.se 02/11/2018 The Course Web Page https://id2223kth.github.io 1 / 89 Where Are We? 2 / 89 Where Are We? 3 / 89 Big Data 4 / 89 Problem Traditional
More informationConflict Graphs for Parallel Stochastic Gradient Descent
Conflict Graphs for Parallel Stochastic Gradient Descent Darshan Thaker*, Guneet Singh Dhillon* Abstract We present various methods for inducing a conflict graph in order to effectively parallelize Pegasos.
More informationDeep Neural Networks Optimization
Deep Neural Networks Optimization Creative Commons (cc) by Akritasa http://arxiv.org/pdf/1406.2572.pdf Slides from Geoffrey Hinton CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy
More informationBig Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Queries on streams
More informationFrameworks in Python for Numeric Computation / ML
Frameworks in Python for Numeric Computation / ML Why use a framework? Why not use the built-in data structures? Why not write our own matrix multiplication function? Frameworks are needed not only because
More informationCOMP 551 Applied Machine Learning Lecture 16: Deep Learning
COMP 551 Applied Machine Learning Lecture 16: Deep Learning Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise noted, all
More informationCPSC 340: Machine Learning and Data Mining
CPSC 340: Machine Learning and Data Mining Fundamentals of learning (continued) and the k-nearest neighbours classifier Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart.
More informationApache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context
1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes
More informationMachine Learning and SystemML. Nikolay Manchev Data Scientist Europe E-
Machine Learning and SystemML Nikolay Manchev Data Scientist Europe E- mail: nmanchev@uk.ibm.com @nikolaymanchev A Simple Problem In this activity, you will analyze the relationship between educational
More informationCPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016
CPSC 340: Machine Learning and Data Mining Deep Learning Fall 2016 Assignment 5: Due Friday. Assignment 6: Due next Friday. Final: Admin December 12 (8:30am HEBB 100) Covers Assignments 1-6. Final from
More informationNearest Neighbor with KD Trees
Case Study 2: Document Retrieval Finding Similar Documents Using Nearest Neighbors Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Emily Fox January 22 nd, 2013 1 Nearest
More informationModule 1 Lecture Notes 2. Optimization Problem and Model Formulation
Optimization Methods: Introduction and Basic concepts 1 Module 1 Lecture Notes 2 Optimization Problem and Model Formulation Introduction In the previous lecture we studied the evolution of optimization
More informationCS294-1 Assignment 2 Report
CS294-1 Assignment 2 Report Keling Chen and Huasha Zhao February 24, 2012 1 Introduction The goal of this homework is to predict a users numeric rating for a book from the text of the user s review. The
More informationMachine Learning / Jan 27, 2010
Revisiting Logistic Regression & Naïve Bayes Aarti Singh Machine Learning 10-701/15-781 Jan 27, 2010 Generative and Discriminative Classifiers Training classifiers involves learning a mapping f: X -> Y,
More informationChapter Multidimensional Gradient Method
Chapter 09.04 Multidimensional Gradient Method After reading this chapter, you should be able to: 1. Understand how multi-dimensional gradient methods are different from direct search methods. Understand
More informationNeural Networks for Machine Learning. Lecture 15a From Principal Components Analysis to Autoencoders
Neural Networks for Machine Learning Lecture 15a From Principal Components Analysis to Autoencoders Geoffrey Hinton Nitish Srivastava, Kevin Swersky Tijmen Tieleman Abdel-rahman Mohamed Principal Components
More informationM. Sc. (Artificial Intelligence and Machine Learning)
Course Name: Advanced Python Course Code: MSCAI 122 This course will introduce students to advanced python implementations and the latest Machine Learning and Deep learning libraries, Scikit-Learn and
More informationCS6375: Machine Learning Gautam Kunapuli. Mid-Term Review
Gautam Kunapuli Machine Learning Data is identically and independently distributed Goal is to learn a function that maps to Data is generated using an unknown function Learn a hypothesis that minimizes
More informationHarp-DAAL for High Performance Big Data Computing
Harp-DAAL for High Performance Big Data Computing Large-scale data analytics is revolutionizing many business and scientific domains. Easy-touse scalable parallel techniques are necessary to process big
More informationDS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University
DS 4400 Machine Learning and Data Mining I Alina Oprea Associate Professor, CCIS Northeastern University January 24 2019 Logistics HW 1 is due on Friday 01/25 Project proposal: due Feb 21 1 page description
More informationParallelism. CS6787 Lecture 8 Fall 2017
Parallelism CS6787 Lecture 8 Fall 2017 So far We ve been talking about algorithms We ve been talking about ways to optimize their parameters But we haven t talked about the underlying hardware How does
More informationMIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius
MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius What is Mixed Precision Training? Reduced precision tensor math with FP32 accumulation, FP16 storage Successfully used to train a variety
More informationNeural Networks (pp )
Notation: Means pencil-and-paper QUIZ Means coding QUIZ Neural Networks (pp. 106-121) The first artificial neural network (ANN) was the (single-layer) perceptron, a simplified model of a biological neuron.
More informationDeep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group
Deep Learning Vladimir Golkov Technical University of Munich Computer Vision Group 1D Input, 1D Output target input 2 2D Input, 1D Output: Data Distribution Complexity Imagine many dimensions (data occupies
More informationLecture 22 : Distributed Systems for ML
10-708: Probabilistic Graphical Models, Spring 2017 Lecture 22 : Distributed Systems for ML Lecturer: Qirong Ho Scribes: Zihang Dai, Fan Yang 1 Introduction Big data has been very popular in recent years.
More informationPredict the Likelihood of Responding to Direct Mail Campaign in Consumer Lending Industry
Predict the Likelihood of Responding to Direct Mail Campaign in Consumer Lending Industry Jincheng Cao, SCPD Jincheng@stanford.edu 1. INTRODUCTION When running a direct mail campaign, it s common practice
More informationCSE 546 Machine Learning, Autumn 2013 Homework 2
CSE 546 Machine Learning, Autumn 2013 Homework 2 Due: Monday, October 28, beginning of class 1 Boosting [30 Points] We learned about boosting in lecture and the topic is covered in Murphy 16.4. On page
More informationIntroduction to Machine Learning. Xiaojin Zhu
Introduction to Machine Learning Xiaojin Zhu jerryzhu@cs.wisc.edu Read Chapter 1 of this book: Xiaojin Zhu and Andrew B. Goldberg. Introduction to Semi- Supervised Learning. http://www.morganclaypool.com/doi/abs/10.2200/s00196ed1v01y200906aim006
More informationSimple Model Selection Cross Validation Regularization Neural Networks
Neural Nets: Many possible refs e.g., Mitchell Chapter 4 Simple Model Selection Cross Validation Regularization Neural Networks Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February
More informationThe exam is closed book, closed notes except your one-page (two-sided) cheat sheet.
CS 189 Spring 2015 Introduction to Machine Learning Final You have 2 hours 50 minutes for the exam. The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. No calculators or
More information