May 1, CODY, Error Backpropagation, Bischop 5.3, and Support Vector Machines (SVM) Bishop Ch 7. May 3, Class HW SVM, PCA, and K-means, Bishop Ch

Similar documents
Statistical Methods for Data Mining

Workshop report 1. Daniels report is on website 2. Don t expect to write it based on listening to one project (we had 6 only 2 was sufficient

Statistical Methods for Data Mining

Unsupervised Learning

SUPPORT VECTOR MACHINES

Network Traffic Measurements and Analysis

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

SUPPORT VECTOR MACHINES

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

An Introduction to Machine Learning

Content-based image and video analysis. Machine learning

A Dendrogram. Bioinformatics (Lec 17)

Hierarchical clustering

What to come. There will be a few more topics we will cover on supervised learning

DM6 Support Vector Machines

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

Clustering and Dimensionality Reduction. Stony Brook University CSE545, Fall 2017

Machine Learning: Think Big and Parallel

Exploratory data analysis for microarrays

Support Vector Machines.

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

CSE 255 Lecture 6. Data Mining and Predictive Analytics. Community Detection

Clustering and The Expectation-Maximization Algorithm

All lecture slides will be available at CSC2515_Winter15.html

Clustering and Visualisation of Data

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

Introduction to Machine Learning. Xiaojin Zhu

CSE 40171: Artificial Intelligence. Learning from Data: Unsupervised Learning

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Support Vector Machines (a brief introduction) Adrian Bevan.

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

Clustering. Chapter 10 in Introduction to statistical learning

CSE 258 Lecture 5. Web Mining and Recommender Systems. Dimensionality Reduction

Support Vector Machines

CSE 255 Lecture 5. Data Mining and Predictive Analytics. Dimensionality Reduction

Machine Learning for OR & FE

Feature Extractors. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. The Perceptron Update Rule.

10701 Machine Learning. Clustering

Linear methods for supervised learning

Cluster Analysis: Agglomerate Hierarchical Clustering

Clustering COMS 4771

Kernels and Clustering

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Optimal Separating Hyperplane and the Support Vector Machine. Volker Tresp Summer 2018

Lecture 25: Review I

Classification by Support Vector Machines

CSE 158. Web Mining and Recommender Systems. Midterm recap

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

Lab 2: Support vector machines

10-701/15-781, Fall 2006, Final

1 Case study of SVM (Rob)

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING

5 Learning hypothesis classes (16 points)

Chapter 6: Cluster Analysis

Classification by Support Vector Machines

CS 343: Artificial Intelligence

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

A Course in Machine Learning

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Chakra Chennubhotla and David Koes

Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please)

Support Vector Machines

Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms

Data Analysis 3. Support Vector Machines. Jan Platoš October 30, 2017

Classification: Feature Vectors

CSE 158 Lecture 6. Web Mining and Recommender Systems. Community Detection

Lecture 9: Support Vector Machines

Unsupervised Learning

CPSC 340: Machine Learning and Data Mining. Kernel Trick Fall 2017

Generative and discriminative classification techniques

Clustering CS 550: Machine Learning

MSA220 - Statistical Learning for Big Data

Support vector machines

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Supervised vs unsupervised clustering

Constrained optimization

Feature scaling in support vector data description

Machine Learning Classifiers and Boosting

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

CSE 573: Artificial Intelligence Autumn 2010

Supervised vs. Unsupervised Learning

Lecture 10: SVM Lecture Overview Support Vector Machines The binary classification problem

732A54/TDDE31 Big Data Analytics

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

Chap.12 Kernel methods [Book, Chap.7]

Unsupervised Learning: Clustering

Clustering analysis of gene expression data

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

Statistics 202: Data Mining. c Jonathan Taylor. Clustering Based in part on slides from textbook, slides of Susan Holmes.

Cluster Analysis. Ying Shen, SSE, Tongji University

Support Vector Machines

Unsupervised learning in Vision

Machine Learning (BSMC-GA 4439) Wenke Liu

Transcription:

May 1, CODY, Error Backpropagation, Bischop 5.3, and Support Vector Machines (SVM) Bishop Ch 7. May 3, Class HW SVM, PCA, and K-means, Bishop Ch 12.1, 9.1 May 8, CODY Machine Learning for finding oil, focusing on 1) robust seismic denoising/interpolation using structured matrix approximation 2) seismic image clustering and classification, using t-sne(t-distributed stochastic neighbor embedding) and CNN. Weichang Li, Goup Leader Aramco, Houston. May 10, Class HW First distribution of final projects. Bishop Ch 9 May 15, CODY Seismology and Machine Learning, Daniel Trugman (half class), ch 9 May 17, Class HW Ocean acoustic source tracking. Final projects. The main goal in the last 3 weeks is the Final project. ch 9 May 22, Dictionary learning, Mike Bianco (half class), Graphical models Bishop Ch 8 May 24, Graphical models Bishop Ch 8 May 31, No Class Workshop, Big Data and The Earth Sciences: Grand Challenges Workshop June 5, Discuss workshop. Spiess Hall open for project discussion 11am-. June 7, Workshop report. No class June 12 Spiess Hall open for project discussion 9-11:30am and 2-7pm June 16 Final report delivered. Beer time

SVMs are Perceptrons! SVM s use each training case, x, to define a feature K(x,.) where K is user chosen. So the user designs the features. SVM do feature selection by picking support vectors, and learn feature weighting from a big optimization problem. So an SVM is a clever way to train a standard perceptron. All that a perceptron cannot do, cannot be done by SVM s (but it s a long time since 1969 so people have forgotten this). SVM DOES: Margin maximization Kernel trick Sparse

SVM summarized--- Only kernels Minimize with respect to w, w 0 C ( ) ζ n + + w 2 (Bishop 7.21), Solution found in dual domain with Lagrange multipliers a n, n = 1 N and This gives the support vectors S w4 = ) 6 a n t n φ(xn) (Bishop 7.8) Used for predictions y= = w 0 + w > φ x = w 0 +? a n ) 6 t n φ x n T φ x = w 0 +? a n t n k x n, x (Bishop 7.13) ) 6

case 'Classify' % train SVM Code for classification model = svmtrain(y, X,['-c 7.46 -g ' gamma ' -q ' kernel]); % predict [predict_label,~, ~] = svmpredict(rand([length(y),1]), X, model,'-q'); >> modelmodel = struct with fields: Parameters: [5 1 double] nr_class: 2 totalsv: 36 rho: 8.3220 Label: [2 1 double] sv_indices: [36 1 double] ProbA: [] ProbB: [] nsv: [2 1 double] sv_coef: [36 1 double] SVs: [36 2 double]

Can be inner product in infinite dimensional space Assume x R 1 and γ > 0. e γ x i x j 2 = e γ(x i x j ) 2 = e γx i 2+2γx ix j γxj 2 ( 2γx i x j 1+ + (2γx ix j ) 2 =e γx 2 i γx 2 j =e γx 2 i γx 2 j where + + (2γx ix j ) 3 + ) 1! 2! 3! ( 2γ 2γ (2γ) 1 1+ 1! x i 1! x 2 (2γ) j + xi 2 2 2! 2! (2γ) 3 (2γ) xi 3 3 xj 3 + ) = φ(x i ) T φ(x j ), 3! 3! φ(x) =e γx 2 [1, 2γ 1! x, (2γ) 2 (2γ) x 2 3 T, x 3, ]. 2! 3! x 2 j

Finding the Decision Function w: maybe infinite variables The dual problem min α subject to 1 2 αt Qα e T α 0 α i C, i =1,...,l y T α =0, where Q ij = y i y j φ(x i ) T φ(x j )ande = [1,...,1] T At optimum Corresponds to (Bishop 7.32) With y=t w = l i=1 α iy i φ(x i ) A finite problem: #variables = #training data Using these results to eliminate w, b, and {ξ n } from the Lagrangian, we obtain the dual Lagrangian in the form L(a) = N a n 1 2 n=1 N N a n a m t n t m k(x n, x m ) (7.32) n=1 m=1 which is identical to the separable case, except that the constraints are somewhat

2 Linear Kernel 2 Sigmoid Function Kernel 1 1 x2 0-1 0-1 -2-2 0 2-2 -2 0 2 2 1 Polynomial Kernel Radial Basis Function Kernel 2 1 x2 0-1 0-1 -2-2 0 2 x1-2 -2 0 2 x1

Unsupervised Learning Unsupervised vs Supervised Learning: Most of this course focuses on supervised learning methods such as regression and classification. In that setting we observe both a set of features X 1,X 2,...,X p for each object, as well as a response or outcome variable Y. The goal is then to predict Y using X 1,X 2,...,X p. Here we instead focus on unsupervised learning, we where observe only the features X 1,X 2,...,X p. We are not interested in prediction, because we do not have an associated response variable Y.

The Goals of Unsupervised Learning The goal is to discover interesting things about the measurements: is there an informative way to visualize the data? Can we discover subgroups among the variables or among the observations? We discuss two methods: principal components analysis, a tool used for data visualization or data pre-processing before supervised techniques are applied, and clustering, a broad class of methods for discovering unknown subgroups in data. 2/50

Challenge of unsupervised learning Unsupervised learning is more subjective than supervised learning, as there is no simple goal for the analysis, such as prediction of a response. But techniques for unsupervised learning are of growing importance in a number of fields: subgroups of breast cancer patients grouped by their gene expression measurements, groups of shoppers characterized by their browsing and purchase histories, movies grouped by the ratings assigned by movie viewers. However, it is often easier to work with unlabeled data. Think of the movie rating Past science history

Principal Components Analysis Dimensionality reduction Data-compression less storage and easy learning Data visualization PCA produces a low-dimensional representation of a dataset. It finds a sequence of linear combinations of the variables that have maximal variance, and are mutually uncorrelated. Apart from producing derived variables for use in supervised learning problems, PCA also serves as a tool for data visualization.

PCA don t work well for classification

Principal Components Analysis: details The first principal component of a set of features X 1,X 2,...,X p is the normalized linear combination of the features Z 1 = 11 X 1 + 21 X 2 +...+ p1 X p that P has the largest variance. By normalized, we mean that p 2 j=1 j1 = 1. We refer to the elements 11,..., p1 as the loadings of the first principal component; together, the loadings make up the principal component loading vector, 1 =( 11 21... p1) T. We constrain the loadings so that their sum of squares is equal to one, since otherwise setting these elements to be arbitrarily large in absolute value could result in an arbitrarily large variance.

Computation of Principal Components Suppose we have a n p data set X. Since we are only interested in variance, we assume that each of the variables in X has been centered to have mean zero (that is, the column means of X are zero). We then look for the linear combination of the sample feature values of the form z i1 = 11 x i1 + 21 x i2 +...+ p1 x ip (1) for i = 1,...,n that has largest sample variance, subject to the constraint that P p 2 j=1 j1 = 1. Since each of the x ij has mean zero, then so does z i1 (for any values of j1). Hence the sample variance of the z i1 can be written as 1 P n n i=1 z2 i1.

Computation: continued Plugging in (1) the first principal component loading vector solves the optimization problem maximize 11,..., p1 1 n 0 nx @ i=1 px j=1 j1x ij 1 A 2 subject to px j=1 2 j1 =1. This problem can be solved via a singular-value decomposition of the matrix X, a standard technique in linear algebra. We refer to Z 1 as the first principal component, with realized values z 11,...,z n1

Further principal components The second principal component is the linear combination of X 1,...,X p that has maximal variance among all linear combinations that are uncorrelated with Z 1. The second principal component scores z 12,z 22,...,z n2 take the form z i2 = 12 x i1 + 22 x i2 +...+ p2 x ip, where 2 is the second principal component loading vector, with elements 12, 22,..., p2.

Test Data

PCA VS Regression PCA vs Fisher PCA looks for a low-dimensional representation of the observations that explains a good fraction of the variance. Clustering looks for homogeneous subgroups among the observations.

Proportion Variance Explained To understand the strength of each component, we are interested in knowing the proportion of variance explained (PVE) by each one. The total variance present in a data set (assuming that the variables have been centered to have mean zero) is defined as px px 1 nx Var(X j )= x 2 n ij, j=1 j=1 i=1 and the variance explained by the mth principal component is Var(Z m )= 1 nx z 2 n im. i=1 It can be shown that P p j=1 Var(X j)= P M m=1 Var(Z m), with M =min(n 1,p).

Proportion Variance Explained: continued Therefore, the PVE of the mth principal component is given by the positive quantity between 0 and 1 P n P i=1 z2 im p P n j=1 i=1 x2 ij. The PVEs sum to one. We sometimes display the cumulative PVEs. Prop. Variance Explained 0.0 0.2 0.4 0.6 0.8 1.0 Cumulative Prop. Variance Explained 0.0 0.2 0.4 0.6 0.8 1.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Principal Component 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Principal Component

How many principal components should we use? If we use principal components as a summary of our data, how many components are su cient? No simple answer to this question, as cross-validation is not available for this purpose. Why not? When could we use cross-validation to select the number of components? the scree plot on the previous slide can be used as a guide: we look for an elbow.

Clustering Clustering refers to a very broad set of techniques for finding subgroups, or clusters, in a data set. We seek a partition of the data into distinct groups so that the observations within each group are quite similar to each other, It make this concrete, we must define what it means for two or more observations to be similar or di erent. Indeed, this is often a domain-specific consideration that must be made based on knowledge of the data being studied.

Two clustering methods In K-means clustering, we seek to partition the observations into a pre-specified number of clusters. In hierarchical clustering, we do not know in advance how many clusters we want; in fact, we end up with a tree-like visual representation of the observations, called a dendrogram, that allows us to view at once the clusterings obtained for each possible number of clusters, from 1 to n.

N K J = r nk x n µ k 2 (9.1) n=1 k=1 K-means m of the squares of the distances of each data point to its Solving for r nk r nk = { { 1 if k = arg minj x n µ j 2 0 otherwise. (9.2) Differentiating for μ k er the optimization of the with the held fixed. The objective N 2 r nk (x n µ k )=0 (9.3) n=1 which we can easily solve for µ k to give µ k = n r nkx n n r nk. (9.4) The denominator in this expression is equal to the number of points assigned to

K-means

K-means clustering K=2 K=3 K=4 A simulated data set with 150 observations in 2-dimensional space. Panels show the results of applying K-means clustering with di erent values of K, the number of clusters. The color of each observation indicates the cluster to which it was assigned using the K-means clustering algorithm. Note that there is no ordering of the clusters, so the cluster coloring is arbitrary. These cluster labels were not used in clustering; instead, they are the outputs of the clustering procedure.

K-Means Clustering Algorithm 1. Randomly assign a number, from 1 to K, to each of the observations. These serve as initial cluster assignments for the observations. 2. Iterate until the cluster assignments stop changing: 2.1 For each of the K clusters, compute the cluster centroid. The kth cluster centroid is the vector of the p feature means for the observations in the kth cluster. 2.2 Assign each observation to the cluster whose centroid is closest (where closest is defined using Euclidean distance).

Properties of the Algorithm This algorithm is guaranteed to decrease the value of the objective (4) at each step. Why? Note that 1 C k X i,i 0 2C k where x kj = 1 cluster C k. C k px (x ij x i 0 j) 2 =2 X j=1 i2c k px (x ij x kj ) 2, j=1 P i2c k x ij is the mean for feature j in however it is not guaranteed to give the global minimum. Why not?

Data Step 1 Iteration 1, Step 2a Iteration 1, Step 2b Iteration 2, Step 2a Final Results Example Details of Previous Figure The progress of the K-means algorithm with K=3. Top left: The observations are shown. Top center: In Step 1 of the algorithm, each observation is randomly assigned to a cluster. Top right: In Step 2(a), the cluster centroids are computed. These are shown as large colored disks. Initially the centroids are almost completely overlapping because the initial cluster assignments were chosen at random. Bottom left: In Step 2(b), each observation is assigned to the nearest centroid. Bottom center: Step 2(a) is once again performed, leading to new cluster centroids. Bottom right: The results obtained after 10 iterations. 33 / 50 34 / 50

Details of Previous Figure Different starting values 36 320.9 235.8 235.8 235.8 235.8 310.9 K-means clustering performed six times on the data from previous figure with K = 3, each time with a di erent random assignment of the observations in Step 1 of the K-means algorithm. Above each plot is the value of the objective (4). Three di erent local optima were obtained, one of which resulted in a smaller value of the objective and provides better separation between the clusters. Those labeled in red all achieved the same best solution, with an objective value of 235.8 35 / 50

Hierarchical Clustering K-means clustering requires us to pre-specify the number of clusters K. This can be a disadvantage (later we discuss strategies for choosing K) Hierarchical clustering is an alternative approach which does not require that we commit to a particular choice of K. In this section, we describe bottom-up or agglomerative clustering. This is the most common type of hierarchical clustering, and refers to the fact that a dendrogram is built starting from the leaves and combining clusters up to the trunk.

Hierarchical Clustering Algorithm The approach in words: Start with each point in its own cluster. Identify the closest two clusters and merge them. Repeat. Ends when all points are in a single cluster. Dendrogram A C B D E D E 0 1 2 3 4 B A C

An Example X2 2 0 2 4 6 4 2 0 2 45 observations generated in 2-dimensional space. In reality there are three distinct classes, shown in separate colors. However, we will treat these class labels as unknown and will seek to cluster the observations in order to discover the classes from the data. X 1

Example X2 2 0 2 4 6 4 2 0 2 X 1 45 observations generated in 2-dimensional space. In reality there are three distinct classes, shown in separate colors. However, we will treat these class labels as unknown and will seek to cluster the observations in order to discover the classes from the data. 0 2 4 6 8 10 0 2 4 6 8 10 Left: Dendrogram obtained from hierarchically clustering the data from previous slide, with complete linkage and Euclidean distance. Center: The dendrogram from the left-hand panel, cut at a height of 9 (indicated by the dashed line). This cut results in two distinct clusters, shown in di erent colors. Right: The dendrogram from the left-hand panel, now cut at a height of 5. This cut results in three distinct clusters, shown in di erent colors. Note that the colors were not used in clustering, but are simply used for display purposes in this figure 0 2 4 6 8 10

NOT USED