Mathematics of Data. INFO-4604, Applied Machine Learning University of Colorado Boulder. September 5, 2017 Prof. Michael Paul

Similar documents
Linear Classification and Perceptron

CS 343: Artificial Intelligence

Kernels and Clustering

Distribution-free Predictive Approaches

K-Nearest Neighbour (Continued) Dr. Xiaowei Huang

Supervised Learning: Nearest Neighbors

Intro to Artificial Intelligence

k-nearest Neighbors + Model Selection

Nearest neighbors classifiers

Data Mining. Lecture 03: Nearest Neighbor Learning

CISC 4631 Data Mining

7. Nearest neighbors. Learning objectives. Foundations of Machine Learning École Centrale Paris Fall 2015

7. Nearest neighbors. Learning objectives. Centre for Computational Biology, Mines ParisTech

Linear Regression and K-Nearest Neighbors 3/28/18

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

Lecture 3. Oct

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

CS178: Machine Learning and Data Mining. Complexity & Nearest Neighbor Methods

K Nearest Neighbor Wrap Up K- Means Clustering. Slides adapted from Prof. Carpuat

CS5670: Computer Vision

Introduction to Machine Learning. Xiaojin Zhu

Nearest Neighbor Classification. Machine Learning Fall 2017

Classification: Feature Vectors

CSE 573: Artificial Intelligence Autumn 2010

Machine Learning (CSE 446): Unsupervised Learning

CSCI567 Machine Learning (Fall 2014)

Instance-based Learning

K-Nearest Neighbors. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Machine Learning / Jan 27, 2010

Naïve Bayes for text classification

Nearest Neighbor Classification

CSC 411 Lecture 4: Ensembles I

Topics in Machine Learning

What to come. There will be a few more topics we will cover on supervised learning

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron

Data Mining Classification: Alternative Techniques. Lecture Notes for Chapter 4. Instance-Based Learning. Introduction to Data Mining, 2 nd Edition

Jarek Szlichta

Introduction to Supervised Learning

Feature Extractors. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. The Perceptron Update Rule.

Intro. To Graphing Linear Equations

Perceptron Introduction to Machine Learning. Matt Gormley Lecture 5 Jan. 31, 2018

ADVANCED CLASSIFICATION TECHNIQUES

Introduction to Artificial Intelligence

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

Clustering and Visualisation of Data

Logistic Regression and Gradient Ascent

Lecture 25: Review I

2.1 Transforming Linear Functions

Nearest Neighbor Predictors

MODULE 7 Nearest Neighbour Classifier and its variants LESSON 11. Nearest Neighbour Classifier. Keywords: K Neighbours, Weighted, Nearest Neighbour

Nonparametric Classification Methods

Supervised Learning Classification Algorithms Comparison

SYDE 372 Introduction to Pattern Recognition. Distance Measures for Pattern Classification: Part I

CSC 411: Lecture 05: Nearest Neighbors

Clustering Billions of Images with Large Scale Nearest Neighbor Search

KTH ROYAL INSTITUTE OF TECHNOLOGY. Lecture 14 Machine Learning. K-means, knn

Semi-supervised learning and active learning

CS570: Introduction to Data Mining

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

Supervised Learning: K-Nearest Neighbors and Decision Trees

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

9 Classification: KNN and SVM

Lecture 9: Support Vector Machines

Model Selection Introduction to Machine Learning. Matt Gormley Lecture 4 January 29, 2018

CS 340 Lec. 4: K-Nearest Neighbors

Lecture 3: Linear Classification

MLCC 2018 Local Methods and Bias Variance Trade-Off. Lorenzo Rosasco UNIGE-MIT-IIT

Network Traffic Measurements and Analysis

Introduction to Machine Learning Prof. Anirban Santara Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Topic 1 Classification Alternatives

CPSC 340: Machine Learning and Data Mining. Kernel Trick Fall 2017

Problem 1: Complexity of Update Rules for Logistic Regression

Instance-Based Learning.

Machine Learning (CSE 446): Perceptron

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Chapter 6: Cluster Analysis

Supervised vs unsupervised clustering

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

Nonparametric Methods Recap

Classification and Regression

UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 15: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

Data Preprocessing. Supervised Learning

Lecture 8: Grid Search and Model Validation Continued

Data Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners

CS 8520: Artificial Intelligence. Machine Learning 2. Paula Matuszek Fall, CSC 8520 Fall Paula Matuszek

Machine Learning. Nonparametric methods for Classification. Eric Xing , Fall Lecture 2, September 12, 2016

Machine Perception of Music & Audio. Topic 10: Classification

Sec 4.1 Coordinates and Scatter Plots. Coordinate Plane: Formed by two real number lines that intersect at a right angle.

Learning Packet THIS BOX FOR INSTRUCTOR GRADING USE ONLY. Mini-Lesson is complete and information presented is as found on media links (0 5 pts)

Gene Clustering & Classification

UVA CS 4501: Machine Learning. Lecture 10: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

Midterm Examination CS540-2: Introduction to Artificial Intelligence

CS7267 MACHINE LEARNING NEAREST NEIGHBOR ALGORITHM. Mingon Kang, PhD Computer Science, Kennesaw State University

CS 343H: Honors AI. Lecture 23: Kernels and clustering 4/15/2014. Kristen Grauman UT Austin

Data mining. Classification k-nn Classifier. Piotr Paszek. (Piotr Paszek) Data mining k-nn 1 / 20

Going nonparametric: Nearest neighbor methods for regression and classification

Figure (5) Kohonen Self-Organized Map

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Lecture 03 Linear classification methods II

Transcription:

Mathematics of Data INFO-4604, Applied Machine Learning University of Colorado Boulder September 5, 2017 Prof. Michael Paul

Goals In the intro lecture, every visualization was in 2D What happens when we have more dimensions? Vectors and data points What does a feature vector look like geometrically? How to calculate the distance between points? Definitions: vector products and linear functions

Two new algorithms today K-nearest neighbors classification Label an instance with the most common label among the most similar training instances K-means clustering Put instances into clusters to which they are closest (in a geometric space) Both require a way to measure the similarity/distance between instances

Linear Regression

Linear Functions General form of a line: f(x) = mx + b intercept y = ½x + 1 slope

Linear Functions General form of a line: f(x) = mx + b intercept y = ½x + 1 slope value of output when input (x) is 0

Linear Functions General form of a line: intercept y = ½x + 1 f(x) = mx + b rise slope = rise over run run

Linear Functions General form of a line: f(x) = mx + b slope intercept m and b are called parameters They are constant (once specified) Also called coefficients x is the argument of the function It is the input to the function

Linear Functions Machine learning involves learning the parameters of the predictor function In linear regression, the predictor function is a linear function But the parameters are unknown ahead of time Goal is to learn what the slope and intercept should be (How to do that is a question we ll answer next week)

Linear Functions Linear functions can have more than one argument f(x 1,x 2 ) = m 1 x 1 + m 2 x 2 + b y = 2x 1 + 2x 2 + 5 One variable: line Two variables: plane From: https://www.math.uri.edu/~bkaskosz/flashmo/graph3d/

Linear Regression Two input variables (want to predict third) Fit a plane to the points

Linear Functions General form of linear functions: f(x 1,,x k ) = m i x i + b k i=1 One variable: line Two variables: plane In general: hyperplane

Linear Regression How much will Mario Kart (Wii) sell for on ebay? (example from OpenIntro Stats, Ch 8)

Linear Regression How much will Mario Kart (Wii) sell for on ebay? (example from OpenIntro Stats, Ch 8) Four features:

Linear Regression f(x) = 5.13 cond_new + 1.08 stock_photo 0.03 duration + 7.29 wheels + 36.21 If you know the values of the four features, you can get a guess of the output (price) by plugging them into this function

Linear Functions f(x 1,,x k ) = m i x i + b k i=1 f(x) = 5.13 cond_new + 1.08 stock_photo 0.03 duration + 7.29 wheels + 36.21 Mapping this to the general form x 1 = cond_new m 1 = 5.13 k = 4 x 2 = stock_photo m 2 = 1.08 x 3 = duration m 3 = -0.03 x 4 = wheels m 4 = 7.29 b = 36.21

Vector Notation A list of values is called a vector We can use variables to denote entire vectors as shorthand m = <m 1, m 2, m 3, m 4 > x = <x 1, x 2, x 3, x 4 >

Vector Notation The dot product of two vectors is written as m T x or m x, which is defined as: m T x = k i=1 m i x i Example: m = <5.13, 1.08, -0.03, 7.29> x = <x 1, x 2, x 3, x 4 > m T x = 5.13x 1 + 1.08x 2 0.03x 3 + 7.29x 4

Vector Notation Equivalent notation for a linear function: f(x 1,,x k ) = m i x i + b or k i=1 f(x) = m T x + b

Vector Notation Terminology: A point is the same as a vector (at least as used in this class)

Vector Notation Remember: In machine learning, the number of dimensions in your points/vectors is the number of features

Pause

Distance How far apart are two points?

Distance Euclidean distance between two points in two dimensions: (x 2 x 1 ) 2 + (y 2 y 1 ) 2 In three dimensions (x,y,z): (x 2 x 1 ) 2 + (y 2 y 1 ) 2 + (z 2 z 1 ) 2

Distance General formulation of Euclidean distance between two points with k dimensions: d(p, q) = (p i q i ) 2 k i=1 where p and q are the two points (each represents a k-dimensional vector)

Distance Example: p = <1.3, 5.0, -0.5, -1.8> q = <1.8, 5.0, 0.1, -2.3> d(p, q) = sqrt( (1.3 1.8) 2 + (5.0 5.0) 2 + (-0.5 0.1) 2 + (-1.8-2.3) 2 ) = sqrt(.86) =.927

Distance A special case is the distance between a point and zero (the origin). k d(p, 0) = (p i ) 2 i=1 This is called the Euclidean norm of p A norm is a measure of a vector s length The Euclidean norm is also called the L2 norm We ll learn about other norms later

Distance-based Prediction Suppose you have these 20 instances, labeled with one of two classes (blue or green)

Distance-based Prediction? You have a new instance but don t know the class label.

Distance-based Prediction? You have a new instance but don t know the class label. One heuristic: Label it with the label of the nearest point.

Distance-based Prediction? Sometimes the nearest point doesn t provide a great estimate.

Distance-based Prediction? Sometimes the nearest point doesn t provide a great estimate.

Distance-based Prediction? Sometimes the nearest point doesn t provide a great estimate. Another heuristic: Compare it to the nearest five points.

Distance-based Prediction 4 votes for green 1 vote for blue? Sometimes the nearest point doesn t provide a great estimate. Another heuristic: Compare it to the nearest five points.

Distance-based Prediction The k-nearest neighbors (knn) algorithm classifies an instance as follows: 1. Find the k labeled instances that have the lowest distance to the unlabeled instance 2. Return the majority class (most common label) in the set of k nearest instances Can also be used for regression instead of classification (but less common) Replace majority class in step 2 above with average value

Distance-based Prediction When you run the knn algorithm, you have to decide what k should be. Mostly an empirical question; trial and error experimentally. If k is too small, prediction will sensitive to noise. If k is too large, algorithm loses the local context that makes it work.

Distance-based Prediction Common variant of knn: weigh the nearest neighbors by their distance (e.g., when calculating the majority class, give more votes to the instances that are closest)

k-means Clustering

k-means Clustering Suppose we want to cluster these 20 instances into 2 groups

k-means Clustering Suppose we want to cluster these 20 instances into 2 groups One way to start: Randomly assign two of the points to clusters

k-means Clustering Suppose we want to cluster these 20 instances into 2 groups Then assign every point to the cluster corresponding to whichever of the two points it is closer to

k-means Clustering Suppose we want to cluster these 20 instances into 2 groups Then assign every point to the cluster corresponding to whichever of the two points it is closer to

k-means Clustering Define the center of each cluster as the mean of all the points in the cluster.

k-means Clustering Define the center of each cluster as the mean of all the points in the cluster. Now assign every point to the cluster corresponding to whichever of the two centers it is closer to

k-means Clustering Define the center of each cluster as the mean of all the points in the cluster. Now assign every point to the cluster corresponding to whichever of the two centers it is closer to

k-means Clustering Repeat.

k-means Clustering Repeat. Recalculate the means.

k-means Clustering Repeat. Recalculate the means. Reassign the points.

k-means Clustering Repeat.

k-means Clustering Repeat. Recalculate the means.

k-means Clustering Repeat. Recalculate the means. Reassign the points.

k-means Clustering Stop once the cluster assignments don t change.

k-means Clustering 1. Initialize the cluster means 2. Repeat until assignments stop changing: a) Assign each instance to the cluster whose mean is nearest to the instance b) Update the cluster means based on the new cluster assignments: where S i is the set of instances in cluster i, and S i is the number of instances in the cluster.

k-means Clustering How to initialize? Two common approaches: Randomly assign each instance to a cluster and calculate the means. Pick k points at random and treat them as the cluster means. This is the approach used in the illustration in the previous slides. This approach generally works better than the previous approach (leads to initial cluster means that are more spread out) Note that both of these approaches involve randomness and will not always lead to the same solution each time!

k-means Clustering How to choose k? Similar challenge as in k-nn. Usually trial-and-error + some intuition about what the dataset looks like. Some clustering algorithms can automatically figure out the number of clusters. Also based on distance. General idea: if points within a cluster are still far apart, the cluster should probably be split into more clusters.

Recap Both k-nn and k-means require some definition of distance between points. Euclidean distance most common. There are lots of others (many implemented in sklearn) While the illustrations had only two dimensions, the algorithms apply to any number of dimensions, using the definition of Euclidean distance that we learned today.