Unsupervised Learning

Similar documents
Manufactured Home Production by Product Mix ( )

Statistical Methods for Data Mining

Reporting Child Abuse Numbers by State

Alaska ATU 1 $13.85 $4.27 $ $ Tandem Switching $ Termination

Alaska ATU 1 $13.85 $4.27 $ $ Tandem Switching $ Termination

MapMarker Standard 10.0 Release Notes

What's Next for Clean Water Act Jurisdiction

Bulk Resident Agent Change Filings. Question by: Stephanie Mickelsen. Jurisdiction. Date: 20 July Question(s)

Alaska no no all drivers primary. Arizona no no no not applicable. primary: texting by all drivers but younger than

MapMarker Plus 10.2 Release Notes

Arizona does not currently have this ability, nor is it part of the new system in development.

MapMarker Plus v Release Notes

AGILE BUSINESS MEDIA, LLC 500 E. Washington St. Established 2002 North Attleboro, MA Issues Per Year: 12 (412)

Chart 2: e-waste Processed by SRD Program in Unregulated States

MapMarker Plus 12.0 Release Notes

How Social is Your State Destination Marketing Organization (DMO)?

Statistical Methods for Data Mining

Is your standard BASED on the IACA standard, or is it a complete departure from the. If you did consider. using the IACA

Levels of Measurement. Data classing principles and methods. Nominal. Ordinal. Interval. Ratio. Nominal: Categorical measure [e.g.

Question by: Scott Primeau. Date: 20 December User Accounts 2010 Dec 20. Is an account unique to a business record or to a filer?

SECTION 2 NAVIGATION SYSTEM: DESTINATION SEARCH

CONSOLIDATED MEDIA REPORT B2B Media 6 months ended June 30, 2018

LAB #6: DATA HANDING AND MANIPULATION

Oklahoma Economic Outlook 2015

Oklahoma Economic Outlook 2016

Publisher's Sworn Statement

Managing Transportation Research with Databases and Spreadsheets: Survey of State Approaches and Capabilities

π H LBS. x.05 LB. PARCEL SCALE OVERVIEW OF CONTROLS uline.com CONTROL PANEL CONTROL FUNCTIONS lb kg 0

Ted C. Jones, PhD Chief Economist

MapMarker Plus v Release Notes

CONSOLIDATED MEDIA REPORT Business Publication 6 months ended December 31, 2017

Terry McAuliffe-VA. Scott Walker-WI

J.D. Power and Associates Reports: Overall Wireless Network Problem Rates Differ Considerably Based on Type of Usage Activity

US STATE CONNECTIVITY

User Experience Task Force

How Employers Use E-Response Date: April 26th, 2016 Version: 6.51

Local Telephone Competition: Status as of December 31, 2010

4/25/2013. Bevan Erickson VP, Marketing

Introduction to R for Epidemiologists

Crop Progress. Corn Emerged - Selected States [These 18 States planted 92% of the 2016 corn acreage]

The Promise of Brown v. Board Not Yet Realized The Economic Necessity to Deliver on the Promise

Wireless Network Data Speeds Improve but Not Incidence of Data Problems, J.D. Power Finds

57,611 59,603. Print Pass-Along Recipients Website

Advanced LabVIEW for FTC

Established Lafayette St., P.O. Box 998 Issues Per Year: 12 Yarmouth, ME 04096

Crop Progress. Corn Dough Selected States [These 18 States planted 92% of the 2017 corn acreage] Corn Dented Selected States ISSN:

C.A.S.E. Community Partner Application

For Every Action There is An Equal and Opposite Reaction Newton Was an Economist - The Outlook for Real Estate and the Economy

JIM TAYLOR PILOT CAR SVC J & J PILOT CAR SVC PILOTCAR.NET ROYAL ESCORT

Embedded Systems Conference Silicon Valley

DATES OF NEXT EVENT: Conference: June 4 8, 2007 Exhibits: June 4 7, 2007 San Diego Convention Center, San Diego, CA

2011 Aetna Producer Certification Help Guide. Updated July 28, 2011

Distracted Driving Accident Claims Involving Mobile Devices Special Considerations and New Frontiers in Legal Liability

Instructions for Enrollment

Loops. An R programmer can determine the order of processing of commands, via use of the control statements; repeat{}, while(), for(), break, and next

Real Estate Forecast 2017

Disaster Economic Impact

BRAND REPORT FOR THE 6 MONTH PERIOD ENDED JUNE 2014

Online Certification/Authentication of Documents re: Business Entities. Date: 05 April 2011

Guide to the Virginia Mericle Menu Collection

DATES OF EVENT: Conference: March 31 April 2, 2009 Exhibits: April 1 3, Sands Expo & Convention Center, Las Vegas, NV

WINDSTREAM CARRIER ETHERNET: E-NNI Guide & ICB Processes

45 th Design Automation Conference

DATES OF EVENT: Conference: March 23 March 25, 2010 Exhibits: March 24 March 26, Sands Expo & Convention Center, Las Vegas, NV

Qualified recipients are Chief Executive Officers, Partners, Chairmen, Presidents, Owners, VPs, and other real estate management personnel.

Summary of the State Elder Abuse. Questionnaire for Hawaii

Summary of the State Elder Abuse. Questionnaire for Alaska

Legal-Compliance Department March 22, 2019 Page 1 of 7

12 Interacting with Trellis Displays

Telephone Appends. White Paper. September Prepared by

76 Million Boomers. 83 Million Millennials 19 to Million Millennials 16 to 35

U.S. Residential High Speed Internet

Legal-Compliance Department October 11, 2017 Page 1 of 8

SAS Visual Analytics 8.1: Getting Started with Analytical Models

NEHA-NRPP APPLICATION FOR CERTIFICATION

EyeforTravel s Hotel Distribution Index. EyeforTravel s Hotel Distribution Index

BOUNDARY PVC EVERLASTING FENCE 100% VIRGIN VINYL THE NEW YORK STYLE FENCE STOCK COLORS WHITE BEIGE BROWN/CLAY GRAY. Copyright 2007

Summary of the State Elder Abuse. Questionnaire for Texas

ADJUSTER ONLINE UPDATING INSTRUCTIONS

24-Month Extension of Post-Completion Optional Practical Training (OPT)

5 August 22, USPS Network Optimization and First Class Mail Large Commercial Accounts Questionnaire Final August 22, 2011

2018 Payroll Tax Table Update Instructions (Effective January 2, 2018)

FDA's Collaborative Efforts to Promote ISO/IEC 17025:2005 Accreditation for the Nation's Food/Feed Testing Laboratories

BRAND REPORT FOR THE 6 MONTH PERIOD ENDED JUNE 2018

Options not included in this section of Schedule No. 12 have previously expired and the applicable pages may have been deleted/removed.

Summary of the State Elder Abuse. Questionnaire for Nebraska

STATE SEXUAL ASSAULT COALITIONS

Clustering and The Expectation-Maximization Algorithm

KEY BENEFITS STANDARD FEATURE(S)

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

US PS E d u cati o n K it

Ted C. Jones Chief Economist. Ted C. Jones, PhD Chief Economist

GURLEY PRECISION INSTRUMENTS Sales Representatives List: North America

No Place But Up Interest Rates Rents, Prices Real Estate and the Economy

IACMI - The Composites Institute

Summary of the State Elder Abuse. Questionnaire for New York

1. STATEMENT OF MARKET SERVED

RECERTIFICATION BOOKLET

OpenFox. Configurator

energy efficiency Building Energy Codes

Transcription:

Unsupervised Learning Fabio G. Cozman - fgcozman@usp.br November 16, 2018

What can we do? We just have a dataset with features (no labels, no response). We want to understand the data... no easy to define this.

What can we do? We just have a dataset with features (no labels, no response). We want to understand the data... no easy to define this. Possible ideas: Find association rules that indicate relationships amongst variables (market basket analysis). Find compact ways to summarize observations (representation learning). Cluster the data.

Representation learning Often it is possible to find compact representations that capture most of the content of observations. Dimension is reduced: easier to visualize, easier to understand. Easier to capture complex patterns (manifolds). In essense, the idea is to learn (the real) features.

A data manifold

Principal Component Analysis (PCA) We have features X 1,..., X n, suitably normalized (zero mean and unit variance).

Principal Component Analysis (PCA) We have features X 1,..., X n, suitably normalized (zero mean and unit variance). The first principal component Z 1 is a linear combination Z 1 = φ 11 X 1 + φ 21 X 2 + + φ n1 X n, where φ 2 i1 = 1, and such that Z 1 has the largest possible variance.

Principal Component Analysis (PCA) We have features X 1,..., X n, suitably normalized (zero mean and unit variance). The first principal component Z 1 is a linear combination Z 1 = φ 11 X 1 + φ 21 X 2 + + φ n1 X n, where φ 2 i1 = 1, and such that Z 1 has the largest possible variance. The next principal components similarly maximize variance, being uncorrelated with previous ones (orthogonal to previous ones...).

A two-dimensional example Ad Spending 0 5 10 15 20 25 30 35 10 20 30 40 50 60 70 Population

Loadings The coefficients φ ij are called loadings. As features are centered, the mean of Z i is zero as well. Hence sample variance of Z 1 is ( N n ) 2 φ n1 x ij. j=1 i=1

Building the representation To find the loadings for the first principal component, maximize sample variance: max φ 11,...,φ n1 ( N n ) 2 φ n1 x ij, j=1 subject to n i=1 φ2 i1 = 1. i=1 Quadratic problem (can be solved).

Building the representation, continued After finding Z 1, find Z 2 : same problem, but in orthogonal space. Again, quadratic problem (can be solved). Then repeat for other principal components.

Building the representation, continued After finding Z 1, find Z 2 : same problem, but in orthogonal space. Again, quadratic problem (can be solved). Then repeat for other principal components. In the end, an orthogonal basis that spans the feature space. Actually, there are fast algorithms that use matrices.

Example: the USArrests dataset 0.5 0.0 0.5 Second Principal Component 3 2 1 0 1 2 3 UrbanPop Hawaii Rhode Massachusetts Island Utah New Jersey California Connecticut Washington Colorado New York Ohio Illinois Arizona Nevada Wisconsin Minnesota Pennsylvania Rape Oregon Texas Kansas Oklahoma Delaware Nebraska Missouri Iowa Indiana New Hampshire Michigan Idaho Virginia New Mexico Maine Wyoming Maryland Florida rth Dakota Montana Assault South Dakota Kentucky Tennessee Louisiana Arkansas Alabama Alaska Georgia VermontWest Virginia Murder South Carolina North Carolina Mississippi 0.5 0.0 0.5 3 2 1 0 1 2 3 First Principal Component

Other manifold learning techniques From AstroML documentation, by Jake VanderPlas (http://www.astroml.org/book figures/chapter7/fig S manifold PCA.html).

Another interpretation for PCA 1 First loading vector yields the line that is closest to the N observations.

Another interpretation for PCA 1 First loading vector yields the line that is closest to the N observations. 2 First and second loading vectors yield the plance that is closest to the N observations.

Another interpretation for PCA 1 First loading vector yields the line that is closest to the N observations. 2 First and second loading vectors yield the plance that is closest to the N observations. 3 And so on.

Example in three dimensions First principal component Second principal component 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0

The important idea If we retain just a few principal components, we explain most of the variance in the data. Hopefully by keeping a few principal components we have better features.

Explained variance The total variance is n i=1 (1/N) N j=1 x ij 2. The variance explained by principal component Z k is N (1/N) zjk. 2 j=1

Explained variance The total variance is n i=1 (1/N) N j=1 x ij 2. The variance explained by principal component Z k is N (1/N) zjk. 2 j=1 The proportion of variance explained by Z k is (1/N) N j=1 z jk 2 n i=1 (1/N) N j=1 x. ij 2

Main practical question How many principal components to keep? No easy answer. Check proportion of explained variance; stop when it is small enough. Try a few possibilities.

PCA in supervised learning PCA is a tool to explore data. But it can also be used to produce a compact set of features. Any classifier can be preceded by some feature summarization through PCA.

Really doing PCA Principal components are produced by eigenvector computation.

Really doing PCA Principal components are produced by eigenvector computation. Take matrix A and suppose Av = λv. Then λ is an eigenvalue, and v is an eigenvector.

Suppose matrix A is symmetric Then eigenvalues are real; eigenvectors are orthogonal and real. Build matrix with eigenvectors U = [u 1 u 2... u n ] and diagonal matrix with eigenvalues λ 1 D = λ 2.... λ n Then: A = UDU T.

Covariance matrix Consider the matrix X = x 1 x 2... x N and compute a symmetric matrix 1 N 1 XT X. Each element of this matrix is a covariance 1 N 1 N x ij x kj. i=1

Decomposition Compute matrix U of eigenvectors of X T X, sorted from largest to smallest eigenvalue. Then principal components are Z 1. Z n = U T X 1. X n. The most numerically stable way to compute this transformation is through the Singular Value Decomposition (SVD) of X.

A summary: 1 Compute m = N j=1 x j and S = (X m) T (X m). 2 Compute the eigenvalues and eigenvectors of S. 3 Order the eigenvectors by the value of their eigenvalues (from largest to smallest). 4 Select the K eigenvectors with largest eigenvalues, and build the matrix 5 Then define U = [u 1... u K ]. Z = U T (X m).

Reconstruction Suppose we have m and U. Suppose we have a point x; then we compute z = U T (x m). We can now see the approximate point ˆx = m + Uz. (See Barber s book, Figures 15.1 and 15.3.)

Clustering The goal is to find subgroups of the dataset; each sugbroup is a cluster. Often clustering appears as synonym for unsupervised learning. Examples: Clustering species in biology experiment. Clustering clients of supermarket based on purchases.

A few points Clustering is not easy to evaluate. A bit subjective. It depends on measure of similarity. It must somehow maximize intra-cluster similarity, and minimize inter-cluster similarity. Two important methods: K-means. hierarchical clustering.

Example K=2 K=3 K=4 (Obtained with K-means, discussed later.)

K-means: Setup Features X 1,..., X n ; clusters C 1,..., C K to be found (all disjoint).

K-means: Setup Features X 1,..., X n ; clusters C 1,..., C K to be found (all disjoint). We must find clusters that max C 1,...,C k K s(c k ) k=1 for some measure of intra-cluster similarity s. One reasonable measure: s(c k ) = 1 C k x,x C k n i=1 (x i x i ) 2.

K-means: Setup Features X 1,..., X n ; clusters C 1,..., C K to be found (all disjoint). We must find clusters that max C 1,...,C k K s(c k ) k=1 for some measure of intra-cluster similarity s. One reasonable measure: s(c k ) = 1 C k x,x C k n i=1 (x i x i ) 2. K-means offers an approximate solution to the resulting maximization problem.

K-means Input: dataset; distance measure; number K. 1 Select initial assignment to K clusters: One possibility: assign each point randomly. Another one: randomly select K points as centroids and assign points to closest centroids. 2 Iterate until stopping criterion is met: 1 Compute centroid for each cluster. 2 Assign each point to closest centroid (use distance measure). Note: centroid of points is the average across coordinates.

Example Data Step 1 Iteration 1, Step 2a Iteration 1, Step 2b Iteration 2, Step 2a Final Results

Another example Red points: randomly selected centroids at initial step.

When to stop? When a maximum number of iterations is run. When centroids stop changing (or display little change). When the measure K k=1 s(c k) displays little change.

K-means: strengths Easy to understand, easy to implement. Cost is proportional to number of points, number of clusters, and number of iterations.

K-means: weaknesses Requires K. Results depend on initial guess. Sensitive to outliers. No real theoretical guarantees; stopping is somewhat ad hoc. Centroid computation may be meaningless for categorical data.

Sensitivity to initial guess Red points: randomly selected centroids at initial step.

Sensitivity to initial guess Red points: randomly selected centroids at initial step.

Sensitivity to outliers Red points: randomly selected centroids at initial step.

Sensitivity to outliers Red points: randomly selected centroids at initial step.

Some techniques Run K-means several times, then select best clustering according to K k=1 s(c k). Remove points that are too far from the cloud of points in dataset. Instead of centroid, use a more robust representation for clusters... select a few points randomly to compute centroid... use K-medoids (medoid is a point whose sum of distances to other points in cluster is minimal).

Example 320.9 235.8 235.8 235.8 235.8 310.9

Hierarchical clustering Main idea: build a representation for all possible clusters, instead of working with a fixed K. Usual representation is dedrogram. Dendrograms can be build bottom-up (agglomerative) or top-down (divisive). Most popular (by far): agglomerative clustering.

A dendrogram 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3 4 1 6 9 2 8 5 7 X2 1.5 1.0 0.5 0.0 0.5 3 4 6 1 9 2 8 5 7 1.5 1.0 0.5 0.0 0.5 1.0 X 1

Interpreting a dendrogram 0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10

Agglomerative clustering 1 Compute distance between all N points, using some given distance measure. 2 Repeat whenever possible: 1 Find the two closest clusters. 2 Fuse these clusters in the dendrogram (height is the distance between them).

What is the distance between clusters? Complete linkage: distance between two points (one in each cluster) that are most distant. Single linkage: distance between two points (one in each cluster) that are closest. Average linkage: average of the distance between t all pairs of points (one in each cluster). Centroid linkage: distance between centroids of both clusters. Centroid linkage may be a bit difficult to represent.

Example Average Linkage Complete Linkage Single Linkage Note: single linkage may lead to long clusters...

Distance measures Common: (weighted/squared) Euclidian distance. (x1 y 1 ) 2 + + (x n y n ) 2. (weighted) Manhattan distance: x 1 y 1 + + x n y n. (weighted) Chebyshev distance: max ( x 1 y 1,..., x n y n ). (Note: usually good idea to standardize.)

Distance measures: categorical data Observations of binary variables (0s and 1s): just the proportion of disagreements. Jaccard coefficient: useful when only of the labels is relevant (the other label is the normal one): number of disagreements number of disagreements + number of agreements of relevant class. For categorical variables: Proportion of disagreements. Use one-hot encoding and techniques for binary variables.

Latent models A different scheme is to assume a statistical model. For instance, suppose there is class variable with K labels, and P(X = x Y = y) for each label. Estimate the probablities, then assign labels. This is a mixture model. The class variable is a latent variable.

Latent models A different scheme is to assume a statistical model. For instance, suppose there is class variable with K labels, and P(X = x Y = y) for each label. Estimate the probablities, then assign labels. This is a mixture model. The class variable is a latent variable. There are very complex latent models around (often called topic models). We are not discussing latent models in this course...

Evaluating a clustering Not easy; quite subjective and application-dependent. Try a few things and compare the results! Visualize the clusters (projections). Check sum of inter/intra-cluster distances. Use a global metric (for instance, silhoutte). Many algorithms have been proposed; hard to compare them. K-means and dendrograms are still the champions.

Silhoutte For each point x: a(x) is the average distance between x and all other points in the same cluster. Find the average distance between x and all other points in each other cluster; denote by b(x) the smallest of these averages.

Silhoutte For each point x: a(x) is the average distance between x and all other points in the same cluster. Find the average distance between x and all other points in each other cluster; denote by b(x) the smallest of these averages. Define s(x) = 1 a(x)/b(x) if a(x) < b(x); 0 if a(x) = b(x); b(x)/a(x) 1 if a(x) > b(x). Note: s(x) [ 1, 1] (1 is good ; 1 is bad ).

Silhoutte Note: s(x) [ 1, 1] (1 is good ; 1 is bad ). Take the average of s(x) for all points as the measure (the silhoutte) of the clustering. A higher silhoutte is better than a lower one... For each point x: a(x) is the average distance between x and all other points in the same cluster. Find the average distance between x and all other points in each other cluster; denote by b(x) the smallest of these averages. Define s(x) = 1 a(x)/b(x) if a(x) < b(x); 0 if a(x) = b(x); b(x)/a(x) 1 if a(x) > b(x).

A final note Some of the figures in this presentation are taken from An Introduction to Statistical Learning, with applications in R (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani.