Unsupervised Learning

Size: px

Start display at page:

Download "Unsupervised Learning"

Esmond Miller
6 years ago
Views:

1 Unsupervised Learning Chapter 14: The Elements of Statistical Learning Presented for 540 by Len Tanaka

2 Objectives Introduction Techniques: Association Rules Cluster Analysis Self-Organizing Maps Projective Methods Multidimensional Scaling

3 New Setup Supervised: D = { (x(i), y (i) ) 1 i N, x R p, y R or D} Pr(X, Y) = Pr(Y X) Pr(X) Unsupervised: D = { (x (i) ) 1 i N, x R p } Y is from X

4 Methods Find simple descriptions Association rules Find distinct classes or types Cluster analysis Find associations among p variables Principal components, multidimensional scaling, self-organizing maps, principal curves

5 Association Rules Find joint values of X = {X1, X2,..., Xp} Example: Market basket analysis Xij {0, 1} if product i is purchased with j Rather than finding bumps...find regions

6 Association Rules Let Sj be set of all values for jth variable sj Sj Pr[ j=1...p(xj sj)] (14.2: conjunctive rule) K = j=1...p Sj (K dummy variables: Z1...Zk)

7 Associative Rules T(K) = T(K) is the prevalence of K in the data Set some bound t where {Kl T(K)>t}

8 Example X i Age Sex Employed 31 M yes K {<30, 30+} {M, F} {yes, no} Z <30 0 M 1 yes F 0 no 0

9 Apriori Algorithm Agrawal et al {Kl T(K)>t} is small Any item set of L subset of K, T(L) T(K) Calculate K = m, consider m-1 items Throw away sets < t Each high support analyzed

10 Apriori Algorithm A B Confidence: C(A B) = T(A B) / T(A) Lift: L(A B) = C(A B) / T(B)

11 Example: K = {peanut butter, jelly, bread} T(peanut butter, jelly bread) = 0.03 C(peanut butter, jelly bread) = T(pb, jelly, bread) / T(pb, jelly) = 0.82 L(pb, jelly bread) = 0.82 / T(bread) = 1.95

13 Problems As threshold t decreases, solution grows exponentially Restrictive form of data Rules with high confidence or lift but low support will be lost

14 Unsupervised as Supervised Find g(x) in terms of g0(x) Uniform density over x Gaussian with same mean and covariance Assign Y = 1 for training sample Randomly generate g0(x) assign Y = 0

15 Convert to Supervised

16 Figure 14.3 Training classified red Reference uniform green

17 Generalized Association Rules g(x) can be used to find data density regions Eliminate Apriori problem of locating low support but highly associated items

18 We have methods Convert unsupervised space to regions of high density CART Decision tree terminal nodes are regions PRIM Find the bump maximizing average value

19 Example Married, own home, not apartment = 24% <24yo, single, not homemaker or retired, rent or live with family = 24% Own home, not apartment married C = 95.9%, L = 2.61 Apriori can t do X value

20 Cluster Analysis Segment data Subsets are closely related Find natural hierarchy Form descriptive statistics

21 Measuring Similarity Proximity matrices N N matrix D where d ii' = proximity Diagonal is 0, values positive, usually symmetric Dissimilarities based on attributes j = 1...p

22 Measuring Dissimilarity Object dissimilarity Weights can be adjusted to highlight variables with greater dissimilarity w = 1/[2(var(Xj)]

23 Clustering Algorithms Combinatorial algorithms Mixture modeling Kernel density estimation, ex: section 6.8 Mode seekers PRIM

24 Combinatorial Algorithms T = W(C) + B(C) Minimize Maximize

25 Clustering Algorithms K-means Vector Quantization K-medoids Hierarchical Clustering Agglomerative Divisive

26 K-means Clustering

27 Vector Quantization

28 K-medoids Clustering

30 Self-Organizing Maps Fit K vertices of grid to data Grid: rectangular, hexagonal,... Constrained K-means versus principal curves Updated by minimizing m k Euclidean distance Parameters r and α: Decline from 1 to 0 over 1000 iterations

33 Projective Methods Principal Component Analysis Principal Curve/Surface Analysis Independent Component Analysis

34 Principal Components

35 Principal Components Singular value decomposition: X = U D V T U: left singular vectors, N p orthogonal V: right singular vectors, p p orthogonal D: singular values, p p diagonal

36 Principal Components

37 Principal Curve

38 Principal Curves

39 Versus SOM Principal curves and surfaces share similarities to self-organizing maps As SOM prototypes increase, closer match to principal curves Principal curves provide smooth parameterization versus discrete

40 Independent Components Goal is source separation Example in audio removing noise Find statistically independent signals where distribution not normal with constant variance

41 ICA

42 ICA Example

43 Multidimensional Scaling Given d as distance or dissimilarity measure Minimize stress function: Least squares: Sammon mapping: Classical scaling:

44 U.S. Cities Example Atl Chi Den Hou LA Mia NYC SF Sea WDC Atl Chi Den Hou LA Mia NYC SF Sea WDC

45 Least Squares MDS

46 Sammon MDS

47 Classic MDS

48 Conclusions Reframe our set of X Techniques: Association Rules Cluster Analysis Self-Organizing Maps Projective Methods Manifold Modeling

49 References Burges CJC. Geometric Methods for Feature Extraction and Dimensional Reduction: A Guided Tour. Data Mining and Knowledge Discovery Handbook: A Complete Guide for Practitioners and Researchers. Eds Rokach L, Maimon O. Kluwer Academic Publishers, Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York: Springer, 2001.

50 Thank you

Market basket analysis

Market basket analysis Find joint values of the variables X = (X 1,..., X p ) that appear most frequently in the data base. It is most often applied to binary-valued data X j. In this context the observations