Data Exploration with PCA and Unsupervised Learning with Clustering Paul Rodriguez, PhD PACE SDSC
|
|
- Marlene Thomas
- 5 years ago
- Views:
Transcription
1 Data Exploration with PCA and Unsupervised Learning with Clustering Paul Rodriguez, PhD PACE SDSC
2 Clustering Idea Given a set of data can we find a natural grouping? Essential R commands: D =rnorm(12,0,1) #generate 12 #random normal X1 =matrix(d,6,2) #put into 6x2 matrix X1[,1]=X1[,1]+4; #shift center X1[,2]=X1[,2]+2; X 2 #repeat for another set of points #bind data points and plot plot(rbind(x1,x2), xlim=c(-10,10),ylim=c(-10,10)); X 1
3 Why Clustering A good grouping implies some structure In other words, given a good grouping, we can then: Interpret and label clusters Identify important features Characterize new points by the closest cluster (or nearest neighbors) Use the cluster assignments as a compression or summary of the data
4 Clustering Objective Objective: find subsets that are similar within cluster and dissimilar between clusters Similarity defined by distance measures Euclidean distance Manhattan distance Mahalanobis (Euclidean w/dimensions rescaled by variance)
5 Kmeans Clustering A simple, effective, and standard method Start with K initial cluster centers Loop: Assign each data point to nearest cluster center Calculate mean of cluster for new center Stop when assignments don t change Issues: How to choose K? How to choose initial centers? Will it always stop?
6 Kmeans Example For K=1, using Euclidean distance, where will the cluster center be? X 2 X 1
7 Kmeans Example For K=1, the overall mean minimizes Sum Squared Error (SSE), aka Euclidean distance Essential R commands: Kresult = kmeans(x,1,10,1) #choose 1 data point as initial K centers #10 is max loop iterations #1 is number of initial sets to try #Kresult is an R object with subfields Kresult$cluster #cluster assignments Kresults$tot.withinss # tot within SSE
8 Kmeans Example K=1 K=2 Essential R commands: inds=which(kresult$cluster==k) plot(x[inds,],col2use= red ); K=3 K=4
9 Kmeans Example K=1 K=2 Essential R commands: inds=which(kresult$cluster==k) plot(x[inds,],col2use= red ); K=3 K=4 As K increases individual points get a cluster
10 Choosing K for Kmeans Total Within Cluster SSE Essential R commands: for (num_k in 1:10) { Kres=kmeans(X,num_k,10,1); Save and then plot Kres$tot.withinss K=1 to 10 - Not much improvement after K=2 ( elbow )
11 Kmeans Example more points How many clusters should there be?
12 Choosing K for Kmeans Total Within Cluster SSE K=1 to 10 - Smooth decrease at K 2, harder to choose - In general, smoother decrease => less structure
13 Kmeans Guidelines Choosing K: Elbow in total-within-cluster SSE as K=1 N Cross-validation: hold out points, compare fit as K=1 N Choosing initial starting points: take K random data points, do several Kmeans, take best fit Stopping: may converge to sub-optimal clusters may get stuck or have slow convergence (point assignments bounce around), 10 iterations is often good
14 Kmeans Example uniform K=1 K=2 K=3 K=4
15 Choosing K - uniform Total Within Cluster SSE K=1 to 10 - Smooth decrease across K => less structure
16 Kmeans Clustering Issues Scale: Dimensions with large numbers may dominate distance metrics Outliers: Outliers can pull cluster mean, K-mediods uses median instead of mean
17 Soft Clustering Methods Fuzzy Clustering Use weighted assignments to all clusters Weights depend on relative distance Find min weighted SSE Expectation-Maximization: Mixture of multivariate Gaussian distributions Mixture weights are missing Find most likely means & variances, for the expectations of the data given the weights
18 Kmeans unequal cluster variance Can you guess K?
19 Kmeans unequal cluster variance
20 Choosing K unequal distributions Total Within Cluster SSE K=1 to 10 - Smooth decrease across K => less structure
21 EM clustering Selects K=2 (either by Information Criterion= min of SSE+ K*logN, Or by cross-validation) Handles unequal variance R: library( mclust ) em_fit=mclust(x); plot(em_fit);
22 Kmeans computations Distance of each point to each cluster center For N points, D dimensions: each loop requires N*D*K operations Update Cluster centers only track points that change, get change in cluster center But for EM errors to each cluster center update a probability function
23 Kmeans vs EM performance 1 Gordon compute node, normal random matrices R: system.time(mclust()) Wall Time (secs) 15 min 10 min EM: 1000 pts Number of Points (i.e. rows) Kmeans:1000 pts 1K 4K 8K Number of Dimensions (i.e. columns in data matrix)
24 Other distance measures Cosine of the angle between points (equals Euc. Distance normalized by vector lengths) A D C C is closer to A because ABCº<ABDº B Jaccard (over sets A,B): 1- ( A B / AUB ) means number of elments
25 Other distance measures Hamming distance: count 1 if values different e.g. appropriate for binary strings Total difference is 2
26 Exercise: Clustering Athlete s Data in Weka Is there a relationship between athlete s physical attributes and their sport? Let s filter data to try a few, good, candidate sport categories.
27 Weka not nice for picking out instances with multi-criteria, so I used excel: (Excel, data -> advanced -> click on column and then drop down selection menu. Include only a few sports) Download AHW_1_filteredsports.csv from pace.sdsc.edu, open in weka
28 Notice the number of sports and their nominal values
29 Visualize correlation with sport as class any problems, any promising relationships?
30 Are there one or more variables we should ignore?
31 Select Ignore attributes -> ctrl-space to select Total and Sport. Should we ignore sex?
32 select cluster tab; choose -> simple Kmeans ; choose 5 classes, accept other defaults
33 Results end up in output panel. Take note of sum squared error. Any other message to note?
34 Rt click on model result, visualize cluster assignment Compare sport and cluster number how do they correspond? Are the clusters useful to distinguish (or predict) an athlete s sport?
35 Try rerunning with different random seed, you get different clustering, but same SSE
36 What is the relationship of sex variable to cluster and sport assignments?
37 Try EM algorithm, accept defaults and start
38 Rt click on model result, visualize cluster assignment, Compare sport and cluster number
39 Rerun with 5 clusters, compare sport and cluster number
40 Exercise: For simple Kmeans Get a full sweep of K ie change N to 1,3,4,5, what K would you choose?
41 Earth System Sciences Example An Investigation Using SiroSOM for the Analysis of QUEST Stream-Sediment and Lake-Sediment Geochemical Data (Fraser & Hodgkinson 2009) at (e.g. Quest_Out_Data.csv) Summary: identify patterns and relationships anomalous data promote resource assessment and exploration Data: 15,000 rows x 42 measures (and metadata)
42 Data location and spreadsheet
43 N=15,000 x P=42 correlation plot In R: corrplot(cor(d2use))
44 Standard Deviation of Columns Standard Deviation In R: corrplot(cor(d2use)) sdl=lapply(d2use,sd);
45 Kmeans SSE as K=1..30 D=read.table('QUEST_out_data.csv',header=T,sep=",") D2use=D[,11:dim(D)[2]-1] res_win_bet=matrix(0,30,2); for (num_k in 1:30){ kres=kmeans(d2use,num_k,iter.max=10,nstart=1); res_win_bet[num_k,1:2]=c(kres$tot.withinss,kres$betweenss);} plot(1:num_k,res_win_bet[,1],pch=8,main="sse") K=1 30; Authors used 20
46 K=20, heat map of centroids Minerals R: install.packages('fields') library('fields') dev.new() plot.new() image.plot(kres$clusters) dev.print(file= paste( centers.jpeg"), device=jpeg,width=600) K=1 20
47 Plotting in Physical Space Kmeans cluster assignments #get distinct color map values cols2use=colors()[seq(21,631,30)]; for (i in 1:num_k) { class_inds=which(kres$cluster==i); plot(d[class_inds,1:2],col=cols2use[i],xlim=xl, ylim=yl);}
48
49 Pause
50 Incremental & Hierarchical Clustering Start with 1 cluster (all instances) and do splits OR Start with N clusters (1 per instance) and do merges Can be greedy & expensive in its search some algorithms might merge & split algorithms need to store and recalculate distances Need distance between groups in constrast to K-means
51 Incremental & Hierarchical Clustering Result is a hierarchy of clusters displayed as a dendrogram tree Useful for tree-like interpretations syntax (e.g. word co-occurences) concepts (e.g. classification of animals) topics (e.g. sorting Enron s) spatial data (e.g. city distances) genetic expression (e.g. possible biological networks) exploratory analysis
52 Incremental & Hierarchical Clustering Clusters are merged/split according to distance or utility measure Euclidean distance (squared differences) conditional probabilities (for nominal features) Options to choose which clusters to Link single linkage, mean, average (w.r.t. points in clusters) (may lead to different trees, depending on spreads) Ward method (smallest increase within cluster variance) change in probability of features for given clusters
53 Linkage options e.g. single linkage (closest to any cluster instance) Cluster1 Cluster2 e.g. mean (closest to mean of all cluster instances) Cluster1 Cluster2
54 Linkage options (cont ) e.g. average (mean of pairwise distances) Cluster1 Cluster2 e.g. Ward s method (find new cluster with min. variance) Cluster1 Cluster2
55 Hierarchical Clustering Demo 3888 Interactions among 685 proteins From Hu et.al. TAP dataset b0009 b b0009 b b0014 b b0014 b b0014 b b0014 b b0014 b b0015 b R: d=read.table("core_interactions.txt"); #Create adjacency matrix, for each interaction create an edge C =matrix(0,p,p); #Connection matrix (aka Adjacency matrix) IJ =cbind(d[,1],d[,2]) #factor level not text is saved as Nx2 for (i in 1:N) { C[IJ[i,1],IJ[i,2]]=1; }
56 Interactions as connections structure is hard to see install.packages('igraph') library('igraph') gc=graph.adjacency(c,mode="directed") plot(gc,vertex.size=3,edge.arrow.size=0,vertex.label=na)
57 Hierarchical Clustering Demo hclust with single distance: chaining d2use=dist(c,method="binary") fit <- hclust(d2use, method= single") plot(fit) the cluster distance when 2 are combined Items that cluster first
58 Hierarchical Clustering Demo hclust with Ward distance: spherical clusters the cluster distance when 2 are combined
59 Hierarchical Clustering Demo Where height change looks big, cut off tree groups <- cutree(fit, k=7) rect.hclust(fit, k=7, border="red")
60 Hierarchical Clustering Demo Kmeans vs Hierarchical: Total Within Cluster Sum Sqrd Error Number of K R: emfit=mclust(c_mncntr); # 9 clusters were selected
61 Hierarchical Clustering Demo Kmeans (k=7) vs Hierarchical: Lots of overlap despite that Kmeans not have binary distance option Kmeans cluster assigment Hierarchical Group Assigment R: groups <- cutree(fit, k=7) ; Kresult=kmeans(C,7,10,1); table(kresult$cluster,groups)
62 Principle Components vs Clustering PCA reduces dimensions, Clustering reduces to categorical groups In some cases, k PCs k clusters
63 Principle Components vs Kmeans & Hierarchical Total Variance of Data On Principle Components Number of Principle Components R: C_mncntr=scale(C,center=TRUE,scale=F) P=princomp(C_mncntr); #columns of P$loadings are eigenvectors plot(p$sdev[1:50])
64 pause
65 Microbiome exploration metagenomic data on microbes in gut for diseased and normals many small values so use cut off ( std>.004)
66 Biome subjects correlation Crohn s Ulcer Colitis normals
67 Biome Subject Hierarchical Clustering
68 Clustering Test EM clustering selected 4 classes Kmeans 2-4 classes was reasonble Actual vs EM class assignments (EM class 2,3,4 isolate some subjects Actual\EM emres=mclust(xbt2) table(t,emres$classification)
69 Actual Classes on first 2 PCs
70 EM classes on first 2 PCs
71 Summary Having no label doesn t stop you from finding structure in data Unsupervised methods are somewhat related
72 End
Hierarchical Clustering 4/5/17
Hierarchical Clustering 4/5/17 Hypothesis Space Continuous inputs Output is a binary tree with data points as leaves. Useful for explaining the training data. Not useful for making new predictions. Direction
More informationUnsupervised Learning
Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised
More informationCHAPTER 4: CLUSTER ANALYSIS
CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis
More informationClustering CS 550: Machine Learning
Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf
More informationCSE 5243 INTRO. TO DATA MINING
CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10. Cluster
More informationCSE 5243 INTRO. TO DATA MINING
CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.
More informationCS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample
CS 1675 Introduction to Machine Learning Lecture 18 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem:
More informationHierarchical Clustering
What is clustering Partitioning of a data set into subsets. A cluster is a group of relatively homogeneous cases or observations Hierarchical Clustering Mikhail Dozmorov Fall 2016 2/61 What is clustering
More informationUniversity of Florida CISE department Gator Engineering. Clustering Part 2
Clustering Part 2 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville Partitional Clustering Original Points A Partitional Clustering Hierarchical
More informationClustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani
Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation
More informationNetwork Traffic Measurements and Analysis
DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,
More informationChapter 6: Cluster Analysis
Chapter 6: Cluster Analysis The major goal of cluster analysis is to separate individual observations, or items, into groups, or clusters, on the basis of the values for the q variables measured on each
More informationMachine Learning and Data Mining. Clustering (1): Basics. Kalev Kask
Machine Learning and Data Mining Clustering (1): Basics Kalev Kask Unsupervised learning Supervised learning Predict target value ( y ) given features ( x ) Unsupervised learning Understand patterns of
More informationBBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler
BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for
More informationData Informatics. Seon Ho Kim, Ph.D.
Data Informatics Seon Ho Kim, Ph.D. seonkim@usc.edu Clustering Overview Supervised vs. Unsupervised Learning Supervised learning (classification) Supervision: The training data (observations, measurements,
More informationData Science and Statistics in Research: unlocking the power of your data Session 3.4: Clustering
Data Science and Statistics in Research: unlocking the power of your data Session 3.4: Clustering 1/ 1 OUTLINE 2/ 1 Overview 3/ 1 CLUSTERING Clustering is a statistical technique which creates groupings
More informationCluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1
Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods
More informationClustering. CS294 Practical Machine Learning Junming Yin 10/09/06
Clustering CS294 Practical Machine Learning Junming Yin 10/09/06 Outline Introduction Unsupervised learning What is clustering? Application Dissimilarity (similarity) of objects Clustering algorithm K-means,
More informationMultivariate Analysis
Multivariate Analysis Cluster Analysis Prof. Dr. Anselmo E de Oliveira anselmo.quimica.ufg.br anselmo.disciplinas@gmail.com Unsupervised Learning Cluster Analysis Natural grouping Patterns in the data
More informationHard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering
An unsupervised machine learning problem Grouping a set of objects in such a way that objects in the same group (a cluster) are more similar (in some sense or another) to each other than to those in other
More informationUnsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi
Unsupervised Learning Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Content Motivation Introduction Applications Types of clustering Clustering criterion functions Distance functions Normalization Which
More informationMSA220 - Statistical Learning for Big Data
MSA220 - Statistical Learning for Big Data Lecture 13 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Clustering Explorative analysis - finding groups
More informationCS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample
Lecture 9 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem: distribute data into k different groups
More informationWhat to come. There will be a few more topics we will cover on supervised learning
Summary so far Supervised learning learn to predict Continuous target regression; Categorical target classification Linear Regression Classification Discriminative models Perceptron (linear) Logistic regression
More informationLecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic
SEMANTIC COMPUTING Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic TU Dresden, 23 November 2018 Overview Unsupervised Machine Learning overview Association
More informationSTATS306B STATS306B. Clustering. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010
STATS306B Jonathan Taylor Department of Statistics Stanford University June 3, 2010 Spring 2010 Outline K-means, K-medoids, EM algorithm choosing number of clusters: Gap test hierarchical clustering spectral
More informationDATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm
DATA MINING LECTURE 7 Hierarchical Clustering, DBSCAN The EM Algorithm CLUSTERING What is a Clustering? In general a grouping of objects such that the objects in a group (cluster) are similar (or related)
More informationStatistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1
Week 8 Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Part I Clustering 2 / 1 Clustering Clustering Goal: Finding groups of objects such that the objects in a group
More informationBig Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)
Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 9: Data Mining (4/4) March 9, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides
More informationCluster analysis. Agnieszka Nowak - Brzezinska
Cluster analysis Agnieszka Nowak - Brzezinska Outline of lecture What is cluster analysis? Clustering algorithms Measures of Cluster Validity What is Cluster Analysis? Finding groups of objects such that
More information5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction
Computational Methods for Data Analysis Massimo Poesio UNSUPERVISED LEARNING Clustering Unsupervised learning introduction 1 Supervised learning Training set: Unsupervised learning Training set: 2 Clustering
More informationUnderstanding Clustering Supervising the unsupervised
Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data
More informationBased on Raymond J. Mooney s slides
Instance Based Learning Based on Raymond J. Mooney s slides University of Texas at Austin 1 Example 2 Instance-Based Learning Unlike other learning algorithms, does not involve construction of an explicit
More informationChapter 6 Continued: Partitioning Methods
Chapter 6 Continued: Partitioning Methods Partitioning methods fix the number of clusters k and seek the best possible partition for that k. The goal is to choose the partition which gives the optimal
More informationCluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010
Cluster Analysis Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX 7575 April 008 April 010 Cluster Analysis, sometimes called data segmentation or customer segmentation,
More informationClustering. Chapter 10 in Introduction to statistical learning
Clustering Chapter 10 in Introduction to statistical learning 16 14 12 10 8 6 4 2 0 2 4 6 8 10 12 14 1 Clustering ² Clustering is the art of finding groups in data (Kaufman and Rousseeuw, 1990). ² What
More informationCluster Analysis. Angela Montanari and Laura Anderlucci
Cluster Analysis Angela Montanari and Laura Anderlucci 1 Introduction Clustering a set of n objects into k groups is usually moved by the aim of identifying internally homogenous groups according to a
More informationUnsupervised Learning : Clustering
Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex
More information10701 Machine Learning. Clustering
171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among
More informationMidterm Examination CS540-2: Introduction to Artificial Intelligence
Midterm Examination CS540-2: Introduction to Artificial Intelligence March 15, 2018 LAST NAME: FIRST NAME: Problem Score Max Score 1 12 2 13 3 9 4 11 5 8 6 13 7 9 8 16 9 9 Total 100 Question 1. [12] Search
More informationClustering. Partition unlabeled examples into disjoint subsets of clusters, such that:
Text Clustering 1 Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: Examples within a cluster are very similar Examples in different clusters are very different Discover
More informationOlmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.
Center of Atmospheric Sciences, UNAM November 16, 2016 Cluster Analisis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster)
More informationMachine Learning (BSMC-GA 4439) Wenke Liu
Machine Learning (BSMC-GA 4439) Wenke Liu 01-25-2018 Outline Background Defining proximity Clustering methods Determining number of clusters Other approaches Cluster analysis as unsupervised Learning Unsupervised
More informationAdministrative. Machine learning code. Supervised learning (e.g. classification) Machine learning: Unsupervised learning" BANANAS APPLES
Administrative Machine learning: Unsupervised learning" Assignment 5 out soon David Kauchak cs311 Spring 2013 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture17-clustering.ppt Machine
More informationLecture-17: Clustering with K-Means (Contd: DT + Random Forest)
Lecture-17: Clustering with K-Means (Contd: DT + Random Forest) Medha Vidyotma April 24, 2018 1 Contd. Random Forest For Example, if there are 50 scholars who take the measurement of the length of the
More informationData Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 1
More informationECLT 5810 Clustering
ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping
More informationExploratory data analysis for microarrays
Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA
More informationUnsupervised Learning. Supervised learning vs. unsupervised learning. What is Cluster Analysis? Applications of Cluster Analysis
7 Supervised learning vs unsupervised learning Unsupervised Learning Supervised learning: discover patterns in the data that relate data attributes with a target (class) attribute These patterns are then
More informationCase-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.
CS 188: Artificial Intelligence Fall 2008 Lecture 25: Kernels and Clustering 12/2/2008 Dan Klein UC Berkeley Case-Based Reasoning Similarity for classification Case-based reasoning Predict an instance
More informationCS 188: Artificial Intelligence Fall 2008
CS 188: Artificial Intelligence Fall 2008 Lecture 25: Kernels and Clustering 12/2/2008 Dan Klein UC Berkeley 1 1 Case-Based Reasoning Similarity for classification Case-based reasoning Predict an instance
More informationInformation Retrieval and Web Search Engines
Information Retrieval and Web Search Engines Lecture 7: Document Clustering December 4th, 2014 Wolf-Tilo Balke and José Pinto Institut für Informationssysteme Technische Universität Braunschweig The Cluster
More informationWorking with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan
Working with Unlabeled Data Clustering Analysis Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan chanhl@mail.cgu.edu.tw Unsupervised learning Finding centers of similarity using
More informationMachine Learning. Unsupervised Learning. Manfred Huber
Machine Learning Unsupervised Learning Manfred Huber 2015 1 Unsupervised Learning In supervised learning the training data provides desired target output for learning In unsupervised learning the training
More informationECLT 5810 Clustering
ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping
More informationCluster Analysis. Ying Shen, SSE, Tongji University
Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group
More informationUnsupervised Learning and Data Mining
Unsupervised Learning and Data Mining Unsupervised Learning and Data Mining Clustering Supervised Learning ó Decision trees ó Artificial neural nets ó K-nearest neighbor ó Support vectors ó Linear regression
More informationUnsupervised learning, Clustering CS434
Unsupervised learning, Clustering CS434 Unsupervised learning and pattern discovery So far, our data has been in this form: We will be looking at unlabeled data: x 11,x 21, x 31,, x 1 m x 12,x 22, x 32,,
More informationClustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York
Clustering Robert M. Haralick Computer Science, Graduate Center City University of New York Outline K-means 1 K-means 2 3 4 5 Clustering K-means The purpose of clustering is to determine the similarity
More informationClustering and Dissimilarity Measures. Clustering. Dissimilarity Measures. Cluster Analysis. Perceptually-Inspired Measures
Clustering and Dissimilarity Measures Clustering APR Course, Delft, The Netherlands Marco Loog May 19, 2008 1 What salient structures exist in the data? How many clusters? May 19, 2008 2 Cluster Analysis
More information2. Find the smallest element of the dissimilarity matrix. If this is D lm then fuse groups l and m.
Cluster Analysis The main aim of cluster analysis is to find a group structure for all the cases in a sample of data such that all those which are in a particular group (cluster) are relatively similar
More informationINF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22
INF4820 Clustering Erik Velldal University of Oslo Nov. 17, 2009 Erik Velldal INF4820 1 / 22 Topics for Today More on unsupervised machine learning for data-driven categorization: clustering. The task
More informationClustering and Visualisation of Data
Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some
More informationFinding Clusters 1 / 60
Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering Clustering by Partitioning, e.g. k-means Density Based Clustering, e.g. DBScan Grid Based Clustering 1 / 60
More informationUnsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing
Unsupervised Data Mining: Clustering Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 1. Supervised Data Mining Classification Regression Outlier detection
More informationMultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A
MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A. 205-206 Pietro Guccione, PhD DEI - DIPARTIMENTO DI INGEGNERIA ELETTRICA E DELL INFORMAZIONE POLITECNICO DI BARI
More informationCSE 40171: Artificial Intelligence. Learning from Data: Unsupervised Learning
CSE 40171: Artificial Intelligence Learning from Data: Unsupervised Learning 32 Homework #6 has been released. It is due at 11:59PM on 11/7. 33 CSE Seminar: 11/1 Amy Reibman Purdue University 3:30pm DBART
More informationECS 234: Data Analysis: Clustering ECS 234
: Data Analysis: Clustering What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed
More informationCluster Analysis. Summer School on Geocomputation. 27 June July 2011 Vysoké Pole
Cluster Analysis Summer School on Geocomputation 27 June 2011 2 July 2011 Vysoké Pole Lecture delivered by: doc. Mgr. Radoslav Harman, PhD. Faculty of Mathematics, Physics and Informatics Comenius University,
More informationGene Clustering & Classification
BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering
More informationHierarchical Clustering Lecture 9
Hierarchical Clustering Lecture 9 Marina Santini Acknowledgements Slides borrowed and adapted from: Data Mining by I. H. Witten, E. Frank and M. A. Hall 1 Lecture 9: Required Reading Witten et al. (2011:
More informationMachine Learning (BSMC-GA 4439) Wenke Liu
Machine Learning (BSMC-GA 4439) Wenke Liu 01-31-017 Outline Background Defining proximity Clustering methods Determining number of clusters Comparing two solutions Cluster analysis as unsupervised Learning
More informationPart I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a
Week 9 Based in part on slides from textbook, slides of Susan Holmes Part I December 2, 2012 Hierarchical Clustering 1 / 1 Produces a set of nested clusters organized as a Hierarchical hierarchical clustering
More informationCS7267 MACHINE LEARNING
S7267 MAHINE LEARNING HIERARHIAL LUSTERING Ref: hengkai Li, Department of omputer Science and Engineering, University of Texas at Arlington (Slides courtesy of Vipin Kumar) Mingon Kang, Ph.D. omputer Science,
More informationClustering algorithms
Clustering algorithms Machine Learning Hamid Beigy Sharif University of Technology Fall 1393 Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1393 1 / 22 Table of contents 1 Supervised
More informationInformation Retrieval and Web Search Engines
Information Retrieval and Web Search Engines Lecture 7: Document Clustering May 25, 2011 Wolf-Tilo Balke and Joachim Selke Institut für Informationssysteme Technische Universität Braunschweig Homework
More informationLecture on Modeling Tools for Clustering & Regression
Lecture on Modeling Tools for Clustering & Regression CS 590.21 Analysis and Modeling of Brain Networks Department of Computer Science University of Crete Data Clustering Overview Organizing data into
More informationAssociation Rule Mining and Clustering
Association Rule Mining and Clustering Lecture Outline: Classification vs. Association Rule Mining vs. Clustering Association Rule Mining Clustering Types of Clusters Clustering Algorithms Hierarchical:
More informationMultivariate analyses in ecology. Cluster (part 2) Ordination (part 1 & 2)
Multivariate analyses in ecology Cluster (part 2) Ordination (part 1 & 2) 1 Exercise 9B - solut 2 Exercise 9B - solut 3 Exercise 9B - solut 4 Exercise 9B - solut 5 Multivariate analyses in ecology Cluster
More informationUnsupervised: no target value to predict
Clustering Unsupervised: no target value to predict Differences between models/algorithms: Exclusive vs. overlapping Deterministic vs. probabilistic Hierarchical vs. flat Incremental vs. batch learning
More informationLesson 3. Prof. Enza Messina
Lesson 3 Prof. Enza Messina Clustering techniques are generally classified into these classes: PARTITIONING ALGORITHMS Directly divides data points into some prespecified number of clusters without a hierarchical
More informationComputational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions
Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions Thomas Giraud Simon Chabot October 12, 2013 Contents 1 Discriminant analysis 3 1.1 Main idea................................
More informationCSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo
CSE 6242 A / CS 4803 DVA Feb 12, 2013 Dimension Reduction Guest Lecturer: Jaegul Choo CSE 6242 A / CS 4803 DVA Feb 12, 2013 Dimension Reduction Guest Lecturer: Jaegul Choo Data is Too Big To Do Something..
More informationAn Unsupervised Approach for Combining Scores of Outlier Detection Techniques, Based on Similarity Measures
An Unsupervised Approach for Combining Scores of Outlier Detection Techniques, Based on Similarity Measures José Ramón Pasillas-Díaz, Sylvie Ratté Presenter: Christoforos Leventis 1 Basic concepts Outlier
More informationStatistics 202: Statistical Aspects of Data Mining
Statistics 202: Statistical Aspects of Data Mining Professor Rajan Patel Lecture 11 = Chapter 8 Agenda: 1)Reminder about final exam 2)Finish Chapter 5 3)Chapter 8 1 Class Project The class project is due
More informationUniversity of Florida CISE department Gator Engineering. Clustering Part 5
Clustering Part 5 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville SNN Approach to Clustering Ordinary distance measures have problems Euclidean
More informationDS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li
Welcome to DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li Time: 6:00pm 8:50pm Thu Location: AK 232 Fall 2016 High Dimensional Data v Given a cloud of data points we want to understand
More informationCOMP 551 Applied Machine Learning Lecture 13: Unsupervised learning
COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning Associate Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: (jpineau@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp551
More informationAn Introduction to Cluster Analysis. Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs
An Introduction to Cluster Analysis Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs zhaoxia@ics.uci.edu 1 What can you say about the figure? signal C 0.0 0.5 1.0 1500 subjects Two
More informationUnsupervised Learning and Clustering
Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)
More informationChapter DM:II. II. Cluster Analysis
Chapter DM:II II. Cluster Analysis Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained Cluster Analysis DM:II-1
More informationClustering. Stat 430 Fall 2011
Clustering Stat 430 Fall 2011 Outline Distance Measures Linkage Hierachical Clustering KMeans Data set: Letters from the UCI repository: Letters Data 20,000 instances of letters Variables: 1. lettr capital
More informationMachine learning - HT Clustering
Machine learning - HT 2016 10. Clustering Varun Kanade University of Oxford March 4, 2016 Announcements Practical Next Week - No submission Final Exam: Pick up on Monday Material covered next week is not
More informationUnsupervised Learning: Clustering
Unsupervised Learning: Clustering Vibhav Gogate The University of Texas at Dallas Slides adapted from Carlos Guestrin, Dan Klein & Luke Zettlemoyer Machine Learning Supervised Learning Unsupervised Learning
More informationLecture 25: Review I
Lecture 25: Review I Reading: Up to chapter 5 in ISLR. STATS 202: Data mining and analysis Jonathan Taylor 1 / 18 Unsupervised learning In unsupervised learning, all the variables are on equal standing,
More informationIncorporating Known Pathways into Gene Clustering Algorithms for Genetic Expression Data
Incorporating Known Pathways into Gene Clustering Algorithms for Genetic Expression Data Ryan Atallah, John Ryan, David Aeschlimann December 14, 2013 Abstract In this project, we study the problem of classifying
More informationStat 321: Transposable Data Clustering
Stat 321: Transposable Data Clustering Art B. Owen Stanford Statistics Art B. Owen (Stanford Statistics) Clustering 1 / 27 Clustering Given n objects with d attributes, place them (the objects) into groups.
More informationMATH5745 Multivariate Methods Lecture 13
MATH5745 Multivariate Methods Lecture 13 April 24, 2018 MATH5745 Multivariate Methods Lecture 13 April 24, 2018 1 / 33 Cluster analysis. Example: Fisher iris data Fisher (1936) 1 iris data consists of
More informationHierarchical Clustering
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram A tree like diagram that records the sequences of merges or splits 0 0 0 00
More informationSYDE Winter 2011 Introduction to Pattern Recognition. Clustering
SYDE 372 - Winter 2011 Introduction to Pattern Recognition Clustering Alexander Wong Department of Systems Design Engineering University of Waterloo Outline 1 2 3 4 5 All the approaches we have learned
More information