CBioVikings. Richard Röttger. Copenhagen February 2 nd, Clustering of Biomedical Data
|
|
- Corey Dickerson
- 5 years ago
- Views:
Transcription
1 CBioVikings Copenhagen February 2 nd, Richard Röttger 1
2 Who is talking? 2
3 Resources Go to You will find The dataset These slides An overview paper A small R script for a cluster Analysis R Tutorial R Röttger. Clustering of Biological Datasets in the Era of Big Data. Journal of Integrative Bioinformatics 13 (1), 300 3
4 Clustering in Life Sciences Long-standing problem in computer science grouping or segmenting a collection of objects into subsets or clusters such that those within each cluster are more closely related to one another than objects assigned to different clusters. Image taken from: 4
5 Clustering in Life Sciences Long-standing problem in computer science grouping or segmenting a collection of objects into subsets or clusters such that those within each cluster are more closely related to one another than objects assigned to different clusters. Applied in almost every scientific field, e.g.: Information retrieval Economics and marketing Astronomy Bioinformatics Image taken from: 5
6 Clustering in Life Sciences Long-standing problem in computer science grouping or segmenting a collection of objects into subsets or clusters such that those within each cluster are more closely related to one another than objects assigned to different clusters. Applied in almost every scientific field, e.g.: Information retrieval Economics and marketing Astronomy Bioinformatics In Bioinformatics Homology detection Gene expression study Protein complex prediction Image taken from: 6
7 Complexity of Clustering 7
8 Complexity of Clustering Most Pressing Issues: What tool to use? How to find a best clustering? How to tune the parameters of a tool? How measure to do this in a reliable and reproducible manner? 8
9 Graphical Analysis 9
10 First, let s have a look! Good way to gain an overview Histograms and Scatterplots Can be misleading Hard to automatize 10
11 Scatterplots How many clusters do you see? This is so-called overplotting. Only meaningful for bivariate data 11
12 Density Estimation We have seen that we have a couple of problems Overplotting Wrong bin size can easily hide interesting features Now, let s consider a different approach Assume that our dataset originates from some probability density function If we would know the type and specifics of this density function, we would have all the information we need for a clustering BUT: We do not have this information! And we do not want to make any assumption (i.e., that is the so-called nonparametric density estimation) 12
13 Histogram as a Density Estimate Divide the sample space into a number of bins Approximate the density at the center of each bin by counting 13
14 Drawbacks of a Histogram The density estimate depends on the starting position of the bins Discontinuities are not due to the underlying density; Curse of dimensionality: number of bins grows exponentially with the number of dimensions In high dimensions many examples are needed in order to have non-empty bins Therefore: Unsuitable for high dimensions More sophisticated density estimators required 14
15 Kernel Density Estimators; Parzen Windows We can estimate a density function by employing a kernel function K: Notice how the Parzen window estimate resembles the histogram, with the exception that the bin locations are determined by the data 15
16 Different Kernel Functions 16
17 Revisiting our Example 17
18 Pre-Processing 18
19 Preprocessing: Feature Extraction / Selection Observation Features might be correlated Features might useless for a clustering Features might even be blurring the cluster structure Feature Selection Utilizes only a subset of the available Features Most methods are coupled with a mining tool to determine optimality Feature Extraction Creates new features out of the existing features Seeks to create uncorrelated, better features Examples: PCA, PCoA 19
20 Preprocessing: Normalization Feature1: [0,1] Feature2: [1000,80000] Normalization: Bring both features to [0,1] 20
21 Preprocessing: Normalization Feature1: [0,1] Feature2: [1000,80000] Normalization: Bring both features to [0,1] => Bad with outliers! 21
22 Preprocessing: Normalization Feature1: [0,1] Feature2: [1000,80000] Normalization: Bring both features to [0,1] => Bad with outliers! Standardization: The values are scaled by the deviation from the mean: 22
23 Preprocessing: Normalization Feature1: [0,1] Feature2: [1000,80000] Normalization: Bring both features to [0,1] => Bad with outliers! Standardization: The values are scaled by the deviation from the mean: Generally: Loss of scale and location! 23
24 PCA PCA is a very complex and large topic which can basically fill entire lecture series Furthermore, there are many interpretations and different applications for a PCA 1 Here, we limit ourselfs to the usage of PCA in clustering: Project data to a lower dimensional space With the intention of simplify clustering Hopefully provides a better means for visual inspection see for example: 24
25 PCA PCA is a very complex and large topic which can basically fill entire lecture series Furthermore, there are many interpretations and different applications for a PCA 1 Here, we limit ourselfs to the usage of PCA in clustering: Project data to a lower dimensional space With the intention of simplify clustering Hopefully provides a better means for visual inspection The task of a PCA is to perform a dimensionality reduction in such a way that most of the variance in the original data is preserved see for example: 25
26 An Example 26
27 An Example 27
28 How does a PCA work? The PCA performs a basis transformation, in which the first basis vector is the vector accounting for most of the variance in the dataset, the second for the most of the remaining variance and so on... These basis vectors can be found by the eigenvalue decomposition of the covariance matrix Q or the sample correlation matrix R. The eigenvalues λ 1,, λ d indicate the variance of the eigenvectors y 1,, y d 28
29 The co-variance is defined as The Co-Variance Matrix This is the observed covariance for n observations x i, y i The co-variance matrix is then defined as The covariance matrix generalizes the notion of variance to multiple dimensions 29
30 Example: PCA 30
31 Example: PCA 31
32 Image taken from Ricardo Gutierrez-Osuna s class on Pattern Analysis 32
33 Proximity Calculation 33
34 Different Proximity Measures Similarity Numerical measure of how alike two data objects are Is higher when objects are more alike Often falls in the range [0, 1] Dissimilarity Numerical measure of how different two data objects are Lower when objects are more alike Minimum dissimilarity is often 0 Upper limit varies Often called Distance if it fulfills metric properties 34
35 One-mode / Two-mode One-mode In a one-mode dataset the data is given in a n n-matrix P = (p ij ) p ij relates each pair of objects x i and x j with each other Also often called a similarity/dissimilarity Matrix Normally, a one-mode matrix is symmetric Called one-mode as columns and rows describe the same thing Two-mode A two-mode dataset normally comes as a n d-matrix Each object is in a row, with each property being stored in a different column Sometimes, this mode is called the Raw-data A row is also sometimes called a feature vector 35
36 Proximity Calculation: Continues Data Euclidean Type of measures (Minkowski Distance) Image taken from wikipedia.com 36
37 Proximity Calculation: Continues Data Correlation Coefficient 37
38 Similarity Measures for Binary Variables Most of the measure define a similarity on the count of different mismatches of two objects in the d variables Generally saying, the counts a and d can be seen as matches, the counts b and c as mismatches While b and c can be seen as equivalent, this is certainly not true for the matching states a and d 38
39 Similarity Measures for Binary Variables Matching Coefficient Jaccard coefficient When the presence of a feature has the same explanatory power as the absence, the Matching Coefficient is applied, otherwise the Jaccard coefficient 39
40 Similarity Measures for Categorical Data A straightforward way would be treating each level of the categorical variable as own binary variable and apply the known measures Let s say the variable eye-color {blue, brown, green, gray} Can be converted into binary variables has blue eyes, has brown eyes,... Problem: By default, many negative matches Therefore: It is often counted how often two objects agree on the different variables 40
41 Proximity Calculation: Specialized Functions These Standard Methods are often not sufficient for biological data, as we neither have categorical data of an embedding in a n- dimensional space How to embed a sequence? A network? A Protein structure? Specialized Measures: BLAST Network Edit Distance Protein structure alignments 41
42 Clustering 42
43 From A Criteria to Algorithm Each clustering tool optimizes some inherent idea of a perfect clustering They are all only approximations! Possibilities to separate n objects into k clusters: N 2,5 = 15 N 10,3 = 9330 N 50,4 = N 100,5 = There are estimated ±1 atoms in the observable universe 43
44 From A Criteria to Algorithm Each clustering tool optimizes some inherent idea of a perfect clustering They are all only approximations! Possibilities to separate n objects into k clusters: N 2,5 = 15 N 10,3 = 9330 N 50,4 = It is important to know what exactly the clustering algorithm optimizes! N 100,5 = There are estimated ±1 atoms in the observable universe 44
45 Tool Selection: Overview k-means based Hierarchical Graph-based Density-based Others 45
46 Tool Selection: k-means based Most popular clustering tool Two-step iterative process: Assign objects to closest centers Updates these centers Good time complexity (almost linear) Minimizes the mean-squared-error of the objects to the cluster centers Works quite well in practice 46
47 Tool Selection: Problems with k-means Sensitive to initialization: how do we choose the initial partitions? 47
48 Tool Selection: Problems with k-means Sensitive to initialization: how do we choose the initial partitions? Run several iterations (Subset) Furthest-first initialization 48
49 Tool Selection: Problems with k-means k-means prefers hyperspherical clusters of approximately the same size Image taken from wikipedia.com 49
50 How to find the best k? No easy answer to that Employ domain knowledge Use internal cluster validity indices Use GAP statistic 50
51 Tool Selection: Hierarchical Creates a hierarchical embedding of the clustering Two main branches Agglomerative Divisive Image: Brazma, Alvis, and Jaak Vilo. "Gene expression data analysis." FEBS letters (2000):
52 Tool Selection: Single Linkage The distance between two clusters is represented by the distance of the closest pair of data objects belonging to different clusters. 52
53 Tool Selection: Complete Linkage The distance between two clusters is represented by the distance of the farthest pair of data objects belonging to different clusters 53
54 Tool Selection: Average Linkage The distance between two clusters is represented by the average distance of all pairs of data objects belonging to different clusters Compromise between Single and Complete Link Strengths Less susceptible to noise and outliers Limitations Biased towards spherical clusters 54
55 Tool Selection: Overview k-means based Hierarchical Graph-based Represent the data as a graph Identifying densely connected areas in the graph Examples: MCL, Transitivity Clustering, Affinity Propagation Used for: Network and Complex analysis Density-based Images: Vlasblom, James, and Shoshana J. Wodak. "Markov clustering versus affinity propagation for the partitioning of protein interaction graphs." BMC bioinformatics 10.1 (2009): 1. 55
56 Tool Selection: Overview k-means based Hierarchical Graph-based Density-based separating high-density areas from low-density areas Very Efficient Arbitrary cluster shape Require embedding of the objects 56
57 Cluster Evaluation 57
58 Evaluate a Clustering The validation of clustering structures is the most difficult and frustrating part of cluster analysis. Without a strong effort in this direction, cluster analysis will remain a black art accessible only to those true believers who have experience and great courage. Algorithms for Clustering Data, Jain and Dubes No puppets were harmed in the production of this lecture; generally, the usage of black magic is limited to a minimum at SDU. 58
59 Overview of Cluster Validation Two different kinds of measures can be distinguished External Measures Compare two clusterings Use a gold-standard to evaluate the quality of a clustering Internal Measures Only use the clustering as basis for evaluation Comparable to cluster criteria 59
60 External Measures We can look at each pair of points and define Or map each cluster c j to the gold standard cluster k i with the highest overlap TP if a k i a c j FP if a k i a c j FN if a k i a c j 60
61 Rand Index (pair-wise) Now we can define Measures Jaccard Index (pair-wise) F-measure (mapping) 61
62 Internal Measures Do not have additional information of the ground truth at disposal Similar to cluster criteria Normally based on: Compactness: this measures how closely related the objects in a cluster are Separation: this measures how distinct or well-separated a cluster is from other clusters 62
63 Dunn Index Internal Measures The Dunn Index assesses the clustering performance by relating the maximal cluster diameter to the minimal distance between clusters This measure is prone to outliers for it is based on minimal and maximal distances Davis Bouldin Index The Davies Bouldin Index DB is defined based on the average distances between objects and their cluster centroids 63
64 Silhouette Coefficient Based on: Cohesion a(x): average within cluster distance of x Separation b(x): average distance of x to the closest other cluster Takes values between -1 and 1 64
65 What to do? Such a method does not exists for all use cases How to proceed then? Is there a general rule we could follow? ClustEval: Fully automatizes the clustering We tested 13 clustering methods On 24 datasets (12 real-world, 12 artificial) 13 common validity measures 1000 parameter sets per tool per dataset 65
66 Results 66
67 Results of ClustEval There is no general best performer among the tools Quite often internal and external measures do not agree on the performance assessment When using only biomedical datasets, the Silhouette Value has the best agreement with external measures 67
68 Workshop Introduction 68
69 BreathOMICS data 69
70 Averaged Y What is it good for? -Graph :38 J.I. Baumbach - B&S Analytik, Dortmund, Germany 0.90 SHAM S Pentanone Monomer & Dimer 0.45 CLI S Zeitskala einzeln normiert / a.u. 70
71 Data Preprocessing RAW Smoothed De-noised 71
72 Peak Detection - Local maxima search (LMS) - Merged peak cluster localization (MPCL) Bader et al Wavelet-based multi-scale peak detection Bader et al Water shed transformation (WST) Bunkowski et al Peak model estimation (PME) Kopczynski et al
73 Patients Substances BreathOMICS 73
74 Peak Alignment 74
75 Resources Go to You will find The dataset These slides An overview paper A small R script for a cluster Analysis R Tutorial R Röttger. Clustering of Biological Datasets in the Era of Big Data. Journal of Integrative Bioinformatics 13 (1),
76 Thank you for your Attention Q & A Contact: roettger@imada.sdu.dk 76
Clustering CS 550: Machine Learning
Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf
More informationCluster analysis. Agnieszka Nowak - Brzezinska
Cluster analysis Agnieszka Nowak - Brzezinska Outline of lecture What is cluster analysis? Clustering algorithms Measures of Cluster Validity What is Cluster Analysis? Finding groups of objects such that
More informationGene Clustering & Classification
BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering
More informationUniversity of Florida CISE department Gator Engineering. Clustering Part 5
Clustering Part 5 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville SNN Approach to Clustering Ordinary distance measures have problems Euclidean
More informationCluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1
Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods
More informationBBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler
BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for
More informationUnsupervised Learning and Clustering
Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)
More informationSYDE Winter 2011 Introduction to Pattern Recognition. Clustering
SYDE 372 - Winter 2011 Introduction to Pattern Recognition Clustering Alexander Wong Department of Systems Design Engineering University of Waterloo Outline 1 2 3 4 5 All the approaches we have learned
More informationUnderstanding Clustering Supervising the unsupervised
Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data
More informationClassification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University
Classification Vladimir Curic Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Outline An overview on classification Basics of classification How to choose appropriate
More informationUnsupervised Learning and Clustering
Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)
More informationIntroduction to Computer Science
DM534 Introduction to Computer Science Clustering and Feature Spaces Richard Roettger: About Me Computer Science (Technical University of Munich and thesis at the ICSI at the University of California at
More informationCS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample
CS 1675 Introduction to Machine Learning Lecture 18 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem:
More informationMSA220 - Statistical Learning for Big Data
MSA220 - Statistical Learning for Big Data Lecture 13 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Clustering Explorative analysis - finding groups
More informationLecture on Modeling Tools for Clustering & Regression
Lecture on Modeling Tools for Clustering & Regression CS 590.21 Analysis and Modeling of Brain Networks Department of Computer Science University of Crete Data Clustering Overview Organizing data into
More informationCluster Analysis. Angela Montanari and Laura Anderlucci
Cluster Analysis Angela Montanari and Laura Anderlucci 1 Introduction Clustering a set of n objects into k groups is usually moved by the aim of identifying internally homogenous groups according to a
More informationCS7267 MACHINE LEARNING
S7267 MAHINE LEARNING HIERARHIAL LUSTERING Ref: hengkai Li, Department of omputer Science and Engineering, University of Texas at Arlington (Slides courtesy of Vipin Kumar) Mingon Kang, Ph.D. omputer Science,
More informationCS 521 Data Mining Techniques Instructor: Abdullah Mueen
CS 521 Data Mining Techniques Instructor: Abdullah Mueen LECTURE 2: DATA TRANSFORMATION AND DIMENSIONALITY REDUCTION Chapter 3: Data Preprocessing Data Preprocessing: An Overview Data Quality Major Tasks
More informationOverview of Clustering
based on Loïc Cerfs slides (UFMG) April 2017 UCBL LIRIS DM2L Example of applicative problem Student profiles Given the marks received by students for different courses, how to group the students so that
More informationCSE 5243 INTRO. TO DATA MINING
CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10. Cluster
More informationCSE 5243 INTRO. TO DATA MINING
CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.
More informationClustering Part 3. Hierarchical Clustering
Clustering Part Dr Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville Hierarchical Clustering Two main types: Agglomerative Start with the points
More informationUniversity of Florida CISE department Gator Engineering. Clustering Part 2
Clustering Part 2 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville Partitional Clustering Original Points A Partitional Clustering Hierarchical
More informationData Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analysis: Basic Concepts and Algorithms Slides From Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining
More informationEECS730: Introduction to Bioinformatics
EECS730: Introduction to Bioinformatics Lecture 15: Microarray clustering http://compbio.pbworks.com/f/wood2.gif Some slides were adapted from Dr. Shaojie Zhang (University of Central Florida) Microarray
More informationECLT 5810 Clustering
ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping
More informationHierarchical Clustering
Hierarchical Clustering Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram A tree-like diagram that records the sequences of merges
More informationCS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample
Lecture 9 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem: distribute data into k different groups
More informationLecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 7 Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar Hierarchical Clustering Produces a set
More informationClustering Part 4 DBSCAN
Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of
More informationMachine Learning (BSMC-GA 4439) Wenke Liu
Machine Learning (BSMC-GA 4439) Wenke Liu 01-31-017 Outline Background Defining proximity Clustering methods Determining number of clusters Comparing two solutions Cluster analysis as unsupervised Learning
More informationClustering Lecture 3: Hierarchical Methods
Clustering Lecture 3: Hierarchical Methods Jing Gao SUNY Buffalo 1 Outline Basics Motivation, definition, evaluation Methods Partitional Hierarchical Density-based Mixture model Spectral methods Advanced
More informationClustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search
Informal goal Clustering Given set of objects and measure of similarity between them, group similar objects together What mean by similar? What is good grouping? Computation time / quality tradeoff 1 2
More informationUnsupervised Learning
Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised
More informationWorkload Characterization Techniques
Workload Characterization Techniques Raj Jain Washington University in Saint Louis Saint Louis, MO 63130 Jain@cse.wustl.edu These slides are available on-line at: http://www.cse.wustl.edu/~jain/cse567-08/
More informationUnsupervised Learning : Clustering
Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex
More informationData Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 1
More informationCluster Analysis for Microarray Data
Cluster Analysis for Microarray Data Seventh International Long Oligonucleotide Microarray Workshop Tucson, Arizona January 7-12, 2007 Dan Nettleton IOWA STATE UNIVERSITY 1 Clustering Group objects that
More informationCSE 158. Web Mining and Recommender Systems. Midterm recap
CSE 158 Web Mining and Recommender Systems Midterm recap Midterm on Wednesday! 5:10 pm 6:10 pm Closed book but I ll provide a similar level of basic info as in the last page of previous midterms CSE 158
More informationClustering and Visualisation of Data
Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some
More informationECLT 5810 Clustering
ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping
More informationClustering and Dimensionality Reduction
Clustering and Dimensionality Reduction Some material on these is slides borrowed from Andrew Moore's excellent machine learning tutorials located at: Data Mining Automatically extracting meaning from
More informationClustering of Biological Datasets in the Era of Big Data
Clustering of Biological Datasets in the Era of Big Data Richard Röttger 1,* 1 Department of Mathematics and Computer Science, University of Southern Denmark, Campusvej 55, 5230 Odense, Denmark, http://imada.sdu.dk/
More informationMachine Learning (BSMC-GA 4439) Wenke Liu
Machine Learning (BSMC-GA 4439) Wenke Liu 01-25-2018 Outline Background Defining proximity Clustering methods Determining number of clusters Other approaches Cluster analysis as unsupervised Learning Unsupervised
More informationHierarchical Clustering
What is clustering Partitioning of a data set into subsets. A cluster is a group of relatively homogeneous cases or observations Hierarchical Clustering Mikhail Dozmorov Fall 2016 2/61 What is clustering
More informationSegmentation Computer Vision Spring 2018, Lecture 27
Segmentation http://www.cs.cmu.edu/~16385/ 16-385 Computer Vision Spring 218, Lecture 27 Course announcements Homework 7 is due on Sunday 6 th. - Any questions about homework 7? - How many of you have
More informationCOMP 465: Data Mining Still More on Clustering
3/4/015 Exercise COMP 465: Data Mining Still More on Clustering Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed. Describe each of the following
More informationUniversity of Florida CISE department Gator Engineering. Clustering Part 4
Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of
More informationClustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani
Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation
More information9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology
9/9/ I9 Introduction to Bioinformatics, Clustering algorithms Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Outline Data mining tasks Predictive tasks vs descriptive tasks Example
More informationDistances, Clustering! Rafael Irizarry!
Distances, Clustering! Rafael Irizarry! Heatmaps! Distance! Clustering organizes things that are close into groups! What does it mean for two genes to be close?! What does it mean for two samples to
More informationNotes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018)
1 Notes Reminder: HW2 Due Today by 11:59PM TA s note: Please provide a detailed ReadMe.txt file on how to run the program on the STDLINUX. If you installed/upgraded any package on STDLINUX, you should
More informationMultivariate Analysis
Multivariate Analysis Cluster Analysis Prof. Dr. Anselmo E de Oliveira anselmo.quimica.ufg.br anselmo.disciplinas@gmail.com Unsupervised Learning Cluster Analysis Natural grouping Patterns in the data
More informationExploratory data analysis for microarrays
Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA
More informationClustering (Basic concepts and Algorithms) Entscheidungsunterstützungssysteme
Clustering (Basic concepts and Algorithms) Entscheidungsunterstützungssysteme Why do we need to find similarity? Similarity underlies many data science methods and solutions to business problems. Some
More informationClustering. Chapter 10 in Introduction to statistical learning
Clustering Chapter 10 in Introduction to statistical learning 16 14 12 10 8 6 4 2 0 2 4 6 8 10 12 14 1 Clustering ² Clustering is the art of finding groups in data (Kaufman and Rousseeuw, 1990). ² What
More informationOlmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.
Center of Atmospheric Sciences, UNAM November 16, 2016 Cluster Analisis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster)
More informationVisual Representations for Machine Learning
Visual Representations for Machine Learning Spectral Clustering and Channel Representations Lecture 1 Spectral Clustering: introduction and confusion Michael Felsberg Klas Nordberg The Spectral Clustering
More informationTRANSACTIONAL CLUSTERING. Anna Monreale University of Pisa
TRANSACTIONAL CLUSTERING Anna Monreale University of Pisa Clustering Clustering : Grouping of objects into different sets, or more precisely, the partitioning of a data set into subsets (clusters), so
More informationHierarchical Clustering
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram A tree like diagram that records the sequences of merges or splits 0 0 0 00
More informationCluster Analysis. Ying Shen, SSE, Tongji University
Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group
More informationCLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS
CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of
More informationLecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/4 What
More information10701 Machine Learning. Clustering
171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among
More informationDATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm
DATA MINING LECTURE 7 Hierarchical Clustering, DBSCAN The EM Algorithm CLUSTERING What is a Clustering? In general a grouping of objects such that the objects in a group (cluster) are similar (or related)
More informationWorking with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan
Working with Unlabeled Data Clustering Analysis Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan chanhl@mail.cgu.edu.tw Unsupervised learning Finding centers of similarity using
More informationCSE 255 Lecture 6. Data Mining and Predictive Analytics. Community Detection
CSE 255 Lecture 6 Data Mining and Predictive Analytics Community Detection Dimensionality reduction Goal: take high-dimensional data, and describe it compactly using a small number of dimensions Assumption:
More informationCHAPTER 4: CLUSTER ANALYSIS
CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis
More informationNotes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017)
1 Notes Reminder: HW2 Due Today by 11:59PM TA s note: Please provide a detailed ReadMe.txt file on how to run the program on the STDLINUX. If you installed/upgraded any package on STDLINUX, you should
More informationData Mining Concepts & Techniques
Data Mining Concepts & Techniques Lecture No 08 Cluster Analysis Naeem Ahmed Email: naeemmahoto@gmailcom Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Outline
More informationUsing the Kolmogorov-Smirnov Test for Image Segmentation
Using the Kolmogorov-Smirnov Test for Image Segmentation Yong Jae Lee CS395T Computational Statistics Final Project Report May 6th, 2009 I. INTRODUCTION Image segmentation is a fundamental task in computer
More informationINF4820, Algorithms for AI and NLP: Hierarchical Clustering
INF4820, Algorithms for AI and NLP: Hierarchical Clustering Erik Velldal University of Oslo Sept. 25, 2012 Agenda Topics we covered last week Evaluating classifiers Accuracy, precision, recall and F-score
More informationClustering: Classic Methods and Modern Views
Clustering: Classic Methods and Modern Views Marina Meilă University of Washington mmp@stat.washington.edu June 22, 2015 Lorentz Center Workshop on Clusters, Games and Axioms Outline Paradigms for clustering
More informationCSE 5243 INTRO. TO DATA MINING
CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/28/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.
More informationData Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394
Data Mining Clustering Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 31 Table of contents 1 Introduction 2 Data matrix and
More informationClustering. Content. Typical Applications. Clustering: Unsupervised data mining technique
Content Clustering Examples Cluster analysis Partitional: K-Means clustering method Hierarchical clustering methods Data preparation in clustering Interpreting clusters Cluster validation Clustering: Unsupervised
More informationIntroduction to Clustering
Introduction to Clustering Ref: Chengkai Li, Department of Computer Science and Engineering, University of Texas at Arlington (Slides courtesy of Vipin Kumar) What is Cluster Analysis? Finding groups of
More informationCluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010
Cluster Analysis Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX 7575 April 008 April 010 Cluster Analysis, sometimes called data segmentation or customer segmentation,
More informationSGN (4 cr) Chapter 10
SGN-41006 (4 cr) Chapter 10 Feature Selection and Extraction Jussi Tohka & Jari Niemi Department of Signal Processing Tampere University of Technology February 18, 2014 J. Tohka & J. Niemi (TUT-SGN) SGN-41006
More informationINF 4300 Classification III Anne Solberg The agenda today:
INF 4300 Classification III Anne Solberg 28.10.15 The agenda today: More on estimating classifier accuracy Curse of dimensionality and simple feature selection knn-classification K-means clustering 28.10.15
More informationClustering: Overview and K-means algorithm
Clustering: Overview and K-means algorithm Informal goal Given set of objects and measure of similarity between them, group similar objects together K-Means illustrations thanks to 2006 student Martin
More informationStatistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1
Week 8 Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Part I Clustering 2 / 1 Clustering Clustering Goal: Finding groups of objects such that the objects in a group
More informationChapter 4: Text Clustering
4.1 Introduction to Text Clustering Clustering is an unsupervised method of grouping texts / documents in such a way that in spite of having little knowledge about the content of the documents, we can
More informationUnsupervised Learning
Unsupervised Learning Fabio G. Cozman - fgcozman@usp.br November 16, 2018 What can we do? We just have a dataset with features (no labels, no response). We want to understand the data... no easy to define
More informationHierarchical clustering
Hierarchical clustering Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Description Produces a set of nested clusters organized as a hierarchical tree. Can be visualized
More informationIntroduction to Data Mining
Introduction to Data Mining Lecture #14: Clustering Seoul National University 1 In This Lecture Learn the motivation, applications, and goal of clustering Understand the basic methods of clustering (bottom-up
More information5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction
Computational Methods for Data Analysis Massimo Poesio UNSUPERVISED LEARNING Clustering Unsupervised learning introduction 1 Supervised learning Training set: Unsupervised learning Training set: 2 Clustering
More informationForestry Applied Multivariate Statistics. Cluster Analysis
1 Forestry 531 -- Applied Multivariate Statistics Cluster Analysis Purpose: To group similar entities together based on their attributes. Entities can be variables or observations. [illustration in Class]
More informationSOCIAL MEDIA MINING. Data Mining Essentials
SOCIAL MEDIA MINING Data Mining Essentials Dear instructors/users of these slides: Please feel free to include these slides in your own material, or modify them as you see fit. If you decide to incorporate
More informationWeek 7 Picturing Network. Vahe and Bethany
Week 7 Picturing Network Vahe and Bethany Freeman (2005) - Graphic Techniques for Exploring Social Network Data The two main goals of analyzing social network data are identification of cohesive groups
More informationCSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo
CSE 6242 A / CS 4803 DVA Feb 12, 2013 Dimension Reduction Guest Lecturer: Jaegul Choo CSE 6242 A / CS 4803 DVA Feb 12, 2013 Dimension Reduction Guest Lecturer: Jaegul Choo Data is Too Big To Do Something..
More informationData Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining, 2 nd Edition
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar Outline Prototype-based Fuzzy c-means
More informationLecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic
SEMANTIC COMPUTING Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic TU Dresden, 23 November 2018 Overview Unsupervised Machine Learning overview Association
More informationFoundations of Machine Learning CentraleSupélec Fall Clustering Chloé-Agathe Azencot
Foundations of Machine Learning CentraleSupélec Fall 2017 12. Clustering Chloé-Agathe Azencot Centre for Computational Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr Learning objectives
More informationLesson 3. Prof. Enza Messina
Lesson 3 Prof. Enza Messina Clustering techniques are generally classified into these classes: PARTITIONING ALGORITHMS Directly divides data points into some prespecified number of clusters without a hierarchical
More informationData Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/004 What
More informationECS 234: Data Analysis: Clustering ECS 234
: Data Analysis: Clustering What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed
More informationRoad map. Basic concepts
Clustering Basic concepts Road map K-means algorithm Representation of clusters Hierarchical clustering Distance functions Data standardization Handling mixed attributes Which clustering algorithm to use?
More informationCSE 158 Lecture 6. Web Mining and Recommender Systems. Community Detection
CSE 158 Lecture 6 Web Mining and Recommender Systems Community Detection Dimensionality reduction Goal: take high-dimensional data, and describe it compactly using a small number of dimensions Assumption:
More informationMethods for Intelligent Systems
Methods for Intelligent Systems Lecture Notes on Clustering (II) Davide Eynard eynard@elet.polimi.it Department of Electronics and Information Politecnico di Milano Davide Eynard - Lecture Notes on Clustering
More information