Tight Clustering: a method for extracting stable and tight patterns in expression profiles

Size: px
Start display at page:

Download "Tight Clustering: a method for extracting stable and tight patterns in expression profiles"

Transcription

1 Statistical issues in microarra analsis Tight Clustering: a method for etracting stable and tight patterns in epression profiles Eperimental design Image analsis Normalization George C. Tseng Dept. of Biostatistics & Human Genetics Universit of Pittsburgh Identif differentiall epressed genes Data visualization Clustering Regulator network Classification Data matri Heatmap (data visualization) Data: X={ ij } n d, an n (genes) d (samples) matri. row.names chromosome sample1 sample2 sample3 sample4 sample5 time time3 time5 time7 time NA 96669_at _at _at _at _at 15. NA NA NA NA NA 16378_at _at NA 98569_at 2. NA NA 93794_at _at _at _at _at 19. NA -.22 NA NA NA 95124_i_at _at _at _at NA 99674_at _at row.names chromosome sampl1 sample2 sample3 sample4 sample5 time time3 time5 time7 time NA 96669_at _at _at _at _at 15. NA NA NA NA NA 16378_at _at NA 98569_at 2. NA NA 93794_at _at _at _at _at 19. NA -.22 NA NA NA 95124_i_at _at _at _at NA 99674_at _at

2 Wh clustering: Cluster genes: similar epression pattern implies co-regulation. Although man sophisticated methods for detecting regulator interactions (e.g. Shortest-path and Liquid Association), cluster analsis remains a useful routine in arra analsis. Subsequent analsis: Identif novel genes participating in known cellular process Enrichment of particular Gene Ontolog (GO) terms in clusters Motif finding in clusters Cluster samples: identif potential sub-classes of disease Clustering in microarra: an eample Gene epression during the life ccle of Drosophila melanogaster. (22) Science 297: genes monitored. Reference sample is pooled from all samples. 66 sequential time points spanning embronic (E), larval (L), pupal (P) and adult (A) periods. Filter genes without significant pattern (11 genes) and standardize each gene to have mean and stdev 1. Eample: Data from life ccle of Drosophila melanogaster. (22) Science 297: k=1 k=15 k=3 Main challenges for clustering in microarra Challenge 1: Lots of scattered genes. i.e. genes not belonging to an tight cluster of biological function. K-means Clustering looks informative A closer look, however, finds lots of noises in each cluster

3 Main challenges for clustering in microarra Challenge 2: Microarra is an eplorator tool to guide further biological eperiments Hpothesis driven: hpothesis => eperimental data. Data driven: high-throughput eperiment => data mining => hpothesis => further validation eperiment Important to provide the most informative clusters instead of lots of loose clusters (reduce false positives). Current Methods Dimension reduction and data visualization: Principle Component Analsis (PCA) (Alter 2) Multi-Dimensional Scaling (MDS) Clustering methods Hierarchical Clustering (Eisen 1998) K-means (Hartigan 1975) K-memoids Self-Organizing Map (SOM) (Tamao 1999) CLICK (Ron Shamir 21) Model-based approach (Frale and Rafter 1998) Model-based approach Model-based approach Frale and Rafter (1998) applied a Gaussian miture model. (1)EM algorithm to maimize the classification likelihood. (2) Baesian Information Criterion (BIC) for determining k and the compleit of the covariance matri. Advantage: A sound probabilistic model for inference: model selection and estimation Can easil etend to model scattered genes Problems: Local minimum Model selection is usuall inapplicable in arra data; BIC is approimate

4 K-means clustering Procedures: Step 1: estimate the number of clusters, k. Step 2: minimize the within-cluster dispersion to the cluster centers. k 2 W ( k) = i C j j= 1 i Cj Note: 1. Points should be in Euclidean space. 2. Optimization performed b iterative relocation algorithms. Local minimum inevitable. 3. k has to be correctl estimated. K-means clustering K-means is a special case of model-based approach. Problems: Local minimum Does not allow scattered genes Estimation of number of clusters k Hierarchical clustering Estimate the number of clusters k: Milligan & Cooper(1985) compared 3 published rules. 1. Calinski & Harabasz (1974) 2. Hartigan (1975) B( k) /( k 1) ma CH ( k) = W ( k) /( n k), Stop when H(k)<1 3. Tibshirani, Walther & Hastie (2) * ma Gap ( k) = E (log( W ( k))) log( W ( k)) n n 4. Tibshirani et al(21), Dudoit & Fridland(22) Prediction-based resampling approach. Hierarchical clustering Iterativel agglomerate nearest nodes to form bottom-up tree. Single Linkage: shortest distance between points in the two nodes. Complete Linkage: largest distance between points in the two nodes. Note: Clusters can be obtained b cutting the hierarchical tree. 4

5 Eample of hierarchical clustering Hierarchical clustering Eisen et al 1998 Other Methods Current methods aim to find tight clusters: 1. CLICK: graph-theoretical techniques to find tight kernels. Several heuristic procedures then used to epand the kernels into full clustering. 2. Committee algorithm: similar idea to find tight committees and then epand to full clustering. Traditional: Estimate the number of clusters, k. (ecept for hierarchical clustering) Perform clustering through assigning all genes into clusters Tight Clustering: Directl identif informative, tight and stable clusters with reasonable size, sa, 2~6 genes. Need not estimate k!! Need not assign all genes into clusters

6 whole data Tight Clustering subsample subsample 2 judgement b subsample 1 judgement b subsample Original Data X co-membership matri D[C(X', k), X] X={ ij } n d : data to be clustered. X'={' ij } n/2 d : random sub-sample C(X', k)=(c 1, C 2,, C k ): the cluster centers obtained from clustering X' into k clusters. sub-sample X' K-means cluster centers C(X', k)=(c 1,, C k ) D[C(X', k), X] : an n n matri denoting co-membership relations of X classified b C(X', k). (Tibshirani 21) D[C(X', k), X] ij =1 if i and j in the same cluster. = o.w. Vi I Vj s(v i,v j) = V U V i j :a measure of similarit of two sets of genes 6

7 Algorithm 1 (when fiing k): 1. Fi k. Random sub-sampling X (1),, X (B). Define the average co-membership matri to be (1) (B) D = mean( D[C(X, k), X], K, D[C(X, k), X] ). Note: a. D ij =1 i and j alwas clustered together in each sub-sampling judgment. b. D ij = i and j never clustered together in each sub-sampling judgment. c. Dii = 1 i Algorithm 1 (when fiing k): (cont d) 2. Search for a large set of points V = { v 1, K, vm} {1, K, n} such that Dv i v j 1 α i, j α close to. Sets with this propert are candidates of tight clusters. Order sets with this propert b their size to obtain V k1,v k2, Tight Clustering Algorithm: k k 1 k 2 k Tight Clustering Algorithm: 1. Start with a suitable k. Search for consecutive k s and choose the top 3 clusters for each k. V k,1 V k,2 V k,3.7.1 V( k +1),1.1 V( k +1), V( k +1),3.1 V( k +2), V( k +2), V( k +2),3.17 V( k +3), V( k +3), V( k +3),3 { Vk, Vk 2, Vk 3},{ V( k + 1)1, V( k + 1)2, V( k 1) 3}, K 2. Stop when s( V, V Select 1 + k ' l ( k ' + 1) m k' k, V ( k ' + 1) m ) β, s( V k + m, V k + ( ' 1) ( ' 2) n ) β l, m, n {1,2,3}, β close to1 to be the tightest cluster. 7

8 Tight Clustering Algorithm: (cont d) 3. Identif the tightest cluster and remove it from the whole data. 4. Decrease k b 1. Repeat 1.~3. to identif the net tight cluster. Remark: α, β and k determines the tightness and size of resulting clusters. Simulation A simple simulation on 2-D: 14 clusters normall distributed (5 points each) plus 175 sporadic points. Stdev=.1,.2,, Simulation Tight clustering on simulated data: α =, β =.7, B = 1, k = 1, 2, 25 and remain truth alpha beta.7 k= k= k= k= Simulation k = 25, α =, β =.7, B =

9 Eample 1: Data from life ccle of Drosophila melanogaster. (22) Science 297: Tight Clustering α =.1, β =.6, k = 15 Eample 1: Data from life ccle of Drosophila melanogaster. (22) Science 297: k=1 k=15 k= K-means Clustering looks informative. 11 clusters and 661 remaining scattered genes A closer look, however, finds lots of noises in each cluster. Comparison: a corresponding cluster of K-means & Tight Clustering 22 common genes Eample 1: Data from life ccle of Drosophila melanogaster. (22) Science 297: Tight Clustering total of 28 genes K-means clustering total of 18 genes Eample 2: Mouse embronic eperiment Mouse embronic eperiment: oligonucleotide arra (U74Av2 mouse arra from Affmetri) containing probe sets for about 1, mouse genes. Totall 126 samples. Half of them are from different stages of mouse embronic development. The remaining half is a diverse collection of samples from various tissues, including several tpes of adult stem cells. Mean sq. distance:

10 Eample 2: Mouse embronic eperiment Comparison of various K-means and tight clustering: Eample 3: Simulated data A. simulated gene epression of 15 clusters and 5 scattered genes. B. Randoml permuted from A. a. K-means b. K-memoid c. SOM d. CLICK e. Model-based clustering f. Tight clustering Eample 3: Simulated data Adjusted Rand inde is a measure to compare similarit of two clustering results. We compare clustering results from each method to the underling truth. Ongoing developments Theoretical foundation for re-sampling approach. Multi-resolution tight clustering. Etend the idea to bi-clustering. Incorporating multiple tight clustering results. Other general and fundamental problems in clustering. 1

11 tightclust: a software for Tight Clustering Acknowledgement: Harvard: Wing H. Wong (Department of Statistics) Inputs from: Chen Li (Department of Biostatistics) Rung Kim Richard Zhong 11

Discussion: Clustering Random Curves Under Spatial Dependence

Discussion: Clustering Random Curves Under Spatial Dependence Discussion: Clustering Random Curves Under Spatial Dependence Gareth M. James, Wenguang Sun and Xinghao Qiao Abstract We discuss the advantages and disadvantages of a functional approach to clustering

More information

6.867 Machine learning

6.867 Machine learning 6.867 Machine learning Final eam December 3, 24 Your name and MIT ID: J. D. (Optional) The grade ou would give to ourself + a brief justification. A... wh not? Problem 5 4.5 4 3.5 3 2.5 2.5 + () + (2)

More information

6.867 Machine learning

6.867 Machine learning 6.867 Machine learning Final eam December 3, 24 Your name and MIT ID: J. D. (Optional) The grade ou would give to ourself + a brief justification. A... wh not? Cite as: Tommi Jaakkola, course materials

More information

Clustering. Supervised vs. Unsupervised Learning

Clustering. Supervised vs. Unsupervised Learning Clustering Supervised vs. Unsupervised Learning So far we have assumed that the training samples used to design the classifier were labeled by their class membership (supervised learning) We assume now

More information

Unsupervised Learning. Supervised learning vs. unsupervised learning. What is Cluster Analysis? Applications of Cluster Analysis

Unsupervised Learning. Supervised learning vs. unsupervised learning. What is Cluster Analysis? Applications of Cluster Analysis 7 Supervised learning vs unsupervised learning Unsupervised Learning Supervised learning: discover patterns in the data that relate data attributes with a target (class) attribute These patterns are then

More information

k-means Gaussian mixture model Maximize the likelihood exp(

k-means Gaussian mixture model Maximize the likelihood exp( k-means Gaussian miture model Maimize the likelihood Centers : c P( {, i c j,...,, c n },...c k, ) ep( i c j ) k-means P( i c j, ) ep( c i j ) Minimize i c j Sum of squared errors (SSE) criterion (k clusters

More information

Clustering Part 2. A Partitional Clustering

Clustering Part 2. A Partitional Clustering Universit of Florida CISE department Gator Engineering Clustering Part Dr. Sanja Ranka Professor Computer and Information Science and Engineering Universit of Florida, Gainesville Universit of Florida

More information

Outline. Advanced Digital Image Processing and Others. Importance of Segmentation (Cont.) Importance of Segmentation

Outline. Advanced Digital Image Processing and Others. Importance of Segmentation (Cont.) Importance of Segmentation Advanced Digital Image Processing and Others Xiaojun Qi -- REU Site Program in CVIP (7 Summer) Outline Segmentation Strategies and Data Structures Algorithms Overview K-Means Algorithm Hidden Markov Model

More information

High throughput Data Analysis 2. Cluster Analysis

High throughput Data Analysis 2. Cluster Analysis High throughput Data Analysis 2 Cluster Analysis Overview Why clustering? Hierarchical clustering K means clustering Issues with above two Other methods Quality of clustering results Introduction WHY DO

More information

Evaluation and comparison of gene clustering methods in microarray analysis

Evaluation and comparison of gene clustering methods in microarray analysis Evaluation and comparison of gene clustering methods in microarray analysis Anbupalam Thalamuthu 1 Indranil Mukhopadhyay 1 Xiaojing Zheng 1 George C. Tseng 1,2 1 Department of Human Genetics 2 Department

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Pierre Gaillard ENS Paris September 28, 2018 1 Supervised vs unsupervised learning Two main categories of machine learning algorithms: - Supervised learning: predict output Y from

More information

A Quick Guide for the EMCluster Package

A Quick Guide for the EMCluster Package A Quick Guide for the EMCluster Package Wei-Chen Chen 1, Ranjan Maitra 2 1 pbdr Core Team 2 Department of Statistics, Iowa State Universit, Ames, IA, USA Contents Acknowledgement ii 1. Introduction 1 2.

More information

Module 3 Graph Theoretic Segmentation

Module 3 Graph Theoretic Segmentation Module 3 Graph Theoretic Segmentation Scott T. Acton Virginia Image and Video Analsis VIVA Charles L. Brown Department of Electrical and Computer Engineering Department of Biomedical Engineering Universit

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar What is Cluster Analsis? Finding groups of objects such that the

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 01-25-2018 Outline Background Defining proximity Clustering methods Determining number of clusters Other approaches Cluster analysis as unsupervised Learning Unsupervised

More information

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Unsupervised Learning Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Content Motivation Introduction Applications Types of clustering Clustering criterion functions Distance functions Normalization Which

More information

Chapters 11 and 13, Graph Data Mining

Chapters 11 and 13, Graph Data Mining CSI 4352, Introduction to Data Mining Chapters 11 and 13, Graph Data Mining Young-Rae Cho Associate Professor Department of Computer Science Balor Universit Graph Representation Graph An ordered pair GV,E

More information

Math 1050 Lab Activity: Graphing Transformations

Math 1050 Lab Activity: Graphing Transformations Math 00 Lab Activit: Graphing Transformations Name: We'll focus on quadratic functions to eplore graphing transformations. A quadratic function is a second degree polnomial function. There are two common

More information

Hierarchical clustering. Copyright 2000, Kevin Wayne 1

Hierarchical clustering. Copyright 2000, Kevin Wayne 1 Hierarchical Clustering continued & more about trees Clustering genes in microarra eperiments Function prediction Genetic networks Pathwa discover Gene regulation studies Comparative genomics How does

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 01-31-017 Outline Background Defining proximity Clustering methods Determining number of clusters Comparing two solutions Cluster analysis as unsupervised Learning

More information

Microarray data analysis

Microarray data analysis Microarray data analysis Computational Biology IST Technical University of Lisbon Ana Teresa Freitas 016/017 Microarrays Rows represent genes Columns represent samples Many problems may be solved using

More information

Non-linear models. Basis expansion. Overfitting. Regularization.

Non-linear models. Basis expansion. Overfitting. Regularization. Non-linear models. Basis epansion. Overfitting. Regularization. Petr Pošík Czech Technical Universit in Prague Facult of Electrical Engineering Dept. of Cbernetics Non-linear models Basis epansion.....................................................................................................

More information

streammoa: Interface to Algorithms from MOA for stream

streammoa: Interface to Algorithms from MOA for stream streammoa: Interface to Algorithms from MOA for stream Matthew Bolaños Southern Methodist Universit John Forrest Microsoft Michael Hahsler Southern Methodist Universit Abstract This packages provides an

More information

Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter Introduction to Data Mining b Tan, Steinbach, Kumar What is Cluster Analsis? Finding groups of objects such that the

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining /8/ What is Cluster

More information

DECISION-TREE-BASED MULTICLASS SUPPORT VECTOR MACHINES. Fumitake Takahashi, Shigeo Abe

DECISION-TREE-BASED MULTICLASS SUPPORT VECTOR MACHINES. Fumitake Takahashi, Shigeo Abe DECISION-TREE-BASED MULTICLASS SUPPORT VECTOR MACHINES Fumitake Takahashi, Shigeo Abe Graduate School of Science and Technology, Kobe University, Kobe, Japan (E-mail: abe@eedept.kobe-u.ac.jp) ABSTRACT

More information

Linear Programming. Revised Simplex Method, Duality of LP problems and Sensitivity analysis

Linear Programming. Revised Simplex Method, Duality of LP problems and Sensitivity analysis Linear Programming Revised Simple Method, Dualit of LP problems and Sensitivit analsis Introduction Revised simple method is an improvement over simple method. It is computationall more efficient and accurate.

More information

! Introduction. ! Partitioning methods. ! Hierarchical methods. ! Model-based methods. ! Density-based methods. ! Scalability

! Introduction. ! Partitioning methods. ! Hierarchical methods. ! Model-based methods. ! Density-based methods. ! Scalability Preview Lecture Clustering! Introduction! Partitioning methods! Hierarchical methods! Model-based methods! Densit-based methods What is Clustering?! Cluster: a collection of data objects! Similar to one

More information

Scale Invariant Feature Transform (SIFT) CS 763 Ajit Rajwade

Scale Invariant Feature Transform (SIFT) CS 763 Ajit Rajwade Scale Invariant Feature Transform (SIFT) CS 763 Ajit Rajwade What is SIFT? It is a technique for detecting salient stable feature points in an image. For ever such point it also provides a set of features

More information

Predictor Selection Algorithm for Bayesian Lasso

Predictor Selection Algorithm for Bayesian Lasso Predictor Selection Algorithm for Baesian Lasso Quan Zhang Ma 16, 2014 1 Introduction The Lasso [1] is a method in regression model for coefficients shrinkage and model selection. It is often used in the

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

Exploratory data analysis for microarrays

Exploratory data analysis for microarrays Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA

More information

9/17/2009. Wenyan Li (Emily Li) Sep. 15, Introduction to Clustering Analysis

9/17/2009. Wenyan Li (Emily Li) Sep. 15, Introduction to Clustering Analysis Introduction ti to K-means Algorithm Wenan Li (Emil Li) Sep. 5, 9 Outline Introduction to Clustering Analsis K-means Algorithm Description Eample of K-means Algorithm Other Issues of K-means Algorithm

More information

10701 Machine Learning. Clustering

10701 Machine Learning. Clustering 171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among

More information

Biclustering for Microarray Data: A Short and Comprehensive Tutorial

Biclustering for Microarray Data: A Short and Comprehensive Tutorial Biclustering for Microarray Data: A Short and Comprehensive Tutorial 1 Arabinda Panda, 2 Satchidananda Dehuri 1 Department of Computer Science, Modern Engineering & Management Studies, Balasore 2 Department

More information

Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter Introduction to Data Mining b Tan, Steinbach, Kumar What is Cluster Analsis? Finding groups of objects such that the

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

Feature-Based Dissimilarity Space Classification

Feature-Based Dissimilarity Space Classification Feature-Based Dissimilarit Space Classification Robert P.W. Duin 1, Marco Loog 1,Elżbieta Pȩkalska 2, and David M.J. Ta 1 1 Facult of Electrical Engineering, Mathematics and Computer Sciences, Delft Universit

More information

Understanding Clustering Supervising the unsupervised

Understanding Clustering Supervising the unsupervised Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data

More information

Global Ordering For Multi-dimensional Data: Comparison with K-means Clustering

Global Ordering For Multi-dimensional Data: Comparison with K-means Clustering DIMACS Technical Report 9- April 9 Global Ordering For Multi-dimensional Data: Comparison with K-means Clustering b Baiang Liu Dept. of Computer Science Rutgers Universit New Brunswick, New Jerse 89 Casimir

More information

Clustering. Lecture 6, 1/24/03 ECS289A

Clustering. Lecture 6, 1/24/03 ECS289A Clustering Lecture 6, 1/24/03 What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed

More information

Biclustering Bioinformatics Data Sets. A Possibilistic Approach

Biclustering Bioinformatics Data Sets. A Possibilistic Approach Possibilistic algorithm Bioinformatics Data Sets: A Possibilistic Approach Dept Computer and Information Sciences, University of Genova ITALY EMFCSC Erice 20/4/2007 Bioinformatics Data Sets Outline Introduction

More information

ECS 234: Data Analysis: Clustering ECS 234

ECS 234: Data Analysis: Clustering ECS 234 : Data Analysis: Clustering What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed

More information

Chapter 6: Cluster Analysis

Chapter 6: Cluster Analysis Chapter 6: Cluster Analysis The major goal of cluster analysis is to separate individual observations, or items, into groups, or clusters, on the basis of the values for the q variables measured on each

More information

Clustering and Dissimilarity Measures. Clustering. Dissimilarity Measures. Cluster Analysis. Perceptually-Inspired Measures

Clustering and Dissimilarity Measures. Clustering. Dissimilarity Measures. Cluster Analysis. Perceptually-Inspired Measures Clustering and Dissimilarity Measures Clustering APR Course, Delft, The Netherlands Marco Loog May 19, 2008 1 What salient structures exist in the data? How many clusters? May 19, 2008 2 Cluster Analysis

More information

Clustering fundamentals

Clustering fundamentals Elena Baralis, Tania Cerquitelli Politecnico di Torino What is Cluster Analsis? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from

More information

Where we are. Exploratory Graph Analysis (40 min) Focused Graph Mining (40 min) Refinement of Query Results (40 min)

Where we are. Exploratory Graph Analysis (40 min) Focused Graph Mining (40 min) Refinement of Query Results (40 min) Where we are Background (15 min) Graph models, subgraph isomorphism, subgraph mining, graph clustering Eploratory Graph Analysis (40 min) Focused Graph Mining (40 min) Refinement of Query Results (40 min)

More information

Cluster Analysis: Agglomerate Hierarchical Clustering

Cluster Analysis: Agglomerate Hierarchical Clustering Cluster Analysis: Agglomerate Hierarchical Clustering Yonghee Lee Department of Statistics, The University of Seoul Oct 29, 2015 Contents 1 Cluster Analysis Introduction Distance matrix Agglomerative Hierarchical

More information

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Classification Vladimir Curic Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Outline An overview on classification Basics of classification How to choose appropriate

More information

Clustering gene expression data

Clustering gene expression data Clustering gene expression data 1 How Gene Expression Data Looks Entries of the Raw Data matrix: Ratio values Absolute values Row = gene s expression pattern Column = experiment/condition s profile genes

More information

Solution Guide II-D. Classification. HALCON Progress

Solution Guide II-D. Classification. HALCON Progress Solution Guide II-D Classification HALCON 17.12 Progress How to use classification, Version 17.12 All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted

More information

Machine Learning in Biology

Machine Learning in Biology Università degli studi di Padova Machine Learning in Biology Luca Silvestrin (Dottorando, XXIII ciclo) Supervised learning Contents Class-conditional probability density Linear and quadratic discriminant

More information

Double Self-Organizing Maps to Cluster Gene Expression Data

Double Self-Organizing Maps to Cluster Gene Expression Data Double Self-Organizing Maps to Cluster Gene Expression Data Dali Wang, Habtom Ressom, Mohamad Musavi, Cristian Domnisoru University of Maine, Department of Electrical & Computer Engineering, Intelligent

More information

STAD Research Report 2015/02. Parsimonious Time Series Clustering.

STAD Research Report 2015/02. Parsimonious Time Series Clustering. STAD Research Report 2015/02 Parsimonious Time Series Clustering. arxiv:1509.00729v1 [stat.me] 2 Sep 2015 Carmela Iorio*, Gianluca Frasso***, Antonio D Ambrosio*,Roberta Siciliano** *Department of Economics

More information

A Hybrid Intelligent System for Fault Detection in Power Systems

A Hybrid Intelligent System for Fault Detection in Power Systems A Hybrid Intelligent System for Fault Detection in Power Systems Hiroyuki Mori Hikaru Aoyama Dept. of Electrical and Electronics Eng. Meii University Tama-ku, Kawasaki 14-8571 Japan Toshiyuki Yamanaka

More information

Measure of Distance. We wish to define the distance between two objects Distance metric between points:

Measure of Distance. We wish to define the distance between two objects Distance metric between points: Measure of Distance We wish to define the distance between two objects Distance metric between points: Euclidean distance (EUC) Manhattan distance (MAN) Pearson sample correlation (COR) Angle distance

More information

Supervised vs. Unsupervised Learning

Supervised vs. Unsupervised Learning Clustering Supervised vs. Unsupervised Learning So far we have assumed that the training samples used to design the classifier were labeled by their class membership (supervised learning) We assume now

More information

Solution Guide II-D. Classification. Building Vision for Business. MVTec Software GmbH

Solution Guide II-D. Classification. Building Vision for Business. MVTec Software GmbH Solution Guide II-D Classification MVTec Software GmbH Building Vision for Business Overview In a broad range of applications classification is suitable to find specific objects or detect defects in images.

More information

Expectation Maximization (EM) and Gaussian Mixture Models

Expectation Maximization (EM) and Gaussian Mixture Models Expectation Maximization (EM) and Gaussian Mixture Models Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 2 3 4 5 6 7 8 Unsupervised Learning Motivation

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)

More information

10-701/15-781, Fall 2006, Final

10-701/15-781, Fall 2006, Final -7/-78, Fall 6, Final Dec, :pm-8:pm There are 9 questions in this exam ( pages including this cover sheet). If you need more room to work out your answer to a question, use the back of the page and clearly

More information

Announcements. Recognition I. Optical Flow: Where do pixels move to? dy dt. I + y. I = x. di dt. dx dt. = t

Announcements. Recognition I. Optical Flow: Where do pixels move to? dy dt. I + y. I = x. di dt. dx dt. = t Announcements I Introduction to Computer Vision CSE 152 Lecture 18 Assignment 4: Due Toda Assignment 5: Posted toda Read: Trucco & Verri, Chapter 10 on recognition Final Eam: Wed, 6/9/04, 11:30-2:30, WLH

More information

What is clustering. Organizing data into clusters such that there is high intra- cluster similarity low inter- cluster similarity

What is clustering. Organizing data into clusters such that there is high intra- cluster similarity low inter- cluster similarity Clustering What is clustering Organizing data into clusters such that there is high intra- cluster similarity low inter- cluster similarity Informally, finding natural groupings among objects. High dimensional

More information

the power of machine vision Solution Guide II-D Classification

the power of machine vision Solution Guide II-D Classification the power of machine vision Solution Guide II-D Classification How to use classification, Version 12.0.2 All rights reserved. No part of this publication may be reproduced, stored in a retrieval system,

More information

Solution Guide II-D. Classification. Building Vision for Business. MVTec Software GmbH

Solution Guide II-D. Classification. Building Vision for Business. MVTec Software GmbH Solution Guide II-D Classification MVTec Software GmbH Building Vision for Business How to use classification, Version 10.0.4 All rights reserved. No part of this publication may be reproduced, stored

More information

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2 161 Machine Learning Hierarchical clustering Reading: Bishop: 9-9.2 Second half: Overview Clustering - Hierarchical, semi-supervised learning Graphical models - Bayesian networks, HMMs, Reasoning under

More information

Figure (5) Kohonen Self-Organized Map

Figure (5) Kohonen Self-Organized Map 2- KOHONEN SELF-ORGANIZING MAPS (SOM) - The self-organizing neural networks assume a topological structure among the cluster units. - There are m cluster units, arranged in a one- or two-dimensional array;

More information

Cluster Analysis for Microarray Data

Cluster Analysis for Microarray Data Cluster Analysis for Microarray Data Seventh International Long Oligonucleotide Microarray Workshop Tucson, Arizona January 7-12, 2007 Dan Nettleton IOWA STATE UNIVERSITY 1 Clustering Group objects that

More information

and Algorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 8/30/ Introduction to Data Mining 08/06/2006 1

and Algorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 8/30/ Introduction to Data Mining 08/06/2006 1 Cluster Analsis: Basic Concepts and Algorithms Dr. Hui Xiong Rutgers Universit Introduction to Data Mining 8//6 Introduction to Data Mining 8/6/6 What is Cluster Analsis? Finding groups of objects such

More information

Cross-validation for detecting and preventing overfitting

Cross-validation for detecting and preventing overfitting Cross-validation for detecting and preventing overfitting Note to other teachers and users of these slides. Andrew would be delighted if ou found this source material useful in giving our own lectures.

More information

Data Mining. Cluster Analysis: Basic Concepts and Algorithms

Data Mining. Cluster Analysis: Basic Concepts and Algorithms Data Mining Cluster Analsis: Basic Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining /8/ What is Cluster Analsis? Finding groups of objects such that the objects in a group will

More information

Clustering Analysis Basics

Clustering Analysis Basics Clustering Analysis Basics Ke Chen Reading: [Ch. 7, EA], [5., KPM] Outline Introduction Data Types and Representations Distance Measures Major Clustering Methodologies Summary Introduction Cluster: A collection/group

More information

A Line Drawings Degradation Model for Performance Characterization

A Line Drawings Degradation Model for Performance Characterization A Line Drawings Degradation Model for Performance Characterization 1 Jian Zhai, 2 Liu Wenin, 3 Dov Dori, 1 Qing Li 1 Dept. of Computer Engineering and Information Technolog; 2 Dept of Computer Science

More information

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask Machine Learning and Data Mining Clustering (1): Basics Kalev Kask Unsupervised learning Supervised learning Predict target value ( y ) given features ( x ) Unsupervised learning Understand patterns of

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning A review of clustering and other exploratory data analysis methods HST.951J: Medical Decision Support Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision

More information

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme Machine Learning B. Unsupervised Learning B.1 Cluster Analysis Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University of Hildesheim, Germany

More information

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering An unsupervised machine learning problem Grouping a set of objects in such a way that objects in the same group (a cluster) are more similar (in some sense or another) to each other than to those in other

More information

Fitting a transformation: Feature-based alignment April 30 th, Yong Jae Lee UC Davis

Fitting a transformation: Feature-based alignment April 30 th, Yong Jae Lee UC Davis Fitting a transformation: Feature-based alignment April 3 th, 25 Yong Jae Lee UC Davis Announcements PS2 out toda; due 5/5 Frida at :59 pm Color quantization with k-means Circle detection with the Hough

More information

What and Why Transformations?

What and Why Transformations? 2D transformations What and Wh Transformations? What? : The geometrical changes of an object from a current state to modified state. Changing an object s position (translation), orientation (rotation)

More information

Global Optimization with MATLAB Products

Global Optimization with MATLAB Products Global Optimization with MATLAB Products Account Manager 이장원차장 Application Engineer 엄준상 The MathWorks, Inc. Agenda Introduction to Global Optimization Peaks Surve of Solvers with Eamples 8 MultiStart 6

More information

Clustering and The Expectation-Maximization Algorithm

Clustering and The Expectation-Maximization Algorithm Clustering and The Expectation-Maximization Algorithm Unsupervised Learning Marek Petrik 3/7 Some of the figures in this presentation are taken from An Introduction to Statistical Learning, with applications

More information

Structured prediction using the network perceptron

Structured prediction using the network perceptron Structured prediction using the network perceptron Ta-tsen Soong Joint work with Stuart Andrews and Prof. Tony Jebara Motivation A lot of network-structured data Social networks Citation networks Biological

More information

Introduction to GE Microarray data analysis Practical Course MolBio 2012

Introduction to GE Microarray data analysis Practical Course MolBio 2012 Introduction to GE Microarray data analysis Practical Course MolBio 2012 Claudia Pommerenke Nov-2012 Transkriptomanalyselabor TAL Microarray and Deep Sequencing Core Facility Göttingen University Medical

More information

Methods for Intelligent Systems

Methods for Intelligent Systems Methods for Intelligent Systems Lecture Notes on Clustering (II) Davide Eynard eynard@elet.polimi.it Department of Electronics and Information Politecnico di Milano Davide Eynard - Lecture Notes on Clustering

More information

Fall 2017 ECEN Special Topics in Data Mining and Analysis

Fall 2017 ECEN Special Topics in Data Mining and Analysis Fall 2017 ECEN 689-600 Special Topics in Data Mining and Analysis Nick Duffield Department of Electrical & Computer Engineering Teas A&M University Organization Organization Instructor: Nick Duffield,

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 7 Introduction to Data Mining, nd Edition b Tan, Steinbach, Karpatne, Kumar What is Cluster Analsis? Finding groups

More information

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. XX, NO. X, X X 1

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. XX, NO. X, X X 1 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. XX, NO. X, X X 1 Joint video frame set division and low-rank decomposition for background subtraction Jiajun Wen, Yong Xu, Member, IEEE,

More information

Statistically Analyzing the Impact of Automated ETL Testing on Data Quality

Statistically Analyzing the Impact of Automated ETL Testing on Data Quality Chapter 5 Statisticall Analzing the Impact of Automated ETL Testing on Data Qualit 5.0 INTRODUCTION In the previous chapter some prime components of hand coded ETL prototpe were reinforced with automated

More information

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06 Clustering CS294 Practical Machine Learning Junming Yin 10/09/06 Outline Introduction Unsupervised learning What is clustering? Application Dissimilarity (similarity) of objects Clustering algorithm K-means,

More information

Unlabeled Data Classification by Support Vector Machines

Unlabeled Data Classification by Support Vector Machines Unlabeled Data Classification by Support Vector Machines Glenn Fung & Olvi L. Mangasarian University of Wisconsin Madison www.cs.wisc.edu/ olvi www.cs.wisc.edu/ gfung The General Problem Given: Points

More information

APPLICATION OF RECIRCULATION NEURAL NETWORK AND PRINCIPAL COMPONENT ANALYSIS FOR FACE RECOGNITION

APPLICATION OF RECIRCULATION NEURAL NETWORK AND PRINCIPAL COMPONENT ANALYSIS FOR FACE RECOGNITION APPLICATION OF RECIRCULATION NEURAL NETWORK AND PRINCIPAL COMPONENT ANALYSIS FOR FACE RECOGNITION Dmitr Brliuk and Valer Starovoitov Institute of Engineering Cbernetics, Laborator of Image Processing and

More information

Association Rule Mining and Clustering

Association Rule Mining and Clustering Association Rule Mining and Clustering Lecture Outline: Classification vs. Association Rule Mining vs. Clustering Association Rule Mining Clustering Types of Clusters Clustering Algorithms Hierarchical:

More information

Chapter 3. Interpolation. 3.1 Introduction

Chapter 3. Interpolation. 3.1 Introduction Chapter 3 Interpolation 3 Introduction One of the fundamental problems in Numerical Methods is the problem of interpolation, that is given a set of data points ( k, k ) for k =,, n, how do we find a function

More information

CS 157: Assignment 6

CS 157: Assignment 6 CS 7: Assignment Douglas R. Lanman 8 Ma Problem : Evaluating Conve Polgons This write-up presents several simple algorithms for determining whether a given set of twodimensional points defines a conve

More information

Machine Learning 15/04/2015. Supervised learning vs. unsupervised learning. What is Cluster Analysis? Applications of Cluster Analysis

Machine Learning 15/04/2015. Supervised learning vs. unsupervised learning. What is Cluster Analysis? Applications of Cluster Analysis // Supervised learning vs unsupervised learning Machine Learning Unsupervised Learning Supervised learning: discover patterns in the data that relate data attributes with a target (class) attribute These

More information

Classification of High Dimensional Data By Two-way Mixture Models

Classification of High Dimensional Data By Two-way Mixture Models Classification of High Dimensional Data By Two-way Mixture Models Jia Li Statistics Department The Pennsylvania State University 1 Outline Goals Two-way mixture model approach Background: mixture discriminant

More information

Semi-supervised learning

Semi-supervised learning Semi-supervised Learning COMP 790-90 Seminar Spring 2011 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Overview 2 Semi-supervised learning Semi-supervised classification Semi-supervised clustering Semi-supervised

More information

SEEK User Manual. Introduction

SEEK User Manual. Introduction SEEK User Manual Introduction SEEK is a computational gene co-expression search engine. It utilizes a vast human gene expression compendium to deliver fast, integrative, cross-platform co-expression analyses.

More information

IBL and clustering. Relationship of IBL with CBR

IBL and clustering. Relationship of IBL with CBR IBL and clustering Distance based methods IBL and knn Clustering Distance based and hierarchical Probability-based Expectation Maximization (EM) Relationship of IBL with CBR + uses previously processed

More information

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering Team 2 Prof. Anita Wasilewska CSE 634 Data Mining All Sources Used for the Presentation Olson CF. Parallel algorithms

More information