Microarray data analysis
|
|
- Magdalen Melton
- 5 years ago
- Views:
Transcription
1 Microarray data analysis Computational Biology IST Technical University of Lisbon Ana Teresa Freitas 016/017 Microarrays Rows represent genes Columns represent samples Many problems may be solved using clustering Eample of microarray dataset 1
2 Microarray data S j Epression levels of gene i, across samples G i Epression levels of all genes, for one sample Typical eamples of samples: Heat shock, phases in cell cycle, cancer, normal, Microarray data Genes mrna samples sample1 sample sample3 sample4 sample Gene epression level of gene i in mrna sample j Log (treated-ep-value /controlled-ep-value )
3 What do we actually measure? We measure signal of cdna target(s) which hybridize(s) to the probe (and backgrounds, ratios, standard deviations, dust etc. ) What do we wish to know (an abstraction)? [mrna] 1a, [mrna] 1b,.. [mrna] Na, [mrna] Nb Where N = Number of Genes, a and b = different colors Factors with impact on the signal level Amount of mrna Labeling efficiencies Quality of the RNA Laser/dye combination Detection efficiency of photomultiplier 3
4 Typical Assumption [mrna] n,a α signal n,a [mrna] n,a = k * signal n,a Normalization constant n = gene inde a = color Low level analysis Image analysis - computation of probes intensities/signals Normalization - is the attempt to compensate for systematic technical differences between chips, to see more clearly the systematic biological differences between samples. Statisticians use the term 'bias' to describe systematic errors, which affect a large number of genes. 4
5 Normalization Sources of Systematic Errors Different incorporation efficiency of dyes Different amounts of mrna Eperimenter/protocol issues (comparing chips processed by different labs) Different scanning parameters Batch bias Normalization Two problems: How to detect biases? Which genes to use for estimating biases among chips? How to remove the biases? 5
6 Which genes to use for bias detection? All genes on the chip Assumption: Most of the genes are equally epressed in the compared samples, the proportion of the differential genes is low (<0%). Limits: Not appropriate when comparing highly heterogeneous samples (different tissues) Not appropriate for analysis of dedicated chips (apoptosis chips, inflammation chips etc) Which genes to use for bias detection? Housekeeping genes Assumption: based on prior knowledge a set of genes can be regarded as equally epressed in the compared samples Affy novel chips: normalization set of 100 genes NHGRI s cdna microarrays: 70 "house-keeping" genes set Limits: The validity of the assumption is questionable Housekeeping genes are usually epressed at high levels, not informative for the low intensities range 6
7 Normalization methods Global normalization (Scaling) enforces the chips to have equal mean (median) intensity Intensity-dependent normalization (Lowess) enforces equal means at all intensities Quantile Normalization enforces the chips to have identical intensity distribution Quantile Normalization Sort each column in the data matri according to genes (probes ) intensities in each chip Compute mean intensity in each rank across the chips Replace each intensity by the mean intensity at its rank Re-order columns to original state, each row corresponds to a gene Chip #1 Chip # Chip #3 Average chip 7
8 Quantile Normalization Before After What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping a set of data objects into clusters Clustering is unsupervised classification: no predefined classes Typical applications As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms 8
9 Things to study (1) Clustering (grouping) genes: i.e., finding groups of co-regulated genes Eample: Epression levels across time of two clusters of co-regulated genes samples samples Things to study () Clustering (grouping) samples i.e., finding groups of samples with similar genetic profiles (e.g., cancer types). Groups of similar behaviour? 9
10 Things to study (3) Classifying genes: i.e., deciding if a gene is co-regulated with some known gene(s), based on their epression profiles across samples. Annotated gene 1 Unknown gene samples samples Annotated gene samples Co-regulation? Similar biological function? Same transcription factor? Things to study (4) Classifying samples: i.e., classifying new samples, based on a set of classified samples (eample: cancer versus normal; different types of cancer;) classified samples A B samples to be classified 10
11 Things to study (5) Selecting genes: a) deciding if a given gene, in isolation, behaves differently in a control versus eperimental situation (e.g., cancer vs normal, two types of cancer, treatment vs non-treatment). b) Selecting which group genes is significantly different in a control versus eperimental situation (same eamples). c) Selecting which group of genes is relevant for a given classification problem. Clustering methods Similarity-based (need a similarity function) Construct a partition Agglomerative, bottom up Searching for an optimal partition Typically hard clustering Model-based (latent models, probabilistic or algebraic) First compute the model Clusters are obtained easily after having a model Typically soft clustering 11
12 Similary-based clustering Define a similarity function to measure similarity between two objects Common criteria: Find a partition to Maimize intra-cluster similarity Minimize inter-cluster similarity Two ways to construct the partition Hierarchical (e.g.,agglomerative Hierarchical Clustering) Search by starting at a random partition (e.g., K-means) Agglomerative Hierarchical Clustering Given a similarity function to measure similarity between two objects Gradually group similar objects together in a bottom-up fashion Stop when some stopping criterion is met Variations: different ways to compute group similarity based on individual object similarity 1
13 13 Distance Metrics For clustering algorithms the calculation of a distance between gene vectors or eperiment vectors is a necessary step Distances metrics can be classified as Metric distances Semi-metric distances Metric distances: 1. d ab >= 0. d ab = d ba 3. d aa = 0 4. d ab <= d ac + d cb Semi-metric distances: obey 1) to 3), fail in 4) Distance Metrics Minkowski distance If q = 1, d is Manhattan distance (semi-metric distance) If q =, d is Euclidean distance (metric distance) q q p p q q j i j i j i j i d ) ( ), ( = ), ( 1 1 p p j i j i j i j i d = ) ( ), ( 1 1 p p j i j i j i j i d =
14 Distance Metrics Pearson correlation coefficient (semi-metric distance) d( i, j) = n ( )( ) i = 1 i1 1 i n n ( ) ( ) i = 1 i1 1 i = 1 i ( ) 1 1 ( ) -1 <= d(i,j) <= +1 1 (, ) 1 Distance Metrics Entropy based distances: Mutual Information (semi-metric distance) Mutual Information (MI) is a statistical representation of the correlation of two signals A and B. MI is a measure of the additional information known about one epression pattern when given another. MI is not based on linear models and can therefore also see non-linear dependencies (see picture). 14
15 Similarity-induced Structure How to Compute Group Similarity? Three Popular Methods: Given two groups g1 and g, Single-link algorithm: s(g1,g)= similarity of the closest pair Complete-link algorithm: s(g1,g)= similarity of the farthest pair Average-link algorithm: s(g1,g)= average of similarity of all pairs 15
16 Comparison of the Three Methods Single-link Loose clusters Individual decision, sensitive to outliers Complete-link Tight clusters Individual decision, sensitive to outliers Average-link In between Group decision, insensitive to outliers Which one is the best? Depends on what you need! Hierarchical (agglomerative) clustering. Strictly speaking, agglomerative clustering does not produce clusters, but a dendogram dissimilarity Cutting the dendogram at a certain level yields clusters. Dendogram cutting is a problem analogous to the selection of K in K-means clustering. 16
17 Eample of agglomerative gene clustering (Eisen et al, 98) Microarray data from time course of serum stimulation of primary human fibroblasts. Eperiment: Foreskin fibroblasts were grown in culture and were deprived of serum for 48 hr. Serum was added back and samples taken at time 0, 15 min, 30 min, 1hr, hr, 3 hr, 4 hr, 8 hr, 1 hr, 16 hr, 0 hr, 4 hr. Clustering: Agglomerative clustering Correlation Coefficient + (average-link) Clusters with biological interpretation: (A) cholesterol biosynthesis, (B) the cell cycle, (C) the immediate-early response, (D) signalling and angiogenesis, (E) wound healing and tissue remodelling. Data Structures Data matri 11 i1 n1 1f if nf 1p ip np Dissimilarity matri 0 d(,1) d(3,1) : d( n,1) 0 d(3,) : d( n,) 0 : 0 17
18 Partitioning Algorithms: Basic Concept Partitioning method: Construct a partition of a database D of n objects into a set of k clusters Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion Global optimal: ehaustively enumerate all partitions Heuristic methods: k-means and k-medoids algorithms k-means (MacQueen 67): Each cluster is represented by the center of the cluster k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw 87): Each cluster is represented by one of the objects in the cluster The K-Means Clustering Method Given k, the k-means algorithm is implemented in four steps: Step 1: Partition objects into k nonempty subsets Step : Compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point, of the cluster) Step 3: Assign each object to the cluster with the nearest seed point Go back to Step, stop when no more new assignment 18
19 The K-Means Clustering Method Eample K= Arbitrarily choose K object as initial cluster center Assign each objects to most similar center reassign Update the cluster means Update the cluster means reassign Comments on the K-Means Method Strength: Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. Comparing: PAM: O(k(n-k) ), CLARA: O(ks + k(n-k)) Comment: Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing Weakness Applicable only when mean is defined, then what about categorical data? Need to specify k, the number of clusters, in advance Unable to handle noisy data and outliers Not suitable to discover clusters with non-conve shapes 19
20 Variations of the K-Means Method A few variants of the k-means which differ in Selection of the initial k means Dissimilarity calculations Strategies to calculate cluster means Handling categorical data: k-modes (Huang 98) Replacing means of clusters with modes Using new dissimilarity measures to deal with categorical objects Using a frequency-based method to update modes of clusters A miture of categorical and numerical data: k-prototype method What is the problem of k-means Method? The k-means algorithm is sensitive to outliers! Since an object with an etremely large value may substantially distort the distribution of the data. K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster
21 Problem 1 Consider the following epression matri where the epression levels of genes (G1 and G) were analyzed in 7 healthy/infected tissues (conditions C1 to C7). Consider also the problem of grouping tissues given the epression profiles of the genes using clustering algorithms. Determine the dendogram found by a hierarchical clustering algorithm (HCA) using a bottom-up approach, the Euclidean distance to compute the distance between conditions, and the single-link distance to compute the distance between groups (intercluster distance). How would you use the dendogram to group the tissues in groups (clusters) and which will be those clusters? Determine the groups found by the K-means (K=) algorithm when the centroids are initialized with C5 = (4,3) and C6 = (1,1). Biclustering: Motivation Gene epression matrices have been etensively analyzed using clustering in one of two dimensions The gene dimension The condition dimension This corresponds to the Analysis of epression patterns of genes by comparing rows in the matri. Analysis of epression patterns of samples by comparing columns in the matri. 1
22 Biclustering: Motivation Common objectives pursued when analyzing gene epression data include: 1. Grouping of genes according to their epression under multiple conditions.. Classification of a new gene, given its epression and the epression of other genes, with known classification. 3. Grouping of conditions based on the epression of a number of genes. 4. Classification of a new sample, given the epression of the genes under that eperimental condition. What is Biclustering? Biclustering = Simultaneous clustering of both rows and columns of a data matri. Concept can be traced back to the 70 (Hartigan, 197), although it has been rarely used or studied. The term was introduced by (Cheng and Church, 000) who were the first to used it in gene epression data analysis. Technique used in other fields, such as collaborative filtering, information retrieval and data mining.
23 What is Biclustering? We consider a n by m data matri, A=(X,Y), where X={ 1,, n } = Set of n rows Y={y 1,, y m } = Set of m columns a ij = numeric value (discrete or real) representing the relation between row i and column j. In the case of gene epression matrices X = Set of Genes Y = Set of Conditions a ij = epression level of gene i under condition j (real value). What is Biclustering? Gene 1 Condition 1 a 11 Gene Epression Matri Condition j a 1j Condition m a 1m Gene i a i1 a ij a im Gene n a n1 a nj a nm A = (X,Y) 3
24 What is Biclustering? Given the matri A = (X,Y) I = Subset of rows J = Subset of columns (I,Y) = a subset of rows that ehibit similar behavior across the set of all columns = cluster of rows (X,J) = a subset of columns that ehibit similar behavior across the set of all rows = cluster of columns What is Biclustering? (I,J) = a subset of rows and a subset of columns, where the rows ehibit similar behavior across the columns and vice-versa. = sub-matri of A that contains only the elements a ij with set of rows I and set of columns J. = bicluster We want to identify a set of biclusters B k = (I k,j k ). Each bicluster B k must satisfies some specific characteristics of homogeneity. 4
25 What is Biclustering? C 1 C C 3 C 4 C 5 C 6 C 7 C 8 C 9 C 10 G 1 a 11 G G 3 G 4 G 5 G 6 a 11 a 1 a 31 a 41 a 51 a 61 a 1 a a 3 a 4 a 5 a 6 a 13 a 3 a 33 a 43 a 53 a 63 X = {G 1, G, G 3, G 4, G 5, G 6 } Y= {C 1, C, C 3, C 4, C 5, C 6, C 7, C 8, C 9, C 10 } a 14 a 15 a 16 a 17 a 18 a 19 0 a a a 1 4 a 5 a 6 a 7 a a a a a 35 a 36 a 37 a a a a a 45 a 46 a 47 a a a a a a a a a a a a a a a Cluster of Columns (X,J) Cluster of Rows (I,Y) {C 4, C 5, C 6 } {G, G 3, G 4 } I = {G, G 3, G 4 } J = {C 4, C 5, C 6 } Bicluster (I,J) {{G, G 3, G 4 }, {C 4, C 5, C 6 }} What is Biclustering? Biclustering Goals Perform simultaneous clustering on the row and column dimensions of the gene epression matri instead of clustering the rows and columns separetely. Identify sub-matrices (subsets of rows and subsets of columns) with interesting properties. Gene Epression Data Analysis Identify subgroups of genes and subgroups of conditions, where the genes ehibit highly correlated activities for every condition Madeira, Sara C. and Oliveira, Arlindo L. Biclustering Algorithms for Biological Data Analysis: A Survey IEEE/ACM Trans. Comput. Biol. Bioinformatics January 004 5
26 Bicluster Types An interesting criteria to evaluate a biclustering algorithm concerns the identification of the type of biclusters the algorithm is able to find. There are four major classes of biclusters 1. Biclusters with constant values.. Biclusters with constant values on rows or columns. 3. Biclusters with coherent values. 4. Biclusters with coherent evolutions. Constant Values
27 Constant Values on Rows or Columns Constant Rows Constant Columns Coherent Values Additive Model Multiplicative Model 7
28 Coherent Evolutions S S S S S3 S3 S3 S3 S4 S4 S4 S4 Overall Coherent Evolution Coherent Evolution On the Rows Coherent Evolutions S S3 S S S3 S S S3 S S S3 S Coherent Evolution On the Columns Order Preserving Sub-Matri (OPSM) 8
29 Algorithms When this is the case, a bicluster corresponds to a biclique in the corresponding bipartite graph. Finding a maimum size bicluster Is equivalent to finding the maimum edge biclique in a bipartite graph. This problem is known to be NP-complete (Peeters, 003). More comple cases Where the actual numeric values in the matri A are taken into account to compute the quality of a bicluster Have a compleity that is necessarily no lower than this simpler case. Algorithms Given this, the large majority of the algorithms use heuristic approaches to identify biclusters. In many cases the algorithm is preceded by a normalization step that is applied to the data matri. The goal is to make more evident the patterns of interest. Some algorithms avoid heuristics but ehibit an eponential worst case runtime. 9
30 Algorithms Different Objectives Identify one bicluster. Identify a given number of biclusters. Different Approaches Discover one bicluster at a time. Discover one set of biclusters at a time. Discover all biclusters at the same time (Simultaneous bicluster identification) Algorithms: Heuristic Approaches Iterative Row and Column Clustering Combination Apply clustering algorithms to the rows and columns of the data matri, separately. Combine the results using some sort of iterative procedure to combine the two cluster arrangements. Divide and Conquer Break the problem into several subproblems that are similar to the original problem but smaller in size. Solve the problems recursively. 30
31 Algorithms: Heuristic Approaches Combine the intermediate solutions to create a solution to the original problem. Usually break the matri into submatrices (biclusters) based on a certain criterion and then continue the biclustering process on the new submatrices. Greedy Iterative Search Always make a locally optimal choice in the hope that this choice will lead to a globally good solution. Usually perform greedy row/column addition/removal. Algorithms Ehaustive Bicluster Enumeration A number of methods have been used to speed up ehaustive search. In some cases the algorithms assume restrictions on the size of the biclusters that should be listed. 31
32 Measure cluster homogeneity 3
33 Missing values: Random numbers Find one bicluster at a time Hide biclustering using random numbers 33
34 34
35 Eample Consider the following epression matri J A(X,Y) = I Run Brute-Force Deletion and Addition algorithm to find a Biclustering 35
36 Eample Run Algorithm, δ = 0 (maimum acceptable mean squared residue score), α = 1,5 (a threshold for the multiple node deletion) aij column aij row aij = 9/(44) a1j = 5/4 ai1 = 7/4 aj = 8/4 ai = 4/4 a3j = 6/4 ai3 = 9/4 a4j = 10/4 ai4 = 9/4 H(I,J) = (1/(44)) * ((a11 a1j ai1 + aij)^ + (a1 a1j ai + aij)^) + (a13 a1j ai3 + aij)^) + (a14 a1j ai4 + aij)^) + (a1 aj ai1 + aij)^) + = 1,8 36
Cluster Analysis. CSE634 Data Mining
Cluster Analysis CSE634 Data Mining Agenda Introduction Clustering Requirements Data Representation Partitioning Methods K-Means Clustering K-Medoids Clustering Constrained K-Means clustering Introduction
More informationGene Clustering & Classification
BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering
More informationData Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1
Data Mining: Concepts and Techniques Chapter 7.1-4 March 8, 2007 Data Mining: Concepts and Techniques 1 1. What is Cluster Analysis? 2. Types of Data in Cluster Analysis Chapter 7 Cluster Analysis 3. A
More informationCommunity Detection. Jian Pei: CMPT 741/459 Clustering (1) 2
Clustering Community Detection http://image.slidesharecdn.com/communitydetectionitilecturejune0-0609559-phpapp0/95/community-detection-in-social-media--78.jpg?cb=3087368 Jian Pei: CMPT 74/459 Clustering
More informationECLT 5810 Clustering
ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping
More informationECLT 5810 Clustering
ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping
More informationCSE 5243 INTRO. TO DATA MINING
CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10. Cluster
More information2. Background. 2.1 Clustering
2. Background 2.1 Clustering Clustering involves the unsupervised classification of data items into different groups or clusters. Unsupervised classificaiton is basically a learning task in which learning
More informationUnsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team
Unsupervised Learning Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team Table of Contents 1)Clustering: Introduction and Basic Concepts 2)An Overview of Popular Clustering Methods 3)Other Unsupervised
More informationCSE 5243 INTRO. TO DATA MINING
CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.
More information10701 Machine Learning. Clustering
171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among
More informationDNA chips and other techniques measure the expression level of a large number of genes, perhaps all
INESC-ID TECHNICAL REPORT 1/2004, JANUARY 2004 1 Biclustering Algorithms for Biological Data Analysis: A Survey* Sara C. Madeira and Arlindo L. Oliveira Abstract A large number of clustering approaches
More informationKapitel 4: Clustering
Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases WiSe 2017/18 Kapitel 4: Clustering Vorlesung: Prof. Dr.
More informationUnsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi
Unsupervised Learning Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Content Motivation Introduction Applications Types of clustering Clustering criterion functions Distance functions Normalization Which
More informationEECS 730 Introduction to Bioinformatics Microarray. Luke Huan Electrical Engineering and Computer Science
EECS 730 Introduction to Bioinformatics Microarray Luke Huan Electrical Engineering and Computer Science http://people.eecs.ku.edu/~jhuan/ GeneChip 2011/11/29 EECS 730 2 Hybridization to the Chip 2011/11/29
More informationUnsupervised Learning
Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised
More informationExploratory data analysis for microarrays
Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA
More informationBiclustering for Microarray Data: A Short and Comprehensive Tutorial
Biclustering for Microarray Data: A Short and Comprehensive Tutorial 1 Arabinda Panda, 2 Satchidananda Dehuri 1 Department of Computer Science, Modern Engineering & Management Studies, Balasore 2 Department
More informationUnsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing
Unsupervised Data Mining: Clustering Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 1. Supervised Data Mining Classification Regression Outlier detection
More informationClustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani
Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation
More informationCLUSTERING IN BIOINFORMATICS
CLUSTERING IN BIOINFORMATICS CSE/BIMM/BENG 8 MAY 4, 0 OVERVIEW Define the clustering problem Motivation: gene expression and microarrays Types of clustering Clustering algorithms Other applications of
More informationUnsupervised Learning Partitioning Methods
Unsupervised Learning Partitioning Methods Road Map 1. Basic Concepts 2. K-Means 3. K-Medoids 4. CLARA & CLARANS Cluster Analysis Unsupervised learning (i.e., Class label is unknown) Group data to form
More information9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology
9/9/ I9 Introduction to Bioinformatics, Clustering algorithms Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Outline Data mining tasks Predictive tasks vs descriptive tasks Example
More informationGene expression & Clustering (Chapter 10)
Gene expression & Clustering (Chapter 10) Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species Dynamic programming Approximate pattern matching
More informationUnsupervised Learning
Unsupervised Learning Pierre Gaillard ENS Paris September 28, 2018 1 Supervised vs unsupervised learning Two main categories of machine learning algorithms: - Supervised learning: predict output Y from
More informationClustering CS 550: Machine Learning
Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf
More informationClustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York
Clustering Robert M. Haralick Computer Science, Graduate Center City University of New York Outline K-means 1 K-means 2 3 4 5 Clustering K-means The purpose of clustering is to determine the similarity
More informationData Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science
Data Mining Dr. Raed Ibraheem Hamed University of Human Development, College of Science and Technology Department of Computer Science 2016 201 Road map What is Cluster Analysis? Characteristics of Clustering
More informationBBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler
BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for
More informationEECS730: Introduction to Bioinformatics
EECS730: Introduction to Bioinformatics Lecture 15: Microarray clustering http://compbio.pbworks.com/f/wood2.gif Some slides were adapted from Dr. Shaojie Zhang (University of Central Florida) Microarray
More informationIntroduction to GE Microarray data analysis Practical Course MolBio 2012
Introduction to GE Microarray data analysis Practical Course MolBio 2012 Claudia Pommerenke Nov-2012 Transkriptomanalyselabor TAL Microarray and Deep Sequencing Core Facility Göttingen University Medical
More informationCluster Analysis for Microarray Data
Cluster Analysis for Microarray Data Seventh International Long Oligonucleotide Microarray Workshop Tucson, Arizona January 7-12, 2007 Dan Nettleton IOWA STATE UNIVERSITY 1 Clustering Group objects that
More informationK-Means. Oct Youn-Hee Han
K-Means Oct. 2015 Youn-Hee Han http://link.koreatech.ac.kr ²K-Means algorithm An unsupervised clustering algorithm K stands for number of clusters. It is typically a user input to the algorithm Some criteria
More informatione-ccc-biclustering: Related work on biclustering algorithms for time series gene expression data
: Related work on biclustering algorithms for time series gene expression data Sara C. Madeira 1,2,3, Arlindo L. Oliveira 1,2 1 Knowledge Discovery and Bioinformatics (KDBIO) group, INESC-ID, Lisbon, Portugal
More informationINF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering
INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering Erik Velldal University of Oslo Sept. 18, 2012 Topics for today 2 Classification Recap Evaluating classifiers Accuracy, precision,
More informationCLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16
CLUSTERING CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16 1. K-medoids: REFERENCES https://www.coursera.org/learn/cluster-analysis/lecture/nj0sb/3-4-the-k-medoids-clustering-method https://anuradhasrinivas.files.wordpress.com/2013/04/lesson8-clustering.pdf
More informationClustering. Lecture 6, 1/24/03 ECS289A
Clustering Lecture 6, 1/24/03 What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed
More information[7.3, EA], [9.1, CMB]
K-means Clustering Ke Chen Reading: [7.3, EA], [9.1, CMB] Outline Introduction K-means Algorithm Example How K-means partitions? K-means Demo Relevant Issues Application: Cell Neulei Detection Summary
More informationGiri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748
CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs07.html 3/3/08 CAP5510 1 Gene g Probe 1 Probe 2 Probe N 3/3/08 CAP5510
More informationChapter 6: Cluster Analysis
Chapter 6: Cluster Analysis The major goal of cluster analysis is to separate individual observations, or items, into groups, or clusters, on the basis of the values for the q variables measured on each
More informationECS 234: Data Analysis: Clustering ECS 234
: Data Analysis: Clustering What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed
More informationBiclustering Bioinformatics Data Sets. A Possibilistic Approach
Possibilistic algorithm Bioinformatics Data Sets: A Possibilistic Approach Dept Computer and Information Sciences, University of Genova ITALY EMFCSC Erice 20/4/2007 Bioinformatics Data Sets Outline Introduction
More information/ Computational Genomics. Normalization
10-810 /02-710 Computational Genomics Normalization Genes and Gene Expression Technology Display of Expression Information Yeast cell cycle expression Experiments (over time) baseline expression program
More informationClustering. Chapter 10 in Introduction to statistical learning
Clustering Chapter 10 in Introduction to statistical learning 16 14 12 10 8 6 4 2 0 2 4 6 8 10 12 14 1 Clustering ² Clustering is the art of finding groups in data (Kaufman and Rousseeuw, 1990). ² What
More informationUnsupervised Learning. Supervised learning vs. unsupervised learning. What is Cluster Analysis? Applications of Cluster Analysis
7 Supervised learning vs unsupervised learning Unsupervised Learning Supervised learning: discover patterns in the data that relate data attributes with a target (class) attribute These patterns are then
More informationData Mining Algorithms
for the original version: -JörgSander and Martin Ester - Jiawei Han and Micheline Kamber Data Management and Exploration Prof. Dr. Thomas Seidl Data Mining Algorithms Lecture Course with Tutorials Wintersemester
More informationWhat is Cluster Analysis? COMP 465: Data Mining Clustering Basics. Applications of Cluster Analysis. Clustering: Application Examples 3/17/2015
// What is Cluster Analysis? COMP : Data Mining Clustering Basics Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, rd ed. Cluster: A collection of data
More informationCluster analysis. Agnieszka Nowak - Brzezinska
Cluster analysis Agnieszka Nowak - Brzezinska Outline of lecture What is cluster analysis? Clustering algorithms Measures of Cluster Validity What is Cluster Analysis? Finding groups of objects such that
More informationClustering. Supervised vs. Unsupervised Learning
Clustering Supervised vs. Unsupervised Learning So far we have assumed that the training samples used to design the classifier were labeled by their class membership (supervised learning) We assume now
More informationClustering part II 1
Clustering part II 1 Clustering What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods 2 Partitioning Algorithms:
More informationMeasure of Distance. We wish to define the distance between two objects Distance metric between points:
Measure of Distance We wish to define the distance between two objects Distance metric between points: Euclidean distance (EUC) Manhattan distance (MAN) Pearson sample correlation (COR) Angle distance
More informationINF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering
INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Murhaf Fares & Stephan Oepen Language Technology Group (LTG) September 27, 2017 Today 2 Recap Evaluation of classifiers Unsupervised
More informationCHAPTER 4: CLUSTER ANALYSIS
CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis
More informationINF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22
INF4820 Clustering Erik Velldal University of Oslo Nov. 17, 2009 Erik Velldal INF4820 1 / 22 Topics for Today More on unsupervised machine learning for data-driven categorization: clustering. The task
More informationBig Data Analytics! Special Topics for Computer Science CSE CSE Feb 9
Big Data Analytics! Special Topics for Computer Science CSE 4095-001 CSE 5095-005! Feb 9 Fei Wang Associate Professor Department of Computer Science and Engineering fei_wang@uconn.edu Clustering I What
More informationCS Introduction to Data Mining Instructor: Abdullah Mueen
CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen LECTURE 8: ADVANCED CLUSTERING (FUZZY AND CO -CLUSTERING) Review: Basic Cluster Analysis Methods (Chap. 10) Cluster Analysis: Basic Concepts
More informationMixture models and clustering
1 Lecture topics: Miture models and clustering, k-means Distance and clustering Miture models and clustering We have so far used miture models as fleible ays of constructing probability models for prediction
More informationBiclustering Algorithms for Gene Expression Analysis
Biclustering Algorithms for Gene Expression Analysis T. M. Murali August 19, 2008 Problems with Hierarchical Clustering It is a global clustering algorithm. Considers all genes to be equally important
More informationUnsupervised Learning and Clustering
Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)
More informationClustering Analysis Basics
Clustering Analysis Basics Ke Chen Reading: [Ch. 7, EA], [5., KPM] Outline Introduction Data Types and Representations Distance Measures Major Clustering Methodologies Summary Introduction Cluster: A collection/group
More informationWorkload Characterization Techniques
Workload Characterization Techniques Raj Jain Washington University in Saint Louis Saint Louis, MO 63130 Jain@cse.wustl.edu These slides are available on-line at: http://www.cse.wustl.edu/~jain/cse567-08/
More informationRoad map. Basic concepts
Clustering Basic concepts Road map K-means algorithm Representation of clusters Hierarchical clustering Distance functions Data standardization Handling mixed attributes Which clustering algorithm to use?
More informationCourse on Microarray Gene Expression Analysis
Course on Microarray Gene Expression Analysis ::: Normalization methods and data preprocessing Madrid, April 27th, 2011. Gonzalo Gómez ggomez@cnio.es Bioinformatics Unit CNIO ::: Introduction. The probe-level
More informationCluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1
Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods
More informationClustering Techniques
Clustering Techniques Bioinformatics: Issues and Algorithms CSE 308-408 Fall 2007 Lecture 16 Lopresti Fall 2007 Lecture 16-1 - Administrative notes Your final project / paper proposal is due on Friday,
More informationUnsupervised Learning
Unsupervised Learning Unsupervised learning Until now, we have assumed our training samples are labeled by their category membership. Methods that use labeled samples are said to be supervised. However,
More informationDistance-based Methods: Drawbacks
Distance-based Methods: Drawbacks Hard to find clusters with irregular shapes Hard to specify the number of clusters Heuristic: a cluster must be dense Jian Pei: CMPT 459/741 Clustering (3) 1 How to Find
More informationClustering and Visualisation of Data
Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some
More informationClustering in Data Mining
Clustering in Data Mining Classification Vs Clustering When the distribution is based on a single parameter and that parameter is known for each object, it is called classification. E.g. Children, young,
More informationBiclustering with δ-pcluster John Tantalo. 1. Introduction
Biclustering with δ-pcluster John Tantalo 1. Introduction The subject of biclustering is chiefly concerned with locating submatrices of gene expression data that exhibit shared trends between genes. That
More informationContents. ! Data sets. ! Distance and similarity metrics. ! K-means clustering. ! Hierarchical clustering. ! Evaluation of clustering results
Statistical Analysis of Microarray Data Contents Data sets Distance and similarity metrics K-means clustering Hierarchical clustering Evaluation of clustering results Clustering Jacques van Helden Jacques.van.Helden@ulb.ac.be
More informationPAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods
Whatis Cluster Analysis? Clustering Types of Data in Cluster Analysis Clustering part II A Categorization of Major Clustering Methods Partitioning i Methods Hierarchical Methods Partitioning i i Algorithms:
More informationClustering Jacques van Helden
Statistical Analysis of Microarray Data Clustering Jacques van Helden Jacques.van.Helden@ulb.ac.be Contents Data sets Distance and similarity metrics K-means clustering Hierarchical clustering Evaluation
More informationBased on Raymond J. Mooney s slides
Instance Based Learning Based on Raymond J. Mooney s slides University of Texas at Austin 1 Example 2 Instance-Based Learning Unlike other learning algorithms, does not involve construction of an explicit
More informationHierarchical Clustering
Hierarchical Clustering Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram A tree-like diagram that records the sequences of merges
More informationChapter DM:II. II. Cluster Analysis
Chapter DM:II II. Cluster Analysis Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained Cluster Analysis DM:II-1
More informationTriclustering in Gene Expression Data Analysis: A Selected Survey
Triclustering in Gene Expression Data Analysis: A Selected Survey P. Mahanta, H. A. Ahmed Dept of Comp Sc and Engg Tezpur University Napaam -784028, India Email: priyakshi@tezu.ernet.in, hasin@tezu.ernet.in
More informationUnsupervised Learning. Unsupervised Learning. What is Clustering? Unsupervised Learning I Clustering 9/7/2017. Clustering
Unsupervised Learning Clustering Centroid models (K-mean) Connectivity models (hierarchical clustering) Density models (DBSCAN) Graph-based models Subspace models (Biclustering) Feature extraction techniques
More informationSemi-supervised learning
Semi-supervised Learning COMP 790-90 Seminar Spring 2011 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Overview 2 Semi-supervised learning Semi-supervised classification Semi-supervised clustering Semi-supervised
More informationClustering. CS294 Practical Machine Learning Junming Yin 10/09/06
Clustering CS294 Practical Machine Learning Junming Yin 10/09/06 Outline Introduction Unsupervised learning What is clustering? Application Dissimilarity (similarity) of objects Clustering algorithm K-means,
More informationDNA chips and other techniques measure the expression
24 IEEE TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 1, NO. 1, JANUARY-MARCH 2004 Biclustering Algorithms for Biological Data Analysis: A Survey Sara C. Madeira and Arlindo L. Oliveira
More informationInformation Retrieval and Organisation
Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London What Is Text Clustering? Text Clustering = Grouping a set of documents into classes of similar
More informationHigh throughput Data Analysis 2. Cluster Analysis
High throughput Data Analysis 2 Cluster Analysis Overview Why clustering? Hierarchical clustering K means clustering Issues with above two Other methods Quality of clustering results Introduction WHY DO
More informationFoundations of Machine Learning CentraleSupélec Fall Clustering Chloé-Agathe Azencot
Foundations of Machine Learning CentraleSupélec Fall 2017 12. Clustering Chloé-Agathe Azencot Centre for Computational Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr Learning objectives
More informationUnsupervised Learning and Clustering
Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)
More informationData Informatics. Seon Ho Kim, Ph.D.
Data Informatics Seon Ho Kim, Ph.D. seonkim@usc.edu Clustering Overview Supervised vs. Unsupervised Learning Supervised learning (classification) Supervision: The training data (observations, measurements,
More informationUnsupervised Learning : Clustering
Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex
More informationAPPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE
APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE Sundari NallamReddy, Samarandra Behera, Sanjeev Karadagi, Dr. Anantha Desik ABSTRACT: Tata
More informationMining di Dati Web. Lezione 3 - Clustering and Classification
Mining di Dati Web Lezione 3 - Clustering and Classification Introduction Clustering and classification are both learning techniques They learn functions describing data Clustering is also known as Unsupervised
More informationMultivariate analyses in ecology. Cluster (part 2) Ordination (part 1 & 2)
Multivariate analyses in ecology Cluster (part 2) Ordination (part 1 & 2) 1 Exercise 9B - solut 2 Exercise 9B - solut 3 Exercise 9B - solut 4 Exercise 9B - solut 5 Multivariate analyses in ecology Cluster
More informationData Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 1
More informationExploratory Analysis: Clustering
Exploratory Analysis: Clustering (some material taken or adapted from slides by Hinrich Schutze) Heejun Kim June 26, 2018 Clustering objective Grouping documents or instances into subsets or clusters Documents
More informationINF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering
INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Erik Velldal & Stephan Oepen Language Technology Group (LTG) September 23, 2015 Agenda Last week Supervised vs unsupervised learning.
More informationHard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering
An unsupervised machine learning problem Grouping a set of objects in such a way that objects in the same group (a cluster) are more similar (in some sense or another) to each other than to those in other
More informationSupervised vs. Unsupervised Learning
Clustering Supervised vs. Unsupervised Learning So far we have assumed that the training samples used to design the classifier were labeled by their class membership (supervised learning) We assume now
More informationCS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University
CS490W Text Clustering Luo Si Department of Computer Science Purdue University [Borrows slides from Chris Manning, Ray Mooney and Soumen Chakrabarti] Clustering Document clustering Motivations Document
More informationhttp://www.xkcd.com/233/ Text Clustering David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture17-clustering.ppt Administrative 2 nd status reports Paper review
More informationLecture 7 Cluster Analysis: Part A
Lecture 7 Cluster Analysis: Part A Zhou Shuigeng May 7, 2007 2007-6-23 Data Mining: Tech. & Appl. 1 Outline What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering
More informationNetwork Traffic Measurements and Analysis
DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,
More informationMultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A
MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A. 205-206 Pietro Guccione, PhD DEI - DIPARTIMENTO DI INGEGNERIA ELETTRICA E DELL INFORMAZIONE POLITECNICO DI BARI
More information