Microarray data analysis

Size: px
Start display at page:

Download "Microarray data analysis"

Transcription

1 Microarray data analysis Computational Biology IST Technical University of Lisbon Ana Teresa Freitas 016/017 Microarrays Rows represent genes Columns represent samples Many problems may be solved using clustering Eample of microarray dataset 1

2 Microarray data S j Epression levels of gene i, across samples G i Epression levels of all genes, for one sample Typical eamples of samples: Heat shock, phases in cell cycle, cancer, normal, Microarray data Genes mrna samples sample1 sample sample3 sample4 sample Gene epression level of gene i in mrna sample j Log (treated-ep-value /controlled-ep-value )

3 What do we actually measure? We measure signal of cdna target(s) which hybridize(s) to the probe (and backgrounds, ratios, standard deviations, dust etc. ) What do we wish to know (an abstraction)? [mrna] 1a, [mrna] 1b,.. [mrna] Na, [mrna] Nb Where N = Number of Genes, a and b = different colors Factors with impact on the signal level Amount of mrna Labeling efficiencies Quality of the RNA Laser/dye combination Detection efficiency of photomultiplier 3

4 Typical Assumption [mrna] n,a α signal n,a [mrna] n,a = k * signal n,a Normalization constant n = gene inde a = color Low level analysis Image analysis - computation of probes intensities/signals Normalization - is the attempt to compensate for systematic technical differences between chips, to see more clearly the systematic biological differences between samples. Statisticians use the term 'bias' to describe systematic errors, which affect a large number of genes. 4

5 Normalization Sources of Systematic Errors Different incorporation efficiency of dyes Different amounts of mrna Eperimenter/protocol issues (comparing chips processed by different labs) Different scanning parameters Batch bias Normalization Two problems: How to detect biases? Which genes to use for estimating biases among chips? How to remove the biases? 5

6 Which genes to use for bias detection? All genes on the chip Assumption: Most of the genes are equally epressed in the compared samples, the proportion of the differential genes is low (<0%). Limits: Not appropriate when comparing highly heterogeneous samples (different tissues) Not appropriate for analysis of dedicated chips (apoptosis chips, inflammation chips etc) Which genes to use for bias detection? Housekeeping genes Assumption: based on prior knowledge a set of genes can be regarded as equally epressed in the compared samples Affy novel chips: normalization set of 100 genes NHGRI s cdna microarrays: 70 "house-keeping" genes set Limits: The validity of the assumption is questionable Housekeeping genes are usually epressed at high levels, not informative for the low intensities range 6

7 Normalization methods Global normalization (Scaling) enforces the chips to have equal mean (median) intensity Intensity-dependent normalization (Lowess) enforces equal means at all intensities Quantile Normalization enforces the chips to have identical intensity distribution Quantile Normalization Sort each column in the data matri according to genes (probes ) intensities in each chip Compute mean intensity in each rank across the chips Replace each intensity by the mean intensity at its rank Re-order columns to original state, each row corresponds to a gene Chip #1 Chip # Chip #3 Average chip 7

8 Quantile Normalization Before After What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping a set of data objects into clusters Clustering is unsupervised classification: no predefined classes Typical applications As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms 8

9 Things to study (1) Clustering (grouping) genes: i.e., finding groups of co-regulated genes Eample: Epression levels across time of two clusters of co-regulated genes samples samples Things to study () Clustering (grouping) samples i.e., finding groups of samples with similar genetic profiles (e.g., cancer types). Groups of similar behaviour? 9

10 Things to study (3) Classifying genes: i.e., deciding if a gene is co-regulated with some known gene(s), based on their epression profiles across samples. Annotated gene 1 Unknown gene samples samples Annotated gene samples Co-regulation? Similar biological function? Same transcription factor? Things to study (4) Classifying samples: i.e., classifying new samples, based on a set of classified samples (eample: cancer versus normal; different types of cancer;) classified samples A B samples to be classified 10

11 Things to study (5) Selecting genes: a) deciding if a given gene, in isolation, behaves differently in a control versus eperimental situation (e.g., cancer vs normal, two types of cancer, treatment vs non-treatment). b) Selecting which group genes is significantly different in a control versus eperimental situation (same eamples). c) Selecting which group of genes is relevant for a given classification problem. Clustering methods Similarity-based (need a similarity function) Construct a partition Agglomerative, bottom up Searching for an optimal partition Typically hard clustering Model-based (latent models, probabilistic or algebraic) First compute the model Clusters are obtained easily after having a model Typically soft clustering 11

12 Similary-based clustering Define a similarity function to measure similarity between two objects Common criteria: Find a partition to Maimize intra-cluster similarity Minimize inter-cluster similarity Two ways to construct the partition Hierarchical (e.g.,agglomerative Hierarchical Clustering) Search by starting at a random partition (e.g., K-means) Agglomerative Hierarchical Clustering Given a similarity function to measure similarity between two objects Gradually group similar objects together in a bottom-up fashion Stop when some stopping criterion is met Variations: different ways to compute group similarity based on individual object similarity 1

13 13 Distance Metrics For clustering algorithms the calculation of a distance between gene vectors or eperiment vectors is a necessary step Distances metrics can be classified as Metric distances Semi-metric distances Metric distances: 1. d ab >= 0. d ab = d ba 3. d aa = 0 4. d ab <= d ac + d cb Semi-metric distances: obey 1) to 3), fail in 4) Distance Metrics Minkowski distance If q = 1, d is Manhattan distance (semi-metric distance) If q =, d is Euclidean distance (metric distance) q q p p q q j i j i j i j i d ) ( ), ( = ), ( 1 1 p p j i j i j i j i d = ) ( ), ( 1 1 p p j i j i j i j i d =

14 Distance Metrics Pearson correlation coefficient (semi-metric distance) d( i, j) = n ( )( ) i = 1 i1 1 i n n ( ) ( ) i = 1 i1 1 i = 1 i ( ) 1 1 ( ) -1 <= d(i,j) <= +1 1 (, ) 1 Distance Metrics Entropy based distances: Mutual Information (semi-metric distance) Mutual Information (MI) is a statistical representation of the correlation of two signals A and B. MI is a measure of the additional information known about one epression pattern when given another. MI is not based on linear models and can therefore also see non-linear dependencies (see picture). 14

15 Similarity-induced Structure How to Compute Group Similarity? Three Popular Methods: Given two groups g1 and g, Single-link algorithm: s(g1,g)= similarity of the closest pair Complete-link algorithm: s(g1,g)= similarity of the farthest pair Average-link algorithm: s(g1,g)= average of similarity of all pairs 15

16 Comparison of the Three Methods Single-link Loose clusters Individual decision, sensitive to outliers Complete-link Tight clusters Individual decision, sensitive to outliers Average-link In between Group decision, insensitive to outliers Which one is the best? Depends on what you need! Hierarchical (agglomerative) clustering. Strictly speaking, agglomerative clustering does not produce clusters, but a dendogram dissimilarity Cutting the dendogram at a certain level yields clusters. Dendogram cutting is a problem analogous to the selection of K in K-means clustering. 16

17 Eample of agglomerative gene clustering (Eisen et al, 98) Microarray data from time course of serum stimulation of primary human fibroblasts. Eperiment: Foreskin fibroblasts were grown in culture and were deprived of serum for 48 hr. Serum was added back and samples taken at time 0, 15 min, 30 min, 1hr, hr, 3 hr, 4 hr, 8 hr, 1 hr, 16 hr, 0 hr, 4 hr. Clustering: Agglomerative clustering Correlation Coefficient + (average-link) Clusters with biological interpretation: (A) cholesterol biosynthesis, (B) the cell cycle, (C) the immediate-early response, (D) signalling and angiogenesis, (E) wound healing and tissue remodelling. Data Structures Data matri 11 i1 n1 1f if nf 1p ip np Dissimilarity matri 0 d(,1) d(3,1) : d( n,1) 0 d(3,) : d( n,) 0 : 0 17

18 Partitioning Algorithms: Basic Concept Partitioning method: Construct a partition of a database D of n objects into a set of k clusters Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion Global optimal: ehaustively enumerate all partitions Heuristic methods: k-means and k-medoids algorithms k-means (MacQueen 67): Each cluster is represented by the center of the cluster k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw 87): Each cluster is represented by one of the objects in the cluster The K-Means Clustering Method Given k, the k-means algorithm is implemented in four steps: Step 1: Partition objects into k nonempty subsets Step : Compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point, of the cluster) Step 3: Assign each object to the cluster with the nearest seed point Go back to Step, stop when no more new assignment 18

19 The K-Means Clustering Method Eample K= Arbitrarily choose K object as initial cluster center Assign each objects to most similar center reassign Update the cluster means Update the cluster means reassign Comments on the K-Means Method Strength: Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. Comparing: PAM: O(k(n-k) ), CLARA: O(ks + k(n-k)) Comment: Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing Weakness Applicable only when mean is defined, then what about categorical data? Need to specify k, the number of clusters, in advance Unable to handle noisy data and outliers Not suitable to discover clusters with non-conve shapes 19

20 Variations of the K-Means Method A few variants of the k-means which differ in Selection of the initial k means Dissimilarity calculations Strategies to calculate cluster means Handling categorical data: k-modes (Huang 98) Replacing means of clusters with modes Using new dissimilarity measures to deal with categorical objects Using a frequency-based method to update modes of clusters A miture of categorical and numerical data: k-prototype method What is the problem of k-means Method? The k-means algorithm is sensitive to outliers! Since an object with an etremely large value may substantially distort the distribution of the data. K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster

21 Problem 1 Consider the following epression matri where the epression levels of genes (G1 and G) were analyzed in 7 healthy/infected tissues (conditions C1 to C7). Consider also the problem of grouping tissues given the epression profiles of the genes using clustering algorithms. Determine the dendogram found by a hierarchical clustering algorithm (HCA) using a bottom-up approach, the Euclidean distance to compute the distance between conditions, and the single-link distance to compute the distance between groups (intercluster distance). How would you use the dendogram to group the tissues in groups (clusters) and which will be those clusters? Determine the groups found by the K-means (K=) algorithm when the centroids are initialized with C5 = (4,3) and C6 = (1,1). Biclustering: Motivation Gene epression matrices have been etensively analyzed using clustering in one of two dimensions The gene dimension The condition dimension This corresponds to the Analysis of epression patterns of genes by comparing rows in the matri. Analysis of epression patterns of samples by comparing columns in the matri. 1

22 Biclustering: Motivation Common objectives pursued when analyzing gene epression data include: 1. Grouping of genes according to their epression under multiple conditions.. Classification of a new gene, given its epression and the epression of other genes, with known classification. 3. Grouping of conditions based on the epression of a number of genes. 4. Classification of a new sample, given the epression of the genes under that eperimental condition. What is Biclustering? Biclustering = Simultaneous clustering of both rows and columns of a data matri. Concept can be traced back to the 70 (Hartigan, 197), although it has been rarely used or studied. The term was introduced by (Cheng and Church, 000) who were the first to used it in gene epression data analysis. Technique used in other fields, such as collaborative filtering, information retrieval and data mining.

23 What is Biclustering? We consider a n by m data matri, A=(X,Y), where X={ 1,, n } = Set of n rows Y={y 1,, y m } = Set of m columns a ij = numeric value (discrete or real) representing the relation between row i and column j. In the case of gene epression matrices X = Set of Genes Y = Set of Conditions a ij = epression level of gene i under condition j (real value). What is Biclustering? Gene 1 Condition 1 a 11 Gene Epression Matri Condition j a 1j Condition m a 1m Gene i a i1 a ij a im Gene n a n1 a nj a nm A = (X,Y) 3

24 What is Biclustering? Given the matri A = (X,Y) I = Subset of rows J = Subset of columns (I,Y) = a subset of rows that ehibit similar behavior across the set of all columns = cluster of rows (X,J) = a subset of columns that ehibit similar behavior across the set of all rows = cluster of columns What is Biclustering? (I,J) = a subset of rows and a subset of columns, where the rows ehibit similar behavior across the columns and vice-versa. = sub-matri of A that contains only the elements a ij with set of rows I and set of columns J. = bicluster We want to identify a set of biclusters B k = (I k,j k ). Each bicluster B k must satisfies some specific characteristics of homogeneity. 4

25 What is Biclustering? C 1 C C 3 C 4 C 5 C 6 C 7 C 8 C 9 C 10 G 1 a 11 G G 3 G 4 G 5 G 6 a 11 a 1 a 31 a 41 a 51 a 61 a 1 a a 3 a 4 a 5 a 6 a 13 a 3 a 33 a 43 a 53 a 63 X = {G 1, G, G 3, G 4, G 5, G 6 } Y= {C 1, C, C 3, C 4, C 5, C 6, C 7, C 8, C 9, C 10 } a 14 a 15 a 16 a 17 a 18 a 19 0 a a a 1 4 a 5 a 6 a 7 a a a a a 35 a 36 a 37 a a a a a 45 a 46 a 47 a a a a a a a a a a a a a a a Cluster of Columns (X,J) Cluster of Rows (I,Y) {C 4, C 5, C 6 } {G, G 3, G 4 } I = {G, G 3, G 4 } J = {C 4, C 5, C 6 } Bicluster (I,J) {{G, G 3, G 4 }, {C 4, C 5, C 6 }} What is Biclustering? Biclustering Goals Perform simultaneous clustering on the row and column dimensions of the gene epression matri instead of clustering the rows and columns separetely. Identify sub-matrices (subsets of rows and subsets of columns) with interesting properties. Gene Epression Data Analysis Identify subgroups of genes and subgroups of conditions, where the genes ehibit highly correlated activities for every condition Madeira, Sara C. and Oliveira, Arlindo L. Biclustering Algorithms for Biological Data Analysis: A Survey IEEE/ACM Trans. Comput. Biol. Bioinformatics January 004 5

26 Bicluster Types An interesting criteria to evaluate a biclustering algorithm concerns the identification of the type of biclusters the algorithm is able to find. There are four major classes of biclusters 1. Biclusters with constant values.. Biclusters with constant values on rows or columns. 3. Biclusters with coherent values. 4. Biclusters with coherent evolutions. Constant Values

27 Constant Values on Rows or Columns Constant Rows Constant Columns Coherent Values Additive Model Multiplicative Model 7

28 Coherent Evolutions S S S S S3 S3 S3 S3 S4 S4 S4 S4 Overall Coherent Evolution Coherent Evolution On the Rows Coherent Evolutions S S3 S S S3 S S S3 S S S3 S Coherent Evolution On the Columns Order Preserving Sub-Matri (OPSM) 8

29 Algorithms When this is the case, a bicluster corresponds to a biclique in the corresponding bipartite graph. Finding a maimum size bicluster Is equivalent to finding the maimum edge biclique in a bipartite graph. This problem is known to be NP-complete (Peeters, 003). More comple cases Where the actual numeric values in the matri A are taken into account to compute the quality of a bicluster Have a compleity that is necessarily no lower than this simpler case. Algorithms Given this, the large majority of the algorithms use heuristic approaches to identify biclusters. In many cases the algorithm is preceded by a normalization step that is applied to the data matri. The goal is to make more evident the patterns of interest. Some algorithms avoid heuristics but ehibit an eponential worst case runtime. 9

30 Algorithms Different Objectives Identify one bicluster. Identify a given number of biclusters. Different Approaches Discover one bicluster at a time. Discover one set of biclusters at a time. Discover all biclusters at the same time (Simultaneous bicluster identification) Algorithms: Heuristic Approaches Iterative Row and Column Clustering Combination Apply clustering algorithms to the rows and columns of the data matri, separately. Combine the results using some sort of iterative procedure to combine the two cluster arrangements. Divide and Conquer Break the problem into several subproblems that are similar to the original problem but smaller in size. Solve the problems recursively. 30

31 Algorithms: Heuristic Approaches Combine the intermediate solutions to create a solution to the original problem. Usually break the matri into submatrices (biclusters) based on a certain criterion and then continue the biclustering process on the new submatrices. Greedy Iterative Search Always make a locally optimal choice in the hope that this choice will lead to a globally good solution. Usually perform greedy row/column addition/removal. Algorithms Ehaustive Bicluster Enumeration A number of methods have been used to speed up ehaustive search. In some cases the algorithms assume restrictions on the size of the biclusters that should be listed. 31

32 Measure cluster homogeneity 3

33 Missing values: Random numbers Find one bicluster at a time Hide biclustering using random numbers 33

34 34

35 Eample Consider the following epression matri J A(X,Y) = I Run Brute-Force Deletion and Addition algorithm to find a Biclustering 35

36 Eample Run Algorithm, δ = 0 (maimum acceptable mean squared residue score), α = 1,5 (a threshold for the multiple node deletion) aij column aij row aij = 9/(44) a1j = 5/4 ai1 = 7/4 aj = 8/4 ai = 4/4 a3j = 6/4 ai3 = 9/4 a4j = 10/4 ai4 = 9/4 H(I,J) = (1/(44)) * ((a11 a1j ai1 + aij)^ + (a1 a1j ai + aij)^) + (a13 a1j ai3 + aij)^) + (a14 a1j ai4 + aij)^) + (a1 aj ai1 + aij)^) + = 1,8 36

Cluster Analysis. CSE634 Data Mining

Cluster Analysis. CSE634 Data Mining Cluster Analysis CSE634 Data Mining Agenda Introduction Clustering Requirements Data Representation Partitioning Methods K-Means Clustering K-Medoids Clustering Constrained K-Means clustering Introduction

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

Data Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1

Data Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques Chapter 7.1-4 March 8, 2007 Data Mining: Concepts and Techniques 1 1. What is Cluster Analysis? 2. Types of Data in Cluster Analysis Chapter 7 Cluster Analysis 3. A

More information

Community Detection. Jian Pei: CMPT 741/459 Clustering (1) 2

Community Detection. Jian Pei: CMPT 741/459 Clustering (1) 2 Clustering Community Detection http://image.slidesharecdn.com/communitydetectionitilecturejune0-0609559-phpapp0/95/community-detection-in-social-media--78.jpg?cb=3087368 Jian Pei: CMPT 74/459 Clustering

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10. Cluster

More information

2. Background. 2.1 Clustering

2. Background. 2.1 Clustering 2. Background 2.1 Clustering Clustering involves the unsupervised classification of data items into different groups or clusters. Unsupervised classificaiton is basically a learning task in which learning

More information

Unsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team

Unsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team Unsupervised Learning Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team Table of Contents 1)Clustering: Introduction and Basic Concepts 2)An Overview of Popular Clustering Methods 3)Other Unsupervised

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

More information

10701 Machine Learning. Clustering

10701 Machine Learning. Clustering 171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among

More information

DNA chips and other techniques measure the expression level of a large number of genes, perhaps all

DNA chips and other techniques measure the expression level of a large number of genes, perhaps all INESC-ID TECHNICAL REPORT 1/2004, JANUARY 2004 1 Biclustering Algorithms for Biological Data Analysis: A Survey* Sara C. Madeira and Arlindo L. Oliveira Abstract A large number of clustering approaches

More information

Kapitel 4: Clustering

Kapitel 4: Clustering Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases WiSe 2017/18 Kapitel 4: Clustering Vorlesung: Prof. Dr.

More information

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Unsupervised Learning Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Content Motivation Introduction Applications Types of clustering Clustering criterion functions Distance functions Normalization Which

More information

EECS 730 Introduction to Bioinformatics Microarray. Luke Huan Electrical Engineering and Computer Science

EECS 730 Introduction to Bioinformatics Microarray. Luke Huan Electrical Engineering and Computer Science EECS 730 Introduction to Bioinformatics Microarray Luke Huan Electrical Engineering and Computer Science http://people.eecs.ku.edu/~jhuan/ GeneChip 2011/11/29 EECS 730 2 Hybridization to the Chip 2011/11/29

More information

Unsupervised Learning

Unsupervised Learning Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised

More information

Exploratory data analysis for microarrays

Exploratory data analysis for microarrays Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA

More information

Biclustering for Microarray Data: A Short and Comprehensive Tutorial

Biclustering for Microarray Data: A Short and Comprehensive Tutorial Biclustering for Microarray Data: A Short and Comprehensive Tutorial 1 Arabinda Panda, 2 Satchidananda Dehuri 1 Department of Computer Science, Modern Engineering & Management Studies, Balasore 2 Department

More information

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing Unsupervised Data Mining: Clustering Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 1. Supervised Data Mining Classification Regression Outlier detection

More information

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation

More information

CLUSTERING IN BIOINFORMATICS

CLUSTERING IN BIOINFORMATICS CLUSTERING IN BIOINFORMATICS CSE/BIMM/BENG 8 MAY 4, 0 OVERVIEW Define the clustering problem Motivation: gene expression and microarrays Types of clustering Clustering algorithms Other applications of

More information

Unsupervised Learning Partitioning Methods

Unsupervised Learning Partitioning Methods Unsupervised Learning Partitioning Methods Road Map 1. Basic Concepts 2. K-Means 3. K-Medoids 4. CLARA & CLARANS Cluster Analysis Unsupervised learning (i.e., Class label is unknown) Group data to form

More information

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology 9/9/ I9 Introduction to Bioinformatics, Clustering algorithms Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Outline Data mining tasks Predictive tasks vs descriptive tasks Example

More information

Gene expression & Clustering (Chapter 10)

Gene expression & Clustering (Chapter 10) Gene expression & Clustering (Chapter 10) Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species Dynamic programming Approximate pattern matching

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Pierre Gaillard ENS Paris September 28, 2018 1 Supervised vs unsupervised learning Two main categories of machine learning algorithms: - Supervised learning: predict output Y from

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

More information

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York Clustering Robert M. Haralick Computer Science, Graduate Center City University of New York Outline K-means 1 K-means 2 3 4 5 Clustering K-means The purpose of clustering is to determine the similarity

More information

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science Data Mining Dr. Raed Ibraheem Hamed University of Human Development, College of Science and Technology Department of Computer Science 2016 201 Road map What is Cluster Analysis? Characteristics of Clustering

More information

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 15: Microarray clustering http://compbio.pbworks.com/f/wood2.gif Some slides were adapted from Dr. Shaojie Zhang (University of Central Florida) Microarray

More information

Introduction to GE Microarray data analysis Practical Course MolBio 2012

Introduction to GE Microarray data analysis Practical Course MolBio 2012 Introduction to GE Microarray data analysis Practical Course MolBio 2012 Claudia Pommerenke Nov-2012 Transkriptomanalyselabor TAL Microarray and Deep Sequencing Core Facility Göttingen University Medical

More information

Cluster Analysis for Microarray Data

Cluster Analysis for Microarray Data Cluster Analysis for Microarray Data Seventh International Long Oligonucleotide Microarray Workshop Tucson, Arizona January 7-12, 2007 Dan Nettleton IOWA STATE UNIVERSITY 1 Clustering Group objects that

More information

K-Means. Oct Youn-Hee Han

K-Means. Oct Youn-Hee Han K-Means Oct. 2015 Youn-Hee Han http://link.koreatech.ac.kr ²K-Means algorithm An unsupervised clustering algorithm K stands for number of clusters. It is typically a user input to the algorithm Some criteria

More information

e-ccc-biclustering: Related work on biclustering algorithms for time series gene expression data

e-ccc-biclustering: Related work on biclustering algorithms for time series gene expression data : Related work on biclustering algorithms for time series gene expression data Sara C. Madeira 1,2,3, Arlindo L. Oliveira 1,2 1 Knowledge Discovery and Bioinformatics (KDBIO) group, INESC-ID, Lisbon, Portugal

More information

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering Erik Velldal University of Oslo Sept. 18, 2012 Topics for today 2 Classification Recap Evaluating classifiers Accuracy, precision,

More information

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16 CLUSTERING CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16 1. K-medoids: REFERENCES https://www.coursera.org/learn/cluster-analysis/lecture/nj0sb/3-4-the-k-medoids-clustering-method https://anuradhasrinivas.files.wordpress.com/2013/04/lesson8-clustering.pdf

More information

Clustering. Lecture 6, 1/24/03 ECS289A

Clustering. Lecture 6, 1/24/03 ECS289A Clustering Lecture 6, 1/24/03 What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed

More information

[7.3, EA], [9.1, CMB]

[7.3, EA], [9.1, CMB] K-means Clustering Ke Chen Reading: [7.3, EA], [9.1, CMB] Outline Introduction K-means Algorithm Example How K-means partitions? K-means Demo Relevant Issues Application: Cell Neulei Detection Summary

More information

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748 CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs07.html 3/3/08 CAP5510 1 Gene g Probe 1 Probe 2 Probe N 3/3/08 CAP5510

More information

Chapter 6: Cluster Analysis

Chapter 6: Cluster Analysis Chapter 6: Cluster Analysis The major goal of cluster analysis is to separate individual observations, or items, into groups, or clusters, on the basis of the values for the q variables measured on each

More information

ECS 234: Data Analysis: Clustering ECS 234

ECS 234: Data Analysis: Clustering ECS 234 : Data Analysis: Clustering What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed

More information

Biclustering Bioinformatics Data Sets. A Possibilistic Approach

Biclustering Bioinformatics Data Sets. A Possibilistic Approach Possibilistic algorithm Bioinformatics Data Sets: A Possibilistic Approach Dept Computer and Information Sciences, University of Genova ITALY EMFCSC Erice 20/4/2007 Bioinformatics Data Sets Outline Introduction

More information

/ Computational Genomics. Normalization

/ Computational Genomics. Normalization 10-810 /02-710 Computational Genomics Normalization Genes and Gene Expression Technology Display of Expression Information Yeast cell cycle expression Experiments (over time) baseline expression program

More information

Clustering. Chapter 10 in Introduction to statistical learning

Clustering. Chapter 10 in Introduction to statistical learning Clustering Chapter 10 in Introduction to statistical learning 16 14 12 10 8 6 4 2 0 2 4 6 8 10 12 14 1 Clustering ² Clustering is the art of finding groups in data (Kaufman and Rousseeuw, 1990). ² What

More information

Unsupervised Learning. Supervised learning vs. unsupervised learning. What is Cluster Analysis? Applications of Cluster Analysis

Unsupervised Learning. Supervised learning vs. unsupervised learning. What is Cluster Analysis? Applications of Cluster Analysis 7 Supervised learning vs unsupervised learning Unsupervised Learning Supervised learning: discover patterns in the data that relate data attributes with a target (class) attribute These patterns are then

More information

Data Mining Algorithms

Data Mining Algorithms for the original version: -JörgSander and Martin Ester - Jiawei Han and Micheline Kamber Data Management and Exploration Prof. Dr. Thomas Seidl Data Mining Algorithms Lecture Course with Tutorials Wintersemester

More information

What is Cluster Analysis? COMP 465: Data Mining Clustering Basics. Applications of Cluster Analysis. Clustering: Application Examples 3/17/2015

What is Cluster Analysis? COMP 465: Data Mining Clustering Basics. Applications of Cluster Analysis. Clustering: Application Examples 3/17/2015 // What is Cluster Analysis? COMP : Data Mining Clustering Basics Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, rd ed. Cluster: A collection of data

More information

Cluster analysis. Agnieszka Nowak - Brzezinska

Cluster analysis. Agnieszka Nowak - Brzezinska Cluster analysis Agnieszka Nowak - Brzezinska Outline of lecture What is cluster analysis? Clustering algorithms Measures of Cluster Validity What is Cluster Analysis? Finding groups of objects such that

More information

Clustering. Supervised vs. Unsupervised Learning

Clustering. Supervised vs. Unsupervised Learning Clustering Supervised vs. Unsupervised Learning So far we have assumed that the training samples used to design the classifier were labeled by their class membership (supervised learning) We assume now

More information

Clustering part II 1

Clustering part II 1 Clustering part II 1 Clustering What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods 2 Partitioning Algorithms:

More information

Measure of Distance. We wish to define the distance between two objects Distance metric between points:

Measure of Distance. We wish to define the distance between two objects Distance metric between points: Measure of Distance We wish to define the distance between two objects Distance metric between points: Euclidean distance (EUC) Manhattan distance (MAN) Pearson sample correlation (COR) Angle distance

More information

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Murhaf Fares & Stephan Oepen Language Technology Group (LTG) September 27, 2017 Today 2 Recap Evaluation of classifiers Unsupervised

More information

CHAPTER 4: CLUSTER ANALYSIS

CHAPTER 4: CLUSTER ANALYSIS CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis

More information

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22 INF4820 Clustering Erik Velldal University of Oslo Nov. 17, 2009 Erik Velldal INF4820 1 / 22 Topics for Today More on unsupervised machine learning for data-driven categorization: clustering. The task

More information

Big Data Analytics! Special Topics for Computer Science CSE CSE Feb 9

Big Data Analytics! Special Topics for Computer Science CSE CSE Feb 9 Big Data Analytics! Special Topics for Computer Science CSE 4095-001 CSE 5095-005! Feb 9 Fei Wang Associate Professor Department of Computer Science and Engineering fei_wang@uconn.edu Clustering I What

More information

CS Introduction to Data Mining Instructor: Abdullah Mueen

CS Introduction to Data Mining Instructor: Abdullah Mueen CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen LECTURE 8: ADVANCED CLUSTERING (FUZZY AND CO -CLUSTERING) Review: Basic Cluster Analysis Methods (Chap. 10) Cluster Analysis: Basic Concepts

More information

Mixture models and clustering

Mixture models and clustering 1 Lecture topics: Miture models and clustering, k-means Distance and clustering Miture models and clustering We have so far used miture models as fleible ays of constructing probability models for prediction

More information

Biclustering Algorithms for Gene Expression Analysis

Biclustering Algorithms for Gene Expression Analysis Biclustering Algorithms for Gene Expression Analysis T. M. Murali August 19, 2008 Problems with Hierarchical Clustering It is a global clustering algorithm. Considers all genes to be equally important

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

Clustering Analysis Basics

Clustering Analysis Basics Clustering Analysis Basics Ke Chen Reading: [Ch. 7, EA], [5., KPM] Outline Introduction Data Types and Representations Distance Measures Major Clustering Methodologies Summary Introduction Cluster: A collection/group

More information

Workload Characterization Techniques

Workload Characterization Techniques Workload Characterization Techniques Raj Jain Washington University in Saint Louis Saint Louis, MO 63130 Jain@cse.wustl.edu These slides are available on-line at: http://www.cse.wustl.edu/~jain/cse567-08/

More information

Road map. Basic concepts

Road map. Basic concepts Clustering Basic concepts Road map K-means algorithm Representation of clusters Hierarchical clustering Distance functions Data standardization Handling mixed attributes Which clustering algorithm to use?

More information

Course on Microarray Gene Expression Analysis

Course on Microarray Gene Expression Analysis Course on Microarray Gene Expression Analysis ::: Normalization methods and data preprocessing Madrid, April 27th, 2011. Gonzalo Gómez ggomez@cnio.es Bioinformatics Unit CNIO ::: Introduction. The probe-level

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

Clustering Techniques

Clustering Techniques Clustering Techniques Bioinformatics: Issues and Algorithms CSE 308-408 Fall 2007 Lecture 16 Lopresti Fall 2007 Lecture 16-1 - Administrative notes Your final project / paper proposal is due on Friday,

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Unsupervised learning Until now, we have assumed our training samples are labeled by their category membership. Methods that use labeled samples are said to be supervised. However,

More information

Distance-based Methods: Drawbacks

Distance-based Methods: Drawbacks Distance-based Methods: Drawbacks Hard to find clusters with irregular shapes Hard to specify the number of clusters Heuristic: a cluster must be dense Jian Pei: CMPT 459/741 Clustering (3) 1 How to Find

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

Clustering in Data Mining

Clustering in Data Mining Clustering in Data Mining Classification Vs Clustering When the distribution is based on a single parameter and that parameter is known for each object, it is called classification. E.g. Children, young,

More information

Biclustering with δ-pcluster John Tantalo. 1. Introduction

Biclustering with δ-pcluster John Tantalo. 1. Introduction Biclustering with δ-pcluster John Tantalo 1. Introduction The subject of biclustering is chiefly concerned with locating submatrices of gene expression data that exhibit shared trends between genes. That

More information

Contents. ! Data sets. ! Distance and similarity metrics. ! K-means clustering. ! Hierarchical clustering. ! Evaluation of clustering results

Contents. ! Data sets. ! Distance and similarity metrics. ! K-means clustering. ! Hierarchical clustering. ! Evaluation of clustering results Statistical Analysis of Microarray Data Contents Data sets Distance and similarity metrics K-means clustering Hierarchical clustering Evaluation of clustering results Clustering Jacques van Helden Jacques.van.Helden@ulb.ac.be

More information

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods Whatis Cluster Analysis? Clustering Types of Data in Cluster Analysis Clustering part II A Categorization of Major Clustering Methods Partitioning i Methods Hierarchical Methods Partitioning i i Algorithms:

More information

Clustering Jacques van Helden

Clustering Jacques van Helden Statistical Analysis of Microarray Data Clustering Jacques van Helden Jacques.van.Helden@ulb.ac.be Contents Data sets Distance and similarity metrics K-means clustering Hierarchical clustering Evaluation

More information

Based on Raymond J. Mooney s slides

Based on Raymond J. Mooney s slides Instance Based Learning Based on Raymond J. Mooney s slides University of Texas at Austin 1 Example 2 Instance-Based Learning Unlike other learning algorithms, does not involve construction of an explicit

More information

Hierarchical Clustering

Hierarchical Clustering Hierarchical Clustering Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram A tree-like diagram that records the sequences of merges

More information

Chapter DM:II. II. Cluster Analysis

Chapter DM:II. II. Cluster Analysis Chapter DM:II II. Cluster Analysis Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained Cluster Analysis DM:II-1

More information

Triclustering in Gene Expression Data Analysis: A Selected Survey

Triclustering in Gene Expression Data Analysis: A Selected Survey Triclustering in Gene Expression Data Analysis: A Selected Survey P. Mahanta, H. A. Ahmed Dept of Comp Sc and Engg Tezpur University Napaam -784028, India Email: priyakshi@tezu.ernet.in, hasin@tezu.ernet.in

More information

Unsupervised Learning. Unsupervised Learning. What is Clustering? Unsupervised Learning I Clustering 9/7/2017. Clustering

Unsupervised Learning. Unsupervised Learning. What is Clustering? Unsupervised Learning I Clustering 9/7/2017. Clustering Unsupervised Learning Clustering Centroid models (K-mean) Connectivity models (hierarchical clustering) Density models (DBSCAN) Graph-based models Subspace models (Biclustering) Feature extraction techniques

More information

Semi-supervised learning

Semi-supervised learning Semi-supervised Learning COMP 790-90 Seminar Spring 2011 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Overview 2 Semi-supervised learning Semi-supervised classification Semi-supervised clustering Semi-supervised

More information

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06 Clustering CS294 Practical Machine Learning Junming Yin 10/09/06 Outline Introduction Unsupervised learning What is clustering? Application Dissimilarity (similarity) of objects Clustering algorithm K-means,

More information

DNA chips and other techniques measure the expression

DNA chips and other techniques measure the expression 24 IEEE TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 1, NO. 1, JANUARY-MARCH 2004 Biclustering Algorithms for Biological Data Analysis: A Survey Sara C. Madeira and Arlindo L. Oliveira

More information

Information Retrieval and Organisation

Information Retrieval and Organisation Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London What Is Text Clustering? Text Clustering = Grouping a set of documents into classes of similar

More information

High throughput Data Analysis 2. Cluster Analysis

High throughput Data Analysis 2. Cluster Analysis High throughput Data Analysis 2 Cluster Analysis Overview Why clustering? Hierarchical clustering K means clustering Issues with above two Other methods Quality of clustering results Introduction WHY DO

More information

Foundations of Machine Learning CentraleSupélec Fall Clustering Chloé-Agathe Azencot

Foundations of Machine Learning CentraleSupélec Fall Clustering Chloé-Agathe Azencot Foundations of Machine Learning CentraleSupélec Fall 2017 12. Clustering Chloé-Agathe Azencot Centre for Computational Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr Learning objectives

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)

More information

Data Informatics. Seon Ho Kim, Ph.D.

Data Informatics. Seon Ho Kim, Ph.D. Data Informatics Seon Ho Kim, Ph.D. seonkim@usc.edu Clustering Overview Supervised vs. Unsupervised Learning Supervised learning (classification) Supervision: The training data (observations, measurements,

More information

Unsupervised Learning : Clustering

Unsupervised Learning : Clustering Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex

More information

APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE

APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE Sundari NallamReddy, Samarandra Behera, Sanjeev Karadagi, Dr. Anantha Desik ABSTRACT: Tata

More information

Mining di Dati Web. Lezione 3 - Clustering and Classification

Mining di Dati Web. Lezione 3 - Clustering and Classification Mining di Dati Web Lezione 3 - Clustering and Classification Introduction Clustering and classification are both learning techniques They learn functions describing data Clustering is also known as Unsupervised

More information

Multivariate analyses in ecology. Cluster (part 2) Ordination (part 1 & 2)

Multivariate analyses in ecology. Cluster (part 2) Ordination (part 1 & 2) Multivariate analyses in ecology Cluster (part 2) Ordination (part 1 & 2) 1 Exercise 9B - solut 2 Exercise 9B - solut 3 Exercise 9B - solut 4 Exercise 9B - solut 5 Multivariate analyses in ecology Cluster

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 1

More information

Exploratory Analysis: Clustering

Exploratory Analysis: Clustering Exploratory Analysis: Clustering (some material taken or adapted from slides by Hinrich Schutze) Heejun Kim June 26, 2018 Clustering objective Grouping documents or instances into subsets or clusters Documents

More information

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Erik Velldal & Stephan Oepen Language Technology Group (LTG) September 23, 2015 Agenda Last week Supervised vs unsupervised learning.

More information

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering An unsupervised machine learning problem Grouping a set of objects in such a way that objects in the same group (a cluster) are more similar (in some sense or another) to each other than to those in other

More information

Supervised vs. Unsupervised Learning

Supervised vs. Unsupervised Learning Clustering Supervised vs. Unsupervised Learning So far we have assumed that the training samples used to design the classifier were labeled by their class membership (supervised learning) We assume now

More information

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University CS490W Text Clustering Luo Si Department of Computer Science Purdue University [Borrows slides from Chris Manning, Ray Mooney and Soumen Chakrabarti] Clustering Document clustering Motivations Document

More information

http://www.xkcd.com/233/ Text Clustering David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture17-clustering.ppt Administrative 2 nd status reports Paper review

More information

Lecture 7 Cluster Analysis: Part A

Lecture 7 Cluster Analysis: Part A Lecture 7 Cluster Analysis: Part A Zhou Shuigeng May 7, 2007 2007-6-23 Data Mining: Tech. & Appl. 1 Outline What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,

More information

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A. 205-206 Pietro Guccione, PhD DEI - DIPARTIMENTO DI INGEGNERIA ELETTRICA E DELL INFORMAZIONE POLITECNICO DI BARI

More information