Active Clustering and Ranking

Size: px
Start display at page:

Download "Active Clustering and Ranking"

Transcription

1 Active Clustering and Ranking Rob Nowak, University of Wisconsin-Madison IMA Workshop on "High-Dimensional Phenomena" (9/26-30, 2011) Gautam Dasarathy Brian Eriksson (Madison/Boston) Kevin Jamieson Aarti Singh (CMU)

2 Clustering Using Pairwise Similarities Clustering Problem: Consider a set of n objects x 1,..., x n. To cluster the objects into meaningful groups, we can ask questions of the form how similar is x i to x j? for i, j {1,..., n}. The goal is to cluster the objects by asking as few questions as possible. s ij j i Metric Data: s ij 1 dist(i, j) Standard ( approaches (k-means, spectral clustering, etc) assume the full set of n ) 2 similarities are known, but in high-dimensional settings (n large) similarities may be missing and/or costly to collect/compute.

3 Clustering Using Pairwise Similarities Clustering Problem: Consider a set of n objects x 1,..., x n. To cluster the objects into meaningful groups, we can ask questions of the form how similar is x i to x j? for i, j {1,..., n}. The goal is to cluster the objects by asking as few questions as possible. s ij j i Metric Data: s ij 1 dist(i, j) Standard ( approaches (k-means, spectral clustering, etc) assume the full set of n ) 2 similarities are known, but in high-dimensional settings (n large) similarities may be missing and/or costly to collect/compute. However, the existence of clusters can induce redundancy into the similarities, and therefore it may be possible to robustly cluster based on a small subset.

4 Ranking Based on Pairwise Comparisons Ranking Problem: Consider a set of n objects x 1,..., x n R d. The locations of x,..., x n 1 are known, but location of x n is unknown. To gather information about x n, we can only ask questions of the form is object x n closer to x i than x j? The goal is to rank x 1,..., x n 1 relative to distances to x n by asking as few questions as possible. δ j j i δ i n Ordinal Data: 1(δ i < δ j ) Standard ranking methods assume the full set of ( n 2) comparisons are known, but in high-dimensional settings (n large) comparisons may be missing and/or costly to collect.

5 Ranking Based on Pairwise Comparisons Ranking Problem: Consider a set of n objects x 1,..., x n R d. The locations of x,..., x n 1 are known, but location of x n is unknown. To gather information about x n, we can only ask questions of the form is object x n closer to x i than x j? The goal is to rank x 1,..., x n 1 relative to distances to x n by asking as few questions as possible. δ j j i δ i n Ordinal Data: 1(δ i < δ j ) Standard ranking methods assume the full set of ( n 2) comparisons are known, but in high-dimensional settings (n large) comparisons may be missing and/or costly to collect. However, many comparisons are redundant because the objects embed in R d, and therefore it may be possible to correctly rank based on a small subset.

6 Cost of Obtaining Similarities identify Internet topology Similarity = covariance of delay measurements Need to send O(N 2 ) packets ~ ! Significant burden on network resources phylogenetics Similarity = genome sequence alignments Need to align sequences of length 10000! Computationally expensive to compare sequences conditions/drugs genes identify co-regulated genes Similarity = Pearson correlation of expressions Need to test several conditions/drugs ~ x 10000! High cost of experimentation sonar Excessive burden to human experts

7 Hierarchical Clustering Given N objects and pairwise similarities between them, arrange the objects into a hierarchy of clusters Sample Complexity: Minimum number of similarities needed to cluster?

8 Motivation: Internet Topology Inference Correlation between traffic patterns at two points can indicate the similarity between nodes (e.g., number of shared links in paths) SYN-ACK! SYN!

9 Passive vs. Active Clustering Generate a hierarchical clustering of the objects Passive Clustering: given all or a random subset of pre-selected pairwise similarities. Active Clustering: by sequentially selecting which pairwise similarities to measure in an adaptive fashion.

10 Passive vs. Active Clustering Generate a hierarchical clustering of the objects Passive Clustering: given all or a random subset of pre-selected pairwise similarities. Active Clustering: by sequentially selecting which pairwise similarities to measure in an adaptive fashion. Example:

11 Passive vs. Active Clustering Generate a hierarchical clustering of the objects Passive Clustering: given all or a random subset of pre-selected pairwise similarities. Active Clustering: by sequentially selecting which pairwise similarities to measure in an adaptive fashion. Example: very dissimilar very similar

12 Passive vs. Active Clustering Generate a hierarchical clustering of the objects Passive Clustering: given all or a random subset of pre-selected pairwise similarities. Active Clustering: by sequentially selecting which pairwise similarities to measure in an adaptive fashion. Example: very dissimilar very similar

13 Passive vs. Active Clustering Generate a hierarchical clustering of the objects Passive Clustering: given all or a random subset of pre-selected pairwise similarities. Active Clustering: by sequentially selecting which pairwise similarities to measure in an adaptive fashion. Example: very dissimilar very similar need to ask?

14 Cost of Passive Clustering To cluster N objects, most clustering methods require all N(N 1) 2 pairwise similarities

15 Cost of Passive Clustering To cluster N objects, most clustering methods require all N(N 1) 2 pairwise similarities Can random sub-sampling reduce this burden? Theorem 1: To recover clusters of size m, any clustering procedure based on random sub-sampling requires at least N(N 1) m similarities. Small clusters are hard to find using randomly selected similarities

16 Cost of Passive Clustering To cluster N objects, most clustering methods require all N(N 1) 2 pairwise similarities Can random sub-sampling reduce this burden? Theorem 1: To recover clusters of size m, any clustering procedure based on random sub-sampling requires at least N(N 1) m similarities. Small clusters are hard to find using randomly selected similarities Related fact: Clusters of size O(N) can be recovered from random samples of O(N log N) similarities (Balcan-Gupta 10, Shamir-Tishby 11) Any method needs at least N similarities (at least one per object)

17 Hierarchical Clustering Given N objects and pairwise similarities between them, arrange the objects into a hierarchy of clusters (aka cluster tree ) Identifiability: Is there an unambiguous way to hierarchically cluster?

18 Hierarchically Consistent Similarities Do the similarities conform to the hierarchical clustering structure?

19 Hierarchically Consistent Similarities Do the similarities conform to the hierarchical clustering structure? Consistency Assumption: Similarities are consistent with a hierarchical clustering if for all outlier i j k i C j, k C intra cluster similarity inter cluster similari.es

20 Hierarchically Consistent Similarities Do the similarities conform to the hierarchical clustering structure? Consistency Assumption: Similarities are consistent with a hierarchical clustering if for all outlier i j k i C j, k C intra cluster similarity inter cluster similari.es outlier(i, j, k) = i : max(s ij,s ik ) <s jk j : max(s ij,s jk ) <s ik k : max(s ik,s jk ) <s ij which two of the three are most similar

21 Active Clustering by Outlier Testing Methodology example: Ini.al Tree Structure 9 10

22 Active Clustering by Outlier Testing Methodology example: 1 6 New object

23 Active Clustering by Outlier Testing Methodology example: 1 6 New object True Placement in Tree 9 10

24 Active Clustering by Outlier Testing Methodology example: 1 6 v Find interior node (v) such that tree is roughly divided in two

25 Active Clustering by Outlier Testing Methodology example: v Choose two objects whose nearest common ancestor is v

26 Active Clustering by Outlier Testing Methodology example: v Find the outlier of the three objects outlier 4

27 Active Clustering by Outlier Testing Methodology example: v Find the outlier of the three objects. 6 This is found using 3 pairwise similari.es: 10 outlier 4

28 Active Clustering by Outlier Testing Methodology example: 6 v This resolves which subtree the new object is in. 10 outlier 4

29 Active Clustering by Outlier Testing Methodology example: 1 This process is repeated

30 Active Clustering by Outlier Testing Methodology example: Un.l the true loca.on for the new object is found True Placement

31 Active Clustering: Efficient Hierarchical Clustering A binary search procedure based on outlier(i, j, k) = i : max(s ij,s ik ) <s jk j : max(s ij,s jk ) <s ik k : max(s ik,s jk ) <s ij Each object can be placed using log N outlier tests 3 similarities/test / !"#$%&'()$*$+,% 6 New object! 3 105%

32 Active Clustering: Efficient Hierarchical Clustering A binary search procedure based on outlier(i, j, k) = i : max(s ij,s ik ) <s jk j : max(s ij,s jk ) <s ik k : max(s ik,s jk ) <s ij Each object can be placed using log N outlier tests 3 similarities/test / !"#$%&'()$*$+,% 6 New object! 3 105% Theorem 2: Under consistency assumption, hierarchical clustering of N objects can be recovered using at most 3N log N sequentially selected similarities. within a constant factor of the information theoretic lower bound

33 Active Clustering: Efficient Hierarchical Clustering A binary search procedure based on outlier(i, j, k) = i : max(s ij,s ik ) <s jk j : max(s ij,s jk ) <s ik k : max(s ik,s jk ) <s ij Each object can be placed using log N outlier tests 3 similarities/test / !"#$%&'()$*$+,% 6 New object! 3 105% Theorem 2: Under consistency assumption, hierarchical clustering of N objects can be recovered using at most 3N log N sequentially selected similarities. within a constant factor of the information theoretic lower bound Remark: Binary search strategy is closely related to computationally efficient schemes for learning tree-structured graphical models proposed in Pearl and Tarsi, Structuring Causal Trees, J. Complexity, 1986, pp

34 Experimental Results with Synthetic Internet Topology Network Core Number of Leaf Nodes Agglomera.ve Similari.es Required Outlier Similari.es Required Savings Over Exhaus.ve Approaches N= ,528 8, %

35 Sensitivity to Errors and Inconsistencies The previous technique is very sensitive to noise and inconsistencies S(, ) > max {S(, ), S(, )} outlier which can lead to cascading failures inconsistent similarity Es.mated Loca.on True Loca.on

36 Robust Active Clustering To overcome this fragility, we design a top-down recursive splitting approach and use voting to boost our confidence about each decision we make.

37 Robust Active Clustering To overcome this fragility, we design a top-down recursive splitting approach and use voting to boost our confidence about each decision we make. Goal : In each step, split a single cluster into 2 sub-clusters, efficiently and accurately

38 Cluster Splitting Procedure?

39 Cluster Splitting Procedure seed?

40 Cluster Splitting Procedure seed? m test pairs chosen uniformly at random

41 Cluster Splitting Procedure seed? m test pairs chosen uniformly at random outlier(,, ) = outlier(,, ) = or? If yes, then and are probably clustered together

42 Why? Consider outlier tests w/o errors

43 Why? Consider outlier tests w/o errors outlier(,, ) = outlier(,, ) =

44 Why? Consider outlier tests w/o errors outlier(,, ) = outlier(,, ) = outlier(,, ) = outlier(,, ) =

45 Why? Consider outlier tests w/o errors outlier(,, ) = outlier(,, ) = outlier(,, ) = outlier(,, ) = outlier(,, ) =? outlier(,, ) =?

46 Why? Consider outlier tests w/o errors outlier(,, ) = outlier(,, ) = outlier(,, ) = outlier(,, ) = outlier(,, ) =? outlier(,, ) =? outlier(,, ) = outlier(,, ) =

47 Why? Consider outlier tests w/o errors outlier(,, ) = outlier(,, ) = outlier(,, ) = outlier(,, ) = outlier(,, ) =? outlier(,, ) =? outlier(,, ) = outlier(,, ) = outlier(,, ) = outlier(,, ) = outlier(,, ) = outlier(,, ) = outlier(,, ) =? outlier(,, ) = outlier(,, ) = outlier(,, ) =?

48 Why? Consider outlier tests w/o errors outlier(,, ) = outlier(,, ) = outlier(,, ) = outlier(,, ) = outlier(,, ) =? outlier(,, ) =? outlier(,, ) = outlier(,, ) = outlier(,, ) = outlier(,, ) = outlier(,, ) = outlier(,, ) = outlier(,, ) =? outlier(,, ) = outlier(,, ) = outlier(,, ) =?

49 Why? Consider outlier tests w/o errors outlier(,, ) = outlier(,, ) = outlier(,, ) = outlier(,, ) = outlier(,, ) =? outlier(,, ) =? outlier(,, ) = outlier(,, ) = impossible if and are not clustered together outlier(,, ) = outlier(,, ) = outlier(,, ) = outlier(,, ) = outlier(,, ) =? outlier(,, ) = outlier(,, ) = outlier(,, ) =?

50 Clustered vs. Non-Clustered outlier(,, ) = outlier(,, ) = outlier(,, ) = outlier(,, ) = A outlier(,, ) =? outlier(,, ) =? outlier(,, ) = outlier(,, ) =

51 Clustered vs. Non-Clustered outlier(,, ) = outlier(,, ) = outlier(,, ) = outlier(,, ) = A outlier(,, ) =? outlier(,, ) =? outlier(,, ) = outlier(,, ) = If outlier tests are perfectly correct, then P(A, clustered) = 2 η (1 η), where η = fractional size of seed cluster P(A, not clustered) = 0

52 Clustered vs. Non-Clustered outlier(,, ) = outlier(,, ) = outlier(,, ) = outlier(,, ) = A outlier(,, ) =? outlier(,, ) =? outlier(,, ) = outlier(,, ) = If outlier tests are perfectly correct, then P(A, clustered) = 2 η (1 η), where η = fractional size of seed cluster P(A, not clustered) = 0 If outlier tests are incorrect with probability p<1/2, then P(A, clustered) 2 (1 p) η (1 η) P(A, not clustered) p

53 Sample Complexity of Robust Active Clustering m test pairs : seed? 2η(1 η) 1 2η(1 η), Theorem 3: If outlier tests are inconsistent with probability p< then we correctly hierarchically cluster N objects (into clusters of size log N) using O(N log 2 N) sequentially and adaptively selected pairwise similarities. extra log n factor due to voting over test pairs

54 Robust Active Clustering Experiments Synthetic experiments balanced binary tree with N = 512 objects Resolvability (smallest correct cluster size - smaller is better) probability of inconsistency Agglomerative Clustering Robust Active (38% measured) Robust Active (65% measured) 5% % %

55 Robust Active Clustering Experiments Gene microarray experiments, N=512 genes Bo_om up agglomera.ve clustering Top down ac.ve clustering (43% of the pairwise similari.es)

56 Summary randomly selected similarities (where m is the smallest cluster size) adaptively selected similarities consistent similarities all clusters of any size inconsistent similarities all clusters of size > log N O(N log 2 N) Ac.ve Clustering: Robust and Efficient Hierarchical Clustering using Adap.vely Selected Similari.es, Ar#ficial Intelligence and Sta#s#cs, AISTATS 11

57 Ranking Based on Pairwise Comparisons Ranking Problem: Consider a set of n objects x 1,..., x n R d. The locations of x,..., x n 1 are known, but location of x n is unknown. To gather information about x n, we can only ask questions of the form is object x n closer to x i than x j? The goal is to rank x 1,..., x n 1 relative to distances to x n by asking as few questions as possible. δ j j i δ i n Ordinal Data: 1(δ i < δ j ) Standard ranking methods assume the full set of ( n 2) comparisons are known, but in high-dimensional settings (n large) comparisons may be missing and/or costly to collect. However, many comparisons are redundant because the objects embed in R d, and therefore it may be possible to correctly rank based on a small subset.

58

59 Bartender: What s your favorite beer, sir?

60 Bartender: What s your favorite beer, sir? Al: Hmm... actually I m more of wine-man

61 Bartender: What s your favorite beer, sir? Al: Hmm... actually I m more of wine-man Bartender: Try these two samples. Do you prefer A or B?

62 Bartender: What s your favorite beer, sir? Al: Hmm... actually I m more of wine-man Bartender: Try these two samples. Do you prefer A or B? Al: B

63 Bartender: What s your favorite beer, sir? Al: Hmm... actually I m more of wine-man Bartender: Try these two samples. Do you prefer A or B? Al: B Bartender: Ok try these two: C or D?...

64 Ranking Relative to Distance A r Al s latent preferences in beer space (e.g, hoppiness, lightness, sweetness,...) B C F E D G

65 Ranking Relative to Distance A C < A < B < E < G < D < F r B C F E D G

66 Ranking Relative to Distance A B C E < B < F < G < C < A < D E r F D G

67 Ranking Relative to Distance A B C F D < G < C < E < A < B < F E D r G

68 Ranking Relative to Distance A Goal: Determine ranking by asking comparisons like, Is r closer to A or B? Weakness of randomized schemes: If comparisons are selected at random, then almost all ( n 2) comparisons are needed to rank. C B F D < G < C < E < A < B < F E D r G

69 Ranking with Adaptively Selected Queries Insert H into: D < G < C < E < A < B < F D G C E A B F { }

70 Ranking with Adaptively Selected Queries Insert H into: D < G < C < E < A < B < F D G C E A B F { } D G C E A B F {H < E}

71 Ranking with Adaptively Selected Queries Insert H into: D < G < C < E < A < B < F D G C E A B F { } D G C E A B F {H < E} D G C E A B F {H < E},{G < H}

72 Ranking with Adaptively Selected Queries Insert H into: D < G < C < E < A < B < F D G C E A B F { } D G C E A B F {H < E} D G C E A B F {H < E},{G < H} D G C E A B F {H < E},{G < H},{H < C} D < G < H < C < E < A < B < F

73 Ranking with Adaptively Selected Queries Insert H into: D < G < C < E < A < B < F D G C E A B F { } D G C E A B F {H < E} D G C E A B F {H < E},{G < H} D G C E A B F {H < E},{G < H},{H < C} D < G < H < C < E < A < B < F log 2 k comparisons to insert an item into a list of k objects = n log 2 n comparisons to rank n objects

74 Ranking with Adaptively Selected Queries Insert H into: D < G < C < E < A < B < F D G C E A B F { } D G C E A B F {H < E} D G C E A B F {H < E},{G < H} D G C E A B F {H < E},{G < H},{H < C} D < G < H < C < E < A < B < F log 2 k comparisons to insert an item into a list of k objects = n log 2 n comparisons to rank n objects... but does embedding dimension d affect the sample complexity?

75 Ranking by Exploiting Low-Dimensional Geometry Many comparisons are redundant because the objects embed in R d, and therefore it may be possible to correctly rank based on a small subset. binary information we can gather: q i,j x n is closer to x i than x j

76 Ranking by Exploiting Low-Dimensional Geometry Many comparisons are redundant because the objects embed in R d, and therefore it may be possible to correctly rank based on a small subset. binary information we can gather: q i,j x n is closer to x i than x j Sequential Data Selection input: x 1,..., x n 1 R d and x n at unknown position in R d initialize: x 1,..., x n 1 in uniformly random order for k=2,...,n-1 for i=1,...,k-1 if q i,k is ambiguous given {q i,j } i,j<k, then ask for pairwise comparison, else impute q i,j from {q i,j } i,j<k output: ranking of x 1,..., x n 1 consistent with all pairwise comparisons

77 Ranking and Geometry

78 Ranking and Geometry Definition: Answers to previous queries induces a Region of Ambiguity. Any query that intersects this region is said to be Ambiguous. Otherwise its Unambiguous

79 Ranking and Geometry

80 Ranking and Geometry # of d-cells k2d d!

81 Ranking and Geometry # of d-cells k2d d!

82 Ranking and Geometry # of d-cells k2d d! # intersected k2(d 1) (d 1)!

83 Ranking and Geometry # of d-cells k2d d! # intersected k2(d 1) (d 1)! = P(ambiguous) d k 2

84 Ranking and Geometry # of d-cells k2d d! # intersected k2(d 1) (d 1)! = P(ambiguous) d k 2 = E[#ambiguous] d k

85 Ranking and Geometry # of d-cells k2d d! # intersected k2(d 1) (d 1)! = P(ambiguous) d k 2 (Coombs 1960) (Buck 1943) (Cover 1965) = E[#ambiguous] d k n = E[# requested] k=2 d k d log n (Jamieson & Nowak 2011)

86 Summary # of comparisons needed to rank random selection sequential w/o geometry exploiting geometry noise-tolerant O(n 2 ) O(n log n) O(d log n) O(d log 2 n) Ac.ve Ranking using Pairwise Comparisons, NIPS 11 code:

87 Sonar Example Sonar echo audio signals bounced off: {50 targets, 50 rocks } S i,j = {human-judged similarity between signals i and j} Learning task: Leave one signal out of the set and rank the other 99 using comparisons: q i,j {S i, <S j, } Compute d-dim embedding using MDS with similarity matrix. S i, <S j, x i r < x j r because embedding is approximate Dimension 2 3 % of queries requested Average d(y, ỹ) error d(y, ŷ) Kendell Tau % of queries we requested best achievable error our algorithm s error

Active Clustering: Robust and Efficient Hierarchical Clustering using Adaptively Selected Similarities

Active Clustering: Robust and Efficient Hierarchical Clustering using Adaptively Selected Similarities Active Clustering: Robust and Efficient Hierarchical Clustering using Adaptively Selected Similarities Brian Eriksson Gautam Dasarathy Aarti Singh Robert Nowak Boston University University of Wisconsin

More information

Ranking Clustered Data with Pairwise Comparisons

Ranking Clustered Data with Pairwise Comparisons Ranking Clustered Data with Pairwise Comparisons Alisa Maas ajmaas@cs.wisc.edu 1. INTRODUCTION 1.1 Background Machine learning often relies heavily on being able to rank the relative fitness of instances

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

Estimation of Intrinsic Dimension via Clustering

Estimation of Intrinsic Dimension via Clustering Estimation of Intrinsic Dimension via Clustering Brian Eriksson Department of Computer Science Boston University eriksson@cs.bu.edu Mark Crovella Department of Computer Science Boston University crovella@bu.edu

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 7: Document Clustering December 4th, 2014 Wolf-Tilo Balke and José Pinto Institut für Informationssysteme Technische Universität Braunschweig The Cluster

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 01-31-017 Outline Background Defining proximity Clustering methods Determining number of clusters Comparing two solutions Cluster analysis as unsupervised Learning

More information

Lesson 3. Prof. Enza Messina

Lesson 3. Prof. Enza Messina Lesson 3 Prof. Enza Messina Clustering techniques are generally classified into these classes: PARTITIONING ALGORITHMS Directly divides data points into some prespecified number of clusters without a hierarchical

More information

Evolutionary tree reconstruction (Chapter 10)

Evolutionary tree reconstruction (Chapter 10) Evolutionary tree reconstruction (Chapter 10) Early Evolutionary Studies Anatomical features were the dominant criteria used to derive evolutionary relationships between species since Darwin till early

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

More information

ECS 234: Data Analysis: Clustering ECS 234

ECS 234: Data Analysis: Clustering ECS 234 : Data Analysis: Clustering What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed

More information

Ranking Clustered Data with Pairwise Comparisons

Ranking Clustered Data with Pairwise Comparisons Ranking Clustered Data with Pairwise Comparisons Kevin Kowalski nargle@cs.wisc.edu 1. INTRODUCTION Background. Machine learning often relies heavily on being able to rank instances in a large set of data

More information

Cluster Analysis. Angela Montanari and Laura Anderlucci

Cluster Analysis. Angela Montanari and Laura Anderlucci Cluster Analysis Angela Montanari and Laura Anderlucci 1 Introduction Clustering a set of n objects into k groups is usually moved by the aim of identifying internally homogenous groups according to a

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 7: Document Clustering May 25, 2011 Wolf-Tilo Balke and Joachim Selke Institut für Informationssysteme Technische Universität Braunschweig Homework

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

10701 Machine Learning. Clustering

10701 Machine Learning. Clustering 171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among

More information

kd-trees Idea: Each level of the tree compares against 1 dimension. Let s us have only two children at each node (instead of 2 d )

kd-trees Idea: Each level of the tree compares against 1 dimension. Let s us have only two children at each node (instead of 2 d ) kd-trees Invented in 1970s by Jon Bentley Name originally meant 3d-trees, 4d-trees, etc where k was the # of dimensions Now, people say kd-tree of dimension d Idea: Each level of the tree compares against

More information

A Review on Cluster Based Approach in Data Mining

A Review on Cluster Based Approach in Data Mining A Review on Cluster Based Approach in Data Mining M. Vijaya Maheswari PhD Research Scholar, Department of Computer Science Karpagam University Coimbatore, Tamilnadu,India Dr T. Christopher Assistant professor,

More information

Clustering. Lecture 6, 1/24/03 ECS289A

Clustering. Lecture 6, 1/24/03 ECS289A Clustering Lecture 6, 1/24/03 What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 01-25-2018 Outline Background Defining proximity Clustering methods Determining number of clusters Other approaches Cluster analysis as unsupervised Learning Unsupervised

More information

Exploiting Parallelism to Support Scalable Hierarchical Clustering

Exploiting Parallelism to Support Scalable Hierarchical Clustering Exploiting Parallelism to Support Scalable Hierarchical Clustering Rebecca Cathey, Eric Jensen, Steven Beitzel, Ophir Frieder, David Grossman Information Retrieval Laboratory http://ir.iit.edu Background

More information

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Classification Vladimir Curic Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Outline An overview on classification Basics of classification How to choose appropriate

More information

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology 9/9/ I9 Introduction to Bioinformatics, Clustering algorithms Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Outline Data mining tasks Predictive tasks vs descriptive tasks Example

More information

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2 161 Machine Learning Hierarchical clustering Reading: Bishop: 9-9.2 Second half: Overview Clustering - Hierarchical, semi-supervised learning Graphical models - Bayesian networks, HMMs, Reasoning under

More information

Hierarchical Clustering

Hierarchical Clustering What is clustering Partitioning of a data set into subsets. A cluster is a group of relatively homogeneous cases or observations Hierarchical Clustering Mikhail Dozmorov Fall 2016 2/61 What is clustering

More information

Exploratory data analysis for microarrays

Exploratory data analysis for microarrays Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA

More information

Orthogonal range searching. Range Trees. Orthogonal range searching. 1D range searching. CS Spring 2009

Orthogonal range searching. Range Trees. Orthogonal range searching. 1D range searching. CS Spring 2009 CS 5633 -- Spring 2009 Orthogonal range searching Range Trees Carola Wenk Slides courtesy of Charles Leiserson with small changes by Carola Wenk CS 5633 Analysis of Algorithms 1 Input: n points in d dimensions

More information

Orthogonal range searching. Orthogonal range search

Orthogonal range searching. Orthogonal range search CG Lecture Orthogonal range searching Orthogonal range search. Problem definition and motiation. Space decomposition: techniques and trade-offs 3. Space decomposition schemes: Grids: uniform, non-hierarchical

More information

Exploratory Analysis: Clustering

Exploratory Analysis: Clustering Exploratory Analysis: Clustering (some material taken or adapted from slides by Hinrich Schutze) Heejun Kim June 26, 2018 Clustering objective Grouping documents or instances into subsets or clusters Documents

More information

CLUSTERING IN BIOINFORMATICS

CLUSTERING IN BIOINFORMATICS CLUSTERING IN BIOINFORMATICS CSE/BIMM/BENG 8 MAY 4, 0 OVERVIEW Define the clustering problem Motivation: gene expression and microarrays Types of clustering Clustering algorithms Other applications of

More information

Bioinformatics - Lecture 07

Bioinformatics - Lecture 07 Bioinformatics - Lecture 07 Bioinformatics Clusters and networks Martin Saturka http://www.bioplexity.org/lectures/ EBI version 0.4 Creative Commons Attribution-Share Alike 2.5 License Learning on profiles

More information

Cluster Analysis for Microarray Data

Cluster Analysis for Microarray Data Cluster Analysis for Microarray Data Seventh International Long Oligonucleotide Microarray Workshop Tucson, Arizona January 7-12, 2007 Dan Nettleton IOWA STATE UNIVERSITY 1 Clustering Group objects that

More information

Multi-Modal Data Fusion: A Description

Multi-Modal Data Fusion: A Description Multi-Modal Data Fusion: A Description Sarah Coppock and Lawrence J. Mazlack ECECS Department University of Cincinnati Cincinnati, Ohio 45221-0030 USA {coppocs,mazlack}@uc.edu Abstract. Clustering groups

More information

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University CS 555: DISTRIBUTED SYSTEMS [P2P SYSTEMS] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Byzantine failures vs malicious nodes

More information

Clustering algorithms and introduction to persistent homology

Clustering algorithms and introduction to persistent homology Foundations of Geometric Methods in Data Analysis 2017-18 Clustering algorithms and introduction to persistent homology Frédéric Chazal INRIA Saclay - Ile-de-France frederic.chazal@inria.fr Introduction

More information

Hierarchical Clustering

Hierarchical Clustering Hierarchical Clustering Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram A tree-like diagram that records the sequences of merges

More information

Spectral Clustering and Community Detection in Labeled Graphs

Spectral Clustering and Community Detection in Labeled Graphs Spectral Clustering and Community Detection in Labeled Graphs Brandon Fain, Stavros Sintos, Nisarg Raval Machine Learning (CompSci 571D / STA 561D) December 7, 2015 {btfain, nisarg, ssintos} at cs.duke.edu

More information

Clustering. Bruno Martins. 1 st Semester 2012/2013

Clustering. Bruno Martins. 1 st Semester 2012/2013 Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2012/2013 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti. Outline 1 Motivation Basic Concepts

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Unsupervised Learning: Clustering Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com (Some material

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06 Clustering CS294 Practical Machine Learning Junming Yin 10/09/06 Outline Introduction Unsupervised learning What is clustering? Application Dissimilarity (similarity) of objects Clustering algorithm K-means,

More information

Clustering. K-means clustering

Clustering. K-means clustering Clustering K-means clustering Clustering Motivation: Identify clusters of data points in a multidimensional space, i.e. partition the data set {x 1,...,x N } into K clusters. Intuition: A cluster is a

More information

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample CS 1675 Introduction to Machine Learning Lecture 18 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem:

More information

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample Lecture 9 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem: distribute data into k different groups

More information

Contents. Preface to the Second Edition

Contents. Preface to the Second Edition Preface to the Second Edition v 1 Introduction 1 1.1 What Is Data Mining?....................... 4 1.2 Motivating Challenges....................... 5 1.3 The Origins of Data Mining....................

More information

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A. 205-206 Pietro Guccione, PhD DEI - DIPARTIMENTO DI INGEGNERIA ELETTRICA E DELL INFORMAZIONE POLITECNICO DI BARI

More information

Introduction to Trees

Introduction to Trees Introduction to Trees Tandy Warnow December 28, 2016 Introduction to Trees Tandy Warnow Clades of a rooted tree Every node v in a leaf-labelled rooted tree defines a subset of the leafset that is below

More information

Task Description: Finding Similar Documents. Document Retrieval. Case Study 2: Document Retrieval

Task Description: Finding Similar Documents. Document Retrieval. Case Study 2: Document Retrieval Case Study 2: Document Retrieval Task Description: Finding Similar Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 11, 2017 Sham Kakade 2017 1 Document

More information

Hierarchical Clustering 4/5/17

Hierarchical Clustering 4/5/17 Hierarchical Clustering 4/5/17 Hypothesis Space Continuous inputs Output is a binary tree with data points as leaves. Useful for explaining the training data. Not useful for making new predictions. Direction

More information

Semi-supervised learning and active learning

Semi-supervised learning and active learning Semi-supervised learning and active learning Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Combining classifiers Ensemble learning: a machine learning paradigm where multiple learners

More information

Range Searching and Windowing

Range Searching and Windowing CS 6463 -- Fall 2010 Range Searching and Windowing Carola Wenk 1 Orthogonal range searching Input: n points in d dimensions E.g., representing a database of n records each with d numeric fields Query:

More information

Cluster analysis. Agnieszka Nowak - Brzezinska

Cluster analysis. Agnieszka Nowak - Brzezinska Cluster analysis Agnieszka Nowak - Brzezinska Outline of lecture What is cluster analysis? Clustering algorithms Measures of Cluster Validity What is Cluster Analysis? Finding groups of objects such that

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

More information

Fast Hierarchical Clustering via Dynamic Closest Pairs

Fast Hierarchical Clustering via Dynamic Closest Pairs Fast Hierarchical Clustering via Dynamic Closest Pairs David Eppstein Dept. Information and Computer Science Univ. of California, Irvine http://www.ics.uci.edu/ eppstein/ 1 My Interest In Clustering What

More information

Gene expression & Clustering (Chapter 10)

Gene expression & Clustering (Chapter 10) Gene expression & Clustering (Chapter 10) Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species Dynamic programming Approximate pattern matching

More information

Geometric data structures:

Geometric data structures: Geometric data structures: Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade Sham Kakade 2017 1 Announcements: HW3 posted Today: Review: LSH for Euclidean distance Other

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Unsupervised learning Until now, we have assumed our training samples are labeled by their category membership. Methods that use labeled samples are said to be supervised. However,

More information

Clustering, cont. Genome 373 Genomic Informatics Elhanan Borenstein. Some slides adapted from Jacques van Helden

Clustering, cont. Genome 373 Genomic Informatics Elhanan Borenstein. Some slides adapted from Jacques van Helden Clustering, cont Genome 373 Genomic Informatics Elhanan Borenstein Some slides adapted from Jacques van Helden Improving the search heuristic: Multiple starting points Simulated annealing Genetic algorithms

More information

CMSC 754 Computational Geometry 1

CMSC 754 Computational Geometry 1 CMSC 754 Computational Geometry 1 David M. Mount Department of Computer Science University of Maryland Fall 2005 1 Copyright, David M. Mount, 2005, Dept. of Computer Science, University of Maryland, College

More information

Missing Data Estimation in Microarrays Using Multi-Organism Approach

Missing Data Estimation in Microarrays Using Multi-Organism Approach Missing Data Estimation in Microarrays Using Multi-Organism Approach Marcel Nassar and Hady Zeineddine Progress Report: Data Mining Course Project, Spring 2008 Prof. Inderjit S. Dhillon April 02, 2008

More information

Clustering Algorithm (DBSCAN) VISHAL BHARTI Computer Science Dept. GC, CUNY

Clustering Algorithm (DBSCAN) VISHAL BHARTI Computer Science Dept. GC, CUNY Clustering Algorithm (DBSCAN) VISHAL BHARTI Computer Science Dept. GC, CUNY Clustering Algorithm Clustering is an unsupervised machine learning algorithm that divides a data into meaningful sub-groups,

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10. Cluster

More information

ECG782: Multidimensional Digital Signal Processing

ECG782: Multidimensional Digital Signal Processing ECG782: Multidimensional Digital Signal Processing Object Recognition http://www.ee.unlv.edu/~b1morris/ecg782/ 2 Outline Knowledge Representation Statistical Pattern Recognition Neural Networks Boosting

More information

Clustering. Chapter 10 in Introduction to statistical learning

Clustering. Chapter 10 in Introduction to statistical learning Clustering Chapter 10 in Introduction to statistical learning 16 14 12 10 8 6 4 2 0 2 4 6 8 10 12 14 1 Clustering ² Clustering is the art of finding groups in data (Kaufman and Rousseeuw, 1990). ² What

More information

Foundations of Machine Learning CentraleSupélec Fall Clustering Chloé-Agathe Azencot

Foundations of Machine Learning CentraleSupélec Fall Clustering Chloé-Agathe Azencot Foundations of Machine Learning CentraleSupélec Fall 2017 12. Clustering Chloé-Agathe Azencot Centre for Computational Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr Learning objectives

More information

Big Data Analytics! Special Topics for Computer Science CSE CSE Feb 9

Big Data Analytics! Special Topics for Computer Science CSE CSE Feb 9 Big Data Analytics! Special Topics for Computer Science CSE 4095-001 CSE 5095-005! Feb 9 Fei Wang Associate Professor Department of Computer Science and Engineering fei_wang@uconn.edu Clustering I What

More information

over Multi Label Images

over Multi Label Images IBM Research Compact Hashing for Mixed Image Keyword Query over Multi Label Images Xianglong Liu 1, Yadong Mu 2, Bo Lang 1 and Shih Fu Chang 2 1 Beihang University, Beijing, China 2 Columbia University,

More information

Computational Geometry

Computational Geometry Orthogonal Range Searching omputational Geometry hapter 5 Range Searching Problem: Given a set of n points in R d, preprocess them such that reporting or counting the k points inside a d-dimensional axis-parallel

More information

Document Clustering: Comparison of Similarity Measures

Document Clustering: Comparison of Similarity Measures Document Clustering: Comparison of Similarity Measures Shouvik Sachdeva Bhupendra Kastore Indian Institute of Technology, Kanpur CS365 Project, 2014 Outline 1 Introduction The Problem and the Motivation

More information

Data Mining and Machine Learning: Techniques and Algorithms

Data Mining and Machine Learning: Techniques and Algorithms Instance based classification Data Mining and Machine Learning: Techniques and Algorithms Eneldo Loza Mencía eneldo@ke.tu-darmstadt.de Knowledge Engineering Group, TU Darmstadt International Week 2019,

More information

Multi-label Classification. Jingzhou Liu Dec

Multi-label Classification. Jingzhou Liu Dec Multi-label Classification Jingzhou Liu Dec. 6 2016 Introduction Multi-class problem, Training data (x $, y $ ) ( ), x $ X R., y $ Y = 1,2,, L Learn a mapping f: X Y Each instance x $ is associated with

More information

Data Preprocessing. Slides by: Shree Jaswal

Data Preprocessing. Slides by: Shree Jaswal Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data

More information

Clustering (COSC 416) Nazli Goharian. Document Clustering.

Clustering (COSC 416) Nazli Goharian. Document Clustering. Clustering (COSC 416) Nazli Goharian nazli@cs.georgetown.edu 1 Document Clustering. Cluster Hypothesis : By clustering, documents relevant to the same topics tend to be grouped together. C. J. van Rijsbergen,

More information

Problem 1: Complexity of Update Rules for Logistic Regression

Problem 1: Complexity of Update Rules for Logistic Regression Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 16 th, 2014 1

More information

Algorithms for Bioinformatics

Algorithms for Bioinformatics Adapted from slides by Leena Salmena and Veli Mäkinen, which are partly from http: //bix.ucsd.edu/bioalgorithms/slides.php. 582670 Algorithms for Bioinformatics Lecture 6: Distance based clustering and

More information

Recent Research Results. Evolutionary Trees Distance Methods

Recent Research Results. Evolutionary Trees Distance Methods Recent Research Results Evolutionary Trees Distance Methods Indo-European Languages After Tandy Warnow What is the purpose? Understand evolutionary history (relationship between species). Uderstand how

More information

Scan Matching. Pieter Abbeel UC Berkeley EECS. Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

Scan Matching. Pieter Abbeel UC Berkeley EECS. Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics Scan Matching Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics Scan Matching Overview Problem statement: Given a scan and a map, or a scan and a scan,

More information

Distance based tree reconstruction. Hierarchical clustering (UPGMA) Neighbor-Joining (NJ)

Distance based tree reconstruction. Hierarchical clustering (UPGMA) Neighbor-Joining (NJ) Distance based tree reconstruction Hierarchical clustering (UPGMA) Neighbor-Joining (NJ) All organisms have evolved from a common ancestor. Infer the evolutionary tree (tree topology and edge lengths)

More information

Large-Scale Face Manifold Learning

Large-Scale Face Manifold Learning Large-Scale Face Manifold Learning Sanjiv Kumar Google Research New York, NY * Joint work with A. Talwalkar, H. Rowley and M. Mohri 1 Face Manifold Learning 50 x 50 pixel faces R 2500 50 x 50 pixel random

More information

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for

More information

Clustering (COSC 488) Nazli Goharian. Document Clustering.

Clustering (COSC 488) Nazli Goharian. Document Clustering. Clustering (COSC 488) Nazli Goharian nazli@ir.cs.georgetown.edu 1 Document Clustering. Cluster Hypothesis : By clustering, documents relevant to the same topics tend to be grouped together. C. J. van Rijsbergen,

More information

Power to the points: Local certificates for clustering

Power to the points: Local certificates for clustering Power to the points: Local certificates for clustering Suresh Venkatasubramanian University of Utah Joint work with Parasaran Raman Data Mining Pipeline Modelling Data Cleaning Exploration Modelling Decision

More information

University of Florida CISE department Gator Engineering. Clustering Part 4

University of Florida CISE department Gator Engineering. Clustering Part 4 Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

Understanding Clustering Supervising the unsupervised

Understanding Clustering Supervising the unsupervised Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data

More information

Clustering analysis of gene expression data

Clustering analysis of gene expression data Clustering analysis of gene expression data Chapter 11 in Jonathan Pevsner, Bioinformatics and Functional Genomics, 3 rd edition (Chapter 9 in 2 nd edition) Human T cell expression data The matrix contains

More information

Range Reporting. Range Reporting. Range Reporting Problem. Applications

Range Reporting. Range Reporting. Range Reporting Problem. Applications Philip Bille Problem problem. Preprocess at set of points P R 2 to support report(x1, y1, x2, y2): Return the set of points in R P, where R is rectangle given by (x1, y1) and (x2, y2). Applications Relational

More information

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest) Lecture-17: Clustering with K-Means (Contd: DT + Random Forest) Medha Vidyotma April 24, 2018 1 Contd. Random Forest For Example, if there are 50 scholars who take the measurement of the length of the

More information

Dimension reduction : PCA and Clustering

Dimension reduction : PCA and Clustering Dimension reduction : PCA and Clustering By Hanne Jarmer Slides by Christopher Workman Center for Biological Sequence Analysis DTU The DNA Array Analysis Pipeline Array design Probe design Question Experimental

More information

CS Introduction to Data Mining Instructor: Abdullah Mueen

CS Introduction to Data Mining Instructor: Abdullah Mueen CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen LECTURE 8: ADVANCED CLUSTERING (FUZZY AND CO -CLUSTERING) Review: Basic Cluster Analysis Methods (Chap. 10) Cluster Analysis: Basic Concepts

More information

Basic Data Mining Technique

Basic Data Mining Technique Basic Data Mining Technique What is classification? What is prediction? Supervised and Unsupervised Learning Decision trees Association rule K-nearest neighbor classifier Case-based reasoning Genetic algorithm

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning A review of clustering and other exploratory data analysis methods HST.951J: Medical Decision Support Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision

More information

Advanced Algorithm Design and Analysis (Lecture 12) SW5 fall 2005 Simonas Šaltenis E1-215b

Advanced Algorithm Design and Analysis (Lecture 12) SW5 fall 2005 Simonas Šaltenis E1-215b Advanced Algorithm Design and Analysis (Lecture 12) SW5 fall 2005 Simonas Šaltenis E1-215b simas@cs.aau.dk Range Searching in 2D Main goals of the lecture: to understand and to be able to analyze the kd-trees

More information

Clustering Part 4 DBSCAN

Clustering Part 4 DBSCAN Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

Statistics 202: Data Mining. c Jonathan Taylor. Clustering Based in part on slides from textbook, slides of Susan Holmes.

Statistics 202: Data Mining. c Jonathan Taylor. Clustering Based in part on slides from textbook, slides of Susan Holmes. Clustering Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Clustering Clustering Goal: Finding groups of objects such that the objects in a group will be similar (or

More information

PPI Network Alignment Advanced Topics in Computa8onal Genomics

PPI Network Alignment Advanced Topics in Computa8onal Genomics PPI Network Alignment 02-715 Advanced Topics in Computa8onal Genomics PPI Network Alignment Compara8ve analysis of PPI networks across different species by aligning the PPI networks Find func8onal orthologs

More information

Workload Characterization Techniques

Workload Characterization Techniques Workload Characterization Techniques Raj Jain Washington University in Saint Louis Saint Louis, MO 63130 Jain@cse.wustl.edu These slides are available on-line at: http://www.cse.wustl.edu/~jain/cse567-08/

More information

Week 7 Picturing Network. Vahe and Bethany

Week 7 Picturing Network. Vahe and Bethany Week 7 Picturing Network Vahe and Bethany Freeman (2005) - Graphic Techniques for Exploring Social Network Data The two main goals of analyzing social network data are identification of cohesive groups

More information

Chapter 5. Divide and Conquer. Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved.

Chapter 5. Divide and Conquer. Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved. Chapter 5 Divide and Conquer Slides by Kevin Wayne. Copyright 25 Pearson-Addison Wesley. All rights reserved. Divide-and-Conquer Divide-and-conquer. Break up problem into several parts. Solve each part

More information

Multidimensional Indexing The R Tree

Multidimensional Indexing The R Tree Multidimensional Indexing The R Tree Module 7, Lecture 1 Database Management Systems, R. Ramakrishnan 1 Single-Dimensional Indexes B+ trees are fundamentally single-dimensional indexes. When we create

More information

CS573 Data Privacy and Security. Li Xiong

CS573 Data Privacy and Security. Li Xiong CS573 Data Privacy and Security Anonymizationmethods Li Xiong Today Clustering based anonymization(cont) Permutation based anonymization Other privacy principles Microaggregation/Clustering Two steps:

More information