Data Mining: Concepts and Techniques. Chapter 7. Cluster Analysis. Examples of Clustering Applications. What is Cluster Analysis?

Size: px
Start display at page:

Download "Data Mining: Concepts and Techniques. Chapter 7. Cluster Analysis. Examples of Clustering Applications. What is Cluster Analysis?"

Transcription

1 Data Mining: Concepts an Techniques Chapter Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign Jiawei Han an Micheline Kamber, All rights reserve Chapter. Cluster Analysis. What is Cluster Analysis?. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methos. Partitioning Methos. Hierarchical Methos. Density-Base Methos What is Cluster Analysis? Examples of Clustering Applications Cluster: a collection of ata objects Similar to one another within the same cluster Dissimilar to the objects in other clusters istance (or similarity) measures Cluster analysis Fining similarities between ata accoring to the characteristics foun in the ata an grouping similar ata objects into clusters Unsupervise learning: no preefine classes Typical applications As a stan-alone tool to get insight into ata istribution As a preprocessing step for other algorithms Marketing: Help marketers iscover istinct groups in their customer bases, an then use this knowlege to evelop targete marketing programs Lan use: Ientification of areas of similar lan use in an earth observation atabase Insurance: Ientifying groups of motor insurance policy holers with a high average claim cost City-planning: Ientifying groups of houses accoring to their house type, value, an geographical location Earth-quake stuies: Observe earth quake epicenters shoul be clustere along continent faults Requirements of Clustering in Data Mining Example clustering for ontology aligment Scalability Ability to eal with ifferent types of attributes Discovery of clusters with arbitrary shape Minimal requirements for omain knowlege to etermine input parameters Able to eal with noise an outliers Insensitive to orer of input recors High imensionality Incorporation of user-specifie constraints Interpretability an usability

2 Chapter. Cluster Analysis Data Structures. What is Cluster Analysis?. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methos. Partitioning Methos. Hierarchical Methos. Density-Base Methos Data matrix n objects, p attributes (two moes) One row represents one object Dissimilarity matrix Distance table (one moe) x x i x n (,) (, ) : ( n,) x f x if x nf (,) : ( n,) : x p x ip x np Type of ata in clustering analysis Interval-value variables Interval-scale variables Continuous measurements (weight, temperature, ) Binary variables Variables with states (on/off, yes/no) Nominal variables A generalization of the binary variable in that it can take more than states (color/re,yellow,blue,green) Orinal ranking is important (e.g. meals(gol,silver,bronze)) Ratio variables Sometimes we nee to stanarize the ata Calculate the mean absolute eviation: where s = ( x f n m + x n m = (x + x + + f f f m + + x Calculate the stanarize measurement (z-score) x m if f z = if s x nf ). m ) f f f f nf f f a positive measurement on a nonlinear scale (growth) Variables of mixe types Distances between objects Distances are normally use to measure the similarity or issimilarity between two ata objects Properties (i,j) (i,i) = (i,j) = (j,i) (i,j) (i,k) + (k,j) Eucliean istance: ( i, j) = ( x x + x x + + x x ) i j i j ip jp Manhattan istance: Distances between objects ( i, j) = x i x x x x j + x i j + + i p jp where i = (x i, x i,, x ip ) an j = (x j, x j,, x jp ) are two p-imensional ata objects,

3 Distances between objects Minkowski istance: = i q q q ( i, j) q ( x x x x x x ) p p q is a positive integer If q =, is Manhattan istance If q =, is Eucliean istance j + i j + + i j Binary Variables symmetric binary variables: both states are equally important; / asymmetric binary variables: one state is more important than the other (e.g. outcome of isease test); is the important state, the other Contingency tables for Binary Variables Object j sum a b a+ b Object i c c+ sum a + c b+ p Distance measure for symmetric binary variables Object i Object j a c b sum a + c b+ sum a+ b c+ p a: number of attributes having for object i an for object j b: number of attributes having for object i an for object j c: number of attributes having for object i an for object j : number of attributes having for object i an for object j p = a+b+c+ ( i, j) = b + c a + b + c + Distance measure for asymmetric binary variables Object i Object j a c b sum a + c b+ Jaccar coefficient = - (i,j) = sum a+ b c+ p ( i, j ) = b c a + + b + c sim a a Jaccar ( i, j) = + b + c Dissimilarity between Binary Variables Example Name Gener Fever Cough Test- Test- Test- Test- Jack M Y N P N N N Mary F Y N P N P N Jim M Y P N N N N gener is a symmetric attribute the remaining attributes are asymmetric binary istance base on these let the values Y an P be set to, an the value N be set to ( jack, mary ) = + =. + + ( jack +, jim ) = =. + + ( jim, mary + ) = =. + +

4 Nominal or Categorical Variables Orinal Variables Metho : Simple matching m: # of matches, p: total # of variables ( i, j) = p p m Metho : use a large number of binary variables creating a new binary variable for each of the M nominal states An orinal variable can be iscrete or continuous Orer is important, e.g., rank Can be treate like interval-scale replace x if by their rank map the range of each variable onto [, ] by replacing i-th object in the f-th variable by r if z = if M compute the issimilarity using methos for intervalscale variables f r {,, M } if f Ratio-Scale Variables Variables of Mixe Types Ratio-scale variable: a positive measurement on a nonlinear scale, approximately at exponential scale, such as Ae Bt or Ae -Bt Methos: treat them like interval-scale variables not a goo choice! (why? the scale can be istorte) apply logarithmic transformation y if = log(x if ) treat them as continuous orinal ata, treat their rank as interval-scale A atabase may contain all the six types of variables symmetric binary, asymmetric binary, nominal, orinal, interval an ratio One may use a weighte formula to combine their effects: p ( f ) ( f ) ( Σ δ f = ij ij i, j ) = p ( ) Σ δ f f is binary or nominal: f = ij (f) = if x = x, or ij if jf ij(f) = otherwise f is interval-base: use the normalize istance f is orinal or ratio-scale compute ranks r if an an treat z if as interval-scale z = if rif M f elta(i,j) = iff (i) x-value is missing or (ii) x-values are an f asymmetric binary attribute Vector Objects Vector moel for information retrieval (simplifie) Vector objects: keywors in ocuments, gene features in micro-arrays, etc. Broa applications: information retrieval, biologic taxonomy, etc. Cosine measure cloning arenergic Doc (,,) Doc (,,) Q (,,) receptor sim(,q) =. q x q

5 Chapter. Cluster Analysis. What is Cluster Analysis?. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methos. Partitioning Methos. Hierarchical Methos. Density-Base Methos Major Clustering Approaches (I) Partitioning approach: Construct various partitions an then evaluate them by some criterion, e.g., minimizing the sum of square errors Typical methos: k-means, k-meois, CLARANS Hierarchical approach: Create a hierarchical ecomposition of the set of ata (or objects) using some criterion Typical methos: Diana, Agnes, BIRCH, ROCK, CAMELEON Density-base approach: Base on connectivity an ensity functions Typical methos: DBSCAN, OPTICS, DenClue Major Clustering Approaches (II) Gri-base approach: base on a multiple-level granularity structure Typical methos: STING, WaveCluster, CLIQUE Moel-base: A moel is hypothesize for each of the clusters an tries to fin the best fit of that moel to each other Typical methos: EM, SOM, COBWEB Frequent pattern-base: Base on the analysis of frequent patterns Typical methos: pcluster User-guie or constraint-base: Clustering by consiering user-specifie or application-specific constraints Typical methos: COD (obstacles), constraine clustering Typical Alternatives to Calculate the Distance between Clusters Single link: smallest istance between an element in one cluster an an element in the other, i.e., is(k i, K j ) = min(t ip, t jq ) Complete link: largest istance between an element in one cluster an an element in the other, i.e., is(k i, K j ) = max(t ip, t jq ) Average: avg istance between an element in one cluster an an element in the other, i.e., is(k i, K j ) = avg(t ip, t jq ) Centroi: istance between the centrois of two clusters, i.e., is(k i, K j ) = is(c i, C j ) Meoi: istance between the meois of two clusters, i.e., is(k i, K j ) = is(m i, M j ) Meoi: one chosen, centrally locate object in the cluster Centroi, Raius an Diameter of a Cluster (for numerical ata sets) Centroi: the mile of a cluster N ( tip ) C i m = = N Raius: square root of average istance from any point of the cluster to its centroi R m = Diameter: square root of average mean square istance between all pairs of points in the cluster Σ Σ N ( t c i ip m ) = N Σ N Σ N ( t t ) D = i = i = ip iq m N( N ) Chapter. Cluster Analysis. What is Cluster Analysis?. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methos. Partitioning Methos. Hierarchical Methos. Density-Base Methos

6 Partitioning Algorithms: Basic Concept The K-Means Clustering Metho Partitioning metho: Construct a partition of a atabase D of n objects into a set of k clusters, s.t., min sum of square istance k Σm= Σt mi Km( Cm tmi) Given a k, fin a partition of k clusters that optimizes the chosen partitioning criterion Global optimal: exhaustively enumerate all partitions Heuristic methos: k-means an k-meois algorithms k-means (MacQueen ): Each cluster is represente by the center of the cluster k-meois or PAM (Partition aroun meois) (Kaufman & Rousseeuw ): Each cluster is represente by one of the objects in the cluster Given k, ata D. arbitrarily choose k objects as initial cluster centers. Repeat Assign each object to the cluster to which the object is most similar base on mean values of the objects in the cluster Upate cluster means (calculate mean value of the objects for each cluster) Until no change The K-Means Clustering Metho Comments on the K-Means Metho Example K= Arbitrarily choose K objects as initial cluster centers Assign each object to most similar center reassign Upate the cluster means Upate the cluster means reassign Strength: Relatively efficient: O(tkn), where n is # objects, k is # clusters, an t is # iterations. Normally, k, t << n. Comparing: PAM: O(k(n-k) ), CLARA: O(ks + k(n-k)) Comment: Often terminates at a local optimum. The global optimum may be foun using techniques such as: eterministic annealing an genetic algorithms Weakness Applicable only when mean is efine, then what about categorical ata? Nee to specify k, the number of clusters, in avance Unable to hanle noisy ata an outliers Not suitable to iscover clusters with non-convex shapes What Is the Problem of the K-Means Metho? The K-Meois Clustering Metho The k-means algorithm is sensitive to outliers! Since an object with an extremely large value may substantially istort the istribution of the ata. K-Meois: Instea of taking the mean value of the object in a cluster as a reference point, meois can be use, which is the most centrally locate object in a cluster. Fin representative objects, calle meois, in clusters PAM (Partitioning Aroun Meois, ) starts from an initial set of meois an iteratively replaces one of the meois by one of the non-meois if it improves the total istance of the resulting clustering PAM works effectively for small ata sets, but oes not scale well for large ata sets CLARA (Kaufmann & Rousseeuw, ) CLARANS (Ng & Han, ): Ranomize sampling

7 A Typical K-Meois Algorithm (PAM) - iea PAM (Partitioning Aroun Meois) () K= Do loop Until no change Arbitrary choose k objects as initial meois Swapping O an O ramom If quality is improve. Total Cost = Assign each remaining object to nearest meois Compute total cost of swapping Total Cost = Ranomly select a nonmeoi object,o ramom PAM (Kaufman an Rousseeuw, ) Algorithm Select k representative objects arbitrarily For each pair of non-selecte object h an selecte object i, calculate the total swapping cost TC ih Select a pair i an h, which correspons to the minimum swapping cost If TC ih <, i is replace by h Then assign each non-selecte object to the most similar representative object repeat steps - until there is no change PAM Clustering: Total swapping cost TC ih = j C jih What Is the Problem with PAM? t j i h C jih = (j, h) - (j, i) j h i t t C jih = i j h i h t j Pam is more robust than k-means in the presence of noise an outliers because a meoi is less influence by outliers or other extreme values than a mean Pam works efficiently for small ata sets but oes not scale well for large ata sets. O(k(n-k) ) for each iteration where n is # of ata,k is # of clusters Sampling base metho, CLARA(Clustering LARge Applications) Cjih = (j, t) - (j, i) C jih = (j, h) - (j, t) CLARA (Clustering Large Applications) () CLARA (Clustering Large Applications) () CLARA (Kaufmann an Rousseeuw in ) Built in statistical analysis packages, such as S+ It raws multiple samples of the ata set, applies PAM on each sample, an gives the best clustering as the output Strength: eals with larger ata sets than PAM Weakness: Efficiency epens on the sample size A goo clustering base on samples will not necessarily represent a goo clustering of the whole ata set if the sample is biase Algorithm (n=, s = +k) Repeat n times: Draw sample of s objects from the entire ata set an perform PAM to fin k meios of the sample assign each non-selecte object in the entire ata set to the most similar meio Calculate average issimilarity of the clustering. If the value is smaller than the current minimum, use this value as current minimum an retain the k meois as best so far.

8 Graph abstraction Graph Abstraction Noe represents k objects (meois), a potential solution for the clustering. Noes are neighbors if the sets of objects iffer by one object. Each noe has k(n-k) neighbors. Cost ifferential between two neighbors is TCih (with Oi an Oh are the iffering noes in the meio sets) PAM searches for noe in the graph with minimum cost CLARA searches in smaller graphs (as it uses PAM on samples of the entire ata set) CLARANS Searches in the original graph Searches part of the graph Uses the neighbors to guie the search CLARANS ( Ranomize CLARA) () CLARANS ( Ranomize CLARA) () CLARANS (A Clustering Algorithm base on Ranomize Search) (Ng an Han ) It is more efficient an scalable than both PAM an CLARA Algorithm Numlocal: number of local minima to be foun Maxneighbor: maximum number of neighbors to compare Repeat numlocal times: (fin local minimum) Take arbitrary noe in the graph Consier ranom neighbor S of the current noe an calculate the cost ifferential. If S has lower cost, then set S to current an repeat this step. If S oes not have lower cost, repeat this step (check at most Maxneighbor neighbors) Compare the cost of current noe with minimum cost so far. If the cost is lower, set minumum cost to cost of the current noe, an bestnoe to the current noe. Chapter. Cluster Analysis Hierarchical Clustering. What is Cluster Analysis?. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methos. Partitioning Methos. Hierarchical Methos. Density-Base Methos Use istance matrix as clustering criteria. This metho oes not require the number of clusters k as an input, but nees a termination conition Step Step Step Step Step a b c e a b e c e a b c e Step Step Step Step Step agglomerative (AGNES) ivisive (DIANA)

9 Complete-link Clustering Example Complete-link Clustering Example (,) (,) (,) (,) (,) (,) (,) (,) (,), (,), (,), = max{, = max{ = max{,,,,,,, } = max{,} =, } = max{,} = } = max{,} = (,),(,),(,) = max{ = max{, (,),, } = max{,} =, } = max{,} =, (,), Complete-link Clustering Example AGNES (Agglomerative Nesting) (,,),(,) = max{ (,),(,),,(, )} = (,) (,) th= th= (,) (,) (,) (,) Introuce in Kaufmann an Rousseeuw () Implemente in statistical analysis packages, e.g., Splus Use the Single-Link metho an the issimilarity matrix. Merge noes that have the least issimilarity Go on in a non-escening fashion Eventually all noes belong to the same cluster AGNES (Agglomerative Nesting) Denrogram: Shows How the Clusters are Merge Decompose ata objects into a several levels of neste partitioning (tree of clusters), calle a enrogram. A clustering of the ata objects is obtaine by cutting the enrogram at the esire level, then each connecte component forms a cluster.

10 DIANA (Divisive Analysis) Recent Hierarchical Clustering Methos Introuce in Kaufmann an Rousseeuw () Implemente in statistical analysis packages, e.g., Splus Inverse orer of AGNES Eventually each noe forms a cluster on its own Major weakness of agglomerative clustering methos o not scale well: time complexity of at least O(n ), where n is the number of total objects can never uno what was one previously Integration of hierarchical with istance-base clustering BIRCH (): uses CF-tree an incrementally ajusts the quality of sub-clusters ROCK (): clustering categorical ata by neighbor an link analysis CHAMELEON (): hierarchical clustering using ynamic moeling BIRCH () Clustering Feature Vector in BIRCH Birch: Balance Iterative Reucing an Clustering using Hierarchies (Zhang, Ramakrishnan & Livny, SIGMOD ) Incrementally construct a CF (Clustering Feature) tree, a hierarchical ata structure for multiphase clustering Phase : scan DB to buil an initial in-memory CF tree (a multi-level compression of the ata that tries to preserve the inherent clustering structure of the ata) Phase : use an arbitrary clustering algorithm to cluster the leaf noes of the CF-tree Scales linearly: fins a goo clustering with a single scan an improves the quality with a few aitional scans Weakness: hanles only numeric ata, an sensitive to the orer of the ata recor, not always natural clusters. Clustering Feature: CF = (N, LS, SS) N: Number of ata points LS: N i==x i SS: N i==x i CF = (, (,),(,)) (,) (,) (,) (,) (,) CF-Tree in BIRCH Clustering feature: summary of the statistics for a given subcluster: the -th, st an n moments of the subcluster from the statistical point of view. registers crucial measurements for computing cluster an utilizes storage efficiently A CF tree is a height-balance tree that stores the clustering features for a hierarchical clustering A nonleaf noe in a tree has chilren an stores the sums of the CFs of their chilren A nonleaf noe represents a cluster mae of the subclusters represente by its chilren A leaf noe represents a cluster mae of the subclusters represente by its entries A CF tree has two parameters Branching factor: specify the maximum number of chilren. threshol: max iameter of sub-clusters store at the leaf noes The CF Tree Structure Root B = CF CF CF CF T = chil chil chil chil Non-leaf noe CF chil CF CF chil chil CF chil Leaf noe Leaf noe prev CF a CF b CF k next prev CF l CF m CF q next

11 ROCK: Clustering Categorical Data ROCK: RObust Clustering using links S. Guha, R. Rastogi & K. Shim, ICDE Major ieas Use links to measure similarity/proximity maximize the sum of the number of links between points within a cluster, minimize the sum of the number of links for points in ifferent clusters Computational complexity: On ( + nmm + nlog n) m a Similarity Measure in ROCK Traitional measures for categorical ata may not work well, e.g., Jaccar coefficient Example: Two groups (clusters) of transactions C. <a, b, c,, e>: {a, b, c}, {a, b, }, {a, b, e}, {a, c, }, {a, c, e}, {a,, e}, {b, c, }, {b, c, e}, {b,, e}, {c,, e} C. <a, b, f, g>: {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g} Jaccar coefficient may lea to wrong clustering result C :. ({a, b, c}, {b,, e}} to. ({a, b, c}, {a, b, }) C & C : coul be as high as. ({a, b, c}, {a, b, f}) Jaccar coefficient-base similarity function: T T Sim( T, T) = T T Ex. Let T = {a, b, c}, T = {c,, e} { c } Sim ( T, T ) = = { a, b, c,, e } =. Link Measure in ROCK Link Measure in ROCK Neighbor: p an p are neighbors iff sim(p,p) >= t (sim an t between an ) Link(pi,pj) is the number of common neighbors between pi an pj Links: # of common neighbors C <a, b, c,, e>: {a, b, c}, {a, b, }, {a, b, e}, {a, c, }, {a, c, e}, {a,, e}, {b, c, }, {b, c, e}, {b,, e}, {c,, e} C <a, b, f, g>: {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g} Let T = {a, b, c}, T = {c,, e}, T = {a, b, f} an sim the Jaccar coefficient similarity an t=. link(t, T ) =, since they have common neighbors {a, c, }, {a, c, e}, {b, c, }, {b, c, e} link(t, T ) =, since they have common neighbors {a, b, }, {a, b, e}, {a, b, g}, {a, b, c}, {a, b, f} Link Measure in ROCK Link(Ci,Cj) = the number of cross links between clusters Ci an Cj G(Ci,Cj) = gooness measure for merging Ci an Cj = Link(Ci,Cj) ivie by the expecte number of cross links The ROCK Algorithm Algorithm: sampling-base clustering Draw ranom sample Hierarchical clustering with links using gooness measure of merging Label ata in isk: a point is assigne to the cluster for which it has the most neighbors after normalization

12 CHAMELEON: Hierarchical Clustering Using Dynamic Moeling () Overall Framework of CHAMELEON CHAMELEON: by G. Karypis, E.H. Han, an V. Kumar Measures the similarity base on a ynamic moel Two clusters are merge only if the interconnectivity an closeness (proximity) between two clusters are high relative to the internal interconnectivity of the clusters an closeness of items within the clusters A two-phase algorithm. Use a graph partitioning algorithm: cluster objects into a large number of relatively small sub-clusters. Use an agglomerative hierarchical clustering algorithm: fin the genuine clusters by repeately combining these sub-clusters Data Set Construct Sparse Graph Final Clusters Partition the Graph Merge Partition CHAMELEON: Hierarchical Clustering Using Dynamic Moeling () CHAMELEON: Hierarchical Clustering Using Dynamic Moeling () A two-phase algorithm. Use a graph partitioning algorithm: cluster objects into a large number of relatively small sub-clusters Base on k-nearest neighbor graph Ege between two noes if points corresponing to either of the noes are among the k-most similar points of the point corresponing to the other noe Ege weight is ensity of the region Dynamic notion of neighborhoo: in regions with high ensity, a neighborhoo raius is small, while in sparse regions the neighborhoo raius is large. A two-phase algorithm.. Use an agglomerative hierarchical clustering algorithm: fin the genuine clusters by repeately combining these sub-clusters Interconnectivity between clusters Ci an Cj: normalize sum of the weights of the eges that connect noes in Ci an Cj Closeness of clusters Ci an Cj: average similarity between points in Ci that are connecte to points in Cj Merge if both measures are above user-efine threshols CHAMELEON (Clustering Complex Objects) Chapter. Cluster Analysis. What is Cluster Analysis?. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methos. Partitioning Methos. Hierarchical Methos. Density-Base Methos

13 Density-Base Clustering Methos Density-Base Clustering: Basic Concepts Clustering base on ensity (local cluster criterion), such as ensity-connecte points Major features: Discover clusters of arbitrary shape Hanle noise One scan Nee ensity parameters as termination conition Several interesting stuies: DBSCAN: Ester, et al. (KDD ) OPTICS: Ankerst, et al (SIGMOD ). DENCLUE: Hinneburg & D. Keim (KDD ) CLIQUE: Agrawal, et al. (SIGMOD ) (more gri-base) Two parameters: Eps: Maximum raius of the neighborhoo MinPts: Minimum number of points in an Epsneighborhoo of that point N Eps (p): {q belongs to D ist(p,q) <= Eps} Directly ensity-reachable: A point p is irectly ensityreachable from a point q w.r.t. Eps, MinPts if p belongs to N Eps (q) core point conition: N Eps (q) >= MinPts p q MinPts = Eps = cm Density-Reachable an Density-Connecte DBSCAN: Density Base Spatial Clustering of Applications with Noise Density-reachable: A point p is ensity-reachable from a point q w.r.t. Eps, MinPts if there is a chain of points p,, p n, p = q, p n = p such that p i+ is irectly ensity-reachable from p i q p p Relies on a ensity-base notion of cluster: A cluster is efine as a maximal set of ensity-connecte points Discovers clusters of arbitrary shape in spatial atabases with noise Outlier Density-connecte A point p is ensity-connecte to a point q w.r.t. Eps, MinPts if there is a point o such that both, p an q are ensity-reachable from o w.r.t. Eps an MinPts p o q Borer Core Eps = cm MinPts = DBSCAN: The Algorithm DBSCAN: Sensitive to Parameters Arbitrary select a point p Retrieve all points ensity-reachable from p w.r.t. Eps an MinPts. If p is a core point, a cluster is forme. If p is a borer point, no points are ensity-reachable from p an DBSCAN visits the next point of the atabase. Continue the process until all of the points have been processe.

14 CHAMELEON (Clustering Complex Objects) OPTICS: A Cluster-Orering Metho () OPTICS: Orering Points To Ientify the Clustering Structure Ankerst, Breunig, Kriegel, an Saner (SIGMOD ) Prouces a special orer of the atabase wrt its ensity-base clustering structure This cluster-orering contains info equivalent to the ensity-base clusterings corresponing to a broa range of parameter settings Goo for both automatic an interactive cluster analysis, incluing fining intrinsic clustering structure Can be represente graphically or using visualization techniques OPTICS: Some Extension from DBSCAN OPTICS Core Distance of p wrt MinPts: smallest istance eps between p an an object in its eps-neighborhoo such that p woul be a core object for eps an MinPts. Otherwise, unefine. Reachability Distance of p wrt o: Max (core-istance (o), (o, p)) if o is core object. Unefine otherwise p Max (core-istance (o), (o, p)) r(p, o) =.cm. r(p,o) = cm p o MinPts = ε = cm () Select non-processe object o () Fin neighbors (eps-neighborhoo) Compute core istance for o Write object o to orere file an mark o as processe If o is not a core object, restart at () (o is a core object ) Put neighbors of o in Seelist an orer If neighbor n is not yet in SeeList then a (n, reachability from o) else if reachability from o < current reachability, then upate reachability + orer SeeList wrt reachability Take new object from Seelist with smallest reachability an restart at () Reachability -istance unefine ε ε ε DENCLUE: Using Statistical Density Functions DENsity-base CLUstEring by Hinneburg & Keim (KDD ) Using statistical ensity functions Major features Soli mathematical founation Goo for ata sets with large amounts of noise Allows a compact mathematical escription of arbitrarily shape clusters in high-imensional ata sets Significant faster than existing algorithm (e.g., DBSCAN) But nees a large number of parameters Cluster-orer of the objects

15 Denclue: Technical Essence Density Attractor Uses gri cells but only keeps information about gri cells that o actually contain ata points an manages these cells in a tree-base access structure Influence function: escribes the impact of a ata point within its neighborhoo xy (, ) f (,) x y = e σ Gaussian Overall ensity of the ata space can be calculate as the sum of the influence function of all ata points ( x, x i ) N D σ f Gaussian ( x ) = i = e Clusters can be etermine mathematically by ientifying ensity attractors. Density attractors are local maxima of the overall ensity function ( x, xi ) N D σ fgaussian( x, xi ) = = ( x i i x) e Denclue: Technical Essence Center-Define an Arbitrary Significant ensity attractor for threshol k: ensity attractor with ensity larger than or equal to k Set of significant ensity attractors X for threshol k: for each pair of ensity attractors x, x in X there is a path from x to x such that each point on the path has ensity larger than or equal to k Center-efine cluster for a significant ensity attractor x for threshol k: points that are ensity attracte by x Points that are attracte to a ensity attractor with ensity less than k are calle outliers Arbitrary-shape cluster for a set of significant ensity attractors X for threshol k: points that are ensity attracte to some ensity attractor in X

Data Mining: Concepts and Techniques. Chapter 7 Jiawei Han. University of Illinois at Urbana-Champaign. Department of Computer Science

Data Mining: Concepts and Techniques. Chapter 7 Jiawei Han. University of Illinois at Urbana-Champaign. Department of Computer Science Data Mining: Concepts and Techniques Chapter 7 Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign www.cs.uiuc.edu/~hanj 6 Jiawei Han and Micheline Kamber, All rights reserved

More information

Data Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1

Data Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques Chapter 7.1-4 March 8, 2007 Data Mining: Concepts and Techniques 1 1. What is Cluster Analysis? 2. Types of Data in Cluster Analysis Chapter 7 Cluster Analysis 3. A

More information

Lecture 7 Cluster Analysis: Part A

Lecture 7 Cluster Analysis: Part A Lecture 7 Cluster Analysis: Part A Zhou Shuigeng May 7, 2007 2007-6-23 Data Mining: Tech. & Appl. 1 Outline What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering

More information

Unsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team

Unsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team Unsupervised Learning Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team Table of Contents 1)Clustering: Introduction and Basic Concepts 2)An Overview of Popular Clustering Methods 3)Other Unsupervised

More information

Clustering part II 1

Clustering part II 1 Clustering part II 1 Clustering What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods 2 Partitioning Algorithms:

More information

What is Cluster Analysis? COMP 465: Data Mining Clustering Basics. Applications of Cluster Analysis. Clustering: Application Examples 3/17/2015

What is Cluster Analysis? COMP 465: Data Mining Clustering Basics. Applications of Cluster Analysis. Clustering: Application Examples 3/17/2015 // What is Cluster Analysis? COMP : Data Mining Clustering Basics Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, rd ed. Cluster: A collection of data

More information

UNIT V CLUSTERING, APPLICATIONS AND TRENDS IN DATA MINING. Clustering is unsupervised classification: no predefined classes

UNIT V CLUSTERING, APPLICATIONS AND TRENDS IN DATA MINING. Clustering is unsupervised classification: no predefined classes UNIT V CLUSTERING, APPLICATIONS AND TRENDS IN DATA MINING What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other

More information

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods Whatis Cluster Analysis? Clustering Types of Data in Cluster Analysis Clustering part II A Categorization of Major Clustering Methods Partitioning i Methods Hierarchical Methods Partitioning i i Algorithms:

More information

Clustering Techniques

Clustering Techniques Clustering Techniques Marco BOTTA Dipartimento di Informatica Università di Torino botta@di.unito.it www.di.unito.it/~botta/didattica/clustering.html Data Clustering Outline What is cluster analysis? What

More information

Cluster Analysis. CSE634 Data Mining

Cluster Analysis. CSE634 Data Mining Cluster Analysis CSE634 Data Mining Agenda Introduction Clustering Requirements Data Representation Partitioning Methods K-Means Clustering K-Medoids Clustering Constrained K-Means clustering Introduction

More information

Cluster Analysis. Outline. Motivation. Examples Applications. Han and Kamber, ch 8

Cluster Analysis. Outline. Motivation. Examples Applications. Han and Kamber, ch 8 Outline Cluster Analysis Han and Kamber, ch Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Methods CS by Rattikorn Hewett Texas Tech University Motivation

More information

Lezione 21 CLUSTER ANALYSIS

Lezione 21 CLUSTER ANALYSIS Lezione 21 CLUSTER ANALYSIS 1 Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods Density-Based

More information

CS Data Mining Techniques Instructor: Abdullah Mueen

CS Data Mining Techniques Instructor: Abdullah Mueen CS 591.03 Data Mining Techniques Instructor: Abdullah Mueen LECTURE 6: BASIC CLUSTERING Chapter 10. Cluster Analysis: Basic Concepts and Methods Cluster Analysis: Basic Concepts Partitioning Methods Hierarchical

More information

Lecture 3 Clustering. January 21, 2003 Data Mining: Concepts and Techniques 1

Lecture 3 Clustering. January 21, 2003 Data Mining: Concepts and Techniques 1 Lecture 3 Clustering January 21, 2003 Data Mining: Concepts and Techniques 1 What is Cluster Analysis? Cluster: a collection of data objects High intra-class similarity Low inter-class similarity How to

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 09: Vector Data: Clustering Basics Instructor: Yizhou Sun yzsun@cs.ucla.edu October 27, 2017 Methods to Learn Vector Data Set Data Sequence Data Text Data Classification

More information

CS490D: Introduction to Data Mining Prof. Chris Clifton. Cluster Analysis

CS490D: Introduction to Data Mining Prof. Chris Clifton. Cluster Analysis CS490D: Introduction to Data Mining Prof. Chris Clifton February, 004 Clustering Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods

More information

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017)

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017) 1 Notes Reminder: HW2 Due Today by 11:59PM TA s note: Please provide a detailed ReadMe.txt file on how to run the program on the STDLINUX. If you installed/upgraded any package on STDLINUX, you should

More information

CS570: Introduction to Data Mining

CS570: Introduction to Data Mining CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading: Chapter 10.3 Han, Chapter 9.5 Tan Cengiz Gunay, Ph.D. Slides courtesy of Li Xiong, Ph.D., 2011 Han, Kamber & Pei.

More information

Knowledge Discovery in Databases

Knowledge Discovery in Databases Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Lecture notes Knowledge Discovery in Databases Summer Semester 2012 Lecture 8: Clustering

More information

CS249: ADVANCED DATA MINING

CS249: ADVANCED DATA MINING CS249: ADVANCED DATA MINING Vector Data: Clustering: Part I Instructor: Yizhou Sun yzsun@cs.ucla.edu April 26, 2017 Methods to Learn Classification Clustering Vector Data Text Data Recommender System Decision

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Clustering: Part 1 Instructor: Yizhou Sun yzsun@ccs.neu.edu October 30, 2013 Announcement Homework 1 due next Monday (10/14) Course project proposal due next

More information

COMP5331: Knowledge Discovery and Data Mining

COMP5331: Knowledge Discovery and Data Mining COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified by Dr. Lei Chen based on the slides provided by Jiawei Han, Micheline Kamber, and Jian Pei 2012 Han, Kamber & Pei. All rights

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/28/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

More information

Clustering in Data Mining

Clustering in Data Mining Clustering in Data Mining Classification Vs Clustering When the distribution is based on a single parameter and that parameter is known for each object, it is called classification. E.g. Children, young,

More information

d(2,1) d(3,1 ) d (3,2) 0 ( n, ) ( n ,2)......

d(2,1) d(3,1 ) d (3,2) 0 ( n, ) ( n ,2)...... Data Mining i Topic: Clustering CSEE Department, e t, UMBC Some of the slides used in this presentation are prepared by Jiawei Han and Micheline Kamber Cluster Analysis What is Cluster Analysis? Types

More information

Data Mining Algorithms

Data Mining Algorithms for the original version: -JörgSander and Martin Ester - Jiawei Han and Micheline Kamber Data Management and Exploration Prof. Dr. Thomas Seidl Data Mining Algorithms Lecture Course with Tutorials Wintersemester

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Chapter 10: Cluster Analysis: Basic Concepts and Methods Instructor: Yizhou Sun yzsun@ccs.neu.edu April 2, 2013 Chapter 10. Cluster Analysis: Basic Concepts and Methods Cluster

More information

CS570: Introduction to Data Mining

CS570: Introduction to Data Mining CS570: Introduction to Data Mining Cluster Analysis Reading: Chapter 10.4, 10.6, 11.1.3 Han, Chapter 8.4,8.5,9.2.2, 9.3 Tan Anca Doloc-Mihu, Ph.D. Slides courtesy of Li Xiong, Ph.D., 2011 Han, Kamber &

More information

A Review on Cluster Based Approach in Data Mining

A Review on Cluster Based Approach in Data Mining A Review on Cluster Based Approach in Data Mining M. Vijaya Maheswari PhD Research Scholar, Department of Computer Science Karpagam University Coimbatore, Tamilnadu,India Dr T. Christopher Assistant professor,

More information

COMP 465: Data Mining Still More on Clustering

COMP 465: Data Mining Still More on Clustering 3/4/015 Exercise COMP 465: Data Mining Still More on Clustering Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed. Describe each of the following

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

! Introduction. ! Partitioning methods. ! Hierarchical methods. ! Model-based methods. ! Density-based methods. ! Scalability

! Introduction. ! Partitioning methods. ! Hierarchical methods. ! Model-based methods. ! Density-based methods. ! Scalability Preview Lecture Clustering! Introduction! Partitioning methods! Hierarchical methods! Model-based methods! Densit-based methods What is Clustering?! Cluster: a collection of data objects! Similar to one

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Clustering: Part 1 Instructor: Yizhou Sun yzsun@ccs.neu.edu February 10, 2016 Announcements Homework 1 grades out Re-grading policy: If you have doubts in your

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 10

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 10 Data Mining: Concepts and Techniques (3 rd ed.) Chapter 10 Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University 2011 Han, Kamber & Pei. All rights

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

Community Detection. Jian Pei: CMPT 741/459 Clustering (1) 2

Community Detection. Jian Pei: CMPT 741/459 Clustering (1) 2 Clustering Community Detection http://image.slidesharecdn.com/communitydetectionitilecturejune0-0609559-phpapp0/95/community-detection-in-social-media--78.jpg?cb=3087368 Jian Pei: CMPT 74/459 Clustering

More information

Hierarchy. No or very little supervision Some heuristic quality guidances on the quality of the hierarchy. Jian Pei: CMPT 459/741 Clustering (2) 1

Hierarchy. No or very little supervision Some heuristic quality guidances on the quality of the hierarchy. Jian Pei: CMPT 459/741 Clustering (2) 1 Hierarchy An arrangement or classification of things according to inclusiveness A natural way of abstraction, summarization, compression, and simplification for understanding Typical setting: organize

More information

Unsupervised Learning Hierarchical Methods

Unsupervised Learning Hierarchical Methods Unsupervised Learning Hierarchical Methods Road Map. Basic Concepts 2. BIRCH 3. ROCK The Principle Group data objects into a tree of clusters Hierarchical methods can be Agglomerative: bottom-up approach

More information

DS504/CS586: Big Data Analytics Big Data Clustering II

DS504/CS586: Big Data Analytics Big Data Clustering II Welcome to DS504/CS586: Big Data Analytics Big Data Clustering II Prof. Yanhua Li Time: 6pm 8:50pm Thu Location: AK 232 Fall 2016 More Discussions, Limitations v Center based clustering K-means BFR algorithm

More information

Cluster analysis. Agnieszka Nowak - Brzezinska

Cluster analysis. Agnieszka Nowak - Brzezinska Cluster analysis Agnieszka Nowak - Brzezinska Outline of lecture What is cluster analysis? Clustering algorithms Measures of Cluster Validity What is Cluster Analysis? Finding groups of objects such that

More information

Clustering Part 4 DBSCAN

Clustering Part 4 DBSCAN Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

Clustering Expression Data. Clustering Expression Data

Clustering Expression Data. Clustering Expression Data www.cs.washington.eu/ Subscribe if you Din t get msg last night Clustering Exression Data Why cluster gene exression ata? Tissue classification Fin biologically relate genes First ste in inferring regulatory

More information

University of Florida CISE department Gator Engineering. Clustering Part 4

University of Florida CISE department Gator Engineering. Clustering Part 4 Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

DS504/CS586: Big Data Analytics Big Data Clustering II

DS504/CS586: Big Data Analytics Big Data Clustering II Welcome to DS504/CS586: Big Data Analytics Big Data Clustering II Prof. Yanhua Li Time: 6pm 8:50pm Thu Location: KH 116 Fall 2017 Updates: v Progress Presentation: Week 15: 11/30 v Next Week Office hours

More information

DBSCAN. Presented by: Garrett Poppe

DBSCAN. Presented by: Garrett Poppe DBSCAN Presented by: Garrett Poppe A density-based algorithm for discovering clusters in large spatial databases with noise by Martin Ester, Hans-peter Kriegel, Jörg S, Xiaowei Xu Slides adapted from resources

More information

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018)

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018) 1 Notes Reminder: HW2 Due Today by 11:59PM TA s note: Please provide a detailed ReadMe.txt file on how to run the program on the STDLINUX. If you installed/upgraded any package on STDLINUX, you should

More information

Clustering Analysis Basics

Clustering Analysis Basics Clustering Analysis Basics Ke Chen Reading: [Ch. 7, EA], [5., KPM] Outline Introduction Data Types and Representations Distance Measures Major Clustering Methodologies Summary Introduction Cluster: A collection/group

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10. Cluster

More information

A Comparative Study of Various Clustering Algorithms in Data Mining

A Comparative Study of Various Clustering Algorithms in Data Mining Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 6.017 IJCSMC,

More information

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394 Data Mining Clustering Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 31 Table of contents 1 Introduction 2 Data matrix and

More information

Kapitel 4: Clustering

Kapitel 4: Clustering Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases WiSe 2017/18 Kapitel 4: Clustering Vorlesung: Prof. Dr.

More information

Metodologie per Sistemi Intelligenti. Clustering. Prof. Pier Luca Lanzi Laurea in Ingegneria Informatica Politecnico di Milano Polo di Milano Leonardo

Metodologie per Sistemi Intelligenti. Clustering. Prof. Pier Luca Lanzi Laurea in Ingegneria Informatica Politecnico di Milano Polo di Milano Leonardo Metodologie per Sistemi Intelligenti Clustering Prof. Pier Luca Lanzi Laurea in Ingegneria Informatica Politecnico di Milano Polo di Milano Leonardo Outline What is clustering? What are the major issues?

More information

Image Segmentation using K-means clustering and Thresholding

Image Segmentation using K-means clustering and Thresholding Image Segmentation using Kmeans clustering an Thresholing Preeti Panwar 1, Girhar Gopal 2, Rakesh Kumar 3 1M.Tech Stuent, Department of Computer Science & Applications, Kurukshetra University, Kurukshetra,

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

More information

Clustering in Ratemaking: Applications in Territories Clustering

Clustering in Ratemaking: Applications in Territories Clustering Clustering in Ratemaking: Applications in Territories Clustering Ji Yao, PhD FIA ASTIN 13th-16th July 2008 INTRODUCTION Structure of talk Quickly introduce clustering and its application in insurance ratemaking

More information

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for

More information

Distance-based Methods: Drawbacks

Distance-based Methods: Drawbacks Distance-based Methods: Drawbacks Hard to find clusters with irregular shapes Hard to specify the number of clusters Heuristic: a cluster must be dense Jian Pei: CMPT 459/741 Clustering (3) 1 How to Find

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

Course Content. What is Classification? Chapter 6 Objectives

Course Content. What is Classification? Chapter 6 Objectives Principles of Knowledge Discovery in Data Fall 007 Chapter 6: Data Clustering Dr. Osmar R. Zaïane University of Alberta Course Content Introduction to Data Mining Association Analysis Sequential Pattern

More information

Analysis and Extensions of Popular Clustering Algorithms

Analysis and Extensions of Popular Clustering Algorithms Analysis and Extensions of Popular Clustering Algorithms Renáta Iváncsy, Attila Babos, Csaba Legány Department of Automation and Applied Informatics and HAS-BUTE Control Research Group Budapest University

More information

Unsupervised Learning Partitioning Methods

Unsupervised Learning Partitioning Methods Unsupervised Learning Partitioning Methods Road Map 1. Basic Concepts 2. K-Means 3. K-Medoids 4. CLARA & CLARANS Cluster Analysis Unsupervised learning (i.e., Class label is unknown) Group data to form

More information

Efficient and Effective Clustering Methods for Spatial Data Mining. Raymond T. Ng, Jiawei Han

Efficient and Effective Clustering Methods for Spatial Data Mining. Raymond T. Ng, Jiawei Han Efficient and Effective Clustering Methods for Spatial Data Mining Raymond T. Ng, Jiawei Han 1 Overview Spatial Data Mining Clustering techniques CLARANS Spatial and Non-Spatial dominant CLARANS Observations

More information

Course Content. Classification = Learning a Model. What is Classification?

Course Content. Classification = Learning a Model. What is Classification? Lecture 6 Week 0 (May ) and Week (May 9) 459-0 Principles of Knowledge Discovery in Data Clustering Analysis: Agglomerative,, and other approaches Lecture by: Dr. Osmar R. Zaïane Course Content Introduction

More information

Unsupervised Learning

Unsupervised Learning Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised

More information

Lesson 3. Prof. Enza Messina

Lesson 3. Prof. Enza Messina Lesson 3 Prof. Enza Messina Clustering techniques are generally classified into these classes: PARTITIONING ALGORITHMS Directly divides data points into some prespecified number of clusters without a hierarchical

More information

CS 412 Intro. to Data Mining Chapter 10. Cluster Analysis: Basic Concepts and Methods

CS 412 Intro. to Data Mining Chapter 10. Cluster Analysis: Basic Concepts and Methods CS 412 Intro. to Data Mining Chapter 10. Cluster Analysis: Basic Concepts and Methods Jiawei Han, Computer Science, Univ. Illinois at Urbana -Champaign, 2017 1 2 Chapter 10. Cluster Analysis: Basic Concepts

More information

Data Mining 4. Cluster Analysis

Data Mining 4. Cluster Analysis Data Mining 4. Cluster Analysis 4.5 Spring 2010 Instructor: Dr. Masoud Yaghini Introduction DBSCAN Algorithm OPTICS Algorithm DENCLUE Algorithm References Outline Introduction Introduction Density-based

More information

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation

More information

CS412 Homework #3 Answer Set

CS412 Homework #3 Answer Set CS41 Homework #3 Answer Set December 1, 006 Q1. (6 points) (1) (3 points) Suppose that a transaction datase DB is partitioned into DB 1,..., DB p. The outline of a distributed algorithm is as follows.

More information

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A. 205-206 Pietro Guccione, PhD DEI - DIPARTIMENTO DI INGEGNERIA ELETTRICA E DELL INFORMAZIONE POLITECNICO DI BARI

More information

New Version of Davies-Bouldin Index for Clustering Validation Based on Cylindrical Distance

New Version of Davies-Bouldin Index for Clustering Validation Based on Cylindrical Distance New Version of Davies-Boulin Inex for lustering Valiation Base on ylinrical Distance Juan arlos Roas Thomas Faculta e Informática Universia omplutense e Mari Mari, España correoroas@gmail.com Abstract

More information

CHAPTER 7. PAPER 3: EFFICIENT HIERARCHICAL CLUSTERING OF LARGE DATA SETS USING P-TREES

CHAPTER 7. PAPER 3: EFFICIENT HIERARCHICAL CLUSTERING OF LARGE DATA SETS USING P-TREES CHAPTER 7. PAPER 3: EFFICIENT HIERARCHICAL CLUSTERING OF LARGE DATA SETS USING P-TREES 7.1. Abstract Hierarchical clustering methods have attracted much attention by giving the user a maximum amount of

More information

Almost Disjunct Codes in Large Scale Multihop Wireless Network Media Access Control

Almost Disjunct Codes in Large Scale Multihop Wireless Network Media Access Control Almost Disjunct Coes in Large Scale Multihop Wireless Network Meia Access Control D. Charles Engelhart Anan Sivasubramaniam Penn. State University University Park PA 682 engelhar,anan @cse.psu.eu Abstract

More information

A PSO Optimized Layered Approach for Parametric Clustering on Weather Dataset

A PSO Optimized Layered Approach for Parametric Clustering on Weather Dataset Vol.3, Issue.1, Jan-Feb. 013 pp-504-508 ISSN: 49-6645 A PSO Optimize Layere Approach for Parametric Clustering on Weather Dataset Shikha Verma, 1 Kiran Jyoti 1 Stuent, Guru Nanak Dev Engineering College

More information

A Novel Density Based Clustering Algorithm by Incorporating Mahalanobis Distance

A Novel Density Based Clustering Algorithm by Incorporating Mahalanobis Distance Receive: November 20, 2017 121 A Novel Density Base Clustering Algorithm by Incorporating Mahalanobis Distance Margaret Sangeetha 1 * Velumani Paikkaramu 2 Rajakumar Thankappan Chellan 3 1 Department of

More information

CS570: Introduction to Data Mining

CS570: Introduction to Data Mining CS570: Introduction to Data Mining Clustering: Model, Grid, and Constraintbased Methods Reading: Chapters 10.5, 11.1 Han, Chapter 9.2 Tan Cengiz Gunay, Ph.D. Slides courtesy of Li Xiong, Ph.D., 2011 Han,

More information

7.1 Euclidean and Manhattan distances between two objects The k-means partitioning algorithm... 21

7.1 Euclidean and Manhattan distances between two objects The k-means partitioning algorithm... 21 Contents 7 Cluster Analysis 7 7.1 What Is Cluster Analysis?......................................... 7 7.2 Types of Data in Cluster Analysis.................................... 9 7.2.1 Interval-Scaled

More information

Non-homogeneous Generalization in Privacy Preserving Data Publishing

Non-homogeneous Generalization in Privacy Preserving Data Publishing Non-homogeneous Generalization in Privacy Preserving Data Publishing W. K. Wong, Nios Mamoulis an Davi W. Cheung Department of Computer Science, The University of Hong Kong Pofulam Roa, Hong Kong {wwong2,nios,cheung}@cs.hu.h

More information

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm DATA MINING LECTURE 7 Hierarchical Clustering, DBSCAN The EM Algorithm CLUSTERING What is a Clustering? In general a grouping of objects such that the objects in a group (cluster) are similar (or related)

More information

On Clustering Validation Techniques

On Clustering Validation Techniques On Clustering Validation Techniques Maria Halkidi, Yannis Batistakis, Michalis Vazirgiannis Department of Informatics, Athens University of Economics & Business, Patision 76, 0434, Athens, Greece (Hellas)

More information

Road map. Basic concepts

Road map. Basic concepts Clustering Basic concepts Road map K-means algorithm Representation of clusters Hierarchical clustering Distance functions Data standardization Handling mixed attributes Which clustering algorithm to use?

More information

Descriptive Modeling. Based in part on Chapter 9 of Hand, Manilla, & Smyth And Section 14.3 of HTF David Madigan

Descriptive Modeling. Based in part on Chapter 9 of Hand, Manilla, & Smyth And Section 14.3 of HTF David Madigan Descriptive Modeling Based in part on Chapter 9 of Hand, Manilla, & Smyth And Section 14.3 of HTF David Madigan Data Mining Algorithms A data mining algorithm is a well-defined procedure that takes data

More information

CS Introduction to Data Mining Instructor: Abdullah Mueen

CS Introduction to Data Mining Instructor: Abdullah Mueen CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen LECTURE 8: ADVANCED CLUSTERING (FUZZY AND CO -CLUSTERING) Review: Basic Cluster Analysis Methods (Chap. 10) Cluster Analysis: Basic Concepts

More information

Clustering from Data Streams

Clustering from Data Streams Clustering from Data Streams João Gama LIAAD-INESC Porto, University of Porto, Portugal jgama@fep.up.pt 1 Introduction 2 Clustering Micro Clustering 3 Clustering Time Series Growing the Structure Adapting

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

More information

6 Gradient Descent. 6.1 Functions

6 Gradient Descent. 6.1 Functions 6 Graient Descent In this topic we will iscuss optimizing over general functions f. Typically the function is efine f : R! R; that is its omain is multi-imensional (in this case -imensional) an output

More information

Classifying Facial Expression with Radial Basis Function Networks, using Gradient Descent and K-means

Classifying Facial Expression with Radial Basis Function Networks, using Gradient Descent and K-means Classifying Facial Expression with Raial Basis Function Networks, using Graient Descent an K-means Neil Allrin Department of Computer Science University of California, San Diego La Jolla, CA 9237 nallrin@cs.ucs.eu

More information

Clustering Algorithms for Spatial Databases: A Survey

Clustering Algorithms for Spatial Databases: A Survey Clustering Algorithms for Spatial Databases: A Survey Erica Kolatch Department of Computer Science University of Maryland, College Park CMSC 725 3/25/01 kolatch@cs.umd.edu 1. Introduction Spatial Database

More information

Multilevel Linear Dimensionality Reduction using Hypergraphs for Data Analysis

Multilevel Linear Dimensionality Reduction using Hypergraphs for Data Analysis Multilevel Linear Dimensionality Reuction using Hypergraphs for Data Analysis Haw-ren Fang Department of Computer Science an Engineering University of Minnesota; Minneapolis, MN 55455 hrfang@csumneu ABSTRACT

More information

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science Data Mining Dr. Raed Ibraheem Hamed University of Human Development, College of Science and Technology Department of Computer Science 2016 201 Road map What is Cluster Analysis? Characteristics of Clustering

More information

Data Mining: Clustering

Data Mining: Clustering Bi-Clustering COMP 790-90 Seminar Spring 011 Data Mining: Clustering k t 1 K-means clustering minimizes Where ist ( x, c i t i c t ) ist ( x m j 1 ( x ij i, c c t ) tj ) Clustering by Pattern Similarity

More information

d 3 d 4 d d d d d d d d d d d 1 d d d d d d

d 3 d 4 d d d d d d d d d d d 1 d d d d d d Proceeings of the IASTED International Conference Software Engineering an Applications (SEA') October 6-, 1, Scottsale, Arizona, USA AN OBJECT-ORIENTED APPROACH FOR MANAGING A NETWORK OF DATABASES Shu-Ching

More information

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16 CLUSTERING CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16 1. K-medoids: REFERENCES https://www.coursera.org/learn/cluster-analysis/lecture/nj0sb/3-4-the-k-medoids-clustering-method https://anuradhasrinivas.files.wordpress.com/2013/04/lesson8-clustering.pdf

More information

A Technical Insight into Clustering Algorithms & Applications

A Technical Insight into Clustering Algorithms & Applications A Technical Insight into Clustering Algorithms & Applications Nandita Yambem 1, and Dr A.N.Nandakumar 2 1 Research Scholar,Department of CSE, Jain University,Bangalore, India 2 Professor,Department of

More information

EECS 730 Introduction to Bioinformatics Microarray. Luke Huan Electrical Engineering and Computer Science

EECS 730 Introduction to Bioinformatics Microarray. Luke Huan Electrical Engineering and Computer Science EECS 730 Introduction to Bioinformatics Microarray Luke Huan Electrical Engineering and Computer Science http://people.eecs.ku.edu/~jhuan/ GeneChip 2011/11/29 EECS 730 2 Hybridization to the Chip 2011/11/29

More information

Big Data Analytics! Special Topics for Computer Science CSE CSE Feb 9

Big Data Analytics! Special Topics for Computer Science CSE CSE Feb 9 Big Data Analytics! Special Topics for Computer Science CSE 4095-001 CSE 5095-005! Feb 9 Fei Wang Associate Professor Department of Computer Science and Engineering fei_wang@uconn.edu Clustering I What

More information

University of Florida CISE department Gator Engineering. Clustering Part 5

University of Florida CISE department Gator Engineering. Clustering Part 5 Clustering Part 5 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville SNN Approach to Clustering Ordinary distance measures have problems Euclidean

More information

A Neural Network Model Based on Graph Matching and Annealing :Application to Hand-Written Digits Recognition

A Neural Network Model Based on Graph Matching and Annealing :Application to Hand-Written Digits Recognition ITERATIOAL JOURAL OF MATHEMATICS AD COMPUTERS I SIMULATIO A eural etwork Moel Base on Graph Matching an Annealing :Application to Han-Written Digits Recognition Kyunghee Lee Abstract We present a neural

More information

Lecture 1 September 4, 2013

Lecture 1 September 4, 2013 CS 84r: Incentives an Information in Networks Fall 013 Prof. Yaron Singer Lecture 1 September 4, 013 Scribe: Bo Waggoner 1 Overview In this course we will try to evelop a mathematical unerstaning for the

More information

An Algorithm for Building an Enterprise Network Topology Using Widespread Data Sources

An Algorithm for Building an Enterprise Network Topology Using Widespread Data Sources An Algorithm for Builing an Enterprise Network Topology Using Wiesprea Data Sources Anton Anreev, Iurii Bogoiavlenskii Petrozavosk State University Petrozavosk, Russia {anreev, ybgv}@cs.petrsu.ru Abstract

More information