DATA CLUSTERING SATU VIRTANEN. T Seminar on String Algorithms

Size: px

Start display at page:

Download "DATA CLUSTERING SATU VIRTANEN. T Seminar on String Algorithms"

Sara Hines
5 years ago
Views:

1 DATA CLUSTERING SATU VIRTANEN T Seminar on String Algorithms

2 OUTLINE Introduction General clustering methods Clustering in metric spaces Clustering string data Clustering in graphs Concluding remarks

3 CLUSTERING Practical applications of information processing involve massive data sets. Only a small fraction of the data contains semantically interesting information. Clustering = process of organizing properly represented data into meaningful groups, possibly ignoring noise, by interpreting which data points are in some sense connected.

4 A COMMON ABSTRACTION P = a set of n data points represented as d-dimensional coordinate vectors p = (p 1, p 2,..., p d ) P. A cluster C P is a set of points that are sufficiently close to each other w.r.t. some proximity measure.

5 SOME QUESTIONS TO BE SETTLED How close is close enough? Does such a threshold need to be fixed beforehand? How many clusters will emerge? Is the number of clusters fixed? How many points are needed to form a cluster? Will the clustering relation be symmetric?

6 THERE ARE NO CORRECT ANSWERS! One can define strict rules for what is a proper cluster and what is not these are bound to be application-specific. Issues worth considering [BYCHN03]: natural justification of the definition of a cluster computational complexity of determining the clusters

7 WHAT ALL GETS CLUSTERED? Most clustering algorithms will produce clusters regardless of the data even for uniformly random data [JMF99]. Outliers = noise or erroneous data to be left outside of all clusters; recognizing these is often quite error prone. For example, large holes in the ozone layer went unnoticed for a while as they were classified as outliers [BYCHN03].

8 METHODS FOR FINDING k CLUSTERS Some algorithms require the number of clusters k as a parameter. The user must either have some a priori information on the suitable number of clusters or needs iterate the algorithm times for different numbers of clusters to find the most convincing clustering.

9 THE k-means CLUSTERING ALGORITHM Universe X in which n data points p P are located. A set K X of k points are chosen as cluster centers. Each p P is assigned to some c K. This is iterated l times to minimize total distance of the points to the assigned centers. Complexity: O(n k l)-time and O(n + k)-space.

10 THRESHOLD ALGORITHMS Many algorithms require as a parameter a value that determines the boundaries of the clusters. The threshold determines how close (in the sense of the defined proximity measure) two data points need to be in order to be classified into the same cluster. It may be absolute or relative. Different values of the threshold yield different clusterings.

13 WHY THRESHOLDS? In practice, an ideal algorithm would not require such a threshold but would rather interpret it dynamically depending on the input data. In some application areas, naturally defined threshold values exist and hence clustering algorithms that need them are also justified.

14 APPLICATIONS image processing: recognition of e.g. hand-written characters image segmentation: reducing noise in images genome comparison: recognition of e.g. proteins data mining: extracting useful information from massive databases

15 THE GENERAL APPROACH Usually a clustering algorithm produces clusters for all of the data points simultaneously, iteratively moving the clusters around until some fitness function is satisfied. If the data points are clustered sequentially such that the cluster of one will be completely decided before another is considered, the algorithm is called incremental.

16 CLASSIFICATION OF THE METHODS Based on the output of the algorithm; if it is just one absolute clustering partitional clustering a hierarchy of possible clusterings hierarchical clustering

17 PARTITIONAL METHODS Produces a single collection C of clusters C P such that C i=1 C i = P. If C i, C j C, i j, C i C j =, the clustering is crisp, and if C i, C j C, i j, s.t. C i C j, the clustering is fuzzy. [BYCHN03]

18 AN EXAMPLE: SQUARED ERROR APPROACH Calculate the centroid c i of each C i C, and then measure the total squared error of the data points p = (p 1, p 2,..., p d ) P w.r.t. to their centroid c i. E = C i C p j C i d ( ci (k) p j (k) ) 2 k=1 Choose the partitioning C that minimizes E. Some limits on the set of feasible partitions need to be imposed, as assigning each data point to its own cluster yields E = 0.

19 HIERARCHICAL CLUSTERING Methods that produce several different levels of clustering going from rather coarse clustering to an excessively fine clustering. It depends on the application whether an absolute definition of a good cluster is at all possible. If not, the user must to choose between the possible clusterings provided by a hierarchical method.

20 DENDROGRAMS

21 CLUSTERING IN METRIC SPACES There exist algorithms that produce a clustering based on a distance matrix only. This usually means that the data points are on a metric space, which consists of a universe X together with a distance function d (u, v) 0 that is a metric. For a metric d defined in X, the following applies u, v, w X : it is symmetric: d (u, v) = d (v, u), it has zero reflexive distance: d (v, v) = 0, and the triangle inequality holds: d (u, v) + d (v, w) d (u, w).

22 CLUSTERING IN METRIC SPACES Usually only a finite subset U X is considered. A clustering is now a partition (or a cover) of U and P is partitioned (or covered) accordingly. The partition is aimed to be such that the distance between points in the same cluster is as small as feasible and the distance between points in different clusters is as large as possible. For practical applications, the distance function d should be light to compute especially for large U.

23 EXAMPLES OF METRIC CLUSTERINGS When a feasible metric d has been defined on X, the clusters of P may be formed by selecting the k nearest points of each data point to be its cluster. This is very likely to produce a fuzzy clustering and the value of k should be reasoned out by the user. Another clustering results in connecting iteratively the points nearest to each other into the same initial cluster until all points have been connected to at least one other point This produces a crisp clustering, possibly with just one cluster.

24 EUCLIDEAN DISTANCE A common choice for a metric in continuous spaces. For two d-dimensional vectors a = (a 1, a 2,..., a d ) and b = (b 1, b 2,..., b d ), this is d (a i b i ) 2. i=1

25 A NON-METRICAL DISTANCE-BASED METHOD The mutual neighbor distance method by Gowda and Krishna (see [JMF99] and the references therein): For each p i P, n = P, label the other n 1 points assigning label 1 to the nearest point, 2 to the second nearest, and finally n 1 to the farthest. Denote the matrix of these labels by N, such that N ij is the label assigned to p j by p i, then M(p i, p j ) = N ij + N ji. The triangle equality is not satisfied and hence M is not a metric. Note that a threshold value needs to be defined.

26 CLUSTERING STRING DATA No intuitive definition of distance is readily available; no dimensional information or obvious coordinate vector is attached to the data define transformations that introduce a proximity measure. Hamming distance Levenshtein or edit distance

27 EDIT DISTANCE Defined for two sequences A = (a 1, a 2,..., a s ) to sequence B = (b 1, b 2,..., b t ) as the smallest number of substitutions, insertions, and deletions required to transform A into B. This is a metric and can be calculated by dynamic programming in O(s t) time. Works well in associating a mistyped word to the intended word.

28 WEAKNESSES OF THE EDIT DISTANCE Pronunciation may be significant in determining whether two words are similar. Other meaesures are needed for languages with complicated stemming and conflation rules. All of the following words have the same stem talo: talot talossa talosta taloon A stemmer that aims to take into account the structural properties of words is called morphological and can be based on the holomorphic distance, which is in turn based on feature extraction on subsequences of the original string.

29 EXAMPLE: HOLOMORPHIC VS. EDIT DISTANCE The 10 nearest neighbors in a dictionary for the Spanish word apetitosa (engl. tasty, appetizing, or tempting). Holom. dist. Translation Edit distance Translation apetitosa tasty apetitosa tasty apetito appetite apetitoso tasty apetitoso tasty aceitosa oily apetite appetite apestosa sickening apetitiva appetizing apetitiva appetizing apetitivo appetizing apetito appetite apetible tempting aceitoso oily apetecer crave acetoso sour apetecedor tempting alentosa reassuring apetencia hunger (for) aparatosa pretentious

30 APPLICATIONS OF EDIT DISTANCE Data that originally is not in textual format can be transformed into a string, such as the DNA sequences. Any binary vector or matrix can also be handled by string methods.

31 CLUSTERING BOOKS BY ACM CLASSIFICATION LABELS Jain et al. [JMF99] consider clustering books by defining the similarity w.r.t. the ACM CR classification labels. They use the ratio of the length of the longest common prefix to the length of the first string. Examples of such labels: H242, I233, and I522.

32 ACM COMPUTING CLASSIFICATION SYSTEM General Literature Computer Hardware Systems Software Data Organization Theory of Computation Mathematics of Computing Information Systems Computing Methodologies Computer Applications Computing Milieux A B C D E F G H I J K General Programming Techniques Software Engineering Programming Languages Operating Systems Miscellaneous m General Formal Definitions Language Language Constructs and Theory Classifications and Features Processors Miscellaneous m Abstract data types Classes and objects Concurrent programming structures Constraints Control structures...

33 CLUSTERING IN GRAPHS Graph G = (V, E), V = n, E = m. data points represented by the vertex set V connections between represented by the edge set E For different types of data, simple transformations into graphs exist.

34 DELAUNAY GRAPHS A transformation into a graph for a set of points on a plane (generalizable to higher dimensions). Represent each point with a vertex and placing an edge between each pair of points that are Voronoi neighbors.

35 THE VORONOI NEIGHBORHOOD RELATION Define a partitioning the plane containing the n data points into n convex polygons such that there is exactly one data point inside each of the polygons, and all other points inside the polygon are closer to the data point of that polygon than to any other data point The resulting diagram of points are polygons is called a Voronoi diagram or a Dirichlet tessellation. Two points are Voronoi neighbors if they can be connected by a straight line that only passes through their own two polygons.

36 AN EXAMPLE: A SET OF POINTS

37 THE VORONOI DIAGRAM

38 ADDING THE EDGES

39 THE RESULTING GRAPH

40 USING MINIMUM SPANNING TREES A common approach to cluster graphs is using a minimum spanning tree of the graph [JMF99]. Clusters are obtained by deleting the edges in decreasing order of their weights.

41 MST EXAMPLE

42 MST EXAMPLE

43 MST EXAMPLE

44 MST EXAMPLE

45 MST EXAMPLE

46 MST EXAMPLE

47 MST EXAMPLE

48 MST EXAMPLE

49 DENSITY-BASED METHODS For a graph G = (V, E) with n = V and m = E, density is the ratio of m to the maximum possible number of edges ( n 2) : δ = ( m n ) = 2 m n(n 1). 2 In simple terms, a cluster in a graph can be for example a surprisingly dense induced subgraph. Some authors have only considered the maximal cliques of graphs as proper clusters.

50 RELATIVE DENSITY One may also consider the relative density δ r of a subgraph with vertex set S V : δ r = {(u, v) E u, v S} {(u, v) E {u, v} S }. Mihail et al. [MGSZ02] use spectral methods to identify clusters that have a high relative density.

51 CONNECTIVITY-BASED CLUSTERING The connectivity of a graph may be characterized by the number of (disjoint) paths between pairs of vertices. For a pair of vertices to belong to the same cluster, they should be highly connected, whereas there should not be many paths connecting them to vertices outside the cluster. One may for example split the graph in two parts such that there are as few edges as possible between the parts, considering a remaining part of h vertices a proper cluster if more than h 2 edges would be needed to split it further [HS00].

52 ANOTHER CONNECTIVITY APPROACH Many clustering approaches assign weights to the edges of the graph and prune the edges with respect to the weights until the graph has decomposed into satisfactory clusters (as in the MST example). One such a weight function is edge-betweenness, which for an edge e is the number of shortest paths that contain e [NG03].

53 LOCAL ONLINE CLUSTERING Finding the cluster for only one vertex instead of all vertices, using only local information. We formulate a fitness function similar to relative density and optimize by simulated annealing [KCDGV83]. Useful for massive and partially unknown graphs such as the World Wide Web.

54 OTHER APPROACHES artificial neural networks evolutionary algorithms [JMF99]

55 In summary, clustering is an interesting, useful, and challenging problem. It has great potential in applications like object recognition, image segmentation, and information filtering and retrieval. However, it is possible to exploit this potential only after making several design choices carefully. [JMF99]

56 References [BYCHN03] Ricardo Baeza-Yates, E. Chávez, N. Herrera, and Gonzalo Navarro. Clustering in metric spaces with applications in information retrieval. In Wu and Xiong, editors, Information Retrieval and Clustering. Kluwer Academic Publishers, To appear. [HS00] [JMF99] E. Hartuv and R. Shamir. A clustering algorithm based on graph connectivity. Information Processing Letters, 76(4 6): , A. K. Jain, M. N. Murty, and P. J.Flynn. Data clustering: A review. ACM Computing Surveys, 31(3), September [KCDGV83] Scott Kirkpatrick, Jr. C. D. Gelatt, and M. P. Vecchi.

57 Optimization by simulated annealing. Science, 220(4598): , [MGSZ02] [NG03] Milena Mihail, Christos Gkantsidis, Amin Saberi, and Ellen Zegura. On the semantics of Internet topologies. Technical Report GIT-CC-02-07, College of Computing, Georgia Institute of Technology, January Mark E. J. Newman and Michelle Girvan. Mixing patterns and community structure in networks. In Proceedings of the XVIII Sitges Conference on Statistical Mechanics, Berlin, Germany, Springer-Verlag.

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)