Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2012/2013 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti.
Outline 1 Motivation Basic Concepts 2 Bottom-Up Top-Down 3 Self-Organizing Maps Multidimensional Scaling Latent Semantic Indexing
Outline Motivation Basic Concepts 1 Motivation Basic Concepts 2 3
Motivation Motivation Basic Concepts Problem: Query word could be ambiguous: E.g., query star retrieves documents about astronomy, animals, etc. Solution: clustering document responses to queries along lines of different topics Problem: Manual construction of topic hierarchies and taxonomies Solution: preliminary clustering of large samples of web documents Problem: Speeding up similarity search Solution: restrict the search for documents similar to a query to most representative cluster(s)
An Example Motivation Basic Concepts
Motivation Basic Concepts Cluster Hypothesis: Given a suitable clustering of a collection, if the user is interested in document d (or term t), he is likely to be interested in other members of the cluster to which d (t) belongs Task: Evolve measures of similarity to cluster a collection of documents/terms into groups within which similarity within a cluster is larger than across clusters
Concepts Motivation Basic Concepts Two important paradigms: Bottom-up agglomerative clustering Top-down partitioning Visualization techniques: of corpus in a low-dimensional space Characterizing the entities: Internally: Vector space model, probabilistic models Externally: Measure of similarity/dissimilarity between pairs
Parameters Motivation Basic Concepts General parameters: Similarity measure: ρ(d 1,d 2 ) E.g., cosine similarity Distance measure: δ(d 1,d 2 ) E.g., Euclidean distance Number of clusters: k Problems: Large number of noisy dimensions Notion of noise is application dependent
Motivation Basic Concepts Bottom-up clustering Top-down clustering Self-organization map Multidimensional scaling Latent semantic indexing Generative models and probabilistic approaches Single topic per document Documents correspond to mixtures of multiple topics
Outline Bottom-Up Top-Down 1 2 Bottom-Up Top-Down 3
Bottom-Up Top-Down Partition document collection into k clusters [D 1,D 2,...,D k ] Choices: Minimize intra-cluster distance: i d 1,d 2 D i δ(d 1,d 2 ) Maximize intra-cluster semblance: i d 1,d 2 D i ρ(d 1,d 2 ) If cluster representations D i are available Minimize i d D i δ(d, D i ) Maximize i d D i ρ(d, D i ) Soft clustering d assigned to with D i confidence z d,i Find z d,i so as to minimize i d D i z d,i δ(d, D i ) or maximize i d D i z d,i ρ(d, D i ) Two ways to get partitions: bottom-up clustering and top-down clustering
Hierarchical Agglomerative Bottom-Up Top-Down Initially G is a collection of singleton groups, each with one document d Repeat Find Γ, G with max similarity measure, s(γ ) Merge group Γ with group Use above info to plot the hierarchical merging process (dendogram) To get desired number of clusters: cut across any level of the dendogram
Example Dendogram Bottom-Up Top-Down
Similarity Measures Bottom-Up Top-Down Self-Similarity Average pair wise similarity between documents in Φ s(φ) = 1 ) ( Φ 2 d 1,d 2 Φ s(d 1,d 2 ) s(d 1,d 2 ) is the inter-document similarity measure (e.g., cosine of tfidf vectors) Other criteria: Maximum/minimum pair wise similarity between documents in the clusters Complexity: O(n 2 logn) with n 2 space
Top-Down Bottom-Up Top-Down Uses an internal representation for documents as well as clusters Partitions documents into k clusters 2 variants Hard: 0/1 assignment of documents to clusters Soft: documents belong to clusters with fractional scores Termination When assignment of documents to clusters ceases to change much, or when cluster centroids move negligibly over successive iterations
The k-means Algorithm Bottom-Up Top-Down Hard k-means Choose k arbitrary centroids Assign each document to nearest centroid Recompute centroids Contribution for updating cluster centroid { η(d µc ) if µ c is closest to d µ c = d 0 otherwise µ c µ c + µ c The parameter η is called the learning rate
The k-means Algorithm Bottom-Up Top-Down Soft k-means Don t break close ties between document assignments to clusters and don t make documents contribute to a single cluster which wins narrowly The contribution for updating cluster centroid µ c from document d is related to the current similarity between both 1/ d µ c 2 µ c = η γ 1/ d µ γ 2(d µ c)
An Example Bottom-Up Top-Down (online demo)
Outline Self-Organizing Maps Multidimensional Scaling Latent Semantic Indexing 1 2 3 Self-Organizing Maps Multidimensional Scaling Latent Semantic Indexing
Visualization Techniques Self-Organizing Maps Multidimensional Scaling Latent Semantic Indexing Goal: of corpus in a low-dimensional space Hierarchical Agglomerative (HAC) lends itself easily to visualization Self-Organization Map (SOM) Technique related to k-means Multidimensional scaling (MDS) minimize the distortion of interpoint distances in the low-dimensional embedding as compared to the dissimilarity given in the input data Latent Semantic Indexing (LSI) Linear transformations to reduce number of dimensions
Self-Organizing Maps Self-Organizing Maps Multidimensional Scaling Latent Semantic Indexing Like soft k-means Determine association between clusters and documents Associate a representative vector with each cluster and iteratively refine Unlike k-means Embed the clusters in a low-dimensional space right from the beginning Large number of clusters can be initialized even if eventually many are to remain devoid of documents Each cluster can be a slot in a square/hexagonal grid. The grid structure defines the neighborhood N(c) for each cluster c Also involves a proximity function h(γ, c) between clusters
Update Rule Self-Organizing Maps Multidimensional Scaling Latent Semantic Indexing Data item d activates node (closest cluster) as well as the neighborhood nodes E.g., all nodes within n hops Update rule for node under the influence of d is: µ γ µ γ +ηh(γ,c d )(d µ γ )
Example Self-Organizing Maps Multidimensional Scaling Latent Semantic Indexing
Multidimensional Scaling Self-Organizing Maps Multidimensional Scaling Latent Semantic Indexing Goal: Distance preserving low dimensional embedding of documents (often called Principal Coordinate Analysis) Symmetric inter-document distances are given a priori or computed from internal representation Coarse-grained user feedback User provides similarity d i,j between documents i and j With increasing feedback, prior distances are overridden Objective : Minimize the stress of embedding i,j stress = ( dˆ i,j d i,j ) 2 i,j d2 i,j
Problems Self-Organizing Maps Multidimensional Scaling Latent Semantic Indexing Stress not easy to optimize Iterative hill climbing 1 Points (documents) assigned random coordinates by external heuristic 2 Points (one at a time) moved by small distance in direction of locally decreasing stress For n documents Each takes O(n) time to be moved Total O(n 2 ) time per relaxation Faster approach: FastMap [Christos Faloutsos and Lin 1995]
Extended Similarity Self-Organizing Maps Multidimensional Scaling Latent Semantic Indexing In the vector space model different terms have different meanings car automobile However, car and automobile are related They are likely to co-occur often Documents having related words are related Useful for search and clustering Two basic approaches: Hand-made thesaurus (e.g., WordNet) Co-occurrence and associations
Latent Semantic Indexing Self-Organizing Maps Multidimensional Scaling Latent Semantic Indexing Vector-space model distinct orthogonal direction for each terms Not all terms are orthogonal E.g., car and automobile We need a matrix with a lower rank than the traditional document-term matrix This matrix can be found through Singular Value Decomposition
Singular Value Decomposition Self-Organizing Maps Multidimensional Scaling Latent Semantic Indexing
Querying Self-Organizing Maps Multidimensional Scaling Latent Semantic Indexing A query q is projected to the new space ˆq = D 1 r r UT r T q T Cosine similarity can now be applied to the projections Results are often better than tfidf retrieval SVD filters out noise and discovers semantic associations between terms
Example Self-Organizing Maps Multidimensional Scaling Latent Semantic Indexing
Example Self-Organizing Maps Multidimensional Scaling Latent Semantic Indexing
Problems Self-Organizing Maps Multidimensional Scaling Latent Semantic Indexing Computationally expensive Can run in minutes to hours on a 10 3 to 10 4 collection Current implementations need to store the whole input matrix in memory Not feasible for the Web Probabilistic Latent Semantic Indexing adds a sounder probabilistic model Based on a mixture decomposition derived from a latent class model
Self-Organizing Maps Multidimensional Scaling Latent Semantic Indexing Questions?