MIA - Master on Artificial Intelligence

Similar documents
Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Association Rule Mining and Clustering

NATURAL LANGUAGE PROCESSING

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

CLUSTERING. Quiz information. today 11/14/13% ! The second midterm quiz is on Thursday (11/21) ! In-class (75 minutes!)

Document Clustering: Comparison of Similarity Measures

Chapter DM:II. II. Cluster Analysis

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Clustering CS 550: Machine Learning

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

Flat Clustering. Slides are mostly from Hinrich Schütze. March 27, 2017

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Introduction to Clustering

Clustering algorithms

Machine Learning. Unsupervised Learning. Manfred Huber

ECLT 5810 Clustering

Information Retrieval and Web Search Engines

Clustering. Partition unlabeled examples into disjoint subsets of clusters, such that:

Hierarchical Clustering 4/5/17

CHAPTER 4: CLUSTER ANALYSIS

CSE 5243 INTRO. TO DATA MINING

Based on Raymond J. Mooney s slides

ECLT 5810 Clustering

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Introduction to Information Retrieval

Administrative. Machine learning code. Supervised learning (e.g. classification) Machine learning: Unsupervised learning" BANANAS APPLES

CSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16)

Unsupervised Learning I: K-Means Clustering

Unsupervised Learning

CSE 5243 INTRO. TO DATA MINING

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Introduction to Information Retrieval

Information Retrieval and Web Search Engines

COMP90042 LECTURE 3 LEXICAL SEMANTICS COPYRIGHT 2018, THE UNIVERSITY OF MELBOURNE

Gene Clustering & Classification

Introduction to Mobile Robotics

Cluster Analysis: Agglomerate Hierarchical Clustering

Clustering Tips and Tricks in 45 minutes (maybe more :)

Mining di Dati Web. Lezione 3 - Clustering and Classification

Vector Space Models: Theory and Applications

Road map. Basic concepts

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University

Cluster analysis formalism, algorithms. Department of Cybernetics, Czech Technical University in Prague.

Unsupervised Learning

PV211: Introduction to Information Retrieval

Cluster analysis. Agnieszka Nowak - Brzezinska

Cluster Analysis. Ying Shen, SSE, Tongji University

Lexical Semantics. Regina Barzilay MIT. October, 5766

Machine learning - HT Clustering

Text Documents clustering using K Means Algorithm

Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

What is Clustering? Clustering. Characterizing Cluster Methods. Clusters. Cluster Validity. Basic Clustering Methodology

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

CS47300: Web Information Search and Management

An Unsupervised Technique for Statistical Data Analysis Using Data Mining

Data Informatics. Seon Ho Kim, Ph.D.

Cluster Analysis. Angela Montanari and Laura Anderlucci

MEASUREMENT OF SEMANTIC SIMILARITY BETWEEN WORDS: A SURVEY

Unsupervised Learning and Clustering

Automatic Construction of WordNets by Using Machine Translation and Language Modeling

ECG782: Multidimensional Digital Signal Processing

Clustering & Bootstrapping

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Unsupervised Learning : Clustering

Expectation Maximization!

An Introduction to Cluster Analysis. Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs

Overview of Clustering

Clustering Results. Result List Example. Clustering Results. Information Retrieval

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Chapter 9. Classification and Clustering

Unsupervised Learning. Clustering and the EM Algorithm. Unsupervised Learning is Model Learning

Hierarchical Clustering

A Linguistic Approach for Semantic Web Service Discovery

Hierarchical Graph Clustering: Quality Metrics & Algorithms

Using Machine Learning to Optimize Storage Systems

Unsupervised Learning and Data Mining

CS 2750: Machine Learning. Clustering. Prof. Adriana Kovashka University of Pittsburgh January 17, 2017

MEASURING SEMANTIC SIMILARITY BETWEEN WORDS AND IMPROVING WORD SIMILARITY BY AUGUMENTING PMI

4. Ad-hoc I: Hierarchical clustering

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Natural Language Processing

Lesson 3. Prof. Enza Messina

Clustering. Bruno Martins. 1 st Semester 2012/2013

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

k-means demo Administrative Machine learning: Unsupervised learning" Assignment 5 out

K-Means and Gaussian Mixture Models

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

Automatic Data Analysis in Visual Analytics Selected Methods

Unsupervised Learning and Clustering

Transcription:

MIA - Master on Artificial Intelligence

1 Hierarchical Non-hierarchical Evaluation

1 Hierarchical Non-hierarchical Evaluation

The Concept of, proximity, affinity, distance, difference, divergence We use distance when metric properties hold: d(x, x) = 0 d(x, y) 0 when x y d(x, y) = d(y, x) (simmetry) d(x, z) d(x, y) + d(y, z) (triangular inequation) We use similarity in the general case Function: sim : A B S (where S is often [0, 1]) Homogeneous: sim : A A S (e.g. word-to-word) Heterogeneous: sim : A B S (e.g. word-to-document) Not necessarily symmetric, or holding triangular inequation.

The Concept of If A is a metric space, the distance in A may be used. D euclidean ( x, y) = x y = (x i y i ) 2 vs distance sim D (A, B) = 1 1+D(A,B) monotonic: min{sim(x, y), sim(x, z)} sim(x, y z) i

Applications, case-based reasoning, IR,... Discovering related words - Distributional similarity Resolving syntactic ambiguity - Taxonomic similarity Resolving semantic ambiguity - Ontological similarity Acquiring selectional restrictions/preferences

Relevant Information Content (information about compared units) Words: form, morphology, PoS,... Senses: synset, topic, domain,... Syntax: parse trees, syntactic roles,... Documents: words, collocations, NEs,... Context (information about the situation in which simmilarity is computed) Window based vs. Syntactic based External Knowledge Monolingual/bilingual dictionaries, ontologies, corpora

Vectorial methods (1) L 1 norm, Manhattan distance, taxi-cab distance, city-block distance N L 1 ( x, y) = x i y i i=1 L 2 norm, Euclidean distance L 2 ( x, y) = x y = N (x i y i ) 2 Cosine distance cos( x, y) = x y x y = i=1 x i y i i x 2 i i i y 2 i

Vectorial methods (2) L 1 and L 2 norms are particular cases of Minkowsky measure ( N ) 1 r D minkowsky ( x, y) = L r ( x, y) = (x i y i ) r Camberra distance N x i y i D camberra ( x, y) = x i + y i i=1 Chebychev distance D chebychev ( x, y) = max x i y i i i=1

Set-oriented methods (3): Binary valued vectors seen as sets 2 X Y Dice. S dice (X, Y) = X + Y X Y Jaccard. S jaccard (X, Y) = X Y X Y Overlap. S overlap (X, Y) = min( X, Y ) X Y Cosine. cos(x, Y) = X Y Above similarities are in [0, 1] and can be used as distances simply substracting: D = 1 S

Set-oriented methods (4): Agreement contingency table Object j Object i 1 0 1 a b a + b 0 c d c + d a + c b + d p 2a Dice. S dice (X, Y) = 2a + b + c a Jaccard. S jaccard (X, Y) = a + b + c a Overlap. S overlap (X, Y) = min(a + b, a + c) a Cosine. S overlap (X, Y) = (a + b)(a + c) Matching coefficient. S mc (i, j) = a + d p

Distributional Particular case of vectorial representation where attributes are probability distributions N x T = [x 1... x N ] such that i, 0 x i 1 and x i = 1 i=1 Kullback-Leibler Divergence (Relative Entropy) D(q r) = q(y) log q(y) (non symmetrical) r(y) y Y Mutual Information I(A, B) = D(h f g) = h(a, b) h(a, b) log f(a) g(b) a A b B (KL-divergence between joint and product distribution)

Semantic Project objects onto a semantic space: D A (x 1, x 2 ) = D B (f(x 1 ), f(x 2 )) Semantic spaces: ontology (WordNet, CYC, SUMO,...) or graph-like knowledge base (e.g. Wikipedia). Not easy to project words, since semantic space is composed of concepts, and a word may map to more than one concept. Not obvious how to compute distance in the semantic space.

WordNet

WordNet

Distances in WordNet WordNet:: http://maraca.d.umn.edu/cgi-bin/similarity/similarity.cgi Some definitions: SLP(s 1, s 2 ) = Shortest Path Length from concept s 1 to s 2 (Which subset of arcs are used? antonymy, gloss,... ) depth(s) = Depth of concept s in the ontology MaxDepth = max depth(s) s WN LCS(s 1, s 2 ) = Lowest Common Subsumer of s 1 and s 2 IC(s) = log 1 = Information Content of s (given a P(s) corpus)

Distances in WordNet Shortest Path Length: D(s 1, s 2 ) = SLP(s 1, s 2 ) Leacock & Chodorow: D(s 1, s 2 ) = log SLP(s 1, s 2 ) 2 MaxDepth Wu & Palmer: D(s 1, s 2 ) = 2 depth(lcs(s 1, s 2 )) depth(s 1 ) + depth(s 2 ) Resnik: D(s 1, s 2 ) = IC(LCS(s 1, s 2 )) Jiang & Conrath: D(s 1, s 2 ) = IC(s 1 ) + IC(s 2 ) 2 IC(LCS(s 1, s 2 )) Lin: D(s 1, s 2 ) = 2 IC(LCS(s 1, s 2 )) IC(s 1 ) + IC(s 2 ) Gloss overlap: Sum of squares of lengths of word overlaps between glosses Gloss vector: Cosine of second-order co-occurrence vectors of glosses

Distances in Wikipedia Measures using links, including measures usend on WordNet, but applied to Wikipedia graph http://www.h-its.org/english/research/nlp/download/wikipediasimilarity.php Measures using content of articles (vector spaces) Measures using Wikipedia Categories

1 Hierarchical Non-hierarchical Evaluation

Partition a set of objects into clusters. Objects: features and values measure Utilities: Exploratory Data Analysis (EDA). Generalization (learning). Ex: on Monday, on Sunday,? Friday Supervised vs unsupervised classification Object assignment to clusters Hard. one cluster per object. Soft. distribution P(c i x j ). Degree of membership.

Produced structures Hierarchical (set of clusters + relationships) Good for detailed data analysis Provides more information Less efficient No single best algorithm Flat / Non-hierarchical (set of clusters) Preferable if efficiency is required or large data sets K-means: Simple method, sufficient starting point. K-means assumes euclidean space, if is not the case, EM may be used. Cluster representative Centroid µ = 1 c x c x

Dendogram Hierarchical Single-link clustering of 22 frequent English words represented as a dendogram. be not he I it this the his a and but in on with for at from of to as is was

Hierarchical Hierarchical Bottom-up (Agglomerative ) Start with individual objects, iteratively group the most similar. Top-down (Divisive ) Start with all the objects, iteratively divide them maximizing within-group similarity.

Agglomerative (Bottom-up) Hierarchical Input: A set X = {x 1,..., x n } of objects A function sim: P(X) P(X) R Output: A cluster hierarchy for i:=1 to n do c i :={x i } end C:={c 1,..., c n }; j:=n + 1 while C > 1 do (c n1, c n2 ):=arg max (cu,c v ) C C sim(c u, c v ) c j = c n1 c n2 C:=C \ {c n1, c n2 } {c j } j:=j + 1 end while

Cluster Hierarchical Single link: of two most similar members Local coherence (close objects are in the same cluster) Elongated clusters (chaining effect) Complete link: of two least similar members Global coherence, avoids elongated clusters Better (?) clusters UPGMA: Unweighted Pair Group Method with Arithmetic Mean 1 D(x, y) X Y x X y Y Average pairwise similarity between members Trade-off between global coherence and efficiency

Examples Hierarchical A cloud of points in a plane Single-link clustering Intermediate clustering Complete-link clustering

Divisive (Top-down) Hierarchical Input: A set X = {x 1,..., x n } of objects A function coh: P(X) R A function split: P(X) P(X) P(X) Output: A cluster hierarchy C:={X}; c 1 :=X; j:=1 while c i C s.t. c i > 1 do c u :=arg min cv C coh(c v ) (c j+1, c j+2 ) = split(c u ) C:=C \ {c u } {c j+1, c j+2 } j:=j + 2 end while

Top-down clustering Hierarchical Cluster splitting: Finding two sub-clusters Split clusters with lower coherence: Single-link, Complete-link, Group-average Splitting is a sub-clustering task: Non-hierarchical clustering Bottom-up clustering Example: Distributional noun clustering (Pereira et al., 93) nouns with similar verb probability distributions KL divergence as distance between distributions D(p q) = p(x) log p(x) q(x) x X Bottom-up clustering not applicable due to some q(x) = 0

Non-hierarchical clustering Nonhierarchical Start with a partition based on random seeds Iteratively refine partition by means of reallocating objects Stop when cluster quality doesn t improve further group-average similarity mutual information between adjacent clusters likelihood of data given cluster model Number of desired clusters? Testing different values Minimum Description Length: the goodness function includes information about the number of clusters

K-means Nonhierarchical Clusters are represented by centers of mass (centroids) or a prototypical member (medoid) Euclidean distance Sensitive to outliers Hard clustering O(n)

K-means algorithm Nonhierarchical Input: A set X = {x 1,..., x n } R m A distance measure d : R m R m R A function for computing the mean µ : P(R) R m Output: A partition of X in clusters Select k initial centers f 1,..., f k while stopping criterion is not true do for all clusters c j do c j :={x i f l d(x i, f j ) d(x i, f l )} for all means f j do f j :=µ(c j ) end while

K-means example Nonhierarchical Assignment Recomputation of means

EM algorithm Nonhierarchical Estimate the (hidden) parameters of a model given the data Estimation Maximization deadlock Estimation: If we knew the parameters, we could compute the expected values of the hidden structure of the model. Maximization: If we knew the expected values of the hidden structure of the model, we could compute the MLE of the parameters. NLP applications Forward-Backward algorithm (Baum-Welch reestimation). Inside-Outside algorithm. Unsupervised WSD

EM example Nonhierarchical Can be seen as a soft version of K-means Random initial centroids Soft assignments Recompute (averaged) centroids C1 C2 C1 C1 C2 C2 Initial state After iteration 1 After iteration 2 An example of using the EM algorithm for soft clustering

evaluation Evaluation Related to a reference clustering: Purity and Inverse Purity. P = 1 D max c x IP = 1 D c x x max c x c Where: c = obtained clusters x = expected clusters Without reference clustering: Cluster quality measures: Coherence, average internal distance, average external distance, etc.