Discovering Word Senses from Text

Similar documents
Link Graph Analysis for Adult Images Classification

Bottom-Up Fuzzy Partitioning in Fuzzy Decision Trees

Matrix-Matrix Multiplication Using Systolic Array Architecture in Bluespec

Multilabel Classification with Meta-level Features

Research on Neural Network Model Based on Subtraction Clustering and Its Applications

Machine Learning: Algorithms and Applications

Fuzzy Modeling for Multi-Label Text Classification Supported by Classification Algorithms

Performance Evaluation of TreeQ and LVQ Classifiers for Music Information Retrieval

FUZZY SEGMENTATION IN IMAGE PROCESSING

A Fast Way to Produce Optimal Fixed-Depth Decision Trees

A Real-Time Detecting Algorithm for Tracking Community Structure of Dynamic Networks

Time Synchronization in WSN: A survey Vikram Singh, Satyendra Sharma, Dr. T. P. Sharma NIT Hamirpur, India

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Cluster ( Vehicle Example. Cluster analysis ( Terminology. Vehicle Clusters. Why cluster?

Performance Evaluation of Information Retrieval Systems

Session 4.2. Switching planning. Switching/Routing planning

Query Clustering Using a Hybrid Query Similarity Measure

TAR based shape features in unconstrained handwritten digit recognition

Boosting Weighted Linear Discriminant Analysis

Pattern Classification: An Improvement Using Combination of VQ and PCA Based Techniques

International Journal of Pharma and Bio Sciences HYBRID CLUSTERING ALGORITHM USING POSSIBILISTIC ROUGH C-MEANS ABSTRACT

LOCAL BINARY PATTERNS AND ITS VARIANTS FOR FACE RECOGNITION

Bit-level Arithmetic Optimization for Carry-Save Additions

Connectivity in Fuzzy Soft graph and its Complement

Computing Cloud Cover Fraction in Satellite Images using Deep Extreme Learning Machine

Color Texture Classification using Modified Local Binary Patterns based on Intensity and Color Information

CS 534: Computer Vision Model Fitting

Progressive scan conversion based on edge-dependent interpolation using fuzzy logic

A Robust Algorithm for Text Detection in Color Images

Clustering Data. Clustering Methods. The clustering problem: Given a set of objects, find groups of similar objects

Microprocessors and Microsystems

Keyword-based Document Clustering

Problem Set 3 Solutions

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

Clustering incomplete data using kernel-based fuzzy c-means algorithm

A MPAA-Based Iterative Clustering Algorithm Augmented by Nearest Neighbors Search for Time-Series Data Streams

Fuzzy C-Means Initialized by Fixed Threshold Clustering for Improving Image Retrieval

Description of NTU Approach to NTCIR3 Multilingual Information Retrieval

Interval uncertain optimization of structures using Chebyshev meta-models

Simulation: Solving Dynamic Models ABE 5646 Week 11 Chapter 2, Spring 2010

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Module Management Tool in Software Development Organizations

OSSM Ordered Sequence Set Mining for Maximal Length Frequent Sequences A Hybrid Bottom-Up-Down Approach

Steganalysis of DCT-Embedding Based Adaptive Steganography and YASS

Machine Learning. Topic 6: Clustering

Enhancement of Infrequent Purchased Product Recommendation Using Data Mining Techniques

Performance Analysis of Hybrid (supervised and unsupervised) method for multiclass data set

UB at GeoCLEF Department of Geography Abstract

Proper Choice of Data Used for the Estimation of Datum Transformation Parameters

Hierarchical clustering for gene expression data analysis

Language Understanding in the Wild: Combining Crowdsourcing and Machine Learning

CSCI 104 Sorting Algorithms. Mark Redekopp David Kempe

Optimizing Document Scoring for Query Retrieval

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Topic 5: semantic analysis. 5.5 Types of Semantic Actions

A Binarization Algorithm specialized on Document Images and Photos

Unsupervised Learning

Minimize Congestion for Random-Walks in Networks via Local Adaptive Congestion Control

On the End-to-end Call Acceptance and the Possibility of Deterministic QoS Guarantees in Ad hoc Wireless Networks

Programming in Fortran 90 : 2017/2018

All-Pairs Shortest Paths. Approximate All-Pairs shortest paths Approximate distance oracles Spanners and Emulators. Uri Zwick Tel Aviv University

Adaptive Class Preserving Representation for Image Classification

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following.

Intelligent Information Acquisition for Improved Clustering

Semi-analytic Evaluation of Quality of Service Parameters in Multihop Networks

CS47300: Web Information Search and Management

Keywords - Wep page classification; bag of words model; topic model; hierarchical classification; Support Vector Machines

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems

Multi-Collaborative Filtering Algorithm for Accurate Push of Command Information System

Assembler. Building a Modern Computer From First Principles.

Range images. Range image registration. Examples of sampling patterns. Range images and range surfaces

AVideoStabilizationMethodbasedonInterFrameImageMatchingScore

Design Level Performance Modeling of Component-based Applications. Yan Liu, Alan Fekete School of Information Technologies University of Sydney

Parallel Grammatical Evolution for Circuit Optimization

Cluster Analysis of Electrical Behavior

MULTIPLE OBJECT DETECTION AND TRACKING IN SONAR MOVIES USING AN IMPROVED TEMPORAL DIFFERENCING APPROACH AND TEXTURE ANALYSIS

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 15

The Simulation of Electromagnetic Suspension System Based on the Finite Element Analysis

Concurrent Apriori Data Mining Algorithms

Improving Web Image Search using Meta Re-rankers

News. Recap: While Loop Example. Reading. Recap: Do Loop Example. Recap: For Loop Example

EECS 730 Introduction to Bioinformatics Sequence Alignment. Luke Huan Electrical Engineering and Computer Science

Biostatistics 615/815

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

An Optimal Algorithm for Prufer Codes *

Scalable Parametric Runtime Monitoring

Classifier Selection Based on Data Complexity Measures *

A Clustering Algorithm for Chinese Adjectives and Nouns 1

Gabor-Filtering-Based Completed Local Binary Patterns for Land-Use Scene Classification

Load Balancing for Hex-Cell Interconnection Network

Clustering Algorithm of Similarity Segmentation based on Point Sorting

Insertion Sort. Divide and Conquer Sorting. Divide and Conquer. Mergesort. Mergesort Example. Auxiliary Array

Design and Analysis of Algorithms

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

A Novel Dynamic and Scalable Caching Algorithm of Proxy Server for Multimedia Objects

Unsupervised Learning and Clustering

Transcription:

Dsoverng Word Senses from Text Patrk Pantel and Dekang Ln Unversty of Alberta Department of Computng Sene Edmonton, Alberta T6H E1 Canada {ppantel, lndek}@s.ualberta.a ABSTRACT Inventores of manually ompled dtonares usually serve as a soure for word senses. However, they often nlude many rare senses whle ssng orpus/doman-spef senses. We present a lusterng algorthm alled CBC (Clusterng By Comttee) that automatally dsovers word senses from text. It ntally dsovers a set of tght lusters alled omttees that are well sattered n the slarty spae. The entrod of the members of a omttee s used as the feature vetor of the luster. We proeed by assgnng words to ther most slar lusters. After assgnng an element to a luster, we remove ther overlappng features from the element. Ths allows CBC to dsover the less frequent senses of a word and to avod dsoverng duplate senses. Eah luster that a word belongs to represents one of ts senses. We also present an evaluaton methodology for automatally measurng the preson and reall of dsovered senses. Categores and Subet Desrptors H.3.3 [Informaton Storage and Retreval]: Informaton Searh and Retreval---Clusterng. General Terms Algorthms, Measurement, Expermentaton. Keywords Word sense dsovery, lusterng, evaluaton, mahne learnng. 1. INTRODUCTION Usng word senses versus word forms s useful n many applatons suh as nformaton retreval [0], mahne translaton [5] and queston-answerng [16]. In prevous approahes, word senses are usually defned usng a manually onstruted lexon. There are several dsadvantages assoated wth these word senses. Frst, manually reated lexons often ontan rare senses. For example, WordNet 1.5 [15] (hereon referred to as WordNet) nluded a sense of omputer that means the person who omputes. Usng WordNet to expand queres to an nformaton retreval system, the expanson of omputer Persson to make dgtal or hard opes of all or part of ths work for personal or lassroom use s granted wthout fee provded that opes are not made or dstrbuted for proft or ommeral advantage and that opes bear ths note and the full taton on the frst page. To opy otherwse, or republsh, to post on servers or to redstrbute to lsts, requres pror spef persson and/or a fee. SIGKDD 0, July 3-6, 00, Edmonton, Alberta, Canada. Copyrght 00 ACM 1-58113-567-X/0/0007 $5.00. nludes words lke estmator and rekoner. The seond problem wth these lexons s that they ss many doman spef senses. For example, WordNet sses the user-nterfae-obet sense of the word dalog (as often used n software manuals). The meanng of an unknown word an often be nferred from ts ontext. Consder the followng sentenes: A bottle of tezgüno s on the table. Everyone lkes tezgüno. Tezgüno makes you drunk. We make tezgüno out of orn. The ontexts n whh the word tezgüno s used suggest that tezgüno may be a knd of alohol beverage. Ths s beause other alohol beverages tend to our n the same ontexts as tezgüno. The ntuton s that words that our n the same ontexts tend to be slar. Ths s known as the Dstrbutonal Hypothess [3]. There have been many approahes to ompute the slarty between words based on ther dstrbuton n a orpus [4][8][1]. The output of these programs s a ranked lst of slar words to eah word. For example, [1] outputs the followng slar words for wne and sut: wne: beer, whte wne, red wne, Chardonnay, hampagne, frut, food, offee, ue, Cabernet, ogna, vnegar, Pnot nor, lk, vodka, sut: lawsut, aket, shrt, pant, dress, ase, sweater, oat, trouser, lam, busness sut, blouse, skrt, ltgaton, The slar words of wne represent the meanng of wne. However, the slar words of sut represent a xture of ts lothng and ltgaton senses. Suh lsts of slar words do not dstngush between the multple senses of polysemous words. The algorthm we present n ths paper automatally dsovers word senses by lusterng words aordng to ther dstrbutonal slarty. Eah luster that a word belongs to orresponds to a sense of the word. Consder the followng sample outputs from our algorthm: (sut Nq34 0.39 (blouse, slak, leggng, sweater) Nq137 0.0 (lawsut, allegaton, ase, harge) ) (plant Nq15 0.41 (plant, fatory, falty, refnery) Nq35 0.0 (shrub, ground over, perennal, bulb) )

(heart Nq7 ) 0.7 (kdney, bone marrow, marrow, lver) Nq866 0.17 (psyhe, onsousness, soul, nd) Eah entry shows the lusters to whh the headword belongs. Nq34, Nq137, are automatally generated names for the lusters. The number after eah luster name s the slarty between the luster and the headword (.e. sut, plant and heart). The lsts of words are the top-4 most slar members to the luster entrod. Eah luster orresponds to a sense of the headword. For example, Nq34 orresponds to the lothng sense of sut and Nq137 orresponds to the ltgaton sense of sut. In ths paper, we present a lusterng algorthm, CBC (Clusterng By Comttee), n whh the entrod of a luster s onstruted by averagng the feature vetors of a subset of the luster members. The subset s vewed as a omttee that deternes whh other elements belong to the luster. By arefully hoosng omttee members, the features of the entrod tend to be the more typal features of the target lass. We also propose an automat evaluaton methodology for senses dsovered by lusterng algorthms. Usng the senses n WordNet, we measure the preson of a system s dsovered senses and the reall of the senses t should dsover.. RELATED WORK Clusterng algorthms are generally ategorzed as herarhal and parttonal. In herarhal agglomeratve algorthms, lusters are onstruted by teratvely mergng the most slar lusters. These algorthms dffer n how they ompute luster slarty. In sngle-lnk lusterng, the slarty between two lusters s the slarty between ther most slar members whle ompletelnk lusterng uses the slarty between ther least slar members. Average-lnk lusterng omputes ths slarty as the average slarty between all pars of elements aross lusters. The omplexty of these algorthms s O(n logn), where n s the number of elements to be lustered [6]. Chameleon s a herarhal algorthm that employs dyna modelng to mprove lusterng qualty [7]. When mergng two lusters, one ght onsder the sum of the slartes between pars of elements aross the lusters (e.g. average-lnk lusterng). A drawbak of ths approah s that the exstene of a sngle par of very slar elements ght unduly ause the merger of two lusters. An alternatve onsders the number of pars of elements whose slarty exeeds a ertan threshold [3]. However, ths may ause undesrable mergers when there are a large number of pars whose slartes barely exeed the threshold. Chameleon lusterng ombnes the two approahes. K-means lusterng s often used on large data sets sne ts omplexty s lnear n n, the number of elements to be lustered. K-means s a faly of parttonal lusterng algorthms that teratvely assgns eah element to one of K lusters aordng to the entrod losest to t and reomputes the entrod of eah luster as the average of the luster s elements. However, K- means has omplexty O(K T n) and s effent for many lusterng tasks. Beause the ntal entrods are randomly seleted, the resultng lusters vary n qualty. Some sets of ntal entrods lead to poor onvergene rates or poor luster qualty. Bsetng K-means [19], a varaton of K-means, begns wth a set ontanng one large luster onsstng of every element and teratvely pks the largest luster n the set, splts t nto two lusters and replaes t by the splt lusters. Splttng a luster onssts of applyng the bas K-means algorthm α tmes wth K= and keepng the splt that has the hghest average elemententrod slarty. Hybrd lusterng algorthms ombne herarhal and parttonal algorthms n an attempt to have the hgh qualty of herarhal algorthms wth the effeny of parttonal algorthms. Bukshot [1] addresses the problem of randomly seletng ntal entrods n K-means by ombnng t wth average-lnk lusterng. Cuttng et al. lam ts lusters are omparable n qualty to herarhal algorthms but wth a lower omplexty. Bukshot frst apples average-lnk to a random sample of n elements to generate K lusters. It then uses the entrods of the lusters as the ntal K entrods of K-means lusterng. The sample sze ounterbalanes the quadrat runnng tme of average-lnk to make Bukshot effent: O(K T n + nlogn). The parameters K and T are usually onsdered to be small numbers. CBC s a desendent of UNICON [13], whh also uses small and tght lusters to onstrut ntal entrods. We ompare them n Seton 4.4 after presentng the CBC algorthm. 3. WORD SIMILARITY Followng [1], we represent eah word by a feature vetor. Eah feature orresponds to a ontext n whh the word ours. For example, sp s a verb-obet ontext. If the word wne ourred n ths ontext, the ontext s a feature of wne. The value of the feature s the pontwse mutual nformaton [14] between the feature and the word. Let be a ontext and F (w) be the frequeny ount of a word w ourrng n ontext. The pontwse mutual nformaton, w,, between and w s defned as: where N = ( ) w =, (1) F F ( ) N F N F s the total frequeny ounts of all words and ther ontexts. A well-known problem wth mutual nformaton s that t s based towards nfrequent words/features. We therefore multpled w, wth a dsountng fator: F F + n F 1 n F N, F ( ), F ( ) + 1 We ompute the slarty between two words w and w usng the osne oeffent [17] of ther mutual nformaton vetors: sm w (,w ) = w w w w () (3)

4. ALGORITHM CBC onssts of three phases. In Phase I, we ompute eah element s top-k slar elements. In our experments, we used k = 10. In Phase II, we onstrut a olleton of tght lusters, where the elements of eah luster form a omttee. The algorthm tres to form as many omttees as possble on the ondton that eah newly formed omttee s not very slar to any exstng omttee. If the ondton s volated, the omttee s smply dsarded. In the fnal phase of the algorthm, eah element e s assgned to ts most slar lusters. 4.1 Phase I: Fnd top-slar elements Computng the omplete slarty matrx between pars of elements s obvously quadrat. However, one an dramatally redue the runnng tme by takng advantage of the fat that the feature vetor s sparse. By ndexng the features, one an retreve the set of elements that have a gven feature. To ompute the top slar elements of an element e, we frst sort the features aordng to ther pontwse mutual nformaton values and then only onsder a subset of the features wth hghest mutual nformaton. Fnally, we ompute the parwse slarty between e and the elements that share a feature from ths subset. Sne hgh mutual nformaton features tend not to our n many elements, we only need to ompute a fraton of the possble parwse ombnatons. Usng ths heurst, slar words that share only low mutual nformaton features wll be ssed by our algorthm. However, n our experments, ths had no vsble mpat on luster qualty. 4. Phase II: Fnd omttees The seond phase of the lusterng algorthm reursvely fnds tght lusters sattered n the slarty spae. In eah reursve step, the algorthm fnds a set of tght lusters, alled omttees, and dentfes resdue elements that are not overed by any omttee. We say a omttee overs an element f the element s slarty to the entrod of the omttee exeeds some hgh slarty threshold. The algorthm then reursvely attempts to fnd more omttees among the resdue elements. The output of the algorthm s the unon of all omttees found n eah reursve step. The detals of Phase II are presented n Fgure 1. In Step 1, the sore reflets a preferene for bgger and tghter lusters. Step gves preferene to hgher qualty lusters n Step 3, where a luster s only kept f ts slarty to all prevously kept lusters s below a fxed threshold. In our experments, we set θ 1 = 0.35. Step 4 ternates the reurson f no omttee s found n the prevous step. The resdue elements are dentfed n Step 5 and f no resdues are found, the algorthm ternates; otherwse, we reursvely apply the algorthm to the resdue elements. Eah omttee that s dsovered n ths phase defnes one of the fnal output lusters of the algorthm. 4.3 Phase III: Assgn elements to lusters In Phase III, eah element e s assgned to ts most slar lusters n the followng way: let C be a lst of lusters ntally empty let S be the top-00 slar lusters to e Input: Step 1: Step : Step 3: A lst of elements E to be lustered, a slarty database S from Phase I, thresholds θ 1 and θ. For eah element e E Cluster the top slar elements of e from S usng average-lnk lusterng. For eah luster dsovered ompute the followng sore: avgsm(), where s the number of elements n and avgsm() s the average parwse slarty between elements n. Store the hghest-sorng luster n a lst L. Sort the lusters n L n desendng order of ther sores. Let C be a lst of omttees, ntally empty. For eah luster L n sorted order Compute the entrod of by averagng the frequeny vetors of ts elements and omputng the mutual nformaton vetor of the entrod n the same way as we dd for ndvdual elements. If s slarty to the entrod of eah omttee prevously added to C s below a threshold θ 1, add to C. Step 4: If C s empty, we are done and return C. Step 5: For eah element e E If e s slarty to every omttee n C s below threshold θ, add e to a lst of resdues R. Step 6: If R s empty, we are done and return C. Otherwse, return the unon of C and the output of a reursve all to Phase II usng the same nput exept replang E wth R. Output: a lst of omttees. Fgure 1. Phase II of CBC. whle S s not empty { let S be the most slar luster to e f the slarty(e, ) < σ ext the loop f s not slar to any luster n C { assgn e to remove from e ts features that overlap wth the features of ; } remove from S } When omputng the slarty between a luster and an element (or another luster) we use the entrod of omttee members as the representaton for the luster. Ths phase resembles K-means n that elements are assgned to ther losest entrods. Unlke K- means, the number of lusters s not fxed and the entrods do not hange (.e. when an element s added to a luster, t s not added to the omttee of the luster). The key to the algorthm for dsoverng senses s that one an element e s assgned to a luster, the ntersetng features

between e and are removed from e. Ths allows CBC to dsover the less frequent senses of a word and to avod dsoverng duplate senses. entty 0.395 4.4 Comparson wth UNICON UNICON [13] also onstruts luster entrods usng a small set of slar elements, lke the omttees n CBC. One of the man dfferenes between UNICON and CBC s that UNICON only guarantees that the omttees do not have overlappng members. However, the entrods of two omttees may stll be qute slar. UNICON deals wth ths problem by mergng suh lusters. In ontrast, Step n Phase II of CBC only outputs a omttee f ts entrod s not slar to any prevously output omttee. Another man dfferene between UNICON and CBC s n Phase III of CBC. UNICON has dffulty dsoverng senses of a word when ths word has a donatng sense. For example, n the newspaper orpus that we used n our experments, the fatory sense of plant s used muh more frequently than ts lfe sense. Consequently, the maorty of the features of the word plant are related to ts fatory sense. Ths s evdened n the followng top- 30 most slar words of plant. falty, fatory, reator, refnery, power plant, ste, manufaturng plant, tree, buldng, omplex, landfll, dump, proet, ll, arport, staton, farm, operaton, warehouse, ompany, home, enter, lab, store, ndustry, park, house, busness, nnerator All of the above, exept the word tree, are related to the fatory sense. Even though UNICON generated a luster ground over, perennal, shrub, bulb, annual, wldflower, shrubbery, fern, grass,... the slarty between plant and ths luster s very low. On the other hand, CBC removes the fatory related features from the feature vetor of plant after t s assgned to the fatory luster. As a result, the slarty between the {ground over, perennal, } luster and the revsed feature vetor of plant beomes muh hgher. 5. EVALUATION METHODOLOGY To evaluate our system, we ompare ts output wth WordNet, a manually reated lexon. 5.1 WordNet WordNet [15] s an eletron dtonary organzed as a graph. Eah node, alled a synset, represents a set of synonymous words. The ars between synsets represent hyponym/hypernym (sublass/superlass) relatonshps 1. Fgure shows a fragment of WordNet. The number attahed to a synset s s the probablty that a randomly seleted noun refers to an nstane of s or any synset below t. These probabltes are not nluded n WordNet. We use the frequeny ounts of synsets n the SemCor [9] orpus to estmate them. Sne SemCor s a farly small orpus (00K 1 WordNet also ontans other semant relatonshps suh as meronyms (part-whole relatonshps) and antonyms, however we do not use them here. 0.000113 0.0000189 natural -elevaton words), the frequeny ounts of the synsets n the lower part of the WordNet herarhy are very sparse. We smooth the probabltes by assung that all sblngs are equally lkely gven the parent. Ln [11] defned the slarty between two WordNet synsets s 1 and s as: sm s (,s ) 1 hll log P() s ( s ) + log P( s ) = (4) log P where s s the most spef synset that subsumes s 1 and s. For example, usng Fgure, f s 1 = hll and s = shore then s = geologal-formaton and sm(hll, shore) = 0.66. 5. Preson For eah word, CBC outputs a lst of lusters to whh the word belongs. Eah luster should orrespond to a sense of the word. The preson of the system s measured by the perentage of output lusters that atually orrespond to a sense of the word. To ompute the preson, we must defne what t means for a luster to orrespond to a orret sense of a word. To deterne ths automatally, we map lusters to WordNet senses. Let S(w) be the set of WordNet senses of a word w (eah sense s a synset that ontans w). We defne smw(s, u), the slarty between a synset s and a word u, as the maxmum slarty between s and a sense of u: smw nanmate-obet natural-obet geologal-formaton ( s,u) 1 shore oast 0.167 0.0163 ( ) 0.00176 0.0000836 0.000016 Fgure. Example herarhy of synsets n WordNet along wth eah synset s probablty. = max sm s,t (5) t S ( u ) Let k be the top-k members of a luster, where these are the k most slar members to the omttee of. We defne the slarty between s and, smc(s, ), as the average slarty between s and the top-k members of :

smc ( s,) u k smw k ( s,u) = (6) Suppose a lusterng algorthm assgns the word w to luster. We say that orresponds to a orret sense of w f max smc s S ( s,) θ In our experments, we set k = 4 and vared the θ values. The WordNet sense of w that orresponds to s then: arg max smc s S ( s,) It s possble that multple lusters wll orrespond to the same WordNet sense. In ths ase, we only ount one of them as orret. We defne the preson of a word w as the perentage of orret lusters to whh t s assgned. The preson of a lusterng algorthm s the average preson of all the words. 5.3 Reall The reall (ompleteness) of a word w measures the rato between the orret lusters to whh w s assgned and the atual number of senses n whh w was used n the orpus. Clearly, there s no way to know the omplete lst of senses of a word n any nontrval orpus. To address ths problem, we pool the results of several lusterng algorthms to onstrut the target senses. For a gven word w, we use the unon of the orret luster of w dsovered by the algorthms as the target lst of senses for w. Whle ths reall value s lkely not the true reall, t does provde a relatve rankng of the algorthms used to onstrut the pool of target senses. The overall reall s the average reall of all words. 5.4 F-measure The F-measure [18] ombnes preson and reall aspets: (7) (8) RP F = (9) R + P where R s the reall and P s the preson. F weghts low values of preson and reall more heavly than hgher values. It s hgh when both preson and reall are hgh. 6. EXPERIMENTAL RESULTS In ths seton, we desrbe our expermental setup and present evaluaton results of our system. 6.1 Setup We used Mnpar [10], a broad-overage Englsh parser, to parse about 1GB (144M words) of newspaper text from the TREC olleton (1988 AP Newswre, 1989-90 LA Tmes, and 1991 San Jose Merury) at a speed of about 500 words/seond on a PIII-750 wth 51MB memory. We olleted the frequeny ounts of the Avalable at www.s.ualberta.a/~lndek/npar.htm. Table 1. Preson, Reall and F-measure on the data set for varous algorthms wth σ = 0.18 and θ = 0.5. ALGORITHM PRECISION (%) RECALL (%) F-MEASURE (%) CBC 60.8 50.8 55.4 UNICON 53.3 45.5 49. Bukshot 5.6 45. 48.6 K-means 48.0 44. 46.0 Bsetng K-means 33.8 31.8 3.8 Average-lnk 50.0 41.0 45.0 grammatal relatonshps (ontexts) output by Mnpar and used them to ompute the pontwse mutual nformaton values from Seton 3. The test set s onstruted by ntersetng the words n WordNet wth the nouns n the orpus whose total mutual nformaton wth all of ts ontexts exeeds a threshold (we used 50). Sne WordNet has a low overage of proper names, we removed all aptalzed nouns. The resultng test set onssts of 13403 words. The average number of features per word s 740.8. We modfed the average-lnk, K-means, Bsetng K-means and Bukshot algorthms of Seton sne these algorthms only assgn eah element to a sngle luster. For eah of these algorthms, the modfaton s as follows: Apply the algorthm as desrbed n Seton For eah luster returned by the algorthm Create a entrod for usng all elements assgned to t Apply MK-means usng the above entrods where MK-means s the K-means algorthm, usng the above entrods as ntal entrods, exept that eah element s assgned to ts most slar luster plus all other lusters wth whh t has slarty greater than σ. We then use these modfed algorthms to dsover senses. These lusterng algorthms were not desgned for sense dsovery. Lke UNICON, when assgnng an element to a luster, they do not remove the overlappng features from the element. Thus, a word s often assgned to multple lusters that are slar. 6. Word Sense Evaluaton We ran CBC and the modfed lusterng algorthms desrbed n the prevous subseton on the data set and appled the evaluaton methodology from Seton 4.4. Table 1 shows the results. For Bukshot and K-means, we set the number of lusters to 1000 and the maxmum number of teratons to 5. For the Bsetng K- means algorthm, we appled the bas K-means algorthm twe (α = n Seton ) wth a maxmum of 5 teratons per splt. CBC returned 941 lusters and outperformed the next best algorthm by 7.5% on preson and 5.3% on reall. In Seton 5. we stated that a luster orresponds to a orret sense of a word w f ts maxmum smc slarty wth any synset n S(w) exeeds a threshold θ (Eq. 7). Fgure shows our

Table. Comparson of manual and automat evaluatons of a 1% random sample of the data set. 80% AUTOMATIC MANUAL + 104 0 17 41 0 + 0 1 3 F -measure 64% 48% 3% 16% 0.1 0.15 0. 0.5 0.3 0.35 0.4 θ experments usng dfferent values of θ. The hgher the θ value, the strter we are n defnng orret senses. Naturally, the systems F-measures derease when θ nreases. The relatve rankng of the algorthms s not senstve to the hoe of θ values. CBC has hgher F-measure for all θ thresholds. For all sense dsovery algorthms, we assgn an element to a luster f ther slarty exeeds a threshold σ. The value of σ does not affet the frst sense returned by the algorthms for eah word beause eah word s always assgned to ts most slar luster. We expermented wth dfferent values of σ and present the results n Fgure 3. Wth a lower σ value, words are assgned to more lusters. Consequently, the preson goes down whle reall goes up. CBC has hgher F-measure for all σ thresholds. 6.3 Manual Evaluaton We manually evaluated a 1% random sample of the test data onsstng of 133 words wth 168 senses. Here s a sample of the nstanes that are manually udged for the words ara, aptal and deve: ara S1: song, ballad, folk song, tune aptal S1: money, donaton, fundng, honorarum aptal S: amp, shantytown, townshp, slum deve S1: amera, transtter, sensor, eletron deve deve S: equpment, test equpment, roomputer, vdeo equpment For eah dsovered sense of a word, we nlude ts top-4 most slar words. The evaluaton onssts of assgnng a tag to eah sense as follows: : The lst of top-4 words desrbes a sense of the word that has not yet been seen +: The lst of top-4 words desrbes a sense of the word that has already been seen (duplate sense) : The lst of top-4 words does not desrbe a sense of the word The S sense of deve s an example of a sense that s evaluated wth the duplate sense tag. Table ompares the agreements/dsagreements between our manual and automat evaluatons. Our manual evaluaton agreed wth the automat evaluaton 88.1% of the tme. Ths suggests that the evaluaton methodology s relable. Most of the dsagreements (17 out of 0) were on senses that were norret aordng to the automat evaluaton but orret n the manual evaluaton. The automat evaluaton slassfed these Fgure. F-measure of several algorthms wth σ = 0.18 and varyng θ thresholds from Eq.7. F -measure 60% 50% 40% 30% 0% CBC UNICON Bukshot K-means BK-means Average-Lnk 0.1 0.14 0.18 0. 0.6 σ CBC UNICON Bukshot K-means BK-means A verage-lnk Fgure 3. F-measure of several algorthms wth θ = 0.5 and varyng σ thresholds. beause sometmes WordNet sses a sense of a word and beause of the organzaton of the WordNet herarhy. Some words n WordNet should have hgh slarty (e.g. eleted offal and legslator) but they are not lose to eah other n the herarhy. Our manual evaluaton of the sample gave a preson of 7.0%. The automat evaluaton of the same sample gave 63.1% preson. Of the 13,403 words n the test data, CBC found 869 of them polysemous. 7. DISCUSSION We omputed the average preson for eah luster, whh s the perentage of elements n a luster that orretly orrespond to a WordNet sense aordng to Eq.7. We nspeted the low-preson lusters and found that they were low for three man reasons. Frst, some lusters suffer from part-of-speeh onfuson. Many of the nouns n our data set an also be used as verbs and adetves. Sne the feature vetor of a word s onstruted from all nstanes of that word (nludng ts noun, verb and adetve usage), CBC outputs ontan lusters of verbs and adetves. For example, the followng luster ontans 11 adetves: werd, stupd, slly, old, bad, smple, normal, wrong, wld, good, romant, tough, speal, small, real, smart,...

The noun senses of all of these words n WordNet are not slar. Therefore, the luster has a very low.6% preson. In hndsght, we should have removed the verb and adetve usage features. Seondly, CBC outputs some lusters of proper names. If a word that frst ours as a ommon noun also has a proper-noun usage t wll not be removed from the test data. For the same reasons as the part-of-speeh onfuson problem, CBC dsovers proper name lusters but gets them evaluated as f they were ommon nouns (sne WordNet ontans few proper nouns). For example, the followng luster has an average preson of 10%: blue ay, expo, angel, marner, ub, brave, prate, twn, athlets, brewer Fnally, some onepts dsovered by CBC are ompletely ssng from WordNet. For example, the followng luster of government departments has a low preson of 3.3% beause WordNet does not have a synset that subsumes these words: publ works, ty plannng, forestry, fnane, toursm, agrulture, health, affar, soal welfare, transport, labor, ommunaton, envronment, mgraton, publ serve, transportaton, urban plannng, fshery, avaton, teleommunaton, mental health, prourement, ntellgene, ustom, hgher eduaton, rereaton, preservaton, lottery, orreton, soutng Somewhat surprsngly, all of the low-preson lusters that we nspeted are reasonably good. At frst sght, we thought the followng luster was bad: shamrok, nestle, dart, partnershp, haft, onsortum, blokbuster, whrlpool, delta, hallmark, rosewood, odyssey, bass, forte, asade, tadel, metropoltan, hooker By lookng at the features of the entrod of ths luster, we realzed that t s mostly a luster of ompany names. 8. CONCLUSION We presented a lusterng algorthm, CBC, that automatally dsovers word senses from text. We frst fnd well-sattered tght lusters alled omttees and use them to onstrut the entrods of the fnal lusters. We proeed by assgnng words to ther most slar lusters. After assgnng an element to a luster, we remove ther overlappng features from the element. Ths allows CBC to dsover the less frequent senses of a word and to avod dsoverng duplate senses. Eah luster that a word belongs to represents one of ts senses. We also presented an evaluaton methodology for automatally measurng the preson and reall of dsovered senses. In our experments, we showed that CBC outperforms several well known herarhal, parttonal, and hybrd lusterng algorthms. Our manual evaluaton of sample CBC outputs agreed wth 88.1% of the desons made by the automat evaluaton. 9. ACKNOWLEDGEMENTS The authors wsh to thank the revewers for ther helpful omments. Ths researh was partly supported by Natural Senes and Engneerng Researh Counl of Canada grant OGP11338 and sholarshp PGSB07797. 10. REFERENCES [1] Cuttng, D. R.; Karger, D.; Pedersen, J.; and Tukey, J. W. 199. Satter/Gather: A luster-based approah to browsng large doument olletons. In Proeedngs of SIGIR-9. pp. 318 39. Copenhagen, Denmark. [] Guha, S.; Rastog, R.; and Kyuseok, S. 1999. ROCK: A robust lusterng algorthm for ategoral attrbutes. In Proeedngs of ICDE 99. pp. 51 51. Sydney, Australa. [3] Harrs, Z. 1985. Dstrbutonal struture. In: Katz, J. J. (ed.) The Phlosophy of Lngusts. New York: Oxford Unversty Press. pp. 6 47. [4] Hndle, D. 1990. Noun lassfaton from predate-argument strutures. In Proeedngs of ACL-90. pp. 68 75. Pttsburgh, PA. [5] Huthns, J. and Sommers, H. 199. Introduton to Mahne Translaton. Aade Press. [6] Jan, A. K.; Murty, M. N.; and Flynn, P. J. 1999. Data lusterng: A revew. ACM Computng Surveys 31(3):64 33. [7] Karyps, G.; Han, E.-H.; and Kumar, V. 1999. Chameleon: A herarhal lusterng algorthm usng dyna modelng. IEEE Computer: Speal Issue on Data Analyss and Mnng 3(8):68 75. [8] Landauer, T. K., and Dumas, S. T. 1997. A soluton to Plato's problem: The Latent Semant Analyss theory of the aquston, nduton, and representaton of knowledge. Psyhologal Revew 104:11 40. [9] Landes, S.; Leaok, C.; and Teng, R. I. 1998. Buldng semant onordanes. In WordNet: An Eletron Lexal Database, edted by C. Fellbaum. pp. 199 16. MIT Press. [10] Ln, D. 1994. Prnpar - an effent, broad-overage, prnplebased parser. Proeedngs of COLING-94. pp. 4 48. Kyoto, Japan. [11] Ln, D. 1997. Usng syntat dependeny as loal ontext to resolve word sense ambguty. In Proeedngs of ACL-97. pp. 64 71. Madrd, Span. [1] Ln, D. 1998. Automat retreval and lusterng of slar words. Proeedngs of COLING/ACL-98. pp. 768 774. Montreal, Canada. [13] Ln, D. and Pantel, P. 001. Induton of semant lasses from natural language text. In Proeedngs of SIGKDD-01. pp. 317 3. San Franso, CA. [14] Mannng, C. D. and Shütze, H. 1999. Foundatons of Statstal Natural Language Proessng. MIT Press. [15] Mller, G. 1990. WordNet: An onlne lexal database. Internatonal Journal of Lexography, 1990. [16] Pasa, M. and Harabagu, S. 001. The nformatve role of WordNet n Open-Doman Queston Answerng. In Proeedngs of NAACL-01 Workshop on WordNet and Other Lexal Resoures. pp. 138 143. Pttsburgh, PA. [17] Salton, G. and MGll, M. J. 1983. Introduton to Modern Informaton Retreval. MGraw Hll. [18] Shaw Jr, W. M.; Burgn, R.; and Howell, P. 1997. Performane standards and evaluatons n IR test olletons: Cluster-based retreval methods. Informaton Proessng and Management 33:1 14, 1997. [19] Stenbah, M.; Karyps, G.; and Kumar, V. 000. A omparson of doument lusterng tehnques. Tehnal Report #00-034. Department of Computer Sene and Engneerng, Unversty of Mnnesota. [0] Voorhees, E. M. 1998. Usng WordNet for text retreval. In WordNet: An Eletron Lexal Database, edted by C. Fellbaum. pp. 85 303. MIT Press.