Keyword-based Document Clustering

Similar documents
Machine Learning: Algorithms and Applications

Hierarchical clustering for gene expression data analysis

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

UB at GeoCLEF Department of Geography Abstract

A Binarization Algorithm specialized on Document Images and Photos

Cluster Analysis of Electrical Behavior

Fuzzy C-Means Initialized by Fixed Threshold Clustering for Improving Image Retrieval

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

Parallelism for Nested Loops with Non-uniform and Flow Dependences

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

Feature Reduction and Selection

Module Management Tool in Software Development Organizations

CS 534: Computer Vision Model Fitting

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Performance Evaluation of Information Retrieval Systems

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

BioTechnology. An Indian Journal FULL PAPER. Trade Science Inc.

Query Clustering Using a Hybrid Query Similarity Measure

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems

Machine Learning. Topic 6: Clustering

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

Discriminative Dictionary Learning with Pairwise Constraints

Classifier Selection Based on Data Complexity Measures *

Information Retrieval

K-means and Hierarchical Clustering

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

An Optimal Algorithm for Prufer Codes *

Available online at Available online at Advanced in Control Engineering and Information Science

Innovation Typology. Collaborative Authoritativeness. Focused Web Mining. Text and Data Mining In Innovation. Generational Models

Lobachevsky State University of Nizhni Novgorod. Polyhedron. Quick Start Guide

Outline. Self-Organizing Maps (SOM) US Hebbian Learning, Cntd. The learning rule is Hebbian like:

Security Enhanced Dynamic ID based Remote User Authentication Scheme for Multi-Server Environments

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur

Application of Clustering Algorithm in Big Data Sample Set Optimization

Clustering Algorithm of Similarity Segmentation based on Point Sorting

Unsupervised Learning and Clustering

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

An Image Fusion Approach Based on Segmentation Region

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

Document Representation and Clustering with WordNet Based Similarity Rough Set Model

User Authentication Based On Behavioral Mouse Dynamics Biometrics

Chinese Word Segmentation based on the Improved Particle Swarm Optimization Neural Networks

Unsupervised Learning

Description of NTU Approach to NTCIR3 Multilingual Information Retrieval

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Load-Balanced Anycast Routing

An Improved Image Segmentation Algorithm Based on the Otsu Method

Improving Web Image Search using Meta Re-rankers

Recommended Items Rating Prediction based on RBF Neural Network Optimized by PSO Algorithm

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

LinkSelector: A Web Mining Approach to. Hyperlink Selection for Web Portals

BIN XIA et al: AN IMPROVED K-MEANS ALGORITHM BASED ON CLOUD PLATFORM FOR DATA MINING

Load Balancing for Hex-Cell Interconnection Network

Comparison of Performance in Text Mining using Categorization of Unstructured Data

BRDPHHC: A Balance RDF Data Partitioning Algorithm based on Hybrid Hierarchical Clustering

Related-Mode Attacks on CTR Encryption Mode

Revealing Paths of Relevant Information in Web Graphs

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Problem Set 3 Solutions

Smoothing Spline ANOVA for variable screening

Intra-Parametric Analysis of a Fuzzy MOLP

Edge Detection in Noisy Images Using the Support Vector Machines

Solving two-person zero-sum game by Matlab

BOOSTING CLASSIFICATION ACCURACY WITH SAMPLES CHOSEN FROM A VALIDATION SET

MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION

Efficient Content Representation in MPEG Video Databases

From Comparing Clusterings to Combining Clusterings

Impact of a New Attribute Extraction Algorithm on Web Page Classification

A Method of Hot Topic Detection in Blogs Using N-gram Model

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

PCA Based Gait Segmentation

An Efficient Genetic Algorithm with Fuzzy c-means Clustering for Traveling Salesman Problem

FAHP and Modified GRA Based Network Selection in Heterogeneous Wireless Networks

A Simple Methodology for Database Clustering. Hao Tang 12 Guangdong University of Technology, Guangdong, , China

Type-2 Fuzzy Non-uniform Rational B-spline Model with Type-2 Fuzzy Data

A Multi-step Strategy for Shape Similarity Search In Kamon Image Database

Study of Data Stream Clustering Based on Bio-inspired Model

MODULE DESIGN BASED ON INTERFACE INTEGRATION TO MAXIMIZE PRODUCT VARIETY AND MINIMIZE FAMILY COST

S1 Note. Basis functions.

Private Information Retrieval (PIR)

Single Document Keyphrase Extraction Using Neighborhood Knowledge

Clustering Algorithm Combining CPSO with K-Means Chunqin Gu 1, a, Qian Tao 2, b

Face Recognition University at Buffalo CSE666 Lecture Slides Resources:

Concurrent Apriori Data Mining Algorithms

TN348: Openlab Module - Colocalization

AUTOMATED METHOD FOR STATISTICAL PROCESSING OF AE TESTING DATA

Querying by sketch geographical databases. Yu Han 1, a *

A NOTE ON FUZZY CLOSURE OF A FUZZY SET

Efficient Distributed File System (EDFS)

An Entropy-Based Approach to Integrated Information Needs Assessment

Fuzzy Modeling of the Complexity vs. Accuracy Trade-off in a Sequential Two-Stage Multi-Classifier System

CMPS 10 Introduction to Computer Science Lecture Notes

Enhancement of Infrequent Purchased Product Recommendation Using Data Mining Techniques

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation

Feature Selection as an Improving Step for Decision Tree Construction

Pruning Training Corpus to Speedup Text Classification 1

The Shortest Path of Touring Lines given in the Plane

Classic Term Weighting Technique for Mining Web Content Outliers

Transcription:

Keyword-based ocument lusterng Seung-Shk Kang School of omputer Scence Kookmn Unversty & AIrc hungnung-dong Songbuk-gu Seoul 36-72 Korea sskang@kookmn.ac.kr Abstract ocument clusterng s an aggregaton of related documents to a cluster based on the smlarty evaluaton task between documents and the representatves of clusters. erms and ther dscrmnatng features of terms are the clue to the clusterng and the dscrmnatng features are based on the term and document frequences. Feature selecton method on the bass of frequency statstcs has a lmtaton to the enhancement of the clusterng algorthm because t does not consder the contents of the cluster obects. In ths paper we adopt a content-based analytc approach to refne the smlarty computaton and propose a keyword-based clusterng algorthm. Expermental results show that content-based keyword weghtng outperforms frequency-based weghtng method. Keywords: ocument lusterng Weghtng Scheme Feature Selecton Introducton ocument clusterng s an aggregaton of documents by dscrmnatng the relevant documents from the rrelevant documents. he relevance determnaton crtera of any two documents s a smlarty measure and the representatves of the documents [234]. here are some smlarty measures such as ce coeffcent Jaccard s coeffcent and cosne measure. hese smlarty measures requre that the documents are represented n document vectors and the smlarty of two documents s calculated from the operaton of document vectors. In general the representatves of a document or a cluster are document vectors that consst of <term weght> pars and the document smlartes are determned by the terms and ther weghtng values that are extracted from the document [79]. In the prevous studes on the document clusterng we focused on the clusterng algorthm but the document hs work was supported by the Korea Scence and Engneerng Foundaton(KOSEF) through the Advanced Informaton echnology Research enter(airc). representaton methodology was not the mportant ssue. ocument vectors are smply constructed from the term frequency (F) and the nverted document frequency (IF). hs representaton of term weghtng method starts from the precondton that terms or keywords representng the document are calculated by F-IF. erm weghtng method by F-IF s generally used to construct a document vector but we cannot say that t s the best way of representng a document. So we suppose that there s a lmtaton to mprove the accuracy of the clusterng system only by mprovng the clusterng algorthm wthout changng the document/cluster representaton method. Also document clusterng requres a large amount of memory spaces to keep the representatves of documents/clusters and the smlarty measures [6 8 ]. Gven N documents to be clustered N N smlarty matrx s needed to store document smlarty measures. Also the recursve teraton of smlarty calculaton and reconstructng the representatve of the clusters need a huge number of computatons. In ths paper we propose a new clusterng method that s based on the keyword weghtng approach. he clusterng algorthm starts from the seed documents and the cluster s expanded by the keyword relatonshp. he evoluton of the cluster stops when no more documents are added to the cluster and rrelevant documents are removed from the cluster canddates. 2 Keyword-based Weghtng Scheme In general the constructon of a document vector depends on the term frequency and document frequency. If keywords are determned by frequency nformaton of the document we are apt to generate an error that nouns are often used regardless of substance of the document and the words of a hgh frequency are extracted. he clusterng method whch s focused on smlarty calculaton consders the whole words except stopwords as the representatve of the document and consttutes a document vector that s calculated by the weght value from the term frequency and document frequency. It s common that terms and ther weght values represent a document and <term weght> pars are the unque elements of the document vector. When we construct a document vector term frequency and document frequency are the most mportant features to calculate the weght of a term. As for the terms and

ther weght values the weght value of a term means a rankng score ust as an mportance factor to the document. So the term weghtng can be seen as an evaluaton of the term as a keyword or a stopword to the document. he weghtng functon w(t) from a term to ts weght s descrbed n expresson (). w: term weght () w(t) = f t s a stopword f t s a keyword a otherwse a For the weghtng scheme of terms there are two ponts of vews as the representaton of a document: () a dscrmnatve value that dstngushes or characterzes the document from others; (2) an mportance measure as a keyword or a stopword. Frequency-based term weghtng (FBW) s a statstcal measure of terms n an nter-document relatonshp. hs weghtng scheme s a very effcent method for dstngushng and characterzng a document from others and t performs well for the applcatons of document classfcaton or clusterng n the nformaton retreval system. he only evaluaton measure to characterze a document n frequency-based weghtng scheme s a frequency statstcs but term frequences are not the best measures to characterze the document by terms. Another weghtng scheme s a keyword-based term weghtng (KBW) method that s based on the keyword mportance factors n a document. It s an analytc approach that analyses the contents of a document to get a keyword lst from the document. he weght value of a word s calculated by the mportance factors as a keyword n a document. he weght value of a word s a combnaton value of keyword-weghtng factors and the terms are ordered by the keyword rankng score. he rankng scores n ths weghtng scheme are calculated from the analyss results of the document. Keyword-based term weghtng wll be a good soluton to overcome the lmtaton of the frequency-based weghtng scheme. Keywords n a text are the terms that represent a document and the canddate keywords are extracted from the analyss results of the document. Keyword rankng method depends on several factors of a term such as the type of a document the locaton and the role of words n a sentence or a paragraph [5]. hematc words of a document are representatve terms for the document. hematc words are extracted from a text by analysng the contents of the text but keyword extracton depends on the type of text. Keywords are easly found n the ttle or an abstract n a research paper that conssts of a ttle abstract body experment and concluson. Also newspaper artcle contans a keyword n the ttle or the frst part of the text. here are some clues of determnng a keyword and we may classfy them as word level sentence level paragraph level and text level features. Word-level features are the type of part-of-speech and case-role nformaton. he part-of-speech of Korean noun s dvded nto common noun compound noun proper noun and numeral. Syntactc or sentence-level features are the type of a phrase or a clause sentence locaton and sentence type. From the rhetorc word n a sentence the mportance of the sentence s computed and the terms n a sentence are affected by the type of a sentence. Also the weghtng scheme of a term n the subectve clause s not the equal to the same term that appeared n an auxlary clause or n a modfyng clause. Basc term weght s assgned by the type of a term and recomputed by the features that t accompanes n the text. hat s the weght value of a term s also determned by the characterstcs of word sentence phrase and clause where the term s extracted. 3 Keyword-based ocument lusterng Keyword-based document clusterng creates a cluster by the keywords of each document. Suppose that s a set of clusters that s fnally created by the clusterng algorthm. If n s the number of clusters n then s a set of clusters. 2 = { } 2 Each cluster s ntalsed by document d that s not assgned to the exstng clusters and d s a seed document of. When a new cluster s created expanson and reducton steps are repeated untl t reaches a stable state from the start state. In each evoluton steps for cluster s the -th state of. : the -th state of a cluster he characterstc vector of a cluster s a set of <keyword weght value> pars that represents the cluster. If K s a keyword set of a document and K s a keyword set of cluster then K s the -th state of cluster. Fgure shows a keyword-based clusterng algorthm for the cluster. Gven the keyword sets for each document cluster s created by the self-expandng algorthm. 3. luster Intalsaton he frst step of the clusterng algorthm s a creaton and ntalsaton of a new cluster. A document s selected that does not belong to any other cluster and t s assgned to a new cluster that s an ntal state n n

of cluster. = { } At ths tme a document that s the frst document n the new cluster s called a seed document (or an ntalsaton document). he seed document s randomly selected among the documents that do not belong to the clusters ~. Keyword set K of a document s a set of keywords k k 2 k n that are extracted from document. he ntal state of keyword set K s ntalsed by K. K = K K = { k k s a keyword that s extracted from } = { } K = K = { x document x where k K x for k such that k K } = do { K = K where x x + = for all x begn s = sm( x K ) f ( s < threshold) + + = { x} end for = + } whle ( seleteocument() ) = Fgure. Keyword-based clusterng algorthm 3.2 Expandng the luster In the ntalsaton step of the cluster a new cluster an ntal state of cluster s establshed as the seed document and the keyword set K s ntalsed by the key word set of the seed document. In the expandng step of the cluster the cluster s expanded by addng more related documents to the cluster that nclude the keywords of the seed document as the related documents of the seed document. hat s addng the total documents that K appear each keyword of (the keyword extracted from the seed document) to the cluster that s the next state of cluster expands the cluster. = { x k x K = K where K k K he cluster expanson s performed by the teraton of keyword expanson and cluster expanson. More documents are added to a cluster by the smlarty evaluaton between the keyword set and the document. If a new document s added to a cluster then the keywords n the added document are also added to the keyword set of the cluster. he frst expanson s performed by the keyword set extracted from the seed document. he second expanson s performed by new keywords that are added to a cluster as a result of the frst expanson. And the -th expanson s performed by the (-)-th state of the keyword set. he number of teratons s decded through the experment. When a cluster s expanded from to the keyword set K s also expanded to a new keyword set K that appears n the total documents of the cluster. he keyword set K of s a unon of the total keyword sets of. x } he keyword set of the cluster s used to calculate the characterstc vector of each cluster. he characterstc vector s consttuted the weght value calculated by term frequency (F) and nverted document frequency (IF) of the keywords and ths s used to calculate the smlarty measure between a document and the cluster. 3.3 luster Reducton and ompleton hs step s to produce a complete cluster by removng the documents that are not related to the cluster. For the cluster documents of a low smlarty to the cluster are removed that are not related to a cluster through the smlarty computaton wth the cluster. he result of cluster reducton s a flterng of documents that are not related + to the cluster and the cluster s generated as a next step of the cluster. Ultmately the cluster s completed that conssts of the related documents after flterng the non-related documents. If a cluster s completed the next cluster + s created through the same process. lusterng s termnated f all the documents are clustered or no more clusters are created. x

Input ocument Keyword Extracton create nverted-fle reate Inverted-Fle reate a luster Int. luster create cluster Keyword set 2 n Expand luster expand cluster Reduce/omplete luster lusters a b 2a 2b na nb a 2 L a n a b 2 L b n b z 2 L z n z Fgure 2. Overall archtecture of keywordbased clusterng 4 esgn and Implementaton he structure of a keyword-based clusterng system s shown n Fgure 2. At frst keywords are extracted from each nput document and the weght values of them are computed. Keywords and ther scores are stored n an nverted-fle structure. Inverted-fle structure s a good for the expanson of the cluster and addng the documents that ncludes a keyword to the ntal cluster. Fgure 3 shows an example of the operaton of the document clusterng system: ntalzaton expanson reducton and completon of clusters. A new cluster s created and t ncludes a seed document. An ntal set of keywords for the ntal state of a cluster s a keyword set K of document. K = { 2 n } For the terms n K documents that contan the same term are added as a canddate document n the cluster. Let the canddate documents be a b 2a 2b na nb. then xy s a document that s expanded by term x. Keyword set of the cluster s reconstructed by new set of documents. In each step of the cluster expanson the number of keywords that are used for the expanson and the threshold of the weght value are decded through experments consderng the maxmum number of document canddates n a cluster. Also <keyword weght> pars as an ntermedate representatve of the cluster are much mportant factor of the cluster expanson. result A B 2A 2B na nb complete cluster Fgure 3. Example of keyword-based clusterng Now a new keyword set that s lmted to the cluster canddates s constructed to get cluster documents. hrough the smlarty calculaton between the document and the canddate centrod of the cluster relevant documents are selected to be a member of the cluster. hrough the teratons on keyword selecton and the reconstructon of the related documents a new cluster s completed that reaches n a stable status wth a strong relatonshp between keyword set and document set. 5 he Experments We mplemented our clusterng algorthm and appled t to the clusterng of smlar documents. he test documents for the experment are collected from the three days of newspaper artcles. he total number of artcles s 383 and average 32 terms are extracted from the artcles. We performed a document clusterng by applyng the dfference crtera for term selecton: ) frequency-based term selecton; 2) percentage-based keyword selecton; and 3) keyword selecton by absolute number of keywords. Fgure 4 shows the result of smlarty clusterng by frequency-based term selecton. In ths experment three types of term selecton are performed.

- all terms are used to the clusterng - terms wth more than frequency 2 - terms wth more than frequency 3 In each experment we vared the smlarty decson rato by the percentage of term matches. Fgure 4 shows that term selecton by frequency 2 or 3 s not good for the representaton of a document. smlarty decson and auxlary keywords are also needed for the accuracy. Another pont n ths experment s that 3%~6% keyword selecton resulted better than the selecton of all terms. We compared the F -measure for the selecton of maxmum keywords. All the experments n Fgure 6 resulted better than the experment of usng all the terms n the document. Also 3~7 keywords wth 6%~7% match rato resulted a good performance for the comparson of document smlarty. term m atch rato term match rato Fgure 4. Frequency-based keyword selecton Fgure 6. Keyword selecton by maxmum term match rato Fgure 5. Percentage-based keyword selecton In the experment of percentage-based keyword selecton terms of hgh weght values are selected for the smlarty calculaton of the document. All the curves n Fgure 5 are a smlar shape except for % selecton. In case of % selecton we guess that less than % of keywords are not suffcent for the 6 oncluson It s common that clusterng algorthm s based on the smlarty computaton by frequency-based statstcs to aggregate the related documents. hs metrc s an mportant factor for term weghtng. We proposed a term weghtng method that s based on the keyword features and we tred to complement the drawback of frequency-based metrc. Based on the keyword weghtng scheme documents of the same keywords are grouped nto a cluster canddate and a new cluster s created by removng rrelevant documents. We performed an experment for the clusterng of smlar documents and the results showed that keyword-based weghtng scheme s better than the frequency-based method. Our keyword-based algorthm s usng 3%~6% of terms for a clusterng and the smlarty matrx s not a necessty that t wll be good for the clusterng of a huge number of documents. We also expect that ths algorthm wll be good for the topc trackng of specal events. In the experment we randomly selected a seed document and t s a bt senstve for the seed document. So our next research wll be focused on mnmzng the effect of the seed document by gettng representatve keywords before startng the clusterng.

References [] Anderberg M. R. luster Analyss for Applcatons New York: Academc 973. [2] an F. and E. A. Ozkarahan ynamc luster Mantenance Informaton Processng & Management Vol. 25 pp.275-29 989. [3] ubes R. and A. K. Jan lusterng Methodologes n Exploratory ata Analyss Advances n omputers Vol. 9 pp.3-227 98. [4] Frakes W. B. and R. Baeza-Yates Informaton Retreval Prentce Hall 992. [5] Kang S. S. H. G. Lee S. H. Son G.. Hong and B. J. Moon erm Weghtng Method by Postposton and ompound Noun Recognton Proceedngs of 3 th onference on Korean Language omputng pp.96-98 2. [6] Murtagh F. omplextes of Herarchc lusterng Algorthms: State of the Art omputatonal Statstcs Quarterly Vol. pp.-3 984. [7] Perry S. A. and P. Wllett A Revew of the Use of Inverted Fles for Best Match Searchng n Informaton Retreval Systems Journal of Informaton Scence Vol. 6 pp.59-66 983. [8] Sbson R. SLINK: an Optmally Effcent Algorthm for the Sngle-Lnk luster Method omputer Journal Vol. 6 pp.328-342 973. [9] Wllett P. ocument lusterng Usng an Inverted Fle Approach Journal of Informaton Scence Vol. 2 pp.223-23 98. [] Wllett P. Recent rends n Herarchc ocument lusterng: A rtcal Revew Informaton Processng and Management Vol. 24 No.5 pp.577-597 988.