A New Approach for Automatic Thesaurus Construction and Query Expansion for Document Retrieval

Similar documents
Introduction to Information Retrieval

Making Retrieval Faster Through Document Clustering

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Chapter 6: Information Retrieval and Web Search. An introduction

A New Approach for Handling the Iris Data Classification Problem

An Adaptive Agent for Web Exploration Based on Concept Hierarchies

A Reduce Identical Composite Event Transmission Algorithm for Wireless Sensor Networks

Outline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity

International Journal of Advanced Research in Computer Science and Software Engineering

A Survey on Postive and Unlabelled Learning

A New Technique to Optimize User s Browsing Session using Data Mining

Modern Information Retrieval

An Automatic Reply to Customers Queries Model with Chinese Text Mining Approach

Using Gini-index for Feature Weighting in Text Categorization

Web Information Retrieval using WordNet

A Novel PAT-Tree Approach to Chinese Document Clustering

A probabilistic description-oriented approach for categorising Web documents

A Model for Information Retrieval Agent System Based on Keywords Distribution

A Method of Identifying the P2P File Sharing

Balancing Manual and Automatic Indexing for Retrieval of Paper Abstracts

Improving Document Retrieval by Automatic Query Expansion Using Collaborative Learning of Term-Based Concepts

International Journal of Advance Foundation and Research in Science & Engineering (IJAFRSE) Volume 1, Issue 2, July 2014.

Introduction to Information Retrieval

CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL

modern database systems lecture 4 : information retrieval

Relevance Feedback and Query Reformulation. Lecture 10 CS 510 Information Retrieval on the Internet Thanks to Susan Price. Outline

A Hierarchical Document Clustering Approach with Frequent Itemsets

Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _

Research Article A Two-Level Cache for Distributed Information Retrieval in Search Engines

An Edge-Based Algorithm for Spatial Query Processing in Real-Life Road Networks

Feature Selecting Model in Automatic Text Categorization of Chinese Financial Industrial News

Modern Information Retrieval

Visualizing Changes in Data Collections Using Growing Self-Organizing Maps *

York University at CLEF ehealth 2015: Medical Document Retrieval

A RECOMMENDER SYSTEM FOR SOCIAL BOOK SEARCH

Noisy Text Clustering

Texture classification using fuzzy uncertainty texture spectrum

Temperature Calculation of Pellet Rotary Kiln Based on Texture

Data Hiding on Text Using Big-5 Code

SVM-based Filter Using Evidence Theory and Neural Network for Image Denosing

ISSN: (Online) Volume 2, Issue 3, March 2014 International Journal of Advance Research in Computer Science and Management Studies

Information Retrieval and Web Search

CS 6320 Natural Language Processing

Patent Classification Using Ontology-Based Patent Network Analysis

Encoding Words into String Vectors for Word Categorization

Concept-Based Document Similarity Based on Suffix Tree Document

Robust Relevance-Based Language Models

Knowledge Engineering in Search Engines

Optimization of HMM by the Tabu Search Algorithm

Improving the Effectiveness of Information Retrieval with Local Context Analysis

Relevancy Measurement of Retrieved Webpages Using Ruzicka Similarity Measure

Impact of Term Weighting Schemes on Document Clustering A Review

A Novel Feature Selection Framework for Automatic Web Page Classification

A Patent Retrieval Method Using a Hierarchy of Clusters at TUT

Using Query History to Prune Query Results

characteristic on several topics. Part of the reason is the free publication and multiplication of the Web such that replicated pages are repeated in

Optimal Design of Steel Columns with Axial Load Using Artificial Neural Networks

Reactive Ranking for Cooperative Databases

Ontology-Based Web Query Classification for Research Paper Searching

A Two-Tier Distributed Full-Text Indexing System

Two Algorithms of Image Segmentation and Measurement Method of Particle s Parameters

QUERY EXPANSION USING WORDNET WITH A LOGICAL MODEL OF INFORMATION RETRIEVAL

Automatic New Topic Identification in Search Engine Transaction Log Using Goal Programming

A Metric for Inferring User Search Goals in Search Engines

IMPROVING THE RELEVANCY OF DOCUMENT SEARCH USING THE MULTI-TERM ADJACENCY KEYWORD-ORDER MODEL

Using Statistical Properties of Text to Create. Metadata. Computer Science and Electrical Engineering Department

Chapter 3 - Text. Management and Retrieval

International ejournals

Arbee L.P. Chen ( 陳良弼 )

WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY

Domain-specific Concept-based Information Retrieval System

Direction-Length Code (DLC) To Represent Binary Objects

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

Information Retrieval and Web Search

60-538: Information Retrieval

Prior Art Retrieval Using Various Patent Document Fields Contents

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s

Research Article Path Planning Using a Hybrid Evolutionary Algorithm Based on Tree Structure Encoding

Dependence among Terms in Vector Space Model

Large Scale Chinese News Categorization. Peng Wang. Joint work with H. Zhang, B. Xu, H.W. Hao

Information Retrieval. Information Retrieval and Web Search

Interest-based Recommendation in Digital Library

A DYNAMIC CONTROLLING SCHEME FOR A TRACKING SYSTEM. Received February 2009; accepted April 2009

Birkbeck (University of London)

An Objective Evaluation Methodology for Handwritten Image Document Binarization Techniques

RMIT University at TREC 2006: Terabyte Track

Cluster-based Similarity Aggregation for Ontology Matching

Information Retrieval. hussein suleman uct cs

Kohei Arai 1 Graduate School of Science and Engineering Saga University Saga City, Japan

Retrieval of Web Documents Using a Fuzzy Hierarchical Clustering

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.

A Vector Space Equalization Scheme for a Concept-based Collaborative Information Retrieval System

TERM BASED SIMILARITY MEASURE FOR TEXT CLASSIFICATION AND CLUSTERING USING FUZZY C-MEANS ALGORITHM

Effective Information Retrieval using Genetic Algorithms based Matching Functions Adaptation

TREC 2016 Dynamic Domain Track: Exploiting Passage Representation for Retrieval and Relevance Feedback

ICTNET at Web Track 2010 Diversity Task

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Feature-Guided K-Means Algorithm for Optimal Image Vector Quantizer Design

An Apriori-like algorithm for Extracting Fuzzy Association Rules between Keyphrases in Text Documents

A Proposed Model For Forecasting Stock Markets Based On Clustering Algorithm And Fuzzy Time Series

Transcription:

Information and Management Sciences Volume 18, Number 4, pp. 299-315, 2007 A New Approach for Automatic Thesaurus Construction and Query Expansion for Document Retrieval Liang-Yu Chen National Taiwan University of Science and Technology R.O.C. Shyi-Ming Chen National Taiwan University of Science and Technology R.O.C. Abstract In this paper, we present a new approach for automatic thesaurus construction and query expansion for document retrieval. We analyze the information between any two terms in each document cluster center of final document clusters or intermediate document clusters in the clustering process to automatically construct the thesaurus, where these information includes the co-occurrence frequency of any two terms in each document cluster center, the degree of effect of each term in each document cluster center and the inner noise of each document cluster, respectively. We also present a query expansion method to expand the user s queries and present a new method to calculate the degree of similarity between the user s query and documents. The proposed thesaurus construction method and the proposed query expansion method can improve the performance of information retrieval systems for dealing with document retrieval. Keywords: Document Retrieval, Query Expansion, Thesaurus Construction, Query Terms, Vector Space Models, Document Clusters. 1. Introduction Thesaurus is commonly used in information retrieval (IR) systems [1], where a thesaurus is composed of a set of terms (phrases or words) plus a set of relationships between these terms. The systems can deal with users queries expansion based on the constructed thesaurus. There are two types of thesaurus, i.e., the manual constructed thesaurus and the automatic constructed thesaurus. The manual constructed thesaurus is constructed by some domain experts, which define the relationships between any two terms. The major problem of manual thesaurus is that they are expensive to build and hard to update. Furthermore, even the same expert may define different relationships between two Received September 2005; Revised January 2006; Accepted March 2006. Supported in part by the National Science Council, Republic of China, under Grant NSC 93-2213- E-011-018.

300 Information and Management Sciences, Vol. 18, No. 4, December, 2007 terms at different times. By contrast, the construction of automatic thesaurus is more objective. Different methods have been proposed for constructing the thesaurus, e.g., the Similarity Thesauri [18] and the Phrasefinder [14]. Existing techniques for automatic query expansion can be categorized as either global or local. The local query expansion technique uses a small number of retrieved top-ranked documents of a query to expand the query [3], [8]. But if only a few of the top-ranked documents retrieved by the original user s query are relevant, the retrieval performance will seriously decrease. In recent years, query expansion methods based on user s relevance feedback have been proposed [4], [13] which analyze the relevant documents filtered by the user to deal with query expansion for improving the retrieval performance. The global query expansion technique requires some statistics, which take a considerable amount of computer resources to compute, such as the co-occurrence data about all possible pairs of terms in a corpus. One of the earliest global techniques is the term-clustering technique [15], which groups words into clusters based on their co-occurrences and uses the clusters for query expansion. In [2], Billhardt et al. presented a context vector model for information retrieval. In [10], He et al. presented a mining process to extract document cluster knowledge from the Web Citation Database to support the retrieval of Web publications. In [16], Kalczynski and Chu presented a temporal document retrieval model for business news archives, where the classical vector space model is extended to the temporal document retrieval model that incorporates the fuzzy representations of temporal expressions. In this paper, we present a new approach for automatic thesaurus construction and query expansion for document retrieval. We analyze the information between any two terms in each document cluster center of final document clusters or intermediate document clusters in the clustering process to automatically construct the thesaurus, where these information includes the co-occurrence frequency of any two terms in each document cluster center, the degree of effect of each term in each document cluster center and the inner noise of each document cluster, respectively. We also present a query expansion method to expand the user s queries and present a new method to calculate the degree of similarity between the user s query and documents. The proposed thesaurus construction method and the proposed query expansion method can improve the performance of information retrieval systems for dealing with document retrieval. The rest of this paper is organized as follows. In Section 2, we briefly review the vector space model [19] in information retrieval systems and briefly review the document

A New Approach for Automatic Thesaurus Construction and Query Expansion 301 cluster method we presented in [5]. In Section 3, we present a new method for automatic thesaurus construction based on document clusters. In Section 4, we present a new query expansion method for document retrieval based on constructed thesaurus. In Section 5, we analyze the experimental results of the proposed method. The conclusions are discussed in Section 6. 2. Preliminaries In the vector space model [19], a document d k can be represented as a document vector d k = w 1k,w 2k,...,w nk, where n denotes the number of the terms appearing in document d k, and w ik denotes the weight of term t i in document d k. In [19], Salton used formula (1) to calculate the weight w ik of term t i in document d k. Formula (2) is the Inverse Document Frequency [19]: w ik = Max j tf ik tf jk IDF i, (1) IDF i = log 10 N n i, (2) where IDF i denotes the inverse document frequency of term t i, N denotes the number of documents in the database, n i denotes the number of documents which contain term t i, and tf ik denotes the frequency of term t i appearing in document d k. In [5], we have presented a fuzzy hierarchical clustering method based on dynamic document cluster centers to cluster documents. We used terms in documents to construct a document cluster center of documents. The number of terms in a document cluster center will be different when document clusters are merged. The terms in a document cluster center will affect the degree of similarity between two document clusters. In the following, we briefly describe some characteristics of a dynamic document cluster center [5]: (1) During the generating or merging process of a document cluster, every term in the document cluster center is associated with a value of relative time to live (RTL), where the RTL value is the main factor to determine whether a term can stay in the document cluster center or not. (2) During the generating or merging process of a document cluster, every term in the document cluster center is associated with a value of degree of effect. The higher

302 Information and Management Sciences, Vol. 18, No. 4, December, 2007 value of degree of effect, the more significant the term with respect to the document cluster. (3) The number of terms in a document cluster center will increase or decrease dynamically depending on the merge of single document clusters or multiple document clusters in the merging process. In [5], we used formula (3) to calculate the degree of similarity sim(c i,c j ) between two document clusters C i and C j : sim(c i,c j ) = where A = k=1,2,...,s k=1,2,...,s Min(v ik,v jk )T(w ik,w jk ) k=1,2,...,s v ik, B = k=1,2,...,s T(w ik,w jk ) = 1 w ik w jk, Max(v ik,v jk ) v jk, S = k=1,2,...,s 2 S A + B, (3) Min(v ik,v jk ), w ik denotes the weight of term t k in cluster C i, w jk denotes the weight of term t k in cluster C j, v ik denotes the value of the degree of effect of term t k in cluster C i, v jk denotes the value of the degree of effect of term t k in cluster C j, and s denotes the number of identical terms. In formula (3), we considered the weight and the degree of effect of a term to calculate the degree of similarity between two document clusters. 3. An Automatic Thesaurus Construction Algorithm In this section, we present an automatic thesaurus construction algorithm. The thesaurus is constructed into a network structure based on the documents clustering techniques we presented in [5]. In the constructed thesaurus network, every term can be represented as a node and the relationship between any two terms can be represented by a link associated with a degree of relationship between these two terms. The degree of relationship is calculated, and a higher degree of relationship between two terms indicates that there is a stronger relationship between the two terms. There are some automatic global thesaurus construction methods based on document clusters. Some of them only consider the intermediate clusters (i.e., the document clusters formed in the middle of the clustering process) [9]; others only analyze the final clusters (i.e., the document clusters formed in the final stage of the clustering process) [16]. In the following, we describe the characteristics of a document cluster in the clustering process. In general, there are

A New Approach for Automatic Thesaurus Construction and Query Expansion 303 usually a few documents in the intermediate clusters and the similarity between any two documents is usually high. In another words, the documents in intermediate clusters are usually closely related. On the other hand, the final clusters usually have more documents than intermediate clusters and the degree of similarity between two documents is usually low. However, the information may be lost by the system whether it considered the intermediate clusters or final clusters. For example, it is hard to categorize all documents belonging to the same category to one cluster. In general, when the numbers of documents in a document cluster increase, the inner noises (i.e., irrelevant documents) in a document cluster will increase at the same time. Therefore, the method that only considers final clusters usually cannot extract the most closely related terms and cannot precisely establish the degree of relationship between the related terms. On the other hand, if a method only considers intermediate clusters, then it usually cannot extract related and important terms because intermediate clusters do not contain enough documents. In this paper, we present an automatic thesaurus construction algorithm based on all of the document clusters except initial clusters. In order to calculate the degree of relationship between any two terms more precisely, we also present a method to calculate the degree of relationship between any two terms based on the co-occurrence frequency of these two terms and the degree of effect of a term in a document cluster. The proposed automatic thesaurus construction algorithm consists of the following two parts: (1) Link Generation and (2) Calculate the degree of relationship between any two terms: (1) Link Generation: We define a parameter γ, called the ratio of co-occurrence frequency, in a document cluster to determine whether there is a link between terms t x and t y in a document cluster center C k or not, shown as follows: γ = Num d Num c, (4) where γ [0,1], Num d denotes the number of documents in document cluster C k in which both term t x and term t y appear in document cluster C k ; Num c denotes the number of documents in document cluster C k. If the value γ of the ratio of co-occurrence frequency in a document cluster between two terms t x and t y in document cluster C k is larger than the user-supplied parameter α, where α is called the threshold value of the ratio of co-occurrence frequency in a document cluster, then the system will generate a link between terms t x and t y. Otherwise, the system will not generate a link between the two terms.

304 Information and Management Sciences, Vol. 18, No. 4, December, 2007 (2) Calculating the Degree of Relationship between any Two Terms: Because we consider the intermediate clusters and the final clusters in the clustering process when constructing the thesaurus, once the link between two terms has been generated, we have to provide a method to calculate the degree of relationship between the two terms in the document cluster center precisely. The number of documents in a document cluster and the number of terms in a document cluster center can affect the calculation of the value of the degree of relationship between any two terms. In the following, we consider three factors that will affect the calculation of the degree of relationship between any two terms: (i) Inner Noises in a Document Cluster (Irrelevant Documents in a Document Cluster): Once term t x and term t y both appear in document cluster center C k, we apply c k, called the average of the distances of instances in the same category [7], as shown in formula (5), and define c kp, called the average of the distances of partial instances in the same category, as shown formula (6), to measure the inner noise with respect to these two terms in this document cluster: c k = c kp = Numc 1 Numc l=1 m=l+1 µ k(d l ) µ k (d m) sim(d l,d m) Numc 1 Numc µ l=1 k (d l ) m=l+1 µ k(d m) T v Numc 1 Numd T v l=1 Numd 1 l=1 m=l+1 µ k(d l ) µ k (d m ) sim(d l,d m ) Num µ k (d l ) d m=l+1 µ k(d m ) if Num c > 1 otherwise, if Num d > 1 otherwise, where T v is a parameter representing the variance for a single-instance category [7], d l and d m are any two documents in document cluster C k ; Num c denotes the number of documents in document cluster C k ; Num d denotes the number of documents which contain terms t x and t y in document cluster C k ; d l and d m are documents which both contain terms t x and t y ; µ k (d l ) denotes the degree of membership of document d l with respect to a document cluster C k and µ k (d m ) denotes the degree of membership of document d m with respect to document cluster C k ; sim(d l,d m ) is the similarity measure shown in formula (3) to calculate the degree of similarity between documents d l and d m. By observing the ratio between c kp and c k, i.e., c kp / c k, we can know whether there are inner noises in a document cluster with respect to terms t x and t y or not. For example, for terms t x and t y, if the ratio c kp / c k is lager than 1, then it indicates that there are inner noises in this (5) (6)

A New Approach for Automatic Thesaurus Construction and Query Expansion 305 document cluster with respect to terms t x and t y. Furthermore, in this paper, we also consider the fact that the number of documents in a document cluster may affect the degree of relationship between any two terms. Therefore, we use the ratio between Num d and Num c to adjust the degree of inner noises to derive formula (7): c kp c k Num d Num c, (7) where Num d denotes the number of documents which both contain terms t x and t y in a document cluster C k ; Num c denotes the number of documents in a document cluster C k. (ii) The Degree of Effect between any Two Terms in a Document Cluster Center: In the clustering process, a document cluster will have different terms in it s cluster center when different document clusters are merged, and the degree of effect between any two terms in the document cluster center can also affect the calculation of the degree of relationship between any two terms. If a term has a larger value of the degree of effect in a document cluster center, then it is more significant in this document cluster. However, the number of documents in a document cluster will affect the distribution of the values of the degrees of effect of terms in the document cluster center. In general, the more documents in a document cluster, the larger the difference between the values of the degree of effect of terms. In order to analyze the difference between the values of the degree of effect of terms, the degree of effect of terms can be regarded as a real-value sequence. Then, we consider the Entropy of the real-value sequence [1] and consider the values of the degrees of effect v x and v y of terms t x and t y, respectively, to calculate a partial degree of relationship between terms t x and t y, shown as follows: where E = n i=1 Min(v x,v y ) E Max(v x,v y ) T, (8) v i 1 n k=1 v log k 2 vi, v i denotes the degree of effect of term t i in n k=1 v k the document cluster center, v x and v y are the degrees of effect of terms t x and t y in the document cluster center, respectively, and T denotes the number of terms in the document cluster center. (iii) Co-Occurrence Frequency Analysis: Co-occurrence frequency analysis is always used for constructing a global thesaurus. In this paper, we also consider the

306 Information and Management Sciences, Vol. 18, No. 4, December, 2007 co-occurrence of any two terms t x and t y of document cluster C to calculate the degree of relationship between terms t x and t y. After the documents have been clustered, a higher value of co-occurrence with respect to terms t x and t y means that these two terms are more closely related to each other. In this paper, we use formula (9) to measure the information of co-occurrence frequency with respect to terms t x and t y : co occ Max(occ x,occ y ), (9) where occ x denotes the number of documents containing term t x in document cluster C; occ y denotes the number of documents containing term t y in document cluster C; co occ denotes the number of documents containing both terms t x and t y in document cluster C. Based on the discussions of (i), (ii) and (iii), once the ratio of co-occurrence frequency in a document cluster with respect to terms t x and t y in document cluster C k is larger than the threshold value α, we use formula (10) to calculate the degree of relationship δ between terms t x and t y, δ = c kp Num d Min(v x,v y ) E c k Num c Max(v x,v y ) T co occ Max(occ x,occ y ), (10) where formula (10) is a combination of formulas (7), (8) and (9). The automatic thesaurus construction algorithm is now presented as follows. Automatic Thesaurus Construction Algorithm Based on Document Clusters: Input: The threshold value α of the ratio of co-occurrence frequency in a document cluster, where α [0,1]; the intermediate clusters and final clusters C 1,C 2,...,C p of the documents. Output: The constructed thesaurus. Step 1: Initially, set the variables i = 0 and k = 0. Step 2: Let k = k + 1. If k > p then Stop. Otherwise, let document cluster C k be the training document cluster. Assume that there existing n k terms t 1,t 2,...,t nk in the document cluster center C k. Step 3: Let i = i + 1. If i > n k 1 then go to Step 2. Otherwise, let j = 0. Choose term t i from the document cluster C k and perform Step 3.1 to Step 3.5. Step 3.1: Let j = i + 1 and find term t j.

A New Approach for Automatic Thesaurus Construction and Query Expansion 307 Step 3.2: Based on formula (4), calculate the ratio of co-occurrence frequency γ between term t i and term t j in the document cluster C k, where γ [0,1]. If γ < α then go to Step 3.3. Otherwise, go to Step 3.4. Step 3.3: If j < n k then go to Step 3.1. Otherwise, go to Step 3. Step 3.4: Check whether there is a link between terms t i and t j or not. If there is a link between terms t i and t j then go to Step 3.5. Otherwise, generate a new link L ti,t j between terms t i and t j, and calculate the degree δ of relationship between terms t i and t j, based on formula (10), where δ [0,1]. If j < n k then go to Step 3.1. Otherwise, go to Step 3. Step 3.5: Assume that the original degree of relationship between terms t i and t j is ω, where ω [0,1]. Based on formula (10), calculate the degree of relationship δ between terms t i and t j in the document cluster C k, where δ [0,1]. Then, let the new degree of relationship between terms t i and t j be equal to ω + δ. If j < n k then go to Step 3.1. Otherwise, go to Step 3. 4. Query Expansion In general, users retrieve documents through information retrieval systems [11]. However, the query terms submitted by the users usually do not provide enough information to retrieve most of the relevant documents. Since the query expansion method has been proposed [15], it has been the main method for improving the performance of information retrieval systems. In this paper, we apply the constructed thesaurus to expand the user query to improve the performance of information retrieval systems. When the user submits his/her query terms, the system chooses the term has the highest IDF value among the query terms as the center of the query expansion terms and chooses the terms having higher degrees of relationship with respect to the center of the query expansion terms as query expansion terms. Then, the system calculates the degree of relationship and the weight of each expansion term. Finally, it replaces the original query terms by expansion terms. In the following, we introduce some parameters and formulas that are needed for the proposed query expansion algorithm: (i) The Number of Expansion Terms β: It is a user-defined parameter, and the system will generate β expansion terms according to parameter β, where β 1. (ii) The calculation of the relevant degree of expansion terms: After the system gets the center t of the query expansion terms, every term t found according to

308 Information and Management Sciences, Vol. 18, No. 4, December, 2007 term t must calculate its relevant degree based on formula (11): θ ρ f f, (11) where ρ denotes the highest degree of relationship between the center of query expansion terms t and the other terms; θ denotes the degree of relationship between terms t and t ; f and f are the IDF values of terms t and t, respectively. (iii) The Threshold Value of Filtering Document Clusters ϕ: After finishing the query expansion process, the system will filter some training document clusters (assume that the training clusters are C 1,C 2,...,C o ) and then calculate the weight of each expanded term based on some documents of the document clusters filtered by system. To filter a document cluster C a, it must satisfy formula (12): v C a DC C a ϕ, 1 a o, (12) where v C a denotes the degree of effect of term t in document cluster C a, where t is the center of query expansion terms, and DC C a denotes the number of documents in document cluster C a. (iv) The Calculation of the Weights of Expanded Terms: After the system found the center t of the expansion terms, every expanded term t must calculate the weight based on formula (13): oa=1 p C a s=1 w C a,ds oa=1 p C a s=1 I, (13) where p C a denotes the number of documents in document cluster C a and w C a,ds denotes the weight of term t in document d s of document cluster C a. If term t does not appear in document d s of document cluster C a, then the value of w C a is,ds set to 0. If term t appears in document d s of document cluster C a, then the value of I is set to 1. Otherwise, the value of I is set to 0. The query expansion algorithm is now presented as follows: Query Expansion Algorithm: Input: The number β of expansion terms, the threshold value ϕ for filtering document clusters, and the query-terms set Q = {q 1,q 2,...,q m } are submitted by the user. The IDF vector of the query-terms set Q is IDF Q = f 1,f 2,...,f m, where f i denotes the

A New Approach for Automatic Thesaurus Construction and Query Expansion 309 IDF value of query term q i, 1 i m, m β, and ϕ [0,1]; intermediate clusters and final clusters in the clustering process are C 1,C 2,...,C p. Initial variable: i = 0. Output: Query expansion term set K = {e 1,e 2,...,e β } and the relevant degree vector G of term set K, where G = g 1,g 2,...,g β, g i denotes the relevant degree of term e i, and 1 i β, the weight vector W of term set K, where W = w 1,w 2,...,w β, w i denotes the weight of term e i, and 1 i β. Step 1: Choose query term q l that has the largest IDF value form the query term set Q (assume that f l denotes the IDF value of term q l ) and let term q l be the center of the query expansion. Put q l into query expansion term set K and let e l = q l, where 1 l m. Step 2: Let the relevant degree g l of term e l be equal to 1. Step 3: Find term t among the thesaurus that has the largest degree of relationship between term t and term e l, based on the constructed thesaurus (assume that the degree of relationship between terms t and e l is ρ, where ρ [0,1]). Step 4: Put every query term q s in the query term set Q except term q l into K (assume that the link between query term q s and the center e l of the query expansion terms is L el,q s ), such that e s = q s, where 1 s m and s l. Calculate the relevant degree g s of every term e s in the query term set K based on formula (11) and mark the link L el,e s, where 1 s m and s l. Set i = m. If i = β then Stop. Otherwise, go to Step 5. Step 5: Choose term t r from the thesaurus which has the largest relationship between term t r and term e l where the link between term t r and term e l have not been marked yet. Mark the link L el,t r between term t r and the center e l of query expansion terms. Set i = i+1 and let e i = t r. Calculate the relevant degree g i of term e i based on formula (11). If i < β then go to Step 5. Otherwise, go to Step 6. Step 6: Find document clusters from training document clusters C 1,C 2,...,C p for which their cluster center contains term e l, where the degree of effect of term e l is larger than the other terms in the cluster center and where the document clusters must satisfy formula (12). Step 7: Calculate the weight w j of each term e j in the query expansion term set K based on formula (13), where 1 j m. After expanding the original user s query, we apply the proposed similarity calculation algorithm to improve the retrieval performance. The algorithm considers the degrees of

310 Information and Management Sciences, Vol. 18, No. 4, December, 2007 relationship of terms generated by the proposed query expansion algorithm as the default degrees of relationship when querying. The degree of relationship of every query term will be changed dynamically according to the previous degree of relationship of the terms. In the following, we present the method to calculate the degree of similarity between the query expansion term set and a document. Assume that the query expansion term set K = {e 1,e 2,...,e β }, where its degree of relationship vector is G = g 1,g 2,...,g β and its weighting vector is W = w 1,w 2,...,w β. The formula to calculate the degree of similarity between the query expansion term set K and document d k is shown as follows: β j=1 g j T(w j,w j,dk ) β j=1 g, (14) j where w j denotes the weight of term e j in W; w j,dk denotes the weight of term e j in document d k ; T(w j,w j,dk ) [12] calculates the degree of similarity between w j and w j,dk ; g j denotes the dynamic degree of relationship of term e j. The formula used to calculate g j is shown as follows: j 1 g j = g j + k=1 where if g j > 1, then let g j = 1; if g j < 0, then let g j = 0. ((T(w k,w k,dj ) g k ) g j ), (15) The proposed similarity calculation algorithm is now presented as follows: The Similarity Calculation Algorithm: Input: Query expansion term set K = {e 1,e 2,...,e β }, its degree of relationship vector G = g 1,g 2,...,g β, and its weighting vector W = w 1,w 2,...,w β }. Output: The degree of similarity between each document and the query expansion term set K. Step 1: Sort the terms in the query expansion term set K according to their degree of relationship in a descending sequence to form a new query expansion term set K = {e 1,e 2,...,e β }, where its degree of effect vector is G = g 1,g 2,...,g β, g 1 g 2 g β, its weighting vector is W = w 1,w 2,...,w β, and g 1 g 2 g β. Step 2: Based on formulas (14) and (15), calculate the degree of similarity between the query expansion term set and each document. Sort documents according to their degrees of similarity in a descending sequence for the user s browsing.

A New Approach for Automatic Thesaurus Construction and Query Expansion 311 In summary, we use a flowchart to illustrate the application of the three proposed algorithms for document retrieval as shown in Figure 1. Figure 1. The flowchart of the proposed method. 5. Experimental Results We have implemented the proposed method on a Pentium 4 PC using Delphi Version 5.0. We choose 292 research reports consisting of 15 categories from [20] for constructing the thesaurus, which are a subset of the collection of the research reports of the National Science Council (NSC), Republic of China. The number of documents in each category is between 13 and 15. Each document consists of several parts, including a report ID, a title, a Chinese abstract, an English abstract,..., etc., and a document may belong to many different categories at the same time. The system gets the report ID and the English abstract of each report automatically and uses the stem method [1] to sieve out the roots of terms and to generate the term database based on these term roots. We use the proposed method shown in Figure 1 to construct the thesaurus automatically based on the document clusters obtained by [5] and apply the constructed thesaurus to expand the original user s queries. In our experiment, we set the threshold value α = 0.5 when executing the automatic thesaurus construction algorithm and set the value β = 5 when executing the query expansion algorithm. We test the performance of query expansion on the same database [20] by observing the improvement of retrieval performance of the nine queries shown in Table 1. Figure 2 and Figure 3 compare recall rate and the precision rate of the top 20 documents of the nine queries, respectively, where the precision rate and the recall rate are defined as follows [1]: Precision rate = R e R a, (16)

312 Information and Management Sciences, Vol. 18, No. 4, December, 2007 Recall rate = R e R r, (17) where R e denotes the number of relevant retrieved documents, R r denotes the number of relevant documents in the collection, and R a denotes the number of retrieved documents. From Figure 2 and Figure 3, we can see that the query expansion method proposed in this paper can expand the user s queries and gets higher precision rates and recall rates. Table 1. A list of the user s queries. Q 1 : Heterogeneous Database Q 2 : Natural Language Porcessing Q 3 : Network Security Q 4 : Multimedia Database Q 5 : Parallel Computing Q 6 : Speech Recognition Q 7 : Expert System Q 8 : Mobile Communication Q 9 : Robot Arm 1 Precision Rate 0.8 0.6 0.4 0.2 The Proposed Mehtod The Original User's Queries 0 Queries Figure 2. The precision rates of the top 20 retrieved documents of the user s queries. 1 Precision Rate 0.8 0.6 0.4 0.2 The Proposed Mehtod The Original User's Queries 0 Queries Figure 3. The recall rates of the top 20 retrieved documents of the user s queries.

A New Approach for Automatic Thesaurus Construction and Query Expansion 313 6. Conclusions In this paper, we have presented a new approach for automatic thesaurus construction and query expansion for document retrieval. We analyze the information between any two terms in each document cluster center of final document clusters or intermediate document clusters in the clustering process to automatically construct the thesaurus, where these information includes the co-occurrence frequency of any two terms in each document cluster center, the degree of effect of each term in each document cluster center and the inner noise of each document cluster, respectively. The thesaurus is automatically constructed using a network structure. We also have presented a query expansion algorithm to expand the user s queries and present a new method to calculate the degree of similarity between the user s query and documents. The proposed thesaurus construction method and the proposed query expansion method can improve the performance of information retrieval systems for dealing with document retrieval. References [1] Baeza-Yates, R. and Ribeiro-Neto, B., Modern Information Retrieval, ACM Press, New York, 1999. [2] Billhardt, H., Borrago, D. and Maojo, V., A context vector model for information retrieval, Journal of the American Society for Information Science and Technology, Vol.53, No.3, pp.236-249, 2002. [3] Buckley, C., Salton, G., Alan, J. and Singhal, A., Automatic query expansion using SMART, Proceedings of the 3rd Text Retrieval Conference (TREC-3), edited by Donna K. Harman, National Institute of Standards and Technology, Gaithersburg, MD, pp.69-80, 1995. [4] Chang, Y. C., Chen, S. M. and Liau, C. J., A new query expansion method based on fuzzy rules, Proceedings of the 2003 Joint Conference on AI, Fuzzy System, and Grey System, Taipei, Taiwan, Republic of China, 2003. [5] Chen, L. Y. and Chen, S. M., A new fuzzy hierarchical clustering method based on dynamic cluster centers, in Proceedings of the Ninth Conference on Information Management Research, Changhua, Taiwan, Republic of China, 2003. [6] Chen, L. Y. and Chen, S. M., A new method for automatic thesaurus construction and query expansion, Proceedings of the 2004 15th International Conference on Information Management, Taipei, Taiwan, Republic of China, 2004. [7] Chen, C. L. P. and Lu, Y., FUZZ: A fuzzy-based concept formation system that integrates human categorization and numerical clustering, IEEE Transactions on Systems, Man, and Cybernetics-Part B: Cybernetics, Vol.27, No.1, pp.79-94, 1997. [8] Croft, W. and Harper, D. J., Using probabilistic models of document retrieval without relevance information, Journal of Documentation, Vol.35, No.4, pp.285-295, 1979. [9] Crouch, C. J., An approach to the automatic construction of global thesauri, Information Processing & Management, Vol.26, pp.629-640, 1990. [10] He, Y., Hui, S. C. and Fong, A. C. M., Mining a Web citation database for document clustering, Applied Artificial Intelligence, Vol.16, No.4, pp.283-302, 2002. [11] Horng, Y. J., Chen, S. M. and Lee, C. H., A new fuzzy information retrieval method based on document terms reweighting techniques, International Journal of Information and Management Sciences, Vol.14, No.4, pp.63-82, 2003.

314 Information and Management Sciences, Vol. 18, No. 4, December, 2007 [12] Horng, Y. J., Chen, S. M. and Lee, C. H., Fuzzy information retrieval using fuzzy hierarchical clustering and fuzzy inference techniques, Proceedings of the 13th International Conference on Information Management, Taipei, Taiwan, Republic of China, pp.215-222, 2002. [13] Ide, E., New experiments in relevance feedback, in The SMART Retrieval System, edited by G. Salton, Prentice Hall, Englewood Cliffs, NJ, pp.337-354, 1971. [14] Jing, Y. and Croft, W. B., An association thesaurus for information retrieval, Proceedings of the 1994 Intelligent Multimedia Information Retrieval Systems, NY, pp.146-160, 1994. [15] Jones, K. S., Automatic Keyword Classification for Information Retrieval, Butterworths, London, UK, 1971. [16] Kalczynski, P. J. and Chou, A., Temporal document retrieval model for business news archives, Information Processing and Management, Vol.41, No.3, pp.635-650, 2005. [17] Ma, Z. M., Zhang, W. J. and W. Y. Ma, Extending object-oriented databases for fuzzy information modeling, Information Systems, Vol.29, No.5, pp. 421-435, 2004. [18] Qiu, Y. and Frei, H. P., Concept based query expansion, Proceedings of the 16th Annual International ACM Conference on Research and Development in Information Retrieval, NY, pp.160-169, 1993. [19] Salton, G., The Smart Retrieval System - Experiments in Automatic Document Processing, Prentice Hall, Englewood Cliffs, NJ, 1971. [20] A Subset of the Collection of the Research Reports of the National Science Council, Taiwan, R. O. C., http://fuzzylab.et.ntust.edu.tw/nsc Report Data base/292documents.html (Data Source: http://sticnet.stic.gov.tw). Authors Information Liang-Yu Chen received the B.S. degree from the Department of Computer Science and Information Engineering, Tamkang University, Taipei, Taiwan, Republic of China, in June 2002, and received the M.S. degree in the Department of Computer Science and Information Engineering at National Taiwan University of Science and Technology, Taipei, Taiwan, in June 2004. His current research interests include information retrieval systems, fuzzy systems, and artificial intelligence. Department of Electronic Engineering, National Taiwan University of Science and Technology, Taipei, Taiwan, R. O. C. E-mail: M9115054@mail.ntust.edu.tw Shyi-Ming Chen is currently a Professor in the Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology, Taipei, Taiwan, R. O. C. He received the Ph.D. degree in Electrical Engineering from the National Taiwan University, Taipei, Taiwan, in June 1991. He has published more than 260 papers in referred journals, book chapters and conference proceedings. His research interests include fuzzy systems, information retrieval systems, knowledge-based systems, neural networks, artificial intelligence, data mining, and genetic algorithms. He is currently the President of the Taiwanese Association for Artificial Intelligence (TAAI). He is an Associate Editor of the IEEE Transactions on Systems, Man, and Cybernetics - Part C, an Associate Editor of the IEEE Computational Intelligence Magazine, an Associate Editor of the Journal of Intelligent & Fuzzy Systems, an Editor of the New Mathematics and Natural Computation Journal, an Associate Editor of the International Journal of Fuzzy Systems, an Editorial Board Member of the International Journal of Information and Communication Technology, an Editorial Board Member of the WSEAS Transactions on Systems, an Associate Editor of the WSEAS Transactions on Computers, an Editor of the Journal of Advanced Computational Intelligence and Intelligent Informatics, an Associate Editor of the International Journal of Applied Intelligence, an Associate Editor of the International Journal of Artificial Intelligence Tools, an Editorial Board Member of the International Journal of Computational Intelligence and Applications, an Editorial Board Member of the Advances in Fuzzy Sets and Systems Journal, an Editor of the International Journal of Soft Computing, an Editor of the Asian Journal of Information

A New Approach for Automatic Thesaurus Construction and Query Expansion 315 Technology, an Editorial Board Member of the International Journal of Intelligence Systems Technologies and Applications, an Editor of the Asian Journal of Information Management, an Associate Editor of the International Journal of Innovative Computing, Information and Control, an Editorial Board Member of the International Journal of Computer Applications in Technology, an Associate Editor of the Journal of Uncertain Systems, an Editorial Board Member of the Advances in Computer Sciences and Engineering Journal, and an Associate Editor of the International Journal of Intelligent Information and Database Systems. He was an Editor of the Journal of the Chinese Grey System Association from 1998 to 2003. He is currently also the Dean of the College of Electrical Engineering and Computer Science, Jinwen University of Science and Technology, Taipei, Taiwan, R.O.C. He is an IET Fellow (Fellow of IEE). Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology, Taipei, Taiwan, R.O.C. E-mail: smchen@mail.ntust.edu.tw Tel: +886-2-27376417