A Patent Retrieval Method Using a Hierarchy of Clusters at TUT

Similar documents
Explore Co-clustering on Job Applications. Qingyun Wan SUNet ID:qywan

Information-Theoretic Co-clustering

A Vector Space Equalization Scheme for a Concept-based Collaborative Information Retrieval System

The Effect of Word Sampling on Document Clustering

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University

Improved K-Means Algorithm Based on COATES Approach. Yeola, Maharashtra, India. Yeola, Maharashtra, India.

Evaluation of the Document Categorization in Fixed-point Observatory

Implementation of Data Clustering With Meta Information Using Improved K-Means Algorithm Based On COATES Approach

A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis

TDT- An Efficient Clustering Algorithm for Large Database Ms. Kritika Maheshwari, Mr. M.Rajsekaran

DENSITY BASED AND PARTITION BASED CLUSTERING OF UNCERTAIN DATA BASED ON KL-DIVERGENCE SIMILARITY MEASURE

Chapter 6: Information Retrieval and Web Search. An introduction

Progress Report: Collaborative Filtering Using Bregman Co-clustering

Bipartite Graph Partitioning and Content-based Image Clustering

DOCUMENT CLUSTERING USING HIERARCHICAL METHODS. 1. Dr.R.V.Krishnaiah 2. Katta Sharath Kumar. 3. P.Praveen Kumar. achieved.

Clustering Large and Sparse Co-occurrence Data

Chapter 1, Introduction

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Feature selection. LING 572 Fei Xia

Anomaly Detection on Data Streams with High Dimensional Data Environment

Review on Techniques of Collaborative Tagging

Density-Based Clustering Based on Probability Distribution for Uncertain Data

A New Technique to Optimize User s Browsing Session using Data Mining

Balancing Manual and Automatic Indexing for Retrieval of Paper Abstracts

IMPROVING THE RELEVANCY OF DOCUMENT SEARCH USING THE MULTI-TERM ADJACENCY KEYWORD-ORDER MODEL

An Approach for Privacy Preserving in Association Rule Mining Using Data Restriction

Latent Semantic Indexing

Auxiliary Variational Information Maximization for Dimensionality Reduction

Using the K-Nearest Neighbor Method and SMART Weighting in the Patent Document Categorization Subtask at NTCIR-6

indexing and query processing. The inverted le was constructed for the retrieval target collection which contains full texts of two years' Japanese pa

Contents. Preface to the Second Edition

Temporal Weighted Association Rule Mining for Classification

Data Distortion for Privacy Protection in a Terrorist Analysis System

Query Phrase Expansion using Wikipedia for Patent Class Search

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN:

PERFORMANCE EVALUATION OF MULTIVIEWPOINT-BASED SIMILARITY MEASURE FOR DATA CLUSTERING

CS Introduction to Data Mining Instructor: Abdullah Mueen

Behavioral Data Mining. Lecture 18 Clustering

Efficient Image Compression of Medical Images Using the Wavelet Transform and Fuzzy c-means Clustering on Regions of Interest.

String Vector based KNN for Text Categorization

Overview of Classification Subtask at NTCIR-6 Patent Retrieval Task

A hybrid method to categorize HTML documents

Hierarchical Co-Clustering Based on Entropy Splitting

Letter Pair Similarity Classification and URL Ranking Based on Feedback Approach

Unsupervised Learning

Inferring User Search for Feedback Sessions

Proceedings of NTCIR-9 Workshop Meeting, December 6-9, 2011, Tokyo, Japan

Correlation Based Feature Selection with Irrelevant Feature Removal

A Content Vector Model for Text Classification

Unsupervised learning in Vision

LRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier

A Modified Hierarchical Clustering Algorithm for Document Clustering

Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please)

Personalized Information Retrieval

Multi-Dimensional Text Classification

Keyword Extraction by KNN considering Similarity among Features

Performance Analysis of Video Data Image using Clustering Technique

Efficient Semi-supervised Spectral Co-clustering with Constraints

Index Terms:- Document classification, document clustering, similarity measure, accuracy, classifiers, clustering algorithms.

IRCE at the NTCIR-12 IMine-2 Task

PRIVACY-PRESERVING MULTI-PARTY DECISION TREE INDUCTION

Constrained Clustering with Interactive Similarity Learning

K-Means Clustering With Initial Centroids Based On Difference Operator

An Analysis of Researcher Network Evolution on the Web

Active Sampling for Constrained Clustering

Space and Cosine Similarity measures for Text Document Clustering Venkata Gopala Rao S. Bhanu Prasad A.

Content-based Dimensionality Reduction for Recommender Systems

R 2 D 2 at NTCIR-4 Web Retrieval Task

Hierarchical Document Clustering

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang

Gene Clustering & Classification

Exploiting Index Pruning Methods for Clustering XML Collections

A Survey on Postive and Unlabelled Learning

Privacy Protection in Personalized Web Search with User Profile

Lecture 7: Relevance Feedback and Query Expansion

Detecting Clusters and Outliers for Multidimensional

A User Study on Features Supporting Subjective Relevance for Information Retrieval Interfaces

International Journal of Advanced Research in Computer Science and Software Engineering

Noisy Text Clustering

CSCI 599: Applications of Natural Language Processing Information Retrieval Retrieval Models (Part 3)"

Correspondence Clustering: An Approach to Cluster Multiple Related Spatial Datasets

CS 6604: Data Mining Large Networks and Time-Series

A Distributed Retrieval System for NTCIR-5 Patent Retrieval Task

Analyzing Document Retrievability in Patent Retrieval Settings

A Document-centered Approach to a Natural Language Music Search Engine

Recommendation on the Web Search by Using Co-Occurrence

Clustering Algorithms for Data Stream

H-D and Subspace Clustering of Paradoxical High Dimensional Clinical Datasets with Dimension Reduction Techniques A Model

Retrieval of Web Documents Using a Fuzzy Hierarchical Clustering

REPRESENTATION OF BIG DATA BY DIMENSION REDUCTION

Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching

A Novel Approach for Restructuring Web Search Results by Feedback Sessions Using Fuzzy clustering

KEYWORDS: Clustering, RFPCM Algorithm, Ranking Method, Query Redirection Method.

Parallel Rare Term Vector Replacement: Fast and Effective Dimensionality Reduction for Text (Abridged)

Production of Video Images by Computer Controlled Cameras and Its Application to TV Conference System

SCHEME OF COURSE WORK. Data Warehousing and Data mining

C-NBC: Neighborhood-Based Clustering with Constraints

Self-organization of very large document collections

Transcription:

A Patent Retrieval Method Using a Hierarchy of Clusters at TUT Hironori Doi Yohei Seki Masaki Aono Toyohashi University of Technology 1-1 Hibarigaoka, Tenpaku-cho, Toyohashi-shi, Aichi 441-8580, Japan doi@kde.ics.tut.ac.jp, {seki,aono}@ics.tut.ac.jp Abstract To retrieve relevant documents from an enormous document collection, we usually utilize the similarity or distance measure between a query and the documents, or apply document clustering techniques to the document collection and partition it into relevant document groups. For patent retrieval, however, it is difficult to retrieve documents by using query terms only, because complex terminologies specific to patents appear in them. One approach to solving this problem is to use query expansion techniques. We have extended the usual vector space model by utilizing coclustering techniques. We generate a hierarchy of clusters by applying these techniques to the document collection with different levels of cluster granularity. The query is then expanded by using this hierarchy of clusters. We participated in the NTCIR-5 Patent Retrieval Task (Document Retrieval Subtask) using our system and present the effectiveness of our approach for patent retrieval with experiments using the NTCIR- 4 and NTCIR-5 test collections. Keywords: Query Expansion, Document Clustering. 1 Introduction Patent data is usually classified into different groups with codes such as IPC (International Patent Classification), which classify themes from a single viewpoint using names, summaries, or claims, or by using F- terms, which classify themes from multiple viewpoints using narratives. On the other hand, document clustering techniques, which partition the document set into groups according to the similarity or distance measure between the elements in the document set, are widely used and applied to retrieving documents. For patent data, concept-level searches based on the similarity of contents are more effective than exact keyword matching using queries, because different terms are frequently used to express the same concept. In this paper, we explore our approach to the NTCIR-5 Patent Retrieval Task (Document Retrieval Subtask). We have previously proposed the approach in [1] as a query expansion method with a hierarchy of clusters generated by applying co-clustering techniques to the document collection at different levels of cluster granularity. Cluster granularity was defined using powers of two (16, 32, 64,...). 2 Related Work Beginning in the 1990s, researchers have applied clustering techniques to information retrieval. Cutting et al. [3] applied an exclusive clustering technique to retrieval results using the k-means algorithm. Eguchi et al. [6] proposed a retrieval method using clustering similar to Cutting s and relevance feedback techniques, and they reported improvements in search effectiveness. Sasaki et al. [7] improved the effectiveness of retrieval by using spherical k-means clustering and reducing dimensions with a random projection technique. Chang et al. [2] proposed an automated query expansion method using clustering features. 3 Co-clustering Recent work, which used clustering to improve the search effectiveness, was restricted only to document clustering. We propose a retrieval method using not only document clusters but term clusters by utilizing co-clustering [5, 4] techniques to improve search effectiveness, especially for a concept-level search. This proposal is based on the motivation that a set of cooccurring terms in the document is an effective index to retrieve relevant special-use documents such as patents. In this section, we give an overview of coclustering. We present our proposal in the next section. 3.1 Definition Let X and Y be two discrete random variables that take values in a document set X = {x 1, x 2,...} and a term (index) set Y = {y 1, y 2,...} respectively.

Suppose that we know their joint probability distribution p(x, Y ); often this can be estimated using cooccurrence data, expressed as a (normalized) matrix consisting of co-occurrences of terms and documents. Let ˆX = {h 1, h 2..., h s } be s document clusters, and Ŷ = {g 1, g 2..., g t } be t term clusters. Let π x p(x) denote the probability density function that a document x appears, and π y p(y) denote the probability density function that a term y appears. For clusters, let π h p(x) = x h p(x) be the probability density function of a document cluster h, and let π g p(g) = y g p(y) be the probability density function of a term cluster g. Given a document x, the probability density function that a set of terms Y appears in x can be expressed as z(x) = p(y x). Similarly, given a document cluster h, the probability density function that a set of terms Y appears in h can be expressed as µ h (x) = p(y h). In the co-clustering algorithm, an optimal document cluster ˆX and an optimal term cluster Ŷ are found subject to the condition that minimizes the mutual information I(X, Y ) I( ˆX, Ŷ ), considering the co-occurring terms in the documents. I(X, Y ) I( ˆX, Ŷ ) = KL(p(X, Y ) q(x, Y )) (1) KL( ) on the right-hand side of the above expression stands for the Kullback Leibler divergence. We assume the following conditions: q(x, y) = p(h, g)p(x h)p(y g) and p(h, g) = x h y g p(x, y). 3.2 Advantages of Co-clustering With the co-clustering algorithm, document clusters and term clusters are generated together by minimizing the difference in mutual information between the original random variables and the clustered random variables. This implies that by using coclustering, a document term matrix with low dimensions that approximates the original matrix preserves highly associated local subspaces. This is in contrast to singular value decomposition (SVD), in that SVD usually reduces dimension by a global effect on the matrix and generates a dense matrix in which the elements consist of both positive and negative values. In co-clustering, the effect of approximation is localized and preserves the sparseness of the original matrix, and it does not contain negative values. This is preferable, because there are no negative values in document term matrices that contain TF-IDF values. 4 Our Proposed Method We have developed a hierarchical data structure granularity levels of clusters to cope with the problems inherent in most clustering methods discussed in 4.1. In this section, we will briefly enumerate problems of typical clustering, then describe the cluster averaging algorithm, the hierarchical cluster generation algorithm, and a query expansion method based on these algorithms. 4.1 Clustering Problems We have three problems with clustering: 1. The quality of the clusters depends heavily on the initial parameters of the clustering algorithms. 2. Partition-based clustering suffers from noisy data. 3. The number of clusters to be generated cannot be properly determined a priori. We propose three solutions to these problems. For the dependency on initial parameters, we change the initial seed for the random variables, run the clustering algorithms more than once, and average the clusters into similar clusters. This algorithm is described as the Cluster Averaging Algorithm in Section 4.1.1. For the noise problem, we compute the similarity between documents in each cluster, and remove noisy documents that were not contained in any clusters in the cluster averaging algorithm. We therefore generate an effective partitioning into clusters. For the problem of the number of clusters, we generate clusters at many granularity levels. We then link similar clusters between different granularity levels of clusters (16, 32, 64,...) above the threshold, and make a hierarchical cluster data structure. This structure is shown in Figure 1. This algorithm is described as the Hierarchical Cluster Generation Algorithm in Section 4.1.2. 4.1.1 Cluster Averaging Algorithm Step 1 Perform clustering with granularity level M (M 1,...,M k ) R times by changing the initial seed for the random variables. Denote the resulting document clusters by D r = {D1, r D2, r...dm r } (r = 1,...,R) and the term clusters by W r = {W1 r, W2 r,...wm r } (r = 1,...,R). The document cluster Dj r corresponds to the term cluster W j r. Step 2 Initialize a vector H = {H 1, H 2,..., H M } (which we call the document cluster vector ) by D 1. In the same way, initialize a vector G = {G 1, G 2,..., G M } (which we call the term cluster vector ) by W 1. Step 3 Apply the cluster smoothing algorithm shown in Figure 2. In this step, we perturb the cluster vector of H if the currently scanned (processed) document vector D has similarity to H above a predefined threshold δ. We call this step cluster smoothing.

Figure 1. A hierarchical cluster data structure 4.1.2 Hierarchical Cluster Generation Algorithm Step 1 Compute the similarity between H i and H i+1 using the average cluster vectors. Step 2 If the similarity between the document cluster D i in H i and the document cluster D i+1 in H i+1 is larger than a threshold, add a link from D i to D i+1. Step 3 Repeat [Step 2] until the granularity level reaches the finest document cluster set H k. 5 Query Expansion In this section, we explain our query expansion method using a generated hierarchy of clusters. With a hierarchical cluster granularity data structure, we compute the similarity between a query vector and the average cluster vectors in order of the roughness of granularity. If the similarity is above a threshold value, queries are expanded by adding the terms and the weights in the average cluster vectors, as in the following expression: q = αq + Σw i v i (2) Figure 2. Cluster smoothing algorithm Note that v i represents an expanded word vector, w i

Precision 0.05 0.048 0.046 0.044 0.042 0.04 0.038 0.036 0.034 0.032 rank A:NTCIR5 Claim Document 0 0.2 0.4 0.6 0.8 1 Recall Figure 3. Evaluation A on NTCIR-5 Patent Retrieval Task (Document Retrieval Subtask) Precision 0.06 0.055 0.05 0.045 0.04 0.035 0.03 0.025 0.02 0.015 rank B:NTCIR5 Claim Document 0 0.2 0.4 0.6 0.8 1 Recall Figure 5. Evaluation A on NTCIR-4 Patent Test Collection Figure 4. Evaluation B on NTCIR-5 Patent Retrieval Task (Document Retrieval Subtask) represents its weight, and α is a nonnegative coefficient. Clusters found at coarser granularity levels are basically major clusters, and the number of elements is generally large. This is preferable for applications that focus on recall and that search similar data as much as possible. 6 Results We participated in the NTCIR-5 Patent Retrieval Task (Document Retrieval Subtask) using our system. The results are shown in Figures 3 and 4. We also conducted additional experiments using the NTCIR-4 Patent Test Collection. The results are shown in Figures 5 and 6. In the figures, claim refers to the result with query expansion using only terms in patent claims as queries, Figure 6. Evaluation B on NTCIR-4 Patent Test Collection

while document refers to the results with query expansion using terms in the whole patent body text in the document as queries. The results show that query expansion using whole patent body text outperforms query expansion using only patent claims. This is because the expanded queries in the latter case did not match the relevant clusters in the hierarchical cluster granularity data structure. In addition, the accuracy of retrieval was not as good. We also analyzed the effectiveness of our approach utilizing other methods, for example full-text search. 7 Conclusion [4] I. S. Dhillon and Y. Guan. Information Theoretic Clustering of Space Co-Occurrence Data. Proc. of The Third IEEE International Conference on Data Mining (ICDM 2003), pages 517 520, November 2003. [5] I. S. Dhillon, S. Mallela, and D. S. Modha. Information Theoretic Co-Clustering. Proc. of the Ninth ACM SIGKDD Int l Conf. on Knowledge Discovery and Data Mining (SIGKDD 2003), pages 89 98, August 2003. [6] K. Eguchi, H. Ito, A. Kumamoto, and Y. Kanata. Adaptive Query Expansion Based on Clustering Search Results. Transactions of Information Processing Society of Japan, 40(5):2439 2449, 1999. [7] M. Sasaki and K. Kita. Dimensionality Reduction of Vector Space Information Retrieval Model Based on Random Projection (in Japanese). Journal of Natural Language Processing, 8(1):5 19, 2001. We focused on the problems of clustering for patent search and explored a query expansion method described in [1] for the NTCIR-5 Patent Retrieval Task (Document Retrieval Subtask), using a hierarchical cluster granularity data structure constructed by applying multiple co-clustering algorithms with different initial conditions. The result in NTCIR-5 showed us that the effectiveness of retrieval improved for some topics by expanding queries with conceptually similar terms. For other topics, the expanded queries performed worse because they contain noise (outliers). In future work, we will explore solutions to the problem of increased noise. We will also explore smart data structures to solve the problem of huge sizes, especially focusing on the problem of selecting cluster granularities. We will also experiment on comparing other clustering methods, such as k-means. In addition, we will perform comparative experiments using other query expansion methods to assess the effectiveness and the practical utility of our proposal. References [1] M. Aono and H. Doi. A Method for Query Expansion Using a Hierarchy of Clusters. In G. G. Lee, A. Yamada, H. Meng, and S. H. Myaeng, editors, Information Retrieval Technology: Second Asia Information Retrieval Symposium (AIRS 2005), volume 3689 of Lecture Notes in Computer Science, pages 479 484. Springer Verlag, Jeju Island, Korea, October 2005. [2] Y. Chang, I. Choi, J. Choi, M. Kim, and V. V. Raghavan. Conceptual Retrieval based on Feature Clustering of Documents. Proc. of Workshop on Mathematical/Formal Methods in Information Retrieval at the 25th Ann. Int l ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR 2002), pages 89 104, August 2002. [3] D. R. Cutting, D. R. Karger, J. O. Pedersen, and J. W. Tukey. Scatter/Gather: A Clustering-based Approach to Browsing Large Document Collections. Proc. of the 15th Ann. Int l ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR 92), pages 318 329, June 1992.