Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Similar documents
UB at GeoCLEF Department of Geography Abstract

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Parallelism for Nested Loops with Non-uniform and Flow Dependences

A Binarization Algorithm specialized on Document Images and Photos

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Pruning Training Corpus to Speedup Text Classification 1

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

The Research of Support Vector Machine in Agricultural Data Classification

Performance Evaluation of Information Retrieval Systems

A Novel Term_Class Relevance Measure for Text Categorization

Deep Classification in Large-scale Text Hierarchies

Experiments in Text Categorization Using Term Selection by Distance to Transition Point

Arabic Text Classification Using N-Gram Frequency Statistics A Comparative Study

Learning-Based Top-N Selection Query Evaluation over Relational Databases

Edge Detection in Noisy Images Using the Support Vector Machines

Support Vector Machines

Application of k-nn Classifier to Categorizing French Financial News

Description of NTU Approach to NTCIR3 Multilingual Information Retrieval

Query Clustering Using a Hybrid Query Similarity Measure

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

BOOSTING CLASSIFICATION ACCURACY WITH SAMPLES CHOSEN FROM A VALIDATION SET

Spam Filtering Based on Support Vector Machines with Taguchi Method for Parameter Selection

User Authentication Based On Behavioral Mouse Dynamics Biometrics

Fingerprint matching based on weighting method and SVM

Classifier Selection Based on Data Complexity Measures *

Using Ambiguity Measure Feature Selection Algorithm for Support Vector Machine Classifier

Keywords - Wep page classification; bag of words model; topic model; hierarchical classification; Support Vector Machines

A Misclassification Reduction Approach for Automatic Call Routing

Optimizing Document Scoring for Query Retrieval

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur

Using an Automatic Weighted Keywords Dictionary for Intelligent Web Content Filtering

Combining Multiple Resources, Evidence and Criteria for Genomic Information Retrieval

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

Automatic Text Categorization of Mathematical Word Problems

Skew Angle Estimation and Correction of Hand Written, Textual and Large areas of Non-Textual Document Images: A Novel Approach

Classic Term Weighting Technique for Mining Web Content Outliers

Credibility Adjusted Term Frequency: A Supervised Term Weighting Scheme for Sentiment Analysis and Text Classification

Information Retrieval

Efficient Text Classification by Weighted Proximal SVM *

Semantic Image Retrieval Using Region Based Inverted File

SURFACE PROFILE EVALUATION BY FRACTAL DIMENSION AND STATISTIC TOOLS USING MATLAB

Web Document Classification Based on Fuzzy Association

Issues and Empirical Results for Improving Text Classification

Simulation Based Analysis of FAST TCP using OMNET++

Feature Selection as an Improving Step for Decision Tree Construction

Chi Square Feature Extraction Based Svms Arabic Language Text Categorization System

Load-Balanced Anycast Routing

An Image Fusion Approach Based on Segmentation Region

Cluster Analysis of Electrical Behavior

Investigating the Performance of Naïve- Bayes Classifiers and K- Nearest Neighbor Classifiers

Impact of a New Attribute Extraction Algorithm on Web Page Classification

A Web Site Classification Approach Based On Its Topological Structure

Evaluation of an Enhanced Scheme for High-level Nested Network Mobility

Web Spam Detection Using Multiple Kernels in Twin Support Vector Machine

A MODIFIED K-NEAREST NEIGHBOR CLASSIFIER TO DEAL WITH UNBALANCED CLASSES

Relevance Feedback Document Retrieval using Non-Relevant Documents

Selecting Query Term Alterations for Web Search by Exploiting Query Contexts

Deep Classifier: Automatically Categorizing Search Results into Large-Scale Hierarchies

Novel Pattern-based Fingerprint Recognition Technique Using 2D Wavelet Decomposition

Federated Search of Text-Based Digital Libraries in Hierarchical Peer-to-Peer Networks

Design of Structure Optimization with APDL

Recommended Items Rating Prediction based on RBF Neural Network Optimized by PSO Algorithm

An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc.

Biostatistics 615/815

Face Recognition Based on SVM and 2DPCA

Reducing Frame Rate for Object Tracking

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

A Novel Adaptive Descriptor Algorithm for Ternary Pattern Textures

Enhanced AMBTC for Image Compression using Block Classification and Interpolation

High-Boost Mesh Filtering for 3-D Shape Enhancement

Virtual Machine Migration based on Trust Measurement of Computer Node

MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION

Smoothing Spline ANOVA for variable screening

Comparison Study of Textural Descriptors for Training Neural Network Classifiers

Lobachevsky State University of Nizhni Novgorod. Polyhedron. Quick Start Guide

Query classification using topic models and support vector machine

Wishing you all a Total Quality New Year!

Dynamic Integration of Regression Models

CUM: An Efficient Framework for Mining Concept Units

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

Classifying Acoustic Transient Signals Using Artificial Intelligence

CAN COMPUTERS LEARN FASTER? Seyda Ertekin Computer Science & Engineering The Pennsylvania State University

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

Classification of Face Images Based on Gender using Dimensionality Reduction Techniques and SVM

Web-supported Matching and Classification of Business Opportunities

Machine Learning 9. week

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

Reliable Negative Extracting Based on knn for Learning from Positive and Unlabeled Examples

A Method of Hot Topic Detection in Blogs Using N-gram Model

Corner-Based Image Alignment using Pyramid Structure with Gradient Vector Similarity

BAYESIAN MULTI-SOURCE DOMAIN ADAPTATION

Audio Content Classification Method Research Based on Two-step Strategy

Comparison of Performance in Text Mining using Categorization of Unstructured Data

THE CONDENSED FUZZY K-NEAREST NEIGHBOR RULE BASED ON SAMPLE FUZZY ENTROPY

Oracle Database: SQL and PL/SQL Fundamentals Certification Course

Vol. 5, No. 3 March 2014 ISSN Journal of Emerging Trends in Computing and Information Sciences CIS Journal. All rights reserved.

Journal of Chemical and Pharmaceutical Research, 2014, 6(6): Research Article. A selective ensemble classification method on microarray data

Journal of Engineering Science and Technology Review 7 (3) (2014) Research Article

Transcription:

Proceedngs of NTCIR-6 Workshop Meetng, May 15-18, 2007, Tokyo, Japan Term Weghtng Classfcaton System Usng the Ch-square Statstc for the Classfcaton Subtask at NTCIR-6 Patent Retreval Task Kotaro Hashmoto + Nagaoka Unversty of Technology 1603-1 Kamtomoka-cho, Nagaoka-sh, Ngata 940-2188, Japan Takash Yukawa Nagaoka Unversty of Technology 1603-1 Kamtomoka-cho, Nagaoka-sh, Ngata 940-2188, Japan yukawa@vos.nagaokaut.ac.p Abstract In the present paper, a term weghtng classfcaton method usng the ch-square statstc s proposed and evaluated n the classfcaton subtask at NTCIR-6 patent retreval task. In ths task, large numbers of patent applcatons are classfed nto F- term categores. Therefore, a patent classfcaton system requres hgh classfcaton speed, as well as hgh classfcaton accuracy. The ch-square statstc can calculate the frequency of word appearance n the F-term and the frequency of word non-appearance n the F-term. The proposed method treats words as a scalar value and a rankng algorthm smply adds the word values of each word ncluded n the test patent document n each F-term. Therefore, the proposed method provdes classfcaton that s sgnfcantly faster than other methods. The proposed method s evaluated n A-precson, R-precson, and F-measure. Although the proposed method dd not obtan the best score, ths method acheves a classfcaton accuracy that s as hgh as those of other methods usng machne learnng or the vector classfcaton method. In ths task, the processng speed s not evaluated. Therefore, processng speed s also evaluated. The evaluaton results show that the proposed method s much faster than that usng the vector classfcaton +Current afflaton: Nppon Telegraph and Telephone West Corporaton method. Evaluaton results of classfcaton accuracy and processng speed show that the proposed method s confrmed to be effectve and to be practcal. Keywords: patent classfcaton, F-term, ch- Square statstc 1. Introducton In the prevous NTCIR workshop, a machne learnng method and a vector classfcaton method, such as the k-nearest Neghbor method, provded good results for the classfcaton subtask. However, these methods are expected to requre a long processng tme when classfyng a large number of patent documents. At the classfcaton subtask, the number of F-term themes and test documents ncrease consderably. Therefore, the patent classfcaton system s requred to have a hgh classfcaton speed as well as hgh classfcaton accuracy. In the present paper, a hgh-speed classfcaton method havng a classfcaton accuracy that s as good as or better than those of tradtonal methods s proposed usng a complex machne learnng classfcaton method. In addton, the detals of proposed method are descrbed heren. The proposed method s mplemented and evaluated for a classfed test collecton n the classfcaton subtask. The results of the evaluaton are presented and dscussed.

Proceedngs of NTCIR-6 Workshop Meetng, May 15-18, 2007, Tokyo, Japan 2. Background 2.1 Classfcaton subtask at NTCIR-6 scalar value. In ths secton, the proposed patent classfcaton method usng the ch-square statstc weghtng [6] s descrbed n detal. In the classfcaton subtask at NTCIR-5 [2], two subtasks, the theme classfcaton subtask and the F- term classfcaton subtask, are performed. The F- term classfcaton subtask s used for only fve themes, ncludng 2,562 test documents. These documents assgn multple F-terms. In ths NTCIR-5 F-term classfcaton subtask, popular methods nclude the use of the K Nearest Neghbor method [3] and the Support Vector Machne method [4]. These methods provded good classfcaton accuracy but requre a long processng tme. 2.2 Classfcaton subtask at NTCIR-6 In classfcaton subtask at NTCIR-6 [5] only the F-term classfcaton task s performed. The total number of F-term themes s 108 and total number of test documents s 21,606, and these ncrease at a great rate for NTCIR-5. Therefore, the classfcaton system requres faster classfcaton. Although the Japan Patent Offce (JPO) categorzes approxmately 1,900 F-term themes and JPO has over 4.5 mllon patent applcatons, the NTCIR-6 test collecton has a smaller number of documents, for the purpose of practcal use. In the evaluaton, A-Precson, R-precson, and F- measures are used. The A-precson s the average precson when each relevant F-term s ranked to a test document. The R-precson ndcates the precson when the top R relevant F-term s ranked n a test document, where R s the number of relevant categores. The F-measure s the average nverse of the combned recall and precson. The recall s the rato of correct outputs to the total number of correct categores. The precson s the rato of correct outputs to the total number of outputs. 3.1 Preprocessng A patent applcaton composes fve parts: bblographcal nformaton (ttle of nventon, applcaton number, patent applcant, nventor, etc.), an abstract, a clam, a detaled descrpton, and a bref descrpton and schematc drawngs. For ths task, the proposed method uses only abstract and clam. ChaSen s used for morphologcal analyss and only nouns are used n ths method. 3.2 Ch-square statstc Ch-square statstc weghtng s a term weghtng classfcaton method, such as the TF-IDF [7] method. However, ch-square statstc weghtng consders weghts for words whch do not appear n the documents as well as word whch appear n them. The ch-square statstc weghtng apples the followng equaton to each word of each F-term: 2 (x,y) = D(y,n) + D(x-y,(1-)n) +D(m-y,(1-)n)+D(n-x-(m-y),(1-)(1-)n) (1) where = x/n, = m/n, and D(o,e) = (o-e) 2 /e (2) Each parameter can be estmated usng the followng 2x2 contngency table, whch lsts the F- terms of each word n the tranng patent documents. Table 2 2x2 contngency table Word B Word B Subtotal F-term A y x-y x F-term A m-y n-x-m+y n-x Table 1 Number of themes and documents Subtotal m n-m n NTCIR-5 NTCIR-6 JPO Theme 5 108 about 1,900 Test Over 4.5 2562 21606 Here, y s the number of appearances of word B n Document mllon the assgned F-term A n the document, x-y s the number of appearances n the document other than word B n the assgned F-term A, m-y s the number 3. Basc Ch-square Statstc Weghtng of appearances n the document of word B n other than the assgned F-term A, n-x-m+y s the number of appearances n the document of words other than To acheve a hgh-speed classfcaton system, the word B n other than the assgned F-term A, x s the method avodng usng a machne learnng technque total number of the patent documents wth assgned s proposed. The proposed method treats a word as a F-term A, m s the total number of appearances n the

Proceedngs of NTCIR-6 Workshop Meetng, May 15-18, 2007, Tokyo, Japan document of word B, and n s the total number of the documents n each theme. 3.3 Classfcaton The rankng algorthm sums the calculated chsquare statstc weghtng n each F-term n each word appearng n the test documents. After summng the ch-square statstc weghtng, they are sorted n descendng order: R ( D, F ) W ( F ) W D (3) where W (F ) s the ch-square statstc weghtng of F-term ncluded n document. 3.4 Evaluaton at tentatve test set Table 3 shows the evaluaton results for the average precson usng the NTCIR-5 test collecton as a tentatve test set as well as the same results for the TF-IDF method as a baselne method of term weghtng. 4.1 N-gram ch-square statstc weghtng The N-gram s used to consder the co-occurrence relaton between words. The N-gram ch-square statstc can be calculated n the same way as that for the words. Table 4 N-gram 2x2 contngency table N-gram B N-gram B Subtotal F-term A y x-y x F-term A m-y n-x-m+y n-x Subtotal m n-m n The value of document s calculated as sums of the followng two values: sums of ch-square statstc weghts for words appeared n the document. sums of ch-square statstc weghts for N-gram appeared n the document. WD R ( D, F ) W ( F ) N ( F ) N D (4) Table 3 Evaluaton results of the ch-square statstc and baselne method at NTCIR-5 test collecton System 2B022 3G301 4B064 5H180 5J104 Ch- Square 0.4232 0.3171 0.4262 0.4837 0.3770 TF-IDF 0.3352 0.3053 0.4104 0.4648 0.3080 As show n Table 3, the ch-square statstc term weghtng method performs well for all themes. In partcular, the results of 2B022 and 5J104 are good. These themes are few tranng patent document. The ch-square statstc term weghtng classfcaton method performs well even wth few tranng data. 4. Improved Ch-square The prevous secton ndcates that the proposed method performs well compared wth the baselne of the term weghtng method. However, these evaluaton results ndcated that the proposed method had no advantage over the other methods used at NTCIR-5. Therefore, two mprovements are appled to the proposed method. In ths secton, the mprovement of the ch-square statstc term weghtng s descrbed n detal. where N (F ) s the N-gram ch-squared weghtng of F-term ncluded n document. In ths paper, bgram (N=2) s used for the evaluaton. 4.2 Weght emphass for Words n F-term descrptons Words n F-term descrptons are consdered to be mportant words. Therefore, f the words n F-term descrptons appear n the test document, these word ch-square statstc weghts are added to ts weght: W ( F ) W L R ( D, F ) N ( F ) W D W ( F ) W L ND (5) where L s word n F-term descrptons ncludng F- term. 4.3 Evaluaton wth the mproved method Table 5 shows the evaluaton results for the average precson at NTCIR-5 test collecton, as compared wth the other methods consdered heren. In the Table 5, VSM, SVM and K-NN are method and result of other teams partcpated n classfcaton subtask at NTCIR-5. As show n Table 5, the mproved method acheves accuracy as hgh as or better than the accuracy of other vector classfcaton methods. These results show that although term weghtng and smple methods lke the proposed method can obtan good results for a classfed patent document.

Proceedngs of NTCIR-6 Workshop Meetng, May 15-18, 2007, Tokyo, Japan Table 5 Evaluaton results of the mproved chsquare method at NTCIR-5 test collecton System 2B022 3G301 4B064 5H180 5J104 Proposed 0.4349 0.3956 0.4877 0.5451 0.4051 VSM 0.3142 0.1498 0.2196 0.2060 0.1516 SVM 0.3705 0.3568 0.4271 0.5080 0.3387 K-NN 0.4812 0.4091 0.5701 0.6222 0.4281 5. Evaluaton of the NTCIR-6 test set In ths secton, the evaluaton results at NTCIR-6 of the proposed method are descrbed and compared wth all of the teams that partcpated n the classfcaton subtask at NTCIR-6 patent retreval task. The weght for each word n the F-term descrpton s vared from 1 to 2, and the N-gram chsquare statstc weghtng s vared from 0.5 to 1.0. Table 7 shows the parameters and results for each system. For the proposed system, the best score was obtaned by NUT5 usng the N-gram ch-square statstc weght of 0.8 and the F-term label key word weght of 2. 5.2 Evaluaton result of NTCIR-6 Fgures 1, 2, and 3 show the results for all of the teams for A-Precson, R-Precson and F-measures, respectvely. Sx teams of 46 systems, ncludng the team of the present study, partcpated n the workshop. The proposed method ranked 5th among all teams. However, there s no sgnfcant dfference n A-Precson and R-Precson between the top four teams. 5.1 Evaluaton results of the proposed system The proposed method s appled to sx runs wth dfferent parameters as shown n Table 6. The runs 1 through 3 use word ch-square statstc term weghtng and b-gram ch-square statstc term weghtng. The runs 4 through 6 use word ch-square statstc term weghtng, b-gram ch-square statstc term weghtng, and weghts for words n the F-term descrpton. Table 6 System parameters Run ID N-gram F-term label NUT01 0.5 1 NUT02 0.8 1 NUT03 1.0 1 NUT04 0.5 2 NUT05 0.8 2 NUT06 1.0 2 Table 7 Evaluaton results of NTCIR-6 Fg. 1 Evaluaton Result of A-Precson Fg. 2 Evaluaton Result of R-Precson Run ID A-Precson R-Precson F-measure NUT01 0.4039 0.3644 0.2391 NUT02 0.4053 0.3656 0.2366 NUT03 0.4065 0.3656 0.2357 NUT04 0.4090 0.3687 0.2469 NUT05 0.4101 0.3700 0.2432 NUT06 0.4100 0.3698 0.2413 Fg. 3 Evaluaton Result of F-measure

Proceedngs of NTCIR-6 Workshop Meetng, May 15-18, 2007, Tokyo, Japan 5.3 Dscusson Table 8 shows the classfcaton methods of all teams. Most teams use a machne learnng system or the vector classfcaton method. These systems are assumed to requre a long tme to learn and classfy all of the documents. The proposed system acheves fast classfcaton. In ths task, the classfcaton speed s not evaluated. The proposed method s evaluated n comparson wth the vector classfcaton method. Durng the preprocessng phase, the proposed method performs approxmately 3.5 tmes faster than the vector classfcaton method. In addton, durng the classfcaton phase, the proposed method performs approxmately fve tmes faster than the vector classfcaton method. Table 8 System Team 1 Team 2 Team 3 Team 4 Team 5 Proposed system 6. Concluson Classfcaton method of the NTCIR-6 patent workshop HSMV, SVM SVM, NB NB, Maxmum Entropy K-NN K-NN Ch-Squared References [1] Japan Patent Offce. Admnstraton of Patent 2006 Annual report. [2] Makoto Iwayama, Atush Fu, Norko Kando: Overvew of Classfcaton Subtask at NTCIR-5 Patent Retreval Task, Proc. NTCIR-5 Workshop Meetng (2005). [3] Y. Yang and X. Lu. A re-examnaton of text categorzaton methods. Proc. 22nd Annual Internatonal ACM SIGIR Conference on Research and Development n Informaton Retreval (1999). [4] N. Crstann and J. Shawe-Taylor. An Introducton to Support Vector Machnes and Other Kernel-based Learnng s. Cambrdge Unversty Press (2000). [5] Makoto Iwayama, Atsush Fu, Norko Kando, Overvew of Classfcaton Subtask at NTCIR-6 Patent Retreval Task, Proceedngs of the 6th NTCIR Workshop, 2007. [6] Shnch Morshta, Jun Sese: Traversng Itemset Lattces wth Statstcal Metrc Prunng, Proc. ACM SIGACT-SIGMOD-SIGART Symp. On Database Systems (PODS), pp226-236 (2000). [7] Salton, G., and Buckley, C. Term-weghtng approaches n automatc text retreval. Informaton Processng and Management 24 (1998), 513-523. [8] Shannon, C., A Mathematcal Theory of Communcaton, Bell Syst. Tech. J.27. In the present paper, a hgh-accuracy and hghspeed patent classfcaton method s proposed for the F-term classfcaton subtask. The proposed method s a fast weghtng method usng the ch-square statstc term. The proposed method was appled to sx systems n the classfcaton subtask at NTICR-6 and the results of ths applcaton were evaluated. The results of accuracy evaluaton were good, even though the best score was not obtaned by the proposed method, confrmng the effectveness of the proposed method.