Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Proceedngs of NTCIR-6 Workshop Meetng, May 15-18, 2007, Tokyo, Japan Term Weghtng Classfcaton System Usng the Ch-square Statstc for the Classfcaton Subtask at NTCIR-6 Patent Retreval Task Kotaro Hashmoto + Nagaoka Unversty of Technology 1603-1 Kamtomoka-cho, Nagaoka-sh, Ngata 940-2188, Japan Takash Yukawa Nagaoka Unversty of Technology 1603-1 Kamtomoka-cho, Nagaoka-sh, Ngata 940-2188, Japan yukawa@vos.nagaokaut.ac.p Abstract In the present paper, a term weghtng classfcaton method usng the ch-square statstc s proposed and evaluated n the classfcaton subtask at NTCIR-6 patent retreval task. In ths task, large numbers of patent applcatons are classfed nto F- term categores. Therefore, a patent classfcaton system requres hgh classfcaton speed, as well as hgh classfcaton accuracy. The ch-square statstc can calculate the frequency of word appearance n the F-term and the frequency of word non-appearance n the F-term. The proposed method treats words as a scalar value and a rankng algorthm smply adds the word values of each word ncluded n the test patent document n each F-term. Therefore, the proposed method provdes classfcaton that s sgnfcantly faster than other methods. The proposed method s evaluated n A-precson, R-precson, and F-measure. Although the proposed method dd not obtan the best score, ths method acheves a classfcaton accuracy that s as hgh as those of other methods usng machne learnng or the vector classfcaton method. In ths task, the processng speed s not evaluated. Therefore, processng speed s also evaluated. The evaluaton results show that the proposed method s much faster than that usng the vector classfcaton +Current afflaton: Nppon Telegraph and Telephone West Corporaton method. Evaluaton results of classfcaton accuracy and processng speed show that the proposed method s confrmed to be effectve and to be practcal. Keywords: patent classfcaton, F-term, ch- Square statstc 1. Introducton In the prevous NTCIR workshop, a machne learnng method and a vector classfcaton method, such as the k-nearest Neghbor method, provded good results for the classfcaton subtask. However, these methods are expected to requre a long processng tme when classfyng a large number of patent documents. At the classfcaton subtask, the number of F-term themes and test documents ncrease consderably. Therefore, the patent classfcaton system s requred to have a hgh classfcaton speed as well as hgh classfcaton accuracy. In the present paper, a hgh-speed classfcaton method havng a classfcaton accuracy that s as good as or better than those of tradtonal methods s proposed usng a complex machne learnng classfcaton method. In addton, the detals of proposed method are descrbed heren. The proposed method s mplemented and evaluated for a classfed test collecton n the classfcaton subtask. The results of the evaluaton are presented and dscussed.

Proceedngs of NTCIR-6 Workshop Meetng, May 15-18, 2007, Tokyo, Japan 2. Background 2.1 Classfcaton subtask at NTCIR-6 scalar value. In ths secton, the proposed patent classfcaton method usng the ch-square statstc weghtng [6] s descrbed n detal. In the classfcaton subtask at NTCIR-5 [2], two subtasks, the theme classfcaton subtask and the F- term classfcaton subtask, are performed. The F- term classfcaton subtask s used for only fve themes, ncludng 2,562 test documents. These documents assgn multple F-terms. In ths NTCIR-5 F-term classfcaton subtask, popular methods nclude the use of the K Nearest Neghbor method [3] and the Support Vector Machne method [4]. These methods provded good classfcaton accuracy but requre a long processng tme. 2.2 Classfcaton subtask at NTCIR-6 In classfcaton subtask at NTCIR-6 [5] only the F-term classfcaton task s performed. The total number of F-term themes s 108 and total number of test documents s 21,606, and these ncrease at a great rate for NTCIR-5. Therefore, the classfcaton system requres faster classfcaton. Although the Japan Patent Offce (JPO) categorzes approxmately 1,900 F-term themes and JPO has over 4.5 mllon patent applcatons, the NTCIR-6 test collecton has a smaller number of documents, for the purpose of practcal use. In the evaluaton, A-Precson, R-precson, and F- measures are used. The A-precson s the average precson when each relevant F-term s ranked to a test document. The R-precson ndcates the precson when the top R relevant F-term s ranked n a test document, where R s the number of relevant categores. The F-measure s the average nverse of the combned recall and precson. The recall s the rato of correct outputs to the total number of correct categores. The precson s the rato of correct outputs to the total number of outputs. 3.1 Preprocessng A patent applcaton composes fve parts: bblographcal nformaton (ttle of nventon, applcaton number, patent applcant, nventor, etc.), an abstract, a clam, a detaled descrpton, and a bref descrpton and schematc drawngs. For ths task, the proposed method uses only abstract and clam. ChaSen s used for morphologcal analyss and only nouns are used n ths method. 3.2 Ch-square statstc Ch-square statstc weghtng s a term weghtng classfcaton method, such as the TF-IDF [7] method. However, ch-square statstc weghtng consders weghts for words whch do not appear n the documents as well as word whch appear n them. The ch-square statstc weghtng apples the followng equaton to each word of each F-term: 2 (x,y) = D(y,n) + D(x-y,(1-)n) +D(m-y,(1-)n)+D(n-x-(m-y),(1-)(1-)n) (1) where = x/n, = m/n, and D(o,e) = (o-e) 2 /e (2) Each parameter can be estmated usng the followng 2x2 contngency table, whch lsts the F- terms of each word n the tranng patent documents. Table 2 2x2 contngency table Word B Word B Subtotal F-term A y x-y x F-term A m-y n-x-m+y n-x Table 1 Number of themes and documents Subtotal m n-m n NTCIR-5 NTCIR-6 JPO Theme 5 108 about 1,900 Test Over 4.5 2562 21606 Here, y s the number of appearances of word B n Document mllon the assgned F-term A n the document, x-y s the number of appearances n the document other than word B n the assgned F-term A, m-y s the number 3. Basc Ch-square Statstc Weghtng of appearances n the document of word B n other than the assgned F-term A, n-x-m+y s the number of appearances n the document of words other than To acheve a hgh-speed classfcaton system, the word B n other than the assgned F-term A, x s the method avodng usng a machne learnng technque total number of the patent documents wth assgned s proposed. The proposed method treats a word as a F-term A, m s the total number of appearances n the

Proceedngs of NTCIR-6 Workshop Meetng, May 15-18, 2007, Tokyo, Japan document of word B, and n s the total number of the documents n each theme. 3.3 Classfcaton The rankng algorthm sums the calculated chsquare statstc weghtng n each F-term n each word appearng n the test documents. After summng the ch-square statstc weghtng, they are sorted n descendng order: R ( D, F ) W ( F ) W D (3) where W (F ) s the ch-square statstc weghtng of F-term ncluded n document. 3.4 Evaluaton at tentatve test set Table 3 shows the evaluaton results for the average precson usng the NTCIR-5 test collecton as a tentatve test set as well as the same results for the TF-IDF method as a baselne method of term weghtng. 4.1 N-gram ch-square statstc weghtng The N-gram s used to consder the co-occurrence relaton between words. The N-gram ch-square statstc can be calculated n the same way as that for the words. Table 4 N-gram 2x2 contngency table N-gram B N-gram B Subtotal F-term A y x-y x F-term A m-y n-x-m+y n-x Subtotal m n-m n The value of document s calculated as sums of the followng two values: sums of ch-square statstc weghts for words appeared n the document. sums of ch-square statstc weghts for N-gram appeared n the document. WD R ( D, F ) W ( F ) N ( F ) N D (4) Table 3 Evaluaton results of the ch-square statstc and baselne method at NTCIR-5 test collecton System 2B022 3G301 4B064 5H180 5J104 Ch- Square 0.4232 0.3171 0.4262 0.4837 0.3770 TF-IDF 0.3352 0.3053 0.4104 0.4648 0.3080 As show n Table 3, the ch-square statstc term weghtng method performs well for all themes. In partcular, the results of 2B022 and 5J104 are good. These themes are few tranng patent document. The ch-square statstc term weghtng classfcaton method performs well even wth few tranng data. 4. Improved Ch-square The prevous secton ndcates that the proposed method performs well compared wth the baselne of the term weghtng method. However, these evaluaton results ndcated that the proposed method had no advantage over the other methods used at NTCIR-5. Therefore, two mprovements are appled to the proposed method. In ths secton, the mprovement of the ch-square statstc term weghtng s descrbed n detal. where N (F ) s the N-gram ch-squared weghtng of F-term ncluded n document. In ths paper, bgram (N=2) s used for the evaluaton. 4.2 Weght emphass for Words n F-term descrptons Words n F-term descrptons are consdered to be mportant words. Therefore, f the words n F-term descrptons appear n the test document, these word ch-square statstc weghts are added to ts weght: W ( F ) W L R ( D, F ) N ( F ) W D W ( F ) W L ND (5) where L s word n F-term descrptons ncludng F- term. 4.3 Evaluaton wth the mproved method Table 5 shows the evaluaton results for the average precson at NTCIR-5 test collecton, as compared wth the other methods consdered heren. In the Table 5, VSM, SVM and K-NN are method and result of other teams partcpated n classfcaton subtask at NTCIR-5. As show n Table 5, the mproved method acheves accuracy as hgh as or better than the accuracy of other vector classfcaton methods. These results show that although term weghtng and smple methods lke the proposed method can obtan good results for a classfed patent document.

Proceedngs of NTCIR-6 Workshop Meetng, May 15-18, 2007, Tokyo, Japan Table 5 Evaluaton results of the mproved chsquare method at NTCIR-5 test collecton System 2B022 3G301 4B064 5H180 5J104 Proposed 0.4349 0.3956 0.4877 0.5451 0.4051 VSM 0.3142 0.1498 0.2196 0.2060 0.1516 SVM 0.3705 0.3568 0.4271 0.5080 0.3387 K-NN 0.4812 0.4091 0.5701 0.6222 0.4281 5. Evaluaton of the NTCIR-6 test set In ths secton, the evaluaton results at NTCIR-6 of the proposed method are descrbed and compared wth all of the teams that partcpated n the classfcaton subtask at NTCIR-6 patent retreval task. The weght for each word n the F-term descrpton s vared from 1 to 2, and the N-gram chsquare statstc weghtng s vared from 0.5 to 1.0. Table 7 shows the parameters and results for each system. For the proposed system, the best score was obtaned by NUT5 usng the N-gram ch-square statstc weght of 0.8 and the F-term label key word weght of 2. 5.2 Evaluaton result of NTCIR-6 Fgures 1, 2, and 3 show the results for all of the teams for A-Precson, R-Precson and F-measures, respectvely. Sx teams of 46 systems, ncludng the team of the present study, partcpated n the workshop. The proposed method ranked 5th among all teams. However, there s no sgnfcant dfference n A-Precson and R-Precson between the top four teams. 5.1 Evaluaton results of the proposed system The proposed method s appled to sx runs wth dfferent parameters as shown n Table 6. The runs 1 through 3 use word ch-square statstc term weghtng and b-gram ch-square statstc term weghtng. The runs 4 through 6 use word ch-square statstc term weghtng, b-gram ch-square statstc term weghtng, and weghts for words n the F-term descrpton. Table 6 System parameters Run ID N-gram F-term label NUT01 0.5 1 NUT02 0.8 1 NUT03 1.0 1 NUT04 0.5 2 NUT05 0.8 2 NUT06 1.0 2 Table 7 Evaluaton results of NTCIR-6 Fg. 1 Evaluaton Result of A-Precson Fg. 2 Evaluaton Result of R-Precson Run ID A-Precson R-Precson F-measure NUT01 0.4039 0.3644 0.2391 NUT02 0.4053 0.3656 0.2366 NUT03 0.4065 0.3656 0.2357 NUT04 0.4090 0.3687 0.2469 NUT05 0.4101 0.3700 0.2432 NUT06 0.4100 0.3698 0.2413 Fg. 3 Evaluaton Result of F-measure

Proceedngs of NTCIR-6 Workshop Meetng, May 15-18, 2007, Tokyo, Japan 5.3 Dscusson Table 8 shows the classfcaton methods of all teams. Most teams use a machne learnng system or the vector classfcaton method. These systems are assumed to requre a long tme to learn and classfy all of the documents. The proposed system acheves fast classfcaton. In ths task, the classfcaton speed s not evaluated. The proposed method s evaluated n comparson wth the vector classfcaton method. Durng the preprocessng phase, the proposed method performs approxmately 3.5 tmes faster than the vector classfcaton method. In addton, durng the classfcaton phase, the proposed method performs approxmately fve tmes faster than the vector classfcaton method. Table 8 System Team 1 Team 2 Team 3 Team 4 Team 5 Proposed system 6. Concluson Classfcaton method of the NTCIR-6 patent workshop HSMV, SVM SVM, NB NB, Maxmum Entropy K-NN K-NN Ch-Squared References [1] Japan Patent Offce. Admnstraton of Patent 2006 Annual report. [2] Makoto Iwayama, Atush Fu, Norko Kando: Overvew of Classfcaton Subtask at NTCIR-5 Patent Retreval Task, Proc. NTCIR-5 Workshop Meetng (2005). [3] Y. Yang and X. Lu. A re-examnaton of text categorzaton methods. Proc. 22nd Annual Internatonal ACM SIGIR Conference on Research and Development n Informaton Retreval (1999). [4] N. Crstann and J. Shawe-Taylor. An Introducton to Support Vector Machnes and Other Kernel-based Learnng s. Cambrdge Unversty Press (2000). [5] Makoto Iwayama, Atsush Fu, Norko Kando, Overvew of Classfcaton Subtask at NTCIR-6 Patent Retreval Task, Proceedngs of the 6th NTCIR Workshop, 2007. [6] Shnch Morshta, Jun Sese: Traversng Itemset Lattces wth Statstcal Metrc Prunng, Proc. ACM SIGACT-SIGMOD-SIGART Symp. On Database Systems (PODS), pp226-236 (2000). [7] Salton, G., and Buckley, C. Term-weghtng approaches n automatc text retreval. Informaton Processng and Management 24 (1998), 513-523. [8] Shannon, C., A Mathematcal Theory of Communcaton, Bell Syst. Tech. J.27. In the present paper, a hgh-accuracy and hghspeed patent classfcaton method s proposed for the F-term classfcaton subtask. The proposed method s a fast weghtng method usng the ch-square statstc term. The proposed method was appled to sx systems n the classfcaton subtask at NTICR-6 and the results of ths applcaton were evaluated. The results of accuracy evaluaton were good, even though the best score was not obtaned by the proposed method, confrmng the effectveness of the proposed method.