Classification and clustering methods for documents. by probabilistic latent semantic indexing model

A Short Course at amang University aipei, aiwan, R.O.C., March 7-9, 2006 Classification an clustering methos for ocuments by probabilistic latent semantic inexing moel Shigeichi Hirasawa Department of Inustrial an Management Systems Engineering, School of Science an Engineering, Wasea University 3-4-1, Ohubo, Shinuu,oyo 169-8555 Japan Phone:+81-3-5286-3290 Fax:+81-3-5273-7215 e-mail:hirasawa@hirasa.mgmt.wasea.ac.p Relate area: ata base, information retrieval, nowlege acquisition, text mining, classification, clustering Abstract Base on information retrieval moel especially probabilistic latent semantic inexing (PLSI moel, we iscuss methos for classification an clustering of a set of ocuments. A metho for classification is presente an is emonstrate its goo performance by applying to a set of benchmar ocuments with free format (text only. hen the classification metho is moifie to a clustering metho an the clustering metho is applie to a set of experimental ocuments with fixe an free formats to partition into two clusters, where the experimental ocuments are obtaine from stuent questionnaires. Since the experimental ocuments are alreay categorize, the clustering metho can be clearly evaluate its performance. he metho has better performance compare to the conventional one base on the vector space moel.

1. Introuction Recent evelopment in information retrieval techniques enables us to process a large amount of text ata. he information retrieval techniques inclue classification an clustering techniques for a set of ocuments [BYRN99]. hese techniques are use for not only classical information retrieval systems such as for technical paper archives, but also customer relationship management (CRM, nowlege management system (MS, or questionnaire analysis (QA at enterprises for the purpose of maret research, or personal management. In this paper, we iscuss classification an clustering techniques base on the probabilistic latent semantic inexing (PLSI moel [Hoffman99], A new classification metho [IIGSH03-a] [HC04] for a set of ocuments is presente using the maximum a posteriori probability criterion similar to the naive Bayesian technique. Using Japanese benchmar ocuments with free format [Mainichi95], the metho is evaluate, an is emonstrate its goo performance compare to conventional techniques for relatively small sets of ocuments, where a ocument with free format implies the texts. he metho successfully uses a characteristic such that the EM algorithm converges a value epenent on an initial value. hen the classification metho is moifie into a new clustering metho. he clustering metho is applie to a set of real ocuments, where real ocuments are those obtaine by stuent questionnaires which are compose of both fixe an free formats [HIIGS03] [IGH05]. A ocument with fixe format implies items such as those of selecting one from sentences, wors, symbols, or numbers. While a ocument with free format implies the usual texts. We can fin such ocuments in technical paper archives, questionnaires, or nowlege collaboration. In the case of paper archives, the ocuments with fixe format (calle items in this paper correspon to the name of authors, the name of ournals, the 2 year of publication, the name of publishers, the name of countries, an so on, i.e., iscrete concepts. As is foun in the traitional vector space moel of information retrieval systems, a co-occurrence matrix is use for the representation of a ocument set. he ocuments with fixe format are represente by an item-ocument matrix G=[g m ], where g m is the selecte result of the item m (i m in the ocument (. he ocuments with free format are also represente by a term-ocument matrix H=[h i ], where h i is the frequency of the term i (t i in the ocument (. he imensions of matrices G an H are I D, an D, respectively. Both matrices are compresse into those with smaller imensions by the probabilistic ecomposition in PLSI [Hofmann99] similar to the single value ecomposition (SVD in LSI (latent semantic inexing [BYRN99]. he unobserve states are z ( =1,2,,. Introucing a weight λ (0 λ 1, the log-lielihoo function corresponing to matrix [λg, (1 λh] is maximize by the EM algorithm [CH01], where A^ is the transpose matrix of A. hen we obtain the probabilities Pr(z (=1,2,.,, an the conitional probabilities Pr(t i z, i m, an Pr( z. Using these probabilities, Pr(i m, an Pr(t i, are erive. We ecie the state for epening on Pr(z, an a similarity function between z an z can be efine in the usual way, i.e., by cosine, or by inner prouct. By these preparations, we use the group average istance metho with the similarity measure for agglomeratively clustering the state z 's until the number of clusters becomes S, where S [HC03]. o show the effectiveness of the methos, first we apply them into the benchmar test ocument set [Mainichi95] which has been alreay categorize. hen as an experiment, we apply the propose metho into a ocument set given by stuent questionnaires [SIGIH03], where the stuents are the members of a class (Introuction to computer science, in the secon acaemic year,

unergrauate school for the present author. he contents of the questionnaires consist of questions answere with fixe format: e.g., Are you intereste in wearable computers? (Its answer is yes or no, an questions, with free format: e.g., write your image of computers. Merging the ocuments of stuents from two ifferent classes, then the merge ocuments are partitione into two categories. We show that each member of the partitione classes coincies with that of the original classes at high rate. Its better performance is compare to the conventional metho base on the vector space moel. A final obect of this experiment is to fin helpful leas to the faculty evelopment [HIASG04] [IIGSH03-b]. 2. Information Retrieval Moel Early information retrieval systems aopte (1 Boolean moel, an base on inex terms (i.e., eywors some of which are still in use for commercial purposes. o avoi over-simplification by this moel, an to enable raning the relevant ocument together with automatic inexing, (2 vector space (VS moel was propose in early '70s [Salton71]. o improve the performance of the VS moel, latent semantic inexing (LSI moel was stuie by reucing the imension of the vector space using single value ecomposition (SVD [BYRN99]. As a similar approach, probabilistic latent semantic inexing (PLSI moel base on a statistical latent class moel has recently been propose by. Hofmann [Hofmann99]. Information retrieval moel are shown in able 2.1. able 2.1: Information retrieval moel Base Set theory Algebraic Probabilistic Moel (Classic Boolian Moel Fuzzy Extene Boolian Moel (Classical Vector Space Moel (VSM [7] Generalize VSM Latent Semantic Inexing (LSI Moel [2] Probabilistic LSI (PLSI Moel [4] Neural Networ Moel (Classical Probabilistic Moel Extene Probabilistic Moel Inference Networ Moel Bayesian Networ Moel 2.1 he Vector Space Moel (VSM he VS moel uses non-binary weights in the i-th (inex term (t i in the -th ocument ( for a given ocument set D an queries (q. [Vector Space Moel] Let be a term set use for representing a ocument set D. Let t i (i=1,2,, be the i-th term in, where is a subset of the all term set 0 appeare in D, an (=1,2,, D, the -th ocument in D. hen a term-ocument matrix A=[a i ] is given by the weight w i 0 associate with a pair (t i,. In the VS moel, the weight w i is usually given by so-calle the tf-if value, where tf stans for the term frequency, an if, the inverse ocument frequency. When the number of the i-th term (t i in the -th ocument ( is f i,, then tf(i, = f i,. When the number of ocuments in D for which the term t i appears is f(i, then if(i = log(d/ f(i. he tf-if value is calculate by their prouct. As the result, for the VS moel the weight w i is given by w i = tf(i, if(i (2.1 3 an is equal to a i. he i-th row of the matrix A represents the frequency vector of the term t i in D, an the -th

column, that of in, we use the term vector t i an the ocument vector as t i = (a i1,a i2,, a id (2.2 = (a 1,a 2,, a (2.3 where x is the transpose vector of x. Similar to the vector, we also use a query vector q for a query q by the weight associate with the pair (t i, q as follows: q =(q 1,q 2,,q (2.4 hen we can efine the similarity s(q, between q an. In the case of measuring it by cosine of the angle between the vectors q an, we have s( q, where x is the norm of x. q =, (2.5 q 2.2 he Latent Semantic Inexing (LSI Moel he LSI moel is accomplishe by mapping each ocument an query vector into a lower imensional space by using SVD. [runcate LSI Moel] Let an element of a term-ocument matrix A R D be given by eq.(2.1. hen the matrix A is ecompose into A by the truncate SVD as follows: A A where = = U ( U Uˆ Σ V Σ 0 0 V 0 Vˆ U R, Σ R, V an R (2.6 D p min {, D }. In eq.(2.6, A A F is minimize for any, where p is the ran of A, an F is the Frobenius matrix norm. Let the term-ocument matrix A be given by the reuce ran matrix A by the truncate SVD, then a query vector q R 1 in eq.(2.4 is represente by 1 ˆ q R in a space unit 1 1 ˆ = Σ q R q (2.7 1 then s(q, is also compute by s( q, where an qˆ ˆ = (2.8 qˆ ˆ ˆ 1 = ΣV e R 1 2 e (0,0, L L,0,1,0, L,0 (2.9 = is the -th canonical vector. 2.3 he Probabilistic Latent Semantic Inexing (PLSI Moel In contrast to the LSI moel, the PLSI moel is base on mixture ecomposition erive from a latent state moel. A term-ocument matrix A=[a i ] is irectly given by term frequency tf(i,=f i,, i.e., a i is the number of a term t i in a ocument. In the LSI moel, the matrix A R D ecompose into A with smaller imension by SVD, using principal eigenvectors. While in the PLSI moel, the matrix A is probabilistically ecompose into unobserve states, where the -th state is enote by z Z (=1,2,,, an Z, a set of states. First, we assume both (i an inepenence between pairs (t i,, an (ii a conitional inepenence between t i an, i.e., the term t i an the ocument are inepenent conitione on the latent state z. A graphical epresentation is epicte in Fig. 2.1. Pr( z Pr(t i z z t Pr(z Fig. 2.1: A graphical moel for the PLSI moel is imension : 4 1 In the other case, q 1 1 ˆ = Σ U q R

he oint probability of t i an, Pr (t i, is given by Pr( t, = Pr( Pr( t z Pr( z i z Z i (2.10 a i i Pr( z ti, Pr( z = (2.17 a i, i hen we have the probabilities Pr(, Pr(t i z, an Pr (z. = z Z Pr( z Pr( t z Pr( z (2.11 he number of the set of the states, or the carinality of Z, Z = satisfies max {, D } (2.12 [PLSI Moel] Let a term-ocument matrix A=[a i ] be given by only tf(i, of eq.(2.1. hen the probabilities Pr(, Pr(t i z, an Pr(z are etermine by the lielihoo principle, i.e., by maximization of the following log-lielihoo function: L = ai log Pr( ti, (2.13 i, he maximization technique usually use for the lielihoo function is the Expectation Maximization (EM algorithm. he EM algorithm performs iteratively E-step an M-step as follows: [EM algorithm] Accoring to eq.(2.11, the maximum value of eq.(2.13 is compute by alternating E-step an M-step until it converges. E-step: Pr( z t, i M-step: = Pr( z Pr( z Pr( t i Pr( t i z z Pr( Pr( z z ' ' i ' ' (2.14 a i Pr( z ti, Pr( ti z = (2.15 a Pr( z t, i', i' i' a Pr(, i i z ti Pr( z = (2.16 a Pr( z t,, ' ' ' i i i 5 o avoi overtraining to the ata in the EM algorithm, a temperature variable β (β > 0 is use, that is calle a tempere EM (EM [Hofmann99]. At the E-step for the EM, the numerator an the each term of the enominator of eq.(2.14 are replace by those to the power of β. 3. Propose Methos We propose new classification an clustering methos base on the PLSI moel. he methos are strongly epenent on the fact an property that the EM algorithm usually converges to the local optimum solution from starting with an initial value. Hence we use a representative ocument as the initial value for the EM algorithm. Since the latent states are regare as concepts in the PLSI moel, the state correspons to the category or the cluster. Consequently, we can state that the methos presente in this paper have goo performance for a ocument set with relatively small size. 3.1 Classification metho [IIGSH03-a] Suppose a set of ocuments D for which the number of categories is, where the categories are enote by C 1, C 2,, C. [Propose classification metho] (1 Choose a subset of ocuments D * ( D which are alreay categorize an compute representative ocument vectors * 1, * 2,, * : * = 1 n C (3.1 where n is the number of selecte ocuments to compute the representative ocument vector from C. (2 Compute the probabilities Pr(z, Pr( z

an Pr(t i z which maximizes eq.(2.13 by the EM algorithm, where Z =. (3 Decie the state z = C for as max Pr( z ( ˆ ˆ = Pr( z ˆ z ˆ (3.2 By the algorithm escribe above, a set of ocuments is classifie into categories. If we can obtain the representative ocuments prior to classification, they are use for * in eq.(3.1. 3.2 Clustering metho [HC03][HIASG04] Suppose a set of ocuments to be clustere into S clusters, where the S clusters are enote by c1, c2,,c S. [Propose clustering metho] (1 Choose a proper ( S an compute the probabilities Pr(z, Pr( z, an Pr(t i z which maximizes eq.(2.13 by the EM algorithm, where Z =. z ˆ (2 Decie the state for as max Pr( z = Pr( z If S =, then. c ˆ ˆ z ˆ (3.3 (3 If S<, then compute a similarity measure s(z, z ' : z z z ' s( z, z ' = (3.4 z z = (Pr( t ' 1 z,pr( t2 z, L,Pr( t z (3.5 an use the group average istance metho with the similarity measure s(z,z ' for agglomeratively clustering the states z 's until the number of clusters becomes S. hen we have S clusters, an the members of each cluster are those of a cluster of states. By the above algorithm, a set of ocuments is clustere into S clusters. 4. Experimental Results We first apply the classification metho to the set of benchmar ocuments [Saai99], an verify its effectiveness. We then apply the clustering metho to the set of stuent questionnaires as real ocuments to be analyze whose answers are written in both fixe an free formats. All ocuments applie in this paper are written in Japanese. 4.1 Document sets he ocument sets which we use as experimental ata are shown in able 4.1. able 4.1: Document sets (a (b (c contents format # wors Articles of Mainichi news paper in 94 [Mainichi95] questionnaire (see able 4.3 in etail free (texts only fixe an free (see able 4.3 107,835 amount categorize # selecte ocument D L +D 101,058 (see able 4.2 3,993 135+35 yes (9+1 categories Yes (2 categories able 4.2: Selecte categories of newspaper 300(S=3 200 300 (S=2 8 135+35 category contents # articles # use for # use for D L +D training D L test D C 1 business 100 50 50 C 2 local 100 50 50 C 3 sports 100 50 50 otal 300 150 150 able 4.3: Contents of initial questionnaire Format Fixe (item Free (text Number of questions 7 maor questions 2 5 questions 3 Examples - For how many years have you use computers? - Do you have a plan to stuy abroa? - Can you assemble a PC? - Do you have any license in information technology? - Write 10 terms in information technology which you now 4. - Write about your nowlege an experience on computers. - What in of ob will you have after grauation? - What o you imagine from the name of the subect? 2 Each question has 4-21 minor questions. 3 Each text is written within 250-300 Chinese an Japanese characters. 4 here is a possibility to improve the performance of the propose metho by elimination of these items. 6

able 4.4: Obect classes Name of subect Course Number of stuents Introuction to Computer Science (Class CS Introuction to Information Society (Class IS Science course 135 Literary course As shown in able 4.1, the benchmar ata in Japan [Mainichi95],[Saai99] is a ocument set compose of 101,058 articles of Mainichi newspaper in '94, which is prepare for. he articles are categorize into 9 categories an the others, epenent on their contents (eite location in newspaper such as economics, business, sports, or local. (a is a case of 3 categories as shown in able 4.2, an (b are cases of S (=2 8 categories, each of which contains 100 450 articles. While (c in able 4.1 is actual ata i.e., stuent questionnaires for which the present authors want to analyze for obtaining useful nowlege from the ata in orer to manage the classes. Effective clustering gives a proper class partition epening on stuents' interests, their levels, their experiences an so on. 4.2 Classification problem: (a 4.2.1 Experiment conitions of (a As shown in able 4.2, we choose three categories. 100 articles are ranomly chosen from each category. he half of them, D L, is use for training, an the rest of them, D, for test. As baseline classification methos to be compare to the propose metho, the following conventional methos are evaluate, where we call the classification metho by the VS moel simply as the VS metho, that by the LSI moel as the LSI metho, an that by the PLSI moel as the PLSI metho. he VS metho: classifie by the cosine similarity measure between the representative ocument vector an a given ocument vector in the VS moel where a i is given by eq.(2.1. he LSI metho: 35 the same as the VS metho except that the term-ocument matrix [a i ] is compresse by SVD in the LSI moel, where =81 which correspons to the conition that the cumulative istribution rate=70[%]. PLSI metho: the same as VS the metho except that the matrix [Pr( z ] is compresse by the PLSI moel, where =10. he propose metho uses =3. 4.2.2 Results of (a able 4.5 for each metho shows the classifie number of articles C to C ˆ n ˆ from category, hence the number of the iagonal n ˆ element implies that of correct classification. able 4.6 inicates the classification error rate for each metho. able 4.5: Classifie number form C to for each metho C ˆ to C metho from C C 1 C 2 C 3 C 1 17 4 29 VS metho C 2 8 38 4 C 3 15 4 31 C 1 16 6 28 LSI metho C 2 6 43 1 C 3 12 5 33 C 1 41 0 9 PLSI metho C 2 0 47 3 C 3 13 6 31 C 1 47 0 3 Propose metho C 2 0 50 0 C 3 4 2 44 able 4.6: Classification error rate metho classification error [%] VS metho 42.7 LSI metho 38.7 PLSI metho 20.7 Propose metho 6.0 Classification process by the EM algorithm is shown in Fig. 4.1 for step 1, 4, an 4096. At step 1, almost all ocument vectors are locate in the center of the triangle. hen they move to each state z (=1,2,3 epening on the probability Pr(z (=1,2,3 at step L as L 7

increases. Finally, each ocument vector is on the lines with satisfying Pr(z1 +Pr(z2 +Pr(z3 =1. Fig. 4.1: Classification process by EM algorithm We see that the propose classification metho is clearly superior compare to the conventional methos. 4.3 Classification problem (b 4.3.1 Experiment conitions of (b We choose S = 2, 3,, 8 categories, each of which contains D L =100 450 articles ranomly chosen. he half of them D L is use for training, an the rest of them D, for test. 4.3.2 Results of (b he number of ocuments use for training D L vs. the classification error rate P C with parameters S is shown in Fig. 4.2. he number of categories S vs. the classification error rate P C with parameters D L, is also shown in Fig. 4.3, where we always choose D L = D. Fig. 4.2 Classification error rate for D L Figure 4.3 Classification error rate for S From Figs.4.2 an 4.3, we see that the propose classification metho has goo performance for small size of ocument sets, an the classification error P C increases as the number of categories S increases. 4.4 Clustering problem: (c As state above, we emonstrate the effectiveness of the propose classification metho. Base on this verification, we exten it to a clustering metho. We assume that characteristics of the stuents in Class CS are ifferent from those in Class IS, because their maors are obviously istinct. First, the ocuments of stuents in Class CS an those in Class IS are merge. hen the merge ocuments are partitione into two clusters by the clustering metho state in 3.2, as shown in Fig. 4.4. We can expect that one cluster contains only the ocuments in Class CS an the other cluster, in Class IS. Since we now whether the ocument comes from Class CS or Class IS, the experiment is regare as a classification problem, hence we can easily evaluate the performance of the clustering metho by clustering error Pr({ ˆ} = C( e. 8

Fig. 4.4: Class partition problem by clustering metho 4.4.1 Experiments conitions of (c Substituting [λg, (1 λh ] into A=[a i ] in eq.(2.13, the log-lielihoo function L is compute [HC03]. his conition is ae to the clustering metho state in 3.2 before step (1. hen ocuments given by stuent questionnaire in two classes, Class CS an Class IS are applie. As shown in able 4.4, the total number of stuents is 170. As another clustering metho to be compare to that evelope in this paper, the VS clustering metho is evaluate, where the VS (clustering metho uses tf-if value given by eq.(2.1 for matrix A in the VS moel. 4.4.2 Results of (c Since S=2, clustering error occurs when in Class CS is classifie into Class IS, an vice versa. Fig. 4.5 shows the clustering error rate C(e vs. λ, where λ is the weight for matrices G an H. If λ=0, then only the matrix H is use which implies the case of use of text (free format only. Fig. 4.5: Clustering error rate C(e vs. λ he result shows the superiority of the clustering metho iscusse in this paper. 9 Choosing λ=0.5 will be favorable to minimize the clustering error. We see that C(e ecreases as increases. If becomes large, however, the performance will go own because of overfitting. Fig. 4.6 shows that there is the optimum value of to minimize C(e, although it is ifficult to fin it out. We also show clustering process for the EM algorithm at step 1, 4, an 1024 for =2 an =3 in Fig. 4.7. We see that the EM algorithm wors well for clustering ocument sets. Fig. 4.6: Clustering error rate C(e vs. Fig. 4.7: Clustering process by EM algorithm 5. Concluing Remars We have propose a classification metho for a set of ocuments an exten it to a clustering metho. he classification metho exhibits its better performance for a ocument set with comparatively small size by using the iea of the PLSI moel. he clustering metho also has goo performance. We show that it is applicable to ocuments with both fixe an free formats by

introucing a weighting parameter λ. In the case of clustering problem (c, we trie to apply the MDL principle [Rissanen83] to ecie the optimum value of. We see that the negative log-lielihoo function, L, ecreases slowly as increases. While the penalty term, log D, increases rapily as increases. If 2 the number of the clusters S an that of the states are small, then the optimum value of will be small. his suggests us that there is a possibility to apply Bayesian probabilistic latent semantic inexing (BPLSI moel [GISH03], [GIH03] into the clustering problems. Although the optimum value of the number of the states is still har to ecie, the metho is robust in choosing for small an S. As an important relate stuy, it is necessary to evelop a metho for abstracting the characteristics of each cluster [HIIGS03], [IIGSH03-b]. An extension to a metho for a set of ocuments with comparatively large size also remains as a further investigation. Acnowlegement he author woul lie to than Mr.. Ishia for his valuable support to write this paper. his wor was partially supporte by Wasea University Grant for Special Research Proect no:2005b-189. References [BYRN99] R. Baeza-Yates an B. Ribeiro-Neto, Moern Information Retrieval, ACM Press, New Yor, 1999. [CH01] D. Cohn, an. Hofmann, he missing lin A probabilistic moel of ocument content an hypertext connectivity, Avances in Neural Information Processing Systems (NIPS 13, MI Press 2001. [GIH03] M. Goto,. Ishia, an S. Hirasawa, Representation metho for a set of ocuments from the viewpoint of Bayesian statistics, Proc. IEEE 2003 Int. Conf. on SMC, pp.4637-4642, Washington DC, Oct. 2003. [GISH03] M. Gotoh, J. Itoh,. Ishia,. Saai, an S. Hirasawa, A metho to analyze a set of ocuments base on Bayesian statistics, (in Japanese Proc. of 2003 Fall Conference on Information Management, JASMIN, pp.28-31, Haoate, Nov. 2003. [GSIIH03] M. Gotoh,. Saai, J. Itoh,. Ishia, an S. Hirasawa, nowlege iscovery from questionnaires with selecting an escripting answers, (in Japanese Proc. of PC Conference, pp.43-46, agoshima, Aug. 2003. [HC03] S. Hirasawa, an W. W. Chu, nowlege acquisition from ocuments with both fixe an free formats, Proc. IEEE 2003 Int. Conf. on SMC, pp.4694-4699, Washington DC, Oct. 2003. [HC04] S. Hirasawa, an W.W. Chu Classification methos for ocuments with both fixe an free format by PLSI moel, Proc. of 2004 Int. Conf. on Management Science an Decision Maing, pp.427-444, aipei, aiwan, R.O.C., May 2004. [HIAGS05] S. Hirasawa,. Ishia, H. Aachi, M. Goto, an. Saai, A Document classification an its applicaion to questionnaire analyses, (in Japanese, Proc. of 2005 Spring Conference in Information Management, JASMIN, pp.54-57, oyo, June 2005. [HIIGS03] S. Hirasawa,. Ishia, J. Ito, M. Goto, an. Saai, Analyses on stuent questionnaires with fixe an free formats, (in Japanese Proc. of Comp. Eu. JUCE, pp.144-145, Sept. 2003. [Hofmann99]. Hofmann, Probabilistic latent semantic inexing, Proc. of SIGIR'99, ACM Press, pp.50-57, 1999. [IGH05]. Ishia, M. Gotoh, an S. Hirasawa, Analysys of stuent questionnaire in the lecture of computer science, (in Japanese Computer Eucation, CIEC, vol.18, pp.152-159, 10

July 2005. [IIGSH03-a] J. Itoh,. Ishia, M. Gotoh,. Saai, an S. Hirasawa, nowlege iscovery in ocuments base on PLSI, (in Japanese IEICE 2003 FI, pp.83-84, Ebetsu, Sept. 2003. [IIGSH03-b]. Ishia, J. Itoh, M. Gotoh,. Saai, an S. Hirasawa, A moel of class an its verification, (in Japanese Proc. of 2003 Fall Conference on Information Management, JASMIN, pp.226-229, Haoate, Nov. 2003. [Mainichi95] Mainichi Newspaper Co., Lt., 94 Mainichi Newspaper Article, CD, Naigai Associate, 1995. [Saai99]. Saai, et al., BMIR-J2: A est Collection for Evaluation of Japanese Information Retrieval Systems, ACM SIGIR Forum, Vol.33, No.1, pp.13-17, 1999. [Salton71] G. Salton, he SMAR Retrieval System Experiments in Automatic Documents Processing, Prentice Hall Inc., Englewoo Cliffs, NJ, 1971. [SIGIH03]. Saai, J. Itoh, M. Gotoh,. Ishia, an S. Hirasawa, Efficient analysis of stuent questionnaires using information retrieval techniques, (in Japanese Proc. of 2003 Spring Conference on Information Management, JASMIN, pp.182-185, oyo, June 2003. [Rissanen83] J. Rissanen, A universal prior for integers an estimation by minimum escription length, Ann. Statist. vol.11, no.22, pp416-431, 1983. 11