Classification and clustering methods for documents. by probabilistic latent semantic indexing model

Similar documents
Multilevel Linear Dimensionality Reduction using Hypergraphs for Data Analysis

Particle Swarm Optimization Based on Smoothing Approach for Solving a Class of Bi-Level Multiobjective Programming Problem

Software Reliability Modeling and Cost Estimation Incorporating Testing-Effort and Efficiency

Skyline Community Search in Multi-valued Networks

Non-homogeneous Generalization in Privacy Preserving Data Publishing

APPLYING GENETIC ALGORITHM IN QUERY IMPROVEMENT PROBLEM. Abdelmgeid A. Aly

filtering LETTER An Improved Neighbor Selection Algorithm in Collaborative Taek-Hun KIM a), Student Member and Sung-Bong YANG b), Nonmember

Tight Wavelet Frame Decomposition and Its Application in Image Processing

NTCIR-6 CLIR-J-J Experiments at Yahoo! Japan

Coupling the User Interfaces of a Multiuser Program

Shift-map Image Registration

Transient analysis of wave propagation in 3D soil by using the scaled boundary finite element method

Considering bounds for approximation of 2 M to 3 N

Characterizing Decoding Robustness under Parametric Channel Uncertainty

The Reconstruction of Graphs. Dhananjay P. Mehendale Sir Parashurambhau College, Tilak Road, Pune , India. Abstract

Optimal Oblivious Path Selection on the Mesh

BIJECTIONS FOR PLANAR MAPS WITH BOUNDARIES

New Version of Davies-Bouldin Index for Clustering Validation Based on Cylindrical Distance

Random Clustering for Multiple Sampling Units to Speed Up Run-time Sample Generation

Cluster Center Initialization Method for K-means Algorithm Over Data Sets with Two Clusters

Fast Fractal Image Compression using PSO Based Optimization Techniques

Yet Another Parallel Hypothesis Search for Inverse Entailment Hiroyuki Nishiyama and Hayato Ohwada Faculty of Sci. and Tech. Tokyo University of Scien

Learning Polynomial Functions. by Feature Construction

Solution Representation for Job Shop Scheduling Problems in Ant Colony Optimisation

Classifying Facial Expression with Radial Basis Function Networks, using Gradient Descent and K-means

Feature Extraction and Rule Classification Algorithm of Digital Mammography based on Rough Set Theory

Image compression predicated on recurrent iterated function systems

1 Surprises in high dimensions

Image Segmentation using K-means clustering and Thresholding

Queueing Model and Optimization of Packet Dropping in Real-Time Wireless Sensor Networks

THE BAYESIAN RECEIVER OPERATING CHARACTERISTIC CURVE AN EFFECTIVE APPROACH TO EVALUATE THE IDS PERFORMANCE

Rough Set Approach for Classification of Breast Cancer Mammogram Images

Improving Performance of Sparse Matrix-Vector Multiplication

Online Appendix to: Generalizing Database Forensics

Modifying ROC Curves to Incorporate Predicted Probabilities

Dual Arm Robot Research Report

Loop Scheduling and Partitions for Hiding Memory Latencies

Research Article Inviscid Uniform Shear Flow past a Smooth Concave Body

AnyTraffic Labeled Routing

Fuzzy Clustering in Parallel Universes

SURVIVABLE IP OVER WDM: GUARANTEEEING MINIMUM NETWORK BANDWIDTH

Generalized Low Rank Approximations of Matrices

Adjacency Matrix Based Full-Text Indexing Models

ESTIMATION OF VISUAL QUALITY OF INJECTION-MOLDED POLYMER PANELS

New Geometric Interpretation and Analytic Solution for Quadrilateral Reconstruction

Learning convex bodies is hard

A Classification of 3R Orthogonal Manipulators by the Topology of their Workspace

A Neural Network Model Based on Graph Matching and Annealing :Application to Hand-Written Digits Recognition

EFFICIENT ON-LINE TESTING METHOD FOR A FLOATING-POINT ADDER

PAMPAS: Real-Valued Graphical Models for Computer Vision. Abstract. 1. Introduction. Anonymous

Open Access Adaptive Image Enhancement Algorithm with Complex Background

On the Role of Multiply Sectioned Bayesian Networks to Cooperative Multiagent Systems

Learning to Rank: A New Technology for Text Processing

Generalized Edge Coloring for Channel Assignment in Wireless Networks

Study of Network Optimization Method Based on ACL

Kinematic Analysis of a Family of 3R Manipulators

THE APPLICATION OF ARTICLE k-th SHORTEST TIME PATH ALGORITHM

Shift-map Image Registration

Learning Subproblem Complexities in Distributed Branch and Bound

An Algorithm for Building an Enterprise Network Topology Using Widespread Data Sources

Cloud Search Service Product Introduction. Issue 01 Date HUAWEI TECHNOLOGIES CO., LTD.

A PSO Optimized Layered Approach for Parametric Clustering on Weather Dataset

Performance Modelling of Necklace Hypercubes

Lecture 1 September 4, 2013

A shortest path algorithm in multimodal networks: a case study with time varying costs

Intensive Hypercube Communication: Prearranged Communication in Link-Bound Machines 1 2

A Modified Immune Network Optimization Algorithm

Frequent Pattern Mining. Frequent Item Set Mining. Overview. Frequent Item Set Mining: Motivation. Frequent Pattern Mining comprises

Introduction to Inversion in Geophysics

Handling missing values in kernel methods with application to microbiology data

A Versatile Model-Based Visibility Measure for Geometric Primitives

Here are a couple of warnings to my students who may be here to get a copy of what happened on a day that you missed.

6 Gradient Descent. 6.1 Functions

Design of Policy-Aware Differentially Private Algorithms

Improving Web Search Ranking by Incorporating User Behavior Information

Texture Recognition with combined GLCM, Wavelet and Rotated Wavelet Features

SMART IMAGE PROCESSING OF FLOW VISUALIZATION

THE increasingly digitized power system offers more data,

2-connected graphs with small 2-connected dominating sets

Finite Automata Implementations Considering CPU Cache J. Holub

arxiv: v2 [cs.lg] 22 Jan 2019

Throughput Characterization of Node-based Scheduling in Multihop Wireless Networks: A Novel Application of the Gallai-Edmonds Structure Theorem

EXACT SIMULATION OF A BOOLEAN MODEL

Bayesian localization microscopy reveals nanoscale podosome dynamics

MANJUSHA K.*, ANAND KUMAR M., SOMAN K. P.

Interior Permanent Magnet Synchronous Motor (IPMSM) Adaptive Genetic Parameter Estimation

Comparison of Methods for Increasing the Performance of a DUA Computation

Evolutionary Optimisation Methods for Template Based Image Registration

MEASURING THE PERFORMANCE OF SIMILARITY PROPAGATION IN AN SEMANTIC SEARCH ENGINE

Bends, Jogs, And Wiggles for Railroad Tracks and Vehicle Guide Ways

Almost Disjunct Codes in Large Scale Multihop Wireless Network Media Access Control

A FUZZY FRAMEWORK FOR SEGMENTATION, FEATURE MATCHING AND RETRIEVAL OF BRAIN MR IMAGES

NEW METHOD FOR FINDING A REFERENCE POINT IN FINGERPRINT IMAGES WITH THE USE OF THE IPAN99 ALGORITHM 1. INTRODUCTION 2.

5th International Conference on Advanced Design and Manufacturing Engineering (ICADME 2015)

Polygon Simplification by Minimizing Convex Corners

Compact BDD Representations for Multiple-Output Functions and Their Application

Distributed Line Graphs: A Universal Technique for Designing DHTs Based on Arbitrary Regular Graphs

Design and Analysis of Optimization Algorithms Using Computational

Lab work #8. Congestion control

Particle Swarm Optimization with Time-Varying Acceleration Coefficients Based on Cellular Neural Network for Color Image Noise Cancellation

Transcription:

A Short Course at amang University aipei, aiwan, R.O.C., March 7-9, 2006 Classification an clustering methos for ocuments by probabilistic latent semantic inexing moel Shigeichi Hirasawa Department of Inustrial an Management Systems Engineering, School of Science an Engineering, Wasea University 3-4-1, Ohubo, Shinuu,oyo 169-8555 Japan Phone:+81-3-5286-3290 Fax:+81-3-5273-7215 e-mail:hirasawa@hirasa.mgmt.wasea.ac.p Relate area: ata base, information retrieval, nowlege acquisition, text mining, classification, clustering Abstract Base on information retrieval moel especially probabilistic latent semantic inexing (PLSI moel, we iscuss methos for classification an clustering of a set of ocuments. A metho for classification is presente an is emonstrate its goo performance by applying to a set of benchmar ocuments with free format (text only. hen the classification metho is moifie to a clustering metho an the clustering metho is applie to a set of experimental ocuments with fixe an free formats to partition into two clusters, where the experimental ocuments are obtaine from stuent questionnaires. Since the experimental ocuments are alreay categorize, the clustering metho can be clearly evaluate its performance. he metho has better performance compare to the conventional one base on the vector space moel.

1. Introuction Recent evelopment in information retrieval techniques enables us to process a large amount of text ata. he information retrieval techniques inclue classification an clustering techniques for a set of ocuments [BYRN99]. hese techniques are use for not only classical information retrieval systems such as for technical paper archives, but also customer relationship management (CRM, nowlege management system (MS, or questionnaire analysis (QA at enterprises for the purpose of maret research, or personal management. In this paper, we iscuss classification an clustering techniques base on the probabilistic latent semantic inexing (PLSI moel [Hoffman99], A new classification metho [IIGSH03-a] [HC04] for a set of ocuments is presente using the maximum a posteriori probability criterion similar to the naive Bayesian technique. Using Japanese benchmar ocuments with free format [Mainichi95], the metho is evaluate, an is emonstrate its goo performance compare to conventional techniques for relatively small sets of ocuments, where a ocument with free format implies the texts. he metho successfully uses a characteristic such that the EM algorithm converges a value epenent on an initial value. hen the classification metho is moifie into a new clustering metho. he clustering metho is applie to a set of real ocuments, where real ocuments are those obtaine by stuent questionnaires which are compose of both fixe an free formats [HIIGS03] [IGH05]. A ocument with fixe format implies items such as those of selecting one from sentences, wors, symbols, or numbers. While a ocument with free format implies the usual texts. We can fin such ocuments in technical paper archives, questionnaires, or nowlege collaboration. In the case of paper archives, the ocuments with fixe format (calle items in this paper correspon to the name of authors, the name of ournals, the 2 year of publication, the name of publishers, the name of countries, an so on, i.e., iscrete concepts. As is foun in the traitional vector space moel of information retrieval systems, a co-occurrence matrix is use for the representation of a ocument set. he ocuments with fixe format are represente by an item-ocument matrix G=[g m ], where g m is the selecte result of the item m (i m in the ocument (. he ocuments with free format are also represente by a term-ocument matrix H=[h i ], where h i is the frequency of the term i (t i in the ocument (. he imensions of matrices G an H are I D, an D, respectively. Both matrices are compresse into those with smaller imensions by the probabilistic ecomposition in PLSI [Hofmann99] similar to the single value ecomposition (SVD in LSI (latent semantic inexing [BYRN99]. he unobserve states are z ( =1,2,,. Introucing a weight λ (0 λ 1, the log-lielihoo function corresponing to matrix [λg, (1 λh] is maximize by the EM algorithm [CH01], where A^ is the transpose matrix of A. hen we obtain the probabilities Pr(z (=1,2,.,, an the conitional probabilities Pr(t i z, i m, an Pr( z. Using these probabilities, Pr(i m, an Pr(t i, are erive. We ecie the state for epening on Pr(z, an a similarity function between z an z can be efine in the usual way, i.e., by cosine, or by inner prouct. By these preparations, we use the group average istance metho with the similarity measure for agglomeratively clustering the state z 's until the number of clusters becomes S, where S [HC03]. o show the effectiveness of the methos, first we apply them into the benchmar test ocument set [Mainichi95] which has been alreay categorize. hen as an experiment, we apply the propose metho into a ocument set given by stuent questionnaires [SIGIH03], where the stuents are the members of a class (Introuction to computer science, in the secon acaemic year,

unergrauate school for the present author. he contents of the questionnaires consist of questions answere with fixe format: e.g., Are you intereste in wearable computers? (Its answer is yes or no, an questions, with free format: e.g., write your image of computers. Merging the ocuments of stuents from two ifferent classes, then the merge ocuments are partitione into two categories. We show that each member of the partitione classes coincies with that of the original classes at high rate. Its better performance is compare to the conventional metho base on the vector space moel. A final obect of this experiment is to fin helpful leas to the faculty evelopment [HIASG04] [IIGSH03-b]. 2. Information Retrieval Moel Early information retrieval systems aopte (1 Boolean moel, an base on inex terms (i.e., eywors some of which are still in use for commercial purposes. o avoi over-simplification by this moel, an to enable raning the relevant ocument together with automatic inexing, (2 vector space (VS moel was propose in early '70s [Salton71]. o improve the performance of the VS moel, latent semantic inexing (LSI moel was stuie by reucing the imension of the vector space using single value ecomposition (SVD [BYRN99]. As a similar approach, probabilistic latent semantic inexing (PLSI moel base on a statistical latent class moel has recently been propose by. Hofmann [Hofmann99]. Information retrieval moel are shown in able 2.1. able 2.1: Information retrieval moel Base Set theory Algebraic Probabilistic Moel (Classic Boolian Moel Fuzzy Extene Boolian Moel (Classical Vector Space Moel (VSM [7] Generalize VSM Latent Semantic Inexing (LSI Moel [2] Probabilistic LSI (PLSI Moel [4] Neural Networ Moel (Classical Probabilistic Moel Extene Probabilistic Moel Inference Networ Moel Bayesian Networ Moel 2.1 he Vector Space Moel (VSM he VS moel uses non-binary weights in the i-th (inex term (t i in the -th ocument ( for a given ocument set D an queries (q. [Vector Space Moel] Let be a term set use for representing a ocument set D. Let t i (i=1,2,, be the i-th term in, where is a subset of the all term set 0 appeare in D, an (=1,2,, D, the -th ocument in D. hen a term-ocument matrix A=[a i ] is given by the weight w i 0 associate with a pair (t i,. In the VS moel, the weight w i is usually given by so-calle the tf-if value, where tf stans for the term frequency, an if, the inverse ocument frequency. When the number of the i-th term (t i in the -th ocument ( is f i,, then tf(i, = f i,. When the number of ocuments in D for which the term t i appears is f(i, then if(i = log(d/ f(i. he tf-if value is calculate by their prouct. As the result, for the VS moel the weight w i is given by w i = tf(i, if(i (2.1 3 an is equal to a i. he i-th row of the matrix A represents the frequency vector of the term t i in D, an the -th

column, that of in, we use the term vector t i an the ocument vector as t i = (a i1,a i2,, a id (2.2 = (a 1,a 2,, a (2.3 where x is the transpose vector of x. Similar to the vector, we also use a query vector q for a query q by the weight associate with the pair (t i, q as follows: q =(q 1,q 2,,q (2.4 hen we can efine the similarity s(q, between q an. In the case of measuring it by cosine of the angle between the vectors q an, we have s( q, where x is the norm of x. q =, (2.5 q 2.2 he Latent Semantic Inexing (LSI Moel he LSI moel is accomplishe by mapping each ocument an query vector into a lower imensional space by using SVD. [runcate LSI Moel] Let an element of a term-ocument matrix A R D be given by eq.(2.1. hen the matrix A is ecompose into A by the truncate SVD as follows: A A where = = U ( U Uˆ Σ V Σ 0 0 V 0 Vˆ U R, Σ R, V an R (2.6 D p min {, D }. In eq.(2.6, A A F is minimize for any, where p is the ran of A, an F is the Frobenius matrix norm. Let the term-ocument matrix A be given by the reuce ran matrix A by the truncate SVD, then a query vector q R 1 in eq.(2.4 is represente by 1 ˆ q R in a space unit 1 1 ˆ = Σ q R q (2.7 1 then s(q, is also compute by s( q, where an qˆ ˆ = (2.8 qˆ ˆ ˆ 1 = ΣV e R 1 2 e (0,0, L L,0,1,0, L,0 (2.9 = is the -th canonical vector. 2.3 he Probabilistic Latent Semantic Inexing (PLSI Moel In contrast to the LSI moel, the PLSI moel is base on mixture ecomposition erive from a latent state moel. A term-ocument matrix A=[a i ] is irectly given by term frequency tf(i,=f i,, i.e., a i is the number of a term t i in a ocument. In the LSI moel, the matrix A R D ecompose into A with smaller imension by SVD, using principal eigenvectors. While in the PLSI moel, the matrix A is probabilistically ecompose into unobserve states, where the -th state is enote by z Z (=1,2,,, an Z, a set of states. First, we assume both (i an inepenence between pairs (t i,, an (ii a conitional inepenence between t i an, i.e., the term t i an the ocument are inepenent conitione on the latent state z. A graphical epresentation is epicte in Fig. 2.1. Pr( z Pr(t i z z t Pr(z Fig. 2.1: A graphical moel for the PLSI moel is imension : 4 1 In the other case, q 1 1 ˆ = Σ U q R

he oint probability of t i an, Pr (t i, is given by Pr( t, = Pr( Pr( t z Pr( z i z Z i (2.10 a i i Pr( z ti, Pr( z = (2.17 a i, i hen we have the probabilities Pr(, Pr(t i z, an Pr (z. = z Z Pr( z Pr( t z Pr( z (2.11 he number of the set of the states, or the carinality of Z, Z = satisfies max {, D } (2.12 [PLSI Moel] Let a term-ocument matrix A=[a i ] be given by only tf(i, of eq.(2.1. hen the probabilities Pr(, Pr(t i z, an Pr(z are etermine by the lielihoo principle, i.e., by maximization of the following log-lielihoo function: L = ai log Pr( ti, (2.13 i, he maximization technique usually use for the lielihoo function is the Expectation Maximization (EM algorithm. he EM algorithm performs iteratively E-step an M-step as follows: [EM algorithm] Accoring to eq.(2.11, the maximum value of eq.(2.13 is compute by alternating E-step an M-step until it converges. E-step: Pr( z t, i M-step: = Pr( z Pr( z Pr( t i Pr( t i z z Pr( Pr( z z ' ' i ' ' (2.14 a i Pr( z ti, Pr( ti z = (2.15 a Pr( z t, i', i' i' a Pr(, i i z ti Pr( z = (2.16 a Pr( z t,, ' ' ' i i i 5 o avoi overtraining to the ata in the EM algorithm, a temperature variable β (β > 0 is use, that is calle a tempere EM (EM [Hofmann99]. At the E-step for the EM, the numerator an the each term of the enominator of eq.(2.14 are replace by those to the power of β. 3. Propose Methos We propose new classification an clustering methos base on the PLSI moel. he methos are strongly epenent on the fact an property that the EM algorithm usually converges to the local optimum solution from starting with an initial value. Hence we use a representative ocument as the initial value for the EM algorithm. Since the latent states are regare as concepts in the PLSI moel, the state correspons to the category or the cluster. Consequently, we can state that the methos presente in this paper have goo performance for a ocument set with relatively small size. 3.1 Classification metho [IIGSH03-a] Suppose a set of ocuments D for which the number of categories is, where the categories are enote by C 1, C 2,, C. [Propose classification metho] (1 Choose a subset of ocuments D * ( D which are alreay categorize an compute representative ocument vectors * 1, * 2,, * : * = 1 n C (3.1 where n is the number of selecte ocuments to compute the representative ocument vector from C. (2 Compute the probabilities Pr(z, Pr( z

an Pr(t i z which maximizes eq.(2.13 by the EM algorithm, where Z =. (3 Decie the state z = C for as max Pr( z ( ˆ ˆ = Pr( z ˆ z ˆ (3.2 By the algorithm escribe above, a set of ocuments is classifie into categories. If we can obtain the representative ocuments prior to classification, they are use for * in eq.(3.1. 3.2 Clustering metho [HC03][HIASG04] Suppose a set of ocuments to be clustere into S clusters, where the S clusters are enote by c1, c2,,c S. [Propose clustering metho] (1 Choose a proper ( S an compute the probabilities Pr(z, Pr( z, an Pr(t i z which maximizes eq.(2.13 by the EM algorithm, where Z =. z ˆ (2 Decie the state for as max Pr( z = Pr( z If S =, then. c ˆ ˆ z ˆ (3.3 (3 If S<, then compute a similarity measure s(z, z ' : z z z ' s( z, z ' = (3.4 z z = (Pr( t ' 1 z,pr( t2 z, L,Pr( t z (3.5 an use the group average istance metho with the similarity measure s(z,z ' for agglomeratively clustering the states z 's until the number of clusters becomes S. hen we have S clusters, an the members of each cluster are those of a cluster of states. By the above algorithm, a set of ocuments is clustere into S clusters. 4. Experimental Results We first apply the classification metho to the set of benchmar ocuments [Saai99], an verify its effectiveness. We then apply the clustering metho to the set of stuent questionnaires as real ocuments to be analyze whose answers are written in both fixe an free formats. All ocuments applie in this paper are written in Japanese. 4.1 Document sets he ocument sets which we use as experimental ata are shown in able 4.1. able 4.1: Document sets (a (b (c contents format # wors Articles of Mainichi news paper in 94 [Mainichi95] questionnaire (see able 4.3 in etail free (texts only fixe an free (see able 4.3 107,835 amount categorize # selecte ocument D L +D 101,058 (see able 4.2 3,993 135+35 yes (9+1 categories Yes (2 categories able 4.2: Selecte categories of newspaper 300(S=3 200 300 (S=2 8 135+35 category contents # articles # use for # use for D L +D training D L test D C 1 business 100 50 50 C 2 local 100 50 50 C 3 sports 100 50 50 otal 300 150 150 able 4.3: Contents of initial questionnaire Format Fixe (item Free (text Number of questions 7 maor questions 2 5 questions 3 Examples - For how many years have you use computers? - Do you have a plan to stuy abroa? - Can you assemble a PC? - Do you have any license in information technology? - Write 10 terms in information technology which you now 4. - Write about your nowlege an experience on computers. - What in of ob will you have after grauation? - What o you imagine from the name of the subect? 2 Each question has 4-21 minor questions. 3 Each text is written within 250-300 Chinese an Japanese characters. 4 here is a possibility to improve the performance of the propose metho by elimination of these items. 6

able 4.4: Obect classes Name of subect Course Number of stuents Introuction to Computer Science (Class CS Introuction to Information Society (Class IS Science course 135 Literary course As shown in able 4.1, the benchmar ata in Japan [Mainichi95],[Saai99] is a ocument set compose of 101,058 articles of Mainichi newspaper in '94, which is prepare for. he articles are categorize into 9 categories an the others, epenent on their contents (eite location in newspaper such as economics, business, sports, or local. (a is a case of 3 categories as shown in able 4.2, an (b are cases of S (=2 8 categories, each of which contains 100 450 articles. While (c in able 4.1 is actual ata i.e., stuent questionnaires for which the present authors want to analyze for obtaining useful nowlege from the ata in orer to manage the classes. Effective clustering gives a proper class partition epening on stuents' interests, their levels, their experiences an so on. 4.2 Classification problem: (a 4.2.1 Experiment conitions of (a As shown in able 4.2, we choose three categories. 100 articles are ranomly chosen from each category. he half of them, D L, is use for training, an the rest of them, D, for test. As baseline classification methos to be compare to the propose metho, the following conventional methos are evaluate, where we call the classification metho by the VS moel simply as the VS metho, that by the LSI moel as the LSI metho, an that by the PLSI moel as the PLSI metho. he VS metho: classifie by the cosine similarity measure between the representative ocument vector an a given ocument vector in the VS moel where a i is given by eq.(2.1. he LSI metho: 35 the same as the VS metho except that the term-ocument matrix [a i ] is compresse by SVD in the LSI moel, where =81 which correspons to the conition that the cumulative istribution rate=70[%]. PLSI metho: the same as VS the metho except that the matrix [Pr( z ] is compresse by the PLSI moel, where =10. he propose metho uses =3. 4.2.2 Results of (a able 4.5 for each metho shows the classifie number of articles C to C ˆ n ˆ from category, hence the number of the iagonal n ˆ element implies that of correct classification. able 4.6 inicates the classification error rate for each metho. able 4.5: Classifie number form C to for each metho C ˆ to C metho from C C 1 C 2 C 3 C 1 17 4 29 VS metho C 2 8 38 4 C 3 15 4 31 C 1 16 6 28 LSI metho C 2 6 43 1 C 3 12 5 33 C 1 41 0 9 PLSI metho C 2 0 47 3 C 3 13 6 31 C 1 47 0 3 Propose metho C 2 0 50 0 C 3 4 2 44 able 4.6: Classification error rate metho classification error [%] VS metho 42.7 LSI metho 38.7 PLSI metho 20.7 Propose metho 6.0 Classification process by the EM algorithm is shown in Fig. 4.1 for step 1, 4, an 4096. At step 1, almost all ocument vectors are locate in the center of the triangle. hen they move to each state z (=1,2,3 epening on the probability Pr(z (=1,2,3 at step L as L 7

increases. Finally, each ocument vector is on the lines with satisfying Pr(z1 +Pr(z2 +Pr(z3 =1. Fig. 4.1: Classification process by EM algorithm We see that the propose classification metho is clearly superior compare to the conventional methos. 4.3 Classification problem (b 4.3.1 Experiment conitions of (b We choose S = 2, 3,, 8 categories, each of which contains D L =100 450 articles ranomly chosen. he half of them D L is use for training, an the rest of them D, for test. 4.3.2 Results of (b he number of ocuments use for training D L vs. the classification error rate P C with parameters S is shown in Fig. 4.2. he number of categories S vs. the classification error rate P C with parameters D L, is also shown in Fig. 4.3, where we always choose D L = D. Fig. 4.2 Classification error rate for D L Figure 4.3 Classification error rate for S From Figs.4.2 an 4.3, we see that the propose classification metho has goo performance for small size of ocument sets, an the classification error P C increases as the number of categories S increases. 4.4 Clustering problem: (c As state above, we emonstrate the effectiveness of the propose classification metho. Base on this verification, we exten it to a clustering metho. We assume that characteristics of the stuents in Class CS are ifferent from those in Class IS, because their maors are obviously istinct. First, the ocuments of stuents in Class CS an those in Class IS are merge. hen the merge ocuments are partitione into two clusters by the clustering metho state in 3.2, as shown in Fig. 4.4. We can expect that one cluster contains only the ocuments in Class CS an the other cluster, in Class IS. Since we now whether the ocument comes from Class CS or Class IS, the experiment is regare as a classification problem, hence we can easily evaluate the performance of the clustering metho by clustering error Pr({ ˆ} = C( e. 8

Fig. 4.4: Class partition problem by clustering metho 4.4.1 Experiments conitions of (c Substituting [λg, (1 λh ] into A=[a i ] in eq.(2.13, the log-lielihoo function L is compute [HC03]. his conition is ae to the clustering metho state in 3.2 before step (1. hen ocuments given by stuent questionnaire in two classes, Class CS an Class IS are applie. As shown in able 4.4, the total number of stuents is 170. As another clustering metho to be compare to that evelope in this paper, the VS clustering metho is evaluate, where the VS (clustering metho uses tf-if value given by eq.(2.1 for matrix A in the VS moel. 4.4.2 Results of (c Since S=2, clustering error occurs when in Class CS is classifie into Class IS, an vice versa. Fig. 4.5 shows the clustering error rate C(e vs. λ, where λ is the weight for matrices G an H. If λ=0, then only the matrix H is use which implies the case of use of text (free format only. Fig. 4.5: Clustering error rate C(e vs. λ he result shows the superiority of the clustering metho iscusse in this paper. 9 Choosing λ=0.5 will be favorable to minimize the clustering error. We see that C(e ecreases as increases. If becomes large, however, the performance will go own because of overfitting. Fig. 4.6 shows that there is the optimum value of to minimize C(e, although it is ifficult to fin it out. We also show clustering process for the EM algorithm at step 1, 4, an 1024 for =2 an =3 in Fig. 4.7. We see that the EM algorithm wors well for clustering ocument sets. Fig. 4.6: Clustering error rate C(e vs. Fig. 4.7: Clustering process by EM algorithm 5. Concluing Remars We have propose a classification metho for a set of ocuments an exten it to a clustering metho. he classification metho exhibits its better performance for a ocument set with comparatively small size by using the iea of the PLSI moel. he clustering metho also has goo performance. We show that it is applicable to ocuments with both fixe an free formats by

introucing a weighting parameter λ. In the case of clustering problem (c, we trie to apply the MDL principle [Rissanen83] to ecie the optimum value of. We see that the negative log-lielihoo function, L, ecreases slowly as increases. While the penalty term, log D, increases rapily as increases. If 2 the number of the clusters S an that of the states are small, then the optimum value of will be small. his suggests us that there is a possibility to apply Bayesian probabilistic latent semantic inexing (BPLSI moel [GISH03], [GIH03] into the clustering problems. Although the optimum value of the number of the states is still har to ecie, the metho is robust in choosing for small an S. As an important relate stuy, it is necessary to evelop a metho for abstracting the characteristics of each cluster [HIIGS03], [IIGSH03-b]. An extension to a metho for a set of ocuments with comparatively large size also remains as a further investigation. Acnowlegement he author woul lie to than Mr.. Ishia for his valuable support to write this paper. his wor was partially supporte by Wasea University Grant for Special Research Proect no:2005b-189. References [BYRN99] R. Baeza-Yates an B. Ribeiro-Neto, Moern Information Retrieval, ACM Press, New Yor, 1999. [CH01] D. Cohn, an. Hofmann, he missing lin A probabilistic moel of ocument content an hypertext connectivity, Avances in Neural Information Processing Systems (NIPS 13, MI Press 2001. [GIH03] M. Goto,. Ishia, an S. Hirasawa, Representation metho for a set of ocuments from the viewpoint of Bayesian statistics, Proc. IEEE 2003 Int. Conf. on SMC, pp.4637-4642, Washington DC, Oct. 2003. [GISH03] M. Gotoh, J. Itoh,. Ishia,. Saai, an S. Hirasawa, A metho to analyze a set of ocuments base on Bayesian statistics, (in Japanese Proc. of 2003 Fall Conference on Information Management, JASMIN, pp.28-31, Haoate, Nov. 2003. [GSIIH03] M. Gotoh,. Saai, J. Itoh,. Ishia, an S. Hirasawa, nowlege iscovery from questionnaires with selecting an escripting answers, (in Japanese Proc. of PC Conference, pp.43-46, agoshima, Aug. 2003. [HC03] S. Hirasawa, an W. W. Chu, nowlege acquisition from ocuments with both fixe an free formats, Proc. IEEE 2003 Int. Conf. on SMC, pp.4694-4699, Washington DC, Oct. 2003. [HC04] S. Hirasawa, an W.W. Chu Classification methos for ocuments with both fixe an free format by PLSI moel, Proc. of 2004 Int. Conf. on Management Science an Decision Maing, pp.427-444, aipei, aiwan, R.O.C., May 2004. [HIAGS05] S. Hirasawa,. Ishia, H. Aachi, M. Goto, an. Saai, A Document classification an its applicaion to questionnaire analyses, (in Japanese, Proc. of 2005 Spring Conference in Information Management, JASMIN, pp.54-57, oyo, June 2005. [HIIGS03] S. Hirasawa,. Ishia, J. Ito, M. Goto, an. Saai, Analyses on stuent questionnaires with fixe an free formats, (in Japanese Proc. of Comp. Eu. JUCE, pp.144-145, Sept. 2003. [Hofmann99]. Hofmann, Probabilistic latent semantic inexing, Proc. of SIGIR'99, ACM Press, pp.50-57, 1999. [IGH05]. Ishia, M. Gotoh, an S. Hirasawa, Analysys of stuent questionnaire in the lecture of computer science, (in Japanese Computer Eucation, CIEC, vol.18, pp.152-159, 10

July 2005. [IIGSH03-a] J. Itoh,. Ishia, M. Gotoh,. Saai, an S. Hirasawa, nowlege iscovery in ocuments base on PLSI, (in Japanese IEICE 2003 FI, pp.83-84, Ebetsu, Sept. 2003. [IIGSH03-b]. Ishia, J. Itoh, M. Gotoh,. Saai, an S. Hirasawa, A moel of class an its verification, (in Japanese Proc. of 2003 Fall Conference on Information Management, JASMIN, pp.226-229, Haoate, Nov. 2003. [Mainichi95] Mainichi Newspaper Co., Lt., 94 Mainichi Newspaper Article, CD, Naigai Associate, 1995. [Saai99]. Saai, et al., BMIR-J2: A est Collection for Evaluation of Japanese Information Retrieval Systems, ACM SIGIR Forum, Vol.33, No.1, pp.13-17, 1999. [Salton71] G. Salton, he SMAR Retrieval System Experiments in Automatic Documents Processing, Prentice Hall Inc., Englewoo Cliffs, NJ, 1971. [SIGIH03]. Saai, J. Itoh, M. Gotoh,. Ishia, an S. Hirasawa, Efficient analysis of stuent questionnaires using information retrieval techniques, (in Japanese Proc. of 2003 Spring Conference on Information Management, JASMIN, pp.182-185, oyo, June 2003. [Rissanen83] J. Rissanen, A universal prior for integers an estimation by minimum escription length, Ann. Statist. vol.11, no.22, pp416-431, 1983. 11