Categorisation tool, final prototype

Size: px

Start display at page:

Download "Categorisation tool, final prototype"

Caroline Woods
5 years ago
Views:

1 Categorisation tool, final prototype February 16, 1999 Project ref. no. LE Project title EuroSearch Deliverable status Restricted Contractual date of delivery Month 11 Actual date of delivery Month 13 Deliverable number 4.3 Deliverable Title Categorisation tool, final prototype Type Prototype Status & version Final 1.21 Number of pages 29 WP contributing to the WP 4 deliverable WP Task responsible UNIDO Authors Norbert Fuhr, Norbert Gövert, Mounia Lalmas, Fabrizio Sebastiani EC Project Officer Yves Paternoster Keywords automatic categorisation, classification, knn, Rocchio, category description, relevance feedback, probabilistic indexing Abstract WP 4 deals with automatic categorisation of Web documents. The categorisation is based on a description oriented approach to document indexing. This Deliverable describes further progress with respect to the work carried out in Deliverable 4.2, and the final prototype which implements all the components of our categorisation tool.

2 Summary The goal of Work Package 4 (WP 4) is to implement a categorisation tool to allow for an automatic categorisation of Web documents. This Deliverable describes further progress with respect to the work carried out in Deliverable 4.2, and the final prototype which implements all the components of our categorisation tool. The categorisation approach is grounded on an automatic textual analysis of Web documents associating weighted terms to documents. We use a probabilistic indexing based on the description-oriented indexing approach developed at the University of Dortmund. This method and its implementation are described in detail in this Deliverable. In our earlier work, we identified two different tasks with respect to document categorisation, category-centred and document-centred categorisation. Both tasks have been implemented. We provide information for the two implementations. Extensive experimentation and evaluation are needed to show and to improve the effectiveness of our automatic categorisation tool. In this Deliverable, we describe our experimentation environment, and present some preliminary results concerning the effectiveness of our approach. Further experiments will be carried out as part of Work Package 6. 2

3 Contents 1 Introduction 5 2 Architecture 5 3 The intermediary prototype Creation of the test-bed Document indexing Term extraction Description step Refinement The knn classifier The Web interface The final prototype Decision step Refinement Category-centred categorisation Category description generation Determining thresholds Document-centred categorisation A new probabilistic interpretation of knn The Web interface Experimentation environment Baseline Recall and precision Dimensions of the experimentation space Collection Document normalisation Category normalisation Term space modification Feature selection and calculation Polynomial structure Query document indexing Query term selection Retrieval function Evaluation Results and analysis Probabilistic indexing vs. baseline indexing Probabilistic vs. cosine retrieval function radius1 vs. root indexing Analysis

4 7 Future work Application of the categorisation tool to German Web documents Evaluation Improvements Integration Conclusion 27 References 28 4

5 1 Introduction The goal of Work Package 4 (WP 4) is to implement a categorisation tool to allow for an automatic categorisation of Web documents. In Deliverable 4.1, the specification of the categorisation tool was described, and an overall architecture was defined. The implementation of some of the components as well as the implementation of a baseline indexing method led to an intermediary prototype of the categorisation tool. This was described in Deliverable 4.2. This Deliverable describes further progress with respect to the work carried out in Deliverable 4.2; we have implemented the final prototype. In this Deliverable, we describe the implementation of the components of our categorisation tool. In particular, we present in detail the probabilistic indexing method, upon which our automatic categorisation is based, the generation of the category description, which is necessary to categorise documents, and a new implementation of the refinement process, which is the fourth step of the indexing phase. From the indexing of pre-categorised documents, new documents can be categorised. In Deliverable 4.1, two different categorisation tasks were identified, category-centred and document-centred classification. Both tasks have been implemented and are described in detail in this Deliverable. Extensive experimentation and evaluation are needed to show and improve the effectiveness of our approach in automatically categorising Web documents. Within this Deliverable we describe our experimentation environment. We identified various dimensions which can be considered for implementing our categorisation tool. We present some preliminary results obtained from the experiments carried out so far. Further experiments will be done as part of Work Package 6. The outline of this Deliverable is as follows: In Section 2 we summarise the main components of the architecture of the categorisation tool. Section 3 gives an overview on what has been implemented for the preliminary prototype (this was described in Deliverable 4.2). Newly implemented components and where appropriate the theory behind the scenes are described in Section 4. Section 5 gives a specification of the dimensions to be considered when experimenting with our categorisation tool. Results of preliminary experiments are presented in Section 6, together with some conclusions. Section 7 gives an outlook on further work. Finally, we conclude in Section 8. 2 Architecture In Deliverable 4.1, the overall architecture was defined with the following components: Test-bed creation Automatic categorisation of documents requires a test-bed of precategorised documents, upon which the classifiers are trained, and the categorisation tasks are experimented with and validated. 5

6 Document indexing Assigning categories to documents, or vice versa, requires suitable representation of the documents. The approach followed in this project is grounded on an automatic textual analysis of Web documents associating weighted terms to documents. We use a probabilistic approach based on the description-oriented indexing approach developed in [Fuhr & Buckley 91]. To allow for a more efficient categorisation of documents, a refinement step for the term space is further applied. Category description generation One of the categorisation task in this project works by issuing to the retrieval system a query consisting of a description of the category of interest. For automatic generation of such category descriptions the Rocchio method [Rocchio 71] is used. Classification tasks Using both the probabilistic indexing and the category descriptions, documents can then be classified according to some given categories. There are two categorisation tasks. The category-centred categorisation is used to search through a database the documents that satisfy best a given category description. The document-centred categorisation is used to identify the category to which a given document belongs. The former task is performed by processing category descriptions as queries against an information retrieval system containing the documents to be classified (and returned to the user). The latter task is performed based on the probabilistic indexing of documents taken from a learning sample using the knn classifier [Yang 94]. 3 The intermediary prototype In this section, we describe the components of the categorisation that were fully implemented in the previous Deliverable 4.2. These include the creation of the test-bed, some steps of the indexing, a preliminary implementation of the knn classifier, and a Web interface to the prototype. 3.1 Creation of the test-bed The categorisation of documents requires a test-bed of pre-categorised documents, for learning, experimentation, and evaluation purposes. The creation of the test-bed required the spidering and the normalisation of documents. We used the Computers and Internet category of the Yahoo! 1 catalogue. Documents from this catalogue were spidered and then normalised to handle the various structures of Web documents. The two processes were described in Deliverable 4.1, and their outcomes, including various statistics, were presented in Deliverable

7 3.2 Document indexing Assigning categories to documents, or vice versa, requires a suitable representation of the documents. This necessitates the indexing of documents. In this project, we represent documents as vectors of weighted terms. We perform a probabilistic indexing of document terms based on the description-oriented indexing approach developed in [Fuhr & Buckley 91]. The approach consists of the three steps: term extraction, description step, and decision step (see figure 1). An additional step is performed, the refinement step, to allow for a more efficient categorisation of documents. The term extraction step, description step and refinement step were implemented in the intermediary prototype, and are briefly described in the following subsections. The decision step has been implemented in the final prototype, and is described in Section 4.1. A simple indexing was used in the preliminary prototype based on the standard tf idf weighting [Salton & Buckley 88]. This indexing is also used as a baseline for the evaluation of our indexing approach Term extraction We extracted terms from the normalised documents. Standard knowledge extraction methods for text was used. The outcome was a list of single words forming the indexing vocabulary, referred to as the term space Description step The description step consists of the construction of relevance descriptions for termdocument pairs (t, d). A relevance description x(t, d) is defined for each term t in the term space and each document d. The vector comprises a set of features that are considered to be important for the task of assigning weights to terms with respect to a given document. The nature of theses features can be classified in three groups: document related e. g. document length, maximum term frequency term-document related e. g. term frequency in a given document term related e. g. inverse document frequency The description-oriented indexing approach makes no additional assumptions about the choice of features and the structure of the description vector x. Therefore the actual definition of relevance descriptions can be adapted to the specific application context, namely the representation of documents and the amount of learning data available. In Deliverable 4.2 we described ten features which reflect our field of application, namely 7

8 the categorisation of Web documents. Besides features known to be useful to standard text retrieval applications, we used features taking into account the HTML nature of the documents. These are, for instance, features deciding whether a term is highlighted, or whether it occurs in a document heading. The outcome of the description step process is a database of triplets (t, d, x(t, d)) of term, document, and their associated relevance description x(t, d). The relevance description format is a n-dimension vector, where each dimension corresponds to one of the features used (n is the number of features used in total) Refinement The purpose of the refinement phase is the reduction in size of the term space. This means that, while before refinement a document is represented as a vector of n weighted terms (output from the decision step), with n being the cardinality of the terms space, after refinement a document is represented as a vector of m weighted terms, with m n. This reduction is accomplished for reasons of computational efficiency; the categorisation task can be carried out much more efficiently, basically with no costs in terms of effectiveness, both in the document-centred and in the category-centred categorisation tasks. Term space reduction was discussed in detail in Deliverable 4.2. For the Intermediate Prototype discussed in Deliverable 4.2, a term space reduction technique based on document frequency had been implemented. For the final prototype, a more sophisticated technique, based on a simplified version of the χ 2 measure, has been implemented. This is discussed in detail in Section The knn classifier The knn [Yang 94] algorithm computes for each document d to be categorised and category C: sim(d, C) = sim(d, d ) C(d ) d NN where NN is the set of the k documents d in the set of training documents (the k nearest neighbours of d) for which sim(d, d ) (the similarity between d and d ) is maximum. The function C(d ) yields 1 if document d belongs to category C, and 0 otherwise. The document d is categorised by the category C with the highest sim(d, C) value. In the intermediary prototype, we applied this initial version of knn based on our baseline (tf idf). The implementation uses k = 30 which has been demonstrated by [Yang 94] to give good performance. The similarity score between documents is the cosine function [Salton & Buckley 88]. 8

9 3.4 The Web interface Based on the baseline indexing we presented a web interface for performing documentcentred categorisation. The user of this interface could browse through our test-bed (much like browsing the Yahoo! s Computers and Internet catalogue). At any point categories and their assigned documents could be viewed. In addition the user was enabled to submit the documents of the test set to our knn classifier, in order to get a ranking of categories for the given document. 4 The final prototype The newly implemented components are the decision step (deriving probabilistic term weights), the generation of the category description, and the two categorisation tasks. 4.1 Decision step In the decision step, a probabilistic index term weights are determined based on the outcome of the description step (Section 3.2.2). In this section, we describe in details the decision step, together with an example illustrating it. The difference between standard probabilistic indexing approaches and the descriptionoriented approach is that in the former term weights are estimated based on the probability P (R t, d) whereas in the latter the estimates are based on the probability P (R x(t, d)). This is shown in figure 1. term-document pair (t, d) probabilistic indexing weight P (R t, d) description decision x(t, d) relevance description P (R x(t, d)) Figure 1: Subdivision of the indexing task In a classification problem, P (R t, d) is the probability that a document d is judged relevant to an arbitrary category C given that the document is indexed by the term t. The probability P (R x(t, d)) can be viewed as the probability that a document d is judged relevant to an arbitrary category C given that (1) t has relevance description x with d, and (2) t has relevance description x with a document of the same category C. 9

10 The estimation of P (R t, d) requires relevance data (for each category, the documentterm pairs). The amount of relevance data per category may not be sufficient since the occurrence of term-document pairs per category may be too small. With the description-oriented approach, document-term pairs with different documents or terms can be mapped to the same relevance description (the same x). Therefore, the amount of relevance data that is available for the estimation of a specific indexing weight is not dependent on the number of categories, or documents for which we have relevance assessment. The probability P (R x(t, d)) is derived from a learning sample L D D R where: D is the set of documents; R = {R, R} for relevant and not relevant 2 ; L = {(d, d, r(d, d )) d, d D} such that: { r(d, d R if d and d belong to the same category, ) = R otherwise. Based on L, we form a multi-set of relevance descriptions with relevance judgements: L x = [( x(t, d), r(d, d )) t d d (d, d, r(d, d )) L] This set with multiple occurrences of elements (bag) forms the basis for the estimation of the probabilistic index term weights P (R x(t, d)). Following the concepts of other probabilistic information retrieval models, the probabilities P (R x(t, d)) could be directly estimated by computing the corresponding relative from those elements of L x that have the same relevance description. As an example, assume that the relevance description consists of a two-dimension vector x = (x 1, x 2 ) with the following features: { 1 if t occurs in the title of d, x 1 = 0 otherwise, { 1 if t occurs once in d, x 2 = 2 if t occurs at least twice in d. Table 1 shows a small learning sample (the relevance data) with three documents d 1, d 2 and d 3, and seven terms t 1,..., t 7. From this data, the estimates shown in Table 2 can be derived by means of relative frequencies. 2 The method can be generalised to include a wider relevance scale. 10

11 document document judgement term description d d r(d, d ) t d d x(t, d) t 1 (1, 1) d 1 d 2 R t 2 (0, 1) t 3 (1, 2) t 1 (0, 2) d 2 d 1 R t 2 (1, 1) t 3 (0, 1) t 2 (0, 2) d 1 d 3 R t 5 (0, 2) t 6 (1, 1) t 7 (1, 2) t 2 (1, 2) d 3 d 1 R t 5 (0, 2) t 6 (0, 1) t 7 (1, 1) t 1 (1, 2) d 2 d 3 R t3 (0, 1) t 7 (1, 1) t 1 (1, 2) d 3 d 2 R t3 (1, 1) t 7 (1, 1) Table 1: Example of a learning sample Better estimates can be achieved by applying probabilistic classification procedures as developed in pattern recognition or machine learning because they use additional (plausible) assumptions to compute the estimates. The classification procedure yielding estimation of the probabilities P (R x(t, d)) is termed an indexing function e( x(t, d)). Let y(d, d ) denote a class variables representing the relevance judgement r(d, d ) for each element of L: { y(d, d 1 if r(d, d ) = R, ) = 0 otherwise. Now we seek for a regression function e opt ( x) which yields an optimal approximation of x P (R x) (0, 2) 4/4 (0, 1) 3/4 (1, 2) 3/5 (1, 1) 4/7 Table 2: Probability estimates for the example 11

12 the class variable y (E denotes the expectation). As optimisation criterion, minimum squared errors are used: E((y e opt ( x)) 2 )! = min To derive the (optimal) indexing function from the learning sample L x we use the least square polynomials (LSP) approach [Knorz 83] [Fuhr 89], which was shown effective in [Fuhr & Buckley 93]. In this approach polynomials with a predefined structure are taken as function classes. Therefore the class of polynomials from which the indexing function is to be selected first. Based on the relevance description in vector form x, a polynomial structure v( x) = (v 1,..., v L ) has to be defined: v( x) = (1, x 1, x 2,..., x N, x 2 1, x 1 x 2,...) Here N denotes the number of dimensions of x. In practice, mostly linear and quadratic polynomials are regarded. The indexing function now yields e( x) = a T v( x) where a = (a i ) for i = 1,..., L is the coefficient vector to be estimated. So P (R x(t, d)) is estimated by the polynomial: e( x(t, d)) = a 1 + a 2 x 1 + a 3 x a N x N + a N+1 x a N+2 x 1 x = a T v( x) For our example, we take the following (linear) polynomial structure v( x) = (1, x 1, x 2 ) So a = (a 1, a 2, a 3 ) and the indexing function is e( x(t, d)) = a 1 + a 2 x 1 + a 3 x 2 The coefficient vector a is computed by solving the following linear equation system [Schürmann 77]: E( v v T ) a = E( v y) As an approximation for the expectations, the corresponding arithmetic means from the learning sample are taken. The actual computation process is based on the empirical momental matrix M which contains both sides of the above equation system: M = ( v v T, v y) = 1 ( v( x(t, d)) v( x(t, d)) T, v( x(t, d)) y(d, d )) L x ( x(t,d),r(d,d )) L x 12

13 The matrix M can then be solved to yield the coefficient vector a. Four our example, we obtain the following momental matrix: M = From there we derive the coefficient vector a = (0.69, 0.28, 0.11). The following indexing function is then: e( x) = x x 2 Table 3 shows the probability estimates derived from the above indexing function. As one can see from the comparison of the values of P (R x) and e( x), the approximation e( x) of P (R x) keeps the ordering. x P (R x) e( x) (0, 2) 4/ (0, 1) 3/ (1, 2) 3/ (1, 1) 4/ Table 3: Probability estimates for the example with LSP 4.2 Refinement A new term space reduction technique has been implemented for this final prototype. We call it simplified χ 2 (sχ 2 ), to emphasise its relationship with the well-known χ 2 measure. Simplified χ 2 is defined by the following formula: sχ 2 (t, C) = P (t C) P ( t C) P (t C) P ( t C) where P is a probability function on the training set. For example, P (t C) represents the probability that a random document belonging to the training set is indexed by term t and is not tagged by category C; the other probabilities are to be interpreted accordingly. If we use the abbreviations illustrated in Table 4 we can write the previous formula as sχ 2 (t, C) = αδ βγ P (t C) P (t C) P ( t C) P ( t C) α β γ δ Table 4: Abbreviations 13

14 Recent investigations ([Schütze et al. 95] [Yang 97]) have shown that χ 2 is an extremely effective measure for term space reduction, especially when reduction by a large factor is needed. The χ 2 measure is defined by the following formula: χ 2 (t, C) = N (αδ βγ) 2 (α + γ) (β + δ) (α + β) (γ + δ) where N is the cardinality of the space. In [Ng et al. 97] however it is argued that this measure is counter-intuitive, because by squaring the (αδ βγ) factor in the numerator of the χ 2 measure gives equal emphasis to the α and δ factors (that show a positive correlation between t and C, and should therefore be emphasised) and to the β and γ factors (that show negative correlation between t and C, and should therefore be de-emphasised). Thus a correlation coefficient is proposed, defined by the equation CC(t, C) = (αδ βγ) N (α + γ) (β + δ) (α + β) (γ + δ) which is the square root of χ 2. In this way, pairs (t, C) that show positive correlation correctly receive a high CC value, while pairs (t, C) that show negative correlation correctly receive a low CC value. However, some factors are either not influent or of dubious intuitive value. For instance, the N factor in the numerator has no effect, since it is equal for all pairs (t, C). Further, the rationale for the presence of the (α + β) factor (corresponding to P (t)) and of the (γ + δ) factor (corresponding to P ( t)) is not clear; its effect is to emphasise extremely rare terms (since it is for these terms that P (t) P ( t) is lowest, and consequently CC is highest), which a recent investigation [Yang 97] has shown to be the least interesting for categorisation purposes. Also the rationale for the presence of the (α + γ) factor (corresponding to P (C)) and of the (β + δ) factor (corresponding to P ( C)) is not clear; its effect is to emphasise extremely rare categories (since, analogously to the case discussed earlier, it is for these categories that P (C) P ( C) is lowest, and consequently CC is highest), which is extremely counter-intuitive. Therefore, these factors, either because of their irrelevance or because of their counterintuitive, should be omitted. This corresponds to removing the entire second factor, leaving us with the formulation proposed first in this section. N (α+γ) (β+δ) (α+β) (γ+δ) However, this gives a measure of pairs (t, C), while we want a measure for terms t. Following [Yang & Pedersen 97], we take sχ 2 max(t) to be the maximum value of sχ 2 (t, C), which is the final measure we use for term selection. 4.3 Category-centred categorisation The category-centred categorisation is used to search through a database the documents that satisfy best a given category. it works by applying to a category C of interest the 14

15 following steps: Generate, based on the indexing of the training documents (Section 4.1), for each category C a compact description of C, which also means generating a description of the characteristics that a generic document d should possess in order to be classified under C. The category description generation is described in Section Issue the description of category C as a query to an information retrieval engine against the documents to be classified. In order to do this we need a proper representation for category description. As query model we use vector of weighted terms, which is commen to most infomation retrieval systems. As a result of this query, the information retrieval system ranks the documents to be categorised in order of decreasing similarity between the document vector and the query vector. The top-ranked documents have to be categorised under C. Exactly how many top-ranked documents have to be categorised under C is defined by a threshold T C. The determination of this threshold is described in Section Category description generation The generation of category descriptions is implemented in the final prototype by the Rocchio method also used in [Cohen & Singer 96] and [Ittner et al. 95]. Rocchio s original equation for revision of a query after relevance feedback [Rocchio 71] is w new i = α w old i + β R d j R w ij γ R with α + β γ = 1 and 0 α, β, γ 1. In this formula, wi new and wi old are the weights of term t i in the revised and unrevised query vectors, respectively, w ij is the weight that term t i has in the vector representing document d j, and R and R are the sets of the documents known to be relevant and irrelevant (marked as such by the user), respectively. Here, α, β and γ are control parameters that allow tuning the formula by bestowing more or less importance on the factors they multiply; the constraints that d j R α + β γ = 1 and 0 α, β, γ 1 ensures that if the weights w old i hand side of the formula belong to the [0, 1] interval, so do the weights wi new hand side. w ij and w ij on the right on the left The original formula is modified for text categorisation purposes into the following: w C i = β R C d j R C w ij γ R C d j R C w ij with β γ = 1 and 0 β, γ 1. Here, w C i is the weight that term t i has in the description of category C, w ij is the weight that term t i has in the training document d j, 15

16 and R C and R C are the sets of positive instances and negative instances of C, respectively (i. e. the sets of training documents d j that are categorised (R C ) and are not categorised ( R C ) under C, respectively). The constraints that β γ = 1 and 0 β, γ 1 have the same effect as in in the original Rocchio formula. The factor 1 R C d j R C w ij may be interpreted as the average weight that term t i has in the positive instances of C, while the factor 1 R d j R C w ij may be interpreted as the average weight that term t has in the negative instances of C. The Rocchio method allows learning a category description not only from positive instances of the category, but also from negative instances. In our case this seems interesting because our catalogue is tree-shaped, which means that documents that are negative instances of C but positive instances of categories sibling of C are extremely interesting negative instances. We call these documents near-positive instances of C. A similar intuition has been successfully explored in [Ng et al. 97], although not in the context of the Rocchio method. We will therefore implement the following variant of the Rocchio formula: wi C = β R C w ij γ R d j R C C w ij d j R C Here, near-positive instances are used in place of negative instances: R C is the set of nearpositive instances of C, i. e. documents being categorised under a sibling of category C Determining thresholds Different notions of threshold were described in Deliverable 4.2. In the final prototype we have implemented the following method, referred to as proportional assignment. The threshold can be in the form of a percentage T C, learned from the training sample: T C % of the documents to be categorised are to be categorised under C if and only if T C % of the training documents are categorised under it. The threshold is dependent on the category C. 4.4 Document-centred categorisation Given a document d, the most appropriate category (or categories) under which to categorise d are individuated. This task is performed using the knn or k-nearest neighbours classifier. 16

17 In the intermediary prototype, we used the knn algorithm as described in [Yang 94] (see also Section 3.3). In the final prototype, we use a new probabilistic interpretation of knn because (1) it gives a theoretically sound justification of the various knn parameters, and (2) it promises better results in terms of effectiveness A new probabilistic interpretation of knn Given an arbitrary document d, the knn method ranks d s nearest neighbours among the set of pre-classified documents, and uses the categories of the k nearest neighbours to predict the category/categories of document d. The similarity score of each of the (k) neighbour documents is used as a weight of its categories, and the sum of category weights over the k nearest neighbours are used for category ranking. Let d be the document to be classified and C a category. The probability that d belongs to category C is viewed as the probability that d implies C. The latter can be estimated as follows: P (d C) P (d NN) P (d d ) P (d C) d NN where NN is the set of nearest neighbour documents, P (d NN) is the normalisation factor ( d NN P (d NN) = 1) and P (d d ) corresponds to the similarity between d and d. The factor P (d C) reflects the relevance data available for d : { P (d 1 if d belongs to C, C) = 0 otherwise. Following [Wong & Yao 95] P (d d ) is computed as P (d d ) = t = t = t P (d t)p (t d ) P (t d)p (d t) P (d t)p (t) P (d t) P (d) where P (t) reflects the probability of a term, and is approximated by the inverse document frequency (idf) of the term. P (d) corresponds to a normalisation factor with respect to the document to be categorised (i. e., t P (t d) = 1). The probability P (d t) (P (d t)) reflects the indexing weights of term t in document d (d, respectively). 4.5 The Web interface The prototype can be accessed at the following site: 17

18 The final prototype has the same functionality as the intermediary prototype, the difference being that the weighted terms are defined based on our probabilistic indexing approach. We have also added another functionality. It is now possible to categorise arbitrary documents from the Web (by providing the URL). It should be noted that for the categorisation to be meaningful, the document must be related to the domain of Computers and Internet. 5 Experimentation environment Designing experiments for evaluating information retrieval applications means to take decisions for a number of varying parameters. To carry out some preliminary experiments (see Section 6 for the results), we identified several dimensions in our experimentation space. Furthermore we need a baseline to compare to our approach. Also, we require measures with which we can compare the outcome of the experiments. Recall and Precision are used for this purpose. 5.1 Baseline As a baseline, we use the tf idf approach described in Section Recall and precision To evaluate our categorisation approach, we perform an evaluation based on the classic notions of precision and recall, but adapted to text categorisation: Precision The probability that, if a document d is categorised under category C, this decision is correct. Recall The probability that, if a document d should be categorised under category C, this decision is taken. The computation of the precision and recall values will be described as part of Work Package 6. 18

19 5.3 Dimensions of the experimentation space Here, we describe the dimensions we have identified while implementing our categorisation approach. Each dimension has a domain, and the values of the domains are given in a typewriter font Collection Domain: { yahoo, dino,... }. One design aim of the categorisation tool is that it is applicable to arbitrary Web catalogues, regardless of the language. So far, we have experimented with English documents which were spidered from the Yahoo! catalogue. In future work (see Section 7), we will experiment with German Web documents of the DINO-Online 3 Web catalogue Document normalisation Domain: { root, sameprefix, radius1 }. The indexing of a document can be based on the document only (root strategy), or on the document and those it links to on the same Web site of the root document (radius1 strategy). One of the problem with the Yahoo! catalogue is that document topics vary widely from document to document. For example, a document referred by a root document can have little relation to the root document. Therefore, the radius1 strategy may yield too low precision. On the other hand many of the root document do not contain much content (in the extreme case, some documents contain a HTML frame set only), thus yielding low recall. A compromise of these two strategies is the sameprefix strategy. This strategy restricts the radius1 document nodes documents to those that appear on the same site in the same directory as or in subdirectories of the root document node. This can help increasing precision while retaining recall Category normalisation Domain: { top, merge, nomerge }. Our description-oriented approach has a learning phase, using relevance data. These are based on the documents that appear under a category. In Deliverable 4.2, some statistics were given showing that the number of documents per category in the Yahoo! collection may be too small. Therefore, we consider various degrees of category merging, i. e. keeping all the categories (nomerge), merging the leaf categories into their respective super categories (merge), or keeping only the top categories (top)

20 5.3.4 Term space modification Domain: { none, df, chisquare }. As discussed in Deliverable 4.2, and in the present Deliverable (Section and 4.2), various strategies can be used to select terms which have high discriminating power with respect to categorisation. One option is to use a document frequency based (df) strategy. Another option is the chisquare strategy Feature selection and calculation So far, we have experiment with ten features (see Section and Deliverable 4.2). Some features may be better than others for deriving the weights for Web documents. We can use any subset of the ten features. Also, the computation of a feature can be modified, for example to include normalisation and logarithm value, which are known to increase indexing effectiveness in information retrieval Polynomial structure Domain: { power set of set of components of complete quadratic polynomial } The set of components of complete quadratic polynomial depends on the dimension of the feature vector. We consider linear and quadratic polynomial structures Query document indexing Domain: { binary, tf, tfidf, probabilistic } A document that has to be classified is issued as a query to the database containing the learning sample. Therefore, a suitable representation of the document must be determined. The simplest option is to not weight the document (query) terms (binary). A second option is to use the occurrence frequency of a document term as its weight (tf). A third option is to apply the indexing method perform on the documents in the database. This means using the tfidf weighting with respect to the baseline indexing and probabilistic weighting with respect to the description-oriented indexing Query term selection Domain: { top1, top2,..., all } For efficiency reason, for large query documents we cannot consider all its terms when categorising that document. Therefore we limit the number of query terms to be used. 20

21 Either all terms are taken (if appropriate) or the topn most important terms (i. e. those terms with the highest query term weights) Retrieval function Domain: { cosine, probabilistic,... } To categorise a document, a suitable retrieval function must be chosen. One standard in information retrieval is the cosine function. With a probabilistic interpretation of knn, we obtain a probabilistic retrieval function Evaluation Domain: { nomerge, merge, top } The outcome of any experiments can be evaluated in several ways depending on how the categories are normalised (see Section 5.3.3). The evaluation can be based on perfect match (nomerge) or partial match. An extreme strategy in the partial match case is to consider the top category only (top). Another option is to merge the leaf categories into their respective super categories (merge). 6 Results and analysis The aim of experimentation is to find one or several optimal paths (with respect to effectiveness) through the various dimensions of the experimentation space (described in the previous section). We did a preliminary evaluation of our methods in order to show the effectiveness of our approach in the document-centred categorisation. The results are presented first. We analyse them in Section 6.4. Common settings with respect to the dimensions of our experimentation space are shown in Table 5. We considered about 70 % of the test-bed as training documents (used in the learning phase); the other 30 % of the documents were taken as test documents, i. e. as input to the knn classifier. For each result, recall-precision curves were derived, by applying micro averaging, as described in Deliverable 4.2. We decided to set the category normalisation for our preliminary evaluation to top because setting them to nomerge (i. e. whole category matching) leads to very low effectiveness for both the baseline indexing and our probabilistic indexing. Figure 2 shows the respective recall-precision curves. The average precision for the baseline is 2.45 % and for the probabilistic indexing 2.13 %. With such low results, proper comparison, and hence enhancement, is not possible. 21

22 Dimension setting Collection yahoo Term space modification none Feature selection as described in D 4.2 Polynomial structure linear Query document indexing probabilistic Evaluation top Table 5: Common settings in pre-evaluation probabilistic indexing baseline indexing Precision Recall Figure 2: Probabilistic indexing vs. baseline indexing, radius1, nomerge 22

23 6.1 Probabilistic indexing vs. baseline indexing Figure 3 shows the results obtained with our description-oriented indexing approach and the baseline tf idf indexing. Here we only considered the root nodes of the test-bed documents. The average precision for the baseline is % and for the probabilistic indexing %; our approach outperforms the baseline indexing by 9.83 % probabilistic indexing baseline indexing Precision Recall Figure 3: Probabilistic indexing vs. baseline indexing, top, root 6.2 Probabilistic vs. cosine retrieval function Figure 4 shows the results obtained with a probabilistic interpretation of knn (see Section 4.4.1) and with the use of the cosine function (see Section 5.3.9). In average, the probabilistic retrieval function yields a 3.36 % better performance than the standard cosine value. 6.3 radius1 vs. root indexing Our previous two experiments were performed with the root node indexing strategy. In the present experiment, we wanted to found out the impact of the document normalisation strategy on the effectiveness. In Figure 5 we show the recall-precision curves for the radius1 vs. root indexing strategies. In average the radius1 indexing strategy yields a precision of % whereas the root has an average precision of %. 23

24 1 0.9 probabilistic cosine value Precision Recall Figure 4: Probabilistic vs. cosine retrieval function, root, top radius 1 root Precision Recall Figure 5: radius1 vs. root indexing, top 24

25 6.4 Analysis In our first experiment, we categorised Web documents with respect to 2806 categories. The results were very low (very few documents were correctly categorised). The average precision of the baseline indexing and the probabilistic indexing was low; in addition the description-oriented approach did not perform significantly better than the baseline method. There are three main reasons for this. First, the training sample was small (it consisted of 11,699 documents) which means that our classifier had to learn from approximately 4.2 documents per category only. Second, although the size of the training sample was small, the size of the learning sample was enormous (the multi-occurrence set L x described in Section 4.1). The indexing function was derived from more than 8.7 million different feature vectors, which we believe is too high. From previous experiences with description-oriented indexing, a learning sample size of 50 to 100 feature vectors per feature was sufficient to yield satisfying results [Fuhr & Buckley 91]. The third reason is the skewedness of our test-bed. Looking into the term space we see that it is polluted with many non-terms. We applied only simple methods for text analysis and term extraction. Enhanced methods should definitively yield improvements. We carried out other experiments which show promising results. We can already see that using our probabilistic indexing as a basis to categorise Web documents outperforms the standard tf idf indexing. Not only we have a theoretical justification for our probabilistic retrieval function (for the knn categorisation task), but also experimental indication that this retrieval function is effective. Furthermore, the results obtained with the radius1 and root node indexing show that a Web document should be indexed by considering its content and the content of Web documents that are linked from it. Our main conclusion is that our description-oriented indexing approach promises effective results with respect to the document-centred categorisation task. Further experiments are needed to substantiate our conclusions, to enhance our prototype, and to evaluate its effect with respect to the category-centred categorisation task. They will be done as part of Work Package 6. Many variants of our approach can be experimented with. In particular, the following two dimensions choice of the polynomial structure selection and calculation of the features are only possible with our approach (and not with the baseline approach), thus giving us more scope to refine our prototype. 25

26 7 Future work 7.1 Application of the categorisation tool to German Web documents The categorisation tool developed in this project is a fully automatic approach, and is portable to the various languages involved in the EuroSearch federation. We are currently setting up another experiment using German documents. This will show the application of our categorisation tool for non-english document. The document testbed collection we are using is based on the DINO-Online catalogue available at http: // The adaptation requires the following steps: changing the modules that are language dependent, forming a test-bed of pre-categorised documents from the DINO-Online catalogue, training the categorisation algorithm on the selected test-bed running the categorisation tool, and testing the results, making manual corrections where/if necessary. 7.2 Evaluation This work is part of WP 6, and will be described in Deliverable 6.2 associated to that work package. 7.3 Improvements The outcome of the evaluation will inform us on the effectiveness of our categorisation tool. We may need to refine the implementation of some of the modules involved in building the categorisation tool. Where appropriate, it will also be necessary to experimentally compare the alternative solutions implemented. For example, it may be discovered that not all attributes that we have used to derive the weighted terms are useful. Some may have to be discarded to obtain a more effective categorisation tool. As another example, for the term space reduction phase it will be necessary to verify whether the more sophisticated simplified χ 2 coefficient implemented in this final prototype is actually more effective than the simpler technique based on document frequency implemented in the Intermediate Prototype. The set of dimensions listed in Section 5.3 will constitute the basis for refining our prototype. 26

27 7.4 Integration The technologies developed so far are aimed at classification of Web pages. The categorycentred categorisation approach selects the most relevant pages for a given category from a database of indexed Web pages, such as the database of a Web search engine. The next steps for producing a Web catalogue, providing the same service as the Yahoo! one, are to provide a link to a Web site and to automatically generate a site description. The first step requires a technique, which we call URL clustering, aimed at identifying pages belonging to the same site and extracting the most relevant page as a reference for the site. It must be noted that the host:port portion of an URL does not necessarily identify a Web site; a notable example of this are the Web communities, such as or which host hundreds or thousands of sites belonging to different categories. The second step requires technologies of abstracting, which - in the case of a Web catalogue - should be able to identify the most relevant phrases for the given category. This technology is called query biased abstracting, meaning that the abstract obtained depends on the query used for retrieving the document. These technologies are currently being developed and will be described in the next Deliverable Conclusion This deliverable describes the implementation of the final prototype of the categorisation tool. All components have been implemented. The theory behind these components have been explained in details. We are now carrying out various experiments in order to evaluate our prototype, and to enhance its effectiveness. 27

28 References Cohen, W. W.; Singer, Y. (1996). Context-sensitive learning Methods for Text Categorization. In: Frei, H.-P.; Harman, D.; Schäuble, P.; Wilkinson, R. (eds.): Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages ACM, New York. Fuhr, N.; Buckley, C. (1991). A Probabilistic Learning Approach for Document Indexing. ACM Transactions on Information Systems 9(3), pages Fuhr, N.; Buckley, C. (1993). Optimizing Document Indexing and Search Term Weighting Based on Probabilistic Models. In: Harman, D. (ed.): The First Text REtrieval Conference (TREC-1), pages National Institute of Standards and Technology Special Publication , Gaithersburg, Md Fuhr, N. (1989). Models for Retrieval with Probabilistic Indexing. Information Processing and Management 25(1), pages Ittner, D. J.; Lewis, D. D.; Ahn, D. D. (1995). Text categorization of low quality images. In: Proceedings of SDAIR-95, pages th Annual Symposium on Document Analysis and Information Retrieval. Knorz, G. (1983). Automatisches Indexieren als Erkennen abstrakter Objekte. Niemeyer, Tübingen. Ng, H.-T.; Gog, W.-B.; Low, K.-L. (1997). Feature Selection, Perceptron Learning, and a Usability Case Study for Text. In: Belkin, N. J.; Narasimhalu, A. D.; Willet, P. (eds.): Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages ACM, New York. Rocchio, J. (1971). Relevance Feedback in Information Retrieval. In: Salton, G. (ed.): The SMART Retrieval System - Experiments in Automatic Document Processing. Prentice Hall, Englewood, Cliffs, New Jersey. Salton, G.; Buckley, C. (1988). Term Weighting Approaches in Automatic Text Retrieval. Information Processing and Management 24(5), pages Schürmann, J. (1977). Polynomklassifikatoren für die Zeichenerkennung. Ansatz, Adaption, Anwendung. Oldenbourg, München, Wien. Schütze, H.; Pedersen, J. O.; Hull, D. A. (1995). A Comparison of Classifiers and Document Representations for the Routing Problem. In: Fox, E.; Ingwersen, P.; Fidel, R. (eds.): Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages ACM, New York. ISBN Wong, S.; Yao, Y. (1995). On Modeling Information Retrieval with Probabilistic Inference. ACM Transactions on Information Systems 13(1), pages

A probabilistic description-oriented approach for categorising Web documents

A probabilistic description-oriented approach for categorising Web documents Norbert Gövert Mounia Lalmas Norbert Fuhr University of Dortmund {goevert,mounia,fuhr}@ls6.cs.uni-dortmund.de Abstract The automatic