Categorisation tool, final prototype

Size: px
Start display at page:

Download "Categorisation tool, final prototype"

Transcription

1 Categorisation tool, final prototype February 16, 1999 Project ref. no. LE Project title EuroSearch Deliverable status Restricted Contractual date of delivery Month 11 Actual date of delivery Month 13 Deliverable number 4.3 Deliverable Title Categorisation tool, final prototype Type Prototype Status & version Final 1.21 Number of pages 29 WP contributing to the WP 4 deliverable WP Task responsible UNIDO Authors Norbert Fuhr, Norbert Gövert, Mounia Lalmas, Fabrizio Sebastiani EC Project Officer Yves Paternoster Keywords automatic categorisation, classification, knn, Rocchio, category description, relevance feedback, probabilistic indexing Abstract WP 4 deals with automatic categorisation of Web documents. The categorisation is based on a description oriented approach to document indexing. This Deliverable describes further progress with respect to the work carried out in Deliverable 4.2, and the final prototype which implements all the components of our categorisation tool.

2 Summary The goal of Work Package 4 (WP 4) is to implement a categorisation tool to allow for an automatic categorisation of Web documents. This Deliverable describes further progress with respect to the work carried out in Deliverable 4.2, and the final prototype which implements all the components of our categorisation tool. The categorisation approach is grounded on an automatic textual analysis of Web documents associating weighted terms to documents. We use a probabilistic indexing based on the description-oriented indexing approach developed at the University of Dortmund. This method and its implementation are described in detail in this Deliverable. In our earlier work, we identified two different tasks with respect to document categorisation, category-centred and document-centred categorisation. Both tasks have been implemented. We provide information for the two implementations. Extensive experimentation and evaluation are needed to show and to improve the effectiveness of our automatic categorisation tool. In this Deliverable, we describe our experimentation environment, and present some preliminary results concerning the effectiveness of our approach. Further experiments will be carried out as part of Work Package 6. 2

3 Contents 1 Introduction 5 2 Architecture 5 3 The intermediary prototype Creation of the test-bed Document indexing Term extraction Description step Refinement The knn classifier The Web interface The final prototype Decision step Refinement Category-centred categorisation Category description generation Determining thresholds Document-centred categorisation A new probabilistic interpretation of knn The Web interface Experimentation environment Baseline Recall and precision Dimensions of the experimentation space Collection Document normalisation Category normalisation Term space modification Feature selection and calculation Polynomial structure Query document indexing Query term selection Retrieval function Evaluation Results and analysis Probabilistic indexing vs. baseline indexing Probabilistic vs. cosine retrieval function radius1 vs. root indexing Analysis

4 7 Future work Application of the categorisation tool to German Web documents Evaluation Improvements Integration Conclusion 27 References 28 4

5 1 Introduction The goal of Work Package 4 (WP 4) is to implement a categorisation tool to allow for an automatic categorisation of Web documents. In Deliverable 4.1, the specification of the categorisation tool was described, and an overall architecture was defined. The implementation of some of the components as well as the implementation of a baseline indexing method led to an intermediary prototype of the categorisation tool. This was described in Deliverable 4.2. This Deliverable describes further progress with respect to the work carried out in Deliverable 4.2; we have implemented the final prototype. In this Deliverable, we describe the implementation of the components of our categorisation tool. In particular, we present in detail the probabilistic indexing method, upon which our automatic categorisation is based, the generation of the category description, which is necessary to categorise documents, and a new implementation of the refinement process, which is the fourth step of the indexing phase. From the indexing of pre-categorised documents, new documents can be categorised. In Deliverable 4.1, two different categorisation tasks were identified, category-centred and document-centred classification. Both tasks have been implemented and are described in detail in this Deliverable. Extensive experimentation and evaluation are needed to show and improve the effectiveness of our approach in automatically categorising Web documents. Within this Deliverable we describe our experimentation environment. We identified various dimensions which can be considered for implementing our categorisation tool. We present some preliminary results obtained from the experiments carried out so far. Further experiments will be done as part of Work Package 6. The outline of this Deliverable is as follows: In Section 2 we summarise the main components of the architecture of the categorisation tool. Section 3 gives an overview on what has been implemented for the preliminary prototype (this was described in Deliverable 4.2). Newly implemented components and where appropriate the theory behind the scenes are described in Section 4. Section 5 gives a specification of the dimensions to be considered when experimenting with our categorisation tool. Results of preliminary experiments are presented in Section 6, together with some conclusions. Section 7 gives an outlook on further work. Finally, we conclude in Section 8. 2 Architecture In Deliverable 4.1, the overall architecture was defined with the following components: Test-bed creation Automatic categorisation of documents requires a test-bed of precategorised documents, upon which the classifiers are trained, and the categorisation tasks are experimented with and validated. 5

6 Document indexing Assigning categories to documents, or vice versa, requires suitable representation of the documents. The approach followed in this project is grounded on an automatic textual analysis of Web documents associating weighted terms to documents. We use a probabilistic approach based on the description-oriented indexing approach developed in [Fuhr & Buckley 91]. To allow for a more efficient categorisation of documents, a refinement step for the term space is further applied. Category description generation One of the categorisation task in this project works by issuing to the retrieval system a query consisting of a description of the category of interest. For automatic generation of such category descriptions the Rocchio method [Rocchio 71] is used. Classification tasks Using both the probabilistic indexing and the category descriptions, documents can then be classified according to some given categories. There are two categorisation tasks. The category-centred categorisation is used to search through a database the documents that satisfy best a given category description. The document-centred categorisation is used to identify the category to which a given document belongs. The former task is performed by processing category descriptions as queries against an information retrieval system containing the documents to be classified (and returned to the user). The latter task is performed based on the probabilistic indexing of documents taken from a learning sample using the knn classifier [Yang 94]. 3 The intermediary prototype In this section, we describe the components of the categorisation that were fully implemented in the previous Deliverable 4.2. These include the creation of the test-bed, some steps of the indexing, a preliminary implementation of the knn classifier, and a Web interface to the prototype. 3.1 Creation of the test-bed The categorisation of documents requires a test-bed of pre-categorised documents, for learning, experimentation, and evaluation purposes. The creation of the test-bed required the spidering and the normalisation of documents. We used the Computers and Internet category of the Yahoo! 1 catalogue. Documents from this catalogue were spidered and then normalised to handle the various structures of Web documents. The two processes were described in Deliverable 4.1, and their outcomes, including various statistics, were presented in Deliverable

7 3.2 Document indexing Assigning categories to documents, or vice versa, requires a suitable representation of the documents. This necessitates the indexing of documents. In this project, we represent documents as vectors of weighted terms. We perform a probabilistic indexing of document terms based on the description-oriented indexing approach developed in [Fuhr & Buckley 91]. The approach consists of the three steps: term extraction, description step, and decision step (see figure 1). An additional step is performed, the refinement step, to allow for a more efficient categorisation of documents. The term extraction step, description step and refinement step were implemented in the intermediary prototype, and are briefly described in the following subsections. The decision step has been implemented in the final prototype, and is described in Section 4.1. A simple indexing was used in the preliminary prototype based on the standard tf idf weighting [Salton & Buckley 88]. This indexing is also used as a baseline for the evaluation of our indexing approach Term extraction We extracted terms from the normalised documents. Standard knowledge extraction methods for text was used. The outcome was a list of single words forming the indexing vocabulary, referred to as the term space Description step The description step consists of the construction of relevance descriptions for termdocument pairs (t, d). A relevance description x(t, d) is defined for each term t in the term space and each document d. The vector comprises a set of features that are considered to be important for the task of assigning weights to terms with respect to a given document. The nature of theses features can be classified in three groups: document related e. g. document length, maximum term frequency term-document related e. g. term frequency in a given document term related e. g. inverse document frequency The description-oriented indexing approach makes no additional assumptions about the choice of features and the structure of the description vector x. Therefore the actual definition of relevance descriptions can be adapted to the specific application context, namely the representation of documents and the amount of learning data available. In Deliverable 4.2 we described ten features which reflect our field of application, namely 7

8 the categorisation of Web documents. Besides features known to be useful to standard text retrieval applications, we used features taking into account the HTML nature of the documents. These are, for instance, features deciding whether a term is highlighted, or whether it occurs in a document heading. The outcome of the description step process is a database of triplets (t, d, x(t, d)) of term, document, and their associated relevance description x(t, d). The relevance description format is a n-dimension vector, where each dimension corresponds to one of the features used (n is the number of features used in total) Refinement The purpose of the refinement phase is the reduction in size of the term space. This means that, while before refinement a document is represented as a vector of n weighted terms (output from the decision step), with n being the cardinality of the terms space, after refinement a document is represented as a vector of m weighted terms, with m n. This reduction is accomplished for reasons of computational efficiency; the categorisation task can be carried out much more efficiently, basically with no costs in terms of effectiveness, both in the document-centred and in the category-centred categorisation tasks. Term space reduction was discussed in detail in Deliverable 4.2. For the Intermediate Prototype discussed in Deliverable 4.2, a term space reduction technique based on document frequency had been implemented. For the final prototype, a more sophisticated technique, based on a simplified version of the χ 2 measure, has been implemented. This is discussed in detail in Section The knn classifier The knn [Yang 94] algorithm computes for each document d to be categorised and category C: sim(d, C) = sim(d, d ) C(d ) d NN where NN is the set of the k documents d in the set of training documents (the k nearest neighbours of d) for which sim(d, d ) (the similarity between d and d ) is maximum. The function C(d ) yields 1 if document d belongs to category C, and 0 otherwise. The document d is categorised by the category C with the highest sim(d, C) value. In the intermediary prototype, we applied this initial version of knn based on our baseline (tf idf). The implementation uses k = 30 which has been demonstrated by [Yang 94] to give good performance. The similarity score between documents is the cosine function [Salton & Buckley 88]. 8

9 3.4 The Web interface Based on the baseline indexing we presented a web interface for performing documentcentred categorisation. The user of this interface could browse through our test-bed (much like browsing the Yahoo! s Computers and Internet catalogue). At any point categories and their assigned documents could be viewed. In addition the user was enabled to submit the documents of the test set to our knn classifier, in order to get a ranking of categories for the given document. 4 The final prototype The newly implemented components are the decision step (deriving probabilistic term weights), the generation of the category description, and the two categorisation tasks. 4.1 Decision step In the decision step, a probabilistic index term weights are determined based on the outcome of the description step (Section 3.2.2). In this section, we describe in details the decision step, together with an example illustrating it. The difference between standard probabilistic indexing approaches and the descriptionoriented approach is that in the former term weights are estimated based on the probability P (R t, d) whereas in the latter the estimates are based on the probability P (R x(t, d)). This is shown in figure 1. term-document pair (t, d) probabilistic indexing weight P (R t, d) description decision x(t, d) relevance description P (R x(t, d)) Figure 1: Subdivision of the indexing task In a classification problem, P (R t, d) is the probability that a document d is judged relevant to an arbitrary category C given that the document is indexed by the term t. The probability P (R x(t, d)) can be viewed as the probability that a document d is judged relevant to an arbitrary category C given that (1) t has relevance description x with d, and (2) t has relevance description x with a document of the same category C. 9

10 The estimation of P (R t, d) requires relevance data (for each category, the documentterm pairs). The amount of relevance data per category may not be sufficient since the occurrence of term-document pairs per category may be too small. With the description-oriented approach, document-term pairs with different documents or terms can be mapped to the same relevance description (the same x). Therefore, the amount of relevance data that is available for the estimation of a specific indexing weight is not dependent on the number of categories, or documents for which we have relevance assessment. The probability P (R x(t, d)) is derived from a learning sample L D D R where: D is the set of documents; R = {R, R} for relevant and not relevant 2 ; L = {(d, d, r(d, d )) d, d D} such that: { r(d, d R if d and d belong to the same category, ) = R otherwise. Based on L, we form a multi-set of relevance descriptions with relevance judgements: L x = [( x(t, d), r(d, d )) t d d (d, d, r(d, d )) L] This set with multiple occurrences of elements (bag) forms the basis for the estimation of the probabilistic index term weights P (R x(t, d)). Following the concepts of other probabilistic information retrieval models, the probabilities P (R x(t, d)) could be directly estimated by computing the corresponding relative from those elements of L x that have the same relevance description. As an example, assume that the relevance description consists of a two-dimension vector x = (x 1, x 2 ) with the following features: { 1 if t occurs in the title of d, x 1 = 0 otherwise, { 1 if t occurs once in d, x 2 = 2 if t occurs at least twice in d. Table 1 shows a small learning sample (the relevance data) with three documents d 1, d 2 and d 3, and seven terms t 1,..., t 7. From this data, the estimates shown in Table 2 can be derived by means of relative frequencies. 2 The method can be generalised to include a wider relevance scale. 10

11 document document judgement term description d d r(d, d ) t d d x(t, d) t 1 (1, 1) d 1 d 2 R t 2 (0, 1) t 3 (1, 2) t 1 (0, 2) d 2 d 1 R t 2 (1, 1) t 3 (0, 1) t 2 (0, 2) d 1 d 3 R t 5 (0, 2) t 6 (1, 1) t 7 (1, 2) t 2 (1, 2) d 3 d 1 R t 5 (0, 2) t 6 (0, 1) t 7 (1, 1) t 1 (1, 2) d 2 d 3 R t3 (0, 1) t 7 (1, 1) t 1 (1, 2) d 3 d 2 R t3 (1, 1) t 7 (1, 1) Table 1: Example of a learning sample Better estimates can be achieved by applying probabilistic classification procedures as developed in pattern recognition or machine learning because they use additional (plausible) assumptions to compute the estimates. The classification procedure yielding estimation of the probabilities P (R x(t, d)) is termed an indexing function e( x(t, d)). Let y(d, d ) denote a class variables representing the relevance judgement r(d, d ) for each element of L: { y(d, d 1 if r(d, d ) = R, ) = 0 otherwise. Now we seek for a regression function e opt ( x) which yields an optimal approximation of x P (R x) (0, 2) 4/4 (0, 1) 3/4 (1, 2) 3/5 (1, 1) 4/7 Table 2: Probability estimates for the example 11

12 the class variable y (E denotes the expectation). As optimisation criterion, minimum squared errors are used: E((y e opt ( x)) 2 )! = min To derive the (optimal) indexing function from the learning sample L x we use the least square polynomials (LSP) approach [Knorz 83] [Fuhr 89], which was shown effective in [Fuhr & Buckley 93]. In this approach polynomials with a predefined structure are taken as function classes. Therefore the class of polynomials from which the indexing function is to be selected first. Based on the relevance description in vector form x, a polynomial structure v( x) = (v 1,..., v L ) has to be defined: v( x) = (1, x 1, x 2,..., x N, x 2 1, x 1 x 2,...) Here N denotes the number of dimensions of x. In practice, mostly linear and quadratic polynomials are regarded. The indexing function now yields e( x) = a T v( x) where a = (a i ) for i = 1,..., L is the coefficient vector to be estimated. So P (R x(t, d)) is estimated by the polynomial: e( x(t, d)) = a 1 + a 2 x 1 + a 3 x a N x N + a N+1 x a N+2 x 1 x = a T v( x) For our example, we take the following (linear) polynomial structure v( x) = (1, x 1, x 2 ) So a = (a 1, a 2, a 3 ) and the indexing function is e( x(t, d)) = a 1 + a 2 x 1 + a 3 x 2 The coefficient vector a is computed by solving the following linear equation system [Schürmann 77]: E( v v T ) a = E( v y) As an approximation for the expectations, the corresponding arithmetic means from the learning sample are taken. The actual computation process is based on the empirical momental matrix M which contains both sides of the above equation system: M = ( v v T, v y) = 1 ( v( x(t, d)) v( x(t, d)) T, v( x(t, d)) y(d, d )) L x ( x(t,d),r(d,d )) L x 12

13 The matrix M can then be solved to yield the coefficient vector a. Four our example, we obtain the following momental matrix: M = From there we derive the coefficient vector a = (0.69, 0.28, 0.11). The following indexing function is then: e( x) = x x 2 Table 3 shows the probability estimates derived from the above indexing function. As one can see from the comparison of the values of P (R x) and e( x), the approximation e( x) of P (R x) keeps the ordering. x P (R x) e( x) (0, 2) 4/ (0, 1) 3/ (1, 2) 3/ (1, 1) 4/ Table 3: Probability estimates for the example with LSP 4.2 Refinement A new term space reduction technique has been implemented for this final prototype. We call it simplified χ 2 (sχ 2 ), to emphasise its relationship with the well-known χ 2 measure. Simplified χ 2 is defined by the following formula: sχ 2 (t, C) = P (t C) P ( t C) P (t C) P ( t C) where P is a probability function on the training set. For example, P (t C) represents the probability that a random document belonging to the training set is indexed by term t and is not tagged by category C; the other probabilities are to be interpreted accordingly. If we use the abbreviations illustrated in Table 4 we can write the previous formula as sχ 2 (t, C) = αδ βγ P (t C) P (t C) P ( t C) P ( t C) α β γ δ Table 4: Abbreviations 13

14 Recent investigations ([Schütze et al. 95] [Yang 97]) have shown that χ 2 is an extremely effective measure for term space reduction, especially when reduction by a large factor is needed. The χ 2 measure is defined by the following formula: χ 2 (t, C) = N (αδ βγ) 2 (α + γ) (β + δ) (α + β) (γ + δ) where N is the cardinality of the space. In [Ng et al. 97] however it is argued that this measure is counter-intuitive, because by squaring the (αδ βγ) factor in the numerator of the χ 2 measure gives equal emphasis to the α and δ factors (that show a positive correlation between t and C, and should therefore be emphasised) and to the β and γ factors (that show negative correlation between t and C, and should therefore be de-emphasised). Thus a correlation coefficient is proposed, defined by the equation CC(t, C) = (αδ βγ) N (α + γ) (β + δ) (α + β) (γ + δ) which is the square root of χ 2. In this way, pairs (t, C) that show positive correlation correctly receive a high CC value, while pairs (t, C) that show negative correlation correctly receive a low CC value. However, some factors are either not influent or of dubious intuitive value. For instance, the N factor in the numerator has no effect, since it is equal for all pairs (t, C). Further, the rationale for the presence of the (α + β) factor (corresponding to P (t)) and of the (γ + δ) factor (corresponding to P ( t)) is not clear; its effect is to emphasise extremely rare terms (since it is for these terms that P (t) P ( t) is lowest, and consequently CC is highest), which a recent investigation [Yang 97] has shown to be the least interesting for categorisation purposes. Also the rationale for the presence of the (α + γ) factor (corresponding to P (C)) and of the (β + δ) factor (corresponding to P ( C)) is not clear; its effect is to emphasise extremely rare categories (since, analogously to the case discussed earlier, it is for these categories that P (C) P ( C) is lowest, and consequently CC is highest), which is extremely counter-intuitive. Therefore, these factors, either because of their irrelevance or because of their counterintuitive, should be omitted. This corresponds to removing the entire second factor, leaving us with the formulation proposed first in this section. N (α+γ) (β+δ) (α+β) (γ+δ) However, this gives a measure of pairs (t, C), while we want a measure for terms t. Following [Yang & Pedersen 97], we take sχ 2 max(t) to be the maximum value of sχ 2 (t, C), which is the final measure we use for term selection. 4.3 Category-centred categorisation The category-centred categorisation is used to search through a database the documents that satisfy best a given category. it works by applying to a category C of interest the 14

15 following steps: Generate, based on the indexing of the training documents (Section 4.1), for each category C a compact description of C, which also means generating a description of the characteristics that a generic document d should possess in order to be classified under C. The category description generation is described in Section Issue the description of category C as a query to an information retrieval engine against the documents to be classified. In order to do this we need a proper representation for category description. As query model we use vector of weighted terms, which is commen to most infomation retrieval systems. As a result of this query, the information retrieval system ranks the documents to be categorised in order of decreasing similarity between the document vector and the query vector. The top-ranked documents have to be categorised under C. Exactly how many top-ranked documents have to be categorised under C is defined by a threshold T C. The determination of this threshold is described in Section Category description generation The generation of category descriptions is implemented in the final prototype by the Rocchio method also used in [Cohen & Singer 96] and [Ittner et al. 95]. Rocchio s original equation for revision of a query after relevance feedback [Rocchio 71] is w new i = α w old i + β R d j R w ij γ R with α + β γ = 1 and 0 α, β, γ 1. In this formula, wi new and wi old are the weights of term t i in the revised and unrevised query vectors, respectively, w ij is the weight that term t i has in the vector representing document d j, and R and R are the sets of the documents known to be relevant and irrelevant (marked as such by the user), respectively. Here, α, β and γ are control parameters that allow tuning the formula by bestowing more or less importance on the factors they multiply; the constraints that d j R α + β γ = 1 and 0 α, β, γ 1 ensures that if the weights w old i hand side of the formula belong to the [0, 1] interval, so do the weights wi new hand side. w ij and w ij on the right on the left The original formula is modified for text categorisation purposes into the following: w C i = β R C d j R C w ij γ R C d j R C w ij with β γ = 1 and 0 β, γ 1. Here, w C i is the weight that term t i has in the description of category C, w ij is the weight that term t i has in the training document d j, 15

16 and R C and R C are the sets of positive instances and negative instances of C, respectively (i. e. the sets of training documents d j that are categorised (R C ) and are not categorised ( R C ) under C, respectively). The constraints that β γ = 1 and 0 β, γ 1 have the same effect as in in the original Rocchio formula. The factor 1 R C d j R C w ij may be interpreted as the average weight that term t i has in the positive instances of C, while the factor 1 R d j R C w ij may be interpreted as the average weight that term t has in the negative instances of C. The Rocchio method allows learning a category description not only from positive instances of the category, but also from negative instances. In our case this seems interesting because our catalogue is tree-shaped, which means that documents that are negative instances of C but positive instances of categories sibling of C are extremely interesting negative instances. We call these documents near-positive instances of C. A similar intuition has been successfully explored in [Ng et al. 97], although not in the context of the Rocchio method. We will therefore implement the following variant of the Rocchio formula: wi C = β R C w ij γ R d j R C C w ij d j R C Here, near-positive instances are used in place of negative instances: R C is the set of nearpositive instances of C, i. e. documents being categorised under a sibling of category C Determining thresholds Different notions of threshold were described in Deliverable 4.2. In the final prototype we have implemented the following method, referred to as proportional assignment. The threshold can be in the form of a percentage T C, learned from the training sample: T C % of the documents to be categorised are to be categorised under C if and only if T C % of the training documents are categorised under it. The threshold is dependent on the category C. 4.4 Document-centred categorisation Given a document d, the most appropriate category (or categories) under which to categorise d are individuated. This task is performed using the knn or k-nearest neighbours classifier. 16

17 In the intermediary prototype, we used the knn algorithm as described in [Yang 94] (see also Section 3.3). In the final prototype, we use a new probabilistic interpretation of knn because (1) it gives a theoretically sound justification of the various knn parameters, and (2) it promises better results in terms of effectiveness A new probabilistic interpretation of knn Given an arbitrary document d, the knn method ranks d s nearest neighbours among the set of pre-classified documents, and uses the categories of the k nearest neighbours to predict the category/categories of document d. The similarity score of each of the (k) neighbour documents is used as a weight of its categories, and the sum of category weights over the k nearest neighbours are used for category ranking. Let d be the document to be classified and C a category. The probability that d belongs to category C is viewed as the probability that d implies C. The latter can be estimated as follows: P (d C) P (d NN) P (d d ) P (d C) d NN where NN is the set of nearest neighbour documents, P (d NN) is the normalisation factor ( d NN P (d NN) = 1) and P (d d ) corresponds to the similarity between d and d. The factor P (d C) reflects the relevance data available for d : { P (d 1 if d belongs to C, C) = 0 otherwise. Following [Wong & Yao 95] P (d d ) is computed as P (d d ) = t = t = t P (d t)p (t d ) P (t d)p (d t) P (d t)p (t) P (d t) P (d) where P (t) reflects the probability of a term, and is approximated by the inverse document frequency (idf) of the term. P (d) corresponds to a normalisation factor with respect to the document to be categorised (i. e., t P (t d) = 1). The probability P (d t) (P (d t)) reflects the indexing weights of term t in document d (d, respectively). 4.5 The Web interface The prototype can be accessed at the following site: 17

18 The final prototype has the same functionality as the intermediary prototype, the difference being that the weighted terms are defined based on our probabilistic indexing approach. We have also added another functionality. It is now possible to categorise arbitrary documents from the Web (by providing the URL). It should be noted that for the categorisation to be meaningful, the document must be related to the domain of Computers and Internet. 5 Experimentation environment Designing experiments for evaluating information retrieval applications means to take decisions for a number of varying parameters. To carry out some preliminary experiments (see Section 6 for the results), we identified several dimensions in our experimentation space. Furthermore we need a baseline to compare to our approach. Also, we require measures with which we can compare the outcome of the experiments. Recall and Precision are used for this purpose. 5.1 Baseline As a baseline, we use the tf idf approach described in Section Recall and precision To evaluate our categorisation approach, we perform an evaluation based on the classic notions of precision and recall, but adapted to text categorisation: Precision The probability that, if a document d is categorised under category C, this decision is correct. Recall The probability that, if a document d should be categorised under category C, this decision is taken. The computation of the precision and recall values will be described as part of Work Package 6. 18

19 5.3 Dimensions of the experimentation space Here, we describe the dimensions we have identified while implementing our categorisation approach. Each dimension has a domain, and the values of the domains are given in a typewriter font Collection Domain: { yahoo, dino,... }. One design aim of the categorisation tool is that it is applicable to arbitrary Web catalogues, regardless of the language. So far, we have experimented with English documents which were spidered from the Yahoo! catalogue. In future work (see Section 7), we will experiment with German Web documents of the DINO-Online 3 Web catalogue Document normalisation Domain: { root, sameprefix, radius1 }. The indexing of a document can be based on the document only (root strategy), or on the document and those it links to on the same Web site of the root document (radius1 strategy). One of the problem with the Yahoo! catalogue is that document topics vary widely from document to document. For example, a document referred by a root document can have little relation to the root document. Therefore, the radius1 strategy may yield too low precision. On the other hand many of the root document do not contain much content (in the extreme case, some documents contain a HTML frame set only), thus yielding low recall. A compromise of these two strategies is the sameprefix strategy. This strategy restricts the radius1 document nodes documents to those that appear on the same site in the same directory as or in subdirectories of the root document node. This can help increasing precision while retaining recall Category normalisation Domain: { top, merge, nomerge }. Our description-oriented approach has a learning phase, using relevance data. These are based on the documents that appear under a category. In Deliverable 4.2, some statistics were given showing that the number of documents per category in the Yahoo! collection may be too small. Therefore, we consider various degrees of category merging, i. e. keeping all the categories (nomerge), merging the leaf categories into their respective super categories (merge), or keeping only the top categories (top)

20 5.3.4 Term space modification Domain: { none, df, chisquare }. As discussed in Deliverable 4.2, and in the present Deliverable (Section and 4.2), various strategies can be used to select terms which have high discriminating power with respect to categorisation. One option is to use a document frequency based (df) strategy. Another option is the chisquare strategy Feature selection and calculation So far, we have experiment with ten features (see Section and Deliverable 4.2). Some features may be better than others for deriving the weights for Web documents. We can use any subset of the ten features. Also, the computation of a feature can be modified, for example to include normalisation and logarithm value, which are known to increase indexing effectiveness in information retrieval Polynomial structure Domain: { power set of set of components of complete quadratic polynomial } The set of components of complete quadratic polynomial depends on the dimension of the feature vector. We consider linear and quadratic polynomial structures Query document indexing Domain: { binary, tf, tfidf, probabilistic } A document that has to be classified is issued as a query to the database containing the learning sample. Therefore, a suitable representation of the document must be determined. The simplest option is to not weight the document (query) terms (binary). A second option is to use the occurrence frequency of a document term as its weight (tf). A third option is to apply the indexing method perform on the documents in the database. This means using the tfidf weighting with respect to the baseline indexing and probabilistic weighting with respect to the description-oriented indexing Query term selection Domain: { top1, top2,..., all } For efficiency reason, for large query documents we cannot consider all its terms when categorising that document. Therefore we limit the number of query terms to be used. 20

21 Either all terms are taken (if appropriate) or the topn most important terms (i. e. those terms with the highest query term weights) Retrieval function Domain: { cosine, probabilistic,... } To categorise a document, a suitable retrieval function must be chosen. One standard in information retrieval is the cosine function. With a probabilistic interpretation of knn, we obtain a probabilistic retrieval function Evaluation Domain: { nomerge, merge, top } The outcome of any experiments can be evaluated in several ways depending on how the categories are normalised (see Section 5.3.3). The evaluation can be based on perfect match (nomerge) or partial match. An extreme strategy in the partial match case is to consider the top category only (top). Another option is to merge the leaf categories into their respective super categories (merge). 6 Results and analysis The aim of experimentation is to find one or several optimal paths (with respect to effectiveness) through the various dimensions of the experimentation space (described in the previous section). We did a preliminary evaluation of our methods in order to show the effectiveness of our approach in the document-centred categorisation. The results are presented first. We analyse them in Section 6.4. Common settings with respect to the dimensions of our experimentation space are shown in Table 5. We considered about 70 % of the test-bed as training documents (used in the learning phase); the other 30 % of the documents were taken as test documents, i. e. as input to the knn classifier. For each result, recall-precision curves were derived, by applying micro averaging, as described in Deliverable 4.2. We decided to set the category normalisation for our preliminary evaluation to top because setting them to nomerge (i. e. whole category matching) leads to very low effectiveness for both the baseline indexing and our probabilistic indexing. Figure 2 shows the respective recall-precision curves. The average precision for the baseline is 2.45 % and for the probabilistic indexing 2.13 %. With such low results, proper comparison, and hence enhancement, is not possible. 21

22 Dimension setting Collection yahoo Term space modification none Feature selection as described in D 4.2 Polynomial structure linear Query document indexing probabilistic Evaluation top Table 5: Common settings in pre-evaluation probabilistic indexing baseline indexing Precision Recall Figure 2: Probabilistic indexing vs. baseline indexing, radius1, nomerge 22

23 6.1 Probabilistic indexing vs. baseline indexing Figure 3 shows the results obtained with our description-oriented indexing approach and the baseline tf idf indexing. Here we only considered the root nodes of the test-bed documents. The average precision for the baseline is % and for the probabilistic indexing %; our approach outperforms the baseline indexing by 9.83 % probabilistic indexing baseline indexing Precision Recall Figure 3: Probabilistic indexing vs. baseline indexing, top, root 6.2 Probabilistic vs. cosine retrieval function Figure 4 shows the results obtained with a probabilistic interpretation of knn (see Section 4.4.1) and with the use of the cosine function (see Section 5.3.9). In average, the probabilistic retrieval function yields a 3.36 % better performance than the standard cosine value. 6.3 radius1 vs. root indexing Our previous two experiments were performed with the root node indexing strategy. In the present experiment, we wanted to found out the impact of the document normalisation strategy on the effectiveness. In Figure 5 we show the recall-precision curves for the radius1 vs. root indexing strategies. In average the radius1 indexing strategy yields a precision of % whereas the root has an average precision of %. 23

24 1 0.9 probabilistic cosine value Precision Recall Figure 4: Probabilistic vs. cosine retrieval function, root, top radius 1 root Precision Recall Figure 5: radius1 vs. root indexing, top 24

25 6.4 Analysis In our first experiment, we categorised Web documents with respect to 2806 categories. The results were very low (very few documents were correctly categorised). The average precision of the baseline indexing and the probabilistic indexing was low; in addition the description-oriented approach did not perform significantly better than the baseline method. There are three main reasons for this. First, the training sample was small (it consisted of 11,699 documents) which means that our classifier had to learn from approximately 4.2 documents per category only. Second, although the size of the training sample was small, the size of the learning sample was enormous (the multi-occurrence set L x described in Section 4.1). The indexing function was derived from more than 8.7 million different feature vectors, which we believe is too high. From previous experiences with description-oriented indexing, a learning sample size of 50 to 100 feature vectors per feature was sufficient to yield satisfying results [Fuhr & Buckley 91]. The third reason is the skewedness of our test-bed. Looking into the term space we see that it is polluted with many non-terms. We applied only simple methods for text analysis and term extraction. Enhanced methods should definitively yield improvements. We carried out other experiments which show promising results. We can already see that using our probabilistic indexing as a basis to categorise Web documents outperforms the standard tf idf indexing. Not only we have a theoretical justification for our probabilistic retrieval function (for the knn categorisation task), but also experimental indication that this retrieval function is effective. Furthermore, the results obtained with the radius1 and root node indexing show that a Web document should be indexed by considering its content and the content of Web documents that are linked from it. Our main conclusion is that our description-oriented indexing approach promises effective results with respect to the document-centred categorisation task. Further experiments are needed to substantiate our conclusions, to enhance our prototype, and to evaluate its effect with respect to the category-centred categorisation task. They will be done as part of Work Package 6. Many variants of our approach can be experimented with. In particular, the following two dimensions choice of the polynomial structure selection and calculation of the features are only possible with our approach (and not with the baseline approach), thus giving us more scope to refine our prototype. 25

26 7 Future work 7.1 Application of the categorisation tool to German Web documents The categorisation tool developed in this project is a fully automatic approach, and is portable to the various languages involved in the EuroSearch federation. We are currently setting up another experiment using German documents. This will show the application of our categorisation tool for non-english document. The document testbed collection we are using is based on the DINO-Online catalogue available at http: // The adaptation requires the following steps: changing the modules that are language dependent, forming a test-bed of pre-categorised documents from the DINO-Online catalogue, training the categorisation algorithm on the selected test-bed running the categorisation tool, and testing the results, making manual corrections where/if necessary. 7.2 Evaluation This work is part of WP 6, and will be described in Deliverable 6.2 associated to that work package. 7.3 Improvements The outcome of the evaluation will inform us on the effectiveness of our categorisation tool. We may need to refine the implementation of some of the modules involved in building the categorisation tool. Where appropriate, it will also be necessary to experimentally compare the alternative solutions implemented. For example, it may be discovered that not all attributes that we have used to derive the weighted terms are useful. Some may have to be discarded to obtain a more effective categorisation tool. As another example, for the term space reduction phase it will be necessary to verify whether the more sophisticated simplified χ 2 coefficient implemented in this final prototype is actually more effective than the simpler technique based on document frequency implemented in the Intermediate Prototype. The set of dimensions listed in Section 5.3 will constitute the basis for refining our prototype. 26

27 7.4 Integration The technologies developed so far are aimed at classification of Web pages. The categorycentred categorisation approach selects the most relevant pages for a given category from a database of indexed Web pages, such as the database of a Web search engine. The next steps for producing a Web catalogue, providing the same service as the Yahoo! one, are to provide a link to a Web site and to automatically generate a site description. The first step requires a technique, which we call URL clustering, aimed at identifying pages belonging to the same site and extracting the most relevant page as a reference for the site. It must be noted that the host:port portion of an URL does not necessarily identify a Web site; a notable example of this are the Web communities, such as or which host hundreds or thousands of sites belonging to different categories. The second step requires technologies of abstracting, which - in the case of a Web catalogue - should be able to identify the most relevant phrases for the given category. This technology is called query biased abstracting, meaning that the abstract obtained depends on the query used for retrieving the document. These technologies are currently being developed and will be described in the next Deliverable Conclusion This deliverable describes the implementation of the final prototype of the categorisation tool. All components have been implemented. The theory behind these components have been explained in details. We are now carrying out various experiments in order to evaluate our prototype, and to enhance its effectiveness. 27

28 References Cohen, W. W.; Singer, Y. (1996). Context-sensitive learning Methods for Text Categorization. In: Frei, H.-P.; Harman, D.; Schäuble, P.; Wilkinson, R. (eds.): Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages ACM, New York. Fuhr, N.; Buckley, C. (1991). A Probabilistic Learning Approach for Document Indexing. ACM Transactions on Information Systems 9(3), pages Fuhr, N.; Buckley, C. (1993). Optimizing Document Indexing and Search Term Weighting Based on Probabilistic Models. In: Harman, D. (ed.): The First Text REtrieval Conference (TREC-1), pages National Institute of Standards and Technology Special Publication , Gaithersburg, Md Fuhr, N. (1989). Models for Retrieval with Probabilistic Indexing. Information Processing and Management 25(1), pages Ittner, D. J.; Lewis, D. D.; Ahn, D. D. (1995). Text categorization of low quality images. In: Proceedings of SDAIR-95, pages th Annual Symposium on Document Analysis and Information Retrieval. Knorz, G. (1983). Automatisches Indexieren als Erkennen abstrakter Objekte. Niemeyer, Tübingen. Ng, H.-T.; Gog, W.-B.; Low, K.-L. (1997). Feature Selection, Perceptron Learning, and a Usability Case Study for Text. In: Belkin, N. J.; Narasimhalu, A. D.; Willet, P. (eds.): Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages ACM, New York. Rocchio, J. (1971). Relevance Feedback in Information Retrieval. In: Salton, G. (ed.): The SMART Retrieval System - Experiments in Automatic Document Processing. Prentice Hall, Englewood, Cliffs, New Jersey. Salton, G.; Buckley, C. (1988). Term Weighting Approaches in Automatic Text Retrieval. Information Processing and Management 24(5), pages Schürmann, J. (1977). Polynomklassifikatoren für die Zeichenerkennung. Ansatz, Adaption, Anwendung. Oldenbourg, München, Wien. Schütze, H.; Pedersen, J. O.; Hull, D. A. (1995). A Comparison of Classifiers and Document Representations for the Routing Problem. In: Fox, E.; Ingwersen, P.; Fidel, R. (eds.): Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages ACM, New York. ISBN Wong, S.; Yao, Y. (1995). On Modeling Information Retrieval with Probabilistic Inference. ACM Transactions on Information Systems 13(1), pages

A probabilistic description-oriented approach for categorising Web documents

A probabilistic description-oriented approach for categorising Web documents A probabilistic description-oriented approach for categorising Web documents Norbert Gövert Mounia Lalmas Norbert Fuhr University of Dortmund {goevert,mounia,fuhr}@ls6.cs.uni-dortmund.de Abstract The automatic

More information

TREC 2016 Dynamic Domain Track: Exploiting Passage Representation for Retrieval and Relevance Feedback

TREC 2016 Dynamic Domain Track: Exploiting Passage Representation for Retrieval and Relevance Feedback RMIT @ TREC 2016 Dynamic Domain Track: Exploiting Passage Representation for Retrieval and Relevance Feedback Ameer Albahem ameer.albahem@rmit.edu.au Lawrence Cavedon lawrence.cavedon@rmit.edu.au Damiano

More information

Probabilistic Learning Approaches for Indexing and Retrieval with the. TREC-2 Collection

Probabilistic Learning Approaches for Indexing and Retrieval with the. TREC-2 Collection Probabilistic Learning Approaches for Indexing and Retrieval with the TREC-2 Collection Norbert Fuhr, Ulrich Pfeifer, Christoph Bremkamp, Michael Pollmann University of Dortmund, Germany Chris Buckley

More information

Information Retrieval. (M&S Ch 15)

Information Retrieval. (M&S Ch 15) Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion

More information

A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems

A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University

More information

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search. An introduction Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods

More information

Focussed Structured Document Retrieval

Focussed Structured Document Retrieval Focussed Structured Document Retrieval Gabrialla Kazai, Mounia Lalmas and Thomas Roelleke Department of Computer Science, Queen Mary University of London, London E 4NS, England {gabs,mounia,thor}@dcs.qmul.ac.uk,

More information

Text Categorization (I)

Text Categorization (I) CS473 CS-473 Text Categorization (I) Luo Si Department of Computer Science Purdue University Text Categorization (I) Outline Introduction to the task of text categorization Manual v.s. automatic text categorization

More information

Boolean Model. Hongning Wang

Boolean Model. Hongning Wang Boolean Model Hongning Wang CS@UVa Abstraction of search engine architecture Indexed corpus Crawler Ranking procedure Doc Analyzer Doc Representation Query Rep Feedback (Query) Evaluation User Indexer

More information

Using Text Learning to help Web browsing

Using Text Learning to help Web browsing Using Text Learning to help Web browsing Dunja Mladenić J.Stefan Institute, Ljubljana, Slovenia Carnegie Mellon University, Pittsburgh, PA, USA Dunja.Mladenic@{ijs.si, cs.cmu.edu} Abstract Web browsing

More information

Information Retrieval. Techniques for Relevance Feedback

Information Retrieval. Techniques for Relevance Feedback Information Retrieval Techniques for Relevance Feedback Introduction An information need may be epressed using different keywords (synonymy) impact on recall eamples: ship vs boat, aircraft vs airplane

More information

Classification. 1 o Semestre 2007/2008

Classification. 1 o Semestre 2007/2008 Classification Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti. Outline 1 2 3 Single-Class

More information

Automated Online News Classification with Personalization

Automated Online News Classification with Personalization Automated Online News Classification with Personalization Chee-Hong Chan Aixin Sun Ee-Peng Lim Center for Advanced Information Systems, Nanyang Technological University Nanyang Avenue, Singapore, 639798

More information

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München Evaluation Measures Sebastian Pölsterl Computer Aided Medical Procedures Technische Universität München April 28, 2015 Outline 1 Classification 1. Confusion Matrix 2. Receiver operating characteristics

More information

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University of Economics

More information

A Document-centered Approach to a Natural Language Music Search Engine

A Document-centered Approach to a Natural Language Music Search Engine A Document-centered Approach to a Natural Language Music Search Engine Peter Knees, Tim Pohle, Markus Schedl, Dominik Schnitzer, and Klaus Seyerlehner Dept. of Computational Perception, Johannes Kepler

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

Routing and Ad-hoc Retrieval with the. Nikolaus Walczuch, Norbert Fuhr, Michael Pollmann, Birgit Sievers. University of Dortmund, Germany.

Routing and Ad-hoc Retrieval with the. Nikolaus Walczuch, Norbert Fuhr, Michael Pollmann, Birgit Sievers. University of Dortmund, Germany. Routing and Ad-hoc Retrieval with the TREC-3 Collection in a Distributed Loosely Federated Environment Nikolaus Walczuch, Norbert Fuhr, Michael Pollmann, Birgit Sievers University of Dortmund, Germany

More information

String Vector based KNN for Text Categorization

String Vector based KNN for Text Categorization 458 String Vector based KNN for Text Categorization Taeho Jo Department of Computer and Information Communication Engineering Hongik University Sejong, South Korea tjo018@hongik.ac.kr Abstract This research

More information

A RECOMMENDER SYSTEM FOR SOCIAL BOOK SEARCH

A RECOMMENDER SYSTEM FOR SOCIAL BOOK SEARCH A RECOMMENDER SYSTEM FOR SOCIAL BOOK SEARCH A thesis Submitted to the faculty of the graduate school of the University of Minnesota by Vamshi Krishna Thotempudi In partial fulfillment of the requirements

More information

CMPSCI 646, Information Retrieval (Fall 2003)

CMPSCI 646, Information Retrieval (Fall 2003) CMPSCI 646, Information Retrieval (Fall 2003) Midterm exam solutions Problem CO (compression) 1. The problem of text classification can be described as follows. Given a set of classes, C = {C i }, where

More information

A Content Vector Model for Text Classification

A Content Vector Model for Text Classification A Content Vector Model for Text Classification Eric Jiang Abstract As a popular rank-reduced vector space approach, Latent Semantic Indexing (LSI) has been used in information retrieval and other applications.

More information

Automatic New Topic Identification in Search Engine Transaction Log Using Goal Programming

Automatic New Topic Identification in Search Engine Transaction Log Using Goal Programming Proceedings of the 2012 International Conference on Industrial Engineering and Operations Management Istanbul, Turkey, July 3 6, 2012 Automatic New Topic Identification in Search Engine Transaction Log

More information

Impact of Term Weighting Schemes on Document Clustering A Review

Impact of Term Weighting Schemes on Document Clustering A Review Volume 118 No. 23 2018, 467-475 ISSN: 1314-3395 (on-line version) url: http://acadpubl.eu/hub ijpam.eu Impact of Term Weighting Schemes on Document Clustering A Review G. Hannah Grace and Kalyani Desikan

More information

Term Frequency Normalisation Tuning for BM25 and DFR Models

Term Frequency Normalisation Tuning for BM25 and DFR Models Term Frequency Normalisation Tuning for BM25 and DFR Models Ben He and Iadh Ounis Department of Computing Science University of Glasgow United Kingdom Abstract. The term frequency normalisation parameter

More information

Outline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity

Outline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity Outline Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity Lecture 10 CS 410/510 Information Retrieval on the Internet Query reformulation Sources of relevance for feedback Using

More information

CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL

CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL STUDIA UNIV. BABEŞ BOLYAI, INFORMATICA, Volume LVII, Number 4, 2012 CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL IOAN BADARINZA AND ADRIAN STERCA Abstract. In this paper

More information

Static Pruning of Terms In Inverted Files

Static Pruning of Terms In Inverted Files In Inverted Files Roi Blanco and Álvaro Barreiro IRLab University of A Corunna, Spain 29th European Conference on Information Retrieval, Rome, 2007 Motivation : to reduce inverted files size with lossy

More information

3 No-Wait Job Shops with Variable Processing Times

3 No-Wait Job Shops with Variable Processing Times 3 No-Wait Job Shops with Variable Processing Times In this chapter we assume that, on top of the classical no-wait job shop setting, we are given a set of processing times for each operation. We may select

More information

A Filtering System Based on Personal Profiles

A  Filtering System Based on Personal Profiles A E-mail Filtering System Based on Personal Profiles Masami Shishibori, Kazuaki Ando and Jun-ichi Aoe Department of Information Science & Intelligent Systems, The University of Tokushima 2-1 Minami-Jhosanjima-Cho,

More information

Feature selection. LING 572 Fei Xia

Feature selection. LING 572 Fei Xia Feature selection LING 572 Fei Xia 1 Creating attribute-value table x 1 x 2 f 1 f 2 f K y Choose features: Define feature templates Instantiate the feature templates Dimensionality reduction: feature selection

More information

Multimedia Information Systems

Multimedia Information Systems Multimedia Information Systems Samson Cheung EE 639, Fall 2004 Lecture 6: Text Information Retrieval 1 Digital Video Library Meta-Data Meta-Data Similarity Similarity Search Search Analog Video Archive

More information

A New Approach for Automatic Thesaurus Construction and Query Expansion for Document Retrieval

A New Approach for Automatic Thesaurus Construction and Query Expansion for Document Retrieval Information and Management Sciences Volume 18, Number 4, pp. 299-315, 2007 A New Approach for Automatic Thesaurus Construction and Query Expansion for Document Retrieval Liang-Yu Chen National Taiwan University

More information

Keyword Extraction by KNN considering Similarity among Features

Keyword Extraction by KNN considering Similarity among Features 64 Int'l Conf. on Advances in Big Data Analytics ABDA'15 Keyword Extraction by KNN considering Similarity among Features Taeho Jo Department of Computer and Information Engineering, Inha University, Incheon,

More information

Multi-Dimensional Text Classification

Multi-Dimensional Text Classification Multi-Dimensional Text Classification Thanaruk THEERAMUNKONG IT Program, SIIT, Thammasat University P.O. Box 22 Thammasat Rangsit Post Office, Pathumthani, Thailand, 12121 ping@siit.tu.ac.th Verayuth LERTNATTEE

More information

Information Retrieval

Information Retrieval Natural Language Processing SoSe 2015 Information Retrieval Dr. Mariana Neves June 22nd, 2015 (based on the slides of Dr. Saeedeh Momtazi) Outline Introduction Indexing Block 2 Document Crawling Text Processing

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

modern database systems lecture 4 : information retrieval

modern database systems lecture 4 : information retrieval modern database systems lecture 4 : information retrieval Aristides Gionis Michael Mathioudakis spring 2016 in perspective structured data relational data RDBMS MySQL semi-structured data data-graph representation

More information

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most common word appears 10 times then A = rn*n/n = 1*10/100

More information

Retrieval Quality vs. Effectiveness of Relevance-Oriented Search in XML Documents

Retrieval Quality vs. Effectiveness of Relevance-Oriented Search in XML Documents Retrieval Quality vs. Effectiveness of Relevance-Oriented Search in XML Documents Norbert Fuhr University of Duisburg-Essen Mohammad Abolhassani University of Duisburg-Essen Germany Norbert Gövert University

More information

Patent Classification Using Ontology-Based Patent Network Analysis

Patent Classification Using Ontology-Based Patent Network Analysis Association for Information Systems AIS Electronic Library (AISeL) PACIS 2010 Proceedings Pacific Asia Conference on Information Systems (PACIS) 2010 Patent Classification Using Ontology-Based Patent Network

More information

RMIT University at TREC 2006: Terabyte Track

RMIT University at TREC 2006: Terabyte Track RMIT University at TREC 2006: Terabyte Track Steven Garcia Falk Scholer Nicholas Lester Milad Shokouhi School of Computer Science and IT RMIT University, GPO Box 2476V Melbourne 3001, Australia 1 Introduction

More information

ResPubliQA 2010

ResPubliQA 2010 SZTAKI @ ResPubliQA 2010 David Mark Nemeskey Computer and Automation Research Institute, Hungarian Academy of Sciences, Budapest, Hungary (SZTAKI) Abstract. This paper summarizes the results of our first

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

A hybrid method to categorize HTML documents

A hybrid method to categorize HTML documents Data Mining VI 331 A hybrid method to categorize HTML documents M. Khordad, M. Shamsfard & F. Kazemeyni Electrical & Computer Engineering Department, Shahid Beheshti University, Iran Abstract In this paper

More information

Relevance Feedback and Query Reformulation. Lecture 10 CS 510 Information Retrieval on the Internet Thanks to Susan Price. Outline

Relevance Feedback and Query Reformulation. Lecture 10 CS 510 Information Retrieval on the Internet Thanks to Susan Price. Outline Relevance Feedback and Query Reformulation Lecture 10 CS 510 Information Retrieval on the Internet Thanks to Susan Price IR on the Internet, Spring 2010 1 Outline Query reformulation Sources of relevance

More information

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics Helmut Berger and Dieter Merkl 2 Faculty of Information Technology, University of Technology, Sydney, NSW, Australia hberger@it.uts.edu.au

More information

A Vector Space Equalization Scheme for a Concept-based Collaborative Information Retrieval System

A Vector Space Equalization Scheme for a Concept-based Collaborative Information Retrieval System A Vector Space Equalization Scheme for a Concept-based Collaborative Information Retrieval System Takashi Yukawa Nagaoka University of Technology 1603-1 Kamitomioka-cho, Nagaoka-shi Niigata, 940-2188 JAPAN

More information

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science Information Retrieval CS 6900 Lecture 06 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Boolean Retrieval vs. Ranked Retrieval Many users (professionals) prefer

More information

M erg in g C lassifiers for Im p ro v ed In fo rm a tio n R e triev a l

M erg in g C lassifiers for Im p ro v ed In fo rm a tio n R e triev a l M erg in g C lassifiers for Im p ro v ed In fo rm a tio n R e triev a l Anette Hulth, Lars Asker Dept, of Computer and Systems Sciences Stockholm University [hulthi asker]ø dsv.su.s e Jussi Karlgren Swedish

More information

Combinatorial PCA and SVM Methods for Feature Selection in Learning Classifications (Applications to Text Categorization)

Combinatorial PCA and SVM Methods for Feature Selection in Learning Classifications (Applications to Text Categorization) Combinatorial PCA and SVM Methods for Feature Selection in Learning Classifications (Applications to Text Categorization) Andrei V. Anghelescu Ilya B. Muchnik Dept. of Computer Science DIMACS Email: angheles@cs.rutgers.edu

More information

Domain Specific Search Engine for Students

Domain Specific Search Engine for Students Domain Specific Search Engine for Students Domain Specific Search Engine for Students Wai Yuen Tang The Department of Computer Science City University of Hong Kong, Hong Kong wytang@cs.cityu.edu.hk Lam

More information

BayesTH-MCRDR Algorithm for Automatic Classification of Web Document

BayesTH-MCRDR Algorithm for Automatic Classification of Web Document BayesTH-MCRDR Algorithm for Automatic Classification of Web Document Woo-Chul Cho and Debbie Richards Department of Computing, Macquarie University, Sydney, NSW 2109, Australia {wccho, richards}@ics.mq.edu.au

More information

A Survey on Postive and Unlabelled Learning

A Survey on Postive and Unlabelled Learning A Survey on Postive and Unlabelled Learning Gang Li Computer & Information Sciences University of Delaware ligang@udel.edu Abstract In this paper we survey the main algorithms used in positive and unlabeled

More information

Social Media Computing

Social Media Computing Social Media Computing Lecture 4: Introduction to Information Retrieval and Classification Lecturer: Aleksandr Farseev E-mail: farseev@u.nus.edu Slides: http://farseev.com/ainlfruct.html At the beginning,

More information

Context based Re-ranking of Web Documents (CReWD)

Context based Re-ranking of Web Documents (CReWD) Context based Re-ranking of Web Documents (CReWD) Arijit Banerjee, Jagadish Venkatraman Graduate Students, Department of Computer Science, Stanford University arijitb@stanford.edu, jagadish@stanford.edu}

More information

Ranking Algorithms For Digital Forensic String Search Hits

Ranking Algorithms For Digital Forensic String Search Hits DIGITAL FORENSIC RESEARCH CONFERENCE Ranking Algorithms For Digital Forensic String Search Hits By Nicole Beebe and Lishu Liu Presented At The Digital Forensic Research Conference DFRWS 2014 USA Denver,

More information

WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY

WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY IJCSNS International Journal of Computer Science and Network Security, VOL.9 No.4, April 2009 349 WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY Mohammed M. Sakre Mohammed M. Kouta Ali M. N. Allam Al Shorouk

More information

Reducing Redundancy with Anchor Text and Spam Priors

Reducing Redundancy with Anchor Text and Spam Priors Reducing Redundancy with Anchor Text and Spam Priors Marijn Koolen 1 Jaap Kamps 1,2 1 Archives and Information Studies, Faculty of Humanities, University of Amsterdam 2 ISLA, Informatics Institute, University

More information

MULTIVARIATE TEXTURE DISCRIMINATION USING A PRINCIPAL GEODESIC CLASSIFIER

MULTIVARIATE TEXTURE DISCRIMINATION USING A PRINCIPAL GEODESIC CLASSIFIER MULTIVARIATE TEXTURE DISCRIMINATION USING A PRINCIPAL GEODESIC CLASSIFIER A.Shabbir 1, 2 and G.Verdoolaege 1, 3 1 Department of Applied Physics, Ghent University, B-9000 Ghent, Belgium 2 Max Planck Institute

More information

User Profiling for Interest-focused Browsing History

User Profiling for Interest-focused Browsing History User Profiling for Interest-focused Browsing History Miha Grčar, Dunja Mladenič, Marko Grobelnik Jozef Stefan Institute, Jamova 39, 1000 Ljubljana, Slovenia {Miha.Grcar, Dunja.Mladenic, Marko.Grobelnik}@ijs.si

More information

CLEF-IP 2009: Exploring Standard IR Techniques on Patent Retrieval

CLEF-IP 2009: Exploring Standard IR Techniques on Patent Retrieval DCU @ CLEF-IP 2009: Exploring Standard IR Techniques on Patent Retrieval Walid Magdy, Johannes Leveling, Gareth J.F. Jones Centre for Next Generation Localization School of Computing Dublin City University,

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Spoken Document Retrieval (SDR) for Broadcast News in Indian Languages

Spoken Document Retrieval (SDR) for Broadcast News in Indian Languages Spoken Document Retrieval (SDR) for Broadcast News in Indian Languages Chirag Shah Dept. of CSE IIT Madras Chennai - 600036 Tamilnadu, India. chirag@speech.iitm.ernet.in A. Nayeemulla Khan Dept. of CSE

More information

Using PageRank in Feature Selection

Using PageRank in Feature Selection Using PageRank in Feature Selection Dino Ienco, Rosa Meo, and Marco Botta Dipartimento di Informatica, Università di Torino, Italy {ienco,meo,botta}@di.unito.it Abstract. Feature selection is an important

More information

Evaluating Classifiers

Evaluating Classifiers Evaluating Classifiers Charles Elkan elkan@cs.ucsd.edu January 18, 2011 In a real-world application of supervised learning, we have a training set of examples with labels, and a test set of examples with

More information

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review Gautam Kunapuli Machine Learning Data is identically and independently distributed Goal is to learn a function that maps to Data is generated using an unknown function Learn a hypothesis that minimizes

More information

From Passages into Elements in XML Retrieval

From Passages into Elements in XML Retrieval From Passages into Elements in XML Retrieval Kelly Y. Itakura David R. Cheriton School of Computer Science, University of Waterloo 200 Univ. Ave. W. Waterloo, ON, Canada yitakura@cs.uwaterloo.ca Charles

More information

Information Retrieval

Information Retrieval Information Retrieval WS 2016 / 2017 Lecture 2, Tuesday October 25 th, 2016 (Ranking, Evaluation) Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University

More information

In this project, I examined methods to classify a corpus of s by their content in order to suggest text blocks for semi-automatic replies.

In this project, I examined methods to classify a corpus of  s by their content in order to suggest text blocks for semi-automatic replies. December 13, 2006 IS256: Applied Natural Language Processing Final Project Email classification for semi-automated reply generation HANNES HESSE mail 2056 Emerson Street Berkeley, CA 94703 phone 1 (510)

More information

Using PageRank in Feature Selection

Using PageRank in Feature Selection Using PageRank in Feature Selection Dino Ienco, Rosa Meo, and Marco Botta Dipartimento di Informatica, Università di Torino, Italy fienco,meo,bottag@di.unito.it Abstract. Feature selection is an important

More information

Annotated multitree output

Annotated multitree output Annotated multitree output A simplified version of the two high-threshold (2HT) model, applied to two experimental conditions, is used as an example to illustrate the output provided by multitree (version

More information

Spatially-Aware Information Retrieval on the Internet

Spatially-Aware Information Retrieval on the Internet Spatially-Aware Information Retrieval on the Internet SPIRIT is funded by EU IST Programme Contract Number: Deliverable number: D18 5302 Deliverable type: R Contributing WP: WP 5 Contractual date of delivery:

More information

ITERATIVE SEARCHING IN AN ONLINE DATABASE. Susan T. Dumais and Deborah G. Schmitt Cognitive Science Research Group Bellcore Morristown, NJ

ITERATIVE SEARCHING IN AN ONLINE DATABASE. Susan T. Dumais and Deborah G. Schmitt Cognitive Science Research Group Bellcore Morristown, NJ - 1 - ITERATIVE SEARCHING IN AN ONLINE DATABASE Susan T. Dumais and Deborah G. Schmitt Cognitive Science Research Group Bellcore Morristown, NJ 07962-1910 ABSTRACT An experiment examined how people use

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Chapter 5 Relevance Feedback and Query Expansion Introduction A Framework for Feedback Methods Explicit Relevance Feedback Explicit Feedback Through Clicks Implicit Feedback

More information

Document Structure Analysis in Associative Patent Retrieval

Document Structure Analysis in Associative Patent Retrieval Document Structure Analysis in Associative Patent Retrieval Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media Studies University of Tsukuba 1-2 Kasuga, Tsukuba, 305-8550,

More information

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018 MIT 801 [Presented by Anna Bosman] 16 February 2018 Machine Learning What is machine learning? Artificial Intelligence? Yes as we know it. What is intelligence? The ability to acquire and apply knowledge

More information

Applying the KISS Principle for the CLEF- IP 2010 Prior Art Candidate Patent Search Task

Applying the KISS Principle for the CLEF- IP 2010 Prior Art Candidate Patent Search Task Applying the KISS Principle for the CLEF- IP 2010 Prior Art Candidate Patent Search Task Walid Magdy, Gareth J.F. Jones Centre for Next Generation Localisation School of Computing Dublin City University,

More information

Using Query History to Prune Query Results

Using Query History to Prune Query Results Using Query History to Prune Query Results Daniel Waegel Ursinus College Department of Computer Science dawaegel@gmail.com April Kontostathis Ursinus College Department of Computer Science akontostathis@ursinus.edu

More information

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! (301) 219-4649 james.mayfield@jhuapl.edu What is Information Retrieval? Evaluation

More information

Automatically Generating Queries for Prior Art Search

Automatically Generating Queries for Prior Art Search Automatically Generating Queries for Prior Art Search Erik Graf, Leif Azzopardi, Keith van Rijsbergen University of Glasgow {graf,leif,keith}@dcs.gla.ac.uk Abstract This report outlines our participation

More information

Predictive Indexing for Fast Search

Predictive Indexing for Fast Search Predictive Indexing for Fast Search Sharad Goel, John Langford and Alex Strehl Yahoo! Research, New York Modern Massive Data Sets (MMDS) June 25, 2008 Goel, Langford & Strehl (Yahoo! Research) Predictive

More information

CS299 Detailed Plan. Shawn Tice. February 5, The high-level steps for classifying web pages in Yioop are as follows:

CS299 Detailed Plan. Shawn Tice. February 5, The high-level steps for classifying web pages in Yioop are as follows: CS299 Detailed Plan Shawn Tice February 5, 2013 Overview The high-level steps for classifying web pages in Yioop are as follows: 1. Create a new classifier for a unique label. 2. Train it on a labelled

More information

Application of Support Vector Machine Algorithm in Spam Filtering

Application of Support Vector Machine Algorithm in  Spam Filtering Application of Support Vector Machine Algorithm in E-Mail Spam Filtering Julia Bluszcz, Daria Fitisova, Alexander Hamann, Alexey Trifonov, Advisor: Patrick Jähnichen Abstract The problem of spam classification

More information

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data American Journal of Applied Sciences (): -, ISSN -99 Science Publications Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data Ibrahiem M.M. El Emary and Ja'far

More information

Alfonso F. Cárdenas UC Los Angeles 420 Westwood Plaza Los Angeles, CA 90095

Alfonso F. Cárdenas UC Los Angeles 420 Westwood Plaza Los Angeles, CA 90095 Online Selection of Parameters in the Rocchio Algorithm for Identifying Interesting News Articles Raymond K. Pon UC Los Angeles 42 Westwood Plaza Los Angeles, CA 995 rpon@cs.ucla.edu Alfonso F. Cárdenas

More information

Performance Evaluation

Performance Evaluation Chapter 4 Performance Evaluation For testing and comparing the effectiveness of retrieval and classification methods, ways of evaluating the performance are required. This chapter discusses several of

More information

An Attempt to Identify Weakest and Strongest Queries

An Attempt to Identify Weakest and Strongest Queries An Attempt to Identify Weakest and Strongest Queries K. L. Kwok Queens College, City University of NY 65-30 Kissena Boulevard Flushing, NY 11367, USA kwok@ir.cs.qc.edu ABSTRACT We explore some term statistics

More information

Combining CORI and the decision-theoretic approach for advanced resource selection

Combining CORI and the decision-theoretic approach for advanced resource selection Combining CORI and the decision-theoretic approach for advanced resource selection Henrik Nottelmann and Norbert Fuhr Institute of Informatics and Interactive Systems, University of Duisburg-Essen, 47048

More information

Making Retrieval Faster Through Document Clustering

Making Retrieval Faster Through Document Clustering R E S E A R C H R E P O R T I D I A P Making Retrieval Faster Through Document Clustering David Grangier 1 Alessandro Vinciarelli 2 IDIAP RR 04-02 January 23, 2004 D a l l e M o l l e I n s t i t u t e

More information

Using XML Logical Structure to Retrieve (Multimedia) Objects

Using XML Logical Structure to Retrieve (Multimedia) Objects Using XML Logical Structure to Retrieve (Multimedia) Objects Zhigang Kong and Mounia Lalmas Queen Mary, University of London {cskzg,mounia}@dcs.qmul.ac.uk Abstract. This paper investigates the use of the

More information

A study on optimal parameter tuning for Rocchio Text Classifier

A study on optimal parameter tuning for Rocchio Text Classifier A study on optimal parameter tuning for Rocchio Text Classifier Alessandro Moschitti University of Rome Tor Vergata, Department of Computer Science Systems and Production, 00133 Rome (Italy) moschitti@info.uniroma2.it

More information

Tag-based Social Interest Discovery

Tag-based Social Interest Discovery Tag-based Social Interest Discovery Xin Li / Lei Guo / Yihong (Eric) Zhao Yahoo!Inc 2008 Presented by: Tuan Anh Le (aletuan@vub.ac.be) 1 Outline Introduction Data set collection & Pre-processing Architecture

More information

Reading group on Ontologies and NLP:

Reading group on Ontologies and NLP: Reading group on Ontologies and NLP: Machine Learning27th infebruary Automated 2014 1 / 25 Te Reading group on Ontologies and NLP: Machine Learning in Automated Text Categorization, by Fabrizio Sebastianini.

More information

Encoding Words into String Vectors for Word Categorization

Encoding Words into String Vectors for Word Categorization Int'l Conf. Artificial Intelligence ICAI'16 271 Encoding Words into String Vectors for Word Categorization Taeho Jo Department of Computer and Information Communication Engineering, Hongik University,

More information

MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS

MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS J.I. Serrano M.D. Del Castillo Instituto de Automática Industrial CSIC. Ctra. Campo Real km.0 200. La Poveda. Arganda del Rey. 28500

More information

Information-Theoretic Feature Selection Algorithms for Text Classification

Information-Theoretic Feature Selection Algorithms for Text Classification Proceedings of International Joint Conference on Neural Networks, Montreal, Canada, July 31 - August 4, 5 Information-Theoretic Feature Selection Algorithms for Text Classification Jana Novovičová Institute

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval SCCS414: Information Storage and Retrieval Christopher Manning and Prabhakar Raghavan Lecture 10: Text Classification; Vector Space Classification (Rocchio) Relevance

More information

Birkbeck (University of London)

Birkbeck (University of London) Birkbeck (University of London) MSc Examination for Internal Students Department of Computer Science and Information Systems Information Retrieval and Organisation (COIY64H7) Credit Value: 5 Date of Examination:

More information

Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation

Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation I.Ceema *1, M.Kavitha *2, G.Renukadevi *3, G.sripriya *4, S. RajeshKumar #5 * Assistant Professor, Bon Secourse College

More information