2 cessor is needed to convert incoming (dynamic) queries into a format compatible with the representation model. Finally, a relevance measure is used

Size: px

Start display at page:

Download "2 cessor is needed to convert incoming (dynamic) queries into a format compatible with the representation model. Finally, a relevance measure is used"

Colleen Marsh
5 years ago
Views:

1 PROBLEM 4: TERM WEIGHTING SCHEMES IN INFORMATION RETRIEVAL MARY PAT CAMPBELL y, GRACE E. CHO z, SUSAN NELSON x, CHRIS ORUM {, JANELLE V. REYNOLDS-FLEMING k, AND ILYA ZAVORINE Problem Presenter: Laura Mather National Security Agency Abstract. Information retrieval is the process of evaluating a user's query, or information need, against a set of documents (books, journal articles, web pages, etc.) to determine which of the documents satises the query. With the advent of the World Wide Web, there is suddenly a need to query enormous sets of documents both eciently and accurately. In the vector space model of information retrieval, documents are represented by sparse vectors each component of which corresponds to a term, usually a word, in the documents set. In the simplest case, the components of these vectors are the raw frequency counts of each term in each document. More sophisticated term weighting schemes are used to improve information retrieval accuracy. We study a specic term weighting scheme (log-entropy weighting) to determine its eectiveness on dierent aspects of retrieval. New approaches to term weighting are also examined. In addition, we describe our workshop experience and some of our technical work.. Introduction and Motivation. Browsing through the pages on the World Wide Web has become one of the fastest growing hobbies in the world. The World Wide Web is also a great source of information when conducting personal research. It is estimated that there are currently 50 million web pages on the internet. To that end, search engines for the World Wide Web and other information sources have become a tool to help the querent in the search for information. Search engines are based on certain information retrieval models. Some examples are the boolean model, the probabilistic model, and the vector space model. There are advantages and disadvantages to working with any of these models. The main purpose of them, however, is to retrieve relevant documents specic to a search. We work with the vector space model. This model uses a storage matrix whose columns represent the documents in a collection and whose rows represent the term frequencies among the documents. We are interested in ad-hoc quering only, i.e. when dynamic queries are compared against a static document database in order to nd documents closest to the query. When conducting a query, one method is to search through the storage matrix and match the query terms with row terms producing the document closest to the query. Many times, ad-hoc querying requires more sophisticated measures of searching such as term weighting schemes which attempt to prioritize documents according to relevant terms. We concentrate on a term weighting scheme called log-entropy. Our goal is to answer the question \Is log-entropy weighting optimal for querying?" As part of the answer to this question, we need to develop other term weighting schemes to compare them with the log-entropy weighting. The group as a whole created the term weighting schemes used in this paper. When certain aspects of the problem were handled individually, those involved are credited individually. 2. Denitions and Mathematical Framework. A document database is a collection of documents which are relevant to certain subjects. Documents consist of words, or terms, which to a greater or lesser extent identify these subjects. For instance, the word \and" may appear frequently in all documents but it says almost nothing about the contents of these documents. Users are interested in certain subjects and would like to search and nd those, and only those, documents relevant to these subjects. In order to nd these documents, the user generates concise queries which consist of words relevant to these subjects. Search engines then use these queries to nd pertinent documents. For the purposes of this paper, we dene a search engine by three components. First, the static database of documents needs to be characterized using a representation model. Secondly, a query pro- This work was supported in part by the NSA and NSF y New York University z North Carolina State University x Georgia Institute of Technology { University of Utah k Texas A & M University University of Maryland at College Park

2 2 cessor is needed to convert incoming (dynamic) queries into a format compatible with the representation model. Finally, a relevance measure is used to compare converted queries against documents. In this paper we develop more ecient representation models for the collection of documents while keeping the query processor and the relevance measure xed. We use the vector space model for the database. A good introduction to the model is given in [4]. This model is used because it yields relatively simple and inexpensive computations while being easy to understand. Another attractive feature of the vector space model is that it is amenable to the techniques of linear algebra. We consider a nite collection of documents d ; : : : ; d ndocs, where ndocs is the cardinality of this collection. The total number of distinct terms appearing in all the documents is nterms. The database is represented by a nterms ndocs matrix, A = [a ij ], with nterms ndocs, where a ij is some function of f ij 's. The quantity f ij is simply the number of times the term i appears in document j. In the simplest case a ij = f ij. In this case, the matrix A is called the raw frequency matrix. Each document is represented by a vector of size nterms. A query is also represented by a vector of size nterms whose entries are one or zero representing the presence or absence of a search term. Information is retrieved by comparing query vectors with document vectors and determining which documents are \near" the queries in some sense. Example 2. shows a simple database and a query, as well as the corresponding raw frequency matrix. Notice that the word which appears most often - \the" - is least informative. Example 2.: d = \the rst short document" d 2 = \another one" d 3 = \the longest document in the database" query = \document" nterms = 9; ndocs = 3 d d 2 d 3 query another database document 0 rst longest of one short the In the above example, the raw frequency matrix is represented by the 9 rows and the rst 3 columns of the matrix above. Notice that all entries are integers. The vector space model, however, allows for scaling these to non-integer values. This may be accomplished through various weighting schemes. A discussion of some weighting schemes and their uses can be found in []. One such weighting scheme that has received attention is log-entropy discussed in [2], where it is stated to be the best of several dierent weighting schemes. Before discussing entropy and log-entropy, we x some notation. Let p ij and q ij be the word counts normalized to sum to one over the rows and columns respectively p ij = f P ij ndocs ; q ij = j= f ij f P ij nterms : i= f ij A single term i selected uniformly at random from all appearances of term i in the document set has probability p ij of coming from the j th document. A single word selected uniformly at random from the j th document has probability q ij of being the i th term. The log-entropy weighting scheme operates row-wise and replaces each number f ij with the positive real number

3 Problem 4 IMMW 97 3 log ( + f ij )?P ndocs? j= p ij log p ij log(ndocs) Here the row entropy is the quantity? P ndocs j= p ij log p ij. It measures uncertainty of location (in terms of documents) when a single term i is selected at random. Its minimum value is zero, and this occurs when term i appears in at most one document. Its largest possible value is log(ndocs), occuring when term i appears the same number of times in all ndocs documents. Since the maximum entropy is an increasing function of ndocs, this functional dependence is eliminated by using normalized entropy? P ndocs j= p ij log p ij log(ndocs) that takes values between zero and one. In log-entropy weighting, a term whose appearance tends to be equally likely among the documents is given low weight across a row, and a term whose appearance is concentrated in a few documents is given higher weight in those documents where it appears. The eect of the expression log ( + f ij ) is to dampen large variations in the raw term frequency. Log-entropy is a weighting scheme that operates row-by-row, with no interaction between terms. We also consider a column weighting scheme that uses the entropy calculated from the column probabilities q ij. We have a hueristic that documents with low column entropy (high repetition of a few terms) are less informative than documents with high column entropy (documents with a large variety of terms) as stated below. This rests on the idea that the information conveyed in a document increases as the uncertainty of the content increases (as measured by the uniformity of the probabilities q ij ). This column weighting scheme operates column-by-column replacing each f ij by the positive number (f ij ) (?P nterms i= q ij log q ij ): log (nterms) 3. Model Development. In order to test the log-entropy weighting scheme, we developed several alternative weighting schemes. We discussed the assumptions that motivate weighting schemes and how they apply to our problem. We organized our weighting ideas into categories for clarity and systematic testing, and, nally, developed a mathematical model for each idea. The assumptions can be divided into two categories: Standard Assumptions, which are well-established by the information retrieval community, and Additional Assumptions, which are original ideas that motivated some of the alternative weighting schemes. 3.. Standard Assumptions.. Most terms in a document do not contribute to the content. 2. \Relevant" terms are those that occur with moderate frequency. 3. \Irrelevant" terms are those that occur either the most or the least frequent. 4. If no other term but the query term appears in the document, then the document is considered irrelevant to the querent. 5. The classic denitions of precision and recall will be used to determine the eciency of weighting schemes. 6. The inner product will be used as the measure of similarity Additional Assumptions.. The length of a document contributes to the relevance of the subject matter. For example, the longer a document is, the more relevant it is to any subject because it contains more words, whereas the shorter the document, the less relevant to any subject. 2. Medium document length is most useful to the searcher in that the searcher is looking for the most relevant documents that are the quickest to read and to understand. This idea of medium document length can be represented by a unimodal function of length. 3. Important \content" material appears at the beginning of a document. 4. Documents are considered to be more useful to the reader when the term frequencies in a document are more evenly distributed. :

4 Development of Weighting Schemes. Our ultimate goal is to develop an information retrieval system based on the vector space model that would allow us to nd the most relevant information pertinent to our search. To achieve this goal we seek representations of the knowledge base which are more ecient than the standard \raw frequency" representation. In response to our goal, we divided our weighting schemes into three categories: local weightings, column normalizations, and row normalizations. Local weighting centers on the manipulation of the f ij term in some manner. Column and row normalizations are schemes that use an entire column or row. Normalization of the j th column involves elementwise multiplication of column vector d j by a scalar valued function of f ;j ; : : : ; f nterms;j, that is, just the entries of the j th column. Normalization of the i th row involves elementwise multiplication of that row vector by a scalar valued function of f i; ; : : : ; f i;ndocs. We consistently apply the same ordering to the weighting: First column normalization is performed simultaneously on all columns, and then the result of this operation is used as input to the local weighting scheme and row normalization. The results of local weighting and row normalizations are multiplied to obtain the nal matrix Column Normalizations. The column normalizations use all the information in the columns to prioritize the terms in the documents in some way.. Identity: The identity is the operation that does nothing to the word count data; the column vector does not change. 2. l: The l normalization divides each column entry by the maximum of all the column entries. This brings raw frequency counts of each term to lie between zero and one. The eect is that the column vectors are scaled to have the same l norm. This causes all documents to be pulled \closer" in the term space. The l normalization gives the ratio of the frequency of a given word to the word of highest frequency. 3. Entropy: One of our assumptions is that documents with low term entropy (high repetition of few words) are less useful to the searcher than documents with high term entropy (all words equally likely). This rests on the idea that the information conveyed in a document increases as the uncertainty of the content increases. When normalizing a column by entropy we therefore replace each raw frequency count in the column vector by P nterms? i= q ij log q ij : log(nterms) 4. Median centering: When searching on documents of dierent lengths, very long or very short documents are problematic since very long ones tend to be judged to be more relevant than they actually are and very short ones tend to be judged to be less relevant than they actually are, as per our assumptions. The intention of median centering is to correct this problem and yet preserve some variation in the length of the documents. To this end we compute the l 2 norm of each column and nd the median of these values and their deviation from the median. Let med d be the median and dev di be the deviation of the l 2 norm of document i from the med d. We then multipy each column entry by med d? sgn(dev di ) log (log jdev d i j + ), where is a scaling parameter Local Weightings. The local weightings refer to information specic to an individual term in one particular document. This information will be used after column normalization is done.. Raw frequency (Identity): The raw frequency f ij is the number of times that term i appears in document j. It represents the pure information collected from the document. 2. Log-frequency: The log-frequency scales the raw frequency using the formula log (f ij + ). The scaling is done as a means for comparing both large and small values of raw frequency. 3. x log x : One of our assumptions is that relevant words that actually suggest the content of documents occur with moderate frequency, as explained in [4]. In support of this idea, we have developed

5 Problem 4 IMMW 97 5 a non-symmetric, quadratic-like, concave down function with a long right tail, x log x, where x f is the relative frequency ij max i (f. The graph of ij ) x log ( x ) is shown in Fig. 3.. The idea behind the function is that as the frequency of the term increases, the value of the term in respect to content increases until a certain maximum. Once the maximum has occured, the value of the term decreases with frequency, but at a much slower rate Row Normalizations. The row normalizations use all the information in the rows to determine which terms have priority in the documents.. Identity: The identity leaves the row vectors unchanged, i.e.we make no attempt to correct for words that occur equally in almost all documents nor to emphasize words that are concentrated in just a few documents. 2. l : l normalization divides each row entry by the total sum of the entries in the row. This makes the value of each term lie between zero and one. For an average row, it is unlikely that any entry will be close to one, so it is the rows that are concentrated in one document that end up getting the most weighting with this normalization. 3. Entropy: In documents (column vectors), high normalized entropy is considered good and low normalized entropy is considered bad. Here the weighting scheme operates row-wise and replaces each locally weighted entry in the row with the positive real number??p ndocs j= p ij log p ij log (ndocs) for each row i: 4. Testing Procedure. A database of 34 document les and 25 query les was created using a C program written by Ilya. The program goes through the documents and queries one by one and counts frequencies for all distint words. It creates an alphabetical list of terms and a raw frequency matrix in the MATLAB (Version 4.2c) sparse matrix format. Of the 34 document database, 09 les were downloaded from the World Wide Web using standard search engines and clustered around 5 subjects that may or may not have some overlap. The ve subjects were \lm noir", \sports", \broken ankle", \salmon", and \dolphins." The other 25 document les were articially generated in order to test certain normalization and weighting schemes. For example, in order to measure the performance of column entropy normalization, several documents were created which contain nothing but certain query words repeated many times. As pointed out in the assumptions section, these les have no essential information in them and should not be retrieved during the search. In conjunction with the document database, a relevance matrix was created that rated the relevance of each document particular to a search. Mary Pat and Grace created a program in MATLAB (Version 4.2c) to apply certain combinations of weighting schemes to the sparse document database. Using the order in which we conduct our weighting schemes, there are a total of 36 dierent permutations. Since we are using precision and recall as our basis for eciency, a precision/recall graph is drawn indicating the average for each weighting scheme. The combined document/query database produced a sparse matrix of size It took approximately 70 seconds to generate this matrix on a Sun Sparc 20. It took approximately 300 seconds on average to perform the experiments with one combination of a weighting and normalizations on a Sun Sparc Performance of Information-Retrieval System. We tested all 36 combinations of column/row normalizations and local weighting to compare the performance. There are two parameters to measure performance: recall and precision. Recall is the proportion of all relevant documents in the collection that are retrieved by the system, and precision is the proportion of the relevant documents in the set returned to the user. the number of retrieved relevant documents Recall = the number of relevant documents the number of retrieved relevant documents P recision = the number of retrieved documents

6 6 For example, when a user gets 00 documents returned by a system and 40 of them are relevant to the query, then the precision is 0:4. For each precision/recall graph, we calculated several levels of recall and for each recall, precision is calculated and averaged across the queries. Fig.5. is the recall-precision graph for the combination nonenone-none. In this scheme, we use the raw frequency matrix without any column/row normalization or local weighting. This curve shows the typical behavior of recall-precision: when recall is small, precision increases as recall increases, and once precision reaches its maximum it starts decreasing slowly. This behavior is typical because as there is a xed number of relevant documents in the collection, when a small number of documents are returned to a user, the chance of more relevant documents to be returned will increase as the number of returned documents increases. However, once all relevant documents are already retrieved by requesting more documents, the precision will go down. Fig.5.2 is the recall-precision graph for our base line none-log-entropy. Here, the scheme uses log (f ij + ) for local weighting, entropy for row normalization, and no column normalization. The basic shape of the curve is the same as Fig.5., except precision is slightly higher everywhere. The last graph, Fig.5.3, is a little dierent. It is the recall-precision graph for the combination nonex log (=x)-none. The x log (=x) is used for local weighting, and no column/row normalization. First, precision increases rapidly at the beginning. Secondly, the maximum precision is about 30% higher than none-none-none combination, and about 20% higher than our base line none-log-entropy. We analyzed performance of all 36 dierent weighting schemes. As the measure for performance, maximum precision of each weighting scheme is used. The summary of our analysis is shown in Table 5.. performance worst best local weighting column none entropy centering l +% +8% +0% none; log (f + ) 0% 0% 0% x log (=x) local none log (f + ) x log (=x) + 8% % any row none entropy l +0 3% +2 6% none; log (f + )?0 %?3 4% x log (=x) Table.5. Performance of Weighting Schemes First, for column normalization, when none or log (f ij + ) is used as local weighting, median centering and l normalization performed about 8 0% better than when we use no column normalization, whereas entropy does not seem to make a big dierence. However, with x log (=x) local weighting, column normalization does not aect the performance. Secondly, for local weighting, our new weighting x log (=x) made a visible improvement, about 24 29%. log (f ij + ) improved the performance only modestly. Finally, for row normalization, when none or log (f ij + ) is used as local weighting, l normalization performed best. On the other hand, when x log (=x) is used, the performance was best when no row normalization was done. This suggests that we should not use any row normalization, when x log (=x) is used as local weighting. Among all 36 weighting schemes we tested, the best performance was done with the combinations c? x log (=x)? none where c is any column normalization. In that case, precision was about 70%. The worst combination in performance was none? none? none, i.e., when no normalization or weighting is done. Precision in this case was about 43%. To summarize, our new local weighting x log (=x) performed substantially better than log(f ij + ). When using x log (=x), it is recommended no row normalization be performed. Column normalization does not improve or worsen the performance when x log (=x) is used. Finally, the base line none? log? entropy had the maximum precision of 49% which is only slightly better than the performance of none? none? none and about 20% lower then c? x log (=x)? none. Thus, none? log? entropy weight scheme does not seem to be optimal. 6. Workshop Experience. We spent approximately 7 full days working on our problem. The rst two days were spent familiarizing ourselves with the basics of information retrieval and how it applied to our problem. Some of our learning came from group sessions where Laura presented background and

7 Problem 4 IMMW 97 7 answered questions. Other learning came from independent research in the library and independent reading of papers suggested by Laura. In addition, we tried to outline specic goals for the remainder of the workshop. After the goals were decided, we worked both individually and in our group to address them. The group sessions served two-fold purposes. One purpose was to generate new ideas, but another was to give members a chance to sound their ideas and get feedback from the group. Some of these ideas had already been conceived and implemented. Others could not be further developed at the workshop for lack of time. We decided to focus on developing dierent variations of the representation model partially because we believed that the remaining time would be sucient for us to thoroughly test our ideas. The most time-consuming part of the testing phase was coding. It took about a day to fully debug and ne-tune the word counter code and about a day to get the MATLAB programs running. 7. Conclusions. We have shown that the particular weighting scheme known as log-entropy is not optimal for ad-hoc querying. As discussed in the previous section, the log-entropy weighting scheme performs at a precision rate of 50%. A new weighting scheme consisting of c? x log ( x )? none, where c is any column normalization performs at a precision rate of 70%. It would appear that the weighting scheme that has the greatest eect on performance is the local weighting. There are three improvements that could be done in future work. First, a dierent similarity measure could be explored. As stated in the Assumptions Section 3., we employed the inner-product similarity measure. There are other types of similarity measures as represented in [3], but each of these measures uses the inner product as a basis and creates a more complicated measure from there. Second, the assumption from 3.2 could be applied: that more important content material appears at the beginning of the document. In this case, the whole document would not have to be searched. The rst two or three paragraphs could be searched and it would reduce the size of our representation matrix. Finally, the query terms could be weighted. Using a more clearly dened search would limit the number of \useless" documents that were returned. We pushed the limits of MATLAB while experimenting with our database. Considering the size of the documents, the computation time were rather slow. In order to test more realistic and therefore larger databases, it would be necessary to use a lower level language like C or FORTRAN or possibly employ a parallel computer. Finally, our group still believes that column normalizations should have some eect on the search engine. We would like to explore alternate schemes to prove the validity of column normalizations. Our group contained six people that each brought dierent skills to the problem solving endeavor. We were able to work well with each other and respect each others opinions. While this workshop is one week of intense problem solving, it is safe to say that we learned something about our problem and about each other. REFERENCES [] C. Buckley and G. Salton, Term weighting approaches in automatic text retrieval, tech. rep., Cornell University, 987. [2] S. Dumais, Improving the retrieval of information from external sources., Behavior Research Methods, Instruments and Computers, 23 (99), pp. 229{236. [3] W. Jones and G. Furnas, Pictures of relevance: a geometric analysis of similarity measures., Journal of the American Society for Information Science, 38 (987), pp. 420{442. [4] G. Salton, A. Wong, and C. Yang, A vector space model for automatic indexing., Communications of the ACM, 8 (975), pp. 63{620.

8 8 Fig. 3.. x log ( x ) Local Weighting Scheme precision 0.9 none none none recall Fig. 5.. No Weighting Schemes: none-none-none

9 Problem 4 IMMW 97 9 precision 0.9 none log entropy recall Fig Baseline weighting scheme: none-log-entropy precision 0.9 none locfam none recall Fig Best Weighting Combination: any column-x log (=x)-none

Enumeration of Full Graphs: Onset of the Asymptotic Region. Department of Mathematics. Massachusetts Institute of Technology. Cambridge, MA 02139

Enumeration of Full Graphs: Onset of the Asymptotic Region L. J. Cowen D. J. Kleitman y F. Lasaga D. E. Sussman Department of Mathematics Massachusetts Institute of Technology Cambridge, MA 02139 Abstract