Similarity search in multimedia databases

Size: px

Start display at page:

Download "Similarity search in multimedia databases"

Justina Richards
5 years ago
Views:

1 Similarity search in multimedia databases Performance evaluation for similarity calculations in multimedia databases JO TRYTI AND JOHAN CARLSSON Bachelor s Thesis at CSC Supervisor: Michael Minock Examiner: Karl Meinke

3 Abstract Similarity search is an increasingly popular topic in informational retrieval. This report focuses on text similarity in conjunction with other metrics to explore and review the statistical methods described in the literature in the information retrieval field regarding similarity search and vector space modeling for use in multimedia databases. The aim of this report was to investigate how to effectively calculate the k-most nearest neighbours in a multimedia database and our results confirmed earlier work in the field that using a heuristic implementation methods is more effective than a naã ve method without losing any significant loss in precision.

4 Contents 1 Introduction 1 2 Background Similarity search and information retrival Recall and precision Multimedia objects Text analysis Term frequency Term discrimination Length normalization TF-idf Stopwords Stemming Vector space model Limitations Method Index implementations Database Measurement scale Test cases Results Computational time Index comparison Precision Discussion 11 6 Conclusion 13 Bibliography 15 Appendices 15

5 Chapter 1 Introduction Similarity search is an increasingly popular topic in the informational retrieval field both in the academic as well as in the commercial world. Many online companies and other forms of services on the Internet strive to provide accurate and relevant recommendations to their users based upon their preferences. For multimedia objects (movies, music, etc) it can be hard to measure how well two objects correlate to each other, as it often contains what is called "fuzzy" objects. In other words, objects which can be hard to assign a specific value or meaning. There are many ways of computing the similarity between multimedia objects, but this report will focus mostly on text similarity between media objects and comparing different heuristic implementations. Our aim with this report is to investigate how to effectively calculate the k-most nearest neighbours in a multimedia database. 1

7 Chapter 2 Background 2.1 Similarity search and information retrival Recall and precision As this report will focus on information retrieval and heuristic methods, some specific terms used in the field will be used throughout this text, namely precision and recall. Precision is the set of relevant results from a returned search. Recall is the amount of relevant results in relation to the whole set of objects Multimedia objects A multimedia object can consist of an arbitrarly number of different fields, but they are usually classified in three different kinds of fields. These fields types are: Token or text A description or category with multiple tokens Metric An enumerable attribute, for example the date when a music track were released Precalculated Items with precalculated distance, for example countries when geograpical distanc For each field a similarity score will be calculated separately and summerized. Each field will be weighted according to 2.1. score(q, d) = n c i sim(q i, d i ) i=1 (2.1) n c i i=1 2.2 Text analysis When comparing text fields, there are a lot of different methods and implementations. In short, the analysis of a given text with semantic or with statistical tools. 3

8 The implementations described below will focus on statistical tools. When referring to a collection of documents, for example our database with movie descriptions, the term corpus will be used. c(w,d) The count of word w in d f(q, d) Probability score for document d given the query document q df(w) Document frequency, the count of documents that contains w Table 2.1. definitions Term frequency Term frequency is a summarization of how many times a word appears in a document. As mentioned in the stopwords section common words like "a","and","or" will likely get a high count if stop words are not removed first. Term frequency is often an indicator of how closely related documents are based upon how frequently certain keywords recurs in documents, albeit an unreliable indication because even with removing stop words there are still recurring words which might be unrelated or not add any further information. A common modification is to use the logarithm value to the term frequency, if a term occurs twenty times more than another term in a document it is unlikely that it is twenty times more significant [Christopher D. Manning and Schutze,, 127]. It can be easily seen that it gets skewed if there is an large discrepancy in document sizes, therefore the length of documents is normalized. A formal definition is formed by [Hui Fang,, 50] q = w, d1 = d2, c(w, d1) > c(w, d2) thenf(q, d1) > f(q, d2) (2.2) Common TF heuristic implementations are listed from [Gerard Salton, 1988] in 2.2. boolean tf(w, d) = [0, 1] 1 if the term is in the document else 0 raw tf(w, d) = c(w, d) Raw term frequency for the document logarithmic tf(w, d) = log( c(w, d) + 1) Logarithmic scaled termfrequency Augmented normalized tf(w, d) = cf(w,d) 2 max(cf(w i,d)) ) Were max(cf(wd)) is the maximum cf(w, d) for any w in d. Table 2.2. Term frequency heuristics Term discrimination In 1972 K. Jones described a new statistical method [Jones, 2004] called the inverse document frequency. Inverse document frequency is similar to term frequency, but

9 in respect to how many times a term frequents a corpus. Not as term frequency which only takes into account the terms in one document. If the corpus only contains documents in a very specific topic, the topic-specific terms will get a high frequency and because of using the inverse frequency it will therefore be given a low value, as they are not as relevant as say an uncommon word in the corpus. The inverse document frequency is used to get an idea of how frequent terms are in relation to all the given documents. Common terms gets a low value and uncommon values get a high value. raw 1 No change in idf weight logarithmic idf(w) = log(n n) N is the total number of documents in the corpus and n is the number of documents contaning w. probabilistic idf(w) = log( N n n ) Probabilistic inverse frequency factor Table 2.3. Term discrimination heuristics Length normalization As mentioned by [Hui Fang,, 50] length normalization is used to penalize long documents, so as not to favour documents according to their size TF-idf Term frequency (TF) coupled with the inverse document frequency (IDF) gives the TF-IDF method. It is a good statistical tool for assigning weights to terms based upon how often it appears in a given corpus combined with how many times it appears in a specific document. In short it gives a weight based upon how many times a term appears in a given document in relation to how many times it frequents the corpus as a whole. It is a powerful statistical method used in many search engines Stopwords One thing to take into consideration when comparing text fields is if there are words not worth including, words like "a","or","and" does not add or remove any value from a given text, and can thus be safely removed. There are several widely used stop word lists, but many of them are context-dependent. In one of the implementation, described in the method section, a dynamic stop words list is used, based upon the removal of every value below a certain threshold in the idf table Stemming Linguistic morphology is the study of word structures and can be used in information retrieval as a way of reducing the number of words based upon their structure.

10 Words such as "search", "searching" "searches" can all be reduced to the root "search". It comes with a reduction in precision, as some words can lose their actual meaning as there are many words with a similar root but have different meaning. This is something which is called over stemming. Just as stop words, stemming is in some regard context dependent and there are different stemming lists which can be used, the one most known is the M. Porters [Porter, 1980] algorithm which have been translated to several different programming languages since 1980 when it was first published. We decided to not use any stemming in our dataset as it reduces the precision and on the limited size of our documents we decided to not implement it. 2.3 Vector space model One way of measuring the similarity between two or more given documents in a corpus is by indexing each word in a vector space and then comparing them by using the cosine angle derived from the scalar product described in [Christopher D. Manning and Schutze, ] and can be seen in 2.3 below. cosine(v) = A B A B (2.3) Where A and B is the vector space of two documents divided by the normalized length of the vectors Limitations Limitations with using a vector space model are several as noted by V. Raghavan, et al [Raghavan and Wong, 1986]. The vector space for large documents might in some cases increase very rapidly, as each term represent one dimension in the vector. It is not uncommon with vector spaces which have more than thousand dimensions. There is always an inherent problem with large dimensional vector spaces, as each dimension tend to become sparse and will become almost perpendicular to one other as an angle is computed by the dot product divided by the length of the vector2.3, and as the length of the vector increases the scalar product will still have a relative low value, the angle will be reduced to zero. This is called the curse of dimensionality, something which is to some extent reduced in this report by the use of stop words and stemming and because of the minor size of our movie description. Another limitation with using a vector space model is that there are no semantic analysis method used. This will result in some documents which have a similar meaning but using slightly different words might not get matched at all because the vector space model does not analyze the context of the documents, merely the frequency of terms. It does also assume that every term is statistically independent. In our report we take note of the inherent problems with using a vector model in regard to the semantic analysis, but we will not take it in much further consideration as linguistic methods is not in the scope of this paper.

11 Chapter 3 Method In this section the different implementations used in this report will be described further. The implementations will mostly differ in what kind of weights which will be applied and there will be a base case, a naive implementation, used when comparing the different implementations time and, to some extent, memory consumption. 3.1 Index implementations The most common aproach when calculating similarities for information retrival is to construct an index. A standard implementation is an inverted index [Christopher D. Manning and Schutze,, 67], this index improves calculating speed at the cost of memory. An inverted index maps a term with a list of documents and thier term freqency for that term. When doing a similarity search to find the k nearest neighbors in the inverted index it is only necessary to calculate similarities for documents containing at least one common term. For some webservices the similarity search is just i minor part and the cost to keep an large index might make an simpler but slower implementation disirable. The calculational drawback with an inverted index is that it makes document length calculations expensive. We have implemented similarity search using both an inverted index and a more simple index. The simple index calculates similarities by comparing two documents at the time, it also links terms with the document frequency Database The database in which the different methods and implementations will be used upon is a small subset of IMDB. We have decided to focus on just a few attributes which was deemed interesting, more precisely the plot, genre and release year. The description of each movie is between words and the test database consists of about 100 thousand movies. Each movie object has a description, release date and a genre descriptor. Both the description of the movie and genres will be computed by using a vector space, detailed in 2.3 and the release year will be 7

12 computed as a metric. as can be seen in Measurement scale yearsim(a, B) = log B A (3.1) To measure the heuristics used in the test cases an implementation were chosen as a scale. The scale used is defined in 3.2. This is a commonly used weight scale. tf(t, d) = log(c(t, d) + 1) idf(t) = log N, d N cf(t, d) > 0 d t idf(t) tf(t, d) idf(t) tf(t, q) (3.2) cosim(d, q) = 3.4 Test cases Term frequency weights d q, t q Boolean, raw and logarithmic term frequency heuristics were implemented as described in 2.2 Inverse document frequency weights Raw and logarithimc inverse document frequency heuristecs were implemented as described in 2.3 Length normalization The two used length normalization heuristics were the vector length and the amount of terms in the document. The vector length based upon the amount of terms as described by [Christopher D. Manning and Schutze, ] is commonly used for inverse index. stopwords The used word removal technique were to remove all words with a logarithmic idf value of more then 2. Over 1 of the document contains terms that have a 2 2 logarithmic idf value of 2 or less.

13 Chapter 4 Results Computational time In 4.1 average knn computation time for twenty movies queried, measuring TF, IDF and length heuristics. Figure Index comparison Shown in 4.2 is the computational time for different index implementations. Average for twenty movie knn queries Precision Average difference in results are shown for k = 10 for 20 movie knn queries. 9

14 Figure 4.2. Figure 4.3.

15 Chapter 5 Discussion The outcome of our experiments were in some sense surprising as each we thought that each weight optimization added or modified would result in a time reduction at the cost of precision. This was not the case as time were fairly constant and also with a lower reduction in precision than what was to be expected. The question is of course what is an acceptable trade off between fast computational time and precision. Quite surprising was the accuracy of the raw TF test case 4.3, one case which we thought would have a low precision. We are not quite sure if it could have been an error on our implementation or if there were not enough testing to give an accurate value. In our test database which we used, our searches were quick across the different implementations and the loss of precision were only shown with some of the heuristic methods. The genres in our database had a dimension space size of six and we used TF- IDF, even though in the end it was negligible with the low dimension space and it could have been considered it as a enumerable discrete token. One of the problems with our implementations is that building the TF-IDF index requires excessive computations and each time the database is updated with a movie the entire movie index needs to be updated, which makes it difficult to maintain but it is required when using the TF-IDF method. 11

17 Chapter 6 Conclusion Using statistical methods for improving retrieval performance is a valid approach and it comes as no surprise why it is a favored option for commercial uses. There is a tradeoff between the effective search queries and the exhaustive index maintenance which is needed when updating the database used. By using different kinds of heuristic methods in conjunction with the TF-IDF index improves the processing query speed at cost of precision, depending on what kind of heuristic used. We found that a use of stop words greatly reduced the computational time and also had the lowest precision decrease of the heuristic methods. 13

19 Bibliography [Christopher D. Manning and Schutze, ] Christopher D. Manning, P. R. and Schutze, H. Introduction to information retrieval. Cambridge University Press. [Gerard Salton, 1988] Gerard Salton, C. B. (1988). Ter-weighting approaches in automatic text retrieval. Information Processing and Management, 24(5): [Hui Fang, ] Hui Fang, Tao Tao, C. Z. A formal study of information retrieval heuristics. Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval. [Jones, 2004] Jones, K. S. (2004). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 60(5): [Porter, 1980] Porter, M. (1980). Porter stemmer. [Raghavan and Wong, 1986] Raghavan, V. V. and Wong, S. M. (1986). A critical analysis of vector space model for information retrieval. Journal of the American Society for Information Science, 37(5):

Information Retrieval. (M&S Ch 15)

Information Retrieval. (M&S Ch 15) Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion