Bachelor Thesis: Approximate String Joins

Size: px

Start display at page:

Download "Bachelor Thesis: Approximate String Joins"

Melinda Carson
5 years ago
Views:

1 Bachelor Informatica Informatica Universiteit van Amsterdam Bachelor Thesis: Approximate String Joins Vasco Visser cknr August 17, 29 Supervisor(s): Arjen de Vries (CWI)

2 2

3 3 Abstract This thesis addresses the problem of joining two relations containing the same entities, without attributes suited to function as conditions in an equi-join. Because of the lack of such attributes the only way to match the relations is to estimate some measure of similarity that hopefully correlates with the underlying semantics. A naive solution computes the cross product of the two relations, only to then calculate the similarity for each pair of tuples in the cartesian product. Computation of the cross product is a process of quadratic complexity and is generally considered unscalable. This thesis explores possible alternatives with lower complexity. The focus lies on a particular solution using TF-IDF weights and cosine similarity as measure of similarity between strings. A probabilistic sampling scheme is examined, to produce a sample of the original relation(s) participating in the join, which can be used for an approximation to the true cosine similarity join. The sampling scheme can be implemented on any database system that supports SQL. Through experiments the correctness and scalability of said solution is examined. The effectiveness of the sampling scheme is also experimentally examined, both in terms of performance, and precision and recall with respect to the true cosine similarity join. The empirical data shows the join and sampling scheme both work correctly. The scalability of the similarity join when not first sampling the data is poor; the running time behaves quadratic with respect to the input size. The scalability of the similarity join using the samples generated by the sampling scheme is better. That is, the running time is a function of the size of the sample set and therefore controlled by a user defined variable.

4 4

5 Contents 1 Introduction 7 2 Similarity Bit difference Longest Common Substring and Sequence Edit distance Cosine Similarity Dimensionality, or choice of token Term weights Cosine distance calculation Similarity Join Cosine Similarity Join High frequency tokens Sample scheme Choice of Sample Size Increasing recall SQL Solutions A naive approach Excluding trivial cases Similarity Join using Cosine Similarity Token weighting Normalization Sampling the Tokens Top-k extension Top-k as a post-processing step Standard SQL solution Generalized top-k Generalized top-k extention Other solutions

6 6 CONTENTS 6 Experimental Results Preprocessing and sampling Join Effect of ǫ Conclusion 33

7 CHAPTER 1 Introduction Many databases contain overlapping data but lack shared attributes to identify semantically equal entities between them. As an example consider a database containing information about actors, like the Internet Movie Database, and a generic encyclopaedia, like Wikipedia. One might wonder what actors have an article about their persona in the generic encyclopaedia. To answer this question a join has to be computed between the two databases. However, conventional (equi-) joins combine tuples based on an equality predicate; the alternative of a theta-join matches tuples on inequality over a deterministic ordering, which given two values will always give the same order (usually some numerical quantity). No attributes in two unrelated databases can be expected to exist for which equality or an ordering can be used to identify all or most tuples (possibly) belonging to the same entity. The example above calls for a join with approximate predicates. The result of such a join are not tuples equal to one another given some attributes, but tuples that are about equal to one another. To determine similarity between tuples they should be compared using some function. The result of a comparison should be a quantity specifying some form of similarity. It is not trivial to compute such an approximate- or similarity join with lower than quadratic cost, since in principle all tuples from the relations being joined have to be compared to determine their similarity. As another example, consider price comparison websites. Price comparison websites have bots crawling online warehouses, collecting data from many different sources, each source likely representing products in a different but rather similar way. Possibly some universal product identification number exists, but not all warehouses will show this number on a product web page. For those products that cannot be associated with an identification number, an approximate join with the comparisons website s own product database can be computed in order to add the unidentified products as an instance of a known product. An approximate join generally has part in a data-integration project. With a data-integration project is meant the undertaking of combining data from different sources to form a new unified data source. An approximate join is generally only a part of this since the result set of a such a join will likely contain false positives (see chapter 2). The result set of an approximate join is a set of tuple pairs t i, t j where both tuples originated from different databases, the tuples in 7

8 8 CHAPTER 1. INTRODUCTION each pair possibly belong to the same entity. An approximate join thus provides data for another procedure that determines if indeed two tuples belong to the same entity. This project investigates how an approximate string join can be accomplished in existing, unaltered database systems using plain SQL. In addition, the variables involved in an approximate join are explained, their effect on the result and performance discussed. Performance measurements will be made to determine the real world performance of the investigated solutions and the effects of the variables involved. Finally a note is made on the possibility for a join that finds the top k best matches for each tuple in the original relation. As a side note it is worth mentioning that this project is about joining databases using relatively small strings: the database attributes on which the join conditions are set, are assumed to fit in a regular varchar field, thus are no more than 255 characters long. The performance (both quality- and time wise) of the investigated solutions for larger strings like paragraphs or articles is not considered. The investigated solutions might however have an application for these larger strings, for example in the detection of plagiarism.

9 CHAPTER 2 Similarity As mentioned earlier, a join with approximate predicates is needed. This predicate should be a value of some metric that says something about how similar two tuples are. It can be said that similarity is inversely proportional to the difference between strings, while the difference can be expressed as the sum of errors in one string of which correction would make both strings equal. The notion of what is an error differs from one metric to another, making certain metrics better suited for situations in which particular errors 1 are likely to occur. Now the question that arises is: what similarity metric to use? Before this question is answered it might be worthwhile to consider that an approximate join is likely to have part in a data-integration project. It might help to know what the requirements are for the result set of an approximate join, in its role as part of a data integration project. Besides trivial requirements of correctness, two requirements can be documented: 1) the result should be precise, meaning there must be no tuple pair in the join result that holds tuples not belonging to the same entity 2 (false positives); and 2) the result should be complete, meaning no pairs of tuples should exist that belong to the same entity, but are not in the join result (false negatives). In practice, a tradeoff between the two requirements has to be made. This is because both precision and completeness are affected by a single threshold set for a certain similarity metric. A preciser result is a consequence of a tighter threshold and a more complete result is due to a looser threshold. For an approximate join in its role as part in a data integration project, the second requirement is far more important than is the first, as false positives can be excluded after the join has been executed. This is clearly not the case with false negatives. Altogether this means that a good metric will make the results of the join have a good completeness and precision. It seem reasonable to expect that the more the ability of a metric fits the data at hand 3, the higher a threshold can be to still give reasonably complete results. It is also to be expected that a lower threshold produces a less precise result. Therefore it seems reasonable to favour a metric that is suited for as many as possible of the types of errors that (can) occur. 1 For example spelling- or typing errors or errors as a consequence of convention differences e.g. <firstname> <lastname> v.s. <lastname>, <firstname> 2 It is assumed that any entity recognition, i.e. determining if two strings designate the same entity, will conform to human perception. 3 Thus there is a high correspondence of types of errors that occur and types of errors that the metric is suited for 9

10 1 CHAPTER 2. SIMILARITY In the remainder of this chapter some similarity metrics are discussed, some examples are given to illustrate that certain metrics are suited only for specific errors. 2.1 Bit difference One can state that if the exclusive-or (XOR) of the binary representation of two character sequences shows relatively few ones, the strings are highly similar. The XOR of two bit strings is called the bit-difference. It is indeed true, when two strings have a low ones count in their bit-difference, the strings are highly similar, as a one in the bit-difference can at most indicate a difference of one character in the string. The inverse is not necessarily true; when the bitdifference for two strings has many ones, the strings can still, at least in human perception, be very much the same. As an example consider two names Johnson and Jonson, XORing the bitwise representation of these two strings will result in many ones. However a human would look at these names and consider them highly similar, probably suspecting a spelling error. Therefore the bit-difference is not suited where missing characters are possible errors. 2.2 Longest Common Substring and Sequence A different approach would be to look at the length of longest common substring (LCSstr) for two strings. For any two strings we can determine the longest string they both have in common, using the example from the previous paragraph: Johnson and Jonson, the LCSstr is nson. We could further relate the length of the LCSstr to the length of the longest of the two compared strings, in this case it would be 4/7. Based on this factor one could find the strings to be about 55 percent equal. Still this does not conform to our human intuition of the two string being highly similar. This approach suffers the same drawback as the bit-difference approach, as a high similarity rating does indicate high similarity, but a low rating does not necessarily mean two strings are not similar by human standards. As an alternative to LCSstr, the length of the longest common subsequence (LCSseq) can be used. The LCSseq looks at subsequences, which unlike substrings need not be consecutive parts of a string. This means the LCSseq is less sensitive to spelling errors; a single spelling error in one of two otherwise identical strings will alter the length of the LCSseq by exactly one. Using the example of the preceding paragraph: Johnson and Jonson, the LCSseq is Jonson, yielding a similarity factor of 6/7. Based on this factor one could find the two strings to be about 85 percent equal. Of all metrics discussed so far this rating is closest to what human intuition says about the similarity between Johnson and Jonson. Still the LCSseq length has a drawback. Consider for example the names Frank Brussels and Brussels, Frank. Once again a human can see this could very well indicate the same person. However, the LCSseq of Frank Brussels and Brussels, Frank is Brussels. A length of only 8, giving a factor of 8/15. Based on this factor one could find the strings to be only about 55 percent equal. This example illustrates the LCSseq is unsuited for data in which the order of words (or tokens in general) are possible errors. 2.3 Edit distance A more common similarity metric for strings is the edit-distance. This metric is closely related to the length of the LCSseq. The edit distance expresses how many edit operations are needed to transform one string to another. The edit operations allowed depend on what variation of

11 2.4. COSINE SIMILARITY 11 the algorithm is used. The most basic operations are insertion and deletion of single characters. In case only insertion and deletion is allowed, the relation between the edit distance and the LCSseq length of two strings p and q is given by ed(p, q) = p + q 2 LCSseq(p, q) Because the edit distance is so closely related to the length of the LCSseq, it suffers the same functional drawbacks; the edit distance is not suited to express similarity for strings when word order changes can occur. A common addition to the allowed edit operations is substitution, allowing a change in character to be one instead of two (deletion and insertion) operations. This is called the Levenshtein distance [9]. A variation on the edit-distance is the block edit-distance, this variation is suited for situations where word order is a possible error. This is because copies and moves of substrings are allowed. Also the block edit distance has a drawback; it is sensitive to insertion of words in either one of the compared strings. Consider the following company names Microsoft and Microsoft corporation. Most humans would recognize that both designate the same entity, the (block) edit distance will however give a very low score. One could argue the edit distance neglects or underrates the similar parts of the strings in this example, because in this case it would have been better to have the similar parts in the two strings rated higher. Of course it is easily seen that not in all cases it is better to have similar parts rated higher than dissimilar parts. For example Microsoft corporation and corporation clearly are not so similar. Whether or not similar parts should be rated higher can depend on application specific requirements and the semantics of the (dis)similar parts. 2.4 Cosine Similarity A more complex approach is to conceptually map strings to vectors (the vector space model). The vectors can then be used to determine string similarity. Normalizing vectors to unit length, the only discriminator between them is their direction, the relative difference in their direction is the angle they make. The measured angle is a decimal number between and 2π, a value of zero or 2π means the vectors are the same, a value of π means their direction is opposite. Using the cosine of the angle a 1 is found when two vectors are the same and -1 if two vectors are of opposite direction 4. This measure of similarity between vectors is known as the cosine similarity Dimensionality, or choice of token When a string is more formally defined as a sequences of elements of some alphabet Σ = {σ 1,..., σ n } we could use a dimension of n, each string will thus have some value for the σ 1 direction, the σ 2 direction, etcetera. Choosing the elements of the alphabet as dimension will however not be such a good choice (at least not for the latin alpabet, which is relatively small), because even two random strings will have values in many of the same dimensions. The dimensionality should thus be higher, for example a dimension for each distinct triplet σ i σ j σ k where σ i,j,k Σ could be used, or more general a dimension for each possible qgram, but also a dimension for each occuring word can be used. 4 In case of TF-IDF (see section 2.4.2) values for the vector terms, only positive values are possible, thus the cosine similarity will be in the (,1) range

12 12 CHAPTER 2. SIMILARITY Term weights After the dimension is chosen, a string can be mapped to the vector space and a value should be assigned for each term. In order to do this the string is tokenized, the choice of token should depend on the dimensions; for the sake of argument qgrams with q = 3 are chosen as tokens. For each tri-gram not occurring in the string a value of zero is set. For each tri-gram that does occur some none zero weight is set. Note that the resulting vector is very sparse, this sparseness can be exploited. For each token that does occur in a string some non zero weight is set, for this weight the tuple frequency - inverse document frequency (TF-IDF) [1] value can be used. The TF-IDF weight for a token t is proportional to its frequency of occurrence in its document (which is what is called a tuple here), but inversely proportional to the number of documents (so tuples) it occurs in. The weights now only need to be normalized so all vectors have unit length. This will make the distance calculation easier Cosine distance calculation The cosine similarity between two normalized vectors is now determined using the dot product: u v = u i v i Observe that the only vector elements contributing to the value of the dot procuct are those non-zero in both vectors u and v. As will be explained in the next chapter, the SQL based implementation of the similarity join can exploit this. Cosine similarity is not sensitive to word order, as any order is mostly discarded. Due to the use of TF-IDF weights, the cosine similarity metric is also less sensitive to insertions of words in a string. The argument here is that those words inserted, that should be rated less, are words that occur more frequently in a relation and thus are rated lower because of the lower IDF value for its constituting tokens.

13 CHAPTER 3 Similarity Join Once a similarity metric has been chosen it can be used as a predicate for a (theta) join. The following notation is borrowed from [7]: A join between two relations R 1 and R 2, where all resulting tuples have a similarity greater than φ on some attributes, is denoted by R 1 φ R 2. From this point onward, whenever tuple is mentioned, it refers to only the tuple s attributes used in the similarity join. To compute R 1 φ R 2, the similarity for each tuple in R 1 with each tuple in R 2 is determined, if and only if the similarity is greater than φ the tuples are included in the result. Assuming no relevant index is possible on R 1 and R 2 the cost of this join is quadratic. Given some choice of token (see 2.4.1), every tuple t = σ 1,.., σ n in R 1 and R 2 is represented as a sequence of tokens. For any two tuples t R1 R 1 and t R2 R 2 to be called related at least one common token must exist, precisely the following must hold: { σ i t R1, σ j t R2 σ i = σ j }. Two tuples not related are in turn called unrelated. Whether or not two tuples are related can depend on choice of token. For example when single characters are used as tokens, there probably will not be many tuples unrelated, if however tri-grams are being used, more tuples will be unrelated. For any computation of R 1 φ R 2, unrelated tuples should rather not be compared, as unrelated tuples can never reach the similarity threshold φ. So rather than using the raw data of R 1 and R 2, a join can be done using the tokens of all tuples in R 1 and R 2. A join between tokens has the advantage that one can use ordinary indices on the token attribute, thus can be computed much more efficiently. 3.1 Cosine Similarity Join When using the cosine distance as a similarity metric, the threshold φ will be some number between en 1, where zero means unrelated and one means identical. The inputs for the cosine similarity function are the weight vectors associated with the tuples being compared. The basic steps before computing a similarity join using the cosine similarity are: 1. Tokenization 2. Token weighting 13

14 14 CHAPTER 3. SIMILARITY JOIN (a) Inverse document frequency calculation (b) Tuple frequency calculation 3. Term weight normalization When the above steps are completed each tuple in R 1 and R 2 will have a weight vector associated. The vectors are used to calculate the similarity for the associated tuples. Just as it is in the general case, only related tuples should be considered. In terms of vectors this means only vectors both sharing a value in at least one corresponding dimension should be compared High frequency tokens Even when only considering related tuples, many comparisons could still be made that turn out not to reach the threshold φ. This is because some tokens are likely to occur very often. Such frequently occuring tokens greatly increase the number of related tuples, however many tuples related because of these frequently occuring tokens are unrelated otherwise. An often used approach to solve the problem of the high frequency tokens is the use of a stop list. This list contains the tokens that occur very frequently, the tokens on the stop list are subsequently eliminated from the equation. The downside of using a stop list is that tuples containing a lot of stop tokens become difficult to compare. To illustrate this, suppose words are chosen as tokens; the string to be or not to be contain tokens that are likely to be all on the stop list, this string could thus be completely discarded Sample scheme As an alternative to a stop list, a probabilistic sampling scheme has beep proposed in[7]. The probabilistic sampling scheme is a sequence of sampling and weighting steps that result in a subset of all tokens in R 1 and R 2. This subset can be used for an approximation to R 1 φ R 2. The first step in the scheme is to take a random sample of all tokens in R 1 and R 2. Let u tj (i) be the weight of term i for the vector associated with tuple t j R 1, the probability to select the token associated with u tj (i) in the sample for R 1 will be ut j (i) sum(i), where sum(i) = R 1 j=1 u t j (i) is the added weight for all ith terms of all associated vectors in the relation. As opposed to the stop list approach, in which all occurrences of the most occuring tokens are removed, the probabilistic sampling scheme conceptually removes token occurrences only if the weight of the token occurrence relative to the accumulated weight of all occurrences of that specific token is low. A sample size[ S determines ] how many passes of the sampling step are made, on average there will be I ij = S ut j (i) sum(i) insertions of u tj (i) in the sample set. Using sum(i) and I ij it is possible to approximate the cosine similarity for tuples u tj sample set and v tk in the original R 2 relation: u tj v tk T i=1 v t k (i) sum(i) I ij S in the (3.1) Where T is the number of unique tokens in R 1 and R 2 (i.e. dimensions in vector space).

15 3.1. COSINE SIMILARITY JOIN 15 Substituting I ij indeed gives the dot product: T i=1 v t k (i) sum(i) I ij S = T i=1 v t k (i) sum(i) S ut j (i) sum(i) S T u tj (i) = v tk (i) sum(i) sum(i) i=1 T = v tk (i) u tj (i) i=1 It follows from (3.1), that tokens for which the value of I ij is zero, can be discarded in the calculation for the approximation of u tk v tj. An implementation of a similarity join with the implications of (3.1) in mind, will join the tokens of one table with the samples of the other. This means the scheme needs to be executed two times, once using the sampled version of R 1, comparing to each tuple in R 2, and once using the sampled version of R 2, comparing to each tuple in R 1. The union of both is the end result and the approximation to R 1 φ R 2. However, it is also possible to determine an approximation to the dot product using only the samples of R 1 and R 2, and sum(i): u tk v tj T i=1 sum R 1 (i) I R1 ij sum R2 (i) I R2 ik S 2 (3.2) The correctness of (3.2) can be shown in the same manner as (3.1). From (3.2) follows: tokens in both tables for which I ij is zero, can be discarded in the calculation for the approximation to u tk v tj. The practical implication of this is that an approximation to R 1 φ R 2 can be done using only the samples for R 1 and R Choice of Sample Size The sample size S is very important for the quality of the approximation. As I ij is an average, it becomes closer to the true value for the number of inserts as S becomes larger. But S does not merely control the precision of I ij, when S becomes larger, the number of insert in the sample u tj (i) sum(i) set becomes larger. Because R j=1 = 1, there will on average be an insertion for every token in T for each increment of S. For any given S the number of discarded token occurrences relative to the size of the original R i relation becomes larger as R i grows. This can be seen by realizing the number of insertion in the sample set is on average S T and therefore largely independent from the size of R i. According to [7] reasonable values for S are between 32 and 128. This value for S was determined experimentally, and thus suited to the specifics of the used data. The bottom line is that the best value for S depends on the data at hand, in particular it depends on (the size of) R i. The final decision for S should be based on a consideration of performance versus quality. Higher values make the cardinality for the sample set (a reduced form, where a counter is kept for each occuring token in R i ) come closer to the number of token occurrences R i, and therefore undermines the sampling s objective. Lower values causes too many token occurrences to be discarded, and reduces the quality of the approximation.

16 16 CHAPTER 3. SIMILARITY JOIN Increasing recall In the previous section it is explained how the sample size S controls the quality of the approximation. More specifically S determines how complete the result will be 1. A higher value for S will result in a more complete result. Still, for any given S the result remains an approximation and the approximation in this case will always underrate the similarity for any given two tuples. This means there is a probability that in the approximation some tuples will not reach the similarity threshold (except if S is so large that all token occurrences are sampled), while those tuples actually should be in the result The effects of the underestimation can be countered by a rather simple measure: decreasing the threshold. By decreasing the threshold the result will be more complete at the cost of being less precise. This decreasing of the threshold can be implemented by introducing some correction factor ǫ. This factor can be applied to the threshold φ as followed: φ = (1 ǫ)φ. The decision for the value of ǫ should be based on a trade off between recall and precision. As will be shown in the next chapter, implications of both equation (3.1) and (3.2) can be used to devise a scheme for an implementation of an approximation to R 1 φ R 2 using SQL statements. It is to be expected that when using only the implications of (3.1), the result will turn out to be closer to R 1 φ R 2, as (3.1) uses one approximated variable, as opposed to (3.2), which uses two approximated variables. The correction factor ǫ should be used to minimize the effects. On the other hand, the solution using the implication from (3.2) will be expected to perform faster, as the relations participating in the join are smaller and a join is only executed once. 1 With complete is meant the recall of the approximation with respect to the true similarity join. It is a measure of how many tuples in R 1 e φ R 2 are present in the approximation. Precision is a measure of how many tuples in the approximation of R 1 e φ R 2 are not R 1 e φ R 2

17 CHAPTER 4 SQL Solutions It has already been shown an approximate string join can be accomplished using vanilla SQL on an unmodified database system [6, 7, 8]. In this chapter some possible approaches will be discussed and compared. 4.1 A naive approach A naive SQL based approach would be to compute the cross product of two relations. A cursor can then be opened on the cross product, calculating a similarity value for each element, thus effectively calculating the similarity for each possible combination of tuples in the original relations. A filter can then be applied to exclude pairs of tuples that are not similar enough. Both the cross product calculation and the filtering steps can be combined. Although this naive approach computes the exact answer, it is still poorly suited for many real-life situations. Query 4.1 shows a query using an user defined function to compute the similarity between two strings. It can be altered to calculate similarity for multiple attributes by extending the WHERE clause. 4.2 Excluding trivial cases A better approach would be not to compare unrelated tuples (as defined in the previous chapter) at all. To be able to consider only related tuples, both relations must be tokenized. The query to tokenize a relation using tri-grams is given in Query 4.2. The cardinality for the relation holding a tokenized representation of R i is R i (avg( t j ) + 2), where avg( t j ) is the average length for a tuple in R i, 2 is added because of start and stop symbols added to each tuple. SELECT * FROM R1, R2 WHERE similarity(r1.a, R2.a) > φ Query 4.1: Quadratic similarity join 17

18 18 CHAPTER 4. SQL SOLUTIONS SELECT R1.tupleid, SUBSTRING ( SUBSTRING( #...#, 1, 2) UPPER(R1.a) SUBSTRING( %...%, 1, 2), strlens.len, 3 ) FROM R1, strlens WHERE strlens.len <= LENGTH(R1.a) + 2 ; Query 4.2: Tokenization query SELECT R1tokens.tupleid as r1id, R2tokens.tupleid as r2id FROM R1tokens, R2tokens WHERE R1tokens.token = R2tokens.token GROUP BY R1tokens.tupleid, R2tokens.tupleid Query 4.3: Joining related tuples Notice the usage of an auxiliary table called strlen, this table holds an enumeration of all possible string lengths for the attributes to be tokenized. Another choice is the UPPER function to discard case as a discriminator for the similarity determination. Assuming the tokenized representations of R 1 and R 2 are stored in relation R 1 tokens and R 2 tokens respectively, a join can be computed using both token tables. Query 4.3 shows the SQL statement needed to find only those tuples that have at least one token in common. Query 4.3 could be extended to only select tuples that have two or more tokens in common by the use of a HAVING clause. The worst-case cardinality (in case of all tuples in R 1 are related to all tuples in R 2 ) for a relation in which the results of Query 4.3 are stored is R 1 R 2. Depending on choice of token (e.g. tri-grams), it is not realistic to assume the worst-case scenario. However a poor choice of token could result in a cardinality approaching that of the worst-case scenario. The result of Query 4.3 has been stored in a relation R1R2related, Query 4.4 will execute the user defined similarity function only for related tuples. The result of Query 4.4 and Query 4.1 is the same. SELECT R1.* R2.* FROM R1, R2, R1R2related WHERE R1R2related.r1id = R1.id AND R1R2related.r2id = R2.id AND similarity( R1.a, R2.a ) > φ Query 4.4: Filtering on similarity condition

19 4.3. SIMILARITY JOIN USING COSINE SIMILARITY 19 SELECT T.tupleid, T.token, COUNT(*) FROM Ritokens T GROUP BY T.tupleid, T.token; Query 4.5: Finding the token frequency SELECT T.token, LOG(S.size)-LOG(COUNT(DISTINCT T.tupleid)) FROM Ritokens T, Risize S GROUP BY T.token, S.size; Query 4.6: Finding the idf So far choice of similarity metric has been independent to the SQL statements; the similarity calculation has been abstracted by the means of a similarity function. Any metric that can be calculated using two simple input strings as only input parameters can be used. Similarity metrics such as cosine similarity, that need other parameters than strings as input cannot be calculated in this way. The remainder of this chapter will consider a similarity join using cosine similarity. 4.3 Similarity Join using Cosine Similarity The following steps must be followed before a similarity join using cosine similarity can be computed. 1. Tokenization 2. Token weighting (a) Tuple frequency calculation (b) Inverse document frequency calculation 3. Term weight normalization Step 1. has already been done in the previous section (Query 4.2). For the cosine similarity join implementation the same SQL statement can be used Token weighting Query 4.5 shows the SQL statement to compute the tuple frequency for each token occurence. The cardinality for a relation containing the tuple frequencies for tokens in R i is the equal to SELECT T.tupleid, SQRT(SUM(I.idf*I.idf*T.tf*T.tf)) FROM RiTF T, RiIDF I WHERE I.token=T.token GROUP BY T.tupleid; Query 4.7: Compute the length of each weight vector

20 2 CHAPTER 4. SQL SOLUTIONS SELECT T.tupleid, T.token, I.idf*T.tf/L.len FROM RiTF T, RiIDF I, RiLength L WHERE I.token=T.token AND T.tupleid=L.tupleid;Token weighting Query 4.8: Normalize the token weights according to vector lengths SELECT T.tupleid, T.token, I.idf*T.tf/L.len FROM R1weigths r1w, R2weights r2w WHERE r1w.token = r2w.token GROUP BY r1w.tupleid, r2w.tupleid HAVING SUM(r1w.weight * r2w.weight) > φ ; Query 4.9: Join using the the calculate weights the cardinality for the R i token relation. Query 4.6 shows the SQL statement to compute the inverse document frequency for each distinct token occuring in R i. The cardinality for a relation containing the inverse document frequency for tokens in R i is equal to the number of unique tokens in R i. As an alternative, the DISTINCT keyword in Query 4.6 (to compute the inverse document frequency) can also be removed. The rationale here is that most tokens will only occur once in each tuple (as we are dealing with short tuples) Normalization The length for each vector is calculated using the formula for finding the euclidean norm. The T weight for each term is given by tf idf, so the euclidean norm is i= (idf tf)2. Query 4.8 uses the results of Query 4.5, 4.6 and 4.7 to compute the final weights for each token. The weights table can now be joined, as shown in Query 4.9. Note that Query 4.9 also demonstrates how the SQL implementation of the cosine join exploits the sparseness of the weight vectors. The result of Query 4.9 are the ids for all tuples in R 1 and R 2 with cosine similarity greater than φ. SELECT riw.tupleid, riw.token, riw.weight, ROUND(S * riw.weight / rs.total) FROM Riweigths riw, Risum rs WHERE riw.token = rs.token AND ROUND(S * riw.weight / rs.total) > Query 4.1: SQL implementation of the sample scheme

21 4.3. SIMILARITY JOIN USING COSINE SIMILARITY 21 SELECT tupleid1, tupleid2 FROM ( SELECT r1w.tupleid AS tupleid1, r2s.tupleid AS tupleid2 SUM(r1w.weight * r2s.count * r2sum.total) AS approx FROM R1weights r1w, R2sample r2s, R2sum r2sum WHERE r1w.token = r2s.token AND r1w.token = r2sum.token GROUP BY r1w.tupleid, r2s.tupleid UNION ALL SELECT r2w.tupleid AS tupleid2, r1s.tupleid AS tupleid1 SUM(r2w.weight * r1s.count * r2sum.total) AS approx FROM R2weights r2w, R1sample r1s, R1sum r1sum WHERE r2w.token = r1s.token AND r2w.token = r1sum.token GROUP BY r2w.tupleid, r1s.tupleid ) sim GROUP BY tupleid1, tupleid2 HAVING AVG(approx) S * φ Query 4.11: SQL implementation of (3.1) SELECT r1s.tupleid, r2s.tupleid FROM R1sample r1s, R2sample r2s, R1sum r1sum, R2sum r2s WHERE r1s.token = r1sum.token AND r2s.token = r2sum.token AND r1s.token = r2s.token GROUP BY r1s.tupleid, r2s.tupleid HAVING SUM(r1s.count * r1sum.total * r2s.count * r2sum.total) S * S * φ Query 4.12: SQL implementation of (3.2) Sampling the Tokens As was stipulated in the previous chapter, even when only considering related tuples in R1 and R2, still many tuples might not reach the similarity threshold φ. A probabilistic sampling scheme was introduced to reduce the number of unnecessary comparisons. In this section the sampling scheme will be shown to be implementable using only SQL statements. Query 4.1 shows the sampling scheme in SQL. The ROUND(S * riw.weight / rs.total) term represents the expected average number of insertion of a token occurrence in the sample set, where the number of inserts is stored for each sampled token. The size of the sample relation is bounded by the size of the token relation. Query 4.11 shows the SQL implementation of the approximation to R 1 φ R 2 making use of the implications of (3.1). A join is executed two times between a sample set of one relation and the unsampled tokens from the other relation. The union of both is thresholded by φ. Query 4.12 shows the SQL implementation of the approximation to R 1 φ R 2 using the im-

22 22 CHAPTER 4. SQL SOLUTIONS SELECT r1s.tupleid, r2s.tupleid FROM R1sample r1s, R2sample r2s WHERE r1s.token = r2s.token GROUP BY r1s.tupleid, r2s.tupleid HAVING SUM(r1s.weight * r2s.weight) φ Query 4.13: Simplification of Query 4.12 plication of (3.2). Looking more closely at the HAVING clause of Query 4.12, the inequality is: r1s.count r1sum.total r2s.count r2sum.total > S2 φ Since count was calculated as S * riw.weight / rs.total and all token occurrences are unique in the sample relation (see Query 4.1), the equation can be simplified: r1s.count r1sum.total r2s.count r2sum.total S 2 > φ S r1s.weight r1sum.total r1sum.total S r2s.weight r2sum.total r2sum.total S 2 > φ r1s.weight r2s.weight > φ The sum relation can now be completely discarded from the query. Query 4.13 shows this query. As consequence of this simplification it becomes clear that no tuple in the join approximation can have overestimated similarity. Therefore the precision of the approximation is always 1 (only if ǫ = ).

23 CHAPTER 5 Top-k extension The previous chapter explains how a similarity join can be implemented using SQL statements. The similarity join answered the question What tuples in R 1 and R 2 have a similarity greater than φ. Another interesting question would be: What are the k best matches in R j for each tuple in R i. Both questions can also be combined, thus not only would the top-k results be returned, but the similarity of the tuples must also reach the threshold φ. In this chapter various possible solutions to answer the top-k question are briefly explored. 5.1 Top-k as a post-processing step The top-k question can be answered in a few different ways. The most preferred way would be to calculate the similarity for related tuples (more or less) in the order of their similarity and stop when k is reached, or when it is certain all tuples in the top-k are included. This ordered top-k needs a custom tailored algorithm. It is not possible to model a query using standard SQL or generic set operations to achieve this. Another possibility would be to calculate R 1 φ R 2 and make a top-k selection on the results for each tuple in R 1 and for each tuple in R 2. This is will be called a post-processing approach and is what will be discussed in this section. SELECT r1r2.* FROM r1, r1r2 outerresult WHERE r1.id=outerresult.r1id AND r1.id IN ( SELECT r1id FROM r1r2 innerresult WHERE innerresult.r1id=outerresult.r1id ORDER BY innerresult.similarity FETCH FIRST K ROWS ONLY ) Query 5.1: Query to retreive top-k similarity results for each tuple in R1 23

24 24 CHAPTER 5. TOP-K EXTENSION SELECT... FROM... WHERE... STOP AFTER <value expression> FOR EACH <stop-grouping attributes> RANK BY <ranking specification> ORDER BY... where: <value expression> defines an expression, not associated with the rest of the query, resulting in an integer value. <stop-grouping attributes> is a set of attribute names, defining a group <ranking specification> defines the ranking used within each group to determine what k tuples will be returned. Figure 5.1: Syntax for SQL addition SELECT * FROM ( SELECT * FROM r1r2 STOP AFTER K FOR EACH r1r2.tupleid1 RANK BY cosine_similarity UNION ALL SELECT * FROM r1r2 STOP AFTER K FOR EACH r1r2.tupleid2 RANK BY cosine_similarity ) topnsim Query 5.2: Applying generalized top-k to the result of Standard SQL solution It is (theoretically) possible to answer the top-k question as a post-processing step by executing a standard SQL query with a correlatedfetch FIRST 1 subquery. Query 5.1 shows how to retrieve a top-k for each tuple in R 1, using the already calculated similarity join result stored in the r1r2 relation. This query must be executed once to find all top-k matches for the tuples in R 1, and once to find all top-k matches for the tuples in R 2. The union of the results is the final anwser to the top-k question Generalized top-k In this section an addition to SQL is discussed. With this addition it becomes possible to express the top-k selection in a more natural way. Not forcing the database system to execute some subquery numerous times. The addition is proposed in [5] and is called generalized top-k. What is proposed adds to the language the possibility to define groups and a ranking within those groups. 1 FETCH FIRST in subqueries is added in the SQL 28 standard

25 5.2. OTHER SOLUTIONS 25 SELECT... FROM... WHERE... STOP AFTER <value expression> FOR EACH <stop-grouping attributes> GROUP BY <nested group attributes> RANK BY <ranking specification> ORDER BY... where: <nested group attributes> is a set of attribute names defining a group within the FOR EACH group <ranking specification> in addition to the specification in Figure 5.1, the following extra constraints are put on the specification if a GROUP BY clause is specified: Aggregate functions are allowed. No reference to attributes not in <nested group attributes> are allowed. Figure 5.2: Proposed syntax for SQL addition A specified number of returned tuples per group can be set, so the top-k values in the ranking can be returned. The syntax for the addition is shown in Figure 5.1. Query 5.2 shows how to filter out the k best matches. As can be seen, a generalized top-k query is executed two times, first grouping for each tuple from R 1, then for each tuple in R 2, the result is the best K matches for each tuple in R 1 and each tuple in R Generalized top-k extention A limitation of the addition proposed in [5] is the inability to define groups within a FOR EACH group. With the ability to define groups within a FOR EACH group it becomes possible to rank the groups within each FOR EACH group, returning only the top-k groups. This extra addition brings exactly the functionality needed for a best matches query. Figure 5.2 shows a possible syntax for an extra addition needed to execute the top-k during the computation of R 1 R 2. Query 5.3 shows the query that would now be possible. It should be noted that the join on sample sets for R 1 and R 2 might be executed twice with a naive query execution plan. Thus the extension to generalized top-k possibly only saves the disk I/O of writing the related tuples not in the top-k. 5.2 Other solutions Fast algorithms exist that can do top-k TF-IDF lookups[4]. Using such fast algorithms in the context of a database system will probably mean the introduction of a join operator specific to the calculation of a cosine similarity join. It is not possible to create a User Defined Aggregate Function for this because the result is not just a single value, but k values. A further study on these fast top-k algorithms and how a specific join operator might look like is beyond the scope of this project.

26 26 CHAPTER 5. TOP-K EXTENSION SELECT * FROM ( SELECT * FROM r1sample r1s, r2sample r2s WHERE r1s.token = r2s.token STOP AFTER N FOR EACH r1s.tupleid GROUP BY r1s.tupleid, r2s.tupleid RANK BY SUM(r1s.weight * r2s.weight) UNION ALL SELECT * FROM r1sample r1s, r2sample r2s WHERE r1s.token = r2s.token STOP AFTER N FOR EACH r2s.tupleid GROUP BY r1s.tupleid, r2s.tupleid RANK BY SUM(r1s.weight * r2s.weight) ) topnsim Query 5.3: Applying generalized top-k to the result of 4.13

27 CHAPTER 6 Experimental Results In this chapter the SQL based approach of chapter 4 is benchmarked. The database system used for this benchmark is MonetDB, a database system developed at CWI. All tests are run on a high end server machine with an eight core CPU and 64 gigabytes of RAM. Three sets of data are used in the experiment. The first is a list of (English) Wikipedia article titles [2], the second is a list of actors and actresses from the IMDB [3] (Internet Movie Database), the third is a list of authors from DBLP [1] (a database of computer science journals and proceedings). Both the DBLP and IMDB data are joined with the Wikipedia dataset. The different steps needed to complete (an approximation to) a similarity join are benchmarked separately. The data produced by each step is also examined where possible and interesting. All benchmarks are run for different sizes of R 1 and R 2. Random tuples from the original relation are chosen to create a relation of a specific size. This is done such that all tuples in a relation of specific size are also present in a relation of a bigger size (e.g. all tuples in a relation with 1 tuples are also in a relation of 2 tuples). This makes it somewhat easier to compare results between relations of different sizes. 6.1 Preprocessing and sampling The preprocessing and sampling step combined will result in the sampled relation, which in turn will be used for the symmetric and union join plans. The preprocessing steps consist of tokenization and tf-idf weighting of the token occurrences. Figure 6.1 shows that both the preprocessing and the sampling behave linear with respect to the number of tuples in R i. Figure 6.2 plots the size of the sample relations resulting from various values of S, for different sizes of R 1 and R 2. Note the striking similarity between 6.2b and 6.2c, not only in shape (as all three graphs show similar shape) but also numerical, i.e. the two sample relations have about the same number of tuples. The explanation for this similarity is that both original relations contain names, and for the most part probably English/European names, the data is thus quite similar. The wikipedia set contains article titles, this is quite different from names, therefore the sampled relation shown in Figure 6.2a differs more from 6.2b and 6.2c. More specifically the cardinality of the sample relations in Figure 6.2a is larger. This does not (necessarily) mean more tokens 27

28 28 CHAPTER 6. EXPERIMENTAL RESULTS Sampling execution time (msec) Sample relation cardinality 1.6e+6 1.4e+6 1.2e+6 1e Ri =1 Ri =2 Ri =5 Ri =1 Ri =2 Ri =4 Ri = Ri cardinality Sample size (a) Tokenization (a) Wikipedia sample set size Preprocessing duration (msec) Sample relation cardinality 1.1e+6 1e Ri =1 Ri =2 Ri =5 Ri =1 Ri =2 Ri =4 Ri = Ri cardinality Sample size (b) TF-IDF weighting (b) DBLP sample set size Sampling execution time (msec) Sample relation cardinality 1.1e+6 1e Ri =1 Ri =2 Ri =5 Ri =1 Ri =2 Ri =4 Ri = Ri cardinality Sample size (c) Sampling (c) IMDB sample set size Figure 6.1: Preprocessing benchmark Figure 6.2: Sample set sizes

29 6.1. PREPROCESSING AND SAMPLING Ri =1 Ri =2 Ri =5 Ri =1 Ri =2 Ri = Ri =1 Ri =2 Ri =5 Ri =1 Ri =2 Ri =4 Join duration (msec) Ri =6 Join duration (msec) Ri = Sample size (a) Wikipedia e DBLP Sample size (b) Wikipedia e IMDB S=32 S=64 S=128 S= S=32 S=64 S=128 S=256 Join execution time (msec) Join execution time (msec) Source table size Source table size (c) Wikipedia e DBLP (d) Wikipedia e IMDB Figure 6.3: Join execution times Ri =1 Ri =2 Ri =5 Ri =1 Ri =2 Ri =4 Ri = Ri =1 Ri =2 Ri =5 Ri =1 Ri =2 Ri =4 Ri =6 Number of results 8 6 Number of results Sample size Sample size (a) Wikipedia e DBLP (b) Wikipedia e IMDB S=32 S=64 S=128 S= S=32 S=64 S=128 S= Number of results 8 6 Number of results R1 and R2 size R1 and R2 size (c) Wikipedia e DBLP (d) Wikipedia e IMDB Figure 6.4: Size of the result sets

30 3 CHAPTER 6. EXPERIMENTAL RESULTS Join duration (msec) Number of results S=32 S=64 S= Sample size (a) Execution time R1 and R2 size (b) Result set size Figure 6.5: Execution time and result set size for the union join for Wikipedia DBLP 3 25 naive normal S=128 union S= naive normal S=128 union S= Join duration (msec) Join duration (msec) Ri cardinality Ri cardinality (a) Execution time (b) Result set size Figure 6.6: Comparison of the three join approaches occurrences have been sampled (remember from chapter 3, the number of samples insertions is given by S T ), but it does mean more different token occurrences have been sampled. 6.2 Join Three join plans have been benchmarked. The most important one is the join as shown in Query 4.13 (will be called symmetric join). This plan is expected to be the most scalable, but also expected to calculate the most incomplete result. The next plan is the query that joins the samples from R 1 with the token from R 2 and vice versa, using the union as the result. This plan is shown in Query 4.11 (will be called union join) and is expected to scale less than the symmetric join, but in turn will most likely produce a more complete result. The final plan is the join purely on tokens, using no form of sampling at all. This plan produces the best result, but will scale the worst of all plans and is shown in Query 4.9 (will be called baseline). Note that for all queries the results are also fetched, thus all joins queries are encapsulated in a query that fetches the actually corresponding records of the join result in R 1 and R 2 from disk.

Similarity search in multimedia databases

Similarity search in multimedia databases Performance evaluation for similarity calculations in multimedia databases JO TRYTI AND JOHAN CARLSSON Bachelor s Thesis at CSC Supervisor: Michael Minock Examiner: