Bachelor Thesis: Approximate String Joins

Size: px
Start display at page:

Download "Bachelor Thesis: Approximate String Joins"

Transcription

1 Bachelor Informatica Informatica Universiteit van Amsterdam Bachelor Thesis: Approximate String Joins Vasco Visser cknr August 17, 29 Supervisor(s): Arjen de Vries (CWI)

2 2

3 3 Abstract This thesis addresses the problem of joining two relations containing the same entities, without attributes suited to function as conditions in an equi-join. Because of the lack of such attributes the only way to match the relations is to estimate some measure of similarity that hopefully correlates with the underlying semantics. A naive solution computes the cross product of the two relations, only to then calculate the similarity for each pair of tuples in the cartesian product. Computation of the cross product is a process of quadratic complexity and is generally considered unscalable. This thesis explores possible alternatives with lower complexity. The focus lies on a particular solution using TF-IDF weights and cosine similarity as measure of similarity between strings. A probabilistic sampling scheme is examined, to produce a sample of the original relation(s) participating in the join, which can be used for an approximation to the true cosine similarity join. The sampling scheme can be implemented on any database system that supports SQL. Through experiments the correctness and scalability of said solution is examined. The effectiveness of the sampling scheme is also experimentally examined, both in terms of performance, and precision and recall with respect to the true cosine similarity join. The empirical data shows the join and sampling scheme both work correctly. The scalability of the similarity join when not first sampling the data is poor; the running time behaves quadratic with respect to the input size. The scalability of the similarity join using the samples generated by the sampling scheme is better. That is, the running time is a function of the size of the sample set and therefore controlled by a user defined variable.

4 4

5 Contents 1 Introduction 7 2 Similarity Bit difference Longest Common Substring and Sequence Edit distance Cosine Similarity Dimensionality, or choice of token Term weights Cosine distance calculation Similarity Join Cosine Similarity Join High frequency tokens Sample scheme Choice of Sample Size Increasing recall SQL Solutions A naive approach Excluding trivial cases Similarity Join using Cosine Similarity Token weighting Normalization Sampling the Tokens Top-k extension Top-k as a post-processing step Standard SQL solution Generalized top-k Generalized top-k extention Other solutions

6 6 CONTENTS 6 Experimental Results Preprocessing and sampling Join Effect of ǫ Conclusion 33

7 CHAPTER 1 Introduction Many databases contain overlapping data but lack shared attributes to identify semantically equal entities between them. As an example consider a database containing information about actors, like the Internet Movie Database, and a generic encyclopaedia, like Wikipedia. One might wonder what actors have an article about their persona in the generic encyclopaedia. To answer this question a join has to be computed between the two databases. However, conventional (equi-) joins combine tuples based on an equality predicate; the alternative of a theta-join matches tuples on inequality over a deterministic ordering, which given two values will always give the same order (usually some numerical quantity). No attributes in two unrelated databases can be expected to exist for which equality or an ordering can be used to identify all or most tuples (possibly) belonging to the same entity. The example above calls for a join with approximate predicates. The result of such a join are not tuples equal to one another given some attributes, but tuples that are about equal to one another. To determine similarity between tuples they should be compared using some function. The result of a comparison should be a quantity specifying some form of similarity. It is not trivial to compute such an approximate- or similarity join with lower than quadratic cost, since in principle all tuples from the relations being joined have to be compared to determine their similarity. As another example, consider price comparison websites. Price comparison websites have bots crawling online warehouses, collecting data from many different sources, each source likely representing products in a different but rather similar way. Possibly some universal product identification number exists, but not all warehouses will show this number on a product web page. For those products that cannot be associated with an identification number, an approximate join with the comparisons website s own product database can be computed in order to add the unidentified products as an instance of a known product. An approximate join generally has part in a data-integration project. With a data-integration project is meant the undertaking of combining data from different sources to form a new unified data source. An approximate join is generally only a part of this since the result set of a such a join will likely contain false positives (see chapter 2). The result set of an approximate join is a set of tuple pairs t i, t j where both tuples originated from different databases, the tuples in 7

8 8 CHAPTER 1. INTRODUCTION each pair possibly belong to the same entity. An approximate join thus provides data for another procedure that determines if indeed two tuples belong to the same entity. This project investigates how an approximate string join can be accomplished in existing, unaltered database systems using plain SQL. In addition, the variables involved in an approximate join are explained, their effect on the result and performance discussed. Performance measurements will be made to determine the real world performance of the investigated solutions and the effects of the variables involved. Finally a note is made on the possibility for a join that finds the top k best matches for each tuple in the original relation. As a side note it is worth mentioning that this project is about joining databases using relatively small strings: the database attributes on which the join conditions are set, are assumed to fit in a regular varchar field, thus are no more than 255 characters long. The performance (both quality- and time wise) of the investigated solutions for larger strings like paragraphs or articles is not considered. The investigated solutions might however have an application for these larger strings, for example in the detection of plagiarism.

9 CHAPTER 2 Similarity As mentioned earlier, a join with approximate predicates is needed. This predicate should be a value of some metric that says something about how similar two tuples are. It can be said that similarity is inversely proportional to the difference between strings, while the difference can be expressed as the sum of errors in one string of which correction would make both strings equal. The notion of what is an error differs from one metric to another, making certain metrics better suited for situations in which particular errors 1 are likely to occur. Now the question that arises is: what similarity metric to use? Before this question is answered it might be worthwhile to consider that an approximate join is likely to have part in a data-integration project. It might help to know what the requirements are for the result set of an approximate join, in its role as part of a data integration project. Besides trivial requirements of correctness, two requirements can be documented: 1) the result should be precise, meaning there must be no tuple pair in the join result that holds tuples not belonging to the same entity 2 (false positives); and 2) the result should be complete, meaning no pairs of tuples should exist that belong to the same entity, but are not in the join result (false negatives). In practice, a tradeoff between the two requirements has to be made. This is because both precision and completeness are affected by a single threshold set for a certain similarity metric. A preciser result is a consequence of a tighter threshold and a more complete result is due to a looser threshold. For an approximate join in its role as part in a data integration project, the second requirement is far more important than is the first, as false positives can be excluded after the join has been executed. This is clearly not the case with false negatives. Altogether this means that a good metric will make the results of the join have a good completeness and precision. It seem reasonable to expect that the more the ability of a metric fits the data at hand 3, the higher a threshold can be to still give reasonably complete results. It is also to be expected that a lower threshold produces a less precise result. Therefore it seems reasonable to favour a metric that is suited for as many as possible of the types of errors that (can) occur. 1 For example spelling- or typing errors or errors as a consequence of convention differences e.g. <firstname> <lastname> v.s. <lastname>, <firstname> 2 It is assumed that any entity recognition, i.e. determining if two strings designate the same entity, will conform to human perception. 3 Thus there is a high correspondence of types of errors that occur and types of errors that the metric is suited for 9

10 1 CHAPTER 2. SIMILARITY In the remainder of this chapter some similarity metrics are discussed, some examples are given to illustrate that certain metrics are suited only for specific errors. 2.1 Bit difference One can state that if the exclusive-or (XOR) of the binary representation of two character sequences shows relatively few ones, the strings are highly similar. The XOR of two bit strings is called the bit-difference. It is indeed true, when two strings have a low ones count in their bit-difference, the strings are highly similar, as a one in the bit-difference can at most indicate a difference of one character in the string. The inverse is not necessarily true; when the bitdifference for two strings has many ones, the strings can still, at least in human perception, be very much the same. As an example consider two names Johnson and Jonson, XORing the bitwise representation of these two strings will result in many ones. However a human would look at these names and consider them highly similar, probably suspecting a spelling error. Therefore the bit-difference is not suited where missing characters are possible errors. 2.2 Longest Common Substring and Sequence A different approach would be to look at the length of longest common substring (LCSstr) for two strings. For any two strings we can determine the longest string they both have in common, using the example from the previous paragraph: Johnson and Jonson, the LCSstr is nson. We could further relate the length of the LCSstr to the length of the longest of the two compared strings, in this case it would be 4/7. Based on this factor one could find the strings to be about 55 percent equal. Still this does not conform to our human intuition of the two string being highly similar. This approach suffers the same drawback as the bit-difference approach, as a high similarity rating does indicate high similarity, but a low rating does not necessarily mean two strings are not similar by human standards. As an alternative to LCSstr, the length of the longest common subsequence (LCSseq) can be used. The LCSseq looks at subsequences, which unlike substrings need not be consecutive parts of a string. This means the LCSseq is less sensitive to spelling errors; a single spelling error in one of two otherwise identical strings will alter the length of the LCSseq by exactly one. Using the example of the preceding paragraph: Johnson and Jonson, the LCSseq is Jonson, yielding a similarity factor of 6/7. Based on this factor one could find the two strings to be about 85 percent equal. Of all metrics discussed so far this rating is closest to what human intuition says about the similarity between Johnson and Jonson. Still the LCSseq length has a drawback. Consider for example the names Frank Brussels and Brussels, Frank. Once again a human can see this could very well indicate the same person. However, the LCSseq of Frank Brussels and Brussels, Frank is Brussels. A length of only 8, giving a factor of 8/15. Based on this factor one could find the strings to be only about 55 percent equal. This example illustrates the LCSseq is unsuited for data in which the order of words (or tokens in general) are possible errors. 2.3 Edit distance A more common similarity metric for strings is the edit-distance. This metric is closely related to the length of the LCSseq. The edit distance expresses how many edit operations are needed to transform one string to another. The edit operations allowed depend on what variation of

11 2.4. COSINE SIMILARITY 11 the algorithm is used. The most basic operations are insertion and deletion of single characters. In case only insertion and deletion is allowed, the relation between the edit distance and the LCSseq length of two strings p and q is given by ed(p, q) = p + q 2 LCSseq(p, q) Because the edit distance is so closely related to the length of the LCSseq, it suffers the same functional drawbacks; the edit distance is not suited to express similarity for strings when word order changes can occur. A common addition to the allowed edit operations is substitution, allowing a change in character to be one instead of two (deletion and insertion) operations. This is called the Levenshtein distance [9]. A variation on the edit-distance is the block edit-distance, this variation is suited for situations where word order is a possible error. This is because copies and moves of substrings are allowed. Also the block edit distance has a drawback; it is sensitive to insertion of words in either one of the compared strings. Consider the following company names Microsoft and Microsoft corporation. Most humans would recognize that both designate the same entity, the (block) edit distance will however give a very low score. One could argue the edit distance neglects or underrates the similar parts of the strings in this example, because in this case it would have been better to have the similar parts in the two strings rated higher. Of course it is easily seen that not in all cases it is better to have similar parts rated higher than dissimilar parts. For example Microsoft corporation and corporation clearly are not so similar. Whether or not similar parts should be rated higher can depend on application specific requirements and the semantics of the (dis)similar parts. 2.4 Cosine Similarity A more complex approach is to conceptually map strings to vectors (the vector space model). The vectors can then be used to determine string similarity. Normalizing vectors to unit length, the only discriminator between them is their direction, the relative difference in their direction is the angle they make. The measured angle is a decimal number between and 2π, a value of zero or 2π means the vectors are the same, a value of π means their direction is opposite. Using the cosine of the angle a 1 is found when two vectors are the same and -1 if two vectors are of opposite direction 4. This measure of similarity between vectors is known as the cosine similarity Dimensionality, or choice of token When a string is more formally defined as a sequences of elements of some alphabet Σ = {σ 1,..., σ n } we could use a dimension of n, each string will thus have some value for the σ 1 direction, the σ 2 direction, etcetera. Choosing the elements of the alphabet as dimension will however not be such a good choice (at least not for the latin alpabet, which is relatively small), because even two random strings will have values in many of the same dimensions. The dimensionality should thus be higher, for example a dimension for each distinct triplet σ i σ j σ k where σ i,j,k Σ could be used, or more general a dimension for each possible qgram, but also a dimension for each occuring word can be used. 4 In case of TF-IDF (see section 2.4.2) values for the vector terms, only positive values are possible, thus the cosine similarity will be in the (,1) range

12 12 CHAPTER 2. SIMILARITY Term weights After the dimension is chosen, a string can be mapped to the vector space and a value should be assigned for each term. In order to do this the string is tokenized, the choice of token should depend on the dimensions; for the sake of argument qgrams with q = 3 are chosen as tokens. For each tri-gram not occurring in the string a value of zero is set. For each tri-gram that does occur some none zero weight is set. Note that the resulting vector is very sparse, this sparseness can be exploited. For each token that does occur in a string some non zero weight is set, for this weight the tuple frequency - inverse document frequency (TF-IDF) [1] value can be used. The TF-IDF weight for a token t is proportional to its frequency of occurrence in its document (which is what is called a tuple here), but inversely proportional to the number of documents (so tuples) it occurs in. The weights now only need to be normalized so all vectors have unit length. This will make the distance calculation easier Cosine distance calculation The cosine similarity between two normalized vectors is now determined using the dot product: u v = u i v i Observe that the only vector elements contributing to the value of the dot procuct are those non-zero in both vectors u and v. As will be explained in the next chapter, the SQL based implementation of the similarity join can exploit this. Cosine similarity is not sensitive to word order, as any order is mostly discarded. Due to the use of TF-IDF weights, the cosine similarity metric is also less sensitive to insertions of words in a string. The argument here is that those words inserted, that should be rated less, are words that occur more frequently in a relation and thus are rated lower because of the lower IDF value for its constituting tokens.

13 CHAPTER 3 Similarity Join Once a similarity metric has been chosen it can be used as a predicate for a (theta) join. The following notation is borrowed from [7]: A join between two relations R 1 and R 2, where all resulting tuples have a similarity greater than φ on some attributes, is denoted by R 1 φ R 2. From this point onward, whenever tuple is mentioned, it refers to only the tuple s attributes used in the similarity join. To compute R 1 φ R 2, the similarity for each tuple in R 1 with each tuple in R 2 is determined, if and only if the similarity is greater than φ the tuples are included in the result. Assuming no relevant index is possible on R 1 and R 2 the cost of this join is quadratic. Given some choice of token (see 2.4.1), every tuple t = σ 1,.., σ n in R 1 and R 2 is represented as a sequence of tokens. For any two tuples t R1 R 1 and t R2 R 2 to be called related at least one common token must exist, precisely the following must hold: { σ i t R1, σ j t R2 σ i = σ j }. Two tuples not related are in turn called unrelated. Whether or not two tuples are related can depend on choice of token. For example when single characters are used as tokens, there probably will not be many tuples unrelated, if however tri-grams are being used, more tuples will be unrelated. For any computation of R 1 φ R 2, unrelated tuples should rather not be compared, as unrelated tuples can never reach the similarity threshold φ. So rather than using the raw data of R 1 and R 2, a join can be done using the tokens of all tuples in R 1 and R 2. A join between tokens has the advantage that one can use ordinary indices on the token attribute, thus can be computed much more efficiently. 3.1 Cosine Similarity Join When using the cosine distance as a similarity metric, the threshold φ will be some number between en 1, where zero means unrelated and one means identical. The inputs for the cosine similarity function are the weight vectors associated with the tuples being compared. The basic steps before computing a similarity join using the cosine similarity are: 1. Tokenization 2. Token weighting 13

14 14 CHAPTER 3. SIMILARITY JOIN (a) Inverse document frequency calculation (b) Tuple frequency calculation 3. Term weight normalization When the above steps are completed each tuple in R 1 and R 2 will have a weight vector associated. The vectors are used to calculate the similarity for the associated tuples. Just as it is in the general case, only related tuples should be considered. In terms of vectors this means only vectors both sharing a value in at least one corresponding dimension should be compared High frequency tokens Even when only considering related tuples, many comparisons could still be made that turn out not to reach the threshold φ. This is because some tokens are likely to occur very often. Such frequently occuring tokens greatly increase the number of related tuples, however many tuples related because of these frequently occuring tokens are unrelated otherwise. An often used approach to solve the problem of the high frequency tokens is the use of a stop list. This list contains the tokens that occur very frequently, the tokens on the stop list are subsequently eliminated from the equation. The downside of using a stop list is that tuples containing a lot of stop tokens become difficult to compare. To illustrate this, suppose words are chosen as tokens; the string to be or not to be contain tokens that are likely to be all on the stop list, this string could thus be completely discarded Sample scheme As an alternative to a stop list, a probabilistic sampling scheme has beep proposed in[7]. The probabilistic sampling scheme is a sequence of sampling and weighting steps that result in a subset of all tokens in R 1 and R 2. This subset can be used for an approximation to R 1 φ R 2. The first step in the scheme is to take a random sample of all tokens in R 1 and R 2. Let u tj (i) be the weight of term i for the vector associated with tuple t j R 1, the probability to select the token associated with u tj (i) in the sample for R 1 will be ut j (i) sum(i), where sum(i) = R 1 j=1 u t j (i) is the added weight for all ith terms of all associated vectors in the relation. As opposed to the stop list approach, in which all occurrences of the most occuring tokens are removed, the probabilistic sampling scheme conceptually removes token occurrences only if the weight of the token occurrence relative to the accumulated weight of all occurrences of that specific token is low. A sample size[ S determines ] how many passes of the sampling step are made, on average there will be I ij = S ut j (i) sum(i) insertions of u tj (i) in the sample set. Using sum(i) and I ij it is possible to approximate the cosine similarity for tuples u tj sample set and v tk in the original R 2 relation: u tj v tk T i=1 v t k (i) sum(i) I ij S in the (3.1) Where T is the number of unique tokens in R 1 and R 2 (i.e. dimensions in vector space).

15 3.1. COSINE SIMILARITY JOIN 15 Substituting I ij indeed gives the dot product: T i=1 v t k (i) sum(i) I ij S = T i=1 v t k (i) sum(i) S ut j (i) sum(i) S T u tj (i) = v tk (i) sum(i) sum(i) i=1 T = v tk (i) u tj (i) i=1 It follows from (3.1), that tokens for which the value of I ij is zero, can be discarded in the calculation for the approximation of u tk v tj. An implementation of a similarity join with the implications of (3.1) in mind, will join the tokens of one table with the samples of the other. This means the scheme needs to be executed two times, once using the sampled version of R 1, comparing to each tuple in R 2, and once using the sampled version of R 2, comparing to each tuple in R 1. The union of both is the end result and the approximation to R 1 φ R 2. However, it is also possible to determine an approximation to the dot product using only the samples of R 1 and R 2, and sum(i): u tk v tj T i=1 sum R 1 (i) I R1 ij sum R2 (i) I R2 ik S 2 (3.2) The correctness of (3.2) can be shown in the same manner as (3.1). From (3.2) follows: tokens in both tables for which I ij is zero, can be discarded in the calculation for the approximation to u tk v tj. The practical implication of this is that an approximation to R 1 φ R 2 can be done using only the samples for R 1 and R Choice of Sample Size The sample size S is very important for the quality of the approximation. As I ij is an average, it becomes closer to the true value for the number of inserts as S becomes larger. But S does not merely control the precision of I ij, when S becomes larger, the number of insert in the sample u tj (i) sum(i) set becomes larger. Because R j=1 = 1, there will on average be an insertion for every token in T for each increment of S. For any given S the number of discarded token occurrences relative to the size of the original R i relation becomes larger as R i grows. This can be seen by realizing the number of insertion in the sample set is on average S T and therefore largely independent from the size of R i. According to [7] reasonable values for S are between 32 and 128. This value for S was determined experimentally, and thus suited to the specifics of the used data. The bottom line is that the best value for S depends on the data at hand, in particular it depends on (the size of) R i. The final decision for S should be based on a consideration of performance versus quality. Higher values make the cardinality for the sample set (a reduced form, where a counter is kept for each occuring token in R i ) come closer to the number of token occurrences R i, and therefore undermines the sampling s objective. Lower values causes too many token occurrences to be discarded, and reduces the quality of the approximation.

16 16 CHAPTER 3. SIMILARITY JOIN Increasing recall In the previous section it is explained how the sample size S controls the quality of the approximation. More specifically S determines how complete the result will be 1. A higher value for S will result in a more complete result. Still, for any given S the result remains an approximation and the approximation in this case will always underrate the similarity for any given two tuples. This means there is a probability that in the approximation some tuples will not reach the similarity threshold (except if S is so large that all token occurrences are sampled), while those tuples actually should be in the result The effects of the underestimation can be countered by a rather simple measure: decreasing the threshold. By decreasing the threshold the result will be more complete at the cost of being less precise. This decreasing of the threshold can be implemented by introducing some correction factor ǫ. This factor can be applied to the threshold φ as followed: φ = (1 ǫ)φ. The decision for the value of ǫ should be based on a trade off between recall and precision. As will be shown in the next chapter, implications of both equation (3.1) and (3.2) can be used to devise a scheme for an implementation of an approximation to R 1 φ R 2 using SQL statements. It is to be expected that when using only the implications of (3.1), the result will turn out to be closer to R 1 φ R 2, as (3.1) uses one approximated variable, as opposed to (3.2), which uses two approximated variables. The correction factor ǫ should be used to minimize the effects. On the other hand, the solution using the implication from (3.2) will be expected to perform faster, as the relations participating in the join are smaller and a join is only executed once. 1 With complete is meant the recall of the approximation with respect to the true similarity join. It is a measure of how many tuples in R 1 e φ R 2 are present in the approximation. Precision is a measure of how many tuples in the approximation of R 1 e φ R 2 are not R 1 e φ R 2

17 CHAPTER 4 SQL Solutions It has already been shown an approximate string join can be accomplished using vanilla SQL on an unmodified database system [6, 7, 8]. In this chapter some possible approaches will be discussed and compared. 4.1 A naive approach A naive SQL based approach would be to compute the cross product of two relations. A cursor can then be opened on the cross product, calculating a similarity value for each element, thus effectively calculating the similarity for each possible combination of tuples in the original relations. A filter can then be applied to exclude pairs of tuples that are not similar enough. Both the cross product calculation and the filtering steps can be combined. Although this naive approach computes the exact answer, it is still poorly suited for many real-life situations. Query 4.1 shows a query using an user defined function to compute the similarity between two strings. It can be altered to calculate similarity for multiple attributes by extending the WHERE clause. 4.2 Excluding trivial cases A better approach would be not to compare unrelated tuples (as defined in the previous chapter) at all. To be able to consider only related tuples, both relations must be tokenized. The query to tokenize a relation using tri-grams is given in Query 4.2. The cardinality for the relation holding a tokenized representation of R i is R i (avg( t j ) + 2), where avg( t j ) is the average length for a tuple in R i, 2 is added because of start and stop symbols added to each tuple. SELECT * FROM R1, R2 WHERE similarity(r1.a, R2.a) > φ Query 4.1: Quadratic similarity join 17

18 18 CHAPTER 4. SQL SOLUTIONS SELECT R1.tupleid, SUBSTRING ( SUBSTRING( #...#, 1, 2) UPPER(R1.a) SUBSTRING( %...%, 1, 2), strlens.len, 3 ) FROM R1, strlens WHERE strlens.len <= LENGTH(R1.a) + 2 ; Query 4.2: Tokenization query SELECT R1tokens.tupleid as r1id, R2tokens.tupleid as r2id FROM R1tokens, R2tokens WHERE R1tokens.token = R2tokens.token GROUP BY R1tokens.tupleid, R2tokens.tupleid Query 4.3: Joining related tuples Notice the usage of an auxiliary table called strlen, this table holds an enumeration of all possible string lengths for the attributes to be tokenized. Another choice is the UPPER function to discard case as a discriminator for the similarity determination. Assuming the tokenized representations of R 1 and R 2 are stored in relation R 1 tokens and R 2 tokens respectively, a join can be computed using both token tables. Query 4.3 shows the SQL statement needed to find only those tuples that have at least one token in common. Query 4.3 could be extended to only select tuples that have two or more tokens in common by the use of a HAVING clause. The worst-case cardinality (in case of all tuples in R 1 are related to all tuples in R 2 ) for a relation in which the results of Query 4.3 are stored is R 1 R 2. Depending on choice of token (e.g. tri-grams), it is not realistic to assume the worst-case scenario. However a poor choice of token could result in a cardinality approaching that of the worst-case scenario. The result of Query 4.3 has been stored in a relation R1R2related, Query 4.4 will execute the user defined similarity function only for related tuples. The result of Query 4.4 and Query 4.1 is the same. SELECT R1.* R2.* FROM R1, R2, R1R2related WHERE R1R2related.r1id = R1.id AND R1R2related.r2id = R2.id AND similarity( R1.a, R2.a ) > φ Query 4.4: Filtering on similarity condition

19 4.3. SIMILARITY JOIN USING COSINE SIMILARITY 19 SELECT T.tupleid, T.token, COUNT(*) FROM Ritokens T GROUP BY T.tupleid, T.token; Query 4.5: Finding the token frequency SELECT T.token, LOG(S.size)-LOG(COUNT(DISTINCT T.tupleid)) FROM Ritokens T, Risize S GROUP BY T.token, S.size; Query 4.6: Finding the idf So far choice of similarity metric has been independent to the SQL statements; the similarity calculation has been abstracted by the means of a similarity function. Any metric that can be calculated using two simple input strings as only input parameters can be used. Similarity metrics such as cosine similarity, that need other parameters than strings as input cannot be calculated in this way. The remainder of this chapter will consider a similarity join using cosine similarity. 4.3 Similarity Join using Cosine Similarity The following steps must be followed before a similarity join using cosine similarity can be computed. 1. Tokenization 2. Token weighting (a) Tuple frequency calculation (b) Inverse document frequency calculation 3. Term weight normalization Step 1. has already been done in the previous section (Query 4.2). For the cosine similarity join implementation the same SQL statement can be used Token weighting Query 4.5 shows the SQL statement to compute the tuple frequency for each token occurence. The cardinality for a relation containing the tuple frequencies for tokens in R i is the equal to SELECT T.tupleid, SQRT(SUM(I.idf*I.idf*T.tf*T.tf)) FROM RiTF T, RiIDF I WHERE I.token=T.token GROUP BY T.tupleid; Query 4.7: Compute the length of each weight vector

20 2 CHAPTER 4. SQL SOLUTIONS SELECT T.tupleid, T.token, I.idf*T.tf/L.len FROM RiTF T, RiIDF I, RiLength L WHERE I.token=T.token AND T.tupleid=L.tupleid;Token weighting Query 4.8: Normalize the token weights according to vector lengths SELECT T.tupleid, T.token, I.idf*T.tf/L.len FROM R1weigths r1w, R2weights r2w WHERE r1w.token = r2w.token GROUP BY r1w.tupleid, r2w.tupleid HAVING SUM(r1w.weight * r2w.weight) > φ ; Query 4.9: Join using the the calculate weights the cardinality for the R i token relation. Query 4.6 shows the SQL statement to compute the inverse document frequency for each distinct token occuring in R i. The cardinality for a relation containing the inverse document frequency for tokens in R i is equal to the number of unique tokens in R i. As an alternative, the DISTINCT keyword in Query 4.6 (to compute the inverse document frequency) can also be removed. The rationale here is that most tokens will only occur once in each tuple (as we are dealing with short tuples) Normalization The length for each vector is calculated using the formula for finding the euclidean norm. The T weight for each term is given by tf idf, so the euclidean norm is i= (idf tf)2. Query 4.8 uses the results of Query 4.5, 4.6 and 4.7 to compute the final weights for each token. The weights table can now be joined, as shown in Query 4.9. Note that Query 4.9 also demonstrates how the SQL implementation of the cosine join exploits the sparseness of the weight vectors. The result of Query 4.9 are the ids for all tuples in R 1 and R 2 with cosine similarity greater than φ. SELECT riw.tupleid, riw.token, riw.weight, ROUND(S * riw.weight / rs.total) FROM Riweigths riw, Risum rs WHERE riw.token = rs.token AND ROUND(S * riw.weight / rs.total) > Query 4.1: SQL implementation of the sample scheme

21 4.3. SIMILARITY JOIN USING COSINE SIMILARITY 21 SELECT tupleid1, tupleid2 FROM ( SELECT r1w.tupleid AS tupleid1, r2s.tupleid AS tupleid2 SUM(r1w.weight * r2s.count * r2sum.total) AS approx FROM R1weights r1w, R2sample r2s, R2sum r2sum WHERE r1w.token = r2s.token AND r1w.token = r2sum.token GROUP BY r1w.tupleid, r2s.tupleid UNION ALL SELECT r2w.tupleid AS tupleid2, r1s.tupleid AS tupleid1 SUM(r2w.weight * r1s.count * r2sum.total) AS approx FROM R2weights r2w, R1sample r1s, R1sum r1sum WHERE r2w.token = r1s.token AND r2w.token = r1sum.token GROUP BY r2w.tupleid, r1s.tupleid ) sim GROUP BY tupleid1, tupleid2 HAVING AVG(approx) S * φ Query 4.11: SQL implementation of (3.1) SELECT r1s.tupleid, r2s.tupleid FROM R1sample r1s, R2sample r2s, R1sum r1sum, R2sum r2s WHERE r1s.token = r1sum.token AND r2s.token = r2sum.token AND r1s.token = r2s.token GROUP BY r1s.tupleid, r2s.tupleid HAVING SUM(r1s.count * r1sum.total * r2s.count * r2sum.total) S * S * φ Query 4.12: SQL implementation of (3.2) Sampling the Tokens As was stipulated in the previous chapter, even when only considering related tuples in R1 and R2, still many tuples might not reach the similarity threshold φ. A probabilistic sampling scheme was introduced to reduce the number of unnecessary comparisons. In this section the sampling scheme will be shown to be implementable using only SQL statements. Query 4.1 shows the sampling scheme in SQL. The ROUND(S * riw.weight / rs.total) term represents the expected average number of insertion of a token occurrence in the sample set, where the number of inserts is stored for each sampled token. The size of the sample relation is bounded by the size of the token relation. Query 4.11 shows the SQL implementation of the approximation to R 1 φ R 2 making use of the implications of (3.1). A join is executed two times between a sample set of one relation and the unsampled tokens from the other relation. The union of both is thresholded by φ. Query 4.12 shows the SQL implementation of the approximation to R 1 φ R 2 using the im-

22 22 CHAPTER 4. SQL SOLUTIONS SELECT r1s.tupleid, r2s.tupleid FROM R1sample r1s, R2sample r2s WHERE r1s.token = r2s.token GROUP BY r1s.tupleid, r2s.tupleid HAVING SUM(r1s.weight * r2s.weight) φ Query 4.13: Simplification of Query 4.12 plication of (3.2). Looking more closely at the HAVING clause of Query 4.12, the inequality is: r1s.count r1sum.total r2s.count r2sum.total > S2 φ Since count was calculated as S * riw.weight / rs.total and all token occurrences are unique in the sample relation (see Query 4.1), the equation can be simplified: r1s.count r1sum.total r2s.count r2sum.total S 2 > φ S r1s.weight r1sum.total r1sum.total S r2s.weight r2sum.total r2sum.total S 2 > φ r1s.weight r2s.weight > φ The sum relation can now be completely discarded from the query. Query 4.13 shows this query. As consequence of this simplification it becomes clear that no tuple in the join approximation can have overestimated similarity. Therefore the precision of the approximation is always 1 (only if ǫ = ).

23 CHAPTER 5 Top-k extension The previous chapter explains how a similarity join can be implemented using SQL statements. The similarity join answered the question What tuples in R 1 and R 2 have a similarity greater than φ. Another interesting question would be: What are the k best matches in R j for each tuple in R i. Both questions can also be combined, thus not only would the top-k results be returned, but the similarity of the tuples must also reach the threshold φ. In this chapter various possible solutions to answer the top-k question are briefly explored. 5.1 Top-k as a post-processing step The top-k question can be answered in a few different ways. The most preferred way would be to calculate the similarity for related tuples (more or less) in the order of their similarity and stop when k is reached, or when it is certain all tuples in the top-k are included. This ordered top-k needs a custom tailored algorithm. It is not possible to model a query using standard SQL or generic set operations to achieve this. Another possibility would be to calculate R 1 φ R 2 and make a top-k selection on the results for each tuple in R 1 and for each tuple in R 2. This is will be called a post-processing approach and is what will be discussed in this section. SELECT r1r2.* FROM r1, r1r2 outerresult WHERE r1.id=outerresult.r1id AND r1.id IN ( SELECT r1id FROM r1r2 innerresult WHERE innerresult.r1id=outerresult.r1id ORDER BY innerresult.similarity FETCH FIRST K ROWS ONLY ) Query 5.1: Query to retreive top-k similarity results for each tuple in R1 23

24 24 CHAPTER 5. TOP-K EXTENSION SELECT... FROM... WHERE... STOP AFTER <value expression> FOR EACH <stop-grouping attributes> RANK BY <ranking specification> ORDER BY... where: <value expression> defines an expression, not associated with the rest of the query, resulting in an integer value. <stop-grouping attributes> is a set of attribute names, defining a group <ranking specification> defines the ranking used within each group to determine what k tuples will be returned. Figure 5.1: Syntax for SQL addition SELECT * FROM ( SELECT * FROM r1r2 STOP AFTER K FOR EACH r1r2.tupleid1 RANK BY cosine_similarity UNION ALL SELECT * FROM r1r2 STOP AFTER K FOR EACH r1r2.tupleid2 RANK BY cosine_similarity ) topnsim Query 5.2: Applying generalized top-k to the result of Standard SQL solution It is (theoretically) possible to answer the top-k question as a post-processing step by executing a standard SQL query with a correlatedfetch FIRST 1 subquery. Query 5.1 shows how to retrieve a top-k for each tuple in R 1, using the already calculated similarity join result stored in the r1r2 relation. This query must be executed once to find all top-k matches for the tuples in R 1, and once to find all top-k matches for the tuples in R 2. The union of the results is the final anwser to the top-k question Generalized top-k In this section an addition to SQL is discussed. With this addition it becomes possible to express the top-k selection in a more natural way. Not forcing the database system to execute some subquery numerous times. The addition is proposed in [5] and is called generalized top-k. What is proposed adds to the language the possibility to define groups and a ranking within those groups. 1 FETCH FIRST in subqueries is added in the SQL 28 standard

25 5.2. OTHER SOLUTIONS 25 SELECT... FROM... WHERE... STOP AFTER <value expression> FOR EACH <stop-grouping attributes> GROUP BY <nested group attributes> RANK BY <ranking specification> ORDER BY... where: <nested group attributes> is a set of attribute names defining a group within the FOR EACH group <ranking specification> in addition to the specification in Figure 5.1, the following extra constraints are put on the specification if a GROUP BY clause is specified: Aggregate functions are allowed. No reference to attributes not in <nested group attributes> are allowed. Figure 5.2: Proposed syntax for SQL addition A specified number of returned tuples per group can be set, so the top-k values in the ranking can be returned. The syntax for the addition is shown in Figure 5.1. Query 5.2 shows how to filter out the k best matches. As can be seen, a generalized top-k query is executed two times, first grouping for each tuple from R 1, then for each tuple in R 2, the result is the best K matches for each tuple in R 1 and each tuple in R Generalized top-k extention A limitation of the addition proposed in [5] is the inability to define groups within a FOR EACH group. With the ability to define groups within a FOR EACH group it becomes possible to rank the groups within each FOR EACH group, returning only the top-k groups. This extra addition brings exactly the functionality needed for a best matches query. Figure 5.2 shows a possible syntax for an extra addition needed to execute the top-k during the computation of R 1 R 2. Query 5.3 shows the query that would now be possible. It should be noted that the join on sample sets for R 1 and R 2 might be executed twice with a naive query execution plan. Thus the extension to generalized top-k possibly only saves the disk I/O of writing the related tuples not in the top-k. 5.2 Other solutions Fast algorithms exist that can do top-k TF-IDF lookups[4]. Using such fast algorithms in the context of a database system will probably mean the introduction of a join operator specific to the calculation of a cosine similarity join. It is not possible to create a User Defined Aggregate Function for this because the result is not just a single value, but k values. A further study on these fast top-k algorithms and how a specific join operator might look like is beyond the scope of this project.

26 26 CHAPTER 5. TOP-K EXTENSION SELECT * FROM ( SELECT * FROM r1sample r1s, r2sample r2s WHERE r1s.token = r2s.token STOP AFTER N FOR EACH r1s.tupleid GROUP BY r1s.tupleid, r2s.tupleid RANK BY SUM(r1s.weight * r2s.weight) UNION ALL SELECT * FROM r1sample r1s, r2sample r2s WHERE r1s.token = r2s.token STOP AFTER N FOR EACH r2s.tupleid GROUP BY r1s.tupleid, r2s.tupleid RANK BY SUM(r1s.weight * r2s.weight) ) topnsim Query 5.3: Applying generalized top-k to the result of 4.13

27 CHAPTER 6 Experimental Results In this chapter the SQL based approach of chapter 4 is benchmarked. The database system used for this benchmark is MonetDB, a database system developed at CWI. All tests are run on a high end server machine with an eight core CPU and 64 gigabytes of RAM. Three sets of data are used in the experiment. The first is a list of (English) Wikipedia article titles [2], the second is a list of actors and actresses from the IMDB [3] (Internet Movie Database), the third is a list of authors from DBLP [1] (a database of computer science journals and proceedings). Both the DBLP and IMDB data are joined with the Wikipedia dataset. The different steps needed to complete (an approximation to) a similarity join are benchmarked separately. The data produced by each step is also examined where possible and interesting. All benchmarks are run for different sizes of R 1 and R 2. Random tuples from the original relation are chosen to create a relation of a specific size. This is done such that all tuples in a relation of specific size are also present in a relation of a bigger size (e.g. all tuples in a relation with 1 tuples are also in a relation of 2 tuples). This makes it somewhat easier to compare results between relations of different sizes. 6.1 Preprocessing and sampling The preprocessing and sampling step combined will result in the sampled relation, which in turn will be used for the symmetric and union join plans. The preprocessing steps consist of tokenization and tf-idf weighting of the token occurrences. Figure 6.1 shows that both the preprocessing and the sampling behave linear with respect to the number of tuples in R i. Figure 6.2 plots the size of the sample relations resulting from various values of S, for different sizes of R 1 and R 2. Note the striking similarity between 6.2b and 6.2c, not only in shape (as all three graphs show similar shape) but also numerical, i.e. the two sample relations have about the same number of tuples. The explanation for this similarity is that both original relations contain names, and for the most part probably English/European names, the data is thus quite similar. The wikipedia set contains article titles, this is quite different from names, therefore the sampled relation shown in Figure 6.2a differs more from 6.2b and 6.2c. More specifically the cardinality of the sample relations in Figure 6.2a is larger. This does not (necessarily) mean more tokens 27

28 28 CHAPTER 6. EXPERIMENTAL RESULTS Sampling execution time (msec) Sample relation cardinality 1.6e+6 1.4e+6 1.2e+6 1e Ri =1 Ri =2 Ri =5 Ri =1 Ri =2 Ri =4 Ri = Ri cardinality Sample size (a) Tokenization (a) Wikipedia sample set size Preprocessing duration (msec) Sample relation cardinality 1.1e+6 1e Ri =1 Ri =2 Ri =5 Ri =1 Ri =2 Ri =4 Ri = Ri cardinality Sample size (b) TF-IDF weighting (b) DBLP sample set size Sampling execution time (msec) Sample relation cardinality 1.1e+6 1e Ri =1 Ri =2 Ri =5 Ri =1 Ri =2 Ri =4 Ri = Ri cardinality Sample size (c) Sampling (c) IMDB sample set size Figure 6.1: Preprocessing benchmark Figure 6.2: Sample set sizes

29 6.1. PREPROCESSING AND SAMPLING Ri =1 Ri =2 Ri =5 Ri =1 Ri =2 Ri = Ri =1 Ri =2 Ri =5 Ri =1 Ri =2 Ri =4 Join duration (msec) Ri =6 Join duration (msec) Ri = Sample size (a) Wikipedia e DBLP Sample size (b) Wikipedia e IMDB S=32 S=64 S=128 S= S=32 S=64 S=128 S=256 Join execution time (msec) Join execution time (msec) Source table size Source table size (c) Wikipedia e DBLP (d) Wikipedia e IMDB Figure 6.3: Join execution times Ri =1 Ri =2 Ri =5 Ri =1 Ri =2 Ri =4 Ri = Ri =1 Ri =2 Ri =5 Ri =1 Ri =2 Ri =4 Ri =6 Number of results 8 6 Number of results Sample size Sample size (a) Wikipedia e DBLP (b) Wikipedia e IMDB S=32 S=64 S=128 S= S=32 S=64 S=128 S= Number of results 8 6 Number of results R1 and R2 size R1 and R2 size (c) Wikipedia e DBLP (d) Wikipedia e IMDB Figure 6.4: Size of the result sets

30 3 CHAPTER 6. EXPERIMENTAL RESULTS Join duration (msec) Number of results S=32 S=64 S= Sample size (a) Execution time R1 and R2 size (b) Result set size Figure 6.5: Execution time and result set size for the union join for Wikipedia DBLP 3 25 naive normal S=128 union S= naive normal S=128 union S= Join duration (msec) Join duration (msec) Ri cardinality Ri cardinality (a) Execution time (b) Result set size Figure 6.6: Comparison of the three join approaches occurrences have been sampled (remember from chapter 3, the number of samples insertions is given by S T ), but it does mean more different token occurrences have been sampled. 6.2 Join Three join plans have been benchmarked. The most important one is the join as shown in Query 4.13 (will be called symmetric join). This plan is expected to be the most scalable, but also expected to calculate the most incomplete result. The next plan is the query that joins the samples from R 1 with the token from R 2 and vice versa, using the union as the result. This plan is shown in Query 4.11 (will be called union join) and is expected to scale less than the symmetric join, but in turn will most likely produce a more complete result. The final plan is the join purely on tokens, using no form of sampling at all. This plan produces the best result, but will scale the worst of all plans and is shown in Query 4.9 (will be called baseline). Note that for all queries the results are also fetched, thus all joins queries are encapsulated in a query that fetches the actually corresponding records of the join result in R 1 and R 2 from disk.

Similarity search in multimedia databases

Similarity search in multimedia databases Similarity search in multimedia databases Performance evaluation for similarity calculations in multimedia databases JO TRYTI AND JOHAN CARLSSON Bachelor s Thesis at CSC Supervisor: Michael Minock Examiner:

More information

Instructor: Stefan Savev

Instructor: Stefan Savev LECTURE 2 What is indexing? Indexing is the process of extracting features (such as word counts) from the documents (in other words: preprocessing the documents). The process ends with putting the information

More information

Information Retrieval. (M&S Ch 15)

Information Retrieval. (M&S Ch 15) Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion

More information

Knowledge Discovery and Data Mining 1 (VO) ( )

Knowledge Discovery and Data Mining 1 (VO) ( ) Knowledge Discovery and Data Mining 1 (VO) (707.003) Data Matrices and Vector Space Model Denis Helic KTI, TU Graz Nov 6, 2014 Denis Helic (KTI, TU Graz) KDDM1 Nov 6, 2014 1 / 55 Big picture: KDDM Probability

More information

CS 512, Spring 2017: Take-Home End-of-Term Examination

CS 512, Spring 2017: Take-Home End-of-Term Examination CS 512, Spring 2017: Take-Home End-of-Term Examination Out: Tuesday, 9 May 2017, 12:00 noon Due: Wednesday, 10 May 2017, by 11:59 am Turn in your solutions electronically, as a single PDF file, by placing

More information

Relational Model, Relational Algebra, and SQL

Relational Model, Relational Algebra, and SQL Relational Model, Relational Algebra, and SQL August 29, 2007 1 Relational Model Data model. constraints. Set of conceptual tools for describing of data, data semantics, data relationships, and data integrity

More information

Leveraging Set Relations in Exact Set Similarity Join

Leveraging Set Relations in Exact Set Similarity Join Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,

More information

Query Processing & Optimization

Query Processing & Optimization Query Processing & Optimization 1 Roadmap of This Lecture Overview of query processing Measures of Query Cost Selection Operation Sorting Join Operation Other Operations Evaluation of Expressions Introduction

More information

NOTES ON OBJECT-ORIENTED MODELING AND DESIGN

NOTES ON OBJECT-ORIENTED MODELING AND DESIGN NOTES ON OBJECT-ORIENTED MODELING AND DESIGN Stephen W. Clyde Brigham Young University Provo, UT 86402 Abstract: A review of the Object Modeling Technique (OMT) is presented. OMT is an object-oriented

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

Basant Group of Institution

Basant Group of Institution Basant Group of Institution Visual Basic 6.0 Objective Question Q.1 In the relational modes, cardinality is termed as: (A) Number of tuples. (B) Number of attributes. (C) Number of tables. (D) Number of

More information

Mahathma Gandhi University

Mahathma Gandhi University Mahathma Gandhi University BSc Computer science III Semester BCS 303 OBJECTIVE TYPE QUESTIONS Choose the correct or best alternative in the following: Q.1 In the relational modes, cardinality is termed

More information

Cluster Analysis. Angela Montanari and Laura Anderlucci

Cluster Analysis. Angela Montanari and Laura Anderlucci Cluster Analysis Angela Montanari and Laura Anderlucci 1 Introduction Clustering a set of n objects into k groups is usually moved by the aim of identifying internally homogenous groups according to a

More information

Boolean Model. Hongning Wang

Boolean Model. Hongning Wang Boolean Model Hongning Wang CS@UVa Abstraction of search engine architecture Indexed corpus Crawler Ranking procedure Doc Analyzer Doc Representation Query Rep Feedback (Query) Evaluation User Indexer

More information

Contents Contents Introduction Basic Steps in Query Processing Introduction Transformation of Relational Expressions...

Contents Contents Introduction Basic Steps in Query Processing Introduction Transformation of Relational Expressions... Contents Contents...283 Introduction...283 Basic Steps in Query Processing...284 Introduction...285 Transformation of Relational Expressions...287 Equivalence Rules...289 Transformation Example: Pushing

More information

CS 125 Section #4 RAMs and TMs 9/27/16

CS 125 Section #4 RAMs and TMs 9/27/16 CS 125 Section #4 RAMs and TMs 9/27/16 1 RAM A word-ram consists of: A fixed set of instructions P 1,..., P q. Allowed instructions are: Modular arithmetic and integer division on registers; the standard

More information

Approximate String Joins

Approximate String Joins Approximate String Joins Divesh Srivastava AT&T Labs-Research The Need for String Joins Substantial amounts of data in existing RDBMSs are strings There is a need to correlate data stored in different

More information

6. Relational Algebra (Part II)

6. Relational Algebra (Part II) 6. Relational Algebra (Part II) 6.1. Introduction In the previous chapter, we introduced relational algebra as a fundamental model of relational database manipulation. In particular, we defined and discussed

More information

Chapter 13: Query Processing

Chapter 13: Query Processing Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing

More information

SQL. Lecture 4 SQL. Basic Structure. The select Clause. The select Clause (Cont.) The select Clause (Cont.) Basic Structure.

SQL. Lecture 4 SQL. Basic Structure. The select Clause. The select Clause (Cont.) The select Clause (Cont.) Basic Structure. SL Lecture 4 SL Chapter 4 (Sections 4.1, 4.2, 4.3, 4.4, 4.5, 4., 4.8, 4.9, 4.11) Basic Structure Set Operations Aggregate Functions Null Values Nested Subqueries Derived Relations Modification of the Database

More information

3 No-Wait Job Shops with Variable Processing Times

3 No-Wait Job Shops with Variable Processing Times 3 No-Wait Job Shops with Variable Processing Times In this chapter we assume that, on top of the classical no-wait job shop setting, we are given a set of processing times for each operation. We may select

More information

STRUCTURED ENTITY QUERYING OVER UNSTRUCTURED TEXT JIAHUI JIANG THESIS

STRUCTURED ENTITY QUERYING OVER UNSTRUCTURED TEXT JIAHUI JIANG THESIS c 2015 Jiahui Jiang STRUCTURED ENTITY QUERYING OVER UNSTRUCTURED TEXT BY JIAHUI JIANG THESIS Submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science

More information

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17 Announcement CompSci 516 Database Systems Lecture 10 Query Evaluation and Join Algorithms Project proposal pdf due on sakai by 5 pm, tomorrow, Thursday 09/27 One per group by any member Instructor: Sudeepa

More information

Chapter 12: Query Processing

Chapter 12: Query Processing Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Overview Chapter 12: Query Processing Measures of Query Cost Selection Operation Sorting Join

More information

Topic: Local Search: Max-Cut, Facility Location Date: 2/13/2007

Topic: Local Search: Max-Cut, Facility Location Date: 2/13/2007 CS880: Approximations Algorithms Scribe: Chi Man Liu Lecturer: Shuchi Chawla Topic: Local Search: Max-Cut, Facility Location Date: 2/3/2007 In previous lectures we saw how dynamic programming could be

More information

Data about data is database Select correct option: True False Partially True None of the Above

Data about data is database Select correct option: True False Partially True None of the Above Within a table, each primary key value. is a minimal super key is always the first field in each table must be numeric must be unique Foreign Key is A field in a table that matches a key field in another

More information

6.001 Notes: Section 4.1

6.001 Notes: Section 4.1 6.001 Notes: Section 4.1 Slide 4.1.1 In this lecture, we are going to take a careful look at the kinds of procedures we can build. We will first go back to look very carefully at the substitution model,

More information

Chapter 12: Query Processing. Chapter 12: Query Processing

Chapter 12: Query Processing. Chapter 12: Query Processing Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 12: Query Processing Overview Measures of Query Cost Selection Operation Sorting Join

More information

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and

More information

Chapter 13: Query Processing Basic Steps in Query Processing

Chapter 13: Query Processing Basic Steps in Query Processing Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and

More information

Chapter 4: SQL. Basic Structure

Chapter 4: SQL. Basic Structure Chapter 4: SQL Basic Structure Set Operations Aggregate Functions Null Values Nested Subqueries Derived Relations Views Modification of the Database Joined Relations Data Definition Language Embedded SQL

More information

Chapter 3. Set Theory. 3.1 What is a Set?

Chapter 3. Set Theory. 3.1 What is a Set? Chapter 3 Set Theory 3.1 What is a Set? A set is a well-defined collection of objects called elements or members of the set. Here, well-defined means accurately and unambiguously stated or described. Any

More information

Creating SQL Tables and using Data Types

Creating SQL Tables and using Data Types Creating SQL Tables and using Data Types Aims: To learn how to create tables in Oracle SQL, and how to use Oracle SQL data types in the creation of these tables. Outline of Session: Given a simple database

More information

Divisibility Rules and Their Explanations

Divisibility Rules and Their Explanations Divisibility Rules and Their Explanations Increase Your Number Sense These divisibility rules apply to determining the divisibility of a positive integer (1, 2, 3, ) by another positive integer or 0 (although

More information

One of the most important areas where quantifier logic is used is formal specification of computer programs.

One of the most important areas where quantifier logic is used is formal specification of computer programs. Section 5.2 Formal specification of computer programs One of the most important areas where quantifier logic is used is formal specification of computer programs. Specification takes place on several levels

More information

Relational Algebra and SQL

Relational Algebra and SQL Relational Algebra and SQL Relational Algebra. This algebra is an important form of query language for the relational model. The operators of the relational algebra: divided into the following classes:

More information

6.001 Notes: Section 6.1

6.001 Notes: Section 6.1 6.001 Notes: Section 6.1 Slide 6.1.1 When we first starting talking about Scheme expressions, you may recall we said that (almost) every Scheme expression had three components, a syntax (legal ways of

More information

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues COS 597A: Principles of Database and Information Systems Information Retrieval Traditional database system Large integrated collection of data Uniform access/modifcation mechanisms Model of data organization

More information

The SQL data-definition language (DDL) allows defining :

The SQL data-definition language (DDL) allows defining : Introduction to SQL Introduction to SQL Overview of the SQL Query Language Data Definition Basic Query Structure Additional Basic Operations Set Operations Null Values Aggregate Functions Nested Subqueries

More information

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science Information Retrieval CS 6900 Lecture 06 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Boolean Retrieval vs. Ranked Retrieval Many users (professionals) prefer

More information

Relational Databases

Relational Databases Relational Databases Jan Chomicki University at Buffalo Jan Chomicki () Relational databases 1 / 49 Plan of the course 1 Relational databases 2 Relational database design 3 Conceptual database design 4

More information

Database Systems SQL SL03

Database Systems SQL SL03 Checking... Informatik für Ökonomen II Fall 2010 Data Definition Language Database Systems SQL SL03 Table Expressions, Query Specifications, Query Expressions Subqueries, Duplicates, Null Values Modification

More information

Chapter 2 Basic Structure of High-Dimensional Spaces

Chapter 2 Basic Structure of High-Dimensional Spaces Chapter 2 Basic Structure of High-Dimensional Spaces Data is naturally represented geometrically by associating each record with a point in the space spanned by the attributes. This idea, although simple,

More information

Introduction to Clustering

Introduction to Clustering Introduction to Clustering Ref: Chengkai Li, Department of Computer Science and Engineering, University of Texas at Arlington (Slides courtesy of Vipin Kumar) What is Cluster Analysis? Finding groups of

More information

Chapter 3: Introduction to SQL

Chapter 3: Introduction to SQL Chapter 3: Introduction to SQL Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 3: Introduction to SQL Overview of the SQL Query Language Data Definition Basic Query

More information

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data American Journal of Applied Sciences (): -, ISSN -99 Science Publications Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data Ibrahiem M.M. El Emary and Ja'far

More information

Disjunctive and Conjunctive Normal Forms in Fuzzy Logic

Disjunctive and Conjunctive Normal Forms in Fuzzy Logic Disjunctive and Conjunctive Normal Forms in Fuzzy Logic K. Maes, B. De Baets and J. Fodor 2 Department of Applied Mathematics, Biometrics and Process Control Ghent University, Coupure links 653, B-9 Gent,

More information

Chapter 2 - Graphical Summaries of Data

Chapter 2 - Graphical Summaries of Data Chapter 2 - Graphical Summaries of Data Data recorded in the sequence in which they are collected and before they are processed or ranked are called raw data. Raw data is often difficult to make sense

More information

Chapter 3: SQL. Database System Concepts, 5th Ed. Silberschatz, Korth and Sudarshan See for conditions on re-use

Chapter 3: SQL. Database System Concepts, 5th Ed. Silberschatz, Korth and Sudarshan See  for conditions on re-use Chapter 3: SQL Database System Concepts, 5th Ed. See www.db-book.com for conditions on re-use Chapter 3: SQL Data Definition Basic Query Structure Set Operations Aggregate Functions Null Values Nested

More information

Chapter 12: Query Processing

Chapter 12: Query Processing Chapter 12: Query Processing Overview Catalog Information for Cost Estimation $ Measures of Query Cost Selection Operation Sorting Join Operation Other Operations Evaluation of Expressions Transformation

More information

Advanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret

Advanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret Advanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret Greedy Algorithms (continued) The best known application where the greedy algorithm is optimal is surely

More information

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval 1 Naïve Implementation Convert all documents in collection D to tf-idf weighted vectors, d j, for keyword vocabulary V. Convert

More information

Relational Model: History

Relational Model: History Relational Model: History Objectives of Relational Model: 1. Promote high degree of data independence 2. Eliminate redundancy, consistency, etc. problems 3. Enable proliferation of non-procedural DML s

More information

Chapter 3: SQL. Chapter 3: SQL

Chapter 3: SQL. Chapter 3: SQL Chapter 3: SQL Database System Concepts, 5th Ed. See www.db-book.com for conditions on re-use Chapter 3: SQL Data Definition Basic Query Structure Set Operations Aggregate Functions Null Values Nested

More information

Database Systems SQL SL03

Database Systems SQL SL03 Inf4Oec10, SL03 1/52 M. Böhlen, ifi@uzh Informatik für Ökonomen II Fall 2010 Database Systems SQL SL03 Data Definition Language Table Expressions, Query Specifications, Query Expressions Subqueries, Duplicates,

More information

Semantic Search in s

Semantic Search in  s Semantic Search in Emails Navneet Kapur, Mustafa Safdari, Rahul Sharma December 10, 2010 Abstract Web search technology is abound with techniques to tap into the semantics of information. For email search,

More information

Database System Concepts

Database System Concepts Chapter 13: Query Processing s Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2008/2009 Slides (fortemente) baseados nos slides oficiais do livro c Silberschatz, Korth

More information

CS352 Lecture - Introduction to SQL

CS352 Lecture - Introduction to SQL CS352 Lecture - Introduction to SQL Objectives: last revised September 12, 2002 1. To introduce the SQL language 2. To introduce basic SQL DML operations (select, insert, update, delete, commit, rollback)

More information

Database Management System Dr. S. Srinath Department of Computer Science & Engineering Indian Institute of Technology, Madras Lecture No.

Database Management System Dr. S. Srinath Department of Computer Science & Engineering Indian Institute of Technology, Madras Lecture No. Database Management System Dr. S. Srinath Department of Computer Science & Engineering Indian Institute of Technology, Madras Lecture No. # 3 Relational Model Hello everyone, we have been looking into

More information

Essay Question: Explain 4 different means by which constrains are represented in the Conceptual Data Model (CDM).

Essay Question: Explain 4 different means by which constrains are represented in the Conceptual Data Model (CDM). Question 1 Essay Question: Explain 4 different means by which constrains are represented in the Conceptual Data Model (CDM). By specifying participation conditions By specifying the degree of relationship

More information

TRIE BASED METHODS FOR STRING SIMILARTIY JOINS

TRIE BASED METHODS FOR STRING SIMILARTIY JOINS TRIE BASED METHODS FOR STRING SIMILARTIY JOINS Venkat Charan Varma Buddharaju #10498995 Department of Computer and Information Science University of MIssissippi ENGR-654 INFORMATION SYSTEM PRINCIPLES RESEARCH

More information

Relational Database: The Relational Data Model; Operations on Database Relations

Relational Database: The Relational Data Model; Operations on Database Relations Relational Database: The Relational Data Model; Operations on Database Relations Greg Plaxton Theory in Programming Practice, Spring 2005 Department of Computer Science University of Texas at Austin Overview

More information

Advanced Database Systems

Advanced Database Systems Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed

More information

Similarity Joins of Text with Incomplete Information Formats

Similarity Joins of Text with Incomplete Information Formats Similarity Joins of Text with Incomplete Information Formats Shaoxu Song and Lei Chen Department of Computer Science Hong Kong University of Science and Technology {sshaoxu,leichen}@cs.ust.hk Abstract.

More information

Exercise: Graphing and Least Squares Fitting in Quattro Pro

Exercise: Graphing and Least Squares Fitting in Quattro Pro Chapter 5 Exercise: Graphing and Least Squares Fitting in Quattro Pro 5.1 Purpose The purpose of this experiment is to become familiar with using Quattro Pro to produce graphs and analyze graphical data.

More information

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION 6.1 INTRODUCTION Fuzzy logic based computational techniques are becoming increasingly important in the medical image analysis arena. The significant

More information

II B.Sc(IT) [ BATCH] IV SEMESTER CORE: RELATIONAL DATABASE MANAGEMENT SYSTEM - 412A Multiple Choice Questions.

II B.Sc(IT) [ BATCH] IV SEMESTER CORE: RELATIONAL DATABASE MANAGEMENT SYSTEM - 412A Multiple Choice Questions. Dr.G.R.Damodaran College of Science (Autonomous, affiliated to the Bharathiar University, recognized by the UGC)Re-accredited at the 'A' Grade Level by the NAAC and ISO 9001:2008 Certified CRISL rated

More information

The Encoding Complexity of Network Coding

The Encoding Complexity of Network Coding The Encoding Complexity of Network Coding Michael Langberg Alexander Sprintson Jehoshua Bruck California Institute of Technology Email: mikel,spalex,bruck @caltech.edu Abstract In the multicast network

More information

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016 Query Processing Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016 Slides re-used with some modification from www.db-book.com Reference: Database System Concepts, 6 th Ed. By Silberschatz,

More information

Character Recognition

Character Recognition Character Recognition 5.1 INTRODUCTION Recognition is one of the important steps in image processing. There are different methods such as Histogram method, Hough transformation, Neural computing approaches

More information

ResPubliQA 2010

ResPubliQA 2010 SZTAKI @ ResPubliQA 2010 David Mark Nemeskey Computer and Automation Research Institute, Hungarian Academy of Sciences, Budapest, Hungary (SZTAKI) Abstract. This paper summarizes the results of our first

More information

Plan for today. CS276B Text Retrieval and Mining Winter Vector spaces and XML. Text-centric XML retrieval. Vector spaces and XML

Plan for today. CS276B Text Retrieval and Mining Winter Vector spaces and XML. Text-centric XML retrieval. Vector spaces and XML CS276B Text Retrieval and Mining Winter 2005 Plan for today Vector space approaches to XML retrieval Evaluating text-centric retrieval Lecture 15 Text-centric XML retrieval Documents marked up as XML E.g.,

More information

CHAPTER VII INDEXED K TWIN NEIGHBOUR CLUSTERING ALGORITHM 7.1 INTRODUCTION

CHAPTER VII INDEXED K TWIN NEIGHBOUR CLUSTERING ALGORITHM 7.1 INTRODUCTION CHAPTER VII INDEXED K TWIN NEIGHBOUR CLUSTERING ALGORITHM 7.1 INTRODUCTION Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called cluster)

More information

CPS122 Lecture: From Python to Java

CPS122 Lecture: From Python to Java Objectives: CPS122 Lecture: From Python to Java last revised January 7, 2013 1. To introduce the notion of a compiled language 2. To introduce the notions of data type and a statically typed language 3.

More information

CSCI 403: Databases 13 - Functional Dependencies and Normalization

CSCI 403: Databases 13 - Functional Dependencies and Normalization CSCI 403: Databases 13 - Functional Dependencies and Normalization Introduction The point of this lecture material is to discuss some objective measures of the goodness of a database schema. The method

More information

Chapter 14: Query Optimization

Chapter 14: Query Optimization Chapter 14: Query Optimization Database System Concepts 5 th Ed. See www.db-book.com for conditions on re-use Chapter 14: Query Optimization Introduction Transformation of Relational Expressions Catalog

More information

RELATIONAL DATA MODEL: Relational Algebra

RELATIONAL DATA MODEL: Relational Algebra RELATIONAL DATA MODEL: Relational Algebra Outline 1. Relational Algebra 2. Relational Algebra Example Queries 1. Relational Algebra A basic set of relational model operations constitute the relational

More information

Set theory is a branch of mathematics that studies sets. Sets are a collection of objects.

Set theory is a branch of mathematics that studies sets. Sets are a collection of objects. Set Theory Set theory is a branch of mathematics that studies sets. Sets are a collection of objects. Often, all members of a set have similar properties, such as odd numbers less than 10 or students in

More information

Programming Lecture 3

Programming Lecture 3 Programming Lecture 3 Expressions (Chapter 3) Primitive types Aside: Context Free Grammars Constants, variables Identifiers Variable declarations Arithmetic expressions Operator precedence Assignment statements

More information

CPS122 Lecture: From Python to Java last revised January 4, Objectives:

CPS122 Lecture: From Python to Java last revised January 4, Objectives: Objectives: CPS122 Lecture: From Python to Java last revised January 4, 2017 1. To introduce the notion of a compiled language 2. To introduce the notions of data type and a statically typed language 3.

More information

Basic Structure Set Operations Aggregate Functions Null Values Nested Subqueries Derived Relations Views Modification of the Database Data Definition

Basic Structure Set Operations Aggregate Functions Null Values Nested Subqueries Derived Relations Views Modification of the Database Data Definition Chapter 4: SQL Basic Structure Set Operations Aggregate Functions Null Values Nested Subqueries Derived Relations Views Modification of the Database Data Definition Language 4.1 Schema Used in Examples

More information

Database Applications (15-415)

Database Applications (15-415) Database Applications (15-415) DBMS Internals- Part VI Lecture 14, March 12, 2014 Mohammad Hammoud Today Last Session: DBMS Internals- Part V Hash-based indexes (Cont d) and External Sorting Today s Session:

More information

SQL. Dean Williamson, Ph.D. Assistant Vice President Institutional Research, Effectiveness, Analysis & Accreditation Prairie View A&M University

SQL. Dean Williamson, Ph.D. Assistant Vice President Institutional Research, Effectiveness, Analysis & Accreditation Prairie View A&M University SQL Dean Williamson, Ph.D. Assistant Vice President Institutional Research, Effectiveness, Analysis & Accreditation Prairie View A&M University SQL 1965: Maron & Levien propose Relational Data File 1968:

More information

Rank Measures for Ordering

Rank Measures for Ordering Rank Measures for Ordering Jin Huang and Charles X. Ling Department of Computer Science The University of Western Ontario London, Ontario, Canada N6A 5B7 email: fjhuang33, clingg@csd.uwo.ca Abstract. Many

More information

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is: CS 124 Section #8 Hashing, Skip Lists 3/20/17 1 Probability Review Expectation (weighted average): the expectation of a random quantity X is: x= x P (X = x) For each value x that X can take on, we look

More information

NULLs & Outer Joins. Objectives of the Lecture :

NULLs & Outer Joins. Objectives of the Lecture : Slide 1 NULLs & Outer Joins Objectives of the Lecture : To consider the use of NULLs in SQL. To consider Outer Join Operations, and their implementation in SQL. Slide 2 Missing Values : Possible Strategies

More information

The Goal of this Document. Where to Start?

The Goal of this Document. Where to Start? A QUICK INTRODUCTION TO THE SEMILAR APPLICATION Mihai Lintean, Rajendra Banjade, and Vasile Rus vrus@memphis.edu linteam@gmail.com rbanjade@memphis.edu The Goal of this Document This document introduce

More information

Query Evaluation Strategies

Query Evaluation Strategies Introduction to Search Engine Technology Term-at-a-Time and Document-at-a-Time Evaluation Ronny Lempel Yahoo! Labs (Many of the following slides are courtesy of Aya Soffer and David Carmel, IBM Haifa Research

More information

Lecture 3 SQL. Shuigeng Zhou. September 23, 2008 School of Computer Science Fudan University

Lecture 3 SQL. Shuigeng Zhou. September 23, 2008 School of Computer Science Fudan University Lecture 3 SQL Shuigeng Zhou September 23, 2008 School of Computer Science Fudan University Outline Basic Structure Set Operations Aggregate Functions Null Values Nested Subqueries Derived Relations Views

More information

Database System Concepts, 5th Ed.! Silberschatz, Korth and Sudarshan See for conditions on re-use "

Database System Concepts, 5th Ed.! Silberschatz, Korth and Sudarshan See   for conditions on re-use Database System Concepts, 5th Ed.! Silberschatz, Korth and Sudarshan See www.db-book.com for conditions on re-use " Data Definition! Basic Query Structure! Set Operations! Aggregate Functions! Null Values!

More information

Designing Views to Answer Queries under Set, Bag,and BagSet Semantics

Designing Views to Answer Queries under Set, Bag,and BagSet Semantics Designing Views to Answer Queries under Set, Bag,and BagSet Semantics Rada Chirkova Department of Computer Science, North Carolina State University Raleigh, NC 27695-7535 chirkova@csc.ncsu.edu Foto Afrati

More information

Solutions to Homework 10

Solutions to Homework 10 CS/Math 240: Intro to Discrete Math 5/3/20 Instructor: Dieter van Melkebeek Solutions to Homework 0 Problem There were five different languages in Problem 4 of Homework 9. The Language D 0 Recall that

More information

Overview of DB & IR. ICS 624 Spring Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa

Overview of DB & IR. ICS 624 Spring Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa ICS 624 Spring 2011 Overview of DB & IR Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 1/12/2011 Lipyeow Lim -- University of Hawaii at Manoa 1 Example

More information

THE preceding chapters were all devoted to the analysis of images and signals which

THE preceding chapters were all devoted to the analysis of images and signals which Chapter 5 Segmentation of Color, Texture, and Orientation Images THE preceding chapters were all devoted to the analysis of images and signals which take values in IR. It is often necessary, however, to

More information

Optimizing Testing Performance With Data Validation Option

Optimizing Testing Performance With Data Validation Option Optimizing Testing Performance With Data Validation Option 1993-2016 Informatica LLC. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording

More information

Something to think about. Problems. Purpose. Vocabulary. Query Evaluation Techniques for large DB. Part 1. Fact:

Something to think about. Problems. Purpose. Vocabulary. Query Evaluation Techniques for large DB. Part 1. Fact: Query Evaluation Techniques for large DB Part 1 Fact: While data base management systems are standard tools in business data processing they are slowly being introduced to all the other emerging data base

More information

THE EFFECT OF JOIN SELECTIVITIES ON OPTIMAL NESTING ORDER

THE EFFECT OF JOIN SELECTIVITIES ON OPTIMAL NESTING ORDER THE EFFECT OF JOIN SELECTIVITIES ON OPTIMAL NESTING ORDER Akhil Kumar and Michael Stonebraker EECS Department University of California Berkeley, Ca., 94720 Abstract A heuristic query optimizer must choose

More information

Tag-based Social Interest Discovery

Tag-based Social Interest Discovery Tag-based Social Interest Discovery Xin Li / Lei Guo / Yihong (Eric) Zhao Yahoo!Inc 2008 Presented by: Tuan Anh Le (aletuan@vub.ac.be) 1 Outline Introduction Data set collection & Pre-processing Architecture

More information

BOOLEAN MATRIX FACTORIZATIONS. with applications in data mining Pauli Miettinen

BOOLEAN MATRIX FACTORIZATIONS. with applications in data mining Pauli Miettinen BOOLEAN MATRIX FACTORIZATIONS with applications in data mining Pauli Miettinen MATRIX FACTORIZATIONS BOOLEAN MATRIX FACTORIZATIONS o THE BOOLEAN MATRIX PRODUCT As normal matrix product, but with addition

More information

Leveraging Transitive Relations for Crowdsourced Joins*

Leveraging Transitive Relations for Crowdsourced Joins* Leveraging Transitive Relations for Crowdsourced Joins* Jiannan Wang #, Guoliang Li #, Tim Kraska, Michael J. Franklin, Jianhua Feng # # Department of Computer Science, Tsinghua University, Brown University,

More information