Continuous similarity search for evolving queries

Size: px

Start display at page:

Download "Continuous similarity search for evolving queries"

Eustacia Gordon
6 years ago
Views:

1 Knowl Inf Syst DOI 10.7/s x REGULAR PAPER Continuous similarity search for evolving queries Xiaoning Xu 1 Chuancong Gao 2 Jian Pei 2,3 Ke Wang 2 Abdullah Al-Barakati 3 Received: 14 May 2015/Revised: 3 August 2015/Accepted: 5 October 2015 Springer-Verlag London 2015 Abstract In this paper, we study a novel problem of continuous similarity search for evolving queries. Given a set of objects, each being a set or multiset of items, and a data stream, we want to continuously maintain the top-k most similar objects using the last n items in the stream as an evolving query. We show that the problem has several important applications. At the same time, the problem is challenging. We develop a filtering-based method and a hashing-based method. Our experimental results on both real data sets and synthetic data sets show that our methods are effective and efficient. Keywords Similarity search Data stream Evolving query 1 Introduction Let us consider a recommendation problem at a Q&A website. Suppose you want to run 3 ads on the Q&A website, for a user who does not have a profile yet, what ads should you display? In addition to many factors, such as click-through rates of ads and bidding price information, a natural and important idea is to consider the ads that are most related to the questions being recently asked at the website by all users. For example, technically, you may model ads and questions as sets of keywords, and may want to retrieve the top-k ads that are most similar to This work is partly supported by an NSERC Discovery grant, the Canada Research Chair program, and a Yahoo! Faculty Research and Engagement Program (FREP) award. All opinions, findings, conclusions and recommendations in this paper are those of the authors and do not necessarily reflect the views of the funding agencies. B Jian Pei jpei@cs.sfu.ca 1 Fortinet Inc., Burnaby, BC, Canada 2 Simon Fraser University, Burnaby, BC, Canada 3 King Abdulaziz University, Jeddah, Saudi Arabia

2 X. Xu et al. the last n questions asked, where the similarity measure captures the relevance between ads and questions. Such ads can be used as the candidates for further selections. The above scenario is just one of the many applications that motivate the problem to be studied in this paper. Given a set of static data objects (e.g., ads in the above example), and an evolving data stream (e.g., the questions asked in the above example), a sliding window on the data stream (e.g., the last n questions in the above example) presents an evolving query. Theproblem ofcontinuous similarity search for evolving queries is to continuously conduct top-k similarity search on the set of static objects using the evolving queries. The problem of continuous similarity search for evolving queries has many important applications. As another example, in a computer role-playing game, a player has a set of weapons and tools, which is relatively stable. The player goes through a game mission scene, where the objects in the continuously updated surrounding environment, such as different types of enemies, scoring opportunities and obstacles, present a stream. The window over the stream containing the most recent objects approaching the player presents an evolving query. The player has to select proper weapons and tools that match the current surrounding environment best. Here, an enemy can be modeled as a set of factors, such as explosion, that it may be tackled, and a weapon/tool can be modeled as a set of factors that it may be used. Again, before any gaming strategies can be used, an essential task is to continuously maintain the top-k best weapons and tools with respect to the evolving enemies. Our problem can be considered as the dynamic version of the top-k set similarity search problem. More generally, a traditional similarity search problem involves a collection of objects, a similarity function, and a user-defined threshold. The search is to find all objects in the collection whose similarity scores regarding the query are no less than the predefined threshold. Similarity search has many applications, such as information retrieval [14,16,33], near-duplicate web page detection [19], record linkage [37], data compression [16], data integration [9], image and video search and recommendation [13,15,31,35], statistics and data analysis [12,22], machine learning [10], and data mining [3,18]. Besides answering threshold-based queries, top-k queries are also of great value since given a threshold, the size of the result may be unpredictable and, in many real applications, we are only interested in a small number of most similar objects. Moreover, as to be reviewed in Sect. 2.2, the problem of similarity search on data streams has been extensively explored, especially for nearest neighbor search. However, to the best of our knowledge, the problem of continuous set-based similarity search for evolving queries has not been systematically investigated. In this paper, we tackle the problem of continuous similarity search for evolving queries. Since in many applications, an object can be represented as a set or multiset, such as using a keyword vector to represent a document, we use weighted Jaccard similarity as the measure. The major challenge is how to speed up the similarity computation and avoid checking evolving queries with every static object exactly at every time point. We develop an upper bound for incremental maintenance of similarity scores. The bound can be computed in constant time. We propose two algorithms, an exact one based on the pruning and verification framework, and the other approximate one based on MinHash. We report an empirical evaluation on both synthetic and real-world data sets, which validates the efficiency and effectiveness of our proposed methods. The rest of the paper is organized as follows. In Sect. 2, we review the related work. We then formulate the problem in Sect. 3. In sect. 4, we propose a filtering-based method. In Sect. 5, we present a hashing-based method. We report our experimental results in Sect. 6 and conclude the paper in Sect. 7.

3 Continuous similarity search for evolving queries 2 Related work Our study is mainly related to the existing work on similarity search and continuous top-k queries. In this section, we briefly review the state-of-the-art methods related to our study. A thorough survey on those topics is far beyond the capacity of the paper. 2.1 Similarity join and search The static version of similarity search has been studied extensively. The state-of-the-art methods can be categorized into two major groups. The first category is the filtering-based approaches. The general idea is to develop upper and lower bounds of the similarity between an object and a query, which can be computed efficiently. Many objects can be filtered out using the bounds. Consequently, the exact similarity scores, which are supposed to be more expensive to compute, are calculated for only a small number of surviving objects from filtering. For instance, Sarawagi and Kirpal [34] proposed an inverted index-based probing method for similarity joins on sets. Chaudhuri et al. [8] developed the prefix-filtering principle for similarity joins. The all-pairs algorithm developed by Bayardo et al. [3] further improves this approach by adding the minimum length constraint and some other filtering techniques to speed up similarity joins. Nevertheless, all-pairs and prefix-filtering methods often generate a nontrivial number of candidates, which have to be verified using the exact similarity measure. Xiao et al. [38] extended the all-pairs method and proposed a new positional filtering method PPJoin, which makes use of the ordering information. PPJoin+ [38] combinessuffix filtering with PPJoin and can further reduce the number of candidate pairs. A new similarity measure PathSim, which is based on meta path and is used in heterogeneous networks, is defined in [36]. The second category is hashing-based methods. The general idea is to develop hash functions that have good locality preservation properties similar objects are likely to be hashed to the same bucket. The idea was introduced by Indyk and Motwani [21] for approximate nearest neighbor search in d-dimensional Euclidean space. The basic principle is to hash the points using multiple hash functions such that closer points have a higher probability of collision than points that are far away. Gionis et al. [17] further improved the algorithms and achieved better query time guarantees. Later, an improved algorithm that almost achieves the space and time lower bounds is presented by Andoni and Indyk [1]. The MinHash technique [5] is used to approximate the resemblance and the containment of sets. This technique is used to estimate the rarity and similarity between two windowed data streams in [11]. Moreover, Charikar [7] proposed SimHash to hash similar data to similar values. An estimation for vector-based cosine similarity using a random projection method is also discussed [7]. In this paper, we explore both filtering-based and hashing-based methods, which have not been addressed in the existing literature for evolving similarity search queries. 2.2 Continuous queries over a data stream Different evolving models are used in previous studies that investigated continuous queries over a data stream. For example, Kontaki et al. [24] studied similarity range queries in streaming time sequences using Euclidean distance, where both the query and data objects are evolving. An indexing method that is based on incremental computation method for discrete Fourier transform is used to achieve a high candidates ratio. Lian et al. [26] tackled the similarity search problem over multiple stream time series, given a static time series as a

4 X. Xu et al. query. An approximation algorithm is developed using a weighted locality-sensitive hashing technique. Motivated by a wide range of applications such as network intrusion detection, much work [4,25,28,29] has been embarked on monitoring nearest neighbor (NN) queries continuously over a data stream. The basic idea is to utilize indexing structures for reducing memory consumption and supporting efficient updates. Mouratidis et al. [28]proposed two approaches for continuous monitoring of NN queries over sliding window streams. Koudas et al. [25] developed an approximation algorithm that utilizes an indexing scheme, DISC, and has guaranteed bounds on error and performance. The existing work on continuously monitoring nearest neighbors for mobile query object is different from the problem studied here. In those previous studies, the mobile object is assumed to move in a trajectory, potentially predictable to some extent. In this paper, the stream presenting an evolving query is not assumed a moving object. Instead, we simply use the current sliding window as the current query. The existing methods on continuous nearest neighbor monitoring for mobile objects cannot solve our problem. In addition to continuous queries on similarity search problems, some interesting problems are defined over data streams. For example, Pan and Zhu [30] developed a two-level candidate checking scheme for continuously querying the top-k correlated graphs in a data stream scenario where static queries are posed on evolving graph streams. Mouratidis et al. [27] proposed two approaches for continuously answering top-k queries where the query is a static preference function over a fixed-size sliding window. One approach is to compute new answers whenever a current top-k point expired and the other approach is to precompute future changes partially. Rao et al. [32] devised methods for the problem that uses an aggregation function to measure the relevance between a document stream and a query consists of terms. They modeled documents as a data stream. That is, new documents keep coming while the query is not evolving, which is different from our model where the evolving query consists of the elements in the current sliding window in a data stream. Kollios and Tsotras [23] proposed efficient hashing methods to answer membership queries in a temporal setting. Table 1 summarizes the differences between our study and some previous methods related to ours. 3 Problem definition Let us consider an alphabet of items, a collection of objects R where each object is a set (or multiset) over, a data stream S with each element e keeps coming, and an integer n as the size of a query sliding window on S. A query Q is the last n elements in S. A query is in general a multiset, but also can be modeled as a set. The top-k continuous similarity search for evolving query Q is to continuously find the k objects in R that are most similar to Q. We answer the query continuously whenever a new element in S arrives. In this paper, we consider each object being a set or multiset. Queries can be sets or multisets as well. Since a set is a special case of a multiset where elements in the object are all distinct, we use weighted Jaccard similarity to measure the similarity between objects and queries. Specifically, let X be a nonempty multiset that consists of items in alphabet = {v 1,v 2,...,v }.WemapX to a -dimensional vector X such that the value of the ith component is the absolute frequency of item v i in X. X is called the vector representation of multiset X.

5 Continuous similarity search for evolving queries Table 1 Comparison of our methods and the previous work Method Problem solved Algorithm Features Our methods Pruning method Top-k set/multiset similarity search Upper bounding similarity scores w.r.t. evolving queries Hashing method Top-k set similarity search Using MinHash and inverted indices for fast computation of top-k results Similarity join/search pruning-based methods Sarawagi and Kirpal [34] Chaudhuri et al. [8] Bayardo et al. [3] Xiao et al. [38] Threshold-based set similarity join Threshold-based set similarity join Threshold-based set similarity join Threshold-based set similarity join Sun et al. [36] PathSim based top-k similarity search Similarity join/search hashing-based methods Indyk and Motwani [21] Gionis et al. [17] Andoni and Indyk [1] Broder [5] Datar and Muthukrishnan [11] Charikar [7] Approximate NN search in Euclidean space Approximate NN search in Euclidean space Approximate NN search in Euclidean space Finding similar documents Estimate rarity and similarity of windowed data streams Design new locality-sensitive hashing schemes Continuous queries over a data stream Kontaki et al. [24] Similarity range queries in streaming time sequences Lian et al. [26] Similarity search over multiple stream time series Mouratidis et al. [28] Continuous monitoring of NN queries over sliding window streams A probing method based on inverted index Prefix-filtering principle All-Pairs: improved the prefix-filtering method PPJoin, PPJoin+: add positional filtering and suffix filtering Co-clustering based pruning method Locality-Sensitive Hashing (LSH) Improved [21]and achieved better time guarantees Almost achieves the space and time lower bounds MinHash technique Applying MinHash technique SimHash technique Indexing and Discrete Fourier Transformation A weighted LSH Conceptual partitioning and precompute the future changes Top-k, set/multiset similarity search, static objects, evolving queries Top-k, set similarity search, static objects, evolving queries Threshold, set similarity join, static objects Threshold, set similarity join, static objects Threshold, set similarity join, static objects Threshold, set similarity join, static objects Top-k, similarity search, static objects Threshold, NN search in Euclidean space, static objects Threshold, NN search in Euclidean space, static objects Threshold, NN search in Euclidean space, static objects Jaccard similarity estimation, static documents Jaccard similarity estimation, two evolving data windows Similarity estimation, static objects Euclidean distance, evolving objects and evolving queries A static time series as the query, multiple stream time series as the objects Euclidean space, evolving objects and static queries

6 X. Xu et al. Table 1 continued Method Problem solved Algorithm Features Pan and Zhu [30] Rao et al. [32] Continuously querying the top-k correlated graphs in a data stream Measure the relevance between a document stream and a query A two-level candidate checking scheme A graph indexing structure for results sharing among queries Evolving graph streams and static queries Evolving documents and static queries Table 2 A collection of sets R T 1 a b c f i T 2 a d e T 3 d e T 4 b c d i T 5 a c g h i j T 6 a c e f h b i c a d b i c a d f (a) (b) Fig. 1 Query stream S. a Current query Q, b query Q after a new item arrives Let X and Y be two nonempty multisets over alphabet.let X =[x 1, x 2,...,x ] and Y =[y 1, y 2,...,y ] be the vector representations of X and Y, respectively. The weighted Jaccard similarity is sim jac (X, Y ) = sim jac ( X, Y ) = i=1 min(x i, y i ) i=1 max(x i, y i ) (1) Example 1 (Computing similarity scores) Given an alphabet ={a, b, c, d, e, f, g}, consider two multisets X ={c, b, a, e, f } and Y ={a, b, a, c, e, d}. The vector representations of X and Y are x = [1, 1, 1, 0, 1, 1, 0] and y = [2, 1, 1, 1, 1, 0, 0], respectively. Since 7i=1 max(x i, y i ) = 7and 7 i=1 min(x i, y i ) = 4, the weighted Jaccard similarity score between X and Y is sim jac (X, Y ) = 4 7 = We now model how a query evolves. Given a query Q = {e p+1, e p+2,...,e p+ Q }, where item e p+i (1 i Q ) isthei-th arrived item in the query, the updated query after u time instants (u Q ), that is, u updates in the stream, is Q = {e p+u+1, e p+u+2,...,e p+ Q, e p+ Q +1, e p+ Q +2,...,e p+ Q +u }, where item e p+ Q + j is the j-th newly coming element in the updated query, 1 j u. Example 2 (Continuous top-k queries) A collection of sets R in a database is shown in Table 2, and a data stream S is shown in Fig. 1a. The alphabet ={a, b, c, d, e, f, g, h, i, j} of 10 items. Suppose k = 2 and the size of the sliding window is 5, that is, we use the set of the last 5 entries in stream S as the evolving query Q, and find the top-2 most similar sets in R.

7 Continuous similarity search for evolving queries Table 3 Jaccard similarity scores T 1 T 2 T 3 T 4 T 5 T 6 Q Q Table 4 Summary of frequently used symbols Symbol Interpretation For all methods R A collection of objects in database S The querying data stream The alphabet used in database T i The ith object in R Q t The query containing the elements in the sliding window at time t X The vector representation of a set (or multiset) X sim(x, Y ) The exact similarity of two sets (or multisets) X and Y k Number of objects in the result topk t The k objects that are most similar to the query at time t score(topk t [k]) The similarity score of the kth most similar result at time t For pruning-based method sim(x, Y ) +u The upper bound of similarity score of two sets (or multisets) X and Y, after u updates on Y min_step i The least number of updates needed for object T i s progressive upper bound sim(t i, Q t ) +min_step i to exceed score(topk t [k]) For MinHash-based method {[n]} The set {0,...,n 1} sim (X, Y ) The MinHash approximated similarity of two sets X and Y H A set of hash functions h 1, h 2,...,h H index i Inverted index built on the MinHash values of the ith hash function, i [1, H ] index i [v] Inverted list of hash value v which consists of the objects whose MinHash value is v according to the ith hash function, i [1, H ] and v [1, ] min(h i (X)) The minimum hash (MinHash) value of set X according to the ith hash function h i, i [1, H ] For experiments Q t The average object length of the data set We can compute the Jaccard similarity score between an object in R and the current query Q ={b, i, c, a, d}. The scores are shown in the first row of Table 3. Objects T 4 and T 1 are the top-2 similar objects with respect to query Q. Suppose next an item f arrives at the data stream as shown in Fig. 1b. We now have an updated query Q ={i, c, a, d, f }. The updated Jaccard similarity scores are shown in the last row of Table 3. The rankings of the objects in R ordered by similarity scores change slightly. For example, the rank of T 4 changes from 1 to 2 (the lower the rank, the higher the similarity). Consequently, T 1 and T 4 are the top-2 similar objects with respect to query Q. Table 4 lists the frequently used symbols in the paper.

8 X. Xu et al. 4 A pruning-based method Computing the exact similarity score for every object against the updated query at every time point is costly. A brute-force method takes O( Q t + T i ) time to compute the similarity between each object T i R with the current query Q t at every time instant t, and in total O( R ( Q t +max 1 i R { T i })) time to update the top-k results at each time instant. In this section, we derive an upper bound of Jaccard similarity scores and then present a pruning-based method that utilizes the upper bound. 4.1 A progressive upper bound We define the length of a multiset as the number of elements, including duplicates, in the multiset. To compute the upper bound of the Jaccard similarity between two multisets X and Y after u updates on Y, we can use the following property. Theorem 1 (A progressive upper bound) Let X and Y be two multisets, and Y be the multiset with u updates on Y. Then, sim jac (X, Y ) sim jac (X, Y ) +u. where sim jac (X, Y ) +u = sim jac(x,y ) α+β α β and α = X + Y, β = (1 + sim jac (X, Y )) u. Proof By definition, we have and sim jac (X, Y ) = i=1 min{x i, y i } i=1 max{x i, y i } (2) max{x i, y i }= X + Y min{x i, y i }. (3) i=1 Using Eqs. 2 and 3, wehave i=1 min{x i, y i }=sim jac (X, Y ) max{x i, y i } i=1 i=1 = sim jac (X, Y ) ( X + Y min{x i, y i }) Apparently, i=1 min{x i, y i } can be rewritten using X, Y and sim jac (X, Y ) as follows. i=1 i=1 min{x i, y i }= ( X + Y ) sim jac(x, Y ) sim jac (X, Y ) + 1 (4) After u updates, the maximum increase in the intersection size (the numerator of Eq. 2)and the maximum decrease in the union size (the denominator of Eq. 2) are both u. Combining Eqs. 3 and 4, wehave

9 Continuous similarity search for evolving queries sim jac (X, Y ) u + i=1 min{x i, y i } u + i=1 max{x i, y i } u + i=1 = min{x i, y i } u + X + Y i=1 min{x i, y i } = ( X + Y +u) sim jac(x, Y ) + u X + Y u sim jac (X, Y ) u Let α = X + Y and β = (1 + sim jac (X, Y )) u,wehave sim jac (X, Y ) sim jac(x, Y ) α + β = sim jac (X, Y ) +u α β The upper bound can be computed based on X, Y,andsim jac (X, Y ).Astobeshown soon in the algorithm, sim jac (X, Y ) is already known when computing the upper bound. Thus, it only takes constant time. Example 3 (Progressive upper bound) Consider the same multisets X and Y in Example 1. We have X =5, Y =6, and sim jac (X, Y ) = Suppose u = 1, the upper bound of the Jaccard similarity after an update is sim jac (X, Y ) +1 = Obviously, the progressive upper bound has a monotonicity with respect to the number of updates u. Thus, if the upper bound sim jac (X, Y ) +u is not large enough, so is any sim jac (X, Y ) +u (1 u < u). Our strategy for deriving the upper bound after u steps can be further extended to other multiset-based similarity measures, such as the weighted cosine similarity, the weighted dice similarity, and the weighted overlap similarity. Appendix gives the details about the extensions to other similarity measures. 4.2 A general pruning-based algorithm A brute-force method is to compute the similarity score between the updated query and each object at each time instant, and then obtain the top-k list. This approach is costly and involves much unnecessary computation, since when the query window slides only a small number of positions, such as 1 or 2, the changes of the similarity scores are limited. In this subsection, we present a heuristic algorithm GP that finds the exact top-k objects continuously. For the sake of simplicity, let us assume that new elements come in a manner synchronized with time, that is, a new element arrives at the stream when there is an update in time. The main idea is based on the observation that the query at time t + 1sharesalarge portion of elements, namely at least n 1 n, with the query at time t. Thus, the change in the top-k list is limited. In other words, the objects with small similarity scores at time t have a very limited chance to enter the top-k list at time t + 1. To implement the above idea, we divide the objects into two categories according to their current similarity scores. The first category C 1 contains the objects whose similarity scores are small enough so that those objects will not enter the top-k list in the next several updates. The objects in this category do not need to be checked using the exact similarity scores at the next several time instants. Instead, we only need to maintain their upper bounds of the similarity scores.

10 X. Xu et al. The second category C 2 contains the other objects not in the first category. The objects in this category need to be checked by computing the exact similarity scores. Suppose that at time instant t, we obtain the top-k list of objects topk t. Denote by score(topk t [k]) the similarity score of the kth object, which is least similar to the current query Q t. Apparently, score(topk t [k]) can be obtained without any extra cost. For each object T i not in the list topk t, we compute min_step i, the smallest number of updates needed for object T i to have its progressive upper bound exceed score(topk t [k]). According to Property 1, min_step i is the smallest integer satisfying score(topk t [k]) < sim(t i, Q t ) +min_step i. min_step i can be easily computed with respect to different similarity functions according to the respective specific forms of the upper bounds. For weighted Jaccard similarity, we have score(topk t [k]) i=1 min_step i = max{x i, y i } i=1 min{x i, y i } 1 + score(topk t (5) [k]) (score(topk t [k]) sim(t i, Q t )) ( T i + Q t ) = (1 + score(topk t [k])) (1 + sim(t i, Q t (6) )) Similarly, given Eq. 6, we only need constant time to compute min_step i. For the other three similarities, it is simply score(topk t [k]) sim(t i, Q t ) min_step i = sim(t i, Q t ) +1 sim(t i, Q t ) Example 4 (Computing min_step i ) Again, consider the collection of sets R shownintable2 and a data stream S shown in Fig. 1a. As the size of the sliding window is 5, the list of the top-2 most similar objects in R at the current time t is {T 1 : 0.67, T 4 : 0.8}, where the numbers after colons are the similarity scores. The similarity score of object T 5 at the current time instant t is T 5 is not one of the top- 2 results. However, for T 5, i=1 min{x i, y i }=3and i=1 max{x i, y i }=8with x i T 5 and y i Q t.usingeq.5we can compute that, after min_sup 5 = = 1.43 = 2 updates, the progressive upper bound of T 5, sim(t 5, Q t ) +2 = = 0.83, is larger than the score of the current kth most similar object score(topk t [k]) = sim(t 1, Q t ) = At time instant t, for each object T i,wemaintainmin_step i and the corresponding progressive upper bound sim(t i, Q t ) +min_step i. Please note that the min_step i values and the sim(t i, Q t ) +min_step i values may be computed at time instant t or before. After processing the objects at time instant t 1, at time instant t, with respect to query Q t,weprocessthe objects as follows. 1. For all objects T i such that min_step i = 0, including those already in topk t 1,we compute the exact similarity between T i and Q t and obtain a top-k list topk t among those objects. Those objects are assigned to the second category C 2. We set the similarity threshold σ = score(topk t [k]). 2. For all objects T i such that min_step i > 0butsim(T i, Q t 1 ) +min_step i >σ, since their similarity to query Q t may be greater than score(topk t [k]), we have to compute their exact similarity to Q t, too. We update the top-k list topk t and also the similarity threshold σ = score(topk t [k]). Obviously, in this step, σ = score(topk t [k]) is monotonically increasing, and never decreases. At the end of this step, topk t is finalized. Those objects whose exact similarity scores are larger than score(topk t [k]), that is, in the list topk t, are also added to category C 2.

11 Continuous similarity search for evolving queries Algorithm 1: A general pruning algorithm (GP) Input: Top-k results topk t 1 at previous time t 1, Query Q t at current time t Output: Top-k results topk t at current time t 1 for 0 u queue 1 do 2 foreach T i queue[u] do 3 if u = 0 or sim(t i, Q t ) +u > score(topk t 1 [k]) then 4 delete T i from bin queue[u] 5 compute sim(t i, Q t ) and update topk t 6 min_step i null 7 remove bin queue[0] 8 foreach T i with min_step i = null do 9 (min_step i, sim(t i, Q t ) +min_step i ) est_bound ( sim(t i, Q t ), score(topk t [k]) ) 10 insert T i to bin queue[min_step i ] Algorithm 2: Update min_step i and similarity bounds Function: est_bound(sim(t i, Q t ), score(topk t [k])) Input: T i s similarity score sim(t i, Q t ) at current time t,thekth largest similarity score of top-k list score(topk t [k]) at current time t Output: The pair of (min_step i, sim(t i, Q t ) +min_step i ) 1 if sim(t i, Q t )<score(topk t [k]) then 2 compute min_step i ; 3 sim(t i, Q t ) +min_step i upper bound after min_step i updates; 4 else 5 min_step i 0; 6 sim(t i, Q t ) +min_step i sim(t i, Q t ); 7 return (min_step i, sim(t i, Q t ) +min_step i ); 3. We update min_step i and the corresponding progressive upper bound sim(t i, Q t 1 ) +min_step i for objects T i C For the other objects not in C 2, they belong to the first category C 1 and have no hope to be more similar to Q t than any of the object in the list topk t and thus do not need to be checked in this round. We only need to decrease their min_step i values by 1. The pseudocode of the algorithm is given in Algorithms 1 and 2. In implementation, we organize objects in bins according to their min_step i values and use a heap to maintain the top-k list. All the bins are maintained in a queue structure (called queue in the pseudocode). 5 A MinHash-based method In this section, we use the technique MinHash [7], a locality-sensitive hashing (LSH) scheme, to approximate Jaccard similarity. We design indices for efficient updating estimated similarity scores. Since the MinHash technique is designed for estimating sets rather than multisets, we limit the definition of the current query to the set of distinct elements. 5.1 Approximating Jaccard similarity over static query We first discuss how to estimate Jaccard similarity using MinHash for static objects and queries.

12 X. Xu et al. Definition 1 (Image under permutation [2]) Denote by {[n]} the set {0,...,n 1}. A permutation π on {[n]} is a bijective function (one-to-one correspondence) from {[n]} to itself. If x {[n]}, thenπ(x), thevalueofπ when applied to x is the image of x under π. The image of a subset X [n] under π is π[x] ={y y = π(x) for x X}. Definition 2 (Min-wise independent permutations [6]) Let S n be the set of all permutations of {[n]}. A family of permutations F S n is min-wise independent if for any set X {[n]} and any element x X, wehave Pr(min{π[X]} = π(x)) = 1 (7) X where the permutation π is chosen randomly in F. The following property enables efficient estimation of Jaccard similarity between two sets. Theorem 2 (Jaccard similarity estimation [20]) Consider a query Q and an object T i where the elements are drawn from an alphabet. Let h be a hash function that maps elements in to distinct integers in range [0, 1] and is randomly picked from a family of min-wise independent permutations. Let the MinHash value, min(h(t i )), be the element in T i with the smallest hash value. Then, Pr[min(h(T i )) = min(h(q))] = T i Q T i Q. The MinHash values for all objects and that of the query can be stored in a MinHash signature matrix, M, where each entry M(i, j) is the MinHash value of the jth itemset under hash function h i. Based on Property 2, the Jaccard similarity between two sets can be estimated by the ratio of the number of rows containing the same MinHash values to the number of all the rows in the signature matrix. We deliberate the computation in Example 5. Example 5 (Jaccard similarity estimation) Consider the matrix representation of an example with four objects and a static query Q shownintable5a. Each column in the matrix represents a set and an element is in the set if the corresponding entry is 1. We apply five random permutations defined in Table 5b as the min-wise independent hash functions on objects and the query. The signature matrix is shown in Table 5c. For example, since object T 1 contains elements a, b,andc, the corresponding hash values according to h 4 are1,3,and4, respectively. The MinHash value of T 1 according to h 4 is a since a is the element in T 1 with the smallest hash value. We can estimate the Jaccard similarity score between two sets by computing the ratio of the number of rows containing the same MinHash values to the number of random permutations. For example, the MinHash values of T 1 and Q are the same according to random permutations h 2 and h 3. Thus, the estimated Jaccard similarity between T 1 and Q is 2 5.Wecomparethe estimated Jaccard similarity with the exact Jaccard similarity for each object in Table 5d. When objects are ordered in descending order of estimated similarity score regarding to the query, we have T 3 > T 1 = T 4 > T 2, which is identical to the order using exact similarity. Therefore, in this example, we can get the top-k result accurately using estimated scores. 5.2 Approximating Jaccard similarity over evolving query To answer evolving queries, given a fixed set of hash functions, the MinHash values for all the objects are fixed and only the MinHash values for the query are subject to modification. That is, we only need to update the MinHash values for the query. Therefore, we have a fast solution shown in Algorithm 3.

13 Continuous similarity search for evolving queries Table 5 An example of Jaccard similarity estimation Element T 1 T 2 T 3 T 4 Q (a) Matrix representation of sets a b c d e Element h 1 h 2 h 3 h 4 h 5 (b) Random permutations a b c d e T 1 T 2 T 3 T 4 Q (c) Signature matrix h 1 a a e e e h 2 c a c e c h 3 b d b d b h 4 a d e d d h 5 a a e a d T 1 T 2 T 3 T 4 (d) Exact and approximated Jaccard similarity sim jac (T i, Q) sim jac (T i, Q) Algorithm 3 first computes the MinHash signature for an updated query. To determine whether we need to update the estimated Jaccard similarity of object T j with respect to the updated query Q t according to a hash function h i, we consider the following 4 cases based on whether the MinHash values of the object and the queries are equal or not. Case 1: If min(h i (T j )) = min(h i (Q t 1 )) and min(h i (T j )) = min(h i (Q t )), the estimated Jaccard similarity between T j and the updated query should be increased by H 1, where H is the number of hash functions applied. Case 2: If min(h i (T j )) = min(h i (Q t 1 )) and min(h i (T j )) = min(h i (Q t )), the estimated Jaccard similarity between T j and the updated query should be decreased by H 1. Case 3: If min(h i (T j )) = min(h i (Q t 1 )) and min(h i (T j )) = min(h i (Q t )), the estimated Jaccard similarity between T j and the updated query remains. Case 4: If min(h i (T j )) = min(h i (Q t 1 )) and min(h i (T j )) = min(h i (Q t )), the estimated Jaccard similarity between T j and the updated query remains, too. We can further reduce the computational cost using inverted indices. The MinHash signature matrix that stores the MinHash values of objects can be transformed into H inverted

14 X. Xu et al. Algorithm 3: A MinHash-based algorithm (MHB) Input: Current query Q t, H hash functions, a set of previous MinHash values {min(h i (Q t 1 ))}, MinHash table for R objects. Output: Top-k results topk t at current time t 1 compute {min(h i (Q t ))}; 2 foreach j {1,..., R } do 3 foreach i {1,..., H } do 4 if min(h i (T j )) = min(h i (Q t 1 )) and min(h i (T j )) = min(h i (Q t )) then 5 sim (T j, Q t ) sim(t j, Q t ) + 1 H ; 6 else if min(h i (T j )) = min(h i (Q t 1 )) and min(h i (T j )) = min(h i (Q t )) then 7 sim (T j, Q t ) sim(t j, Q t ) 1 H ; 8 if T j / topk t and sim (T j, Q t )>score(topk t [k]) then 9 update topk t ; Fig. 2 Inverted indices for MinHash values indices whose structure is shown in Fig. 2. For each hash function, its corresponding inverted index stores a mapping from each existing MinHash value to a list of object ids, that is, inverted list. Moreover, elements in each inverted list are sorted in ascending order of object id, which enables more efficient list merge and intersection operations. In this way, we can efficiently retrieve the objects with a specified MinHash value according to a certain hash function. The algorithm that uses the inverted indices is presented in Algorithm 4. The MinHash signature for the updated query is computed first. We need to search the inverted index of a hash function h i only when the MinHash values of Q t and Q t 1 regarding h i are different. Suppose the MinHash value of the updated query changes according to h i,to determine the objects whose similarity scores need to be updated, we recall Case 1 and Case 2 mentioned earlier in this section. Each case corresponds to a scan in an inverted list of h i. Specifically, in Case 1, we increase the estimated Jaccard similarity by H 1 for objects in the inverted list of hash value min(h i (Q t )). That is, instead of checking for each object as in Algorithm 3, we search for the objects that satisfy Case 1 using the index. Similarly, in Case 2, we decrease the estimated Jaccard similarity by H 1 for objects in the inverted list of hash value min(h i (Q t 1 )).

15 Continuous similarity search for evolving queries Algorithm 4: A MinHash-based algorithm using inverted indices (MHI) Input: Current query Q t, H hash functions, a set of MinHash values {min(h i (Q t 1 ))}, Inverted indices. Output: Top-k results topk t at current time t 1 compute {min(h i (Q t ))}; 2 foreach i {1,..., H } do 3 if min(h i (Q t 1 )) = min(h i (Q t )) then 4 for T j index i [min(h i (Q t ))] do 5 sim (T j, Q t ) sim (T j, Q t ) + 1 H ; 6 for T j index i [min(h i (Q t 1 ))] do 7 sim (T j, Q t ) sim (T j, Q t ) 1 H ; 8 foreach j {1,..., R } do 9 if T j / topk t and sim (T j, Q t )>score(topk t [k]) then 10 update topk t ; The time complexity of this algorithm is O( H R ),where R is the number of objects and H is the number of hash functions. It is achieved when the MinHash values of consecutive queries differ under all the hash functions. However, this case rarely occurs because the Jaccard similarity between consecutive queries is in fact very high. 6 Experimental results In this section, we discuss our experimental results on a series of synthetic data sets and two real data sets. All experiments were conducted on a Mac Pro (Late 2013) server with Intel Xeon 3.70GHz CPU, 64GB memory, and 256GB SSD hard drive installed. All the algorithms were implemented in Python 3 using a much faster just-in-time compiler PyPy Results on synthetic data sets To evaluate the effectiveness and efficiency of our two methods, the general pruning-based method ( GP for short) and the MinHash-based method ( MHI for short), in this section, we first report the experimental results on synthetic data sets. We used the IBM Quest data generator 1 to produce synthetic data sets. We conducted experiments to test the efficiency and accuracy of our methods with respect to the following parameters. k: top-k; Q t : the parameter controlling both the number of items per query and average number of items per object. : alphabet size; H : the number of different hash functions. R : number of objects; To generate synthetic data sets using the IBM Quest data generator, we set parameters Q t,, and R. The synthetic query stream was generated by concatenating random objects whose lengths are between 0.8 Q t and 1.2 Q t,where Q t is the average object length of thedataset. 1

16 X. Xu et al. Table 6 Values of controlled variables for tests on Jaccard similarity Data set Varying k Q t H R Figures Synthetic k Multiple 10 10,000,000 3a, 4a, 6a Synthetic k Multiple 10 20,000 3b, 4b, 6b Synthetic Q t 10 Multiple 10,000,000 3d, 4d, 6c Synthetic Multiple,000 3c, 4c, 6c Synthetic H ,000 Multiple,000 3e, 6e Synthetic R ,000 Multiple 3f, 4e, 5a, 6f Market basket k Multiple 10 16,470 88,162 7a c Market basket H ,470 Multiple 88,162 7g, h Market basket R ,470 Multiple 7d f, 5b Click stream k Multiple ,790 8a c Click stream H Multiple 31,790 8g, h Click stream R Multiple 5c, 8d f The way we produce synthetic query streams mimics some real application scenarios. Take the online RPG gaming scenario as an example. Suppose one is exploring the game map and walks from one game scene to another. While each scene has its combination of enemies (as objects here), when moving from one scene to another, the most suitable query is the combination of enemies from both scenes (concatenation of the two objects within sliding window). In each experiment, we varied one factor and fixed the others. Table 6 shows the values of controlled variables along with the corresponding figures for each test. We compare the performance of GP and MHI on the average querying time with two baseline methods, namely BFM and MHB. BFM is a brute-force method that computes the exact similarity score for every object with respect to a query. MHB is a method based on MinHash technique but does not use any indexing structures. To better illustrate how our upper bounds derived in Sect. 4 can be used to prune unpromising objects, we define the pruning effectiveness with respect to an update as follows. Definition 3 (Pruning effectiveness of GP) Suppose we have R objects in total and we compute the exact similarity scores for R objects during an update. The pruning effectiveness with respect to this update is 1 R R. In this section, we report the average pruning effectiveness of 0 updates. Moreover, we provide peak memory usage for all the methods when testing the scalability. The accuracy in our context is defined as follows. Definition 4 (Accuracy) For a method A that returns a top-k list topk_a t with respect to query Q t, the accuracy of A is the proportion of objects in topk_a t whose exact similarity scores with respect to Q t is no smaller than the similarity between topk t [k] and Q t,where topk t is the ground truth top-k list. That is, acc_a = {T T topk_at sim(t, Q t ) sim(topk t [k], Q t )}. k

17 Continuous similarity search for evolving queries The idea here is to check the percentage of objects reported in the answer indeed have a similarity score passing the minimum similarity threshold set by the kth most similar object in the ground truth. The higher this percentage, the more accurate the answer. The pruning algorithm GP always returns the exact query results and thus has % accuracy. Since the MinHash-based methods compute similarity scores approximately, MHB and MHI give estimated answers to top-k queries. We report the average accuracy of MinHash methods in Fig Efficiency The average querying time of the four methods when k varies is shown in Fig. 3a, b on two synthetic data sets with a large alphabet (10 4 ) and a small one (20), respectively. In both cases, our baseline methods, BFM and MHB, almost have no change in average processing time. MHI that uses hash functions outperforms the other four methods greatly. The pruning effectiveness of GP is shown in Fig. 4a, b when the alphabet size is set to 10 4 and 20, respectively. The pruning effectiveness of GP drops gradually because the similarity score of the kth best object in the top-k result tends to decrease when k increases. Given other parameters fixed, the similarity score of the kth best object in the result with respect to queries becomes smaller. Thus, the pruning effectiveness generally is weaker in the case of larger alphabet size. Figure 3c provides a better illustration of the trends of the pruning effectiveness of GP when the alphabet size changes. When the alphabet size increases dramatically, the pruning effectiveness of GP drops from 69.2 to 56.3 % gradually. Thus, the average processing time of GP slightly increases. The average running time of BFM remains stable with respect to alphabet size. The average running time of MHI first decreases and then increases. When the alphabet size is small, the length of each inverted list is relatively large. Thus, a change in MinHash value of the query may result in many updates in the estimated similarity scores. When alphabet size becomes very large, though the size of each inverted list is small, the number of unmatched MinHash values between the MinHash lists of the previous query and the new query increases, which results in more searches on the inverted indices. We also examined how the average query answering time changes with respect to average object length. The results on efficiency and pruning effectiveness are shown in Figs. 3d and 4d, respectively. The average processing time of BFM increases linearly while the performance of MHB does not increase greatly. It is because we transformed the objects of varied length into MinHash signatures of the same length. MHI still performs much better than the others and increases linearly. It is because when the average object length becomes larger, the size of each inverted list becomes larger, too, which results in longer processing time. When the average object length increases, the pruning effectiveness of GP increases dramatically, from 61.1 to %. However, its average running time still increases slightly, since there is an increase in the cost of computing exact similarity scores for larger sets in the verification phase. The performance with respect to the number of hash functions is shown in Fig. 3e. The average processing time of both MHB and MHI increases. MHI is roughly linear. The scalability of our methods is shown in Fig. 3f. The runtime of all the methods increases almost linearly as the number of object increases. The pruning effectiveness of GP increases from 48.3 to 64.2 % when the number of objects increases from 10 4 to 10 6,whichisshownin Fig. 4e. The peak memory usage of our methods is shown in Fig. 5a. BFM and GP follow the similar growth trend and GP requires only a little more memory than BFM. It is because we

18 X. Xu et al. Avg. Processing Time (ms) Varying k k (a) BFM MHB GP MHI Avg. P rocessing Time (ms) Varying k k (b) Avg. Processing Time (ms) Varying Lexicon Size k 10k k 1m Lexicon Size (c) Avg. P rocessing Time (ms) Varying Avg. Transaction Length Avg. Transaction Length (d) Avg. P rocessing Time (ms) Varying Number of Hash Functions Number of Hash Functions (e) Avg. Processing Time (ms) 0 10 Varying Number of Transactions 1 10k 50k k 500k 1m Number of Transactions Fig. 3 Average processing time on synthetic data set. a Vary k, large, b vary k, small, c vary, d vary Q t, e vary H, f vary R (f) use extra data structures like queue and max-heap to store the upper bounds of similarity scores. Compared to BFM and GP, MinHash-based methods need more space since we need to store a MinHash signature for each object. MHB costs less space than MHI because we store MinHash signatures differently for these two methods. For MHB, we simply store each MinHash signature as an array, while in MHI, we construct an inverted index for each hash function to enable more efficient update, which requires more space.

19 Continuous similarity search for evolving queries Avg. Pruning Effectiveness (%) Avg. Pruning Effectiveness (%) Avg. Pruning Effectiveness (%) Varying k GP k k 10k k 1m (a) Varying Lexicon Size Lexicon Size (c) Varying Number of Transactions 0 10k 50k k 500k 1m Number of Transactions (e) Avg. Pruning Effectiveness (%) Avg. Pruning Effectiveness (%) Varying k k (b) Varying Avg. Transaction Length Avg. Transaction Length (d) Fig. 4 Pruning effectiveness of GP on synthetic data set. a Vary k, large, b vary k, small, c vary, d vary H, e vary R Accuracy We also tested the accuracy of the MinHash-based algorithms. By our definition, the accuracy of the two MinHash-based algorithms are the same according to the same set of hash functions. We denote the MinHash methods by MH in the figures that report accuracy. The accuracy can be affected by two factors, the number of hash functions used and the intrinsic characteristics of our data set, such as the distribution of similarity scores among objects.

20 X. Xu et al. Peak Memory Usage (MB) Peak Memory Usage (MB) 10k 1k Varying Number of Transactions 10 10k 50k k 500k 1m k 10k 1k Number of Transactions (a) Varying Scale Factor Scale Factor (c) BFM MHB GP MHI Peak Memory Usage (MB) k 10k 1k Varying Scale Factor Scale Factor (b) Fig. 5 Data set peak memory usage on different data sets. a Synthetic data set, b market basket data set, c click stream The results show that the MinHash-based methods achieve an accuracy ranging from 68 to 98 % on the synthetic data sets when hash functions are used. Figure 6a, b shows the change in accuracy when k increases on data sets of large alphabet and small alphabet, respectively. On the data set of alphabet size 10 4, the accuracy decreases gradually, from 98 to 68.7 %. When alphabet size is 20, the average accuracy does not vary much but still in a slightly descending trend. The average accuracy increases quickly and then decreases when the alphabet size increases, as shown in Fig. 6c. The highest accuracy is achieved when the alphabet size is The reason is as follows. With the fixed number of objects, when alphabet size is small, there are many similar objects in the data set, which lowers the accuracy. In contrast, when the alphabet size is too large, objects in the data set would be less similar and lead to slightly lower accuracy. In Fig. 6c, when the average object length increases from 10 to 10 3, the accuracy drops drastically from 86.7 to 10 %. The reason is apparent. In MinHash, the Jaccard similarity is the probability of two objects having the same minimal hash values with respect to all the hash functions. With a fixed number of hash functions, the larger the window size (average object length here), the less the portion of the elements in the window are hashed as the common MinHash value, which leads to a less accurate estimation. Thus, more hash functions are needed to achieve the same level of accuracy when the object length becomes larger.

21 Continuous similarity search for evolving queries Varying k MH Varying k Avg. Accuracy (%) Avg. Accuracy (%) k (a) k (b) Varying Lexicon Size Varying Avg. Transaction Length Avg. Accuracy (%) Avg. Accuracy (%) k 10k k 1m Lexicon Size (c) Avg. Transaction Length (d) Varying Number of Hash Functions Varying Number of Transactions Avg. Accuracy (%) Avg. Accuracy (%) Number of Hash Functions (e) 50 10k 50k k 500k 1m Number of Transactions (f) Fig. 6 Average accuracy of hashing-based method on synthetic data set. a Vary k, large, b vary k, small, c vary, d vary Q t, e vary H, f vary R We also tested the trend of accuracy when the number of hash functions and the number of objects increase. The results are shown in Fig. 6e and f, respectively. When more hash functions are used, the estimated Jaccard similarity is closer to the exact score and thus leads to a higher accuracy. As shown in Fig. 6e, the result follows this trend. Our methods achieve average accuracy rate ranging from 86.7 to 95 %. Figure 6f suggests that the average accuracy increases gradually when the number of objects increases.

22 X. Xu et al. Table 7 Statistics of the real data sets Data set R Avg. T i Market basket 88, , 470 Click stream 31, Results on real data sets We conducted experimental studies on two real data sets: a Market Basket data set 2 from an anonymous Belgian retail store, and a Click Stream data set 3 where the predefined category, such as news and tech, replaces each URL on the MSNBC website. The data processing was conducted by the provider. Moreover, the provider of the Click Stream data set also removed those short sequences of lengths no larger than 8. We then transformed each sequence into a set. For Click Stream data set, each static object is a set of categories for URLs visited by each user, while for the market basket data set, each static object is a set of items purchased by a customer. There is a significant difference in alphabet size of the two real data sets. The market basket data set has 16,470 distinct items, while the Click Stream data set contains only 17 distinct items. Table 7 shows the detailed statistics of the data sets. To generate the querying stream, we used the same method when generating querying streams for synthetic data sets. That is, we concatenated objects randomly selected in the data set whose size is between 0.8 Q t and 1.2 Q t,where Q t is the average object length of the data set. For each data set, we compared the average processing time and the average accuracy when k or H varies, respectively. We also include scalability test on both time and peak memory usage. The results are shown in Figs. 5, 7, and8. The trends of the curves for the market basket data set are consistent with the results on the synthetic data sets. To compare the average processing time of different methods when k varies, we set the default number of hash functions to. The processing time of BFM and MHB is stable. MHI always has the best performance, and the processing time increases gradually. The pruning effectiveness of GP decreases from 64.9 to 38.3 % gradually when k increases. Therefore, the processing time of GP increases when k increases. The MinHashbased algorithms can always achieve accuracy over 70 % when k varies from 1 to 0, which is shown in Fig. 7b. Figure 7g, h shows the trends of processing time and accuracy of the MinHash-based methods, respectively, when different numbers of hash functions are used. The average processing time of both MHB and MHI increases almost linearly with the number of hash functions. The performance of MHI is between 31 and 96 times faster than MHB. The average accuracy increases steadily from 73.1 to 93.4 % when the number of hash functions increases from to 500. We also tested the scalability on this data set by duplicating the data set 5, 10, 50, and times. We report the average processing time, average accuracy, average pruning effectiveness, and peak memory usage in Figs. 7d f and 5b, respectively. Figure 7d shows that the processing time of all the methods increases linearly as the data set size increases. The accuracy first increases drastically and then reaches the level of over 95 % and remains steady. The pruning effectiveness does not change much and is around %

23 Continuous similarity search for evolving queries Avg. Processing Time (ms) Avg. Pruning Effectiveness (%) Avg. Accuracy (%) Avg. Processing Time (ms) 0 10 Varying k BFM MHB GP MHI k (a) k Varying k (c) Varying Scale Factor Scale Factor (e) Varying Number of Hash Functions Number of Hash Functions (g) Avg. Accuracy (%) Avg. Processing Time (ms) Avg. Pruning Effectiveness (%) Avg. Accuracy (%) Varying k k (b) Varying Scale Factor Scale Factor Varying Scale Factor (d) Scale Factor (f) Varying Number of Hash Functions Number of Hash Functions Fig. 7 Results on market basket data set. a Vary k (time), b vary k (accuracy), c vary k (pruning effectiveness), d vary R (time), e vary R (accuracy), f vary R (pruning effectiveness), g vary H (time), h vary H (accuracy) (h)

24 X. Xu et al. Avg. Processing Time (ms) Avg. Pruning Effectiveness (%) Avg. Accuracy (%) Avg. Processing Time (ms) 10 1 Varying k k Varying k (a) k (c) Varying Scale Factor Scale Factor (e) Varying Number of Hash Functions Number of Hash Functions (g) BFM MHB GP MHI Avg. Processing Time (ms) Avg. Accuracy (%) Avg. Pruning Effectiveness (%) Avg. Accuracy (%) Varying k k Varying Scale Factor Scale Factor Varying Scale Factor (b) (d) Scale Factor (f) Varying Number of Hash Functions Number of Hash Functions Fig. 8 Results on click stream data set. a Vary k (time), b vary k (accuracy), c vary k (pruning effectiveness), d vary R (time), e vary R (accuracy), f vary R (pruning effectiveness), g vary H (time), h vary H (accuracy) (h)

25 Continuous similarity search for evolving queries The peak memory usage is shown in Fig. 5b. The peak memory size used of all the methods increases almost linearly when the data set increases. Similar to the results in the scalability test on the synthetic data sets, GP uses only a little more memory than BFM. Their trends are similar. The MinHash-based methods consume more space than BFM and GP. The reason is the same as explained on the synthetic data sets in Sect The results on the click stream data set are shown in Figs. 8 and 5c. Since the average length of the objects in this data set is only 5.33, the MinHash-based algorithms do not need many hash functions in order to achieve high accuracy. As shown in Fig. 8h, the average accuracy is higher than 90 % when 40 or more hash functions are used. To test the performance when k varies, we used 50 hash functions as the default. When k varies, MHI is still the best method with respect to average query answering time. The average accuracy drops gradually from 90 to 67 %. As shown in Fig. 8c, the pruning effectiveness of GP decreases from 75.8 to 54.2 %, which is generally better than the results on the market basket data set. We tested the scalability on the click stream data set by duplicating the data set 10,, and 0 times. Figures 8d f and 5c show how the average processing time, average accuracy, average pruning effectiveness, and peak memory usage change with respect to different scale of data set, respectively. The average processing time of all the methods increases almost linearly when the data set size increases. MHI always achieves the lowest processing time and outperforms the baselines by more than an order of magnitude. Moreover, it reports with reasonably high accuracy in the range of 81.6 and 90 % as shown in Fig. 8e. When the data set size increases, the pruning effectiveness of GP increases steadily and is generally higher than the results on the market basket data set. The memory usage of BFM, GP, and MHB is very close and the corresponding curves in Fig. 5c overlap with each other. Comparing to the other data sets, the memory usage of MHB is much closer to BFM and GP because we use less number of hash functions. MHI still consumes more memory due to the inverted indices data structure. However, its memory usage is much closer to the other three methods comparing to the results on the other data sets. Recall that the size of our indexing structure depends on the number of hash functions used and the alphabet size. 6.3 Summary We conclude the following from our experiments. First, the pruning-based method is an exact method and the hashing-based method can achieve very good accuracy when a few hundreds of hash functions are used. The hashing-based method can lead to good accuracy and can achieve faster average querying time than the pruning-based method in most cases. However, when k increases, the accuracy drops and the running time increases greatly. The pruningbased method runs slower but maintains relatively stable running time with increasing k. For the hashing-based method, usually larger number of hash functions H provides better accuracy but also takes more running time. However, when the number of hash functions exceeds 1,000, the increase in accuracy usually is very minor. As shown on synthetic data set, when the average object length Q t becomes larger, the accuracy of the hashing-based method drops greatly while the pruning effectiveness of the pruning-based method increases significantly, which implies that the pruning-based method is more feasible in this case. In general, both methods achieve better performance on data sets with larger lexicon size. The sparser the data, the more significant the difference on similarity, and usually the better runtime performance.

26 X. Xu et al. 7 Conclusions In this paper, we studied the problem of continuous similarity search for evolving queries. To the best of our knowledge, it is the first research endeavor on this problem. We devised two efficient methods with different framework. The pruning-based method uses pruning strategies to reduce the cost of computing the exact similarity scores. The MinHash-based method approximates the Jaccard similarity score based on MinHash technique and efficiently updates the estimated scores using indexing structures. The empirical results on synthetic and real data sets verify the effectiveness and efficiency of our methods. As for future work, we plan to consider the following interesting directions. Enhance the pruning effectiveness For example, we may incorporate the evolving feature into the pruning techniques used in static similarity join and search algorithms to further prune candidates. Improve the hashing-based method for other similarity measures We may extend our hashing-based framework to other similarity and distance measures that can be approximated by an LSH scheme used in MinHash. Similarity search over multiple evolving streams Given a static object and multiple data streams with a fixed sliding window size, we may want to continuously find the top-k data streams whose last n items are most similar to the static object. The techniques proposed in this paper can be further extended for the problem. Appendix: Other similarity measures and their upper bounds for pruning method In this section, we extend the upper bounds for pruning method to weighted overlap, weighted cosine, and weighted dice similarity. Similar to the case of weighted Jaccard similarity, they all have monotonicity with respect to number of updates u. Property 1 (A progressive upper bound for weighted overlap similarity). We first define the weighted overlap similarity as follows. sim over (X, Y ) = sim over ( X, Y ) = min(x i, y i ) (8) Let X, Y be two multisets and Y be the multiset with u updates on Y. Given X, Y, and the weighted overlap similarity score sim over (X, Y ) between X and Y, without the knowledge of the updated elements in Y, we have an upper bound for sim over (X, Y ). Proof By definition, i=1 sim over (X, Y ) sim over (X, Y ) + u (9) sim over (X, Y ) = min(x i, y i ) Obviously, the maximum possible increase after u updates is u. i=1 Property 2 (A progressive upper bound for weighted cosine similarity) We first define the weighted cosine similarity as follows.

27 Continuous similarity search for evolving queries sim cos (X, Y ) = sim cos ( X, Y ) = i=1 min(x i, y i ) X Y (10) Let X, Y be two multisets and Y be the multiset with u updates on Y. Given X, Y, and the weighted cosine similarity score sim cos (X, Y ) between X and Y, without the knowledge of the updated elements in Y, we have an upper bound for sim cos (X, Y ). sim cos (X, Y u ) sim cos (X, Y ) + (11) X Y Proof In our scenario, Y = Y. Thus, X Y = X Y. By Property 1,wehave sim cos (X, Y i=1 ) min(x i, y i ) + u u X Y = sim cos (X, Y ) + X Y Property 3 (A progressive upper bound for weighted dice similarity) We first define the weighted dice similarity as follows. sim dice (X, Y ) = sim dice ( X, Y ) = 2 i=1 min(x i, y i ) (12) X + Y Let X, Y be two multisets and Y be the multiset with u updates on Y. Given X, Y, and the weighted dice similarity score sim dice (X, Y ) between X and Y, without the knowledge of the updated elements in Y, we have an upper bound for sim dice (X, Y ). sim dice (X, Y 2u ) sim dice (X, Y ) + (13) X + Y Proof In our scenario, Y = Y. Thus, X + Y = X + Y. By Property 1,wehave sim dice (X, Y ) 2 i=1 min(x i, y i ) + 2u 2u X + Y = sim dice (X, Y ) + X + Y References 1. Andoni A, Indyk P (2006) Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In: Proceedings of the 47th annual IEEE symposium on foundations of computer science, FOCS 06, Washington, DC, USA. IEEE Computer Society, pp Artin E (2011) Geometric algebra. Wiley, Hoboken 3. Bayardo RJ, Ma Y, Srikant R (2007) Scaling up all pairs similarity search. In: Proceedings of the 16th international conference on World Wide Web, WWW 07, New York, NY, USA. ACM, pp Böhm C, Ooi BC, Plant C, Yan Y (2007) Efficiently processing continuous k-nn queries on data streams. In: Proceedings of the international conference on data engineering, ICDE 07, Washington, DC, USA. IEEE Computer Society, pp Broder A (1997) On the resemblance and containment of documents. In: Proceedings of the compression and complexity of sequences 1997, SEQUENCES 97, Washington, DC, USA. IEEE Computer Society, pp Broder AZ, Charikar M, Frieze AM, Mitzenmacher M (2000) Min-wise independent permutations. J Comput Syst Sci (3): Charikar MS (2002) Similarity estimation techniques from rounding algorithms. In: Proceedings of the thirty-fourth annual ACM symposium on theory of computing, STOC 02, New York, NY, USA. ACM, pp 3 388

28 X. Xu et al. 8. Chaudhuri S, Ganti V, Kaushik R (2006) A primitive operator for similarity joins in data cleaning. In: Proceedings of the 22nd international conference on data engineering, ICDE 06, Washington, DC, USA. IEEE Computer Society, pp Cohen WW (1998) Integration of heterogeneous databases without common domains using queries based on textual similarity. In: Proceedings of the 1998 ACM SIGMOD international conference on management of data, SIGMOD 98, New York, NY, USA. ACM, pp Cost S, Salzberg S (1993) A weighted nearest neighbor algorithm for learning with symbolic features. Mach Learn 10(1): Datar M, Muthukrishnan S (2002) Estimating rarity and similarity over data stream windows. In: Proceedings of the 10th annual European symposium on algorithms, ESA 02. Springer, London, pp Devroye L, Wagner TJ (1982) 8 nearest neighbor methods in discrimination. Handbook of statistics 2: Faloutsos C, Barber R, Flickner M, Hafner J, Niblack W, Petkovic D, Equitz W (1994) Efficient and effective querying by image content. J Intell Inf Syst 3(3 4): Faloutsos C, Oard DW (1995) A survey of information retrieval and filtering methods. University of Maryland at College Park, College Park, MD, USA. Univ. of Maryland Institute for Advanced Computer Studies Report 15. Flickner M, Sawhney H, Niblack W, Ashley J, Huang Q, Dom B, Gorkani M, Hafner J, Lee D, Petkovic D, Steele D, Yanker P (1995) Query by image and video content: the qbic system. Computer 28(9): Gersho A, Gray RM (1991) Vector quantization and signal compression. Kluwer Academic Publishers, Norwell 17. Gionis A, Indyk P, Motwani R (1999) Similarity search in high dimensions via hashing. In: Proceedings of the 25th international conference on very large data bases, VLDB 99, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc., pp Hastie T, Tibshirani R (1995) Discriminant adaptive nearest neighbor classification. In: Proceedings of the first international conference on knowledge discovery and data mining, KDD 95, Palo Alto, CA, USA. AAAI Press, pp Henzinger M (2006) Finding near-duplicate web pages: a large-scale evaluation of algorithms. In: Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR 06, New York, NY, USA. ACM, pp Indyk P (2001) A small approximately min-wise independent family of hash functions. J Algorithms 38(1): Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the thirtieth annual ACM symposium on theory of computing, STOC 98, New York, NY, USA. ACM, pp Koivune V, Kassam S (1995) Nearest neighbor filters for multivariate data. In: IEEE workshop on nonlinear signal and image processing, Washington, DC, USA. IEEE Computer Society, pp Kollios G, Tsotras VJ (2002) Hashing methods for temporal data. IEEE Trans Knowl Data Eng 14(4): Kontaki M, Papadopoulos AN (2004) Efficient similarity search in streaming time sequences. In: Proceedings of the 16th international conference on scientific and statistical database management, SSDBM 04, Washington, DC, USA. IEEE Computer Society, pp Koudas N, Ooi BC, Tan K-L, Zhang R (2004) Approximate nn queries on streams with guaranteed error/performance bounds. In: Proceedings of the thirtieth international conference on very large data bases, VLDB 04. VLDB Endowment, pp Lian X, Chen L, Wang B (2007) Approximate similarity search over multiple stream time series. In: Proceedings of the 12th international conference on database systems for advanced applications, DAS- FAA 07. Springer, Berlin, pp Mouratidis K, Bakiras S, Papadias D (2006) Continuous monitoring of top-k queries over sliding windows. In: Proceedings of the 2006 ACM SIGMOD international conference on management of data, SIGMOD 06, New York, NY, USA. ACM, pp Mouratidis K, Papadias D (2007) Continuous nearest neighbor queries over sliding windows. IEEE Trans Knowl Data Eng 19(6): Mouratidis K, Papadias D, Bakiras S, Tao Y (2005) A threshold-based algorithm for continuous monitoring of k nearest neighbors. IEEE Trans Knowl Data Eng 17(11): Pan S, Zhu X (2012) Continuous top-k query for graph streams. In Proceedings of the 21st ACM international conference on information and knowledge management. CIKM 12, New York, NY, USA. ACM, pp Pentland A, Picard RW, Sclaroff S (1994) Photobook: content-based manipulation of image databases. In: Storage and retrieval for image and video databases, Bellingham, WA, USA. SPIE, pp 34 47

Continuous similarity search for evolving queries 32. Rao W, Chen L, Chen S, Tarkoma S (2014) Evaluating continuous top-k queries over document streams. World Wide Web 17(1):59 83 33.

In: Proceedings of the 2004 ACM SIGMOD international conference on management of data, SIGMOD 04, New York, NY, USA. ACM, pp 743 754 35.

29 Continuous similarity search for evolving queries 32. Rao W, Chen L, Chen S, Tarkoma S (2014) Evaluating continuous top-k queries over document streams. World Wide Web 17(1): Salton G, McGill MJ (1986) Introduction to modern information retrieval. McGraw-Hill Inc., New York 34. Sarawagi S, Kirpal A (2004) Efficient set joins on similarity predicates. In: Proceedings of the 2004 ACM SIGMOD international conference on management of data, SIGMOD 04, New York, NY, USA. ACM, pp Smeulders A, Jain R (eds) (1998) Image databases and multimedia search. World Scientific Publishing Co., Inc., River Edge 36. Sun Y, Han J, Yan X, Yu PS, Wu T (2011) Pathsim: meta path-based top-k similarity search in heterogeneous information networks. Proc VLDB Endow 4(11): Winkler WE (1999) The state of record linkage and current research problems. Statistical Research Division, US Census Bureau, Suitland 38. Xiao C, Wang W, Lin X, Yu JX (2008) Efficient similarity joins for near duplicate detection. In: Proceedings of the 17th international conference on World Wide Web, WWW 08, New York, NY, USA. ACM, pp Xiaoning Xu received her Master s degree in computing science from Simon Fraser University, Canada in 2014, supervised by Prof. Jian Pei. Previously, she was in the SFU-ZJU computing science dual degree program. She received a B.Sc. degree from Simon Fraser University, Canada, and a B.Eng. degree from Zhejiang University, China, both in Her research interests focus on similarity search and data stream processing. She is currently a software engineer at Fortinet Inc., working on data analytics and threat detection in network log data. Chuancong Gao is a Ph.D. student in computing science in Simon Fraser University, Canada, since 2013, supervised by Prof. Jian Pei. Previously he received a master degree from Tsinghua University, China, in 2010 and a bachelor degree from Shandong University, China, in 2007, both in computer science and in engineering. His current research interests are data cleaning and data integration.

He received a Ph.D. degree at the same school in 2002 under Dr. Jiawei Han s supervision.

He has published prolifically and is one of the top cited authors in data mining. He received a series of prestigious awards.

30 X. Xu et al. Jian Pei is the Canada Research Chair in Big Data Science and a professor at the School of Computing Science and the Department of Statistics and Actuarial Science at Simon Fraser University, Canada. He received a Ph.D. degree at the same school in 2002 under Dr. Jiawei Han s supervision. His research interests are to develop effective and efficient data analysis techniques for novel data intensive applications. He has published prolifically and is one of the top cited authors in data mining. He received a series of prestigious awards. He is also active in providing consulting service to industry and transferring the research outcome in his group to industry and applications. He is an editor of several esteemed journals in his areas and a passionate organizer of the premier academic conferences and initiatives defining the frontiers of the areas. He is an IEEE fellow. Ke Wang received Ph.D. from Georgia Institute of Technology. He is currently a professor at the School of Computing Science, Simon Fraser University. His research interests include database technology, data mining, and knowledge discovery, with emphasis on massive data sets, graph and network data, and data privacy. Ke Wang has published more than research papers in database, information retrieval, and data mining conferences. He is currently an associate editor of the ACM TKDD journal. Abdullah Al-Barakati was the head of Information Systems Department at King Abdulaziz University (KAU), Jeddah, Saudi Arabia, since in the period He is also a faculty member and a lecturer in the Faculty of Computing and Information Technology since the year Dr. Al-Barakati s academic interests range from teaching and academic research to modern theories in applied informatics. His research interests revolve on the area of Big Data ranging from modern practices in Big Data to advanced concepts and special issues in Big Data mining (pattern extraction, visualization, and analysis). Language Processing (LP) and Machine Learning (ML) also fall within his research interests as he has developed a passion for text analysis and manipulation. Furthermore, web applications especially the dynamics of Social Media and web 2.0 are important elements of Dr. Al-Barakati s research interests as he believes that they form the cornerstones of many astonishing innovations now and in the future.

Continuous Similarity Search for Evolving Queries

Continuous Similarity Search for Evolving Queries by Xiaoning Xu B.Sc., Simon Fraser University, 2012 B.Eng., Zhejiang University, 2012 a Thesis submitted in partial fulfillment of the requirements for