Continuous similarity search for evolving queries

Size: px
Start display at page:

Download "Continuous similarity search for evolving queries"

Transcription

1 Knowl Inf Syst DOI 10.7/s x REGULAR PAPER Continuous similarity search for evolving queries Xiaoning Xu 1 Chuancong Gao 2 Jian Pei 2,3 Ke Wang 2 Abdullah Al-Barakati 3 Received: 14 May 2015/Revised: 3 August 2015/Accepted: 5 October 2015 Springer-Verlag London 2015 Abstract In this paper, we study a novel problem of continuous similarity search for evolving queries. Given a set of objects, each being a set or multiset of items, and a data stream, we want to continuously maintain the top-k most similar objects using the last n items in the stream as an evolving query. We show that the problem has several important applications. At the same time, the problem is challenging. We develop a filtering-based method and a hashing-based method. Our experimental results on both real data sets and synthetic data sets show that our methods are effective and efficient. Keywords Similarity search Data stream Evolving query 1 Introduction Let us consider a recommendation problem at a Q&A website. Suppose you want to run 3 ads on the Q&A website, for a user who does not have a profile yet, what ads should you display? In addition to many factors, such as click-through rates of ads and bidding price information, a natural and important idea is to consider the ads that are most related to the questions being recently asked at the website by all users. For example, technically, you may model ads and questions as sets of keywords, and may want to retrieve the top-k ads that are most similar to This work is partly supported by an NSERC Discovery grant, the Canada Research Chair program, and a Yahoo! Faculty Research and Engagement Program (FREP) award. All opinions, findings, conclusions and recommendations in this paper are those of the authors and do not necessarily reflect the views of the funding agencies. B Jian Pei jpei@cs.sfu.ca 1 Fortinet Inc., Burnaby, BC, Canada 2 Simon Fraser University, Burnaby, BC, Canada 3 King Abdulaziz University, Jeddah, Saudi Arabia

2 X. Xu et al. the last n questions asked, where the similarity measure captures the relevance between ads and questions. Such ads can be used as the candidates for further selections. The above scenario is just one of the many applications that motivate the problem to be studied in this paper. Given a set of static data objects (e.g., ads in the above example), and an evolving data stream (e.g., the questions asked in the above example), a sliding window on the data stream (e.g., the last n questions in the above example) presents an evolving query. Theproblem ofcontinuous similarity search for evolving queries is to continuously conduct top-k similarity search on the set of static objects using the evolving queries. The problem of continuous similarity search for evolving queries has many important applications. As another example, in a computer role-playing game, a player has a set of weapons and tools, which is relatively stable. The player goes through a game mission scene, where the objects in the continuously updated surrounding environment, such as different types of enemies, scoring opportunities and obstacles, present a stream. The window over the stream containing the most recent objects approaching the player presents an evolving query. The player has to select proper weapons and tools that match the current surrounding environment best. Here, an enemy can be modeled as a set of factors, such as explosion, that it may be tackled, and a weapon/tool can be modeled as a set of factors that it may be used. Again, before any gaming strategies can be used, an essential task is to continuously maintain the top-k best weapons and tools with respect to the evolving enemies. Our problem can be considered as the dynamic version of the top-k set similarity search problem. More generally, a traditional similarity search problem involves a collection of objects, a similarity function, and a user-defined threshold. The search is to find all objects in the collection whose similarity scores regarding the query are no less than the predefined threshold. Similarity search has many applications, such as information retrieval [14,16,33], near-duplicate web page detection [19], record linkage [37], data compression [16], data integration [9], image and video search and recommendation [13,15,31,35], statistics and data analysis [12,22], machine learning [10], and data mining [3,18]. Besides answering threshold-based queries, top-k queries are also of great value since given a threshold, the size of the result may be unpredictable and, in many real applications, we are only interested in a small number of most similar objects. Moreover, as to be reviewed in Sect. 2.2, the problem of similarity search on data streams has been extensively explored, especially for nearest neighbor search. However, to the best of our knowledge, the problem of continuous set-based similarity search for evolving queries has not been systematically investigated. In this paper, we tackle the problem of continuous similarity search for evolving queries. Since in many applications, an object can be represented as a set or multiset, such as using a keyword vector to represent a document, we use weighted Jaccard similarity as the measure. The major challenge is how to speed up the similarity computation and avoid checking evolving queries with every static object exactly at every time point. We develop an upper bound for incremental maintenance of similarity scores. The bound can be computed in constant time. We propose two algorithms, an exact one based on the pruning and verification framework, and the other approximate one based on MinHash. We report an empirical evaluation on both synthetic and real-world data sets, which validates the efficiency and effectiveness of our proposed methods. The rest of the paper is organized as follows. In Sect. 2, we review the related work. We then formulate the problem in Sect. 3. In sect. 4, we propose a filtering-based method. In Sect. 5, we present a hashing-based method. We report our experimental results in Sect. 6 and conclude the paper in Sect. 7.

3 Continuous similarity search for evolving queries 2 Related work Our study is mainly related to the existing work on similarity search and continuous top-k queries. In this section, we briefly review the state-of-the-art methods related to our study. A thorough survey on those topics is far beyond the capacity of the paper. 2.1 Similarity join and search The static version of similarity search has been studied extensively. The state-of-the-art methods can be categorized into two major groups. The first category is the filtering-based approaches. The general idea is to develop upper and lower bounds of the similarity between an object and a query, which can be computed efficiently. Many objects can be filtered out using the bounds. Consequently, the exact similarity scores, which are supposed to be more expensive to compute, are calculated for only a small number of surviving objects from filtering. For instance, Sarawagi and Kirpal [34] proposed an inverted index-based probing method for similarity joins on sets. Chaudhuri et al. [8] developed the prefix-filtering principle for similarity joins. The all-pairs algorithm developed by Bayardo et al. [3] further improves this approach by adding the minimum length constraint and some other filtering techniques to speed up similarity joins. Nevertheless, all-pairs and prefix-filtering methods often generate a nontrivial number of candidates, which have to be verified using the exact similarity measure. Xiao et al. [38] extended the all-pairs method and proposed a new positional filtering method PPJoin, which makes use of the ordering information. PPJoin+ [38] combinessuffix filtering with PPJoin and can further reduce the number of candidate pairs. A new similarity measure PathSim, which is based on meta path and is used in heterogeneous networks, is defined in [36]. The second category is hashing-based methods. The general idea is to develop hash functions that have good locality preservation properties similar objects are likely to be hashed to the same bucket. The idea was introduced by Indyk and Motwani [21] for approximate nearest neighbor search in d-dimensional Euclidean space. The basic principle is to hash the points using multiple hash functions such that closer points have a higher probability of collision than points that are far away. Gionis et al. [17] further improved the algorithms and achieved better query time guarantees. Later, an improved algorithm that almost achieves the space and time lower bounds is presented by Andoni and Indyk [1]. The MinHash technique [5] is used to approximate the resemblance and the containment of sets. This technique is used to estimate the rarity and similarity between two windowed data streams in [11]. Moreover, Charikar [7] proposed SimHash to hash similar data to similar values. An estimation for vector-based cosine similarity using a random projection method is also discussed [7]. In this paper, we explore both filtering-based and hashing-based methods, which have not been addressed in the existing literature for evolving similarity search queries. 2.2 Continuous queries over a data stream Different evolving models are used in previous studies that investigated continuous queries over a data stream. For example, Kontaki et al. [24] studied similarity range queries in streaming time sequences using Euclidean distance, where both the query and data objects are evolving. An indexing method that is based on incremental computation method for discrete Fourier transform is used to achieve a high candidates ratio. Lian et al. [26] tackled the similarity search problem over multiple stream time series, given a static time series as a

4 X. Xu et al. query. An approximation algorithm is developed using a weighted locality-sensitive hashing technique. Motivated by a wide range of applications such as network intrusion detection, much work [4,25,28,29] has been embarked on monitoring nearest neighbor (NN) queries continuously over a data stream. The basic idea is to utilize indexing structures for reducing memory consumption and supporting efficient updates. Mouratidis et al. [28]proposed two approaches for continuous monitoring of NN queries over sliding window streams. Koudas et al. [25] developed an approximation algorithm that utilizes an indexing scheme, DISC, and has guaranteed bounds on error and performance. The existing work on continuously monitoring nearest neighbors for mobile query object is different from the problem studied here. In those previous studies, the mobile object is assumed to move in a trajectory, potentially predictable to some extent. In this paper, the stream presenting an evolving query is not assumed a moving object. Instead, we simply use the current sliding window as the current query. The existing methods on continuous nearest neighbor monitoring for mobile objects cannot solve our problem. In addition to continuous queries on similarity search problems, some interesting problems are defined over data streams. For example, Pan and Zhu [30] developed a two-level candidate checking scheme for continuously querying the top-k correlated graphs in a data stream scenario where static queries are posed on evolving graph streams. Mouratidis et al. [27] proposed two approaches for continuously answering top-k queries where the query is a static preference function over a fixed-size sliding window. One approach is to compute new answers whenever a current top-k point expired and the other approach is to precompute future changes partially. Rao et al. [32] devised methods for the problem that uses an aggregation function to measure the relevance between a document stream and a query consists of terms. They modeled documents as a data stream. That is, new documents keep coming while the query is not evolving, which is different from our model where the evolving query consists of the elements in the current sliding window in a data stream. Kollios and Tsotras [23] proposed efficient hashing methods to answer membership queries in a temporal setting. Table 1 summarizes the differences between our study and some previous methods related to ours. 3 Problem definition Let us consider an alphabet of items, a collection of objects R where each object is a set (or multiset) over, a data stream S with each element e keeps coming, and an integer n as the size of a query sliding window on S. A query Q is the last n elements in S. A query is in general a multiset, but also can be modeled as a set. The top-k continuous similarity search for evolving query Q is to continuously find the k objects in R that are most similar to Q. We answer the query continuously whenever a new element in S arrives. In this paper, we consider each object being a set or multiset. Queries can be sets or multisets as well. Since a set is a special case of a multiset where elements in the object are all distinct, we use weighted Jaccard similarity to measure the similarity between objects and queries. Specifically, let X be a nonempty multiset that consists of items in alphabet = {v 1,v 2,...,v }.WemapX to a -dimensional vector X such that the value of the ith component is the absolute frequency of item v i in X. X is called the vector representation of multiset X.

5 Continuous similarity search for evolving queries Table 1 Comparison of our methods and the previous work Method Problem solved Algorithm Features Our methods Pruning method Top-k set/multiset similarity search Upper bounding similarity scores w.r.t. evolving queries Hashing method Top-k set similarity search Using MinHash and inverted indices for fast computation of top-k results Similarity join/search pruning-based methods Sarawagi and Kirpal [34] Chaudhuri et al. [8] Bayardo et al. [3] Xiao et al. [38] Threshold-based set similarity join Threshold-based set similarity join Threshold-based set similarity join Threshold-based set similarity join Sun et al. [36] PathSim based top-k similarity search Similarity join/search hashing-based methods Indyk and Motwani [21] Gionis et al. [17] Andoni and Indyk [1] Broder [5] Datar and Muthukrishnan [11] Charikar [7] Approximate NN search in Euclidean space Approximate NN search in Euclidean space Approximate NN search in Euclidean space Finding similar documents Estimate rarity and similarity of windowed data streams Design new locality-sensitive hashing schemes Continuous queries over a data stream Kontaki et al. [24] Similarity range queries in streaming time sequences Lian et al. [26] Similarity search over multiple stream time series Mouratidis et al. [28] Continuous monitoring of NN queries over sliding window streams A probing method based on inverted index Prefix-filtering principle All-Pairs: improved the prefix-filtering method PPJoin, PPJoin+: add positional filtering and suffix filtering Co-clustering based pruning method Locality-Sensitive Hashing (LSH) Improved [21]and achieved better time guarantees Almost achieves the space and time lower bounds MinHash technique Applying MinHash technique SimHash technique Indexing and Discrete Fourier Transformation A weighted LSH Conceptual partitioning and precompute the future changes Top-k, set/multiset similarity search, static objects, evolving queries Top-k, set similarity search, static objects, evolving queries Threshold, set similarity join, static objects Threshold, set similarity join, static objects Threshold, set similarity join, static objects Threshold, set similarity join, static objects Top-k, similarity search, static objects Threshold, NN search in Euclidean space, static objects Threshold, NN search in Euclidean space, static objects Threshold, NN search in Euclidean space, static objects Jaccard similarity estimation, static documents Jaccard similarity estimation, two evolving data windows Similarity estimation, static objects Euclidean distance, evolving objects and evolving queries A static time series as the query, multiple stream time series as the objects Euclidean space, evolving objects and static queries

6 X. Xu et al. Table 1 continued Method Problem solved Algorithm Features Pan and Zhu [30] Rao et al. [32] Continuously querying the top-k correlated graphs in a data stream Measure the relevance between a document stream and a query A two-level candidate checking scheme A graph indexing structure for results sharing among queries Evolving graph streams and static queries Evolving documents and static queries Table 2 A collection of sets R T 1 a b c f i T 2 a d e T 3 d e T 4 b c d i T 5 a c g h i j T 6 a c e f h b i c a d b i c a d f (a) (b) Fig. 1 Query stream S. a Current query Q, b query Q after a new item arrives Let X and Y be two nonempty multisets over alphabet.let X =[x 1, x 2,...,x ] and Y =[y 1, y 2,...,y ] be the vector representations of X and Y, respectively. The weighted Jaccard similarity is sim jac (X, Y ) = sim jac ( X, Y ) = i=1 min(x i, y i ) i=1 max(x i, y i ) (1) Example 1 (Computing similarity scores) Given an alphabet ={a, b, c, d, e, f, g}, consider two multisets X ={c, b, a, e, f } and Y ={a, b, a, c, e, d}. The vector representations of X and Y are x = [1, 1, 1, 0, 1, 1, 0] and y = [2, 1, 1, 1, 1, 0, 0], respectively. Since 7i=1 max(x i, y i ) = 7and 7 i=1 min(x i, y i ) = 4, the weighted Jaccard similarity score between X and Y is sim jac (X, Y ) = 4 7 = We now model how a query evolves. Given a query Q = {e p+1, e p+2,...,e p+ Q }, where item e p+i (1 i Q ) isthei-th arrived item in the query, the updated query after u time instants (u Q ), that is, u updates in the stream, is Q = {e p+u+1, e p+u+2,...,e p+ Q, e p+ Q +1, e p+ Q +2,...,e p+ Q +u }, where item e p+ Q + j is the j-th newly coming element in the updated query, 1 j u. Example 2 (Continuous top-k queries) A collection of sets R in a database is shown in Table 2, and a data stream S is shown in Fig. 1a. The alphabet ={a, b, c, d, e, f, g, h, i, j} of 10 items. Suppose k = 2 and the size of the sliding window is 5, that is, we use the set of the last 5 entries in stream S as the evolving query Q, and find the top-2 most similar sets in R.

7 Continuous similarity search for evolving queries Table 3 Jaccard similarity scores T 1 T 2 T 3 T 4 T 5 T 6 Q Q Table 4 Summary of frequently used symbols Symbol Interpretation For all methods R A collection of objects in database S The querying data stream The alphabet used in database T i The ith object in R Q t The query containing the elements in the sliding window at time t X The vector representation of a set (or multiset) X sim(x, Y ) The exact similarity of two sets (or multisets) X and Y k Number of objects in the result topk t The k objects that are most similar to the query at time t score(topk t [k]) The similarity score of the kth most similar result at time t For pruning-based method sim(x, Y ) +u The upper bound of similarity score of two sets (or multisets) X and Y, after u updates on Y min_step i The least number of updates needed for object T i s progressive upper bound sim(t i, Q t ) +min_step i to exceed score(topk t [k]) For MinHash-based method {[n]} The set {0,...,n 1} sim (X, Y ) The MinHash approximated similarity of two sets X and Y H A set of hash functions h 1, h 2,...,h H index i Inverted index built on the MinHash values of the ith hash function, i [1, H ] index i [v] Inverted list of hash value v which consists of the objects whose MinHash value is v according to the ith hash function, i [1, H ] and v [1, ] min(h i (X)) The minimum hash (MinHash) value of set X according to the ith hash function h i, i [1, H ] For experiments Q t The average object length of the data set We can compute the Jaccard similarity score between an object in R and the current query Q ={b, i, c, a, d}. The scores are shown in the first row of Table 3. Objects T 4 and T 1 are the top-2 similar objects with respect to query Q. Suppose next an item f arrives at the data stream as shown in Fig. 1b. We now have an updated query Q ={i, c, a, d, f }. The updated Jaccard similarity scores are shown in the last row of Table 3. The rankings of the objects in R ordered by similarity scores change slightly. For example, the rank of T 4 changes from 1 to 2 (the lower the rank, the higher the similarity). Consequently, T 1 and T 4 are the top-2 similar objects with respect to query Q. Table 4 lists the frequently used symbols in the paper.

8 X. Xu et al. 4 A pruning-based method Computing the exact similarity score for every object against the updated query at every time point is costly. A brute-force method takes O( Q t + T i ) time to compute the similarity between each object T i R with the current query Q t at every time instant t, and in total O( R ( Q t +max 1 i R { T i })) time to update the top-k results at each time instant. In this section, we derive an upper bound of Jaccard similarity scores and then present a pruning-based method that utilizes the upper bound. 4.1 A progressive upper bound We define the length of a multiset as the number of elements, including duplicates, in the multiset. To compute the upper bound of the Jaccard similarity between two multisets X and Y after u updates on Y, we can use the following property. Theorem 1 (A progressive upper bound) Let X and Y be two multisets, and Y be the multiset with u updates on Y. Then, sim jac (X, Y ) sim jac (X, Y ) +u. where sim jac (X, Y ) +u = sim jac(x,y ) α+β α β and α = X + Y, β = (1 + sim jac (X, Y )) u. Proof By definition, we have and sim jac (X, Y ) = i=1 min{x i, y i } i=1 max{x i, y i } (2) max{x i, y i }= X + Y min{x i, y i }. (3) i=1 Using Eqs. 2 and 3, wehave i=1 min{x i, y i }=sim jac (X, Y ) max{x i, y i } i=1 i=1 = sim jac (X, Y ) ( X + Y min{x i, y i }) Apparently, i=1 min{x i, y i } can be rewritten using X, Y and sim jac (X, Y ) as follows. i=1 i=1 min{x i, y i }= ( X + Y ) sim jac(x, Y ) sim jac (X, Y ) + 1 (4) After u updates, the maximum increase in the intersection size (the numerator of Eq. 2)and the maximum decrease in the union size (the denominator of Eq. 2) are both u. Combining Eqs. 3 and 4, wehave

9 Continuous similarity search for evolving queries sim jac (X, Y ) u + i=1 min{x i, y i } u + i=1 max{x i, y i } u + i=1 = min{x i, y i } u + X + Y i=1 min{x i, y i } = ( X + Y +u) sim jac(x, Y ) + u X + Y u sim jac (X, Y ) u Let α = X + Y and β = (1 + sim jac (X, Y )) u,wehave sim jac (X, Y ) sim jac(x, Y ) α + β = sim jac (X, Y ) +u α β The upper bound can be computed based on X, Y,andsim jac (X, Y ).Astobeshown soon in the algorithm, sim jac (X, Y ) is already known when computing the upper bound. Thus, it only takes constant time. Example 3 (Progressive upper bound) Consider the same multisets X and Y in Example 1. We have X =5, Y =6, and sim jac (X, Y ) = Suppose u = 1, the upper bound of the Jaccard similarity after an update is sim jac (X, Y ) +1 = Obviously, the progressive upper bound has a monotonicity with respect to the number of updates u. Thus, if the upper bound sim jac (X, Y ) +u is not large enough, so is any sim jac (X, Y ) +u (1 u < u). Our strategy for deriving the upper bound after u steps can be further extended to other multiset-based similarity measures, such as the weighted cosine similarity, the weighted dice similarity, and the weighted overlap similarity. Appendix gives the details about the extensions to other similarity measures. 4.2 A general pruning-based algorithm A brute-force method is to compute the similarity score between the updated query and each object at each time instant, and then obtain the top-k list. This approach is costly and involves much unnecessary computation, since when the query window slides only a small number of positions, such as 1 or 2, the changes of the similarity scores are limited. In this subsection, we present a heuristic algorithm GP that finds the exact top-k objects continuously. For the sake of simplicity, let us assume that new elements come in a manner synchronized with time, that is, a new element arrives at the stream when there is an update in time. The main idea is based on the observation that the query at time t + 1sharesalarge portion of elements, namely at least n 1 n, with the query at time t. Thus, the change in the top-k list is limited. In other words, the objects with small similarity scores at time t have a very limited chance to enter the top-k list at time t + 1. To implement the above idea, we divide the objects into two categories according to their current similarity scores. The first category C 1 contains the objects whose similarity scores are small enough so that those objects will not enter the top-k list in the next several updates. The objects in this category do not need to be checked using the exact similarity scores at the next several time instants. Instead, we only need to maintain their upper bounds of the similarity scores.

10 X. Xu et al. The second category C 2 contains the other objects not in the first category. The objects in this category need to be checked by computing the exact similarity scores. Suppose that at time instant t, we obtain the top-k list of objects topk t. Denote by score(topk t [k]) the similarity score of the kth object, which is least similar to the current query Q t. Apparently, score(topk t [k]) can be obtained without any extra cost. For each object T i not in the list topk t, we compute min_step i, the smallest number of updates needed for object T i to have its progressive upper bound exceed score(topk t [k]). According to Property 1, min_step i is the smallest integer satisfying score(topk t [k]) < sim(t i, Q t ) +min_step i. min_step i can be easily computed with respect to different similarity functions according to the respective specific forms of the upper bounds. For weighted Jaccard similarity, we have score(topk t [k]) i=1 min_step i = max{x i, y i } i=1 min{x i, y i } 1 + score(topk t (5) [k]) (score(topk t [k]) sim(t i, Q t )) ( T i + Q t ) = (1 + score(topk t [k])) (1 + sim(t i, Q t (6) )) Similarly, given Eq. 6, we only need constant time to compute min_step i. For the other three similarities, it is simply score(topk t [k]) sim(t i, Q t ) min_step i = sim(t i, Q t ) +1 sim(t i, Q t ) Example 4 (Computing min_step i ) Again, consider the collection of sets R shownintable2 and a data stream S shown in Fig. 1a. As the size of the sliding window is 5, the list of the top-2 most similar objects in R at the current time t is {T 1 : 0.67, T 4 : 0.8}, where the numbers after colons are the similarity scores. The similarity score of object T 5 at the current time instant t is T 5 is not one of the top- 2 results. However, for T 5, i=1 min{x i, y i }=3and i=1 max{x i, y i }=8with x i T 5 and y i Q t.usingeq.5we can compute that, after min_sup 5 = = 1.43 = 2 updates, the progressive upper bound of T 5, sim(t 5, Q t ) +2 = = 0.83, is larger than the score of the current kth most similar object score(topk t [k]) = sim(t 1, Q t ) = At time instant t, for each object T i,wemaintainmin_step i and the corresponding progressive upper bound sim(t i, Q t ) +min_step i. Please note that the min_step i values and the sim(t i, Q t ) +min_step i values may be computed at time instant t or before. After processing the objects at time instant t 1, at time instant t, with respect to query Q t,weprocessthe objects as follows. 1. For all objects T i such that min_step i = 0, including those already in topk t 1,we compute the exact similarity between T i and Q t and obtain a top-k list topk t among those objects. Those objects are assigned to the second category C 2. We set the similarity threshold σ = score(topk t [k]). 2. For all objects T i such that min_step i > 0butsim(T i, Q t 1 ) +min_step i >σ, since their similarity to query Q t may be greater than score(topk t [k]), we have to compute their exact similarity to Q t, too. We update the top-k list topk t and also the similarity threshold σ = score(topk t [k]). Obviously, in this step, σ = score(topk t [k]) is monotonically increasing, and never decreases. At the end of this step, topk t is finalized. Those objects whose exact similarity scores are larger than score(topk t [k]), that is, in the list topk t, are also added to category C 2.

11 Continuous similarity search for evolving queries Algorithm 1: A general pruning algorithm (GP) Input: Top-k results topk t 1 at previous time t 1, Query Q t at current time t Output: Top-k results topk t at current time t 1 for 0 u queue 1 do 2 foreach T i queue[u] do 3 if u = 0 or sim(t i, Q t ) +u > score(topk t 1 [k]) then 4 delete T i from bin queue[u] 5 compute sim(t i, Q t ) and update topk t 6 min_step i null 7 remove bin queue[0] 8 foreach T i with min_step i = null do 9 (min_step i, sim(t i, Q t ) +min_step i ) est_bound ( sim(t i, Q t ), score(topk t [k]) ) 10 insert T i to bin queue[min_step i ] Algorithm 2: Update min_step i and similarity bounds Function: est_bound(sim(t i, Q t ), score(topk t [k])) Input: T i s similarity score sim(t i, Q t ) at current time t,thekth largest similarity score of top-k list score(topk t [k]) at current time t Output: The pair of (min_step i, sim(t i, Q t ) +min_step i ) 1 if sim(t i, Q t )<score(topk t [k]) then 2 compute min_step i ; 3 sim(t i, Q t ) +min_step i upper bound after min_step i updates; 4 else 5 min_step i 0; 6 sim(t i, Q t ) +min_step i sim(t i, Q t ); 7 return (min_step i, sim(t i, Q t ) +min_step i ); 3. We update min_step i and the corresponding progressive upper bound sim(t i, Q t 1 ) +min_step i for objects T i C For the other objects not in C 2, they belong to the first category C 1 and have no hope to be more similar to Q t than any of the object in the list topk t and thus do not need to be checked in this round. We only need to decrease their min_step i values by 1. The pseudocode of the algorithm is given in Algorithms 1 and 2. In implementation, we organize objects in bins according to their min_step i values and use a heap to maintain the top-k list. All the bins are maintained in a queue structure (called queue in the pseudocode). 5 A MinHash-based method In this section, we use the technique MinHash [7], a locality-sensitive hashing (LSH) scheme, to approximate Jaccard similarity. We design indices for efficient updating estimated similarity scores. Since the MinHash technique is designed for estimating sets rather than multisets, we limit the definition of the current query to the set of distinct elements. 5.1 Approximating Jaccard similarity over static query We first discuss how to estimate Jaccard similarity using MinHash for static objects and queries.

12 X. Xu et al. Definition 1 (Image under permutation [2]) Denote by {[n]} the set {0,...,n 1}. A permutation π on {[n]} is a bijective function (one-to-one correspondence) from {[n]} to itself. If x {[n]}, thenπ(x), thevalueofπ when applied to x is the image of x under π. The image of a subset X [n] under π is π[x] ={y y = π(x) for x X}. Definition 2 (Min-wise independent permutations [6]) Let S n be the set of all permutations of {[n]}. A family of permutations F S n is min-wise independent if for any set X {[n]} and any element x X, wehave Pr(min{π[X]} = π(x)) = 1 (7) X where the permutation π is chosen randomly in F. The following property enables efficient estimation of Jaccard similarity between two sets. Theorem 2 (Jaccard similarity estimation [20]) Consider a query Q and an object T i where the elements are drawn from an alphabet. Let h be a hash function that maps elements in to distinct integers in range [0, 1] and is randomly picked from a family of min-wise independent permutations. Let the MinHash value, min(h(t i )), be the element in T i with the smallest hash value. Then, Pr[min(h(T i )) = min(h(q))] = T i Q T i Q. The MinHash values for all objects and that of the query can be stored in a MinHash signature matrix, M, where each entry M(i, j) is the MinHash value of the jth itemset under hash function h i. Based on Property 2, the Jaccard similarity between two sets can be estimated by the ratio of the number of rows containing the same MinHash values to the number of all the rows in the signature matrix. We deliberate the computation in Example 5. Example 5 (Jaccard similarity estimation) Consider the matrix representation of an example with four objects and a static query Q shownintable5a. Each column in the matrix represents a set and an element is in the set if the corresponding entry is 1. We apply five random permutations defined in Table 5b as the min-wise independent hash functions on objects and the query. The signature matrix is shown in Table 5c. For example, since object T 1 contains elements a, b,andc, the corresponding hash values according to h 4 are1,3,and4, respectively. The MinHash value of T 1 according to h 4 is a since a is the element in T 1 with the smallest hash value. We can estimate the Jaccard similarity score between two sets by computing the ratio of the number of rows containing the same MinHash values to the number of random permutations. For example, the MinHash values of T 1 and Q are the same according to random permutations h 2 and h 3. Thus, the estimated Jaccard similarity between T 1 and Q is 2 5.Wecomparethe estimated Jaccard similarity with the exact Jaccard similarity for each object in Table 5d. When objects are ordered in descending order of estimated similarity score regarding to the query, we have T 3 > T 1 = T 4 > T 2, which is identical to the order using exact similarity. Therefore, in this example, we can get the top-k result accurately using estimated scores. 5.2 Approximating Jaccard similarity over evolving query To answer evolving queries, given a fixed set of hash functions, the MinHash values for all the objects are fixed and only the MinHash values for the query are subject to modification. That is, we only need to update the MinHash values for the query. Therefore, we have a fast solution shown in Algorithm 3.

13 Continuous similarity search for evolving queries Table 5 An example of Jaccard similarity estimation Element T 1 T 2 T 3 T 4 Q (a) Matrix representation of sets a b c d e Element h 1 h 2 h 3 h 4 h 5 (b) Random permutations a b c d e T 1 T 2 T 3 T 4 Q (c) Signature matrix h 1 a a e e e h 2 c a c e c h 3 b d b d b h 4 a d e d d h 5 a a e a d T 1 T 2 T 3 T 4 (d) Exact and approximated Jaccard similarity sim jac (T i, Q) sim jac (T i, Q) Algorithm 3 first computes the MinHash signature for an updated query. To determine whether we need to update the estimated Jaccard similarity of object T j with respect to the updated query Q t according to a hash function h i, we consider the following 4 cases based on whether the MinHash values of the object and the queries are equal or not. Case 1: If min(h i (T j )) = min(h i (Q t 1 )) and min(h i (T j )) = min(h i (Q t )), the estimated Jaccard similarity between T j and the updated query should be increased by H 1, where H is the number of hash functions applied. Case 2: If min(h i (T j )) = min(h i (Q t 1 )) and min(h i (T j )) = min(h i (Q t )), the estimated Jaccard similarity between T j and the updated query should be decreased by H 1. Case 3: If min(h i (T j )) = min(h i (Q t 1 )) and min(h i (T j )) = min(h i (Q t )), the estimated Jaccard similarity between T j and the updated query remains. Case 4: If min(h i (T j )) = min(h i (Q t 1 )) and min(h i (T j )) = min(h i (Q t )), the estimated Jaccard similarity between T j and the updated query remains, too. We can further reduce the computational cost using inverted indices. The MinHash signature matrix that stores the MinHash values of objects can be transformed into H inverted

14 X. Xu et al. Algorithm 3: A MinHash-based algorithm (MHB) Input: Current query Q t, H hash functions, a set of previous MinHash values {min(h i (Q t 1 ))}, MinHash table for R objects. Output: Top-k results topk t at current time t 1 compute {min(h i (Q t ))}; 2 foreach j {1,..., R } do 3 foreach i {1,..., H } do 4 if min(h i (T j )) = min(h i (Q t 1 )) and min(h i (T j )) = min(h i (Q t )) then 5 sim (T j, Q t ) sim(t j, Q t ) + 1 H ; 6 else if min(h i (T j )) = min(h i (Q t 1 )) and min(h i (T j )) = min(h i (Q t )) then 7 sim (T j, Q t ) sim(t j, Q t ) 1 H ; 8 if T j / topk t and sim (T j, Q t )>score(topk t [k]) then 9 update topk t ; Fig. 2 Inverted indices for MinHash values indices whose structure is shown in Fig. 2. For each hash function, its corresponding inverted index stores a mapping from each existing MinHash value to a list of object ids, that is, inverted list. Moreover, elements in each inverted list are sorted in ascending order of object id, which enables more efficient list merge and intersection operations. In this way, we can efficiently retrieve the objects with a specified MinHash value according to a certain hash function. The algorithm that uses the inverted indices is presented in Algorithm 4. The MinHash signature for the updated query is computed first. We need to search the inverted index of a hash function h i only when the MinHash values of Q t and Q t 1 regarding h i are different. Suppose the MinHash value of the updated query changes according to h i,to determine the objects whose similarity scores need to be updated, we recall Case 1 and Case 2 mentioned earlier in this section. Each case corresponds to a scan in an inverted list of h i. Specifically, in Case 1, we increase the estimated Jaccard similarity by H 1 for objects in the inverted list of hash value min(h i (Q t )). That is, instead of checking for each object as in Algorithm 3, we search for the objects that satisfy Case 1 using the index. Similarly, in Case 2, we decrease the estimated Jaccard similarity by H 1 for objects in the inverted list of hash value min(h i (Q t 1 )).

15 Continuous similarity search for evolving queries Algorithm 4: A MinHash-based algorithm using inverted indices (MHI) Input: Current query Q t, H hash functions, a set of MinHash values {min(h i (Q t 1 ))}, Inverted indices. Output: Top-k results topk t at current time t 1 compute {min(h i (Q t ))}; 2 foreach i {1,..., H } do 3 if min(h i (Q t 1 )) = min(h i (Q t )) then 4 for T j index i [min(h i (Q t ))] do 5 sim (T j, Q t ) sim (T j, Q t ) + 1 H ; 6 for T j index i [min(h i (Q t 1 ))] do 7 sim (T j, Q t ) sim (T j, Q t ) 1 H ; 8 foreach j {1,..., R } do 9 if T j / topk t and sim (T j, Q t )>score(topk t [k]) then 10 update topk t ; The time complexity of this algorithm is O( H R ),where R is the number of objects and H is the number of hash functions. It is achieved when the MinHash values of consecutive queries differ under all the hash functions. However, this case rarely occurs because the Jaccard similarity between consecutive queries is in fact very high. 6 Experimental results In this section, we discuss our experimental results on a series of synthetic data sets and two real data sets. All experiments were conducted on a Mac Pro (Late 2013) server with Intel Xeon 3.70GHz CPU, 64GB memory, and 256GB SSD hard drive installed. All the algorithms were implemented in Python 3 using a much faster just-in-time compiler PyPy Results on synthetic data sets To evaluate the effectiveness and efficiency of our two methods, the general pruning-based method ( GP for short) and the MinHash-based method ( MHI for short), in this section, we first report the experimental results on synthetic data sets. We used the IBM Quest data generator 1 to produce synthetic data sets. We conducted experiments to test the efficiency and accuracy of our methods with respect to the following parameters. k: top-k; Q t : the parameter controlling both the number of items per query and average number of items per object. : alphabet size; H : the number of different hash functions. R : number of objects; To generate synthetic data sets using the IBM Quest data generator, we set parameters Q t,, and R. The synthetic query stream was generated by concatenating random objects whose lengths are between 0.8 Q t and 1.2 Q t,where Q t is the average object length of thedataset. 1

16 X. Xu et al. Table 6 Values of controlled variables for tests on Jaccard similarity Data set Varying k Q t H R Figures Synthetic k Multiple 10 10,000,000 3a, 4a, 6a Synthetic k Multiple 10 20,000 3b, 4b, 6b Synthetic Q t 10 Multiple 10,000,000 3d, 4d, 6c Synthetic Multiple,000 3c, 4c, 6c Synthetic H ,000 Multiple,000 3e, 6e Synthetic R ,000 Multiple 3f, 4e, 5a, 6f Market basket k Multiple 10 16,470 88,162 7a c Market basket H ,470 Multiple 88,162 7g, h Market basket R ,470 Multiple 7d f, 5b Click stream k Multiple ,790 8a c Click stream H Multiple 31,790 8g, h Click stream R Multiple 5c, 8d f The way we produce synthetic query streams mimics some real application scenarios. Take the online RPG gaming scenario as an example. Suppose one is exploring the game map and walks from one game scene to another. While each scene has its combination of enemies (as objects here), when moving from one scene to another, the most suitable query is the combination of enemies from both scenes (concatenation of the two objects within sliding window). In each experiment, we varied one factor and fixed the others. Table 6 shows the values of controlled variables along with the corresponding figures for each test. We compare the performance of GP and MHI on the average querying time with two baseline methods, namely BFM and MHB. BFM is a brute-force method that computes the exact similarity score for every object with respect to a query. MHB is a method based on MinHash technique but does not use any indexing structures. To better illustrate how our upper bounds derived in Sect. 4 can be used to prune unpromising objects, we define the pruning effectiveness with respect to an update as follows. Definition 3 (Pruning effectiveness of GP) Suppose we have R objects in total and we compute the exact similarity scores for R objects during an update. The pruning effectiveness with respect to this update is 1 R R. In this section, we report the average pruning effectiveness of 0 updates. Moreover, we provide peak memory usage for all the methods when testing the scalability. The accuracy in our context is defined as follows. Definition 4 (Accuracy) For a method A that returns a top-k list topk_a t with respect to query Q t, the accuracy of A is the proportion of objects in topk_a t whose exact similarity scores with respect to Q t is no smaller than the similarity between topk t [k] and Q t,where topk t is the ground truth top-k list. That is, acc_a = {T T topk_at sim(t, Q t ) sim(topk t [k], Q t )}. k

17 Continuous similarity search for evolving queries The idea here is to check the percentage of objects reported in the answer indeed have a similarity score passing the minimum similarity threshold set by the kth most similar object in the ground truth. The higher this percentage, the more accurate the answer. The pruning algorithm GP always returns the exact query results and thus has % accuracy. Since the MinHash-based methods compute similarity scores approximately, MHB and MHI give estimated answers to top-k queries. We report the average accuracy of MinHash methods in Fig Efficiency The average querying time of the four methods when k varies is shown in Fig. 3a, b on two synthetic data sets with a large alphabet (10 4 ) and a small one (20), respectively. In both cases, our baseline methods, BFM and MHB, almost have no change in average processing time. MHI that uses hash functions outperforms the other four methods greatly. The pruning effectiveness of GP is shown in Fig. 4a, b when the alphabet size is set to 10 4 and 20, respectively. The pruning effectiveness of GP drops gradually because the similarity score of the kth best object in the top-k result tends to decrease when k increases. Given other parameters fixed, the similarity score of the kth best object in the result with respect to queries becomes smaller. Thus, the pruning effectiveness generally is weaker in the case of larger alphabet size. Figure 3c provides a better illustration of the trends of the pruning effectiveness of GP when the alphabet size changes. When the alphabet size increases dramatically, the pruning effectiveness of GP drops from 69.2 to 56.3 % gradually. Thus, the average processing time of GP slightly increases. The average running time of BFM remains stable with respect to alphabet size. The average running time of MHI first decreases and then increases. When the alphabet size is small, the length of each inverted list is relatively large. Thus, a change in MinHash value of the query may result in many updates in the estimated similarity scores. When alphabet size becomes very large, though the size of each inverted list is small, the number of unmatched MinHash values between the MinHash lists of the previous query and the new query increases, which results in more searches on the inverted indices. We also examined how the average query answering time changes with respect to average object length. The results on efficiency and pruning effectiveness are shown in Figs. 3d and 4d, respectively. The average processing time of BFM increases linearly while the performance of MHB does not increase greatly. It is because we transformed the objects of varied length into MinHash signatures of the same length. MHI still performs much better than the others and increases linearly. It is because when the average object length becomes larger, the size of each inverted list becomes larger, too, which results in longer processing time. When the average object length increases, the pruning effectiveness of GP increases dramatically, from 61.1 to %. However, its average running time still increases slightly, since there is an increase in the cost of computing exact similarity scores for larger sets in the verification phase. The performance with respect to the number of hash functions is shown in Fig. 3e. The average processing time of both MHB and MHI increases. MHI is roughly linear. The scalability of our methods is shown in Fig. 3f. The runtime of all the methods increases almost linearly as the number of object increases. The pruning effectiveness of GP increases from 48.3 to 64.2 % when the number of objects increases from 10 4 to 10 6,whichisshownin Fig. 4e. The peak memory usage of our methods is shown in Fig. 5a. BFM and GP follow the similar growth trend and GP requires only a little more memory than BFM. It is because we

18 X. Xu et al. Avg. Processing Time (ms) Varying k k (a) BFM MHB GP MHI Avg. P rocessing Time (ms) Varying k k (b) Avg. Processing Time (ms) Varying Lexicon Size k 10k k 1m Lexicon Size (c) Avg. P rocessing Time (ms) Varying Avg. Transaction Length Avg. Transaction Length (d) Avg. P rocessing Time (ms) Varying Number of Hash Functions Number of Hash Functions (e) Avg. Processing Time (ms) 0 10 Varying Number of Transactions 1 10k 50k k 500k 1m Number of Transactions Fig. 3 Average processing time on synthetic data set. a Vary k, large, b vary k, small, c vary, d vary Q t, e vary H, f vary R (f) use extra data structures like queue and max-heap to store the upper bounds of similarity scores. Compared to BFM and GP, MinHash-based methods need more space since we need to store a MinHash signature for each object. MHB costs less space than MHI because we store MinHash signatures differently for these two methods. For MHB, we simply store each MinHash signature as an array, while in MHI, we construct an inverted index for each hash function to enable more efficient update, which requires more space.

19 Continuous similarity search for evolving queries Avg. Pruning Effectiveness (%) Avg. Pruning Effectiveness (%) Avg. Pruning Effectiveness (%) Varying k GP k k 10k k 1m (a) Varying Lexicon Size Lexicon Size (c) Varying Number of Transactions 0 10k 50k k 500k 1m Number of Transactions (e) Avg. Pruning Effectiveness (%) Avg. Pruning Effectiveness (%) Varying k k (b) Varying Avg. Transaction Length Avg. Transaction Length (d) Fig. 4 Pruning effectiveness of GP on synthetic data set. a Vary k, large, b vary k, small, c vary, d vary H, e vary R Accuracy We also tested the accuracy of the MinHash-based algorithms. By our definition, the accuracy of the two MinHash-based algorithms are the same according to the same set of hash functions. We denote the MinHash methods by MH in the figures that report accuracy. The accuracy can be affected by two factors, the number of hash functions used and the intrinsic characteristics of our data set, such as the distribution of similarity scores among objects.

20 X. Xu et al. Peak Memory Usage (MB) Peak Memory Usage (MB) 10k 1k Varying Number of Transactions 10 10k 50k k 500k 1m k 10k 1k Number of Transactions (a) Varying Scale Factor Scale Factor (c) BFM MHB GP MHI Peak Memory Usage (MB) k 10k 1k Varying Scale Factor Scale Factor (b) Fig. 5 Data set peak memory usage on different data sets. a Synthetic data set, b market basket data set, c click stream The results show that the MinHash-based methods achieve an accuracy ranging from 68 to 98 % on the synthetic data sets when hash functions are used. Figure 6a, b shows the change in accuracy when k increases on data sets of large alphabet and small alphabet, respectively. On the data set of alphabet size 10 4, the accuracy decreases gradually, from 98 to 68.7 %. When alphabet size is 20, the average accuracy does not vary much but still in a slightly descending trend. The average accuracy increases quickly and then decreases when the alphabet size increases, as shown in Fig. 6c. The highest accuracy is achieved when the alphabet size is The reason is as follows. With the fixed number of objects, when alphabet size is small, there are many similar objects in the data set, which lowers the accuracy. In contrast, when the alphabet size is too large, objects in the data set would be less similar and lead to slightly lower accuracy. In Fig. 6c, when the average object length increases from 10 to 10 3, the accuracy drops drastically from 86.7 to 10 %. The reason is apparent. In MinHash, the Jaccard similarity is the probability of two objects having the same minimal hash values with respect to all the hash functions. With a fixed number of hash functions, the larger the window size (average object length here), the less the portion of the elements in the window are hashed as the common MinHash value, which leads to a less accurate estimation. Thus, more hash functions are needed to achieve the same level of accuracy when the object length becomes larger.

21 Continuous similarity search for evolving queries Varying k MH Varying k Avg. Accuracy (%) Avg. Accuracy (%) k (a) k (b) Varying Lexicon Size Varying Avg. Transaction Length Avg. Accuracy (%) Avg. Accuracy (%) k 10k k 1m Lexicon Size (c) Avg. Transaction Length (d) Varying Number of Hash Functions Varying Number of Transactions Avg. Accuracy (%) Avg. Accuracy (%) Number of Hash Functions (e) 50 10k 50k k 500k 1m Number of Transactions (f) Fig. 6 Average accuracy of hashing-based method on synthetic data set. a Vary k, large, b vary k, small, c vary, d vary Q t, e vary H, f vary R We also tested the trend of accuracy when the number of hash functions and the number of objects increase. The results are shown in Fig. 6e and f, respectively. When more hash functions are used, the estimated Jaccard similarity is closer to the exact score and thus leads to a higher accuracy. As shown in Fig. 6e, the result follows this trend. Our methods achieve average accuracy rate ranging from 86.7 to 95 %. Figure 6f suggests that the average accuracy increases gradually when the number of objects increases.

22 X. Xu et al. Table 7 Statistics of the real data sets Data set R Avg. T i Market basket 88, , 470 Click stream 31, Results on real data sets We conducted experimental studies on two real data sets: a Market Basket data set 2 from an anonymous Belgian retail store, and a Click Stream data set 3 where the predefined category, such as news and tech, replaces each URL on the MSNBC website. The data processing was conducted by the provider. Moreover, the provider of the Click Stream data set also removed those short sequences of lengths no larger than 8. We then transformed each sequence into a set. For Click Stream data set, each static object is a set of categories for URLs visited by each user, while for the market basket data set, each static object is a set of items purchased by a customer. There is a significant difference in alphabet size of the two real data sets. The market basket data set has 16,470 distinct items, while the Click Stream data set contains only 17 distinct items. Table 7 shows the detailed statistics of the data sets. To generate the querying stream, we used the same method when generating querying streams for synthetic data sets. That is, we concatenated objects randomly selected in the data set whose size is between 0.8 Q t and 1.2 Q t,where Q t is the average object length of the data set. For each data set, we compared the average processing time and the average accuracy when k or H varies, respectively. We also include scalability test on both time and peak memory usage. The results are shown in Figs. 5, 7, and8. The trends of the curves for the market basket data set are consistent with the results on the synthetic data sets. To compare the average processing time of different methods when k varies, we set the default number of hash functions to. The processing time of BFM and MHB is stable. MHI always has the best performance, and the processing time increases gradually. The pruning effectiveness of GP decreases from 64.9 to 38.3 % gradually when k increases. Therefore, the processing time of GP increases when k increases. The MinHashbased algorithms can always achieve accuracy over 70 % when k varies from 1 to 0, which is shown in Fig. 7b. Figure 7g, h shows the trends of processing time and accuracy of the MinHash-based methods, respectively, when different numbers of hash functions are used. The average processing time of both MHB and MHI increases almost linearly with the number of hash functions. The performance of MHI is between 31 and 96 times faster than MHB. The average accuracy increases steadily from 73.1 to 93.4 % when the number of hash functions increases from to 500. We also tested the scalability on this data set by duplicating the data set 5, 10, 50, and times. We report the average processing time, average accuracy, average pruning effectiveness, and peak memory usage in Figs. 7d f and 5b, respectively. Figure 7d shows that the processing time of all the methods increases linearly as the data set size increases. The accuracy first increases drastically and then reaches the level of over 95 % and remains steady. The pruning effectiveness does not change much and is around %

23 Continuous similarity search for evolving queries Avg. Processing Time (ms) Avg. Pruning Effectiveness (%) Avg. Accuracy (%) Avg. Processing Time (ms) 0 10 Varying k BFM MHB GP MHI k (a) k Varying k (c) Varying Scale Factor Scale Factor (e) Varying Number of Hash Functions Number of Hash Functions (g) Avg. Accuracy (%) Avg. Processing Time (ms) Avg. Pruning Effectiveness (%) Avg. Accuracy (%) Varying k k (b) Varying Scale Factor Scale Factor Varying Scale Factor (d) Scale Factor (f) Varying Number of Hash Functions Number of Hash Functions Fig. 7 Results on market basket data set. a Vary k (time), b vary k (accuracy), c vary k (pruning effectiveness), d vary R (time), e vary R (accuracy), f vary R (pruning effectiveness), g vary H (time), h vary H (accuracy) (h)

24 X. Xu et al. Avg. Processing Time (ms) Avg. Pruning Effectiveness (%) Avg. Accuracy (%) Avg. Processing Time (ms) 10 1 Varying k k Varying k (a) k (c) Varying Scale Factor Scale Factor (e) Varying Number of Hash Functions Number of Hash Functions (g) BFM MHB GP MHI Avg. Processing Time (ms) Avg. Accuracy (%) Avg. Pruning Effectiveness (%) Avg. Accuracy (%) Varying k k Varying Scale Factor Scale Factor Varying Scale Factor (b) (d) Scale Factor (f) Varying Number of Hash Functions Number of Hash Functions Fig. 8 Results on click stream data set. a Vary k (time), b vary k (accuracy), c vary k (pruning effectiveness), d vary R (time), e vary R (accuracy), f vary R (pruning effectiveness), g vary H (time), h vary H (accuracy) (h)

25 Continuous similarity search for evolving queries The peak memory usage is shown in Fig. 5b. The peak memory size used of all the methods increases almost linearly when the data set increases. Similar to the results in the scalability test on the synthetic data sets, GP uses only a little more memory than BFM. Their trends are similar. The MinHash-based methods consume more space than BFM and GP. The reason is the same as explained on the synthetic data sets in Sect The results on the click stream data set are shown in Figs. 8 and 5c. Since the average length of the objects in this data set is only 5.33, the MinHash-based algorithms do not need many hash functions in order to achieve high accuracy. As shown in Fig. 8h, the average accuracy is higher than 90 % when 40 or more hash functions are used. To test the performance when k varies, we used 50 hash functions as the default. When k varies, MHI is still the best method with respect to average query answering time. The average accuracy drops gradually from 90 to 67 %. As shown in Fig. 8c, the pruning effectiveness of GP decreases from 75.8 to 54.2 %, which is generally better than the results on the market basket data set. We tested the scalability on the click stream data set by duplicating the data set 10,, and 0 times. Figures 8d f and 5c show how the average processing time, average accuracy, average pruning effectiveness, and peak memory usage change with respect to different scale of data set, respectively. The average processing time of all the methods increases almost linearly when the data set size increases. MHI always achieves the lowest processing time and outperforms the baselines by more than an order of magnitude. Moreover, it reports with reasonably high accuracy in the range of 81.6 and 90 % as shown in Fig. 8e. When the data set size increases, the pruning effectiveness of GP increases steadily and is generally higher than the results on the market basket data set. The memory usage of BFM, GP, and MHB is very close and the corresponding curves in Fig. 5c overlap with each other. Comparing to the other data sets, the memory usage of MHB is much closer to BFM and GP because we use less number of hash functions. MHI still consumes more memory due to the inverted indices data structure. However, its memory usage is much closer to the other three methods comparing to the results on the other data sets. Recall that the size of our indexing structure depends on the number of hash functions used and the alphabet size. 6.3 Summary We conclude the following from our experiments. First, the pruning-based method is an exact method and the hashing-based method can achieve very good accuracy when a few hundreds of hash functions are used. The hashing-based method can lead to good accuracy and can achieve faster average querying time than the pruning-based method in most cases. However, when k increases, the accuracy drops and the running time increases greatly. The pruningbased method runs slower but maintains relatively stable running time with increasing k. For the hashing-based method, usually larger number of hash functions H provides better accuracy but also takes more running time. However, when the number of hash functions exceeds 1,000, the increase in accuracy usually is very minor. As shown on synthetic data set, when the average object length Q t becomes larger, the accuracy of the hashing-based method drops greatly while the pruning effectiveness of the pruning-based method increases significantly, which implies that the pruning-based method is more feasible in this case. In general, both methods achieve better performance on data sets with larger lexicon size. The sparser the data, the more significant the difference on similarity, and usually the better runtime performance.

26 X. Xu et al. 7 Conclusions In this paper, we studied the problem of continuous similarity search for evolving queries. To the best of our knowledge, it is the first research endeavor on this problem. We devised two efficient methods with different framework. The pruning-based method uses pruning strategies to reduce the cost of computing the exact similarity scores. The MinHash-based method approximates the Jaccard similarity score based on MinHash technique and efficiently updates the estimated scores using indexing structures. The empirical results on synthetic and real data sets verify the effectiveness and efficiency of our methods. As for future work, we plan to consider the following interesting directions. Enhance the pruning effectiveness For example, we may incorporate the evolving feature into the pruning techniques used in static similarity join and search algorithms to further prune candidates. Improve the hashing-based method for other similarity measures We may extend our hashing-based framework to other similarity and distance measures that can be approximated by an LSH scheme used in MinHash. Similarity search over multiple evolving streams Given a static object and multiple data streams with a fixed sliding window size, we may want to continuously find the top-k data streams whose last n items are most similar to the static object. The techniques proposed in this paper can be further extended for the problem. Appendix: Other similarity measures and their upper bounds for pruning method In this section, we extend the upper bounds for pruning method to weighted overlap, weighted cosine, and weighted dice similarity. Similar to the case of weighted Jaccard similarity, they all have monotonicity with respect to number of updates u. Property 1 (A progressive upper bound for weighted overlap similarity). We first define the weighted overlap similarity as follows. sim over (X, Y ) = sim over ( X, Y ) = min(x i, y i ) (8) Let X, Y be two multisets and Y be the multiset with u updates on Y. Given X, Y, and the weighted overlap similarity score sim over (X, Y ) between X and Y, without the knowledge of the updated elements in Y, we have an upper bound for sim over (X, Y ). Proof By definition, i=1 sim over (X, Y ) sim over (X, Y ) + u (9) sim over (X, Y ) = min(x i, y i ) Obviously, the maximum possible increase after u updates is u. i=1 Property 2 (A progressive upper bound for weighted cosine similarity) We first define the weighted cosine similarity as follows.

27 Continuous similarity search for evolving queries sim cos (X, Y ) = sim cos ( X, Y ) = i=1 min(x i, y i ) X Y (10) Let X, Y be two multisets and Y be the multiset with u updates on Y. Given X, Y, and the weighted cosine similarity score sim cos (X, Y ) between X and Y, without the knowledge of the updated elements in Y, we have an upper bound for sim cos (X, Y ). sim cos (X, Y u ) sim cos (X, Y ) + (11) X Y Proof In our scenario, Y = Y. Thus, X Y = X Y. By Property 1,wehave sim cos (X, Y i=1 ) min(x i, y i ) + u u X Y = sim cos (X, Y ) + X Y Property 3 (A progressive upper bound for weighted dice similarity) We first define the weighted dice similarity as follows. sim dice (X, Y ) = sim dice ( X, Y ) = 2 i=1 min(x i, y i ) (12) X + Y Let X, Y be two multisets and Y be the multiset with u updates on Y. Given X, Y, and the weighted dice similarity score sim dice (X, Y ) between X and Y, without the knowledge of the updated elements in Y, we have an upper bound for sim dice (X, Y ). sim dice (X, Y 2u ) sim dice (X, Y ) + (13) X + Y Proof In our scenario, Y = Y. Thus, X + Y = X + Y. By Property 1,wehave sim dice (X, Y ) 2 i=1 min(x i, y i ) + 2u 2u X + Y = sim dice (X, Y ) + X + Y References 1. Andoni A, Indyk P (2006) Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In: Proceedings of the 47th annual IEEE symposium on foundations of computer science, FOCS 06, Washington, DC, USA. IEEE Computer Society, pp Artin E (2011) Geometric algebra. Wiley, Hoboken 3. Bayardo RJ, Ma Y, Srikant R (2007) Scaling up all pairs similarity search. In: Proceedings of the 16th international conference on World Wide Web, WWW 07, New York, NY, USA. ACM, pp Böhm C, Ooi BC, Plant C, Yan Y (2007) Efficiently processing continuous k-nn queries on data streams. In: Proceedings of the international conference on data engineering, ICDE 07, Washington, DC, USA. IEEE Computer Society, pp Broder A (1997) On the resemblance and containment of documents. In: Proceedings of the compression and complexity of sequences 1997, SEQUENCES 97, Washington, DC, USA. IEEE Computer Society, pp Broder AZ, Charikar M, Frieze AM, Mitzenmacher M (2000) Min-wise independent permutations. J Comput Syst Sci (3): Charikar MS (2002) Similarity estimation techniques from rounding algorithms. In: Proceedings of the thirty-fourth annual ACM symposium on theory of computing, STOC 02, New York, NY, USA. ACM, pp 3 388

28 X. Xu et al. 8. Chaudhuri S, Ganti V, Kaushik R (2006) A primitive operator for similarity joins in data cleaning. In: Proceedings of the 22nd international conference on data engineering, ICDE 06, Washington, DC, USA. IEEE Computer Society, pp Cohen WW (1998) Integration of heterogeneous databases without common domains using queries based on textual similarity. In: Proceedings of the 1998 ACM SIGMOD international conference on management of data, SIGMOD 98, New York, NY, USA. ACM, pp Cost S, Salzberg S (1993) A weighted nearest neighbor algorithm for learning with symbolic features. Mach Learn 10(1): Datar M, Muthukrishnan S (2002) Estimating rarity and similarity over data stream windows. In: Proceedings of the 10th annual European symposium on algorithms, ESA 02. Springer, London, pp Devroye L, Wagner TJ (1982) 8 nearest neighbor methods in discrimination. Handbook of statistics 2: Faloutsos C, Barber R, Flickner M, Hafner J, Niblack W, Petkovic D, Equitz W (1994) Efficient and effective querying by image content. J Intell Inf Syst 3(3 4): Faloutsos C, Oard DW (1995) A survey of information retrieval and filtering methods. University of Maryland at College Park, College Park, MD, USA. Univ. of Maryland Institute for Advanced Computer Studies Report 15. Flickner M, Sawhney H, Niblack W, Ashley J, Huang Q, Dom B, Gorkani M, Hafner J, Lee D, Petkovic D, Steele D, Yanker P (1995) Query by image and video content: the qbic system. Computer 28(9): Gersho A, Gray RM (1991) Vector quantization and signal compression. Kluwer Academic Publishers, Norwell 17. Gionis A, Indyk P, Motwani R (1999) Similarity search in high dimensions via hashing. In: Proceedings of the 25th international conference on very large data bases, VLDB 99, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc., pp Hastie T, Tibshirani R (1995) Discriminant adaptive nearest neighbor classification. In: Proceedings of the first international conference on knowledge discovery and data mining, KDD 95, Palo Alto, CA, USA. AAAI Press, pp Henzinger M (2006) Finding near-duplicate web pages: a large-scale evaluation of algorithms. In: Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR 06, New York, NY, USA. ACM, pp Indyk P (2001) A small approximately min-wise independent family of hash functions. J Algorithms 38(1): Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the thirtieth annual ACM symposium on theory of computing, STOC 98, New York, NY, USA. ACM, pp Koivune V, Kassam S (1995) Nearest neighbor filters for multivariate data. In: IEEE workshop on nonlinear signal and image processing, Washington, DC, USA. IEEE Computer Society, pp Kollios G, Tsotras VJ (2002) Hashing methods for temporal data. IEEE Trans Knowl Data Eng 14(4): Kontaki M, Papadopoulos AN (2004) Efficient similarity search in streaming time sequences. In: Proceedings of the 16th international conference on scientific and statistical database management, SSDBM 04, Washington, DC, USA. IEEE Computer Society, pp Koudas N, Ooi BC, Tan K-L, Zhang R (2004) Approximate nn queries on streams with guaranteed error/performance bounds. In: Proceedings of the thirtieth international conference on very large data bases, VLDB 04. VLDB Endowment, pp Lian X, Chen L, Wang B (2007) Approximate similarity search over multiple stream time series. In: Proceedings of the 12th international conference on database systems for advanced applications, DAS- FAA 07. Springer, Berlin, pp Mouratidis K, Bakiras S, Papadias D (2006) Continuous monitoring of top-k queries over sliding windows. In: Proceedings of the 2006 ACM SIGMOD international conference on management of data, SIGMOD 06, New York, NY, USA. ACM, pp Mouratidis K, Papadias D (2007) Continuous nearest neighbor queries over sliding windows. IEEE Trans Knowl Data Eng 19(6): Mouratidis K, Papadias D, Bakiras S, Tao Y (2005) A threshold-based algorithm for continuous monitoring of k nearest neighbors. IEEE Trans Knowl Data Eng 17(11): Pan S, Zhu X (2012) Continuous top-k query for graph streams. In Proceedings of the 21st ACM international conference on information and knowledge management. CIKM 12, New York, NY, USA. ACM, pp Pentland A, Picard RW, Sclaroff S (1994) Photobook: content-based manipulation of image databases. In: Storage and retrieval for image and video databases, Bellingham, WA, USA. SPIE, pp 34 47

29 Continuous similarity search for evolving queries 32. Rao W, Chen L, Chen S, Tarkoma S (2014) Evaluating continuous top-k queries over document streams. World Wide Web 17(1): Salton G, McGill MJ (1986) Introduction to modern information retrieval. McGraw-Hill Inc., New York 34. Sarawagi S, Kirpal A (2004) Efficient set joins on similarity predicates. In: Proceedings of the 2004 ACM SIGMOD international conference on management of data, SIGMOD 04, New York, NY, USA. ACM, pp Smeulders A, Jain R (eds) (1998) Image databases and multimedia search. World Scientific Publishing Co., Inc., River Edge 36. Sun Y, Han J, Yan X, Yu PS, Wu T (2011) Pathsim: meta path-based top-k similarity search in heterogeneous information networks. Proc VLDB Endow 4(11): Winkler WE (1999) The state of record linkage and current research problems. Statistical Research Division, US Census Bureau, Suitland 38. Xiao C, Wang W, Lin X, Yu JX (2008) Efficient similarity joins for near duplicate detection. In: Proceedings of the 17th international conference on World Wide Web, WWW 08, New York, NY, USA. ACM, pp Xiaoning Xu received her Master s degree in computing science from Simon Fraser University, Canada in 2014, supervised by Prof. Jian Pei. Previously, she was in the SFU-ZJU computing science dual degree program. She received a B.Sc. degree from Simon Fraser University, Canada, and a B.Eng. degree from Zhejiang University, China, both in Her research interests focus on similarity search and data stream processing. She is currently a software engineer at Fortinet Inc., working on data analytics and threat detection in network log data. Chuancong Gao is a Ph.D. student in computing science in Simon Fraser University, Canada, since 2013, supervised by Prof. Jian Pei. Previously he received a master degree from Tsinghua University, China, in 2010 and a bachelor degree from Shandong University, China, in 2007, both in computer science and in engineering. His current research interests are data cleaning and data integration.

30 X. Xu et al. Jian Pei is the Canada Research Chair in Big Data Science and a professor at the School of Computing Science and the Department of Statistics and Actuarial Science at Simon Fraser University, Canada. He received a Ph.D. degree at the same school in 2002 under Dr. Jiawei Han s supervision. His research interests are to develop effective and efficient data analysis techniques for novel data intensive applications. He has published prolifically and is one of the top cited authors in data mining. He received a series of prestigious awards. He is also active in providing consulting service to industry and transferring the research outcome in his group to industry and applications. He is an editor of several esteemed journals in his areas and a passionate organizer of the premier academic conferences and initiatives defining the frontiers of the areas. He is an IEEE fellow. Ke Wang received Ph.D. from Georgia Institute of Technology. He is currently a professor at the School of Computing Science, Simon Fraser University. His research interests include database technology, data mining, and knowledge discovery, with emphasis on massive data sets, graph and network data, and data privacy. Ke Wang has published more than research papers in database, information retrieval, and data mining conferences. He is currently an associate editor of the ACM TKDD journal. Abdullah Al-Barakati was the head of Information Systems Department at King Abdulaziz University (KAU), Jeddah, Saudi Arabia, since in the period He is also a faculty member and a lecturer in the Faculty of Computing and Information Technology since the year Dr. Al-Barakati s academic interests range from teaching and academic research to modern theories in applied informatics. His research interests revolve on the area of Big Data ranging from modern practices in Big Data to advanced concepts and special issues in Big Data mining (pattern extraction, visualization, and analysis). Language Processing (LP) and Machine Learning (ML) also fall within his research interests as he has developed a passion for text analysis and manipulation. Furthermore, web applications especially the dynamics of Social Media and web 2.0 are important elements of Dr. Al-Barakati s research interests as he believes that they form the cornerstones of many astonishing innovations now and in the future.

Continuous Similarity Search for Evolving Queries

Continuous Similarity Search for Evolving Queries Continuous Similarity Search for Evolving Queries by Xiaoning Xu B.Sc., Simon Fraser University, 2012 B.Eng., Zhejiang University, 2012 a Thesis submitted in partial fulfillment of the requirements for

More information

Leveraging Set Relations in Exact Set Similarity Join

Leveraging Set Relations in Exact Set Similarity Join Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,

More information

Predictive Indexing for Fast Search

Predictive Indexing for Fast Search Predictive Indexing for Fast Search Sharad Goel, John Langford and Alex Strehl Yahoo! Research, New York Modern Massive Data Sets (MMDS) June 25, 2008 Goel, Langford & Strehl (Yahoo! Research) Predictive

More information

This paper proposes: Mining Frequent Patterns without Candidate Generation

This paper proposes: Mining Frequent Patterns without Candidate Generation Mining Frequent Patterns without Candidate Generation a paper by Jiawei Han, Jian Pei and Yiwen Yin School of Computing Science Simon Fraser University Presented by Maria Cutumisu Department of Computing

More information

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery : Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Hong Cheng Philip S. Yu Jiawei Han University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center {hcheng3, hanj}@cs.uiuc.edu,

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS46: Mining Massive Datasets Jure Leskovec, Stanford University http://cs46.stanford.edu /7/ Jure Leskovec, Stanford C46: Mining Massive Datasets Many real-world problems Web Search and Text Mining Billions

More information

Association Pattern Mining. Lijun Zhang

Association Pattern Mining. Lijun Zhang Association Pattern Mining Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction The Frequent Pattern Mining Model Association Rule Generation Framework Frequent Itemset Mining Algorithms

More information

Parallelizing String Similarity Join Algorithms

Parallelizing String Similarity Join Algorithms Parallelizing String Similarity Join Algorithms Ling-Chih Yao and Lipyeow Lim University of Hawai i at Mānoa, Honolulu, HI 96822, USA {lingchih,lipyeow}@hawaii.edu Abstract. A key operation in data cleaning

More information

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing Finding Similar Sets Applications Shingling Minhashing Locality-Sensitive Hashing Goals Many Web-mining problems can be expressed as finding similar sets:. Pages with similar words, e.g., for classification

More information

2. Discovery of Association Rules

2. Discovery of Association Rules 2. Discovery of Association Rules Part I Motivation: market basket data Basic notions: association rule, frequency and confidence Problem of association rule mining (Sub)problem of frequent set mining

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Week 9: Data Mining (3/4) March 8, 2016 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides

More information

Query Answering Using Inverted Indexes

Query Answering Using Inverted Indexes Query Answering Using Inverted Indexes Inverted Indexes Query Brutus AND Calpurnia J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes 2 Document-at-a-time Evaluation

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining Lecture #6: Mining Data Streams Seoul National University 1 Outline Overview Sampling From Data Stream Queries Over Sliding Window 2 Data Streams In many data mining situations,

More information

HOT asax: A Novel Adaptive Symbolic Representation for Time Series Discords Discovery

HOT asax: A Novel Adaptive Symbolic Representation for Time Series Discords Discovery HOT asax: A Novel Adaptive Symbolic Representation for Time Series Discords Discovery Ninh D. Pham, Quang Loc Le, Tran Khanh Dang Faculty of Computer Science and Engineering, HCM University of Technology,

More information

SA-IFIM: Incrementally Mining Frequent Itemsets in Update Distorted Databases

SA-IFIM: Incrementally Mining Frequent Itemsets in Update Distorted Databases SA-IFIM: Incrementally Mining Frequent Itemsets in Update Distorted Databases Jinlong Wang, Congfu Xu, Hongwei Dan, and Yunhe Pan Institute of Artificial Intelligence, Zhejiang University Hangzhou, 310027,

More information

Mining N-most Interesting Itemsets. Ada Wai-chee Fu Renfrew Wang-wai Kwong Jian Tang. fadafu,

Mining N-most Interesting Itemsets. Ada Wai-chee Fu Renfrew Wang-wai Kwong Jian Tang. fadafu, Mining N-most Interesting Itemsets Ada Wai-chee Fu Renfrew Wang-wai Kwong Jian Tang Department of Computer Science and Engineering The Chinese University of Hong Kong, Hong Kong fadafu, wwkwongg@cse.cuhk.edu.hk

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 9: Data Mining (3/4) March 7, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides

More information

Predictive Indexing for Fast Search

Predictive Indexing for Fast Search Predictive Indexing for Fast Search Sharad Goel Yahoo! Research New York, NY 10018 goel@yahoo-inc.com John Langford Yahoo! Research New York, NY 10018 jl@yahoo-inc.com Alex Strehl Yahoo! Research New York,

More information

Localized and Incremental Monitoring of Reverse Nearest Neighbor Queries in Wireless Sensor Networks 1

Localized and Incremental Monitoring of Reverse Nearest Neighbor Queries in Wireless Sensor Networks 1 Localized and Incremental Monitoring of Reverse Nearest Neighbor Queries in Wireless Sensor Networks 1 HAI THANH MAI AND MYOUNG HO KIM Department of Computer Science Korea Advanced Institute of Science

More information

Topic: Duplicate Detection and Similarity Computing

Topic: Duplicate Detection and Similarity Computing Table of Content Topic: Duplicate Detection and Similarity Computing Motivation Shingling for duplicate comparison Minhashing LSH UCSB 290N, 2013 Tao Yang Some of slides are from text book [CMS] and Rajaraman/Ullman

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

Some Interesting Applications of Theory. PageRank Minhashing Locality-Sensitive Hashing

Some Interesting Applications of Theory. PageRank Minhashing Locality-Sensitive Hashing Some Interesting Applications of Theory PageRank Minhashing Locality-Sensitive Hashing 1 PageRank The thing that makes Google work. Intuition: solve the recursive equation: a page is important if important

More information

Finding Similar Items:Nearest Neighbor Search

Finding Similar Items:Nearest Neighbor Search Finding Similar Items:Nearest Neighbor Search Barna Saha February 23, 2016 Finding Similar Items A fundamental data mining task Finding Similar Items A fundamental data mining task May want to find whether

More information

MINING FREQUENT MAX AND CLOSED SEQUENTIAL PATTERNS

MINING FREQUENT MAX AND CLOSED SEQUENTIAL PATTERNS MINING FREQUENT MAX AND CLOSED SEQUENTIAL PATTERNS by Ramin Afshar B.Sc., University of Alberta, Alberta, 2000 THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE

More information

Worst-case running time for RANDOMIZED-SELECT

Worst-case running time for RANDOMIZED-SELECT Worst-case running time for RANDOMIZED-SELECT is ), even to nd the minimum The algorithm has a linear expected running time, though, and because it is randomized, no particular input elicits the worst-case

More information

Association Rules. Berlin Chen References:

Association Rules. Berlin Chen References: Association Rules Berlin Chen 2005 References: 1. Data Mining: Concepts, Models, Methods and Algorithms, Chapter 8 2. Data Mining: Concepts and Techniques, Chapter 6 Association Rules: Basic Concepts A

More information

Mining Frequent Itemsets for data streams over Weighted Sliding Windows

Mining Frequent Itemsets for data streams over Weighted Sliding Windows Mining Frequent Itemsets for data streams over Weighted Sliding Windows Pauray S.M. Tsai Yao-Ming Chen Department of Computer Science and Information Engineering Minghsin University of Science and Technology

More information

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 3/6/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 In many data mining

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 53 INTRO. TO DATA MINING Locality Sensitive Hashing (LSH) Huan Sun, CSE@The Ohio State University Slides adapted from Prof. Jiawei Han @UIUC, Prof. Srinivasan Parthasarathy @OSU MMDS Secs. 3.-3.. Slides

More information

Appropriate Item Partition for Improving the Mining Performance

Appropriate Item Partition for Improving the Mining Performance Appropriate Item Partition for Improving the Mining Performance Tzung-Pei Hong 1,2, Jheng-Nan Huang 1, Kawuu W. Lin 3 and Wen-Yang Lin 1 1 Department of Computer Science and Information Engineering National

More information

Pincer-Search: An Efficient Algorithm. for Discovering the Maximum Frequent Set

Pincer-Search: An Efficient Algorithm. for Discovering the Maximum Frequent Set Pincer-Search: An Efficient Algorithm for Discovering the Maximum Frequent Set Dao-I Lin Telcordia Technologies, Inc. Zvi M. Kedem New York University July 15, 1999 Abstract Discovering frequent itemsets

More information

Reduction of Periodic Broadcast Resource Requirements with Proxy Caching

Reduction of Periodic Broadcast Resource Requirements with Proxy Caching Reduction of Periodic Broadcast Resource Requirements with Proxy Caching Ewa Kusmierek and David H.C. Du Digital Technology Center and Department of Computer Science and Engineering University of Minnesota

More information

Query Evaluation Strategies

Query Evaluation Strategies Introduction to Search Engine Technology Term-at-a-Time and Document-at-a-Time Evaluation Ronny Lempel Yahoo! Labs (Many of the following slides are courtesy of Aya Soffer and David Carmel, IBM Haifa Research

More information

Market baskets Frequent itemsets FP growth. Data mining. Frequent itemset Association&decision rule mining. University of Szeged.

Market baskets Frequent itemsets FP growth. Data mining. Frequent itemset Association&decision rule mining. University of Szeged. Frequent itemset Association&decision rule mining University of Szeged What frequent itemsets could be used for? Features/observations frequently co-occurring in some database can gain us useful insights

More information

Efficient GSP Implementation based on XML Databases

Efficient GSP Implementation based on XML Databases 212 International Conference on Information and Knowledge Management (ICIKM 212) IPCSIT vol.45 (212) (212) IACSIT Press, Singapore Efficient GSP Implementation based on Databases Porjet Sansai and Juggapong

More information

AN EFFICIENT GRADUAL PRUNING TECHNIQUE FOR UTILITY MINING. Received April 2011; revised October 2011

AN EFFICIENT GRADUAL PRUNING TECHNIQUE FOR UTILITY MINING. Received April 2011; revised October 2011 International Journal of Innovative Computing, Information and Control ICIC International c 2012 ISSN 1349-4198 Volume 8, Number 7(B), July 2012 pp. 5165 5178 AN EFFICIENT GRADUAL PRUNING TECHNIQUE FOR

More information

Online Pattern Recognition in Multivariate Data Streams using Unsupervised Learning

Online Pattern Recognition in Multivariate Data Streams using Unsupervised Learning Online Pattern Recognition in Multivariate Data Streams using Unsupervised Learning Devina Desai ddevina1@csee.umbc.edu Tim Oates oates@csee.umbc.edu Vishal Shanbhag vshan1@csee.umbc.edu Machine Learning

More information

Maintaining Frequent Itemsets over High-Speed Data Streams

Maintaining Frequent Itemsets over High-Speed Data Streams Maintaining Frequent Itemsets over High-Speed Data Streams James Cheng, Yiping Ke, and Wilfred Ng Department of Computer Science Hong Kong University of Science and Technology Clear Water Bay, Kowloon,

More information

Parallel Similarity Join with Data Partitioning for Prefix Filtering

Parallel Similarity Join with Data Partitioning for Prefix Filtering 22 ECTI TRANSACTIONS ON COMPUTER AND INFORMATION TECHNOLOGY VOL.9, NO.1 May 2015 Parallel Similarity Join with Data Partitioning for Prefix Filtering Jaruloj Chongstitvatana 1 and Methus Bhirakit 2, Non-members

More information

Ranking Clustered Data with Pairwise Comparisons

Ranking Clustered Data with Pairwise Comparisons Ranking Clustered Data with Pairwise Comparisons Alisa Maas ajmaas@cs.wisc.edu 1. INTRODUCTION 1.1 Background Machine learning often relies heavily on being able to rank the relative fitness of instances

More information

Scheduling Strategies for Processing Continuous Queries Over Streams

Scheduling Strategies for Processing Continuous Queries Over Streams Department of Computer Science and Engineering University of Texas at Arlington Arlington, TX 76019 Scheduling Strategies for Processing Continuous Queries Over Streams Qingchun Jiang, Sharma Chakravarthy

More information

Mining Frequent Patterns without Candidate Generation

Mining Frequent Patterns without Candidate Generation Mining Frequent Patterns without Candidate Generation Outline of the Presentation Outline Frequent Pattern Mining: Problem statement and an example Review of Apriori like Approaches FP Growth: Overview

More information

Parallel Mining of Maximal Frequent Itemsets in PC Clusters

Parallel Mining of Maximal Frequent Itemsets in PC Clusters Proceedings of the International MultiConference of Engineers and Computer Scientists 28 Vol I IMECS 28, 19-21 March, 28, Hong Kong Parallel Mining of Maximal Frequent Itemsets in PC Clusters Vong Chan

More information

Midterm Examination CS540-2: Introduction to Artificial Intelligence

Midterm Examination CS540-2: Introduction to Artificial Intelligence Midterm Examination CS540-2: Introduction to Artificial Intelligence March 15, 2018 LAST NAME: FIRST NAME: Problem Score Max Score 1 12 2 13 3 9 4 11 5 8 6 13 7 9 8 16 9 9 Total 100 Question 1. [12] Search

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/24/2014 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 High dim. data

More information

Chapter 4: Association analysis:

Chapter 4: Association analysis: Chapter 4: Association analysis: 4.1 Introduction: Many business enterprises accumulate large quantities of data from their day-to-day operations, huge amounts of customer purchase data are collected daily

More information

CS 4604: Introduction to Database Management Systems. B. Aditya Prakash Lecture #10: Query Processing

CS 4604: Introduction to Database Management Systems. B. Aditya Prakash Lecture #10: Query Processing CS 4604: Introduction to Database Management Systems B. Aditya Prakash Lecture #10: Query Processing Outline introduction selection projection join set & aggregate operations Prakash 2018 VT CS 4604 2

More information

Parallel Approach for Implementing Data Mining Algorithms

Parallel Approach for Implementing Data Mining Algorithms TITLE OF THE THESIS Parallel Approach for Implementing Data Mining Algorithms A RESEARCH PROPOSAL SUBMITTED TO THE SHRI RAMDEOBABA COLLEGE OF ENGINEERING AND MANAGEMENT, FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

More information

Shingling Minhashing Locality-Sensitive Hashing. Jeffrey D. Ullman Stanford University

Shingling Minhashing Locality-Sensitive Hashing. Jeffrey D. Ullman Stanford University Shingling Minhashing Locality-Sensitive Hashing Jeffrey D. Ullman Stanford University 2 Wednesday, January 13 Computer Forum Career Fair 11am - 4pm Lawn between the Gates and Packard Buildings Policy for

More information

An Optimal and Progressive Approach to Online Search of Top-K Influential Communities

An Optimal and Progressive Approach to Online Search of Top-K Influential Communities An Optimal and Progressive Approach to Online Search of Top-K Influential Communities Fei Bi, Lijun Chang, Xuemin Lin, Wenjie Zhang University of New South Wales, Australia The University of Sydney, Australia

More information

Tracking Frequent Items Dynamically: What s Hot and What s Not To appear in PODS 2003

Tracking Frequent Items Dynamically: What s Hot and What s Not To appear in PODS 2003 Tracking Frequent Items Dynamically: What s Hot and What s Not To appear in PODS 2003 Graham Cormode graham@dimacs.rutgers.edu dimacs.rutgers.edu/~graham S. Muthukrishnan muthu@cs.rutgers.edu Everyday

More information

CLOSET+:Searching for the Best Strategies for Mining Frequent Closed Itemsets

CLOSET+:Searching for the Best Strategies for Mining Frequent Closed Itemsets CLOSET+:Searching for the Best Strategies for Mining Frequent Closed Itemsets Jianyong Wang, Jiawei Han, Jian Pei Presentation by: Nasimeh Asgarian Department of Computing Science University of Alberta

More information

Faloutsos 1. Carnegie Mellon Univ. Dept. of Computer Science Database Applications. Outline

Faloutsos 1. Carnegie Mellon Univ. Dept. of Computer Science Database Applications. Outline Carnegie Mellon Univ. Dept. of Computer Science 15-415 - Database Applications Lecture #14: Implementation of Relational Operations (R&G ch. 12 and 14) 15-415 Faloutsos 1 introduction selection projection

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu /2/8 Jure Leskovec, Stanford CS246: Mining Massive Datasets 2 Task: Given a large number (N in the millions or

More information

Advanced Database Systems

Advanced Database Systems Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed

More information

FP-Growth algorithm in Data Compression frequent patterns

FP-Growth algorithm in Data Compression frequent patterns FP-Growth algorithm in Data Compression frequent patterns Mr. Nagesh V Lecturer, Dept. of CSE Atria Institute of Technology,AIKBS Hebbal, Bangalore,Karnataka Email : nagesh.v@gmail.com Abstract-The transmission

More information

Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri

Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri Scene Completion Problem The Bare Data Approach High Dimensional Data Many real-world problems Web Search and Text Mining Billions

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu [Kumar et al. 99] 2/13/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

More information

Mining Data Streams. Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction. Summarization Methods. Clustering Data Streams

Mining Data Streams. Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction. Summarization Methods. Clustering Data Streams Mining Data Streams Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction Summarization Methods Clustering Data Streams Data Stream Classification Temporal Models CMPT 843, SFU, Martin Ester, 1-06

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/25/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 3 In many data mining

More information

Pay-As-You-Go Entity Resolution

Pay-As-You-Go Entity Resolution Pay-As-You-Go Entity Resolution Steven Euijong Whang, David Marmaros, and Hector Garcia-Molina, Member, IEEE Abstract Entity resolution (ER) is the problem of identifying which records in a database refer

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Queries on streams

More information

Distance-based Outlier Detection: Consolidation and Renewed Bearing

Distance-based Outlier Detection: Consolidation and Renewed Bearing Distance-based Outlier Detection: Consolidation and Renewed Bearing Gustavo. H. Orair, Carlos H. C. Teixeira, Wagner Meira Jr., Ye Wang, Srinivasan Parthasarathy September 15, 2010 Table of contents Introduction

More information

High Dimensional Indexing by Clustering

High Dimensional Indexing by Clustering Yufei Tao ITEE University of Queensland Recall that, our discussion so far has assumed that the dimensionality d is moderately high, such that it can be regarded as a constant. This means that d should

More information

KEYWORD search is a well known method for extracting

KEYWORD search is a well known method for extracting IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 7, JULY 2014 1657 Efficient Duplication Free and Minimal Keyword Search in Graphs Mehdi Kargar, Student Member, IEEE, Aijun An, Member,

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)

More information

Link Prediction for Social Network

Link Prediction for Social Network Link Prediction for Social Network Ning Lin Computer Science and Engineering University of California, San Diego Email: nil016@eng.ucsd.edu Abstract Friendship recommendation has become an important issue

More information

Information Integration of Partially Labeled Data

Information Integration of Partially Labeled Data Information Integration of Partially Labeled Data Steffen Rendle and Lars Schmidt-Thieme Information Systems and Machine Learning Lab, University of Hildesheim srendle@ismll.uni-hildesheim.de, schmidt-thieme@ismll.uni-hildesheim.de

More information

Chapter 12: Query Processing. Chapter 12: Query Processing

Chapter 12: Query Processing. Chapter 12: Query Processing Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 12: Query Processing Overview Measures of Query Cost Selection Operation Sorting Join

More information

09/28/2015. Problem Rearrange the elements in an array so that they appear in reverse order.

09/28/2015. Problem Rearrange the elements in an array so that they appear in reverse order. Unit 4 The array is a powerful that is widely used in computing. Arrays provide a special way of sorting or organizing data in a computer s memory. The power of the array is largely derived from the fact

More information

CHAPTER 3 ASSOCIATION RULE MINING WITH LEVELWISE AUTOMATIC SUPPORT THRESHOLDS

CHAPTER 3 ASSOCIATION RULE MINING WITH LEVELWISE AUTOMATIC SUPPORT THRESHOLDS 23 CHAPTER 3 ASSOCIATION RULE MINING WITH LEVELWISE AUTOMATIC SUPPORT THRESHOLDS This chapter introduces the concepts of association rule mining. It also proposes two algorithms based on, to calculate

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

Chapter 13: Query Processing

Chapter 13: Query Processing Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing

More information

COMP3121/3821/9101/ s1 Assignment 1

COMP3121/3821/9101/ s1 Assignment 1 Sample solutions to assignment 1 1. (a) Describe an O(n log n) algorithm (in the sense of the worst case performance) that, given an array S of n integers and another integer x, determines whether or not

More information

Skyline Diagram: Finding the Voronoi Counterpart for Skyline Queries

Skyline Diagram: Finding the Voronoi Counterpart for Skyline Queries Skyline Diagram: Finding the Voronoi Counterpart for Skyline Queries Jinfei Liu, Juncheng Yang, Li Xiong, Jian Pei, Jun Luo Department of Mathematics & Computer Science, Emory University {jinfei.liu, juncheng.yang,

More information

CSE547: Machine Learning for Big Data Spring Problem Set 1. Please read the homework submission policies.

CSE547: Machine Learning for Big Data Spring Problem Set 1. Please read the homework submission policies. CSE547: Machine Learning for Big Data Spring 2019 Problem Set 1 Please read the homework submission policies. 1 Spark (25 pts) Write a Spark program that implements a simple People You Might Know social

More information

High-dimensional knn Joins with Incremental Updates

High-dimensional knn Joins with Incremental Updates GeoInformatica manuscript No. (will be inserted by the editor) High-dimensional knn Joins with Incremental Updates Cui Yu Rui Zhang Yaochun Huang Hui Xiong Received: date / Accepted: date Abstract The

More information

2 CONTENTS

2 CONTENTS Contents 5 Mining Frequent Patterns, Associations, and Correlations 3 5.1 Basic Concepts and a Road Map..................................... 3 5.1.1 Market Basket Analysis: A Motivating Example........................

More information

Efficient Non-domination Level Update Approach for Steady-State Evolutionary Multiobjective Optimization

Efficient Non-domination Level Update Approach for Steady-State Evolutionary Multiobjective Optimization Efficient Non-domination Level Update Approach for Steady-State Evolutionary Multiobjective Optimization Ke Li 1, Kalyanmoy Deb 1, Qingfu Zhang 2, and Sam Kwong 2 1 Department of Electrical and Computer

More information

2.3 Algorithms Using Map-Reduce

2.3 Algorithms Using Map-Reduce 28 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK one becomes available. The Master must also inform each Reduce task that the location of its input from that Map task has changed. Dealing with a failure

More information

The Encoding Complexity of Network Coding

The Encoding Complexity of Network Coding The Encoding Complexity of Network Coding Michael Langberg Alexander Sprintson Jehoshua Bruck California Institute of Technology Email: mikel,spalex,bruck @caltech.edu Abstract In the multicast network

More information

Chapter 12: Query Processing

Chapter 12: Query Processing Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Overview Chapter 12: Query Processing Measures of Query Cost Selection Operation Sorting Join

More information

Probabilistic Ranked Queries in Uncertain Databases

Probabilistic Ranked Queries in Uncertain Databases Probabilistic Ranked Queries in Uncertain Databases Xiang Lian and Lei Chen Department of Computer Science and Engineering Hong Kong University of Science and Technology Clear Water Bay, Kowloon Hong Kong,

More information

Towards a hybrid approach to Netflix Challenge

Towards a hybrid approach to Netflix Challenge Towards a hybrid approach to Netflix Challenge Abhishek Gupta, Abhijeet Mohapatra, Tejaswi Tenneti March 12, 2009 1 Introduction Today Recommendation systems [3] have become indispensible because of the

More information

Spatial Index Keyword Search in Multi- Dimensional Database

Spatial Index Keyword Search in Multi- Dimensional Database Spatial Index Keyword Search in Multi- Dimensional Database Sushma Ahirrao M. E Student, Department of Computer Engineering, GHRIEM, Jalgaon, India ABSTRACT: Nearest neighbor search in multimedia databases

More information

CPSC 340: Machine Learning and Data Mining. Finding Similar Items Fall 2017

CPSC 340: Machine Learning and Data Mining. Finding Similar Items Fall 2017 CPSC 340: Machine Learning and Data Mining Finding Similar Items Fall 2017 Assignment 1 is due tonight. Admin 1 late day to hand in Monday, 2 late days for Wednesday. Assignment 2 will be up soon. Start

More information

Scalable Techniques for Similarity Search

Scalable Techniques for Similarity Search San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Fall 2015 Scalable Techniques for Similarity Search Siddartha Reddy Nagireddy San Jose State University

More information

Speeding up Queries in a Leaf Image Database

Speeding up Queries in a Leaf Image Database 1 Speeding up Queries in a Leaf Image Database Daozheng Chen May 10, 2007 Abstract We have an Electronic Field Guide which contains an image database with thousands of leaf images. We have a system which

More information

Solutions to Problem Set 1

Solutions to Problem Set 1 CSCI-GA.3520-001 Honors Analysis of Algorithms Solutions to Problem Set 1 Problem 1 An O(n) algorithm that finds the kth integer in an array a = (a 1,..., a n ) of n distinct integers. Basic Idea Using

More information

On the Max Coloring Problem

On the Max Coloring Problem On the Max Coloring Problem Leah Epstein Asaf Levin May 22, 2010 Abstract We consider max coloring on hereditary graph classes. The problem is defined as follows. Given a graph G = (V, E) and positive

More information

Database System Concepts

Database System Concepts Chapter 13: Query Processing s Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2008/2009 Slides (fortemente) baseados nos slides oficiais do livro c Silberschatz, Korth

More information

Cost Models for Query Processing Strategies in the Active Data Repository

Cost Models for Query Processing Strategies in the Active Data Repository Cost Models for Query rocessing Strategies in the Active Data Repository Chialin Chang Institute for Advanced Computer Studies and Department of Computer Science University of Maryland, College ark 272

More information

9/24/ Hash functions

9/24/ Hash functions 11.3 Hash functions A good hash function satis es (approximately) the assumption of SUH: each key is equally likely to hash to any of the slots, independently of the other keys We typically have no way

More information

Query Evaluation Strategies

Query Evaluation Strategies Introduction to Search Engine Technology Term-at-a-Time and Document-at-a-Time Evaluation Ronny Lempel Yahoo! Research (Many of the following slides are courtesy of Aya Soffer and David Carmel, IBM Haifa

More information

Incremental Evaluation of Sliding-Window Queries over Data Streams

Incremental Evaluation of Sliding-Window Queries over Data Streams Incremental Evaluation of Sliding-Window Queries over Data Streams Thanaa M. Ghanem 1 Moustafa A. Hammad 2 Mohamed F. Mokbel 3 Walid G. Aref 1 Ahmed K. Elmagarmid 1 1 Department of Computer Science, Purdue

More information

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and

More information

Chapter 13: Query Processing Basic Steps in Query Processing

Chapter 13: Query Processing Basic Steps in Query Processing Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and

More information

CS570 Introduction to Data Mining

CS570 Introduction to Data Mining CS570 Introduction to Data Mining Frequent Pattern Mining and Association Analysis Cengiz Gunay Partial slide credits: Li Xiong, Jiawei Han and Micheline Kamber George Kollios 1 Mining Frequent Patterns,

More information

Discovery of Frequent Itemset and Promising Frequent Itemset Using Incremental Association Rule Mining Over Stream Data Mining

Discovery of Frequent Itemset and Promising Frequent Itemset Using Incremental Association Rule Mining Over Stream Data Mining Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 5, May 2014, pg.923

More information