Fast Parallel Set Similarity Joins on Many-core Architectures
|
|
- Edwina Holmes
- 6 years ago
- Views:
Transcription
1 Fast Parallel Set Similarity Joins on Many-core Architectures Sidney Ribeiro-Junior, Rafael David Quirino, Leonardo Andrade Ribeiro, Wellington Santos Martins Universidade Federal de Goiás, Brazil {sydneyribeirojunior, rafaelquirino, laribeiro, Abstract. Set similarity join is a core operation for text data integration, cleaning, and mining. Previous research work on improving the performance of set similarity joins mostly focused on sequential, CPU-based algorithms. Main optimizations of such algorithms exploit high threshold values and the underlying data characteristics to derive efficient filters. In this article, we investigate strategies to accelerate set similarity join by exploiting massive parallelism available in modern Graphics Processing Units (GPUs). We develop two new parallel set similarity join algorithms, which implement an inverted index on the GPU to quickly identify similar sets. The first algorithm, called gssjoin, does not rely on any filtering scheme and, thus, exhibits much better robustness to variations of threshold values and data distributions. Moreover, gssjoin adopts a load balancing strategy to evenly distribute the similarity calculations among the GPU s threads. The second algorithm, called sf-gssjoin, applies a block division scheme for dealing with large datasets that do not fit in GPU memory. This scheme also enables a substantial reduction of the comparison space by discarding entire blocks based on size limits. We present variants of both algorithms for a multi-gpu platform to further exploit task parallelism in addition to data parallelism. Experimental evaluation on real-world datasets shows that we obtain up to 109x and 19.5x speedups over state-of-the-art algorithms for CPU and GPU, respectively. Categories and Subject Descriptors: H.2 [Database Management]: Systems; C.1 [Processor Architectures]: Multiple Data Stream Architectures (Multiprocessors) Keywords: Advanced Query Processing, GPU, High Performance Computing, Parallel Set Similarity Join 1. INTRODUCTION The problem of efficiently answering set similarity join queries has attracted growing attention over the years [Sarawagi and Kirpal 2004; Chaudhuri et al. 2006; Arasu et al. 2006; Bayardo et al. 2007; Vernica et al. 2010; Xiao et al. 2011; Ribeiro and Härder 2011; Wang et al. 2012; Deng et al. 2015; Cruz et al. 2016; Mann et al. 2016; Rong et al. 2017]. Set similarity join returns all pairs of similar sets from a dataset two sets are considered similar if the value returned by a set similarity function for them is not less than a given threshold. This operation is of great interest and practical importance both in itself and as a basic operator for more advanced data processing tasks. Indeed, set similarity join has been used in a wide range of application domains including data integration [Doan et al. 2012], entity extraction [Agrawal et al. 2008], data cleaning [Chaudhuri et al. 2006], collaborative filtering [Spertus et al. 2005], and data mining [Leskovec et al. 2014]. Set similarity join is a popular approach to dealing with text data, whose representation is typically sparse and high-dimensional. Text data can be mapped to sets and there is a rich variety of set similarity functions to capture various notions of similarity the well-known Jaccard similarity is a prime example of such functions. Furthermore, predicates involving set similarity functions can be equivalently expressed as an overlap constraint [Chaudhuri et al. 2006]; intuitively, the overlap of similar sets should be high. As a result, set similarity join is reduced to the problem of identifying set pairs with enough overlap. This research was partially supported by CAPES, CNPq, FAPEG, and FINEP. We thank NVIDIA for equipment donations. Copyright c 2017 Permission to copy without fee all or part of the material printed in JIDM is granted provided that the copies are not made or distributed for commercial advantage, and that notice is given that copying is by permission of the Sociedade Brasileira de Computação. Journal of Information and Data Management, Vol. 8, No. 3, December 2017, Pages
2 256 Sidney Ribeiro-Junior et. al. The set overlap abstraction provides the basis for several optimizations. Prefix filtering [Sarawagi and Kirpal 2004; Chaudhuri et al. 2006] is arguably the most important of such optimizations. In fact, this technique is employed by all state-of-the-art algorithms considered in a recent experimental evaluation [Mann et al. 2016]. Prefix filtering exploits the threshold value to prune dissimilar set pairs by inspecting only a fraction of them. A filtering-verification framework is typically used in this context, where input sets are sequentially processed: in the filtering step, prefix filtering is employed with the support of an inverted index to discard set pairs that cannot meet the overlap constraint; then, in the verification step, the overlap of the surviving pairs is measured and those deemed similar are sent to the output. Unfortunately, prefix filtering is only effective at high threshold values. Its pruning power drops drastically at lower thresholds, leading to an explosion of the number of set pairs that need to be compared in the verification step. This serious drawback may render set similarity join algorithms based on prefix filtering unsuitable for applications which often require lower thresholds to produce accurate results important examples are duplicate detection and clustering [Hassanzadeh et al. 2009]. Moreover, filtering effectiveness and, in turn, algorithm performance heavily relies on characteristics of the underlying data distribution [Sidney et al. 2015]. In particular, the filtering effect is severely reduced (or even eliminated) on more uniform data distributions. To address the challenge of efficiently processing set similarity joins, we can exploit parallelism and take advantage of modern architectures with multiple cores. The multi-core trend affected not only CPUs but also GPUs (graphics processing units), which are known as many-core processors or accelerators. GPUs offer attractive performance to energy consumption and are becoming increasingly more affordable due to mass marketing for gaming. They have a large number of relatively simple, and slower, processing cores. The possibility of using GPUs to perform general purpose computations appeared a few years ago and quickly caught the attention of researchers. However, due to their significant differences when compared to CPUs, novel algorithms and implementation approaches need to be designed. In order to take full advantage of their performance, it is necessary considerable parallelism (tens of thousands of threads) and an adequate control of high latency operations, particularly involving global memory accesses which should be reduced to a minimum. In this article, we present an exact parallel algorithm and a GPU-based implementation for the set similarity join problem. Our solution takes advantage of data parallelism by processing individual sets in parallel. Each individual operation can be seen as a set similarity search since it finds all sets in a set collection that are similar to a given input set. Furthermore, elements in a set are, in turn, processed in parallel as well, thereby exploiting both inter- and intra-set parallelism. This greatly improves the similarity join processing and can be straightforwardly mapped to modern highlythreaded accelerators like many-core GPUs. The proposed solution, called gssjoin (GPU-based Set Similarity Join), efficiently implements an inverted index, by using a parallel counting operation followed by a parallel prefix-sum calculation. At search time, this inverted index is used to quickly find sets sharing tokens with the input set. We construct an index for the input set which is used for a load balancing strategy to evenly distribute the similarity calculations among the GPU s threads. Finally, set pairs whose similarity is not less than the given threshold are returned to the CPU. This article is an extended and revised version of a previous conference paper [Ribeiro-Júnior et al. 2016]. As part of the new material, we include a new, improved version of the gssjoin algorithm, called sf-gssjoin (Size-filtered gssjoin), which uses size filtering to discard dissimilar pairs before comparing the sets, block division to deal with larger datasets, and a simpler compaction technique to improve performance at high threshold values. Furthermore, we provide an extensive set of experiments in which we compare our algorithm with other parallel solutions on several datasets. Our contributions are summarized as follows. A fine-grained parallel algorithm for both data indexing and set similarity join processing.
3 Fast Parallel Set Similarity Joins on Many-core Architectures 257 A highly threaded GPU implementation that takes advantage of intensive occupation, hierarchical memory, load balancing, and coalesced memory access. A scalable multi-gpu implementation that exploits both data parallelism and task parallelism. A block division scheme for dealing with larger datasets that do not fit in the GPU memory, which also enables an aggressive reduction of the comparison space by pruning entire blocks of sets based on size limits. An extensive experimental evaluation comparing our solutions with state-of-the-art sequential as well as parallel algorithms on standard real-world textual datasets. This article is organized as follows. Section 2 covers related work. Section 3 provides background material and Section 4 presents an overview of the architecture and programming model of a GPU. The gssjoin algorithm and its improved version, sf-gssjoin, are presented in Section 5 and Section 6, respectively. Experimental results are described in Section 7 and Section 8 concludes the article. 2. RELATED WORK There is a substantial body of literature on the efficient computation of set similarity joins. Most proposals focus on the design of CPU-based, intrinsically sequential algorithms [Sarawagi and Kirpal 2004; Chaudhuri et al. 2006; Arasu et al. 2006; Bayardo et al. 2007; Xiao et al. 2011; Ribeiro and Härder 2011; Wang et al. 2012] and, therefore, cannot fully exploit modern many-core architectures. Main optimizations use the threshold and the underlying data characteristics to derive effective filters [Sidney et al. 2015]. Popular filtering schemes are derived from the set overlap constraint, such as size filtering [Sarawagi and Kirpal 2004], prefix filtering [Sarawagi and Kirpal 2004; Chaudhuri et al. 2006], and positional filtering [Xiao et al. 2011]. Other approaches partition the input dataset based on the symmetric set difference such that two sets are similar only if they share a common partition [Arasu et al. 2006; Deng et al. 2015]. Interestingly, overly complex filters can actually increase execution runtime: PPJoin+ [Xiao et al. 2011] and AdaptJoin [Wang et al. 2012], which employ sophisticated filtering strategies, have been shown to be the slowest competitors in the study presented by Mann et al. [2016]. Our proposed solutions either use no filter (i.e., gssjoin) or only size filtering to reduce the comparison space (i.e., sf-gssjoin). Another popular type of string similarity join employs constraints based on the edit distance, which is defined by the minimum number of character-editing operations (i.e., insertion, deletion, and substitution) to make two strings equal. Because one can derive (q-gram) set overlap bound to the edit distance [Gravano et al. 2001], our algorithms can be used in a pre-processing step to reduce the number of distance computations edit distance also lends itself to efficient GPU implementation [Chacón et al. 2014]. Lieberman et al. [Lieberman et al. 2008] presented a parallel similarity join algorithm for distance functions of the Minkowski family (e.g., the Euclidean distance). The algorithm first maps the smaller input dataset to a set of space-filling curves and then performs interval searches for each point in the other dataset in parallel. The overall performance of the algorithm drastically decreases as the number of dimensions increases (see Figure 5b in [Lieberman et al. 2008]) because every additional dimension requires the construction of a new space-filling curve. Thus, this approach can be prohibitively expensive on text data, whose representation typically involves several thousands of dimensions. Besides stand-alone algorithms, set similarity join can also be realized using relational database technology. Previous work proposed expressing set similarity join declaratively in SQL [Ribeiro et al. 2016] or implementing it as a physical operator within the query engine [Chaudhuri et al. 2006]. Considering the increasing adoption of hardware accelerators by relational databases, including commercial products [Meraji et al. 2016], set similarity join can be a candidate operation for off-loading to the GPU in such systems.
4 258 Sidney Ribeiro-Junior et. al. Other works focused on performing set similarity join in a distributed platform such as MapReduce [Vernica et al. 2010; Deng et al. 2014; Rong et al. 2017]. In general, these proposals apply a partition scheme to send dissimilar strings to different reducers and, thus, avoid unnecessary similarity calculations. Existing partition schemes are based on prefix filtering [Vernica et al. 2010], symmetric set difference [Deng et al. 2014], and vertical partitioning [Rong et al. 2017]. Nevertheless, intensive computations still have to be performed in the reducers. We plan to investigate the integration of our solutions into a distributed setting to accelerate such local computations in future work. Our algorithms, as well as all other algorithms discussed so far, are exact, i.e., they always produce the correct answer. Approximate set similarity joins may miss valid output set pairs to trade accuracy for query time. Locality Sensitive Hashing (LSH) is the most popular technique for approximate set similarity joins [Indyk and Motwani 1998], which is based on hashing functions that are approximately similarity-preserving. Cruz et al. [2016] propose an approximate set similarity join algorithm designed for a many-core architecture (GPU). The authors estimate the Jaccard similarity between two sets using MinHash [Broder et al. 1998], an LSH scheme for Jaccard. MinHash can be orthogonally combined with our algorithm to reduce set sizes and, thus, obtain greater scalability. Similarity search queries retrieve strings similar to a given query string. They could be treated as a special case of join queries, where one of the two input datasets contains a single string. Techniques to increase filtering effectiveness in similarity search queries include exploiting multiple set orderings [Kim and Lee 2012], extracting different signature schemes from strings in the input data and the query string [Qin et al. ], and defining a global order to import the input sets into a tree index structure [Zhang et al. 2017]. The work in [Matsumoto and Yiu 2015] proposes a top-k similarity search algorithm for data represented as points in Euclidean space top-k similarity query returns the k most similar data objects. This algorithm leverages both the GPU and CPU: expensive distance calculations are executed on the GPU, whereas sorting is executed on the host CPU. To the best of our knowledge, our gssjoin algorithm was the first GPU-based algorithm for exact set similarity join [Ribeiro-Júnior et al. 2016]. In a subsequent work, we proposed fgssjoin [Quirino et al. 2017], which employs several filters to reduce the number of candidates. Specifically, fgssjoin applies prefix filtering, size filtering, and positional filtering to prune dissimilar pairs of sets in the filtering phase. These filters are very effective on high threshold values, however, its performance declines as lower threshold values are used. Size filtering is also exploited by fgssjoin in a block division scheme similar to the one adopted by sf-gssjoin. We experimentally compare the algorithms presented in this article with fgssjoin in Section BACKGROUND In this section, we first formally define the set similarity join problem. Then, we review state-of-theart optimization techniques and describe a high-level algorithmic framework for set similarity joins. Finally, we discuss the limitations of current solutions. 3.1 Basic Concepts and Problem Definition Strings can be mapped to sets of tokens in several ways. A well-known method is based on the concept of q-grams, i.e., sub-strings of length q obtained by sliding a window over the characters of the input string. For example, the string gssjoin can be mapped to the set of 3 -grams tokens { gss, SSJ, SJo, Joi, oin }. Given two sets r and s, a set similarity function sim (r, s) returns a value in [0, 1] to represent their similarity; a greater value indicates that r and s have higher similarity. We formally define the set similarity join operation as follows.
5 Algorithm 1: General set similarity join algorithm. Input: A sorted set collection C, a threshold τ Output: A set S containing all pairs (r, s) s.t. sim (r, s) τ 1 I 1, I 2,... I U, S 2 foreach r C do 3 foreach t pref β (r) do 4 foreach s I t do 5 if not filter (r, s) 6 S S verify (r, s) 7 I t I t {r} 8 return S Fast Parallel Set Similarity Joins on Many-core Architectures 259 Definition 3.1 Set Similarity Join. Given two set collections R and S, a set similarity function sim, and a threshold τ, the set similarity join between R and S returns all scored set pairs (r, s), τ s.t. (r, s) R S and sim (r, s) = τ τ. A popular set similarity function is the well-known Jaccard similarity. Given two sets r and s, the Jaccard similarity is defined as J (r, s) = r s r s. A predicate involving the Jaccard similarity and a threshold τ can be equivalently rewritten into a set overlap constraint: J (r, s) τ r s τ 1+τ ( r + s ). In this article, we focus on the Jaccard similarity, but all concepts and techniques presented henceforth holds for other set similarity functions as well such as Dice and Cosine [Ribeiro and Härder 2011]. Intuitively, the size difference of similar sets should not be large. We can derive bounds for immediate pruning of candidate pairs whose sizes differ enough. Formally, the size bounds of r, denoted by minsize (r, τ) and maxsize (r, τ), are functions that map τ and r to a real value s.t. s, if sim (r, s) τ, then minsize (r, τ) s maxsize (r, τ) [Sarawagi and Kirpal 2004]. Thus, given a set r, we can define a size-based filter to safely ignore all other sets whose sizes do not fall within the interval [minsize [ (r,] τ), maxsize (r, τ)]. For the Jaccard similarity, we have [minsize (r, τ), maxsize (r, τ)] = τ r, r τ. Further, we can use the prefix filtering principle [Sarawagi and Kirpal 2004; Chaudhuri et al. 2006] to prune dissimilar sets by examining only a subset of them. We first assume that the tokens of all sets are sorted according to a total order. A prefix r p r is the subset of r containing its first p tokens. Given two sets r and s, if r s α, then r ( r α+1) s ( s α+1). For a threshold τ, we can identify all candidate matches of a given set r using a prefix of length r r τ + 1; we denote such prefix by pref (r). The original overlap constraint only needs to be verified on set pairs sharing a prefix token. Finally, we pick the token frequency ordering as total order, thereby sorting the sets by increasing token frequencies in the set collection. Thus, we move lower frequency tokens to prefix positions to minimize the number of verifications. Example 3.1. Consider the sets r = {A, B, C, D, E} and s = {A, B, D, E, F }. Thus, we have r = s = 5 and r s = 4; thus sim (r, s) = For a threshold τ = 0.6, the set overlap constraint is r s (5 + 5) = 3.75, [minsize (r, τ), maxsize (r, τ)] = [3, 8.3] (values for s are the same), pref (r) = {A, B, C}, and pref (s)={a, B, D}. 3.2 General Algorithm Most current set similarity join algorithms for main memory employ an inverted index and follow a filtering-verification framework [Mann et al. 2016]. A high-level description of this framework is presented by Algorithm 1. An inverted list I t stores all sets containing a token t in their prefix. The input collection C is scanned and, for each set r, its prefix tokens are used to find candidate sets in the
6 260 Sidney Ribeiro-Junior et. al. (a) Prefix size varying threshold values. (b) Number of candidates generated with varying threshold values. (c) Inverted list length with different token frequency distributions. (d) Number of candidates generated with varying q values. Fig. 1. Limitations of algorithms based on prefix filtering. corresponding inverted lists (lines 2 4). Each candidate s is checked using filters, such as positional [Xiao et al. 2011] and length-based filtering [Sarawagi and Kirpal 2004] (line 5); if the candidate passes through, the actual similarity computation is performed in the verification step and r and s are added to the result if they are similar (line 6). Finally, r is appended to the inverted lists of its prefix tokens (line 7). An important observation is that Algorithm 1 is intrinsically sequential: sets, prefix tokens, and candidate sets are processed sequentially, while the inverted index is dynamically created. Algorithm 1 is actually a self-join. It can be extended to a binary join by first indexing the smaller collection and then going through the larger collection to identify matching pairs. 3.3 Prefix Filtering Limitations Currently, prefix filtering is the prevalent optimization technique for CPU-based, set similarity join algorithms. As such, all these algorithms benefit from its pruning power, but also suffer from its limitations. In particular, prefix filtering effectiveness is heavily dictated by two factors: threshold value and token frequency distribution [Sidney et al. 2015]. There is a clear correlation between threshold values and similarity join performance. Invariably, execution time increases with lower threshold values. The explanation is that lower threshold values imply larger prefixes as illustrated in Figure 1(a). As a result, more inverted lists have to be scanned (Algorithm 1, lines 3 4) and a larger number of candidate pairs have to be verified. Figure 1(b) shows the number of candidates (in log scale) for decreasing Jaccard thresholds on a 100K sample taken from the DBLP dataset (details about the datasets are given in Section 7). As we decrease the threshold from 0.9 to 0.3, there is an increase of three orders of magnitude in the number of candidates. Disregarding pruning due to other filters, the frequency of tokens in the prefix determines the number of candidates for a given set. The total token order based on frequency places rare tokens in the prefixes. As a result, inverted lists are shorter and there is much less prefix overlap between dissimilar sets. Of course, the effectiveness of this strategy depends on the underlying token frequency distribution. For a uniform distribution, it behaves not better than an ordinary lexicographical order. Figure 1(c) depicts the impact of the token frequency skew on size of the lists associated with prefix tokens and 1(d) shows the number of candidates (in log scale) for increasing values q-gram size (larger values of q results in more skewed token distribution). We can observe an increase of one order of magnitude in the number of candidates. The above observations indicate intrinsic limitations of prefix filtering, and, in turn, current set similarity join algorithms. Next, we provide a general overview of the underlying many-core architecture before presenting our massively parallel approach, which avoids such drawbacks.
7 Fast Parallel Set Similarity Joins on Many-core Architectures 261 Fig. 2. Overview of a GPU architecture 4. GPU ARCHITECTURE AND PROGRAMMING MODEL This section provides a brief description of both the GPU architecture and its programming model (we refer to [Kirk and Wen-mei 2012] for details). GPUs were originally designed as special-purpose coprocessors for dedicated graphics rendering. However, in the last years, it has become a powerful accelerator for general purpose computing. Nowadays GPUs are regarded as massively parallel processors with approximately ten times the computation power and memory bandwidth of CPUs. They can perform a large number of independent parallel computations and this performance is improving at a rate higher than that of CPUs and at an exceptionally high performance-to-cost ratio. Figure 2 shows a simplified diagram of a GPU that is common to most GPUs currently available. A GPU can be regarded as a Multiple SIMD (Single Instruction Multiple Data) processor with each SIMD unit corresponding to a streaming multiprocessor (SM). Inside the SM there are several streaming processor (SP) cores that operate on different data but in a synchronous way. SPs belonging to the same SM share, among others, instruction fetch and decoding, load and store units, the register file and local (shared) memory. A group of threads in an SM is divided into multiple schedule units, called warps, that are dynamically scheduled on the SM. The GPU supports a large number of light-weight threads, e.g. often tens or hundreds of thousands, and, unlike CPU threads, the overhead of creation and switching is negligible. Threads for which data have already been transferred from the global memory can perform calculations while threads waiting for the completion of data transfer have to wait. Thus, in order to hide the global memory s high latency, it is important to have more threads than the number of SPs and to have threads accessing consecutive memory addresses that can be coalesced. In most platforms, the communication between CPU and GPU takes place through a PCI Express connection. This channel permits the exchange of data between each one s address space but at a much slower speed. Thus, the GPU programming model requires that part of the program runs on the CPU while the computationally-intensive part is processed by the GPU. A GPU program exposes parallelism through data-parallel functions, called kernels, that are offloaded to the GPU. The programmer needs to configure the number of threads to be used. These threads are organized in groups (thread blocks) that are further assembled into a grid structure. When a kernel is launched, the blocks within a grid are distributed on idle SMs and the threads mapped to the SPs. 5. THE GSSJOIN ALGORITHM In this section, we present a detailed description of our GPU-based algorithm called gssjoin. We start by describing the inverted index creation, that is, the pre-processing step. Then, the actual set similarity join is detailed. Finally, we present the multi-gpu version of gssjoin.
8 262 Sidney Ribeiro-Junior et. al. 5.1 Pre-processing Before performing the set similarity join, we map input strings to sets of tokens using q-grams and create an inverted index in the GPU memory, assuming the input collection fits in memory and is static. Let S be the set collection ordered by set size and V be the vocabulary of the collection, that is, the set of distinct tokens of the input collection. The input collection is the set E of distinct token-sets (t, s) pairs occurring in the original collection, with t V and s S. An array of size E is used to store the inverted index. Once the set E has been moved to the GPU memory, each pair in it is examined in parallel, so that each time a token is visited the number of sets where it appears is incremented and stored in the array count of size V. A parallel prefix-sum is executed on the count array by mapping each element to the sum of all tokens before it and storing the results in the index array. Thus, each element of the index array points to the position of the corresponding first element in the invertedindex, where all (t, s) pairs will be stored sorted by the token. The cardinality of all sets is computed in parallel with each processor contributing partially (incrementing) in the respective position of the array cardinality of size S. Algorithm 2 depicts the data indexing process. 5.2 Set Similarity Search After the inverted index construction, gssjoin can be viewed as a batch of set similarity search operations. Given an input set, the set similarity search consists of two steps. First, the Jaccard similarity of the input set s to all sets have to be computed. Then, the sets similar to the input set are selected. The Jaccard similarity computation takes advantage of the inverted index model, because only the similarities between the input set s and those sets in S that have tokens in common with s have to be computed. These sets are the elements of the invertedindex pointed to by the entries of the index array associated with the tokens occurring in the input set s. A straightforward solution to compute the Jaccard similarity is to distribute the tokens of the input set s evenly among the processors and let each processor p access the inverted lists corresponding to tokens allocated to it. However, the distribution of tokens in sets of the collections follows approximately the Zipf s Law [Zipf and Wilson 1949]. This means that few tokens occur in a large number of sets and most tokens occur in only a few sets. Consequently, the size of the inverted lists also varies according to the Zipf s Law, thus distributing the workload according to the tokens of s would cause a great imbalance of the work among the processors. Thus, we propose a load balance method to distribute the sets evenly among the processors so that each processor computes approximately the same number of Jaccard similarities. In order to facilitate the explanation of this method, suppose that we concatenate all the inverted lists corresponding to tokens in s in a logical vector E s = [ 0.. E s 1 ], where E s is the sum of the sizes of all inverted lists Algorithm 2: Indexing input : The entries array E where each element is a pair (token, setid) output: The inverted index I 1 array of integers count[0... V 1] 2 array of integers index[0... V 1] 3 inverted index I[0... E 1] 4 for each entry e in E, in parallel do 5 atomicadd(count[e.token], 1) 6 end 7 Perform an exclusive sum on count in parallel and store the result in index 8 for each entry e in E, in parallel do 9 position atomicadd(index[e.token], 1) 10 I[position] e.setid 11 end 12 return I, count and index
9 Fast Parallel Set Similarity Joins on Many-core Architectures 263 of tokens in s. Given a set of processors P = {p 0, p P 1 }, the load balance method should allocate elements of E s in intervals of approximately the same size, that is, each processor p i P should process elements of E s in the interval [i Es Es P, min((i + 1) P 1, E s 1)]. Since each processor knows the interval of the indexes of the logical vector E s it has to process, all that is necessary to execute the load balancing is a mapping of the logical indexes of E s to the appropriate indexes in the inverted index (array invertedindex). Each processor executes the mapping for the indexes in the interval corresponding to it and finds the corresponding elements in the invertedindex array for which it has to compute the Jaccard similarity to the input set. Let V s V be the vocabulary of the input set s. The mapping uses three auxiliary arrays: count s [ 0.. V s 1 ], start s [ 0.. V s 1 ]] and index s [ 0.. V s 1 ]. The arrays count s and start s are obtained together by copying in parallel count[t i ] to count s [t i ] and index[t i ] to start s [t i ], respectively, for each token t i in the input set s. Once the count s is obtained, an inclusive parallel prefix sum on count s is performed and the results are stored in index s. Algorithm 3 shows the pseudo-code for the complete similarity search. In lines 3 6, the arrays count s and start s are obtained. In line 8, the array index s is obtained by applying a parallel prefix sum on array count s. Next, each processor executes a mapping of each position x in the interval of indexes of E s associated to it to the appropriate position of the invertedindex. This mapping is described in lines of the algorithm. Then, the mapped entries of the inverted index are used to compute the intersection between each set associated with these entries and the input set. The intersections are computed partially by each processor, but the complete intersections are available when all processors have finished this phase. Lines show how the Jaccard similarities are computed and compacted. Each processor is responsible for S P similarities. Each Jaccard similarity calculation uses the intersection (calculated in the previous phase) and the cardinality of the sets. Similarities above the given threshold are flagged and pruned by an exclusive parallel prefix sum. Thus only those similarities are returned to the CPU (line 22). Algorithm 3: SimilaritySearch(invertedIndex, s) input : invertedindex, count, index, cardinality, threshold, inputset s[ 0.. V s 1 ]. output: Jaccard similarity array jac_sim[ 0.. S 1 ] initialized with zeros. 1 array of integers count s[ 0.. V s 1 ] initialized with zeros 2 array of integers index s[ 0.. V s 1 ] 3 for each token t i s, in parallel do 4 count s[i] = count[t i]; 5 index s[i] = index[t i]; 6 end 7 Perform an inclusive parallel prefix sum on count s and stores the results in index s 8 foreach processor p i P do 9 for x [i Es P, min((i + 1) Es P 1, E S 1)] do // Map x to the correct position indinvp os of the invertedindex 10 pos = min(i : index s[i] > x); 11 if pos = 0 then 12 p = 0; offset = x; 13 else 14 p = index s[pos 1]; offset = x p; 15 end 16 indinvp os = start s[pos] + offset 17 uses s[pos] and invertedindex[indinvp os] in the partial computation of the intersection between s and the set associated to invertedindex[indinvp os] 18 end 19 for x [i S P S, min((i + 1) 1, S 1)] do P // Jaccard similarity calculation for S sets using P processors 20 uses intersection and union (through cardinality) to compute the Jaccard similarity 21 flag sets with Jaccard similarity above the threshold 22 end 23 end 24 Perform an exclusive parallel prefix sum on the flagged sets to compact the selected sets 25 Return the array: jac_sim with the selected jaccard similarities.
10 264 Sidney Ribeiro-Junior et. al. 5.3 Multi-GPU Similarity Join The gssjoin algorithm was designed having in mind a many-core architecture (accelerator) with (global) shared memory. It exploits data parallelism when processing a single input set and uses a single accelerator. It does that by making use of thousands of threads to index the dataset and to find the most similar sets to a given input set. However, when dealing with self-joins, the gssjoin search has to be invoked repeatedly, to deal with the processing of many input sets. The gssjoin algorithm can handle that by processing the queries one after another, once the input data has been indexed. This streaming operation requires that the most similar sets are returned before another input set can be processed. In addition, a set-specific memory allocation is needed for every set. Moreover, machines with more than one accelerator (GPU) can not take advantage of the extra computing power for the gssjoin search. These observations have motivated us to extend the gssjoin to deal with multiple sets in a multi-gpu platform. In the multi-gpu version, called mgssjoin, task parallelism is exploited in addition to data parallelism. The data indexing step is performed by replicating the input data in each of the g available GPUs and then, in parallel, creating g copies of the inverted index. Thus, each GPU receives the same task and they all produce the same inverted index in their memory. Next, the gssjoin search proceeds by partitioning the m sets (tasks) into the g GPUs. Each GPU receives m/g tasks. This is possible since the tasks (sets) are completely independent of each other. Since the sets are of different size, we preallocate memory based on the biggest set, i.e., the set with the largest number of tokens). This saves us a lot of time since GPU memory allocation can be very costly. Algorithm 4 shows the pseudo-code for the mgssjoin. Note that this algorithm, differently from the previous ones, exploits task parallelism and runs on the CPU. The GPU function (kernel) calls (invocations), for data indexing and gssjoin search, take place in lines 3 and 9 respectively. 6. THE SF-GSSJOIN ALGORITHM In this section, we present sf-gssjoin, an improved version of our gssjoin, which includes a block division scheme for dealing with larger datasets that do not fit in the GPU memory. This division strategy also allows us to prune entire blocks based on size filtering. As the previous algorithm, sf-gssjoin builds an inverted index to quickly identify sets containing tokens in common. The sf-gssjoin algorithm proceeds in two phases: indexing and probing. In the indexing phase, the sets are indexed in an inverted index. The similarity join happens in the probing phase when probe sets are compared with indexed sets and the similar pairs are returned. When dealing with datasets with results that do not fit in GPU s memory, sf-gssjoin divides the data into blocks of sets and works with partial inverted indexes. The blocks are indexed successively and probed with all previous blocks. The blocks that were not indexed yet are not probed because it would be redundant Algorithm 4: M ultigp U Search(E) input : token-set pairs in E[ 0.. E 1 ]. output: A list of the most similar, one for each set. 1 for each i g, in parallel do 2 set.gpu.device(i); 3 DataIndexing(E); 4 end 5 Allocate memory for the biggest set; 6 for each j g, in parallel do 7 set.gpu.device(j); 8 for each set s (m/g) in parallel do 9 SimilaritySearch(invertedIndex, s); 10 end 11 end 12 Return A list of the most similar sets, one for each set.
11 Fast Parallel Set Similarity Joins on Many-core Architectures 265 Fig. 3. The block division strategy Algorithm 5: sf-gssjoin input : A collection of sets C, the size of blocks n and a threshold γ output: A set S with all similar pairs (x, y) 1 Divide C in blocks of size n 2 for each set block i in C do 3 I indexing(c[i n...min(i n + n 1, C 1)]) 4 for each set block j in C do 5 if j i and C[min(j n + n 1, C 1)] minsize(c[i n], γ) then 6 S S probing(c[j n...min(j n + n 1, C 1)], I, γ) 7 end 8 end 9 end 10 return S work: due to the commutative property of Jaccard, if the pair (x, y) meets the threshold, so does (y, x). The block division strategy is shown in Figure 3 with a dataset divided into 4 blocks. The first block is probed only against itself and the last is probed against all blocks. The size filtering technique is applied between blocks before the probing phase as well as between sets on the probing phase. When applied between blocks, sf-gssjoin verifies if the last probe set is larger or equal to the minsize value of the first indexed set. If it is not, all the previous probe sets cannot be similar too because they are smaller than the last set and all the following indexed sets are also too large to be similar to any probing set. Thus, it is not necessary to probe the block. The sf-gssjoin algorithm is shown in Algorithm 5. The sets are divided into blocks of size n in line 1. Each block is indexed in line 3 and probed by the previous blocks and by itself in line 6. Then the partial result is included in the final result. The size pruning between blocks is done in line 4. The indexing phase is the same of gssjoin and is described by Algorithm Probing The probing phase achieves massive multi-level parallelism in the GPU by dividing the probe sets between the streaming multi-processors (SMs) and, inside each SM, dividing the inverted lists between the streaming processors (SPs). As can be seen in Algorithm 6, the sets are examined in parallel and the inverted lists related to their tokens are also examined in parallel (lines 2 and 4). Before incrementing the intersection value between the probe set and the sets in the inverted lists, the set id is verified to avoid unnecessary computations and the indexed set size is verified with the probe set maxsize (line 5). If the indexed set is smaller than the probe maximum size, they cannot be similar and the intersection increment is not done. This is the size filtering technique applied between sets. It is important to notice that sets that cannot be similar because of their sizes will keep the intersection value equal to 0. The intersection increment is done with an atomic function in line 6 because many threads can be accessing the same sets at the same time.
12 266 Sidney Ribeiro-Junior et. al. Algorithm 6: Probing input : The block of probe sets P, the partial inverted index I and a threshold τ output: The set of similar pairs S 1 matrix of integers M P P 2 for each set x in P, in parallel do 3 for each feature f in x do 4 for each set y in f s inverted list I[f], in parallel do 5 if x.id < y.id and maxsize(x) > y then 6 atomicadd(m[x.id][y.id], 1) 7 end 8 end 9 end 10 end 11 totalsimilars 0 12 for each pair (x, y) in M, in parallel do 13 if M[x.id][y.id] > 0 then 14 similarity calculatesimilarity( x, y, M[x.id][y.id]) 15 if similarity τ then 16 position atomicadd(totalsimilars, 1) 17 S[position] (x.id, y.id) 18 end 19 end 20 end 21 return S The second part of the probing phase verifies in parallel the pairs with non zero intersection, in lines 12 and 13, calculates the similarity between them in line 14 and stores the pairs with similarity equal or greater than the chosen threshold in the result array in line 17. An atomic add function is used in line 16 to determine the position where the pair should be stored. Thus only similar pairs will be returned, reducing the amount of data to be copied back to the CPU memory. This data compaction, consequently, reduces the cost of memory copy. 6.2 Multi-GPU Version The multi-gpu version of sf-gssjoin divides the indexed blocks among the GPUs and sends the probing sets to the respective GPUs. The algorithm is the same as described in Section 5, except that indexed blocks and their probing blocks are divided among GPUs in line 2. This may cause a load imbalance because each indexed block has a different number of probing blocks. For example, if we have two GPUs and two blocks, one GPU will index the first block and probe the same block. The other GPU will index the second block and probe the first and second blocks. But if the number of blocks is big and the set sizes variable, this imbalance is generally negligible since some blocks will not be probed as a result of size pruning. 7. EXPERIMENTS We now experimentally evaluate our algorithms. The goals of our study are: 1) to evaluate the performance of gssjoin and sf-gssjoin with varying thresholds on different datasets; 2) to compare our algorithms against state-of-the-art CPU-based and GPU-based similarity join algorithms; 3) to compare the performance gain of sf-gssjoin against gssjoin; 4) to measure the speedup gains of our multi-gpu implementation.
13 Fast Parallel Set Similarity Joins on Many-core Architectures 267 Table I. Dataset statistics. Database Number of sets Max length Mean length Standard deviation DBLP 100k Spotify 500k IMDb 300k Experimental Setup We used three publicly available, real-world datasets: DBLP 1, a database of Computer Science publications, Spotify 2, a database of songs names and artists, and IMDb 3, a database of movies, series and documentaries titles. For DBLP, we randomly selected 20k publications and extracted their title and authors; then, we generated 4 additional dirty copies of each string, i.e., duplicates to which we injected textual modifications consisting of 1 5 character-level modifications (insertions, deletions, and substitutions). For all datasets, we converted all strings to lower-case and eliminated repeated white spaces. Statistics about the datasets are shown in Table I. Finally, we tokenized all strings into sets of 2-grams and 3-grams and sorted the sets as described in Section 3. We compared sf-gssjoin against ppjoin, one of the best-performing CPU-based algorithms according to the evaluation in [Mann et al. 2016], fgssjoin, a GPU-based algorithm which showed significant speedups against ppjoin [Quirino et al. 2017],and gssjoin, our earlier algorithm [Ribeiro- Júnior et al. 2016]. We used the binary provided by the authors of ppjoin 4. All three parallel algorithms were implemented using the CUDA Toolkit version 7.5. The experiments were conducted on a machine running CentOs bits, with 2 Intel Xeon E5-2620, 16GB of ECC RAM, and four GeForce Zotac Nvidia GTX Titan Black, with 6GB of RAM and 2,880 CUDA cores each. 7.2 Performance Results We now analyze and compare the efficiency of sf-gssjoin against ppjoin, fgssjoin, and gssjoin. To this end, we measured the runtime performance with varying threshold parameter and for q-grams of size 2 and 3. Figures 4 and 5 show the results. As a general trend, sf-gssjoin outperforms ppjoin in almost all scenarios. They exhibit similar performance at high threshold values ( ). However, as the threshold decreases, ppjoin performance drops dramatically. As discussed in Section 3.3, ppjoin, as most set similarity join algorithms, heavily depends on prefix filtering effectiveness to obtain performance. At lower thresholds, prefix filtering becomes ineffective and the verification workload skyrockets. In contrast, sf-gssjoin runtimes increase smoothly, thereby achieving increasing speedups over ppjoin. Furthermore, the performance advantage of sf-gssjoin over ppjoin increases with the dataset size. The peak speedup gain for sf-gssjoin is obtained on the DBLP dataset with 2-grams and threshold value of 0.3: in this setting, sf-gssjoin is 109x faster than ppjoin. Unfortunately, we could not obtain the ppjoin runtimes for all databases at low thresholds due to memory limitations. However, the performance gains clearly increase when threshold values decrease. As for ppjoin, the fgssjoin algorithm exhibits increasing runtimes on lower thresholds similar. It outperforms sf-gssjoin at high thresholds (usually above 0,6), but its performance deteriorates as threshold values decrease. This behavior is due to the use of the same filtering techniques of ppjoin. As in ppjoin, the filtering phase cost increases and more candidates need to be evaluated in the verification phase. The highest speedup gain of sf-gssjoin against fgssjoin was 19,5x on the Spotify database using 2-grams with threshold value equal to Information courtesy of IMDb ( Used with permission. 4 weiw/project/simjoin.html
14 268 Sidney Ribeiro-Junior et. al. (a) DBLP dataset, 3-grams. (b) IMDB dataset, 3-grams. (c) Spotify dataset, 3-grams. (d) DBLP dataset, 2-grams. (e) IMDB dataset, 2-grams. (f) Spotify dataset, 2-grams. Fig. 4. ppjoin, fgssjoin and sf-gssjoin runtimes. When compared with gssjoin, sf-gssjoin shows a better performance at higher threshold values, when size pruning can discard more blocks of sets. At low thresholds ( ) minimum size limits are too low to be effective and they exhibit similar runtimes, but generally sf-gssjoin outperforms our former algorithm. As Figure 5(b) shows, the highest speedup obtained is on the IMDb database with 2-grams and 0.9 as the threshold value. In this scenario sf-gssjoin is 18x faster than gssjoin. The blocking division scheme of sf-gssjoin plays an important role in effectively pruning the comparison space based on size filtering. Moreover, gssjoin indexes all the sets and reads all the sets on the inverted lists when probing a set and verifies if the sets were compared earlier to avoid returning the same pair twice. This verification is needed only when the indexed block and the probe block are the same. Therefore, a large number of dissimilar set pairs are never compared. We observe that sf-gssjoin and gssjoin suffer a drop in performance for threshold value at 0.2 because of the huge number of set pairs in the result that have to be copied to the CPU memory. On the DBLP database with 2-grams and 0.2 threshold value, sf-gssjoin shows an inferior performance than gssjoin as result of its compaction technique. Using a simple atomic operation to compact the result can be inefficient when a large number of sets are similar. When there is a large number of similar sets, many atomic sums are performed by sf-gssjoin to determine the positions of set pairs on the result array. These atomic operations have a high cost and may lead to an almost sequential execution. The gssjoin compaction technique divides the result array into blocks to avoid the atomic sums overhead. Table II shows the execution time using multiple GPUs on all databases with 0.5 as the threshold value. The data is divided between the GPUs, making our solution scalable. The use of multiple GPUs showed an almost linear speedup increase for larger databases, as can be seen in Figure 5(c). But for small databases or at high threshold values ( ) the cost of memory copy between the CPU memory and the GPUs memories is high. And, as mentioned before, if the set sizes distribution is more homogeneous, the irregular number of blocks to be compared can cause a workload imbalance between GPUs and speedups will not be linear.
15 Fast Parallel Set Similarity Joins on Many-core Architectures 269 (a) sf-gssjoin speedups when compared to gssjoin using 3-grams. (b) sf-gssjoin speedups when compared to gssjoin using 2-grams. (c) Multi-GPU Speedup. Fig. 5. Speedups. Table II. Execution Time (secs) Using Multiple GPUs, Threshold = 0.5 Number of GPUs: DBLP 2gram DBLP 3gram IMDb 2gram IMDb 3gram Spotify 2gram Spotify 3gram CONCLUSIONS In this article, we proposed GPU-based solutions to the set similarity join problem. We present two new parallel algorithms based on an inverted index implemented on the GPU. We exploit a number of strategies to optimize our algorithms including intense GPU occupancy, hierarchical memory, coalesced memory access, load balancing, and a block division scheme. As a result, we achieve significant speedups over state-of-the-art CPU-based and GPU-based algorithms in most settings. We also propose multi-gpu variants to further exploit task parallelism in addition to data parallelism, which exhibit almost linear speedup with the number of GPUs. In future work, we plan to investigate the integration of our GPU-based algorithms into a distributed framework. REFERENCES Agrawal, S., Chakrabarti, K., Chaudhuri, S., and Ganti, V. Scalable ad-hoc entity extraction from text collections. Proceedings of the VLDB Endowment 1 (1): , Arasu, A., Ganti, V., and Kaushik, R. Efficient Exact Set-Similarity Joins. In Proceedings of the International Conference on Very Large Data Bases. Seoul, Korea, pp , Bayardo, R. J., Ma, Y., and Srikant, R. Scaling up All Pairs Similarity Search. In Proceedings of the International World Wide Web Conferences. Banff, Canada, pp , Broder, A. Z., Charikar, M., Frieze, A. M., and Mitzenmacher, M. (Extended Abstract). In STOC. pp , Chacón, A., Marco-Sola, S., Espinosa, A., Ribeca, P., and Moure, J. C. Computation of Levenshtein Distance on GPU. In ICS. pp , Min-Wise Independent Permutations Thread-cooperative, Bit-parallel Chaudhuri, S., Ganti, V., and Kaushik, R. A Primitive Operator for Similarity Joins in Data Cleaning. In Proceedings of the IEEE International Conference on Data Engineering. Atlanta, USA, pp. 5, Cruz, M. S. H., Kozawa, Y., Amagasa, T., and Kitagawa, H. Accelerating set similarity joins using gpus. LNCS Transactions on Large-Scale Data- and Knowledge-Centered Systems vol. 28, pp. 1 22, Deng, D., Li, G., Hao, S., Wang, J., and Feng, J. MassJoin: A Mapreduce-based Method for Scalable String Similarity Joins. In Proceedings of the IEEE International Conference on Data Engineering. Chigago, USA, pp , 2014.
Parallel Similarity Join with Data Partitioning for Prefix Filtering
22 ECTI TRANSACTIONS ON COMPUTER AND INFORMATION TECHNOLOGY VOL.9, NO.1 May 2015 Parallel Similarity Join with Data Partitioning for Prefix Filtering Jaruloj Chongstitvatana 1 and Methus Bhirakit 2, Non-members
More informationLeveraging Set Relations in Exact Set Similarity Join
Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,
More informationParallelizing String Similarity Join Algorithms
Parallelizing String Similarity Join Algorithms Ling-Chih Yao and Lipyeow Lim University of Hawai i at Mānoa, Honolulu, HI 96822, USA {lingchih,lipyeow}@hawaii.edu Abstract. A key operation in data cleaning
More informationOptimization solutions for the segmented sum algorithmic function
Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code
More informationSimilarity Joins in MapReduce
Similarity Joins in MapReduce Benjamin Coors, Kristian Hunt, and Alain Kaeslin KTH Royal Institute of Technology {coors,khunt,kaeslin}@kth.se Abstract. This paper studies how similarity joins can be implemented
More informationPerformance impact of dynamic parallelism on different clustering algorithms
Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware E-mail: jdimarco@udel.edu, taufer@udel.edu
More informationGPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC
GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC MIKE GOWANLOCK NORTHERN ARIZONA UNIVERSITY SCHOOL OF INFORMATICS, COMPUTING & CYBER SYSTEMS BEN KARSIN UNIVERSITY OF HAWAII AT MANOA DEPARTMENT
More informationSupporting Fuzzy Keyword Search in Databases
I J C T A, 9(24), 2016, pp. 385-391 International Science Press Supporting Fuzzy Keyword Search in Databases Jayavarthini C.* and Priya S. ABSTRACT An efficient keyword search system computes answers as
More informationProfiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency
Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu
More informationarxiv: v1 [cs.db] 20 Nov 2017
Bitmap Filter: Speeding up Exact Set Similarity Joins with Bitwise Operations Edans F. O. Sandes University of Brasilia edans@unb.br George L. M. Teodoro University of Brasilia teodoro@unb.br Alba C. M.
More informationSurvey of String Similarity Join Algorithms on Large Scale Data
Survey of String Similarity Join Algorithms on Large Scale Data P.Selvaramalakshmi Research Scholar Dept. of Computer Science Bishop Heber College (Autonomous) Tiruchirappalli, Tamilnadu, India. Dr. S.
More informationACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS
ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS Ferdinando Alessi Annalisa Massini Roberto Basili INGV Introduction The simulation of wave propagation
More informationEfficient Lists Intersection by CPU- GPU Cooperative Computing
Efficient Lists Intersection by CPU- GPU Cooperative Computing Di Wu, Fan Zhang, Naiyong Ao, Gang Wang, Xiaoguang Liu, Jing Liu Nankai-Baidu Joint Lab, Nankai University Outline Introduction Cooperative
More informationEfficient Entity Matching over Multiple Data Sources with MapReduce
Efficient Entity Matching over Multiple Data Sources with MapReduce Demetrio Gomes Mestre, Carlos Eduardo Pires Universidade Federal de Campina Grande, Brazil demetriogm@gmail.com, cesp@dsc.ufcg.edu.br
More informationMitigating Data Skew Using Map Reduce Application
Ms. Archana P.M Mitigating Data Skew Using Map Reduce Application Mr. Malathesh S.H 4 th sem, M.Tech (C.S.E) Associate Professor C.S.E Dept. M.S.E.C, V.T.U Bangalore, India archanaanil062@gmail.com M.S.E.C,
More informationOptimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink
Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink Rajesh Bordawekar IBM T. J. Watson Research Center bordaw@us.ibm.com Pidad D Souza IBM Systems pidsouza@in.ibm.com 1 Outline
More informationScalable GPU Graph Traversal!
Scalable GPU Graph Traversal Duane Merrill, Michael Garland, and Andrew Grimshaw PPoPP '12 Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming Benwen Zhang
More informationTRIE BASED METHODS FOR STRING SIMILARTIY JOINS
TRIE BASED METHODS FOR STRING SIMILARTIY JOINS Venkat Charan Varma Buddharaju #10498995 Department of Computer and Information Science University of MIssissippi ENGR-654 INFORMATION SYSTEM PRINCIPLES RESEARCH
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied
More informationMultiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering
Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to
More informationTHE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS
Computer Science 14 (4) 2013 http://dx.doi.org/10.7494/csci.2013.14.4.679 Dominik Żurek Marcin Pietroń Maciej Wielgosz Kazimierz Wiatr THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT
More informationSpeeding-up the Verification Phase of Set Similarity Joins in the GPGPU paradigm
1 Speeding-up the Verification Phase of Set Similarity Joins in the GPGPU paradigm Christos Bellas, Anastasios Gounaris Senior Member, IEEE arxiv:1812.09141v1 [cs.db] 21 Dec 2018 Abstract We investigate
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied
More information6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS
Chapter 6 Indexing Results 6. INTRODUCTION The generation of inverted indexes for text databases is a computationally intensive process that requires the exclusive use of processing resources for long
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS46: Mining Massive Datasets Jure Leskovec, Stanford University http://cs46.stanford.edu /7/ Jure Leskovec, Stanford C46: Mining Massive Datasets Many real-world problems Web Search and Text Mining Billions
More informationA Real Time GIS Approximation Approach for Multiphase Spatial Query Processing Using Hierarchical-Partitioned-Indexing Technique
International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2017 IJSRCSEIT Volume 2 Issue 6 ISSN : 2456-3307 A Real Time GIS Approximation Approach for Multiphase
More informationNowadays data-intensive applications play a
Journal of Advances in Computer Engineering and Technology, 3(2) 2017 Data Replication-Based Scheduling in Cloud Computing Environment Bahareh Rahmati 1, Amir Masoud Rahmani 2 Received (2016-02-02) Accepted
More informationUsing GPUs to compute the multilevel summation of electrostatic forces
Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationA Batched GPU Algorithm for Set Intersection
A Batched GPU Algorithm for Set Intersection Di Wu, Fan Zhang, Naiyong Ao, Fang Wang, Xiaoguang Liu, Gang Wang Nankai-Baidu Joint Lab, College of Information Technical Science, Nankai University Weijin
More informationPart 1: Indexes for Big Data
JethroData Making Interactive BI for Big Data a Reality Technical White Paper This white paper explains how JethroData can help you achieve a truly interactive interactive response time for BI on big data,
More informationA Fast and High Throughput SQL Query System for Big Data
A Fast and High Throughput SQL Query System for Big Data Feng Zhu, Jie Liu, and Lijie Xu Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing, China 100190
More informationChapter 12: Indexing and Hashing. Basic Concepts
Chapter 12: Indexing and Hashing! Basic Concepts! Ordered Indices! B+-Tree Index Files! B-Tree Index Files! Static Hashing! Dynamic Hashing! Comparison of Ordered Indexing and Hashing! Index Definition
More informationChapter 12: Indexing and Hashing
Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing and Hashing Index Definition in SQL
More informationMaximizing Face Detection Performance
Maximizing Face Detection Performance Paulius Micikevicius Developer Technology Engineer, NVIDIA GTC 2015 1 Outline Very brief review of cascaded-classifiers Parallelization choices Reducing the amount
More informationA Comparative Study on Exact Triangle Counting Algorithms on the GPU
A Comparative Study on Exact Triangle Counting Algorithms on the GPU Leyuan Wang, Yangzihao Wang, Carl Yang, John D. Owens University of California, Davis, CA, USA 31 st May 2016 L. Wang, Y. Wang, C. Yang,
More informationB. Tech. Project Second Stage Report on
B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic
More informationGeneralizing Prefix Filtering to Improve Set Similarity Joins
Generalizing Prefix Filtering to Improve Set Similarity Joins Leonardo Andrade Ribeiro a,1,, Theo Härder a a AG DBIS, Department of Computer Science, University of Kaiserslautern, Germany Abstract Identification
More informationA Parallel Access Method for Spatial Data Using GPU
A Parallel Access Method for Spatial Data Using GPU Byoung-Woo Oh Department of Computer Engineering Kumoh National Institute of Technology Gumi, Korea bwoh@kumoh.ac.kr Abstract Spatial access methods
More informationarxiv: v1 [physics.comp-ph] 4 Nov 2013
arxiv:1311.0590v1 [physics.comp-ph] 4 Nov 2013 Performance of Kepler GTX Titan GPUs and Xeon Phi System, Weonjong Lee, and Jeonghwan Pak Lattice Gauge Theory Research Center, CTP, and FPRD, Department
More informationMultipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs
Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs Haicheng Wu 1, Daniel Zinn 2, Molham Aref 2, Sudhakar Yalamanchili 1 1. Georgia Institute of Technology 2. LogicBlox
More informationAccelerating MapReduce on a Coupled CPU-GPU Architecture
Accelerating MapReduce on a Coupled CPU-GPU Architecture Linchuan Chen Xin Huo Gagan Agrawal Department of Computer Science and Engineering The Ohio State University Columbus, OH 43210 {chenlinc,huox,agrawal}@cse.ohio-state.edu
More informationQuadrant-Based MBR-Tree Indexing Technique for Range Query Over HBase
Quadrant-Based MBR-Tree Indexing Technique for Range Query Over HBase Bumjoon Jo and Sungwon Jung (&) Department of Computer Science and Engineering, Sogang University, 35 Baekbeom-ro, Mapo-gu, Seoul 04107,
More informationCUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav
CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture
More informationParallel Approach for Implementing Data Mining Algorithms
TITLE OF THE THESIS Parallel Approach for Implementing Data Mining Algorithms A RESEARCH PROPOSAL SUBMITTED TO THE SHRI RAMDEOBABA COLLEGE OF ENGINEERING AND MANAGEMENT, FOR THE DEGREE OF DOCTOR OF PHILOSOPHY
More informationACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU
Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents
More informationOVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI
CMPE 655- MULTIPLE PROCESSOR SYSTEMS OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI What is MULTI PROCESSING?? Multiprocessing is the coordinated processing
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationAn Introduction to Parallel Programming
An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe
More informationPSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets
2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department
More informationGPU Implementation of a Multiobjective Search Algorithm
Department Informatik Technical Reports / ISSN 29-58 Steffen Limmer, Dietmar Fey, Johannes Jahn GPU Implementation of a Multiobjective Search Algorithm Technical Report CS-2-3 April 2 Please cite as: Steffen
More informationGPU-accelerated Verification of the Collatz Conjecture
GPU-accelerated Verification of the Collatz Conjecture Takumi Honda, Yasuaki Ito, and Koji Nakano Department of Information Engineering, Hiroshima University, Kagamiyama 1-4-1, Higashi Hiroshima 739-8527,
More informationCHAPTER 5 WEIGHTED SUPPORT ASSOCIATION RULE MINING USING CLOSED ITEMSET LATTICES IN PARALLEL
68 CHAPTER 5 WEIGHTED SUPPORT ASSOCIATION RULE MINING USING CLOSED ITEMSET LATTICES IN PARALLEL 5.1 INTRODUCTION During recent years, one of the vibrant research topics is Association rule discovery. This
More informationPredictive Indexing for Fast Search
Predictive Indexing for Fast Search Sharad Goel, John Langford and Alex Strehl Yahoo! Research, New York Modern Massive Data Sets (MMDS) June 25, 2008 Goel, Langford & Strehl (Yahoo! Research) Predictive
More informationCHAPTER 1 INTRODUCTION
1 CHAPTER 1 INTRODUCTION 1.1 Advance Encryption Standard (AES) Rijndael algorithm is symmetric block cipher that can process data blocks of 128 bits, using cipher keys with lengths of 128, 192, and 256
More informationHarp-DAAL for High Performance Big Data Computing
Harp-DAAL for High Performance Big Data Computing Large-scale data analytics is revolutionizing many business and scientific domains. Easy-touse scalable parallel techniques are necessary to process big
More informationhigh performance medical reconstruction using stream programming paradigms
high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming
More informationCSE 5243 INTRO. TO DATA MINING
CSE 53 INTRO. TO DATA MINING Locality Sensitive Hashing (LSH) Huan Sun, CSE@The Ohio State University Slides adapted from Prof. Jiawei Han @UIUC, Prof. Srinivasan Parthasarathy @OSU MMDS Secs. 3.-3.. Slides
More informationOptimizing Data Locality for Iterative Matrix Solvers on CUDA
Optimizing Data Locality for Iterative Matrix Solvers on CUDA Raymond Flagg, Jason Monk, Yifeng Zhu PhD., Bruce Segee PhD. Department of Electrical and Computer Engineering, University of Maine, Orono,
More informationAdvanced Databases: Parallel Databases A.Poulovassilis
1 Advanced Databases: Parallel Databases A.Poulovassilis 1 Parallel Database Architectures Parallel database systems use parallel processing techniques to achieve faster DBMS performance and handle larger
More informationAssignment 5. Georgia Koloniari
Assignment 5 Georgia Koloniari 2. "Peer-to-Peer Computing" 1. What is the definition of a p2p system given by the authors in sec 1? Compare it with at least one of the definitions surveyed in the last
More informationCompressing and Decoding Term Statistics Time Series
Compressing and Decoding Term Statistics Time Series Jinfeng Rao 1,XingNiu 1,andJimmyLin 2(B) 1 University of Maryland, College Park, USA {jinfeng,xingniu}@cs.umd.edu 2 University of Waterloo, Waterloo,
More informationCorrelation based File Prefetching Approach for Hadoop
IEEE 2nd International Conference on Cloud Computing Technology and Science Correlation based File Prefetching Approach for Hadoop Bo Dong 1, Xiao Zhong 2, Qinghua Zheng 1, Lirong Jian 2, Jian Liu 1, Jie
More informationN-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo
N-Body Simulation using CUDA CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo Project plan Develop a program to simulate gravitational
More informationChapter 12: Query Processing. Chapter 12: Query Processing
Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 12: Query Processing Overview Measures of Query Cost Selection Operation Sorting Join
More informationMining Frequent Itemsets for data streams over Weighted Sliding Windows
Mining Frequent Itemsets for data streams over Weighted Sliding Windows Pauray S.M. Tsai Yao-Ming Chen Department of Computer Science and Information Engineering Minghsin University of Science and Technology
More informationA Many-Core Machine Model for Designing Algorithms with Minimum Parallelism Overheads
A Many-Core Machine Model for Designing Algorithms with Minimum Parallelism Overheads Sardar Anisul Haque Marc Moreno Maza Ning Xie University of Western Ontario, Canada IBM CASCON, November 4, 2014 ardar
More informationParallelizing Inline Data Reduction Operations for Primary Storage Systems
Parallelizing Inline Data Reduction Operations for Primary Storage Systems Jeonghyeon Ma ( ) and Chanik Park Department of Computer Science and Engineering, POSTECH, Pohang, South Korea {doitnow0415,cipark}@postech.ac.kr
More informationChapter 18: Parallel Databases
Chapter 18: Parallel Databases Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 18: Parallel Databases Introduction I/O Parallelism Interquery Parallelism Intraquery
More informationChapter 18: Parallel Databases. Chapter 18: Parallel Databases. Parallelism in Databases. Introduction
Chapter 18: Parallel Databases Chapter 18: Parallel Databases Introduction I/O Parallelism Interquery Parallelism Intraquery Parallelism Intraoperation Parallelism Interoperation Parallelism Design of
More informationSemantic text features from small world graphs
Semantic text features from small world graphs Jurij Leskovec 1 and John Shawe-Taylor 2 1 Carnegie Mellon University, USA. Jozef Stefan Institute, Slovenia. jure@cs.cmu.edu 2 University of Southampton,UK
More informationAC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery
: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Hong Cheng Philip S. Yu Jiawei Han University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center {hcheng3, hanj}@cs.uiuc.edu,
More informationGPU ACCELERATED DATABASE MANAGEMENT SYSTEMS
CIS 601 - Graduate Seminar Presentation 1 GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS PRESENTED BY HARINATH AMASA CSU ID: 2697292 What we will talk about.. Current problems GPU What are GPU Databases GPU
More informationPortland State University ECE 588/688. Graphics Processors
Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly
More informationEgemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for
Comparison of Two Image-Space Subdivision Algorithms for Direct Volume Rendering on Distributed-Memory Multicomputers Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc Dept. of Computer Eng. and
More informationCommunication and Memory Efficient Parallel Decision Tree Construction
Communication and Memory Efficient Parallel Decision Tree Construction Ruoming Jin Department of Computer and Information Sciences Ohio State University, Columbus OH 4321 jinr@cis.ohiostate.edu Gagan Agrawal
More informationWhen MPPDB Meets GPU:
When MPPDB Meets GPU: An Extendible Framework for Acceleration Laura Chen, Le Cai, Yongyan Wang Background: Heterogeneous Computing Hardware Trend stops growing with Moore s Law Fast development of GPU
More informationChapter 11: Indexing and Hashing
Chapter 11: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing and Hashing Index Definition in SQL
More informationSimDataMapper: An Architectural Pattern to Integrate Declarative Similarity Matching into Database Applications
SimDataMapper: An Architectural Pattern to Integrate Declarative Similarity Matching into Database Applications Natália Cristina Schneider 1, Leonardo Andrade Ribeiro 2, Andrei de Souza Inácio 3, Harley
More informationMobile Cloud Multimedia Services Using Enhance Blind Online Scheduling Algorithm
Mobile Cloud Multimedia Services Using Enhance Blind Online Scheduling Algorithm Saiyad Sharik Kaji Prof.M.B.Chandak WCOEM, Nagpur RBCOE. Nagpur Department of Computer Science, Nagpur University, Nagpur-441111
More informationCost Models for Query Processing Strategies in the Active Data Repository
Cost Models for Query rocessing Strategies in the Active Data Repository Chialin Chang Institute for Advanced Computer Studies and Department of Computer Science University of Maryland, College ark 272
More informationFast Fuzzy Clustering of Infrared Images. 2. brfcm
Fast Fuzzy Clustering of Infrared Images Steven Eschrich, Jingwei Ke, Lawrence O. Hall and Dmitry B. Goldgof Department of Computer Science and Engineering, ENB 118 University of South Florida 4202 E.
More informationMulti-Stage Rocchio Classification for Large-scale Multilabeled
Multi-Stage Rocchio Classification for Large-scale Multilabeled Text data Dong-Hyun Lee Nangman Computing, 117D Garden five Tools, Munjeong-dong Songpa-gu, Seoul, Korea dhlee347@gmail.com Abstract. Large-scale
More informationGPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten
GPU Computing: Development and Analysis Part 1 Anton Wijs Muhammad Osama Marieke Huisman Sebastiaan Joosten NLeSC GPU Course Rob van Nieuwpoort & Ben van Werkhoven Who are we? Anton Wijs Assistant professor,
More informationParallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)
Lecture 2: Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload) Visual Computing Systems Analyzing a 3D Graphics Workload Where is most of the work done? Memory Vertex
More informationParallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)
Lecture 2: Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload) Visual Computing Systems Today Finishing up from last time Brief discussion of graphics workload metrics
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationOn the Max Coloring Problem
On the Max Coloring Problem Leah Epstein Asaf Levin May 22, 2010 Abstract We consider max coloring on hereditary graph classes. The problem is defined as follows. Given a graph G = (V, E) and positive
More informationSchool of Computer and Information Science
School of Computer and Information Science CIS Research Placement Report Multiple threads in floating-point sort operations Name: Quang Do Date: 8/6/2012 Supervisor: Grant Wigley Abstract Despite the vast
More informationUsing Statistics for Computing Joins with MapReduce
Using Statistics for Computing Joins with MapReduce Theresa Csar 1, Reinhard Pichler 1, Emanuel Sallinger 1, and Vadim Savenkov 2 1 Vienna University of Technology {csar, pichler, sallinger}@dbaituwienacat
More informationWarps and Reduction Algorithms
Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum
More informationHybrid Implementation of 3D Kirchhoff Migration
Hybrid Implementation of 3D Kirchhoff Migration Max Grossman, Mauricio Araya-Polo, Gladys Gonzalez GTC, San Jose March 19, 2013 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation
More informationEfficient Range Query Processing on Uncertain Data
Efficient Range Query Processing on Uncertain Data Andrew Knight Rochester Institute of Technology Department of Computer Science Rochester, New York, USA andyknig@gmail.com Manjeet Rege Rochester Institute
More informationUniversity of Cambridge Engineering Part IIB Module 4F12 - Computer Vision and Robotics Mobile Computer Vision
report University of Cambridge Engineering Part IIB Module 4F12 - Computer Vision and Robotics Mobile Computer Vision Web Server master database User Interface Images + labels image feature algorithm Extract
More informationImproving Performance of Machine Learning Workloads
Improving Performance of Machine Learning Workloads Dong Li Parallel Architecture, System, and Algorithm Lab Electrical Engineering and Computer Science School of Engineering University of California,
More informationFast Efficient Clustering Algorithm for Balanced Data
Vol. 5, No. 6, 214 Fast Efficient Clustering Algorithm for Balanced Data Adel A. Sewisy Faculty of Computer and Information, Assiut University M. H. Marghny Faculty of Computer and Information, Assiut
More informationUsing Graphics Processors for High Performance IR Query Processing
Using Graphics Processors for High Performance IR Query Processing Shuai Ding Jinru He Hao Yan Torsten Suel Polytechnic Inst. of NYU Polytechnic Inst. of NYU Polytechnic Inst. of NYU Yahoo! Research Brooklyn,
More informationFinding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing
Finding Similar Sets Applications Shingling Minhashing Locality-Sensitive Hashing Goals Many Web-mining problems can be expressed as finding similar sets:. Pages with similar words, e.g., for classification
More informationSmurf: Self-Service String Matching Using Random Forests
Smurf: Self-Service String Matching Using Random Forests Paul Suganthan G. C., Adel Ardalan, AnHai Doan, Aditya Akella University of Wisconsin-Madison {paulgc, adel, anhai, akella}@cs.wisc.edu ABSTRACT
More information