Fast Parallel Set Similarity Joins on Many-core Architectures

Size: px
Start display at page:

Download "Fast Parallel Set Similarity Joins on Many-core Architectures"

Transcription

1 Fast Parallel Set Similarity Joins on Many-core Architectures Sidney Ribeiro-Junior, Rafael David Quirino, Leonardo Andrade Ribeiro, Wellington Santos Martins Universidade Federal de Goiás, Brazil {sydneyribeirojunior, rafaelquirino, laribeiro, Abstract. Set similarity join is a core operation for text data integration, cleaning, and mining. Previous research work on improving the performance of set similarity joins mostly focused on sequential, CPU-based algorithms. Main optimizations of such algorithms exploit high threshold values and the underlying data characteristics to derive efficient filters. In this article, we investigate strategies to accelerate set similarity join by exploiting massive parallelism available in modern Graphics Processing Units (GPUs). We develop two new parallel set similarity join algorithms, which implement an inverted index on the GPU to quickly identify similar sets. The first algorithm, called gssjoin, does not rely on any filtering scheme and, thus, exhibits much better robustness to variations of threshold values and data distributions. Moreover, gssjoin adopts a load balancing strategy to evenly distribute the similarity calculations among the GPU s threads. The second algorithm, called sf-gssjoin, applies a block division scheme for dealing with large datasets that do not fit in GPU memory. This scheme also enables a substantial reduction of the comparison space by discarding entire blocks based on size limits. We present variants of both algorithms for a multi-gpu platform to further exploit task parallelism in addition to data parallelism. Experimental evaluation on real-world datasets shows that we obtain up to 109x and 19.5x speedups over state-of-the-art algorithms for CPU and GPU, respectively. Categories and Subject Descriptors: H.2 [Database Management]: Systems; C.1 [Processor Architectures]: Multiple Data Stream Architectures (Multiprocessors) Keywords: Advanced Query Processing, GPU, High Performance Computing, Parallel Set Similarity Join 1. INTRODUCTION The problem of efficiently answering set similarity join queries has attracted growing attention over the years [Sarawagi and Kirpal 2004; Chaudhuri et al. 2006; Arasu et al. 2006; Bayardo et al. 2007; Vernica et al. 2010; Xiao et al. 2011; Ribeiro and Härder 2011; Wang et al. 2012; Deng et al. 2015; Cruz et al. 2016; Mann et al. 2016; Rong et al. 2017]. Set similarity join returns all pairs of similar sets from a dataset two sets are considered similar if the value returned by a set similarity function for them is not less than a given threshold. This operation is of great interest and practical importance both in itself and as a basic operator for more advanced data processing tasks. Indeed, set similarity join has been used in a wide range of application domains including data integration [Doan et al. 2012], entity extraction [Agrawal et al. 2008], data cleaning [Chaudhuri et al. 2006], collaborative filtering [Spertus et al. 2005], and data mining [Leskovec et al. 2014]. Set similarity join is a popular approach to dealing with text data, whose representation is typically sparse and high-dimensional. Text data can be mapped to sets and there is a rich variety of set similarity functions to capture various notions of similarity the well-known Jaccard similarity is a prime example of such functions. Furthermore, predicates involving set similarity functions can be equivalently expressed as an overlap constraint [Chaudhuri et al. 2006]; intuitively, the overlap of similar sets should be high. As a result, set similarity join is reduced to the problem of identifying set pairs with enough overlap. This research was partially supported by CAPES, CNPq, FAPEG, and FINEP. We thank NVIDIA for equipment donations. Copyright c 2017 Permission to copy without fee all or part of the material printed in JIDM is granted provided that the copies are not made or distributed for commercial advantage, and that notice is given that copying is by permission of the Sociedade Brasileira de Computação. Journal of Information and Data Management, Vol. 8, No. 3, December 2017, Pages

2 256 Sidney Ribeiro-Junior et. al. The set overlap abstraction provides the basis for several optimizations. Prefix filtering [Sarawagi and Kirpal 2004; Chaudhuri et al. 2006] is arguably the most important of such optimizations. In fact, this technique is employed by all state-of-the-art algorithms considered in a recent experimental evaluation [Mann et al. 2016]. Prefix filtering exploits the threshold value to prune dissimilar set pairs by inspecting only a fraction of them. A filtering-verification framework is typically used in this context, where input sets are sequentially processed: in the filtering step, prefix filtering is employed with the support of an inverted index to discard set pairs that cannot meet the overlap constraint; then, in the verification step, the overlap of the surviving pairs is measured and those deemed similar are sent to the output. Unfortunately, prefix filtering is only effective at high threshold values. Its pruning power drops drastically at lower thresholds, leading to an explosion of the number of set pairs that need to be compared in the verification step. This serious drawback may render set similarity join algorithms based on prefix filtering unsuitable for applications which often require lower thresholds to produce accurate results important examples are duplicate detection and clustering [Hassanzadeh et al. 2009]. Moreover, filtering effectiveness and, in turn, algorithm performance heavily relies on characteristics of the underlying data distribution [Sidney et al. 2015]. In particular, the filtering effect is severely reduced (or even eliminated) on more uniform data distributions. To address the challenge of efficiently processing set similarity joins, we can exploit parallelism and take advantage of modern architectures with multiple cores. The multi-core trend affected not only CPUs but also GPUs (graphics processing units), which are known as many-core processors or accelerators. GPUs offer attractive performance to energy consumption and are becoming increasingly more affordable due to mass marketing for gaming. They have a large number of relatively simple, and slower, processing cores. The possibility of using GPUs to perform general purpose computations appeared a few years ago and quickly caught the attention of researchers. However, due to their significant differences when compared to CPUs, novel algorithms and implementation approaches need to be designed. In order to take full advantage of their performance, it is necessary considerable parallelism (tens of thousands of threads) and an adequate control of high latency operations, particularly involving global memory accesses which should be reduced to a minimum. In this article, we present an exact parallel algorithm and a GPU-based implementation for the set similarity join problem. Our solution takes advantage of data parallelism by processing individual sets in parallel. Each individual operation can be seen as a set similarity search since it finds all sets in a set collection that are similar to a given input set. Furthermore, elements in a set are, in turn, processed in parallel as well, thereby exploiting both inter- and intra-set parallelism. This greatly improves the similarity join processing and can be straightforwardly mapped to modern highlythreaded accelerators like many-core GPUs. The proposed solution, called gssjoin (GPU-based Set Similarity Join), efficiently implements an inverted index, by using a parallel counting operation followed by a parallel prefix-sum calculation. At search time, this inverted index is used to quickly find sets sharing tokens with the input set. We construct an index for the input set which is used for a load balancing strategy to evenly distribute the similarity calculations among the GPU s threads. Finally, set pairs whose similarity is not less than the given threshold are returned to the CPU. This article is an extended and revised version of a previous conference paper [Ribeiro-Júnior et al. 2016]. As part of the new material, we include a new, improved version of the gssjoin algorithm, called sf-gssjoin (Size-filtered gssjoin), which uses size filtering to discard dissimilar pairs before comparing the sets, block division to deal with larger datasets, and a simpler compaction technique to improve performance at high threshold values. Furthermore, we provide an extensive set of experiments in which we compare our algorithm with other parallel solutions on several datasets. Our contributions are summarized as follows. A fine-grained parallel algorithm for both data indexing and set similarity join processing.

3 Fast Parallel Set Similarity Joins on Many-core Architectures 257 A highly threaded GPU implementation that takes advantage of intensive occupation, hierarchical memory, load balancing, and coalesced memory access. A scalable multi-gpu implementation that exploits both data parallelism and task parallelism. A block division scheme for dealing with larger datasets that do not fit in the GPU memory, which also enables an aggressive reduction of the comparison space by pruning entire blocks of sets based on size limits. An extensive experimental evaluation comparing our solutions with state-of-the-art sequential as well as parallel algorithms on standard real-world textual datasets. This article is organized as follows. Section 2 covers related work. Section 3 provides background material and Section 4 presents an overview of the architecture and programming model of a GPU. The gssjoin algorithm and its improved version, sf-gssjoin, are presented in Section 5 and Section 6, respectively. Experimental results are described in Section 7 and Section 8 concludes the article. 2. RELATED WORK There is a substantial body of literature on the efficient computation of set similarity joins. Most proposals focus on the design of CPU-based, intrinsically sequential algorithms [Sarawagi and Kirpal 2004; Chaudhuri et al. 2006; Arasu et al. 2006; Bayardo et al. 2007; Xiao et al. 2011; Ribeiro and Härder 2011; Wang et al. 2012] and, therefore, cannot fully exploit modern many-core architectures. Main optimizations use the threshold and the underlying data characteristics to derive effective filters [Sidney et al. 2015]. Popular filtering schemes are derived from the set overlap constraint, such as size filtering [Sarawagi and Kirpal 2004], prefix filtering [Sarawagi and Kirpal 2004; Chaudhuri et al. 2006], and positional filtering [Xiao et al. 2011]. Other approaches partition the input dataset based on the symmetric set difference such that two sets are similar only if they share a common partition [Arasu et al. 2006; Deng et al. 2015]. Interestingly, overly complex filters can actually increase execution runtime: PPJoin+ [Xiao et al. 2011] and AdaptJoin [Wang et al. 2012], which employ sophisticated filtering strategies, have been shown to be the slowest competitors in the study presented by Mann et al. [2016]. Our proposed solutions either use no filter (i.e., gssjoin) or only size filtering to reduce the comparison space (i.e., sf-gssjoin). Another popular type of string similarity join employs constraints based on the edit distance, which is defined by the minimum number of character-editing operations (i.e., insertion, deletion, and substitution) to make two strings equal. Because one can derive (q-gram) set overlap bound to the edit distance [Gravano et al. 2001], our algorithms can be used in a pre-processing step to reduce the number of distance computations edit distance also lends itself to efficient GPU implementation [Chacón et al. 2014]. Lieberman et al. [Lieberman et al. 2008] presented a parallel similarity join algorithm for distance functions of the Minkowski family (e.g., the Euclidean distance). The algorithm first maps the smaller input dataset to a set of space-filling curves and then performs interval searches for each point in the other dataset in parallel. The overall performance of the algorithm drastically decreases as the number of dimensions increases (see Figure 5b in [Lieberman et al. 2008]) because every additional dimension requires the construction of a new space-filling curve. Thus, this approach can be prohibitively expensive on text data, whose representation typically involves several thousands of dimensions. Besides stand-alone algorithms, set similarity join can also be realized using relational database technology. Previous work proposed expressing set similarity join declaratively in SQL [Ribeiro et al. 2016] or implementing it as a physical operator within the query engine [Chaudhuri et al. 2006]. Considering the increasing adoption of hardware accelerators by relational databases, including commercial products [Meraji et al. 2016], set similarity join can be a candidate operation for off-loading to the GPU in such systems.

4 258 Sidney Ribeiro-Junior et. al. Other works focused on performing set similarity join in a distributed platform such as MapReduce [Vernica et al. 2010; Deng et al. 2014; Rong et al. 2017]. In general, these proposals apply a partition scheme to send dissimilar strings to different reducers and, thus, avoid unnecessary similarity calculations. Existing partition schemes are based on prefix filtering [Vernica et al. 2010], symmetric set difference [Deng et al. 2014], and vertical partitioning [Rong et al. 2017]. Nevertheless, intensive computations still have to be performed in the reducers. We plan to investigate the integration of our solutions into a distributed setting to accelerate such local computations in future work. Our algorithms, as well as all other algorithms discussed so far, are exact, i.e., they always produce the correct answer. Approximate set similarity joins may miss valid output set pairs to trade accuracy for query time. Locality Sensitive Hashing (LSH) is the most popular technique for approximate set similarity joins [Indyk and Motwani 1998], which is based on hashing functions that are approximately similarity-preserving. Cruz et al. [2016] propose an approximate set similarity join algorithm designed for a many-core architecture (GPU). The authors estimate the Jaccard similarity between two sets using MinHash [Broder et al. 1998], an LSH scheme for Jaccard. MinHash can be orthogonally combined with our algorithm to reduce set sizes and, thus, obtain greater scalability. Similarity search queries retrieve strings similar to a given query string. They could be treated as a special case of join queries, where one of the two input datasets contains a single string. Techniques to increase filtering effectiveness in similarity search queries include exploiting multiple set orderings [Kim and Lee 2012], extracting different signature schemes from strings in the input data and the query string [Qin et al. ], and defining a global order to import the input sets into a tree index structure [Zhang et al. 2017]. The work in [Matsumoto and Yiu 2015] proposes a top-k similarity search algorithm for data represented as points in Euclidean space top-k similarity query returns the k most similar data objects. This algorithm leverages both the GPU and CPU: expensive distance calculations are executed on the GPU, whereas sorting is executed on the host CPU. To the best of our knowledge, our gssjoin algorithm was the first GPU-based algorithm for exact set similarity join [Ribeiro-Júnior et al. 2016]. In a subsequent work, we proposed fgssjoin [Quirino et al. 2017], which employs several filters to reduce the number of candidates. Specifically, fgssjoin applies prefix filtering, size filtering, and positional filtering to prune dissimilar pairs of sets in the filtering phase. These filters are very effective on high threshold values, however, its performance declines as lower threshold values are used. Size filtering is also exploited by fgssjoin in a block division scheme similar to the one adopted by sf-gssjoin. We experimentally compare the algorithms presented in this article with fgssjoin in Section BACKGROUND In this section, we first formally define the set similarity join problem. Then, we review state-of-theart optimization techniques and describe a high-level algorithmic framework for set similarity joins. Finally, we discuss the limitations of current solutions. 3.1 Basic Concepts and Problem Definition Strings can be mapped to sets of tokens in several ways. A well-known method is based on the concept of q-grams, i.e., sub-strings of length q obtained by sliding a window over the characters of the input string. For example, the string gssjoin can be mapped to the set of 3 -grams tokens { gss, SSJ, SJo, Joi, oin }. Given two sets r and s, a set similarity function sim (r, s) returns a value in [0, 1] to represent their similarity; a greater value indicates that r and s have higher similarity. We formally define the set similarity join operation as follows.

5 Algorithm 1: General set similarity join algorithm. Input: A sorted set collection C, a threshold τ Output: A set S containing all pairs (r, s) s.t. sim (r, s) τ 1 I 1, I 2,... I U, S 2 foreach r C do 3 foreach t pref β (r) do 4 foreach s I t do 5 if not filter (r, s) 6 S S verify (r, s) 7 I t I t {r} 8 return S Fast Parallel Set Similarity Joins on Many-core Architectures 259 Definition 3.1 Set Similarity Join. Given two set collections R and S, a set similarity function sim, and a threshold τ, the set similarity join between R and S returns all scored set pairs (r, s), τ s.t. (r, s) R S and sim (r, s) = τ τ. A popular set similarity function is the well-known Jaccard similarity. Given two sets r and s, the Jaccard similarity is defined as J (r, s) = r s r s. A predicate involving the Jaccard similarity and a threshold τ can be equivalently rewritten into a set overlap constraint: J (r, s) τ r s τ 1+τ ( r + s ). In this article, we focus on the Jaccard similarity, but all concepts and techniques presented henceforth holds for other set similarity functions as well such as Dice and Cosine [Ribeiro and Härder 2011]. Intuitively, the size difference of similar sets should not be large. We can derive bounds for immediate pruning of candidate pairs whose sizes differ enough. Formally, the size bounds of r, denoted by minsize (r, τ) and maxsize (r, τ), are functions that map τ and r to a real value s.t. s, if sim (r, s) τ, then minsize (r, τ) s maxsize (r, τ) [Sarawagi and Kirpal 2004]. Thus, given a set r, we can define a size-based filter to safely ignore all other sets whose sizes do not fall within the interval [minsize [ (r,] τ), maxsize (r, τ)]. For the Jaccard similarity, we have [minsize (r, τ), maxsize (r, τ)] = τ r, r τ. Further, we can use the prefix filtering principle [Sarawagi and Kirpal 2004; Chaudhuri et al. 2006] to prune dissimilar sets by examining only a subset of them. We first assume that the tokens of all sets are sorted according to a total order. A prefix r p r is the subset of r containing its first p tokens. Given two sets r and s, if r s α, then r ( r α+1) s ( s α+1). For a threshold τ, we can identify all candidate matches of a given set r using a prefix of length r r τ + 1; we denote such prefix by pref (r). The original overlap constraint only needs to be verified on set pairs sharing a prefix token. Finally, we pick the token frequency ordering as total order, thereby sorting the sets by increasing token frequencies in the set collection. Thus, we move lower frequency tokens to prefix positions to minimize the number of verifications. Example 3.1. Consider the sets r = {A, B, C, D, E} and s = {A, B, D, E, F }. Thus, we have r = s = 5 and r s = 4; thus sim (r, s) = For a threshold τ = 0.6, the set overlap constraint is r s (5 + 5) = 3.75, [minsize (r, τ), maxsize (r, τ)] = [3, 8.3] (values for s are the same), pref (r) = {A, B, C}, and pref (s)={a, B, D}. 3.2 General Algorithm Most current set similarity join algorithms for main memory employ an inverted index and follow a filtering-verification framework [Mann et al. 2016]. A high-level description of this framework is presented by Algorithm 1. An inverted list I t stores all sets containing a token t in their prefix. The input collection C is scanned and, for each set r, its prefix tokens are used to find candidate sets in the

6 260 Sidney Ribeiro-Junior et. al. (a) Prefix size varying threshold values. (b) Number of candidates generated with varying threshold values. (c) Inverted list length with different token frequency distributions. (d) Number of candidates generated with varying q values. Fig. 1. Limitations of algorithms based on prefix filtering. corresponding inverted lists (lines 2 4). Each candidate s is checked using filters, such as positional [Xiao et al. 2011] and length-based filtering [Sarawagi and Kirpal 2004] (line 5); if the candidate passes through, the actual similarity computation is performed in the verification step and r and s are added to the result if they are similar (line 6). Finally, r is appended to the inverted lists of its prefix tokens (line 7). An important observation is that Algorithm 1 is intrinsically sequential: sets, prefix tokens, and candidate sets are processed sequentially, while the inverted index is dynamically created. Algorithm 1 is actually a self-join. It can be extended to a binary join by first indexing the smaller collection and then going through the larger collection to identify matching pairs. 3.3 Prefix Filtering Limitations Currently, prefix filtering is the prevalent optimization technique for CPU-based, set similarity join algorithms. As such, all these algorithms benefit from its pruning power, but also suffer from its limitations. In particular, prefix filtering effectiveness is heavily dictated by two factors: threshold value and token frequency distribution [Sidney et al. 2015]. There is a clear correlation between threshold values and similarity join performance. Invariably, execution time increases with lower threshold values. The explanation is that lower threshold values imply larger prefixes as illustrated in Figure 1(a). As a result, more inverted lists have to be scanned (Algorithm 1, lines 3 4) and a larger number of candidate pairs have to be verified. Figure 1(b) shows the number of candidates (in log scale) for decreasing Jaccard thresholds on a 100K sample taken from the DBLP dataset (details about the datasets are given in Section 7). As we decrease the threshold from 0.9 to 0.3, there is an increase of three orders of magnitude in the number of candidates. Disregarding pruning due to other filters, the frequency of tokens in the prefix determines the number of candidates for a given set. The total token order based on frequency places rare tokens in the prefixes. As a result, inverted lists are shorter and there is much less prefix overlap between dissimilar sets. Of course, the effectiveness of this strategy depends on the underlying token frequency distribution. For a uniform distribution, it behaves not better than an ordinary lexicographical order. Figure 1(c) depicts the impact of the token frequency skew on size of the lists associated with prefix tokens and 1(d) shows the number of candidates (in log scale) for increasing values q-gram size (larger values of q results in more skewed token distribution). We can observe an increase of one order of magnitude in the number of candidates. The above observations indicate intrinsic limitations of prefix filtering, and, in turn, current set similarity join algorithms. Next, we provide a general overview of the underlying many-core architecture before presenting our massively parallel approach, which avoids such drawbacks.

7 Fast Parallel Set Similarity Joins on Many-core Architectures 261 Fig. 2. Overview of a GPU architecture 4. GPU ARCHITECTURE AND PROGRAMMING MODEL This section provides a brief description of both the GPU architecture and its programming model (we refer to [Kirk and Wen-mei 2012] for details). GPUs were originally designed as special-purpose coprocessors for dedicated graphics rendering. However, in the last years, it has become a powerful accelerator for general purpose computing. Nowadays GPUs are regarded as massively parallel processors with approximately ten times the computation power and memory bandwidth of CPUs. They can perform a large number of independent parallel computations and this performance is improving at a rate higher than that of CPUs and at an exceptionally high performance-to-cost ratio. Figure 2 shows a simplified diagram of a GPU that is common to most GPUs currently available. A GPU can be regarded as a Multiple SIMD (Single Instruction Multiple Data) processor with each SIMD unit corresponding to a streaming multiprocessor (SM). Inside the SM there are several streaming processor (SP) cores that operate on different data but in a synchronous way. SPs belonging to the same SM share, among others, instruction fetch and decoding, load and store units, the register file and local (shared) memory. A group of threads in an SM is divided into multiple schedule units, called warps, that are dynamically scheduled on the SM. The GPU supports a large number of light-weight threads, e.g. often tens or hundreds of thousands, and, unlike CPU threads, the overhead of creation and switching is negligible. Threads for which data have already been transferred from the global memory can perform calculations while threads waiting for the completion of data transfer have to wait. Thus, in order to hide the global memory s high latency, it is important to have more threads than the number of SPs and to have threads accessing consecutive memory addresses that can be coalesced. In most platforms, the communication between CPU and GPU takes place through a PCI Express connection. This channel permits the exchange of data between each one s address space but at a much slower speed. Thus, the GPU programming model requires that part of the program runs on the CPU while the computationally-intensive part is processed by the GPU. A GPU program exposes parallelism through data-parallel functions, called kernels, that are offloaded to the GPU. The programmer needs to configure the number of threads to be used. These threads are organized in groups (thread blocks) that are further assembled into a grid structure. When a kernel is launched, the blocks within a grid are distributed on idle SMs and the threads mapped to the SPs. 5. THE GSSJOIN ALGORITHM In this section, we present a detailed description of our GPU-based algorithm called gssjoin. We start by describing the inverted index creation, that is, the pre-processing step. Then, the actual set similarity join is detailed. Finally, we present the multi-gpu version of gssjoin.

8 262 Sidney Ribeiro-Junior et. al. 5.1 Pre-processing Before performing the set similarity join, we map input strings to sets of tokens using q-grams and create an inverted index in the GPU memory, assuming the input collection fits in memory and is static. Let S be the set collection ordered by set size and V be the vocabulary of the collection, that is, the set of distinct tokens of the input collection. The input collection is the set E of distinct token-sets (t, s) pairs occurring in the original collection, with t V and s S. An array of size E is used to store the inverted index. Once the set E has been moved to the GPU memory, each pair in it is examined in parallel, so that each time a token is visited the number of sets where it appears is incremented and stored in the array count of size V. A parallel prefix-sum is executed on the count array by mapping each element to the sum of all tokens before it and storing the results in the index array. Thus, each element of the index array points to the position of the corresponding first element in the invertedindex, where all (t, s) pairs will be stored sorted by the token. The cardinality of all sets is computed in parallel with each processor contributing partially (incrementing) in the respective position of the array cardinality of size S. Algorithm 2 depicts the data indexing process. 5.2 Set Similarity Search After the inverted index construction, gssjoin can be viewed as a batch of set similarity search operations. Given an input set, the set similarity search consists of two steps. First, the Jaccard similarity of the input set s to all sets have to be computed. Then, the sets similar to the input set are selected. The Jaccard similarity computation takes advantage of the inverted index model, because only the similarities between the input set s and those sets in S that have tokens in common with s have to be computed. These sets are the elements of the invertedindex pointed to by the entries of the index array associated with the tokens occurring in the input set s. A straightforward solution to compute the Jaccard similarity is to distribute the tokens of the input set s evenly among the processors and let each processor p access the inverted lists corresponding to tokens allocated to it. However, the distribution of tokens in sets of the collections follows approximately the Zipf s Law [Zipf and Wilson 1949]. This means that few tokens occur in a large number of sets and most tokens occur in only a few sets. Consequently, the size of the inverted lists also varies according to the Zipf s Law, thus distributing the workload according to the tokens of s would cause a great imbalance of the work among the processors. Thus, we propose a load balance method to distribute the sets evenly among the processors so that each processor computes approximately the same number of Jaccard similarities. In order to facilitate the explanation of this method, suppose that we concatenate all the inverted lists corresponding to tokens in s in a logical vector E s = [ 0.. E s 1 ], where E s is the sum of the sizes of all inverted lists Algorithm 2: Indexing input : The entries array E where each element is a pair (token, setid) output: The inverted index I 1 array of integers count[0... V 1] 2 array of integers index[0... V 1] 3 inverted index I[0... E 1] 4 for each entry e in E, in parallel do 5 atomicadd(count[e.token], 1) 6 end 7 Perform an exclusive sum on count in parallel and store the result in index 8 for each entry e in E, in parallel do 9 position atomicadd(index[e.token], 1) 10 I[position] e.setid 11 end 12 return I, count and index

9 Fast Parallel Set Similarity Joins on Many-core Architectures 263 of tokens in s. Given a set of processors P = {p 0, p P 1 }, the load balance method should allocate elements of E s in intervals of approximately the same size, that is, each processor p i P should process elements of E s in the interval [i Es Es P, min((i + 1) P 1, E s 1)]. Since each processor knows the interval of the indexes of the logical vector E s it has to process, all that is necessary to execute the load balancing is a mapping of the logical indexes of E s to the appropriate indexes in the inverted index (array invertedindex). Each processor executes the mapping for the indexes in the interval corresponding to it and finds the corresponding elements in the invertedindex array for which it has to compute the Jaccard similarity to the input set. Let V s V be the vocabulary of the input set s. The mapping uses three auxiliary arrays: count s [ 0.. V s 1 ], start s [ 0.. V s 1 ]] and index s [ 0.. V s 1 ]. The arrays count s and start s are obtained together by copying in parallel count[t i ] to count s [t i ] and index[t i ] to start s [t i ], respectively, for each token t i in the input set s. Once the count s is obtained, an inclusive parallel prefix sum on count s is performed and the results are stored in index s. Algorithm 3 shows the pseudo-code for the complete similarity search. In lines 3 6, the arrays count s and start s are obtained. In line 8, the array index s is obtained by applying a parallel prefix sum on array count s. Next, each processor executes a mapping of each position x in the interval of indexes of E s associated to it to the appropriate position of the invertedindex. This mapping is described in lines of the algorithm. Then, the mapped entries of the inverted index are used to compute the intersection between each set associated with these entries and the input set. The intersections are computed partially by each processor, but the complete intersections are available when all processors have finished this phase. Lines show how the Jaccard similarities are computed and compacted. Each processor is responsible for S P similarities. Each Jaccard similarity calculation uses the intersection (calculated in the previous phase) and the cardinality of the sets. Similarities above the given threshold are flagged and pruned by an exclusive parallel prefix sum. Thus only those similarities are returned to the CPU (line 22). Algorithm 3: SimilaritySearch(invertedIndex, s) input : invertedindex, count, index, cardinality, threshold, inputset s[ 0.. V s 1 ]. output: Jaccard similarity array jac_sim[ 0.. S 1 ] initialized with zeros. 1 array of integers count s[ 0.. V s 1 ] initialized with zeros 2 array of integers index s[ 0.. V s 1 ] 3 for each token t i s, in parallel do 4 count s[i] = count[t i]; 5 index s[i] = index[t i]; 6 end 7 Perform an inclusive parallel prefix sum on count s and stores the results in index s 8 foreach processor p i P do 9 for x [i Es P, min((i + 1) Es P 1, E S 1)] do // Map x to the correct position indinvp os of the invertedindex 10 pos = min(i : index s[i] > x); 11 if pos = 0 then 12 p = 0; offset = x; 13 else 14 p = index s[pos 1]; offset = x p; 15 end 16 indinvp os = start s[pos] + offset 17 uses s[pos] and invertedindex[indinvp os] in the partial computation of the intersection between s and the set associated to invertedindex[indinvp os] 18 end 19 for x [i S P S, min((i + 1) 1, S 1)] do P // Jaccard similarity calculation for S sets using P processors 20 uses intersection and union (through cardinality) to compute the Jaccard similarity 21 flag sets with Jaccard similarity above the threshold 22 end 23 end 24 Perform an exclusive parallel prefix sum on the flagged sets to compact the selected sets 25 Return the array: jac_sim with the selected jaccard similarities.

10 264 Sidney Ribeiro-Junior et. al. 5.3 Multi-GPU Similarity Join The gssjoin algorithm was designed having in mind a many-core architecture (accelerator) with (global) shared memory. It exploits data parallelism when processing a single input set and uses a single accelerator. It does that by making use of thousands of threads to index the dataset and to find the most similar sets to a given input set. However, when dealing with self-joins, the gssjoin search has to be invoked repeatedly, to deal with the processing of many input sets. The gssjoin algorithm can handle that by processing the queries one after another, once the input data has been indexed. This streaming operation requires that the most similar sets are returned before another input set can be processed. In addition, a set-specific memory allocation is needed for every set. Moreover, machines with more than one accelerator (GPU) can not take advantage of the extra computing power for the gssjoin search. These observations have motivated us to extend the gssjoin to deal with multiple sets in a multi-gpu platform. In the multi-gpu version, called mgssjoin, task parallelism is exploited in addition to data parallelism. The data indexing step is performed by replicating the input data in each of the g available GPUs and then, in parallel, creating g copies of the inverted index. Thus, each GPU receives the same task and they all produce the same inverted index in their memory. Next, the gssjoin search proceeds by partitioning the m sets (tasks) into the g GPUs. Each GPU receives m/g tasks. This is possible since the tasks (sets) are completely independent of each other. Since the sets are of different size, we preallocate memory based on the biggest set, i.e., the set with the largest number of tokens). This saves us a lot of time since GPU memory allocation can be very costly. Algorithm 4 shows the pseudo-code for the mgssjoin. Note that this algorithm, differently from the previous ones, exploits task parallelism and runs on the CPU. The GPU function (kernel) calls (invocations), for data indexing and gssjoin search, take place in lines 3 and 9 respectively. 6. THE SF-GSSJOIN ALGORITHM In this section, we present sf-gssjoin, an improved version of our gssjoin, which includes a block division scheme for dealing with larger datasets that do not fit in the GPU memory. This division strategy also allows us to prune entire blocks based on size filtering. As the previous algorithm, sf-gssjoin builds an inverted index to quickly identify sets containing tokens in common. The sf-gssjoin algorithm proceeds in two phases: indexing and probing. In the indexing phase, the sets are indexed in an inverted index. The similarity join happens in the probing phase when probe sets are compared with indexed sets and the similar pairs are returned. When dealing with datasets with results that do not fit in GPU s memory, sf-gssjoin divides the data into blocks of sets and works with partial inverted indexes. The blocks are indexed successively and probed with all previous blocks. The blocks that were not indexed yet are not probed because it would be redundant Algorithm 4: M ultigp U Search(E) input : token-set pairs in E[ 0.. E 1 ]. output: A list of the most similar, one for each set. 1 for each i g, in parallel do 2 set.gpu.device(i); 3 DataIndexing(E); 4 end 5 Allocate memory for the biggest set; 6 for each j g, in parallel do 7 set.gpu.device(j); 8 for each set s (m/g) in parallel do 9 SimilaritySearch(invertedIndex, s); 10 end 11 end 12 Return A list of the most similar sets, one for each set.

11 Fast Parallel Set Similarity Joins on Many-core Architectures 265 Fig. 3. The block division strategy Algorithm 5: sf-gssjoin input : A collection of sets C, the size of blocks n and a threshold γ output: A set S with all similar pairs (x, y) 1 Divide C in blocks of size n 2 for each set block i in C do 3 I indexing(c[i n...min(i n + n 1, C 1)]) 4 for each set block j in C do 5 if j i and C[min(j n + n 1, C 1)] minsize(c[i n], γ) then 6 S S probing(c[j n...min(j n + n 1, C 1)], I, γ) 7 end 8 end 9 end 10 return S work: due to the commutative property of Jaccard, if the pair (x, y) meets the threshold, so does (y, x). The block division strategy is shown in Figure 3 with a dataset divided into 4 blocks. The first block is probed only against itself and the last is probed against all blocks. The size filtering technique is applied between blocks before the probing phase as well as between sets on the probing phase. When applied between blocks, sf-gssjoin verifies if the last probe set is larger or equal to the minsize value of the first indexed set. If it is not, all the previous probe sets cannot be similar too because they are smaller than the last set and all the following indexed sets are also too large to be similar to any probing set. Thus, it is not necessary to probe the block. The sf-gssjoin algorithm is shown in Algorithm 5. The sets are divided into blocks of size n in line 1. Each block is indexed in line 3 and probed by the previous blocks and by itself in line 6. Then the partial result is included in the final result. The size pruning between blocks is done in line 4. The indexing phase is the same of gssjoin and is described by Algorithm Probing The probing phase achieves massive multi-level parallelism in the GPU by dividing the probe sets between the streaming multi-processors (SMs) and, inside each SM, dividing the inverted lists between the streaming processors (SPs). As can be seen in Algorithm 6, the sets are examined in parallel and the inverted lists related to their tokens are also examined in parallel (lines 2 and 4). Before incrementing the intersection value between the probe set and the sets in the inverted lists, the set id is verified to avoid unnecessary computations and the indexed set size is verified with the probe set maxsize (line 5). If the indexed set is smaller than the probe maximum size, they cannot be similar and the intersection increment is not done. This is the size filtering technique applied between sets. It is important to notice that sets that cannot be similar because of their sizes will keep the intersection value equal to 0. The intersection increment is done with an atomic function in line 6 because many threads can be accessing the same sets at the same time.

12 266 Sidney Ribeiro-Junior et. al. Algorithm 6: Probing input : The block of probe sets P, the partial inverted index I and a threshold τ output: The set of similar pairs S 1 matrix of integers M P P 2 for each set x in P, in parallel do 3 for each feature f in x do 4 for each set y in f s inverted list I[f], in parallel do 5 if x.id < y.id and maxsize(x) > y then 6 atomicadd(m[x.id][y.id], 1) 7 end 8 end 9 end 10 end 11 totalsimilars 0 12 for each pair (x, y) in M, in parallel do 13 if M[x.id][y.id] > 0 then 14 similarity calculatesimilarity( x, y, M[x.id][y.id]) 15 if similarity τ then 16 position atomicadd(totalsimilars, 1) 17 S[position] (x.id, y.id) 18 end 19 end 20 end 21 return S The second part of the probing phase verifies in parallel the pairs with non zero intersection, in lines 12 and 13, calculates the similarity between them in line 14 and stores the pairs with similarity equal or greater than the chosen threshold in the result array in line 17. An atomic add function is used in line 16 to determine the position where the pair should be stored. Thus only similar pairs will be returned, reducing the amount of data to be copied back to the CPU memory. This data compaction, consequently, reduces the cost of memory copy. 6.2 Multi-GPU Version The multi-gpu version of sf-gssjoin divides the indexed blocks among the GPUs and sends the probing sets to the respective GPUs. The algorithm is the same as described in Section 5, except that indexed blocks and their probing blocks are divided among GPUs in line 2. This may cause a load imbalance because each indexed block has a different number of probing blocks. For example, if we have two GPUs and two blocks, one GPU will index the first block and probe the same block. The other GPU will index the second block and probe the first and second blocks. But if the number of blocks is big and the set sizes variable, this imbalance is generally negligible since some blocks will not be probed as a result of size pruning. 7. EXPERIMENTS We now experimentally evaluate our algorithms. The goals of our study are: 1) to evaluate the performance of gssjoin and sf-gssjoin with varying thresholds on different datasets; 2) to compare our algorithms against state-of-the-art CPU-based and GPU-based similarity join algorithms; 3) to compare the performance gain of sf-gssjoin against gssjoin; 4) to measure the speedup gains of our multi-gpu implementation.

13 Fast Parallel Set Similarity Joins on Many-core Architectures 267 Table I. Dataset statistics. Database Number of sets Max length Mean length Standard deviation DBLP 100k Spotify 500k IMDb 300k Experimental Setup We used three publicly available, real-world datasets: DBLP 1, a database of Computer Science publications, Spotify 2, a database of songs names and artists, and IMDb 3, a database of movies, series and documentaries titles. For DBLP, we randomly selected 20k publications and extracted their title and authors; then, we generated 4 additional dirty copies of each string, i.e., duplicates to which we injected textual modifications consisting of 1 5 character-level modifications (insertions, deletions, and substitutions). For all datasets, we converted all strings to lower-case and eliminated repeated white spaces. Statistics about the datasets are shown in Table I. Finally, we tokenized all strings into sets of 2-grams and 3-grams and sorted the sets as described in Section 3. We compared sf-gssjoin against ppjoin, one of the best-performing CPU-based algorithms according to the evaluation in [Mann et al. 2016], fgssjoin, a GPU-based algorithm which showed significant speedups against ppjoin [Quirino et al. 2017],and gssjoin, our earlier algorithm [Ribeiro- Júnior et al. 2016]. We used the binary provided by the authors of ppjoin 4. All three parallel algorithms were implemented using the CUDA Toolkit version 7.5. The experiments were conducted on a machine running CentOs bits, with 2 Intel Xeon E5-2620, 16GB of ECC RAM, and four GeForce Zotac Nvidia GTX Titan Black, with 6GB of RAM and 2,880 CUDA cores each. 7.2 Performance Results We now analyze and compare the efficiency of sf-gssjoin against ppjoin, fgssjoin, and gssjoin. To this end, we measured the runtime performance with varying threshold parameter and for q-grams of size 2 and 3. Figures 4 and 5 show the results. As a general trend, sf-gssjoin outperforms ppjoin in almost all scenarios. They exhibit similar performance at high threshold values ( ). However, as the threshold decreases, ppjoin performance drops dramatically. As discussed in Section 3.3, ppjoin, as most set similarity join algorithms, heavily depends on prefix filtering effectiveness to obtain performance. At lower thresholds, prefix filtering becomes ineffective and the verification workload skyrockets. In contrast, sf-gssjoin runtimes increase smoothly, thereby achieving increasing speedups over ppjoin. Furthermore, the performance advantage of sf-gssjoin over ppjoin increases with the dataset size. The peak speedup gain for sf-gssjoin is obtained on the DBLP dataset with 2-grams and threshold value of 0.3: in this setting, sf-gssjoin is 109x faster than ppjoin. Unfortunately, we could not obtain the ppjoin runtimes for all databases at low thresholds due to memory limitations. However, the performance gains clearly increase when threshold values decrease. As for ppjoin, the fgssjoin algorithm exhibits increasing runtimes on lower thresholds similar. It outperforms sf-gssjoin at high thresholds (usually above 0,6), but its performance deteriorates as threshold values decrease. This behavior is due to the use of the same filtering techniques of ppjoin. As in ppjoin, the filtering phase cost increases and more candidates need to be evaluated in the verification phase. The highest speedup gain of sf-gssjoin against fgssjoin was 19,5x on the Spotify database using 2-grams with threshold value equal to Information courtesy of IMDb ( Used with permission. 4 weiw/project/simjoin.html

14 268 Sidney Ribeiro-Junior et. al. (a) DBLP dataset, 3-grams. (b) IMDB dataset, 3-grams. (c) Spotify dataset, 3-grams. (d) DBLP dataset, 2-grams. (e) IMDB dataset, 2-grams. (f) Spotify dataset, 2-grams. Fig. 4. ppjoin, fgssjoin and sf-gssjoin runtimes. When compared with gssjoin, sf-gssjoin shows a better performance at higher threshold values, when size pruning can discard more blocks of sets. At low thresholds ( ) minimum size limits are too low to be effective and they exhibit similar runtimes, but generally sf-gssjoin outperforms our former algorithm. As Figure 5(b) shows, the highest speedup obtained is on the IMDb database with 2-grams and 0.9 as the threshold value. In this scenario sf-gssjoin is 18x faster than gssjoin. The blocking division scheme of sf-gssjoin plays an important role in effectively pruning the comparison space based on size filtering. Moreover, gssjoin indexes all the sets and reads all the sets on the inverted lists when probing a set and verifies if the sets were compared earlier to avoid returning the same pair twice. This verification is needed only when the indexed block and the probe block are the same. Therefore, a large number of dissimilar set pairs are never compared. We observe that sf-gssjoin and gssjoin suffer a drop in performance for threshold value at 0.2 because of the huge number of set pairs in the result that have to be copied to the CPU memory. On the DBLP database with 2-grams and 0.2 threshold value, sf-gssjoin shows an inferior performance than gssjoin as result of its compaction technique. Using a simple atomic operation to compact the result can be inefficient when a large number of sets are similar. When there is a large number of similar sets, many atomic sums are performed by sf-gssjoin to determine the positions of set pairs on the result array. These atomic operations have a high cost and may lead to an almost sequential execution. The gssjoin compaction technique divides the result array into blocks to avoid the atomic sums overhead. Table II shows the execution time using multiple GPUs on all databases with 0.5 as the threshold value. The data is divided between the GPUs, making our solution scalable. The use of multiple GPUs showed an almost linear speedup increase for larger databases, as can be seen in Figure 5(c). But for small databases or at high threshold values ( ) the cost of memory copy between the CPU memory and the GPUs memories is high. And, as mentioned before, if the set sizes distribution is more homogeneous, the irregular number of blocks to be compared can cause a workload imbalance between GPUs and speedups will not be linear.

15 Fast Parallel Set Similarity Joins on Many-core Architectures 269 (a) sf-gssjoin speedups when compared to gssjoin using 3-grams. (b) sf-gssjoin speedups when compared to gssjoin using 2-grams. (c) Multi-GPU Speedup. Fig. 5. Speedups. Table II. Execution Time (secs) Using Multiple GPUs, Threshold = 0.5 Number of GPUs: DBLP 2gram DBLP 3gram IMDb 2gram IMDb 3gram Spotify 2gram Spotify 3gram CONCLUSIONS In this article, we proposed GPU-based solutions to the set similarity join problem. We present two new parallel algorithms based on an inverted index implemented on the GPU. We exploit a number of strategies to optimize our algorithms including intense GPU occupancy, hierarchical memory, coalesced memory access, load balancing, and a block division scheme. As a result, we achieve significant speedups over state-of-the-art CPU-based and GPU-based algorithms in most settings. We also propose multi-gpu variants to further exploit task parallelism in addition to data parallelism, which exhibit almost linear speedup with the number of GPUs. In future work, we plan to investigate the integration of our GPU-based algorithms into a distributed framework. REFERENCES Agrawal, S., Chakrabarti, K., Chaudhuri, S., and Ganti, V. Scalable ad-hoc entity extraction from text collections. Proceedings of the VLDB Endowment 1 (1): , Arasu, A., Ganti, V., and Kaushik, R. Efficient Exact Set-Similarity Joins. In Proceedings of the International Conference on Very Large Data Bases. Seoul, Korea, pp , Bayardo, R. J., Ma, Y., and Srikant, R. Scaling up All Pairs Similarity Search. In Proceedings of the International World Wide Web Conferences. Banff, Canada, pp , Broder, A. Z., Charikar, M., Frieze, A. M., and Mitzenmacher, M. (Extended Abstract). In STOC. pp , Chacón, A., Marco-Sola, S., Espinosa, A., Ribeca, P., and Moure, J. C. Computation of Levenshtein Distance on GPU. In ICS. pp , Min-Wise Independent Permutations Thread-cooperative, Bit-parallel Chaudhuri, S., Ganti, V., and Kaushik, R. A Primitive Operator for Similarity Joins in Data Cleaning. In Proceedings of the IEEE International Conference on Data Engineering. Atlanta, USA, pp. 5, Cruz, M. S. H., Kozawa, Y., Amagasa, T., and Kitagawa, H. Accelerating set similarity joins using gpus. LNCS Transactions on Large-Scale Data- and Knowledge-Centered Systems vol. 28, pp. 1 22, Deng, D., Li, G., Hao, S., Wang, J., and Feng, J. MassJoin: A Mapreduce-based Method for Scalable String Similarity Joins. In Proceedings of the IEEE International Conference on Data Engineering. Chigago, USA, pp , 2014.

Parallel Similarity Join with Data Partitioning for Prefix Filtering

Parallel Similarity Join with Data Partitioning for Prefix Filtering 22 ECTI TRANSACTIONS ON COMPUTER AND INFORMATION TECHNOLOGY VOL.9, NO.1 May 2015 Parallel Similarity Join with Data Partitioning for Prefix Filtering Jaruloj Chongstitvatana 1 and Methus Bhirakit 2, Non-members

More information

Leveraging Set Relations in Exact Set Similarity Join

Leveraging Set Relations in Exact Set Similarity Join Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,

More information

Parallelizing String Similarity Join Algorithms

Parallelizing String Similarity Join Algorithms Parallelizing String Similarity Join Algorithms Ling-Chih Yao and Lipyeow Lim University of Hawai i at Mānoa, Honolulu, HI 96822, USA {lingchih,lipyeow}@hawaii.edu Abstract. A key operation in data cleaning

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

Similarity Joins in MapReduce

Similarity Joins in MapReduce Similarity Joins in MapReduce Benjamin Coors, Kristian Hunt, and Alain Kaeslin KTH Royal Institute of Technology {coors,khunt,kaeslin}@kth.se Abstract. This paper studies how similarity joins can be implemented

More information

Performance impact of dynamic parallelism on different clustering algorithms

Performance impact of dynamic parallelism on different clustering algorithms Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware E-mail: jdimarco@udel.edu, taufer@udel.edu

More information

GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC

GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC MIKE GOWANLOCK NORTHERN ARIZONA UNIVERSITY SCHOOL OF INFORMATICS, COMPUTING & CYBER SYSTEMS BEN KARSIN UNIVERSITY OF HAWAII AT MANOA DEPARTMENT

More information

Supporting Fuzzy Keyword Search in Databases

Supporting Fuzzy Keyword Search in Databases I J C T A, 9(24), 2016, pp. 385-391 International Science Press Supporting Fuzzy Keyword Search in Databases Jayavarthini C.* and Priya S. ABSTRACT An efficient keyword search system computes answers as

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

arxiv: v1 [cs.db] 20 Nov 2017

arxiv: v1 [cs.db] 20 Nov 2017 Bitmap Filter: Speeding up Exact Set Similarity Joins with Bitwise Operations Edans F. O. Sandes University of Brasilia edans@unb.br George L. M. Teodoro University of Brasilia teodoro@unb.br Alba C. M.

More information

Survey of String Similarity Join Algorithms on Large Scale Data

Survey of String Similarity Join Algorithms on Large Scale Data Survey of String Similarity Join Algorithms on Large Scale Data P.Selvaramalakshmi Research Scholar Dept. of Computer Science Bishop Heber College (Autonomous) Tiruchirappalli, Tamilnadu, India. Dr. S.

More information

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS Ferdinando Alessi Annalisa Massini Roberto Basili INGV Introduction The simulation of wave propagation

More information

Efficient Lists Intersection by CPU- GPU Cooperative Computing

Efficient Lists Intersection by CPU- GPU Cooperative Computing Efficient Lists Intersection by CPU- GPU Cooperative Computing Di Wu, Fan Zhang, Naiyong Ao, Gang Wang, Xiaoguang Liu, Jing Liu Nankai-Baidu Joint Lab, Nankai University Outline Introduction Cooperative

More information

Efficient Entity Matching over Multiple Data Sources with MapReduce

Efficient Entity Matching over Multiple Data Sources with MapReduce Efficient Entity Matching over Multiple Data Sources with MapReduce Demetrio Gomes Mestre, Carlos Eduardo Pires Universidade Federal de Campina Grande, Brazil demetriogm@gmail.com, cesp@dsc.ufcg.edu.br

More information

Mitigating Data Skew Using Map Reduce Application

Mitigating Data Skew Using Map Reduce Application Ms. Archana P.M Mitigating Data Skew Using Map Reduce Application Mr. Malathesh S.H 4 th sem, M.Tech (C.S.E) Associate Professor C.S.E Dept. M.S.E.C, V.T.U Bangalore, India archanaanil062@gmail.com M.S.E.C,

More information

Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink

Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink Rajesh Bordawekar IBM T. J. Watson Research Center bordaw@us.ibm.com Pidad D Souza IBM Systems pidsouza@in.ibm.com 1 Outline

More information

Scalable GPU Graph Traversal!

Scalable GPU Graph Traversal! Scalable GPU Graph Traversal Duane Merrill, Michael Garland, and Andrew Grimshaw PPoPP '12 Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming Benwen Zhang

More information

TRIE BASED METHODS FOR STRING SIMILARTIY JOINS

TRIE BASED METHODS FOR STRING SIMILARTIY JOINS TRIE BASED METHODS FOR STRING SIMILARTIY JOINS Venkat Charan Varma Buddharaju #10498995 Department of Computer and Information Science University of MIssissippi ENGR-654 INFORMATION SYSTEM PRINCIPLES RESEARCH

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to

More information

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS Computer Science 14 (4) 2013 http://dx.doi.org/10.7494/csci.2013.14.4.679 Dominik Żurek Marcin Pietroń Maciej Wielgosz Kazimierz Wiatr THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT

More information

Speeding-up the Verification Phase of Set Similarity Joins in the GPGPU paradigm

Speeding-up the Verification Phase of Set Similarity Joins in the GPGPU paradigm 1 Speeding-up the Verification Phase of Set Similarity Joins in the GPGPU paradigm Christos Bellas, Anastasios Gounaris Senior Member, IEEE arxiv:1812.09141v1 [cs.db] 21 Dec 2018 Abstract We investigate

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS Chapter 6 Indexing Results 6. INTRODUCTION The generation of inverted indexes for text databases is a computationally intensive process that requires the exclusive use of processing resources for long

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS46: Mining Massive Datasets Jure Leskovec, Stanford University http://cs46.stanford.edu /7/ Jure Leskovec, Stanford C46: Mining Massive Datasets Many real-world problems Web Search and Text Mining Billions

More information

A Real Time GIS Approximation Approach for Multiphase Spatial Query Processing Using Hierarchical-Partitioned-Indexing Technique

A Real Time GIS Approximation Approach for Multiphase Spatial Query Processing Using Hierarchical-Partitioned-Indexing Technique International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2017 IJSRCSEIT Volume 2 Issue 6 ISSN : 2456-3307 A Real Time GIS Approximation Approach for Multiphase

More information

Nowadays data-intensive applications play a

Nowadays data-intensive applications play a Journal of Advances in Computer Engineering and Technology, 3(2) 2017 Data Replication-Based Scheduling in Cloud Computing Environment Bahareh Rahmati 1, Amir Masoud Rahmani 2 Received (2016-02-02) Accepted

More information

Using GPUs to compute the multilevel summation of electrostatic forces

Using GPUs to compute the multilevel summation of electrostatic forces Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

A Batched GPU Algorithm for Set Intersection

A Batched GPU Algorithm for Set Intersection A Batched GPU Algorithm for Set Intersection Di Wu, Fan Zhang, Naiyong Ao, Fang Wang, Xiaoguang Liu, Gang Wang Nankai-Baidu Joint Lab, College of Information Technical Science, Nankai University Weijin

More information

Part 1: Indexes for Big Data

Part 1: Indexes for Big Data JethroData Making Interactive BI for Big Data a Reality Technical White Paper This white paper explains how JethroData can help you achieve a truly interactive interactive response time for BI on big data,

More information

A Fast and High Throughput SQL Query System for Big Data

A Fast and High Throughput SQL Query System for Big Data A Fast and High Throughput SQL Query System for Big Data Feng Zhu, Jie Liu, and Lijie Xu Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing, China 100190

More information

Chapter 12: Indexing and Hashing. Basic Concepts

Chapter 12: Indexing and Hashing. Basic Concepts Chapter 12: Indexing and Hashing! Basic Concepts! Ordered Indices! B+-Tree Index Files! B-Tree Index Files! Static Hashing! Dynamic Hashing! Comparison of Ordered Indexing and Hashing! Index Definition

More information

Chapter 12: Indexing and Hashing

Chapter 12: Indexing and Hashing Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing and Hashing Index Definition in SQL

More information

Maximizing Face Detection Performance

Maximizing Face Detection Performance Maximizing Face Detection Performance Paulius Micikevicius Developer Technology Engineer, NVIDIA GTC 2015 1 Outline Very brief review of cascaded-classifiers Parallelization choices Reducing the amount

More information

A Comparative Study on Exact Triangle Counting Algorithms on the GPU

A Comparative Study on Exact Triangle Counting Algorithms on the GPU A Comparative Study on Exact Triangle Counting Algorithms on the GPU Leyuan Wang, Yangzihao Wang, Carl Yang, John D. Owens University of California, Davis, CA, USA 31 st May 2016 L. Wang, Y. Wang, C. Yang,

More information

B. Tech. Project Second Stage Report on

B. Tech. Project Second Stage Report on B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic

More information

Generalizing Prefix Filtering to Improve Set Similarity Joins

Generalizing Prefix Filtering to Improve Set Similarity Joins Generalizing Prefix Filtering to Improve Set Similarity Joins Leonardo Andrade Ribeiro a,1,, Theo Härder a a AG DBIS, Department of Computer Science, University of Kaiserslautern, Germany Abstract Identification

More information

A Parallel Access Method for Spatial Data Using GPU

A Parallel Access Method for Spatial Data Using GPU A Parallel Access Method for Spatial Data Using GPU Byoung-Woo Oh Department of Computer Engineering Kumoh National Institute of Technology Gumi, Korea bwoh@kumoh.ac.kr Abstract Spatial access methods

More information

arxiv: v1 [physics.comp-ph] 4 Nov 2013

arxiv: v1 [physics.comp-ph] 4 Nov 2013 arxiv:1311.0590v1 [physics.comp-ph] 4 Nov 2013 Performance of Kepler GTX Titan GPUs and Xeon Phi System, Weonjong Lee, and Jeonghwan Pak Lattice Gauge Theory Research Center, CTP, and FPRD, Department

More information

Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs

Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs Haicheng Wu 1, Daniel Zinn 2, Molham Aref 2, Sudhakar Yalamanchili 1 1. Georgia Institute of Technology 2. LogicBlox

More information

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Accelerating MapReduce on a Coupled CPU-GPU Architecture Accelerating MapReduce on a Coupled CPU-GPU Architecture Linchuan Chen Xin Huo Gagan Agrawal Department of Computer Science and Engineering The Ohio State University Columbus, OH 43210 {chenlinc,huox,agrawal}@cse.ohio-state.edu

More information

Quadrant-Based MBR-Tree Indexing Technique for Range Query Over HBase

Quadrant-Based MBR-Tree Indexing Technique for Range Query Over HBase Quadrant-Based MBR-Tree Indexing Technique for Range Query Over HBase Bumjoon Jo and Sungwon Jung (&) Department of Computer Science and Engineering, Sogang University, 35 Baekbeom-ro, Mapo-gu, Seoul 04107,

More information

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture

More information

Parallel Approach for Implementing Data Mining Algorithms

Parallel Approach for Implementing Data Mining Algorithms TITLE OF THE THESIS Parallel Approach for Implementing Data Mining Algorithms A RESEARCH PROPOSAL SUBMITTED TO THE SHRI RAMDEOBABA COLLEGE OF ENGINEERING AND MANAGEMENT, FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

More information

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents

More information

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI CMPE 655- MULTIPLE PROCESSOR SYSTEMS OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI What is MULTI PROCESSING?? Multiprocessing is the coordinated processing

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

An Introduction to Parallel Programming

An Introduction to Parallel Programming An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe

More information

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department

More information

GPU Implementation of a Multiobjective Search Algorithm

GPU Implementation of a Multiobjective Search Algorithm Department Informatik Technical Reports / ISSN 29-58 Steffen Limmer, Dietmar Fey, Johannes Jahn GPU Implementation of a Multiobjective Search Algorithm Technical Report CS-2-3 April 2 Please cite as: Steffen

More information

GPU-accelerated Verification of the Collatz Conjecture

GPU-accelerated Verification of the Collatz Conjecture GPU-accelerated Verification of the Collatz Conjecture Takumi Honda, Yasuaki Ito, and Koji Nakano Department of Information Engineering, Hiroshima University, Kagamiyama 1-4-1, Higashi Hiroshima 739-8527,

More information

CHAPTER 5 WEIGHTED SUPPORT ASSOCIATION RULE MINING USING CLOSED ITEMSET LATTICES IN PARALLEL

CHAPTER 5 WEIGHTED SUPPORT ASSOCIATION RULE MINING USING CLOSED ITEMSET LATTICES IN PARALLEL 68 CHAPTER 5 WEIGHTED SUPPORT ASSOCIATION RULE MINING USING CLOSED ITEMSET LATTICES IN PARALLEL 5.1 INTRODUCTION During recent years, one of the vibrant research topics is Association rule discovery. This

More information

Predictive Indexing for Fast Search

Predictive Indexing for Fast Search Predictive Indexing for Fast Search Sharad Goel, John Langford and Alex Strehl Yahoo! Research, New York Modern Massive Data Sets (MMDS) June 25, 2008 Goel, Langford & Strehl (Yahoo! Research) Predictive

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION 1 CHAPTER 1 INTRODUCTION 1.1 Advance Encryption Standard (AES) Rijndael algorithm is symmetric block cipher that can process data blocks of 128 bits, using cipher keys with lengths of 128, 192, and 256

More information

Harp-DAAL for High Performance Big Data Computing

Harp-DAAL for High Performance Big Data Computing Harp-DAAL for High Performance Big Data Computing Large-scale data analytics is revolutionizing many business and scientific domains. Easy-touse scalable parallel techniques are necessary to process big

More information

high performance medical reconstruction using stream programming paradigms

high performance medical reconstruction using stream programming paradigms high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 53 INTRO. TO DATA MINING Locality Sensitive Hashing (LSH) Huan Sun, CSE@The Ohio State University Slides adapted from Prof. Jiawei Han @UIUC, Prof. Srinivasan Parthasarathy @OSU MMDS Secs. 3.-3.. Slides

More information

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

Optimizing Data Locality for Iterative Matrix Solvers on CUDA Optimizing Data Locality for Iterative Matrix Solvers on CUDA Raymond Flagg, Jason Monk, Yifeng Zhu PhD., Bruce Segee PhD. Department of Electrical and Computer Engineering, University of Maine, Orono,

More information

Advanced Databases: Parallel Databases A.Poulovassilis

Advanced Databases: Parallel Databases A.Poulovassilis 1 Advanced Databases: Parallel Databases A.Poulovassilis 1 Parallel Database Architectures Parallel database systems use parallel processing techniques to achieve faster DBMS performance and handle larger

More information

Assignment 5. Georgia Koloniari

Assignment 5. Georgia Koloniari Assignment 5 Georgia Koloniari 2. "Peer-to-Peer Computing" 1. What is the definition of a p2p system given by the authors in sec 1? Compare it with at least one of the definitions surveyed in the last

More information

Compressing and Decoding Term Statistics Time Series

Compressing and Decoding Term Statistics Time Series Compressing and Decoding Term Statistics Time Series Jinfeng Rao 1,XingNiu 1,andJimmyLin 2(B) 1 University of Maryland, College Park, USA {jinfeng,xingniu}@cs.umd.edu 2 University of Waterloo, Waterloo,

More information

Correlation based File Prefetching Approach for Hadoop

Correlation based File Prefetching Approach for Hadoop IEEE 2nd International Conference on Cloud Computing Technology and Science Correlation based File Prefetching Approach for Hadoop Bo Dong 1, Xiao Zhong 2, Qinghua Zheng 1, Lirong Jian 2, Jian Liu 1, Jie

More information

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo N-Body Simulation using CUDA CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo Project plan Develop a program to simulate gravitational

More information

Chapter 12: Query Processing. Chapter 12: Query Processing

Chapter 12: Query Processing. Chapter 12: Query Processing Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 12: Query Processing Overview Measures of Query Cost Selection Operation Sorting Join

More information

Mining Frequent Itemsets for data streams over Weighted Sliding Windows

Mining Frequent Itemsets for data streams over Weighted Sliding Windows Mining Frequent Itemsets for data streams over Weighted Sliding Windows Pauray S.M. Tsai Yao-Ming Chen Department of Computer Science and Information Engineering Minghsin University of Science and Technology

More information

A Many-Core Machine Model for Designing Algorithms with Minimum Parallelism Overheads

A Many-Core Machine Model for Designing Algorithms with Minimum Parallelism Overheads A Many-Core Machine Model for Designing Algorithms with Minimum Parallelism Overheads Sardar Anisul Haque Marc Moreno Maza Ning Xie University of Western Ontario, Canada IBM CASCON, November 4, 2014 ardar

More information

Parallelizing Inline Data Reduction Operations for Primary Storage Systems

Parallelizing Inline Data Reduction Operations for Primary Storage Systems Parallelizing Inline Data Reduction Operations for Primary Storage Systems Jeonghyeon Ma ( ) and Chanik Park Department of Computer Science and Engineering, POSTECH, Pohang, South Korea {doitnow0415,cipark}@postech.ac.kr

More information

Chapter 18: Parallel Databases

Chapter 18: Parallel Databases Chapter 18: Parallel Databases Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 18: Parallel Databases Introduction I/O Parallelism Interquery Parallelism Intraquery

More information

Chapter 18: Parallel Databases. Chapter 18: Parallel Databases. Parallelism in Databases. Introduction

Chapter 18: Parallel Databases. Chapter 18: Parallel Databases. Parallelism in Databases. Introduction Chapter 18: Parallel Databases Chapter 18: Parallel Databases Introduction I/O Parallelism Interquery Parallelism Intraquery Parallelism Intraoperation Parallelism Interoperation Parallelism Design of

More information

Semantic text features from small world graphs

Semantic text features from small world graphs Semantic text features from small world graphs Jurij Leskovec 1 and John Shawe-Taylor 2 1 Carnegie Mellon University, USA. Jozef Stefan Institute, Slovenia. jure@cs.cmu.edu 2 University of Southampton,UK

More information

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery : Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Hong Cheng Philip S. Yu Jiawei Han University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center {hcheng3, hanj}@cs.uiuc.edu,

More information

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS CIS 601 - Graduate Seminar Presentation 1 GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS PRESENTED BY HARINATH AMASA CSU ID: 2697292 What we will talk about.. Current problems GPU What are GPU Databases GPU

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for Comparison of Two Image-Space Subdivision Algorithms for Direct Volume Rendering on Distributed-Memory Multicomputers Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc Dept. of Computer Eng. and

More information

Communication and Memory Efficient Parallel Decision Tree Construction

Communication and Memory Efficient Parallel Decision Tree Construction Communication and Memory Efficient Parallel Decision Tree Construction Ruoming Jin Department of Computer and Information Sciences Ohio State University, Columbus OH 4321 jinr@cis.ohiostate.edu Gagan Agrawal

More information

When MPPDB Meets GPU:

When MPPDB Meets GPU: When MPPDB Meets GPU: An Extendible Framework for Acceleration Laura Chen, Le Cai, Yongyan Wang Background: Heterogeneous Computing Hardware Trend stops growing with Moore s Law Fast development of GPU

More information

Chapter 11: Indexing and Hashing

Chapter 11: Indexing and Hashing Chapter 11: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing and Hashing Index Definition in SQL

More information

SimDataMapper: An Architectural Pattern to Integrate Declarative Similarity Matching into Database Applications

SimDataMapper: An Architectural Pattern to Integrate Declarative Similarity Matching into Database Applications SimDataMapper: An Architectural Pattern to Integrate Declarative Similarity Matching into Database Applications Natália Cristina Schneider 1, Leonardo Andrade Ribeiro 2, Andrei de Souza Inácio 3, Harley

More information

Mobile Cloud Multimedia Services Using Enhance Blind Online Scheduling Algorithm

Mobile Cloud Multimedia Services Using Enhance Blind Online Scheduling Algorithm Mobile Cloud Multimedia Services Using Enhance Blind Online Scheduling Algorithm Saiyad Sharik Kaji Prof.M.B.Chandak WCOEM, Nagpur RBCOE. Nagpur Department of Computer Science, Nagpur University, Nagpur-441111

More information

Cost Models for Query Processing Strategies in the Active Data Repository

Cost Models for Query Processing Strategies in the Active Data Repository Cost Models for Query rocessing Strategies in the Active Data Repository Chialin Chang Institute for Advanced Computer Studies and Department of Computer Science University of Maryland, College ark 272

More information

Fast Fuzzy Clustering of Infrared Images. 2. brfcm

Fast Fuzzy Clustering of Infrared Images. 2. brfcm Fast Fuzzy Clustering of Infrared Images Steven Eschrich, Jingwei Ke, Lawrence O. Hall and Dmitry B. Goldgof Department of Computer Science and Engineering, ENB 118 University of South Florida 4202 E.

More information

Multi-Stage Rocchio Classification for Large-scale Multilabeled

Multi-Stage Rocchio Classification for Large-scale Multilabeled Multi-Stage Rocchio Classification for Large-scale Multilabeled Text data Dong-Hyun Lee Nangman Computing, 117D Garden five Tools, Munjeong-dong Songpa-gu, Seoul, Korea dhlee347@gmail.com Abstract. Large-scale

More information

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten GPU Computing: Development and Analysis Part 1 Anton Wijs Muhammad Osama Marieke Huisman Sebastiaan Joosten NLeSC GPU Course Rob van Nieuwpoort & Ben van Werkhoven Who are we? Anton Wijs Assistant professor,

More information

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload) Lecture 2: Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload) Visual Computing Systems Analyzing a 3D Graphics Workload Where is most of the work done? Memory Vertex

More information

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload) Lecture 2: Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload) Visual Computing Systems Today Finishing up from last time Brief discussion of graphics workload metrics

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

On the Max Coloring Problem

On the Max Coloring Problem On the Max Coloring Problem Leah Epstein Asaf Levin May 22, 2010 Abstract We consider max coloring on hereditary graph classes. The problem is defined as follows. Given a graph G = (V, E) and positive

More information

School of Computer and Information Science

School of Computer and Information Science School of Computer and Information Science CIS Research Placement Report Multiple threads in floating-point sort operations Name: Quang Do Date: 8/6/2012 Supervisor: Grant Wigley Abstract Despite the vast

More information

Using Statistics for Computing Joins with MapReduce

Using Statistics for Computing Joins with MapReduce Using Statistics for Computing Joins with MapReduce Theresa Csar 1, Reinhard Pichler 1, Emanuel Sallinger 1, and Vadim Savenkov 2 1 Vienna University of Technology {csar, pichler, sallinger}@dbaituwienacat

More information

Warps and Reduction Algorithms

Warps and Reduction Algorithms Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum

More information

Hybrid Implementation of 3D Kirchhoff Migration

Hybrid Implementation of 3D Kirchhoff Migration Hybrid Implementation of 3D Kirchhoff Migration Max Grossman, Mauricio Araya-Polo, Gladys Gonzalez GTC, San Jose March 19, 2013 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation

More information

Efficient Range Query Processing on Uncertain Data

Efficient Range Query Processing on Uncertain Data Efficient Range Query Processing on Uncertain Data Andrew Knight Rochester Institute of Technology Department of Computer Science Rochester, New York, USA andyknig@gmail.com Manjeet Rege Rochester Institute

More information

University of Cambridge Engineering Part IIB Module 4F12 - Computer Vision and Robotics Mobile Computer Vision

University of Cambridge Engineering Part IIB Module 4F12 - Computer Vision and Robotics Mobile Computer Vision report University of Cambridge Engineering Part IIB Module 4F12 - Computer Vision and Robotics Mobile Computer Vision Web Server master database User Interface Images + labels image feature algorithm Extract

More information

Improving Performance of Machine Learning Workloads

Improving Performance of Machine Learning Workloads Improving Performance of Machine Learning Workloads Dong Li Parallel Architecture, System, and Algorithm Lab Electrical Engineering and Computer Science School of Engineering University of California,

More information

Fast Efficient Clustering Algorithm for Balanced Data

Fast Efficient Clustering Algorithm for Balanced Data Vol. 5, No. 6, 214 Fast Efficient Clustering Algorithm for Balanced Data Adel A. Sewisy Faculty of Computer and Information, Assiut University M. H. Marghny Faculty of Computer and Information, Assiut

More information

Using Graphics Processors for High Performance IR Query Processing

Using Graphics Processors for High Performance IR Query Processing Using Graphics Processors for High Performance IR Query Processing Shuai Ding Jinru He Hao Yan Torsten Suel Polytechnic Inst. of NYU Polytechnic Inst. of NYU Polytechnic Inst. of NYU Yahoo! Research Brooklyn,

More information

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing Finding Similar Sets Applications Shingling Minhashing Locality-Sensitive Hashing Goals Many Web-mining problems can be expressed as finding similar sets:. Pages with similar words, e.g., for classification

More information

Smurf: Self-Service String Matching Using Random Forests

Smurf: Self-Service String Matching Using Random Forests Smurf: Self-Service String Matching Using Random Forests Paul Suganthan G. C., Adel Ardalan, AnHai Doan, Aditya Akella University of Wisconsin-Madison {paulgc, adel, anhai, akella}@cs.wisc.edu ABSTRACT

More information