A Survey and Experimental Comparison of Distributed SPARQL Engines for Very Large RDF Data

Size: px
Start display at page:

Download "A Survey and Experimental Comparison of Distributed SPARQL Engines for Very Large RDF Data"

Transcription

1 A Survey and Experimental Comparison of Distributed SPARQL Engines for Very Large RDF Data Ibrahim Abdelaziz Razen Harbi Zuhair Khayyat Panos Kalnis King Abdullah University of Science and Technology Saudi Aramco ABSTRACT Distributed SPARQL engines promise to support very large RDF datasets by utilizing shared-nothing computer clusters. Some are based on distributed frameworks such as MapReduce; others implement proprietary distributed processing; and some rely on expensive preprocessing for data partitioning. These systems exhibit a variety of trade-offs that are not well-understood, due to the lack of any comprehensive quantitative and qualitative evaluation. In this paper, we present a survey of 22 state-of-the-art systems that cover the entire spectrum of distributed RDF data processing and categorize them by several characteristics. Then, we select 12 representative systems and perform extensive experimental evaluation with respect to preprocessing cost, query performance, scalability and workload adaptability, using a variety of synthetic and real large datasets with up to 4.3 billion triples. Our results provide valuable insights for practitioners to understand the trade-offs for their usage scenarios. Finally, we publish online our evaluation framework, including all datasets and workloads, for researchers to compare their novel systems against the existing ones. 1. INTRODUCTION The Resource Description Framework (RDF) [8] is a versatile data model that provides a simple way to express facts in the semantic web. Many large public knowledge bases, including UniProt [10], PubChemRDF [7], Bio2RDF [3] and DBpedia [4] have billions of facts in RDF format. These databases are usually interlinked, and are globally queried using SPARQL [42]. As the volume of RDF data grows, the computational complexity of indexing and querying large datasets becomes challenging. Single-machine RDF systems, like RDF-3X [53] and gstore [37], do not scale well to complex queries on web-scale RDF data [29, 33]. To overcome this problem, many distributed SPARQL query engines [29, 33, 47, 41, 32, 48, 30, 28, 36, 25, 46, 18, 15, 38, 16] have been introduced. They utilize shared-nothing computing clusters and are either built on top of distributed data processing frame- This work is licensed under the Creative Commons Attribution- NonCommercial-NoDerivatives 4.0 International License. To view a copy of this license, visit For any use beyond those covered by this license, obtain permission by ing info@vldb.org. Proceedings of the VLDB Endowment, Vol. 10, No. 13 Copyright 2017 VLDB Endowment /17/08. works, such as MapReduce, or implement proprietary distributed computation approaches. They partition the RDF graph among multiple machines (a.k.a., workers) to handle big datasets, and parallelize query execution to reduce the running time. Answering queries typically involve processing of local data at each worker, interleaved with data exchange among workers. The main challenge of distributed systems is scaling-out, which is affected by various factors including data partitioning, load balancing, indexing, query optimization and communication overhead. Despite the plethora of distributed RDF systems and their practical applications, there is limited information about their comparative performance. Some published surveys provide qualitative comparisons. For example, Sakr et al. [49] present an overview of using the relational model for RDF data. Svoboda et al. [52] classify state-of-the-art indexing approaches for linked data. Kaoudi et al. [34] survey RDF data management systems designed for the cloud. Ozsu [43] provides a broader overview of existing centralized and distributed RDF engines and discusses querying techniques for linked data. Finally, Ma et al. [40] survey techniques that use relational and NoSQL databases to store large RDF data. None of the aforementioned surveys contain any experimental evaluation. Motivated by the lack of quantitative comparison, in this paper we provide a comprehensive experimental evaluation of state-of-the-art distributed RDF systems, using a variety of very large real datasets and query loads. We start with an extensive survey that covers 22 relevant systems. We describe the execution model and the graph partitioning strategy of each system, discuss the similarities and differences and explain the various trade-offs. We also categorize the systems based on the: (i) underlying implementation framework (e.g., MapReduce, vertex-centric or proprietary distributed processing); (ii) use of generic joins versus specialized graph exploration; (iii) data replication strategy; and (iv) adaptivity to query workload. Then, we perform extensive experimental evaluation of the following 12 representative systems: S2RDF [15], Ad- Part [46], DREAM [38], Urika-GD [11], CliqueSquare [25], S2X [51], TriAD [48], SHAPE [32], H-RDF-3X [29], H2RDF+ [41], SHARD [47] and gstored [45]. We use all the standard synthetic benchmarks (e.g., LUBM [6]) and a variety of very large real datasets (e.g., Bio2RDF [3]) with up to 4.3 billion triples, to stretch the systems to their limits. We analyse the following metrics: (i) Startup cost: it includes the time overhead of data partitioning, indexing, replication and loading to HDFS/memory, as well as the storage overhead of replication; (ii) Query efficiency: we utilize query workloads with varying query complexities and selectivities, 2049

2 Figure 2: A SPARQL query that finds CS professors with their advisees. Figure 1: An example RDF graph. Each edge and its associated vertices correspond to an RDF triple; e.g., Bill, worksf or, CS. and report the execution time; (iii) Scalability to very large datasets; and (iv) Adaptability to different workloads. Our results suggest that specialized in-memory systems, such as AdPart [46] or TriAD [48] provide the best performance, assuming the data can fit in the cumulative memory of the computing cluster. If this condition is not satisfied, MapReduce based systems (e.g., H2RDF+ [41]), are an acceptable alternative. In contrast, the startup costs of some systems (e.g., S2RDF [15]) or the excessive replication (e.g., DREAM [38]), severely limit their applicability to large datasets. In an attempt to standardize the evaluation of future systems and assist practitioners to select the appropriate solution for their data and applications, we publish online all datasets, our evaluation methodology and links to the systems. The rest of the paper is organized as follows: Section 2 provides essential background on RDF and an overview of single machine RDF stores. Section 3 contains a survey of the 22 state-of-the-art distributed RDF systems, whereas Section 4 presents the experimental evaluation of the selected 12 systems. Finally, Section 5 concludes our findings. 2. BACKGROUND RDF [8] is a standard data model and the core component of the W3C Semantic Web. RDF datasets consist of triples subject, predicate, object, where the predicate (P) represents a relationship between a subject (S) and an object (O). RDF data can be viewed as a long relational table with three columns, or as a directed labeled graph with vertices reflecting the entities and edge labels as the predicates. Figure 1 shows an example RDF graph of students and professors in an academic network. SPARQL [9] is the de-facto query language for RDF data. In its simplest form, a.k.a. Basic Graph Pattern (BGP), a SPARQL query consists of a set of RDF triple patterns; some of the nodes in a pattern are variables that may appear in multiple patterns. For example, the query in Figure 2(a) returns all professors who work for CS with their advisees. The query corresponds to the graph pattern in Figure 2(b). The answer is the set of ordered bindings of (?prof,?stud) that render the query graph isomorphic to subgraphs in the data. Assuming data are stored in a table D(s, p, o), the query can be answered by first decomposing it into two subqueries: q 1 σ p=worksf or o=cs (D) and q 2 σ p=advisor (D). The subqueries are answered independently by scanning table D; then, their intermediate results are joined on the subject and object attribute: q 1 q1.s=q 2.o q 2. By applying the query on the data of Figure 1, we get (?prof,?stud) {(James, Lisa),(Bill, John), (Bill, Fred),(Bill, Lisa)}. 2.1 Single-machine RDF stores We discuss the storage models of single-machine RDF systems since several distributed systems [38, 32, 29] rely on them for querying RDF data within a single partition. Triple Table uses a single table with three columns corresponding to subject, predicate and object to store RDF data. An index is created per column for faster join evaluation. A query with several predicates corresponds to a set of self-joins on the large triple table. This approach scales poorly due to expensive self joins [26]. RDF-3X [53] and Hexastore [23] reduce this cost by using a set of indices that cover all possible permutations of S, P and O. These indices are stored as clustered B+-trees and are compressed using rigorous byte-level techniques. Aggregate indices, selectivity histograms and statistics about frequently accessed paths are used to select the lowest-cost execution plan. Their optimizer use order-preserving merge-joins for most of the operations and hash-joins for the last few ones. Property table is a wider and flattened representation of RDF data [54]. The dimensions of this table are determined by the number of subjects and distinct predicates. Each cell in the table contains the object value of the corresponding subject and predicate. This representation has high storage overhead when the number of unique predicates is large due to its sparse representation. Moreover, it can not represent multi-valued attributes, i.e., a subject connected to different objects using the same predicate. Jena2 [55] solves this problem by introducing two alternative representations, namely, the clustered property table and property-class tables. BitMat [20] proposes an alternative representation of property tables that uses a compressed 3-dimensions bitmatrix. Each dimension of the bit-matrix corresponds to part of RDF triple; S, P and O. Each cell represents the existence of an RDF triple defined by the S,P,O positions. Queries are executed on the compressed data without materializing intermediate join results. Vertical Partitioning is another representation for RDF data proposed by SW-Store [24]. The triples table is vertically partitioned into n tables, where n is the number of distinct predicates. A two columns table is created for each predicate where a row is a pair of subject-object values connected through the predicate. Tables are sorted on the subject to render subject lookup and merge joins faster. This approach stores multi-valued attributes as successive rows and does not store NULL values. It provides good performance for queries with bounded predicates, however, it requires scanning multiple tables to reconstruct information related to a single entity. 3. DISTRIBUTED RDF SYSTEMS Distributed RDF systems scale to large datasets by partitioning the RDF graph among many compute nodes (workers) and evaluating queries in a distributed fashion. Each SPARQL query is decomposed into multiple subqueries, which are then evaluated independently. Since the data is 2050

3 Table 1: Summary of state-of-the-art distributed RDF systems. System Partitioning Strategy Execution Model AdPart [46] Subject Hash + workload adaptive Distributed Semi-Join AdPart-NA [46] Subject Hash Distributed Semi-Join CliqueSquare [25] Hybrid (Hash + VP) MapReduce-based Join DREAM [38] No partitioning; full replication RDF-3X [53] EAGRE [56] METIS MapReduce-based Join gstored [45] Partitioning Agnostic gstore [37] H-RDF-3X [29] METIS RDF-3X [53] H2RDF+ [41] H-Base partitioner (range) Centralized + MapReduce HadoopRDF [30] VP + predicate files on HDFS MapReduce Join Partout [36] Workload-based fragmentation RDF-3X [53] PigSparql [14] Hash + Triple-based files SPARQL to PigLatin S2RDF [15] Extended Vertical Partitioning SPARQL to SQL S2X [51] GraphX partitioning strategy Vertex-Centric BGP matching Sedge [57] Subject Hash Vertex-Centric BGP matching Sempala [50] VP SPARQL to SQL SHAPE [32] Semantic Hash Partitioning RDF-3X [53] SHARD [47] Hash MapReduce-based Join TriAD [48] Hash-based Sharding Distributed Merge/Hash Joins TriAD-SG [48] METIS + Horizontal Sharding Distributed Merge/Hash Joins Trinity.RDF [33] Key-value store on graph Graph Exploration WARP [28] METIS on query workload RDF-3X [53] MapReduce Graph Specialized Workload Replication Based Based System Awareness Figure 3: Vertical partitions of the data of Fig. 1. distributed, nodes may need to exchange intermediate results during query evaluation; therefore queries with large intermediate results incur high communication cost [48, 29]. To achieve acceptable response time, distributed systems attempt to minimize communication cost and maximize parallelism. This is accomplished by efficient data partitioning that maximizes data locality; load balancing that avoids stragglers; efficient join implementations; and the utilization of native RDF indexing techniques to speedup joins and index lookups. We summarize in Table 1 the various features of existing distributed RDF systems. Note that Replication in this table points to systems that explicitly replicate RDF data, excluding HDFS replication. In this survey, we categorize distributed RDF management systems along 2 dimensions based on their execution model: (i) MapReduce and Graph-based systems: such systems rely on general purpose frameworks, i.e., Hadoop or Spark, that offer seamless data distribution and parallelization at the cost of flexibility. (ii) Specialized RDF systems: are built specifically for SPARQL query evaluation by utilizing custom physical layouts, native RDF indexing, efficient communication protocols and explicit replication. Within the second category, we define three subcategories based on the data partitioning scheme: lightweight, sophisticated and workload-aware partitioning. Systems based on sophisticated partitioning offer faster query execution at the cost of startup time and storage requirements. Workload-aware systems achieve faster query execution by adapting their data partitioning to the entire query workload. 3.1 MapReduce and Graph Based Systems SHARD [47] is a triple store implemented on top of MapReduce [31]. The entire RDF dataset is stored in a single file within HDFS [2], where each line represents all triples of a single subject. The input dataset is hash-partitioned among workers such that each worker is responsible for a distinct set of triples. SHARD does not use indexing; consequently, during query evaluation it scans the entire dataset. SPARQL queries are executed as a sequence of MapReduce iterations. Each iteration is responsible for a single subquery, while the results are continuously joined with subsequent iterations. The final iteration is responsible for filtering the bounded variables and removing redundant results. HadoopRDF [30] also uses HDFS to store the RDF data as flat files; replication and distribution is left to HDFS. Unlike SHARD, HadoopRDF uses multiple files on HDFS, a file for each predicate which is similar to SW-Store s [24] vertical partitioning. HadoopRDF also splits each predicate file into multiple smaller files based on explicit and implicit type information. Initially, it divides the rdf:type file into as many files as the number of distinct objects. Then, a set of files are created for each type type object. For example, rdf:type in Figure 3 results in only one file (type Grad) because predicate type has only one object (Grad). Then, HadoopRDF divides the remaining predicate files into multiple files based on the object type. It splits triples based on the RDF class their object belongs to. For example, triple John, teacherof, OS will be stored in a file named teacherof Course. To retrieve this implicit type information, it needs to join the predicate file with the type * files. HadoopRDF executes queries as a sequence of MapReduce iterations. It contains a query optimizer to select the query plan that minimizes the number of MapReduce iterations and the size of intermediate results. CliqueSquare [25] exploits the replication of HDFS to maximize the efficiency of parallel joins and to reduce the communication cost. To achieve this, CliqueSquare partitions and stores the data in three different ways, by hashing each triple on the subject, predicate and object. Furthermore, it partitions the data within each machine into smaller files by applying property-based grouping similar to HadoopRDF. The integration of HDFS replication and partitioning enables CliqueSquare to perform all first-level joins (i.e., subject-subject, subject-predicate, etc.) locally on each machine. The CliqueSquare optimizer minimizes the number of joins by generating relatively shallow plans that use multi-way joins. It finds the possible clique decom- 2051

4 positions of the query graph, where each clique corresponds to a multi-way join, and then selects the decomposition with the lowest cost. The resulting plan is executed as a sequence of MapReduce iterations. H2RDF+ [41] is a distributed RDF engine based on MapReduce and Apache HBase [5]. It materializes all six permutations of RDF triples using HBase tables, which are sorted key-value maps. Data partitioning is left to HBase that range-partitions tables based on keys. Maintaining these indices offers several benefits: (i) all SPARQL triple patterns can be answered efficiently by a single scan on the corresponding index; (ii) merge join can be employed to exploit the precomputed ordering in these indices; and (iii) every join between triple patterns is executed as merge-join. H2RDF+ maintains a set of aggregated statistics to estimate the selectivity of triple patterns, join results and join cost. It uses a greedy algorithm that finds at each execution step the join with the lowest cost. It uses multi-way merge join for sorted data, and sort-merge join for unsorted intermediate results. Simple queries are executed efficiently in a centralized fashion, while complex queries with large intermediate results are evaluated as a sequence of MapReduce jobs. H2RDF+ utilizes lazy materialization to minimize the size of the intermediate results. S2X [51] exploits the inherited graph structure of RDF to process SPARQL as graph-based computations on top of GraphX. It uses the parallel vertex-centric model to evaluate the BGP matching of SPARQL while other operators, such as OPTIONAL and FILTER, are processed through Spark RDD operators. BGP matching starts by distributing all triple patterns to all graph vertices. Each vertex matches its edge labels with the triple s predicate. Graph vertices cooperatively validate their triple candidacy with their direct neighbours by exchanging messages. Then, the partial results are collected and incrementally merged. S2X uses two string encoding types: hash and count-based. Hash-based encoding utilizes a 64-bit hash function to encode subjects and objects, while count-based assigns unique numeric values to them. S2X does not have a special RDF partitioner; it uses GraphX 2D hashing, which hashes on the encoded subject then hashes on the encoded object, to partition the input graph among workers. Sedge [57] proposes similar techniques for SPARQL query execution on top of the vertex-centric processing model. The entire graph is replicated several times and each replica is partitioned differently. Each SPARQL query is executed against the replica that minimizes communication. Sedge does not provide automatic translation of SPARQL queries to the vertex-centric model; a vertex-centric program has to be written manually for each query, which is counterproductive and requires prior knowledge about the data. 3.2 Specialized RDF Systems Lightweight Partitioning Trinity.RDF [33] is a distributed in-memory RDF engine that can handle large datasets. It represents RDF data in a graph form using adjacency lists, stored in the Trinity keyvalue store [21]. The graph is hash-partitioned on vertex-id; this is equivalent to partitioning the data twice, on subject and object. Trinity.RDF uses graph exploration for query evaluation. In every iteration, a single subquery is explored starting from the valid bindings in all workers. For example, Figure 4 shows how Trinity.RDF executes Q prof. Start- Figure 4: Trinity.RDF Graph Exploration Execution Plan for Q prof using the data in Figure 2. ing with the pattern?prof, worksf or, CS, it explores the neighbours of CS connected via worksf or. It finds that the possible binding for?prof are James and Bill. In the next iteration, it starts to explore from nodes James and Bill via edge advisor, and generates the bindings for?stud, advisor,?prof. Graph exploration avoids the generation of redundant intermediate results. However, because exploration only involves two vertices (source and target), Trinity.RDF cannot prune invalid intermediate results without carrying all their historical bindings. Hence, workers need to ship candidate results to the master to finalize processing, which is a potential bottleneck. TriAD [48] employs lightweight hash partitioning based on both subjects and objects. Since partitioning information is encoded into the triples, TriAD has full locality awareness of the data and processes large number of concurrent joins without communication. It creates six in-memory tables on each machine, one for each permutation of subject, predicate, object. The six SPO permutations are arranged into two groups; subject-key indices (SPO, SOP, PSO), and object-key indices (OSP, OPS, POS). Each of these indices is then hash-partitioned among different machines and sorted within each machine in lexicographic order. This enables TriAD to perform efficient distributed merge-joins over the different SPO indices. Multiple join operators are executed concurrently by all workers, which communicate via asynchronous message passing. At each compute node, TriAD uses multiple-threads to evaluate multiple operators in the query plan in parallel. TriAD shards one (both) relation(s) when evaluating distributed merge (hash) joins, which does not preserve the locality of intermediate results. This causes TriAD to re-shard intermediate results if the sharding column of the previous join is not the current join column. This cost is significant for large intermediate results with multiple attributes. AdPart-NA [46] employs lightweight partitioning that hashes triples on subject. Each worker stores its local set of triples using three in-memory data structures; P-index, PS-index and PO-index. P-index returns the set triples having the given predicate. Similarly, PS and PO indices return the nodes connected to the given predicate-subject or predicate-object, respectively. AdPart-NA exploits the query structure and the hash-based data locality in order to minimize the communication cost during query evaluation. It capitalizes on the subject-based locality and the locality of its intermediate results (pinning) to evaluate joins in parallel without communication. Star queries joining on subjects are processed in parallel by all workers. Whenever possible, intermediate results are hash-distributed among workers instead of broadcasting to all workers. Figure 5 shows how AdPart-NA evaluates Q prof by a single subject-object join, assuming the following execution order: q 1:?prof, worksf or, CS, q 2:?stud, advisor,?prof. For such 2052

5 Figure 5: Adpart-NA Query Execution Plan for the SPARQL query in Figure 2. queries, AdPart-NA employs distributed semi-join. Each worker scans its PO index to find all triples matching q 1, projects on join column?prof, and exchanges it with the other worker. Once the projected column is received, each worker computes semi-join q 1?prof q 2 using its PO index. Specifically, w 1 probes p = advisor, o = Bill while w 2 probes p = advisor, o = James. Finally, each worker computes q 1?prof q 2. DREAM [38] utilizes data replication instead of partitioning by building a single database that is replicated to all workers. It also avoids expensive intermediate data shuffling and only exchanges small auxiliary data. Each machine uses RDF-3X on its assigned data for statistics estimation and query evaluation. DREAM decomposes each query into multiple, usually non-overlapping, subqueries where each subquery is answered by a single worker. Depending on the query complexity, DREAM s optimizer decides to run it either in a centralized or in a distributed fashion. Although DREAM does not incur any partitioning overhead, it exhibits excessive replication and costly preprocessing because of the centralized database construction. gstored [45] is a distributed partitioning-agnostic system. It does not decompose the input query which is sent as-is to all workers. gstored starts the query evaluation by computing the partial local matches at each worker. This process depends on a revised version of gstore [37], a single-machine graph-based RDF engine. Then, gstored assembles the partial matches to build the cross-partition results. gstored allows for two modes of assembly; centralized and distributed. The centralized assembly mode sends the partial results to a centralized site, whereas distributed mode assembles them in multiples sites in parallel Sophisticated Partitioning H-RDF-3X [29] uses METIS [27], a balanced vertex partitioning method, to efficiently assign each graph node to a single partition. Then, H-RDF-3X enforces the so-called k-hop guarantee where, for any vertex v assigned to partition p, all vertices up to k-hops away and the corresponding edges are replicated in p. This way any query within radius k can be executed without communication. For example, partitioning the graph in Figure 1 among two workers using 1-hop undirected guarantee yields the partitions shown in Table 2. Each partition is stored and managed by a standalone centralized RDF-3X store; duplicate results are expected due to replication. For example, query Q =?stud, advisor, Bill returns duplicate Lisa, advisor, Bill and Fred, advisor, Bill ; one from each partition. To solve this problem, H-RDF-3X introduces the notion of triple ownership. For each vertex v assigned to partition p, H-RDF-3X stores a new triple v, is owned, yes at partition p. During query evaluation an extra join is required for filtering out duplicate results. Queries with radius larger than k are ex- Table 2: 1-hop undirected guarantee partitioning of the RDF in Figure 1. subject(s) and object(s) highlighted in blue are owned by W1; the rest belong to W2. Replicated triples are highlighted in yellow. W1 W2 subject predicate object subject predicate object HPC suborgof MIT CS suborgof MIT EE suborgof MIT HCI suborgof CMU CHEM suborgof CMU Bill worksfor CS James worksfor CS Bill gradfrom CMU James ugradfrom CMU Bill ugradfrom CMU James gradfrom MIT John type Grad Lisa ugradfrom MIT John ugradfrom CMU Lisa type Grad John advisor Bill Lisa advisor James CHEM suborgof CMU Lisa advisor Bill James ugradfrom CMU Fred advisor Bill Lisa advisor Bill John type Grad James worksfor CS CS suborgof MIT Fred advisor Bill Figure 6: An example of semantic hash partitioning with forward direction and k = 1. ecuted using expensive MapReduce joins. k must be small (e.g., k 2 in [29]) because replication increases exponentially with k. EAGRE [56] transforms the RDF data into an entity graph by grouping triples based on the subject where each subject is called an entity. Then, it groups entities with similar properties into an entity class. EAGRE generates a compressed entity graph containing only the entity classes and their relationships which is then partitioned using METIS. At each machine, entities belonging to the same class are treated as high dimensional data indexed by a Space Filling Curve. This maintains an order preserving layout of the data which fits well range and order by queries. EAGRE aims at minimizing the I/O costs by a distributed scheduling approach that reduces the total number of data blocks to read for query evaluation. Similar to H-RDF-3X, EA- GRE also suffers from the overhead of MapReduce joins for queries that cannot be evaluated locally. SHAPE [32] uses semantic hash partitioning to group vertices based on URI hierarchy for the sake of increasing data locality. It identifies groups of triples anchored at the same subject or object and tries to place these grouped triples in the same partition. Then, SHAPE applies its semantic hashing technique in two phases: (i) baseline hash partitioning and (ii) k hop expansion which adds to each partition all triples whose shortest distance to any anchor of the partition is at most k. Figure 6 shows an example for how to expand a baseline partition in the forward direction with k = 1. Each resulting partition is managed by a standalone RDF- 3X store. Similar to H-RDF-3X, SHAPE suffers from the high overhead of MapReduce joins. It also requires an extra join for filtering duplicate results. Furthermore, URI-based grouping results in skewed partitioning if a large percentage of vertices share prefixes. TriAD-SG [48] is a variation of TriAD that uses METIS for data partitioning. Edges that cross partitions are replicated, 2053

6 Figure 7: A summary graph for the RDF in Figure 1. resulting in 1 hop guarantee. It defines a summary graph which includes a vertex for each partition; edges connect vertices that share cross-partition edges. Figure 7 shows the summary graph of the data in Figure 1. Queries in TriAD- SG are evaluated against the summary graph first, in order to prune partitions that do not contribute to query results. Then, they are evaluated on the RDF data residing in the partitions retrieved from the summary graph. Multiple join operators are executed concurrently by all workers, which communicate via an asynchronous message passing protocol. S2RDF [15] is a SPARQL engine built on top of Spark [39]. It proposes a relational partitioning technique for RDF data called Extended Vertical partitioning (ExtVP). ExtVP extends the vertical partitioning approach used by HadoopRDF to minimize the size of input data during query evaluation. ExtVP uses semi-join reduction [44] to minimize data skewness and eliminate dangling triples that do not contribute to any join. For every two vertical partitions (see Figure 3), ExtVP pre-computes join reductions. The results are materialized as tables in HDFS. Specifically, for two partitions P 1 and P 2, S2RDF pre-computes: (i) subject-subject: P 1 s=s P 2, P 2 s=s P 1, (ii) subjectobject: P 1 s=o P 2, P 2 s=o P 1, and (iii) object-subject: P 1 o=s P 2, P 2 o=s P 1. The objective of this reduction is to use the semi-join reduced tables for joins instead of the base table since the reduced tables are much smaller, i.e., T 1 A=B T 2 = (T 1 A=B T 2) (T 1 A=B T 2). S2RDF does not run on Spark directly; it translates SPARQL queries into SQL jobs which are then executed on top of Spark SQL [19]. S2RDF follows a similar approach to Sempala [50] and PigSPARQL [14]. Sempala is a distributed RDF engine that translates SPARQL into SQL which runs on top of Apache Impala [35]. Similarly, PigSPARQL translates SPARQL queries into Pig Latin [22] scripts on Apache Pig Workload-Aware Partitioning Partout [36] is a workload-aware distributed RDF engine. It relies on a given query workload to extract representative triple patterns and uses them to partition the data into fragments. Partout has two objectives: (i) collocate fragments that are used together in queries; and (ii) achieve load balancing among workers. Partout defines a load score for each fragment and sorts fragments in descending order. For each fragment, it calculates a benefit score for allocating it to each machine. The benefit score takes into account both the machine utilization well as the fragment locality. Each worker runs RDF-3X on its assigned fragments. Partout uses global statistics to generate an initial query plan, which is then refined by a cost model that considers data fragments and their locations. The final query plan is executed in parallel by all machines, where each machine sends the results to other hosts in the pipeline. WARP [28] uses a representative query workload to replicate frequently accessed data by extending the n hop guarantee method [29]. Given a user query, WARP determines its center node and radius. If the query is within the n hop guarantee, WARP sends the query to all machines, which evaluate the query in parallel. Otherwise, the query is decomposed into subqueries for which a distributed query evaluation plan is created. Subqueries are evaluated in parallel by all machines and the results are sent to the master which combines them using merge join. AdPart [46] extends AdPart-NA with adaptive workloadawareness, to cope with the dynamism of RDF workloads. It monitors the query workload and incrementally redistributes parts of the data that are frequently accessed by hot patterns. By maintaining these patterns, many future queries are evaluated without communication. The adaptivity of AdPart complements its good performance on queries that can benefit from its hash-based data locality. Frequent query patterns that are not favored by the initial partitioning (e.g., star joins on an object) are processed in parallel. 4. EXPERIMENTAL EVALUATION In this section, we evaluate the performance of 12 representative distributed RDF systems using multiple real and synthetic datasets. All datasets, workloads, and the detailed results of our experiments are available online [1]. 4.1 System Setup Datasets: We use real and synthetic datasets of variable sizes, summarized in Table 3. We utilize the LUBM [6] synthetic data generator to create a dataset of 10,240 universities consisting of 1.36 billion triples. LUBM and its template queries are widely used for testing most distributed RDF engines [33, 41, 32, 48]. We also use WatDiv [12] which is a recent benchmark that provides a wide spectrum of queries with varying structural characteristics and selectivity classes. We consider two versions of this synthetic dataset: WatDiv with 109 million and WatDiv-1B with 1 billion triples. We also use two real datasets; YAGO2 [13] and Bio2RDF [3]. YAGO2 is derived from Wikipedia, Word- Net and GeoNames containing 284 million triples. Bio2RDF provides linked data for life sciences and contains around 4.3 billion triples connecting 24 different biological datasets. Hardware Setup: All systems are deployed on a 12 machine cluster connected by a 10Gbps Ethernet switch. Each machine has a 148GB RAM and two 2.1GHz AMD Opteron 6172 CPUs (12 cores each). The cluster runs a 64-bit Linux. Hadoop and Spark configuration: We use Hadoop and Spark for MapReduce and Spark-based systems and configure them to best utilize the available resources. For Hadoop, we use 12 mappers per worker (a total of 144 mappers), while we configure Spark with 12 cores per worker to achieve a total of 144 Spark cores. Note that we do not use all 24 cores for Hadoop and Spark workers to allow for other background processes, such as HDFS threads and Spark communicator threads. Compared Systems: We evaluate the performance of 5 systems from the MapReduce and Graph-based category and 7 from Specialized RDF systems. The 12 evaluated systems are: (i) AdPart [46], the current state-ofthe-art distributed RDF engine. (ii) TriAD [48], a recent efficient in-memory RDF system that uses lightweight hash partitioning. We also consider TriAD-SG, which uses graph summaries for join-ahead pruning. (iii) Three Hadoop-based systems which use lightweight partitioning: CliqueSquare [25], H2RDF+ [41] and SHARD [47]. (iv) SHAPE [32], a semantic hash partitioning approach for RDF data. (v) H-RDF-3X [29], a system that uses METIS for graph partitioning and applies the k hop guarantee scheme. We configure SHAPE with full level semantic hash 2054

7 Table 3: Datasets; M: millions. #S, #P, #O denote number of distinct subjects, predicates and objects. Dataset Triples (M) #S (M) #O (M) #S O (M) #P Size (GB) Indegree Outdegree (Avg/StDev) (Avg/StDev) WatDiv / /89.25 YAGO / /35.89 WatDiv-1B 1, / /89.05 LUBM , / /5.97 Bio2RDF 4, , , / / Table 4: Partitioning config. and replication ratio. (a) Partitioning Config. (b) Initial Replication H-RDF-3X SHAPE H-RDF-3X SHAPE DREAM LUBM undirected 2 forward 19.5% 42.9% 1200% WatDiv 3 undirected 3 undirected 1090% 0% 1200% YAGO2 2 undirected 2 forward 73.7% 0% 1200% Bio2RDF 2 undirected 2 undirected N/A N/A 1200% partitioning and enable the type optimization. For H-RDF- 3X, we enable the type and high degree vertices optimizations. (vi) S2RDF [15], a SQL-based RDF engine on top of Spark. (vii) S2X [51], an RDF engine on top of GraphX. (viii) DREAM [38] which distributes the query execution among fully-fledged unpartitioned data stores. (ix) Urika- GD; a data analytics appliance which provides an RDF triplestore and a SPARQL query engine. Urika-GD provides graph-optimized hardware with 2TB of global sharedmemory and 64 Threadstorm processors with 128 hardware threads per processor. (x) gstored [45] a partitioning agnostic approach for distributed SPARQL query processing. Finally, as baselines, we also compare to two single-machine engines; (xi) RDF-3X [53] and (xii) gstore [37]. These 12 systems were selected based on: (a) the availability of their source codes, (b) they were successfully run on our cluster and provided the correct results, and (c) they either achieve the best performance in their category or introduce novel approaches for distributed SPARQL query evaluation. Configuration: We use the source codes provided by each system s authors and enable all optimizations. H-RDF-3X and SHAPE were configured to partition each dataset such that all queries are processable without communication (Table 4(a)). To achieve this, we configure H-RDF-3X with undirected guarantee, instead of forward guarantee. For TriAD-SG, we use the same number of partitions reported in [48] for LUBM and WatDiv. Determining the number of summary graph partitions requires empirical evaluation of some data workload or a representative sample. Generating a representative sample from real data is tricky, whereas empirical evaluation on the original data is costly. Therefore, we do not evaluate TriAD-SG on Bio2RDF and YAGO2. For experiments on individual queries, the query performance is averaged over five runs for each system. The query workloads in scalability and workload adaptivity experiments are evaluated only once for each system. 4.2 Startup Overhead Our first experiment measures the time it takes all systems for preparing the data prior to answering queries. In Table 5, systems are only allowed 24 hours to complete preprocessing. For a fair comparison, we include the overhead of loading data into HDFS for Hadoop and Spark-based systems; however, we exclude the string-to-id mapping time for all systems. Lightweight partitioning: The preprocessing phase of systems under this category require the least time due to their lightweight partitioning overhead. SHARD and H2RDF+ employ random and range-based partitioning, respectively, while CliqueSquare uses a combination of hash and vertical partitioning. S2X depends on GraphX s default partitioning strategy, however, its encoding techniques Table 5: Preprocessing time (Minutes). LUBM WatDiv YAGO2 Bio2RDF Single Machine Systems gstore >24h >24h RDF-3X >24h Distributed Systems SHARD H2RDF CliqueSquare N/A S2X (Count) S2X (Hash) N/A N/A S2RDF (VP) >24h S2RDF (ExtVP) ,144 >24h AdPart-NA TriAD TriAD-SG N/A N/A H-RDF-3X >24h SHAPE >24h gstored >24h >24h DREAM >24h may affect GraphX s partitioning results. The hash-based encoding has very long loading time and cannot load all graphs, such as YAGO2. On the other hand, count-based encoding has faster data loading in GraphX but is slightly slower in query runtime. In our experiments, we only report the best query evaluation results for S2X irrespective of the encoding scheme since their performance do not vary significantly. MapReduce-based systems suffer from the overhead of storing their data on HDFS first before they can start their preprocessing phase. Both TriAD and AdPart-NA use hash-based partitioning. However, TriAD is slower than AdPart-NA because it sorts indices, gathers statistics and partitions its data twice (on subject and object columns). Sophisticated partitioning: The preprocessing overhead of LUBM using RDF-3X is significant compared to several distributed engines, whereas gstore failed to preprocess this dataset within 24 hours. Both systems could preprocess WatDiv-100 and YAGO2 in a reasonable time as these datasets are much smaller than LUBM However, they are still slower than several distributed systems such as AdPart-NA and H2RDF+. Both single-machine based systems failed to preprocess Bio2RDF within 24 hours. As Table 5 shows, systems that rely on METIS for partitioning (i.e., H-RDF-3X and TriAD-SG) have significant startup cost since METIS does not scale to large RDF graphs [32]. To apply METIS, we had to remove all triples connected to literals; otherwise, METIS will take several days to partition LUBM and YAGO2. For LUBM , SHAPE incurs less preprocessing time compared to METIS-based systems. However, SHAPE performs worse for WatDiv and YAGO2 due to data imbalance, causing some of its RDF-3X engines to take more time in data indexing. Due to the uniform URI s of YAGO2 and WatDiv, SHAPE could not utilize its semantic hash partitioning and placed the entire dataset in a single partition. Finally, both SHAPE and H-RDF-3X did not finish partitioning Bio2RDF and were terminated after 24 hours. S2RDF has two preprocessing modes: VP and ExtVP. VP mode partitions the RDF graph based on the triples predicate, and stores each predicate in HDFS as a compressed columnar storage format (parquet file). ExtVP mode builds on top of VP. For every two VPs, ExtVP pre-computes its join reductions and materializes the results as new partitions (tables for Spark SQL) in HDFS. Hence, ExtVP in- 2055

8 Table 6: Runtime for LUBM queries (ms). SM: Single Machine, MR: MapReduce, and SS: Specialized systems. S2X failed to execute all queries; gstore and gstored could not preprocess the data within 24 hours. LUBM Complex Queries Simple Queries Geo- L1 L2 L3 L7 L4 L5 L6 Mean Query/h SM RDF-3X 1,812, ,750 1,898,500 98, ,466 6 SHARD 413, ,310 N/A 469, , , , ,362 N/A H2RDF+ 285,430 71, , ,320 24,120 4,760 22,910 59, MR CliqueSquare 125,020 71,010 80, ,040 90,010 24,000 37,010 74, S2RDF-VP 217,537 28, ,761 29,965 46,770 5,233 11,308 35, S2RDF-ExtVP 46,552 35,802 21,533 47,343 9,222 2,718 4,555 15, AdPart-NA 2, , ,920 TriAD 6,023 1,519 2,387 17, TriAD-SG 5,392 1,774 4,636 21, Urika-GD 5,835 2,396 1,871 6,951 1, ,588 2,259 1,211 SS H-RDF-3X 7,004 2,640 7,957 7,175 1,635 1,586 1,965 3, H-RDF-3X (in-memory) 6,841 2,597 7,948 7,551 1,596 1,594 1,926 3, SHAPE 25,319 4,387 25,360 15,026 1,603 1,574 1,567 5, DREAM 13,031,410 98,263 2,358 4,700, ,755 12,110 1 DREAM (cached) 1,843,376 98,263 <1 83, curs significantly higher overhead, an order of magnitude slower, compared to its VP version. The overhead incurred by ExtVP depends on several factors including dataset size, density and number of predicates. For example, the size of YAGO2 is almost five times less than the size of LUBM , however, ExtVP requires almost an order of magnitude extra time to process YAGO. This behaviour is also noticed with ExtVP on WatDiv in comparison to LUBM This is mainly due to the sparsity and the lower number of predicates in LUBM compared to YAGO and WatDiv. In particular, ExtVP generates a total of 179 partitions (30,337 HDFS objects) for LUBM compared to 2,221 partitions (319,662 HDFS objects) for Wat- Div and 8,125 partitions ( 2 million HDFS objects) for YAGO2. Due to this high overhead, S2RDF failed to preprocess Bio2RDF dataset within 24 hours. 4.3 Initial Replication Cost We only report the initial replication for H-RDF-3X, SHAPE and DREAM (Table 4(b)) as other systems do not explicitly apply replication. H-RDF-3X results in a 19.5% replication for LUBM and a 73.7% for YAGO2. The sparsity and uniformity of LUBM allow METIS to generate a small edge cut between the different partitions of LUBM which significantly reduces the cross partition replication of H-RDF-3X. METIS, however, fails to reduce the graph edge-cuts for WatDiv because of its dense nature. It caused H-RDF-3X to replicate the whole partitions blindly using k- hop guarantee resulting in a 1090% replication. SHAPE incurs 42.9% replication for LUBM due to its full level semantic hash partitioning and type optimization. It also places WatDiv and YAGO2 in a single partition since the URI s of both datasets are uniform. This results in a 0% replication and causes SHAPE to be as slow as a single-machine RDF- 3X store for these two datasets. Similarly, DREAM indexes the entire database on a single machine which is then replicated to other workers. DREAM s preprocessing overhead is reasonable for small datasets (Table 5), however; it does not scale for larger graphs where it requires more than a day to build the Bio2RDF database. For our 12 machines cluster, the replication ratio of DREAM is 1200% for all datasets. 4.4 Query Performance LUBM dataset In this experiment (see Table 6), we use the LUBM dataset and L1-L7 [20] queries. We classify those queries, based on their structure and selectivity into simple and complex. L4 and L5 are simple selective star queries. L6 is also a simple query because it is highly selective. L1, L3 and L7 are complex queries with large intermediate results but very few final results. Finally, L2 is a simple yet non-selective star query that generates large final results. RDF-3X (single-machine) performs well for simple queries while it is significantly slower for complex queries. SHARD, H2RDF+ and CliqueSquare suffer from the expensive overhead of MapReduce joins; hence, their performance is significantly worse than other systems. For selective simple queries, H2RDF+ avoids the overhead of MapReducebased joins by solving these queries in a centralized manner which is an order of magnitude faster. The flat plans of CliqueSquare significantly reduce the join overhead for complex queries and achieve up to an order of magnitude better performance. S2X fails to run all queries as it generates a lot of intermediate results at the vertex level. Compared to MapReduce systems, S2RDF-ExtVP shows significant performance improvement due to its in-memory caching technique as well as the materialized join reduction tables. Note that S2RDF requires loading multiple partitions into memory for each query before execution. For example, it loads 6 partitions (1,200 HDFS object) to process L1 and L3. SHAPE and H-RDF-3X perform better than MapReducebased systems because they do not require communication. H-RDF-3X performs better than SHAPE as it has less replication. However, as both SHAPE and H-RDF-3X use MapReduce for dispatching queries to workers, they still suffer from the non-negligible overhead of MapReduce (1.5 sec on our cluster). Without this overhead, both systems would perform well for simple selective queries. For complex queries, these systems still perform reasonably well as they run in parallel without any communication overhead. For example, for query L7 which requires excessive communication, H-RDF-3X and SHAPE perform better than some of the specialized systems, such as TriAD. With low hop guarantee, the preprocessing cost for SHAPE and H-RDF-3X can be reduced at the cost of worse query performance because of the MapReduce joins. Although SHAPE and H-RDF-3X do not incur any communication, they still suffer from two limitations: (i) managing the original and replicated data in the same set of indices results in large and duplicate intermediate results, rendering the cost of join evaluation higher and (ii) to filter out duplicate results, H-RDF-3X requires an additional join with the ownership triple pattern. For fairness, we also stored H-RDF-3X databases in a memorymounted partition; still, it did not affect the performance significantly. As a result, we do not consider in-memory H-RDF-3X in further experiments. 2056

9 Table 7: Query runtimes (ms) for WatDiv and YAGO2 datasets. SHARD and gstored crashed in several queries while DREAM could not finish WatDiv queries within 24 hours. WatDiv-100 (GeoMean) YAGO2 L1-L5 S1-S7 F1-F5 C1-C3 Y1 Y2 Y3 Y4 GeoMean Query/h SM gstore 1, , ,105 2,178 2, RDF-3X , ,750 3, , SHARD N/A N/A N/A N/A 238, ,861 N/A N/A 238,861 N/A H2RDF+ 5,441 8,679 18,457 65,786 10,962 12,349 43,868 35,517 21, MR CliqueSquare 29,216 23,908 40,464 55, ,021 73,011 36, ,015 77, S2X 202,964 24, , ,393 1,811,658 1,863,374 1,720,424 1,876,964 1,817,050 2 S2RDF-VP 2,180 2,764 4,717 6,969 7,893 9,128 10,161 10,473 9, S2RDF-ExtVP 1,880 2,269 3,531 5,295 2,822 3,032 3,393 3,628 3,203 1,118 AdPart-NA ,225 TriAD , ,903 Urika-GD 1,264 1,330 1,743 2,357 1,864 1,649 1,523 1,415 1,604 2,232 SS H-RDF-3X 1,662 1,693 1,778 1,929 1, ,081 1,933 1,720 6, SHAPE 1,870 1,824 1,836 2,723 1, ,514 1,823 1,871 8, gstored ,949 8,482 N/A N/A N/A N/A N/A N/A DREAM (cached) N/A N/A N/A N/A 2,161 N/A 14, ,748 N/A AdPart-NA and TriAD perform the best in the specialized systems category. They are comparable for L4 and L5 due to their high selectivity. AdPart-NA exploits its hash distribution to solve L4 and L5 without communication. It also solves L2 without communication but slower due to L2 s low selectivity. AdPart-NA is more than an order of magnitude faster than TriAD and TriAD-SG for L2 because it reports the results through a single scan followed by hash lookups by utilizing hash indices and right deep tree plans. Compared to AdPart-NA, TriAD solves L2 by using two distributed index scans (one for each base subquery) followed by a binary search-based merge join that is only effective for selective queries. TriAD-SG outperforms TriAD and AdPart-NA in L6 as its pruning technique eliminates communication. In DREAM, statistics are collected on a query-by-query basis and then cached for future queries. This causes huge performance difference between the first run of DREAM and its subsequent runs. The number of machines that can be utilized during query execution is bounded by the number of join vertices; only a maximum of 3 workers for LUBM queries. This explains its huge overhead when processing complex queries, i.e., L1 and L7. For L3, the query planner detects the empty result during statistics collection phase and terminates the query execution early. DREAM executes simple queries on a single machine by using RDF-3X WatDiv dataset The WatDiv benchmark defines 20 query templates 1 classified into four categories: linear (L), star (S), snowflake (F) and complex queries (C). Table 7 shows the geometric mean of running 20 queries from each category on WatDiv-100M. RDF-3X and gstore perform well for most queries, where they are faster than several distributed systems including S2X, CliqueSquare SHAPE and gstored. H2RDF+ and CliqueSquare perform worse than most distributed systems due to the overhead of MapReduce. H2RDF+ performs much better than CliqueSquare. Even though the flat plans of CliqueSquare reduce the number of MapReduce joins, H2RDF+ uses a more efficient join implementation using traditional RDF indices. Furthermore, H2RDF+ encodes the URIs and literals of RDF data; hence it has lower overhead than CliqueSquare when shuffling intermediate results. S2X performs much worse due to the significant network overhead of shuffling vertices with high memory footprint. The performance of S2RDF, on the other hand, is close to MapReduce-based systems on WatDiv dataset. Unlike the results in LUBM-10240, S2RDF-ExtVP shows only a slight improvement in comparison to S2RDF-VP. 1 Even though SHAPE and H-RDF-3X use 3-hop undirected guarantee replication, they do not outperform the single-machine RDF-3X. gstored is only faster than its centralized version (gstore) for L and S queries. Its parallelization overhead affected its performance on F and C queries. AdPart-NA and TriAD, on the other hand, provide better performance than all systems. TriAD performs better than AdPart-NA for L and F queries as it requires multiple subject-object joins. In contrast to AdPart, TriAD utilizes subject-object locality to answer these joins without communication. For complex queries with large diameters, AdPart-NA performs better as a result of its locality awareness. DREAM results are omitted because the overhead of its statistics calculation is extremely high (more than 24 hours) due to the high number of triple patterns in WatDiv YAGO dataset YAGO2 does not provide benchmark queries. Therefore, we use the test queries (Y1-Y4) defined by AdPart to benchmark the performance of different systems (Table 7). Similar, to WatDiv dataset, H2RDF+ outperforms CliqueSquare and SHARD due to the utilization of HBase indices and its distributed implementation of merge and sort-merge joins. Moreover, S2X is still the slowest system while S2RDF provides better performance compared to MapReduce-based systems. Similar to the result in LUBM-10240, ExtVP shows better performance compared to VP. gstored fails to process all YAGO queries due to its parallelization overhead. AdPart-NA solves most of the joins in Y1 and Y2 without communication, which explains the superior performance of AdPart-NA compared to TriAD for Y1 and Y2, respectively. On the other hand, Y3 requires an object-object join on which AdPart-NA needs to broadcast the join column; therefore, it performs worse than TriAD Bio2RDF dataset Bio2RDF dataset does not have benchmark queries; therefore, we used the test queries (B1-B5) which are extracted from a real query log. The results of this experiment are shown in Table 8. Note that S2X failed to execute all Bio2RDF queries because of the huge amount of data it generates at its initial supersteps. Moreover, gstore, gstored, RDF-3X, H-RDF-3X, SHAPE, S2RDF and DREAM could not finish their preprocessing phase within 24 hours. The fastest systems in this experiment are AdPart-NA and TriAD while H2RDF+ and SHARD performed worse than other systems due to the MapReduce overhead. TriAD outperforms all other systems on queries B1, B2 and B3 since these queries require subject-object or object-object joins 2057

Accelerating SPARQL Queries and Analytics on RDF Data

Accelerating SPARQL Queries and Analytics on RDF Data Accelerating SPARQL Queries and Analytics on RDF Data Dissertation by Razen Mohammad Al-Harbi In Partial Fulfillment of the Requirements For the Degree of Doctor of Philosophy King Abdullah University

More information

TriAD: A Distributed Shared-Nothing RDF Engine based on Asynchronous Message Passing

TriAD: A Distributed Shared-Nothing RDF Engine based on Asynchronous Message Passing TriAD: A Distributed Shared-Nothing RDF Engine based on Asynchronous Message Passing Sairam Gurajada, Stephan Seufert, Iris Miliaraki, Martin Theobald Databases & Information Systems Group ADReM Research

More information

Sempala. Interactive SPARQL Query Processing on Hadoop

Sempala. Interactive SPARQL Query Processing on Hadoop Sempala Interactive SPARQL Query Processing on Hadoop Alexander Schätzle, Martin Przyjaciel-Zablocki, Antony Neu, Georg Lausen University of Freiburg, Germany ISWC 2014 - Riva del Garda, Italy Motivation

More information

Efficient, Scalable, and Provenance-Aware Management of Linked Data

Efficient, Scalable, and Provenance-Aware Management of Linked Data Efficient, Scalable, and Provenance-Aware Management of Linked Data Marcin Wylot 1 Motivation and objectives of the research The proliferation of heterogeneous Linked Data on the Web requires data management

More information

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context 1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes

More information

Fast and Concurrent RDF Queries with RDMA-Based Distributed Graph Exploration

Fast and Concurrent RDF Queries with RDMA-Based Distributed Graph Exploration Fast and Concurrent RDF Queries with RDMA-Based Distributed Graph Exploration JIAXIN SHI, YOUYANG YAO, RONG CHEN, HAIBO CHEN, FEIFEI LI PRESENTED BY ANDREW XIA APRIL 25, 2018 Wukong Overview of Wukong

More information

PERFORMANCE OF RDF QUERY PROCESSING ON THE INTEL SCC

PERFORMANCE OF RDF QUERY PROCESSING ON THE INTEL SCC MARC Symposium at ONERA'2012 1 PERFORMANCE OF RDF QUERY PROCESSING ON THE INTEL SCC Vasil Slavov, Praveen Rao, Dinesh Barenkala, Srivenu Paturi Department of Computer Science & Electrical Engineering University

More information

Answering Aggregate Queries Over Large RDF Graphs

Answering Aggregate Queries Over Large RDF Graphs 1 Answering Aggregate Queries Over Large RDF Graphs Lei Zou, Peking University Ruizhe Huang, Peking University Lei Chen, Hong Kong University of Science and Technology M. Tamer Özsu, University of Waterloo

More information

WORQ: Workload-Driven RDF Query Processing. Department of Computer Science Purdue University

WORQ: Workload-Driven RDF Query Processing. Department of Computer Science Purdue University WORQ: Workload-Driven RDF Query Processing Amgad Madkour Ahmed Aly Walid G. Aref (Purdue) (Google) (Purdue) Department of Computer Science Purdue University Introduction RDF Data Is Everywhere RDF is an

More information

Column Stores vs. Row Stores How Different Are They Really?

Column Stores vs. Row Stores How Different Are They Really? Column Stores vs. Row Stores How Different Are They Really? Daniel J. Abadi (Yale) Samuel R. Madden (MIT) Nabil Hachem (AvantGarde) Presented By : Kanika Nagpal OUTLINE Introduction Motivation Background

More information

A Fast and High Throughput SQL Query System for Big Data

A Fast and High Throughput SQL Query System for Big Data A Fast and High Throughput SQL Query System for Big Data Feng Zhu, Jie Liu, and Lijie Xu Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing, China 100190

More information

April Copyright 2013 Cloudera Inc. All rights reserved.

April Copyright 2013 Cloudera Inc. All rights reserved. Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and the Virtual EDW Headline Goes Here Marcel Kornacker marcel@cloudera.com Speaker Name or Subhead Goes Here April 2014 Analytic Workloads on

More information

Processing SPARQL queries over distributed RDF graphs

Processing SPARQL queries over distributed RDF graphs The VLDB Journal DOI 10.1007/s00778-015-0415-0 REGULAR PAPER Processing SPARQL queries over distributed RDF graphs Peng Peng 1 Lei Zou 1 M. Tamer Özsu 2 Lei Chen 3 Dongyan Zhao 1 Received: 30 March 2015

More information

Advanced Databases: Parallel Databases A.Poulovassilis

Advanced Databases: Parallel Databases A.Poulovassilis 1 Advanced Databases: Parallel Databases A.Poulovassilis 1 Parallel Database Architectures Parallel database systems use parallel processing techniques to achieve faster DBMS performance and handle larger

More information

Leveraging Set Relations in Exact Set Similarity Join

Leveraging Set Relations in Exact Set Similarity Join Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,

More information

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16 MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti 1 Department

More information

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs B. Barla Cambazoglu and Cevdet Aykanat Bilkent University, Department of Computer Engineering, 06800, Ankara, Turkey {berkant,aykanat}@cs.bilkent.edu.tr

More information

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab Apache is a fast and general engine for large-scale data processing aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes Low latency: sub-second

More information

Chisel++: Handling Partitioning Skew in MapReduce Framework Using Efficient Range Partitioning Technique

Chisel++: Handling Partitioning Skew in MapReduce Framework Using Efficient Range Partitioning Technique Chisel++: Handling Partitioning Skew in MapReduce Framework Using Efficient Range Partitioning Technique Prateek Dhawalia Sriram Kailasam D. Janakiram Distributed and Object Systems Lab Dept. of Comp.

More information

Mitigating Data Skew Using Map Reduce Application

Mitigating Data Skew Using Map Reduce Application Ms. Archana P.M Mitigating Data Skew Using Map Reduce Application Mr. Malathesh S.H 4 th sem, M.Tech (C.S.E) Associate Professor C.S.E Dept. M.S.E.C, V.T.U Bangalore, India archanaanil062@gmail.com M.S.E.C,

More information

Optimizing Testing Performance With Data Validation Option

Optimizing Testing Performance With Data Validation Option Optimizing Testing Performance With Data Validation Option 1993-2016 Informatica LLC. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording

More information

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics IBM Data Science Experience White paper R Transforming R into a tool for big data analytics 2 R Executive summary This white paper introduces R, a package for the R statistical programming language that

More information

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop

More information

Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here

Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here Marcel Kornacker marcel@cloudera.com Speaker Name or Subhead Goes Here 2013-11-12 Copyright 2013 Cloudera

More information

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department

More information

Oracle Big Data Connectors

Oracle Big Data Connectors Oracle Big Data Connectors Oracle Big Data Connectors is a software suite that integrates processing in Apache Hadoop distributions with operations in Oracle Database. It enables the use of Hadoop to process

More information

Resource and Performance Distribution Prediction for Large Scale Analytics Queries

Resource and Performance Distribution Prediction for Large Scale Analytics Queries Resource and Performance Distribution Prediction for Large Scale Analytics Queries Prof. Rajiv Ranjan, SMIEEE School of Computing Science, Newcastle University, UK Visiting Scientist, Data61, CSIRO, Australia

More information

Technical Sheet NITRODB Time-Series Database

Technical Sheet NITRODB Time-Series Database Technical Sheet NITRODB Time-Series Database 10X Performance, 1/10th the Cost INTRODUCTION "#$#!%&''$!! NITRODB is an Apache Spark Based Time Series Database built to store and analyze 100s of terabytes

More information

Big Linked Data Storage and Query Processing. Prof. Sherif Sakr

Big Linked Data Storage and Query Processing. Prof. Sherif Sakr Big Linked Data Storage and Query Processing Prof. Sherif Sakr ACM and IEEE Distinguished Speaker The 3rd KEYSTONE Training School on Keyword search in Big Linked Data Vienna, Austria August 21, 2017 http://www.cse.unsw.edu.au/~ssakr/

More information

Adaptive Workload-based Partitioning and Replication for RDF Graphs

Adaptive Workload-based Partitioning and Replication for RDF Graphs Adaptive Workload-based Partitioning and Replication for RDF Graphs Ahmed Al-Ghezi and Lena Wiese Institute of Computer Science, University of Göttingen {ahmed.al-ghezi wiese}@cs.uni-goettingen.de Abstract.

More information

Big Data Using Hadoop

Big Data Using Hadoop IEEE 2016-17 PROJECT LIST(JAVA) Big Data Using Hadoop 17ANSP-BD-001 17ANSP-BD-002 Hadoop Performance Modeling for JobEstimation and Resource Provisioning MapReduce has become a major computing model for

More information

Apache Spark Graph Performance with Memory1. February Page 1 of 13

Apache Spark Graph Performance with Memory1. February Page 1 of 13 Apache Spark Graph Performance with Memory1 February 2017 Page 1 of 13 Abstract Apache Spark is a powerful open source distributed computing platform focused on high speed, large scale data processing

More information

Quadrant-Based MBR-Tree Indexing Technique for Range Query Over HBase

Quadrant-Based MBR-Tree Indexing Technique for Range Query Over HBase Quadrant-Based MBR-Tree Indexing Technique for Range Query Over HBase Bumjoon Jo and Sungwon Jung (&) Department of Computer Science and Engineering, Sogang University, 35 Baekbeom-ro, Mapo-gu, Seoul 04107,

More information

Part 1: Indexes for Big Data

Part 1: Indexes for Big Data JethroData Making Interactive BI for Big Data a Reality Technical White Paper This white paper explains how JethroData can help you achieve a truly interactive interactive response time for BI on big data,

More information

On Smart Query Routing: For Distributed Graph Querying with Decoupled Storage

On Smart Query Routing: For Distributed Graph Querying with Decoupled Storage On Smart Query Routing: For Distributed Graph Querying with Decoupled Storage Arijit Khan Nanyang Technological University (NTU), Singapore Gustavo Segovia ETH Zurich, Switzerland Donald Kossmann Microsoft

More information

CAVA: Exploring Memory Locality for Big Data Analytics in Virtualized Clusters

CAVA: Exploring Memory Locality for Big Data Analytics in Virtualized Clusters 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing : Exploring Memory Locality for Big Data Analytics in Virtualized Clusters Eunji Hwang, Hyungoo Kim, Beomseok Nam and Young-ri

More information

An Efficient Approach to Triple Search and Join of HDT Processing Using GPU

An Efficient Approach to Triple Search and Join of HDT Processing Using GPU An Efficient Approach to Triple Search and Join of HDT Processing Using GPU YoonKyung Kim, YoonJoon Lee Computer Science KAIST Daejeon, South Korea e-mail: {ykkim, yjlee}@dbserver.kaist.ac.kr JaeHwan Lee

More information

Performance of Multicore LUP Decomposition

Performance of Multicore LUP Decomposition Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations

More information

! Parallel machines are becoming quite common and affordable. ! Databases are growing increasingly large

! Parallel machines are becoming quite common and affordable. ! Databases are growing increasingly large Chapter 20: Parallel Databases Introduction! Introduction! I/O Parallelism! Interquery Parallelism! Intraquery Parallelism! Intraoperation Parallelism! Interoperation Parallelism! Design of Parallel Systems!

More information

Chapter 20: Parallel Databases

Chapter 20: Parallel Databases Chapter 20: Parallel Databases! Introduction! I/O Parallelism! Interquery Parallelism! Intraquery Parallelism! Intraoperation Parallelism! Interoperation Parallelism! Design of Parallel Systems 20.1 Introduction!

More information

Chapter 20: Parallel Databases. Introduction

Chapter 20: Parallel Databases. Introduction Chapter 20: Parallel Databases! Introduction! I/O Parallelism! Interquery Parallelism! Intraquery Parallelism! Intraoperation Parallelism! Interoperation Parallelism! Design of Parallel Systems 20.1 Introduction!

More information

arxiv: v1 [cs.db] 16 Feb 2018

arxiv: v1 [cs.db] 16 Feb 2018 PRoST: Distributed Execution of SPARQL Queries Using Mixed Partitioning Strategies Matteo Cossu elcossu@gmail.com Michael Färber michael.faerber@cs.uni-freiburg.de Georg Lausen lausen@informatik.uni-freiburg.de

More information

RDF in the clouds: a survey

RDF in the clouds: a survey The VLDB Journal (2015) 24:67 91 DOI 10.1007/s00778-014-0364-z REGULAR PAPER RDF in the clouds: a survey Zoi Kaoudi Ioana Manolescu Received: 28 October 2013 / Revised: 14 April 2014 / Accepted: 10 June

More information

Advanced Data Management

Advanced Data Management Advanced Data Management Medha Atre Office: KD-219 atrem@cse.iitk.ac.in Aug 11, 2016 Assignment-1 due on Aug 15 23:59 IST. Submission instructions will be posted by tomorrow, Friday Aug 12 on the course

More information

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing /34 Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing Zuhair Khayyat 1 Karim Awara 1 Amani Alonazi 1 Hani Jamjoom 2 Dan Williams 2 Panos Kalnis 1 1 King Abdullah University of

More information

Rainbow: A Distributed and Hierarchical RDF Triple Store with Dynamic Scalability

Rainbow: A Distributed and Hierarchical RDF Triple Store with Dynamic Scalability 2014 IEEE International Conference on Big Data Rainbow: A Distributed and Hierarchical RDF Triple Store with Dynamic Scalability Rong Gu, Wei Hu, Yihua Huang National Key Laboratory for Novel Software

More information

HANA Performance. Efficient Speed and Scale-out for Real-time BI

HANA Performance. Efficient Speed and Scale-out for Real-time BI HANA Performance Efficient Speed and Scale-out for Real-time BI 1 HANA Performance: Efficient Speed and Scale-out for Real-time BI Introduction SAP HANA enables organizations to optimize their business

More information

Managing and Mining Billion Node Graphs. Haixun Wang Microsoft Research Asia

Managing and Mining Billion Node Graphs. Haixun Wang Microsoft Research Asia Managing and Mining Billion Node Graphs Haixun Wang Microsoft Research Asia Outline Overview Storage Online query processing Offline graph analytics Advanced applications Is it hard to manage graphs? Good

More information

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488) Efficiency Efficiency: Indexing (COSC 488) Nazli Goharian nazli@cs.georgetown.edu Difficult to analyze sequential IR algorithms: data and query dependency (query selectivity). O(q(cf max )) -- high estimate-

More information

A Comparison of MapReduce Join Algorithms for RDF

A Comparison of MapReduce Join Algorithms for RDF A Comparison of MapReduce Join Algorithms for RDF Albert Haque 1 and David Alves 2 Research in Bioinformatics and Semantic Web Lab, University of Texas at Austin 1 Department of Computer Science, 2 Department

More information

Cloud Computing CS

Cloud Computing CS Cloud Computing CS 15-319 Programming Models- Part III Lecture 6, Feb 1, 2012 Majd F. Sakr and Mohammad Hammoud 1 Today Last session Programming Models- Part II Today s session Programming Models Part

More information

Chapter 18: Parallel Databases

Chapter 18: Parallel Databases Chapter 18: Parallel Databases Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 18: Parallel Databases Introduction I/O Parallelism Interquery Parallelism Intraquery

More information

Chapter 18: Parallel Databases. Chapter 18: Parallel Databases. Parallelism in Databases. Introduction

Chapter 18: Parallel Databases. Chapter 18: Parallel Databases. Parallelism in Databases. Introduction Chapter 18: Parallel Databases Chapter 18: Parallel Databases Introduction I/O Parallelism Interquery Parallelism Intraquery Parallelism Intraoperation Parallelism Interoperation Parallelism Design of

More information

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS Chapter 6 Indexing Results 6. INTRODUCTION The generation of inverted indexes for text databases is a computationally intensive process that requires the exclusive use of processing resources for long

More information

Accelerate Big Data Insights

Accelerate Big Data Insights Accelerate Big Data Insights Executive Summary An abundance of information isn t always helpful when time is of the essence. In the world of big data, the ability to accelerate time-to-insight can not

More information

Harp-DAAL for High Performance Big Data Computing

Harp-DAAL for High Performance Big Data Computing Harp-DAAL for High Performance Big Data Computing Large-scale data analytics is revolutionizing many business and scientific domains. Easy-touse scalable parallel techniques are necessary to process big

More information

HYRISE In-Memory Storage Engine

HYRISE In-Memory Storage Engine HYRISE In-Memory Storage Engine Martin Grund 1, Jens Krueger 1, Philippe Cudre-Mauroux 3, Samuel Madden 2 Alexander Zeier 1, Hasso Plattner 1 1 Hasso-Plattner-Institute, Germany 2 MIT CSAIL, USA 3 University

More information

Graph-Aware, Workload-Adaptive SPARQL Query Caching

Graph-Aware, Workload-Adaptive SPARQL Query Caching Graph-Aware, Workload-Adaptive SPARQL Query Caching Nikolaos Papailiou Dimitrios Tsoumakos Panagiotis Karras Nectarios Koziris National Technical University of Athens, Greece {npapa, nkoziris}@cslab.ece.ntua.gr

More information

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Shark Hive on Spark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Agenda Intro to Spark Apache Hive Shark Shark s Improvements over Hive Demo Alpha

More information

Track Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross

Track Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross Track Join Distributed Joins with Minimal Network Traffic Orestis Polychroniou Rajkumar Sen Kenneth A. Ross Local Joins Algorithms Hash Join Sort Merge Join Index Join Nested Loop Join Spilling to disk

More information

DATABASE PERFORMANCE AND INDEXES. CS121: Relational Databases Fall 2017 Lecture 11

DATABASE PERFORMANCE AND INDEXES. CS121: Relational Databases Fall 2017 Lecture 11 DATABASE PERFORMANCE AND INDEXES CS121: Relational Databases Fall 2017 Lecture 11 Database Performance 2 Many situations where query performance needs to be improved e.g. as data size grows, query performance

More information

An Empirical Evaluation of RDF Graph Partitioning Techniques

An Empirical Evaluation of RDF Graph Partitioning Techniques An Empirical Evaluation of RDF Graph Partitioning Techniques Adnan Akhter, Axel-Cyrille Ngonga Ngomo,, and Muhammad Saleem AKSW, Germany, {lastname}@informatik.uni-leipzig.de University of Paderborn, Germany

More information

Something to think about. Problems. Purpose. Vocabulary. Query Evaluation Techniques for large DB. Part 1. Fact:

Something to think about. Problems. Purpose. Vocabulary. Query Evaluation Techniques for large DB. Part 1. Fact: Query Evaluation Techniques for large DB Part 1 Fact: While data base management systems are standard tools in business data processing they are slowly being introduced to all the other emerging data base

More information

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015 Frank Feinbube, M.Sc., Felix Eberhardt, M.Sc., Prof. Dr. Andreas Polze Workloads 2 Hardware / software execution environment

More information

Scalable GPU Graph Traversal!

Scalable GPU Graph Traversal! Scalable GPU Graph Traversal Duane Merrill, Michael Garland, and Andrew Grimshaw PPoPP '12 Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming Benwen Zhang

More information

Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce

Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce Huayu Wu Institute for Infocomm Research, A*STAR, Singapore huwu@i2r.a-star.edu.sg Abstract. Processing XML queries over

More information

Lightweight Streaming-based Runtime for Cloud Computing. Shrideep Pallickara. Community Grids Lab, Indiana University

Lightweight Streaming-based Runtime for Cloud Computing. Shrideep Pallickara. Community Grids Lab, Indiana University Lightweight Streaming-based Runtime for Cloud Computing granules Shrideep Pallickara Community Grids Lab, Indiana University A unique confluence of factors have driven the need for cloud computing DEMAND

More information

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( ) Guide: CIS 601 Graduate Seminar Presented By: Dr. Sunnie S. Chung Dhruv Patel (2652790) Kalpesh Sharma (2660576) Introduction Background Parallel Data Warehouse (PDW) Hive MongoDB Client-side Shared SQL

More information

CSE 190D Spring 2017 Final Exam Answers

CSE 190D Spring 2017 Final Exam Answers CSE 190D Spring 2017 Final Exam Answers Q 1. [20pts] For the following questions, clearly circle True or False. 1. The hash join algorithm always has fewer page I/Os compared to the block nested loop join

More information

Chapter 17: Parallel Databases

Chapter 17: Parallel Databases Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism Intraquery Parallelism Intraoperation Parallelism Interoperation Parallelism Design of Parallel Systems Database Systems

More information

Built for Speed: Comparing Panoply and Amazon Redshift Rendering Performance Utilizing Tableau Visualizations

Built for Speed: Comparing Panoply and Amazon Redshift Rendering Performance Utilizing Tableau Visualizations Built for Speed: Comparing Panoply and Amazon Redshift Rendering Performance Utilizing Tableau Visualizations Table of contents Faster Visualizations from Data Warehouses 3 The Plan 4 The Criteria 4 Learning

More information

Data-Intensive Computing with MapReduce

Data-Intensive Computing with MapReduce Data-Intensive Computing with MapReduce Session 5: Graph Processing Jimmy Lin University of Maryland Thursday, February 21, 2013 This work is licensed under a Creative Commons Attribution-Noncommercial-Share

More information

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 3: Programming Models RCFile: A Fast and Space-efficient Data

More information

Huge market -- essentially all high performance databases work this way

Huge market -- essentially all high performance databases work this way 11/5/2017 Lecture 16 -- Parallel & Distributed Databases Parallel/distributed databases: goal provide exactly the same API (SQL) and abstractions (relational tables), but partition data across a bunch

More information

Distributed Filesystem

Distributed Filesystem Distributed Filesystem 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributing Code! Don t move data to workers move workers to the data! - Store data on the local disks of nodes in the

More information

Scalable Indexing and Adaptive Querying of RDF Data in the cloud

Scalable Indexing and Adaptive Querying of RDF Data in the cloud Scalable Indexing and Adaptive Querying of RDF Data in the cloud Nikolaos Papailiou Computing Systems Laboratory, National Technical University of Athens npapa@cslab.ece.ntua.gr Panagiotis Karras Management

More information

Efficient Algorithm for Frequent Itemset Generation in Big Data

Efficient Algorithm for Frequent Itemset Generation in Big Data Efficient Algorithm for Frequent Itemset Generation in Big Data Anbumalar Smilin V, Siddique Ibrahim S.P, Dr.M.Sivabalakrishnan P.G. Student, Department of Computer Science and Engineering, Kumaraguru

More information

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc Fuxi Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc {jiamang.wang, yongjun.wyj, hua.caihua, zhipeng.tzp, zhiqiang.lv,

More information

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic WHITE PAPER Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic Western Digital Technologies, Inc. 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Executive

More information

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414 Introduction to Database Systems CSE 414 Lecture 17: MapReduce and Spark Announcements Midterm this Friday in class! Review session tonight See course website for OHs Includes everything up to Monday s

More information

Advanced Data Management Technologies Written Exam

Advanced Data Management Technologies Written Exam Advanced Data Management Technologies Written Exam 02.02.2016 First name Student number Last name Signature Instructions for Students Write your name, student number, and signature on the exam sheet. This

More information

Outline Introduction Triple Storages Experimental Evaluation Conclusion. RDF Engines. Stefan Schuh. December 5, 2008

Outline Introduction Triple Storages Experimental Evaluation Conclusion. RDF Engines. Stefan Schuh. December 5, 2008 December 5, 2008 Resource Description Framework SPARQL Giant Triple Table Property Tables Vertically Partitioned Table Hexastore Resource Description Framework SPARQL Resource Description Framework RDF

More information

arxiv: v4 [cs.db] 21 Mar 2016

arxiv: v4 [cs.db] 21 Mar 2016 Noname manuscript No. (will be inserted by the editor) Processing SPARQL Queries Over Distributed RDF Graphs Peng Peng Lei Zou M. Tamer Özsu Lei Chen Dongyan Zhao arxiv:4.6763v4 [cs.db] 2 Mar 206 the date

More information

HAWQ: A Massively Parallel Processing SQL Engine in Hadoop

HAWQ: A Massively Parallel Processing SQL Engine in Hadoop HAWQ: A Massively Parallel Processing SQL Engine in Hadoop Lei Chang, Zhanwei Wang, Tao Ma, Lirong Jian, Lili Ma, Alon Goldshuv Luke Lonergan, Jeffrey Cohen, Caleb Welton, Gavin Sherry, Milind Bhandarkar

More information

Improving the MapReduce Big Data Processing Framework

Improving the MapReduce Big Data Processing Framework Improving the MapReduce Big Data Processing Framework Gistau, Reza Akbarinia, Patrick Valduriez INRIA & LIRMM, Montpellier, France In collaboration with Divyakant Agrawal, UCSB Esther Pacitti, UM2, LIRMM

More information

Accelerating BI on Hadoop: Full-Scan, Cubes or Indexes?

Accelerating BI on Hadoop: Full-Scan, Cubes or Indexes? White Paper Accelerating BI on Hadoop: Full-Scan, Cubes or Indexes? How to Accelerate BI on Hadoop: Cubes or Indexes? Why not both? 1 +1(844)384-3844 INFO@JETHRO.IO Overview Organizations are storing more

More information

Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix

Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix Carlos Ordonez, Yiqun Zhang Department of Computer Science, University of Houston, USA Abstract. We study the serial and parallel

More information

Graph-Processing Systems. (focusing on GraphChi)

Graph-Processing Systems. (focusing on GraphChi) Graph-Processing Systems (focusing on GraphChi) Recall: PageRank in MapReduce (Hadoop) Input: adjacency matrix H D F S (a,[c]) (b,[a]) (c,[a,b]) (c,pr(a) / out (a)), (a,[c]) (a,pr(b) / out (b)), (b,[a])

More information

Abstract. The Challenges. ESG Lab Review InterSystems IRIS Data Platform: A Unified, Efficient Data Platform for Fast Business Insight

Abstract. The Challenges. ESG Lab Review InterSystems IRIS Data Platform: A Unified, Efficient Data Platform for Fast Business Insight ESG Lab Review InterSystems Data Platform: A Unified, Efficient Data Platform for Fast Business Insight Date: April 218 Author: Kerry Dolan, Senior IT Validation Analyst Abstract Enterprise Strategy Group

More information

V Conclusions. V.1 Related work

V Conclusions. V.1 Related work V Conclusions V.1 Related work Even though MapReduce appears to be constructed specifically for performing group-by aggregations, there are also many interesting research work being done on studying critical

More information

Design Tradeoffs for Data Deduplication Performance in Backup Workloads

Design Tradeoffs for Data Deduplication Performance in Backup Workloads Design Tradeoffs for Data Deduplication Performance in Backup Workloads Min Fu,DanFeng,YuHua,XubinHe, Zuoning Chen *, Wen Xia,YuchengZhang,YujuanTan Huazhong University of Science and Technology Virginia

More information

Big Data and Scripting map reduce in Hadoop

Big Data and Scripting map reduce in Hadoop Big Data and Scripting map reduce in Hadoop 1, 2, connecting to last session set up a local map reduce distribution enable execution of map reduce implementations using local file system only all tasks

More information

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia MapReduce Spark Some slides are adapted from those of Jeff Dean and Matei Zaharia What have we learnt so far? Distributed storage systems consistency semantics protocols for fault tolerance Paxos, Raft,

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Week 2: MapReduce Algorithm Design (2/2) January 14, 2016 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

Graph Data Management

Graph Data Management Graph Data Management Analysis and Optimization of Graph Data Frameworks presented by Fynn Leitow Overview 1) Introduction a) Motivation b) Application for big data 2) Choice of algorithms 3) Choice of

More information

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13 Bigtable A Distributed Storage System for Structured Data Presenter: Yunming Zhang Conglong Li References SOCC 2010 Key Note Slides Jeff Dean Google Introduction to Distributed Computing, Winter 2008 University

More information

Parallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism

Parallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism Parallel DBMS Parallel Database Systems CS5225 Parallel DB 1 Uniprocessor technology has reached its limit Difficult to build machines powerful enough to meet the CPU and I/O demands of DBMS serving large

More information

Shark: SQL and Rich Analytics at Scale. Yash Thakkar ( ) Deeksha Singh ( )

Shark: SQL and Rich Analytics at Scale. Yash Thakkar ( ) Deeksha Singh ( ) Shark: SQL and Rich Analytics at Scale Yash Thakkar (2642764) Deeksha Singh (2641679) RDDs as foundation for relational processing in Shark: Resilient Distributed Datasets (RDDs): RDDs can be written at

More information

[This is not an article, chapter, of conference paper!]

[This is not an article, chapter, of conference paper!] http://www.diva-portal.org [This is not an article, chapter, of conference paper!] Performance Comparison between Scaling of Virtual Machines and Containers using Cassandra NoSQL Database Sogand Shirinbab,

More information

Cray Graph Engine / Urika-GX. Dr. Andreas Findling

Cray Graph Engine / Urika-GX. Dr. Andreas Findling Cray Graph Engine / Urika-GX Dr. Andreas Findling Uniprot / EU Open Data Portal http://sparql.uniprot.org/ https://data.europa.eu/euodp/en/linked-data SPARQL Query counting total number of triples SELECT

More information