Storing, Indexing and Querying Large Provenance Data Sets as RDF Graphs in Apache HBase

Size: px
Start display at page:

Download "Storing, Indexing and Querying Large Provenance Data Sets as RDF Graphs in Apache HBase"

Transcription

1 Storing, Indexing and Querying Large Provenance Data Sets as RDF Graphs in Apache HBase Artem Chebotko, John Abraham, Pearl Brazier Department of Computer Science University of Texas - Pan American Edinbug, TX, USA {chebotkoa, jabraham, brazier}@utpa.edu Anthony Piazza Piazza Software Consulting Corpus Christi, TX, USA tony@piazzaconsulting.com Andrey Kashlev, Shiyong Lu Department of Computer Science Wayne State University Detroit, MI, USA {andrey.kashlev, shiyong}@wayne.edu Abstract Provenance, which records the history of an insilico experiment, has been identified as an important requirement for scientific workflows to support scientific discovery reproducibility, result interpretation, and problem diagnosis. Large provenance datasets are composed of many smaller provenance graphs, each of which corresponds to a single workflow execution. In this work, we explore and address the challenge of efficient and scalable storage and querying of large collections of provenance graphs serialized as RDF graphs in an Apache HBase database. Specifically, we propose: (i) novel storage and indexing techniques for RDF data in HBase that are better suited for provenance datasets rather than generic RDF graphs and (ii) novel SPARQL query evaluation algorithms that solely rely on indices to compute expensive join operations, make use of numeric values that represent triple positions rather than actual triples, and eliminate the need for intermediate data transfers over a network. The empirical evaluation of our algorithms using provenance datasets and queries of the University of Texas Provenance Benchmark confirms that our approach is efficient and scalable. Keywords-scientific workflow; provenance; big data; HBase; distributed database; SPARQL; RDF; query; scalability I. INTRODUCTION In scientific workflow environments, scientific discovery reproducibility, result interpretation, and problem diagnosis primarily depend on the availability of provenance metadata that records the complete history of an in-silico experiment [1], [2], [3], [4]. Using the terminology of the Open Provenance Model (OPM) [5], provenance of a single workflow execution is a directed graph with three kinds of nodes: artifacts (e.g., data products), processes (e.g., computations or actions), and agents (e.g., catalysts of a process). Nodes are connected via directed edges that represent five types of dependencies: process used artifact, artifact wasgeneratedby process, process wascontrolledby agent, process wastriggeredby process, and artifact was- DerivedFrom artifact. In addition, OPM includes a number of other constructs that are helpful for provenance modelling. A sample provenance graph that uses the OPM notation and encodes provenance of a workflow for creating and populating a relational database is shown in Fig. 1. In this graph, ellipses and rectangles denote artifacts and processes, respectively, and edges denote dependencies whose interpretations can be inferred based on their domains and ranges. Create Table SQL Statements Figure 1. Create Index SQL Statements Create Database Schema Schema Load Data Instance Create Trigger SQL Statements Dataset Sample OPM Provenance Graph. This provenance graph can be serialized using the Resource Description Framework (RDF) and OPM vocabularies, such as Open Provenance Model Vocabulary (OPMV) or Open Provenance Model OWL Ontology (OPMO). As an example, we show a partial serialization of the presented provenance graph using OPMV and Terse RDF Triple Language (Turtle): utpb:schema rdf:type opmv:artifact. utpb:instance rdf:type opmv:artifact. utpb:dataset rdf:type opmv:artifact. utpb:loaddata rdf:type opmv:process. utpb:loaddata opmv:used utpb:schema, utpb:dataset. utpb:instance opmv:wasgeneratedby utpb:loaddata. utpb:instance opmv:wasderivedfrom utpb:schema, utpb:dataset. The provenance graph is now effectively converted to an RDF graph (or set of triples that encode its edges) and can be further stored and queried using the SPARQL query language. SPARQL and other RDF query languages have been frequently used for provenance querying [6], [7], [8]. A provenance query, such as Find all artifacts and their values, if any, in a provenance graph with identifier can be expressed in SPARQL as

2 SELECT?artifact?value FROM NAMED < WHERE { GRAPH utpb:opmgraph {?artifact rdf:type opmv:artifact. OPTIONAL {?artifact rdf:label?value. }. } } The main focus of our research in this work is on the efficient and scalable storage and querying of large collections of provenance graphs serialized as RDF graphs in an Apache HBase database. With the development of user-friendly and powerful tools, such as scientific workflow management systems [9], [1], [11], [12], [13], [14], scientists are able to design and repeatedly execute workflows with different input datasets and varying input parameters with just a few mouse clicks. Each workflow execution generates a provenance graph that will be stored and queried on different occasions. A single provenance graph is readily manageable as its size is correlated with the workflow size and even workflows with many hundreds of processes produce a relatively small metadata footprint that fits into main memory of a single machine. The challenge arises when hundreds of thousands or even millions of provenance graphs constitute a provenance dataset. Managing large and constantly growing provenance datasets on a single machine eventually fails and we turn to distributed data management solutions. We design such a solution for large provenance datasets based on Apache HBase [15], an open-source implementation of Google s BigTable [16]. While we deploy and evaluate our solution on a small cluster of commodity machines, HBase is readily available in cloud environments suggesting virtually unlimited elasticity. The main contributions of this work are: (i) novel storage and indexing schemes for RDF data in HBase that are suitable for provenance datasets, and (ii) novel and efficient querying algorithms to evaluate SPARQL queries in HBase that are optimized to make use of bitmap indices and numeric values instead of triples. Our solution enables the evaluation of queries over an individual provenance graph without intermediate data transfers over a network. In addition, we conducted an empirical evaluation of our approach using provenance graphs and test queries of the University of Texas Provenance Benchmark [17]. Our experiments confirmed that our proposed storage, indexing and querying techniques are efficient and scalable for large provenance datasets. II. RELATED WORK Besides HBase, there are multiple projects under the Apache umbrella ( that focus on distributed computing, including Hadoop, Cassandra, Hive, Pig, and CouchDB. Hadoop implements a MapReduce software framework and a distributed file system. Cassandra blends a fully distributed design with a column-oriented storage model. Hive deals with data warehousing on top of Hadoop and provides its own Hive QL query language. Pig is geared towards analyzing large datasets through use of its high-level Pig Latin language for expressing data analysis programs, which are then turned into MapReduce jobs. CouchDB is a distributed, document-oriented database that supports incremental MapReduce queries written in JavaScript. Along the same lines, other projects in academia and industry include Cheetah (data warehousing on top of MapReduce), Hadoop++ (an improved MapReduce framework based on Hadoop), G-Store (a key-value store with multi key access functionality), and Hadapt (data warehousing on top of MapReduce). None of the above projects targets RDF data specifically or supports SPARQL. RDF data management in non-relational (often called NoSQL) databases has only recently been gaining momentum. Due to the paper size limit, we only briefly introduce the reader to the most relevant works in this area. Techniques for evaluating SPARQL basic graph patterns using MapReduce are presented in [18] and [19]. Efficient approaches to analytical query processing and distributed reasoning on RDF graphs in MapReduce-based systems are proposed in [2] and [21]. The translation of SPARQL queries into Pig Latin queries that can be evaluated using Hadoop is presented in [22]. Efficient RDF querying in distributed RDF-3X is reported in [23]. RDF storage schemes and querying algorithms for HBase and MySQL Cluster are proposed in our own work [24]. Bitmap indices for RDF join processing on a single machine have been previously studied in [25]. While existing works deal with very large graphs that require partitioning, this work deals with very large numbers of relatively small RDF graphs, which enables us to apply unique optimizations in our storing, indexing, and querying techniques. III. STORING AND INDEXING RDF GRAPHS IN HBASE In this section, we first formalize definitions of RDF dataset, RDF graph, and SPARQL basic graph pattern. We then propose our indexing and storage schemes for RDF data in HBase. A. RDF Data and Queries Definition 3.1 (RDF dataset): An RDF dataset D is a set of RDF graphs {G 1, G 2,..., G n }, where each graph G i D is a named graph that has a unique identifier G i.id and n. The RDF dataset definition requires each RDF graph to have a unique identifier, which is frequently the case in large collections of RDF graphs to allow easy distinction among graphs. Such an identifier is either a part of the graph description or can be assigned automatically without a loss of generality. Definition 3.2 (RDF graph): An RDF graph G is a set of RDF triples {t 1, t 2,..., t n }, where n = G and each triple

3 t i G is a tuple of the form (s, p, o) with s, p, and o denoting a subject, predicate, and object, respectively. The RDF graph definition views an RDF graph as a set of triples whose subjects, objects, and predicates correspond to labeled nodes and edges, respectively. Each node in an RDF graph is labeled with a unique identifier (i.e., Universal Resource Identifier or URI) or a value (i.e., literal). Edges are labeled with identifiers (i.e., URIs) and multiple edges with the same label are common. While the number of distinct labels for nodes, which correspond to subjects and objects, increases with the growth of an RDF graph or RDF dataset as new nodes are added, the number of labels for edges, which correspond to predicates, is usually bound to the number of properties defined in an annotation vocabulary (e.g., OPMV defines 13 properties and reuses a few properties from the RDF and RDFS namespaces; OPMO extends OPMV and supports around 5 properties). Therefore, for any large RDF dataset, it is safe to assume that the number of distinct predicates is substantially smaller (on the order of dozens or hundreds) than the number of distinct subjects or objects. We use P to denote a set of all predicates in an RDF dataset D; formally, P = P 1 P 2... P n, where P i = {p (s, p, o) G i }, G i D, and D = {G 1, G 2,..., G n }. In an RDF graph, the order of RDF triples is not semantically important. Since any concrete RDF graph serialization has to store triples in some order, it is convenient to view a set of triples in an RDF graph as an ordered set. We use function num to denote the position of a triple in an RDF graph G, such that num(t i ) returns position i, where t i G and G = {t 1, t 2,..., t n }. Furthermore, inverse function num 1 (i) returns triple t i found at position i in graph G = {t 1, t 2,..., t n }. To query RDF datasets and individual RDF graphs, the standard RDF query language, called SPARQL, is used. SPARQL allows defining various graph patterns that can be matched over RDF graphs to retrieve results. While SPARQL distinguishes basic graph patterns, optional graph patterns and alternative graph patterns, in this paper, we restrict our presentation to the basic graph patterns as defined in the following. Definition 3.3 (Basic graph pattern): A basic graph pattern bgp is a set of triple patterns {tp 1, tp 2,..., tp n }, also denoted as tp 1 AND tp 2 AND AND tp n, where n 1, AND is a binary operator that corresponds to the conjunction in SPARQL and each tp i bgp is a triple (sp, pp, op), such that sp, pp, and op are a subject pattern, predicate pattern, and object pattern, respectively. A basic graph pattern consists of the simplest querying constructs, called triple patterns. A triple pattern can contain variables or URIs as subject, predicate, and object patterns (object patterns can also be represented by literals) that are to be matched over the respective components of individual triples. Unlike a URI or a literal in a triple patten that has to match itself in a triple, a variable can match anything. Multiple occurrences of the same variable in a triple pattern or a basic graph pattern must be bound to the same values. B. Indexing Scheme Matching a basic graph pattern over an RDF graph involves matching of constituent triple patterns over a set of RDF triples. Each triple pattern yields an intermediate set of triples and such intermediate results must be further joined together to find matching subgraphs. To speed up this computation, we define several bitmap indices. Definition 3.4 (Index I p ): A bitmap index I p for an RDF graph G D is a set of tuples {(p 1, v 1 ), (p 2, v 2 ),..., (p n, v n )}, where p i P is a predicate in a set of all predicates in an RDF dataset D, n = P, and v i is a bit vector of size G that has 1 in k-th position iff triple t k = num 1 (k), t k G, and t k.p = p i. Index I p helps quickly identify triples (their positions) in an RDF graph that have a particular predicate. I p (p) denotes a bit vector for predicate p. The size of the bit vector is fixed and equals the number of triples in the graph (i.e., G ). The number of vectors in the index equals the number of distinct predicates (i.e., P ), which is relatively small (usually P < G ). Similarly, indices I s and I o to quickly identify triples with a given subject and object can be defined. Intuitively, to find a triple with subject s, predicate p, and object o in an RDF graph, a logical (AND) of the corresponding vectors can be computed: I s (s) I p (p) I o (o). While the purpose of indices I s, I p and I o is to speed up matching of individual triple patterns, the indices that we define next can be used to join intermediate results obtained via triple pattern matching into subgraphs. Definition 3.5 (Index I ss ): A bitmap index I ss for an RDF graph G D is a set of tuples {(1, v 1 ), (2, v 2 ),..., (n, v n )}, where n = G, 1, 2,..., n are the consecutive triple positions in G, and v i is a bit vector of size G that has 1 in k-th position iff t k = num 1 (k), t k G, t i = num 1 (i), t i G, and t k.s = t i.s. Definition 3.6 (Index I oo ): A bitmap index I oo for an RDF graph G D is a set of tuples {(1, v 1 ), (2, v 2 ),..., (n, v n )}, where n = G, 1, 2,..., n are the consecutive triple positions in G, and v i is a bit vector of size G that has 1 in k-th position iff t k = num 1 (k), t k G, t i = num 1 (i), t i G, and t k.o = t i.o. Definition 3.7 (Indices I so and I os ): A bitmap index I so for an RDF graph G D is a set of tuples {(1, v 1 ), (2, v 2 ),..., (n, v n )}, where n = G, 1, 2,..., n are the consecutive triple positions in G, and v i is a bit vector of size G that has 1 in k-th position iff t k = num 1 (k), t k G, t i = num 1 (i), t i G, and t k.o = t i.s. A bitmap index I os is the transpose of I so, such that I os = Iso. T

4 T D rowid data:graph index:i s index:i p index:i o index:i ss index:i oo index:i so G 1.id G 1 I s1 I p1 I o1 I ss1 I oo1 I so1 G 2.id G 2 I s2 I p2 I o2 I ss2 I oo2 I so2 G n.id G n I sn I pn I on I ssn I oon I son Figure 2. HBase Storage Scheme. Indices I ss, I oo, I so, and I os are all of the same size of G G bits for a given graph G. They can be used to quickly match triples from two sets based on the equality of their subjects, objects, subjects-objects, and objects-subjects, respectively. Intuitively, if given a position i that corresponds to a triple t i G such that t i = num 1 (i), other triples (their positions) in G with the same subject as t i.s can be found in the bit vector I ss (i). It should be noted that indices that allow matching triples based on predicates, subjectspredicates, and objects-predicates equalities can also be defined; however their usability is limited since graph patterns with variables shared by predicate patterns and other patterns are rarely used. We denote indices I s, I p and I o as selection indices and indices I ss, I oo, I so, and I os as join indices. Note that index I os can be obtained from index I so and vice versa. Although we introduce both types of indices for theoretical completeness, only one is required in practice. C. Storage Scheme HBase stores data in tables that can be described as sparse multidimensional sorted maps and are structurally different from relations found in conventional relational databases. An HBase table (hereafter table for short) stores data rows that are sorted based on the row keys. Each row has a unique row key and an arbitrary number of columns, such that columns in two distinct rows do not have to be the same. A full column name (hereafter column for short) consists of a column family and a column qualifier (e.g., family:qualifier), where column families are usually specified at the time of table creation and their number does not change and column qualifiers are dynamically added or deleted as needed. Rows in a table can be distributed over different machines in an HBase cluster and efficiently retrieved based on a given row key and, if available, columns. To store provenance datasets composed of provenance graphs serialized as RDF graphs, we propose a single table storage scheme shown in Fig. 2. Each row in the table stores: (1) an RDF graph identifier as a unique row id/key, (2) a complete RDF graph as one aggregate value in the data column family, and (3) precomputed bitmap indices for the respective RDF graph in the index column family. The decision to store each RDF graph as one value rather than partition it into subgraphs or even individual triples is motivated by the following observations. First, such storage avoids unnecessary data transfers that may occur if a graph is partitioned and distributed over different machines. Second, Algorithm 1 Applying selection indices 1: function applyselectionindices 2: input: graph identifier G.id, triple pattern tp = (sp, pp, op), table T D 3: output: bit vector v that has 1 in k-th position, i.e., v[k] = 1, if triple t k = num 1 (k), t k G, and t k matches non-variable components of tp 4: Let I s, I p, and I o be the respective indices in the row with rowid G.id of table T D 5: Let v be a bit vector that has 1 in every position k, where 1 k G 6: if tp.sp is not a variable then 7: v = v I s(tp.sp) 8: end if 9: if tp.pp is not a variable then 1: v = v I p (tp.pp) 11: end if 12: if tp.op is not a variable then 13: v = v I o (tp.op) 14: end if 15: return v 16: end function as we show in detail in the next section, expensive query processing operations (i.e., joins) can be performed using compact bitmap indices and an RDF graph is only required to be accessed once to replace triple positions in query results with actual triples. Finally, unlike some applications that require dealing with very large graphs that cannot fit into main memory of a single machine and therefore require partitioning, individual provenance graphs are relatively small in general (yet their number can be very large) and can be stored as one aggregate value. We present query processing over this storage scheme next. IV. RDF QUERY PROCESSING IN HBASE To be able to evaluate SPARQL queries in HBase, we design four efficient functions that deal with application of selection indices, application of join indices, handling of special cases not supported by the indices, and basic graph pattern evaluation. Function applyselectionindices is outlined in Algorithm 1. It takes a graph identifier and a triple pattern and returns a bit vector of triple positions in the graph, where value 1 signifies that a position-corresponding triple matches URIs and/or literals found in the triple pattern. Selection indices for a particular graph identifier are applied using the conjunction of bit vectors. A resulting bit vector encodes the result of one triple matching over a graph. Function applyjoinindices is outlined in Algorithm 2. This function, given a graph identifier, a triple pattern with one known solution represented by a triple position, and a

5 Algorithm 2 Applying join indices 1: function applyjoinindices 2: input: graph identifier G.id, triple pattern with known solution tp, to-be-joined triple pattern tp, triple position p, i.e., triple t p = num 1 (p) matches tp, table T D 3: output: bit vector v that has 1 in k-th position, i.e., v[k] = 1, if triple t k = num 1 (k), t k G, and t k can join with t p based on the equality of their subjects, objects, and/or subjects-objects. 4: Let I ss, I oo, and I so be the respective indices in the row with rowid G.id of table T D 5: Let v be a bit vector that has 1 in every position k, where 1 k G 6: Let v[p] = /* to avoid joining triple at position p with itself */ 7: if tp.sp and tp.sp are variables and tp.sp = tp.sp then 8: v = v I ss (p) 9: end if 1: if tp.op and tp.op are variables and tp.op = tp.op then 11: v = v I oo(p) 12: end if 13: if tp.sp and tp.op are variables and tp.sp = tp.op then 14: v = v I so (p) 15: end if 16: if tp.op and tp.sp are variables and tp.op = tp.sp then 17: v = v I T so (p) 18: end if 19: return v 2: end function Algorithm 3 Handling special cases 1: function handlespecialcases 2: input: basic graph pattern bgp, set of solutions S 3: output: set of solutions F S 4: Let F = S 5: /* Special case not supported by selection indices - rare in practice */ 6: if Any tp bgp contains any two variables with the same name then 7: for each s F do 8: Discard s, i.e., F = F {s}, if triple t s that corresponds to tp has different bindings for such variables 9: end for 1: end if 11: /* Special case not supported by join indices - rare in practice */ 12: if Any tp bgp has a variable at tp.pp that also occurs in some other tp bgp, tp tp then 13: for each s F do 14: Discard s, i.e., F = F {s}, if triples t s and t s that correspond to tp and tp have different bindings for such variables 15: end for 16: end if 17: return F 18: end function to-be-joined triple pattern, can quickly compute a bit vector that encodes solutions (triple positions) that join with the known solution. A join condition is implicitly coded by the use of the same variable in the two triple patterns. It can be represented by the equality of subjects, objects, and/or subjects-objects in the two triple patterns. Join indices are also applied using the conjunction of the respective bit vectors. Function handlespecialcases is outlined in Algorithm 3. This function performs post-processing of final results obtained via basic graph pattern matching. It takes a basic graph pattern and its set of solutions, where each solution is represented by a sequence of actual triples, and Algorithm 4 Matching a basic graph pattern over a graph 1: function matchbgp 2: input: graph identifier G.id, basic graph pattern bgp = {tp 1, tp 2,..., tp n } and n 1, table T D 3: output: set of subgraph solutions S = {g g is a (sub)graph of G and g matches bgp} 4: Order triple patterns in bgp, such that triple patterns that yield a smaller result and triple patterns that have a shared variable with preceding triple patterns are evaluated first. 5: Let ordered bgp = (tp 1, tp 2,..., tp n) 6: v sel = applyselectionindices(g.id, tp 1, T D ) 7: S = {(k) v sel [k] = 1} /*solutions for the first triple pattern*/ 8: if S = ø then return S end if 9: for each tp i in (tp 2,..., tp n) do 1: v sel = applyselectionindices(g.id, tp i, T D ) 11: Let set S join = ø /*solutions for current join*/ 12: Let set T P = {tp j tp j (tp 1, tp 2,..., tp i 1 ), j < i, and tp i and tp j have variables with the same name as subject or object patterns} 13: for each s in S do 14: v join = v sel 15: for each tp j in (tp 1, tp 2,..., tp i 1 ) do 16: if tp j in T P then 17: v join = v join applyjoinindices(g.id, tp j, tp i, s[j], T D ) /* s[j] is a solution (triple position) for tp j found in sequence s at position j */ 18: end if 19: end for 2: S tpi = {(k) v join [k] = 1} /*solutions for current triple pattern*/ 21: Compute Cartesian product of {s} and S tpi, i.e., S join = S join ({s} S tpi ) 22: end for 23: S = S join 24: if S = ø then return S end if 25: end for 26: /* Replace triple positions in S with actual triples */ 27: for each s in S do 28: s = {num 1 (k) k s} 29: Replace s with s in S 3: end for 31: /* Handle special cases that are not supported by the selection and join indices */ S = handlespecialcases(bgp, S) 32: return S 33: end function deals with special cases not supported by selection and join indices. In particular, selection indices have no means to verify that if a triple pattern contains the same variable twice (or even three times) then a matching triple must have identical values that can be bound to multiple occurrences of this variable. Join indices do not support join conditions based on the equality of a predicate and any other term in a triple. It is possible to add additional indices to handle selection and join operations on predicates, however such indices will be rarely needed for real-life queries. Finally, main function matchbgp is outlined in Algorithm 4. This function matches a SPARQL basic graph pattern bgp that consists of a set of triple patterns tp 1, tp 2,..., tp n over an RDF graph with a known identifier that is stored in HBase. The final result is a set of subgraph solutions S. The algorithm starts by ordering triple patterns in bgp (lines 4 and 5) using two criteria: (1) triple patterns that yield a smaller result should be evaluated first to decrease the number of iterations and (2) triple patterns that have

6 a shared variable with preceding triple patterns should be given preference over triple patterns with no shared variables to avoid unnecessary Cartesian products. Next (lines 6-8), the algorithm applies selection indices and obtains a set of solutions for the first triple pattern. Each solution in the set is represented by a sequence with one triple position. An empty set of solutions results in an empty result for the whole basic graph pattern. All subsequent triple patterns are further evaluated and joined with already available results (lines 9-25). For any subsequent triple pattern, selection indices are applied (line 1), an empty set of join solutions is prepared (line 11), and preceding triple patterns that share variables with the current triple pattern are identified (line 12). For each solution that has been obtained for the preceding triple patterns (lines 13-22), join indices are applied (line 14-19), a bit vector resulting from both selection and join index applications is converted to a set of solutions for the current triple pattern (line 2), and the join result is computed by combining the known solution with newly computed ones (line 21). The set of known solutions is updated (line 23) and verified that it is not empty (line 24). The process repeats for the next available triple pattern. Once all joins are processed, each triple position in the set of solutions is replaced with an actual triple from the graph using the num 1 function (lines 26-3). The solutions are then post-processed by function handlespecialcases to accommodate cases that are not supported by selection and join indices (line 31). Finally, the resulting set S is returned (line 32). Some of the advantages of these algorithms include: (1) expensive selection and join computations are performed over indices rather than a graph; (2) computation heavily relies on numeric values that represent triple positions rather than actual triples with lengthy literals and URIs; (3) computation can be fully completed on the same machine where data resides, eliminating all intermediate data transfers over a network. V. PERFORMANCE STUDY This section reports our empirical evaluation of the proposed approach and algorithms. A. Experimental Setup Hardware. Our experiments used nine commodity machines with identical hardware. Each machine had a latemodel 3. GHz 64-bit Pentium 4 processor, 2 GB DDR2-533 RAM, 8 GB 72 rpm Serial ATA hard drive. The machines were networked together via their add-on gigabit Ethernet adapters connected to a Dell PowerConnect 2724 gigabit Ethernet switch and were all running 64-bit Debian Linux 6. and Oracle JDK 7. Hadoop and HBase. Hadoop 1.. and HBase.94 were used. Minor changes to the default configuration for stability included setting each block of data to replicate two times and increasing the HBase max heap size to 1.2 GB. Out of nine identical machines in the cluster, one was designated as an HBase master and the other eight were HBase region servers (slaves). Our implementation. Our algorithms were implemented in Java and the experiments were conducted using Bash shell scripts to execute the Java class files and store the results in an automated and repeatable manner. B. Datasets and Queries The experiments used the University of Texas Provenance Benchmark (UTPB) [17]. UTPB includes provenance templates defined according to the three vocabularies of the Open Provenance Model (OPM) 1, provenance generation software capable of generating provenance for any number of workflow runs based on a provenance template, and provenance test queries in several categories. We used UTPB to generate datasets of varying sizes using the Database Experiment template for a successful workflow execution that was serialized based on the Open Provenance Model OWL Ontology (OPMO). Each generated RDF graph in these datasets represented the provenance of a single workflow execution and contained roughly 4 RDF triples. Table I indicates the characteristics of each generated UTPB dataset. The table does not take into account the dictionary file (also an RDF graph) that was generated by UTPB for each dataset and contained all graph identifiers. The number of triples in this graph was the same as the number of RDF graphs in the dataset from Table I (e.g., 1, triples for D1). We used 11 UTPB test queries in the first four categories (Graphs, Dependencies, Artifacts, and Processes) to benchmark the performance of our implementation. The exact queries expressed in SPARQL can be found on the UTPB website 2. When a provenance dataset was stored in our HBase cluster according to the proposed schema, HBase automatically partitioned the table into regions (subsets of rows). Available region servers were assigned to handle certain regions. In other words, the provenance dataset was partitioned into subsets of provenance graphs that were stored on individual machines in the cluster. Every provenance graph (along with its indices) was stored as a whole on one of the machines with no partitioning. Therefore, any query over an individual provenance graph was processed by a machine that stored the graph, avoiding any expensive data transfers of intermediate results among region servers. The final result of a query was transferred to a client application running on the HBase master. 1 Open Provenance Model, 2 University of Texas Provenance Benchmark, chebotkoa/utpb/

7 Table I DATASET CHARACTERISTICS. Dataset # of RDF graphs # of RDF triples Size (# of workflow runs) D1 1, 4,, 2.1 GB D2 2, 8,, 4.2 GB D3 3, 12,, 6.3 GB D4 4, 16,, 8.4 GB D5 5, 2,, 1.5 GB 4, 3, 2, 1, Q1 Q3 Q5 Q7 Q9 Q11 Figure Q2 Q4 Q6 Q8 Q1 Legend X-axis: Dataset Y-axis: Query execution time, ms Query Performance and Scalability. C. Query Evaluation Performance and Scalability Query performance and scalability of our approach are reported in Fig. 3. Queries Q1 and Q2 were in the Graphs category and had basic graph patterns with one triple pattern. Both Q1 and Q2 yielded larger results compared to other queries. Q1 returned all graph identifiers in a dataset (e.g., 1, triples for D1 and 5, triples for D5) and Q2 returned all triples in a particular provenance graph (e.g., returned around 4 triples for each dataset). Even though these queries were the simplest queries among the 11 UTPB test queries, they showed to be more expensive due to larger result sets, which is especially evident for query Q1 the only query whose performance was on the order of seconds. In the case of both Q1 and Q2, which involved no joins, the major factor in query performance is the transfer time of final query results to a client machine and it is hardly possible to achieve better performance on the given hardware. By contrast, all other queries performed on the order of tens of milliseconds, required joins, and returned subsets of triples in a particular provenance graph (< 4 triples). Queries Q3 Q7 were in the Dependencies category and dealt with various dependencies among artifacts and processes in provenance graphs. They all had similar complexities: basic graph patterns with three triple patterns each. They also returned comparable result sets in terms of the number of triples (except query Q4 that returned an empty result set for the selected UTPB provenance graph template). As a result, these queries showed very similar query evaluation performance with Q4 being the fastest as it only required the evaluation of its first triple pattern to compute the final (empty) result. Queries Q8 and Q9 were in the Artifacts category and dealt with data artifacts in provenance graphs. Q8 contained six triple patterns and two optional clauses. Q9 had 18 triple patterns, two optional clauses, two filer and one union constructs. Q9 is the most complex query of all, yet it was shown to be efficient and scalable with our approach. The last two queries, Q1 and Q11, were in the Processes category and dealt with processes in provenance graphs. Q1 had two triple patterns and one optional clause. While Q11 is a complex query with 11 triple patterns and one union clause, it yielded an empty query result in our experiments due to the selected provenance template. In summary, the proposed approach and its implementation proved to be efficient and scalable. Q1 showed linear scalability and took the most time to execute due to a relatively large result set. The other queries showed nearly constant scalability (technically, linear with a small scope). This can be explained by the fact that each query (except Q1) dealt with a single provenance graph of fixed size with minimal data transfers and fast index-based join processing. VI. CONCLUSIONS AND FUTURE WORK In this paper, we studied the problem of storing and querying large collections of scientific workflow provenance graphs serialized as RDF graphs in Apache HBase. We designed novel storage and indexing schemes for RDF data in HBase that are suitable for provenance datasets. Our storage scheme takes advantage of the fact that individual provenance graphs generally fit into memory of

8 a single machine and require no partitioning. Our bitmap indices are stored together with graphs and support both selection and join operations for efficient query processing. We also proposed efficient querying algorithms to evaluate SPARQL queries in HBase. Our algorithms rely on indices to compute expensive join operations, make use of numeric values that represent triple positions rather than actual triples with lengthy literals and URIs, and eliminate the need for intermediate data transfers over a network. Finally, we conducted an empirical evaluation of our approach using provenance graphs and test queries of the University of Texas Provenance Benchmark. Our experiments confirmed that our proposed storage, indexing and querying techniques are efficient and scalable for large provenance datasets. In the future, we plan to compare our approach with other SQL and NoSQL solutions in the context of distributed scientific workflow provenance management, as well as experiment with a multi-user workload to measure query throughput of our system. REFERENCES [1] Y. Simmhan, B. Plale, and D. Gannon, A survey of data provenance in e-science, SIGMOD Record, vol. 34, no. 3, pp , 25. [2] S. B. Davidson, S. C. Boulakia, A. Eyal, B. Ludäscher, T. M. McPhillips, S. Bowers, M. K. Anand, and J. Freire, Provenance in scientific workflow systems, IEEE Data Engineering Bulletin, vol. 3, no. 4, pp. 44 5, 27. [3] S. B. Davidson and J. Freire, Provenance and scientific workflows: challenges and opportunities, in Proc. of SIGMOD Conference, 28, pp [4] V. Cuevas-Vicenttín, S. C. Dey, S. Köhler, S. Riddle, and B. Ludäscher, Scientific workflows and provenance: Introduction and research opportunities, Datenbank-Spektrum, vol. 12, no. 3, pp , 212. [5] L. Moreau, B. Clifford, J. Freire, J. Futrelle, Y. Gil, P. T. Groth, N. Kwasnikowska, S. Miles, P. Missier, J. Myers, B. Plale, Y. Simmhan, E. G. Stephan, and J. V. den Bussche, The Open Provenance Model core specification (v1.1), Future Gen. Comp. Syst., vol. 27, no. 6, pp , 211. [6] A. Chebotko, S. Lu, X. Fei, and F. Fotouhi, RDFProv: A relational RDF store for querying and managing scientific workflow provenance, Data Knowl. Eng., vol. 69, no. 8, pp , 21. [7] J. Zhao, C. A. Goble, R. Stevens, and D. Turi, Mining Taverna s semantic web of provenance, Concurr. Comput. : Pract. Exper., vol. 2, no. 5, pp , 28. [8] Third Provenance Challenge, Challenge/ThirdProvenanceChallenge. [9] C. Lin, S. Lu, X. Fei, A. Chebotko, D. Pai, Z. Lai, F. Fotouhi, and J. Hua, A reference architecture for scientific workflow management systems and the VIEW SOA solution, IEEE Transactions on Services Computing, vol. 2, no. 1, pp , 29. [1] T. M. Oinn, et al., Taverna: lessons in creating a workflow environment for the life sciences, Concurr. Comput. : Pract. Exper., vol. 18, no. 1, pp , 26. [11] B. Ludäscher, I. Altintas, C. Berkley, D. Higgins, E. Jaeger, M. B. Jones, E. A. Lee, J. Tao, and Y. Zhao, Scientific workflow management and the Kepler system, Concurr. Comput. : Pract. Exper., vol. 18, no. 1, pp , 26. [12] S. P. Callahan, J. Freire, E. Santos, C. E. Scheidegger, C. T. Silva, and H. T. Vo, Managing the evolution of dataflows with VisTrails, in Proc. of ICDE Workshops, 26, p. 71. [13] J. Kim, E. Deelman, Y. Gil, G. Mehta, and V. Ratnakar, Provenance trails in the Wings/Pegasus system, Concurr. Comput. : Pract. Exper., vol. 2, no. 5, pp , 28. [14] Y. Zhao, et al., Swift: Fast, reliable, loosely coupled parallel computation, in Proc. of SWF, 27, pp [15] Apache HBase, [16] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber, Bigtable: A distributed storage system for structured data, ACM Transactions on Computer Systems, vol. 26, no. 2, 28. [17] A. Chebotko, E. D. Hoyos, C. Gomez, A. Kashlev, X. Lian, and C. Reilly, UTPB: A benchmark for scientific workflow provenance storage and querying systems, in Proc. of SWF, 212, pp [18] M. F. Husain, L. Khan, M. Kantarcioglu, and B. M. Thuraisingham, Data intensive query processing for large RDF graphs using cloud computing tools, in Proc. of CLOUD, 21, pp [19] J. Myung, J. Yeon, and S. Lee, SPARQL basic graph pattern processing with iterative MapReduce, in Proc. of MDAC, 21, pp. 6:1 6:6. [2] P. Ravindra, V. V. Deshpande, and K. Anyanwu, Towards scalable RDF graph analytics on MapReduce, in Proc. of MDAC, 21, pp. 5:1 5:6. [21] J. Urbani, S. Kotoulas, E. Oren, and F. van Harmelen, Scalable distributed reasoning using MapReduce, in Proc. of ISWC, 29, pp [22] A. Schätzle, M. Przyjaciel-Zablocki, and G. Lausen, PigSPARQL: mapping SPARQL to Pig Latin, in Proc. of SWIM, 211, p. 4. [23] J. Huang, D. J. Abadi, and K. Ren, Scalable SPARQL querying of large RDF graphs, PVLDB, vol. 4, no. 11, pp , 211. [24] C. Franke, S. Morin, A. Chebotko, J. Abraham, and P. Brazier, Distributed semantic web data management in HBase and MySQL Cluster, in Proc. of CLOUD, 211, pp [25] M. Atre, V. Chaoji, M. J. Zaki, and J. A. Hendler, Matrix Bit loaded: a scalable lightweight join query processor for RDF data, in Proc. of the WWW, 21, pp

UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems

UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems Artem Chebotko, Eugenio De Hoyos, Carlos Gomez, Andrey Kashlev, Xiang Lian, and Christine Reilly Department of Computer

More information

OPQL: A First OPM-Level Query Language for Scientific Workflow Provenance

OPQL: A First OPM-Level Query Language for Scientific Workflow Provenance OPQL: A First OPM-Level Query Language for Scientific Workflow Provenance Chunhyeok Lim, Shiyong Lu, Artem Chebotko, and Farshad Fotouhi Department of Computer Science, Wayne State University, Detroit,

More information

7 th International Digital Curation Conference December 2011

7 th International Digital Curation Conference December 2011 Golden Trail 1 Golden-Trail: Retrieving the Data History that Matters from a Comprehensive Provenance Repository Practice Paper Paolo Missier, Newcastle University, UK Bertram Ludäscher, Saumen Dey, Michael

More information

Existing System : MySQL - Relational DataBase

Existing System : MySQL - Relational DataBase Chapter 2 Existing System : MySQL - Relational DataBase A relational database is a database that has a collection of tables of data items, all of which is formally described and organized according to

More information

Provenance-aware Faceted Search in Drupal

Provenance-aware Faceted Search in Drupal Provenance-aware Faceted Search in Drupal Zhenning Shangguan, Jinguang Zheng, and Deborah L. McGuinness Tetherless World Constellation, Computer Science Department, Rensselaer Polytechnic Institute, 110

More information

Towards Provenance Aware Comment Tracking for Web Applications

Towards Provenance Aware Comment Tracking for Web Applications Towards Provenance Aware Comment Tracking for Web Applications James R. Michaelis and Deborah L. McGuinness Tetherless World Constellation Rensselaer Polytechnic Institute Troy, NY 12180 {michaj6,dlm}@cs.rpi.edu

More information

Application of Named Graphs Towards Custom Provenance Views

Application of Named Graphs Towards Custom Provenance Views Application of Named Graphs Towards Custom Provenance Views Tara Gibson, Karen Schuchardt, Eric Stephan Pacific Northwest National Laboratory Abstract Provenance capture as applied to execution oriented

More information

Automatic capture and efficient storage of e-science experiment provenance

Automatic capture and efficient storage of e-science experiment provenance CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. (in press) Published online in Wiley InterScience (www.interscience.wiley.com)..1235 Automatic capture and efficient

More information

Sempala. Interactive SPARQL Query Processing on Hadoop

Sempala. Interactive SPARQL Query Processing on Hadoop Sempala Interactive SPARQL Query Processing on Hadoop Alexander Schätzle, Martin Przyjaciel-Zablocki, Antony Neu, Georg Lausen University of Freiburg, Germany ISWC 2014 - Riva del Garda, Italy Motivation

More information

SAF: A Provenance-Tracking Framework for Interoperable Semantic Applications

SAF: A Provenance-Tracking Framework for Interoperable Semantic Applications SAF: A Provenance-Tracking Framework for Interoperable Semantic Applications Evan W. Patton, Dominic Difranzo, and Deborah L. McGuinness Rensselaer Polytechnic Institute, 110 8th StreetTroy, NY, USA, 12180

More information

Tackling the Provenance Challenge One Layer at a Time

Tackling the Provenance Challenge One Layer at a Time CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE [Version: 2002/09/19 v2.02] Tackling the Provenance Challenge One Layer at a Time Carlos Scheidegger 1, David Koop 2, Emanuele Santos 1, Huy Vo 1, Steven

More information

A STUDY ON THE TRANSLATION MECHANISM FROM RELATIONAL-BASED DATABASE TO COLUMN-BASED DATABASE

A STUDY ON THE TRANSLATION MECHANISM FROM RELATIONAL-BASED DATABASE TO COLUMN-BASED DATABASE A STUDY ON THE TRANSLATION MECHANISM FROM RELATIONAL-BASED DATABASE TO COLUMN-BASED DATABASE Chin-Chao Huang, Wenching Liou National Chengchi University, Taiwan 99356015@nccu.edu.tw, w_liou@nccu.edu.tw

More information

Collaborative provenance for workflow-driven science and engineering Altintas, I.

Collaborative provenance for workflow-driven science and engineering Altintas, I. UvA-DARE (Digital Academic Repository) Collaborative provenance for workflow-driven science and engineering Altintas, I. Link to publication Citation for published version (APA): Altıntaş, İ. (2011). Collaborative

More information

Optimal Query Processing in Semantic Web using Cloud Computing

Optimal Query Processing in Semantic Web using Cloud Computing Optimal Query Processing in Semantic Web using Cloud Computing S. Joan Jebamani 1, K. Padmaveni 2 1 Department of Computer Science and engineering, Hindustan University, Chennai, Tamil Nadu, India joanjebamani1087@gmail.com

More information

Provenance-Aware Faceted Search in Drupal

Provenance-Aware Faceted Search in Drupal Provenance-Aware Faceted Search in Drupal Zhenning Shangguan, Jinguang Zheng, and Deborah L. McGuinness Tetherless World Constellation, Computer Science Department, Rensselaer Polytechnic Institute, 110

More information

Review on Managing RDF Graph Using MapReduce

Review on Managing RDF Graph Using MapReduce Review on Managing RDF Graph Using MapReduce 1 Hetal K. Makavana, 2 Prof. Ashutosh A. Abhangi 1 M.E. Computer Engineering, 2 Assistant Professor Noble Group of Institutions Junagadh, India Abstract solution

More information

A Fast and High Throughput SQL Query System for Big Data

A Fast and High Throughput SQL Query System for Big Data A Fast and High Throughput SQL Query System for Big Data Feng Zhu, Jie Liu, and Lijie Xu Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing, China 100190

More information

A Semantic Web Approach to the Provenance Challenge

A Semantic Web Approach to the Provenance Challenge CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2000; 00:1 7 [Version: 2002/09/19 v2.02] A Semantic Web Approach to the Provenance Challenge Jennifer Golbeck,,

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

Provenance Trails in the Wings/Pegasus System

Provenance Trails in the Wings/Pegasus System To appear in the Journal of Concurrency And Computation: Practice And Experience, 2007 Provenance Trails in the Wings/Pegasus System Jihie Kim, Ewa Deelman, Yolanda Gil, Gaurang Mehta, Varun Ratnakar Information

More information

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?

More information

Managing Rapidly-Evolving Scientific Workflows

Managing Rapidly-Evolving Scientific Workflows Managing Rapidly-Evolving Scientific Workflows Juliana Freire, Cláudio T. Silva, Steven P. Callahan, Emanuele Santos, Carlos E. Scheidegger, and Huy T. Vo University of Utah Abstract. We give an overview

More information

Innovatus Technologies

Innovatus Technologies HADOOP 2.X BIGDATA ANALYTICS 1. Java Overview of Java Classes and Objects Garbage Collection and Modifiers Inheritance, Aggregation, Polymorphism Command line argument Abstract class and Interfaces String

More information

Integrating Text and Graphics to Present Provenance Information

Integrating Text and Graphics to Present Provenance Information Integrating Text and Graphics to Present Provenance Information Thomas Bouttaz, Alan Eckhardt, Chris Mellish, and Peter Edwards Computing Science, University of Aberdeen, Aberdeen AB24 5UA, UK {t.bouttaz,a.eckhardt,c.mellish,p.edwards}@abdn.ac.uk

More information

4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015)

4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015) 4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015) Benchmark Testing for Transwarp Inceptor A big data analysis system based on in-memory computing Mingang Chen1,2,a,

More information

Comparing Provenance Data Models for Scientific Workflows: an Analysis of PROV-Wf and ProvOne

Comparing Provenance Data Models for Scientific Workflows: an Analysis of PROV-Wf and ProvOne Comparing Provenance Data Models for Scientific Workflows: an Analysis of PROV-Wf and ProvOne Wellington Oliveira 1, 2, Paolo Missier 3, Daniel de Oliveira 1, Vanessa Braganholo 1 1 Instituto de Computação,

More information

Column Stores and HBase. Rui LIU, Maksim Hrytsenia

Column Stores and HBase. Rui LIU, Maksim Hrytsenia Column Stores and HBase Rui LIU, Maksim Hrytsenia December 2017 Contents 1 Hadoop 2 1.1 Creation................................ 2 2 HBase 3 2.1 Column Store Database....................... 3 2.2 HBase

More information

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info START DATE : TIMINGS : DURATION : TYPE OF BATCH : FEE : FACULTY NAME : LAB TIMINGS : PH NO: 9963799240, 040-40025423

More information

Big Data with Hadoop Ecosystem

Big Data with Hadoop Ecosystem Diógenes Pires Big Data with Hadoop Ecosystem Hands-on (HBase, MySql and Hive + Power BI) Internet Live http://www.internetlivestats.com/ Introduction Business Intelligence Business Intelligence Process

More information

Provenance Map Orbiter: Interactive Exploration of Large Provenance Graphs

Provenance Map Orbiter: Interactive Exploration of Large Provenance Graphs Provenance Map Orbiter: Interactive Exploration of Large Provenance Graphs Peter Macko Harvard University Margo Seltzer Harvard University Abstract Provenance systems can produce enormous provenance graphs

More information

MapReduce and Friends

MapReduce and Friends MapReduce and Friends Craig C. Douglas University of Wyoming with thanks to Mookwon Seo Why was it invented? MapReduce is a mergesort for large distributed memory computers. It was the basis for a web

More information

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing ΕΠΛ 602:Foundations of Internet Technologies Cloud Computing 1 Outline Bigtable(data component of cloud) Web search basedonch13of thewebdatabook 2 What is Cloud Computing? ACloudis an infrastructure, transparent

More information

Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life

Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life Shawn Bowers 1, Timothy McPhillips 1, Sean Riddle 1, Manish Anand 2, Bertram Ludäscher 1,2 1 UC Davis Genome Center,

More information

A Comparison of MapReduce Join Algorithms for RDF

A Comparison of MapReduce Join Algorithms for RDF A Comparison of MapReduce Join Algorithms for RDF Albert Haque 1 and David Alves 2 Research in Bioinformatics and Semantic Web Lab, University of Texas at Austin 1 Department of Computer Science, 2 Department

More information

EXTRACT DATA IN LARGE DATABASE WITH HADOOP

EXTRACT DATA IN LARGE DATABASE WITH HADOOP International Journal of Advances in Engineering & Scientific Research (IJAESR) ISSN: 2349 3607 (Online), ISSN: 2349 4824 (Print) Download Full paper from : http://www.arseam.com/content/volume-1-issue-7-nov-2014-0

More information

Distributed computing: index building and use

Distributed computing: index building and use Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput

More information

The International Journal of Digital Curation Volume 7, Issue

The International Journal of Digital Curation Volume 7, Issue doi:10.2218/ijdc.v7i1.221 Golden Trail 139 Golden Trail: Retrieving the Data History that Matters from a Comprehensive Provenance Repository Paolo Missier, Newcastle University Bertram Ludäscher, Saumen

More information

POMELo: A PML Online Editor

POMELo: A PML Online Editor POMELo: A PML Online Editor Alvaro Graves Tetherless World Constellation Department of Cognitive Sciences Rensselaer Polytechnic Institute Troy, NY 12180 gravea3@rpi.edu Abstract. This paper introduces

More information

Efficient, Scalable, and Provenance-Aware Management of Linked Data

Efficient, Scalable, and Provenance-Aware Management of Linked Data Efficient, Scalable, and Provenance-Aware Management of Linked Data Marcin Wylot 1 Motivation and objectives of the research The proliferation of heterogeneous Linked Data on the Web requires data management

More information

An Introduction to Big Data Formats

An Introduction to Big Data Formats Introduction to Big Data Formats 1 An Introduction to Big Data Formats Understanding Avro, Parquet, and ORC WHITE PAPER Introduction to Big Data Formats 2 TABLE OF TABLE OF CONTENTS CONTENTS INTRODUCTION

More information

Data Informatics. Seon Ho Kim, Ph.D.

Data Informatics. Seon Ho Kim, Ph.D. Data Informatics Seon Ho Kim, Ph.D. seonkim@usc.edu HBase HBase is.. A distributed data store that can scale horizontally to 1,000s of commodity servers and petabytes of indexed storage. Designed to operate

More information

CSE-E5430 Scalable Cloud Computing Lecture 9

CSE-E5430 Scalable Cloud Computing Lecture 9 CSE-E5430 Scalable Cloud Computing Lecture 9 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 15.11-2015 1/24 BigTable Described in the paper: Fay

More information

On the use of Abstract Workflows to Capture Scientific Process Provenance

On the use of Abstract Workflows to Capture Scientific Process Provenance On the use of Abstract Workflows to Capture Scientific Process Provenance Paulo Pinheiro da Silva, Leonardo Salayandia, Nicholas Del Rio, Ann Q. Gates The University of Texas at El Paso CENTER OF EXCELLENCE

More information

Cassandra- A Distributed Database

Cassandra- A Distributed Database Cassandra- A Distributed Database Tulika Gupta Department of Information Technology Poornima Institute of Engineering and Technology Jaipur, Rajasthan, India Abstract- A relational database is a traditional

More information

OSDBQ: Ontology Supported RDBMS Querying

OSDBQ: Ontology Supported RDBMS Querying OSDBQ: Ontology Supported RDBMS Querying Cihan Aksoy 1, Erdem Alparslan 1, Selçuk Bozdağ 2, İhsan Çulhacı 3, 1 The Scientific and Technological Research Council of Turkey, Gebze/Kocaeli, Turkey 2 Komtaş

More information

Comparing SQL and NOSQL databases

Comparing SQL and NOSQL databases COSC 6397 Big Data Analytics Data Formats (II) HBase Edgar Gabriel Spring 2014 Comparing SQL and NOSQL databases Types Development History Data Storage Model SQL One type (SQL database) with minor variations

More information

SQL-to-MapReduce Translation for Efficient OLAP Query Processing

SQL-to-MapReduce Translation for Efficient OLAP Query Processing , pp.61-70 http://dx.doi.org/10.14257/ijdta.2017.10.6.05 SQL-to-MapReduce Translation for Efficient OLAP Query Processing with MapReduce Hyeon Gyu Kim Department of Computer Engineering, Sahmyook University,

More information

Tackling the Provenance Challenge One Layer at a Time

Tackling the Provenance Challenge One Layer at a Time CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE [Version: 2002/09/19 v2.02] Tackling the Provenance Challenge One Layer at a Time Carlos Scheidegger 1, David Koop 2, Emanuele Santos 1, Huy Vo 1, Steven

More information

Provenance for MapReduce-based Data-Intensive Workflows

Provenance for MapReduce-based Data-Intensive Workflows Provenance for MapReduce-based Data-Intensive Workflows Daniel Crawl, Jianwu Wang, Ilkay Altintas San Diego Supercomputer Center University of California, San Diego 9500 Gilman Drive La Jolla, CA 92093-0505

More information

Stages of Data Processing

Stages of Data Processing Data processing can be understood as the conversion of raw data into a meaningful and desired form. Basically, producing information that can be understood by the end user. So then, the question arises,

More information

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed? Simple to start What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed? What is the maximum download speed you get? Simple computation

More information

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung HDFS: Hadoop Distributed File System CIS 612 Sunnie Chung What is Big Data?? Bulk Amount Unstructured Introduction Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per

More information

CA485 Ray Walshe NoSQL

CA485 Ray Walshe NoSQL NoSQL BASE vs ACID Summary Traditional relational database management systems (RDBMS) do not scale because they adhere to ACID. A strong movement within cloud computing is to utilize non-traditional data

More information

Distributed computing: index building and use

Distributed computing: index building and use Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput

More information

PROFESSIONAL. NoSQL. Shashank Tiwari WILEY. John Wiley & Sons, Inc.

PROFESSIONAL. NoSQL. Shashank Tiwari WILEY. John Wiley & Sons, Inc. PROFESSIONAL NoSQL Shashank Tiwari WILEY John Wiley & Sons, Inc. Examining CONTENTS INTRODUCTION xvil CHAPTER 1: NOSQL: WHAT IT IS AND WHY YOU NEED IT 3 Definition and Introduction 4 Context and a Bit

More information

Performance Evaluation of the Karma Provenance Framework for Scientific Workflows

Performance Evaluation of the Karma Provenance Framework for Scientific Workflows Performance Evaluation of the Karma Provenance Framework for Scientific Workflows Yogesh L. Simmhan, Beth Plale, Dennis Gannon, and Suresh Marru Indiana University, Computer Science Department, Bloomington,

More information

Bigtable. Presenter: Yijun Hou, Yixiao Peng

Bigtable. Presenter: Yijun Hou, Yixiao Peng Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber Google, Inc. OSDI 06 Presenter: Yijun Hou, Yixiao Peng

More information

Data Provenance in Distributed Propagator Networks

Data Provenance in Distributed Propagator Networks Data Provenance in Distributed Propagator Networks Ian Jacobi Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge, MA 02139 jacobi@csail.mit.edu Abstract.

More information

Big Data Analytics using Apache Hadoop and Spark with Scala

Big Data Analytics using Apache Hadoop and Spark with Scala Big Data Analytics using Apache Hadoop and Spark with Scala Training Highlights : 80% of the training is with Practical Demo (On Custom Cloudera and Ubuntu Machines) 20% Theory Portion will be important

More information

Mitigating Data Skew Using Map Reduce Application

Mitigating Data Skew Using Map Reduce Application Ms. Archana P.M Mitigating Data Skew Using Map Reduce Application Mr. Malathesh S.H 4 th sem, M.Tech (C.S.E) Associate Professor C.S.E Dept. M.S.E.C, V.T.U Bangalore, India archanaanil062@gmail.com M.S.E.C,

More information

Orchestrating Music Queries via the Semantic Web

Orchestrating Music Queries via the Semantic Web Orchestrating Music Queries via the Semantic Web Milos Vukicevic, John Galletly American University in Bulgaria Blagoevgrad 2700 Bulgaria +359 73 888 466 milossmi@gmail.com, jgalletly@aubg.bg Abstract

More information

HADOOP FRAMEWORK FOR BIG DATA

HADOOP FRAMEWORK FOR BIG DATA HADOOP FRAMEWORK FOR BIG DATA Mr K. Srinivas Babu 1,Dr K. Rameshwaraiah 2 1 Research Scholar S V University, Tirupathi 2 Professor and Head NNRESGI, Hyderabad Abstract - Data has to be stored for further

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop

More information

RDFPath. Path Query Processing on Large RDF Graphs with MapReduce. 29 May 2011

RDFPath. Path Query Processing on Large RDF Graphs with MapReduce. 29 May 2011 29 May 2011 RDFPath Path Query Processing on Large RDF Graphs with MapReduce 1 st Workshop on High-Performance Computing for the Semantic Web (HPCSW 2011) Martin Przyjaciel-Zablocki Alexander Schätzle

More information

When, Where & Why to Use NoSQL?

When, Where & Why to Use NoSQL? When, Where & Why to Use NoSQL? 1 Big data is becoming a big challenge for enterprises. Many organizations have built environments for transactional data with Relational Database Management Systems (RDBMS),

More information

Managing the Evolution of Dataflows with VisTrails Extended Abstract

Managing the Evolution of Dataflows with VisTrails Extended Abstract Managing the Evolution of Dataflows with VisTrails Extended Abstract Steven P. Callahan Juliana Freire Emanuele Santos Carlos E. Scheidegger Cláudio T. Silva Huy T. Vo University of Utah vistrails@sci.utah.edu

More information

A Granular Concurrency Control for Collaborative Scientific Workflow Composition

A Granular Concurrency Control for Collaborative Scientific Workflow Composition A Granular Concurrency Control for Collaborative Scientific Workflow Composition Xubo Fei, Shiyong Lu, Jia Zhang Department of Computer Science, Wayne State University, Detroit, MI, USA {xubo, shiyong}@wayne.edu

More information

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The

More information

WebLab PROV : Computing fine-grained provenance links for XML artifacts

WebLab PROV : Computing fine-grained provenance links for XML artifacts WebLab PROV : Computing fine-grained provenance links for XML artifacts Bernd Amann LIP6 - UPMC, Paris Camelia Constantin LIP6 - UPMC, Paris Patrick Giroux EADS-Cassidian, Val de Reuil Clément Caron EADS-Cassidian,

More information

Performance Comparison of NOSQL Database Cassandra and SQL Server for Large Databases

Performance Comparison of NOSQL Database Cassandra and SQL Server for Large Databases Performance Comparison of NOSQL Database Cassandra and SQL Server for Large Databases Khalid Mahmood Shaheed Zulfiqar Ali Bhutto Institute of Science and Technology, Karachi Pakistan khalidmdar@yahoo.com

More information

Introduction to BigData, Hadoop:-

Introduction to BigData, Hadoop:- Introduction to BigData, Hadoop:- Big Data Introduction: Hadoop Introduction What is Hadoop? Why Hadoop? Hadoop History. Different types of Components in Hadoop? HDFS, MapReduce, PIG, Hive, SQOOP, HBASE,

More information

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Blended Learning Outline: Cloudera Data Analyst Training (171219a) Blended Learning Outline: Cloudera Data Analyst Training (171219a) Cloudera Univeristy s data analyst training course will teach you to apply traditional data analytics and business intelligence skills

More information

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework Scientific Journal of Impact Factor (SJIF): e-issn (O): 2348- International Journal of Advance Engineering and Research Development Volume 3, Issue 2, February -2016 A Study: Hadoop Framework Devateja

More information

Abstract Provenance Graphs: Anticipating and Exploiting Schema-Level Data Provenance

Abstract Provenance Graphs: Anticipating and Exploiting Schema-Level Data Provenance Abstract Provenance Graphs: Anticipating and Exploiting Schema-Level Data Provenance Daniel Zinn and Bertram Ludäscher fdzinn,ludaeschg@ucdavis.edu Abstract. Provenance graphs capture flow and dependency

More information

RAMCloud. Scalable High-Performance Storage Entirely in DRAM. by John Ousterhout et al. Stanford University. presented by Slavik Derevyanko

RAMCloud. Scalable High-Performance Storage Entirely in DRAM. by John Ousterhout et al. Stanford University. presented by Slavik Derevyanko RAMCloud Scalable High-Performance Storage Entirely in DRAM 2009 by John Ousterhout et al. Stanford University presented by Slavik Derevyanko Outline RAMCloud project overview Motivation for RAMCloud storage:

More information

An Efficient Approach to Triple Search and Join of HDT Processing Using GPU

An Efficient Approach to Triple Search and Join of HDT Processing Using GPU An Efficient Approach to Triple Search and Join of HDT Processing Using GPU YoonKyung Kim, YoonJoon Lee Computer Science KAIST Daejeon, South Korea e-mail: {ykkim, yjlee}@dbserver.kaist.ac.kr JaeHwan Lee

More information

Part 1: Indexes for Big Data

Part 1: Indexes for Big Data JethroData Making Interactive BI for Big Data a Reality Technical White Paper This white paper explains how JethroData can help you achieve a truly interactive interactive response time for BI on big data,

More information

A Task Abstraction and Mapping Approach to the Shimming Problem in Scientific Workflows

A Task Abstraction and Mapping Approach to the Shimming Problem in Scientific Workflows A Abstraction and Mapping Approach to the Shimming Problem in Scientific Workflows Cui Lin, Shiyong Lu, Xubo Fei, Darshan Pai and Jing Hua Department of Computer Science, Wayne State University {cuilin,

More information

Improved MapReduce k-means Clustering Algorithm with Combiner

Improved MapReduce k-means Clustering Algorithm with Combiner 2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation Improved MapReduce k-means Clustering Algorithm with Combiner Prajesh P Anchalia Department Of Computer Science and Engineering

More information

A New HadoopBased Network Management System with Policy Approach

A New HadoopBased Network Management System with Policy Approach Computer Engineering and Applications Vol. 3, No. 3, September 2014 A New HadoopBased Network Management System with Policy Approach Department of Computer Engineering and IT, Shiraz University of Technology,

More information

An Entity Based RDF Indexing Schema Using Hadoop And HBase

An Entity Based RDF Indexing Schema Using Hadoop And HBase An Entity Based RDF Indexing Schema Using Hadoop And HBase Fateme Abiri Dept. of Computer Engineering Ferdowsi University Mashhad, Iran Abiri.fateme@stu.um.ac.ir Mohsen Kahani Dept. of Computer Engineering

More information

8/24/2017 Week 1-B Instructor: Sangmi Lee Pallickara

8/24/2017 Week 1-B Instructor: Sangmi Lee Pallickara Week 1-B-0 Week 1-B-1 CS535 BIG DATA FAQs Slides are available on the course web Wait list Term project topics PART 0. INTRODUCTION 2. DATA PROCESSING PARADIGMS FOR BIG DATA Sangmi Lee Pallickara Computer

More information

Distributed File Systems II

Distributed File Systems II Distributed File Systems II To do q Very-large scale: Google FS, Hadoop FS, BigTable q Next time: Naming things GFS A radically new environment NFS, etc. Independence Small Scale Variety of workloads Cooperation

More information

Embedded Technosolutions

Embedded Technosolutions Hadoop Big Data An Important technology in IT Sector Hadoop - Big Data Oerie 90% of the worlds data was generated in the last few years. Due to the advent of new technologies, devices, and communication

More information

Accelerating Big Data: Using SanDisk SSDs for Apache HBase Workloads

Accelerating Big Data: Using SanDisk SSDs for Apache HBase Workloads WHITE PAPER Accelerating Big Data: Using SanDisk SSDs for Apache HBase Workloads December 2014 Western Digital Technologies, Inc. 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents

More information

Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce

Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce Huayu Wu Institute for Infocomm Research, A*STAR, Singapore huwu@i2r.a-star.edu.sg Abstract. Processing XML queries over

More information

Publishing and Consuming Provenance Metadata on the Web of Linked Data

Publishing and Consuming Provenance Metadata on the Web of Linked Data Publishing and Consuming Provenance Metadata on the Web of Linked Data Olaf Hartig 1 and Jun Zhao 2 1 Humboldt-Universität zu Berlin hartig@informatik.hu-berlin.de 2 University of Oxford jun.zhao@zoo.ox.ac.uk

More information

New Approaches to Big Data Processing and Analytics

New Approaches to Big Data Processing and Analytics New Approaches to Big Data Processing and Analytics Contributing authors: David Floyer, David Vellante Original publication date: February 12, 2013 There are number of approaches to processing and analyzing

More information

Big Data Hadoop Stack

Big Data Hadoop Stack Big Data Hadoop Stack Lecture #1 Hadoop Beginnings What is Hadoop? Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware

More information

CS November 2018

CS November 2018 Bigtable Highly available distributed storage Distributed Systems 19. Bigtable Built with semi-structured data in mind URLs: content, metadata, links, anchors, page rank User data: preferences, account

More information

MRBench : A Benchmark for Map-Reduce Framework

MRBench : A Benchmark for Map-Reduce Framework MRBench : A Benchmark for Map-Reduce Framework Kiyoung Kim, Kyungho Jeon, Hyuck Han, Shin-gyu Kim, Hyungsoo Jung, Heon Y. Yeom School of Computer Science and Engineering Seoul National University Seoul

More information

L22: NoSQL. CS3200 Database design (sp18 s2) 4/5/2018 Several slides courtesy of Benny Kimelfeld

L22: NoSQL. CS3200 Database design (sp18 s2)   4/5/2018 Several slides courtesy of Benny Kimelfeld L22: NoSQL CS3200 Database design (sp18 s2) https://course.ccs.neu.edu/cs3200sp18s2/ 4/5/2018 Several slides courtesy of Benny Kimelfeld 2 Outline 3 Introduction Transaction Consistency 4 main data models

More information

Rainbow: A Distributed and Hierarchical RDF Triple Store with Dynamic Scalability

Rainbow: A Distributed and Hierarchical RDF Triple Store with Dynamic Scalability 2014 IEEE International Conference on Big Data Rainbow: A Distributed and Hierarchical RDF Triple Store with Dynamic Scalability Rong Gu, Wei Hu, Yihua Huang National Key Laboratory for Novel Software

More information

CS November 2017

CS November 2017 Bigtable Highly available distributed storage Distributed Systems 18. Bigtable Built with semi-structured data in mind URLs: content, metadata, links, anchors, page rank User data: preferences, account

More information

Data Partitioning and MapReduce

Data Partitioning and MapReduce Data Partitioning and MapReduce Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Intelligent Decision Support Systems Master studies,

More information

Oracle Big Data Connectors

Oracle Big Data Connectors Oracle Big Data Connectors Oracle Big Data Connectors is a software suite that integrates processing in Apache Hadoop distributions with operations in Oracle Database. It enables the use of Hadoop to process

More information

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16 MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti 1 Department

More information

SMCCSE: PaaS Platform for processing large amounts of social media

SMCCSE: PaaS Platform for processing large amounts of social media KSII The first International Conference on Internet (ICONI) 2011, December 2011 1 Copyright c 2011 KSII SMCCSE: PaaS Platform for processing large amounts of social media Myoungjin Kim 1, Hanku Lee 2 and

More information

Advances in Data Management - NoSQL, NewSQL and Big Data A.Poulovassilis

Advances in Data Management - NoSQL, NewSQL and Big Data A.Poulovassilis Advances in Data Management - NoSQL, NewSQL and Big Data A.Poulovassilis 1 NoSQL So-called NoSQL systems offer reduced functionalities compared to traditional Relational DBMS, with the aim of achieving

More information