Storing, Indexing and Querying Large Provenance Data Sets as RDF Graphs in Apache HBase
|
|
- Benjamin Doyle
- 6 years ago
- Views:
Transcription
1 Storing, Indexing and Querying Large Provenance Data Sets as RDF Graphs in Apache HBase Artem Chebotko, John Abraham, Pearl Brazier Department of Computer Science University of Texas - Pan American Edinbug, TX, USA {chebotkoa, jabraham, brazier}@utpa.edu Anthony Piazza Piazza Software Consulting Corpus Christi, TX, USA tony@piazzaconsulting.com Andrey Kashlev, Shiyong Lu Department of Computer Science Wayne State University Detroit, MI, USA {andrey.kashlev, shiyong}@wayne.edu Abstract Provenance, which records the history of an insilico experiment, has been identified as an important requirement for scientific workflows to support scientific discovery reproducibility, result interpretation, and problem diagnosis. Large provenance datasets are composed of many smaller provenance graphs, each of which corresponds to a single workflow execution. In this work, we explore and address the challenge of efficient and scalable storage and querying of large collections of provenance graphs serialized as RDF graphs in an Apache HBase database. Specifically, we propose: (i) novel storage and indexing techniques for RDF data in HBase that are better suited for provenance datasets rather than generic RDF graphs and (ii) novel SPARQL query evaluation algorithms that solely rely on indices to compute expensive join operations, make use of numeric values that represent triple positions rather than actual triples, and eliminate the need for intermediate data transfers over a network. The empirical evaluation of our algorithms using provenance datasets and queries of the University of Texas Provenance Benchmark confirms that our approach is efficient and scalable. Keywords-scientific workflow; provenance; big data; HBase; distributed database; SPARQL; RDF; query; scalability I. INTRODUCTION In scientific workflow environments, scientific discovery reproducibility, result interpretation, and problem diagnosis primarily depend on the availability of provenance metadata that records the complete history of an in-silico experiment [1], [2], [3], [4]. Using the terminology of the Open Provenance Model (OPM) [5], provenance of a single workflow execution is a directed graph with three kinds of nodes: artifacts (e.g., data products), processes (e.g., computations or actions), and agents (e.g., catalysts of a process). Nodes are connected via directed edges that represent five types of dependencies: process used artifact, artifact wasgeneratedby process, process wascontrolledby agent, process wastriggeredby process, and artifact was- DerivedFrom artifact. In addition, OPM includes a number of other constructs that are helpful for provenance modelling. A sample provenance graph that uses the OPM notation and encodes provenance of a workflow for creating and populating a relational database is shown in Fig. 1. In this graph, ellipses and rectangles denote artifacts and processes, respectively, and edges denote dependencies whose interpretations can be inferred based on their domains and ranges. Create Table SQL Statements Figure 1. Create Index SQL Statements Create Database Schema Schema Load Data Instance Create Trigger SQL Statements Dataset Sample OPM Provenance Graph. This provenance graph can be serialized using the Resource Description Framework (RDF) and OPM vocabularies, such as Open Provenance Model Vocabulary (OPMV) or Open Provenance Model OWL Ontology (OPMO). As an example, we show a partial serialization of the presented provenance graph using OPMV and Terse RDF Triple Language (Turtle): utpb:schema rdf:type opmv:artifact. utpb:instance rdf:type opmv:artifact. utpb:dataset rdf:type opmv:artifact. utpb:loaddata rdf:type opmv:process. utpb:loaddata opmv:used utpb:schema, utpb:dataset. utpb:instance opmv:wasgeneratedby utpb:loaddata. utpb:instance opmv:wasderivedfrom utpb:schema, utpb:dataset. The provenance graph is now effectively converted to an RDF graph (or set of triples that encode its edges) and can be further stored and queried using the SPARQL query language. SPARQL and other RDF query languages have been frequently used for provenance querying [6], [7], [8]. A provenance query, such as Find all artifacts and their values, if any, in a provenance graph with identifier can be expressed in SPARQL as
2 SELECT?artifact?value FROM NAMED < WHERE { GRAPH utpb:opmgraph {?artifact rdf:type opmv:artifact. OPTIONAL {?artifact rdf:label?value. }. } } The main focus of our research in this work is on the efficient and scalable storage and querying of large collections of provenance graphs serialized as RDF graphs in an Apache HBase database. With the development of user-friendly and powerful tools, such as scientific workflow management systems [9], [1], [11], [12], [13], [14], scientists are able to design and repeatedly execute workflows with different input datasets and varying input parameters with just a few mouse clicks. Each workflow execution generates a provenance graph that will be stored and queried on different occasions. A single provenance graph is readily manageable as its size is correlated with the workflow size and even workflows with many hundreds of processes produce a relatively small metadata footprint that fits into main memory of a single machine. The challenge arises when hundreds of thousands or even millions of provenance graphs constitute a provenance dataset. Managing large and constantly growing provenance datasets on a single machine eventually fails and we turn to distributed data management solutions. We design such a solution for large provenance datasets based on Apache HBase [15], an open-source implementation of Google s BigTable [16]. While we deploy and evaluate our solution on a small cluster of commodity machines, HBase is readily available in cloud environments suggesting virtually unlimited elasticity. The main contributions of this work are: (i) novel storage and indexing schemes for RDF data in HBase that are suitable for provenance datasets, and (ii) novel and efficient querying algorithms to evaluate SPARQL queries in HBase that are optimized to make use of bitmap indices and numeric values instead of triples. Our solution enables the evaluation of queries over an individual provenance graph without intermediate data transfers over a network. In addition, we conducted an empirical evaluation of our approach using provenance graphs and test queries of the University of Texas Provenance Benchmark [17]. Our experiments confirmed that our proposed storage, indexing and querying techniques are efficient and scalable for large provenance datasets. II. RELATED WORK Besides HBase, there are multiple projects under the Apache umbrella ( that focus on distributed computing, including Hadoop, Cassandra, Hive, Pig, and CouchDB. Hadoop implements a MapReduce software framework and a distributed file system. Cassandra blends a fully distributed design with a column-oriented storage model. Hive deals with data warehousing on top of Hadoop and provides its own Hive QL query language. Pig is geared towards analyzing large datasets through use of its high-level Pig Latin language for expressing data analysis programs, which are then turned into MapReduce jobs. CouchDB is a distributed, document-oriented database that supports incremental MapReduce queries written in JavaScript. Along the same lines, other projects in academia and industry include Cheetah (data warehousing on top of MapReduce), Hadoop++ (an improved MapReduce framework based on Hadoop), G-Store (a key-value store with multi key access functionality), and Hadapt (data warehousing on top of MapReduce). None of the above projects targets RDF data specifically or supports SPARQL. RDF data management in non-relational (often called NoSQL) databases has only recently been gaining momentum. Due to the paper size limit, we only briefly introduce the reader to the most relevant works in this area. Techniques for evaluating SPARQL basic graph patterns using MapReduce are presented in [18] and [19]. Efficient approaches to analytical query processing and distributed reasoning on RDF graphs in MapReduce-based systems are proposed in [2] and [21]. The translation of SPARQL queries into Pig Latin queries that can be evaluated using Hadoop is presented in [22]. Efficient RDF querying in distributed RDF-3X is reported in [23]. RDF storage schemes and querying algorithms for HBase and MySQL Cluster are proposed in our own work [24]. Bitmap indices for RDF join processing on a single machine have been previously studied in [25]. While existing works deal with very large graphs that require partitioning, this work deals with very large numbers of relatively small RDF graphs, which enables us to apply unique optimizations in our storing, indexing, and querying techniques. III. STORING AND INDEXING RDF GRAPHS IN HBASE In this section, we first formalize definitions of RDF dataset, RDF graph, and SPARQL basic graph pattern. We then propose our indexing and storage schemes for RDF data in HBase. A. RDF Data and Queries Definition 3.1 (RDF dataset): An RDF dataset D is a set of RDF graphs {G 1, G 2,..., G n }, where each graph G i D is a named graph that has a unique identifier G i.id and n. The RDF dataset definition requires each RDF graph to have a unique identifier, which is frequently the case in large collections of RDF graphs to allow easy distinction among graphs. Such an identifier is either a part of the graph description or can be assigned automatically without a loss of generality. Definition 3.2 (RDF graph): An RDF graph G is a set of RDF triples {t 1, t 2,..., t n }, where n = G and each triple
3 t i G is a tuple of the form (s, p, o) with s, p, and o denoting a subject, predicate, and object, respectively. The RDF graph definition views an RDF graph as a set of triples whose subjects, objects, and predicates correspond to labeled nodes and edges, respectively. Each node in an RDF graph is labeled with a unique identifier (i.e., Universal Resource Identifier or URI) or a value (i.e., literal). Edges are labeled with identifiers (i.e., URIs) and multiple edges with the same label are common. While the number of distinct labels for nodes, which correspond to subjects and objects, increases with the growth of an RDF graph or RDF dataset as new nodes are added, the number of labels for edges, which correspond to predicates, is usually bound to the number of properties defined in an annotation vocabulary (e.g., OPMV defines 13 properties and reuses a few properties from the RDF and RDFS namespaces; OPMO extends OPMV and supports around 5 properties). Therefore, for any large RDF dataset, it is safe to assume that the number of distinct predicates is substantially smaller (on the order of dozens or hundreds) than the number of distinct subjects or objects. We use P to denote a set of all predicates in an RDF dataset D; formally, P = P 1 P 2... P n, where P i = {p (s, p, o) G i }, G i D, and D = {G 1, G 2,..., G n }. In an RDF graph, the order of RDF triples is not semantically important. Since any concrete RDF graph serialization has to store triples in some order, it is convenient to view a set of triples in an RDF graph as an ordered set. We use function num to denote the position of a triple in an RDF graph G, such that num(t i ) returns position i, where t i G and G = {t 1, t 2,..., t n }. Furthermore, inverse function num 1 (i) returns triple t i found at position i in graph G = {t 1, t 2,..., t n }. To query RDF datasets and individual RDF graphs, the standard RDF query language, called SPARQL, is used. SPARQL allows defining various graph patterns that can be matched over RDF graphs to retrieve results. While SPARQL distinguishes basic graph patterns, optional graph patterns and alternative graph patterns, in this paper, we restrict our presentation to the basic graph patterns as defined in the following. Definition 3.3 (Basic graph pattern): A basic graph pattern bgp is a set of triple patterns {tp 1, tp 2,..., tp n }, also denoted as tp 1 AND tp 2 AND AND tp n, where n 1, AND is a binary operator that corresponds to the conjunction in SPARQL and each tp i bgp is a triple (sp, pp, op), such that sp, pp, and op are a subject pattern, predicate pattern, and object pattern, respectively. A basic graph pattern consists of the simplest querying constructs, called triple patterns. A triple pattern can contain variables or URIs as subject, predicate, and object patterns (object patterns can also be represented by literals) that are to be matched over the respective components of individual triples. Unlike a URI or a literal in a triple patten that has to match itself in a triple, a variable can match anything. Multiple occurrences of the same variable in a triple pattern or a basic graph pattern must be bound to the same values. B. Indexing Scheme Matching a basic graph pattern over an RDF graph involves matching of constituent triple patterns over a set of RDF triples. Each triple pattern yields an intermediate set of triples and such intermediate results must be further joined together to find matching subgraphs. To speed up this computation, we define several bitmap indices. Definition 3.4 (Index I p ): A bitmap index I p for an RDF graph G D is a set of tuples {(p 1, v 1 ), (p 2, v 2 ),..., (p n, v n )}, where p i P is a predicate in a set of all predicates in an RDF dataset D, n = P, and v i is a bit vector of size G that has 1 in k-th position iff triple t k = num 1 (k), t k G, and t k.p = p i. Index I p helps quickly identify triples (their positions) in an RDF graph that have a particular predicate. I p (p) denotes a bit vector for predicate p. The size of the bit vector is fixed and equals the number of triples in the graph (i.e., G ). The number of vectors in the index equals the number of distinct predicates (i.e., P ), which is relatively small (usually P < G ). Similarly, indices I s and I o to quickly identify triples with a given subject and object can be defined. Intuitively, to find a triple with subject s, predicate p, and object o in an RDF graph, a logical (AND) of the corresponding vectors can be computed: I s (s) I p (p) I o (o). While the purpose of indices I s, I p and I o is to speed up matching of individual triple patterns, the indices that we define next can be used to join intermediate results obtained via triple pattern matching into subgraphs. Definition 3.5 (Index I ss ): A bitmap index I ss for an RDF graph G D is a set of tuples {(1, v 1 ), (2, v 2 ),..., (n, v n )}, where n = G, 1, 2,..., n are the consecutive triple positions in G, and v i is a bit vector of size G that has 1 in k-th position iff t k = num 1 (k), t k G, t i = num 1 (i), t i G, and t k.s = t i.s. Definition 3.6 (Index I oo ): A bitmap index I oo for an RDF graph G D is a set of tuples {(1, v 1 ), (2, v 2 ),..., (n, v n )}, where n = G, 1, 2,..., n are the consecutive triple positions in G, and v i is a bit vector of size G that has 1 in k-th position iff t k = num 1 (k), t k G, t i = num 1 (i), t i G, and t k.o = t i.o. Definition 3.7 (Indices I so and I os ): A bitmap index I so for an RDF graph G D is a set of tuples {(1, v 1 ), (2, v 2 ),..., (n, v n )}, where n = G, 1, 2,..., n are the consecutive triple positions in G, and v i is a bit vector of size G that has 1 in k-th position iff t k = num 1 (k), t k G, t i = num 1 (i), t i G, and t k.o = t i.s. A bitmap index I os is the transpose of I so, such that I os = Iso. T
4 T D rowid data:graph index:i s index:i p index:i o index:i ss index:i oo index:i so G 1.id G 1 I s1 I p1 I o1 I ss1 I oo1 I so1 G 2.id G 2 I s2 I p2 I o2 I ss2 I oo2 I so2 G n.id G n I sn I pn I on I ssn I oon I son Figure 2. HBase Storage Scheme. Indices I ss, I oo, I so, and I os are all of the same size of G G bits for a given graph G. They can be used to quickly match triples from two sets based on the equality of their subjects, objects, subjects-objects, and objects-subjects, respectively. Intuitively, if given a position i that corresponds to a triple t i G such that t i = num 1 (i), other triples (their positions) in G with the same subject as t i.s can be found in the bit vector I ss (i). It should be noted that indices that allow matching triples based on predicates, subjectspredicates, and objects-predicates equalities can also be defined; however their usability is limited since graph patterns with variables shared by predicate patterns and other patterns are rarely used. We denote indices I s, I p and I o as selection indices and indices I ss, I oo, I so, and I os as join indices. Note that index I os can be obtained from index I so and vice versa. Although we introduce both types of indices for theoretical completeness, only one is required in practice. C. Storage Scheme HBase stores data in tables that can be described as sparse multidimensional sorted maps and are structurally different from relations found in conventional relational databases. An HBase table (hereafter table for short) stores data rows that are sorted based on the row keys. Each row has a unique row key and an arbitrary number of columns, such that columns in two distinct rows do not have to be the same. A full column name (hereafter column for short) consists of a column family and a column qualifier (e.g., family:qualifier), where column families are usually specified at the time of table creation and their number does not change and column qualifiers are dynamically added or deleted as needed. Rows in a table can be distributed over different machines in an HBase cluster and efficiently retrieved based on a given row key and, if available, columns. To store provenance datasets composed of provenance graphs serialized as RDF graphs, we propose a single table storage scheme shown in Fig. 2. Each row in the table stores: (1) an RDF graph identifier as a unique row id/key, (2) a complete RDF graph as one aggregate value in the data column family, and (3) precomputed bitmap indices for the respective RDF graph in the index column family. The decision to store each RDF graph as one value rather than partition it into subgraphs or even individual triples is motivated by the following observations. First, such storage avoids unnecessary data transfers that may occur if a graph is partitioned and distributed over different machines. Second, Algorithm 1 Applying selection indices 1: function applyselectionindices 2: input: graph identifier G.id, triple pattern tp = (sp, pp, op), table T D 3: output: bit vector v that has 1 in k-th position, i.e., v[k] = 1, if triple t k = num 1 (k), t k G, and t k matches non-variable components of tp 4: Let I s, I p, and I o be the respective indices in the row with rowid G.id of table T D 5: Let v be a bit vector that has 1 in every position k, where 1 k G 6: if tp.sp is not a variable then 7: v = v I s(tp.sp) 8: end if 9: if tp.pp is not a variable then 1: v = v I p (tp.pp) 11: end if 12: if tp.op is not a variable then 13: v = v I o (tp.op) 14: end if 15: return v 16: end function as we show in detail in the next section, expensive query processing operations (i.e., joins) can be performed using compact bitmap indices and an RDF graph is only required to be accessed once to replace triple positions in query results with actual triples. Finally, unlike some applications that require dealing with very large graphs that cannot fit into main memory of a single machine and therefore require partitioning, individual provenance graphs are relatively small in general (yet their number can be very large) and can be stored as one aggregate value. We present query processing over this storage scheme next. IV. RDF QUERY PROCESSING IN HBASE To be able to evaluate SPARQL queries in HBase, we design four efficient functions that deal with application of selection indices, application of join indices, handling of special cases not supported by the indices, and basic graph pattern evaluation. Function applyselectionindices is outlined in Algorithm 1. It takes a graph identifier and a triple pattern and returns a bit vector of triple positions in the graph, where value 1 signifies that a position-corresponding triple matches URIs and/or literals found in the triple pattern. Selection indices for a particular graph identifier are applied using the conjunction of bit vectors. A resulting bit vector encodes the result of one triple matching over a graph. Function applyjoinindices is outlined in Algorithm 2. This function, given a graph identifier, a triple pattern with one known solution represented by a triple position, and a
5 Algorithm 2 Applying join indices 1: function applyjoinindices 2: input: graph identifier G.id, triple pattern with known solution tp, to-be-joined triple pattern tp, triple position p, i.e., triple t p = num 1 (p) matches tp, table T D 3: output: bit vector v that has 1 in k-th position, i.e., v[k] = 1, if triple t k = num 1 (k), t k G, and t k can join with t p based on the equality of their subjects, objects, and/or subjects-objects. 4: Let I ss, I oo, and I so be the respective indices in the row with rowid G.id of table T D 5: Let v be a bit vector that has 1 in every position k, where 1 k G 6: Let v[p] = /* to avoid joining triple at position p with itself */ 7: if tp.sp and tp.sp are variables and tp.sp = tp.sp then 8: v = v I ss (p) 9: end if 1: if tp.op and tp.op are variables and tp.op = tp.op then 11: v = v I oo(p) 12: end if 13: if tp.sp and tp.op are variables and tp.sp = tp.op then 14: v = v I so (p) 15: end if 16: if tp.op and tp.sp are variables and tp.op = tp.sp then 17: v = v I T so (p) 18: end if 19: return v 2: end function Algorithm 3 Handling special cases 1: function handlespecialcases 2: input: basic graph pattern bgp, set of solutions S 3: output: set of solutions F S 4: Let F = S 5: /* Special case not supported by selection indices - rare in practice */ 6: if Any tp bgp contains any two variables with the same name then 7: for each s F do 8: Discard s, i.e., F = F {s}, if triple t s that corresponds to tp has different bindings for such variables 9: end for 1: end if 11: /* Special case not supported by join indices - rare in practice */ 12: if Any tp bgp has a variable at tp.pp that also occurs in some other tp bgp, tp tp then 13: for each s F do 14: Discard s, i.e., F = F {s}, if triples t s and t s that correspond to tp and tp have different bindings for such variables 15: end for 16: end if 17: return F 18: end function to-be-joined triple pattern, can quickly compute a bit vector that encodes solutions (triple positions) that join with the known solution. A join condition is implicitly coded by the use of the same variable in the two triple patterns. It can be represented by the equality of subjects, objects, and/or subjects-objects in the two triple patterns. Join indices are also applied using the conjunction of the respective bit vectors. Function handlespecialcases is outlined in Algorithm 3. This function performs post-processing of final results obtained via basic graph pattern matching. It takes a basic graph pattern and its set of solutions, where each solution is represented by a sequence of actual triples, and Algorithm 4 Matching a basic graph pattern over a graph 1: function matchbgp 2: input: graph identifier G.id, basic graph pattern bgp = {tp 1, tp 2,..., tp n } and n 1, table T D 3: output: set of subgraph solutions S = {g g is a (sub)graph of G and g matches bgp} 4: Order triple patterns in bgp, such that triple patterns that yield a smaller result and triple patterns that have a shared variable with preceding triple patterns are evaluated first. 5: Let ordered bgp = (tp 1, tp 2,..., tp n) 6: v sel = applyselectionindices(g.id, tp 1, T D ) 7: S = {(k) v sel [k] = 1} /*solutions for the first triple pattern*/ 8: if S = ø then return S end if 9: for each tp i in (tp 2,..., tp n) do 1: v sel = applyselectionindices(g.id, tp i, T D ) 11: Let set S join = ø /*solutions for current join*/ 12: Let set T P = {tp j tp j (tp 1, tp 2,..., tp i 1 ), j < i, and tp i and tp j have variables with the same name as subject or object patterns} 13: for each s in S do 14: v join = v sel 15: for each tp j in (tp 1, tp 2,..., tp i 1 ) do 16: if tp j in T P then 17: v join = v join applyjoinindices(g.id, tp j, tp i, s[j], T D ) /* s[j] is a solution (triple position) for tp j found in sequence s at position j */ 18: end if 19: end for 2: S tpi = {(k) v join [k] = 1} /*solutions for current triple pattern*/ 21: Compute Cartesian product of {s} and S tpi, i.e., S join = S join ({s} S tpi ) 22: end for 23: S = S join 24: if S = ø then return S end if 25: end for 26: /* Replace triple positions in S with actual triples */ 27: for each s in S do 28: s = {num 1 (k) k s} 29: Replace s with s in S 3: end for 31: /* Handle special cases that are not supported by the selection and join indices */ S = handlespecialcases(bgp, S) 32: return S 33: end function deals with special cases not supported by selection and join indices. In particular, selection indices have no means to verify that if a triple pattern contains the same variable twice (or even three times) then a matching triple must have identical values that can be bound to multiple occurrences of this variable. Join indices do not support join conditions based on the equality of a predicate and any other term in a triple. It is possible to add additional indices to handle selection and join operations on predicates, however such indices will be rarely needed for real-life queries. Finally, main function matchbgp is outlined in Algorithm 4. This function matches a SPARQL basic graph pattern bgp that consists of a set of triple patterns tp 1, tp 2,..., tp n over an RDF graph with a known identifier that is stored in HBase. The final result is a set of subgraph solutions S. The algorithm starts by ordering triple patterns in bgp (lines 4 and 5) using two criteria: (1) triple patterns that yield a smaller result should be evaluated first to decrease the number of iterations and (2) triple patterns that have
6 a shared variable with preceding triple patterns should be given preference over triple patterns with no shared variables to avoid unnecessary Cartesian products. Next (lines 6-8), the algorithm applies selection indices and obtains a set of solutions for the first triple pattern. Each solution in the set is represented by a sequence with one triple position. An empty set of solutions results in an empty result for the whole basic graph pattern. All subsequent triple patterns are further evaluated and joined with already available results (lines 9-25). For any subsequent triple pattern, selection indices are applied (line 1), an empty set of join solutions is prepared (line 11), and preceding triple patterns that share variables with the current triple pattern are identified (line 12). For each solution that has been obtained for the preceding triple patterns (lines 13-22), join indices are applied (line 14-19), a bit vector resulting from both selection and join index applications is converted to a set of solutions for the current triple pattern (line 2), and the join result is computed by combining the known solution with newly computed ones (line 21). The set of known solutions is updated (line 23) and verified that it is not empty (line 24). The process repeats for the next available triple pattern. Once all joins are processed, each triple position in the set of solutions is replaced with an actual triple from the graph using the num 1 function (lines 26-3). The solutions are then post-processed by function handlespecialcases to accommodate cases that are not supported by selection and join indices (line 31). Finally, the resulting set S is returned (line 32). Some of the advantages of these algorithms include: (1) expensive selection and join computations are performed over indices rather than a graph; (2) computation heavily relies on numeric values that represent triple positions rather than actual triples with lengthy literals and URIs; (3) computation can be fully completed on the same machine where data resides, eliminating all intermediate data transfers over a network. V. PERFORMANCE STUDY This section reports our empirical evaluation of the proposed approach and algorithms. A. Experimental Setup Hardware. Our experiments used nine commodity machines with identical hardware. Each machine had a latemodel 3. GHz 64-bit Pentium 4 processor, 2 GB DDR2-533 RAM, 8 GB 72 rpm Serial ATA hard drive. The machines were networked together via their add-on gigabit Ethernet adapters connected to a Dell PowerConnect 2724 gigabit Ethernet switch and were all running 64-bit Debian Linux 6. and Oracle JDK 7. Hadoop and HBase. Hadoop 1.. and HBase.94 were used. Minor changes to the default configuration for stability included setting each block of data to replicate two times and increasing the HBase max heap size to 1.2 GB. Out of nine identical machines in the cluster, one was designated as an HBase master and the other eight were HBase region servers (slaves). Our implementation. Our algorithms were implemented in Java and the experiments were conducted using Bash shell scripts to execute the Java class files and store the results in an automated and repeatable manner. B. Datasets and Queries The experiments used the University of Texas Provenance Benchmark (UTPB) [17]. UTPB includes provenance templates defined according to the three vocabularies of the Open Provenance Model (OPM) 1, provenance generation software capable of generating provenance for any number of workflow runs based on a provenance template, and provenance test queries in several categories. We used UTPB to generate datasets of varying sizes using the Database Experiment template for a successful workflow execution that was serialized based on the Open Provenance Model OWL Ontology (OPMO). Each generated RDF graph in these datasets represented the provenance of a single workflow execution and contained roughly 4 RDF triples. Table I indicates the characteristics of each generated UTPB dataset. The table does not take into account the dictionary file (also an RDF graph) that was generated by UTPB for each dataset and contained all graph identifiers. The number of triples in this graph was the same as the number of RDF graphs in the dataset from Table I (e.g., 1, triples for D1). We used 11 UTPB test queries in the first four categories (Graphs, Dependencies, Artifacts, and Processes) to benchmark the performance of our implementation. The exact queries expressed in SPARQL can be found on the UTPB website 2. When a provenance dataset was stored in our HBase cluster according to the proposed schema, HBase automatically partitioned the table into regions (subsets of rows). Available region servers were assigned to handle certain regions. In other words, the provenance dataset was partitioned into subsets of provenance graphs that were stored on individual machines in the cluster. Every provenance graph (along with its indices) was stored as a whole on one of the machines with no partitioning. Therefore, any query over an individual provenance graph was processed by a machine that stored the graph, avoiding any expensive data transfers of intermediate results among region servers. The final result of a query was transferred to a client application running on the HBase master. 1 Open Provenance Model, 2 University of Texas Provenance Benchmark, chebotkoa/utpb/
7 Table I DATASET CHARACTERISTICS. Dataset # of RDF graphs # of RDF triples Size (# of workflow runs) D1 1, 4,, 2.1 GB D2 2, 8,, 4.2 GB D3 3, 12,, 6.3 GB D4 4, 16,, 8.4 GB D5 5, 2,, 1.5 GB 4, 3, 2, 1, Q1 Q3 Q5 Q7 Q9 Q11 Figure Q2 Q4 Q6 Q8 Q1 Legend X-axis: Dataset Y-axis: Query execution time, ms Query Performance and Scalability. C. Query Evaluation Performance and Scalability Query performance and scalability of our approach are reported in Fig. 3. Queries Q1 and Q2 were in the Graphs category and had basic graph patterns with one triple pattern. Both Q1 and Q2 yielded larger results compared to other queries. Q1 returned all graph identifiers in a dataset (e.g., 1, triples for D1 and 5, triples for D5) and Q2 returned all triples in a particular provenance graph (e.g., returned around 4 triples for each dataset). Even though these queries were the simplest queries among the 11 UTPB test queries, they showed to be more expensive due to larger result sets, which is especially evident for query Q1 the only query whose performance was on the order of seconds. In the case of both Q1 and Q2, which involved no joins, the major factor in query performance is the transfer time of final query results to a client machine and it is hardly possible to achieve better performance on the given hardware. By contrast, all other queries performed on the order of tens of milliseconds, required joins, and returned subsets of triples in a particular provenance graph (< 4 triples). Queries Q3 Q7 were in the Dependencies category and dealt with various dependencies among artifacts and processes in provenance graphs. They all had similar complexities: basic graph patterns with three triple patterns each. They also returned comparable result sets in terms of the number of triples (except query Q4 that returned an empty result set for the selected UTPB provenance graph template). As a result, these queries showed very similar query evaluation performance with Q4 being the fastest as it only required the evaluation of its first triple pattern to compute the final (empty) result. Queries Q8 and Q9 were in the Artifacts category and dealt with data artifacts in provenance graphs. Q8 contained six triple patterns and two optional clauses. Q9 had 18 triple patterns, two optional clauses, two filer and one union constructs. Q9 is the most complex query of all, yet it was shown to be efficient and scalable with our approach. The last two queries, Q1 and Q11, were in the Processes category and dealt with processes in provenance graphs. Q1 had two triple patterns and one optional clause. While Q11 is a complex query with 11 triple patterns and one union clause, it yielded an empty query result in our experiments due to the selected provenance template. In summary, the proposed approach and its implementation proved to be efficient and scalable. Q1 showed linear scalability and took the most time to execute due to a relatively large result set. The other queries showed nearly constant scalability (technically, linear with a small scope). This can be explained by the fact that each query (except Q1) dealt with a single provenance graph of fixed size with minimal data transfers and fast index-based join processing. VI. CONCLUSIONS AND FUTURE WORK In this paper, we studied the problem of storing and querying large collections of scientific workflow provenance graphs serialized as RDF graphs in Apache HBase. We designed novel storage and indexing schemes for RDF data in HBase that are suitable for provenance datasets. Our storage scheme takes advantage of the fact that individual provenance graphs generally fit into memory of
8 a single machine and require no partitioning. Our bitmap indices are stored together with graphs and support both selection and join operations for efficient query processing. We also proposed efficient querying algorithms to evaluate SPARQL queries in HBase. Our algorithms rely on indices to compute expensive join operations, make use of numeric values that represent triple positions rather than actual triples with lengthy literals and URIs, and eliminate the need for intermediate data transfers over a network. Finally, we conducted an empirical evaluation of our approach using provenance graphs and test queries of the University of Texas Provenance Benchmark. Our experiments confirmed that our proposed storage, indexing and querying techniques are efficient and scalable for large provenance datasets. In the future, we plan to compare our approach with other SQL and NoSQL solutions in the context of distributed scientific workflow provenance management, as well as experiment with a multi-user workload to measure query throughput of our system. REFERENCES [1] Y. Simmhan, B. Plale, and D. Gannon, A survey of data provenance in e-science, SIGMOD Record, vol. 34, no. 3, pp , 25. [2] S. B. Davidson, S. C. Boulakia, A. Eyal, B. Ludäscher, T. M. McPhillips, S. Bowers, M. K. Anand, and J. Freire, Provenance in scientific workflow systems, IEEE Data Engineering Bulletin, vol. 3, no. 4, pp. 44 5, 27. [3] S. B. Davidson and J. Freire, Provenance and scientific workflows: challenges and opportunities, in Proc. of SIGMOD Conference, 28, pp [4] V. Cuevas-Vicenttín, S. C. Dey, S. Köhler, S. Riddle, and B. Ludäscher, Scientific workflows and provenance: Introduction and research opportunities, Datenbank-Spektrum, vol. 12, no. 3, pp , 212. [5] L. Moreau, B. Clifford, J. Freire, J. Futrelle, Y. Gil, P. T. Groth, N. Kwasnikowska, S. Miles, P. Missier, J. Myers, B. Plale, Y. Simmhan, E. G. Stephan, and J. V. den Bussche, The Open Provenance Model core specification (v1.1), Future Gen. Comp. Syst., vol. 27, no. 6, pp , 211. [6] A. Chebotko, S. Lu, X. Fei, and F. Fotouhi, RDFProv: A relational RDF store for querying and managing scientific workflow provenance, Data Knowl. Eng., vol. 69, no. 8, pp , 21. [7] J. Zhao, C. A. Goble, R. Stevens, and D. Turi, Mining Taverna s semantic web of provenance, Concurr. Comput. : Pract. Exper., vol. 2, no. 5, pp , 28. [8] Third Provenance Challenge, Challenge/ThirdProvenanceChallenge. [9] C. Lin, S. Lu, X. Fei, A. Chebotko, D. Pai, Z. Lai, F. Fotouhi, and J. Hua, A reference architecture for scientific workflow management systems and the VIEW SOA solution, IEEE Transactions on Services Computing, vol. 2, no. 1, pp , 29. [1] T. M. Oinn, et al., Taverna: lessons in creating a workflow environment for the life sciences, Concurr. Comput. : Pract. Exper., vol. 18, no. 1, pp , 26. [11] B. Ludäscher, I. Altintas, C. Berkley, D. Higgins, E. Jaeger, M. B. Jones, E. A. Lee, J. Tao, and Y. Zhao, Scientific workflow management and the Kepler system, Concurr. Comput. : Pract. Exper., vol. 18, no. 1, pp , 26. [12] S. P. Callahan, J. Freire, E. Santos, C. E. Scheidegger, C. T. Silva, and H. T. Vo, Managing the evolution of dataflows with VisTrails, in Proc. of ICDE Workshops, 26, p. 71. [13] J. Kim, E. Deelman, Y. Gil, G. Mehta, and V. Ratnakar, Provenance trails in the Wings/Pegasus system, Concurr. Comput. : Pract. Exper., vol. 2, no. 5, pp , 28. [14] Y. Zhao, et al., Swift: Fast, reliable, loosely coupled parallel computation, in Proc. of SWF, 27, pp [15] Apache HBase, [16] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber, Bigtable: A distributed storage system for structured data, ACM Transactions on Computer Systems, vol. 26, no. 2, 28. [17] A. Chebotko, E. D. Hoyos, C. Gomez, A. Kashlev, X. Lian, and C. Reilly, UTPB: A benchmark for scientific workflow provenance storage and querying systems, in Proc. of SWF, 212, pp [18] M. F. Husain, L. Khan, M. Kantarcioglu, and B. M. Thuraisingham, Data intensive query processing for large RDF graphs using cloud computing tools, in Proc. of CLOUD, 21, pp [19] J. Myung, J. Yeon, and S. Lee, SPARQL basic graph pattern processing with iterative MapReduce, in Proc. of MDAC, 21, pp. 6:1 6:6. [2] P. Ravindra, V. V. Deshpande, and K. Anyanwu, Towards scalable RDF graph analytics on MapReduce, in Proc. of MDAC, 21, pp. 5:1 5:6. [21] J. Urbani, S. Kotoulas, E. Oren, and F. van Harmelen, Scalable distributed reasoning using MapReduce, in Proc. of ISWC, 29, pp [22] A. Schätzle, M. Przyjaciel-Zablocki, and G. Lausen, PigSPARQL: mapping SPARQL to Pig Latin, in Proc. of SWIM, 211, p. 4. [23] J. Huang, D. J. Abadi, and K. Ren, Scalable SPARQL querying of large RDF graphs, PVLDB, vol. 4, no. 11, pp , 211. [24] C. Franke, S. Morin, A. Chebotko, J. Abraham, and P. Brazier, Distributed semantic web data management in HBase and MySQL Cluster, in Proc. of CLOUD, 211, pp [25] M. Atre, V. Chaoji, M. J. Zaki, and J. A. Hendler, Matrix Bit loaded: a scalable lightweight join query processor for RDF data, in Proc. of the WWW, 21, pp
UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems
UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems Artem Chebotko, Eugenio De Hoyos, Carlos Gomez, Andrey Kashlev, Xiang Lian, and Christine Reilly Department of Computer
More informationOPQL: A First OPM-Level Query Language for Scientific Workflow Provenance
OPQL: A First OPM-Level Query Language for Scientific Workflow Provenance Chunhyeok Lim, Shiyong Lu, Artem Chebotko, and Farshad Fotouhi Department of Computer Science, Wayne State University, Detroit,
More information7 th International Digital Curation Conference December 2011
Golden Trail 1 Golden-Trail: Retrieving the Data History that Matters from a Comprehensive Provenance Repository Practice Paper Paolo Missier, Newcastle University, UK Bertram Ludäscher, Saumen Dey, Michael
More informationExisting System : MySQL - Relational DataBase
Chapter 2 Existing System : MySQL - Relational DataBase A relational database is a database that has a collection of tables of data items, all of which is formally described and organized according to
More informationProvenance-aware Faceted Search in Drupal
Provenance-aware Faceted Search in Drupal Zhenning Shangguan, Jinguang Zheng, and Deborah L. McGuinness Tetherless World Constellation, Computer Science Department, Rensselaer Polytechnic Institute, 110
More informationTowards Provenance Aware Comment Tracking for Web Applications
Towards Provenance Aware Comment Tracking for Web Applications James R. Michaelis and Deborah L. McGuinness Tetherless World Constellation Rensselaer Polytechnic Institute Troy, NY 12180 {michaj6,dlm}@cs.rpi.edu
More informationApplication of Named Graphs Towards Custom Provenance Views
Application of Named Graphs Towards Custom Provenance Views Tara Gibson, Karen Schuchardt, Eric Stephan Pacific Northwest National Laboratory Abstract Provenance capture as applied to execution oriented
More informationAutomatic capture and efficient storage of e-science experiment provenance
CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. (in press) Published online in Wiley InterScience (www.interscience.wiley.com)..1235 Automatic capture and efficient
More informationSempala. Interactive SPARQL Query Processing on Hadoop
Sempala Interactive SPARQL Query Processing on Hadoop Alexander Schätzle, Martin Przyjaciel-Zablocki, Antony Neu, Georg Lausen University of Freiburg, Germany ISWC 2014 - Riva del Garda, Italy Motivation
More informationSAF: A Provenance-Tracking Framework for Interoperable Semantic Applications
SAF: A Provenance-Tracking Framework for Interoperable Semantic Applications Evan W. Patton, Dominic Difranzo, and Deborah L. McGuinness Rensselaer Polytechnic Institute, 110 8th StreetTroy, NY, USA, 12180
More informationTackling the Provenance Challenge One Layer at a Time
CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE [Version: 2002/09/19 v2.02] Tackling the Provenance Challenge One Layer at a Time Carlos Scheidegger 1, David Koop 2, Emanuele Santos 1, Huy Vo 1, Steven
More informationA STUDY ON THE TRANSLATION MECHANISM FROM RELATIONAL-BASED DATABASE TO COLUMN-BASED DATABASE
A STUDY ON THE TRANSLATION MECHANISM FROM RELATIONAL-BASED DATABASE TO COLUMN-BASED DATABASE Chin-Chao Huang, Wenching Liou National Chengchi University, Taiwan 99356015@nccu.edu.tw, w_liou@nccu.edu.tw
More informationCollaborative provenance for workflow-driven science and engineering Altintas, I.
UvA-DARE (Digital Academic Repository) Collaborative provenance for workflow-driven science and engineering Altintas, I. Link to publication Citation for published version (APA): Altıntaş, İ. (2011). Collaborative
More informationOptimal Query Processing in Semantic Web using Cloud Computing
Optimal Query Processing in Semantic Web using Cloud Computing S. Joan Jebamani 1, K. Padmaveni 2 1 Department of Computer Science and engineering, Hindustan University, Chennai, Tamil Nadu, India joanjebamani1087@gmail.com
More informationProvenance-Aware Faceted Search in Drupal
Provenance-Aware Faceted Search in Drupal Zhenning Shangguan, Jinguang Zheng, and Deborah L. McGuinness Tetherless World Constellation, Computer Science Department, Rensselaer Polytechnic Institute, 110
More informationReview on Managing RDF Graph Using MapReduce
Review on Managing RDF Graph Using MapReduce 1 Hetal K. Makavana, 2 Prof. Ashutosh A. Abhangi 1 M.E. Computer Engineering, 2 Assistant Professor Noble Group of Institutions Junagadh, India Abstract solution
More informationA Fast and High Throughput SQL Query System for Big Data
A Fast and High Throughput SQL Query System for Big Data Feng Zhu, Jie Liu, and Lijie Xu Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing, China 100190
More informationA Semantic Web Approach to the Provenance Challenge
CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2000; 00:1 7 [Version: 2002/09/19 v2.02] A Semantic Web Approach to the Provenance Challenge Jennifer Golbeck,,
More informationJoint Entity Resolution
Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute
More informationProvenance Trails in the Wings/Pegasus System
To appear in the Journal of Concurrency And Computation: Practice And Experience, 2007 Provenance Trails in the Wings/Pegasus System Jihie Kim, Ewa Deelman, Yolanda Gil, Gaurang Mehta, Varun Ratnakar Information
More informationTopics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples
Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?
More informationManaging Rapidly-Evolving Scientific Workflows
Managing Rapidly-Evolving Scientific Workflows Juliana Freire, Cláudio T. Silva, Steven P. Callahan, Emanuele Santos, Carlos E. Scheidegger, and Huy T. Vo University of Utah Abstract. We give an overview
More informationInnovatus Technologies
HADOOP 2.X BIGDATA ANALYTICS 1. Java Overview of Java Classes and Objects Garbage Collection and Modifiers Inheritance, Aggregation, Polymorphism Command line argument Abstract class and Interfaces String
More informationIntegrating Text and Graphics to Present Provenance Information
Integrating Text and Graphics to Present Provenance Information Thomas Bouttaz, Alan Eckhardt, Chris Mellish, and Peter Edwards Computing Science, University of Aberdeen, Aberdeen AB24 5UA, UK {t.bouttaz,a.eckhardt,c.mellish,p.edwards}@abdn.ac.uk
More information4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015)
4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015) Benchmark Testing for Transwarp Inceptor A big data analysis system based on in-memory computing Mingang Chen1,2,a,
More informationComparing Provenance Data Models for Scientific Workflows: an Analysis of PROV-Wf and ProvOne
Comparing Provenance Data Models for Scientific Workflows: an Analysis of PROV-Wf and ProvOne Wellington Oliveira 1, 2, Paolo Missier 3, Daniel de Oliveira 1, Vanessa Braganholo 1 1 Instituto de Computação,
More informationColumn Stores and HBase. Rui LIU, Maksim Hrytsenia
Column Stores and HBase Rui LIU, Maksim Hrytsenia December 2017 Contents 1 Hadoop 2 1.1 Creation................................ 2 2 HBase 3 2.1 Column Store Database....................... 3 2.2 HBase
More informationWe are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info
We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info START DATE : TIMINGS : DURATION : TYPE OF BATCH : FEE : FACULTY NAME : LAB TIMINGS : PH NO: 9963799240, 040-40025423
More informationBig Data with Hadoop Ecosystem
Diógenes Pires Big Data with Hadoop Ecosystem Hands-on (HBase, MySql and Hive + Power BI) Internet Live http://www.internetlivestats.com/ Introduction Business Intelligence Business Intelligence Process
More informationProvenance Map Orbiter: Interactive Exploration of Large Provenance Graphs
Provenance Map Orbiter: Interactive Exploration of Large Provenance Graphs Peter Macko Harvard University Margo Seltzer Harvard University Abstract Provenance systems can produce enormous provenance graphs
More informationMapReduce and Friends
MapReduce and Friends Craig C. Douglas University of Wyoming with thanks to Mookwon Seo Why was it invented? MapReduce is a mergesort for large distributed memory computers. It was the basis for a web
More informationΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing
ΕΠΛ 602:Foundations of Internet Technologies Cloud Computing 1 Outline Bigtable(data component of cloud) Web search basedonch13of thewebdatabook 2 What is Cloud Computing? ACloudis an infrastructure, transparent
More informationKepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life
Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life Shawn Bowers 1, Timothy McPhillips 1, Sean Riddle 1, Manish Anand 2, Bertram Ludäscher 1,2 1 UC Davis Genome Center,
More informationA Comparison of MapReduce Join Algorithms for RDF
A Comparison of MapReduce Join Algorithms for RDF Albert Haque 1 and David Alves 2 Research in Bioinformatics and Semantic Web Lab, University of Texas at Austin 1 Department of Computer Science, 2 Department
More informationEXTRACT DATA IN LARGE DATABASE WITH HADOOP
International Journal of Advances in Engineering & Scientific Research (IJAESR) ISSN: 2349 3607 (Online), ISSN: 2349 4824 (Print) Download Full paper from : http://www.arseam.com/content/volume-1-issue-7-nov-2014-0
More informationDistributed computing: index building and use
Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput
More informationThe International Journal of Digital Curation Volume 7, Issue
doi:10.2218/ijdc.v7i1.221 Golden Trail 139 Golden Trail: Retrieving the Data History that Matters from a Comprehensive Provenance Repository Paolo Missier, Newcastle University Bertram Ludäscher, Saumen
More informationPOMELo: A PML Online Editor
POMELo: A PML Online Editor Alvaro Graves Tetherless World Constellation Department of Cognitive Sciences Rensselaer Polytechnic Institute Troy, NY 12180 gravea3@rpi.edu Abstract. This paper introduces
More informationEfficient, Scalable, and Provenance-Aware Management of Linked Data
Efficient, Scalable, and Provenance-Aware Management of Linked Data Marcin Wylot 1 Motivation and objectives of the research The proliferation of heterogeneous Linked Data on the Web requires data management
More informationAn Introduction to Big Data Formats
Introduction to Big Data Formats 1 An Introduction to Big Data Formats Understanding Avro, Parquet, and ORC WHITE PAPER Introduction to Big Data Formats 2 TABLE OF TABLE OF CONTENTS CONTENTS INTRODUCTION
More informationData Informatics. Seon Ho Kim, Ph.D.
Data Informatics Seon Ho Kim, Ph.D. seonkim@usc.edu HBase HBase is.. A distributed data store that can scale horizontally to 1,000s of commodity servers and petabytes of indexed storage. Designed to operate
More informationCSE-E5430 Scalable Cloud Computing Lecture 9
CSE-E5430 Scalable Cloud Computing Lecture 9 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 15.11-2015 1/24 BigTable Described in the paper: Fay
More informationOn the use of Abstract Workflows to Capture Scientific Process Provenance
On the use of Abstract Workflows to Capture Scientific Process Provenance Paulo Pinheiro da Silva, Leonardo Salayandia, Nicholas Del Rio, Ann Q. Gates The University of Texas at El Paso CENTER OF EXCELLENCE
More informationCassandra- A Distributed Database
Cassandra- A Distributed Database Tulika Gupta Department of Information Technology Poornima Institute of Engineering and Technology Jaipur, Rajasthan, India Abstract- A relational database is a traditional
More informationOSDBQ: Ontology Supported RDBMS Querying
OSDBQ: Ontology Supported RDBMS Querying Cihan Aksoy 1, Erdem Alparslan 1, Selçuk Bozdağ 2, İhsan Çulhacı 3, 1 The Scientific and Technological Research Council of Turkey, Gebze/Kocaeli, Turkey 2 Komtaş
More informationComparing SQL and NOSQL databases
COSC 6397 Big Data Analytics Data Formats (II) HBase Edgar Gabriel Spring 2014 Comparing SQL and NOSQL databases Types Development History Data Storage Model SQL One type (SQL database) with minor variations
More informationSQL-to-MapReduce Translation for Efficient OLAP Query Processing
, pp.61-70 http://dx.doi.org/10.14257/ijdta.2017.10.6.05 SQL-to-MapReduce Translation for Efficient OLAP Query Processing with MapReduce Hyeon Gyu Kim Department of Computer Engineering, Sahmyook University,
More informationTackling the Provenance Challenge One Layer at a Time
CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE [Version: 2002/09/19 v2.02] Tackling the Provenance Challenge One Layer at a Time Carlos Scheidegger 1, David Koop 2, Emanuele Santos 1, Huy Vo 1, Steven
More informationProvenance for MapReduce-based Data-Intensive Workflows
Provenance for MapReduce-based Data-Intensive Workflows Daniel Crawl, Jianwu Wang, Ilkay Altintas San Diego Supercomputer Center University of California, San Diego 9500 Gilman Drive La Jolla, CA 92093-0505
More informationStages of Data Processing
Data processing can be understood as the conversion of raw data into a meaningful and desired form. Basically, producing information that can be understood by the end user. So then, the question arises,
More informationWhat is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?
Simple to start What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed? What is the maximum download speed you get? Simple computation
More informationHDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung
HDFS: Hadoop Distributed File System CIS 612 Sunnie Chung What is Big Data?? Bulk Amount Unstructured Introduction Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per
More informationCA485 Ray Walshe NoSQL
NoSQL BASE vs ACID Summary Traditional relational database management systems (RDBMS) do not scale because they adhere to ACID. A strong movement within cloud computing is to utilize non-traditional data
More informationDistributed computing: index building and use
Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput
More informationPROFESSIONAL. NoSQL. Shashank Tiwari WILEY. John Wiley & Sons, Inc.
PROFESSIONAL NoSQL Shashank Tiwari WILEY John Wiley & Sons, Inc. Examining CONTENTS INTRODUCTION xvil CHAPTER 1: NOSQL: WHAT IT IS AND WHY YOU NEED IT 3 Definition and Introduction 4 Context and a Bit
More informationPerformance Evaluation of the Karma Provenance Framework for Scientific Workflows
Performance Evaluation of the Karma Provenance Framework for Scientific Workflows Yogesh L. Simmhan, Beth Plale, Dennis Gannon, and Suresh Marru Indiana University, Computer Science Department, Bloomington,
More informationBigtable. Presenter: Yijun Hou, Yixiao Peng
Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber Google, Inc. OSDI 06 Presenter: Yijun Hou, Yixiao Peng
More informationData Provenance in Distributed Propagator Networks
Data Provenance in Distributed Propagator Networks Ian Jacobi Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge, MA 02139 jacobi@csail.mit.edu Abstract.
More informationBig Data Analytics using Apache Hadoop and Spark with Scala
Big Data Analytics using Apache Hadoop and Spark with Scala Training Highlights : 80% of the training is with Practical Demo (On Custom Cloudera and Ubuntu Machines) 20% Theory Portion will be important
More informationMitigating Data Skew Using Map Reduce Application
Ms. Archana P.M Mitigating Data Skew Using Map Reduce Application Mr. Malathesh S.H 4 th sem, M.Tech (C.S.E) Associate Professor C.S.E Dept. M.S.E.C, V.T.U Bangalore, India archanaanil062@gmail.com M.S.E.C,
More informationOrchestrating Music Queries via the Semantic Web
Orchestrating Music Queries via the Semantic Web Milos Vukicevic, John Galletly American University in Bulgaria Blagoevgrad 2700 Bulgaria +359 73 888 466 milossmi@gmail.com, jgalletly@aubg.bg Abstract
More informationHADOOP FRAMEWORK FOR BIG DATA
HADOOP FRAMEWORK FOR BIG DATA Mr K. Srinivas Babu 1,Dr K. Rameshwaraiah 2 1 Research Scholar S V University, Tirupathi 2 Professor and Head NNRESGI, Hyderabad Abstract - Data has to be stored for further
More informationPLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS
PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad
More informationHadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved
Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop
More informationRDFPath. Path Query Processing on Large RDF Graphs with MapReduce. 29 May 2011
29 May 2011 RDFPath Path Query Processing on Large RDF Graphs with MapReduce 1 st Workshop on High-Performance Computing for the Semantic Web (HPCSW 2011) Martin Przyjaciel-Zablocki Alexander Schätzle
More informationWhen, Where & Why to Use NoSQL?
When, Where & Why to Use NoSQL? 1 Big data is becoming a big challenge for enterprises. Many organizations have built environments for transactional data with Relational Database Management Systems (RDBMS),
More informationManaging the Evolution of Dataflows with VisTrails Extended Abstract
Managing the Evolution of Dataflows with VisTrails Extended Abstract Steven P. Callahan Juliana Freire Emanuele Santos Carlos E. Scheidegger Cláudio T. Silva Huy T. Vo University of Utah vistrails@sci.utah.edu
More informationA Granular Concurrency Control for Collaborative Scientific Workflow Composition
A Granular Concurrency Control for Collaborative Scientific Workflow Composition Xubo Fei, Shiyong Lu, Jia Zhang Department of Computer Science, Wayne State University, Detroit, MI, USA {xubo, shiyong}@wayne.edu
More informationParallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce
Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The
More informationWebLab PROV : Computing fine-grained provenance links for XML artifacts
WebLab PROV : Computing fine-grained provenance links for XML artifacts Bernd Amann LIP6 - UPMC, Paris Camelia Constantin LIP6 - UPMC, Paris Patrick Giroux EADS-Cassidian, Val de Reuil Clément Caron EADS-Cassidian,
More informationPerformance Comparison of NOSQL Database Cassandra and SQL Server for Large Databases
Performance Comparison of NOSQL Database Cassandra and SQL Server for Large Databases Khalid Mahmood Shaheed Zulfiqar Ali Bhutto Institute of Science and Technology, Karachi Pakistan khalidmdar@yahoo.com
More informationIntroduction to BigData, Hadoop:-
Introduction to BigData, Hadoop:- Big Data Introduction: Hadoop Introduction What is Hadoop? Why Hadoop? Hadoop History. Different types of Components in Hadoop? HDFS, MapReduce, PIG, Hive, SQOOP, HBASE,
More informationBlended Learning Outline: Cloudera Data Analyst Training (171219a)
Blended Learning Outline: Cloudera Data Analyst Training (171219a) Cloudera Univeristy s data analyst training course will teach you to apply traditional data analytics and business intelligence skills
More informationInternational Journal of Advance Engineering and Research Development. A Study: Hadoop Framework
Scientific Journal of Impact Factor (SJIF): e-issn (O): 2348- International Journal of Advance Engineering and Research Development Volume 3, Issue 2, February -2016 A Study: Hadoop Framework Devateja
More informationAbstract Provenance Graphs: Anticipating and Exploiting Schema-Level Data Provenance
Abstract Provenance Graphs: Anticipating and Exploiting Schema-Level Data Provenance Daniel Zinn and Bertram Ludäscher fdzinn,ludaeschg@ucdavis.edu Abstract. Provenance graphs capture flow and dependency
More informationRAMCloud. Scalable High-Performance Storage Entirely in DRAM. by John Ousterhout et al. Stanford University. presented by Slavik Derevyanko
RAMCloud Scalable High-Performance Storage Entirely in DRAM 2009 by John Ousterhout et al. Stanford University presented by Slavik Derevyanko Outline RAMCloud project overview Motivation for RAMCloud storage:
More informationAn Efficient Approach to Triple Search and Join of HDT Processing Using GPU
An Efficient Approach to Triple Search and Join of HDT Processing Using GPU YoonKyung Kim, YoonJoon Lee Computer Science KAIST Daejeon, South Korea e-mail: {ykkim, yjlee}@dbserver.kaist.ac.kr JaeHwan Lee
More informationPart 1: Indexes for Big Data
JethroData Making Interactive BI for Big Data a Reality Technical White Paper This white paper explains how JethroData can help you achieve a truly interactive interactive response time for BI on big data,
More informationA Task Abstraction and Mapping Approach to the Shimming Problem in Scientific Workflows
A Abstraction and Mapping Approach to the Shimming Problem in Scientific Workflows Cui Lin, Shiyong Lu, Xubo Fei, Darshan Pai and Jing Hua Department of Computer Science, Wayne State University {cuilin,
More informationImproved MapReduce k-means Clustering Algorithm with Combiner
2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation Improved MapReduce k-means Clustering Algorithm with Combiner Prajesh P Anchalia Department Of Computer Science and Engineering
More informationA New HadoopBased Network Management System with Policy Approach
Computer Engineering and Applications Vol. 3, No. 3, September 2014 A New HadoopBased Network Management System with Policy Approach Department of Computer Engineering and IT, Shiraz University of Technology,
More informationAn Entity Based RDF Indexing Schema Using Hadoop And HBase
An Entity Based RDF Indexing Schema Using Hadoop And HBase Fateme Abiri Dept. of Computer Engineering Ferdowsi University Mashhad, Iran Abiri.fateme@stu.um.ac.ir Mohsen Kahani Dept. of Computer Engineering
More information8/24/2017 Week 1-B Instructor: Sangmi Lee Pallickara
Week 1-B-0 Week 1-B-1 CS535 BIG DATA FAQs Slides are available on the course web Wait list Term project topics PART 0. INTRODUCTION 2. DATA PROCESSING PARADIGMS FOR BIG DATA Sangmi Lee Pallickara Computer
More informationDistributed File Systems II
Distributed File Systems II To do q Very-large scale: Google FS, Hadoop FS, BigTable q Next time: Naming things GFS A radically new environment NFS, etc. Independence Small Scale Variety of workloads Cooperation
More informationEmbedded Technosolutions
Hadoop Big Data An Important technology in IT Sector Hadoop - Big Data Oerie 90% of the worlds data was generated in the last few years. Due to the advent of new technologies, devices, and communication
More informationAccelerating Big Data: Using SanDisk SSDs for Apache HBase Workloads
WHITE PAPER Accelerating Big Data: Using SanDisk SSDs for Apache HBase Workloads December 2014 Western Digital Technologies, Inc. 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents
More informationParallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce
Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce Huayu Wu Institute for Infocomm Research, A*STAR, Singapore huwu@i2r.a-star.edu.sg Abstract. Processing XML queries over
More informationPublishing and Consuming Provenance Metadata on the Web of Linked Data
Publishing and Consuming Provenance Metadata on the Web of Linked Data Olaf Hartig 1 and Jun Zhao 2 1 Humboldt-Universität zu Berlin hartig@informatik.hu-berlin.de 2 University of Oxford jun.zhao@zoo.ox.ac.uk
More informationNew Approaches to Big Data Processing and Analytics
New Approaches to Big Data Processing and Analytics Contributing authors: David Floyer, David Vellante Original publication date: February 12, 2013 There are number of approaches to processing and analyzing
More informationBig Data Hadoop Stack
Big Data Hadoop Stack Lecture #1 Hadoop Beginnings What is Hadoop? Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware
More informationCS November 2018
Bigtable Highly available distributed storage Distributed Systems 19. Bigtable Built with semi-structured data in mind URLs: content, metadata, links, anchors, page rank User data: preferences, account
More informationMRBench : A Benchmark for Map-Reduce Framework
MRBench : A Benchmark for Map-Reduce Framework Kiyoung Kim, Kyungho Jeon, Hyuck Han, Shin-gyu Kim, Hyungsoo Jung, Heon Y. Yeom School of Computer Science and Engineering Seoul National University Seoul
More informationL22: NoSQL. CS3200 Database design (sp18 s2) 4/5/2018 Several slides courtesy of Benny Kimelfeld
L22: NoSQL CS3200 Database design (sp18 s2) https://course.ccs.neu.edu/cs3200sp18s2/ 4/5/2018 Several slides courtesy of Benny Kimelfeld 2 Outline 3 Introduction Transaction Consistency 4 main data models
More informationRainbow: A Distributed and Hierarchical RDF Triple Store with Dynamic Scalability
2014 IEEE International Conference on Big Data Rainbow: A Distributed and Hierarchical RDF Triple Store with Dynamic Scalability Rong Gu, Wei Hu, Yihua Huang National Key Laboratory for Novel Software
More informationCS November 2017
Bigtable Highly available distributed storage Distributed Systems 18. Bigtable Built with semi-structured data in mind URLs: content, metadata, links, anchors, page rank User data: preferences, account
More informationData Partitioning and MapReduce
Data Partitioning and MapReduce Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Intelligent Decision Support Systems Master studies,
More informationOracle Big Data Connectors
Oracle Big Data Connectors Oracle Big Data Connectors is a software suite that integrates processing in Apache Hadoop distributions with operations in Oracle Database. It enables the use of Hadoop to process
More informationMAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti
International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16 MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti 1 Department
More informationSMCCSE: PaaS Platform for processing large amounts of social media
KSII The first International Conference on Internet (ICONI) 2011, December 2011 1 Copyright c 2011 KSII SMCCSE: PaaS Platform for processing large amounts of social media Myoungjin Kim 1, Hanku Lee 2 and
More informationAdvances in Data Management - NoSQL, NewSQL and Big Data A.Poulovassilis
Advances in Data Management - NoSQL, NewSQL and Big Data A.Poulovassilis 1 NoSQL So-called NoSQL systems offer reduced functionalities compared to traditional Relational DBMS, with the aim of achieving
More information