Trading Consistency for Scalability in Scientific Metadata

Size: px

Start display at page:

Download "Trading Consistency for Scalability in Scientific Metadata"

Erika Sharp
5 years ago
Views:

1 Trading Consistency for Scalability in Scientific Metadata Scott Jensen and Beth Plale School of Informatics and Computing Indiana University Bloomington, Indiana Abstract Long-term repositories that are able to represent the detailed descriptive metadata of scientific data have been recognized as key to both data reuse and preservation of the initial investment in generating the data. Detailed metadata captured during scientific investigation not only enables the efficient discovery of relevant data sets but also is a source for exploring ongoing activity. In XMC Cat metadata catalog, an XML catalog that uses a novel hybrid model to store XML to a relational database, we exploit differences in the temporal utility between browse and search metadata to selectively relax the consistency model used. By ensuring only eventual consistency on parts of the solution, we determine through experimental analysis that the performance and scalability of the catalog can be substantially improved. Keywords-Scientific Metadata, Eventual Consistency, XML, e- Science I. INTRODUCTION In e-science, metadata repositories have been recognized as crucial to preserving an organization s investment in data by enabling metadata collection, long-term preservation, and reuse of scientific data. This includes both observational data and data sets derived through scientific investigation and analysis. Science communities facilitate communication within their community through development of domain-specific XML schemas often as an extension/profile of metadata standards such as the FGDC [12] and ISO [17] for geospatial data. Researchers in computer science and library science have categorized the metadata captured by these detailed schemata [16,25,23,31,33]. Common to such categorizations is a breakdown between discovery and administrative or structural metadata; where discovery metadata is used by researchers to locate datasets of interest, and administrative or structural metadata is needed to actually use the dataset once it has been discovered [16]. With the XMC Cat metadata catalog developed in the context of the Linked Environments for Atmospheric Discovery (LEAD) project [9], our focus is on automating the capture of detailed discovery metadata during scientific investigation utilizing a workflow-oriented cyberinfrastructure. The LEAD science gateway currently has 126 users, with 5TB of atmospheric data, workflows, and forecast results. This collection is described by over 2 million metadata concepts containing more than 8 million metadata elements. Our focus in automating the capture of metadata is two-fold. As noted by Gray et al., metadata is ephemeral and must be captured as data This research was funded in part by NSF CNS is generated since it cannot be recreated later [15], so capture during workflow execution enables long-term preservation by ensuring that ephemeral metadata is not lost. However, it also addresses a recognized issue in archiving scientific data the benefits of metadata annotation and archiving inure to the organization and not the scientist generating the data [6,36]. Metadata captured during workflow execution provides value to scientists as a tool to monitor and review results prior to publication. In a science discovery cyberinfrastructure such as LEAD four major activities determine the workload on the metadata catalog (Fig. 1): browse, search, catalog new data objects, and automated harvest of metadata. Through a portal interface a researcher can browse detailed metadata in a time-organized hierarchy of their experiments, including inputs, intermediate results, and workflow output, enabling a researcher to monitor ongoing, long-running experiments. In this context, only the object s ID is needed to retrieve the metadata that describes it. A search interface allows scientists to search both private and public data based on detailed metadata captured for each experiment; including all configuration settings and critical workflow notifications. As experiments are executed, core metadata is captured about inputs, the experiment itself, intermediate data products, and the output generated. Figure 1. Four operations dominate a metadata catalog s workload The dotted lines in Fig. 1 between harvest and browse illustrates a relationship that is not applicable to long-term archives concurrent with ingest, metadata is used by scientists to monitor workflow progress. If a workflow has bursts of data generation activity, this translates into a workload for the metadata catalog that is bursty in nature. Analyzing the metadata catalog operations of a 15-hour weather forecast workflow for instance, the workflow takes 1 hour to run from initializing boundary conditions through to generating 2D visualizations that are viewable on a cell phone [29]. Although the entire workflow takes 1 hour to execute, the metadata ingest operations are highly non-uniform throughout

2 the hour. When the workflow is first launched, metadata for the experiment and its collections are inserted within the first minute of the workflow. This insert includes core metadata, the workflow, and initial configuration parameters. Once the job has moved off the wait queue of the supercomputer, all forecast model initialization data are added in a burst. After this initial spin-up, the early stages of model initialization and data assimilation execute relatively quickly, resulting in a handful of early workflow stages completing within 30 minutes and cataloging their metadata, followed by a 20 minute lull (other than periodic workflow notifications) as the main forecast model (WRF [24]) completes. Were one to catalog metadata in a long-term archive, the bursty behavior of workflows would not be relevant since the workflow run had finished executing long before results of the experiment are published. Even if submitted in bursts postmortem, the inserts can be queued to balance the load. The bursty nature of metadata workload created by scientific workflows could be smoothed by more hardware such as through replication or clustering, but aside from the hardware cost, this also imposes a cost on browse queries which depend on concurrently ingested metadata. We show that the scalability of cataloging scientific metadata in response to an increased workload can be addressed through loosening the consistency required between browse and search metadata by exploiting temporal differences in when metadata is needed for these two operations. The remainder of this paper is organized as follows: Section 2 discusses eventual consistency. Section 3 discusses the application of eventual consistency to the use of metadata in browsing and searching scientific metadata. Section 4 gives details of an experimental evaluation of eventual consistency as applied to scientific metadata. Section 5 discusses related work, and section 6 concludes with future directions. II. EVENTUAL CONSISTENCY IN DATA ACCESS In relational databases, changes to the data are commonly performed through transactions that ensure the properties of Atomicity, Consistency, Independence, and Durability (ACID) [14], that is, the set of changes comprising a transaction are either performed in their entirety or not at all (atomic), the set of changes does not interact with other transactions and are thus serializable (independent), the changes are permanent once the transaction completes (durable), and the constraints imposed on relationships are maintained (consistency). With the emergence of the Internet, cloud-based web services, and large-scale data stores, the need for scalability and availability raises a challenge to the feasibility of ACID properties. One of the earliest discussions on the trade-offs between ACID and availability is by Eric Brewer, co-founder of Inktomi, who at a 2000 PODC keynote introduced Basically Available Soft-state Eventual consistency (BASE) [8]. Proposed as an alternative to ACID, BASE acknowledges the tension between consistency and availability. Brewer states that there are three properties of distributed systems (1) consistency, (2) availability, and (3) tolerance of network partitions, and that only two of these are possible at one time a property he refers to as the CAP theorem. As noted in Vogels [35], partitions are a given for large distributed systems, so under the CAP theorem, consistency must be traded off to gain availability. However, Brewer views ACID and BASE not as two absolutes, but as forming a spectrum with ACID providing strong consistency and BASE providing weak consistency in which approximate answers and stale data are accepted. In application of CAP to management of scientific metadata, scalability in the face of bursty workloads can similarly be addressed by relaxing consistency in order to provide fresh results where fresh results are needed, and allowing some operations to see older data where time critical needs are not as great. In our research, this tradeoff comes in preserving freshness for browse operations while at the same time allowing the search metadata to gain consistency over a longer period of time, that is, eventually. III. E-SCIENCE DISCOVERY METADATA A. Browse and Search Metadata as Functional Groups Capture of discovery metadata is for two distinct uses, browsing and searching, and the temporal utility of these two types of metadata differ. Browsing is most useful when the data of interest is known in advance, such as during workflow execution or reviewing a set of search results. In contrast, search capabilities address the longer-term preservation of a community s investment in data by allowing a scientist to find datasets that address her needs based on querying for measured variables, configuration settings, data quality metrics, or other detailed discovery metadata. In contrast to browse metadata which can be located based solely on the unique ID of a data object, search metadata requires a greater level of detail to be of value as data ages corresponding as Michener notes with our lack of ability to hold such details in memory [25]. In the XMC Cat metadata catalog, the metadata used to reconstruct XML in response to browse queries is managed separate from the detailed shredded metadata used to search for data products. This hybrid XML-relational approach [19] is used to maximize the performance of browse queries that are executed as users interactively click on data products in the web portal. It is this bifurcation of browse and search representations of the metadata that is exploited to relax the consistency between these two subclasses of discovery metadata. As Pritchett notes in [30], consistency is always in a state of flux when it is relaxed, and since temporal inconsistency cannot be hidden from users, one must look for opportunities where consistency can be relaxed preferably between different functions in a system. Browsing and searching can be viewed as two distinct functions. We exploit this distinction to improve the scalability of the metadata catalog and its adaptability under bursty workloads. B. Introduction to XMC Cat XMC Cat is a Java based web service metadata catalog that uses a lightweight software stack based on Apache Tomcat and the Axis2 web services engine [3]. It adapts to community metadata schemata by partitioning each schema based on the metadata concepts it contains [20]. When ingesting metadata, character large objects (CLOBs) containing the XML for each concept are stored in the catalog and later used to rebuild the

3 full metadata for an object in response to queries. Browse queries require only these CLOBs to build a response. In addition, each concept is also parsed using a process commonly referred to as shredding [32]. The shredded elements and the concept structure containing them provide the detailed discovery metadata needed for search queries. If the CLOBs and shredded metadata for a concept are stored using a single transaction, the recall and precision of the search results depends only on the accuracy and completeness of the metadata cataloged. Any object that matches the search criteria would be found (recall) and any object listed in the results would match the search criteria (precision). Ensuring consistency between browse and search metadata requires the CLOBs and shredded metadata to be inserted as an atomic operation. The shredding process transforms metadata based on a domain schema to a domain-neutral concept and element structure using the following transformations: Metadata Document Globally Unique ID Concept + Concept Global order (int) CLOB Shredded Concept Shredded Concept Concept name (string) Concept source (string) Shredded Concept * Shredded Element * Shredded Element Element name (string) Element source (string) Element value (string) During ingest, each concept contained in the metadata is tagged with its global order based on the partitioning of the schema. The CLOBs are indexed based on the object ID and global order of the concept they represent. This covering index is used to determine the concepts that describe an object; allowing metadata to be quickly reconstructed in response to queries even if it was added incrementally through additional metadata harvesting. C. Eventual Consistency of Search Metadata An earlier performance evaluation of XMC Cat s hybrid XML-relational approach compared it to in-lining (an alternative often used for storing XML in a relational database), and found shredding and storing the leaf elements to be a significant cost [18]. Storing concepts as both CLOBs and shredded elements greatly improves query response time, so shredding cannot simply be done away with. However, since the CLOBs and shredded elements are used for different functions (browsing versus searching), both insert performance and scalability can be improved by relaxing the consistency between browse and search metadata and only ensuring eventual consistency instead. Eventual consistency requires changes to both the metadata catalog s architecture and the ingest process as shown in Fig. 2. Eventual consistency is implemented architecturally by the introduction of concept shredders deployed on distributed servers which asynchronously shred concepts into searchable metadata. Under this new approach, only a subset of the first two transformations is performed during the ingest process - Storage of concepts so a user can browse their workspace Distributed shredding of concepts for eventually consistent quering of the workspace XMC Cat Web Service parse metadata into concept-sized CLOBs 1a 1b store concept CLOBs to object s metadata shredded metadata added to XMC Cat metadata catalog 6a adding metadata to existing experiments XMC Cat shred into sub-concepts and elements Yes Metadata Catalog successfully shredded? new experiments adding metadata queue concept ID for eventual shredding queue query for a batch of concept CLOBS to process 2 Distributed Concept Shredders remove CLOBs added to queue 3 entry for CLOB in processing dequeue CLOB 4 queue worker thread... worker thread 6b Distributed Concept Shredders Figure 2. Distributed shredding of metadata concepts. The shredding of search metadata is queued and performed asynchronously. parsing out the concept CLOBs (Step1a in Fig. 2). Additionally, the internal IDs of the CLOBs are inserted into a FIFO queue for eventual processing (Step 1b). These steps are captured by the following reduced set of transformations: Metadata Document Globally Unique ID Concept + Concept Global order (int) CLOB Under this approach, the shredders are distributed to different servers and independent no shredder is aware of other shredders, and the XMC Cat catalog does not know how many shredders are operational at any point in time. Each distributed shredder has a controller thread that fetches batches of concepts (the CLOB, its ID, and version) from the catalog s queue (Step 2) and stores them in its own local queue (Step 3). The frequency of retrieval adjusts automatically based on the number of concepts remaining in the local queue at the start of each fetch cycle. Although fetched by a shredder for processing, to prevent the loss of queryable metadata should a shredder fail, concepts must remain in the catalog s queue until the CLOB is actually shredded and inserted back into the catalog. To minimize the retrieval of concepts by multiple shredders, an approach similar to that used in [7] is employed. Each concept in the catalog s queue has a timestamp that is set on retrieval to a configurable time forward when it can again be retrieved. Although the queue is a FIFO queue, concepts are not retrieved by other shredders until their timestamp has expired. A set of worker threads at each shredder dequeue concepts from the local queue in a synchronized manner (Step 4) and performs the remaining transformations for shredding the CLOBs into a generalized concept and element Yes 5

4 representation (Step 5) and inserting the shredded metadata into the catalog (Step 6a). After the shredded metadata is inserted, the entry in the catalog s concept queue can be removed (Step 6b). Idempotent Concept Inserts: Inserting the shredded metadata into the catalog in Step 6a must be idempotent since multiple inserts could otherwise occur in the following two situations: 1. If an existing concept is updated, the prior version and the update could both be in the queue to be shredded. If the controller at each distributed shredder retrieves a large batch of concepts to reduce frequent fetching, the update could be processed by one shredder (because it is near the front of its local queue) before the prior value was shredded. Each CLOB has a version assigned and if the version for the update does not match the current CLOB, the update is abandoned. 2. If a shredder is slow, it may not complete processing a concept before the future timestamp set during the fetch cycle in Step 2 expires. This could result in two distributed shredders fetching the same concept. In this case, a unique key in the shredded concept table prevents the insert of an identical concept for the same object and CLOB ID, so the insert is abandoned. D. Bounds on Eventual Consistency for Search Metadata The argument for using eventual consistency in managing scientific metadata is improved insert performance and scalability. However, if the time lapse until consistency is achieved exceeds the temporal difference in the use of browse and search metadata, search results will suffer. The lower and upper bounds on achieving consistency is defined as ECt and determined as follows: EC t = W t + T t + R t + S t + I t Where W t, T t, and R t represent the queue and fetch time from when a concept is first queued at the catalog in Step 1b until it has been loaded into a shredder s local queue in Step 3. W t is the time a concept waits in the catalog s queue after being queued in Step 1b until it is retrieved by a shredder in Step 2. The fetching of a concept and loading it into a shredder s local queue (Steps 2 and 3) can be decomposed into two operations, the time to identify or tag the concepts to be retrieved (T t ), and the time to retrieve and load the tagged concepts (R t ). The total time for a distributed shredder to process a concept (Steps 4 6) is the sum of S t and I t where S t is the time required to shred the concept (Steps 4 and 5) and I t is the cost of inserting the shredded metadata into the catalog (Step 6). W t depends on two factors, the time the distributed shredder s controller waits to fetch (C t ) and whether the insert rate into the concept queue in Step 1a exceeds the maximum capacity service rate of the distributed shredders. When the shredders are processing on par with the insert rate, W t = C t (which is configurable for each distributed shredder). The default lower and upper bounds on C t are 1 second and 120 seconds respectively. The lower bound for EC t is dominated by C t since if a concept is queued midway through the minimum wait time of 1,000ms, the wait would be 500ms. We determine by means of experimental investigation the average component costs to fetch and process concepts. The average time to fetch a batch of 100 concepts and load the shredder s local queue is 139ms, which breaks down almost evenly between 64.42ms on average to tag the concepts (T t ) and 74.58ms to retrieve and load the concepts (R t ). The time to process the concepts averages 17.22ms, with the time to shred averaging 3.48ms and the time to insert the shredded metadata averaging 13.74ms. As with the lower bound, the average time delay for eventual consistency is also dominated by the wait time in the queue (W t ) since the average time for fetching, processing, and inserting a concept totals only ms. Since the insertion of each CLOB s ID into the concept queue in Step 1b is performed in a stored procedure that is executed atomically with the insert of the CLOBs in Step 1a, no delay in consistency occurs to initially queue the concept. IV. EXPERIMENTAL EVALUATION A. Insert Metadata Although XMC Cat supports a number of administrative operations, the main functionality of a catalog is inserting and querying metadata. To evaluate the benefits of eventual consistency, the following measurement metrics are used: Insert operation response time Scalability performance of inserts and queries For all tests we use real metadata harvested from the workspaces of meteorological researchers using the LEAD portal. The base scalability workload used in testing is adopted from [28], a 2007 study that estimated and characterized the volume of experiments that would be run in a production LEAD environment. Execution time of experiments generating metadata in LEAD is dominated by the execution time of the Weather Research and Forecasting (WRF) model. Likewise, the number of files generated by an experiment is also based on WRF. The metadata added for each experiment consists of spatial and temporal metadata, citation and usage elements from the FGDC schema that are similar to elements from the Dublin Core standard [10], notifications harvested from the LEAD notification bus during WRF experiments, and configuration parameters (contained in FORTRAN namelist configuration files). The metadata cataloged for files within an experiment are taken from the ARPS Data Analysis System (ADAS) files used in LEAD. The metadata for an individual file is less voluminous than for an experiment and consists of metadata describing the data s owner, distribution, spatial and temporal bounds, and terms from the Climate and Forecasting controlled vocabulary [11]. Each experiment catalogs 2,202 separate metadata elements grouped into 256 concepts which frequently exhibit some complex structure such as is the case for workflow notifications. The metadata catalog is initialized from the workspaces of 125 actual users. B. Test Environment The metadata catalog s web service is hosted on a Dell PowerEdge 6950 configured with quad dual-core AMD

5 Mean Execution Time (ms) Validate Request 2,500 Validate Parent XSLT XML Bean Root Element Define Object Extract Concepts Insert Concepts Shredded Metadata 2,000 1,500 1, Mean Execution Time (ms) Mean Execution Time (ms) Validate Request Validate Against Existing Metadata Insert Concepts Shredded Metadata Other 0 Eventual Strict (a) File Metadata Other 0 Eventual Strict (b) Experiment Metadata 0 Eventual Strict (c) Incremental Concepts Figure 3. Baseline comparing strict consistency with eventual consistency when inserting metadata. The performance improvement from eventual consistency is mainly due to inserting shredded search metadata asynchronously, but it also reduces the cost of parsing and manipulating the metadata using XML beans. 2.4GHz Opteron processors, 16GB of memory, and running RHEL 4. The web service stack uses version 1.3 of Apache Axis2 on Tomcat version The database is MySQL version The clients and distributed shredder are located on separate servers configured with dual 2-core 2.0GHZ AMD Opteron processors, 16GB of memory, running RHEL 4, and connected by Gigabit Ethernet. C. Baseline Performance Since browse metadata is of value to scientists in monitoring experiments, metadata describing a digital object must be able to grow incrementally during long-running workflows. This requires cataloging metadata for new objects and incrementally adding metadata for existing digital objects. To measure insert performance, we focus on these two operations and vary the size of the insert from moderately-sized metadata describing data files to the more substantial metadata used to describe WRF experiments. Additionally, we evaluate incremental cataloging of metadata for an experiment by initially inserting only core metadata and then incrementally adding metadata concepts such as notifications and configuration settings that would be added during workflow execution. XMC Cat also supports updates and deletions of existing metadata concepts for an object, but in LEAD those operations are much less frequently invoked. Metadata for both files and experiments are inserted for 25 iterations, and eliminating the spin-up of the first operation, Fig. 3(a) and 3(b) illustrate the respective mean execution time broken down into its component costs. Although it takes longer to insert the substantially greater metadata describing an experiment, the dominant costs for both files and experiments are (1) transforming domain metadata to concepts and elements using XSLT, (2) parsing the transformed metadata as an XML Bean, and (3) inserting the shredded metadata. For experiments, operations that manipulate larger XML documents start to become more expensive such as extracting the root metadata element from the shredded metadata. The majority of the savings in execution time under eventual consistency is the cost of inserting shredded metadata since that step is delegated to the distributed shredders and done asynchronously outside of the ingest operation. However, there is also a reduction in the time required for the transformation since the XSLT template and resulting XML bean are smaller due to delegating shredding of leaf elements to the distributed shredders. Overall strict consistency takes 41% and 58% longer respectively for files and experiments to execute the metadata insert operation in comparison to eventual consistency. Since metadata is added incrementally to experiments during workflow execution, we test the performance of adding concepts to previously cataloged experiments. As a baseline we ran 25 iterations that added 76 concepts containing actual workflow notifications and configuration parameters from WRF experiments in LEAD. Each concept is added in a separate web service call to XMC Cat. Fig. 3(c) shows the mean execution time excluding the first concept added. Consistent with files and experiments, the time saved under eventual consistency is mainly attributable to delegating inserts to the distributed shredders. However, the additional cost for strict consistency is only 26% because a significant portion of the time is used to check that adding the concept does not result in metadata that no longer validates against the schema. D. Catalog Workload The scalability evaluation is carried out using a pseudorealistic workload derived from usage patterns observed in the LEAD science gateway. The data and data characteristics of each of the experiments are realistic, taken with permission from the workspaces of users of the LEAD portal.plale [28] reports the characteristics of a workload based on 125 active users who each execute 4 workflows in a 12-hour time frame for a total of 500 workflows. These workflows are classified based on the following four categories with their respective frequencies: Educational Experiments (50%) Canonical Experiments (10%) Ensemble Experiments (1%) Data Import (39%) The number of files generated by a real workflow will vary depending on factors such as configuration. The workload estimates are based on average observed behavior. An ensemble experiment is set at 100 canonical experiments where

6 each canonical experiment generates 72 files. The educational experiments are smaller and generate only 16 files. The data import only adds files. Since metadata for experiments is added incrementally as the workflow executes, metadata for the canonical and ensemble experiments is added incrementally with the frequency of the incremental inserts based on the running time of the experiments and the number of concepts to be added. The metadata for each educational experiment is added as a single larger metadata insert. The query workload is based on a 76.9:23.1 total query-toinsert ratio. This is adopted from the TPC-E benchmark [34] because of the resemblance of usage captured in TPC-E and observed in the LEAD workspace. Based on the patterns observed in the LEAD portal, three quarters of the queries in the scalability test are classified as browse queries based on the object s unique ID and one quarter are search context queries. As a final aspect of the workload, a set of test clients execute each type of insert or query and are configured with a test plan that specifies different users for whom metadata should be added and queries executed. The test plan specifies the rate at which the client executes a query or insert, and each client executes the web service calls on a separate thread to prevent the event of saturation at the metadata catalog service from impeding the rate at which operations are launched. All scalability tests are run based on a 30-minute window. Specifically, after running the initial base case which is modeled on the projected LEAD workload, the workload is first doubled and then increased in subsequent runs by 2 times the base workload until the service is saturated. Each increment of the base workload adds 43,837 concepts containing 212,565 metadata elements that describe data files and experiments and can be queried by scientists to locate relevant datasets, so at 10 times the base workload, the scalability test is adding over 2 million queryable shredded metadata elements in the 30 minute window while executing a total of 62,827 browse and search queries. E. Scalability Performance The scalability piece of the evaluation compares total execution time of strict and eventual consistency under a workload that is increased in size through increments that are multiples of the base catalog workload described in sub-section D above. The turnaround time for inserts and queries are measured as the workload is scaled from 1 to 8 times the base catalog workload of inserts and queries. Query performance does not differ significantly except for higher overall scalability under eventual consistency. Insert performance is shown in Fig. 4 as the workload is scaled to 8 times the base workload. Fig. 4(a) shows the average time in milliseconds to insert metadata for a file as the workload is scaled from the initial base workload to 8 times that workload. The solid line indicates the total time for strict consistency and the dashed line represents the portion of that total spent inserting the shredded metadata. The dotted line shows the total time required under eventual consistency using distributed shredding. The performance difference is almost entirely attributable to inserting the shredded metadata (which under Mean Execution Time (ms) Mean Execution Time (ms) Including Validation Excluding Validation Figure 4. Scalability of inserts under an increased load. As the workload is scaled from 1 to 8 times the base workload, the performance of eventual consistency for files (a) scales well. Larger metadata inserts for experiments (b) are negatively impacted by validating the web service request. eventual consistency is distributed and done asynchronously). The advantage of eventual consistency begins to decrease at 6 times the base workload, but strict consistency still takes 42% longer to execute and even at that workload eventual consistency still performs better than strict consistency at the initial base workload. The dashed and dotted line in Fig. 4(a) shows the total time required for eventual consistency with the distributed shredder turned off. Under this alternative eventual consistency instead becomes more advantageous at 6 times the base workload since the cost of shredding under strict consistency begins to rise. Turning off the shredders increases the consistency delay between browse and search metadata, but if a gateway expects a bursty pattern, shredding can be adapted to this bursty behavior. When adding metadata concepts to existing experiments, eventual consistency performs only marginally better unless the shredders are turned off, in which case strict consistency takes 27% to 53% longer to execute. This is expected since determining the validity of the resulting metadata is a significant cost when incrementally adding concepts. The performance when inserting all of the metadata for an experiment as a single large document is depicted in Fig. 4(b). The total time to insert the metadata shows wide fluctuations in performance. The lower section of Fig. 4(b) shows the total cost for inserting the same metadata but excludes the initial validation of the request. Without this validation step, the performance fluctuations exhibited by both consistency models is eliminated and eventual consistency is advantageous. The non-linear cost of validating large metadata documents highlights a performance advantage to inserting metadata incrementally. Validating large XML Bean instances is CPUintensive and under the concurrent inserts and queries of the

7 scalability workload significantly degrades performance. In XMC Cat, metadata schemata are partitioned into concepts and validating concepts added incrementally does not exhibit this same performance cost. Small and relatively static tables identify and validate mutually exclusive and required concepts, so although concepts are added incrementally, this never results in an invalid metadata document. F. Moderating the Workload with Eventual Consistency As the workload scales up, eventual consistency allows the catalog to scale to 10 times the base workload without errors or timeouts that preclude successful completion of the web service calls. Under strict consistency the workload can only scale to 8 times the base workload without encountering fatal errors. While the difference may not appear great, hidden is the fact that under strict consistency non-fatal insert errors start to occur at 2 times the base workload. These non-fatal errors occur when the CLOBs have been inserted but shredding of the search metadata fails. For these inserts consistency is less wellbehaved than under eventual consistency. Eventual consistency provides the ability to automatically adjust the trade-off between consistency and scalability by dynamically adapting the configuration of the distributed shredders to moderate a heavy workload on the catalog. Under strict consistency, all search metadata must be shredded and inserted before an insert operation is complete. Alternately a non-fatal error occurs, but in our measurements handling these non-fatal errors does not shorten response time. Under eventual consistency, the number of shredders needed, and the number of worker threads at each shredder, provide an upper bound on the work performed to shred and insert search metadata in a given time frame. At 6 times the base workload the distributed shredders keep pace with the concept ingest rate. Of the 263,022 concepts added for that workload, only 100 remain in the queue at the end of the test. As the workload scales beyond this level the concept queue acts as a buffer. At 8 times the base workload the queue takes an additional 5 minutes to finish processing concepts. At 10 times the base workload the concept queue requires an additional 23.5 minutes to finish processing. The distributed shredder s controller thread automatically adjusts the interval between fetches to minimize the overhead incurred in calls to retrieve concepts. The fetch cycle is a two stage process where the manager first calls the catalog to determine if there are concepts available to process and tags them to be fetched. During this stage, if the catalog service is experiencing a heavy workload, a flag can be set with a timer to tell the shredder manager there are no concepts available. This allows the metadata catalog to dynamically adjust the number of active shredders based on the current system load. V. RELATED WORK Globus MCS [33] and SRB s MCAT [4] both take a domain-neutral approach to storing scientific metadata, and the EAV approach used in MCS for storing user defined attributes as name/value pairs (originally inspired by MCAT) influenced the storage of shredded metadata in XMC Cat. However, as noted by Singh et al. in [33], the conversion from XML to name/value pairs caused a performance bottleneck when storing metadata communicated as XML in MCS. In the ecological sciences, the Metacat metadata catalog [5,21] can ingest metadata as XML, shreds it, and stores it in a relational database. The approach used in Metacat to store XML is similar to an early schema-less approach referred to as the edge-table [13]. In Metacat, adding metadata incrementally to a previously cataloged object requires retrieving the entire XML document, updating it, and reinserting it. Reinserting the document also requires rebuilding all of the path indexes for the document making incremental updates expensive. Additionally, when an object s metadata is updated, the entire prior metadata for that object is archived [22], making hundreds of incremental concept inserts during workflow execution impractical. More recently, one of the motivating use-cases [2] for Amazon s SimpleDB was as a metadata index to data stored in Amazon s Simple Storage Service (S3). Although deployed in a cloud, SimpleDB uses an EAV approach similar to MCS and MCAT in that it uses a schema-less approach to store up to 256 simple name-value pairs per object. Unlike the shredded metadata stored in XMC Cat or the user-defined attributes stored in MCS, values in SimpleDB are not strongly typed but are instead stored as strings. Although, SimpleDB allows for range and value comparisons other than equality, numeric data must be consistently zero-padded since comparisons are lexicographical. SimpleDB does use eventual consistency for scalability and availability; updates are expected to propagate to all replicas within 2 seconds [1]. VI. CONCLUSION AND FUTURE WORK In this paper we have shown that the hybrid XML-relational approach used in the XMC Cat metadata catalog web service can exploit differences in the temporal utility of browse and search metadata to improve the performance of metadata inserts and overall system scalability. By relaxing the consistency between these two subcategories of discovery metadata to ensure only eventual consistency, the shredding of search metadata can be done asynchronously by delegating the task to distributed concept shredders. Evaluation of the catalog s ability to scale in the face of a incrementally expanding user workload shows that relaxing the consistency between browse and search metadata results in an almost 30% improvement for insert operations. This approach is also wellsuited to handling incremental updates of metadata communicated as XML allowing browse metadata to provide greater value to scientists as a tool for monitoring long-running experiments and workflows. Currently we are working with a research project in astronomy that is ingesting data based on a streaming model. In this context, we are exploring scalability and moderating of the catalog s workload in a streaming environment. The hybrid architecture in XMC Cat allows for a loose coupling between the community XML schema and the concepts scientists search over. Based on this loose coupling, we are researching the use of the metadata catalog, combined with semantic tools, as a front-end to legacy structured data, to

8 provide scientists with a query capability more closely aligned with their model of the data. The current implementation distributes the shredding of metadata concepts, but metadata is eventually inserted back into the main catalog. We are exploring approaches to distributing the metadata catalog itself to further scale with the deluge of scientific data being generated. ACKNOWLEDGMENT We thank the LEAD Gateway community, particularly the meteorological researchers at the University of Oklahoma and at other institutions using LEAD. We also thank the reviewers for their valuable feedback. REFERENCES [1] Amazon Web Services, Structured Data Storage, last accessed at: entry.jspa?externalid=3087&categoryid=152 [2] Amazon Web Services, Indexing Amazon S3 Object Metadata, last accessed at: [3] Apache Axis2 Next Generation Web Services. Available at: [4] C. Baru, R. Moore, A. Rajasekar, and M. Wan, The storage resource broker, in Proceedings of the 1998 Conference of the Centre for Advanced Studies on Collaborative Research (CASCON), Toronto, Canada, November [5] C. Berkley, M. Jones, J. Bojilova, and D. Higgins, Metacat: a schemaindependent XML database system, in Proceedings of the 13th International Conference on Scientific and Statistical Database Management, Fairfax, Virginia, July [6] J. Birnholtz and M. Bietz, Data at Work: Sharing in Science and Engineering, in Proceedings of the 2003 International ACM SIGGROUP Conference on Supporting Group Work, Sanibel Island, Florida, November [7] M. Brantner, D. Florescu, D. Graf, D. Kossmann, and T. Kraska, Building a database on S3, in Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data,Vancouver, Canada, June [8] E.A. Brewer Towards robust distributed systems, keynote at the 19th ACM Principles of Distributed Computing (PODC 2000), Portland Oregon, July [9] K. Droegemeier, K. Brewster, M. Xue, D. Weber, D. Gannon, B. Plale, D. Reed, L. Ramakrishnan, J. Alameda, R. Wilhelmson, T. Baltzer, B. Domenico, D. Murray, A. Wlson, R. Clark, S. Yalda, S. Graves, R. Ramachandran, J. Rushing, and E. Joseph, "Service-oriented environments for dynamically interacting with mesoscale weather", Computing in Science and Engineering, IEEE Computer Society Press and American Institute of Physics, vol. 7, no. 6, pp , [10] Dublin Core Metadata Initiative, Available at: [11] B. Eaton, J. Gregory, B. Drach, K. Taylor, and S. Hankin, NetCDF Climate and Forecast (CF) Metadata Conventions, Version 1.0, [12] Federal Geographic Data Committee, Washington, D.C., Content Standard for Digital Geospatial Metadata Workbook Version 2.0, [13] D. Florescu and D. Kossmann, Storing and querying XML data using an RDBMS, IEEE Data Engineering Bulletin, vol. 22, no.3, pp , [14] H. Garcia-Molina, J.D. Ullman, and J. Widom, Database systems: the complete book, Prentice-Hall, Upper Saddle River, New Jersey, [15] J. Gray, A. S. Szalay, A. R. Thakar, C. Stoughton, J. vandenberg, Online scientific data curation, publication, and archiving, Microsoft Research, Technical Report MSR-TR , July [16] B. J. Hurley, J. Price-Wilkin, M. Proffitt, and H. Besser, The making of America II testbed project: A digital library service model (part III: implementing the MOA II service model), Council on Library and Information Resources, December 1999, last retrieved from: [17] International Organization for Standardization, Geographic Information Metadata (ISO19115:2003), [18] S. Jensen and B. Plale, Using characteristics of computational science schemas for workflow metadata management, In Proceedings of the 2008 IEEE Congress on Services, IEEE 2008 Second International Workshop on Scientific Workflows (SWF 2008), Hawaii, July [19] S. Jensen, B. Plale, S. Lee Pallickara and Y. Sun, A hybrid XMLrelational grid metadata catalog, Workshop on Web Services-based Grid Applications (WGSA'06) in association with International Conference on Parallel Processing (ICPP-06), August [20] S. Jensen and B. Plale, Extended abstract: Schema-independent and schema-friendly scientific metadata management Fourth IEEE International Conference on escience, pp , [21] M.B. Jones, C. Berkley, J. Bojilova, and M. Schildhauer, Managing Scientific Metadata, IEEE Internet Computing, vol. 5, no. 5, pp , Sept./Oct [22] Metacat Administrator s Guide for Metacat 1.9.1, version 1.0, April Available at: [23] METS Metadata Encoding & Transmission Standard: An Overview & Tutorial, last retrieved from [24] J. Michalakes, J. Dudhia, D. Gill, J. Klemp and W. Skamarock, Design of a Next-Generation Regional Weather Research and Forecast Model, Towards Teracomputing, pp , World Scientific, River Edge, New Jersey (1998). [25] W. K. Michener, J. W. Brunt, J. J. Helly, T. B. Kirchner, and S. G. Stafford, Nongeospatial metadata for the ecological sciences, Ecological Applications, vol. 7, no. 1, pp , February [26] National Information Standards Organization (NISO), Understanding metadata, NISO Press, last retrieved from: [27] S. Newhouse, J. M. Schopf, A. Richards, and M. P. Atkinson, Study of user priorities for e-infrastructure for e-research (SUPER), in Proceedings of the UK e-science All Hands Conference, [28] B. Plale, Workload characterization and analysis of storage and bandwidth needs of LEAD workspace, LEAD TR 001, Linked Environments for Atmospheric Discovery (LEAD) (2007) Version 3.0. [29] B. Plale, LEAD II/Trident workflows for timely weather products: the challenge of Vortex2, Microsoft External Research Symposium, April [30] D. Pritchett, BASE: An ACID alternative, ACM Queue, vol. 6, no. 3, pp , May [31] A. Rajasekar and R. Moore, Data and metadata collections for scientific applications, in Proceedings of the 9th International Conference on High-Performance Computing and Networking (HPCN Europe), June [32] M. Rys, D. Chamberlin, and D. Florescu, XML and relational database management systems: the inside story, in Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, New York, June [33] G. Singh, S. Bharathi, A. Chervenak, E. Deelman, C. Kesselman, M. Manohar, S. Patil, and L. Pearlman, A Metadata catalog service for data intensive applications, in Proceedings of the 2003 ACM/IEEE conference on Supercomputing, Phoenix, November [34] Transaction Processing Performance Council: TPC Benchmark E Standard Specification Version (2007). [35] W. Vogels, Eventually consistent, Communications of the ACM, vol. 52, no. 1, pp , January [36] A. Voss et al., e-research infrastructure development and community engagement, in Proceedings of the UK e-science All Hands Meeting 2007, Nottingham UK, September 2007.

Evaluation of Two XML Storage Approaches for Scientific Metadata

Evaluation of Two XML Storage Approaches for Scientific Metadata Scott Jensen, Devarshi Ghoshal, Beth Plale Indiana University scjensen@indiana.edu, dghoshal@indiana.edu, plale@indiana.edu Abstract Scientific