Putting the Archives to Work: Workflow and Metadata-driven Analysis in LTER Science Wade Sheldon Georgia Coastal Ecosystems LTER University of Georgia Acknowledgements: John Porter (Virginia Coast Reserve LTER) Duane Costa (LTER Network Office) Corinna Gries (North Temperate Lakes LTER) Inigo San Gil (LTER Network Office, McMurdo Dry Valleys LTER)
Background Ecological science in the LTER Network is a dataintensive effort covering vast temporal and spatial scales The practice of Informatics is critical for Managing LTER data for analysis Curating LTER data for accuracy and accessibility Archiving LTER data for interpretation and use by future scientists LTER sites have adopted many informatics standards and practices to meet these goals, including EML metadata and PASTA LTER EML implementation targets data integration, not just resource description
EML Metadata in LTER Integration-level EML is comprehensive Data discovery metadata Title, abstract, keywords, personnel Research context metadata Study description, methods, protocols, project description Data set coverage metadata Temporal, spatial, taxonomic Physical metadata for entities (e.g. tables) File format, delimiters, terminators, header format Download URL Attribute metadata Data types, names, descriptions, units, codes, Q/C limits Supports software-mediated discovery, download, parsing of entities and integration with other data
EML Generation LTER sites have developed many approaches for generating EML from site catalogs Morpho editor XML editors (oxygen, XML Spy) Custom application frameworks/databases Two software systems are emerging as community tools used at multiple sites and outside LTER Metabase Metadata Management System (Metabase) Drupal Ecological Information Management System (DEIMS)
Metabase MMS Generalized RDBMS for managing environmental metadata (GCE 2002) Personnel Site geography (study area polygons, point locations) Instrumentation Research Projects Data sets (studies, methods, entities, attributes, files) Linked to Bibliographic and Taxonomic DBs Supports automatic cross-links between people/research/pubs and data RESTful web services for mapping, personnel, data set descriptions Automated metadata generation for data sets, cross-links between all related information Used by GCE, CWT, MCR, SBC, SREL (HBR adopting) http://gce-lter.marsci.uga.edu/public/app/resource_details.asp?id=434
DEIMS IMS built on the popular Drupal CMS framework (LNO 2010) A Drupal installation profile for storing, editing, and sharing data and information about biological and ecological research Provides user-friendly forms to describe all contextual information about your data Produces EML metadata to register data with metadata clearinghouses (LTER PASTA, ORNL-DAAC, KNB) Allows you to query external databases using the Data Explorer feature Used by MCM, SEV, JRN, LUQ, U. Michigan Biological Station and others (https://www.drupal.org/project/deims)
Metadata-driven Analysis PASTA has simplified using EML-described data for metadata-driven analysis and workflows Unified repository for LTER data Quality checks ensure data-metadata conformity API for versioned metadata, data retrieval Trigger mechanism for running workflows on changes Variety of workflow tools in use Kepler R, SAS, SPSS statistical software MATLAB technical computing software GCE Data Toolbox (MATLAB-based framework) Web services (VCR) enabled on LTER Data Portal for generating data loaders for R, SAS, SPSS, MATLAB PASTA/EML support included in GCE Data Toolbox for interactive and programmatic data mining and workflows
Kepler Kepler supports data downloading via REST URLs EML metadata actor loads entities, configures ports for attributes Demo workflows for downloading PASTA data, ClimDB export Demo: http://intranet2.lternet.edu/content/video-and-presentations-2012-nsf-lter-mini-symposium-now-available
R, SAS, SPSS and MATLAB EML transformed via XSLT to generate native data acquisition programs for target platform (documented source code) Based on R stat program generator from TERN (I-LTER) Programs download compatible entities, load data into appropriate structures from within analysis environment Attribution and research origin metadata included as code comments (R, SAS, SPSS) or in data structure itself (MATLAB) Flexible can be run locally or via RESTful web services Many benefits to users: EML, source code and entities (e.g. CSV files) can be saved, re-used User can debug code using native IDE if incompatible data features Generated code can be modified and extended as part of a custom workflow Leverages tools researchers use every day!
R, SAS, SPSS, MATLAB http://www.vcrlter.virginia.edu/webservice/pastaprog/knb-lter-vcr.26.20.r
R, SAS, SPSS, MATLAB http://www.vcrlter.virginia.edu/webservice/pastaprog/knb-lter-vcr.26.20.m
R, SAS, SPSS, MATLAB Running code downloads files, loads data
LTER Data Portal Web Services Links to code generator services on summary page for every LTER data set in PASTA
LTER Data Portal Web Services Provides code (copy/paste or download) and instructions
GCE Data Toolbox MATLAB framework for metadata-based processing, quality control and analysis of environmental data (see ESIP poster) Imports EML-described data from local file system, PASTA, KNB, site catalogs,... Leverages generic EML-to-MATLAB XSLT Complete metadata imported along with entities into data model Supports authentication, entity selection (GUI or workflow)
GCE Data Toolbox Wide range of data visualization, transformation, integration tools for developing workflows Derived data contain complete metadata, QA/QC information, processing history Export results to text files, MATLAB arrays, RDBMS, CUAHSI ODM, KML, HTML EML and text files can be uploaded to PASTA Open source software library for MATLAB: https://gce-svn.marsci.uga.edu/trac/gce_toolbox
GCE Data Toolbox
Conclusion Currently just scratching the surface of what EML and PASTA can do Vision of machine-readable data and metadataenabled analysis and integration now a reality Use cases and collaborations inside and outside LTER welcome