Wade Sheldon Georgia Coastal Ecosystems LTER University of Georgia email: sheldon@uga.edu
Regardless of Q/A procedures, data quality issues guaranteed with environmental sensor data Without good Q/C data users can draw invalid conclusions (untrustworthy data) Q/C analysis is a critical part of any monitoring program See: Campbell et al. 2013. Quantity is Nothing without Quality: Automated QA/QC for Streaming Environmental Sensor Data. BioScience. 63(7):574-585.
Wide variety of software can be used to perform Q/C analysis Spreadsheets (conditional colors) Statistical software (outlier tests, consistency checks, models) Plotting software (visual checks) Assigning and managing qualifiers (flags) and revising data often a manual process Traditional Q/C techniques don t scale well for streaming sensor data Data volume too high to plot and look at every value before posting Tedious to document Q/C steps (provenance) Poor scope for automation Need for different approaches and tools for streaming sensor data that support automation, provenance-tracking
Requirements Algorithmic Q/C analysis (automation) Visual Q/C analysis (review/revision) Ability to repeat analysis with new data (scalability) Method to assign and manage qualifiers (flagging) Support for data correction and revision (cleaning) Documentation of Q/C steps (provenance) Example frameworks GCE Data Toolbox (MATLAB) CUAHSI ODM Tools (Python)
GCE Data Toolbox is a lightweight, portable, file-based data management system implemented in MATLAB Key components Generalized tabular data model for metadata, data and Q/C rules and flags Command-line function library (API) Graphical user interface (GUI) Data import support Delimited text (csv, tab, space) Logger files (CSI, SBE, YSI,...) SQL RDBMS Web services (NWIS, NOAA HADS, CHORDS,...) Data export Delimited text, SQL, XML/HTML Metadata export Plain text, EML/XML,...
Algorithmic Q/C Analysis Rules (i.e. criteria) define conditions in which values should be flagged Unlimited Q/C rules for each variable Rules evaluated when data loaded and when data or rules change Rules predefined in metadata templates to automate Q/C on import Interactive Q/C Analysis and Revision Qualifiers can be assigned/cleared visually on data plots with the mouse Qualifiers can be propagated to dependent columns Qualifiers can be removed or edited (search/replace) if standards change Automatic Documentation of Q/C Steps Q/C operations (including revisions) logged to processing lineage Data anomalies reports can be auto-generated and annotated Data correction, analysis, synthesis tools Q/C-aware Qualified values can be filtered, summarized, visualized during analysis Statistics about missing/qualified values tabulated, used to qualify derived data
flag_novaluechange(col_salinity,0.3,0.3,3)= F col_depth<0.2= Q col_depth<0= I
Visual Q/C tool can be invoked from interactive data plots Actions variable-specific to prevent inadvertent flagging of wrong values Left-click/drag to assign, right-click/drag to clear Anomaly reports can be auto-generated on demand and annotated to explain rationale for revision
Composite flags can be manually propagated to derived variables Flags can be meshed with or overwrite existing flags Often easier to propagate flags than compose multi-column rule sets Whenever flags interactively edited, automatic Q/C rules locked to prevent over-riding edits
Q/C flags can be visualized in data editor grid and plots Flagged values can be excluded from analyses and summarized Flagged values can be selectively removed from data sets and filled Replace flagged/missing values using constants, equations, models Fill values from replicate sensors (coalesce) Interpolated using linear regression, splines, shape-preserving cubic hermite
Interactive tools for sensor drift correction
Harvest Manager Data processing and Q/C workflows can be run on a timed basis Harvest management tools for defining, starting, stopping workflows and viewing logs Demo workflows provided with toolbox Workflow Raw Data Import Data Add / Import Metadat a QA/QC Analysi s Post- Process Synthes is Archive / Publish Products Reports
Finalized data can be published to a DataONE member node (EDI, KNB, etc) as an EML-described data package Can also export data/metadata in wide variety of formats for other repositories or local archiving
Python application for working with time series data in a CUAHSI Observations Data Model (ODM) database Multi-platform support (Windows, Linux, Mac) Multi-database support (Microsoft SQL Server, MySQL, and PostgreSQL) Implements a scripting interface to save the provenance of data edits in QC process Modern the Graphical User Interface (GUI) Horsburgh, Jeffery S.; Reeder, Stephanie; and Spackman Jones, Amber, "ODM Tools Python: Open Source Software For Managing Continuous Sensor Data" (2014). CUNY Academic Works.
Data querying and visualization Horsburgh, Jeffery S.; Reeder, Stephanie; and Spackman Jones, Amber, "ODM Tools Python: Open Source Software For Managing Continuous Sensor Data" (2014). CUNY Academic Works.
Data editing and visualization Horsburgh, Jeffery S.; Reeder, Stephanie; and Spackman Jones, Amber, "ODM Tools Python: Open Source Software For Managing Continuous Sensor Data" (2014). CUNY Academic Works.
Data Q/C analysis and visualization Horsburgh, Jeffery S.; Reeder, Stephanie; and Spackman Jones, Amber, "ODM Tools Python: Open Source Software For Managing Continuous Sensor Data" (2014). CUNY Academic Works.
Scriptable quality control editing Automatically generated Python code with each editing step Horsburgh, Jeffery S.; Reeder, Stephanie; and Spackman Jones, Amber, "ODM Tools Python: Open Source Software For Managing Continuous Sensor Data" (2014). CUNY Academic Works.
Revised data can be saved to the ODM database as a new time series processing level for the station Generated scripts can be saved and re-run to reproduce the edited time series or used for similar data Scripts document provenance of the data flagging and revision Finalized data can be published to a CUAHSI Hydroserver and accessed via web services and CUAHSI tools (HydroDesktop) Important Note: ODM Tools requires local installation of a legacy ODM database and does not connect to the latest CUAHSI Cloud Hydroserver platform
GCE Data Toolbox Website: https://gce-svn.marsci.uga.edu/trac/gce_toolbox Ref: Sheldon Jr., W.M., 2008. Dynamic, rule-based quality control framework for realtime sensor data. In: Gries, C., Jones, M.B. (Eds.), Proceedings of the Environmental Information Management Conference 2008: Sensor Networks, September 2008, pp. 145-150. http://gcelter.marsci.uga.edu/public/files/pubs/wsheldon_dynamic_qc_eimc2008_fi nal.pdf ODM Tools Python Website: https://github.com/odm2/odmtoolspython Ref: Horsburgh, J. et al. 2015. Open source software for visualization and quality control of continuous hydrologic and water quality sensor data. Environmental Modelling & Software, 70:32-44 (doi.org/10.1016/j.envsoft.2015.04.002)
Website: http://wiki.esipfed.org/index.php/envirosensing_cluster