Automated Data Quality Assurance for Marine Observations James V. Koziana Science Applications International Corporation (SAIC) Hampton, VA 23666 USA Third Meeting of GCOOS DMAC Renaissance Orlando Hotel Forbes Place Orlando, FL 23-24 February 2009 1
Outline Introduction IOOS State of Oceans (time and space) Data Quality Assurance System Quality Assurance Pyramid Science Data Challenges Quality Assurance System Block diagram and discussions Results Conclusions Present and Future Applications Wrap-up 2
U.S. Integrated Ocean Observing System (IOOS) Vision: Lead the way in the provision of products and services based on ocean observations for a wide range of societal benefits. Goal: Achieve unprecedented levels of resolution, quality, and distribution of all global and coastal ocean observations to improve predictions of ecosystem, weather and water, and climate events. Vice Admiral Lautenbacher, Jr., U.S. Navy (Ret.) Under Secretary of Commerce for Oceans & Atmosphere; November 21, 2005 IOOS Requirements Vision for observing systems will bring streams of real-time data from a distributed sensor system Each data provider will prepare their data using Data Management and Communications (DMAC) standards and protocols. 3
State of Oceans & Coasts Varies Across Time and Space Geophysical 1. Sea surface meteorological variables 2. Land Sea Stream flows 3. Sea level 4. Surface waves, currents 5. Ice distribution 6. Temperature, Salinity 7. Bathymetry Biophysical 1. Optical properties 2. Benthic habitats IOOS Core Variables Chemical 1. pco 2 2. Dissolved inorganic nutrients 3. Contaminants 4. Dissolved oxygen Biological 1. Fish species, abundance 2. Zooplankton species, abundance 3. Phytoplankton species, biomass (ocean color) 4. Waterborne pathogens 4
Data Quality Data Quality refers to the quality of data. Data are of high quality "if they are fit for their intended uses in operations, decision making and planning" (J.M. Juran). Alternatively, the data are deemed of high quality if they correctly represent the real-world construct to which they refer. These two views can often be in disagreement, even about the http://upload.wikimedia.org/wikipedia/commons/ same set of data used for the same purpose. e/ee/groundhog-standing2.jpg Quality Assurance (QA) and Quality Control (QC) Quality assurance (QA): an integrated system of activities involving planning, quality control, quality assessment, reporting and quality improvement to ensure that a product or service meets defined standards of quality with a stated level of confidence. QUALITY ASSURANCE DIVISION National Center for Environmental Research And Quality Assurance Office of Research and Development U. S. Environmental Protection Agency December 10, 1997 Quality control (QC): the overall system of technical activities whose purpose is to measure and control the quality of a product or service so that it meets the needs of users. The aim is to provide quality that is satisfactory, adequate, dependable, and http://contraryguy-plants.buzznet.com/user/photos/fast-foodclose-up/?id=1420108#usersubnav economical. 5
Quality Checks Static: Single station; single-time checks locates external out liners in observations. Unaware of previous or current meteorological or hydrological situation by other observations and grids Validity Internal consistency Vertical consistency Dynamic: Which defines the QC information by taking advantage of other available hydrological information. Position Consistency Temporal Consistency Spatial consistency Single character data descriptor for each observation Provides an overall opinion to quality by combining the information from various quality checks Algorithms used to complete the data descriptors functions of type of QC checks applied to observations and sophisticated checks Level 1: least sophisticated Level 2: medium Level 3: Most sophisticated 6
Algorithms more conceptual Work on less variety of data Quality Assurance Pyramid Comparison with model Comparison with remotely sensed data Comparison with statistical trends Inter-comparisons (dis-similar sensor/ platform) Inter-comparisons (nearby similar sensor/platform) Increased Accuracy Algorithms more concrete Works on wider variety of data Inter-comparisons (same sensor same platform) Rate Checks (Time Continunity) Range Limit Checks (Dynamic, Seasonal and Regional) Begin at bottom and work upward 7
Science Data Lifecycle Sensor Development Platform Development Platform Operations Platform Maintenance Basic Research Enhanced Understanding Algorithm Development Algorithm Implementation Data Acquisition Applications Decision Support Data Center Development Model Development Model Implementation Processing Data Center Ops QA Archiving Dissemination Model Runs Enhanced Understanding Forecasts System Development Deployment/O&M Exploitation
Science Data Management Challenges Data management systems are an important component of Earth Science missions Support primary mission goal of timely delivery of high quality data products to science community They are expensive and time consuming to develop and maintain Relative size of data management code vs science code Problems with traditional stovepipe development approach Continuous change associated with science algorithms The science team needs to be able to allocate their time and resources to science, not data management. Data management functions represent 60% - 80% of the code for typical science applications Continuous change inherent in research environments Highly iterative nature of algorithm development drives numerous changes to code during development and after launch Data management code which is tightly coupled to algorithms and data products Can be time consuming to modify as changes ripple throughout the code Leads to stovepipe approach that results in significant duplication of code and effort 9
Data Quality Assurance System for Earth Science Data and Information Scalable, modular system that can be used to address various methods of characterizing the quality of data products This approach facilitates science software development Reduce level of effort required and program risk (cost effective) Allow data management team to be more responsive to science algorithm developers (flexibility) System is designed to: Include substational core functionality that is common to any science applications Be easily configurable to work with many different data sets (observations and model output) Readily accommodate algorithm and data product additions and modifications with minimal code changes. Balance flexibility and performance 10
Data Quality Assurance System Block Diagram User Supplied QC Algorithms Algorithm QC 1 Algorithm QC 2 Algorithm QC 3 Algorithm Library Config. Files Input Subsystem NetCDF Control Subsystem Output Subsystem NetCDF Framework Run time defined configuration files Input Data Flow Output HDF SensorML XML Data Store HDF SensorML XML Output File Input Data ASCII DataBase ASCII DataBase Common Data Structure Components 11
Input Subsystem Configuration files defines the format and content of input data products Reads the configuration file to identify input parameters Reads a data record from the file and parses it into discrete parameters Attaches the parameters to the data structure Example: Upstream subsystem adds a parameter to their output product that must be included in current application s output The traditional approach requires modifying the input file reader, the data structures, and the output file writer This QA approach requires modifying the ASCII input configuration file 12
Output Subsystem Configuration files defines the format and content of input data products Add new formats and NOT re-define the QA system [ASCII, HDF, SensorML] Retrieves parameters by name from the common data structures Formats the data for a specific output product Writes the data to the output file Example: Add new configuration element: Add a parameter to the output product The traditional approach requires modifying an algorithm, the data structures, and the output file writer This QA approach requires modifying an algorithm and adding a line to the ASCII configuration file 13
Control Subsystem Controls the data flow of the quality control processing Provides the linkage between the QA subsystems Balances the flexibility and performance of the QA system Readily accommodates algorithm and data product additions with minimal code changes. Activities for the subsystem controlled by configuration file Example: Perform a range limit check and time continuity check for sea surface temperature. Configuration file is configured to execute two specific data quality algorithms. Performs Range limit check (i.e., compare with pre-established thresholds) Performs Time Continuity (i.e., range of parameter over time) based on the provided appropriate. Data and aggregate quality flag information are put into the Data Store 14
Data Store Implemented using common data structure for application-specific data structures assembled from standard building blocks. Subsystem elements are generic and configurable Architectural approach facilitates changing on the fly Underlying data structures allows for the addition of parameters or new algorithms Retrieve parameters by name Subsystems and the algorithm library are able to query data structures and retrieve the necessary information Internal data sets are represented by parameter classes which no longer correspond to the input and/or output data formats Parameter classes correspond to the computer language-specific data types (ex., character, unsigned character, short, unsigned short, etc.) Easy implementation of data processing algorithms, as the algorithms do not need to write the code specific to the original input data format and the ultimate output format Localized changes to algorithms are made with minimal changes to the input, control, data store and output modules. 15
Common Data Structures Application specific data structures are assembled from standard building blocks Can be used to model many different instruments Facilitates changing the underlying data structures Parameters can be added on the fly Facilitates adding new algorithms and new parameters to the application Localizes changes to algorithms and minimizes changes to input, data structure, and output modules Parameters can be retrieved by name Allows other modules (framework and algorithms) to query data structures and retrieve information required Allows other framework components to be generic and configurable 16
Algorithm Library Implemented using a modular architecture that encapsulates algorithms into decoupled, re-useable modules while providing the mechanism for assembling them into a working system. Library of re-useable validation algorithms Implement the algorithms as generic re-useable modules based upon Reference [NDBC Technical Document 03-02] that can be configured to work with parameters from any source. Comprised of data quality algorithms that range from fairly simple rates and limit checks to sophisticated parameters into comparisons that exploit the relationship between parameters. Some data providers will require more sophiscated algorithms that provide a higher level of accuracy presented by specific configuration of sensors. These algorithms can be easily added to the algorithm library. For this paper we have configured a rate limit check and time continuity check algorithms to process several marine parameters. 17
Dqa_app input_data input_data_cfg input_limits input_limits_cfg output_data output_data_cfg NDBC Provided Air Temp Data Buoy 42007 6/25/08 42007 6/25/2008 0:00:00 ATMP1 29 42007 6/25/2008 1:00:00 ATMP1 29 42007 6/25/2008 2:00:00 ATMP1 28.9 42007 6/25/2008 3:00:00 ATMP1 29 42007 6/25/2008 4:00:00 ATMP1 29.1 42007 6/25/2008 5:00:00 ATMP1 28.7 42007 6/25/2008 6:00:00 ATMP1 28.8 42007 6/25/2008 7:00:00 ATMP1 28.9 42007 6/25/2008 8:00:00 ATMP1 28.5 42007 6/25/2008 9:00:00 ATMP1 28.5 42007 6/25/2008 10:00:00 ATMP1 28.4 42007 6/25/2008 11:00:00 ATMP1 28.2 42007 6/25/2008 12:00:00 ATMP1 28.3 42007 6/25/2008 13:00:00 ATMP1 28.2 42007 6/25/2008 14:00:00 ATMP1 26.6 42007 6/25/2008 15:00:00 ATMP1 64.5 42007 6/25/2008 16:00:00 ATMP1 37.5 42007 6/25/2008 17:00:00 ATMP1 35.5 42007 6/25/2008 18:00:00 ATMP1 28.2 42007 6/25/2008 19:00:00 ATMP1 28.2 42007 6/25/2008 20:00:00 ATMP1 28.3 42007 6/25/2008 21:00:00 ATMP1 28.4 42007 6/25/2008 22:00:00 ATMP1 28.6 42007 6/25/2008 23:00:00 ATMP1 28.4 QA Range Checks (Hard & soft flags) Time Continunity Checks (Hard & soft flags) Quality Checked Data for Buoy 42007 6/25/08 42007 6/25/2008 0:00:00 ATMP1 29 0 42007 6/25/2008 1:00:00 ATMP1 29 0 42007 6/25/2008 2:00:00 ATMP1 28.9 0 42007 6/25/2008 3:00:00 ATMP1 29 0 42007 6/25/2008 4:00:00 ATMP1 29.1 0 42007 6/25/2008 5:00:00 ATMP1 28.7 0 42007 6/25/2008 6:00:00 ATMP1 28.8 0 42007 6/25/2008 7:00:00 ATMP1 28.9 0 42007 6/25/2008 8:00:00 ATMP1 28.5 0 42007 6/25/2008 9:00:00 ATMP1 28.5 0 42007 6/25/2008 10:00:00 ATMP1 28.4 0 42007 6/25/2008 11:00:00 ATMP1 28.2 0 42007 6/25/2008 12:00:00 ATMP1 28.3 0 42007 6/25/2008 13:00:00 ATMP1 28.2 0 42007 6/25/2008 14:00:00 ATMP1 26.6 0 42007 6/25/2008 15:00:00 ATMP1 64.5 V 42007 6/25/2008 16:00:00 ATMP1 37.5 V 42007 6/25/2008 17:00:00 ATMP1 35.5 L 42007 6/25/2008 18:00:00 ATMP1 28.2 0 42007 6/25/2008 19:00:00 ATMP1 28.2 0 42007 6/25/2008 20:00:00 ATMP1 28.3 0 42007 6/25/2008 21:00:00 ATMP1 28.4 0 42007 6/25/2008 22:00:00 ATMP1 28.6 0 42007 6/25/2008 23:00:00 ATMP1 28.4 0 Input/Output Configuration File Type Data Parameters Data Set Dimensions Input, Output and Control Configuration Files Control Subsystem (ie., Data Flow) Quality Assurance Procedures 18
Source of Data Western Gulf of Mexico Recent Marine Data http://ndbc.noaa.gov/maps/westgulf.shtml Louisiana/Mississippi Coastal Region Recent Marine Data http://ndbc.noaa.gov/maps/westgulf_inset.shtml 19
Regional and Seasonal Limits Central Gulf of Mexico Parameter JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC BARO MAX 1029.3 1028.4 1026.7 1025.3 1022.4 1021.4 1022.2 1021.1 1020.1 1022.7 1026.2 1028.1 BARO MIN 1010.3 1007.7 1007.9 1008.2 1009.8 1010.1 1012.5 1011.1 1008.5 1008.3 1010.2 1011.0 ATMP MAX 27.9 27.9 28.5 28.9 30.0 31.3 32.1 32.1 31.5 30.6 29.3 28.1 ATMP MIN 12.7 13.2 14.6 18.1 21.6 24.3 25.5 25.5 24.6 21.2 16.9 14.4 WTMP MAX 28.2 28.1 28.3 28.5 29.4 30.6 31.5 31.8 32.7 30.5 29.5 28.7 WTMP MIN 18.5 17.8 18.3 18.0 23.1 25.8 27.2 27.6 25.5 24.9 22.2 20.3 WDIR MAX 360.0 360.0 360.0 360.0 360.0 360.0 360.0 360.0 360.0 360.0 360.0 360.0 WDIR MIN 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 WSPD MAX 14.5 14.2 13.9 12.6 11.3 10.3 9.4 9.8 12.8 13.4 13.6 13.8 WSPD MIN 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 WVHGT MX 3.0 3.0 2.9 2.5 2.2 2.0 1.7 1.8 2.7 2.7 2.7 2.8 WVHGT MN 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NDBC Technical Document 03-02 20
V L Note: L: Failed Limits Check V: Failed Time-Continunity Test 21
42007 Katrina August 23 to 29, 2005 a soft flags for observations above the Seasonal/Regional Max Limit http://visibleearth.nasa.gov/view_rec.php?id=7938 Storm Flag Seasonal/Regional Max Limit Seasonal/Regional Max Limit Wind Speed Seasonal/Regional Min Limit Seasonal/Regional Min Limit (hours) 22
Present and Future Applications QA system is well-suited for a diverse science community employing many different methods to characterize the quality of their data products. Allow single-data providers to large-scale data providers (i.e., observational and model) to perform automatic quality assurance on their data products QA system processes a real-time data stream High quality data product Associated metadata Aggregated quality flags Expanding the QA system To address additional input/output data types To enhance the algorithm library by developing additional quality control algorithms o simple quality tests (ex., storm limits, time continuity, internal consistency and others) o higher order algorithm development that exploit the relationship between sensors and parameters. To explore more supplicated algorithms that provide higher level accuracy presented by specific configuration of sensors. Explore applying data mining to quality control Define interface to the QA for data providers Use of the state-of-the-art visualization Use of analysis tools to monitor the real-time data streams How users analyze data to determine the root causes of problems and editing 23
Wrap-Up Scalable, modular Data Quality Assurance System QA Algorithm Library is extendable to other data parameters by changing ASCII configuration files Reduces level of effort required (cost effective) Be easily configurable to work with many different data sets (observations and model output) Readily accommodate algorithm and data product additions and modifications with minimal code changes. QA was performed at same confidence level (Range Limit and Time Continuity Checks) Initial validation testing with NDBC products Limited set of daily air temp (24 hours station 42007) and wind speed (3 days station 42001). Air temp (3 hard flags) Wind speed (48 soft flags) 24
First Quality Assurance of Real-Time Ocean Data (QARTOD) workshop Defined minimum standards for QA/QC for Real-time observations Flags and tests documented in metadata Each observing system should decide their own best method for: Data description Delivery Testing No storage format, transport or metadata standards recommended 25
Additional Charts James V. Koziana Science Applications International Corporation (SAIC) Hampton, VA 23666 USA 26
How to update the input files Plotting capabilities 1. General upper limit 2. General seasonal limit 3. Seasonal-upper limit 4. Seasonal-lower limit 5. Time Cont. general limit 6. Time Cont, seasonal limit 1. Plotting input data 2. Plotting output data 3. Google maps