HANDLING PUBLICLY GENERATED AIR QUALITY DATA PETE TENEBRUSO & MIKE MATSKO MARCH 8 TH, 2017
EXAMPLES OF DEP DATA AND CROWDSOURCING Storm Readiness Beach Assessments Park Closings Emergency Management Social Media Watershed Ambassadors ARMS
AIR INFORMATION MANAGEMENT SYSTEM (ARMS) There are 286 monitors at these 40 air quality stations tracking 35 distinct parameters (continuous and non-continuous) NOX, NO, SO2, PM2.5, NO2, CO, O3, wind speed, wind direction, temp are monitored There approximately 150,000,000 continuous air quality minute data points collected every year.
ARMS SITES & MONITORS
Back Up Wireless Polling System NJDEP Air and Radiation Monitoring System Regular Air Wireless Sites Verizon Air Card Envista Comm Center and Database Digi Wireless Router Verizon Wireless Network OIT ExtraNet GSN WWW WWW.NJAQINOW.NET Hosted by Envitech Updated via FTP from ARMS Comm Center Air FRM Wireless Sites Digi Wireless Router Public Access Layer Secure Access Layer Air TEOM Wireless Sites CREST Wireless Sites Digi Wireless Router Digi Wireless Router Phone Line Modem for Backup Verizon Wireless Access Points Core layer Standby Database Standby Comm. Center & FTP MOXA NPort Modem Bank for Backup Primary Database Clustering Master Comm. Center & FTP Clustering MOXA NPort Modem Bank for Backup CREST Leased Line Sites Leased Line Modem Leased Line Modem FRM/TEOM Modem Oracle Observer (Manage Fast Start Failover) Slave Comm. Center & FTP Air FRM Phone Line Sites Air TEOM Phone Line Sites Phone Line Modem Phone Line Modem Leased Lines & Phone lines Network Green Devices represent future projects Standby NJ Air/Rad Monitoring System @ 401 East State Street DEP TLS Primary NJ Air/Rad Monitoring System @ Troop C Prepared by Harry Chen, 12/1/2010
CROWDSOURCING a specific sourcing model in which individuals or organizations use contributions from Internet users to obtain needed services or ideas Amazon Mechanical Turk Kickstarter Wikipedia
BACKGROUND Massive data deluge in recent years 80% of the worlds data is unstructured (images, videos, raw text, etc.) Algorithms to fully comprehend unstructured data have not been developed yet Many experts believe we are at least several decades away from this goal
CONSIDERATIONS OF USING INFO FROM 'CROWDS' Can disseminate both valid and invalid information Crowds often have no immediate way to discern truth from falsehood Crowds are prone to add opinion to data; which sometimes sticks more than the credible data themselves. Separating opinion and credible data through expert interpretation and curation, both centralized and decentralized, is important Very few organizational or procedural channels specifying how to aggregate and incorporate information in decision making Better information is needed not necessarily more monitors.
INTEGRATING EXPERTS, CROWDS, & ALGORITHMS.
CROWD SOURCING CONCERNS How to solicit users What they can contribute How to combine their contributions How to manage quality, open versus close worlds, query semantics, query execution, optimization, and user interfaces
BENEFITS OF MACHINE LEARNING Feature extraction i.e. interpreting text to infer time, location, people, etc. referred to in it; Classification - classify, group or tag information based on some explicit or unknown criteria; Clustering - Machines can process vast amounts of data and present correlations and proximities that escape the human eye and brain, sometimes discovering non-obvious correlations between variables With large amounts of data available, it is not even necessary to have a deep understanding of the relationships within the data themselves: machines can on their own distil the noise from the relevant correlations through successive optimization.
MACHINE LEARNING SHORTCOMINGS Algorithms are more specific than sensitive, meaning that important signals may be missed (false negatives) A combination of algorithms is important to draw different types of events and event features from undifferentiated data understanding which algorithms, through experience, is essential Algorithms need to be thoroughly validated and tested and reassessed Algorithms need data to train and feedback to learn. Out of the box value is difficult Human factor lazy over time experience with accepted algorithms, where over-dependency and improper cross-checks of an algorithm's results may result in missed or misinterpreted signals; Low social acceptance of systems that do not function in a way that is predictable or describable Past misuse of machine learning has led users to fear and distrust algorithm w/o some human interaction. Should an algorithm declare a health emergency or should it help present data to an expert or authority with 'suggestions' and 'red flags', and then the authority can declare a health emergency
STANDARDIZATION NEEDED FOR INTEROPERABILITY Interoperability challenges with data formats, service interfaces, semantics and measurement uniformity Broad usage of open sensor standards is needed The Sensor Web Enablement Initiative (SWE) by the OGC (Open Geospatial Consortium) seeks to provide open standards and protocols for enhanced operability within and between multiple platforms and vendors. They aim to make sensors discoverable, query-able, and controllable over the Internet. Currently, the SWE family consists of seven standards: Sensor Model Language (SensorML) XML Schemas to defining geometric, dynamic and observational properties of a sensor. Accommodates sensor discovery, processing and analysis of the retrieved data, as well as the geo-location of observed values. Observations & Measurements (O&M) Transducer Model Language (TML) Generally speaking, TML can be understood as O&M's pendant or streaming data by providing a method and message format describing how to interpret raw transducer data. Sensor Observation Service (SOS) This component provides a service to retrieve measurement results from a sensor or a sensor network.
STANDARDIZATION CONTINUED Sensor Planning Service (SPS) This component provides a standardized interface for collection assets and aims at automating complex information flows in large networks.. Sensor Alert Service (SAS) Interfaces enabling sensors to advertise and publish alerts, including according metadata. Web Notification Service (WNS) Enables 1 & 2 way message exchanges, with other services. This process is especially expedient when several services are required to comply with a client's request, or when an according response is only possible under considerable delays.
SENSOR OBSERVATION SERVICE (SOS)
NEED A GOOD PLAN What are you trying to do - what s the value of this data What s the approach? Selecting location and placement Collecting Quality control Sensor maintenance Data review Data validation Issues (interference and drift) Analyze, interpret, communicate results QA QC
SENSOR CONSIDERATIONS Low cost Varying reliability, quality, and accuracy Questionable maintenance and calibration Pollutants measured (ozone, PM, volatiles) Location and Placement - Fixed/mobile, in/outside, below/above ground IOT Security of devices
DATA MANAGEMENT CONSIDERATIONS Several Existing repositories - would not want to replicate DEP had experience in managing large sets of data but not at this potential scale Large cost of managing data Infrastructure/Tools/etc. Leverage existing Real time and historical APIs Separation of local, state, and nationwide data Integration and analysis with existing state data