Discovery of the Source of Contaminant Release

Discovery of the Source of Contaminant Release Devina Sanjaya 1 Henry Qin Introduction Computer ability to model contaminant release events and predict the source of release in real time is crucial in various applications, especially in environmental safety monitoring and homeland security. In the event of unintentional industrial accidents or biological attacks in urban environments, immediate, accurate response is required. A real-time computer program that provides information about the identification of the contaminant, the source of the release, and the prediction of the subsequent path of contamination can be used to assist the decision making process for evacuation and countermeasures. The contaminant source inversion problem involves intricate geometry, uncertain flow conditions, and limited, noisy sensor readings. Moreover, the problem is generally ill-conditioned in the sense that small changes in the sensor readings can cause large changes in the calculated source of release [2]. This makes single-point deterministic calculations not robust. The lack of robustness is due to the fact that some inputs might produce nearly the same outputs, especially when measurement error is taken into account. To increase robustness, statistical approaches are used, but it often requires large sampling and numerous forward simulations which can quickly become computationally expensive. Multiple previous studies have been conducted to reduce the computational cost, such as grid coarsening, reduced-order modeling, and stochastic expansions. Previous studies have also considered applying uncertainty quantification methods to analyze the propagation of input uncertainties. In this paper, we combine machine learning models and computational fluid dynamics to discover the source of release for large-scale problems in real time. Various machine learning models are evaluated for robustness to noisy sensor readings and limited training data. We will also compare our results with statistical results from Markov Chain Monte Carlo (MCMC) with single walker [3] and ensemble walkers [1]. 2 Data Format Our training and test data are obtained using the computational fluid dynamics software (XFLOW) developed by Dr. Fidkowski at the University of Michigan: Ann Arbor. For the 2D case, we simulate a contaminant release around cross sections of buildings (see figure 1 (left)). Five sensors are placed around the buildings in a pseudorandom fashion without iteration or tuning. Each sensor takes three readings spaced equally in time, for a total of 15 sensor readings. A spatial approximation order of p = 2 is used and the Peclet number for the simulations, based on the mean velocity and domain size in the x-direction, is P e = 100. We use the 15 sensor readings as input features to our machine learning models, and X and Y as the output features. Each forward simulation used to obtain sensor readings was completed in less than 1 minute when parallelized with 8 processors. Figure 1: Mesh and setup used during CFD simulation for two (left) and 3D case (right). Sensor readings from both cases are used as our training and test sets. 1

For the 3D case, we simulate contaminant release in a realistic urban area (see figure 1 (right)); this domain is the same as in Lieberman et al [4]. There are 10 sensors placed around the buildings chosen in a pseudorandom fashion without iteration or tuning. Each sensor provides 4 readings for a total of 40 sensor readings. A spatial approximation order of p = 1 was used and the Peclet number for the simulations, based on the mean velocity and the domain extent in the direction of velocity, was Pe = 50. Our input features are the 40 sensor readings, and our outputs are the X, Y, Z, and Amplitude. Each forward simulation takes about 8 minutes on 100 processors. 3 Implementation & Discussion In this section, we will discuss how we implement machine learning models to our problem. All of our models are trained using the statistical programming language R or Matlab. For the purpose of the discussion below, we assume that we are only attempting to predict X, since predicting other variables (Y in 2D case and Y, Z, and Amplitude in 3D case) is symmetric. Moreover, to create realistic test cases, sensor errors are considered; we consider both uniform and Gaussian error distributions. Due to our time constraint, we will mainly discuss the results from the 2D case. 3.1 Test Error Definition All reported test errors are defined as a percentage of the interval over the parameter which we are making predictions. Equation 1) shows how we compute the test error when predicting X. %Test Error = x ˆx (max(x) min(x)) where x is the true value of X, ˆx is the predicted value of X, max(x) is the maximum of the interval of X, and min(x) is the minimum of the interval of X.In the 2D case, the X [0, 1], and in the 3D case, X [0, 4.71]. We believe this definition of test error percentage, rather than the standard definition of (Expected - Actual) / Actual, allows us to fairly compare predictions across examples with different true locations of the source of release. Intuitively, since we are trying to predict a location rather than a quantity, an error of 0.01 model units should be interpreted the same way whether it is an error from of a ground truth of 0.1 or a ground truth of 0.2, and our definition of error reflects this. Furthermore, this percentage error deifinition enables us to compare the test errors between 2D and 3D cases. 3.2 Perfect Sensor Readings First, we consider the case where all sensor readings are perfect. For our 2D test case, we have 144 examples in total: 72 for training and 72 for testing. We found that ordinary linear regression, which directly models the output values as a linear combination of the raw input feature values, does not perform well, with a mean error of 24%. However, linear regression with logarithmic feature mapping (see equation 2) performs quite well, with a mean error of 1%. This suggests that there is a log relationship between the sensor readings and the location of contaminant release. X = β 0 + β 1 log r 1 + β 2 log r 2 +... + β 15 log r 15 (2) To use log-transformed model, we first replace any readings that are less than equal to zero with the fixed constant 1 10 10, take the log of each sensor reading, and then apply multiple linear regression. Figure 2 (left) shows the residuals plot from our testing. This figure shows the mean error of 1% and standard deviation of 1% in predicting the source of release. Next, we consider the 3D case. Here, we have 256 examples in total: 200 examples for training and 56 examples for test. As with the 2D case, linear regression does not perform well, while linear regression with logarithmic feature mapping works well, with a mean error of 22.5%. Performing the same steps as before, we found a mean error of 0.26% and standard deviation of 0.26% as shown in Figure 2 (right). We acknowledge that these errors are unusually low, but do not claim that these errors will generalize to other 3D simulations, even assuming perfect sensors. (1) 2

0 10 20 30 40 50 60 70 0 20 40 60 80 100 120 Figure 2: Test residuals of linear regression with logarithmic feature mapping from the 2D (left) and 3D (right) cases with perfect sensor readings 3.3 Sensor Readings with Uniform Error Distribution Having successfully modeled the simple case, we moved onto a more complex case: uniform, substantial sensor error. Based on knowledge of the field, 1 10 2 is a reasonable sensor error. To apply the uniform error, we add the same fixed constant to each of our sensor readings, and treat these perturbed sensor readings as the new raw features. During training, we naturally assume that the constant is unknown, as it would be in practice. Our training and test sets are the same as in the previous case. Unfortunately, this more complex case clearly demonstrated that our previous method was not robust against uniform sensor error, as our test errors increased by an order of magnitude. This behavior is consistent with the ill-conditioned nature of the inverse problem. In retrospect, we could have anticipated this. When systematic sensor error is added, the true model starts to look like the function in equation 3, while we were still trying to model it using equation 2. X = β 0 + β 1 log (r 1 + ɛ) + β 2 log (r 2 + ɛ) +... + β 15 log (r 15 + ɛ) (3) We tried to model equation 3 by using R s nonlinear least squares nls package, but we ran into singularity problem. Since direct model fitting did not pan out, we implemented a hill-climbing algorithm in an attempt to greedily discover the value of the hidden constant ɛ. More specifically, our hill-climbing algorithm varies the value of ɛ to find the maximum R 2 statistic for a least squares fit against log (r i ɛ). The procedure is as follows: 1. Initialize a step size s to a constant 0.001. 2. Choose a random starting value ɛ 0 [1 10 2, 5 10 2 ]. 3. Fit a least squares model using the features log (r i ɛ 0 ). 4. Fit two more least squares models using ɛ = ɛ 0 s and ɛ = ɛ 0 + s. 5. Set ɛ 0 to be equal the ɛ in the model above which produced the highest R 2 statistic. 6. Half the step size s, so s s/2. 7. If R 2 > 0.99, terminate the algorithm and report ɛ 0. Otherwise, return to Step 3. The above algorithm is able to pinpoint ɛ in the 2D case well, and thus, we can substitute ɛ to 3 before modeling the data. However, this hill-climbing algorithm does not work well in 3D case due to multiple local maxima. 3.4 Mixed Sensor Readings Now, we will consider the case where some sensor readings might happen to contain no errors while others have uniform or Gaussian-distributed errors. To simulate these errors, we first replicate the original set of examples three times, creating a new data set with three times the number of examples. Next, we add a constant error term to one full replica, add Gaussian-distributed errors (mean 0, stddev 1E 2) to the second full replica, and then randomize the order of the data 3

set. From this mixed data set, we randomly select half the examples for use in training, and hold out the other half as a test set. Multiple machine learning models are trained, such as linear regression, linear regression with logarithmic feature mapping, locally weighted linear regression with logarithmic feature mapping, decision tree, boosting, random forest, and K-nearest neighbors. Figure 3 compares the various models based on different testing error metrics (mean, standard deviation, median, and 90th percentile) for our 2D case. Here, we can see random forest gives us the lowest mean testing error, which is about 5% and the lowest 90th percentile error, which is about 10%. Figure 4 shows the test residuals of random forest for our 2D and 3D cases. 0.45 0.4 0.35 Linear Linear with Log Loc. Weighted Linear with Log Decision Tree Boosting Random Forest K nearest Neighbor 0.3 Testing Errors 0.25 0.2 0.15 0.1 0.05 0 Mean StdDev Median 90% Error Metrics Figure 3: Testing error metrics of all machine learnings method applied to the 2D case with mixed sensor readings 1.0 0.5 0.0 0.5 1.0 1.5 0 50 100 150 200 0 100 200 300 Figure 4: Test residuals of random forest applied on the 2D (left) and 3D (right) cases with mixed sensor readings. Note that these figures are not on the same scale because the 2D and 3D cases have different ranges for their dimensions. 4

3.5 Comparison with MCMC We compare our results with statistical results from MCMC with single and ensemble walkers presented in[5]. Using MCMC and noisy sensor readings of 1 10 2, we obtained less than 1% in predicting X for both 2D and 3D cases. Although the results from MCMC are far more accurate, it takes substantial time to obtain a single prediction from MCMC because generating a set of sensor readings at multiple proposed location of the MCMC walker(s) is time consuming. For instance, the 2D case converges in about 4 minutes using 32 processors and the 3D case converges in about 6 minutes using 100 processors, On the other hand, we can train a Random Forest and evaluate hundreds of examples in less than 1.5 seconds on a single processor. 4 Conclusion To summarize, we make the following contributions in this paper. With perfect sensor data, the relationship between sensor readings and the contaminant source is a simple log-linear one. Out of all the models we experimented with, Random Forest proved to be the most robust against noisy data. Comparing to MCMC, supervised learning requires far less computational power, but is less accurate and less robust to noisy data. To increase the robustness of supervised learning to noisy data, more research is required. To the best of our knowledge, predicting the location of contaminant release in a realistic setting remains an open problem. 5 Acknowledgement We gratefully acknowledge Dr. Fidkowski at the University of Michigan: Ann Arbor for the use of his computing resources and simulation software (XFLOW) in generating our training and test data. References [1] J. Goodman and J. Weare. Ensemble samplers with affine invariance. Communications in Applied Mathematics and Computational Science, 5(1):65 80, 2010. [2] J. Hadamard. Lectures on the Cauchy Problem in Linear Partial Differential Equations. Yale University Press, 1923. [3] W.K. Hastings. Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57(1):97 109, 1970. [4] C. Lieberman, K. Fidkowski, K. Willcox, and B. van Bloemen Waanders. Hessian-based model reduction: large-scale inversion and prediction. International Journal for Numerical Methods in Fluids, 2012. [5] D. Sanjaya, I. Tobasco, and K. Fidkowski. Adjoint-accelerated statistical and deterministic inversion of atmospheric contaminant transport. Unpublished. 5