A Knowledge Framework For Sensor Data Quality Control

Size: px

Start display at page:

Download "A Knowledge Framework For Sensor Data Quality Control"

Clifton Franklin
5 years ago
Views:

1 1 A Knowledge Framework For Sensor Data Quality Control Anshuman Sahu, Bo Yang, Member, IEEE Abstract--Data driven approaches are being increasingly used in grid monitoring and decision support for power system operation and planning. The fact that more and more critical business decisions are made based on measurement data poses high requirements on reliability of communication and information infrastructure. Data quality is a major concern for many power companies due to communication network inefficiency, malfunctioning of sensors and inappropriate data ingestion. This paper proposes a framework for sensor data quality control which is data driven and requires minimal domain expert input. The proposed framework identifies poor quality data (outliers, adverse operational data) based on feature extraction and machine learning techniques. The framework generates additional insights into the data which can be leveraged by the operator. We demonstrate a prototype of the framework on real data. Index Terms Sensor data quality, data complexity, machine learning models. I. INTRODUCTION With the proliferation of digital measuring devices, such as smart meter on distribution systems and phasor measurement units (PMU) on the transmission systems, power systems become more observable through field measurements. As a result, measurement data are being increasingly used in grid monitoring and decision support for power system operation and planning, which previously used to rely heavily on information derived from model simulation. For instance, conventional transmission grid operations rely on remedy control strategies that are determined based on offline analysis. Such analysis may not be optimal for real time operation. PMU and other grid sensors provide improved situational awareness so that abnormalities such as oscillations, unexpected generation drop etc. can be detected based on high resolution synchronized field measurements. In recent years, research efforts on grid analysis facilitated by PMU data have attracted great attention and covered a wide range of topics: state estimation [1]; adaptive control and protection [2]; voltage stability assessment [3]; and low frequency oscillation detection [4]. Similar trend has been observed on distribution power systems, where the popularity of smart meter and intelligent electronic devices reforms the way distribution companies operate with improved feeder management, Anshuman Sahu is with Hitachi America Big Data Lab, Santa Clara, CA ( anshuman.sahu@hal.hitachi.com). Bo Yang is with Smartwires ( Bo.Yang@smartwires.com) resilient operations, and enhanced customer engagement. The fact that more and more critical business decisions are made based on measurement data poses high requirements on reliability of communication and information infrastructure. Unfortunately, data quality is a concern for many power companies due to inefficiency of communication network, malfunctioning of sensors and inappropriate data ingestion. Research efforts dedicated to improvement of data quality often focus on recovering the missing data [5], mitigating impacts of network latency [6], or capturing noise level [7]. Common practices are 1) excluding obvious unrealistic data points based on engineering judgment or rough estimate based on physical models and then 2) developing models to capture impacts of noise and uncertainties. Such process works well for many applications except that it requires a lot of domain expert inputs, and is difficult to generalize for data of different nature. For example, the rule-of-thumb that works well on voltage measurements cannot be easily adapted for current measurements. When volume, velocity, and dimensionality of data stream are sufficiently high, the process becomes really challenging and cannot be handled well by existing techniques. This paper proposes a framework for data quality control which is purely data driven and requires minimal domain expert input. The proposed framework identifies poor quality data (outliers, adverse operational data) based on feature extraction and machine learning. The data signatures extracted are clustered, and fed into a learner for gleaning useful insights. The framework can be used to detect abnormalities in very high dimensional settings, and is robust to changes in operating conditions. Section 2 describes the architecture and methodology behind our framework. Section 3 describes an instantiation of the components of the framework in detail on a dataset on paired wind cup anemometer measurements. We finally conclude with future work in Section 4. II. ARCHITECTURE AND METHODOLOGY It is assumed that the outputs of physical systems are interrelated and can be described mathematically. Their measurements shall be able to reflect such principles. Good or acceptable measurements will fit the mathematical models well while the unacceptable data points, i.e. bad or missing data, will fall out of the solution space. The strategy for data quality control is to identify such acceptable solution space based on the observed data and train machine model to describe the principles, so that when new data stream comes in, bad data

2 2 can easily be screened out. The architecture is shown in Figure 1 consists of four major components. Feature extraction (FE) is often deployed for preprocessing of data matrix for subsequent learning and generalization steps. When size of input data is very huge and a lot of features in the data are correlated, FE can reduce the dimensionality of input data while retaining useful features. Dimensionality reduction methods include principal component analysis (PCA) and other subspace learning methods. Clustering is a data mining process to discover groups of similar data points. It is expected that normal and abnormal data points will have different characteristics, and are thus expected to belong to different groups. Different cluster analysis techniques can be employed. Commonly used methods consider neighborhood distribution, density and connectivity between data points. Figure 1 Work flow for data quality control framework. Machine Model Training Once data points are clustered, a label can be assigned to them based on cluster membership. Machine learning models in the form of supervised classifiers will then be learned to characterize each data group. One of the groups will be able to represent the normal operating data closely while other groups will correspond to contaminated data points and cannot be used to make predicts or decisions. In many cases, the physical systems under measurement are partially known. Domain expertise can then be leveraged to improve model accuracy, which otherwise might be limited due to lack of data coverage. Domain expert also has the opportunity to validate or further improve the machine model. Such refined model could be applied for data classification after which clean data will be identified for further analysis. III. CASE STUDY: DEMONSTRATION ON THE PAIRED WIND CUP ANEMOMETER DATA We demonstrate the efficacy of our framework on sensor data obtained from a pair of cup anemometers. The dataset was released during Prognostics and Health Management (PHM) society challenge in Different attributes (features) such as wind speed, wind direction, and temperature were recorded and summarized in intervals of 10 minutes for both anemometers. For more details about the origin of the dataset, we refer the interested reader to the following website: Note that unlike the goal of the contest, our contribution here is to show how the framework can utilize real-life data to tease out valuable insights in an almost automated fashion. We describe our procedure step-by-step on one of the datasets. The first eight attributes correspond to summary statistics for wind speed; the next three attributes correspond to summary statistics for wind direction (all entries in the column corresponding to minimum value of wind direction were found to be zero and were removed); and the final four attributes correspond to summary statistics for temperature. Also we are instantiating our framework with specific algorithms for each step. These can be appropriately substituted with other appropriate methods for dealing with different situations. A. Feature Extraction Using Principal Component Analysis (PCA) The goal of feature extraction is to understand the dependencies between the attributes and identify the inherent subspace in which data resides. A survey of such techniques can be found in [1]. PCA is a simple yet highly effective approach in practice. We performed PCA on the scaled data, and project the data to the top 3 principal components (PC) in Figure 3. As shown in Figure 2, these 3 PCs can explain slightly more than 80% of the total variance of the data. Upon inspecting the loadings, we identified that the first PC focused on wind speed; the second PC focused on temperature and wind-direction; and the third PC focused on standard deviation of the attributes. Figure 2 Cumulative variance of principal components: Top 3 principal components are selected covering 80% of variance B. Sub-Group identification via clustering Once we extract the relevant features, we need to discover sub-groups within the data. Model based clustering [9] utilizing Bayesian Information Criterion (BIC) was employed for this purpose. Our premise is the data set consists of outliers as well as points corresponding to adverse weather conditions (for example icing) in addition to normal points. Generally, the number of clusters identified depends on the application of interest. We show the results of clustering in Figure 4 where we identify three clusters.

3 C. Machine Model By Decision Tree Method After discovering the sub-groups, we are naturally interested in understanding them. Supervised machine learning algorithms can be employed for such purpose.

The cluster memberships learned from previous step are assigned as class labels to the data in original feature space. The tree model learned from the data is visualized in Figure 5.

3 3 C. Machine Model By Decision Tree Method After discovering the sub-groups, we are naturally interested in understanding them. Supervised machine learning algorithms can be employed for such purpose. Decision Trees [10] are one such technique that can be employed. They are robust, can handle high-dimensionality, provide importance scores for the features, and produce rules which are interpretable. The cluster memberships learned from previous step are assigned as class labels to the data in original feature space. The tree model learned from the data is visualized in Figure 5. The first row in each node of the tree shows the dominant class in that node (1 represents class for outlier data; 2 represents class for adverse weather data; and 3 represents class for normal data); the second row shows the estimated probability for each class (in the order of classes 1,2,3 respectively); and the final row represents the percentage of total data that fall in the node as a result of application of rule denoted below each node. If a data point satisfies the rule, it goes to the left child node. Else it goes to the right child node. for such cases. The rules, however, were based on human judgment with no empirical evidence. Furthermore, the rules were derived looking at one feature at a time. Thus, we can miss potential interaction between features which can be useful descriptors. In our method, we provide a systematic approach to discover such rules. Decision trees can handle interaction between features. Moreover, with availability of additional ground truth data as well as feedback from operator, we can improve the accuracy of our method. Table 1 depicts the comparison of classification results from our approach and previous study [11], which base on engineering judgment and rough estimate to weather affects. It can be seen that substantially more data points are retained in our approach. Figure 5 Classification tree based on rules learned from data Figure 3 Clustered data in dominant PCA space: normal data are shown in green, data from adverse weather conditions are shown in red and outliers in the data are shown in blue. Clustering in PCA space tends to separate out abnormal points. Figure 6 Boxplot showing distribution of key features for each class (color coded previously) identified through decision tree. Figure 4 Highly coupled clusters and their covariance structure. Color labels for each class are same as in Figure 3. In addition to estimating the class probabilities, we also look at the important features identified by the tree model, and show the box plots on those features for the three classes in Figure 6. It is interesting to note that the features and rules identified above show physical meaning in discriminating between the classes. Previous study [11] had shown some rules Table 1 Comparison of data clustering with physical rule based method Bad data (outlier+icing) Good data Rule-based method Our method Difference 55% -38% D. Data Classification The fine-tuned machine model is then applied to a new data file for prediction. The test file has 720 data rows where each row has the same data format as the given training data set. Each row is tested under the developed decision tree and labeled with one class. In this case, normal data points (530 rows) dominate the data file where outliers and icing points are

4 combined to have 190 rows. Class 3 will be retained for further analysis.

DEMONSTRATION ON HIGH RESOLUTION TIME SERIES DATA The proposed framework is further demonstrated on time series data that are raw field measurements, more dynamic and contaminated with noise.

The adopted PMU data is available through the Texas Synchrophasor Network. The measurements on 22pm of January 12, 2012 are randomly selected for demonstration.

The experiments were run on the first five minutes for the purpose of demonstrating the efficacy of the methodology in detecting data points of interest.

Both phasors and frequency signals are selected as ground truth for comparison, which may contain noise and outliers by nature.

The proposed approaches are expected to differentiate the added noises and outliers only relying on correlations between angles and frequencies from different PMUs.

Contaminated data Since there is no information about level of noise and quality for the raw data, artificial white noise with different magnitudes are added to test performance of the proposed

4 4 combined to have 190 rows. Class 3 will be retained for further analysis. The classification of data can then be deployed on equipment health analysis, which can be intuitively determined if substantial outliers or icing points been detected. IV. DEMONSTRATION ON HIGH RESOLUTION TIME SERIES DATA The proposed framework is further demonstrated on time series data that are raw field measurements, more dynamic and contaminated with noise. The data from phasor measurement units are adopted to benchmark the efficiency of proposed approaches on outlier identification and nuisance data detection. The adopted PMU data is available through the Texas Synchrophasor Network. The measurements on 22pm of January 12, 2012 are randomly selected for demonstration. The original data file contains one hour of voltage, phasor and frequency measurements for six PMUs scattered around Texas. The experiments were run on the first five minutes for the purpose of demonstrating the efficacy of the methodology in detecting data points of interest. The goal is to test the proposed data quality control framework on selected dataset without inputs from domain experts. Both phasors and frequency signals are selected as ground truth for comparison, which may contain noise and outliers by nature. In the experiments, however, white noise of different levels and distributions are also applied to the raw data for performance evaluation. The proposed approaches are expected to differentiate the added noises and outliers only relying on correlations between angles and frequencies from different PMUs. Figure 7 Frequency and angle measurements for 6 PMUs A. Contaminated data Since there is no information about level of noise and quality for the raw data, artificial white noise with different magnitudes are added to test performance of the proposed methods. There is no missing data for the selected data file. The following noise scenarios are considered. Simulated white noise has a Gaussian distribution with noise level of 1-3% of measurement signals. The occurrence of the noise is also randomly distributed where total occurrence is set to be 20-60% of overall duration. As the magnitude and occurrence of the contaminated measurements are totally random, it can measure the flexibility of proposed methodology and its ability to address unpredictable issues. Raw data 1% Mag, 20% ToD 3% Mag, 60% ToD 3% Mag, 40% ToD 1% Mag, 40% ToD Figure 8 Phase angle and frequency with artificial white noises - 5 mins overview (Upper left) Noises at various levels(upper right) Comparison of magnitude (Lower left) Similar noised frequency measurements (Lower right) Figure 8 shows the angle and frequency measurements when various levels of noise have been added. The two figures on the top show absolute voltage angle with 1-3% of white noise, where the upper left one is a closer look at the noised signals. The figure on lower right compared each noise level, which in general are more obvious as the measured value increases with time. The figure on the lower right depicts the noised frequency signals. Since 1% change corresponds to roughly 0.6 Hz, the noisy measurement points can be easily differentiated from the clean data points. B. Outlier Detection As in Figure 8, the contaminated frequency data points, i.e. outliers are relatively easy to identify. However, the noise applied to angle measurements are closely coupled with clean data points and follow roughly same trends as time progresses. It is challenging to use rule based approach, i.e. angle thresholds, to separate contaminated angle measurements. The proposed principle component analysis method, however, successfully captures the correlations between angle and frequency and applies that for outlier identification. Figure 9 depicts the outliers in red and clean data in blue. It is confirmed that all contaminated data points are detected. Figure 9 Angle and frequency for 1 PMU with 1% noise level and 40% total duration Figure 10 shows that tri-modal feature of frequency is captured and adopted as clustering criteria. The root of the tree divides all datasets into two groups, where one group with frequencies equal to or above 61 Hz is seen as one cluster in orange color.

The group in the Green Box is the collection of all normal measurements.

5 5 Apparently, this group has all the outliers, which are higher than the normal values. The rest of data is further clustered into two groups, where the group with frequencies lower than 59 Hz is seen to have outliers lower than the normal values. The group in the Green Box is the collection of all normal measurements. Figure 12 Classification tree used to differentiate contaminated data points Figure 10 Classification tree used to differentiate contaminated data points (with 3% noise level and 40% total duration C. Nuisance Data Detection Sometimes sensors can malfunction or are not calibrated properly, which induces nuisance data. This type of data quality problem is difficult to detect unless the measurements are way off the chart. In this paper, we propose to deploy the correlation among measurements to identify problematic measurements due to device malfunction. It is assumed that the measurements from different locations are bounded by physical rules behind the scene and the correlation of data shall be relatively consistent. For example, the differentiation of angle shall yield frequency. Thus the malfunction of one device will cause the change in data correlation and such new pattern cannot reverse automatically unless the data channel is removed. To test this approach, a vertical shift of 3% is applied to one PMU angle measurement after 85 second (out of 1 hour). The assumption is that this PMU was calibrated incorrectly and henceforth, yields wrong angle measurements. Figure 11 shows that such change was captured immediately as shown in the red. However, both frequencies and angles are marked as problematic by PCA method since it only detects change of correlation among data channels and cannot advise root cause. Decision tree method [10] is then applied to explore the cause of change, which discloses angle measured by PMU 1 to be the differentiator (Figure 12). Figure 11 Angle and frequency when 1 PMU angle measurement has vertical shift V. CONCLUSION AND FUTURE WORK A knowledge framework for sensor data quality control is proposed to identify abnormal data points through feature extraction and clustering techniques. Supervised learner models are then used to describe the data signature for normal operating conditions which can be incorporated into future decision making. The prototype of the framework has been described in details on the paired wind cup anemometer data and proved its feasibility. It is then further demonstrated on dynamic time series data, i.e. PMU measurements. The whole process is automatic and requires least domain expert inputs, which leverage on the data signature and correlation of data measurements. The results of experiment show that the proposed framework can function as expected on both types of datasets. Future research will take into account the location and topology (spatial) information as well as the associated temporal information for refining the predictions. VI. REFERENCES [1] M. Gol and A. Abur, A Fast Decoupled State Estimator for Systems Measured by PMUs, IEEE Transactions on Power Systems, vol. PP, no, 99, pp. 1-6, 2014 [2] M. Ariff, B. Pal and A. Singh, Estimating Dynamic Model Parameters for Adaptive Protection and Control in Power System, IEEE Transactions on Power Systems, vol. PP, no. 99, pp. 1-11, 2014 [3] V. Vijay, Application of phasor measurements for dynamic security assessment using decision trees, IEEE PES general meeting, San Diego, July 2012 [4] N. Zhou, J. Dagle, Initial Results in Using a Self-Coherence Method for Detecting Sustained Oscillations, IEEE Transactions on Power Systems, vol. 30, no. 1, pp , 2014 [5] M. Wang, J. H. Chow and et al, A Low-Rank Matrix Approach for the Analysis of Large Amounts of Power System Synchrophasor 48th Hawaii International Conference on System Sciences, 2015 [6] O. Al-Khatib, W. Hardjawana, B. Vucetic, Traffic Modeling and Optimization in Public and Private Wireless Access Networks for Smart Grids, IEEE Transactions on Smart Grid, vol 5, issue 4, pp [7] J.S. Erkelens, R. Heusdens, Tracking of Nonstationary Noise Based on Data-Driven Recursive Noise Power Estimation, IEEE Transactions on Audio, Speech, and Language Processing, vol 16, issue 6, pp [8] S. Ding, H. Zhu, W. Jia, and C. Su, "A survey on feature extraction for pattern recognition." Artificial Intelligence Review vol. 37, no. 3, pp , [9] C. Fraley, and A. E. Raftery, "Model-based clustering, discriminant analysis, and density estimation." Journal of the American statistical Association, vol. 97, no. 458, pp , [10] L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen, Classification and regression trees. CRC press, [11] D. Siegel, and J. Lee, "An auto-associative residual processing and K- means clustering approach for anemometer health assessment." International Journal of Prognostics and Health Management, vol 2, pp , 2011.

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Exploratory data analysis tasks Examine the data, in search of structures