A Knowledge Framework For Sensor Data Quality Control
|
|
- Clifton Franklin
- 5 years ago
- Views:
Transcription
1 1 A Knowledge Framework For Sensor Data Quality Control Anshuman Sahu, Bo Yang, Member, IEEE Abstract--Data driven approaches are being increasingly used in grid monitoring and decision support for power system operation and planning. The fact that more and more critical business decisions are made based on measurement data poses high requirements on reliability of communication and information infrastructure. Data quality is a major concern for many power companies due to communication network inefficiency, malfunctioning of sensors and inappropriate data ingestion. This paper proposes a framework for sensor data quality control which is data driven and requires minimal domain expert input. The proposed framework identifies poor quality data (outliers, adverse operational data) based on feature extraction and machine learning techniques. The framework generates additional insights into the data which can be leveraged by the operator. We demonstrate a prototype of the framework on real data. Index Terms Sensor data quality, data complexity, machine learning models. I. INTRODUCTION With the proliferation of digital measuring devices, such as smart meter on distribution systems and phasor measurement units (PMU) on the transmission systems, power systems become more observable through field measurements. As a result, measurement data are being increasingly used in grid monitoring and decision support for power system operation and planning, which previously used to rely heavily on information derived from model simulation. For instance, conventional transmission grid operations rely on remedy control strategies that are determined based on offline analysis. Such analysis may not be optimal for real time operation. PMU and other grid sensors provide improved situational awareness so that abnormalities such as oscillations, unexpected generation drop etc. can be detected based on high resolution synchronized field measurements. In recent years, research efforts on grid analysis facilitated by PMU data have attracted great attention and covered a wide range of topics: state estimation [1]; adaptive control and protection [2]; voltage stability assessment [3]; and low frequency oscillation detection [4]. Similar trend has been observed on distribution power systems, where the popularity of smart meter and intelligent electronic devices reforms the way distribution companies operate with improved feeder management, Anshuman Sahu is with Hitachi America Big Data Lab, Santa Clara, CA ( anshuman.sahu@hal.hitachi.com). Bo Yang is with Smartwires ( Bo.Yang@smartwires.com) resilient operations, and enhanced customer engagement. The fact that more and more critical business decisions are made based on measurement data poses high requirements on reliability of communication and information infrastructure. Unfortunately, data quality is a concern for many power companies due to inefficiency of communication network, malfunctioning of sensors and inappropriate data ingestion. Research efforts dedicated to improvement of data quality often focus on recovering the missing data [5], mitigating impacts of network latency [6], or capturing noise level [7]. Common practices are 1) excluding obvious unrealistic data points based on engineering judgment or rough estimate based on physical models and then 2) developing models to capture impacts of noise and uncertainties. Such process works well for many applications except that it requires a lot of domain expert inputs, and is difficult to generalize for data of different nature. For example, the rule-of-thumb that works well on voltage measurements cannot be easily adapted for current measurements. When volume, velocity, and dimensionality of data stream are sufficiently high, the process becomes really challenging and cannot be handled well by existing techniques. This paper proposes a framework for data quality control which is purely data driven and requires minimal domain expert input. The proposed framework identifies poor quality data (outliers, adverse operational data) based on feature extraction and machine learning. The data signatures extracted are clustered, and fed into a learner for gleaning useful insights. The framework can be used to detect abnormalities in very high dimensional settings, and is robust to changes in operating conditions. Section 2 describes the architecture and methodology behind our framework. Section 3 describes an instantiation of the components of the framework in detail on a dataset on paired wind cup anemometer measurements. We finally conclude with future work in Section 4. II. ARCHITECTURE AND METHODOLOGY It is assumed that the outputs of physical systems are interrelated and can be described mathematically. Their measurements shall be able to reflect such principles. Good or acceptable measurements will fit the mathematical models well while the unacceptable data points, i.e. bad or missing data, will fall out of the solution space. The strategy for data quality control is to identify such acceptable solution space based on the observed data and train machine model to describe the principles, so that when new data stream comes in, bad data
2 2 can easily be screened out. The architecture is shown in Figure 1 consists of four major components. Feature extraction (FE) is often deployed for preprocessing of data matrix for subsequent learning and generalization steps. When size of input data is very huge and a lot of features in the data are correlated, FE can reduce the dimensionality of input data while retaining useful features. Dimensionality reduction methods include principal component analysis (PCA) and other subspace learning methods. Clustering is a data mining process to discover groups of similar data points. It is expected that normal and abnormal data points will have different characteristics, and are thus expected to belong to different groups. Different cluster analysis techniques can be employed. Commonly used methods consider neighborhood distribution, density and connectivity between data points. Figure 1 Work flow for data quality control framework. Machine Model Training Once data points are clustered, a label can be assigned to them based on cluster membership. Machine learning models in the form of supervised classifiers will then be learned to characterize each data group. One of the groups will be able to represent the normal operating data closely while other groups will correspond to contaminated data points and cannot be used to make predicts or decisions. In many cases, the physical systems under measurement are partially known. Domain expertise can then be leveraged to improve model accuracy, which otherwise might be limited due to lack of data coverage. Domain expert also has the opportunity to validate or further improve the machine model. Such refined model could be applied for data classification after which clean data will be identified for further analysis. III. CASE STUDY: DEMONSTRATION ON THE PAIRED WIND CUP ANEMOMETER DATA We demonstrate the efficacy of our framework on sensor data obtained from a pair of cup anemometers. The dataset was released during Prognostics and Health Management (PHM) society challenge in Different attributes (features) such as wind speed, wind direction, and temperature were recorded and summarized in intervals of 10 minutes for both anemometers. For more details about the origin of the dataset, we refer the interested reader to the following website: Note that unlike the goal of the contest, our contribution here is to show how the framework can utilize real-life data to tease out valuable insights in an almost automated fashion. We describe our procedure step-by-step on one of the datasets. The first eight attributes correspond to summary statistics for wind speed; the next three attributes correspond to summary statistics for wind direction (all entries in the column corresponding to minimum value of wind direction were found to be zero and were removed); and the final four attributes correspond to summary statistics for temperature. Also we are instantiating our framework with specific algorithms for each step. These can be appropriately substituted with other appropriate methods for dealing with different situations. A. Feature Extraction Using Principal Component Analysis (PCA) The goal of feature extraction is to understand the dependencies between the attributes and identify the inherent subspace in which data resides. A survey of such techniques can be found in [1]. PCA is a simple yet highly effective approach in practice. We performed PCA on the scaled data, and project the data to the top 3 principal components (PC) in Figure 3. As shown in Figure 2, these 3 PCs can explain slightly more than 80% of the total variance of the data. Upon inspecting the loadings, we identified that the first PC focused on wind speed; the second PC focused on temperature and wind-direction; and the third PC focused on standard deviation of the attributes. Figure 2 Cumulative variance of principal components: Top 3 principal components are selected covering 80% of variance B. Sub-Group identification via clustering Once we extract the relevant features, we need to discover sub-groups within the data. Model based clustering [9] utilizing Bayesian Information Criterion (BIC) was employed for this purpose. Our premise is the data set consists of outliers as well as points corresponding to adverse weather conditions (for example icing) in addition to normal points. Generally, the number of clusters identified depends on the application of interest. We show the results of clustering in Figure 4 where we identify three clusters.
3 3 C. Machine Model By Decision Tree Method After discovering the sub-groups, we are naturally interested in understanding them. Supervised machine learning algorithms can be employed for such purpose. Decision Trees [10] are one such technique that can be employed. They are robust, can handle high-dimensionality, provide importance scores for the features, and produce rules which are interpretable. The cluster memberships learned from previous step are assigned as class labels to the data in original feature space. The tree model learned from the data is visualized in Figure 5. The first row in each node of the tree shows the dominant class in that node (1 represents class for outlier data; 2 represents class for adverse weather data; and 3 represents class for normal data); the second row shows the estimated probability for each class (in the order of classes 1,2,3 respectively); and the final row represents the percentage of total data that fall in the node as a result of application of rule denoted below each node. If a data point satisfies the rule, it goes to the left child node. Else it goes to the right child node. for such cases. The rules, however, were based on human judgment with no empirical evidence. Furthermore, the rules were derived looking at one feature at a time. Thus, we can miss potential interaction between features which can be useful descriptors. In our method, we provide a systematic approach to discover such rules. Decision trees can handle interaction between features. Moreover, with availability of additional ground truth data as well as feedback from operator, we can improve the accuracy of our method. Table 1 depicts the comparison of classification results from our approach and previous study [11], which base on engineering judgment and rough estimate to weather affects. It can be seen that substantially more data points are retained in our approach. Figure 5 Classification tree based on rules learned from data Figure 3 Clustered data in dominant PCA space: normal data are shown in green, data from adverse weather conditions are shown in red and outliers in the data are shown in blue. Clustering in PCA space tends to separate out abnormal points. Figure 6 Boxplot showing distribution of key features for each class (color coded previously) identified through decision tree. Figure 4 Highly coupled clusters and their covariance structure. Color labels for each class are same as in Figure 3. In addition to estimating the class probabilities, we also look at the important features identified by the tree model, and show the box plots on those features for the three classes in Figure 6. It is interesting to note that the features and rules identified above show physical meaning in discriminating between the classes. Previous study [11] had shown some rules Table 1 Comparison of data clustering with physical rule based method Bad data (outlier+icing) Good data Rule-based method Our method Difference 55% -38% D. Data Classification The fine-tuned machine model is then applied to a new data file for prediction. The test file has 720 data rows where each row has the same data format as the given training data set. Each row is tested under the developed decision tree and labeled with one class. In this case, normal data points (530 rows) dominate the data file where outliers and icing points are
4 4 combined to have 190 rows. Class 3 will be retained for further analysis. The classification of data can then be deployed on equipment health analysis, which can be intuitively determined if substantial outliers or icing points been detected. IV. DEMONSTRATION ON HIGH RESOLUTION TIME SERIES DATA The proposed framework is further demonstrated on time series data that are raw field measurements, more dynamic and contaminated with noise. The data from phasor measurement units are adopted to benchmark the efficiency of proposed approaches on outlier identification and nuisance data detection. The adopted PMU data is available through the Texas Synchrophasor Network. The measurements on 22pm of January 12, 2012 are randomly selected for demonstration. The original data file contains one hour of voltage, phasor and frequency measurements for six PMUs scattered around Texas. The experiments were run on the first five minutes for the purpose of demonstrating the efficacy of the methodology in detecting data points of interest. The goal is to test the proposed data quality control framework on selected dataset without inputs from domain experts. Both phasors and frequency signals are selected as ground truth for comparison, which may contain noise and outliers by nature. In the experiments, however, white noise of different levels and distributions are also applied to the raw data for performance evaluation. The proposed approaches are expected to differentiate the added noises and outliers only relying on correlations between angles and frequencies from different PMUs. Figure 7 Frequency and angle measurements for 6 PMUs A. Contaminated data Since there is no information about level of noise and quality for the raw data, artificial white noise with different magnitudes are added to test performance of the proposed methods. There is no missing data for the selected data file. The following noise scenarios are considered. Simulated white noise has a Gaussian distribution with noise level of 1-3% of measurement signals. The occurrence of the noise is also randomly distributed where total occurrence is set to be 20-60% of overall duration. As the magnitude and occurrence of the contaminated measurements are totally random, it can measure the flexibility of proposed methodology and its ability to address unpredictable issues. Raw data 1% Mag, 20% ToD 3% Mag, 60% ToD 3% Mag, 40% ToD 1% Mag, 40% ToD Figure 8 Phase angle and frequency with artificial white noises - 5 mins overview (Upper left) Noises at various levels(upper right) Comparison of magnitude (Lower left) Similar noised frequency measurements (Lower right) Figure 8 shows the angle and frequency measurements when various levels of noise have been added. The two figures on the top show absolute voltage angle with 1-3% of white noise, where the upper left one is a closer look at the noised signals. The figure on lower right compared each noise level, which in general are more obvious as the measured value increases with time. The figure on the lower right depicts the noised frequency signals. Since 1% change corresponds to roughly 0.6 Hz, the noisy measurement points can be easily differentiated from the clean data points. B. Outlier Detection As in Figure 8, the contaminated frequency data points, i.e. outliers are relatively easy to identify. However, the noise applied to angle measurements are closely coupled with clean data points and follow roughly same trends as time progresses. It is challenging to use rule based approach, i.e. angle thresholds, to separate contaminated angle measurements. The proposed principle component analysis method, however, successfully captures the correlations between angle and frequency and applies that for outlier identification. Figure 9 depicts the outliers in red and clean data in blue. It is confirmed that all contaminated data points are detected. Figure 9 Angle and frequency for 1 PMU with 1% noise level and 40% total duration Figure 10 shows that tri-modal feature of frequency is captured and adopted as clustering criteria. The root of the tree divides all datasets into two groups, where one group with frequencies equal to or above 61 Hz is seen as one cluster in orange color.
5 5 Apparently, this group has all the outliers, which are higher than the normal values. The rest of data is further clustered into two groups, where the group with frequencies lower than 59 Hz is seen to have outliers lower than the normal values. The group in the Green Box is the collection of all normal measurements. Figure 12 Classification tree used to differentiate contaminated data points Figure 10 Classification tree used to differentiate contaminated data points (with 3% noise level and 40% total duration C. Nuisance Data Detection Sometimes sensors can malfunction or are not calibrated properly, which induces nuisance data. This type of data quality problem is difficult to detect unless the measurements are way off the chart. In this paper, we propose to deploy the correlation among measurements to identify problematic measurements due to device malfunction. It is assumed that the measurements from different locations are bounded by physical rules behind the scene and the correlation of data shall be relatively consistent. For example, the differentiation of angle shall yield frequency. Thus the malfunction of one device will cause the change in data correlation and such new pattern cannot reverse automatically unless the data channel is removed. To test this approach, a vertical shift of 3% is applied to one PMU angle measurement after 85 second (out of 1 hour). The assumption is that this PMU was calibrated incorrectly and henceforth, yields wrong angle measurements. Figure 11 shows that such change was captured immediately as shown in the red. However, both frequencies and angles are marked as problematic by PCA method since it only detects change of correlation among data channels and cannot advise root cause. Decision tree method [10] is then applied to explore the cause of change, which discloses angle measured by PMU 1 to be the differentiator (Figure 12). Figure 11 Angle and frequency when 1 PMU angle measurement has vertical shift V. CONCLUSION AND FUTURE WORK A knowledge framework for sensor data quality control is proposed to identify abnormal data points through feature extraction and clustering techniques. Supervised learner models are then used to describe the data signature for normal operating conditions which can be incorporated into future decision making. The prototype of the framework has been described in details on the paired wind cup anemometer data and proved its feasibility. It is then further demonstrated on dynamic time series data, i.e. PMU measurements. The whole process is automatic and requires least domain expert inputs, which leverage on the data signature and correlation of data measurements. The results of experiment show that the proposed framework can function as expected on both types of datasets. Future research will take into account the location and topology (spatial) information as well as the associated temporal information for refining the predictions. VI. REFERENCES [1] M. Gol and A. Abur, A Fast Decoupled State Estimator for Systems Measured by PMUs, IEEE Transactions on Power Systems, vol. PP, no, 99, pp. 1-6, 2014 [2] M. Ariff, B. Pal and A. Singh, Estimating Dynamic Model Parameters for Adaptive Protection and Control in Power System, IEEE Transactions on Power Systems, vol. PP, no. 99, pp. 1-11, 2014 [3] V. Vijay, Application of phasor measurements for dynamic security assessment using decision trees, IEEE PES general meeting, San Diego, July 2012 [4] N. Zhou, J. Dagle, Initial Results in Using a Self-Coherence Method for Detecting Sustained Oscillations, IEEE Transactions on Power Systems, vol. 30, no. 1, pp , 2014 [5] M. Wang, J. H. Chow and et al, A Low-Rank Matrix Approach for the Analysis of Large Amounts of Power System Synchrophasor 48th Hawaii International Conference on System Sciences, 2015 [6] O. Al-Khatib, W. Hardjawana, B. Vucetic, Traffic Modeling and Optimization in Public and Private Wireless Access Networks for Smart Grids, IEEE Transactions on Smart Grid, vol 5, issue 4, pp [7] J.S. Erkelens, R. Heusdens, Tracking of Nonstationary Noise Based on Data-Driven Recursive Noise Power Estimation, IEEE Transactions on Audio, Speech, and Language Processing, vol 16, issue 6, pp [8] S. Ding, H. Zhu, W. Jia, and C. Su, "A survey on feature extraction for pattern recognition." Artificial Intelligence Review vol. 37, no. 3, pp , [9] C. Fraley, and A. E. Raftery, "Model-based clustering, discriminant analysis, and density estimation." Journal of the American statistical Association, vol. 97, no. 458, pp , [10] L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen, Classification and regression trees. CRC press, [11] D. Siegel, and J. Lee, "An auto-associative residual processing and K- means clustering approach for anemometer health assessment." International Journal of Prognostics and Health Management, vol 2, pp , 2011.
Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University
Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Exploratory data analysis tasks Examine the data, in search of structures
More informationReal-Time Model-Free Detection of Low-Quality Synchrophasor Data
Real-Time Model-Free Detection of Low-Quality Synchrophasor Data Meng Wu and Le Xie Department of Electrical and Computer Engineering Texas A&M University College Station, TX NASPI Work Group meeting March
More informationApplication of Clustering Techniques to Energy Data to Enhance Analysts Productivity
Application of Clustering Techniques to Energy Data to Enhance Analysts Productivity Wendy Foslien, Honeywell Labs Valerie Guralnik, Honeywell Labs Steve Harp, Honeywell Labs William Koran, Honeywell Atrium
More informationApplying Supervised Learning
Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains
More information2. Data Preprocessing
2. Data Preprocessing Contents of this Chapter 2.1 Introduction 2.2 Data cleaning 2.3 Data integration 2.4 Data transformation 2.5 Data reduction Reference: [Han and Kamber 2006, Chapter 2] SFU, CMPT 459
More informationClustering and Visualisation of Data
Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some
More informationRandom projection for non-gaussian mixture models
Random projection for non-gaussian mixture models Győző Gidófalvi Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 92037 gyozo@cs.ucsd.edu Abstract Recently,
More informationQuality Assessment of Power Dispatching Data Based on Improved Cloud Model
Quality Assessment of Power Dispatching Based on Improved Cloud Model Zhaoyang Qu, Shaohua Zhou *. School of Information Engineering, Northeast Electric Power University, Jilin, China Abstract. This paper
More informationDetection of Anomalies using Online Oversampling PCA
Detection of Anomalies using Online Oversampling PCA Miss Supriya A. Bagane, Prof. Sonali Patil Abstract Anomaly detection is the process of identifying unexpected behavior and it is an important research
More information3. Data Preprocessing. 3.1 Introduction
3. Data Preprocessing Contents of this Chapter 3.1 Introduction 3.2 Data cleaning 3.3 Data integration 3.4 Data transformation 3.5 Data reduction SFU, CMPT 740, 03-3, Martin Ester 84 3.1 Introduction Motivation
More informationOnline Bad Data Detection for Synchrophasor Systems via Spatio-temporal Correlations
LOGO Online Bad Data Detection for Synchrophasor Systems via Spatio-temporal s Le Xie Texas A&M University NASPI International Synchrophasor Symposium March 24, 2016 Content 1 Introduction 2 Technical
More informationMachine Learning for Pre-emptive Identification of Performance Problems in UNIX Servers Helen Cunningham
Final Report for cs229: Machine Learning for Pre-emptive Identification of Performance Problems in UNIX Servers Helen Cunningham Abstract. The goal of this work is to use machine learning to understand
More informationUsing Statistical Techniques to Improve the QC Process of Swell Noise Filtering
Using Statistical Techniques to Improve the QC Process of Swell Noise Filtering A. Spanos* (Petroleum Geo-Services) & M. Bekara (PGS - Petroleum Geo- Services) SUMMARY The current approach for the quality
More informationDiscovery of the Source of Contaminant Release
Discovery of the Source of Contaminant Release Devina Sanjaya 1 Henry Qin Introduction Computer ability to model contaminant release events and predict the source of release in real time is crucial in
More informationFall 2017 ECEN Special Topics in Data Mining and Analysis
Fall 2017 ECEN 689-600 Special Topics in Data Mining and Analysis Nick Duffield Department of Electrical & Computer Engineering Teas A&M University Organization Organization Instructor: Nick Duffield,
More informationEfficient PMU Data Analysis through High Performance Data Management Platform
NASPI WG meeting Data & Network Management Task Team Efficient PMU Data Analysis through High Performance Data Management Platform 10/14/2015 Bo Lucy Yang, Jun Yamazaki, Norifumi Nishikawa, Hsiu-Khuern
More informationClassification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University
Classification Vladimir Curic Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Outline An overview on classification Basics of classification How to choose appropriate
More informationThe Comparative Study of Machine Learning Algorithms in Text Data Classification*
The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification
More informationInternational Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani
LINK MINING PROCESS Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani Higher Colleges of Technology, United Arab Emirates ABSTRACT Many data mining and knowledge discovery methodologies and process models
More informationChapter 5: Outlier Detection
Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases SS 2016 Chapter 5: Outlier Detection Lecture: Prof. Dr.
More informationSemi-Supervised Clustering with Partial Background Information
Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject
More informationResponse to API 1163 and Its Impact on Pipeline Integrity Management
ECNDT 2 - Tu.2.7.1 Response to API 3 and Its Impact on Pipeline Integrity Management Munendra S TOMAR, Martin FINGERHUT; RTD Quality Services, USA Abstract. Knowing the accuracy and reliability of ILI
More informationFeature Selection Technique to Improve Performance Prediction in a Wafer Fabrication Process
Feature Selection Technique to Improve Performance Prediction in a Wafer Fabrication Process KITTISAK KERDPRASOP and NITTAYA KERDPRASOP Data Engineering Research Unit, School of Computer Engineering, Suranaree
More informationCyber attack detection using decision tree approach
Cyber attack detection using decision tree approach Amit Shinde Department of Industrial Engineering, Arizona State University,Tempe, AZ, USA {amit.shinde@asu.edu} In this information age, information
More informationCOSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor
COSC160: Detection and Classification Jeremy Bolton, PhD Assistant Teaching Professor Outline I. Problem I. Strategies II. Features for training III. Using spatial information? IV. Reducing dimensionality
More informationStatistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte
Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Outline Introduction Data pre-treatment 1. Normalization 2. Centering,
More informationDESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES
EXPERIMENTAL WORK PART I CHAPTER 6 DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES The evaluation of models built using statistical in conjunction with various feature subset
More informationCOLOR FIDELITY OF CHROMATIC DISTRIBUTIONS BY TRIAD ILLUMINANT COMPARISON. Marcel P. Lucassen, Theo Gevers, Arjan Gijsenij
COLOR FIDELITY OF CHROMATIC DISTRIBUTIONS BY TRIAD ILLUMINANT COMPARISON Marcel P. Lucassen, Theo Gevers, Arjan Gijsenij Intelligent Systems Lab Amsterdam, University of Amsterdam ABSTRACT Performance
More informationIntroduction to Data Mining
Introduction to JULY 2011 Afsaneh Yazdani What motivated? Wide availability of huge amounts of data and the imminent need for turning such data into useful information and knowledge What motivated? Data
More informationResearch on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a
International Conference on Education Technology, Management and Humanities Science (ETMHS 2015) Research on Applications of Data Mining in Electronic Commerce Xiuping YANG 1, a 1 Computer Science Department,
More informationAnomaly Detection on Data Streams with High Dimensional Data Environment
Anomaly Detection on Data Streams with High Dimensional Data Environment Mr. D. Gokul Prasath 1, Dr. R. Sivaraj, M.E, Ph.D., 2 Department of CSE, Velalar College of Engineering & Technology, Erode 1 Assistant
More informationCse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University
Cse634 DATA MINING TEST REVIEW Professor Anita Wasilewska Computer Science Department Stony Brook University Preprocessing stage Preprocessing: includes all the operations that have to be performed before
More informationA Data Classification Algorithm of Internet of Things Based on Neural Network
A Data Classification Algorithm of Internet of Things Based on Neural Network https://doi.org/10.3991/ijoe.v13i09.7587 Zhenjun Li Hunan Radio and TV University, Hunan, China 278060389@qq.com Abstract To
More informationUnderstanding Clustering Supervising the unsupervised
Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data
More informationCS 229 Final Project Report Learning to Decode Cognitive States of Rat using Functional Magnetic Resonance Imaging Time Series
CS 229 Final Project Report Learning to Decode Cognitive States of Rat using Functional Magnetic Resonance Imaging Time Series Jingyuan Chen //Department of Electrical Engineering, cjy2010@stanford.edu//
More informationSensor Based Time Series Classification of Body Movement
Sensor Based Time Series Classification of Body Movement Swapna Philip, Yu Cao*, and Ming Li Department of Computer Science California State University, Fresno Fresno, CA, U.S.A swapna.philip@gmail.com,
More informationDecision Trees Dr. G. Bharadwaja Kumar VIT Chennai
Decision Trees Decision Tree Decision Trees (DTs) are a nonparametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target
More informationAccelerometer Gesture Recognition
Accelerometer Gesture Recognition Michael Xie xie@cs.stanford.edu David Pan napdivad@stanford.edu December 12, 2014 Abstract Our goal is to make gesture-based input for smartphones and smartwatches accurate
More informationMachine Learning in the Wild. Dealing with Messy Data. Rajmonda S. Caceres. SDS 293 Smith College October 30, 2017
Machine Learning in the Wild Dealing with Messy Data Rajmonda S. Caceres SDS 293 Smith College October 30, 2017 Analytical Chain: From Data to Actions Data Collection Data Cleaning/ Preparation Analysis
More informationCalibrating HART Transmitters. HCF_LIT-054, Revision 1.1
Calibrating HART Transmitters HCF_LIT-054, Revision 1.1 Release Date: November 19, 2008 Date of Publication: November 19, 2008 Document Distribution / Maintenance Control / Document Approval To obtain
More informationOptimal Clustering and Statistical Identification of Defective ICs using I DDQ Testing
Optimal Clustering and Statistical Identification of Defective ICs using I DDQ Testing A. Rao +, A.P. Jayasumana * and Y.K. Malaiya* *Colorado State University, Fort Collins, CO 8523 + PalmChip Corporation,
More informationNetwork Traffic Measurements and Analysis
DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,
More informationChapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction
CHAPTER 5 SUMMARY AND CONCLUSION Chapter 1: Introduction Data mining is used to extract the hidden, potential, useful and valuable information from very large amount of data. Data mining tools can handle
More informationDatabase and Knowledge-Base Systems: Data Mining. Martin Ester
Database and Knowledge-Base Systems: Data Mining Martin Ester Simon Fraser University School of Computing Science Graduate Course Spring 2006 CMPT 843, SFU, Martin Ester, 1-06 1 Introduction [Fayyad, Piatetsky-Shapiro
More informationThis tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.
About the Tutorial Data Mining is defined as the procedure of extracting information from huge sets of data. In other words, we can say that data mining is mining knowledge from data. The tutorial starts
More informationSlides for Data Mining by I. H. Witten and E. Frank
Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-
More informationRegression on SAT Scores of 374 High Schools and K-means on Clustering Schools
Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools Abstract In this project, we study 374 public high schools in New York City. The project seeks to use regression techniques
More informationData Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha
Data Preprocessing S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha 1 Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking
More informationData Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality
Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation = noisy: containing
More informationPreprocessing of Stream Data using Attribute Selection based on Survival of the Fittest
Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest Bhakti V. Gavali 1, Prof. Vivekanand Reddy 2 1 Department of Computer Science and Engineering, Visvesvaraya Technological
More informationAli Abur Northeastern University Department of Electrical and Computer Engineering Boston, MA 02115
Enhanced State t Estimation Ali Abur Northeastern University Department of Electrical and Computer Engineering Boston, MA 02115 GCEP Workshop: Advanced Electricity Infrastructure Frances Arriallaga Alumni
More informationA Survey Of Issues And Challenges Associated With Clustering Algorithms
International Journal for Science and Emerging ISSN No. (Online):2250-3641 Technologies with Latest Trends 10(1): 7-11 (2013) ISSN No. (Print): 2277-8136 A Survey Of Issues And Challenges Associated With
More informationFigure 1: Workflow of object-based classification
Technical Specifications Object Analyst Object Analyst is an add-on package for Geomatica that provides tools for segmentation, classification, and feature extraction. Object Analyst includes an all-in-one
More informationLouis Fourrier Fabien Gaie Thomas Rolf
CS 229 Stay Alert! The Ford Challenge Louis Fourrier Fabien Gaie Thomas Rolf Louis Fourrier Fabien Gaie Thomas Rolf 1. Problem description a. Goal Our final project is a recent Kaggle competition submitted
More informationEstimating Noise and Dimensionality in BCI Data Sets: Towards Illiteracy Comprehension
Estimating Noise and Dimensionality in BCI Data Sets: Towards Illiteracy Comprehension Claudia Sannelli, Mikio Braun, Michael Tangermann, Klaus-Robert Müller, Machine Learning Laboratory, Dept. Computer
More informationDefining a Better Vehicle Trajectory With GMM
Santa Clara University Department of Computer Engineering COEN 281 Data Mining Professor Ming- Hwa Wang, Ph.D Winter 2016 Defining a Better Vehicle Trajectory With GMM Christiane Gregory Abe Millan Contents
More informationECE 285 Class Project Report
ECE 285 Class Project Report Based on Source localization in an ocean waveguide using supervised machine learning Yiwen Gong ( yig122@eng.ucsd.edu), Yu Chai( yuc385@eng.ucsd.edu ), Yifeng Bu( ybu@eng.ucsd.edu
More informationAN ENHANCED ATTRIBUTE RERANKING DESIGN FOR WEB IMAGE SEARCH
AN ENHANCED ATTRIBUTE RERANKING DESIGN FOR WEB IMAGE SEARCH Sai Tejaswi Dasari #1 and G K Kishore Babu *2 # Student,Cse, CIET, Lam,Guntur, India * Assistant Professort,Cse, CIET, Lam,Guntur, India Abstract-
More informationAPPLICATION NOTE. XCellAir s Wi-Fi Radio Resource Optimization Solution. Features, Test Results & Methodology
APPLICATION NOTE XCellAir s Wi-Fi Radio Resource Optimization Solution Features, Test Results & Methodology Introduction Multi Service Operators (MSOs) and Internet service providers have been aggressively
More informationGrid Operations - Program 39
Grid Operations - Program 39 Program Description Program Overview In many ways, today's power system must be operated to meet objectives for which it was not explicitly designed. Today's transmission system
More informationOnline Pattern Recognition in Multivariate Data Streams using Unsupervised Learning
Online Pattern Recognition in Multivariate Data Streams using Unsupervised Learning Devina Desai ddevina1@csee.umbc.edu Tim Oates oates@csee.umbc.edu Vishal Shanbhag vshan1@csee.umbc.edu Machine Learning
More informationData Preprocessing. Slides by: Shree Jaswal
Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data
More informationOutlier Ensembles. Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY Keynote, Outlier Detection and Description Workshop, 2013
Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY 10598 Outlier Ensembles Keynote, Outlier Detection and Description Workshop, 2013 Based on the ACM SIGKDD Explorations Position Paper: Outlier
More informationMSA220 - Statistical Learning for Big Data
MSA220 - Statistical Learning for Big Data Lecture 13 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Clustering Explorative analysis - finding groups
More informationRemote Sensing & Photogrammetry W4. Beata Hejmanowska Building C4, room 212, phone:
Remote Sensing & Photogrammetry W4 Beata Hejmanowska Building C4, room 212, phone: +4812 617 22 72 605 061 510 galia@agh.edu.pl 1 General procedures in image classification Conventional multispectral classification
More informationFace Recognition using Eigenfaces SMAI Course Project
Face Recognition using Eigenfaces SMAI Course Project Satarupa Guha IIIT Hyderabad 201307566 satarupa.guha@research.iiit.ac.in Ayushi Dalmia IIIT Hyderabad 201307565 ayushi.dalmia@research.iiit.ac.in Abstract
More informationA Data-Mining Approach for Wind Turbine Power Generation Performance Monitoring Based on Power Curve
, pp.456-46 http://dx.doi.org/1.1457/astl.16. A Data-Mining Approach for Wind Turbine Power Generation Performance Monitoring Based on Power Curve Jianlou Lou 1,1, Heng Lu 1, Jia Xu and Zhaoyang Qu 1,
More informationModulation-Aware Energy Balancing in Hierarchical Wireless Sensor Networks 1
Modulation-Aware Energy Balancing in Hierarchical Wireless Sensor Networks 1 Maryam Soltan, Inkwon Hwang, Massoud Pedram Dept. of Electrical Engineering University of Southern California Los Angeles, CA
More informationLab 9. Julia Janicki. Introduction
Lab 9 Julia Janicki Introduction My goal for this project is to map a general land cover in the area of Alexandria in Egypt using supervised classification, specifically the Maximum Likelihood and Support
More informationTexture Image Segmentation using FCM
Proceedings of 2012 4th International Conference on Machine Learning and Computing IPCSIT vol. 25 (2012) (2012) IACSIT Press, Singapore Texture Image Segmentation using FCM Kanchan S. Deshmukh + M.G.M
More informationA Neural Network for Real-Time Signal Processing
248 MalkofT A Neural Network for Real-Time Signal Processing Donald B. Malkoff General Electric / Advanced Technology Laboratories Moorestown Corporate Center Building 145-2, Route 38 Moorestown, NJ 08057
More informationCS 521 Data Mining Techniques Instructor: Abdullah Mueen
CS 521 Data Mining Techniques Instructor: Abdullah Mueen LECTURE 2: DATA TRANSFORMATION AND DIMENSIONALITY REDUCTION Chapter 3: Data Preprocessing Data Preprocessing: An Overview Data Quality Major Tasks
More informationCPSC 340: Machine Learning and Data Mining. Outlier Detection Fall 2018
CPSC 340: Machine Learning and Data Mining Outlier Detection Fall 2018 Admin Assignment 2 is due Friday. Assignment 1 grades available? Midterm rooms are now booked. October 18 th at 6:30pm (BUCH A102
More informationPerformance Degradation Assessment and Fault Diagnosis of Bearing Based on EMD and PCA-SOM
Performance Degradation Assessment and Fault Diagnosis of Bearing Based on EMD and PCA-SOM Lu Chen and Yuan Hang PERFORMANCE DEGRADATION ASSESSMENT AND FAULT DIAGNOSIS OF BEARING BASED ON EMD AND PCA-SOM.
More informationBasic Data Mining Technique
Basic Data Mining Technique What is classification? What is prediction? Supervised and Unsupervised Learning Decision trees Association rule K-nearest neighbor classifier Case-based reasoning Genetic algorithm
More informationAn Automated System for Data Attribute Anomaly Detection
Proceedings of Machine Learning Research 77:95 101, 2017 KDD 2017: Workshop on Anomaly Detection in Finance An Automated System for Data Attribute Anomaly Detection David Love Nalin Aggarwal Alexander
More informationLearning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009
Learning and Inferring Depth from Monocular Images Jiyan Pan April 1, 2009 Traditional ways of inferring depth Binocular disparity Structure from motion Defocus Given a single monocular image, how to infer
More informationResearch on outlier intrusion detection technologybased on data mining
Acta Technica 62 (2017), No. 4A, 635640 c 2017 Institute of Thermomechanics CAS, v.v.i. Research on outlier intrusion detection technologybased on data mining Liang zhu 1, 2 Abstract. With the rapid development
More informationCOMPARISON OF DENSITY-BASED CLUSTERING ALGORITHMS
COMPARISON OF DENSITY-BASED CLUSTERING ALGORITHMS Mariam Rehman Lahore College for Women University Lahore, Pakistan mariam.rehman321@gmail.com Syed Atif Mehdi University of Management and Technology Lahore,
More informationUniversity of Florida CISE department Gator Engineering. Clustering Part 5
Clustering Part 5 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville SNN Approach to Clustering Ordinary distance measures have problems Euclidean
More informationData: a collection of numbers or facts that require further processing before they are meaningful
Digital Image Classification Data vs. Information Data: a collection of numbers or facts that require further processing before they are meaningful Information: Derived knowledge from raw data. Something
More informationModel-based segmentation and recognition from range data
Model-based segmentation and recognition from range data Jan Boehm Institute for Photogrammetry Universität Stuttgart Germany Keywords: range image, segmentation, object recognition, CAD ABSTRACT This
More informationImplementing Operational Analytics Using Big Data Technologies to Detect and Predict Sensor Anomalies
Implementing Operational Analytics Using Big Data Technologies to Detect and Predict Sensor Anomalies Joseph Coughlin, Rohit Mital, Shashi Nittur, Benjamin SanNicolas, Christian Wolf, Rinor Jusufi Stinger
More informationUsing Machine Learning to Optimize Storage Systems
Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation
More informationAPPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE
APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE Sundari NallamReddy, Samarandra Behera, Sanjeev Karadagi, Dr. Anantha Desik ABSTRACT: Tata
More informationChapter 10. Conclusion Discussion
Chapter 10 Conclusion 10.1 Discussion Question 1: Usually a dynamic system has delays and feedback. Can OMEGA handle systems with infinite delays, and with elastic delays? OMEGA handles those systems with
More informationImproving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique
Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique Anotai Siltepavet 1, Sukree Sinthupinyo 2 and Prabhas Chongstitvatana 3 1 Computer Engineering, Chulalongkorn University,
More informationData Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing
Data Mining: Concepts and Techniques (3 rd ed.) Chapter 3 1 Chapter 3: Data Preprocessing Data Preprocessing: An Overview Data Quality Major Tasks in Data Preprocessing Data Cleaning Data Integration Data
More informationDUPLICATE DETECTION AND AUDIO THUMBNAILS WITH AUDIO FINGERPRINTING
DUPLICATE DETECTION AND AUDIO THUMBNAILS WITH AUDIO FINGERPRINTING Christopher Burges, Daniel Plastina, John Platt, Erin Renshaw, and Henrique Malvar March 24 Technical Report MSR-TR-24-19 Audio fingerprinting
More informationPerry. Lakeshore. Avon. Eastlake
Perry Lorain Avon Lakeshore Eastlake Ashtabula Mansfield Sammis Beaver Valley Conesville Tidd Burger & Kammer Muskingum Perry Lorain Avon Lakeshore Eastlake Ashtabula Mansfield Sammis Beaver Valley Conesville
More informationCustomer Clustering using RFM analysis
Customer Clustering using RFM analysis VASILIS AGGELIS WINBANK PIRAEUS BANK Athens GREECE AggelisV@winbank.gr DIMITRIS CHRISTODOULAKIS Computer Engineering and Informatics Department University of Patras
More informationPreprocessing Short Lecture Notes cse352. Professor Anita Wasilewska
Preprocessing Short Lecture Notes cse352 Professor Anita Wasilewska Data Preprocessing Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization and concept
More informationMachine Learning (CSMML16) (Autumn term, ) Xia Hong
Machine Learning (CSMML16) (Autumn term, 28-29) Xia Hong 1 Useful books: 1. C. M. Bishop: Pattern Recognition and Machine Learning (2007) Springer. 2. S. Haykin: Neural Networks (1999) Prentice Hall. 3.
More informationAutomatic Shadow Removal by Illuminance in HSV Color Space
Computer Science and Information Technology 3(3): 70-75, 2015 DOI: 10.13189/csit.2015.030303 http://www.hrpub.org Automatic Shadow Removal by Illuminance in HSV Color Space Wenbo Huang 1, KyoungYeon Kim
More informationReview of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga.
Americo Pereira, Jan Otto Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga. ABSTRACT In this paper we want to explain what feature selection is and
More informationMIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA
Exploratory Machine Learning studies for disruption prediction on DIII-D by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA Presented at the 2 nd IAEA Technical Meeting on
More informationCS 231A Computer Vision (Fall 2012) Problem Set 3
CS 231A Computer Vision (Fall 2012) Problem Set 3 Due: Nov. 13 th, 2012 (2:15pm) 1 Probabilistic Recursion for Tracking (20 points) In this problem you will derive a method for tracking a point of interest
More informationClassifying Images with Visual/Textual Cues. By Steven Kappes and Yan Cao
Classifying Images with Visual/Textual Cues By Steven Kappes and Yan Cao Motivation Image search Building large sets of classified images Robotics Background Object recognition is unsolved Deformable shaped
More informationUnified PMU Placement Algorithm for Power Systems
Unified PMU Placement Algorithm for Power Systems Kunal Amare, and Virgilio A. Centeno Bradley Department of Electrical and Computer Engineering, Virginia Tech Blacksburg, VA-24061, USA. Anamitra Pal Network
More informationFuzzy Partitioning with FID3.1
Fuzzy Partitioning with FID3.1 Cezary Z. Janikow Dept. of Mathematics and Computer Science University of Missouri St. Louis St. Louis, Missouri 63121 janikow@umsl.edu Maciej Fajfer Institute of Computing
More information