CHAPTER 3. Preprocessing and Feature Extraction. Techniques

CHAPTER 3 Preprocessing and Feature Extraction Techniques

CHAPTER 3 Preprocessing and Feature Extraction Techniques 3.1 Need for Preprocessing and Feature Extraction schemes for Pattern Recognition and Discrimination With the advent of fast computers equipped with large storage and data acquisition systems, researchers involved in areas of diverse applications such as engineering, medicine, geophysics, astronomy etc have to face the problem of processing and analyzing huge volumes of observations for decision-making. Such datasets, in contrast with smaller and more traditional datasets that have been studied extensively in the past, present new challenges in data analysis. Though traditional statistical methods provide reasonably good solutions in general, the increase in the number of samples in real-time studies usually are associated with the number of variables during each observation. The dimension of the data hence plays a vital role during processing datasets since these are related to the number of variables that are measured on each observation. In general, dimensionality reduction may be accomplished by eliminating data closely related with the rest of the data in the set or by merging data which is representative of a smaller set of features. This trade-off between accuracy as represented by the complete data set and the computational cost of retaining the parameters without implementing feature extraction/selection techniques plays a vital role in the performance during pattern recognition. This trade-off leads to the vital aspect in pattern recognition called the curse of dimensionality. 57

While preprocessing refers to feature (attribute) construction from a set of raw data based on techniques such as standardization (scaling), normalization (centering), extraction of local attributes (kernel or syntactic methods), attribute discretization (discrete and finite sets) etc., the role of the feature extraction methods is to acquire the most appropriate set of information from the original data to enable representation of information in a lower dimensionality space based on feature selection, assessment of evaluation criterion etc. Since, both these components play a vital role in pattern recognition it is essential that appropriate choice of the preprocessing and feature extraction is made to obtain optimal results during the classification task. 3.2 Methods of Feature Extraction and Its Relevance to PD pattern recognition- An Overview Data obtained from the PD measurement and acquisition system which best describes the dynamics of discharge patterns are obtained as φ-q-n characteristics. Though several researchers have utilized both the phase and the time resolved approaches for PD pattern recognition of insulation, this study resorts to the former since it has been observed that each discharge pulse reflects the physical process at the discharge site and a strong relationship has been established between the nature of patterns and the type of defect. In this research the data describing the source of PD is characterized into six categories based on phase window technique: 1. Measures based on Maximum values of q (10 and 30 ); 2. Measures based on Minimum values of q (10 and 30 ); and 3. Measures based on Central Tendency (10 and 30 ); 4. Measures based on types of Mean values (10 and 58

30 ); 5. Measures based on mean-slope-angle (10 and 30 ); 5. Measures based on statistical moments; and 6. Measures based on Two Pass Split Window (TPSW) Scheme (10 and 30 ). 3.3 Types of Feature Extraction Techniques for PD Pattern Recognition 3.3.1 Measures based on Basic Statistical Operators and Inequality Relationship of Types of Mean Measures These measures are primarily used for classification of PD patterns utilizing the phase resolved PD (PRPD) approach. The basic uni-variate statistical analysis is taken up for obtaining the statistical operators representing the distribution between φ- q- n patterns. Initially, the phase window technique of representation of the distribution of pulse patterns that describe the basic statistical operators describing the nature of central tendency (mean, median and mode), dispersion (standard deviation, quartile deviation and range) and range of discharge values (maximum and minimum quantities) have been taken up to ascertain the role played by the distribution of pulses in the PD signature taken up for studies. During the entire course of the analysis a phase window of 30 and 10 has been taken up since it was found that a reasonably good index on the issues related to the curse of dimensionality is made evident during the training phase. The following measures are considered during the analysis: 1. Measures based on maximum and minimum values: (a) φ q max n (phase window of 10 ) 59

(b) φ q min n (phase window of 30 ) (c) φ q n max (phase window of 10 ) (d) φ q n min (phase window of 30 ) 2. Measures of Central Tendency (magnitude of q and number of occurrences n ): (a) Mean, Median and Mode (10 window) (b) Mean, Median and Mode (30 window) 3. Measures of Dispersion (magnitude of q and number of occurrences n ): (a) Range, Variance, Standard Deviation and Quartile deviation (10 window) (b) Range, Variance, Standard Deviation and Quartile Deviation (30 window) A new approach of utilizing statistical measures pertaining to types of mean and its inequality expression has also been exploited in this research. This aspect has been utilized recently by a few researchers in allied fields of engineering [75] who have reported on its significant success since it is observed to have served as an effective technique in reducing the dimensionality nevertheless conserving the relationship among the input feature vectors. Hence an attempted has also been made in this research to ascertain the effectiveness of the proposed inequality expression in providing a compact set of extracted features. The following phase window measures have been taken up for analysis: Harmonic Mean (HM) Geometric Mean (GM) Arithmetic Mean (AM) Root Mean Square (30 and 10 window) 60

3.2.2 Measures based on Higher Order Statistical Moments as Mathematical Descriptors Since several studies [11] by researchers have been carried out utilizing the traditional statistical operators as mathematical descriptors for PD pattern recognition studies with considerable level of success, this study also envisages an analysis on the role played by the mathematical descriptors namely mean, standard deviation, kurtosis, skewness, crosscorrelation and modified cross-correlation in obtaining the features that describe the pulse pattern signatures. These descriptors have been processed and acquired for 10 and 30 phase windows. 3.2.3 Measures based on Two Pass Split Window (TPSW) Filter Technique Recently, a few research studies have utilized effectively the TPSW technique as a preprocessing and feature extraction technique in divergent fields such as speech recognition [76], sonar signal processing [77], target recognition [78] etc. This scheme has earlier found wide application in audio signal processing wherein the primary focus is to reduce the influence of spurious noise artifacts that are invariably introduced during the recording or playback methods which attempt at preserving the original sound to a considerable extent. It has been reported recently that this scheme has been successfully utilized in classification of underwater sonar signals radiated by ships. This aspect has been clearly delineated in [39] wherein, in essence, the spectrum of such radiated sonar signals consist of two different spectral types namely: 1. Broad-band, which comprises a 61

continuous spectrum and 2. Narrow-band, which has a discontinuous spectrum containing components occurring at discrete frequencies wherein the long pulses are filtered from the tonal. Hence, effective extraction of tonal features from mixed spectra is the essence in applications involving classification of sonar signals. It is hence evident that this aspect is analogous to the task of extracting the discontinuous pulsating components pertaining to the PD patterns since it correlates and corresponds well in comparison with the tones of the radiated sonar spectrum signals. TPSW filtering scheme provides a mechanism for obtaining smooth local-mean estimates of the signal notwithstanding the presence of spurious noisy spikes in the analyzed signal. The basic concept involves carrying out a simple moving-average filtering over a segment of the signal comprising a long pulse. In this approach, the continuous spectrum is estimated first and consequently the tonal components are extracted. The algorithm for the TPSW scheme is indicated Figure 3.1. Step 1: For signal f(x), select a window centered on k bins: R k = { k M, k M + 1,..., k,... k + M 1, k + M } (Number of bins in the windows is 2M+1) Step 2: First Pass: Computing the local mean k = + M 1 f ( k) = f ( i) 2M + 1 i k M Step 3: Forming a clipped sequence (to avoid biasing estimate of the local mean due to a tonal): f ( k) : g( k) = f ( k) : f ( k) α f ( k) f ( k) > α f ( k) α is a constant usually set to 0.5 and serves as an a-priori estimate. Step 4: Second Pass: Obtain the continuous spectrum (broad band component) by calculating the local mean using g (k): k = + M 1 m ( k) = g( i) 2M + 1 i k M Step 5: Computing Narrowband component h (k): h( k) = f ( k) m( k) Figure 3.1: Algorithm for TPSW Feature Extraction Scheme 62

3.4 Summary The need for preprocessing and feature extraction during pattern recognition studies, relevance of data compaction and dimensionality reduction, Phase Resolved PD (PRPD) based phase window approach has been deliberated. An insight into the various preprocessing and feature extraction techniques based on simple statistical measures (central tendency, dispersion and range), stochastic measures based on higher order moments (skewness, kurtosis, correlation and cross-correlation), measures based on statistical inequality expression for types of mean (arithmetic mean, harmonic mean, geometric mean and root mean) and signal processing based measures (TPSW scheme) has been provided. The following major aspects are summarized: 1. The need for feature extraction technique that provides an approach which is progressively and sequentially more complex depending on the nature of the PD signature datasets is evident thus necessitating various methodologies ranging from simple statistical measures to signal processing based measures. 2. The relevance of utilizing the novel TPSW scheme approach for feature extraction in PD pattern recognition is expounded. Its ability in discriminating signals based on discrimination in terms of two frequency bands (during two different phases of the algorithm) makes the scheme relevant for extraction of appropriate features pertaining to PD pattern signatures. 63