Simultaneous Design of Feature Extractor and Pattern Classifer Using the Minimum Classification Error Training Algorithm

Similar documents
Robust speech recognition using features based on zero crossings with peak amplitudes

A Texture Feature Extraction Technique Using 2D-DFT and Hamming Distance

Pattern Recognition Using Discriminative Feature Extraction

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Fractional Discrimination for Texture Image Segmentation

An Autoassociator for Automatic Texture Feature Extraction

An Introduction to Pattern Recognition

WORD LEVEL DISCRIMINATIVE TRAINING FOR HANDWRITTEN WORD RECOGNITION Chen, W.; Gader, P.

THE most popular training method for hidden Markov

2-2-2, Hikaridai, Seika-cho, Soraku-gun, Kyoto , Japan 2 Graduate School of Information Science, Nara Institute of Science and Technology

Image Classification Using Wavelet Coefficients in Low-pass Bands

Face Recognition by Combining Kernel Associative Memory and Gabor Transforms

A Visualization Tool to Improve the Performance of a Classifier Based on Hidden Markov Models

Face Recognition Using Vector Quantization Histogram and Support Vector Machine Classifier Rong-sheng LI, Fei-fei LEE *, Yan YAN and Qiu CHEN

Texture Image Segmentation using Fractional Discrimination Functions

Neural Network Weight Selection Using Genetic Algorithms

Image retrieval based on bag of images

COPY RIGHT. To Secure Your Paper As Per UGC Guidelines We Are Providing A Electronic Bar Code

Locating 1-D Bar Codes in DCT-Domain

Segmentation of Neuronal-Cell Images from Stained Fields and Monomodal Histograms

Analysis of a Reduced-Communication Diffusion LMS Algorithm

Human Action Recognition Using Independent Component Analysis

A Distance-Based Classifier Using Dissimilarity Based on Class Conditional Probability and Within-Class Variation. Kwanyong Lee 1 and Hyeyoung Park 2

A System for Joining and Recognition of Broken Bangla Numerals for Indian Postal Automation

Production of Video Images by Computer Controlled Cameras and Its Application to TV Conference System

Sparse Coding Based Lip Texture Representation For Visual Speaker Identification

Discriminative Training of Decoding Graphs for Large Vocabulary Continuous Speech Recognition

Semi-blind Block Channel Estimation and Signal Detection Using Hidden Markov Models

The Serial Commutator FFT

Fully Automatic Methodology for Human Action Recognition Incorporating Dynamic Information

High Performance Chinese OCR Based on Gabor Features, Discriminative Feature Extraction and Model Training

ModelStructureSelection&TrainingAlgorithmsfor an HMMGesture Recognition System

NON-UNIFORM SPEAKER NORMALIZATION USING FREQUENCY-DEPENDENT SCALING FUNCTION

Gesture Recognition Using Image Comparison Methods

Chapter 3. Speech segmentation. 3.1 Preprocessing

CS6220: DATA MINING TECHNIQUES

A Maximal Figure-of-Merit (MFoM) Learning Approach to. Robust Classifier Design for Text Categorization

Robust Lip Contour Extraction using Separability of Multi-Dimensional Distributions

On Modeling Variations for Face Authentication

Proceedings of the International MultiConference of Engineers and Computer Scientists 2018 Vol I IMECS 2018, March 14-16, 2018, Hong Kong

Convex combination of adaptive filters for a variable tap-length LMS algorithm

10-701/15-781, Fall 2006, Final

PATTERN CLASSIFICATION AND SCENE ANALYSIS

Discriminative training and Feature combination

Energy-Based Learning

Nonlinear Image Interpolation using Manifold Learning

Linear Discriminant Analysis in Ottoman Alphabet Character Recognition

Why DNN Works for Speech and How to Make it More Efficient?

Using Genetic Algorithms to Improve Pattern Classification Performance

Center for Automation and Autonomous Complex Systems. Computer Science Department, Tulane University. New Orleans, LA June 5, 1991.

View Invariant Movement Recognition by using Adaptive Neural Fuzzy Inference System

Training Algorithms for Robust Face Recognition using a Template-matching Approach

Classifier C-Net. 2D Projected Images of 3D Objects. 2D Projected Images of 3D Objects. Model I. Model II

Ergodic Hidden Markov Models for Workload Characterization Problems

MURDOCH RESEARCH REPOSITORY

Speech Recognition Lecture 8: Acoustic Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

An Efficient Learning Scheme for Extreme Learning Machine and Its Application

Image Quality Assessment Techniques: An Overview

The theory and design of a class of perfect reconstruction modified DFT filter banks with IIR filters

Text-Independent Speaker Identification

Estimating the Bayes Risk from Sample Data

Expectation and Maximization Algorithm for Estimating Parameters of a Simple Partial Erasure Model

Vulnerability of Voice Verification System with STC anti-spoofing detector to different methods of spoofing attacks

Using Gradient Descent Optimization for Acoustics Training from Heterogeneous Data

A Comparative Study of Conventional and Neural Network Classification of Multispectral Data

Efficient Speech Recognition System for Isolated Digits

Robust Ring Detection In Phase Correlation Surfaces

Variable-Component Deep Neural Network for Robust Speech Recognition

Dynamic Time Warping

Temperature Distribution Measurement Based on ML-EM Method Using Enclosed Acoustic CT System

Adaptive Metric Nearest Neighbor Classification

Software/Hardware Co-Design of HMM Based Isolated Digit Recognition System

A VLSI Architecture for H.264/AVC Variable Block Size Motion Estimation

Statistical texture classification via histograms of wavelet filtered images

[2006] IEEE. Reprinted, with permission, from [Wenjing Jia, Gaussian Weighted Histogram Intersection for License Plate Classification, Pattern

GMM-FREE DNN TRAINING. Andrew Senior, Georg Heigold, Michiel Bacchiani, Hank Liao

Cursive Handwriting Recognition System Using Feature Extraction and Artificial Neural Network

Combining Multiple OCRs for Optimizing Word

Off-line Character Recognition using On-line Character Writing Information

HIERARCHICAL LARGE-MARGIN GAUSSIAN MIXTURE MODELS FOR PHONETIC CLASSIFICATION. Hung-An Chang and James R. Glass

Speech Signal Filters based on Soft Computing Techniques: A Comparison

Hybrid PSO-SA algorithm for training a Neural Network for Classification

Comparative Evaluation of Feature Normalization Techniques for Speaker Verification

Identifying Layout Classes for Mathematical Symbols Using Layout Context

The Un-normalized Graph p-laplacian based Semi-supervised Learning Method and Speech Recognition Problem

Evaluation of Expression Recognition Techniques

IBM Research Report. Inter Mode Selection for H.264/AVC Using Time-Efficient Learning-Theoretic Algorithms

Pattern Classification Algorithms for Face Recognition

User Signature Identification and Image Pixel Pattern Verification

Fixed Point LMS Adaptive Filter with Low Adaptation Delay

LOW-RANK MATRIX FACTORIZATION FOR DEEP NEURAL NETWORK TRAINING WITH HIGH-DIMENSIONAL OUTPUT TARGETS

Robust Steganography Using Texture Synthesis

Combining one-class classifiers to classify missing data

High-Order Circular Derivative Pattern for Image Representation and Recognition

This leads to our algorithm which is outlined in Section III, along with a tabular summary of it's performance on several benchmarks. The last section

Image Denoising AGAIN!?

Audio-visual interaction in sparse representation features for noise robust audio-visual speech recognition

Yui-Lam CHAN and Wan-Chi SIU

Skill. Robot/ Controller

Stacked Denoising Autoencoders for Face Pose Normalization

Transcription:

Griffith Research Online https://research-repository.griffith.edu.au Simultaneous Design of Feature Extractor and Pattern Classifer Using the Minimum Classification Error Training Algorithm Author Paliwal, Kuldip, Bacchiani, M., Sagisaka, Y. Published 1995 Conference Title Proc. IEEE Workshop on Neural Networks for Signal Processing DOI https://doi.org/10.1109/nnsp.1995.514880 Copyright Statement Copyright 1995 IEEE. Personal use of this material is permitted. However, permission to reprint/ republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE. Downloaded from http://hdl.handle.net/10072/19781

SIMULTANEOUS DESIGN OF FEATURE EXTRACTOR AND PATTERN CLASSIFIER USING THE MINIMUM CLASSIFICATION ERROR TRAINING ALGORITHM K.K. Paliwal ', M. Bacchiani and Y. Sagisaka ATR Interpreting Telecommunications Res. Labs. 2-2 Hikaridai, Seika-cho, Soraku-gun, Kyoto 619-02 Japan ABSTRACT ~ Recently, a minimum classification error training algorithm has been proposed for minimizing the misclassification probability based on a given set of training samples using a generalized probabilistic descent method. This algorithm is a type of discriminative learning algorithm, but it approaches the objective of minimum classification error in a more direct manner than the conventional discriminative training algorithms. We apply this algorithm for simultaneous design of feature extractor and pattern classifier, and demonstrate some of its properties and advantages. 1. INTRODUCTION Juang and Katagiri [l] have recently proposed a minimum classification error training algorithm which minimizes the misclassification probability based on a given set of training samples using a generalized probabilistic descent method. This algorithm is a type of discriminative learning algorithm, but it approaches the objective of minimum classification error in a more direct manner than the conventional discriminative training algorithms [a]. Because of this, it has been used in a number of pattern classification applications [3, 4, 5, 6, 7, 81. For example, Chang et al. [3] and Komori and Katagiri [4] have used this algorithm for designing the pattern classifier for dynamic time-warping based speech recognition, Chou et al. [5] and Rainton and Sagayama [6] for hidden Markov model (HMM) based speech recognition, Sukkar and Wilpon [7] for word spotting, and Liu et al. [8] for HMM-based speaker recognition. More recently, the minimum classification error training algorithm has been used for feature extraction [9, 10, 11, 121. For example, Biem and Katagiri have used it for determining the parameters of a cepstral lifter [9] and a filter bank [lo]. They have found that the resulting parameters of cepstral lifter and filter bank have a good physical interpretation. Bacchiani and Aikawa [Ill have used this algorithm for designing a dynamic cepstral filter. Watanabe and Katagiri [la] have used a class-dependent unitary transformation for feature extraction whose parameters are determined by the minimum classification error training algorithm. 'Present Address: School of Microelectronic Engineering, Griffith University, Brisbane, QLD 4111, Australia 0-7803-2739-XI95 $4.00 0 1995 IEEE 67

In the present paper, we use the minimum classification error (MCE) training algorithm to design the feature extractor as well as the classifier. Consider a canonic pattern recognizer shown in Fig. 1. The aim of the pattern recognizer is to classify an input signal as one of the I< classes. This is done in two steps: a feature analysis step and a patt,ern classification step. In the feature analysis step, the input signal is analyzed and D parameters are measured. The D-dimensional parameter vector X is processed by the fea.ture extractor which produces at its output a fea,ture vector Y of dimensionality 5 D. In our study, we use a D x D linear trailsformation T for feature extraction; i.e., Y = TX. The transformation T is a general linear transformation; i.e., it is not restrict,ed, for example, to be unitary as done by Watanabe and Katagiri [12]. The feature vector Y is a D-dimensional vector in our study. In the pattern classification step, the feat,ure vector is compared with each of the I< class-models and a dissimilaritmy (or, distance) niemure is computed for each class. The class that, gives the minimum distance is considered to be the recognized class. Signal : Parameter - Feature i C- Measurement ;. Extractor (PI m Recognized i Y em Class ; m- Classifier t (C) Figure 1: A pat.tern recognition system In the present, paper, we study the inininium classification error training algorithm for the following configurations of feature extractor and classifier: 0 Configuration 1: It is a baselinr configuration where transformation T is unity and the class models are determined by the maximum-likelihood (ML) training algorithm (as the class-dependent means of the training vectors). 0 Configuration 2: Here the transformation T is kept fixed to a unity matrix and the class models are comput,ed using the MCE training algorithm. 0 Configuration 3: Here t,he transformation T is computed by the MCE training algorithm and the class models are kept fixed to the baseline class models. 0 Configuration 4: This configuration is similar to Configuration 3, except that transformation T is applied to the parameter vector as well 68

as to the class models of the baseline configuration prior to distance computation. 0 Configuration 5: Here both the transformatioii T and the class models are computed independeiitly using the MCE training algorithm. 0 Configuration 6: This configuration is similar to Configuration 3, except that the transformation T is made class-dependent; i.e., we 110w have K different transformations, Tk, k = 1,2,...,I< for I< different classes. We apply transforination Tk to parameter vector X before computing its distance from the k-th class. 0 Configuration 7: This configuration is similar to Configurat!ion 5, except that the transformation T is made class-dependent (similar to Configuration 6); i.e., we now have K different, transformations, Tk, IC = 1,2,...,K for K different classes. Note that Configurations 2,3,4 ancl 5 use a class-independent transforination as showii in Fig. 2; while Configurations 6 and 7 use a class-dependent transformation as shown in Fig. 3.... i Feature Analysis Signal X ;y 7- ' i)- c, +- F. P -. T i -I... : *- c, *' 2 T- M i I I N Recognized Class b Figure 2: A pattern recognition system with class-independent transformation. 69

m(k) x(') Feature Analysis Distance Measure - Recognized Class Figure 3: A pattern recognition system with class-dependent transformation. 2. MCE TRAINING ALGORITHM In this section, we describe briefly the minimum classification error (MCE) training algorithm. For more details, see [l]. The MCE algorithm is described here only for Configuration 5. It can be extended to other configurations in a straightforward manner. In thc pattern recognition system shown in Fig. 1, the input parameter vector X is transformed to a feature vector Y (= TX) and classified into class i if Di 5 Dj, for all j # i, (1) where Di is the distance of feat,ure vector Y from class i. In the present, paper, we use a simple Euclidean distance measure to define this distance. It is given by where m(i) is the prototype vector representing the class i. Here, we are given a total of P labeled training vectors; i.e., we have P parameter vectors x('), x('),... ~ availa.ble for training with corresponding classification d'), C('),..., CC') known to us. Our aim here is to use the MCE algorithm to estimate the transformation matrix T and class prototype vectors m('), m('),... ~ using these labeled training vectors. The procedure for doing this is described below. 70

We use tlhis distance to define the misclassification measure for the pth training vector as follows: where Df!p, is the distance of pth training vector from its known class C@) and the distance Dg?,,) is computed from the relation The loss function L(P) for the pth training vector is then defined as the sigmoid of the misclassification measure as follows: L(P) = f(d'p') 1-1 + e-ad(p) ' where a is a parameter defining the slope of the sigmoid function. The total loss function L is defined as L P p=1 L(P) (7) In the MCE algorithm, the transformation matrix and class prototype vectors are obtained by minimizing this loss function through the steepest gradient descent algorithm. This is an iterative algorithm where parameters at the (k + l)th iteration are computed from the kth iteration results as follows: and 71

where 71 is a positive constant (known as the adaptation constant) and and For the initialization of the MCE algorithm, the transformation matrix T is taken to be a unity matrix. The prototype vectors for different classes are initialized by their maximum likelihood estimates (i.e., by their classcondit,ioned means). 3. RESULTS The MCE algorithm is studied here on a multispeaker vowel recognition task. The Peterson-Barney vowel data base [13] is used for this purpose. Here each vowel is represented in terms of 4 parameters: fundamental frequency and frequcncies of the first three formants. The data base consists of t>wo repetitions of 10 vowels in /hvd/ context recorded from 76 speakers (33 men, 28 women and 15 children). Fundamental and formant frequencies were measured by Peterson and Barney from the central steady-state portions of the /hvd/ utterances. We use the first repetition for tra.ining and the second for testing. We use t>he Euclidean distance measure for classification of the 4-dimensional parameter vector into one of the 10 classes. The model for each class is determined as the mean vect.or of the training patterns of that class. In our implementation of the MCE training algorithm, we use the steepest gradient descent algorithm with the adaptation parameter updated every iteration using a fast-converging algorithm described by Lucke [lil]. In order to study the convergence properties of this algorithm, we study it for Configuration 5. Figure 4 shows the the loss function, recognition error on training and test data as a function of iteration number. It, can he seen from this figure that the loss function is decreasing wit,h number of iterations and the algorithm is converging reasonably fast (within 500 iterations). Also, recognition results on test data are similar to those on training, showing the generalization property of the algorithm. 72

The MCE algorithm is studied for all the seven configurations. The results for different configurations are listed in Table 1. We list in column 2 of this table the total number of free parameters used in the transformat,ion and the classifier. Column 3 of this table lists the number of parameters computed by the MCE training algorithm. The numbers shown within square brackets in columns 2 and 3 correspond to the vowel recognition task used in this study, where K = 10 and D = 4. From this table, we can make the following observations: (a) Loss function v 0 500 1000 1500 2000 2500 3000 Iteration number (b) Recognition error rate (in %) on training data 601 I I I I I W I 0' I I I I I 0 500 1000 1500 2000 2500 3000 Iteration number (c) Recognition error rate (in %) on test data I I I I I 0' 0 500 1000 1500 2000 I 2500 Iteration number I 3000 Figure 4: Results for Configuration 5 as a function of iteration number. (a) Loss function, (b) Recognition error rate (in %) on training data, and (c) Recognition error rate (in %) on test data. 1. The MCE training algorithm performs better than the MI, algorithm (compare Configuration 1 with Configuration 2). 2. The recognition performance of the palterri recognizer improves with an increase in the total number of free parameters in the transformation and the classifier. 3 For a given number of free parameters, the recognition performance improves as the more number of parameters arc updated by the MCE 73

Table 1: Recognition error rate (in %) for different configurations of feature extractors and classifiers studied using the minimum classification error (MCE) training algorithm Configuration Conf. 1 Conf. 2 Conf. 3 Conf. 4 Conf. 5 Conf. 6 Conf. 7 Total no. of free parameters I<D [40] KD [40] D2 + I<D [56] D2 + IiD [56] D2 + K D [56] K(D2 + D) [200] K(D2 + D) [a001 No. of parameters updated by MCE 0 101 AD [40] D2 [16] D2 [16] D2 + I<D [56] I<D2 [160] 1<(D2 + D) [200] Training 48.29 34.47 33.16 13.95 9.87 9.47 9.34 Test 48.16 36.18 33.29 16.32 11.45 12.37 12.76 training algorithm (compare Configuration 3 with Configuration 5 or Configuration 6 with Configuration 7). 4. Observe Configurations 3 and 4. Both of these configurations have same number of free parameters and same number of parameters are updated by the MCE training algorithm. But, Configuration 4 gives significantly better results than Configuration 3. This is because in Configuration 4 the transformation T is applied to the parameter vector X as well as the class models prior to distance computation. 5. Observe Configurations 5 and 7. Configuration 5 uses a class independent transformation, while Configuration 7 uses class-dependent transformations. Therefore, Configuration 7 shows better recognition performance than Configuration 5 on training data, though the difference in the recognition rates for the two configurations is small. However, note that recognition performance of Configuration 7 on test data is inferior to that of Configuration 5. This happens because the number of parameters updated by the MCE algorithm are too large for the limited amount of data available for training and, hence, the results do not generalize to test data properly. 4. SUMMARY In this paper, we ha.ve studied the use of minimum classification error (MCE) training algorithm for the design of the feature extractor and the pattern classifier. We have investigated a number of configurations of the feature extractor and the pattern classifier, and have demonstrated a number of properties and advantages of the MCE training algorithm. 74

References [1] B.H. Juang and S. Katagiri, Discriminative learning for minimum error classification, IEEE Trans. Signal Processing, Vol. 40, No. 12, pp. 3043-3054. Dec. 1992. [a] R.O. Duda and P.E. Hart, Pattern Classification and Scene Analysis, New York: Wiley, 1973. [3] P.C. Chang, S.H. Chen and B.H. Juang, Discriminative analysis of distortion sequences in speech recognition, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 1991, Vol. 1, pp. 549-552. [4] T. Komori and S. Katagiri, GPD training of dynamic programmingbased speech recognizers, Journal of Acoust. Soc. of Japan (E), Vol. 13, No. 6, pp. 341-349, Nov. 1992. [5] W. Chou, B.H. Juang and C.H. Lee, Segmental GPD training of HMM based speech recognizer, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, Mar. 1992, Vol. 1, pp. 473-476. [6] D. Rainton and S. Sagayama, Minimum error classification training of HMMs - Implementation details and experimental results, Journal of Acoust. Soc. of Japan (E), Vol. 13, No. 6, pp. 379-387, Nov. 1992. [7] R.A. Sukkar and J.G. Wilpon, A two-pass classifier for utterance reject,ion in keyword spotting, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 1993, Vol. 2, pp. 451-454. [8] C.S. Liu, C.H. Lee, W. Chou, B.H. Juang and A.E. Rosenberg, A study on minimum error discriminative training for speaker recognition, Journal of Acoust. Soc. of Am., Vol. 97, No. 1, pp. 637-648, Jan. 1995. [9] A. Biem and S. Katagiri, Feature extraction based on minimum classification error/generalized probabilistic descent method, in Proc. IEEE Int. Con5 Acoust., Speech, Signal Processing, 1993, Vol. 2, pp. 275-278. [lo] A. Biem and S. Katagiri, Filter bank design based on discriminative feature extraction, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 1994, Vol. 1, pp. 485-488. [l 11 M. Racchiani and K. Aikawa, Optimization of time-frequency masking filters using the minimum classification error criterion, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 1994, Vol. 2, pp. 197-200. [la] H. Watanabe, T. Yamaguchi and S. Katagiri, Discriminative metric design for pattern recognition, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 1995, Vol. 5, pp. 3439-3442. 75

[13] G. Peterson and H.L. Barney. Y ontrol methods used in a study of the vowels, Journal of Aeoust. Soc. of Am., Vol. 24: pp. 175-184, 1952. [14] 11. Lucke, On the represent,ation of temporal data for connect,ionist word recognition, P1i.D. Thesis. Cambridge Ilnivrrsity. Nov. 1991. 76