Transductive Phoneme Classification Using Local Scaling And Confidence

Size: px

Start display at page:

Download "Transductive Phoneme Classification Using Local Scaling And Confidence"

Alisha Pierce
6 years ago
Views:

1 202 IEEE 27-th Convention of Electrical and Electronics Engineers in Israel Transductive Phoneme Classification Using Local Scaling And Confidence Matan Orbach Dept. of Electrical Engineering Technion - Israel Institute of Technology Haifa 32000, Israel matanorb@tx.technion.ac.il Koby Crammer Dept. of Electrical Engineering Technion - Israel Institute of Technology Haifa 32000, Israel koby@ee.technion.ac.il Abstract We apply a graph-based Transduction Algorithm with COnfidence named TACO to the task of phoneme classification. In recent work, TACO outperformed two state-of-theart transductive learning algorithms on several natural language processing tasks. However, although TACO is a general-purpose algorithm, it has not yet been used for tasks in other domains, nor applied to graphs with millions of vertices. We show its effectiveness, as well as its scalability, by performing transductive phoneme classification on data from the TIMIT speech corpus. In addition, we experiment with two methods for graph construction, including local scaling, previously used for unsupervised clustering. Our results show that local scaling combined with TACO outperforms other combinations of graph construction methods and graph-based transductive algorithms. I. INTRODUCTION A key challenge in automatic speech recognition systems is transcribing acoustic signals with phonemes. For this task, much research has been dedicated to the design of supervised classifiers, taking as input a set of acoustic signals annotated with corresponding phonemes [,2]. However, such data may not always be available in the amounts required for achieving satisfying classification performance. Furthermore, speech signals alone can be easily obtained in a variety of languages and accents. For that reason, phoneme classification has recently attracted attention from researchers in the field of semi-supervised learning SSL and specifically, graph-based methods. SSL algorithms take as input a set of labeled data and an additional, typically large, unlabeled dataset. In graphbased SSL, the learner assumes the existence of an undirected weighted graph consisting of both labeled and unlabeled examples. Each input example is associated with a vertex. An edge weight is a measure of similarity between corresponding connected vertices. In the transductive setting, the goal of the learner is to label the unlabeled examples in the graph. We focus on graph-based transduction for phoneme classification. Each example is a vector of features describing a frame of the acoustic signal in time. For measuring similarity, Euclidean distances can be used, then transformed into graph weights using a bandwidth parametrized Gaussian kernel [3]. A similar approach [4] is to first decorrelate input examples and then use a Gaussian kernel, without a bandwidth parameter, for forming graph weights. Such methods apply the same weights generation method throughout the entire input graph. However, the density of input examples is likely to vary in input space. Therefore, we propose to form edge weights using a locally scaled Gaussian kernel, previously proposed for graph construction in unsupervised clustering [5]. Recently we introduced TACO, a graph-based transductive algorithm, that was shown to perform well on several text classification tasks [6]. However, it has not yet been applied to tasks in other domains. In addition to propagating labels, unlike previous algorithms, TACO maintains additional confidence information, used for estimating the quality of each propagated label. This information is used to better propagate label information throughout the graph, discouraging the effect of poorly estimated labels. In addition, TACO has been shown to adapt well to unbalanced data. This is an important property since phonemes are inherently unbalanced. We use TACO as our transductive learning algorithm. Previous work on transductive phoneme classification was typically performed in one of two possible settings. Alexandrescu & Kirchhoff [3,7] first use a supervised classifier to output soft phonetic labeling of feature vectors. Then, the graph is constructed in a labels space, and a transductive graph-based algorithm is used as a second phase, smoothing the output labeling of the supervised classifier. In contrast, Subramanya & Bilmes [4] construct the graph directly from feature vectors representing the acoustic signals. Liu & Kirchhoff experiment in both settings [8], comparing phoneme classification performance using several graph-based learning algorithms and graph construction methods. We follow the latter setting, without performing a first phase using a supervised classifier. II. GRAPH-BASED TRANSDUCTION The input to a graph-based transductive learner is constituted of two sets. The first set D l = {x i, y i } n l i= contains examples x i, associated with a label from a given labels set y i L = {,..., m}. The second set D u = {x i } n l+n u i=n l + contains additional unlabeled examples. We assume each example is embedded within vector space, x i R d. The goal of the learner is to assign a label ŷ i to each of the unlabeled

2 examples in D u. We denote the total number of input examples by n = n l + n u. In phoneme classification, examples are feature vectors representing time frames of acoustic speech signals. The labeled input set D l contains feature vectors representing a set of labeled utterances A l = { as, } p s. Each utterance as is a sequence of feature vectors, a s =..., x i,..., labeled with the corresponding phonemes sequence p s =..., y i,...,. For simplicity, we assume the length of the acoustic signal a s is equal to the length of the corresponding phonemes sequence p s. In practice, a phoneme sequence may contain consecutive occurrences of the same phoneme to represent a single spoken phoneme spanning over more than one consecutive time frames. Similarly, the unlabeled input set D u contains feature vectors corresponding to a set of unlabeled utterances A u = { } as. The first step in graph-based transduction is the construction of an undirected weighted graph G = V, E, W from the input. Each input feature vector x i is associated with a vertex v i V. The set of edges is E = V V. Edge weights are described by a symmetric matrix with non-negative elements, denoted W R n n. An edge weight w i,j W represents the strength of our belief that predictions for vertices v i and v j should be similar. A large value for w i,j means these two predictions should be close. However, the opposite is not true. Specifically, a small edge weight or even zero does not mean predictions for v i and v j should be different. Rather, it states our lack of knowledge on the correct relationship between these two predictions. In practice, most edge weights are zero, and W is sparse. We discuss several ways for setting edge weights in Sec. III. Prior knowledge about examples in the labeled input set formulated by associating a prior labels vector y i {0, } m with each vertex. For every vertex v i associated with feature vector from the labeled input set D l, our input contains the correct phoneme p, so we set y i,p = and all other entries to zero. For vertices v j associated with feature vectors from the unlabeled input set D u, we set y j = 0, the vector with all elements equal to zero. For simplicity, we assume the first n l vertices in V are labeled vertices, associated with feature vectors from the labeled input set, and the last n u vertices in V are unlabeled vertices, associated with feature vectors from the unlabeled input set. We denote by δ l i = [i nl ] the indicator of a vertex to be labeled, that is δ l i = iff the vertex v i is a labeled vertex. III. GRAPH WEIGHTS The choice of weights for the graph edges is of key importance to the overall performance of graph-based transductive algorithms. Typically, for phoneme classification, a distance measure d x i, x j is used to calculate distances between pairs of feature vectors. Then, the distances are transformed to weights, representing similarity, using a Gaussian kernel w i,j = exp [d x i, x j ] 2 a 2 where a is a kernel bandwidth hyper-parameter. The quality of the generated weights is controlled by choice of both distance measure and bandwidth hyper-parameter. Several methods have been previously proposed for setting the value of the bandwidth hyper-parameter a. In one approach [9], a gradient descent based method is used to select a per dimension bandwidth parameter, such that the output labeling has low entropy, and thus forms a confident labeling. Another approach [0] is minimizing the leave-one-out prediction error on labeled data points, also using a gradient based algorithm. However, both gradient based methods add considerable computational cost. A more computationally efficient approach [3,7], utilizes a single bandwidth parameter. First, the average betweenclass distance d b and average within-class distance d w are computed: d b = N b N w y i y j d x i, x j ; dw = y i=y j d x i, x j where N b and N w are the respective counts of elements in each sum. Next, the bandwidth parameter is chosen such that two samples distanced at db + d w /2 have a similarity of 0.5: exp [ db + d 2 ] w /2 a 2 = 2 a = d b + d w 2 ln 2. 2 The intuition behind this method is that two samples placed at the most ambiguous distance should also have an ambiguous similarity value. We refer to this method as global scaling. Using a single bandwidth parameter, or even a set of bandwidth parameters, one per dimension, implies that the same notion of closeness is used throughout the entire graph. However, input data is likely to be denser in some areas than others, and also possibly denser for one or more specific labels. Therefore, we propose using local scaling [5]. For each vertex v i we maintain its own local bandwidth parameter a i, and set its value according to the local neighbourhood of v i. Using the local scaling parameters we set the graph weights as w i,j = exp [d x i, x j ] 2 a i a j. 3 Various methods can be used for selecting the local scaling parameters. We follow Zelnik-Manor & Perona [5] and simply set a i = d x i, x k i, 4 where x k i is the kth nearest neighbour of x i.

3 IV. TRANSDUCTION WITH CONFIDENCE Recently we introduced TACO [6], a graph-based transductive algorithm. We apply TACO for our task of phoneme classification. For completeness, we briefly describe TACO. TACO maintains both first order and second order information for every vertex in the input graph. The first order information are per vertex label scores µ i = [µ i,,..., µ i,m ] R m. The larger the rth element µ is, the stronger is the belief that the input x i associated with vertex v i belongs to class r. Prediction is given according to the common multiclass inference rule ŷ i = arg max r µ. Typically, graph-based transductive algorithms maintain only first order information [4,9,]. However, TACO maintains additional second order confidence information, a per vertex diagonal non-negative matrix Σ i R m m, where the rth diagonal element of Σ i is denoted by σ. Each parameter σ is associated with uncertainty in the corresponding score parameter µ. The lower the value of σ is, the higher is the confidence in the score value µ. TACO casts learning by minimizing the following unconstrained convex objective in parameters { µ i, Σ i } n i= : C = 4 n i,j= α n l i= w i,j [ µi µ j Σ i + Σ ] j µi µ j [µ i y i Σ i + ] γ I µ i y i n TrΣ i β i= 5 6 n log det Σ i, 7 i= where α, β and γ are hyper-parameters. The objective consists of three terms. The manifold term 5 promotes smoothness of the output labeling, requiring scores for close vertices large w i,j to be similar, unless uncertainty is high in either predicted scores. The second term 6 requires the scores for labeled vertices to be close to their corresponding prior labels vector, again unless the uncertainty in score parameters is high. The last term 7 regularizes the uncertainty parameters to be far from infinity and not close to zero. An efficient iterative algorithm for minimizing the above objective was derived by Orbach & Crammer [6]. Let µ t σ t and denote the score and uncertainty parameters maintained by the iterative algorithm at iteration t for vertex v i and label r. Iterations are based on two update equations. First, updating a score value for a specific vertex and label µ t using neighbouring score and uncertainty parameters from previous iteration, given by µ t = nj i w t,j µt j,r j= nj i j= w t,j + c t y + c t 8 Parameters: α > 0, β > 0,γ > 0 Input: Graph G = V, E, W and v i V prior labeling y i Initialize: t =, µ 0 i = 0 and Σ 0 i = I for all v i V Repeat For v i V : Compute µ t i from µ t j and Σ t Compute Σ t i from µ t j using 9 j using 8 t t + Until convergence Output: Score vectors µ t i and confidence matrices Σ t i. where Fig.. w t,j = w i,j The TACO algorithm for graph-based transduction. σ t + ; c t σ t = δ li j,r σ t = β 2α + 2α σ t + γ This update sets the score for label r and vertex v i to be a weighted average of neighbouring scores for label r from the previous iteration. The weights w t,j in 8 are based on static graph weights w i,j and dynamic uncertainty parameters. The second update step concerns updating the uncertainty value for a particular vertex and label σ t, using scores of neighbouring vertices from the previous iteration: β 2 + 2α g t 9 where g t = n j= 2 2 w i,j µ t j,r µt + δl i µ t y. Here, uncertainly for label r and vertex v i is monotonic in a quadratic measure of divergence between { previous } score µ t and previous neighbouring scores µ t j,r. The complete pseudocode for TACO is given in Fig.. V. EXPERIMENTS We evaluate the performance of TACO on the task of phoneme classification, along with two other state-of-theart graph-based transductive algorithms: Modified Adsorption MAD [] and Measure Propagation MP [4,2]. Data: The TIMIT corpus contains speech signals manually annotated with frame based phonetic transcriptions [3]. We use pre-processed data [4] partitioned to a training set of 3, 696 utterances, a development set of 0 utterances and a test set of 92 utterances. We use a standard mapping of the 6 phonemes in TIMIT to a subset of 39 classes [5]. The data contains feature vectors consisting of 3 Melfrequency coefficients along with first and second derivatives 39 values. Structural information is incorporated by adding to each feature vector its immediate three predecessors and successors, such that the final dimension of input examples is 39 7 = 273..

4 Graph construction: From the input partition we construct two graphs. First, a development graph, including examples from the training and development sets, for a total of 4, 096 and roughly.2 million vertices. Second, a test graph, with examples from the training and test sets, containing 3, 888 utterances and around. million vertices. For measuring distances in input space we use Euclidean distance d x i, x j 2 = x i x j 2 2. We prune each graph by keeping for each vertex its kth nearest neighbours k-nn, yielding a directed graph. Then, direction of edges is removed, resulting in an undirected graph in which vertex degree may be larger than k. We fix k = 0, as previously used by Subramanya & Bilmes [4]. We transform distances to similarities using two graph construction methods. For global scaling we calculate the global bandwidth parameter using and 2. The averages in are calculated by applying random sampling [3]. For local scaling, we select local bandwidth parameters according to 4 and form edge weights using 3. The same value of k used for nearest neighbours graph construction is also used for local bandwidth parameters selection, so there is no additional computational cost. To conclude, we have four input graphs: a containing along with training development or test data; b weights formed using global or local scaling. Setting: We select utterances for the labeled utterances set A l by randomly sampling utterances from the training set. The labeled input set D l contains feature vectors composing the sampled utterances. This is a more realistic scenario than simply randomly sampling feature vectors, without relating to their source. Utterances are sampled until a fraction f of the feature vectors in the training set is labeled, under the constraint that each phoneme class is selected at least once. We use f {0.0, 0.05, 0., 0.2, 0.3, 0.5}. On the sampled labeled information we perform class prior normalization, for both TACO and MAD [6]. The development graph is used for hyper-parameters tuning. We tune by performing a grid search over a predefined range of values for each algorithm. The range for each of the hyperparameters in the three algorithms is as follows. For TACO, α {e-4, e-2,, e2, e4}, β {e-4, e-2,, e2} and γ {, 00}. For MP, ν {e-8, e-6, e-4, 0.0, 0.} and µ {e-8, e-4, 0.0, 0.,, 0, 00,} and fixing α =. This is a superset of the range used before [4]. For MAD, µ = and µ 2, µ 3 {e-8, e-4, 0.0,, 0, 00, 000} following Talukdar & Crammer []. Performance is evaluated on vertices that belong to the development set, and the optimal hyper-parameters combination is selected. Final evaluation is performed on the test graph. We repeat the described labeled sampling procedure, and set the values for the hyper-parameters to be the optimal values selected on the development graph. Performance is evaluated on vertices belonging to the test set. Results: We use two metrics to evaluate performance [4], phone accuracy, computed using the Levenshtein distance, and frame accuracy, the percentage of frames classified correctly. For all results the reported evaluation metric is the same as Phone accuracy Phone accuracy a Test graph b Development graph Fig. 2. A comparison of phone accuracy for different amounts of supervision. Results on a 56,692 test set vertices from the test graph b 20,448 development set vertices from the development graph. the metric used for hyper-parameters tuning. A comparison of phone accuracy on the test graph, for all evaluated combinations of algorithms and graph construction methods, is given in Fig. 2a. Local scaling for graph construction and TACO as the transductive algorithm outperform all other combinations for all values of f. Results on the development graph in Fig. 2b are similar with slightly higher absolute values. In Fig. 3, we use frame accuracy as the evaluation metric, and results are similar. Comparing graph construction methods, both TACO and MP perform better on graphs constructed with local scaling. MAD performs better using local scaling when relatively small amounts of labeled training data are available. The performance gain attained by using local over global scaling

5 Frame accuracy Fig. 3. A comparison of frame accuracy for different amounts of supervision on test set vertices from the test graph. Local scaling performance gain TACO MP MAD Fig. 4. Change in phone accuracy comparing local and global scaling. Results are on test set vertices from the test graph. Positive values indicate an increase in phoneme accuracy gained by using local scaling. is further illustrated in Fig. 4. For all algorithms, the most significant performance boost is where only % of the training data is labeled. The largest gain is for MP, improving by roughly 4.5% of phone accuracy, for % of labeled data. As more data is labeled, the performance gap favouring local scaling is decreased. For TACO, the performance gain is decreased monotonically, from over 4% for % of labeled data until just above 0.5% for % of labeled data. A similar trend appears for MAD, gaining improvement with local scaling only until a fraction of 20% of the training set is labeled. From this point on, local scaling has a negative effect of performance, and global scaling is better. This implies local scaling is more beneficial when small amounts of labeled utterances are available. VI. CONCLUSION We have demonstrated the effectiveness and scalability of TACO, a recently introduced graph-based transductive algorithm, to the task of phoneme classification. TACO outperforms two other state-of-the-art algorithms, MAD and MP. In addition, we introduced local scaling as a graph construction method for transductive phoneme classification. Local scaling improves the input graph, improving the phoneme classification accuracy of TACO. In future work we plan to modify current transduction algorithms to better use the sequential nature of acoustic utterances. We believe the use of such structured information may contribute an additional performance boost to current results. We also plan to perform induction, allowing the labeling of previously unseen unlabeled utterances. REFERENCES [] K. Crammer and D. D. Lee, Online discriminative learning of phoneme recognition via collections of generalized linear models, ICASP, 202. [2] A. Gunawardana, M. Mahajan, A. Acero, and J. C. Platt, Hidden conditional random fields for phone classification, INTERSPEECH, [3] A. Alexandrescu and K. Kirchhoff, Graph-based learning for phonetic classification, ASRU, [4] A. Subramanya and J. Bilmes, Semi-supervised learning with measure propagation, JMLR, 20. [5] L. Zelnik-Manor and P. Perona, Self-tuning spectral clustering, in NIPS, [6] M. Orbach and K. Crammer, Graph-based transduction with confidence, in ECML, 202. [7] K. Kirchhoff and A. Alexandrescu, Phonetic classification using controlled random walks, Interspeech, vol. 2, no., pp. 2 5, 20. [8] Y. Liu and K. Kirchhoff, A comparison of graph construction and learning algorithms for graph-based phonetic classification, UWEE Technical Report, 202. [9] X. Zhu, Z. Ghahramani, and J. Lafferty, Semi-supervised learning using gaussian fields and harmonic functions, in ICML, [0] X. Zhang and W. S. Lee, Hyperparameter learning for graph based semi-supervised learning algorithms, in NIPS, [] P. P. Talukdar and K. Crammer, New regularized algorithms for transductive learning, in ECML, [2] A. Subramanya and J. Bilmes, Soft-supervised learning for text classification, in EMNLP, [3] L. F. Lamel, R. H. Kassel, and S. Seneff, Speech database development: design and analysis of the acoustic-phonetic corpus, Proceedings of the DARPA Speech Recognition Workshop, 986. [4] C. C. Cheng, F. Sha, and L. K. Saul, A fast online algorithm for large margin training of continuous density hidden markov models, INTERSPEECH, [5] L. Kai-fu and H. Hsiao-Wuen, Speaker-independent phone recognition using hidden markov models, IEEE Transactions on Acoustics, Speech and Signal Processing, 989.

Learning Better Data Representation using Inference-Driven Metric Learning

Learning Better Data Representation using Inference-Driven Metric Learning Paramveer S. Dhillon CIS Deptt., Univ. of Penn. Philadelphia, PA, U.S.A dhillon@cis.upenn.edu Partha Pratim Talukdar Search Labs,