QUERY BY EXAMPLE IN LARGE DATABASES USING KEY-SAMPLE DISTANCE TRANSFORMATION AND CLUSTERING

Size: px

Start display at page:

Download "QUERY BY EXAMPLE IN LARGE DATABASES USING KEY-SAMPLE DISTANCE TRANSFORMATION AND CLUSTERING"

Adam Wilfred Richardson
6 years ago
Views:

1 QUERY BY EXAMPLE IN LARGE DATABASES USING KEY-SAMPLE DISTANCE TRANSFORMATION AND CLUSTERING Marko Helén Tampere University of Technology Institute of Signal Processing Korkeakoulunkatu 1, FIN Tampere, Finland Tel: Fax: Tommi Lahti Nokia Research Center Interaction Core Technology Center Personal Media and Content team Finland ABSTRACT Calculating the similarity estimates between the query sample and the database samples becomes an exhaustive task with large, usually continuously updated multimedia databases. In this paper, a fast and low complexity transformation from the original feature space into k-dimensional vector space and clustering are proposed to alleviate the problem. First k keysamples are chosen randomly from the database. These samples and a distance function specify the transformation from the series of feature vectors into k-dimensional vector space where database (re)clustering can be done fast with plurality of traditional clustering technique whenever required. In the experiments, similarity between the samples was calculated by using the Euclidean distance between their associated feature vector probability density functions. The k-means algorithm was used to cluster the transformed samples in the vector space. The experiments show that considerable time and computational savings are achieved while there is only a marginal drop in performance. 1. INTRODUCTION Query by example aims at automatic retrieval of samples from the database, which are similar to the example provided by the user. The most accurate way of making a query is naturally the exhaustive full query, in which the distances between the features associated to the query sample and the features associated to the database samples are extracted and distance from query sample to all samples in database is calculated. Audio similarity calculations between audio samples is considered as an example throughout this paper. After the feature extraction there is a series of feature vectors associated to each sample. Helén and Virtanen provided a closed form solution for the Euclidean distance between a probability density functions (pdfs) and proposed a full query using this distance measure for the series of feature vectors[1]. However, with large, usually continuously updated multimedia databases, full query is impractically slow and computationally expensive. In order to reduce the computational time, clustering can be applied as a preprocessing. However, clustering techniques like k-means that rely on vector operations when calculating the cluster centroids, can not be applied directly. Furthermore, there are alternatives which can be applied. For example, the k-medoids algorithm, which is strongly related to the k-means. The difference is, that in k-means the cluster centroid is the center of cluster but in k-medoids the cluster centroid is an actual sample in the database having the smallest sum of distances to the other samples in the cluster. The drawback is that during the clustering, the expensive distance calculations (discussed below) need to be done many times iteratively. Shapiro calculated distances from the database samples to predefined reference points. Then the query was made only within the samples which were nearly the same distance from the reference points [2]. This way the number of distance calculations could be reduced and thus speed up the query. Several indexing techniques have also been developed for the purpose. Spatial Access Methods (SAMs) utilize hierarchical tree structures to cluster the feature space [3]. The other common class of techniques called Metric Access Methods (MAMs) assume only the availability of the distance function [4]. Clustering algorithms like k-means algorithm, again, cannot operate directly in the distance space. Therefore, one obvious solution is to map all the samples into n- dimensional vector space while preserving the distances between all the database samples. Multidimensional Scaling (MDS) techniques like FastMap are known to preserve the distances but they basically require calculating distances between all database pair at some point [5]. Kiranyaz and Gabbouj proposed technique called Pro-

2 gressive Query (PQ) that performs a series of intermediate queries, returns intermediate results to the user and finally converging to the full scale query [6]. Having the most promising intermediate queries - in other words, the most promising clusters of samples - processed first, naturally improves the usability of this idea. Prior information on the underlying classes can be utilized in ranking the clusters. Categorization of music into genres and training statistical models on them is one common example. Berenzweig, Ellis, and Lawrence proposed a method of mapping music into an anchor space [7]. In the general query by example situation, however, prior information to enable semantic etc. modeling can not always be assumed. Another thing is that modeling may become costly and may required even hand labeled training material. We are aiming at very general query by example method, which would be practical in large multimedia databases. In this paper the transformation from the series of feature vectors to fixed dimensional feature space is proposed. This enables the use of effective clustering methods in order to reduce the query times. This paper is organized as follows. Section 2 gives an overview of the fast query by example system. Section 3 presents the proposed transformation and introduces the distance measure used here. Section 4 represents the experimental results from the fast search compared to exhaustive full search. Finally, the conclusions are presented in Section 5. Database Feature extraction Estimate GMMs Key-sample distance transform Cluster the database Key-sample selection Find the nearest cluster Calculate distances inside cluster Sort by distance Example signal Feature extraction Estimate GMM Key-sample distance transform 2. SYSTEM OVERVIEW An overview of the system is illustrated in Fig. 1. First, the features are extracted from the example signal given by the user. Second, a GMM which models the feature distribution is estimated using the expectation maximization (EM) algorithm. The same set of features and a GMM is estimated for each sample in the query database in beforehand. Third, the samples are transformed into k-dimensional space using key-sample distance transformation (explained in Chapter 3). Fourth, the database is clustered into n clusters using standard clustering algorithms. Fifth, the example signal is compared against the clusters and the one with the shortest distance is chosen. Sixth, the example is compared against signals in the chosen cluster one by one and similarity between all pairs is estimated by the Euclidean distance between their pdfs. Finally, when all the similarity values are calculated a decision is made regarding the similarity of the samples to the example and those considered similar are retrieved to the user. 3. KEY-SAMPLE DISTANCE TRANSFORMATION The clustering is an important step in query process, because the size of the database can easily grow so large that going linearily through the whole database would be too exchaustive. However, traditional clustering algorithms require the setting Similar database samples Fig. 1. Overview of the fast query by example system. of cluster centroids, which are points in the feature space [8]. Unfortunately, if samples contain a series of feature vectors, the samples can not be inserted in any fixed dimensional feature space. One solution would be applying some statistical measure (mean, median,...) to the series of feature vectors but then lots of information that could be used in distance calculation would be lost. Feature vectors could also be concatenated if the series of feature vectors are the same length, but since the varying length series are considered here this does not solve the problem. We propose a transformation which enables the usage of effective clustering methods in large databases where each sample is a series of feature vectors. The transformation is based on distances to key-samples chosen from the database. The transformation is defined as follows: T (x, O,d)=F R k, (1)

3 Original feature set Random sampling Data transformation Clustering Fig. 2. The key-sample distance transformation. where x is the original series of feature vectors, O is the set of k key-samples, d is the distance measure, F is the original feature space, and R k is the k-dimensional feature space in which i th element is the distance from x to i th key-sample (i =1,..., k). The system is illustrated in Fig. 2. First, the k samples are chosen randomly from the database to work as key-samples. Second, the distances from each sample in the database to these key-samples are calculated. The distances from one sample to all of these key-samples are considered as a feature vector for this sample. Third, the database is clustered using these new feature vectors. Now, the samples are points in k- dimensional feature space and thus, the traditional clustering algorithms can be applied. We used k-means clustering in the simulations, since it is known to be very efficient algorithm. When the query is made, the nearest cluster to query sample is found using these random sample distances as a features and the actual query is made only inside the closest clusters. In order to achieve the most accurate results we applied Euclidean distance between pdfs [1] for the inside cluster query. The advantage of using this transformation is that we achieve significant speedup in clustering system, since instead of series of feature vectors, we can operate with single feature vectors. Simultanously, we are able to use very accurate distance measures since, in contrast to full search, only a small fraction of all combinations have to be calculated Euclidean distance between pdfs The distribution p(x) of the features of each sample is modelled using a Gaussian mixture model (GMM), defined as p(x) = w i N i (x; µ i, Σ i ), (2) i=1 where w i is the weight of the i th component, I is the number of components, and N i is the multivariate normal distribution with mean vector µ i and diagonal covariance matrix Σ i.the weights are non-negative and sum to unity. The parameters of GMMs are estimated using expectation maximization (EM) algorithm. The means, variances, and weights for a fixed number of components are estimated. It should be noted that the variances have to be restricted above a relatively high fixed minimum level, since low-variance components would dominate the measure. The similarity of two samples is measured by the square of the Euclidean distance e between their distributions p 1 (x) and p 2 (x). This is obtained by integrating the squared difference over the whole feature space: e =... [p 1 (x) p 2 (x)] 2 dx 1...dx N. (3) Helen and Virtanen derived a closed-form solution for this in [1]: e = 2 i=1 j=1 i=1 j=1 w i w j Q i,j,1,1 + J w i v j Q i,j,1,2, J i=1 j=1 J v i v j Q i,j,2,2 where w i and w j are the weights of the i th and j th component of GMM 1, v i and v j are the weights of the i th and j th component of GMM 2, and I and J are the number of components in GMM 1 and GMM 2 respectively. Q i,j,k,m denotes the integral of the product of the i th component of GMM k {1, 2} and the j th component of GMM m {1, 2}: N 1 (x; µ 1, Σ 1 )N 2 (x; µ 2, Σ 2 ) dx 1 = (2π) N/2 N n=1 σ1,n 2 + σ2 2,n [ ] exp 1 (µ 1,n µ 2,n ) 2 2 σ1,n 2 +, σ2 2,n n=1 where µ k,n is the n th entry of mean vector µ k, k {1, 2}, and σk,n 2 is its variance. 4. EXPERIMENTAL RESULTS To evaluate the performance of the proposed transformation in query by example, the following simulations were used. The audio database used in the tests contained 1332 samples with 16 khz sampling rate. The samples were manually annotated into 4 main categories and 17 sub categories. Samples falling into each class were considered to be similar. The classes and the number of samples in each class are listed in Table 1. The representative samples for each class are selected by listening and choosing the samples for each class, which do not have content from other classes. Samples for Environmental class are taken from CASR recordings [9]. The subclasses correspond the classes in CASR (car, restaurant, road). The drum samples are acoustic drum sequences used by Paulus [10]. The rest of the Music subclasses are from RWC Music Database [11], acoustic class is from RWC Jazz Music Database, electroacoustic is from RWC Popular Music Database, and Symphony is from Classical Music Database. Sing class was taken from Vox (4) (5)

4 Table 1. Classes used in simulations. Mainclass Subclass Environmental (231) Inside car (151) In restaurant (42) Traffic (38) Music (620) Acoustic (264) Drums (56) Electroacoustic (249) Symphony (51) Sing (165) Humming (52) Singing (60) Whistling (53) Speech (316) Speaker1 (50) Speaker2 (47) Speaker3 (44) Speaker4 (40) Speaker5 (47) Speaker6 (38) Speaker7 (50) database presented in [12]. The speech samples are from the CMU Arctic speech database [13]. All the samples in our database are 10 seconds long. The length of speech samples in Arctic database is 2-4 seconds, thus the samples from each speaker are combined so that 10 second samples are achieved. Original samples in other databases are longer than 10 seconds, thus random 10 second clips are cut from those Feature extraction Feature extraction aims at modelling the perceptually most relevant information from the original signal using only a small number of parameters. Most features are extracted in short (20ms-60ms) frame, and typically they parametrize the spectrum of the sound because in comparison to the timedomain signal, the spectrum correlates better with human sound perception. We are aiming at very general audio signal query thus we choose the features which measure the different properties from the sound. In our earlier studies, different feature sets were tested and based on experiments [1] [14] the best feature set was chosen. The frequency content of the frame is described using three Mel-frequency cepstral coefficients, spectral centroid, noise likeness [15], spectral spread, spectral flux, harmonic ratio [16], and maximum autocorrelation lag. Temporal characteristics of the signal are described using zero crossing rate, crest factor, total energy, and variance of instantaneous power. The features are extracted in 46 ms frames. Each feature is normalized to have zero mean and unity variance over the whole database. The total number of features is Evaluation procedure The database was first transformed and clustered using the proposed method. One sample at the time was drawn from the database to serve as a query sample and the rest were considered as the database. The nearest cluster was found by calculating the Euclidean distance between the transformed query sample and the cluster centroids. Then query was made inside the nearest cluster using original series of feature vectors of the samples and the Euclidean distance between their pdfs. The number of Gaussian components used in simulations was 8. In addition to nearest cluster, also the effect of two and three nearest clusters were tested. If in the query process the retrieved sample was labeled in the same class as the query sample, the database sample was seen as correctly retrieved from the database. The query results were compared to full search, which offers us an upper limit for precision which could be achieved with the optimal clustering. The results are presented here in terms of precision rates, which gives the portion of correctly retrieved samples over all retrieved samples from the database: precision = c r, (6) where c is the number of correctly retrieved samples from the database and r gives the number of all retrieved samples Results Figure 3 presents the results of query by example using the key-sample distance transformation with 17 clusters compared to full search. Here 5 most similar samples are retrieved from the database. The figure illustrates the difference in precision while the number of key-samples is changing. As can be seen, the higher number of key-samples results in higher precision. However, the improvement in precision is quite small after 10 key-samples. It is advantageous to restrict the number of key-samples to as small as possible, because the distance from these samples to all the other samples in database have to be calculated and thus, with large number of key-samples the clustering becomes exhaustive. Compared to full search the difference in precision when using 10 key-samples is only 3 percent units. On the other hand, the speedup of query is directly proportional to the number of clusters assuming that clustering is made offline and the time to search the nearest cluster is negligible. The search of the nearest cluster may not always be negligable, since the distances from query sample to all key-samples have to be calculated. However, with 10 key-samples the computation time is acceptable. Figure 4 illustrates the effect that number of clusters has to the retrieval accuracy. Here 10 key-samples are used and 5 most similar samples are retrieved. One cluster case corresponds to the full search and as expected the search accu-

5 Fast search Full search Fast search 1 cluster Fast search 2 cluster Fast search 3 cluster Full search Precision Precision Number of key samples Number of clusters Fig. 3. Precision values when the number of key-samples is changing. Fig. 4. Precision values when the number of clusters is changing. racy decreases while the number of clusters is increased. The choice of how many clusteres should be used, depends on the application. When the number of clusters is low, the clustering phase is faster but the query is slower. Likewise, when the number of clusters is high, the query is fast but the clustering is slow. The clustering phase can be made offline, thus the query part is usually more critical. The accuracy of the query can be increased by finding the similar samples also from the other near clusters. Figure 4 also illustrates the effect of making the query in two or three nearest clusters. In these cases the precision is very close to full search even when using cluster numbers as high as 50. Table 2 presents the confusion matrix of the query with 10 key-samples, 17 clusters, and 10 most similar samples were retrieved. The overall precision in this testcase was 91.1 %. It can be seen that almost all false retrieved were however, inside the same main class. The worst cases were acoustic music vs. electroacoustic and singing vs. humming vs. whistling. These errors are understandable because those classes are close to each other for human listener as well. 5. CONCLUSIONS AND FUTURE WORK A novel method for speeding up the query by example in large database using key-sample distance transformation and clustering was proposed. The method was tested in audio query by example but is applicaple to any query by example or classification task. The running time of query by example was reduced significantly (to less than one tenth) compared to the full search, while the accuracy was only reduced by 3 percent units, when the search was made only inside the closest cluster. When the search was expanded to 2 or 3 nearest clusters the difference in precision to full search was only around 1 percent unit. In our future work we will concentrate on choosing the number of clusters and updating the clusters while the number of samples in the database is altering. When samples are added to or removed from the database, the existing clusters must be updated since the samples inside the clusters are changing. Furthermore, there comes a point when clusters must be split or combined in order to maintain the desired size of clusters. 6. ACKNOWLEDGEMENTS This work was supported by the Academy of Finland, project No (Finnish Centre of Excellence program ) and Nokia Research Center. 7. REFERENCES [1] M. Helén and T. Virtanen, Query by Example Methods of Audio Signals Using Euclidean Distance Between Gaussian Mixture Models, in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2007), Honolulu, Hawaii, USA, Apr [2] M. Shapiro, The Choice of Reference Points in Bestmatch File Searching, Communications of the ACM, vol. 20, no. 5, pp , [3] V. Gaede and O. Günther, Multidimensional access methods, ACM Computing Surveys, vol. 30, no. 2, pp , 1998.

6 Table 2. Confusion matrix for proposed method when 10 most similar samples are retrieved. Inside car In restaurant Traffic Acoustic Drums Electroacoustic Symphony Inside car In restaurent Traffic Acoustic Drums Electroacoustic Symphony Humming Singing Whistling Speaker Speaker Speaker Speaker Speaker Speaker Speaker Humming Singing Whistling Speaker1 Speaker2 Speaker3 Speaker4 Speaker5 Speaker6 Speaker7 [4] C. Traina, A. Traina, B. Seeger, and C. Faloutsos, Slim-Trees: High Performance Metric Trees Minimizing Overlap between Nodes, Lecture Notes in Computer Science, vol. 1777, pp , [5] C. Faloutsos and K.-I. Lin, FastMap: A Fast Algorithm for Indexing, Data-Mining and Visualization of Traditional and Multimedia Datasets, in Proc ACM SIGMOD International Conference on Management of Data, San Jose, California, 1995, pp [6] S. Kiranyaz and M. Gabbouj, A Novel Multimedia Retrieval Technique: Progressive Query (WHY WAIT?), IEE Proceedings - Vision, Image and Signal Processing, vol. 152, no. 3, pp , June [7] A. Berenzweig, D. P. W. Ellis, and S. Lawrence, Anchor Space for Classification and Similarity Measurement of Music, in Proc. International Conference on Multimedia and Expo (ICME 03), 2003, pp [8] H. Ferhatosmanoglu, E. Tuncel, D. Agrawal, and A. El Abbadi, Approximate Nearest Neighbor Searching in Multimedia Databases, in Proc. 17th International Conference on Data Engineering, Heidelberg, Germany, 2001, pp [9] V. Peltonen, J. Tuomi, A. Klapuri, J. Huopaniemi, and T. Sorsa, Computational auditory scene recognition, in Proc IEEE International Conference on Acoustics, Speech, and Signal Processing, Florida, May [10] J. Paulus and T. Virtanen, Drum Transcription with Non-negative Spectrogram Factorisation, in Proc. 13th European Signal Processing Conference (EUSIPCO2005), Antalya, Turkey, Sept [11] M. Goto, H. Hashiguchi, T. Nishimura, and R. Oka, RWC Music Database: Popular, Classical, and Jazz Music Databases, in Proc. 3rd International Conference on Music Information Retrieval, Oct [12] T. Viitaniemi, A. Klapuri, and A. Eronen, A Probabilistic Model for the Transcription of Single-Voice Melodies, in Proc. Finnish Signal Processing Symposium (FINSIG 03), Finland, May 2003, pp [13] J. Kominek and A. Black, The CMU Arctic speech databases, in Proc. 5th ISCA Speech Synthesis Workshop, Pittsburgh, USA, 2004, pp [14] M. Helén and T. Lahti, Query by Example Methods for Audio Signals, in Proc. 7th IEEE Nordic Signal Processing Symposium, Iceland, June 2006, pp [15] C. Uhle, C. Dittmar, and T. Sporer, Extraction of Drum Tracks From Polyphonic Music Using Independent Subspace Analysis, in Proc. 4th International Symposium on Independent Component Analysis and Blind Signal Separation (ICA2003), Nara, Japan, Apr [16] J. J. Burred and A. Lerch, A Hierarchical Approach to Automatic Musical Genre Classification, in Proc. 6th International Conference on Digital Audio Effects (DAFX), London, UK, Sept

SOUND EVENT DETECTION AND CONTEXT RECOGNITION 1 INTRODUCTION. Toni Heittola 1, Annamaria Mesaros 1, Tuomas Virtanen 1, Antti Eronen 2

SOUND EVENT DETECTION AND CONTEXT RECOGNITION 1 INTRODUCTION. Toni Heittola 1, Annamaria Mesaros 1, Tuomas Virtanen 1, Antti Eronen 2 Toni Heittola 1, Annamaria Mesaros 1, Tuomas Virtanen 1, Antti Eronen 2 1 Department of Signal Processing, Tampere University of Technology Korkeakoulunkatu 1, 33720, Tampere, Finland toni.heittola@tut.fi,