Feature and Dissimilarity Representations for the Sound-Based Recognition of Bird Species

Size: px

Start display at page:

Download "Feature and Dissimilarity Representations for the Sound-Based Recognition of Bird Species"

Sheila Bond
6 years ago
Views:

1 Feature and Dissimilarity Representations for the Sound-Based Recognition of Bird Species José Francisco Ruiz-Muñoz, Mauricio Orozco-Alzate César Germán Castellanos-Domínguez Universidad Nacional de Colombia Sede Manizales 16th Iberoamerican Congress on Pattern Recognition Pucón, Chile. November 17, /40

2 Outline 1 Introduction 2 Representations 3 Experimental setup 4 Results 5 Discussion 6 Conclusions 2/40

3 Outline 1 Introduction 2 Representations 3 Experimental setup 4 Results 5 Discussion 6 Conclusions 3/40

4 Introduction Advances in PR and DSP allow the identification of bird species by their emitted sounds and, thereby, the design of automated systems for avian monitoring. In spite of those advances, biodiversity assessments have typically been carried out by visual inspection, which requires human involvement and, therefore, may be expensive and have a limited coverage. 4/40

5 Introduction Advances in PR and DSP allow the identification of bird species by their emitted sounds and, thereby, the design of automated systems for avian monitoring. In spite of those advances, biodiversity assessments have typically been carried out by visual inspection, which requires human involvement and, therefore, may be expensive and have a limited coverage. 4/40

Introduction Advances in PR and DSP allow the identification of bird species by their emitted sounds and, thereby, the design of automated systems for avian monitoring.

6 Introduction Advances in PR and DSP allow the identification of bird species by their emitted sounds and, thereby, the design of automated systems for avian monitoring. In spite of those advances, biodiversity assessments have typically been carried out by visual inspection, which requires human involvement and, therefore, may be expensive and have a limited coverage. 4/40

7 Automatic systems: a solution Automatic acoustic monitoring is a non-intrusive and cost effective alternative that may provide good temporal and spatial coverages [Brandes, 2008]. Source: [Frommolt et al., 2008] 5/40

8 Units in a bird song TATIONS OF BIRD SOUNDS 2253 were possible, the recognied even in a noisy environalternative approach of recllenging. The main problem be several birds singing sial variability in the songs of ongs of other species makes ng patterns for each species. rd sounds was based on sinugnition results were encourl was clearly oversimplified: he frequency and amplitude sinusoid. It was also shown el is significantly better in center frequency. In [16], roduced which characterize le. This model is revised in airs of stylized syllables in information about consecion accuracy. In the current mparison of different paras. The problem of decomzable units is discussed in ing the segmentation of a g is also introduced. In Sece introduced and compared, operties of the classification ection V, we provide the eriments. In addition to the f bird vocalization, we also syllables. Finally, after the sions from the current work Source: [Somervuo et al., 2006] Fig. 1. Hierarchical levels of song of the Common Chaffinch. Sound-based recognition of bird-species is suitable when considering syllables as elementary units [Fagerlund, 2007, Acevedo et al., 2009]. Raw measurements corresponding to those elementary units can be represented in vector spaces where classification rules can afterwards be applied. In this paper, we call the smallest unit a syllable. A syllable is basically a sound that a bird produces with a single blow of air from the lungs. This is also somewhat inaccurate definition as many birds are capable of complicated circular breathing cycles during singing [18]. The rate of events in bird vocalization may also be so high that the separation of individual syllables is difficult to perform in a natural environment due to reverberation. The segmentationof a recording toindividual syllables is performed using an iterative time-domain algorithm [19]. First, a smoothenergy envelopeofthe signaliscomputedand theglobal minimumenergyis selectedastheinitialbackgroundnoiselevel estimate. Initial threshold is set to the half of the initial noise level, which is set to the lowest signal envelope energy. Noise level and threshold are updated using the following 6/40

9 Units in a bird song TATIONS OF BIRD SOUNDS 2253 were possible, the recognied even in a noisy environalternative approach of recllenging. The main problem be several birds singing sial variability in the songs of ongs of other species makes ng patterns for each species. rd sounds was based on sinugnition results were encourl was clearly oversimplified: he frequency and amplitude sinusoid. It was also shown el is significantly better in center frequency. In [16], roduced which characterize le. This model is revised in airs of stylized syllables in information about consecion accuracy. In the current mparison of different paras. The problem of decomzable units is discussed in ing the segmentation of a g is also introduced. In Sece introduced and compared, operties of the classification ection V, we provide the eriments. In addition to the f bird vocalization, we also syllables. Finally, after the sions from the current work Source: [Somervuo et al., 2006] Fig. 1. Hierarchical levels of song of the Common Chaffinch. Sound-based recognition of bird-species is suitable when considering syllables as elementary units [Fagerlund, 2007, Acevedo et al., 2009]. Raw measurements corresponding to those elementary units can be represented in vector spaces where classification rules can afterwards be applied. In this paper, we call the smallest unit a syllable. A syllable is basically a sound that a bird produces with a single blow of air from the lungs. This is also somewhat inaccurate definition as many birds are capable of complicated circular breathing cycles during singing [18]. The rate of events in bird vocalization may also be so high that the separation of individual syllables is difficult to perform in a natural environment due to reverberation. The segmentationof a recording toindividual syllables is performed using an iterative time-domain algorithm [19]. First, a smoothenergy envelopeofthe signaliscomputedand theglobal minimumenergyis selectedastheinitialbackgroundnoiselevel estimate. Initial threshold is set to the half of the initial noise level, which is set to the lowest signal envelope energy. Noise level and threshold are updated using the following 6/40

10 Representations Representations for bioacoustic signals have traditionally been built by feature extraction. Dissimilarity representations are also a feasible option to face this problem: dissimilarities have the potential to build either simpler or richer representations; the later case, a richer representation, when considering for instance time-varying measurements that take into account non-stationarity. 7/40

11 Representations Representations for bioacoustic signals have traditionally been built by feature extraction. Dissimilarity representations are also a feasible option to face this problem: dissimilarities have the potential to build either simpler or richer representations; the later case, a richer representation, when considering for instance time-varying measurements that take into account non-stationarity. 7/40

12 Outline 1 Introduction 2 Representations 3 Experimental setup 4 Results 5 Discussion 6 Conclusions 8/40

13 Feature representations Several methods of feature-based representations, commonly used in bioacoustics and bird sound classification, are compared: Standard features [Acevedo et al., 2009]: minimum and maximum frequencies, temporal duration and maximum power. Coarse representation of segment structure [Acevedo et al., 2009]: minimum and maximum frequencies, temporal duration and frequency of maximum power in eight non-overlapping frames. 9/40

14 Feature representations Several methods of feature-based representations, commonly used in bioacoustics and bird sound classification, are compared: Standard features [Acevedo et al., 2009]: minimum and maximum frequencies, temporal duration and maximum power. Coarse representation of segment structure [Acevedo et al., 2009]: minimum and maximum frequencies, temporal duration and frequency of maximum power in eight non-overlapping frames. 9/40

15 Feature representations Several methods of feature-based representations, commonly used in bioacoustics and bird sound classification, are compared: Standard features [Acevedo et al., 2009]: minimum and maximum frequencies, temporal duration and maximum power. Coarse representation of segment structure [Acevedo et al., 2009]: minimum and maximum frequencies, temporal duration and frequency of maximum power in eight non-overlapping frames. 9/40

16 Feature representations Descriptive features [Fagerlund, 2007]: this set includes both temporal and spectral features. Segments are divided into overlapping frames of 256 samples with 50% overlap. Mel-cepstrum representation: the first 12 MFCCs, the log-energy and the so-called delta and delta-delta coeficients are obtained for each frame. Their mean values along the frames are used as features. 10/40

17 Feature representations Descriptive features [Fagerlund, 2007]: this set includes both temporal and spectral features. Segments are divided into overlapping frames of 256 samples with 50% overlap. Mel-cepstrum representation: the first 12 MFCCs, the log-energy and the so-called delta and delta-delta coeficients are obtained for each frame. Their mean values along the frames are used as features. 10/40

18 Dissimilarity representations A dissimilarity representation consists in building vectorial spaces where coordinate axes represent dissimilarities, typically distance measures to prototypes. 11/40

19 Dissimilarity representations: one dimensional (1-D) transforms Analysis of signal properties is usually done in the frequency domain, for this study were calculated the spectrum for each signal by using two different approaches: Fast Fourier Transform (FFT) and Power Spectral Density (PSD). 12/40

20 Dissimilarity representations: two dimensional (2-D) transforms Time-varying representations are computed by dividing sound segments into frames and converting each one to either a spectral or a feature representation. Feature sets measured for each frame were: 1 Spectrogram, also known as short time Fourier transform (SFT). 2 PSD by using the Yule Walker method. 3 Selected descriptive features (spectral centroid, signal bandwidth, spectral rolloff frequency, spectral flux, spectral flatness, zero crossing rate and short time energy). 4 Mel-cepstrum representation. 13/40

21 Dissimilarity representations: two dimensional (2-D) transforms Time-varying representations are computed by dividing sound segments into frames and converting each one to either a spectral or a feature representation. Feature sets measured for each frame were: 1 Spectrogram, also known as short time Fourier transform (SFT). 2 PSD by using the Yule Walker method. 3 Selected descriptive features (spectral centroid, signal bandwidth, spectral rolloff frequency, spectral flux, spectral flatness, zero crossing rate and short time energy). 4 Mel-cepstrum representation. 13/40

22 Dissimilarity representations: two dimensional (2-D) transforms Time-varying representations are computed by dividing sound segments into frames and converting each one to either a spectral or a feature representation. Feature sets measured for each frame were: 1 Spectrogram, also known as short time Fourier transform (SFT). 2 PSD by using the Yule Walker method. 3 Selected descriptive features (spectral centroid, signal bandwidth, spectral rolloff frequency, spectral flux, spectral flatness, zero crossing rate and short time energy). 4 Mel-cepstrum representation. 13/40

23 Dissimilarity representations: two dimensional (2-D) transforms Time-varying representations are computed by dividing sound segments into frames and converting each one to either a spectral or a feature representation. Feature sets measured for each frame were: 1 Spectrogram, also known as short time Fourier transform (SFT). 2 PSD by using the Yule Walker method. 3 Selected descriptive features (spectral centroid, signal bandwidth, spectral rolloff frequency, spectral flux, spectral flatness, zero crossing rate and short time energy). 4 Mel-cepstrum representation. 13/40

24 Dissimilarity representations: two dimensional (2-D) transforms Time-varying representations are computed by dividing sound segments into frames and converting each one to either a spectral or a feature representation. Feature sets measured for each frame were: 1 Spectrogram, also known as short time Fourier transform (SFT). 2 PSD by using the Yule Walker method. 3 Selected descriptive features (spectral centroid, signal bandwidth, spectral rolloff frequency, spectral flux, spectral flatness, zero crossing rate and short time energy). 4 Mel-cepstrum representation. 13/40

25 Dissimilarity representations: two dimensional (2-D) transforms Time-varying representations are computed by dividing sound segments into frames and converting each one to either a spectral or a feature representation. Feature sets measured for each frame were: 1 Spectrogram, also known as short time Fourier transform (SFT). 2 PSD by using the Yule Walker method. 3 Selected descriptive features (spectral centroid, signal bandwidth, spectral rolloff frequency, spectral flux, spectral flatness, zero crossing rate and short time energy). 4 Mel-cepstrum representation. 13/40

26 Dissimilarity representations: measures In the case of equally sized representations (e.g. FFT or PSD) a classical measure, as the Euclidean distance, can be used. Euclidean distance can not be directly applied to time-varying representations. To overcome this dificulty, in this study we have used the EMD [Rubner et al., 2000]. 14/40

27 Dissimilarity representations: measures In the case of equally sized representations (e.g. FFT or PSD) a classical measure, as the Euclidean distance, can be used. Euclidean distance can not be directly applied to time-varying representations. To overcome this dificulty, in this study we have used the EMD [Rubner et al., 2000]. 14/40

28 Outline 1 Introduction 2 Representations 3 Experimental setup 4 Results 5 Discussion 6 Conclusions 15/40

29 Introduction Representations Experimental setup Results Discussion Conclusions References Dataset A set of birdsong recordings of eleven species were used in the experiments. 16/40

30 Dataset Raw field recordings were taken at Reserva Natural Río Blanco in Manizales, Colombia. The sampling frequency of the recordings is 44.1 khz. Bird sounds are detected in raw data using segmentation techniques based on the energy [Härmä, 2003]. The data set contains a total of 595 syllables distributed per species as shown in Table 1. 17/40

31 Dataset Raw field recordings were taken at Reserva Natural Río Blanco in Manizales, Colombia. The sampling frequency of the recordings is 44.1 khz. Bird sounds are detected in raw data using segmentation techniques based on the energy [Härmä, 2003]. The data set contains a total of 595 syllables distributed per species as shown in Table 1. 17/40

32 Dataset Raw field recordings were taken at Reserva Natural Río Blanco in Manizales, Colombia. The sampling frequency of the recordings is 44.1 khz. Bird sounds are detected in raw data using segmentation techniques based on the energy [Härmä, 2003]. The data set contains a total of 595 syllables distributed per species as shown in Table 1. 17/40

33 Dataset Raw field recordings were taken at Reserva Natural Río Blanco in Manizales, Colombia. The sampling frequency of the recordings is 44.1 khz. Bird sounds are detected in raw data using segmentation techniques based on the energy [Härmä, 2003]. The data set contains a total of 595 syllables distributed per species as shown in Table 1. 17/40

34 Dataset Table: 1. Birds species in the current study. Latin name Abbreviation Syllables Grallaria ruficapilla GR 33 Henicorhina leucophrys HL 64 Mimus gilvus MG 66 Myadestes ralloides MR 58 Pitangus sulphuratus PS 53 Pyrrhomyias cinnamomea PC 36 Troglodytes aedon TA 33 Turdus ignobilis TI 74 Turdus serranus TS 78 Xiphocolaptes promeropirhynchus XP 46 Zonotrichia capensis ZC 54 18/40

35 Outline 1 Introduction 2 Representations 3 Experimental setup 4 Results 5 Discussion 6 Conclusions 19/40

36 Results Three approaches for representing bird sounds have been analyzed: 1 Feature representations. 2 Dissimilarity representations for signals in the frequency domain. 3 Dissimilarity representations for time-varying signal transforms. 20/40

37 Results Three approaches for representing bird sounds have been analyzed: 1 Feature representations. 2 Dissimilarity representations for signals in the frequency domain. 3 Dissimilarity representations for time-varying signal transforms. 20/40

38 Results Three approaches for representing bird sounds have been analyzed: 1 Feature representations. 2 Dissimilarity representations for signals in the frequency domain. 3 Dissimilarity representations for time-varying signal transforms. 20/40

39 Results Three approaches for representing bird sounds have been analyzed: 1 Feature representations. 2 Dissimilarity representations for signals in the frequency domain. 3 Dissimilarity representations for time-varying signal transforms. 20/40

40 Performance measure Evaluation for each representation was assessed by using measures of the leave-one-out 1-NN performance. For each representation, a confusion matrix is reported. The following performance measures per class are presented: True Positive rate (TP), False Positive rate (FP), Accuracy (ACC) and F1 score. 21/40

41 Performance measure Evaluation for each representation was assessed by using measures of the leave-one-out 1-NN performance. For each representation, a confusion matrix is reported. The following performance measures per class are presented: True Positive rate (TP), False Positive rate (FP), Accuracy (ACC) and F1 score. 21/40

42 Performance measure Evaluation for each representation was assessed by using measures of the leave-one-out 1-NN performance. For each representation, a confusion matrix is reported. The following performance measures per class are presented: True Positive rate (TP), False Positive rate (FP), Accuracy (ACC) and F1 score. 21/40

43 Results for feature representations 22/40

44 Results for feature representations 23/40

45 Results for feature representations 24/40

46 Results for feature representations This table shows the best performance of feature representation. 25/40

47 Results for feature representations This table shows the best performance of feature representation. 25/40

48 Results for dissimilarity representations: 1-D transform This table shows the poorest performance with a total accuracy of 50.58%. 26/40

49 Results for dissimilarity representations: 1-D transform This table shows the poorest performance with a total accuracy of 50.58%. 26/40

50 Results for dissimilarity representations: 1-D transform This table shows the best performance of dissimilarity representations for signals in the frequency domain. 27/40

51 Results for dissimilarity representations: 1-D transform This table shows the best performance of dissimilarity representations for signals in the frequency domain. 27/40

52 Results for dissimilarity representations: time-varying 28/40

53 Results for dissimilarity representations: time-varying 29/40

54 Results for dissimilarity representations: time-varying 30/40

55 Results for dissimilarity representations: time-varying This table shows the best performance of dissimilarity representations for time-varying signal transforms. 31/40

56 Results for dissimilarity representations: time-varying This table shows the best performance of dissimilarity representations for time-varying signal transforms. 31/40

57 Outline 1 Introduction 2 Representations 3 Experimental setup 4 Results 5 Discussion 6 Conclusions 32/40

58 Discussion Mel-cepstrum and descriptive features showed a good performance in both feature and dissimilarity representations, as expected because those representations are specifically designed for audio recognition. The dissimilarity representation for PSD, in spite of being a rather simple representation, yielded an acceptable performance with a total accuracy of 80.67%. 33/40

59 Discussion Mel-cepstrum and descriptive features showed a good performance in both feature and dissimilarity representations, as expected because those representations are specifically designed for audio recognition. The dissimilarity representation for PSD, in spite of being a rather simple representation, yielded an acceptable performance with a total accuracy of 80.67%. 33/40

60 Discussion Performances of dissimilarity representations for time-varying signal transforms are remarkable. This fact reveals the importance of taking into account the non-stationarity; which is observable by comparing results of dissimilarity representations computed for FFT. The poorest ones with a total accuracy of 50.58%, and results of the dissimilarity representations for time-varying FFT (SFT) that had a total accuracy of 93.78%. 34/40

61 Discussion In the case of PSD, the performance also increased when deriving dissimilarities for time-varying transforms but in less proportion, with a total accuracy of 86.22%. 35/40

62 Outline 1 Introduction 2 Representations 3 Experimental setup 4 Results 5 Discussion 6 Conclusions 36/40

63 Conclusions We conclude that Mel-cepstrum coefficients are suitable for bird sound representation, even more when dissimilarities are computed from them; i.e. when non-stationarity is taken into account. Representation in dissimilarity spaces derived from time-varying representations was found to be preferable instead of doing so in the corresponding 1-D representations. 37/40

64 Conclusions We conclude that Mel-cepstrum coefficients are suitable for bird sound representation, even more when dissimilarities are computed from them; i.e. when non-stationarity is taken into account. Representation in dissimilarity spaces derived from time-varying representations was found to be preferable instead of doing so in the corresponding 1-D representations. 37/40

65 Future work Continuation of the data collection stage: Larger and more representative set. Construction of trained classifiers in the dissimilarity space. Comparison with hidden markov model-based classification: Classical Bayesian approach. Generative embeddings. 38/40

66 Future work Continuation of the data collection stage: Larger and more representative set. Construction of trained classifiers in the dissimilarity space. Comparison with hidden markov model-based classification: Classical Bayesian approach. Generative embeddings. 38/40

67 Future work Continuation of the data collection stage: Larger and more representative set. Construction of trained classifiers in the dissimilarity space. Comparison with hidden markov model-based classification: Classical Bayesian approach. Generative embeddings. 38/40

68 Future work Continuation of the data collection stage: Larger and more representative set. Construction of trained classifiers in the dissimilarity space. Comparison with hidden markov model-based classification: Classical Bayesian approach. Generative embeddings. 38/40

69 Future work Continuation of the data collection stage: Larger and more representative set. Construction of trained classifiers in the dissimilarity space. Comparison with hidden markov model-based classification: Classical Bayesian approach. Generative embeddings. 38/40

70 Future work Continuation of the data collection stage: Larger and more representative set. Construction of trained classifiers in the dissimilarity space. Comparison with hidden markov model-based classification: Classical Bayesian approach. Generative embeddings. 38/40

71 References Acevedo, M. A., Corrada-Bravo, C. J., Corrada-Bravo, H., Villanueva-Rivera, L. J., and Aide, T. M. (2009). Automated classification of bird and amphibian calls using machine learning: A comparison of methods. Ecological Informatics, 4(4): Brandes, T. S. (2008). Automated sound recording and analysis techniques for bird surveys and conservation. Bird Conservation International, 18(S1):S163 S173. Fagerlund, S. (2007). Bird species recognition using support vector machines. EURASIP Journal on Advances in Signal Processing, 2007(1): Frommolt, K.-H., Bardeli, R., and Clausen, M., editors (2008). Computational bioacoustics for assessing biodiversity: Proceedings of the International Expert meeting on IT-based detection of bioacoustical patterns, December 7th until December 10th, 2007 at the International Academy for Nature Conservation (INA). Isle of Vilm, Germany, volume 234 of BfN - Skripten, Bonn, Germany. Federal Agency for Nature Conservation. Härmä, A. (2003). Automatic identification of bird species based on sinusoidal modeling of syllables. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 03, volume 5, pages Rubner, Y., Tomasi, C., and Guibas, L. (2000). The earth mover s distance as a metric for image retrieval. International Journal of Computer Vision, 40(S2):S99 -S121. Somervuo, P., Härmä, A., and Fagerlund, S. (2006). Parametric representations of bird sounds for automatic species recognition. IEEE Transactions on Audio, Speech, and Language Processing, 14(6): /40

72 Questions? 40/40

Repeating Segment Detection in Songs using Audio Fingerprint Matching

Repeating Segment Detection in Songs using Audio Fingerprint Matching Regunathan Radhakrishnan and Wenyu Jiang Dolby Laboratories Inc, San Francisco, USA E-mail: regu.r@dolby.com Institute for Infocomm