Audio Event Detection and classification using extended R-FCN Approach. Kaiwu Wang, Liping Yang, Bin Yang

Size: px

Start display at page:

Download "Audio Event Detection and classification using extended R-FCN Approach. Kaiwu Wang, Liping Yang, Bin Yang"

Elisabeth Mason
6 years ago
Views:

1 Audo Event Detecton and classfcaton usng extended R-FCN Approach Kawu Wang, Lpng Yang, Bn Yang Key Laboratory of Optoelectronc Technology and Systems(Chongqng Unversty), Mnstry of Educaton, ChongQng Unversty, Chna {amsturdy, yanglp, ABSTRACT In ths study, we present a new audo event detecton and classfcaton approach based on R-FCN a state-of-the-art fully convolutonal network framework for vsual object detecton. Spectrogram features of audo sgnals are used as the nput of the approach. The proposed approach conssts of two stages lke R-FCN network. In the frst stage, we detect whether there are audo events by sldng convolutonal kernel n tme axs, and then proposals whch possbly contan audo events are generated by RPN (Regon Proposal Networks). In the second stage, tme and frequency doman nformaton are ntegrated to classfy these proposals and refne ther boundares. Our approach can output the postons of audo events drectly whch can nput a two-dmensonal representaton of arbtrary length sound wthout any sze regularzaton. Index Terms audo event detecton, Convolutonal Neural Network, spectrogram feature 1. INTRODUCTION Intellgent survellance system s becomng ncreasngly ubqutous n our lvng envronment. At present, most of the vdeo-based survellance system s lack of robustness and relablty n practcal applcaton. For example, vdeo-based survellance doesn t work well n some specfc scenaros, such as the nght or cloudy crcumstances. Audo survellance has been alone or n combnaton wth vdeo survellance to solve ths problem. The work n [1] descrbed a framework for scene analyss n a typcal survellance scenaro through ntegratng audo and vsual nformaton. Generally, audo stream s much less onerous than vdeo stream and the audo devces are more nexpensve. Audo events detecton has been one of the mportant components to ntellgent survellance of securty. Unlke the statc or changng slowly backgrounds n vdeo survellance backgrounds, there may be some mpulsve sounds n audo backgrounds. Moreover, the audo sgnal s more versatle when audo events are supermposed on one or more backgrounds wth dfferent sgnal-to-nose rato (SNR). Thus, t s a challenge work to detect audo events correctly from an audo segment. Early works manly concentrated on extractng dfferent types of hand-crafted features, and tranng effectve classfers for recognton wth tradtonal machne learnng algorthms. The most typcal classfer s Support Vector Machnes (SVM). For nstance, a method amed at recognzng envronmental sounds for survellance and securty applcatons s presented n [2], whch apples one-class support vector machnes together wth a sophstcated measure. Another approach s to utlze a Gaussan Mxture Model (GMM) for sounds recognton. The proposed approach n [3] frst classfes a gven audo frame nto vocal and non-vocal events, and then performs further classfcaton nto normal and target events usng GMM. The non-statonary (tmefrequency) technques are appled to sound classfcaton and produce a good result. In [4], Jonathan Denns et al. extract mage feature from the SPD mage a novel two-dmensonal representaton that characterzes the spectral power dstrbuton over tme n each frequency sub-band. In [5], [6], features are extracted from the spectrogram mage of sound sgnals for automatc sound recognton. More recently, methods based on Deep Neural Networks (DNNs) have acheved good performance for sound event classfcaton and detecton. In [12], the authors outlne a sound event classfcaton framework that compares audtory mage front end features wth spectrogram mage-based front end features, usng deep neural network classfers. The work n [13] employs an ensemble learnng framework a stack of ensemble classfers named mult-resoluton stackng (MRS). A concatenaton of lower buldng blocks predctons and the expanson of the raw acoustc feature s fed nto each classfer n MRS. The lower buldng blocks descrbe a base classfer n MRS, named boosted deep neural network. In [14], the authors record a database of sounds occurrng n subway trans n real condtons of explotaton and use DNNs to classfy the sounds nto screams, shouts and other categores. In these audo events detecton methods, Deep Neural Networks (DNNs) mostly work as a classfer whch acheve better performance than other tradtonal classfers, but can t drectly detect audo events n real-tme when the audo events occur. In ths paper, we present a new audo event detecton and classfcaton approach by extendng the R-FCN framework [11] whch s a fully convolutonal neural network used n vsual object detecton. Inspred by the framework, the proposed approach also conssts of three parts (not nclude the extractng of feature maps). The frst part s RPN Network [10] for generatng a lst of proposals whch possbly contan audo events. However, unlke the proposals wth dfferent wdths and heghts n [11], the proposals n our extended R-FCN framework have the same heght and just vary n wdth. We do ths just because the sound s one dmenson sgnal. The second part refnes the boundares of above proposals. The thrd part classfes above proposals nto audo events or backgrounds. In the second and thrd parts, we not only use the nformaton n tme doman but also use the nformaton n dfferent frequency ranges for classfcaton and boundary regresson. Spectrogram features of audo sgnals are

2 Fgure 1: Overvew of the system archtecture for audo detecton and classfcaton fed nto the archtecture as nputs. Based on the capacty of Res- Net network [8] for extractng features and solvng the gradent degradaton problem when tranng a deep neural network, feature maps are produced by the modfed ResNet50 network fed wth these spectrogram feature. These feature maps are shared by above three parts. The rest of ths paper s organzed as follows. Secton 2 presents the proposed deep learnng approach for audo detecton and classfcaton. Secton 3 dscusses our experments and results. Secton 4 concludes ths work. 2. PROPOSED APPROACH Our approach follows the deep learnng framework of R- FCN whch s a fully convolutonal neural network for vsual object detecton [11]. We mprove the Regon Proposal Network (RPN) to make the framework sutable for audo detecton. To make the overall system more smply and generate real-tme result, we substtute ResNet50 for ResNet101, and modfy Res- Net50 network. The overall system archtecture essentally conssts of two stages through three parts n Fgure 1. Frstly, we detect the approxmate poston (proposal) of audo event and don t care the audo event class by our frst part. So, ths can be treated as a bnary classfcaton problem. Secondly, we classfy the proposal nto audo events or background and refne the boundares n our second and thrd parts. These proposals are also called as regons of nterest (RoIs) The Modfed ResNet50 Network Based on the capacty of ResNet50 [8] network for extractng features and solvng the degradaton problem when tranng a deep neural network, we extract the shared feature maps from spectrogram feature by the modfed ResNet50 network. Experments show that the performance wll be reduced because of the reducton of feature maps resoluton. So, we remove the last fc layer, poolng layer and three buldng blocks n ResNet50. The modfed ResNet50 begns wth a 7 7 convolutonal layer whch has 64 kernels and a strde of 2, then follows a 3 3 max poolng layer wth a strde of 2, the rest of the modfed Res- Net50 are seres of buldng blocks. The last convolutonal layer has 1024 kernels, so the generated shared feature maps has 1024 channels The Improved RPN Network For audo events detecton, the mproved RPN network outputs the start tme, the length of proposals wth scores estmatng the probablty that these proposals belong to audo events or background. We model ths process followed a fully convolutonal network [10]. The shared feature maps produced by the modfed ResNet50 are fed nto RPN network. A strp wndow sldes over the shared feature maps to generate m base regons for proposals at each poston on the tme axs of feature map. These m base regons are dfferent n length but start at the same tme of audo sgnal corresponded to the current poston of sldng wndow. The sldng wndow has a sze of n 1, n s the heght of the shared feature maps. At each poston of sldng wndow, 256-dmensonal vector representng these m base regons smultaneously s generated, whch s used to produce scores and coordnates of proposals based on these m base regons. The coordnates represent the dfference of start tme and length between a base regon and proposal. The scores estmate the probablty that these m proposals belong to audo

Then, the 256-d feature vector s used for generatng scores and refnng the poston.

3 events or background. So, 2m scores and 2m coordnates are generated for each poston of sldng wndow. In another word, we generate m base regons at every few frames of audo sgnal, and extract 256-d feature vector from front part of these m base regons to represent all m base regons smultaneously. Then, the 256-d feature vector s used for generatng scores and refnng the poston. Takng nspraton from the character of convolutonal layer, the mproved RPN network s mplemented by three normal convolutonal layers. The 256-d feature vector s produced by a n 1 convolutonal layer whch has 256 kernels and works as a sldng wndow. Then the 256-d vector s smultaneously fed nto two sblng 1 1 convolutonal layers a classfcaton layer (cls layer) and a regresson layer (reg layer). The cls layer and reg layer have 2m convolutonal kernels respectvely. The cls layer outputs 2 scores that estmate the probablty that the proposal belongs to audo events or background. So the cls layer has 2m outputs. The reg layer has 2m outputs representng the dfference between base regons and proposals n start tme and length. The key deal of RPN network for audo detecton s llustrated n the Fgure 2. We use m lnes wth dfferent lengths represent the m base regons n the Fgure 2. Fgure 2: Key deal of RPN network for audo detecton We use two 2-d vectors (B s, B l ) and (G s, G l ) denote the poston of a base regon and ground truth respectvely. We are more concerned wth the tme that audo event occurs, so the 2-d poston vector denotes the start and length of an audo event. We need to defne an operaton F to produce poston vector of proposal from a base regon: ( P, P) F( B, B ) (1) s l s l Here, (P s, P l ) denote the poston vector of a proposal. When the base regon near to the ground truth, we take shft and scale transformaton nto consderaton from a base regon. The operaton F can be smply defned as: P B * t B, P B * exp( t ) (2) s l s s l l l t s,t l s the shft factor and scale factor need to be learned, and they are the reg layer s output for a base regon. So, the correspondng labels are defned as: * * ts ( Gs Bs ) / B, l tl log( Gl / Bl ) (3) When tranng the mproved RPN network, our loss functon for audo detecton followed the mult-task loss n [10] whch can solve the classfcaton and coordnates refnng smultaneously, defned as: 1 L p t L p p p L t t 2 2 * * *, cls, reg, (4) Here, s the ndex of base regons, P represents the probablty that base regon s predcted as an audo event or background. We set the balance weght λ=1. The classfcaton loss L cls s log loss over two classes (audo event or background). The groundtruth label P * s 1 f the base regon s an audo event, otherwse s 0. Ths means the regresson loss L reg (smooth L 1 loss functon, defned n [9]) s actvated only for the base regon near to ground truth. t s a two-dmensonal vector (t s,t l ) for base regon, whch s produced by reg layer. t * s a twodmensonal vector (t * s,t * l ) for base regon The classfcaton and refnng of proposals After the frst stage, we select 6000 proposals whch are most lkely to be audo events from the translatons of all base regons. We do ths by checkng ther probablty to be audo event because the frst stage s a bnary problem. In the second stage, k 2 (C+1)-channel score maps are produced by a convolutonal layer whch has k 2 (C+1) kernels and 2k 2 -channel coordnate maps are produced by a convolutonal layer whch has 2k 2 kernels as seen n Fgure 1. C s the categores of audo events (+1 for background), k equals 3, but whch s a hyper parameter related to the parameter of the followng RoI poolng layer. Gven the 6000 proposals (RoIs), each RoI s dvded nto k k parts, and k k score / coordnate features are produced through RoI poolng layer from score / coordnate maps respectvely. The detals of the RoI poolng layer defned n [11]. After the RoI poolng layer, a vote prncple (mplemented by an average poolng layer) s appled to every channel of scores / coordnates features. Then the 2-d coordnates are generated, and the (C+1)-dmensonal scores for each category are generated after a softmax layer. In the score or coordnate maps, the three red, green, blue maps represent the hgh, mddle, low frequency components respectvely, and each of these three maps n every color (red, green, blue) maps represents sequentally one thrd segment of a proposal n tme axs correspond to the color depth, the deepest color score map represents frst segment of a proposal and the lghtest color score map represents the last segment. Each component of score / coordnate features s from average poolng on only one of k k score / coordnate maps correspond to ther same color. Ths can be understood that not only components at dfferent tme perods but also the dfferent frequency components at same tme perods have dfferent response to the classfyng or refnng of a RoI. For example, there are three RoIs whch have same sze and one thrd overlap as showed n Fgure 3. However, the same one thrd part has dfferent response to the classfyng or refnng of these three RoIs showed by the lghter of every color. Moreover, dfferent frequency components have dfferent response to the classfyng or refnng of each RoI showed by the dfferent color (red, green, blue). What s more, there are always k k score / coordnate features regardless of the sze of RoI. In another word, the sound wth arbtrary length can be processed.

4 Fgure 3: Illustraton of the RoI poolng layer s functon 3. EXPERIMENTS We perform experments on the dataset of IEEE DCASE Challenge 2017 Task 2, whch conssts of solated sound events for three target class (babycry, glassbreak, gunshot) and recordngs of everyday acoustc scenes to serve as background. The background audo materal conssts of recordngs from 15 dfferent audo scenes, and s part of TUT Acoustc Scenes 2016 dataset [7]. We regularze audo sgnals to same sample frequency of 44.1 KHz and synthesze our tranng data accordng to the event-to-background rato (EBR, -6.0, 0, 6.0). Our tranng data has mxtures (5000 per target class, each mxture only contans a target class event). Then we generate the grayscale spectrogram through short-tme Fast Fourer transform. A hammng wndow wth a length of 1024 samples and an overlap of 441 samples s appled. The generated spectrogram feature has a heght of 512. We evaluate our approach on the development dataset. The manly evaluaton metrcs are Error Rate (ER) and F-Score (F) calculated usng event-based onset-only condton wth a collar of 500ms [15]. Table 1: The results for each class on development dataset Class F (%) ER DR IR Baselne babycry Our approach glassbreak gunshot All babycry glassbreak gunshot All Table 1 shows the result on Event-based overall metrcs for each class on development dataset compared wth Baselne system for DCASE Challenge 2017 Task 2. DR, IR are the Deleton Rate, Inserton Rate respectvely [15]. The mplementaton of Baselne s based on a multlayer perceptron archtecture (MLP) and uses log mel-band energes as features. The features are calculated n frames of 40 ms wth a 50% overlap, usng 40 mel bands coverng the frequency range 0 to Hz. The feature vector was constructed usng a 5-frame context, resultng n a feature vector length of 200. The MLP conssts of two dense layers of 50 hdden unts each, wth 20% dropout. For each of the target classes, there s a separate bnary classfer wth one output neuron wth sgmod actvaton, ndcatng the actvty of the target class. Our approach has an ER value of 0.16, F-Scores of 91.4% for all classes, whch outperforms the Baselne system (ER: 0.60, F: 70.0%). The ER values of babycry and gunshot are mproved by 0.67 and 0.53 respectvely. It shows great mprovement for babycry and gunshot. The result on Event-based overall metrcs for each EBR on development dataset s shown n Table 2. F-Scores and ER for all EBRs (not ncludes the mxture whch doesn t contan target event) are 91.7% and The dfference of ER value between each EBR s lower than that of Baselne system, whch proves our approach s more robust to nose. Our approach has lower IR value for each class and EBR, whch ndcates our approach s more relable. Table 2: The results for each EBR on development dataset Class F (%) ER DR IR Baselne Our approach All All On the evaluaton dataset, F-score and ER on segment-based metrcs are and 82.0%, respectvely. Our approach generally produced better result because of the two stage strategy. After the frst stage, there wll be a global summary for an audo event compared wth Baselne s strategy. The 5-frame context of Baselne system only has a local summary for an audo event. At the same tme, the usng of deeper network and the ntegraton of tme and frequency doman nformaton s another reason for the better performance. 4. CONCLUSION We present an audo event detecton and classfcaton approach whch can drectly output the postons of audo events from an audo sgnals wth arbtrary length. We mproved RPN network for generatng proposals whch possbly contan audo events. Tme and frequency doman nformaton are fully utlzed to classfy these proposals and refne ther boundares.

5 5. REFERENCES [1] Crstan M, Bcego M, Murno V, et al. Audo-Vsual Event Recognton n Survellance Vdeo Sequences[J]. IEEE Transactons on Multmeda, 2007, 9(2): [2] Rabaou A, Davy M, Rossgnol S, et al. Usng One-Class SVMs and Wavelets for Audo Survellance[J]. IEEE Transactons on Informaton Forenscs and Securty, 2008, 3(4): [3] Atrey P K, Maddage N C, Kankanhall M S, et al. Audo Based Event Detecton for Multmeda Survellance[C]. nternatonal conference on acoustcs, speech, and sgnal processng, 2006: [4] Denns J, Tran H D, Chng E S, et al. Image Feature Representaton of the Subband Power Dstrbuton for Robust Sound Event Classfcaton[J]. IEEE Transactons on Audo, Speech, and Language Processng, 2013, 21(2): [5] Sharan R V, Mor T J. Nose robust audo survellance usng reduced spectrogram mage feature and one-aganst-all SVM[J]. Neurocomputng, 2015: [6] Sharan R V, Mor T J. Subband Tme-Frequency Image Texture Features for Robust Audo Survellance[J]. IEEE Transactons on Informaton Forenscs and Securty, 2015, 10(12): [7] Mesaros A, Hettola T, Vrtanen T. TUT database for acoustc scene classfcaton and sound event detecton[c]// European Sgnal Processng Conference. 2016: [8] He K, Zhang X, Ren S, et al. Deep Resdual Learnng for Image Recognton[C]. computer vson and pattern recognton, 2015: [9] Grshck R. Fast R-CNN[C]. nternatonal conference on computer vson, 2015: [10] Ren S, He K, Grshck R, et al. Faster R-CNN: Towards Real-Tme Object Detecton wth Regon Proposal Networks[J]. IEEE Transactons on Pattern Analyss and Machne Intellgence, 2015: 1-1. [11] Da J, L Y, He K, et al. R-FCN: Object Detecton va Regon-based Fully Convolutonal Networks[C]. neural nformaton processng systems, 2016: [12] Mcloughln I V, Zhang H, Xe Z, et al. Robust sound event classfcaton usng deep neural networks[j]. IEEE Transactons on Audo, Speech, and Language Processng, 2015, 23(3): [13] Zhang X, Wang D. Boostng contextual nformaton for deep neural network based voce actvty detecton[j]. IEEE Transactons on Audo, Speech, and Language Processng, 2016, 24(2): [14] Lafftte P, Sodoyer D, Tatkeu C, et al. Deep neural networks for automatc detecton of screams and shouted speech n subway trans[c]. nternatonal conference on acoustcs, speech, and sgnal processng, 2016: [15] Mesaros A, Hettola T, Vrtanen T. Metrcs for polyphonc sound event detecton[j]. Appled Scences, 2016, 6(6): 162. [16] Wabel A, Hanazawa T, Hnton G, et al. Phoneme recognton usng tme-delay neural networks[j]. IEEE transactons on acoustcs, speech, and sgnal processng, 1989, 37(3):

Machine Learning 9. week

Machine Learning 9. week Machne Learnng 9. week Mappng Concept Radal Bass Functons (RBF) RBF Networks 1 Mappng It s probably the best scenaro for the classfcaton of two dataset s to separate them lnearly. As you see n the below