arxiv: v1 [cs.sd] 22 Dec 2017

Size: px

Start display at page:

Download "arxiv: v1 [cs.sd] 22 Dec 2017"

Diana Young
6 years ago
Views:

1 Musc Genre Classfcaton wth Parallelng Recurrent Convolutonal Neural Network arxv: v1 [cs.sd] 22 Dec 2017 Ln Feng, Shenlan Lu, Janng Yao December 2017 Abstract Deep learnng has been demonstrated ts effectveness and effcency n musc genre classfcaton. However, the exstng achevements stll have several shortcomngs whch mpar the performance of ths classfcaton task. In ths paper, we propose a hybrd archtecture whch conssts of the parallelng CNN and B-RNN blocks. They focus on spatal features and temporal frame orders extracton respectvely. Then the two outputs are fused nto one powerful representaton of muscal sgnals and fed nto softmax functon for classfcaton. The parallelng network guarantees the extractng features robust enough to represent musc. Moreover, the experments prove our proposed archtecture mprove the musc genre classfcaton performance and the addtonal B-RNN block s a supplement for CNNs. 1 Introducton Wth the extensve utlzaton of varous musc platforms, an ncreasng number of musc s wdely spread, whch causes chaos for audences and those platforms to organze these musc. Furthermore, t s mpossble to organze and dstngush such a large number of musc by manual efforts. Therefore, how to construct a convenent way to deal wth ths problem s of vtal mportance but challengng. Most of state-of-the-art methods am to classfy the musc genre whch s a toplevel label on musc to help audences to categorze and descrbe varous musc. [1] Meanwhle, exact classfcaton on musc genre s crucal for musc platforms to organze musc nto dfferent groups. For ths reason, classfcaton on musc genre has attracted wdely attentons n the feld of musc nformaton retreval (MIR) [2][3]. As two crucal components for musc genre classfcaton, feature extracton and classfer learnng may greatly nfluence the performance of most classfcaton systems [4].Feature extracton concentrates on explorng sutable representatons of samples whch are expected to be classfed n terms of feature vectors or parwse smlartes [5]. After feature extracton, features and representatons of musc are fed nto a classfer, whch ams to map feature vectors 1

2 nto dfferent musc genres. Banya et al. [6] adopt tmbral texture features (.e. Mel-frequency Cepstral Coeffcent) and rhythm content features lke beat hstogram (BH) [1] to represent musc sgnals. Then, they combne Extreme Learnng Machne (ELM) [7] wth baggng [8] as a classfer. Arab et al. [9] draw chord features and chord progresson nformaton nto feature extracton. In addton, by utlzng Support Vector Machne (SVM), they proved chord features n conjuncton wth low-level features [5] can provde hgher classfcaton accuracy. The state-of-the-art achevement s reported by Sarkar et al. [10], whch employs Emprcal Mode Decomposton (EMD) for sgnal component extracton and depends only on ptch based features. Even though all methods above acheve good performance n some certan stuatons, these hand-craft features cannot avod some fatal dsadvantages. The hand-craft features extracton from musc sgnals need some complex process, thus t requres researchers to possess expertse n the muscal doman. Furthermore, features whch extracted for one certan task lack unversalty snce they may have poor performances n other tasks. In recent years, deep learnng, especally Convolutonal Neural Networks (CNNs) have been utlzed n varous mage classfcatons successfully. [11][12] Meanwhle, Sander et al. [13] prove that comparng wth normal mages, spectrograms of musc audo can also acheve good performance wth CNNs. Under ths crcumstance, there s a growng tendency of learnng robust feature representatons from spectrograms of musc wth CNNs [14][15]. In contrast wth tradtonal methods, CNNs provdes an end-to-end tranng archtecture whch combne feature extracton wth musc classfcaton n one stage. And multple works based on CNNs have shown ther superortes for musc genre classfcaton. But t s worth notcng, dfferent from ordnary mages, spectrograms of musc have heavly sequental relatonshps nsde. However, the exstng musc genre classfcatons wth CNNs are not able to model the long-term temporal nformaton n spectrograms of musc data. As we all know, Recurrent Neural Networks [16] (RNNs) can model long-term dependences lke musc structure or recurrent harmones [17] whch are sgnfcant for musc classfcaton. To address all lmtaton mentoned above, we propose a hybrd learnng archtecture named Parallelng Recurrent and Convolutonal Neural Network (PRCNN), whch conssts of a CNN block and a Bdrectonal Recurrent Neural Network (B-RNN) block [18]. The man contrbuton of our proposed archtecture s that the hybrd structure models not only spatal features but also temporal frame orders of musc data, whch are greatly complementary to musc genre classfcatons comparng wth smple CNNs. The rest of ths paper are organzed as follows. In Secton 2, we retrospect related work of musc genre classfcaton and carefully analyze ther contrbuton as well as lmtaton. Secton 3 descrbes the constructon of our proposed hybrd archtecture PRCNN for musc genre classfcaton n detal. In Secton 4, we mplement varous experments based on several datasets and demonstrate the valdty of our proposed archtecture PRCNN. Fnally, we draw a concluson and present some future work n Secton 5. 2

3 2 Related work Musc genre classfcaton s a wdely studed area n Musc Informaton Retreval for categorzng and descrbng enormous amount of musc [1]. Varous researches ndcate extractng representatve features from musc sgnals can heavly mprove the performance of classfcatons. Thus, most exstng works focus on extractng robust features to represent musc n order to mprove the musc genre classfcaton performances. Motvated by the success of computer vson [19], CNNs have also attracted much attenton n the feld of musc genre classfcaton. By tranng an end-to-end archtecture, CNNs have powerful capactes to represent varous musc wth hgher-level features. In addton, CNNs requre less engneerng effort and pror knowledge of one certan feld. L et al. [14] declare the varatons of muscal patterns wth a certan transformaton such as, Fast Fourer Transform (FFT) and Mel-frequency Cepstral Coeffcent (MFCC), are smlar to mages whch work well wth CNNs n mage classfcatons [12]. Moreover, they prove CNNs are feasble alternates to extract muscal patterns features automatcally. Although ther work brngs opportuntes to dsplace hand-craft features, the expermental results, however, show the proposed structure s not robust enough to make testng data perform as excellently as tranng data. Zhang et al. [15] proposed two networks to mprove the performance of musc genre classfcaton wth CNNs. In order to offer more statstcal nformaton to the followng layers, max- and average-poolng are operated n conjuncton across the entre tme axs n one of networks. Tendng to mprove the accuracy from ncreased depth, they utlze shortcut connectons nspred by resdual learnng [20] n another network. The performances of two CNNs are both demonstrated to be mproved contrast wth prevous results based on GTZAN [1] dataset. However, as mentoned n prevous secton, muscal patterns have some temporal relatonshps whch are crucal for musc genre classfcatons but wll be dropped n CNNs. For ths reason, Cho et al. [21] desgn a hybrd model named convolutonal recurrent neural network (CRNN), whch CNNs and RNNs are exploted as features extractor and temporal summarzer, respectvely. Comparng wth three exstng CNNs, CRNN s demonstrated to mprove the performance of musc classfcaton va learnng more temporal nformaton. But ths hybrd model also have ts lmtaton whch mpar the performance of musc classfcaton. Even though CRNN has RNNs to be the temporal summarzer, t can only summarze temporal nformaton from the output of CNNs. Obvously, the temporal relatonshps of orgnal muscal sgnals are not preserved durng operatons wth CNNs. To preserve both spatal features and temporal frame orders of orgnal musc sgnals, we carefully desgn the hybrd model whch conssts of parallelng CNN and B-RNN blocks. In next secton, we wll descrbe our proposed hybrd archtecture for musc genre classfcaton n detal. 3

4 Fgure 1: The network archtecture of PRCNN 4

5 3 methodology As llustrated n Fgure 1, our proposed hybrd archtecture s dvded nto four blocks wth weghts to play dfferent roles. At the bottom of Fgure 1, we utlze Short-term Fourer Transform (STFT) spectrogram of muscal sgnals as the nput of our network. The nput whose sze s s smultaneously fed nto parallelng CNN and B-RNN blocks to mplement feature extracton. As aforementoned, CNNs have excellent performance on extractng spatal features of musc. However, the STFT spectrogram of muscal sgnals has some sgnfcant sequental-relatonshps lost n CNNs durng supervsed learnng. Thus, the parallelng B-RNN block s employed to extract temporal frame orders from the spectrogram as a supplement. Then the outputs of two parallelng blocks are fused nto one feature vector whch wll be classfed next. After a dense layer, we apply a softmax operaton as a post-processng stage to acqure a feature vector whch conssts of normalzed probabltes of dfferent musc genres. As mentoned n Secton 1, feature extracton s a crucal part n musc genre classfcatons. Therefore, n the rest of ths secton, we descrbe the parallelng CNN and B-RNN blocks utlzed for feature extracton n detal. 3.1 Convolutonal Neural Network Block Except for the nput and output layers, the CNN block of our proposed hybrd archtecture has 10 layers, ncludng fve convolutonal-poolng layers. After each convolutonal layer, a max-poolng operaton s followed to further process the output of prevous convolutonal layer. Each kernel detects a fxed 3 1 regon n the prevous layer wth 1 1 paddng. The desgn of paddng s to reduce the nformaton loss durng convoluton. In order to acqure more meanngful representatons from spectrogram, we desgn the fve convolutonal layers wth 16, 32, 64, 128 and 64 flters respectvely. The frst three max-poolng layers output the maxmum value wthn a 2 2 rectangular neghborhood wth strdes 2 2. And the upper two max-poolng layers reports the maxmum value of a 4 4 regon wth 4 4 strdes to extract more robust representatons. The output of CNN block s a vector of and wll be fed nto the classfer n conjuncton wth the output of B-RNN block. Convoluton kernel sze Kernels are regarded as feature detectors n convolutonal layers. In general, a kernel sze defned as k r c means the kernel can learn k features of r c, where r and c refers to rows and columns of a kernel respectvely. Kernel sze determnes the range of a feature map t can precsely detects. Thus, the kernel sze can certanly affect the performance of feature learnng. When the kernel sze s too small, t s not capable to learn representatve features from the gven data. Thus some researchers, such as Krzhevsky et al. [22], proposed large convoluton kernels szed as to detect features. However, the ncreasng sze of convoluton kernel makes parameters of per feature detector ncrease, and obvously, the storage and computaton wll both ncrease. Moreover, large kernels lose the nvarance wthn ther ranges 5

6 [23]. Amng to learn more representatve features wth less parameters, the kernel sze utlzed n our proposed archtecture s 3 1, whch have shown excellent performance of features detectng wth sutable parameters storage and computaton. Poolng Poolng functon, s regarded as a process of subsamplng and a crucal stage n CNNs. In contrast wth convoluton, poolng s a non-lnear behavor whch produces a summary statstc of the nearby output. The max-poolng operaton employed n CNN block can represent the most promnent features of musc, such as ampltudes. A max-poolng can also reduce the dmenson of prevous output, and therefore prevents the network from overfttng wth less parameters. Meanwhle, the poolng sze s also an mportant aspect whch nfluences the musc genre classfcaton. In general, underszed poolng sze makes the network not nvarant enough for some small translatons. On the contrary, f the poolng sze s overszed, some requste feature locatons wll be lost and some error may be brought nto the classfcaton result. Rectfed Lnear Unts As we all know, convoluton s a lnear operaton whch s usually not enough to reflect the representatons of features. Thus, we employ Rectfed Lnear Unts (ReLUs) [24] to acheve a non-lnear behavor. The defnton of ReLUs actvaton functon s f(x) = max(0, x). Obvously, ReLUs brngs out sparse feature representatons n hdden layers snce components below 0 are cut off. In contrast wth sgmod, ReLUs do not saturate at 1 and the partal dervatve of the actvaton functon s never 0, whch can avod the appearance of vanshng gradent n some degree. Meanwhle, ReLUs also have more rapd speed of convergence than tradtonal sgmod and tanh actvatons. 3.2 Bdrectonal Gated Recurrent Unts Block As llustrated n Fgure 1, the BGRU-RNN block conssts of 7 layers except for the nput and fused output layers. In ths block, the nput s frst processed by a max-poolng layer to reduce the dmenson. After ths step, the dmenson of spectrogram s reduced to Snce the upper BGRU layers are constructed knder complex, we employ an embeddng layer for further dmenson reducton to decrease parameters of. After the pre-tranng, a nput s fed nto two stacked BGRUs llustrated n Fgure 2 for features extracton. In contrast to the output of CNNs block, we smply splce the outputs of two stacked BGRU layers as one 256D feature vector. As we all know, standard recurrent neural networks (RNNs) only take advantage of prevous contexts but gnore the backwards dependences whch are also mportant for feature learnng. However, many applcatons have demonstrated that the predcton of y (t) heavly depends on the whole nput sequence, ncludng the past and future nformaton. Another lmtaton of tradtonal RNNs s that they wll suffer from the problem of vanshng and explodng gradents 6

7 Output Layer Backward Layer Forward Layer Input Layer t-1 t t+1 Fgure 2: The network archtecture of BGRU when dealng(deal) wth long-term dependences. Thus, n our hybrd archtecture, we explot two stacked bdrectonal BGRUs whch s a varant of RNNs to mprove the performance of feature extracton. The structure of BGRUs s shown n Fgure 2 and we wll descrbe t n detal soon. Bdrectonal Gated Recurrent Unts The desgn of BGRU s motvated by two man consderatons: 1) utlzng gated Recurrent Unt (GRU) to extract temporal features from spectrogram of muscal sgnals whch are lost n CNNs; 2) extractng powerful representatons by takng full advantage of past and future nformaton of a sequence. GRU s proposed n [25] to make the recurrent blocks adaptvely capture nformaton from varable-length sequences. Obvously, a BGRUs archtectures means that we employ GRU n both forward states part and backward states part. As llustrated n Fgure 2, the nput layer s fed nto both forward and backward layers. Meanwhle, the output layer s produced by both forward and backward layers. But the two reverse layers have no drect connectons. Indeed, GRU s a more smplfed varaton of the Long Short-term Memory (LSTM) [26], whch ntegrates nput and forget gates nto one update gate and append a reset gates. For GRU, t makes one sngle gatng unt smultaneously controls the forgettng element and the decson to update the state unt. In the th GRU, the actvaton h (t) at tme t s calculated by the prevous actvaton h (t 1) and 7

8 the current canddate update: h (t) h (t) = u (t) h (t) + (1 u (t) )h (t 1), (1) where u and respectvely stand for update gate and canddate actvaton. The update gate decdes how much the unt updates from ts actvaton: u (t) = σ(b u + U u x (t) + W u h (t 1) ), (2) where b, U and W respectvely denote the bases, nput weghts and recurrent weghts nto the th GRU. The nput vector at tme t s defned as x (t). The canddate actvaton s computed analogously to the update gate: h (t) h (t) = tanh(b + Ux (t) + W (r (t) h (t 1) )), (3) where r stands for reset gate and denotes an element-wse multplcaton operaton. If r (t) s close to 0, the reset gate s off and the unt should forget the past nformaton. The reset gate s defned wth the followng formula: r (t) = σ(b r + U r x (t) + W r h (t 1) ) (4) The update and reset gates can separately neglect vector parts. The update gates decde how much the past states should mpact current states. Whle the reset gates provde nonlnear effect n the correlaton between past state and future state. They decde whch parts should be computed n the future state. In our bdrectonal archtecture, the forward GRUs are calculated by past states along postve tme axs whle the back forward GRUs are computed by future states along reverse tme axs. For nstance, the actvaton at tme t of backward GRUs s calculated by the future actvaton h (t+1) and the current canddate update: h (t) = u (t) h (t) + (1 u (t) )h (t+1), (5) and other formulas are smlar to ths, beng computed along the reverse tme axs. Comparng wth LSTM, GRU has smpler structure whch captures temporal correlatons from muscal sgnals but overcomes the problem of vanshng and explodng gradent. GRU and LSTM can both preserve mportant nformaton va gates nsde durng dealng wth long-term dependences. But n GRU, the actvatons of gates only depend on prevous output and current nput. Thus, the smpler GRU mtgates the occurrence of overfttng and tends to converge faster than LSTM wth less parameters. 3.3 Feature Fuson and Classfer Block The outputs of the two parallelng blocks are two 256 dmensonal vectors. In our hybrd archtecture, CNNs and BRNNs blocks respectvely focus on extractng spatal features and temporal frame orders of muscal sgnals. Thus, the two 8

9 vectors need to be fused nto one powerful representaton to mprove the performance of musc genre classfcaton. Snce the two vectors have the same sze, we carry out two methods of fusng them nto one feature representaton: 1) drectly add the values of two vectors together and acqure a new 521 dmensonal vector; 2) keep the orgnal values of two vectors and concatenate them nto a 521 dmensonal vector. After feature fuson, the syncretc representaton s fed nto dense and softmax layers to mplement the classfcaton. In the classfer block, a dense layer s employed to map the prevous fused vector nto a feature vector whose sze s 10. Then a softmax functon s adopted n ths feature vector for musc genre classfcaton. The softmax functon s defned as: P () = exp(x ) k k=1 exp(x k), (6) where P () and x respectvely represent the probablty of musc genre and the th value of the feature vector. The am of explotng a softmax functon s to make each value of feature vector between 0 1. And the result of k k=1 exp(x k) equals 1. In ths stuaton, the 10 values between 0 1 can be regarded as the probabltes of 10 musc genres. 4 EXPERIMENT In ths secton, we ntroduce the two dataset used n our experments and report some contrast experments results for valdatng the effectveness of the proposed parallelng archtecture. 4.1 Dataset Descrpton There are two classcal datasets utlzed n our experments. One s GTZAN dataset [1] whch has been used as a benchmark n varous systems for musc genre classfcaton. It conssts of 1000 songs excerpts whch are evenly dstrbuted nto ten dfferent genres: Blues, Classcal, Country, Dsco, Hppop, Jazz, Metal, Pop, Reggae and Rock. Each song s about 30 seconds duraton and sampled wth the rate of 22050Hz at 16 bt. Another dataset s Extended Ballroom dataset [27] whch s an extended verson based on Ballroom dataset [28]. The Extended Ballroom dataset we use for tranng and testng conssts of 4180 excerpts wth 30 seconds duraton. The audo qualty s better than the Ballroom dataset and 5 new genres of ballroom dance musc: Foxtrot, Pasodoble, Salsa, Slowwaltz and Wcswng are added. 4.2 Expermental Setup Dataset pre-processng As we all know, Deep Neural Networks need enormous nput data to learn robust feature representaton. However, the datasets we used n our experments are wth 1000 song excerpts and 4180 musc tracks respectvely. In order to ncrease the number of tracks, we cut each song excerpt 9

10 Table 1: Genre classfcaton results on GTZAN dataset Methods Features Accuracy CNN+2-layer RNN STFT 88.8% CNN+1-layer RNN STFT 90.2% nnet1 STFT 84.8% nnet2 STFT 87.4% KCNN(k=5)+SVM [30] Mel-spectrum, SFM, SCF 83.9% DNN(ReLU+SGD+Dropout) [29] FFT(aggregaton) 83.0% Multlayer nvarant representaton [31] STFT wth log representaton 82.0% Table 2: Improved performance wth RNN for dfferent CNNs CNNs Wthout RNN Wth RNN Our CNN 88.0% 92.0% Alexnet 81.4% 88.8% Vgg % 88.7% ResNet % 87.6% nto shorter musc clps wth 3 seconds duraton and 50% overlap. Thus, the ncreased tranng datasets help our archtecture avod overfttng partly and have better performance on feature extracton. Smlar to the processng n [29][15], we calculate Fast Fourer Transforms (FFTs) on frames of length 1024 at khz samplng rate wth 50% overlap and use the absolute value of each FFT frame. We fnally construct a STFT spectrogram wth 128 frames and each frame s a 513 dmensonal vector. 4.3 Result The musc genre classfcaton accuracy of the proposed PRCNN s reported n Table 1. For comparson, we also reported other achevements appled to the GTZAN dataset presented n [15]. As shown n Table 1, we desgn our B-RNN block wth 2 layers RNNs and 1 layer RNN respectvely. And the results both show better performance than other achevements appled to the same dataset. Nevertheless, the problem of overfttng can easly appears n 2 layers RNNs durng feature learnng n the small szed dataset. Thus, we only use 1 layer RNN n our B-RNN block to extract features from spectrogram. And the results prove that the B-RNN block wth 1 layer RNN acheves better performance than employng 2 layers RNNs. In order to valdate the effectveness of the addtonally parallelng RNN block, we desgn some contrast experments wth other typcal CNNs. In Table 2, all the results are all acheved on the GTZAN dataset. And as can be seen, n contrast to utlzng CNNs alone, all of the CNNs wth parallelng RNN can mprove the performance of musc genre classfcaton. 10

11 5 Concluson In ths paper, we propose a hybrd archtecture PRCNN to mprove the performance of musc genre classfcaton. Ths end-to-end model conssts of parallelng CNN and B-RNN blocks for feature extracton. The CNN block focuses on extractng spatal features from spectrogram of muscal sgnals. On the contrary, the BRNNs block s desgned wth the purpose of modelng temporal frame orders. Furthermore, the bdrectonal archtecture can make current states depend on not only prevous nformaton but also future contexts of the sequence durng supervsed learnng. The outputs of two parallelng blocks are fused nto a more powerful feature vector for musc classfcaton. Several experments n ths paper adequately demonstrate the effectveness of our hybrd archtecture. Moreover, comparng wth utlzng CNNs alone, the expermental results prove extractng temporal frame orders from muscal sgnals wth RNNs mproves the performance of musc genre classfcaton. References [1] G. Tzanetaks and P. Cook, Muscal genre classfcaton of audo sgnals, IEEE Transactons on speech and audo processng, vol. 10, no. 5, pp , [2] J. Shawe-Taylor and A. Meng, An nvestgaton of feature models for musc genre classfcaton usng the support vector classfer, [3] K. West and S. Cox, Fndng an optmal segmentaton for audo genre classfcaton., n ISMIR, pp , [4] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern classfcaton (2nd edton), En Broeck the Statstcal Mechancs of Learnng Rsty, [5] Z. Fu, G. Lu, K. M. Tng, and D. Zhang, A survey of audo-based musc classfcaton and annotaton, IEEE transactons on multmeda, vol. 13, no. 2, pp , [6] B. K. Banya, D. Ghmre, and J. Lee, A novel approach of automatc musc genre classfcaton based on tmbra texture and rhythmc content features, n Advanced Communcaton Technology (ICACT), th Internatonal Conference on, pp , IEEE, [7] G.-B. Huang, Q.-Y. Zhu, and C.-K. Sew, Extreme learnng machne: theory and applcatons, Neurocomputng, vol. 70, no. 1, pp , [8] L. Breman, Baggng predctors, Machne learnng, vol. 24, no. 2, pp , [9] A. F. Arab and G. Lu, Enhanced polyphonc musc genre classfcaton usng hgh level features, n Sgnal and Image Processng Applcatons 11

12 (ICSIPA), 2009 IEEE Internatonal Conference on, pp , IEEE, [10] R. Sarkar and S. K. Saha, Musc genre classfcaton usng emd and ptch based feature, n Advances n Pattern Recognton (ICAPR), 2015 Eghth Internatonal Conference on, pp. 1 6, IEEE, [11] Y. We, W. Xa, M. Ln, J. Huang, B. N, J. Dong, Y. Zhao, and S. Yan, Hcp: A flexble cnn framework for mult-label mage classfcaton, IEEE transactons on pattern analyss and machne ntellgence, vol. 38, no. 9, pp , [12] D. C. Cresan, U. Meer, J. Masc, L. Mara Gambardella, and J. Schmdhuber, Flexble, hgh performance convolutonal neural networks for mage classfcaton, n IJCAI Proceedngs-Internatonal Jont Conference on Artfcal Intellgence, vol. 22, p. 1237, Barcelona, Span, [13] S. Deleman and B. Schrauwen, End-to-end learnng for musc audo, n Acoustcs, Speech and Sgnal Processng (ICASSP), 2014 IEEE Internatonal Conference on, pp , IEEE, [14] T. L. L, A. B. Chan, and A. Chun, Automatc muscal pattern feature extracton usng convolutonal neural network, n Proc. Int. Conf. Data Mnng and Applcatons, [15] W. Zhang, W. Le, X. Xu, and X. Xng, Improved musc genre classfcaton wth convolutonal neural networks., n INTERSPEECH, pp , [16] J. L. Elman, Fndng structure n tme, Cogntve scence, vol. 14, no. 2, pp , [17] J. Pons, T. Ldy, and X. Serra, Expermentng wth muscally motvated convolutonal neural networks, n Content-Based Multmeda Indexng (CBMI), th Internatonal Workshop on, pp. 1 6, IEEE, [18] M. Schuster and K. K. Palwal, Bdrectonal recurrent neural networks, IEEE Transactons on Sgnal Processng, vol. 45, no. 11, pp , [19] S. Lawrence, C. L. Gles, A. C. Tso, and A. D. Back, Face recognton: A convolutonal neural-network approach, IEEE transactons on neural networks, vol. 8, no. 1, pp , [20] K. He, X. Zhang, S. Ren, and J. Sun, Deep resdual learnng for mage recognton, n Proceedngs of the IEEE conference on computer vson and pattern recognton, pp , [21] K. Cho, G. Fazekas, M. Sandler, and K. Cho, Convolutonal recurrent neural networks for musc classfcaton, arxv preprnt arxv: ,

13 [22] A. Krzhevsky, I. Sutskever, and G. E. Hnton, Imagenet classfcaton wth deep convolutonal neural networks, n Advances n neural nformaton processng systems, pp , [23] K. Cho, G. Fazekas, and M. Sandler, Automatc taggng usng deep convolutonal neural networks, arxv preprnt arxv: , [24] V. Nar and G. E. Hnton, Rectfed lnear unts mprove restrcted boltzmann machnes, n Proceedngs of the 27th nternatonal conference on machne learnng (ICML-10), pp , [25] K. Cho, B. Van Merrënboer, D. Bahdanau, and Y. Bengo, On the propertes of neural machne translaton: Encoder-decoder approaches, arxv preprnt arxv: , [26] S. Hochreter and J. Schmdhuber, Long short-term memory, Neural computaton, vol. 9, no. 8, pp , [27] U. Marchand and G. Peeters, The extended ballroom dataset, [28] F. Gouyon, S. Dxon, E. Pampalk, and G. Wdmer, Evaluatng rhythmc descrptors for muscal genre classfcaton, [29] S. Sgta and S. Dxon, Improved musc feature learnng wth deep neural networks, n Acoustcs, Speech and Sgnal Processng (ICASSP), 2014 IEEE Internatonal Conference on, pp , IEEE, [30] P. Zhang, X. Zheng, W. Zhang, S. L, S. Qan, W. He, S. Zhang, and Z. Wang, A deep neural network for modelng musc, n Proceedngs of the 5th ACM on Internatonal Conference on Multmeda Retreval, pp , ACM, [31] C. Zhang, G. Evangelopoulos, S. Vonea, L. Rosasco, and T. Poggo, A deep representaton for nvarance and musc classfcaton, n Acoustcs, Speech and Sgnal Processng (ICASSP), 2014 IEEE Internatonal Conference on, pp , IEEE,

Classifying Acoustic Transient Signals Using Artificial Intelligence

Classifying Acoustic Transient Signals Using Artificial Intelligence Classfyng Acoustc Transent Sgnals Usng Artfcal Intellgence Steve Sutton, Unversty of North Carolna At Wlmngton (suttons@charter.net) Greg Huff, Unversty of North Carolna At Wlmngton (jgh7476@uncwl.edu)