STEGANOGRAPHY is the technique that conceals secret. Spec-ResNet: A General Audio Steganalysis scheme based on Deep Residual Network of Spectrogram

Size: px

Start display at page:

Download "STEGANOGRAPHY is the technique that conceals secret. Spec-ResNet: A General Audio Steganalysis scheme based on Deep Residual Network of Spectrogram"

Herbert Moore
5 years ago
Views:

1 1 Spec-ResNet: A General Audio Steganalysis scheme based on Deep Residual Network of Spectrogram Yanzhen Ren, Member, IEEE, Dengkai Liu, Qiaochu Xiong, Jianming Fu, and Lina Wang arxiv: v1 [cs.mm] 21 Jan 219 Abstract The widespread application of audio and video communication technology make the compressed audio data flowing over the Internet, and make it become an important carrier for covert communication. There are many steganographic schemes emerged in the mainstream audio compression data, such as AAC and MP3, followed by many steganalysis schemes. However, these steganalysis schemes are only effective in the specific embedded domain. In this paper, a general steganalysis scheme Spec- ResNet (Deep Residual Network of Spectrogram) is proposed to detect the steganography schemes of different embedding domain for AAC and MP3. The basic idea is that the steganographic modification of different embedding domain will all introduce the change of the decoded audio signal. In this paper, the spectrogram, which is the visual representation of the spectrum of frequencies of audio signal, is adopted as the input of the feature network to extract the universal features introduced by steganography schemes; Deep Neural Network Spec-ResNet is welldesigned to represent the steganalysis feature; and the features extracted from different spectrogram windows are combined to fully capture the steganalysis features. The experiment results show that the proposed scheme has good detection accuracy and generality. The proposed scheme has better detection accuracy for three different AAC steganographic schemes and MP3Stego than the state-of-arts steganalysis schemes which are based on traditional hand-crafted or CNN-based feature. To the best of our knowledge, the audio steganalysis scheme based on the spectrogram and deep residual network is first proposed in this paper. The method proposed in this paper can be extended to the audio steganalysis of other codec or audio forensics. Index Terms Audio Steganalysis, Spectrogram, Deep residual network, Feature representation. I. INTRODUCTION STEGANOGRAPHY is the technique that conceals secret information in digital carriers, such as text, image, audio, and video, to build covert communication. In many cases, these steganography schemes will be used by some malicious software or malicious organizations. Therefore, steganalysis technology has been studied extensively, which is the countermeasure technique of steganography to detect whether there is any secret information hidden in the data which looks normal. With the widespread application of various audio and video communication technologies, compressed audio is stored or spread in Internet ubiquitously. Today, the most widely used This work is supported by the Natural Science Foundation of China(NSFC) under the grant NO , NO. U , NO. U153624, and China Scholarship Council. Yanzhen Ren, Dengkai Liu, Qiaochu Xiong, Jianming Fu and Lina Wang are with Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, Wuhan 4372, China. ( renyz@whu.edu.cn; dengkailiu@whu.edu.cn; emmaxiong@whu.edu.cn; jmfu@whu.edu.cn; lnawang@163.com; ) audio compression standards are AAC (Advanced Audio Coding) and MP3 (MPEG-1 Audio Layer III). AAC plays a dominant role in various internet communication applications, and can achieve better sound quality than MP3 at similar bit-rate, so AAC is gradually replacing MP3 now. AAC is specified both as Part of the MPEG-2 and MPEG-4 standard [1], so it is used almost everywhere for the mainstream audio applications. Today, AAC has been the default audio coding format for many mainstream HDTV Standards, hardwares, mobilephones, and Internet communication applications [2], such as DVB(Digital Video Broadcasting), itunes and ipod, iphone, Play Station, YouTube, Twitter, Facebook, Youku, Tudou, Baidu Music, QQ Music, Himalayan, etc. The ubiquitous storage and communication of audio compression data provides a huge covert communication carrier channel. Many steganography schemes based on AAC and MP3 have been proposed [3] [15]. There are three main embedding domains in the compression parameters of AAC: Modified Discrete Cosine Transform coefficients (MDCT) [3], scale factor [4], [5], and Huffman coding [6] [8]. The basic method of these steganography schemes is modifying the compression parameters to hide the secret information. Wang et al. [3] proposed to embed secret information by modifying the small value of quantized MDCT coefficients to achieve good imperceptibility. In [6], the least significant bit (LSB) of the escape sequences in AAC Huffman coding are modified with matrix encoding [16] to embed secret information. In [7], the sign bit of the corresponding MDCT coefficient is modified to implement steganography, which utilizes the fact that the sign bits of the MDCT coefficients in the signed codebook will be coded separately. In [8], Zhu et al. proposed to embed secret information by modifying the sections of Huffman coding of AAC. In [9], Luo et al. proposed an adaptive audio steganography scheme in the time domain based on the advanced audio coding (AAC) and the Syndrome- Trellis coding (STC). Except for [9], the above steganography schemes are all modifying the compression parameters of AAC. Although the embedding domains are different for each scheme, but eventually the MDCT coefficients of each AAC frame are modified in different level. For MP3, there are many steganographic schemes emerged [1] [12]. Some MP3 steganographic tools have been published early, such as MP3Stego [13], UnderMP3Cover [14], and MP3Stegz [15]. Among them, MP3Stego [13] is the most tested MP3 steganography schemes of state-of-arts, which embeds the secret information by modifying the quantization step size in the encoding procedure. With the development of the steganography schemes of

2 2 AAC and MP3, many steganalysis schemes are followed [17] [25]. To detect the steganography schemes of AAC Huffman coding domain, Ren et al. [17] proposed to extract the Markov transition probability of neighboring scale factor bands codebook (SFB) as the steganalysis feature to model the correlation of the neighboring SFBs codebook. Calibration technology is used to improve the detection accuracy. To detect the steganography schemes of AAC MDCT coefficient domain, Ren et al. [19] proposed to extract the Markov transition probabilities and joint probability densities of first-order and second-order differential residuals of inter-frame and intra-frame MDCT coefficients as steganalysis features. The performance of the schemes [17], [19] are good for specific embedding domain but will be decreased for the other embedding domain. To detect MP3Stego [13], which is almost the most tested MP3 steganography scheme, there are many steganalysis schemes. Yan et al. [18] proposed to extract the difference between the quantization steps of MP3 neighboring frames as the steganalysis features. Yan et al. [2] proposed to extract the statistical inconsistencies in the statistical distribution of the number of bits in the audio bit pool before and after recompression as steganalysis scheme. Yu et al. [21] used the recompression calibration technique and proposed a steganalysis scheme based on the statistical distribution feature of Big value. Wang et al. [25] used Markov feature to capture the correlation of the quantized MDCT coefficients, which will be destroyed by MP3stego embedding procedure. However, the steganalysis features of all schemes above are extracted from the specific embedding domain, so they have good performance for the steganographic schemes of the identical embedding domain. With the excellent effect of Deep Neural Network (DNN) in various research fields, DNN had been used in many steganalysis schemes [26] [38]. Among them, there are two audio steganalysis schemes based on DNN. In [32], a CNN (convolutional neural networks) is designed to detect the audio steganography schemes of the time domain. This work shows the effect of the CNN to improve the performance of audio steganalysis scheme. However the scheme is good to detect the steganography scheme of audio time domain, and is ineffective for the steganography scheme of AAC and MP3. In [36], to detect the MP3 steganography schemes in Huffman coding domain, a steganalysis scheme based on CNN is proposed. There are three characters for this scheme: the quantified MDCT matrix is used as the input of the CNN network; a high pass filter is used to enhance the difference introduced by steganography; and many network optimization strategy are adopted. The experimental results show that the CNN-applied scheme performs better than the hand-crafted steganalysis features. However, the same problem existed, the performance of the scheme will be reduced to detect the other steganography schemes of different embedding domain. According to the current research results in steganalysis research area, it can be concluded that the deep neural network technology can really improve the detection accuracy of steganalysis [32], [35] [37]. However, the key point is how to express the steganalysis data more rich and accurate, and to design the deep neural network more effective and suitable for the steganalysis work. The purpose of this paper is aiming at these points to improve the generality of steganalysis scheme for AAC and MP3. Although there are many embedding schemes in the audio compression parameter of AAC and MP3, such as MDCT coefficients [3], scale factor [4], [5], Huffman coding [6] [8], all these schemes will lead to the change of the MDCT values in different ways, and eventually will change the correlation of the time-frequency of the audio signal. Since that, if the steganalysis features are extracted from the time-frequency characteristics of the audio signal, the steganalysis scheme will be more generative and can be used to detect the steganographic schemes which cause the change in the time-frequency characteristics, and almost all the steganographic schemes for the compression audio of AAC and MP3 [1] [12] belong to this category. Based on this idea, a steganalysis scheme Spec-ResNet (Deep Neural Network of Spectrogram) is proposed in this paper. There are three main contributions for the proposed scheme: To extract the universal features introduced by the steganography schemes based on perceptual audio coding, the spectrogram, which is the visual representation of the spectrum of frequencies of audio signal, is adopted as the input of the feature network. To extract the classification features of steganography schemes more effectively, the deep residual network is used as the basic framework of the proposed schemes Spec-ResNet. Since that the deep residual networks uses the deeper neural networks without the problem of gradient disappearance. It is suitable to extract the steganalysis features which are based on the weak signal. To capture the steganalysis features more completely, the features extracted from multiple spectrograms of different window size are combined. Since that the time frequency relationship of the spectrogram are different under different window size. A simple SVM classifier is used to verify the performance improvement of feature combination. The paper is organized as follows. The related background knowledge is introduced in section II. In section III, the proposed steganalysis scheme Spec-ResNet is described in detail. The experiments results are followed in section IV. Finally, the conclusion is drawn and the future work is discussed in V. II. PRELIMINARY A. The principle of audio codec Both AAC and MP3 are under the framework of perceptual audio coding, so their encoding principles are almost similar, including MDCT transform, psychoacoustic model, quantization and entropy coding. The improvement of AAC over MP3 include more sampling rates and channels supported, more efficient coding, additional modules added to increase compression efficiency, etc. The framework of perceptual audio coding and three main embedding domain is shown in Fig.1. During the process of perceptual audio coding, the psychoacoustic model is used firstly to remove the part of the sound signal that the human hearing cannot perceive to reduce the data rate. Then, the input pulse signal is subjected

3 3 audio signal Psychoacoustic model Scale factor Quantization coding Analysis filter bank MDCT coefficient Huffman coding Entropy coding Coding stream Fig. 1: The procedure of perceptual audio coding and three main embedding domain. to frame division or block division, and an analysis filter bank is used to perform time-frequency conversion. Each frequency domain sub-band is calculated according to a psychoacoustic model to obtain corresponding masking threshold, maximum permissible distortion. Three kinds of cyclic quantization and entropy coding will be performed on the transformed MDCT coefficient, and finally the audio code stream of entropy coding is output. B. The steganographic schemes of AAC and MP3 Due to lossy quantization in AAC and MP3 coding, the slight modification of the quantized integer MDCT coefficients will not decrease the audio quality obviously. Therefore, the embedding of the secret message can be implemented by finetuning the value of the compression parameter. There are three main embedding domain: MDCT coefficients, scale factor and Huffman coding parameters, which are shown in Fig.1. The basic idea of these steganography schemes is modifying the compression parameters slightly to hide the secret information. There are many AAC and MP3 steganography schemes, four steganography schemes from different embedding domains will be introduced in this section. 1) LSB-EE: In [6], the least significant bit (LSB) of the escape sequences in AAC Huffman coding are modified with matrix encoding to embed secret information. In AAC codec, escape sequence is a special codebook used to encode MDCT coefficients over 16. The modification of the LSB of escape sequence will bring little negative effect to audio quality. In AAC compressed data, MDCT coefficients whose value is over 16 accounts for 5%-6% of total coefficients, and they are mainly in low frequency portion, so the change of this scheme to the AAC audio signal is almost in the area of the low frequency signal. 2) MIN: In [3], Wang et al. proposes a method for implementing steganography in the small value region of MDCT coefficients. Since that the energy of the audio is mainly concentrated in the low and middle frequency portion, the MDCT coefficients of which are quantized with a smaller quantization step size, and the large quantization step size is used for the high frequency portion to reduce the distortion as much as possible. For the high frequency portion, the quantization range is larger, and it needs less coding bits. The amplitude of the quantized coefficient is basically {-1,, 1}, which is called the small value region of the MDCT coefficient. In the AAC quantized MDCT coefficient encoding stage, the coefficients of small value region are generally encoded by codebook 1 and codebook 2, and the 4 quantized MDCT coefficients are grouped into one index value index. The calculation formula is as follows in (1). 3 index = 3 i q [i] + 4, q [i] { 1,, 1} (1) i= The codeword is searched among the codebook according to index. In order to further reduce the distortion of AAC audio, the steganographic algorithm use the method of modifying the last MDCT coefficient q[3] and calculate a new index to replace the original one. In AAC compressed data, the small MDCT coefficients are mainly in the low and middle frequency portion, so the change of this scheme to the AAC audio signal is almost in the area of low and middle frequency signal. 3) SIGN: In [7], Zhu et al. proposed to embed message in the Huffman encoding process of AAC coding stream. The MDCT coefficients are encoded by a plurality of codebooks. According to the number of bits after encoding, the encoder selects an optimal codebook to encode 2 or 4 MDCT coefficients as a Huffman code. If the codebooks selected has sign bits, the sign bits of non-zero MDCT coefficients will be attached to codeword. The coding stream structure corresponding to the codebook with sign bits is shown in Table I and Table II. In codebook 1 and 2 or codebook 5 and 6, Huffman code represents 4 or 2 MDCT coefficients, while sign w, sign x, sign y, sign z is the sign bit of MDCT coefficients w, x, y, z respectively. sign = represents MDCT coefficient is negative, and sign = 1 represents MDCT coefficient is positive. TABLE I: Coding stream structure of codebook 1 and 2. Huffman code sign w sign x sign y sign z TABLE II: Coding stream structure of codebook 5 and 6. Huffman code sign x sign y With a appropriate threshold T, SIGN modified the sign bit of MDCT coefficients ranging from T to T accordingly to implement audio steganography. This subtle modification will not change any parameters in AAC encoding procedure except sign bit of specific MDCT coefficient, therefore it brings little distortion to audio quality. 4) MP3Stego: In [13], MP3Stego is a public tool to hide information in MP3 files during the MP3 compression process. There are two nested loops to quantify the MDCT coefficient in MP3 encoder, inner loop and outer loop. The outer loop is to verify the distortion of the audio signal to guarantee the audio quality. The inner loop is to select the suitable scalar factor to qualify the MDCT coefficients with the available bits number. The hiding operation of MP3Stego is in inner loop, The part2-3-length variable which records the number of the main-data bits is controlled by the selecting of the scalar factor

4 Amplitude Frequency(kHz) 1.5 -.5 Waveform -1.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Time(s) Spectrogram 2 15 1 5.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Time(s) Fig. 2: Waveform and spectrogram of an audio segment.

4 4 Amplitude Frequency(kHz) Waveform Time(s) Spectrogram Time(s) Fig. 2: Waveform and spectrogram of an audio segment. to make the LSB of part2-3-length variable equal to the secret bit. All these four steganographic schemes modified the compression parameters to hide information. Although the embedding domain is different, the final impact of all these schemes are the change of the MDCT coefficients of the encoded audio signals in different frequency range. Since that the above four schemes covered almost all frequency range of audio signals, so they are chosen as the target steganography schemes to verify the generalization ability of the proposed steganalysis scheme. C. Spectrogram Spectrogram is a basic visual representation of the spectrum of frequencies of audio signals, it can show the energy amplitude information of different frequency bands changing over time, it contains abundant time-frequency information of audio signal. Therefore spectrogram should be a good research object for general audio steganalysis to capture more relative features introduced by different audio steganography schemes. Fig. 2 is the spectrogram of an audio segment together with its waveform. In the spectrogram, horizontal axis represents the time domain and the vertical axis represents the frequency domain of the audio signal, the third dimension which is represented by the intensity of color of each point in the image indicating the amplitude of the frequency at the time. To plot the spectrogram of the audio signal, The signal will be divided into frames of special window size. The formant of spectrogram can effectively express the acoustic parameters, while its local characteristics can represent the additive noise very well, so it is feasible to capture the traces left by embedding operation through spectrogram. III. THE ARCHITECTURE OF SPEC-RESNET Based on the analysis above, an universal audio steganalysis scheme Spec-ResNet is proposed in this paper. spectrogram is adopted as the steganalysis object, deep residual network is used to extract the steganalysis features, and several different window spectrograms are combined together to extract more rich features. The main architecture of the Spec-ResNet scheme is shown in Fig. 3, which consists of three main components: Spectrogram Preprocess Module( SPM ), Deep Residual Steganalysis Network( S-ResNet ) and Classification Module( CM ). SPM generated the spectrogram of the specific window size of the input audio signal, and use filters to preprocess the spectrogram. S-ResNet is the steganalysis feature network framework based on the architecture of deep residual network to extract the steganalysis features of filtered spectrogram. CM is the traditional classification process, which fuse the features from multiple spectrograms under different window size. The detail of each module will be described below. A. Spectrogram Preprocess Module In this Module, the trained or the tested audio signals are represented as a spectrogram under specific windows size. Audio signals are highly correlated in both time and frequency domains, so the spectrogram is a perfect representation to analysis the correlation changes of time and frequency domain which are introduced by the steganography operation. The spectrogram of the audio signal contains three kind of signal: the audio content signal, the noise of real audio signal and the noise introduced by embedding operations. Just like the steganalysis task of image, the noise introduced by steganography is weak, so the noise of the audio signal should be amplified to improve the signal noise rate (SNR) of the steganography noise. In [39], Fridrich et. al. proposed an image steganalysis scheme based on the spatial domain rich model feature (SRM), in which multiple filters are introduced to capture the noise distribution characters from tested image in different directions and dimensions. Inspired by this work, the correlation of the intra-frame and inter-frame in spectrograms of 1 cover audio samples are analysed by linear regression, and multiple filters are designed to preprocess the spectrogram. The audio signals are 2s duration with 32 khz, AAC file, and decoded as wav file. The spectrogram dimension is n m, while n is half of the window size due to the symmetry in Fourier transform, m is the count of total frame in one piece of audio with 5% overlap(e.g., if window size N = 124, the dimension of spectrogram is ). Multiple linear regression model is showed in (2), y is the dependent variable and x t, t = 1, 2,..., s is the independent variable, β is bias. For 3 3 filter, the value of s is 8, and the correlation between the matrix center point and 8 signal points around is presented in Fig. 4. y = β + β 1 x β t x t β s x s (2) In Fig.4, both of the correlation between the neighboring points in horizonal and vertical direction are relatively strong, especially in vertical direction. In spectrogram, the vertical direction correlation is the relevance of the neighboring frequency band in the same frame, while horizonal direction described the relation between the successive frames in the same frequency band. Based on the analysis above, a set of filters from different direction were chosen as the first convolutional layer of deep residual network to extract the

-1 +1-1 +1 +1 +1-2 +1-2 Fig. 4: Result of the linear regression. The depth of the color or the absolute value of logistic coefficients represent the strength of the correlation.

5 5 Filters with fixed parameters Deep residual network(s-resnet) Spectrogram Original audio signal... Classification module(cm) Filters with fixed parameters Deep residual network(s-resnet) Spectrogram Fig. 3: The architecture of proposed steganalysis scheme Spec-ResNet Fig. 4: Result of the linear regression. The depth of the color or the absolute value of logistic coefficients represent the strength of the correlation. energy difference introduced by steganographic noise between cover and stego samples. In this work, 4 fixed-parameters filters are chosen to preprocess the spectrogram. Filters size is 3 3, and blanks are filled with zero, which is shown in Fig. 5. In the future work, the filter parameters can be trained by the network to improve the performance of the proposed scheme. B. Deep Residual Steganalysis Network The spectrogram of audio signal is filtered by the four filters. Then it is processed by the initial convolutional layer with 1 output channels to extract the potential correlation in the feature matrix. Deep residual steganalysis network( S-ResNet ) in this paper consists of 3 layers and there is a shortcut for each two convolution layers. After that, a global average pooling layer is applied. The whole architecture of S-ResNet Fig. 5: Filters based on the correlation of intra-frame and inter-frame. is shown in Fig. 6 and the details about the S-ResNet are described as follows. 1) Convolution layers: The convolution layers in this paper are 3 3 kernel, no matter the parameter of the filters is fixed or not. Filters introduced in SPM is applied to weak the influence of the audio content in spectrogram. The rest convolution layers which include the initial convolution layer after SPM, are used to capture the potential correlation inherent in the feature matrix. According to the numbers of the channels, the convolution layers in residual network can be classified into three types or groups:,, and, which is shown in Fig. 7. Number of channels in,, and are 1, 2, and 4 respectively, and the stride is 1 in both of the horizonal and vertical direction if no special declaration. Moreover, the number of residual units in each group is 5, which means 1 convolutional layers. +1.

6 6 Audio Spectrogram Preprocess Module 3 3 Init conv, 1 BN ReLu, stride=2 in both directions 3 3 average pooling, stride=2 in both directions 3 3 average pooling 3 3 Conv, 1 BN ReLu 3 3 Conv, 1 + Global average pooling Classification Module Fig. 6: The structure of deep residual network S-ResNet.. x x BN ReLU Weight layer ReLU F(x) Weight layer ReLU x Conv, 3 3 channels, stride=1 both in horizonal and vertical direction if not specified explicitly. Fig. 7: Convolution layers with different architecture, in which 3 3 channels(channels = 1, 2, or 4) represents filter length filter width number of channels respectively. is with 1 channels, is with 2 channels, and is with 4 channels.. Weight layer H(x) ReLU Weight layer F(x)+x ReLU (a) Convolutional unit (b) Residual unit. Fig. 8: The architecture of Convolutional unit and Residual unit. 2) Residual unit: In the common sense, for the Convolutional Neural Networks(CNN), more convolutional layers means that the features extracted from different dimensions are richer and will achieve better classification ability. However, once the convolutional neural network reaches a certain depth, the increasing number of neural network layers will lead to the degradation of the performance due to the possible vanishing gradient problem. At this time, the training accuracy and test accuracy will be declining, and it will be hard to train an effective neural network, and the learning ability of the Deep Neural Network will be limited. Compared with classical CNN, residual network [4] is capable of integrating more convolutional layers without declining network performance. The model learning can be simplified to approximate an mapping function learning, which is described in Fig. 8. H (x) = x is the hypothesis model of cascaded convolutional layers. the work of CNN is trying to find a mapping function H (x) directly. However residual network is different due to the identity function x, it

7 7 try to find a residual function F (x) = H (x) x instead. In this way, residual network is transformed to approximate ideal hypothesis by learning the residual F (x) =, which is easier than approximating H (x) directly. Obviously, convolutional output is more sensitive to the change of input than before with identity function x introduced. In Fig.6, the arc across each two convolution layers represents the shortcut. The real arc indicates x has the same size as F (x) and the two tensors can be added together directly, while the dashed arc between and or and indicates that they have different size and x needs downsampling before the add operation. 3) Batch normalization layer: Batch normalization is proposed in [41] to accelerate deep network training. The distribution of the inputs before each hidden layer varies with each training iterations. Batch normalization performs normalization for each training iterations instead of setting lower learning rates at first or restricting parameter initialization, which needs more time to train the model. It is applied before each new convolution layer in Fig.6. 4) Activation function: Besides batch normalization, activation function is also applied after each batch normalization in S-ResNet. Nonlinear links between convolution layers have better expression ability than linear model, therefore ReLU (Rectified Linear Unit) is adopted as activation function in this paper. 5) Pooling layers: In order to reduce computational complexity, max pooling layer or average pooling layer are used in deep network to reduce feature dimension. 3 3 Average pooling layer is applied to S-ResNet between and or Between and with stride = 2 in both horizontal and vertical directions. There exists no pooling layer in shortcut across,, or to keep the dimension of x and F (x) consistent. A global average pooling is applied right before classification module (CM) to get 4-dimensional classification feature. C. Classification module In the existing CNN-based [32], [36] or RNN-based [42] steganalysis scheme, classification module of network is always a fully-connected layer with a 2-way softmax, which is shown in Fig.9 to get the probability of 2 class labels. With a given threshold ranging from to 1(default to.5), the tested samples will be judged as cover or stego. is low, which does not reflect the texture characteristics of the sound well. The frequency resolution of the narrow-band spectrogram is too small to properly identify the position of the formant. In the spectrogram feature map under different window sizes, the audio signals characteristics are different, so it is very important to select the window size suitable for steganalysis. In this paper, spectrogram with different window size is considered together. For a specific audio sample, its spectrogram has different dimension with different N, however the classification feature s dimension after global average pooling is constant. In our experiment, three spectrogram with different windows are chosen to increase the diversity of features. three 4-dimensional features are cascaded to a 12- dimensional features as the final steganalysis feature. Fig. 1 showed the classification module in this paper, and the value of window size N is 124, 512, and 256 for AAC. LibSVM [43] is introduced to train the ultimate classification model based on the 12-dimensional feature originating from audio samples. In the stage of detection, the 12-dimensional feature extracted from the test audio samples are used to judge whether a piece of audio is cover or stego audio. Spectrogram with N=124 (512*128) Spectrogram with N=512 (256*256) Spectrogram with N=256 (128*512) Trained model-1 Trained model-2 Trained model-3 4-dimension feature 4-dimension feature 4-dimension feature Fig. 1: The procedure of combining three 4-dimension feature into 12-dimension classification feature.. S-ResNet 4-dimension feature Fully connected layer Softmax Classification result Fig. 9: Original CNN classification module. In short-time Fourier transform, different spectrograms with different scales will be generated when different window size used. When window size N is relatively small, broadband spectrogram will be generated, the time-bandwidth is narrow and the corresponding frequency-bandwidth is broad. the narrow-band spectrogram is generated vice versa. The broad-band spectrogram can reflect the time-varying characteristics of the spectrum well, but the frequency resolution. IV. EXPERIMENTS To evaluate the performance of the proposed scheme Spec- ResNet, three main experiments have been carried out. The purpose of the first experiment is to evaluate the detection accuracy of the proposed scheme for the three AAC steganography schemes of different embedding domain. Spectrogram with three different windows are used to evaluate the influence to the detecting performance. The second experiment is implemented to compare Spec-ResNet with the work of the existing steganalysis schemes. The third experiment is done to detect the MP3Stego, which is under MP3 codec and embedding domain is scalar factor.

8 8 A. Experiment setup 1) Data set: To the best of our knowledge, there is still no public audio data set for AAC or MP3 steganalysis, so the audio data set is constructed by ourselves. 186 pieces of music are downloaded from Internet, including different musical styles such as jazz, rock, country, pop, different language(mainly Chinese and English), and different gender etc. The music samples have been cut into clips to construct the AAC audio data set and MP3 audio data set. All the audio data sets of the experiments described below together with the source code of the proposed scheme will be public on GitHub 1. Specifically, this project is based on Tensorflow CUDA CuDNN Python In addition, all parameters are trained with Adam optimizer: mini-batch size is 32, which means 16 pairs of cover and stego audio samples; the learning rate decay is.9 and the weight decay is a) AAC audio data set: The 1, audio samples with 16 khz, 16 bit and 2s duration are used to construct 4 different audio sample sets, including one cover sample set (CDB AAC), and three stego sample sets (SDB AAC LSB, SDB AAC MIN, SDB AAC SIGN). The main construction of the four data sets are as follows. CDB AAC: The audio samples of WAV file are encoded by the open-source audio encoder FAAC [44] as 32 kbps, then decoded to WAV by FAAD2. The decoded audio samples are of 32 khz with 2 channels, and the total number of cover samples in CDB AAC is 1,. SDB AAC LSB: The 1, audio samples(.m4a format) are generated by LSB EE steganographic scheme [6], and the secret message is embedded with a relative embedding rate (EBR) of 1%, 2%, 3%, 5%, and 1% respectively. The total number of stego samples(.wav format) in SDB AAC LSB is 1, 5 = 5, after the compressed audio bitstream are decoded by FAAD2. SDB AAC MIN: The 1, audio samples(.m4a format) are generated by MIN steganographic scheme [3], and the secret message is embedded with a EBR of 1%, 2%, 3%, 5%, and 1% respectively. The total number of stego samples(.wav format) in SDB AAC MIN is 1, 5 = 5, after the compressed audio samples are decoded by FAAD2. SDB AAC SIGN: The 1, audio samples(.m4a format) are generated by SIGN steganographic scheme [7], and the secret information is embedded with a EBR of 1%, 2%, 3%, 5%, and 1% respectively. The total number of stego samples(.wav format) in SDB AAC SIGN is 1, 5 = 5, after the compressed audio samples are decoded by FAAD2. b) MP3 audio data sets: The 9, audio samples with 16 bit, 44.1 khz, and 5s segment are used to construct 2 different audio data sets, including a cover samples set(cdb MP3) and a stego samples set(sdb MP3). The main contents of the two data sets are as follows. CDB MP3: The audio samples of WAV file are encoded as 128 kbps and decoded by the public MP3 codec. The decoded 1 ResNet audio samples(.wav format) are 44.1 khz, and the total number of cover samples in CDB MP3 is 9,. SDB MP3: The audio samples of WAV file are encoded by MP3Stego [13]. Since that the maximum embedding capacity of MP3Stego is roughly 2 bits for each frame and 5s audio samples into which 3 Bytes message are embeded has 191 frames, so the EBR is nearly 6%. Besides, the total number of stego samples in SDB MP3 is 9,. 2) Metrics: To evaluate the detection accuracy of the proposed scheme and contrast schemes, TPR (True Positive Rate), TNR (True Negative Rate) and ACC (Classification Accuracy) are adopted as metrics. TPR means the proportion that stego samples are detected as stego, TNR means the proportion that cover samples are detected as cover. ACC is the detection accuracy for all cover and stego samples. When the number of cover samples and stego samples are equal, ACC is the average of TPR and TNR. B. Experiments and Results 1) Experiment I: This experiment is carried out to evaluate the detection accuracy of the proposed scheme. Three AAC steganography schemes with different embedding domain are detected. Spectrograms with three different window size N win = 256, 512, or 1, 24 are extracted from 6, audio samples in CDB AAC, SDB AAC LSB, SDB AAC MIN, and SDB AAC SIGN respectively. In the training procedure, the spectrograms extracted from 3, audio samples in CDB AAC are used as the cover input signal, and the spectrograms of 3, audio samples with.3 EBR in SDB AAC LSB, SDB AAC MIN and SDB AAC SIGN are selected as the stego input signal. For a specific window size, train set contains 6, feature matrices described above (for N = 256, the dimension of spectrogram feature matrix is , for N = 512, the dimension of spectrogram feature matrix is , and for N = 124, the dimension of the spectrogram feature matrix is ). However, the architecture of test set is different. For cover samples, spectrogram of the rest 3, audio samples in CDB AAC are selected, and spectrogram of the corresponding 3, audio samples generated by different schemes and different EBR (2 samples in each algorithm of a specific EBR) are selected as the stego samples. Three models are trained based on spectrogram with three different window size respectively. The results of experiment I are shown in TABLE III. N = 256, 512, or 124 represent the spectrogram window size. EA is the embedding schemes, and ACC shows the average detecting accuracy of the trained model towards cover or stego samples generated by a specific steganographic algorithms with multiple EBR, while AVERAGE represents the average detecting accuracy for the cover or stego samples of a specific EBR of multiple schemes. The results show that the performance of the classification model trained by spectrogram with different window size are different. The average accuracy for cover samples of different model are all above 83.89%. When N = 512, the accuracy for cover samples can achieve 93.47%, and the average detecting accuracy for the stego samples with 1% EBR is above 82.6%. When N = 124, the accuracy

9 9 reaches 89.5%. It shows that the proposed scheme Spec- ResNet can be used to detect multiple steganographic schemes of different embedding domain and have good performance. 2) Experiment II: This experiment is carried out to compare the detection accuracy of the proposed scheme with the existing steganalysis schemes. Although feature matrix of spectrogram with different window size has different shape, the length of feature through global average pooling layer is constant, explicitly 4. For spectrogram of a specific window size, a classification model is trained based on it and will be saved thereafter. Then three 4-dimensional feature originating from three trained classification model are merged into a 12-dimensional feature. SVM is used to train the ultimate classification model, which considers spectrogram feature with different scale. In particular, the architecture of train set and test set is just the same as that of Experiment I. The detecting performance of Spec-ResNet is compared with Ren et al. s scheme [19], Luo et al. s scheme [32], and Zhao et al. s scheme [36]. In Ren et al. s scheme [19], a classical rich model feature are introduced to fully analyze the Markov transition probabilities and joint probability densities of first-order differentials and second-order differential residuals of inter-frame and intra-frame MDCT coefficients. In Luo et al. s scheme [32], the classifier is trained with seven convolutional neural network based on original audio signal. Differently, a CNN is adopted to process quantified MDCT coefficients matrix in [36], while this CNN is replaced with S-ResNet above in our experiments. Audio samples with an duration of 2s are adopted for performance testing in this experiment, and 8, audio samples including 4, cover samples and 4, stego samples are used for training, The left 12, audio samples are used for testing. The results of experiment II is shown in TABLE IV. Method represents the steganalysis schemes and S len is the sample length. EA, AVERAGE and ACC is the same as that in TABLE III. TABLE IV shows that the detecting performance of steganalysis scheme Spec-ResNet is better than that of the three trained model in TABLE III, which shows that the feature combination of spectrogram with different windows can improve the performance of Spec-ResNet. Compared with the existing steganalysis scheme, the average detecting accuracy of Spec-ResNet is higher than that of the existing schemes, including traditional hand-crafted feature [19] and CNN based scheme [32], [36]. In [19], the length of the audio samples is 2 seconds, but the performance of Spec-Resnet still outperformed it under the length of 2s audio samples. Fig. 11 is the graphical representation of TABLE IV. 3) Experiment III: To evaluate the detection performance of the Spec-ResNet for MP3, MP3Stego is chosen as the detecting object. The training procedure is the same as Experiment I. The audio duration is 5s and the three different window sizes are N = 512, 124, or 248 respectively (for N = 512, 124, or 248, the dimension of feature matrix is , , ). Since that the samples in MP3 audio data set is 44.1 khz. Train set consists of spectrogram of 15 audio samples in CDB MP3 and 15 audio samples in SDB MP3, and test set is the same as train set in architecture. There is no intersection between train set and test set. The performance of Spec-ResNet is compared with the existing steganalysis scheme. Wang et al. s method [25] designed a Markov feature captured the correlation of the quantized MDCT coefficients (refer to as QMDCTs), which will be destroyed by MP3stego s embedding behaviour. Luo et al. s scheme [32] are compared to detect MP3stego with audio duration of 5 seconds, in which 8 audio samples are chosen to train CNN model, and 2 audio samples are used to test the trained model. The detecting performance of proposed scheme Spec- ResNet towards MP3Stego was compared with the existing schemes proposed in [25], [32]. The results are presented in TABLE V. Mixture represents the mixture classification model which is trained on the three spectrograms with different window size and S len & EBR is sample length & relative embedding rate. Spec-ResNet achieves better performance compared with the trained model based on spectrogram with only a specific window size. The detecting accuracy (ACC) of the proposed scheme which is in bold is higher than the existing steganalysis scheme based on traditional hand-crafted feature [25] or deep neural network [32]. That means the proposed steganalysis scheme Spec-ResNet can also be applied to other audio coding standards, such as MP3, and can achieve good detection accuracy. V. CONCLUSION In this paper, a general steganalysis scheme Spec-ResNet based on the spectrogram and deep residual network is proposed, it can be used to detect the steganography schemes of different embedding domain of AAC and MP3. Three conclusions can be obtained from this work: 1). Spectrogram, represents the spectral relationship in time series of the audio signal, is good to be as the analysis signal for audio steganalysis scheme to obtain more universal feature. It can be extended to the other audio analysis area, such as audio forensic, to improve the classification performance; 2) Deep residual network, which solved the problem of gradient disappearance, is suitable to be used to extract the steganalysis features which are based on weak signal changes; 3) Fusion of training features from different scales will improve the classification accuracy. The experiment results show that the proposed scheme has good detection accuracy and generality. It achieves better detection accuracy for three main AAC embedding domain: MDCT, scale factor and Huffman coding and MP3Stego than the existing steganalysis scheme. To the best of our knowledge, the audio steganalysis scheme based on the spectrogram and deep residual network is first proposed in this paper. The method of this paper can be applied to the other kind of audio classification works. REFERENCES [1] I , Iec information technology-coding of audio-visual objectspart 3: audio, 24. [2] (218) Advanced Audio Coding. [Online]. Available: wikipedia.org/wiki/advanced Audio Coding [3] Y.-J. Wang, L. Guo, and C.-P. Wang, Steganography method for advanced audio coding, Journal of Chinese Computer Systems, vol. 32, no. 7, pp , 211.

10 1 TABLE III: The detecting performance of spectrogram with different window size. Spectrogram EA TNR TPR Cover 1% 2% 3% 5% 1% ACC LSB-EE [6] N =124 MIN [3] SIGN [7] AVERAGE LSB-EE [6] N =512 MIN [3] SIGN [7] AVERAGE LSB-EE [6] N =256 MIN [3] SIGN [7] AVERAGE TABLE IV: The performance of Spec-ResNet compared with the existing steganalysis schemes. Method S len EA TNR TPR Cover 1% 2% 3% 5% 1% ACC Spec-ResNet Ren et al. [19] Zhao et al. [36] Luo et al. [32] 2s LSB-EE [6] s MIN [3] s SIGN [7] AVERAGE s LSB-EE [6] s MIN [3] s SIGN [7] AVERAGE s LSB-EE [6] s MIN [3] s SIGN [7] AVERAGE s LSB-EE [6] s MIN [3] s SIGN [7] AVERAGE

11 11 1 LSB-EE 1 MIN 1 SIGN detection accuracy proposed, 2s.2 Ren et.al, 2s.1 Zhao et.al, 2s Luo et.al, 2s relative embedding rate(%) (a) detecting accuracy for LSB-EE [6] detection accuracy proposed, 2s.2 Ren et.al, 2s.1 Zhao et.al, 2s Luo et.al, 2s relative embedding rate(%) (b) detecting accuracy for MIN [3] detection accuracy proposed, 2s.2 Ren et.al, 2s.1 Zhao et.al, 2s Luo et.al, 2s relative embedding rate(%) (c) detecting accuracy for SIGN [7] Fig. 11: The performance of Spec-ResNet compared with existing steganalysis scheme. TABLE V: The performance to detect MP3Stego [13]. Algorithm S len & EBR TNR TPR ACC Spec-ResNet:N =512 5s, 6% Spec-ResNet:N =124 5s, 6% Spec-ResNet:N =248 5s, 6% Spec-ResNet:Mixture 5s, 6% Luo et al. [32] 5s, 6% Wang et al. [25] 1s, 1%.925 [4] S. Xu, P. Zhang, P. Wang, and H. Yang, Performance analysis of data hiding in mpeg-4 aac audio, Tsinghua Science and Technology, vol. 14, no. 1, pp , 29. [5] Y. Wei, L. Guo, and Y. Wang, Controlling bitrate steganography on aac audio, in Image and Signal Processing (CISP), 21 3rd International Congress on, vol. 9. IEEE, 21, pp [6] Y. Wang, L. Guo, Y. Wei, and C. Wang, A steganography method for aac audio based on escape sequences, in 21 International Conference on Multimedia Information Networking and Security. IEEE, 21, pp [7] J. Zhu, R. Wang, and D. Yan, The sign bits of huffman codeword-based steganography for aac audio, in Multimedia Technology (ICMT), 21 International Conference on. IEEE, 21, pp [8] J. Zhu, R.-D. Wang, J. Li, and D.-Q. Yan, A huffman coding sectionbased steganography for aac audio, Information Technology Journal, vol. 1, no. 1, pp , 211. [9] W. Luo, Z. Yue, and H. Li, Adaptive audio steganography based on advanced audio coding and syndrome-trellis coding, 217. [1] M. Bazyar and R. Sudirman, A recent review of mp3 based steganography methods, International Journal of Security and Its Applications, vol. 8, no. 6, pp , 214. [11] K. Yang, X. Yi, X. Zhao, and L. Zhou, Adaptive mp3 steganography using equal length entropy codes substitution, in International Workshop on Digital Watermarking. Springer, 217, pp [12] R. Zhang, J. Liu, and F. Zhu, A steganography algorithm based on mp3 linbits bit of huffman codeword, in International Conference on Intelligent Information Hiding and Multimedia Signal Processing. Springer, 217, pp [13] F. Petitcolas, mp3stego, cl. cam. ac. uk/ fapp2/steganography/mp3stego/index. html, [14] C. Platt, Undermp3cover, 24. [15] Z. Achmad, Mp3stegz, 28. [16] J. Fridrich and D. Soukal, Matrix embedding for large payloads, IEEE Transactions on Information Forensics & Security, vol. 1, no. 3, pp , 26. [17] Y. Ren, Q. Xiong, and L. Wang, Steganalysis of aac using calibrated markov model of adjacent codebook, in Acoustics, Speech and Signal Processing (ICASSP), 216 IEEE International Conference on. IEEE, 216, pp [18] D. Yan, R. Wang, X. Yu, and J. Zhu, Steganalysis for mp3stego using differential statistics of quantization step, Digital Signal Processing, vol. 23, no. 4, pp , 213. [19] Y. Ren, Q. Xiong, and L. Wang, A steganalysis scheme for aac audio based on mdct difference between intra and inter frame, in International Workshop on Digital Watermarking. Springer, 217, pp [2] D. Yan and R. Wang, Detection of mp3stego exploiting recompression calibration-based feature, Multimedia tools and applications, vol. 72, no. 1, pp , 214. [21] X. Yu, R. Wang, and D. Yan, Detecting mp3stego using calibrated side information features, Journal of Software, vol. 8, no. 1, 213. [22] M. Qiao, A. H. Sung, and Q. Liu, Mp3 audio steganalysis, Information sciences, vol. 231, pp , 213. [23] R. Kuriakose and P. Premalatha, A novel method for mp3 steganalysis, in Intelligent Computing, Communication and Devices. Springer, 215, pp [24] C. Jin, R. Wang, D. Yan, P. Ma, and K. Yang, A novel detection scheme for mp3stego with low payload, in IEEE China Summit & International Conference on Signal and Information Processing, 214, pp [25] C. Jin, R. Wang, and D. Yan, Steganalysis of mp3stego with low embedding-rate using markov feature, Multimedia Tools and Applications, vol. 76, no. 5, pp , 217. [26] V. Holub and J. Fridrich, Random projections of residuals for digital image steganalysis, IEEE Transactions on Information Forensics & Security, vol. 8, no. 12, pp , 213. [27] S. Tan and B. Li, Stacked convolutional auto-encoders for steganalysis of digital images, in Signal and Information Processing Association Summit and Conference, 214, pp [28] Y. Qian, J. Dong, W. Wang, and T. Tan, Deep learning for steganalysis via convolutional neural networks, in Media Watermarking, Security, and Forensics 215, vol International Society for Optics and Photonics, 215, p. 949J. [29] G. Xu, H. Z. Wu, and Y. Q. Shi, Structural design of convolutional neural networks for steganalysis, IEEE Signal Processing Letters, vol. 23, no. 5, pp , 216. [3] Y. Qian, J. Dong, W. Wang, and T. Tan, Learning and transferring representations for image steganalysis using convolutional neural network, in Image Processing (ICIP), 216 IEEE International Conference on. IEEE, 216, pp [31] G. Xu, H. Z. Wu, and Y. Q.. Shi, Ensemble of cnns for steganalysis: An empirical study, in ACM Workshop on Information Hiding and Multimedia Security, 216, pp [32] B. Chen, W. Luo, and H. Li, Audio steganalysis with convolutional neural network, in Proceedings of the 5th ACM Workshop on Information Hiding and Multimedia Security. ACM, 217, pp [33] V. Sedighi and J. Fridrich, Histogram layer, moving convolutional neural networks towards feature-based steganalysis, Electronic Imaging, vol. 217, no. 7, pp. 5 55, 217.

Fig. 1: The perceptual audio coding procedure and three main embedding domains.

Fig. 1: The perceptual audio coding procedure and three main embedding domains. 1 Spec-ResNet: A General Audio Steganalysis Scheme based on a Deep Residual Network for Spectrograms Yanzhen Ren, Member, IEEE, Dengkai Liu, Qiaochu Xiong, Jianming Fu, and Lina Wang arxiv:191.6838v2 [cs.mm]