Identifying Compression History of Wave Audio and Its Applications

Identifying Compression History of Wave Audio and Its Applications DA LUO, WEIQI LUO, RUI YANG, Sun Yat-sen University JIWU HUANG, Shenzhen University Audio signal is sometimes stored and/or processed in WAV (waveform) format without any knowledge of its previous compression operations. To perform some subsequent processing, such as digital audio forensics, audio enhancement and blind audio quality assessment, it is necessary to identify its compression history. In this article, we will investigate how to identify a decompressed wave audio that went through one of three popular compression schemes, including MP3, WMA (windows media audio) and AAC (advanced audio coding). By analyzing the corresponding frequency coefficients, including modified discrete cosine transform (MDCT) and Mel-frequency cepstral coefficients (MFCCs), of those original audio clips and their decompressed versions with different compression schemes and bit rates, we propose several statistics to identify the compression scheme as well as the corresponding bit rate previously used for a given WAV signal. The experimental results evaluated on 8,800 audio clips with various contents have shown the effectiveness of the proposed method. In addition, some potential applications of the proposed method are discussed. Categories and Subject Descriptors: K.6.5 [Management of Computing and Information Systems]: Security and Protection Authentication; I.5.4 [Pattern Recognition]: Applications Waveform analysis; H.5.5 [Information Interfaces and Presentation]: Sound and Music Computing Signal analysis, synthesis, and processing General Terms: Security Additional Key Words and Phrases: Audio compression history identification, mel-frequency cepstral coefficients, modified discrete cosine transform ACM Reference Format: Da Luo, Weiqi Luo, Rui Yang, and Jiwu Huang. 2014. Identifying compression history of wave audio and its applications. ACM Trans. Multimedia Comput. Commun. Appl. 10, 3, Article 30 (April 2014), 19 pages. DOI: http://dx.doi.org/10.1145/2575978 1. INTRODUCTION The WAV audio, a popular format for raw and typically uncompressed audio, can preserve well the original waveform information of audio and thus it is widely used in many applications. Usually, the A part of this work was presented at IEEE ICASSP 12. This work is supported in part by National Science & Technology Pillar Program (No:2012BAK16B06), NSFC (U1135001, 61332012, 61272191, 61202497), the funding of Zhujiang Science and Technology (2011J2200091) and the Guangdong NSF (S2013010012039). Author s address: D. Luo, R. Yang, School of Information Science Technology, Sun Yat-sen University, Guangzhou 510006, China; email: is04ld@mail2.sysu.edu.cn, yrui@mail2.sysu.edu.cn; W. Luo (corresponding author), School of Software, Sun Yat-sen University, Guangzhou 510006, China; email: luoweiqi@mail.sysu.edu.cn; J. Huang, College of Information Engineering, Shenzhen University, Shenzhen 518060, China; email: jwhuang@szu.edu.cn. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. c 2014 ACM 1551-6857/2014/04-ART30 $15.00 DOI: http://dx.doi.org/10.1145/2575978 30

30:2 D. Luo et al. WAV audio comes from two sources: direct recording, and decompressed version from a compression format. With the audio editing softwares, it is convenient to obtain the decompressed audio. However, it may cause some social problems. For example, the forgers can create a fake audio/speech in this way: they decompress a compressed audio/speech and perform splicing, then re-save them as the consistent WAV format. Such operations are very easy to achieve by using audio editing softwares, such as CoolEdit and GoldWave. How to identify such forged audio has not yet been solved. For another example, someone can obtain various compressed audio clips of lower quality from the Internet, and then decompress and re-save them as higher bit rates or lossless format for distribution to seek more commercial benefit [Yang et al. 2009]. Since we cannot obtain any information of the previous compression operations such as the type of compression schemes and bit rates from the file header of a WAV audio, identifying the compression history of wave audio becomes a very important issue for exposing those fake audio clips, locating spliced segments, and evaluating the quality of audio blindly. Furthermore, compression history estimation can provide more useful information for some other subsequent processing, such as audio enhancement and audio transcoding. Up to now, some works related to compression history estimation for media have been reported for digital image and video. Fan and de Queiroz [2003], first tried to expose those JPEG decompressed images and further estimate their quantization tables with the maximum likelihood estimation. Based on Benford s law (first digit law) on the DCT coefficients [Fu et al. 2007], the authors proposed a method for estimating the JPEG quantization table and detect double JPEG compressed images. Lukáš and Fridrich [2003], proposed a method to estimate the primary quantization matrix in double compressed JPEG images. Luo et al. proposed a method to estimate the JPEG compression history from bitmaps based on the compression error analysis [Luo et al. 2010], and also proposed a method for identifying image source encoder based on quantization artifacts [Luo et al. 2010]. For video forensics, Bestagini et al. [2012] proposed a method to identify the type of video codec previously used by analyzing its coding-based footprints. Tagliasacchi and Tubaro [2010] proposed a method for blindly estimating the quantization parameters in H.264/AVC decoded video. There have been some researches in audio forensics such as acoustic reverberation detection [Malik and Farid 2010] and microphone/environment classification [Kraetzer et al. 2007]. However, only a few related works have been proposed for detecting digital audio compression history. Yang et al. [2009] tried to expose those fake-quality MP3 audio clips that have been recompressed with higher bit rates (up-transcoding). Based on double quantization artifacts, Qiao et al. [2010] and Liu et al. [2010] proposed the methods for detecting MP3 recompressed clips for both up-transcoding and down-transcoding cases. Bianchi et al. [2013], Chen et al. [2012], and Yang et al. [2010] detected double compressed MP3 by measuring the quantized MDCT coefficients. Hicsonmez et al. [2013], introduced a method that could discriminate between single and double compressed audio and identify the codec and bit rate. Jenner and Kwasinski [2012], proposed a method for identifying several speech codecs. Hiçsönmez and Avcibas [2011], proposed a method for audio codec identification through payload sampling. In our previous work [Luo et al. 2012], we analyzed the quantization artifacts in the MDCT domain, and proposed a 21-dimension feature to measure the artifacts. The preparatory results have shown its effectiveness for identifying MP3 decompressed audio clips. However, the performance for detecting the bit rates of WMA decompressed audio clips is still far from satisfactory. Furthermore, the effectiveness for identifying the AAC audio had not been evaluated in our previous work. In this article, we will further investigate the problem of identifying audio compression histoy, namely, we aim to identify the compression scheme and the bit rate previously used for a decompressed audio clip. It is well known that quantization is one of the necessary operations in various lossy media compression schemes, including audio, image and video, and will introduce some artifacts in the corresponding frequency domain of the resulting media, for instance, the DCT domain for JPEG

Identifying Compression History of Wave Audio and Its Applications 30:3 Fig. 1. Illustration of decompressed WAV audio singal. Fig. 2. Block diagram of the general lossy compression for digital audio. images and the wavelet domain for JPEG 2000 images. Typically, the more severe the compression, the more quantization artifacts would be presented. By analyzing such artifacts, it is possible to identify those decompressed audio clips from uncompressed ones, and further estimate their compression schemes and parameters. Therefore, how to detect and measure the quantization artifacts is a crucial issue. Our previous work [Luo et al. 2012] tried to utilize the MDCT coefficients to recognize the quantization artifacts during compression. For further improvement, we will introduce and analyze another important feature in this article, that is the Mel-frequency cepstral coefficients (MFCCs), which have been successfully used in speech recognition and speaker identification [Reynolds et al. 2000]. The remainder of this article is organized as follows. Section 2 proposes and analyzes some statistics from the MDCT coefficients and MFCCs. Section 3 shows the experimental results and analyses. Section 4 discusses some potential applications of the proposed method, and finally the concluding remarks are given in Section 5. 2. PROPOSED METHOD As illustrated in Figure 1, given a WAV audio signal, it may be previously compressed with some compression scheme at a given bit rate. The proposed method aims to identify its compression history. In this article, three popular compression schemes in digital audio, that is, MP3, WMA and AAC, have been investigated. In the following, we would firstly give a brief overview of the general lossy audio compression, and then propose some features for compression history estimation. 2.1 Overview of Lossy Audio Compression To reduce the storage of digital audio, most popular audio compression is lossy, such as MP3, WMA and AAC. Figure 2 shows the general lossy compression system for digital audio [Pan 1995; Painter and Spanias 2000]. Usually, the input audio signal is firstly divided into many frames in temporal domain, which are then converted into the frequency domain with some transforms. Based on the psychoacoustic model, human ears are not sensitive to high-frequency components, and these components would be removed via quantization. Finally, the resulting quantized coefficients are further encoded to bitstream. In the following, we will take MP3 compression as an example. In MP3 audio compression [MP3Standard; Hacker 2000], the audio signal is firstly divided into frames of size 1152 samples with half overlapped, and then each frame is fed to the MP3 encoder.

30:4 D. Luo et al. The frame is separated into 32 subbands with the analysis filterbank. The modified discrete cosine transform (MDCT) is performed in each subband, and 18 frequency coefficients can be obtained. Finally we can obtain 576 frequency coefficients for each frame. Based on the properties of HAS (human auditory system), the psychoacoustic model is then used to analyze the resulting coefficients and to get the masking thresholds which are used in the succeeding quantization operation. In order to obtain a trade-off between the bit rate and quality distortion of the compressed audio, the quantization is necessary to remove some of the less audible components, and thus the quantization artifacts will be introduced at this stage. Finally, the quantized coefficients are further encoded using the lossless coding to obtain a bitstream. The decoder works in a reverse manner. In the following, we would analyze two types of frequency coefficients (i.e., MDCT and MFCC) after quantization. 2.2 Feature Extraction Based on extensive experiments, we found that different compression schemes and/or compression bit rates will significantly affect the quantization artifacts in different frequency domains. How to detect and measure the quantization artifacts is the key issue in our method. To this end, we analyze some statistics on the frequency coefficients of those decompressed audio clips. Two different types of frequency coefficients have been studied in this article. They are MDCT coefficients and MFCCs. In the following, we will briefly describe how to derive the two types of frequency coefficients from a WAV signal, and then analyze some statistics from the corresponding coefficients for identifying compression history. Modified Discrete Cosine Transform Coefficients. MDCT is a Fourier-related transform based on the type-iv discrete cosine transform (DCT-IV) [Princen et al. 1987]. It is widely used in most modern lossy audio compression schemes, including MP3, WMA, and AAC. In this article, in order to obtain the MDCT coefficients of any given WAV signal, the following operations are performed. (1) The input WAV signal is first divided into frames of 1152 samples with half overlapped. (2) For each frame, the audio samples are separated into 32 subbands by analysis filterbank, and the MDCT window further divides each of these 32 subbands into 18 subbands (long window) or 6 subbands (short window). So 18 spectral lines (coefficients) can be obtained. Note that 3 short windows will be combined together. (3) Finally, a total of 576 (32 18 = 576) MDCT coefficients for each frame can be obtained. Please note that the preceding operations are exactly the same as the processing of MP3 compression [MP3Standard] before the coefficient quantization and entropy coding. In the implementation, we use the LAME MP3 encoder [LAME MP3 Encoder] with its default parameters to extract the MDCT coefficients. Mel-Frequency Cepstral Coefficients. The MFCCs are the representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear Mel scale of frequency [Reynolds et al. 2000]. For a given WAV signal, the following operations should be performed to obtain the MFCCs. (1) The input WAV signal is divided into frames with half overlapped. (2) Perform the discrete fourier transform (DFT) on each frame separately. (3) Map the energy of the DFT spectrum onto the Mel scale with K triangular bandpass filters. (4) Take the logs of the energy at each Mel frequency. (5) Perform DCT to the Mel log energy, and the MFCCs are the amplitudes of the resulting spectrum.

Identifying Compression History of Wave Audio and Its Applications 30:5 Fig. 3. Illustration of feature extraction for feature sets #1 and #2. In the implementation, we use the VoiceBox [VoiceBox] to extract MFCCs with the following parameters: frame size of 1152 samples, sampling rate of 44100Hz, K = 32 for the triangular bandpass filters, and default value for others. In order to identify the compression history of a given WAV signal, we design three feature sets to measure the quantization artifacts with different compression schemes and bit rates. Please note that the first two sets are derived from the MDCT coefficients as illustrated in Figure 3, and the third one is derived from the MFCCs. The details are described as follows. Feature Set #1 (1-D). For compression history estimation, we should first determine whether a given WAV signal has been compressed or not. It is well known that most lossy compression schemes try to remove the some frequency components via quantization. Usually, in this process, a low-pass filtering will be performed. One of the obvious quantization artifacts is that many high-frequency MDCT coefficients in each frame will be quantized to zero. We found that, after the quantization, the average number of those MDCT coefficients with a value of exactly zero per frame (feature set #1) will increase compared with those original uncompressed audio clips. Such a simple statistic can serve as a useful sign for determining whether the WAV audio has been previously compressed with otherschemeornot. In the implementation, the audio signal is divided into frames and we obtain 576 MDCT coefficients for each frame as described previously. The average number of exact zero values per frame is obtained by dividing the total number of zero value by the total frame number. In Figure 4, we show the boxplots of the average number of zero values per frame for 8,800 original audio clips (Please refer to Section 3 for more detail about 8,800 test audio clips) and their decompressed versions with different compression schemes and bit rates. It is obvious that more zero MDCT coefficients would be presented after quantization. With a proper threshold, we can obtain a satisfactory detection accuracy as high as 96% for differentiating original audio clips and their corresponding decompressed versions with a randomly selected compression scheme and a random bit rate. Feature Set #2 (20-D). From the Figure 4, it can be clearly seen that the feature set #1 cannot be used for estimating the compression bit rates for a given compression scheme. Taking MP3 compression

30:6 D. Luo et al. Fig. 4. The boxplots of the average number of zero values per frame for 8,800 original audio clips and their decompressed versions at different compression schemes and bit rates. The red crosses denote the outlier data. The value of Y-axis means the ratio of zero values per frame. for example, most values in the feature set #1 are centered on 0.211 for those MP3 decompressed audio clips at the bit rates ranging from 32kbps to 128kbps. For the same reason, with feature set #1, it is difficult to discriminate between the WMA and AAC decompressed audio clips. We analyze the distributions of the MDCT coefficients for those WAV signals that have been previously compressed with different schemes and bit rates. As illustrated in Figure 3, we firstly obtain 576 MDCT coefficients from each frame as described above, and then we average the corresponding frequency coefficients for all frames and obtain absolute values of the resulting mean values. In this step, we can obtain 576 absolute values for each WAV audio clip. Figure 5 shows the distributions evaluated on 8,800 original uncompressed audio clips and their decompressed versions using MP3, WMA and AAC compression schemes at the compression bit rates of 32kbps, 64kbps and 128kbps, respectively. Three important properties can be observed from the figures. Firstly, all curves in each figure will approximatively decrease with the increase of the index of the x-axis, which means that the energy(amplitude) of different frequency components decrease from lower frequency to higher ones. Secondly, for a given compression scheme, the lower the bit rate, the less the valid high-frequency coefficients would exist. Thirdly and the most importantly, we can find out different cut-off frequencies of different bit rates, which make the shapes of the curves quite different for those audio clips with different compression schemes and bit rates. Therefore, the curve should be a very promising feature for compression history estimation. In order to reduce the dimension of the feature vector, we divide the resulting 576 coefficients into 24 non-overlapping bins. For each bin, we calculate an average number for the 24 (24 = 576/24) coefficients. Therefore, we finally obtain 24 mean values. Based on our experiments, we found that the last 4 mean values are exactly zero for all compression schemes and bit rates. Thus, we only employ the first 20 values in the feature set #2. Feature Set #3 (54-D). With feature sets #1 and #2, we can obtain good results for identifying those MP3 decompressed clips [Luo et al. 2012]. However, the performances of bit rate estimation are still far from satisfactory. For instance, the detection accuracy for higher bit rates is around 84% for the WMA decompressed audio, and only around 70% for the AAC decompressed audio. Based on our extensive experiments, we found that the lossy audio compression will significantly affect the distributions of the MFCCs of the resulting audio clips.

Identifying Compression History of Wave Audio and Its Applications 30:7 Fig. 5. The average of 576 MDCT coefficients for 8,800 original audio clips and their MP3/WMA/AAC decompressed versions at different bit rates. Fig. 6. The average of the first ten MFCCs for 8,800 original audio clips and their MP3/WMA/AAC decompressed versions at different bit rates. To further improve the performance, we introduce another statistics from MFCCs as described in Section 2.2. Similarly, we plot the average MFCCs for different compression schemes and bit rates, illustrated in Figure 6. We just show the first ten average values of MFCCs in the figures for display purpose. Please note that here each individual point of the curve denotes an average value of the corresponding MFCC coefficients on 8,800 audio clips. We can observe that the distributions of MFCCs are quite different for different compression schemes and bit rates, which may help to estimate the compression bit rate. For each WAV audio, the original MFCC features, the first and second derivatives MFCCs ( MFCCs, 2 MFCCs) are employed. The extensive experiments in Section 3.5 demonstrate that

30:8 D. Luo et al. the feature combination of 18 MFCC, 18 MFCCs and 18 2 MFCCs coefficients can achieve a good tradeoff between the feature dimension and detection results. Therefore, we can obtain an 18 3 = 54-D feature vector in all. Finally, we combine the feature sets #1, #2, and #3, and obtain a feature vector of 75 (1 + 20 + 54 = 75) dimensions for identifying the compression history from a WAV signal. 3. EXPERIMENTAL RESULTS AND DISCUSSIONS In the experiments, we randomly collect 8,800 mono audio clips of five seconds at 44.1kHz. They are cut from original uncompressed WAV audio files with different contents, including blues, country, disco, jazz, rock, pop and classical music. We employ the audio softwares GoldWave [GoldWave] and Format- Factory [FormatFactory] to obtain different compressed stereo audio clips, including MP3, WMA and AAC, with different compression bit rates. Here, the test bit rates for the MP3 and WMA format are 32, 48, 64, 80, 96, 128kbps, and the test bit rates for the AAC format are 32, 64, 96, 128kbps that are supported by the software. These compressed audio clips are then decompressed and finally stored in mono WAV format of 44.1KHz sampling rate to remove all the previous compression information in the file header. For a given WAV audio clip, the 75-D feature vector mentioned above is extracted. We employ the support vector machine toolbox [Chang and Lin 2011], and use the RBF (radial basis function) kernel for classification. 8,800 audio clips are used for all the experiments. 30% of them are randomly selected in the training stage, and the remaining 70% are used for testing. In order to show the effectiveness of the proposed features, the two following experiments have been conducted for a given compression scheme, and the results are given in Sections 3.1, 3.2, and 3.3. (1) Fixed and random compression bit rate test: In this experiment, 8,800 original WAV and their decompressed versions at each fixed bit rate or random-selected bit rate are used. We try to identify whether or not a given WAV audio clip has been compressed with a fixed/random compression rate. (2) Compression bit rates identification: In this experiment, 8,800 original WAV and their decompressed versions at all the bit rates are used. We try to estimate the compression bit rate previously used for a given WAV audio. For more experiments and analyses, please refer to Sections 3.4 to 3.11. 3.1 Results of MP3 Audio Clips The experimental results for the fixed and random compression bit rate test are shown in the first row in Table I. It can be seen that all detection accuracies are over 99% for those MP3 decompressed audio clips at fixed bit rates ranging from 32kbps to 128kbps, and the detection accuracy for the random bit rate test is 98.91%, which is slightly poorer than those of fixed bit rate cases. The results demonstrate that our method can effectively identify whether a given WAV has been previously compressed by an MP3 encoder or not. Table II is the confusion matrix for identifying the compression bit rates. The detection accuracy of each kind of bit rates is all above 98%, and the average detection accuracy is 98.52% by averaging the diagonal data in the confusion matrix. It can be clearly seen that the proposed method can effectively estimate the compression bit rates for those previously MP3 compressed audio clips. 3.2 Results of WMA Audio Clips As shown in the second row in Table I, the proposed method is also effective for identifying the WMA decompressed audio clips, even when the fixed bit rate is as high as 128kbps. For the random compression bit rate test, we obtain a satisfactory result with a detection accuracy of 95.87%, which implies

Identifying Compression History of Wave Audio and Its Applications 30:9 Table I. Detection accuracy for identifying uncompressed WAV clips and MP3/WMA/AAC decompressed ones at a fixed and random bit rate (%). The symbol denotes the compression bit rate is not supported by the software. 32k 48k 64k 80k 96k 128k Random MP3 99.62 99.69 99.65 99.65 99.72 99.35 98.91 WMA 99.56 98.82 98.78 98.81 98.04 97.49 95.87 AAC 99.19 98.80 98.82 99.01 98.42 Table II. Confusion matrix for identifying MP3 audio at different bit rates (%), where the symbol * denotes the value is less than 5%. Original 32k 48k 64k 80k 96k 128k Original 98.45 * * * * * * 32k * 99.59 * 0 0 0 0 48k * * 98.58 * * 0 0 64k * * * 98.24 * 0 0 80k * * * * 98.05 * 0 96k * * * * * 98.27 * 128k * * * * * * 98.47 Table III. Confusion matrix for identifying WMA audio at different bit rates (%), where the symbol * denotes the value is less than 5%. Original 32k 48k 64k 80k 96k 128k Original 95.62 * * * * * * 32k * 99.20 * 0 0 0 0 48k * * 94.59 * * * * 64k * * 6.89 89.83 * 0 0 80k * 0 * * 92.58 * 0 96k * * * * * 93.08 * 128k * 0 * * * 6.34 91.60 that the proposed method is able to identify whether a given WAV has been previously compressed by WMA encoder or not. The confusion matrix for identifying those WMA decompressed audio clips at different bit rates is shown in Table III. It can be seen that the detection accuracy for the WMA format is slightly poorer than that for the MP3 format as shown in Table II. However, the detection results are still satisfactory with an average accuracy of 93.79%. 3.3 Results of AAC Audio Clips The third row in Table I shows the detection results for the fixed/random bit rate test for AAC decompressed audio clips. It shows that our method is also effective for detecting AAC decompressed audio clips. For the random test, the average detection accuracy is as high as 98.42%. The confusion matrix for identifying the bit rate of AAC decompressed audio clips is shown in Table IV. Similarly, the experimental results are relatively poorer than those for MP3, especially for detecting those audio clips with higher compression bit rate. For identifying different bit rates, however, we still achieve an average detection accuracy as high as 92.33%.

30:10 D. Luo et al. Table IV. Confusion matrix for identifying AAC audio at different bit rates (%), where the symbol * denotes the value is less than 5%. Original 32k 64k 96k 128k Original 97.71 * * * * 32k * 98.11 * 0 0 64k * * 90.17 7.32 * 96k * * 14.07 81.60 * 128k * * * * 94.07 Table V. Confusion matrix for identifying original WAV, MP3, WMA and AAC decompressed audio clips (%), where the symbol * denotes the value is less than 5%. Original MP3 WMA AAC Original 94.75 * * * MP3 * 92.12 * 5.55 WMA * * 93.21 * AAC * * * 93.53 3.4 Compression Scheme Identification In previous experiments, we assumed that all testing WAV audio clips were compressed with a fixed compression scheme, for instance, MP3 in Section 3.1. In this experiment, we assume the candidate compression schemes previously used may be MP3, WMA or AAC. We aim to determine which scheme has been employed for the WAV signal. For each uncompressed WAV audio clip in the experiments, we obtain the corresponding MP3, WMA and AAC decompressed audio at a random bit rate, respectively. In all, we have 8,800 4 = 35,200 WAV clips in the experiment. Similarly, 30% of these clips are used to train an SVM classifier, and the remaining audio clips are used for testing. The detection results are shown in Table V. On average, the detection accuracy is 93.40%. 3.5 Feature Dimension of MFCCs vs. Performance In this section, the corresponding average detection accuracies along the diagonal line in the confusion Tables II, III, IV, and V have been given under different feature set combinations. The experimental results are listed in Table VI. First of all, it can be seen from Table VI that the performance will increase after introducing the MFCC features. Besides, we found that the higher the order (i.e., MFCC, MFCC and 2 MFCC) and the higher the dimension (i.e., 12-D, 18-D and 24-D) of MFCC features we employ, the better the detection accuracy we usually obtain. As described in Section 2.2, we use the feature sets #1, #2 and 18-D MFCCs, 18-D MFCCs, 18-D 2 MFCCs as highlighted in Table VI, which can achieve a better tradeoff between the feature dimension and detection performance. 3.6 Comparative Analysis with Our Previous Work In this section, we compare the proposed method with our previous work [Luo et al. 2012] for identifying the compression bit rates. The comparative results are shown in Tables VII, VIII, and IX, respectively. Overall, it is observed that the performance is improved after introducing the MFCCsbased features, especially for identifying the decompressed audio clips of AAC (over 11% improvement on average) and WMA (over 5% improvement on average), please also refer to Table VI.

Identifying Compression History of Wave Audio and Its Applications 30:11 Table VI. Experimental Results under Different Feature Combinations Feature set used AAC MP3 WMA Scheme Dimension set #1, #2 i.e., Method [Luo et al. 2012] 81.67 97.08 88.78 82.08 21 set #1,#2 and 12 MFCCs 91.49 97.23 92.00 88.12 33 set #1,#2 and 12 MFCCs, 12 MFCCs 92.20 97.44 91.15 89.49 45 set #1,#2 and 12 MFCCs, 12 MFCCs, 12 2 MFCCs 92.55 98.36 91.50 92.79 57 set #1,#2 and 18 MFCCs 91.04 97.76 94.32 88.78 39 set #1,#2 and 18 MFCCs, 18 MFCCs 91.91 97.76 93.89 90.36 57 set #1,#2 and 18 MFCCs, 18 MFCCs, 18 2 MFCCs 92.33 98.52 93.79 93.40 75 set #1,#2 and 24 MFCCs 90.52 97.99 95.58 88.35 45 set #1,#2 and 24 MFCCs, 24 MFCCs 91.55 97.95 95.30 91.09 69 set #1,#2 and 24 MFCCs, 24 MFCCs, 24 2 MFCCs 91.50 98.48 94.65 93.57 93 Table VII. The improvement of detection results for identifying MP3 audio at different bit rates after introducing the feature set #3, the symbol denotes the absolute change is less than 1%, the symbol before the values denotes increment, otherwise. Original 32k 48k 64k 80k 96k 128k Original 32k 48k 1.12% 64k 1.76% 80k 2.79% 96k 1.86% 128k 1.50% 1.63% Table VIII. The improvement of detection results for identifying WMA audio at different bit rates after introducing the feature set #3 Original 32k 48k 64k 80k 96k 128k Original 2.01% 32k 1.02% 48k 2.93% 1.60% 64k 7.74% 7.58% 80k 1.03% 5.71% 1.16% 6.37% 96k 2.28% 2.12% 6.63% 1.03% 128k 3.34% 3.18% 8.45% For the compression scheme identification, please refer to Table VI and Table X, we also achieved a better result. On average, we obtain an 11% improvement after introducing the MFCC features, and the detection error has declined to about 7%. 3.7 Robustness Analysis on Different Audio Subsets In all the experiments above, we used 8,800 audio clips. In this section, we only use its subset as the training and testing data to test the robustness of the proposed method. In this experiment, the sizes of the subset are 1000, 2500, 4000, 5500 and 7000 respectively, and the subset is randomly selected from 8,800 audio clips. Then the experiments of compression bit rates identification are repeated for the MP3, WMA and AAC formats, respectively.

30:12 D. Luo et al. Table IX. The improvement of detection results for identifying AAC audio at different bit rates after introducing the feature set #3 Original 32k 64k 96k 128k Original 2.53% 1.36% 32k 1.25% 64k 1.50% 5.12% 10.95% 3.31% 1.00% 96k 2.53% 2.50% 8.47% 18.36% 4.85% 128k 5.81% 1.81% 5.04% 7.53% 20.21% Table X. The improvement of detection results for identifying compression schemes after introducing the feature set #3 Original MP3 WMA AAC Original 6.55% 0 4.57% 1.89% MP3 1.05% 4.07% 4.83% 1.81% WMA 5.01% 5.17% 20.27% 10.08% AAC 1.67% 8.62% 4.10% 14.39% Fig. 7. The detection accuracies for the audio subsets with different sizes. As shown in Figure 7, the average detection accuracies would raise 3-6% for MP3/WMA/AAC when the size of the subset increases from 1000 to 5500, while they can hardly achieve 1% improvement when the size of the subset increases from 5500 to 8800. The more training data we used, the more reliable and the better results can be achieved. 3.8 Robustness Analysis on Noise Attack In this section, we would evaluate the performance of the proposed method for those audio clips with white Gaussian noise contamination, which is a common attack in practice. In the experiments, 8800 audio clips are used and their MP3/WMA/AAC decompressed versions are firstly obtained. White Gaussian noise with 35dB is then added to all decompressed audio clips. The experimental results are listed as follows. Fixed and random compression bit rate test: The experimental results in Table XI show that our proposed method is able to effectively separate the original audio and the decompressed audio with noise. On average, over 95% detection accuracy can be achieved

Identifying Compression History of Wave Audio and Its Applications 30:13 Table XI. Detection accuracy for identifying uncompressed WAV clips and MP3/WMA/AAC decompressed ones at a fixed and random bit rate with 35dB white noise (%). The symbol denotes the compression bit rate is not supported by the software. 32k 48k 64k 80k 96k 128k Random MP3 99.76 99.81 99.72 99.69 99.60 99.54 98.39 WMA 98.14 96.74 96.53 96.64 96.04 95.01 95.03 AAC 99.69 99.55 99.24 99.21 99.24 Table XII. Confusion matrix for identifying MP3 audio at different bit rates with 35dB noise(%), where the symbol * denotes the value is less than 5%. Average detection rate is 94.25%. Original 32k 48k 64k 80k 96k 128k Original 98.42 * * * * * * 32k * 99.13 * * * * * 48k * * 96.10 * * * * 64k * * 6.89 92.09 * * * 80k * * * * 91.76 * * 96k * * * * * 90.53 * 128k * * * * * 6.34 91.73 Table XIII. Confusion matrix for identifying WMA audio at different bit rates with 35dB noise(%), where the symbol * denotes the value is less than 5%. Average detection rate is 77.61%. Original 32k 48k 64k 80k 96k 128k Original 88.87 * * * * * 5.42 32k * 98.01 * * * * * 48k * * 83.79 7.88 * * * 64k * * 17.53 66.29 11.55 * * 80k * * 8.66 15.38 70.16 * * 96k * * * * 7.37 70.55 12.69 128k * * * * * 22.27 65.58 Compression bit rates identification: The confusion matrixes for MP3/WMA/AAC are shown in Tables XII, XIII, XIV, respectively. Overall, the performance would decrease after noise attack. However, the average detection accuracies are still satisfying for MP3 and AAC format, only 4.27% and 7.70% decreasements comparing with those without noise. For the WMA format, the performance would degrade significantly (about 16%), which means that the proposed method is sensitive to noise attack for this type of audio in compression bit rate estimation. To overcome this limitation, we need some new robust features and it may be considered in the future. Two different noise strengths (i.e., 30dB and 40dB) are also evaluated. Please note that we can clearly perceive the noise by our ears when the SNR of noise is as low as 30dB. On average, the detection accuracy fluctuates at ±3% relative to those of 35dB. Overall, our proposed method can still achieve a satisfactory result for identifying decompressed audio with little noise. 3.9 Frame Offset Problem As described in Section 2.1, most lossy audio compression process is usually performed frame by frame. Therefore, such a frame structure would be preserved after decompression. The proposed features

30:14 D. Luo et al. Table XIV. Confusion matrix for identifying AAC audio at different bit rates with 35dB noise(%), where the symbol * denotes the value is less than 5%. Average detection rate is 84.63%. Original 32k 64k 96k 128k Original 98.37 * * * * 32k * 98.11 * * * 64k * * 86.96 9.72 * 96k * * 21.91 67.09 9.65 128k * * 7.46 18.47 72.62 Table XV. Confusion matrix for identifying original WAV, MP3, WMA and AAC decompressed audio clips with frame offsets(%), where the symbol * denotes the value is less than 5%. Original MP3 WMA AAC Original 93.87 * * * MP3 * 82.30 * 12.14 WMA * 5.92 89.43 * AAC * 11.18 * 85.24 (refer to Feature set #1, #2, and #3) are also frame-based. Based on our experiments, the distribution of the frequency coefficients (including MDCT and MFCC coefficients) would change for different frame parameters, and thus it may affect the effectiveness of the algorithm. The frame offset problem can be regarded as a special attack for those frame-based algorithm, as in the previous forensic work [Yang et al. 2008]. In this section, we evaluate the performance of the proposed method for those audio clips with frame structure desynchronization. In the experiments, some samples of all decompressed audio clips are firstly randomly removed. The number of the deleted samples is randomly selected from 1 to 22050 (half of a second). The experimental results and the analyses are shown as follows. (1) Random bit rate test for MP3 decompressed audio clips. The detection accuracy is 97.03%. (98.91% for no frame offset, please refer to Section 3.1) (2) Random bit rate test for WMA decompressed audio clips. The detection accuracy is 94.86%. (95.87% for no frame offset, please refer to Section 3.2) (3) Random bit rate test for AAC decompressed audio clips. The detection accuracy is 97.06%. (98.42% for no frame offset, please refer to Section 3.3) When the compression scheme of a questionable WAV signal is fixed, the above results show that the proposed method can obtain similar results with no frame offset cases. However, the detection performance will drop for compression schemes identification, especially for those MP3 decompressed audio clips, see Table XV. On average, there is around 6% decrement compared with the results shown in Table V. 3.10 Evaluation on Compressed Audio with VBR In the previous experiments, the compression is performed with constant bit rates. In this section, we would evaluate the proposed method on those audio clips with variable bit rates (VBR). In our experiments, we use the CoolEdit, GoldWave and NeroAAC for compressing 8,800 original uncompressed audio clips into MP3, WMA and AAC files, respectively, and set the VBR quality with three different

Identifying Compression History of Wave Audio and Its Applications 30:15 Table XVI. Detection accuracy for MP3/WMA/AAC compressed audio with VBR option in different qualities(%) Quality Low Median High (Bit rates) (80-95kbps) (105-140kbps) (145-220kbps) MP3 96.26 95.75 92.06 WMA 95.12 95.26 85.38 AAC 95.92 93.69 91.28 Table XVII. Detection accuracy for identifying uncompressed WAV clips and MP3/WMA/AAC decompressed ones at a fixed and random bit rate (%). The symbol denotes the compression bit rate is not supported by the software. 32k 48k 64k 80k 96k 128k Random MP3 99.94 99.91 99.91 99.87 99.88 99.72 99.68 WMA 99.92 99.14 99.28 99.12 98.41 98.00 97.50 AAC 99.92 99.84 99.62 99.31 99.22 levels (low, medium and high). In such a way, the bit rates of the resulting audio clips would fall into three different ranges based on our experiments, that is, 80-95kbps, 105-140kbps and 145-220kbps. We aim to determine whether or not a WAV audio has been previously compressed with VBR. The experimental results are shown in Table XVI. It is observed that our proposed method can also achieve an accuracy of above 85% even when the compression quality is high. For those audio clips with low and median levels (i.e., bit rates less than 140kbps), we can obtain an average accuracy as high as 95%. 3.11 Experiments on the Dataset of GTZAN Genre Collection In this section, we would evaluate the proposed method on another dataset GTZAN Genre Collection [GTZAN], which includes 1000 original audio clips with the length of around 30 seconds. In order to obtain sufficient training/testing data, each original audio clip is divided into 5 non-overlapping segments. Therefore, we have 5 1000 = 5000 audio segments (around 5 seconds for each segments) in all. We repeat all the previous experiments, and show the experimental results as follows: Fixed and random compression bit rate test: trying to identify whether a given audio has been compressed or not. The results are shown in Table XVII. Compression bit rate identification: For MP3/WMA/AAC, the confusion matrices for different compression schemes are shown in Tables XVIII XX, respectively. The average detection accuracies for the diagonal lines of the three tables are 98.65%, 88.66% and 94.74%. The above experimental results show that the proposed method is also effective for the dataset of GTZAN Genre Collection. For both the fixed and random compression bit rates, almost all detection accuracies are over 99% (refer to Table XVII). For compression bit rate estimation for MP3, WMA and AAC (refer to Table XVIII XX), we compare the average detection accuracy along the diagonal values with those results on our previous dataset, and show them in Table XXI. From this table, it is observed that the average detection accuracies for AAC and MP3 are increased (about 1.2% improvements), while the accuracy for WMA is slightly decreased (about 5.0% decrements). Overall, the performances evaluated on two different datasets are similar, which shows the effectiveness of our method.

30:16 D. Luo et al. Table XVIII. Confusion matrix for identifying MP3 audio at different bit rates, where the symbol * denotes the value is less than 5%. (%) Original 32k 48k 64k 80k 96k 128k Original 99.45 * 0 0 * 0 * 32k 0 99.88 * 0 0 0 0 48k 0 * 99.20 * * 0 0 64k 0 * * 99.02 * 0 0 80k * * * * 98.37 * 0 96k * * * * * 96.91 * 128k * * * * * * 97.74 Table XIX. Confusion matrix for identifying WMA audio at different bit rates, where the symbol * denotes the value is less than 5%. (%) Original 32k 48k 64k 80k 96k 128k Original 98.05 * * 0 * * * 32k 0 99.88 * 0 0 0 0 48k * * 88.68 9.17 * * * 64k * * 28.71 67.00 * * * 80k * * 8.14 7.57 82.45 * * 96k * * * * * 90.02 * 128k * * * * * * 94.54 Table XX. Confusion matrix for identifying AAC audio at different bit rates, where the symbol * denotes the value is less than 5%. (%) Original 32k 64k 96k 128k Original 99.02 * * * * 32k * 99.71 * * * 64k * * 97.05 * * 96k * * * 89.37 7.08 128k * * * 10.60 88.54 4. POTENTIAL APPLICATIONS Three potential applications of identifying the compression history of WAV audio will be discussed in this section. They are digital audio splicing detection, fake-quality CD identification and blind audio quality assessment. Audio splicing is one of the commonly used tampering operations in practice. To modify the content of an audio, audio clips with different compression history (including uncompressed version, various compression schemes and/or bit rates) would be spliced together. To this end, all compressed audio clips must firstly be decompressed in the temporal domain (i.e., in WAV form), and then some audio segments would be carefully selected and inserted into suitable positions of a targeted one. Finally, there are two ways for saving the resulting spliced WAV audio. The first one is to restore it as compressed form, such as MP3. In such a case, double quantization artifacts would be introduced, which can be effectively detected by some existing works, such as Qiao et al. [2010] and Liu et al. [2010]. The second one is to save it in the uncompressed form. In this case, our proposed method becomes available via identifying the compression history for every small segments. As illustrated in Figure 8, two audio segments with different compression history would be spliced together to modify its original meaning. Please note that any obvious hearing artifacts would not be introduced especially when the true bit

Identifying Compression History of Wave Audio and Its Applications 30:17 Table XXI. Comparison of the average of diagonal elements of the confusion matrix for our dataset and GTZAN Genre Collection. (%) Our dataset GTZAN dataset MP3 98.52 98.65 WMA 93.79 88.66 AAC 92.33 94.74 Fig. 8. Illustration of digital audio splicing using 2 audio segments with different compression history. rate of the inserted segment is similar to the targeted one. In order to expose such a spliced audio, we should firstly divide it into small non-overlapping segments, and then employ the proposed method to identify the compression history for each segment. If inconsistent segments have been found, the suspect audio would be regarded as a spliced one with a high probability. Based on our extensive experiments, we can obtain satisfactory results even when the audio segments are as short as 1 second. Another possible application of the proposed method is to identify those fake-quality CD disks (piracy CD). As we know, a CD disk usually records the original waveform of music and its quality is very high. However, a forger may firstly download some compressed music clips (or trial editions with lower quality), for example, MP3 audio at 128kbps from the Internet, and decompress them, and then burn to a CD, seeking commercial benefit. In this case, the resulting CD is actually of low quality since it is transformed from those compressed audio clips with lower bit rates. By estimating the compression history of the CD music, it is possible to determine whether the CD is of fake quality or not. Furthermore, the proposed features can be extended to blindly audio quality assessment. For instance, we search a favorite music audio on the Internet and usually find a lot of near-duplication versions. In fact, the true quality (bit rates) of most searched audio clips would be much lower than their alleged values, since the compression bit rate is easy to be up-converted using some audio editing softwares such as GoldWave and FormatFactory. In order to pick out the best quality one from them, the blind audio quality assessment becomes important in this situation. With the proposed method, we can estimate the bit rate of the audio, and therefore, some promising measures can be achieved for blind quality assessment.

30:18 D. Luo et al. 5. CONCLUDING REMARKS In this article, we propose a method for exposing the audio compression history of the WAV audio, and describe its potential applications in audio splicing detection and quality evaluation. The proposed method is mainly based on the statistics of MDCT and MFCC coefficients. We firstly analyze the MDCT and MFCC coefficients of the original uncompressed audio and their decompressed versions with different compression schemes and bit rates, then three different feature sets are proposed for exposing the compression history of WAV audio. Three popular audio compression schemes, that is, MP3, WMA and AAC, have been investigated in our experiments. The extensive experiments have shown that the proposed method can effectively identify whether a WAV audio has been previously compressed or not, and can further identify the compression scheme and estimate its compression bit rate with a high detection accuracy. Furthermore, the robustness against noise attack, frame offset problem and VBR compression mode have been studied. There is still space to improve the robustness against several types of attacks mentioned above. We will also consider other possible attacks in the future. Furthermore, we will extend our study to investigate whether or not the proposed method can identify those audio clips that have been recompressed with different compression schemes, and further estimate their primary compression schemes and bit rates previously used. REFERENCES P. Bestagini, A. Allam, S. Milani, M. Tagliasacchi, and S. Tubaro. 2012. Video codec identification. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing. 2257 2260. T. Bianchi, A. Rosa, and M. Fontani. 2013. Detection and classification of double compressed MP3 audio tracks. In Proceedings of the 1st ACM Workshop on Information Hiding and Multimedia Security. 159 164. C.-C. Chang and C.-J. Lin. 2011. LIBSVM : a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 27:1 27:27. G. Chen, X. Kong, W. Zhong, and B. Wang. 2012. Detection of double mp3 compression based on fluctuation intensity of quantized MDCT coefficients. In Proceedings of the China Information Hiding and Multimedia Security Workshop. 164 167. Z. Fan and R. L. De Queiroz. 2003. Identification of bitmap compression history: JPEG detection and quantizer estimation. IEEE Trans. Image Process. 12, 2, 230 235. Formatfactory. Formatfactory software - http://www.formatoz.com/. D. Fu, Y. Shi, and W. Su. 2007. A generalized benford s law for JPEG coefficients and its applications in image forensics. In Proceedings of SPIE on Electronic Imaging, Security, Steganography, and Watermarking of Multimedia Contents. Vol. 6505. Goldwave. Goldwave software - http://www.goldwave.ca/. GTZAN. GTZAN Genre Collection - http://marsyas.info/download/data sets/. S. Hacker. 2000. MP3: The Definitive Guide. O Reilly Media. S. Hiçsönmez, H. T. Sencar, and I. Avcibas. 2011. Audio codec identification through payload sampling. In Proceedings of the International Workshop on Information Forensics and Security. S. Hiçsönmez, E. Uzun, and H. T. Sencar. 2013. Methods for identifying traces of compression in audio. In Proceedings of the 1st International Conference on Communications, Signal Processing, and Their Applications. 1 6. F. Jenner and A. Kwasinski. 2012. Highly accurate non-intrusive speech forensics for codec identifications from observed decoded signals. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing. Kyoto, 1737 1740. C. Kraetzer, A. Oermann, J. Dittmann, and A. Lang. 2007. Digital audio forensics: A first practical evaluation on microphone and environment classification. In Proceedings of the Workshop on Multimedia and security. 63 74. Lame MP3 Encoder. http://sourceforge.net/projects/lame/. Q. Liu, A. Sung, and M. Qiao. 2010. Detection of double mp3 compression. Cognitive Comput. 2, 291 296. J. Lukáš and J. Fridrich. 2003. Estimation of primary quantization matrix in double compressed JPEG images. In Proceedings of the Digital Forensic Research Workshop. D. Luo, W. Luo, R. Yang, and J. Huang. 2012. Compression history identification for digital audio signal. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing. 1733 1736. W. Luo, J. Huang, and G. Qiu. 2010a. JPEG error analysis and its applications to digital image forensics. IEEE Trans. Inf. Forensics Secur. 5, 3, 480 491.