Exposing MP3 Audio Forgeries Using Frame Offsets

Exposing MP3 Audio Forgeries Using Frame Offsets RUI YANG, ZHENHUA QU, and JIWU HUANG, Sun Yat-sen University Audio recordings should be authenticated before they are used as evidence. Although audio watermaring and signature are widely applied for authentication, these two techniques require accessing the original audio before it is published. Passive authentication is necessary for digital audio, especially for the most popular audio format: MP3. In this article, we propose a passive approach to detect forgeries of MP3 audio. During the process of MP3 encoding the audio samples are divided into frames, and thus each frame has its own frame offset after encoding. Forgeries lead to the breaing of framing grids. So the frame offset is a good indication for locating forgeries, and it can be retrieved by the identification of the quantization characteristic. In this way, the doctored positions can be automatically located. Experimental results demonstrate that the proposed approach is effective in detecting some common forgeries, such as deletion, insertion, substitution, and splicing. Even when the bit rate is as low as 32 bps, the detection rate is above 99%. Categories and Subject Descriptors: H.4. [Information Systems Applications]: General; K.6.5 [Management of Computing and Information Systems]: Security and Protection General Terms: Security, Algorithms, Verification Additional Key Words and Phrases: MP3 audio forgery, forgery detection, audio authentication ACM Reference Format: Yang, R., Qu, Z., and Huang, J. 212. Exposing MP3 audio forgeries using frame offsets. ACM Trans. Multimedia Comput. Commun. Appl. 8, S2, Article 35 (September 212), 2 pages. DOI = 1.1145/2344436.2344441 http://doi.acm.org/1.1145/2344436.2344441 35 1. INTRODUCTION With the development of digital voice recorders and cell phones, nowadays speech and conversation can be easily recorded as evidence. However, hearing cannot be believing since these audio recordings can be tampered with very easily by pervasive audio editing software. An audio recording may contain some important words or sentences synthesized from other audio, so authentication technologies need to be developed for digital audio. The existing audio authentication technologies can be divided into two groups: active authentication (including digital watermaring and digital signature) and passive authentication. Active authentication requires accessing original audio before it is distributed, for example, embedding a watermar or generating a signature, while passive audio authentication A portion of this article was presented at the 1 th ACM Multimedia and Security Worshop. The wor was supported in part by 973 Program (211CB3224) in China and NSFC (U11351, 6122497). J. Huang is also a visiting researcher of State Key Laboratory of Information Security, Beijing 119, China. Authors addresses: R. Yang, Z. Qu, and J. Huang (corresponding author), Sun Yat-sen University, Guangzhou 516, China; email: isshjw@mail.sysu.edu.cn. Permission to mae digital or hard copies of part or all of this wor for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this wor owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this wor in other wors requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 71, New Yor, NY 1121-71 USA, fax +1 (212) 869-481, or permissions@acm.org. c 212 ACM 1551-6857/212/9-ART35 $15. DOI 1.1145/2344436.2344441 http://doi.acm.org/1.1145/2344436.2344441 ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 8, No. S2, Article 35, Publication date: September 212.

35:2 R. Yang et al. means checing the integrity of audio recording by analyzing its inherent properties. In most authentication cases, audio does not actually contain any digital watermar or signature. Thus it is necessary to passively examine the integrity of the digital audio. Until now, there were few wors on passive authentication for digital audio. Based on the assumption that a natural signal has wea higher-order statistical correlations in the frequency domain and that forgery in speech would introduce unnatural correlations, Farid [1999] used bispectral analysis to detect digital forgery for speech signals. It was shown that the zero phase of bispectral decreased a lot for forged speech. However, the method is only suitable for uncompressed audio. Grigoras [25] pointed out that digital equipment captures not only the intended speech but also the 5/6 Hz Electric Networ Frequency (ENF) when recording. The ENF criterion could be used to chec the integrity of digital audio recordings and to verify the exact time when a digital recording was created. This could be done by compared the ENF of audio recordings with a reference frequency database from the electric company or the laboratory. The method is highly dependent on the accuracy of the extracted ENF, while ENF is a quite wea signal compared to the audio recording. Dittmann et al. [Kraetzer et al. 27] proposed a method to determine the authenticity of the speaer s environment. In their paper it was said that the extraction of the bacground features in an audio stream could provide an informative basis for determining the location of its origin and the used microphone. But a lot of audio recordings are required for training. MP3 audio format is popularly used in most applications, and is now the most popular format among all formats in digital voice recorders. The top 2 best-selling digital voice recorders of amazon.com all support the MP3 format, and some of them only support the MP3 format. For most cell phones, the default recording format is the MP3 format. Digital voice recorder and cell phone are the most frequent recording machines for people in daily life. It would be fairly easy to remove complete sections of a recording or splice two sentences from different recordings. Small changes in the audio stream can cause a different meaning of the whole sentence. Exposing forgeries in MP3 files can authenticate the daily recordings presented as evidence in criminal and civil court cases, and such as undercover surveillance recordings made by the police, recordings presented by feuding parties in a divorce, recorded telephone conversation in domestic violence cases, and recordings from corporations seeing to prove employee wrongdoing or industrial espionage. At the same time, forgeries detection solutions are needed for manufacturers of audio recording equipment. There are as yet still no reported passive authentication methods focusing on MP3 format audio. An existing related wor is the classification of MP3 encoders, which was proposed by Boehm and Westfeld [24]. The wor outlines a method to discriminate 2 different MP3 encoders with 1 features. Experimental results show that these features have accurate classification for MP3 encoders and can improve the performance of MP3 steganalysis. The application of the method to passive authentication is not discussed in the paper. Theoretically the method could handle tampered audio by splicing audio from different recorders, but tampering within an audio recording is out of its range. As MP3 audio becomes popular, it is necessary to develop passive approaches to chec the integrity of MP3 audio. Passive authentication on JPEG image and MPEG video has attracted many researchers. Some approaches have been proposed, such as the quantization-table-based method [Luas and Fridrich 23], the periodical-artifacts-based method [Popescu and Farid 24], Benford s-law-based method [Fu et al. 27], and the shift double JPEG detection-based method [Qu et al. 28]. One direct question arises: can these methods be applied to passive authentication on MP3? Unfortunately, direct extension of the existing JPEG methods to MP3 audio does not wor, because there are many differences between MP3 compression and JPEG compression. For example, an MP3 encoder divides the samples of the time domain into frames with 5% overlap, while JPEG compression is without overlap. This leads to the impossibility of detection of bloc artifacts in MP3 compression. The calculation and quantization ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 8, No. S2, Article 35, Publication date: September 212.

Exposing MP3 Audio Forgeries Using Frame Offsets 35:3 Fig. 1. Bloc diagram of MP3: (a) encoder; (b) decoder. in MP3 compression are performed with float point representation. So the quantization-table-based method in JPEG which performs well with integer numbers is useless for MP3 compression. In this article we will propose a forgery detection method for digital audio of MP3 format. Note that forgeries at MP3 files are always performed in this way: first decoding, then tampering, and finally re-encoding. Based on the discovery that forgeries brea the original frame segmentation, we utilize frame offsets to locate forgeries automatically. The original frame offsets are retrieved by a quantization characteristic. Via extensive experiments, it is shown that the proposed method can detect most common forgeries, such as deletion, insertion, substitution, and splicing. At the same time, the proposed method is robust to some common postprocesses lie filtering and adding noise. The article is organized as follows. In Section 2, we give a brief analysis of MP3 coding and claim that only identical frame offsetting can introduce the quantized spectral characteristic. Then we develop a method to detect frame offsets in Section 3. Based on the detection method, we propose that the change of frame offsets could locate forgeries effectively in Section 4. The experimental results are shown in Section 5. Finally, we conclude our article with a discussion and future wor in Section 6. 2. ANALYSIS OF MP3 COMPRESSION CHARACTERISTICS In this section, first we will give a brief overview of MP3 coding, then explain two important concepts of this article: frame offset and quantization characteristics. In Section 2.1 we only explain those principles that are relevant to our detection method, especially the spectral decomposition and quantization. Detailed architecture and specification of MP3 coding may be referred to ISO [1992]. In Section 2.2, the definition of frame offset is demonstrated via an example. In Section 2.3, the quantization characteristics are analyzed. 2.1 MP3 Coding Figure 1(a) shows the bloc diagram of a typical MP3 encoder [Painter and Spanias 2]. The input PCM signal is first separated into 32 sub-bands by the analysis filterban, and the Modified Discrete Cosine Transform (MDCT) window further divides each of these 32 sub-bands into 18 sub-bands (long windows) or 6 sub-bands (short windows). Then a total of 576 or 192 spectral lines are generated respectively. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 8, No. S2, Article 35, Publication date: September 212.

35:4 R. Yang et al. Fig. 2. Framing grids and frame offsets. The top panel shows three continuous framing grids for the first encoding, and the bottom panel shows the corresponding frame grids for the second encoding. The frame offsets of the three framing grids are identical. The psychoacoustic model analyses the audio content and estimates the masing thresholds. The output of this model consists of the just noticeable noise level for each sub-band and the information about the window type for MDCT. According to the masing thresholds estimated by the psychoacoustic model, the spectral values are quantized via a power-law quantizer. The quantization step introduces an iterative algorithm to control both the bit rate and the distortion level, so that the perceived distortion is as small as possible, under the limitations of the desired bit rate. Finally, the quantized spectral values are encoded using Huffman code tables to form a bitstream. The bloc diagram of MP3 decoder is shown in Figure 1(b). Firstly, Huffman decoding is performed on the MP3 bitstream, and then the decoder restores the quantized MDCT coefficient values and the side information related to them, such as the window type that is assigned to each frame. After inverse quantization, the coefficients are inverse-mdct transformed to the sub-band domain. Finally, the PCM waveforms are reconstructed by the synthesis filterban. 2.2 Frame Offset The frame offset [Yang et al. 28] is defined as the shifting samples of the frame grid between the first and second encoding in this article. It is noted that forgeries at MP3 files are always performed in this way: first decoding, then tampering, and finally re-encoding. So the frame offset would become nonzero when forgeries are conducted on MP3 files, and is always zero for no forgery. Figure 2 shows an illustration of the generation of frame offset. When performing the first encoding, the framing grids of the original signal are shown in the top of Figure 2. Each framing grid contains 1152 samples with 5% overlap. After decoding, some extra zero samples are added at the beginning of the signal by the ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 8, No. S2, Article 35, Publication date: September 212.

Exposing MP3 Audio Forgeries Using Frame Offsets 35:5 value value.5 (a) unquantized spectral in a real value form.5 1 2 3 4 5 6 frequency index (b) quantized spectral in a real value form.5.5 1 2 3 4 5 6 frequency index (c) unquantized spectral in a logarithmic representation 1 No troughs 5 magnitude (db) magnitude (db) 1 2 3 4 5 6 frequency index (d) quantized spectral in a logarithmic representation 1 Many troughs 5 1 2 3 4 5 6 frequency index Fig. 3. Unquantized and quantized spectral coefficients: (a) and (b) are in a real value form, while (c) and (d) are in a logarithmic representation. The major difference between the unquantized and quantized spectral is the number of zero coefficients, which are shown as troughs. decoder. During the second encoding, new framing grids are generated. Obviously, if forgeries occur, frame offsets of some frames may change. 2.3 Quantization Characteristics Many spectral coefficients are usually quantized to zero during the encoding. This is due to some spectral components being completely mased by other components and the existence of some coefficients around zero which is the inherent probability distribution of the spectral coefficients. The increase in zero spectral coefficients is a quantization characteristic of MP3 coding. This characteristic is firstly described by Herre and Schug [2] and Herre et al. [22]. They utilized it to optimize audio cascaded coding. In the following, we will analyze this characteristic. The difference between an unquantized spectral coefficient and its quantized one is not easily visible in their real value form, as illustrated in Figures 3(a) and (b). But they can be discriminated by looing at the spectral coefficients in a logarithmic representation. As shown in Figures 3(c) and (d), there are many zero values which appear as troughs in the quantized spectral, while this phenomenon cannot be found in the unquantized spectral. These troughs in the spectral representation will be visible only if the framing grids are the same as those in the first encoding. This means that only if the identical frame offset with the first encoding is ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 8, No. S2, Article 35, Publication date: September 212.

35:6 R. Yang et al. magnitude (db) magnitude (db) magnitude (db) 1 5 (a) offset = 1 1 2 3 4 5 6 frequency index (b) offset = 1 5 1 2 3 4 5 6 frequency index (c) offset = +1 1 5 1 2 3 4 5 6 frequency index Fig. 4. Spectral coefficients when with frame offsets of 1,, +1 samples. The quantization characteristics appear only if the correct frame offset () is applied. applied will these troughs appear. This fact is illustrated by Figure 4, which shows MDCT coefficients of a decoded signal with one-sample-left shift (offset = 1),no-sample shift (offset= ) and one-sampleright shift (offset =+1) from the encoder framing grid, respectively. As we see, the troughs disappear even with the frame offset being one-sample shift in the decoded signal. 3. METHOD OF RETRIEVING FRAME OFFSETS The ey of detecting frame offsets is the identification of quantization characteristics. In this section, we develop a method of retrieving frame offsets based on the observations in the previous section. 3.1 Number of Active Coefficients From Figure 4, it is noted that a significant difference between spectral coefficients without offsets (Figure 4(b)) and with offset (Figures 4(a) and (c)) is the number of active (nonzero) spectral coefficients. For convenience, we denote the number of active coefficients as NAC in this article. In Figure 4, the NACs for offset 1 and+1 (shifted offsets) are 36 and 3, respectively; while the NAC for offset (matching offset) is only 197. For a robust and automatic identification of the characteristic spectral, the NACs as a function of frame offset can be used as a feature. Such a criterion yields reliable results, as shown in Figure 5. We observe that the beginning of each frame is clearly detectable by an obvious decrease in the NACs. A period of 576 can be observed. Why is there a period of 576? It is noted that 576 = 1152 5%, where 1152 is the length of a frame and 5% is the amount of overlap specified by the MP3 standard. A frame with offset 576 exactly corresponds to the next frame. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 8, No. S2, Article 35, Publication date: September 212.

Exposing MP3 Audio Forgeries Using Frame Offsets 35:7 Number of active coefficients 35 3 25 2 Number of active coefficients via different frame offsets 15 576 1152 1728 2 frame offset Fig. 5. NACs via different frame offsets. NAC achieves minimums when the frame offsets are multiples of 576. 3.2 Theoretical Analysis Now let us examine why the quantization characteristics appear only if the matching offset is applied. It arises from the inherent property of MDCT. The MDCT transform performed in MP3 coding is as follows [Wang and Velermo 23]. X (p) [] = 2 2N 1 ( π x (p) [n] h[n] cos N N (n + N + 1 ( ) + 1 )), N 1 (1) 2 2 n= By applying an inverse-mdct transform to the frame, we get 2N time-aliased samples. ˆx (p) [n] = 2 N 1 ( ( π X (p) [] cos N N n + N + 1 ) ( + 1 )), n 2N 1 (2) 2 2 = In order to cancel the aliasing and get the original samples, we have to use the OLA (Overlapping Addition) procedure. An inverse-mdct is applied to the previous and the next frame. Then, each of the resulting aliased segments is multiplied by its corresponding window function and the overlapping time segments are added together. We thus recover the original samples. { ˆx(p 1) [n + N] h[n n 1] + ˆx (p) [n] h[n], n N 1 x (p) [n] = (3) ˆx (p) [n] h[2n n 1] + ˆx (p+1) [n N] h[n N], N n 2N 1 Denote that x (p) [n] = x (p) [n] h[n], n 2N 1. (4) If a signal exhibits local symmetry such that { x(p) [n] = x (p) [N n 1], n N 1 (5) x (p) [n] = x (p) [3N n 1], N n 2N 1 its MDCT coefficients become zero. That is, X (p) [] = for =,...,N 1. In Wang et al. [2], it has been proven that x (p) [n] fulfills Eq. (5) if X (p) [] =. This inherent property of the MDCT gives the answer to why NAC has a significant decrease only if the identical frame offset is applied. After MP3 encoding, many spectral coefficients are mased or quantized to ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 8, No. S2, Article 35, Publication date: September 212.

35:8 R. Yang et al. Table I. Mean Value and Standard Diviation of NACs at Different Bit Rates bit rate shifted NACs matching NAC Mean Std Mean Std 32 bps 175.61 13.45 67.8 12.34 64 bps 313.46 19.99 178.38 11.6 96 bps 331.72 18.3 249.15 25.7 128 bps 345.45 19.14 31.23 25.6 zero. When decoding, these zero spectral coefficients are restored to the time domain, and x (p) [n] fulfills Eq. (5). While performing MDCT on the decoded data with the identical frame offset to the first encoding process, we will get a lot of X (p) [] equal to zero. If there is a different frame offset, the local symmetry in Eq. (5) is broen, and then the corresponding spectral X (p) [] will not be zero. 3.3 Experiments on Retrieving Frame Offsets To illustrate the preceding analysis, we randomly select 3 different audio frames, and encode these frames with LAME v3.97 [LAM 212] at the bit rates of 32 bps, 64 bps, 96 bps, and 128 bps, respectively. For each bit rate, we apply offsets from 575 to 575 on these frames, and calculate NACs corresponding to all offsets. Then we get 1151 NACs for each frame totally. The 115 NACs corresponding to wrong offsets are named as shifted NACs, and the NAC corresponding to the correct offset is denoted as matching NAC. The shifted NACs and the matching NAC are plotted, respectively. As shown in Figure 6, for each bit rate, there are 3 boxes representing the distribution of shifted NACs. As shown in Figure 6(a), the minimum value of shifted NACs is larger than 15 for each frame, while the matching NAC is below 8. For all frames, we observe that matching NAC is very discriminative from shifted NACs. The case of 64 bps, 96 bps, and 128 bps are illustrated in Figures 6(b), (c), and (d), respectively. Although frames may be encoded with different bit rates, the matching NAC is always smaller than shifted NACs. This means that we can regard the minimum NAC as the matching NAC. From Figure 6, we also notice that the distance between shifted NACs and the matching NAC becomes small while the bit rate increases. This is because signal distortion and lost information is less when the bit rate is higher, and MDCT coefficients contain less s. As the aforesaid investigation is based on only 3 frames, the conclusion may be not general enough. In the following, we will tae statistics on 128 frames, including 64 frames of speech and 64 frames of music. We compute 115 shifted NACs and the matching NAC for each frame. Table I displays the mean values and standard deviations of NAC based on 128 frames. It is found that the mean values of shifted NACs and the matching NAC have a significant distance. The standard deviations are all small compared to the mean values. However, as we noted before, the difference between shifted NACs and the matching NAC becomes small when with a high bit rate, such as 128 bps. 4. LOCATING FORGERIES VIA CHECKING FRAME OFFSETS As audio samples are divided into frames for encoding, the frame offset could be useful evidence of tampering. When forgeries occur, all frames after the forged points will be affected. The detected offsets of corresponding frames will change. Figure 7 is an example of cropping. The original sentence I am not guilty is recorded with sampling rate of 44.1Hz and saved as MP3 format by a digital recorder, as shown in Figure 7(a). We manipulate this audio recording with CoolEdit v2.1, and remove the ey word not. The meaning of the sentence becomes the opposite: I am guilty, shown in Figure 7(b). The detected offsets of all frames in the original audio and the doctored one are demonstrated in Figure 7(c) ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 8, No. S2, Article 35, Publication date: September 212.

Exposing MP3 Audio Forgeries Using Frame Offsets 35:9 4 35 3 (a) NAC result of frames encoded with 32 bps 32bps NAC 25 2 15 1 distribution of 115 NACs with wrong offsets for 14th audio frame NAC with the correct offset for 14th audio frame 5 1 2 3 4 5 6 7 8 9 1 11 12 13 14 15 16 17 18 19 2 21 22 23 24 25 26 27 28 29 3 different audio frames (b) NAC result of frames encoded with 64 bps 4 64bps 35 3 NAC 25 2 15 1 1 2 3 4 5 6 7 8 9 1 11 12 13 14 15 16 17 18 19 2 21 22 23 24 25 26 27 28 29 3 different audio frames (c) NAC result of frames encoded with 96 bps 4 35 3 NAC 25 2 15 96bps 1 1 2 3 4 5 6 7 8 9 1 111213 14151617 18192 212223 24252627 28293 different audio frames (d) NAC result of frames encoded with 128 bps 4 35 3 NAC 25 2 15 128bps 1 1 2 3 4 5 6 7 8 9 1 111213 14151617 18192 212223 24252627 28293 different audio frames Fig. 6. The distribution of NACs corresponding to frame offsets from 575 to 575 on 3 different audio frames, which are encoded using LAME v3.97, mono. The box stands for the distribution of 115 NACs with wrong offsets, while the isolated point is the NAC with the correct offset. In panel (a) (b) (c) (d) are the cases for 32 bps, 64 bps, 96 bps, 128 bps, respectively. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 8, No. S2, Article 35, Publication date: September 212.

35:1 R. Yang et al. detected offset detected offset.5 (a) Original Waveform I am not guilty..5 2 4 6 8 1 12 14 16 Cropping x 1 4 (b) Doctored Waveform.2.2.4 I am guilty. 2 4 6 8 1 12 14 16 x 1 4 (c) detection result of original audio 6 4 2 5 1different 15 frame 2 25 (d) detection result of doctored audio 6 4 2 5 1 15 2 25 different frame Fig. 7. Example of locating one cropping. The sentence I am not guilty is cropped to I am guilty, shown as (a) and (b). (c) is the detection result of the original audio. The detected offsets of all frames are, which means there are no forgeries. (d) is the detection result of the doctored audio. The detected offsets change at frame 119, which means there is a forgery. Note that the horizontal-axis represents samples in (a)(b), but frames in (c)(d). 16 samples corresponds to 277 frames exactly. and Figure 7(d), respectively. We observe that all frames in the original audio have the same offset. But for the doctored one, the detected offsets have two different values, for frames 1 to frame 118, and 384 for the remainder. We can draw a conclusion that there is a forgery at frame 119. From the previous example, we have the general procedures of locating forgeries: (i) detecting offsets of all frames; (ii) checing the differences between frame offsets. Now how can the offsets of all frames be retrieved effectively? Given an audio signal of L samples, we denote it with vector-notation x, and mar the j-sampleshifted version (which means appending j zero samples at the beginning of x) asx ( j) ( j < 576). x () = x, x ( j+1) = [,x ( j)], j =,...,574 For each offset j, we split x ( j) into 1152 samples per frame with 5% overlap, so we totally get N = L/576 1 frames as follows. We have ( j) j) [ˆx ˆx( N 1] = Fx ( j), ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 8, No. S2, Article 35, Publication date: September 212.

Exposing MP3 Audio Forgeries Using Frame Offsets 35:11 where F represents frame segmentation as well as applying the window function, and ˆx ( j) is the -th frame of x ( j), We apply the filterban and MDCT to each frame and obtain its spectral (576 MDCT coefficients). We have s ( j) = T ˆx ( j), where T represents both filtering by the filterban and MDCT. s ( j) represents the spectral of the -th frame of x ( j). We change s ( j) ( into the logarithm representation M j). M ( j) = 1 log ( max ( s ( j) j) s( 11, 1 )) ( We express M j) in a logarithm representation by projecting all values into the range [,1]. ( We then count the number of active value in M j).wehave c ( j) = CM ( j), where C represents the counting operation. For frame, the detected offset is where mean(c ( j) ) = 1 575 576 j= offset = { arg min j c ( j), if mean( c ( j) ) ( j) min c θ, ) ( j) min c <θ, 1, if mean ( c ( j) j) c(, θ is a threshold to discriminate whether the frame offset is detectable. are close, but there is always a For some cases the frame offset does not exist or is not covered, all c ( j) min c ( j). So we need a threshold θ to indicate these cases, and we accept the frame offset is detectable only when mean(c ( j) j) ) min c( is large enough. Otherwise the frame offset is undetectable. Note that each frame would expect a offset for no forgery, since there is no sample shift on each frame. However, the detection results of some frames would come up with nonzero offset for forgery. To locate the forgeries, we just differentiate offset. Ifoffset offset 1, a forgery occurs at frame. 5. EXPERIMENTAL RESULTS 5.1 Illustration of Locating Forgeries In Section 4, we show that the proposed method can locate one deletion correctly. However, the frame offset method is effective not only for one deletion, but also for multiple deletions. Here we demonstrate an example where a sentence only consists of numbers, as often appears in witness statements. As shown in Figure 8, three numbers are cropped away from the original sentence. The detected offsets of all frames in the doctored audio are shown in Figure 8(c). We observe that the frame offsets change at the 7th, 18th, and 47th frame. This means that some forgeries occur at these locations. From Figure 8, if the manipulations on the MP3 audio destroy frame segmentations of the previous encoding, the frame offset method would be able to locate those forgeries. After insertion, the doctored audio is separated into three segments. Obviously the three segments have different frame offsets. Figure 9 shows an example of insertion detection. It is shown that the method locates those forgeries very exactly. As two spliced parts often come from the different sources, they often have different frame offsets, so our method is also effective for detecting splicing. The case of substitution is illustrated in Figure 1. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 8, No. S2, Article 35, Publication date: September 212.

35:12 R. Yang et al. 1 (a) waveform of original audio.5.5 1.5 1 1.5 2 2.5 3 3.5 4 4.5 5 x 1 5 (b) waveform of doctored audio 1.5.5 one two three four five six seven eight nine one three five six seven nine 1.5 1 1.5 2 2.5 3 3.5 4 4.5 5 x 1 5 6 (c) detect result of doctored audio detected offset 4 2 1 2 3 4 5 6 7 8 different frames Fig. 8. Example of locating multiple deletions. Three numbers are cropped away from a series of numbers, shown as (a) and (b). (c) is the detection result of the doctored audio. Frame offsets change at the 7th, 18th, and 47th frames, which means there are forgeries at these frames. 5.2 Extensive Experiments Our experiments also include extensive tests of different types of audio clips. Our tested audio includes 64 speech clips (each 3 s long) and 64 music clips (each 3 s long). These original audio clips are in WAV format, 22.5 Hz, 16 bit, mono. We use LAME 3.97 to encode the audio clips into MP3 with bit rates of 32 bps, 64 bps, and 96 bps, respectively. Then each clip consists of 1142 frames. For each clip, we randomly select 1 frames and each frame performs 2 sample deletion and 2 sample insertion, respectively. So for each bit rate, we test our approach on 128 doctored frames with deletion and another 128 frames with insertion. We apply our method to these audio clips. We use the false positive error to measure the undoctored frames incorrectly identified as doctored, while the false negative error represents the doctored ones that are not detected. We denote the false positive error rate and false negative error rate as f p and f n, respectively. The accurate detection rate AR is calculated as follows. ( AR = 1 f ) p + f n 1% (6) 2 The test results for speech and music are shown in Table II and Table III, respectively. As we see, whether we are locating deletion or insertion in these audio frames, all accuracy rates are above 99%. We notice that the detection results of low bit rates are a little better than those of high bit ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 8, No. S2, Article 35, Publication date: September 212.

Exposing MP3 Audio Forgeries Using Frame Offsets 35:13 detected offset.5 (a) Original waveform 1 I don t thin so.5 2 4 6 8 1 12 14 Insertion x 1 4 (b) Original waveform 2.5 I agree with it.5 2 4 6 8 1 12 14 x 1 4 (c) Forgery waveform.5 I don t agree with it.5 2 4 6 8 1 12 14 x 1 4 (d) Detect result of doctored audio 6 4 2 5 1 15 2 25 different frames Fig. 9. Example of locating insertion. A ey word don t is inserted into a sentence, shown as (a) and (b). (c) is the detection result of the doctored audio. Frame offsets change at the 48th and 1th frames, which means there are forgeries at these frames. rates. This is due to MP3s with lower bit rates having stronger compression traces which means that the frame offset can be detected more accurately. The f p s of speech are higher than those of music, while the opposite is the case for f n s. This may be due to the presence of fewer silent samples in the music clips, and frame offset detection of silent portions introduces errors more easily. It is noted that the detection rate cannot achieve 1%. For some special cases our method will fail to locate forgeries. When the frame contains lots of zero samples, for example, one half, the correct offset cannot be detected via NAC, as shown in Figure 11. The actual offset of the frame is 2. However, the detected offset is 575. While applying different offsets, the number of zero samples varies rapidly, which leads to unstable NAC. 5.3 Sensitivity and Robustness In this subsection, we discuss the sensitivity and robustness of the proposed method against a variety of attac schemes. 5.3.1 Splicing at the Boundary. If the adversary is smart enough to splice or crop exactly multiple of 576 samples to achieve the exact boundary of one frame, will the detection method still wor? After generating the desired audio, the adversary only needs to adjust some (1 575) samples to match ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 8, No. S2, Article 35, Publication date: September 212.

35:14 R. Yang et al. (a) Original waveform 1.5.5 I lie it 2 4 6 8 1 12 Substitution x 1 4 (b) Original waveform 2.5.5 I hate doing that 2 4 6 8 1 12 x 1 4 (c) Forgery waveform.5.5 I lie doing that 2 4 6 8 1 12 x 1 4 detected offset (d) Detect result of doctored audio 6 4 2 5 1 15 2 different frames Fig. 1. Example of locating substitution. A ey word hate is replaced by lie, shown as (a) and (b). (c) is the detection result of the doctored audio. Frame offsets change at the 48th and 9th frames, which means there are forgeries at these frames. Table II. Detection Results for Speech Forgery Type bit rate f p f n AR deletion 32 bps.5%.3% 99.73% deletion 64 bps.9%.14% 99.48% deletion 96 bps 1.12%.34% 99.27% insertion 32 bps.51%.3% 99.73% insertion 64 bps.85%.2% 99.47% insertion 96 bps 1.1%.37% 99.31% Table III. Detection Results for Music Forgery Type bit rate f p f n AR deletion 32 bps.2%.27% 99.76% deletion 64 bps.27%.47% 99.63% deletion 96 bps.32%.61% 99.53% insertion 32 bps.16%.2% 99.82% insertion 64 bps.23%.42% 99.67% insertion 96 bps.28%.45% 99.63% the frame boundary. Because 1 575 samples only last less than 575/441 =.13 s for a 44.1 Hz sampling rate, this adjustment would not affect the meaning of the desired audio. Thans to the 5% overlap framing method during the MP3 encoding, we can still find the trace of this forgery. We give a demonstration in Figure 12. Suppose that one forgery occurs at the boundary of frame. There exactly ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 8, No. S2, Article 35, Publication date: September 212.

Exposing MP3 Audio Forgeries Using Frame Offsets 35:15 1 (a)waveform of an undetectable frame.5 amplitude.5 1 1 2 3 4 5 6 sample index (b) NAC result 4 35 3 NAC 25 2 15 1 1 2 3 4 5 6 frame offset Fig. 11. An example of fail case. Shown in (a) is the waveform of one frame with undetectable frame offset. Shown in (b) are the NACs via different frame offsets. 576 samples are cropped. The spectral of new frame + 1 will not have the quantization characteristic no matter with which offset, but frame and frame + 2 still have many troughs with the original offset. 5.3.2 Additive Noise. Additive noise may be added to the tampered speech to cover forgeries, and this presents a challenge for forgery detection. To investigate the robustness of the proposed scheme undergone with additive noise, a short speech clip consisting of 45 frames is tested. The audio samples of the 2th frame are added with white Gaussian noise of 3dB, as shown in Figure 13(a). Since both the 19th and 21st frames are 5% overlapping with the 2th frame, it means that the 19th and 21st frames are half doctored at the same time. Then we investigate the effect of additive noise on NAC. All frames are applied with offsets from to 575, and the corresponding NACs are recorded and plotted vertically, as shown in Figure 13(b). It is noted that all the plots have a significantly small value except those plots of the 18th, 19th, 2th, 21st, and 22nd frames. This means frame offsets of all frames except these five frames can be detected via NAC. Since there is not such a remarable decrease among the NACs of the 18th, 19th, 2th, 21st, and 22nd frames, the frame offsets of these five frames are undetectable and mared with a special value 1 as mentioned in Section 3. The detection result of the tampered speech is shown as Figure 13(c). From the detection result, it shows that the proposed method can resist locally added noise, which means that forgeries covered by noise can be located. However, if the noise is globally added after forgeries, all the frame offsets become undetectable and mared as 1. In this case, the proposed method is not able to locate the forgeries, but it still ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 8, No. S2, Article 35, Publication date: September 212.

35:16 R. Yang et al..2 (a) waveform Original audio amplitude Doctored audio.2 576 1152 1728 233 (b) spectral 1 frame frame +1 frame +2 magnitude(db) 5 576 1152 1728 Fig. 12. The case of splicing at the boundary. Shown in (a) is a waveform of audio whose 576 samples are cropped from the 1153rd sample. Shown in (b) is the spectral of the three frames of doctored audio. All the frames have the quantization characteristics except the middle frame. 1 (a) audio with additive noise NAC detected offset 1.5 1 1.5 2 2.5 adding noise x 1 4 (b) NAC result of each frame 4 2 5 5 1 15 2 25 3 35 4 45 different frame (c) detection result of each frame 1 5 1 15 2 25 3 35 4 45 different frame Fig. 13. The effect of additive noise on NAC. Shown in (a) is the waveform of audio with partially additive noise. Shown in (b) are the NAC results of all frames. Shown in (c) is the detection result of frame offsets. indicates that the audio is abnormal and must be postprocessed. In this case, the audio is suspect and rejected as evidence. 5.3.3 Filtering. Another common way to cover forgeries is filtering the tampered signal. Here we test with a median filter, mean filter, and low-pass filter. The same speech clip as in the preceding section is selected for testing. Since the effect of different filters on NAC is similar, under the limitation of page range only the result of the median filter is illustrated. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 8, No. S2, Article 35, Publication date: September 212.

Exposing MP3 Audio Forgeries Using Frame Offsets 35:17 1 (a) audio with filtering NAC detected offset 1.5 1filtering 1.5 2 2.5 x 1 4 (b) NAC result of each frame 4 2 5 5 1 15 2 25 3 35 4 45 different frame (c) detection result of each frame 1 5 1 15 2 25 3 35 4 45 different frame Fig. 14. The effect of median filtering on NAC. Shown in (a) is the waveform of audio partially filtered. Shown in (b) are the NAC results of all frames. Shown in (c) is the detection result of frame offsets. First, the 2th frame of the audio signal is filtered by a median filter with length of 7, as shown in Figure 14(a). Since both 19th and 21st frames are 5% overlapping with the 2th frame, it means that the 19th and 21st frames are half filtered at the same time. Then NACs of all frames are investigated and the proposed detection method is applied to the whole speech clip. As shown in Figure 14(b), similar to the case of adding noise, the plots of NACs of the 18th, 19th, 2th, 21st, and 22nd frames have no significant decreases, while the plots of other frames have an obviously small value. From the detection result at Figure 14(c), it shows that frame offsets of the 18th, 19th, 2th, 21st, and 22nd frames are undetectable, but other frames have a obvious offset as. It means that the proposed method can indicate the filtered portion of an audio signal if the signal is partially filtered. However, similar as the case of adding noise, if the audio signal is globally filtered, the proposed method could not locate forgeries automatically, but still indicates the filtered signal has been manipulated. 6. DISCUSSIONS AND CONCLUSIONS 6.1 Extension to Other Formats Although we only investigate audio of MP3 format, the idea of locating forgeries via the frame offset is suitable for audio of other compressed formats, such as AAC, WMA, and OGG Vorbis. Since the generation of audio with these formats is performed frame by frame, the frame offset of each frame is achievable. To confirm this, we use audio signal encoded with AAC for testing. Notice that the length of each frame in AAC is 124, and the frequency spectral is also of MDCT coefficients. The tool we utilize to encode and decode audio signals is FAAC [FAA]. The test clip consists of 4 frames audio, and its sampling rate is 44.1 Hz. The encoding parameters of FAAC are 96 bps, mono. First, we investigate whether the AAC audio has the quantization characteristic. Offsets 1,, and +1 are applied to the 9th frame, respectively. For each offset, 124 MDCT coefficients can be obtained. Then we plot these coefficients in a logarithmic representation, as shown in Figures 15(a), (b), and (c). It is obvious that ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 8, No. S2, Article 35, Publication date: September 212.

35:18 R. Yang et al. magnitude (db) magnitude (db) magnitude (db) 4 2 (a) offset = 1 2 4 6 8 1 12 frequency index (b) offset = 4 2 2 4 6 8 1 12 frequency index (c) offset = +1 4 2 2 4 6 8 1 12 frequency index NAC 8 6 (d) NAC result of audio encoded with AAC 4 5 1 15 2 25 frame offset Fig. 15. Quantization characteristic of AAC. Subfigure (a), (b), (c) are corresponding to spectral of the 9th frame with offsets 1,, and +1, respectively. Similar with the case of MP3, the quantization characteristic shows up when only with the matching offset (). Subfigure (d) shows the NAC result of 9th frame with offsets 1 to 25. only Figure 15(b) shows the quantization characteristic. Furthermore, we apply offsets 1 to 25 on the frame, and obtain the corresponding NAC results, as shown in Figure 15(d). A period of 124 can be observed. Within the length of the frame, there is only one matching offset, and its NAC is discriminative from other 123 NACs. Now we are in a step of checing AAC audio forgeries. The audio with 4 frames has totally 496 samples. We delete samples from index 1 to 15. Then we apply the proposed method to the doctored AAC audio. Each frame generates 124 NACs, and the matching offset is recognized as the one corresponding to minimize NAC. The detection result is shown as Figure 16. Therefore we show that the proposed method can detect forgeries on AAC audio. Our method is also able to extend to other frame-based encoders, since applying the matching offset is easier to approximate with the first-encoding spectral than using other shifted offsets. What we must remember is the procedure of extracting spectral varying from different encoders, since they use different frame length and windows. 6.2 Conclusions In this article, we propose a method to expose MPEG audio forgeries using frame offsets. The main contributions of this wor are as follows. First, according to our best nowledge, this is the first piece of wor on detecting forgeries on MP3 audio. It extends the research topics of forgery detection. Second, this wor illustrates that MDCT coefficients can reflect forgery traces very well for MPEG audio. Via theoretical analysis and extensive experiments, we show that NAC is a reliable feature to retrieve ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 8, No. S2, Article 35, Publication date: September 212.

Exposing MP3 Audio Forgeries Using Frame Offsets 35:19 detected offset 1 1 1 1 4 2 (a) original audio.5 1 1.5 2 2.5 3 3.5 4 x 1 4 (b) doctored audio.5 1 1.5 2 2.5 3 3.5 4 x 1 4 (c) detection result 5 1 15 2 25 3 35 4 frame index Fig. 16. Forgeries detection result of AAC audio. frame offsets. Based on the fact that most common forgeries change frame offsets of audio, the proposed method can locate these forgeries effectively. Extensive experimental results show that the proposed method has very good performance on both speech and music. All the accuracy rates are above 99%, which shows the effectiveness of our proposed method. Another advantage of the proposed method is the simplicity in computation. We only need to investigate the MDCT coefficients of the audio. However, if audio is transcoded between different compressed formats, the frame offset is difficult to obtain and the proposed method will fail in this case. It is noted that at a high bit rate such as 128 bps the NAC method is not very suitable for retrieving frame offsets, since zero coefficients are few at high bit rates. So in the future, we will focus on obtaining the frame offset when transcoding and at high bit rates. ACKNOWLEDGMENTS The authors would lie to than the anonymous reviewers for their constructive comments. Their suggestions will be very helpful for our future wor. REFERENCES BOEHM, R. AND WESTFELD, A. 24. Statistical characterisation of mp3 encoders for steganalysis. In Proceedings of the 6th ACM Multimedia and Security Worshop. ACM. FAAC. 212. Freeware advanced audio coder. http://www.audiocoding.com/faac.html. FARID, H. 1999. Detecting digital forgeries using bispectral analysis. MIT AI Memo AIM-1657, MIT. FU, D., SHI, Y., AND SU, W. 27. A generalized benford s law for jpeg coefficients and its applications in image forensics. In Proceedings of SPIE Conference on Security, Steganography, and Watermaring of Multimedia Contents. GRIGORAS, C. 25. Digital audio recording analysis: The electric networ frequency (enf) criterion. Int. J. Speech Lang. Law 2, 1, 63 76. HERRE, J. AND SCHUG, M. 2. Analysis of decompressed audio The inverse decoder. In Proceedings of the 19th AES Convention. HERRE, J., SCHUG, M., AND GEIGER, R. 22. Analysing decompressed audio with the inverse decoder Towards an operative algorithm. In Proceedings of the 112th AES Convention. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 8, No. S2, Article 35, Publication date: September 212.

35:2 R. Yang et al. ISO. 1992. Iso/iec international standard is 11172-3. Information technology Coding of moving pictures and associated audio for digital storage media up to about 1.5 Mbit/s. http://www.iso.org/iso/catalouge detail.htm?csnumber=22412. KRAETZER, C., OERMANN, A., DITTMANN, J., AND LANG, A. 27. Digital audio forensics: A first practical evaluation on microphone and environment classification. In Proceedings of the 9th ACM Multimedia and Security Worshop. LAME 3.97. 212. Mp3 encoder. http://lame.sourceforge.net. LUKAS, J. AND FRIDRICH, J. 23. Estimation of primary quantization matrix in double compressed jpeg images. In Proceedings of the Digital Forensic Research Worshop. PAINTER, T. AND SPANIAS, A. 2. Perceptual coding of digital audio. Proc. IEEE 88, 4, 451 513. POPESCU, A. AND FARID, H. 24. Statistical tools for digital forensics. In Proceedings of the 6th International Worshop on Information Hiding. QU, Z., LUO, W., AND HUANG, J. 28. A convolutive mixing model for shift double jpeg compression with application to passive image authentication. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing. WANG, Y. AND VELERMO, M. 23. Modified discrete cosine transform Its implications for audio coding and error concealment. AES J. 51, 1, 51 62. WANG, Y., YAROSLAVSKY, L., VILERMO, M., AND VAANANEN, M. 2. Some peculiar properties of the mdct. In Proceedings of the 16th IFIP World Computer Congress. YANG,R.,QU,Z.,AND HUANG, J. 28. Detecting digital audio forgeries by checing frame offsets. In Proceedings of the 1th ACM Multimedia and Security Worshop. ACM. Received November 21; revised July 211; accepted August 211 ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 8, No. S2, Article 35, Publication date: September 212.