REVVIND REVerse engineering of audio-visual content Data

Size: px

Start display at page:

Download "REVVIND REVerse engineering of audio-visual content Data"

Garry Harmon
5 years ago
Views:

1 FP7-ICT-2007-C GA No footprint detectors LAST MODIFICATION: REPORTING ORGANIZATION: FRAUNHOFER (INSTITUTE FOR DIGITAL MEDIA TECHNOLOGY) AUTHORS: LUCA CUCCOVILLO, PATRICK AICHROTH, DANIEL GÄRTNER (FRAUNHOFER) PAOLO BESTAGINI (POLIMI) MARCO FONTANI (CNIT) DAVID VÁZQUEZ-PADÍN (UVIGO) MARCO VISENTINI SCARZANELLA (IMPERIAL) MARINA OIKAWA (UNICAMP) FP7-ICT (REWIND) Page 1 of 73

2 REVISIONS: No. Date Comments Authors: Initial version Luca Added evaluations for WP4 and WP7 Daniel, David, Paolo, Marco VS, Marco F, Marina, Luca Modified chapter on Automatic Evaluation Patrick, Luca Modified introduction Patrick Proofreading and formatting Luca Minor edits p.52, title, modified related entities chapters, modified combined inverse decoder description Patrick FP7-ICT (REWIND) Page 2 of 73

3 CONTENTS 1 Introduction WP3. Technical achievements and Performance analysis Acquisition-based footprint detectors Coding-based footprint detectors Image tampering Detection and localization Conclusions Bibliography WP4. Technical achievements and Performance analysis Multi-modal Analysis Anti-Forensics Test Tools Attacker-aware Detectors Analysis of heterogeneous chains Conclusions Bibliography WP7. Technical achievements and Performance analysis Overview Evaluation Methodology Metrics How far we can get? Results Conclusions Bibliography Automatic Evaluation Approach DC-09 Video Codec Detection DC-06 Image Recapture Detection DC-19 Resampling Footprint Detection DC-02 Image Splicing Detection DC-16 MP3 Bitrate Estimation and Classification Annex A: XML definitions for automatic testing examples DC DC DC DC FP7-ICT (REWIND) Page 3 of 73

4 6.5 DC UC-01a UC-01b UC UC UC UC FP7-ICT (REWIND) Page 4 of 73

5 1 Introduction D5.6 includes the progress achieved in Task 5.4 at the end of the REWIND project, including four parts: 1. A summary of WP3 detectors and tools, and respective evaluation results for D3.3 and D3.5 detectors, which have already been reported in D5.4 (albeit in slightly different form). 2. A summary of WP4 detectors and tools, and respective evaluation results for D4.2 and D4.4 detectors, which have been a focus of REWIND work in year three. 3. A summary of WP7 detectors and tools, and respective evaluation results. 4. Evaluation results from detector examples used to validate of the automatic evaluation approach developed within REWIND (and reported in D5.5), thereby extending D5.4. The goal of the deliverable is to provide a concise view on the results achieved by the aforementioned tools, comparing them with State-of-the-Art approaches. In order to reflect this, the original structure of the previous evaluation report D5.4 was slightly adapted, detectors were grouped differently, and extra chapters for conclusions were added. To ensure that this includes all aspects of REWIND work, the deliverable includes work on WP3 activities that were already reported in D5.4 (in grey), complementing them with conclusions and related use cases, detectors and datasets. Throughout the document, common WP5 terminology and IDs will be applied for use cases (UC), detector candidates (DC), data sets (DS), all which were described in previous WP5 deliverables and especially D5.5. In order to reflect some additions related e.g. to WP7, an update of D5.5, D5.5-2, will be provided together with this deliverable for improved readability and completeness. The following will not include a repetition of details of the annotation schemas and the automatic evaluation approach, nor detailed information about use cases, detector candidates, anti-forensics tools and phylogeny tools and dataset identified and provided by the REWIND project, all of which have already been reported in previous WP5 deliverables, especially D5.5. FP7-ICT (REWIND) Page 5 of 73

1 Overview Analysis of acquisition footprints has two main forensic roles: first, acquisition footprints can reveal important parameters about the capturing or recapturing device, thus allowing

6 2 WP3. Technical achievements and Performance analysis 2.1 Acquisition-based footprint detectors Overview Analysis of acquisition footprints has two main forensic roles: first, acquisition footprints can reveal important parameters about the capturing or recapturing device, thus allowing camera or microphone identification. Perhaps more importantly, recapture is often used to mask malicious activity, since recapture scrambles compression and editing footprints. It is therefore an important cue for the forensic analyst to be able to systematically identify data that has been recaptured, as an indicator of potential malicious activity. The latter case can be represented schematically as shown below: Figure 1: Recapture as anti-forensics to cover traces of prior tampering activity. The importance of automatic detection of recaptured data gains further relevance given the fact that in controlled recapture conditions it is challenging to discern original and recaptured data by inspection. As an example, an original scene and its recaptured version are shown below. Figure 2: Example of (a) original and (b) recaptured images. Recaptured images do not present obvious footprints for detection-by-inspection in cases where the original image is not available. Here we report on the results obtained by five representative detectors covering the full set of acquisitionbased footprints on videos, images and audio data. The detector in [1] exploits scene jitter due to either physiological tremor or environmental factors as a cue for classification: when recapturing planar surfaces approximately parallel to the camera imaging plane (such as a cinema screen), any added motion due to jitter will result in approximately uniform high-frequency 2D motion fields. Inter-frame motion trajectories are retrieved with feature tracking techniques, while a uniformity measure of the high-frequency components from the obtained trajectories is used as an input to a binary classifier. FP7-ICT (REWIND) Page 6 of 73

7 However, this method fails whenever the scene is captured in a controlled environment, with a steady camera and no environmental disturbances. The second method is an extension for the more challenging case where the camera is held steady with a tripod [2]. In this scenario, the lack of temporal synchronization between projector/lcd screen and recapture device causes a phase shift even in the case when the frame rates are perfectly homogeneous in the system. This results in ghosting artifacts in the recaptured video, which can be detected for automatic recapture classification. Concerning images, the detector in [3] infers the image recapture chain through the analysis of visualized edges. The proposed method uses elements of sampling theory to characterize the way edges are blurred and distorted by the sampling kernel modeling camera optics during acquisition. Since the shape of the sampling kernel is device-dependent, a dictionary of edge profiles was created corresponding to edges obtained from known devices at different stages of the recapture chain. Such device-dependency of the kernel allows not only to identify recaptured images, but to identify which devices were involved in the recapture. Acquisition-based detectors for audio are presented in [4, 5]. In [4], the algorithm identifies the recording device used for single audio portions by leveraging the inherent microphone characteristics. Such inconsistencies can be used to detect, for example, splicing attacks to distort the original meaning of the target audio file. In [5], footprints due to changes in ENF phase are used to detect potentially tampered regions. ENF information is extracted from each portion of the audio file and then matched against an ENF reference database to validate the order and duration of the detected signals. Acquisition footprints can also be caused by the transmission medium if we consider Internet streaming as an acquisition process. The work in [7] analyses the issue of illicit online content distribution. This work is motivated by the fact that as video-on-demand becomes increasingly popular, technological solutions have to be found for the problem of third-party distribution without the proper legal rights, causing significant financial damage to the original content provider. Specifically, Digital Video Broadcast (DVB) standards consider both the possibility of terrestrial (DVB-T) and satellite (DVB-S) distribution; the differences between the involved communications technologies used for both kinds of systems have significant implications on the transmission bandwidth, and consequently on the number of programs that are statistically multiplexed within the DVB Transport Stream (TS); these differences will reflect on the instantaneous rate of each program in the TS. On the other hand, IPTV streams have a piecewise constant rate, meaning that the rate of the streamed program must be arranged in constant rate bursts. Therefore, the mentioned differences in the instantaneous bitrate distribution of DVB-T and DVB-S will leave a footprint on the IPTV piecewise constant bitrate characteristics. By building statistics on the packet timing distributions, the proposed method is able to classify the stream sources as either terrestrial or satellite, and to compare them with the expected sources Related DC, UC and DS The tools evaluated in this group represent the following Detector Candidates (DC), described in D5.5: DC-06 - Image: Recapturing detector DC-11 - Video: Recapture detector DC-13 - Video: Streaming source detector FP7-ICT (REWIND) Page 7 of 73

8 DC-17 - Audio: ENF-based tampering detectors DC-22 - Audio: Microphone classification-based tampering detector They address the following Use Cases (UC), described in D5.5: UC-06 - Image editing: Hiding tampering via recapturing UC-08 - Image reproduction: Face recognition fraud UC-14 - Video editing: Hiding Tampering via recapturing UC-18 - Video reproduction: Obtaining bootlegs in cinemas UC-19 - Video reproduction: Rebroadcasting 3rd party content UC-22 - Audio editing: Tampering speech by cutting and merging The following Datasets (DS) mentioned in D5.5 were used: DS-05 Image: Recaptured set DS-09 Audio: Edited speech set DS-13 Video: Recaptured videos with fixed camera DS-18 Multimedia: Uncompressed or single-compressed A/V content Results Results from the detectors [1-5] are shown in Figure 3 below. In our table, results are given in terms of true/false classifications of the two possible cases, recaptured and original marked as positive and negative respectively. Then, the detector in [5] is compared against the state-of-the-art (SotA) in ENF analysis [6]. Finally, performance of the IPTV stream source classifier analyzing network footprints [7] is shown. Algorithm TP TN FP FN Video recapture [1] Video recapture [2] Image recapture [3] Audio capture [4] Audio capture [5] SotA [6] against [5] IPTV streaming [7] FP7-ICT (REWIND) Page 8 of 73

9 1,00 0,90 0,80 0,70 0,60 0,50 0,40 0,30 0,20 0,10 0,00 TP TN FP FN Video recapture [1] Video recapture [2] Image recapture [3] Audio capture [4] Audio capture [5] SotA against [5], [6] IPTV streaming [7] Figure 3: Binary classification rates for the proposed detectors and comparison with the state-of-the-art, where available. For the video recapture detectors in [1, 2], the methods have been tested against a database of 18 videos (9 original, 9 recaptured). Also, no SotA is presented as there are no detectors with quantitative validation specifically tailored for the video recapture case in the literature. The method presented on image recapture detection [3] exploits cues caused by the shape of the sampling kernel modelling camera optics during acquisition, and was validated against a database of 160 images. This is in stark contrast with the few methods explicitly targeting image recapture detection in the literature, which are based on natural image statistics on color distributions. The methods are therefore not mutually exclusive and can be integrated within a fusion framework to increase the overall robustness. The two methods on audio [4, 5] have been tested on databases containing 192 (24 original, 168 tampered) and 412 (59 original, 353 tampered) samples respectively. In particular, the method shown in [5] performs significantly better than the SotA by fusing together cues on ENF and statistical pattern matching. This considerably lowers the FP rate. The methods do, however, still have limitations. Video recapture is still a challenging subject since it is not possible to leverage CFA artifacts or PRNU signatures to detect inconsistencies like in tampering detection/localization. Furthermore, video compression, professional film editing in original films and automatic balancing mechanisms in user-grade cameras do alter natural image statistics and traditional image signatures even for original images. The cues exploited by the detectors in [1, 2] exploit artifacts introduced exclusively during video recapture. Recapture jitter [1] however can be confused with environmental jitter in natural scenes too, such as handheld aerial sequences where the parallax effect is low and the high frequency motion appears uniform throughout the scene. This explains the relatively high false positive rate in the tests. Likewise, the approach in [2] can only work if there are enough features present in the scene with a high ratio of feature motion to motion blur. Moreover, the phase shift between the recording device and the displayed video could be such that the camera s integration window falls entirely within a single frame of the original video. In this case, there are no ghosting artifacts and the method fails. FP7-ICT (REWIND) Page 9 of 73

10 Conversely, there is still much scope for increasing the robustness of both approaches. Chiefly, all experiments have been carried out on very short segments (less than 500 frames for [1] and ~30 frames for [2]). By segmenting a feature-length video into scenes and examining them separately, it would be possible to implement a majority voting mechanism for both methods that would greatly improve their robustness. The image recapture detector does require user interaction, as appropriate edge neighborhoods have to be manually selected. Because of the numerical sensitivity of the operators involved in the sampling chain modelling, the images require to be fairly clean, as little noise can throw off the edge profile estimation needed for the kernel classification. Lastly, the method relies on the presence of a sampling kernel dictionary that is used for classification: results from a camera that is not included within the dictionary can be unpredictable. The audio method in [4] suffers from a similar problem, requiring a prior dataset of different microphone characteristics for classification. Improvements on the performance are still possible with the inclusion of a de-reverberation processing phase in order to limit the impact of the scene on the track, thus enhancing the discriminant power of the microphone channel magnitude response. The performance of [5] is significantly improved with respect to the state of the art, however it can still be improved in terms TP rate. Finally, while the audio methods are designed to identify tampering and/or splicing attacks on an audio file, we still do not have at our disposal a reliable technique for an entire track that has been wholly recaptured. Finally, concerning the IPTV stream classifier, the performance of the method proposed in [7] is practically the same for IPTV generated from DVB-T and DVB-S streams. Further work will include the classification of IPTV streams generated from standard DVDs. 2.2 Coding-based footprint detectors Overview Multimedia objects are typically distributed in compressed format. This is done to both save space on the storage devices used, and to reduce the required bandwidth when network sharing is involved. Furthermore, multimedia data usually undergoes multiple compressions during its lifetime. Indeed, during the acquisition process it is customary to compress media at the origin, re-encode it after any editing operation, and apply an additional coding step even when the media is uploaded to a sharing platform. This workflow is shown in Figure 4 below. Figure 4: Typical video chain of operations involving two coding steps: i) the first during the acquisition stage; ii) the second after either sharing or editing. Each coding step leaves peculiar footprints that can be studied to trace back FP7-ICT (REWIND) Page 10 of 73

11 the file history. By studying the chain of coding steps applied to a multimedia object it is then possible to recover valuable information on the media itself; e.g., the detection of an abnormal high number of coding steps might suggest prior tampering, or the retrieval of coding parameters indicating the use of a proprietary scheme can lead to device identification. In the rest of the document we will comment on some of the main results achieved by the REWIND consortium as part of WP3 on the study of coding chains for all types of media. With regards to the study of compression chains on images, existing works show how to detect whether an image is uncompressed or if it has gone through up to two compression stages. However, chain identification when considering a higher number of compression stages remains an open issue. To this end, the technique presented in [8] tackles the problem of detecting long chains of JPEG compressions, up to four consecutive compression stages. The proposed technique is based on analyzing the distribution of the most significant digit from the DCT coefficients, which is modeled according to Benford s law. When dealing with videos, the problem of detecting the number of compression steps is inextricably intertwined with the issue of recovering the codec used. Indeed, while for images JPEG is considered a universal standard, video sequences are often compressed using different codecs (e.g., MPEG2, MPEG4, etc.). Within the Consortium, two tools have been developed to address both the problem of inferring the number of compression steps, and the detection of the coding architecture. In [9], the Benford s law principle studied in [8] was extended to the video case, allowing to detect if a video sequence was compressed up to three times. For doubly encoded videos, the work in [10] explains how to determine which codec was used during the first coding step. The proposed method relies on the notion that lossy coding is an almost idempotent operation, i.e., re-encoding the reconstructed sequence with the same codec and coding parameters produces a sequence that is highly correlated with the original input. As a consequence, it is possible to analyses this correlation to identify the first codec provided that the second codec does not cause a significant decrease in quality. An example case is shown in Figure 5, where an image compressed 5 times is perceptually similar to the original input. Figure 5: Comparison between images JPEG-compressed multiple times: an image compressed once (on the left), and an image compressed 5 times with an average quality factor equal to 70 (on the right). Regarding audio data, the method in [11] has been developed to address the fake quality detection problem. This occurs when lossless audio files (such as WAV) or allegedly high-quality MP3s are sold, while FP7-ICT (REWIND) Page 11 of 73

12 the content has been encoded with a lossy audio codec instead. This tool analyses the Modified Discrete Cosine Transform (MDCT) coefficients that are processed during MP3 encoding leaving specific footprints, and classifies the file either as a fake or not. Whenever a double MP3 encoding is detected, the tool also estimates the bitrate of the first encoding Related DC, UC and DS The tools evaluated in this group represent the following Detector Candidates (DC), described in D5.5: DC-01 - Image: Multiple JPEG compression detector DC-09 - Video: Codec detector DC-21 - Audio: Fake quality detector DC-26 - Video: Multiple coding detector They address the following Use Cases (UC), described in D5.5: UC-01 - Image editing: Cut/paste attack UC-02 - Image editing: Copy move attack UC-03 - Image editing: Modifying personal content UC-04 - Image editing: Modifying and sharing 3rd party content UC-05 - Image editing: Hiding traces of compression UC-10 - Video coding: Multiple transcoding UC-20 - Audio coding: Bitrate fraud The following Datasets (DS) mentioned in D5.5 were used: DS-01 - Image: Uncompressed Nikon camera set DS-03 - Image: Aligned JPEG Compression splicing set DS-04 - Image: Not-Aligned JPEG Compression splicing set DS-06 - Video: Single and double compressed sequence set DS-11 - Audio: Fake quality MP3 set DS-14 - Image: Single, double, and triple JPEG compressed images Results All the detectors studied for coding chains classification solve different N-class classification problems, where N > 2. Indeed, the detector in [8] infers if an image has been JPEG-compressed up to four times. This was extended in [9] for videos, which are classified as compressed up to three times. In [10], the proposed approach is able to detect whether a double-encoded video has been compressed with MPEG2, MPEG4, or AVC in the first coding step. Finally, in [11], the original bitrate of double MP3 encoded audio tracks is recovered. A common metric was established to evaluate these detectors within a single framework. In particular, confusion matrices were used, which report the probability of assigning to the object the class given by the column, knowing that the ground truth is indicated by the row. Confusion matrices for the described detectors are reported below. FP7-ICT (REWIND) Page 12 of 73

13 N,N* Table 1: Confusion matrix reporting the probability of identifying N* compression steps, given that N steps were applied with an average quality factor equal to 75. On the left the method proposed in [8], on the right the SOTA method proposed in [12]. N,N* C,C* MPEG2 MPEG4 AVC MPEG MPEG AVC Table 2: On the left, confusion matrix showing the probability of detecting N* compression steps for video compressed N times with an average quantization parameter equal to 25, using [9]. On the right, confusion matrix for the identification of codec C* on videos encoded with C exploiting the algorithm discussed in [10]. b,b* Table 3: Confusion matrix showing the probability of associating the bitrate b* to an MP3 encoded with bitrate b, using either the method proposed in [11] (on the left), or the SOTA one proposed in [14]. Second compression has been operated at 192 kbit/s. Results for the detector proposed in [8] were obtained with an 11-fold cross-validation consisting of 110 images from the UCID dataset [13]. For each image tested and for each compression stage out of the four considered, the quality factor QF was sampled from a random variable uniformly distributed in the interval [QF N 10, QF N + 10], where QF N is the quality factor for the last compression stage. Table 1 reports the results for the proposed algorithm when QF N =75 and for the state-of-the-art [12], on the left and the right respectively of each column indicating the compression stage number. FP7-ICT (REWIND) Page 13 of 73

14 While the results show that the two methods achieve nearly the same results for N 3, the proposed method is characterized by a lower computational complexity, because of the small size of the feature vector. Moreover, when N = 4, the accuracy of the proposed solution is approximately 94%, while [12] achieves a probability of correct detection equal to 87%. The accuracy of coding chain estimation when applied to videos [9] was evaluated by considering 12 video sequences that were coded using a generic hybrid transform-based motion-compensated video codec. At compression stage N, the quantization parameter QPn was sampled from a random variable uniformly distributed in the interval [QPr 10, QPr + 10], where QPr is the average quantization parameter for the sequence. When comparing it against the case of still images presented earlier, this scenario presents a significantly harder challenge due to the motion estimation phase, which scrambles the statistics of coefficients. As shown in Table 2 (left), sequences coded twice are not easily distinguishable from those coded three times, and therefore, correctly determining N for longer coding chains still presents room for improvement. However, state-of-the-art detectors cannot currently go beyond double compression, therefore having broken this barrier with a sizeable degree of reliability represents a promising start and the foundation for new research avenues to be explored. To test the ability of detecting double compression as well as identifying the coding scheme employed, the method in [10] was evaluated on a dataset of six video sequences: four at CIF spatial resolution ( ); two at 4CIF spatial resolution ( ). Each original sequence was encoded with either MPEG-2, MPEG- 4 or AVC at three different bitrates. At the second coding pass the sequence was then re-encoded with one of the three codecs from the same set. Results showing the performance for different codecs used in the second pass are shown in Table 2 (right). As this is the first work proposed targeting codec identification in double encoding chains, there is no stateof-the-art available. However, results are satisfying and show the potential of the proposed technique. Some limitations are still present and need to be addressed in future work; chiefly, whenever the second coding step has a particularly low bitrate, the traces of the first codec are suppressed, and the proposed method will not perform satisfactorily. Lastly, the method proposed in [11] for the problem of fake bitrate detection on audio data was tested with a collection consisting of over 1,000 uncompressed audio files. These were then doubly compressed at different combinations of bitrates for both coding stages, yielding a grand total of 25,000 doubly compressed MP3 files. Table 3 shows the comparison between the proposed method and the state-of-theart in [14] when the second bitrate is 192 kbit/s. Results generally highlight that both methods perform well when the first bitrate is greater than the second, otherwise previous coding information is lost. Compared to [14], the proposed method achieves better results for high bitrates, and in general performs better when small audio segments are taken into account (even smaller than 1 second). FP7-ICT (REWIND) Page 14 of 73

2.3 Image tampering Detection and localization 2.3.1 Overview Editing operations constitute the most common type of tampering on images, videos and audio.

15 2.3 Image tampering Detection and localization Overview Editing operations constitute the most common type of tampering on images, videos and audio. This is also due to the existence and diffusion of powerful yet user-friendly software suites for professional editing of multimedia objects. Apart from laying the foundations for tampering detection within more complex processing chains, the techniques presented as part of this section aim to start filling the technological gap between malicious users and forensic analysts. Even with basic attacks, it is possible to generate convincing examples of tampered images, as shown in Figure 6. A large number of more sophisticated examples have been devised for the REWIND challenge and will be evaluated in the coming months. Figure 6: Example of (a) original image and (b) its tampered version by copy-move attack (from [15]). The detectors presented here are representative of the overall work done on image tampering detection and are compared to the state-of-the-art. The method proposed in [15] detects the presence of Colour Filter Array (CFA) artifacts in the image, and identifies tampered regions by finding areas where the artifacts have been removed because of the demosaicing algorithm. The algorithm finds local regions by computing the tampering likelihood for each 2x2 block in the image. The two tools in [16, 17] assume that tampered images undergo double JPEG compression on the tampered areas. The two possible cases, i.e. whether the second compression blocks are aligned or non-aligned to the first are considered in the two papers respectively. Similarly to the method in [15], the two approaches allow the estimate of tampering likelihood on each image 8x8 block Related DC, UC and DS The tools evaluated in this group represent the following Detector Candidates (DC), described in D5.5: DC-02 - Image: Splicing detection DC-04 - Image: Forgery localization detectors They address the following Use Cases (UC), described in D5.5: UC-01 - Image editing: Cut/paste attack UC-02 - Image editing: Copy/move attack UC-03 - Image editing: Modifying personal content UC-04 - Image editing: Modifying and sharing 3rd party content FP7-ICT (REWIND) Page 15 of 73

16 UC-05 - Image editing: Hiding traces of compression The following Datasets (DS) mentioned in D5.5 were used: DS-01 - Image: Uncompressed Nikon camera set DS-03 - Image: Aligned JPEG Compression splicing set DS-04 - Image: Not-Aligned JPEG Compression splicing set DS-14 - Image: Single, double, and triple JPEG compressed images Results The detectors are evaluated and compared against the state-of-the-art in Figure 7 below: Figure 7: ROC curves for the CFA-based tampering detection algorithm in [15] (left) and the two methods on double JPEG compression [16, 17] compared against the state-of-the-art [18]. The method in [15] has been evaluated on 400 different images. The results are consistently high regardless of which one out of the four demosaicing algorithms is used. However, there are some limitations on the scenarios for which this detector can be considered: first, the demosaicing filter is unknown, therefore there is an imperfect prediction of the neighboring pixels. Then, JPEG compression destroys CFA footprints, even for relatively high quality factors. Finally, global geometric transformations such as filtering or resizing may destroy CFA footprints. The two methods in [16, 17] have been tested against the state-of-the-art in [18] and are shown in Figure 7 (right) for different relative quality factors employed in the two compression steps. However, only the method in [17] is directly comparable with the one in [18], as they both consider the case when the tampered blocks are aligned with respect to the second JPEG compression step. In this case, the method outperforms the state-of-the-art. The performance worsens slightly for the case of non-aligned blocks, but in this case there is no fair comparison available with other existing methods. As shown in the results, one of the main limitations of these methods is that a strong final compression is likely to disrupt the double compression artifacts. Furthermore, any filtering and/or resizing of the image would disrupt the searched traces, making detection with these tools impossible. These drawbacks are not negligible when a scenario FP7-ICT (REWIND) Page 16 of 73

17 where a tampered image is distributed on the internet is considered, as most sharing platform like Flickr of Facebook re-encode or resize the image during the upload. Finally, localization of forgeries based on nonaligned double JPEG compression requires that the pasted region is relatively big (at least 25% of the total image area). Conversely, the performance indicated in the results greatly improves when considering the tools together within a decision fusion framework. This has been investigated as part of WP4 and significantly outperforms the state-of-the-art. Conclusions The table below summarizes the methods presented in this document together with a single accuracy value indicating their performance. For each scenario, a target accuracy was also indicated. Whenever possible, this value was obtained directly from state-of-the-art methods presented for the same problem. However, for methods breaking completely new ground for which prior state-of-the-art is not available, results from the closest scenario considered in the literature were reported. Similarly, we have also reported results for methods tackling the same challenge but with significantly different evaluation settings. These two cases are indicated by an asterisk * next to the result. The three symbols below indicate whether a target can be considered achieved or not: - : Results exceed the SotA with thorough testing in realistic conditions. - : Results are on level with or better than the closest SoTA, where available. However there might not be a close match in the literature for the proposed method. - Х: Results are significantly worse than the SoTA or below an acceptability threshold. A fourth target completion class could be defined for methods ready to be marketed. However, this is beyond the scope of this FET project. Method Reference Proposed % Target % Target Reference Target achieved? Video recapture [1,2] * [21] Image recapture [3] [23] Audio tampering detection [4,5] * [6,20] IPTV stream classification [7] [25] Image multiple compression [8] [12] FP7-ICT (REWIND) Page 17 of 73

18 Video multiple compression [9] * [24] Video codec identification [10] * [22] MP3 fake bitrate detection Image tampering localization [11] [14] [15-17] [18] Table 4: Results from the proposed methods and target accuracies. Whenever the target is taken from techniques that differ significantly in terms of scenario or evaluation conditions, this is indicated by an asterisk *. Video recapture proposes a significant improvement over the state-of-the-art [21], especially when considering that the only method available has been tested exclusively on synthetic data (from which the accuracy is obtained) and a single real video. There are still issues to be addressed that have been outlined during the description of the method, but work on video recapture detection constitutes a promising area of research especially for its relevance towards automatic bootleg detection methods. The technique developed for image recapture outperforms the target accuracy, but needs further testing with larger databases of cameras. Concerning audio tampering detection, the results have been obtained by averaging the two proposed methods in [4, 5] and comparing them with the average of the state-of-the-art for microphone classification and ENF analysis [6, 20]. If only the microphone classification results are considered, the method in [20] performs better than the method proposed in [4]. The evaluation conditions, however, are different, since in [20] fewer microphones are used which were all of professional grade, thus reducing the chance of confusion and decreasing the noise in the data. Moreover, the test content in [20] was not processed with any kind of lossy compression algorithm. Benford s-law based methods [8, 9] outperform the state-of-the-art for images, and while the results are lower than the reported target for videos, existing methods do not go beyond double video coding, whereas in our work chains up to 3 coding steps are considered. Moreover, most existing methods only extend methods for images on I-frames, whereas the proposed method enables us to work on P-frames as well. The proposed method on video codec identification cannot be compared to any similar method found in the literature. The value reported as target is taken from a conceptually similar method [22] identifying the codecs used in the case of double audio compression. Apart from the fact that codec identification is arguably more challenging on videos than on audio data, the method from the literature can only tackle the case of single coding, whereas our method in [10] can deal with double coding in video. The MP3 fake bitrate detector [11] is on average outperformed by the method in the literature [14]. However, a single accuracy value cannot capture properly the method s performance: by looking more FP7-ICT (REWIND) Page 18 of 73

19 carefully at the confusion matrix reported in Section 2, it can be seen that the proposed method does outperform the SotA at medium-to-high-bitrates, whereas it fails at low bitrates. Moreover, the proposed method can handle short audio segments, whereas the SotA needs longer tracks for classification. Finally, the image tampering classification methods presented [15-17] on average outperform the SotA from the literature [18]. Moreover, one of the detectors can handle the case of non-aligned second JPEG compression still with comparable accuracy, and all the methods are then fused together for enhanced robustness and even better performance in a REWIND-published tool that is presented as part of WP4. Bibliography [1] M. Visentini-Scarzanella and P. L. Dragotti, Video jitter analysis for automatic bootleg detection, in IEEE Multimedia Signal Processing Workshop (MMSP '12). [2] P. Bestagini, M. Visentini-Scarzanella, M. Tagliasacchi, P. L. Dragotti, and S. Tubaro, Video recapture detection based on ghosting artifact analysis, in IEEE International Conference on Image Processing (ICIP '13). [3] T. Thongkamwitoon, H. Muammar, and P. L. Dragotti, Identification of image acquisition chains using a dictionary of edge features, in EURASIP Signal Processing Conference (EUSIPCO '12). [4] L. Cuccovillo, S. Mann, M. Tagliasacchi, and P. Aichroth, Audio tampering detection via microphone classification, in IEEE Multimedia Signal Processing Workshop (MMSP '13). [5] S. Mann, L. Cuccovillo, P. Aichroth, and C. Dittmar, Combining ENF phase discontinuity checking and temporal pattern matching for audio tampering detection in Workshop Audiosignalund Sprachverarbeitung (WASP '13). [6] D. P. N. Rodr guez, J. A. Apolin rio, L.W.P. Biscainho, Audio Authenticity: Detecting ENF Discontinuity With High Precision Phase Analysis, IEEE Transactions on Information Forensics and Security, vol. 5, no. 3, [7] M. Masciopinto and P. Comesaña, IPTV streaming source classification, IEEE International Workshop on Information Forensics and Security (WIFS 12). [8] S. Milani, M. Tagliasacchi, and S. Tubaro, Discriminating multiple jpeg compression using first digit features, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 12). [9] S. Milani, P. Bestagini, M. Tagliasacchi, and S. Tubaro, Multiple compression detection for video sequences, IEEE Multimedia Signal Processing Workshop (MMSP 12). [10] P. Bestagini, A. Allam, S. Milani, M. Tagliasacchi, and S. Tubaro, Video codec identification, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 12). [11] T. Bianchi, A. De Rosa, M. Fontani, G. Rocciolo, and A. Piva, Detection and classification of double compressed MP3 audio tracks, 1st ACM Intl. Workshop on Information Hiding and Multimedia Security, June [12] B. Li, Y. Q. Shi, and J. Huang, Detecting doubly compressed JPEG images by using mode based first digit features, IEEE Multimedia Signal Processing Workshop (MMSP 08). [13] G. Schaefer and M. Stich, UCID - An uncompressed color image database, in Proc. SPIE, January [14] M. Qiao, A. H. Sung, and Q. Liu, Revealing real quality of double compressed MP3 audio, ACM international conference on Multimedia (MM 10). FP7-ICT (REWIND) Page 19 of 73

20 [15] P. Ferrara, T. Bianchi, A. De Rosa, A. Piva, Image Forgery Localization via Fine-Grained Analysis of CFA Artifacts, IEEE Transactions on Information Forensics and Security, vol. 7, no. 5, Oct. 2012, pp [16] T. Bianchi and A. Piva, "Analysis of Non-Aligned Double JPEG Artifacts for the Localization of Image Forgeries", IEEE International Workshop on Information Forensics and Security (WIFS'11). [17] T. Bianchi, A. De Rosa, A. Piva, Improved DCT coefficient analysis for forgery localization in JPEG images, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP'11). [18] Z. C. Lin, J. F. He, X. Tang, C. K. Tang, Fast, automatic and fine-grained tampered JPEG image detection via DCT coefficient analysis, Pattern Recognition, vol. 42, no. 11, pp , Nov [19] H. Farid, Exposing digital forgeries from JPEG ghosts, IEEE Transactions on Information Forensics and Security, vol. 4, no. 1, pp , Mar [20] C. Kraetzer, K. Qian, M. Schott, J. Dittmann, A context model for microphone forensics and its application in evaluations, Proc. SPIE 7880, Media Watermarking, Security, and Forensics III, [21] W. Wang and H. Farid, Detecting re-projected video, 10 th International Workshop on Information Hiding, [22] S. Hicsonmez, H. T. Sencar, and I. Avcibas, Audio codec identification through payload sampling, IEEE International Workshop on Information Forensics and Security (WIFS '11). [23] H. Cao, Identification of recaptured photographs on LCD screens, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 10). [24] Y. Su and J. Xu, Detection of Double-compression in MPEG-2 Videos, 2 nd International Workshop on Intelligent Systems and Applications (ISA), [25] W.-H. Lin and A. Haputmann, News Video Classification Using SVM-based multimodal classifiers and combination strategies, ACM Multimedia, FP7-ICT (REWIND) Page 20 of 73

21 3 WP4. Technical achievements and Performance analysis 3.1 Multi-modal Analysis Most of the methods developed within the WP3 of the project work by searching subtle traces that are left in the media during common processing operations. The kind and strength of such traces are strongly related both to the specific kind of processing that is applied and to the encoding algorithm that is used to store the media. Hence, it is very hard to develop a single analysis tool that fits to every possible situation. An interesting solution to this problem is the synergic use of different tools, which can widen the analyst arsenal without forcing him to selecting each time the more appropriate analysis algorithm. Moreover, the strengths of some tools may compensate for the weaknesses of other ones. With the above considerations in mind, two multi-clue analysis tools have been developed within WP4, one targeting image forensics and one targeting audio forensics Overview The Multi-clue image forgery localization tool, introduced in the deliverable D4.2, allows evaluating the integrity of a digital image by revealing whether the image is a malicious composition of different contents or an original shot of an event. This tool combines different image forensic algorithms for splicing localization [26,27,28] through a specifically tailored decision fusion framework (currently the paper is in preparation), improving the detection performance with respect to single tools. The proposed multi-clue approach makes use of background information that is available to the analyst, such as local properties of the image (e.g. saturation and quantization strength) to improve the interpretation of single tool outputs, as suggested in [29]. Once interpreted and transformed into an appropriate mathematical form, tool outputs are merged together, taking into account the set of possible relationships between searched traces. The output of the fusion framework is a single forgery localization map. As a final step, such a map can be refined by segmenting the image under analysis and assigning each segment a class between tampered and original, according to the values assumed by the forgery map in corresponding locations. Application of lossy compression is an often used step in the creation of audio material. Music is distributed online in AAC or MP3 in order to save bandwidth, mobile devices record audio and store it compressed in order to save storage space. After decoding a compressed audio file, traces of the used codec sometimes can still be found in the audio signal. The combined inverse decoder is a tool for the analysis of audio files, looking for traces left by MP3, MP3PRO, AAC and HE-AAC. The Combined inverse decoder audio tampering detection system is based on the principles of the inverse decoder which has been introduced for MP3 and AAC in [30, 31], and extended for the support of MP3PRO in [32]. The processing is done on a segment basis: The combined inverse decoder identifies segments including known coding traces, and then for each segment, various properties including codec type, framing grid offset, encoding bit rate, and stereo mode are identified. The combined inverse decoder tool can be used for audio tampering detection, as introduced in [33], where tampering is indicated by codec change or framing grid offset change from one segment to the following one. Further possible applications include fake bitrate detection, or original framing grid offset detection to decrease the number of artifacts for subsequent transcoding steps. FP7-ICT (REWIND) Page 21 of 73

Figure 8: Visualization of an exemplary Combined Inverse Decoder output. The output of the tool is formatted as xml, e.g., see the output reported below.

22 Figure 8: Visualization of an exemplary Combined Inverse Decoder output. The output of the tool is formatted as xml, e.g., see the output reported below. This format, being at the same time human- and machine-readable, can be easily used in order to infer more complex information. An example for an application built on top of the combined inverse decoder displaying visual information is shown in Figure 8. <?xml version="1.0" encoding="iso "?> <AudioFootprintDetection xsi:nonamespaceschemalocation="audiofootprintdetection_v1.xsd" xmlns:xsi=" <GeneralPropertiesType> <Duration Type="time">00:00:08.000</Duration> <Channels>1</Channels> <SamplingRate>44100</SamplingRate> </GeneralPropertiesType> <SegmentType> <Start Type="time">00:00:00.000</Start> <Duration Type="duration">00:00:02.000</Duration> <CodingTracesType> <Codec>unknown</Codec> <EstimatedBitrate type="huffman">unknown</estimatedbitrate> <EstimatedBitrate type="total_rounded">unknown</estimatedbitrate> <StereoCoding>unknown</StereoCoding> <FrameOffset>0</FrameOffset> </CodingTracesType> </SegmentType> <SegmentType> FP7-ICT (REWIND) Page 22 of 73

23 <Start Type="time">00:00:02.000</Start> <Duration Type="duration">00:00:03.000</Duration> <CodingTracesType> <Codec>AAC</Codec> <EstimatedBitrate type="huffman"> </estimatedbitrate> <EstimatedBitrate type="total_rounded">48</estimatedbitrate> <StereoCoding>mono</StereoCoding> <FrameOffset>584</FrameOffset> </CodingTracesType> </SegmentType> <SegmentType> <Start Type="time">00:00:05.000</Start> <Duration Type="duration">00:00:03.000</Duration> <CodingTracesType> <Codec>MP3</Codec> <EstimatedBitrate type="huffman"> </estimatedbitrate> <EstimatedBitrate type="total_rounded">128</estimatedbitrate> <StereoCoding>mono</StereoCoding> <FrameOffset>469</FrameOffset> </CodingTracesType> </SegmentType> </AudioFootprintDetection> Besides the tampering detection, in [33] also an codec detection is carried out. The evaluation is performed on a dataset that can be accessed online ( and is explained in detail in [34] Related DC, UC and DS The tools evaluated in this group represent the following Detector Candidates (DC), described in D5.5: DC-02 - Image: Splicing detection DC-04 - Image: Forgery localization detectors DC-16 - Audio: Codec detector DC-23 - Audio: Combined inverse decoding tampering detector They address the following Use Cases (UC), described in D5.5: UC-01 - Image editing: Cut/paste attack UC-02 - Image editing: Copy-move attack UC-03 - Image editing: Modifying personal content UC-04 - Image editing: Modifying and sharing 3 rd party content UC-22 - Audio editing: Tampering speech by cutting and merging The following Datasets (DS) mentioned in D5.5 were used: FP7-ICT (REWIND) Page 23 of 73

24 DS-02 - Image: removing color filter array splicing set DS-03 - Image: aligned JPEG compression splicing set DS-04 - Image: not-aligned JPEG compression splicing set DS-09 - Audio: Edited speech set Results The Multi-clue image forgery localization tool [29] has been evaluated using Receiver Operating Characteristic (ROC) curves, and calculating the corresponding Area Under the Curve (AUC). A collection of images from the datasets DS-02, DS-03, DS-04 has been used to test the detector. The same set of images have been analysed using each one of the tools that are integrated within the framework, so to compare the performance obtained with the fusion techniques to those that were achieved by single detectors. Results are reported in Figure 9, together with AUC values. Figure 9: ROC curve displaying the advantage of multi-clue image forgery localization The Combined inverse decoder has been evaluated for its use for tampering detection in [33], on both a global and a discontinuity level. In the global level analysis, performance is evaluated on retrieving complete audio files. Three scenarios are discriminated: A. The tampered audio file is retrieved (TP) if at least one discontinuity is detected. B. The tampered audio file is retrieved (TP) if at least one discontinuity belonging to one cutting point is detected. FP7-ICT (REWIND) Page 24 of 73

25 C. The tampered audio file is retrieved (TP) if each cutting point is detected by at least one discontinuity. In the discontinuity level analysis, the retrieval of explicit cuts (not of the tampered files) is analyzed: D. A discontinuity is correct (TP) it if belongs to a cutting point. It is not correct (FP) of it does not belong to a cutting point. Cutting points for which no belonging discontinuity is detected. In all cases, a detected discontinuity is seen as belonging to a cutting point, if the location difference between the detected discontinuity and the cutting point in the ground truth is smaller than 300 ms. In order to compare the method in [33] with the state of the art in [35], only files having no codec traces or MP3 codec traces are used. Since the original state of the art system was not available, our own approach has been modified to be similar to the state of the art in order to allow comparison. Test Case Statistics (%) Precision Recall Accuracy A 99.5/ 97.2 / / 93.3 / / 92.2 / 91.8 B 99.5/ 97.1 / / 90.9 / / 90.2 / 89.2 C 99.4 / 96.8 / / 81.9 / / 82.6 / 78.8 D 92.7 / 93.0 / / 91.0 / 95.7 Table 5: Comparison of proposed system (MP3, AAC, HE-AAC, MP3PRO) / proposed system (MP3 only) / state of the art (MP3 only). 3.2 Anti-Forensics Test Tools The most recent researches on the processing of operation footprints have been developed taking into account the two opposite points of view of an analyst and an attacker. On the adversary side, the research produced several attack strategies that the opponent may use to fool some State-of-the-Art detectors. The complex set of methodologies developed in order to fool some State-of-the-Art detectors is commonly referred to as anti-forensics or counter-forensics, and aims at revealing the weakness of forensic technology, in order to improve the next generation of such a technology Overview An example of counter-forensics as an adversarial approach to forensic detectors is [36]. Most of the existing counter-forensics strategies, although successful, are based on heuristic criteria, and their optimality is not proven. In [37] the optimal modification strategy of a content in order to fool a histogram-based forensics detector is derived. The objective of this work, i.e., of the Optimal counter-forensics for histogram-based forensics tool, is to propose a general (non-targeted) attacking method, with the distinctive feature of a single target FP7-ICT (REWIND) Page 25 of 73

26 function whose optimization is consistently pursued in the different steps of the attack design, obtaining the optimal attack in the MSE sense. Specifically, the attention is focused on the so-called histogram-based forensic detectors, that take their decisions just based on the histogram of a (generally, transform-based) function of the input samples. In order to prove the usefulness of the proposed strategy, we employ it to successfully attack a well-known algorithm for detecting double-jpeg compression [38]. Note that for the sake of brevity we do not include results from the similar methodology proposed in [39] which is also part of WP4 and has shown to be able to cope with two different problems, concretely, gamma-correction and histogram stretching Related DC, AT, UC and DS The tools evaluated in this group represent the following Detector Candidates (DC) and Anti-forensics Tools (AT), described in D5.5: DC-01 - Image: Multiple JPEG compression detector, but this tool can be applied to any histogrambased forensics detector AT-01 - Image: Universal anti-forensic tool against histogram-based detectors They address the following Use Cases (UC), described in D5.5: UC-01 - Image editing: Cut/paste attack UC-02 - Image editing: Copy-move attack UC-03 - Image editing: Modifying personal content UC-04 - Image editing: Modifying and sharing 3rd party content UC-05 - Image editing: Hiding traces of compression UC-24 - Image editing: Hiding traces of enhancement The following Datasets (DS) mentioned in D5.5 were used: DS-01 - Image: Uncompressed Nikon camera set Results The strategy derived in [37] has been tested against the double-jpeg compression detector proposed in [38], using images from the database UCIDv2 [40]. After configuring the SVM classifier from the detector in [38] (c.f. [37]), a fresh set of 380 images is used in the performance evaluation. One of those images (used for defining the starting point of the optimization) was JPEG-compressed only once with QF= 70, while the others were compressed twice: first with QF= 10 and then with QF= 70. Under these settings, the PSNR achieved by averaging the MSE is db and 94.7% of the images have been classified as non-doubly compressed. As a comparison, we have tested the attack proposed in [36], which is specifically designed for fooling doubly-jpeg compression detectors, in the same experimental framework. In this particular case, the PSNR achieved by averaging the MSE is db and 50.4% of the images have been classified as non-doubly compressed. FP7-ICT (REWIND) Page 26 of 73

27 3.3 Attacker-aware Detectors As a consequence of the presence of anti-forensics, the most recent forensics detectors take into account the possibility that an adversarial processing took place during the creation and/or distribution of content. Two examples of these attacker-aware detectors, both targeting image forensics and developed within WP4, are proposed hereafter Overview The increased possibility to tamper with digital content has lately cast doubts on the traditional trust in images as representation of the reality. To re-establish part of this credibility, a number of digital image forensic techniques have been proposed to detect possible image forgeries. Among these techniques, several leverage the statistical footprints left by JPEG compression. Indeed, when an image is compressed using JPEG, the histogram of quantized Discrete Cosine Transform (DCT) coefficients exhibits a characteristic comb-like shape, which can be exploited to find, e.g., the used quantization matrix. It has been shown, however, that a knowledgeable attacker can restore the original distribution of transform coefficients, thus hiding the traces left by JPEG compression [41]. By doing so, all forensic techniques that leverage the presence of JPEG footprints can be fooled. The necessity of developing methods to counter anti-forensics attacks is then more than evident. In [42, 43] we propose a method, in the following addressed as Countering JPEG anti-forensics, that can be employed by a forensic analyst to detect whether an image has been attacked with the anti-forensic technique proposed in [41], with the purpose of hiding the traces of JPEG compression. In addition, the developed method is able to accurately estimate the original JPEG quality factor. The algorithm is validated by means of large-scale tests on 1338 images taken from the UCID dataset [44]. In [45] we study the combination of JPEG compression and full-frame linear filtering, analyzing their impact on the statistical distribution of the Discrete Cosine Transform (DCT) coefficients of the image. In this work, addressing the Joint Detection of full-frame linear filtering and JPEG compression in digital images, we relax the quite strong assumption we made in [46] (which is part of WP3) about the knowledge of the compression quantization step and propose a simple yet very effective forensic tool which is now able to jointly detect the filter kernel and the quality factor of the JPEG compression that have been applied to an image, so to retrieve the entire processing history of the content. We extract a set of significant features of the DCT distributions of the compressed and filtered image and build a linear classifier able to effectively discriminate different combinations of filtering and compressions Related DC, AT, UC and DS The tools evaluated in this group represent the following Detector Candidates (DC) and Anti-forensics Tools (AT), described in D5.5: DC-24 - Image: Joint linear-filtering and JPEG compression factor detection AT-02 - Image: JPEG compression anti-forensics tool They address the following Use Cases (UC), described in D5.5: UC-03 - Image editing: Modifying personal content UC-04 - Image editing: Modifying and sharing 3rd party content FP7-ICT (REWIND) Page 27 of 73

28 UC-05 - Image editing: Hiding traces of compression The following Datasets (DS) mentioned in D5.5 were used: DS-01 - Image: Uncompressed Nikon camera set DS-14 - Image: Single, double, and triple JPEG compressed images Results In order to evaluate the detectability of attacked images using the method proposed in [42] for Countering JPEG anti-forensics, we carried out a set of large-scale tests on 1338 images of the UCID dataset [44]. To this end, we compressed half of the images at a quality factor randomly picked in the interval. To these images, we applied the anti-forensics attack proposed in [41]. Then, we used our algorithm to label each test image either as an attacked (positive case) or as an uncompressed (negative case) picture. Since the performance of the detector depends on both a threshold and an algorithm parameter n, we evaluated the Receiver Operating Characteristic (ROC) curve by changing the threshold value for different parameters n (see Figure 10). In our experiment, we obtained a value of Area Under the Curve (AUC) that always exceeds Figure 10: ROC curve of the tool countering JPEG anti-forensics The proposed method for Joint Detection of full-frame linear filtering and JPEG compression in digital images in [45] has been tested using images from the UCID dataset [44]. After configuring the SVM classifier (c.f. [45]), 300 images were compressed with a quality factor (QF) belonging to the set {40,50,60,70,80,90}. Each compressed image has been then convolved with a filter kernel chosen among a fixed set of linear filters, gathered in Table 6. FP7-ICT (REWIND) Page 28 of 73

29 Group ID Group 1 Filter Kernel 1. LP Average [3x3] 2. LP Average [5x5] 3. LP Gaussian [3x3], σ F 2 =1 4. LP Gaussian [5x5], σ F 2 =1 Group 2 5. LP Gaussian [3x3], σ F 2 = LP Gaussian [5x5], σ F 2 =0.5 Group 3 Group 4 Group 5 Group 6 7. HP Laplacian, α= HP Laplacian, α= LP Laplacian, α= HP Laplacian, α= HP Average [3x3] 12. HP Average [5x5] 13. Identity filter Table 6: Filters grouped according to the similarity of their frequency response. Note that we group the filters based on the similarity of their frequency responses for the analyzed DCT coefficients and allow them to be classified within the same class. Given the 6 groups used for representing the filters and the 6 considered compression factors QF, a total of 36 classes will have to be discriminated by the SVM classifier. The performance of the proposed algorithm is verified in terms of the percentage of correct classification over the image database, jointly for JPEG compression and image filtering. The overall accuracy is 89.2% as reflected in the first row of Table 7. If we discard the frequency response similarity among the filters, so to have a total of 78 classes (13 filters and 6 different compression factors) we are able to reach an average of 74.5% as shown in the second row of Table 7, thus confirming the confusion in the classification introduced by the filters similarity. As a last experiment, we wanted to test the robustness of the proposed approach with respect to double compression (simulating a realistic application scenario). To that end, after the first compression and the application of the linear filtering, each image is further compressed using a compression quality factor QF 2 = 90. The accuracy of correct classification is reported in the last row of Table 7, achieving an 88.6% of accuracy. FP7-ICT (REWIND) Page 29 of 73

30 Detect # classes SVM Accuracy Fgroups + QF % F + QF % Fgroups + QF * * (double compressed with second QF 2 = 90) % Table 7: Experimental results in terms of classification accuracy obtained in different application scenarios. 3.4 Analysis of heterogeneous chains Most of the detectors developed within WP3 aim to detect the traces left by single processing operations. The analysis of heterogeneous chain was thus one of the tasks of WP4: Its goal, rather than to focus on the detection of single operations, is to handle content that underwent several operations, and to detect not only the last operation that took place, but a subset of the operations present during the generation chain Overview Image recapture is, on its own, an implicit complex heterogeneous chain. Rather than being a single atomic operation, image recapture consists of a sequence of operations including resampling due to preprocessing and the geometric recapture setup, low pass filtering from the monitor and camera lens, demosaicing, built-in camera image compression, as well as image post-processing such as color balancing, gamma correction and edge enhancement. However, developing an understanding of image recapture is important not only in the context of the REWIND project from a purely theoretical perspective on the modelling of heterogeneous chains. Rather, image recapture is an effective technique in the malicious user's arsenal, as it can be applied for claiming ownership of copyrighted data, concealing prior tampering, or distributing low resolution material while passing it as high definition content. Image recapture detection is therefore an important counter- antiforensics technique to automatically detect non-original material. In [47] we proposed a technique for automatic image recapture that requires no prior training. By drawing parallels between camera projection and image resampling, we showed how image recapture can be seen as a series of local resampling operations. We then described a practical detector and classifier to extract local resampling footprints and distinguish between original, resampled and recaptured image data, identifying the resampling rate where appropriate. The method was tested on the UCID database, where an accuracy rate greater than 95% is achieved under realistic conditions. An example of the different outputs provided by the Image recapture and resampling detector on the three aforementioned classes of images is shown in Figure 11. FP7-ICT (REWIND) Page 30 of 73

31 (a) (b) (c) Figure 11: Output of the Image resampling and recapture detection on (a) source, (b) recaptured or (c) resampled content. In order to manipulate an image in a realistic scenario, a series of operations are involved jointly. For example, a JPEG image could be opened, processed and resaved in a JPEG format, forming a heterogeneous chain of processing. The tool for the detection of Double JPEG compression in presence of image contrast, proposed in [Ferrara2013], allows detecting and reverse engineering a chain composed by a double JPEG compression interleaved by a linear contrast enhancement. The approach is based on the well-known peakto-valley behavior of the histogram of double-quantized DCT coefficients, and provides a method to detect the presence of such heterogeneous chain, an estimation of the quality factor of the previous JPEG compression and the amount of linear contrast enhancement Related DC, UC and DS The tools evaluated in this group represent the following Detector Candidates (DC), described in D5.5: DC-06 - Image: Recapturing detector DC-27 - Image: Joint contrast enhancement and JPEG compression factor detection They address the following Use Cases (UC), described in D5.5: UC-03 - Image editing: Modifying personal content UC-04 - Image editing: Modifying and sharing 3 rd party content UC-06 - Image editing: Hiding tampering via recapturing UC-08 - Image reproduction: Face recognition fraud The following Datasets (DS) mentioned in D5.5 were used: DS-05 - Image: Recaptured set DS-30 - Image: Joint contrast enhancement and JPEG compression factor estimation Results The technique for Image recapture and resampling detection proposed in [47] is able to correctly classify original, recaptured and resampled images by analyzing their local resampling footprints, with an AUC of relative to the classification of recaptured and resampled images. The technique has been validated FP7-ICT (REWIND) Page 31 of 73

32 on the UCID dataset under realistic recapture conditions under 11 different resampling factors, for a total of 440 images. During validation, each tested image was assessed for accuracy in both classification and resampling factor estimation. The method has been compared with the performance of the detector in [49]. While the performance seems similar, it is important to highlight that the method in [49] is limited to the simpler case of resampling detection. Whenever the method in [49] is adopted for recapture detection, it breaks down and offers no classification power for this scenario (please see [47] for details). Hence, the proposed detector maintains the performance of the State-of-the-Art for resampling detection while at the same time extending it to the more complex case of compound resampling and recapture detection. The tool for the detection of Double JPEG compression in presence of image contrast [48] was tested on a dataset generated from 300 TIFF images, with different combinations of first (with quality factor QF1) and second (with quality factor QF2) compression and different amounts of linear contrast enhancement as DS- 26. A set of test were carried out to analyze the capability of detect the presence or absence of such heterogeneous chain. The results are reported in Table 8 by means of AUC values for different combinations of QF1 and QF2, mediating over all possible values of contrast enhancement QF1/QF Table 8: Detection performance: AUC values for a subset of pairs (QF1, QF2), with { }, by averaging over all possible values of contrast enhancement In order to evaluate the capability of jointly estimating the first compression and the amount of contrast enhancement, we report Figure 12 and Figure 13. The results are shown as Root Mean Square Error (RMSE) and Accuracy of the estimation of the contrast enhancement and QF1, respectively. FP7-ICT (REWIND) Page 32 of 73

33 Figure 12: Estimation of Contrast enhancement: RMSE for different combinations of QF1 and QF2 Figure 13: Accuracy of classification of QF1, for different QF2 and amounts of contrast enhancement. FP7-ICT (REWIND) Page 33 of 73

34 Conclusions The table below summarizes the methods presented in this document together with a single score value indicating their performance. Some tools have been evaluated by means of the accuracy; for other tools, the Area Under the Curve (AUD) of the ROC curve is provided. A target score value was also indicated: Whenever possible, this value was obtained directly from State-of-the-Art methods presented for the same problem. However, for methods breaking completely new ground for which prior State-of-the-Art is not available, results from the closest scenario considered in the literature were reported. Similarly, we have also reported results for methods tackling the same challenge but with significantly different evaluation settings. These two cases are indicated by an asterisk * next to the result. The three symbols below indicate whether a target can be considered achieved or not: - : Results exceed the SotA with thorough testing in realistic conditions. - : Results are on level with or better than the closest SotA, where available. However there might not be a close match in the literature for the proposed method. - Х: Results are significantly worse than the SotA or below an acceptability threshold. A fourth target completion class could be defined for methods ready to be marketed. However, this is beyond the scope of this FET project. Method Multi-Clue Image Forgery Localization Countering JPEG anti-forensics Image Recapture and Resampling Detector Double JPEG compression in presence of image contrast enhancement Method Target Target Reference Proposed AUC Target AUC Reference achieved? [ 29] N/A [42] * [50] [47] * [49] [48] N/A Reference Proposed Accuracy Target Accuracy Target Reference Target achieved? Combined Inverse Decoder [33] % 86.6 %* [35] Optimal Counter-forensics for Histogram-Based Forensics Tool Joint Detection of full-frame linear filtering and JPEG [37] % % [36] [45] % %* [46] FP7-ICT (REWIND) Page 34 of 73

35 compression in digital images Table 9: Results from the proposed methods and target scores. Whenever the target is taken from techniques that differ significantly in terms of scenario or evaluation conditions, this is indicated by an asterisk *. Concerning the Multi-clue image forgery localization task, based on experimental validation we can conclude that the target accuracy has been achieved through the use of the multi-clue analysis framework [29]. Comparison with State-of-the-Art was not possible because no comparable tools have been published by the end of the REWIND project. Regarding the success fooling the detector in [38], our Optimal Counter-forensics for histogram-based strategy presented in [37] largely improves the accuracy obtained by the attack proposed in [36]. Moreover, the PSNR obtained by averaging the MSE using [36] is more than 9 db worse than ours. As a conclusion, our strategy outperforms the targeted SoTA method. Nevertheless, given that the PSNR is known to be an inadequate distortion measure in perceptual terms, this topic cannot considered to be solved and in the future we will address the use of other measures as the SSIM. Although the SSIM is nonconvex, it would be worth deriving a convex proxy that could speed-up the optimization here proposed. As it concerns the Tool for countering JPEG compression anti-forensics [42], we evaluated the performance of detecting compressed and uncompressed images, comparing results with the method proposed in [50]. Notice, that the method in [50] does not take into account the use of anti-forensics tools. It is only able to detect whether an image has been compressed or not, if no other operations are applied afterwards. Thus, in order to be more fair, we tested [50] with content generated without using anti-forensics tools, and our method in the more challenging scenario of presence of anti-forensics operations. Despite the task for [42] being harder, the AUC that we obtain is greater than that in [50]. This means that the proposed method works very well in detecting compressed images, even though JPEG traces have been somehow hidden. As a limitation of this method, notice that the higher the used JPEG quality (e.g., > 95), the worse the performances. The proposed method in [45], performing the Joint detection of full-frame linear filtering and JPEG compression in digital images, cannot be compared to any similar method found in the literature. The value reported as target is taken from our previous work in [46], where only the detection of the undergone linear filter is taken into account. This framework may be regarded as a first approach to analyze and successfully classify JPEG images that have been further post-processed with a full-frame linear filtering. The problem of image recapture within heterogeneous chains is effectively dealt with by the detector in [47], i.e., the Image recapture and resampling detector, that is able to correctly differentiate between original, recaptured and resampled images. The performance is beyond satisfactory and the method has been validated on realistic data. Moreover, it offers the same accuracy as the method in [49] which is limited to the simpler case of resampling detection while at the same time extending it to the recaptured case. In terms of limitations, the method in [47] is currently unable to detect recaptured images if the overall equivalent resampling factor is less than 2, including all cases of downsampling. Alternative methods will have to be investigated to cover these cases. FP7-ICT (REWIND) Page 35 of 73

36 Concerning the Double JPEG compression in presence of image contrast enhancement task, based on experimental validation we can conclude that the target accuracy (i.e. AUC) has been achieved through the use of the framework proposed in [48] in term of detecting such heterogeneous chain. For the estimation of the parameters, in particular the amount of contrast enhancement, the target accuracy (RMSE) has not been achieved. Comparison with State-of-the-Art was not possible because no comparable tools have been published. The Combined inverse decoder approach applied to audio tampering detector, proposed in [33], proved to be applicable and effective. The algorithm has been extensively test with a realistic tampered audio dataset, which has been made public in [34]. Table 9 reports the accuracy averaged across the tampering detection test cases A, B and C, as presented in Section 3.1.3, for both the state of the art algorithm in [35] and for the algorithm proposed in [33]. The accuracy is computed by using only files having no codec traces or MP3 codec. The proposed algorithm shows an accuracy of %, which is higher than the 86.6 % achieved by the State of the Art. Moreover, the proposed algorithm can also handle content encoded with MP3PRO, AAC and HE-AAC, thus obtaining a much larger applicability than the one in [35]. Bibliography [26] T. Bianchi, A. De Rosa, and A. Piva, Improved DCT coefficient analysis for forgery localization in JPEG images," in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), [27] T. Bianchi and A. Piva, Analysis of non-aligned double JPEG artifacts for the localization of image forgeries," in IEEE International Workshop on Information Forensics and Security (WIFS), [28] P. Ferrara, T. Bianchi, A. De Rosa, and A. Piva, Image forgery localization via fine-grained analysis of CFA artifacts" in IEEE Transactions on Information Forensics and Security, [29] Fontani, M.; Argones-Rua, E.; Troncoso, C.; Barni, M., "The watchful forensic analyst: Multi-clue information fusion with background knowledge," Information Forensics and Security (WIFS), 2013 IEEE International Workshop on, vol., no., pp.120,125, Nov. 2013, doi: /WIFS [30] J. Herre and M. Schug, Analysis of Decompressed Audio The Inverse Decoder, in Proceedings of the 109th AES Convention, Los Angeles, 2000 [31] S. Moehrs, J. Herre, R. Geiger, Analysing decompressed audio with the Inverse Decoder towards an operative algorithm, in Proceedings of the 112th AES Convention, Munich, 2002 [32] P. Bießmann, D. Gärtner, C. Dittmar, P. Aichroth, M. Schnabel, G. Schuller, and R. Geiger, Estimating MP3PRO Encoder Parameters From Decoded Audio, in Proceedings of the 2. Workshop Audiosignalund Sprachverarbeitung, Koblenz, 2013 [33] D. Gärtner, C. Dittmar, P. Aichroth, L. Cuccovillo, S. Mann, and G. Schuller, Efficient Cross-Codec Framing Grid Analysis For Audio Tampering Detection, in Proceedings of the 136th AES Convention, Berlin, 2014 [34] D. Gärtner, L. Cuccovillo, S. Mann, and P. Aichroth, A multi-codec audio dataset for codec analysis and tampering detection, in Proceedings of the 54th AES Conference on Audio Forensics, London, 2014 [35] R. Yang, Z. Qu, and J. Huang, Detecting digital audio forgeries by checking frame offsets, in Proceedings of the 10th ACM Workshop on Multimedia and Security (MM&Sec) FP7-ICT (REWIND) Page 36 of 73

37 [36] Patchara Sutthiwan and Yun Q. Shi, Anti-forensics of double JPEG compression detection, in Lectures Notes in Computer Science. Proc. of the International Workshop on Digital Forensics and Watermarking, Atlantic City, NJ, October 2011, vol. 7128, pp , Springer. [37] P. Comesaña and F. Pérez-Gonz lez, Optimal counterforensics for histogram-based forensics, in IEEE International Conference on Audio, Speech and Signal Processing, Vancouver, Canada, 2013, pp [38] Tomas Pevny and Jessica Fridrich, Detection of double-compression in JPEG images for applications in steganography, in IEEE Transactions on Information Forensics and Security, vol. 3, no. 2, pp , June [39] Mauro Barni, Marco Fontani, and Benedetta Tondi, A universal technique to hide traces of histogrambased image manipulations, in Proc. of ACM Workshop on Multimedia and Security, Coventry, UK, September 2012, pp [40] Gerald Schaefer and Michal Stich, UCID - An Uncompressed Colour Image Database, in Proc. of SPIE. Storage and Retrieval Methods and Applications for Multimedia 2004, San Jose, CA, CA, vol. 5307, pp [41] M. C. Stamm and K. J. R. Liu, Anti-forensics of digital image compression, IEEE Transactions on Information Forensics and Security, vol. 6, no. 3-2, pp , [42] G. Valenzise, V. Nobile, M. Tagliasacchi, and S. Tubaro, Countering jpeg anti-forensics, in IEEE ICIP [43] G. Valenzise, M. Tagliasacchi, and S. Tubaro, Revealing the traces of jpeg compression anti- forensics, IEEE Transactions on Information Forensics and Security, vol. 8, no. 2, pp , [44] G. Schaefer and M. Stich, UCID: an uncompressed color image database, in Proc. SPIE: Storage and Retrieval Methods and Applications for Multimedia, 2004, vol. 5307, pp [45] V. Conotter, P. Comesaña, and F. Pérez-Gonz lez, Joint detection of full-frame linear filtering and JPEG compression in digital images, in IEEE International Workshop on Information Forensics and Security (WIFS), Guangzhou, China, [46] V. Conotter, P. Comesaña, and F. Pérez-Gonz lez, Forensic analysis of full-frame linearly filtered JPEG images, in IEEE International Conference on Image Processing (ICIP), Melbourne, Australia, 2013, pp [47] M. Visentini-Scarzanella, P. Bestagini, M. Tagliasacchi, S. Tubaro, P. L. Dragotti, Image Recapture Detection via Local Resampling Analysis, (Submitted to) 21st IEEE International Conference on Image Processing (ICIP 2014), Paris, France, October [48] P. Ferrara, T. Bianchi, A. De Rosa, A. Piva, Reverse engineering of double compressed images in the presence of contrast enhancement, Multimedia Signal Processing (MMSP), 2013 IEEE 15th International Workshop on, vol., no., pp.141,146, Sept Oct [49] M. Kirchner, T. Gloe On Resampling Detection in Recompressed Images, IEEE International Workshop on Information Forensics and Security (WIFS 2009). [50] Z. Fan and R. L. de Queiroz, Identification of bitmap compression history: JPEG detection and quantizer estimation, IEEE Trans. Image Process., vol. 12, no. 2, pp , February FP7-ICT (REWIND) Page 37 of 73

38 4 WP7. Technical achievements and Performance analysis 4.1 Overview As indicated in the DOW, the REWIND project aims at synergistically combining principles of signal processing, machine learning and information theory to answer relevant questions on the past history of multimedia objects, leveraging on the footprints left by each processing step applied to the considered content. To understand the past history of multimedia objects, we need to perform multimedia object linking and multimedia phylogeny analysis. That is exactly the main objectives of WP7. The work within WP7 is articulated based on five fronts: (i) the creation/collection over time of a representative corpus containing multimedia-related objects; (ii) providing approaches for dealing with object dependencies over time for large-scale scenarios; (iii) exact and heuristic approaches for reconstructing the past history relations of a set of multimedia objects; (iv) approaches for dealing with phylogeny forests and (v) approaches for dealing with multiple parenting relationships. During the time of activity of the REWIND Inco Extension, regarding WP7, we have achieved significant progress towards the development of algorithms for multimedia phylogeny. As a result of this working period within WP7, a range of algorithms is now available, and a large part of them are (or will be shortly) sufficiently mature to be part of the phylogeny toolbox that will be delivered in the final WP7 report Related PT, UC, RA and DS The tools evaluated in this group represent the following Phylogeny Tools (DC), described in D5.5: PT-01 - Image: Reconstructing Image Phylogeny Trees PT-02 - Image: Reconstructing Image Phylogeny Forests PT-03 - Image: Reconstructing Multiple Parenting Relations PT-05 - Video: Reconstructing parent sequence They address the following Use Cases (UC), described in D5.5: UC-26 - Image phylogeny: Modifying and sharing an image multiple times UC-29 - Image phylogeny: Modifying and sharing semantically similar images UC-30 - Image phylogeny: Combining and sharing multiple images several times UC-28 - Video editing: Splicing videos to generate compilation sequences The following Relationship Annotators (RA) mentioned in D5.5 were exploited: RA-01 - Image: Dissimilarity The following Datasets (DS) mentioned in D5.5 were used: DS-15 - Image: Image phylogeny tree set with complete scenarios DS-16 - Image: Realistic image phylogeny tree set DS-17 - Image: Multiple parents set DS-17 - Image: Multiple parents set DS-25 - Video: Near-duplicate video sequences DS-26 - Image: Image phylogeny tree set with missing links FP7-ICT (REWIND) Page 38 of 73

39 DS-27 - Image: Large scale image phylogeny tree set DS-28 - Image: Image phylogeny forest set (Dataset A) DS-29 - Image: Image phylogeny forest set (Dataset B) 4.2 Evaluation Methodology Metrics To evaluate the phylogeny algorithms we devised in this project, we need to have both controlled datasets for which we know the structure of modifications and evolution of the documents over time, and uncontrolled datasets in which we assess how the approaches perform in real world cases. In addition, for all datasets, we need specific metrics to evaluate how good the approaches are. Thus, in this section, we introduce and describe the quantitative metrics used to evaluate reconstructed trees and forests in both controlled and uncontrolled scenarios Metrics for Controlled Environments In the methodology introduced by Dias et. al. [51] for the validation of Image Phylogeny Trees (IPTs), we look at four different quantitative metrics (Roots, Edges, Leaves and Ancestry) to evaluate a reconstructed tree in scenarios where the ground- truth is available (controlled scenarios). ( ) { ( ) ( ) ( ) ( ) ( ) Where is the cardinality of a set, ( ) ( ) ( ), and denotes the total number of nodes of the known Image Phylogeny Tree. With the first metric, we analyse if an algorithm finds the correct root of a tree. With the Edges metric, we evaluate how many edges are in common between the resulting tree of an algorithm and the ground truth tree. The third metric, Leaves, evaluates how many common leaves there are in the tree found by the algorithm and the one in the ground truth, while the Ancestry metric measures how many correct ancestors in a tree for each possible node is in common with the ground truth. According to the above metrics, if the IPT s root is the root of one root in the reference tree, it is correct. If an edge connects two nodes in the IPT and the same is true in the reference tree, it is correct. The same holds for leaves and ancestry information. Figure 14 shows an example of an original image phylogeny tree as we as a reconstructed tree and the metrics use to evaluate the reconstruction process. Considering the Root Evaluation metric, for instance, the reconstructed tree has Root(IPT1, IPT2) = 1 given that the algorithm found one correct root from the reference tree. According to the Edges Evaluation metric, the reconstructed tree has E(IPT1, IPT2) = {(1,6); (6,4)} /5 = 2/5 = 40% correct edges. The same reasoning applies to the other metrics. FP7-ICT (REWIND) Page 39 of 73

40 Figure 14: On the left, an example of an original tree encompassing an original image and its transformations to create near-duplicates. On the right, the reconstructed phylogeny tree of transformations and near-duplications. On the bottom, a table with details about the data structures necessary to run and evaluate an IPT reconstruction algorithm. The vector representation [6,4,1,...] stands for the tree representation in which the index represents a node and the number in an index position represents the parent of such a node. For instance, the parent of node 1 is node 6, the parent of node 2 is node 4, and so on. To evaluate Image Phylogeny Forests (IPFs), a similar methodology is used, in which the metrics are adapted and calculated according to the following equation: ( ) where M is the evaluation metric of interest (root, edges, leaves or ancestry), IPF 1 is the original (groundtruth) forest, IPF 2 is the reconstructed forest, S 1 is the set of elements in the first forest corresponding to one of the metrics (e.g., set of roots of the first forest), and S 2 is the equivalent for the reconstructed forest. For instance, to obtain the metric Root, we calculate the intersection of roots found on the ground-truth forest with respect to the roots found in the forest by the forest reconstruction algorithm, and normalize it by the union of both sets. As an example, consider the algorithm finds three roots S 2 =(r 1,r 2,r 3 ) and two of FP7-ICT (REWIND) Page 40 of 73

41 them turn out to be correct with respect to the reference forest S 1 =(r 1,r 3 ). Therefore, the root metric here yields: ( ) Metrics for Uncontrolled Environments For uncontrolled environments, we do not have any definitive proof of a parent-child relationship between any two images. However, we are able to define some metrics to evaluate the robustness of an image phylogeny algorithm and its associated input dissimilarity matrix. For that, we can select one image I a in a set of near-duplicate images I and artificially generate a nearduplicate I b = T β (I a ) where T β denotes a family of pre-defined operations T and respective parameters β. We devise four error metrics ( { }) and one accuracy metric (P) that compare the tree reconstructed from the original set with the tree reconstructed from the augmented set with the new nearduplicate I b. Er 1 is one if the new node I b is not a child of its generating node Ia, and zero otherwise. Er 2 is one if the structure of the tree relating the nodes in the original set changes with the insertion of the new node I b in the set, and zero otherwise. Er 3 is one if the new node I b appears as a father of another node on the original tree, and zero otherwise. Er 4 is one if the root of the reconstructed tree of the original set is different from the root of the reconstructed tree of the set augmented with I b, and zero otherwise. P is one if the reconstructed tree is perfect compared to the original tree (Er 1 = Er 2 = 0), and zero otherwise. As an example, suppose we have a tree with two images (Nodes X and Y). We do not know the relationship between X and Y in order to evaluate a phylogeny algorithm as this is an uncontrolled scenario. However, suppose we select X to artificially generate an offspring (Node Z). After the insertion of Node Y to the original tree, we expect a phylogeny algorithm to preserve the relationship originally found between X and Y and we to put Z as the descendant of X. If this happens, we can say the algorithm is reliable. This is exactly what the four errors devised above seek to measure Metrics for Multiple parents evaluation Multiple parenting phylogeny studies the phylogeny of images created through the combination of content of two or more other images. In the most common scenario, we have three types of images: hosts, aliens and compositions, each one related to a near-duplicate set. The composition is the result of inserting a portion of an alien image into a host image. For evaluating multiple parenting relationships, in addition to the four metrics explained in Section 5.2.1, a new metric named subset was created, which measures if images with the same semantic content end up in the same trees in the reconstructed forest. In other words, it measures if the image phylogeny forest algorithm correctly separates the images in meaningful groups. First, we define the set: FP7-ICT (REWIND) Page 41 of 73

42 ( ) {( ) ( ) ( ) ( ) ( )} where F x, with x {R,GT}, is a reconstructed (R) or a ground truth(gt) forest,(i A,I B )is a generic couple of images and π(i,f) is a function returning the parent of an image I in the forest F. Finally, τ(i,f) returns the tree to which the image I belongs in the forest F, and ρ(f ) gives the roots of F. The subset metric is defined as: ( ) ( ) The subset metric is important because it gives information about the separation of the host, alien and composition subset. Additionally, to evaluate the results of our multiple parenting approach we introduce the metrics composition root (CR), host parent (HP) and alien parent (AP), which test if such nodes were correctly found in each test case. Additionally, we employ the metrics composition node (CN), host node (HN) and alien node (AN), used to check if the composition root and host and alien parents are, respectively, composition, host and alien images. This second set of metrics is used to evaluate the classification of the trees. 4.3 How far we can get? When evaluating a multimedia phylogeny algorithm it is important to have in mind what the results mean and how far we can get when developing new solutions. In this regard, below we present an evaluation system, which reflects our level of satisfaction with the research. For each algorithm we developed within the project regarding multimedia phylogeny, there is a target satisfaction level. Whenever possible, this value is obtained directly from State-of-the-Art methods presented for the same problem. However, for methods breaking completely new ground for which prior state-of-the-art is not available, results from the closest scenario considered in the literature are reported. The three symbols below indicate whether a target can be considered achieved or not: : Results exceed the SotA with thorough testing in realistic conditions. : Results are on level with or better than the closest SotA, where available. However, there might not be a close match in the literature for the proposed method. Х: Results are significantly worse than the SotA or below an acceptability threshold. A fourth target completion class could be defined for methods ready to be marketed. However, this is beyond the scope of this FET project. 4.4 Results In the following sections, we present the results we achieved and discuss our satisfaction level regarding each one Contribution #1 Multimedia Phylogeny investigates the history and evolutionary process of digital objects which includes finding the causal and ancestry document relationships, source of modifications and the order and transformations that originally created the set of near duplicates. Multimedia Phylogeny has direct applications in security, forensics, and information retrieval. In this contribution, we explore the phylogeny FP7-ICT (REWIND) Page 42 of 73

problem for near-duplicate images in large-scale scenarios, and present solutions that have straightforward extension to other media such as videos.

43 problem for near-duplicate images in large-scale scenarios, and present solutions that have straightforward extension to other media such as videos. Experiments with about two million test cases (with synthetic and real data) show that our methods automatically build image phylogeny trees from partial information about the near-duplicates, improving the efficiency and effectiveness of the whole process, and represent a step-forward in determining causal relationships of digital images over time. The most expensive task to reconstruct an image phylogeny tree is the dissimilarity matrix calculation. Often we need to deal with hundreds or thousands of near-duplicate images or videos each time and 11.5 computing days to calculate the dissimilarity matrix before using an IPT reconstruction algorithm for n = 1,000 images is prohibitive. In this contribution, also described in a previous REWIND report, we expand upon the best-known image phylogeny solution (SotA), Oriented Kruskal [51], for dealing with large-scale image phylogeny problems. We introduce three methods able to reconstruct an image phylogeny tree with a partially complete dissimilarity matrix. The results in this branch led us to our paper published in [52]. Figure 15: Results considering the (In)Direct Ancestry Correction heuristic compared to the state-of-the-art As we presented in the report, we design and implement three heuristics for the problem: Grandpa, Direct, and (In)-Direct. All of them represent major improvements over the State-of-the-Art. Figure 15shows how the best proposed heuristic, (In)Direct Ancestry, compares against Oriented Kruskal [51]. As we can see, for all metrics, the proposed solution outperforms the SotA. For instance, when we have access to only 20% of the entries in the dissimilarity matrix, the SotA solution correctly finds the root of the tree in about ~30% of the cases. On the other hand, our solution finds the correct root in about 100% of the cases. Therefore, we can tag this contribution as : Results exceed the SoA with thorough testing in realistic FP7-ICT (REWIND) Page 43 of 73

4.4.2 Contribution #2 With the second contribution, we aim at automatically identifying the structure of relationships underlying the images, correctly reconstruct their past history and ancestry

44 4.4.2 Contribution #2 With the second contribution, we aim at automatically identifying the structure of relationships underlying the images, correctly reconstruct their past history and ancestry information, and group them in distinct trees of processing history. We introduce a new algorithm that automatically handles sets of images comprising different related images, and outputs the phylogeny trees (also known as a forest) associated with them. For this problem, we present a statistical-based approach that can infer the number of trees in a set and then use a standard phylogeny algorithm (e.g., [51]) for finding the trees. Results from this approach have been published in [53]. Table 10 shows the results for the Oriented Kruskal [51] phylogeny algorithm considering different number of trees per forest. In this case, the algorithm requires the input from the user regarding the number of trees to reconstruct. As we deal with controlled experiments in this case, we feed the algorithm with the correct required parameter k. The algorithm is robust to scenarios with single and different cameras. For instance, with forests with five trees, the algorithm can successfully find the root of such trees in ~92% of the cases considering a scenario with near-duplicates from semantically similar images coming from multiple cameras. Table 10: Reconstructing a forest of size F {1,..., 5} trees using the Oriented Kruskal (OK) algorithm [51]. This algorithm requires the input from the user for the size of the forest to reconstruct. Results are relative to the ground truth. However, the algorithm clearly has a major drawback: it requires the number of trees to look for in the forest. Table 10 shows that if we use the state-of-the-art Oriented Kruskal [51] algorithm without knowing the number of trees to reconstruct, its performance decreases with the number of trees in the two most important metrics to consider: roots and ancestry. For edges and leaves, normally a tree reconstruction algorithm behaves similarly for trees and forests since, in the case of forest, there is a difference of only a few edges. Table 11 shows the same results but in a realistic situation in which we do not know the number of trees in the forest. Therefore, an automatic solution for finding the number of trees in the forest is paramount and this is where our contribution shines. Table 12 depicts the results for our contribution Automatic Oriented Kruskal (AOK) algorithm with respect to the baseline proposed in [51]. For instance, AOK is only 2% worse than the baseline when finding the FP7-ICT (REWIND) Page 44 of 73

roots of the trees in a forest with five trees. In addition, the algorithm correctly finds the ancestors of all images (parents, grand-parents, grand-grand-parents, etc.

This is a major result of the proposed algorithm since it statistically performs similar to the State-of-the-Art approach without requiring any input from the user with respect to the number of trees

45 roots of the trees in a forest with five trees. In addition, the algorithm correctly finds the ancestors of all images (parents, grand-parents, grand-grand-parents, etc.) in 77% of the cases, which represents only a 0.5% decrease when compared with the baseline in Table 10. This is a major result of the proposed algorithm since it statistically performs similar to the State-of-the-Art approach without requiring any input from the user with respect to the number of trees in the forest. Note also that the algorithm improves the results for finding the roots and ancestors of all trees in the forest without sacrificing the edges and leaves metrics. Table 11: Reconstructing a forest of size F {1,..., 5} trees using the Oriented Kruskal (OK) algorithm [51] with no information about the size of the forest to reconstruct. Results are relative to the baseline in Table 10. The redder the value, the worse the metric while the bluer the better. Our contribution in this work relies on improving upon such results Table 12: Reconstructing a forest of size F {1,..., 5} trees using the proposed Automatic Oriented Kruskal (AOK) algorithm. Results are relative to the baseline in Table 10. The bluer the value, the better. Therefore, we can tag this contribution as : Results exceed the SoA with thorough testing in realistic conditions. FP7-ICT (REWIND) Page 45 of 73

46 4.4.3 Contribution #3 In this research branch, we explore one optimum branching algorithm for reconstructing the evolution tree associated with a set of image documents. Results from this contribution have been published in [58]. In the literature, the problem called Optimal Branching (OB) deals with the construction of minimum spanning trees on directed graphs with known roots. Solutions to this problem were proposed independently by Chu and Liu [54], Edmonds [55], and Bock [56]. The version proposed by Edmonds [55], is recursive and receives as input a graph G with n nodes and weights associated with the edges as well as a especial vertex r (root of the branching). For the purpose of this work, we consider a black-box C++implementation of Chu-Liu, Bock and Edmonds optimum branching algorithm as described by Tarjan [57]. For validation of the algorithms in this paper, we follow the methodology introduced by Dias et al. [51], and we compare new algorithm developed to the Oriented Kruskal [51] on a controlled environment with full trees and with missing links. Figure 16 depicts the summary results for the algorithm we present in this research branch compared to Oriented Kruskal. In the chart, the x-axis denotes the number of nodes in the tested tree while the y-axis denotes the percentage of correct reconstruction (correct score) according to the four metrics used (Root, Edges, Leaves, and Ancestry). For instance, for trees with 10 nodes, the Oriented Kruskal (OK) correctly finds the root in 98.6 % of the cases. The Chu-Liu, Bock and Edmonds (CLBE) algorithm outperforms both approaches and finds the correct root in 99.3 % of the cases. As we consider 125,000 test cases in the validation, CLBE outperforms OK for finding the correct root in 875 cases. Oriented Kruskal 100% 97% 94% 91% Correct Score 88% 85% 82% 79% 76% 73% 70% Number of Nodes in the Tree Root Edges Leaves Ancestry FP7-ICT (REWIND) Page 46 of 73

47 100% 97% 94% 91% Chu-Liu, Bock and Edmonds Correct Score 88% 85% 82% 79% 76% 73% 70% Number of Nodes in the Tree Root Edges Leaves Ancestry Figure 16: Summary results for (a) Oriented Kruskal and (b) Chu-Liu, Bock and Edmonds algorithms for image phylogeny tree reconstruction considering the scenario with complete trees (no missing links). Therefore, we can tag this contribution as : Results exceed the SoA with thorough testing in realistic conditions Contribution #4 Without user intervention, an Image Phylogeny Forest (IPF) reconstruction algorithm relies on the choice of a good threshold point to correctly decide the number of trees in the forest in advance. In this third contribution, we proposed a new approach to automatically reconstruct image phylogeny forests using the Optimum Branching algorithm proposed in [54], introducing two new methods: the Automatic Optimum Branching algorithm (AOB), and the Extended Automatic Optimum Branching algorithm (E-AOB). Furthermore, we also propose a new fusion approach combining the results given by each of the methods developed so far for IPF reconstruction (AOK [53], AOB, and E-AOB), in such a way that errors introduced by one method can be fixed by other method(s). The contribution presented in this section is currently submitted and under review on the IEEE Transactions on Information Forensics and Security (TIFS) [59]. For evaluating the reconstructed IPFs, we consider the same quantitative metrics (roots, edges, leaves and ancestry) introduced by Dias et al. [51], considering scenarios where the ground truth is available. For our experiments in a controlled scenario, we consider images taken with a single camera (OC) and with multiple cameras (MC) having similar scene semantics (the main content of the image is the same, but with small variations in the camera parameters, such as viewpoint, zoom, etc.). In our experiments, we analysed the robustness of AOK, AOB, and E-AOB in two parts: (a) using each method separately and (b) their possible combinations C = {AOK x AOB, AOK x E-AOB, AOB x E-AOB, AOK x AOB x E-AOK}. FP7-ICT (REWIND) Page 47 of 73

(MC) Table 13: Comparison among AOK [53] and the variations of AOB

48 (a) Semantically similar images from the scenario using a single camera (OC) (b) Semantically similar images using multiple cameras (MC) Table 13: Comparison among AOK [53] and the variations of AOB algorithm for single and multiple cameras. FP7-ICT (REWIND) Page 48 of 73

A Wilcoxon signed-rank test was performed for all metrics in Dataset B, comparing (i) AOK with AOB, and (ii) AOK with E-AOB.

49 (a) Semantically similar images from OC (b) Semantically similar images from MC Table 14: Comparison among AOK [53] and the variations of AOB algorithm for single and multiple cameras. A Wilcoxon signed-rank test was performed for all metrics in Dataset B, comparing (i) AOK with AOB, and (ii) AOK with E-AOB. In Table 4, results for this test are described in its last row, with the blue dots indicating that difference among the results are statistically significant, at 95% confidence level, in favour of FP7-ICT (REWIND) Page 49 of 73

50 AOB or E-AOB, while the red crosses represent statistical difference in favour of the baseline AOK. In case (i), statistical difference in favour of AOK was found for metric roots in the OC scenario, and for metrics roots and ancestry in the MC scenario. These results show that AOB is only able to improve the results of AOK regarding the metrics edges and leaves. On the other hand, when comparing AOK and E-AOB (Scenario ii), all differences are statistically significant in favour of E-AOB, confirming this method has better performance than the state-of-the-art method presented in the literature [53]. Using as baseline the results presented in Table 13, better results for the fusion approach were found for the combination AOK x AOB x E-AOB. Table 14 shows the results for this fusion and the error variation Δ error in comparison to AOK and to the current best performing algorithm, E-AOB. The error variation was calculated with respect to each metric (roots, edges, leaves, ancestry), using the same equation introduced in [58]: ( ) ( ) where M1 represents the method being evaluated in comparison to method M2. In the fusion (AOK x AOB x E-AOB), OC scenario had lower performance only in dataset B, introducing more error for metrics roots and ancestry when the forest has more than eight trees. However, after running a Wilcoxon signed-rank test for these results, no statistical difference was found among them (represented by the green dashes in the last row of the table). For the other metrics in OC, and all metrics in MC scenario, differences are statistically significant at 95\% confidence level in favour of the fusion approach. Therefore, we can tag this contribution as : Results exceed the SoA with thorough testing in realistic conditions Contribution #5 In the Image Phylogeny problem, it was assumed that each image may inherit content from at most one parent image, hence the resulting graph structure is a tree. In this research branch, we extend upon the original formulation of the Image Phylogeny problem to deal with situations where an image may inherit content from multiple different parents. This scenario arises when an image is a composition created through the combination of content from different source images. This combination can be done by, for example, removing some content of one image and pasting it into another, or arranging some images under a new frame. With Multiple Parenting Phylogeny we aim to identify the phylogeny as well as the relationships existing in a set of images. Our approach proposed to tackle the multiple parenting phylogeny problem in a set of images, is based on first identifying the groups of near duplicates existent in the set, reconstructing their individual phylogenies. After separating the groups, we find the ones that represent compositions, as well as the ones representing their sources, finally looking into the source groups for the exact images used to create each of the compositions. Results from this approach have been submitted to the IEEE International Conference on Image Processing (ICIP) [60] and are currently under review. FP7-ICT (REWIND) Page 50 of 73

51 For evaluation, 600 hundred test cases were used, all extracted from the dataset DS-17. The cases were equally divided in Direct Pasting and Poisson Blending type compositions. A smaller disjoint set of 100 test cases of each type of composition was used to discover parameters. Table 15 shows the results for the forest algorithm. Two variations of the algorithm were used: a modification of the Oriented Kruskal [23 mar] to identify exactly three trees (K3T), and the Automatic Oriented Kruskal [53] (AOK). K3T is used as our upper bound, as it works under the assumption that we know the exact number of trees in the forest. To evaluate the results, we applied the metrics defined by Dias et al. [23 mar] (roots, edges, leaves and ancestry), plus a new one named subset. It verifies if nodes that are in the same tree in the ground truth end up in the same tree in the result and is used to measure the separation of the trees. Most of the results were comparable for both algorithms, with K3T results slightly above, especially for the metric roots. Considering we do not know the exact number of trees in a real scenario, the results show that AOK still obtain good accuracy even when this information is absent. We observed that in about 30% of the test cases, AOK does not find three trees. The high results for the subset metric are of special importance, as they show the alien, host and composition nodes are correctly grouped together in the reconstructed forest. A bad grouping could lead to wrong results in the subsequent steps of the method. Type Direct Poisson Metrics (%) Algorithm root edges leaves ancestry subset AOK K3T AOK K3T Table 15: Forest algorithm results for finding near-duplicate groups. In Table 16, we have the results for the classification of the trees and the identification of the host and alien parents. The composition root (CR), host parent (HP) and alien node (AN) metrics show how accurate the method is to identify those nodes. We also employ the composition node (CN), host node (HN) and alien node (AN) metrics to check if the composition root, host parent and alien parent are composition, host and alien images, respectively. Those metrics measure the classification of the trees. The local dissimilarity method used for identifying the alien parent was defined as the mean of the distance between the SIFT descriptors in the shared content region between composition root and alien images. From the results, we observe that AOK and K3T obtained high and very close results for the CN, HN and AN metrics, showing that both algorithms achieved good trees classification. This is of special importance, as it shows that even though AOK does not always find the correct number of trees, its separation of groups is still accurate enough for the trees classification. The CR and HP results show the method has good accuracy in identifying the original composition and its host parent. After classifying a tree as a composition, we chose its root as CR, meaning that this value is not FP7-ICT (REWIND) Page 51 of 73

52 only dependent on the tree classification step, but also on the algorithm to reconstruct the forest. Finally, even though the proposed method is good at identifying the alien tree, as shown by the close to 99% accuracy of the AN metric, it still is not good at finding the alien node used to create the composition. Type Direct Poisson Metrics (%) Algorithm CR CN HP HN AP AN AOK K3T , AOK K3T Table 16: Multiple parenting results From these results, we identified two issues that directly affect the outcome of the alien parent identification: The standardization of the shared content region used to compare the aliens with the composition root, and the comparison measure used. The descriptor distance and the sum of squared error were tried as dissimilarity measure. The descriptor distance, although not dependent on a comparison mask, showed it is not a good measure, due to the low results. The sum of squared error, on the other hand, relies too much on the compared region, so using individual masks is not suitable for this kind of comparison. With both problems in mind, we are currently working on two improvements to the alien parent identification: the computation of a single shared-region mask, using each individual mask found when comparing the alien nodes to the composition root, and the development of stronger comparison methods. As part of our ongoing work, Table 17 shows the results for using the gradient dissimilarity and the mutual information as new dissimilarity measures, in comparison with our previous better results. Those results were all using AOK to separate the trees. Both measures presented great accuracy improvement, for direct pasting and Poisson blending, in comparison with our previous best results. We almost doubled the accuracy for Poisson blending and obtained 21% gain with mutual information on direct pasting. This second result is of special importance as it is 5% off the 73% rate with which we correctly classify the composition root. Yet, Poisson blending is still far from the expected 66% rate of identifying the composition root. Type Algorithm AP (%) SSE with combined mask 47.3 Direct Poisson Gradient dissimilarity on 3x3 region, X and Y directions 58.3 Mutual information of each colour channels 68.3 SSE with combined mask 21.7 Gradient dissimilarity on 3x3 region, X and Y directions 40.3 Mutual information of each colour channels 40.3 Table 17: Alien parent identification FP7-ICT (REWIND) Page 52 of 73

53 The results show that there is still a large room for improvement, specially the identification of the alien parent for Poisson blending type of compositions. As the errors tend to propagate through the pipeline of our proposed method, it is important to have a strong algorithm to correctly separate the trees in the first step. It is also important to improve the second step, especially in its adaptation to a more general scenario. Finally, we expect to further investigate other forms of improving our host and alien parent identification, based on the last proposed improvements and also other ideas, such as better registration and colour estimation. Since, as far as we know, we are the first ones to tackle this challenging problem, we can tag this contribution as : Results are on level with or better than the closest SotA, where available. However, there might not be a close match in the literature for the proposed method Contribution #6 Another contribution refers to the reconstruction of video sequences from partially overlapping matching shots. A significant fraction of the available video content is created by reusing already existing online videos (e.g., videos from YouTube). With the approach proposed in this work, it is possible to reveal how the original content was edited (e.g., by deleting some of the video frames) before being inserted in a new content, thus explaining the intent of the content creator. This is helpful to understand the semantics behind content processing, which would be otherwise impossible if the video sequence is analysed alone. Moreover, it is possible to reconstruct a parent sequence even when this sequence is no longer available online (it has been deleted by the original content owner or it has never been publicly released in its totality, for instance). This work has been developed with partners from Politecnico di Milano and results from this contribution have been submitted to the IEEE International Conference on Image Processing (ICIP) [61], currently under review. Other attempt regarding video phylogeny extends the work proposed by Dias et al. [62]. We have been working on an initial approach for video phylogeny reconstruction considering videos that are not temporally coherent. Our initial experiments using dataset DS-25 showed lower performance when using the same dissimilarity function developed by Dias et al. [62]. Therefore, we are currently working on other methods for calculating the dissimilarity between videos that are not temporally coherent. Conclusions In this report, we presented the results regarding WP7 and our satisfaction levels regarding them. First and foremost, as this is cutting-edge research, there is no significant literature about the problem yet. Therefore our contributions go also in the direction of setting new grounds regarding experimental protocols, benchmarking and target objectives. Second, regarding the presented evaluation measures, although 100% accuracy is not possible, it is important to keep in mind what are their impacts in a real investigation. For instance, if we need to find the root of the tree for a pornographic case, even though we cannot achieve 100% and correctly finding such a root, if we are able to reduce the search scope to at least 10% of the original space, it is an important breakthrough. Therefore, when looking at the binary variable measure Root it is not only that value that is FP7-ICT (REWIND) Page 53 of 73

54 important. For instance, if we have trees with 1,000 nodes and the average root height is 1.1, it means that even if we did not find the correct root of the tree we are sure it is within the top 10% closest nodes to the found root. Therefore, we were able to reduce the search space in 90% and might focus our forensic analysis in the top 10% of the nodes. Furthermore, when developing such multimedia phylogeny algorithms, we are also interested in evaluating robustness in real scenarios. That is why we proposed some additional measures for such cases. However, as no research has been done in this problem before, such measures might not be the best ones and we are currently studying new possibilities and evaluation protocols. With WP7, it was possible to develop contributions in different aspects of multimedia phylogeny, from the single phylogeny trees to phylogeny forests, as well as the first steps to the resolution of the multiple parenting phylogeny problem. In addition, a range of algorithms is now available, as well as a large corpus with specific evaluation metrics for each approach, considering controlled and uncontrolled scenarios. Bibliography [51] Z. Dias, A. Rocha, and S. Goldenstein, Image phylogeny by minimal spanning trees, IEEE Transactions on Information Forensics and Security (TIFS), vol. 7, no. 2, pp , [52] Z. Dias, S. Goldenstein, and A. Rocha, Large-scale image phylogeny: Tracing back image ancestry relationships, In IEEE Multimedia. v. 20, p , [53] Z. Dias, S. Goldenstein, and A. Rocha, "Toward Image Phylogeny Forests: Automatically Recovering Semantically-Similar Image Relationships". In Elsevier Forensic Science International (FSI), v. 231, p , [54] Y. J. Chu and T. H. Liu, On the Shortest Arborescence of a Directed Graph. Science Sinica, vol. 14, pp , [55] J. Edmonds, Optimum Branchings. J. Research of the National Bureau of Standards, vol. 71B, pp , [56] F. Bock, An algorithm to construct a minimum directed spanning tree in a directed network, Developments in Operations Research, pp , [57] R. E. Tarjan, Finding optimum branchings, Networks, vol. 7, no. 1, pp , [58] Z. Dias, S. Goldenstein, and A. Rocha. Exploring Heuristic and Optimum Branching Algorithms for Image Phylogeny. Journal of Visual Comm. and Image Representation, v. 24, p , [59] F. de O. Costa, M. A. Oikawa, Z. Dias, S. Goldenstein, and A. Rocha. Image Phylogeny Forests Reconstruction. Submitted to IEEE Transactions on Information Forensics and Security, [60] A. Oliveira, P. Ferrara, A. De Rosa, A. Piva, M. Barni, S. Goldenstein, Z. Dias, A. Rocha. Multiple Parenting Identification in Image Phylogeny. Submitted to IEEE International Conference on Image Processing (ICIP), [61] S. Lameri, P. Bestagini, A. Melloni, S. Milani, A. Rocha, M. Tagliasacchi, S. Tubaro, Who is my parent? Reconstructing video sequences from partially overlapping matching shots, IEEE Intl. Conference on Image Processing (ICIP). Submitted, [62] Z. Dias, A. Rocha, and S. Goldenstein, Video phylogeny: Recovering near-duplicate video relationships, in IEEE Intl. Workshop on Information Forensics and Security (WIFS), 2011, pp FP7-ICT (REWIND) Page 54 of 73

5 Automatic Evaluation Approach As described in previous deliverables, REWIND included the development of an automatic evaluation framework using a dedicated storage and testing service, based on

55 5 Automatic Evaluation Approach As described in previous deliverables, REWIND included the development of an automatic evaluation framework using a dedicated storage and testing service, based on fine-grain XML annotations to content, and schemas for detector interfaces and test case definitions. The respective motivation, lessons learned and service description were reported in D5.5, the related XML schemas in D5.4. The previous report on evaluation D5.4 also included two detector evaluation examples that were used to validate the PoC implementation. They were extended by three more detectors, totaling in five examples for validation. The examples covered various aspects and levels of testing with respect to test set size and selection, and evaluation types, including binary and multiclass classification, and deviation of estimated outputs. They will be briefly described in the following. 5.1 DC-09 Video Codec Detection DC-09 Video Codec Detection takes a video sequence, last codec, and last GOP used as input, and provides the first codec used as output (multiclass). The detector component interface definition XML can be found in 6.3. The XML definition of the test case can be found in 6.10, a screenshot of the summary page in Figure 17. As the case represented the first PoC test for the automatic approach, only a small test set of 5 items from the DS-06 dataset was chosen. Figure 17: Screenshot of the summary page for the DC-09 / UC-10 test case FP7-ICT (REWIND) Page 55 of 73

56 The results for the first test case were as follows 1 : Figure 18: Screenshot of the results page for the DC-09 / UC-10 test case 5.2 DC-06 Image Recapture Detection DC-06 "Image Recapture Detection takes an image as input, and provides a binary classification for recapture detection as output. The detector component interface definition XML can be found in 6.2. The XML definition of the test case can be found in 6.9, and a screenshot of the summary page in Figure 19. A set of 180 items from the DS-05 dataset was used (no FP/TN testing). Figure 19: Screenshot of the summary page for the DC-06 / UC-06 test case 1 See FP7-ICT (REWIND) Page 56 of 73

57 The results for the first test case were as follows 2 : Figure 20: Screenshot of the results page for the DC-06 / UC-06 test case 5.3 DC-19 Resampling Footprint Detection DC-19 Resampling Footprint Detection takes an image and a specific resampling factor to analyze (opt.) as input, and provides a binary classification for resampling detection and an estimated resampling factor as output. The detector component interface definition XML can be found in 6.5. The XML definition of the test case can be found in 6.8, and a screenshot of the summary page in Figure 21. A set of about 800 items from dataset DS-10 was used. Figure 21: Screenshot of the summary page for the DC-19 / UC-03 test case 2 See FP7-ICT (REWIND) Page 57 of 73

58 The results for the first test case were as follows 3 : Figure 22: Screenshot of the results page for the DC-19 / UC-03 test case 5.4 DC-02 Image Splicing Detection DC-02 "Image Splicing Detection - Not Aligned JPEG Compression detector" takes an image and suspect region as input, providing a binary classification for recapture detection and confidence as output. The respective detector component interface definition XML can be found in 6.1. DC-02 has been tested using two different test cases, both including image splicings created by pasting part of a JPEG image into an uncompressed image, then performing a final JPEG compression. The first test case includes a group of 500 spliced images and 500 original images distributed among two pools for positive/negative testing. The XML definition of this first case can be found in 6.6, and a screenshot of the summary page in Figure See FP7-ICT (REWIND) Page 58 of 73

Figure 23: Screenshot of the summary page for the DC-02 / UC-01a test case The results for the first test case were as follows 4 : Figure 24: Screenshot of the results page for

using two pools with 100 spliced and 100 original images each, but only considering images where the last JPEG compression had a quality >85/100.

59 Figure 23: Screenshot of the summary page for the DC-02 / UC-01a test case The results for the first test case were as follows 4 : Figure 24: Screenshot of the results page for the DC-02 / UC-01a test case As it is known that a higher quality setting for the final compression step simplifies the detection task, a second test case has been designed, using two pools with 100 spliced and 100 original images each, but only considering images where the last JPEG compression had a quality >85/100. The XML definition of this case can be found in 6.7, and a screenshot of the summary page in Figure See FP7-ICT (REWIND) Page 59 of 73

Figure 25: Screenshot of the summary page for the DC-02 / UC-01b test case Not surprisingly, the results for the second test case are better than for the first 5 : Figure 26: Screenshot of the

5 DC-16 MP3 Bitrate Estimation and Classification DC-16 MP3 Bitrate Estimation and Classification takes a decompressed audio file as input, and provides an estimated standard bitrate (multiclass) of

60 Figure 25: Screenshot of the summary page for the DC-02 / UC-01b test case Not surprisingly, the results for the second test case are better than for the first 5 : Figure 26: Screenshot of the results page for the DC-02 / UC-01b test case 5.5 DC-16 MP3 Bitrate Estimation and Classification DC-16 MP3 Bitrate Estimation and Classification takes a decompressed audio file as input, and provides an estimated standard bitrate (multiclass) of the preceding mp3 encoding step as output. The detector component interface definition XML can be found in 6.4. The XML definition of the test case can be found in 6.11, and a screenshot of the summary page in Figure 27. A set of 400 items from the DS-08 dataset was used, which were previously encoded with 32, 64, 96 and 128 kbps MP3. 5 See FP7-ICT (REWIND) Page 60 of 73

visualization for multiclass evaluation results which was included with the

61 Figure 27: Screenshot of the summary page for the DC-16 / UC-20 test case The results for bitrate classification were as follows, using a slightly improved visualization for multiclass evaluation results which was included with the second version of the testing service: FP7-ICT (REWIND) Page 61 of 73

Deliverable D6.3 Release of publicly available datasets and software tools

Grant Agreement No. 268478 Deliverable D6.3 Release of publicly available datasets and software tools Lead partner for this deliverable: PoliMI Version: 1.0 Dissemination level: Public April 29, 2013 Contents