CS89/189 Project Milestone: Differentiating CGI from Photographic Images

CS89/189 Project Milestone: Differentiating CGI from Photographic Images Shruti Agarwal and Liane Makatura February 18, 2016 1 Overview In response to the continuing improvements being made in the realm of modeling, rendering, and image manipulation, this project seeks to utilize deep convolutional neural networks (CNN) as a tool to perform, and hopefully improve upon, this task of differentiating between photographs and computer generated images (CGI). Due to the lack of a suitably large dataset (which limited our ability to directly train a neural net), we set out to test our hypothesis in the following 3 ways: 1. Extract the visual representations from the penultimate layer of a CNN (AlexNet) that has been trained on an unrelated, natural image dataset, such as ImageNet, then use these as features in an SVM to see if it can effectively classify CGI and photographic images. 2. Use a relevant dataset (composed of comparable CGI and real image patches) to fine-tune the penultimate layer of the above mentioned CNN; then, repeat the process described in (1) to see if our results improve. Figure 1: Left shows artist Max Edwin Wahyudi s CGI rendering of Korean actress Song Hae Kyo. Right is a photograph of the same. 3. Test the performance of our fine-tuned CNN by directly feeding novel CGI and photographic input, to see if it will be able to generalize effectively at inference time. Note that we have switched to using AlexNet instead of VGG because it will be faster to finetune, which works well in our constrained time frame. While VGG might produce better results, and thus would be an interesting future extension, we believe that AlexNet will be sufficient for this project. By this milestone, we anticipated having a complete dataset, along with preliminary results from our SVM on general AlexNet features (without finetuning). 2 Dataset One of our biggest hurdles in this project is the lack of a pre-existing dataset that is both relevant and sufficiently large. Ultimately, our goal was to amass approximately 100,000 image patches for each class (CGI and photograph). This creates a final dataset consisting of 200,000 patches. This section outlines our acquisition and processing pipeline. 1

2.1 Image Collection To create our binary classifier, we needed to collect a 2-part dataset: Photographs We were supplied with a dataset of approximately 96 million photographic images (courtesy of Professor Hany Farid). These images spanned a wide range of semantic content, and we aimed to match this diversity in the CGI dataset as well. CGI No established CGI dataset currently exists, so we created our own. We initially intended to render our own images using an Autodesk Maya cityscape model that was made available to us (courtesy of Prof. Hany Farid). However, we had two main concerns about the dataset we would amass through such an approach: Believability The level of photorealism that we could obtain with the provided architecture, texture maps, and lighting models did not appear convincing enough to generate challenging cases for our network. Ideally, we wanted to train our network on images that were challenging even for a human to categorize. Homogeneity The model had a very distinct style and relatively limited diversity of architecture, scenes/objects, textures, lighting conditions, etc. This homogeneity may tempt the network to simply learn these content-specific identifiers, rather than general features specific to the image type (photo vs. CGI) To avoid these issues, we chose to collect our images from individual artists (namely those in the Dartmouth Digital Arts program), along with various sources around the web. In particular, we used pictures generated by professional rendering companies, such as Render Atelier or Maxwell Render, along with many personal portfolios, university competition results, and creative content-sharing sites such as cgtrader, Adobe Behance, and VRayWorld. Due to the nature of these websites, the process-related content of the posts, and the structure of the sites content tagging, we are reasonably confident that all collected images are in fact examples of CGI. This process also gave us the benefit of collecting samples produced with varying software (Autodesk Maya, Blender, ZBrush, 3DSMax) and renderers (MentalRay, VRay, Corona, RenderMan). It is worth noting that our process ended up consuming much more time than expected, because it devolved into a manual crawling effort. Small scripts often ended up collecting many undesirable images, ranging from advertisements or banners to actual photographs. Due to the quality of the renders we were gathering, it was often very difficult and time-consuming to verify the data and pick out any photographs that slipped into the CGI bin hence, we opted to shift the time-intensive part of the process to the direct gathering. This way, we always saw the images in their respective contexts, which allowed us to be more confident in their classification. 2.2 Image Processing After collecting sufficiently many CG images, we went through the following steps to finalize our dataset: categorizing full images, extracting patches, and extracting features. These steps were conducted for both CG and photographic images. Then, each CGI patch was matched with its nearest neighbor in the photo set, and both patches were moved to our final dataset. Details for each step are outlined below: Preliminary Image Categorization We fed each full image through AlexNet (the CNN we intend to use) and allowed them to be classified into one of the 1000 categories that AlexNet was trained to recognize. We were not concerned with whether or not the classification was correct this step was simply meant to split the CGI/photo dataset into semantically similar subsets. It was conducted to reduce the time complexity of the last step (nearest neighbor search), so that we didn t have to search through nearly a billion unrelated photo patches in order to find the best possible match for a particular CGI patch. We intuited that patches from semantically similar images would be more likely to provide good matches, so to reduce time complexity we partitioned our image sets into bins numbered 1...1000 based on AlexNet s classification. Thus, our file structure had folders CGI 1..1000 and Photo 1..1000. For simplicity, we will consider one pair of corresponding class bins, CGI i and Photo i. 2

Patch Extraction From each image in CGI i and Photo i, we randomly extracted 20 patches of size 227x227 to use in our final dataset. We chose to use patches instead of full images to increase the size of our dataset while also reducing the amount of spatial and contextual information that can be exploited by the network, and increasing the probability that we would be able to find a reasonably well-matching patch in the complementary image set. The patch size (227x227) was chosen to accommodate AlexNet s architecture, whose fully connected layers expect input of this size. Feature Extraction We fed each individual patch through AlexNet, and extracted the corresponding features from the first fully connected layer (FC6) of the network. These features will be used by our nearest neighbor pairing search (below) and the SVM. Nearest-Match Pairing For each patch in CGI i, we did a nearest neighbor search through all the candidate matches in Photo i using the representative feature vectors extracted from AlexNet in the previous step. We mapped in this order (CGI to photo) because we had a far larger photo dataset available to us, and we wanted to ensure that every instance of CGI was matched and included in the final dataset. The idea behind this pairing was to ensure that our dataset had challenging examples where content was very similarly distributed between the two classes thus, the network would be forced to learn something specific about CGI vs. photo. A snapshot of our paired datasets can be found below: 3

Figure 2: CGI patches (top) and their nearest neighbor photo match (bottom - in corresponding order). 2.3 Current Dataset At the time of SVM training, we had approximately 2,800 CG images contributing to our fully processed database, where each image was the highest available resolution. We also classified and sampled approximately 100,000 random photographs, to ensure that there were enough candidate matches for each CG 4

image while still maintaining reasonable time complexity for things like nearest neighbor search. Extracting 20 patches from each image, this gave us roughly 56,000 patches each for CGI and photographs. Our data also includes roughly 800 additional CGI that have not yet been processed (and thus were not included in our initial SVM). Our recent discovery of Behance and VrayWorld, where rendered images are well tagged, will also enable us to collect the remaining data for our target set much more quickly (now that we do not have to manually crawl the web from link to link). Additionally, we found that the images we were able to collect were of very high resolution (usually 1-2k, but some even approaching 4k), so we intend to leverage more patches from each photo to continue enlarging our dataset. We have not yet included these additional patches because we have not finished the processing, and we did not want to over-represent some images in comparison to the others. This process will easily allow us to surpass our original dataset goals, which listed 100,000 patches of size 64x64 for each category. 2.4 Known Limitations & Proposed Solutions In collecting our dataset, we encountered the following limitations: Presence of watermarks or signatures in CGI (typically in the lower left hand corner) We brainstormed ways to discard the affected region of the image (by manual editing, cropping off the bottom/side, or discouraging the selection of the lower left hand region). Ultimately, we opted not to address this issue explicitly, because 1) manual editing would increase time and destroy the integrity of the render, and 2) we did not want to sacrifice valuable, usable information that also lived in these affected regions. In practice, we expect the random patch selection to prevent this from becoming too much of an issue. Available photorealistic CGI content is heavily skewed toward architecture, modern interior design We intend to 1) target more organic CG images (image searches, frames of natural VR/video game reels), and 2) mimic this bias in our photographic database by adding a number of images. As long as the content is equally skewed on both halves of the dataset, this should prevent our network from using particular content to determine CGI or photo classification, which is the only real concern. Preliminary classification of the images could result in suboptimal patch matches If time permits, we may attempt to rerun nearest neighbor search without this preliminary filtering to see if our results improve. However, the purpose of this matching is just to ensure that our network has challenging examples to learn on, and our visualizations seem to corroborate that our matches are satisfactory for this purpose. As such, we are not certain that it would be worth the additional time complexity; however, it may be worth exploring if time permits. Since we had already structured our dataset with this additional filtering, we have not yet run any experiments without it. 3 Preliminary Results After generating our initial dataset, we were able to run a few preliminary tests, which are detailed below. 3.1 t-sne Visualization As we saw in [2], deep features have the ability to cluster not only similar looking images, but also images from the same domain. We wanted to see if the features we extracted from AlexNet (without finetuning) had any inherent separation between the two classes, so we also decided to create a t-sne visualization. The results of our visualization can be seen below: 5

Figure 3: This image shows a 2-dimensional visualization of our 4096 dimensional feature space; three separate classes (spanning 4 semantic content categories detailed in Figure 3) are represented here, as denoted in the legend. Clearly, the images in our two sets are comparable to one another, as images within distinct ImageNet class categories (including both CGI and photos) cluster very distinctly according to the AlexNet representation. We also see some separation between CGI and photos in the class clusters, given further validation to the idea that our network will be able to differentiate between these image types. 6

Figure 4: Same visualization as above, but with representative thumbnails flanking each class cluster. Note that class 979 inherited both water and sky images, but they cluster very distinctly in the representation. The thumbnails in this image were placed here as an interpretive tool; while they are drawn from the represented data, their particular location on the image does not necessarily correspond with the location of that thumbnail s representative dot in the visualization. An effort has been made to ensure that the images are not covering any data points in the visualization. All images are framed in the appropriate color and labeled for clarity. As we can see in fig.4&5 the CGI and photos features have some clustering within each class itself. It is possible that the Alexnet is capturing very minute appearance difference between real and CGI images. It is also a possibility that there the Alexnet features are also capturing some differences between CGI and real photos which can be generalized to all the classes. To analyse this hypothesis, we used these features to train a linear SVM. The results are discussed in the following section. 3.2 SVM Classification via LIBLINEAR We used the FC-6 features obtained from a non-finetuned Alexnet for our cgi & real dataset and trained a linear SVM for the task of binary classification. We used LIBLINEAR library available at https://www.csie.ntu.edu.tw/ cjlin/liblinear/ We also tried to train a linear SVM from LIBSVM package, but it was taking quite long to train on our dataset. We used features from 104446 patches having equal number of CGI and real image patches, ie 52223 patches from each category. Each feature is of 4096 dimension. Therefore, the size of our dataset is 104446 X 4096. We then scale all the dataset so that each feature dimension is [0,1] range. This is done to ensure that the training doesn t get biased towards the dimension with large values. For training data we select random 84000 instances from our dataset and the remaining instances we keep for testing. We train the linear SVM classifier using L2 regularization and L2 loss option. We then compute classification accuracy on both train and test data. We also analyse the accuracy corresponding to individual classes, for both testing and training data. We repeat the training and testing procedure 10 times-due to time constraint-each time selecting the training and testing data randomly from the full dataset. The training and testing accuracy results averaged over 10 experiments are given in Fig.4. We are achieving 77.7% test accuracy without fine-tuning of CNN features for our task. We can also see that the percentage of error in classifying CGI images is higher than 7

that of real images. We assume that is because the Alexnet is trained on real images only and fc-6 features couldn t capture the fine details of a CGI image which are required for differentiating it from the real image. We hope fine-tuning our network for our task should improve the accuracy. Figure 5: % Accuracy with linear SVM. 4 Future Work As our preliminary results suggest, the deep features from Alexnet are able to differentiate between CGI and real images. We expect that the accuracy of our results will be improved as we get bigger training dataset and we also with the fine-tune of Alexnet for our task. We plan to finish acquiring dataset and start with fine-tuning of Alexnet. For fine-tuning we will replace the last 1000 way classification layer with a 2-way classification layer. Currently in our dataset we chose only those real image patches which are considered similar to CGI images by Alexnet. By doing this we hypothesis that we are selecting really difficult examples for Alexnet to classify separately as CGI and real images. We will validate that our method of generating dataset is better than randomly selecting images. We plan to fine-tune Alexnet separately using the data generated by our method and a randomly generated dataset and analyse the classification accuracy obtained in each case. 5 References 1. H. Farid and M.J. Bravo. Perceptual Discrimination of Computer Generated and Photographic Faces. Digital Investigation, 8:226-235, 2012. 2. DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition. J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. ICML, 2014 8

3. Very Deep Convolutional Networks for Large-Scale Image Recognition. K. Simonyan, and A. Zisserman. ICLR, 2015 4. http://www.robots.ox.ac.uk/ vgg/research/very deep/ 5. https://www.dropbox.com/home/3dcity 6. https://github.com/bvlc/caffe 7. https://github.com/itseez/opencv/archive/3.1.0.zip 8. Z. Li, Z. Zhang, and Y. Shi, Distinguishing computer graphics from photographic images using a multiresolution approach based on local binary patterns, Secur. Commun. Netw., vol. 7, no. 11, pp. 21532159, 2014 9. Fei-Fei, L., Fergus, R., and Perona, P. Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories. In CVPR, 2004. 10. Saenko, K., Kulis, B., Fritz, M., and Darrell, T. Adapting visual category models to new domains. In ECCV, 2010. 9