CS89/189 Project Milestone: Differentiating CGI from Photographic Images

Similar documents
A FRAMEWORK OF EXTRACTING MULTI-SCALE FEATURES USING MULTIPLE CONVOLUTIONAL NEURAL NETWORKS. Kuan-Chuan Peng and Tsuhan Chen

arxiv: v1 [cs.cv] 6 Jul 2016

DeCAF: a Deep Convolutional Activation Feature for Generic Visual Recognition

Part Localization by Exploiting Deep Convolutional Networks

An Exploration of Computer Vision Techniques for Bird Species Classification

Beyond Bags of Features

A Deep Learning Approach to Vehicle Speed Estimation

Depth Estimation from a Single Image Using a Deep Neural Network Milestone Report

Combining Selective Search Segmentation and Random Forest for Image Classification

Transfer Learning. Style Transfer in Deep Learning

Deep Learning For Video Classification. Presented by Natalie Carlebach & Gil Sharon

Return of the Devil in the Details: Delving Deep into Convolutional Nets

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

A Novel Representation and Pipeline for Object Detection

Chapter 7. Conclusions and Future Work

Using Machine Learning for Classification of Cancer Cells

Apparel Classifier and Recommender using Deep Learning

Object detection with CNNs

TRANSPARENT OBJECT DETECTION USING REGIONS WITH CONVOLUTIONAL NEURAL NETWORK

IDENTIFYING PHOTOREALISTIC COMPUTER GRAPHICS USING CONVOLUTIONAL NEURAL NETWORKS

Real-time Object Detection CS 229 Course Project

CS 231A Computer Vision (Fall 2011) Problem Set 4

Yelp Restaurant Photo Classification

Aggregating Descriptors with Local Gaussian Metrics

Know your data - many types of networks

A Generalized Method to Solve Text-Based CAPTCHAs

Kaggle Data Science Bowl 2017 Technical Report

Flow-Based Video Recognition

Improving Recognition through Object Sub-categorization

DeepIndex for Accurate and Efficient Image Retrieval

FACE DETECTION AND RECOGNITION OF DRAWN CHARACTERS HERMAN CHAU

Object Detection Based on Deep Learning

Large-scale Video Classification with Convolutional Neural Networks

TEXT SEGMENTATION ON PHOTOREALISTIC IMAGES

Lecture 19: Generative Adversarial Networks

ECCV Presented by: Boris Ivanovic and Yolanda Wang CS 331B - November 16, 2016

Computer Vision Lecture 16

3D model classification using convolutional neural network

Object Classification Problem

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Previously. Part-based and local feature models for generic object recognition. Bag-of-words model 4/20/2011

Su et al. Shape Descriptors - III

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Data: a collection of numbers or facts that require further processing before they are meaningful

Automatic Colorization of Grayscale Images

arxiv: v1 [cs.cv] 20 Dec 2016

FaceNet. Florian Schroff, Dmitry Kalenichenko, James Philbin Google Inc. Presentation by Ignacio Aranguren and Rahul Rana

CEA LIST s participation to the Scalable Concept Image Annotation task of ImageCLEF 2015

Human Action Recognition Using CNN and BoW Methods Stanford University CS229 Machine Learning Spring 2016

Image Transformation via Neural Network Inversion

In Defense of Fully Connected Layers in Visual Representation Transfer

CSE 559A: Computer Vision

Computer Vision Lecture 16

Part-based and local feature models for generic object recognition

Image Analogies for Visual Domain Adaptation

CS231N Project Final Report - Fast Mixed Style Transfer

arxiv: v1 [cs.mm] 12 Jan 2016

Applied Statistics for Neuroscientists Part IIa: Machine Learning

Proceedings of the International MultiConference of Engineers and Computer Scientists 2018 Vol I IMECS 2018, March 14-16, 2018, Hong Kong

Learning to Recognize Faces in Realistic Conditions

Martian lava field, NASA, Wikipedia

Lightweight Unsupervised Domain Adaptation by Convolutional Filter Reconstruction

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016

SVM Segment Video Machine. Jiaming Song Yankai Zhang

Industrial Technology Research Institute, Hsinchu, Taiwan, R.O.C ǂ

Measuring Aristic Similarity of Paintings

Fine-tuning Pre-trained Large Scaled ImageNet model on smaller dataset for Detection task

CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp

CS 231A Computer Vision (Fall 2012) Problem Set 4

Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neural Networks

REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION

Final Report: Smart Trash Net: Waste Localization and Classification

Object detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation

Fuzzy Set Theory in Computer Vision: Example 3

Countermeasure for the Protection of Face Recognition Systems Against Mask Attacks

Unsupervised Deep Learning. James Hays slides from Carl Doersch and Richard Zhang

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Computer Vision Lecture 16

Automatic Detection of Multiple Organs Using Convolutional Neural Networks

An Implementation on Histogram of Oriented Gradients for Human Detection

Linear combinations of simple classifiers for the PASCAL challenge

Object Detection. Sanja Fidler CSC420: Intro to Image Understanding 1/ 1

Study of Residual Networks for Image Recognition

Semantic Segmentation

Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005

Lab 9. Julia Janicki. Introduction

Deep Convolutional Neural Networks. Nov. 20th, 2015 Bruce Draper

Fashion Style in 128 Floats: Joint Ranking and Classification using Weak Data for Feature Extraction SUPPLEMENTAL MATERIAL

FACE DETECTION AND LOCALIZATION USING DATASET OF TINY IMAGES

Sketchable Histograms of Oriented Gradients for Object Detection

Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction

CS230: Lecture 3 Various Deep Learning Topics

Character Recognition from Google Street View Images

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides

Convolutional Neural Networks + Neural Style Transfer. Justin Johnson 2/1/2017

A HMAX with LLC for Visual Recognition

Recognition of Animal Skin Texture Attributes in the Wild. Amey Dharwadker (aap2174) Kai Zhang (kz2213)

Two-Stream Convolutional Networks for Action Recognition in Videos

Content-Based Image Recovery

ARE you? CS229 Final Project. Jim Hefner & Roddy Lindsay

Transcription:

CS89/189 Project Milestone: Differentiating CGI from Photographic Images Shruti Agarwal and Liane Makatura February 18, 2016 1 Overview In response to the continuing improvements being made in the realm of modeling, rendering, and image manipulation, this project seeks to utilize deep convolutional neural networks (CNN) as a tool to perform, and hopefully improve upon, this task of differentiating between photographs and computer generated images (CGI). Due to the lack of a suitably large dataset (which limited our ability to directly train a neural net), we set out to test our hypothesis in the following 3 ways: 1. Extract the visual representations from the penultimate layer of a CNN (AlexNet) that has been trained on an unrelated, natural image dataset, such as ImageNet, then use these as features in an SVM to see if it can effectively classify CGI and photographic images. 2. Use a relevant dataset (composed of comparable CGI and real image patches) to fine-tune the penultimate layer of the above mentioned CNN; then, repeat the process described in (1) to see if our results improve. Figure 1: Left shows artist Max Edwin Wahyudi s CGI rendering of Korean actress Song Hae Kyo. Right is a photograph of the same. 3. Test the performance of our fine-tuned CNN by directly feeding novel CGI and photographic input, to see if it will be able to generalize effectively at inference time. Note that we have switched to using AlexNet instead of VGG because it will be faster to finetune, which works well in our constrained time frame. While VGG might produce better results, and thus would be an interesting future extension, we believe that AlexNet will be sufficient for this project. By this milestone, we anticipated having a complete dataset, along with preliminary results from our SVM on general AlexNet features (without finetuning). 2 Dataset One of our biggest hurdles in this project is the lack of a pre-existing dataset that is both relevant and sufficiently large. Ultimately, our goal was to amass approximately 100,000 image patches for each class (CGI and photograph). This creates a final dataset consisting of 200,000 patches. This section outlines our acquisition and processing pipeline. 1

2.1 Image Collection To create our binary classifier, we needed to collect a 2-part dataset: Photographs We were supplied with a dataset of approximately 96 million photographic images (courtesy of Professor Hany Farid). These images spanned a wide range of semantic content, and we aimed to match this diversity in the CGI dataset as well. CGI No established CGI dataset currently exists, so we created our own. We initially intended to render our own images using an Autodesk Maya cityscape model that was made available to us (courtesy of Prof. Hany Farid). However, we had two main concerns about the dataset we would amass through such an approach: Believability The level of photorealism that we could obtain with the provided architecture, texture maps, and lighting models did not appear convincing enough to generate challenging cases for our network. Ideally, we wanted to train our network on images that were challenging even for a human to categorize. Homogeneity The model had a very distinct style and relatively limited diversity of architecture, scenes/objects, textures, lighting conditions, etc. This homogeneity may tempt the network to simply learn these content-specific identifiers, rather than general features specific to the image type (photo vs. CGI) To avoid these issues, we chose to collect our images from individual artists (namely those in the Dartmouth Digital Arts program), along with various sources around the web. In particular, we used pictures generated by professional rendering companies, such as Render Atelier or Maxwell Render, along with many personal portfolios, university competition results, and creative content-sharing sites such as cgtrader, Adobe Behance, and VRayWorld. Due to the nature of these websites, the process-related content of the posts, and the structure of the sites content tagging, we are reasonably confident that all collected images are in fact examples of CGI. This process also gave us the benefit of collecting samples produced with varying software (Autodesk Maya, Blender, ZBrush, 3DSMax) and renderers (MentalRay, VRay, Corona, RenderMan). It is worth noting that our process ended up consuming much more time than expected, because it devolved into a manual crawling effort. Small scripts often ended up collecting many undesirable images, ranging from advertisements or banners to actual photographs. Due to the quality of the renders we were gathering, it was often very difficult and time-consuming to verify the data and pick out any photographs that slipped into the CGI bin hence, we opted to shift the time-intensive part of the process to the direct gathering. This way, we always saw the images in their respective contexts, which allowed us to be more confident in their classification. 2.2 Image Processing After collecting sufficiently many CG images, we went through the following steps to finalize our dataset: categorizing full images, extracting patches, and extracting features. These steps were conducted for both CG and photographic images. Then, each CGI patch was matched with its nearest neighbor in the photo set, and both patches were moved to our final dataset. Details for each step are outlined below: Preliminary Image Categorization We fed each full image through AlexNet (the CNN we intend to use) and allowed them to be classified into one of the 1000 categories that AlexNet was trained to recognize. We were not concerned with whether or not the classification was correct this step was simply meant to split the CGI/photo dataset into semantically similar subsets. It was conducted to reduce the time complexity of the last step (nearest neighbor search), so that we didn t have to search through nearly a billion unrelated photo patches in order to find the best possible match for a particular CGI patch. We intuited that patches from semantically similar images would be more likely to provide good matches, so to reduce time complexity we partitioned our image sets into bins numbered 1...1000 based on AlexNet s classification. Thus, our file structure had folders CGI 1..1000 and Photo 1..1000. For simplicity, we will consider one pair of corresponding class bins, CGI i and Photo i. 2

Patch Extraction From each image in CGI i and Photo i, we randomly extracted 20 patches of size 227x227 to use in our final dataset. We chose to use patches instead of full images to increase the size of our dataset while also reducing the amount of spatial and contextual information that can be exploited by the network, and increasing the probability that we would be able to find a reasonably well-matching patch in the complementary image set. The patch size (227x227) was chosen to accommodate AlexNet s architecture, whose fully connected layers expect input of this size. Feature Extraction We fed each individual patch through AlexNet, and extracted the corresponding features from the first fully connected layer (FC6) of the network. These features will be used by our nearest neighbor pairing search (below) and the SVM. Nearest-Match Pairing For each patch in CGI i, we did a nearest neighbor search through all the candidate matches in Photo i using the representative feature vectors extracted from AlexNet in the previous step. We mapped in this order (CGI to photo) because we had a far larger photo dataset available to us, and we wanted to ensure that every instance of CGI was matched and included in the final dataset. The idea behind this pairing was to ensure that our dataset had challenging examples where content was very similarly distributed between the two classes thus, the network would be forced to learn something specific about CGI vs. photo. A snapshot of our paired datasets can be found below: 3

Figure 2: CGI patches (top) and their nearest neighbor photo match (bottom - in corresponding order). 2.3 Current Dataset At the time of SVM training, we had approximately 2,800 CG images contributing to our fully processed database, where each image was the highest available resolution. We also classified and sampled approximately 100,000 random photographs, to ensure that there were enough candidate matches for each CG 4

image while still maintaining reasonable time complexity for things like nearest neighbor search. Extracting 20 patches from each image, this gave us roughly 56,000 patches each for CGI and photographs. Our data also includes roughly 800 additional CGI that have not yet been processed (and thus were not included in our initial SVM). Our recent discovery of Behance and VrayWorld, where rendered images are well tagged, will also enable us to collect the remaining data for our target set much more quickly (now that we do not have to manually crawl the web from link to link). Additionally, we found that the images we were able to collect were of very high resolution (usually 1-2k, but some even approaching 4k), so we intend to leverage more patches from each photo to continue enlarging our dataset. We have not yet included these additional patches because we have not finished the processing, and we did not want to over-represent some images in comparison to the others. This process will easily allow us to surpass our original dataset goals, which listed 100,000 patches of size 64x64 for each category. 2.4 Known Limitations & Proposed Solutions In collecting our dataset, we encountered the following limitations: Presence of watermarks or signatures in CGI (typically in the lower left hand corner) We brainstormed ways to discard the affected region of the image (by manual editing, cropping off the bottom/side, or discouraging the selection of the lower left hand region). Ultimately, we opted not to address this issue explicitly, because 1) manual editing would increase time and destroy the integrity of the render, and 2) we did not want to sacrifice valuable, usable information that also lived in these affected regions. In practice, we expect the random patch selection to prevent this from becoming too much of an issue. Available photorealistic CGI content is heavily skewed toward architecture, modern interior design We intend to 1) target more organic CG images (image searches, frames of natural VR/video game reels), and 2) mimic this bias in our photographic database by adding a number of images. As long as the content is equally skewed on both halves of the dataset, this should prevent our network from using particular content to determine CGI or photo classification, which is the only real concern. Preliminary classification of the images could result in suboptimal patch matches If time permits, we may attempt to rerun nearest neighbor search without this preliminary filtering to see if our results improve. However, the purpose of this matching is just to ensure that our network has challenging examples to learn on, and our visualizations seem to corroborate that our matches are satisfactory for this purpose. As such, we are not certain that it would be worth the additional time complexity; however, it may be worth exploring if time permits. Since we had already structured our dataset with this additional filtering, we have not yet run any experiments without it. 3 Preliminary Results After generating our initial dataset, we were able to run a few preliminary tests, which are detailed below. 3.1 t-sne Visualization As we saw in [2], deep features have the ability to cluster not only similar looking images, but also images from the same domain. We wanted to see if the features we extracted from AlexNet (without finetuning) had any inherent separation between the two classes, so we also decided to create a t-sne visualization. The results of our visualization can be seen below: 5

Figure 3: This image shows a 2-dimensional visualization of our 4096 dimensional feature space; three separate classes (spanning 4 semantic content categories detailed in Figure 3) are represented here, as denoted in the legend. Clearly, the images in our two sets are comparable to one another, as images within distinct ImageNet class categories (including both CGI and photos) cluster very distinctly according to the AlexNet representation. We also see some separation between CGI and photos in the class clusters, given further validation to the idea that our network will be able to differentiate between these image types. 6

Figure 4: Same visualization as above, but with representative thumbnails flanking each class cluster. Note that class 979 inherited both water and sky images, but they cluster very distinctly in the representation. The thumbnails in this image were placed here as an interpretive tool; while they are drawn from the represented data, their particular location on the image does not necessarily correspond with the location of that thumbnail s representative dot in the visualization. An effort has been made to ensure that the images are not covering any data points in the visualization. All images are framed in the appropriate color and labeled for clarity. As we can see in fig.4&5 the CGI and photos features have some clustering within each class itself. It is possible that the Alexnet is capturing very minute appearance difference between real and CGI images. It is also a possibility that there the Alexnet features are also capturing some differences between CGI and real photos which can be generalized to all the classes. To analyse this hypothesis, we used these features to train a linear SVM. The results are discussed in the following section. 3.2 SVM Classification via LIBLINEAR We used the FC-6 features obtained from a non-finetuned Alexnet for our cgi & real dataset and trained a linear SVM for the task of binary classification. We used LIBLINEAR library available at https://www.csie.ntu.edu.tw/ cjlin/liblinear/ We also tried to train a linear SVM from LIBSVM package, but it was taking quite long to train on our dataset. We used features from 104446 patches having equal number of CGI and real image patches, ie 52223 patches from each category. Each feature is of 4096 dimension. Therefore, the size of our dataset is 104446 X 4096. We then scale all the dataset so that each feature dimension is [0,1] range. This is done to ensure that the training doesn t get biased towards the dimension with large values. For training data we select random 84000 instances from our dataset and the remaining instances we keep for testing. We train the linear SVM classifier using L2 regularization and L2 loss option. We then compute classification accuracy on both train and test data. We also analyse the accuracy corresponding to individual classes, for both testing and training data. We repeat the training and testing procedure 10 times-due to time constraint-each time selecting the training and testing data randomly from the full dataset. The training and testing accuracy results averaged over 10 experiments are given in Fig.4. We are achieving 77.7% test accuracy without fine-tuning of CNN features for our task. We can also see that the percentage of error in classifying CGI images is higher than 7

that of real images. We assume that is because the Alexnet is trained on real images only and fc-6 features couldn t capture the fine details of a CGI image which are required for differentiating it from the real image. We hope fine-tuning our network for our task should improve the accuracy. Figure 5: % Accuracy with linear SVM. 4 Future Work As our preliminary results suggest, the deep features from Alexnet are able to differentiate between CGI and real images. We expect that the accuracy of our results will be improved as we get bigger training dataset and we also with the fine-tune of Alexnet for our task. We plan to finish acquiring dataset and start with fine-tuning of Alexnet. For fine-tuning we will replace the last 1000 way classification layer with a 2-way classification layer. Currently in our dataset we chose only those real image patches which are considered similar to CGI images by Alexnet. By doing this we hypothesis that we are selecting really difficult examples for Alexnet to classify separately as CGI and real images. We will validate that our method of generating dataset is better than randomly selecting images. We plan to fine-tune Alexnet separately using the data generated by our method and a randomly generated dataset and analyse the classification accuracy obtained in each case. 5 References 1. H. Farid and M.J. Bravo. Perceptual Discrimination of Computer Generated and Photographic Faces. Digital Investigation, 8:226-235, 2012. 2. DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition. J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. ICML, 2014 8

3. Very Deep Convolutional Networks for Large-Scale Image Recognition. K. Simonyan, and A. Zisserman. ICLR, 2015 4. http://www.robots.ox.ac.uk/ vgg/research/very deep/ 5. https://www.dropbox.com/home/3dcity 6. https://github.com/bvlc/caffe 7. https://github.com/itseez/opencv/archive/3.1.0.zip 8. Z. Li, Z. Zhang, and Y. Shi, Distinguishing computer graphics from photographic images using a multiresolution approach based on local binary patterns, Secur. Commun. Netw., vol. 7, no. 11, pp. 21532159, 2014 9. Fei-Fei, L., Fergus, R., and Perona, P. Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories. In CVPR, 2004. 10. Saenko, K., Kulis, B., Fritz, M., and Darrell, T. Adapting visual category models to new domains. In ECCV, 2010. 9