Deep Learning Approaches to 3D Shape Completion

Size: px

Start display at page:

Download "Deep Learning Approaches to 3D Shape Completion"

Sheila Janis Hampton
5 years ago
Views:

1 Deep Learning Approaches to 3D Shape Completion Prafull Sharma Stanford University Jarrod Cingel Stanford University Abstract This project explores various methods rooted in deep learning to address the problem of 3D inpainting. We investigate shallow autoencoders, deep convolutional autoencoders, and generative adversarial networks and their applications to shape completion in three dimensions. We find that deep convolutional autoencoders serve as robust tools to complete 3D shapes belonging to a distinct taxonomical object group, training a model that completes predetermined known cut regions with over 97% accuracy and completes random cut regions with over 94% accuracy. We also assess potential shortcomings of shallow and convolutional autoencoders for this purpose, training one of each to restore both deterministic and random cuts in three dimensional voxelized models. Finally, we address GANs and their potential use in solving this problem. 1. Introduction Deep learning is influencing different fields including computer vision, robotics, natural language processing, and others. As visual data grows in the future, it is important to have algorithms/systems to restore corrupted or partial images. Inpainting is a technique used to reconstruct damaged or missing parts of an image. This has many application such as recovering corrupt data in image files. In case of the 2D inpainting in images, the input is a 2D image and a 2D mask which is the area in the image that needs to be painted. The challenging part of the task is to reconstruct the masked region in a visually believable way. In this paper, we use the same technique to address 3D reconstruction. Shape completion is a well researched area in computer graphics and vision, especially in processing partial 3D CAD models [5] [8]. We will work with 3D models and try to recover a missing part of the 3D model. We present two methods to perform this task, auto-encoders and deep convolutional generative adversarial networks (DCGANs). Our paper is organized as follows. In Section 2, we discuss some of the previous work in image inpainting and 3D shape completion. Then, we present information about our dataset in Section 3 followed by methodology in Section 4. In Section 5, we present the results from the techniques used by us with a discussion. The paper is concluded in Section 6 with a brief summary of our work and future work that we would pursue with the CVGL lab under supervision of Prof. Savarese. 2. Previous Work One of the seminal works in image inpainting was presented by Bertalmio et al. [2]. This paper presented an algorithm for digital inpainting which replicated the techniques used by professional artists. The main idea presented in the paper is to smoothly propagate information from the surrounding areas of the mask in the isophotes direction to reconstruct the missing region of the image. The results in the paper were sharp and did not have any color artifacts. One of the major drawbacks of the algorithm in the paper was the inability to reconstruct large textured region. The authors followed up their work with a paper on simultaneous structure and texture image inpainting [3]. In this paper, they presented an algorithm for simultaneously filling texture and structure in the missing part of the image. The basic idea discussed in the paper is to decompose the image into the sum of two functions having different basic characteristics. Then, they reconstruct each of the functions with structure and texture filling algorithms. The functions are decomposed in a way that the first function represents the underlying image structure while the second function represents the texture and noise. The image is then recovered by adding back the two decomposed sub-images. Similar ideas were followed in the 3D domain to completed the missing regions of a 3D scan. An examplebased 3D shape completion was presented by Pauly et al. [8]. They discuss a new method to obtain a complete 3D model from incomplete surface scans using a dataset of 3D shapes as prior for regions of the missing data. Their methods chooses a few models and warps them to conform with the input data. Then these models are consistently blended to obtain the resulting 3D shape. They use a penalty function for shape matching and a corresponding optimization scheme to compute the non-rigid alignment of the con- 1

2 text models with the input data. Their method achieved an efficient reconstruction from the highly incomplete scans which allowed easy 3D content acquisition with simple 3D scans. In the paper Shape Completion using 3D-Encoder- Predictor CNNs and Shape Synthesis, Dai et al. present a deep learning approach to shape completion using volumetric deep neural network and 3D shape synthesis [5]. They introduced a 3D-Encoder-Predictor Network (3D- EPN), which is trained to predict and fill the missing data in the 3D models. They encode both known and unknown spaces which allows them to predict the global structure in unknown area with high accuracy. Using a patch-based 3D shape synthesis method, they were able to reconstruct fine details on the surface and generate high resolution outputs. 3. Dataset Our dataset is comprised of voxelized models from ShapeNet, a large dataset of 2D and 3D image processing examples [4]. ShapeNet models are sorted and grouped by taxonomy, making it convenient to train on only a single class of objects. We arbitrarily chose the chair class, consisting of 6778 unique chair models. We divided these 6778 models into two sets. The first 6278 models were designated for the training set, and the remaining 500 were set aside for the validation set. Once the training and validation models were separated, we used two different methods of voxel cutting for the study. Both methods resulted in the removal of a voxel cube from somewhere in the model. In the first method, this cube was cut away from a fixed location, the back-bottom-right corner of each model. This was designed with the intention of cutting out the back leg of every chair (of course, variations in chair type would sometimes mean that more or less than a single leg ultimately got cut away). The second cutting method removed a cube of voxels from a randomly generated location in the model. These cuts were made randomly from each model and were not consistent between models. The only stipulation was that cuts must be centered over a region that contains a positive voxel, unless it is not physically possible given the model; this avoids selecting empty regions of each model to be cut away, which would essentially be useless to the model s training. It is important to note here that for both the deterministic and nondeterministic cutting methods, the location of the cube to be cut away was stored as a mask and passed onto the autoencoder. This way the models we discuss in subsequent sections had information about which parts of the original voxelized model were obfuscated away in order to aid the training process. 4. Methodology 4.1. Autoencoder An autoencoder is comprised of two models working in tandem, an encoder model and a decoder model [1]. The encoder portion takes in some input and maps it to a representation in an alternate dimensional space. The decoder portion accepts some encoded data as input and then attempts to reconstruct the original input to the encoder layer as closely as possible. We can formalize this general autoencoder model as follows: y = s(w x + b) z = s(w y + b ) Here, x signifies the initial input, y is the output of the encoder, z is the output of the decoder, and W, b, W, b are learned parameters. It is important to note that the ultimate objective is generally to minimize the difference between z and x. There are several viable objective functions to accomplish this task, the most common being the hinge loss and the cross-entropy loss. In order to make autoencoders better suited toward more robust representations, we can incorporate deep learning into the encoder and decoder layers. Specifically, we can use convolutional neural networks (CNNs) as the encoder and decoder models. This allows more sophisticated features to be learned from the data, since CNNs are much better able to interpret multi-layer representations like images or 3D models than traditional learning approaches. For this project, we experimented with shallow and convolutional autoencoders on both the deterministically-cut and randomly-cut voxelized model datasets Shallow Autoencoder In order to implement the shallow autoencoder, we used Keras with a TensorFlow backend. The original and cut voxelized models are loaded and flattened into dimensional arrays. There is also a mask that is added which indicates the cut area (in the mask, entries corresponding to the cut cube are given a value of 1, and untouched areas are given a value of 0). The mask is then flattened as well. The full structure of the shallow autoencoder is depicted in 1. Our shallow encoder has two input layers, the flattened cut model and the flattened mask. These are then concatenated and encoded into 32- dimensional dense feature space with a rectified linear unit (ReLU) activation function. The final layer uses a sigmoid activation function to output a flattened dimensional array, which can be reshaped into a voxelized model. Since each entry here corresponds to a probability 2

3 space). These three units use progressively decreasing filter counts with ReLU activation, mimmicking the inverse of the encoder s structure, with upsampling of size 2 on each axis. Before our final 3D convolution, we add a dropout with probability 0.4 in order to help reduce overfitting. Finally, our last 3D convolution level restores the original desired dimensionality, and its sigmoidal activation outputs a probability for the presence of every voxel. In the convolutional autoencoder, we make one more final modification. Since we do not want our loss function to be concerned with the known areas (sigmoidal activation will never output exactly 0 or exactly 1 in practice), we use our mask to replace probability values with binary values for the known portions that remain untouched. This functionality is accomplished by the final multiplication and addition layers in the model diagram. We use the binary cross-entropy loss function for training. Both models were also trained on each cutting dataset: once on the deterministic cuts, and once on the random cuts. Training was completed on Google Cloud virtual machine instances using NVIDIA Tesla GPU in order to fit the model in a reasonable timeframe. Figure 1. Shallow Autoencoder Graphical Representation that a voxel is present in that space, a threshold can be used to covert this to a binary voxelized grid Deep Convolutional Autoencoder The deep convolutional autoencoder has several key differences from the shallow version. It s structure can be seen in 2. The input and mask are defined the same as before, but this time they are not flattened. Instead, they are concatenated and then go through the encoder layer, which is a series of three units of 3D convolution layers and 3D max pooling layers. The 3D convolution is used in order to learn more complex representations at each level, and the max pooling layers are used to downsample and prevent overfitting. Each convolution layer uses ReLU activation with double the number of filters as the previous level (the first level has 64, so level 2 has 128 and level 3 has 256), has a kernel, and has a unit stride. The increasing filter count was selected to help take some of the burden off of the decoder layer. This gives our encoder feature space a dimensionality of (4, 4, 4, 256). Our decoder layer is a series of three units of 3D convolution layers and 3D upsampling layers. These are used to return our encoded input into the proper dimensionality of the output feature space (same as the input feature 4.2. Generative Adversarial Network Another method that we tried was the Generative Adversarial Networks (GANs). GANs were introduced in a paper by Goodfellow et al. as a training method for generative models [6]. A GAN is comprised of two networks, discriminator and generator. Discriminator is a classifier which takes in the data and classifies it as real or fake. Generator is responsible for generating the input for the discriminator and improves by generating data which is closer to real data. This setup can be thought as a minimax setup, formulated in the equation below. minimize G maximize E x pdata [log D(x)] + E z p(z) [log (1 D(G(z)))] D We explored the 3D-GAN model on our dataset by Wu et al. [10]. This method generated 3D objects from a probabilistic space by using volumetric convolutional networks and GANs Deep Convolutional Generative Adversarial Networks Deep Convolutional Generative Adversarial Networks (DC- GANs) were introduced by Radford et al. [9]. In this model of GANs, both generator and discriminator are convolutional neural networks (CNNs). This method was also used by Yeh et al. for performing semantic image inpainting [11]. We used a similar approach but adapted it for our 3D mod- 3

4 els. Our discriminator had two 3D convolutional layers with leaky relu as the activation. Each convolution layer is followed by 3D maxpool layer to reduce the dimensionality of the input. Then, we have two fully connected layers which output a score for each of the samples in the batch. The generator uses fully connected layers and then uses ConvTranspose layer to upscale the input to generate a volume of We use two different loss functions for discriminator and generator respectively. Both loss functions are based on least squares adapted from the paper on Least Squares Generative Adversarial Networks by Mao et al. [7]. The generator loss is as follows: l G = 1 2 E z p(z) [(D(G(z)) 1) 2] and the discriminator loss: l D = 1 2 E x p data [(D(x) 1) 2] E z p(z) Evaluation Metric [(D(G(z))) 2] In addition to cross-entropy validation loss of the trained models, we have devised a more concrete metric to evaluate accuracy. For each trained model and each cut type, we make cuts in a test set of voxelized models, feed these cut versions into the trained autoencoder/gan, and then return probability models. At each probability p i in this output, we define a threshold α such that for each corresponding point in the voxelized output model o i, we have: { 0 if p i α o i = 1 otherwise Once the output model O is produced as a voxelized model consisting of all the o i s, then O can be compared with the original model M. Treating O and M as matrices of shape (32, 32, 32, 1), then we can compute the total number of erroneous voxels as follows: d(o, M) = O M ijk i=1 j=1 k=1 Figure 2. Deep Autoencoder Graphical Representation This results in computing the number of mistakes in a single reconstruction (i.e. counting the number of nonzero entries in the absolute value of the difference between O and M). We then compute the average number of errors in each attempted reconstruction: i test set err avg = d i(o, M) test set 4

Finally, we compute and report the average accuracy in reconstructing each cut region specifically, letting l be the length of the cubic cut region: Accuracy = 1 err avg l 3 In our case, we chose α =

(a) Original model. 5. Results (b) Model with a cut. (a) Original model. (c) Reconstructed model. Figure 4. Shallow Autoencoder Random Reconstruction Example 5.1.

5 Finally, we compute and report the average accuracy in reconstructing each cut region specifically, letting l be the length of the cubic cut region: Accuracy = 1 err avg l 3 In our case, we chose α = 0.5 and l = 12, denoting threshold and cut length values, respectively. Our training set consisted of 500 chair models, and we would perform this process for both the deterministic and random cuts. (a) Original model. 5. Results (b) Model with a cut. (a) Original model. (c) Reconstructed model. Figure 4. Shallow Autoencoder Random Reconstruction Example 5.1. Autoencoder (b) Cut Model There are a total of four models that we trained, listed as follows: 1. Shallow autoencoder, deterministic cuts 2. Shallow autoencoder, random cuts 3. Deep convolutional autoencoder, deterministic cuts 4. Deep convolutional autoencoder, random cuts (c) Shallow Reconstruction Figure 3. Shallow Autoencoder Deterministic Reconstruction Example Once each model was trained, we tested it on a test set consisting of 500 3D voxelized models. We recorded key statistics about the model in Table 1, and also computed the average accuracy within the reconstructed region using the metric described in section 4.2. We can see that the deep convolutional autoencoders generally performed better than their shallow counterparts, although the deterministic cut results are very close and 5

Cross-entropy Validation Loss Number of Epochs Metric Accuracy Shallow AE, Deterministic Cuts 0.0033 138 97.74% Shallow AE, Random Cuts 0.0131 91 89.97% Deep AE, Deterministic Cuts 0.0023 15 97.

Figure 5. Deep Autoencoder Deterministic Reconstruction Example (c) Reconstructed model. Figure 6. Deep Autoencoder Deterministic Reconstruction Example nearly comparable.

This is most likely a result of the numerous combinations of viable random cut positions combined with many different types of chairs that all look distinct.

6 Cross-entropy Validation Loss Number of Epochs Metric Accuracy Shallow AE, Deterministic Cuts % Shallow AE, Random Cuts % Deep AE, Deterministic Cuts % Deep AE, Random Cuts % Table 1. Result table of autoencoder performances. (a) Original model. (a) Original model. (b) Model with a cut. (b) Model with a cut. (c) Reconstructed model. Figure 5. Deep Autoencoder Deterministic Reconstruction Example (c) Reconstructed model. Figure 6. Deep Autoencoder Deterministic Reconstruction Example nearly comparable. We can also see that the reconstruction accuracy on randomly cut regions is lower than the accuracy on the deterministically chosen cuts. This is most likely a result of the numerous combinations of viable random cut positions combined with many different types of chairs that all look distinct. It is a lot to ask of the decoder to reconstruct an arbitrary region in an arbitrary shape in the random case, in contrast to the much simpler task of reconstructing what is almost always part of the chair s back leg. Let us first examine the shallow autoencoder operating on a deterministic cut in Figure 3. We can see that the reconstruction, even with the shallow autoencoder is almost perfect, with only slight discrepancies in the upper decorative edge of the chair. This makes sense since the accuracies for deep and shallow autoencoders were almost identical. 6

Next, examine the shallow autoencoder operating on a random cut in Figure 4 We can see that this reconstruction is quite flawed, as the shallow autoencoder is simply not robust enough to represent

7 Next, examine the shallow autoencoder operating on a random cut in Figure 4 We can see that this reconstruction is quite flawed, as the shallow autoencoder is simply not robust enough to represent the complexities associated with cuts in random positions. We can see significant artifacts in the reconstruction of the chair back, and this confirms the fact that our accuracy for this category was significantly lower than the other accuracies. In Figure 5 we have a case where the deep convolutional autoencoder reconstructed the back leg portion of a chair model (deterministic cuts). As we can see, it performed fairly well, and this seemed to be the trend for most of the data: In the example, we can see that the area was reconstructed almost perfectly, with the exception of being 1 voxel too wide toward the back and 1 voxel too high in the front. Areas like this, which the model sees as borderline between having and omitting a voxel are naturally the most tough for the model to decipher, since these will have probability values very close to 0.5. This performance could potentially be improved by modifying the threshold described in section 4.2, but it seems reasonable to expect a single voxel worth of error in questionable regions. This is because a probability very close to 0.5 is returned for voxels adjacent to regions that should be positive. Overall, however, this reconstruction is very promising, since it correctly reconstructs over 0.97 of the region that was cut away. Next, we have a case where the deep convolutional autoencoder reconstructed the random portions of a chair models, refer to Figure 6. Although performance was weaker than in the deterministic cut data, it still was acceptable overall: In this example, we can see that the portion cut away included a large chunk of the chair s seat as well as the top of its front leg. The seat portion was reconstructed very nicely, but the area where the leg connects to the chair seat is missing. This is probably a result of the autoencoder not seeing enough examples of chair leg connecting to chair seat during the training process. We did achieve reconstruction of chair as shown in Figure 7. We tried to run our DCGAN on our dataset with several different parameters and learning rates but couldn t achieve any publishable results. Due to the lack of compute power, we were unable to experiment with the DCGAN to achieve good results. In terms of our architecture for the DCGAN, we think that we should have used the same convolutional model architecture as in our autoencoder. Figure 7. Reconstruction of a chair from scratch. 6. Conclusion Deep learning methods seem to be very useful in their capacity to solve the 3D shape completion problem. We have trained both shallow and deep autoencoders suited for inpainting 3D models of chairs from ShapeNet with both deterministic, constant cuts as well as random cuts anywhere in the model. We managed to achieve upwards of 97% and 94% accuracy on each respective category, which is a significant benchmark and creates visually similar reconstructions. We also explored the use of GANs for this task, as well. In the future, we hope to continue our work with Professor Savarese in the CVGL lab. We plan to explore different autoencoder designs, as well as optimal threshold reconstruction values. Specifically, an interesting problem would be to learn the optimal reconstruction threshold α value over a significantly larger test dataset. We also hope to further our work with GANs to attain more consistent results. Thanks to Professor Savarese, Trevor Standley, Chris Choy, and Lynne Tchapmi for their guidance and mentorship during the course of the project. 7

8 6.1. References References [1] P. Baldi. Autoencoders, unsupervised learning, and deep architectures. ICML unsupervised and transfer learning, 27(37-50):1, [2] M. Bertalmio, G. Sapiro, V. Caselles, and C. Ballester. Image inpainting. In Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, SIG- GRAPH 00, pages , New York, NY, USA, ACM Press/Addison-Wesley Publishing Co. [3] M. Bertalmio, L. Vese, G. Sapiro, and S. Osher. Simultaneous structure and texture image inpainting. IEEE transactions on image processing, 12(8): , [4] A. X. Chang, T. A. Funkhouser, L. J. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu. Shapenet: An information-rich 3d model repository. CoRR, abs/ , [5] A. Dai, C. R. Qi, and M. Nießner. Shape completion using 3d-encoder-predictor cnns and shape synthesis. arxiv preprint arxiv: , [6] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages , [7] X. Mao, Q. Li, H. Xie, R. Y. K. Lau, and Z. Wang. Multiclass generative adversarial networks with the L2 loss function. CoRR, abs/ , [8] M. Pauly, N. J. Mitra, J. Giesen, M. H. Gross, and L. J. Guibas. Example-based 3d scan completion. In Symposium on Geometry Processing, number EPFL-CONF , pages 23 32, [9] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. CoRR, abs/ , [10] J. Wu, C. Zhang, T. Xue, W. T. Freeman, and J. B. Tenenbaum. Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. In Advances in Neural Information Processing Systems, pages 82 90, [11] R. Yeh, C. Chen, T. Lim, M. Hasegawa-Johnson, and M. N. Do. Semantic image inpainting with perceptual and contextual losses. CoRR, abs/ ,

An Empirical Study of Generative Adversarial Networks for Computer Vision Tasks

An Empirical Study of Generative Adversarial Networks for Computer Vision Tasks Report for Undergraduate Project - CS396A Vinayak Tantia (Roll No: 14805) Guide: Prof Gaurav Sharma CSE, IIT Kanpur, India