arxiv: v2 [cs.lg] 22 Jan 2019

Size: px

Start display at page:

Download "arxiv: v2 [cs.lg] 22 Jan 2019"

Wesley Cain
5 years ago
Views:

1 Spatial Variational Auto-Encoing via Matrix-Variate Normal Distributions Zhengyang Wang Hao Yuan Shuiwang Ji arxiv: v2 [cs.lg] 22 Jan 2019 Abstract The key iea of variational auto-encoers (VAEs) resembles that of traitional auto-encoer moels in which spatial information is suppose to be explicitly encoe in the latent space. However, the latent variables in VAEs are vectors, which can be interprete as multiple feature maps of size 1x1. Such representations can only convey spatial information implicitly when couple with powerful ecoers. In this work, we propose spatial VAEs that use feature maps of larger size as latent variables to explicitly capture spatial information. This is achieve by allowing the latent variables to be sample from matrix-variate normal (MVN) istributions whose parameters are compute from the encoer network. To increase epenencies among locations on latent feature maps an reuce the number of parameters, we further propose spatial VAEs via low-rank MVN istributions. Experimental results show that the propose spatial VAEs outperform original VAEs in capturing rich structural an spatial information. Keywors Deep learning, variational auto-encoers, matrixvariate normal istributions, generative moels, unsupervise learning 1 Introuction. The mathematical an computational moeling of probability istributions in high-imensional space an generating samples from them are highly useful yet very challenging. With the evelopment of eep learning methos, eep generative moels have been shown to be effective an scalable [12, 22, 5, 9, 19, 8, 21] in capturing probability istributions over high-imensional ata spaces an generating samples from them. Among them, variational auto-encoers (VAEs) [12, 22, 6, 11] are one of the most promising approaches. In machine Department of Computer Science an Engineering at Texas A&M University. zhengyang.wang@tamu.eu School of Electrical Engineering an Computer Science at Washington State University. hao.yuan@wsu.eu Department of Computer Science an Engineering at Texas A&M University. sji@tamu.eu learning, the auto-encoer architecture is applie to train scalable moels by learning latent representations. For image moeling tasks, it is preferre to encoe spatial information into the latent space explicitly. However, the latent variables in VAEs are vectors, which can be interprete as 1 1 feature maps with no explicit spatial information. While such lack of explicit spatial information oes not lea to major performance problems on simple tasks such as igit generation from the MNIST ataset [16], it greatly limits the moel s abilities when images are more complicate [13, 17]. To overcome this limitation, we propose spatial VAEs that employ ( > 1) feature maps as latent representations. Such latent feature maps are generate from matrix-variate normal (MVN) istributions whose parameters are compute from the encoer network. Specifically, MVN istributions are able to generate feature maps with appropriate epenencies among locations. To increase epenencies among locations on latent feature maps an reuce the number of parameters, we further propose spatial VAEs via lowrank MVN istributions. In this low-rank formulation, the mean matrix of MVN istribution is compute as the outer prouct of two vectors compute from the encoer network. Experimental results on image moeling tasks emonstrate the capabilities of our spatial VAEs in complicate image generation tasks. It is worth noting that the original VAEs can be consiere as a special case of spatial VAEs via MVN istributions. That is, if we set the size of feature maps generate via MVN istributions to 1 1, spatial VAEs via MVN istributions reuce to the original VAEs. More importantly, when the size of feature maps is larger than 1 1, irect structural ties have been built into elements of the feature maps via MVN istributions. Thus, our propose spatial VAEs are intrinsically ifferent with the original VAEs when the size of feature maps is larger than 1 1. Specifically, our propose spatial VAEs cannot be obtaine by enlarging the size of the latent representations in the original VAEs.

2 2 Backgroun an Relate Work. In this section, we introuce the architectures of autoencoers an variational auto-encoers. 2.1 Auto-Encoer Architectures. Auto-encoer (AE) is a moel architecture use in tasks like image segmentation [30, 23, 18], machine translation [2, 25] an enoising reconstruction [28, 29]. It consists of two parts: an encoer that encoes the input ata into lower-imensional latent representations an a ecoer that generates outputs by ecoing the representations. Depening on ifferent tasks, the latent representations will focus on ifferent properties of input ata. Nevertheless, these tasks usually require outputs to have similar or exactly the same structure as inputs. Thus, structural information is expecte to be preserve through the encoer-ecoer process. In computer vision tasks, structural information usually means spatial information of images. There are two main strategies to preserve spatial information in AE for image tasks. One is to apply very powerful ecoers, like conitional pixel convolutional neural networks (PixelCNNs) [20, 27, 24, 9], that generate output images pixel-by-pixel. In this way, the ecoers can recover spatial information in the form of epenencies among pixels. However, pixel-by-pixel generation is very slow, resulting in major spee problems in practice. The other metho is to let the latent representations explicitly contain spatial information an apply ecoers that can make use of such information. To apply this strategy for image tasks, usually the latent representations are feature maps of size between the size of a pixel (1 1) an that of the input image, while the ecoers are econvolutional neural networks (DCNNs) [30]. Since most computer vision tasks only require high-level spatial information like relative locations of objects instea of etaile relationships among pixels, preserving only rough spatial information is enough, an this strategy is prove effective an efficient. 2.2 Variational Auto-Encoers. In unsupervise learning, generative moels aim to moeling the unerlying ata istribution. Formally, for ata space X, let p true (x) enote the probability ensity function (PDF) of the true ata istribution for x X. Given a ataset D = {x (i) } N i=1 of i.i. samples from X, generative moels try to approximate p true (x) using a moel istribution p θ (x) where θ represents moel parameters. To train the moel, maximum likelihoo (ML) inference is performe on θ; that is, parameters are upate to optimize log p θ (D) = log p θ (x (1),..., x (N) ) = N i=1 log p θ(x (i) ). The approximation quality of p θ (x) relies on the generalization ability of the moel. In machine learning, it highly epens on learning latent representations which can encoe common features among ata samples an isentangle abstract explanatory factors behin the ata [3]. In ata generation tasks, we apply p θ (x) = p θ (x z)p θ (z)z for moeling, where p θ (z) is the PDF of the istribution of latent representations an p θ (x z) represents a complex mapping from the latent space to the ata space. A major avantage of using latent representations is imensionality reuction of ata since they are low-imensional. The prior p θ (z) can be simple an easy to moel while the mapping represente by p θ (x z) can be learne through complicate eep learning moels automatically. Recently, [12] point out that the above moel has intractability problems an can only be traine by costly sampling-base methos. To tackle this, they propose variational auto-encoers (VAEs), which instea maximize a variational lower boun of the loglikelihoo as (2.1) log p θ (x) L VAE = E z qφ (z x)[log p θ (x z)] D KL [q φ (z x) p θ (z)], where q φ (z x) is an approximation moel to the intractable p θ (z x), parameterize by φ, D KL [ ] represents the Kullback-Leibler ivergence. In VAEs, p θ (x z) = N (x; f θ (z), σ 2 I), q φ (z x) = N (z; µ φ (x), Σ φ (x)), an p θ (z) = N (z; 0, I) are moele as multivariate Gaussian istributions with iagonal covariance matrices. Here, f θ (z), µ φ (x) an Σ φ (x) are compute with eep neural networks like CNNs. Figure 1 shows the architecture of VAEs. The moel parameters θ an φ can be traine using the reparameterization trick [22], where the sampling process z q φ (z x) = N (z; µ φ (x), Σ φ (x)) is ecompose into two steps as (2.2) ɛ N (ɛ; 0, I), z = µ φ (x) + Σ 1 2 φ (x) ɛ. 3 Spatial Variational Auto-Encoers. In this section, we analyze a problem of the original VAEs an propose spatial VAEs in Section 3.1 to overcome it. Afterwars, several ways to implement spatial VAEs are iscusse. A naïve implementation is introuce an analyze in Section 3.2, followe by a metho that incorporates the use of matrix-variate normal (MVN) istributions in Section 3.3. Finally, we propose our final moel, spatial VAEs via low-rank MVN istributions, by applying a low-rank formulation of MVN istributions in Section 3.4.

3 3.1 Overview. Note that p θ (x z) an q φ (z x) in VAEs resemble the encoer an ecoer, respectively, in AE for image reconstruction tasks, where z represents the latent representations. However, in VAE, z is commonly a vector, which can be consiere as multiple 1 1 feature maps. While z may implicitly preserve some spatial information of the input image x, it raises the requirement for a more complex ecoer. Given a fixe architecture, the hypothesis space of ecoer moels is limite. As a result, the optimal ecoer may not lie in the hypothesis space [31]. This problem significantly hampers the performance of VAEs, especially when spatial information is important for images in X. Base on the above analysis, it is beneficial to either have larger hypothesis space for ecoers or let z explicitly contain spatial information. Note that these two methos correspon to the two strategies introuce in Section 2.1. [9] follow the first strategy an propose PixelVAEs whose ecoers are conitional PixelCNNs [27] instea of simple DCNNs. As conitional PixelCNNs themselves are also generative moels, PixelVAEs can be consiere as conitional PixelCNNs with the conitions replace by z. In spite of their impressive results, the performance of PixelVAEs an conitional PixelC- NNs is similar, which inicates that conitional Pixel- CNNs are responsible for capturing most properties of images in X. In this case, z contributes little to the performance. In aition, applying conitional PixelCNNs leas to very slow generation process in practice. In this work, the secon strategy is explore by constructing spatial latent representations z in the form of feature maps of size larger than 1 1. Such feature maps can explicitly contain spatial information. We term VAEs with spatial latent representations as spatial VAEs. The main istinction between spatial VAEs an the original VAEs is the size of latent feature maps. By having ( > 1) feature maps instea of 1 1 ones, the total imension of the latent representations z significantly increases. However, spatial VAEs are essentially ifferent from the original VAEs with a higherimensional latent vector z. Suppose the vector z is extene by 2 times in orer to match the total imension, the number of hien noes in each layer of ecoers will exploe corresponingly. This results in an explosion in the number of ecoers parameters, which slows own the generation process. Whereas in spatial VAEs, ecoers becomes even simpler since is closer to the require size of output images. From the other sie, when using ecoers of similar capacities, spatial VAEs must have higher-imensional latent representations than the original VAEs. It is emonstrate that this only slightly influences the training process by requiring more outputs from encoers, while the generation process that only involves ecoers remains unaffecte. Our experimental results show that with proper esigns, spatial VAEs substantially outperform the original VAEs when applying similar ecoers. 3.2 Naïve Spatial VAEs. To achieve spatial VAEs, a irect an naïve way is to simply reshape the original vector z into N feature maps of size. But this naïve way is problematic since the sampling process oes not change. Note that in the original VAEs, the vector z is sample from q φ (z x) = N (z; µ φ (x), Σ φ (x)). The covariance matrix Σ φ (x) is iagonal, meaning each variable is uncorrelate. In particular, for multivariate Gaussian istributions, uncorrelation implies inepenence. Therefore, z s components are inepenent ranom variables an the variances of their istributions correspon to entries on the iagonal of Σ φ (x). Specifically, suppose z is a C- imensional vector, the i th component is a ranom variable that follows the univariate normal istribution as z i N (z i ; µ φ (x) i, iag(σ φ (x)) i ), i = 1,..., C, where iag( ) represents the vector consisting of a matrix s iagonal entries. After applying the reparameterization trick, we can rewrite Equation 2.2 as (3.3) ɛ i N (ɛ i ; 0, 1), z i = µ φ (x) i + iag(σ φ (x)) 1 2 i ɛ i, i = 0,..., C. To sample N feature maps of size in naïve spatial VAEs, the above process is followe by a reshape operation while setting C = 2 N. However, between two ifferent components z i an z j, the only relationship is that their respective istribution parameters (µ φ (x) i, iag(σ φ (x)) i ) an (µ φ (x) j, iag(σ φ (x)) j ) are both compute from x. Such epenencies are implicit an weak. It is obvious that after reshaping, there is no irect relationship among locations within each feature map, while spatial latent representations shoul contain spatial information like epenencies among locations. To overcome this limitation, we propose spatial VAEs via matrix-variate normal istributions. 3.3 Spatial VAEs via Matrix-Variate Normal Distributions. Instea of obtaining N feature maps of size by first sampling a 2 N-imensional vector from multivariate normal istributions an then reshaping, we propose to irectly sample matrices as feature maps from matrix-variate normal (MVN) istributions [10], resulting in an improve moel known as spatial VAEs via MVN istributions. Specifically, we moify q φ (z x) in the original VAEs an keep other parts the same. As explaine below, MVN istributions can moel epenencies between the rows an columns

Ouputs Input Image Encoer Sampling Decoer Generate Image Interpretation C C 1 1 C iag( ) z~ N(, ) N k k = k iag( ) iag( ) iag( k ) iag( k ) = reshape ( ) iag k N N z ~ MVN (, ) Figure 1: Illustration

4 Ouputs Input Image Encoer Sampling Decoer Generate Image Interpretation C C 1 1 C iag( ) z~ N(, ) N k k = k iag( ) iag( ) iag( k ) iag( k ) = reshape ( ) iag k N N z ~ MVN (, ) Figure 1: Illustration of the ifferences between the propose spatial VAEs via low-rank MVN istributions an the original VAEs. At the top is the architecture of the original VAEs where the latent z is a vector sample from a multivariate Gaussian istribution with a iagonal covariance matrix. Below is the propose moel which is explaine in etail in Section 3.4. Briefly, it moifies the sampling process by incorporating a low-rank formulation of the MVN istributions an prouces latent representations that explicitly retain spatial information. in a matrix. In this way, epenencies among locations within a feature map are establishe. We procee by proviing the efinition of MVN istributions. Definition: A ranom matrix A R m n is sai to follow a matrix-variate normal istribution N m,n (A; M, Ω Ψ) with mean matrix M R m n an covariance matrix Ω Ψ, where Ω R m m > 0, Ψ R n n > 0, if vec(a T ) follows the multivariate normal istribution N (vec(a T ); vec(m T ), Ω Ψ). Here, enotes the Kronecker prouct an vec( ) enotes transforming a R m n matrix into an mn-imensional vector by concatenating the columns. In MVN istributions, Ω an Ψ capture the relationships across rows an columns, respectively, of a matrix. By constructing the covariance matrix through the Kronecker prouct of these two matrices, epenencies among values in a matrix can be moele. In spatial VAEs, a feature map F can be consiere as a R matrix that follows a MVN istribution N, (F ; M, Ω Ψ), where Ω R an Ψ R are iagonal matrices. Although within F the ranom variables corresponing to each location are still inepenent since Ω Ψ is iagonal, MVN istributions are able to a irect structural ties among locations through their variances. For example, for two locations (i 1, j 1 ) an (i 2, j 2 ) in F, (3.4) (3.5) F (i1,j 1) N (F (i1,j 1); M (i1,j 1), iag(ω Ψ) i1 j 1 ), F (i2,j 2) N (F (i2,j 2); M (i2,j 2), iag(ω Ψ) i2 j 2 ). Here, F (i1,j 1) an F (i2,j 2) are inepenently sample from two univariate Gaussian istributions. However, the variances iag(ω Ψ) i1 j 1 an iag(ω Ψ) i2 j 2 have built irect interactions through the Kronecker prouct. Base on this, we propose spatial VAEs via MVN istributions, which samples N feature maps of size from N inepenent MVN istributions as (3.6) F k N, (F k ; M kφ (x), Ω kφ (x) Ψ kφ (x)),

5 k = 0,..., N, where M kφ (x), Ω kφ (x) an Ψ kφ (x) are compute through the encoer. Here, compare to the original VAEs, q φ (z x) is replace but p θ (z) remains the same. Since MVN istributions are efine base on multivariate Gaussian istributions, the term D KL [q φ (z x) p θ (z)] in Equation 2.1 can be calculate in a similar way. To emonstrate the ifferences with naïve spatial VAEs, we reexamine the original VAEs. Note that naïve spatial VAEs have the same sampling process as the original VAEs. The original VAE samples a C = 2 N- imensional vector z from q φ (z x) = N (z; µ φ (x), Σ φ (x)) where µ φ (x) is a C-imensional vector an Σ φ (x) is a R C C iagonal matrix. Because Σ φ (x) is iagonal, it can be represente by the C-imensional vector iag(σ φ (x)). To summarize, the encoer of the original VAEs outputs 2C = 2 2 N values which are interprete as µ φ (x) an iag(σ φ (x)). In spatial VAEs via MVN istributions, accoring to Equation 3.6, M kφ (x) is a R matrix while Ω kφ (x) an Ψ kφ (x) are R iagonal matrices that can be represente by -imensional vectors. In this case, the require number of outputs from the encoer is change to ( 2 + 2)N, corresponing to [M 1φ (x),..., M N φ (x)], [iag(ω 1φ (x)),..., iag(ω N φ (x))] an [iag(ψ 1φ (x)),..., iag(ψ N φ (x))]. As has been explaine in Section 3.2, since Ω kφ (x) Ψ kφ (x) is iagonal, sampling the matrix F k is equivalent to sampling scalar numbers from inepenent univariate normal istributions. So the moifie sampling process with the reparameterization trick is (3.7) where ɛ (i,j,k) N (ɛ (i,j,k) ; 0, 1), z (i,j,k) = µ kφ (x) (i,j) +iag(ω kφ (x) Ψ kφ (x)) 1 2 i j ɛ (i,j,k), i, j = 0,...,, k = 1,..., N, iag(ω kφ (x) Ψ kφ (x)) i j = [iag(ω kφ (x))iag T (Ψ kφ (x))] (i,j). Here, we take avantage of the fact that for iagonal matrices, the Kronecker prouct is equivalent to the out-prouct of vectors. To be specific, suppose D 1 an D 2 are two R iagonal matrices, then 1 = iag(d 1 ) an 2 = iag(d 2 ) are two -imensional vectors an satisfy (3.8) iag(d 1 D 2 ) = vec( 1 T 2 ). It is worth noting that, compare to naïve spatial VAEs, the require number of outputs from the encoer ecreases from 2 2 N to ( 2 + 2)N. As a result, spatial VAEs via MVN istributions leas to a simpler moel while aing structural ties among locations. Note that the original VAEs can be consiere as a special case of the spatial VAEs via MVN istributions. That is, if we set = 1, spatial VAEs via MVN istributions reuce to the original VAEs. 3.4 A Low-Rank Formulation. The use of MVN istributions makes locations irectly relate to each other within a feature map by aing restrictions on variances. However, in probability theory, variance only measures the expecte istance from the mean. To have more irect relationships, it is preferre to have restricte means. In this section, we introuce a lowrank formulation of MVN istributions [1] for spatial VAEs. The low-rank formulation of a MVN istribution N m,n (M, Ω Ψ) is enote as N m,n (µ, ν, Ω Ψ) where the mean matrix M is compute by the out-prouct µν T instea. Here, µ an ν are m-imensional an n- imensional vectors, respectively. Similar to computing the covariance matrix through the Kronecker prouct of two separate matrices, it explicitly forces structural interactions among entries of the mean matrix. Applying this low-rank formulation leas to our final moel, spatial VAEs via low-rank MVN istributions, which is illustrate in Figure 1. By using two istinct - imensional vectors to construct M iφ (x) R, Equation 3.6 is moifie as (3.9) F k N, (F k ; µ kφ (x)ν k T φ (x), Ω kφ (x) Ψ kφ (x)), k = 0,..., N, where µ kφ (x) an ν kφ (x) are -imensional vectors. For the encoer, the number of outputs is further reuce to 4N from ( 2 + 2)N, replacing 2 N outputs for (M 1φ (x),..., M N φ (x)) with N outputs for (µ 1φ (x),..., µ N φ (x)) an another N outputs for (ν 1φ (x),..., ν N φ (x)). In contrast to Equation 3.7, the two-step sampling process can be expresse as (3.10) where ɛ (i,j,k) N (ɛ (i,j,k) ; 0, 1), z (i,j,k) = (µ kφ (x)ν k T φ (x)) (i,j) +iag(ω kφ (x) Ψ kφ (x)) 1 2 i j ɛ (i,j,k), i, j = 0,...,, k = 1,..., N, iag(ω kφ (x) Ψ kφ (x)) i j = [iag(ω kφ (x))iag T (Ψ kφ (x))] (i,j). As has been emonstrate in Section 3.1, spatial VAEs require more outputs from encoers than the original

Figure 2: Sample face images generate by ifferent VAEs when traine on the CelebA ataset. The first an secon rows shows training images an images generate by the original VAEs.

6 Figure 2: Sample face images generate by ifferent VAEs when traine on the CelebA ataset. The first an secon rows shows training images an images generate by the original VAEs. The remaining three rows are the results of naïve spatial VAEs, spatial VAEs via MVN istributions an spatial VAEs via low-rank MVN istributions, respectively. VAEs, which slows own the training process. Spatial VAEs via low-rank MVN istributions properly aress the problem while achieving appropriate spatial latent representations. Accoring to the experimental results, they outperform the original VAEs in several image generation tasks when similar ecoers are use. 4 Experimental Stuies. We use the original VAEs as the baseline moels in our experiments, as most recent improvements on VAEs are erive from the vector latent representations an can be easily incorporate into our matrix-base moels. To eluciate the performance ifferences of various spatial VAEs, we compare the results of three ifferent spatial VAEs as introuce in Section 3; namely naïve spatial VAEs, spatial VAEs via MVN istributions an spatial VAEs via low-rank MVN istributions. We train the moels on the CelebA, CIFAR-10 an MNIST atasets, an analyze sample images generate from the moels to evaluate the performance. For the same task, the encoers of all compare moels are compose of the same convolutional neural networks (CNNs) an a fully-connecte output layer [15, 14]. While the fullyconnecte layer may iffer as require by ifferent numbers of output units, it only slightly affects the training process. As iscusse in Section 3.1, it is reasonable to compare spatial VAEs with the original VAEs in the case that their ecoers have similar architectures an moel capabilities. Therefore, following the original VAEs, econvolutional neural networks (DCNNs) are use as ecoers in spatial VAEs. Meanwhile, the total number of trainable parameters in the ecoers of all compare moels are set to be as similar as possible while accommoating ifferent input sizes. 4.1 CelebA. The CelebA ataset contains 202, 599 colore face images of size The generative moels are suppose to generate faces that are similar but not exactly the same to those in the ataset. For this task, the CNNs in the encoers have 3 layers while the ecoers are 5 or 6-layer DCNNs corresponing to spatial VAEs an the original VAEs, respectively. This ifference is cause by the fact that spatial VAEs have ( > 1) feature maps as latent representations, which require fewer up-sampling operations to obtain outputs. We set = 3 an N = 64, an the imension of z in the original VAEs is 81 in orer to have ecoers with similar numbers of trainable parameters. Figure 2 shows sample face images generate by the original VAEs an three ifferent variants of spatial VAEs. It is clear that spatial VAEs can generate images with more etails than the original VAEs.

7 Figure 3: Sample images generate by ifferent VAEs when traine on the CIFAR-10 ataset. From top to bottom, the five rows are training images an images generate by the original VAEs, naïve spatial VAEs, spatial VAEs via MVN istributions, spatial VAEs via low-rank MVN istributions, respectively. Due to the lack of explicit spatial information, the original VAEs prouce face images with little etails like hair near the borers. While naïve spatial VAEs seem to aress this problem, most faces have only incomplete hairs as naïve spatial VAEs cannot capture the relationships among ifferent locations. Theoretically, spatial VAEs via MVN istributions are able to incorporate interactions among locations. However, the results are strange faces with some istortions. We believe the reason is that aing epenencies among locations through restrictions on istribution variances is not effective an sufficient. Spatial VAEs via low-rank MVN istributions that have restricte means tackle this well an generate faces with appealing visual appearances. 4.2 CIFAR-10. The CIFAR-10 ataset consists of 60, 000 color images of in 10 classes. VAEs usually perform poorly in generating photo-realistic images since there are significant ifferences among images in ifferent classes, inicating that the unerlying true istribution of the ata is a multi-moel. In this case, VAEs ten to output very blurry images [26, 8, 7]. However, comparison among ifferent moels can still emonstrate the ifferences in terms of generative capabilities. In this experiment, we set = 3 an N = 128, an the imension of z in the original VAEs is 150. The encoers have 4 layers while the ecoers have 4 or 5 layers. Some sample images are provie in Figure 3. The original VAEs only prouce images compose of several colore areas, which is consistent to the results of a similar moel reporte in [22]. It is obvious that all three implementations of spatial VAEs generate images with more etails. However, naïve spatial VAEs still prouce meaningless images as there is no relationship among ifferent parts. The images generate by spatial VAEs via MVN istributions look like some istorte Table 1: Parzen winow log-likelihoo estimates of test ata on the MNIST ataset. We follow the same proceure as in [8]. Moel Log-Likelihoo Original VAE 297 Naïve SVAE 275 SVAE via MVN 267 SVAE via low-rank MVN 296 objects, which have similar problems to the results of the CelebA ataset. Again, spatial VAEs via lowrank MVN istributions outperform the other moels, proucing blurry but object-like images. 4.3 MNIST. We perform quantitative analysis on real-value MNIST ataset by employing the Parzen winow log-likelihoo estimates [4]. This evaluation metho is use for several generative moels where the exact likelihoo is not tractable [8, 19]. The results are reporte in Table 1 where SVAE is short for spatial VAE. Despite of the ifference in visual quality of generate images, spatial VAE via low-rank MVN istributions shares similar quantitative results with the original VAE. Note that generative moels for images are suppose to capture the unerlying ata istribution by maximizing log-likelihoo an generate images that are similar to real ones. However, it has been pointe in [26] that these two objectives are not consistent, an generative moels nee to be evaluate irectly with respect to the applications for which they were intene. A moel that can generates samples with goo visual appearances may have poor average log-likelihoo on test ataset an vice versa. Common examples of eep generative moels are VAEs an generative aversarial networks (GANs) [8]. VAEs usually have higher average log-likelihoo while GANs

8 Table 2: Training an generation time of ifferent moels when traine on the CelebA ataset using a Nviia Tesla K40C GPU. The average time for training one epoch an the time for generating 10, 000 images are reporte an compare. Moel Training time Generation time Original VAE s s Naïve SVAE s s SVAE via MVN s s SVAE via low-rank MVN s s can generate more photo-realistic images. This is basically cause by the ifferent training objectives of these two moels [7]. Currently there is no commonly accepte stanar for evaluating generative moels. 4.4 Timing Comparison. To show the influence of ifferent spatial VAEs to the training process, we compare the training time on the CelebA ataset. Theoretically, spatial VAEs slow own training ue to the larger numbers of outputs from encoers. To keep the number of trainable parameters in ecoers roughly equal, we set the imension of z in the original VAEs to be 81 while = 3 an N = 64 for spatial VAEs. Accoring to Section 3, the numbers of outputs from their encoers are 162, 1152, 960, an 768 for the original VAE, naïve spatial VAE, spatial VAE via MVN istributions an spatial VAE via low-rank MVN istributions, respectively. We train our moels on a Nviia Tesla K40C GPU an report the average time for training one epoch in Table 2. Comparisons of the time for generating 10, 000 images are also provie to show that the increase in the total imension of latent representations oes not affect the generation process. The results show consistent relationships between the training time an the number of outputs from encoers; that is, spatial VAEs cost more time than the original VAE but spatial VAEs via low-rank MVN istributions can alleviate this problem. Moreover, spatial VAEs only slightly slow own the training process since they only affect one single layer in the moels. 5 Conclusion. In this work, we propose spatial VAEs for image generation tasks, which improve VAEs by requiring the latent representations to explicitly contain spatial information of images. Specifically, in spatial VAEs, ( > 1) feature maps are sample to serve as spatial latent representations in contrast to a vector. This is achieve by sampling the latent feature maps from MVN istributions, which can moel epenencies between the rows an columns in a matrix. We further propose to employ a low-rank formulation of MVN istributions to establish stronger epenencies. Qualitative results on ifferent atasets show that spatial VAEs via low-rank MVN istributions substantially outperform the original VAEs. Acknowlegements. This work was supporte by the National Science Founation grants IIS an DBI References [1] G. I. Allen an R. Tibshirani, Transposable regularize covariance moels with an application to missing ata imputation, The Annals of Applie Statistics, 4 (2010), p [2] D. Bahanau, K. Cho, an Y. Bengio, Neural machine translation by jointly learning to align an translate, arxiv preprint arxiv: , (2014). [3] Y. Bengio, A. Courville, an P. Vincent, Representation learning: A review an new perspectives, IEEE transactions on pattern analysis an machine intelligence, 35 (2013), pp [4] O. Breuleux, Y. Bengio, an P. Vincent, Quickly generating representative samples from an rbm-erive process, Neural computation, 23 (2011), pp [5] Y. Bura, R. Grosse, an R. Salakhutinov, Importance weighte autoencoers, arxiv preprint arxiv: , (2015). [6] C. Doersch, Tutorial on variational autoencoers, arxiv preprint arxiv: , (2016). [7] I. Goofellow, Nips 2016 tutorial: Generative aversarial networks, arxiv preprint arxiv: , (2016). [8] I. Goofellow, J. Pouget-Abaie, M. Mirza, B. Xu, D. Ware-Farley, S. Ozair, A. Courville, an Y. Bengio, Generative aversarial nets, in Avances in neural information processing systems, 2014, pp [9] I. Gulrajani, K. Kumar, F. Ahme, A. A. Taiga, F. Visin, D. Vazquez, an A. Courville, Pixelvae: A latent variable moel for natural images, arxiv preprint arxiv: , (2016). [10] A. K. Gupta an D. K. Nagar, Matrix variate istributions, vol. 104, CRC Press, 1999.

9 [11] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, an M. Welling, Improve variational inference with inverse autoregressive flow, in Avances in neural information processing systems, 2016, pp [12] D. P. Kingma an M. Welling, Auto-encoing variational bayes, arxiv preprint arxiv: , (2013). [13] A. Krizhevsky an G. Hinton, Learning multiple layers of features from tiny images, (2009). [14] A. Krizhevsky, I. Sutskever, an G. E. Hinton, Imagenet classification with eep convolutional neural networks, in Avances in neural information processing systems, 2012, pp [15] Y. LeCun, L. Bottou, Y. Bengio, an P. Haffner, Graient-base learning applie to ocument recognition, Proceeings of the IEEE, 86 (1998), pp [16] Y. LeCun, C. Cortes, an C. J. Burges, The mnist atabase of hanwritten igits, [17] Z. Liu, P. Luo, X. Wang, an X. Tang, Deep learning face attributes in the wil, in Proceeings of International Conference on Computer Vision (ICCV), [18] J. Long, E. Shelhamer, an T. Darrell, Fully convolutional networks for semantic segmentation, in Proceeings of the IEEE Conference on Computer Vision an Pattern Recognition, 2015, pp [19] A. Makhzani, J. Shlens, N. Jaitly, I. Goofellow, an B. Frey, Aversarial autoencoers, arxiv preprint arxiv: , (2015). [20] A. v.. Oor, N. Kalchbrenner, an K. Kavukcuoglu, Pixel recurrent neural networks, arxiv preprint arxiv: , (2016). [21] A. Rafor, L. Metz, an S. Chintala, Unsupervise representation learning with eep convolutional generative aversarial networks, arxiv preprint arxiv: , (2015). [22] D. J. Rezene, S. Mohame, an D. Wierstra, Stochastic backpropagation an approximate inference in eep generative moels, arxiv preprint arxiv: , (2014). [23] O. Ronneberger, P. Fischer, an T. Brox, U- net: Convolutional networks for biomeical image segmentation, in International Conference on Meical Image Computing an Computer-Assiste Intervention, Springer, 2015, pp [24] T. Salimans, A. Karpathy, X. Chen, an D. P. Kingma, Pixelcnn++: Improving the pixelcnn with iscretize logistic mixture likelihoo an other moifications, arxiv preprint arxiv: , (2017). [25] I. Sutskever, O. Vinyals, an Q. V. Le, Sequence to sequence learning with neural networks, in Avances in neural information processing systems, 2014, pp [26] L. Theis, A. v.. Oor, an M. Bethge, A note on the evaluation of generative moels, arxiv preprint arxiv: , (2015). [27] A. van en Oor, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, et al., Conitional image generation with pixelcnn ecoers, in Avances in Neural Information Processing Systems, 2016, pp [28] P. Vincent, H. Larochelle, Y. Bengio, an P.-A. Manzagol, Extracting an composing robust features with enoising autoencoers, in Proceeings of the 25th international conference on Machine learning, ACM, 2008, pp [29] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, an P.-A. Manzagol, Stacke enoising autoencoers: Learning useful representations in a eep network with a local enoising criterion, Journal of Machine Learning Research, 11 (2010), pp [30] M. D. Zeiler, D. Krishnan, G. W. Taylor, an R. Fergus, Deconvolutional networks, in Computer Vision an Pattern Recognition (CVPR), 2010 IEEE Conference on, IEEE, 2010, pp [31] S. Zhao, J. Song, an S. Ermon, Towars eeper unerstaning of variational autoencoing moels, arxiv preprint arxiv: , (2017).

Classifying Facial Expression with Radial Basis Function Networks, using Gradient Descent and K-means

Classifying Facial Expression with Raial Basis Function Networks, using Graient Descent an K-means Neil Allrin Department of Computer Science University of California, San Diego La Jolla, CA 9237 nallrin@cs.ucs.eu