






Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The challenges of training Generative Adversarial Networks (GANs) on realistic datasets and the benefits of using a mixture of multiple generators and discriminators. The authors find that this approach can improve the generative distribution and reduce variance, leading to better performance. They also explore the impact of network depth, width, and training set size on GANs.
Typology: Summaries
1 / 12
This page cannot be seen from the preview
Don't miss anything!







(^1) School of Information Science & Technology, ShanghaiTech University (^2) Shanghai Institute of Microsystem and Information Technology, Chinese Academy of Sciences (^3) University of Chinese Academy of Sciences
Abstract —Generative Adversarial Networks (GANs) have made great progress in synthesizing realistic im- ages in recent years. However, they are often trained on image datasets with either too few samples or too many classes belonging to different data distributions. Conse- quently, GANs are prone to underfitting or overfitting, making the analysis of them difficult and constrained. Therefore, in order to conduct a thorough study on GANs while obviating unnecessary interferences in- troduced by the datasets, we train them on artificial datasets where there are infinitely many samples and the real data distributions are simple, high-dimensional and have structured manifolds. Moreover, the genera- tors are designed such that optimal sets of parameters exist. Empirically, we find that under various distance measures, the generator fails to learn such parameters with the GAN training procedure. We also find that training mixtures of GANs leads to more performance gain compared to increasing the network depth or width when the model complexity is high enough. Our experimental results demonstrate that a mixture of generators can discover different modes or different classes automatically in an unsupervised setting, which we attribute to the distribution of the generation and discrimination tasks across multiple generators and discriminators. As an example of the generalizability of our conclusions to realistic datasets, we train a mixture of GANs on the CIFAR-10 dataset and our method significantly outperforms the state-of-the-art in terms of popular metrics, i.e., Inception Score (IS) and Fréchet Inception Distance (FID).
I. Introduction
The past few years have witnessed the arising popularity of generative models. As can be seen, image processing (e.g., image super-resolution and editing) and machine learning (e.g., reinforcement learning and semi-supervised learning) tasks are infused strong energy by generative models [1]. Typically, a generative model learns a distri- bution P g to approximate the true distribution P r , given a set of observed samples. Generative Adversarial Network [2], with no doubt, is the most prevailing generative model. It is composed of a generator G that maps random noise to synthesized data points, and a discriminator D which aims to tell whether its input comes from the real data distribution P r or generative distribution P g. During training, D and G are updated simultaneously or alternatingly. In a vanilla GAN,
D gives an estimate of the Jensen–Shannon divergence between P r and P g while G tries to minimize it [2]. Unfortunately, the objective of G can get saturated when P g and P r do not have an non-negligible overlapping manifold, causing vanishing gradients to the generator [3]. Let Z and X be the domain and codomain of G respectively. G (Z) is contained in a countable union of manifolds of dimension at most dim Z. Then, according to [3], if the dimension of Z is less than that of X , G (Z) will be a set of measure 0 in X , P r and P g can be distinguished with accuracy 1 by D and thus no gradient is provided to G. Besides, GANs suffer from mode collapse. Mode collapse refers to the phenomenon that the samples of the generator lacks the diversity exhibited in P r. [4] prove that the generator can fool the discriminator by generating a limited number of images from the training set. In other cases of mode collapse, the generated samples are even meaningless as G needs only to fool D in the current iteration. When mode collapse happens, the model fails to generate diverse and realistic data. To cope with these challenges, variants of GAN were proposed (e.g., [5]–[10]). Limited by the fact that these methods are applied to high-dimensional realistic datasets with inadequate samples from each class, the behavior of GANs remains not completely understood. Another problem with realistic datasets is that the performance of GANs can degrade simply due to data scarcity or insufficient model complexity [4], [11]. Considering that we aim to study the behavior of GANs, conventional image datasets might not be good choices. Hence we train GANs on artificially constructed datasets (e.g., mixtures of Gaussians in high dimensional space), applying neural networks with sufficiently high capacity. In this way, we can avoid the influence of the aforemen- tioned factors and focus on the inherent problems of GAN training. The contributions of this work can be summarized as follows:
II. Related Work There are attempts to make a GAN converge to an equilibrium [4], [10], [12]. However, even if a GAN reaches an equilibrium, it might fail to learn the desired real data distribution. To support this conjecture, [13] adopt the Birthday Paradox to measure the diversity of the generative distribution. They present empirical evidence that P g has lower support than P r. However, it should be noted that this problem also might be due to the dimension of the manifold of the latent distribution being lower than the dimension of the manifold of P r [3]. In order to rule out this possibility, we set the dimension of z to be no lower than the dimension of x in our experiments on the artificial datasets. Basically, we share a similar goal with [13], but we conduct experiments on artificial constructed datasets with infinite data samples. Consistent with [13], our experiments reveal that even when a GAN converges to a diverse distribution, it still differs from the true distribution. Considering that the birthday paradox test in [13] is rather restrictive on con- tinuous data, we propose to use some other measures for validating whether GANs can learn the real data distribution. Recently, large scale GAN training(e.g., [11], [14]–[16]) has proven effective on the ImageNet [17] dataset. Their superiority over previous models is mainly due to high model complexity and large batch sizes. While current state-of-the-art GAN models on ImageNet are still subject to model complexity and batch size, our work focus on synthetic datasets that allows the batch size and model complexity to be sufficiently high, which enables us to explore the properties of GANs in ideal cases. Some previous work has studied the feasibility of using multiple discrimiantors [18], multiple generators [19], [20], or both [4] to improve the performance of GANs. Our experiments on artificial datasets is based on MIX+GAN [4] and we find it beneficial to use multiple generators and discriminators. Further, our experimental results un- veil the relations between the number of generators and discriminators and the performance of GANs. As the computation of MIX+GAN is expensive or even infeasible,
we modify it to allow larger mixtures and achieve stat-of- the-art results on CIFAR-10. In our work, we also explore how factors such as network depth, network width and training set size affect the performance of GANs.
III. Models
WGAN-GP [8] has been gaining popularity (e.g., [21]– [23]) for its stability, while MIX+GAN [4] guarantees the existence of approximate equilibrium using a mixture of generators and discriminators. MIX+GAN is also effec- tive in modeling multi-modal data which is common in realistic datasets. Therefore, we combine WGAN-GP and MIX+GAN for our experiments on the artificial datasets. In the following, we will first introduce Wasserstein GAN (WGAN) [7] and WGAN-GP [8], then introduce MIX+GAN [4].
In a vanilla GAN, the generator tries to minimize the approximate Jensen-Shannon divergence defined by the discriminator. Different from vanilla GAN, the discrimi- nator in WGAN calculates an approximate Wasserstein distance between the real and fake data distributions. The discriminator in WGAN is also referred to as the "critic". We will use both terms interchangably in this paper. The minimax game for WGAN is formulated as
W ( Pr , P g ) = min G max D Ex ∼P r [ D ( x )] − E x ˜∼P g [ D (˜ x )] (1)
where D is in the set of all 1-Lipschitz functions and P g is the model distribution implicitly defined by z ∼ p ( z ), x ˜ = G ( z ). Note that Eq. 1 can be reformulated as
W (P r , P g )=
k {min G max D E x ∼P r [ D ( x )] − E x ˜∼P g [ D (˜ x )]} (2)
where D is in the set of all k-Lipschitz functions. WGAN [7] adopts a weight clipping approach to enforce the Lipschitz constraint. However, it can lead to optimiza- tion problems and pathological behaviors. To overcome these problems, An improved version of WGAN was pro- posed in [8], introducing a new objective for the critic:
E x ˜∼P g [ D (˜ x )]−E x ∼P r [ D ( x )]+ λ E x ˆ∼P x ˆ [(‖∇ x ˆ D (ˆ x )‖ 2 −1)^2 ] (3)
where x ˆ comes from the distribution Pˆ x whose samples are interpolated between samples from P g and P data. This choice is based upon the fact that the L2-norm of the gradient of the optimal D is 1 between the manifolds of P g and P data [8]. The last term can be interpreted as a regularizer that forces the gradient between the real and fake datasets to be at a moderate scale, so that P g is moved smoothly to the real data distribution.
projection of) the generated data. If a human inspector cannot distinguish whether the generated data are real or fake, then one can conclude that the generative model is very successful. If the inspectors say that samples generated by one model are significantly better that those generated by another model, then it can also be concluded that one model is better than another. On the other hand, if a human inspector cannot tell a significant difference, then one may want to resort to more objective and more accurate metrics. To inspect the generated synthetic high dimensional samples manually, we project the generated samples onto a plane determined by a = (1 , 0 , 0 , 0 , ..., 0), b = (0 , 1 , 0 , 0 , ..., 0) and c = (0 , 0 , 1 , 0 , ..., 0). In the pro- jection plane, the origin is a , The x-axis is in the same direction as
ab , and the y-axis is in a direction that is perpendicular to the x-axis, as is shown in Figure 1. The projections of sample data can be seen in Figure 2 and some other figures in this paper, where real data points are indicated by red dots, fake generated data points are indicated by blue dots. Background colors show the contour plot of the output of the discriminator(s): red corresponds to high values while blue corresponds to low values.
O’
a(1,0,0,0,…
b(0,1,0,0,…0) c(0,0,1,0,…0)
O
Fig. 1: In our projection method, all the data points are projected from the 1024-dimensional space onto the plane that a, b and c lie in.
2) Fréchet Distance: [12] proposed to use the Fréchet Inception Distance (FID) as a metric for evaluating gen- erative models. The Fréchet Distance (FD, also known as the Wasserstein-2 distance) for two Gaussian distributions N ( m 1 , C 1 ) and N ( m 2 , C 2 ) is given by ‖ m 1 − m 2 ‖^22 + tr ( C 1 + C 2 − 2( C 1 C 2 )^1 /^2 ) [27]. During the computation of FID, images from the real and fake distributions are fed into the Inception model [28] to get their activations in the last pooling layer. The distributions of the activations are approximately treated as Gaussian so that their means and covariances can be used to compute the FID. In this paper, since we are dealing with artificial data and the Inception model was intended for realistic data, we use the means and covariances of the artificial data directly with- out passing them through the Inception Network. 50, data points are sampled from P r and P g respectively for computing the Fréchet Distance.
3) Critic output: As is noted in [7], the loss of the critic provides a meaningful estimate of the Wasserstein distance between P r and P g. We can log the average value of D ( xr ) − D ( xg ) during each iteration with almost no additional computation cost. If it is positive, then it tells us that P r is different from P g. Moreover, it is an indicator of the training dynamics of WGAN. 4) Wasserstein distance: The above estimate of the Wasserstein distance can be inaccurate due to adversarial training. Alternatively, one can train an independent critic to approximate the Wasserstein distance after training a GAN [24]. Note that the gradient penalty term might be large than 0 and the critic may be a k-Lipschitz function, thus we normalized the estimated Wasserstein distance using Eq. 2. For fair comparison, we train an independent critic with the same architecture across different experiments. Specifically, it has 5 layers and 1024 neurons in each hidden layer. In our experiments, we estimate the approximate Wasserstein distance W (P r , P g ) with 25,600 sample points from P r and P g respectively. 5) "Judge" accuracy: In all our experiments, an inde- pendent classifier called "Judge" is trained to distinguish samples from P g and P r. The accuracy of the Judge is an objective metric for evaluating all of our GANs. After the Judge is fully trained, its classification accuracy is expected to range between 0.5 and 1. If the generator(s) has learned the distribution, then the Judge should have an accuracy of around 0.5. Conversely, if the generator(s) produces a distribution different from P r , the Judge is expected to have an accuracy higher than 0.5. Following Theorem 2.2 of [3], given two distributions P g and P r that have support contained in two closed manifolds M and P that don’t perfectly align and don’t have full dimension, and assume that P g and P r are continuous in their respective manifolds, then there exists a perfect classifier that has accuracy 1. One can show that the expected Judge accuracy is related to the total variation distance:
Proposition 1. Let J be a deterministic classifier for samples from two distributions P r and P g with equal prior probabilities. Let δ (P r , P g ) be the total variation distance between P r and P g , then
δ (P r , P g ) ≥ 2 E[ Jacc ] − 1 (9)
The proof of Proposition 1 is provided in Appendix A. Proposition 1 is intuitive: If the total variation distance between two distributions is very low, then it is hard for any classifier to tell them apart and the accuracy of a classifier can hardly get above 0.5; if a classifier has an accuracy of 1, then the total variation distance between them is high. One can in turn show that the total variation distance is related to the Kullback–Leibler Divergence [29].
For fair comparison, we train an independent Judge with the same architecture across different experiments. Specifically, it has 5 layers and 1024 neurons in each layer. In our experiments, we estimate Jacc with 25,600 sample points from P r and P g respectively.
D. Training
We follow some experimental setups of [8] for toy data: The batch size is 256; there are 100,000 GAN iterations, each of which includes 1 generator update and 5 dis- criminator updates; After the training of GAN, we train the Judge and the independent critic for another 100, iterations respectively. Adam optimizers [30] are used for optimizing all models. Our experiments differ from [8] in the following ways: In order to imitate the training of GANs on high-dimensional realistic data, the data points lie in 1024-dimensional space; motivated by [3], in order to allow the manifold of P g to have the same dimension as that of P r , the input noise z follows a 1024-dimensional Gaussian distribution and the activation layers are chosen to be LeakyReLU layers that are injective; in all the experiments, λ in WGAN-GP is set to 10 to improve stability; we adopt a "two time-scale update rule" (TTUR) [12]: the learning rate of the Discriminators(s) is set to 1 e − 4 and the learning rate of the Generator(s) is set to 1 e − 5 after some hyperparameter searching. During each iteration of the training of GAN, the generator(s) and the discriminator(s) are updated in the following order:
E. Results
In this part, we will report the experimental results on the artificial datasets. 1) Generation of Mixtures of Gaussians: In Figure 2, we present the qualitative results on the 3-Gaussians dataset. The projections of real data points and generated data points are indicated by red and blue dots, respectively. The
contour plot shows that the output of the critic of WGAN- GP is quite smooth. In Figure 3, we compare MIX+GANs with different combinations of mixtures quantitatively using the aforementioned metrics. In each experiment, MIX+GAN successfully learns a 3- modal mixture, but it differs from the real distribution. Nevertheless, the generative distribution P g is closer to P r with larger mixtures.
(^33 2 1 0 1 2 3 ) 21
01
23
4
(a) 1G1D
(^33 2 1 0 1 2 3 ) 21
01
23
4
(b) 3G3D
(^33 2 1 0 1 2 3 ) 21
01
23
4
(c) 5G5D
(^33 2 1 0 1 2 3 ) 21
01
23
4
(d) 10G10D
Fig. 2: Projections of real data (red dots) and sam- ples (blue dots) generated by "MIX+GAN" with different mixtures of models. " nGmD " indicates that there are n generators and m discriminators.
(^120000 40000) iteration 60000 80000 100000
2
3
4
5
6
7
8
Frechet Distance 1G1D 3G1D1G3D 3G3D5G1D 1G5D5G5D 10G1D1G10D 10G10D
(a) Fréchet distance
0.0 (^0 20000 40000) iteration 60000 80000 100000
4.0^ D(Xr)-D(Xg) 1G1D3G1D 1G3D3G3D 5G1D1G5D 5G5D10G1D 1G10D10G10D
(b) D ( xr ) − D ( xg )
0.65 (^100000 120000 140000) iteration 160000 180000 200000
1.00^ Judge accuracy
1G1D3G1D 1G3D3G3D 5G1D1G5D 5G5D10G1D 1G10D10G10D
(c) Judge accuracy
0.2 (^200000 220000 240000) iteration 260000 280000 300000
Wasserstein distance
1G1D3G1D 1G3D3G3D 5G1D1G5D 5G5D10G1D 1G10D10G10D
(d) Wasserstein distance
Fig. 3: Comparisons of MIX+GANs with different compo- nents for generating mixtures of 3 Gaussians. We evalate the Fréchet distance during training and train the Judge and the independent critic after the training of WGAN- GP. For all the metrics, lower is better.
There are at least two ways for the generator(s) to win the game. For one thing, Corollary 3.2 in [4] states that low-capacity discriminators are unable detect lack of diver- sity, thus the generator(s) can memorize a large quantity of training data to win the game. For another, since the generator(s) can be learned to be injective with all the hidden dimensions being 1024, which is the same as the input dimension and the output dimension, a mixture of 3 generators can learn 3 individual Gaussian components
information about the input and do not need to be larger.
(^220000 40000) iteration 60000 80000 100000
4
6
8
10
12
14
16
Frechet Distance 3 layers5 layers 8 layers10 layers 15 layers
(a) Fréchet distance
(^0 20000 40000) iteration 60000 80000 100000
D(Xr)-D(Xg) 3 layers5 layers 8 layers10 layers 15 layers
(b) D ( xr ) − D ( xg )
(^100000 120000 140000) iteration 160000 180000 200000
1.00^ Judge accuracy
3 layers5 layers 8 layers10 layers 15 layers
(c) Judge accuracy
0.5 (^200000 220000 240000) iteration 260000 280000 300000
4.0 (^) 3 layers Wasserstein distance 5 layers8 layers 10 layers15 layers
(d) Wasserstein distance
Fig. 8: Quantitative results of varying the depth of the networks. " n layers" indicates that there are n layers in both the generator and the discriminator. For all the metrics, lower is better.
(^020000 40000) iteration 60000 80000 100000
10
20
30
40
50
Frechet Distance
512,512512, 1024,5121024, 2048,20484096, 8196,
(a) Fréchet distance
(^0 20000 40000) iteration 60000 80000 100000
4.0^ D(Xr)-D(Xg) (^10242048) (^40968196)
(b) D ( xr ) − D ( xg )
0.84 (^100000 120000 140000) iteration 160000 180000 200000
1.00^ Judge accuracy
(^10242048) (^40968196)
(c) Judge accuracy
(^200000 220000 240000) iteration 260000 280000 300000
1
2
3
4
5
6
Wasserstein distance
512,512512, 1024,5121024, 2048,20484096, 8196,
(d) Wasserstein distance
Fig. 9: Quantitative results of varying the width of the net- works. The numbers in the legends indicate the numbers of neurons in each hidden layer of and the discriminator. For all the metrics, lower is better.
2) Generation of datasets defined by neural networks: In this part, we define the real data distribution as the distribution of the output of a neural network R that has
the same input and architecture as the generator(s). The parameters of R is randomly initialized with the Glorot uniform initializer [35] and fixed thereafter. There are also at least two ways in which the generator(s) can win the game: either memorize a large sample of the training data according to [4], or learn to have the same parameters as R (of course, there are other sets of parameters that enables the generator(s) to generate P r due to the symmetry and complexity of neural networks). We consider the simplest situation where R has only two layers, that is, it defines an affine transformation from R^1024 to R^1024. Therefore, this dataset is in fact a 1024-dimension Gaussian distribution with randomly initialized mean and covariance. We plot the quantitative results in Figure 10. The results show that GAN training can have difficult in learning an affine transformation.
(^020000 40000) iteration 60000 80000 100000
50
100
150
200
Frechet Distance 1G(2)1D 1G(5)1D3G(2)3D 3G(5)3D
(a) Fréchet distance
(^0 20000 40000) iteration 60000 80000 100000 0
5
10
15
20
D(Xr)-D(Xg) 1G(2)1D 1G(5)1D3G(2)3D 3G(5)3D
(b) D ( xr ) − D ( xg )
0.5 (^100000 120000 140000) iteration 160000 180000 200000
1.0^ Judge accuracy
1G(2)1D1G(5)1D 3G(2)3D3G(5)3D
(c) Judge accuracy
0.0 (^200000 220000 240000) iteration 260000 280000 300000
Wasserstein distance
1G(2)1D1G(5)1D 3G(2)3D3G(5)3D
(d) Wasserstein distance
Fig. 10: Results of the dataset generated by a network R. nG ( l ) mD indicates that there are n generators of l layers and m discriminators (of 5 layers by default). For all the metrics, lower is better.
3) Varying the training set size: Now that we have access to infinite training data, we are able to study the influence the training set size has on the quantitative metrics and show the results in Figure 11. In this set of experiments, we have a MIX+GAN consisting of 3 generators and 3 discriminators, each of which has 5 layers and 1024 neurons in every hidden layer. There are infinite samples in the test set. The only factor of variation is the training set size. The results show that GANs perform worse with smaller training sets. On the contrary, the GAN trained on the largest training set preforms among the best in terms of all the metrics. We can see that the distances to the training set is larger with smaller training set size. This phenomenon is not straightforward as some
would believe that it is easier for the generator(s) to overfit smaller training sets. A possible explanation is that with fewer training data, the discriminator(s) can memorized the training set and reject fake samples more easily, pro- viding less informative feedbacks to the generator. This explanation is consistent with the one in Session 4.2 of [11].
(^100000 120000 140000) iteration 160000 180000 200000
Judge accuracy (training set)
25,600256, 2,560,00025,600,
(a) Jacc (training set)
(^200000 220000 240000) iteration 260000 280000 300000
1.00^ Judge accuracy (test set)
25,600256, 2,560,00025,600,
(b) Jacc (test set)
(^300000 320000 340000) iteration 360000 380000 400000
Wasserstein distance (training set)
25,600256, 2,560,00025,600,
(c) W-dist (training set)
(^400000 420000 440000) iteration 460000 480000 500000
2.2^ Wasserstein distance (test set)
25,600256, 2,560,00025,600,
(d) W-dist (test set)
0.0 (^0 20000 40000) iteration 60000 80000 100000
D(Xr)-D(Xg) 25, 256,0002,560, 25,600,
(e) D ( xr ) − D ( xg )
Fig. 11: Results on the 3 Gaussians dataset when varying the training set size. The numbers in the legends indicate the training set sizes. For all the metrics, lower is better.
Note that the dimension of our data is 1024 = 32 × 32 , which is the same as the spatial dimension of the CIFAR- 10 dataset [36]. However, the CIFAR-10 dataset is more complex and consists of only 50,000 training images. Therefore, one can expect a performance boost when there are more training data for the training of GANs on CIFAR-10 or other small-scale image datasets.
V. extended experiments on CIFAR- In this section, we will show that the lessons we learned from artificial datasets apply to realistic datasets. Inspired by our empirical finding on the artificial datasets that increasing the mixture size can improve the performance of GANs, we modify MIX+GAN and train mixtures of GANs on the CIFAR-10 [36] dataset.
D D D D D
G G G G G
V V V V V
Fig. 12: Illustration of the generation and discrimination of fake samples when there are 5 generators and 5 dis- criminators distributed across 5 devices. The input noise to each generator is omitted.
Since the time and space complexity of MIX+GAN is O ( nGnD ), it is computationally infeasible to train very large mixtures of GANs. Thus, we propose to use a modi- fied version of MIX+GAN. We assume that wi is uniformly distributed (which is true for the data distribution of CIFAR-10 and many other datasets). The batch generated by each Gi is split into nD parts uniformly and fed to different discriminators. Therefore, the actual batch size for each generator and each discriminator remains un- changed, but each discriminator can receive samples from different generators. Inspired by the finding that different generators can capture different modes in a distribution, we do not make each generator generate samples for 10 classes, but max { 10 /nG, 1 } classes, which can ease the difficulty of generation for each generator. In this way, the generators can be viewed as a mixture of experts [37] and the discriminators can be viewed as an ensemble of discriminative models. We use a model-parallelism setting where generators and discriminators are distributed across different devices. If we have n GPU/TPU devices, then Gi and Dj are allocated to device ( i −1 mod n )+1 and device ( j −1 mod n )+1 respectively. In this way, there is no need to synchronize parameters across different devices and load balance can be achieved if both nG and nD are divisible by n. Figure 12 illustrates the flow of the generation and discrimination of fake samples when there are 5 generators and 5 discriminators distributed across 5 devices.
BigGAN in Figure 14. To some extent, the 10 generators can learn different concepts automatically without label supervision, although they do not correspond to the 10 classes perfectly.
Fig. 14: CIFAR-10 samples generated by our unsupervised model with 10 G s and 10 D s. Samples in different columns are generated by different generators.
VI. Conclusions and Future Work In this work, we explore different distance measures to investigate whether GAN training succeeds in learning the distribution. Our empirical results show that even when the distances between P g and P r are short, there exists a simple classifier with a model complexity similar to the discriminator that can distinguish P g and P r accurately. It suggests that P g and P r have little non-negligible overlapping manifold [3]. Empirically, we also find that even when an optimal set of generator parameters exists, GAN training fails to find it. Therefore, it remains an open question whether GANs should be replaced by non- adversarial generative models (e.g., [45]–[49]). In our experiments on the synthetic datasets, increasing the size of the training set can improve the performance of GANs, even when it is already very large. On the other hand, a small training set can negatively affect GANs. Therefore, current datasets might not be large enough to make GANs learn the real data distribution or even result in overfitting. Our experimental results show that training a mixture of GANs is more beneficial than simply increasing the complexity of standalone networks (that are sufficiently complex) for modeling multi-modal data. It is an interest- ing topic to devise different ways to combine models in the
mixtures. It is also promising to measure and promote the diversity of the ensemble [50], [51] of discriminators. Finally, while current state-of-the-art GAN models such as BigGAN [11], CR-BigGAN [16] and LOGAN [15] use a number of TPU cores that is the same as the height or width of the images, we are not able to conduct such large-scale experiments. But we believe that with more computing power, a large mixture of GANs can be trained on datasets such as ImageNet 128 × 128 and improve current state-of-the-arts.
Appendix A Proof of Proposition 1 Proposition 1. Let J be a deterministic classifier for samples from two distributions P r and P g with equal prior probabilities. Let δ (P r , P g ) be the total variation distance between P r and P g , then
δ (P r , P g ) ≥ 2 E[ Jacc ] − 1 (16)
Proof. Without loss of generality, assume that the label y of a sample point x equals 1 if x is from P r and 0 otherwise. Let Jopt be an optimal classifier with the highest expected accuracy. For all x such that pr ( x ) + pg ( x ) > 0 , we have P ( y = 1| x ) = (^) pr ( xpr )+^ ( xp ) g ( x ) and P ( y = 0| x ) = (^) pr ( xp )+ g^ ( xp ) g ( x ). Then there exists a Jopt that predicts J ( x ) = 1 if pr ( x ) ≥ pg ( x ) and J ( x ) = 0 if pr ( x ) < pg ( x ). Therefore,
E[ Jacc ] ≤ E[ Joptacc ] (17)
=
pr ( x ) 1 ( J ( x ) = 1) dx (18)
pg ( x ) 1 ( J ( x ) = 0) dx (19)
pr ( x )≥ pg ( x )
pr ( x ) dx (20)
pr ( x ) <pg ( x )
pg ( x ) dx (21)
max { pr ( x ) , pg ( x )} dx (22)
max { pr ( x ) , pg ( x )} dx − 1 + 1
max { pr ( x ) , pg ( x )} dx (24)
( pr ( x ) + pg ( x )) dx + 1
max { pr ( x ) , pg ( x )} (26)
− min { pr ( x ) , pg ( x )}
dx +
| pr ( x ) − pg ( x )| dx +
δ (P r , P g ) +
It follows that δ (P r , P g ) ≥ 2 E[ Jacc ] − 1.
Appendix B Network architectures In Table III and IV, we list the network architectures we use for CIFAR-10.
TABLE III: MHingeGAN Generator for 32 × 32 images
Block or layer(s) Output shape z ∈ R^80 ∼ N (0 , I ) 20 + 20 + 20 + 20 Embed( y )∈ R^128 Linear(20 + 128 → 4 × 4 × 256) 4 × 4 × 256 Resblock up 256 → 256 8 × 8 × 256 Resblock up 256 → 256 16 × 16 × 256 Resblock up 256 → 256 32 × 32 × 256 BN, ReLU, conv 3 × 3 , tanh 32 × 32 × 3
TABLE IV: MHingeGAN Discriminator for 32 × 32 images
Block or layer(s) Output shape input 32 × 32 × 3 Resblock down 3 → 256 16 × 16 × 256 Resblock down 256 → 256 8 × 8 × 256 Resblock 256 → 256 8 × 8 × 256 Resblock 256 → 256 8 × 8 × 256 ReLU, global sum pooling 256 Linear(256 → K + 1) K + 1
Acknowledgement This work was supported with Cloud TPUs from Google’s TensorFlow Research Cloud (TFRC). Special thanks go to Dr. Kun Huang who helped revise an early draft.
References [1] I. Goodfellow, “Nips 2016 tutorial: Generative adversarial net- works,” arXiv preprint arXiv:1701.00160, 2016. [2] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde- Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems, 2014, pp. 2672–2680. [3] M. Arjovsky and L. Bottou, “Towards principled methods for training generative adversarial networks,” in NIPS 2016 Workshop on Adversarial Training. In review for ICLR, vol. 2016, 2017. [4] S. Arora, R. Ge, Y. Liang, T. Ma, and Y. Zhang, “Generaliza- tion and equilibrium in generative adversarial nets (gans),” in Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017, pp. 224–232. [5] K. Roth, A. Lucchi, S. Nowozin, and T. Hofmann, “Stabilizing training of generative adversarial networks through regulariza- tion,” in Advances in neural information processing systems, 2017, pp. 2018–2028.
[6] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. Paul Smolley, “Least squares generative adversarial networks,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2794–2802. [7] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein gan,” arXiv preprint arXiv:1701.07875, 2017. [8] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville, “Improved training of wasserstein gans,” in Advances in neural information processing systems, 2017, pp. 5767–5777. [9] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, “Spectral normalization for generative adversarial networks,” in International Conference on Learning Representations,