ML PAPER: PROGRESSIVE GANs — TL;DR
Paper published — 26th Feb 2018— by Nvidia
Disclaimer: These are just notes and lot of the text is taken from the paper.
They describe a new training methodology for generative adversarial networks. The key idea is to grow both the generator and discriminator progressively: starting from a low resolution, they add new layers that model increasingly fine details as training progresses. This both speeds the training up and greatly stabilizes it, allowing it to produce images of unprecedented quality.
- Total trainable params in generator: 23.1M, Total trainable params in discriminator: 23.1M
- Both networks consist mainly of replicated 3-layer blocks that are introduced one by one during the course of the training. The last Conv 1 × 1 layer of the generator corresponds to the toRGB block, and the first Conv 1 × 1 layer of the discriminator similarly corresponds to fromRGB. They start with 4 × 4 resolution and train the networks until they have shown the discriminator 800k real images in total. They then alternate between two phases: fade in the first 3-layer block during the next 800k images, stabilize the networks for 800k images, fade in the next 3-layer block during 800k images, etc.
- Their latent/noise vectors correspond to random points on a 512-dimensional hypersphere, and they represent training and generated images in [-1,1].
- Activation functions: leaky ReLU with leakiness 0.2 in all layers of both networks, except for the last layer that uses linear activation
- No batch normalization, layer normalization, or weight normalization in either network, but they perform pixelwise normalization of the feature vectors after each Conv 3×3 layer in the generator
- Initialization — all bias parameters to zero and all weights according to the normal distribution with unit variance. However, they scale the weights with a layer-specific constant at runtime.
- They inject the across-minibatch standard deviation as an additional feature map at 4 × 4 resolution toward the end of the discriminator
- The upsampling and downsampling operations correspond to 2 × 2 element replication and average pooling, respectively
- Optimizer — Adam (Kingma & Ba, 2015) with α = 0.001, β1 = 0, β2 = 0.99, and epsilon=10−8
- Do not use any learning rate decay or rampdown
- For visualizing generator output at any given point during the training, they use an exponential running average for the weights of the generator with decay 0.999.
- They use a minibatch size 16 for resolutions 4*4–128*128 and then gradually decrease the size according to 256*256 → 14, 512*512 → 6, and 1024*1024 → 3 to avoid exceeding the available memory budget
- Loss function — WGAN-GP loss, but unlike Gulrajani et al. (2017), they alternate between optimizing the generator and discriminator on a per-minibatch basis, i.e., they set ncritic = 1
- Additionally, they introduce a fourth term into the discriminator loss with an extremely small weight to keep the discriminator output from drifting too far away from zero. To be precise, it sets new L = L + episilon_driftEx∈Pr[square(D(x))], where epsilon_drift =0.001.
- The results of this network are independent of loss functions
- To meaningfully demonstrate their results at high output resolutions, they need a sufficiently varied high-quality dataset. However, virtually all publicly available datasets previously used in GAN literature are limited to relatively low resolutions ranging from 32*32 to 480*480. To this end, they created a high-quality version of the CELEBA dataset consisting of 30000 of the images at 1024 × 1024 resolution.
- Trained the network on 8 Tesla V100 GPUs for 4 days, after which they no longer observed qualitative differences between the results of consecutive training iterations. Their implementation used an adaptive minibatch size depending on the current output resolution so that the available memory budget was optimally utilized.
- In order to demonstrate that their contributions are largely orthogonal to the choice of a loss function, they also trained the same network using LSGAN loss instead of WGAN-GP loss.
- The best inception scores for CIFAR10 (10 categories of 32 × 32 RGB images) they are aware of are 7.90 for unsupervised and 8.87 for label conditioned setups (Grinblat et al., 2017). The large difference between the two numbers is primarily caused by “ghosts” that necessarily appear between classes in the unsupervised setting, while label conditioning can remove many such transitions. When all of their contributions are enabled, they get 8.80 in the unsupervised setting.
- They find that LSGAN is generally a less stable loss function than WGAN-GP, and it also has a tendency to lose some of the variation towards the end of long runs. Thus they prefer WGAN-GP, but have also produced high-resolution images by building on top of LSGAN.
- They need one additional hack with LSGAN that prevents the training from spiraling out of control when the dataset is too easy for the discriminator, and the discriminator gradients are at risk of becoming meaningless as a result. They adaptively increase the magnitude of multiplicative Gaussian noise in discriminator as a function of the discriminator’s output. The noise is applied to the input of each Conv 3 × 3 and Conv 4 × 4 layer. There is a long history of adding noise to the discriminator, and it is generally detrimental for the image quality (Arjovsky et al., 2017) and ideally one would never have to do that, which according to their tests is the case for WGAN-GP (Gulrajani et al., 2017). The magnitude of noise is determined as 0.2 · max(0,ˆdt − 0.5)2, where ˆdt = 0.1d + 0.9ˆdt−1 is an exponential moving average of the discriminator output d. The motivation behind this hack is that LSGAN is seriously unstable when d approaches (or exceeds) 1.0.
- Creating the CELEBA-HQ dataset — They start with a JPEG image from the CelebA in the-wild dataset. They improve the visual quality through JPEG artifact removal using a convolutional autoencoder trained to remove JPEG artifacts in natural images and increases resolution using an adversarially-trained 4x super-resolution network. They then extend the image through mirror padding and Gaussian filtering to produce a visually pleasing depth-of-field effect. Finally, they use the facial landmark locations to select an appropriate crop region and perform high-quality resampling to obtain the
- They perform the above processing for all 202599 images in the dataset, analyze the resulting 1024 ×1024 images further to estimate the final image quality, sort the images accordingly, and discard all but the best 30000 images. They use a frequency-based quality metric that favors images whose power spectrum contains a broad range of frequencies and is approximately radially symmetric. This penalizes blurry images as well as images that have conspicuous directional features due to, e.g., visible halftoning patterns. They selected the cutoff point of 30000 images as a practical sweet spot between variation and image quality, because it appeared to yield the best results. final image at 1024 × 1024 resolution.
24. CIFAR10 RESULTS — no data augmentation used
25. Datasets trained/tested on — CELEBA-HQ, LSUN, CIFAR10, MNIST-1K