ML PAPER: PIX2PIX — TL;DR

Sneha Ghantasala
8 min readDec 27, 2020

Research Paper: Image-to-Image Translation with Conditional Adversarial Networks

Paper published — 26th Nov 2018 — Berkeley AI Research (BAIR) Laboratory, UC Berkeley

Disclaimer: These are just notes and lot of the text is taken from the paper.

Earlier papers have focused on specific applications, and it has remained unclear how effective image-conditional GANs can be as a general-purpose solution for image-to-image translation.

Contributions

1. Primary contribution is to demonstrate that on a wide variety of problems, conditional GANs produce reasonable results.

2. Second contribution is to present a simple framework sufficient to achieve good results, and to analyze the effects of several important architectural choices. Code is available at https://github.com/phillipi/pix2pix.

Related work

1. Structured losses for image modeling Image-to-image translation problems are often formulated as per-pixel classification or regression

2. Conditional GANs — They are not the first to apply GANs in the conditional setting. Prior and concurrent works are tailored for a specific application. Their framework differs in that nothing is application specific. This makes their setup considerably simpler than most others.

3. Their method also differs from the prior works in several architectural choices for the generator and discriminator.

Unlike past work, for their generator they use a “U-Net”-based architecture, and for their discriminator they use a convolutional “PatchGAN” classifier, which only penalizes structure at the scale of image patches. A similar PatchGAN architecture was previously proposed to capture local style statistics. Here they show that this approach is effective on a wider range of problems, and they investigate the effect of changing the patch size.

Loss Function

GANs are generative models that learn a mapping from random noise vector z to output image y, G : z → y. In contrast, conditional GANs learn a mapping from observed image x and random noise vector z, to y, G : {x, z} → y.

The generator G is trained to produce outputs that cannot be distinguished from “real” images by an adversarially trained discriminator, D, which is trained to do as well as possible at detecting the generator’s “fakes”.

The objective of a conditional GAN can be expressed as

LcGAN (G, D) =Ex,y[log D(x, y)]+ Ex,z[log(1 − D(x, G(x, z))],

where G tries to minimize this objective against an adversarial D that tries to maximize it, i.e. G* = arg minG maxD LcGAN (G, D).

To test the importance of conditioning the discriminator, they also compare to an unconditional variant in which the discriminator does not observe x:

LGAN (G, D) =Ey[log D(y)]+ Ex,z[log(1 − D(G(x, z))]. (2)

Previous approaches have found it beneficial to mix the GAN objective with a more traditional loss, such as L2 distance. The discriminator’s job remains unchanged, but the generator is tasked to not only fool the discriminator but also to be near the ground truth output in an L2 sense. they

also explore this option, using L1 distance rather than L2 as L1 enctheirages less blurring:

LL1(G) = Ex,y,z[ky − G(x, z)k1].

Their final objective is

G* = arg minGmaxD LcGAN (G, D) + λLL1(G).

Without z (noise), the net could still learn a mapping from x to y, but would produce deterministic outputs, and therefore fail to match any distribution other than a delta function. Past conditional GANs have acknowledged this and provided Gaussian noise z as an input to the generator, in addition to x. In initial experiments, they did not find this strategy effective — the generator simply learned to ignore the noise — which is consistent with Mathieu et al.. Instead, for their final models, they provide noise only in the form of dropout, applied on several layers of their generator at both training and test time. Despite the dropout noise, they observe only minor stochasticity in the output of their nets. Designing conditional GANs that produce highly stochastic output, and thereby capture the full entropy of the conditional distributions they model, is an important question left open by the present work.

Network Architecture

A defining feature of image-to-image translation problems is that they map a high resolution input grid to a high resolution output grid. In addition, for the problems they consider, the input and output differ in surface appearance, but both are renderings of the same underlying structure. Therefore, structure in the input is roughly aligned with structure in the output. They design the generator architecture around these considerations.

Generator — U-Net

Many previous solutions to problems in this area have used an encoder-decoder network. In such a network, the input is passed through a series of layers that progressively downsample, until a bottleneck layer, at which point the process is reversed. Such a network requires that all information flow pass through all the layers, including the bottleneck. For many image translation problems, there is a great deal of low-level information shared

between the input and output, and it would be desirable to shuttle this information directly across the net. For example, in the case of image colorization, the input and output share the location of prominent edges. To give the generator a means to circumvent the bottleneck for information like this, they add skip connections, following the general shape of a “U-Net”. Specifically, they add skip connections between each layer i and layer n − i, where n is the total number of layers. Each skip connection simply concatenates all channels at layer i with those at layer n − i.

Discriminator — Markovian discriminator (PatchGAN)

It is well known that the L2 loss — and L1 — produces blurry results on image generation problems. Although these losses fail to encourage high-frequency crispness, in many cases they nonetheless accurately capture the low frequencies. For problems where this is the case, they do not need an entirely new framework to enforce correctness at the low frequencies. L1 will already do.

This motivates restricting the GAN discriminator to only model high-frequency structure, relying on an L1 term to force low-frequency correctness. In order to model high-frequencies, it is sufficient to restrict their attention to the structure in local image patches. Therefore, they design a discriminator architecture — which they term a PatchGAN — that only penalizes structure at the scale of patches. This discriminator tries to classify if each N×N patch in an image is real or fake. They run this discriminator convolutionally across the image, averaging all responses to provide the ultimate output of D. They demonstrate that N can be much smaller than the full size of the image and still produce high quality results. This is advantageous because a smaller PatchGAN has fewer parameters, runs faster, and can be applied to arbitrarily large images.

Optimizations, Training Network

Training phase:

1. To optimize their networks, they follow the standard approach: they alternate between one gradient descent step on D, then one step on G.

2. As suggested in the original GAN paper, rather than training G to minimize log(1 − D(x, G(x, z)), they instead train to maximize log D(x, G(x, z)). 3. In addition, they divide the objective by 2 while optimizing D, which slows down the rate at which D learns relative to G.

4. They use minibatch SGD and apply the Adam solver, with a learning rate of 0.0002, and momentum parameters β1 = 0.5, β2 = 0.999.

Testing phase:

At inference time, they run the generator net in exactly the same manner as during the training phase. This differs from the usual protocol in that:

1. They apply dropout at test time

2. They apply batch normalization using the statistics of the test batch, rather than aggregated statistics of the training batch. This approach to batch normalization, when the batch size is set to 1, has been termed “instance normalization” and has been demonstrated to be effective at image generation tasks. In their experiments, they use batch sizes between 1 and 10 depending on the experiment.

Data requirements and Speed

1. They note that decent results can often be obtained even on small datasets.

2. Their facade training set consists of just 400 images (see results in paper)

3. The day to night training set consists of only 91 unique webcams (see results in paper). On datasets of this size, training can be very fast: for example, the results shown in Figure 14 of paper took less than two hours of training on a single Pascal Titan X GPU.

4. At test time, all models run in well under a second on this GPU.

Experiments to test generality of conditional GAN

• Semantic labels↔photo, trained on the Cityscapes dataset.

• Architectural labels→photo, trained on CMP Facades.

• Map↔aerial photo, trained on data scraped from Google Maps.

• BW→color photos.

• Edges→photo, trained on data from [65] and [60]; binary edges generated using the HED edge detector plus postprocessing.

• Sketch→photo: tests edges→photo models on human drawn sketches.

• Day→night.

• Thermal→color photos.

• Photo with missing pixels→inpainted photo, trained on Paris StreetView.

Evaluation metrics

Evaluating the quality of synthesized images is an open and difficult problem. Traditional metrics such as perpixel mean-squared error do not assess joint statistics of the result, and therefore do not measure the very structure that structured losses aim to capture.

To more holistically evaluate the visual quality of their results, they employ two tactics.

  1. First, they run “real vs. fake” perceptual studies on Amazon Mechanical Turk (AMT). For graphics problems like colorization and photo generation, plausibility to a human observer is often the ultimate goal. Therefore, they test their map generation, aerial photo generation, and image colorization using this approach confirming the hypothesis that L1 encourages average, grayish colors. Using a cGAN, on the other hand, pushes the output distribution closer to the ground truth.
  2. Second, they measure whether or not their synthesized cityscapes are realistic enough that off-the-shelf recognition system can recognize the objects in them. This metric is similar to the “inception score”, the object detection evaluation, and the “semantic interpretability” measures.
  3. “FCN-score” While quantitative evaluation of generative models is known to be challenging, recent works have tried using pre-trained semantic classifiers to measure the discriminability of the generated stimuli as a pseudo-metric. The intuition is that if the generated images are realistic, classifiers trained on real images will be able to classify the synthesized image correctly as well. To this end, they adopt the popular FCN-8s architecture for semantic segmentation, and train it on the cityscapes dataset. They then score synthesized photos by the classification accuracy against the labels these photos were synthesized from.

Results

FCN scores suggest:

1. Patch size of 70 * 70 produce best results.

2. Loss function with cGAN + L1 produces best results.

3. Unet architecture of Generator produces better results than encoder-decoder architecture.

Conclusion

  1. Conditional GANs appear to be effective on problems where the output is highly detailed or photographic, as is common in image processing and graphics tasks. What about vision problems, like semantic segmentation, where the output is instead less complex than the input? To begin to test this, they train a cGAN (with/without L1 loss) on cityscape photo→labels. The paper reports qualitative results, and quantitative classification accuracies. Interestingly, cGANs, trained without the L1 loss, are able to solve this problem at a reasonable degree of accuracy. To their knowledge, this is the first demonstration of GANs successfully generating “labels”, which are nearly discrete, rather than “images”, with their continuous valued variation. Although cGANs achieve some success, they are far from the best available method for solving this problem: simply using L1 regression gets better scores than using a cGAN, as shown in Table 6 of the paper. They argue that for vision problems, the goal (i.e. predicting output close to the ground truth) may be less ambiguous than graphics tasks, and reconstruction losses like L1 are mostly sufficient.
  2. The results in this paper suggest that conditional adversarial networks are a promising approach for many image-to-image translation tasks, especially those involving highly structured graphical outputs. These networks learn a loss adapted to the task and data at hand, which makes them applicable in a wide variety of settings.

--

--