ML PAPER: CYCLEGAN — TL;DR

Sneha Ghantasala
16 min readDec 29, 2020

Research Paper: Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks

Paper published — 24th August 2020 — Berkeley AI Research (BAIR) Laboratory, UC Berkeley

Website: https://junyanz.github.io/CycleGAN/

Pytorch implementation: https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix

Torch implementation: https://github.com/junyanz/CycleGAN

Disclaimer: These are just notes and lot of the text is taken from the paper.

Image-to-image translation is a class of vision and graphics problems where the goal is to learn the mapping between an input image and an output image using a training set of aligned image pairs. However, for many tasks, paired training data will not be available. They present an approach for learning to translate an image from a source domain X to a target domain Y in the absence of paired examples. Their goal is to learn a mapping G : X → Y such that the distribution of images from G(X) is indistinguishable from the distribution Y using an adversarial loss.

Because this mapping is highly under-constrained, they couple it with an inverse mapping F : Y → X and introduce a cycle consistency loss to enforce F(G(X)) ≈ X (and vice versa). Qualitative results are presented on several tasks where paired training data does not exist, including collection style transfer, object transfiguration, season transfer, photo enhancement, etc. Quantitative comparisons against several prior methods demonstrate the superiority of their approach.

In this paper, they present a method that can learn to do the same: capturing special characteristics of one image collection and figuring out how these characteristics could be translated into the other image collection, all in the absence of any paired training examples.

This problem can be more broadly described as image to-image translation, converting an image from one representation of a given scene, x, to another, y, e.g.,grayscale to color, image to semantic labels, edge-map to photograph.

Years of research in computer vision, image processing, computational photography, and graphics have produced powerful translation systems in the supervised setting, where example image pairs {xi, yi}N i=1 are available. However, obtaining paired training data can be difficult and expensive. For example, only a couple of datasets exist for tasks like semantic segmentation (e.g., [4]), and they are relatively small. Obtaining input-output pairs for graphics tasks like artistic stylization can be even more difficult since the desired output is highly complex, typically requiring artistic authoring. For many tasks, like object transfiguration (e.g., zebra↔horse), the desired output is not even well-defined.

They therefore seek an algorithm that can learn to translate between domains without paired input-output examples. They assume there is some underlying relationship between the domains — for example, that they are two different renderings of the same underlying scene — and seek to learn that relationship. Although they lack supervision in the form of paired examples, they can exploit supervision at the level of sets: they are given one set of images in domain X and a different set in domain Y : Paired training data (left) consists of training examples {xi, yi} N i=1, where the correspondence between xi and yi exists [22]. they instead consider unpaired training data (right), consisting of a source set {xi} N i=1 (xi ∈ X) and a target set {yj}M j=1 (yj ∈ Y ), with no information provided as to which xi matches which yj .

They may train a mapping G : X → Y such that the output yˆ = G(x), x ∈ X, is indistinguishable from images y ∈ Y by an adversary trained to classify yˆ apart from y. In theory, this objective can induce an output distribution over yˆ that matches the empirical distribution pdata(y) (in general, this requires G to be stochastic). The optimal G thereby translates the domain X to a domain Yˆ distributed identically to Y .

However, such a translation does not guarantee that an individual input x and output y are paired up in a meaningful way — there are infinitely many mappings G that will induce the same distribution over yˆ. Moreover, in practice, they have found it difficult to optimize the adversarial objective in isolation: standard procedures often lead to the well known problem of mode collapse, where all input images map to the same output image and the optimization fails to make progress.

These issues call for adding more structure to their objective. Therefore, they exploit the property that translation should be “cycle consistent”, in the sense that if they translate, e.g., a sentence from English to French, and then translate it back from French to English, they should arrive back at the original sentence. Mathematically, if they have a translator G : X → Y and another translator F : Y → X, then G and F should be inverses of each other, and both mappings should be bijections. they apply this structural assumption by training both the mapping G and F simultaneously, and adding a cycle consistency loss [64] that encourages F(G(x)) ≈ x and G(F(y)) ≈ y. Combining this loss with adversarial losses on domains X and Y yields their full objective for unpaired image-to-image translation. they apply their method to a wide range of applications, including collection style transfer, object transfiguration, season transfer and photo enhancement. they also compare against previous approaches that rely either on hand-defined factorizations of style and content, or on shared embedding functions, and show that their method outperforms these baselines. They provide both PyTorch and Torch implementations. Check out more results at their website.

The key to GANs’ success is the idea of an adversarial loss that forces the generated images to be, in principle, indistinguishable from real photos. This loss is particularly powerful for image generation tasks, as this is exactly the objective that much of computer graphics aims to optimize. they adopt an adversarial loss to learn the mapping such that the translated images cannot be distinguished from images in the target domain.

Image-to-Image Translation

The idea of image-to-image translation goes back at least to Hertzmann et al.’s Image Analogies, who employ a non-parametric texture model on a single input-output training image pair. More recent approaches use a dataset of input-output examples to learn a parametric translation function using CNNs. Their approach builds on the “pix2pix” framework of Isola et al., which uses a conditional generative adversarial network to learn a mapping from input to output images. Similar ideas have been applied to various tasks such as generating photographs from sketches or from attribute and semantic layouts. However, unlike the above prior work, they learn the mapping without paired training examples.

Unpaired Image-to-Image Translation

Several other methods also tackle the unpaired setting, where the goal is to relate two data domains: X and Y . Rosales et al. propose a Bayesian framework that includes a prior based on a patch-based Markov random field computed from a source image and a likelihood term obtained from multiple style images. More recently, CoGAN and cross-modal scene networks use a weight-sharing strategy to learn a common representation across domains. Concurrent to their method, Liu et al. extends the above framework with a combination of variational autoencoders and generative adversarial networks. Another line of concurrent work encourages the input and output to share specific “content” features even though they may differ in “style“. These methods also use adversarial networks, with additional terms to enforce the output to be close to the input in a predefined metric space, such as class label space, image pixel space, and image feature space.

Unlike the above approaches, their formulation does not rely on any task-specific, predefined similarity function between the input and output, nor do they assume that the input and output have to lie in the same low-dimensional embedding space. This makes their method a general-purpose solution for many vision and graphics tasks. they directly compare against several prior and contemporary approaches in Section 5.1.

Cycle Consistency

The idea of using transitivity as a way to regularize structured data has a long history. In visual tracking, enforcing simple forward-backward consistency has been a standard trick for decades. In the language domain, verifying and improving translations via “back translation and reconciliation” is a technique used by human translators (including, humorously, by Mark Twain), as well as by machines. More recently, higher-order cycle consistency has been used in structure from motion, 3D shape matching, cosegmentation, dense semantic alignment, and depth estimation. Of these, Zhou et al. and Godard et al. are most similar to their work, as they use a cycle consistency loss as a way of using transitivity to supervise CNN training. In this work, they are introducing a similar loss to push G and F to be consistent with each other. Concurrent with their work, in these same proceedings, Yi et al. independently use a similar objective for unpaired image-to-image translation, inspired by dual learning in machine translation.

Neural Style Transfer(one of the papers I read) is another way to perform image-to-image translation, which synthesizes a novel image by combining the content of one image with the style of another image (typically a painting) based on matching the Gram matrix statistics of pre-trained deep features. Their primary focus, on the other hand, is learning the mapping between two image collections, rather than between two specific images, by trying to capture correspondences between higher-level appearance structures. Therefore, their method can be applied to other tasks, such as painting→ photo, object transfiguration, etc. where single sample transfer methods do not perform well. They compare these two methods in Section 5.2 of paper.

Their goal is to learn mapping functions between two domains X and Y given training samples {xi} N i=1 where xi ∈ X and {yj}M j=1 where yj ∈ Y1. They denote the data distribution as x ∼ pdata(x) and y ∼ pdata(y). Their model includes two mappings G : X → Y and F : Y → X. In addition, they introduce two adversarial discriminators DX and DY , where DX aims to distinguish between images {x} and translated images {F(y)}; in the same way, DY aims to discriminate between {y} and {G(x)}. Their objective contains two types of terms: adversarial losses for matching the distribution of generated images to the data distribution in the target domain; and cycle consistency losses to prevent the learned mappings G and F from contradicting each other.

They apply adversarial losses to both mapping functions. For the mapping function G : X → Y and its discriminator DY , they express the objective as:

LGAN(G, DY , X, Y ) = Ey∼pdata(y)[log DY (y)] + Ex∼pdata(x) [log(1 − DY (G(x))],(1)

where G tries to generate images G(x) that look similar to images from domain Y , while DY aims to distinguish between translated samples G(x) and real samples y. G aims to minimize this objective against an adversary D that tries to maximize it, i.e., minG maxDY LGAN(G, DY , X, Y ).

They introduce a similar adversarial loss for the mapping function F : Y → X and its discriminator DX as well: i.e., minF maxDX LGAN(F, DX, Y, X).

Adversarial training can, in theory, learn mappings G and F that produce outputs identically distributed as target domains Y and X respectively (strictly speaking, this requires G and F to be stochastic functions). However, with large enough capacity, a network can map the same set of input images to any random permutation of images in the target domain, where any of the learned mappings can induce an output distribution that matches the target distribution. Thus, adversarial losses alone cannot guarantee that the learned function can map an individual input xi to a desired output yi. To further reduce the space of possible mapping functions, they argue that the learned mapping cycle should be able to bring x back to the original image, i.e., x → G(x) → F(G(x)) ≈ x. They call this forward cycle consistency. Similarly, for each image y from domain Y , G and F should also satisfy backward cycle consistency: y → F(y) → G(F(y)) ≈ y. They incentivize this behavior using a cycle consistency loss:

Lcyc(G, F) = Ex∼pdata(x) [kF(G(x)) − xk1] + Ey∼pdata(y) [kG(F(y)) − yk1]. (2)

In preliminary experiments, they also tried replacing the L1 norm in this loss with an adversarial loss between F(G(x)) and x, and between G(F(y)) and y, but did not observe improved performance. The behavior induced by the cycle consistency loss can be observed: the reconstructed images F(G(x)) end up matching closely to the input images x.

Loss Function

Their full objective is:

L(G, F, DX, DY ) =LGAN(G, DY , X, Y ) + LGAN(F, DX, Y, X) + λLcyc(G, F), (3)

where λ controls the relative importance of the two objectives. They aim to solve:

G∗, F∗ = arg minG,FmaxDx,DY L(G, F, DX, DY ). (4)

Notice that their model can be viewed as training two “autoencoders”: they learn one autoencoder F ◦ G : X → X jointly with another G◦F : Y → Y . However, these autoencoders each have special internal structures: they map an image to itself via an intermediate representation that is a translation of the image into another domain. Such a setup can also be seen as a special case of “adversarial autoencoders” [34], which use an adversarial loss to train the bottleneck layer of an autoencoder to match an arbitrary target distribution. In their case, the target distribution for the X → X autoencoder is that of the domain Y .

In Section 5.1.4, they compare their method against ablations of the full objective, including the adversarial loss LGAN alone and the cycle consistency loss Lcyc alone, and empirically show that both objectives play critical roles in arriving at high-quality results. They also evaluate their method with only cycle loss in one direction and show that a single cycle is not sufficient to regularize the training for this under-constrained problem.

Network Architecture

Generator architectures

  1. They adopt the architecture for their generative networks from Johnson et al. who have shown impressive results for neural style transfer and superresolution. This network contains three convolutions, several residual blocks, two fractionally-strided convolutions with stride 1/2, and one convolution that maps features to RGB. They use 6 blocks for 128 × 128 images and 9 blocks for 256×256 and higher-resolution training images.
  2. Similar to Johnson et al., they use instance normalization.
  3. Let c7s1-k denote a 7×7 Convolution-InstanceNormReLU layer with k filters and stride 1. dk denotes a 3 × 3 Convolution-InstanceNorm-ReLU layer with k filters and stride 2. Reflection padding was used to reduce artifacts. Rk denotes a residual block that contains two 3 × 3 convolutional layers with the same number of filters on both layers. uk denotes a 3 × 3 fractional-strided-ConvolutionInstanceNorm-ReLU layer with k filters and stride 1/2.
  4. The network with 6 residual blocks consists of:

c7s1–64,d128,d256,R256,R256,R256,

R256,R256,R256,u128,u64,c7s1–3

  1. The network with 9 residual blocks consists of:

c7s1–64,d128,d256,R256,R256,R256,

R256,R256,R256,R256,R256,R256,u128

u64,c7s1–3

Discriminator architectures

  1. For the discriminator networks they use 70 × 70 PatchGANs, which aim to classify whether 70 × 70 overlapping image patches are real or fake. Such a patch-level discriminator architecture has fewer parameters than a full-image discriminator and can work on arbitrarily sized images in a fully convolutional fashion.
  2. Let Ck denote a 4 × 4 Convolution-InstanceNorm-LeakyReLU layer with k filters and stride 2. After the last layer, they apply a convolution to produce a 1-dimensional output.
  3. They do not use InstanceNorm for the first C64 layer.
  4. They use leaky ReLUs with a slope of 0.2.
  5. The discriminator architecture is: C64-C128-C256-C512

Training Details

They apply two techniques from recent works to stabilize their model training procedure:

  1. First, for LGAN (Equation 1), they replace the negative log likelihood objective by a least-squares loss [35]. This loss is more stable during training and generates higher quality results. In particular, for a GAN loss LGAN(G, D, X, Y ), they train the G to minimize Ex∼pdata(x) [(D(G(x)) − 1)2 ] and train the D to minimize Ey∼pdata(y) [(D(y) − 1)2 ] + Ex∼pdata(x) [D(G(x))2].
  2. Second, to reduce model oscillation, they follow Shrivastava et al.’s strategy and update the discriminators using a history of generated images rather than the ones produced by the latest generators. They keep an image buffer that stores the 50 previously created images.
  3. For all the experiments, they set λ = 10 in Equation 3.
  4. They use the Adam solver with a batch size of 1.
  5. All networks were trained from scratch with a learning rate of 0.0002.
  6. In practice, they divide the objective by 2 while optimizing D, which slows down the rate at which D learns, relative to the rate of G.
  7. They keep the same learning rate for the first 100 epochs and linearly decay the rate to zero over the next 100 epochs.
  8. Weights are initialized from a Gaussian distribution N (0, 0.02).

They first compare their approach against recent methods for unpaired image-to-image translation on paired datasets where ground truth input-output pairs are available for evaluation. They then study the importance of both the adversarial loss and the cycle consistency loss and compare their full

method against several variants. Finally, they demonstrate the generality of their algorithm on a wide range of applications where paired data does not exist. For brevity, they refer to their method as CycleGAN.

Using the same evaluation datasets and metrics as “pix2pix”, they compare their method against several baselines both qualitatively and quantitatively. The tasks include semantic labels↔photo on the Cityscapes dataset, and map↔aerial photo on data scraped from Google Maps. They also perform ablation study on the full loss function.

Evaluation

AMT perceptual studies On the map↔aerial photo task, they run “real vs fake” perceptual studies on Amazon Mechanical Turk (AMT) to assess the realism of their outputs. They follow the same perceptual study protocol from Isola et al., except they only gather data from 25 participants per algorithm they tested. Participants were shown a sequence of pairs of images, one a real photo or map and one fake (generated by their algorithm or a baseline), and asked to click on the image they thought was real. The first 10 trials of each session were practice and feedback was given as to whether the participant’s response was correct or incorrect. The remaining 40 trials were used to assess the rate at which each algorithm fooled participants. Each session only tested a single algorithm, and participants were only allowed to complete a single session. The numbers they report here are not directly comparable to those in [22] as their ground truth images were processed slightly differently 2 and the participant pool should only be used to compare their current method against the baselines (which were run under identical conditions), rather than against [22].

FCN score

Although perceptual studies may be the gold standard for assessing graphical realism, they also seek an automatic quantitative measure that does not require human experiments. For this, they adopt the “FCN score” from, and use it to evaluate the Cityscapes labels→photo task. The FCN metric evaluates how interpretable the generated photos are according to an off-the-shelf semantic segmentation algorithm (the fully-convolutional network, FCN,from [33]). The FCN predicts a label map for a generated photo. This label map can then be compared against the input ground truth labels using standard semantic segmentation metrics described below. The intuition is that if they generate a photo from a label map of “car on the road”, then they have succeeded if the FCN applied to the generated photo detects “car on the road”.

Semantic segmentation metrics

To evaluate the performance of photo→labels, they use the standard metrics from the Cityscapes benchmark, including per-pixel accuracy, per-class accuracy, and mean class Intersection-Over-Union (Class IOU).

Baselines

CoGAN

This method learns one GAN generator for domain X and one for domain Y , with tied weights on the first few layers for shared latent representations. Translation from X to Y can be achieved by finding a latent representation that generates image X and then rendering this latent representation into style Y .

SimGAN

Like their method, Shrivastava et al. uses an adversarial loss to train a translation from X to Y . The regularization term kx − G(x)k1 i s used to penalize making large changes at pixel level. Feature loss + GAN they also test a variant of SimGAN where the L1 loss is computed over deep image features using a pretrained network (VGG-16 relu4 2), rather than over RGB pixel values. Computing distances in deep feature space, like this, is also sometimes referred to as using a “perceptual loss”. BiGAN/ALI Unconditional GANs learn a generator G : Z → X, that maps a random noise z to an image x. The BiGAN and ALI propose to also learn the inverse mapping function F : X → Z. Though they were originally designed for mapping a latent vector z to an image x, they implemented the same objective for mapping a source image x to a target image y.

pix2pix

They also compare against pix2pix, which is trained on paired data, to see how close they can get to this “upper bound” without using any paired data.

For a fair comparison, they implement all the baselines using the same architecture and details as their method, except for CoGAN. CoGAN builds on generators that produce images from a shared latent representation, which is incompatible with their image-to-image network. they use the public implementation of CoGAN instead.

In Table 4 and Table 5, they compare against ablations of their full loss. Removing the GAN loss substantially degrades results, as does removing the cycle-consistency loss. They therefore conclude that both terms are critical to their results. they also evaluate their method with the cycle loss in only one direction: GAN + forward cycle loss Ex∼pdata(x) [kF(G(x))−xk1], or GAN + backward cycle loss Ey∼pdata(y) [kG(F(y))−yk1] (Equation 2) and find that it often incurs training instability and causes mode collapse, especially for the direction of the mapping that was removed.

Photo generation from paintings

For painting→photo, they find that it is helpful to introduce an additional loss to encourage the mapping to preserve color composition between the input and output. In particular, they adopt the technique of Taigman et al. [49] and regularize the generator to be near an identity mapping when real samples of the target domain are provided as the input to the generator: i.e., L identity(G, F) = Ey∼pdata(y)[kG(y) − yk1] + Ex∼pdata(x) [kF(x) − xk1].

To generate results, they passed images of width 512 pixels with correct aspect ratio to the generator network as input. The weight for the identity mapping loss was 0.5λ where λ was the weight for cycle consistency loss. they set λ = 10. The identity mapping loss helps preserve the color of the input paintings. Without L identity, the generator G and F are free to change the tint of input images when there is no need to.

For example, when learning the mapping between Monet’s paintings and Flickr photographs, the generator often maps paintings of daytime to photographs taken during sunset, because such a mapping may be equally valid under the adversarial loss and cycle consistency loss.

Limitations and Discussion

Although their method can achieve compelling results in many cases, the results are far from uniformly positive. Figure 17 of the paper shows several typical failure cases.

  1. On translation tasks that involve color and texture changes, as many of those reported above, the method often succeeds.
  2. They have also explored tasks that require geometric changes, with little success. For example, on the task of dog→cat transfiguration, the learned translation degenerates into making minimal changes to the input. This failure might be caused by their generator architectures which are tailored for good performance on the appearance changes.
  3. Handling more varied and extreme transformations, especially geometric changes, is an important problem for future work.
  4. Some failure cases are caused by the distribution characteristics of the training datasets. For example, their method has got confused in the horse → zebra example, because their model was trained on the wild horse and zebra synsets of ImageNet, which does not contain images of a person riding a horse or zebra.
  5. They also observe a lingering gap between the results achievable with paired training data and those achieved by their unpaired method. In some cases, this gap may be very hard — or even impossible — to close: for example, their method sometimes permutes the labels for tree and building in the output of the photos→labels task. Resolving this ambiguity may require some form of weak semantic supervision. Integrating weak or semi-supervised data may lead to substantially more powerful translators, still at a fraction of the annotation cost of the fully-supervised systems.

Nonetheless, in many cases completely unpaired data is plentifully available and should be made use of. This paper pushes the boundaries of what is possible in this “unsupervised” setting.

--

--