In this paper, we propose a novel generator architecture to translate a selfie photo into an anime face image in an unsupervised way. The purpose is to translate selfie images to anime face images that preserves the traits and shapes of facial parts (...
In this paper, we propose a novel generator architecture to translate a selfie photo into an anime face image in an unsupervised way. The purpose is to translate selfie images to anime face images that preserves the traits and shapes of facial parts (e.g., eyes and noses) of an input source image with reference anime style. Recent face translation works often fails to preserve the characteristics of facial parts of selfie photos while transferring to anime images. To cope with this, the proposed method develop new adversarial generative network (GAN) architecture composed of (1) simple cycle contents loss, (2) multi-scale assisted self-attention, and (3) adaptive feature fusion. The goal of using cycle contents loss is to make the GAN preserves a wide range of selfie image contents, which includes hair shape, facial expression and shape of face. By comparing feature maps in the process of image translation, cycle contents loss prevents the feature maps being overly summarized by encoder and decoder. This prevents hair styles and facial expressions of the selfie image being oversimplified into a plain expression of anime images such as straight hair, tiny nose, and mouth. In addition, multi-scale assisted self-attention complements the existing attention by using its various scales of self-attentions. This multi-scale assistance provides additional spatial relationship of feature maps that cannot be perceived by single-scale attention. It allows model to gain additional facial characteristics that can be useful to generate an anime image containing hair style and facial expression from a selfie photo. Adaptive feature fusion helps model to understand which self-attention map is important among the multiple self-attention maps generated by the multi-scaled self-attention module. It makes GAN model to understand which scales of self-attention map is critical for image translation and allow the optimal elements without being disturbed from too much information. Our extensive and comparative experiments on selfie2anime and photo2anime datasets have been performed to demonstrate the effectiveness of our method over other state-of-the-art methods.