Videos are best viewed on desktop.
3D GAN inversion aims to project a single image into the latent space of a 3D Generative Adversarial Network (GAN), thereby achieving 3D geometry reconstruction. While there exist encoders that achieve good results in 3D GAN inversion, they are predominantly built on EG3D, which specializes in synthesizing near-frontal views and is limiting in synthesizing comprehensive 3D scenes from diverse viewpoints. In contrast to existing approaches, we propose a novel framework built on PanoHead, which excels in synthesizing images from a 360-degree perspective. To achieve realistic 3D modeling of the input image, we introduce a dual encoder system tailored for high-fidelity reconstruction and realistic generation from different viewpoints. Accompanying this, we propose a stitching framework on the triplane domain to get the best predictions from both. To achieve seamless stitching, both encoders must output consistent results despite being specialized for different tasks. For this reason, we carefully train these encoders using specialized losses, including an adversarial loss based on our novel occlusion-aware triplane discriminator. Experiments reveal that our approach surpasses the existing encoder training methods qualitatively and quantitatively.
We have opted to train two separate encoders rather than a single one, each with a specific focus. One encoder is dedicated to capturing visible regions, while the other specializes in occluded regions. This approach allows each encoder to excel at its respective task. By employing our novel stitching pipeline, we can effectively combine the strengths of both encoders, yielding superior results compared to a single encoder that would struggle to handle both tasks with the same level of precision.
Our training methodology for the Triplane discriminator involves generating real samples by sampling latent vectors Z+ and producing in-domain triplanes using PanoHead. Fake samples are generated from encoded images. Despite the effectiveness of adversarial loss in enhancing reconstructions, challenges persist in achieving high fidelity to the input due to the origin of real samples from the generator. To address this, we propose an occlusion-aware discriminator, trained exclusively with features from occluded pixels. This ensures that visible regions, such as frontal faces, do not influence discriminator training. Additionally, a masking mechanism for synthesized triplanes is employed to alleviate distribution mismatches between encoded and synthesized samples.
We recognize that, in certain cases, the artifacts are visible in the back middle of the head and are more noticeable in the mesh rendering. Additional research and improvements are required.
For GOAE, we trained the model on our dataset, and it generated realistic results for the input view. However, while multi-view consistency was achieved, the overall realism across views was lacking. Further tuning could improve the visual quality.
@misc{bilecen2024dualencoderganinversion,
title={Dual Encoder GAN Inversion for High-Fidelity 3D Head Reconstruction from Single Images},
author={Bahri Batuhan Bilecen and Ahmet Berke Gokmen and Aysegul Dundar},
year={2024},
eprint={2409.20530},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2409.20530},
}