Recent progress in NeRF-based GANs has introduced a number of approaches for high-resolution and high-fidelity generative modeling of human heads with a possibility for novel view rendering.
At the same time, one must solve an inverse problem to be able to re-render or modify an existing image or video. Despite the success of universal optimization-based methods for 2D GAN inversion, those, applied to 3D GANs, may fail to produce 3D-consistent renderings. Fast encoder-based techniques, such as those developed for StyleGAN, may also be less appealing due to the lack of identity preservation.
In our work, we introduce a real-time method that bridges the gap between the two approaches by directly utilizing the tri-plane representation introduced for EG3D generative model. In particular, we build upon a feed-forward convolutional encoder for the latent code and extend it with a fully-convolutional predictor of tri-plane numerical offsets. As shown in our work, the renderings are similar in quality to optimization-based techniques and significantly outperform the baselines for novel view. As we empirically prove, this is a consequence of directly operating in the tri-plane space, not in the GAN parameter space, while making use of an encoder-based trainable approach.
Using Triplanenet, you can re-render in-the-wild videos from a novel view. The framework is capable of representing tiny details of in-the-wild portrait imagery in 3D and supports complex facial expressions.
Triplanenet runs in real-time on a single RTX 3090 GPU.
Inversion is performed in two phases.
In the initial phase, an encoder is utilized to predict pivotal latent code and obtain initial reconstruction.
In the second phase, the initial reconstruction and its difference with the input image is processed by an auto-encoder to estimate tri-plane offsets. The offsets are numerically added to the tri-planes output by the EG3D generator. The final reconstruction is obtained by processing the refined tri-planes by the renderer block.
Triplanenet can reconstruct a face in more detail, especially introducing more fidelity for features such as hats, hair, and background.
For novel view rendering, Triplanenet preserves identity and multi-view consistency better compared to other approaches.
The inference time is given for a single RTX A100 Ti GPU.
For more work on similar tasks, please check out
PTI: Pivotal Tuning for Latent-based editing of Real Images introduces an optimization mechanism for solving the StyleGAN inversion task.
Encoding in Style: a StyleGAN Encoder for Image-to-Image Translation presents encoder-based approach to embed input images into W+ space of StyleGAN.
@preprint{bhattarai2023triplanenet,
title={TriPlaneNet: An Encoder for EG3D Inversion},
author={Bhattarai, Ananta R. and Nie{\ss}ner, Matthias and Sevastopolsky, Artem},
journal={arXiv preprint arXiv:2303.13497},
year={2023}
}