LagerNVS: Latent Geometry for Fully Neural Real-Time Novel View Synthesis

TL;DR: Real-time, generalizable novel view synthesis enabled by 3D biases without explicit 3D representations.

Interactive demo

Use the code to interactively explore your own scenes. Rendering is done end-to-end by a neural network on a cloud GPU and streamed to a local browser. Below is a demo of exploring scenes from sparse photo captures around Oxford and London.

The role of 3D priors

LagerNVS shows that 3D priors are useful explicit for Novel View Synthesis, but 3D representations are not necessary. Our model uses two types of 3D priors: a 3D-inspired architecture and and 3D pre-training.

First, we design the architecture with the insight that 3D reconstruction is difficult, traditionally taking minutes to hours, but 3D rendering is easy and can be real-time. To mimic this process, we choose an encoder-decoder architecture with a large, powerful and slow(er) encoder, and a lightweight, real-time decoder. Second, we initialize our encoder with a 3D reconstructor, VGGT, which has been shown to be a good 3D prior.

Our 3D priors lead to quality improvements (see below), while permitting real-time rendering. Our work suggests that 3D objectives are a useful learning signal for spatial tasks and a 3D-centric approach can be useful for designing efficient architectures.

How it works

Images and cameras are input to a large network, which inherits 3D bias at initialization from pre-trained VGGT. The tokens output from the network serve as a 3D-aware latent geometry representation. The tokens are fed to a lightweight decoder network which renders novel views in real-time. The encoder-decoder architecture mirrors the process of slow 3D reconstruction and fast rendering: reconstruction is difficult, and requires a large network, while rendering is much simpler, so the decoder can be small.

An animated diagram briefly describing the method.

Comparisons

Compare the renders of our method LagerNVS (right) and other state-of-the-art methods (left). Compared to other implicit methods (LVSM), our method has better 3D consistency. Compared to explicit methods (Depthsplat, AnySplat), our method better represents reflections and thin structures - the captions under the videos point out these areas. Explore the scenes below by clicking on the thumbnails.

LVSM (Implicit, 2-view, posed)

LVSM

LagerNVS (ours)

Observe the balconies

Depthsplat 4V (Explicit, 4-view, posed)

Depthsplat

LagerNVS (ours)

Depthsplat 6V (Explicit, 6-View, posed)

Depthsplat

LagerNVS (ours)

AnySplat (Explicit, unposed)

AnySplat

LagerNVS (ours)

Pre-training

3D pre-training from VGGT improves Novel View Synthesis quality due to better geometry estimation, especially in the goreground regions. Explore the scenes below by clicking on the thumbnails.

Inputs

No pre-training

2D pre-training (DinoV2)

Ours: 3D pre-training (VGGT)

Architecture

Our proposed 'highway' encoder-decoder architecture allows increasing the network capacity without sacrificing the rendering speed. Compared to the 'bottleneck' encoder-decoder, improved information flow in our architecture allows better Novel View Synthesis quality.

Inputs

Decoder only

Enc-dec. bottleneck

Ours

Citation

@InProceedings{szymanowicz2026lagernvs,
  title={LagerNVS: Latent Geometry for Fully Neural Real-time Novel View Synthesis},
  author={Stanislaw Szymanowicz and Minghao Chen and Jianyuan Wang and Christian Rupprecht and Andrea Vedaldi},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2026}
}