Viewset Diffusion: (0-)Image-Conditioned 3D Generative Models from 2D Data

Visual Geometry Group - University of Oxford
Method figure.

Viewset Diffusion trains image-conditioned 3D generative models from 2D data. A single model is capable of both ambiguity-aware 3D reconstruction and unconditional 3D generation.


We present Viewset Diffusion: a framework for training image-conditioned 3D generative models from 2D data.

Image-conditioned 3D generative models tackle the inherent ambiguity in single-view 3D reconstruction: given one image of an object, there is often more than one possible 3D volume that matches the input image. This problem is often treated as a deterministic task, where a single instance of a 3D volume is regressed or optimised given an input image. However, a single image never captures all sides of an object. Deterministic models are inherently limited to producing one possible reconstruction and therefore make mistakes in ambiguous settings. We treat 3D reconstruction as a conditional generation task to capture ambiguities.

Modelling the distributions of 3D shapes is challenging because often 3D ground truth data is not available. We propose to solve the issue of data availability by training a conditional diffusion model which jointly denoises a multi-view image set. The input image set can consist of any number of 'clean' (conditioning) and 'noisy' (target) images. We constrain output of Viewset Diffusion models to a single 3D volume per image set, guaranteeing consistent geometry. Training is done through reconstruction losses on renderings, allowing training with only three images per object. Our design of architecture and training scheme allows our model to perform 3D generation and generative, ambiguity-aware single-view reconstruction in a feed-forward manner.

Results (scroll horizontally for more classes)

Single-view 3D Reconstruction (non-cherry-picked)

Unconditional 3D Generation (non-cherry-picked)


Under presence of ambiguity, deterministic methods blur possible shapes (orange car’s back, Minecraft characters’ poses) and colours (black car’s back, occluded sides of Minecraft characters). Our method samples plausible 3D reconstructions.

Ambiguity figure.


Viewset Diffusion takes in any number of clean conditioning images and target images with Gaussian noise and jointly denoises the input image set. The denoising function is defined as reconstructing and rendering a 3D volume. When there is at least one clean conditioning view, Viewset Diffusion samples plausible 3D reconstructions. When all input views are noisy, Viewset Diffusion generates 3D volumes.

Method figure.

Related works

Also check out a great, related paper Diffusion with Forward Models.


  author    = {Szymanowicz, Stanislaw and Rupprecht, Christian and Vedaldi, Andrea},
  title     = {Viewset Diffusion: (0-)Image-Conditioned 3D Generative Models from 2D Data},
  journal   = {International Conference on Computer Vision},
  year      = {2023},