GINA-3D: Learning to Generate Implicit Neural Assets in the Wild
Bokui Shen1∗
Xinchen Yan2
Charles R. Qi2
Mahyar Najibi2
Boyang Deng1,2†
Leonidas Guibas3
Yin Zhou2
Dragomir Anguelov2
1Stanford University, 2Waymo LLC, 3Google
In-the-Wild Driving Data
“Same kind…”
“Similar size…”
GINA-3D Synthesis
Composition with Background NeRF
GINA-3D
Reconstruction
“Night time…”
In-the-wild Driving Data
“Random…”
Figure 1. Leveraging in-the-wild data for generative assets modeling embodies a scalable approach for simulation. GINA-3D uses real-world
driving data to perform various synthesis tasks for realistic 3D implicit neural assets. Left: Multi-sensor observations in the wild. Middle:
Asset reconstruction and conditional synthesis. Right: Scene composition with background neural fields [1].
Abstract
Modeling the 3D world from sensor data for simula-
tion is a scalable way of developing testing and valida-
tion environments for robotic learning problems such as
autonomous driving. However, manually creating or re-
creating real-world-like environments is difficult, expensive,
and not scalable. Recent generative model techniques have
shown promising progress to address such challenges by
learning 3D assets using only plentiful 2D images – but still
suffer limitations as they leverage either human-curated im-
age datasets or renderings from manually-created synthetic
3D environments. In this paper, we introduce GINA-3D, a
generative model that uses real-world driving data from cam-
era and LiDAR sensors to create realistic 3D implicit neural
assets of diverse vehicles and pedestrians. Compared to the
existing image datasets, the real-world driving setting poses
new challenges due to occlusions, lighting-variations and
long-tail distributions. GINA-3D tackles these challenges by
decoupling representation learning and generative modeling
into two stages with a learned tri-plane latent structure, in-
spired by recent advances in generative modeling of images.
To evaluate our approach, we construct a large-scale object-
centric dataset containing over 1.2M images of vehicles and
pedestrians from the Waymo Open Dataset, and a new set
of 80K images of long-tail instances such as construction
equipment, garbage trucks, and cable cars. We compare
our model with existing approaches and demonstrate that it
achieves state-of-the-art performance in quality and diver-
sity for both generated images and geometries.
∗Work done during an internship at Waymo. † Work done at Waymo.
1. Introduction
Learning to perceive, reason, and interact with the 3D
world has been a longstanding challenge in the computer
vision and robotics community for decades [2–9]. Mod-
ern robotic systems [10–16] deployed in the wild are often
equipped with multiple sensors (e.g. cameras, LiDARs, and
Radars) that perceive the 3D environments, followed by an
intelligent unit for reasoning and interacting with the com-
plex scene dynamics. End-to-end testing and validating these
intelligent agents in the real-world environments are diffi-
cult and expensive, especially in safety critical and resource
constrained domains like autonomous driving.
On the other hand, the use of simulated data has pro-
liferated over the last few years to train and evaluate the
intelligent agents under controlled settings [17–27] in a safe,
scalable and verifiable manner. Such developments were
fueled by rapid advances in computer graphics, including
rendering frameworks [28–30], physical simulation [31,32]
and large-scale open-sourced asset repositories [33–39]. A
key concern is to create realistic virtual worlds that align in
asset content, composition, and behavior with real distribu-
tions, so as to give the practitioner confidence that using such
simulations for development and verification can transfer to
performance in the real world [40–48]. However, manual
asset creation faces two major obstacles. First, manual cre-
ation of 3D assets requires dedicated efforts from engineers
and artists with 3D domain expertise, which is expensive
and difficult to scale [26]. Second, real-world distribution
contains diverse examples (including interesting rare cases)
and is also constantly evolving [49,50].
Recent developments in the generative 3D modeling offer
1
arXiv:2304.02163v2 [cs.CV] 28 Aug 2023
new perspectives to tackle these aforementioned obstacles,
as it allows producing additional realistic but previously
unseen examples. A sub-class of these approaches, gener-
ative 3D-aware image synthesis [51, 52], holds significant
promise since it enables 3D modeling from partial observa-
tions (e.g. image projections of the 3D object). Moreover,
many real-world robotic applications already capture, an-
notate and update multi-sensor observations at scale. Such
data thus offer an accurate, diverse, task-relevant, and up-
to-date representation of the real-world distribution, which
the generative model can potentially capture. However, ex-
isting works use either human-curated image datasets with
clean observations [53–58] or renderings from synthetic 3D
environments [33,36]. Scaling generative 3D-aware image
synthesis models to the real world faces several challenges,
as many factors are entangled in the partial observations.
First, bridging the in-the-wild images from a simple prior
without 3D structures make the learning difficult. Second,
unconstrained occlusions entangle object-of-interest and its
surroundings in pixel space, which is hard to disentangle
in a purely unsupervised manner. Lastly, the above chal-
lenges are compounded by a lack of effort in constructing an
asset-centric benchmark for sensor data captured in the wild.
In this work, we introduce a 3D-aware generative trans-
former for implicit neural asset generation, named GINA-3D
(Generative Implicit Neural Assets). To tackle the real world
challenges, we propose a novel 3D-aware Encoder-Decoder
framework with a learned structured prior. Specifically, we
embed a tri-plane structure into the latent prior (or tri-plane
latents) of our generative model, where each entry is param-
eterized by a discrete representation from a learned code-
book [59,60]. The Encoder-Decoder framework is composed
of a transformation encoder and a decoder with neural ren-
dering components. To handle unconstrained occlusions, we
explicitly disentangle object pixels from its surrounding with
an occlusion-aware composition, using pseudo labels from
an off-the-shelf segmenation model [61]. Finally, the learned
prior of tri-plane latents from a discrete codebook can be
used to train conditional latents sampling models [62]. The
same codebook can be readily applied to various conditional
synthesis tasks, including object scale, class, semantics, and
time-of-day.
To evaluate our model, we construct a large-scale object-
centric benchmark from multi-sensor driving data captured
in the wild. We first extract over 1.2M images of diverse
variations for vehicles and pedestrians from Waymo Open
Dataset [14]. We then augment the benchmark with long-tail
instances from real-world driving scenes, including rare ob-
jects like construction equipment, cable cars, school buses
and garbage trucks. We demonstrate through extensive ex-
periments that GINA-3D outperforms the state-of-the-art
3D-aware generative models, measured by image quality,
geometry consistency, and geometry diversity. Moreover,
we showcase example applications of various conditional
synthesis tasks and shape editing results by leveraging the
learned 3D-aware codebook. The benchmark is publicly
available through waymo.com/open.
2. Related Work
We discuss the relevant work on generative 3D-aware
image synthesis, 3D shape modeling, and applications in
autonomous driving.
Generative 3D-aware image synthesis.
Learning gener-
ative 3D-aware representations from image collections has
been increasingly popular for the past decade [63–69]. Early
work explored image synthesis from disentangled factors
such as learned pose embedding [64,66,69] or compact scene
representations [65,67]. Representing the 3D-structure as a
compressed embedding, this line of work approached image
synthesis by upsampling from the embedding space with a
stack of 2D deconvolutional layers. Driven by the progresses
in differentiable rendering, there have been efforts [70–73]
in baking explicit 3D structures into the generative architec-
tures. These efforts, however, are often confined to a coarse
3D discretization due to memory consumption. Moving be-
yond explicits, more recent work leverages neural radiance
fields to learn implicit 3D-aware structures [51,52,74–82]
for image synthesis. Schwarz et al. [74] introduced the Gen-
erative Radiance Fields (GRAF) that disentangles the 3D
shape, appearance and camera pose of a single object with-
out occlusions. Built on top of GRAF, Niemeyer et al. [51]
proposed the GIRAFFE model, which handles scene involv-
ing multiple objects by using the compositional 3D scene
structure. Notably, the query operation in the volumetric ren-
dering becomes computationally heavy at higher resolutions.
To tackle this, Chan et al. [52] introduced hybrid explicit-
implicit 3D representations with tri-plane features (EG3D),
which showcases image synthesis at higher resolutions. Con-
currently, [83] and [84] pioneer high-resolution unbounded
3D scene generation on ImageNet using tri-plane represen-
tations, where [84] uses a vector-quantized framework and
[83] uses a GAN framework. Our work is designed for ap-
plications in autonomous driving sensor simulation with an
emphasis on object-centric modeling.
Generative 3D shape modeling.
Generative modeling
of complete 3D shapes has also been extensively studied,
including efforts on synthesizing 3D voxel grids [85–93],
point clouds [94–96], surface meshes [97–103], shape primi-
tives [104,105], and implicit functions or hybrid representa-
tions [103,106–112] using various deep generative models.
Shen et al. [111] introduced a differentiable explicit sur-
face extraction method called Deep Marching Tetrahedra
(DMTet) that learns to reconstruct 3D surface meshes with
arbitrary topology directly. Built on top of the EG3D [52]
tri-plane features for image synthesis, Gao et al. [103] pro-
posed an extension that is capable of generating textured
2
surface meshes using DMTet for geometry generation and
tri-plane features for texture synthesis. The existing efforts
assume access to accurate multi-view silhouettes (often from
complete ground-truth 3D shapes) , which does not reflect
the real challenges present in data captured in the wild.
Assets modeling in driving simulation.
Simulated en-
vironment modeling has drawn great attention in the au-
tonomous driving domain. In a nutshell, the problem can
be decomposed into asset creation (e.g., dynamic objects
and background), scene generation, and rendering. Early
work leverages artist-created objects and background assets
to build virtual driving environments [18, 20, 113] using
classic graphics rendering pipelines. While being able to
generate virtual scenes with varying configurations, these
methods produce scenes with limited diversity and a sig-
nificant reality gap. Many recent works explored different
aspects of data-driven simulation, including image synthe-
sis [114–117], assets modeling [47,48,118–121], scene gen-
eration [49,122,123], and scene rendering [1,124–126]. In
particular, Chen et al. [48] and Zakharov et al. [119] per-
formed explicit texture warping or implicit rendering from a
single-view observation for each vehicle object. Therefore,
their asset reconstruction quality is sensitive to occlusions
and bounded by the view angle from a single observation.
Building upon these efforts, more recent work including
Muller et al. [121] and Kundu et al. [125] approached ob-
ject completion with global or instance-specific latent codes,
representing each object asset under the Normalized Object
Coordinate Space (NOCS). In comparison, the latent codes
in our proposed model have 3D tri-plane structures which
offers several benefits in learning and applications. More
importantly, we can generate previously unseen 3D assets,
which is essentially different from object reconstruction.
3. Generative Implicit Neural Assets
We propose GINA-3D, a scalable framework to acquire
3D assets from in-the-wild data (Sec. 3.1). Core to our frame-
work is a novel 3D-aware Encoder-Decoder model with a
learned structure prior (Sec. 3.2). The learned structure prior
can facilitate various downstream applications with an itera-
tive latents sampling model (Sec. 3.3) per application.
3.1. Background.
Given a collection of images containing 3D objects cap-
tured in the wild X = {x} (x is an image data sample),
3D-aware image synthesis [51,52,63–79,81] aims to learn a
distribution of 3D objects. The core idea is to represent each
3D object as a hidden variable h within a generative model
and further leverage a neural rendering module NR to synthe-
size a sample image at viewpoint v through x = NR(h, v).
To model the hidden 3D structure h, the formulation in-
troduces a low-dimensional space where latent variables z
(typically a Gaussian) can sample from and connect h and z
by a generator h = fθ(z), parameterized by θ.
Pr(x, z|v) = Pr(x|z, v) · Pr(z)
(1)
The probabilistic formulation is shown in Fig. 2-a, and Eq. 1.
Here, Pr(x|z, v) is the conditional probability of the image
given the latent variables and viewpoint, where Pr(z) and
Pr(v) are the prior distributions. As the latent variable z
models the 3D objects, one can sample and extract assets for
downstream applications. The assets can be either injected
into neural representations of scenes [1,125], or transformed
into explicit 3D structures such as textured meshes for tradi-
tional renders [20] or geometry-aware compositing [48,124].
s
x
s
z
h
x
o
v
m
s
x
x
s
v
z
h
x
v
z
h
x
S
m
(a) Controlled View
(b) Real-world Data
Figure 2. Probabilistic Views.
The challenges in the wild.
While human-curated image
datasets [53–58] or syntheti-
cally generated images with
clean background [33,36,68,
103] fit into the formulation
in Eq. 1, real-world distribu-
tions have unconstrained occlusions due to complex object-
scene entanglement. For example, a moving vehicle can be
easily occluded by another object (e.g. traffic cones and cars)
in an urban driving environment, which further entangle ob-
ject and scene in the pixel space. Moreover, environmental
lighting and object diversity lead to a more complex under-
lying distribution.
As illustrated Fig. 2-b and Eq. 2, these challenges yield
a new probabilistic formulation that the hidden structure
h, surrounding scene S and viewpoint v jointly contribute
to the occlusion (m) and the visible pixels on the object x
through x = NR(h, v) ⊙m(S, v, h).
Pr(x, z|v, S) = Pr(x|z, v, S) · Pr(z)
(2)
Prior art such as GIRAFFE [51] tackles the challenges with
two assumptions: (1) the scene is composed of a limited
number of same-class foreground objects and a background
backdrop S; and (2) the real data distribution can be bridged
using an one-pass generator fθ(x; z, S, v) (θ is the learned
parametrization) conditioned on independently sampled ob-
jects z, scene background S and the camera viewpoint v (e.g.
Multi-variate Gaussian distributions with diagonal variance)
through adversarial training. Unfortunately, the first assump-
tion barely holds for in-the-wild images with unconstrained
foreground occlusions. As shown in Niemeyer et al. [51],
the second assumption can already introduce artifacts due to
disentanglement failures.
Our proposal.
We focus on interpreting the visible pixels
of the object of interest, as synthesizing objects and scene
jointly with a generative model is very challenging. We
leverage an auxiliary encoder Eϕ(x) that approximates the
posterior Pr(z|x) in training the generative model to recon-
struct the input. This way, we bypass the need to model com-
plex scene and occlusions explicitly, since paired input and
3
output are now available for supervising the auto-encoding
style training. Specifically, given an image x and the corre-
sponding occlusion mask m, our objective is to reconstruct
the visible pixels of the object on the image through ˆx ⊙m
where we have the reconstruction ˆx = NR(Gθ(z), v) and
latent z = Eϕ(x), respectively. In practice, we use an off-
the-shelf model to obtain the pseudo-labeled object mask as
the supervision through x⊙m. At the inference time, we can
discard the auxiliary encoder Eϕ as our goal is to generate
assets from a learned latent distribution (tri-plane latents in
our case). To facilitate this, we leverage the vector-quantized
formulation [59,60] to learn a codebook K := {zn}K
n=1 of
size K and the mapping from a continuous-valued vector
to a discrete codebook entry, where each entry follows a
K-way categorical distribution.
3.2. 3D Triplane Latents Learning
We explain in details the Encoder-Decoder training frame-
work to learn tri-plane latents z (Fig. 3-left). The framework
consists of a 2D-to-3D encoder Eϕ, learnable codebook
quantization K and a 3D-to-2D decoder Gθ.
Eϕ: 2D-to-3D Encoder.
We adopt Vision Transformer
(ViT) [127] as our image feature extractor that maps 16 × 16
non-overlapping image patches into image tokens of dimen-
sion Dimg. Since the goal is to infer the latent 3D-structure
from a 2D image observation, we associate each image token
with tokens in the tri-plane latents using cross-attention mod-
ules, which have previously shown strong performance in
cross-domain and 2D-to-3D information passing [128–131].
The cross-attention module uses a learnable tri-plane posi-
tional encoding as query, and image patch tokens as key
and value.
The module produces tri-plane embeddings
e3D = Eϕ(x) ∈RNZ×NZ×3×Dtok, where Dtok = 32 and
NZ = 16 indicates the dimension of each 3D token and the
spatial resolution, respectively.
K: Codebook Quantization for tri-plane latents.
Given
the continuous tri-plane embedding e3D, we project it to our
K-way categorical prior K through vector quantization. We
apply quantization q(·) of each spatial code e3D
ijk ∈RDtok
on the tri-plane embeddings onto its closest entry zn in the
codebook, which gives tri-plane latents z = q(e3D).
zijh :=

argmin
zn,n∈K
∥e3D
ijk −zn∥

∈RDtok
(3)
Gθ: 3D-to-2D Decoder with neural rendering.
Our de-
coder takes the tri-plane latents z as the input and outputs a
high-dimensional feature maps h ∈RNH×NH×3×DH used
for rendering, where NH = 256 and DH = 32 indicates
spatial resolution of the tri-plane feature maps and the fea-
ture dimension, respectively. We adopt a token Transformer
followed by a Style-based generator [132] as our 3D decoder.
The token transformer first produces high-dimensional in-
termediate features ˆz ∈RNZ×NZ×3×DH with an extra CLS
token using self-attention modules, which are then feed to
the Style-based generator for upsampling. We use 4 blocks
of weight-modulated convolutional layers, each guided by a
mapping network conditioned on the CLS token.
Given the feature maps, we use a shallow MLP that takes
a 3D point p and the hidden feature tri-linearly interpolated
at the query location h(p) as input, following [52,133,134].
It outputs a density value σ and a view-independent color
value c. We perform volume-rendering with the neural radi-
ance field formulation [135].
Training.
Our framework builds upon the vector-quantized
formulations [59, 60, 62, 136–140] where we focus on to-
ken learning in the first stage. Specifically, we extend the
VQ-GAN training losses, where the encoder Eϕ, decoder
Gθ and codebook K are trained jointly with an image dis-
criminator D. As illustrated in Eq. 4, we encourage our
Encoder-Decoder model to reconstruct the real image x with
L2 reconstruction, LPIPS [141], and adversarial loss.
LRGB = ∥(ˆx −x) ⊙m∥2 + f LPIPS(ˆx ⊙m, x ⊙m)
LGAN = [log D(x) + log(1 −D(ˆx))]
(4)
To regularize the codebook learning, we apply the latent
embedding supervision with a commitment term in Eq. 5,
where sg[·] denotes the stop-gradient operation.
LVQ = ∥sg[e3D] −z∥2
2 + λcommit∥sg[z] −e3D∥2
2
(5)
We additionally regularize the 3D density field in a weakly
supervised manner using the rendered aggregated density
(alpha value) xα, encouraging object pixels to have alpha
value 1. To make the loss occlusion aware, we further require
a pixel lies on the non-object region to have zero density,
inspired by Müller et al. [121]. This is achieved by restricting
the non-object region to cover sky or road class on the
pseudo-labeled segmentations (denoted as msky,road).
Lα = ∥(xα −1) ⊙m∥2 + ∥xalpha ⊙msky,road∥2
(6)
To summarize, we optimize the total objective L∗in Eq. 7.
L∗= arg min
ϕ,θ,Z max
D Ex

LVQ + LRGB + Lα + LGAN

(7)
3.3. Iterative Latents Sampling for Neural Assets
Once the first stage training is finished, we can now rep-
resent neural assets using the learned tri-plane latents and
reconstruct a collection of assets from image inputs. To
generate previously unseen assets with various conditions,
we further learn to sample the tri-plane latents in the sec-
ond stage, following the prior works in Generative Trans-
formers [59,60,62,138]. More precisely, we transform the
quantized embedding z ∈RNZ×NZ×3×Dtok into a discrete
sequence s ∈{1, ..., K}NZ×NZ×3, where each element
corresponds to the index we select from the codebook K
through sijk = n : zijk = zn. Following the recent work
4

Stage 1: 3D Triplane Latents Learning
: 2D-to-3D Encoder
Tri-plane Latents
(Discrete)
Vision
Transformer
2D Patch Tokens
2D-to-3D
Cross Attention
Tri-plane
Positional Encoding
Tri-plane Latents
(Continuous)
: 3D-to-2D Decoder
Tri-plane
Feature Maps
Volume
Rendering
Rendered Mask
Rendered Color
Predicted
Mask
Occlusion-aware
Composition

0
1
2
N
…
: Codebook
q(•)
Quantization
Token
Transformer
w
CLS
Style-based
Generator
Mapping Net

Stage 2: Iterative
Latents Sampling
Conditions:
Scale, Class
Semantics
Time-of-day
MaskGIT
MaskGIT

Input
Figure 3. We introduce GINA-3D, a 3D-aware generative transformer for implicit neural asset generation. GINA-3D follows a two-stage
pipeline, where we learn discrete 3D triplane latents in stage 1 (Sec. 3.2) and iterative latents sampling in stage 2 (Sec. 3.3). In stage 1,
an input image is first encoded into continuous tri-plane latents e3D using a Transformer-based 2D-to-3D Encoder Eϕ. Then, a learnable
codebook K quantize the latents into discrete latents z. Finally, a 3D-to-2D Decoder Gθ maps z back to image, using a sequence of
Transformer, Style-based Generator and volume rendering. The rendered image is supervised via an occlusion-aware reconstruction loss.
In stage 2, we learn iterative latents sampling using MaskGIT [62]. Optional conditional information can be used to perform conditional
synthesis. The sampled latents can then be decoded into neural assets using the decoder Gθ learned in stage 1.
MaskGIT [62], we use a bidirectional transformer as our
latent generator Mψ(z) that we learn to iteratively sample
the latent sequence (Fig. 3-right). During training, we learn
to predict randomly masked latents s ¯
M by minimizing the
negative log-likelihood of the masked ones.
Lmask = −Es[
X
∀ijk:sijk=[MASK]
log Pr(sijk|s ¯
M)]
(8)
At inference time, we iteratively generate and refine latents.
Starting from all latents as [MASK], we iteratively predict all
latents simultaneously but only keep the most confident ones
in each step. The remaining ones are assigned as [MASK]
and the refinement continues. Finally, the sequence s can be
readily mapped back to neural assets by indexing the code-
book K to generate tri-plane latents z and decoding using
Gθ. This iterative approach can be applied to asset variations
by selectively masking out tokens of a given instance.
3.4. Expanding Supervision and Conditioning
The two-stage training of GINA-3D is flexible in supervi-
sion and conditioning. When we have additional information,
we can incorporate it in stage 1 as auxiliary supervision for
token learning, or in stage 2 for conditional synthesis.
Unit box vs. Scaled box.
Object scale information can
serve as an additional input to the tri-linear interpolation on
the tri-plane feature maps by rescaling the feature maps to
span object bounding box (instead of a unit box).
Semantic feature fields.
Various recent works have
demonstrated the effectiveness of learning hybrid represen-
tations in the neural rendering [142–144] and 2D image syn-
thesis [145]. We can naturally incorporate semantic feature
fields in our formulation by computing additional channels
in our neural rendering MLP. We precompute DINO-ViT fea-
tures [146] for each image and learn a semantic feature field
to build part correspondence among generated instances.
LiDAR depth supervision.
When LiDAR point cloud is
available in the data, it can be used as the additional super-
vision through a reconstruction term between the rendered
depth and LiDAR depth.
Conditional synthesis.
Last but not the least, additional
information support various applications in conditional syn-
thesis. Denoted as C, it can be fed into our latent prior as
Mψ(sijk|s ¯
M, C). For example, object scale, object class,
time-of-day and object semantic embeddings can also serve
as c for control over the generation process.
4. Experiments
4.1. Object-centric Benchmark
We select the Waymo Open Dataset (WOD) [14] as it
is one of the largest and most diverse autonomous driv-
ing datasets, containing rich geometric and semantic labels
Images
Unique Instances
WOD-Vehicle
901K
23.6K
WOD-Pedestrian
321K
8.1K
Longtail-Vehicle
80K
3.7K
Table 1. Statistics of our object-centric benchmark. Experiments
were conducted on a subset with image patches rescaled to 2562
resolution.
5
Image
Geometry
Quality
Semantic Diversity
Quality
Mesh Diversity
Method
FID↓
Mask FOU↓
COV↑
MMD↓
Cons.↓
Mesh FOU↓
COV↑
MMD↓
GIRAFFE [51]
105.3
43.66
8.24
2.35
15.87
N/A
N/A
N/A
EG3D [52]
137.6
7.40
6.26
2.37
2.38
25.7
3.12
4.70
tri-plane z
scaled box
LiDAR
GINA-3D
×
×
×
147.9
1.85
4.78
2.00
1.55
N/A
1.95
2.43
✓
×
×
79.0
1.82
19.67
1.52
1.27
11.7
5.75
2.21
✓
✓
×
60.5
1.77
20.68
1.53
1.06
2.33
8.69
2.26
✓
✓
✓
59.5
1.80
25.00
1.46
0.98
4.57
11.42
2.17
Table 2. Quantitative evaluation on the realism and diversity of generated image and geometry (metrics details in Sec. 4.3).
such as object bounding boxes and per-pixel instance masks.
Specifically, the dataset includes 1,150 driving scenes cap-
tured mostly in downtown San Francisco and Phoenix, each
consisting of 200 frames of multi-sensor observations. To
construct an object-centric benchmark, we propose a coarse-
to-fine procedure to extract collections of single-view 2D
photographs by leveraging 3D object boxes, camera-LiDAR
synchronization, and fine-grained 2D panoptic labels. First,
we leverage the 3D box annotations to exclude objects be-
yond certain distances to the surveying vehicle in each data
frame (e.g., 40m for pedestrians and 60m for vehicles, re-
spectively). At a given frame, we project 3D point clouds
within each 3D bounding box to the most visible camera and
extract the centering patch to build our single-view 2D im-
age collections. Furthermore, we train a Panoptic-Deeplab
model [61,147] using the 2D panoptic segmentations on the
labeled subset and create per-pixel pseudo-labels for each
camera image on the entire dataset. This allows us to differ-
entiate pixels belonging to the object of interest, background,
and occluder (e.g., standing pole in front of a person). We
further exclude certain patches where objects are heavily
occluded using the 2D panoptic predictions. Even with the
filtering criterion applied, we believe that the resulting bench-
mark is still very challenging due to occlusions, intra-class
variations (e.g., truck and sedan), partial observations (e.g.,
we do not have full 360 degree observations of a single
vehicle), and imperfect segmentation. In particular, we pro-
vide accurate registration of camera rays and LiDAR point
clouds to the object coordinate frame, taking into account
the camera rolling shutter, object motion and ego motion.
We repeat the same process to extract vehicles and pedestri-
ans from WOD, and additional longtail vehicles from our
Longtail dataset. The proposed object-centric benchmark is
one of the largest datasets for generative modeling to date,
including diverse and longtail examples in the wild.
4.2. Implementation Details
GINA-3D.
Our encoder takes in images at resolution of
2562 and renders at 1282 during training. Our tri-plane
latents have a resolution of 162 with a codebook containing
2048 entries and lookup dimension of 32. We trained our
(b) WOD-Pedestrian
(a) WOD-Vehicle
(c) Longtail-Vehicle
Figure 4. Image samples from our object-centric benchmark.
models on 8 Tesla V100 GPUs using Adam optimizer [148],
with batch size 32 and 64 in each stage, respectively. We
trained stage 1 for 150K steps and stage 2 for 80K steps.
Baselines.
We compare against two state-of-the-art meth-
ods in the domain, GIRAFFE [51] and EG3D [52], which
we train on our dataset at the resolution of 1282. We noticed
that GIRAFFE model trained on full pixels fails to disentan-
gle viewpoints, occlusions and identities. This makes the
extraction of the foreground pixels difficult, as the render
mask is only defined at the low dimensional resolution 162.
We instead report the numbers using a model trained by
whitening out non-object regions. For EG3D, we observed
that training EG3D with unmasked image leads to training
collapse, due to the absence of foreground and background
modeling. Thus, we trained EG3D under the same setting.
4.3. Evaluations on WOD-Vehicle
We conduct quantitative evaluations in Table. 2 and visu-
alize qualitative results of different model in Fig. 5.
Image Evaluation.
For image quality, we calculate
Fréchet Inception Distance (FID) [149] between 50K gen-
erated images and all available validation images. To better
reflect the metric on object completeness, we filter images
where its object segmentation mask take up at least 50% of
the projected 3D bounding box (Fig.5-right). We addition-
ally measure the completeness of the generated images by
Mask Floater-Over-Union (Mask FOU), which is defined as
the percentage of unconnected pixels over the rendered ob-
ject region. To measure the semantic diversity, we compute
the Coverage (COV) score and Minimum Matching Distance
(MMD) [94] using the CLIP [150] embeddings. COV mea-
sures the fraction of CLIP embeddings in the validation set
6
WOD-Vehicle validation samples
GINA-3D (Ours)
ages
GIRAFFE
EG3D
Figure 5. Qualitative comparison between GIRAFFE, EG3D and ours with images rendered from a horizontal 30◦viewpoint. Both baselines
fail to disentangle real-world sensor data. GIRAFFE fails to disentangle rotation in object representation, while both baselines fail to
disentangle occlusion and produce incomplete shape. We show samples from occlusion-filtered WOD-Vehicle validation set on the right.
(c) GINA-3D on Longtail-Vehicle
(b) GINA-3D + DINO
(d) WOD-Ped
(a) GINA-3D on WOD-Vehicle
Figure 6. Generation from GINA-3D variants. (a) GINA-3D trained on WOD-Vehicle. (b) GINA-3D with additional DINO feature field
generation. (c) GINA-3D trained on Longtail-Vehicle. (d) GINA-3D trained on WOD-Pedestrain.
that has matches in the generated set, and MMD measures
the distance between each generated embedding to the clos-
est one in the validation. Our model demonstrates significant
improvements in FID, image completeness and semantic
diversity. Without explicit disentanglement, baselines can
hardly handle the real distributions, resulting in artifacts of
incomplete shapes (Fig. 5).
Geometry Evaluation. To measure the underlying volume
rendering consistency, we follow Or et al. [79] and compute
the alignment errors between the volume-rendered depth
from two viewpoints. We extract the mesh using march-
ing cubes [151] with a density threshold of 10 following
EG3D [52]. We measure the completeness by Mesh Floater-
Over-Union (Mesh FOU), which is defined as the percentage
of the surface area on unconnected mesh pieces over the
entire mesh. Since we do not have ground-truth meshes in
the real world data, we approximate mesh diversity by mea-
suring between generated meshes and aggregated LiDAR
point clouds within a bounding box from the validation set.
We measure mesh diversity using the aforementioned COV
and MMD with a new distance metric. To account for the
incompleteness of LiDAR point clouds, we use a one-way
Chamfer distance, which is defined as the mean distance be-
tween validation point clouds and their nearest neighbor from
a given generated mesh. Our model demonstrates signifi-
cant improvements in volume rendering consistency, shape
completeness and shape diversity.
Augmentation and Ablation. GINA-3D can naturally incor-
porate additional supervisions when available. We present
variations of GINA-3D trained with object scale, LiDAR
and DINO [146] supervision. With object scale informa-
tion available, we normalize tri-plane feature maps with the
scale on each dimension. The model trained with rescaled
tri-plane resolution yields significant performance boost in
both quality and diversity over unit bounding cube, as latents
are better utilized. Moreover, we observe that by adding
auxiliary L2 depth supervision from LiDAR, most metrics
are improved except Mask and Mesh FOU. While LiDAR
provides strong signal to underlying geometry, it also intro-
duces inconsistency on transparent surfaces. We hypothesize
that such challenge leads to slightly more floaters, which
we leave as future directions to explore. Alternatively, we
can learn additional neural semantic fields through 2D-to-3D
feature lifting [142]. By only changing the final layer of
7
Discrete
embedding
Continuous
embedding
Image-conditioned
Assets Variations
90%
Assets Variations
99%
90%
99%
Input
Reconstruction
Input
Reconstruction
Masking
Masking
Assets Variations
Figure 7. GINA-3D unifies a wide range of asset synthesis tasks, all obtained with the same stage 1 decoder and variations of stage 2 training.
Top row: Conditional synthesis using discrete conditions (object classes and time-of-day). 2nd row: Conditional synthesis using continuous
conditions (semantic token and object scale). 3rd row: Image-conditioned assets variations by randomizing tri-plane latents.
the NR MLP, we can learn an additional view-consistent and
instance-invariant semantic feature field (Fig. 6-b), which
can enable future applications of language-conditioned and
part-based editing [8] Finally, we perform ablation studies
on the key design of tri-plane latents. If we remove the tri-
plane structure and use a MLP-only NR, the model fails to
capture the diversity of real-world data and results in mode
collapse, which generates always a mean car shape.
4.4. Applications
Generating long-tail instance. Our data-driven framework
is scalable to new data. We provide results on GINA-3D
trained on Longtail-Vehicle and WOD-Ped dataset in Fig. 6-
c,d respectively. Without finetuning the architecture on the
newly collected data, GINA-3D can readily learn to generate
long-tail objects from noisy segmentation masks. As shown
in Fig. 6-c, generation results range from trams, truck to
construction equipment of various shapes. GINA-3D can
also be applied to other categories (e.g. pedestrian, Fig.6-d).
Results show moderate shape and texture diversity.
Conditional synthesis. As described in Sec. 3.4, the flexibil-
ity of the two-stage approach makes it a promising candidate
for conditional asset synthesis. Specifically, we freeze the
stage 1 model, and train variations of MaskGIT by passing
in different conditions. We provide results for three kinds of
conditional synthesis tasks in Fig. 7, namely discrete embed-
dings (object class, time-of-day), continuous embeddings,
and image-conditioned generation. For image-conditioned
asset reconstruction and variations, we first infer the latents
using the encoder model and then sample asset variations
by controlling masking ratio of the reconstructed tri-plane
latents. The more tokens are masked, the wider the variation
range becomes. We provide more details for conditional
synthesis in the supplementary material.
4.5. Limitations
Misaligned 3D bounding boxes. As in our WOD-Ped re-
sults, misaligned boxes lead to mismatch in pixel space, re-
sulting in blurrier results. Latest methods in ray-based [130]
or patch-based [81] learning are promising directions.
Few-shot and transfer learning. Though our data-driven
approach achieves reasonable performance by training on
Longtail-Vehicle alone, the comparative scarcity of data
leads to lower diversity. How to enable few-shot learning or
transfer learning remains an open question.
Transcient effects. Direction-dependent effect can be incor-
porated in our pipeline. We believe modeling material [152]
together with LiDAR is an interesting direction.
5. Conclusion
In this work, we presented GINA-3D, a scalable learn-
ing framework to synthesize 3D assets from robotic sensors
deployed in the wild. Core to our framework is a deep
encoder-decoder backbone that learns discrete tri-plane la-
tent variables from partially-observed 2D input pixels. Our
backbone is composed of an encoder with cross-attentions, a
decoder with tri-plane feature maps, and a neural volumetric
rendering module. We further introduce a latent transformer
to generate tri-plane latents with various conditions includ-
ing bounding box size, time of the day, and semantic features.
To evaluate our framework, we have established a large-scale
object-centric benchmark containing diverse vehicles and
pedestrians. Experimental results have demonstrated strong
performance on image quality, geometry consistency and
geometry diversity over existing methods. The benchmark
is publicly available through waymo.com/open.
Acknowledgements:
We based our MaskGIT implemen-
tation on Chang et al. [62]. We thank Huiwen Chang for
helpful MaskGIT pointers. We acknowledge the helpful
discussions and support from Qichi Yang and James Guo.
We thank Mathilde Caron for her DINO implementation and
helpful pointers. We based our GIRAFFE baseline on the
reimplementation by Kyle Sargent. We thank Golnaz Ghiasi
for helpful pointers on segmentation models.
8
References
[1] Matthew Tancik, Vincent Casser, Xinchen Yan, Sabeek Prad-
han, Ben Mildenhall, Pratul P Srinivasan, Jonathan T Barron,
and Henrik Kretzschmar. Block-nerf: Scalable large scene
neural view synthesis. In CVPR, 2022. 1, 3
[2] Harry Barrow, J Tenenbaum, A Hanson, and E Riseman.
Recovering intrinsic scene characteristics. Comput. vis. syst,
2(3-26):2, 1978. 1
[3] D Man and A Vision. A computational investigation into the
human representation and processing of visual information.
WH San Francisco: Freeman and Company, San Francisco,
1982. 1
[4] Ruo Zhang, Ping-Sing Tsai, James Edwin Cryer, and
Mubarak Shah.
Shape-from-shading: a survey.
IEEE
transactions on pattern analysis and machine intelligence,
21(8):690–706, 1999. 1
[5] Marshall Tappen, William Freeman, and Edward Adelson.
Recovering intrinsic images from a single image. NIPS, 15,
2002. 1
[6] Derek Hoiem, Alexei A Efros, and Martial Hebert. Au-
tomatic photo pop-up. In ACM SIGGRAPH 2005 Papers,
pages 577–584. 2005. 1
[7] Ashutosh Saxena, Min Sun, and Andrew Y Ng. Make3d:
Learning 3d scene structure from a single still image. IEEE
transactions on pattern analysis and machine intelligence,
31(5):824–840, 2008. 1
[8] Stephen Gould, Richard Fulton, and Daphne Koller. Decom-
posing a scene into geometric and semantically consistent
regions. In ICCV. IEEE, 2009. 1, 8
[9] Abhinav Gupta, Alexei A Efros, and Martial Hebert. Blocks
world revisited: Image understanding using qualitative ge-
ometry and mechanics. In ECCV. Springer, 2010. 1
[10] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel
Urtasun. Vision meets robotics: The kitti dataset. The
International Journal of Robotics Research, 32(11):1231–
1237, 2013. 1
[11] Will Maddern, Geoffrey Pascoe, Chris Linegar, and Paul
Newman. 1 year, 1000 km: The oxford robotcar dataset.
The International Journal of Robotics Research, 36(1):3–15,
2017. 1, 15
[12] Ming-Fang Chang, John Lambert, Patsorn Sangkloy, Jagjeet
Singh, Slawomir Bak, Andrew Hartnett, De Wang, Peter
Carr, Simon Lucey, Deva Ramanan, et al. Argoverse: 3d
tracking and forecasting with rich maps. In CVPR, 2019. 1,
15
[13] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora,
Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan,
Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multi-
modal dataset for autonomous driving. In CVPR, 2020. 1,
15
[14] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien
Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou,
Yuning Chai, Benjamin Caine, et al. Scalability in perception
for autonomous driving: Waymo open dataset. In CVPR,
2020. 1, 2, 5, 15
[15] Sudeep Dasari, Frederik Ebert, Stephen Tian, Suraj Nair,
Bernadette Bucher, Karl Schmeckpeper, Siddharth Singh,
Sergey Levine, and Chelsea Finn. Robonet: Large-scale
multi-robot learning.
arXiv preprint arXiv:1910.11215,
2019. 1
[16] Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Cheb-
otar, Omar Cortes, Byron David, Chelsea Finn, Keerthana
Gopalakrishnan, Karol Hausman, Alex Herzog, et al. Do as i
can, not as i say: Grounding language in robotic affordances.
arXiv preprint arXiv:2204.01691, 2022. 1
[17] German Ros, Laura Sellart, Joanna Materzynska, David
Vazquez, and Antonio M Lopez. The synthia dataset: A large
collection of synthetic images for semantic segmentation of
urban scenes. In CVPR, 2016. 1
[18] Adrien Gaidon, Qiao Wang, Yohann Cabon, and Eleonora
Vig. Virtual worlds as proxy for multi-object tracking analy-
sis. In CVPR, 2016. 1, 3
[19] Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten,
Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr:
A diagnostic dataset for compositional language and elemen-
tary visual reasoning. In CVPR, 2017. 1
[20] Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio
Lopez, and Vladlen Koltun. Carla: An open urban driving
simulator. In Conference on robot learning, pages 1–16.
PMLR, 2017. 1, 3
[21] Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt,
Luca Weihs, Alvaro Herrasti, Daniel Gordon, Yuke Zhu,
Abhinav Gupta, and Ali Farhadi. Ai2-thor: An interactive 3d
environment for visual ai. arXiv preprint arXiv:1712.05474,
2017. 1
[22] Shital Shah, Debadeepta Dey, Chris Lovett, and Ashish
Kapoor. Airsim: High-fidelity visual and physical simulation
for autonomous vehicles. In Field and service robotics,
pages 621–635. Springer, 2018. 1
[23] Amir R Zamir, Alexander Sax, William Shen, Leonidas J
Guibas, Jitendra Malik, and Silvio Savarese. Taskonomy:
Disentangling task transfer learning. In CVPR, 2018. 1
[24] Hassan Abu Alhaija, Siva Karthik Mustikovela, Lars
Mescheder, Andreas Geiger, and Carsten Rother.
Aug-
mented reality meets computer vision: Efficient data gen-
eration for urban driving scenes. International Journal of
Computer Vision, 126(9):961–972, 2018. 1
[25] Manolis Savva, Abhishek Kadian, Oleksandr Maksymets,
Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia
Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A plat-
form for embodied ai research. In ICCV, 2019. 1
[26] Yohann Cabon, Naila Murray, and Martin Humenberger.
Virtual kitti 2. arXiv preprint arXiv:2001.10773, 2020. 1
[27] Bokui Shen, Fei Xia, Chengshu Li, Roberto Martín-Martín,
Linxi Fan, Guanzhi Wang, Claudia Pérez-D’Arpino, Shya-
mal Buch, Sanjana Srivastava, Lyne Tchapmi, et al. igibson
1.0: a simulation environment for interactive tasks in large
realistic scenes. In IROS, 2021. 1
9
[28] Blender Online Community. Blender - a 3d modelling and
rendering package, 2018. 1
[29] Arthur Juliani, Vincent-Pierre Berges, Ervin Teng, Andrew
Cohen, Jonathan Harper, Chris Elion, Chris Goy, Yuan Gao,
Hunter Henry, Marwan Mattar, et al. Unity: A general plat-
form for intelligent agents. arXiv preprint arXiv:1809.02627,
2018. 1
[30] Epic Games. Unreal engine. 1
[31] Miles Macklin, Matthias Müller, Nuttapong Chentanez, and
Tae-Yong Kim. Unified particle physics for real-time appli-
cations. ACM Transactions on Graphics (TOG), 33(4):1–12,
2014. 1
[32] Erwin Coumans and Yunfei Bai. Pybullet, a python module
for physics simulation for games, robotics and machine
learning. 2016. 1
[33] Angel X Chang, Thomas Funkhouser, Leonidas Guibas,
Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese,
Manolis Savva, Shuran Song, Hao Su, et al.
Shapenet:
An information-rich 3d model repository. arXiv preprint
arXiv:1512.03012, 2015. 1, 2, 3
[34] Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber,
Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-
annotated 3d reconstructions of indoor scenes. In CVPR,
2017. 1
[35] Angel Chang, Angela Dai, Thomas Funkhouser, Maciej
Halber, Matthias Niessner, Manolis Savva, Shuran Song,
Andy Zeng, and Yinda Zhang. Matterport3d: Learning
from rgb-d data in indoor environments. arXiv preprint
arXiv:1709.06158, 2017. 1
[36] Keunhong Park, Konstantinos Rematas, Ali Farhadi, and
Steven M Seitz.
Photoshape:
Photorealistic materi-
als for large-scale shape collections.
arXiv preprint
arXiv:1809.09761, 2018. 1, 2, 3
[37] Fei Xia, Amir R Zamir, Zhiyang He, Alexander Sax, Jiten-
dra Malik, and Silvio Savarese. Gibson env: Real-world
perception for embodied agents. In CVPR, 2018. 1
[38] Kaichun Mo, Shilin Zhu, Angel X Chang, Li Yi, Subarna
Tripathi, Leonidas J Guibas, and Hao Su. Partnet: A large-
scale benchmark for fine-grained and hierarchical part-level
3d object understanding. In CVPR, 2019. 1
[39] Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler,
Luca Sbordone, Patrick Labatut, and David Novotny. Com-
mon objects in 3d: Large-scale learning and evaluation of
real-life 3d category reconstruction. In ICCV, 2021. 1
[40] Fereshteh Sadeghi and Sergey Levine. Cad2rl: Real single-
image flight without a single real image. arXiv preprint
arXiv:1611.04201, 2016. 1
[41] Matthias Müller, Alexey Dosovitskiy, Bernard Ghanem, and
Vladlen Koltun. Driving policy transfer via modularity and
abstraction. arXiv preprint arXiv:1804.09364, 2018. 1
[42] Yevgen Chebotar, Ankur Handa, Viktor Makoviychuk, Miles
Macklin, Jan Issac, Nathan Ratliff, and Dieter Fox. Closing
the sim-to-real loop: Adapting simulation randomization
with real world experience. In ICRA, 2019. 1
[43] Ilge Akkaya, Marcin Andrychowicz, Maciek Chociej, Ma-
teusz Litwin, Bob McGrew, Arthur Petron, Alex Paino,
Matthias Plappert, Glenn Powell, Raphael Ribas, et al.
Solving rubik’s cube with a robot hand. arXiv preprint
arXiv:1910.07113, 2019. 1
[44] Bła˙zej Osi´nski, Adam Jakubowski, Paweł Zi˛ecina, Piotr
Miło´s, Christopher Galias, Silviu Homoceanu, and Henryk
Michalewski. Simulation-based reinforcement learning for
real-world autonomous driving. In ICRA, 2020. 1
[45] Abhishek Kadian, Joanne Truong, Aaron Gokaslan, Alexan-
der Clegg, Erik Wijmans, Stefan Lee, Manolis Savva, Sonia
Chernova, and Dhruv Batra. Sim2real predictivity: Does
evaluation in simulation predict real-world performance?
IEEE Robotics and Automation Letters, 5(4):6670–6677,
2020. 1
[46] Saminda Abeyruwan, Laura Graesser, David B D’Ambrosio,
Avi Singh, Anish Shankar, Alex Bewley, and Pannag R
Sanketi. i-sim2real: Reinforcement learning of robotic poli-
cies in tight human-robot interaction loops. arXiv preprint
arXiv:2207.06572, 2022. 1
[47] Sivabalan Manivasagam, Shenlong Wang, Kelvin Wong,
Wenyuan Zeng, Mikita Sazanovich, Shuhan Tan, Bin Yang,
Wei-Chiu Ma, and Raquel Urtasun. Lidarsim: Realistic lidar
simulation by leveraging the real world. In CVPR, 2020. 1,
3
[48] Yun Chen, Frieda Rong, Shivam Duggal, Shenlong Wang,
Xinchen Yan, Sivabalan Manivasagam, Shangjie Xue, Ersin
Yumer, and Raquel Urtasun. Geosim: Realistic video simu-
lation via geometry-aware composition for self-driving. In
CVPR, 2021. 1, 3
[49] Amlan Kar, Aayush Prakash, Ming-Yu Liu, Eric Cameracci,
Justin Yuan, Matt Rusiniak, David Acuna, Antonio Torralba,
and Sanja Fidler. Meta-sim: Learning to generate synthetic
datasets. In ICCV, 2019. 1, 3
[50] Girish Varma, Anbumani Subramanian, Anoop Namboodiri,
Manmohan Chandraker, and CV Jawahar. Idd: A dataset
for exploring problems of autonomous navigation in uncon-
strained environments. In WACV. IEEE, 2019. 1
[51] Michael Niemeyer and Andreas Geiger. Giraffe: Represent-
ing scenes as compositional generative neural feature fields.
In CVPR, 2021. 2, 3, 6
[52] Eric R Chan, Connor Z Lin, Matthew A Chan, Koki Nagano,
Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas J
Guibas, Jonathan Tremblay, Sameh Khamis, et al. Effi-
cient geometry-aware 3d generative adversarial networks.
In CVPR, 2022. 2, 3, 4, 6, 7, 17
[53] Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas
Funkhouser, and Jianxiong Xiao. Lsun: Construction of a
large-scale image dataset using deep learning with humans
in the loop. arXiv preprint arXiv:1506.03365, 2015. 2, 3
[54] Linjie Yang, Ping Luo, Chen Change Loy, and Xiaoou Tang.
A large-scale car dataset for fine-grained categorization and
verification. In CVPR, 2015. 2, 3
10
[55] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen.
Progressive growing of gans for improved quality, stability,
and variation. arXiv preprint arXiv:1710.10196, 2017. 2, 3
[56] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang.
Large-scale celebfaces attributes (celeba) dataset. Retrieved
August, 15(2018):11, 2018. 2, 3
[57] Tero Karras, Samuli Laine, and Timo Aila. A style-based
generator architecture for generative adversarial networks.
In CVPR, 2019. 2, 3
[58] Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha.
Stargan v2: Diverse image synthesis for multiple domains.
In CVPR, 2020. 2, 3
[59] Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete
representation learning. In NeurIPS, 2017. 2, 4
[60] Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming
transformers for high-resolution image synthesis. In CVPR,
2021. 2, 4, 20
[61] Bowen Cheng, Maxwell D Collins, Yukun Zhu, Ting Liu,
Thomas S Huang, Hartwig Adam, and Liang-Chieh Chen.
Panoptic-deeplab: A simple, strong, and fast baseline for
bottom-up panoptic segmentation. In Proceedings of the
IEEE/CVF conference on computer vision and pattern recog-
nition, pages 12475–12485, 2020. 2, 6, 16
[62] Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T
Freeman. Maskgit: Masked generative image transformer.
In CVPR, 2022. 2, 4, 5, 8, 17
[63] Joshua B Tenenbaum and William T Freeman. Separating
style and content with bilinear models. Neural computation,
12(6):1247–1283, 2000. 2, 3
[64] Scott Reed, Kihyuk Sohn, Yuting Zhang, and Honglak Lee.
Learning to disentangle factors of variation with manifold
interaction. In ICML. PMLR, 2014. 2, 3
[65] Alexey Dosovitskiy, Jost Tobias Springenberg, and Thomas
Brox. Learning to generate chairs with convolutional neural
networks. In CVPR, 2015. 2, 3
[66] Jimei Yang, Scott E Reed, Ming-Hsuan Yang, and Honglak
Lee. Weakly-supervised disentangling with recurrent trans-
formations for 3d view synthesis. In NIPS, 2015. 2, 3
[67] Tejas D Kulkarni, William F Whitney, Pushmeet Kohli, and
Josh Tenenbaum. Deep convolutional inverse graphics net-
work. In NIPS, 2015. 2, 3
[68] Danilo Jimenez Rezende, SM Eslami, Shakir Mohamed,
Peter Battaglia, Max Jaderberg, and Nicolas Heess. Unsu-
pervised learning of 3d structure from images. NIPS, 2016.
2, 3
[69] Xi Yin, Xiang Yu, Kihyuk Sohn, Xiaoming Liu, and Man-
mohan Chandraker. Towards large-pose face frontalization
in the wild. In ICCV, 2017. 2, 3
[70] Jun-Yan Zhu, Zhoutong Zhang, Chengkai Zhang, Jiajun
Wu, Antonio Torralba, Josh Tenenbaum, and Bill Freeman.
Visual object networks: Image generation with disentangled
3d representations. In NeurIPS, 2018. 2, 3
[71] Thu Nguyen-Phuoc, Chuan Li, Lucas Theis, Christian
Richardt, and Yong-Liang Yang. Hologan: Unsupervised
learning of 3d representations from natural images. In ICCV,
2019. 2, 3
[72] Thu H Nguyen-Phuoc, Christian Richardt, Long Mai,
Yongliang Yang, and Niloy Mitra. Blockgan: Learning 3d
object-aware scene representations from unlabelled images.
2020. 2, 3
[73] Yiyi Liao, Katja Schwarz, Lars Mescheder, and Andreas
Geiger. Towards unsupervised learning of generative models
for 3d controllable image synthesis. In CVPR, 2020. 2, 3
[74] Katja Schwarz, Yiyi Liao, Michael Niemeyer, and Andreas
Geiger. Graf: Generative radiance fields for 3d-aware image
synthesis. In NeurIPS, 2020. 2, 3
[75] Eric R Chan, Marco Monteiro, Petr Kellnhofer, Jiajun Wu,
and Gordon Wetzstein. pi-gan: Periodic implicit generative
adversarial networks for 3d-aware image synthesis. In CVPR,
2021. 2, 3
[76] Zekun Hao, Arun Mallya, Serge Belongie, and Ming-Yu Liu.
Gancraft: Unsupervised 3d neural rendering of minecraft
worlds. In ICCV, 2021. 2, 3
[77] Peng Zhou, Lingxi Xie, Bingbing Ni, and Qi Tian. CIPS-3D:
A 3D-Aware Generator of GANs Based on Conditionally-
Independent Pixel Synthesis. 2021. 2, 3
[78] Jiatao Gu, Lingjie Liu, Peng Wang, and Christian Theobalt.
Stylenerf:
A style-based 3d aware generator for high-
resolution image synthesis. In ICLR, 2022. 2, 3
[79] Roy Or-El, Xuan Luo, Mengyi Shan, Eli Shechtman,
Jeong Joon Park,
and Ira Kemelmacher-Shlizerman.
Stylesdf: High-resolution 3d-consistent image and geometry
generation. In CVPR, 2022. 2, 3, 7, 18
[80] Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden-
hall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv
preprint arXiv:2209.14988, 2022. 2
[81] Ivan Skorokhodov, Sergey Tulyakov, Yiqun Wang, and Peter
Wonka. Epigraf: Rethinking training of 3d gans. 2022. 2, 3,
8
[82] Kangle Deng, Gengshan Yang, Deva Ramanan, and Jun-Yan
Zhu. 3d-aware conditional image synthesis. In CVPR, 2023.
2
[83] Ivan Skorokhodov, Aliaksandr Siarohin, Yinghao Xu, Jian
Ren, Hsin-Ying Lee, Peter Wonka, and Sergey Tulyakov.
3d generation on imagenet. In International Conference on
Learning Representations. 2
[84] Kyle Sargent, Jing Yu Koh, Han Zhang, Huiwen Chang,
Charles Herrmann, Pratul Srinivasan, Jiajun Wu, and De-
qing Sun. Vq3d: Learning a 3d-aware generative model on
imagenet. arXiv preprint arXiv:2302.06833, 2023. 2
[85] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu,
Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d
shapenets: A deep representation for volumetric shapes. In
CVPR, 2015. 2
11
[86] Jiajun Wu, Chengkai Zhang, Tianfan Xue, Bill Freeman, and
Josh Tenenbaum. Learning a probabilistic latent space of
object shapes via 3d generative-adversarial modeling. 2016.
2
[87] Rohit Girdhar, David F Fouhey, Mikel Rodriguez, and Ab-
hinav Gupta. Learning a predictable and generative vector
representation for objects. In ECCV. Springer, 2016. 2
[88] Matheus Gadelha, Subhransu Maji, and Rui Wang. 3d shape
induction from 2d views of multiple objects. In 3DV, 2017.
2
[89] Edward J Smith and David Meger. Improved adversarial
systems for 3d object generation and reconstruction. In
CoRL. PMLR, 2017. 2
[90] Philipp Henzler, Niloy J Mitra, and Tobias Ritschel. Escap-
ing plato’s cave: 3d shape from adversarial rendering. In
ICCV, 2019. 2
[91] Sebastian Lunz, Yingzhen Li, Andrew Fitzgibbon, and
Nate Kushman. Inverse graphics gan: Learning to gen-
erate 3d shapes from unstructured 2d data. arXiv preprint
arXiv:2002.12674, 2020. 2
[92] Linqi Zhou, Yilun Du, and Jiajun Wu. 3d shape generation
and completion through point-voxel diffusion. In ICCV,
2021. 2
[93] Moritz Ibing, Gregor Kobsik, and Leif Kobbelt.
Oc-
tree transformer:
Autoregressive 3d shape generation
on hierarchically structured sequences.
arXiv preprint
arXiv:2111.12480, 2021. 2
[94] Panos Achlioptas, Olga Diamanti, Ioannis Mitliagkas, and
Leonidas Guibas. Learning representations and generative
models for 3d point clouds. In ICML. PMLR, 2018. 2, 6, 18
[95] Guandao Yang, Xun Huang, Zekun Hao, Ming-Yu Liu,
Serge Belongie, and Bharath Hariharan. Pointflow: 3d point
cloud generation with continuous normalizing flows. In
ICCV, 2019. 2
[96] Kaichun Mo, He Wang, Xinchen Yan, and Leonidas Guibas.
PT2PC: Learning to generate 3d point cloud shapes from
part tree conditions. 2020. 2
[97] Ayan Sinha, Asim Unmesh, Qixing Huang, and Karthik
Ramani. Surfnet: Generating 3d shape surfaces using deep
residual networks. In CVPR, 2017. 2
[98] Thibault Groueix, Matthew Fisher, Vladimir G Kim,
Bryan C Russell, and Mathieu Aubry.
A papier-mâché
approach to learning 3d surface generation. In CVPR, 2018.
2
[99] Dario Pavllo, Graham Spinks, Thomas Hofmann, Marie-
Francine Moens, and Aurelien Lucchi. Convolutional gener-
ation of textured 3d meshes. NeurIPS, 2020. 2
[100] Dario Pavllo, Jonas Kohler, Thomas Hofmann, and Aurelien
Lucchi. Learning generative models of textured 3d meshes
from real-world images. In ICCV, 2021. 2
[101] Wenzheng Chen, Huan Ling, Jun Gao, Edward Smith,
Jaakko Lehtinen, Alec Jacobson, and Sanja Fidler. Learning
to predict 3d objects with an interpolation-based differen-
tiable renderer. NeurIPS, 2019. 2
[102] Charlie Nash, Yaroslav Ganin, SM Ali Eslami, and Peter
Battaglia. Polygen: An autoregressive generative model of
3d meshes. In ICML. PMLR, 2020. 2
[103] Jun Gao, Tianchang Shen, Zian Wang, Wenzheng Chen,
Kangxue Yin, Daiqing Li, Or Litany, Zan Gojcic, and Sanja
Fidler. Get3d: A generative model of high quality 3d tex-
tured shapes learned from images. In NeurIPS, 2022. 2,
3
[104] Kaichun Mo, Paul Guerrero, Li Yi, Hao Su, Peter Wonka,
Niloy Mitra, and Leonidas Guibas. Structurenet: Hierarchi-
cal graph networks for 3d shape generation. ACM Transac-
tions on Graphics (TOG), Siggraph Asia 2019, 38(6):Article
242, 2019. 2
[105] Shubham Tulsiani, Hao Su, Leonidas J Guibas, Alexei A
Efros, and Jitendra Malik. Learning shape abstractions by
assembling volumetric primitives. In CVPR, 2017. 2
[106] Shichen Liu, Shunsuke Saito, Weikai Chen, and Hao Li.
Learning to infer implicit surfaces without 3d supervision.
NeurIPS, 2019. 2
[107] Lars Mescheder, Michael Oechsle, Michael Niemeyer, Se-
bastian Nowozin, and Andreas Geiger. Occupancy networks:
Learning 3d reconstruction in function space. In CVPR,
2019. 2
[108] Andrew Luo, Tianqin Li, Wen-Hao Zhang, and Tai Sing
Lee. Surfgen: Adversarial 3d shape synthesis with explicit
surface discriminators. In ICCV, 2021. 2
[109] Zhiqin Chen and Hao Zhang. Learning implicit fields for
generative shape modeling. In CVPR, 2019. 2
[110] Xingguang Yan, Liqiang Lin, Niloy J Mitra, Dani Lischin-
ski, Daniel Cohen-Or, and Hui Huang.
Shapeformer:
Transformer-based shape completion via sparse representa-
tion. In CVPR, 2022. 2
[111] Tianchang Shen, Jun Gao, Kangxue Yin, Ming-Yu Liu, and
Sanja Fidler. Deep marching tetrahedra: a hybrid representa-
tion for high-resolution 3d shape synthesis. NeurIPS, 2021.
2
[112] Paritosh Mittal, Yen-Chi Cheng, Maneesh Singh, and Shub-
ham Tulsiani. Autosdf: Shape priors for 3d completion,
reconstruction and generation. In CVPR, 2022. 2
[113] Stephan R Richter, Vibhav Vineet, Stefan Roth, and Vladlen
Koltun.
Playing for data: Ground truth from computer
games. In ECCV. Springer, 2016. 3
[114] Seunghoon Hong, Xinchen Yan, Thomas S Huang, and
Honglak Lee. Learning hierarchical semantic image manip-
ulation through structured representations. NeurIPS, 2018.
3
[115] Seung Wook Kim, Jonah Philion, Antonio Torralba, and
Sanja Fidler. Drivegan: Towards a controllable high-quality
neural simulation. In CVPR, 2021. 3
[116] Wei Li, CW Pan, Rong Zhang, JP Ren, YX Ma, Jin Fang,
FL Yan, QC Geng, XY Huang, HJ Gong, et al. Aads: Aug-
mented autonomous driving simulation using data-driven
algorithms. Science robotics, 4(28):eaaw0863, 2019. 3
12
[117] Huan Ling, David Acuna, Karsten Kreis, Seung Wook Kim,
and Sanja Fidler. Variational amodal object completion. In
NeurIPS, 2020. 3
[118] Jason Zhang, Gengshan Yang, Shubham Tulsiani, and Deva
Ramanan. Ners: Neural reflectance surfaces for sparse-view
3d reconstruction in the wild. In NeurIPS, 2021. 3
[119] Sergey Zakharov, Rares Andrei Ambrus, Vitor Campag-
nolo Guizilini, Dennis Park, Wadim Kehl, Fredo Durand,
Joshua B Tenenbaum, Vincent Sitzmann, Jiajun Wu, and
Adrien Gaidon. Single-shot scene reconstruction. In CoRL,
2021. 3
[120] Tom Monnier, Matthew Fisher, Alexei A Efros, and Mathieu
Aubry. Share with thy neighbors: Single-view reconstruction
by cross-instance consistency. In ECCV, 2022. 3
[121] Norman
Müller,
Andrea
Simonelli,
Lorenzo
Porzi,
Samuel
Rota
Bulò,
Matthias
Nießner,
and
Peter
Kontschieder.
Autorf:
Learning 3d object radiance
fields from single view observations. In CVPR, 2022. 3, 4
[122] Jeevan Devaranjan, Amlan Kar, and Sanja Fidler. Meta-
sim2: Unsupervised learning of scene structure for synthetic
data generation. In ECCV. Springer, 2020. 3
[123] Shuhan Tan, Kelvin Wong, Shenlong Wang, Sivabalan Mani-
vasagam, Mengye Ren, and Raquel Urtasun. Scenegen:
Learning to generate realistic traffic scenes. In CVPR, 2021.
3
[124] Zhenpei Yang, Yuning Chai, Dragomir Anguelov, Yin Zhou,
Pei Sun, Dumitru Erhan, Sean Rafferty, and Henrik Kret-
zschmar. Surfelgan: Synthesizing realistic sensor data for
autonomous driving. In CVPR, 2020. 3
[125] Abhijit Kundu, Kyle Genova, Xiaoqi Yin, Alireza Fathi,
Caroline Pantofaru, Leonidas J Guibas, Andrea Tagliasacchi,
Frank Dellaert, and Thomas Funkhouser. Panoptic neural
fields: A semantic object-aware neural scene representation.
In CVPR, 2022. 3
[126] Konstantinos Rematas,
Andrew Liu,
Pratul P Srini-
vasan, Jonathan T Barron, Andrea Tagliasacchi, Thomas
Funkhouser, and Vittorio Ferrari. Urban radiance fields. In
CVPR, 2022. 3
[127] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-
vain Gelly, et al. An image is worth 16x16 words: Trans-
formers for image recognition at scale.
arXiv preprint
arXiv:2010.11929, 2020. 4
[128] Ronghang Hu and Amanpreet Singh. Unit: Multimodal
multitask learning with a unified transformer. In ICCV,
2021. 4
[129] Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac,
Carl Doersch, Catalin Ionescu, David Ding, Skanda Kop-
pula, Daniel Zoran, Andrew Brock, Evan Shelhamer, et al.
Perceiver io: A general architecture for structured inputs &
outputs. arXiv preprint arXiv:2107.14795, 2021. 4
[130] Mehdi SM Sajjadi, Henning Meyer, Etienne Pot, Urs
Bergmann, Klaus Greff, Noha Radwan, Suhani Vora, Mario
Luˇci´c, Daniel Duckworth, Alexey Dosovitskiy, et al. Scene
representation transformer: Geometry-free novel view syn-
thesis through set-latent scene representations. In CVPR,
2022. 4, 8
[131] Daniel Rebain, Mark J Matthews, Kwang Moo Yi, Gopal
Sharma, Dmitry Lagun, and Andrea Tagliasacchi. Attention
beats concatenation for conditioning neural fields. arXiv
preprint arXiv:2209.10684, 2022. 4
[132] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten,
Jaakko Lehtinen, and Timo Aila. Analyzing and improving
the image quality of stylegan. In CVPR, 2020. 4, 16, 17
[133] Songyou Peng, Michael Niemeyer, Lars Mescheder, Marc
Pollefeys, and Andreas Geiger. Convolutional occupancy
networks. In ECCV. Springer, 2020. 4, 17
[134] Bokui Shen, Zhenyu Jiang, Christopher Choy, Leonidas J
Guibas, Silvio Savarese, Anima Anandkumar, and Yuke
Zhu. Acid: Action-conditional implicit visual dynamics for
deformable object manipulation. 2022. 4, 17
[135] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik,
Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf:
Representing scenes as neural radiance fields for view syn-
thesis. In ECCV, 2020. 4
[136] Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. Gener-
ating diverse high-fidelity images with vq-vae-2. In NeurIPS,
2019. 4
[137] Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Hee-
woo Jun, David Luan, and Ilya Sutskever. Generative pre-
training from pixels. In ICML. PMLR, 2020. 4
[138] Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang,
James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge,
and Yonghui Wu. Vector-quantized image modeling with
improved vqgan. In ICLR, 2021. 4, 16
[139] Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gun-
jan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yin-
fei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive
models for content-rich text-to-image generation. arXiv
preprint arXiv:2206.10789, 2022. 4
[140] Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kin-
dermans, Hernan Moraldo, Han Zhang, Mohammad Taghi
Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan.
Phenaki: Variable length video generation from open do-
main textual description. arXiv preprint arXiv:2210.02399,
2022. 4
[141] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman,
and Oliver Wang. The unreasonable effectiveness of deep
features as a perceptual metric. In CVPR, 2018. 4
[142] Sosuke Kobayashi, Eiichi Matsumoto, and Vincent Sitz-
mann. Decomposing nerf for editing via feature field distil-
lation. arXiv preprint arXiv:2205.15585, 2022. 5, 7
[143] Vadim Tschernezki, Iro Laina, Diane Larlus, and Andrea
Vedaldi.
Neural feature fusion fields: 3d distillation of
self-supervised 2d image representations. arXiv preprint
arXiv:2209.03494, 2022. 5
13
[144] Liangchen Song, Anpei Chen, Zhong Li, Zhang Chen,
Lele Chen, Junsong Yuan, Yi Xu, and Andreas Geiger.
Nerfplayer: A streamable dynamic scene representation
with decomposed neural radiance fields. arXiv preprint
arXiv:2210.15947, 2022. 5
[145] Yuxuan Zhang, Huan Ling, Jun Gao, Kangxue Yin, Jean-
Francois Lafleche, Adela Barriuso, Antonio Torralba, and
Sanja Fidler. Datasetgan: Efficient labeled data factory with
minimal human effort. In CVPR, 2021. 5
[146] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jé-
gou, Julien Mairal, Piotr Bojanowski, and Armand Joulin.
Emerging properties in self-supervised vision transformers.
In ICCV, 2021. 5, 7, 19
[147] Jieru Mei, Alex Zihao Zhu, Xinchen Yan, Hang Yan, Siyuan
Qiao, Yukun Zhu, Liang-Chieh Chen, Henrik Kretzschmar,
and Dragomir Anguelov. Waymo open dataset: Panoramic
video panoptic segmentation. In ECCV, 2022. 6, 16
[148] Diederik P Kingma and Jimmy Ba. Adam: A method for
stochastic optimization. arXiv preprint arXiv:1412.6980,
2014. 6, 17
[149] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bern-
hard Nessler, and Sepp Hochreiter. Gans trained by a two
time-scale update rule converge to a local nash equilibrium.
NIPS, 30, 2017. 6, 17
[150] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning
transferable visual models from natural language supervi-
sion. In ICML. PMLR, 2021. 6, 18
[151] William E Lorensen and Harvey E Cline. Marching cubes:
A high resolution 3d surface construction algorithm. ACM
siggraph computer graphics, 21(4):163–169, 1987. 7, 23
[152] Dor Verbin, Peter Hedman, Ben Mildenhall, Todd Zickler,
Jonathan T Barron, and Pratul P Srinivasan. Ref-nerf: Struc-
tured view-dependent appearance for neural radiance fields.
In CVPR, 2022. 8
[153] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin-
ton. Layer normalization. arXiv preprint arXiv:1607.06450,
2016. 16
[154] Jonathan T Barron, Ben Mildenhall, Matthew Tancik, Peter
Hedman, Ricardo Martin-Brualla, and Pratul P Srinivasan.
Mip-nerf: A multiscale representation for anti-aliasing neu-
ral radiance fields. In ICCV, 2021. 17
[155] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon
Shlens, and Zbigniew Wojna.
Rethinking the inception
architecture for computer vision. In Proceedings of the
IEEE conference on computer vision and pattern recognition,
pages 2818–2826, 2016. 17
[156] Itseez. Open source computer vision library. https://
github.com/itseez/opencv, 2015. 18
[157] Dawson-Haggerty et al. trimesh. 18
14
Appendix
In this supplementary document, we first describe in details our proposed dataset and the processing behind it in Sec. A.
Then, we discuss various implementation details including network architectures, evaluation metrics and conditional synthesis
details in Sec. B. Next, we examine ablation of different loss terms, and evaluate stage 1 model’s performance. Finally, we
discuss baselines in more details in Sec. E and showcase mesh extraction visualization results in Sec. F.
(a) WOD-Vehicle (Box, Mask, LiDAR)
60m
(b) WOD-Pedestrian (Box, Mask, LiDAR)
(c) Longtail-Vehicle
Figure 8. Our object-centric benchmark.
Appendix A. Dataset
We build the object-centric benchmark on top of the Waymo Open Dataset (WOD) [14] and our Longtail dataset. The
Waymo Open Dataset (WOD) is one of the largest and most diverse autonomous driving datasets among others [11–13],
15
containing rich geometric and semantic labels such as 3D bounding boxes and per-pixel instance masks. Specifically, the
dataset includes 1,150 driving scenes captured mostly in downtown San Francisco and Phoenix, each consisting of 200 frames
of multi-sensor data. Each data frame includes 3D point clouds from LiDAR sensors and high-resolution images from five
cameras (positioned at Front, Front-Left, Front-Right, Side-Left, and Side-Right). The objects were captured in the wild and
their images exhibit large variations due to object interactions (e.g., heavy occlusion and distance to the robotic platform),
sensor artifacts (e.g., motion blur and rolling shutter) and environmental factors (e.g., lighting and weather conditions).
To construct a benchmark for object-centric modeling, we propose a coarse-to-fine procedure to extract collections of
single-view 2D photographs by leveraging 3D object boxes, camera-LiDAR synchronization, and fine-grained 2D panoptic
labels. First, we leverage the 3D box annotations to exclude objects beyond certain distances to the surveying vehicle in each
data frame (e.g., 40m for pedestrians and 60m for vehicles, respectively). At a given frame, we project 3D point clouds within
each 3D bounding box to the most visible camera and extract the centering patch to build our single-view 2D image collections.
Furthermore, we train a Panoptic-Deeplab model [61] using the 2D panoptic segmentations on the labeled subset [147] and
create per-pixel pseudo-labels for each camera image on the entire WOD. This allows us to differentiate pixels belonging to
the object of interest, background, and occluder (e.g., standing pole in front of a person). We further exclude certain patches
where objects are heavily occluded using the 2D panoptic predictions. Even with the filtering criterion applied, we believe
that the resulting benchmark is still very challenging due to occlusions, intra-class variations (e.g., truck and sedan), partial
observations (e.g., we do not have full 360 degree observations of a single vehicle), and imperfect segmentation. In particular,
we provide accurate registration of camera rays and LiDAR point clouds to the object coordinate frame, taking into account
the camera rolling shutter, object motion and ego motion. Our WOD-ObjectAsset can be accessed through waymo.com/open,
organized in the Waymo Open Dataset modular format, enabling users to selectively download only the components they need.
Finally, we provide code examples to access and visualize data in the tutorial_object_asset.
Our Longtail dataset contains LiDAR point clouds and camera images, along with 3D bounding box annotations. We obtain
the pseudo-labeled segmentations using the same 2D panoptic model pretrained on WOD. We apply the same coarse-to-fine
procedure to obtain the Longtail-Vehicle benchmark.
Appendix B. More Implementation Details
B.1. Network Architecture
All models use exponential moving average of weights.
Encoder Eϕ.
Our encoder contains three vision transformer blocks and three cross-attention blocks. The vision transformer
takes input images of resolution of 2562, and first map each patch into a 512 dimensional token. A CLS token is appended to
the list of image patch tokens. Then, the transformer blocks are used to process the image patch tokens. Each transformer
block has 8 heads, an embedding dimension of 512 and a hidden dimension of 2048. For cross-attention blocks, we first
initialize tri-plane positional embedding of shape 16 × 16 × 3, each embedding is of 512 dimension. The tri-plane positional
embedding is passed through a fully-connected layer of 512 dimension. The processed tri-plane positional embedding is
then used a query input to the cross-attention transformer blocks, while the image patch tokens serve as key and value. Each
cross-attention transformer block has 8 heads, an embedding dimension of 512 and a hidden dimension of 2048. Finally, the
output of the cross-attention transformer blocks are passed through a fully-connected layer with Layer Normalization [153]
and tanh activation into 16 × 16 × 3 tokens of 32 dimension, which is the dimension of each entry in the codebook K.
Codebook K.
Our discrete codebook contains 2048 entries with lookup dimension of 32, which means each entry is of
32-dimensional. Codebook are initialized using fan-in variance scaling, scale equals 1 and uniform distribution. Similar to Yu
et al. [138], we use l2-normalized codes, which means applying l2 normalization on the encoded tri-plane latents e3D and
codebook entries in K.
Decoder Gθ - Token Transformer.
The token transformer contains 3 self-attention transformers blocks. A CLS token is
appended to the tri-plane latents. Positional encoding is used to represent 3D spatial locations. Each transformer block has 8
heads, an embedding dimension of 512 and a hidden dimension of 2048. Finally, the output of the transformer blocks are
passed through a fully-connected layer with Layer Normalization [153] and tanh activation into 16 × 16 × 3 tokens of 256
dimension (and an additional CLS token).
Decoder Gθ - Style-based Generator.
We first use a mapping network [132] to map the aforementioned CLS token into
intermediate latent space W. The mapping network contains 8 fully-connected layers of hidden dimension 512. The mapping
network outputs a vector w of 512 dimensional. Following Karras et al. [132], we use w for a style-based generator. For each
plane in our tri-plane representation (xy, xz, yz planes), we use a generator contains three up-sampling blocks with hidden
dimensions of 512, 256 and 128 respectively. Finally, the style-based generators output tri-plane feature maps with 32 feature
channels.
16
Decoder Gθ - Volume Rendering.
Our volume renderer is implemented as 2 fully-connected layers, similar to Chan et
al. [52]. The decoder takes as input the 32-dimensional aggregated feature vector from the style-based generator. For each
pixel, we query 40 points, with 24 uniformly sampled and 16 importance-sampled. We use MipNeRF [154] as our volume
rendering module. Volume rendering is performed at a resolution of 128 × 128.
Discriminator.
We use a StyleGAN2 [132] discriminator with hidden dimensions 16, 32, 64, 128, 256. We use R1 regular-
ization with γ = 1.
Stage-2 Modeling Mψ.
We follow a shallower verions of the network architecture and training set up introduced in [62].
We use 12 layers, 8 attention heads, 768 embedding dimensions and 3072 hidden dimensions. The model uses learnable
positional embedding, Layer Normalization, and truncated normal initialization (stddev= 0.02). We use the following training
hyperparameters: label smoothing=0.1, dropout rate=0.1, Adam optimizer [148] with β1 = 0.9 and β2 = 0.96. We use a
cosine masking schedule. During inference, token synthesis are performed in 10 steps.
B.2. Aligning Tri-plane to Object Scale
Figure 9. Illustration of using uniform tri-plane versus using scale-aligned triplane.
Since vehicles can have drastically different scales in its x, y, z directions, using a naive uniform scale tri-plane to cover the
object leaves a lot of computation capacity under-utilized. As illustrated in the top row of Fig. 9, if we cover a normal sedan
using uniform size tri-plane, most of the entries in the tri-plane features correspond to empty space. The problem becomes
more severe for longer-tail instances of truck, bus etc., where the scale ratio among x, y, z become even more extreme.
To encourage a more efficient tri-plane features usage, we make tri-plane latents aligned to object scales during the
coordinate feature orthographic projection step. As illustrated in the bottom row of Fig. 9, when querying feature of coordinate
p ∈[0, 1]3 ⊂R3, if we have object scale sx, sy, sz, we simply scale p as ˆp :=
p
[sx,sy,sz] ∈[0, 1
sx ] × [0, 1
sy ] × [0, 1
sz ], and
query tri-plane features using ˆp. The orthographic projection follows the same tri-plane grid-sampling and aggregation as in
prior works [52,133,134].
In the basic GINA-3D pipeline without using scaled tri-plane features, the model learns to handle object scale implicitly. In
our scaled box model variations, the model leverages the object scale only in tri-plane feature orthographic projection step.
The model implicitly learns to produce feature maps that align with object scale. As illustrated in the main paper, such design
greatly improve model performance. We leave feeding object scale information explicit to the model as a future direction to
explore.
B.3. Evaluation Metrics
We discuss in details the metrics we have used for quantitative evaluations.
Image Quality.
To evaluate the image quality, we employ two metrics Fréchet Inception Distance (FID) [149] and Mask
Floater-Over Union (Mask FOU) over 50K generated images. Fréchet Inception Distance (FID) [149] is commonly used to
evaluate the quality of 2D images. The generated images are encoded using a pretrained Inception v3 [155] model, and the last
pooling layer’s output was stored as the final encoding. The FID metric is computed as:
FID(Ig, Iv) = ||µg −µv||2
2 + Tr[Σg + Σv −2
p
Σg · Σv]
(9)
where Tr denotes the trace operation, µg, Σg are the mean and covariance matrix of the generated images encodings, and
µv, Σv are the mean and covariance matrix of the validation images encodings.
17
We additionally measure if the generated texture forms a single full object, which is implemented by checking if the
generated pixels span a connected region. We measure this by calculating percentage of pixels that are not connected. Since
all images from baselines and GINA-3D are generated using a white background, we measure pixels connected components
using the findContours function from OpenCV [156] to find connected components, and use contourArea to find
the largest connected component, which we denote Cl. We then use the aggregated density (alpha) value to find the entire
shape’s projection on the image, which we denote S. Mask FOU is simply calculated mean over entire generated image set (as
percentage):
Mask FOU(Ig) =
1
|Ig|
X
i∈Ig
(1 −Area(Cl,i)
Area(Si) )
(10)
Image Diversity.
We want to evaluate the semantic diversity of the generated image, which we measure with Coverage
(COV) score and Minimum Matching Distance (MMD) [94] using pretrained CLIP [150] embeddings. Specifically, Coverage
(COV) score measures the fraction of images in the validation set that are matched to at least one of the images in the generated
set. Formally, it’s defined as:
COV(Ig, Iv) = |{argmini∈Iv ||CLIP(i) −CLIP(j)||2
2|j ∈Ig}|
|Iv|
(11)
Intuitively, COV uses CLIP embedding distance to perform nearest-neighbor matching for each generate image towards
validation set. It measures diversity by checking what percentage of validation set is being matched as a nearest neighbor.
However, COV is only one side of the story. A set of generated image can have a high COV score by having purely random
generated images that are randomly matched to validation set. This issue is alleviated by the incorporation of Minimum
Matching Distance (MMD), which measures if the nearest-neighbor matching yields high-quality matching pairs:
MMD(Ig, Iv) =
1
|Iv|
X
i∈Iv
min
j∈Ig ||CLIP(i) −CLIP(j)||2
2
(12)
Intuitively, MMD measures the average closest distance between images in the validation set and their corresponding nearest
neighbor in the training set. MMD correlates well with how faithful (with respect to the validation set) elements of generated
set are [94].
Geometry Quality.
Due to a lack of 3D geometry ground-truth for in-the-wild data, we measure geometry quality using an
existing metric Consistency score from Or-El et al. [79], and a Mesh Floater-Over Union (Mesh FOU) which measures if
the geometry forms a single connected object. Consistency score measures if the implicit fields are evaluated at consistent
3D locations, which is an important characteristic for view-consistent renderings [79]. In practice, it measures depth map
consistency across viewpoints by back-projecting depth map to the 3D space. For each model, we normalize the object longest
edge to length of 10 for numeric clarity, and compare two depth maps at an angle difference of 45 degrees along the z-axis
(yaw). We calculate consistency across depth maps for all images in the generated set, denote as Dg:
Consistency(Dg) =
1
|Dg|
X
i∈Dg
CD(i, irot)
(13)
where irot represents the depth map after rotating the view point by 45 degree along z-axis.
We additionally measure if each generated shape forms a single full object, which is measured by checking if the generated
mesh forms a single mesh. We measure this by calculating percentage of mesh surface area that is not connected. We use
surface area over volume because we observe that volume calculation is unstable with non-watertight meshes. For each
generated mesh S, we use split function from Trimesh [157] to find the largest connected component, which we denote Cl.
Mesh FOU is simply calculated mean over entire generated mesh set Mg (as percentages):
Mesh FOU(Mg) =
1
|Mg|
X
i∈Mg
(1 −Area(Cl,i)
Area(Si) )
(14)
18
Figure 10. Illustrations of aggregated point clouds.
Geometry Diversity.
We use Coverage (COV) and Minimum Matching Distance (MMD) again for measuring diversity.
However, due to the lack of ground truth full 3D shape from in-the-wild data, our metric needs to be more carefully designed.
A source for accurate but partial geometry that we can obtain is by aggregating LiDAR point-cloud scans for a given instance
from different observations. We then uniformly subsample 2048 points from the aggregated point cloud. We show examples of
aggregated point clouds in Fig. 10. As shown in the figure, the aggregated point clouds are indicative of the underlying shapes,
but are incomplete. Chamfer distance, a common metric for shape similarity, calculates bi-directional nearest neighbors.
However, due to incompleteness, finding the nearest neighbors of the generated points in the partial point will only result
in noisy matches. Therefore, we do not measure the two-sided Chamfer distance, but measure only the distance of nearest
neighbors of validation point clouds in the generated mesh. Formally, we have:
COV(Mg, Pv) = |{argmini∈Pv D(i, j)|j ∈Mg}|
|Pv|
(15)
MMD(Mg, Pv) =
1
|Pv|
X
i∈Pv
min
j∈Mg D(i, j)
(16)
D(i, j|i ∈Pv, j ∈Mg) = 1
|i|
X
x∈i
min
y∈j ||x −y||2
2
(17)
B.4. Conditional Synthesis
We showcased in the main paper various conditional synthesis tasks, for which we provide more details here.
Discrete Conditions.
We feed discrete conditions (object class, time-of-day) as additional tokens to MaskGIT. Specifically,
we increase the vocabulary size by the number of classes in the discrete conditions. Object class contains 4 options: cars, truck,
bus and others. Time-of-day is a binary variable of day versus night. The vocabulary thus becomes 2048 + 4 for object class,
and 2048 + 2 for time-of-day. We feed the conditional input as an additional token to the 768 tri-plane latents by concatenating
the two, resulting in an input of sequence length 769. The sequence is then fed into MaskGIT for masked token prediction as
in unconditional case.
Continuous Conditions.
Alternatively, we feed continuous conditions to MaskGIT by concatenating conditional input
with MaskGIT intermediate layer’s output. Specifically, MaskGIT first generates word embedding for each token in the
sequence. We pass the continuous condition through a fully-connected layer and concatenate the output with each token’s word
embedding. The concatenated embedding is then passed through the rest of the network. To synthesize samples conditioned
on object semantics, we feed semantic embedding from a pre-trained DINO model [146]. To condition on object scale, we
pass in positional embedding of object scale. We use standard cosine and sine positional embedding of degree 6.
Image-conditioned Assets Variations.
Given our mask-based iterative sampling stage, we can generate image-conditioned
asset with variations. We first use stage-1 model to perform reconstruction, retrieving a full-set of predicted tri-plane latents.
We then generate variations of the reconstructed instance by randomly masking out tri-plane latents. The degree of variations
can be controlled by masking out different number of tokens. By masking 90% of tokens, we observe the variations are mostly
reflected in generated assets under different textures. By masking out 99% of tokens, we see changes in object shapes more
19
1st, 2nd, 3rd
LGAN
LPIPS
Lα
LVQ
No VQ
|K| = 210
|K| = 212
Full
Generative Metric (FID)
65.1
83.0
80.3
64.7
-
66.2
58.9
59.5
Recon. Metric (ℓ2 input view)
1.78
1.92
1.44
1.62
1.01
2.21
1.81
1.55
Recon. Metric (ℓ2 cross view)
2.42
2.28
1.55
2.14
1.71
2.28
2.30
1.83
Table 3. We perform various ablation studies on 1) removing each term in out overall loss function; 2) Removing vector quantization entirely;
3) Different codebook sizes K. We further report stage 1 model’s reconstruction quality using ℓ2 losses for input views as well as novel
views.
significantly, while the general object class remain the same. We believe how to better control the variation process is an
interesting direction to explore in the future.
Appendix C. Additional GINA Visualizations
We present additional visualizations of GINA-3D model in Fig. 11.
Appendix D. Ablation on Loss Terms and Stage 1 Evaluation
Ablation study.
In this experiment, we use our scaled box model, trained with LiDAR supervision as our base model. We
conduct ablation studies by removing each loss, removing quantization entirely and training with different codebook sizes. As
shown in Table 3, the ablaion results justify each loss term we introduced in the paper, as removing each one of them leads to
higher FID compared to the full model. This finding is consistent with Esser et al. [60], which suggests LPIPS is important for
visual fidelity. In addition, larger codebook K (212) has marginal impact in our setting.
Evaluating stage 1 model.
We report ℓ2 reconstruction loss (in 10−2) on the input and novel views of unseen instances.
The model is able to obtain better reconstruction performance by removing quantization entirely (No VQ), but it deprives the
discrete codebook for stage 2 generative training. While generation and reconstruction correlates to some extent, performance
rankings (color-coded) differ between them.
Appendix E. Discussions on Baselines
Generating the full images.
Directly modeling full images of data-in-the-wild yields significant challenges. In the early
stage of the project, we experimented with directly using GAN-based approaches on full images. As illustrated in Fig. 12-a,b,
feeding full images without explicit modeling of occlusion makes learning challenging on our benchmark. For EG3D, we
observed that training EG3D with unmasked image leads to training collapse, due to the absence of foreground and background
modeling. For example, the generated image samples in Fig. 13-b lack diversity in shape and appearance (e.g. color).
We clarify a key difference to pure GAN-based approaches (GIRAFFE and EG3D) is that our approach has two training
stages and the masked loss is only applied in the first stage to reconstruct the input. In other words, masked loss cannot be
directly applied to existing GAN-based approaches as the corresponding object mask for each generated RGB is not observed
in the adversarial (encoder-free) training. Alternatively, one can still apply the masked loss by factorizing RGB, object
silhouette and occlusion. We have tried many variations of this idea in the early stage without avail, as learning disentangled
factors was challenging for adversarial training. We provide such examples in Fig. 12-c. In this experiment, we tried to extend
EG3D by generating occlusion masks with a separate branch. However, the training became very unstable and we were not
able to produce improved results beyond the original EG3D on our benchmark. As we can see, the model fails to disentangle
object silhouette and occlusion. It still generates partial shape, while generating some plausible foreground occlusion. In
fact, occlusion is even more challenging to generate explicitly on our data where object silhouettes and occlusion masks are
entangled, as the outcome depends on the view and layout.
Generating the object images.
Whitening out non-object regions has been used by EG3D (see ShapeNet-Cars in its supp.)
and GET3D. It combines white color to pixels with α < 1 during neural rendering, which implicitly supervise α. Such set up
separates object pixels from the surroundings, and makes generation focused on object modeling. We follow this design and
have found in our experiments that baselines fail to generate separated target object without whitening-out.
We provide additional details about the baseline methods GIRAFFE and EG3D in Fig. 13. We noticed that the learned
GIRAFFE models are capable of generating vehicle-like patches but with viewpoints, occlusions and identities entangled in
the latent space. For example, we generate a pair of images (in Fig. 13(a)) by varying the viewpoint variable while keeping the
20
Figure 11. Additional qualitative results of GINA-3D.
21
(a) Generated image samples from GIRAFFE trained on full images (view, view+45°)
(b) Generated image samples from EG3D trained on full images (random views)
(c) Generated image samples from jointly generating object RGB and occlusion masks (random views)
Figure 12. Additional results on generation results trained on full images. a) We trained GIRAFFE on full images; b) We trained EG3D
models on full images; c) We augmented the EG3D model by jointly generating object RGB, background RGB and occlusion masks.
We visualizes object RGB and its corresponding occlusion mask in alternate columns. Results suggest that it’s difficult for the model to
disentangle object shape and occlusion.
identity latent variable fixed. It turns out that the generations are not easily controllable by the viewpoint variables, while the
vehicle identities often change across views. The entangled representation makes the extracted meshes not very meaningful for
22
(a) Generated image samples from GIRAFFE trained on full images (view, view+45°)
(a) Generated image samples from GIRAFFE trained on masked images (fixed views)
(b) Generated image samples from EG3D trained on masked images (fixed views)
Figure 13. Additional results on generation results trained on masked images. a) Additional visualizations of GIRAFFE baseline reported in
the main paper; b) Additional visualizations of EG3D baseline reported in the main paper. Results suggest that it’s difficult for GIRAFFE to
disentangle rotation. Both baselines show significant occlusion artifacts.
(a) A random batch of 16 EG3D extracted meshes.
(b) A random batch of 16 GINA-3D extracted meshes.
Figure 14. Example mesh extractions from EG3D and GINA-3D.
the GIRAFFE baseline on our benchmark. Additionally, the geometry extraction becomes even harder as the rendering mask
is defined at a low dimensional resolution 162.
Appendix F. Extracted Meshes
As mentioned in the main text, we use marching cubes [151] with density threshold of 10 to extract meshes for geometry
evaluation. We showcase here random samples of extracted meshes from EG3D and GINA-3D. We show 16 examples each
in Fig. 14a-14b. As we see, EG3D meshes can contain artifacts like missing parts of shape (row 3 right two). Furthermore,
it shows relatively little diversity. GINA-3D not only preserves complete shapes, but also demonstrate a greater diversity,
including more shape variation and semantic variation (mini-van row 2 column 4; bus row 4 column 4). Such observation is
consistent with our quantitative evaluations.
However, we do observe that GINA-3D meshes can be non-watertight and contain holes. We hope to address such problems
in future works. We believe that by incorporating other representations like Signed Distance Fields (SDF), the mesh quality
can be further improved.
23