GINA-3D: Learning to Generate Implicit Neural Assets in the Wild Bokui Shen 1 ∗ Xinchen Yan 2 Charles R. Qi 2 Mahyar Najibi 2 Boyang Deng 1 , 2 † Leonidas Guibas 3 Yin Zhou 2 Dragomir Anguelov 2 1 Stanford University, 2 Waymo LLC, 3 Google In-the-Wild Driving Data “Same kind...” “Similar size...” GINA-3D Synthesis Composition with Background NeRF GINA-3D Reconstruction “Night time...” In-the-wild Driving Data “Random...” Figure 1. Leveraging in-the-wild data for generative assets modeling embodies a scalable approach for simulation. GINA-3D uses real-world driving data to perform various synthesis tasks for realistic 3D implicit neural assets. Left: Multi-sensor observations in the wild. Middle: Asset reconstruction and conditional synthesis. Right: Scene composition with background neural fields [1]. Abstract Modeling the 3D world from sensor data for simula- tion is a scalable way of developing testing and valida- tion environments for robotic learning problems such as autonomous driving. However, manually creating or re- creating real-world-like environments is difficult, expensive, and not scalable. Recent generative model techniques have shown promising progress to address such challenges by learning 3D assets using only plentiful 2D images – but still suffer limitations as they leverage either human-curated im- age datasets or renderings from manually-created synthetic 3D environments. In this paper, we introduce GINA-3D, a generative model that uses real-world driving data from cam- era and LiDAR sensors to create realistic 3D implicit neural assets of diverse vehicles and pedestrians. Compared to the existing image datasets, the real-world driving setting poses new challenges due to occlusions, lighting-variations and long-tail distributions. GINA-3D tackles these challenges by decoupling representation learning and generative modeling into two stages with a learned tri-plane latent structure, in- spired by recent advances in generative modeling of images. To evaluate our approach, we construct a large-scale object- centric dataset containing over 1.2M images of vehicles and pedestrians from the Waymo Open Dataset, and a new set of 80K images of long-tail instances such as construction equipment, garbage trucks, and cable cars. We compare our model with existing approaches and demonstrate that it achieves state-of-the-art performance in quality and diver- sity for both generated images and geometries. ∗ Work done during an internship at Waymo. † Work done at Waymo. 1. Introduction Learning to perceive, reason, and interact with the 3D world has been a longstanding challenge in the computer vision and robotics community for decades [2 – 9]. Mod- ern robotic systems [10 – 16] deployed in the wild are often equipped with multiple sensors ( e.g . cameras, LiDARs, and Radars) that perceive the 3D environments, followed by an intelligent unit for reasoning and interacting with the com- plex scene dynamics. End-to-end testing and validating these intelligent agents in the real-world environments are diffi- cult and expensive, especially in safety critical and resource constrained domains like autonomous driving. On the other hand, the use of simulated data has pro- liferated over the last few years to train and evaluate the intelligent agents under controlled settings [17 – 27] in a safe, scalable and verifiable manner. Such developments were fueled by rapid advances in computer graphics, including rendering frameworks [28 – 30], physical simulation [31, 32] and large-scale open-sourced asset repositories [33 – 39]. A key concern is to create realistic virtual worlds that align in asset content, composition, and behavior with real distribu- tions, so as to give the practitioner confidence that using such simulations for development and verification can transfer to performance in the real world [40 – 48]. However, manual asset creation faces two major obstacles. First, manual cre- ation of 3D assets requires dedicated efforts from engineers and artists with 3D domain expertise, which is expensive and difficult to scale [26]. Second, real-world distribution contains diverse examples (including interesting rare cases) and is also constantly evolving [49, 50]. Recent developments in the generative 3D modeling offer 1 arXiv:2304.02163v2 [cs.CV] 28 Aug 2023 new perspectives to tackle these aforementioned obstacles, as it allows producing additional realistic but previously unseen examples. A sub-class of these approaches, gener- ative 3D-aware image synthesis [51, 52], holds significant promise since it enables 3D modeling from partial observa- tions ( e.g . image projections of the 3D object). Moreover, many real-world robotic applications already capture, an- notate and update multi-sensor observations at scale. Such data thus offer an accurate, diverse, task-relevant, and up- to-date representation of the real-world distribution, which the generative model can potentially capture. However, ex- isting works use either human-curated image datasets with clean observations [53 – 58] or renderings from synthetic 3D environments [33, 36]. Scaling generative 3D-aware image synthesis models to the real world faces several challenges, as many factors are entangled in the partial observations. First, bridging the in-the-wild images from a simple prior without 3D structures make the learning difficult. Second, unconstrained occlusions entangle object-of-interest and its surroundings in pixel space, which is hard to disentangle in a purely unsupervised manner. Lastly, the above chal- lenges are compounded by a lack of effort in constructing an asset-centric benchmark for sensor data captured in the wild. In this work, we introduce a 3D-aware generative trans- former for implicit neural asset generation, named GINA-3D ( G enerative I mplicit N eural A ssets). To tackle the real world challenges, we propose a novel 3D-aware Encoder-Decoder framework with a learned structured prior. Specifically, we embed a tri-plane structure into the latent prior (or tri-plane latents ) of our generative model, where each entry is param- eterized by a discrete representation from a learned code- book [59,60]. The Encoder-Decoder framework is composed of a transformation encoder and a decoder with neural ren- dering components. To handle unconstrained occlusions, we explicitly disentangle object pixels from its surrounding with an occlusion-aware composition, using pseudo labels from an off-the-shelf segmenation model [61]. Finally, the learned prior of tri-plane latents from a discrete codebook can be used to train conditional latents sampling models [62]. The same codebook can be readily applied to various conditional synthesis tasks, including object scale, class, semantics, and time-of-day. To evaluate our model, we construct a large-scale object- centric benchmark from multi-sensor driving data captured in the wild. We first extract over 1.2M images of diverse variations for vehicles and pedestrians from Waymo Open Dataset [14]. We then augment the benchmark with long-tail instances from real-world driving scenes, including rare ob- jects like construction equipment, cable cars, school buses and garbage trucks. We demonstrate through extensive ex- periments that GINA-3D outperforms the state-of-the-art 3D-aware generative models, measured by image quality, geometry consistency, and geometry diversity. Moreover, we showcase example applications of various conditional synthesis tasks and shape editing results by leveraging the learned 3D-aware codebook. The benchmark is publicly available through waymo.com/open. 2. Related Work We discuss the relevant work on generative 3D-aware image synthesis, 3D shape modeling, and applications in autonomous driving. Generative 3D-aware image synthesis. Learning gener- ative 3D-aware representations from image collections has been increasingly popular for the past decade [63 – 69]. Early work explored image synthesis from disentangled factors such as learned pose embedding [64,66,69] or compact scene representations [65, 67]. Representing the 3D-structure as a compressed embedding, this line of work approached image synthesis by upsampling from the embedding space with a stack of 2D deconvolutional layers. Driven by the progresses in differentiable rendering, there have been efforts [70 – 73] in baking explicit 3D structures into the generative architec- tures. These efforts, however, are often confined to a coarse 3D discretization due to memory consumption. Moving be- yond explicits, more recent work leverages neural radiance fields to learn implicit 3D-aware structures [51, 52, 74 – 82] for image synthesis. Schwarz et al . [74] introduced the Gen- erative Radiance Fields (GRAF) that disentangles the 3D shape, appearance and camera pose of a single object with- out occlusions. Built on top of GRAF , Niemeyer et al . [51] proposed the GIRAFFE model, which handles scene involv- ing multiple objects by using the compositional 3D scene structure. Notably, the query operation in the volumetric ren- dering becomes computationally heavy at higher resolutions. To tackle this, Chan et al . [52] introduced hybrid explicit- implicit 3D representations with tri-plane features (EG3D) , which showcases image synthesis at higher resolutions. Con- currently, [83] and [84] pioneer high-resolution unbounded 3D scene generation on ImageNet using tri-plane represen- tations, where [84] uses a vector-quantized framework and [83] uses a GAN framework. Our work is designed for ap- plications in autonomous driving sensor simulation with an emphasis on object-centric modeling. Generative 3D shape modeling. Generative modeling of complete 3D shapes has also been extensively studied, including efforts on synthesizing 3D voxel grids [85 – 93], point clouds [94 – 96], surface meshes [97 – 103], shape primi- tives [104, 105], and implicit functions or hybrid representa- tions [103, 106 – 112] using various deep generative models. Shen et al . [111] introduced a differentiable explicit sur- face extraction method called Deep Marching Tetrahedra (DMTet) that learns to reconstruct 3D surface meshes with arbitrary topology directly. Built on top of the EG3D [52] tri-plane features for image synthesis, Gao et al . [103] pro- posed an extension that is capable of generating textured 2 surface meshes using DMTet for geometry generation and tri-plane features for texture synthesis. The existing efforts assume access to accurate multi-view silhouettes (often from complete ground-truth 3D shapes) , which does not reflect the real challenges present in data captured in the wild. Assets modeling in driving simulation. Simulated en- vironment modeling has drawn great attention in the au- tonomous driving domain. In a nutshell, the problem can be decomposed into asset creation (e.g., dynamic objects and background), scene generation, and rendering. Early work leverages artist-created objects and background assets to build virtual driving environments [18, 20, 113] using classic graphics rendering pipelines. While being able to generate virtual scenes with varying configurations, these methods produce scenes with limited diversity and a sig- nificant reality gap. Many recent works explored different aspects of data-driven simulation, including image synthe- sis [114 – 117], assets modeling [47, 48, 118 – 121], scene gen- eration [49, 122, 123], and scene rendering [1, 124 – 126]. In particular, Chen et al . [48] and Zakharov et al . [119] per- formed explicit texture warping or implicit rendering from a single-view observation for each vehicle object. Therefore, their asset reconstruction quality is sensitive to occlusions and bounded by the view angle from a single observation. Building upon these efforts, more recent work including Muller et al . [121] and Kundu et al . [125] approached ob- ject completion with global or instance-specific latent codes, representing each object asset under the Normalized Object Coordinate Space (NOCS). In comparison, the latent codes in our proposed model have 3D tri-plane structures which offers several benefits in learning and applications. More importantly, we can generate previously unseen 3D assets, which is essentially different from object reconstruction. 3. Generative Implicit Neural Assets We propose GINA-3D , a scalable framework to acquire 3D assets from in-the-wild data (Sec. 3.1). Core to our frame- work is a novel 3D-aware Encoder-Decoder model with a learned structure prior (Sec. 3.2). The learned structure prior can facilitate various downstream applications with an itera- tive latents sampling model (Sec. 3.3) per application. 3.1. Background. Given a collection of images containing 3D objects cap- tured in the wild X = { x } ( x is an image data sample), 3D-aware image synthesis [51, 52, 63 – 79, 81] aims to learn a distribution of 3D objects. The core idea is to represent each 3D object as a hidden variable h within a generative model and further leverage a neural rendering module NR to synthe- size a sample image at viewpoint v through x = NR ( h , v ) . To model the hidden 3D structure h , the formulation in- troduces a low-dimensional space where latent variables z (typically a Gaussian) can sample from and connect h and z by a generator h = f θ ( z ) , parameterized by θ . Pr ( x , z | v ) = Pr ( x | z , v ) · Pr ( z ) (1) The probabilistic formulation is shown in Fig. 2-a, and Eq. 1. Here, Pr ( x | z , v ) is the conditional probability of the image given the latent variables and viewpoint, where Pr ( z ) and Pr ( v ) are the prior distributions. As the latent variable z models the 3D objects, one can sample and extract assets for downstream applications. The assets can be either injected into neural representations of scenes [1, 125], or transformed into explicit 3D structures such as textured meshes for tradi- tional renders [20] or geometry-aware compositing [48, 124]. s x s z h x o v m s x x s v z h x v z h x S m (a) Controlled View (b) Real-world Data Figure 2. Probabilistic Views. The challenges in the wild. While human-curated image datasets [53 – 58] or syntheti- cally generated images with clean background [33, 36, 68, 103] fit into the formulation in Eq. 1, real-world distribu- tions have unconstrained occlusions due to complex object- scene entanglement. For example, a moving vehicle can be easily occluded by another object ( e.g . traffic cones and cars) in an urban driving environment, which further entangle ob- ject and scene in the pixel space. Moreover, environmental lighting and object diversity lead to a more complex under- lying distribution. As illustrated Fig. 2-b and Eq. 2, these challenges yield a new probabilistic formulation that the hidden structure h , surrounding scene S and viewpoint v jointly contribute to the occlusion ( m ) and the visible pixels on the object x through x = NR ( h , v ) ⊙ m ( S , v , h ) . Pr ( x , z | v , S ) = Pr ( x | z , v , S ) · Pr ( z ) (2) Prior art such as GIRAFFE [51] tackles the challenges with two assumptions: (1) the scene is composed of a limited number of same-class foreground objects and a background backdrop S ; and (2) the real data distribution can be bridged using an one-pass generator f θ ( x ; z , S , v ) ( θ is the learned parametrization) conditioned on independently sampled ob- jects z , scene background S and the camera viewpoint v ( e.g . Multi-variate Gaussian distributions with diagonal variance) through adversarial training. Unfortunately, the first assump- tion barely holds for in-the-wild images with unconstrained foreground occlusions. As shown in Niemeyer et al . [51], the second assumption can already introduce artifacts due to disentanglement failures. Our proposal. We focus on interpreting the visible pixels of the object of interest, as synthesizing objects and scene jointly with a generative model is very challenging. We leverage an auxiliary encoder E φ ( x ) that approximates the posterior Pr ( z | x ) in training the generative model to recon- struct the input. This way, we bypass the need to model com- plex scene and occlusions explicitly, since paired input and 3 output are now available for supervising the auto-encoding style training. Specifically, given an image x and the corre- sponding occlusion mask m , our objective is to reconstruct the visible pixels of the object on the image through ˆ x ⊙ m where we have the reconstruction ˆ x = NR ( G θ ( z ) , v ) and latent z = E φ ( x ) , respectively. In practice, we use an off- the-shelf model to obtain the pseudo-labeled object mask as the supervision through x ⊙ m . At the inference time, we can discard the auxiliary encoder E φ as our goal is to generate assets from a learned latent distribution ( tri-plane latents in our case). To facilitate this, we leverage the vector-quantized formulation [59, 60] to learn a codebook K := { z n } K n =1 of size K and the mapping from a continuous-valued vector to a discrete codebook entry, where each entry follows a K -way categorical distribution. 3.2. 3D Triplane Latents Learning We explain in details the Encoder-Decoder training frame- work to learn tri-plane latents z (Fig. 3-left). The framework consists of a 2D-to-3D encoder E φ , learnable codebook quantization K and a 3D-to-2D decoder G θ . E φ : 2D-to-3D Encoder. We adopt Vision Transformer (ViT) [127] as our image feature extractor that maps 16 × 16 non-overlapping image patches into image tokens of dimen- sion D img . Since the goal is to infer the latent 3D-structure from a 2D image observation, we associate each image token with tokens in the tri-plane latents using cross-attention mod- ules, which have previously shown strong performance in cross-domain and 2D-to-3D information passing [128 – 131]. The cross-attention module uses a learnable tri-plane posi- tional encoding as query, and image patch tokens as key and value. The module produces tri-plane embeddings e 3D = E φ ( x ) ∈ R N Z × N Z × 3 × D tok , where D tok = 32 and N Z = 16 indicates the dimension of each 3D token and the spatial resolution, respectively. K : Codebook Quantization for tri-plane latents. Given the continuous tri-plane embedding e 3D , we project it to our K -way categorical prior K through vector quantization. We apply quantization q ( · ) of each spatial code e 3D ijk ∈ R D tok on the tri-plane embeddings onto its closest entry z n in the codebook, which gives tri-plane latents z = q ( e 3D ) . z ijh := ( argmin z n ,n ∈ K ∥ e 3D ijk − z n ∥ ) ∈ R D tok (3) G θ : 3D-to-2D Decoder with neural rendering. Our de- coder takes the tri-plane latents z as the input and outputs a high-dimensional feature maps h ∈ R N H × N H × 3 × D H used for rendering, where N H = 256 and D H = 32 indicates spatial resolution of the tri-plane feature maps and the fea- ture dimension, respectively. We adopt a token Transformer followed by a Style-based generator [132] as our 3D decoder. The token transformer first produces high-dimensional in- termediate features ˆ z ∈ R N Z × N Z × 3 × D H with an extra CLS token using self-attention modules, which are then feed to the Style-based generator for upsampling. We use 4 blocks of weight-modulated convolutional layers, each guided by a mapping network conditioned on the CLS token. Given the feature maps, we use a shallow MLP that takes a 3D point p and the hidden feature tri-linearly interpolated at the query location h ( p ) as input, following [52, 133, 134]. It outputs a density value σ and a view-independent color value c . We perform volume-rendering with the neural radi- ance field formulation [135]. Training. Our framework builds upon the vector-quantized formulations [59, 60, 62, 136 – 140] where we focus on to- ken learning in the first stage. Specifically, we extend the VQ-GAN training losses, where the encoder E φ , decoder G θ and codebook K are trained jointly with an image dis- criminator D . As illustrated in Eq. 4, we encourage our Encoder-Decoder model to reconstruct the real image x with L 2 reconstruction, LPIPS [141], and adversarial loss. L RGB = ∥ (ˆ x − x ) ⊙ m ∥ 2 + f LPIPS (ˆ x ⊙ m , x ⊙ m ) L GAN = [log D ( x ) + log(1 − D (ˆ x ))] (4) To regularize the codebook learning, we apply the latent embedding supervision with a commitment term in Eq. 5, where sg [ · ] denotes the stop-gradient operation. L VQ = ∥ sg [ e 3D ] − z ∥ 2 2 + λ commit ∥ sg [ z ] − e 3D ∥ 2 2 (5) We additionally regularize the 3D density field in a weakly supervised manner using the rendered aggregated density (alpha value) x α , encouraging object pixels to have alpha value 1. To make the loss occlusion aware, we further require a pixel lies on the non-object region to have zero density, inspired by Müller et al . [121]. This is achieved by restricting the non-object region to cover sky or road class on the pseudo-labeled segmentations (denoted as m sky,road ). L α = ∥ ( x α − 1 ) ⊙ m ∥ 2 + ∥ x alpha ⊙ m sky,road ∥ 2 (6) To summarize, we optimize the total objective L ∗ in Eq. 7. L ∗ = arg min φ,θ, Z max D E x [ L VQ + L RGB + L α + L GAN ] (7) 3.3. Iterative Latents Sampling for Neural Assets Once the first stage training is finished, we can now rep- resent neural assets using the learned tri-plane latents and reconstruct a collection of assets from image inputs. To generate previously unseen assets with various conditions, we further learn to sample the tri-plane latents in the sec- ond stage, following the prior works in Generative Trans- formers [59, 60, 62, 138]. More precisely, we transform the quantized embedding z ∈ R N Z × N Z × 3 × D tok into a discrete sequence s ∈ { 1 , ..., K } N Z × N Z × 3 , where each element corresponds to the index we select from the codebook K through s ijk = n : z ijk = z n . Following the recent work 4 Stage 1: 3D Triplane Latents Learning : 2D-to-3D Encoder Tri-plane Latents (Discrete) Vision Transformer 2D Patch Tokens 2D-to-3D Cross Attention Tri-plane Positional Encoding Tri-plane Latents (Continuous) : 3D-to-2D Decoder Tri-plane Feature Maps Volume Rendering Rendered Mask Rendered Color Predicted Mask Occlusion-aware Composition 0 1 2 N ... : Codebook q(•) Quantization Token Transformer w CLS Style-based Generator Mapping Net Stage 2: Iterative Latents Sampling Conditions: Scale, Class Semantics Time-of-day MaskGIT MaskGIT Input Figure 3. We introduce GINA-3D , a 3D-aware generative transformer for implicit neural asset generation. GINA-3D follows a two-stage pipeline, where we learn discrete 3D triplane latents in stage 1 (Sec. 3.2) and iterative latents sampling in stage 2 (Sec. 3.3). In stage 1, an input image is first encoded into continuous tri-plane latents e 3D using a Transformer-based 2D-to-3D Encoder E φ . Then, a learnable codebook K quantize the latents into discrete latents z . Finally, a 3D-to-2D Decoder G θ maps z back to image, using a sequence of Transformer, Style-based Generator and volume rendering. The rendered image is supervised via an occlusion-aware reconstruction loss. In stage 2, we learn iterative latents sampling using MaskGIT [62]. Optional conditional information can be used to perform conditional synthesis. The sampled latents can then be decoded into neural assets using the decoder G θ learned in stage 1. MaskGIT [62], we use a bidirectional transformer as our latent generator M ψ ( z ) that we learn to iteratively sample the latent sequence (Fig. 3-right). During training, we learn to predict randomly masked latents s ̄ M by minimizing the negative log-likelihood of the masked ones. L mask = − E s [ ∑ ∀ ijk : s ijk = [MASK] log Pr ( s ijk | s ̄ M )] (8) At inference time, we iteratively generate and refine latents. Starting from all latents as [MASK] , we iteratively predict all latents simultaneously but only keep the most confident ones in each step. The remaining ones are assigned as [MASK] and the refinement continues. Finally, the sequence s can be readily mapped back to neural assets by indexing the code- book K to generate tri-plane latents z and decoding using G θ . This iterative approach can be applied to asset variations by selectively masking out tokens of a given instance. 3.4. Expanding Supervision and Conditioning The two-stage training of GINA-3D is flexible in supervi- sion and conditioning. When we have additional information, we can incorporate it in stage 1 as auxiliary supervision for token learning, or in stage 2 for conditional synthesis. Unit box vs. Scaled box. Object scale information can serve as an additional input to the tri-linear interpolation on the tri-plane feature maps by rescaling the feature maps to span object bounding box (instead of a unit box). Semantic feature fields. Various recent works have demonstrated the effectiveness of learning hybrid represen- tations in the neural rendering [142 – 144] and 2D image syn- thesis [145]. We can naturally incorporate semantic feature fields in our formulation by computing additional channels in our neural rendering MLP. We precompute DINO-ViT fea- tures [146] for each image and learn a semantic feature field to build part correspondence among generated instances. LiDAR depth supervision. When LiDAR point cloud is available in the data, it can be used as the additional super- vision through a reconstruction term between the rendered depth and LiDAR depth. Conditional synthesis. Last but not the least, additional information support various applications in conditional syn- thesis. Denoted as C , it can be fed into our latent prior as M ψ ( s ijk | s ̄ M , C ) . For example, object scale, object class, time-of-day and object semantic embeddings can also serve as c for control over the generation process. 4. Experiments 4.1. Object-centric Benchmark We select the Waymo Open Dataset (WOD) [14] as it is one of the largest and most diverse autonomous driv- ing datasets, containing rich geometric and semantic labels Images Unique Instances WOD-Vehicle 901K 23.6K WOD-Pedestrian 321K 8.1K Longtail-Vehicle 80K 3.7K Table 1. Statistics of our object-centric benchmark. Experiments were conducted on a subset with image patches rescaled to 256 2 resolution. 5 Image Geometry Quality Semantic Diversity Quality Mesh Diversity Method FID ↓ Mask FOU ↓ COV ↑ MMD ↓ Cons. ↓ Mesh FOU ↓ COV ↑ MMD ↓ GIRAFFE [51] 105.3 43.66 8.24 2.35 15.87 N/A N/A N/A EG3D [52] 137.6 7.40 6.26 2.37 2.38 25.7 3.12 4.70 tri-plane z scaled box LiDAR GINA-3D × × × 147.9 1.85 4.78 2.00 1.55 N/A 1.95 2.43 ✓ × × 79.0 1.82 19.67 1.52 1.27 11.7 5.75 2.21 ✓ ✓ × 60.5 1.77 20.68 1.53 1.06 2.33 8.69 2.26 ✓ ✓ ✓ 59.5 1.80 25.00 1.46 0.98 4.57 11.42 2.17 Table 2. Quantitative evaluation on the realism and diversity of generated image and geometry (metrics details in Sec. 4.3). such as object bounding boxes and per-pixel instance masks. Specifically, the dataset includes 1,150 driving scenes cap- tured mostly in downtown San Francisco and Phoenix, each consisting of 200 frames of multi-sensor observations. To construct an object-centric benchmark, we propose a coarse- to-fine procedure to extract collections of single-view 2D photographs by leveraging 3D object boxes, camera-LiDAR synchronization, and fine-grained 2D panoptic labels. First, we leverage the 3D box annotations to exclude objects be- yond certain distances to the surveying vehicle in each data frame (e.g., 40 m for pedestrians and 60 m for vehicles, re- spectively). At a given frame, we project 3D point clouds within each 3D bounding box to the most visible camera and extract the centering patch to build our single-view 2D im- age collections. Furthermore, we train a Panoptic-Deeplab model [61, 147] using the 2D panoptic segmentations on the labeled subset and create per-pixel pseudo-labels for each camera image on the entire dataset. This allows us to differ- entiate pixels belonging to the object of interest, background, and occluder (e.g., standing pole in front of a person). We further exclude certain patches where objects are heavily occluded using the 2D panoptic predictions. Even with the filtering criterion applied, we believe that the resulting bench- mark is still very challenging due to occlusions, intra-class variations (e.g., truck and sedan), partial observations (e.g., we do not have full 360 degree observations of a single vehicle), and imperfect segmentation. In particular, we pro- vide accurate registration of camera rays and LiDAR point clouds to the object coordinate frame, taking into account the camera rolling shutter, object motion and ego motion. We repeat the same process to extract vehicles and pedestri- ans from WOD, and additional longtail vehicles from our Longtail dataset. The proposed object-centric benchmark is one of the largest datasets for generative modeling to date, including diverse and longtail examples in the wild. 4.2. Implementation Details GINA-3D. Our encoder takes in images at resolution of 256 2 and renders at 128 2 during training. Our tri-plane latents have a resolution of 16 2 with a codebook containing 2048 entries and lookup dimension of 32 . We trained our (b) WOD-Pedestrian (a) WOD-Vehicle (c) Longtail-Vehicle Figure 4. Image samples from our object-centric benchmark. models on 8 Tesla V100 GPUs using Adam optimizer [148], with batch size 32 and 64 in each stage, respectively. We trained stage 1 for 150K steps and stage 2 for 80K steps. Baselines. We compare against two state-of-the-art meth- ods in the domain, GIRAFFE [51] and EG3D [52], which we train on our dataset at the resolution of 128 2 . We noticed that GIRAFFE model trained on full pixels fails to disentan- gle viewpoints, occlusions and identities. This makes the extraction of the foreground pixels difficult, as the render mask is only defined at the low dimensional resolution 16 2 . We instead report the numbers using a model trained by whitening out non-object regions. For EG3D, we observed that training EG3D with unmasked image leads to training collapse, due to the absence of foreground and background modeling. Thus, we trained EG3D under the same setting. 4.3. Evaluations on WOD-Vehicle We conduct quantitative evaluations in Table. 2 and visu- alize qualitative results of different model in Fig. 5. Image Evaluation. For image quality, we calculate Fréchet Inception Distance (FID) [149] between 50K gen- erated images and all available validation images. To better reflect the metric on object completeness, we filter images where its object segmentation mask take up at least 50% of the projected 3D bounding box (Fig.5-right). We addition- ally measure the completeness of the generated images by Mask Floater-Over-Union (Mask FOU), which is defined as the percentage of unconnected pixels over the rendered ob- ject region. To measure the semantic diversity, we compute the Coverage (COV) score and Minimum Matching Distance (MMD) [94] using the CLIP [150] embeddings. COV mea- sures the fraction of CLIP embeddings in the validation set 6 WOD-Vehicle validation samples GINA-3D (Ours) GIRAFFE EG3D Figure 5. Qualitative comparison between GIRAFFE, EG3D and ours with images rendered from a horizontal 30 ◦ viewpoint. Both baselines fail to disentangle real-world sensor data. GIRAFFE fails to disentangle rotation in object representation, while both baselines fail to disentangle occlusion and produce incomplete shape. We show samples from occlusion-filtered WOD-Vehicle validation set on the right. (c) GINA-3D on Longtail-Vehicle (b) GINA-3D + DINO (d) WOD-Ped (a) GINA-3D on WOD-Vehicle Figure 6. Generation from GINA-3D variants. (a) GINA-3D trained on WOD-Vehicle. (b) GINA-3D with additional DINO feature field generation. (c) GINA-3D trained on Longtail-Vehicle. (d) GINA-3D trained on WOD-Pedestrain. that has matches in the generated set, and MMD measures the distance between each generated embedding to the clos- est one in the validation. Our model demonstrates significant improvements in FID, image completeness and semantic diversity. Without explicit disentanglement, baselines can hardly handle the real distributions, resulting in artifacts of incomplete shapes (Fig. 5). Geometry Evaluation. To measure the underlying volume rendering consistency, we follow Or et al . [79] and compute the alignment errors between the volume-rendered depth from two viewpoints. We extract the mesh using march- ing cubes [151] with a density threshold of 10 following EG3D [52]. We measure the completeness by Mesh Floater- Over-Union (Mesh FOU), which is defined as the percentage of the surface area on unconnected mesh pieces over the entire mesh. Since we do not have ground-truth meshes in the real world data, we approximate mesh diversity by mea- suring between generated meshes and aggregated LiDAR point clouds within a bounding box from the validation set. We measure mesh diversity using the aforementioned COV and MMD with a new distance metric. To account for the incompleteness of LiDAR point clouds, we use a one-way Chamfer distance, which is defined as the mean distance be- tween validation point clouds and their nearest neighbor from a given generated mesh. Our model demonstrates signifi- cant improvements in volume rendering consistency, shape completeness and shape diversity. Augmentation and Ablation. GINA-3D can naturally incor- porate additional supervisions when available. We present variations of GINA-3D trained with object scale, LiDAR and DINO [146] supervision. With object scale informa- tion available, we normalize tri-plane feature maps with the scale on each dimension. The model trained with rescaled tri-plane resolution yields significant performance boost in both quality and diversity over unit bounding cube, as latents are better utilized. Moreover, we observe that by adding auxiliary L 2 depth supervision from LiDAR, most metrics are improved except Mask and Mesh FOU. While LiDAR provides strong signal to underlying geometry, it also intro- duces inconsistency on transparent surfaces. We hypothesize that such challenge leads to slightly more floaters, which we leave as future directions to explore. Alternatively, we can learn additional neural semantic fields through 2D-to-3D feature lifting [142]. By only changing the final layer of 7 Discrete embedding Continuous embedding Image-conditioned Assets Variations 90% Assets Variations 99% 90% 99% Input Reconstruction Input Reconstruction Masking Masking Assets Variations Figure 7. GINA-3D unifies a wide range of asset synthesis tasks, all obtained with the same stage 1 decoder and variations of stage 2 training. Top row: Conditional synthesis using discrete conditions (object classes and time-of-day). 2nd row: Conditional synthesis using continuous conditions (semantic token and object scale). 3rd row: Image-conditioned assets variations by randomizing tri-plane latents. the NR MLP, we can learn an additional view-consistent and instance-invariant semantic feature field (Fig. 6-b), which can enable future applications of language-conditioned and part-based editing [8] Finally, we perform ablation studies on the key design of tri-plane latents. If we remove the tri- plane structure and use a MLP-only NR , the model fails to capture the diversity of real-world data and results in mode collapse, which generates always a mean car shape. 4.4. Applications Generating long-tail instance. Our data-driven framework is scalable to new data. We provide results on GINA-3D trained on Longtail-Vehicle and WOD-Ped dataset in Fig. 6- c,d respectively. Without finetuning the architecture on the newly collected data, GINA-3D can readily learn to generate long-tail objects from noisy segmentation masks. As shown in Fig. 6-c, generation results range from trams, truck to construction equipment of various shapes. GINA-3D can also be applied to other categories (e.g. pedestrian, Fig.6-d). Results show moderate shape and texture diversity. Conditional synthesis. As described in Sec. 3.4, the flexibil- ity of the two-stage approach makes it a promising candidate for conditional asset synthesis. Specifically, we freeze the stage 1 model, and train variations of MaskGIT by passing in different conditions. We provide results for three kinds of conditional synthesis tasks in Fig. 7, namely discrete embed- dings (object class, time-of-day), continuous embeddings, and image-conditioned generation. For image-conditioned asset reconstruction and variations, we first infer the latents using the encoder model and then sample asset variations by controlling masking ratio of the reconstructed tri-plane latents . The more tokens are masked, the wider the variation range becomes. We provide more details for conditional synthesis in the supplementary material. 4.5. Limitations Misaligned 3D bounding boxes. As in our WOD-Ped re- sults, misaligned boxes lead to mismatch in pixel space, re- sulting in blurrier results. Latest methods in ray-based [130] or patch-based [81] learning are promising directions. Few-shot and transfer learning. Though our data-driven approach achieves reasonable performance by training on Longtail-Vehicle alone, the comparative scarcity of data leads to lower diversity. How to enable few-shot learning or transfer learning remains an open question. Transcient effects. Direction-dependent effect can be incor- porated in our pipeline. We believe modeling material [152] together with LiDAR is an interesting direction. 5. Conclusion In this work, we presented GINA-3D, a scalable learn- ing framework to synthesize 3D assets from robotic sensors deployed in the wild. Core to our framework is a deep encoder-decoder backbone that learns discrete tri-plane la- tent variables from partially-observed 2D input pixels. Our backbone is composed of an encoder with cross-attentions, a decoder with tri-plane feature maps, and a neural volumetric rendering module. We further introduce a latent transformer to generate tri-plane latents with various conditions includ- ing bounding box size, time of the day, and semantic features. To evaluate our framework, we have established a large-scale object-centric benchmark containing diverse vehicles and pedestrians. Experimental results have demonstrated strong performance on image quality, geometry consistency and geometry diversity over existing methods. The benchmark is publicly available through waymo.com/open. Acknowledgements: We based our MaskGIT implemen- tation on Chang et al . [62]. We thank Huiwen Chang for helpful MaskGIT pointers. We acknowledge the helpful discussions and support from Qichi Yang and James Guo. We thank Mathilde Caron for her DINO implementation and helpful pointers. We based our GIRAFFE baseline on the reimplementation by Kyle Sargent. We thank Golnaz Ghiasi for helpful pointers on segmentation models. 8 References [1] Matthew Tancik, Vincent Casser, Xinchen Yan, Sabeek Prad- han, Ben Mildenhall, Pratul P Srinivasan, Jonathan T Barron, and Henrik Kretzschmar. Block-nerf: Scalable large scene neural view synthesis. In CVPR , 2022. 1, 3 [2] Harry Barrow, J Tenenbaum, A Hanson, and E Riseman. Recovering intrinsic scene characteristics. Comput. vis. syst , 2(3-26):2, 1978. 1 [3] D Man and A Vision. A computational investigation into the human representation and processing of visual information. WH San Francisco: Freeman and Company, San Francisco , 1982. 1 [4] Ruo Zhang, Ping-Sing Tsai, James Edwin Cryer, and Mubarak Shah. Shape-from-shading: a survey. IEEE transactions on pattern analysis and machine intelligence , 21(8):690–706, 1999. 1 [5] Marshall Tappen, William Freeman, and Edward Adelson. Recovering intrinsic images from a single image. NIPS , 15, 2002. 1 [6] Derek Hoiem, Alexei A Efros, and Martial Hebert. Au- tomatic photo pop-up. In ACM SIGGRAPH 2005 Papers , pages 577–584. 2005. 1 [7] Ashutosh Saxena, Min Sun, and Andrew Y Ng. Make3d: Learning 3d scene structure from a single still image. IEEE transactions on pattern analysis and machine intelligence , 31(5):824–840, 2008. 1 [8] Stephen Gould, Richard Fulton, and Daphne Koller. Decom- posing a scene into geometric and semantically consistent regions. In ICCV . IEEE, 2009. 1, 8 [9] Abhinav Gupta, Alexei A Efros, and Martial Hebert. Blocks world revisited: Image understanding using qualitative ge- ometry and mechanics. In ECCV . Springer, 2010. 1 [10] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. The International Journal of Robotics Research , 32(11):1231– 1237, 2013. 1 [11] Will Maddern, Geoffrey Pascoe, Chris Linegar, and Paul Newman. 1 year, 1000 km: The oxford robotcar dataset. The International Journal of Robotics Research , 36(1):3–15, 2017. 1, 15 [12] Ming-Fang Chang, John Lambert, Patsorn Sangkloy, Jagjeet Singh, Slawomir Bak, Andrew Hartnett, De Wang, Peter Carr, Simon Lucey, Deva Ramanan, et al. Argoverse: 3d tracking and forecasting with rich maps. In CVPR , 2019. 1, 15 [13] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. In CVPR , 2020. 1, 15 [14] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. In CVPR , 2020. 1, 2, 5, 15 [15] Sudeep Dasari, Frederik Ebert, Stephen Tian, Suraj Nair, Bernadette Bucher, Karl Schmeckpeper, Siddharth Singh, Sergey Levine, and Chelsea Finn. Robonet: Large-scale multi-robot learning. arXiv preprint arXiv:1910.11215 , 2019. 1 [16] Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Cheb- otar, Omar Cortes, Byron David, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, et al. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691 , 2022. 1 [17] German Ros, Laura Sellart, Joanna Materzynska, David Vazquez, and Antonio M Lopez. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In CVPR , 2016. 1 [18] Adrien Gaidon, Qiao Wang, Yohann Cabon, and Eleonora Vig. Virtual worlds as proxy for multi-object tracking analy- sis. In CVPR , 2016. 1, 3 [19] Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elemen- tary visual reasoning. In CVPR , 2017. 1 [20] Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. Carla: An open urban driving simulator. In Conference on robot learning , pages 1–16. PMLR, 2017. 1, 3 [21] Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Daniel Gordon, Yuke Zhu, Abhinav Gupta, and Ali Farhadi. Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:1712.05474 , 2017. 1 [22] Shital Shah, Debadeepta Dey, Chris Lovett, and Ashish Kapoor. Airsim: High-fidelity visual and physical simulation for autonomous vehicles. In Field and service robotics , pages 621–635. Springer, 2018. 1 [23] Amir R Zamir, Alexander Sax, William Shen, Leonidas J Guibas, Jitendra Malik, and Silvio Savarese. Taskonomy: Disentangling task transfer learning. In CVPR , 2018. 1 [24] Hassan Abu Alhaija, Siva Karthik Mustikovela, Lars Mescheder, Andreas Geiger, and Carsten Rother. Aug- mented reality meets computer vision: Efficient data gen- eration for urban driving scenes. International Journal of Computer Vision , 126(9):961–972, 2018. 1 [25] Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A plat- form for embodied ai research. In ICCV , 2019. 1 [26] Yohann Cabon, Naila Murray, and Martin Humenberger. Virtual kitti 2. arXiv preprint arXiv:2001.10773 , 2020. 1 [27] Bokui Shen, Fei Xia, Chengshu Li, Roberto Martín-Martín, Linxi Fan, Guanzhi Wang, Claudia Pérez-D’Arpino, Shya- mal Buch, Sanjana Srivastava, Lyne Tchapmi, et al. igibson 1.0: a simulation environment for interactive tasks in large realistic scenes. In IROS , 2021. 1 9 [28] Blender Online Community. Blender - a 3d modelling and rendering package, 2018. 1 [29] Arthur Juliani, Vincent-Pierre Berges, Ervin Teng, Andrew Cohen, Jonathan Harper, Chris Elion, Chris Goy, Yuan Gao, Hunter Henry, Marwan Mattar, et al. Unity: A general plat- form for intelligent agents. arXiv preprint arXiv:1809.02627 , 2018. 1 [30] Epic Games. Unreal engine. 1 [31] Miles Macklin, Matthias Müller, Nuttapong Chentanez, and Tae-Yong Kim. Unified particle physics for real-time appli- cations. ACM Transactions on Graphics (TOG) , 33(4):1–12, 2014. 1 [32] Erwin Coumans and Yunfei Bai. Pybullet, a python module for physics simulation for games, robotics and machine learning. 2016. 1 [33] Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012 , 2015. 1, 2, 3 [34] Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly- annotated 3d reconstructions of indoor scenes. In CVPR , 2017. 1 [35] Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments. arXiv preprint arXiv:1709.06158 , 2017. 1 [36] Keunhong Park, Konstantinos Rematas, Ali Farhadi, and Steven M Seitz. Photoshape: Photorealistic materi- als for large-scale shape collections. arXiv preprint arXiv:1809.09761 , 2018. 1, 2, 3 [37] Fei Xia, Amir R Zamir, Zhiyang He, Alexander Sax, Jiten- dra Malik, and Silvio Savarese. Gibson env: Real-world perception for embodied agents. In CVPR , 2018. 1 [38] Kaichun Mo, Shilin Zhu, Angel X Chang, Li Yi, Subarna Tripathi, Leonidas J Guibas, and Hao Su. Partnet: A large- scale benchmark for fine-grained and hierarchical part-level 3d object understanding. In CVPR , 2019. 1 [39] Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Com- mon objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. In ICCV , 2021. 1 [40] Fereshteh Sadeghi and Sergey Levine. Cad2rl: Real single- image flight without a single real image. arXiv preprint arXiv:1611.04201 , 2016. 1 [41] Matthias Müller, Alexey Dosovitskiy, Bernard Ghanem, and Vladlen Koltun. Driving policy transfer via modularity and abstraction. arXiv preprint arXiv:1804.09364 , 2018. 1 [42] Yevgen Chebotar, Ankur Handa, Viktor Makoviychuk, Miles Macklin, Jan Issac, Nathan Ratliff, and Dieter Fox. Closing the sim-to-real loop: Adapting simulation randomization with real world experience. In ICRA , 2019. 1 [43] Ilge Akkaya, Marcin Andrychowicz, Maciek Chociej, Ma- teusz Litwin, Bob McGrew, Arthur Petron, Alex Paino, Matthias Plappert, Glenn Powell, Raphael Ribas, et al. Solving rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113 , 2019. 1 [44] Bła ̇ zej Osi ́ nski, Adam Jakubowski, Paweł Zi ̨ ecina, Piotr Miło ́ s, Christopher Galias, Silviu Homoceanu, and Henryk Michalewski. Simulation-based reinforcement learning for real-world autonomous driving. In ICRA , 2020. 1 [45] Abhishek Kadian, Joanne Truong, Aaron Gokaslan, Alexan- der Clegg, Erik Wijmans, Stefan Lee, Manolis Savva, Sonia Chernova, and Dhruv Batra. Sim2real predictivity: Does evaluation in simulation predict real-world performance? IEEE Robotics and Automation Letters , 5(4):6670–6677, 2020. 1 [46] Saminda Abeyruwan, Laura Graesser, David B D’Ambrosio, Avi Singh, Anish Shankar, Alex Bewley, and Pannag R Sanketi. i-sim2real: Reinforcement learning of robotic poli- cies in tight human-robot interaction loops. arXiv preprint arXiv:2207.06572 , 2022. 1 [47] Sivabalan Manivasagam, Shenlong Wang, Kelvin Wong, Wenyuan Zeng, Mikita Sazanovich, Shuhan Tan, Bin Yang, Wei-Chiu Ma, and Raquel Urtasun. Lidarsim: Realistic lidar simulation by leveraging the real world. In CVPR , 2020. 1, 3 [48] Yun Chen, Frieda Rong, Shivam Duggal, Shenlong Wang, Xinchen Yan, Sivabalan Manivasagam, Shangjie Xue, Ersin Yumer, and Raquel Urtasun. Geosim: Realistic video simu- lation via geometry-aware composition for self-driving. In CVPR , 2021. 1, 3 [49] Amlan Kar, Aayush Prakash, Ming-Yu Liu, Eric Cameracci, Justin Yuan, Matt Rusiniak, David Acuna, Antonio Torralba, and Sanja Fidler. Meta-sim: Learning to generate synthetic datasets. In ICCV , 2019. 1, 3 [50] Girish Varma, Anbumani Subramanian, Anoop Namboodiri, Manmohan Chandraker, and CV Jawahar. Idd: A dataset for exploring problems of autonomous navigation in uncon- strained environments. In WACV . IEEE, 2019. 1 [51] Michael Niemeyer and Andreas Geiger. Giraffe: Represent- ing scenes as compositional generative neural feature fields. In CVPR , 2021. 2, 3, 6 [52] Eric R Chan, Connor Z Lin, Matthew A Chan, Koki Nagano, Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas J Guibas, Jonathan Tremblay, Sameh Khamis, et al. Effi- cient geometry-aware 3d generative adversarial networks. In CVPR , 2022. 2, 3, 4, 6, 7, 17 [53] Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365 , 2015. 2, 3 [54] Linjie Yang, Ping Luo, Chen Change Loy, and Xiaoou Tang. A large-scale car dataset for fine-grained categorization and verification. In CVPR , 2015. 2, 3 10 [55] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196 , 2017. 2, 3 [56] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Large-scale celebfaces attributes (celeba) dataset. Retrieved August , 15(2018):11, 2018. 2, 3 [57] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In CVPR , 2019. 2, 3 [58] Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha. Stargan v2: Diverse image synthesis for multiple domains. In CVPR , 2020. 2, 3 [59] Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. In NeurIPS , 2017. 2, 4 [60] Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In CVPR , 2021. 2, 4, 20 [61] Bowen Cheng, Maxwell D Collins, Yukun Zhu, Ting Liu, Thomas S Huang, Hartwig Adam, and Liang-Chieh Chen. Panoptic-deeplab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition , pages 12475–12485, 2020. 2, 6, 16 [62] Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. In CVPR , 2022. 2, 4, 5, 8, 17 [63] Joshua B Tenenbaum and William T Freeman. Separating style and content with bilinear models. Neural computation , 12(6):1247–1283, 2000. 2, 3 [64] Scott Reed, Kihyuk Sohn, Yuting Zhang, and Honglak Lee. Learning to disentangle factors of variation with manifold interaction. In ICML . PMLR, 2014. 2, 3 [65] Alexey Dosovitskiy, Jost Tobias Springenberg, and Thomas Brox. Learning to generate chairs with convolutional neural networks. In CVPR , 2015. 2, 3 [66] Jimei Yang, Scott E Reed, Ming-Hsuan Yang, and Honglak Lee. Weakly-supervised disentangling with recurrent trans- formations for 3d view synthesis. In NIPS , 2015. 2, 3 [67] Tejas D Kulkarni, William F Whitney, Pushmeet Kohli, and Josh Tenenbaum. Deep convolutional inverse graphics net- work. In NIPS , 2015. 2, 3 [68] Danilo Jimenez Rezende, SM Eslami, Shakir Mohamed, Peter Battaglia, Max Jaderberg, and Nicolas Heess. Unsu- pervised learning of 3d structure from images. NIPS , 2016. 2, 3 [69] Xi Yin, Xiang Yu, Kihyuk Sohn, Xiaoming Liu, and Man- mohan Chandraker. Towards large-pose face frontalization in the wild. In ICCV , 2017. 2, 3 [70] Jun-Yan Zhu, Zhoutong Zhang, Chengkai Zhang, Jiajun Wu, Antonio Torralba, Josh Tenenbaum, and Bill Freeman. Visual object networks: Image generation with disentangled 3d representations. In NeurIPS , 2018. 2, 3 [71] Thu Nguyen-Phuoc, Chuan Li, Lucas Theis, Christian Richardt, and Yong-Liang Yang. Hologan: Unsupervised learning of 3d representations from natural images. In ICCV , 2019. 2, 3 [72] Thu H Nguyen-Phuoc, Christian Richardt, Long Mai, Yongliang Yang, and Niloy Mitra. Blockgan: Learning 3d object-aware scene representations from unlabelled images. 2020. 2, 3 [73] Yiyi Liao, Katja Schwarz, Lars Mescheder, and Andreas Geiger. Towards unsupervised learning of generative models for 3d controllable image synthesis. In CVPR , 2020. 2, 3 [74] Katja Schwarz, Yiyi Liao, Michael Niemeyer, and Andreas Geiger. Graf: Generative radiance fields for 3d-aware image synthesis. In NeurIPS , 2020. 2, 3 [75] Eric R Chan, Marco Monteiro, Petr Kellnhofer, Jiajun Wu, and Gordon Wetzstein. pi-gan: Periodic implicit generative adversarial networks for 3d-aware image synthesis. In CVPR , 2021. 2, 3 [76] Zekun Hao, Arun Mallya, Serge Belongie, and Ming-Yu Liu. Gancraft: Unsupervised 3d neural rendering of minecraft worlds. In ICCV , 2021. 2, 3 [77] Peng Zhou, Lingxi Xie, Bingbing Ni, and Qi Tian. CIPS-3D: A 3D-Aware Generator of GANs Based on Conditionally- Independent Pixel Synthesis. 2021. 2, 3 [78] Jiatao Gu, Lingjie Liu, Peng Wang, and Christian Theobalt. Stylenerf: A style-based 3d aware generator for high- resolution image synthesis. In ICLR , 2022. 2, 3 [79] Roy Or-El, Xuan Luo, Mengyi Shan, Eli Shechtman, Jeong Joon Park, and Ira Kemelmacher-Shlizerman. Stylesdf: High-resolution 3d-consistent image and geometry generation. In CVPR , 2022. 2, 3, 7, 18 [80] Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden- hall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988 , 2022. 2 [81] Ivan Skorokhodov, Sergey Tulyakov, Yiqun Wang, and Peter Wonka. Epigraf: Rethinking training of 3d gans. 2022. 2, 3, 8 [82] Kangle Deng, Gengshan Yang, Deva Ramanan, and Jun-Yan Zhu. 3d-aware conditional image synthesis. In CVPR , 2023. 2 [83] Ivan Skorokhodov, Aliaksandr Siarohin, Yinghao Xu, Jian Ren, Hsin-Ying Lee, Peter Wonka, and Sergey Tulyakov. 3d generation on imagenet. In International Conference on Learning Representations . 2 [84] Kyle Sargent, Jing Yu Koh, Han Zhang, Huiwen Chang, Charles Herrmann, Pratul Srinivasan, Jiajun Wu, and De- qing Sun. Vq3d: Learning a 3d-aware generative model on imagenet. arXiv preprint arXiv:2302.06833 , 2023. 2 [85] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In CVPR , 2015. 2 11 [86] Jiajun Wu, Chengkai Zhang, Tianfan Xue, Bill Freeman, and Josh Tenenbaum. Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. 2016. 2 [87] Rohit Girdhar, David F Fouhey, Mikel Rodriguez, and Ab- hinav Gupta. Learning a predictable and generative vector representation for objects. In ECCV . Springer, 2016. 2 [88] Matheus Gadelha, Subhransu Maji, and Rui Wang. 3d shape induction from 2d views of multiple objects. In 3DV , 2017. 2 [89] Edward J Smith and David Meger. Improved adversarial systems for 3d object generation and reconstruction. In CoRL . PMLR, 2017. 2 [90] Philipp Henzler, Niloy J Mitra, and Tobias Ritschel. Escap- ing plato’s cave: 3d shape from adversarial rendering. In ICCV , 2019. 2 [91] Sebastian Lunz, Yingzhen Li, Andrew Fitzgibbon, and Nate Kushman. Inverse graphics gan: Learning to gen- erate 3d shapes from unstructured 2d data. arXiv preprint arXiv:2002.12674 , 2020. 2 [92] Linqi Zhou, Yilun Du, and Jiajun Wu. 3d shape generation and completion through point-voxel diffusion. In ICCV , 2021. 2 [93] Moritz Ibing, Gregor Kobsik, and Leif Kobbelt. Oc- tree transformer: Autoregressive 3d shape generation on hierarchically structured sequences. arXiv preprint arXiv:2111.12480 , 2021. 2 [94] Panos Achlioptas, Olga Diamanti, Ioannis Mitliagkas, and Leonidas Guibas. Learning representations and generative models for 3d point clouds. In ICML . PMLR, 2018. 2, 6, 18 [95] Guandao Yang, Xun Huang, Zekun Hao, Ming-Yu Liu, Serge Belongie, and Bharath Hariharan. Pointflow: 3d point cloud generation with continuous normalizing flows. In ICCV , 2019. 2 [96] Kaichun Mo, He Wang, Xinchen Yan, and Leonidas Guibas. PT2PC: Learning to generate 3d point cloud shapes from part tree conditions. 2020. 2 [97] Ayan Sinha, Asim Unmesh, Qixing Huang, and Karthik Ramani. Surfnet: Generating 3d shape surfaces using deep residual networks. In CVPR , 2017. 2 [98] Thibault Groueix, Matthew Fisher, Vladimir G Kim, Bryan C Russell, and Mathieu Aubry. A papier-mâché approach to learning 3d surface generation. In CVPR , 2018. 2 [99] Dario Pavllo, Graham Spinks, Thomas Hofmann, Marie- Francine Moens, and Aurelien Lucchi. Convolutional gener- ation of textured 3d meshes. NeurIPS , 2020. 2 [100] Dario Pavllo, Jonas Kohler, Thomas Hofmann, and Aurelien Lucchi. Learning generative models of textured 3d meshes from real-world images. In ICCV , 2021. 2 [101] Wenzheng Chen, Huan Ling, Jun Gao, Edward Smith, Jaakko Lehtinen, Alec Jacobson, and Sanja Fidler. Learning to predict 3d objects with an interpolation-based differen- tiable renderer. NeurIPS , 2019. 2 [102] Charlie Nash, Yaroslav Ganin, SM Ali Eslami, and Peter Battaglia. Polygen: An autoregressive generative model of 3d meshes. In ICML . PMLR, 2020. 2 [103] Jun Gao, Tianchang Shen, Zian Wang, Wenzheng Chen, Kangxue Yin, Daiqing Li, Or Litany, Zan Gojcic, and Sanja Fidler. Get3d: A generative model of high quality 3d tex- tured shapes learned from images. In NeurIPS , 2022. 2, 3 [104] Kaichun Mo, Paul Guerrero, Li Yi, Hao Su, Peter Wonka, Niloy Mitra, and Leonidas Guibas. Structurenet: Hierarchi- cal graph networks for 3d shape generation. ACM Transac- tions on Graphics (TOG), Siggraph Asia 2019 , 38(6):Article 242, 2019. 2 [105] Shubham Tulsiani, Hao Su, Leonidas J Guibas, Alexei A Efros, and Jitendra Malik. Learning shape abstractions by assembling volumetric primitives. In CVPR , 2017. 2 [106] Shichen Liu, Shunsuke Saito, Weikai Chen, and Hao Li. Learning to infer implicit surfaces without 3d supervision. NeurIPS , 2019. 2 [107] Lars Mescheder, Michael Oechsle, Michael Niemeyer, Se- bastian Nowozin, and Andreas Geiger. Occupancy networks: Learning 3d reconstruction in function space. In CVPR , 2019. 2 [108] Andrew Luo, Tianqin Li, Wen-Hao Zhang, and Tai Sing Lee. Surfgen: Adversarial 3d shape synthesis with explicit surface discriminators. In ICCV , 2021. 2 [109] Zhiqin Chen and Hao Zhang. Learning implicit fields for generative shape modeling. In CVPR , 2019. 2 [110] Xingguang Yan, Liqiang Lin, Niloy J Mitra, Dani Lischin- ski, Daniel Cohen-Or, and Hui Huang. Shapeformer: Transformer-based shape completion via sparse representa- tion. In CVPR , 2022. 2 [111] Tianchang Shen, Jun Gao, Kangxue Yin, Ming-Yu Liu, and Sanja Fidler. Deep marching tetrahedra: a hybrid representa- tion for high-resolution 3d shape synthesis. NeurIPS , 2021. 2 [112] Paritosh Mittal, Yen-Chi Cheng, Maneesh Singh, and Shub- ham Tulsiani. Autosdf: Shape priors for 3d completion, reconstruction and generation. In CVPR , 2022. 2 [113] Stephan R Richter, Vibhav Vineet, Stefan Roth, and Vladlen Koltun. Playing for data: Ground truth from computer games. In ECCV . Springer, 2016. 3 [114] Seunghoon Hong, Xinchen Yan, Thomas S Huang, and Honglak Lee. Learning hierarchical semantic image manip- ulation through structured representations. NeurIPS , 2018. 3 [115] Seung Wook Kim, Jonah Philion, Antonio Torralba, and Sanja Fidler. Drivegan: Towards a controllable high-quality neural simulation. In CVPR , 2021. 3 [116] Wei Li, CW Pan, Rong Zhang, JP Ren, YX Ma, Jin Fang, FL Yan, QC Geng, XY Huang, HJ Gong, et al. Aads: Aug- mented autonomous driving simulation using data-driven algorithms. Science robotics , 4(28):eaaw0863, 2019. 3 12 [117] Huan Ling, David Acuna, Karsten Kreis, Seung Wook Kim, and Sanja Fidler. Variational amodal object completion. In NeurIPS , 2020. 3 [118] Jason Zhang, Gengshan Yang, Shubham Tulsiani, and Deva Ramanan. Ners: Neural reflectance surfaces for sparse-view 3d reconstruction in the wild. In NeurIPS , 2021. 3 [119] Sergey Zakharov, Rares Andrei Ambrus, Vitor Campag- nolo Guizilini, Dennis Park, Wadim Kehl, Fredo Durand, Joshua B Tenenbaum, Vincent Sitzmann, Jiajun Wu, and Adrien Gaidon. Single-shot scene reconstruction. In CoRL , 2021. 3 [120] Tom Monnier, Matthew Fisher, Alexei A Efros, and Mathieu Aubry. Share with thy neighbors: Single-view reconstruction by cross-instance consistency. In ECCV , 2022. 3 [121] Norman Müller, Andrea Simonelli, Lorenzo Porzi, Samuel Rota Bulò, Matthias Nießner, and Peter Kontschieder. Autorf: Learning 3d object radiance fields from single view observations. In CVPR , 2022. 3, 4 [122] Jeevan Devaranjan, Amlan Kar, and Sanja Fidler. Meta- sim2: Unsupervised learning of scene structure for synthetic data generation. In ECCV . Springer, 2020. 3 [123] Shuhan Tan, Kelvin Wong, Shenlong Wang, Sivabalan Mani- vasagam, Mengye Ren, and Raquel Urtasun. Scenegen: Learning to generate realistic traffic scenes. In CVPR , 2021. 3 [124] Zhenpei Yang, Yuning Chai, Dragomir Anguelov, Yin Zhou, Pei Sun, Dumitru Erhan, Sean Rafferty, and Henrik Kret- zschmar. Surfelgan: Synthesizing realistic sensor data for autonomous driving. In CVPR , 2020. 3 [125] Abhijit Kundu, Kyle Genova, Xiaoqi Yin, Alireza Fathi, Caroline Pantofaru, Leonidas J Guibas, Andrea Tagliasacchi, Frank Dellaert, and Thomas Funkhouser. Panoptic neural fields: A semantic object-aware neural scene representation. In CVPR , 2022. 3 [126] Konstantinos Rematas, Andrew Liu, Pratul P Srini- vasan, Jonathan T Barron, Andrea Tagliasacchi, Thomas Funkhouser, and Vittorio Ferrari. Urban radiance fields. In CVPR , 2022. 3 [127] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. arXiv preprint arXiv:2010.11929 , 2020. 4 [128] Ronghang Hu and Amanpreet Singh. Unit: Multimodal multitask learning with a unified transformer. In ICCV , 2021. 4 [129] Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Kop- pula, Daniel Zoran, Andrew Brock, Evan Shelhamer, et al. Perceiver io: A general architecture for structured inputs & outputs. arXiv preprint arXiv:2107.14795 , 2021. 4 [130] Mehdi SM Sajjadi, Henning Meyer, Etienne Pot, Urs Bergmann, Klaus Greff, Noha Radwan, Suhani Vora, Mario Luˇ ci ́ c, Daniel Duckworth, Alexey Dosovitskiy, et al. Scene representation transformer: Geometry-free novel view syn- thesis through set-latent scene representations. In CVPR , 2022. 4, 8 [131] Daniel Rebain, Mark J Matthews, Kwang Moo Yi, Gopal Sharma, Dmitry Lagun, and Andrea Tagliasacchi. Attention beats concatenation for conditioning neural fields. arXiv preprint arXiv:2209.10684 , 2022. 4 [132] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. In CVPR , 2020. 4, 16, 17 [133] Songyou Peng, Michael Niemeyer, Lars Mescheder, Marc Pollefeys, and Andreas Geiger. Convolutional occupancy networks. In ECCV . Springer, 2020. 4, 17 [134] Bokui Shen, Zhenyu Jiang, Christopher Choy, Leonidas J Guibas, Silvio Savarese, Anima Anandkumar, and Yuke Zhu. Acid: Action-conditional implicit visual dynamics for deformable object manipulation. 2022. 4, 17 [135] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis. In ECCV , 2020. 4 [136] Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. Gener- ating diverse high-fidelity images with vq-vae-2. In NeurIPS , 2019. 4 [137] Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Hee- woo Jun, David Luan, and Ilya Sutskever. Generative pre- training from pixels. In ICML . PMLR, 2020. 4 [138] Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved vqgan. In ICLR , 2021. 4, 16 [139] Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gun- jan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yin- fei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789 , 2022. 4 [140] Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kin- dermans, Hernan Moraldo, Han Zhang, Mohammad Taghi Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan. Phenaki: Variable length video generation from open do- main textual description. arXiv preprint arXiv:2210.02399 , 2022. 4 [141] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR , 2018. 4 [142] Sosuke Kobayashi, Eiichi Matsumoto, and Vincent Sitz- mann. Decomposing nerf for editing via feature field distil- lation. arXiv preprint arXiv:2205.15585 , 2022. 5, 7 [143] Vadim Tschernezki, Iro Laina, Diane Larlus, and Andrea Vedaldi. Neural feature fusion fields: 3d distillation of self-supervised 2d image representations. arXiv preprint arXiv:2209.03494 , 2022. 5 13 [144] Liangchen Song, Anpei Chen, Zhong Li, Zhang Chen, Lele Chen, Junsong Yuan, Yi Xu, and Andreas Geiger. Nerfplayer: A streamable dynamic scene representation with decomposed neural radiance fields. arXiv preprint arXiv:2210.15947 , 2022. 5 [145] Yuxuan Zhang, Huan Ling, Jun Gao, Kangxue Yin, Jean- Francois Lafleche, Adela Barriuso, Antonio Torralba, and Sanja Fidler. Datasetgan: Efficient labeled data factory with minimal human effort. In CVPR , 2021. 5 [146] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jé- gou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In ICCV , 2021. 5, 7, 19 [147] Jieru Mei, Alex Zihao Zhu, Xinchen Yan, Hang Yan, Siyuan Qiao, Yukun Zhu, Liang-Chieh Chen, Henrik Kretzschmar, and Dragomir Anguelov. Waymo open dataset: Panoramic video panoptic segmentation. In ECCV , 2022. 6, 16 [148] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 , 2014. 6, 17 [149] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bern- hard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. NIPS , 30, 2017. 6, 17 [150] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In ICML . PMLR, 2021. 6, 18 [151] William E Lorensen and Harvey E Cline. Marching cubes: A high resolution 3d surface construction algorithm. ACM siggraph computer graphics , 21(4):163–169, 1987. 7, 23 [152] Dor Verbin, Peter Hedman, Ben Mildenhall, Todd Zickler, Jonathan T Barron, and Pratul P Srinivasan. Ref-nerf: Struc- tured view-dependent appearance for neural radiance fields. In CVPR , 2022. 8 [153] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin- ton. Layer normalization. arXiv preprint arXiv:1607.06450 , 2016. 16 [154] Jonathan T Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P Srinivasan. Mip-nerf: A multiscale representation for anti-aliasing neu- ral radiance fields. In ICCV , 2021. 17 [155] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 2818–2826, 2016. 17 [156] Itseez. Open source computer vision library. https:// github.com/itseez/opencv , 2015. 18 [157] Dawson-Haggerty et al. trimesh. 18 14 Appendix In this supplementary document, we first describe in details our proposed dataset and the processing behind it in Sec. A. Then, we discuss various implementation details including network architectures, evaluation metrics and conditional synthesis details in Sec. B. Next, we examine ablation of different loss terms, and evaluate stage 1 model’s performance. Finally, we discuss baselines in more details in Sec. E and showcase mesh extraction visualization results in Sec. F. (a) WOD-Vehicle (Box, Mask, LiDAR) (b) WOD-Pedestrian (Box, Mask, LiDAR) (c) Longtail-Vehicle Figure 8. Our object-centric benchmark. Appendix A. Dataset We build the object-centric benchmark on top of the Waymo Open Dataset (WOD) [14] and our Longtail dataset. The Waymo Open Dataset (WOD) is one of the largest and most diverse autonomous driving datasets among others [11 – 13], 15 containing rich geometric and semantic labels such as 3D bounding boxes and per-pixel instance masks. Specifically, the dataset includes 1,150 driving scenes captured mostly in downtown San Francisco and Phoenix, each consisting of 200 frames of multi-sensor data. Each data frame includes 3D point clouds from LiDAR sensors and high-resolution images from five cameras (positioned at Front, Front-Left, Front-Right, Side-Left, and Side-Right). The objects were captured in the wild and their images exhibit large variations due to object interactions (e.g., heavy occlusion and distance to the robotic platform), sensor artifacts (e.g., motion blur and rolling shutter) and environmental factors (e.g., lighting and weather conditions). To construct a benchmark for object-centric modeling, we propose a coarse-to-fine procedure to extract collections of single-view 2D photographs by leveraging 3D object boxes, camera-LiDAR synchronization, and fine-grained 2D panoptic labels. First, we leverage the 3D box annotations to exclude objects beyond certain distances to the surveying vehicle in each data frame (e.g., 40 m for pedestrians and 60 m for vehicles, respectively). At a given frame, we project 3D point clouds within each 3D bounding box to the most visible camera and extract the centering patch to build our single-view 2D image collections. Furthermore, we train a Panoptic-Deeplab model [61] using the 2D panoptic segmentations on the labeled subset [147] and create per-pixel pseudo-labels for each camera image on the entire WOD. This allows us to differentiate pixels belonging to the object of interest, background, and occluder (e.g., standing pole in front of a person). We further exclude certain patches where objects are heavily occluded using the 2D panoptic predictions. Even with the filtering criterion applied, we believe that the resulting benchmark is still very challenging due to occlusions, intra-class variations (e.g., truck and sedan), partial observations (e.g., we do not have full 360 degree observations of a single vehicle), and imperfect segmentation. In particular, we provide accurate registration of camera rays and LiDAR point clouds to the object coordinate frame, taking into account the camera rolling shutter, object motion and ego motion. Our WOD-ObjectAsset can be accessed through waymo.com/open, organized in the Waymo Open Dataset modular format, enabling users to selectively download only the components they need. Finally, we provide code examples to access and visualize data in the tutorial_object_asset. Our Longtail dataset contains LiDAR point clouds and camera images, along with 3D bounding box annotations. We obtain the pseudo-labeled segmentations using the same 2D panoptic model pretrained on WOD. We apply the same coarse-to-fine procedure to obtain the Longtail-Vehicle benchmark. Appendix B. More Implementation Details B.1. Network Architecture All models use exponential moving average of weights. Encoder E φ . Our encoder contains three vision transformer blocks and three cross-attention blocks. The vision transformer takes input images of resolution of 256 2 , and first map each patch into a 512 dimensional token. A CLS token is appended to the list of image patch tokens. Then, the transformer blocks are used to process the image patch tokens. Each transformer block has 8 heads, an embedding dimension of 512 and a hidden dimension of 2048. For cross-attention blocks, we first initialize tri-plane positional embedding of shape 16 × 16 × 3 , each embedding is of 512 dimension. The tri-plane positional embedding is passed through a fully-connected layer of 512 dimension. The processed tri-plane positional embedding is then used a query input to the cross-attention transformer blocks, while the image patch tokens serve as key and value. Each cross-attention transformer block has 8 heads, an embedding dimension of 512 and a hidden dimension of 2048. Finally, the output of the cross-attention transformer blocks are passed through a fully-connected layer with Layer Normalization [153] and tanh activation into 16 × 16 × 3 tokens of 32 dimension, which is the dimension of each entry in the codebook K . Codebook K . Our discrete codebook contains 2048 entries with lookup dimension of 32, which means each entry is of 32-dimensional. Codebook are initialized using fan-in variance scaling, scale equals 1 and uniform distribution. Similar to Yu et al . [138], we use l 2 -normalized codes, which means applying l 2 normalization on the encoded tri-plane latents e 3D and codebook entries in K . Decoder G θ - Token Transformer. The token transformer contains 3 self-attention transformers blocks. A CLS token is appended to the tri-plane latents. Positional encoding is used to represent 3D spatial locations. Each transformer block has 8 heads, an embedding dimension of 512 and a hidden dimension of 2048. Finally, the output of the transformer blocks are passed through a fully-connected layer with Layer Normalization [153] and tanh activation into 16 × 16 × 3 tokens of 256 dimension (and an additional CLS token). Decoder G θ - Style-based Generator. We first use a mapping network [132] to map the aforementioned CLS token into intermediate latent space W . The mapping network contains 8 fully-connected layers of hidden dimension 512. The mapping network outputs a vector w of 512 dimensional. Following Karras et al . [132], we use w for a style-based generator. For each plane in our tri-plane representation ( xy, xz, yz planes), we use a generator contains three up-sampling blocks with hidden dimensions of 512 , 256 and 128 respectively. Finally, the style-based generators output tri-plane feature maps with 32 feature channels. 16 Decoder G θ - Volume Rendering. Our volume renderer is implemented as 2 fully-connected layers, similar to Chan et al . [52]. The decoder takes as input the 32-dimensional aggregated feature vector from the style-based generator. For each pixel, we query 40 points, with 24 uniformly sampled and 16 importance-sampled. We use MipNeRF [154] as our volume rendering module. Volume rendering is performed at a resolution of 128 × 128 . Discriminator. We use a StyleGAN2 [132] discriminator with hidden dimensions 16 , 32 , 64 , 128 , 256 . We use R1 regular- ization with γ = 1 . Stage-2 Modeling M ψ . We follow a shallower verions of the network architecture and training set up introduced in [62]. We use 12 layers, 8 attention heads, 768 embedding dimensions and 3072 hidden dimensions. The model uses learnable positional embedding, Layer Normalization, and truncated normal initialization (stddev = 0 . 02 ). We use the following training hyperparameters: label smoothing=0.1, dropout rate=0.1, Adam optimizer [148] with β 1 = 0 . 9 and β 2 = 0 . 96 . We use a cosine masking schedule. During inference, token synthesis are performed in 10 steps. B.2. Aligning Tri-plane to Object Scale Figure 9. Illustration of using uniform tri-plane versus using scale-aligned triplane. Since vehicles can have drastically different scales in its x, y, z directions, using a naive uniform scale tri-plane to cover the object leaves a lot of computation capacity under-utilized. As illustrated in the top row of Fig. 9, if we cover a normal sedan using uniform size tri-plane, most of the entries in the tri-plane features correspond to empty space. The problem becomes more severe for longer-tail instances of truck, bus etc ., where the scale ratio among x, y, z become even more extreme. To encourage a more efficient tri-plane features usage, we make tri-plane latents aligned to object scales during the coordinate feature orthographic projection step. As illustrated in the bottom row of Fig. 9, when querying feature of coordinate p ∈ [0 , 1] 3 ⊂ R 3 , if we have object scale s x , s y , s z , we simply scale p as ˆ p := p [ s x ,s y ,s z ] ∈ [0 , 1 s x ] × [0 , 1 s y ] × [0 , 1 s z ] , and query tri-plane features using ˆ p . The orthographic projection follows the same tri-plane grid-sampling and aggregation as in prior works [52, 133, 134]. In the basic GINA-3D pipeline without using scaled tri-plane features, the model learns to handle object scale implicitly. In our scaled box model variations, the model leverages the object scale only in tri-plane feature orthographic projection step. The model implicitly learns to produce feature maps that align with object scale. As illustrated in the main paper, such design greatly improve model performance. We leave feeding object scale information explicit to the model as a future direction to explore. B.3. Evaluation Metrics We discuss in details the metrics we have used for quantitative evaluations. Image Quality. To evaluate the image quality, we employ two metrics Fréchet Inception Distance (FID) [149] and Mask Floater-Over Union (Mask FOU) over 50K generated images. Fréchet Inception Distance (FID) [149] is commonly used to evaluate the quality of 2D images. The generated images are encoded using a pretrained Inception v3 [155] model, and the last pooling layer’s output was stored as the final encoding. The FID metric is computed as: FID ( I g , I v ) = || μ g − μ v || 2 2 + Tr [Σ g + Σ v − 2 √ Σ g · Σ v ] (9) where Tr denotes the trace operation, μ g , Σ g are the mean and covariance matrix of the generated images encodings, and μ v , Σ v are the mean and covariance matrix of the validation images encodings. 17 We additionally measure if the generated texture forms a single full object, which is implemented by checking if the generated pixels span a connected region. We measure this by calculating percentage of pixels that are not connected. Since all images from baselines and GINA-3D are generated using a white background, we measure pixels connected components using the findContours function from OpenCV [156] to find connected components, and use contourArea to find the largest connected component, which we denote C l . We then use the aggregated density (alpha) value to find the entire shape’s projection on the image, which we denote S . Mask FOU is simply calculated mean over entire generated image set (as percentage): Mask FOU ( I g ) = 1 | I g | ∑ i ∈ I g (1 − Area ( C l,i ) Area ( S i ) ) (10) Image Diversity. We want to evaluate the semantic diversity of the generated image, which we measure with Coverage (COV) score and Minimum Matching Distance (MMD) [94] using pretrained CLIP [150] embeddings. Specifically, Coverage (COV) score measures the fraction of images in the validation set that are matched to at least one of the images in the generated set. Formally, it’s defined as: COV ( I g , I v ) = |{ argmin i ∈ I v || CLIP ( i ) − CLIP ( j ) || 2 2 | j ∈ I g }| | I v | (11) Intuitively, COV uses CLIP embedding distance to perform nearest-neighbor matching for each generate image towards validation set. It measures diversity by checking what percentage of validation set is being matched as a nearest neighbor. However, COV is only one side of the story. A set of generated image can have a high COV score by having purely random generated images that are randomly matched to validation set. This issue is alleviated by the incorporation of Minimum Matching Distance (MMD), which measures if the nearest-neighbor matching yields high-quality matching pairs: MMD ( I g , I v ) = 1 | I v | ∑ i ∈ I v min j ∈ I g || CLIP ( i ) − CLIP ( j ) || 2 2 (12) Intuitively, MMD measures the average closest distance between images in the validation set and their corresponding nearest neighbor in the training set. MMD correlates well with how faithful (with respect to the validation set) elements of generated set are [94]. Geometry Quality. Due to a lack of 3D geometry ground-truth for in-the-wild data, we measure geometry quality using an existing metric Consistency score from Or-El et al . [79], and a Mesh Floater-Over Union (Mesh FOU) which measures if the geometry forms a single connected object. Consistency score measures if the implicit fields are evaluated at consistent 3D locations, which is an important characteristic for view-consistent renderings [79]. In practice, it measures depth map consistency across viewpoints by back-projecting depth map to the 3D space. For each model, we normalize the object longest edge to length of 10 for numeric clarity, and compare two depth maps at an angle difference of 45 degrees along the z -axis (yaw). We calculate consistency across depth maps for all images in the generated set, denote as D g : Consistency ( D g ) = 1 | D g | ∑ i ∈ D g CD ( i, i rot ) (13) where i rot represents the depth map after rotating the view point by 45 degree along z -axis. We additionally measure if each generated shape forms a single full object, which is measured by checking if the generated mesh forms a single mesh. We measure this by calculating percentage of mesh surface area that is not connected. We use surface area over volume because we observe that volume calculation is unstable with non-watertight meshes. For each generated mesh S , we use split function from Trimesh [157] to find the largest connected component, which we denote C l . Mesh FOU is simply calculated mean over entire generated mesh set M g (as percentages): Mesh FOU ( M g ) = 1 | M g | ∑ i ∈ M g (1 − Area ( C l,i ) Area ( S i ) ) (14) 18 Figure 10. Illustrations of aggregated point clouds. Geometry Diversity. We use Coverage (COV) and Minimum Matching Distance (MMD) again for measuring diversity. However, due to the lack of ground truth full 3D shape from in-the-wild data, our metric needs to be more carefully designed. A source for accurate but partial geometry that we can obtain is by aggregating LiDAR point-cloud scans for a given instance from different observations. We then uniformly subsample 2048 points from the aggregated point cloud. We show examples of aggregated point clouds in Fig. 10. As shown in the figure, the aggregated point clouds are indicative of the underlying shapes, but are incomplete. Chamfer distance, a common metric for shape similarity, calculates bi-directional nearest neighbors. However, due to incompleteness, finding the nearest neighbors of the generated points in the partial point will only result in noisy matches. Therefore, we do not measure the two-sided Chamfer distance, but measure only the distance of nearest neighbors of validation point clouds in the generated mesh. Formally, we have: COV ( M g , P v ) = |{ argmin i ∈ P v D ( i, j ) | j ∈ M g }| | P v | (15) MMD ( M g , P v ) = 1 | P v | ∑ i ∈ P v min j ∈ M g D ( i, j ) (16) D ( i, j | i ∈ P v , j ∈ M g ) = 1 | i | ∑ x ∈ i min y ∈ j || x − y || 2 2 (17) B.4. Conditional Synthesis We showcased in the main paper various conditional synthesis tasks, for which we provide more details here. Discrete Conditions. We feed discrete conditions (object class, time-of-day) as additional tokens to MaskGIT. Specifically, we increase the vocabulary size by the number of classes in the discrete conditions. Object class contains 4 options: cars, truck, bus and others. Time-of-day is a binary variable of day versus night. The vocabulary thus becomes 2048 + 4 for object class, and 2048 + 2 for time-of-day. We feed the conditional input as an additional token to the 768 tri-plane latents by concatenating the two, resulting in an input of sequence length 769 . The sequence is then fed into MaskGIT for masked token prediction as in unconditional case. Continuous Conditions. Alternatively, we feed continuous conditions to MaskGIT by concatenating conditional input with MaskGIT intermediate layer’s output. Specifically, MaskGIT first generates word embedding for each token in the sequence. We pass the continuous condition through a fully-connected layer and concatenate the output with each token’s word embedding. The concatenated embedding is then passed through the rest of the network. To synthesize samples conditioned on object semantics, we feed semantic embedding from a pre-trained DINO model [146]. To condition on object scale, we pass in positional embedding of object scale. We use standard cosine and sine positional embedding of degree 6. Image-conditioned Assets Variations. Given our mask-based iterative sampling stage, we can generate image-conditioned asset with variations. We first use stage-1 model to perform reconstruction, retrieving a full-set of predicted tri-plane latents. We then generate variations of the reconstructed instance by randomly masking out tri-plane latents. The degree of variations can be controlled by masking out different number of tokens. By masking 90% of tokens, we observe the variations are mostly reflected in generated assets under different textures. By masking out 99% of tokens, we see changes in object shapes more 19 1st, 2nd, 3rd L GAN LPIPS L α L VQ No VQ | K | = 2 10 | K | = 2 12 Full Generative Metric (FID) 65.1 83.0 80.3 64.7 - 66.2 58.9 59.5 Recon. Metric ( ℓ 2 input view) 1.78 1.92 1.44 1.62 1.01 2.21 1.81 1.55 Recon. Metric ( ℓ 2 cross view) 2.42 2.28 1.55 2.14 1.71 2.28 2.30 1.83 Table 3. We perform various ablation studies on 1) removing each term in out overall loss function; 2) Removing vector quantization entirely; 3) Different codebook sizes K . We further report stage 1 model’s reconstruction quality using ℓ 2 losses for input views as well as novel views. significantly, while the general object class remain the same. We believe how to better control the variation process is an interesting direction to explore in the future. Appendix C. Additional GINA Visualizations We present additional visualizations of GINA-3D model in Fig. 11. Appendix D. Ablation on Loss Terms and Stage 1 Evaluation Ablation study. In this experiment, we use our scaled box model, trained with LiDAR supervision as our base model. We conduct ablation studies by removing each loss, removing quantization entirely and training with different codebook sizes. As shown in Table 3, the ablaion results justify each loss term we introduced in the paper, as removing each one of them leads to higher FID compared to the full model. This finding is consistent with Esser et al . [60], which suggests LPIPS is important for visual fidelity. In addition, larger codebook K ( 2 12 ) has marginal impact in our setting. Evaluating stage 1 model. We report ℓ 2 reconstruction loss (in 10 − 2 ) on the input and novel views of unseen instances. The model is able to obtain better reconstruction performance by removing quantization entirely ( No VQ ), but it deprives the discrete codebook for stage 2 generative training. While generation and reconstruction correlates to some extent, performance rankings (color-coded) differ between them. Appendix E. Discussions on Baselines Generating the full images. Directly modeling full images of data-in-the-wild yields significant challenges. In the early stage of the project, we experimented with directly using GAN-based approaches on full images. As illustrated in Fig. 12-a,b, feeding full images without explicit modeling of occlusion makes learning challenging on our benchmark. For EG3D, we observed that training EG3D with unmasked image leads to training collapse, due to the absence of foreground and background modeling. For example, the generated image samples in Fig. 13-b lack diversity in shape and appearance ( e.g . color). We clarify a key difference to pure GAN-based approaches (GIRAFFE and EG3D) is that our approach has two training stages and the masked loss is only applied in the first stage to reconstruct the input. In other words, masked loss cannot be directly applied to existing GAN-based approaches as the corresponding object mask for each generated RGB is not observed in the adversarial (encoder-free) training. Alternatively, one can still apply the masked loss by factorizing RGB, object silhouette and occlusion. We have tried many variations of this idea in the early stage without avail, as learning disentangled factors was challenging for adversarial training. We provide such examples in Fig. 12-c. In this experiment, we tried to extend EG3D by generating occlusion masks with a separate branch. However, the training became very unstable and we were not able to produce improved results beyond the original EG3D on our benchmark. As we can see, the model fails to disentangle object silhouette and occlusion. It still generates partial shape, while generating some plausible foreground occlusion. In fact, occlusion is even more challenging to generate explicitly on our data where object silhouettes and occlusion masks are entangled, as the outcome depends on the view and layout. Generating the object images. Whitening out non-object regions has been used by EG3D (see ShapeNet-Cars in its supp.) and GET3D. It combines white color to pixels with α < 1 during neural rendering, which implicitly supervise α . Such set up separates object pixels from the surroundings, and makes generation focused on object modeling. We follow this design and have found in our experiments that baselines fail to generate separated target object without whitening-out. We provide additional details about the baseline methods GIRAFFE and EG3D in Fig. 13. We noticed that the learned GIRAFFE models are capable of generating vehicle-like patches but with viewpoints, occlusions and identities entangled in the latent space. For example, we generate a pair of images (in Fig. 13(a)) by varying the viewpoint variable while keeping the 20 Figure 11. Additional qualitative results of GINA-3D. 21 (a) Generated image samples from GIRAFFE trained on full images (view, view+45°) (b) Generated image samples from EG3D trained on full images (random views) (c) Generated image samples from jointly generating object RGB and occlusion masks (random views) Figure 12. Additional results on generation results trained on full images. a) We trained GIRAFFE on full images; b) We trained EG3D models on full images; c) We augmented the EG3D model by jointly generating object RGB, background RGB and occlusion masks. We visualizes object RGB and its corresponding occlusion mask in alternate columns. Results suggest that it’s difficult for the model to disentangle object shape and occlusion. identity latent variable fixed. It turns out that the generations are not easily controllable by the viewpoint variables, while the vehicle identities often change across views. The entangled representation makes the extracted meshes not very meaningful for 22 (a) Generated image samples from GIRAFFE trained on full images (view, view+45°) (a) Generated image samples from GIRAFFE trained on masked images (fixed views) (b) Generated image samples from EG3D trained on masked images (fixed views) Figure 13. Additional results on generation results trained on masked images. a) Additional visualizations of GIRAFFE baseline reported in the main paper; b) Additional visualizations of EG3D baseline reported in the main paper. Results suggest that it’s difficult for GIRAFFE to disentangle rotation. Both baselines show significant occlusion artifacts. (a) A random batch of 16 EG3D extracted meshes. (b) A random batch of 16 GINA-3D extracted meshes. Figure 14. Example mesh extractions from EG3D and GINA-3D. the GIRAFFE baseline on our benchmark. Additionally, the geometry extraction becomes even harder as the rendering mask is defined at a low dimensional resolution 16 2 . Appendix F. Extracted Meshes As mentioned in the main text, we use marching cubes [151] with density threshold of 10 to extract meshes for geometry evaluation. We showcase here random samples of extracted meshes from EG3D and GINA-3D. We show 16 examples each in Fig. 14a-14b. As we see, EG3D meshes can contain artifacts like missing parts of shape (row 3 right two). Furthermore, it shows relatively little diversity. GINA-3D not only preserves complete shapes, but also demonstrate a greater diversity, including more shape variation and semantic variation (mini-van row 2 column 4; bus row 4 column 4). Such observation is consistent with our quantitative evaluations. However, we do observe that GINA-3D meshes can be non-watertight and contain holes. We hope to address such problems in future works. We believe that by incorporating other representations like Signed Distance Fields (SDF), the mesh quality can be further improved. 23