SurfelGAN: Synthesizing Realistic Sensor Data for Autonomous Driving Zhenpei Yang1∗, Yuning Chai2, Dragomir Anguelov2, Yin Zhou2, Pei Sun2, Dumitru Erhan3, Sean Rafferty2, Henrik Kretzschmar2 1UT Austin, 2Waymo, 3Google Brain Abstract Autonomous driving system development is critically de- pendent on the ability to replay complex and diverse traffic scenarios in simulation. In such scenarios, the ability to ac- curately simulate the vehicle sensors such as cameras, lidar or radar is essential. However, current sensor simulators leverage gaming engines such as Unreal or Unity, requiring manual creation of environments, objects and material prop- erties. Such approaches have limited scalability and fail to produce realistic approximations of camera, lidar, and radar data without significant additional work. In this paper, we present a simple yet effective approach to generate realistic scenario sensor data, based only on a limited amount of lidar and camera data collected by an autonomous vehicle. Our approach uses texture-mapped surfels to efficiently reconstruct the scene from an initial vehicle pass or set of passes, preserving rich information about object 3D geometry and appearance, as well as the scene conditions. We then leverage a SurfelGAN network to reconstruct realistic camera images for novel positions and orientations of the self-driving vehicle and moving objects in the scene. We demonstrate our approach on the Waymo Open Dataset and show that it can synthesize realistic camera data for simulated scenarios. We also create a novel dataset that contains cases in which two self-driving vehicles observe the same scene at the same time. We use this dataset to provide additional evaluation and demonstrate the usefulness of our SurfelGAN model. 1. Introduction Recent advances in deep learning have inspired break- throughs in multiple areas related to autonomous driving such as perception [15, 26], prediction [4, 6] and plan- ning [12]. These recent trends only underscore the increas- ingly significant role of data-driven system development. One aspect is that deep learning networks benefit from large training datasets. Another is that autonomous driving sys- tem evaluation requires the ability to realistically replay a ∗Work done as an intern at Waymo. Correspondence: yzp@utexas.edu large set of diverse and complex scenarios in simulation cap- turing sensor properties, seasons, time of day, and weather. Developing simulators that support the levels of realism re- quired for autonomous system evaluation is a challenging task. There are many ways to design simulators, including simulating mid-level object representations [4, 11]. How- ever, mid-level representations omit subtle perceptual cues that are important for scene understanding, such as pedes- trian gestures and blinking lights on vehicles. Furthermore, as end-to-end models that combine perception, prediction, and sometimes even control become an increasingly more popular direction of research, we are faced with the need to faithfully simulate the sensor data, which is the input to such models during scenario replay. Frameworks for autonomous driving that support realistic sensor simulation are traditionally built on top of gaming en- gines such as Unreal or Unity [11]. The environment and its object models are created and arranged manually, to approxi- mate real-world scenes of interest. In order to enable realistic LiDAR and radar modeling, material properties often need to be manually specified as well. The overall process is time-consuming and effort-intensive. Furthermore, simple ray-casting or ray-tracing techniques are often insufficient to generate realistic camera, LiDAR, or radar data for a specific self-driving system, and additional work is required to adapt the simulated sensor statistics to the real sensors. In this work, we propose a simple yet effective data-driven approach for creating realistic scenario sensor data. Our ap- proach relies on camera and LiDAR data collected during a single pass, or several passes, of an autonomous vehicle through a scene of interest. We use this data to reconstruct the scene using a texture-mapped surfel representation. This representation is simple and computationally efficient to cre- ate and preserves rich information about the 3D geometry, semantics, and appearance of all objects in the scene. Given the surfel reconstruction, we can render the scene for novel poses of the self-driving vehicle (SDV) and the other sce- nario agents. The rendered reconstruction for these novel views may have some missing parts due to occlusion differ- ences between the initial and the new scene configuration. It can also have visual quality artifacts due to the limited fidelity of the surfel reconstruction. We address this gap by arXiv:2005.03844v2 [cs.CV] 25 Jun 2020 a) Goal: Novel image generation b) Step 1: Surfel scene reconstruction c) Step 2: SurfelGAN image generation Surfel Rendering Semantic Segmentation Instance Segmentation SurfelGAN Textured Surfel Figure 1. Overview of our proposed system. a) The goal of this work is the generation of camera images for autonomous driving simulation. When provided with a novel trajectory of the self-driving vehicle in simulation, the system generates realistic visual sensor data that is useful for downstream modules such as an object detector, a behavior predictor, or a motion planner. At a high level, the method consists of two steps: b) First, we scan the target environment and reconstruct a scene consisting of rich textured surfels. c) Surfels are rendered at the camera pose of the novel trajectory, alongside semantic and instance segmentation masks. Through a GAN [14], we generate realistically looking camera images. applying a GAN network [14] to the rendered surfel views to produce the final high-quality image reconstructions. An overview of our proposed system is illustrated in Fig. 1. Our work makes the following contributions: 1) We de- scribe a pipeline that builds a detailed reconstruction of a dynamic scene from real-world sensor data. This represen- tation allows us to render novel views in the scene, corre- sponding to deviations of the SDV and the other agents in the environment from their initially captured trajectories (Sec. 3.1). 2) We propose a GAN architecture that takes in the rendered surfel views and synthesizes images with quality and statistics approaching that of real images (Tab. 1) 3) We build the first dataset for reliably evaluating the task of novel view synthesis for autonomous driving, which contains cases in which two self-driving vehicles observe the same scene at the same time. We use this dataset to provide additional evaluation and demonstrate the usefulness of our SurfelGAN model. 2. Related Work Simulated Environments for Driving Agents. There have been many efforts towards building simulated environments for various tasks [5, 11, 40, 41, 42]. Much work has focused on indoor environments [5, 40, 42] based on public indoor datasets such as SUNCG [35] or Matterport3D [7]. In con- trast to indoor settings where the environment is relatively simple and easy to model, simulators for autonomous driving exhibit significant challenges in modeling the complicated and dynamic scenarios of real-world scenes. TORCS[41] is one of the first simulation environments that support multi- agent racing, but is not tailored for real-world autonomous driving research and development. DeepGTAV [1] provides a plugin that transforms the Grand Theft Auto gaming envi- ronment into a vision-based self-driving car research envi- ronment. CARLA[11] is a popular open-source simulation engine that supports the training and testing of SDVs. All these simulators rely on manual creation of synthetic envi- ronments, which is a formidable and laborious process. In CARLA [11], the 3D model of the environment, which in- cludes buildings, road, vegetation, vehicles, and pedestrians, is manually created. The simulator provides one town with 2.9 km of drivable roads for training and another town with 1.4 km of drivable roads for testing. In contrast, our system is easily extendable to new scenes that are driven by an SDV. Furthermore, because the environment we are building is a high-quality reconstruction based on the vehicle sensors, it naturally closes the domain gap between synthetic and real contents, which is present in most traditional simulation environments. Similar to this work, AADS [23] utilizes real sensor data to synthesize novel views. The major difference is that we reconstruct the 3D environment, while AADS uses a purely image-based novel view synthesis. Reconstructing the 3D environment gives us the freedom to synthesize novel views that could not be easily captured in the real world. Moreover, once our environment is built, we no longer need to store the images or query the nearest K views upon syn- thesis, which saves time for deployment. Learning on Synthetic Data. Besides enabling end-to-end training and evaluation of agents, the simulated environment can also provide a large amount of data for training deep neural networks. [32] uses a synthetic scene to generate a large amount of fully labeled training data for urban scene segmentation. [19] generate images containing novel place- ment of dynamic objects to boost the performance of object detection. Geometric Reconstruction and 3D Representations. A typical approach for 3D reconstruction of outdoor environ- ments is to use structure from motion [37, 39] or multi-view stereo [13] to recover a dense 3D point cloud from image collections, then optionally use Poisson reconstruction [21] to obtain a mesh representation. Such a paradigm is most suitable when we have multiple images covering the same area from different perspectives, which is not always true in our case. Thanks to the rapid advancement of LiDAR technology, we can have accurate depth information to com- plement the camera image data. Our approach leverages the traditional surfel representation [31] augmented with fine- grained image textures, which not only greatly simplifies the 3D reconstruction process, but also effectively models object appearance and color with high fidelity. Truncated Signed Distance Functions [10] and their most recent variants [29] are also promising alternatives to surfel-based modeling. Re- cent work by Aliev et al. [2] that augments 3D cloud points with a learnable neural descriptor for rendering purposes has also shown promising results, however it assumes a static environment and is not applicable in practice to outdoor driving scenarios, which usually contain tens of millions of points. GAN-based Image Translation. Generative Adversarial Networks (GAN) [14] have attracted broad interest in both academia and industry. While [14] aims to synthesise re- alistic images directly, [18] targets the conditional image synthesis setting. Subsequent research [3, 20, 30, 44] has made great strides in improving the quality of images gen- erated by GAN methods; we refer the readers to [9] for an overview. [38] propose a model trained on Cityscapes [8] that can convert videos of semantic segmentation masks into videos of realistic images. Their requirement of having ac- curate per-pixel semantic annotations of scenes of interest may be difficult to satisfy. In contrast, our approach requires only accurate 3D bounding boxes for the moving objects in the scene, which can be more cost-effective to obtain by human annotation. We believe even this requirement can be further relaxed, by replacing the ground-truth 3D boxes with 3D boxes produced by running a high-quality offline 3D perception pipeline on the SDV sensor data. Finally, in our work we also address the traditional challenge of GAN evaluation by proposing two new metrics that are suitable for the task of novel view synthesis. 3. Approach In this section, we describe the key innovations of this work: 1) texture-enhanced surfel scene reconstruction and 2) SurfelGAN-based image synthesis applied on the novel ren- dered scene views. Their combination enables the creation of a realistic data-driven sensor simulation environment. 3.1. Surfel Scene Reconstruction Enhanced Surfel Map. A good scene reconstruction model enables the faithful preservation of the sensor information, while remaining efficient in terms of computation and stor- age. Towards this goal, we propose a novel texture-enhanced surfel map representation. Surfels are compact, easy to re- construct, and because of their fixed size, easy to texture and compress. Below we describe our approach, which can preserve more fine-grained details compared to traditional surfel map representations [31]. We discretize the scene into a 3D voxel grid of fixed size and process the LiDAR scans in the order they are captured. For each voxel, we construct a surfel disk by estimating the mean coordinate and the surfel normal, based on all the Li- DAR points in that voxel. The surfel disk radius is defined as √ 3v, where v denotes the voxel size. For the LiDAR points binned in a voxel, we also have the corresponding colors from the camera image, which we can use to estimate the surfel color. Note that traditional surfel maps suffer from the trade-off between geometry consistency and fine-grained details, i.e., a large voxel size gives better geometry con- sistency but fewer details, while small voxel size results in finer details but less stable geometry. Therefore, we take an alternative approach that aims to achieve both good ge- ometry consistency and rich texture details. Specifically, we discretize each surfel disk into a k × k grid centered on its point centroid, as illustrated in subfigure b) in Fig. 1. Each grid center is assigned an independent color to encode higher-resolution texture details. Since each surfel may have a different appearance across different frames, due to the variations of the lighting con- ditions and the changes of relative pose (distance and view angle), we propose to enhance the surfel representation by creating a codebook of such k × k grids at n various dis- tances. For each grid bin, we determine its color from the first observation, which we found is important to obtain a smooth rendering image. During the rendering stage, we determine which k × k patch to use based on the camera pose. The final rendering is shown in Fig. 2. We can see that the baseline surfel map introduces many artifacts at object boundaries and yields non-smooth coloring at non-boundary areas. In contrast, our texture-enhanced surfel map elimi- nates much of the artifacts and leads to vivid-looking images. In our experiments, we use v = 0.2m, k = 5 and n = 10. Handling Dynamic Objects. We consider vehicles as rigid dynamic objects and reconstruct a separate model for each. For simplicity, we leverage the high-quality 3D bounding box annotations from the Waymo Open Dataset [36] to accu- mulate the LiDAR points from multiple scans for each object of interest. We apply the Iterative Closest Point (ICP) [33] algorithm to refine the point cloud registration, producing a dense point cloud that allows an accurate, enhanced surfel reconstruction for each vehicle. Please see Sec. A for recon- structed examples. Our approach does not strictly require 3D box ground-truth; we can also leverage state of the art vehicle detection and tracking algorithms [27, 34] to get ini- tial estimates for ICP. However, we leave this experiment for Figure 2. Visualization of different scene modeling strategies. Top row: Surfel baseline; Center row: our Texture-Enhanced Surfel Map (also known as surfel rendering in the rest of the paper); Bottom row: Real camera image. future work. When simulating the environment, the reconstructed ve- hicle models can be placed in any location of choice. In the case of the pedestrians, which are deformable objects, we reconstruct a separate surfel model for each LiDAR scan separately. We allow placement of the reconstructed pedes- trian anywhere in the scene for that scan. We leave the task of accurate deformable model reconstruction from multiple scans to future work. 3.2. Image Synthesis via SurfelGAN While the surfel scene reconstruction provides a rich rep- resentation of the environment, it produces surfel-based ren- derings that have a non-negligible realism gap when com- pared to real images, due to incomplete reconstruction and imperfect geometry and texturing (see Fig. 2). Our Surfel- GAN model is explicitly designed to address this issue. SurfelGAN is a generative model that converts surfel image renderings to realistically looking images. We treat semantic and instance segmentation maps as additional ren- dered image channels. For the sake of simplicity, we omit their explicit mention in the rest of this this section. Let the generator GS→I θS be an encoder-decoder model with learnable parameters θS. Given pairs of surfel render- ings Sp and images Ip, the supervised loss can be applied to train the generator. We call a SurfelGAN model that is trained solely with supervised learning SurfelGAN-S. Ad- ditionally, we can add an adversarial loss from a real image discriminator DI φI. SurfelGANs trained with this additional loss is named SurfelGAN-SA. However, paired training data between surfel renderings and real image is very limited. Unpaired data is however easy to obtain. We leverage unpaired data for two purposes: improving the generalization of the discriminator by training with more unlabeled examples, and regularizing the genera- tor by enforcing cycle consistency. Let the reverse generator GI→S θI be another encoder-decoder model which has the same architecture as GS→I θS except more output channels for semantic and instance maps. Then any surfel rendering, paired Sp or unpaired Su can be translated to a real image and translated back to a surfel rendering, where a cycle con- sistency loss can be applied. The same applies to any paired Ip or unpaired Iu real image as well. Finally, we add the sur- fel rendering discriminator DS φS that judges generated surfel images. We call SurfelGANs trained with additional cycle consistency SurfelGAN-SAC. An intuitive overview of the training strategy is shown in Fig. 3, while Sec. 4 contains a detailed description of our paired and unpaired data. We optimize the following objective: max φS,φI min θS,θI Lr(GS→I θS , Sp, Ip) + λ1Lr(GI→S θI , Ip, Sp) + λ2La(GS→I θS , DI φI , Sp,u) + λ3La(GI→S θI , DS φS, Ip,u) + λ4Lc(GS→I θS , GI→S θI , Sp,u) + λ5Lc(GI→S θI , GS→I θS , Ip,u), (1) where Lr, La, Lc denote the supervised reconstruction, ad- versarial and cycle consistency loss, respectively. We use hinged Wasserstein loss for adversarial training [24, 28, 43] in our experiments as it helps to stabilize the training. We use ℓ1-loss as reconstruction and cycle-consistency loss for renderings and images and cross entropy loss for semantic and instance maps. Distance Weighted Loss. Due to the limited coverage of the surfel map, our surfel rendering contains large areas of unknown regions. The uncertainty in those regions is much G S -› I G I -› S DS DI Unpaired surfel rendering Sup Paired surfel rendering Sp Paired image Ip Unpaired image Iup Image-to-surfel generator Surfel-to-image generator Image discriminator Surfel rendering discriminator Figure 3. (Best viewed in color) SurfelGAN training paradigm. The training setup has two symmetric encoder-decoder generators mapping from surfel renderings to real images GS→I and vice versa GI→S. Additionally, there are two discriminators, DS, DI, which specialize in the surfel and the real domain. The losses are shown as colored arrows. Green: supervised reconstruction loss. Red: adversarial loss. Blue/Yellow: cycle-consistency losses. When training with paired data, e.g. WOD-TRAIN, the surfel renderings translate to real images, and we can apply a one-directional supervised reconstruction loss (SurfelGAN-S) only or add an additional adversarial loss (SurfelGAN-SA). When training with unpaired data, we can go either from the surfel renderings (e.g. WOD-TRAIN-NV) or the real images (e.g. Internal Camera Dataset), use one of the encoder-decoder networks to get to the other domain and back. We can then apply a cycle consistency loss. (SurfelGAN-SAC). The encoder-decoder networks consist of 8 convolutional and 8 deconvolutional layers. Discriminators consist of 5 convolutional layers. All network operate on 256 × 256 sized input. higher than that of the region with surfel information. Also, the distance between the camera and the surfel introduces another factor of uncertainty. Therefore, we use a distance weighted loss to stabilize our GAN training. Specifically, during data pre-processing, we generate a distance map that records the nearest distance to the observed region and then uses the distance information as weighting coefficients to modulate our reconstruction loss. Training Details. We use the Adam[22] optimizer for train- ing. We set the initial learning rate to 2e−4 for both the gen- erator and the discriminator and set β1 = 0.5 and β2 = 0.9. We use batch normalization [17] after Relu activation. We set λ1 = 1, λ2, λ3 = 0.001, , λ4, λ5 = 0.1 in all of our ex- periments. The total training time of our network is 3 days, based on one Nvidia Titan V100 GPU with batch size 8. 4. Experimental Results We base our experiments mainly on the Waymo Open Dataset [36], but we also collected two additional datasets in order to obtain a higher quality model and enable a more extensive evaluation. Waymo Open Dataset (WOD) [36]. The dataset consists of 798 training (WOD-TRAIN) and 202 validation (WOD- EVAL) sequences. Each sequence contains 20 seconds of camera and LiDAR data captured at 10Hz, as well as fully annotated 3D bounding boxes for vehicles, pedestrians, and cyclists. The LiDAR data covers a full 360 degrees around the agent, while five cameras capture the frontal 180 degrees. After reconstructing the surfel scenes, we can render the surfel images in the same pose as the original camera im- ages, hence generating surfel-image-to-camera-image pairs that can be used for paired training and evaluation. Since during the reconstruction process we know the category for each surfel, we can easily derive both semantic and instance segmentation masks by first rendering an index map that associates each pixel with a surfel index and then determin- ing the semantic class or instance number through a look-up table. We derive another dataset from WOD, which we call Waymo Open Dataset-Novel View (WOD-TRAIN-NV and WOD-EVAL-NV). We again start from reconstructed surfel scenes, but we now render surfel images from novel camera poses perturbed from existing camera poses. The perturba- tion consists of applying a random translation and a random yaw angle perturbation to the camera mounted vehicle. We use the annotated 3D bounding boxes to ensure the perturbed vehicle does not intersect with other objects in the scene. We generate one new surfel image rendering for each frame in the original dataset. Note that although this dataset comes for free, i.e. we could generate any number of test- ing frames, it does not have corresponding camera images. Therefore, this dataset can only be used for unpaired training and only some types of evaluation. Internal Camera Image Dataset. We collected additional 9.8k short sequences (100 frames for each) similar to WOD images. These un-annotated images are used for unpaired training of real images. Dual-Camera-Pose Dataset (DCP) Finally, we built a unique dataset tailored for measuring the realism of our model. The dataset contains scenarios where two vehicles observe the same scene at the same time. Specifically, we find the interval where two vehicles are within 20m of each other. We use the sensor data from the first vehicle to recon- struct the scene, and render the surfel image at the exact pose of the second vehicle. After filtering cases where the scene reconstruction is too incomplete, we obtain around 1k pairs, for which we can directly measure the pixel-wise accuracy of the generated image. Surfel Rendering SurfelGAN-S SurfelGAN-SA SurfelGAN-SAC REAL Figure 4. Qualitative comparison bewteen different SurfelGAN variants and the baseline on WOD-EVAL under different weather conditions. WOD-TRAIN-NV WOD-EVAL WOD-EVAL-NV AP@50 ↑ AP@75 ↑ AP ↑ Rec ↑ AP@50 AP@75 AP Rec AP@50 AP@75 AP Rec Surfel (baseline) 0.444 0.168 0.211 0.342 0.521 0.168 0.239 0.371 0.462 0.154 0.213 0.348 SurfelGAN-S (ours) 0.508 0.177 0.236 0.359 0.576 0.164 0.252 0.341 0.514 0.159 0.230 0.358 SurfelGAN-SA (ours) 0.554 0.200 0.259 0.382 0.610 0.174 0.266 0.394 0.567 0.180 0.257 0.387 SurfelGAN-SAC (ours) 0.564 0.200 0.263 0.385 0.620 0.181 0.272 0.400 0.570 0.181 0.258 0.388 Real (upper bound) - - - - 0.619 0.198 0.281 0.424 - - - - Table 1. Realism w.r.t. an off-the-shelf vehicle object detector. We generated images using the proposed SurfelGAN and ran inference on them using an off-the-shelf object detector. We report the standard COCO object detection metrics [25], including variants of the average-precision (AP) and recall at 100 (Rec). Surfel is the surfel rendering that is the input to SurfelGAN. SurfelGAN is the proposed model. The S variant is trained with paired supervised learning only. The SA variant adds the adversarial loss, and the SAC variant makes use of additional unpaired data and applied a cyclic adversarial loss. Real is the real image captured by cameras, which is only available in WOD-EVAL. It serves as an upper bound to the detector’s quality. As shown above, SurfelGAN significantly improves over the baseline, and reaches quality metric values similar to those of the real images. 4.1. Model Variants and Baseline Most experiments were performed on three variants of our proposed model. Supervised (S): we train the surfel- rendering-to-image model in a supervised way by minimiz- ing an ℓ1-loss between the generated image and the ground- truth real image. This type of training requires paired data. Hence, it is only possible to train on WOD-TRAIN. Su- pervised + Adversarial (SA): we still only consider WOD- TRAIN as the training data. However, we add an adversarial loss alongside the ℓ1-loss. Supervised + Adversarial + Cy- cle (SAC): in this variation, we also use WOD-TRAIN-NV and the Internal Camera Image Dataset. Since these two sets are unpaired, the supervised loss does not apply. We propose to use a cycle-consistency loss in addition to the adversarial loss, as discussed in Sec. 3.2. The baseline for our applications is the direct surfel ren- derings (Surfel) that serve as the input to our model. 4.2. Vehicle Detector Realism Since the primary application of this work is simulation for autonomous driving, it is natural to evaluate the gener- ated camera data using a downstream perception module. Specifically, we want to know how well an off-the-shelf vehicle object detector performs on the generated images without any fine-tuning. This is a test of whether the detector statistics on the generated images match those it obtains on the real images. We chose to use a vehicle detector with a ResNet architecture [16] and an SSD detection head [26], trained and evaluated on resized images in 512 × 512 resolu- tion from a mixture of datasets that include WOD-TRAIN. We trained our SurfelGAN model variants on a mixture of WOD-TRAIN, WOD-TRAIN-NV and the Internal Camera Image Dataset, and generated images on WOD-TRAIN-NV, WOD-EVAL and WOD-EVAL-NV. Tab. 1 shows the quan- titative comparison of the detector’s quality on the original surfel renderings that are the input of SurfelGAN, on im- ages generated from the variants of SurfelGAN and on real images. A few generated image examples are displayed in Fig. 4. Please find additional visualizations in the supple- mentary material. Fig. 5 highlights our system’s ability to generate images in novel views. Notably, our texture-enhanced surfel scene reconstruction already produces surfel renderings that achieve good detec- tion quality on the WOD-EVAL set at 52.1% AP@50. But there is still a significant gap between these surfel renderings and real images at 61.9%, which motivates our SurfelGAN work. As shown in Tab. 1, SurfelGAN-S, -SA and -SAC vari- ants gradually improve over the baseline surfel renderings. SurfelGAN-SAC ultimately improves the AP@50 metric from 52.1% to 62.0% on WOD-EVAL, which is on par with the real images at 61.9%. This shows that images generated by SurfelGAN-SAC are close to real images in the eyes of the detector, which is the primary motivation of this work. There are two types of generalization worth evaluating. The first type is whether a SurfelGAN model trained on one set of scenes (e.g. WOD-TRAIN and WOD-TRAIN-NV) generalizes to new scenes (e.g. WOD-EVAL and WOD- EVAL-NV). We believe that the SurfelGAN model general- izes well since the relative improvement of SurfelGAN over the baseline is very similar between the WOD-TRAIN-NV and WOD-EVAL-NV columns in Tab. 1. The second type of generalization is whether surfel ren- dering has a strong bias towards the poses from which the scene was reconstructed. We compare the metric values between WOD-EVAL and WOD-EVAL-NV. Although Sur- felGAN improved by roughly 10% over the baseline in both cases, there is a noticeable quality difference between the Perturbation AP@50 AP@75 AP d <= 1.0 0.574 0.174 0.257 1.0 < d <= 2.0 0.547 0.173 0.246 2.0 < d 0.488 0.153 0.218 Table 2. Detector metric break down at different perturbation levels on WOD-EVAL-NV by SurfelGAN-SAC. Surfel SGAN-S SGAN-SA SGAN-SAC ℓ1-distance ↓ 0.262 0.229 0.240 0.238 Table 3. Image-pixel realism. We applied the SurfelGAN on the Dual-Camera-Pose Dataset, where it is possible to measure ℓ1- distance error between the generated images and the real ones. two columns. To better understand this difference, in Tab. 2, we breakdown the metrics of SurfelGAN-SAC on WOD- EVAL-NV according to how much each pose deviates from the original poses in WOD-EVAL. The deviation d(.) is de- fined as a weighted sum of both translational and rotational differences of the poses: d((t, R), (t′, R′)) = ||t −t′|| + λR | log(RT R′)|| √ 2 (2) where t and R are the pose (translation and rotation) of the novel view in WOD-EVAL-NV, and t′, R′ the pose of its closest pose in WOD-EVAL. λR is chosen to be 1.0. The result suggests that the surfel renderings do have a quality bias w.r.t viewing direction, which means we should not perturb too much from the original poses if we want higher quality synthesized data. However, we believe that this problem can be ameliorated if we were to reconstruct the surfel scene from multiple runs, a direction we left to future work. 4.3. Image-Pixels Realism The Dual-Camera-Pose (DCP) Dataset contains scenarios in which two vehicles observe the same scene at the same time, allowing us to can reconstruct the surfel scene using one camera and generate images from the point of view of the second camera. We can match each generated image to the real one and report the ℓ1-distance error on the pixels that are covered by the surfel rendering. This is to ensure that there is a fair comparison between the surfel renderings and the generated images. Like in the previous experiment, the model is trained using WOD-TRAIN, WOD-TRAIN- NV, and the Internal Camera Image Dataset. The results are shown in Tab. 3. SurfelGAN improves on top of the surfel renderings, generating images that are closer to real images in ℓ1-distance. However, it is worth noting that the SurfelGAN-S version outperforms both SA and SAC that used additional losses and data during training. This finding is not unexpected since SurfelGAN-S optimizes for the ℓ1- distance. Figure 5. Novel View Synthesis. The first column is Surfel image under novel view, the second column is our synthesized result. The third column is the original view. Additional visualization can be found in Sec. B. Training Set AP@50 AP@75 AP WOD-TRAIN 0.219 0.108 0.119 + WOD-TRAIN-NV Surfel 0.228 0.111 0.120 + WOD-TRAIN-NV SurfelGAN 0.254 0.121 0.130 Table 4. Detector metric on Open Dataset validation set when trained with different combination of data. 4.4. Improving Detection by Data Augmentation In an additional experiment, we explore whether the SurfelGAN-generated images from perturbed views are a helpful form of data augmentation for training a vehicle ob- ject detector. For the baseline, we trained a vehicle detector on WOD-TRAIN and evaluated the detector’s quality on WOD-EVAL. We then trained another vehicle detector us- ing both WOD-TRAIN and surfel images generated from WOD-TRAIN-NV, and also evaluated on WOD-EVAL. WOD-TRAIN-NV only inherits 3D bounding boxes from WOD-TRAIN, and does not contain tightly-fitting 2D bound- ing boxes like those in WOD-TRAIN. We approximate the latter by projecting all surfels in the 3D bounding boxes to the 2D novel view and taking the axis-aligned bounding box as an approximation. The results are shown in Tab. 4. The data augmentation significantly boosts the average precision metric, improving the AP@50 score from 21.9% to 25.4%, the AP@75 from 10.8% to 12.1%, and the average AP from 11.9% to 13.0%. It is worth noting that these AP scores are much lower than those in Tab. 1. The main reason for the discrepancy is that images are resized differently in order to use the off-the-shelf detector in Tab. 1. We also trained on surfel renderings directly. There is a slight improvement us- ing them compared to training only on WOD-TRAIN. Using SurfelGAN synthesized images yields a much more signifi- cant improvement, which further demonstrates the realism of the SurfelGAN model. Figure 6. We show two typical failure cases, the first is the case where reconstructed Surfel map contains too large errors. The second is on the unmapped region (top part of building in this example). 4.5. Limitations The surfel scene reconstruction has its limitations. For example, it may fail to reconstruct certain areas of the scene, as illustrated in the top row in Fig. 6. In this particular case, SurfelGAN was unable to recover from broken geometry, resulting in an unrealistically looking vehicle. Suboptimal renderings also may occur in areas that are missing surfels. Lacking any surfel cues forces the model output to have high variance, especially when it tries to hallucinate patterns that infrequently appear in the dataset, such as tall buildings. This observation suggests that there is room for improvement in the reconstruction stage. In the case where we only have partial geometry, applying a learned geometry completion model first could be more helpful than relying solely on the generating module to resolve all artifacts. 5. Conclusion We propose a simple yet effective data-driven approach, which can synthesize camera data for autonomous driving simulations. Based on the camera and LiDAR data captured by a vehicle pass through a scene, we reconstruct a 3D model using our Enhanced Surfel Map representation. Given this representation, we can render novel views and configurations of objects in the environment. We use our SurfelGAN image synthesis model to fix any reconstruction, occlusion or ren- dering artifacts. To the best of our knowledge, we have built the first purely data-driven camera simulation system for autonomous driving. Experimental results not only demon- strate the high level of realism of our synthesized sensor data but also show the data can be used for training dataset augmentation for deep neural networks. In future work, we plan to enhance camera simulation further by improving the dynamic object modeling process and by investigating temporally consistent video generation. References [1] Deepgtav v2. 2 [2] Kara-Ali Aliev, Dmitry Ulyanov, and Victor Lempitsky. Neu- ral point-based graphics. arXiv preprint arXiv:1906.08240, 2019. 3 [3] Martin Arjovsky, Soumith Chintala, and L´eon Bottou. Wasser- stein gan. arXiv preprint arXiv:1701.07875, 2017. 3 [4] Mayank Bansal, Alex Krizhevsky, and Abhijit Ogale. Chauf- feurnet: Learning to drive by imitating the best and synthesiz- ing the worst. In RSS, 2019. 1 [5] Simon Brodeur, Ethan Perez, Ankesh Anand, Florian Golemo, Luca Celotti, Florian Strub, Jean Rouat, Hugo Larochelle, and Aaron Courville. Home: A household multimodal environ- ment. arXiv preprint arXiv:1711.11017, 2017. 2 [6] Yuning Chai, Benjamin Sapp, Mayank Bansal, and Dragomir Anguelov. Multipath: Multiple probabilistic anchor tra- jectory hypotheses for behavior prediction. arXiv preprint arXiv:1910.05449, 2019. 1 [7] Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments. arXiv preprint arXiv:1709.06158, 2017. 2 [8] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016. 3 [9] Antonia Creswell, Tom White, Vincent Dumoulin, Kai Arulkumaran, Biswa Sengupta, and Anil A Bharath. Gen- erative adversarial networks: An overview. IEEE Signal Processing Magazine, 2018. 3 [10] Brian Curless and Marc Levoy. A volumetric method for building complex models from range images. In Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’96, 1996. 3 [11] Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. Carla: An open urban driving simulator. arXiv preprint arXiv:1711.03938, 2017. 1, 2 [12] Michael Everett, Yu Fan Chen, and Jonathan P How. Motion planning among dynamic, decision-making agents with deep reinforcement learning. In IROS, 2018. 1 [13] Yasutaka Furukawa and Jean Ponce. Accurate, dense, and robust multi-view stereopsis. TPAMI, 2010. 3 [14] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NeurIPS, 2014. 2, 3 [15] Kaiming He, Georgia Gkioxari, Piotr Doll´ar, and Ross Gir- shick. Mask r-cnn. In ICCV, 2017. 1 [16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016. 7 [17] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal co- variate shift. arXiv preprint arXiv:1502.03167, 2015. 5 [18] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial net- works. In CVPR, 2017. 3 [19] Amlan Kar, Aayush Prakash, Ming-Yu Liu, Eric Cameracci, Justin Yuan, Matt Rusiniak, David Acuna, Antonio Torralba, and Sanja Fidler. Meta-sim: Learning to generate synthetic datasets. arXiv preprint arXiv:1904.11621, 2019. 2 [20] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017. 3 [21] Michael Kazhdan, Matthew Bolitho, and Hugues Hoppe. Pois- son surface reconstruction. In Proceedings of the fourth Eu- rographics symposium on Geometry processing, volume 7, 2006. 3 [22] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. 5 [23] Wei Li, Chengwei Pan, Rong Zhang, Jiaping Ren, Yuexin Ma, Jin Fang, Feilong Yan, Qichuan Geng, Xinyu Huang, Huajun Gong, et al. Aads: Augmented autonomous driv- ing simulation using data-driven algorithms. arXiv preprint arXiv:1901.07849, 2019. 2 [24] Jae Hyun Lim and Jong Chul Ye. Geometric gan. arXiv preprint arXiv:1705.02894, 2017. 4 [25] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C. Lawrence Zitnick. Microsoft COCO: common objects in context. In ECCV, 2014. 6 [26] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In ECCV, pages 21–37. Springer, 2016. 1, 7 [27] Wenjie Luo, Bin Yang, and Raquel Urtasun. Fast and furi- ous: Real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net. In CVPR, 2018. 3 [28] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative ad- versarial networks. arXiv preprint arXiv:1802.05957, 2018. 4 [29] Jeong Joon Park, Peter Florence, Julian Straub, Richard New- combe, and Steven Lovegrove. Deepsdf: Learning continuous signed distance functions for shape representation. In CVPR, 2019. 3 [30] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-adaptive nor- malization. In CVPR, 2019. 3 [31] Hanspeter Pfister, Matthias Zwicker, Jeroen Van Baar, and Markus Gross. Surfels: Surface elements as rendering primi- tives. In Proceedings of the 27th annual conference on Com- puter graphics and interactive techniques, 2000. 3 [32] German Ros, Laura Sellart, Joanna Materzynska, David Vazquez, and Antonio M Lopez. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In CVPR, 2016. 2 [33] A. Segal, D. Haehnel, and S. Thrun. Generalized-icp. In Proceedings of Robotics: Science and Systems, 2009. 3 [34] Shaoshuai Shi, Zhe Wang, Xiaogang Wang, and Hongsheng Li. Part-aˆ 2 net: 3d part-aware and aggregation neural net- work for object detection from point cloud. arXiv preprint arXiv:1907.03670, 2019. 3 [35] Shuran Song, Fisher Yu, Andy Zeng, Angel X Chang, Mano- lis Savva, and Thomas Funkhouser. Semantic scene comple- tion from a single depth image. In CVPR, 2017. 2 [36] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aur´elien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2446–2454, 2020. 3, 5 [37] Shimon Ullman. The interpretation of structure from mo- tion. Proceedings of the Royal Society of London. Series B. Biological Sciences, 1979. 3 [38] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. Video-to-video synthesis. arXiv preprint arXiv:1808.06601, 2018. 3 [39] Changchang Wu et al. Visualsfm: A visual structure from motion system. 2011. 3 [40] Yi Wu, Yuxin Wu, Georgia Gkioxari, and Yuandong Tian. Building generalizable agents with a realistic and rich 3d environment. arXiv preprint arXiv:1801.02209, 2018. 2 [41] Bernhard Wymann, Eric Espi´e, Christophe Guionneau, Chris- tos Dimitrakakis, R´emi Coulom, and Andrew Sumner. Torcs, the open racing car simulator. Software available at http://torcs. sourceforge. net, 2000. 2 [42] Fei Xia, Amir R Zamir, Zhiyang He, Alexander Sax, Jiten- dra Malik, and Silvio Savarese. Gibson env: Real-world perception for embodied agents. In CVPR, 2018. 2 [43] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318, 2018. 4 [44] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, 2017. 3 A. Additional Details on Surfel Scene Recon- struction For vehicle reconstruction, we aggregate LiDAR points across frames into a single model using bounding box an- notations and a local registration method (ICP). We also exploit the symmetry of man-made objects to complete their geometry. We reconstruct a total of 46,786 vehicle models. Some examples are shown in Figure 7. We also include some surfel scene reconstruction results in Figure 8. These surfel maps are built based on camera-LiDAR sequences from the Waymo Open Dataset training split. B. Additional Qualitative Synthesized Image Results We provide additional qualitative results: Fig. 9 contains synthesized output when gradually perturbing the current viewpoint. Fig. 10 contains synthesized output when we also perturb the other objects in the scene. Fig. 13 pro- vides a visual comparison between different model variants. Results of generated surfel images and semantic maps of SurfelGAN-SAC can be found in Fig. 14. More visual results of SurfelGAN-SAC can be find in Fig. 11 and Fig. 12. Figure 7. Sample reconstructed surfel vehicles. Figure 8. Reconstructed surfel scene maps from four different scenes. C. Analysis on Surfel Image Coverage We examine the object detection metric for different sur- fel image coverage ratios. Specifically, we define the surfel image coverage ratio as the percentage of area in the surfel Perturbation AP@50 AP@75 AP r ≤0.3 0.490 0.155 0.218 0.3 < r ≤0.5 0.577 0.235 0.279 0.5 < r 0.566 0.172 0.253 Table 5. Object detection metric for different surfel image coverage ratios (r) on WOD-EVAL-NV. rendered image that is non-empty. The results can be found in Tab. 5. We found that our SurfelGAN model performs similarly when the coverage ratio is above 30%, but the per- formance declines significantly when the input surfel map becomes more and more incomplete. This can happen when the SDV is looking in a direction where the surfel scene re- construction is incomplete, for example, if it is positioned at the surfel map boundary looking outward. This observation suggests that building more complete surfel scene maps over multiple runs should be helpful. Figure 9. Synthesized images when gradually perturbing the camera view point. Each column contains one example. Figure 10. Synthesized images by perturbing objects in the scene. Top row: original surfel rendering. Center row: surfel rendering after perturbing the objects. Third row: synthesized images. Figure 11. Qualitative results of SurfelGAN-SAC (1/2). We show pairs of surfel rendering and synthesized image. Figure 12. Qualitative results of SurfelGAN-SAC (2/2). We show surfel rendering and synthesized image pairs. Surfel rendering SurfelGAN-S SurfelGAN-SA SurfelGAN-SAC REAL Figure 13. Qualitative comparison between different SurfelGAN variants on WOD-EVAL. Figure 14. Visualization of the different outputs of SurfelGAN-SAC. From left to right: surfel rendering, generated surfel rendering, semantic map, generated semantic map, camera image, generated camera image.