Block-NeRF: Scalable Large Scene Neural View Synthesis Matthew Tancik 1 ∗ Vincent Casser 2 Xinchen Yan 2 Sabeek Pradhan 2 Ben Mildenhall 3 Pratul P. Srinivasan 3 Jonathan T. Barron 3 Henrik Kretzschmar 2 1 UC Berkeley 2 Waymo 3 Google Research Alamo Square, SF 1 km Block-NeRF Sept. June Figure 1. Block-NeRF is a method that enables large-scale scene reconstruction by representing the environment using multiple compact NeRFs that each fit into memory. At inference time, Block-NeRF seamlessly combines renderings of the relevant NeRFs for the given area. In this example, we reconstruct the Alamo Square neighborhood in San Francisco using data collected over 3 months. Block-NeRF can update individual blocks of the environment without retraining on the entire scene, as demonstrated by the construction on the right. Video results can be found on the project website waymo.com/research/block-nerf. Abstract We present Block-NeRF, a variant of Neural Radiance Fields that can represent large-scale environments. Specif- ically, we demonstrate that when scaling NeRF to render city-scale scenes spanning multiple blocks, it is vital to de- compose the scene into individually trained NeRFs. This decomposition decouples rendering time from scene size, en- ables rendering to scale to arbitrarily large environments, and allows per-block updates of the environment. We adopt several architectural changes to make NeRF robust to data captured over months under different environmental condi- tions. We add appearance embeddings, learned pose refine- ment, and controllable exposure to each individual NeRF, and introduce a procedure for aligning appearance between adjacent NeRFs so that they can be seamlessly combined. We build a grid of Block-NeRFs from 2.8 million images to cre- ate the largest neural scene representation to date, capable of rendering an entire neighborhood of San Francisco. 1. Introduction Recent advancements in neural rendering such as Neural Radiance Fields [42] have enabled photo-realistic reconstruc- *Work done as an intern at Waymo. tion and novel view synthesis given a set of posed camera im- ages [3, 40, 45]. Earlier works tended to focus on small-scale and object-centric reconstruction. Though some methods now address scenes the size of a single room or building, these are generally still limited and do not na ̈ ıvely scale up to city-scale environments. Applying these methods to large environments typically leads to significant artifacts and low visual fidelity due to limited model capacity. Reconstructing large-scale environments enables several important use-cases in domains such as autonomous driv- ing [32, 44, 68] and aerial surveying [14, 35]. One example is mapping, where a high-fidelity map of the entire operating domain is created to act as a powerful prior for a variety of problems, including robot localization, navigation, and colli- sion avoidance. Furthermore, large-scale scene reconstruc- tions can be used for closed-loop robotic simulations [13]. Autonomous driving systems are commonly evaluated by re-simulating previously encountered scenarios; however, any deviation from the recorded encounter may change the vehicle’s trajectory, requiring high-fidelity novel view ren- derings along the altered path. Beyond basic view synthesis, scene conditioned NeRFs are also capable of changing en- vironmental lighting conditions such as camera exposure, weather, or time of day, which can be used to further augment simulation scenarios. 1 arXiv:2202.05263v1 [cs.CV] 10 Feb 2022 Reconstructing such large-scale environments introduces additional challenges, including the presence of transient objects (cars and pedestrians), limitations in model capacity, along with memory and compute constraints. Furthermore, training data for such large environments is highly unlikely to be collected in a single capture under consistent condi- tions. Rather, data for different parts of the environment may need to be sourced from different data collection efforts, in- troducing variance in both scene geometry ( e.g ., construction work and parked cars), as well as appearance ( e.g ., weather conditions and time of day). We extend NeRF with appearance embeddings and learned pose refinement to address the environmental changes and pose errors in the collected data. We addi- tionally add exposure conditioning to provide the ability to modify the exposure during inference. We refer to this modified model as a Block-NeRF. Scaling up the network capacity of Block-NeRF enables the ability to represent in- creasingly large scenes. However this approach comes with a number of limitations; rendering time scales with the size of the network, networks can no longer fit on a single compute device, and updating or expanding the environment requires retraining the entire network. To address these challenges, we propose dividing up large environments into individually trained Block-NeRFs, which are then rendered and combined dynamically at inference time. Modeling these Block-NeRFs independently allows for maximum flexibility, scales up to arbitrarily large en- vironments and provides the ability to update or introduce new regions in a piecewise manner without retraining the entire environment as demonstrated in Figure 1. To com- pute a target view, only a subset of the Block-NeRFs are rendered and then composited based on their geographic lo- cation compared to the camera. To allow for more seamless compositing, we propose an appearance matching technique which brings different Block-NeRFs into visual alignment by optimizing their appearance embeddings. 2. Related Work 2.1. Large Scale 3D Reconstruction Researchers have been developing and refining tech- niques for 3D reconstruction from large image collections for decades [1, 16, 33, 47, 57, 77], and much current work re- lies on mature and robust software implementations such as COLMAP to perform this task [55]. Nearly all of these recon- struction methods share a common pipeline: extract 2D im- age features (such as SIFT [39]), match these features across different images, and jointly optimize a set of 3D points and camera poses to be consistent with these matches (the well- explored problem of bundle adjustment [23, 65]). Extending this pipeline to city-scale data is largely a matter of imple- menting highly robust and parallelized versions of these algorithms, as explored in work such as Photo Tourism [57] Block-NeRF Origin Block-NeRF Training Radius Visibility Prediction Color Prediction Combined Color Prediction Target View Discarded Figure 2. The scene is split into multiple Block-NeRFs that are each trained on data within some radius (dotted orange line) of a specific Block-NeRF origin coordinate (orange dot). To render a target view in the scene, the visibility maps are computed for all of the NeRFs within a given radius. Block-NeRFs with low visibility are discarded (bottom Block-NeRF) and the color output is rendered for the remaining blocks. The renderings are then merged based on each block origin’s distance to the target view. and Building Rome in a Day [1]. Core graphics research has also explored breaking up scenes for fast high quality rendering [38]. These approaches typically output a camera pose for each input image and a sparse 3D point cloud. To get a complete 3D scene model, these outputs must be further processed by a dense multi-view stereo algorithm ( e.g ., PMVS [18]) to produce a dense point cloud or triangle mesh. This process presents its own scaling difficulties [17]. The resulting 3D models often contain artifacts or holes in areas with limited texture or specular reflections as they are challenging to triangulate across images. As such, they frequently require further postprocessing to create models that can be used to render convincing imagery [56]. However, this task is mainly the domain of novel view synthesis, and 3D reconstruction techniques primarily focus on geometric accuracy. In contrast, our approach does not rely on large-scale SfM to produce camera poses, instead performing odome- try using various sensors on the vehicle as the images are collected [64]. 2.2. Novel View Synthesis Given a set of input images of a given scene and their camera poses, novel view synthesis seeks to render observed scene content from previously unobserved viewpoints, al- lowing a user to navigate through a recreated environment with high visual fidelity. 2 Geometry-based Image Reprojection. Many ap- proaches to view synthesis start by applying traditional 3D reconstruction techniques to build a point cloud or triangle mesh representing the scene. This geometric “proxy” is then used to reproject pixels from the input images into new camera views, where they are blended by heuristic [6] or learning-based methods [24, 52, 53]. This approach has been scaled to long trajectories of first-person video [31], panoramas collected along a city street [30], and single landmarks from the Photo Tourism dataset [41]. Methods reliant on geometry proxies are limited by the quality of the initial 3D reconstruction, which hurts their performance in scenes with complex geometry or reflectance effects. Volumetric Scene Representations. Recent view synthe- sis work has focused on unifying reconstruction and render- ing and learning this pipeline end-to-end, typically using a volumetric scene representation. Methods for rendering small baseline view interpolation often use feed-forward networks to learn a mapping directly from input images to an output volume [15, 76], while methods such as Neural Volumes [37] that target larger-baseline view synthesis run a global optimization over all input images to reconstruct every new scene, similar to traditional bundle adjustment. Neural Radiance Fields (NeRF) [42] combines this single- scene optimization setting with a neural scene representation capable of representing complex scenes much more effi- ciently than a discrete 3D voxel grid; however, its rendering model scales very poorly to large-scale scenes in terms of compute. Followup work has proposed making NeRF more efficient by partitioning space into smaller regions, each con- taining its own lightweight NeRF network [48, 49]. Unlike our method, these network ensembles must be trained jointly, limiting their flexibility. Another approach is to provide extra capacity in the form of a coarse 3D grid of latent codes [36]. This approach has also been applied to compress detailed 3D shapes into neural signed distance functions [62] and to represent large scenes using occupancy networks [46]. We build our Block-NeRF implementation on top of mip- NeRF [3], which improves aliasing issues that hurt NeRF’s performance in scenes where the input images observe the scene from many different distances. We incorporate tech- niques from NeRF in the Wild (NeRF-W) [40], which adds a latent code per training image to handle inconsistent scene appearance when applying NeRF to landmarks from the Photo Tourism dataset. NeRF-W creates a separate NeRF for each landmark from thousands of images, whereas our approach combines many NeRFs to reconstruct a coherent large environment from millions of images. Our model also incorporates a learned camera pose refinement which has been explored in previous works [34, 59, 66, 69, 70]. Some NeRF-based methods use segmentation data to isolate and reconstruct static [67] or moving objects (such d d f RGB x σ σ f c Exposure Integrated Positional Encoding Positional Encoding Appearance Embedding f x v Visibility Figure 3. Our model is an extension of the model presented in mip-NeRF [3]. The first MLP f σ predicts the density σ for a position x in space. The network also outputs a feature vector that is concatenated with viewing direction d , the exposure level, and an appearance embedding. These are fed into a second MLP f c that outputs the color for the point. We additionally train a visibility network f v to predict whether a point in space was visible in the training views, which is used for culling Block-NeRFs during inference. as people or cars) [44, 73] across video sequences. As we focus primarily on reconstructing the environment itself, we choose to simply mask out dynamic objects during training. 2.3. Urban Scene Camera Simulation Camera simulation has become a popular data source for training and validating autonomous driving systems on interactive platforms [2, 28]. Early works [13, 19, 51, 54] syn- thesized data from scripted scenarios and manually created 3D assets. These methods suffered from domain mismatch and limited scene-level diversity. Several recent works tackle the simulation-to-reality gaps by minimizing the distribution shifts in the simulation and rendering pipeline. Kar et al . [26] and Devaranjan et al . [12] proposed to minimize the scene- level distribution shift from rendered outputs to real camera sensor data through a learned scenario generation frame- work. Richter et al . [50] leveraged intermediate rendering buffers in the graphics pipeline to improve photorealism of synthetically generated camera images. Towards the goal of building photo-realistic and scalable camera simulation, prior methods [9, 32, 68] leverage rich multi-sensor driving data collected during a single drive to reconstruct 3D scenes for object injection [9] and novel view synthesis [68] using modern machine learning techniques, in- cluding image GANs for 2D neural rendering. Relying on a sophisticated surfel reconstruction pipeline, SurfelGAN [68] is still susceptible to errors in graphical reconstruction and can suffer from the limited range and vertical field-of-view of LiDAR scans. In contrast to existing efforts, our work tackles the 3D rendering problem and is capable of modeling the real camera data captured from multiple drives under varying environmental conditions, such as weather and time of day, which is a prerequisite for reconstructing large-scale areas. 3 3. Background We build upon NeRF [42] and its extension mip-NeRF [3]. Here, we summarize relevant parts of these methods. For details, please refer to the original papers. 3.1. NeRF and mip-NeRF Preliminaries Neural Radiance Fields (NeRF) [42] is a coordinate-based neural scene representation that is optimized through a dif- ferentiable rendering loss to reproduce the appearance of a set of input images from known camera poses. After opti- mization, the NeRF model can be used to render previously unseen viewpoints. The NeRF scene representation is a pair of multilayer perceptrons (MLPs). The first MLP f σ takes in a 3D position x and outputs volume density σ and a feature vector. This feature vector is concatenated with a 2D viewing direction d and fed into the second MLP f c , which outputs an RGB color c . This architecture ensures that the output color can vary when observed from different angles, allowing NeRF to represent reflections and glossy materials, but that the underlying geometry represented by σ is only a function of position. Each pixel in an image corresponds to a ray r ( t ) = o + t d through 3D space. To calculate the color of r , NeRF randomly samples distances { t i } N i =0 along the ray and passes the points r ( t i ) and direction d through its MLPs to calculate σ i and c i . The resulting output color is c out = N ∑ i =1 w i c i , where w i = T i (1 − e − ∆ i σ i ) , (1) T i = exp   − ∑ j