Block-NeRF: Scalable Large Scene Neural View Synthesis MatthewTancik1∗ VincentCasser2 XinchenYan2 SabeekPradhan2 BenMildenhall3 PratulP.Srinivasan3 JonathanT.Barron3 HenrikKretzschmar2 1UCBerkeley 2Waymo 3GoogleResearch Alamo Square, SF June Sept. Block-NeRF 1 km Figure1.Block-NeRFisamethodthatenableslarge-scalescenereconstructionbyrepresentingtheenvironmentusingmultiplecompact NeRFsthateachfitintomemory.Atinferencetime,Block-NeRFseamlesslycombinesrenderingsoftherelevantNeRFsforthegivenarea. Inthisexample,wereconstructtheAlamoSquareneighborhoodinSanFranciscousingdatacollectedover3months.Block-NeRFcan updateindividualblocksoftheenvironmentwithoutretrainingontheentirescene,asdemonstratedbytheconstructionontheright.Video resultscanbefoundontheprojectwebsitewaymo.com/research/block-nerf. tionandnovelviewsynthesisgivenasetofposedcameraim- Abstract ages[3,40,45]. Earlierworkstendedtofocusonsmall-scale and object-centric reconstruction. Though some methods We present Block-NeRF, a variant of Neural Radiance now address scenes the size of a single room or building, Fieldsthatcanrepresentlarge-scaleenvironments. Specif- thesearegenerallystilllimitedanddonotna¨ıvelyscaleup ically, we demonstrate that when scaling NeRF to render tocity-scaleenvironments. Applyingthesemethodstolarge city-scalescenesspanningmultipleblocks,itisvitaltode- environmentstypicallyleadstosignificantartifactsandlow compose the scene into individually trained NeRFs. This visualfidelityduetolimitedmodelcapacity. decompositiondecouplesrenderingtimefromscenesize,en- ablesrenderingtoscaletoarbitrarilylargeenvironments, Reconstructinglarge-scaleenvironmentsenablesseveral andallowsper-blockupdatesoftheenvironment. Weadopt important use-cases in domains such as autonomous driv- severalarchitecturalchangestomakeNeRFrobusttodata ing[32,44,68]andaerialsurveying[14,35]. Oneexampleis capturedovermonthsunderdifferentenvironmentalcondi- mapping,whereahigh-fidelitymapoftheentireoperating tions. Weaddappearanceembeddings,learnedposerefine- domainiscreatedtoactasapowerfulpriorforavarietyof ment, and controllable exposure to each individual NeRF, problems,includingrobotlocalization,navigation,andcolli- andintroduceaprocedureforaligningappearancebetween sionavoidance. Furthermore,large-scalescenereconstruc- adjacentNeRFssothattheycanbeseamlesslycombined.We tionscanbeusedforclosed-looproboticsimulations[13]. buildagridofBlock-NeRFsfrom2.8millionimagestocre- Autonomous driving systems are commonly evaluated by atethelargestneuralscenerepresentationtodate,capable re-simulating previously encountered scenarios; however, ofrenderinganentireneighborhoodofSanFrancisco. anydeviationfromtherecordedencountermaychangethe vehicle’strajectory,requiringhigh-fidelitynovelviewren- 1.Introduction deringsalongthealteredpath. Beyondbasicviewsynthesis, sceneconditionedNeRFsarealsocapableofchangingen- RecentadvancementsinneuralrenderingsuchasNeural vironmental lighting conditions such as camera exposure, RadianceFields[42]haveenabledphoto-realisticreconstruc- weather,ortimeofday,whichcanbeusedtofurtheraugment *WorkdoneasaninternatWaymo. simulationscenarios. 1 2202 beF 01 ]VC.sc[ 1v36250.2022:viXraReconstructingsuchlarge-scaleenvironmentsintroduces additional challenges, including the presence of transient objects(carsandpedestrians),limitationsinmodelcapacity, Combined alongwithmemoryandcomputeconstraints. Furthermore, Color Prediction trainingdataforsuchlargeenvironmentsishighlyunlikely to be collected in a single capture under consistent condi- tions. Rather,datafordifferentpartsoftheenvironmentmay needtobesourcedfromdifferentdatacollectionefforts,in- troducingvarianceinbothscenegeometry(e.g.,construction workandparkedcars),aswellasappearance(e.g.,weather conditionsandtimeofday). We extend NeRF with appearance embeddings and Target View learned pose refinement to address the environmental Discarded changes and pose errors in the collected data. We addi- tionally add exposure conditioning to provide the ability tomodifytheexposureduringinference. Werefertothis Block-NeRF Origin Visibility Prediction modifiedmodelasaBlock-NeRF.Scalingupthenetwork Block-NeRF Training Radius Color Prediction capacityofBlock-NeRFenablestheabilitytorepresentin- creasinglylargescenes.Howeverthisapproachcomeswitha Figure2.ThesceneissplitintomultipleBlock-NeRFsthatareeach numberoflimitations;renderingtimescaleswiththesizeof trainedondatawithinsomeradius(dottedorangeline)ofaspecific thenetwork,networkscannolongerfitonasinglecompute Block-NeRF origin coordinate (orange dot). To render a target device,andupdatingorexpandingtheenvironmentrequires viewinthescene,thevisibilitymapsarecomputedforallofthe retrainingtheentirenetwork. NeRFswithinagivenradius.Block-NeRFswithlowvisibilityare Toaddressthesechallenges,weproposedividinguplarge discarded(bottomBlock-NeRF)andthecoloroutputisrendered environmentsintoindividuallytrainedBlock-NeRFs,which fortheremainingblocks.Therenderingsarethenmergedbasedon arethenrenderedandcombineddynamicallyatinference eachblockorigin’sdistancetothetargetview. time. ModelingtheseBlock-NeRFsindependentlyallows and Building Rome in a Day [1]. Core graphics research for maximum flexibility, scales up to arbitrarily large en- hasalsoexploredbreakingupscenesforfasthighquality vironmentsandprovidestheabilitytoupdateorintroduce rendering[38]. new regions in a piecewise manner without retraining the Theseapproachestypicallyoutputacameraposeforeach entire environment as demonstrated in Figure 1. To com- inputimageandasparse3Dpointcloud. Togetacomplete pute a target view, only a subset of the Block-NeRFs are 3Dscenemodel,theseoutputsmustbefurtherprocessedby renderedandthencompositedbasedontheirgeographiclo- adensemulti-viewstereoalgorithm(e.g., PMVS[18])to cationcomparedtothecamera. Toallowformoreseamless produceadensepointcloudortrianglemesh. Thisprocess compositing,weproposeanappearancematchingtechnique presentsitsownscalingdifficulties[17]. Theresulting3D whichbringsdifferentBlock-NeRFsintovisualalignment modelsoftencontainartifactsorholesinareaswithlimited byoptimizingtheirappearanceembeddings. texture or specular reflections as they are challenging to 2.RelatedWork triangulateacrossimages. Assuch,theyfrequentlyrequire furtherpostprocessingtocreatemodelsthatcanbeusedto 2.1.LargeScale3DReconstruction renderconvincingimagery[56].However,thistaskismainly Researchers have been developing and refining tech- thedomainofnovelviewsynthesis,and3Dreconstruction niquesfor3Dreconstructionfromlargeimagecollections techniquesprimarilyfocusongeometricaccuracy. fordecades[1,16,33,47,57,77],andmuchcurrentworkre- In contrast, our approach does not rely on large-scale liesonmatureandrobustsoftwareimplementationssuchas SfMtoproducecameraposes,insteadperformingodome- COLMAPtoperformthistask[55].Nearlyalloftheserecon- tryusingvarioussensorsonthevehicleastheimagesare structionmethodsshareacommonpipeline: extract2Dim- collected[64]. agefeatures(suchasSIFT[39]),matchthesefeaturesacross 2.2.NovelViewSynthesis differentimages,andjointlyoptimizeasetof3Dpointsand cameraposestobeconsistentwiththesematches(thewell- Given a set of input images of a given scene and their exploredproblemofbundleadjustment[23,65]). Extending cameraposes,novelviewsynthesisseekstorenderobserved thispipelinetocity-scaledataislargelyamatterofimple- scene content from previously unobserved viewpoints, al- menting highly robust and parallelized versions of these lowingausertonavigatethrougharecreatedenvironment algorithms,asexploredinworksuchasPhotoTourism[57] withhighvisualfidelity. 2Geometry-based Image Reprojection. Many ap- σ x proachestoviewsynthesisstartbyapplyingtraditional3D x f f Visibility σ d v reconstructiontechniquestobuildapointcloudortriangle mesh representing the scene. This geometric “proxy” is d f RGB Positional Encoding then used to reproject pixels from the input images into Exposure c Integrated Positional newcameraviews,wheretheyareblendedbyheuristic[6] Appearance Embedding Encoding orlearning-basedmethods[24,52,53]. Thisapproachhas been scaled to long trajectories of first-person video [31], Figure3. Ourmodelisanextensionofthemodelpresentedin mip-NeRF [3]. The first MLP f predicts the density σ for a panoramas collected along a city street [30], and single σ position x in space. The network also outputs a feature vector landmarksfromthePhotoTourismdataset[41]. Methods thatisconcatenatedwithviewingdirectiond,theexposurelevel, reliantongeometryproxiesarelimitedbythequalityofthe andanappearanceembedding.ThesearefedintoasecondMLP initial3Dreconstruction,whichhurtstheirperformancein f that outputs the color for the point. We additionally train a c sceneswithcomplexgeometryorreflectanceeffects. visibilitynetworkf topredictwhetherapointinspacewasvisible v inthetrainingviews,whichisusedforcullingBlock-NeRFsduring VolumetricSceneRepresentations. Recentviewsynthe- inference. sisworkhasfocusedonunifyingreconstructionandrender- ing and learning this pipeline end-to-end, typically using avolumetricscenerepresentation. Methodsforrendering aspeopleorcars)[44,73]acrossvideosequences. Aswe small baseline view interpolation often use feed-forward focusprimarilyonreconstructingtheenvironmentitself,we networkstolearnamappingdirectlyfrominputimagesto choosetosimplymaskoutdynamicobjectsduringtraining. an output volume [15,76], while methods such as Neural Volumes[37]thattargetlarger-baselineviewsynthesisrun a global optimization over all input images to reconstruct 2.3.UrbanSceneCameraSimulation everynewscene,similartotraditionalbundleadjustment. NeuralRadianceFields(NeRF)[42]combinesthissingle- Camera simulation has become a popular data source sceneoptimizationsettingwithaneuralscenerepresentation fortrainingandvalidatingautonomousdrivingsystemson capable of representing complex scenes much more effi- interactiveplatforms[2,28]. Earlyworks[13,19,51,54]syn- cientlythanadiscrete3Dvoxelgrid;however,itsrendering thesizeddatafromscriptedscenariosandmanuallycreated modelscalesverypoorlytolarge-scalescenesintermsof 3Dassets. Thesemethodssufferedfromdomainmismatch compute. FollowupworkhasproposedmakingNeRFmore andlimitedscene-leveldiversity.Severalrecentworkstackle efficientbypartitioningspaceintosmallerregions,eachcon- thesimulation-to-realitygapsbyminimizingthedistribution tainingitsownlightweightNeRFnetwork[48,49]. Unlike shiftsinthesimulationandrenderingpipeline.Karetal.[26] ourmethod,thesenetworkensemblesmustbetrainedjointly, andDevaranjanetal.[12]proposedtominimizethescene- limitingtheirflexibility.Anotherapproachistoprovideextra leveldistributionshiftfromrenderedoutputstorealcamera capacityintheformofacoarse3Dgridoflatentcodes[36]. sensor data through a learned scenario generation frame- Thisapproachhasalsobeenappliedtocompressdetailed work. Richteretal.[50]leveragedintermediaterendering 3Dshapesintoneuralsigneddistancefunctions[62]andto buffersinthegraphicspipelinetoimprovephotorealismof representlargescenesusingoccupancynetworks[46]. syntheticallygeneratedcameraimages. WebuildourBlock-NeRFimplementationontopofmip- Towardsthegoalofbuildingphoto-realisticandscalable NeRF[3],whichimprovesaliasingissuesthathurtNeRF’s camerasimulation,priormethods[9,32,68]leveragerich performanceinsceneswheretheinputimagesobservethe multi-sensordrivingdatacollectedduringasingledriveto scenefrommanydifferentdistances. Weincorporatetech- reconstruct3Dscenesforobjectinjection[9]andnovelview niquesfromNeRFintheWild(NeRF-W)[40],whichadds synthesis[68]usingmodernmachinelearningtechniques,in- alatentcodepertrainingimagetohandleinconsistentscene cludingimageGANsfor2Dneuralrendering. Relyingona appearance when applying NeRF to landmarks from the sophisticatedsurfelreconstructionpipeline,SurfelGAN[68] PhotoTourismdataset. NeRF-WcreatesaseparateNeRF isstillsusceptibletoerrorsingraphicalreconstructionand foreachlandmarkfromthousandsofimages,whereasour cansufferfromthelimitedrangeandverticalfield-of-view approachcombinesmanyNeRFstoreconstructacoherent of LiDAR scans. In contrast to existing efforts, our work largeenvironmentfrommillionsofimages. Ourmodelalso tacklesthe3Drenderingproblemandiscapableofmodeling incorporates a learned camera pose refinement which has the real camera data captured from multiple drives under beenexploredinpreviousworks[34,59,66,69,70]. varyingenvironmentalconditions,suchasweatherandtime Some NeRF-based methods use segmentation data to ofday,whichisaprerequisiteforreconstructinglarge-scale isolateandreconstructstatic[67]ormovingobjects(such areas. 33.Background intervals. TofeedthesefrustumsintotheMLP,mip-NeRF approximateseachofthemasGaussiandistributionswith WebuilduponNeRF[42]anditsextensionmip-NeRF[3]. parametersµ ,Σ andreplacesthepositionalencodingγ Here, wesummarizerelevantpartsofthesemethods. For i i PE withitsexpectationovertheinputGaussian details,pleaserefertotheoriginalpapers. 3.1.NeRFandmip-NeRFPreliminaries γ IPE(µ,Σ)=E X∼N(µ,Σ)[γ PE(X)], (4) NeuralRadianceFields(NeRF)[42]isacoordinate-based referredtoasanintegratedpositionalencoding. neuralscenerepresentationthatisoptimizedthroughadif- ferentiablerenderinglosstoreproducetheappearanceofa 4.Method setofinputimagesfromknowncameraposes. Afteropti- mization,theNeRFmodelcanbeusedtorenderpreviously Training a single NeRF does not scale when trying to unseenviewpoints. representscenesaslargeascities. Weinsteadproposesplit- The NeRF scene representation is a pair of multilayer ting the environment into a set of Block-NeRFs that can perceptrons(MLPs).ThefirstMLPf takesina3Dposition beindependentlytrainedinparallelandcompositedduring σ xandoutputsvolumedensityσ andafeaturevector. This inference. Thisindependenceenablestheabilitytoexpand featurevectorisconcatenatedwitha2Dviewingdirection the environment with additional Block-NeRFs or update dandfedintothesecondMLPf ,whichoutputsanRGB blockswithoutretrainingtheentireenvironment(seeFig- c colorc. Thisarchitectureensuresthattheoutputcolorcan ure 1). We dynamically select relevant Block-NeRFs for varywhenobservedfromdifferentangles,allowingNeRF rendering,whicharethencompositedinasmoothmanner to represent reflections and glossy materials, but that the when traversing the scene. To aid with this compositing, underlyinggeometryrepresentedbyσisonlyafunctionof weoptimizetheappearancescodestomatchlightingcondi- position. tionsanduseinterpolationweightscomputedbasedoneach Eachpixelinanimagecorrespondstoarayr(t)=o+ Block-NeRF’sdistancetothenovelview. td through 3D space. To calculate the color of r, NeRF 4.1.BlockSizeandPlacement randomlysamplesdistances{t }N alongtherayandpasses i i=0 thepointsr(t i)anddirectiondthroughitsMLPstocalculate TheindividualBlock-NeRFsshouldbearrangedtocol- σ iandc i. Theresultingoutputcoloris lectivelyensurefullcoverageofthetargetenvironment. We typicallyplaceoneBlock-NeRFateachintersection, cov- ering the intersection itself and any connected street 75% N c =(cid:88) w c , wherew =T (1−e−∆iσi), (1) ofthewayuntilitconvergesintothenextintersection(see out i i i i Figure1). Thisresultsina50%overlapbetweenanytwo i=1   adjacentblocksontheconnectingstreetsegment,making (cid:88) appearancealignmenteasierbetweenthem. Followingthis T i =exp− ∆ jσ j, ∆ i =t i−t i−1. (2) proceduremeansthattheblocksizeisvariable;whereneces- j