NeRDi: Single-View NeRF Synthesis with Language-Guided Diffusion as General Image Priors CongyueDeng2* Chiyu“Max”Jiang1 CharlesR.Qi1 XinchenYan1 YinZhou1 LeonidasGuibas2,3 DragomirAnguelov1 1Waymo 2StanfordUniversity 3GoogleResearch Figure1. Fromlefttoright: Wepresentasingle-imageNeRFsynthesisframeworkforin-the-wildimageswithout3Dsupervisionby leveraginggeneralpriorsfromlarge-scaleimagediffusionmodels.Givenaninputimage,weoptimizeforaNeRFbyminimizinganimage distributionlossforarbitrary-viewrenderingswiththediffusionmodelconditionedontheinputimage.Wedesignatwo-sectionsemantic featureastheconditioninginputtothediffusionmodel. Thefirstsectionistheimagecaptions whichcarriestheoverallsemantics;the 0 secondsectionisatextembeddings extractedfromtheinputimagewithtextualinversion,whichcapturesadditionalvisualcues. Our ∗ two-sectionsemanticfeatureprovidesanappropriateimageprior,allowingthesynthesisofarealisticNeRFcoherenttotheinputimage. Abstract resultsontheDTUMVSdatasetshowthatourmethodcan synthesize novel views with higher quality even compared 2D-to-3Dreconstructionisanill-posedproblem,yethu- toexistingmethodstrainedonthisdataset. Wealsodemon- mans are good at solving this problem due to their prior strate our generalizability in zero-shot NeRF synthesis for knowledgeofthe3Dworlddevelopedoveryears.Drivenby in-the-wildimages. this observation, we propose NeRDi, a single-view NeRF synthesis framework with general image priors from 2D 1.Introduction diffusion models. Formulating single-view reconstruction Novelviewsynthesisisalong-existingproblemincom- as an image-conditioned 3D generation problem, we op- puter vision and computer graphics. Recent progresses timize the NeRF representations by minimizing a diffusion in neural rendering such as NeRFs [23] have made huge loss on its arbitrary view renderings with a pretrained im- strides in novel view synthesis. Given a set of multi- age diffusion model under the input-view constraint. We view images with known camera poses, NeRFs represent leverageoff-the-shelfvision-languagemodelsandintroduce astatic3Dsceneasaradiancefieldparametrizedbyaneu- atwo-sectionlanguageguidanceasconditioninginputsto ral network, which enables rendering at novel views with thediffusionmodel.Thisisessentiallyhelpfulforimproving the learned network. A line of work has been focusing multiviewcontentcoherenceasitnarrowsdownthegeneral on reducing the required inputs to NeRF reconstructions, imagepriorconditionedonthesemanticandvisualfeatures rangingfromdenseinputswithcalibratedcameraposesto of the single-view input image. Additionally, we introduce sparse images [12, 26, 52] with noisy or without camera ageometriclossbasedonestimateddepthmapstoregular- izetheunderlying3DgeometryoftheNeRF.Experimental *WorkdoneasaninternatWaymo. 2202 ceD 6 ]VC.sc[ 1v76230.2122:viXraposes [48]. Yet the problem of NeRF synthesis from one synthesis framework without 3D supervision, using 2D singleviewremainschallengingduetoitsill-posednature, priors from diffusion models trained on large image astheone-to-onecorrespondencefroma2Dimagetoa3D datasets. scene does not exist. Most existing works formulate this • We design a two-section semantic guidance to narrow asareconstructionproblemandtackleitbytraininganet- downthegeneralpriorknowledgeconditionedonthein- work to predict the NeRF parameters from the input im- put image, enforcing synthesized novel views to be se- age [9, 52]. But they require matched multiview images manticallyandvisuallycoherent. with calibrated camera poses as supervision, which is in- • We introduce a geometric regularization term on esti- accessible in many cases such as images from the Internet mateddepthmapswith3Duncertainties. or captured by non-expert users with mobile devices. Re- • Wevalidateourzero-shotnovelviewsynthesisresultson centattemptshavebeenfocusedonrelaxingthisconstraint theDTUMVS[13]dataset,achievinghigherqualitythan byusingunsupervisedtrainingwithnovel-viewadversarial supervised baselines. We also demonstrate our capabil- losses and self-consistency [22, 51]. But they still require ity of generating novel-view renderings with high visual thetestcasestofollowthetrainingdistributionwhichlim- qualityonin-the-wildimages. its their generalizability. There is also work [45] that ag- gregatespriorslearnedonsyntheticmulti-viewdatasetsand 2.RelatedWork transfersthemtoin-the-wildimagesusingdatadistillation. NovelviewsynthesiswithNeRF.Therecentlyproliferat- Buttheyaremissingfinedetailswithpoorgeneralizability ing NeRF representation [23] has shown great success in tounseencategories. novelviewsynthesis,whichisalong-existingtaskincom- Despitethedifficultyof2D-to-3Dmappingforcomput- putergraphicsandvision.Combiningdifferentiablerender- ers,itisactuallynotadifficulttaskforhumanbeings. Hu- ing[16,53,54,55]withneuralnetworksceneparametriza- mans gain knowledge of the 3D world through daily ob- tions,NeRFisabletorecovertheunderlying3Dscenefrom servationsandformacommonsenseofhowthingsshould acollectionofposedimagesandrenderitatnovelviewsre- looklikeandshouldnotlooklike. Givenaspecificimage, alistically. Anumberoffollow-upworkshavebeenfocus- theycanquicklynarrowdowntheirpriorknowledgetothe ingonrelaxingNeRFinputstolessinformativedatasuchas visualinput. Thismakeshumansgoodatsolvingill-posed unposedimages[21,48,50]orsparseviews[7,12,26,34]. perceptionproblemslikesingle-view3Dreconstruction.In- Aslessdatagivesrisetoamorecomplexoptimizationland- spired by this, we propose a single-image NeRF synthe- scape, avarietyofregularizationlosseshavebeenstudied, sisframeworkwithout3Dsupervisionbyleveraginglarge- for example: RegNeRF [26] regularizes the geometry and scale diffusion-based 2D image generation model (Figure appearanceofpatches,DDP[34]andDS-NeRF[7]regular- 1). Givenaninputimage,weoptimizeforaNeRFbymini- izethedepthmaps,DietNeRF[12]enforcessemanticcon- mizinganimagedistributionlossforarbitrary-viewrender- sistencybetweenviewsbyminimizingaCLIP[30]feature ingswiththediffusionmodelconditionedontheinputim- loss,andGNeRF[21]adoptsapatch-basedadversarialloss. age.Anunconstrainedimagediffusionisthe‘generalprior’ AnotherlineofworklearnsNeRF-basednovel-viewpredic- whichisinclusivebutalsovague. Tonarrowdowntheprior tionforfew-orsingle-imageinputsbypre-trainingascene knowledgeandrelateittotheinputimage,wedesignatwo- prioronalargedatasetof3Dscenescontainingdenseviews sectionsemanticfeatureastheconditioninginputtothedif- [5,6,18,44,46,52]. Withadditionalself-supervisiontech- fusionmodel. Thefirstsectionistheimagecaptionwhich niques such as equivariance [9] or cycle-consistency [22], carriestheoverallsemantics;thesecondisatextembedding thelearningofscenepriorscanbedonesimplyfromsparse- extractedfromtheinputimagewithtextualinversion[10], or single-view data, or even purely from unposed image which captures additional visual cues. These two sections collections with an image adversarial loss [2, 3, 27, 39]. of language guidance facilitate our realistic NeRF synthe- These two lines of works both have their specialties and sis with semantic and visual coherence between different constraints: the first is generalizable to any scene configu- views. Inaddition,weintroduceageometriclossbasedon rations,butisalsolesscompetitiveinthemorechallenging the estimated depth of the input view for regularizing the scenarios such as single-image novel view synthesis with underlying3Dstructure. Learnedwithalltheguidanceand high quality requirements; the second, on the other hand, constraints,ourmodelisabletoleveragethegeneralimage hasstrongabilityofinferringunseennovelviewsfromvery prior and perform zero-shot NeRF synthesis on single im- limited inputs, but is also restricted to certain scene cate- ageinputs. Experimentalresultsshowthatwecangenerate goriesmodeledbytheirscenepriorslearnedfromthetrain- high quality novel views from diverse in-the-wild images. ingdata. Inourwork,weleverageadiffusion-basedimage Tosummarize,ourkeycontributionsare: priorforNeRFsynthesisthatisgeneralenoughformodel- • Weformulatesingle-viewreconstructionasaconditioned ingvariationsofin-the-wildimageswhilehavingtheadap- 3Dgenerationproblemandproposeasingle-imageNeRF tivitytoeachspecificinputimage.sentation F : (x,y,z) → (c,σ) as its 3D reconstruc- ω tion†. The NeRF holds the rendering equation that, for anycameraviewwithposeP,onecansamplecamerarays r(t)=o+tdandrendertheimagexatthisviewwith (cid:90) tf Cˆ(r)= T(t)σ(t)c(t)dt (1) (cid:16) tn (cid:17) (cid:82)t where T(t) = exp − σ(s)ds . For more details, tn pleaserefertoMildenhalletal.[23]. Forsimplicity,wede- notethiswholerenderingequationbyx = f(P,ω)which meansNeRFf rendersimagexatcameraposePwithpa- rameters ω. Instead of predicting the NeRF parameters ω Figure 2. Method overview. We represent the underlying 3D from x in a forward pass, we formulate this as a condi- 0 sceneasaNeRFandoptimizeforitsparameterswiththreelosses: tioned3Dgenerationproblem a reconstruction loss at the fixed input view; a diffusion loss at f(·,ω)∼3Dscenedistribution|f(P ,ω)=x (2) 0 0 arbitrarilysampledviewswhichalsotakesaconditioningtextin- where we optimize the NeRF to follow a 3D scene distri- putgeneratedfromtheinputimagewithourtwo-sectionfeature butionconditionedonthatitsrenderingf(P ,ω)atagiven extraction; and finally, a depth correlation loss at the input view 0 viewP shouldbetheinputimagex regularizingthe3Dgeometry. 0 0 Directlylearningthe3Dscenedistributionpriorrequires Diffusion-based generative models. Denoising diffusion large 3D datasets, which is less straightforward to acquire probabilistic models [11, 41], or score-based generative and restricts its application to unseen scene categories. To models [42, 43], have recently caught a surge of inter- enable better generalizability to in-the-wild scenarios, we estsduetotheirsimpledesignsandexcellentperformances instead leverage 2D image priors and reformulate the ob- acrossavarietyofcomputervisiontaskssuchasimagegen- jectiveinto eration [11, 41, 42, 43], completion [36, 43], and editing ∀P, f(P,ω)∼P|f(P ,ω)=x (3) 0 0 [14,20]. Invisualcontentcreation,language-guidedimage where the optimization is conducted on images f(P,ω) diffusion models such as DALL-E2 [32], Imagen [37] and renderedatarbitrarilysampledviews,pushingthemtofol- StableDiffusion[35]haveshowngreatsuccessingenerat- low an image prior P while satisfying the constraint x = 0 ingphotorealisticimageswithstrongsemanticcorrelation f(P ,ω). Theoverallobjectivecanbewrittenasmaximiz- 0 tothegiventext-promptinputs. Inadditionaltothesuccess ingtheconditionalprobability of2Dimagediffusionmodels,morerecentworkshavealso maxE P(f(P,ω)|f(P ,ω)=x , s). (4) P 0 0 extenddiffusionmodelsto3Dcontentgeneration. [19,57] ω Here, s is an additional semantic guidance term that we generate 3D pointclouds with point diffusions. 3DiM [49] apply to further restrict the prior image distribution to fit shows uncertainty-aware novel view synthesis with image the generation context. In contrast to DreamFusion [28] diffusionsconditionedoninputviewsandposes,butitdoes whichalsoutilizeslanguage-guidedimagediffusionmodel not have guaranteed multiview consistency as no underly- as 2D image priors for sampled views, our main contribu- ing3Drepresentationisadopted. Morerelatedtooursare tionstandsinourapproachforfurtherconstrainingtheiden- DreamFusion[28]andGAUDI[1]thatalsogenerateNeRFs tity of the generated 3D volume to be consistent with the withdiffusions:[28]generatesNeRFsunderlanguageguid- inputs. ance by optimizing for their renderings at randomly sam- We cover more details on this novel-view distribution pledviewswitha2Dimagediffusionmodel[37];[1]trains lossinSec. 3.1. Weutilizenaturallanguagedescriptionsof a diffusion model on the latent space of NeRF scenes, but thesceneasthesemanticguidances. Moredetailsonthis thelearnedscenedistributionislimitedtoasetofindoor3D willbediscussedinSec. 3.2. Inaddition,astheimagedif- scenesanddoesnotgeneralizetoin-the-wildimages. Simi- fusionmodelonlyoperatesontherenderedrgbcolors, we larto[28],wealsoleverage2Dimagediffusionstooptimize further apply a geometric regularization with a depth map fortheNeRFrenderingsatnovelviews, butinsteadofun- estimatedattheinputviewtofacilitatetheNeRFoptimiza- constrained NeRF generation with user-specified language tion(Sec. 3.3) inputs,westudyhowtofaithfullycapturethethefeaturesof single-viewimageinputsanduseittoconstrainthenovel- 3.1.NovelViewDistributionLoss viewimagedistributions. Denoising Diffusion Probabilistic Models (DDPM) are a type of generative models that learn a distribution over 3.Method AnoverviewofourmethodisshowninFigure2. Given †HereweuseaLambertianNeRFwithoutviewdirectioninputsfor an input image x 0, we would like to learn a NeRF repre- enforcingstrongermultiviewconsistency.trainingdatasamples. Recently,therearemanyadvancesin imagediffusionmodel. WiththeLDMEquation5,wecan languageguidedimagesynthesiswithdiffusionmodels.We optimize for the text embedding s for the input image x ∗ 0 build our method upon the recent Latent Diffusion Model by (LDM)[35]foritshighqualityandefficiencyinimagegen- s =argminE (cid:2) (cid:107)(cid:15)−(cid:15) (z ,t,c (s))(cid:107)2(cid:3) ∗ z∼E(x0),s,(cid:15)∼N(0,1),t θ t θ 2 eration. Itadoptsapre-trainedimageauto-encoderwithan s (6) encoder E(x) = z mapping images x into latent codes s In Figure 3 middle row, images generated with textual in- and a decoder D(E(x)) = x recovering the images. The versionareshown. Thecolorsandvisualcuesoftheinput diffusionprocessisthentrainedinthelatentspacebymin- image are well captured (orange-colored elements, food, imizingtheobjective E (cid:2) (cid:107)(cid:15)−(cid:15) (z ,t,c (s))(cid:107)2(cid:3) . (5) and even the brand logos). However, the semantics at the z∼E(x),s,(cid:15)∼N(0,1),t θ t θ 2 macro level is sometimes wrong (second column is a per- wheretisadiffusiontimescale, (cid:15) ∼ N(0,1)isarandom son playing sports). One reason is that, different from the noisesample,z isthelatentcodeznoisedtotimetwith(cid:15), t multi-imagescenarioswheretextualinversioncandiscover and(cid:15) isthedenoisingnetworkwithparametersθtoregress θ thecommoncontentsoftheseimages,itisunclearforone the noise (cid:15). The diffusion model also takes a conditioning singleimagewhatthekeyfeaturesarethatthetextembed- inputswhichisencodedasc (s)andservesasguidancein θ dingshouldfocuson. thedenoisingprocess. Fortext-to-imagegenerationmodels Toreflectbothsemanticandvisualcharacteristicsofthe suchastheLDM,c isapre-trainedlargelanguagemodel θ input image in the novel view synthesis task, we combine thatencodestheconditionaltexts. thesetwomethodsbyconcatenatingtheirtextembeddings In a pre-trained diffusion model, the network parame- to form a joint feature s = [s ,s ] and use it as the guid- 0 ∗ ters θ are fixed, and we can instead optimize for the in- anceinthediffusionprocessinEquation5.Figure3bottom putimagexwiththesameobjectivewhichtransformsxto rowshowstheimagesgeneratedwiththisjointfeature,with follow the image distribution priors conditioned on s. Let balancedsemanticsandvisualcues. x = f(P,ω)beourNeRFrenderingatarbitrarilysampled 3.3.GeometricRegularization view P, we can back propagate gradients to the NeRF pa- rametersωandthusgetastochasticgradientdescentonω. While image diffusion shapes the appearance of the NeRF, multiview consistency is difficult to enforce as the 3.2.Semantics-ConditionedImagePriors underlying 3D geometry can be different even with the We argue that the prior distribution over all in-the-wild same image rendering [15, 24], making the gradient back- imagesisnotspecificenoughtoguidethenovelviewsyn- propagation(fromtheimagediffusiontotheNeRFparam- thesis from an arbitrary image. We thus introduce a well- etersω)highlynon-controllable. Tothisend,wefurtherin- designed guidance s that narrows down the generic prior corporateageometricregularizationtermontheinputview over natural images to a prior of images related to the in- depthtoalleviatethisissue. WeadopttheDensePrediction putimagex . Herewechoosetextastheguidance,which 0 Transformer (DPT) model [33] trained on 1.4 million im- is flexible for describing arbitrary input images. Text-to- agesforzero-shotmonoculardepthestimationandapplyit image diffusion models such as LDM utilize a pre-trained totheinputimagex toestimateadepthmapd .Weuse 0 0,est large language model as the language encoder to learn a thisestimateddepthtoregularizethedepth conditional distribution over images conditioned on lan- (cid:90) tf guage. This serves as a natural gateway for us to utilize dˆ 0 = σ(t)dt. (7) languageasameanstorestricttheimagepriorspace. tn renderedbytheNeRFatinputviewP .Duetotheambigu- 0 The most straightforward way of getting a text prompt itiesoftheestimateddepth(includingscales,shifts,camera fromtheinputimageistouseanimagecaptioningorclas- intrinsics)andestimationerror(Figure4), wecannotback sification network S trained on (image, text) datasets and projectpixelswithdepthto3Dandcomputetheregulariza- predict a text s = S(x ). However, while text descrip- 0 0 tiondirectly. Instead,wemaximizethePearsoncorrelation tioncansummarizethesemanticsoftheimage, itleavesa between the estimated depth map and the NeRF-rendered hugespaceofambiguities,makingithardtoincludeallthe depth visual details in the image especially with limited prompt (cid:16) (cid:17) Cov(dˆ ,d ) length. InFigure3toprow,weshowtheimagesgenerated ρ dˆ ,d = 0 0,est (8) 0 0,est (cid:113) with the caption “a collection of products” from the input Var(dˆ )Var(d ) 0 0,est image on the left. While their semantics are highly accu- which measures if the rendered depth distribution and the ratewithrespecttothelanguagedescription,thegenerated noisyestimateddepthdistributionarelinearlycorrelated. imageshaveveryhighvariancesintheirvisualpatternsand 4.Experiments lowcorrelationstotheinputimage. Textualinversion[10], onthe otherhand, optimizesfor Nowwedemonstrateourefficacyinsynthesizingrealis- thetextembeddingofoneorfewimagesfromatext-based tic NeRFs with single-view inputs. Section 4.1 presents aFigure3. Imagegenerationwithdifferentsemanticguidance. Toprow: Imagesgeneratedwithcaption“acollectionofproducts”. Theimagesfollowsthesemanticswell,buttheircontentareofveryhighvariance(canbeanykindofproducts). Middlerow: Images generatedpurelywiththelatentembeddingfromtextualinversion. Thecolordistributionandvisualcuesoftheinputimagearewell captured,butthesemanticsisnotpreserved(secondcolumn,theimageisapersonplayingsports). Bottomrow: Imagesgeneratedwith combinedimagecaptionandtextualinversion.Bothsemanticandvisualfeaturesoftheinputimageareaddressed. domly sampled novel views, we render 128×128 images and resize them to 512 × 512 before feeding them to the encoder of [35]. At the input view, we render at the same resolution as the inputimage to compute the image recon- structionanddepthcorrelationlosses. Figure4.Ambiguityinestimateddepthmap. Baselines. We compare with two state-of-the-art single- viewNeRFreconstructionalgorithms,PixelNeRF[52]and Table1.Single-imagenovelviewsynthesisresultsonDTU. itsfine-tunedmodelwithCLIP[30]featureconsistencyloss Method PSNR↑ SSIM↑ LPIPS↓ asproposedbyDietNeRF[12],bothofwhichtrainedonthe NeRF 8.000 0.286 0.703 trainingsetdatafromtheDTUMVSdataset. Togainbetter pixelNeRF 15.550 0.537 0.535 convergence, we use the predictions from [52] as an ini- pixelNeRF,L ft 16.048 0.564 0.515 tialization for our 3D scene optimization. But our method MSE DietPixelNeRF 14.242 0.481 0.487 isdirectlyappliedtothetestsceneswithoutanyadditional Ours 14.472 0.465 0.421 fine-tuningontheDTUtrainingset. quantitativecomparisonbetweenourmethodandthestate- Results. Table 1 shows the quantitative comparison be- of-the-art single-view NeRF reconstruction methods on a tweenourmethodandthebaselines. Followingtheconven- syntheticdataset. Section4.2showsaqualitativecompari- tion, we report the standard image quality metrics PSNR son as well as more synthesis results of our method on in- and SSIM [47]. Our PSNR and SSIM are slightly lower the-wildimages. than pixelNeRF [52] which directly learns the scene dis- tributions from the DTU training set and are on par with 4.1.SyntheticScenes DietPixelNeRF [12] which enforces semantic consistency Setup. Weevaluate ourmethodon theDTUMVS dataset between views. However, we emphasize that these two [13]with15testscenesasspecifiedin[52]. Foreachinput metrics are less indicative in our scenario as they are lo- image,weuseGPT-2[31]togenerateacaption. Weman- calpixel-alignedsimilaritymetricsbetweenthesynthesized ually correct the obvious mistakes made by GPT-2 while novel views and the ground truth images but uncertainties tryingourbesttoavoidintroducingadditionaldetails. The naturallyexistinsingle-view3Dinference.Themiddlecol- scenes and their captions are listed in the supplementary umnofthefirstsceneinFigure5showsanexampleofsuch material. uncertainty. The height of the tallest snack bag in the in- Implementation details. For the NeRF model, we imple- putimagecannotbeinferredasitstopextrudesbeyondthe mentthemulti-resolutiongridsamplerasdescribedin[25]. cameraview. Thewidthofthetoypigintheleftcolumnof For the diffusion model, we employ the text-guided diffu- thethirdsceneisanotherexamplewhichcannotbeinferred sionmodelfrom[35]whichwaspre-trainedontheLAION- fromtheinputsideview. Inbothcasesourmethodguesses 400Mdataset[38]. While[35]operateson512×512im- its novel view (bottom row) in a reasonable sense but dif- ages,NeRF’svolumetricrenderingatthisresolutionwould ferentfromthegroundtruth(toprow). Inaddition,wealso incuranextensivecomputationalburden. Thus,attheran- measurenovelviewswithLPIPS[56],whichisaperceptualFigure5. Single-imagenovelviewsynthesisresultsontheDTUtestscenes. VanillaNeRFcannotrecoverscenesfromsingleimage inputsduetotheill-posednatureofthisproblem.WhilepixelNeRFcaninferthenovelviewimageswiththepriorfromtheDTUtraining setofsimilarscenes,itssynthesizedrenderingsremainnoisyandblurry.WithapixelNeRFinitialization,ourmethodisabletosynthesize cleanernovelviewswithrealisticgeometriesandappearances,despitehavingneverbeentrainedonthisdataset. Uncertaintiesinnovel viewinference:(Thefirstscene,middlecolumn)theexactheightofthetallestsnackbagcannotbeinferredasthetopgoesoutsideofthe cameraview. (Thethirdscene,leftcolumn)thewidthofthetoypigfromthetopviewisundecidablefromtheinputview. Inbothcases, ourmethodguessesareasonableanswerinthesynthesizednovelviewthatisdifferentfromthegroundtruth. metriccomputingtheMeanSquaredError(MSE)between More results. Figure 7a shows our results on images of normalized features from all layers of a pre-trained VGG objects from the internet. The text prompts are words or encoder[40]. Ourmethodshowsasignificantimprovement phrasesusedtosearchfortheimages. Thebackgroundsare on this metric compared to the baselines as the diffusion maskedoutusinganoff-the-shelfdichotomousimageseg- modelhelpstoimproveimagequalitieswhilethelanguage mentation network from [29]. For each input, we show 3 guidancemaintainsthemulti-viewsemanticconsistency. different novel views that are distant from the input view. Figure 5 shows a qualitative comparison between our Figure 7b shows our results on images with more com- methodandthebaselines.Withthesceneinitializationfrom plexcontentsandbackgroundsfromtheCOCOdataset[17] [52],ourmethodremovesthenoisesandblurriness,synthe- whichcontains(image,caption)pairs.Withincameraviews sizinghighqualitynovelviews. closetotheinput,ourmodelisstillabletogeneraterealistic renderings.Butitcanhardlygeneralizetodistantviewsdue 4.2.ImagesintheWild tothelimitedcapacityoftheNeRFscenebox. Qualitative comparisons. Figure 6 shows a qualitative 4.3.AblationStudies comparison between our method and existing state-of-the- Weconductablationstudiestoshowtheefficacyofour art single-image to 3D synthesis methods for in-the-wild two-section semantic guidance and geometric regulariza- images [12, 45]. Input images are adopted from the tion. GoogleScannedObjectsdataset[8]withtheircategoryla- bels (‘bag’ and ‘hat’) as captions. Similar to ours, Diet- Semantic guidance. Figure 8a shows the ablation of the NeRF[12]usesaninput-viewconstrainedNeRFoptimiza- two text embeddings s 0 from image captions and s ∗ from tion technique where they minimize the CLIP [30] feature textual inversion. Without the captions s 0, the model fails betweenarbitraryviewrenderings.WhileCLIPfeaturesen- tolearntheoverallsemanticsandcannotgenerateamean- forceconsistentappearances,theyfailtocapturetheglobal ingful object. While both the full model and the caption- semanticsoftheobject. SS3D[45]isaforward-prediction only one (without textual inversion) successfully generate model for 3D geometries that transfers the priors learned backack novel views, the results without textual inversion onsyntheticdatasetstoin-the-wildimageswithknowledge s ∗havemoreblurrinessandnoises. Azoom-incomparison distillation. While it generates more structured global ge- isshowninFigure8b. ometries,itfailstocapturethefinegeometricdetailsofthe Figure8cshowsanothercomparisonofmodelswithand inputimage. Thegeometriesofthehatsinthebottomrows withouttextualinversions onthecanexamplefromFigure ∗ arealsoincorrect,withonlythesilhouetteshapepreserved 7b left. In the object regions visible to the input view, the butthestructureof‘hat’shapemissing. fullmodelbetterrecoversthefinedetails(thewhitelettersFigure6. NovelviewsynthesisresultsonobjectsfromtheGoogleScannedObjectsDataset. Left:Ourresultsgeneratedfromsingle- wordtextinputs‘backpack’(top3rows)and‘hat’(bottom3rows).Middle:DietNeRF[12]minimizestheCLIPfeaturedistancesbetween theinputviewandarbitrarilysampledviews. Thisresultsinnovelviewrenderswithconsistenttexturesandstyles,butfailstocapture theglobalsemanticmeaning. Forafaircomparison,DietNeRFisalsooptimizedwithdepthregularization. Right: SS3D[45]predicts coarsegeometriesinaconsistencymanner,butitfailstorecoverallthefinegeometricdetails. Additionally,thegeometriesofthehatsin thebottomrowsareincorrect,withonlythesilhouetteshapepreservedbutthestructureof‘hat’shapemissing. (a)Resultsonobject-centricimagesfromtheinternetwithsingle-word (b)ResultsonimagesfromtheCOCOdataset[17].Inputimageshavemore orshortphrasecaptions.Inputbackgroundsareremovedwith[29]. complexcontentswithbackgroundsandthecaptionsaresentences. Figure7.Resultsonimagesinthewild. on the lateral); and in the invisible regions, the full model try. Themodelwithouttheregularizationontheinputview completes the appearances with coherent styles of the in- depthcanstillgeneraterealisticappearancesatnovelviews put (red and white textures at the back of the can), while with the diffusion model, but the underlying 3D geometry themodelwithouttextualinversiondoesnothavesuchap- iserroneousandmulti-viewconsistencyisnotenforced.As pearance coherency. The model with textual inversion can a sanity check, we also visualize the results with only the even synthesize the pull tab at the top (second column of depth loss but without the diffusion model. The model is the zoom-in views) by inferring from the input side view unabletogeneratearealisticNeRFduetothe3Dambigui- thatthisisacancontainingdrinks. tiesofmonoculardepthasstatedinSection3.3. Geometricregularization. Figure9showsanablationon 5.Conclusions the geometric regularization term. Both image renderings and depth maps are visualized. The full model is able to In this paper, we propose a novel framework for zero- synthesize realistic novel views with coherent 3D geome- shot single-view NeRF synthesis for images in the wild(a)Toprow: Fullmodel. Middlerow: Caption-onlyguidancewithout textualinversion.Themodelisstillabletogenerateashapestrictlyfollow- ingthesemanticsandtheinputviewappearanceandgeometricconstraints, (a) A failure case due to the biases in the image diffusion model. butstrugglesmoreinsynthesizingthedetails. Azoom-incomparisonis Top: Novelviewsynthesisresultswithtextprompt‘a shoe in the shownin8bbelow.Bottomrow:Textual-inversion-onlywithoutcaption. style of ’. Bottom: Imagesgeneratedby[35]withtext Textualinversionfailstocapturetheglobalsemantics. prompt“asingleshoe”.Yethalfoftheimageshavetwoshoesinit. (b)Afailurecaseonahighlydeformableinstance. Whiletheoverall (b)Azoom-incomparisonbetweenfullmodelresultsandresultswith- bodyshapeofthecatiscaptured,thesynthesizedcathastwoheadsand outtextualinversion. Thefullmodelshowsbettercapabilityofsynthe- twotails. sizinglessblurrydetails. Figure10.Failurecases. generation conditioned on the input image. To efficiently use these priors in synthesizing consistent views, we de- sign a two-section language guidance as conditioning in- putstothediffusionmodelwhichunifiesthesemanticand visual features of the input image. To our knowledge, we (c)Anothercomparisonbetweenmodelswithantwithouttextualin- arethefirsttocombinesemanticandvisualfeaturesinthe version. Theinputisfrom7bleft, bottomrow. Thefullmodelisable textembeddingspaceandapplyittonovelviewsynthesis. tosynthesizebettertexturedetailsatvisibleregionsaswellascompleting theinvisibleregionswithsimilartextures,whilethecaption-onlymodel In addition, we introduce a geometric regularization term renderingsaremoreblurryandcannotfillintheinvisibleregions. whileaddressingthe3Dambiguityofmonocular-estimated Figure8.Ablationsonthetwo-sectionsemanticguidance. depthmaps. Ourexperimentalresultsshowthat,withwell- designedguidanceandconstraints,onecanleveragegeneral image priors to specific image-to-3D, enabling us to build generalizableandadaptablereconstructionframeworks. Limitations and future work. As our method relies on multiple large pre-trained image models [29, 31, 33, 35], anybiasesinthesemodelswillaffectoursynthesisresults. Figure 10a shows an example where the image diffusion model[35]cangeneratetwoshoeseventhetextpromptis “asingleshoe”,resultinginoursynthesizedNeRFshowing Figure 9. Ablations on the geometric regularization. Visual- the features of multiple shoes. Our method is also less ro- izationofinputviewreconstructionandnovelviewsonrendered busttohighlydeformableinstances, asourlanguageguid- images and depth maps. Top row: The full model is able to ancefocusesonsemanticsandstylesbutlacksaglobalde- synthesize realistic novel views while preserving geometric co- scriptionofphysicalstatesanddynamics.Figure10bshows herency. Middlerow:Withoutthedepthcorrelationloss,thedif- fusionmodelisstillabletogeneratereasonableappearances,but suchafailurecase.Renderingsfromeachindependentview theunderlying3Dgeometryiserroneousandthenovelviewsare are visually plausible but represent different states of the inconsistenttotheinputview.Bottomrow:Theinput-viewdepth sameinstances. estimationcannotguidenovelviewsynthesisbyitselfwithoutthe Besides, whileformulation-wise theoptimizationis ap- diffusionmodeldueto3Dambiguities. plicabletoanyscenes,itismoresuitableforobject-centric imagesasittakestheunderlyingassumptionthatthescene without3Dsupervision.Weleveragethegeneralimagepri- hasexactlythesamesemanticsfromanyview,whichisnot orsin2Ddiffusionmodelsandapplythemtothe3DNeRF true for large scenes with complex configurations due toBoxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas J Guibas,JonathanTremblay,SamehKhamis,etal. Efficient geometry-aware3dgenerativeadversarialnetworks. InPro- ceedingsoftheIEEE/CVFConferenceonComputerVision Figure11.Imagesgeneratedby[35]with‘a pumpkin’. andPatternRecognition,pages16123–16133,2022. 2 [3] EricRChan, MarcoMonteiro, PetrKellnhofer, JiajunWu, viewchangesandocclusions. Thetextembeddinglearned andGordonWetzstein. pi-gan: Periodicimplicitgenerative adversarialnetworksfor3d-awareimagesynthesis. InPro- fromtextualinversionisofthedimensionofasingle-world ceedings of the IEEE/CVF conference on computer vision embedding, limiting its expressiveness in representing the andpatternrecognition,pages5799–5809,2021. 2 subtletiescomplexcontents. [4] Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, A.AdditionalResults Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: Figure12showsouradditionalresultsandcomparisons An information-rich 3d model repository. arXiv preprint arXiv:1512.03012,2015. 9 forimagesinthewild.Theresultsarepresentedin4groups, [5] AnpeiChen,ZexiangXu,FuqiangZhao,XiaoshuaiZhang, each group containing 3 objects from similar classes but FanboXiang,JingyiYu,andHaoSu.Mvsnerf:Fastgeneral- withdifferentcontentdetailsandappearances. Weusethis izableradiancefieldreconstructionfrommulti-viewstereo. totestthecapabilityofeachmethodincapturingtheoverall In Proceedings of the IEEE/CVF International Conference semanticsandvisualfeaturevariationsfrominputimages. onComputerVision,pages14124–14133,2021. 2 Comparison to DietNeRF [12]. For a fair comparison, [6] JulianChibane,AayushBansal,VericaLazova,andGerard DietNeRF is also optimized with the estimated depth map Pons-Moll. Stereoradiancefields(srf): Learningviewsyn- thesis for sparse views of novel scenes. In Proceedings of fromtheinputimage. WhileDietNeRFisabletomaintain theIEEE/CVFConferenceonComputerVisionandPattern appearance consistency between different views, it fails to Recognition,pages7911–7920,2021. 2 capturetheoverallgeometryoftheobjects,especiallywhen [7] Kangle Deng, Andrew Liu, Jun-Yan Zhu, and Deva Ra- the object has complex geometric structures (such as the manan.Depth-supervisednerf:Fewerviewsandfastertrain- chairsinthe1stgroup,andthebasketsinthe3rdgroup). In ing for free. In Proceedings of the IEEE/CVF Conference the 4th group (the skirts), our generated textures form the onComputerVisionandPatternRecognition,pages12882– unseenbackregionsarealsoclosertotheinputimagethan 12891,2022. 2 DietNeRF. [8] LauraDowns,AnthonyFrancis,NateKoenig,BrandonKin- man,RyanHickman,KristaReymann,ThomasBMcHugh, Our method also addresses the naturally existing ambi- and Vincent Vanhoucke. Google scanned objects: A high- guity in novel-view inference, especially for the occluded quality dataset of 3d scanned household items. arXiv regions in the input view. For example, in the 3rd group preprintarXiv:2204.11918,2022. 6 in Figure 12, the unseen spaces of the baskets are filled [9] Emilien Dupont, Miguel Bautista Martin, Alex Colburn, withdifferentfruits/flowers/vegetables,insteadofduplicat- Aditya Sankar, Josh Susskind, and Qi Shan. Equivariant ing the input views as DietNeRF [12]. As a feature or as neural rendering. In International Conference on Machine aninductivebias,suchsynthesisresultsarealsoaffectedby Learning,pages2761–2770.PMLR,2020. 2 the 2D distribution from the image diffusion model. For [10] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patash- example, Figure 11 shows the image generation results by nik, Amit H Bermano, Gal Chechik, and Daniel Cohen- [35] with text prompt ‘a pumpkin’. Half of them are Or. An image is worth one word: Personalizing text-to- image generation using textual inversion. arXiv preprint Jack-o’-lanterns. Thismakesoursynthesizedpumpkinalso arXiv:2208.01618,2022. 2,4 having the Jack-o’-lantern face at its back (the 3rd row of [11] JonathanHo,AjayJain,andPieterAbbeel. Denoisingdiffu- the2ndgroup). sionprobabilisticmodels. AdvancesinNeuralInformation ComparisontoSS3D[45]. Asageometry-basedmethod, ProcessingSystems,33:6840–6851,2020. 3 SS3D captures better global geometries than DietNeRF [12] AjayJain,MatthewTancik,andPieterAbbeel. Puttingnerf onadiet: Semanticallyconsistentfew-shotviewsynthesis. evenwithoutthedepthregularization,especiallyontheob- In Proceedings of the IEEE/CVF International Conference jectclassescoveredbyShapeNet[4]wherethe onComputerVision,pages5885–5894,2021. 1,2,5,6,7,9 [13] RasmusJensen,AndersDahl,GeorgeVogiatzis,EnginTola, References andHenrikAanæs. Largescalemulti-viewstereopsiseval- uation. InProceedingsoftheIEEEconferenceoncomputer [1] MiguelAngelBautista,PengshengGuo,SamiraAbnar,Wal- visionandpatternrecognition,pages406–413,2014. 2,5 ter Talbott, Alexander Toshev, Zhuoyuan Chen, Laurent [14] GwanghyunKim, TaesungKwon, andJongChulYe. Dif- Dinh, Shuangfei Zhai, Hanlin Goh, Daniel Ulbricht, et al. fusionclip: Text-guided diffusion models for robust image Gaudi: A neural architect for immersive 3d scene genera- manipulation. InProceedingsoftheIEEE/CVFConference tion. arXivpreprintarXiv:2207.13751,2022. 3 on Computer Vision and Pattern Recognition, pages 2426– [2] EricRChan,ConnorZLin,MatthewAChan,KokiNagano,Figure12.Additionalresultsforimagesinthewild.2435,2022. 3 hall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv [15] StevenMLehar. Theworldinyourhead: Agestaltviewof preprintarXiv:2209.14988,2022. 3 themechanismofconsciousexperience. PsychologyPress, [29] Xuebin Qin, Hang Dai, Xiaobin Hu, Deng-Ping Fan, Ling 2003. 4 Shao,andLucVanGool. Highlyaccuratedichotomousim- [16] Tzu-MaoLi,MiikaAittala,Fre´doDurand,andJaakkoLehti- agesegmentation. InECCV,2022. 6,7,8 nen. Differentiable monte carlo ray tracing through edge [30] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya sampling. ACMTransactionsonGraphics(TOG),37(6):1– Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, 11,2018. 2 AmandaAskell,PamelaMishkin,JackClark,etal. Learn- [17] Tsung-YiLin,MichaelMaire,SergeBelongie,JamesHays, ingtransferablevisualmodelsfromnaturallanguagesuper- PietroPerona,DevaRamanan,PiotrDolla´r,andCLawrence vision. In International Conference on Machine Learning, Zitnick. Microsoft coco: Common objects in context. In pages8748–8763.PMLR,2021. 2,5,6 European conference on computer vision, pages 740–755. [31] AlecRadford,JeffreyWu,RewonChild,DavidLuan,Dario Springer,2014. 6,7 Amodei, Ilya Sutskever, et al. Language models are unsu- [18] Yuan Liu, Sida Peng, Lingjie Liu, Qianqian Wang, Peng pervisedmultitasklearners. OpenAIblog, 1(8):9, 2019. 5, Wang, Christian Theobalt, Xiaowei Zhou, and Wenping 8 Wang. Neuralraysforocclusion-awareimage-basedrender- [32] AdityaRamesh,PrafullaDhariwal,AlexNichol,CaseyChu, ing. InProceedingsoftheIEEE/CVFConferenceonCom- and Mark Chen. Hierarchical text-conditional image gen- puter Vision and Pattern Recognition, pages 7824–7833, erationwithcliplatents. arXivpreprintarXiv:2204.06125, 2022. 2 2022. 3 [19] ShitongLuoandWeiHu. Diffusionprobabilisticmodelsfor [33] Rene´ Ranftl,AlexeyBochkovskiy,andVladlenKoltun. Vi- 3dpointcloudgeneration. InProceedingsoftheIEEE/CVF sion transformers for dense prediction. In Proceedings of Conference on Computer Vision and Pattern Recognition, the IEEE/CVF International Conference on Computer Vi- pages2837–2845,2021. 3 sion,pages12179–12188,2021. 4,8 [20] Chenlin Meng, Yang Song, Jiaming Song, Jiajun Wu, Jun- [34] Barbara Roessle, Jonathan T Barron, Ben Mildenhall, YanZhu,andStefanoErmon. Sdedit: Imagesynthesisand PratulPSrinivasan,andMatthiasNießner. Densedepthpri- editingwithstochasticdifferentialequations. arXivpreprint ors for neural radiance fields from sparse input views. In arXiv:2108.01073,2021. 3 ProceedingsoftheIEEE/CVFConferenceonComputerVi- [21] QuanMeng,AnpeiChen,HaiminLuo,MinyeWu,HaoSu, sionandPatternRecognition,pages12892–12901,2022. 2 Lan Xu, Xuming He, and Jingyi Yu. Gnerf: Gan-based [35] Robin Rombach, Andreas Blattmann, Dominik Lorenz, neuralradiancefieldwithoutposedcamera. InProceedings Patrick Esser, and Bjo¨rn Ommer. High-resolution image oftheIEEE/CVFInternationalConferenceonComputerVi- synthesis with latent diffusion models. In Proceedings of sion,pages6351–6361,2021. 2 theIEEE/CVFConferenceonComputerVisionandPattern [22] Lu Mi, Abhijit Kundu, David Ross, Frank Dellaert, Noah Recognition(CVPR),pages10684–10695,June2022. 3,4, Snavely, and Alireza Fathi. im2nerf: Image to neural ra- 5,8,9 diance field in the wild. arXiv preprint arXiv:2209.04061, [36] ChitwanSaharia,WilliamChan,HuiwenChang,ChrisLee, 2022. 2 Jonathan Ho, Tim Salimans, David Fleet, and Mohammad [23] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Norouzi. Palette: Image-to-image diffusion models. In JonathanTBarron,RaviRamamoorthi,andRenNg. Nerf: ACM SIGGRAPH 2022 Conference Proceedings, pages 1– Representingscenesasneuralradiancefieldsforviewsyn- 10,2022. 3 thesis.CommunicationsoftheACM,65(1):99–106,2021.1, [37] Chitwan Saharia, William Chan, Saurabh Saxena, Lala 2,3 Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed [24] NiloyJMitraandMarkPauly. Shadowart. ACMTransac- Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, tionsonGraphics,28(CONF):156–1,2009. 4 Rapha Gontijo Lopes, et al. Photorealistic text-to-image [25] ThomasMu¨ller,AlexEvans,ChristophSchied,andAlexan- diffusionmodelswithdeeplanguageunderstanding. arXiv derKeller.Instantneuralgraphicsprimitiveswithamultires- preprintarXiv:2205.11487,2022. 3 olution hash encoding. arXiv preprint arXiv:2201.05989, [38] Christoph Schuhmann, Richard Vencu, Romain Beaumont, 2022. 5 Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo [26] Michael Niemeyer, Jonathan T Barron, Ben Mildenhall, Coombes,JeniaJitsev,andAranKomatsuzaki.Laion-400m: MehdiSMSajjadi,AndreasGeiger,andNohaRadwan.Reg- Open dataset of clip-filtered 400 million image-text pairs. nerf: Regularizingneuralradiancefieldsforviewsynthesis arXivpreprintarXiv:2111.02114,2021. 5 fromsparseinputs. InProceedingsoftheIEEE/CVFCon- [39] KatjaSchwarz, YiyiLiao, MichaelNiemeyer, andAndreas ferenceonComputerVisionandPatternRecognition,pages Geiger. Graf: Generative radiance fields for 3d-aware im- 5480–5490,2022. 1,2 agesynthesis. AdvancesinNeuralInformationProcessing [27] MichaelNiemeyerandAndreasGeiger. Giraffe:Represent- Systems,33:20154–20166,2020. 2 ingscenesascompositionalgenerativeneuralfeaturefields. [40] KarenSimonyanandAndrewZisserman. Verydeepconvo- In Proceedings of the IEEE/CVF Conference on Computer lutional networks for large-scale image recognition. arXiv VisionandPatternRecognition,pages11453–11464,2021. preprintarXiv:1409.1556,2014. 6 2 [41] Jiaming Song, Chenlin Meng, and Stefano Ermon. [28] BenPoole,AjayJain,JonathanTBarron,andBenMilden- Denoising diffusion implicit models. arXiv preprintarXiv:2010.02502,2020. 3 man, and Oliver Wang. The unreasonable effectiveness of [42] YangSongandStefanoErmon.Generativemodelingbyesti- deepfeaturesasaperceptualmetric. InProceedingsofthe matinggradientsofthedatadistribution.AdvancesinNeural IEEE conference on computer vision and pattern recogni- InformationProcessingSystems,32,2019. 3 tion,pages586–595,2018. 5 [43] YangSong,JaschaSohl-Dickstein,DiederikPKingma,Ab- [57] LinqiZhou,YilunDu,andJiajunWu. 3dshapegeneration hishekKumar,StefanoErmon,andBenPoole. Score-based andcompletionthroughpoint-voxeldiffusion. InProceed- generative modeling through stochastic differential equa- ings of the IEEE/CVF International Conference on Com- tions. arXivpreprintarXiv:2011.13456,2020. 3 puterVision,pages5826–5835,2021. 3 [44] AlexTrevithickandBoYang. Grf: Learningageneralradi- ancefieldfor3drepresentationandrendering. InProceed- ings of the IEEE/CVF International Conference on Com- puterVision,pages15182–15192,2021. 2 [45] KalyanAlwalaVasudev,AbhinavGupta,andShubhamTul- siani. Pre-train, self-train, distill: A simple recipe for su- persizing3dreconstruction.InComputerVisionandPattern Recognition(CVPR),2022. 2,6,7,9 [46] Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul P Srinivasan, Howard Zhou, Jonathan T Barron, Ricardo Martin-Brualla,NoahSnavely,andThomasFunkhouser.Ibr- net: Learning multi-view image-based rendering. In Pro- ceedingsoftheIEEE/CVFConferenceonComputerVision andPatternRecognition,pages4690–4699,2021. 2 [47] ZhouWang,AlanCBovik,HamidRSheikh,andEeroPSi- moncelli. Imagequalityassessment: fromerrorvisibilityto structuralsimilarity.IEEEtransactionsonimageprocessing, 13(4):600–612,2004. 5 [48] Zirui Wang, Shangzhe Wu, Weidi Xie, Min Chen, and Victor Adrian Prisacariu. Nerf–: Neural radiance fields without known camera parameters. arXiv preprint arXiv:2102.07064,2021. 2 [49] Daniel Watson, William Chan, Ricardo Martin-Brualla, Jonathan Ho, Andrea Tagliasacchi, and Mohammad Norouzi. Novelviewsynthesiswithdiffusionmodels. arXiv preprintarXiv:2210.04628,2022. 3 [50] LiorYariv,YoniKasten,DrorMoran,MeiravGalun,Matan Atzmon, BasriRonen, andYaronLipman. Multiviewneu- ralsurfacereconstructionbydisentanglinggeometryandap- pearance. AdvancesinNeuralInformationProcessingSys- tems,33:2492–2502,2020. 2 [51] Yufei Ye, Shubham Tulsiani, and Abhinav Gupta. Shelf- supervised mesh prediction in the wild. In Proceedings of theIEEE/CVFConferenceonComputerVisionandPattern Recognition,pages8843–8852,2021. 2 [52] AlexYu,VickieYe,MatthewTancik,andAngjooKanazawa. pixelnerf: Neural radiance fields from one or few images. In Proceedings of the IEEE/CVF Conference on Computer VisionandPatternRecognition,pages4578–4587,2021. 1, 2,5,6 [53] ChengZhang,BaileyMiller,KanYan,IoannisGkioulekas, andShuangZhao.Path-spacedifferentiablerendering.ACM transactionsongraphics,39(4),2020. 2 [54] Cheng Zhang, Lifan Wu, Changxi Zheng, Ioannis Gkioulekas, Ravi Ramamoorthi, and Shuang Zhao. A dif- ferentialtheoryofradiativetransfer. ACMTransactionson Graphics(TOG),38(6):1–16,2019. 2 [55] ChengZhang,ZihanYu,andShuangZhao. Path-spacedif- ferentiablerenderingofparticipatingmedia. ACMTransac- tionsonGraphics(TOG),40(4):1–15,2021. 2 [56] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht-