NeRDi: Single-View NeRF Synthesis
with Language-Guided Diffusion as General Image Priors
CongyueDeng2* Chiyu“Max”Jiang1 CharlesR.Qi1 XinchenYan1 YinZhou1
LeonidasGuibas2,3 DragomirAnguelov1
1Waymo 2StanfordUniversity 3GoogleResearch
Figure1. Fromlefttoright: Wepresentasingle-imageNeRFsynthesisframeworkforin-the-wildimageswithout3Dsupervisionby
leveraginggeneralpriorsfromlarge-scaleimagediffusionmodels.Givenaninputimage,weoptimizeforaNeRFbyminimizinganimage
distributionlossforarbitrary-viewrenderingswiththediffusionmodelconditionedontheinputimage.Wedesignatwo-sectionsemantic
featureastheconditioninginputtothediffusionmodel. Thefirstsectionistheimagecaptions whichcarriestheoverallsemantics;the
0
secondsectionisatextembeddings extractedfromtheinputimagewithtextualinversion,whichcapturesadditionalvisualcues. Our
∗
two-sectionsemanticfeatureprovidesanappropriateimageprior,allowingthesynthesisofarealisticNeRFcoherenttotheinputimage.
Abstract resultsontheDTUMVSdatasetshowthatourmethodcan
synthesize novel views with higher quality even compared
2D-to-3Dreconstructionisanill-posedproblem,yethu-
toexistingmethodstrainedonthisdataset. Wealsodemon-
mans are good at solving this problem due to their prior
strate our generalizability in zero-shot NeRF synthesis for
knowledgeofthe3Dworlddevelopedoveryears.Drivenby
in-the-wildimages.
this observation, we propose NeRDi, a single-view NeRF
synthesis framework with general image priors from 2D
1.Introduction
diffusion models. Formulating single-view reconstruction
Novelviewsynthesisisalong-existingproblemincom-
as an image-conditioned 3D generation problem, we op-
puter vision and computer graphics. Recent progresses
timize the NeRF representations by minimizing a diffusion
in neural rendering such as NeRFs [23] have made huge
loss on its arbitrary view renderings with a pretrained im-
strides in novel view synthesis. Given a set of multi-
age diffusion model under the input-view constraint. We
view images with known camera poses, NeRFs represent
leverageoff-the-shelfvision-languagemodelsandintroduce
astatic3Dsceneasaradiancefieldparametrizedbyaneu-
atwo-sectionlanguageguidanceasconditioninginputsto
ral network, which enables rendering at novel views with
thediffusionmodel.Thisisessentiallyhelpfulforimproving
the learned network. A line of work has been focusing
multiviewcontentcoherenceasitnarrowsdownthegeneral
on reducing the required inputs to NeRF reconstructions,
imagepriorconditionedonthesemanticandvisualfeatures
rangingfromdenseinputswithcalibratedcameraposesto
of the single-view input image. Additionally, we introduce
sparse images [12, 26, 52] with noisy or without camera
ageometriclossbasedonestimateddepthmapstoregular-
izetheunderlying3DgeometryoftheNeRF.Experimental *WorkdoneasaninternatWaymo.
2202
ceD
6
]VC.sc[
1v76230.2122:viXraposes [48]. Yet the problem of NeRF synthesis from one synthesis framework without 3D supervision, using 2D
singleviewremainschallengingduetoitsill-posednature, priors from diffusion models trained on large image
astheone-to-onecorrespondencefroma2Dimagetoa3D datasets.
scene does not exist. Most existing works formulate this • We design a two-section semantic guidance to narrow
asareconstructionproblemandtackleitbytraininganet- downthegeneralpriorknowledgeconditionedonthein-
work to predict the NeRF parameters from the input im- put image, enforcing synthesized novel views to be se-
age [9, 52]. But they require matched multiview images manticallyandvisuallycoherent.
with calibrated camera poses as supervision, which is in- • We introduce a geometric regularization term on esti-
accessible in many cases such as images from the Internet mateddepthmapswith3Duncertainties.
or captured by non-expert users with mobile devices. Re- • Wevalidateourzero-shotnovelviewsynthesisresultson
centattemptshavebeenfocusedonrelaxingthisconstraint theDTUMVS[13]dataset,achievinghigherqualitythan
byusingunsupervisedtrainingwithnovel-viewadversarial supervised baselines. We also demonstrate our capabil-
losses and self-consistency [22, 51]. But they still require ity of generating novel-view renderings with high visual
thetestcasestofollowthetrainingdistributionwhichlim- qualityonin-the-wildimages.
its their generalizability. There is also work [45] that ag-
gregatespriorslearnedonsyntheticmulti-viewdatasetsand 2.RelatedWork
transfersthemtoin-the-wildimagesusingdatadistillation.
NovelviewsynthesiswithNeRF.Therecentlyproliferat-
Buttheyaremissingfinedetailswithpoorgeneralizability
ing NeRF representation [23] has shown great success in
tounseencategories.
novelviewsynthesis,whichisalong-existingtaskincom-
Despitethedifficultyof2D-to-3Dmappingforcomput-
putergraphicsandvision.Combiningdifferentiablerender-
ers,itisactuallynotadifficulttaskforhumanbeings. Hu-
ing[16,53,54,55]withneuralnetworksceneparametriza-
mans gain knowledge of the 3D world through daily ob-
tions,NeRFisabletorecovertheunderlying3Dscenefrom
servationsandformacommonsenseofhowthingsshould
acollectionofposedimagesandrenderitatnovelviewsre-
looklikeandshouldnotlooklike. Givenaspecificimage,
alistically. Anumberoffollow-upworkshavebeenfocus-
theycanquicklynarrowdowntheirpriorknowledgetothe
ingonrelaxingNeRFinputstolessinformativedatasuchas
visualinput. Thismakeshumansgoodatsolvingill-posed
unposedimages[21,48,50]orsparseviews[7,12,26,34].
perceptionproblemslikesingle-view3Dreconstruction.In-
Aslessdatagivesrisetoamorecomplexoptimizationland-
spired by this, we propose a single-image NeRF synthe-
scape, avarietyofregularizationlosseshavebeenstudied,
sisframeworkwithout3Dsupervisionbyleveraginglarge-
for example: RegNeRF [26] regularizes the geometry and
scale diffusion-based 2D image generation model (Figure
appearanceofpatches,DDP[34]andDS-NeRF[7]regular-
1). Givenaninputimage,weoptimizeforaNeRFbymini-
izethedepthmaps,DietNeRF[12]enforcessemanticcon-
mizinganimagedistributionlossforarbitrary-viewrender-
sistencybetweenviewsbyminimizingaCLIP[30]feature
ingswiththediffusionmodelconditionedontheinputim-
loss,andGNeRF[21]adoptsapatch-basedadversarialloss.
age.Anunconstrainedimagediffusionisthe‘generalprior’
AnotherlineofworklearnsNeRF-basednovel-viewpredic-
whichisinclusivebutalsovague. Tonarrowdowntheprior
tionforfew-orsingle-imageinputsbypre-trainingascene
knowledgeandrelateittotheinputimage,wedesignatwo-
prioronalargedatasetof3Dscenescontainingdenseviews
sectionsemanticfeatureastheconditioninginputtothedif-
[5,6,18,44,46,52]. Withadditionalself-supervisiontech-
fusionmodel. Thefirstsectionistheimagecaptionwhich
niques such as equivariance [9] or cycle-consistency [22],
carriestheoverallsemantics;thesecondisatextembedding
thelearningofscenepriorscanbedonesimplyfromsparse-
extractedfromtheinputimagewithtextualinversion[10],
or single-view data, or even purely from unposed image
which captures additional visual cues. These two sections
collections with an image adversarial loss [2, 3, 27, 39].
of language guidance facilitate our realistic NeRF synthe-
These two lines of works both have their specialties and
sis with semantic and visual coherence between different
constraints: the first is generalizable to any scene configu-
views. Inaddition,weintroduceageometriclossbasedon
rations,butisalsolesscompetitiveinthemorechallenging
the estimated depth of the input view for regularizing the
scenarios such as single-image novel view synthesis with
underlying3Dstructure. Learnedwithalltheguidanceand
high quality requirements; the second, on the other hand,
constraints,ourmodelisabletoleveragethegeneralimage
hasstrongabilityofinferringunseennovelviewsfromvery
prior and perform zero-shot NeRF synthesis on single im-
limited inputs, but is also restricted to certain scene cate-
ageinputs. Experimentalresultsshowthatwecangenerate
goriesmodeledbytheirscenepriorslearnedfromthetrain-
high quality novel views from diverse in-the-wild images.
ingdata. Inourwork,weleverageadiffusion-basedimage
Tosummarize,ourkeycontributionsare:
priorforNeRFsynthesisthatisgeneralenoughformodel-
• Weformulatesingle-viewreconstructionasaconditioned ingvariationsofin-the-wildimageswhilehavingtheadap-
3Dgenerationproblemandproposeasingle-imageNeRF tivitytoeachspecificinputimage.sentation F : (x,y,z) → (c,σ) as its 3D reconstruc-
ω
tion†. The NeRF holds the rendering equation that, for
anycameraviewwithposeP,onecansamplecamerarays
r(t)=o+tdandrendertheimagexatthisviewwith
(cid:90) tf
Cˆ(r)= T(t)σ(t)c(t)dt (1)
(cid:16)
tn
(cid:17)
(cid:82)t
where T(t) = exp − σ(s)ds . For more details,
tn
pleaserefertoMildenhalletal.[23]. Forsimplicity,wede-
notethiswholerenderingequationbyx = f(P,ω)which
meansNeRFf rendersimagexatcameraposePwithpa-
rameters ω. Instead of predicting the NeRF parameters ω
Figure 2. Method overview. We represent the underlying 3D from x in a forward pass, we formulate this as a condi-
0
sceneasaNeRFandoptimizeforitsparameterswiththreelosses: tioned3Dgenerationproblem
a reconstruction loss at the fixed input view; a diffusion loss at f(·,ω)∼3Dscenedistribution|f(P ,ω)=x (2)
0 0
arbitrarilysampledviewswhichalsotakesaconditioningtextin-
where we optimize the NeRF to follow a 3D scene distri-
putgeneratedfromtheinputimagewithourtwo-sectionfeature
butionconditionedonthatitsrenderingf(P ,ω)atagiven
extraction; and finally, a depth correlation loss at the input view 0
viewP shouldbetheinputimagex
regularizingthe3Dgeometry. 0 0
Directlylearningthe3Dscenedistributionpriorrequires
Diffusion-based generative models. Denoising diffusion large 3D datasets, which is less straightforward to acquire
probabilistic models [11, 41], or score-based generative and restricts its application to unseen scene categories. To
models [42, 43], have recently caught a surge of inter- enable better generalizability to in-the-wild scenarios, we
estsduetotheirsimpledesignsandexcellentperformances instead leverage 2D image priors and reformulate the ob-
acrossavarietyofcomputervisiontaskssuchasimagegen- jectiveinto
eration [11, 41, 42, 43], completion [36, 43], and editing ∀P, f(P,ω)∼P|f(P ,ω)=x (3)
0 0
[14,20]. Invisualcontentcreation,language-guidedimage where the optimization is conducted on images f(P,ω)
diffusion models such as DALL-E2 [32], Imagen [37] and renderedatarbitrarilysampledviews,pushingthemtofol-
StableDiffusion[35]haveshowngreatsuccessingenerat- low an image prior P while satisfying the constraint x =
0
ingphotorealisticimageswithstrongsemanticcorrelation f(P ,ω). Theoverallobjectivecanbewrittenasmaximiz-
0
tothegiventext-promptinputs. Inadditionaltothesuccess ingtheconditionalprobability
of2Dimagediffusionmodels,morerecentworkshavealso maxE P(f(P,ω)|f(P ,ω)=x , s). (4)
P 0 0
extenddiffusionmodelsto3Dcontentgeneration. [19,57] ω
Here, s is an additional semantic guidance term that we
generate 3D pointclouds with point diffusions. 3DiM [49]
apply to further restrict the prior image distribution to fit
shows uncertainty-aware novel view synthesis with image
the generation context. In contrast to DreamFusion [28]
diffusionsconditionedoninputviewsandposes,butitdoes
whichalsoutilizeslanguage-guidedimagediffusionmodel
not have guaranteed multiview consistency as no underly-
as 2D image priors for sampled views, our main contribu-
ing3Drepresentationisadopted. Morerelatedtooursare
tionstandsinourapproachforfurtherconstrainingtheiden-
DreamFusion[28]andGAUDI[1]thatalsogenerateNeRFs
tity of the generated 3D volume to be consistent with the
withdiffusions:[28]generatesNeRFsunderlanguageguid-
inputs.
ance by optimizing for their renderings at randomly sam-
We cover more details on this novel-view distribution
pledviewswitha2Dimagediffusionmodel[37];[1]trains
lossinSec. 3.1. Weutilizenaturallanguagedescriptionsof
a diffusion model on the latent space of NeRF scenes, but
thesceneasthesemanticguidances. Moredetailsonthis
thelearnedscenedistributionislimitedtoasetofindoor3D
willbediscussedinSec. 3.2. Inaddition,astheimagedif-
scenesanddoesnotgeneralizetoin-the-wildimages. Simi-
fusionmodelonlyoperatesontherenderedrgbcolors, we
larto[28],wealsoleverage2Dimagediffusionstooptimize
further apply a geometric regularization with a depth map
fortheNeRFrenderingsatnovelviews, butinsteadofun-
estimatedattheinputviewtofacilitatetheNeRFoptimiza-
constrained NeRF generation with user-specified language
tion(Sec. 3.3)
inputs,westudyhowtofaithfullycapturethethefeaturesof
single-viewimageinputsanduseittoconstrainthenovel- 3.1.NovelViewDistributionLoss
viewimagedistributions. Denoising Diffusion Probabilistic Models (DDPM) are
a type of generative models that learn a distribution over
3.Method
AnoverviewofourmethodisshowninFigure2. Given †HereweuseaLambertianNeRFwithoutviewdirectioninputsfor
an input image x 0, we would like to learn a NeRF repre- enforcingstrongermultiviewconsistency.trainingdatasamples. Recently,therearemanyadvancesin imagediffusionmodel. WiththeLDMEquation5,wecan
languageguidedimagesynthesiswithdiffusionmodels.We optimize for the text embedding s for the input image x
∗ 0
build our method upon the recent Latent Diffusion Model by
(LDM)[35]foritshighqualityandefficiencyinimagegen- s =argminE (cid:2) (cid:107)(cid:15)−(cid:15) (z ,t,c (s))(cid:107)2(cid:3)
∗ z∼E(x0),s,(cid:15)∼N(0,1),t θ t θ 2
eration. Itadoptsapre-trainedimageauto-encoderwithan s
(6)
encoder E(x) = z mapping images x into latent codes s
In Figure 3 middle row, images generated with textual in-
and a decoder D(E(x)) = x recovering the images. The
versionareshown. Thecolorsandvisualcuesoftheinput
diffusionprocessisthentrainedinthelatentspacebymin-
image are well captured (orange-colored elements, food,
imizingtheobjective
E (cid:2) (cid:107)(cid:15)−(cid:15) (z ,t,c (s))(cid:107)2(cid:3) . (5) and even the brand logos). However, the semantics at the
z∼E(x),s,(cid:15)∼N(0,1),t θ t θ 2 macro level is sometimes wrong (second column is a per-
wheretisadiffusiontimescale, (cid:15) ∼ N(0,1)isarandom
son playing sports). One reason is that, different from the
noisesample,z isthelatentcodeznoisedtotimetwith(cid:15),
t multi-imagescenarioswheretextualinversioncandiscover
and(cid:15) isthedenoisingnetworkwithparametersθtoregress
θ thecommoncontentsoftheseimages,itisunclearforone
the noise (cid:15). The diffusion model also takes a conditioning
singleimagewhatthekeyfeaturesarethatthetextembed-
inputswhichisencodedasc (s)andservesasguidancein
θ dingshouldfocuson.
thedenoisingprocess. Fortext-to-imagegenerationmodels
Toreflectbothsemanticandvisualcharacteristicsofthe
suchastheLDM,c isapre-trainedlargelanguagemodel
θ input image in the novel view synthesis task, we combine
thatencodestheconditionaltexts.
thesetwomethodsbyconcatenatingtheirtextembeddings
In a pre-trained diffusion model, the network parame-
to form a joint feature s = [s ,s ] and use it as the guid-
0 ∗
ters θ are fixed, and we can instead optimize for the in-
anceinthediffusionprocessinEquation5.Figure3bottom
putimagexwiththesameobjectivewhichtransformsxto
rowshowstheimagesgeneratedwiththisjointfeature,with
follow the image distribution priors conditioned on s. Let
balancedsemanticsandvisualcues.
x = f(P,ω)beourNeRFrenderingatarbitrarilysampled
3.3.GeometricRegularization
view P, we can back propagate gradients to the NeRF pa-
rametersωandthusgetastochasticgradientdescentonω. While image diffusion shapes the appearance of the
NeRF, multiview consistency is difficult to enforce as the
3.2.Semantics-ConditionedImagePriors
underlying 3D geometry can be different even with the
We argue that the prior distribution over all in-the-wild
same image rendering [15, 24], making the gradient back-
imagesisnotspecificenoughtoguidethenovelviewsyn-
propagation(fromtheimagediffusiontotheNeRFparam-
thesis from an arbitrary image. We thus introduce a well-
etersω)highlynon-controllable. Tothisend,wefurtherin-
designed guidance s that narrows down the generic prior
corporateageometricregularizationtermontheinputview
over natural images to a prior of images related to the in-
depthtoalleviatethisissue. WeadopttheDensePrediction
putimagex . Herewechoosetextastheguidance,which
0 Transformer (DPT) model [33] trained on 1.4 million im-
is flexible for describing arbitrary input images. Text-to-
agesforzero-shotmonoculardepthestimationandapplyit
image diffusion models such as LDM utilize a pre-trained
totheinputimagex toestimateadepthmapd .Weuse
0 0,est
large language model as the language encoder to learn a
thisestimateddepthtoregularizethedepth
conditional distribution over images conditioned on lan- (cid:90) tf
guage. This serves as a natural gateway for us to utilize dˆ 0 = σ(t)dt. (7)
languageasameanstorestricttheimagepriorspace. tn
renderedbytheNeRFatinputviewP .Duetotheambigu-
0
The most straightforward way of getting a text prompt
itiesoftheestimateddepth(includingscales,shifts,camera
fromtheinputimageistouseanimagecaptioningorclas-
intrinsics)andestimationerror(Figure4), wecannotback
sification network S trained on (image, text) datasets and
projectpixelswithdepthto3Dandcomputetheregulariza-
predict a text s = S(x ). However, while text descrip-
0 0 tiondirectly. Instead,wemaximizethePearsoncorrelation
tioncansummarizethesemanticsoftheimage, itleavesa
between the estimated depth map and the NeRF-rendered
hugespaceofambiguities,makingithardtoincludeallthe
depth
visual details in the image especially with limited prompt
(cid:16) (cid:17) Cov(dˆ ,d )
length. InFigure3toprow,weshowtheimagesgenerated ρ dˆ ,d = 0 0,est (8)
0 0,est (cid:113)
with the caption “a collection of products” from the input Var(dˆ )Var(d )
0 0,est
image on the left. While their semantics are highly accu- which measures if the rendered depth distribution and the
ratewithrespecttothelanguagedescription,thegenerated noisyestimateddepthdistributionarelinearlycorrelated.
imageshaveveryhighvariancesintheirvisualpatternsand
4.Experiments
lowcorrelationstotheinputimage.
Textualinversion[10], onthe otherhand, optimizesfor Nowwedemonstrateourefficacyinsynthesizingrealis-
thetextembeddingofoneorfewimagesfromatext-based tic NeRFs with single-view inputs. Section 4.1 presents aFigure3. Imagegenerationwithdifferentsemanticguidance. Toprow: Imagesgeneratedwithcaption“acollectionofproducts”.
Theimagesfollowsthesemanticswell,buttheircontentareofveryhighvariance(canbeanykindofproducts). Middlerow: Images
generatedpurelywiththelatentembeddingfromtextualinversion. Thecolordistributionandvisualcuesoftheinputimagearewell
captured,butthesemanticsisnotpreserved(secondcolumn,theimageisapersonplayingsports). Bottomrow: Imagesgeneratedwith
combinedimagecaptionandtextualinversion.Bothsemanticandvisualfeaturesoftheinputimageareaddressed.
domly sampled novel views, we render 128×128 images
and resize them to 512 × 512 before feeding them to the
encoder of [35]. At the input view, we render at the same
resolution as the inputimage to compute the image recon-
structionanddepthcorrelationlosses.
Figure4.Ambiguityinestimateddepthmap.
Baselines. We compare with two state-of-the-art single-
viewNeRFreconstructionalgorithms,PixelNeRF[52]and
Table1.Single-imagenovelviewsynthesisresultsonDTU.
itsfine-tunedmodelwithCLIP[30]featureconsistencyloss
Method PSNR↑ SSIM↑ LPIPS↓
asproposedbyDietNeRF[12],bothofwhichtrainedonthe
NeRF 8.000 0.286 0.703 trainingsetdatafromtheDTUMVSdataset. Togainbetter
pixelNeRF 15.550 0.537 0.535 convergence, we use the predictions from [52] as an ini-
pixelNeRF,L ft 16.048 0.564 0.515 tialization for our 3D scene optimization. But our method
MSE
DietPixelNeRF 14.242 0.481 0.487 isdirectlyappliedtothetestsceneswithoutanyadditional
Ours 14.472 0.465 0.421 fine-tuningontheDTUtrainingset.
quantitativecomparisonbetweenourmethodandthestate- Results. Table 1 shows the quantitative comparison be-
of-the-art single-view NeRF reconstruction methods on a tweenourmethodandthebaselines. Followingtheconven-
syntheticdataset. Section4.2showsaqualitativecompari- tion, we report the standard image quality metrics PSNR
son as well as more synthesis results of our method on in- and SSIM [47]. Our PSNR and SSIM are slightly lower
the-wildimages. than pixelNeRF [52] which directly learns the scene dis-
tributions from the DTU training set and are on par with
4.1.SyntheticScenes
DietPixelNeRF [12] which enforces semantic consistency
Setup. Weevaluate ourmethodon theDTUMVS dataset between views. However, we emphasize that these two
[13]with15testscenesasspecifiedin[52]. Foreachinput metrics are less indicative in our scenario as they are lo-
image,weuseGPT-2[31]togenerateacaption. Weman- calpixel-alignedsimilaritymetricsbetweenthesynthesized
ually correct the obvious mistakes made by GPT-2 while novel views and the ground truth images but uncertainties
tryingourbesttoavoidintroducingadditionaldetails. The naturallyexistinsingle-view3Dinference.Themiddlecol-
scenes and their captions are listed in the supplementary umnofthefirstsceneinFigure5showsanexampleofsuch
material. uncertainty. The height of the tallest snack bag in the in-
Implementation details. For the NeRF model, we imple- putimagecannotbeinferredasitstopextrudesbeyondthe
mentthemulti-resolutiongridsamplerasdescribedin[25]. cameraview. Thewidthofthetoypigintheleftcolumnof
For the diffusion model, we employ the text-guided diffu- thethirdsceneisanotherexamplewhichcannotbeinferred
sionmodelfrom[35]whichwaspre-trainedontheLAION- fromtheinputsideview. Inbothcasesourmethodguesses
400Mdataset[38]. While[35]operateson512×512im- its novel view (bottom row) in a reasonable sense but dif-
ages,NeRF’svolumetricrenderingatthisresolutionwould ferentfromthegroundtruth(toprow). Inaddition,wealso
incuranextensivecomputationalburden. Thus,attheran- measurenovelviewswithLPIPS[56],whichisaperceptualFigure5. Single-imagenovelviewsynthesisresultsontheDTUtestscenes. VanillaNeRFcannotrecoverscenesfromsingleimage
inputsduetotheill-posednatureofthisproblem.WhilepixelNeRFcaninferthenovelviewimageswiththepriorfromtheDTUtraining
setofsimilarscenes,itssynthesizedrenderingsremainnoisyandblurry.WithapixelNeRFinitialization,ourmethodisabletosynthesize
cleanernovelviewswithrealisticgeometriesandappearances,despitehavingneverbeentrainedonthisdataset. Uncertaintiesinnovel
viewinference:(Thefirstscene,middlecolumn)theexactheightofthetallestsnackbagcannotbeinferredasthetopgoesoutsideofthe
cameraview. (Thethirdscene,leftcolumn)thewidthofthetoypigfromthetopviewisundecidablefromtheinputview. Inbothcases,
ourmethodguessesareasonableanswerinthesynthesizednovelviewthatisdifferentfromthegroundtruth.
metriccomputingtheMeanSquaredError(MSE)between More results. Figure 7a shows our results on images of
normalized features from all layers of a pre-trained VGG objects from the internet. The text prompts are words or
encoder[40]. Ourmethodshowsasignificantimprovement phrasesusedtosearchfortheimages. Thebackgroundsare
on this metric compared to the baselines as the diffusion maskedoutusinganoff-the-shelfdichotomousimageseg-
modelhelpstoimproveimagequalitieswhilethelanguage mentation network from [29]. For each input, we show 3
guidancemaintainsthemulti-viewsemanticconsistency. different novel views that are distant from the input view.
Figure 5 shows a qualitative comparison between our Figure 7b shows our results on images with more com-
methodandthebaselines.Withthesceneinitializationfrom plexcontentsandbackgroundsfromtheCOCOdataset[17]
[52],ourmethodremovesthenoisesandblurriness,synthe- whichcontains(image,caption)pairs.Withincameraviews
sizinghighqualitynovelviews. closetotheinput,ourmodelisstillabletogeneraterealistic
renderings.Butitcanhardlygeneralizetodistantviewsdue
4.2.ImagesintheWild
tothelimitedcapacityoftheNeRFscenebox.
Qualitative comparisons. Figure 6 shows a qualitative
4.3.AblationStudies
comparison between our method and existing state-of-the-
Weconductablationstudiestoshowtheefficacyofour
art single-image to 3D synthesis methods for in-the-wild
two-section semantic guidance and geometric regulariza-
images [12, 45]. Input images are adopted from the
tion.
GoogleScannedObjectsdataset[8]withtheircategoryla-
bels (‘bag’ and ‘hat’) as captions. Similar to ours, Diet- Semantic guidance. Figure 8a shows the ablation of the
NeRF[12]usesaninput-viewconstrainedNeRFoptimiza- two text embeddings s 0 from image captions and s ∗ from
tion technique where they minimize the CLIP [30] feature textual inversion. Without the captions s 0, the model fails
betweenarbitraryviewrenderings.WhileCLIPfeaturesen- tolearntheoverallsemanticsandcannotgenerateamean-
forceconsistentappearances,theyfailtocapturetheglobal ingful object. While both the full model and the caption-
semanticsoftheobject. SS3D[45]isaforward-prediction only one (without textual inversion) successfully generate
model for 3D geometries that transfers the priors learned backack novel views, the results without textual inversion
onsyntheticdatasetstoin-the-wildimageswithknowledge s ∗havemoreblurrinessandnoises. Azoom-incomparison
distillation. While it generates more structured global ge- isshowninFigure8b.
ometries,itfailstocapturethefinegeometricdetailsofthe Figure8cshowsanothercomparisonofmodelswithand
inputimage. Thegeometriesofthehatsinthebottomrows withouttextualinversions onthecanexamplefromFigure
∗
arealsoincorrect,withonlythesilhouetteshapepreserved 7b left. In the object regions visible to the input view, the
butthestructureof‘hat’shapemissing. fullmodelbetterrecoversthefinedetails(thewhitelettersFigure6. NovelviewsynthesisresultsonobjectsfromtheGoogleScannedObjectsDataset. Left:Ourresultsgeneratedfromsingle-
wordtextinputs‘backpack’(top3rows)and‘hat’(bottom3rows).Middle:DietNeRF[12]minimizestheCLIPfeaturedistancesbetween
theinputviewandarbitrarilysampledviews. Thisresultsinnovelviewrenderswithconsistenttexturesandstyles,butfailstocapture
theglobalsemanticmeaning. Forafaircomparison,DietNeRFisalsooptimizedwithdepthregularization. Right: SS3D[45]predicts
coarsegeometriesinaconsistencymanner,butitfailstorecoverallthefinegeometricdetails. Additionally,thegeometriesofthehatsin
thebottomrowsareincorrect,withonlythesilhouetteshapepreservedbutthestructureof‘hat’shapemissing.
(a)Resultsonobject-centricimagesfromtheinternetwithsingle-word (b)ResultsonimagesfromtheCOCOdataset[17].Inputimageshavemore
orshortphrasecaptions.Inputbackgroundsareremovedwith[29]. complexcontentswithbackgroundsandthecaptionsaresentences.
Figure7.Resultsonimagesinthewild.
on the lateral); and in the invisible regions, the full model try. Themodelwithouttheregularizationontheinputview
completes the appearances with coherent styles of the in- depthcanstillgeneraterealisticappearancesatnovelviews
put (red and white textures at the back of the can), while with the diffusion model, but the underlying 3D geometry
themodelwithouttextualinversiondoesnothavesuchap- iserroneousandmulti-viewconsistencyisnotenforced.As
pearance coherency. The model with textual inversion can a sanity check, we also visualize the results with only the
even synthesize the pull tab at the top (second column of depth loss but without the diffusion model. The model is
the zoom-in views) by inferring from the input side view unabletogeneratearealisticNeRFduetothe3Dambigui-
thatthisisacancontainingdrinks. tiesofmonoculardepthasstatedinSection3.3.
Geometricregularization. Figure9showsanablationon
5.Conclusions
the geometric regularization term. Both image renderings
and depth maps are visualized. The full model is able to In this paper, we propose a novel framework for zero-
synthesize realistic novel views with coherent 3D geome- shot single-view NeRF synthesis for images in the wild(a)Toprow: Fullmodel. Middlerow: Caption-onlyguidancewithout
textualinversion.Themodelisstillabletogenerateashapestrictlyfollow-
ingthesemanticsandtheinputviewappearanceandgeometricconstraints, (a) A failure case due to the biases in the image diffusion model.
butstrugglesmoreinsynthesizingthedetails. Azoom-incomparisonis Top: Novelviewsynthesisresultswithtextprompt‘a shoe in the
shownin8bbelow.Bottomrow:Textual-inversion-onlywithoutcaption. style of <input>’. Bottom: Imagesgeneratedby[35]withtext
Textualinversionfailstocapturetheglobalsemantics. prompt“asingleshoe”.Yethalfoftheimageshavetwoshoesinit.
(b)Afailurecaseonahighlydeformableinstance. Whiletheoverall
(b)Azoom-incomparisonbetweenfullmodelresultsandresultswith- bodyshapeofthecatiscaptured,thesynthesizedcathastwoheadsand
outtextualinversion. Thefullmodelshowsbettercapabilityofsynthe- twotails.
sizinglessblurrydetails.
Figure10.Failurecases.
generation conditioned on the input image. To efficiently
use these priors in synthesizing consistent views, we de-
sign a two-section language guidance as conditioning in-
putstothediffusionmodelwhichunifiesthesemanticand
visual features of the input image. To our knowledge, we
(c)Anothercomparisonbetweenmodelswithantwithouttextualin- arethefirsttocombinesemanticandvisualfeaturesinthe
version. Theinputisfrom7bleft, bottomrow. Thefullmodelisable
textembeddingspaceandapplyittonovelviewsynthesis.
tosynthesizebettertexturedetailsatvisibleregionsaswellascompleting
theinvisibleregionswithsimilartextures,whilethecaption-onlymodel In addition, we introduce a geometric regularization term
renderingsaremoreblurryandcannotfillintheinvisibleregions. whileaddressingthe3Dambiguityofmonocular-estimated
Figure8.Ablationsonthetwo-sectionsemanticguidance. depthmaps. Ourexperimentalresultsshowthat,withwell-
designedguidanceandconstraints,onecanleveragegeneral
image priors to specific image-to-3D, enabling us to build
generalizableandadaptablereconstructionframeworks.
Limitations and future work. As our method relies on
multiple large pre-trained image models [29, 31, 33, 35],
anybiasesinthesemodelswillaffectoursynthesisresults.
Figure 10a shows an example where the image diffusion
model[35]cangeneratetwoshoeseventhetextpromptis
“asingleshoe”,resultinginoursynthesizedNeRFshowing
Figure 9. Ablations on the geometric regularization. Visual-
the features of multiple shoes. Our method is also less ro-
izationofinputviewreconstructionandnovelviewsonrendered
busttohighlydeformableinstances, asourlanguageguid-
images and depth maps. Top row: The full model is able to
ancefocusesonsemanticsandstylesbutlacksaglobalde-
synthesize realistic novel views while preserving geometric co-
scriptionofphysicalstatesanddynamics.Figure10bshows
herency. Middlerow:Withoutthedepthcorrelationloss,thedif-
fusionmodelisstillabletogeneratereasonableappearances,but suchafailurecase.Renderingsfromeachindependentview
theunderlying3Dgeometryiserroneousandthenovelviewsare are visually plausible but represent different states of the
inconsistenttotheinputview.Bottomrow:Theinput-viewdepth sameinstances.
estimationcannotguidenovelviewsynthesisbyitselfwithoutthe Besides, whileformulation-wise theoptimizationis ap-
diffusionmodeldueto3Dambiguities.
plicabletoanyscenes,itismoresuitableforobject-centric
imagesasittakestheunderlyingassumptionthatthescene
without3Dsupervision.Weleveragethegeneralimagepri- hasexactlythesamesemanticsfromanyview,whichisnot
orsin2Ddiffusionmodelsandapplythemtothe3DNeRF true for large scenes with complex configurations due toBoxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas J
Guibas,JonathanTremblay,SamehKhamis,etal. Efficient
geometry-aware3dgenerativeadversarialnetworks. InPro-
ceedingsoftheIEEE/CVFConferenceonComputerVision
Figure11.Imagesgeneratedby[35]with‘a pumpkin’. andPatternRecognition,pages16123–16133,2022. 2
[3] EricRChan, MarcoMonteiro, PetrKellnhofer, JiajunWu,
viewchangesandocclusions. Thetextembeddinglearned andGordonWetzstein. pi-gan: Periodicimplicitgenerative
adversarialnetworksfor3d-awareimagesynthesis. InPro-
fromtextualinversionisofthedimensionofasingle-world
ceedings of the IEEE/CVF conference on computer vision
embedding, limiting its expressiveness in representing the
andpatternrecognition,pages5799–5809,2021. 2
subtletiescomplexcontents.
[4] Angel X Chang, Thomas Funkhouser, Leonidas Guibas,
Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese,
A.AdditionalResults Manolis Savva, Shuran Song, Hao Su, et al. Shapenet:
Figure12showsouradditionalresultsandcomparisons An information-rich 3d model repository. arXiv preprint
arXiv:1512.03012,2015. 9
forimagesinthewild.Theresultsarepresentedin4groups,
[5] AnpeiChen,ZexiangXu,FuqiangZhao,XiaoshuaiZhang,
each group containing 3 objects from similar classes but
FanboXiang,JingyiYu,andHaoSu.Mvsnerf:Fastgeneral-
withdifferentcontentdetailsandappearances. Weusethis
izableradiancefieldreconstructionfrommulti-viewstereo.
totestthecapabilityofeachmethodincapturingtheoverall
In Proceedings of the IEEE/CVF International Conference
semanticsandvisualfeaturevariationsfrominputimages. onComputerVision,pages14124–14133,2021. 2
Comparison to DietNeRF [12]. For a fair comparison, [6] JulianChibane,AayushBansal,VericaLazova,andGerard
DietNeRF is also optimized with the estimated depth map Pons-Moll. Stereoradiancefields(srf): Learningviewsyn-
thesis for sparse views of novel scenes. In Proceedings of
fromtheinputimage. WhileDietNeRFisabletomaintain
theIEEE/CVFConferenceonComputerVisionandPattern
appearance consistency between different views, it fails to
Recognition,pages7911–7920,2021. 2
capturetheoverallgeometryoftheobjects,especiallywhen
[7] Kangle Deng, Andrew Liu, Jun-Yan Zhu, and Deva Ra-
the object has complex geometric structures (such as the manan.Depth-supervisednerf:Fewerviewsandfastertrain-
chairsinthe1stgroup,andthebasketsinthe3rdgroup). In ing for free. In Proceedings of the IEEE/CVF Conference
the 4th group (the skirts), our generated textures form the onComputerVisionandPatternRecognition,pages12882–
unseenbackregionsarealsoclosertotheinputimagethan 12891,2022. 2
DietNeRF. [8] LauraDowns,AnthonyFrancis,NateKoenig,BrandonKin-
man,RyanHickman,KristaReymann,ThomasBMcHugh,
Our method also addresses the naturally existing ambi-
and Vincent Vanhoucke. Google scanned objects: A high-
guity in novel-view inference, especially for the occluded
quality dataset of 3d scanned household items. arXiv
regions in the input view. For example, in the 3rd group
preprintarXiv:2204.11918,2022. 6
in Figure 12, the unseen spaces of the baskets are filled
[9] Emilien Dupont, Miguel Bautista Martin, Alex Colburn,
withdifferentfruits/flowers/vegetables,insteadofduplicat- Aditya Sankar, Josh Susskind, and Qi Shan. Equivariant
ing the input views as DietNeRF [12]. As a feature or as neural rendering. In International Conference on Machine
aninductivebias,suchsynthesisresultsarealsoaffectedby Learning,pages2761–2770.PMLR,2020. 2
the 2D distribution from the image diffusion model. For [10] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patash-
example, Figure 11 shows the image generation results by nik, Amit H Bermano, Gal Chechik, and Daniel Cohen-
[35] with text prompt ‘a pumpkin’. Half of them are Or. An image is worth one word: Personalizing text-to-
image generation using textual inversion. arXiv preprint
Jack-o’-lanterns. Thismakesoursynthesizedpumpkinalso
arXiv:2208.01618,2022. 2,4
having the Jack-o’-lantern face at its back (the 3rd row of
[11] JonathanHo,AjayJain,andPieterAbbeel. Denoisingdiffu-
the2ndgroup).
sionprobabilisticmodels. AdvancesinNeuralInformation
ComparisontoSS3D[45]. Asageometry-basedmethod, ProcessingSystems,33:6840–6851,2020. 3
SS3D captures better global geometries than DietNeRF [12] AjayJain,MatthewTancik,andPieterAbbeel. Puttingnerf
onadiet: Semanticallyconsistentfew-shotviewsynthesis.
evenwithoutthedepthregularization,especiallyontheob-
In Proceedings of the IEEE/CVF International Conference
jectclassescoveredbyShapeNet[4]wherethe
onComputerVision,pages5885–5894,2021. 1,2,5,6,7,9
[13] RasmusJensen,AndersDahl,GeorgeVogiatzis,EnginTola,
References
andHenrikAanæs. Largescalemulti-viewstereopsiseval-
uation. InProceedingsoftheIEEEconferenceoncomputer
[1] MiguelAngelBautista,PengshengGuo,SamiraAbnar,Wal-
visionandpatternrecognition,pages406–413,2014. 2,5
ter Talbott, Alexander Toshev, Zhuoyuan Chen, Laurent
[14] GwanghyunKim, TaesungKwon, andJongChulYe. Dif-
Dinh, Shuangfei Zhai, Hanlin Goh, Daniel Ulbricht, et al.
fusionclip: Text-guided diffusion models for robust image
Gaudi: A neural architect for immersive 3d scene genera-
manipulation. InProceedingsoftheIEEE/CVFConference
tion. arXivpreprintarXiv:2207.13751,2022. 3
on Computer Vision and Pattern Recognition, pages 2426–
[2] EricRChan,ConnorZLin,MatthewAChan,KokiNagano,Figure12.Additionalresultsforimagesinthewild.2435,2022. 3 hall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv
[15] StevenMLehar. Theworldinyourhead: Agestaltviewof preprintarXiv:2209.14988,2022. 3
themechanismofconsciousexperience. PsychologyPress, [29] Xuebin Qin, Hang Dai, Xiaobin Hu, Deng-Ping Fan, Ling
2003. 4 Shao,andLucVanGool. Highlyaccuratedichotomousim-
[16] Tzu-MaoLi,MiikaAittala,Fre´doDurand,andJaakkoLehti- agesegmentation. InECCV,2022. 6,7,8
nen. Differentiable monte carlo ray tracing through edge [30] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
sampling. ACMTransactionsonGraphics(TOG),37(6):1– Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
11,2018. 2 AmandaAskell,PamelaMishkin,JackClark,etal. Learn-
[17] Tsung-YiLin,MichaelMaire,SergeBelongie,JamesHays, ingtransferablevisualmodelsfromnaturallanguagesuper-
PietroPerona,DevaRamanan,PiotrDolla´r,andCLawrence vision. In International Conference on Machine Learning,
Zitnick. Microsoft coco: Common objects in context. In pages8748–8763.PMLR,2021. 2,5,6
European conference on computer vision, pages 740–755. [31] AlecRadford,JeffreyWu,RewonChild,DavidLuan,Dario
Springer,2014. 6,7 Amodei, Ilya Sutskever, et al. Language models are unsu-
[18] Yuan Liu, Sida Peng, Lingjie Liu, Qianqian Wang, Peng pervisedmultitasklearners. OpenAIblog, 1(8):9, 2019. 5,
Wang, Christian Theobalt, Xiaowei Zhou, and Wenping 8
Wang. Neuralraysforocclusion-awareimage-basedrender- [32] AdityaRamesh,PrafullaDhariwal,AlexNichol,CaseyChu,
ing. InProceedingsoftheIEEE/CVFConferenceonCom- and Mark Chen. Hierarchical text-conditional image gen-
puter Vision and Pattern Recognition, pages 7824–7833, erationwithcliplatents. arXivpreprintarXiv:2204.06125,
2022. 2 2022. 3
[19] ShitongLuoandWeiHu. Diffusionprobabilisticmodelsfor [33] Rene´ Ranftl,AlexeyBochkovskiy,andVladlenKoltun. Vi-
3dpointcloudgeneration. InProceedingsoftheIEEE/CVF sion transformers for dense prediction. In Proceedings of
Conference on Computer Vision and Pattern Recognition, the IEEE/CVF International Conference on Computer Vi-
pages2837–2845,2021. 3 sion,pages12179–12188,2021. 4,8
[20] Chenlin Meng, Yang Song, Jiaming Song, Jiajun Wu, Jun- [34] Barbara Roessle, Jonathan T Barron, Ben Mildenhall,
YanZhu,andStefanoErmon. Sdedit: Imagesynthesisand PratulPSrinivasan,andMatthiasNießner. Densedepthpri-
editingwithstochasticdifferentialequations. arXivpreprint ors for neural radiance fields from sparse input views. In
arXiv:2108.01073,2021. 3 ProceedingsoftheIEEE/CVFConferenceonComputerVi-
[21] QuanMeng,AnpeiChen,HaiminLuo,MinyeWu,HaoSu, sionandPatternRecognition,pages12892–12901,2022. 2
Lan Xu, Xuming He, and Jingyi Yu. Gnerf: Gan-based [35] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
neuralradiancefieldwithoutposedcamera. InProceedings Patrick Esser, and Bjo¨rn Ommer. High-resolution image
oftheIEEE/CVFInternationalConferenceonComputerVi- synthesis with latent diffusion models. In Proceedings of
sion,pages6351–6361,2021. 2 theIEEE/CVFConferenceonComputerVisionandPattern
[22] Lu Mi, Abhijit Kundu, David Ross, Frank Dellaert, Noah Recognition(CVPR),pages10684–10695,June2022. 3,4,
Snavely, and Alireza Fathi. im2nerf: Image to neural ra- 5,8,9
diance field in the wild. arXiv preprint arXiv:2209.04061, [36] ChitwanSaharia,WilliamChan,HuiwenChang,ChrisLee,
2022. 2 Jonathan Ho, Tim Salimans, David Fleet, and Mohammad
[23] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Norouzi. Palette: Image-to-image diffusion models. In
JonathanTBarron,RaviRamamoorthi,andRenNg. Nerf: ACM SIGGRAPH 2022 Conference Proceedings, pages 1–
Representingscenesasneuralradiancefieldsforviewsyn- 10,2022. 3
thesis.CommunicationsoftheACM,65(1):99–106,2021.1, [37] Chitwan Saharia, William Chan, Saurabh Saxena, Lala
2,3 Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed
[24] NiloyJMitraandMarkPauly. Shadowart. ACMTransac- Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi,
tionsonGraphics,28(CONF):156–1,2009. 4 Rapha Gontijo Lopes, et al. Photorealistic text-to-image
[25] ThomasMu¨ller,AlexEvans,ChristophSchied,andAlexan- diffusionmodelswithdeeplanguageunderstanding. arXiv
derKeller.Instantneuralgraphicsprimitiveswithamultires- preprintarXiv:2205.11487,2022. 3
olution hash encoding. arXiv preprint arXiv:2201.05989, [38] Christoph Schuhmann, Richard Vencu, Romain Beaumont,
2022. 5 Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo
[26] Michael Niemeyer, Jonathan T Barron, Ben Mildenhall, Coombes,JeniaJitsev,andAranKomatsuzaki.Laion-400m:
MehdiSMSajjadi,AndreasGeiger,andNohaRadwan.Reg- Open dataset of clip-filtered 400 million image-text pairs.
nerf: Regularizingneuralradiancefieldsforviewsynthesis arXivpreprintarXiv:2111.02114,2021. 5
fromsparseinputs. InProceedingsoftheIEEE/CVFCon- [39] KatjaSchwarz, YiyiLiao, MichaelNiemeyer, andAndreas
ferenceonComputerVisionandPatternRecognition,pages Geiger. Graf: Generative radiance fields for 3d-aware im-
5480–5490,2022. 1,2 agesynthesis. AdvancesinNeuralInformationProcessing
[27] MichaelNiemeyerandAndreasGeiger. Giraffe:Represent- Systems,33:20154–20166,2020. 2
ingscenesascompositionalgenerativeneuralfeaturefields. [40] KarenSimonyanandAndrewZisserman. Verydeepconvo-
In Proceedings of the IEEE/CVF Conference on Computer lutional networks for large-scale image recognition. arXiv
VisionandPatternRecognition,pages11453–11464,2021. preprintarXiv:1409.1556,2014. 6
2 [41] Jiaming Song, Chenlin Meng, and Stefano Ermon.
[28] BenPoole,AjayJain,JonathanTBarron,andBenMilden- Denoising diffusion implicit models. arXiv preprintarXiv:2010.02502,2020. 3 man, and Oliver Wang. The unreasonable effectiveness of
[42] YangSongandStefanoErmon.Generativemodelingbyesti- deepfeaturesasaperceptualmetric. InProceedingsofthe
matinggradientsofthedatadistribution.AdvancesinNeural IEEE conference on computer vision and pattern recogni-
InformationProcessingSystems,32,2019. 3 tion,pages586–595,2018. 5
[43] YangSong,JaschaSohl-Dickstein,DiederikPKingma,Ab- [57] LinqiZhou,YilunDu,andJiajunWu. 3dshapegeneration
hishekKumar,StefanoErmon,andBenPoole. Score-based andcompletionthroughpoint-voxeldiffusion. InProceed-
generative modeling through stochastic differential equa- ings of the IEEE/CVF International Conference on Com-
tions. arXivpreprintarXiv:2011.13456,2020. 3 puterVision,pages5826–5835,2021. 3
[44] AlexTrevithickandBoYang. Grf: Learningageneralradi-
ancefieldfor3drepresentationandrendering. InProceed-
ings of the IEEE/CVF International Conference on Com-
puterVision,pages15182–15192,2021. 2
[45] KalyanAlwalaVasudev,AbhinavGupta,andShubhamTul-
siani. Pre-train, self-train, distill: A simple recipe for su-
persizing3dreconstruction.InComputerVisionandPattern
Recognition(CVPR),2022. 2,6,7,9
[46] Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul P
Srinivasan, Howard Zhou, Jonathan T Barron, Ricardo
Martin-Brualla,NoahSnavely,andThomasFunkhouser.Ibr-
net: Learning multi-view image-based rendering. In Pro-
ceedingsoftheIEEE/CVFConferenceonComputerVision
andPatternRecognition,pages4690–4699,2021. 2
[47] ZhouWang,AlanCBovik,HamidRSheikh,andEeroPSi-
moncelli. Imagequalityassessment: fromerrorvisibilityto
structuralsimilarity.IEEEtransactionsonimageprocessing,
13(4):600–612,2004. 5
[48] Zirui Wang, Shangzhe Wu, Weidi Xie, Min Chen,
and Victor Adrian Prisacariu. Nerf–: Neural radiance
fields without known camera parameters. arXiv preprint
arXiv:2102.07064,2021. 2
[49] Daniel Watson, William Chan, Ricardo Martin-Brualla,
Jonathan Ho, Andrea Tagliasacchi, and Mohammad
Norouzi. Novelviewsynthesiswithdiffusionmodels. arXiv
preprintarXiv:2210.04628,2022. 3
[50] LiorYariv,YoniKasten,DrorMoran,MeiravGalun,Matan
Atzmon, BasriRonen, andYaronLipman. Multiviewneu-
ralsurfacereconstructionbydisentanglinggeometryandap-
pearance. AdvancesinNeuralInformationProcessingSys-
tems,33:2492–2502,2020. 2
[51] Yufei Ye, Shubham Tulsiani, and Abhinav Gupta. Shelf-
supervised mesh prediction in the wild. In Proceedings of
theIEEE/CVFConferenceonComputerVisionandPattern
Recognition,pages8843–8852,2021. 2
[52] AlexYu,VickieYe,MatthewTancik,andAngjooKanazawa.
pixelnerf: Neural radiance fields from one or few images.
In Proceedings of the IEEE/CVF Conference on Computer
VisionandPatternRecognition,pages4578–4587,2021. 1,
2,5,6
[53] ChengZhang,BaileyMiller,KanYan,IoannisGkioulekas,
andShuangZhao.Path-spacedifferentiablerendering.ACM
transactionsongraphics,39(4),2020. 2
[54] Cheng Zhang, Lifan Wu, Changxi Zheng, Ioannis
Gkioulekas, Ravi Ramamoorthi, and Shuang Zhao. A dif-
ferentialtheoryofradiativetransfer. ACMTransactionson
Graphics(TOG),38(6):1–16,2019. 2
[55] ChengZhang,ZihanYu,andShuangZhao. Path-spacedif-
ferentiablerenderingofparticipatingmedia. ACMTransac-
tionsonGraphics(TOG),40(4):1–15,2021. 2
[56] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht-