Unsupervised Monocular Depth Learning in Dynamic Scenes HanhanLi1 ArielGordon1,2 HangZhao3 VincentCasser3 AneliaAngelova1,2 {uniqueness@google, gariel@google, hangz@waymo, casser@waymo, anelia@google}.com 1GoogleResearch 2RoboticsatGoogle 3WaymoLLC Abstract: Wepresentamethodforjointlytrainingtheestimationofdepth,ego- motion, and a dense 3D translation field of objects relative to the scene, with monocular photometric consistency being the sole source of supervision. We show that this apparently heavily underdetermined problem can be regularized byimposingthefollowingpriorknowledgeabout3Dtranslationfields: theyare sparse,sincemostofthesceneisstatic,andtheytendtobepiecewiseconstantfor rigidmovingobjects. Weshowthatthisregularizationaloneissufficienttotrain monocular depth prediction models that exceed the accuracy achieved in prior workfordynamicscenes,includingmethodsthatrequiresemanticinput. 1 Keywords: Unsupervised,MonocularDepth,ObjectMotion 1 Introduction Understanding 3D geometry and object motion from camera images is an important problem for robotics applications, including autonomous vehicles [1] and drones [2]. While robotic systems areoftenequippedwithvarioussensors,depthpredictionfromimagesremainsappealinginthatit solelyrequiresanopticalcamera,whichisaverycheapandrobustsensor.Objectmotionestimation isanontrivialproblemacrossvarioussensors. Estimatingdepthandobjectmotionin3Dgivenamonocularvideostreamisanill-posedproblem, and generally heavily relies on prior knowledge. The latter is readily provided by deep networks, thatcanlearnthepriorsthroughtrainingonlargecollectionsofdata. Multiple methods have been devised for providing supervision to these networks. While depth prediction networks can be supervised by sensors [3, 4, 5], datasets providing object 3D motion supervision for image sequences are scarce. Self-supervised methods, which rely mainly on the monocularvideoitselfforsupervision,havebeenattractingincreasingattentionrecently[6,7,8,9, 10,11,12,13],duetothevirtuallyunlimitedabundanceofunlabeledvideodata. Depth Object motion map Motion map Figure1:Depthprediction(foreachframeseparately)andmotionmapprediction(forapairofframes),shown onatrainingvideofromYouTube. Thetotal3Dmotionmapisobtainedbyaddingthelearnedcameramotion vectortotheobjectmotionmap. Notethatthemotionmapismostlyzero,andnearlyconstantthroughouta movingobject.Thisisaresultofthemotionregularizersused. 1 Codeisavailableatgithub.com/google-research/google-research/tree/master/depthandmotionlearning 0202 voN 7 ]VC.sc[ 2v40461.0102:viXraSelf-supervisedlearningofdepthestimationisbasedonprinciplesofstructurefrommotion(SfM): When the same scene is observed from two different positions, the views will be consistent if a correctdepthisassignedtoeachpixel, andthecameramovementiscorrectlyestimated. Assuch, these methods tend to suffer from many of the challenges of SfM: textureless areas, occlusions, reflections, and – perhaps above all – moving objects. Complete view consistency can only be achieved if the motion of every object between the capture times of the two frames is correctly accounted for. If any point in the scene is presumed to be in motion, it carries four unknowns (depth and three motion components), which is far too many for epipolar geometry constraints to disambiguate. Self-supervisedmethodsthusoftenrelyonadditionalcues. One source of additional information is semantics [12]. Movable objects, such as vehicles can be identified using an auxiliary segmentation model, and a network can be trained to estimate the motion of each object separately. However such techniques depend on access to an auxiliary seg- mentationmodel,capableofsegmentingoutallclassesofmovableobjectstoappearinthevideo. Otherapproachesutilizedifferenttypesofpriorknowledge. Forexample,acommoncaseofobject motionintheself-drivingsettingiswheretheobservingcarfollowsanothercar, atapproximately the same velocity. The observed car thus appears static. Godard et al. [13] propose a method thatidentifiesthiscasebydetectingregionsthatdonotchangebetweenframesandexcludesthese regions from the photometric consistency loss. It is a very common case in the KITTI dataset, andaddressingitresultsinsignificantimprovementsindepthestimationmetrics. Yet, themethod remainslimitedtoonlyonespecifictypeofobjectmotion. Lastly, there are approaches where optical flow is learned jointly with depth, unsupervised [14]. However,stereoinputisusedtodisambiguatethedepthpredictionproblem. Themaincontributionofthispaperisamethodforlearningjointlydepth,ego-motionandadense objectmotionmapin3Dfrommonocularvideoonly,whereunlikepriorwork,ourmethod: • Doesnotutilizeanyauxiliarysignalsapartfromthemonocularvideoitself:neitherseman- ticsignals,norstereo,noranykindofgroundtruth. • Accountsforanyobjectmotionpatternthatcanbeapproximatedbyarigidobjecttransla- tioninanarbitrarydirection. Akeycontributionofourpaperisanovelregularizationmethodfortheresidualtranslationfields, basedonthe 1 norm,whichcaststheresidualmotionfieldintothedesiredpatterndescribedabove. 2 In our method, a deep network predicts a dense 3D translation field (from a pair of frames). The translationfieldcanbedecomposedtothesumofbackgroundtranslationrelativetothecamera(due to ego-motion), which is constant, and an object translation field, which accounts for the motion of every point in the field of view relative to the scene. Another network predicts depth for each pixel(fromeachframeseparately),totalingfourpredictedquantitiesperpixel. Figure1illustrates thedepthandtranslationfields. Atinferencetime,depthisobtainedfromasingleframe,whereas cameramotionandtheobjecttranslationfieldareobtainedfromapairofframes. Since we aim to only use a monocular video for supervision, this problem requires significant amountsofregularization.Weaimtostriketherightbalancebetweenregularizingsufficiently,while atthesametimepreservingtheabilitytomodeldiverseandcomplexmotionpatternsinhighlydy- namicscenes. Weachievethisbyutilizingtwoobservationsaboutthenatureofresidualtranslation fields:(1)Theyaresparse,sincetypicallymostofthepixelsinaframebelongtothebackgroundor staticobjects,and(2)Theytendtobeconstantthroughoutarigidmovingobjectin3Dspace. Weevaluatetheperformanceofthemethodonfourchallengingdatasetswithdynamicscenes. We establish new state-of-the-art results for unsupervised depth prediction on Cityscapes [15] and the WaymoOpenDataset[16], andmatchthestateoftheartonKITTI2 [17]. Tofurtherdemonstrate thegeneralityofourapproach,wealsotrainandqualitativelyevaluateourmodelonacollectionof publicYouTubevideostakenwithhand-heldcameraswhilewalkinginavarietyofenvironments. QualitativeresultsfromalltheabovedatasetsareshowninFigure2. 2SincedynamicscenesarerareintheKITTIdataset,theimprovementsofourmethodonthisbenchmark arelesspronounced. 2RGB Depth 3D Object Motion Map sepacsytiC ITTIK tesataD nepO omyaW ebuTuoY Figure 2: Qualitative results of our unsupervised monocular depth and 3D object motion map learning in dynamicscenesacrossalldatasets:Cityscapes,KITTI,WaymoOpenDatasetandYouTube. 32 RelatedWork StructurefromMotionandMultiviewStereo. Depthestimationisanimportanttaskfor3Dscene understandingandrobotics. Traditionalcomputervisionapproachesrelyonidentifyingcorrespon- dencesbetweenkeypointsintwoormoreimagesofthesamesceneandusingepipolargeometryto solvefortheirdepths[18,19,20]. Thesemethodsyieldsparsedepthmaps. Theycanbeappliedto dynamicscenesinamulti-camerasetting(“multiviewstereo”),ortostaticscenesinasinglemoving camerasetting(“structurefrommotion”). Pleasesee[21]foradetailedsurvey. Depth Estimation. In deep-learning based approaches [3, 4, 22, 5, 23], a deep network predicts a dense depth map. These networks can be trained by direct supervision, such as through LiDAR sensors. Similarapproachesareusedforotherdensepredictionssuchassurfacenormals[24,25]. More recently, deep-learning approaches have been used in conjunction with classical computer vision techniquesto learn depthand ego-motion prediction [6,26, 27, 28,29, 7]. Instead of iden- tifying keypoints in scenes and finding correspondences, deep networks predict dense depth maps andcameramotion,andtheseareusedfordifferentiablywarpingpixelsfromoneviewtoanother. Byapplyingphotometricconsistencylossesonatransformedandthecorrespondingreferenceview, asupervisionsignalforthemodelisderived. Alotofprogressinmonoculardepthlearninghasfol- lowed[28,30,29,7,31,12,32],andsomeworkproposedtousestereoinputsfortraining,withthe purposeofproducingmoreaccuratemonoculardepthatinference[27,28,14,31,33]. Alternative approaches learn to produce the stereo disparity [34, 35, 36], or apply depth completion from an onlinedepthsensor[37,38]. DepthandMotion. Whenthecameraandobjectsmovewithrespecttothescene,enforcingcon- sistency across views requires estimating motion of both the camera as well as individual objects. As estimating depth in dynamic scenes is very challenging, many approaches have used multiple viewstoobtaindepth[39,40]. Severalapproacheshaverecentlybeenproposedforsimultaneously learningdepth,cameramotion,andobjectmotionfrommonocularvideos. Yinetal.[7]andZhou etal.[9]learntojointlypredictdepth,ego-motionandopticalflow. Thefirststageoftheirmethod estimatesdepthandcameramotion,andthustheopticalflowinducedbycameramotion. Asecond stageestimatestheresidualopticalflowduetoobjectmotionrelativetothescene. Theresidualflow isusedtomaskoutmovingobjectsandtoreasonaboutocclusions. Luoetal.[10]alsojointlyopti- mizedepth,cameramotion,andopticalflowestimation,usingaholisticmotionparser. Stereoinput isusedtodisambiguatethedepthpredictiontraining. Casseretal.[12,41]estimatethemotionof objectsinthescenes,withtheassistanceofpre-trainedsegmentationmodels,leadingtosignificant improvementindepthestimationformovingobjects. Gordonetal.[11]introduceadifferentiable wayofhandlingocclusionsbutstillrequireanauxiliarysegmentationmodelformovingobjects. Li etal.[32]areabletohandlesceneswithmovingpeoplebutrequirehuman-segmentationmasksand relies on a dataset with “static” people, observed from multiple views. Godard et al. [13] address objectmotionbydetectingcaseswhereanobjectisstaticwithrespecttothecamera(i.e.moving atthesamespeed),whichaddressesaverycommoncaseofobjectmotionobservedfromvehicles intraffic. Ourworkprovidesamoregeneralwayofperformingunsuperviseddepthlearningindy- namicscenes;unlikepreviouswork,wedonotneedtosegmentoutobjectsinordertoestimatetheir motion, nordoweassumestereodata. Unlikepriorworkthatusesresidualopticalflowtoreason aboutmovingobjectsandocclusions,ourmethoddirectlyregularizesmotionin3D,whichturnsout toleadtobetterdepthpredictionaccuracy. 3 Method Inourapproach,depth,ego-motion,andobjectmotionarelearnedsimultaneouslyfrommonocular videos using self-supervision. We use pairs of adjacent video frames (I and I ) as training data. a b OurdepthnetworkpredictsadepthmapD(u,v)(whereuandv aretheimagecoordinates)atthe originalresolutionfromasingleimage, andweapplyitindependentlyoneachofthetwoframes. The two depth maps are concatenated with I and I in the channel dimension and are fed into a a b motion prediction network (Figure 1). The latter predicts a 3D translation map T (u,v) at the obj originalresolutionforthemovingobjectsanda6Dego-motionvectorM . M consistsofthe ego ego 3Euleranglesthatparameterizethe3DrotationmatrixRandelementsofthe3Dtranslationvector T . TheobjectmotionrelativetothecameraisdefinedasrotationRfollowedbyatranslation ego T(u,v)=T (u,v)+T . (1) obj ego 4We propose new motion regularization losses on T(u,v) (Sec. 3.2.1) which facilitate training in highlydynamicscenes.TheoveralltrainingsetupisshowninFigure3.Figure2visualizesexamples ofthelearnedmotionT (u,v)anddisparityd(u,v)=1/D(u,v)perframe. obj Egomotion M ego tx ty tz rx ry rz 1⨉1⨉6 I b Motion Object motion map T obj Network Transformed Views Unsupervised motion decomposition I H⨉W⨉3 a Motion reg. loss L reg,mot Consistency loss Lcyc+ Lrgb I D a a Depth reg. loss Lreg,dep I D b b Depth reg. loss Lreg,dep Figure 3: Overall training setup. A depth network is independently applied on two adjacent RGB frames, I andI , toproducethedepthmaps, D andD . Thedepthmapstogetherwiththetwooriginalimages a b a b arefedintothemotionnetwork,whichdecomposesthemotionintoaglobalego-motionestimateM and ego aspatialobjectmotionmapT . Givenmotionanddepthestimates,adifferentiableviewtransformerallows obj transitioningbetweenthem. Lossesarehighlightedinred. Forexample,weuseamotionregularizationloss (Section 3.2.1) on the motion map, and a motion cycle consistency loss and a photometric consistency loss (Section3.2.3).Atinferencetime,adepthmapisobtainedfromasingleframe,whereasa3Dmotionmapand ego-motionareobtainedfromtwoconsecutiveframes. 3.1 DepthandMotionNetworks Ourdepthnetworkisanencoder-decoderarchitecture,identicaltotheoneinRef.[6],withtheonly differencethatweuseasoftplusactivationfunctionforthedepth,z((cid:96)) = log(1+e(cid:96)). Inaddition, randomizedlayernormalization[11]isappliedbeforeeachreluactivationinthenetwork. The input to the motion network is a pair of consecutive frames, concatenated along the channel dimension. The motion prediction is similar to the one in Ref. [11], with the difference that each input image has four channels: The three RGB channels, and the predicted depth as the fourth channel. Therationaleisthathavingdepthasafourthchannel,evenifpredictedratherthanexactly measured,provideshelpfulsignalsforthetaskofestimatingmotionin3D. 3.2 Losses Training is driven by a number of self-supervised losses as is standard in monocular or stereo- basedunsuperviseddepthandego-motionlearning. Thetotallossisthesumofthreecomponents, themotionregularization,thedepthregularization,andtheconsistencyregularization. Themotion regularizationandconsistencyregularizationareappliedtwiceontheframepair,andtheframeorder isreversedinthesecondapplication. Thedepthregularizationisappliedindependentlyoneachone ofpair. Wedescribethelossesbelow,startingwiththemotionregularizationlosses,whichareakey contributionofthiswork. 3.2.1 MotionRegularization The regularization L on the motion map T (u,v) consists of the group smoothness loss reg,mot obj L andtheL sparsityloss. ThegroupsmoothnesslossL onT (u,v)minimizeschanges g1 1/2 g1 obj within the moving areas, encouraging the motion map to be nearly constant throughout a moving object. Thisisdoneinanticipationthatmovingobjectsaremostlyrigid. Itisdefinedas: (cid:90)(cid:90) (cid:113) (cid:88) (cid:0) (cid:1)2 (cid:0) (cid:1)2 L [T(u,v)]= ∂ T (u,v) + ∂ T (u,v) dudv (2) g1 u i v i i∈{x,y,z} TheL sparsitylossonT (u,v)isdefinedas: 1/2 obj (cid:90)(cid:90) (cid:88) (cid:112) L [T(u,v)]=2 (cid:104)|T |(cid:105) 1+|T (u,v)|/(cid:104)|T |(cid:105)dudv (3) 1/2 i i i i∈{x,y,z} 5where(cid:104)|T |(cid:105)isthespatialaverageof|T (u,v)|. Thecoefficientsaredesignedinthiswaysothatthe i i regularizationisself-normalizing. Inaddition,itapproachesL forsmallT(u,v),anditsstrength 1 becomesweakerforlargerT(u,v).WevisualizeitsbehaviorintheSupplementalMaterial.Overall, theL lossencouragesmoresparsitythantheL loss. 1/2 1 Thefinalmotionregularizationlossisacombinationoftheabovelosses: L =α L [T (u,v)]+β L [T (u,v)] (4) reg,mot mot g1 obj mot 1/2 obj whereα andβ arehyperparameters. mot mot Strictly speaking, a piecewise-constant T (u,v) can describe any scene where objects are mov- obj inginpuretranslationrelativetothebackground. However,whenobjectsarerotating,theresidual translationfieldisgenerallynotconstantthroughoutthem. Sincefastrotationofobjectsrelativeto thebackgroundisuncommon,especiallyinroadtraffic,weexpectthepiecewise-constantapproxi- mationtobeappropriate. 3.2.2 DepthRegularization We apply a standard edge-aware smoothness regularization on the disparity maps d(u,v) as de- scribedinGodardetal.[27]. Inotherwords,theregularizationisweakeraroundpixelswherecolor variationishigher: (cid:90)(cid:90) L =α (|∂ d(u,v)|e−|∂uI(u,v)|+|∂ d(u,v)|e−|∂vI(u,v)|)dudv (5) reg,dep dep u v whereα isahyperparameter. dep 3.2.3 ConsistencyRegularization The consistency regularization is the sum of the motion cycle consistency loss L and the cyc occlusion-aware photometric consistency loss L . L encourages the forward and backward rgb cyc motionbetweenanypairofframestobetheoppositeofeachother. (cid:107)RR −1(cid:107)2 (cid:90)(cid:90) (cid:107)R T(u,v)+T (u ,v )(cid:107)2 L =α inv +β inv inv warp warp dudv cyc cyc(cid:107)R−1(cid:107)2+(cid:107)R −1(cid:107)2 cyc (cid:107)T(u,v)(cid:107)2+(cid:107)T (u ,v )(cid:107)2 inv inv warp warp (6) where the ‘inv’ subscript indicates that the same quantity was obtained with the input frames re- versedinorder. α andβ arehyperparameters. cyc cyc L encouragesphotometricconsistencyofcorrespondingareasinthetwoinputframes. Similarto rgb priorworks,itisasumofaL1lossandaSSIMstructuralsimilaritylossintheRGBspace. (cid:90)(cid:90) 1−SSIM(I,I ) L =α |I(u,v)−I (u,v)|1 dudv+β warp (7) rgb rgb warp D(u,v)>Dwarp(u,v) rgb 2 whereIandI aretheoriginalimageandwarpedimage,1 isamaskintroduced warp D(u,v)>Dwarp(u,v) inGordonetal. [11]toaddressocclusions. α andβ arehyperparameters. rgb rgb TheframewarpingiscalculatedusingthecameraintrinsicmatrixK,therotationmatrixR,andthe cameratranslationvectorT asz(cid:48)p(cid:48) =KRK−1zp+KT. Herepandz arethehomogeneouspixel coordinatesandthedepth,andtheirprimedcounterpartsarewarpedonesinthenewframe. 4 Experiments In this section, we present results on a variety of datasets, including Cityscapes, KITTI, Waymo Open Dataset and a collection of videos taken with moving cameras from YouTube. For all our experiments,theencoderpartofthedepthnetworkisinitializedfromanetworkpretrainedonIm- ageNet[42]. ThecameraintrinsicmatricesareprovidedinalldatasetsexceptforYoutubevideos, wheretheyarelearnedaspartofourmodels. Thehyperparametervaluesweusedarethesamefor allexperimentsandaregiveninthesupplementarymaterial. Theinferencetimeofourdepthpredictionmodelisabout5.3msperframeofresolution480x192,at abatchsizeof1onaNVIDIAV100GPU(unoptimized),whichisequivalenttoroughly190frames persecond. Asfarasweareaware,thisisamongthefastestmethods. 64.1 Cityscapes TheCityscapes[15]datasetisanurbandrivingdataset,whichisquitechallengingforunsupervised monocular depth estimation, because of the prevalence of dynamic scenes. As a result, not many workspublishedresultsonthisdataset,withfewexceptions[12,11,43].Weusestandardevaluation protocols as in prior work [12, 43]. For training we combine the densely and coarsely annotated splits to obtain 22,973 image-pairs. For evaluation, we use the 1,525 test images. The evaluation usesthecodeandmethodologyfromStruct2Depth[12]. Method Usessemantics? AbsRel SqRel RMSE RMSElog δ<1.25 δ<1.252 δ<1.253 Struct2Depth[12] Yes 0.145 1.737 7.28 0.205 0.813 0.942 0.978 Gordon[11] Yes 0.127 1.33 6.96 0.195 0.830 0.947 0.981 Pilzer[43] No 0.440 6.04 5.44 0.398 0.730 0.887 0.944 Ours No 0.119 1.29 6.98 0.190 0.846 0.952 0.982 Table1: Performancecomparisonofunsupervisedsingle-viewdepthlearningapproaches,formodelstrained andevaluatedonCityscapesusingthestandardsplit. Thedepthcutoffis80m. Ourmodelusesaresolutionof 416×128forinput/output.The‘usessemantics’columnindicateswhetherthecorrespondingmethodrequires apretrainedmasknetworktohelpidentifymovingobjects. Ourapproachdoesnotusesemanticsinformation. ‘AbsRel’,‘SqRel’,‘RMSE’,and‘RMSElog’denotesmeanabsoluteerror,squarederror,rootmeansquared error,androotmeansquaredlogarithmicerrorrespectively. δ