Taskology: Utilizing Task Relations at Scale YaoLu12,So¨renPirk1,2,JanDlabal2,AnthonyBrohan1,2,AnkitaPasad3*, ZhaoChen4,VincentCasser4,AneliaAngelova1,2,ArielGordon1,2 1RoboticsatGoogle,2GoogleResearch,3ToyotaTechnologicalInstituteatChicago,4WaymoLLC {yaolug, pirk, dlabal, brohan}@google.com, ankitap@ttic.edu, {zhaoch, casser}@waymo.com, {anelia, gariel}@google.com Abstract Supervision∇ Supervision Loss 1 Loss 2 Many computer vision tasks address the problem of Consistency sceneunderstandingandarenaturallyinterrelatede.g. ob- ∇ ∇ Loss ∇ ∇ jectclassification,detection,scenesegmentation,depthes- Task 1 Task 2 timation, etc. We show that we can leverage the inher- Images ent relationships among collections of tasks, as they are trainedjointly,supervisingeachotherthroughtheirknown Labels Labels Mediator dataset relationships via consistency losses. Furthermore, explic- (unlabeled) Images Images itlyutilizingtherelationshipsbetweentasksallowsimprov- ingtheirperformancewhiledramaticallyreducingtheneed Dataset 1 Dataset 2 forlabeleddata,andallowstrainingwithadditionalunsu- pervised or simulated data. We demonstrate a distributed Figure 1: Illustration of our framework for the collective joint training algorithm with task-level parallelism, which trainingofmultipletaskswithaconsistencyloss(twotasks affordsahighdegreeofasynchronicityandrobustness.This areshown). Eachtaskisperformedbyaseparatenetwork, allowslearningacrossmultipletasks,orwithlargeamounts andtrainedonaitsowndatasetandasharedunlabeledme- of input data, at scale. We demonstrate our framework on diatordataset. Theconsistencylossisimposedforsamples subsetsofthefollowingcollectionoftasks: depthandnor- fromthemediatordataset. malprediction,semanticsegmentation,3Dmotionandego- motion estimation, and object tracking and 3D detection in point clouds. We observe improved performance across mayleadtoinconsistenciesacrossmultipletasks,e.g.[55], thesetasks,especiallyinthelow-labelregime. andpointstothealternativeoftrainingtasksjointly. Multi-tasklearningtargetstheproblemoftrainingmul- 1.Introduction tipletasksjointly. Commontomanyapproachesisashared feature-extractorcomponentwithmultiple“heads”thatper- Manytasksincomputervision,suchasdepthandsurface formseparatetasks[15,56,13]. Trainingmultipletasksto- normalestimation,flowprediction,poseestimation,seman- getherincreasesthecoherencybetweenthemand–insome ticsegmentation,orclassification,areinherentlyrelatedas setups – also enables their self-supervision [60, 5]. How- theydescribethesurroundingscenealongwithitsdynam- ever, the joint training also has a few disadvantages. For ics. Whilesolvingforeachofthesetasksmayrequirespe- one,asinglemodelformultipletasksisdifficulttodesign, cialized methods, most tasks are connected by the under- maintainandimprove, asanychangesinthetrainingdata, lying physics observed in the real world. A considerable losses,orhyperparametersassociatedwithoneofthetasks, amountofresearchaimstorevealtherelationshipsbetween alsoaffectsallothers. Secondly,differentmodalitiescome tasks [60, 5, 15, 56, 13, 54, 55], but only a few methods withdifferentarchitectures,whicharedifficulttomergeinto exploit these fundamental relationships. Some approaches a single model. For example, point clouds require sparse rely on the unparalleled performance of deep networks to processing[43],whiletensoredimagesuseCNNs. Thirdly, learn explicit mappings between tasks [55, 54]. However, itcan becomeintractable toprocess asingle model– built while training tasks pairs leverages their relationships, it toperformmultipletasks–onasinglecomputenode. *WorkdonewhileatRoboticsatGoogle In this paper we introduce a novel approach for dis- 4321 1202 raM 71 ]VC.sc[ 2v98270.5002:viXratributed collective training that explicitly leverages the in- tasks supervise themselves, which reduces the need for la- herent connections between multiple tasks (Fig. 1). Con- beleddata,andcanleverageunsupervisedorsimulateddata. sistency losses are designed for related tasks, intended to enforce their logical or geometric structure. For example, 2.RelatedWork giventhetwotasksofpredictingsurfacenormalsanddepth Exploiting the structure of – and between – tasks has a from RGB images, the consistency loss between them is long tradition in computer science [51, 53] and computer basedontheanalyticalrelationbetweenthem–normalscan vision[36]. Ithasbeenrecognizedthatknowingaboutthe becomputedfromthederivativesofadepthmap. Weshow structureofataskandhowitisrelatedtoothertaskscanbe herethatexplicitlyenforcingconsistencybetweentaskscan usedasapowerfullearningscheme[3,55,54,3,60,5,22]. improve their individual performance, while their collec- In our work we are interested in making use of these rela- tivetrainingalsoestablishesthecorrespondenceamongthe tions more explicitly through the joint co-training of mul- tasks, which – in turn – leads to a more sound visual un- tiple tasks based on consistency losses. Therefore, our derstanding of the whole scene. We term the framework method is connected to related work on self- and unsuper- ‘Taskology’,asitconnectstasksbytheirphysicalandlogi- visedlearning,multi-tasklearning,domainadaptation,and calconstraints. distributed training. While this spans a breadth of related Using consistency losses to collectively train tasks en- work that we cannot comprehensively discuss, we aim to ablesamodulardesignfortrainingneuralnetworks,which provideanoverviewofapproachesclosesttooursandwith offers three major advantages: We train structurally dif- afocusoncomputervision. ferent tasks with entirely separate networks that are better Methodsbasedonself-supervisionaimtoautonomously suited for each individual task. This is also advantageous generatelabelsfortrainingdatabasedonexploitingthein- from a design, development, and maintainability point of herent structure of related tasks. As a prominent example, view; each component can be replaced or improved sepa- Doerschetal.[14]useunlabeledimagecollectionstolearn ratelyfromallothers. Secondly, webenefitfromunsuper- a representation for recognizing objects. Similarly, many visedorpartiallylabeleddata. Forexample,manydatasets otherapproachesuseproxyorsurrogatetaskstolearnrich are labeled for either segmentation or scene depth; with representations for visual data [40, 42, 58, 56, 41]. Self- consistency losses, we can use partially labeled datasets supervisedmulti-viewlearning[60]iscloselyrelatedtoour for training both tasks, where the consistency losses are approachasitaimstotraintasksbyestablishinggeometric active for the unlabeled portion of the data (Fig. 1). Fi- consistency between them. By designing the consistency nally, we train multiple complex models jointly and asyn- losseswedirectlymakeuseoftheknownrelationsbetween chronously in a distributed manner, on different compute tasksandtherebyshapethetasksspace,whichissimilarto nodes. Each network is processed on a separate machine, commonself-supervisedtrainingschemes. while their training is tied together through consistency The goal of multi-task learning is training models for losses. The communication between collectively trained tasks so as to obtain multiple outputs for a given input, networks – through their predictions – is asynchronous. while jointly improving the performance of each individ- Ourexperimentsshowthatnetworksfordifferenttaskscan ual tasks [4, 59, 46]. Many approaches exist that extract be trained with stale predictions from their peers; we do featuresthroughasharedbackboneandthentrainmultiple notobserveadecreaseinperformanceforupto20minute heads for different objectives [15, 56]. These approaches (∼2000steps)oldpredictions.Unlikeexistingmethodsfor are often restricted to tuning the loss function to balance distributedtraining[37]thatmostlyrelyondata-ormodel contributionsbetweendifferenttasks[8,30,48]. parallelismtosplittrainingacrossmultiplecomputenodes, While our framework is not limited to any specific task ourframeworkseparatestrainingattasklevel; eachmodel domain or combination of tasks, in this work we focus istrainedindependentlyandasynchronouslyfromallother on the collective training of computer vision models. To models, while all models are coherently trained together. thisend,weareinterestedinusingestablishedmethodsfor Distributed training allows scalability in multi-tasks learn- learning depth [22, 18, 32, 60] also together with egomo- ing,bothinthenumberoftasksanddatasetssizes. tion [60, 5], surface normals [57, 13, 25, 19, 44, 52], seg- To summarize, the contributions are: (1) we present a mentation[49,35,29],opticalflow[16,28,45,9],orpoint framework that enables a modular design for training neu- cloudtracking[24,1]. Whilewedonotaimtochangethe ral networks by separating tasks into modules that can be modelarchitecturesforanyofthesetasks,ourgoalistoim- combinedandthentrainedcollectively;(2)weproposecon- provetheirperformancebydesigningconsistencylossesfor sistencylossesforcoherentlytrainingmultipletasksjointly, subsetsofthesetasksandbyjointlytrainingthem. which allows improving their overall performance; (3) we Finally, the distributed version of our framework is demonstratedistributedtrainingofmultipletasks,whichal- closely related to the concept of distillation [26] and on- lows for scalability; (4) we show that collectively trained linedistillation[2],whereonenetworkistrainedtowardan 4322objectivewiththegoaltoguidethetrainingofanothernet- datasetandtheresultsofthisforwardpassareusedtocom- work. Federated learning [39] is another technique where pute the consistency loss. The training loop of each task learningisdistributedamongmultipleinstancesofthesame alternates by drawing samples from the dedicated and the model. However, unlike them, we explicitly leverage the mediator dataset. The setup for the collective training of structure between tasks and by training multiple complex twotasksisillustratedinFig.1. visiontaskssimultaneously. Thesetupdescribedabovecanhaveafewspecialcases: for one, a dedicated dataset for either task, or all, can be 3.Method empty. Inthiscasethesetupreducestounsupervisedtrain- ing;theunsupervisedlearningofdepthandegomotion [60] Ourmaingoalistoenablethedistributedandcollective exemplifies this case. Datasets can overlap, i. e. a dataset training of individual network architectures for computer can have labels for multiple tasks (e. g. for semantic seg- visiontasksthatareinherentlyrelated. Themainideaisto mentation and depth), or datasets can be only partially la- connecttasksviasharedconsistencylosses,whichrepresent beled,i.e.onedatasethaslabelsfordepthonlyandanother functional or logical relationship. In particular, we aim to onehaslabelsforsegmentationonly. Inthelattercasethe exploitconsistencyconstraintsrelatingthetasksofpredict- consistency loss will be applied to both datasets, whereas ing depth, surface normals, egomotion, semantic segmen- supervised losses are applied to individual datasets only. tation, object motion, and object tracking with 2D and 3D Thesetupnaturallygeneralizestothecollectivetrainingof sensordata. Forsometasksitispossibletodirectlyformu- N tasks,wheretheconsistencylossisgenerallyafunction latetheirrelationshipasananalyticaldifferentiableexpres- ofthepredictionsofallparticipatingtasks. sion – e. g. normals can be computed from the derivatives of depth values [27], other tasks are related in more intri- 3.2.TaskConsistencyConstraints cateways,e.g.depthandegomotion,orsegmentationand opticalflow[23,5,9]. Totrainmodelscollectively,weuse We rely on the established knowledge [27, 47] in com- existing model architectures for specific tasks (e. g. such putervisionoftasksandtheirrelationshipstoidentifycon- asforpredictingdepthorsegmentation)anddefineconsis- sistencyconstraints. Consistencyconstraintsensuretheco- tencylossesbetweenthem. Inthissectionwedescribeour herency between different tasks and are derived from laws framework, the collective training of two or multiple tasks ofgeometryandphysics.Ourgoalistoleveragetheconsis- (Sec. 3.1), the motivation for and examples of consistency tencyconstraintstodefineconsistencylossesfortaskcom- losses(Sec.3.2),aswellasthedistributedtrainingofmul- binations. Any constraint that can be written as differen- tipletasks(Sec3.3). tiable analytic expression can be used within our frame- work. While this work focuses on already existing rela- 3.1.CollectiveTrainingofTasks tions between tasks via e. g. analytical loss relationships, Given a set of tasks T = {t ,...,t } we define losses future work can focus on potentially learning these losses. 1 n forsupervisingeachtaskindividually, whichwedenoteas Sections3.2.1,3.2.2,3.2.3belowdescribespecifictaskre- Lsup, aswellasconsistencylossesforthecollectivetrain- lationsconsideredinthiswork. i ingofsetsoftasks,definedasLcon. Inthefollowing,tasks 3.2.1 Scenedepth,segmentationandego-motion arereferredtobytheirindexi. Wethendefinetheoverall lossasfollows: Wefirstexploitconsistencyconstraintsbetweenpredicting (cid:88)n (cid:18) (cid:19) depth, ego-motion and semantic segmentation. We estab- L = Lsup yˆ(w ,x),y (x) i i i i lish consistency between these three tasks by considering i=1 (cid:18) (cid:19) the relations in image pixels and scene geometry between + Lcon yˆ (w ,x),yˆ (w ,x),...,yˆ (w ,x) ,(1) twoconsecutiveframesduringtraining. Morespecifically, 1 1 2 2 n n we can ‘deconstruct’ a scene at a time-step t, estimating where yˆ(·) denotes the generated prediction based on the its depth and potentially moving objects; at time t+1 we i weightsw ofataski,y (·)denotesthegroudtruthlabelof can‘re-construct’thesceneasafunctionofscenegeometry i i ataskiandxisadatasample. (depth) and moving objects observed at the previous time We assume that each task is performed by a separate stepandconsideringthepotentialego-motionandobjects’ deepnetworkaccompaniedbyalabeleddataset,whichwe motion.Withinthissetupweimposebothgeometricandse- refer to as its dedicated dataset. Furthermore, we use the manticsconsistenciestoreflecttherelationsbetweenthese standard supervision loss for each model as if we wanted tasks. Morespecificallyweconsidertrainingseveraltasks: to train it in isolation. For collective training we then use Motion Prediction Networks: Given two consecutive aseparate(unlabeled)dataset,whichwerefertoasmedia- RGB frames, I (i,j) and I (i,j), predict the ego-motion 1 2 tordataset,toenforceconsistencybetweenthetasks. Dur- betweentheseframes,i.e.thetransformationofthecamera ing training, both tasks receive samples from the mediator betweenframe2andframe1. Thiscanbesubdividedinto 4323a translation vector T and rotation matrix R . To 1→2 1→2 modelindependentlymovingobjects,foreverypixel(i,j), another task can predict the movement of the point visible atthepixelofframe1,relativetothescene,whichoccurred betweenframe1andframe2,denotedasδt (i,j). 1→2 Depth Prediction Network: The depth prediction net- workpredictsadepthmap,z(i,j),foranimageI(i,j). Semantic Segmentation Network: From an RGB frame, the semantic segmentation network predicts a logit mapl (i,j)foreachclassc. Foreachpixel(i,j),theclass c isgivenbyc(i,j)=argmax l (i,j). c c Thetasksareinterrelatedasfollows: the3Dtranslation Figure 2: 3D Object Detection setup: from a sequence of fieldscandeviatefromtheirbackgroundvalue(duetocam- pointclouds(top), wepredictboundingboxesperindivid- era motion) only at pixels that belong to possibly-moving ual point cloud (middle) and the optical flow, denoted as objects (e.g. vehicles, pedestrians). Therefore, semantic greenarrows(bottom). segmentation informs 3D motion prediction. Conversely, givenadepthmapanda3Dmotionfield,opticalflowfields thepixelpositionsp(cid:48)(i,j),andusingbilinearinterpolation, can be derived and then used to assert the consistency of 1 weobtainI(cid:48)(i,j),frame2’sRGBimagewarpedontoframe segmentation masks in pairs of adjacent frames. The flow 1 1. Thephotometriclosscanthenbewrittengenerallyas: fieldscanthenbeusedtoinformtrainingofasemanticseg- mentationmodule. L =(cid:88) L (I(cid:48)(i,j),I (i,j))+(cid:88) L (I(cid:48)(i,j),I (i,j)), Letusdefinem(i,j)tobethemovablemask: ph p 1 1 p 2 2 i,j i,j (cid:26) (3) 1 c(i,j)∈M m(i,j)= (2) where I(cid:48) is defined analogously to I(cid:48), just with 1 and 2 0 otherwise 2 1 swappedeverywhere. L standsforapixelwisephotomet- p M is the collection of all classes that represent movable ric loss, such as an L1 penalty on the difference in RGB objects, e.g. persons, cars. For each pixel (i,j), m(i,j) space and structural similarity (SSIM), each weighed by a equals 1 if the pixel belongs to one of the movable object coefficient. In our experiments we used the same photo- classes,and0otherwise. metric loss as described in [23]. The depth prediction and Wecannowcomputethewarpingofthefirstframeonto motion prediction networks were taken from therefrom as thesecondasaresultofthescenemotionandobjectmotion. well. Thesegmentationnetworkwastakenfrom[21]. More specifically, given two adjacent video frames, 1 and Thesecondtypeoftheconsistencyconstraintissegmen- 2, a depth map of frame 1, z (i,j), the camera matrix K, tationlogitsconsistencyacrossadjacentframes.Toformu- 1 andapixelpositioninhomogeneouscoordinatesp 1(i,j)= latethisconstraint,wesamplel c2(i,j)atthepixelpositions (j,i,1)T,onecanwritetheshiftinpresultingfromtherota- p(cid:48) 1(i,j),andusingbilinearinterpolation,weobtainl c(cid:48) 1(i,j). tionandtranslationandobjectmotionδtthatoccurredbe- Thesegmentationconsistencylosscanthenbewrittengen- tween the two frames and obtain new values z(cid:48)(i,j) and erallyas: 1 p(cid:48)(i,j), which are a function of z, p, δt and m and ego- (cid:88) 1 L = L (l(cid:48) (i,j),l (i,j))+ motion network predictions for R and T (see the supp. seg 2 c1 c1 material) z 1(cid:48)(i,j)p(cid:48) 1(i,j) = KR 1→2K−1z 1(i,j)p 1(i,j)+ + (cid:88)i,j,c L (l(cid:48) (i,j),l (i,j)), (4) K(m 1(i,j)δt 1→2(i,j)+T 1→2). 2 c2 c2 Here p(cid:48) and z(cid:48) are respectively the new homogeneous i,j,c 1 1 coordinates of the pixel and the new depth, projected onto where l(cid:48) is defined analogously to l(cid:48) , just with 1 and 2 c2 c1 frame2.Fromthemwecanobtainnewestimatedvaluesfor swappedeverywhere. L standsforaL2losssquared. 2 the image I 1(cid:48)(i,j), via back projection to the image space. Overall,thefinalconsistencylossbecomes The movable mask m (i,j) determines the motion of ob- 1 jectsrelativetothescenetooccuronlyatpixelsthatbelong LC 2Don sem =L ph+L seg. (5) tomovableobjects,i.e.δtisappliedfortheseobjectsonly. 3.2.2 3DObjectDetectioninPointCloudsinTime Consistency Losses: The first type of the consistency constraint is photometric consistency across adjacent Inthissection,wefocusonobjectdetectionfrom3Dpoint frames, imposing that RGB values will be preserved after clouds, a difficult task which also is crucial in many prac- warping. Toformulatethisconstraint,wesampleI (i,j)at tical applications such as for autonomous vehicles. We 2 4324show that, when we simultaneously train an object flow point(x+flow ,y+flow ,θ+flow )≈(x(cid:48),y(cid:48),θ(cid:48))inan- x y θ network, which predicts motion of objects through time, other frame, we consider the following two loss functions we can apply a motion-consistency loss to significantly (inthefollowing,allprimedcoordinatesarecoordinatesaf- boostthesingle-frame3Dobjectdetectorperformance,es- terflowhasbeenappliedtothecurrentframe): pecially in the low-label regime. Similarly to establish- ing consistencies between dynamic scenes in 2D space as 1. L class: Theclassconfidencesattwopoints(x,y,z)(cid:55)→ in Section 3.2.1, one can establish consistencies in 3D. (x + flow x,y + flow y,θ + flow θ) connected by the Morespecificallywecantrackmovingobjectsin3Dacross predictedflowvectorshouldhavethesameclassconfi- frames. Given a module that detects movable objects in dence. L class =(classlogit(x,y)−classlogit(x(cid:48),y(cid:48)))2 a scene, in two consecutive frames, and another module 2. L : The predicted flow can be used to calcu- residual that predicts rigid motion, we can assert that the motion- late consistent residual values for (x,y,θ) at two prediction module correctly estimates the motion of each points connected by the predicted flow. The resid- object. Weheredirectlyenforcedthisin3Dspaceusing3D ual values for z,w,(cid:96), and h should also remain con- boundingboxdetectionsinPointCloudsandopticalflow. stant between two detections connected by flow (for We address 3D Object Detection in Point Clouds using short time spans we assume near-constant elevation). twomodels: onepredicting3Dboundingboxesonapoint L =(cid:80) (di(cid:48)−di+(flow −(i(cid:48)−i))2+ cloud,andtheotherpredicting2Dboxflowonpointcloud residual i∈(x,y,θ) i (cid:80) (dj(cid:48)−dj)2 sequences(Figure2). WeuseaPointPillar-based[33]net- j∈(z,(cid:96),w,h) work as our 3D detector, which allows us to work in a Overall, our 3D Point Cloud motion-consistency loss be- pseudo-2Dtop-downspaceforallourexperiments. Atthe comesLCon =L +L . PCintime class residual sharedfeaturelayerprecedingthedetectorpredictionhead The class loss L ensures that class logits along class weattachaflowpredictionheadthat,givenn frames,out- f predicted object tracks are equal, while the residual loss puts3(n −1)channelscorrespondingtoflow. Ourdetec- f L do the same for residual values along tracks. The residual tionmodeloperatesonthegrid-voxelizedinputpointcloud firsttermintheresidualconsistencytakesintoaccountthe of shape (n x,n y), where we choose n x = n y = 468 (the predictedflow,whichtransforms(x,y,θ)to(x(cid:48),y(cid:48),θ(cid:48)).Be- z dimension has been marginalized out as we are using a causetheflowiscontinuousandnotquantizedlikethevoxel PointPillar[33]model). Thegridsizeinx−ycorresponds grid,wecannormalizeoutthequantizationnoiseexactlyby to each grid point being of extent (0.32m, 0.32m) in real addingaremainderterm,(flow −(i(cid:48)−i))fori∈(x,y,θ). i space. Foreachgridpoint,wepredict: For the residual consistency, we further enforce that the 1. (7n ,n ) residual values pinned to n fixed an- bounding box residuals for z,(cid:96),w,h do not change along a f a chors. The 7 values correspond to displacements objecttracks. Thisreflectsourassumptionthatthedimen- dx,dy,dz,dw,d(cid:96),dh,dθ of the final predicted box sionsofthevehiclearepreservedandthereisnoappreciable from the values for the anchor boxes. Ground truth elevationchangeoverthespanofasecond. boxes are automatically corresponded to anchors at trainingtime. 3.2.3 DepthandSurfaceNormals 2. (n ,n )classlogitvaluesdenotingconfidencethatan Givenadepthpredictionmoduleandasurface-normalpre- a f objectofthespecifiedtypeexistsforthatanchorbox. diction module, we can assert that the normals obtained fromthespatialderivativesofthedepthmapareconsistent 3. (n ,3(n −1))valuescorrespondingtotheboxflow a f with the predicted normals. More specifically, the depth (flow ,flow ,flow ) of a hypothetical box in the x y θ model can produce continuous depth, from which one can currentframetoanyn −1framesinthepast. f analyticallycomputesurfacenormalsestimatespereachlo- cationnˆ,whichareafunctionofdepth.Ontheotherhand, Asmentionedweusethesamebackbonetopredictbox d a surface normals model, can be trained independently to flow. Namely,throughanequivalentbackbonewepredicta producesurfacenormalspredictionsnˆ, andaconsistency three-channelmap(flow ,flow ,flow )ofthesameresolu- p x y θ loss between these two predictions of surface normals can tionofthedetectorpredictions,correspondingtotheflowof beimposedontheshareddatasource(Figure3). Thecon- the boxes in a current frame to any of the previous frames sistencyisthencomputedas: in the sequence. The flow is only supervised at locations withinthegridassociatedwithpositiveobjectdetections. LCon =cosine distance(nˆ,nˆ ), (6) Normals d p Consistency Loss: This set of detection and flow pre- dictions induces natural consistency constraints. Namely, wherenˆ isthecomputedsurfacenormalsfromtheinferred d givenapredictedflowthattransformsananchorpointcen- depthandnˆ isthenormalmappredictedfromthenormal p tered at (x,y) in the current frame to the closest anchor predictionnetwork(seethesupp. materialforderivation). 4325Depth-Normal L2 Loss Cosine Loss Consistency Loss slebaL htpeD Computed Predicted Normals Normals Normal Labels Trainer 1 Supervision Consistency Loss 1 ∇ ∇ Loss Task 1 Forward-pass Server 2: Labels Predicted RPC A stale copy Depth Images Images of task 2’s model Mediator Dataset Depth Normal Dataset 1 (unlabeled) ScanNet Model RGB RGB Model (no depth/normal labels) RGB RGB Figure4: Illustrationofourdistributedsetupforcollective SceneNet training. Each task-module is training on its own machine (its“trainer”,onlyTrainer1isshowninthefigure). Inor- Figure3:Depthandnormalsjointtrainingfordomainadap- der to compute the consistency loss, Trainer 1 reaches out tation: we use SceneNet (simulated) data to supervise the to a server that hosts a stale copy of Task 2’s model and training of separate models for depth and surface normal performs the forward pass (and vice versa). Each trainer predictionandapplyaconsistencylosstojointlytrainboth pushes gradient updates to its respective model, and every modelsforimagesfromScanNet,wherewedonotuseany so often, the stale copies on both forward-pass servers are groundtruthnormalslabels. refresheswithfreshcopiesformtherespectivetrainer. serversthathoststalecopiesofthepeermodules. Ateach Interestingly, that can also be done by combining sim- training step, the trainer then pushes gradients to its own ulated and real data sources. Figure 3 visualizes the setup moduleandupdatesitsweights. usedinourexperimentslater,wherethedatasourceforsur- One advantage of our distributed method is that each face normals (supervised) training is simulated, whereas a module can train with its own hyperparameters, including realdatasourcecanbeusedbyboth(itisusedinunsuper- optimizer, regularizers, and learning rate schedules. Typ- visedmannerforthenormalsmodel). ically per-task modules are published together with these 3.3.DistributedTraining hyper-parameters, and our method allows using them as necessary for each respective model. Moreover, more Collective training of multiple networks eventually re- computationally expensive modules can be allocated with quiresdistributingthecomputationacrossmultiplecompute more computational resources, to approximately equalize nodes, to speed up the training, or simply because a large the training times among the modules. Finally, since the enoughcollectionofmodelscannotbeprocessedonasin- modulescommunicatethroughpredictions,andpredictions glemachine.Indistributedtrainingitisoftenthecommuni- aretypicallymuchmorelightweightthannetworkweights, cationbetweenthenodesthatsetsthelimitations[37]. Our the communication overhead is significantly lower com- framework provides the advantage of training tasks inde- paredtootherdistributedtrainingtechniques. pendently,withcommunicationviaconsistencylossesonly. Furthermore, the modules share, potentially vast amounts 4.Experiments ofunsupervised,data,whichallowsfordata-parallelism. To reduce the communication load, distributed training In the following sections we report results of experi- schemes often aim to be asynchronous, which means that ments on using consistency losses, for the co-training of model parameters or their gradient updates develop some multiplemodelsasdescribedinSections3.2.1,3.2.2,3.2.3. degree of ‘staleness’, which denotes the interval between We observe improved performance across tasks (Sec- updatestoeachmodel. Itiseasytoobservethatstalepre- tion 4.1), successful training with unlabeled data, where dictions are less harmful than stale gradient updates [2], our approach is more helpful in lower label regimes (Sec- since predictions are expected to converge as the training tion 4.2), and successful domain adaptation (Section 4.3). progresses. Therefore,moduleswillqueryeachother’spre- TheexperimentsinSections4.1,4.2,and4.4wererundis- dictions to compute shared losses, but propagate gradients tributed,whereastherestoftheexperimentswererunona locally,withintheirownmodule. singlemachine(e.g.asinFig.1). Our distributed implementation is based on this princi- 4.1.SceneDepth,SegmentationandEgo-motion ple and takes advantage of shared losses. Each module is training on a separate machine (“trainer”), as illustrated For this experiment we show results for the distributed in Fig. 4. The consistency loss depends on the outputs of collectivetrainingofthreetasks. Thefirsttaskissemantic all co-training tasks, which means that its computation re- segmentationbasedonNAS-FPN[21]. Theothertwotasks quiresevaluatingaforwardpassthroughallofthem. Each are depth prediction and motion estimation (for both cam- trainerevaluatestheforwardpassofitsownmoduleand,to eraandobjects),forwhichwerelyonexistingmodels[23]. evaluatetheforwardpassesoftheothermodules,itqueries Segmentation masks were used to regularize the 3D mo- 4326tionfields[23](Section3.2.1). Semanticsegmentationwas Method Labels 3DmAP/mAPH(%) BEVmAP/mAPH(%) NoConsistency 5% 17.6/9.6 44.3/24.3 trained on COCO [34] 2017 as its dedicated dataset and AddingLcon 5% 23.5/12.0 51.1/26.5 Cityscapes[10]wasusedastheunlabeledmediatordataset NoConsistency 20% 30.8/16.4 63.0/34.1 – no Cityscapes labels were used at training. Each of the AddingLcon 20% 31.6/19.1 65.7/39.2 threemodelswastrainedonaseparatemachine,asoutlined NoConsistency 100% 53.0/47.6 75.0/66.8 AddingLcon 100% 54.2/49.6 75.0/68.5 in Sec. 3.3. The segmentation module received a greater allocation of compute resources than the others, since it is Table2:3Ddetectionand2D(BEV)metricsontheWaymo significantly more computationally expensive. The hard- Open Dataset, given various degrees of dataset labels pro- wareconfigurationisdescribedinthesupp. material. The vided for training. We can see consistent improvements batchsize,theoptimizer,thelearningrate,andotherhyper- when applying our motion-based consistency loss, espe- parametersvariedacrossthetasksandbasedontherespec- ciallywithfewerlabels. tivepublishedvaluesforeachmodel. Duringtraining,each ofthethreemodelsqueriesitspeersviaRPCtoobtaintheir predictions,whichwereuptooneminutestale. 4.2.3DObjectDetectioninPointCloudsinTime DepthError Segmentation We perform all experiments on the vehicle class of Configuration (Abs.Rel.) MIOU the Waymo Open Dataset [50] which provides complete A.Depth&motiononly 0.165 - B.Segmentationonly - 0.455 tightly-fitting 3D bounding box annotations along with C.FrozensegmentationmodelBwithdepth&motion 0.129 - tracks for each vehicle and use a sequence length n = 3 D.Frozendepth&motionmodelC&segmentation - 0.471 f E.Depth,motionandsegmentationtrainingjointly 0.125 0.478 point clouds (∆ = 0.5s). Our backbone architecture is based on a PointPillar detector [33]. Given a sequence of Table 1: Results of the distributed collective training of three input point clouds, we quantize the points for each frame models: Depth prediction, 3D motion prediction, and seman- into a grid in the x-y plane and then use our single-frame tic segmentation. COCO was the dedicated dataset for seman- detectionmodeltoproduceaconfidencevalueateachgrid ticsegmentation,andCityscapesservedasanunlabeledmediator point for the presence of an object box as well as residual dataset. Both depth prediction and segmentation were evaluated values(x,y,z,w,(cid:96),h,θ)torefinethefinalboxcoordinates on Cityscapes, with segmentation evaluated only for predictions (Section 3.2.2). For all our reported experiments, n = 3 associatedwithpedestriansandvehicles(detailsoftheevaluation f and n = 2. We follow the original PointPillar network protocolaregiveninthesupp.material). a settingsinchoosingallclassthresholds. Weperformexper- imentswithpartiallabelinginwhichonly5%or20%ofthe boxlabelsareavailable. Ourbaseline(100%)istrainedin The effect of collective training on the performance of isolation,whereasallpartiallabelstudiesareperformedin theparticipatingmodelsisshowninTab.1. ExperimentsA thedistributedframework. andBarethebaselines,wherethedepthandmotionmodels weretrainedjointly,butseparatelyfromsegmentation. Ex- Our results are shown in Table 2. We can see the three perimentEshowstheimprovementinperformancewhenall setsofexperiments,inwhichwestrippedthedatasetofits threetaskstrainjointlywithconsistencyconstraints. Rows labelstovariousdegrees.Ourmetricsarebasedonthestan- CandDareablationsthatdemonstratethechangesinper- dardmeanaverageprecision(mAP)metricsfor2Dand3D formance when consistency constraints are turned on pro- detection.WealsousethemAPHmetricintroducedin[50], gressively.InCthedepthandmotionmodelsaresupervised which takes into account object heading. mAPH is calcu- bythesegmentationmodelfromexperimentB.Cachieves lated similarly to mAP, but all true positives are scaled by thesamedeptherrorasasimilarconfigurationtrainedona err /π,witherr beingtheabsoluteangleerrorofthepre- θ θ single machine [23], where segmentation masks were pre- dictioninradians. Wealsoreportresultsonboth3Ddetec- computed. InexperimentD,segmentationwasconcistency tionand2DBird-Eye-View(BEV)detection. supervised by the improved depth and motion model from Wecanseethatjointtrainingwithconsistencylossesis experiment C, but the latter two models remained frozen. very beneficial. The consistency loss improves the object The progression in quality demonstrates the effect of con- detectorperformanceinallthreesettings,withmoresignif- sistencysupervisiononalltasks. icantimprovementswhenlabelsarescarce. Thisalsoholds Whileconsistencycontributestocorrectness,itdoesnot forboth3Ddetectionand2DBEVdetection. Furthermore, guaranteethelatter. Thisisreflectedinthefailurecasesof the consistency loss has a beneficial effect on mAPH, i.e. the method. Some illustrative examples are shown in the isabletocorrecterrorsinheading,asitenforcesrotational supp. material. consistencyalongeachobjecttrack. 43274.3.DepthandSurfaceNormalswithDomainShift each of the modules is only supervised by the predictions producedbyitspeer. Sincebothmodulesarerandomlyini- Since training in our framework involves multiple tialized, each model initially receives a random and stale datasets, it is interesting to explore what happens when peer-supervisionsignal. there is a large domain disparity between them. To this end, we select the extreme case of domain disparity be- In Fig. 5 we show the results of this experiment, which tween the dedicated datasets and the mediator dataset. As isthedepthpredictionerrorasfunctionoftimeforvarious tasksforthisexperimentweselecteddepthestimationand valuesofstaleness. Whilegreaterstalenessvaluesinitially surface normal prediction. We use SceneNet [38] (simu- hinder the training, all experiments converge to approxi- lateddata)asthededicateddataset,andScanNet[11](real matelythesameresult,andapproximatelyatthesametime. data)astheunlabeledmediatordataset(Figure3). Weuse Stalenessofupto20minutes–or2000trainingsteps–is simulated SceneNet to train depth and normal estimation showntohavenoadverseeffectontheconvergencetimeor models and evaluate them on the real ScanNet data as our thetestmetric. baseline. The strong domain disparity is evident from the factthatamodeltrainedonsimulateddataperformspoorly 0.5 ontherealdataset(Table3). Forthebaselinebothmodels aretrainedseparatelytopredictdepthandsurfacenormals, withameansquarederrorlossforthedepthmodelandaco- sinelossforthesurfacenormalmodel. Thetrainedmodels 0.4 are then used to predict surface normals for samples from ScanNet. Accuracy is measured by using the ground truth dataofScanNetfordepthandsurfacenormalsgeneratedby 0.3 themethodof[25]. WethentrainthemodelswithconsistencylossonScan- Net. Theconsistencylosscanthen becomputedascosine similarity of the computed surface normals and those pre- 0.2 dicted by the normal prediction network. The consistency is based on the fact that a normal map can be analytically 0 20 40 60 80 100 Training time (minutes) computedfromadepthmap [31](Section3.2.3). Table3showstheresultsofindividualtrainingofdepth and surface normals prediction on SceneNet (simulated) andtestedonScanNet(real),andwhentraininginthesame transfer setting but with loss consistency. We observe that trainingwithlossconsistencyimprovestheperformanceon bothtasksonthischallengingsim-to-realtransfertask. Normals Depth Accuracy(in%) Error(in%) Method <11.25◦ <22.50◦ <30.00◦ Abs.Rel SceneNet→ScanNet 9.2 30.8 46.3 28.2 SceneNet→ScanNet (withConsistency) 13.6 34.9 46.7 24.9 Table3: SurfacenormalpredictiontransferfromSceneNet(sim- ulated)toScanNet(real). 4.4.TolerancetoStaleness As discussed in Sec. 3.3, in our setup, individual mod- ules communicate with each other through their predic- tions. Thisismotivatedbytheincreasedresiliencetostal- enessthatpredictionsexhibitcomparedtoweightsandgra- dients [2]. To study the amount of staleness our setup can afford, we train depth and egomotion [5] on the KITTI dataset[20], eachonaseparatemachine. Thisexperiment isparticularlychallengingbecauseitisfullyunsupervised: rorre htped evitaler etulosbA 0 2 10 Staleness (minutes): 1 5 20 0.16 0.15 0.14 200 400 600 Figure5: Depthpredictionerrorasfunctionoftimefordif- ferentvaluesofstalenessfordistributedcollectivetraining ondepthandegomotiononKITTI.Stalenessof20minutes meansthatthedepthtrainerreceivesegomotionlabelsfrom an egomotion model that refreshes every 20 minutes, and vice versa. The depth trainer performs about 100 training stepsperminute,so20minutestranslatesto2000steps. A value of 0 staleness denotes a configuration where all net- works were placed on the same machine and trained syn- chronously. Thegraphsintheinsetshowthelong-timepro- gressionoftraining. Allexperimentsachievethesameab- soluterelativedepthpredictionerrorofabout0.143(which onisparwiththestate-of-the-art[5]formodelsthatdisre- gardobjectmotion), ataboutthesametime, irrespectively ofthestaleness. The negative effects of staleness on convergence time and on result metrics have been studied for various dis- tributedtrainingmethods[12,7,6,17]. Whilethetolerance to staleness varies widely, due to the diversity of methods, moststudiesonlyreportthesensitivityofmethodstostale- nessofuptoafewtensofsteps. Unlikethesefindings,our distributedtrainingsetupismuchmorerobustandthereby enablestrainingwithstalenessofuptothousandsofsteps. 43285.Conclusions Richly-annotated 3d reconstructions of indoor scenes. In CVPR,2017. Wehaveintroducedanovelframework,‘Taskology’,for [12] WeiDai, YiZhou, NanqingDong, HaoZhang, andEricP. thecollectivetrainingofmultiplemodelsofdifferentcom- Xing. Towardunderstandingtheimpactofstalenessindis- putervisiontasks. Ourmaincontributionisthatourframe- tributedmachinelearning,2018. workenablesamodulardesignfortrainingneuralnetworks [13] Thanuja Dharmasiri, Andrew Spek, and Tom Drummond. byseparatingtasksintomodulesthatcanbecombinedand Joint prediction of depths, normals and surface curvature trained collectively. Furthermore, we employ consistency fromRGBimagesusingcnns. CoRR,2017. losses so as to exploit the structure between tasks. By [14] CarlDoersch,AbhinavGupta,andAlexeiA.Efros. Unsu- jointly training multiple tasks, we have shown that consis- pervisedvisualrepresentationlearningbycontextprediction. tencylosseshelptoimprovetheperformance,andcantake InICCV,page1422–1430,2015. advantage of unlabeled and simulated data. Our approach [15] Carl Doersch and Andrew Zisserman. Multi-task self- supervisedvisuallearning. CoRR,2017. achievesbetterresultsfromjointtraining, especiallywhen [16] Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip alargeportionofthedatasetisnotlabeled.Wealsodemon- Ha¨usser,CanerHazirbas,VladimirGolkov,Patrickvander stratedadistributedversionoftheframework,whichtrains Smagt,DanielCremers,andThomasBrox. Flownet:Learn- modelsonseparatemachinesandisrobusttostaleness. ingopticalflowwithconvolutionalnetworks. ICCV,pages 2758–2766,2015. References [17] SanghamitraDutta, GauriJoshi, SoumyadipGhosh, Parijat Dube,andPriyaNagpurkar.Slowandstalegradientscanwin [1] EmanAhmed,AlexandreSaint,AbdElRahmanShabayek, therace:Error-runtimetrade-offsindistributedsgd,2018. Kseniya Cherenkova, Rig Das, Gleb Gusev, Djamila [18] DavidEigen,ChristianPuhrsch,andRobFergus.Depthmap Aouada,andBjo¨rnE.Ottersten. Deeplearningadvanceson predictionfromasingleimageusingamulti-scaledeepnet- different3ddatarepresentations:Asurvey. ArXiv,2018. work. InNIPS,pages2366–2374.2014. [2] RohanAnil,GabrielPereyra,AlexandrePassos,RobertOr- [19] DavidF.Fouhey,AbhinavGupta,andMartialHebert. Data- mandi,GeorgeE.Dahl,andGeoffreyE.Hinton.Largescale driven 3d primitives for single image understanding. In distributed neural network training through online distilla- ICCV,pages3392–3399,2013. tion. InICLR,2018. [20] AndreasGeiger, PhilipLenz, ChristophStiller, andRaquel [3] Arunkumar Byravan and Dieter Fox. Se3-nets: Learning Urtasun. Visionmeetsrobotics: Thekittidataset. Interna- rigidbodymotionusingdeepneuralnetworks. 2017. tionalJournalofRoboticsResearch(IJRR),2013. [4] Rich Caruana. Multitask learning: A knowledge-based [21] Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V. Le. Nas-fpn: sourceofinductivebias. InICML,page41–48,1993. Learningscalablefeaturepyramidarchitectureforobjectde- [5] Vincent Casser, Soeren Pirk, Reza Mahjourian, and Anelia tection. InCVPR,June2019. Angelova.Depthpredictionwithoutthesensors:Leveraging [22] Cle´ment Godard, Oisin Mac Aodha, and Gabriel J. Bros- structureforunsupervisedlearningfrommonocularvideos. tow. Unsupervised monocular depth estimation with left- InAAAI,volume33,pages8001–8008,2019. rightconsistency. CVPR,2017. [6] Chi-Chung Chen, Chia-Lin Yang, and Hsiang-Yun Cheng. [23] Ariel Gordon, Hanhan Li, Rico Jonschkowski, and Anelia Efficientandrobustparalleldnntrainingthroughmodelpar- Angelova. Depth from videos in the wild: Unsupervised allelismonmulti-gpuplatform,2018. monoculardepthlearningfromunknowncameras. InICCV, [7] Jianmin Chen, Xinghao Pan, Rajat Monga, Samy Bengio, 2019. and Rafal Jozefowicz. Revisiting distributed synchronous [24] YulanGuo,HanyunWang,QingyongHu,HaoLiu,LiLiu, sgd,2016. and Mohammed Bennamoun. Deep learning for 3d point [8] Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and An- clouds:Asurvey. arXiv,2019. drew Rabinovich. Gradnorm: Gradient normalization for [25] Steven Hickson, Karthik Raveendran, Alireza Fathi, Kevin adaptivelossbalancingindeepmultitasknetworks.InInter- Murphy, and Irfan A. Essa. Floors are flat: Leveraging nationalConferenceonMachineLearning,pages794–803, semantics for real-time surface normal prediction. CoRR, 2018. 2019. [9] JingchunCheng,Yi-HsuanTsai,ShengjinWang,andMing- [26] GeoffreyHinton,OriolVinyals,andJeffDean.Distillingthe HsuanYang. Segflow: Jointlearningforvideoobjectseg- knowledgeinaneuralnetwork. arXiv,2015. mentationandopticalflow. InICCV,pages686–695,2017. [27] S.Holzer,R.B.Rusu,M.Dixon,S.Gedikli,andN.Navab. [10] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Adaptive neighborhood selection for real-time surface nor- Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe malestimationfromorganizedpointclouddatausinginte- Franke, Stefan Roth, and Bernt Schiele. The cityscapes gralimages. In2012IEEE/RSJInternationalConferenceon datasetforsemanticurbansceneunderstanding. InCVPR, IntelligentRobotsandSystems,pages2684–2689,2012. 2016. [28] Tak-Wai Hui, Xiaoou Tang, and Chen Change Loy. Lite- [11] Angela Dai, Angel X. Chang, Manolis Savva, Maciej Hal- flownet:Alightweightconvolutionalneuralnetworkforop- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: ticalflowestimation. InCVPR,2018. 4329[29] X.Jin,X.Li,H.Xiao,X.Shen,Z.Lin,J.Yang,Y.Chen,J. [43] CharlesRQi,HaoSu,KaichunMo,andLeonidasJGuibas. Dong,L.Liu,Z.Jie,J.Feng,andS.Yan.Videosceneparsing Pointnet: Deep learning on point sets for 3d classification withpredictivefeaturelearning.InICCV,pages5581–5589, andsegmentation. CVPR,2017. 2017. [44] Xiaojuan Qi, Renjie Liao, Zhengzhe Liu, Raquel Urtasun, [30] Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task andJiaya Jia. Geonet: Geometricneuralnetwork forjoint learningusinguncertaintytoweighlossesforscenegeome- depthandsurfacenormalestimation. InCVPR,2018. tryandsemantics. InCVPR,pages7482–7491,2018. [45] ZheRen,JunchiYan,BingbingNi,BinLiu,XiaokangYang, andHongyuanZha. Unsuperviseddeeplearningforoptical [31] K.Klasing,D.Althoff,D.Wollherr,andM.Buss. Compar- flowestimation. InAAAI,page1495–1501,2017. isonofsurfacenormalestimationmethodsforrangesensing [46] SebastianRuder.Anoverviewofmulti-tasklearningindeep applications. InICRA,pages3206–3211,2009. neuralnetworks. CoRR,2017. [32] ArunCSKumar,SuchendraM.Bhandarkarand,andMukta [47] JohannesL.SchonbergerandJan-MichaelFrahm.Structure- Prasad. Depthnet: A recurrentneural network architecture from-motionrevisited. InProceedingsoftheIEEEConfer- for monocular depth prediction. In CVPRW, pages 396– enceonComputerVisionandPatternRecognition(CVPR), 3968,2018. June2016. [33] Alex H Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, [48] Ozan Sener and Vladlen Koltun. Multi-task learning as JiongYang,andOscarBeijbom. Pointpillars: Fastencoders multi-objective optimization. In NeurIPS, pages 527–538, for object detection from point clouds. In CVPR, pages 2018. 12697–12705,2019. [49] Shai Shalev-Shwartz and Amnon Shashua. On the sample [34] Tsung-YiLin,MichaelMaire,SergeBelongie,JamesHays, complexity of end-to-end training vs. semantic abstraction PietroPerona,DevaRamanan,PiotrDolla´r,andC.Lawrence training. CoRR,2016. Zitnick. Microsoft coco: Common objects in context. In [50] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien ECCV,pages740–755,2014. Chouard,VijaysaiPatnaik,PaulTsui,JamesGuo,YinZhou, [35] PaulineLuc,NataliaNeverova,CamilleCouprie,JacobVer- Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han, beek,andYannLeCun. Predictingdeeperintothefutureof Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, Scott Et- semanticsegmentation. ICCV,2017. tinger,MaximKrivokon,AmyGao,AdityaJoshi,YuZhang, [36] Jitendra Malik, Pablo Arbela´ez, Joa˜o Carreira, Katerina Jonathon Shlens, Zhifeng Chen, and Dragomir Anguelov. Fragkiadaki, Ross Girshick, Georgia Gkioxari, Saurabh Scalability in perception for autonomous driving: Waymo Gupta,BharathHariharan,AbhishekKar,andShubhamTul- opendataset. InCVPR,2020. siani. Thethreer’sofcomputervision: Recognition,recon- [51] Alan M. Turing. Computing machinery and intelligence. struction and reorganization. Pattern Recognition Letters, (236):433–460,1950. pages4–14,2016. [52] Peng Wang, Xiaohui Shen, Bryan Russell, Scott Cohen, BrianPrice,andAlanLYuille. Surge: Surfaceregularized [37] RubenMayerandHans-arnoJacobsen. Scalabledeeplearn- geometryestimationfromasingleimage. NIPS,pages172– ing on distributed infrastructures: Challenges, techniques, 180,2016. and tools. ACM Computing Surveys (CSUR), pages 1–37, [53] Terry Winograd. Thinking Machines: Can There Be? Are 2020. We?, page 167–189. Cambridge University Press, USA, [38] John McCormac, Ankur Handa, Stefan Leutenegger, and 1990. Andrew J.Davison. Scenenet rgb-d: Can 5m synthetic im- [54] Amir Zamir, Alexander Sax, William Shen, Leonidas agesbeatgenericimagenetpre-trainingonindoorsegmenta- Guibas, Jitendra Malik, and Silvio Savarese. Taskonomy: tion? 2017. Disentanglingtasktransferlearning. pages3712–3722, 06 [39] Brendan McMahan, Eider Moore, Daniel Ramage, Seth 2018. Hampson, and Blaise Aguera y Arcas. Communication- [55] AmirR.Zamir,AlexanderSax,NikhilCheerla,RohanSuri, Efficient Learning of Deep Networks from Decentralized ZhangjieCao,JitendraMalik,andLeonidasJ.Guibas. Ro- Data. InAartiSinghandJerryZhu,editors,Proceedingsof bustlearningthroughcross-taskconsistency. June2020. the 20th International Conference on Artificial Intelligence [56] Amir Roshan Zamir, Tilman Wekel, Pulkit Agrawal, Colin andStatistics,volume54ofProceedingsofMachineLearn- Wei,JitendraMalik,andSilvioSavarese. Generic3drepre- ingResearch,pages1273–1282,FortLauderdale,FL,USA, sentationviaposeestimationandmatching.InECCV,2016. 20–22Apr2017.PMLR. [57] Huangying Zhan, Chamara Saroj Weerasekera, Ravi Garg, [40] Mehdi Noroozi and Paolo Favaro. Unsupervised learning and Ian D. Reid. Self-supervised learning for single view ofvisualrepresentationsbysolvingjigsawpuzzles. InBas- depthandsurfacenormalestimation. CoRR,2019. tianLeibe,JiriMatas,NicuSebe,andMaxWelling,editors, [58] RichardZhang,PhillipIsola,andAlexeiA.Efros. Colorful ECCV,pages69–84,2016. imagecolorization. InECCV,pages649–666,2016. [41] MehdiNoroozi,HamedPirsiavash,andPaoloFavaro. Rep- [59] YuZhangandQiangYang. Asurveyonmulti-tasklearning. resentationlearningbylearningtocount. CoRR,2017. CoRR,2017. [60] TinghuiZhou,MatthewBrown,NoahSnavely,andDavidG [42] Deepak Pathak, Philipp Kra¨henbu¨hl, Jeff Donahue, Trevor Lowe. Unsupervisedlearningofdepthandego-motionfrom Darrell,andAlexeiEfros. Contextencoders: Featurelearn- ingbyinpainting. CVPR,2016. video. InCVPR,pages1851–1858,2017. 4330SupplementaryMaterial LabelID Class Ours COCO2017 Cityscapes 6.Scenedepth,segmentationandego-motion person/rider 1 1 24/25 bicycle 2 2 33 6.1.ModulesandInterfaces car 3 3 26 The interfaces of the three modules, depth, motion and motorcycle 4 4 32 semanticsegmentation,aredefinedbelow: trafficlights 5 10 19 bus 6 6 28 truck 7 8 27 Motion Prediction Network: Given two consecutive others 8 otherlabels otherlabels RGBframes,I (i,j)andI (i,j),ofwidthw(0 ≤ j < w) 1 2 and height h (0 ≤ i < h), the motion prediction network Table 4: Mapping between Cityscapes label IDs, COCO predictsthefollowingquantities: labelsIDs,andthelabelIDswedefinedforthisexperiment. • δt (i,j): For every pixel (i,j), δt (i,j) esti- 1→2 1→2 mates the movement of the point visible at the pixel (i,j)offrame1,relativetothescene,whichoccurred betweenframe1andframe2. inhomogeneouscoordinates • T : The translation vector of the camera between   1→2 j frame2andframe1. p(i,j)=i, (8) 1 • R : The rotation matrix of the camera between 1→2 frame2andframe1. onecanwritetheshiftinpresultingfromtherotationanda translationthatoccuredbetweenthetwoframesas: Similarly, the network predicts δt (i,j), T , and 2→1 2→1 R , which are defined as above, with (1) and (2) 2→1 z(cid:48)(i,j)p(cid:48)(i,j) = KR K−1z (i,j)p (i,j) swapped. 1 1 1→2 1 1 + K(m (i,j)δt (i,j)+T )(,9) 1 1→2 1→2 Depth Prediction Network: Given an RGB frame, wherep(cid:48) andz(cid:48) arerespectivelythenewhomogeneousco- 1 1 I(i,j), thedepthpredictionnetworkpredictsadepthmap, ordinates of the pixel and the new depth, projected onto z(i,j),foreverypixel(i,j). frame 2, and K is the camera matrix. The above equation consistsofthescenedepth, asobtainedbyrigidmotionof thesceneandtheadditionalchangesobtainedfromthemo- Semantic Segmentation Network: Given an RGB tionsoftheindividuallymovableobjects. Notethatthemo- frame, I(i,j), thesemanticsegmentationnetworkpredicts tionmaskisonlyappliedtoregionsofpotentiallymovable a logit map l (i,j) for each class c. For each pixel (i,j), c objectsm (i,j),determinedbythesemanticsegmentation theclassisgivenbyc(i,j)=argmax l (i,j). 1 c c model. The movable mask m (i,j) (of frame 1) restricts 1 6.2.Nextframewarping motionofobjectsrelativetothescenetooccuronlyatpix- elsthatbelongtomovableobjects. To construct the consistency losses for these tasks, as showninthemainpaper,weneedtoderivethelocationsof 6.3.EvaluationProtocol each pixel from the first frame onto the next frame, which InourexperimentCOCOservedasthededicateddataset isalsoreferredtoasimagewarpingfromframe1toframe 2. Westartwithdefiningm(i,j)tobethemovablemask: for segmentation, and Cityscapes served as the unlabeled mediatordataset. Sincethetwodatasetshavedifferentsets (cid:26) 1 c(i,j)∈M oflabels,wehadtocreateamappingbetweenthetwo. The m(i,j)= (7) 0 otherwise mappingisshowninTable4. Onlylabelsthatrepresentmovableobjectsareofinterest M is the collection of all classes that represent movable for our experiment. We therefore restricted our label set objects. These are detailed below. For each pixel (i,j), to 7 classes, that are in the intersection of Cityscapes and m(i,j)equals1ifthepixelbelongstooneofthemovable COCOandrepresentmovableobjects.Allotherlabelswere objectclasses,and0otherwise. mappedtolabelID8.Whenevaluatingthesegmentationon Giventwoadjacentvideoframes,1and2,adepthmapof Cityscapes, we mapped the Cityscapes groundtruth labels frame1z (i,j),thecameramatrixK,andapixelposition andtheCOCO-trainedmodelpredictionstothese8labels. 1 43316.4.Hardwareconfiguration The three models in this experiment had different com- putational costs. Table 5 shows the duration of a training stepforeachofthethreemodelson8NVIDIAp100GPU, forabatchof32.PlacingthesegmentationmodelonaTPU nodereduceditstrainingsteptimetobeclosertotheother modules. This way the convergence was not gated on the (a)Failureexample1,frame1 Segmentationmodel. Beingabletotraineachmoduleona different hardware configuration is one of the strengths of ourmethod. Model Hardware Steptime Depth GPU 0.81s Motion GPU 0.83s Segmentation GPU 2.18s (b)Failureexample1,frame2 Segmentation TPU 1.42s Figure6: Failureexample1: Segmentationnetworkfailsto Table5: Timepertrainingstepinmillisecondsforeachof segmentoutawhitecarontheleftedgeontwoconsecutive thethreemodulesinSec.3.2.1inthemainpaper,onvarious frames. hardwareplatforms. Thebatchsizeis32inallcases. GPU denotes8NVIDIAp100GPU,andTPUdenotesaGoogle CloudTPUv2-8unit. 6.5.Failurecases Consistencyimprovescorrectness, butdoesnotguaran- tee it. A set of predictions can be consistent with one an- (a)Failureexample2,frame1 other,butnotcorrect.Asimpleexampleisamisdetectionof astaticobject. Ifthesegmentationnetworkfailstosegment out the same object on two consecutive frames, the con- sistency losswill notpenalize this failure. Fig.6 and Fig.7 showssomeexamplesoffailurecasesthatconsistencywas unabletofix. 7.DepthandSurfaceNormals (b)Failureexample2,frame2 Tocomputeaconsistencylossforthecollectivetraining Figure7: Failureexample2: Segmentationnetworkfailsto ofdepthandnormalpredictionmodelswecomputesurface segmentoutsomecarsattheendoftheroadontwoconsec- normals from the predicted depth map, and penalize their utiveframes. deviation from the predicted normal map, closely follow- ingthemethodinRef.[27]. Wefirstconvertthethedepth maptoa3Dpointcloud,usingtheinverseoftheintrinsics (i,j).(cid:126)r ij isapointin3Dspace,inthecameracoordinates, matrix: correspondingtopixel(i,j). We then compute the spatial derivatives of the depth      map: x 1/f 0 −x /f j i,j x 0 x (cid:126)r ij ≡y i,j=z i,j · 0 1/f y −y 0/f yi (∂x,(cid:126)r) =(cid:126)r −(cid:126)r z 0 0 1 1 i,j i,j+1 i,j−1 i,j (11) (10) (∂y,(cid:126)r) =(cid:126)r −(cid:126)r i,j i+1,j i−1,j where f and f denote the focal length, and x and y x y 0 0 the principal point offset, i and j are the pixel coordinates Toexcludedepthdiscontinuities,weinvalidatepixelswhere along the height and the width of the image respectively, thespatialgradientofthedepthrelativetothedepthitselfis andz isthedepthmapevaluatedatthepixelcoordinates greaterthanacertainthresholdβ. Tothisend,wedefinea ij 4332validitymaskv : i,j V =(V ) ·(V ) , (12) i,j x i,j y i,j where (cid:40) 1 (∂ ,z)