Pseudo-labeling for Scalable 3D Object Detection BenjaminCaine∗†,RebeccaRoelofs∗†,VijayVasudevan†, JiquanNgiam†,YuningChai‡,ZhifengChen†,JonathonShlens† †GoogleBrain,‡Waymo {bencaine,rofls}@google.com Abstract Tosafelydeployautonomousvehicles,onboardperception systemsmustworkreliablyathighaccuracyacrossadiverse setofenvironmentsandgeographies. Oneofthemostcom- montechniquestoimprovetheefficacyofsuchsystemsin newdomainsinvolvescollectinglargelabeleddatasets,but suchdatasetscanbeextremelycostlytoobtain,especially ifeachnewdeploymentgeographyrequiresadditionaldata withexpensive3Dboundingboxannotations. Wedemon- strate that pseudo-labeling for 3D object detection is an effectivewaytoexploitlessexpensiveandmorewidelyavail- able unlabeled data, and can lead to performance gains acrossvariousarchitectures,dataaugmentationstrategies, Class Geography Baseline Student ∆ andsizesofthelabeleddataset. Overall,weshowthatbetter Vehicle SF/MTV/PHX 49.1 58.9 +9.8 teachermodelsleadtobetterstudentmodels,andthatwe Ped SF/MTV/PHX 53.4 64.6 +11.2 candistillexpensiveteachersintoefficient,simplestudents. Vehicle Kirkland 26.1 37.2 +11.1 Specifically,wedemonstratethatpseudo-label-trainedstu- Ped Kirkland 14.5 27.1 +12.6 dentmodelscanoutperformsupervisedmodelstrainedon 3-10timestheamountoflabeledexamples. UsingPointPil- Figure 1: Pseudo-labeling for 3D object detection. Top: lars[24],atwo-year-oldarchitecture,asourstudentmodel, Training models with pseudo-labeling consists of a three- weareabletoachievestateoftheartaccuracysimplyby stagetrainingprocess. (1)Supervisedlearningisperformed leveraginglargequantitiesofpseudo-labeleddata. Lastly, onateachermodelusingalimitedcorpusofhuman-labeled we show that these student models generalize better than data. (2)Theteachermodelgeneratespseudo-labelsona supervisedmodelstoanewdomaininwhichweonlyhave larger corpus of unlabeled data. (3) A student model is unlabeleddata,makingpseudo-labeltraininganeffective trainedonaunionoflabeledandpseudo-labeleddata. Bot- formofunsuperviseddomainadaptation. tom: Summary of key results in 3D object detection per- formanceonWaymoOpenDataset[44]withaPointPillars model[24]. AllnumbersreportvalidationsetLevel1dif- 1.Introduction ficultyaverageprecision(AP)forvehiclesandpedestrians. Bothbaselinesandstudentmodelsonlyhaveaccessto10% Self-drivingperceptionsystemstypicallyrequiresufficient of the labeled run segments from original Waymo Open human labels for all objects of interest and subsequently Dataset, which consists of data from San Francisco (SF), train machine learning systems using supervised learning MountainView(MTV),andPhoenix(PHX).Weusenola- techniques[48]. Asaresult,theautonomousvehicleindus- belsfromthedomainadaptationchallengedataset,Kirkland. tryallocatesavastamountofcapitaltogatherlarge-scale human-labeleddatasetsindiverseenvironments[6,14,44]. well on in-domain problems, domain shifts can cause the However, supervised learning using human-labeled data performance to drop significantly [4, 17, 36, 45]. The re- facesahugedeploymenthurdle: whilethetechniqueworks lianceofself-drivingvehiclesonsupervisedlearningimplies ∗Denotesequalcontributionandauthorsforcorrespondence. that the rate at which one can gather human-labeled data 1 1202 raM 2 ]VC.sc[ 1v39020.3012:viXrain novel geographies and environmental conditions limits usesasmaller,lessnoisedteachermodeltogeneratepseudo- wideradoptionofthetechnology.Furthermore,asupervised- labels,whichareusedtotrainalarger,noisedstudentmodel, learning-basedapproachisinefficient: forexample,itwould and the authors suggest performing multiple iterations of notleveragehuman-labeleddatafromParistoimproveself- this process. FixMatch [41] combines self-training with drivingperceptioninRome[50].Unfortunately,wecurrently consistencyregularization[23,38],atechniquethatapplies havenoscalablestrategytoaddresstheselimitations. randomperturbationstotheinputormodeltogeneratemore labeleddata. Inpriorwork,self-traininghasbeensuccess- We view the scaling limitations of supervised learning as fully applied to tasks such as speech recognition [20, 34], a fundamental problem, and we identify a new training imagesegmentation[7],and2Dobjectdetectionincamera paradigmforadaptingself-drivingvehicleperceptionsys- imagery[37,42,61]andvideosequences[8]. tems to different geographies and environmental condi- tionsinwhichhuman-labeleddataislimitedorunavailable. 3D object detection. Though several architectural in- We propose leveraging ideas from the literature on semi- novations have been proposed for 3D object detection supervisedlearning(SSL),whichfocusesonthelowlabel [29, 32, 54, 56, 57, 60], a recent focus has been on tech- regime,andbooststheperformanceofstate-of-the-artmod- niquesthatimprovedataefficiency,ortheamountofdata elsbyleveragingunlabeleddata. Inparticular,weemploy required to reach a certain performance. Data augmenta- apseudo-labelingapproach[26,30,39]togeneratelabeled tion designed for 3D point clouds can significantly boost dataonadditionaldatasetsandfindthatsuchastrategyleads performance(seereferencesin[9,28]),andtechniquesto tosignificantboostsinperformanceon3Dobjectdetection automaticallylearnappropriatedataaugmentationstrategies (Figure1). have been shown to be 10 times more data efficient than baseline 3D detection models [9, 28]. Concurrent to our Additionally, we systematically investigate how to struc- work,[51]showsgainsapplyingknowledgedistillation[18] turepseudo-labeltrainingtomaximizemodelperformance. to3Ddetection,distillingamulti-framemodel’sfeaturesto Weidentifynuancesnotpreviouslywellunderstoodinthe a single-frame model in feature space, whereas we apply literature for how best to implement pseudo-labeling and knowledgedistillationinlabelspace. developsimpleyetpowerfulrecommendationsforhowto extractgainsfromit. Overall,ourworkdemonstratesavi- Severalrecentworksalsoproposeimprovingdataefficiency ablemethodforleveragingunsuperviseddata–particularly byusingweaksupervisiontoaugmentexistinglabeleddata: fromotherdomains–tobooststate-of-the-artperformance [46] incorporates existing 3D box priors to augment 2D onin-domainandout-of-domaintasks. Tosummarizeour boundingboxes,and[31]similarlygeneratesadditional3D contributions: annotations by learning appropriate augmentations for la- beledobjectcenters. Finally,anautomatic3Dboundingbox • Weshowpseudo-labelingisextremelyeffectivefor3D labeling process is proposed by [55], which uses the full objectdetection,andprovideasystematicanalysisof objecttrajectorytoproduceaccurateboundingboxpredic- howtomaximizeit’sperformancebenefits. tions,thoughtheydon’tshowtrainingresultswiththeseauto • Wedemonstratethatpseudo-labeltrainingiseffective labels. andparticularlyusefulforadaptingtonewgeographical Weviewmanyofthetechniquestoimprovedataefficiency domainsforautonomousvehicles. ascomplementarytoourwork,asimprovementsineither • Byoptimizingthepseudo-labeltrainingpipeline(keep- modelarchitecturesordataefficiencywillprovideadditive ing both the architecture and labeled dataset fixed), performancebenefits. weachievestate-of-the-arttestsetperformanceamong comparablemodels,with74.0L1APforVehiclesand SSLfor3Dobjectdetection. TwopriorworksapplyMean 69.8L1APforPedestrians,againof5.4and1.9AP Teacher[47]basedsemi-supervisedlearningtechniquesto respectivelyoverthesamesupervisedmodel. 3Dobjectdetection[49,58]ontheindoorRGB-Ddatasets ScanNet[10]andSUNRGB-D[43]. SESS[58]trainsastu- 2.RelatedWork dentmodelwithseveralconsistencylossesbetweenthestu- dentandtheEMA-basedteachermodel,while3DIoUMatch Semi-supervisedlearning.Semi-supervisedlearning(SSL) [49] proposes training directly on the pseudo labels after isanapproachtotrainingthattypicallycombinesasmall filteringthemviaanIoUpredictionmechanism. Incontrast, amountofhuman-labeleddatawithalargeamountofun- weforgoaMean-Teacher-basedframework,findingseparate labeleddata[27,33,35,53]. Self-trainingreferstoastyle teacherandstudentmodelstobepracticallyadvantageous, of SSL in which the predictions of a model on unlabeled andweshowcaseperformanceon3DLiDARdatasetsde- data,termedpseudo-labels[26],areusedasadditionaltrain- signedtotrainself-drivingcarperceptionsystems. ingdatatoimproveperformance[30,39]. Severalvariants of self-training exist in the literature. Noisy-Student [52] Domainadaptation. Robustnesstogeographiesandenvi- 2ronmentalconditionsiscriticaltomakingself-drivingtech- Pseudo-Label Student nologyviableintherealworld[48]. Recently, onegroup studiedthetaskofadaptinga3Dobjectdetectionarchitec- tureacrossself-drivingvehicledatasets(e.g. [14,19,44]), andreportedsignificantdropsinaccuracywhentrainingon Labeled subset of Labeled subset of onedatasetandtestingonanother[50]. Interestingly,such Waymo OD Waymo OD dropsinaccuracycouldbeattributedtodifferencesincar Unlabeled subset Pseudo-labeled sizesandarepartiallyreducedbyaccountingforthesesize of Waymo OD subset of Waymo OD differences. Inparallel, otherrecentworkreportsnotable Unlabeled Pseudo-labeled dropsinaccuracyacrossgeographieswithinasingledataset Kirkland Data Kirkland Data [44](seeTable9). However,unliketheformerwork,those dropsinaccuracyinthislatterworkcannotbeaccountedfor bydifferencesincarsizes1. Inourwork,weexperimenton thissingledatasetandareabletomitigatedropsinaccuracy acrossgeographies. WefocusononeoftheopenchallengesfortheWaymoOpen Dataset 2: accurate 3D detection in a new city (Kirkland) withchangingenvironmentalconditions(rain)andlimited human-labeleddata. Currently,thestate-of-the-artarchitec- turefortheKirklanddomainadaptationtask[11]employs a single-stage, anchor-free and NMS-free 3D point cloud objectdetectorequippedwithmultipleenhancementsinclud- ingfeaturesfrom2Dcameraneuralnetworks,powerfuldata augmentation, frame stacking, test time ensembling, and pointclouddensification(butnopseudo-labeling). Wedo notimplementthesefullsetofenhancements,yetourbase- lineimplementationachievessimilarperformancetotheir baselinearchitecture[13],insteadfocusingonaccuracyand robustnessgainsthatcanbeachievedbyleveragingalarge amountofunlabeleddata. 3.Methods Our pseudo-labeling process (Figure 2) consists of three stages: trainingateacheronlabeleddata,pseudo-labeling unlabeleddatawithsaidteacher,andtrainingastudenton thecombinationofthelabeledandpseudo-labeleddata. We performandevaluateallofourexperimentsontheWaymo OpenDataset(version1.1)[44]andthedomainadaptation extension. We implement students and teachers as Point- Pillars [24] models using open-source implementations 3, whicharethebaselinesusedby[9,15,32,44]. 3.1.DataSetup TheWaymoOpenDataset[44]isorganizedasacollectionof runsegments. Eachrunsegmentisa∼200framesequence of LiDAR and camera data collected at 10Hz. These run 1WefoundtheaveragewidthandlengthofvehiclesinKirklandandthe WaymoOpenDatasettobequitesimilar. Forinstance,inthevalidation splitsoftheWaymoODandKirklanddatasets,wemeasuredsimilaraverage lengths(4.8mvs4.6m)andaveragewidths(2.1mvs2.1m)acrossO(104) objects.Thesediscrepanciesaremarkedlylessthanthosedescribedin[50]. 2https://waymo.com/open/challenges 3https://github.com/tensorflow/lingvo/ sledoM stesataD Teacher Figure2:Experimentalsetup.Weconductourexperiments ontheWaymoOpenDataset[44],whereweartificiallydi- videthedatasetintolabeledandunlabeledsplits. Wealways treatrunsegmentsfromKirklandasunlabeled(eventhough asubsetarelabeled)andselectsubsets(e.g. 10%,20%,...) oftheoriginalWaymoOpenDatasetrunsegmentstotrain theteacher. Weusetheteachertopseudo-labelallunseen runsegments,andthentrainastudentontheunionoflabeled andpseudo-labeledrunsegments. Finally,weevaluateboth teacher and student models on the original Waymo Open DatasetandKirklandvalidationsplits. segments come from two sets: the original Waymo Open Dataset,whichhas798labeledtrainingrunsegmentscol- lectedinSanFrancisco,Phoenix,andMountainView,and thedomainadaptationbenchmark,whichhas80labeledand 480unlabeledtrainingrunsegmentsfromKirkland. Both datasetscontain3DboundingboxesforPedestrian,Vehicle, andCyclist, but, duetothelownumberofCyclistsinthe data,wefocusonthePedestrianandVehicleclasses. Inourexperiments,wetreatalltheKirklandrunsegments as unlabeled data (even though labels do exist for 80 run segments). Our setup is similar to unsupervised domain adaptation, where only unlabeled data is available in the “target”domain,givingusameasureofhowwellthegains inaccuracyontheWaymoOpenDatasetgeneralizetoanew domain4.Inaddition,oursetupemulatesacommonscenario in which a practitioner has access to a large collection of unlabeledrunsegmentsandamuchsmallersubsetoflabeled runsegments. Inordertostudytheeffectoflabeleddatasetsize,weran- domly sample smaller training datasets from the Waymo OpenDataset. Becauserunsegmentsaretypicallylabeled efficientlyasasequence,wetreateachrunsegmentaseither comprehensivelylabeledorunlabeled,andwesamplebased ontherunsegmentIDs,insteadofindividualframes. For example,selecting10%oftheoriginalWaymoOpenDataset correspondstoselecting10%oftherunsegments,i.e. 79 4Inadditiontogeographicalnuances,Kirklandhasnotablydifferent weatherconditions,e.g.cloudsandrain,thanSanFrancisco,Phoenix,and MountainView. 3runsegments,whichprovides∼15,700frames. Ifwewere toinsteadrandomlyselect10%offrames,wewouldmake thetaskartificiallyeasier,asneighboringframeswouldbe 65 highly correlated, especially if the autonomous vehicle is movingslowly. 60 3.2.ModelSetup All experiments use PointPillars [24] as a baseline archi- tectureduetoitssimplicity,accuracy,andinferencespeed 55 (seeAppendix6.1.1forarchitecturedetails). Toexplorethe impactofteacheraccuracy,weusewiderandmulti-frame PointPillarsmodelsasteachers. Tomakethemodelswider, 50 wemultiplyallchanneldimensionsbyeither2×or4×. To 50 55 60 65 makeamulti-frameteacher,weconcatenatethepointclouds Teacher AP fromeachframewithitspreviousN−1framestransformed intothelastframe’scoordinatesystem. 3.3.TrainingSetup Ourtrainingsetupmirrors[9,44]. WeusetheAdamopti- mizer[21]andtrainwithanexponentialdecayscheduleon thelearningrate. Allteachersandstudentsaretrainedwith the same schedule, but the length of an epoch for teacher andstudentmodelsdifferbecausetheteacheristrainedon lessdatathanthestudent. Weusedataaugmentationstrategiessuchasworldrotation and scene mirroring, which showed strong improvement overnotusingaugmentations. Table1providesanablation studyfortheseaugmentations. Unlessotherwisestated,all othertraininghyperparametersremainfixedbetweenteacher andstudent. SeeAppendix6.1.2foradditionaldetailsonthe trainingsetup. 3.4.Pseudo-LabelTraining Pseudo-label training begins by training a teacher model using standard supervised learning on a labeled subset of runsegments. Oncewetraintheteacher,weselectthebest teachermodelbasedonvalidationsetperformanceonthe WaymoOpenDatasetandusetopseudo-labeltheunlabeled runsegments. Next,wetrainastudentmodelonthesame labeleddatatheteachersaw,plusallthepseudo-labeledrun segments. The mixing ratio of labeled to pseudo-labeled dataisdeterminedbythepercentageofdatatheteacherwas trainedon. We filter the pseudo-labeled boxes to include only those withaclassificationscoreexceedingathreshold,whichwe selectusingaccuracyonavalidationset. Wefindaclassi- ficationscorethresholdof0.5workswellformostmodels, butasmallsubsetofmodels(generallymulti-framePedes- trianmodels,whicharepoorlycalibratedandsystematically under-confident) benefit from a lower threshold. Finally, weevaluatethestudent’sperformanceontheWaymoOpen DatasetandKirklandvalidationsets,wherewealwaysreport PA tnedutS Waymo Open Dataset Figure3: Betterteachersleadtobetterstudents. Weplot Level1APontheWaymoOpenDatasetvalidationsetfor Vehicles. Whencontrollingforlabeleddatasetsize,archi- tecture, and training setup between teachers and students, teacherswithahigherAPgenerallyproducestudentswitha higherAP. Level1(L1)averageprecision(AP). 4.Results UsingtheVehicleclass,wefirstexploretherelationshipbe- tweenteacherandstudentperformanceontheWaymoOpen Datasetforvariousteacherconfigurations,andthenevalu- ategeneralizationtoKirkland. Next,forbothVehiclesand Pedestrians,wedistillincreasinglylargerteachersintosmall, efficient student models, yielding large gains in accuracy with no additional labeled data or inference cost. Finally, wescaleuptheseexperimentswithtwoordersofmagnitude moreunlabeleddata,furtherdemonstratingtheefficacyof pseudo-labeling. We also describe some negative results wherewediscusssomeideaswethoughtshouldwork,but didnot. 4.1.Betterteachersleadtobetterstudents. To understand how teacher performance impacts student performance,wecontroltheaccuracyoftheteacherbyvary- ingtheamountoflabeleddata,theteacher’swidth,andthe strengthofteachertrainingdataaugmentations. Allexper- imentsinthissectionareevaluatedonVehicles. InFigure 3,weshowstudent-versus-teacherperformanceforteacher andstudentmodelswiththesameamountoflabeleddata, equivalentarchitectures,andequivalenttrainingsetups. Ingeneral,higheraccuracyteachersproducehigheraccuracy students. Arelevantquestionisthen,whattechniquesare mosteffectiveforimprovingteacheraccuracy? Toanswer this,weevaluateeachmodificationinturnontheWaymo OpenDataset. Appendix6.3showsthecorrespondingexper- imentswhenevaluatingontheKirklanddataset. 4Amountoflabeleddata.Comparedtoaddingdataaugmen- tationsorincreasingteacherwidth,increasingtheamount of labeled data yields the largest improvements. Figure 4 showsthatincreasingthefractionoflabeleddataimproves bothteacherandstudentperformance,butthestudentgains diminishastheamountofunlabeleddatadecreases. NotethatFigure4showstheoverallpercentlabeled(bottom axis) and unlabeled data (top axis) when we combine the 798WaymoOpenDatasetand560Kirklandrunsegments. Using 100% of the labeled data from the Waymo Open Datasetcorrespondstohavingroughly59%oftheoverall datalabeled,andwegiveteachersaccessto10%,20%,30%, 50%of100%ofthelabelsintheWaymoOpenDatasetin thisexperiment,allowingustoevaluatetheeffectofhaving accesstox%humanlabeledand(1-x)%pseudo-labeleddata fromtheWaymoOpenDataset. Dataaugmentation.Wefindthataddingdataaugmentation doesleadtomodestadditionalgains,mirroringtheobserva- tionsin[61],aslongasitisappliedtoboththeteacherand thestudent. InTable1,weshowthatonewaytogenerate strongerteachermodels(andthusbetterstudents)isthrough strongerdataaugmentations. Although [52] emphasizes the importance of noising the studentmodel,wefoundempiricallythatpseudo-labeltrain- ing can show gains even without data augmentation (see Appendix6.2forfullresults). Teacherwidth. Anadditionalwaytogeneratebetterteach- ersisthroughscalingthemodelsize(parametercount). Be- causetheteacherandstudentaredifferentmodels,theycan Percent unlabeled data 90 80 70 60 50 40 62.5 60.0 57.5 55.0 52.5 50.0 10 20 30 40 50 60 Percent labeled data PA WaymoOpenDatasetL1AP TeacherAugmentation Teacher Student ∆ None 56.3 62.2 +5.9 FlipY 60.1 63.6 +3.5 RotateZ 61.4 63.3 +2.1 RotateZ+FlipY 63.0 64.2 +1.2 Table 1: Stronger teacher augmentations lead to addi- tive gains in student performance. We increasing the strength of teacher augmentations for a 1× width teacher modeltrainedon100%oftheWaymoOpenDataset,while fixing the student to be a 1× width model trained with both RotateZ and FlipY augmentations. We report L1 validation set Vehicle AP on the Waymo Open Dataset. ∆=StudentAP−TeacherAP. 65.0 62.5 60.0 57.5 55.0 52.5 50.0 1 2 4 Teacher width Waymo Open Dataset Teacher Student Figure 4: Pseudo-label training is most effective when the ratio of labeled to unlabeled data is small. Teacher andstudentL1APontheWaymoOpenDatasetvalidation setfortheVehicleclassversusoverallpercentlabeleddata. PA Waymo Open Dataset Teacher 10% OD labeled Student 100% OD labeled Figure 5: Increasing teacher width leads to better stu- dents when labeled data is limited. We increase teacher widthwhilefixingthestudentwidthat1×andcompareL1 APforVehiclemodelsontheWaymoOpenDatasetvalida- tionset. Theteacheristrainedonlabeleddatafromeither 10%or100%oftheoriginalWaymoOpenDataset(bottom andtoppoints,respectively). Whentheratiooflabeledto unlabeled data is small, student accuracy improves as the teachergetswider. However,thiseffectdisappearswhenthe amountofpseudo-labeleddataissmall. beofdifferentsizes,architectures,orconfigurations. One usefulstrategyinvolvesdistillingalarge,expensiveoffline model’sperformanceintoasmall,efficientproductionmodel. InFigure5,wevarytheteacherwidth(1×,2×,or4×)by multiplyingallitschanneldimensionswhilekeepingthestu- dentwidthfixedat1×. Weevaluateperformanceundertwo differentfractionsofavailablelabeleddata(10%or100% oftheoriginalWaymoOpenDataset). When only 10% of the original Waymo Open Dataset is labeled,the1×widthstudentsoutperformtheirwiderteach- ers,incontrasttothefindingsin[52]thatrequirethestudent 5tobeequalorlargerthantheteacher,suggestingthatinthe lowlabeleddataregime,thismaynotbeasimportant. How- ever, when 100% of the original Waymo Open Dataset is labeled,the1×widthstudentcannolongeroutperformthe 4×widthteacherontheoriginalWaymoOpenDataset. Ratio of labeled to unlabeled data. In our results, we foundthatinthesettingwheretheratiooflabeledtounla- beleddataishigh–using100%oftheoriginalWaymoOpen Dataset’slabels(798segments)andonlypseudo-labeling Kirkland’s data (560 segments) – the student gains com- paredtotheteacherdiminish,andthestudentisunableto outperformawiderteacher. One hypothesis is that the lack of improvement is due to the small amount of unlabeled data that the student can benefitfrom. Wetestthishypothesisbyusingtwoordersof magnitudemoreunlabeleddatainSection4.4. 4.2.GeneralizationtoKirkland 45 40 35 30 25 50 55 60 65 Original Waymo OD 1.1 L1 AP PA 1L dnalkriK OpenDatasetperformance,whichwesuspectisduetoanun- derlyingdatadistributiondifferenceandthefactthatweonly uselabeleddatafromtheWaymoOpenDatasetintraining. Interestingly, the slope of the linear relationship changes dependingonwhetherthemodelisateacherorstudent;the studentmodelshaveaslightlyhigherslopethantheteacher models,indicatingthatthestudentmodelsaregeneralizing bettertotheKirklanddataset. Wefindthatthedifference inslopeisstatisticallysignificantbyusinganAnalysisof Covariance(ANCOVA)test, whichevaluateswhetherthe meansofourdependentvariable(KirklandAP)areequal across our categorical independent variable (whether the modelisastudentornot),whilestatisticallycontrollingfor accuracyontheWaymoOpenDataset.WefindanF-scoreof 12.9,givingusap-valuelessthan0.001,whichislowerthan 0.05 (the significance level for 95% confidence), leading us to reject the null hypothesis. Since the student models haveaslightlyhigherKirklandAPforagivenWaymoOpen DatasetAP,weconcludethatthestudentmodelsareslightly morerobusttotheKirklanddistributionshift. 4.3.Pushinglabeleddataefficiency Forpractitioners,animportantquestionis"HowdoImake themostaccuratemodelgivenafixedinferencetimebud- get and fixed amount of labeled data?" We assume that autonomousvehiclepractitionershavemoreunlabeledthan labeleddataduetotherelativeeaseofcollectingvs. com- Teachers prehensivelylabelingdata. Inourexperiment,weshowthat Students betterteachersstillleadtobetterstudents,evenaswemake larger,moreaccurateteachermodels,andthatdistillingan expensive,impracticalofflinemodelintoanefficient,prac- ticalproductionmodelviapseudo-labelingisaneffective Figure6: Pseudo-labelingimprovesperformanceonun- technique. Additionally,weshowviastrongKirklandvalida- labeledgeographicdomain. Strongermodelsontheorig- tionsetresults(adomainwhereweusenolabeleddata)that inalWaymoOpenDatasetarealsobetterontheKirkland pseudo-labelingisaneffectiveformofunsuperviseddomain dataset(whereweonlyhaveunlabeleddata). Moreover,stu- adaptation. dentmodelstrainedwithpseudo-labelinggeneralizebetter toKirklandthannormallysupervisedteachermodels. We improve the teacher by both scaling its width to 4× and concatenating up to four LiDAR frames as input. As Inordertomeasuregeneralizationtonewgeographiesand withallofourexperiments,ourtrainingsetupforstudents environmental conditions, we evaluate all models on the mirrorstheteacherexceptthatthestudentmodelsarealways Kirklanddomainadaptationchallengedataset. Theweather 1×width,1frame. OurresultsareshowninFigure7and inKirklandisrainierthantheweatherinthecitiesthatcom- summarizedinFigure1. prisetheWaymoOpenDataset,whichincreasesthelevelof noiseintheLiDARdata. Weplotthemodel’sperformance For Vehicle models, we find that distilling a 4× width, 4 on the Kirkland dataset versus the model’s performance frameteachermodelintoa1×width,1framestudentmodel, ontheWaymoOpenDatasetforbothteacherandstudent using only 10% of the original Waymo Open Dataset la- modelsinFigure6.Weobserveaclearlinearrelationshipbe- bels,canmatchorexceedtheperformanceofanequivalent tweenthemodel’sperformanceontheWaymoOpenDataset supervised model trained with 5× that amount of labeled andthemodel’sperformanceonKirkland,implyingthata data. OurPedestrianmodelisevenmoreremarkable: using model’saccuracyontheWaymoOpenDatasetcanalmost only10%oftheoriginalWaymoOpenDatasetrunsegment perfectlypredictaccuracyontheKirklanddataset. Overall, labels,ourstudentmodeloutperformsanequivalentsuper- theKirklandperformanceismuchlowerthantheWaymo vised baseline on Kirkland trained on 10× the amount of 665 60 55 50 45 Width: 1x Width: 4x Width: 4x Frames: 1 Frames: 1 Frames: 4 Teacher configuration PA 1L noitadilaV Vehicles: Waymo Open Dataset - 10% OD Labels Vehicles: Kirkland - 10% OD Labels 45 100% OD Labels - 63.0 AP 100% OD Labels - 41.8 AP 40 50% OD Labels - 57.7 AP 50% OD Labels - 37.0 AP 35 10% OD Labels - 49.1 AP 30 10% OD Labels - 26.1 AP 25 Width: 1x Width: 4x Width: 4x Frames: 1 Frames: 1 Frames: 4 Teacher configuration 70 65 60 55 50 Width: 1x Width: 4x Width: 4x Frames: 1 Frames: 1 Frames: 4 Teacher configuration PA 1L noitadilaV Pedestrians: Waymo Open Dataset - 10% OD Labels Pedestrians: Kirkland - 10% OD Labels 30 100% OD Labels - 69.0 AP 50% OD Labels - 66.6 AP 50% OD Labels - 25.2 AP 25 100% OD Labels - 24.9 AP 20 15 10% OD Labels - 14.5 AP 10% OD Labels - 53.4 AP 10 Width: 1x Width: 4x Width: 4x Frames: 1 Frames: 1 Frames: 4 Teacher configuration Figure7: Increasinglylargeteachersdistillintosmall,accuratestudents. Increasingthewidthornumberofframesfor theteacherimpactsperformanceonafixedsizestudent(1×width,1frame)inthelowlabel(10%ofrunsegments)regime forvehicles(top)andpedestrians(bottom). Trainingalarge,expensiveteachermodel,anddistillingtheteacherintoasmall, efficientstudentisanefficienttactic. WepresentresultsforWaymoOpenDataset(left)andKirkland(right). labelsfromtheoriginalWaymoOpenDataset. 5 Training OD/Kir OD/Kir ODL1AP Method #label #pseudo Veh ∆ Ped ∆ Ourresultsshowthatunlabeleddatain-domaincanbevastly baseline 800/0 0/0 63.0 – 69.0 – more effective than labeled data from a different domain. semi-super 800/0 0/560 64.2 +1.2 69.8 +0.8 Additionally, in Appendix 6.4 we show that these results semi-super 800/0 0/8k 65.1 +2.1 68.8 -0.9 holdwhendoublingtheamountoflabeleddata. semi-super 800/0 67k/8k 68.8 +5.8 70.5 +1.5 4.4.Pushingunlabeleddatasetsize Table 2: Pseudo-labeling increases accuracy in domain. Wereturntoourhypothesisthatpseudo-labelingworksbest Thenumberoflabelsarereportedinrunsegments. Allper- whentheratiooflabeleddatatounlabeleddataislow. In formancenumbersreportvalidationsetL1difficultyAPfor practice, unlabeledselfdrivingdataisplentiful, sounder- theoriginalWaymoOpenDatasetwiththesame1×width1 standing how pseudo-labeling performs as the unlabeled framenetworkarchitecture. Onlythetrainingmethodvaries datasetgetssignificantlylargerisimportant. acrosseachexperiment. ∆indicatesthedifferenceinAP withrespecttotheBaselinemodel,whichistrainedonlyon Toscalesthesizeofourunlabeleddataset,weweregranted theWaymoOpenDataset(OD).Semi-supervisedusesa4× accessto>100xmoreunlabeleddatafromSanFrancisco width,4frameteachermodeltrainedonODlabeleddatato (oneofthethreecitiesintheoriginalWaymoOpenDataset) providepseudo-labelsandthentrainsthestudentonthejoint andKirkland.Thisdatacontains∼67,000runsegmentsfrom labeledandpseudo-labeleddata. WeincludeKirklanddata SanFranciscoand∼8,000runsegmentsfromKirkland,as toshowthatoutofdomaindataalsoprovidesgains,butnot comparedtotheoriginal798runsegmentsfromtheoriginal aslarge. WaymoOpenDatasetand560runsegmentsfromKirkland. Empirically,wefindthatexplicitlycontrollingtheratioofla- 5Notethatwefindthatthe4×width,4framePedestrianmodelswere beledtounlabeleddatabecomesimportant,asourunlabeled systematicallyunder-confident,andloweringthepseudo-labelscorethresh- oldfrom0.5to0.3improvedresults. dataotherwiseoverwhelmsthelabeleddata. Wetraina4× 7Training OD/Kir OD/Kir KirklandL1AP Vehicle Pedestrian Model Method #label #pseudo Veh ∆ Ped ∆ L1AP L1APH L1AP L1APH baseline 800/0 0/0 41.8 – 24.8 – WaymoOpenDataset supervised 800/80 0/0 45.0 +3.2 30.3 +5.5 Second[54] 50.1 49.6 – – semi-super 800/0 0/560 44.5 +2.7 28.4 +3.6 StarNet[32] 63.5 63.0 67.8 60.1 semi-super 800/0 0/8k 48.0 +6.2 29.3 +4.5 PointPillars†[24] 68.6 68.1 67.9 55.5 semi-super 800/0 67k/8k 49.7 +7.9 27.3 +2.5 SA-SSD[16] 70.2 69.5 57.1 48.8 RCD[3] 71.9 71.6 – – Table 3: Pseudo-labeling out-of-domain data outper- Ours† 74.0 73.6 69.8 57.9 formssupervisedtrainingonnewgeographies.Wereport validationsetL1difficultyAPforKirklandwiththesame1× Kirkland width1framearchitecture,onlyvaryingthetrainingmethod. PointPillars†[24] 49.3 48.8 37.5 29.7 ∆isthedifferenceinAPwithrespecttotheBaselinemodel. Ours† 56.2 55.7 36.1 28.5 Baseline model is trained on Waymo Open Dataset (OD) buttestedonadistinctgeography(Kirkland). Supervisedis Table 4: Test set results on the Waymo Open Dataset trainedonlabeleddatafromODandthedistinctgeography (top) and Kirkland Dataset (bottom). We compare to (Kirkland).Semi-supervisedusesa4×width4frameteacher otherpublishedsingleframe, LiDAR-only, non-ensemble modeltrainedonODlabeleddatatoprovidepseudo-labels, methods. †indicatesthatbothmodelswereimplemented, andtrainsonthejointlabeledandpseudo-labeleddata. trainedandevaluatedbyus,andareidenticalmodelsintrain- ingsetupandparametercount;theonlydifferenceisthatour width,4frameteacherontheoriginalWaymoOpenDataset, modelwastrainedon∼75kunlabeledrunsegments. andusethistopseudo-labelall∼75,000unlabeledrunseg- ments. Wethentrainstudentmodelswithamixofall798 4.5.Negativeresults labeledoriginalWaymoOpenDatasetrunsegmentsanda subsetofthesenewpseudo-labeledrunsegments. Whilewe Finally,webrieflytouchonideasthatdidnotwork,despite didnotexhaustivelysweeptheratiooflabeled-to-unlabeled positiveevidenceintheliteratureforothertasks[52]. First, data,ingeneralwefounda1:5ratiotoworkbest(exceptfor wetriedtwoformsofsoftlabels,neitherofwhichshoweda ourPedestrianmodelthatusedall∼75,000runsegments, gain. Second,weperformedmultipleiterationsoftraining, whichworkedbestwitharatioof1:1). which showed a small gain, but we deemed it too time- consumingtobeworthit. Third,weexploredwhetherthere Our results show continued gains as we scale the amount was an ambiguous range of classification scores between ofunlabeleddataonboththeWaymoOpenDataset(Table whichweshouldassumethepseudo-objectisneitherlabeled 2) and Kirkland (Table 3). Vehicle models significantly foregroundorbackground,andanchorsassignedtopseudo- improve,witha+5.8APimprovementontheWaymoOpen labelobjectswiththesescoresshouldreceivenoloss. We Datasetvalidationset,anda+7.9APimprovementonthe detailourexperimentsinAppendix6.6. Kirkland validation set. For Pedestrians onthe validation set,weseesmallergainsof+1.5APontheoriginalWaymo 5.Conclusion OpenDataset,and+4.5onKirkland,andmoresensitivity towheretheunlabeleddatacamefrom. Ouranalysisshows Ourworkpresentsthefirstresultsofapplyingpseudo-label thatmanyscenes,especiallyinKirkland,haveveryfewor trainingto3Dobjectdetectionforself-drivingcarpercep- zeropedestrians6. Wesuspectthatthisintroducesbiasesin tion. Weuseasimpleformofpseudo-labelingthatrequires thetrainingprocess,andleavetofutureworktoexplorehow no architecture innovation, yet when deployed in a semi- tobestchoosewhichpseudo-labeledframestotrainon. supervisedlearningparadigm,leadstosubstantialgainsover supervisedlearningbaselinesonvehicleandpedestriande- WeconfirmthesegainsbyevaluatingonthetestsetsinTa- tection. Most interestingly, gains persist in the presence ble4,whereweachievestateoftheartaccuracyonboth ofdomainshiftandnewenvironmentswherebuildingnew VehiclesandPedestriansamongallpublishedsingleframe, supervisedlabeldatasetshasbeenabarriertosafe,widede- LiDARonly,non-ensembleresultsavailable. Wereiterate ployment. Furthermore,weidentifyseveralprescriptionsfor thatwedonotchangethearchitecture,modelhyperparame- maximizingpseudo-label-basedtraining,includingthecon- ters,ortrainingsetupofthestudent;ouronlychangeisto structionofbetterteachermodelarchitecturesandleveraging addadditionalunlabeleddataviapseudo-labeling. dataaugmentation. Tosummarizeourmainresults: 6Inthelabeledvalidationsets,wefound70%ofscenesintheoriginal • Bydistillingalargeteachermodelintoasmallerstudent WaymoOpenDatasethadPedestrians,withanaverageof12.4perscene, modelandleveragingalargecorpusofunlabeleddata, whereasinKirklandonly22%ofsceneshadPedestrians,withanaverage of0.57perscene. weuseatwoyearoldarchitecture[24]toachievestate- 8of-the-art results7 of 74.0 / 69.8 L1 AP (+5.4 / +1.9 ProceedingsoftheIEEEconferenceoncomputervisionand over supervised baseline) for Vehicles / Pedestrians, patternrecognition,pages3722–3731,2017. 9 respectively,ontheWaymoOpenDatasettestset. [6] HolgerCaesar,VarunBankiti,AlexHLang,SourabhVora, VeniceErinLiong,QiangXu,AnushKrishnan,YuPan,Gi- • Usingonly10%ofthelabeledrunsegments,weshow ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- thatVehicleandPedestrianstudentmodelscanoutper- modal dataset for autonomous driving. In Proceedings of formequivalentsupervisedmodelstrainedwith3-10× theIEEE/CVFConferenceonComputerVisionandPattern as much labeled data, achieving a gain of 9.8 AP or Recognition,pages11621–11631,2020. 1 largerforbothclassesanddatasets. [7] Liang-ChiehChen,RaphaelGontijoLopes,BowenCheng, • On the Kirkland Domain Adaptation Challenge, we MaxwellDCollins, EkinDCubuk, BarretZoph, Hartwig show that pseudo-labeling produces more robust stu- Adam, and Jonathon Shlens. Leveraging semi-supervised dentmodels;ourbestmodeloutperformstheequivalent learninginvideosequencesforurbanscenesegmentation. In supervisedmodelby7.9/4.5L1APontheKirkland EuropeanConferenceonComputerVision(ECCV),2020. 2 [8] Liang-ChiehChen,RaphaelGontijoLopes,BowenCheng, validationsetforVehiclesandPedestrians,respectively. MaxwellDCollins, EkinDCubuk, BarretZoph, Hartwig Overall,ourworkcontinuesalong-standingthemeofadapt- Adam, and Jonathon Shlens. Semi-supervised learning in ingunsupervisedandsemi-supervisedlearningtechniques videosequencesforurbanscenesegmentation.arXivpreprint to problems in domain adaptation and the low label limit arXiv:2005.10266,2020. 2 [9] ShuyangCheng, ZhaoqiLeng, EkinDogusCubuk, Barret [1,2,5,40]. Amajorityofthesemethodshavebeentested Zoph,ChunyanBai,JiquanNgiam,YangSong,Benjamin on synthetic problems [5, 12] or small academic datasets Caine, Vijay Vasudevan, Congcong Li, et al. Improving [22,25],andaccordingly,suchworksleaveopentheques- 3d object detection through progressive population based tionofhowthesemethodsmayfareinthereal-world. We augmentation. arXivpreprintarXiv:2004.00831,2020. 2,3, suspectthatdomainadaptationinself-drivingcarperception 4 may present a large-scale problem that may address such [10] AngelaDai,AngelX.Chang,ManolisSavva,MaciejHalber, concernsandmayhelporientthesemi-supervisedlearning ThomasFunkhouser,andMatthiasNießner. Scannet:Richly- fieldtoaproblemofcriticalimportanceforself-drivingcars. annotated3dreconstructionsofindoorscenes,2017. 2 [11] ZhuangzhuangDing,YihanHu,RunzhouGe,LiHuang,Sijia Chen,YuWang,andJieLiao. 1stplacesolutionforwaymo Acknowledgements opendatasetchallenge–3ddetectionanddomainadaptation. WewouldliketothankDragoAnguelov,ShuyangCheng, arXivpreprintarXiv:2006.15505,2020. 3 [12] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal EkinDogusCubuk,BarretZoph,RaphaGontijoLopes,Wei Germain,HugoLarochelle,FrançoisLaviolette,MarioMarc- Han,ZhaoqiLeng,ThangLuong,CharlesQi,PeiSun,and hand,andVictorLempitsky. Domain-adversarialtrainingof YinZhouforhelpfulfeedbackonthiswork. neuralnetworks. TheJournalofMachineLearningResearch, 17(1):2096–2030,2016. 9 References [13] RunzhouGe,ZhuangzhuangDing,YihanHu,YuWang,Sijia Chen,LiHuang,andYuanLi. Afdet:Anchorfreeonestage [1] David Berthelot, Nicholas Carlini, Ekin D Cubuk, Alex 3dobjectdetection. arXivpreprintarXiv:2006.12671,2020. Kurakin, Kihyuk Sohn, Han Zhang, and Colin Raffel. 3 Remixmatch: Semi-supervised learning with distribution [14] AndreasGeiger,PhilipLenz,ChristophStiller,andRaquel alignment and augmentation anchoring. arXiv preprint Urtasun. Visionmeetsrobotics:Thekittidataset. TheInter- arXiv:1911.09785,2019. 9 nationalJournalofRoboticsResearch,32(11):1231–1237, [2] DavidBerthelot,NicholasCarlini,IanGoodfellow,Nicolas 2013. 1,3 Papernot,AvitalOliver,andColinARaffel. Mixmatch: A [15] WeiHan,ZhengdongZhang,BenjaminCaine,BrandonYang, holisticapproachtosemi-supervisedlearning. InAdvances ChristophSprunk,OuaisAlsharif,JiquanNgiam,VijayVa- inNeuralInformationProcessingSystems,pages5049–5059, sudevan, Jonathon Shlens, and Zhifeng Chen. Streaming 2019. 9 objectdetectionfor3-dpointclouds. InEuropeanConfer- [3] AlexBewley,PeiSun,ThomasMensink,DragomirAnguelov, enceonComputerVision(ECCV),2020. 3 andCristianSminchisescu. Rangeconditioneddilatedconvo- [16] ChenhangHe,HuiZeng,JianqiangHuang,Xian-ShengHua, lutionsforscaleinvariant3dobjectdetection,2020. 8 andLeiZhang. Structureawaresingle-stage3dobjectde- [4] BattistaBiggioandFabioRoli.Wildpatterns:Tenyearsafter tectionfrompointcloud. InProceedingsoftheIEEE/CVF theriseofadversarialmachinelearning. PatternRecognition, Conference on Computer Vision and Pattern Recognition, 2018. 1 pages11873–11882,2020. 8 [5] KonstantinosBousmalis,NathanSilberman,DavidDohan, [17] DanHendrycksandThomasDietterich.Benchmarkingneural DumitruErhan,andDilipKrishnan.Unsupervisedpixel-level networkrobustnesstocommoncorruptionsandperturbations. domainadaptationwithgenerativeadversarialnetworks. In In International Conference on Learning Representations (ICLR),2019. 1 7Whencomparedtoothersingle-frame,LiDARonly,non-ensemble [18] GeoffreyHinton,OriolVinyals,andJeffDean. Distillingthe models. 9knowledgeinaneuralnetwork,2015. 2 [34] DanielSPark, YuZhang, YeJia, WeiHan, Chung-Cheng [19] JohnHouston,GuidoZuidhof,LucaBergamini,YaweiYe, Chiu,BoLi,YonghuiWu,andQuocVLe. Improvednoisy AsheshJain,SammyOmari,VladimirIglovikov,andPeter student training for automatic speech recognition. arXiv Ondruska. Onethousandandonehours:Self-drivingmotion preprintarXiv:2005.09629,2020. 2 predictiondataset. arXivpreprintarXiv:2006.14480,2020. 3 [35] Ilija Radosavovic, Piotr Dollár, Ross Girshick, Georgia [20] JacobKahn,AnnLee,andAwniHannun. Self-trainingfor Gkioxari,andKaimingHe. Datadistillation:Towardsomni- end-to-endspeechrecognition. InICASSP2020-2020IEEE supervisedlearning. InCVPR,2018. 2 InternationalConferenceonAcoustics,SpeechandSignal [36] Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Processing(ICASSP),pages7084–7088.IEEE,2020. 2 VaishaalShankar. Doimagenetclassifiersgeneralizetoim- [21] DiederikPKingmaandJimmyBa. Adam: Amethodfor agenet? InInternationalConferenceonMachineLearning, stochastic optimization. arXiv preprint arXiv:1412.6980, pages5389–5400,2019. 1 2014. 4 [37] Chuck Rosenberg, Martial Hebert, and Henry Schneider- [22] AlexKrizhevskyandGeoffHinton.Convolutionaldeepbelief man. Semi-supervisedself-trainingofobjectdetectionmod- networksoncifar-10. Unpublishedmanuscript,40(7):1–9, els. WACV/MOTION,2005. 2 2010. 9 [38] MehdiSajjadi,MehranJavanmardi,andTolgaTasdizen. Reg- [23] SamuliLaineandTimoAila. Temporalensemblingforsemi- ularizationwithstochastictransformationsandperturbations supervisedlearning. arXivpreprintarXiv:1610.02242,2016. fordeepsemi-supervisedlearning. InAdvancesinneural 2 informationprocessingsystems,pages1163–1171,2016. 2 [24] AlexHLang,SourabhVora,HolgerCaesar,LubingZhou, [39] H Scudder. Probability of error of some adaptive pattern- JiongYang,andOscarBeijbom. Pointpillars:Fastencoders recognition machines. IEEE Transactions on Information forobjectdetectionfrompointclouds. InProceedingsofthe Theory,1965. 2 IEEEConferenceonComputerVisionandPatternRecogni- [40] Ashish Shrivastava, Tomas Pfister, Oncel Tuzel, Joshua tion,pages12697–12705,2019. 1,3,4,8 Susskind,WendaWang,andRussellWebb. Learningfrom [25] Yann LeCun, Corinna Cortes, and CJ Burges. Mnist simulatedandunsupervisedimagesthroughadversarialtrain- handwrittendigitdatabase. ATTLabs[Online].Available: ing. In Proceedings of the IEEE conference on computer http://yann.lecun.com/exdb/mnist,2,2010. 9 visionandpatternrecognition,pages2107–2116,2017. 9 [26] Dong-Hyun Lee. Pseudo-label: The simple and efficient [41] KihyukSohn,DavidBerthelot,Chun-LiangLi,ZizhaoZhang, semi-supervisedlearningmethodfordeepneuralnetworks. NicholasCarlini,EkinDCubuk,AlexKurakin,HanZhang, InWorkshoponchallengesinrepresentationlearning,ICML, andColinRaffel. Fixmatch: Simplifyingsemi-supervised volume3,2013. 2 learning with consistency and confidence. arXiv preprint [27] Li-JiaLiandLiFei-Fei. Optimol:automaticonlinepicture arXiv:2001.07685,2020. 2 collectionviaincrementalmodellearning. IJCV,2010. 2 [42] Kihyuk Sohn, Zizhao Zhang, Chun-Liang Li, Han Zhang, [28] RuihuiLi,XianzhiLi,Pheng-AnnHeng,andChi-WingFu. Chen-YuLee,andTomasPfister. Asimplesemi-supervised Pointaugment: an auto-augmentation framework for point learning framework for object detection. arXiv preprint cloudclassification. InProceedingsoftheIEEE/CVFCon- arXiv:2005.04757,2020. 2 ferenceonComputerVisionandPatternRecognition,pages [43] Shuran Song, Samuel P Lichtenberg, and Jianxiong Xiao. 6378–6387,2020. 2 Sunrgb-d:Argb-dsceneunderstandingbenchmarksuite. In [29] WenjieLuo,BinYang,andRaquelUrtasun. Fastandfuri- ProceedingsoftheIEEEconferenceoncomputervisionand ous:Realtimeend-to-end3ddetection,trackingandmotion patternrecognition,pages567–576,2015. 2 forecastingwithasingleconvolutionalnet. InProceedings [44] PeiSun, HenrikKretzschmar, XerxesDotiwalla, Aurelien of the IEEE Conference on Computer Vision and Pattern Chouard,VijaysaiPatnaik,PaulTsui,JamesGuo,YinZhou, Recognition,pages3569–3577,2018. 2 YuningChai,BenjaminCaine,etal. Scalabilityinperception [30] GeoffreyJMcLachlan. Iterativereclassificationprocedure forautonomousdriving:Waymoopendataset.InProceedings forconstructinganasymptoticallyoptimalruleofallocation oftheIEEE/CVFConferenceonComputerVisionandPattern indiscriminantanalysis. JournaloftheAmericanStatistical Recognition,pages2446–2454,2020. 1,3,4 Association,70(350):365–369,1975. 2 [45] ChristianSzegedy,WojciechZaremba,IlyaSutskever,Joan [31] Qinghao Meng, Wenguan Wang, Tianfei Zhou, Jianbing Bruna,DumitruErhan,IanJ.Goodfellow,andRobFergus. Shen,LucVanGool,andDengxinDai. Weaklysupervised Intriguing properties of neural networks. In International 3dobjectdetectionfromlidarpointcloud. arXivpreprint ConferenceonLearningRepresentations(ICLR),2013. 1 arXiv:2007.11901,2020. 2 [46] Yew Siang Tang and Gim Hee Lee. Transferable semi- [32] Jiquan Ngiam, Benjamin Caine, Wei Han, Brandon Yang, supervised3dobjectdetectionfromrgb-ddata. InProceed- Yuning Chai, Pei Sun, Yin Zhou, Xi Yi, Ouais Alsharif, ingsoftheIEEEInternationalConferenceonComputerVi- Patrick Nguyen, et al. Starnet: Targeted computation sion,pages1931–1940,2019. 2 for object detection in point clouds. arXiv preprint [47] AnttiTarvainenandHarriValpola. Meanteachersarebetter arXiv:1908.11069,2019. 2,3,8,15 rolemodels: Weight-averagedconsistencytargetsimprove [33] George Papandreou, Liang-Chieh Chen, Kevin P Murphy, semi-superviseddeeplearningresults,2018. 2 andAlanLYuille. Weakly-andsemi-supervisedlearningofa [48] Sebastian Thrun, Mike Montemerlo, Hendrik Dahlkamp, deepconvolutionalnetworkforsemanticimagesegmentation. DavidStavens,AndreiAron,JamesDiebel,PhilipFong,John InICCV,2015. 2 Gale,MorganHalpenny,GabrielHoffmann,etal. Stanley: 10Therobotthatwonthedarpagrandchallenge.Journaloffield Robotics,23(9):661–692,2006. 1,3 [49] HeWang,YezhenCong,OrLitany,YueGao,andLeonidasJ. Guibas. 3dioumatch: Leveraging iou prediction for semi- supervised3dobjectdetection,2020. 2 [50] YanWang,XiangyuChen,YurongYou,LiErranLi,Bharath Hariharan,MarkCampbell,KilianQWeinberger,andWei- Lun Chao. Train in germany, test in the usa: Making 3d objectdetectorsgeneralize. InProceedingsoftheIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages11713–11723,2020. 2,3 [51] YueWang, AlirezaFathi, JiajunWu, ThomasFunkhouser, andJustinSolomon. Multi-frametosingle-frame: Knowl- edge distillation for 3d object detection. arXiv preprint arXiv:2009.11859,2020. 2 [52] QizheXie,Minh-ThangLuong,EduardHovy,andQuocV Le. Self-trainingwithnoisystudentimprovesimagenetclas- sification. InConferenceonComputerVisionandPattern Recognition(CVPR),2020. 2,5,8,15 [53] I.ZekiYalniz, Herv’eJ’egou, KanChen, ManoharPaluri, andDhruvMahajan. Billion-scalesemi-supervisedlearning forimageclassification. arXiv1905.00546,2019. 2 [54] YanYan,YuxingMao,andBoLi. Second:Sparselyembed- dedconvolutionaldetection. Sensors,18(10):3337,2018. 2, 8 [55] BinYang,MinBai,MingLiang,WenyuanZeng,andRaquel Urtasun.Auto4d:Learningtolabel4dobjectsfromsequential pointclouds,2021. 2 [56] BinYang,MingLiang,andRaquelUrtasun. Hdnet:Exploit- ingHDmapsfor3dobjectdetection.InConferenceonRobot Learning,pages146–155,2018. 2 [57] BinYang, WenjieLuo, andRaquelUrtasun. Pixor: Real- time3dobjectdetectionfrompointclouds. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition,pages7652–7660,2018. 2 [58] Na Zhao, Tat-Seng Chua, and Gim Hee Lee. Sess: Self- ensemblingsemi-supervised3dobjectdetection,2020. 2 [59] YinZhou,PeiSun,YuZhang,DragomirAnguelov,Jiyang Gao, Tom Ouyang, James Guo, Jiquan Ngiam, and Vijay Vasudevan. End-to-endmulti-viewfusionfor3dobjectdetec- tioninlidarpointclouds. InConferenceonRobotLearning, pages923–932,2020. 12 [60] YinZhouandOncelTuzel. Voxelnet: End-to-endlearning forpointcloudbased3dobjectdetection. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition,pages4490–4499,2018. 2 [61] BarretZoph,GolnazGhiasi,Tsung-YiLin,YinCui,Hanxiao Liu,EkinDCubuk,andQuocVLe. Rethinkingpre-training andself-training. arXivpreprintarXiv:2006.06882,2020. 2, 5 116.Appendix 6.1.Modelandtrainingdetails 45 6.1.1 PointPillarsarchitecture Weusethe"Pedestrian"versionofthePointPillarsarchitec- 40 tureforboththeVehicleandthePedestrianclasses,which usesastrideof1forthefirstconvolutionalblock(instead of 2). This results in the output resolution matching the 35 inputresolution,whichwefoundimportantformaintaining accuracyscalingPointPillarstolargerscenes. Weadopta resolutionof512pixels,spanning[-76.8m,76.8m]inboth 30 XandY,andaZrangeof[-3m,3m]givingusapixelsize of 0.33m, which is similar to what is used by [59] on the WaymoOpenDataset. Additionally,onallmodels,were- 30 35 40 45 placehardvoxelization,whichsamplesafixednumberof Teacher AP pointspervoxel,withadynamicvoxelization[59],which allows the model to use all the points in the point cloud, andmakesitefficientlyabletohandlelargerpointclouds. Addingdynamicvoxelizationhasnegligibleeffectonaccu- racy. 6.1.2 Trainingdetails We use the Adam optimizer with an initial learning rate of 3.2e-3. We train for a total of 75 epochs with a batch sizeof64. Anexponentialdecayscheduleofthelearning ratestartsatepoch5. Formodelstrainedwith10%ofthe original Waymo Open Dataset labeled run segments, we doublethetrainingtime,sothatthetotalepochis150,and theexponentialdecaystartsatepoch10. Lastly,onthelarge scaleexperimentsinsection4.4,wetrainfor15totalepochs, with our exponential decay starting at epoch 2. We apply anexponentialmovingaverage(EMA)decayof0.99onall variables and use L2 regularization with scaling constant 1e-4. Ouranchorboxpriorcorrespondstothemeanboxdimen- sionsforeachclassandis[4.725,2.079,1.768]and[0.901, 0.857,1.712]forVehiclesandPedestriansrespectively. Our anchorshavetworotationsof[0,π/2],andareplacedinthe middleofeachvoxel. Inordertocomputethelossfunction duringtraining,weassignananchortoagroundtruthboxif itsIoUisgreaterthan0.6forVehicles,and0.5forPedestri- ans,andtobackgroundiftheIoUisbelow0.45forvehicles, and 0.35 for pedestrians. Boxes with IoU between these valueshavealossweightof0,andweuseforcematchingto makesureeverygroundtruthisassignedatleastonebox. Unless specified otherwise, all students and teachers are trainedwithtwodataaugmentations: RandomWorldRota- tionAboutZAxis,andRandomFlipY.ForRandomWorldRo- tationAboutZAxis we choose a random rotation of up to π/4toapplytotheworldaroundtheZaxis. ForRandom- FlipY,wefliptheYcoordinate,whichcanbethoughtofas PA tnedutS Kirkland Figure 8: Kirkland evaluation: Better teachers lead to betterstudents. Ifthestudentmodelhasanequivalentar- chitectureortrainingsetupcomparedtotheteacher,teachers with a higher AP produce students with a higher AP. All numbers are Vehicle models reporting Level 1 AP on the Kirklandvalidationsplit. mirroringthesceneovertheXaxis,withprobabilityof0.25. 6.2.Isdataaugmentationnecessary? Aug.? OD/KirklandAP Teach. Stud. Teacher Student ∆ No No 56.3/33.6 59.2/38.0 +2.9/+4.4 No Yes 56.3/33.6 62.2/40.9 +5.9/+7.4 Yes No 63.0/41.8 59.5/39.5 -3.5/-2.3 Yes Yes 63.0/41.8 64.2/44.0 +1.2/+2.2 Table 5: Data augmentation is not necessary, but bene- ficial. While data augmentation is not necessary, the best student is achieved when both the student and teacher re- ceivethesameadvantages. Resultsareon1xwidth,1frame vehiclemodelswhereboththeteacherandstudentsaw100% oftheoriginalWaymoOpenDatasetlabeledrunsegments. 6.3.Kirklandresults We provide all the corresponding Kirkland validation set figuresonVehiclesforSection4.1. Weshowthatallofour resultsshownfortheWaymoOpenDatasetstillholdwhen weevaluateonKirkland. Betterteachersleadtobetterstudents. SimilartoFigure 3,Figure8showsthatimprovingtheteacheraccuracyleads toacorrespondingincreaseinstudentaccuracy. Amountoflabeleddata.AgainmirroringourresultsinFig- 12Percent unlabeled data 90 80 70 60 50 40 42.5 40.0 37.5 35.0 32.5 30.0 27.5 10 20 30 40 50 60 Percent labeled data PA Kirkland 45.0 42.5 40.0 37.5 35.0 32.5 30.0 27.5 Teacher 1 2 4 Student Teacher width Figure 9: Kirkland evaluation: Pseudo-label training is most effective when the ratio of labeled to unlabeled dataissmall. TeacherandstudentL1APontheWaymo Open Dataset validation set for the Vehicle class versus percentlabeleddata. KirklandL1AP TeacherAugmentation Teacher Student ∆ None 33.6 40.9 +7.3 RotateZ 39.7 43.0 +3.3 FlipY 36.9 44.4 +7.5 RotateZ+FlipY 41.8 44.0 +2.2 Table 6: Increasing the strength of teacher augmenta- tionsleadstoadditivegainsinstudentperformance. We increasingthestrengthofteacheraugmentationsfora1× widthteachermodeltrainedon100%oftheWaymoOpen Dataset.Thestudentmodelisafixed1×widthmodeltrained withbothRotateZandFlipYaugmentations. WereportL1 validationsetVehicleAPontheKirklanddataset. ∆isthe differenceinAPbetweenthestudentmodelandtheteacher model. ure4,Figure9showsincreasingtheamountoflabeleddata increasesboththeteacherandstudentperformance. Wealso seesimilar(thoughlesssevere)diminishingreturnsasthe ratiooflabeledtounlabeleddatagetslarger. Bothteachers andstudentsare1xwidth,1frameintheseexperiments. Dataaugmentations. WealsoshowtheequivalentofTable 6whenevaluatingonKirkland: Teacher width. Figure 10 shows the effect of increasing teacher width for teachers trained on either 10% or 100% of the Waymo Open Dataset. We see the same result as PA Kirkland Teacher 10% OD labeled Student 100% OD labeled Figure 10: Kirkland evaluation: Increasing teacher width leads to better students. We make teachers wider whilefixingthestudent widthat1xandreportL1 APfor VehiclemodelsontheKirklandvalidationset. Theteacher istrainedonlabeleddatafromeither10%or100%ofthe originalWaymoOpenDataset(topandbottompoints, re- spectively). Thestudentistrainedonthelabeleddataseen bytheteacherplusallunlabeleddatafromtheWaymoOpen DatasetanditsKirklanddataset. Whentheratiooflabeled to unlabeled data is small, the student accuracy improves as the teacher gets wider, however this effect disappears whentheamountofpseudo-labeleddataissmall. Wefurther investigate this by adding more unlabeled data in Section 4.4. inFigure5,wherewhenweholdthestudentconfiguration fixedat1xwidth,1frame,increasingtheteacherwidthleads to an increase in student accuracy in the low data regime (10%oforiginalWaymoOpenDatasetrunsegments). We seeforKirklandthatsimilartotheoriginalWaymoOpen Dataset,whentheratiooflabeledtounlabeleddataislarge (100%oforiginalWaymoOpenDatasetrunsegments),this effectdisappears. 6.4.Pushinglabeleddataefficiency Here we provide additional results where we push the ac- curacyofourstudentmodelsonalimitedamountofdata organized as run segments. We replicated the experiment showninFigure7using20%oftheoriginalWaymoOpen Dataset run segments (so ∼11.4% of the overall run seg- mentsarelabeled),andshowsimilargains. Additionally,we providerawnumericalvaluesforalldatapointsfromthese 8plotsininTable7andTable8,toallowotherstocompare againstus. 1365 60 55 50 45 Width: 1x Width: 4x Width: 4x Frames: 1 Frames: 1 Frames: 4 Teacher configuration PA 1L noitadilaV Vehicles: Waymo Open Dataset - 20% OD Labels Vehicles: Kirkland - 20% OD Labels 45 100% OD Labels - 63.0 AP 100% OD Labels - 41.8 AP 40 50% OD Labels - 57.7 AP 50% OD Labels - 37.0 AP 35 20% OD Labels - 53.5 AP 20% OD Labels - 33.1 AP 30 25 Width: 1x Width: 4x Width: 4x Frames: 1 Frames: 1 Frames: 4 Teacher configuration 70 65 60 55 50 Width: 1x Width: 4x Width: 4x Frames: 1 Frames: 1 Frames: 4 Teacher configuration PA 1L noitadilaV Pedestrians: Waymo Open Dataset - 20% OD Labels Pedestrians: Kirkland - 20% OD Labels 30 100% OD Labels - 69.0 AP 50% OD Labels - 66.6 AP 50% OD Labels - 25.2 AP 25 100% OD Labels - 24.9 AP 20% OD Labels - 59.2 AP 20 20% OD Labels - 16.0 AP 15 10 Width: 1x Width: 4x Width: 4x Frames: 1 Frames: 1 Frames: 4 Teacher configuration Figure11: Increasinglylargeteachersdistillintosmall,accuratestudents. Increasingthewidthornumberofframesfor theteacherimpactsperformanceonafixedsizestudent(1×width,1frame)inthelowlabel(20%ofrunsegments)regime forvehicles(top)andpedestrians(bottom). Trainingalarge,expensiveteachermodel,anddistillingtheteacherintoasmall, efficientstudentisefficienttactic. ResultspresentedforWaymoOpenDataset(left)andKirkland(right). ModelDetails OD/KirklandL1AP Teacher Student %ODLabels Teacher Baseline Student ∆Baseline 1xWidth,1Frame 1xWidth,1Frame 10 49.1/26.1 49.1/26.1 54.6/33.5 +5.5/+7.4 4xWidth,1Frame 1xWidth,1Frame 10 52.2/28.7 49.1/26.1 57.7/35.3 +8.6/+9.2 4xWidth,4Frame 1xWidth,1Frame 10 54.1/30.4 49.1/26.1 58.9/37.2 +9.8/+11.1 1xWidth,1Frame 1xWidth,1Frame 20 53.5/33.1 53.5/33.1 59.0/40.1 +5.5/+7.0 4xWidth,1Frame 1xWidth,1Frame 20 58.6/38.4 53.5/33.1 61.1/43.1 +7.6/+10.0 4xWidth,4Frame 1xWidth,1Frame 20 60.0/39.8 53.5/33.1 61.2/44.2 +7.7/+11.1 1xWidth,1Frame 30 56.4/36.0 1xWidth,1Frame 50 57.7/37.0 1xWidth,1Frame 100 63.0/41.8 Table7: Vehicleresultsforsingleframe,normalwidthstudentmodelstrainedwithincreasinglycomplex(wider,multi-frame) teachermodels. Weshowhowitisadvantageoustodistillacomplex,off-boardmodelintoasimpleonboardmodelusing pseudo-labeling. AllnumbersareonthecorrespondingValidationset,andareLevel1difficultymeanaverageprecision(AP). 6.5.DifferentTeacherandStudentArchitectures largerteacherisaneffectivewaytogeneratesignificantly stronger,smallstudentmodels. Oneremainingquestionis Inthemaintextweshowthattheteacherandstudentarchi- whethertheteacherandstudentarchitecturesneedtobefrom tecturecanbedifferentconfigurations,andinfactusinga 14ModelDetails OD/KirklandL1AP Teacher Student %ODLabels Teacher Baseline Student ∆Baseline 1xWidth,1Frame 1xWidth,1Frame 10 53.4/14.5 53.4/14.5 58.8/19.1 +5.4/+4.6 4xWidth,1Frame 1xWidth,1Frame 10 59.2/21.5 53.4/14.5 61.4/20.9 +8.0/+6.4 4xWidth,4Frame 1xWidth,1Frame 10 64.0/27.6 53.4/14.5 64.6/27.1 +11.2/+12.6 1xWidth,1Frame 1xWidth,1Frame 20 59.2/16.0 59.2/16.0 61.7/20.3 +2.5/+4.3 4xWidth,1Frame 1xWidth,1Frame 20 64.4/22.5 59.2/16.0 65.4/20.4 +6.2/+4.4 4xWidth,4Frame 1xWidth,1Frame 20 68.8/30.8 59.2/16.0 66.8/26.0 +7.6/+10.0 1xWidth,1Frame 30 62.3/23.3 1xWidth,1Frame 50 66.6/25.3 1xWidth,1Frame 100 69.0/24.8 Table8: Pedestrianresultsforsingleframe,normalwidthstudentmodelstrainedwithincreasinglycomplex(wider,multi- frame)teachermodels. Weshowhowitisadvantageoustodistillacomplex,off-boardmodelintoasimpleonboardmodel usingpseudo-labeling. AllnumbersareonthecorrespondingValidationset,andareLevel1difficultymeanaverageprecision (AP). Vehicle Pedestrian whichwastousethepost-sigmoidscoreboundedbetween Model L1AP ∆ L1AP ∆ [0, 1] as the target, the second was to use the logit itself. WaymoOpenDataset Inobjectdetection,becausetheoutputsarepassedthrough Non-MaximumSuppression(NMS),weonlyhavescores StarNet10%Baseline 47.7 - 61.2 - and logits for foreground locations, therefore background StarNetStudent 55.6 +7.9 66.5 +4.3 anchorsallwereassignedascoreorlogitof1. Wefound Kirkland bothtechniquesresultedinslightlyworseperformancethan StarNet10%Baseline 26.3 - 6.7 - simplyusinghardlabels. StarNetStudent 35.2 +8.9 22.2 +15.5 Multiple iterations: We tried multiple iteration training, Table 9: Pseudo Labeling is effective across very differ- whereweusedthebeststudentcheckpointtore-pseudo-label entarchitectures: Wedistilla4×width,4framePointPil- theunlabeleddata,andusethatupdatedpseudo-labeleddata lars teacher model into a single frame StarNet model and totrainanewstudent. Whileourtrendthusfarhasshown seelargegainsinStarNetperformance,despiteitbeingan better teachers lead to better students, its challenging to extremelydifferentarchitecture. combattheoverfittingthatwillnaturallyoccur. Itsourun- derstandingthatthisisoneofthemainreasonsonewants the same architecture family, or even similar in their data toheavilynoisethestudent[52],butwefounditdifficultto representation. Totestthis,wedesignaverysimpleexperi- findanoiselevel(viaaugmentations)thatdidnothamper mentwherewetakeourbest10%originalWaymoODrun model performance enough such that the second iteration segmentPointPillarsteachermodel(theexactmodelused wasnotworse. Withdefaultsettingsusingthesameaugmen- inFigures1&7),anduseittopseudolabeltheremaining tationsforboththeteacher,thefirststudent,andthesecond WaymoOpenDataset. WethentrainaStarNet[32]student student,wefoundasmallgaininperformanceusing10%of modelontheunionofthe10%labeledrunsegments,and theoriginalWaymoOpenDatasetrunsegmentsof∼0.2AP theremainingdatapseudolabeledbyPointPillars. Wechose ontheoriginalvalidationset,and∼1.0APontheKirkland StarNetbecauseit’sapurelypoint-cloudbased,convolution validationset. Becauseofthesmallgainscomparedtothe freeobjectdetectionsystem,whichdifferssignificantlyfrom firstiteration,andthetime-consumingnatureofperforming PointPillarsconvolution-basedarchitecture. Resultsaresum- theseexperiments,weleftfurtherexplorationtofuturework. marized in Table 9, which shows strong gains in StarNet accuracywhenusingaPointPillarsteacher. Scorethresholds:Wewonderedwhethertheremaybesome classification score range for pseudo-labels for which the 6.6.Negativeresultdetails classisambiguousandweshouldassignnoloss.Weallowed anchorstobeassignedalossofzeroiftheseanchorsmatched In this section, we provide some more details about our (vianormalIoUmatching)pseudo-labelobjectswithscores negativeresults. betweensome[lower, upper]range. Wethensweptthese Soft-labels: Weexploredtwoformsofsoftlabels, oneof twovalues,andfoundthemosteffectiveresultswerewhen 15bothvalueswere[0.5,0.5],indicatingthissettingshouldbe turnedoff. Thatsaid,wethinktheideaoflimitingthenoise inducedbybadpseudo-labelsmeritsfutureinvestigation. 16