Offboard 3D Object Detection from Point Cloud Sequences CharlesR.Qi YinZhou MahyarNajibi PeiSun KhoaVo BoyangDeng DragomirAnguelov WaymoLLC Abstract Whilecurrent3Dobjectrecognitionresearchmostlyfo- cuses on the real-time, onboard scenario, there are many offboard use cases of perception that are largely under- explored,suchasusingmachinestoautomaticallygenerate high-quality3Dlabels. Existing3Dobjectdetectorsfailto satisfy the high-quality requirement for offboard uses due to the limited input and speed constraints. In this paper, we propose a novel offboard 3D object detection pipeline using point cloud sequence data. Observing that different (cid:22)(cid:39)(cid:3)(cid:44)(cid:82)(cid:56)(cid:3)(cid:55)(cid:75)(cid:85)(cid:72)(cid:86)(cid:75)(cid:82)(cid:79)(cid:71) framescapturecomplementaryviewsofobjects,wedesign the offboard detector to make use of the temporal points Waymo | Confidential & Proprietary throughbothmulti-frameobjectdetectionandnovelobject- centric refinement models. Evaluated on the Waymo Open Dataset, our pipeline named 3D Auto Labeling shows sig- nificantgainscomparedtothestate-of-the-artonboardde- tectorsandouroffboardbaselines. Itsperformanceiseven on par with human labels verified through a human label study. Further experiments demonstrate the application of autolabelsforsemi-supervisedlearningandprovideexten- siveanalysistovalidatevariousdesignchoices. 1.Introduction Recent years have seen a rapid progress of 3D object recognition with advances in 3D deep learning and strong application demands. However, most 3D perception re- search has been focusing on real-time, onboard use cases and only considers sensor input from the current frame or a few history frames. Those models are sub-optimal for manyoffboard usecaseswherethebestperceptionquality isneeded. Amongthem,oneimportantdirectionistohave machines“autolabel”thedatatosavethecostofhumanla- beling.Highqualityperceptioncanalsobeusedforsimula- tionortobuilddatasetstosuperviseorevaluatedownstream modulessuchasbehaviorprediction. In this paper, we propose a novel pipeline for offboard 3D object detection with a modular design and a series of tailoreddeepnetworkmodels.Theoffboardpipelinemakes use of the whole sensor sequence input (such video data iscommoninapplicationsofautonomousdrivingandaug- (cid:81)(cid:82)(cid:76)(cid:86)(cid:76)(cid:70)(cid:72)(cid:85)(cid:51)(cid:3)(cid:72)(cid:74)(cid:68)(cid:85)(cid:72)(cid:89)(cid:36)(cid:3)(cid:39)(cid:22) (cid:51)(cid:82)(cid:76)(cid:81)(cid:87)(cid:51)(cid:76)(cid:79)(cid:79)(cid:68)(cid:85) (cid:51)(cid:57)(cid:53)(cid:38)(cid:49)(cid:49) (cid:22)(cid:39)(cid:3)(cid:36)(cid:88)(cid:87)(cid:82)(cid:3)(cid:47)(cid:68)(cid:69)(cid:72)(cid:79)(cid:76)(cid:81)(cid:74)(cid:3)(cid:11)(cid:50)(cid:88)(cid:85)(cid:86)(cid:12) (cid:20)(cid:17)(cid:19) (cid:19)(cid:17)(cid:27) (cid:19)(cid:17)(cid:25) (cid:19)(cid:17)(cid:23) (cid:19)(cid:17)(cid:21) (cid:19)(cid:17)(cid:24) (cid:19)(cid:17)(cid:25) (cid:19)(cid:17)(cid:26) (cid:19)(cid:17)(cid:27) (cid:22)(cid:39)(cid:3)(cid:44)(cid:82)(cid:56)(cid:3)(cid:55)(cid:75)(cid:85)(cid:72)(cid:86)(cid:75)(cid:82)(cid:79)(cid:71) (cid:81)(cid:82)(cid:76)(cid:86)(cid:76)(cid:70)(cid:72)(cid:85)(cid:51)(cid:3)(cid:72)(cid:74)(cid:68)(cid:85)(cid:72)(cid:89)(cid:36)(cid:3)(cid:39)(cid:22) P(cid:51)o(cid:82)i(cid:76)n(cid:81)t(cid:87)P(cid:51)i(cid:76)l(cid:79)l(cid:79)a(cid:68)r(cid:85) (cid:51)PV(cid:57)(cid:53)RC(cid:38)N(cid:49)N(cid:49) (cid:22)3D(cid:39) (cid:3)A(cid:36)u(cid:88)t(cid:87)o(cid:82) (cid:3)L(cid:47)a(cid:68)b(cid:69)e(cid:72)li(cid:79)n(cid:76)(cid:81)g(cid:74) ((cid:3)O(cid:11)(cid:50)u(cid:88)rs(cid:85))(cid:86)(cid:12) +11.4% 1.0 (cid:20)(cid:17)(cid:19) +20.7% +19.9% +40.2% 0.8 (cid:19)(cid:17)(cid:27) +47.7% +108.9% 0.6 (cid:19)(cid:17)(cid:25) 0.4 (cid:19)(cid:17)(cid:23) 0.2 (cid:19)(cid:17)(cid:21) 0.5 (cid:19)(cid:17)(cid:24) 0.6 (cid:19)(cid:17)(cid:25) 0.7 (cid:19)(cid:17)(cid:26) 0.8 (cid:19)(cid:17)(cid:27) 3D IoU Threshold +9.3% AP diff +16.6% noisicerP egarevA D3 (cid:22)(cid:39)(cid:3)(cid:44)(cid:82)(cid:56)(cid:3)(cid:55)(cid:75)(cid:85)(cid:72)(cid:86)(cid:75)(cid:82)(cid:79)(cid:71) (cid:81)(cid:82)(cid:76)(cid:86)(cid:76)(cid:70)(cid:72)(cid:85)(cid:51)(cid:3)(cid:72)(cid:74)(cid:68)(cid:85)(cid:72)(cid:89)(cid:36)(cid:3)(cid:39)(cid:22) (cid:51)(cid:82)(cid:76)(cid:81)(cid:87)(cid:51)(cid:76)(cid:79)(cid:79)(cid:68)(cid:85) (cid:51)(cid:57)(cid:53)(cid:38)(cid:49)(cid:49) (cid:22)(cid:39)(cid:3)(cid:36)(cid:88)(cid:87)(cid:82)(cid:3)(cid:47)(cid:68)(cid:69)(cid:72)(cid:79)(cid:76)(cid:81)(cid:74)(cid:3)(cid:11)(cid:50)(cid:88)(cid:85)(cid:86)(cid:12) (cid:20)(cid:17)(cid:19) (cid:19)(cid:17)(cid:27) (cid:19)(cid:17)(cid:25) (cid:19)(cid:17)(cid:23) (cid:19)(cid:17)(cid:21) (cid:19)(cid:17)(cid:24) (cid:19)(cid:17)(cid:25) (cid:19)(cid:17)(cid:26) (cid:19)(cid:17)(cid:27) Figure1.Ouroffboard3DAutoLabelingachievedsignificant gains over two representative onboard 3D detectors (the ef- ficient PointPillar [24] and the top-performing PVRCNN [50]). Therelativegains(thepercentagenumbers)arehigherundermore strictstandard(higherIoUthresholds). Themetricis3DAP(L1) forvehiclesontheWaymoOpenDataset[57]valset. mentedreality). Withnoconstraintsonthemodelcausality andlittleconstraintonmodelinferencespeed, weareable to greatly expand the design space of 3D object detectors andachievesignificantlyhigherperformance. Wedesignouroffboard3Ddetectorbasedonakeyob- servation: differentviewpointsofanobject, withinapoint cloudsequence, containcomplementaryinformationabout its geometry (Fig. 2). An immediate baseline design is to extend the current detectors to use multi-frame inputs. However,asmulti-framedetectorsareeffectivetheyarestill limited in the amount of context they can use and are not naivelyscalabletomoreframes–gainsfromaddingmore framesdiminishquickly(Table5). In order to fully utilize temporal point clouds (e.g. 10 or more seconds), we step away from the common frame- basedinputstructurewheretheentireframesofpointclouds are merged. Instead, we turn to an object-centric design. We first leverage a top-performing multi-frame detector to give us initial object localization. Then, we link objects detected at different frames through multi-object tracking. Based on the tracked boxes and the raw point cloud se- quences, we can extract the entire track data of an object, including all of its sensor data (point clouds) and detec- tor boxes, which is 4D: 3D spatial plus 1D temporal. We 1 1202 raM 8 ]VC.sc[ 1v37050.3012:viXrathen propose novel deep network models to process such 1 frame 5 frames 4D object track data and output temporally consistent and high-quality boxes of the object. As they are similar to how a human labels an object and because of their high- qualityoutput,wecallthosemodelsprocessingthe4Dtrack dataas“object-centricautolabelingmodels”andtheentire 10 frames All (146) frames pipeline“3DAutoLabeling”(Fig.3). We evaluate our proposed models on the Waymo Open Dataset (WOD) [57] which is a large-scale autonomous drivingbenchmarkcontaining1,000+Lidarscansequences with3Dannotationsforeveryframe.Our3DAutoLabeling Figure2.Illustrationofthecomplementaryviewsofanobject from the point cloud sequence. Point clouds (aggregated from pipelinedramaticallyliftstheperceptionqualitycompared multipleframes)visualizedinatop-downviewforamini-van. toexisting3Ddetectorsdesignedforthereal-time,onboard usecases(Fig.1andSec.5.1).Thegainsareevenmoresig- nificantathigherstandards. Tounderstandhowfarweare differentLidarsweepsintoasinglescene. [74]usesgraph- from human performance in 3D object detection, we have based spatiotemporal feature encoding to enable message conductedahumanlabelstudytocompareautolabelswith passing among different frames. [19] encodes previous humanlabels(Sec.5.2). Toourdelight,wefoundthatauto frames with a LSTM to assist detection in the current labelsarealreadyonparorevenslightlybettercomparedto frame. Using multi-modal input (camera views and 3D humanlabelsontheselectedtestsegments. point clouds) [23, 8, 46, 66, 31, 30, 38, 54, 44] has shown In Sec. 5.3, we demonstrate the application of our improved 3D detection performance compared to point- pipeline for semi-supervised learning and show signifi- cloud-only methods, especially for small and far-away ob- cantly improved student models trained with auto labels. jects. Inthiswork,wefocusonapoint-cloud-onlysolution We also conduct extensive ablation and analysis experi- andonleveragingdataoveralongtemporalinterval. mentstovalidateourdesignchoicesinSec.5.4andSec.5.5 andprovidevisualizationresultsinSec.5.6. Learning from point cloud sequences Several recent Insummary,thecontributionsofourworkare: works [34, 15, 40] proposed to learn to estimate scene flow from dynamic point clouds using end-to-end trained • Formulationoftheoffboard3Dobjectdetectionprob- deep neural networks (from a pair of consecutive point lem and proposal of a specific pipeline (3D Auto La- clouds). Extending such ideas, MeteorNet [35] showed beling) that leverages our multi-frame detector and that longer sequences input can lead to performance gains novelobject-centricautolabelingmodels. for tasks such as action recognition, semantic segmenta- • State-of-the-art 3D object detection performance on tion and scene flow estimation. There are also other ap- thechallengingWaymoOpenDataset. plications of learning in point cloud sequences, like point cloud completion [43], future point cloud prediction [63] • The human label study on 3D object detection with and gesture recognition [42]. We also see more released comparisonsbetweenhumanandautolabels. datasetswithsequencepointclouddatasuchastheWaymo Open Dataset [57] for detection and the SemanticKITTI • Demonstratedtheeffectivenessofautolabelsforsemi- dataset[4]for3Dsemanticsegmentation. supervisedlearning. Auto labeling The large datasets required for training 2.RelatedWork data-hungrymodelshaveincreasedtheannotationcostsno- ticeably in recent years. Accurate auto labeling can dra- 3Dobjectdetection Mostworkhasbeenfocusingonus- maticallyreduceannotationtimeandcost. Previousworks ingsingle-frameinput.Intermsoftherepresentationsused, on auto labeling were mainly focused on 2D applications. they can be categorized into voxel-based [59, 12, 27, 56, Lee et al. proposed pseudo-labeling [25] to use the most 23,69,53,77,68,24,73,61],point-based[51,71,41,45, confidentpredictedcategoryofanimageclassifieraslabels 70, 52], perspective-view-based [28, 39, 5] as well as hy- totrainitontheunlabeledpartofthedataset. Morerecent brid strategy [76, 72, 9, 16, 50]. Several recent works ex- works[20,78,67,65]havefurtherimprovedtheprocedures ploredtemporalaggregationofLidarscansforpointcloud tousepseudolabelsanddemonstratedwidesuccessinclud- densificationandshapecompletion. [36]fusesmulti-frame ingstate-of-the-artresultsonImageNet[10]. information by concatenating feature maps from different For 3D object detection, recently, Zakharov et al. [75] frames. [18]aggregates(motion-compensated)pointsfrom proposedanautolabelingframeworkusingpre-trained2D 2Detection output: Tracking output: Point cloud sequence 3D bounding boxes, classes and scores. 3D bounding boxes with track IDs. 3D Object 3D Multi-Object Detection Tracking Static Object Auto Labeling Track-based motion Object Track state classification Data Extraction Dynamic Object Auto Labeling Waymo | Confidential & Proprietary … … … Static object tracks Dynamic object tracks … 3D Auto Labeling … … … Box color: track ID 3D auto labels Zoom in for one frame Figure 3. The 3D Auto Labeling pipeline. Given a point cloud sequence as input, the pipeline first leverages a 3D object detector to localizeobjectsineachframe. Thenobjectboxesatdifferentframesarelinkedthroughamulti-objecttracker. Objecttrackdata(itspoint cloudsateveryframeaswellasits3Dboundingboxes)areextractedforeachobjectandthengothroughtheobject-centricautolabeling (withadivide-and-conquerforstaticanddynamictracks)togeneratethefinal“autolabels”,i.e.refined3Dboundingboxes. detectorstoannotate3Dobjects. Whileeffectiveforloose fewframesandcannotcompensatetheobjectmotionssince localization(i.e.IoUof0.5),thereisaconsiderableperfor- framestackingisdonefortheentirescene. Weobservethat mancegapforapplicationsrequiringhigherprecision. [37] thecontributionsofmulti-frameinputtothedetectorquality triedtoleverageweakcenter-clicksupervisiontoreduce3D diminishaswestackmoreframes(Table5). Anotheridea labels needed. Several other works [7, 3, 26, 32, 13] have istoextendthesecondstageoftwo-stagedetectors[46,51] alsoproposedmethodstoassisthumanannotatorsandcon- to take object points from multiple frames. Compared to sequentlyreducingtheannotationcost. taking multi-frame input of the whole scene, the second- stage only processes proposed object regions. However, it 3.Offboard3DObjectDetection is not intuitive to decide how many context frames to use. Settingafixednumbermayworkwellforsomeobjectsbut Problem statement Given a sequence of sensor inputs suboptimalforothers. (temporaldata)ofadynamicenvironment,ourgoalistolo- calizeandclassifyobjectsinthe3Dsceneforeveryframe. Comparedtotheframe-centricdesignsabove,wherein- Specifically, we consider the input of a sequence of point putisalwaysfromafixednumberofframes,werecognize clouds{P i ∈Rni×C},i=1,2,...,N withthepointcloud thenecessitytoadaptivelychoosethetemporalcontextsize P (n pointswithC channelsforeachpoint)ofeachofthe for each object independently, leading to an object-centric i i N totalframes. ThepointchannelsincludetheXYZinthe design. As shown in Fig. 3, we can leverage the power- sensor’scoordinate(ateachframe)andotheroptionalinfor- ful multi-frame detector to give us the initial object local- mationsuchascolorandintensity. Wealsoassumeknown izations. Then for each object, through tracking, we can sensor poses {M = [R |t ] ∈ R3×4}, i = 1,2,...,N at extractallrelevantobjectpointcloudsanddetectionboxes i i i eachframeintheworldcoordinate,suchthatwecancom- from all frames that it appears in. Subsequent models can pensatetheego-motion. Foreachframe,weoutputamodal takesuchobjecttrackdatatooutputthefinaltrack-levelre- 3D bounding boxes (parameterized by its center, size and finedboxesoftheobjects. Asthisprocessemulateshowa orientation), class types (e.g. vehicles) and unique object humanlabelerannotatesa3Dobjectinthepointcloudse- IDsforallobjectsthatappearintheframe. quence (localize, track and refine the track over time), we chosetorefertoourpipelineas3DAutoLabeling. Designspace Accesstotemporaldata(historyandfuture) hasledtoamuchlargerdesignspaceofdetectorscompared tojustusingsingleframeinput. 4.3DAutoLabelingPipeline One baseline design is to extend the single-frame 3D object detectors to use multi-frame input. Although pre- Fig. 3 illustrates our proposed 3D Auto Labeling vious works [36, 18, 19, 74] have shown its effectiveness, pipeline. We will introduce each module of the pipeline a multi-frame detector is hard to scale up to more than a inthefollowingsub-sections. 34.1.Multi-frame3DObjectDetection MVF++ As the entry point to our pipeline, accurate ob- ject detection is essential for the downstream modules. In thiswork, weproposetheMVF++3Ddetectorbyextend- ingthetop-performingMulti-ViewFusion[76](MVF)de- tectorinthreeaspects:1)toenhancethediscriminativeabil- ity of point-level features, we add an auxiliary loss for 3D semantic segmentation, where points are labeled as posi- tives/negativesiftheylieinside/outsideofagroundtruth3D box;2)forobtainingmoreaccuratetrainingtargetsandim- provingtrainingefficiency,weeliminatetheanchormatch- ingstepintheMVFpaperandadopttheanchor-freedesign Waymo | Confidential & Proprietary as in [58]; 3) to leverage ample computational resources available in the offboard setting, we redesign the network architecture and increase the model capacity. Please see Sec.CintheAppendixfordetails. Multi-frameMVF++ WeextendtheMVFmodeltouse multiple LiDAR scans. Points from multiple consecutive scans are transformed to the current frame based on ego- motion. Eachpointisextendedbyoneadditionalchannel, encodingoftherelativetemporaloffset,similarto[18].The aggregatedpointcloudisusedastheinputtotheMVF++. Test-timeaugmentation Wefurtherboostthe3Ddetec- tionthroughtest-timeaugmentation(TTA)[22],byrotating thepointcloudaroundZ-axisby10differentangles(i.e.[0, ±1/8π,±1/4π,±3/4π,±7/8π,π]),andensemblingpre- dictionswithweightedboxfusion[55]. Whileitmaylead toexcessivecomputationalcomplexityforonboarduses,in theoffboardsettingTTAcanbeparallelizedacrossmultiple devicesforfastexecution. 4.2.Multi-objectTracking Themulti-objecttrackingmodulelinksdetectedobjects across frames. Given the powerful multi-frame detector, we choose to take the tracking-by-detection path and have a separate non-parametric tracker. This leads to a simpler and more modular design compared to the joint detection and tracking methods [36, 64, 29]. Our tracker is an im- plementation variant of the [62], using detector boxes for associationsandKalmanfilterforstateupdates. 4.3.ObjectTrackDataExtraction Giventrackeddetectionboxesforanobject,wecanex- tractobject-specificLiDARpointcloudsfromthesequence. We use the term object track data to refer to such 4D (3D spatialand1Dtemporal)objectinformation. Toextractobjecttrackdata,wefirsttransformallboxes andpointcloudstotheworldcoordinatethroughtheknown sensor poses to remove the ego-motion. For each unique object(accordingtotheobjectID),wecropitsobjectpoints c x n smarap xob Coordinate Transform c x n object point cloud Coordinate (world coordinate) Transform initial box params object point cloud (box coordinate) Box params c x m Static object auto labeling Coord coordinate Foreground Seg. transform transform Network Box Regression Network foreground points (box coordinate) 2 x n gniksam logits c x m Box Regression Network smarap xob coordinate transform c x m Box Regression Network’ foreground points (box’ coordinate) smarap xob Refined box Figure 4. The static object auto labeling model. Taking as in- put the merged object points in the world coordinate, the model outputsasingleboxforthestaticobject. within the estimated detector boxes (enlarged by α meters ineachdirectiontoincludemorecontexts). Suchextraction givesusasequenceofobjectpointclouds{P }, k ∈ S j,k j foreachobjectjanditsvisibleframesS . Fig.3visualizes j theobjectpointsforseveralvehicles. Besidestherawpoint clouds,wealsoextractthetrackedboxesforeachobjectand everyframe{B },k ∈S intheworldcoordinate. j,k j 4.4.Object-centricAutoLabeling Inthissection,wedescribehowwetaketheobjecttrack datato“autolabel”theobjects. AsillustratedinFig.3,the processincludesthreesub-modules:thetrack-basedmotion stateclassification,staticobjectautolabelinganddynamic objectautolabeling,whicharedescribedindetailbelow. Divideandconquer:motionstateestimation Inthereal world, lotsofobjectsarecompletelystaticduringaperiod oftime. Forexample,parkedcarsorfurnitureinaroomdo not move within a few minutes or hours. In terms of off- boarddetection,itispreferredtoassignasingle3Dbound- ingboxtoastaticobjectratherthanseparateboxesindif- ferentframestoavoidjittering. Basedonthisobservation,wetakeadivide-and-conquer approach to handle static and moving objects differently, introducing a module to classify an object’s motion state (static or not) before the auto labeling. While it could be hard to predict an object’s motion state from just a few frames (due to the perception noise), we find it relatively easyifallobjecttrackdataisused. Asthevisualizationin Fig. 3 shows, it is often obvious to tell whether an object is static or not from its trajectory. A linear classifier using a few heuristic features from the object track’s boxes can alreadyachieve99%+motionstateclassificationaccuracy forvehicles. MoredetailsareinSec.E. Staticobjectautolabeling Forastaticobject,themodel takesthemergedobjectpointclouds(P = ∪{P }inthe j j,k world coordinate) from points at different frames and pre- 4Box Regression Network Waymo | Confidential & Proprietary xob gniddebme Sequence object points k=T-r, …, T-1, T, T+1,…, T+r Sequence object boxes k=T-s, …, T-1, T, T+1,…, T+s )1+c( x n 8 x )1+s2( Dynamic object auto labeling Foreground Seg. Network 2 x n gniksam c x m Point Sequence Encoder Foreground object points Box Sequence Trajectory Point Encoder embedding embedding smarap xob lengingtoaligndeformableobjectslikepedestrians. We propose a design (Fig. 5) that leverages both the point cloud and the detector box sequences without align- ingpointstoakeyframeexplicitly. Givenasequenceofob- ject point clouds {P } and a sequence of detector boxes j,k + {B j,k}fortheobjectjatframesk ∈S j,themodelpredicts theobjectboxateachframekinaslidingwindowform. It concat consistsoftwobranches,onetakingthepointsequenceand Box Regression Joint Network embedding theothertakingtheboxsequence. Refined box for frame T For the point cloud branch, the model takes a sub- Figure5.Thedynamicobjectautolabelingmodel.Takingase- sequence of the object point clouds {P }T+r . After j,k k=T−r quenceofobjectpointsandasequenceofobjectboxes,themodel adding a temporal encoding channel to each point (simi- runsinaslidingwindowfashionandoutputsarefined3Dboxfor lar to [18]) , the sub-sequence points are merged through thecenterframe.Inputpointandboxcolorsrepresentframes. union and transformed to the box coordinate of the detec- tor box B at the center frame. Following that, we have j,T dictsasinglebox. Theboxcanthenbetransformedtoeach aPointNet[47]basedsegmentationnetworktoclassifythe framethroughtheknownsensorposes. foreground points (of the 2r +1 frames) and then encode Fig. 4 illustrates our proposed model for static object the object points into an embedding through another point auto labeling. Similar to [46, 51], we first transform encodernetwork. (throughrotationandtranslation)theobjectpointstoabox For the box sequence branch, the box sequences coordinate before the per-object processing, such that the {B j(cid:48) ,k}T k=+ Ts −s of 2s+1 frames are transformed to the box point clouds are more aligned across objects. In the box coordinateofthedetectorboxatframeT.Notethatthebox coordinate, the +X axis is the box heading direction, the sub-sequencecanbelongerthanthepointsub-sequenceto origin is the box center. Since we have the complete se- capture the longer trajectory shape. A box sequence en- quence of the detector boxes, we have multiple options on codernetwork(aPointNetvariant)willthenencodethebox whichboxtouseastheinitialbox. Thechoiceactuallyhas sequenceintoatrajectoryembedding, whereeachboxisa a significant impact on model performance. Empirically, pointwith7-dimgeometryand1-dimtimeencoding. using the box with the highest detector score leads to the Next,thecomputedobjectembeddingandthetrajectory bestperformance(seeSec.Iforanablationstudy). embedding are concatenated to form the joint embedding whichwillthenbepassedthroughaboxregressionnetwork To attend to the object, the object points are passed topredicttheobjectboxatframeT. through an instance segmentation network to segment the foreground (m foreground points are extracted by the 5.Experiments mask). Inspired by the Cascade-RCNN [6], we itera- tively regress the object’s bounding box. At test time, we We start the section by comparing our offboard 3D can further improve box regression accuracy by test-time- Auto Labeling with state-of-the-art 3D object detectors in augmentation(similartoSec. 4.1). Sec.5.1.InSec.5.2wecomparetheautolabelswiththehu- AllnetworksarebasedonthePointNet[47]architecture. manlabels. InSec.5.3,weshowhowtheautolabelscanbe Themodelissupervisedbythesegmentationandboxesti- usedtosuperviseastudentmodeltoachieveimprovedper- mationgroundtruths.Detailsofthearchitecture,lossesand formanceunderlow-labelregimeorinanotherdomain. We thetrainingprocessaredescribedinSec.F. provideanalysisofthemulti-framedetectorinSec.5.4and analysis experiments to validate our designs of the object- Dynamic object auto labeling For a moving object, we centricautolabelingmodelsinSec.5.5andfinallyvisualize needtopredictdifferent3Dboundingboxesforeachframe. theresultsinSec.5.6. Duetothesequenceinput/output,themodeldesignspaceis muchlargerthanthatforstaticobjects. Abaselineistore- Dataset We evaluate our approach using the challenging estimate the 3D bounding box with cropped point clouds. Waymo Open Dataset (WOD) [57], as it provides a large Similar to the smoothing in tracking, we can also refine collection of LiDAR sequences, with 3D labels available boxesbasedonthesequenceofthedetectorboxes. Another foreachframe. Thedatasetincludesatotalnumberof1150 choiceisto“align”orregisterobjectpointswithrespectto sequenceswith798fortraining,202forvalidationand150 akeyframe(e.g.thecurrentframe)toobtainadenserpoint fortesting. EachLiDARsequencelastsaround20seconds cloudforboxestimation. However,thealignmentcanbea with a sampling frequency at 10Hz. For our experiments, harderproblemthanboxestimationespeciallyforoccluded weevaluateboth3Dandbird’seyeview(BEV)objectde- or faraway objects with fewer points. Besides, it is chal- tectionmetricsforvehiclesandpedestrians. 5Vehicles Pedestrians Method frames 3DAP BEVAP 3DAP BEVAP IoU=0.7 IoU=0.8 IoU=0.7 IoU=0.8 IoU=0.5 IoU=0.6 IoU=0.5 IoU=0.6 StarNet[41] 1 53.70 - - - 66.80 - - - PointPillar[24](cid:63) 1 60.25 27.67 78.14 63.79 60.11 40.35 65.42 51.71 Multi-viewfusion(MVF)[76] 1 62.93 - 80.40 - 65.33 - 74.38 - AFDET[14] 1 63.69 - - - - - - - ConvLSTM[19] 4 63.60 - - - - - - - RCD[5] 1 68.95 - 82.09 - - - - - PillarNet[61] 1 69.80 - 87.11 - 72.51 - 78.53 - PV-RCNN[50](cid:63) 1 70.47 39.16 83.43 69.52 65.34 45.12 70.35 56.63 Single-frameMVF++(Ours) 1 74.64 43.30 87.59 75.30 78.01 56.02 83.31 68.04 Multi-frameMVF++w.TTA(Ours) 5 79.73 49.43 91.93 80.33 81.83 60.56 85.90 73.00 3DAutoLabeling(Ours) all 84.50 57.82 93.30 84.88 82.88 63.69 86.32 75.60 Table1.3DobjectdetectionresultsforvehiclesandpedestriansontheWaymoOpenDatasetvalset.Methodsincomparisoninclude priorstate-of-the-artsingle-framebased3Ddetectorsaswellasoursingle-frameMVF++, ourmulti-frameMVF++(5frames)andour full3DAutoLabelingpipeline. ThemetricsareL13DAPandbird’seyeview(BEV)APattwoIoUthresholds: thecommonstandard IoU=0.7andahighstandardIoU=0.8forvehicles;andIoU=0.5,0.6forpedestrians.(cid:63)reproducedresultsusingauthor’sreleasedcode. 5.1.ComparingwithState-of-the-artDetectors 3DAP BEVAP IoU=0.7 IoU=0.8 IoU=0.7 IoU=0.8 In Table 1, we show comparisons of our 3D object de- Human 86.45 60.49 93.86 86.27 tectorsandthe3DAutoLabelingwithvarioussingle-frame 3DAL(Ours) 85.37 56.93 92.80 87.55 and multi-frame based detectors, under both the common Table2.Comparinghumanlabelsandautolabelsin3Dobject standardIoUthresholdandahigherstandardIoUthreshold detection.Themetricsare3DandBEVAPsforvehiclesonthe5 topressuretestthemodels. sequencesfromtheWaymoOpenDatasetvalset.HumanAPsare We show that our single-frame MVF++ has already computed by comparing them with the WOD’s released ground outperformed the prior art single-frame detector PVR- truthandusingnumberofpointsinboxesashumanlabelscores. CNN [50]. The multi-frame version of the MVF++, as a baselineoftheoffboard3Ddetectionmethods,significantly improvesuponthesingle-frameMVF++thankstotheextra to the best of our knowledge, no such study exists for 3D informationfromthecontextframes. recognition especially for 3D object detection. To fill this For vehicles, comparing the last three rows, our com- gap, we conducted a small-scale human label study on the plete3DAutoLabelingpipeline,whichleveragesthemulti- Waymo Open Dataset to understand the capability of hu- frame MVF++ and the object-centric auto labeling mod- maninrecognizingobjectsinadynamic3Dscene. Weran- els,furtherimprovesthedetectionqualityespeciallyinthe domlyselected5sequencesfromtheWaymoOpenDataset higher standard at IoU threshold of 0.8. It improves the valsetandaskedthreeexperiencedlabelerstore-labeleach 3D AP@0.8 significantly by 14.52 points compared to the sequenceindependently(withthesamelabelingprotocolas single-frame MVF++ and by 8.39 points compared to the WOD). multi-frameMVF++,whichisalreadyverypowerfulbyit- InTable2,wereportthemeanAPofhumanlabelsand self.Theseresultsshowthegreatpotentialofleveragingthe auto labels across the 5 sequences. With the common 3D longsequencesofpointcloudsforoffboardperception. AP@0.7(L1)metric,theautolabelsareonlyaround1point WealsoshowthedetectionAPforthepedestrianclass, lowerthantheaveragelabeler, althoughthegapisslightly where we consistently observe the leading performance of largerinthemorestrict3DAP@0.8metric. Withsomevi- the3DAutoLabelingpipelineespeciallyatthehigherlocal- sualization,wefoundthelargergapismostlycausedbyin- izationstandard(IoU=0.6)with7.67pointsgaincompared accurateheights. ThecomparisonswiththeBEVAP@0.8 tothesingle-frameMVF++and3.13pointsgaincompared metric verifies our observation: when we don’t consider tothemulti-frameMVF++. height,theautolabelsevenoutperformtheaveragehuman labelsby1.28points. 5.2.ComparingwithHumanLabels With such high quality, we believe the auto labels can In many perception domains such as image classifica- beusedtopre-labelpointcloudsequencestoassistandac- tionandspeechrecognition,researchershavecollecteddata celerate human labeling, or be used directly to train light- to understand humans’ capability [49, 11, 33]. However, weightstudentmodelsasshowninthefollowingsection. 6TrainingData TestData 3DAP BEVAP anchor-free cap.increase segloss 5-frame TTA AP@0.7/0.8 100%maintrain(Human) mainval 71.2 86.9 (cid:88) - - - - 71.20/39.70 10%maintrain(Human) mainval 64.3 81.2 (cid:88) (cid:88) - - - 74.28/42.91 (cid:88) (cid:88) (cid:88) - - 74.64/43.30 10%maintrain(Human) mainval 70.0 86.4 (cid:88) (cid:88) (cid:88) (cid:88) - 76.34/45.57 +90%maintrain(3DAL) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) 79.73/49.43 Table4.Ablationstudiesontheimprovementsto3Ddetector 100%maintrain(Human) domaintest 59.4 N/A MVF[76]. Metricsare3DAP(L1)atIoUthresholds0.7and0.8 100%maintrain(Human) domaintest 60.3 N/A forvehiclesontheWaymoOpenDatasetvalset. +domain(SelfAnno.) 100%train(Human) domaintest 64.2 N/A +domain(3DAL) #frames 1 2 3 4 5 10 Table3.Resultsofsemi-supervisedlearningwithautolabels. AP@0.7 74.64 75.32 75.63 76.17 76.34 76.96 Metrics are 3D and BEV AP for vehicles on the Waymo Open AP@0.8 43.30 44.11 44.80 45.43 45.57 46.20 Dataset. Thetypeofannotationisreportedinparenthesis. Please Table5.Ablationstudieson3DdetectionAPvs.temporalcon- note,testsetBEVAPisnotprovidedbythesubmissionserver. texts. Metricsare3DAP(L1)forvehiclesontheWaymoOpen Datasetvalset.Weusedthe5-framemodelin3DAutoLabeling. 5.3.ApplicationstoSemi-supervisedLearning In this section, we study the effectiveness of our auto inimprovingthedetectionquality. labelingpipelineinthetaskofsemi-supervisedlearningto Table 5 shows how the number of consecutive input trainastudentmodelundertwosettings: intra-domainand frames impacts the detection APs. The gains of adding cross-domain. We choose the student model as a single- frames quickly diminishes as the number of frames in- frameMVF++detectorthatcanruninreal-time. creases: e.g. while the AP@0.8 improves by 0.81 from 1 Fortheintra-domainsemi-supervisedlearning, weran- to2frames,thegainfrom4to5framesisonly0.14point. domly select 10% sequences (79 ones) in the main WOD trainingsettotrainour3DAutoLabeling(3DAL)pipeline. 5.5.AnalysisofObjectAutoLabelingModels Once trained, we apply it to the rest 90% sequences (719 We evaluate the object auto labeling models using the ones)inthemaintrainingsettogenerate“autolabels”(we box accuracy metric under two IoU thresholds 0.7 and 0.8 only keep boxes with scores higher than 0.1). In Table 3 on the Waymo Open Dataset val set. The predicted box is (first two rows), we see that reducing the human annota- consideredcorrectifitsIoUwiththegroundtruthishigher tions to 10% significantly lowers the student model’s per- thanthethreshold. MoreanalysisisinSec.I. formance. However, when we use auto labels, the student model trained on 10% human labels and 90% auto labels can get similar performance compared to using 100% hu- Ablationsofthestaticobjectautolabeling Intable6we man labels (AP gaps smaller than 1 point), demonstrating can see the importance of the initial coordinate transform superbdataefficiencyautolabelscanprovide. (to the box coordinate), and the foreground segmentation For the cross-domain semi-supervised learning, the networkinthefirst3rows. Inthe4thandthe5throws,we teacher auto labels data from an unseen domain. The see the gains of using iterative box re-estimation and test teacheristrainedonthemainWODtrainset, andautola- timeaugmentationrespectively. bels the domain adaptation WOD train and unlabeled sets (separate680sequencesfromthemainWOD).Thestudent Alternativedesignsofthedynamicobjectautolabeling isthentrainedontheunionofthesethreesets. Evaluations Table 7 ablates the design of the dynamic object auto la- areonthedomainadaptationtestset. Thelastthreerowsof beling model. For the align & refine model, we use the Table 3 showthe results. Withoutusing any data from the multi-frame MVF++ detector boxes to “align” the object new domain, the student gets an AP of 59.4. While using pointcloudsfromthenearbyframes([−2,+2])tothecen- thestudenttoself-labelslightlyhelps(improvestheresults ter frame. For each context frame, we transform the co- by∼1point),usingour3DALtoautolabelthenewdomain ordinate by aligning the center and heading of the context significantlyimprovesthestudentAPby∼5points. frameboxestothecenterframebox. Themodelusingun- alignedpointclouds(inthecenterframe’scoordinate,from 5.4.AnalysisoftheMulti-frameDetector [−2,+2]contextframes), secondrow, actuallygetshigher Table 4 shows the ablations of our proposed MVF++ accuracy(secondrow)thanthealignedone.Themodeltak- detectors. We see that the offboard techniques such as ingonlytheboxsequence(thirdrow)asinputperformsrea- the model capacity increase (+3.08 AP@0.7), using point sonablyaswell,byleveragingthetrajectoryshapeandthe clouds from 5 frames as input (+1.70 AP@0.7) and test box sizes. Our model jointly using the multi-frame object time augmentation (+3.39 AP@0.7) are all very effective pointcloudsandtheboxsequencesgetsthebestaccuracy. 7Figure6.Visualizationof3DautolabelsontheWaymoOpenDatasetvalset(bestviewedincolorwithzoomin). Objectpointsare coloredbyobjecttypeswithblueforstaticvehicles,redformovingvehiclesandorangeforpedestrians. Boxesarecoloredas: greenfor truepositivedetections,redforfalsepositivesandcyanforgroundtruthboxesinthecasesoffalsenegatives. Waymo | Confidential & Proprietary transform segmentation iterative tta Acc@0.7/0.8 static dynamic Method Contextframes Acc@0.7/0.8 Acc@0.7/0.8 - - - - 78.82/50.90 (cid:88) - - - 81.35/54.76 S-MVF++ [−0,+0] 67.17/36.61 80.07/57.71 (cid:88) (cid:88) - - 81.37/55.67 M-MVF++ [−4,+0] 73.96/43.56 82.21/59.52 (cid:88) (cid:88) (cid:88) - 82.02/56.77 [−0,+0] 78.13/50.30 80.65/57.97 (cid:88) (cid:88) (cid:88) (cid:88) 82.28/56.92 [−2,+2] 79.60/52.52 84.34/63.60 3DAL Table6.Ablationstudiesofthestaticautolabelingmodel.Met- [−5,+5] 80.48/55.02 85.10/64.51 ricsaretheboxaccuracyat3DIoU=0.7andIoU=0.8forvehicles all 82.28/56.92 85.67/65.77 intheWaymoOpenDatasetvalset. Table8.Effectsoftemporalcontextsizesforobjectautolabel- ing. Metricsaretheboxaccuracyat3DIoU=0.7,0.8forvehicles in the WOD val set. Dynamic vehicles have a higher accuracy Method Acc@0.7/0.8 becausetheyareclosertothesensorthanstaticones. Align&refine 83.33/60.69 Pointsonly 83.79/61.95 Boxsequenceonly 83.13/58.96 challengingcaseswithocclusionsandveryfewpoints. The Pointsandboxsequencejoint 85.67/65.77 busy intersection scene also shows a few failure cases in- Table7.Comparingwithalternativedesignsofdynamicobject cludingfalsenegativesofpedestriansinrareposes(sitting), autolabeling. Metricsareboxaccuracywith3DIoUthresholds false negatives of severely occluded objects and false pos- 0.7and0.8forvehiclesontheWaymoOpenDatasetvalset. itiveforobjectswithsimilargeometrytocars. Thosehard casescanpotentiallybesolvedwithaddedcamerainforma- tionwithmulti-modallearning. Effectsoftemporalcontextsizesforobjectautolabeling Table 8 studies how the context frame sizes influence the 6.Conclusion boxpredictionaccuracy. Wealsocomparewithoursingle- frame (S-MVF++) and multi-frame detectors (M-MVF++) In this work we have introduced 3D Auto Labeling, a toshowextragainstheobjectautolabelingcanbring. We state-of-the-artoffboard3Dobjectdetectionsolutionusing canclearlyseethatusinglargetemporalcontextsimproves pointcloudsequencesasinput. Thepipelineleveragesthe theperformancewhileusingtheentireobjecttrack(thelast long-term temporal data of objects in the 3D scene. Key row)leadstothebestperformance. Notethatforthestatic tooursuccessareourobject-centricformulation,powerful objectmodel,weusethedetectorboxwiththehighestscore offboard multi-frame detector and novel object auto label- for the initial coordinate transform, which gives our auto ingmodels.EvaluatedontheWaymoOpenDataset,ourso- labelinganadvantageoverframe-basedmethod. lutionhasshownsignificantgainsoverpriorartonboard3D detectors, especiallywithhighstandardmetrics. Ahuman 5.6.QualitativeAnalysis label study has further shown the high quality of the auto InFig.6, wevisualizetheautolabelsfortworepresen- labelsreachingcomparableperformanceasexperiencedhu- tativescenesinautonomousdriving: drivingonaroadwith mans. Moreover,thesemi-supervisedlearningexperiments parked cars, and passing a busy intersection. Our model havedemonstratedtheusefulnessoftheautolabelsforstu- is able to accurately recognize vehicles and pedestrians in denttrainingincasesoflow-labelandunseendomains. 8References flownetforsceneflowestimationonlarge-scalepointclouds. InProceedingsoftheIEEEConferenceonComputerVision [1] Waymoopendataset: 3ddetectionchallenge. https:// andPatternRecognition,pages3254–3263,2019. 2 waymo.com/open/challenges/3d-detection/. [16] ChenhangHe,HuiZeng,JianqiangHuang,Xian-ShengHua, Accessed:2021-01-25. 12 and Lei Zhang. Structure aware single-stage 3d object de- [2] Waymo open dataset: Domain adaptation challenge. tectionfrompointcloud. InProceedingsoftheIEEE/CVF https : / / waymo . com / open / challenges / Conference on Computer Vision and Pattern Recognition domain-adaptation/. Accessed: 2021-01-25. (CVPR),June2020. 2 12 [17] KaimingHe,XiangyuZhang,ShaoqingRen,andJianSun. [3] DavidAcuna,HuanLing,AmlanKar,andSanjaFidler. Ef- Deep residual learning for image recognition. In CVPR, ficient interactive annotation of segmentation datasets with 2016. 13 polygon-rnn++. InProceedingsoftheIEEEconferenceon [18] Peiyun Hu, Jason Ziglar, David Held, and Deva Ramanan. Computer Vision and Pattern Recognition, pages 859–868, What you see is what you get: Exploiting visibility for 3d 2018. 3 objectdetection. InProceedingsoftheIEEE/CVFConfer- [4] Jens Behley, Martin Garbade, Andres Milioto, Jan Quen- enceonComputerVisionandPatternRecognition(CVPR), zel, Sven Behnke, Cyrill Stachniss, and Jurgen Gall. Se- June2020. 2,3,4,5 mantickitti: A dataset for semantic scene understanding of [19] RuiHuang,WanyueZhang,AbhijitKundu,CarolinePanto- lidar sequences. In Proceedings of the IEEE International faru, David A. Ross, Thomas A. Funkhouser, and Alireza ConferenceonComputerVision,pages9297–9307,2019. 2 Fathi. AnLSTMapproachtotemporal3dobjectdetection [5] Alex Bewley, Pei Sun, Thomas Mensink, Dragomir inlidarpointclouds. CoRR,2020. 2,3,6 Anguelov, and Cristian Sminchisescu. Range conditioned [20] Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, and Ondrej dilatedconvolutionsforscaleinvariant3dobjectdetection, Chum.Labelpropagationfordeepsemi-supervisedlearning. 2020. 2,6 InProceedingsoftheIEEEconferenceoncomputervision [6] ZhaoweiCaiandNunoVasconcelos.Cascader-cnn:Delving andpatternrecognition,pages5070–5079,2019. 2 intohighqualityobjectdetection. InCVPR,2018. 5 [21] Diederik P. Kingma and Jimmy Ba. Adam: A method for [7] LluisCastrejon,KaustavKundu,RaquelUrtasun,andSanja stochasticoptimization. CoRR,2014. 14 Fidler. Annotatingobjectinstanceswithapolygon-rnn. In [22] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ProceedingsoftheIEEEconferenceoncomputervisionand Imagenet classification with deep convolutional neural net- patternrecognition,pages5230–5238,2017. 3 works. Commun.ACM,60(6):84–90,May2017. 4 [8] Xiaozhi Chen, Huimin Ma, Ji Wan, Bo Li, and Tian Xia. [23] Jason Ku, Melissa Mozifian, Jungwook Lee, Ali Harakeh, Multi-view 3d object detection network for autonomous andStevenLWaslander. Joint3dproposalgenerationand driving. InCVPR,2017. 2 objectdetectionfromviewaggregation. In2018IEEE/RSJ [9] Y. Chen, S. Liu, X. Shen, and J. Jia. Fast point r-cnn. In InternationalConferenceonIntelligentRobotsandSystems 2019IEEE/CVFInternationalConferenceonComputerVi- (IROS),pages1–8.IEEE,2018. 2 sion(ICCV),pages9774–9783,2019. 2 [24] Alex H Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, [10] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, JiongYang,andOscarBeijbom. Pointpillars: Fastencoders andLiFei-Fei. Imagenet: Alarge-scalehierarchicalimage forobjectdetectionfrompointclouds. InCVPR,2019. 1,2, database. InCVPR.IEEE,2009. 2 6,12,13 [11] Neeraj Deshmukh, Richard Jennings Duncan, Aravind [25] Dong-Hyun Lee. Pseudo-label: The simple and effi- Ganapathiraju, and Joseph Picone. Benchmarking human cientsemi-supervisedlearningmethodfordeepneuralnet- performanceforcontinuousspeechrecognition. InProceed- works. InWorkshoponchallengesinrepresentationlearn- ingofFourthInternationalConferenceonSpokenLanguage ing,ICML,volume3,2013. 2 Processing. ICSLP’96, volume 4, pages 2486–2489. IEEE, [26] Jungwook Lee, Sean Walsh, Ali Harakeh, and Steven L 1996. 6 Waslander. Leveragingpre-trained3dobjectdetectionmod- [12] M.Engelcke,D.Rao,D.Z.Wang,C.H.Tong,andI.Posner. els for fast ground truth generation. In 2018 21st Inter- Vote3deep: Fast object detection in 3d point clouds using national Conference on Intelligent Transportation Systems efficient convolutional neural networks. In 2017 IEEE In- (ITSC),pages2504–2510.IEEE,2018. 3 ternationalConferenceonRoboticsandAutomation(ICRA), [27] B.Li. 3dfullyconvolutionalnetworkforvehicledetection pages1355–1361,May2017. 2 inpointcloud. In2017IEEE/RSJInternationalConference [13] Di Feng, Xiao Wei, Lars Rosenbaum, Atsuto Maki, and onIntelligentRobotsandSystems(IROS),pages1513–1518, Klaus Dietmayer. Deep active learning for efficient train- Sep.2017. 2 ing of a lidar 3d object detector. In 2019 IEEE Intelligent [28] BoLi,TianleiZhang,andTianXia. Vehicledetectionfrom VehiclesSymposium(IV),pages667–674.IEEE,2019. 3 3d lidar using fully convolutional network. In RSS 2016, [14] RunzhouGe,ZhuangzhuangDing,YihanHu,YuWang,Si- 2016. 2 jiaChen, LiHuang, andYuanLi. Afdet: Anchorfreeone [29] Peiliang Li, Jieqi Shi, and Shaojie Shen. Joint spatial- stage3dobjectdetection,2020. 6 temporaloptimizationforstereo3dobjecttracking. InPro- [15] Xiuye Gu, Yijie Wang, Chongruo Wu, Yong Jae Lee, and ceedingsoftheIEEE/CVFConferenceonComputerVision PanquWang.Hplflownet:Hierarchicalpermutohedrallattice andPatternRecognition,pages6877–6886,2020. 4 9[30] M. Liang, B. Yang, Y. Chen, R. Hu, and R. Urtasun. [44] Charles R Qi, Xinlei Chen, Or Litany, and Leonidas J Multi-task multi-sensor fusion for 3d object detection. In Guibas. Imvotenet: Boosting 3d object detection in point 2019 IEEE/CVF Conference on Computer Vision and Pat- cloudswithimagevotes. InProceedingsoftheIEEE/CVF ternRecognition(CVPR),pages7337–7345,2019. 2 Conference on Computer Vision and Pattern Recognition, [31] MingLiang,BinYang,ShenlongWang,andRaquelUrtasun. pages4404–4413,2020. 2 Deepcontinuousfusionformulti-sensor3dobjectdetection. [45] Charles R Qi, Or Litany, Kaiming He, and Leonidas J InECCV,2018. 2 Guibas. Deephoughvotingfor3dobjectdetectioninpoint [32] HuanLing,JunGao,AmlanKar,WenzhengChen,andSanja clouds.InProceedingsoftheIEEEInternationalConference Fidler. Fastinteractiveobjectannotationwithcurve-gcn. In onComputerVision,pages9277–9286,2019. 2 Proceedings of the IEEE Conference on Computer Vision [46] CharlesRQi,WeiLiu,ChenxiaWu,HaoSu,andLeonidasJ andPatternRecognition,pages5257–5266,2019. 3 Guibas.Frustumpointnetsfor3dobjectdetectionfromrgb-d [33] RichardPLippmann. Speechrecognitionbymachinesand data. InCVPR,2018. 2,3,5,13,14,15 humans. Speechcommunication,22(1):1–15,1997. 6 [47] CharlesRQi,HaoSu,KaichunMo,andLeonidasJGuibas. [34] Xingyu Liu, Charles R Qi, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification Flownet3d: Learningsceneflowin3dpointclouds. InPro- andsegmentation. CVPR,2017. 5,14,15 ceedings of the IEEE Conference on Computer Vision and [48] CharlesRQi,LiYi,HaoSu,andLeonidasJGuibas. Point- PatternRecognition,pages529–537,2019. 2 net++: Deephierarchicalfeaturelearningonpointsetsina [35] XingyuLiu, MengyuanYan, andJeannetteBohg. Meteor- metricspace. arXivpreprintarXiv:1706.02413,2017. 14 net:Deeplearningondynamic3dpointcloudsequences. In [49] OlgaRussakovsky,JiaDeng,HaoSu,JonathanKrause,San- ProceedingsoftheIEEEInternationalConferenceonCom- jeevSatheesh,SeanMa,ZhihengHuang,AndrejKarpathy, puterVision,pages9246–9255,2019. 2 Aditya Khosla, Michael Bernstein, et al. Imagenet large [36] WenjieLuo,BinYang,andRaquelUrtasun.Fastandfurious: scalevisualrecognitionchallenge. Internationaljournalof Realtimeend-to-end3ddetection,trackingandmotionfore- computervision,115(3):211–252,2015. 6 castingwithasingleconvolutionalnet.InProceedingsofthe [50] ShaoshuaiShi,ChaoxuGuo,LiJiang,ZheWang,Jianping IEEEConferenceonComputerVisionandPatternRecogni- Shi, Xiaogang Wang, and Hongsheng Li. Pv-rcnn: Point- tion(CVPR),June2018. 2,3,4 voxelfeaturesetabstractionfor3dobjectdetection. InPro- [37] Qinghao Meng, Wenguan Wang, Tianfei Zhou, Jianbing ceedingsoftheIEEE/CVFConferenceonComputerVision Shen,LucVanGool,andDengxinDai. Weaklysupervised andPatternRecognition,pages10529–10538,2020. 1,2,6 3d object detection from lidar point cloud. arXiv preprint [51] ShaoshuaiShi,XiaogangWang,andHongshengLi. Pointr- arXiv:2007.11901,2020. 3 cnn: 3dobjectproposalgenerationanddetectionfrompoint [38] G. P. Meyer, J. Charland, D. Hegde, A. Laddha, and C. cloud. arXivpreprintarXiv:1812.04244,2018. 2,3,5,13 Vallespi-Gonzalez. Sensorfusionforjoint3dobjectdetec- [52] Weijing Shi and Ragunathan (Raj) Rajkumar. Point-gnn: tion and semantic segmentation. In 2019 IEEE/CVF Con- Graph neural network for 3d object detection in a point ferenceonComputerVisionandPatternRecognitionWork- cloud. In The IEEE Conference on Computer Vision and shops(CVPRW),pages1230–1237,2019. 2 PatternRecognition(CVPR),June2020. 2 [39] GregoryP.Meyer,AnkitLaddha,EricKee,CarlosVallespi- [53] Martin Simony, Stefan Milzy, Karl Amendey, and Horst- Gonzalez, and Carl K. Wellington. LaserNet: An efficient Michael Gross. Complex-yolo: An euler-region-proposal probabilistic3Dobjectdetectorforautonomousdriving. In for real-time 3d object detection on point clouds. In Pro- Proceedings of the IEEE Conference on Computer Vision ceedings of the European Conference on Computer Vision andPatternRecognition(CVPR),2019. 2 (ECCV)Workshops,pages0–0,2018. 2 [40] HimangiMittal,BrianOkorn,andDavidHeld. Justgowith [54] VishwanathA.Sindagi,YinZhou,andOncelTuzel. Mvx- theflow:Self-supervisedsceneflowestimation.InProceed- net:Multimodalvoxelnetfor3dobjectdetection.InInterna- ingsoftheIEEE/CVFConferenceonComputerVisionand tionalConferenceonRoboticsandAutomation,ICRA2019, PatternRecognition,pages11177–11185,2020. 2 Montreal,QC,Canada,May20-24,2019,pages7276–7282. [41] Jiquan Ngiam, Benjamin Caine, Wei Han, Brandon Yang, IEEE,2019. 2 Yuning Chai, Pei Sun, Yin Zhou, Xi Yi, Ouais Alsharif, [55] Roman Solovyev, Weimin Wang, and Tatiana Gabruseva. PatrickNguyen,ZhifengChen,JonathonShlens,andVijay Weightedboxesfusion: ensemblingboxesforobjectdetec- Vasudevan. Starnet: Targetedcomputationforobjectdetec- tionmodels. arXivpreprintarXiv:1910.13302,2019. 4,14 tioninpointclouds. CoRR,2019. 2,6 [56] Shuran Song and Jianxiong Xiao. Deep sliding shapes for [42] Joshua Owoyemi and Koichi Hashimoto. Spatiotemporal amodal3dobjectdetectioninrgb-dimages.InCVPR,2016. learning of dynamic gestures from 3d point cloud data. In 2 2018 IEEE International Conference on Robotics and Au- [57] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien tomation(ICRA),pages1–5.IEEE,2018. 2 Chouard,VijaysaiPatnaik,PaulTsui,JamesGuo,YinZhou, [43] Lukas Prantl, Nuttapong Chentanez, Stefan Jeschke, and YuningChai,BenjaminCaine,etal.Scalabilityinperception NilsThuerey. Tranquilclouds:Neuralnetworksforlearning forautonomousdriving: Waymoopendataset. InProceed- temporallycoherentfeaturesinpointclouds. arXivpreprint ingsoftheIEEE/CVFConferenceonComputerVisionand arXiv:1907.05279,2019. 2 PatternRecognition,pages2446–2454,2020. 1,2,5 10[58] ZhiTian,ChunhuaShen,HaoChen,andTongHe. FCOS: [74] J. Yin, J. Shen, C. Guan, D. Zhou, and R. Yang. Lidar- Fullyconvolutionalone-stageobjectdetection. InProc.Int. based online 3d video object detection with graph-based Conf.ComputerVision(ICCV),2019. 4,13 message passing and spatiotemporal transformer attention. [59] Dominic Zeng Wang and Ingmar Posner. Voting for vot- In2020IEEE/CVFConferenceonComputerVisionandPat- inginonlinepointcloudobjectdetection. InProceedingsof ternRecognition(CVPR),pages11492–11501,2020. 2,3 Robotics:ScienceandSystems,Rome,Italy,July2015. 2 [75] SergeyZakharov,WadimKehl,ArjunBhargava,andAdrien [60] Weiyao Wang, Du Tran, and Matt Feiszli. What makes Gaidon. Autolabeling3dobjectswithdifferentiablerender- training multi-modal networks hard? arXiv preprint ing of sdf shape priors. In Proceedings of the IEEE/CVF arXiv:1905.12681,2019. 15 Conference on Computer Vision and Pattern Recognition, pages12224–12233,2020. 2 [61] YueWang, AlirezaFathi, AbhijitKundu, DavidRoss, Car- [76] YinZhou, PeiSun, YuZhang, DragomirAnguelov, Jiyang oline Pantofaru, Thomas Funkhouser, and Justin Solomon. Gao,TomOuyang,JamesGuo,JiquanNgiam,andVijayVa- Pillar-based object detection for autonomous driving. In sudevan. End-to-endmulti-viewfusionfor3dobjectdetec- ECCV,2020. 2,6 tioninlidarpointclouds. InConferenceonRobotLearning, [62] Xinshuo Weng and Kris Kitani. A baseline for 3d multi- pages923–932,2020. 2,4,6,7,12,13 objecttracking. arXivpreprintarXiv:1907.03961,2019. 4, [77] YinZhouandOncelTuzel. Voxelnet: End-to-endlearning 14 forpointcloudbased3dobjectdetection.InCVPR,2018.2, [63] Xinshuo Weng, Jianren Wang, Sergey Levine, Kris Kitani, 13 andNicholasRhinehart.4dforecasting:Sequentialforecast- [78] YangZou,ZhidingYu,XiaofengLiu,BVKKumar,andJin- ingof100,000points,2020. 2 song Wang. Confidence regularized self-training. In Pro- [64] Xinshuo Weng, Ye Yuan, and Kris Kitani. Joint 3d track- ceedingsoftheIEEEInternationalConferenceonComputer ingandforecastingwithgraphneuralnetworkanddiversity Vision,pages5982–5991,2019. 2 sampling. arXivpreprintarXiv:2003.07847,2020. 4 [65] QizheXie,Minh-ThangLuong,EduardHovy,andQuocV Le.Self-trainingwithnoisystudentimprovesimagenetclas- sification. InProceedingsoftheIEEE/CVFConferenceon Computer Vision and Pattern Recognition, pages 10687– 10698,2020. 2 [66] DanfeiXu,DragomirAnguelov,andAsheshJain. PointFu- sion: Deep sensor fusion for 3d bounding box estimation. InProceedingsoftheIEEEConferenceonComputerVision andPatternRecognition(CVPR),2018. 2 [67] I Zeki Yalniz, Herve´ Je´gou, Kan Chen, Manohar Paluri, andDhruvMahajan. Billion-scalesemi-supervisedlearning for image classification. arXiv preprint arXiv:1905.00546, 2019. 2 [68] YanYan,YuxingMao,andBoLi. Second:Sparselyembed- dedconvolutionaldetection. Sensors,18(10):3337,2018. 2 [69] Bin Yang, Wenjie Luo, and Raquel Urtasun. Pixor: Real- time 3d object detection from point clouds. In Proceed- ingsoftheIEEEConferenceonComputerVisionandPattern Recognition,pages7652–7660,2018. 2 [70] Zetong Yang, Yanan Sun, Shu Liu, and Jiaya Jia. 3dssd: Point-based3dsinglestageobjectdetector. InProceedings oftheIEEE/CVFConferenceonComputerVisionandPat- ternRecognition,pages11040–11048,2020. 2 [71] ZetongYang,YananSun,ShuLiu,XiaoyongShen,andJi- ayaJia. Ipod:Intensivepoint-basedobjectdetectorforpoint cloud,2018. 2 [72] Z.Yang,Y.Sun,S.Liu,X.Shen,andJ.Jia. Std: Sparse-to- dense3dobjectdetectorforpointcloud. In2019IEEE/CVF InternationalConferenceonComputerVision(ICCV),pages 1951–1960,2019. 2 [73] M.Ye,S.Xu,andT.Cao. Hvnet: Hybridvoxelnetworkfor lidarbased3dobjectdetection. In2020IEEE/CVFConfer- enceonComputerVisionandPatternRecognition(CVPR), pages1628–1637,2020. 2 11Appendix Method 3DAP 0-30m 30-50m 50+m A.Overview PointPillar 45.48 74.02 36.49 14.94 Multi-frameMVF++ 70.01 86.54 67.72 43.25 Inthisdocument,weprovidemoredetailsofmodels,ex- PV-RCNN-DA 71.40 90.00 66.45 45.92 perimentsandshowmoreanalysisresults. Sec.Bpresents CenterPoint 67.04 86.62 60.95 38.59 more evaluation results on the Waymo Open Dataset test HorizonLidar3D 72.48 90.65 67.26 47.89 setandshowshowouroffboard3Ddetectioncanhelpdo- 3DAL(ours) 78.04 91.90 73.47 52.53 mainadaptationand3Dtracking. Sec.Cexplainsmorede- Table 10. 3D detection AP on the Waymo Open Dataset do- tailsofMVF++detectors. Sec.Dand Sec.Edescribeim- mainadaptationtestsetforvehicles. ThePointPillar, MVF++ plementation details of our multi-object tracker and track- and3DALmodelsweretrainedbyusontheWaymoOpenDataset basedmotionstateclassifierrespectively.Sec.Fcoversnet- maintrainset. Evaluationresultswereobtainedfromsubmitting workarchitectures,lossesandtrainingdetailsofobjectauto tothetestserver.ThePV-RCNN-DA,CenterPointandHorizonL- labeling models. Sec. G describes the specifics of the hu- idar3Dresultsareleadingentriesfromtheleaderboard[2]. manlabelstudyfor3Dobjectdetectionandprovidesmore statistics.Sec.Hprovidesmoreinformationaboutthesemi- supervised learning experiments. Lastly Sec. I gives more theleaderboard[2]ourmethodalsoshowssignificantgains. analysisresultssupplementarytothemainpaper. Theselargegainsareprobablyduetothetemporalinforma- tionaggregation,whichcompensatesthelowerpointdensi- B.MoreEvaluationResults ties in the WOD domain adaptation set (collected in Kirk- landwithmostlyrainyweather). B.1.3DDetectionResultsontheTestSet B.3.3DTrackingResults In Table. 9 we report detection results on the Waymo Open Dataset test set comparing our pipeline with a few In table 11 we show how our improved box estimation leading methods in the leaderboard [1]. Note that our from the offboard 3D Auto Labeling enhances the track- pipeline achieves the best results among Lidar-only meth- ing performance, compared to using the boxes from the ods. It also outperforms the HorizonLidar3D which uses single-frame or multi-frame detectors. All methods used bothcameraand Lidarinputinthe L1metrics. We expect thesametracker(Sec.D).Thisreflectsthatinthetracking- thataddingcamerainputtoourpipelinecanfurtherimprove by-detection paradigm, the localization accuracy plays an ourpipelineinhardcases(L2). important role in determining tracking quality in terms of MOTAandMOTP. Method Sensor APL1 APHL1 APL2 APHL2 Method MOTA↑ MOTP↓ PV-RCNN L 81.06 80.57 73.69 73.23 CenterPoint L 81.05 80.59 73.42 72.99 Single-frameMVF++withKF 52.20 17.08 HorizonLidar3D CL 85.09 84.68 78.23 77.83 Multi-frameMVF++withKF 61.92 16.31 3DAL(ours) L 85.84 85.46 77.24 76.91 3DAutoLabeling 66.90 15.45 Table9.3DdetectionAPontheWaymoOpenDatasetmaintest Table11.3DtrackingresultsforvehiclesontheWaymoOpen setforvehicles.Evaluationresultswereobtainedfromsubmitting Datasetvalset.ThemetricsareL1MOTAandMOTPforvehicles to the test server. For the sensor the ‘L‘ means Lidar-only; the ontheWaymoOpenDatasetvalset. KFstandsforusingKalman ‘CL‘meanscameraandLidar. Notethatourmethodpeeksinto Filteringforthetrackstateupdate. Thearrowsindicatewhether the future for object-centric refinement, which is feasible in the themetricisbetterwhenitishigher(up-wardarrow)orisbetter offboardsetting. whenitislower(down-wardarrow). B.2.DomainAdaptationResults C. Implementation Details of the MVF++ De- InTable10wereportdetectionresultsinanotherdomain tectors and compare our 3D Auto Labeling (3DAL) pipeline with twobaselines:thepopularPointPillars[24]detectorandour NetworkArchitecture Figure7illustratesthepoint-wise offboardmulti-frameMVF++detector. Weseethatour3D featurefusionnetworkwithintheproposedMVF++. Given AutoLabelingpipelineachievessignificantlyhigherdetec- C-dimensional input encoding of N points [76], the net- tion APs compared to the baselines (32.56 higher 3D AP workfirstprojectsthepointsintoa128-Dfeaturespacevia thanthePointPillarsand8.03higher3DAPthanthemulti- a multi-layer perceptron (MLP), where shape information frame MVF++), showing the strong generalization ability canbebetterdescribed. TheMLPiscomposedofalinear of our models. Compared to a few leading methods on layer,abatchnormalization(BN)layerandarectifiedlinear 12Figure7.Point-wisefeaturefusionnetworkofMVF++. GivenaninputpointcloudencodingofshapeN ×C, thenetworkmapsit tohigh-dimensionalfeaturespaceandextractscontextualinformationfromdifferentviewsi.e.theBird’sEyeViewandthePerspective View. Itfusesview-dependentfeaturesbyconcatenatinginformationfromthreesources. ThefinaloutputhasshapeN ×144,asaresult ofconcatenatingdimension-reducedpointfeaturesofshapeN ×128with3DsegmentationfeaturesofshapeN ×16. unit (ReLU) layer. Then it processes the features by two loss as in [58]. L represents Smooth L1 loss learning reg separate MLPsfor view-dependent information extraction, to regress x, y, z center locations, length, width, height i.e.onefortheBird’sEyeViewandoneforthePerspective andheadingorientationatforegroundpixels,asin[24,76]. View [76]. Next, the network employs voxelization [76] L is the auxiliary 3D segmentation loss for distinguish- seg to transform view-dependent point features into the corre- ingforegroundfrombackgroundpoints(pointsarelabeled sponding2Dfeaturemaps,whicharefedtoview-dependent as foreground/background if they lie inside/outside of a ConvNets (i.e. ConvNet and ConvNet ) to further extract ground truth 3D box) [51, 46]. In our experiments, we set b p contextual information within an enlarged receptive field. w = 1.0, w = 2.0, w = 1.0. At inference time, the 1 2 3 Different from MVF [76] using one ResNet [17] layer in final score for ranking all detected boxes is computed as obtainingeachdown-sampledfeaturemaps,weincreasethe themultiplicationoftheclassificationscoreandthecenter- depth of ConvNet and ConvNet by applying one more nessscore. Bydoingso,thecenternessscorecandownplay b p ResNet block in each down-sampling branch. At the end the boxes far away from an object center and thus encour- of view-dependent processing, it applies devoxelization to agenon-maximumsuppression(NMS)toyieldhigh-quality transform the 2D feature map back to point-wise features. boxes,asrecommendedin[58]. Themodelfusespoint-wisefeaturesbyconcatenatingthree sourcesofinformation. Toreducecomputationalcomplex- DataAugmentation Weperformthreeglobalaugmenta- ity,itappliestwoMLPsconsecutively,reducingthefeature tionsthatareappliedtotheLiDARpointcloudandground dimension to 128. For improving the discriminative capa- truthboxessimultaneously[77].First,weapplyrandomflip bility of features, it introduces 3D segmentation auxiliary along the x axis, with probability 0.5. Then, we employ a lossandaugmentsthedimension-reducedfeatureswithseg- randomglobalrotationandscaling,wheretherotationangle mentationfeatures. Theoutputofpoint-wisefeaturefusion andthe scalingfactorare randomlydrawnuniformly from networkhasshapeN ×144. [−π/4,+π/4] and [0.9,1.1], respectively. Finally, we add Upon obtaining point-wise features, we voxelize them aglobaltranslationnoisetox,y,zdrawnfromN(0,0.6). into a 2D feature map and employ a backbone network to generatedetectionresults. Specifically, weadoptthesame Hyperparameters For vehicles, we set voxel size to architectureasin[24,76].Tofurtherboostdetectionperfor- [0.32,0.32,6.0]manddetectionrangeto[−74.88,74.88]m manceintheoffboardsetting,wereplaceeachplainconvo- alongtheXandYaxesand[−2,4]malongZaxis,whichre- lutionlayerwithaResNet[17]layermaintainingthesame sultsina468×4682DfeaturemapintheBird’sEyeView. outputfeaturedimensionandfeaturemapresolution. Forpedestrians,wesetvoxelsizeto[0.24,0.24,4.0]mand detection range to [−74.88,74.88]m along the X and Y Loss Function We train MVF++ by minimizing a loss axes and [−1,3]m along Z axis, which corresponds to a function, defined as L = L +w L +w L + 624×6242DfeaturemapintheBird’sEyeView. During cls 1 centerness 2 reg w L . L and L are focal loss and centerness test-time augmentation, we set (IoU threshold, box score) 3 seg cls centerness 13tobe(0.275,0.5)forvehiclesand(0.2,0.5)forpedestrians, For vehicles, such a simple linear model can achieve totriggerweightedboxfusion[55]. morethan99%classificationaccuracy. Theremainingrare error cases usually happen in short tracks with noisy de- tection boxes, or for objects that are heavily occluded or Training During training, we use the Adam opti- faraway. Forpedestrians,asmostofthemaremovingand mizer[21]andapplycosinedecaytothelearningrate. The eventhestaticonestendtomovetheirarmsandheads,we initial learning rate is set to 1.33×10−3 and ramps up to considerallpedestriantracksasdynamic. 3.0 × 10−3 after 1000 warm-up steps. The training used 64TPUswithaglobalbatchsizeof128andfinishedafter 43,000steps. F.DetailsoftheObjectAutoLabelingModels F.1.StaticObjectAutoLabeling D.ImplementationDetailsoftheTracker Network architecture. In the static object auto labeling Our multi-object tracker is a similar implementation model,theforegroundsegmentationisaPointNet[47]seg- to[62].Toreducetheimpactofsensorego-motionintrack- mentationnetwork,whereeachpointisfirstlyprocessedby ing, we transformed all the boxes to the world coordinate anmulti-layerperceptron(MLP)with5layerswithoutput for tracking. To reduce false positives, we also filter out channelsizesof64,64,64,128,1024.Foreverylayerofthe alldetectionswithscoreslessthan0.1beforethetracking. MLP, we have batch normalization and ReLU. The 1024- We used Bird’s Eye View (BEV) boxes for detection and dim per point embeddings are pooled with a max pooling track association, using the Hungarian algorithm with an layer and concatenated with the output of the 2nd layer of IoUthresholdof0.1. Duringthestatesupdate,theheading the per-point MLP (64-dim). The concatenated 1088-dim ishandledspeciallyastherecanbeflipsandcyclicpatterns. features are further processed by an MLP of 5 layers with Beforeupdatingtheheadingstate,wefirstadjustthedetec- output channel sizes 512,256,128,128,2, where the last tion heading to align with the track state heading – if the layer does not have non-linearity or batch normalization. angledifferenceisobtuse, weaddπ tothedetectionangle The predicted foreground logit scores are used to classify beforetheupdate; wealsoaveragetheanglesinthecyclic eachpointasforegroundorbackground.Alltheforeground space(e.g.theaverageof6radand0.5radis0.1084rather pointsareextracted. than3.25). TheboxregressionnetworkisalsoaPointNet[47]vari- E. Implementation Details of the Motion State ant that takes the foreground points and outputs the 3D box parameters. It has a per-point MLP with output sizes Estimator of128,128,256,512,amaxpoolinglayerandafollowing As we introduced in the main paper, we use the object MLPwithoutputsizes512,256onthemaxpooledfeatures. track data for motion state estimation, which is much eas- There is a final linear layer predicting the box parameters. ier compared to classifying the static/non-static state from We parameterize the boxes in a way similar to [46] as the a single or a few frames. Note that we define an object as box center regression (3-dim), the box heading regression staticonlyifitisstationaryintheentiresequence. Specifi- andclassification(toeachoftheheadingbins)andthebox cally,weextracttwoheuristic-basedfeaturesandfitalinear size regression and classification (to each of the template classifiertoestimatethemotionstate. Thetwofeaturesare: sizes). Foriterativerefinement, weapplythesameboxre- thedetectionboxcenters’varianceandthebegin-to-enddis- gression network one more time on the foreground points tance of the tracked boxes (the distance from the center of transformed to the estimated box’s coordinate. We found thefirstboxofthetracktothecenterofthelastboxofthe that if we use multi-frame MVF++ boxes, using shared track), with boxes all in the world coordinate. To ensure weights for the two box regression networks works better that the statistics are reliable we only consider tracks with than not sharing the weights; while if we use the single- atleast7validmeasurements. Fortracksthataretooshort, frame MVF++, the cascaded design without sharing the wedonotruntheclassificationnortheautolabelingmod- weights works better. The numbers in the main paper are els. Theboxesofthoseshorttracksaremergeddirectlyto fromtheiterativemodel(sharedweights). thefinalautolabels. For simplicity and higher generalizability of the model, The ground truth motion states are computed from weonlyusedXYZcoordinatesofthepointsinthesegmen- ground truth boxes with pre-defined thresholds of begin- tation and box regression networks. Intensities and other to-enddistance(1.0m)andmaxspeed(1m/s). Thethresh- pointchannelswerenotused. Wehavealsotriedtousethe oldsareneededbecausetherecouldbesmalldriftsinsensor morepowerfulPointNet++[48]modelsbutdidnotseeim- poses,suchthatthegroundtruthboxesintheworldcoordi- provement compared to the PointNet-based models in this natearenotexactlythesameforastaticobject. problem. 14Losses. Themodelistrainedwithsupervisionoftheseg- F.2.DynamicObjectAutoLabeling mentationmasksandthegroundtruth3Dboundingboxes. Network architecture. For the foreground segmentation For the segmentation, the sub-network predicts two scores network,weadoptasimilararchitectureasthatforthestatic for each point as foreground or background and is super- auto labeling model except that the input points have one vised with a cross-entropy loss L . For the box re- seg more channel besides the XYZ coordinate, the time en- gression, we implement a process similar to [46], where codingchannel. Thetemporalencodingis0forpointsfrom each box regression network regresses the box by pre- the current frame, −0.1r for the r-th frame prior to the dicting its center cx,cy,cz, its size classes (among a few current frame and +0.1r for the r-th frame after the cur- pre-defined template size classes) and residual sizes for rentframe. Inourimplementationwetake5framesofob- each size class, as well as the heading bin class and a ject points with each frame’s points subsampled to 1,024 residual heading for each bin. We used 12 heading bins points, so in total there are 5,120 points input to the seg- (each bin account for 30 degrees) and 3 size clusters: mentation network. The point sequence encoder network (4.8,1.8,1.5),(10.0,2.6,3.2),(2.0,1.0,1.6),wherethedi- takes the segmented foreground points and uses a Point- mensions are length, width, height. The box regression Net [47]-like architecture with a per-point MLP of output lossisdefinedasL = L +w L +w L + boxi c-reg i 1 s-clsi 2 s-reg i sizes 64,128,256,512, a max-pooling layer and another w L +w L where i ∈ {1,2} represents the cas- 3 h-clsi 4 h-reg i MLPwithoutputsizes512,256onthemax-pooledfeatures. cade/iterativeboxestimationstep. Thetotalboxregression Theoutputisa256-dimfeaturevector. loss is L = L +w(L +L ). The w and w are seg box1 box2 i Fortheboxsequenceencodernetwork,weconsidereach hyperparameterweightsofthelosses. Empirically, weuse box (in the center frame’s box coordinate) as a parameter- w =0.1,w =2,w =0.1,w =2andw =10. 1 2 3 4 izedpointwithchannelsofboxcenter(cx,cy,cz),boxsize (length, width, height), box heading θ and a temporal en- coding. Weusenearlytheentireboxsequence(settingsin themainpaperto50,leadingtoasequencelengthof101). Training and data augmentation. We train our models usingtheextractedobjecttracks(withtheproposedmulti- Theboxsequencecanbeconsideredasapointcloudand frame MVF++ model and our multi-object tracker) from processedbyanotherPointNet. Note,suchasequencecan the Waymo Open Dataset for each class type separately. alsobeprocessedbya1DConvNet,orbeconcatenatedand Groundtruthboxesareassignedtoeveryframeofthetrack processedbyafullyconnectednetwork,orwecanevenuse (frameswithnomatchedgroundtruthareskipped). agraphneuralnetwork. Throughempiricalstudywefound usingaPointNettoencodetheboxsequencefeatureisboth During training, for each static object track, we ran- effective(comparedwithConvNetandfullyconnectedlay- domly select an initial box from the sequence. We also ers) and simple (compared with graph neural networks). randomly sub-sample Uniform[1,|S |] frames from all the j TheboxsequenceencodingPointNethasaper-pointMLP visible frames S of an object j. This naturally leads to j withoutputsizes64,64,128,512,amax-poolinglayerand a data augmentation effect. Note that at test time we al- anotherMLPwithoutputsizes128,128onthemax-pooled ways select the initial box with the highest score and use features.Thefinaloutputisa128-dimfeaturevector,which all frames. The merged points are randomly sub-sampled wecallthetrajectoryembedding. to4,096pointsandrandomlyflippedalongtheX,Y axes The point embedding and the trajectory embedding are with50%chancerespectivelyandrandomlyrotatedaround concatenatedandpassedthroughafinalboxregressionnet- the up-axis (Z) by Uniform[−10,10] degrees. To increase work, which is a MLP with two layers with output sizes thedataquantity,wealsoturnthedynamicobjecttrackdata 128,128 and a linear layer to regress the box parameters to pseudo static track. To achieve that, we use the ground (similartothatofthestaticobjectautolabelingmodel). truth object boxes to align the dynamic object points to a Toencouragecontributionsfrombothbranches,wefol- specificframe’sgroundtruthboxcoordinate.Thisincreases low [60] and also pass the trajectory embedding and the thenumberofobjecttracksofvehiclesby30%. object embedding to two additional box regression sub- In total, we have extracted around 50K (vehicle) object networkstopredictboxesindependently.Thesub-networks tracksfortraining(includingtheaugmentedonesfromdy- havethesamestructureastheoneforthejoint-embedding, namicobjects)andaround10Kobjecttracksforvalidation butwithnon-sharedweights. (static only). We trained the model using the Adam opti- mizerwithabatchsizeof32andaninitiallearningrateof 0.001. The learning rate was decayed by 10X at the 60th, Losses. Similartothestaticautolabelingmodel,wehave 100th and 140th epochs. The model was trained with 180 twotypesofloss,thesegmentationlossandtheboxregres- epochs in total, which took around 20 hours with a V100 sion loss. The box regression outputs are defined in the GPU. same way as that for the static objects. We used 12 head- 15ing bins (each bin account for 30 degrees) and the same segment-17703234244970638241 220 000 240 000 size clusters as those for the static vehicle auto labeling. segment-15611747084548773814 3740 000 3760 000 Forpedestriansweuseasinglesizecluster: (0.9,0.9,1.7) segment-11660186733224028707 420 000 440 000 of length, width, height. The final loss is L = L seg + segment-1024360143612057520 3580 000 3600 000 v 1L box-traj+v 2L box-obj-pc+v 3L box-jointwherewehavethree segment-6491418762940479413 6520 000 6540 000 box losses from the trajectory head, the object point cloud Table 12. Sequence (run segment) list for the human label headandthejointheadrespectively. Thev i,i = 1,2,3are study. ThesequencesareallfromtheWaymoOpenDatasetval the weights for the loss terms to achieve a balanced learn- set. ing of the three types of embeddings. Empirically, we use v =0.3,v =0.3,v =0.4. 1 2 3 withthelabelsfromtheotherandmeasurethe3Dboxcon- sistency by their IoUs. Since we already have the verified Training and data augmentation. During training, we public ground truth, we can compare the 3 sets of labels randomlyselectthecenterframefromeachdynamicobject withthepublicWODgroundtruthandgettheaveragebox track. Ifthecontextsizeislessthantherequiredsequence IoUforallobjectsthatarematched(duetoocclusions,some length2r+1or2s+1,orwhentheframesinthesequence objectsmaybelabeledornotlabeledbyaspecificlabeler). are not consecutive (e.g. the object is occluded for a few Specifically,forhumanboxesthatwecannotfindaground frames), we use placeholder points and boxes (all zeros) truthboxwithmorethan0.03BEVIoUoverlap(falsepos- fortheemptyframes. Astheremaybetrackingerrors,we itive or false negative), they were ignored and not counted match our object track with ground truth tracks and avoid inthecomputation. trainingontheoneswithswitchedtrackIDs. ThestatisticsaresummarizedinTable13. Surprisingly, Astoaugmentation,bothpointsandboxesarerandomly human labels do not have the consistency one may expect flippedalongtheX andY axiswitha50%chanceandran- (e.g.95%IoU).Duetotheinherentuncertaintyoftheprob- domly rotated around the Z axis by Uniform[−10,10] de- lem, even humans can only achieve around 81% 3D IoU grees. Wealsoaddalightrandomshiftandarandomscal- or around 88% BEV IoU in their box consistency. As we ingtothepointclouds. Pointcloudfromeachframeisalso break down the numbers by distance we see, intuitively, randomlysampledto1,024pointsfromthefullpointcloud that nearby objects have a significantly higher mean IoU observed. as they have more visible points and more complete view- We train vehicle and pedestrian models separately. For points. TheBEV2DIoUisalsohigherthanthe3DIoUas vehicles we extracted around 15.7K dynamic tracks for wedonotrequirethecorrectheightestimationintheBEV training and 3K for validation. For pedestrians, we ex- box,whichsimplifiestheproblem. tracted around 22.9K dynamic tracks for training and Tohavearoughunderstandingofhowtheboxesgener- around 5.1K for validation. We train the model using the atedbyour3DAutoLabelingpipelinecomparewithhuman Adam optimizer with batch size 32 and an initial learning labels,wecomputetheaverageIoUofautolabelswiththe rate0.001. Thelearningrateisdecayedby10timesatthe WOD ground truth. Note that those numbers are not di- 180th, 300thand420thepochs. Themodelistrainedwith rectlycomparabletotheaveragehumanIoUsastheycover 500epochsintotal,whichtakes1-2dayswithaV100GPU. differentsetsofobjects(duetothefalsepositivesandfalse negatives). However,itstillgivesusanunderstandingthat G.DetailsoftheHumanLabelStudy theautolabelsarealreadyonparinqualitytohumanlabels. We randomly selected 5 sequences from the Waymo H. More Details about the Semi-supervised OpenDatasetvalsetaslistedinTable12torunthehuman LearningExperiment label study. The 15 labeling tasks (3 sets of re-labels for eachrunsegment)involved12labelerswithexperiencesin Inthesemi-supervisedexperiments, weuseanonboard labeling3DLidarpointclouds.Intotalwecollectedaround single-frame MVF++ detector as the student. We train all 2.3K labels (one label for one object track) for the 3 re- networkswithaneffectivebatchsizeof256scenesperiter- peatedlabelings. ation. The training schedule starts with a warmup period where the learning rate is gradually increased to 0.03 in Howconsistentarehumanlabels? AuxiliarytotheAP 1000iterations. Afterward,weuseacosinedecaylearning resultsinthemainpaper,wealsoanalyzetheIoUsbetween ratescheduletodropthelearningratefrom0.03to0. humanlabels. Wefoundthatevenforhumans, 3Dbound- For the intra-domain semi-supervised learning, we ran- ingboxlabelingcanbechallengingastheinputpointclouds domly select 10% of the sequences (around 15K frames are often partial and occluded. To understand how consis- from 79 sequences) to train the 3DAL pipeline which gets tenthumanlabelsare, wecomparelabelsfromonelabeler an AP of 78.11% on the validation set. Then, the 3DAL 16IoUtype Labeltype all 0- 30- 50m+ Model detectorbox Acc@0.7/0.8 30m 50m Single-frameMVF++ random 67.17/36.61 human 25,641 11,543 7,963 6,135 validboxes random 73.96/43.56 auto 24,146 11,360 7,448 5,338 Multi-frameMVF++ average 79.29/48.67 human 80.92 85.78 80.29 72.59 highestscore 78.67/52.42 3DmIoU auto 80.29 84.04 77.45 76.28 random 79.66/52.46 human 87.98 91.26 87.31 82.68 Autolabelingmodel average 81.22/53.96 BEVmIoU auto 87.50 90.36 85.09 84.78 highestscore 82.28/56.92 Table13.ThemeanIoUofhumanlabelsandautolabelscom- Table 14. Effects of initial box selection in static object auto pared with the Waymo Open Dataset ground truth for vehi- labeling.Numbersareaveragedover3runsfortherandomboxes. cles.Notethatsincedifferentlabels(humanormachine)annotate differentnumberofobjectsforeachframe,thosenumbersarenot directly comparable. They are summarized here for a reference. Choosingauniformlyrandomboxfromthesequenceis Foramorefaircomparisonbetweenhumanandautolabels, see equivalenttothesettingofaframe-centricapproach.Asfor theAveragePrecisioncomparisontableinthemaintable.Weonly the detector baselines (row 1 and row 2), it directly eval- evaluateusinggroundtruthboxeswithatleastonepointinitand uates the average accuracy of the detector boxes. As for onlyevaluateboxesthathaveaBEVIoUlargerthan0.03withany the auto labeling, it means the box estimation is running groundtruthbox. for every frame, similar to a two-stage refinement step in two-stage detectors. We see that such a frame-centric box estimationachievesthemostunfavorableresults,asitisnot annotates the rest of the training set (around 142K frames able to leverage the best viewpoint in the sequence (in the from719sequences). Finally, thestudentistrainedonthe object-centricway). union of these sets (798 sequences). We train the models Intheaverageboxsetting,weaveragealltheboxesfrom foratotalof43Kiterations. thesequence(intheworldcoordinate)andusetheaveraged Forthecross-domainsemi-supervisedlearning, wefirst train3DALontheregularWaymoOpentrainingset. Since boxforthetransformation.Forthehighestscoresetting,we thedomainadaptationvalidationsetisrelativelysmall(i.e. selecttheboxwiththehighestconfidencescoreastheinitial box,whichissimilartochoosingthebestviewpointofthe only contains 20 sequences), we submit the results to the submissionserverandreportonthedomainadaptationtest object. Weseethatthestrategytochoosetheinitialboxhas agreatimpactandcancausea4.46Acc@0.8differencefor set (containing 100 sequences). The 3DAL gets an AP of 78.0%onthedomainadaptationtest setwithoutusingany theautolabelingmodel(thehighestscoreboxvs.arandom box). data from that domain. Then, we use the trained pipeline to annotate the domain adaptation training and unlabeled sets of the Waymo Open Dataset. Finally, the student is Causal model performance. Table 15 compares non- trained on the union of the regular training set annotated causalmodelsandcausalmodels(forstaticobjectautola- byhumans,anddomainadaptationtraining+unlabeledsets annotatedby3DAL.Sincethedatausedfortrainingthestu- dentisaround2Xlargercomparedtotheintra-domainex- refframe contextframes Acc@0.7/0.8 periment,wealsoincreasethetrainingiterationsto80K. [−0,+0] 78.13/50.30 allhighestscore all 82.28/56.92 I. More Analysis Experiments for Object Auto pasthighestscore allhistory 77.56/49.21 Labeling Table15.Effectsoftemporalcontextsforstaticobjectautola- beling.Notethatforacausalmodel,wecannotoutputasinglebox In this section, we provide more analysis results auxil- forastaticobject–wehavetooutputthebestestimationforevery iarytothemainpaper. frameusingthecurrentframeandthehistoryframes.Wecompute theaverageaccuracyacrossallframesforthecausalcase. Effects of key frame selection for static object auto la- beling. Table 14 compares the effects of using different Pointcloudcontext Boxcontext Acc@0.7/0.8 initialboxes(fromthedetectors)forthemodel(fortheco- [−2,+2] all 85.67/65.77 ordinatetransformbeforeforegroundsegmentation): auni- formly chosen random box, the average box and the box [−4,0] allhistory 84.30/62.68 with the highest score. We also show the box accuracy of Table16.Effectsoftemporalcontextsfordynamicobjectauto thedetectorsasareference(i.e.theaccuracyofthoseinitial labeling.Forcausalmodels,weonlyusethecausalpointandbox boxes). sequenceinput. 17Static Dynamic Motion 3DAP BEVAP Aug. Acc.@0.7/0.8 Aug. Acc@0.7/0.8 State IoU=0.7 IoU=0.8 IoU=0.7 IoU=0.8 All 82.28/56.92 All 85.67/65.77 Pred 84.50 57.82 93.30 84.88 −D2S 81.72/55.96 −Shift 85.15/65.23 GT 84.98 57.95 93.36 85.13 −FlipX 81.42/55.98 −Scale 85.15/65.91 Table19.Effectsofthemotionstateestimationontheoffboard −FlipY 81.50/55.49 −FlipY 85.76/63.66 3Ddetection. ThemetricisAPforvehiclesontheWaymoOpen −RotateZ 81.72/56.52 −RotateZ 84.94/64.20 Datasetvalset. “Pred”meansweareclassifyingthemotionstate Table17.Ablationsofdataaugmentation.Weusedifferentdata (staticornot)usingourlinearclassifier.“GT”meansweareusing augmentationsforstaticobjectsanddynamicobjects.“All”means thegroundtruthboxestoclassifythemotionstate. alltheaugmentationsareused. “−X”meansremovingaspecific augmentation,“X”fromtheaugmentationset. “D2S”represents thedynamic-to-staticaugmentation. Bestresultsineachcolumn Ablationsofdataaugmentation. Table17comparesthe areinbold. performanceofourautolabelingpipelinewhentrainedwith differentdataaugmentations.Themostaccuratemodelsare 3DAP BEVAP consistently trained with all the proposed augmentations. Tracker MOT GT MOT GT Forstaticobjects,alltheaugmentationscontributesimilarly intermsofaccuracy,whilefordynamicobjects,randomro- IoU=0.7 84.50 85.77 93.30 96.74 Vehicle IoU=0.8 57.82 58.81 84.88 86.18 tationaroundZ-axisappearstobethemostcritical. IoU=0.5 82.88 83.02 86.32 86.24 Pedestrian Effectsofthetrackingaccuracy. Tostudyhowtracking IoU=0.6 63.69 64.80 75.60 75.65 (association)accuracyaffectsouroffboard3Ddetection,we Table 18. Effects of the tracking accuracy. “MOT” stands for compareresultsusingourKalmanfiltertrackerandan“ora- Multi Object Tracker. “GT” represents Ground Truth Tracker wherethegroundtruthboxesareused. Bestresultsofeachcom- cle”tracker(theysharethedetectionandobjectautolabel- parablepairsareinbold. ing models, just the tracker is different). For the “oracle” tracker,weassociatedetectorboxesusingthegroundtruth boxes.Specifically,foreachdetectorbox,wefinditsclosest beling). Thecausalmodelwastrainedusingthecausalin- groundtruthboxandassignthegroundtruthbox’sobjectID put (the last row) and only used causal input for inference toit. InTable.18,weobservethatbetterperformancescan at every frame. We see the causal model has a relatively beobtainedwhenamorereliabletrackerisused, although loweraccuracycomparedtonon-causalones,probablydue the difference is subtle. In particular, the “oracle” tracker to two reasons. First, it has limited contexts especially for introducesmoreimprovementforvehiclesthanpedestrians. thebeginningframesofthetrack.Second,thepoolofinitial Thisdifferencecanimplythat,forpedestrians,thereismore boxesaremuchmorerestrictediftheinputhastobecausal. spaceforimprovementindetectingtargetsthanassociating For non-causal models, we can select the frame with the detectedboxes. highest confidence as the key frame and use the box from thatframefortheinitialcoordinatetransform.However,for Effectsofthemotionstateestimationaccuracy. InTa- the causal model, it can only select the highest confidence ble19westudyhowmotionstateclassificationaccuracyaf- boxfromthehistoryframes, whicharenotnecessarilythe fectstheoffboard3DdetectionAP.Wereplacethemotion wellvisibleones. state classifier (“Pred”) with a new one using the ground As the causal model’s accuracy is even inferior to the truthboxes(“GT”)forclassificationandseehowmuchAP one that just uses a single frame’s points for refinement improvements it can bring us. We see that while there are (thefirstrowinTable15),theabilitytoselectthebestkey somegains,theyarenotsignificant. Thisisunderstandable frameweighsmorethantheaddedpointsfromafewhistory asourlinearclassifiercanalreadyachievea99%+accuracy. frames. Notethattheperformanceisstillbetterthanthede- tector boxes without refinement (row 2 in Table 14) Such Inference speed. Processing a 20-second sequence (200 resultsindicatethebenefitofhavinganon-causalmodelfor frameswiththe10Hzsensorinput)usingaV100GPU,the theoffboard3Ddetection. detector takes the majority time (around 15 minutes) due Table16reportsasimilarstudyfordynamicobjectauto to the multi-frame input and test-time augmentation. The labeling. We also see that the causal model’s performance tracking takes around 3s and the object-centric refinement is inferior to the non-causal one, although it still improves takesaround25s,whichis28sintotal,0.14sperframe,or upontherawdetectoraccuracy(Table8inthemainpaper a3%extratimeoverthedetection. Intheoffboardsetting, row 2, where the multi-frame MVF++ gets 82.21 / 59.52 we can run detection or the refinement steps in parallel to accuracy). furtherreducetheprocessinglatency. 18