Offboard 3D Object Detection from Point Cloud Sequences
CharlesR.Qi YinZhou MahyarNajibi PeiSun KhoaVo BoyangDeng DragomirAnguelov
WaymoLLC
Abstract
Whilecurrent3Dobjectrecognitionresearchmostlyfo-
cuses on the real-time, onboard scenario, there are many
offboard use cases of perception that are largely under-
explored,suchasusingmachinestoautomaticallygenerate
high-quality3Dlabels. Existing3Dobjectdetectorsfailto
satisfy the high-quality requirement for offboard uses due
to the limited input and speed constraints. In this paper,
we propose a novel offboard 3D object detection pipeline
using point cloud sequence data. Observing that different
(cid:22)(cid:39)(cid:3)(cid:44)(cid:82)(cid:56)(cid:3)(cid:55)(cid:75)(cid:85)(cid:72)(cid:86)(cid:75)(cid:82)(cid:79)(cid:71)
framescapturecomplementaryviewsofobjects,wedesign
the offboard detector to make use of the temporal points
Waymo | Confidential & Proprietary
throughbothmulti-frameobjectdetectionandnovelobject-
centric refinement models. Evaluated on the Waymo Open
Dataset, our pipeline named 3D Auto Labeling shows sig-
nificantgainscomparedtothestate-of-the-artonboardde-
tectorsandouroffboardbaselines. Itsperformanceiseven
on par with human labels verified through a human label
study. Further experiments demonstrate the application of
autolabelsforsemi-supervisedlearningandprovideexten-
siveanalysistovalidatevariousdesignchoices.
1.Introduction
Recent years have seen a rapid progress of 3D object
recognition with advances in 3D deep learning and strong
application demands. However, most 3D perception re-
search has been focusing on real-time, onboard use cases
and only considers sensor input from the current frame or
a few history frames. Those models are sub-optimal for
manyoffboard usecaseswherethebestperceptionquality
isneeded. Amongthem,oneimportantdirectionistohave
machines“autolabel”thedatatosavethecostofhumanla-
beling.Highqualityperceptioncanalsobeusedforsimula-
tionortobuilddatasetstosuperviseorevaluatedownstream
modulessuchasbehaviorprediction.
In this paper, we propose a novel pipeline for offboard
3D object detection with a modular design and a series of
tailoreddeepnetworkmodels.Theoffboardpipelinemakes
use of the whole sensor sequence input (such video data
iscommoninapplicationsofautonomousdrivingandaug-
(cid:81)(cid:82)(cid:76)(cid:86)(cid:76)(cid:70)(cid:72)(cid:85)(cid:51)(cid:3)(cid:72)(cid:74)(cid:68)(cid:85)(cid:72)(cid:89)(cid:36)(cid:3)(cid:39)(cid:22)
(cid:51)(cid:82)(cid:76)(cid:81)(cid:87)(cid:51)(cid:76)(cid:79)(cid:79)(cid:68)(cid:85) (cid:51)(cid:57)(cid:53)(cid:38)(cid:49)(cid:49) (cid:22)(cid:39)(cid:3)(cid:36)(cid:88)(cid:87)(cid:82)(cid:3)(cid:47)(cid:68)(cid:69)(cid:72)(cid:79)(cid:76)(cid:81)(cid:74)(cid:3)(cid:11)(cid:50)(cid:88)(cid:85)(cid:86)(cid:12)
(cid:20)(cid:17)(cid:19)
(cid:19)(cid:17)(cid:27)
(cid:19)(cid:17)(cid:25)
(cid:19)(cid:17)(cid:23)
(cid:19)(cid:17)(cid:21)
(cid:19)(cid:17)(cid:24) (cid:19)(cid:17)(cid:25) (cid:19)(cid:17)(cid:26) (cid:19)(cid:17)(cid:27)
(cid:22)(cid:39)(cid:3)(cid:44)(cid:82)(cid:56)(cid:3)(cid:55)(cid:75)(cid:85)(cid:72)(cid:86)(cid:75)(cid:82)(cid:79)(cid:71)
(cid:81)(cid:82)(cid:76)(cid:86)(cid:76)(cid:70)(cid:72)(cid:85)(cid:51)(cid:3)(cid:72)(cid:74)(cid:68)(cid:85)(cid:72)(cid:89)(cid:36)(cid:3)(cid:39)(cid:22)
P(cid:51)o(cid:82)i(cid:76)n(cid:81)t(cid:87)P(cid:51)i(cid:76)l(cid:79)l(cid:79)a(cid:68)r(cid:85) (cid:51)PV(cid:57)(cid:53)RC(cid:38)N(cid:49)N(cid:49) (cid:22)3D(cid:39) (cid:3)A(cid:36)u(cid:88)t(cid:87)o(cid:82) (cid:3)L(cid:47)a(cid:68)b(cid:69)e(cid:72)li(cid:79)n(cid:76)(cid:81)g(cid:74) ((cid:3)O(cid:11)(cid:50)u(cid:88)rs(cid:85))(cid:86)(cid:12)
+11.4%
1.0 (cid:20)(cid:17)(cid:19)
+20.7% +19.9%
+40.2%
0.8 (cid:19)(cid:17)(cid:27)
+47.7%
+108.9%
0.6 (cid:19)(cid:17)(cid:25)
0.4 (cid:19)(cid:17)(cid:23)
0.2 (cid:19)(cid:17)(cid:21)
0.5 (cid:19)(cid:17)(cid:24) 0.6 (cid:19)(cid:17)(cid:25) 0.7 (cid:19)(cid:17)(cid:26) 0.8 (cid:19)(cid:17)(cid:27)
3D IoU Threshold
+9.3%
AP diff
+16.6%
noisicerP
egarevA
D3
(cid:22)(cid:39)(cid:3)(cid:44)(cid:82)(cid:56)(cid:3)(cid:55)(cid:75)(cid:85)(cid:72)(cid:86)(cid:75)(cid:82)(cid:79)(cid:71)
(cid:81)(cid:82)(cid:76)(cid:86)(cid:76)(cid:70)(cid:72)(cid:85)(cid:51)(cid:3)(cid:72)(cid:74)(cid:68)(cid:85)(cid:72)(cid:89)(cid:36)(cid:3)(cid:39)(cid:22)
(cid:51)(cid:82)(cid:76)(cid:81)(cid:87)(cid:51)(cid:76)(cid:79)(cid:79)(cid:68)(cid:85) (cid:51)(cid:57)(cid:53)(cid:38)(cid:49)(cid:49) (cid:22)(cid:39)(cid:3)(cid:36)(cid:88)(cid:87)(cid:82)(cid:3)(cid:47)(cid:68)(cid:69)(cid:72)(cid:79)(cid:76)(cid:81)(cid:74)(cid:3)(cid:11)(cid:50)(cid:88)(cid:85)(cid:86)(cid:12)
(cid:20)(cid:17)(cid:19)
(cid:19)(cid:17)(cid:27)
(cid:19)(cid:17)(cid:25)
(cid:19)(cid:17)(cid:23)
(cid:19)(cid:17)(cid:21)
(cid:19)(cid:17)(cid:24) (cid:19)(cid:17)(cid:25) (cid:19)(cid:17)(cid:26) (cid:19)(cid:17)(cid:27)
Figure1.Ouroffboard3DAutoLabelingachievedsignificant
gains over two representative onboard 3D detectors (the ef-
ficient PointPillar [24] and the top-performing PVRCNN [50]).
Therelativegains(thepercentagenumbers)arehigherundermore
strictstandard(higherIoUthresholds). Themetricis3DAP(L1)
forvehiclesontheWaymoOpenDataset[57]valset.
mentedreality). Withnoconstraintsonthemodelcausality
andlittleconstraintonmodelinferencespeed, weareable
to greatly expand the design space of 3D object detectors
andachievesignificantlyhigherperformance.
Wedesignouroffboard3Ddetectorbasedonakeyob-
servation: differentviewpointsofanobject, withinapoint
cloudsequence, containcomplementaryinformationabout
its geometry (Fig. 2). An immediate baseline design is
to extend the current detectors to use multi-frame inputs.
However,asmulti-framedetectorsareeffectivetheyarestill
limited in the amount of context they can use and are not
naivelyscalabletomoreframes–gainsfromaddingmore
framesdiminishquickly(Table5).
In order to fully utilize temporal point clouds (e.g. 10
or more seconds), we step away from the common frame-
basedinputstructurewheretheentireframesofpointclouds
are merged. Instead, we turn to an object-centric design.
We first leverage a top-performing multi-frame detector to
give us initial object localization. Then, we link objects
detected at different frames through multi-object tracking.
Based on the tracked boxes and the raw point cloud se-
quences, we can extract the entire track data of an object,
including all of its sensor data (point clouds) and detec-
tor boxes, which is 4D: 3D spatial plus 1D temporal. We
1
1202
raM
8
]VC.sc[
1v37050.3012:viXrathen propose novel deep network models to process such 1 frame 5 frames
4D object track data and output temporally consistent and
high-quality boxes of the object. As they are similar to
how a human labels an object and because of their high-
qualityoutput,wecallthosemodelsprocessingthe4Dtrack
dataas“object-centricautolabelingmodels”andtheentire 10 frames All (146) frames
pipeline“3DAutoLabeling”(Fig.3).
We evaluate our proposed models on the Waymo Open
Dataset (WOD) [57] which is a large-scale autonomous
drivingbenchmarkcontaining1,000+Lidarscansequences
with3Dannotationsforeveryframe.Our3DAutoLabeling Figure2.Illustrationofthecomplementaryviewsofanobject
from the point cloud sequence. Point clouds (aggregated from
pipelinedramaticallyliftstheperceptionqualitycompared
multipleframes)visualizedinatop-downviewforamini-van.
toexisting3Ddetectorsdesignedforthereal-time,onboard
usecases(Fig.1andSec.5.1).Thegainsareevenmoresig-
nificantathigherstandards. Tounderstandhowfarweare differentLidarsweepsintoasinglescene. [74]usesgraph-
from human performance in 3D object detection, we have based spatiotemporal feature encoding to enable message
conductedahumanlabelstudytocompareautolabelswith passing among different frames. [19] encodes previous
humanlabels(Sec.5.2). Toourdelight,wefoundthatauto frames with a LSTM to assist detection in the current
labelsarealreadyonparorevenslightlybettercomparedto frame. Using multi-modal input (camera views and 3D
humanlabelsontheselectedtestsegments. point clouds) [23, 8, 46, 66, 31, 30, 38, 54, 44] has shown
In Sec. 5.3, we demonstrate the application of our improved 3D detection performance compared to point-
pipeline for semi-supervised learning and show signifi- cloud-only methods, especially for small and far-away ob-
cantly improved student models trained with auto labels. jects. Inthiswork,wefocusonapoint-cloud-onlysolution
We also conduct extensive ablation and analysis experi- andonleveragingdataoveralongtemporalinterval.
mentstovalidateourdesignchoicesinSec.5.4andSec.5.5
andprovidevisualizationresultsinSec.5.6.
Learning from point cloud sequences Several recent
Insummary,thecontributionsofourworkare:
works [34, 15, 40] proposed to learn to estimate scene
flow from dynamic point clouds using end-to-end trained
• Formulationoftheoffboard3Dobjectdetectionprob-
deep neural networks (from a pair of consecutive point
lem and proposal of a specific pipeline (3D Auto La-
clouds). Extending such ideas, MeteorNet [35] showed
beling) that leverages our multi-frame detector and
that longer sequences input can lead to performance gains
novelobject-centricautolabelingmodels.
for tasks such as action recognition, semantic segmenta-
• State-of-the-art 3D object detection performance on tion and scene flow estimation. There are also other ap-
thechallengingWaymoOpenDataset. plications of learning in point cloud sequences, like point
cloud completion [43], future point cloud prediction [63]
• The human label study on 3D object detection with and gesture recognition [42]. We also see more released
comparisonsbetweenhumanandautolabels. datasetswithsequencepointclouddatasuchastheWaymo
Open Dataset [57] for detection and the SemanticKITTI
• Demonstratedtheeffectivenessofautolabelsforsemi- dataset[4]for3Dsemanticsegmentation.
supervisedlearning.
Auto labeling The large datasets required for training
2.RelatedWork data-hungrymodelshaveincreasedtheannotationcostsno-
ticeably in recent years. Accurate auto labeling can dra-
3Dobjectdetection Mostworkhasbeenfocusingonus- maticallyreduceannotationtimeandcost. Previousworks
ingsingle-frameinput.Intermsoftherepresentationsused, on auto labeling were mainly focused on 2D applications.
they can be categorized into voxel-based [59, 12, 27, 56, Lee et al. proposed pseudo-labeling [25] to use the most
23,69,53,77,68,24,73,61],point-based[51,71,41,45, confidentpredictedcategoryofanimageclassifieraslabels
70, 52], perspective-view-based [28, 39, 5] as well as hy- totrainitontheunlabeledpartofthedataset. Morerecent
brid strategy [76, 72, 9, 16, 50]. Several recent works ex- works[20,78,67,65]havefurtherimprovedtheprocedures
ploredtemporalaggregationofLidarscansforpointcloud tousepseudolabelsanddemonstratedwidesuccessinclud-
densificationandshapecompletion. [36]fusesmulti-frame ingstate-of-the-artresultsonImageNet[10].
information by concatenating feature maps from different For 3D object detection, recently, Zakharov et al. [75]
frames. [18]aggregates(motion-compensated)pointsfrom proposedanautolabelingframeworkusingpre-trained2D
2Detection output: Tracking output:
Point cloud sequence
3D bounding boxes, classes and scores. 3D bounding boxes with track IDs.
3D Object 3D Multi-Object
Detection Tracking
Static Object
Auto Labeling
Track-based motion Object Track
state classification Data Extraction
Dynamic Object
Auto Labeling
Waymo | Confidential & Proprietary
… … …
Static object tracks
Dynamic object tracks
…
3D Auto Labeling
…
…
…
Box color: track ID
3D auto labels
Zoom in for
one frame
Figure 3. The 3D Auto Labeling pipeline. Given a point cloud sequence as input, the pipeline first leverages a 3D object detector to
localizeobjectsineachframe. Thenobjectboxesatdifferentframesarelinkedthroughamulti-objecttracker. Objecttrackdata(itspoint
cloudsateveryframeaswellasits3Dboundingboxes)areextractedforeachobjectandthengothroughtheobject-centricautolabeling
(withadivide-and-conquerforstaticanddynamictracks)togeneratethefinal“autolabels”,i.e.refined3Dboundingboxes.
detectorstoannotate3Dobjects. Whileeffectiveforloose fewframesandcannotcompensatetheobjectmotionssince
localization(i.e.IoUof0.5),thereisaconsiderableperfor- framestackingisdonefortheentirescene. Weobservethat
mancegapforapplicationsrequiringhigherprecision. [37] thecontributionsofmulti-frameinputtothedetectorquality
triedtoleverageweakcenter-clicksupervisiontoreduce3D diminishaswestackmoreframes(Table5). Anotheridea
labels needed. Several other works [7, 3, 26, 32, 13] have istoextendthesecondstageoftwo-stagedetectors[46,51]
alsoproposedmethodstoassisthumanannotatorsandcon- to take object points from multiple frames. Compared to
sequentlyreducingtheannotationcost. taking multi-frame input of the whole scene, the second-
stage only processes proposed object regions. However, it
3.Offboard3DObjectDetection is not intuitive to decide how many context frames to use.
Settingafixednumbermayworkwellforsomeobjectsbut
Problem statement Given a sequence of sensor inputs
suboptimalforothers.
(temporaldata)ofadynamicenvironment,ourgoalistolo-
calizeandclassifyobjectsinthe3Dsceneforeveryframe. Comparedtotheframe-centricdesignsabove,wherein-
Specifically, we consider the input of a sequence of point putisalwaysfromafixednumberofframes,werecognize
clouds{P
i
∈Rni×C},i=1,2,...,N withthepointcloud thenecessitytoadaptivelychoosethetemporalcontextsize
P (n pointswithC channelsforeachpoint)ofeachofthe for each object independently, leading to an object-centric
i i
N totalframes. ThepointchannelsincludetheXYZinthe design. As shown in Fig. 3, we can leverage the power-
sensor’scoordinate(ateachframe)andotheroptionalinfor- ful multi-frame detector to give us the initial object local-
mationsuchascolorandintensity. Wealsoassumeknown izations. Then for each object, through tracking, we can
sensor poses {M = [R |t ] ∈ R3×4}, i = 1,2,...,N at extractallrelevantobjectpointcloudsanddetectionboxes
i i i
eachframeintheworldcoordinate,suchthatwecancom- from all frames that it appears in. Subsequent models can
pensatetheego-motion. Foreachframe,weoutputamodal takesuchobjecttrackdatatooutputthefinaltrack-levelre-
3D bounding boxes (parameterized by its center, size and finedboxesoftheobjects. Asthisprocessemulateshowa
orientation), class types (e.g. vehicles) and unique object humanlabelerannotatesa3Dobjectinthepointcloudse-
IDsforallobjectsthatappearintheframe. quence (localize, track and refine the track over time), we
chosetorefertoourpipelineas3DAutoLabeling.
Designspace Accesstotemporaldata(historyandfuture)
hasledtoamuchlargerdesignspaceofdetectorscompared
tojustusingsingleframeinput. 4.3DAutoLabelingPipeline
One baseline design is to extend the single-frame 3D
object detectors to use multi-frame input. Although pre- Fig. 3 illustrates our proposed 3D Auto Labeling
vious works [36, 18, 19, 74] have shown its effectiveness, pipeline. We will introduce each module of the pipeline
a multi-frame detector is hard to scale up to more than a inthefollowingsub-sections.
34.1.Multi-frame3DObjectDetection
MVF++ As the entry point to our pipeline, accurate ob-
ject detection is essential for the downstream modules. In
thiswork, weproposetheMVF++3Ddetectorbyextend-
ingthetop-performingMulti-ViewFusion[76](MVF)de-
tectorinthreeaspects:1)toenhancethediscriminativeabil-
ity of point-level features, we add an auxiliary loss for 3D
semantic segmentation, where points are labeled as posi-
tives/negativesiftheylieinside/outsideofagroundtruth3D
box;2)forobtainingmoreaccuratetrainingtargetsandim-
provingtrainingefficiency,weeliminatetheanchormatch-
ingstepintheMVFpaperandadopttheanchor-freedesign
Waymo | Confidential & Proprietary
as in [58]; 3) to leverage ample computational resources
available in the offboard setting, we redesign the network
architecture and increase the model capacity. Please see
Sec.CintheAppendixfordetails.
Multi-frameMVF++ WeextendtheMVFmodeltouse
multiple LiDAR scans. Points from multiple consecutive
scans are transformed to the current frame based on ego-
motion. Eachpointisextendedbyoneadditionalchannel,
encodingoftherelativetemporaloffset,similarto[18].The
aggregatedpointcloudisusedastheinputtotheMVF++.
Test-timeaugmentation Wefurtherboostthe3Ddetec-
tionthroughtest-timeaugmentation(TTA)[22],byrotating
thepointcloudaroundZ-axisby10differentangles(i.e.[0,
±1/8π,±1/4π,±3/4π,±7/8π,π]),andensemblingpre-
dictionswithweightedboxfusion[55]. Whileitmaylead
toexcessivecomputationalcomplexityforonboarduses,in
theoffboardsettingTTAcanbeparallelizedacrossmultiple
devicesforfastexecution.
4.2.Multi-objectTracking
Themulti-objecttrackingmodulelinksdetectedobjects
across frames. Given the powerful multi-frame detector,
we choose to take the tracking-by-detection path and have
a separate non-parametric tracker. This leads to a simpler
and more modular design compared to the joint detection
and tracking methods [36, 64, 29]. Our tracker is an im-
plementation variant of the [62], using detector boxes for
associationsandKalmanfilterforstateupdates.
4.3.ObjectTrackDataExtraction
Giventrackeddetectionboxesforanobject,wecanex-
tractobject-specificLiDARpointcloudsfromthesequence.
We use the term object track data to refer to such 4D (3D
spatialand1Dtemporal)objectinformation.
Toextractobjecttrackdata,wefirsttransformallboxes
andpointcloudstotheworldcoordinatethroughtheknown
sensor poses to remove the ego-motion. For each unique
object(accordingtotheobjectID),wecropitsobjectpoints
c x n
smarap
xob Coordinate
Transform
c x n object point cloud Coordinate
(world coordinate) Transform
initial box params
object point cloud
(box coordinate)
Box params
c
x m
Static object auto labeling
Coord
coordinate Foreground Seg. transform transform Network
Box Regression
Network
foreground points
(box coordinate)
2 x n gniksam
logits
c x m
Box Regression
Network
smarap
xob
coordinate
transform
c
x m
Box Regression
Network’
foreground points
(box’ coordinate)
smarap
xob
Refined box
Figure 4. The static object auto labeling model. Taking as in-
put the merged object points in the world coordinate, the model
outputsasingleboxforthestaticobject.
within the estimated detector boxes (enlarged by α meters
ineachdirectiontoincludemorecontexts). Suchextraction
givesusasequenceofobjectpointclouds{P }, k ∈ S
j,k j
foreachobjectjanditsvisibleframesS . Fig.3visualizes
j
theobjectpointsforseveralvehicles. Besidestherawpoint
clouds,wealsoextractthetrackedboxesforeachobjectand
everyframe{B },k ∈S intheworldcoordinate.
j,k j
4.4.Object-centricAutoLabeling
Inthissection,wedescribehowwetaketheobjecttrack
datato“autolabel”theobjects. AsillustratedinFig.3,the
processincludesthreesub-modules:thetrack-basedmotion
stateclassification,staticobjectautolabelinganddynamic
objectautolabeling,whicharedescribedindetailbelow.
Divideandconquer:motionstateestimation Inthereal
world, lotsofobjectsarecompletelystaticduringaperiod
oftime. Forexample,parkedcarsorfurnitureinaroomdo
not move within a few minutes or hours. In terms of off-
boarddetection,itispreferredtoassignasingle3Dbound-
ingboxtoastaticobjectratherthanseparateboxesindif-
ferentframestoavoidjittering.
Basedonthisobservation,wetakeadivide-and-conquer
approach to handle static and moving objects differently,
introducing a module to classify an object’s motion state
(static or not) before the auto labeling. While it could be
hard to predict an object’s motion state from just a few
frames (due to the perception noise), we find it relatively
easyifallobjecttrackdataisused. Asthevisualizationin
Fig. 3 shows, it is often obvious to tell whether an object
is static or not from its trajectory. A linear classifier using
a few heuristic features from the object track’s boxes can
alreadyachieve99%+motionstateclassificationaccuracy
forvehicles. MoredetailsareinSec.E.
Staticobjectautolabeling Forastaticobject,themodel
takesthemergedobjectpointclouds(P = ∪{P }inthe
j j,k
world coordinate) from points at different frames and pre-
4Box Regression
Network
Waymo | Confidential & Proprietary
xob
gniddebme
Sequence object points
k=T-r, …, T-1, T, T+1,…, T+r
Sequence object boxes
k=T-s, …, T-1, T, T+1,…, T+s
)1+c(
x
n
8
x )1+s2(
Dynamic object auto labeling
Foreground Seg.
Network
2
x n
gniksam c
x m
Point Sequence
Encoder
Foreground
object points
Box Sequence Trajectory Point Encoder embedding embedding
smarap
xob
lengingtoaligndeformableobjectslikepedestrians.
We propose a design (Fig. 5) that leverages both the
point cloud and the detector box sequences without align-
ingpointstoakeyframeexplicitly. Givenasequenceofob-
ject point clouds {P } and a sequence of detector boxes j,k
+ {B j,k}fortheobjectjatframesk ∈S j,themodelpredicts
theobjectboxateachframekinaslidingwindowform. It
concat
consistsoftwobranches,onetakingthepointsequenceand
Box Regression Joint
Network embedding theothertakingtheboxsequence.
Refined box for frame T For the point cloud branch, the model takes a sub-
Figure5.Thedynamicobjectautolabelingmodel.Takingase- sequence of the object point clouds {P }T+r . After
j,k k=T−r
quenceofobjectpointsandasequenceofobjectboxes,themodel
adding a temporal encoding channel to each point (simi-
runsinaslidingwindowfashionandoutputsarefined3Dboxfor lar to [18]) , the sub-sequence points are merged through
thecenterframe.Inputpointandboxcolorsrepresentframes.
union and transformed to the box coordinate of the detec-
tor box B at the center frame. Following that, we have
j,T
dictsasinglebox. Theboxcanthenbetransformedtoeach aPointNet[47]basedsegmentationnetworktoclassifythe
framethroughtheknownsensorposes. foreground points (of the 2r +1 frames) and then encode
Fig. 4 illustrates our proposed model for static object the object points into an embedding through another point
auto labeling. Similar to [46, 51], we first transform encodernetwork.
(throughrotationandtranslation)theobjectpointstoabox For the box sequence branch, the box sequences
coordinate before the per-object processing, such that the {B j(cid:48) ,k}T k=+ Ts −s of 2s+1 frames are transformed to the box
point clouds are more aligned across objects. In the box
coordinateofthedetectorboxatframeT.Notethatthebox
coordinate, the +X axis is the box heading direction, the sub-sequencecanbelongerthanthepointsub-sequenceto
origin is the box center. Since we have the complete se- capture the longer trajectory shape. A box sequence en-
quence of the detector boxes, we have multiple options on codernetwork(aPointNetvariant)willthenencodethebox
whichboxtouseastheinitialbox. Thechoiceactuallyhas sequenceintoatrajectoryembedding, whereeachboxisa
a significant impact on model performance. Empirically, pointwith7-dimgeometryand1-dimtimeencoding.
using the box with the highest detector score leads to the Next,thecomputedobjectembeddingandthetrajectory
bestperformance(seeSec.Iforanablationstudy). embedding are concatenated to form the joint embedding
whichwillthenbepassedthroughaboxregressionnetwork
To attend to the object, the object points are passed
topredicttheobjectboxatframeT.
through an instance segmentation network to segment the
foreground (m foreground points are extracted by the
5.Experiments
mask). Inspired by the Cascade-RCNN [6], we itera-
tively regress the object’s bounding box. At test time, we We start the section by comparing our offboard 3D
can further improve box regression accuracy by test-time- Auto Labeling with state-of-the-art 3D object detectors in
augmentation(similartoSec. 4.1). Sec.5.1.InSec.5.2wecomparetheautolabelswiththehu-
AllnetworksarebasedonthePointNet[47]architecture. manlabels. InSec.5.3,weshowhowtheautolabelscanbe
Themodelissupervisedbythesegmentationandboxesti- usedtosuperviseastudentmodeltoachieveimprovedper-
mationgroundtruths.Detailsofthearchitecture,lossesand formanceunderlow-labelregimeorinanotherdomain. We
thetrainingprocessaredescribedinSec.F. provideanalysisofthemulti-framedetectorinSec.5.4and
analysis experiments to validate our designs of the object-
Dynamic object auto labeling For a moving object, we centricautolabelingmodelsinSec.5.5andfinallyvisualize
needtopredictdifferent3Dboundingboxesforeachframe. theresultsinSec.5.6.
Duetothesequenceinput/output,themodeldesignspaceis
muchlargerthanthatforstaticobjects. Abaselineistore- Dataset We evaluate our approach using the challenging
estimate the 3D bounding box with cropped point clouds. Waymo Open Dataset (WOD) [57], as it provides a large
Similar to the smoothing in tracking, we can also refine collection of LiDAR sequences, with 3D labels available
boxesbasedonthesequenceofthedetectorboxes. Another foreachframe. Thedatasetincludesatotalnumberof1150
choiceisto“align”orregisterobjectpointswithrespectto sequenceswith798fortraining,202forvalidationand150
akeyframe(e.g.thecurrentframe)toobtainadenserpoint fortesting. EachLiDARsequencelastsaround20seconds
cloudforboxestimation. However,thealignmentcanbea with a sampling frequency at 10Hz. For our experiments,
harderproblemthanboxestimationespeciallyforoccluded weevaluateboth3Dandbird’seyeview(BEV)objectde-
or faraway objects with fewer points. Besides, it is chal- tectionmetricsforvehiclesandpedestrians.
5Vehicles Pedestrians
Method frames 3DAP BEVAP 3DAP BEVAP
IoU=0.7 IoU=0.8 IoU=0.7 IoU=0.8 IoU=0.5 IoU=0.6 IoU=0.5 IoU=0.6
StarNet[41] 1 53.70 - - - 66.80 - - -
PointPillar[24](cid:63) 1 60.25 27.67 78.14 63.79 60.11 40.35 65.42 51.71
Multi-viewfusion(MVF)[76] 1 62.93 - 80.40 - 65.33 - 74.38 -
AFDET[14] 1 63.69 - - - - - - -
ConvLSTM[19] 4 63.60 - - - - - - -
RCD[5] 1 68.95 - 82.09 - - - - -
PillarNet[61] 1 69.80 - 87.11 - 72.51 - 78.53 -
PV-RCNN[50](cid:63) 1 70.47 39.16 83.43 69.52 65.34 45.12 70.35 56.63
Single-frameMVF++(Ours) 1 74.64 43.30 87.59 75.30 78.01 56.02 83.31 68.04
Multi-frameMVF++w.TTA(Ours) 5 79.73 49.43 91.93 80.33 81.83 60.56 85.90 73.00
3DAutoLabeling(Ours) all 84.50 57.82 93.30 84.88 82.88 63.69 86.32 75.60
Table1.3DobjectdetectionresultsforvehiclesandpedestriansontheWaymoOpenDatasetvalset.Methodsincomparisoninclude
priorstate-of-the-artsingle-framebased3Ddetectorsaswellasoursingle-frameMVF++, ourmulti-frameMVF++(5frames)andour
full3DAutoLabelingpipeline. ThemetricsareL13DAPandbird’seyeview(BEV)APattwoIoUthresholds: thecommonstandard
IoU=0.7andahighstandardIoU=0.8forvehicles;andIoU=0.5,0.6forpedestrians.(cid:63)reproducedresultsusingauthor’sreleasedcode.
5.1.ComparingwithState-of-the-artDetectors 3DAP BEVAP
IoU=0.7 IoU=0.8 IoU=0.7 IoU=0.8
In Table 1, we show comparisons of our 3D object de-
Human 86.45 60.49 93.86 86.27
tectorsandthe3DAutoLabelingwithvarioussingle-frame
3DAL(Ours) 85.37 56.93 92.80 87.55
and multi-frame based detectors, under both the common
Table2.Comparinghumanlabelsandautolabelsin3Dobject
standardIoUthresholdandahigherstandardIoUthreshold
detection.Themetricsare3DandBEVAPsforvehiclesonthe5
topressuretestthemodels.
sequencesfromtheWaymoOpenDatasetvalset.HumanAPsare
We show that our single-frame MVF++ has already
computed by comparing them with the WOD’s released ground
outperformed the prior art single-frame detector PVR-
truthandusingnumberofpointsinboxesashumanlabelscores.
CNN [50]. The multi-frame version of the MVF++, as a
baselineoftheoffboard3Ddetectionmethods,significantly
improvesuponthesingle-frameMVF++thankstotheextra
to the best of our knowledge, no such study exists for 3D
informationfromthecontextframes.
recognition especially for 3D object detection. To fill this
For vehicles, comparing the last three rows, our com- gap, we conducted a small-scale human label study on the
plete3DAutoLabelingpipeline,whichleveragesthemulti- Waymo Open Dataset to understand the capability of hu-
frame MVF++ and the object-centric auto labeling mod- maninrecognizingobjectsinadynamic3Dscene. Weran-
els,furtherimprovesthedetectionqualityespeciallyinthe domlyselected5sequencesfromtheWaymoOpenDataset
higher standard at IoU threshold of 0.8. It improves the valsetandaskedthreeexperiencedlabelerstore-labeleach
3D AP@0.8 significantly by 14.52 points compared to the sequenceindependently(withthesamelabelingprotocolas
single-frame MVF++ and by 8.39 points compared to the WOD).
multi-frameMVF++,whichisalreadyverypowerfulbyit-
InTable2,wereportthemeanAPofhumanlabelsand
self.Theseresultsshowthegreatpotentialofleveragingthe
auto labels across the 5 sequences. With the common 3D
longsequencesofpointcloudsforoffboardperception.
AP@0.7(L1)metric,theautolabelsareonlyaround1point
WealsoshowthedetectionAPforthepedestrianclass,
lowerthantheaveragelabeler, althoughthegapisslightly
where we consistently observe the leading performance of
largerinthemorestrict3DAP@0.8metric. Withsomevi-
the3DAutoLabelingpipelineespeciallyatthehigherlocal-
sualization,wefoundthelargergapismostlycausedbyin-
izationstandard(IoU=0.6)with7.67pointsgaincompared
accurateheights. ThecomparisonswiththeBEVAP@0.8
tothesingle-frameMVF++and3.13pointsgaincompared
metric verifies our observation: when we don’t consider
tothemulti-frameMVF++.
height,theautolabelsevenoutperformtheaveragehuman
labelsby1.28points.
5.2.ComparingwithHumanLabels
With such high quality, we believe the auto labels can
In many perception domains such as image classifica- beusedtopre-labelpointcloudsequencestoassistandac-
tionandspeechrecognition,researchershavecollecteddata celerate human labeling, or be used directly to train light-
to understand humans’ capability [49, 11, 33]. However, weightstudentmodelsasshowninthefollowingsection.
6TrainingData TestData 3DAP BEVAP anchor-free cap.increase segloss 5-frame TTA AP@0.7/0.8
100%maintrain(Human) mainval 71.2 86.9 (cid:88) - - - - 71.20/39.70
10%maintrain(Human) mainval 64.3 81.2 (cid:88) (cid:88) - - - 74.28/42.91
(cid:88) (cid:88) (cid:88) - - 74.64/43.30
10%maintrain(Human) mainval 70.0 86.4 (cid:88) (cid:88) (cid:88) (cid:88) - 76.34/45.57
+90%maintrain(3DAL) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) 79.73/49.43
Table4.Ablationstudiesontheimprovementsto3Ddetector
100%maintrain(Human) domaintest 59.4 N/A
MVF[76]. Metricsare3DAP(L1)atIoUthresholds0.7and0.8
100%maintrain(Human)
domaintest 60.3 N/A forvehiclesontheWaymoOpenDatasetvalset.
+domain(SelfAnno.)
100%train(Human)
domaintest 64.2 N/A
+domain(3DAL) #frames 1 2 3 4 5 10
Table3.Resultsofsemi-supervisedlearningwithautolabels.
AP@0.7 74.64 75.32 75.63 76.17 76.34 76.96
Metrics are 3D and BEV AP for vehicles on the Waymo Open
AP@0.8 43.30 44.11 44.80 45.43 45.57 46.20
Dataset. Thetypeofannotationisreportedinparenthesis. Please
Table5.Ablationstudieson3DdetectionAPvs.temporalcon-
note,testsetBEVAPisnotprovidedbythesubmissionserver.
texts. Metricsare3DAP(L1)forvehiclesontheWaymoOpen
Datasetvalset.Weusedthe5-framemodelin3DAutoLabeling.
5.3.ApplicationstoSemi-supervisedLearning
In this section, we study the effectiveness of our auto inimprovingthedetectionquality.
labelingpipelineinthetaskofsemi-supervisedlearningto Table 5 shows how the number of consecutive input
trainastudentmodelundertwosettings: intra-domainand frames impacts the detection APs. The gains of adding
cross-domain. We choose the student model as a single- frames quickly diminishes as the number of frames in-
frameMVF++detectorthatcanruninreal-time. creases: e.g. while the AP@0.8 improves by 0.81 from 1
Fortheintra-domainsemi-supervisedlearning, weran- to2frames,thegainfrom4to5framesisonly0.14point.
domly select 10% sequences (79 ones) in the main WOD
trainingsettotrainour3DAutoLabeling(3DAL)pipeline. 5.5.AnalysisofObjectAutoLabelingModels
Once trained, we apply it to the rest 90% sequences (719
We evaluate the object auto labeling models using the
ones)inthemaintrainingsettogenerate“autolabels”(we
box accuracy metric under two IoU thresholds 0.7 and 0.8
only keep boxes with scores higher than 0.1). In Table 3
on the Waymo Open Dataset val set. The predicted box is
(first two rows), we see that reducing the human annota-
consideredcorrectifitsIoUwiththegroundtruthishigher
tions to 10% significantly lowers the student model’s per-
thanthethreshold. MoreanalysisisinSec.I.
formance. However, when we use auto labels, the student
model trained on 10% human labels and 90% auto labels
can get similar performance compared to using 100% hu- Ablationsofthestaticobjectautolabeling Intable6we
man labels (AP gaps smaller than 1 point), demonstrating can see the importance of the initial coordinate transform
superbdataefficiencyautolabelscanprovide. (to the box coordinate), and the foreground segmentation
For the cross-domain semi-supervised learning, the networkinthefirst3rows. Inthe4thandthe5throws,we
teacher auto labels data from an unseen domain. The see the gains of using iterative box re-estimation and test
teacheristrainedonthemainWODtrainset, andautola- timeaugmentationrespectively.
bels the domain adaptation WOD train and unlabeled sets
(separate680sequencesfromthemainWOD).Thestudent Alternativedesignsofthedynamicobjectautolabeling
isthentrainedontheunionofthesethreesets. Evaluations Table 7 ablates the design of the dynamic object auto la-
areonthedomainadaptationtestset. Thelastthreerowsof beling model. For the align & refine model, we use the
Table 3 showthe results. Withoutusing any data from the multi-frame MVF++ detector boxes to “align” the object
new domain, the student gets an AP of 59.4. While using pointcloudsfromthenearbyframes([−2,+2])tothecen-
thestudenttoself-labelslightlyhelps(improvestheresults ter frame. For each context frame, we transform the co-
by∼1point),usingour3DALtoautolabelthenewdomain ordinate by aligning the center and heading of the context
significantlyimprovesthestudentAPby∼5points. frameboxestothecenterframebox. Themodelusingun-
alignedpointclouds(inthecenterframe’scoordinate,from
5.4.AnalysisoftheMulti-frameDetector
[−2,+2]contextframes), secondrow, actuallygetshigher
Table 4 shows the ablations of our proposed MVF++ accuracy(secondrow)thanthealignedone.Themodeltak-
detectors. We see that the offboard techniques such as ingonlytheboxsequence(thirdrow)asinputperformsrea-
the model capacity increase (+3.08 AP@0.7), using point sonablyaswell,byleveragingthetrajectoryshapeandthe
clouds from 5 frames as input (+1.70 AP@0.7) and test box sizes. Our model jointly using the multi-frame object
time augmentation (+3.39 AP@0.7) are all very effective pointcloudsandtheboxsequencesgetsthebestaccuracy.
7Figure6.Visualizationof3DautolabelsontheWaymoOpenDatasetvalset(bestviewedincolorwithzoomin). Objectpointsare
coloredbyobjecttypeswithblueforstaticvehicles,redformovingvehiclesandorangeforpedestrians. Boxesarecoloredas: greenfor
truepositivedetections,redforfalsepositivesandcyanforgroundtruthboxesinthecasesoffalsenegatives.
Waymo | Confidential & Proprietary
transform segmentation iterative tta Acc@0.7/0.8 static dynamic
Method Contextframes
Acc@0.7/0.8 Acc@0.7/0.8
- - - - 78.82/50.90
(cid:88) - - - 81.35/54.76 S-MVF++ [−0,+0] 67.17/36.61 80.07/57.71
(cid:88) (cid:88) - - 81.37/55.67 M-MVF++ [−4,+0] 73.96/43.56 82.21/59.52
(cid:88) (cid:88) (cid:88) - 82.02/56.77 [−0,+0] 78.13/50.30 80.65/57.97
(cid:88) (cid:88) (cid:88) (cid:88) 82.28/56.92 [−2,+2] 79.60/52.52 84.34/63.60
3DAL
Table6.Ablationstudiesofthestaticautolabelingmodel.Met- [−5,+5] 80.48/55.02 85.10/64.51
ricsaretheboxaccuracyat3DIoU=0.7andIoU=0.8forvehicles all 82.28/56.92 85.67/65.77
intheWaymoOpenDatasetvalset. Table8.Effectsoftemporalcontextsizesforobjectautolabel-
ing. Metricsaretheboxaccuracyat3DIoU=0.7,0.8forvehicles
in the WOD val set. Dynamic vehicles have a higher accuracy
Method Acc@0.7/0.8
becausetheyareclosertothesensorthanstaticones.
Align&refine 83.33/60.69
Pointsonly 83.79/61.95
Boxsequenceonly 83.13/58.96 challengingcaseswithocclusionsandveryfewpoints. The
Pointsandboxsequencejoint 85.67/65.77 busy intersection scene also shows a few failure cases in-
Table7.Comparingwithalternativedesignsofdynamicobject cludingfalsenegativesofpedestriansinrareposes(sitting),
autolabeling. Metricsareboxaccuracywith3DIoUthresholds false negatives of severely occluded objects and false pos-
0.7and0.8forvehiclesontheWaymoOpenDatasetvalset. itiveforobjectswithsimilargeometrytocars. Thosehard
casescanpotentiallybesolvedwithaddedcamerainforma-
tionwithmulti-modallearning.
Effectsoftemporalcontextsizesforobjectautolabeling
Table 8 studies how the context frame sizes influence the
6.Conclusion
boxpredictionaccuracy. Wealsocomparewithoursingle-
frame (S-MVF++) and multi-frame detectors (M-MVF++)
In this work we have introduced 3D Auto Labeling, a
toshowextragainstheobjectautolabelingcanbring. We
state-of-the-artoffboard3Dobjectdetectionsolutionusing
canclearlyseethatusinglargetemporalcontextsimproves
pointcloudsequencesasinput. Thepipelineleveragesthe
theperformancewhileusingtheentireobjecttrack(thelast
long-term temporal data of objects in the 3D scene. Key
row)leadstothebestperformance. Notethatforthestatic
tooursuccessareourobject-centricformulation,powerful
objectmodel,weusethedetectorboxwiththehighestscore
offboard multi-frame detector and novel object auto label-
for the initial coordinate transform, which gives our auto
ingmodels.EvaluatedontheWaymoOpenDataset,ourso-
labelinganadvantageoverframe-basedmethod.
lutionhasshownsignificantgainsoverpriorartonboard3D
detectors, especiallywithhighstandardmetrics. Ahuman
5.6.QualitativeAnalysis
label study has further shown the high quality of the auto
InFig.6, wevisualizetheautolabelsfortworepresen- labelsreachingcomparableperformanceasexperiencedhu-
tativescenesinautonomousdriving: drivingonaroadwith mans. Moreover,thesemi-supervisedlearningexperiments
parked cars, and passing a busy intersection. Our model havedemonstratedtheusefulnessoftheautolabelsforstu-
is able to accurately recognize vehicles and pedestrians in denttrainingincasesoflow-labelandunseendomains.
8References flownetforsceneflowestimationonlarge-scalepointclouds.
InProceedingsoftheIEEEConferenceonComputerVision
[1] Waymoopendataset: 3ddetectionchallenge. https://
andPatternRecognition,pages3254–3263,2019. 2
waymo.com/open/challenges/3d-detection/.
[16] ChenhangHe,HuiZeng,JianqiangHuang,Xian-ShengHua,
Accessed:2021-01-25. 12
and Lei Zhang. Structure aware single-stage 3d object de-
[2] Waymo open dataset: Domain adaptation challenge.
tectionfrompointcloud. InProceedingsoftheIEEE/CVF
https : / / waymo . com / open / challenges /
Conference on Computer Vision and Pattern Recognition
domain-adaptation/. Accessed: 2021-01-25.
(CVPR),June2020. 2
12
[17] KaimingHe,XiangyuZhang,ShaoqingRen,andJianSun.
[3] DavidAcuna,HuanLing,AmlanKar,andSanjaFidler. Ef-
Deep residual learning for image recognition. In CVPR,
ficient interactive annotation of segmentation datasets with
2016. 13
polygon-rnn++. InProceedingsoftheIEEEconferenceon
[18] Peiyun Hu, Jason Ziglar, David Held, and Deva Ramanan.
Computer Vision and Pattern Recognition, pages 859–868,
What you see is what you get: Exploiting visibility for 3d
2018. 3
objectdetection. InProceedingsoftheIEEE/CVFConfer-
[4] Jens Behley, Martin Garbade, Andres Milioto, Jan Quen- enceonComputerVisionandPatternRecognition(CVPR),
zel, Sven Behnke, Cyrill Stachniss, and Jurgen Gall. Se-
June2020. 2,3,4,5
mantickitti: A dataset for semantic scene understanding of
[19] RuiHuang,WanyueZhang,AbhijitKundu,CarolinePanto-
lidar sequences. In Proceedings of the IEEE International
faru, David A. Ross, Thomas A. Funkhouser, and Alireza
ConferenceonComputerVision,pages9297–9307,2019. 2
Fathi. AnLSTMapproachtotemporal3dobjectdetection
[5] Alex Bewley, Pei Sun, Thomas Mensink, Dragomir inlidarpointclouds. CoRR,2020. 2,3,6
Anguelov, and Cristian Sminchisescu. Range conditioned
[20] Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, and Ondrej
dilatedconvolutionsforscaleinvariant3dobjectdetection,
Chum.Labelpropagationfordeepsemi-supervisedlearning.
2020. 2,6 InProceedingsoftheIEEEconferenceoncomputervision
[6] ZhaoweiCaiandNunoVasconcelos.Cascader-cnn:Delving andpatternrecognition,pages5070–5079,2019. 2
intohighqualityobjectdetection. InCVPR,2018. 5 [21] Diederik P. Kingma and Jimmy Ba. Adam: A method for
[7] LluisCastrejon,KaustavKundu,RaquelUrtasun,andSanja stochasticoptimization. CoRR,2014. 14
Fidler. Annotatingobjectinstanceswithapolygon-rnn. In [22] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton.
ProceedingsoftheIEEEconferenceoncomputervisionand Imagenet classification with deep convolutional neural net-
patternrecognition,pages5230–5238,2017. 3 works. Commun.ACM,60(6):84–90,May2017. 4
[8] Xiaozhi Chen, Huimin Ma, Ji Wan, Bo Li, and Tian Xia. [23] Jason Ku, Melissa Mozifian, Jungwook Lee, Ali Harakeh,
Multi-view 3d object detection network for autonomous andStevenLWaslander. Joint3dproposalgenerationand
driving. InCVPR,2017. 2 objectdetectionfromviewaggregation. In2018IEEE/RSJ
[9] Y. Chen, S. Liu, X. Shen, and J. Jia. Fast point r-cnn. In InternationalConferenceonIntelligentRobotsandSystems
2019IEEE/CVFInternationalConferenceonComputerVi- (IROS),pages1–8.IEEE,2018. 2
sion(ICCV),pages9774–9783,2019. 2 [24] Alex H Lang, Sourabh Vora, Holger Caesar, Lubing Zhou,
[10] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, JiongYang,andOscarBeijbom. Pointpillars: Fastencoders
andLiFei-Fei. Imagenet: Alarge-scalehierarchicalimage forobjectdetectionfrompointclouds. InCVPR,2019. 1,2,
database. InCVPR.IEEE,2009. 2 6,12,13
[11] Neeraj Deshmukh, Richard Jennings Duncan, Aravind [25] Dong-Hyun Lee. Pseudo-label: The simple and effi-
Ganapathiraju, and Joseph Picone. Benchmarking human cientsemi-supervisedlearningmethodfordeepneuralnet-
performanceforcontinuousspeechrecognition. InProceed- works. InWorkshoponchallengesinrepresentationlearn-
ingofFourthInternationalConferenceonSpokenLanguage ing,ICML,volume3,2013. 2
Processing. ICSLP’96, volume 4, pages 2486–2489. IEEE, [26] Jungwook Lee, Sean Walsh, Ali Harakeh, and Steven L
1996. 6 Waslander. Leveragingpre-trained3dobjectdetectionmod-
[12] M.Engelcke,D.Rao,D.Z.Wang,C.H.Tong,andI.Posner. els for fast ground truth generation. In 2018 21st Inter-
Vote3deep: Fast object detection in 3d point clouds using national Conference on Intelligent Transportation Systems
efficient convolutional neural networks. In 2017 IEEE In- (ITSC),pages2504–2510.IEEE,2018. 3
ternationalConferenceonRoboticsandAutomation(ICRA), [27] B.Li. 3dfullyconvolutionalnetworkforvehicledetection
pages1355–1361,May2017. 2 inpointcloud. In2017IEEE/RSJInternationalConference
[13] Di Feng, Xiao Wei, Lars Rosenbaum, Atsuto Maki, and onIntelligentRobotsandSystems(IROS),pages1513–1518,
Klaus Dietmayer. Deep active learning for efficient train- Sep.2017. 2
ing of a lidar 3d object detector. In 2019 IEEE Intelligent [28] BoLi,TianleiZhang,andTianXia. Vehicledetectionfrom
VehiclesSymposium(IV),pages667–674.IEEE,2019. 3 3d lidar using fully convolutional network. In RSS 2016,
[14] RunzhouGe,ZhuangzhuangDing,YihanHu,YuWang,Si- 2016. 2
jiaChen, LiHuang, andYuanLi. Afdet: Anchorfreeone [29] Peiliang Li, Jieqi Shi, and Shaojie Shen. Joint spatial-
stage3dobjectdetection,2020. 6 temporaloptimizationforstereo3dobjecttracking. InPro-
[15] Xiuye Gu, Yijie Wang, Chongruo Wu, Yong Jae Lee, and ceedingsoftheIEEE/CVFConferenceonComputerVision
PanquWang.Hplflownet:Hierarchicalpermutohedrallattice andPatternRecognition,pages6877–6886,2020. 4
9[30] M. Liang, B. Yang, Y. Chen, R. Hu, and R. Urtasun. [44] Charles R Qi, Xinlei Chen, Or Litany, and Leonidas J
Multi-task multi-sensor fusion for 3d object detection. In Guibas. Imvotenet: Boosting 3d object detection in point
2019 IEEE/CVF Conference on Computer Vision and Pat- cloudswithimagevotes. InProceedingsoftheIEEE/CVF
ternRecognition(CVPR),pages7337–7345,2019. 2 Conference on Computer Vision and Pattern Recognition,
[31] MingLiang,BinYang,ShenlongWang,andRaquelUrtasun. pages4404–4413,2020. 2
Deepcontinuousfusionformulti-sensor3dobjectdetection. [45] Charles R Qi, Or Litany, Kaiming He, and Leonidas J
InECCV,2018. 2 Guibas. Deephoughvotingfor3dobjectdetectioninpoint
[32] HuanLing,JunGao,AmlanKar,WenzhengChen,andSanja clouds.InProceedingsoftheIEEEInternationalConference
Fidler. Fastinteractiveobjectannotationwithcurve-gcn. In onComputerVision,pages9277–9286,2019. 2
Proceedings of the IEEE Conference on Computer Vision [46] CharlesRQi,WeiLiu,ChenxiaWu,HaoSu,andLeonidasJ
andPatternRecognition,pages5257–5266,2019. 3 Guibas.Frustumpointnetsfor3dobjectdetectionfromrgb-d
[33] RichardPLippmann. Speechrecognitionbymachinesand data. InCVPR,2018. 2,3,5,13,14,15
humans. Speechcommunication,22(1):1–15,1997. 6 [47] CharlesRQi,HaoSu,KaichunMo,andLeonidasJGuibas.
[34] Xingyu Liu, Charles R Qi, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification
Flownet3d: Learningsceneflowin3dpointclouds. InPro- andsegmentation. CVPR,2017. 5,14,15
ceedings of the IEEE Conference on Computer Vision and [48] CharlesRQi,LiYi,HaoSu,andLeonidasJGuibas. Point-
PatternRecognition,pages529–537,2019. 2 net++: Deephierarchicalfeaturelearningonpointsetsina
[35] XingyuLiu, MengyuanYan, andJeannetteBohg. Meteor- metricspace. arXivpreprintarXiv:1706.02413,2017. 14
net:Deeplearningondynamic3dpointcloudsequences. In [49] OlgaRussakovsky,JiaDeng,HaoSu,JonathanKrause,San-
ProceedingsoftheIEEEInternationalConferenceonCom- jeevSatheesh,SeanMa,ZhihengHuang,AndrejKarpathy,
puterVision,pages9246–9255,2019. 2 Aditya Khosla, Michael Bernstein, et al. Imagenet large
[36] WenjieLuo,BinYang,andRaquelUrtasun.Fastandfurious: scalevisualrecognitionchallenge. Internationaljournalof
Realtimeend-to-end3ddetection,trackingandmotionfore- computervision,115(3):211–252,2015. 6
castingwithasingleconvolutionalnet.InProceedingsofthe [50] ShaoshuaiShi,ChaoxuGuo,LiJiang,ZheWang,Jianping
IEEEConferenceonComputerVisionandPatternRecogni- Shi, Xiaogang Wang, and Hongsheng Li. Pv-rcnn: Point-
tion(CVPR),June2018. 2,3,4 voxelfeaturesetabstractionfor3dobjectdetection. InPro-
[37] Qinghao Meng, Wenguan Wang, Tianfei Zhou, Jianbing ceedingsoftheIEEE/CVFConferenceonComputerVision
Shen,LucVanGool,andDengxinDai. Weaklysupervised andPatternRecognition,pages10529–10538,2020. 1,2,6
3d object detection from lidar point cloud. arXiv preprint [51] ShaoshuaiShi,XiaogangWang,andHongshengLi. Pointr-
arXiv:2007.11901,2020. 3 cnn: 3dobjectproposalgenerationanddetectionfrompoint
[38] G. P. Meyer, J. Charland, D. Hegde, A. Laddha, and C. cloud. arXivpreprintarXiv:1812.04244,2018. 2,3,5,13
Vallespi-Gonzalez. Sensorfusionforjoint3dobjectdetec- [52] Weijing Shi and Ragunathan (Raj) Rajkumar. Point-gnn:
tion and semantic segmentation. In 2019 IEEE/CVF Con- Graph neural network for 3d object detection in a point
ferenceonComputerVisionandPatternRecognitionWork- cloud. In The IEEE Conference on Computer Vision and
shops(CVPRW),pages1230–1237,2019. 2 PatternRecognition(CVPR),June2020. 2
[39] GregoryP.Meyer,AnkitLaddha,EricKee,CarlosVallespi- [53] Martin Simony, Stefan Milzy, Karl Amendey, and Horst-
Gonzalez, and Carl K. Wellington. LaserNet: An efficient Michael Gross. Complex-yolo: An euler-region-proposal
probabilistic3Dobjectdetectorforautonomousdriving. In for real-time 3d object detection on point clouds. In Pro-
Proceedings of the IEEE Conference on Computer Vision ceedings of the European Conference on Computer Vision
andPatternRecognition(CVPR),2019. 2 (ECCV)Workshops,pages0–0,2018. 2
[40] HimangiMittal,BrianOkorn,andDavidHeld. Justgowith [54] VishwanathA.Sindagi,YinZhou,andOncelTuzel. Mvx-
theflow:Self-supervisedsceneflowestimation.InProceed- net:Multimodalvoxelnetfor3dobjectdetection.InInterna-
ingsoftheIEEE/CVFConferenceonComputerVisionand tionalConferenceonRoboticsandAutomation,ICRA2019,
PatternRecognition,pages11177–11185,2020. 2 Montreal,QC,Canada,May20-24,2019,pages7276–7282.
[41] Jiquan Ngiam, Benjamin Caine, Wei Han, Brandon Yang, IEEE,2019. 2
Yuning Chai, Pei Sun, Yin Zhou, Xi Yi, Ouais Alsharif, [55] Roman Solovyev, Weimin Wang, and Tatiana Gabruseva.
PatrickNguyen,ZhifengChen,JonathonShlens,andVijay Weightedboxesfusion: ensemblingboxesforobjectdetec-
Vasudevan. Starnet: Targetedcomputationforobjectdetec- tionmodels. arXivpreprintarXiv:1910.13302,2019. 4,14
tioninpointclouds. CoRR,2019. 2,6 [56] Shuran Song and Jianxiong Xiao. Deep sliding shapes for
[42] Joshua Owoyemi and Koichi Hashimoto. Spatiotemporal amodal3dobjectdetectioninrgb-dimages.InCVPR,2016.
learning of dynamic gestures from 3d point cloud data. In 2
2018 IEEE International Conference on Robotics and Au- [57] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien
tomation(ICRA),pages1–5.IEEE,2018. 2 Chouard,VijaysaiPatnaik,PaulTsui,JamesGuo,YinZhou,
[43] Lukas Prantl, Nuttapong Chentanez, Stefan Jeschke, and YuningChai,BenjaminCaine,etal.Scalabilityinperception
NilsThuerey. Tranquilclouds:Neuralnetworksforlearning forautonomousdriving: Waymoopendataset. InProceed-
temporallycoherentfeaturesinpointclouds. arXivpreprint ingsoftheIEEE/CVFConferenceonComputerVisionand
arXiv:1907.05279,2019. 2 PatternRecognition,pages2446–2454,2020. 1,2,5
10[58] ZhiTian,ChunhuaShen,HaoChen,andTongHe. FCOS: [74] J. Yin, J. Shen, C. Guan, D. Zhou, and R. Yang. Lidar-
Fullyconvolutionalone-stageobjectdetection. InProc.Int. based online 3d video object detection with graph-based
Conf.ComputerVision(ICCV),2019. 4,13 message passing and spatiotemporal transformer attention.
[59] Dominic Zeng Wang and Ingmar Posner. Voting for vot- In2020IEEE/CVFConferenceonComputerVisionandPat-
inginonlinepointcloudobjectdetection. InProceedingsof ternRecognition(CVPR),pages11492–11501,2020. 2,3
Robotics:ScienceandSystems,Rome,Italy,July2015. 2 [75] SergeyZakharov,WadimKehl,ArjunBhargava,andAdrien
[60] Weiyao Wang, Du Tran, and Matt Feiszli. What makes Gaidon. Autolabeling3dobjectswithdifferentiablerender-
training multi-modal networks hard? arXiv preprint ing of sdf shape priors. In Proceedings of the IEEE/CVF
arXiv:1905.12681,2019. 15 Conference on Computer Vision and Pattern Recognition,
pages12224–12233,2020. 2
[61] YueWang, AlirezaFathi, AbhijitKundu, DavidRoss, Car-
[76] YinZhou, PeiSun, YuZhang, DragomirAnguelov, Jiyang
oline Pantofaru, Thomas Funkhouser, and Justin Solomon.
Gao,TomOuyang,JamesGuo,JiquanNgiam,andVijayVa-
Pillar-based object detection for autonomous driving. In
sudevan. End-to-endmulti-viewfusionfor3dobjectdetec-
ECCV,2020. 2,6
tioninlidarpointclouds. InConferenceonRobotLearning,
[62] Xinshuo Weng and Kris Kitani. A baseline for 3d multi-
pages923–932,2020. 2,4,6,7,12,13
objecttracking. arXivpreprintarXiv:1907.03961,2019. 4,
[77] YinZhouandOncelTuzel. Voxelnet: End-to-endlearning
14
forpointcloudbased3dobjectdetection.InCVPR,2018.2,
[63] Xinshuo Weng, Jianren Wang, Sergey Levine, Kris Kitani,
13
andNicholasRhinehart.4dforecasting:Sequentialforecast-
[78] YangZou,ZhidingYu,XiaofengLiu,BVKKumar,andJin-
ingof100,000points,2020. 2
song Wang. Confidence regularized self-training. In Pro-
[64] Xinshuo Weng, Ye Yuan, and Kris Kitani. Joint 3d track-
ceedingsoftheIEEEInternationalConferenceonComputer
ingandforecastingwithgraphneuralnetworkanddiversity
Vision,pages5982–5991,2019. 2
sampling. arXivpreprintarXiv:2003.07847,2020. 4
[65] QizheXie,Minh-ThangLuong,EduardHovy,andQuocV
Le.Self-trainingwithnoisystudentimprovesimagenetclas-
sification. InProceedingsoftheIEEE/CVFConferenceon
Computer Vision and Pattern Recognition, pages 10687–
10698,2020. 2
[66] DanfeiXu,DragomirAnguelov,andAsheshJain. PointFu-
sion: Deep sensor fusion for 3d bounding box estimation.
InProceedingsoftheIEEEConferenceonComputerVision
andPatternRecognition(CVPR),2018. 2
[67] I Zeki Yalniz, Herve´ Je´gou, Kan Chen, Manohar Paluri,
andDhruvMahajan. Billion-scalesemi-supervisedlearning
for image classification. arXiv preprint arXiv:1905.00546,
2019. 2
[68] YanYan,YuxingMao,andBoLi. Second:Sparselyembed-
dedconvolutionaldetection. Sensors,18(10):3337,2018. 2
[69] Bin Yang, Wenjie Luo, and Raquel Urtasun. Pixor: Real-
time 3d object detection from point clouds. In Proceed-
ingsoftheIEEEConferenceonComputerVisionandPattern
Recognition,pages7652–7660,2018. 2
[70] Zetong Yang, Yanan Sun, Shu Liu, and Jiaya Jia. 3dssd:
Point-based3dsinglestageobjectdetector. InProceedings
oftheIEEE/CVFConferenceonComputerVisionandPat-
ternRecognition,pages11040–11048,2020. 2
[71] ZetongYang,YananSun,ShuLiu,XiaoyongShen,andJi-
ayaJia. Ipod:Intensivepoint-basedobjectdetectorforpoint
cloud,2018. 2
[72] Z.Yang,Y.Sun,S.Liu,X.Shen,andJ.Jia. Std: Sparse-to-
dense3dobjectdetectorforpointcloud. In2019IEEE/CVF
InternationalConferenceonComputerVision(ICCV),pages
1951–1960,2019. 2
[73] M.Ye,S.Xu,andT.Cao. Hvnet: Hybridvoxelnetworkfor
lidarbased3dobjectdetection. In2020IEEE/CVFConfer-
enceonComputerVisionandPatternRecognition(CVPR),
pages1628–1637,2020. 2
11Appendix
Method 3DAP 0-30m 30-50m 50+m
A.Overview PointPillar 45.48 74.02 36.49 14.94
Multi-frameMVF++ 70.01 86.54 67.72 43.25
Inthisdocument,weprovidemoredetailsofmodels,ex-
PV-RCNN-DA 71.40 90.00 66.45 45.92
perimentsandshowmoreanalysisresults. Sec.Bpresents
CenterPoint 67.04 86.62 60.95 38.59
more evaluation results on the Waymo Open Dataset test HorizonLidar3D 72.48 90.65 67.26 47.89
setandshowshowouroffboard3Ddetectioncanhelpdo-
3DAL(ours) 78.04 91.90 73.47 52.53
mainadaptationand3Dtracking. Sec.Cexplainsmorede-
Table 10. 3D detection AP on the Waymo Open Dataset do-
tailsofMVF++detectors. Sec.Dand Sec.Edescribeim-
mainadaptationtestsetforvehicles. ThePointPillar, MVF++
plementation details of our multi-object tracker and track-
and3DALmodelsweretrainedbyusontheWaymoOpenDataset
basedmotionstateclassifierrespectively.Sec.Fcoversnet-
maintrainset. Evaluationresultswereobtainedfromsubmitting
workarchitectures,lossesandtrainingdetailsofobjectauto
tothetestserver.ThePV-RCNN-DA,CenterPointandHorizonL-
labeling models. Sec. G describes the specifics of the hu- idar3Dresultsareleadingentriesfromtheleaderboard[2].
manlabelstudyfor3Dobjectdetectionandprovidesmore
statistics.Sec.Hprovidesmoreinformationaboutthesemi-
supervised learning experiments. Lastly Sec. I gives more theleaderboard[2]ourmethodalsoshowssignificantgains.
analysisresultssupplementarytothemainpaper. Theselargegainsareprobablyduetothetemporalinforma-
tionaggregation,whichcompensatesthelowerpointdensi-
B.MoreEvaluationResults ties in the WOD domain adaptation set (collected in Kirk-
landwithmostlyrainyweather).
B.1.3DDetectionResultsontheTestSet
B.3.3DTrackingResults
In Table. 9 we report detection results on the Waymo
Open Dataset test set comparing our pipeline with a few In table 11 we show how our improved box estimation
leading methods in the leaderboard [1]. Note that our from the offboard 3D Auto Labeling enhances the track-
pipeline achieves the best results among Lidar-only meth- ing performance, compared to using the boxes from the
ods. It also outperforms the HorizonLidar3D which uses single-frame or multi-frame detectors. All methods used
bothcameraand Lidarinputinthe L1metrics. We expect thesametracker(Sec.D).Thisreflectsthatinthetracking-
thataddingcamerainputtoourpipelinecanfurtherimprove by-detection paradigm, the localization accuracy plays an
ourpipelineinhardcases(L2). important role in determining tracking quality in terms of
MOTAandMOTP.
Method Sensor APL1 APHL1 APL2 APHL2
Method MOTA↑ MOTP↓
PV-RCNN L 81.06 80.57 73.69 73.23
CenterPoint L 81.05 80.59 73.42 72.99 Single-frameMVF++withKF 52.20 17.08
HorizonLidar3D CL 85.09 84.68 78.23 77.83 Multi-frameMVF++withKF 61.92 16.31
3DAL(ours) L 85.84 85.46 77.24 76.91 3DAutoLabeling 66.90 15.45
Table9.3DdetectionAPontheWaymoOpenDatasetmaintest
Table11.3DtrackingresultsforvehiclesontheWaymoOpen
setforvehicles.Evaluationresultswereobtainedfromsubmitting
Datasetvalset.ThemetricsareL1MOTAandMOTPforvehicles
to the test server. For the sensor the ‘L‘ means Lidar-only; the
ontheWaymoOpenDatasetvalset. KFstandsforusingKalman
‘CL‘meanscameraandLidar. Notethatourmethodpeeksinto
Filteringforthetrackstateupdate. Thearrowsindicatewhether
the future for object-centric refinement, which is feasible in the
themetricisbetterwhenitishigher(up-wardarrow)orisbetter
offboardsetting.
whenitislower(down-wardarrow).
B.2.DomainAdaptationResults
C. Implementation Details of the MVF++ De-
InTable10wereportdetectionresultsinanotherdomain
tectors
and compare our 3D Auto Labeling (3DAL) pipeline with
twobaselines:thepopularPointPillars[24]detectorandour NetworkArchitecture Figure7illustratesthepoint-wise
offboardmulti-frameMVF++detector. Weseethatour3D featurefusionnetworkwithintheproposedMVF++. Given
AutoLabelingpipelineachievessignificantlyhigherdetec- C-dimensional input encoding of N points [76], the net-
tion APs compared to the baselines (32.56 higher 3D AP workfirstprojectsthepointsintoa128-Dfeaturespacevia
thanthePointPillarsand8.03higher3DAPthanthemulti- a multi-layer perceptron (MLP), where shape information
frame MVF++), showing the strong generalization ability canbebetterdescribed. TheMLPiscomposedofalinear
of our models. Compared to a few leading methods on layer,abatchnormalization(BN)layerandarectifiedlinear
12Figure7.Point-wisefeaturefusionnetworkofMVF++. GivenaninputpointcloudencodingofshapeN ×C, thenetworkmapsit
tohigh-dimensionalfeaturespaceandextractscontextualinformationfromdifferentviewsi.e.theBird’sEyeViewandthePerspective
View. Itfusesview-dependentfeaturesbyconcatenatinginformationfromthreesources. ThefinaloutputhasshapeN ×144,asaresult
ofconcatenatingdimension-reducedpointfeaturesofshapeN ×128with3DsegmentationfeaturesofshapeN ×16.
unit (ReLU) layer. Then it processes the features by two loss as in [58]. L represents Smooth L1 loss learning
reg
separate MLPsfor view-dependent information extraction, to regress x, y, z center locations, length, width, height
i.e.onefortheBird’sEyeViewandoneforthePerspective andheadingorientationatforegroundpixels,asin[24,76].
View [76]. Next, the network employs voxelization [76] L is the auxiliary 3D segmentation loss for distinguish-
seg
to transform view-dependent point features into the corre- ingforegroundfrombackgroundpoints(pointsarelabeled
sponding2Dfeaturemaps,whicharefedtoview-dependent as foreground/background if they lie inside/outside of a
ConvNets (i.e. ConvNet and ConvNet ) to further extract ground truth 3D box) [51, 46]. In our experiments, we set
b p
contextual information within an enlarged receptive field. w = 1.0, w = 2.0, w = 1.0. At inference time, the
1 2 3
Different from MVF [76] using one ResNet [17] layer in final score for ranking all detected boxes is computed as
obtainingeachdown-sampledfeaturemaps,weincreasethe themultiplicationoftheclassificationscoreandthecenter-
depth of ConvNet and ConvNet by applying one more nessscore. Bydoingso,thecenternessscorecandownplay
b p
ResNet block in each down-sampling branch. At the end the boxes far away from an object center and thus encour-
of view-dependent processing, it applies devoxelization to agenon-maximumsuppression(NMS)toyieldhigh-quality
transform the 2D feature map back to point-wise features. boxes,asrecommendedin[58].
Themodelfusespoint-wisefeaturesbyconcatenatingthree
sourcesofinformation. Toreducecomputationalcomplex-
DataAugmentation Weperformthreeglobalaugmenta-
ity,itappliestwoMLPsconsecutively,reducingthefeature
tionsthatareappliedtotheLiDARpointcloudandground
dimension to 128. For improving the discriminative capa-
truthboxessimultaneously[77].First,weapplyrandomflip
bility of features, it introduces 3D segmentation auxiliary
along the x axis, with probability 0.5. Then, we employ a
lossandaugmentsthedimension-reducedfeatureswithseg-
randomglobalrotationandscaling,wheretherotationangle
mentationfeatures. Theoutputofpoint-wisefeaturefusion
andthe scalingfactorare randomlydrawnuniformly from
networkhasshapeN ×144.
[−π/4,+π/4] and [0.9,1.1], respectively. Finally, we add
Upon obtaining point-wise features, we voxelize them aglobaltranslationnoisetox,y,zdrawnfromN(0,0.6).
into a 2D feature map and employ a backbone network to
generatedetectionresults. Specifically, weadoptthesame
Hyperparameters For vehicles, we set voxel size to
architectureasin[24,76].Tofurtherboostdetectionperfor-
[0.32,0.32,6.0]manddetectionrangeto[−74.88,74.88]m
manceintheoffboardsetting,wereplaceeachplainconvo-
alongtheXandYaxesand[−2,4]malongZaxis,whichre-
lutionlayerwithaResNet[17]layermaintainingthesame
sultsina468×4682DfeaturemapintheBird’sEyeView.
outputfeaturedimensionandfeaturemapresolution.
Forpedestrians,wesetvoxelsizeto[0.24,0.24,4.0]mand
detection range to [−74.88,74.88]m along the X and Y
Loss Function We train MVF++ by minimizing a loss axes and [−1,3]m along Z axis, which corresponds to a
function, defined as L = L +w L +w L + 624×6242DfeaturemapintheBird’sEyeView. During
cls 1 centerness 2 reg
w L . L and L are focal loss and centerness test-time augmentation, we set (IoU threshold, box score)
3 seg cls centerness
13tobe(0.275,0.5)forvehiclesand(0.2,0.5)forpedestrians, For vehicles, such a simple linear model can achieve
totriggerweightedboxfusion[55]. morethan99%classificationaccuracy. Theremainingrare
error cases usually happen in short tracks with noisy de-
tection boxes, or for objects that are heavily occluded or
Training During training, we use the Adam opti-
faraway. Forpedestrians,asmostofthemaremovingand
mizer[21]andapplycosinedecaytothelearningrate. The
eventhestaticonestendtomovetheirarmsandheads,we
initial learning rate is set to 1.33×10−3 and ramps up to
considerallpedestriantracksasdynamic.
3.0 × 10−3 after 1000 warm-up steps. The training used
64TPUswithaglobalbatchsizeof128andfinishedafter
43,000steps. F.DetailsoftheObjectAutoLabelingModels
F.1.StaticObjectAutoLabeling
D.ImplementationDetailsoftheTracker
Network architecture. In the static object auto labeling
Our multi-object tracker is a similar implementation
model,theforegroundsegmentationisaPointNet[47]seg-
to[62].Toreducetheimpactofsensorego-motionintrack-
mentationnetwork,whereeachpointisfirstlyprocessedby
ing, we transformed all the boxes to the world coordinate
anmulti-layerperceptron(MLP)with5layerswithoutput
for tracking. To reduce false positives, we also filter out
channelsizesof64,64,64,128,1024.Foreverylayerofthe
alldetectionswithscoreslessthan0.1beforethetracking.
MLP, we have batch normalization and ReLU. The 1024-
We used Bird’s Eye View (BEV) boxes for detection and
dim per point embeddings are pooled with a max pooling
track association, using the Hungarian algorithm with an
layer and concatenated with the output of the 2nd layer of
IoUthresholdof0.1. Duringthestatesupdate,theheading
the per-point MLP (64-dim). The concatenated 1088-dim
ishandledspeciallyastherecanbeflipsandcyclicpatterns.
features are further processed by an MLP of 5 layers with
Beforeupdatingtheheadingstate,wefirstadjustthedetec-
output channel sizes 512,256,128,128,2, where the last
tion heading to align with the track state heading – if the
layer does not have non-linearity or batch normalization.
angledifferenceisobtuse, weaddπ tothedetectionangle
The predicted foreground logit scores are used to classify
beforetheupdate; wealsoaveragetheanglesinthecyclic
eachpointasforegroundorbackground.Alltheforeground
space(e.g.theaverageof6radand0.5radis0.1084rather
pointsareextracted.
than3.25).
TheboxregressionnetworkisalsoaPointNet[47]vari-
E. Implementation Details of the Motion State ant that takes the foreground points and outputs the 3D
box parameters. It has a per-point MLP with output sizes
Estimator
of128,128,256,512,amaxpoolinglayerandafollowing
As we introduced in the main paper, we use the object MLPwithoutputsizes512,256onthemaxpooledfeatures.
track data for motion state estimation, which is much eas- There is a final linear layer predicting the box parameters.
ier compared to classifying the static/non-static state from We parameterize the boxes in a way similar to [46] as the
a single or a few frames. Note that we define an object as box center regression (3-dim), the box heading regression
staticonlyifitisstationaryintheentiresequence. Specifi- andclassification(toeachoftheheadingbins)andthebox
cally,weextracttwoheuristic-basedfeaturesandfitalinear size regression and classification (to each of the template
classifiertoestimatethemotionstate. Thetwofeaturesare: sizes). Foriterativerefinement, weapplythesameboxre-
thedetectionboxcenters’varianceandthebegin-to-enddis- gression network one more time on the foreground points
tance of the tracked boxes (the distance from the center of transformed to the estimated box’s coordinate. We found
thefirstboxofthetracktothecenterofthelastboxofthe that if we use multi-frame MVF++ boxes, using shared
track), with boxes all in the world coordinate. To ensure weights for the two box regression networks works better
that the statistics are reliable we only consider tracks with than not sharing the weights; while if we use the single-
atleast7validmeasurements. Fortracksthataretooshort, frame MVF++, the cascaded design without sharing the
wedonotruntheclassificationnortheautolabelingmod- weights works better. The numbers in the main paper are
els. Theboxesofthoseshorttracksaremergeddirectlyto fromtheiterativemodel(sharedweights).
thefinalautolabels. For simplicity and higher generalizability of the model,
The ground truth motion states are computed from weonlyusedXYZcoordinatesofthepointsinthesegmen-
ground truth boxes with pre-defined thresholds of begin- tation and box regression networks. Intensities and other
to-enddistance(1.0m)andmaxspeed(1m/s). Thethresh- pointchannelswerenotused. Wehavealsotriedtousethe
oldsareneededbecausetherecouldbesmalldriftsinsensor morepowerfulPointNet++[48]modelsbutdidnotseeim-
poses,suchthatthegroundtruthboxesintheworldcoordi- provement compared to the PointNet-based models in this
natearenotexactlythesameforastaticobject. problem.
14Losses. Themodelistrainedwithsupervisionoftheseg- F.2.DynamicObjectAutoLabeling
mentationmasksandthegroundtruth3Dboundingboxes.
Network architecture. For the foreground segmentation
For the segmentation, the sub-network predicts two scores
network,weadoptasimilararchitectureasthatforthestatic
for each point as foreground or background and is super-
auto labeling model except that the input points have one
vised with a cross-entropy loss L . For the box re-
seg
more channel besides the XYZ coordinate, the time en-
gression, we implement a process similar to [46], where
codingchannel. Thetemporalencodingis0forpointsfrom
each box regression network regresses the box by pre-
the current frame, −0.1r for the r-th frame prior to the
dicting its center cx,cy,cz, its size classes (among a few
current frame and +0.1r for the r-th frame after the cur-
pre-defined template size classes) and residual sizes for
rentframe. Inourimplementationwetake5framesofob-
each size class, as well as the heading bin class and a
ject points with each frame’s points subsampled to 1,024
residual heading for each bin. We used 12 heading bins
points, so in total there are 5,120 points input to the seg-
(each bin account for 30 degrees) and 3 size clusters:
mentation network. The point sequence encoder network
(4.8,1.8,1.5),(10.0,2.6,3.2),(2.0,1.0,1.6),wherethedi-
takes the segmented foreground points and uses a Point-
mensions are length, width, height. The box regression
Net [47]-like architecture with a per-point MLP of output
lossisdefinedasL = L +w L +w L +
boxi c-reg i 1 s-clsi 2 s-reg i sizes 64,128,256,512, a max-pooling layer and another
w L +w L where i ∈ {1,2} represents the cas-
3 h-clsi 4 h-reg i MLPwithoutputsizes512,256onthemax-pooledfeatures.
cade/iterativeboxestimationstep. Thetotalboxregression
Theoutputisa256-dimfeaturevector.
loss is L = L +w(L +L ). The w and w are
seg box1 box2 i
Fortheboxsequenceencodernetwork,weconsidereach
hyperparameterweightsofthelosses. Empirically, weuse
box (in the center frame’s box coordinate) as a parameter-
w =0.1,w =2,w =0.1,w =2andw =10.
1 2 3 4
izedpointwithchannelsofboxcenter(cx,cy,cz),boxsize
(length, width, height), box heading θ and a temporal en-
coding. Weusenearlytheentireboxsequence(settingsin
themainpaperto50,leadingtoasequencelengthof101).
Training and data augmentation. We train our models
usingtheextractedobjecttracks(withtheproposedmulti- Theboxsequencecanbeconsideredasapointcloudand
frame MVF++ model and our multi-object tracker) from processedbyanotherPointNet. Note,suchasequencecan
the Waymo Open Dataset for each class type separately. alsobeprocessedbya1DConvNet,orbeconcatenatedand
Groundtruthboxesareassignedtoeveryframeofthetrack processedbyafullyconnectednetwork,orwecanevenuse
(frameswithnomatchedgroundtruthareskipped). agraphneuralnetwork. Throughempiricalstudywefound
usingaPointNettoencodetheboxsequencefeatureisboth
During training, for each static object track, we ran-
effective(comparedwithConvNetandfullyconnectedlay-
domly select an initial box from the sequence. We also
ers) and simple (compared with graph neural networks).
randomly sub-sample Uniform[1,|S |] frames from all the
j TheboxsequenceencodingPointNethasaper-pointMLP
visible frames S of an object j. This naturally leads to
j withoutputsizes64,64,128,512,amax-poolinglayerand
a data augmentation effect. Note that at test time we al-
anotherMLPwithoutputsizes128,128onthemax-pooled
ways select the initial box with the highest score and use
features.Thefinaloutputisa128-dimfeaturevector,which
all frames. The merged points are randomly sub-sampled
wecallthetrajectoryembedding.
to4,096pointsandrandomlyflippedalongtheX,Y axes
The point embedding and the trajectory embedding are
with50%chancerespectivelyandrandomlyrotatedaround
concatenatedandpassedthroughafinalboxregressionnet-
the up-axis (Z) by Uniform[−10,10] degrees. To increase
work, which is a MLP with two layers with output sizes
thedataquantity,wealsoturnthedynamicobjecttrackdata
128,128 and a linear layer to regress the box parameters
to pseudo static track. To achieve that, we use the ground
(similartothatofthestaticobjectautolabelingmodel).
truth object boxes to align the dynamic object points to a
Toencouragecontributionsfrombothbranches,wefol-
specificframe’sgroundtruthboxcoordinate.Thisincreases
low [60] and also pass the trajectory embedding and the
thenumberofobjecttracksofvehiclesby30%.
object embedding to two additional box regression sub-
In total, we have extracted around 50K (vehicle) object
networkstopredictboxesindependently.Thesub-networks
tracksfortraining(includingtheaugmentedonesfromdy-
havethesamestructureastheoneforthejoint-embedding,
namicobjects)andaround10Kobjecttracksforvalidation
butwithnon-sharedweights.
(static only). We trained the model using the Adam opti-
mizerwithabatchsizeof32andaninitiallearningrateof
0.001. The learning rate was decayed by 10X at the 60th, Losses. Similartothestaticautolabelingmodel,wehave
100th and 140th epochs. The model was trained with 180 twotypesofloss,thesegmentationlossandtheboxregres-
epochs in total, which took around 20 hours with a V100 sion loss. The box regression outputs are defined in the
GPU. same way as that for the static objects. We used 12 head-
15ing bins (each bin account for 30 degrees) and the same segment-17703234244970638241 220 000 240 000
size clusters as those for the static vehicle auto labeling. segment-15611747084548773814 3740 000 3760 000
Forpedestriansweuseasinglesizecluster: (0.9,0.9,1.7) segment-11660186733224028707 420 000 440 000
of length, width, height. The final loss is L = L seg + segment-1024360143612057520 3580 000 3600 000
v 1L box-traj+v 2L box-obj-pc+v 3L box-jointwherewehavethree segment-6491418762940479413 6520 000 6540 000
box losses from the trajectory head, the object point cloud
Table 12. Sequence (run segment) list for the human label
headandthejointheadrespectively. Thev i,i = 1,2,3are study. ThesequencesareallfromtheWaymoOpenDatasetval
the weights for the loss terms to achieve a balanced learn- set.
ing of the three types of embeddings. Empirically, we use
v =0.3,v =0.3,v =0.4.
1 2 3
withthelabelsfromtheotherandmeasurethe3Dboxcon-
sistency by their IoUs. Since we already have the verified
Training and data augmentation. During training, we
public ground truth, we can compare the 3 sets of labels
randomlyselectthecenterframefromeachdynamicobject
withthepublicWODgroundtruthandgettheaveragebox
track. Ifthecontextsizeislessthantherequiredsequence
IoUforallobjectsthatarematched(duetoocclusions,some
length2r+1or2s+1,orwhentheframesinthesequence
objectsmaybelabeledornotlabeledbyaspecificlabeler).
are not consecutive (e.g. the object is occluded for a few
Specifically,forhumanboxesthatwecannotfindaground
frames), we use placeholder points and boxes (all zeros)
truthboxwithmorethan0.03BEVIoUoverlap(falsepos-
fortheemptyframes. Astheremaybetrackingerrors,we
itive or false negative), they were ignored and not counted
match our object track with ground truth tracks and avoid
inthecomputation.
trainingontheoneswithswitchedtrackIDs.
ThestatisticsaresummarizedinTable13. Surprisingly,
Astoaugmentation,bothpointsandboxesarerandomly
human labels do not have the consistency one may expect
flippedalongtheX andY axiswitha50%chanceandran-
(e.g.95%IoU).Duetotheinherentuncertaintyoftheprob-
domly rotated around the Z axis by Uniform[−10,10] de-
lem, even humans can only achieve around 81% 3D IoU
grees. Wealsoaddalightrandomshiftandarandomscal-
or around 88% BEV IoU in their box consistency. As we
ingtothepointclouds. Pointcloudfromeachframeisalso
break down the numbers by distance we see, intuitively,
randomlysampledto1,024pointsfromthefullpointcloud
that nearby objects have a significantly higher mean IoU
observed.
as they have more visible points and more complete view-
We train vehicle and pedestrian models separately. For points. TheBEV2DIoUisalsohigherthanthe3DIoUas
vehicles we extracted around 15.7K dynamic tracks for wedonotrequirethecorrectheightestimationintheBEV
training and 3K for validation. For pedestrians, we ex- box,whichsimplifiestheproblem.
tracted around 22.9K dynamic tracks for training and Tohavearoughunderstandingofhowtheboxesgener-
around 5.1K for validation. We train the model using the atedbyour3DAutoLabelingpipelinecomparewithhuman
Adam optimizer with batch size 32 and an initial learning labels,wecomputetheaverageIoUofautolabelswiththe
rate0.001. Thelearningrateisdecayedby10timesatthe WOD ground truth. Note that those numbers are not di-
180th, 300thand420thepochs. Themodelistrainedwith rectlycomparabletotheaveragehumanIoUsastheycover
500epochsintotal,whichtakes1-2dayswithaV100GPU. differentsetsofobjects(duetothefalsepositivesandfalse
negatives). However,itstillgivesusanunderstandingthat
G.DetailsoftheHumanLabelStudy
theautolabelsarealreadyonparinqualitytohumanlabels.
We randomly selected 5 sequences from the Waymo
H. More Details about the Semi-supervised
OpenDatasetvalsetaslistedinTable12torunthehuman
LearningExperiment
label study. The 15 labeling tasks (3 sets of re-labels for
eachrunsegment)involved12labelerswithexperiencesin
Inthesemi-supervisedexperiments, weuseanonboard
labeling3DLidarpointclouds.Intotalwecollectedaround
single-frame MVF++ detector as the student. We train all
2.3K labels (one label for one object track) for the 3 re-
networkswithaneffectivebatchsizeof256scenesperiter-
peatedlabelings.
ation. The training schedule starts with a warmup period
where the learning rate is gradually increased to 0.03 in
Howconsistentarehumanlabels? AuxiliarytotheAP 1000iterations. Afterward,weuseacosinedecaylearning
resultsinthemainpaper,wealsoanalyzetheIoUsbetween ratescheduletodropthelearningratefrom0.03to0.
humanlabels. Wefoundthatevenforhumans, 3Dbound- For the intra-domain semi-supervised learning, we ran-
ingboxlabelingcanbechallengingastheinputpointclouds domly select 10% of the sequences (around 15K frames
are often partial and occluded. To understand how consis- from 79 sequences) to train the 3DAL pipeline which gets
tenthumanlabelsare, wecomparelabelsfromonelabeler an AP of 78.11% on the validation set. Then, the 3DAL
16IoUtype Labeltype all 0- 30- 50m+ Model detectorbox Acc@0.7/0.8
30m 50m
Single-frameMVF++ random 67.17/36.61
human 25,641 11,543 7,963 6,135
validboxes random 73.96/43.56
auto 24,146 11,360 7,448 5,338
Multi-frameMVF++ average 79.29/48.67
human 80.92 85.78 80.29 72.59 highestscore 78.67/52.42
3DmIoU
auto 80.29 84.04 77.45 76.28
random 79.66/52.46
human 87.98 91.26 87.31 82.68 Autolabelingmodel average 81.22/53.96
BEVmIoU
auto 87.50 90.36 85.09 84.78 highestscore 82.28/56.92
Table13.ThemeanIoUofhumanlabelsandautolabelscom- Table 14. Effects of initial box selection in static object auto
pared with the Waymo Open Dataset ground truth for vehi- labeling.Numbersareaveragedover3runsfortherandomboxes.
cles.Notethatsincedifferentlabels(humanormachine)annotate
differentnumberofobjectsforeachframe,thosenumbersarenot
directly comparable. They are summarized here for a reference. Choosingauniformlyrandomboxfromthesequenceis
Foramorefaircomparisonbetweenhumanandautolabels, see equivalenttothesettingofaframe-centricapproach.Asfor
theAveragePrecisioncomparisontableinthemaintable.Weonly the detector baselines (row 1 and row 2), it directly eval-
evaluateusinggroundtruthboxeswithatleastonepointinitand uates the average accuracy of the detector boxes. As for
onlyevaluateboxesthathaveaBEVIoUlargerthan0.03withany the auto labeling, it means the box estimation is running
groundtruthbox.
for every frame, similar to a two-stage refinement step in
two-stage detectors. We see that such a frame-centric box
estimationachievesthemostunfavorableresults,asitisnot
annotates the rest of the training set (around 142K frames
able to leverage the best viewpoint in the sequence (in the
from719sequences). Finally, thestudentistrainedonthe
object-centricway).
union of these sets (798 sequences). We train the models
Intheaverageboxsetting,weaveragealltheboxesfrom
foratotalof43Kiterations.
thesequence(intheworldcoordinate)andusetheaveraged
Forthecross-domainsemi-supervisedlearning, wefirst
train3DALontheregularWaymoOpentrainingset. Since boxforthetransformation.Forthehighestscoresetting,we
thedomainadaptationvalidationsetisrelativelysmall(i.e. selecttheboxwiththehighestconfidencescoreastheinitial
box,whichissimilartochoosingthebestviewpointofthe
only contains 20 sequences), we submit the results to the
submissionserverandreportonthedomainadaptationtest object. Weseethatthestrategytochoosetheinitialboxhas
agreatimpactandcancausea4.46Acc@0.8differencefor
set (containing 100 sequences). The 3DAL gets an AP of
78.0%onthedomainadaptationtest setwithoutusingany
theautolabelingmodel(thehighestscoreboxvs.arandom
box).
data from that domain. Then, we use the trained pipeline
to annotate the domain adaptation training and unlabeled
sets of the Waymo Open Dataset. Finally, the student is Causal model performance. Table 15 compares non-
trained on the union of the regular training set annotated causalmodelsandcausalmodels(forstaticobjectautola-
byhumans,anddomainadaptationtraining+unlabeledsets
annotatedby3DAL.Sincethedatausedfortrainingthestu-
dentisaround2Xlargercomparedtotheintra-domainex- refframe contextframes Acc@0.7/0.8
periment,wealsoincreasethetrainingiterationsto80K. [−0,+0] 78.13/50.30
allhighestscore
all 82.28/56.92
I. More Analysis Experiments for Object Auto
pasthighestscore allhistory 77.56/49.21
Labeling
Table15.Effectsoftemporalcontextsforstaticobjectautola-
beling.Notethatforacausalmodel,wecannotoutputasinglebox
In this section, we provide more analysis results auxil-
forastaticobject–wehavetooutputthebestestimationforevery
iarytothemainpaper.
frameusingthecurrentframeandthehistoryframes.Wecompute
theaverageaccuracyacrossallframesforthecausalcase.
Effects of key frame selection for static object auto la-
beling. Table 14 compares the effects of using different
Pointcloudcontext Boxcontext Acc@0.7/0.8
initialboxes(fromthedetectors)forthemodel(fortheco-
[−2,+2] all 85.67/65.77
ordinatetransformbeforeforegroundsegmentation): auni-
formly chosen random box, the average box and the box [−4,0] allhistory 84.30/62.68
with the highest score. We also show the box accuracy of Table16.Effectsoftemporalcontextsfordynamicobjectauto
thedetectorsasareference(i.e.theaccuracyofthoseinitial labeling.Forcausalmodels,weonlyusethecausalpointandbox
boxes). sequenceinput.
17Static Dynamic Motion 3DAP BEVAP
Aug. Acc.@0.7/0.8 Aug. Acc@0.7/0.8 State IoU=0.7 IoU=0.8 IoU=0.7 IoU=0.8
All 82.28/56.92 All 85.67/65.77 Pred 84.50 57.82 93.30 84.88
−D2S 81.72/55.96 −Shift 85.15/65.23
GT 84.98 57.95 93.36 85.13
−FlipX 81.42/55.98 −Scale 85.15/65.91
Table19.Effectsofthemotionstateestimationontheoffboard
−FlipY 81.50/55.49 −FlipY 85.76/63.66
3Ddetection. ThemetricisAPforvehiclesontheWaymoOpen
−RotateZ 81.72/56.52 −RotateZ 84.94/64.20
Datasetvalset. “Pred”meansweareclassifyingthemotionstate
Table17.Ablationsofdataaugmentation.Weusedifferentdata
(staticornot)usingourlinearclassifier.“GT”meansweareusing
augmentationsforstaticobjectsanddynamicobjects.“All”means
thegroundtruthboxestoclassifythemotionstate.
alltheaugmentationsareused. “−X”meansremovingaspecific
augmentation,“X”fromtheaugmentationset. “D2S”represents
thedynamic-to-staticaugmentation. Bestresultsineachcolumn Ablationsofdataaugmentation. Table17comparesthe
areinbold.
performanceofourautolabelingpipelinewhentrainedwith
differentdataaugmentations.Themostaccuratemodelsare
3DAP BEVAP consistently trained with all the proposed augmentations.
Tracker MOT GT MOT GT Forstaticobjects,alltheaugmentationscontributesimilarly
intermsofaccuracy,whilefordynamicobjects,randomro-
IoU=0.7 84.50 85.77 93.30 96.74
Vehicle IoU=0.8 57.82 58.81 84.88 86.18 tationaroundZ-axisappearstobethemostcritical.
IoU=0.5 82.88 83.02 86.32 86.24
Pedestrian Effectsofthetrackingaccuracy. Tostudyhowtracking
IoU=0.6 63.69 64.80 75.60 75.65
(association)accuracyaffectsouroffboard3Ddetection,we
Table 18. Effects of the tracking accuracy. “MOT” stands for
compareresultsusingourKalmanfiltertrackerandan“ora-
Multi Object Tracker. “GT” represents Ground Truth Tracker
wherethegroundtruthboxesareused. Bestresultsofeachcom- cle”tracker(theysharethedetectionandobjectautolabel-
parablepairsareinbold. ing models, just the tracker is different). For the “oracle”
tracker,weassociatedetectorboxesusingthegroundtruth
boxes.Specifically,foreachdetectorbox,wefinditsclosest
beling). Thecausalmodelwastrainedusingthecausalin- groundtruthboxandassignthegroundtruthbox’sobjectID
put (the last row) and only used causal input for inference toit. InTable.18,weobservethatbetterperformancescan
at every frame. We see the causal model has a relatively beobtainedwhenamorereliabletrackerisused, although
loweraccuracycomparedtonon-causalones,probablydue the difference is subtle. In particular, the “oracle” tracker
to two reasons. First, it has limited contexts especially for introducesmoreimprovementforvehiclesthanpedestrians.
thebeginningframesofthetrack.Second,thepoolofinitial Thisdifferencecanimplythat,forpedestrians,thereismore
boxesaremuchmorerestrictediftheinputhastobecausal. spaceforimprovementindetectingtargetsthanassociating
For non-causal models, we can select the frame with the detectedboxes.
highest confidence as the key frame and use the box from
thatframefortheinitialcoordinatetransform.However,for Effectsofthemotionstateestimationaccuracy. InTa-
the causal model, it can only select the highest confidence ble19westudyhowmotionstateclassificationaccuracyaf-
boxfromthehistoryframes, whicharenotnecessarilythe fectstheoffboard3DdetectionAP.Wereplacethemotion
wellvisibleones. state classifier (“Pred”) with a new one using the ground
As the causal model’s accuracy is even inferior to the truthboxes(“GT”)forclassificationandseehowmuchAP
one that just uses a single frame’s points for refinement improvements it can bring us. We see that while there are
(thefirstrowinTable15),theabilitytoselectthebestkey somegains,theyarenotsignificant. Thisisunderstandable
frameweighsmorethantheaddedpointsfromafewhistory asourlinearclassifiercanalreadyachievea99%+accuracy.
frames. Notethattheperformanceisstillbetterthanthede-
tector boxes without refinement (row 2 in Table 14) Such Inference speed. Processing a 20-second sequence (200
resultsindicatethebenefitofhavinganon-causalmodelfor frameswiththe10Hzsensorinput)usingaV100GPU,the
theoffboard3Ddetection. detector takes the majority time (around 15 minutes) due
Table16reportsasimilarstudyfordynamicobjectauto to the multi-frame input and test-time augmentation. The
labeling. We also see that the causal model’s performance tracking takes around 3s and the object-centric refinement
is inferior to the non-causal one, although it still improves takesaround25s,whichis28sintotal,0.14sperframe,or
upontherawdetectoraccuracy(Table8inthemainpaper a3%extratimeoverthedetection. Intheoffboardsetting,
row 2, where the multi-frame MVF++ gets 82.21 / 59.52 we can run detection or the refinement steps in parallel to
accuracy). furtherreducetheprocessinglatency.
18