Motion Inspired Unsupervised Perception and Prediction in Autonomous Driving Mahyar Najibi∗, Jingwei Ji∗, Yin Zhou∗†, Charles R. Qi, Xinchen Yan, Scott Ettinger, and Dragomir Anguelov Waymo LLC {najibi,jingweij,yinzhou,rqi,xcyan,settinger,dragomir}@waymo.com Abstract. Learning-based perception and prediction modules in mod- ern autonomous driving systems typically rely on expensive human an- notationandaredesignedtoperceiveonlyahandfulofpredefinedobject categories. This closed-set paradigm is insufficient for the safety-critical autonomous driving task, where the autonomous vehicle needs to pro- cess arbitrarily many types of traffic participants and their motion be- haviors in a highly dynamic world. To address this difficulty, this pa- per pioneers a novel and challenging direction, i.e., training perception and prediction models to understand open-set moving objects, with no human supervision. Our proposed framework uses self-learned flow to triggeranautomatedmetalabelingpipelinetoachieveautomaticsuper- vision.3DdetectionexperimentsontheWaymoOpenDatasetshowthat our method significantly outperforms classical unsupervised approaches and is even competitive to the counterpart with supervised scene flow. Wefurthershowthatourapproachgenerateshighlypromisingresultsin open-set3Ddetectionandtrajectoryprediction,confirmingitspotential in closing the safety gap of fully supervised systems. Keywords: Autonomous driving, unsupervised learning, generaliza- tion, detection, motion prediction, scene understanding 1 Introduction Modern 3D object detection [68,61,102,112] and trajectory prediction mod- els[10,32,51,104]areoftendesignedtohandleapredefinedsetofobjecttypesand relyoncostlyhumanannotateddatasetsfortheirtraining.Whilesuchparadigm hasachievedgreatsuccessinpushingthecapabilityofautonomysystems,ithas difficulty in generalizing to the open-set environment that includes a long-tail distribution of object types far beyond the predefined taxonomy. Towards solv- ing the 3D object detection and behavior prediction of those open-set objects, an alternative and potentially more scalable approach to supervised training is unsupervised perception and prediction. ∗ Equal contribution † Corresponding author 2202 tcO 41 ]VC.sc[ 1v16080.0122:viXra2 M. Najibi et al. One central problem in autonomous driving is perceiving the amodal shape of moving objects in space and forecasting their future trajectories, such that theplannerandcontrolsystemscanmaneuversafely.Asmotionestimation(also known as the scene flow problem) is a fundamental task agnostic to the scene semantics [50], it provides an opportunity to address the problem of perception andpredictionofopen-setmovingobjects,withoutanyhumanlabels.Thisleads to our motion-inspired unsupervised perception and prediction system. UsingonlyLiDAR,oursystemdecomposestheunsupervised,open-setlearn- ingtasktotwosteps,asshowninFig.1:(1)AutoMetaLabeling(AML)assisted by scene flow estimation and temporal aggregation, which generates pseudo la- bels of any moving objects in the scene; (2) Training detection and trajectory prediction models based on the auto meta labels. Realizing such an automatic supervision, we transform the challenging open-set learning task to a known, well-studied task of supervised detection or behavior prediction model training. To derive high-quality auto meta labels, we propose two key technologies: an unsupervised scene flow estimation model and a flow-based object proposal and concept construction approach. Most prior works on unsupervised scene flowestimation[96,45,55,59]optimizefortheoverallflowqualitywithoutspecif- ically focusing on the moving objects or considering the usage of scene flow for onboard perception and prediction tasks. For example, the recently proposed Neural Scene Flow Prior (NSFP) [45] achieved state-of-the-art performance in overallsceneflowmetricsbylearningtoestimatesceneflowthroughrun-timeop- timization, without any labels. However, there are too many false positive flows generated for the background, which makes it not directly useful for flow-based object discovery. To tackle its limitations, we extend NSFP to a novel, more accurate and scalable version termed NSFP++. Based on the estimated flow, we propose an automatic pipeline to generate proposals for all moving objects and reconstruct the object shapes (represented as amodal 3D bounding boxes) through tracking, shape registration and refinement. The end product of the process is a set of 3D bounding boxes and tracklets. Given the auto labels, we can train top-performing 3D detection models to localize the open-set moving objects and train behavior prediction models to forecast their trajectories. Evaluated on the Waymo Open Dataset [75], we show that our unsuper- vised and data-driven method significantly outperforms non-parametric cluster- ing based approaches and is even competitive to supervised counterparts (using ground truth scene flow). More importantly, our method substantially extends the capability of handling open-set moving objects for 3D detection and trajec- tory prediction models, leading to a safety improved autonomy system. 2 Related Works LiDAR-based 3D Object Detection. Fully supervised 3D detection based on point clouds has been extensively studied in the literature. Based on their input representation, these detectors can be categorized as those operating directly on the points [68,61,102,69,54,46], on a voxelized space [21,87,73,56,100,70,112,43,89,103,109], a perspective projection of theMotion Inspired Unsupervised Perception and Prediction 3 scene [53,5,27], or a combination of these representations [76,12,111,34,67]. Semi-supervised 3D detection with a smaller labeled training set or under the annotator-in-the-loop setting have also been considered in [62,7,99]. However, unsupervised3Ddetectionhasbeenmostlyunexploredduetotheinherentprob- lem complexity. More recently, Tian et al. [78] proposed to use 3D point cloud cluestoperformunsupervised2Ddetectioninimages.Incontrast,inourpaper, we propose a novel method which performs 3D detection of moving objects in an unsupervised manner. Scene Flow Estimation. Most previous learning-based works for 3D point cloud scene flow estimation were supervised [47,91,60,33]. More recently, the unsupervised setting has been also studied. [55] used self-supervised cycle con- sistency and nearest-neighbour losses to train a flow prediction network. In con- trast, [45] took an inference-time optimization approach and trained a network per scene. We follow [45] to build our scene flow module given its unsupervised natureandrelativelybetterperformance.However,ouranalysisrevealsthelim- itationsofthismethodinhandlingcomplexscenes,makingitsdirectadaptation for proposing high-quality auto labels challenging. In our paper, we noticeably improvetheperformanceofthismethodbyproposingnoveltechniquestobetter capture the locality constraints of the scene and to reduce its false predictions. Unsupervised Object Detection. Existing efforts have been concentrated in the image and video domain, mostly evaluated on object-centric datasets or datasets containing only a handful of object instances per frame. These in- clude statistic-based methods [71,65], visual similarity-based clustering meth- ods [29,26,40], linkage analysis [42] with appearance and geometric consis- tency [15,84,85,86], visual saliency [105,39], and unsupervised feature learning using deep generative models [44,72,63,3]. In contrast, unsupervised object de- tectionfromLiDARsequencesisfairlyunder-explored[18,94,78,48].[18,57]pro- posed to sequentially update the detections and perform tracking based on mo- tion cues from 3D LiDAR scans. Cen et al. [9] used predictions of a supervised detector to yield proposals of unknown categories. However, this approach is in- applicabletounsupervisedsettingsandworksforunknowncategorieswithclose semanticstotheknownones.Wonget al.[94]introducedabottom-upapproach to segment both known and unknown object instances by clustering and aggre- gating LiDAR points in a single frame based on their embedding similarities. In comparison, our work leverages both motion cues and point locations for clus- tering, which puts more emphasis on detecting motion coherent objects and is able to generate amodal bounding boxes. Shape Registration. Shape registration has been an important topic in vision and graphics community for decades, spanning from clas- sical methods including Iterative Closest Point (ICP) [4,13,64,30] and Structure-from-Motion (SfM) [79,1,38,66] to their deep learning vari- ants[90,82,110,77,92,98,97,81,80,113,37,101].Thesemethodsusuallyworkunder the assumption that the object or scene to register is mostly static or at least4 M. Najibi et al. Fig.1.Proposedframework.TakingasinputLiDARsequences(aftergroundremoval), our approach first reasons about per point motion status (static or dynamic) and predicts accurate scene flow. Based on the motion signal, Auto Meta Labeling clus- ters points into semantic concepts, connects them across frames and estimates object amodalshapes(3Dboundingboxes).Thederivedamodalboxesandtrackletswillserve as automatic supervision to train 3D detection and trajectory prediction models. non-deformable.Inautonomousdriving,shaperegistrationhasgainedincreasing attentions where offline processing is required [22,23,56,74,31,88,20]. The shape registration outcome can further support downstream applications such as off- board auto labeling [62,107,99], and perception simulation [52,14]. In this work, we use sequential ICP with motion-inspired initializations to aggregate partial views of objects and produce the auto-labeled bounding boxes. Trajectory Prediction. The recent introduction of the large-scale trajectory prediction datasets [25,6,11,36], helped deep learning based methods to demon- strate new state-of-the-art performance. From a problem formulation stand- point, these methods can be categorized into uni-modal and multi-modal. Uni- modalapproaches[8,19,51,28]predictasingletrajectoryperagent.Multi-modal methods [10,16,35,2,108,58,104,49,106] take into account the possibility of hav- ing multiple plausible trajectories per agent. However, all these methods rely on fully labeled datasets. Unsupervised or open-set settings, although practi- cally important for autonomous driving, have so far remained unexplored. Our method enables existing behavior prediction models to generalize to all moving objects, without the need for predefining an object taxonomy. 3 Method Fig. 1 illustrates an overview of our proposed method, which primarily relies on motion cues for recognizing moving objects in an unsupervised manner. The pipeline has two main modules: unsupervised scene flow estimation (Sec. 3.1) and Auto Meta Labeling (Sec. 3.2). 3.1 Neural Scene Flow Prior++ Background. Many prior works [41,95,33,47] on scene-flow estimation only considered the supervised scenario where human annotations are available for training.However,thesemethodscannotgeneralizewelltonewenvironmentsor to newly seen categories [45]. Recently, Li et al. [45] propose neural scene flow prior(NSFP),whichcanlearnpoint-wise3Dflowvectorsbysolvinganoptimiza- tion problem at run-time without the need of human annotation. Thanks to itsMotion Inspired Unsupervised Perception and Prediction 5 Fig.2. Proposed NSFP++. Taking as input raw LiDAR sequences (after ground re- moval), our approach first reasons about the motion status of each point, decomposes the scene into connected components and predicts local flows accurately for each se- mantically meaningful component. unsupervisednature,NSFPcangeneralizetonewenvironments.Italsoachieved state-of-the-art performance in 3D scene flow estimation. Still, our study shows thatithasnotablelimitationsinhandlingcomplexsceneswhenamixtureoflow and high speed objects are present. For example, as illustrated in Fig. 3, NSFP suffers from underestimating the velocity of moving objects, i.e., false negative flows over pedestrians and inaccurate estimation of fast-moving vehicles. It also introduces excessive false positive flows over static objects (e.g., buildings). We hypothesize that such issues are due to the fact that NSFP applies global opti- mization to the entire point cloud and the highly diverse velocities of different objects set contradictory learning targets for the network to learn properly. Overview. Our goal is to realize an unsupervised 3D scene flow estimation al- gorithmthatcanadapttovariousdrivingscenarios.Here,wepresentourneural scene flow prior++ (NSFP++) method. As illustrated in Fig. 2, our method features three key innovations: 1) robustly identifying static points; 2) divide- and-conquer strategy to handle different objects by decomposing a scene into semantically meaningful connected components and targetedly estimating local flow for each of the them; 3) flow consistency among points in each component. Problem Formulation. Let S t ∈ RN1×3 and S t+1 ∈ RN2×3 be two sets of points captured by the LiDAR sensor of an autonomous vehicle at time t and timet+1,whereN andN denotethenumberofpointsineachset.Wedenote 1 2 F t ∈RN1×3 as the scene flow, a set of flow vectors corresponding to each point in S . Given a point p ∈ S , we define f ∈ F be the corresponding flow vector t t t suchthatpˆ =p+f representsthefuturepositionofpatt+1.Typically,points in S and S have no correspondence and N differs from N . t t+1 1 2 As in Li et al. [45], we model the flow vector f = h(p;Θ) as the output of a neural network h, containing a set of learnable parameters as Θ. To estimate F , we solve for Θ by minimizing the following objective function: t Θ∗,Θ∗ =arg min (cid:88) L(p+f,S )+ (cid:88) L(pˆ+f ,S ) (1) bwd t+1 bwd t Θ,Θbwdp∈St pˆ∈Sˆ t where f =h(p;Θ) is the forward flow, f =h(pˆ;Θ ) is the backward flow, bwd bwd Sˆ is the set of predicted future positions for points in S and L is Chamfer t t distance function. Here we have the forward and backward flow models share the same network architecture but parameterized by Θ and Θ respectively. bwd6 M. Najibi et al. Fig.3.FlowqualitycomparisonbetweenNSFP[45]andourNSFP++overtheWaymo OpenDataset.Dashedcirclesinorangecolorhighlightthemajorshortcomingssuffered by NSFP, i.e., (a) underestimated flow for a fast-moving vehicle, (b)(c) false positive predictions at the background and (d) false negative predictions at pedestrians with subtle motion. In contrast, our NSFP++ generates accurate predictions in all these cases. The model parameters, Θ and Θ are initialized and optimized for each time bwd stamp t. Although we only take the forward flow into the next-step processing, learning the flows bidirectionally help improve the scene flow quality [45,47]. Identifying Static Points. Since our focus is moving objects, we start by strategically removing static points to reduce computational complexity and benefit scene flow estimation. In autonomous driving datasets, one large body of static points is ground. Ground is usually captured as a flat surface for which predicting local motion is not possible due to the aperture problem. We follow [45,47]andremovegroundpointspriortomotionestimation.Thisisachievedby a RANSAC-based algorithm in which a parameterized close-to-horizontal plane is fitted to the points and points in its vicinity are marked as static. However, groundisnottheonlystaticpartofthesceneandunsupervisedflowpredictions inthesestaticregions(e.g.walls,buildings,trees,etc.)introducenoise,reducing the quality of our final auto labels. As a result, we further propose to identify more static regions in the scene prior to scene flow estimation. This is achieved by comparing the Chamfer distance between the points in the current frame with those in earlier frames. We mark points as static if the computed Chamfer distanceislessthanathreshold.Wesetasmallthresholdtohaveahighprecision in this step (i.e. 20 cm/s in our experiments). Estimate Local Flow via Scene Decomposition. Inspired by the fact that objects in outdoor scenes are often well-separated after detecting and isolat- ing the ground points, we propose to further decompose the dynamic part of the scene into connected components. This strategy allows us to solve for local flows for each cluster targetedly, which can greatly improve the accuracy of flow estimation for various traffic participants, e.g., vehicles, pedestrians, cyclists, travelling at highly different velocities. Fig. 2 gives an overview of our method.Motion Inspired Unsupervised Perception and Prediction 7 Fig.4. Illustration of the effectiveness Fig.5. Illustration of the effectiveness of box query with expansion in more of box query followed by pruning in accuratelyestimatingflowovertheob- preserving accurately the local flow for ject shape. Top and bottom are with- nearby objects. Top and bottom are out and with expansion. without and with pruning. More precisely, given the identified static points, we split the point sets as S = Ss∪Sd and S = Ss ∪Sd , where Ss and Ss contain static points t t t t+1 t+1 t+1 t t+1 whileSd andSd storedynamicpoints.Thisseparation,notonlyhelpsdecom- t t+1 pose the scene into semantically meaningful connected components, but also substantially reduces false positive flow predictions on static objects. We then further break down the dynamic points into Sd = (cid:83)K Ci, where Ci ∈ Rmi×3 t i=1 t t is one disjoint cluster of m points (the number of clusters K can vary as the i scenechanges).Intherestofthissection,weomitindexiforbrevityandletC t to represent one of the clusters. For every C ⊆Sd at time t, we solve for model t t parameters to derive local flows, by minimizing the objective function as: Θ∗,Θ∗ =arg min (cid:88) L(p+f,C )+ (cid:88) L(pˆ+f ,C ) bwd t+1 bwd t Θ,Θbwd p∈Ct⊆Sd t pˆ∈Cˆ t⊆Sˆd t + α (cid:88) (cid:107)f −f (cid:107)2 (2) |C | i j 2 t fi,fj∈FCt i(cid:54)=j where the last term is the newly introduced local consistency regularizer with α setto0.1,F consistsofflowvectorsforeachpointinC ,Sˆd containspredicted future positC iot ns of all points residing in Sd, Cˆ is a subt sett of Sˆd only storing t t t futurepositionsofpointsinC ⊆Sd andC isasubsetofSd ,derivedbased t t t+1 t+1 on box query within a neighborhood of C . Next we will present our box query t strategy: expansion with pruning. Box Query Strategy. Considering that some objects (vehicles) may move at a high speed, we need to expand the field of view to find match points in the next frame. Given a cluster C , we find the axis-aligned (along X and Y t axes) bounding box tightly covering C , in the bird’s eye view (BEV). The t box is represented as b = [x ,y ,x ,y ]. Note that fast-moving ob- min min max max jects, e.g., vehicles, can travel multiple meters between two LiDAR scans. To satisfactorily capture the points of such objects at time t + 1, we propose to expand the box query with axis-aligned buffer distances δ , δ and use x y b(cid:48) = [x −δ ,y −δ ,x +δ ,y +δ ] to retrieve points from Sd , min x min y max x max y t+1 resulting in C . We set the buffer distances according to the aspect ratio of t+18 M. Najibi et al. (cid:50)(cid:69)(cid:77)(cid:72)(cid:70)(cid:87)(cid:3)(cid:51)(cid:85)(cid:82)(cid:83)(cid:82)(cid:86)(cid:68)(cid:79)(cid:3)(cid:69)(cid:92)(cid:3)(cid:38)(cid:79)(cid:88)(cid:86)(cid:87)(cid:72)(cid:85)(cid:76)(cid:81)(cid:74) (cid:55)(cid:85)(cid:68)(cid:70)(cid:78)(cid:76)(cid:81)(cid:74) (cid:54)(cid:75)(cid:68)(cid:83)(cid:72) (cid:36)(cid:80)(cid:82)(cid:71)(cid:68)(cid:79)(cid:3)(cid:37)(cid:82)(cid:91)(cid:3) (cid:53)(cid:72)(cid:74)(cid:76)(cid:86)(cid:87)(cid:85)(cid:68)(cid:87)(cid:76)(cid:82)(cid:81) (cid:53)(cid:72)(cid:73)(cid:76)(cid:81)(cid:72)(cid:80)(cid:72)(cid:81)(cid:87) Fig.6. Auto Meta Labeling pipeline. Given point locations and scene flows on each scene,ourAutoMetaLabelingpipelinefirstproposesobjectsbyspatio-temporalclus- tering, connects visible bounding boxes of proposals into tracks, then performs shape registration on each track to obtain 3D amodal bounding boxes on each scene. the box b, i.e., δy = ymax−ymin. We empirically set max{δ ,δ } = 2.5m. Fig. 4 δx xmax−xmin x y illustrates that expanding box query captures the full shape of a fast-moving truck, resulting in accurate prediction of the future position of the entire object point cloud (i.e., predicted future positions align nicely with the next frame). Incrowdedareasofthescene,retrievedpointswithb(cid:48) mayincludeirrelevant pointsintotheoptimizationprocess,causingflowtodrifterroneously.SeeFig.5 as an example, where two vehicles are moving fast and close to each other. Box querywithb(cid:48) canincludepointsfromtheothervehicleandleadtoflowdrifting. To address this challenge, we propose to prune retrieved points based on the statistics of C . Formally, let Ω be the set of retrieved points by b(cid:48) from Sd . t t+1 We select n = min{|Ω|,|C |} nearest points from Ω with respect to the first t moment of C and store them in set C ∈Rn×3. The effectiveness of pruning t t+1 in keeping relevant points and thus preserving local flow is shown in Fig. 5. 3.2 Auto Meta Labeling With the motion signals provided by the unsupervised scene flow module, we areabletogenerate3Dproposalsformovingobjectswithoutanymanuallabels. WeproposeanAutoMetaLabelingpipeline,whichtakespointcloudsandscene flowsasinputsandgenerateshighquality3Dautolabels(Fig.6).TheAutoMeta Labeling pipeline has four components: (a) object proposal by clustering, which leverages spatio-temporal information to cluster points into visible boxes (tight boxes covering visible points), forming the concept of objects in each scene; (b) tracking,whichconnectsvisibleboxesofobjectsacrossframesintotracklets;(c) shape registration, which aggregates points of each track to complete the shape for the object; (d) amodal box refinement, which transforms visible boxes into amodal boxes. See supplementary materials for implementation details. Object Proposal by Clustering. On each scene, given the point cloud lo- cations S = {p | p ∈ R3}N and the corresponding point-wise scene flows n n n=1 F={f |f ∈R3}N ,theclusteringmodulesegmentspointsintosubsetswhere n n n=1 eachsubsetrepresentsanobjectproposal.Wefurthercomputeaboundingboxof eachsubsetasanobjectrepresentation.Traditionalclusteringmethodsonpoint cloudoftenconsider3DpointlocationsSastheonlyfeature.Intheautonomous driving data, with a large portion of points belonging to the background, suchMotion Inspired Unsupervised Perception and Prediction 9 (cid:11)(cid:68)(cid:12) (cid:11)(cid:69)(cid:12) (cid:11)(cid:70)(cid:12) Fig.7.Comparisonbetweenobjectproposalsbydifferentclusteringapproaches.Points are colored by scene flow magnitudes and directions. Dark for static points. (a) Clus- teringbylocationonly.(b)Filterbyflowmagnitudeandthenclusterbasedonlocation (c) Filter by flow and cluster based on both location and motion (ours). Algorithm 1 Object proposal by spatio-temporal clustering on each scene. Input:pointlocationsS={p }N ;point-wisesceneflowsF={f }N n n=1 n n=1 Hyperparams:neighborhoodthresholds(cid:15) ,(cid:15) ;minimumflowmagnitude|f| p f min Output:3DboundingboxesB ={b }Mpf ofthevisiblepartsofproposedobjects vis k k=1 functionFlowBasedClustering(S,F;(cid:15) ,(cid:15) ,|f| ): p f min S(cid:48),F(cid:48)←FilterByFlowMagnitude(S,F;|f| ) min C ←DBSCAN(S(cid:48);(cid:15) ) (cid:46)C ={c }Mp,pointsetsclusteredbylocations p p p i i=1 C ←DBSCAN(F(cid:48);(cid:15) ) (cid:46)C ={c }Mf,pointsetsclusteredbyflows f f f j j=1 forc inC do i p forc inC do j f c ←c ∩c k i j ¯f ←Average({f |∀l:p ∈c }) k l l k b ←MinAreaBBoxAlongDirection(c ,¯f ) k k k return{b }Mpf k k=1 methods generate many irrelevant clusters (Fig. 7a). As we focus on moving ob- jects,weleveragethemotionsignalstoreducefalsepositives.Hence,aclustering method based on both point locations and scene flows is desired. One simple yet effective strategy can be filtering point cloud by scene flows before object proposal: we only keep points with a flow magnitude larger than a threshold. We then apply the DBSCAN [24] clustering algorithm on the filtered point sets. This filtering can largely reduce the false positives (Fig. 7b). However, there is still a common case where the aforementioned approach cannot handle well: close-by objects tend to be under-segmented into a single cluster. To solve this issue, we propose clustering by both spatial locations and scene flows (Algorithm 1). After removing points with flow magnitudes smaller than a threshold |f| , we obtain the filtered point locations S(cid:48) and point-wise min scene flows F(cid:48). Then we apply DBSCAN to S(cid:48) and F(cid:48) separately, resulting in two sets of clusters. Based on its location and motion, a point may fall into different subsets based on these two clusterings. We then intersect the subsets obtained by the location-based and the flow-based clusterings to formulate the finalclusters.Inthisway,twopointsareclusteredtogetheronlyiftheyareclose with respect to both their location and motion (Fig. 7c).10 M. Najibi et al. Algorithm 2 Sequential shape registration and box refinement. Input:Anobjecttrackwithpointlocations{X}L ,boundingboxes{b}L ,headings l l=1 l l=1 {θ}L .Allinworldcoordinatesystem. l l=1 Output:Refinedboxes{b(cid:48)}L . l l=1 functionShapeRegistrationAndBoxRefinement({X}L ,{b}L ,{θ}L ): l l=1 l l=1 l l=1 X(cid:48) =X −X¯ ,∀l∈{1,...,L} (cid:46)Normalizepointstoobject-centered l l l X(cid:48) ←X(cid:48) :ˆi=argmax |X(cid:48)| (cid:46)Inittargetasthemostdensepointcloud tgt ˆi i i I ={ˆi+1,ˆi+2,...,L,ˆi−1,ˆi−2,...,1} (cid:46)Shaperegistrationordering foriinI do forT inSearchGrid(b )do j t T ←[R |T ] init θtgt−θi j X(cid:48) ,T ,(cid:15) ←ICP(X(cid:48),X(cid:48) ,T ) tgt,j i→tgt,j j i tgt init X(cid:48) ,T ←X(cid:48) ,T :ˆj=argmin (cid:15) (cid:46)Registrationw/leasterror tgt i→tgt tgt,ˆj i→tgt,ˆj j j b(cid:48) =MinAreaBBoxAlongDirection(X(cid:48) +X¯ ,θ ) tgt tgt tgt tgt foriinI do b(cid:48) =Transform(b(cid:48) ,T−1 ) i tgt i→tgt return{b(cid:48)}L l l=1 Having the cluster label for each point, we form the concept of an object via a bounding box covering each cluster. Given the partial observation of objects withinasingleframe,weonlygenerateboxestightlycoveringthevisiblepartin thisstage,B ={b }.Withoutobjectsemantics,weusemotioninformationto vis k decide the heading of each box. We compute the average flow¯f of each cluster k c . Then we find the 7 DoF bounding box b surrounding c which has the k k k minimumareaonthexy-planealongthechosenheadingdirectionparallelto¯f . k Multi-Object Tracking. The tracking module connects visible boxes B vis into object tracks. Following the tracking-by-detection paradigm [93,62], we use B for data associations and Kalman filter for state updates. However, vis rather than relying on the Kalman filter to estimate object speeds, our tracking module leverages our estimated scene flows in the associations. In each step of the association, we advance previously tracked boxes using scene flows and match the advanced boxes with those in the next frame. Shape Registration and Amodal Box Refinement. In the unsupervised setting, human annotations of object shapes are unavailable. It is hard to infer the amodal shapes of occluded objects purely based on sensor data from one timestamp. However, the observed views of an object often change across time as the autonomous driving car or the object moves. This enables temporal data aggregation to achieve more complete amodal perception of each object. Fortemporalaggregation,weproposeashaperegistrationmethodbuiltupon sequentially applying ICP [4,13,64] (Algorithm 2). ICP performance is sensitive tothetransformationinitialization.Inclustering,wehaveobtainedtheheadings {θ }L ofallvisibleboxesineachtrack.Thedifferenceinheadingsofeachsource l l=1 and target point set constructs a rotation initialization R for ICP. θtgt−θsrcMotion Inspired Unsupervised Perception and Prediction 11 Inautonomousdrivingscenarios,shaperegistrationamongasequenceofob- servations poses special challenges: (a) objects are moving with large displace- mentsintheworldcoordinatesystem;(b)manyobservationsofobjectsarevery sparse due to their far distance from the sensor and/or heavy occlusions. These two challenges make it hard to register points from different frames. To tackle this problem, we search in a grid to obtain the best translation for aligning the source (from frame A) and target (from frame B) point sets. The grid, or the searchrange,isdefinedbythesizeofthetargetframeboundingbox.Weinitial- ize the translation T corresponding to different grid points and find the best j registration results out of them. Sequentially,partialviewsofanobjectinatrackareaggregatedintoa more complete point set, whose size is often close to amodal boxes. We then compute a bounding box around the target point set similar to the last step in object proposal. During registration, we have estimated the transformation from each source point set to the target, and we can propagate the target bounding box back to each scene by inversing each transformation matrix. Finally, we obtain 3D amodal bounding boxes of detected objects. 4 Experiments We evaluate our framework using the challenging Waymo Open Dataset (WOD)[75],asitprovidesalargecollectionofLiDARsequenceswith3Dlabels for each frame (we only use labels for evaluation unless noted otherwise). In our experiments,objectswithspeed>1m/sareregardedmoving.Hyperparameters and ablation studies are presented in the supplementary material. 4.1 Scene Flow Metrics. We employ the widely adopted metrics as [45,96], which are 3D end- point error (EPE3D) computed as the mean L2 distance between the prediction and the ground truth for all points; Acc denoting the percentage of points 5 with EPE3D < 5cm or relative error < 5%; Acc denoting the percentage of 10 pointswithEPE3D<10cmorrelativeerror<10%;andθ,themeanangleerror between predictions and ground truths. In addition, we evaluate our approach based on fine grained speed breakdowns. We assign each point to one speed class (e.g., 0 - 3m/s, 3 - 6m/s, etc.) and employ the Intersection-over-Union (IoU) metric to measure the performance in terms of class-wise IoU and mean IoU. IoU is computed as TP , same as in 3D semantic segmentation [6]. TP+FP+FN Results. We evaluate our NSFP++ over all frames of the WOD [75] valida- tion set and compare it with the previous state-of-the-art scene flow estimator, NSFP [45]. Following [41], we use the provided vehicle pose to compensate for theegomotion,suchthatourmetricsisindependentfromtheautonomousvehi- cle motion and can better reflect the flow quality on the moving objects. Fig. 3 visualizes the improvement of the proposed NSFP++ compared to NSFP. Our approach accurately predicts flows for both high- and low-speed objects (a, d). In addition, NSFP++ not only is highly reliable in detecting the subtle motion12 M. Najibi et al. Table 1. Comparison of scene flow methods on the WOD validation set. Method EPE3D(m)↓ Acc5(%)↑ Acc10(%)↑ θ(rad)↓ NSFP[45] 0.455 23.65 43.06 0.9190 NSFP++(ours) 0.017 95.05 96.45 0.4737 Table 2. Comparison of scene flow methods on the WOD validation set, with speed breakdowns. IoUperSpeedBreakdown(m/s) Method 0-3 3-6 6-9 9-12 12-15 15+ mIOU NSFP[45] 0.657 0.152 0.216 0.166 0.130 0.140 0.244 NSFP++(ours) 0.989 0.474 0.522 0.479 0.442 0.608 0.586 ofvulnerableroadusers(d)butcanalsorobustlydistinguishallmovingobjects fromthestaticbackground(b,c).Finally,ourapproachoutperformsNSFPsub- stantially across all quantitative metrics, as listed in Tab. 1 and Tab. 2. 4.2 Unsupervised 3D Object Detection Ourmethodaimsatgeneratingautolabelsfortrainingdownstreamautonomous drivingtasksinafullyunsupervisedmanner.3Dobjectdetectionisacorecom- ponentinautonomousdrivingsystems.Inthissection,weevaluatetheeffective- ness of our unsupervised AML pseudo labels by training a 3D object detector. We adopt the PointPillars [43] detector for our experiments. All models are trained and evaluated on WOD [75] training and validation sets. Since there is no category information during training, we use a single-class detector to de- tect any moving objects. We train and evaluate the detectors on a 100m x 40m rectangularregionaroundtheegovehicletoreflecttheegocentricimportanceof the predictions [17]. We set a 3D IoU of 0.4 during evaluation to count for the large variation in size of the class-agnostic moving objects, e.g., vehicles, pedes- trians, cyclists. We employ a top-performing flow model [41] as the supervised counterpart to our unsupervised flow model NSFP++. Tab. 3 compares performance of detectors trained with auto labels gener- ated by our pipelines and several baselines. The first two rows show detection results when a fully supervised flow model [41] (flow supervision derived from human box labels) is deployed for generating the auto labels. The first row rep- resents a baseline where our hybrid clustering method is used to form the auto labels based on motion cues [18]. The second row shows the performance when the same supervised flow predictions are used in combination with our AML pipeline. Clearly, our AML pipeline greatly outperforms the clustering baseline, verifying the high-quality auto labels generated by our method. The last four rowsconsidertheunsupervisedsetting.No flow +Clustering isabaselinewhere DBSCAN is applied to the point locations to form the auto labels. No flow + AML is our pipeline when purely relying on a regular tracker without using any flow information. Unsup Flow + Clustering uses our proposed hybrid clustering techniqueontheoutputsofourNSFP++sceneflowestimatorwithoutconnect- ingwithourAML.UsupFlow+AMLisourfullunsupervisedpipeline.Notably, not only does it outperforms other unsupervised baselines by a large margin,Motion Inspired Unsupervised Perception and Prediction 13 Table 3. Comparisons between 3D detectors trained with autolabels generated by AML with supervised flow and unsupervised flow. 3DmAP 2DmAP Method Supervision L1 L2 L1 L2 SupFlow[41]+Clustering 30.8 29.7 42.7 41.2 Supervised SupFlow[41]+AML 49.9 48.0 56.8 54.8 Noflow+Clustering 4.7 4.5 5.8 5.6 Noflow+AML 9.6 9.4 11.0 10.8 Unsupervised UnsupFlow+Clustering 30.4 29.2 36.7 35.3 UnsupFlow+AML 42.1 40.4 49.1 47.4 but it also achieves a comparable performance with the supervised Sup Flow + AMLcounterpart.Moreover,comparingitwithotherunsupervisedbaselinesby removingpartsofourpipelinevalidatestheimportanceofallcomponentsinour design(pleaseseethesupplementaryformoreablations).Mostimportantly,our approachisafullyunsupervised3Dpipeline,capableofdetectingmovingobjects intheopen-setenvironment.Thisnewfeatureiscostefficientandsafetycritical fortheautonomousvehicletoreliablydetectarbitrarymovingobjects,removing the need of human annotation and the constraint of predefined taxonomy. 4.3 Open-set 3D Object Detection Inthissection,weturnourattentiontotheopen-setsettingwhereonlyasubset of categories are annotated. Since there is no public 3D dataset designed for this purpose, we perform experiments in a leave-one-out manner on WOD [75]. WOD has three categories, namely vehicle, pedestrian, and cyclist. Considering the similar appearances and safety requirements, we combine pedestrian and cyclist into a larger category called VRU (vulnerable road user), resulting in a data size comparable with the vehicle category. We then assume to only have accesstohumanannotationsforoneofthetwocategories,leavingtheotherone out for our auto meta label pipeline to pseudo label. Tab. 4 shows the results. The first two rows show the performance of a fully supervised point pillars detector. As expected, when the detector is trained on one of the categories, it can not generalize to the other. In the last two rows, when human annotations are not available, we rely on our auto labels to fill in for the unknown category. When no vehicle label is available, our pipeline helps thedetectortogeneralizeandconsequentlyimprovesthemAPfrom48.8to77.1. Although generalizing to VRUs without any human labels is a more challenging scenario, our pipeline still improves the mAP by a noticeable margin, showing its effectiveness in the open-set settings. Table 4. Open-set 3D detection results. HumanLabeled Vehicle3DAPVRU3DAP3DmAP Vehicle VRU (cid:88) 97.5 0.0 48.8 SupervisedMethod (cid:88) 0.0 88.7 44.4 (cid:88) 97.5 20.8 59.2 Ours(Supervised+AML) (cid:88) 65.4 88.7 77.114 M. Najibi et al. 4.4 Open-set Trajectory Prediction Fortrajectoryprediction,wehaveextractedroadgraphinformationforasubset ofWOD(consistingof625trainingand172validationsequences).Weusethose WOD run segments with road graph information for our trajectory prediction experiments.Following[25],atrajectorypredictionmodelisrequiredtoforecast the future positions for surrounding agents for 8 seconds into the future, based on the observation of 1 second history. We use the MultiPath++ [83] model for our study. The model predicts 6 different trajectories for each object and a probabilityforeachtrajectory.Toevaluatetheimpactofopen-setmovingobjects onthebehaviorpredictiontask,wetrainmodelsusingperceptionlabelsderived via different strategies as the ground truth data and then evaluate the behavior prediction metrics of the trained models on a manually labeled validation set. We use the minADE and minFDE metrics as described in [25]. Tab.5reportsthetrajectorypredictionresults.Whilethesupervisedmethod achieves a reasonable result when the vehicle class is labeled, its performance is poorwhentrainedonlyontheVRUclass.Thisisexpected,asthemotionlearned fromslowvehiclescanbegeneralizedtoVRUstosomeextent,butpredictingthe trajectoryofthefastmovingvehiclesisoutofreachforamodeltrainedononly VRUs.ThelasttworowsshowtheperformanceofthesamemodelwhenAMLis deployedforauto-labelingthemissingcategory.Consistentwithourobservation in3Ddetection,ourmethodcanbridgethegapintheopen-setsetting.Namely, our approach significantly remedies the generalization problem from VRUs to vehiclesandachievesthebestperformancewhencombininghumanlabelsofthe vehicle class with our auto labels for VRUs. Table 5. Open-set trajectory prediction results. HumanLabeled minADEminFDE Vehicle VRU (cid:88) 2.12 5.39 SupervisedMethod (cid:88) 9.53 22.31 (cid:88) 1.89 4.79 Ours(Supervised+AML) (cid:88) 2.15 5.55 5 Conclusion Inthispaper,weproposedanovelunsupervisedframeworkfortrainingonboard 3D detection and prediction models to understand open-set moving objects. Extensive experiments show that our unsupervised approach is competitive in regular detection tasks to the counterpart which uses supervised scene flow. With promising results, it demonstrates great potential in enabling perception andpredictionsystemstohandleopen-setmovingobjects.Wehopeourfindings encourage more research toward solving autonomy in an open-set environment.Motion Inspired Unsupervised Perception and Prediction 15 References 1. Agarwal,S.,Snavely,N.,Seitz,S.M.,Szeliski,R.:Bundleadjustmentinthelarge. In: ECCV (2010) 3 2. Bansal,M.,Krizhevsky,A.,Ogale,A.:Chauffeurnet:Learningtodrivebyimitat- ing the best and synthesizing the worst. arXiv preprint arXiv:1812.03079 (2018) 4 3. Bau,D.,Zhu,J.Y.,Strobelt,H.,Zhou,B.,Tenenbaum,J.B.,Freeman,W.T.,Tor- ralba, A.: Gan dissection: Visualizing and understanding generative adversarial networks. In: ICLR (2019) 3 4. Besl, P.J., McKay, N.D.: Method for registration of 3-d shapes. In: Sensor fusion IV:controlparadigmsanddatastructures.vol.1611,pp.586–606.Spie(1992) 3, 10 5. Bewley, A., Sun, P., Mensink, T., Anguelov, D., Sminchisescu, C.: Range condi- tioned dilated convolutions for scale invariant 3d object detection (2020) 3 6. Caesar,H.,Bankiti,V.,Lang,A.H.,Vora,S.,Liong,V.E.,Xu,Q.,Krishnan,A., Pan,Y.,Baldan,G.,Beijbom,O.:nuscenes:Amultimodaldatasetforautonomous driving. In: CVPR (2020) 4, 11 7. Caine, B., Roelofs, R., Vasudevan, V., Ngiam, J., Chai, Y., Chen, Z., Shlens, J.: Pseudo-labelingforscalable3dobjectdetection.arXivpreprintarXiv:2103.02093 (2021) 3 8. Casas, S., Luo, W., Urtasun, R.: Intentnet: Learning to predict intention from raw sensor data. In: CoRL (2018) 4 9. Cen,J.,Yun,P.,Cai,J.,Wang,M.Y.,Liu,M.:Open-set3dobjectdetection.In: 3DV (2021) 3 10. Chai, Y., Sapp, B., Bansal, M., Anguelov, D.: Multipath: Multiple probabilistic anchor trajectory hypotheses for behavior prediction. In: CoRL (2019) 1, 4 11. Chang, M.F., Lambert, J., Sangkloy, P., Singh, J., Bak, S., Hartnett, A., Wang, D.,Carr,P.,Lucey,S.,Ramanan,D.,etal.:Argoverse:3dtrackingandforecasting with rich maps. In: CVPR (2019) 4 12. Chen, Y., Liu, S., Shen, X., Jia, J.: Fast point r-cnn. In: ICCV (2019) 3 13. Chen,Y.,Medioni,G.:Objectmodellingbyregistrationofmultiplerangeimages. Image and vision computing 10(3), 145–155 (1992) 3, 10 14. Chen, Y., Rong, F., Duggal, S., Wang, S., Yan, X., Manivasagam, S., Xue, S., Yumer, E., Urtasun, R.: Geosim: Realistic video simulation via geometry-aware composition for self-driving. In: CVPR (2021) 4 15. Cho, M., Kwak, S., Schmid, C., Ponce, J.: Unsupervised object discovery and localization in the wild: Part-based matching with bottom-up region proposals. In: CVPR (2015) 3 16. Cui, H., Radosavljevic, V., Chou, F.C., Lin, T.H., Nguyen, T., Huang, T.K., Schneider,J.,Djuric,N.:Multimodaltrajectorypredictionsforautonomousdriv- ing using deep convolutional networks. In: ICRA (2019) 4 17. Deng,B.,Qi,C.R.,Najibi,M.,Funkhouser,T.,Zhou,Y.,Anguelov,D.:Revisiting 3d object detection from an egocentric perspective. NeurIPS (2021) 12 18. Dewan,A.,Caselitz,T.,Tipaldi,G.D.,Burgard,W.:Motion-baseddetectionand tracking in 3d lidar scans. In: ICRA (2016) 3, 12 19. Djuric,N.,Radosavljevic,V.,Cui,H.,Nguyen,T.,Chou,F.C.,Lin,T.H.,Schnei- der, J.: Short-term motion prediction of traffic actors for autonomous driving using deep convolutional networks (2018) 416 M. Najibi et al. 20. Duggal,S.,Wang,Z.,Ma,W.C.,Manivasagam,S.,Liang,J.,Wang,S.,Urtasun, R.: Mending neural implicit modeling for 3d vehicle reconstruction in the wild. In: WACV (2022) 4 21. Engelcke,M.,Rao,D.,Wang,D.Z.,Tong,C.H.,Posner,I.:Vote3deep:Fastobject detection in 3d point clouds using efficient convolutional neural networks. In: ICRA (2017) 2 22. Engelmann,F.,Stu¨ckler,J.,Leibe,B.:Jointobjectposeestimationandshapere- constructioninurbanstreetscenesusing3dshapepriors.In:GermanConference on Pattern Recognition. pp. 219–230. Springer (2016) 4 23. Engelmann, F., Stu¨ckler, J., Leibe, B.: Samp: shape and motion priors for 4d vehicle reconstruction. In: WACV (2017) 4 24. Ester,M.,Kriegel,H.P.,Sander,J.,Xu,X.,etal.:Adensity-basedalgorithmfor discovering clusters in large spatial databases with noise. In: KDD (1996) 9 25. Ettinger, S., Cheng, S., Caine, B., Liu, C., Zhao, H., Pradhan, S., Chai, Y., Sapp,B.,Qi,C.R.,Zhou,Y.,etal.:Largescaleinteractivemotionforecastingfor autonomous driving: The waymo open motion dataset. In: ICCV (2021) 4, 14 26. Faktor, A., Irani, M.: “clustering by composition”—unsupervised discovery of image categories. In: ECCV (2012) 3 27. Fan,L.,Xiong,X.,Wang,F.,Wang,N.,Zhang,Z.:Rangedet:Indefenseofrange view for lidar-based 3d object detection. In: ICCV (2021) 3 28. Gao,J.,Sun,C.,Zhao,H.,Shen,Y.,Anguelov,D.,Li,C.,Schmid,C.:Vectornet: Encodinghdmapsandagentdynamicsfromvectorizedrepresentation.In:CVPR (2020) 4 29. Grauman, K., Darrell, T.: Unsupervised learning of categories from sets of par- tially matching image features. In: CVPR (2006) 3 30. Groß, J., Oˇsep, A., Leibe, B.: Alignnet-3d: Fast point cloud registration of par- tially observed objects. In: 3DV (2019) 3, 21, 23 31. Gu, J., Ma, W.C., Manivasagam, S., Zeng, W., Wang, Z., Xiong, Y., Su, H., Urtasun,R.:Weakly-supervised3dshapecompletioninthewild.In:ECCV(2020) 4 32. Gu,J.,Sun,C.,Zhao,H.:Densetnt:End-to-endtrajectorypredictionfromdense goal sets. In: ICCV (2021) 1 33. Gu, X., Wang, Y., Wu, C., Lee, Y.J., Wang, P.: Hplflownet: Hierarchical permu- tohedral lattice flownet for scene flow estimation on large-scale point clouds. In: CVPR (2019) 3, 4 34. He, C., Zeng, H., Huang, J., Hua, X.S., Zhang, L.: Structure aware single-stage 3d object detection from point cloud. In: CVPR (June 2020) 3 35. Hong, J., Sapp, B., Philbin, J.: Rules of the road: Predicting driving behavior with a convolutional model of semantic interactions. In: CVPR (2019) 4 36. Houston, J., Zuidhof, G., Bergamini, L., Ye, Y., Chen, L., Jain, A., Omari, S., Iglovikov, V., Ondruska, P.: One thousand and one hours: Self-driving motion prediction dataset. arXiv preprint arXiv:2006.14480 (2020) 4 37. Insafutdinov, E., Dosovitskiy, A.: Unsupervised learning of shape and pose with differentiable point clouds. In: NeurIPS (2018) 3 38. Izadi, S., Kim, D., Hilliges, O., Molyneaux, D., Newcombe, R., Kohli, P., Shot- ton, J., Hodges, S., Freeman, D., Davison, A., et al.: Kinectfusion: real-time 3d reconstruction and interaction using a moving depth camera. In: Proceedings of the24thannualACMsymposiumonUserinterfacesoftwareandtechnology.pp. 559–568 (2011) 3 39. Jerripothula,K.R.,Cai,J.,Yuan,J.:Cats:Co-saliencyactivatedtrackletselection for video co-localization. In: ECCV (2016) 3Motion Inspired Unsupervised Perception and Prediction 17 40. Joulin, A., Bach, F., Ponce, J.: Discriminative clustering for image co- segmentation. In: CVPR (2010) 3 41. Jund, P., Sweeney, C., Abdo, N., Chen, Z., Shlens, J.: Scalable scene flow from point clouds in the real world. IEEE Robotics and Automation Letters 7(2), 1589–1596 (2022). https://doi.org/10.1109/LRA.2021.3139542 4, 11, 12, 13, 21 42. Kim,G.,Torralba,A.:Unsuperviseddetectionofregionsofinterestusingiterative link analysis. In: NIPS (2009) 3 43. Lang, A.H., Vora, S., Caesar, H., Zhou, L., Yang, J., Beijbom, O.: Pointpillars: Fast encoders for object detection from point clouds. In: CVPR (2019) 2, 12 44. Lee,H.,Grosse,R.,Ranganath,R.,Ng,A.Y.:Convolutionaldeepbeliefnetworks forscalableunsupervisedlearningofhierarchicalrepresentations.In:ICML(2009) 3 45. Li, X., Pontes, J.K., Lucey, S.: Neural scene flow prior. In: NeurIPS (2021) 2, 3, 4, 5, 6, 11, 12 46. Li, Z., Wang, F., Wang, N.: Lidar r-cnn: An efficient and universal 3d object detector. In: CVPR (2021) 2 47. Liu,X.,Qi,C.R.,Guibas,L.J.:Flownet3d:Learningsceneflowin3dpointclouds. In: CVPR (2019) 3, 4, 6 48. Liu, Y., Zulfikar, I.E., Luiten, J., Dave, A., Oˇsep, A., Ramanan, D., Leibe, B., Leal-Taix´e, L.: Opening up open-world tracking. In: CVPR (2022) 3 49. Liu, Y., Zhang, J., Fang, L., Jiang, Q., Zhou, B.: Multimodal motion prediction with stacked transformers. In: CVPR (2021) 4 50. Luo, C., Yang, X., Yuille, A.: Self-supervised pillar motion learning for au- tonomous driving. In: CVPR (2021) 2 51. Luo,W.,Yang,B.,Urtasun,R.:Fastandfurious:Realtimeend-to-end3ddetec- tion, tracking and motion forecasting with a single convolutional net. In: CVPR (2018) 1, 4 52. Manivasagam,S.,Wang,S.,Wong,K.,Zeng,W.,Sazanovich,M.,Tan,S.,Yang, B.,Ma,W.C.,Urtasun,R.:Lidarsim:Realisticlidarsimulationbyleveragingthe real world. In: CVPR (2020) 4 53. Meyer,G.P.,Laddha,A.,Kee,E.,Vallespi-Gonzalez,C.,Wellington,C.K.:Laser- net: An efficient probabilistic 3d object detector for autonomous driving. In: CVPR (2019) 3 54. Misra,I.,Girdhar,R.,Joulin,A.:Anend-to-endtransformermodelfor3dobject detection. In: ICCV (2021) 2 55. Mittal, H., Okorn, B., Held, D.: Just go with the flow: Self-supervised scene flow estimation. In: CVPR (2020) 2, 3 56. Najibi, M., Lai, G., Kundu, A., Lu, Z., Rathod, V., Funkhouser, T., Pantofaru, C., Ross, D., Davis, L.S., Fathi, A.: Dops: Learning to detect 3d objects and predict their 3d shapes. In: CVPR (2020) 2, 4 57. Pang, Z., Li, Z., Wang, N.: Model-free vehicle tracking and state estimation in point cloud sequences. In: IROS (2021) 3 58. Phan-Minh,T.,Grigore,E.C.,Boulton,F.A.,Beijbom,O.,Wolff,E.M.:Covernet: Multimodal behavior prediction using trajectory sets. In: CVPR (2020) 4 59. Pontes, J.K., Hays, J., Lucey, S.: Scene flow from point clouds with or without learning. In: 2020 International Conference on 3D Vision (3DV). pp. 261–270 (2020). https://doi.org/10.1109/3DV50981.2020.00036 2 60. Puy,G.,Boulch,A.,Marlet,R.:Flot:Sceneflowonpointcloudsguidedbyoptimal transport. In: ECCV (2020) 3 61. Qi, C.R., Litany, O., He, K., Guibas, L.J.: Deep hough voting for 3d object de- tection in point clouds. In: ICCV (2019) 1, 218 M. Najibi et al. 62. Qi,C.R.,Zhou,Y.,Najibi,M.,Sun,P.,Vo,K.,Deng,B.,Anguelov,D.:Offboard 3d object detection from point cloud sequences. In: CVPR (2021) 3, 4, 10, 21 63. Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. In: ICLR (2015) 3 64. Rusinkiewicz,S.,Levoy,M.:Efficientvariantsoftheicpalgorithm.In:Proceedings thirdinternationalconferenceon3-Ddigitalimagingandmodeling.pp.145–152. IEEE (2001) 3, 10 65. Russell,B.C.,Freeman,W.T.,Efros,A.A.,Sivic,J.,Zisserman,A.:Usingmultiple segmentationstodiscoverobjectsandtheirextentinimagecollections.In:CVPR (2006) 3 66. Scho¨nberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: CVPR (2016) 3 67. Shi, S., Guo, C., Jiang, L., Wang, Z., Shi, J., Wang, X., Li, H.: Pv-rcnn: Point- voxel feature set abstraction for 3d object detection. In: CVPR (2020) 3 68. Shi,S.,Wang,X.,Li,H.:Pointrcnn:3dobjectproposalgenerationanddetection from point cloud. In: CVPR (2019) 1, 2 69. Shi,W.,Rajkumar,R.R.:Point-gnn:Graphneuralnetworkfor3dobjectdetection in a point cloud. In: CVPR (2020) 2 70. Simony,M.,Milzy,S.,Amendey,K.,Gross,H.M.:Complex-yolo:aneuler-region- proposal for real-time 3d object detection on point clouds. In: ECCV (2018) 2 71. Sivic, J., Russell, B.C., Efros, A.A., Zisserman, A., Freeman, W.T.: Discovering object categories in image collections. In: ICCV (2005) 3 72. Sohn,K.,Zhou,G.,Lee,C.,Lee,H.:Learningandselectingfeaturesjointlywith point-wise gated boltzmann machines. In: ICML (2013) 3 73. Song, S., Xiao, J.: Deep sliding shapes for amodal 3d object detection in rgb-d images. In: CVPR (2016) 2 74. Stutz, D., Geiger, A.: Learning 3d shape completion from laser scan data with weak supervision. In: CVPR (2018) 4 75. Sun,P.,Kretzschmar,H.,Dotiwalla,X.,Chouard,A.,Patnaik,V.,Tsui,P.,Guo, J., Zhou, Y., Chai, Y., Caine, B., Vasudevan, V., Han, W., Ngiam, J., Zhao, H., Timofeev,A.,Ettinger,S.,Krivokon,M.,Gao,A.,Joshi,A.,Zhang,Y.,Shlens,J., Chen,Z.,Anguelov,D.:Scalabilityinperceptionforautonomousdriving:Waymo open dataset. In: CVPR (2020) 2, 11, 12, 13, 21 76. Sun, P., Wang, W., Chai, Y., Elsayed, G., Bewley, A., Zhang, X., Sminchisescu, C., Anguelov, D.: Rsn: Range sparse net for efficient, accurate lidar 3d object detection. In: CVPR. pp. 5725–5734 (2021) 3 77. Tang, C.,Tan,P.:Ba-net: Dense bundleadjustmentnetwork. In: ICLR (2019) 3 78. Tian, H., Chen, Y., Dai, J., Zhang, Z., Zhu, X.: Unsupervised object detection with lidar clues. In: CVPR (2021) 3 79. Triggs, B., McLauchlan, P.F., Hartley, R.I., Fitzgibbon, A.W.: Bundle adjust- ment—a modern synthesis. In: International workshop on vision algorithms. pp. 298–372. Springer (1999) 3 80. Tulsiani, S., Efros, A.A., Malik, J.: Multi-view consistency as supervisory signal for learning shape and pose prediction. In: CVPR (2018) 3 81. Tulsiani, S., Zhou, T., Efros, A.A., Malik, J.: Multi-view supervision for single- view reconstruction via differentiable ray consistency. In: CVPR (2017) 3 82. Ummenhofer, B., Zhou, H., Uhrig, J., Mayer, N., Ilg, E., Dosovitskiy, A., Brox, T.:Demon:Depthandmotionnetworkforlearningmonocularstereo.In:CVPR (2017) 3Motion Inspired Unsupervised Perception and Prediction 19 83. Varadarajan,B.,Hefny,A.,Srivastava,A.,Refaat,K.S.,Nayakanti,N.,Cornman, A., Chen, K., Douillard, B., Lam, C., Anguelov, D., Sapp, B.: Multipath++: Efficient information fusion and trajectory aggregation for behavior prediction. CoRR abs/2111.14973 (2021), https://arxiv.org/abs/2111.14973 14 84. Vo, H.V., Bach, F., Cho, M., Han, K., LeCun, Y., P´erez, P., Ponce, J.: Unsu- pervisedimagematchingandobjectdiscoveryasoptimization.In:CVPR(2019) 3 85. Vo, H.V., P´erez, P., Ponce, J.: Toward unsupervised, multi-object discovery in large-scale image collections. In: ECCV (2020) 3 86. Vo,V.H.,Sizikova,E.,Schmid,C.,P´erez,P.,Ponce,J.:Large-scaleunsupervised object discovery. In: NeurIPS (2021) 3 87. Wang, D.Z., Posner, I.: Voting for voting in online point cloud object detection. In: Proceedings of Robotics: Science and Systems. Rome, Italy (July 2015) 2 88. Wang, R., Yang, N., Stu¨ckler, J., Cremers, D.: Directshape: Direct photometric alignmentofshapepriorsforvisualvehicleposeandshapeestimation.In:ICRA (2020) 4 89. Wang, Y., Fathi, A., Kundu, A., Ross, D., Pantofaru, C., Funkhouser, T., Solomon, J.: Pillar-based object detection for autonomous driving. In: ECCV (2020) 2 90. Wang, Y., Solomon, J.M.: Deep closest point: Learning representations for point cloud registration. In: ICCV (2019) 3 91. Wang, Z., Li, S., Howard-Jenkins, H., Prisacariu, V., Chen, M.: Flownet3d++: Geometric losses for deep scene flow estimation. In: WACV (2020) 3 92. Wei, X., Zhang, Y., Li, Z., Fu, Y., Xue, X.: Deepsfm: Structure from motion via deep bundle adjustment. In: ECCV (2020) 3 93. Weng, X., Kitani, K.: A baseline for 3d multi-object tracking. arXiv preprint arXiv:1907.03961 1(2), 6 (2019) 10 94. Wong, K., Wang, S., Ren, M., Liang, M., Urtasun, R.: Identifying unknown in- stances for autonomous driving. In: CoRL. PMLR (2020) 3 95. Wu, P., Chen, S., Metaxas, D.N.: Motionnet: Joint perception and motion pre- dictionforautonomousdrivingbasedonbird’seyeviewmaps.In:CVPR(2020) 4 96. Wu, W., Wang, Z.Y., Li, Z., Liu, W., Fuxin, L.: Pointpwc-net: Cost volume on point clouds for (self-) supervised scene flow estimation. In: ECCV (2020) 2, 11 97. Yan,X.,Hsu,J.,Khansari,M.,Bai,Y.,Pathak,A.,Gupta,A.,Davidson,J.,Lee, H.: Learning 6-dof grasping interaction via deep geometry-aware 3d representa- tions. In: ICRA (2018) 3 98. Yan, X., Yang, J., Yumer, E., Guo, Y., Lee, H.: Perspective transformer nets: Learning single-view 3d object reconstruction without 3d supervision. In: NIPS (2016) 3 99. Yang, B., Bai, M., Liang, M., Zeng, W., Urtasun, R.: Auto4d: Learning to label 4d objects from sequential point clouds. arXiv preprint arXiv:2101.06586 (2021) 3, 4 100. Yang,B.,Luo,W.,Urtasun,R.:Pixor:Real-time3dobjectdetectionfrompoint clouds. In: CVPR (2018) 2 101. Yang,H.,Shi,J.,Carlone,L.:Teaser:Fastandcertifiablepointcloudregistration. IEEE Transactions on Robotics 37(2), 314–333 (2020) 3 102. Yang, Z., Sun, Y., Liu, S., Jia, J.: 3dssd: Point-based 3d single stage object de- tector. In: CVPR (2020) 1, 2 103. Ye, M., Xu, S., Cao, T.: Hvnet: Hybrid voxel network for lidar based 3d object detection. In: CVPR (2020) 220 M. Najibi et al. 104. Ye, M., Cao, T., Chen, Q.: Tpcn: Temporal point cloud networks for motion forecasting. In: CVPR (2021) 1, 4 105. Yuan, J., Liu, Z., Wu, Y.: Discriminative subvolume search for efficient action detection. In: CVPR (2009) 3 106. Yuan,Y.,Weng,X.,Ou,Y.,Kitani,K.M.:Agentformer:Agent-awaretransform- ers for socio-temporal multi-agent forecasting. In: ICCV (2021) 4 107. Zakharov, S., Kehl, W., Bhargava, A., Gaidon, A.: Autolabeling 3d objects with differentiable rendering of sdf shape priors. In: CVPR (2020) 4 108. Zeng,W.,Luo,W.,Suo,S.,Sadat,A.,Yang,B.,Casas,S.,Urtasun,R.:End-to- end interpretable neural motion planner. In: CVPR (2019) 4 109. Zheng, W., Tang, W., Jiang, L., Fu, C.W.: Se-ssd: Self-ensembling single-stage object detector from point cloud. In: CVPR (2021) 2 110. Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: CVPR (2017) 3 111. Zhou,Y.,Sun,P.,Zhang,Y.,Anguelov,D.,Gao,J.,Ouyang,T.,Guo,J.,Ngiam, J., Vasudevan, V.: End-to-end multi-view fusion for 3d object detection in lidar point clouds. In: CoRL (2020) 3 112. Zhou,Y.,Tuzel,O.:Voxelnet:End-to-endlearningforpointcloudbased3dobject detection. In: CVPR (2018) 1, 2 113. Zhu,R.,KianiGaloogahi,H.,Wang,C.,Lucey,S.:Rethinkingreprojection:Clos- ing the loop for pose-aware shape reconstruction from a single image. In: ICCV (2017) 3Motion Inspired Unsupervised Perception and Prediction 21 Appendix A Implementation Details of Auto Meta Labeling TheAutoMetaLabelingpipelinehasfourcomponents:objectproposalbyclus- tering,multi-objecttrackingandamodalboxrefinementbasedonshaperegistra- tion. In the object proposal step, we use DBSCAN for both clustering by point locations and by scene flows. Both clustering methods use Euclidean distance as the distance metric. The neighborhood thresholds (cid:15) and (cid:15) are set to be 1.0 p f and 0.1, respectively. The minimum flow magnitude |f| is set to 1m/s, so as min to include meaningful motions without introducing too much background noise. Ourtrackerfollowstheimplementationasin[62].Weusebird’seyeview(BEV) boxes for data association and use Hungarian matching with an IoU threshold of 0.1. In shape registration, we use a constrained ICP [30] which limits the rotation to be only round z-axis. We have compared the effect of contrained and unconstrained ICP in AML ablation study. The search grid for translation initialization is decided by the target box dimensions on the xy-plane, i.e. the length l and the width w of the target bounding box. We enumerate btgt btgt translation initialization T in a 5×5 grid covering the target bounding box j region with a list T of strides as [−l /2,−l /4,0,l /4,l /2] and a list x btgt btgt btgt btgt T of strides [−w /2,−w /4,0,w /4,w /2]. Each computation of ICP y btgt btgt btgt btgt outputs an error (cid:15) , which is defined as the mean of the Euclidean distances j among matched points between the source and the target point sets. B Ablation Study on Unsupervised Flow Estimation In this section we provide additional ablation studies focusing on our unsuper- vised flow estimation method, NSFP++. Static point removal AsmentionedinSec.3.1,weapplystaticpointremoval prior to unsupervised flow estimation. This step is designed to achieve a high precision to avoid removing dynamic points in the early stages of our pipeline. Here, we compute the precision/recall of this step on the WOD [75] validation set.Wedefineground-truthdynamic/staticlabelsbasedontheavailableground- truth bounding boxes [41]. Dynamic points are defined as those with a ground- truth flow magnitude larger than |f| , and the remaining points belonging to min any ground-truth box are assigned to the static class. Our static point removal stephasaprecisionof97.2%,andarecallof62.2%,validatingthehighprecision of this step in determining the static points. Local flow estimation We also conduct ablation study to validate the effec- tiveness of the proposed components in the local flow estimation step, i.e., box query with expansion followed by pruning and local consistency loss. As illus- tratedinTable6,boxquerywithexpansion(secondrow)effectivelyboostsmIoU22 M. Najibi et al. Table6.Ablationstudyondifferentcomponentsintheproposedlocalflowestimation. BQ stands for the proposed box query strategy, which contains two steps, the first being expansion and the second being pruning. Local consistency represents the local consistency loss among flow predictions within each point cluster. VariantsofNSFP++ EPE3D↓ θ (rad)↓ mIoU↑ BQw.ExpansionBQw.PruningLocalConsistency 0.020 0.515 0.404 (cid:88) 0.023 0.560 0.552 (cid:88) (cid:88) 0.018 0.504 0.571 (cid:88) (cid:88) (cid:88) 0.017 0.474 0.586 Table 7. Flow comparison with the fully supervised model. Method EPE3D (m) ↓ θ (rad) ↓ mIoU ↑ Fully Supervised Network 0.005 0.062 0.826 Unsupervised NSFP++ (ours) 0.017 0.474 0.586 from 0.404 to 0.552 but suffers from higher 3D end-point error (EPE3D) and mean angle error (θ), compared to the method without using box query (first row). This is due to the fact that the expanded query region can capture more matchingpointsbutatthecostofincludingirrelevantpoints.Withtheproposed pruning scheme (third row), all metrics are significantly improved compared to theprevioustworows.Finally,byaddinglocalconsistencyloss(fourthrow),we obtain the best performance across the board. Comparison with the fully supervised model In this subsection, we com- pare our unsupervised flow estimation method with the fully supervised scene flow model used in Sec. 4. Table 7 shows the comparison. As expected, the su- pervisedmodeloutperformsourunsupervisedNSFP++methodwhichdoesnot use any human annotations. However, as shown in Tab. 3, the AML pipeline canrobustlyuseourunsupervisedNSFP++predictionsandeventuallyachieves comparable results to the counterpart using a supervised flow model on down- stream tasks (e.g., L1 mAP of 42.1 for unsupervised v.s. 49.9 for supervised in the object detection task). C Ablation Study on Auto Meta Labeling To examine the design choices in the AML pipeline, we compute the detection metrics on the auto labels generated by our full AML pipeline and several base- lines (Table 8). Note that the numbers reported in Table 8 are from evaluation on auto labels, rather than on the predictions by trained detectors. Filtered by flow + Clustering by position is a baseline where we generate auto labels only using this clustering method. Compared to our spatial-temporal clusteringMotion Inspired Unsupervised Perception and Prediction 23 Table 8. Comparisons of different variants of components in the AML pipeline. All methods are evaluated on the WOD validation set. 3DmAP 2DmAP AMLVariants L1 L2 L1 L2 Filteredbyflow+Clusteringbyposition 25.5 24.6 32.4 31.2 Spatio-temporalclustering 30.4 29.2 36.7 35.3 Regis.w/oinit. 32.2 31.0 36.6 35.3 Regis.w/Rinit.byflowheading 33.2 31.9 37.4 36.0 Regis.w/Tinit.bygridsearch 35.2 33.8 39.3 37.9 Regis.w/UnconstrainedICP 34.3 33.0 38.5 37.1 Regis.w/RTinit.&constrainedICP[30](Full AML) 36.9 35.5 40.5 39.0 method described in Algorithm 1, this baseline does not perform clustering on the estimated flows and as a result it leads to under-segmentation and lower performance. We also carry out experiments on variants of shape registration. Regis. w/o init. is a baseline where we have no initialization when performing constrained ICP. Adding either rotation initialization by flow heading (Regis. w/ R init. by flow heading) or translation initialization by grid search (Regis. w/ T init. by grid search) improves the quality of auto labels. Another baseline, Regis. w/ Unconstrained ICP, is applying both R and T initializations but uses an unconstrained ICP such that 3D rotations are allowed when aligning the source and the target point sets. We find that limiting the rotation to be only around z-axis generates auto labels with a higher quality. Finally, our full AML (Regis. w/ RT init. & constrained ICP) outperforms all other variants. Compared to the 3D detection results in the main paper (see Tab. 3), we find that the object detectorachieveshighermAPthantheautolabelsitistrainedon.Thereasonis that auto labels by design pursue high recall while contain some false positives in the background due to inaccurate flow or noise in the environment. As these false positive labels do not form a consistent data pattern, the object detector learns to focus only on auto labels with common patterns, such as vehicles and VRUs, and assign high confidence scores to these objects at inference time. D Qualitative Analysis D.1 Auto Meta Labeling and Unsupervised Object Detection Fig. 8 shows four examples from the WOD validation set comparing ground truth,autolabelsandunsupervisedobjectdetectionresults.Inourunsupervised setting, both the auto labels and object detectors localize objects in a class- agnostic manner and are not limited by certain categories. In example (a) we showthatautolabelsandobjectdetectorscapturebothpedestriansandvehicles. Inexample(b),wedemonstratethateventhoughthereisfalse positive non- zeroflowestimation,inAMLwefilteroutmanyoftheseclustersduringtracking andpost-processingwhereveryshorttracksaredropped.Theresultingdetector24 M. Najibi et al. (a) (b) false positive flows Ground Truth Ground Truth Auto Labels Auto Labels Predictions Predictions (c) (d) missing stroller Ground Truth Ground Truth Auto Labels Auto Labels Predictions Predictions Fig.8.Visualizationofautolabelsanddetectionpredictionscomparedwiththeground truth of moving objects. Points are colored by flow magnitudes and directions. Dark pointsare static. (a)Theclass-agnostic autolabels andunsupervised objectdetectors capture objects of multiple categories. (b) Although false positive flows occur, AML filters out many of them if they are inconsistent, and the detector learns to ignore these false positive flows. (c) Although the ground truth does not cover categories beyondvehicle,pedestrian,andcyclist,autolabelsandourdetectorcancaptureopen- set moving objects, such as the stroller. (d) An failure case that the detector may not be confident on objects with limited data amount, such as cyclists.Motion Inspired Unsupervised Perception and Prediction 25 hasalsolearnedtoignoreclustersoffalsepositiveflows.Thisexamplealsoshows that both auto labels and object detectors can infer the amodal boxes of some objects with only partial views. Sometimes the unsupervised flow estimation captures true positive motion on points that are beyond the predefined categories in the ground truth. In example (c), a pedestrian is walking with a stroller while stroller is not a class included in the ground truth labels and therefore no bounding box is annotated around the stroller. NSFP++ has estimated the flow on the stroller, enabling AMLanddetectorstolocalizeit.Sincethestrollerisheldbythepedestrianwith a similar speed, the clustering by design does not separate them apart. Clearly, it is safety-critical for autonomous vehicles to understand such moving objects in the open-set environment. Example (d) shows a failure case where the detector could not confidently detect a cyclist. Although the auto labels have captured it, cyclists are less commonthanpedestriansandvehiclesinthetrainingset,whichleadstoinferior performance.Weencouragefutureworktotacklethedataimbalanceissueunder theunsupervisedsetting.Anotherfailurepatternisthatboundingboxesinauto labels tend to be larger than the actual size, due to the fact that temporal aggregationcanincludenoisepoints.Moreadvancedshaperegistrationmethods may help reduce noise and we leave it for future work. D.2 Open-set Trajectory Prediction Fig.9and10showbehaviorpredictionqualitativeresultsonthevalidationsetof ournewlycreatedAnonymizedDataset.Foreachexamplescenario,weshowthe trajectorypredictionsoftwomodels,i.e.,onetrainedonlywithahuman-labeled category (the first column) and the other one trained with the combination of availablehuman-labelsandourAMLautolabelsforallothermovingobjects(the second column). The red and magenta trajectories represent the ground-truth routes taken by the autonomous vehicle and by an agent of interest, respec- tively.Theblueandyellowtrajectoriesarethepossiblepredictionsfortheagent of interest and other agents in the scene. Fig. 9 shows three examples where human labels are available for the VRU category. As can be seen in all three examples, without using our unsupervised auto labels, the model tends to erro- neouslyunderestimatethespeed(e.g.thefirstrow),havedifficultyinpredicting trajectoriesconsistentwiththeunderlyingroadgraph(e.g.thesecondrow),and generatingdangerouspedestrian-liketrajectoriesalongthepedestriancrosswalk (e.g. the third row). Fig. 10 shows the results when human labels are available only for the vehicle category. Similarly, when the model is only trained on the human labels (the first column), it cannot generalize well to the VRU class, predicting fast speeds and vehicle-like trajectories for VRUs. However, in both scenarios,addingautolabels(thesecondcolumnsinFig.9and10)satisfactorily overcomes these errors, showing the effectiveness of our auto labels for training behavior prediction models in the open-set environment.26 M. Najibi et al. Trained with Trained with AML (ours) human labeled VRU + human labeled VRU Fig.9.Behaviorpredictionqualitativeanalysis.Trajectorypredictionsonthreeexam- plescenariosforamodeltrainedwithhumanlabeledVRUsv.s.amodeltrainedwitha combination of human labeled VRUs and our generated autolabels. Red and magenta dotted trajectories represent the ground-truth routes of the autonomous vehicle and agents, respectively. Blue and yellow trajectories are the predictions for the agent of interest and other agents, respectively.Motion Inspired Unsupervised Perception and Prediction 27 Trained with Trained with AML (ours) human labeled vehicles + human labeled vehicles Fig.10. Behavior prediction qualitative analysis. Trajectory predictions on three ex- ample scenarios for a model trained with human labeled vehicles v.s. a model trained with a combination of human labeled vehicles and our generated autolabels.28 M. Najibi et al. Fig.11. Error distributions. y axis is probability density. E Failure Analysis In this section, we analyze the factors causing failure cases. Under threshold IoU=0.4, the precision/recall of our auto meta labels is 0.69/0.50. Part of the failure cases come from (1) false positive predictions that do not match any ground truth boxes; (2) false negatives where ground truth boxes are entirely missed.Moreover,therearepredictedboxesoverlappingwithgroundtruthboxes while their IoUs are lower than the threshold. To have a better understanding, we breakdown 3D bounding box dimensions into three groups: localization (box center x,y,z), size (box length l, width w, height h), and orientation (BEV box heading r). Then, we summarize the distributions of localization, size, and orientation errors of the generated bounding boxes which overlap with at least one ground truth box (Fig. 11). The errors are computed between each pair of a predicted box and the ground truth box that has the highest IoU with the predicted box. Localization. The localization error is defined as (cid:113) (cid:15) = (x −x )2+(y −y )2+(z −z )2. (3) localization pr gt pr gt pr gt As shown in Fig. 11, most of the localization errors are within 1.0 meter. Size. The size error is defined as (cid:15) =max{|l −l |+|w −w |+|h −h |}. (4) size pr gt pr gt pr gt Many predictions have relatively high size errors. This is often caused by inclu- sion of noisy points in the registration step or missing parts of an object if the parts are always invisible throughout the object track. Orientation. The orientation error is defined as (cid:15) =r −r (5) orientation pr gt The orientation errors are generally small, as the orientation of each object is determined by the direction of the scene flows averaged over all points withinMotion Inspired Unsupervised Perception and Prediction 29 Table9.ComparisonbetweenanoraclewithGTboxcoordinatesandbaselinesswitch- ing localization/size/orientation coordinates into AML predictions in turn. The per- formance drops show that the localization and size errors are dominant. 3D mAPH@IoU=0.4 (Oracle) GT localization + GT size + GT orientation 46.1 Predicted localization + GT size + GT orientation 39.7 (-6.4) GT localization + Predicted size + GT orientation 39.7 (-6.4) GT localization + GT size + Predicted orientation 44.5 (-1.6) the object bounding box. This error distribution verifies the quality of the un- supervised scene flows. To find out the dominant factors leading to wrong auto meta labels, we constructseveralbaselinesbymodifyingthepredictionsandmeasuretheirlabel quality. The baselines are as follows: 1. (Oracle) GT localization + GT size + GT orientation: we replace the 7D values (x,y,z,l,w,h,r) of each predicted box with the values of its best matched ground truth box if any; 2. Predictedlocalization+GTsize+GTorientation:wereplacethe(l,w,h,r) of each predicted box with the ground truth values. Comparison with the oracle will show the impact of localization errors; 3. GTlocalization+Predictedsize+GTorientation:wereplacethe(x,y,z,r) of each predicted box with the ground truth values. Comparison with the oracle will show the impact of size errors; 4. GT localization + GT size + Predicted orientation: we replace the (x,y,z,l,w,h) of each predicted box with the ground truth values. Com- parison with the oracle will show the impact of orientation errors. We report the 3D mAPH@IoU=0.4 on the above baselines as mAPH addi- tionallyreflectthequalityofheadingprediction.Wefoundthatlocalizationand size errors are dominant factors and future work may focus on improving the quality of auto labels on these fronts.