Improving the Intra-class Long-tail in 3D Detection via Rare Example Mining Chiyu Max Jiang, Mahyar Najibi, Charles R. Qi, Yin Zhou, and Dragomir Anguelov Waymo LLC., Mountain View CA 94043, USA maxjiang,najibi,rqi,yinzhou,dragomir @waymo.com { } Abstract. Continuedimprovementsindeeplearningarchitectureshave steadilyadvancedtheoverallperformanceof3Dobjectdetectorstolevels onparwithhumansforcertaintasksanddatasets,wheretheoverallper- formanceismostlydrivenbycommonexamples.However,eventhebest performing models suffer from the most naive mistakes when it comes to rare examples that do not appear frequently in the training data, such as vehicles with irregular geometries. Most studies in the long-tail literature focus on class-imbalanced classification problems with known imbalanced label counts per class, but they are not directly applicable to the intra-class long-tail examples in problems with large intra-class variations such as 3D object detection, where instances with the same class label can have drastically varied properties such as shapes and sizes. Other works propose to mitigate this problem using active learn- ing based on the criteria of uncertainty, difficulty, or diversity. In this study, we identify a new conceptual dimension - rareness - to mine new data for improving the long-tail performance of models. We show that rareness,asopposedtodifficulty,isthekeytodata-centricimprovements for 3D detectors, since rareness is the result of a lack in data support while difficulty is related to the fundamental ambiguity in the problem. Weproposeageneralandeffectivemethodtoidentifytherarenessofob- jectsbasedondensityestimationinthefeaturespaceusingflowmodels, and propose a principled cost-aware formulation for mining rare object tracks,whichimprovesoverallmodelperformance,butmoreimportantly - significantly improves the performance for rare objects (by 30.97%). Keywords: Intra-class Long Tail, Rare Example, Active Learning 1 Introduction Long-taillearningisachallengingyetimportanttopicinappliedmachinelearn- ing, particularly for safety-critical applications such as autonomous driving or medical diagnostics. However, even though imbalanced classification problems havebeenheavilystudiedintheliterature,wehavelimitedtoolsindefining,iden- tifying, and improving on intra-class rare instances, such as irregularly shaped vehicles or pedestrians in Halloween costumes, since they come from a diverse open set of anything but common objects. Inspired by Leo Tolstoy’s famous2 C. Jiang et al. All Veh. Large Veh. (Rare) Fully-Supervised (100% labeled data) 0.6018 Semi-Supervised (13% labeled data) Qi et al., 2021 0.4359 Semi-Supervised + Rare Ex. Mining (13% labeled data) 0.5709 30.97%⬆ Ours 0 0.25 0.5 0.75 1 Fig.1: Vehicle 3D object detection Average Precision (AP) on the Waymo Open Dataset with fully- /semi-supervised learning. While standard semi-supervised learning (with a strong auto labeling teacher model [40]) can achieve on par results with fully supervised method on the common cases, the perfor- mancegaponrareobjects(e.g.large vehicles) is significant (60.18 v.s. 43.59). Our method is able to close this gap using rare example mining. 7.0 UoI @ llaceR 1 0.813 0.625 Fully Supervised (100%) Semi-Supervised (10%) Ours -MD-REM++ (10%+3%) 0.438 0.25 0 6 121824303642485460667278849096 Rareness Percentile (%) Improvement Ratio (%) 40 30 20 10 Improvement Ratio (Ours vs Semi-Supervised) 0 Fig.2: Correlation between inferred rareness percentile (lower is more rare) and model performance for subsets of ground truth, indicated by recall. In all models (from fully-supervised to semi-supervised), model performance is strongly correlated to the rareness measure obtained from the log proba- bility inferred by the flow model. By mining a mere 3% of remaining data, our model significantly improves upon the semi-supervised detector, with big gains in the rare intra-class long-tail. quote, we observe: “Common objects are all alike; Every rare object is rare in its own way”. We refer to the spectrum of such rare instances as the intra-class long-tail, where we do not have the luxury of prespecified class-frequency-based rareness measurements. Objects of the intra-class long-tail can be of particular impor- tance in 3D detection due to its safety relevance. While overall performance for modern3Ddetectorscanbequitehigh,wenotethatevenfullysupervisedmod- elsperformsignificantlyworseonraresubsetsofthedata,suchaslargevehicles (Fig. 1). The problem is exacerbated by semi-supervised learning, a popular and cost-efficient approach to quickly scale models on larger datasets where av- erage model performance have been shown to be on par with fully-supervised counterparts using a fraction of the labeled data. Several challenges make it difficult for targeted improvement on the intra- class long-tail for 3D detection. First, as box regression is an important aspect ofobjectdetection,conventionallong-taillearningapproachesutilizingclassfre- quencies,oractivelearningapproachesutilizingentropyormarginuncertainties that depend on classification output distributions are not applicable. Second, since labeling cost given a run segment is proportional to the number of la-Rare Example Mining for 3D Detection 3 Infer Rareness Flow Model Extract Rare Tracks: Feature Human Label (i) Train + Detector … Inference … Other Tracks: … (ii) Auto Label (i) Fully-labeled Segments Unlabeled Segments (ii) Hybrid-labeled Segments Retrain Auto Labeler Rare Example Mining Fig.3: Overview of the Rare Example Mining (REM) pipeline. Our detector, bootstrap-trained on a smaller pool of fully labeled segments, extracts features for a flow model to infer the log probability of every detected instance, which is a strong indicator of rareness. The rare tracks in the unlabeled segments are sent for human labeling while all remaining tracks are labeled using an offboard auto-labeler. The combined datasets is then used for retraining the detector, resulting in an overall performance boost, particularly on rare examples. beled instance tracks, not frames, we require a more granular mining approach that gracefully handles missing labels for objects in the scene. Last but not least,unlikelong-tailproblemsforimbalancedclassificationtasks,itischalleng- ing to define which examples belongs to the intra-class long-tail, which leads to difficulty in evaluating and mining additional data to improve the long-tail performance of these models. In light of these challenges, we propose a generalizable yet effective way to measure and define rareness as the density of instances in the latent feature space. We discover that normalizing flow models are highly effective for feature densityestimationandrobustforanomalydetection,contrarytonegativeresults onanomalydetectionusingnormalizingflowsdirectlyonhighdimensionalimage inputs, as reported by prior work [38]. We present a cost-aware formulation for track-level data-mining and active learning using the rareness criteria, as 3D object labeling cost is often proportional to the number of unique tracks in each run segment. We do this in conjunction with a powerful offboard 3D auto- labeler[40,58]forfillinginmissingdata,andshowstrongermodelimprovement compared to difficulty, uncertainty, or heuristics based active learning baselines, particularly for objects in the tail distributions. Furthermore, we investigate rareness as a novel data-mining criterion, in re- lation to the conventional uncertainty or error-based mining methods. Though models tend to perform poorly on either rare or hard examples, we note a clear distinction between the concept of rare versus hard. In this discussion, “rare” maps to epistemic uncertainty (reducible error) where the model is uncertain due to a lack of data support in the training set, while “hard” maps to aleatoric uncertainty (irreducible error), where the model is uncertain due to the funda- mentalambiguityanduncertaintyofthegivenproblem,forexample,ifthetarget object is heavily occluded. We further illustrate that while conventional uncer- tainty estimates (such as ensembling methods) will uncover both hard and rare objects,filteringouthardexampleswillresultinasignificantlyhigherconcentra-4 C. Jiang et al. tion of rare examples which significantly improves active learning performance, underscoring the importance of rare examples in active learning. In summary, the main contributions of this work are: – We identify rareness as a novel criterion for data mining and active learn- ing, for improving model performance for problems with large intra-class variations such as 3D detection. – We propose an effective way of identifying rare objects by estimating latent feature densities using a flow model, and demonstrate a strong correlation between estimated log probabilities, known rare subcategories, and model performance. – We propose a fine-grained, cost-aware, track level mining methodology for 3Ddetectionthatutilizesapowerfuloffboard3Dauto-labelerforannotating unlabeled objects in partially labeled frames, resulting in a strong perfor- mance boost (30.97%) on intra-class long-tail subcategories compared to convetional semi-supervised baselines. 2 Related Work Long-tail visual recognition:Long-tailisconventionallydefinedasanimbal- anceinamultinomialdistributionbetweenvariousdifferentclasslabels,eitherin theimageclassificationcontext[8,24,26,27,36,55,62,64],densesegmentation problems [20, 23, 52, 53, 56, 59], or between foreground / background labels in objectdetectionproblems[33,34,50,51,60].Existingapproachesforaddressing class-imbalancedproblemsincluderesampling(oversamplingtailclassesorhead classes), reweighitng (using inverse class frequency, effective number of samples [8]), novel loss function design [1, 34, 50–52, 63], meta learning for head-to-tail knowlegetransfer[7,27,35,55],distillation[32,57]andmixtureofexperts[54]. However,thereislittleworktargetingimprovementsfortheintra-classlong- tailindatasetswithinherentlylargeintra-classvariations,orforregressionprob- lems.Zhuetal.[66]studiesthelong-tailproblemforsubcategories,butassumes givensubcategorylabels.Dongetal.[12]studiesimbalancebetweenfine-grained attribute labels in clothing or facial datasets. To the best of our knowledge, our workisamongthefirsttoaddresstheintra-classlong-tailin3Dobjectdetection. Activelearning:Inthisworkwemainlyaddresspool-basedactivelearning[45], whereweassumeanexistingsmallerpooloffully-labeleddataalongwithalarger poolofunlabeleddata,fromwhichweactivelyselectsamplesforhumanlabeling. Existing active learning methods mainly fall under two categories, uncertainty- based and diversity-based methods. Uncertainty-based methods select new la- beling targets based on criteria such as ensemble variance [2] or classification output distribution such as entropy, margin or confidence [6, 14, 21, 22, 25, 41] in the case of classification outputs. More similar to our approach are diversity- based approaches, that aim at balancing the distribution of training data while mining from the unlabeled pool [18, 19, 39, 44]. Gudovskiy et al. [18] furtherRare Example Mining for 3D Detection 5 targets unbalanced datasets. However, these methods are developed for clas- sification problems and are not directly applicable to the intra-class long-tail for detection tasks. Similar to our approach, Sinha et al. [47] proposes to learn data distributions in the latent space, though they employ a discriminator in a variational setting that does not directly estimate the density of each data sample. Segal et al. [43] investigated fine-grained active learning in the context self-drivingvehiclesusingregion-basedselectionwithafocusonjointperception and prediction. Similar to our approach, Elezi et al. [13] uses auto-labeling to improve active learning performances for 2D detection tasks. Flow models: Normalizing flow models are a class of generative models that can approximate probability distributions and efficiently and exactly estimate the densities of high dimensional data [4, 10, 11, 17, 28, 30, 42]. Various studies have reported unsuccessful attempts at using density estimations estimated by normalizing flows for detecting out-of-distribution data by directly learning to map from the high dimensional pixel space of images to the latent space of flow models [5, 38, 61], assigning higher probability to out-of-distribution data. However, similar to our finding, Kirichenko et al. [29] find that the issue can be easily mitigated by training a flow model on the features extracted by a pretrained model such as an EfficientNet pretrained on ImageNet [9], rather than directly learning on the input pixel space. This allows the model to better measure density in a semantically relevant space. We are among the first to use densities estimated by normalizing flows for identifying long-tail examples. 3 Methods In this section, we present a general and effective method for mining rare exam- plesbasedondensityestimationsfromthedata,whichwerefertoasdata-centric rare example mining (REM). To offer further insights to rareness in relation to difficulty, we propose another conceptually simple yet effective method for min- ing rare examples by simply filtering out hard examples from overall uncertain examples. In Section 4.2, we show that combining both approaches can further improvelong-tailperformance.Lastbutnotleast,weproposeacost-aware,fine- grained track-level active learning method that aggregates per-track rareness as a selection criteria for requesting human annotation, and utilize a powerful off- board 3D auto-labeler for populating unmined, unlabeled tracks to maximize the utility of all data when retraining the model. 3.1 Rare Example Mining Data-centric Rare Example Mining (D-REM) Themainintuitionbehind data-centric REM is that we measure the density of every sample in a learned feature embedding space as an indicator for rareness. The full data-centric REM workflow (see Fig. 3) consists of the following steps. First, we pretrain the detection model on an existing source pool of fully- labeled data that might be underrepresenting long-tail examples. Second, we6 C. Jiang et al. Hard Mining Method Mined Data Random Mining + + + Hard Example Mining + Ensemble Mining + + Data-centric Rare Example + MMiondienlg-centric Rare Example Ours } MMi+nDin Rgare Example Mining Easy Common Rare Fig.4: Hard (aleatoric uncertainty) is a fundamentally different dimension com- pared to rare (epistemic uncertainty). Our REM method directly targets rare subsets of the data. Our Data-centric REM method directly estimates rareness based on inferred probabilities by a normalizing flow model trained on learned feature vectors, while our Model-centric REM method performs hard example filtering on top of generic uncertain objects mined by the ensemble mining ap- proach.Wefurthercombinethetwoapproaches(MD-REM)byperforminghard example filtering on top of D-REM to increase easy-rare examples. use the pretrained task model to run inference over the source pool along with a large unlabeled pool of data, and extract per-instance raw feature vectors via Region-of-interest(ROI)pooling,followedbyPCAdimensionalityreductionand normalization. We then train a normalizing flow model over the feature vectors to estimate per-instance rareness (negative log probability) for data mining. Object Feature Extraction: As previously mentioned, one major difference between our proposed approach for estimating rare examples, compared with earlier works in the literature that were not successful in using normalizing flow for out-of-distribution detection [5, 38, 61], is that we propose to estimate the probability density of each example in the latent feature space of pretrained models to leverage the semantic similarity between objects for distinguishing rare instances. As observed by Kirichenko et al. [29], normalizing flow directly trainedonhighdimensionalrawinputfeaturestendtofocusmoreonlocalpixel correlations rather than semantics as it doesn’t leverage high-level embeddings. We extract per-object feature embeddings from the final Birds-Eye-View (BEV) feature map of a 3D object detector via region of interest (ROI) max- pooling [16] by cropping the feature map with the prediction boxes. We mainly apply this for our implementation of the state-of-the-art MVF [40, 65] 3D de- tector, though the process is generally applicable to majority of detectors that produce intermediate feature maps [31, 37, 49]. We further perform principal component analysis (PCA) for dimensionality reduction for improved computational efficiency, followed by normalization on the set of raw feature vectors X roi Rn ×d obtained via ROI pooling ∈ X =(X mean(X ))WT (1) pca roi − roi pca X =X / std(X ) (2) norm pca pcaRare Example Mining for 3D Detection 7 whereW pca Rk ×d isaweightmatrixconsistingofthetop-k PCAcomponents, ∈ mean():Rn ×d Rd, std():Rn ×d Rd are the mean and standard deviation · 7→ · 7→ operators along the first dimension. In summary, the training dataset for our flow model consists of normalized feature vectors after PCA-transformation obtained via ROI max-pooling the final feature map of 3D detectors using predicted bounding boxes. = X [i], i [0,n) (3) x norm D { ∀ ∈ } Rareness Estimation Using Normalizing Flow: We use the continuous normalizing flow models for directly estimating the log probability of each ex- amplerepresentedasafeaturevectorx.Wepresentaquickreviewofnormalizing flows below. Typicalnormalizingflowmodels[28]consistoftwomaincomponents:abase distributionp(z),andalearnedinvertiblefunctionf (x),alsoknownasabijec- θ tor, where θ are the learnable parameters of the bijector, f (x) is the forward θ method and f θ−1(x) is the inverse method. The base distribution is generally chosen to be an analytically tractable distribution whose probability density function (PDF) can be easily computed, such as a spherical multivariate Gaus- siandistribution,wherep(z)= (z;0,I).Alearnablebijectorfunctioncantake N many forms, popular choices include masked scale and shift functions such as RealNVP [11, 28] or continuous bijectors utilizing learned ordinary differential equation (ODE) dynamics [4, 17]. The use of normalizing flows as generative models has been heavily studied in the literature [28], where new in-distribution samples can be generated via passing a randomly sampled latent vector through the forward bijector: x=f (z), where z p(z) (4) θ ∼ However, in this work, we are more interested in using normalizing flows for estimating the exact probabilities of each data example. The latent variable correspondingtoadataexamplecanbeinferredviaz =f (x).Underachange- θ of-variables formula, the log probability of a data sample can be estimated as: logp (x)=logp(f (x))+log det(df (x)/dx) (5) θ θ θ | | =logp(z)+log det(dz/dx) (6) | | The first term, logp(z), can be efficiently computed from the PDF of the base distribution, whereas the computation of the log determinant of the Jacobian: log det(df (x)/dx) vary based on the bijector type. θ | | The training process can be described as a simple maximization of the ex- pected log probability of the data (or equivalently minimization of the expected negativeloglikelihoodoftheparameters)fromthetrainingdata andcanbe x D learned via batch stochastic gradient descent: argm θ in Ex ∼Dx[ −logp θ(x)] (7) In our experiments, we choose the base distribution p(z) to be a spherical mul- tivariate Gaussian (z;0,I), and we use the FFJORD [17] bijector. N8 C. Jiang et al. For the final rare example scoring function for the i-th object, r , we have: i r = logp (x ) (8) i θ i − Model-centric Rare Example Mining (M-REM) We present an alterna- tive model-centric formulation for REM that is conceptually easy and effective, yetillustrativeofthedichotomybetweenrareandhardexamples.Differentfrom the data-centric REM perspective, model-centric REM leverages the divergence among an ensemble of detectors as a measurement of total uncertainty. Different from methods that directly use ensemble divergence as a mining critera for active learning [2], our key insight is that while ensemble divergence is a good measurement of the overall uncertainties for an instance, it could be either due to the problem being fundamentally difficult and ambiguous (i.e., hard),orduetotheproblembeinguncommonandlacktrainingsupportforthe model (i.e., rare). In the case of 3D object detection, a leading reason for an object being physically hard to detect is occlusion and low number of LiDAR points from the object. Conceptually, adding more hard examples such as far- awayandheavilyoccludedobjectswithveryfewvisibleLiDARpointswouldnot behelpful,asthesecasesarefundamentallyambiguousandcannotbeimproved upon simply with increased data support. A simple approach for obtaining rare examples, hence, is to filter out hard examples from the set of overall uncertain examples. In practice, a simple com- bination of two filters: (i) low number of LiDAR points per detection example, or (ii) a large distance between the detection example and the LiDAR source, provestobesurprisinglyeffectiveforimprovingmodelperformancethroughdata mining and active learning. We implement model-centric REM as follows. Let = M ,M , ,M 1 2 N M { ··· } be a set of N independently trained detectors with identical architecture and trainingconfigurations,butdifferentmodelinitialization.Denotedetectionscore forthei-thobjectbythej-thdetectorassj.sj issettozeroifthereisamissed i i detection. The detection variance for the i-th object by the model ensemble M is defined as: N N 1 1 v = (sj sk)2 (9) i N i − N i j=1 k=1 X X For hard example filtering, denote the number of LiDAR points within the i-th object as p , and the distance of the i-th object from the LiDAR source as i d . A simple hard example filter function can be defined as: i h =1 if (p >p˜) & (d 0IoU)areremoved.Thisprocedureisiterativelyperformeduntilthenumber oftracksin reachesthebudgetofK.Allauto-labeledtracks thatintersect h a S S with are removed, and the two sets of tracks are merged into a hybrid, fully- h S labeled dataset = . a h S S ∪S 4 Experiments WeusetheWaymoOpenDataset[48]asthemaindatasetforourinvestigations due to its unparalleled diversity based on geographical coverage, compared with othercamera+LiDARdatasetsavailable[3,15],aswellasitslargeindustry-level scale. The Waymo Open Dataset consists of 1150 scenes that span 20 seconds, recorded across a range of weather conditions in multiple cities. Intheexperimentsbelow,weseektoanswerthreequestions:(1)Doesmodel performance correlate with our rareness measurement for intra-class long-tail (Section 4.1), (2) Can our proposed rare example mining methodology success- fully find and retrieve more rare examples (Section 4.1), and (3) Does adding rare data to our existing training data in an active learning setting improve overall model performance, in particular for the long-tail (Section 4.2). 4.1 Rare Example Mining Analysis In this section, we investigate the ability of the normalizing flow model in our data-centric REM method for detecting intra-class long-tail examples.10 C. Jiang et al. Input Model detections =b ,b , ,b b 1 2 n D ··· (sort by descending rareness score) Auto-labeled tracks Sa = {T 1′,T 2′, ··· ,T m′ } Labeling budget K Output Fully labeled tracks = ( ( )) h a h a S S ∪ S − S ∩S 1: procedure TrackMining 2: Ø h S ←{ } 3: while 7m) ob- on model performance. With a small jects as a reference for measuring the increase in mining budget (6%), we ratio of rare tracks mined by differ- (MD-REM++) can match the perfor- ent approaches. REM is able to mine mance of a fully-supervised model on a higher proportion of rare instances. both ends of the spectrum. probabilitydistributionsbetweenthetrainingandvalidationsets,whileassigning lower probabilities to vehicles from an out-of-distribution set (the Kirkland set from the Waymo Open Dataset, collected from a different geographical region with mostly rainy weather condition). Moreover, the model assigns significantly lower probabilities on OOD categories (Pedestrian) if we perform ROI pooling using the pedestrian ground-truth boxes to extract pedestrian feature vectors from the vehicle model and query the log probability distribution against the flow model. 4.2 Rare Example Mining for Active Learning TodemonstratetheapplicabilityoftheREMapproachfortargetedimprovement ofthemodel’sperformanceintheintra-classlong-tail,weutilizetrack-levelREM for active learning, as detailed in Section 3.2. Experiment Setup: Our experiment setup is as follows. Following Qi et al. [40], we perform a random split on the main training set of the Waymo Open Dataset[48]intoa10%fully-labeledsourcepool,andaremaining90%asalarger “unlabeled”pool,fromwhichwewithholdground-truthlabels.Wefirsttrainour mainmodelonthefully-labeledsourcepool,andperformtrack-leveldatamining on the remaining unlabeled pool using various methods, including our proposed data-centricandmodel-centricREMapproaches.Forallactivelearningbaseline experiments, we mine for a fixed budget of 1268 tracks, amounting to 3% of ∼ all remaining tracks. Our main model consists of a single-frame MVF detector [40, 65]. While in all baseline experiments we utilize the main model for self-labeling unlabeled tracks in the unlabeled pool, we demonstrate that using a strong offboard 3D auto-labeler [40] trained on the same existing data can further boost the overall performance of our REM approach.Rare Example Mining for 3D Detection 13 (a) Referenceexperimentsw/oactivelearning. (c) Main active learning experiments. Experiment Human All RegularLarge Experiment Human All RegularLarge Labels Labels Partial-supervised 10%;0%0.845 0.853 0.378 Partial-supervised 10%;0%0.845 0.853 0.378 Semi-supervised(SL) 10%;0%0.854 0.864 0.350 Random 10%;3%0.873 0.881 0.355 Semi-supervised(AL)10%;0%0.902 0.910 0.419 PredictSize 10%;3%0.865 0.871 0.498 Fully-supervised 100%;0%0.895 0.900 0.602 Ensemble[2] 10%;3%0.869 0.879 0.353 Ours(M-REM) 10%;3%0.886 0.893 0.478 (b) Oracle active learning experiments. Ours(D-REM) 10%;3%0.882 0.888 0.483 OracleHard[46] 10%;3%0.865 0.875 0.341 Ours(D-REM++) 10%;3%0.9060.9130.533 OracleSize 10%;3%0.869 0.875 0.583 Ours 10%;3%0.904 0.909 0.571 (MD-REM++) Table 3: Active learning experiment results. Our method significantly improves model performance across the spectrum, particularly significantly on rare sub- sets.Wedenotehumanlabelratioas(%s,%t)toindicatethemodelbeingtrained with%soffull-labeledrunsegments,alongwith%toftheremainingtracksthat is mined and labeled. CompositionofMinedTracks:Wefirstanalyzethecompositionofthemined tracks, in all cases 1268 tracks obtained using various mining approaches (see Table 1). Wederivethreemainfindingsfromthecompositionanalysis:(1)Data-centric REMisabletoeffectivelyretrieveknownraresubsets,upsamplinglargeobjects by as much as 1214%. (2) Comparing model-centric REM to ensemble mining method,asimplehardexamplefilteringoperatorleadstodrasticallyupsampled rareinstances,signifyingthedichotomyofrareandhard.Byusingahardexam- ple filter we can significantly increase the ratio of rare examples among mined tracks.(3)Combiningmodelanddata-centricREM(byfurtherperforminghard examplefilteringfrominstancedminedbydata-centricREM)furtherbooststhe ratio of large vehicles. Active Learning Experiment:Wepresentouractivelearningexperimentsin Table 3. Results are on vehicles from the Waymo Open Dataset [48], reported as AP at IoU 0.5. We compute subset metrics on all vehicles (“All”), regular vehicles (“Regular”) of size within 3 7m, large vehicles (“Large”) of size > 7 − m. “Regular” subset is a proxy of the common vehicles, while “Large” subset is a proxy for rare. InTable3a,wepresentperformancesofthesingle-frameMVFmodeltrained on different compositions of the data. We denote semi-supervised method using self-labeledsegmentsas“SL”andauto-labeledsegmentsas“AL”.Themainob- servation is that although auto-labeling can significantly improve overall model performance, in particular for common (regular-sized) vehicles, the resulting modelperformanceissignificantlyweakerforraresubsets,motivatingourREM approach. For the active learning experiments, we first compare two oracle-based ap- proaches (Table 3b) that utilize 100% ground-truth knowledge for the mining14 C. Jiang et al. process. “Oracle Hard” is an error-driven mining method inspired by [46], that ranks tracks by s = IoU(GT, Pred) * Probability Score(Pred) to mine tracks which either the base model made a wrong prediction on, or made an inconfi- dent prediction. “Oracle Size” explicitly mines 3% of ground-truth tracks whose box size is > 7m. The main observation is that error-based mining favors diffi- cultexampleswhichdonothelpimprovemodelperformance.Thoughsize-based miningcaneffectivelyimprovelargevehicleperformance,itsolelyimproveslarge vehicles and does not help on regular vehicles. Wethencompareacrossasuiteofactivelearningbaselinesandourproposed REM methods (Table 3c). “Random” mines the tracks via randomized selec- tion, “Predict Size” mines tracks associated with the largest predicted boxes, and “Ensemble” mines the tracks with highest ensemble variance (Eq. (9)). For ourproposedREMmethods,weprefixmodel-centricREMapproacheswith“M- ”, data-centric approaches with “D-”, and a hybrid approach leveraging hard- example filtering on top of data-centric approaches with “MD-”. To further il- lustrate the importance of a strong offboard auto-labeler, we add auto-labeler to our method, denoting the experiments with “++”. Theactivelearningexperimentsshowthat:(1)Bothdata-centricandmodel- centricapproachessignificantlyhelptoimproveperformanceontheraresubset, and a combination of the two can further boost the long-tail performance, (2) While heuristics based mining methods (“Predict Size”) can achieved targeted improvementforlargevehicles,itlikelyfailstocaptureotherdegreesofrareness, resulting in lower overall performance. 5 Ablation studies WefurtherstudytheimpactofincreasingminingbudgetonourREMapproach (Table 2). With a small increase of mining budget (6%), we can match the performance of a fully-supervised model for both common and rare subsets. 6 Discussions and Future Work Inthiswork,weillustratethelimitationsoflearneddetectorswithrespecttorare examplesinproblemswithlargeintra-classvariations,suchas3Ddetection.We proposeanactivelearningapproachbasedondata-centricandmodel-centricrare example mining which is effective at discovering rare objects in unlabled data. Our active learning approach, combined with a state-of-the-art semi-supervised method can achieve full parity with a fully-supervised model on both common and rare examples, utilizing as little as 16% of human labels. A limitation of this study is the scale of the existing datasets for active learning, where data mining beyond the scale of available datasets is limited. Results on a larger dataset will be more informative. Our work shares the same risks and opportunities for the society as other works in 3D detection. Future work includes extending the REM approach beyond 3D detection, includingothertopicsinself-drivingsuchastrajectorypredictionandplanning.Rare Example Mining for 3D Detection 15 Acknowledgements We thank Marshall Tappen, Zhao Chen, Tim Yang, Ab- hishek Sinh and Luna Yue Huang for helpful discussions, Mingxing Tan for proofreading and constructive feedback, and anonymous reviews for in-depth discussions and feedback.Bibliography [1] Abdelkarim, S., Achlioptas, P., Huang, J., Li, B., Church, K., Elhoseiny, M.: Long-tail visual relationship recognition with a visiolinguistic hubless loss (2020) [2] Beluch, W.H., Genewein, T., Nu¨rnberger, A., K¨ohler, J.M.: The power of ensembles for active learning in image classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9368– 9377 (2018) [3] Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krish- nan,A.,Pan,Y.,Baldan,G.,Beijbom,O.:nuscenes:Amultimodaldataset for autonomous driving. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11621–11631 (2020) [4] Chen,R.T.,Rubanova,Y.,Bettencourt,J.,Duvenaud,D.:Neuralordinary differential equations. arXiv preprint arXiv:1806.07366 (2018) [5] Choi, H., Jang, E., Alemi, A.A.: Waic, but why? generative ensembles for robust anomaly detection. arXiv preprint arXiv:1810.01392 (2018) [6] Choi, J., Elezi, I., Lee, H.J., Farabet, C., Alvarez, J.M.: Active learn- ing for deep object detection via probabilistic modeling. arXiv preprint arXiv:2103.16130 (2021) [7] Chu, P., Bian, X., Liu, S., Ling, H.: Feature space augmentation for long- tailed data. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIX 16, pp. 694– 710, Springer (2020) [8] Cui,Y.,Jia,M.,Lin,T.Y.,Song,Y.,Belongie,S.:Class-balancedlossbased on effective number of samples. In: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pp. 9268–9277 (2019) [9] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on com- puter vision and pattern recognition, pp. 248–255, Ieee (2009) [10] Dinh,L.,Krueger,D.,Bengio,Y.:Nice:Non-linearindependentcomponents estimation. arXiv preprint arXiv:1410.8516 (2014) [11] Dinh, L., Sohl-Dickstein, J., Bengio, S.: Density estimation using real nvp. arXiv preprint arXiv:1605.08803 (2016) [12] Dong,Q.,Gong,S.,Zhu,X.:Classrectificationhardminingforimbalanced deep learning. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1851–1860 (2017) [13] Elezi, I., Yu, Z., Anandkumar, A., Leal-Taixe, L., Alvarez, J.M.: Not all labels are equal: Rationalizing the labeling costs for training object detec- tion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14492–14501 (2022) [14] Gal, Y., Islam, R., Ghahramani, Z.: Deep bayesian active learning with image data. In: International Conference on Machine Learning, pp. 1183– 1192, PMLR (2017)Rare Example Mining for 3D Detection 17 [15] Geiger,A.,Lenz,P.,Stiller,C.,Urtasun,R.:Visionmeetsrobotics:Thekitti dataset.TheInternationalJournalofRoboticsResearch32(11),1231–1237 (2013) [16] Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580–587 (2014) [17] Grathwohl, W., Chen, R.T., Bettencourt, J., Sutskever, I., Duvenaud, D.: Ffjord: Free-form continuous dynamics for scalable reversible generative models. In: ICLR (2018) [18] Gudovskiy,D.,Hodgkinson,A.,Yamaguchi,T.,Tsukizawa,S.:Deepactive learningforbiaseddatasetsviafisherkernelself-supervision.In:Proceedings oftheIEEE/CVFConferenceonComputerVisionandPatternRecognition, pp. 9041–9049 (2020) [19] Guo, Y.: Active instance sampling via matrix partition. In: NIPS, pp. 802– 810 (2010) [20] Gupta, A., Dollar, P., Girshick, R.: Lvis: A dataset for large vocabulary instance segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5356–5364 (2019) [21] Harakeh,A.,Smart,M.,Waslander,S.L.:Bayesod:Abayesianapproachfor uncertaintyestimationindeepobjectdetectors.In:2020IEEEInternational Conference on Robotics and Automation (ICRA), pp. 87–93, IEEE (2020) [22] Holub, A., Perona, P., Burl, M.C.: Entropy-based active learning for ob- jectrecognition.In:2008IEEEComputerSocietyConferenceonComputer Vision and Pattern Recognition Workshops, pp. 1–8, IEEE (2008) [23] Hsieh, T.I., Robb, E., Chen, H.T., Huang, J.B.: Droploss for long-tail in- stance segmentation. arXiv preprint arXiv:2104.06402 (2021) [24] Jamal, M.A., Brown, M., Yang, M.H., Wang, L., Gong, B.: Rethinking class-balanced methods for long-tailed visual recognition from a domain adaptation perspective. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7610–7619 (2020) [25] Joshi, A.J., Porikli, F., Papanikolopoulos, N.: Multi-class active learning forimageclassification.In:2009IEEEConferenceonComputerVisionand Pattern Recognition, pp. 2372–2379, IEEE (2009) [26] Kang,B.,Xie,S.,Rohrbach,M.,Yan,Z.,Gordo,A.,Feng,J.,Kalantidis,Y.: Decoupling representation and classifier for long-tailed recognition. arXiv preprint arXiv:1910.09217 (2019) [27] Kim, J., Jeong, J., Shin, J.: M2m: Imbalanced classification via major-to- minor translation. In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pp. 13896–13905 (2020) [28] Kingma,D.P.,Dhariwal,P.:Glow:Generativeflowwithinvertible1x1con- volutions. arXiv preprint arXiv:1807.03039 (2018) [29] Kirichenko, P., Izmailov, P., Wilson, A.G.: Why normalizing flows fail to detect out-of-distribution data. In: NIPS (2020) [30] Kobyzev, I., Prince, S., Brubaker, M.: Normalizing flows: An introduction andreviewofcurrentmethods.IEEETransactionsonPatternAnalysisand Machine Intelligence (2020)18 C. Jiang et al. [31] Lang,A.H.,Vora,S.,Caesar,H.,Zhou,L.,Yang,J.,Beijbom,O.:Pointpil- lars:Fastencodersforobjectdetectionfrompointclouds.In:Proceedingsof the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12697–12705 (2019) [32] Li,T.,Wang,L.,Wu,G.:Selfsupervisiontodistillationforlong-tailedvisual recognition.In:ProceedingsoftheIEEE/CVFInternationalConferenceon Computer Vision, pp. 630–639 (2021) [33] Li,Y.,Wang,T.,Kang,B.,Tang,S.,Wang,C.,Li,J.,Feng,J.:Overcoming classifier imbalance for long-tail object detection with balanced group soft- max. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10991–11000 (2020) [34] Lin, T.Y., Goyal, P., Girshick, R., He, K., Doll´ar, P.: Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision, pp. 2980–2988 (2017) [35] Liu, B., Li, H., Kang, H., Hua, G., Vasconcelos, N.: Gistnet: a geomet- ric structure transfer network for long-tailed recognition. arXiv preprint arXiv:2105.00131 (2021) [36] Liu, Z., Miao, Z., Zhan, X., Wang, J., Gong, B., Yu, S.X.: Large-scale long- tailedrecognitioninanopenworld.In:ProceedingsoftheIEEE/CVFCon- ferenceonComputerVisionandPatternRecognition,pp.2537–2546(2019) [37] Meyer, G.P., Laddha, A., Kee, E., Vallespi-Gonzalez, C., Wellington, C.K.: Lasernet:Anefficientprobabilistic3dobjectdetectorforautonomousdriv- ing.In:ProceedingsoftheIEEE/CVFConferenceonComputerVisionand Pattern Recognition, pp. 12677–12686 (2019) [38] Nalisnick, E., Matsukawa, A., Teh, Y.W., Gorur, D., Lakshminarayanan, B.: Do deep generative models know what they don’t know? In: Inter- national Conference on Learning Representations (2018), URL https: //openreview.net/forum?id=H1xwNhCcYm [39] Nguyen, H.T., Smeulders, A.: Active learning using pre-clustering. In: Pro- ceedings of the twenty-first international conference on Machine learning, p. 79 (2004) [40] Qi, C.R., Zhou, Y., Najibi, M., Sun, P., Vo, K., Deng, B., Anguelov, D.: Offboard3dobjectdetectionfrompointcloudsequences.In:Proceedingsof the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6134–6144 (2021) [41] Qi,G.J.,Hua,X.S.,Rui,Y.,Tang,J.,Zhang,H.J.:Two-dimensionalactive learning for image classification. In: 2008 IEEE conference on computer vision and pattern recognition, pp. 1–8, IEEE (2008) [42] Rezende,D.,Mohamed,S.:Variationalinferencewithnormalizingflows.In: Internationalconferenceonmachinelearning,pp.1530–1538,PMLR(2015) [43] Segal, S., Kumar, N., Casas, S., Zeng, W., Ren, M., Wang, J., Urtasun, R.: Just label what you need: Fine-grained active selection for perception and predictionthroughpartiallylabeledscenes.arXivpreprintarXiv:2104.03956 (2021) [44] Sener, O., Savarese, S.: Active learning for convolutional neural networks: A core-set approach (2017)Rare Example Mining for 3D Detection 19 [45] Settles, B.: Active learning literature survey (2009) [46] Shrivastava, A., Gupta, A., Girshick, R.: Training region-based object de- tectors with online hard example mining. In: Proceedings of the IEEE con- ference on computer vision and pattern recognition, pp. 761–769 (2016) [47] Sinha, S., Ebrahimi, S., Darrell, T.: Variational adversarial active learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5972–5981 (2019) [48] Sun, P., Kretzschmar, H., Dotiwalla, X., Chouard, A., Patnaik, V., Tsui, P., Guo, J., Zhou, Y., Chai, Y., Caine, B., et al.: Scalability in percep- tion for autonomous driving: Waymo open dataset. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2446–2454 (2020) [49] Sun, P., Wang, W., Chai, Y., Elsayed, G., Bewley, A., Zhang, X., Smin- chisescu, C., Anguelov, D.: Rsn: Range sparse net for efficient, accurate lidar 3d object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5725–5734 (2021) [50] Tan, J., Lu, X., Zhang, G., Yin, C., Li, Q.: Equalization loss v2: A new gradient balance approach for long-tailed object detection. arXiv preprint arXiv:2012.08548 (2020) [51] Tan,J.,Wang,C.,Li,B.,Li,Q.,Ouyang,W.,Yin,C.,Yan,J.:Equalization loss for long-tailed object recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11662–11671 (2020) [52] Wang,J.,Zhang,W.,Zang,Y.,Cao,Y.,Pang,J.,Gong,T.,Chen,K.,Liu, Z., Loy, C.C., Lin, D.: Seesaw loss for long-tailed instance segmentation. arXiv preprint arXiv:2008.10032 (2020) [53] Wang, T., Li, Y., Kang, B., Li, J., Liew, J.H., Tang, S., Hoi, S., Feng, J.: Classificationcalibrationforlong-tailinstancesegmentation.arXivpreprint arXiv:1910.13081 (2019) [54] Wang, X., Lian, L., Miao, Z., Liu, Z., Yu, S.X.: Long-tailed recognition by routing diverse distribution-aware experts. In: International Conference on Learning Representations (2021), URL https://openreview.net/forum? id=D9I3drBz4UC [55] Wang, Y.X., Ramanan, D., Hebert, M.: Learning to model the tail. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 7032–7042 (2017) [56] Wu, J., Song, L., Wang, T., Zhang, Q., Yuan, J.: Forest r-cnn: Large- vocabularylong-tailedobjectdetectionandinstancesegmentation.In:Pro- ceedings of the 28th ACM International Conference on Multimedia, pp. 1570–1578 (2020) [57] Xiang, L., Ding, G., Han, J.: Learning from multiple experts: Self-paced knowledgedistillationforlong-tailedclassification.In:EuropeanConference on Computer Vision, pp. 247–263, Springer (2020) [58] Yang,B.,Bai,M.,Liang,M.,Zeng,W.,Urtasun,R.:Auto4d:Learningtola- bel4dobjectsfromsequentialpointclouds.arXivpreprintarXiv:2101.06586 (2021)20 C. Jiang et al. [59] Zang, Y., Huang, C., Loy, C.C.: Fasa: Feature augmentation and sam- pling adaptation for long-tailed instance segmentation. arXiv preprint arXiv:2102.12867 (2021) [60] Zhang, C., Pan, T.Y., Li, Y., Hu, H., Xuan, D., Changpinyo, S., Gong, B., Chao, W.L.: A simple and effective use of object-centric images for long- tailed object detection. arXiv e-prints pp. arXiv–2102 (2021) [61] Zhang, L., Goldstein, M., Ranganath, R.: Understanding failures in out-of- distribution detection with deep generative models. In: International Con- ference on Machine Learning, pp. 12427–12436, PMLR (2021) [62] Zhao, Y., Chen, W., Tan, X., Huang, K., Xu, J., Wang, C., Zhu, J.: Improving long-tailed classification from instance level. arXiv preprint arXiv:2104.06094 (2021) [63] Zheng,Y.,Pal,D.K.,Savvides,M.:Ringloss:Convexfeaturenormalization for face recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5089–5097 (2018) [64] Zhong, Z., Cui, J., Liu, S., Jia, J.: Improving calibration for long-tailed recognition. arXiv preprint arXiv:2104.00466 (2021) [65] Zhou, Y., Sun, P., Zhang, Y., Anguelov, D., Gao, J., Ouyang, T., Guo, J., Ngiam, J., Vasudevan, V.: End-to-end multi-view fusion for 3d object detectioninlidarpointclouds.In:ConferenceonRobotLearning,pp.923– 932, PMLR (2020) [66] Zhu, X., Anguelov, D., Ramanan, D.: Capturing long-tail distributions of objectsubcategories.In:ProceedingsoftheIEEEConferenceonComputer Vision and Pattern Recognition, pp. 915–922 (2014)Improving the Intra-class Long-tail in 3D Detection via Rare Example Mining Supplemental Materials 1 Experiment Details We provide additional details for our experiments below. 1.1 Training the Normalizing Flow Model We train a continuous normalizing flow model (FFJORD [1]) for estimating the rareness of detection objects. As detailed in the main text, we train the flow modelon the normalizedfeature vectorsafter PCAtransformationobtained via ROI max-pooling the final feature map of the MVF detector using predicted or ground-truth bounding boxes. The feature vectors are reduced to the dimension of 10 after the PCA trans- formation.Theflowmodelconsistsofastackof4consecutiveFFJORDbijectors. Each FFJORD bijector consists of 4 hidden layers, each hidden layer consists of 64units,andusestanh()activation.ThemodelistrainedusinganAdamOpti- mizer,withaninitiallearningrateof 1e-4,learningratedecayevery2400steps, with a decay rate of 0.98. We train the flow model for a total of 100 epochs. For thebasedistributionoftheflowmodel,weuseasphericalGaussiandistribution with unit variance. See Fig. 1 for a visualization of the inferred feature densities using a flow model trained using the procedure above. 1.2 3D Auto-labeler Implementation We adopt the auto-labeling pipeline as outlined in [2]. We use a strong teacher modeltoserveastheauto-labeler.Fortheactivelearningexperiments,wetrain a 5-frame MVF model [5] on the same 10% fully-labeled segments as the single- frame MVF student model. We then use this 5-frame MVF model to extract boxes, create object tracks using [4], and further refine the detections using the refiner [2]. 2 Additional Visualizations of Rare Examples2 (a)PCAprojectionofthe (b) PCA projection of (c) Estimated log proba- training embeddings. generated embeddings. bility of embeddings. Fig.1: Visualization of (a) the embeddings used for training the normalizing flow model, (b) the generated embeddings from the flow model by sampling the learned distribution, that matches the training distribution very well, signifying agoodlearnedestimationoffeaturedensities.(c)avisualizationoftheestimated log probability of the embeddings from a learned normalizing flow model. The modelisabletoassignhigherlogprobabilitytodenserfeatures(commonexam- ples) and lower log probability to sparser features (rare examples).Title Suppressed Due to Excessive Length 3 Log Probability Rare -25 -20 -15 -10 -5 0 Common Fig.2:AdditionalvisualizationsofrareexamplesfromtheWaymoOpenDataset, asinferredbythetrainedflowmodel.Warmercolorsindicatemorerareinstances inthescene.Themostrareinstancesinthesceneincludelargevehicles(construc- tion vehicles, trucks), vehicles of irregular geometries such as a flatbed trailer, as well as potentially mislabeled objects such as cones.4 3 Additional Experimental Evaluation 3.1 Pedestrian Rare Example Mining We perform an additional experimental evaluation for rare example mining per- formance on pedestrians to offer more insights into using rare example mining forotherobjectcategories.Similartothemainexperimentsinthepaper,weper- form this experiment on the Waymo Open Dataset [3]. We use the same MVF [5]backboneasthemainpedestriandetectorandutilize[2]forauto-labelingthe unlabeledinstances. Wemainly compare threesetsof experiments. We compare twoactivelearningexperimentsthattrainon10%ofthefully-labeledsegments, plus an additional 10% of mined pedestrian tracks from the rest of the training segments, where the tracks are either mined using our MD-REM++ method or byrandomlyselectingfromtheremainingtracks.ForourMD-REM++method, we use a hard example filtering function where the number of points threshold p˜ = 10 and d = 50(m). We report on Average Precision (AP) metrics at an IoU threshold of 0.5. Unlike vehicle experiments where we have an intuitive es- timate of rareness based on vehicle size, for pedestrian experiment we instead evaluate the model performance on the intra-class long-tail by evaluating the AP performance on the top-5% rarest ground truth objects (based on inferred log probability). We present our results in Table 1. Experiment Human Labels (%)All Rare (Top 5%) Fully Supervised (100%;0%) 0.757 0.111 Random (10%;10%) 0.705 0.038 Ours (MD-REM++)(10%;10%) 0.7290.060 Table 1: Active learning experiments for pedestrians on Waymo Open Dataset. We conclude two main findings from the pedestrian experiment that is con- sistent with the vehicle experiments: 1. Our rare example mining method is able to significantly outperform base- lines. 2. Our flow-based method is able to find challenging instances for the model, as the model performs much worse on rare subsets, compared to the general distribution.Title Suppressed Due to Excessive Length 5 3.2 Additional Evaluation for Vehicle Rare Example Mining We show additional evaluation for the experiments presented in Table 3 (main paper). We present the results below. Experiment Human Labels (%)All RegularLargeRare (Top 5%) Fully Supervised (100%;0%) 0.7300.732 0.458 0.178 Ours (MD-REM++)(10%,3%) 0.7320.735 0.423 0.126 Ours (MD-REM++)(10%,6%) 0.7290.732 0.415 0.126 Ours (MD-REM++)(10%;9%) 0.7300.732 0.443 0.152 Table 2: Additional evaluations for Table 3 in the main paper. Here we report AP @ IoU 0.7 for these experiments. Similartothepedestrianexperimentmetrics,wefurtherintroduceaTop-5% raresubsetmetric.Byminingmoredataviaincreasingtheminingbudgetusing ourREMapproach,weareabletofurthercloseupthegapwiththefullylabeled model with a much more reduced total labeling cost.Bibliography [1] Will Grathwohl, Ricky TQ Chen, Jesse Bettencourt, Ilya Sutskever, and David Duvenaud.Ffjord:Free-formcontinuousdynamicsforscalablereversiblegenerative models. In ICLR, 2018. [2] Charles R Qi, Yin Zhou, Mahyar Najibi, Pei Sun, Khoa Vo, Boyang Deng, and Dragomir Anguelov. Offboard 3d object detection from point cloud sequences. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6134–6144, 2021. [3] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Pat- naik,PaulTsui,JamesGuo,YinZhou,YuningChai,BenjaminCaine,etal. Scal- abilityinperceptionforautonomousdriving:Waymoopendataset. InProceedings oftheIEEE/CVFConferenceonComputerVisionandPatternRecognition,pages 2446–2454, 2020. [4] XinshuoWeng,JianrenWang,DavidHeld,andKrisKitani. 3dmulti-objecttrack- ing: A baseline and new evaluation metrics. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 10359–10366. IEEE, 2020. [5] Yin Zhou, Pei Sun, Yu Zhang, Dragomir Anguelov, Jiyang Gao, Tom Ouyang, JamesGuo,JiquanNgiam,andVijayVasudevan. End-to-endmulti-viewfusionfor 3d object detection in lidar point clouds. In Conference on Robot Learning, pages 923–932. PMLR, 2020.