Unsupervised 3D Perception with 2D Vision-Language Distillation for Autonomous Driving MahyarNajibi* JingweiJi* YinZhou† CharlesR.Qi XinchenYan ScottEttinger DragomirAnguelov WaymoLLC Abstract Closed-set3Dperceptionmodelstrainedononlyapre- definedsetofobjectcategoriescanbeinadequateforsafety criticalapplicationssuchasautonomousdrivingwherenew “bulldozer” object types can be encountered after deployment. In this paper,wepresentamulti-modalautolabelingpipelineca- pableofgeneratingamodal3Dboundingboxesandtrack- letsfortrainingmodelsonopen-setcategorieswithout3D humanlabels.Ourpipelineexploitsmotioncuesinherentin “stop sign” pointcloudsequencesincombinationwiththefreelyavail- able2Dimage-textpairstoidentifyandtrackalltrafficpar- ticipants. Compared to the recent studies in this domain, which can only provide class-agnostic auto labels limited to moving objects, our method can handle both static and moving objects in the unsupervised manner and is able to “USPS truck” output open-vocabulary semantic labels thanks to the pro- Figure1.Anillustrationofthreeinterestingurbansceneexamples posedvision-languageknowledgedistillation. Experiments ofopen-vocabularyperception.Left:ourmethodcanfaithfullyde- on the Waymo Open Dataset show that our approach out- tectobjectsbasedonuser-providedtextqueriesduringinference, performs the prior work by significant margins on various withouttheneedfor3Dhumansupervision.Redpointsarepoints unsupervised3Dperceptiontasks. matchedwiththetextqueries. Right: cameraimagesforreaders’ reference. NotethattheinferenceprocesssolelyreliesonLiDAR pointsanddoesnotrequirecameraimages. 1.Introduction Inautonomousdriving,mostexisting3Ddetectionmod- els[63,23,43]havebeendevelopedwiththepriorassump- constituteasignificantportionoftrafficparticipants. More- tionthatallpossiblecategoriesofinterestshouldbeknown over, it only models the problem in a class-agnostic way and annotated during training. While significant progress and fails to provide semantic labels for scene understand- has been made in this supervised closed-set setting, these ing. This is suboptimal as semantic information is essen- methods still struggle to fully address the safety concerns tial for downstream tasks such as motion planning, where that arise in high-stakes applications. Specifically, in the category-specificsafetyprotocolsaredeliberatelyaddedto dynamicreal-worldenvironment,itisunacceptableforau- navigatethroughvarioustrafficparticipants. tonomous vehicles to fail to handle a category that is not Recently, models trained with large-scale image-text presentinthetrainingdata. Toaddressthissafetyconcern, datasets have demonstrated robust flexibility and general- arecentdevelopmentbyNajibietal.[36]proposedanunsu- izationcapabilitiesforopen-vocabularyimage-basedclassi- pervisedautolabelingpipelinethatusesmotioncuesfrom fication[39,20,34],detection[21,12,25,60]andsemantic pointcloudsequencestolocalize3Dobjects. However,by segmentation[24,11]tasks. Yet,open-vocabularyrecogni- design, this method does not localize static objects which tion in the 3D domain [9, 16, 41] is in its early stages. In thecontextofautonomousdrivingitisevenmoreunderex- *Equalcontribution †Correspondingauthor plored. In this work, we fill this gap by leveraging a pre- 3202 peS 52 ]VC.sc[ 1v19441.9032:viXratrained vision-language model to realize open-vocabulary pipelinethatleveragesvision-languagepre-trainingforun- 3Dperceptioninthewild. supervisedopen-set3Dperceptionincomplex,sparse,and We propose a novel paradigm of Unsupervised 3D occlusion-richenvironmentsforautonomousdriving. Perceptionwith2DVision-Languagedistillation(UP-VL). Unsupervised 3D object detection. Unsupervised 3D Specifically,byincorporatingapre-trainedvision-language object detection from LiDAR data is largely under- model, UP-VL can generate auto labels with substantially explored [7, 54, 50, 28, 36]. Dewan et al. [7] proposed a higher quality for objects in arbitrary motion states, com- model-free method to detect and track the visible part of paredtothelatestworkbyNajibietal.[36]. objects, byusing themotioncuesfrom LiDARsequences. With our auto labels, we propose to co-train a 3D ob- However, this approach is incapable of generating amodal ject detector with a knowledge distillation task, which can bounding boxes which is essential for autonomous driv- achievetwogoalssimultaneously, i.e.improvingdetection ing. Cen et al. [3] relied on a supervised detector to pro- quality and transferring semantic features from 2D image duce proposals of unknown categories. However, this ap- pixels to 3D LiDAR points. The perception model there- proach requires full supervision to train the base detector foreiscapableofdetectingalltrafficparticipantsandthanks and has limited generalization capability to only semanti- to the distilled open-vocabulary features, we can flexibly cally similar categories. Wong et al. [54] identified un- query the detector’s output embedding with text prompts, known instances via supervised segmentation and cluster- for preserving specific types of objects at inference time ing, which by design cannot generate amodal boxes from (seeFigure1forsomeexamples). partialobservations. Mostrecently,Najibietal.[36]devel- WesummarizethecontributionsofUP-VLasfollows: opedanunsupervisedautometalabelingpipelinetogener- • UP-VL achieves state-of-the-art performance on un- atepseudolabelsformovingobjects,whichcanbeusedto supervised 3D perception (detection and tracking) of trainreal-time3Ddetectionmodels. Thisapproachfailsto movingobjectsforautonomousdriving. providesemanticstodetectionboxesandignoresstaticob- jects,whichlimitsitspracticalutility. Comparedtoallpre- • UP-VL introduces semantic-aware unsupervised de- viousefforts,werealizeopen-vocabularyunsupervised3D tection for objects in any motion state, a first in the detectionforbothstaticandmovingobjects,byleveraging fieldofautonomousdriving. Thisbreakthroughelimi- vision-languagepre-training,andbenchmarkoursystemon natestheinformationbottleneckthathasplaguedpre- the realistic and challenging scenario of autonomous driv- viouswork[36],whereclass-agnosticautolabelswere ing. While utilizing 2D vision-language models that may used,coveringonlymovingobjectswithaspeedabove have been pre-trained with human annotations, we avoid apredeterminedthreshold. theneedforanyadditional3Dlabelswithinourparadigm, therebycreatingapragmaticallyunsupervisedsetting. • UP-VLenables3Dopen-vocabularydetectionofnovel objects in the wild, with queries specified by users LiDAR 3D object detection. Most previous works fo- at inference time, therefore removing the need to re- cusedondevelopingperformantmodelarchitecturesinthe collectdataorre-trainmodels. fullysupervisedsetting,withoutconsideringthegeneraliza- tioncapabilitytolong-tailcasesandunknownobjecttypes 2.Relatedworks thatareprevalentinthedynamicrealworld.Thesemethods canbecategorizedintopointbased[43,38,56,44,35,26], Vision-language training. Contrastive vision language voxelizationbased[8,52,46,37,55,45,63,23,53,57,59, training on billions of image-text training pairs resulted in 5,30],perspectiveprojectionbased[32,2,10],andfeature impressiveimprovementsinthetasksofopen-setandzero- fusion [49, 6, 62, 14, 42]. Recent research also explore shotimageclassificationandlanguagerelatedapplications transferringknowledgefromimagefor3Dpointcloudun- [39,20,58]. Morerecently,open-setobjectlocalizationin derstanding[40,29,19,4]. Ourmethodiscompatiblewith 2D images has been shown to benefit from such abundant any3Ddetector,extendingittohandletheopen-setsettings. image-textdataaswell.Specifically,[21,12,25,60,61,33] used image-text training to improve the open-set capa- 3.Method bility of 2D object detectors and [24, 11] explored the use of large-scale scene-level vision-language data for the We present UP-VL, a new approach for unsupervised task of open-set 2D semantic segmentation. Recent re- open-vocabulary 3D detection and tracking of traffic par- search [31, 22, 51, 13, 18] has begun to explore the appli- ticipants. UP-VL advances the previous state-of-the- cationof2Dvision-languagepre-trainingin3Dperception art [36] which was limited to class-agnostic detection of tasks. However,thesestudiesfocusedonstaticindoorsce- moving-only objects in two main directions: 1) It enables nariowherethesceneissmall-scaleandtheRGB-Ddatais class-aware open-set 3D detection by incorporating open- capturedinhigh-resolution. Herewedesignamulti-modal vocabulary text queries at inference time, and 2) It is able 2Training Inference PointCloudSequence PointCloud Open-Vocab. 3D Object Detector Open-Vocab. 3D Object Detector Truck CameraImages Pointwise Features Sedan Pointwise Features Boxes / Tracklets …… Open-Vocabulary …… Queries Cyclist BGCategoriesforExclusion Truck Semantic Label Road Sedan ⊗ Assignment UnsupervisedMulti-modal Cyclist Building ... Auto Labeling ... Figure2.OverviewoftheproposedUP-VLframework. Duringtraining(left),ourmethodtapsintomulti-modalinputs(LiDAR,camera, text)andproduceshigh-qualityautosupervisions, viaUnsupervisedMulti-modalAutoLabeling, including3Dpoint-levelfeatures, 3D object-level bounding boxes and tracklets. Our auto labels are then used to supervise a class-agnostic open-vocabulary 3D detector. Besides,our3Ddetectordistillsthefeaturesextractedfromapre-trained2Dvision-languagemodel.Atinferencetime(right),ourtrained 3Ddetectorproducesclass-agnosticboxesandper-pointfeaturesintheembeddingspaceofthepre-trainedvision-languagemodel. We thenusethetextencodertomapqueriestotheembeddingspaceandcomputetheper-pointsimilarityscoresbetweenthepredictedfeature andthetextembeddings(⊗referstocosinesimilarity).Theseper-pointscoresarethenaggregatedtoassignsemanticlabelstoboxes. todetectobjectsinallmotionstatesasopposedtomoving- obtainamorecomprehensiveviewoftheobject,whichen- onlyobjectsinthepreviousstudy. Toachievethesegoals, ablesthederivationofafaithful3Damodalboundingbox. we deploy a multi-modal approach and combine intrinsic Finally, the resulting 3D amodal boxes and tracklets can motioncues[36]availablefromtheLiDARsequenceswith serveasautolabelsfortraining3Dperceptionmodels. the semantics captured by a vision-language model [11] While the previous work [36] has shown promising re- trained on generic image-text pairs from the Internet. An sults, it suffers from significant limitations: 1) it can only overview of our approach is shown in Figure 2. As illus- deal with moving objects; and 2) it is unable to output se- trated on the left, our training pipeline involves two main mantics. These limitations hinder its practical utility for stages. First, our auto labeling method uses these mo- safety-criticalapplicationssuchasautonomousdriving. tion and semantic cues to automatically label the raw sen- sor data, yielding class-agnostic 3D bounding boxes and 3.2.UnsupervisedMulti-modalAutoLabeling tracklets as well as point-wise semantic features. Then, in the second stage, we use these auto labels to train open- Incontrasttothetraditionalwayoftrainingadetection vocabulary 3D perception models. The right side of the modelbypresentingboxgeometriesandclosed-setseman- figureillustratesourinferencepipelinewheregivenrawLi- tics, our unsupervised multi-modal auto labeling approach DAR point clouds, our detector is able to perform open- produces box geometries and point-wise semantic feature vocabulary3Ddetectiongivenasetoftextqueries. embeddings,wheretheformerteachesthedetectortolocal- ize all traffic participants and the latter informs the model 3.1.Background topreservecertaintypesofobjectsbasedontheinference- timetextqueries. The key challenges in unsupervised 3D perception are twofold: 1) generating high-quality 3D amodal bounding Figure3showsanoverviewoftheautolabelingpipeline boxes and consistent tracklets for all open-set traffic par- and Algorithm 1 presents its details. Specifically, our sys- ticipants, and 2) inferring per-object semantics. Najibi et temleveragesmultiplemodalitiesasinput,namelycamera al.[36]developedanautolabelingtechniquetoaddressthe images, LiDAR point sequences, and natural language. It firstchallengepartially. Theirapproachfocusesonmoving also employs a pre-trained vision-language model [11] to objects only. Specifically, their method takes LiDAR se- extract feature embeddings from images and texts, which quencesasinput,andremovesgroundpoints.Itthenbreaks naturally complements the 3D depth information and mo- downthesceneintoindividualconnectedcomponents(i.e. tioncueswithrichsemantics,comparedto[36]. Webegin pointclusters). Next,itcalculateslocalflowbetweenpairs bydetailingthefeatureextractionprocess.Wethendescribe ofpointclustersfromtwoadjacentframesandretainsonly howweutilizetheextractedvision-languageinformationin clusters with speed above a predefined threshold. It then combinationwiththeinherentmotioncuesfromLiDARse- tracks each cluster across frames and aggregates points to quencestogenerateautolabelsinanunsupervisedmanner. 3BGCategories CameraImages PointCloudSequence Algorithm1Unsupervisedmulti-modalautolabeling. forExclusion Input: A sequence of images across T frames for each of Road theK cameras{Ik};asequenceofLiDARpointlocations t Building {P }. t ... Requires: Cosinesimilaritythresholdforbackgroundcat- egoriesϵbg;minimumsceneflowmagnitudeϵsf;maximum Pointwise Scene ratioofbackgroundpointswithinaboxrbg;asetofaprior VL Features Flow backgroundcategoriesCbg; apre-trainedopen-vocabulary modelwithimageencoderEimg andtextencoderEtxt. Box Proposal Generation Output: Amodal3Dboundingboxes{B }andtheirtrack t IDs{T };point-wiseopen-vocabularyfeatures{Fvl}. t t Tracking Function: 1: fort=1toT do Amodal Box Generation 2: {Vk}←Eimg({Ik}) ▷2DVLfeatures t t Figure 3. Overview of our unsupervised multi-modal auto label- 3: Fv tl ←Unprojection({V tk},P t) ▷3DVLfeatures ingapproach.Thispipelinefirstextractsvision-languageandmo- 4: ift̸=T then tionfeaturesfrommultiplemodalities,thenproposes,tracksand 5: Fs tf ←NSFP++(P t,P t+1) ▷Sceneflow completesboundingboxesofobjects.TheresultingpointwiseVL 6: else f se ua pt eu rr ve is s, io3 nD sb toou trn ad inin tg hebo px ee rcs ea pn td iot nra mck ol de ets l.willserveasautomatic 7: Fs tf ←−NSFP++(P t,P t−1) 8: fori=1toN tdo FeatureExtraction 9: (Ms tf) i ←1(∥(Fs tf) i∥≥ϵsf) As the first step to our approach, we start by extract- 10: (Mb tg) i ←1( cm ∈Cax bg ∥(( FF v tv t l)l) ii ∥· ∥E Etx txt t( (c c) )∥ ≥ϵbg) ingopen-vocabularyfeaturesfromallavailablecamerasand then transfer these 2D features to 3D LiDAR points us- 11: P(cid:101)t,F(cid:101)s tf ←P t[Ms tf],Fs tf[Ms tf] ing known sensor calibrations. Specifically, at each time 12: Bv tis ←InitialBoxProposal(P(cid:101)t,F(cid:101)s tf,Mb tg;rbg) t, we have a set of images {Ik t ∈ RHk×Wk×3} t captured 13: {T t}←Tracking({Bv tis}) by K cameras, where H k and W k are image dimensions 14: {B t}←AmodalBoxGeneration({Bv tis},{T t},{P t)} ofthecamerak. Wealsohaveacollectionofpointcloud, 15: return{B t},{T t},{Fv tl} {P t ∈ RNt×3}, captured over time using LiDAR sensors. Here, N denotes the number of points at time t. We use t a pre-trained open-vocabulary 2D image encoder Eimg to heading). Notethatvisindicatesthateachboxonlycovers extract the pixel-wise visual features for each image, de- the visible portion of an object. To cluster each point, we noted as {V tk ∈ RHk×Wk×D}, where D represents the leverageasetoffeatureswhichincludesthepointlocations feature dimension. Next, we build the mapping between P ,sceneflowFsf,andthevision-languagefeaturesFvl. t t t 3DLiDARpointsandtheircorrespondingimagepixelsus- We design our pipeline to flexibly generate auto labels ing the camera and LiDAR calibration information. Once forobjectsindesiredmotionstates. GivensceneflowFsf, this mapping is created, we can associate each 3D point t weintroduceavelocitythresholdϵsf toselectpointswhose with its corresponding image feature vector. As a result, speedisgreaterthanorequaltothethreshold(e.g.,1.0m/s). weobtainvision-languagefeaturesforallthe3Dpointsas Tocaptureobjectsinallmotionstates,wesetϵsf =0. Fv tl ∈RNt×D,whereN tisthenumberofpointsattimet. Additionally,weleveragemotionsignalsasanothercru- Onemajorchallengeofautolabelingobjectsinallmo- cialrepresentationthatcansubstantiallyaidindeducingthe tionstatesishowtoautomaticallydistinguishtrafficpartici- conceptofobjectnessformovinginstancesintheopen-set pants(e.g.,vehicles,pedestrians,etc.)fromirrelevantscene environment. Specifically, we employ the NSFP++ algo- elements (e.g., street, fence, etc.). We propose to leverage rithm [36] to compute the scene flow Fsf ∈ RNt×3 of an a priori list of background object categories to exclude t points at each time t, which is a set of flow vectors cor- irrelevant scene elements from labeling. Specifically, we respondingtoeachpointinP . use the text encoder, Etxt, from the pre-trained 2D vision- t languagemodel[11], toencodeeachbackgroundcategory BoundingBoxProposalGeneration namecintoitsfeatureembeddingEtxt(c) ∈ RD. Wefur- Ateachtimestep,wegenerateinitialboundingboxpro- therdefineaper-pointbinarybackgroundmask,denotedas posals{Bv tis ∈RMt×7}byclusteringthepoints,whereM t Mb tg ∈ {0,1}Nt, that takes on a value of 1 if a point is isthenumberofboxesattimet,andeachboxisparameter- assigned to one of the a priori background categories, or ized as (center x, center y, center z, length, width, height, 0 otherwise. See Algorithm 1 for the definition of Mbg, t 4where(·) denotesthei-throwofamatrixand1(·)repre- 3.3.1 ModelArchitecture i sentstheindicatorfunction. Weusethisbackgroundmask tomarksceneelementswhicharenotofinterest. Our design, as depicted in Figure 2, is based on decou- plingobjectdetectionintoclass-agnosticobjectlocalization We then proceed to cluster the point cloud into neigh- and semantic label assignment. For class-agnostic bound- boringregionsusingaspatio-temporalclusteringalgorithm, ing box prediction, we add a branch to a 3D point cloud modified from [36], followed by calculating the tightest encoderbackbonetogenerate3Dboundingboxcenter,di- bounding box around each cluster. In addition to cluster- mensions, and heading. This branch accompanies a bi- ingpointsbytheirlocationsandmotions,wealsouseMbg t naryclassificationbranchwhichoutputsforeground/back- to eliminate bounding boxes which are likely to be back- ground class-agnostic per box objectness score. To super- ground. To be precise, we discard any bounding box in visethesetwobranches,wetreatourunsupervisedautola- which the ratio of background points exceeds a threshold bels (see Sec. 3.2) as ground-truth and add bounding box ofrbg (whichissetto99%). Thisprocessresultsintheini- regression and classification losses to our learning objec- tialsetofboundingboxproposals{Bvis}. Notethatinthis t tive. We would like to highlight that our pipeline is inde- step,theboxdimensionsaredeterminedbasedonthevisible pendent of a specific 3D point-cloud encoder [23, 62, 48] portionofeachobject,whichcanbesignificantlyunderes- andthedetectionparadigm(eitheranchor-basedoranchor- timatedcomparedtothehumanlabeledamodalbox,dueto freedetection). Here,weadoptananchor-basedPointPillar ubiquitousocclusionsandsparsity. backbone [23] with Huber loss for box residual regression and Focal Loss [27] for objectness classification to have AmodalAutoLabeling a fair comparison with prior works [36]. Besides predict- Inautonomousdriving,perceptiondownstreamtasksde- ing 3D bounding boxes, we also perform text query-based sireamodalboxesthatencompassboththevisibleandoc- open-vocabulary semantic assignment by distilling knowl- cluded parts of the objects. To transform our visible-only edgefrompre-trained2Dvision-languagemodelsusingan proposals to amodal auto labels, we follow [36] by adopt- extrabranchwhichisdescribedinthenextsubsections. ing a tracking-by-detection paradigm with Kalman filter state updates to link all proposals over time. We then per- form shape registration for each object track of {T } us- 3.3.2 Vision-LanguageKnowledgeDistillation t ing ICP [1]. Within each track, we leverage the intuition Besidesclassagnosticboundingboxgeneration,our3Dde- that different viewpoints contain complementary informa- tector pipeline also distills the semantic knowledge from tionandtemporalaggregationoftheregisteredpointsfrom theper-pointvision-languagefeaturesprovidedbyourauto proposalswouldallowustoobtainacompleteshapeofthe labeling pipeline (i.e. {Fvl}, introduced in the the vision- object. Hence,wefitanewboxtotheaggregatedpointsto t languagefeatureextractioninSec.3.2). Inourmethod,we yieldtheamodalbox.Finally,weundotheregistrationfrom directly distill these features, which as will be discussed aggregatedpointstoindividualframesandreplacetheorig- in the next subsection, unlocks text query-based open- inalvisibleboxproposalateachtimestepwiththeamodal vocabulary category assignment at inference time. More box,whichproducesautolabeled3Dboxesandthetracklet. precisely,asshownintheleftsideofFigure2,weaddanew Inpractice,backgroundpointfiltering,pointcloudregis- linear branch to the model to predict per-point D dimen- trationandtemporalaggregationmaycontainnoise,leading sionalfeatures(hereD isthedimensionalityofthevision- tospuriousboxes,e.g.,tinyandsizableboxesandoverlap- language embedding space). As the input to this branch, pingboxes.Weapplynon-maximumsuppression(NMS)to wescatterthecomputedvoxelizedfeaturesinourbackbone clean the auto label boxes. This final set of unsupervised backintothepointsandconcatenatethemwiththeavailable amodal auto labels {B t}, their track IDs {T t}, together per-pointinputfeatures(i.e.3DpointlocationsandLiDAR with the extracted vision-language embeddings {Fv tl}, are intensity and elongation features). We then train the net- then used to train open-vocabulary 3D object detection worktopredictthefeaturevectorfvl ∈Fvlforanypointp p t modelasdescribedinSec. 3.3. visible in the cameraimages and add the following loss to thetrainingobjective: 3.3.Open-vocabulary3DObjectDetection L (p)=CosineDist(y ,fvl) (1) distill p p In this subsection, we describe how the unsupervised auto labels, can be used to train a 3D object detector ca- wherey isthedistillationpredictionbythemodelforpoint p pable of localizing open-set objects and assigning open- p. Thistogetherwiththeboundingboxregressionandthe vocabularysemanticstothem,allwithoutusingany3Dhu- objectnessclassificationlosses(basedonourautolabelsas manannotationsduringtraining. discussedinSec.3.3.1)formourfinaltrainingobjective. 53.3.3 Open-VocabularyInference Table 1. Comparison of the methods on class-agnostic unsuper- vised 3D detection of moving objects. Top: Auto label boxes. Sofar,wehaveintroducedhowtotrainadetectortosimul- Bottom:Detectionboxes. taneously localize all objects in a class-angnostic manner 3DAP@0.4 3DAP@0.5 Method BoxType and predict vision-language features for all LiDAR points. L1 L2 L1 L2 Here, we discuss how we assign open-vocabulary seman- MI-UP[36] 36.9 35.5 27.4 26.4 Autolabels tics to the predicted boxes during inference. This process UP-VL(ours) 39.9 38.4 34.2 32.0 is depicted in the right side of Figure 2. The pre-trained MI-UP[36] 42.1 40.4 29.6 28.4 2Dvisionlanguagemodel[11]containsanimageencoder Detections UP-VL(ours) 49.9 48.1 38.4 36.9 andatextencoder,whicharejointlytrainedtomaptextand image data to a shared embedding space. As described in Sec. 3.3.2, we add a feature distillation branch that maps sultsunderopen-vocabularyclass-awaresettingfordetect- 3D input point clouds to the 2D image encoder embed- ing moving-only objects (Sec. 4.3.1) and the most chal- dingspace,whichessentiallybridgesthegapbetweenpoint lenging setting of open-vocabulary detection of objects in clouds and semantic text queries. As a result, at the infer- all motion states (Sec. 4.3.2). Finally, Sec. 4.4 reports the ence time we can encode arbitrary open-vocabulary cate- open-set tracking quality of our auto labels and Sec. 4.5 gories presented as text queries and compute their similar- presents qualitative results. See supplementary materials ities with the observed 3D points. This can be achieved formoreablationstudiesanderroranalyses. by computing the cosine similarity between the text query embeddings and the vision-language features predicted by 4.1.ExperimentalSetting our model for each 3D point. Finally, we assign open- vocabulary categories to boxes based on majority voting. We evaluate our framework using the challenging Specifically, we associate each point the category with the WaymoOpenDataset(WOD)[47],whichprovidesalarge highestcomputedcosinesimilarity,andthenassigntoeach collectionofrunsegmentscapturedbymulti-modalsensors boxthemostcommoncategoryofitsenclosingpoints. in diverse environment conditions. To define moving-only objectsinSec.4.2,wefollow[36]andapplyathresholdof Wewouldliketoemphasizethatourapproachdoesnot 1.0m/s(i.e.ϵsf =1.0). Wesetthecosinesimilaritythresh- need to process images at inference time, since we have oldforbackgroundcategoriesatϵbg =0.02toachievebest distilledimageencoderfeaturestothepointcloud. There- performance in practice. The background categories Cbg fore,theonlyaddedcomputationisasimplelinearlayerfor we exclude from auto labeling are “vegetation”, “road”, predictingper-pointvision-languageembeddings,whichis “street”, “sky”, “tree”, “building”, “house”, “skyscaper”, negligiblecomparedtotherestofthedetectorarchitecture. “wall”, “fence”, and “sidewalk”. The WOD [47] has three common object categories, i.e. vehicle, pedestrian, 4.Experiments and cyclist. In the class-aware 3D detection experiments Our UP-VL approach advances the previous state-of- (Sec. 4.3), we follow [36] and combine pedestrian and cy- the-artinunsupervised3Dperceptionforautonomousdriv- clistintooneVRU(vulnerableroadusers)category,which ing[36]intwomainimportantdirections:1)enablingopen- containsasimilarnumberoflabelsasthevehiclecategory. vocabulary category semantics and 2) detecting objects in As in [36], we also train and evaluate the detectors on a allmotionstates(asopposedtomoving-onlyobjectsinthe 100m × 40m rectangular region around the ego vehicle. previousstudy). Inthissection,weperformextensiveeval- We use the popular PointPillars detector [23] for all our uationswithrespecttoeachoftheseinnovations. Notethat detection experiments and set an intersection over union, unsupervisedopen-set3Ddetectionisstillatearlystagein IoU=0.4,forevaluationsunlessnotedotherwise. Pleasere- theresearchcommunitywithfewpublishedworks. There- fertoSec. 1ofsupplementarymaterialsforamoredetailed foretofairlycomparewiththestate-of-the-art[36],weper- descriptionofallexperimentalsettings. formourdetectionexperimentsfirstfollowingthesameset- 4.2. Class-agnostic Unsupervised 3D Detection of ting as [36] (i.e. detecting class-agnostic moving objects) MovingObjects andthenshowcasingournewcapabilities(i.e.detectingob- jectsinanymotionstateswithsemantics). Forfaircomparison, wefollowthesamesettingas[36] Sec. 4.2 studies the performance of our system in the and tailor our approach to class-agnostic moving-only 3D class-agnostic setting. This allows us to compare our ap- detection. Specifically, we perform auto labeling as intro- proach with the existing state-of-the-art method on de- ducedin3.2withspeedthresholdϵsf = 1.0m/sandtraina tectingmoving-onlyobjects, showinglargeimprovements. class-agnosticdetectorwithfeaturedistillationasdescribed Sec.4.3movestheneedlebeyondthecapabilityofthepre- in3.3.1.However,wedisabletextqueriesatinferencetime. viousclass-agnosticstate-of-the-artmethodsandreportsre- Notethat[36]onlyconsidereddetectionofmovingobjects. 6WeleavethestudyofmorechallengingsettingstoSec.4.3. dictsimagefeaturesextractedbyourautolabelingpipeline Table 1 shows our result and compares it with MI- for3Dpointcloudsandconsequentlyismoreefficient. UP [36]. The top part of the table compares the auto la- belingquality. Thebottompartcomparesthedetectorper- 4.3.2 ObjectsinAllMotionStates formancebetweenourUP-VLapproachandMI-UP.Weuse theexactsamedetectionbackboneandhyper-parametersto Finally in this section, we report results on the most chal- ensure a fair comparison. When evaluating at IoU=0.4 as lengingsetting: unsupervisedclass-awareopen-vocabulary suggested by [36], UP-VL significantly outperforms MI- 3D detection for all objects with arbitrary motion states. UP,bothintermsoftheautolabelaswellasthedetection Like Sec. 4.3.1, since [36] falls short in this setting, we performance. To better demonstrate our improved auto la- construct three clustering baselines using different combi- belquality, wealsoevaluatewithahigherlocalizationcri- nationsofourfeatures. Morespecifically,thefirstrowonly terion at IoU=0.5, where our improvement becomes even uses point locations (P ), the second row uses both point t more pronounced. We should also point out that in both locations and our vision-language features (Fvl), and the t methods, the final detection quality is superior to the auto thirdrowleveragesallthefeaturesincludingoursceneflow labelquality. Wehypothesizethatthisisduetothenetwork features(Fsf).Asanablationontheeffectivenessofthein- t being able to learn a better objectness scoring function for troducedfeaturedistillationinUP-VL,wealsoaddabase- rankingaswellasitsabilitytodenoisetheautolabelsgiven line called “Our detector w/o feature distillation”, where theinductivebiasofthemodel[17]. we remove the distillation head and its loss from our de- tector, and like the baselines in the first three rows, we di- 4.3. Class-aware Unsupervised Open-vocabulary rectlyprojectthevision-languagefeaturesfromcameraim- 3DDetection ages to the point cloud for semantic label assignment. As summarizedinTable3,ourautolabelssignificantlyoutper- In this section, we evaluate the capability of our UP- formotherbaselineslistedinthefirstthreerows.Moreover, VL pipeline in class-aware open-vocabulary 3D detection comparingthelasttworows,weobservethattheproposed of objects in different motion states. Please note that we vision-languagefeaturedistillationleadstosignificantper- don’t use any 3D human annotations during training and formanceimprovementarossallmetrics. Forexample,our onlyusetheavailablehumanlabeledcategoriesforevalua- approachwithfeaturedistillationoutperformsthecounter- tion.Moreover,itshouldbenotedthatthepreviousstate-of- partwithoutdistillationbymorethan8pointsinmAP. the-art[36],asaclass-agnosticapproach,fallsshortinthis newsetting,makingcomparisonsnotpossible. Inallexper- 4.4.Tracking iments in this section, we assign labels to boxes by query- ing category names as text at inference time in an open- TheUP-VLexhibitsahighperformancenotonlyinde- vocabularyfashionasdescribedinSec.3.3.3(seeSec.1of tection, butalsointracking-acriticaltaskinautonomous supplementaryforadetailedlistoftextqueriesused). driving. We employ the motion-based tracker from [36], andconductexperimentsinthetracking-by-detectionman- ner. We evaluate tracking performance for moving objects 4.3.1 Moving-onlyObjects andcompareourUP-VLdetectortrainedwithfeaturedistil- lationasoutlinedinTable2againsttwobaselines: MI-UP Table2reportstheclass-awareopen-vocabulary3Ddetec- detector from Table 1 and another open-set baseline from tion results on the moving-only objects. Since [36] is no Table 2. To measure the effectiveness of our model, we longerapplicableinthissetting,weconstructtwobaselines employ the widely used MOTA and MOTP metrics, both for comparison: i.e. geometric clustering [36] which addi- tionally uses our extracted scene flow features (Fsf) and in the class-agnostic and class-aware open-vocabulary set- t tings. Our experimental results (Table 4) demonstrate that itsvariantwhichleveragesboththesceneflowfeaturesand the vision-language features (Fvl). 3D point-wise seman- UP-VLoutperformsbothbaselinesbyasignificantmargin. t tics for the baselines are extracted directly by projecting 4.5.QualitativeResults the 2D image features of the pre-trained vision-language model. We report per-category AP as well as the mAP of Our UP-VL enables open-vocabulary detection of arbi- these baselines in the top two rows of Table 2. The bot- trary object types beyond the few human annotated cate- tom of the table presents the results for our unsupervised gories in the autonomous driving datasets. Figure 4 illus- autolabelsandourfinalUP-VLdetections. Ourautolabels trates some examples. In each row, we present the cam- andUP-VLdetectorbothoutperformbaselinesconstructed era image on the right for readers’ reference. On the left, from prior approaches. As discussed in Sec. 3.3.2, unlike we show the corresponding 3D point cloud and the pre- thebaselinesthatrequiresapplyingtheimageencodertoall dicted 3D bounding box by our model based on the open- cameraimagesatinferencetime,ourdetectordirectlypre- vocabularytextqueryprovidedatinferencetime. 7Table2.Comparisonofmethodsonunsupervisedclass-awaremovingobjectdetection.(*sincesemanticsarenotavailable,wereportclass agnosticAPforthefirstrow,giventhatvehicleandVRUcontainsimilarnumberofsamples.) Representations 3DAP Method Boxtype mAP Motion Vision-Language Veh VRU Clustering[36] ✓ visible N/A N/A 32.4* Clustering[36]+OpenSeg[11] ✓ ✓ visible 47.8 21.5 34.7 Ourautolabels ✓ ✓ amodal 57.5 29.8 43.7 OurUP-VLdetectorw. featuredistillation ✓ ✓ amodal 76.9 28.6 52.8 Table3.Comparisonofmethodsonunsupervisedclass-awaredetectionofobjectsinallmotionstates.(*sincesemanticsarenotavailable, wereportclass-agnosticAPforthefirstrow,giventhatvehicleandVRUcontainsimilarnumberofsamples.) Representations 3DAP Method Boxtype mAP Motion Vision-Language Veh VRU Clustering[36] visible N/A N/A 11.6* Clustering[36]+OpenSeg[11] ✓ visible 15.8 9.9 12.9 Clustering[36]+OpenSeg[11] ✓ ✓ visible 16.1 10.0 13.1 Ourautolabels ✓ ✓ amodal 30.2 14.7 22.4 Ourdetectorw/ofeaturedistillation ✓ ✓ amodal 40.0 15.2 27.6 OurUP-VLdetectorw. featuredistillation ✓ ✓ amodal 52.0 19.7 35.8 Table4.Comparisonoftrackingmethodsformovingobjectswith categoriesspecifiedduringinferencebytextquerieswhich evaluationsinclass-agnostic(Cls. ag.) andclass-awaresettings. webelieveopensupnewdirectionstowardsmorescalable “MI-UP-C” refers to class-agnostic MI-UP clustering approach, softwarestacksforautonomousdriving. whichisunabletobeevaluatedintheclass-awaresetting. MOTA(↑)/MOTP(↓) Method Veh VRU Cls. ag. “trash bin” MI-UP[36] N/A N/A 12.8/45.5 MI-UP-C[36] 39.6/37.4 13.5/53.7 22.8/43.4 +OpenSeg[11] UP-VLdetector 65.3/31.0 24.0/46.8 41.3/37.4 “traffic cone” 5.Conclusions Inthispaper,westudytheproblemofunsupervised3D objectdetectionandtrackinginthecontextofautonomous “jay walking” driving. We present a cost-efficient pipeline using multi- sensor information and an off-the-shelf vision-language modelpre-trainedonimage-textpairs.Coretoourapproach isamulti-modalautolabelingpipeline,capableofgenerat- ing class-agnostic amodal box annotations, tracklets, and “jeep” per-pointsemanticfeaturesextractedfromvision-language models. By combining the semantic information and mo- tioncuesobservedfromtheLiDARpointclouds, ourauto labelingpipelinecanidentifyandtrackopen-settrafficpar- ticipants based on the raw sensory inputs. We have evalu- atedourautolabelsbytraininga3Dopen-vocabularyobject Figure 4. Open-vocabulary detection of both static and moving detection model on the Waymo Open Dataset without any objects via user-provided text queries. Note that in the open- 3D human annotations. Strong results have been demon- vocabularysetting, thetextqueriesofinterestedobjecttypesare strated on the task of open-vocabulary 3D detection with notgivenineitherautolabelingormodeltraining. 8References [17] Minyoung Huh, Hossein Mobahi, Richard Zhang, Brian Cheung, Pulkit Agrawal, and Phillip Isola. The low-rank [1] PaulJBeslandNeilDMcKay. Methodforregistrationof simplicitybiasindeepnetworks,2021. 7 3-dshapes. InSensorfusionIV:controlparadigmsanddata [18] AyushJain,NikolaosGkanatsios,IshitaMediratta,andKate- structures,volume1611,pages586–606.Spie,1992. 5 rinaFragkiadaki. Bottomuptopdowndetectiontransform- [2] Alex Bewley, Pei Sun, Thomas Mensink, Dragomir ersforlanguagegroundinginimagesandpointclouds. In Anguelov, and Cristian Sminchisescu. Range conditioned ECCV,2022. 2 dilatedconvolutionsforscaleinvariant3dobjectdetection, [19] AndrejJanda,BrandonWagstaff,EdwinGNg,andJonathan 2020. 2 Kelly. Self-supervised pre-training of 3d point cloud net- [3] JunCen,PengYun,JunhaoCai,MichaelYuWang,andMing works with image data. arXiv preprint arXiv:2211.11801, Liu. Open-set3dobjectdetection. In3DV,2021. 2 2022. 2 [4] Runnan Chen, Youquan Liu, Lingdong Kong, Xinge Zhu, [20] ChaoJia,YinfeiYang,YeXia,Yi-TingChen,ZaranaParekh, Yuexin Ma, Yikang Li, Yuenan Hou, Yu Qiao, and Wen- HieuPham, QuocLe, Yun-HsuanSung, ZhenLi, andTom ping Wang. Clip2scene: Towards label-efficient 3d scene Duerig.Scalingupvisualandvision-languagerepresentation understandingbyclip.InProceedingsoftheIEEE/CVFCon- learningwithnoisytextsupervision. InICML,2021. 1,2, ferenceonComputerVisionandPatternRecognition,pages 12 7020–7030,2023. 2 [21] Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel [5] YukangChen,YanweiLi,XiangyuZhang,JianSun,andJi- Synnaeve, Ishan Misra, and Nicolas Carion. Mdetr- ayaJia. Focalsparseconvolutionalnetworksfor3dobject modulateddetectionforend-to-endmulti-modalunderstand- detection. InCVPR,2022. 2 ing. InICCV,2021. 1,2 [6] Y. Chen, S. Liu, X. Shen, and J. Jia. Fast point r-cnn. In [22] Sosuke Kobayashi, Eiichi Matsumoto, and Vincent Sitz- ICCV,2019. 2 mann. Decomposing nerf for editing via feature field dis- [7] AyushDewan,TimCaselitz,GianDiegoTipaldi,andWol- tillation. 2022. 2 fram Burgard. Motion-based detection and tracking in 3d [23] Alex H Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, lidarscans. InICRA,2016. 2 JiongYang,andOscarBeijbom. Pointpillars: Fastencoders [8] M.Engelcke,D.Rao,D.Z.Wang,C.H.Tong,andI.Posner. forobjectdetectionfrompointclouds. InCVPR,2019. 1,2, Vote3deep: Fast object detection in 3d point clouds using 5,6 efficientconvolutionalneuralnetworks. InICRA,2017. 2 [24] Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen [9] Linxi Fan, Guanzhi Wang, Yunfan Jiang, Ajay Mandlekar, Koltun, and Rene´ Ranftl. Language-driven semantic seg- Yuncong Yang, Haoyi Zhu, Andrew Tang, De-An Huang, mentation. 2022. 1,2 Yuke Zhu, and Anima Anandkumar. Minedojo: Building [25] LiunianHaroldLi,PengchuanZhang,HaotianZhang,Jian- open-endedembodiedagentswithinternet-scaleknowledge. wei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu InNeurIPS,2022. 1 Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded [10] Lue Fan, Xuan Xiong, Feng Wang, Naiyan Wang, and language-imagepre-training. InCVPR,2022. 1,2 ZhaoxiangZhang. Rangedet: Indefenseofrangeviewfor [26] ZhichaoLi,FengWang,andNaiyanWang. Lidarr-cnn:An lidar-based3dobjectdetection. InICCV,2021. 2 efficientanduniversal3dobjectdetector. InCVPR,2021. 2 [11] GolnazGhiasi,XiuyeGu,YinCui,andTsung-YiLin. Scal- [27] Tsung-YiLin,PriyaGoyal,RossGirshick,KaimingHe,and ing open-vocabulary image segmentation with image-level PiotrDolla´r. Focallossfordenseobjectdetection. InICCV, labels. InECCV,2022. 1,2,3,4,6,8,12 2017. 5 [12] Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. [28] YangLiu,IdilEsenZulfikar,JonathonLuiten,AchalDave, Open-vocabulary object detection via vision and language AljosˇaOsˇep,DevaRamanan,BastianLeibe,andLauraLeal- knowledgedistillation. ICLR,2022. 1,2 Taixe´. Openingupopen-worldtracking. InCVPR,2022. 2 [13] Huy Ha and Shuran Song. Semantic abstraction: Open- [29] Yueh-ChengLiu,Yu-KaiHuang,Hung-YuehChiang,Hung- world3dsceneunderstandingfrom2dvision-languagemod- TingSu,Zhe-YuLiu,Chin-TangChen,Ching-YuTseng,and els. InCoRL,2022. 2 Winston H Hsu. Learning from 2d: Contrastive pixel-to- [14] ChenhangHe,HuiZeng,JianqiangHuang,Xian-ShengHua, pointknowledgetransferfor3dpretraining. arXivpreprint andLeiZhang.Structureawaresingle-stage3dobjectdetec- arXiv:2104.04687,2021. 2 tionfrompointcloud. InCVPR,June2020. 2 [30] Zhijian Liu, Haotian Tang, Alexander Amini, Xinyu Yang, [15] Derek Hoiem, Yodsawalai Chodpathumwan, and Qieyun HuiziMao,DanielaRus,andSongHan. Bevfusion: Multi- Dai. Diagnosing error in object detectors. In ECCV (3), taskmulti-sensorfusionwithunifiedbird’s-eyeviewrepre- pages340–353,2012. 13 sentation. 2023. 2 [16] Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky [31] Yuheng Lu, Chenfeng Xu, Xiaobao Wei, Xiaodong Xie, Liang, PeteFlorence, AndyZeng, JonathanTompson, Igor MasayoshiTomizuka,KurtKeutzer,andShanghangZhang. Mordatch, YevgenChebotar, etal. Inner monologue: Em- Open-vocabulary3ddetectionviaimage-levelclassandde- bodied reasoning through planning with language models. biased cross-modal contrastive learning. arXiv preprint InCoRL,2022. 1 arXiv:2207.01987,2022. 2 9[32] GregoryPMeyer,AnkitLaddha,EricKee,CarlosVallespi- [46] Shuran Song and Jianxiong Xiao. Deep sliding shapes for Gonzalez, and Carl K Wellington. Lasernet: An efficient amodal3dobjectdetectioninrgb-dimages.InCVPR,2016. probabilistic3dobjectdetectorforautonomousdriving. In 2 CVPR,2019. 2 [47] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien [33] MatthiasMinderer,AlexeyGritsenko,AustinStone,Maxim Chouard,VijaysaiPatnaik,PaulTsui,JamesGuo,YinZhou, Neumann,DirkWeissenborn,AlexeyDosovitskiy,Aravindh Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han, Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, Scott Et- Shen, et al. Simple open-vocabulary object detection with tinger,MaximKrivokon,AmyGao,AdityaJoshi,YuZhang, visiontransformers. 2022. 2 Jonathon Shlens, Zhifeng Chen, and Dragomir Anguelov. [34] Yifei Ming, Ziyang Cai, Jiuxiang Gu, Yiyou Sun, Wei Li, Scalability in perception for autonomous driving: Waymo and Yixuan Li. Delving into out-of-distribution detection opendataset. InCVPR,2020. 6 withvision-languagerepresentations. InNeurIPS,2022. 1 [48] PeiSun,MingxingTan,WeiyueWang,ChenxiLiu,FeiXia, Zhaoqi Leng, and Dragomir Anguelov. Swformer: Sparse [35] Ishan Misra, Rohit Girdhar, and Armand Joulin. An end- windowtransformerfor3dobjectdetectioninpointclouds. to-endtransformermodelfor3dobjectdetection. InICCV, In Shai Avidan, Gabriel Brostow, Moustapha Cisse´, Gio- 2021. 2 vanniMariaFarinella,andTalHassner,editors,ECCV,2022. [36] Mahyar Najibi, Jingwei Ji, Yin Zhou, Charles R Qi, 5 XinchenYan,ScottEttinger,andDragomirAnguelov. Mo- [49] PeiSun, WeiyueWang, YuningChai, GamaleldinElsayed, tioninspiredunsupervisedperceptionandpredictioninau- Alex Bewley, Xiao Zhang, Cristian Sminchisescu, and tonomousdriving. InECCV,2022. 1,2,3,4,5,6,7,8,12, DragomirAnguelov. Rsn:Rangesparsenetforefficient,ac- 13 curatelidar3dobjectdetection.InCVPR,pages5725–5734, [37] Mahyar Najibi, Guangda Lai, Abhijit Kundu, Zhichao Lu, 2021. 2 Vivek Rathod, Thomas Funkhouser, Caroline Pantofaru, [50] HaoTian,YuntaoChen,JifengDai,ZhaoxiangZhang,and DavidRoss,LarrySDavis,andAlirezaFathi. Dops:Learn- XizhouZhu. Unsupervisedobjectdetectionwithlidarclues. ingtodetect3dobjectsandpredicttheir3dshapes.InCVPR, InCVPR,2021. 2 2020. 2 [51] Vadim Tschernezki, Iro Laina, Diane Larlus, and Andrea [38] Charles R Qi, Or Litany, Kaiming He, and Leonidas J Vedaldi. Neuralfeaturefusionfields: 3ddistillationofself- Guibas. Deephoughvotingfor3dobjectdetectioninpoint supervised2dimagerepresentations. 2022. 2 clouds. InICCV,2019. 2 [52] Dominic Zeng Wang and Ingmar Posner. Voting for vot- [39] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya inginonlinepointcloudobjectdetection. InProceedingsof Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Robotics:ScienceandSystems,Rome,Italy,July2015. 2 AmandaAskell,PamelaMishkin,JackClark,etal. Learn- [53] YueWang, AlirezaFathi, AbhijitKundu, DavidRoss, Car- ingtransferablevisualmodelsfromnaturallanguagesuper- oline Pantofaru, Thomas Funkhouser, and Justin Solomon. vision. InICML,2021. 1,2 Pillar-based object detection for autonomous driving. In [40] Corentin Sautier, Gilles Puy, Spyros Gidaris, Alexandre ECCV,2020. 2 Boulch,AndreiBursuc,andRenaudMarlet. Image-to-lidar [54] Kelvin Wong, Shenlong Wang, Mengye Ren, Ming Liang, self-superviseddistillationforautonomousdrivingdata. In andRaquelUrtasun. Identifyingunknowninstancesforau- ProceedingsoftheIEEE/CVFConferenceonComputerVi- tonomousdriving. InCoRL,2020. 2 sionandPatternRecognition,pages9891–9901,2022. 2 [55] Bin Yang, Wenjie Luo, and Raquel Urtasun. Pixor: Real- [41] Nur Muhammad Mahi Shafiullah, Chris Paxton, Lerrel time3dobjectdetectionfrompointclouds. InCVPR,2018. Pinto, Soumith Chintala, and Arthur Szlam. Clip-fields: 2 Weakly supervised semantic fields for robotic memory. [56] Zetong Yang, Yanan Sun, Shu Liu, and Jiaya Jia. 3dssd: arXivpreprintarXiv:2210.05663,2022. 1 Point-based3dsinglestageobjectdetector. InCVPR,2020. [42] ShaoshuaiShi,ChaoxuGuo,LiJiang,ZheWang,Jianping 2 Shi, Xiaogang Wang, and Hongsheng Li. Pv-rcnn: Point- [57] M.Ye,S.Xu,andT.Cao. Hvnet: Hybridvoxelnetworkfor voxel feature set abstraction for 3d object detection. In lidarbased3dobjectdetection. InCVPR,2020. 2 CVPR,2020. 2 [58] XiaohuaZhai,XiaoWang,BasilMustafa,AndreasSteiner, [43] ShaoshuaiShi,XiaogangWang,andHongshengLi. Pointr- Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer. cnn: 3dobjectproposalgenerationanddetectionfrompoint Lit: Zero-shot transfer with locked-image text tuning. In cloud. InCVPR,2019. 1,2 CVPR,2022. 2 [44] Weijing Shi and Ragunathan (Raj) Rajkumar. Point-gnn: [59] WuZheng,WeiliangTang,LiJiang,andChi-WingFu. Se- Graph neural network for 3d object detection in a point ssd:Self-ensemblingsingle-stageobjectdetectorfrompoint cloud. InCVPR,2020. 2 cloud. InCVPR,2021. 2 [45] Martin Simony, Stefan Milzy, Karl Amendey, and Horst- [60] Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan MichaelGross. Complex-yolo:aneuler-region-proposalfor Li,NoelCodella,LiunianHaroldLi,LuoweiZhou,Xiyang real-time 3d object detection on point clouds. In ECCV, Dai, Lu Yuan, Yin Li, et al. Regionclip: Region-based 2018. 2 language-imagepretraining. InCVPR,2022. 1,2 10[61] Xingyi Zhou, Rohit Girdhar, Armand Joulin, Phillip Kra¨henbu¨hl, and Ishan Misra. Detecting twenty-thousand classesusingimage-levelsupervision. 2022. 2 [62] YinZhou, PeiSun, YuZhang, DragomirAnguelov, Jiyang Gao,TomOuyang,JamesGuo,JiquanNgiam,andVijayVa- sudevan. End-to-endmulti-viewfusionfor3dobjectdetec- tioninlidarpointclouds. InCoRL,2020. 2,5 [63] YinZhouandOncelTuzel. Voxelnet: End-to-endlearning forpointcloudbased3dobjectdetection.InCVPR,2018.1, 2 11Appendix acosinedecaylearningratescheduleandaninitiallearning rate of 0.003 and train the models for a total of 43K itera- A.ImplementationDetails tions. Thissectionprovidestheimplementationdetailsforthe maintwocomponentsofourapproachnamelymulti-modal autolabeling,andtheopen-vocabulary3Dobjectdetector. B.AdditionalQualitativeResults A.1.UnsupervisedMulti-modalAutoLabeling In the paper, we presented qualitative results demon- In our experiments, we use VEGETATION, ROAD, strating that UP-VL can detect open-set objects using text STREET, SKY, TREE, BUILDING, HOUSE, SKYSCRAPER, queries at inference (see Figure 1 and 4 of the main pa- WALL, FENCE and SIDEWALK as text queries for defining per). Additionally, we included a quantitative comparison backgroundcategories,Cbg,whichareexcludedfromauto with the previous state-of-the-art, MI-UP [36], in Table 1 labeling. Wealsosetthecosinesimilaritiesthresholdϵbg to of the main paper. Here, we present qualitative compari- be0.02.FortheexperimentsinSection4.2,and 4.3.1ofthe sonbetweenourUP-VLdetector(trainedwithdistillation) main paper which consider moving-only objects, we set a and MI-UP [36] detector in Figure 5. The top row shows sceneflowthresholdofϵsf =1m/s(thesameas[36]). For ourUP-VLclass-awarepredictionswheretheblueandred bounding box proposals, we follow Najibi et al. [36] and boxes represent the vehicle and VRU detections respec- set neighborhood threshold to be 1.0m in DBSCAN. With- tively. On the bottom, we are showing the class-agnostic out knowing the semantics of objects, it is challenging to predictions of the MI-UP model as green boxes. Compar- define the headings of all objects. For moving objects, we ing column (a), first we can see that unlike MI-UP which align their headings with the object moving direction. For is unable to predict semantics, our UP-VL approach can staticobjects,wechoosetheirheadingssuchthattheyhave reliably distinguish between objects of vehicle and VRU anacuteanglewiththeheadingoftheautonomousdriving categories. Moreover, UP-VL can detect many of the ob- vehicle. jectswhichwerecompletelymissedorgroupedtogetherby MI-UP. In column (b), we also mark static objects in the A.2.Open-vocabulary3DObjectDetection bottomrow. Comparingthiscolumnhighlightsanotherad- vantageofourapproach. WhileMI-UPislimitedtodetect- Regarding the vision-language model, in this paper we ingmoving-onlyobjectsbydesign,UP-VLisabletodetect use the pre-trained OpenSeg model [11] coupled with the staticobjectsaswell. Lastly,bycomparingcolumn(c),one BERT-Large text encoder in Jia et al. [20] without further can see that our UP-VL approach can significantly reduce fine-tuningonany2Dor3Dautononmousdrivingdatasets. the false positives on cluttered parts of the scene, showing For the knowledge distillation, as discussed in Section yetanotheradvantageofourapproachcomparedtotheprior 3.3.2ofthemainpaper,wedirectlydistillthefinal640di- work on unsupervised 3D object detection in autonomous mensional features of the OpenSeg model. However, for driving. memoryandcomputeefficiencyduringtraining,wefirstre- duce the dimensionality of the features to 64 using an in- cremental PCA fitted to the whole unsupervised training dataset. To evaluate the open-vocabulary detector on the C.EffectofHyperparameters Waymo Open Dataset, we choose the vehicle and VRU as categoriesofinterest,forwhichthedatasethasgroundtruth. In this subsection, we perform an ablation study on the More specifically, we use CAR, VEHICLE, PARKED VE- effectofthehyper-parametersintroducedinAlgorithm1of HICLE, SEDAN, TRUCK, BUS, VAN, MINIVAN, SCHOOL the main paper. More specifically, ϵbg which is used as a BUS,PICKUPTRUCK,AMBULANCE,FIRETRUCKtoquery thresholdonthecomputedcosinesimilaritiestodefinethe for the vehicle category and CYCLIST, HUMAN, PERSON, backgroundpoints,andrbg whichrepresentsathresholdon PEDESTRIAN, BICYCLE to query for the VRU category. the required ratio of background points within a box pro- We found that removing queries from this set will lead to posal to mark it as background and consequently filtering dropped mAPs. For the 3D detection experiments, we use theproposal. TheablationanalysisispresentedinTable5. thesametwo-frameanchor-basedPointPillarsbackboneas First thing to notice is that our approach is fairly robust to previous work [36] for fair comparisons. We also use the these hyper parameters when they are set in a reasonable same set of detection losses to train a class-agnostic 3D range. Moreover,comparingthemiddlerowswiththefirst bounding box regression branch and an objectness score andlastrowsdemonstratestheeffectivenessofintroducing branch,andsupplementthemwiththenewdistillationintro- these thresholding schemes in improving the mAP of the ducedinSection 3.3.2ofthemainpaper. Wetrainmodels model. Giventheseresults,inallexperimentsinthepaper on64TPUs,withabatchsizeof2peraccelerator. Weuse wesetϵbg =0.02andrbg =0.99. 12(a) (b) (c) UP-VL static MI-UP[36] Figure5.ComparisonofourUP-VLwithpriorworkMI-UP[36].Comparatively,ourUP-VL(a)localizesobjectsandclassifiesthem,(b) detectsbothmovingandstaticobjects,(c)producesfewerfalsepositives.Bestviewedincolor.Boxcolors:blueforvehicle,redforVRU, greenforclass-agnostic. Table5.Effectofhyperparametersofϵbgandrbg. 3DAP 3DAP ϵbg mAP rbg mAP Veh VRU Veh VRU 0.10 28.7 12.3 20.5 50% 20.7 7.5 14.1 0.05 29.7 14.1 21.9 90% 27.3 11.1 19.2 “tram” 0.02 30.2 14.7 22.4 99% 30.2 14.7 22.4 “truck” 0.00 29.9 14.3 22.1 100% 30.1 14.6 22.3 Moving only All motion states Vehicle VRU Vehicle VRU 34.9% 31.7% 38.1% 1.4% 63.7% 0.2% 68.1% 60.3% 1.6% 50.2% 49.4% Figure 7. Failure cases. (a) Detector fails to generate very large 0.4% boxesforrarecategorieslike”tram”althoughthepoint-wisese- Classification error: Classification error: Localization error manticassignmentiscorrect. (b)Textqueryof”truck”wrongly confusion with background confusion with other objects matcheswithanobjectofcrane. Figure 6. Error analysis of false positives. Fractions of false- positives that are caused by classification or localization errors. Ouranalysiscoverstwoscenarios: detectingmovingobjectsonly anddetectingobjectsinallmotionstates. Andweexamineboth vehicleandVRUcategories. All other false positives fall under the category of confu- sion with background. For each category, we count the D.ErrorAnalysis “top-ranked” false positives among the most confident N detections, where N is selected to be half the quantity of D.1.QuantitativeAnalysis groundtruthobjectsinthatcategory. Resultsarepresented Section 4 in the main paper discusses the overall accu- inFigure6. Itshouldbenotedthatgiventhedecoupledde- racy of our open-vocabulary 3D object detectors. In this signofourdetector, thelocalizationerrorcanbelinkedto subsection,wewilldelvedeeperintotheanalysisbybreak- ourclass-agnosticboundingboxpredictionbranch,andthe ingdowntheerrors. Onesignificanttypeoferrorsisfalse classificationerrorcanbelinkedtoourdistillationbranch. positivedetections, whichoccurswhenthedetectedobject Ascanbeseen,formovingobjects(theleftsideofthefig- doesnotcorrespondtoanygroundtruthobject,giveneval- ure),thelocalizationerroristhebottleneckinperformance. uation thresholds. Following Hoiem et al. [15], we cate- Thisiswhile,whenwealsoconsiderthestaticobjects(the gorize false positives into three types. Localization error right side of the figure), the share of the classification er- arises when a detected object belongs to the intended cat- rornoticeablyincreases. Moreover,asexpected,wecansee egory but has a misaligned bounding box (0.1 < 3D IoU that confusion between the categories (vehicles vs. VRUs) < 0.4). The remaining false positives, which have an IoU accountsforaverysmallportionofthefalsepositives. We of at least 0.1 with an ground-truth object from a different believe this analysis sheds light on the bottlenecks for fur- category, are classified as confusion with other objects. therimprovementsoftheproposedapproach. 13D.2.QualitativeAnalysis In the previous subsection, we performed quantitative error analysis on the available human annotations in the dataset. Here, wequalitativelypresentsomeerrorpatterns ofourmethodintheopen-vocabularysettingwherehuman annotationsareunavailable. Figure7illustratessomereal- world challenges in unsupervised open-vocabulary 3D de- tection. One type of failure case is the detector failing to generateaboundingboxeventhoughthepoint-wisecosine similarityhascapturedthecorrectsemanticsfromtheuser’s query(e.g.“tram”inFigure7). Webelievethisisbecause suchkindoflargeobjectsarerarelyseeninthetrainingdata and our detector requires more unsupervised training data to confidently capture those objects. Another type of fail- ure case is the mismatch between text queries and visual features for semantically similar concepts. Like the sec- ondexampleinFigure7,whereatextqueryof“truck”has matched with a crane. We hypothesize that this might be duetothesimilarappearancebetweencranesandconstruc- tion trucks and the high co-occurrence of these two object typesinthereal-world. 14