Unsupervised 3D Perception with 2D Vision-Language Distillation
for Autonomous Driving
MahyarNajibi* JingweiJi* YinZhou† CharlesR.Qi XinchenYan
ScottEttinger DragomirAnguelov
WaymoLLC
Abstract
Closed-set3Dperceptionmodelstrainedononlyapre-
definedsetofobjectcategoriescanbeinadequateforsafety
criticalapplicationssuchasautonomousdrivingwherenew
“bulldozer”
object types can be encountered after deployment. In this
paper,wepresentamulti-modalautolabelingpipelineca-
pableofgeneratingamodal3Dboundingboxesandtrack-
letsfortrainingmodelsonopen-setcategorieswithout3D
humanlabels.Ourpipelineexploitsmotioncuesinherentin
“stop sign”
pointcloudsequencesincombinationwiththefreelyavail-
able2Dimage-textpairstoidentifyandtrackalltrafficpar-
ticipants. Compared to the recent studies in this domain,
which can only provide class-agnostic auto labels limited
to moving objects, our method can handle both static and
moving objects in the unsupervised manner and is able to “USPS truck”
output open-vocabulary semantic labels thanks to the pro-
Figure1.Anillustrationofthreeinterestingurbansceneexamples
posedvision-languageknowledgedistillation. Experiments
ofopen-vocabularyperception.Left:ourmethodcanfaithfullyde-
on the Waymo Open Dataset show that our approach out- tectobjectsbasedonuser-providedtextqueriesduringinference,
performs the prior work by significant margins on various withouttheneedfor3Dhumansupervision.Redpointsarepoints
unsupervised3Dperceptiontasks. matchedwiththetextqueries. Right: cameraimagesforreaders’
reference. NotethattheinferenceprocesssolelyreliesonLiDAR
pointsanddoesnotrequirecameraimages.
1.Introduction
Inautonomousdriving,mostexisting3Ddetectionmod-
els[63,23,43]havebeendevelopedwiththepriorassump- constituteasignificantportionoftrafficparticipants. More-
tionthatallpossiblecategoriesofinterestshouldbeknown over, it only models the problem in a class-agnostic way
and annotated during training. While significant progress and fails to provide semantic labels for scene understand-
has been made in this supervised closed-set setting, these ing. This is suboptimal as semantic information is essen-
methods still struggle to fully address the safety concerns tial for downstream tasks such as motion planning, where
that arise in high-stakes applications. Specifically, in the category-specificsafetyprotocolsaredeliberatelyaddedto
dynamicreal-worldenvironment,itisunacceptableforau- navigatethroughvarioustrafficparticipants.
tonomous vehicles to fail to handle a category that is not
Recently, models trained with large-scale image-text
presentinthetrainingdata. Toaddressthissafetyconcern,
datasets have demonstrated robust flexibility and general-
arecentdevelopmentbyNajibietal.[36]proposedanunsu-
izationcapabilitiesforopen-vocabularyimage-basedclassi-
pervisedautolabelingpipelinethatusesmotioncuesfrom
fication[39,20,34],detection[21,12,25,60]andsemantic
pointcloudsequencestolocalize3Dobjects. However,by
segmentation[24,11]tasks. Yet,open-vocabularyrecogni-
design, this method does not localize static objects which
tion in the 3D domain [9, 16, 41] is in its early stages. In
thecontextofautonomousdrivingitisevenmoreunderex-
*Equalcontribution
†Correspondingauthor plored. In this work, we fill this gap by leveraging a pre-
3202
peS
52
]VC.sc[
1v19441.9032:viXratrained vision-language model to realize open-vocabulary pipelinethatleveragesvision-languagepre-trainingforun-
3Dperceptioninthewild. supervisedopen-set3Dperceptionincomplex,sparse,and
We propose a novel paradigm of Unsupervised 3D occlusion-richenvironmentsforautonomousdriving.
Perceptionwith2DVision-Languagedistillation(UP-VL).
Unsupervised 3D object detection. Unsupervised 3D
Specifically,byincorporatingapre-trainedvision-language
object detection from LiDAR data is largely under-
model, UP-VL can generate auto labels with substantially
explored [7, 54, 50, 28, 36]. Dewan et al. [7] proposed a
higher quality for objects in arbitrary motion states, com-
model-free method to detect and track the visible part of
paredtothelatestworkbyNajibietal.[36].
objects, byusing themotioncuesfrom LiDARsequences.
With our auto labels, we propose to co-train a 3D ob-
However, this approach is incapable of generating amodal
ject detector with a knowledge distillation task, which can
bounding boxes which is essential for autonomous driv-
achievetwogoalssimultaneously, i.e.improvingdetection
ing. Cen et al. [3] relied on a supervised detector to pro-
quality and transferring semantic features from 2D image
duce proposals of unknown categories. However, this ap-
pixels to 3D LiDAR points. The perception model there-
proach requires full supervision to train the base detector
foreiscapableofdetectingalltrafficparticipantsandthanks
and has limited generalization capability to only semanti-
to the distilled open-vocabulary features, we can flexibly
cally similar categories. Wong et al. [54] identified un-
query the detector’s output embedding with text prompts,
known instances via supervised segmentation and cluster-
for preserving specific types of objects at inference time
ing, which by design cannot generate amodal boxes from
(seeFigure1forsomeexamples).
partialobservations. Mostrecently,Najibietal.[36]devel-
WesummarizethecontributionsofUP-VLasfollows:
opedanunsupervisedautometalabelingpipelinetogener-
• UP-VL achieves state-of-the-art performance on un- atepseudolabelsformovingobjects,whichcanbeusedto
supervised 3D perception (detection and tracking) of trainreal-time3Ddetectionmodels. Thisapproachfailsto
movingobjectsforautonomousdriving. providesemanticstodetectionboxesandignoresstaticob-
jects,whichlimitsitspracticalutility. Comparedtoallpre-
• UP-VL introduces semantic-aware unsupervised de- viousefforts,werealizeopen-vocabularyunsupervised3D
tection for objects in any motion state, a first in the detectionforbothstaticandmovingobjects,byleveraging
fieldofautonomousdriving. Thisbreakthroughelimi- vision-languagepre-training,andbenchmarkoursystemon
natestheinformationbottleneckthathasplaguedpre- the realistic and challenging scenario of autonomous driv-
viouswork[36],whereclass-agnosticautolabelswere ing. While utilizing 2D vision-language models that may
used,coveringonlymovingobjectswithaspeedabove have been pre-trained with human annotations, we avoid
apredeterminedthreshold. theneedforanyadditional3Dlabelswithinourparadigm,
therebycreatingapragmaticallyunsupervisedsetting.
• UP-VLenables3Dopen-vocabularydetectionofnovel
objects in the wild, with queries specified by users LiDAR 3D object detection. Most previous works fo-
at inference time, therefore removing the need to re- cusedondevelopingperformantmodelarchitecturesinthe
collectdataorre-trainmodels. fullysupervisedsetting,withoutconsideringthegeneraliza-
tioncapabilitytolong-tailcasesandunknownobjecttypes
2.Relatedworks thatareprevalentinthedynamicrealworld.Thesemethods
canbecategorizedintopointbased[43,38,56,44,35,26],
Vision-language training. Contrastive vision language
voxelizationbased[8,52,46,37,55,45,63,23,53,57,59,
training on billions of image-text training pairs resulted in
5,30],perspectiveprojectionbased[32,2,10],andfeature
impressiveimprovementsinthetasksofopen-setandzero-
fusion [49, 6, 62, 14, 42]. Recent research also explore
shotimageclassificationandlanguagerelatedapplications
transferringknowledgefromimagefor3Dpointcloudun-
[39,20,58]. Morerecently,open-setobjectlocalizationin
derstanding[40,29,19,4]. Ourmethodiscompatiblewith
2D images has been shown to benefit from such abundant
any3Ddetector,extendingittohandletheopen-setsettings.
image-textdataaswell.Specifically,[21,12,25,60,61,33]
used image-text training to improve the open-set capa-
3.Method
bility of 2D object detectors and [24, 11] explored the
use of large-scale scene-level vision-language data for the We present UP-VL, a new approach for unsupervised
task of open-set 2D semantic segmentation. Recent re- open-vocabulary 3D detection and tracking of traffic par-
search [31, 22, 51, 13, 18] has begun to explore the appli- ticipants. UP-VL advances the previous state-of-the-
cationof2Dvision-languagepre-trainingin3Dperception art [36] which was limited to class-agnostic detection of
tasks. However,thesestudiesfocusedonstaticindoorsce- moving-only objects in two main directions: 1) It enables
nariowherethesceneissmall-scaleandtheRGB-Ddatais class-aware open-set 3D detection by incorporating open-
capturedinhigh-resolution. Herewedesignamulti-modal vocabulary text queries at inference time, and 2) It is able
2Training Inference
PointCloudSequence PointCloud
Open-Vocab. 3D Object Detector Open-Vocab. 3D Object Detector
Truck
CameraImages Pointwise Features Sedan
Pointwise Features Boxes / Tracklets ……
Open-Vocabulary
…… Queries Cyclist
BGCategoriesforExclusion Truck Semantic Label
Road Sedan ⊗ Assignment
UnsupervisedMulti-modal Cyclist
Building
... Auto Labeling ...
Figure2.OverviewoftheproposedUP-VLframework. Duringtraining(left),ourmethodtapsintomulti-modalinputs(LiDAR,camera,
text)andproduceshigh-qualityautosupervisions, viaUnsupervisedMulti-modalAutoLabeling, including3Dpoint-levelfeatures, 3D
object-level bounding boxes and tracklets. Our auto labels are then used to supervise a class-agnostic open-vocabulary 3D detector.
Besides,our3Ddetectordistillsthefeaturesextractedfromapre-trained2Dvision-languagemodel.Atinferencetime(right),ourtrained
3Ddetectorproducesclass-agnosticboxesandper-pointfeaturesintheembeddingspaceofthepre-trainedvision-languagemodel. We
thenusethetextencodertomapqueriestotheembeddingspaceandcomputetheper-pointsimilarityscoresbetweenthepredictedfeature
andthetextembeddings(⊗referstocosinesimilarity).Theseper-pointscoresarethenaggregatedtoassignsemanticlabelstoboxes.
todetectobjectsinallmotionstatesasopposedtomoving- obtainamorecomprehensiveviewoftheobject,whichen-
onlyobjectsinthepreviousstudy. Toachievethesegoals, ablesthederivationofafaithful3Damodalboundingbox.
we deploy a multi-modal approach and combine intrinsic Finally, the resulting 3D amodal boxes and tracklets can
motioncues[36]availablefromtheLiDARsequenceswith serveasautolabelsfortraining3Dperceptionmodels.
the semantics captured by a vision-language model [11]
While the previous work [36] has shown promising re-
trained on generic image-text pairs from the Internet. An
sults, it suffers from significant limitations: 1) it can only
overview of our approach is shown in Figure 2. As illus-
deal with moving objects; and 2) it is unable to output se-
trated on the left, our training pipeline involves two main
mantics. These limitations hinder its practical utility for
stages. First, our auto labeling method uses these mo-
safety-criticalapplicationssuchasautonomousdriving.
tion and semantic cues to automatically label the raw sen-
sor data, yielding class-agnostic 3D bounding boxes and
3.2.UnsupervisedMulti-modalAutoLabeling
tracklets as well as point-wise semantic features. Then, in
the second stage, we use these auto labels to train open-
Incontrasttothetraditionalwayoftrainingadetection
vocabulary 3D perception models. The right side of the
modelbypresentingboxgeometriesandclosed-setseman-
figureillustratesourinferencepipelinewheregivenrawLi-
tics, our unsupervised multi-modal auto labeling approach
DAR point clouds, our detector is able to perform open-
produces box geometries and point-wise semantic feature
vocabulary3Ddetectiongivenasetoftextqueries.
embeddings,wheretheformerteachesthedetectortolocal-
ize all traffic participants and the latter informs the model
3.1.Background
topreservecertaintypesofobjectsbasedontheinference-
timetextqueries.
The key challenges in unsupervised 3D perception are
twofold: 1) generating high-quality 3D amodal bounding Figure3showsanoverviewoftheautolabelingpipeline
boxes and consistent tracklets for all open-set traffic par- and Algorithm 1 presents its details. Specifically, our sys-
ticipants, and 2) inferring per-object semantics. Najibi et temleveragesmultiplemodalitiesasinput,namelycamera
al.[36]developedanautolabelingtechniquetoaddressthe images, LiDAR point sequences, and natural language. It
firstchallengepartially. Theirapproachfocusesonmoving also employs a pre-trained vision-language model [11] to
objects only. Specifically, their method takes LiDAR se- extract feature embeddings from images and texts, which
quencesasinput,andremovesgroundpoints.Itthenbreaks naturally complements the 3D depth information and mo-
downthesceneintoindividualconnectedcomponents(i.e. tioncueswithrichsemantics,comparedto[36]. Webegin
pointclusters). Next,itcalculateslocalflowbetweenpairs bydetailingthefeatureextractionprocess.Wethendescribe
ofpointclustersfromtwoadjacentframesandretainsonly howweutilizetheextractedvision-languageinformationin
clusters with speed above a predefined threshold. It then combinationwiththeinherentmotioncuesfromLiDARse-
tracks each cluster across frames and aggregates points to quencestogenerateautolabelsinanunsupervisedmanner.
3BGCategories CameraImages PointCloudSequence Algorithm1Unsupervisedmulti-modalautolabeling.
forExclusion
Input: A sequence of images across T frames for each of
Road theK cameras{Ik};asequenceofLiDARpointlocations
t
Building {P }.
t
...
Requires: Cosinesimilaritythresholdforbackgroundcat-
egoriesϵbg;minimumsceneflowmagnitudeϵsf;maximum
Pointwise Scene ratioofbackgroundpointswithinaboxrbg;asetofaprior
VL Features Flow backgroundcategoriesCbg; apre-trainedopen-vocabulary
modelwithimageencoderEimg andtextencoderEtxt.
Box Proposal Generation Output: Amodal3Dboundingboxes{B }andtheirtrack
t
IDs{T };point-wiseopen-vocabularyfeatures{Fvl}.
t t
Tracking
Function:
1: fort=1toT do
Amodal Box Generation
2: {Vk}←Eimg({Ik}) ▷2DVLfeatures
t t
Figure 3. Overview of our unsupervised multi-modal auto label- 3: Fv tl ←Unprojection({V tk},P t) ▷3DVLfeatures
ingapproach.Thispipelinefirstextractsvision-languageandmo- 4: ift̸=T then
tionfeaturesfrommultiplemodalities,thenproposes,tracksand 5: Fs tf ←NSFP++(P t,P t+1) ▷Sceneflow
completesboundingboxesofobjects.TheresultingpointwiseVL
6: else
f se ua pt eu rr ve is s, io3 nD sb toou trn ad inin tg hebo px ee rcs ea pn td iot nra mck ol de ets l.willserveasautomatic 7: Fs tf ←−NSFP++(P t,P t−1)
8: fori=1toN tdo
FeatureExtraction 9: (Ms tf) i ←1(∥(Fs tf) i∥≥ϵsf)
As the first step to our approach, we start by extract- 10: (Mb tg) i ←1( cm ∈Cax bg ∥(( FF v tv t l)l) ii ∥· ∥E Etx txt t( (c c) )∥ ≥ϵbg)
ingopen-vocabularyfeaturesfromallavailablecamerasand
then transfer these 2D features to 3D LiDAR points us-
11: P(cid:101)t,F(cid:101)s tf ←P t[Ms tf],Fs tf[Ms tf]
ing known sensor calibrations. Specifically, at each time 12: Bv tis ←InitialBoxProposal(P(cid:101)t,F(cid:101)s tf,Mb tg;rbg)
t, we have a set of images {Ik t ∈ RHk×Wk×3} t captured 13: {T t}←Tracking({Bv tis})
by K cameras, where H k and W k are image dimensions 14: {B t}←AmodalBoxGeneration({Bv tis},{T t},{P t)}
ofthecamerak. Wealsohaveacollectionofpointcloud, 15: return{B t},{T t},{Fv tl}
{P
t
∈ RNt×3}, captured over time using LiDAR sensors.
Here, N denotes the number of points at time t. We use
t
a pre-trained open-vocabulary 2D image encoder Eimg to
heading). Notethatvisindicatesthateachboxonlycovers
extract the pixel-wise visual features for each image, de-
the visible portion of an object. To cluster each point, we
noted as {V tk ∈ RHk×Wk×D}, where D represents the
leverageasetoffeatureswhichincludesthepointlocations
feature dimension. Next, we build the mapping between P ,sceneflowFsf,andthevision-languagefeaturesFvl.
t t t
3DLiDARpointsandtheircorrespondingimagepixelsus-
We design our pipeline to flexibly generate auto labels
ing the camera and LiDAR calibration information. Once
forobjectsindesiredmotionstates. GivensceneflowFsf,
this mapping is created, we can associate each 3D point t
weintroduceavelocitythresholdϵsf toselectpointswhose
with its corresponding image feature vector. As a result,
speedisgreaterthanorequaltothethreshold(e.g.,1.0m/s).
weobtainvision-languagefeaturesforallthe3Dpointsas
Tocaptureobjectsinallmotionstates,wesetϵsf =0.
Fv tl ∈RNt×D,whereN tisthenumberofpointsattimet.
Additionally,weleveragemotionsignalsasanothercru- Onemajorchallengeofautolabelingobjectsinallmo-
cialrepresentationthatcansubstantiallyaidindeducingthe tionstatesishowtoautomaticallydistinguishtrafficpartici-
conceptofobjectnessformovinginstancesintheopen-set pants(e.g.,vehicles,pedestrians,etc.)fromirrelevantscene
environment. Specifically, we employ the NSFP++ algo- elements (e.g., street, fence, etc.). We propose to leverage
rithm [36] to compute the scene flow Fsf ∈ RNt×3 of an a priori list of background object categories to exclude
t
points at each time t, which is a set of flow vectors cor- irrelevant scene elements from labeling. Specifically, we
respondingtoeachpointinP . use the text encoder, Etxt, from the pre-trained 2D vision-
t
languagemodel[11], toencodeeachbackgroundcategory
BoundingBoxProposalGeneration namecintoitsfeatureembeddingEtxt(c) ∈ RD. Wefur-
Ateachtimestep,wegenerateinitialboundingboxpro- therdefineaper-pointbinarybackgroundmask,denotedas
posals{Bv tis ∈RMt×7}byclusteringthepoints,whereM
t
Mb tg ∈ {0,1}Nt, that takes on a value of 1 if a point is
isthenumberofboxesattimet,andeachboxisparameter- assigned to one of the a priori background categories, or
ized as (center x, center y, center z, length, width, height, 0 otherwise. See Algorithm 1 for the definition of Mbg,
t
4where(·) denotesthei-throwofamatrixand1(·)repre- 3.3.1 ModelArchitecture
i
sentstheindicatorfunction. Weusethisbackgroundmask
tomarksceneelementswhicharenotofinterest. Our design, as depicted in Figure 2, is based on decou-
plingobjectdetectionintoclass-agnosticobjectlocalization
We then proceed to cluster the point cloud into neigh-
and semantic label assignment. For class-agnostic bound-
boringregionsusingaspatio-temporalclusteringalgorithm,
ing box prediction, we add a branch to a 3D point cloud
modified from [36], followed by calculating the tightest
encoderbackbonetogenerate3Dboundingboxcenter,di-
bounding box around each cluster. In addition to cluster-
mensions, and heading. This branch accompanies a bi-
ingpointsbytheirlocationsandmotions,wealsouseMbg
t naryclassificationbranchwhichoutputsforeground/back-
to eliminate bounding boxes which are likely to be back-
ground class-agnostic per box objectness score. To super-
ground. To be precise, we discard any bounding box in
visethesetwobranches,wetreatourunsupervisedautola-
which the ratio of background points exceeds a threshold
bels (see Sec. 3.2) as ground-truth and add bounding box
ofrbg (whichissetto99%). Thisprocessresultsintheini-
regression and classification losses to our learning objec-
tialsetofboundingboxproposals{Bvis}. Notethatinthis
t tive. We would like to highlight that our pipeline is inde-
step,theboxdimensionsaredeterminedbasedonthevisible
pendent of a specific 3D point-cloud encoder [23, 62, 48]
portionofeachobject,whichcanbesignificantlyunderes-
andthedetectionparadigm(eitheranchor-basedoranchor-
timatedcomparedtothehumanlabeledamodalbox,dueto
freedetection). Here,weadoptananchor-basedPointPillar
ubiquitousocclusionsandsparsity.
backbone [23] with Huber loss for box residual regression
and Focal Loss [27] for objectness classification to have
AmodalAutoLabeling
a fair comparison with prior works [36]. Besides predict-
Inautonomousdriving,perceptiondownstreamtasksde- ing 3D bounding boxes, we also perform text query-based
sireamodalboxesthatencompassboththevisibleandoc- open-vocabulary semantic assignment by distilling knowl-
cluded parts of the objects. To transform our visible-only edgefrompre-trained2Dvision-languagemodelsusingan
proposals to amodal auto labels, we follow [36] by adopt- extrabranchwhichisdescribedinthenextsubsections.
ing a tracking-by-detection paradigm with Kalman filter
state updates to link all proposals over time. We then per-
form shape registration for each object track of {T } us- 3.3.2 Vision-LanguageKnowledgeDistillation
t
ing ICP [1]. Within each track, we leverage the intuition
Besidesclassagnosticboundingboxgeneration,our3Dde-
that different viewpoints contain complementary informa-
tector pipeline also distills the semantic knowledge from
tionandtemporalaggregationoftheregisteredpointsfrom
theper-pointvision-languagefeaturesprovidedbyourauto
proposalswouldallowustoobtainacompleteshapeofthe
labeling pipeline (i.e. {Fvl}, introduced in the the vision-
object. Hence,wefitanewboxtotheaggregatedpointsto t
languagefeatureextractioninSec.3.2). Inourmethod,we
yieldtheamodalbox.Finally,weundotheregistrationfrom
directly distill these features, which as will be discussed
aggregatedpointstoindividualframesandreplacetheorig-
in the next subsection, unlocks text query-based open-
inalvisibleboxproposalateachtimestepwiththeamodal
vocabulary category assignment at inference time. More
box,whichproducesautolabeled3Dboxesandthetracklet.
precisely,asshownintheleftsideofFigure2,weaddanew
Inpractice,backgroundpointfiltering,pointcloudregis- linear branch to the model to predict per-point D dimen-
trationandtemporalaggregationmaycontainnoise,leading sionalfeatures(hereD isthedimensionalityofthevision-
tospuriousboxes,e.g.,tinyandsizableboxesandoverlap- language embedding space). As the input to this branch,
pingboxes.Weapplynon-maximumsuppression(NMS)to wescatterthecomputedvoxelizedfeaturesinourbackbone
clean the auto label boxes. This final set of unsupervised backintothepointsandconcatenatethemwiththeavailable
amodal auto labels {B t}, their track IDs {T t}, together per-pointinputfeatures(i.e.3DpointlocationsandLiDAR
with the extracted vision-language embeddings {Fv tl}, are intensity and elongation features). We then train the net-
then used to train open-vocabulary 3D object detection worktopredictthefeaturevectorfvl ∈Fvlforanypointp
p t
modelasdescribedinSec. 3.3. visible in the cameraimages and add the following loss to
thetrainingobjective:
3.3.Open-vocabulary3DObjectDetection
L (p)=CosineDist(y ,fvl) (1)
distill p p
In this subsection, we describe how the unsupervised
auto labels, can be used to train a 3D object detector ca- wherey isthedistillationpredictionbythemodelforpoint
p
pable of localizing open-set objects and assigning open- p. Thistogetherwiththeboundingboxregressionandthe
vocabularysemanticstothem,allwithoutusingany3Dhu- objectnessclassificationlosses(basedonourautolabelsas
manannotationsduringtraining. discussedinSec.3.3.1)formourfinaltrainingobjective.
53.3.3 Open-VocabularyInference Table 1. Comparison of the methods on class-agnostic unsuper-
vised 3D detection of moving objects. Top: Auto label boxes.
Sofar,wehaveintroducedhowtotrainadetectortosimul- Bottom:Detectionboxes.
taneously localize all objects in a class-angnostic manner 3DAP@0.4 3DAP@0.5
Method BoxType
and predict vision-language features for all LiDAR points. L1 L2 L1 L2
Here, we discuss how we assign open-vocabulary seman- MI-UP[36] 36.9 35.5 27.4 26.4
Autolabels
tics to the predicted boxes during inference. This process UP-VL(ours) 39.9 38.4 34.2 32.0
is depicted in the right side of Figure 2. The pre-trained
MI-UP[36] 42.1 40.4 29.6 28.4
2Dvisionlanguagemodel[11]containsanimageencoder Detections
UP-VL(ours) 49.9 48.1 38.4 36.9
andatextencoder,whicharejointlytrainedtomaptextand
image data to a shared embedding space. As described in
Sec. 3.3.2, we add a feature distillation branch that maps
sultsunderopen-vocabularyclass-awaresettingfordetect-
3D input point clouds to the 2D image encoder embed-
ing moving-only objects (Sec. 4.3.1) and the most chal-
dingspace,whichessentiallybridgesthegapbetweenpoint
lenging setting of open-vocabulary detection of objects in
clouds and semantic text queries. As a result, at the infer-
all motion states (Sec. 4.3.2). Finally, Sec. 4.4 reports the
ence time we can encode arbitrary open-vocabulary cate-
open-set tracking quality of our auto labels and Sec. 4.5
gories presented as text queries and compute their similar-
presents qualitative results. See supplementary materials
ities with the observed 3D points. This can be achieved
formoreablationstudiesanderroranalyses.
by computing the cosine similarity between the text query
embeddings and the vision-language features predicted by 4.1.ExperimentalSetting
our model for each 3D point. Finally, we assign open-
vocabulary categories to boxes based on majority voting. We evaluate our framework using the challenging
Specifically, we associate each point the category with the WaymoOpenDataset(WOD)[47],whichprovidesalarge
highestcomputedcosinesimilarity,andthenassigntoeach collectionofrunsegmentscapturedbymulti-modalsensors
boxthemostcommoncategoryofitsenclosingpoints. in diverse environment conditions. To define moving-only
objectsinSec.4.2,wefollow[36]andapplyathresholdof
Wewouldliketoemphasizethatourapproachdoesnot
1.0m/s(i.e.ϵsf =1.0). Wesetthecosinesimilaritythresh-
need to process images at inference time, since we have
oldforbackgroundcategoriesatϵbg =0.02toachievebest
distilledimageencoderfeaturestothepointcloud. There-
performance in practice. The background categories Cbg
fore,theonlyaddedcomputationisasimplelinearlayerfor
we exclude from auto labeling are “vegetation”, “road”,
predictingper-pointvision-languageembeddings,whichis
“street”, “sky”, “tree”, “building”, “house”, “skyscaper”,
negligiblecomparedtotherestofthedetectorarchitecture.
“wall”, “fence”, and “sidewalk”. The WOD [47] has
three common object categories, i.e. vehicle, pedestrian,
4.Experiments
and cyclist. In the class-aware 3D detection experiments
Our UP-VL approach advances the previous state-of- (Sec. 4.3), we follow [36] and combine pedestrian and cy-
the-artinunsupervised3Dperceptionforautonomousdriv- clistintooneVRU(vulnerableroadusers)category,which
ing[36]intwomainimportantdirections:1)enablingopen- containsasimilarnumberoflabelsasthevehiclecategory.
vocabulary category semantics and 2) detecting objects in As in [36], we also train and evaluate the detectors on a
allmotionstates(asopposedtomoving-onlyobjectsinthe 100m × 40m rectangular region around the ego vehicle.
previousstudy). Inthissection,weperformextensiveeval- We use the popular PointPillars detector [23] for all our
uationswithrespecttoeachoftheseinnovations. Notethat detection experiments and set an intersection over union,
unsupervisedopen-set3Ddetectionisstillatearlystagein IoU=0.4,forevaluationsunlessnotedotherwise. Pleasere-
theresearchcommunitywithfewpublishedworks. There- fertoSec. 1ofsupplementarymaterialsforamoredetailed
foretofairlycomparewiththestate-of-the-art[36],weper- descriptionofallexperimentalsettings.
formourdetectionexperimentsfirstfollowingthesameset-
4.2. Class-agnostic Unsupervised 3D Detection of
ting as [36] (i.e. detecting class-agnostic moving objects)
MovingObjects
andthenshowcasingournewcapabilities(i.e.detectingob-
jectsinanymotionstateswithsemantics). Forfaircomparison, wefollowthesamesettingas[36]
Sec. 4.2 studies the performance of our system in the and tailor our approach to class-agnostic moving-only 3D
class-agnostic setting. This allows us to compare our ap- detection. Specifically, we perform auto labeling as intro-
proach with the existing state-of-the-art method on de- ducedin3.2withspeedthresholdϵsf = 1.0m/sandtraina
tectingmoving-onlyobjects, showinglargeimprovements. class-agnosticdetectorwithfeaturedistillationasdescribed
Sec.4.3movestheneedlebeyondthecapabilityofthepre- in3.3.1.However,wedisabletextqueriesatinferencetime.
viousclass-agnosticstate-of-the-artmethodsandreportsre- Notethat[36]onlyconsidereddetectionofmovingobjects.
6WeleavethestudyofmorechallengingsettingstoSec.4.3. dictsimagefeaturesextractedbyourautolabelingpipeline
Table 1 shows our result and compares it with MI- for3Dpointcloudsandconsequentlyismoreefficient.
UP [36]. The top part of the table compares the auto la-
belingquality. Thebottompartcomparesthedetectorper-
4.3.2 ObjectsinAllMotionStates
formancebetweenourUP-VLapproachandMI-UP.Weuse
theexactsamedetectionbackboneandhyper-parametersto Finally in this section, we report results on the most chal-
ensure a fair comparison. When evaluating at IoU=0.4 as lengingsetting: unsupervisedclass-awareopen-vocabulary
suggested by [36], UP-VL significantly outperforms MI- 3D detection for all objects with arbitrary motion states.
UP,bothintermsoftheautolabelaswellasthedetection Like Sec. 4.3.1, since [36] falls short in this setting, we
performance. To better demonstrate our improved auto la- construct three clustering baselines using different combi-
belquality, wealsoevaluatewithahigherlocalizationcri- nationsofourfeatures. Morespecifically,thefirstrowonly
terion at IoU=0.5, where our improvement becomes even uses point locations (P ), the second row uses both point
t
more pronounced. We should also point out that in both locations and our vision-language features (Fvl), and the
t
methods, the final detection quality is superior to the auto thirdrowleveragesallthefeaturesincludingoursceneflow
labelquality. Wehypothesizethatthisisduetothenetwork features(Fsf).Asanablationontheeffectivenessofthein-
t
being able to learn a better objectness scoring function for troducedfeaturedistillationinUP-VL,wealsoaddabase-
rankingaswellasitsabilitytodenoisetheautolabelsgiven line called “Our detector w/o feature distillation”, where
theinductivebiasofthemodel[17]. we remove the distillation head and its loss from our de-
tector, and like the baselines in the first three rows, we di-
4.3. Class-aware Unsupervised Open-vocabulary rectlyprojectthevision-languagefeaturesfromcameraim-
3DDetection ages to the point cloud for semantic label assignment. As
summarizedinTable3,ourautolabelssignificantlyoutper-
In this section, we evaluate the capability of our UP-
formotherbaselineslistedinthefirstthreerows.Moreover,
VL pipeline in class-aware open-vocabulary 3D detection
comparingthelasttworows,weobservethattheproposed
of objects in different motion states. Please note that we
vision-languagefeaturedistillationleadstosignificantper-
don’t use any 3D human annotations during training and
formanceimprovementarossallmetrics. Forexample,our
onlyusetheavailablehumanlabeledcategoriesforevalua-
approachwithfeaturedistillationoutperformsthecounter-
tion.Moreover,itshouldbenotedthatthepreviousstate-of-
partwithoutdistillationbymorethan8pointsinmAP.
the-art[36],asaclass-agnosticapproach,fallsshortinthis
newsetting,makingcomparisonsnotpossible. Inallexper-
4.4.Tracking
iments in this section, we assign labels to boxes by query-
ing category names as text at inference time in an open- TheUP-VLexhibitsahighperformancenotonlyinde-
vocabularyfashionasdescribedinSec.3.3.3(seeSec.1of tection, butalsointracking-acriticaltaskinautonomous
supplementaryforadetailedlistoftextqueriesused). driving. We employ the motion-based tracker from [36],
andconductexperimentsinthetracking-by-detectionman-
ner. We evaluate tracking performance for moving objects
4.3.1 Moving-onlyObjects
andcompareourUP-VLdetectortrainedwithfeaturedistil-
lationasoutlinedinTable2againsttwobaselines: MI-UP
Table2reportstheclass-awareopen-vocabulary3Ddetec-
detector from Table 1 and another open-set baseline from
tion results on the moving-only objects. Since [36] is no
Table 2. To measure the effectiveness of our model, we
longerapplicableinthissetting,weconstructtwobaselines
employ the widely used MOTA and MOTP metrics, both
for comparison: i.e. geometric clustering [36] which addi-
tionally uses our extracted scene flow features (Fsf) and in the class-agnostic and class-aware open-vocabulary set-
t
tings. Our experimental results (Table 4) demonstrate that
itsvariantwhichleveragesboththesceneflowfeaturesand
the vision-language features (Fvl). 3D point-wise seman- UP-VLoutperformsbothbaselinesbyasignificantmargin.
t
tics for the baselines are extracted directly by projecting
4.5.QualitativeResults
the 2D image features of the pre-trained vision-language
model. We report per-category AP as well as the mAP of Our UP-VL enables open-vocabulary detection of arbi-
these baselines in the top two rows of Table 2. The bot- trary object types beyond the few human annotated cate-
tom of the table presents the results for our unsupervised gories in the autonomous driving datasets. Figure 4 illus-
autolabelsandourfinalUP-VLdetections. Ourautolabels trates some examples. In each row, we present the cam-
andUP-VLdetectorbothoutperformbaselinesconstructed era image on the right for readers’ reference. On the left,
from prior approaches. As discussed in Sec. 3.3.2, unlike we show the corresponding 3D point cloud and the pre-
thebaselinesthatrequiresapplyingtheimageencodertoall dicted 3D bounding box by our model based on the open-
cameraimagesatinferencetime,ourdetectordirectlypre- vocabularytextqueryprovidedatinferencetime.
7Table2.Comparisonofmethodsonunsupervisedclass-awaremovingobjectdetection.(*sincesemanticsarenotavailable,wereportclass
agnosticAPforthefirstrow,giventhatvehicleandVRUcontainsimilarnumberofsamples.)
Representations 3DAP
Method Boxtype mAP
Motion Vision-Language Veh VRU
Clustering[36] ✓ visible N/A N/A 32.4*
Clustering[36]+OpenSeg[11] ✓ ✓ visible 47.8 21.5 34.7
Ourautolabels ✓ ✓ amodal 57.5 29.8 43.7
OurUP-VLdetectorw. featuredistillation ✓ ✓ amodal 76.9 28.6 52.8
Table3.Comparisonofmethodsonunsupervisedclass-awaredetectionofobjectsinallmotionstates.(*sincesemanticsarenotavailable,
wereportclass-agnosticAPforthefirstrow,giventhatvehicleandVRUcontainsimilarnumberofsamples.)
Representations 3DAP
Method Boxtype mAP
Motion Vision-Language Veh VRU
Clustering[36] visible N/A N/A 11.6*
Clustering[36]+OpenSeg[11] ✓ visible 15.8 9.9 12.9
Clustering[36]+OpenSeg[11] ✓ ✓ visible 16.1 10.0 13.1
Ourautolabels ✓ ✓ amodal 30.2 14.7 22.4
Ourdetectorw/ofeaturedistillation ✓ ✓ amodal 40.0 15.2 27.6
OurUP-VLdetectorw. featuredistillation ✓ ✓ amodal 52.0 19.7 35.8
Table4.Comparisonoftrackingmethodsformovingobjectswith
categoriesspecifiedduringinferencebytextquerieswhich
evaluationsinclass-agnostic(Cls. ag.) andclass-awaresettings.
webelieveopensupnewdirectionstowardsmorescalable
“MI-UP-C” refers to class-agnostic MI-UP clustering approach,
softwarestacksforautonomousdriving.
whichisunabletobeevaluatedintheclass-awaresetting.
MOTA(↑)/MOTP(↓)
Method
Veh VRU Cls. ag.
“trash bin”
MI-UP[36] N/A N/A 12.8/45.5
MI-UP-C[36]
39.6/37.4 13.5/53.7 22.8/43.4
+OpenSeg[11]
UP-VLdetector 65.3/31.0 24.0/46.8 41.3/37.4
“traffic cone”
5.Conclusions
Inthispaper,westudytheproblemofunsupervised3D
objectdetectionandtrackinginthecontextofautonomous
“jay walking”
driving. We present a cost-efficient pipeline using multi-
sensor information and an off-the-shelf vision-language
modelpre-trainedonimage-textpairs.Coretoourapproach
isamulti-modalautolabelingpipeline,capableofgenerat-
ing class-agnostic amodal box annotations, tracklets, and
“jeep”
per-pointsemanticfeaturesextractedfromvision-language
models. By combining the semantic information and mo-
tioncuesobservedfromtheLiDARpointclouds, ourauto
labelingpipelinecanidentifyandtrackopen-settrafficpar-
ticipants based on the raw sensory inputs. We have evalu-
atedourautolabelsbytraininga3Dopen-vocabularyobject Figure 4. Open-vocabulary detection of both static and moving
detection model on the Waymo Open Dataset without any objects via user-provided text queries. Note that in the open-
3D human annotations. Strong results have been demon- vocabularysetting, thetextqueriesofinterestedobjecttypesare
strated on the task of open-vocabulary 3D detection with notgivenineitherautolabelingormodeltraining.
8References [17] Minyoung Huh, Hossein Mobahi, Richard Zhang, Brian
Cheung, Pulkit Agrawal, and Phillip Isola. The low-rank
[1] PaulJBeslandNeilDMcKay. Methodforregistrationof
simplicitybiasindeepnetworks,2021. 7
3-dshapes. InSensorfusionIV:controlparadigmsanddata
[18] AyushJain,NikolaosGkanatsios,IshitaMediratta,andKate-
structures,volume1611,pages586–606.Spie,1992. 5
rinaFragkiadaki. Bottomuptopdowndetectiontransform-
[2] Alex Bewley, Pei Sun, Thomas Mensink, Dragomir ersforlanguagegroundinginimagesandpointclouds. In
Anguelov, and Cristian Sminchisescu. Range conditioned ECCV,2022. 2
dilatedconvolutionsforscaleinvariant3dobjectdetection,
[19] AndrejJanda,BrandonWagstaff,EdwinGNg,andJonathan
2020. 2
Kelly. Self-supervised pre-training of 3d point cloud net-
[3] JunCen,PengYun,JunhaoCai,MichaelYuWang,andMing works with image data. arXiv preprint arXiv:2211.11801,
Liu. Open-set3dobjectdetection. In3DV,2021. 2 2022. 2
[4] Runnan Chen, Youquan Liu, Lingdong Kong, Xinge Zhu, [20] ChaoJia,YinfeiYang,YeXia,Yi-TingChen,ZaranaParekh,
Yuexin Ma, Yikang Li, Yuenan Hou, Yu Qiao, and Wen- HieuPham, QuocLe, Yun-HsuanSung, ZhenLi, andTom
ping Wang. Clip2scene: Towards label-efficient 3d scene Duerig.Scalingupvisualandvision-languagerepresentation
understandingbyclip.InProceedingsoftheIEEE/CVFCon- learningwithnoisytextsupervision. InICML,2021. 1,2,
ferenceonComputerVisionandPatternRecognition,pages
12
7020–7030,2023. 2
[21] Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel
[5] YukangChen,YanweiLi,XiangyuZhang,JianSun,andJi- Synnaeve, Ishan Misra, and Nicolas Carion. Mdetr-
ayaJia. Focalsparseconvolutionalnetworksfor3dobject modulateddetectionforend-to-endmulti-modalunderstand-
detection. InCVPR,2022. 2 ing. InICCV,2021. 1,2
[6] Y. Chen, S. Liu, X. Shen, and J. Jia. Fast point r-cnn. In [22] Sosuke Kobayashi, Eiichi Matsumoto, and Vincent Sitz-
ICCV,2019. 2 mann. Decomposing nerf for editing via feature field dis-
[7] AyushDewan,TimCaselitz,GianDiegoTipaldi,andWol- tillation. 2022. 2
fram Burgard. Motion-based detection and tracking in 3d [23] Alex H Lang, Sourabh Vora, Holger Caesar, Lubing Zhou,
lidarscans. InICRA,2016. 2 JiongYang,andOscarBeijbom. Pointpillars: Fastencoders
[8] M.Engelcke,D.Rao,D.Z.Wang,C.H.Tong,andI.Posner. forobjectdetectionfrompointclouds. InCVPR,2019. 1,2,
Vote3deep: Fast object detection in 3d point clouds using 5,6
efficientconvolutionalneuralnetworks. InICRA,2017. 2 [24] Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen
[9] Linxi Fan, Guanzhi Wang, Yunfan Jiang, Ajay Mandlekar, Koltun, and Rene´ Ranftl. Language-driven semantic seg-
Yuncong Yang, Haoyi Zhu, Andrew Tang, De-An Huang, mentation. 2022. 1,2
Yuke Zhu, and Anima Anandkumar. Minedojo: Building [25] LiunianHaroldLi,PengchuanZhang,HaotianZhang,Jian-
open-endedembodiedagentswithinternet-scaleknowledge. wei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu
InNeurIPS,2022. 1 Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded
[10] Lue Fan, Xuan Xiong, Feng Wang, Naiyan Wang, and language-imagepre-training. InCVPR,2022. 1,2
ZhaoxiangZhang. Rangedet: Indefenseofrangeviewfor [26] ZhichaoLi,FengWang,andNaiyanWang. Lidarr-cnn:An
lidar-based3dobjectdetection. InICCV,2021. 2 efficientanduniversal3dobjectdetector. InCVPR,2021. 2
[11] GolnazGhiasi,XiuyeGu,YinCui,andTsung-YiLin. Scal- [27] Tsung-YiLin,PriyaGoyal,RossGirshick,KaimingHe,and
ing open-vocabulary image segmentation with image-level PiotrDolla´r. Focallossfordenseobjectdetection. InICCV,
labels. InECCV,2022. 1,2,3,4,6,8,12 2017. 5
[12] Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. [28] YangLiu,IdilEsenZulfikar,JonathonLuiten,AchalDave,
Open-vocabulary object detection via vision and language AljosˇaOsˇep,DevaRamanan,BastianLeibe,andLauraLeal-
knowledgedistillation. ICLR,2022. 1,2 Taixe´. Openingupopen-worldtracking. InCVPR,2022. 2
[13] Huy Ha and Shuran Song. Semantic abstraction: Open- [29] Yueh-ChengLiu,Yu-KaiHuang,Hung-YuehChiang,Hung-
world3dsceneunderstandingfrom2dvision-languagemod- TingSu,Zhe-YuLiu,Chin-TangChen,Ching-YuTseng,and
els. InCoRL,2022. 2 Winston H Hsu. Learning from 2d: Contrastive pixel-to-
[14] ChenhangHe,HuiZeng,JianqiangHuang,Xian-ShengHua, pointknowledgetransferfor3dpretraining. arXivpreprint
andLeiZhang.Structureawaresingle-stage3dobjectdetec- arXiv:2104.04687,2021. 2
tionfrompointcloud. InCVPR,June2020. 2 [30] Zhijian Liu, Haotian Tang, Alexander Amini, Xinyu Yang,
[15] Derek Hoiem, Yodsawalai Chodpathumwan, and Qieyun HuiziMao,DanielaRus,andSongHan. Bevfusion: Multi-
Dai. Diagnosing error in object detectors. In ECCV (3), taskmulti-sensorfusionwithunifiedbird’s-eyeviewrepre-
pages340–353,2012. 13 sentation. 2023. 2
[16] Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky [31] Yuheng Lu, Chenfeng Xu, Xiaobao Wei, Xiaodong Xie,
Liang, PeteFlorence, AndyZeng, JonathanTompson, Igor MasayoshiTomizuka,KurtKeutzer,andShanghangZhang.
Mordatch, YevgenChebotar, etal. Inner monologue: Em- Open-vocabulary3ddetectionviaimage-levelclassandde-
bodied reasoning through planning with language models. biased cross-modal contrastive learning. arXiv preprint
InCoRL,2022. 1 arXiv:2207.01987,2022. 2
9[32] GregoryPMeyer,AnkitLaddha,EricKee,CarlosVallespi- [46] Shuran Song and Jianxiong Xiao. Deep sliding shapes for
Gonzalez, and Carl K Wellington. Lasernet: An efficient amodal3dobjectdetectioninrgb-dimages.InCVPR,2016.
probabilistic3dobjectdetectorforautonomousdriving. In 2
CVPR,2019. 2 [47] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien
[33] MatthiasMinderer,AlexeyGritsenko,AustinStone,Maxim Chouard,VijaysaiPatnaik,PaulTsui,JamesGuo,YinZhou,
Neumann,DirkWeissenborn,AlexeyDosovitskiy,Aravindh Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han,
Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, Scott Et-
Shen, et al. Simple open-vocabulary object detection with tinger,MaximKrivokon,AmyGao,AdityaJoshi,YuZhang,
visiontransformers. 2022. 2 Jonathon Shlens, Zhifeng Chen, and Dragomir Anguelov.
[34] Yifei Ming, Ziyang Cai, Jiuxiang Gu, Yiyou Sun, Wei Li, Scalability in perception for autonomous driving: Waymo
and Yixuan Li. Delving into out-of-distribution detection opendataset. InCVPR,2020. 6
withvision-languagerepresentations. InNeurIPS,2022. 1 [48] PeiSun,MingxingTan,WeiyueWang,ChenxiLiu,FeiXia,
Zhaoqi Leng, and Dragomir Anguelov. Swformer: Sparse
[35] Ishan Misra, Rohit Girdhar, and Armand Joulin. An end-
windowtransformerfor3dobjectdetectioninpointclouds.
to-endtransformermodelfor3dobjectdetection. InICCV,
In Shai Avidan, Gabriel Brostow, Moustapha Cisse´, Gio-
2021. 2
vanniMariaFarinella,andTalHassner,editors,ECCV,2022.
[36] Mahyar Najibi, Jingwei Ji, Yin Zhou, Charles R Qi,
5
XinchenYan,ScottEttinger,andDragomirAnguelov. Mo-
[49] PeiSun, WeiyueWang, YuningChai, GamaleldinElsayed,
tioninspiredunsupervisedperceptionandpredictioninau-
Alex Bewley, Xiao Zhang, Cristian Sminchisescu, and
tonomousdriving. InECCV,2022. 1,2,3,4,5,6,7,8,12,
DragomirAnguelov. Rsn:Rangesparsenetforefficient,ac-
13
curatelidar3dobjectdetection.InCVPR,pages5725–5734,
[37] Mahyar Najibi, Guangda Lai, Abhijit Kundu, Zhichao Lu,
2021. 2
Vivek Rathod, Thomas Funkhouser, Caroline Pantofaru,
[50] HaoTian,YuntaoChen,JifengDai,ZhaoxiangZhang,and
DavidRoss,LarrySDavis,andAlirezaFathi. Dops:Learn-
XizhouZhu. Unsupervisedobjectdetectionwithlidarclues.
ingtodetect3dobjectsandpredicttheir3dshapes.InCVPR,
InCVPR,2021. 2
2020. 2
[51] Vadim Tschernezki, Iro Laina, Diane Larlus, and Andrea
[38] Charles R Qi, Or Litany, Kaiming He, and Leonidas J
Vedaldi. Neuralfeaturefusionfields: 3ddistillationofself-
Guibas. Deephoughvotingfor3dobjectdetectioninpoint
supervised2dimagerepresentations. 2022. 2
clouds. InICCV,2019. 2
[52] Dominic Zeng Wang and Ingmar Posner. Voting for vot-
[39] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
inginonlinepointcloudobjectdetection. InProceedingsof
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
Robotics:ScienceandSystems,Rome,Italy,July2015. 2
AmandaAskell,PamelaMishkin,JackClark,etal. Learn-
[53] YueWang, AlirezaFathi, AbhijitKundu, DavidRoss, Car-
ingtransferablevisualmodelsfromnaturallanguagesuper-
oline Pantofaru, Thomas Funkhouser, and Justin Solomon.
vision. InICML,2021. 1,2
Pillar-based object detection for autonomous driving. In
[40] Corentin Sautier, Gilles Puy, Spyros Gidaris, Alexandre ECCV,2020. 2
Boulch,AndreiBursuc,andRenaudMarlet. Image-to-lidar
[54] Kelvin Wong, Shenlong Wang, Mengye Ren, Ming Liang,
self-superviseddistillationforautonomousdrivingdata. In
andRaquelUrtasun. Identifyingunknowninstancesforau-
ProceedingsoftheIEEE/CVFConferenceonComputerVi-
tonomousdriving. InCoRL,2020. 2
sionandPatternRecognition,pages9891–9901,2022. 2
[55] Bin Yang, Wenjie Luo, and Raquel Urtasun. Pixor: Real-
[41] Nur Muhammad Mahi Shafiullah, Chris Paxton, Lerrel time3dobjectdetectionfrompointclouds. InCVPR,2018.
Pinto, Soumith Chintala, and Arthur Szlam. Clip-fields: 2
Weakly supervised semantic fields for robotic memory.
[56] Zetong Yang, Yanan Sun, Shu Liu, and Jiaya Jia. 3dssd:
arXivpreprintarXiv:2210.05663,2022. 1
Point-based3dsinglestageobjectdetector. InCVPR,2020.
[42] ShaoshuaiShi,ChaoxuGuo,LiJiang,ZheWang,Jianping 2
Shi, Xiaogang Wang, and Hongsheng Li. Pv-rcnn: Point- [57] M.Ye,S.Xu,andT.Cao. Hvnet: Hybridvoxelnetworkfor
voxel feature set abstraction for 3d object detection. In lidarbased3dobjectdetection. InCVPR,2020. 2
CVPR,2020. 2
[58] XiaohuaZhai,XiaoWang,BasilMustafa,AndreasSteiner,
[43] ShaoshuaiShi,XiaogangWang,andHongshengLi. Pointr- Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer.
cnn: 3dobjectproposalgenerationanddetectionfrompoint Lit: Zero-shot transfer with locked-image text tuning. In
cloud. InCVPR,2019. 1,2 CVPR,2022. 2
[44] Weijing Shi and Ragunathan (Raj) Rajkumar. Point-gnn: [59] WuZheng,WeiliangTang,LiJiang,andChi-WingFu. Se-
Graph neural network for 3d object detection in a point ssd:Self-ensemblingsingle-stageobjectdetectorfrompoint
cloud. InCVPR,2020. 2 cloud. InCVPR,2021. 2
[45] Martin Simony, Stefan Milzy, Karl Amendey, and Horst- [60] Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan
MichaelGross. Complex-yolo:aneuler-region-proposalfor Li,NoelCodella,LiunianHaroldLi,LuoweiZhou,Xiyang
real-time 3d object detection on point clouds. In ECCV, Dai, Lu Yuan, Yin Li, et al. Regionclip: Region-based
2018. 2 language-imagepretraining. InCVPR,2022. 1,2
10[61] Xingyi Zhou, Rohit Girdhar, Armand Joulin, Phillip
Kra¨henbu¨hl, and Ishan Misra. Detecting twenty-thousand
classesusingimage-levelsupervision. 2022. 2
[62] YinZhou, PeiSun, YuZhang, DragomirAnguelov, Jiyang
Gao,TomOuyang,JamesGuo,JiquanNgiam,andVijayVa-
sudevan. End-to-endmulti-viewfusionfor3dobjectdetec-
tioninlidarpointclouds. InCoRL,2020. 2,5
[63] YinZhouandOncelTuzel. Voxelnet: End-to-endlearning
forpointcloudbased3dobjectdetection.InCVPR,2018.1,
2
11Appendix acosinedecaylearningratescheduleandaninitiallearning
rate of 0.003 and train the models for a total of 43K itera-
A.ImplementationDetails
tions.
Thissectionprovidestheimplementationdetailsforthe
maintwocomponentsofourapproachnamelymulti-modal
autolabeling,andtheopen-vocabulary3Dobjectdetector. B.AdditionalQualitativeResults
A.1.UnsupervisedMulti-modalAutoLabeling
In the paper, we presented qualitative results demon-
In our experiments, we use VEGETATION, ROAD, strating that UP-VL can detect open-set objects using text
STREET, SKY, TREE, BUILDING, HOUSE, SKYSCRAPER, queries at inference (see Figure 1 and 4 of the main pa-
WALL, FENCE and SIDEWALK as text queries for defining per). Additionally, we included a quantitative comparison
backgroundcategories,Cbg,whichareexcludedfromauto with the previous state-of-the-art, MI-UP [36], in Table 1
labeling. Wealsosetthecosinesimilaritiesthresholdϵbg to of the main paper. Here, we present qualitative compari-
be0.02.FortheexperimentsinSection4.2,and 4.3.1ofthe sonbetweenourUP-VLdetector(trainedwithdistillation)
main paper which consider moving-only objects, we set a and MI-UP [36] detector in Figure 5. The top row shows
sceneflowthresholdofϵsf =1m/s(thesameas[36]). For ourUP-VLclass-awarepredictionswheretheblueandred
bounding box proposals, we follow Najibi et al. [36] and boxes represent the vehicle and VRU detections respec-
set neighborhood threshold to be 1.0m in DBSCAN. With- tively. On the bottom, we are showing the class-agnostic
out knowing the semantics of objects, it is challenging to predictions of the MI-UP model as green boxes. Compar-
define the headings of all objects. For moving objects, we ing column (a), first we can see that unlike MI-UP which
align their headings with the object moving direction. For is unable to predict semantics, our UP-VL approach can
staticobjects,wechoosetheirheadingssuchthattheyhave reliably distinguish between objects of vehicle and VRU
anacuteanglewiththeheadingoftheautonomousdriving categories. Moreover, UP-VL can detect many of the ob-
vehicle. jectswhichwerecompletelymissedorgroupedtogetherby
MI-UP. In column (b), we also mark static objects in the
A.2.Open-vocabulary3DObjectDetection
bottomrow. Comparingthiscolumnhighlightsanotherad-
vantageofourapproach. WhileMI-UPislimitedtodetect-
Regarding the vision-language model, in this paper we
ingmoving-onlyobjectsbydesign,UP-VLisabletodetect
use the pre-trained OpenSeg model [11] coupled with the
staticobjectsaswell. Lastly,bycomparingcolumn(c),one
BERT-Large text encoder in Jia et al. [20] without further
can see that our UP-VL approach can significantly reduce
fine-tuningonany2Dor3Dautononmousdrivingdatasets.
the false positives on cluttered parts of the scene, showing
For the knowledge distillation, as discussed in Section
yetanotheradvantageofourapproachcomparedtotheprior
3.3.2ofthemainpaper,wedirectlydistillthefinal640di-
work on unsupervised 3D object detection in autonomous
mensional features of the OpenSeg model. However, for
driving.
memoryandcomputeefficiencyduringtraining,wefirstre-
duce the dimensionality of the features to 64 using an in-
cremental PCA fitted to the whole unsupervised training
dataset. To evaluate the open-vocabulary detector on the C.EffectofHyperparameters
Waymo Open Dataset, we choose the vehicle and VRU as
categoriesofinterest,forwhichthedatasethasgroundtruth. In this subsection, we perform an ablation study on the
More specifically, we use CAR, VEHICLE, PARKED VE- effectofthehyper-parametersintroducedinAlgorithm1of
HICLE, SEDAN, TRUCK, BUS, VAN, MINIVAN, SCHOOL the main paper. More specifically, ϵbg which is used as a
BUS,PICKUPTRUCK,AMBULANCE,FIRETRUCKtoquery thresholdonthecomputedcosinesimilaritiestodefinethe
for the vehicle category and CYCLIST, HUMAN, PERSON, backgroundpoints,andrbg whichrepresentsathresholdon
PEDESTRIAN, BICYCLE to query for the VRU category. the required ratio of background points within a box pro-
We found that removing queries from this set will lead to posal to mark it as background and consequently filtering
dropped mAPs. For the 3D detection experiments, we use theproposal. TheablationanalysisispresentedinTable5.
thesametwo-frameanchor-basedPointPillarsbackboneas First thing to notice is that our approach is fairly robust to
previous work [36] for fair comparisons. We also use the these hyper parameters when they are set in a reasonable
same set of detection losses to train a class-agnostic 3D range. Moreover,comparingthemiddlerowswiththefirst
bounding box regression branch and an objectness score andlastrowsdemonstratestheeffectivenessofintroducing
branch,andsupplementthemwiththenewdistillationintro- these thresholding schemes in improving the mAP of the
ducedinSection 3.3.2ofthemainpaper. Wetrainmodels model. Giventheseresults,inallexperimentsinthepaper
on64TPUs,withabatchsizeof2peraccelerator. Weuse wesetϵbg =0.02andrbg =0.99.
12(a) (b) (c)
UP-VL
static
MI-UP[36]
Figure5.ComparisonofourUP-VLwithpriorworkMI-UP[36].Comparatively,ourUP-VL(a)localizesobjectsandclassifiesthem,(b)
detectsbothmovingandstaticobjects,(c)producesfewerfalsepositives.Bestviewedincolor.Boxcolors:blueforvehicle,redforVRU,
greenforclass-agnostic.
Table5.Effectofhyperparametersofϵbgandrbg.
3DAP 3DAP
ϵbg mAP rbg mAP
Veh VRU Veh VRU
0.10 28.7 12.3 20.5 50% 20.7 7.5 14.1
0.05 29.7 14.1 21.9 90% 27.3 11.1 19.2 “tram”
0.02 30.2 14.7 22.4 99% 30.2 14.7 22.4
“truck”
0.00 29.9 14.3 22.1 100% 30.1 14.6 22.3
Moving only All motion states
Vehicle VRU Vehicle VRU
34.9% 31.7% 38.1%
1.4% 63.7% 0.2% 68.1% 60.3% 1.6% 50.2% 49.4% Figure 7. Failure cases. (a) Detector fails to generate very large
0.4% boxesforrarecategorieslike”tram”althoughthepoint-wisese-
Classification error: Classification error: Localization error manticassignmentiscorrect. (b)Textqueryof”truck”wrongly
confusion with background confusion with other objects
matcheswithanobjectofcrane.
Figure 6. Error analysis of false positives. Fractions of false-
positives that are caused by classification or localization errors.
Ouranalysiscoverstwoscenarios: detectingmovingobjectsonly
anddetectingobjectsinallmotionstates. Andweexamineboth
vehicleandVRUcategories.
All other false positives fall under the category of confu-
sion with background. For each category, we count the
D.ErrorAnalysis
“top-ranked” false positives among the most confident N
detections, where N is selected to be half the quantity of
D.1.QuantitativeAnalysis
groundtruthobjectsinthatcategory. Resultsarepresented
Section 4 in the main paper discusses the overall accu- inFigure6. Itshouldbenotedthatgiventhedecoupledde-
racy of our open-vocabulary 3D object detectors. In this signofourdetector, thelocalizationerrorcanbelinkedto
subsection,wewilldelvedeeperintotheanalysisbybreak- ourclass-agnosticboundingboxpredictionbranch,andthe
ingdowntheerrors. Onesignificanttypeoferrorsisfalse classificationerrorcanbelinkedtoourdistillationbranch.
positivedetections, whichoccurswhenthedetectedobject Ascanbeseen,formovingobjects(theleftsideofthefig-
doesnotcorrespondtoanygroundtruthobject,giveneval- ure),thelocalizationerroristhebottleneckinperformance.
uation thresholds. Following Hoiem et al. [15], we cate- Thisiswhile,whenwealsoconsiderthestaticobjects(the
gorize false positives into three types. Localization error right side of the figure), the share of the classification er-
arises when a detected object belongs to the intended cat- rornoticeablyincreases. Moreover,asexpected,wecansee
egory but has a misaligned bounding box (0.1 < 3D IoU that confusion between the categories (vehicles vs. VRUs)
< 0.4). The remaining false positives, which have an IoU accountsforaverysmallportionofthefalsepositives. We
of at least 0.1 with an ground-truth object from a different believe this analysis sheds light on the bottlenecks for fur-
category, are classified as confusion with other objects. therimprovementsoftheproposedapproach.
13D.2.QualitativeAnalysis
In the previous subsection, we performed quantitative
error analysis on the available human annotations in the
dataset. Here, wequalitativelypresentsomeerrorpatterns
ofourmethodintheopen-vocabularysettingwherehuman
annotationsareunavailable. Figure7illustratessomereal-
world challenges in unsupervised open-vocabulary 3D de-
tection. One type of failure case is the detector failing to
generateaboundingboxeventhoughthepoint-wisecosine
similarityhascapturedthecorrectsemanticsfromtheuser’s
query(e.g.“tram”inFigure7). Webelievethisisbecause
suchkindoflargeobjectsarerarelyseeninthetrainingdata
and our detector requires more unsupervised training data
to confidently capture those objects. Another type of fail-
ure case is the mismatch between text queries and visual
features for semantically similar concepts. Like the sec-
ondexampleinFigure7,whereatextqueryof“truck”has
matched with a crane. We hypothesize that this might be
duetothesimilarappearancebetweencranesandconstruc-
tion trucks and the high co-occurrence of these two object
typesinthereal-world.
14