Multi-Class 3D Object Detection with Single-Class Supervision Mao Ye§, Chenxi Liu¶†, Maoqing Yao¶, Weiyue Wang¶, Zhaoqi Leng¶, Charles R. Qi¶, Dragomir Anguelov¶ Abstract—While multi-class 3D detectors are needed in many robotics applications, training them with fully labeled datasets can be expensive in labeling cost. An alternative approach is to have targeted single-class labels on disjoint data samples. In this paper, we are interested in training a multi-class 3D object detection model, while using these single-class labeled data. We begin by detailing the unique (a) Full Supervision stance of our “Single-Class Supervision” (SCS) setting with respect to related concepts such as partial supervision and semi supervision. Then, based on the case study of training themulti-classversionofRangeSparseNet(RSN),weadapta spectrumofalgorithms—fromsupervisedlearningtopseudo- labeling — to fully exploit the properties of our SCS setting, and perform extensive ablation studies to identify the most effectivealgorithmandpractice.Empiricalexperimentsonthe (b) Semi Supervision Waymo Open Dataset show that proper training under SCS can approach or match full supervision training while saving labeling costs. I. INTRODUCTION 3D object detection is a core component in various robotics and autonomous driving applications. Existing (c) Single-Class Supervision (ours) public datasets [1], [2], [3], [4] often provide the labels for all K classes on all data, which enables what we call Fig. 1: Full supervision (a), where all objects are properly full supervision training (Figure 1a). However, such fully labeled, is ideal in training a multi-class detector. Semi labeled datasets are not scalable in real-world applications supervision(b)isaclassicsettingwhereeachimageiseither at the industrial scale, given the cost of labeling. A sensible fully labeled or not labeled. Single-class supervision (c) is approachthen,istoperformtargetedlabelingforeachclass an under-explored setting that we study. There is only one of interest, resulting in K datasets that are possibly non- class of objects labeled for each image. overlapping. We term this the “Single-Class Supervision” (SCS) setting (Figure 1c). Intuitively the SCS setting enjoys many advantages, such In this paper, we describe ways to learn under the asdedicatedallocationoflabelingresources(e.g.onlylabel SCS setting based on the case study of training a multi- the “rare / hard” class and save the cost of labeling the class Range Sparse Net (RSN) [11], which is a state-of- “common / easy” class), and better control of class-specific the-art 3D object detection model. The RSN model is a performancemetrics(e.g.labelmoredataforacertainclass two-stage detector. It performs segmentation as its first if the model’s performance on that class is not accurate stage, which makes our solutions potentially generalizable enough). Under this setting, it is straightforward to train a to other tasks such as semantic segmentation as well. We single-class detector, which is common in 3D detection for find that it is critical to correctly handle the missing label autonomous driving application [5], [6], [7], [8], [9], [10], property of SCS, and propose an Informed supervision [11]. However, the training protocol becomes unclear when scheme that makes the most out of the (partial) labels training a multi-class detector: since every training example provided. Based on the Informed supervision scheme, from has incomplete labels, even conducting supervised learning simpletocomplex,weadaptsupervisedlearningalgorithms becomes challenging1. (Section IV) and pseudo-labeling algorithms (Section V) under the SCS setting. Other important training techniques, §TheUniversityofTexasatAustin ¶Waymo †Correspondingauthor including dataset resampling and combining pseudo labels 1Whenobjectsfromdifferentclassestendnottoco-exist,SCSiseasyas itisessentiallyfullysupervised.Butinthispaperwetargettheopposite. from different sources, are also studied. On the Waymo 2202 yaM 11 ]VC.sc[ 1v30750.5022:viXraOpen Dataset [4], we conduct experiments to demonstrate III. MULTI-CLASSRANGESPARSENET that proper modeling under SCS can approach or match Our multi-class 3D object detector is based on Range the full supervision scenario, showing its practicality and Sparse Net (RSN) [11], which is a state-of-the-art single- promise in saving (re)labeling cost. class detector. In this section, we recap its architecture and II. DEFINING&SITUATINGSINGLE-CLASSSUPERVISION describe how we extend it for the multi-class setting. We begin by considering object detection. Let M be the A. Architecture numberofimages(rangeimagesforthe3Dcase),andK be RSN is a two-stage detector. In the first stage, the model the number of classes in the dataset. Each image, indexed segments LiDAR range image pixels into two categories: byi,containsN i objectsofvariousclasses,andN=∑M i=1N i background and foreground. In the second stage, points is the total number of objects in the dataset. classified as foreground are voxelized and fed into a sparse A. Full Supervision convolutionnetworkfollowedbyadetectionheadtopredict the 3D bounding boxes. Full supervision is the situation where all N objects are The RSN architectures described in [11] perform single- labeled. This is of course ideal, and the model performance class 3D object detection. We extend RSN to multi-class under full supervision can be considered the upper bound by sharing the first stage among all K classes, but keeping to any of the partial supervision situations discussed next. individual second stages for each class. This means the B. Partial Supervision foregroundpointselectionstageisnowaK+1-wayinstead We use the term partial supervision to summarize all of a 1+1 = 2-way segmentation, with index 0 being situations where n < N objects are labeled. While the the background. In the second stage, only points that are partiality can be from multiple aspects, there are several classified as foreground class k are voxelized and fed into important special cases of partial supervision, including the corresponding layer. We choose not to share the second semi supervision and (our) single-class supervision. stage, as different object classes may call for different 1) Semi Supervision: Research on semi-supervised voxelization granularities (e.g. pedestrians require a finer learning has mostly been conducted on the image resolution than vehicles). classification setting [12], [13], [14], [15], [16], [17], [13], B. Losses for training RSN [18], [19], [20], [21], with interest in segmentation and Below we describe how we extend the single-class object detection only rising recently [22], [23], [24], [25], training loss in RSN [11] to the multi-class setting. [26], [27], [28], [29], [30], [31], [32], [33], [34], [35], [36], 1) Foreground Point Selection: In a training sample [37], [38], [39]. We follow these last set of works and (X ,Y ), X ∈RH×W×3 is the LiDAR range image, and characterize the semi supervision setting as: (upon sorting,) ri bbox ri Y =∪K Yk is the union of the labeled 3D bounding for a certain integer S, full labels are provided for images bbox k=1 bbox boxes of all K classes. By examining whether a LiDAR withindices1≤i≤S,andnolabelsforimageswithindices point lies within any of the labeled 3D bounding boxes, we S1−ε}+ hm,v v v hm,v 2Admittedly there may be more than one object in an image used for (1−yk )β(hˆk )αlog(1−hˆk )I{yk ≤1−ε}, classification,butherewemakethissimplificationtoillustratethecollapse. hm,v v k hm,vthat have missing label and train the model using the modified segmentation loss (cid:110) (cid:111) U pixel= (i,j):∑K k=1yk ri,i,j=0 , Fig.2:Arangeimagewithonlythepedestrianclasslabeled. Lc seo gnservative=∑ i,jL seg,i,jI{(i,j)∈/U pixel}. Pixels labeled as pedestrian (red box) have accurate label. Althoughthisconservativeapproachavoidsinjectingwrong However, the other pixels have missing label: they may information, all the background pixels are masked out for belong to background (blue dashed box) or vehicle (yellow training. As a result, none of the pixels will be classified as dashed box) since the vehicle class is unlabeled. background, which would lead to a difficult second stage. c) Informed Supervision: Despite that the Aggressive and Conservative approaches are reasonable, they suffer where hˆk denotes the predicted heatmap value for class k v from flaws such as injecting wrong supervision or lack at voxel v, ε is added for numerical stability, α and β are of supervision on the background pixels. To improve, we focusing hyper-parameters. propose a third approach that uses all the pixels for training b) Shape Regression Loss: For any yk hm,v higher than and does not inject wrong supervision, which we name a threshold (i.e. close to an object center), a bin loss [41] is Informed supervision. usedtoregresstheobject’sheading,andtheotherboxshape The idea is derived from the principal of maximum parameters are directly regressed under smooth L1 losses. likelihoodestimation.LetU bethesetofclassesthatare class not labeledinthecurrentdata.ForpixelsinU ,although pixel IV. ADAPTATIONFORSUPERVISEDLEARNING we do not know its exact ground truth label, we know that they must not belong to any classes that has ground In this section, we discuss how to use single-class labels truth label. Thus they can only be of any class in U to train a multi-class detector, which requires adaptation of class or 0 (background). For these pixels with missing label, we thesupervisedlearning–morespecificallyadaptationofthe transformtheoriginalK+1-wayclassificationprobleminto detector training losses. In the next section, we will discuss a K+1−|U |-way classification problem by summing how we can leverage pseudo labels to further improve the class thepredictionlogitsofclassesinU and0.Themodified training effectiveness. class loss is as follows: A. Loss Modification 1) Foreground Point Selection: Consider a range image   ∑ (1−pˆk i˜ ,j)γlog(pˆk i˜ ,j)yk r˜ i,i,j if (i,j)∈U pixel which contains both vehicles and pedestrians, but only Li sn ef go ,ir ,m jed= k∈/Uclass L o/w seg,i,j pedestrians are labeled. Although we can ensure that the pixels that are labeled as pedestrian have accurate labels, where the pixels that are background-to-pedestrians might actually (cid:40) be pixels that belong to vehicles, rather than the real pˆk˜ = ∑ c∈{0}∪Uclasspˆc i,j if k=0 i,j pˆk o/w backgroundpixels(seeFigure2forillustration).Itiscritical i,j (cid:40) to handle those pixels with uncertain label properly. We I{(i.j)∈U } if k=0 yk˜ = pixel propose and study several strategies. ri,i,j yk o/w ri,i,j a) Aggressive Supervision: Aggressive supervision is a simple approach that ignores the fact that pixels that are 2) BoxRegression: Adaptingtheboxregressionlossesis background-to-class-k may not be the real background. It straightforward. To train the detection head of object class simply trains the model as if all the pixels were correctly k, we only use the input data that labels class k: labeled. This can still be practical, especially when the objects are sparse and thus most of the pixels that are L hm=∑K k=1∑ vLk hm,vI{k∈/U class}. labeled as background are truly background. However, Note that the shape regression loss is automatically masked this solution essentially injects wrong information into the out when there is no bounding boxes of the corresponding training, which can be harmful. class, so no extra modification is required for this loss. b) Conservative Supervision: An alternative to the Aggressive approach is a conservative approach that does V. ADAPTATIONFORLEARNINGWITHPSEUDOLABEL not produce loss on non-foreground pixels. Specifically, In this section, we improve the multi-class detector given the segmentation label Y , we define a set of pixels performance by leveraging pseudo labels. riA. Pseudo Label Generating Strategy C. Loss Modification Below we discuss several options to generate pseudo 1) ForegroundPointSelection: Pseudolabelshelpreduce labels to augment the K single-class datasets. the amount of missing label pixels U , but there are pixel 1) Self Labeling: The self labeling approach is adapted still pixels with non-trustworthy labels C pixel. Therefore, from FixMatch [21] for image classification, in which we conveniently,thethreeschemesdescribedinSectionIV-A.1 generate the pseudo labels using the model itself. During all still apply, simply by replacing U pixel with C pixel. Notice training, we feed the data into the multi-class RSN and that under self labeling, the Conservative approach is still generate predicted classes for range image pixels and not suitable, as it will still classify all points as foreground. bounding boxes. Prediction with high confidence are saved 2) Box Regression: With the pseudo labeled bounding as pseudo labels. Notice that we need to generate two boxes, the data that did not have U class originally labeled realizations of data augmentation of the same input to can now be utilized to train the detection head of those prevent the model from degenerating into a trivial solution. classes. Given the pseudo labeled bounding boxes, similar 2) Teacher Labeling: Similarly, we can generate pseudo to supervised learning discussed in Section IV-A.2, we do labels using K well-trained teacher detectors, each being a not need to modify the shape regression other than simply standard RSN that detects a single class. The teacher model replacingthegroundtruthboxlabelbythepseudoboxlabel. for class k is trained using only the portion of the data that For the heatmap loss, voxels that fall within any box that labels class k. We use the corresponding teacher model to belongs to Yˆk have trustworthy heatmap ground truth bbox generate pseudo labels for the unlabeled classes. value. However, we cannot ensure whether the remaining 3) Integrated Labeling: We can also combine the two voxels are background or within a bounding box of class approaches above. For segmentation, we generate the pixel u∈U . Due to such uncertainty, the heatmap loss needs class classes based on the ensemble predictions from the teacher special design and treatment. Similar to Section IV-A.1, we and the trained model with equal weight. For bounding propose three schemes for modifying the heatmap loss. boxes,wecombinethepredictedboxesfromallmodelsand a) Aggressive Supervision: Aggressive supervision filteroutoverlappingboxesbyNon-MaximumSuppression. simply ignores the fact that some of the background voxels may actually belong to an object, and trains the model as B. Incorporating Pseudo Labels if the pseudo boxes were perfect. Its performance can be We follow [21] and only include pseudo labels whose sensitive to the quality of the pseudo labels and may inject confidence scores are above a threshold. Specifically, the wrong information for training. final label for the range image is b) Conservative Supervision: Following the same (cid:40) philosophy, Conservative supervision masks out losses on I{k=argmax pˆc } if (i,j)∈U yˆk = c∈{0}∪Uclass i,j pixel voxelswithouttrustworthylabels.Asaresult,weonlytrain ri,i,j yk o/w onvoxelswithforegroundlabels.Theheatmaplossbecomes ri,i,j wherepˆc i,j isthepredictionlogitsofthemodelthatgenerates Lˆc ho mnservative =∑K k=1∑ vLˆk hm,vI{v∈Uk hm or k∈/U class} the pseudo labels. Analogous to U , we maintain a pixel set C to indicate pixels with “non-trustworthy labels”, Here Uk ={v:∃b∈Yˆk s.t. v is within b} denotes the pixel hm bbox i.e., pixels that does not have ground truth label nor high setofvoxelsthatfallwithinoneofthetrustworthybounding confidence pseudo label boxes, and Lˆk is calculated based on Yˆk as opposed hm,v bbox (cid:110) (cid:111) to Yk . C = (i,j): maxpˆc <τ and (i,j)∈U , bbox pixel i,j pixel pixel c) Informed Supervision: The Informed supervision c we derived for segmentation loss is not directly applicable whereτ isthethresholdtodecidewhethertheprediction pixel here for heatmap loss. The main reason is that the is confident enough to be regarded as high quality. segmentation is multi-class and mutual-exclusive (one out Similarly, the final bounding box labels are the union of of K), while the heatmap loss is single-class (k vs not k). the original ground truth boxes and the generated pseudo Howeverthesamespiritremains.Hereweutilizeaspecial bounding boxes with scores higher than a threshold τ . bbox propertyof3Ddetection:differentfrom2Ddetectionwhere (cid:40) Yˆk = {b:score(b,k)≥τ bbox} if k∈U class each pixel might belong to different objects (since objects bbox Yk o/w can overlap with each other when projected onto a 2D bbox image), in 3D point cloud, we can assume that each point Here score(b,k) is the score of the pseudo bounding box of onlybelongstooneclass.Thatmeansforthedetectionhead class k. Notice that the pseudo heatmap can be generated of class k, if a voxel belongs to any other foreground class, accordingly using the pseudo bounding boxes. its heatmap value of class k must be 0 as this voxel cannot(a) Aggressive Supervision (b) Conservation Supervision (c) Informed Supervision Fig.3:Foregroundpointselectionunderdifferentschemes.Boxesarethegroundtruth.Pointsbeingclassifiedasbackground, vehicle and pedestrian are colored blue, yellow, and red. The Aggressive scheme misclassifies many object points as background. The Conservative scheme would prompt all points to be predicted as foreground. The Informed scheme gives accurate prediction, despite never having both vehicle and pedestrian labels on the same training data. belong to any bounding boxes of class k. This gives the A. Algorithm Comparison following modified heatmap loss: We compare different methods to learn the multi-class Lˆi hn mformed =∑K k=1∑ vLˆk hm,vI{v∈U hm or k∈/U class}, d me ete trc it cor isfr 3o Dm As Ping wle it- hcl Las 1s dla ifb fie cl us lti yn .Table I. The evaluation U ={v:∃b∈∪K Yˆk s.t. v is within b} We consider the supervised learning algorithm with hm k=1 bbox Aggressive scheme to be our baseline (as we are in a new where, different from the Conservative supervision earlier, problemsetting,thereisnopriorbaselinefromotherworks), the construction of U utilizes the information of (pseudo) hm and consider the performance of our multi-class RSN when bounding boxes of all classes (without superscript k). trainedontheunmasked,fullylabeleddata(whichdoesnot belong to SCS) to be the performance upper bound. VI. EXPERIMENTALRESULTS Within supervised learning, Informed supervision We perform experiments on the Waymo Open Dataset outperforms the Aggressive supervision baseline, showing (WOD) [4], specifically detecting vehicles and pedestrians the superiority of proper SCS modeling. As is evident (i.e.,K=2).WODprovides798trainingsequenceswithall from Figure 3, Aggressive supervision misclassifies many K classes labeled, so we create our single-class supervision object points as background and by the design of RSN, scenario by first dividing these sequences into two disjoint those points will not be included in proposing bounding sets, and then masking out labels for either class on the boxes. By comparison, Informed supervision gives accurate corresponding set. Therefore, our experiments correspond prediction for all three kinds of points in Figure 2, even to roughly 50% labeling cost savings. For the division thoughtherearenopointsexplicitlylabeledasbackground. of sequences, we consider two challenging and imbalance We do not include quantitative results with Conservative settings 10%-90% and 5%-95%. We denote the setting that supervision as all points are classified as foreground, x%, (100−x)%sequences containsonlyvehicle, pedestrian causing failed training of the second stage in RSN, though labels as #V/#P=x/100−x. The evaluation on the validation we can still visualize its first stage in Figure 3. set remains unchanged from the standard setting. We then use the Informed scheme to compare between Weemploythesametrainingprotocolforallexperiments, supervisedlearningandpseudolabeling.Weobservethatthe tuned based on the adapted supervised learning. We train adapted pseudo labeling can give significant improvement with batch size 64 for 80K iterations with 4K warmup. over adapted supervised learning. Among the pseudo The number of channels in each layer of our multi-class labeling algorithms, the integrated labeling delivers the best RSN is 3/4 that of CarL 1f / PedL 1f in [11], to reduce performance, as expected. The teacher labeling outperforms the memory cost. The remaining hyper-parameters and data the self labeling approach, where the teacher models are augmentation(randomflippingandrotationareapplied)are single-classdetectorstrainedonlyonthesubsetofsequences the same as those in [11]. For self labeling and integrated labeled with the corresponding class. One of the reasons labeling, we do not apply augmentation when generating maybethatthedifferentobjectfeaturesrequiredbyvehicle pseudo labels using the multi-class RSN. Since the sizes of and pedestrian cause conflicts in the hidden representation the two subsets are quite different, we also apply dataset giventhelimitedmodelcapacity,andthusthequalityofthe resampling such that each data in the mini-batch is drawn pseudolabelsgeneratedbythetwosingle-classdetectorsare from the two subsets with equal probability. better than those generated by the multi-class detector.Algorithms Scheme #V/#P=90/10 #V/#P=10/90 #V/#P=95/5 #V/#P=5/95 Vehicle Pedestrian Vehicle Pedestrian Vehicle Pedestrian Vehicle Pedestrian Supervised Aggressive 70.2 -1.8 53.5 -18.9 63.5 -8.5 68.3 -4.1 69.8 -2.2 28.9 -43.5 18.4 -53.6 69.2 -3.2 Supervised Informed 70.2 -1.8 59.3 -13.1 65.4 -6.6 71.6 -0.8 70.7 -1.3 46.0 -26.4 61.8 -10.2 72.5 +0.1 SelfLabel Informed 70.2 -1.8 62.1 -10.3 66.0 -6.4 72.0 -0.4 70.9 -1.1 47.3 -25.1 64.1 -7.9 72.3 -0.1 TeacherLabel Informed 777111...777 ---000...333 67.1 -5.3 68.8 -3.6 73.2 +0.8 71.3 -0.7 57.8 -14.6 66.0 -6.0 71.7 -0.8 IntegratedLabel Informed 71.6 -0.4 666888...555 ---333...999 666999...000 ---333...000 777333...888 +++111...444 777111...555 ---000...555 555999...555 ---111222...999 666666...333 ---555...777 777333...222 +++000...888 FullLabel - 72.0 0.0 72.4 0.0 72.0 0.0 72.4 0.0 72.0 0.0 72.4 0.0 72.0 0.0 72.4 0.0 TABLEI:Comparingalgorithmsandschemesdevelopedforsingle-classsupervision(SCS).Thetwoadjacentnumbersare the detection AP and its gap to the full label upper bound. Notice that the last row uses more labels and does not belong to SCS, hence marked in gray to indicate that these numbers are unfair comparisons. Segmentation Heatmap Vehicle Pedestrian Aggressive Informed 69.3 -3.4 54.1 -17.0 Conservative Informed 70.8 -0.9 65.6 -1.5 Informed Aggressive 65.6 -6.1 48.8 -18.3 Informed Conservative 71.0 -0.7 66.3 -0.8 Informed Informed 71.7 0.0 67.1 0.0 TABLE II: Comparing different schemes for handling segmentation / heatmap loss with missing label. The two adjacent numbers are the detection AP and its performance Fig. 4: Detection AP with varying dataset resampling gap to employing Informed supervision for both. probability.P/VinthelegendstandforPedestrian/Vehicle. The fact that detecting vehicle and pedestrian may be to improve, which is expected. However, there are a few conflicting may also explain the occasional cases where occasionswherehavingmoredataofaclassresultsinworse the pseudo labeling algorithm surpasses the full label upper AP, for example detecting vehicles using teacher labeling at bound in detecting one class (e.g. in #V/#P=90/10, teacher #V/#P=90/10 vs #V/#P=95/5 (71.7 vs 71.3). We found the and integrated labeling outperform the full label upper cause to be dataset resampling. In Figure 4 we vary the bound in detecting pedestrian). In cases where both classes probability of sampling images from the dominant vehicle havesufficientlabels,thetwotaskscompetewitheachother classwhen#V/#P=90/10or#V/#P=95/5(defaultis50%,i.e., andthemodellearnsarepresentationthatperformswellfor equalprobability).Towardstherightsideofthefigure,when both tasks. While, in these cases, when there is not enough we sample predominantly from the vehicle class when the information for the model to learn a strong representation amount of labeled vehicle sequences is already dominant, for vehicle, the representation learned by the model will the performance of the minority pedestrian class worsens favor detecting the pedestrian. drastically.Ontheotherhand,towardstheleftside,whenwe B. Ablations on Modeling Missing Label sample predominantly from the minority pedestrian class, the model sees too many repetitive pedestrians and not Under the teacher labeling algorithm (#V/#P = 90/10), enough vehicle epochs, degrading the performance on both we ablate the effects of the three schemes we proposed tasks. Therefore, it is critical to find the middle sweet spot. (Aggressive, Conservative, Informed) on either the segmentation loss or the heatmap loss. The results are VII. CONCLUSION summarized in Table II. Employing Informed supervision for both segmentation and heatmap results in the best We study training a multi-class 3D object detector under performance, as Informed supervision best exploits the“single-classsupervision”learningsetting,byproposing trustworthy label information. Aggressive supervision is andbenchmarkingvariousbaselinesandstrategies.Notably, worse than the Conservative approach. This holds for both our proposed Informed Supervision combined with pseudo the segmentation loss and the heatmap loss. We believe this labeling can approach or match the upper bound that is full showsthenegativeinfluenceofinjectingwronginformation supervision, saving significant labeling costs. outweighs the benefit of injecting more information. For future work, we plan to expand the applications of SCS. Within 3D object detection, we plan to experiment C. Dataset Resampling with more classes and more diverse architectures. Beyond After analyzing the Table I vertically, we now analyze 3D object detection, we plan to expand to other tasks such it horizontally. The general trend is that when a class has as2Dobjectdetection,andmovefrommulti-classtomulti- fewerdata,theAPonthisclassislowerandhasmoreroom task (e.g. joint detection and segmentation).REFERENCES [19] D. Berthelot, N. Carlini, E. D. Cubuk, A. Kurakin, K. Sohn, H. Zhang, and C. Raffel, “Remixmatch: Semi-supervised learning [1] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous with distribution alignment and augmentation anchoring,” arXiv driving?thekittivisionbenchmarksuite,”in2012IEEEconference preprintarXiv:1911.09785,2019. oncomputervisionandpatternrecognition. IEEE,2012,pp.3354– [20] Q. Xie, M.-T. Luong, E. Hovy, and Q. V. Le, “Self-training with 3361. noisystudentimprovesimagenetclassification,”inProceedingsofthe [2] X. Huang, X. Cheng, Q. Geng, B. Cao, D. Zhou, P. Wang, Y. Lin, IEEE/CVFConferenceonComputerVisionandPatternRecognition, and R. Yang, “The apolloscape dataset for autonomous driving,” in 2020,pp.10687–10698. ProceedingsoftheIEEEConferenceonComputerVisionandPattern [21] K.Sohn,D.Berthelot,C.-L.Li,Z.Zhang,N.Carlini,E.D.Cubuk, RecognitionWorkshops,2018,pp.954–960. A.Kurakin,H.Zhang,andC.Raffel,“Fixmatch:Simplifyingsemi- [3] J. Behley, M. Garbade, A. Milioto, J. Quenzel, S. Behnke, supervisedlearningwithconsistencyandconfidence,”arXivpreprint C.Stachniss,andJ.Gall,“Semantickitti:Adatasetforsemanticscene arXiv:2001.07685,2020. understandingof lidarsequences,” inProceedingsof theIEEE/CVF [22] G. French, S. Laine, T. Aila, M. Mackiewicz, and G. Finlayson, InternationalConferenceonComputerVision,2019,pp.9297–9307. “Semi-supervised semantic segmentation needs strong, varied [4] P.Sun,H.Kretzschmar,X.Dotiwalla,A.Chouard,V.Patnaik,P.Tsui, perturbations,”arXivpreprintarXiv:1906.01916,2019. J.Guo,Y.Zhou,Y.Chai,B.Caine,etal.,“Scalabilityinperception [23] Y. Zou, Z. Zhang, H. Zhang, C.-L. Li, X. Bian, J.-B. Huang, forautonomousdriving:Waymoopendataset,”inProceedingsofthe and T. Pfister, “Pseudoseg: Designing pseudo labels for semantic IEEE/CVFConferenceonComputerVisionandPatternRecognition, segmentation,”arXivpreprintarXiv:2010.09713,2020. 2020,pp.2446–2454. [24] J.Kim,J.Jang,andH.Park,“Structuredconsistencylossforsemi- supervisedsemanticsegmentation,”arXivpreprintarXiv:2001.04647, [5] Y.ZhouandO.Tuzel,“Voxelnet:End-to-endlearningforpointcloud 2020. based 3d object detection,” in Proceedings of the IEEE conference [25] Y. Ouali, C. Hudelot, and M. Tami, “Semi-supervised semantic oncomputervisionandpatternrecognition,2018,pp.4490–4499. segmentationwithcross-consistencytraining,”inProceedingsofthe [6] Y.Yan,Y.Mao,andB.Li,“Second:Sparselyembeddedconvolutional IEEE/CVFConferenceonComputerVisionandPatternRecognition, detection,”Sensors,vol.18,no.10,p.3337,2018. 2020,pp.12674–12684. [7] G. P. Meyer, A. Laddha, E. Kee, C. Vallespi-Gonzalez, and C. K. [26] X.Chen,Y.Yuan,G.Zeng,andJ.Wang,“Semi-supervisedsemantic Wellington,“Lasernet:Anefficientprobabilistic3dobjectdetectorfor segmentation with cross pseudo supervision,” in Proceedings of the autonomousdriving,”inProceedingsoftheIEEE/CVFConferenceon IEEE/CVFConferenceonComputerVisionandPatternRecognition, ComputerVisionandPatternRecognition,2019,pp.12677–12686. 2021,pp.2613–2622. [8] A.H.Lang,S.Vora,H.Caesar,L.Zhou,J.Yang,andO.Beijbom, [27] Y.Zhu,Z.Zhang,C.Wu,Z.Zhang,T.He,H.Zhang,R.Manmatha, “Pointpillars: Fast encoders for object detection from point clouds,” M. Li, and A. Smola, “Improving semantic segmentation via self- inProceedingsoftheIEEE/CVFConferenceonComputerVisionand training,”arXivpreprintarXiv:2004.14960,2020. PatternRecognition,2019,pp.12697–12705. [28] Z. Feng, Q. Zhou, G. Cheng, X. Tan, J. Shi, and L. Ma, “Semi- [9] Y.Zhou,P.Sun,Y.Zhang,D.Anguelov,J.Gao,T.Ouyang,J.Guo, supervised semantic segmentation via dynamic self-training and J. Ngiam, and V. Vasudevan, “End-to-end multi-view fusion for classbalanced curriculum,” arXiv preprint arXiv:2004.08514, vol. 1, 3d object detection in lidar point clouds,” in Conference on Robot no.2,p.5,2020. Learning. PMLR,2020,pp.923–932. [29] C. Rosenberg, M. Hebert, and H. Schneiderman, “Semi-supervised [10] S.Shi,C.Guo,L.Jiang,Z.Wang,J.Shi,X.Wang,andH.Li,“Pv- self-trainingofobjectdetectionmodels.” rcnn: Point-voxel feature set abstraction for 3d object detection,” in [30] J. Jeong, S. Lee, J. Kim, and N. Kwak, “Consistency-based Proceedings of the IEEE/CVF Conference on Computer Vision and semi-supervised learning for object detection,” Advances in neural PatternRecognition,2020,pp.10529–10538. informationprocessingsystems,vol.32,pp.10759–10768,2019. [11] P. Sun, W. Wang, Y. Chai, G. Elsayed, A. Bewley, X. Zhang, [31] Y. S. Tang and G. H. Lee, “Transferable semi-supervised 3d object C. Sminchisescu, and D. Anguelov, “Rsn: Range sparse net for detectionfromrgb-ddata,”2019. efficient, accurate lidar 3d object detection,” in Proceedings of the [32] P. Tang, C. Ramaiah, Y. Wang, R. Xu, and C. Xiong, “Proposal IEEE/CVFConferenceonComputerVisionandPatternRecognition, learningforsemi-supervisedobjectdetection,”inProceedingsofthe 2021,pp.5725–5734. IEEE/CVF Winter Conference on Applications of Computer Vision, [12] Y.GrandvaletandY.Bengio,“Semi-supervisedlearningbyentropy 2021,pp.2291–2301. minimization,”inAdvancesinneuralinformationprocessingsystems, [33] K. Sohn, Z. Zhang, C.-L. Li, H. Zhang, C.-Y. Lee, and T. Pfister, 2005,pp.529–536. “Asimplesemi-supervisedlearningframeworkforobjectdetection,” [13] D.-H. Lee et al., “Pseudo-label: The simple and efficient semi- arXivpreprintarXiv:2005.04757,2020. supervisedlearningmethodfordeepneuralnetworks,”inWorkshop [34] Q.Yang,X.Wei,B.Wang,X.-S.Hua,andL.Zhang,“Interactiveself- onchallengesinrepresentationlearning,ICML,vol.3,no.2,2013, trainingwithmeanteachersforsemi-supervisedobjectdetection,”in p.896. Proceedings of the IEEE/CVF Conference on Computer Vision and [14] S. Laine and T. Aila, “Temporal ensembling for semi-supervised PatternRecognition,2021,pp.5941–5950. learning,”arXivpreprintarXiv:1610.02242,2016. [35] Y.-C.Liu,C.-Y.Ma,Z.He,C.-W.Kuo,K.Chen,P.Zhang,B.Wu, [15] M. Sajjadi, M. Javanmardi, and T. Tasdizen, “Regularization with Z.Kira,andP.Vajda,“Unbiasedteacherforsemi-supervisedobject stochastictransformationsandperturbationsfordeepsemi-supervised detection,”inInternationalConferenceonLearningRepresentations, learning,”Advancesinneuralinformationprocessingsystems,vol.29, 2021. pp.1163–1171,2016. [36] N. Zhao, T.-S. Chua, and G. H. Lee, “Sess: Self-ensembling semi- [16] A.TarvainenandH.Valpola,“Meanteachersarebetterrolemodels: supervised 3d object detection,” in Proceedings of the IEEE/CVF Weight-averaged consistency targets improve semi-supervised deep Conference on Computer Vision and Pattern Recognition, 2020, pp. learningresults,”arXivpreprintarXiv:1703.01780,2017. 11079–11087. [17] T.Miyato,S.-i.Maeda,M.Koyama,andS.Ishii,“Virtualadversarial [37] W.Wei,P.Wei,andN.Zheng,“Semanticconsistencynetworksfor3d training:aregularizationmethodforsupervisedandsemi-supervised objectdetection,”inProceedingsoftheAAAIConferenceonArtificial learning,” IEEE transactions on pattern analysis and machine Intelligence,vol.35,no.4,2021,pp.2861–2869. intelligence,vol.41,no.8,pp.1979–1993,2018. [38] H.Wang,Y.Cong,O.Litany,Y.Gao,andL.J.Guibas,“3dioumatch: [18] I. Z. Yalniz, H. Je´gou, K. Chen, M. Paluri, and D. Mahajan, Leveraging iou prediction for semi-supervised 3d object detection,” “Billion-scale semi-supervised learning for image classification,” inProceedingsoftheIEEE/CVFConferenceonComputerVisionand arXivpreprintarXiv:1905.00546,2019. PatternRecognition,2021,pp.14615–14624.[39] B.Caine,R.Roelofs,V.Vasudevan,J.Ngiam,Y.Chai,Z.Chen,and J. Shlens, “Pseudo-labeling for scalable 3d object detection,” arXiv preprintarXiv:2103.02093,2021. [40] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dolla´r, “Focal loss fordenseobjectdetection,”inProceedingsoftheIEEEinternational conferenceoncomputervision,2017,pp.2980–2988. [41] S. Shi, X. Wang, and H. P. Li, “3d object proposal generation and detectionfrompointcloud,”inProceedingsoftheIEEEConference onComputerVisionandPatternRecognition,LongBeach,CA,USA, 2019,pp.16–20.