Multi-Class 3D Object Detection with Single-Class Supervision Mao Ye§, Chenxi Liu¶†, Maoqing Yao¶, Weiyue Wang¶, Zhaoqi Leng¶, Charles R. Qi¶, Dragomir Anguelov¶ Abstract— While multi-class 3D detectors are needed in many robotics applications, training them with fully labeled datasets can be expensive in labeling cost. An alternative approach is to have targeted single-class labels on disjoint data samples. In this paper, we are interested in training a multi-class 3D object detection model, while using these single-class labeled data. We begin by detailing the unique stance of our “Single-Class Supervision” (SCS) setting with respect to related concepts such as partial supervision and semi supervision. Then, based on the case study of training the multi-class version of Range Sparse Net (RSN), we adapt a spectrum of algorithms — from supervised learning to pseudo- labeling — to fully exploit the properties of our SCS setting, and perform extensive ablation studies to identify the most effective algorithm and practice. Empirical experiments on the Waymo Open Dataset show that proper training under SCS can approach or match full supervision training while saving labeling costs. I. INTRODUCTION 3D object detection is a core component in various robotics and autonomous driving applications. Existing public datasets [1], [2], [3], [4] often provide the labels for all K classes on all data, which enables what we call full supervision training (Figure 1a). However, such fully labeled datasets are not scalable in real-world applications at the industrial scale, given the cost of labeling. A sensible approach then, is to perform targeted labeling for each class of interest, resulting in K datasets that are possibly non- overlapping. We term this the “Single-Class Supervision” (SCS) setting (Figure 1c). Intuitively the SCS setting enjoys many advantages, such as dedicated allocation of labeling resources (e.g. only label the “rare / hard” class and save the cost of labeling the “common / easy” class), and better control of class-specific performance metrics (e.g. label more data for a certain class if the model’s performance on that class is not accurate enough). Under this setting, it is straightforward to train a single-class detector, which is common in 3D detection for autonomous driving application [5], [6], [7], [8], [9], [10], [11]. However, the training protocol becomes unclear when training a multi-class detector: since every training example has incomplete labels, even conducting supervised learning becomes challenging1. §The University of Texas at Austin ¶Waymo †Corresponding author 1When objects from different classes tend not to co-exist, SCS is easy as it is essentially fully supervised. But in this paper we target the opposite. (a) Full Supervision (b) Semi Supervision (c) Single-Class Supervision (ours) Fig. 1: Full supervision (a), where all objects are properly labeled, is ideal in training a multi-class detector. Semi supervision (b) is a classic setting where each image is either fully labeled or not labeled. Single-class supervision (c) is an under-explored setting that we study. There is only one class of objects labeled for each image. In this paper, we describe ways to learn under the SCS setting based on the case study of training a multi- class Range Sparse Net (RSN) [11], which is a state-of- the-art 3D object detection model. The RSN model is a two-stage detector. It performs segmentation as its first stage, which makes our solutions potentially generalizable to other tasks such as semantic segmentation as well. We find that it is critical to correctly handle the missing label property of SCS, and propose an Informed supervision scheme that makes the most out of the (partial) labels provided. Based on the Informed supervision scheme, from simple to complex, we adapt supervised learning algorithms (Section IV) and pseudo-labeling algorithms (Section V) under the SCS setting. Other important training techniques, including dataset resampling and combining pseudo labels from different sources, are also studied. On the Waymo arXiv:2205.05703v1 [cs.CV] 11 May 2022 Open Dataset [4], we conduct experiments to demonstrate that proper modeling under SCS can approach or match the full supervision scenario, showing its practicality and promise in saving (re)labeling cost. II. DEFINING&SITUATING SINGLE-CLASS SUPERVISION We begin by considering object detection. Let M be the number of images (range images for the 3D case), and K be the number of classes in the dataset. Each image, indexed by i, contains Ni objects of various classes, and N = ∑M i=1 Ni is the total number of objects in the dataset. A. Full Supervision Full supervision is the situation where all N objects are labeled. This is of course ideal, and the model performance under full supervision can be considered the upper bound to any of the partial supervision situations discussed next. B. Partial Supervision We use the term partial supervision to summarize all situations where n < N objects are labeled. While the partiality can be from multiple aspects, there are several important special cases of partial supervision, including semi supervision and (our) single-class supervision. 1) Semi Supervision: Research on semi-supervised learning has mostly been conducted on the image classification setting [12], [13], [14], [15], [16], [17], [13], [18], [19], [20], [21], with interest in segmentation and object detection only rising recently [22], [23], [24], [25], [26], [27], [28], [29], [30], [31], [32], [33], [34], [35], [36], [37], [38], [39]. We follow these last set of works and characterize the semi supervision setting as: (upon sorting,) for a certain integer S, full labels are provided for images with indices 1 ≤i ≤S, and no labels for images with indices S < i ≤M. 2) Single-Class Supervision: We define (our) single-class supervision (SCS) as: for each image i, only objects from 1 class Ci ∈{1,...,K} are labeled. The collection of labels Ci cover all K classes. We illustrate full supervision, semi supervision, and single-class supervision in Figure 1 for intuitive comparison. 3) Extensions to Single-Class Supervision: While this paper focuses on the single-class supervision defined above, there are natural extensions / relaxations. For example, we can relax the “only 1 class” constraint to having ≥1 classes, or having ≤1 classes. The latter implies some images may have no labels, making it closer to semi supervision, to become a hybrid between semi supervision and SCS. Under the framework of this section, image classification can be viewed as Ni = 1,∀i2. This effectively collapses semi supervision and SCS to the same setting. 2Admittedly there may be more than one object in an image used for classification, but here we make this simplification to illustrate the collapse. III. MULTI-CLASS RANGE SPARSE NET Our multi-class 3D object detector is based on Range Sparse Net (RSN) [11], which is a state-of-the-art single- class detector. In this section, we recap its architecture and describe how we extend it for the multi-class setting. A. Architecture RSN is a two-stage detector. In the first stage, the model segments LiDAR range image pixels into two categories: background and foreground. In the second stage, points classified as foreground are voxelized and fed into a sparse convolution network followed by a detection head to predict the 3D bounding boxes. The RSN architectures described in [11] perform single- class 3D object detection. We extend RSN to multi-class by sharing the first stage among all K classes, but keeping individual second stages for each class. This means the foreground point selection stage is now a K +1-way instead of a 1 + 1 = 2-way segmentation, with index 0 being the background. In the second stage, only points that are classified as foreground class k are voxelized and fed into the corresponding layer. We choose not to share the second stage, as different object classes may call for different voxelization granularities (e.g. pedestrians require a finer resolution than vehicles). B. Losses for training RSN Below we describe how we extend the single-class training loss in RSN [11] to the multi-class setting. 1) Foreground Point Selection: In a training sample (Xri,Ybbox), Xri ∈RH×W×3 is the LiDAR range image, and Ybbox = ∪K k=1Yk bbox is the union of the labeled 3D bounding boxes of all K classes. By examining whether a LiDAR point lies within any of the labeled 3D bounding boxes, we can generate the ground truth for foreground point selection Yri ∈ZH×W×K+1 which is a one-hot vector for each pixel. Using i, j,k to index into row, column, class, Lseg = ∑i, jLseg,i, j, Lseg,i, j = ∑K k=0(1−ˆpk i, j)γ log(ˆpk i, j)yk ri,i, j, where ˆpk i, j is the prediction logits of object class k at pixel (i, j), and γ is the focusing parameter of focal loss [40]. 2) Box Regression: Two losses are involved in training the network to produce bounding boxes. a) Heatmap Loss: The heatmap loss is used to train the network to locate centers of the objects. Since we have separate detection heads for different classes, we construct the ground truth heatmap value yk hm,v for each class k at the Cartesian coordinates of each voxel v. Lhm =∑K k=1∑vLk hm,v, Lk hm,v =(1−ˆh k v)α log(ˆh k v)I{yk hm,v > 1−ε}+ (1−yk hm,v)β(ˆh k v)α log(1−ˆh k k)I{yk hm,v ≤1−ε}, Fig. 2: A range image with only the pedestrian class labeled. Pixels labeled as pedestrian (red box) have accurate label. However, the other pixels have missing label: they may belong to background (blue dashed box) or vehicle (yellow dashed box) since the vehicle class is unlabeled. where ˆh k v denotes the predicted heatmap value for class k at voxel v, ε is added for numerical stability, α and β are focusing hyper-parameters. b) Shape Regression Loss: For any yk hm,v higher than a threshold (i.e. close to an object center), a bin loss [41] is used to regress the object’s heading, and the other box shape parameters are directly regressed under smooth L1 losses. IV. ADAPTATION FOR SUPERVISED LEARNING In this section, we discuss how to use single-class labels to train a multi-class detector, which requires adaptation of the supervised learning – more specifically adaptation of the detector training losses. In the next section, we will discuss how we can leverage pseudo labels to further improve the training effectiveness. A. Loss Modification 1) Foreground Point Selection: Consider a range image which contains both vehicles and pedestrians, but only pedestrians are labeled. Although we can ensure that the pixels that are labeled as pedestrian have accurate labels, the pixels that are background-to-pedestrians might actually be pixels that belong to vehicles, rather than the real background pixels (see Figure 2 for illustration). It is critical to handle those pixels with uncertain label properly. We propose and study several strategies. a) Aggressive Supervision: Aggressive supervision is a simple approach that ignores the fact that pixels that are background-to-class-k may not be the real background. It simply trains the model as if all the pixels were correctly labeled. This can still be practical, especially when the objects are sparse and thus most of the pixels that are labeled as background are truly background. However, this solution essentially injects wrong information into the training, which can be harmful. b) Conservative Supervision: An alternative to the Aggressive approach is a conservative approach that does not produce loss on non-foreground pixels. Specifically, given the segmentation label Yri, we define a set of pixels that have missing label and train the model using the modified segmentation loss Upixel = n (i, j) : ∑K k=1yk ri,i, j = 0 o , Lconservative seg = ∑i,jLseg,i, jI{(i, j) /∈Upixel}. Although this conservative approach avoids injecting wrong information, all the background pixels are masked out for training. As a result, none of the pixels will be classified as background, which would lead to a difficult second stage. c) Informed Supervision: Despite that the Aggressive and Conservative approaches are reasonable, they suffer from flaws such as injecting wrong supervision or lack of supervision on the background pixels. To improve, we propose a third approach that uses all the pixels for training and does not inject wrong supervision, which we name Informed supervision. The idea is derived from the principal of maximum likelihood estimation. Let Uclass be the set of classes that are not labeled in the current data. For pixels in Upixel, although we do not know its exact ground truth label, we know that they must not belong to any classes that has ground truth label. Thus they can only be of any class in Uclass or 0 (background). For these pixels with missing label, we transform the original K +1-way classification problem into a K + 1 −|Uclass|-way classification problem by summing the prediction logits of classes in Uclass and 0. The modified loss is as follows: Linformed seg,i, j =    ∑ k/∈Uclass (1−ˆp˜k i,j)γ log(ˆp˜k i, j)y˜k ri,i, j if (i, j) ∈Upixel Lseg,i, j o/w where ˆp˜k i, j = ( ∑c∈{0}∪Uclass ˆpc i, j if k = 0 ˆpk i, j o/w y˜k ri,i, j = ( I{(i.j) ∈Upixel} if k = 0 yk ri,i, j o/w 2) Box Regression: Adapting the box regression losses is straightforward. To train the detection head of object class k, we only use the input data that labels class k: Lhm = ∑K k=1∑vLk hm,vI{k /∈Uclass}. Note that the shape regression loss is automatically masked out when there is no bounding boxes of the corresponding class, so no extra modification is required for this loss. V. ADAPTATION FOR LEARNING WITH PSEUDO LABEL In this section, we improve the multi-class detector performance by leveraging pseudo labels. A. Pseudo Label Generating Strategy Below we discuss several options to generate pseudo labels to augment the K single-class datasets. 1) Self Labeling: The self labeling approach is adapted from FixMatch [21] for image classification, in which we generate the pseudo labels using the model itself. During training, we feed the data into the multi-class RSN and generate predicted classes for range image pixels and bounding boxes. Prediction with high confidence are saved as pseudo labels. Notice that we need to generate two realizations of data augmentation of the same input to prevent the model from degenerating into a trivial solution. 2) Teacher Labeling: Similarly, we can generate pseudo labels using K well-trained teacher detectors, each being a standard RSN that detects a single class. The teacher model for class k is trained using only the portion of the data that labels class k. We use the corresponding teacher model to generate pseudo labels for the unlabeled classes. 3) Integrated Labeling: We can also combine the two approaches above. For segmentation, we generate the pixel classes based on the ensemble predictions from the teacher and the trained model with equal weight. For bounding boxes, we combine the predicted boxes from all models and filter out overlapping boxes by Non-Maximum Suppression. B. Incorporating Pseudo Labels We follow [21] and only include pseudo labels whose confidence scores are above a threshold. Specifically, the final label for the range image is ˆyk ri,i, j = ( I{k = argmaxc∈{0}∪Uclass ˆpc i, j} if (i, j) ∈Upixel yk ri,i, j o/w where ˆpc i,j is the prediction logits of the model that generates the pseudo labels. Analogous to Upixel, we maintain a set Cpixel to indicate pixels with “non-trustworthy labels”, i.e., pixels that does not have ground truth label nor high confidence pseudo label Cpixel = n (i, j) : max c ˆpc i, j < τpixel and (i, j) ∈Upixel o , where τpixel is the threshold to decide whether the prediction is confident enough to be regarded as high quality. Similarly, the final bounding box labels are the union of the original ground truth boxes and the generated pseudo bounding boxes with scores higher than a threshold τbbox. ˆY k bbox = ( {b : score(b,k) ≥τbbox} if k ∈Uclass Yk bbox o/w Here score(b,k) is the score of the pseudo bounding box of class k. Notice that the pseudo heatmap can be generated accordingly using the pseudo bounding boxes. C. Loss Modification 1) Foreground Point Selection: Pseudo labels help reduce the amount of missing label pixels Upixel, but there are still pixels with non-trustworthy labels Cpixel. Therefore, conveniently, the three schemes described in Section IV-A.1 all still apply, simply by replacing Upixel with Cpixel. Notice that under self labeling, the Conservative approach is still not suitable, as it will still classify all points as foreground. 2) Box Regression: With the pseudo labeled bounding boxes, the data that did not have Uclass originally labeled can now be utilized to train the detection head of those classes. Given the pseudo labeled bounding boxes, similar to supervised learning discussed in Section IV-A.2, we do not need to modify the shape regression other than simply replacing the ground truth box label by the pseudo box label. For the heatmap loss, voxels that fall within any box that belongs to ˆY k bbox have trustworthy heatmap ground truth value. However, we cannot ensure whether the remaining voxels are background or within a bounding box of class u ∈Uclass. Due to such uncertainty, the heatmap loss needs special design and treatment. Similar to Section IV-A.1, we propose three schemes for modifying the heatmap loss. a) Aggressive Supervision: Aggressive supervision simply ignores the fact that some of the background voxels may actually belong to an object, and trains the model as if the pseudo boxes were perfect. Its performance can be sensitive to the quality of the pseudo labels and may inject wrong information for training. b) Conservative Supervision: Following the same philosophy, Conservative supervision masks out losses on voxels without trustworthy labels. As a result, we only train on voxels with foreground labels. The heatmap loss becomes ˆL conservative hm = ∑K k=1∑v ˆL k hm,vI{v ∈Uk hm or k /∈Uclass} Here Uk hm = {v : ∃b ∈ˆY k bbox s.t. v is within b} denotes the set of voxels that fall within one of the trustworthy bounding boxes, and ˆL k hm,v is calculated based on ˆY k bbox as opposed to Yk bbox. c) Informed Supervision: The Informed supervision we derived for segmentation loss is not directly applicable here for heatmap loss. The main reason is that the segmentation is multi-class and mutual-exclusive (one out of K), while the heatmap loss is single-class (k vs not k). However the same spirit remains. Here we utilize a special property of 3D detection: different from 2D detection where each pixel might belong to different objects (since objects can overlap with each other when projected onto a 2D image), in 3D point cloud, we can assume that each point only belongs to one class. That means for the detection head of class k, if a voxel belongs to any other foreground class, its heatmap value of class k must be 0 as this voxel cannot (a) Aggressive Supervision (b) Conservation Supervision (c) Informed Supervision Fig. 3: Foreground point selection under different schemes. Boxes are the ground truth. Points being classified as background, vehicle and pedestrian are colored blue, yellow, and red. The Aggressive scheme misclassifies many object points as background. The Conservative scheme would prompt all points to be predicted as foreground. The Informed scheme gives accurate prediction, despite never having both vehicle and pedestrian labels on the same training data. belong to any bounding boxes of class k. This gives the following modified heatmap loss: ˆL informed hm = ∑K k=1∑v ˆL k hm,vI{v ∈Uhm or k /∈Uclass}, Uhm = {v : ∃b ∈∪K k=1 ˆY k bbox s.t. v is within b} where, different from the Conservative supervision earlier, the construction of Uhm utilizes the information of (pseudo) bounding boxes of all classes (without superscript k). VI. EXPERIMENTAL RESULTS We perform experiments on the Waymo Open Dataset (WOD) [4], specifically detecting vehicles and pedestrians (i.e., K = 2). WOD provides 798 training sequences with all K classes labeled, so we create our single-class supervision scenario by first dividing these sequences into two disjoint sets, and then masking out labels for either class on the corresponding set. Therefore, our experiments correspond to roughly 50% labeling cost savings. For the division of sequences, we consider two challenging and imbalance settings 10%-90% and 5%-95%. We denote the setting that x%, (100−x)% sequences contains only vehicle, pedestrian labels as #V/#P = x/100−x. The evaluation on the validation set remains unchanged from the standard setting. We employ the same training protocol for all experiments, tuned based on the adapted supervised learning. We train with batch size 64 for 80K iterations with 4K warmup. The number of channels in each layer of our multi-class RSN is 3/4 that of CarL 1f / PedL 1f in [11], to reduce the memory cost. The remaining hyper-parameters and data augmentation (random flipping and rotation are applied) are the same as those in [11]. For self labeling and integrated labeling, we do not apply augmentation when generating pseudo labels using the multi-class RSN. Since the sizes of the two subsets are quite different, we also apply dataset resampling such that each data in the mini-batch is drawn from the two subsets with equal probability. A. Algorithm Comparison We compare different methods to learn the multi-class detector from single-class labels in Table I. The evaluation metric is 3D AP with L1 difficulty. We consider the supervised learning algorithm with Aggressive scheme to be our baseline (as we are in a new problem setting, there is no prior baseline from other works), and consider the performance of our multi-class RSN when trained on the unmasked, fully labeled data (which does not belong to SCS) to be the performance upper bound. Within supervised learning, Informed supervision outperforms the Aggressive supervision baseline, showing the superiority of proper SCS modeling. As is evident from Figure 3, Aggressive supervision misclassifies many object points as background and by the design of RSN, those points will not be included in proposing bounding boxes. By comparison, Informed supervision gives accurate prediction for all three kinds of points in Figure 2, even though there are no points explicitly labeled as background. We do not include quantitative results with Conservative supervision as all points are classified as foreground, causing failed training of the second stage in RSN, though we can still visualize its first stage in Figure 3. We then use the Informed scheme to compare between supervised learning and pseudo labeling. We observe that the adapted pseudo labeling can give significant improvement over adapted supervised learning. Among the pseudo labeling algorithms, the integrated labeling delivers the best performance, as expected. The teacher labeling outperforms the self labeling approach, where the teacher models are single-class detectors trained only on the subset of sequences labeled with the corresponding class. One of the reasons may be that the different object features required by vehicle and pedestrian cause conflicts in the hidden representation given the limited model capacity, and thus the quality of the pseudo labels generated by the two single-class detectors are better than those generated by the multi-class detector. Algorithms Scheme #V/#P = 90/10 #V/#P = 10/90 #V/#P = 95/5 #V/#P = 5/95 Vehicle Pedestrian Vehicle Pedestrian Vehicle Pedestrian Vehicle Pedestrian Supervised Aggressive 70.2 -1.8 53.5 -18.9 63.5 -8.5 68.3 -4.1 69.8 -2.2 28.9 -43.5 18.4 -53.6 69.2 -3.2 Supervised Informed 70.2 -1.8 59.3 -13.1 65.4 -6.6 71.6 -0.8 70.7 -1.3 46.0 -26.4 61.8 -10.2 72.5 +0.1 Self Label Informed 70.2 -1.8 62.1 -10.3 66.0 -6.4 72.0 -0.4 70.9 -1.1 47.3 -25.1 64.1 -7.9 72.3 -0.1 Teacher Label Informed 71.7 71.7 71.7 -0.3 -0.3 -0.3 67.1 -5.3 68.8 -3.6 73.2 +0.8 71.3 -0.7 57.8 -14.6 66.0 -6.0 71.7 -0.8 Integrated Label Informed 71.6 -0.4 68.5 68.5 68.5 -3.9 -3.9 -3.9 69.0 69.0 69.0 -3.0 -3.0 -3.0 73.8 73.8 73.8 +1.4 +1.4 +1.4 71.5 71.5 71.5 -0.5 -0.5 -0.5 59.5 59.5 59.5 -12.9 -12.9 -12.9 66.3 66.3 66.3 -5.7 -5.7 -5.7 73.2 73.2 73.2 +0.8 +0.8 +0.8 Full Label - 72.0 0.0 72.4 0.0 72.0 0.0 72.4 0.0 72.0 0.0 72.4 0.0 72.0 0.0 72.4 0.0 TABLE I: Comparing algorithms and schemes developed for single-class supervision (SCS). The two adjacent numbers are the detection AP and its gap to the full label upper bound. Notice that the last row uses more labels and does not belong to SCS, hence marked in gray to indicate that these numbers are unfair comparisons. Segmentation Heatmap Vehicle Pedestrian Aggressive Informed 69.3 -3.4 54.1 -17.0 Conservative Informed 70.8 -0.9 65.6 -1.5 Informed Aggressive 65.6 -6.1 48.8 -18.3 Informed Conservative 71.0 -0.7 66.3 -0.8 Informed Informed 71.7 0.0 67.1 0.0 TABLE II: Comparing different schemes for handling segmentation / heatmap loss with missing label. The two adjacent numbers are the detection AP and its performance gap to employing Informed supervision for both. The fact that detecting vehicle and pedestrian may be conflicting may also explain the occasional cases where the pseudo labeling algorithm surpasses the full label upper bound in detecting one class (e.g. in #V/#P = 90/10, teacher and integrated labeling outperform the full label upper bound in detecting pedestrian). In cases where both classes have sufficient labels, the two tasks compete with each other and the model learns a representation that performs well for both tasks. While, in these cases, when there is not enough information for the model to learn a strong representation for vehicle, the representation learned by the model will favor detecting the pedestrian. B. Ablations on Modeling Missing Label Under the teacher labeling algorithm (#V/#P = 90/10), we ablate the effects of the three schemes we proposed (Aggressive, Conservative, Informed) on either the segmentation loss or the heatmap loss. The results are summarized in Table II. Employing Informed supervision for both segmentation and heatmap results in the best performance, as Informed supervision best exploits trustworthy label information. Aggressive supervision is worse than the Conservative approach. This holds for both the segmentation loss and the heatmap loss. We believe this shows the negative influence of injecting wrong information outweighs the benefit of injecting more information. C. Dataset Resampling After analyzing the Table I vertically, we now analyze it horizontally. The general trend is that when a class has fewer data, the AP on this class is lower and has more room Fig. 4: Detection AP with varying dataset resampling probability. P / V in the legend stand for Pedestrian / Vehicle. to improve, which is expected. However, there are a few occasions where having more data of a class results in worse AP, for example detecting vehicles using teacher labeling at #V/#P = 90/10 vs #V/#P = 95/5 (71.7 vs 71.3). We found the cause to be dataset resampling. In Figure 4 we vary the probability of sampling images from the dominant vehicle class when #V/#P = 90/10 or #V/#P = 95/5 (default is 50%, i.e., equal probability). Towards the right side of the figure, when we sample predominantly from the vehicle class when the amount of labeled vehicle sequences is already dominant, the performance of the minority pedestrian class worsens drastically. On the other hand, towards the left side, when we sample predominantly from the minority pedestrian class, the model sees too many repetitive pedestrians and not enough vehicle epochs, degrading the performance on both tasks. Therefore, it is critical to find the middle sweet spot. VII. CONCLUSION We study training a multi-class 3D object detector under the “single-class supervision” learning setting, by proposing and benchmarking various baselines and strategies. Notably, our proposed Informed Supervision combined with pseudo labeling can approach or match the upper bound that is full supervision, saving significant labeling costs. For future work, we plan to expand the applications of SCS. Within 3D object detection, we plan to experiment with more classes and more diverse architectures. Beyond 3D object detection, we plan to expand to other tasks such as 2D object detection, and move from multi-class to multi- task (e.g. joint detection and segmentation). REFERENCES [1] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in 2012 IEEE conference on computer vision and pattern recognition. IEEE, 2012, pp. 3354– 3361. [2] X. Huang, X. Cheng, Q. Geng, B. Cao, D. Zhou, P. Wang, Y. Lin, and R. Yang, “The apolloscape dataset for autonomous driving,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2018, pp. 954–960. [3] J. Behley, M. Garbade, A. Milioto, J. Quenzel, S. Behnke, C. Stachniss, and J. Gall, “Semantickitti: A dataset for semantic scene understanding of lidar sequences,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9297–9307. [4] P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V. Patnaik, P. Tsui, J. Guo, Y. Zhou, Y. Chai, B. Caine, et al., “Scalability in perception for autonomous driving: Waymo open dataset,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 2446–2454. [5] Y. Zhou and O. Tuzel, “Voxelnet: End-to-end learning for point cloud based 3d object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4490–4499. [6] Y. Yan, Y. Mao, and B. Li, “Second: Sparsely embedded convolutional detection,” Sensors, vol. 18, no. 10, p. 3337, 2018. [7] G. P. Meyer, A. Laddha, E. Kee, C. Vallespi-Gonzalez, and C. K. Wellington, “Lasernet: An efficient probabilistic 3d object detector for autonomous driving,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 12 677–12 686. [8] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom, “Pointpillars: Fast encoders for object detection from point clouds,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 12 697–12 705. [9] Y. Zhou, P. Sun, Y. Zhang, D. Anguelov, J. Gao, T. Ouyang, J. Guo, J. Ngiam, and V. Vasudevan, “End-to-end multi-view fusion for 3d object detection in lidar point clouds,” in Conference on Robot Learning. PMLR, 2020, pp. 923–932. [10] S. Shi, C. Guo, L. Jiang, Z. Wang, J. Shi, X. Wang, and H. Li, “Pv- rcnn: Point-voxel feature set abstraction for 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 529–10 538. [11] P. Sun, W. Wang, Y. Chai, G. Elsayed, A. Bewley, X. Zhang, C. Sminchisescu, and D. Anguelov, “Rsn: Range sparse net for efficient, accurate lidar 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 5725–5734. [12] Y. Grandvalet and Y. Bengio, “Semi-supervised learning by entropy minimization,” in Advances in neural information processing systems, 2005, pp. 529–536. [13] D.-H. Lee et al., “Pseudo-label: The simple and efficient semi- supervised learning method for deep neural networks,” in Workshop on challenges in representation learning, ICML, vol. 3, no. 2, 2013, p. 896. [14] S. Laine and T. Aila, “Temporal ensembling for semi-supervised learning,” arXiv preprint arXiv:1610.02242, 2016. [15] M. Sajjadi, M. Javanmardi, and T. Tasdizen, “Regularization with stochastic transformations and perturbations for deep semi-supervised learning,” Advances in neural information processing systems, vol. 29, pp. 1163–1171, 2016. [16] A. Tarvainen and H. Valpola, “Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results,” arXiv preprint arXiv:1703.01780, 2017. [17] T. Miyato, S.-i. Maeda, M. Koyama, and S. Ishii, “Virtual adversarial training: a regularization method for supervised and semi-supervised learning,” IEEE transactions on pattern analysis and machine intelligence, vol. 41, no. 8, pp. 1979–1993, 2018. [18] I. Z. Yalniz, H. J´egou, K. Chen, M. Paluri, and D. Mahajan, “Billion-scale semi-supervised learning for image classification,” arXiv preprint arXiv:1905.00546, 2019. [19] D. Berthelot, N. Carlini, E. D. Cubuk, A. Kurakin, K. Sohn, H. Zhang, and C. Raffel, “Remixmatch: Semi-supervised learning with distribution alignment and augmentation anchoring,” arXiv preprint arXiv:1911.09785, 2019. [20] Q. Xie, M.-T. Luong, E. Hovy, and Q. V. Le, “Self-training with noisy student improves imagenet classification,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 687–10 698. [21] K. Sohn, D. Berthelot, C.-L. Li, Z. Zhang, N. Carlini, E. D. Cubuk, A. Kurakin, H. Zhang, and C. Raffel, “Fixmatch: Simplifying semi- supervised learning with consistency and confidence,” arXiv preprint arXiv:2001.07685, 2020. [22] G. French, S. Laine, T. Aila, M. Mackiewicz, and G. Finlayson, “Semi-supervised semantic segmentation needs strong, varied perturbations,” arXiv preprint arXiv:1906.01916, 2019. [23] Y. Zou, Z. Zhang, H. Zhang, C.-L. Li, X. Bian, J.-B. Huang, and T. Pfister, “Pseudoseg: Designing pseudo labels for semantic segmentation,” arXiv preprint arXiv:2010.09713, 2020. [24] J. Kim, J. Jang, and H. Park, “Structured consistency loss for semi- supervised semantic segmentation,” arXiv preprint arXiv:2001.04647, 2020. [25] Y. Ouali, C. Hudelot, and M. Tami, “Semi-supervised semantic segmentation with cross-consistency training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 12 674–12 684. [26] X. Chen, Y. Yuan, G. Zeng, and J. Wang, “Semi-supervised semantic segmentation with cross pseudo supervision,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 2613–2622. [27] Y. Zhu, Z. Zhang, C. Wu, Z. Zhang, T. He, H. Zhang, R. Manmatha, M. Li, and A. Smola, “Improving semantic segmentation via self- training,” arXiv preprint arXiv:2004.14960, 2020. [28] Z. Feng, Q. Zhou, G. Cheng, X. Tan, J. Shi, and L. Ma, “Semi- supervised semantic segmentation via dynamic self-training and classbalanced curriculum,” arXiv preprint arXiv:2004.08514, vol. 1, no. 2, p. 5, 2020. [29] C. Rosenberg, M. Hebert, and H. Schneiderman, “Semi-supervised self-training of object detection models.” [30] J. Jeong, S. Lee, J. Kim, and N. Kwak, “Consistency-based semi-supervised learning for object detection,” Advances in neural information processing systems, vol. 32, pp. 10 759–10 768, 2019. [31] Y. S. Tang and G. H. Lee, “Transferable semi-supervised 3d object detection from rgb-d data,” 2019. [32] P. Tang, C. Ramaiah, Y. Wang, R. Xu, and C. Xiong, “Proposal learning for semi-supervised object detection,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 2291–2301. [33] K. Sohn, Z. Zhang, C.-L. Li, H. Zhang, C.-Y. Lee, and T. Pfister, “A simple semi-supervised learning framework for object detection,” arXiv preprint arXiv:2005.04757, 2020. [34] Q. Yang, X. Wei, B. Wang, X.-S. Hua, and L. Zhang, “Interactive self- training with mean teachers for semi-supervised object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 5941–5950. [35] Y.-C. Liu, C.-Y. Ma, Z. He, C.-W. Kuo, K. Chen, P. Zhang, B. Wu, Z. Kira, and P. Vajda, “Unbiased teacher for semi-supervised object detection,” in International Conference on Learning Representations, 2021. [36] N. Zhao, T.-S. Chua, and G. H. Lee, “Sess: Self-ensembling semi- supervised 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 11 079–11 087. [37] W. Wei, P. Wei, and N. Zheng, “Semantic consistency networks for 3d object detection,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 4, 2021, pp. 2861–2869. [38] H. Wang, Y. Cong, O. Litany, Y. Gao, and L. J. Guibas, “3dioumatch: Leveraging iou prediction for semi-supervised 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 14 615–14 624. [39] B. Caine, R. Roelofs, V. Vasudevan, J. Ngiam, Y. Chai, Z. Chen, and J. Shlens, “Pseudo-labeling for scalable 3d object detection,” arXiv preprint arXiv:2103.02093, 2021. [40] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Doll´ar, “Focal loss for dense object detection,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988. [41] S. Shi, X. Wang, and H. P. Li, “3d object proposal generation and detection from point cloud,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 16–20. Multi-Class 3D Object Detection with Single-Class Supervision Mao Ye§, Chenxi Liu¶†, Maoqing Yao¶, Weiyue Wang¶, Zhaoqi Leng¶, Charles R. Qi¶, Dragomir Anguelov¶ Abstract— While multi-class 3D detectors are needed in many robotics applications, training them with fully labeled datasets can be expensive in labeling cost. An alternative approach is to have targeted single-class labels on disjoint data samples. In this paper, we are interested in training a multi-class 3D object detection model, while using these single-class labeled data. We begin by detailing the unique stance of our “Single-Class Supervision” (SCS) setting with respect to related concepts such as partial supervision and semi supervision. Then, based on the case study of training the multi-class version of Range Sparse Net (RSN), we adapt a spectrum of algorithms — from supervised learning to pseudo- labeling — to fully exploit the properties of our SCS setting, and perform extensive ablation studies to identify the most effective algorithm and practice. Empirical experiments on the Waymo Open Dataset show that proper training under SCS can approach or match full supervision training while saving labeling costs. I. INTRODUCTION 3D object detection is a core component in various robotics and autonomous driving applications. Existing public datasets [1], [2], [3], [4] often provide the labels for all K classes on all data, which enables what we call full supervision training (Figure 1a). However, such fully labeled datasets are not scalable in real-world applications at the industrial scale, given the cost of labeling. A sensible approach then, is to perform targeted labeling for each class of interest, resulting in K datasets that are possibly non- overlapping. We term this the “Single-Class Supervision” (SCS) setting (Figure 1c). Intuitively the SCS setting enjoys many advantages, such as dedicated allocation of labeling resources (e.g. only label the “rare / hard” class and save the cost of labeling the “common / easy” class), and better control of class-specific performance metrics (e.g. label more data for a certain class if the model’s performance on that class is not accurate enough). Under this setting, it is straightforward to train a single-class detector, which is common in 3D detection for autonomous driving application [5], [6], [7], [8], [9], [10], [11]. However, the training protocol becomes unclear when training a multi-class detector: since every training example has incomplete labels, even conducting supervised learning becomes challenging1. §The University of Texas at Austin ¶Waymo †Corresponding author 1When objects from different classes tend not to co-exist, SCS is easy as it is essentially fully supervised. But in this paper we target the opposite. (a) Full Supervision (b) Semi Supervision (c) Single-Class Supervision (ours) Fig. 1: Full supervision (a), where all objects are properly labeled, is ideal in training a multi-class detector. Semi supervision (b) is a classic setting where each image is either fully labeled or not labeled. Single-class supervision (c) is an under-explored setting that we study. There is only one class of objects labeled for each image. In this paper, we describe ways to learn under the SCS setting based on the case study of training a multi- class Range Sparse Net (RSN) [11], which is a state-of- the-art 3D object detection model. The RSN model is a two-stage detector. It performs segmentation as its first stage, which makes our solutions potentially generalizable to other tasks such as semantic segmentation as well. We find that it is critical to correctly handle the missing label property of SCS, and propose an Informed supervision scheme that makes the most out of the (partial) labels provided. Based on the Informed supervision scheme, from simple to complex, we adapt supervised learning algorithms (Section IV) and pseudo-labeling algorithms (Section V) under the SCS setting. Other important training techniques, including dataset resampling and combining pseudo labels from different sources, are also studied. On the Waymo arXiv:2205.05703v1 [cs.CV] 11 May 2022 Open Dataset [4], we conduct experiments to demonstrate that proper modeling under SCS can approach or match the full supervision scenario, showing its practicality and promise in saving (re)labeling cost. II. DEFINING&SITUATING SINGLE-CLASS SUPERVISION We begin by considering object detection. Let M be the number of images (range images for the 3D case), and K be the number of classes in the dataset. Each image, indexed by i, contains Ni objects of various classes, and N = ∑M i=1 Ni is the total number of objects in the dataset. A. Full Supervision Full supervision is the situation where all N objects are labeled. This is of course ideal, and the model performance under full supervision can be considered the upper bound to any of the partial supervision situations discussed next. B. Partial Supervision We use the term partial supervision to summarize all situations where n < N objects are labeled. While the partiality can be from multiple aspects, there are several important special cases of partial supervision, including semi supervision and (our) single-class supervision. 1) Semi Supervision: Research on semi-supervised learning has mostly been conducted on the image classification setting [12], [13], [14], [15], [16], [17], [13], [18], [19], [20], [21], with interest in segmentation and object detection only rising recently [22], [23], [24], [25], [26], [27], [28], [29], [30], [31], [32], [33], [34], [35], [36], [37], [38], [39]. We follow these last set of works and characterize the semi supervision setting as: (upon sorting,) for a certain integer S, full labels are provided for images with indices 1 ≤i ≤S, and no labels for images with indices S < i ≤M. 2) Single-Class Supervision: We define (our) single-class supervision (SCS) as: for each image i, only objects from 1 class Ci ∈{1,...,K} are labeled. The collection of labels Ci cover all K classes. We illustrate full supervision, semi supervision, and single-class supervision in Figure 1 for intuitive comparison. 3) Extensions to Single-Class Supervision: While this paper focuses on the single-class supervision defined above, there are natural extensions / relaxations. For example, we can relax the “only 1 class” constraint to having ≥1 classes, or having ≤1 classes. The latter implies some images may have no labels, making it closer to semi supervision, to become a hybrid between semi supervision and SCS. Under the framework of this section, image classification can be viewed as Ni = 1,∀i2. This effectively collapses semi supervision and SCS to the same setting. 2Admittedly there may be more than one object in an image used for classification, but here we make this simplification to illustrate the collapse. III. MULTI-CLASS RANGE SPARSE NET Our multi-class 3D object detector is based on Range Sparse Net (RSN) [11], which is a state-of-the-art single- class detector. In this section, we recap its architecture and describe how we extend it for the multi-class setting. A. Architecture RSN is a two-stage detector. In the first stage, the model segments LiDAR range image pixels into two categories: background and foreground. In the second stage, points classified as foreground are voxelized and fed into a sparse convolution network followed by a detection head to predict the 3D bounding boxes. The RSN architectures described in [11] perform single- class 3D object detection. We extend RSN to multi-class by sharing the first stage among all K classes, but keeping individual second stages for each class. This means the foreground point selection stage is now a K +1-way instead of a 1 + 1 = 2-way segmentation, with index 0 being the background. In the second stage, only points that are classified as foreground class k are voxelized and fed into the corresponding layer. We choose not to share the second stage, as different object classes may call for different voxelization granularities (e.g. pedestrians require a finer resolution than vehicles). B. Losses for training RSN Below we describe how we extend the single-class training loss in RSN [11] to the multi-class setting. 1) Foreground Point Selection: In a training sample (Xri,Ybbox), Xri ∈RH×W×3 is the LiDAR range image, and Ybbox = ∪K k=1Yk bbox is the union of the labeled 3D bounding boxes of all K classes. By examining whether a LiDAR point lies within any of the labeled 3D bounding boxes, we can generate the ground truth for foreground point selection Yri ∈ZH×W×K+1 which is a one-hot vector for each pixel. Using i, j,k to index into row, column, class, Lseg = ∑i, jLseg,i, j, Lseg,i, j = ∑K k=0(1−ˆpk i, j)γ log(ˆpk i, j)yk ri,i, j, where ˆpk i, j is the prediction logits of object class k at pixel (i, j), and γ is the focusing parameter of focal loss [40]. 2) Box Regression: Two losses are involved in training the network to produce bounding boxes. a) Heatmap Loss: The heatmap loss is used to train the network to locate centers of the objects. Since we have separate detection heads for different classes, we construct the ground truth heatmap value yk hm,v for each class k at the Cartesian coordinates of each voxel v. Lhm =∑K k=1∑vLk hm,v, Lk hm,v =(1−ˆh k v)α log(ˆh k v)I{yk hm,v > 1−ε}+ (1−yk hm,v)β(ˆh k v)α log(1−ˆh k k)I{yk hm,v ≤1−ε}, Fig. 2: A range image with only the pedestrian class labeled. Pixels labeled as pedestrian (red box) have accurate label. However, the other pixels have missing label: they may belong to background (blue dashed box) or vehicle (yellow dashed box) since the vehicle class is unlabeled. where ˆh k v denotes the predicted heatmap value for class k at voxel v, ε is added for numerical stability, α and β are focusing hyper-parameters. b) Shape Regression Loss: For any yk hm,v higher than a threshold (i.e. close to an object center), a bin loss [41] is used to regress the object’s heading, and the other box shape parameters are directly regressed under smooth L1 losses. IV. ADAPTATION FOR SUPERVISED LEARNING In this section, we discuss how to use single-class labels to train a multi-class detector, which requires adaptation of the supervised learning – more specifically adaptation of the detector training losses. In the next section, we will discuss how we can leverage pseudo labels to further improve the training effectiveness. A. Loss Modification 1) Foreground Point Selection: Consider a range image which contains both vehicles and pedestrians, but only pedestrians are labeled. Although we can ensure that the pixels that are labeled as pedestrian have accurate labels, the pixels that are background-to-pedestrians might actually be pixels that belong to vehicles, rather than the real background pixels (see Figure 2 for illustration). It is critical to handle those pixels with uncertain label properly. We propose and study several strategies. a) Aggressive Supervision: Aggressive supervision is a simple approach that ignores the fact that pixels that are background-to-class-k may not be the real background. It simply trains the model as if all the pixels were correctly labeled. This can still be practical, especially when the objects are sparse and thus most of the pixels that are labeled as background are truly background. However, this solution essentially injects wrong information into the training, which can be harmful. b) Conservative Supervision: An alternative to the Aggressive approach is a conservative approach that does not produce loss on non-foreground pixels. Specifically, given the segmentation label Yri, we define a set of pixels that have missing label and train the model using the modified segmentation loss Upixel = n (i, j) : ∑K k=1yk ri,i, j = 0 o , Lconservative seg = ∑i,jLseg,i, jI{(i, j) /∈Upixel}. Although this conservative approach avoids injecting wrong information, all the background pixels are masked out for training. As a result, none of the pixels will be classified as background, which would lead to a difficult second stage. c) Informed Supervision: Despite that the Aggressive and Conservative approaches are reasonable, they suffer from flaws such as injecting wrong supervision or lack of supervision on the background pixels. To improve, we propose a third approach that uses all the pixels for training and does not inject wrong supervision, which we name Informed supervision. The idea is derived from the principal of maximum likelihood estimation. Let Uclass be the set of classes that are not labeled in the current data. For pixels in Upixel, although we do not know its exact ground truth label, we know that they must not belong to any classes that has ground truth label. Thus they can only be of any class in Uclass or 0 (background). For these pixels with missing label, we transform the original K +1-way classification problem into a K + 1 −|Uclass|-way classification problem by summing the prediction logits of classes in Uclass and 0. The modified loss is as follows: Linformed seg,i, j =    ∑ k/∈Uclass (1−ˆp˜k i,j)γ log(ˆp˜k i, j)y˜k ri,i, j if (i, j) ∈Upixel Lseg,i, j o/w where ˆp˜k i, j = ( ∑c∈{0}∪Uclass ˆpc i, j if k = 0 ˆpk i, j o/w y˜k ri,i, j = ( I{(i.j) ∈Upixel} if k = 0 yk ri,i, j o/w 2) Box Regression: Adapting the box regression losses is straightforward. To train the detection head of object class k, we only use the input data that labels class k: Lhm = ∑K k=1∑vLk hm,vI{k /∈Uclass}. Note that the shape regression loss is automatically masked out when there is no bounding boxes of the corresponding class, so no extra modification is required for this loss. V. ADAPTATION FOR LEARNING WITH PSEUDO LABEL In this section, we improve the multi-class detector performance by leveraging pseudo labels. A. Pseudo Label Generating Strategy Below we discuss several options to generate pseudo labels to augment the K single-class datasets. 1) Self Labeling: The self labeling approach is adapted from FixMatch [21] for image classification, in which we generate the pseudo labels using the model itself. During training, we feed the data into the multi-class RSN and generate predicted classes for range image pixels and bounding boxes. Prediction with high confidence are saved as pseudo labels. Notice that we need to generate two realizations of data augmentation of the same input to prevent the model from degenerating into a trivial solution. 2) Teacher Labeling: Similarly, we can generate pseudo labels using K well-trained teacher detectors, each being a standard RSN that detects a single class. The teacher model for class k is trained using only the portion of the data that labels class k. We use the corresponding teacher model to generate pseudo labels for the unlabeled classes. 3) Integrated Labeling: We can also combine the two approaches above. For segmentation, we generate the pixel classes based on the ensemble predictions from the teacher and the trained model with equal weight. For bounding boxes, we combine the predicted boxes from all models and filter out overlapping boxes by Non-Maximum Suppression. B. Incorporating Pseudo Labels We follow [21] and only include pseudo labels whose confidence scores are above a threshold. Specifically, the final label for the range image is ˆyk ri,i, j = ( I{k = argmaxc∈{0}∪Uclass ˆpc i, j} if (i, j) ∈Upixel yk ri,i, j o/w where ˆpc i,j is the prediction logits of the model that generates the pseudo labels. Analogous to Upixel, we maintain a set Cpixel to indicate pixels with “non-trustworthy labels”, i.e., pixels that does not have ground truth label nor high confidence pseudo label Cpixel = n (i, j) : max c ˆpc i, j < τpixel and (i, j) ∈Upixel o , where τpixel is the threshold to decide whether the prediction is confident enough to be regarded as high quality. Similarly, the final bounding box labels are the union of the original ground truth boxes and the generated pseudo bounding boxes with scores higher than a threshold τbbox. ˆY k bbox = ( {b : score(b,k) ≥τbbox} if k ∈Uclass Yk bbox o/w Here score(b,k) is the score of the pseudo bounding box of class k. Notice that the pseudo heatmap can be generated accordingly using the pseudo bounding boxes. C. Loss Modification 1) Foreground Point Selection: Pseudo labels help reduce the amount of missing label pixels Upixel, but there are still pixels with non-trustworthy labels Cpixel. Therefore, conveniently, the three schemes described in Section IV-A.1 all still apply, simply by replacing Upixel with Cpixel. Notice that under self labeling, the Conservative approach is still not suitable, as it will still classify all points as foreground. 2) Box Regression: With the pseudo labeled bounding boxes, the data that did not have Uclass originally labeled can now be utilized to train the detection head of those classes. Given the pseudo labeled bounding boxes, similar to supervised learning discussed in Section IV-A.2, we do not need to modify the shape regression other than simply replacing the ground truth box label by the pseudo box label. For the heatmap loss, voxels that fall within any box that belongs to ˆY k bbox have trustworthy heatmap ground truth value. However, we cannot ensure whether the remaining voxels are background or within a bounding box of class u ∈Uclass. Due to such uncertainty, the heatmap loss needs special design and treatment. Similar to Section IV-A.1, we propose three schemes for modifying the heatmap loss. a) Aggressive Supervision: Aggressive supervision simply ignores the fact that some of the background voxels may actually belong to an object, and trains the model as if the pseudo boxes were perfect. Its performance can be sensitive to the quality of the pseudo labels and may inject wrong information for training. b) Conservative Supervision: Following the same philosophy, Conservative supervision masks out losses on voxels without trustworthy labels. As a result, we only train on voxels with foreground labels. The heatmap loss becomes ˆL conservative hm = ∑K k=1∑v ˆL k hm,vI{v ∈Uk hm or k /∈Uclass} Here Uk hm = {v : ∃b ∈ˆY k bbox s.t. v is within b} denotes the set of voxels that fall within one of the trustworthy bounding boxes, and ˆL k hm,v is calculated based on ˆY k bbox as opposed to Yk bbox. c) Informed Supervision: The Informed supervision we derived for segmentation loss is not directly applicable here for heatmap loss. The main reason is that the segmentation is multi-class and mutual-exclusive (one out of K), while the heatmap loss is single-class (k vs not k). However the same spirit remains. Here we utilize a special property of 3D detection: different from 2D detection where each pixel might belong to different objects (since objects can overlap with each other when projected onto a 2D image), in 3D point cloud, we can assume that each point only belongs to one class. That means for the detection head of class k, if a voxel belongs to any other foreground class, its heatmap value of class k must be 0 as this voxel cannot (a) Aggressive Supervision (b) Conservation Supervision (c) Informed Supervision Fig. 3: Foreground point selection under different schemes. Boxes are the ground truth. Points being classified as background, vehicle and pedestrian are colored blue, yellow, and red. The Aggressive scheme misclassifies many object points as background. The Conservative scheme would prompt all points to be predicted as foreground. The Informed scheme gives accurate prediction, despite never having both vehicle and pedestrian labels on the same training data. belong to any bounding boxes of class k. This gives the following modified heatmap loss: ˆL informed hm = ∑K k=1∑v ˆL k hm,vI{v ∈Uhm or k /∈Uclass}, Uhm = {v : ∃b ∈∪K k=1 ˆY k bbox s.t. v is within b} where, different from the Conservative supervision earlier, the construction of Uhm utilizes the information of (pseudo) bounding boxes of all classes (without superscript k). VI. EXPERIMENTAL RESULTS We perform experiments on the Waymo Open Dataset (WOD) [4], specifically detecting vehicles and pedestrians (i.e., K = 2). WOD provides 798 training sequences with all K classes labeled, so we create our single-class supervision scenario by first dividing these sequences into two disjoint sets, and then masking out labels for either class on the corresponding set. Therefore, our experiments correspond to roughly 50% labeling cost savings. For the division of sequences, we consider two challenging and imbalance settings 10%-90% and 5%-95%. We denote the setting that x%, (100−x)% sequences contains only vehicle, pedestrian labels as #V/#P = x/100−x. The evaluation on the validation set remains unchanged from the standard setting. We employ the same training protocol for all experiments, tuned based on the adapted supervised learning. We train with batch size 64 for 80K iterations with 4K warmup. The number of channels in each layer of our multi-class RSN is 3/4 that of CarL 1f / PedL 1f in [11], to reduce the memory cost. The remaining hyper-parameters and data augmentation (random flipping and rotation are applied) are the same as those in [11]. For self labeling and integrated labeling, we do not apply augmentation when generating pseudo labels using the multi-class RSN. Since the sizes of the two subsets are quite different, we also apply dataset resampling such that each data in the mini-batch is drawn from the two subsets with equal probability. A. Algorithm Comparison We compare different methods to learn the multi-class detector from single-class labels in Table I. The evaluation metric is 3D AP with L1 difficulty. We consider the supervised learning algorithm with Aggressive scheme to be our baseline (as we are in a new problem setting, there is no prior baseline from other works), and consider the performance of our multi-class RSN when trained on the unmasked, fully labeled data (which does not belong to SCS) to be the performance upper bound. Within supervised learning, Informed supervision outperforms the Aggressive supervision baseline, showing the superiority of proper SCS modeling. As is evident from Figure 3, Aggressive supervision misclassifies many object points as background and by the design of RSN, those points will not be included in proposing bounding boxes. By comparison, Informed supervision gives accurate prediction for all three kinds of points in Figure 2, even though there are no points explicitly labeled as background. We do not include quantitative results with Conservative supervision as all points are classified as foreground, causing failed training of the second stage in RSN, though we can still visualize its first stage in Figure 3. We then use the Informed scheme to compare between supervised learning and pseudo labeling. We observe that the adapted pseudo labeling can give significant improvement over adapted supervised learning. Among the pseudo labeling algorithms, the integrated labeling delivers the best performance, as expected. The teacher labeling outperforms the self labeling approach, where the teacher models are single-class detectors trained only on the subset of sequences labeled with the corresponding class. One of the reasons may be that the different object features required by vehicle and pedestrian cause conflicts in the hidden representation given the limited model capacity, and thus the quality of the pseudo labels generated by the two single-class detectors are better than those generated by the multi-class detector. Algorithms Scheme #V/#P = 90/10 #V/#P = 10/90 #V/#P = 95/5 #V/#P = 5/95 Vehicle Pedestrian Vehicle Pedestrian Vehicle Pedestrian Vehicle Pedestrian Supervised Aggressive 70.2 -1.8 53.5 -18.9 63.5 -8.5 68.3 -4.1 69.8 -2.2 28.9 -43.5 18.4 -53.6 69.2 -3.2 Supervised Informed 70.2 -1.8 59.3 -13.1 65.4 -6.6 71.6 -0.8 70.7 -1.3 46.0 -26.4 61.8 -10.2 72.5 +0.1 Self Label Informed 70.2 -1.8 62.1 -10.3 66.0 -6.4 72.0 -0.4 70.9 -1.1 47.3 -25.1 64.1 -7.9 72.3 -0.1 Teacher Label Informed 71.7 71.7 71.7 -0.3 -0.3 -0.3 67.1 -5.3 68.8 -3.6 73.2 +0.8 71.3 -0.7 57.8 -14.6 66.0 -6.0 71.7 -0.8 Integrated Label Informed 71.6 -0.4 68.5 68.5 68.5 -3.9 -3.9 -3.9 69.0 69.0 69.0 -3.0 -3.0 -3.0 73.8 73.8 73.8 +1.4 +1.4 +1.4 71.5 71.5 71.5 -0.5 -0.5 -0.5 59.5 59.5 59.5 -12.9 -12.9 -12.9 66.3 66.3 66.3 -5.7 -5.7 -5.7 73.2 73.2 73.2 +0.8 +0.8 +0.8 Full Label - 72.0 0.0 72.4 0.0 72.0 0.0 72.4 0.0 72.0 0.0 72.4 0.0 72.0 0.0 72.4 0.0 TABLE I: Comparing algorithms and schemes developed for single-class supervision (SCS). The two adjacent numbers are the detection AP and its gap to the full label upper bound. Notice that the last row uses more labels and does not belong to SCS, hence marked in gray to indicate that these numbers are unfair comparisons. Segmentation Heatmap Vehicle Pedestrian Aggressive Informed 69.3 -3.4 54.1 -17.0 Conservative Informed 70.8 -0.9 65.6 -1.5 Informed Aggressive 65.6 -6.1 48.8 -18.3 Informed Conservative 71.0 -0.7 66.3 -0.8 Informed Informed 71.7 0.0 67.1 0.0 TABLE II: Comparing different schemes for handling segmentation / heatmap loss with missing label. The two adjacent numbers are the detection AP and its performance gap to employing Informed supervision for both. The fact that detecting vehicle and pedestrian may be conflicting may also explain the occasional cases where the pseudo labeling algorithm surpasses the full label upper bound in detecting one class (e.g. in #V/#P = 90/10, teacher and integrated labeling outperform the full label upper bound in detecting pedestrian). In cases where both classes have sufficient labels, the two tasks compete with each other and the model learns a representation that performs well for both tasks. While, in these cases, when there is not enough information for the model to learn a strong representation for vehicle, the representation learned by the model will favor detecting the pedestrian. B. Ablations on Modeling Missing Label Under the teacher labeling algorithm (#V/#P = 90/10), we ablate the effects of the three schemes we proposed (Aggressive, Conservative, Informed) on either the segmentation loss or the heatmap loss. The results are summarized in Table II. Employing Informed supervision for both segmentation and heatmap results in the best performance, as Informed supervision best exploits trustworthy label information. Aggressive supervision is worse than the Conservative approach. This holds for both the segmentation loss and the heatmap loss. We believe this shows the negative influence of injecting wrong information outweighs the benefit of injecting more information. C. Dataset Resampling After analyzing the Table I vertically, we now analyze it horizontally. The general trend is that when a class has fewer data, the AP on this class is lower and has more room Fig. 4: Detection AP with varying dataset resampling probability. P / V in the legend stand for Pedestrian / Vehicle. to improve, which is expected. However, there are a few occasions where having more data of a class results in worse AP, for example detecting vehicles using teacher labeling at #V/#P = 90/10 vs #V/#P = 95/5 (71.7 vs 71.3). We found the cause to be dataset resampling. In Figure 4 we vary the probability of sampling images from the dominant vehicle class when #V/#P = 90/10 or #V/#P = 95/5 (default is 50%, i.e., equal probability). Towards the right side of the figure, when we sample predominantly from the vehicle class when the amount of labeled vehicle sequences is already dominant, the performance of the minority pedestrian class worsens drastically. On the other hand, towards the left side, when we sample predominantly from the minority pedestrian class, the model sees too many repetitive pedestrians and not enough vehicle epochs, degrading the performance on both tasks. Therefore, it is critical to find the middle sweet spot. VII. CONCLUSION We study training a multi-class 3D object detector under the “single-class supervision” learning setting, by proposing and benchmarking various baselines and strategies. Notably, our proposed Informed Supervision combined with pseudo labeling can approach or match the upper bound that is full supervision, saving significant labeling costs. For future work, we plan to expand the applications of SCS. Within 3D object detection, we plan to experiment with more classes and more diverse architectures. Beyond 3D object detection, we plan to expand to other tasks such as 2D object detection, and move from multi-class to multi- task (e.g. joint detection and segmentation). REFERENCES [1] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in 2012 IEEE conference on computer vision and pattern recognition. IEEE, 2012, pp. 3354– 3361. [2] X. Huang, X. Cheng, Q. Geng, B. Cao, D. Zhou, P. Wang, Y. Lin, and R. Yang, “The apolloscape dataset for autonomous driving,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2018, pp. 954–960. [3] J. Behley, M. Garbade, A. Milioto, J. Quenzel, S. Behnke, C. Stachniss, and J. Gall, “Semantickitti: A dataset for semantic scene understanding of lidar sequences,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9297–9307. [4] P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V. Patnaik, P. Tsui, J. Guo, Y. Zhou, Y. Chai, B. Caine, et al., “Scalability in perception for autonomous driving: Waymo open dataset,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 2446–2454. [5] Y. Zhou and O. Tuzel, “Voxelnet: End-to-end learning for point cloud based 3d object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4490–4499. [6] Y. Yan, Y. Mao, and B. Li, “Second: Sparsely embedded convolutional detection,” Sensors, vol. 18, no. 10, p. 3337, 2018. [7] G. P. Meyer, A. Laddha, E. Kee, C. Vallespi-Gonzalez, and C. K. Wellington, “Lasernet: An efficient probabilistic 3d object detector for autonomous driving,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 12 677–12 686. [8] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom, “Pointpillars: Fast encoders for object detection from point clouds,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 12 697–12 705. [9] Y. Zhou, P. Sun, Y. Zhang, D. Anguelov, J. Gao, T. Ouyang, J. Guo, J. Ngiam, and V. Vasudevan, “End-to-end multi-view fusion for 3d object detection in lidar point clouds,” in Conference on Robot Learning. PMLR, 2020, pp. 923–932. [10] S. Shi, C. Guo, L. Jiang, Z. Wang, J. Shi, X. Wang, and H. Li, “Pv- rcnn: Point-voxel feature set abstraction for 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 529–10 538. [11] P. Sun, W. Wang, Y. Chai, G. Elsayed, A. Bewley, X. Zhang, C. Sminchisescu, and D. Anguelov, “Rsn: Range sparse net for efficient, accurate lidar 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 5725–5734. [12] Y. Grandvalet and Y. Bengio, “Semi-supervised learning by entropy minimization,” in Advances in neural information processing systems, 2005, pp. 529–536. [13] D.-H. Lee et al., “Pseudo-label: The simple and efficient semi- supervised learning method for deep neural networks,” in Workshop on challenges in representation learning, ICML, vol. 3, no. 2, 2013, p. 896. [14] S. Laine and T. Aila, “Temporal ensembling for semi-supervised learning,” arXiv preprint arXiv:1610.02242, 2016. [15] M. Sajjadi, M. Javanmardi, and T. Tasdizen, “Regularization with stochastic transformations and perturbations for deep semi-supervised learning,” Advances in neural information processing systems, vol. 29, pp. 1163–1171, 2016. [16] A. Tarvainen and H. Valpola, “Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results,” arXiv preprint arXiv:1703.01780, 2017. [17] T. Miyato, S.-i. Maeda, M. Koyama, and S. Ishii, “Virtual adversarial training: a regularization method for supervised and semi-supervised learning,” IEEE transactions on pattern analysis and machine intelligence, vol. 41, no. 8, pp. 1979–1993, 2018. [18] I. Z. Yalniz, H. J´egou, K. Chen, M. Paluri, and D. Mahajan, “Billion-scale semi-supervised learning for image classification,” arXiv preprint arXiv:1905.00546, 2019. [19] D. Berthelot, N. Carlini, E. D. Cubuk, A. Kurakin, K. Sohn, H. Zhang, and C. Raffel, “Remixmatch: Semi-supervised learning with distribution alignment and augmentation anchoring,” arXiv preprint arXiv:1911.09785, 2019. [20] Q. Xie, M.-T. Luong, E. Hovy, and Q. V. Le, “Self-training with noisy student improves imagenet classification,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 687–10 698. [21] K. Sohn, D. Berthelot, C.-L. Li, Z. Zhang, N. Carlini, E. D. Cubuk, A. Kurakin, H. Zhang, and C. Raffel, “Fixmatch: Simplifying semi- supervised learning with consistency and confidence,” arXiv preprint arXiv:2001.07685, 2020. [22] G. French, S. Laine, T. Aila, M. Mackiewicz, and G. Finlayson, “Semi-supervised semantic segmentation needs strong, varied perturbations,” arXiv preprint arXiv:1906.01916, 2019. [23] Y. Zou, Z. Zhang, H. Zhang, C.-L. Li, X. Bian, J.-B. Huang, and T. Pfister, “Pseudoseg: Designing pseudo labels for semantic segmentation,” arXiv preprint arXiv:2010.09713, 2020. [24] J. Kim, J. Jang, and H. Park, “Structured consistency loss for semi- supervised semantic segmentation,” arXiv preprint arXiv:2001.04647, 2020. [25] Y. Ouali, C. Hudelot, and M. Tami, “Semi-supervised semantic segmentation with cross-consistency training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 12 674–12 684. [26] X. Chen, Y. Yuan, G. Zeng, and J. Wang, “Semi-supervised semantic segmentation with cross pseudo supervision,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 2613–2622. [27] Y. Zhu, Z. Zhang, C. Wu, Z. Zhang, T. He, H. Zhang, R. Manmatha, M. Li, and A. Smola, “Improving semantic segmentation via self- training,” arXiv preprint arXiv:2004.14960, 2020. [28] Z. Feng, Q. Zhou, G. Cheng, X. Tan, J. Shi, and L. Ma, “Semi- supervised semantic segmentation via dynamic self-training and classbalanced curriculum,” arXiv preprint arXiv:2004.08514, vol. 1, no. 2, p. 5, 2020. [29] C. Rosenberg, M. Hebert, and H. Schneiderman, “Semi-supervised self-training of object detection models.” [30] J. Jeong, S. Lee, J. Kim, and N. Kwak, “Consistency-based semi-supervised learning for object detection,” Advances in neural information processing systems, vol. 32, pp. 10 759–10 768, 2019. [31] Y. S. Tang and G. H. Lee, “Transferable semi-supervised 3d object detection from rgb-d data,” 2019. [32] P. Tang, C. Ramaiah, Y. Wang, R. Xu, and C. Xiong, “Proposal learning for semi-supervised object detection,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 2291–2301. [33] K. Sohn, Z. Zhang, C.-L. Li, H. Zhang, C.-Y. Lee, and T. Pfister, “A simple semi-supervised learning framework for object detection,” arXiv preprint arXiv:2005.04757, 2020. [34] Q. Yang, X. Wei, B. Wang, X.-S. Hua, and L. Zhang, “Interactive self- training with mean teachers for semi-supervised object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 5941–5950. [35] Y.-C. Liu, C.-Y. Ma, Z. He, C.-W. Kuo, K. Chen, P. Zhang, B. Wu, Z. Kira, and P. Vajda, “Unbiased teacher for semi-supervised object detection,” in International Conference on Learning Representations, 2021. [36] N. Zhao, T.-S. Chua, and G. H. Lee, “Sess: Self-ensembling semi- supervised 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 11 079–11 087. [37] W. Wei, P. Wei, and N. Zheng, “Semantic consistency networks for 3d object detection,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 4, 2021, pp. 2861–2869. [38] H. Wang, Y. Cong, O. Litany, Y. Gao, and L. J. Guibas, “3dioumatch: Leveraging iou prediction for semi-supervised 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 14 615–14 624. [39] B. Caine, R. Roelofs, V. Vasudevan, J. Ngiam, Y. Chai, Z. Chen, and J. Shlens, “Pseudo-labeling for scalable 3d object detection,” arXiv preprint arXiv:2103.02093, 2021. [40] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Doll´ar, “Focal loss for dense object detection,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988. [41] S. Shi, X. Wang, and H. P. Li, “3d object proposal generation and detection from point cloud,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 16–20.