Pseudo-labeling for Scalable 3D Object Detection Benjamin Caine∗†, Rebecca Roelofs∗†, Vijay Vasudevan†, Jiquan Ngiam†, Yuning Chai‡, Zhifeng Chen†, Jonathon Shlens† †Google Brain, ‡Waymo {bencaine,rofls}@google.com Abstract To safely deploy autonomous vehicles, onboard perception systems must work reliably at high accuracy across a diverse set of environments and geographies. One of the most com- mon techniques to improve the efficacy of such systems in new domains involves collecting large labeled datasets, but such datasets can be extremely costly to obtain, especially if each new deployment geography requires additional data with expensive 3D bounding box annotations. We demon- strate that pseudo-labeling for 3D object detection is an effective way to exploit less expensive and more widely avail- able unlabeled data, and can lead to performance gains across various architectures, data augmentation strategies, and sizes of the labeled dataset. Overall, we show that better teacher models lead to better student models, and that we can distill expensive teachers into efficient, simple students. Specifically, we demonstrate that pseudo-label-trained stu- dent models can outperform supervised models trained on 3-10 times the amount of labeled examples. Using PointPil- lars [24], a two-year-old architecture, as our student model, we are able to achieve state of the art accuracy simply by leveraging large quantities of pseudo-labeled data. Lastly, we show that these student models generalize better than supervised models to a new domain in which we only have unlabeled data, making pseudo-label training an effective form of unsupervised domain adaptation. 1. Introduction Self-driving perception systems typically require sufficient human labels for all objects of interest and subsequently train machine learning systems using supervised learning techniques [48]. As a result, the autonomous vehicle indus- try allocates a vast amount of capital to gather large-scale human-labeled datasets in diverse environments [6, 14, 44]. However, supervised learning using human-labeled data faces a huge deployment hurdle: while the technique works ∗Denotes equal contribution and authors for correspondence. Class Geography Baseline Student ∆ Vehicle SF/MTV/PHX 49.1 58.9 +9.8 Ped SF/MTV/PHX 53.4 64.6 +11.2 Vehicle Kirkland 26.1 37.2 +11.1 Ped Kirkland 14.5 27.1 +12.6 Figure 1: Pseudo-labeling for 3D object detection. Top: Training models with pseudo-labeling consists of a three- stage training process. (1) Supervised learning is performed on a teacher model using a limited corpus of human-labeled data. (2) The teacher model generates pseudo-labels on a larger corpus of unlabeled data. (3) A student model is trained on a union of labeled and pseudo-labeled data. Bot- tom: Summary of key results in 3D object detection per- formance on Waymo Open Dataset [44] with a PointPillars model [24]. All numbers report validation set Level 1 dif- ficulty average precision (AP) for vehicles and pedestrians. Both baselines and student models only have access to 10% of the labeled run segments from original Waymo Open Dataset, which consists of data from San Francisco (SF), Mountain View (MTV), and Phoenix (PHX). We use no la- bels from the domain adaptation challenge dataset, Kirkland. well on in-domain problems, domain shifts can cause the performance to drop significantly [4, 17, 36, 45]. The re- liance of self-driving vehicles on supervised learning implies that the rate at which one can gather human-labeled data 1 arXiv:2103.02093v1 [cs.CV] 2 Mar 2021 in novel geographies and environmental conditions limits wider adoption of the technology. Furthermore, a supervised- learning-based approach is inefficient: for example, it would not leverage human-labeled data from Paris to improve self- driving perception in Rome [50]. Unfortunately, we currently have no scalable strategy to address these limitations. We view the scaling limitations of supervised learning as a fundamental problem, and we identify a new training paradigm for adapting self-driving vehicle perception sys- tems to different geographies and environmental condi- tions in which human-labeled data is limited or unavailable. We propose leveraging ideas from the literature on semi- supervised learning (SSL), which focuses on the low label regime, and boosts the performance of state-of-the-art mod- els by leveraging unlabeled data. In particular, we employ a pseudo-labeling approach [26, 30, 39] to generate labeled data on additional datasets and find that such a strategy leads to significant boosts in performance on 3D object detection (Figure 1). Additionally, we systematically investigate how to struc- ture pseudo-label training to maximize model performance. We identify nuances not previously well understood in the literature for how best to implement pseudo-labeling and develop simple yet powerful recommendations for how to extract gains from it. Overall, our work demonstrates a vi- able method for leveraging unsupervised data – particularly from other domains – to boost state-of-the-art performance on in-domain and out-of-domain tasks. To summarize our contributions: • We show pseudo-labeling is extremely effective for 3D object detection, and provide a systematic analysis of how to maximize it’s performance benefits. • We demonstrate that pseudo-label training is effective and particularly useful for adapting to new geographical domains for autonomous vehicles. • By optimizing the pseudo-label training pipeline (keep- ing both the architecture and labeled dataset fixed), we achieve state-of-the-art test set performance among comparable models, with 74.0 L1 AP for Vehicles and 69.8 L1 AP for Pedestrians, a gain of 5.4 and 1.9 AP respectively over the same supervised model. 2. Related Work Semi-supervised learning. Semi-supervised learning (SSL) is an approach to training that typically combines a small amount of human-labeled data with a large amount of un- labeled data [27, 33, 35, 53]. Self-training refers to a style of SSL in which the predictions of a model on unlabeled data, termed pseudo-labels [26], are used as additional train- ing data to improve performance [30, 39]. Several variants of self-training exist in the literature. Noisy-Student [52] uses a smaller, less noised teacher model to generate pseudo- labels, which are used to train a larger, noised student model, and the authors suggest performing multiple iterations of this process. FixMatch [41] combines self-training with consistency regularization [23, 38], a technique that applies random perturbations to the input or model to generate more labeled data. In prior work, self-training has been success- fully applied to tasks such as speech recognition [20, 34], image segmentation [7], and 2D object detection in camera imagery [37, 42, 61] and video sequences [8]. 3D object detection. Though several architectural in- novations have been proposed for 3D object detection [29, 32, 54, 56, 57, 60], a recent focus has been on tech- niques that improve data efficiency, or the amount of data required to reach a certain performance. Data augmenta- tion designed for 3D point clouds can significantly boost performance (see references in [9, 28]), and techniques to automatically learn appropriate data augmentation strategies have been shown to be 10 times more data efficient than baseline 3D detection models [9, 28]. Concurrent to our work, [51] shows gains applying knowledge distillation [18] to 3D detection, distilling a multi-frame model’s features to a single-frame model in feature space, whereas we apply knowledge distillation in label space. Several recent works also propose improving data efficiency by using weak supervision to augment existing labeled data: [46] incorporates existing 3D box priors to augment 2D bounding boxes, and [31] similarly generates additional 3D annotations by learning appropriate augmentations for la- beled object centers. Finally, an automatic 3D bounding box labeling process is proposed by [55], which uses the full object trajectory to produce accurate bounding box predic- tions, though they don’t show training results with these auto labels. We view many of the techniques to improve data efficiency as complementary to our work, as improvements in either model architectures or data efficiency will provide additive performance benefits. SSL for 3D object detection. Two prior works apply Mean Teacher [47] based semi-supervised learning techniques to 3D object detection [49, 58] on the indoor RGB-D datasets ScanNet [10] and SUN RGB-D [43]. SESS [58] trains a stu- dent model with several consistency losses between the stu- dent and the EMA-based teacher model, while 3DIoUMatch [49] proposes training directly on the pseudo labels after filtering them via an IoU prediction mechanism. In contrast, we forgo a Mean-Teacher-based framework, finding separate teacher and student models to be practically advantageous, and we showcase performance on 3D LiDAR datasets de- signed to train self-driving car perception systems. Domain adaptation. Robustness to geographies and envi- 2 ronmental conditions is critical to making self-driving tech- nology viable in the real world [48]. Recently, one group studied the task of adapting a 3D object detection architec- ture across self-driving vehicle datasets (e.g. [14, 19, 44]), and reported significant drops in accuracy when training on one dataset and testing on another [50]. Interestingly, such drops in accuracy could be attributed to differences in car sizes and are partially reduced by accounting for these size differences. In parallel, other recent work reports notable drops in accuracy across geographies within a single dataset [44] (see Table 9). However, unlike the former work, those drops in accuracy in this latter work cannot be accounted for by differences in car sizes 1. In our work, we experiment on this single dataset and are able to mitigate drops in accuracy across geographies. We focus on one of the open challenges for the Waymo Open Dataset 2: accurate 3D detection in a new city (Kirkland) with changing environmental conditions (rain) and limited human-labeled data. Currently, the state-of-the-art architec- ture for the Kirkland domain adaptation task [11] employs a single-stage, anchor-free and NMS-free 3D point cloud object detector equipped with multiple enhancements includ- ing features from 2D camera neural networks, powerful data augmentation, frame stacking, test time ensembling, and point cloud densification (but no pseudo-labeling). We do not implement these full set of enhancements, yet our base- line implementation achieves similar performance to their baseline architecture [13], instead focusing on accuracy and robustness gains that can be achieved by leveraging a large amount of unlabeled data. 3. Methods Our pseudo-labeling process (Figure 2) consists of three stages: training a teacher on labeled data, pseudo-labeling unlabeled data with said teacher, and training a student on the combination of the labeled and pseudo-labeled data. We perform and evaluate all of our experiments on the Waymo Open Dataset (version 1.1) [44] and the domain adaptation extension. We implement students and teachers as Point- Pillars [24] models using open-source implementations 3, which are the baselines used by [9, 15, 32, 44]. 3.1. Data Setup The Waymo Open Dataset [44] is organized as a collection of run segments. Each run segment is a ∼200 frame sequence of LiDAR and camera data collected at 10Hz. These run 1We found the average width and length of vehicles in Kirkland and the Waymo Open Dataset to be quite similar. For instance, in the validation splits of the Waymo OD and Kirkland datasets, we measured similar average lengths (4.8m vs 4.6m) and average widths (2.1m vs 2.1m) across O(104) objects. These discrepancies are markedly less than those described in [50]. 2 https://waymo.com/open/challenges 3 https://github.com/tensorflow/lingvo/ Unlabeled Kirkland Data Unlabeled subset of Waymo OD Student Pseudo-labeled Kirkland Data Pseudo-labeled subset of Waymo OD Labeled subset of Waymo OD Labeled subset of Waymo OD Pseudo-Label Models Datasets Teacher Figure 2: Experimental setup. We conduct our experiments on the Waymo Open Dataset [44], where we artificially di- vide the dataset into labeled and unlabeled splits. We always treat run segments from Kirkland as unlabeled (even though a subset are labeled) and select subsets (e.g. 10%, 20%, ...) of the original Waymo Open Dataset run segments to train the teacher. We use the teacher to pseudo-label all unseen run segments, and then train a student on the union of labeled and pseudo-labeled run segments. Finally, we evaluate both teacher and student models on the original Waymo Open Dataset and Kirkland validation splits. segments come from two sets: the original Waymo Open Dataset, which has 798 labeled training run segments col- lected in San Francisco, Phoenix, and Mountain View, and the domain adaptation benchmark, which has 80 labeled and 480 unlabeled training run segments from Kirkland. Both datasets contain 3D bounding boxes for Pedestrian, Vehicle, and Cyclist, but, due to the low number of Cyclists in the data, we focus on the Pedestrian and Vehicle classes. In our experiments, we treat all the Kirkland run segments as unlabeled data (even though labels do exist for 80 run segments). Our setup is similar to unsupervised domain adaptation, where only unlabeled data is available in the “target” domain, giving us a measure of how well the gains in accuracy on the Waymo Open Dataset generalize to a new domain 4. In addition, our setup emulates a common scenario in which a practitioner has access to a large collection of unlabeled run segments and a much smaller subset of labeled run segments. In order to study the effect of labeled dataset size, we ran- domly sample smaller training datasets from the Waymo Open Dataset. Because run segments are typically labeled efficiently as a sequence, we treat each run segment as either comprehensively labeled or unlabeled, and we sample based on the run segment IDs, instead of individual frames. For example, selecting 10% of the original Waymo Open Dataset corresponds to selecting 10% of the run segments, i.e. 79 4In addition to geographical nuances, Kirkland has notably different weather conditions, e.g. clouds and rain, than San Francisco, Phoenix, and Mountain View. 3 run segments, which provides ∼15,700 frames. If we were to instead randomly select 10% of frames, we would make the task artificially easier, as neighboring frames would be highly correlated, especially if the autonomous vehicle is moving slowly. 3.2. Model Setup All experiments use PointPillars [24] as a baseline archi- tecture due to its simplicity, accuracy, and inference speed (see Appendix 6.1.1 for architecture details). To explore the impact of teacher accuracy, we use wider and multi-frame PointPillars models as teachers. To make the models wider, we multiply all channel dimensions by either 2× or 4×. To make a multi-frame teacher, we concatenate the point clouds from each frame with its previous N −1 frames transformed into the last frame’s coordinate system. 3.3. Training Setup Our training setup mirrors [9, 44]. We use the Adam opti- mizer [21] and train with an exponential decay schedule on the learning rate. All teachers and students are trained with the same schedule, but the length of an epoch for teacher and student models differ because the teacher is trained on less data than the student. We use data augmentation strategies such as world rotation and scene mirroring, which showed strong improvement over not using augmentations. Table 1 provides an ablation study for these augmentations. Unless otherwise stated, all other training hyperparameters remain fixed between teacher and student. See Appendix 6.1.2 for additional details on the training setup. 3.4. Pseudo-Label Training Pseudo-label training begins by training a teacher model using standard supervised learning on a labeled subset of run segments. Once we train the teacher, we select the best teacher model based on validation set performance on the Waymo Open Dataset and use to pseudo-label the unlabeled run segments. Next, we train a student model on the same labeled data the teacher saw, plus all the pseudo-labeled run segments. The mixing ratio of labeled to pseudo-labeled data is determined by the percentage of data the teacher was trained on. We filter the pseudo-labeled boxes to include only those with a classification score exceeding a threshold, which we select using accuracy on a validation set. We find a classi- fication score threshold of 0.5 works well for most models, but a small subset of models (generally multi-frame Pedes- trian models, which are poorly calibrated and systematically under-confident) benefit from a lower threshold. Finally, we evaluate the student’s performance on the Waymo Open Dataset and Kirkland validation sets, where we always report 50 55 60 65 Teacher AP 50 55 60 65 Student AP Waymo Open Dataset Figure 3: Better teachers lead to better students. We plot Level 1 AP on the Waymo Open Dataset validation set for Vehicles. When controlling for labeled dataset size, archi- tecture, and training setup between teachers and students, teachers with a higher AP generally produce students with a higher AP. Level 1 (L1) average precision (AP). 4. Results Using the Vehicle class, we first explore the relationship be- tween teacher and student performance on the Waymo Open Dataset for various teacher configurations, and then evalu- ate generalization to Kirkland. Next, for both Vehicles and Pedestrians, we distill increasingly larger teachers into small, efficient student models, yielding large gains in accuracy with no additional labeled data or inference cost. Finally, we scale up these experiments with two orders of magnitude more unlabeled data, further demonstrating the efficacy of pseudo-labeling. We also describe some negative results where we discuss some ideas we thought should work, but did not. 4.1. Better teachers lead to better students. To understand how teacher performance impacts student performance, we control the accuracy of the teacher by vary- ing the amount of labeled data, the teacher’s width, and the strength of teacher training data augmentations. All exper- iments in this section are evaluated on Vehicles. In Figure 3, we show student-versus-teacher performance for teacher and student models with the same amount of labeled data, equivalent architectures, and equivalent training setups. In general, higher accuracy teachers produce higher accuracy students. A relevant question is then, what techniques are most effective for improving teacher accuracy? To answer this, we evaluate each modification in turn on the Waymo Open Dataset. Appendix 6.3 shows the corresponding exper- iments when evaluating on the Kirkland dataset. 4 Amount of labeled data. Compared to adding data augmen- tations or increasing teacher width, increasing the amount of labeled data yields the largest improvements. Figure 4 shows that increasing the fraction of labeled data improves both teacher and student performance, but the student gains diminish as the amount of unlabeled data decreases. Note that Figure 4 shows the overall percent labeled (bottom axis) and unlabeled data (top axis) when we combine the 798 Waymo Open Dataset and 560 Kirkland run segments. Using 100% of the labeled data from the Waymo Open Dataset corresponds to having roughly 59% of the overall data labeled, and we give teachers access to 10%, 20%, 30%, 50% of 100% of the labels in the Waymo Open Dataset in this experiment, allowing us to evaluate the effect of having access to x% human labeled and (1-x)% pseudo-labeled data from the Waymo Open Dataset. Data augmentation. We find that adding data augmentation does lead to modest additional gains, mirroring the observa- tions in [61], as long as it is applied to both the teacher and the student. In Table 1, we show that one way to generate stronger teacher models (and thus better students) is through stronger data augmentations. Although [52] emphasizes the importance of noising the student model, we found empirically that pseudo-label train- ing can show gains even without data augmentation (see Appendix 6.2 for full results). Teacher width. An additional way to generate better teach- ers is through scaling the model size (parameter count). Be- cause the teacher and student are different models, they can 40 50 60 70 80 90 Percent unlabeled data 10 20 30 40 50 60 Percent labeled data 50.0 52.5 55.0 57.5 60.0 62.5 AP Waymo Open Dataset Teacher Student Figure 4: Pseudo-label training is most effective when the ratio of labeled to unlabeled data is small. Teacher and student L1 AP on the Waymo Open Dataset validation set for the Vehicle class versus overall percent labeled data. Waymo Open Dataset L1 AP Teacher Augmentation Teacher Student ∆ None 56.3 62.2 +5.9 FlipY 60.1 63.6 +3.5 RotateZ 61.4 63.3 +2.1 RotateZ + FlipY 63.0 64.2 +1.2 Table 1: Stronger teacher augmentations lead to addi- tive gains in student performance. We increasing the strength of teacher augmentations for a 1× width teacher model trained on 100% of the Waymo Open Dataset, while fixing the student to be a 1× width model trained with both RotateZ and FlipY augmentations. We report L1 validation set Vehicle AP on the Waymo Open Dataset. ∆= Student AP −Teacher AP. 1 2 4 Teacher width 50.0 52.5 55.0 57.5 60.0 62.5 65.0 AP Waymo Open Dataset Teacher Student 10% OD labeled 100% OD labeled Figure 5: Increasing teacher width leads to better stu- dents when labeled data is limited. We increase teacher width while fixing the student width at 1× and compare L1 AP for Vehicle models on the Waymo Open Dataset valida- tion set. The teacher is trained on labeled data from either 10% or 100% of the original Waymo Open Dataset (bottom and top points, respectively). When the ratio of labeled to unlabeled data is small, student accuracy improves as the teacher gets wider. However, this effect disappears when the amount of pseudo-labeled data is small. be of different sizes, architectures, or configurations. One useful strategy involves distilling a large, expensive offline model’s performance into a small, efficient production model. In Figure 5, we vary the teacher width (1×, 2×, or 4×) by multiplying all its channel dimensions while keeping the stu- dent width fixed at 1×. We evaluate performance under two different fractions of available labeled data (10% or 100% of the original Waymo Open Dataset). When only 10% of the original Waymo Open Dataset is labeled, the 1× width students outperform their wider teach- ers, in contrast to the findings in [52] that require the student 5 to be equal or larger than the teacher, suggesting that in the low labeled data regime, this may not be as important. How- ever, when 100% of the original Waymo Open Dataset is labeled, the 1× width student can no longer outperform the 4× width teacher on the original Waymo Open Dataset. Ratio of labeled to unlabeled data. In our results, we found that in the setting where the ratio of labeled to unla- beled data is high – using 100% of the original Waymo Open Dataset’s labels (798 segments) and only pseudo-labeling Kirkland’s data (560 segments) – the student gains com- pared to the teacher diminish, and the student is unable to outperform a wider teacher. One hypothesis is that the lack of improvement is due to the small amount of unlabeled data that the student can benefit from. We test this hypothesis by using two orders of magnitude more unlabeled data in Section 4.4. 4.2. Generalization to Kirkland 50 55 60 65 Original Waymo OD 1.1 L1 AP 25 30 35 40 45 Kirkland L1 AP Teachers Students Figure 6: Pseudo-labeling improves performance on un- labeled geographic domain. Stronger models on the orig- inal Waymo Open Dataset are also better on the Kirkland dataset (where we only have unlabeled data). Moreover, stu- dent models trained with pseudo-labeling generalize better to Kirkland than normally supervised teacher models. In order to measure generalization to new geographies and environmental conditions, we evaluate all models on the Kirkland domain adaptation challenge dataset. The weather in Kirkland is rainier than the weather in the cities that com- prise the Waymo Open Dataset, which increases the level of noise in the LiDAR data. We plot the model’s performance on the Kirkland dataset versus the model’s performance on the Waymo Open Dataset for both teacher and student models in Figure 6. We observe a clear linear relationship be- tween the model’s performance on the Waymo Open Dataset and the model’s performance on Kirkland, implying that a model’s accuracy on the Waymo Open Dataset can almost perfectly predict accuracy on the Kirkland dataset. Overall, the Kirkland performance is much lower than the Waymo Open Dataset performance, which we suspect is due to an un- derlying data distribution difference and the fact that we only use labeled data from the Waymo Open Dataset in training. Interestingly, the slope of the linear relationship changes depending on whether the model is a teacher or student; the student models have a slightly higher slope than the teacher models, indicating that the student models are generalizing better to the Kirkland dataset. We find that the difference in slope is statistically significant by using an Analysis of Covariance (ANCOVA) test, which evaluates whether the means of our dependent variable (Kirkland AP) are equal across our categorical independent variable (whether the model is a student or not), while statistically controlling for accuracy on the Waymo Open Dataset. We find an F-score of 12.9, giving us a p-value less than 0.001, which is lower than 0.05 (the significance level for 95% confidence), leading us to reject the null hypothesis. Since the student models have a slightly higher Kirkland AP for a given Waymo Open Dataset AP, we conclude that the student models are slightly more robust to the Kirkland distribution shift. 4.3. Pushing labeled data efficiency For practitioners, an important question is "How do I make the most accurate model given a fixed inference time bud- get and fixed amount of labeled data?" We assume that autonomous vehicle practitioners have more unlabeled than labeled data due to the relative ease of collecting vs. com- prehensively labeling data. In our experiment, we show that better teachers still lead to better students, even as we make larger, more accurate teacher models, and that distilling an expensive, impractical offline model into an efficient, prac- tical production model via pseudo-labeling is an effective technique. Additionally, we show via strong Kirkland valida- tion set results (a domain where we use no labeled data) that pseudo-labeling is an effective form of unsupervised domain adaptation. We improve the teacher by both scaling its width to 4× and concatenating up to four LiDAR frames as input. As with all of our experiments, our training setup for students mirrors the teacher except that the student models are always 1× width, 1 frame. Our results are shown in Figure 7 and summarized in Figure 1. For Vehicle models, we find that distilling a 4× width, 4 frame teacher model into a 1× width, 1 frame student model, using only 10% of the original Waymo Open Dataset la- bels, can match or exceed the performance of an equivalent supervised model trained with 5× that amount of labeled data. Our Pedestrian model is even more remarkable: using only 10% of the original Waymo Open Dataset run segment labels, our student model outperforms an equivalent super- vised baseline on Kirkland trained on 10× the amount of 6 Width: 1x Frames: 1 Width: 4x Frames: 1 Width: 4x Frames: 4 Teacher configuration 45 50 55 60 65 Validation L1 AP 10% OD Labels - 49.1 AP 50% OD Labels - 57.7 AP 100% OD Labels - 63.0 AP Vehicles: Waymo Open Dataset - 10% OD Labels Width: 1x Frames: 1 Width: 4x Frames: 1 Width: 4x Frames: 4 Teacher configuration 25 30 35 40 45 10% OD Labels - 26.1 AP 50% OD Labels - 37.0 AP 100% OD Labels - 41.8 AP Vehicles: Kirkland - 10% OD Labels Width: 1x Frames: 1 Width: 4x Frames: 1 Width: 4x Frames: 4 Teacher configuration 50 55 60 65 70 Validation L1 AP 10% OD Labels - 53.4 AP 50% OD Labels - 66.6 AP 100% OD Labels - 69.0 AP Pedestrians: Waymo Open Dataset - 10% OD Labels Width: 1x Frames: 1 Width: 4x Frames: 1 Width: 4x Frames: 4 Teacher configuration 10 15 20 25 30 10% OD Labels - 14.5 AP 50% OD Labels - 25.2 AP 100% OD Labels - 24.9 AP Pedestrians: Kirkland - 10% OD Labels Figure 7: Increasingly large teachers distill into small, accurate students. Increasing the width or number of frames for the teacher impacts performance on a fixed size student (1× width, 1 frame) in the low label (10% of run segments) regime for vehicles (top) and pedestrians (bottom). Training a large, expensive teacher model, and distilling the teacher into a small, efficient student is an efficient tactic. We present results for Waymo Open Dataset (left) and Kirkland (right). labels from the original Waymo Open Dataset. 5 Our results show that unlabeled data in-domain can be vastly more effective than labeled data from a different domain. Additionally, in Appendix 6.4 we show that these results hold when doubling the amount of labeled data. 4.4. Pushing unlabeled dataset size We return to our hypothesis that pseudo-labeling works best when the ratio of labeled data to unlabeled data is low. In practice, unlabeled self driving data is plentiful, so under- standing how pseudo-labeling performs as the unlabeled dataset gets significantly larger is important. To scales the size of our unlabeled dataset, we were granted access to >100x more unlabeled data from San Francisco (one of the three cities in the original Waymo Open Dataset) and Kirkland. This data contains ∼67,000 run segments from San Francisco and ∼8,000 run segments from Kirkland, as compared to the original 798 run segments from the original Waymo Open Dataset and 560 run segments from Kirkland. 5Note that we find that the 4× width, 4 frame Pedestrian models were systematically under-confident, and lowering the pseudo-label score thresh- old from 0.5 to 0.3 improved results. Training OD / Kir OD / Kir OD L1 AP Method # label # pseudo Veh ∆ Ped ∆ baseline 800 / 0 0 / 0 63.0 – 69.0 – semi-super 800 / 0 0 / 560 64.2 +1.2 69.8 +0.8 semi-super 800 / 0 0 / 8k 65.1 +2.1 68.8 -0.9 semi-super 800 / 0 67k / 8k 68.8 +5.8 70.5 +1.5 Table 2: Pseudo-labeling increases accuracy in domain. The number of labels are reported in run segments. All per- formance numbers report validation set L1 difficulty AP for the original Waymo Open Dataset with the same 1× width 1 frame network architecture. Only the training method varies across each experiment. ∆indicates the difference in AP with respect to the Baseline model, which is trained only on the Waymo Open Dataset (OD). Semi-supervised uses a 4× width, 4 frame teacher model trained on OD labeled data to provide pseudo-labels and then trains the student on the joint labeled and pseudo-labeled data. We include Kirkland data to show that out of domain data also provides gains, but not as large. Empirically, we find that explicitly controlling the ratio of la- beled to unlabeled data becomes important, as our unlabeled data otherwise overwhelms the labeled data. We train a 4× 7 Training OD / Kir OD / Kir Kirkland L1 AP Method # label # pseudo Veh ∆ Ped ∆ baseline 800 / 0 0 / 0 41.8 – 24.8 – supervised 800 / 80 0 / 0 45.0 +3.2 30.3 +5.5 semi-super 800 / 0 0 / 560 44.5 +2.7 28.4 +3.6 semi-super 800 / 0 0 / 8k 48.0 +6.2 29.3 +4.5 semi-super 800 / 0 67k / 8k 49.7 +7.9 27.3 +2.5 Table 3: Pseudo-labeling out-of-domain data outper- forms supervised training on new geographies. We report validation set L1 difficulty AP for Kirkland with the same 1× width 1 frame architecture, only varying the training method. ∆is the difference in AP with respect to the Baseline model. Baseline model is trained on Waymo Open Dataset (OD) but tested on a distinct geography (Kirkland). Supervised is trained on labeled data from OD and the distinct geography (Kirkland). Semi-supervised uses a 4× width 4 frame teacher model trained on OD labeled data to provide pseudo-labels, and trains on the joint labeled and pseudo-labeled data. width, 4 frame teacher on the original Waymo Open Dataset, and use this to pseudo-label all ∼75,000 unlabeled run seg- ments. We then train student models with a mix of all 798 labeled original Waymo Open Dataset run segments and a subset of these new pseudo-labeled run segments. While we did not exhaustively sweep the ratio of labeled-to-unlabeled data, in general we found a 1:5 ratio to work best (except for our Pedestrian model that used all ∼75,000 run segments, which worked best with a ratio of 1:1). Our results show continued gains as we scale the amount of unlabeled data on both the Waymo Open Dataset (Table 2) and Kirkland (Table 3). Vehicle models significantly improve, with a +5.8 AP improvement on the Waymo Open Dataset validation set, and a +7.9 AP improvement on the Kirkland validation set. For Pedestrians on the validation set, we see smaller gains of +1.5 AP on the original Waymo Open Dataset, and +4.5 on Kirkland, and more sensitivity to where the unlabeled data came from. Our analysis shows that many scenes, especially in Kirkland, have very few or zero pedestrians6. We suspect that this introduces biases in the training process, and leave to future work to explore how to best choose which pseudo-labeled frames to train on. We confirm these gains by evaluating on the test sets in Ta- ble 4, where we achieve state of the art accuracy on both Vehicles and Pedestrians among all published single frame, LiDAR only, non-ensemble results available. We reiterate that we do not change the architecture, model hyperparame- ters, or training setup of the student; our only change is to add additional unlabeled data via pseudo-labeling. 6In the labeled validation sets, we found 70% of scenes in the original Waymo Open Dataset had Pedestrians, with an average of 12.4 per scene, whereas in Kirkland only 22% of scenes had Pedestrians, with an average of 0.57 per scene. Model Vehicle Pedestrian L1 AP L1 APH L1 AP L1 APH Waymo Open Dataset Second [54] 50.1 49.6 – – StarNet [32] 63.5 63.0 67.8 60.1 PointPillars† [24] 68.6 68.1 67.9 55.5 SA-SSD [16] 70.2 69.5 57.1 48.8 RCD [3] 71.9 71.6 – – Ours† 74.0 73.6 69.8 57.9 Kirkland PointPillars† [24] 49.3 48.8 37.5 29.7 Ours† 56.2 55.7 36.1 28.5 Table 4: Test set results on the Waymo Open Dataset (top) and Kirkland Dataset (bottom). We compare to other published single frame, LiDAR-only, non-ensemble methods. † indicates that both models were implemented, trained and evaluated by us, and are identical models in train- ing setup and parameter count; the only difference is that our model was trained on ∼75k unlabeled run segments. 4.5. Negative results Finally, we briefly touch on ideas that did not work, despite positive evidence in the literature for other tasks [52]. First, we tried two forms of soft labels, neither of which showed a gain. Second, we performed multiple iterations of training, which showed a small gain, but we deemed it too time- consuming to be worth it. Third, we explored whether there was an ambiguous range of classification scores between which we should assume the pseudo-object is neither labeled foreground or background, and anchors assigned to pseudo- label objects with these scores should receive no loss. We detail our experiments in Appendix 6.6. 5. Conclusion Our work presents the first results of applying pseudo-label training to 3D object detection for self-driving car percep- tion. We use a simple form of pseudo-labeling that requires no architecture innovation, yet when deployed in a semi- supervised learning paradigm, leads to substantial gains over supervised learning baselines on vehicle and pedestrian de- tection. Most interestingly, gains persist in the presence of domain shift and new environments where building new supervised label datasets has been a barrier to safe, wide de- ployment. Furthermore, we identify several prescriptions for maximizing pseudo-label-based training, including the con- struction of better teacher model architectures and leveraging data augmentation. To summarize our main results: • By distilling a large teacher model into a smaller student model and leveraging a large corpus of unlabeled data, we use a two year old architecture [24] to achieve state- 8 of-the-art results7 of 74.0 / 69.8 L1 AP (+5.4 / +1.9 over supervised baseline) for Vehicles / Pedestrians, respectively, on the Waymo Open Dataset test set. • Using only 10% of the labeled run segments, we show that Vehicle and Pedestrian student models can outper- form equivalent supervised models trained with 3-10× as much labeled data, achieving a gain of 9.8 AP or larger for both classes and datasets. • On the Kirkland Domain Adaptation Challenge, we show that pseudo-labeling produces more robust stu- dent models; our best model outperforms the equivalent supervised model by 7.9 / 4.5 L1 AP on the Kirkland validation set for Vehicles and Pedestrians, respectively. Overall, our work continues a long-standing theme of adapt- ing unsupervised and semi-supervised learning techniques to problems in domain adaptation and the low label limit [1, 2, 5, 40]. A majority of these methods have been tested on synthetic problems [5, 12] or small academic datasets [22, 25], and accordingly, such works leave open the ques- tion of how these methods may fare in the real-world. We suspect that domain adaptation in self-driving car perception may present a large-scale problem that may address such concerns and may help orient the semi-supervised learning field to a problem of critical importance for self-driving cars. Acknowledgements We would like to thank Drago Anguelov, Shuyang Cheng, Ekin Dogus Cubuk, Barret Zoph, Rapha Gontijo Lopes, Wei Han, Zhaoqi Leng, Thang Luong, Charles Qi, Pei Sun, and Yin Zhou for helpful feedback on this work. References [1] David Berthelot, Nicholas Carlini, Ekin D Cubuk, Alex Kurakin, Kihyuk Sohn, Han Zhang, and Colin Raffel. Remixmatch: Semi-supervised learning with distribution alignment and augmentation anchoring. arXiv preprint arXiv:1911.09785, 2019. 9 [2] David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and Colin A Raffel. Mixmatch: A holistic approach to semi-supervised learning. In Advances in Neural Information Processing Systems, pages 5049–5059, 2019. 9 [3] Alex Bewley, Pei Sun, Thomas Mensink, Dragomir Anguelov, and Cristian Sminchisescu. Range conditioned dilated convo- lutions for scale invariant 3d object detection, 2020. 8 [4] Battista Biggio and Fabio Roli. Wild patterns: Ten years after the rise of adversarial machine learning. Pattern Recognition, 2018. 1 [5] Konstantinos Bousmalis, Nathan Silberman, David Dohan, Dumitru Erhan, and Dilip Krishnan. Unsupervised pixel-level domain adaptation with generative adversarial networks. In 7When compared to other single-frame, LiDAR only, non-ensemble models. Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3722–3731, 2017. 9 [6] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11621–11631, 2020. 1 [7] Liang-Chieh Chen, Raphael Gontijo Lopes, Bowen Cheng, Maxwell D Collins, Ekin D Cubuk, Barret Zoph, Hartwig Adam, and Jonathon Shlens. Leveraging semi-supervised learning in video sequences for urban scene segmentation. In European Conference on Computer Vision (ECCV), 2020. 2 [8] Liang-Chieh Chen, Raphael Gontijo Lopes, Bowen Cheng, Maxwell D Collins, Ekin D Cubuk, Barret Zoph, Hartwig Adam, and Jonathon Shlens. Semi-supervised learning in video sequences for urban scene segmentation. arXiv preprint arXiv:2005.10266, 2020. 2 [9] Shuyang Cheng, Zhaoqi Leng, Ekin Dogus Cubuk, Barret Zoph, Chunyan Bai, Jiquan Ngiam, Yang Song, Benjamin Caine, Vijay Vasudevan, Congcong Li, et al. Improving 3d object detection through progressive population based augmentation. arXiv preprint arXiv:2004.00831, 2020. 2, 3, 4 [10] Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly- annotated 3d reconstructions of indoor scenes, 2017. 2 [11] Zhuangzhuang Ding, Yihan Hu, Runzhou Ge, Li Huang, Sijia Chen, Yu Wang, and Jie Liao. 1st place solution for waymo open dataset challenge–3d detection and domain adaptation. arXiv preprint arXiv:2006.15505, 2020. 3 [12] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marc- hand, and Victor Lempitsky. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096–2030, 2016. 9 [13] Runzhou Ge, Zhuangzhuang Ding, Yihan Hu, Yu Wang, Sijia Chen, Li Huang, and Yuan Li. Afdet: Anchor free one stage 3d object detection. arXiv preprint arXiv:2006.12671, 2020. 3 [14] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. The Inter- national Journal of Robotics Research, 32(11):1231–1237, 2013. 1, 3 [15] Wei Han, Zhengdong Zhang, Benjamin Caine, Brandon Yang, Christoph Sprunk, Ouais Alsharif, Jiquan Ngiam, Vijay Va- sudevan, Jonathon Shlens, and Zhifeng Chen. Streaming object detection for 3-d point clouds. In European Confer- ence on Computer Vision (ECCV), 2020. 3 [16] Chenhang He, Hui Zeng, Jianqiang Huang, Xian-Sheng Hua, and Lei Zhang. Structure aware single-stage 3d object de- tection from point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11873–11882, 2020. 8 [17] Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. In International Conference on Learning Representations (ICLR), 2019. 1 [18] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the 9 knowledge in a neural network, 2015. 2 [19] John Houston, Guido Zuidhof, Luca Bergamini, Yawei Ye, Ashesh Jain, Sammy Omari, Vladimir Iglovikov, and Peter Ondruska. One thousand and one hours: Self-driving motion prediction dataset. arXiv preprint arXiv:2006.14480, 2020. 3 [20] Jacob Kahn, Ann Lee, and Awni Hannun. Self-training for end-to-end speech recognition. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7084–7088. IEEE, 2020. 2 [21] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. 4 [22] Alex Krizhevsky and Geoff Hinton. Convolutional deep belief networks on cifar-10. Unpublished manuscript, 40(7):1–9, 2010. 9 [23] Samuli Laine and Timo Aila. Temporal ensembling for semi- supervised learning. arXiv preprint arXiv:1610.02242, 2016. 2 [24] Alex H Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 12697–12705, 2019. 1, 3, 4, 8 [25] Yann LeCun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010. 9 [26] Dong-Hyun Lee. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, volume 3, 2013. 2 [27] Li-Jia Li and Li Fei-Fei. Optimol: automatic online picture collection via incremental model learning. IJCV, 2010. 2 [28] Ruihui Li, Xianzhi Li, Pheng-Ann Heng, and Chi-Wing Fu. Pointaugment: an auto-augmentation framework for point cloud classification. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 6378–6387, 2020. 2 [29] Wenjie Luo, Bin Yang, and Raquel Urtasun. Fast and furi- ous: Real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3569–3577, 2018. 2 [30] Geoffrey J McLachlan. Iterative reclassification procedure for constructing an asymptotically optimal rule of allocation in discriminant analysis. Journal of the American Statistical Association, 70(350):365–369, 1975. 2 [31] Qinghao Meng, Wenguan Wang, Tianfei Zhou, Jianbing Shen, Luc Van Gool, and Dengxin Dai. Weakly supervised 3d object detection from lidar point cloud. arXiv preprint arXiv:2007.11901, 2020. 2 [32] Jiquan Ngiam, Benjamin Caine, Wei Han, Brandon Yang, Yuning Chai, Pei Sun, Yin Zhou, Xi Yi, Ouais Alsharif, Patrick Nguyen, et al. Starnet: Targeted computation for object detection in point clouds. arXiv preprint arXiv:1908.11069, 2019. 2, 3, 8, 15 [33] George Papandreou, Liang-Chieh Chen, Kevin P Murphy, and Alan L Yuille. Weakly-and semi-supervised learning of a deep convolutional network for semantic image segmentation. In ICCV, 2015. 2 [34] Daniel S Park, Yu Zhang, Ye Jia, Wei Han, Chung-Cheng Chiu, Bo Li, Yonghui Wu, and Quoc V Le. Improved noisy student training for automatic speech recognition. arXiv preprint arXiv:2005.09629, 2020. 2 [35] Ilija Radosavovic, Piotr Dollár, Ross Girshick, Georgia Gkioxari, and Kaiming He. Data distillation: Towards omni- supervised learning. In CVPR, 2018. 2 [36] Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to im- agenet? In International Conference on Machine Learning, pages 5389–5400, 2019. 1 [37] Chuck Rosenberg, Martial Hebert, and Henry Schneider- man. Semi-supervised self-training of object detection mod- els. WACV/MOTION, 2005. 2 [38] Mehdi Sajjadi, Mehran Javanmardi, and Tolga Tasdizen. Reg- ularization with stochastic transformations and perturbations for deep semi-supervised learning. In Advances in neural information processing systems, pages 1163–1171, 2016. 2 [39] H Scudder. Probability of error of some adaptive pattern- recognition machines. IEEE Transactions on Information Theory, 1965. 2 [40] Ashish Shrivastava, Tomas Pfister, Oncel Tuzel, Joshua Susskind, Wenda Wang, and Russell Webb. Learning from simulated and unsupervised images through adversarial train- ing. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2107–2116, 2017. 9 [41] Kihyuk Sohn, David Berthelot, Chun-Liang Li, Zizhao Zhang, Nicholas Carlini, Ekin D Cubuk, Alex Kurakin, Han Zhang, and Colin Raffel. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. arXiv preprint arXiv:2001.07685, 2020. 2 [42] Kihyuk Sohn, Zizhao Zhang, Chun-Liang Li, Han Zhang, Chen-Yu Lee, and Tomas Pfister. A simple semi-supervised learning framework for object detection. arXiv preprint arXiv:2005.04757, 2020. 2 [43] Shuran Song, Samuel P Lichtenberg, and Jianxiong Xiao. Sun rgb-d: A rgb-d scene understanding benchmark suite. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 567–576, 2015. 2 [44] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2446–2454, 2020. 1, 3, 4 [45] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations (ICLR), 2013. 1 [46] Yew Siang Tang and Gim Hee Lee. Transferable semi- supervised 3d object detection from rgb-d data. In Proceed- ings of the IEEE International Conference on Computer Vi- sion, pages 1931–1940, 2019. 2 [47] Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results, 2018. 2 [48] Sebastian Thrun, Mike Montemerlo, Hendrik Dahlkamp, David Stavens, Andrei Aron, James Diebel, Philip Fong, John Gale, Morgan Halpenny, Gabriel Hoffmann, et al. Stanley: 10 The robot that won the darpa grand challenge. Journal of field Robotics, 23(9):661–692, 2006. 1, 3 [49] He Wang, Yezhen Cong, Or Litany, Yue Gao, and Leonidas J. Guibas. 3dioumatch: Leveraging iou prediction for semi- supervised 3d object detection, 2020. 2 [50] Yan Wang, Xiangyu Chen, Yurong You, Li Erran Li, Bharath Hariharan, Mark Campbell, Kilian Q Weinberger, and Wei- Lun Chao. Train in germany, test in the usa: Making 3d object detectors generalize. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11713–11723, 2020. 2, 3 [51] Yue Wang, Alireza Fathi, Jiajun Wu, Thomas Funkhouser, and Justin Solomon. Multi-frame to single-frame: Knowl- edge distillation for 3d object detection. arXiv preprint arXiv:2009.11859, 2020. 2 [52] Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V Le. Self-training with noisy student improves imagenet clas- sification. In Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 2, 5, 8, 15 [53] I. Zeki Yalniz, Herv’e J’egou, Kan Chen, Manohar Paluri, and Dhruv Mahajan. Billion-scale semi-supervised learning for image classification. arXiv 1905.00546, 2019. 2 [54] Yan Yan, Yuxing Mao, and Bo Li. Second: Sparsely embed- ded convolutional detection. Sensors, 18(10):3337, 2018. 2, 8 [55] Bin Yang, Min Bai, Ming Liang, Wenyuan Zeng, and Raquel Urtasun. Auto4d: Learning to label 4d objects from sequential point clouds, 2021. 2 [56] Bin Yang, Ming Liang, and Raquel Urtasun. Hdnet: Exploit- ing HD maps for 3d object detection. In Conference on Robot Learning, pages 146–155, 2018. 2 [57] Bin Yang, Wenjie Luo, and Raquel Urtasun. Pixor: Real- time 3d object detection from point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7652–7660, 2018. 2 [58] Na Zhao, Tat-Seng Chua, and Gim Hee Lee. Sess: Self- ensembling semi-supervised 3d object detection, 2020. 2 [59] Yin Zhou, Pei Sun, Yu Zhang, Dragomir Anguelov, Jiyang Gao, Tom Ouyang, James Guo, Jiquan Ngiam, and Vijay Vasudevan. End-to-end multi-view fusion for 3d object detec- tion in lidar point clouds. In Conference on Robot Learning, pages 923–932, 2020. 12 [60] Yin Zhou and Oncel Tuzel. Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4490–4499, 2018. 2 [61] Barret Zoph, Golnaz Ghiasi, Tsung-Yi Lin, Yin Cui, Hanxiao Liu, Ekin D Cubuk, and Quoc V Le. Rethinking pre-training and self-training. arXiv preprint arXiv:2006.06882, 2020. 2, 5 11 6. Appendix 6.1. Model and training details 6.1.1 PointPillars architecture We use the "Pedestrian" version of the PointPillars architec- ture for both the Vehicle and the Pedestrian classes, which uses a stride of 1 for the first convolutional block (instead of 2). This results in the output resolution matching the input resolution, which we found important for maintaining accuracy scaling PointPillars to larger scenes. We adopt a resolution of 512 pixels, spanning [-76.8m, 76.8m] in both X and Y, and a Z range of [-3m, 3m] giving us a pixel size of 0.33m, which is similar to what is used by [59] on the Waymo Open Dataset. Additionally, on all models, we re- place hard voxelization, which samples a fixed number of points per voxel, with a dynamic voxelization [59], which allows the model to use all the points in the point cloud, and makes it efficiently able to handle larger point clouds. Adding dynamic voxelization has negligible effect on accu- racy. 6.1.2 Training details We use the Adam optimizer with an initial learning rate of 3.2e-3. We train for a total of 75 epochs with a batch size of 64. An exponential decay schedule of the learning rate starts at epoch 5. For models trained with 10% of the original Waymo Open Dataset labeled run segments, we double the training time, so that the total epoch is 150, and the exponential decay starts at epoch 10. Lastly, on the large scale experiments in section 4.4, we train for 15 total epochs, with our exponential decay starting at epoch 2. We apply an exponential moving average (EMA) decay of 0.99 on all variables and use L2 regularization with scaling constant 1e-4. Our anchor box prior corresponds to the mean box dimen- sions for each class and is [4.725, 2.079, 1.768] and [0.901, 0.857, 1.712] for Vehicles and Pedestrians respectively. Our anchors have two rotations of [0, π/2], and are placed in the middle of each voxel. In order to compute the loss function during training, we assign an anchor to a ground truth box if its IoU is greater than 0.6 for Vehicles, and 0.5 for Pedestri- ans, and to background if the IoU is below 0.45 for vehicles, and 0.35 for pedestrians. Boxes with IoU between these values have a loss weight of 0, and we use force matching to make sure every ground truth is assigned at least one box. Unless specified otherwise, all students and teachers are trained with two data augmentations: RandomWorldRota- tionAboutZAxis, and RandomFlipY. For RandomWorldRo- tationAboutZAxis we choose a random rotation of up to π/4 to apply to the world around the Z axis. For Random- FlipY, we flip the Y coordinate, which can be thought of as 30 35 40 45 Teacher AP 30 35 40 45 Student AP Kirkland Figure 8: Kirkland evaluation: Better teachers lead to better students. If the student model has an equivalent ar- chitecture or training setup compared to the teacher, teachers with a higher AP produce students with a higher AP. All numbers are Vehicle models reporting Level 1 AP on the Kirkland validation split. mirroring the scene over the X axis, with probability of 0.25. 6.2. Is data augmentation necessary? Aug.? OD / Kirkland AP Teach. Stud. Teacher Student ∆ No No 56.3 / 33.6 59.2 / 38.0 +2.9 / +4.4 No Yes 56.3 / 33.6 62.2 / 40.9 +5.9 / +7.4 Yes No 63.0 / 41.8 59.5 / 39.5 -3.5 / -2.3 Yes Yes 63.0 / 41.8 64.2 / 44.0 +1.2 / +2.2 Table 5: Data augmentation is not necessary, but bene- ficial. While data augmentation is not necessary, the best student is achieved when both the student and teacher re- ceive the same advantages. Results are on 1x width, 1 frame vehicle models where both the teacher and student saw 100% of the original Waymo Open Dataset labeled run segments. 6.3. Kirkland results We provide all the corresponding Kirkland validation set figures on Vehicles for Section 4.1. We show that all of our results shown for the Waymo Open Dataset still hold when we evaluate on Kirkland. Better teachers lead to better students. Similar to Figure 3, Figure 8 shows that improving the teacher accuracy leads to a corresponding increase in student accuracy. Amount of labeled data. Again mirroring our results in Fig- 12 40 50 60 70 80 90 Percent unlabeled data 10 20 30 40 50 60 Percent labeled data 27.5 30.0 32.5 35.0 37.5 40.0 42.5 AP Kirkland Teacher Student Figure 9: Kirkland evaluation: Pseudo-label training is most effective when the ratio of labeled to unlabeled data is small. Teacher and student L1 AP on the Waymo Open Dataset validation set for the Vehicle class versus percent labeled data. Kirkland L1 AP Teacher Augmentation Teacher Student ∆ None 33.6 40.9 +7.3 RotateZ 39.7 43.0 +3.3 FlipY 36.9 44.4 +7.5 RotateZ + FlipY 41.8 44.0 +2.2 Table 6: Increasing the strength of teacher augmenta- tions leads to additive gains in student performance. We increasing the strength of teacher augmentations for a 1× width teacher model trained on 100% of the Waymo Open Dataset. The student model is a fixed 1× width model trained with both RotateZ and FlipY augmentations. We report L1 validation set Vehicle AP on the Kirkland dataset. ∆is the difference in AP between the student model and the teacher model. ure 4, Figure 9 shows increasing the amount of labeled data increases both the teacher and student performance. We also see similar (though less severe) diminishing returns as the ratio of labeled to unlabeled data gets larger. Both teachers and students are 1x width, 1 frame in these experiments. Data augmentations. We also show the equivalent of Table 6 when evaluating on Kirkland: Teacher width. Figure 10 shows the effect of increasing teacher width for teachers trained on either 10% or 100% of the Waymo Open Dataset. We see the same result as 1 2 4 Teacher width 27.5 30.0 32.5 35.0 37.5 40.0 42.5 45.0 AP Kirkland Teacher Student 10% OD labeled 100% OD labeled Figure 10: Kirkland evaluation: Increasing teacher width leads to better students. We make teachers wider while fixing the student width at 1x and report L1 AP for Vehicle models on the Kirkland validation set. The teacher is trained on labeled data from either 10% or 100% of the original Waymo Open Dataset (top and bottom points, re- spectively). The student is trained on the labeled data seen by the teacher plus all unlabeled data from the Waymo Open Dataset and its Kirkland dataset. When the ratio of labeled to unlabeled data is small, the student accuracy improves as the teacher gets wider, however this effect disappears when the amount of pseudo-labeled data is small. We further investigate this by adding more unlabeled data in Section 4.4. in Figure 5, where when we hold the student configuration fixed at 1x width, 1 frame, increasing the teacher width leads to an increase in student accuracy in the low data regime (10% of original Waymo Open Dataset run segments). We see for Kirkland that similar to the original Waymo Open Dataset, when the ratio of labeled to unlabeled data is large (100% of original Waymo Open Dataset run segments), this effect disappears. 6.4. Pushing labeled data efficiency Here we provide additional results where we push the ac- curacy of our student models on a limited amount of data organized as run segments. We replicated the experiment shown in Figure 7 using 20% of the original Waymo Open Dataset run segments (so ∼11.4% of the overall run seg- ments are labeled), and show similar gains. Additionally, we provide raw numerical values for all data points from these 8 plots in in Table 7 and Table 8, to allow others to compare against us. 13 Width: 1x Frames: 1 Width: 4x Frames: 1 Width: 4x Frames: 4 Teacher configuration 45 50 55 60 65 Validation L1 AP 20% OD Labels - 53.5 AP 50% OD Labels - 57.7 AP 100% OD Labels - 63.0 AP Vehicles: Waymo Open Dataset - 20% OD Labels Width: 1x Frames: 1 Width: 4x Frames: 1 Width: 4x Frames: 4 Teacher configuration 25 30 35 40 45 20% OD Labels - 33.1 AP 50% OD Labels - 37.0 AP 100% OD Labels - 41.8 AP Vehicles: Kirkland - 20% OD Labels Width: 1x Frames: 1 Width: 4x Frames: 1 Width: 4x Frames: 4 Teacher configuration 50 55 60 65 70 Validation L1 AP 20% OD Labels - 59.2 AP 50% OD Labels - 66.6 AP 100% OD Labels - 69.0 AP Pedestrians: Waymo Open Dataset - 20% OD Labels Width: 1x Frames: 1 Width: 4x Frames: 1 Width: 4x Frames: 4 Teacher configuration 10 15 20 25 30 20% OD Labels - 16.0 AP 50% OD Labels - 25.2 AP 100% OD Labels - 24.9 AP Pedestrians: Kirkland - 20% OD Labels Figure 11: Increasingly large teachers distill into small, accurate students. Increasing the width or number of frames for the teacher impacts performance on a fixed size student (1× width, 1 frame) in the low label (20% of run segments) regime for vehicles (top) and pedestrians (bottom). Training a large, expensive teacher model, and distilling the teacher into a small, efficient student is efficient tactic. Results presented for Waymo Open Dataset (left) and Kirkland (right). Model Details OD / Kirkland L1 AP Teacher Student % OD Labels Teacher Baseline Student ∆Baseline 1x Width, 1 Frame 1x Width, 1 Frame 10 49.1 / 26.1 49.1 / 26.1 54.6 / 33.5 +5.5 / +7.4 4x Width, 1 Frame 1x Width, 1 Frame 10 52.2 / 28.7 49.1 / 26.1 57.7 / 35.3 +8.6 / +9.2 4x Width, 4 Frame 1x Width, 1 Frame 10 54.1 / 30.4 49.1 / 26.1 58.9 / 37.2 +9.8 / +11.1 1x Width, 1 Frame 1x Width, 1 Frame 20 53.5 / 33.1 53.5 / 33.1 59.0 / 40.1 +5.5 / +7.0 4x Width, 1 Frame 1x Width, 1 Frame 20 58.6 / 38.4 53.5 / 33.1 61.1 / 43.1 +7.6 / +10.0 4x Width, 4 Frame 1x Width, 1 Frame 20 60.0 / 39.8 53.5 / 33.1 61.2 / 44.2 +7.7 / +11.1 1x Width, 1 Frame 30 56.4 / 36.0 1x Width, 1 Frame 50 57.7 / 37.0 1x Width, 1 Frame 100 63.0 / 41.8 Table 7: Vehicle results for single frame, normal width student models trained with increasingly complex (wider, multi-frame) teacher models. We show how it is advantageous to distill a complex, off-board model into a simple onboard model using pseudo-labeling. All numbers are on the corresponding Validation set, and are Level 1 difficulty mean average precision (AP). 6.5. Different Teacher and Student Architectures In the main text we show that the teacher and student archi- tecture can be different configurations, and in fact using a larger teacher is an effective way to generate significantly stronger, small student models. One remaining question is whether the teacher and student architectures need to be from 14 Model Details OD / Kirkland L1 AP Teacher Student % OD Labels Teacher Baseline Student ∆Baseline 1x Width, 1 Frame 1x Width, 1 Frame 10 53.4 / 14.5 53.4 / 14.5 58.8 / 19.1 +5.4 / +4.6 4x Width, 1 Frame 1x Width, 1 Frame 10 59.2 / 21.5 53.4 / 14.5 61.4 / 20.9 +8.0 / +6.4 4x Width, 4 Frame 1x Width, 1 Frame 10 64.0 / 27.6 53.4 / 14.5 64.6 / 27.1 +11.2 / +12.6 1x Width, 1 Frame 1x Width, 1 Frame 20 59.2 / 16.0 59.2 / 16.0 61.7 / 20.3 +2.5 / +4.3 4x Width, 1 Frame 1x Width, 1 Frame 20 64.4 / 22.5 59.2 / 16.0 65.4 / 20.4 +6.2 / +4.4 4x Width, 4 Frame 1x Width, 1 Frame 20 68.8 / 30.8 59.2 / 16.0 66.8 / 26.0 +7.6 / +10.0 1x Width, 1 Frame 30 62.3 / 23.3 1x Width, 1 Frame 50 66.6 / 25.3 1x Width, 1 Frame 100 69.0 / 24.8 Table 8: Pedestrian results for single frame, normal width student models trained with increasingly complex (wider, multi- frame) teacher models. We show how it is advantageous to distill a complex, off-board model into a simple onboard model using pseudo-labeling. All numbers are on the corresponding Validation set, and are Level 1 difficulty mean average precision (AP). Model Vehicle Pedestrian L1 AP ∆ L1 AP ∆ Waymo Open Dataset StarNet 10% Baseline 47.7 - 61.2 - StarNet Student 55.6 +7.9 66.5 +4.3 Kirkland StarNet 10% Baseline 26.3 - 6.7 - StarNet Student 35.2 +8.9 22.2 +15.5 Table 9: Pseudo Labeling is effective across very differ- ent architectures: We distill a 4× width, 4 frame PointPil- lars teacher model into a single frame StarNet model and see large gains in StarNet performance, despite it being an extremely different architecture. the same architecture family, or even similar in their data representation. To test this, we design a very simple experi- ment where we take our best 10% original Waymo OD run segment PointPillars teacher model (the exact model used in Figures 1 & 7), and use it to pseudo label the remaining Waymo Open Dataset. We then train a StarNet [32] student model on the union of the 10% labeled run segments, and the remaining data pseudo labeled by PointPillars. We chose StarNet because it’s a purely point-cloud based, convolution free object detection system, which differs significantly from PointPillars convolution-based architecture. Results are sum- marized in Table 9, which shows strong gains in StarNet accuracy when using a PointPillars teacher. 6.6. Negative result details In this section, we provide some more details about our negative results. Soft-labels: We explored two forms of soft labels, one of which was to use the post-sigmoid score bounded between [0, 1] as the target, the second was to use the logit itself. In object detection, because the outputs are passed through Non-Maximum Suppression (NMS), we only have scores and logits for foreground locations, therefore background anchors all were assigned a score or logit of 1. We found both techniques resulted in slightly worse performance than simply using hard labels. Multiple iterations: We tried multiple iteration training, where we used the best student checkpoint to re-pseudo-label the unlabeled data, and use that updated pseudo-labeled data to train a new student. While our trend thus far has shown better teachers lead to better students, its challenging to combat the overfitting that will naturally occur. Its our un- derstanding that this is one of the main reasons one wants to heavily noise the student [52], but we found it difficult to find a noise level (via augmentations) that did not hamper model performance enough such that the second iteration was not worse. With default settings using the same augmen- tations for both the teacher, the first student, and the second student, we found a small gain in performance using 10% of the original Waymo Open Dataset run segments of ∼0.2 AP on the original validation set, and ∼1.0 AP on the Kirkland validation set. Because of the small gains compared to the first iteration, and the time-consuming nature of performing these experiments, we left further exploration to future work. Score thresholds: We wondered whether there may be some classification score range for pseudo-labels for which the class is ambiguous and we should assign no loss. We allowed anchors to be assigned a loss of zero if these anchors matched (via normal IoU matching) pseudo-label objects with scores between some [lower, upper] range. We then swept these two values, and found the most effective results were when 15 both values were [0.5, 0.5], indicating this setting should be turned off. That said, we think the idea of limiting the noise induced by bad pseudo-labels merits future investigation. 16