LidarAugment: Searching for Scalable 3D LiDAR Data Augmentations Zhaoqi Leng1∗, Guowang Li1, Chenxi Liu1, Ekin Dogus Cubuk2, Pei Sun1, Tong He1, Dragomir Anguelov1 and Mingxing Tan1 Abstract—Data augmentations are important in training 80 high-performance3Dobjectdetectorsforpointclouds.Despite 75 recent efforts on designing new data augmentations, perhaps surprisingly, most state-of-the-art 3D detectors only use a 70 few simple data augmentations. In particular, different from 2D image data augmentations, 3D data augmentations need 65 to account for different representations of input data and 60 requirebeingcustomizedfordifferentmodels,whichintroduces significantoverhead.Inthispaper, weresorttoasearch-based 55 approach,andproposeLidarAugment,apracticalandeffective data augmentation strategy for 3D object detection. Unlike 50 UPillar UPillar-L previousapproacheswhereallaugmentationpoliciesaretuned in an exponentially large search space, we propose to factorize and align the search space of each data augmentation, which cuts down the 20+ hyperparameters to 2, and significantly reduces the search complexity. We show LidarAugment can be customized for different model architectures with different input representations by a simple 2D grid search, and con- sistentlyimprovebothconvolution-basedUPillars/StarNet/RSN and transformer-based SWFormer. Furthermore, LidarAug- mentmitigatesoverfittingandallowsustoscaleup3Ddetectors tomuchlargercapacity.Inparticular,bycombiningwithlatest 3Ddetectors,ourLidarAugmentachievesanewstate-of-the-art 74.8 mAPH L2 on Waymo Open Dataset. I. INTRODUCTION Data augmentations are widely used in training deep neuralnetworks.Inparticular,forautonomousdriving,many data augmentations are developed to improve data efficiency and model generalization. However, most recent 3D object detectors only use a few basic data augmentation operations such as rotation, flip and ground-truth sampling [1], [2], [3], [4], [5], [6], [7]. This is in a surprising contrast to 2D image recognition and detection, where much more sophisticated 2Ddataaugmentationsarecommonlyusedinmodernimage- based models [8], [9], [10], [11], [12], [13]. In this paper, weaimtoanswer:isitpracticaltoadoptmoreadvanced3D data augmentations to improve modern 3D object detectors, especially for high-capacity models? The main challenge of adopting advanced 3D data aug- mentations is that 3D augmentations are often sensitive to inputrepresentationsandmodelcapacity.Forexample,range image based models and point cloud based models require different types of data augmentation due to different input representations. High capacity 3D detectors are typically pronetooverfittingandrequirestrongeroveralldataaugmen- tationcomparedtolitemodelswithfewerparameters.There- fore, tailoring each 3D augmentation for different models is necessary. However, the search space scales exponentially with respect to the number of hyperparameters, which leads 1 WaymoResearch,2 GoogleBrain,∗ lengzhaoqi@waymo.com 2L HPAm D3 Baseline LidarAugment 71.0 63.7 60.0 57.8 Fig. 1: Model scaling with LidarAugment on Waymo Open Dataset. Baseline augmentations are from the prior art of [14]. When scaling up UPillars to UPillars-L, our LidarAugmentimprovesbothmodels,andthegainsaremore significant for the larger model, thanks to its customizable regularization. More results in Table IV. tosignificantsearchcost.Recentstudies[15],[16]attemptto addressthesechallengesbyusingefficientsearchalgorithms. Those approaches typically construct a fixed search space, and run a complex search algorithms (such as population- based search [17]) to find a data augmentation strategy for a model. However, our studies reveal that the search spaces used in prior works are suboptimal. Despite having complex searchalgorithms,withoutasystematicwaytodefineagood search space, we cannot unleash the potential of a model. In this paper, we propose LidarAugment, a simplified search-based approach for 3D data augmentations. Unlike previous methods that rely on complex search algorithms to explore an exponentially large search space, our approach aims to define a simplified search space that contains a variety of data augmentations but has minimal (i.e. two) hyperparameters, such that users can easily customize a diverse set of 3D data augmentations for different models. Specifically, we construct the LidarAugment search space by first factorizing a large search space based on operations and exploring each sub search space with a per-operation search. Then, we normalize and align the sub search space foreachdataaugmentationtoformtheLidarAugmentsearch space. The final LidarAugment search space contains only two shared hyperparameters: m ∈ [0,∞) controls the nor- malized magnitude and p∈[0,1] controls the probability of applying each data augmentation policies. Our LidarAug- ment search space significantly simplifies prior works [15] by cutting down the number of hyperparameters to two, a 15× reduction in number of hyperparameters. Despite only having two hyperparamters, our LidarAug- ment search space contains a variety of existing 3D data 2202 tcO 42 ]VC.sc[ 1v88431.0122:viXra(a) (b) (1) Original (2 R) oG tl ao tb eal (3) S G calo lebal (4) Global Translate (5) Global (6) Global (7) Frustum (8) Frustum Flip Drop Drop Noise (9) Drop (10) Paste (11) Swap Box Box Background Fig. 2: Visualizing LidarAugment. (a) all data augmentation operations used in LidarAugment. For non-global operations, we highlight the augmented parts in red (boxes). (b) occlusion introduced by data augmentation, e.g., paste a car object, is handled by removing overlapping rays in range view based on distance. We show point clouds and the corresponding range images with (bottom)/without (top) removing overlapping rays in the range view. augmentations, such as drop/paste 3D bounding boxes, ro- II. RELATEDWORKS tate/scale/dropping points, and copy-paste objects and back- Data augmentation. Data augmentation is widely used grounds. In addition, LidarAugment supports coherent aug- in training deep neural networks In particular, for 3D object mentation across both point and range view representations, detection from point clouds, several global and local data which generalizes to multi-view 3D detectors. augmentations, such as rotation, flip, pasting objects, and frustumnoise,areusedtoimprovemodelperformance [19], We perform extensive experiments on the Waymo Open [1], [20], [2], [4], [21], [15], [22], [23], [24]. However, as Dataset [18] and demonstrate LidarAugment is effec- 3D data augmentations are sensitive to model architectures tive and generalizes well to different model architectures and capacity, it often requires extensive manual tuning to (convolutions-based and transformer-based), different input use these augmentations. Therefore, most existing 3D object views (3D point view and range image), and different detectors [2], [6], [25], [26], [14] only adopt a few simple temporal scales (single and multi frames). Notably, Li- augmentations, such as flip and shift pixels. darAugment advances state-of-the-art (SOTA) transformer- Several recent works attempt to use range images for basedSWFormerby1.4mAPHonthetestset.Furthermore, multi-view 3D detection, but very few augmentations are LidarAugment provides customizable regularization, which developed for range images. [5] attempts to paste objects in allows us to scale up 3D object detectors to much higher therangeimagewithouthandlingocclusions.OurPasteBox capacity without overfitting. As summarized in Figure 1, augmentation support coherently augmenting both range- LidarAugment consistently improves UPillars models, and view and point-view input data while handling occluded theperformancegainsareparticularlylargeforhigh-capacity objects in a simple way (more details in Figure 2), which models. Our contributions can be summarized as: enables more realistic augmented scenes and enriches the data augmentations for multi-view 3D detectors. Learning data augmentation policies. Designing good 1) New insight: we reveal that common 3D data aug- data augmentation normally requires manual tuning and mentation search spaces are suboptimal and should be domain expertise. Several search-based approaches have tailored for different models. been proposed for 2D images, such as AutoAugment [9], 2) LidarAugment:weproposetheLidarAugmentsearch RandAugment [12], and Fast AutoAugment [27]. Our Li- space, which supports jointly optimizing 10 augmen- darAugment is inspired by RandAugment in the sense that tation policies with only two hyperparameters (15× we both try to construct a simplified search space. However, reduction compares to prior works), offering diverse unlike2Dimageaugmentations,whereasearchspaceworks yet practical augmentations. In addition, we develop well for many models, we reveal that existing search space a new method to coherently augment both point and for3Ddetectiontasksaresuboptimal,whichmotivatesusto range-view input representations. proposethefirstsystematicalmethodtodefinesearchspaces 3) State-of-the-art performance: LidarAugment consis- for 3D detection tasks. tently improves both convolution-based UPillars/Star- On the other hand, for 3D detection, PPBA [15] and Net/RSN and attention-based SWFormer. With Li- PointAugment [16] propose efficient learning-based data darAugment, we achieve new state-of-the-art results augmentation frameworks for 3D point clouds. However, on Waymo Open Dataset. In addition, LidarAugment both works require users to run a complex algorithm on enables model scaling to achieve much better quality an exponentially large but not well-designed search space. for high-capacity 3D detectors. In contrast, our work provides a systematical framework todesign a simple and more effective search spaces with only Policy Hyperparameters WOD(Veh./Ped.) mAPL1 two hyperparameters. NoAug - - 60.2 Probability p/p III. LIDARAUGMENT DropBox Numberofboxes 2m/2.8m 66.0(+5.8) In this section, we first introduce data augmentation poli- Probability 1.4p/p 66.6(+6.4) PasteBox Numberofboxes 3.2m/4.4m cies used in LidarAugment. Next, we analyze the perfor- SwapBackground Probability 0.6p 63.6(+3.4) mance of each data augmentation policy on Waymo Open Probability 1.4p GlobalRot 73.3(+13.1) Dataset [18]. Finally, we propose a systematic approach to Maxrotationangle 0.22πm progressively design 3D augmentation search space. GlobalScale Probability p 66.0(+5.8) Scalingfactor 0.036m A. Data augmentations for point clouds and range images. GlobalDrop Probability p 64.9(+4.7) Dropratio 1−0.18m 3D point cloud and 2D range image are two different Probability p Thetaanglewidth 0.1πm representations of LiDAR data. Despite being the native FrustumDrop Phianglewidth 0.1πm 64.1(+3.9) representation of LiDAR data, data augmentations for range Rdistance 75−7.5m Dropratio 1−0.1m image is not well studied compared to point clouds. Here, Probability 0.6p werevisitdataaugmentationsforpointclouds,andintroduce Thetaanglewidth 0.14πm FrustumNoise Phianglewidth 0.14πm 65.1(+4.9) a new method for coherently applying data augmentation to Rdistance 75−10.5m Maxnoiselevel 0.14m both point clouds and range images. Probability 1.4p Augmentingpointclouds.Wefollowtheimplementation GlobalTranslate 67.5(+7.3) Stdev.ofnoise(x,y) 0.66m ofdataaugmentationpoliciesdescribedinrecentstudies[1], GlobalFlip Probability p 69.0(+8.8) [15], [28], which contain global operations (rotate, scale, TABLE I: Aligned search spaces and performance. The translate, flip, and drop points) and local operations (drop search space of each hyperparameter for Waymo Open boxes, paste boxes, swap background, drop points and add Dataset (WOD) for UPillars is listed. (p,m) are two global feature noise in a frustum), shown in Figure 2 (a). hyperparameters to control all data augmentation policies. Augmenting range images. Different from sparse 3D After aliging the search space, the optimal (p,m) for each pointrepresentation,pixelsinrangeimagearecompact.Data dataaugmentationare(0.5,5).Theprobabilityofeachpolicy augmentations,suchaspastingobjectsandswapbackground, is clipped to [0, 1]. The min R distance is clipped to 0. The disturb the compact structure of range representation. Here, maximum rotation angle is clipped to [0, π]. The maximum weproposeanovelapproachtocoherentlyaugmentboth3D flip probabilityis clipped to0.5. The ratio ofdropped points point view and 2D range view by leveraging the bijective are clipped to [0, 0.8]. The theta angle and phi angle are property between point clouds and range images, while clipped to [0, π] and [0, 2π], respectively. account for occlusion. First, we transform the range image pixels to point cloud based on (x,y,z) coordinates. To preserve the bijective map- basedonthepopularPointPillars[2],butincorporatesrecent ping between a pixel in a range image and a point in optimizationsinarchitecturedesign,i.e.,unetbackbone[29], the corresponding point clouds, we concatenate the (row, and center net detection head [30]. column)indexofeachpixelintherangeimageasadditional Datasets and training. Waymo Open Dataset [18] con- featuresbeforescatteringpixelsto3D.Afterperformingdata tains 798 and 202 training and validation sequences. For augmentation in the point representation, we transform the the following studies, we train UPillars with batch size 64, augmentedpointcloudsbacktotherangeviewbyscattering Adam optimizer [31] and cosine decay learning rate with each point to a pixel in a 2D image based on its (row, max learning rate 3e-3 and total step 80000. column) index. Effect of each data augmentations. We factorize the Leveraging the compactness of range images. Coher- LidarAugment search space into per-policy sub search space ently augmenting both range and point views leads to more and show the UPillars performance when trained using only realistic augmented scenes. Because each pixel in a range one policy on Waymo Open Dataset (WOD) in Table I. imagecorrespondstoauniquerayfromLiDAR,overlapping Interestingly,onWOD,themosteffectivedataaugmentation pixels in the range view represent that the same light ray technique is global rotation, whereas on KITTI [32], pasting penetrates trough multiple surfaces. When this happens, we ground truth bounding boxes is commonly regarded as the compare the distance among overlapping pixels in the range most effective data augmentation [1]. A closer look at the view and keep the pixel that is closest to the ego vehicle. statistics of the two datasets reveals that, on average, each This effectively removes occluded points in both the range KITTI LiDAR frame contains about five objects, whereas, and point views, as shown in Figure 2 (b). each frame in WOD on average contains more than 50 objects. Thus, pasting ground truth objects has larger impact B. Effects of each data augmentation. on KITTI, due to the significantly lower object density, than Inthissection,weassesstheeffectsofeachdataaugmen- WOD. On the other hand, smaller global rotation angle π/4 tation policy on Waymo Open Dataset [18]. To benchmark is commonly used when training KITTI dataset, but we find the policies, we develop a UPillars architecture, which is much stronger rotation π is preferred for WOD.CenterNet heatmap Point Features Voxelization Voxel features Box regression Other heads Configs MLP0,1 C0,1,2…6 B0,1,2…6 Res U-Net U-Pillars 128,128 128,128,256,512,512,256,256 1,6,2,2,1,1,1 Raw points MLP0 MLP1 C0 x B0 C1 x B1 C2 x B2 C3 x B3 C4 x B4 C5 x B5 C6 x B6 U-Pillars-L 128,256 256,256,384,768,768,384,384 1,8,4,4,2,2,2 Fig. 3: UPillars architecture. Input points are processed by two full connected layers with channel size (MLP0, MLP1) before voxelized into pillars. The bird-eye-view pillars are processed by a Res U-Net, where the channel size and number of blocks at each resolution are (Ci, Bi). CenterNet detection head, box regression, and other attributes regression heads are applied to the output of U-Net. For both models, we use the same voxel size 0.32m and range 81.92m. C. Defining LidarAugment search space. to be {0.9,1.5,2.1,2.7,3.3,3.9}. If the optimal values (p ,m )=(0.7,3.3), we rescale the search domain to noise noise As indicated from the previous section, different from (p ,m ) = (1.4p,0.66m) such that when the global noise noise RandAugment for 2D images [12], naively using the same hyperparameters (p,m)=(0.5,5), the hyperparameters for search space across different datasets is suboptimal, which Global Translate are optimal. Details about all the hyperpa- is a unique challenge for 3D detection tasks. To mitigate rametersarelistedinTableI.TheLidarAugmentpseudocode thisnewchallenge,weproposetofactorizethewholesearch is shown in Figure 4. spaceandaligneachdataaugmentationbasedonitsoptimal hyperparameters. IV. EXPERIMENTS Align the search space. Using global hyperparameters In this section, we first introduce our experimental setups. to control all data augmentations requires normalizing the Then we show LidarAugment significantly improves the search domain of each hyperparameter. Without normaliza- performance for both convolution-based and attention-based tion, the same global magnitude could lead to an aggressive models. Lastly, we show the model scaling results, followed application of one data augmentation, and an insufficient by ablations studied on different models and datasets. applicationofanotherdataaugmentation.Toalignthesearch domainofeachdataaugmentationpolicy,wetrainaUPillars A. Experimental setup model on a given dataset while only applying a single data Our experiments are mostly based on Waymo Open augmentation policy at each time. Based on the optimal Dataset [18] (WOD) where the main metric is mAPH L2, values of hyperparameters, we rescale the search domain withadditionalablationstudiesonnuScene[33].Weevaluate of each hyperparameter in a data augmentation policy from LidarAugmentonavarietyof3Dobjectdetectors,aswellas [0,arbitrary value] to [0,optimal value] such that the opti- differentmodelsizes.Forfaircomparison,westrictlyfollow mal value for each hyperparameter corresponds to the same theoriginaltrainingsettingsforeachmodel,andonlyreplace global magnitude or probability hyperparameter. Since each the baseline augmentation with our new LidarAugment. We data augmentation policy contains multiple parameters, to train UPillars using Adam optimizer [31] and apply cosine save cost, we perform a small scale 2D grid search to scale learning rate with max learning rate 1e-3, total 16e4 steps the probability and magnitudes of all hyperparameters in and batch size 64. each sub search space. Here, we use Global Translate as an example. We define B. LidarAugment achieves new state-of-the-art results the initial search domain for the probability of applying Table II compares the validation set results on Waymo Global Translatedata augmentationto be {0.3,0.5,0.7,0.9}, Opendataset.OurLidarAugmentsignificantlyimprovesboth and the domain for the magnitude of translation noise convolution-based and transformer-based models. In partic- ular, by scaling up the basic UPillars, our LidarAugment augmentations = [ achieves 71.0 mAPH of L2 on UPillars-L, which is 1.9 AP DropBox, PasteBox, SwapBackground, GlobalRot, GlobalScale, GlobalDrop, FrustumDrop , better than the previous best convolution-based 3D detec- FrustumNoise, GlobalTranslate, GlobalFlip] tor PVRCNN++ [34]. Notably, the latest transformer-based def lidaraugment(m, p, input_frame): SWFormer [14] already uses 4 strong data augmentation for aug in augmentations: aug.set_magnitude(m) policies,i.e.,rotation(probability0.74,yamangleuniformly aug.set_probability(p) sampled from [−π, π]), random flip (probability 0.5), ran- input_frame = aug.transform(input_frame) domly scaling the world (scaling factor uniformly sampled return input_frame from[0.95,1.05],andrandomlydroppoints(dropprobability Fig. 4: Pseudo Python code for LidarAugment. 0.05), where the rotation angle and flip probability aremAPH VehicleAP/APH3D PedestrianAP/APH3D Method Type L2 L1 L2 L1 L2 P.Pillars[2]† conv 51.9 63.3/62.7 55.2/54.7 68.9/56.6 60.4/49.1 CenterPoint[25] conv 67.1 76.6/76.1 68.9/68.4 79.0/73.4 71.0/65.8 RSN3f[6] conv 68.1 78.4/78.1 69.5/69.1 79.4/76.2 69.9/67.0 PVRCNN++[34] conv 69.1 79.3/78.8 70.6/70.2 81.8/76.3 73.2/68.0 UPillars-L† conv 60.0 69.5/69.0 61.5/61.0 70.4/66.1 63.0/59.0 UPillars-L(+LA) conv 71.0 79.5/79.0 71.9/71.5 81.5/77.3 74.5/70.5 SST1f[26] attn 63.4 74.2/73.8 65.5/65.1 78.7/69.6 70.0/61.7 SST3f[26] attn 69.5 77.0/76.6 68.5/68.1 82.4/78.0 75.1/70.9 SWFormer[14] attn 70.9 79.4/78.9 71.1/70.6 82.9/79.0 74.8/71.1 SWFormer(+LA) attn 72.8 80.9/80.4 72.8/72.4 84.4/80.7 76.8/73.2 TABLE II: WOD validation-set results. LA denotes our LidarAugment, conv denotes convolutional networks, and attn denotes attention-based transformer models. LidarAugment improves both types of models, and achieves the best results among each category. † model is trained using augmentations shown in prior art [14]. maxed out. Despite that, LidarAugment still outperforms competitivewiththelatestSWFormers,i.e.,theirmAPHare SWFormer by 1.9 AP, establishing a new state-of-the-art 71.0 vs. 72.8. This opens up new research opportunities on resultforsingle-modalmodelswithoutensembleortesttime exploring much larger and higher performance 3D detectors augmentation on Waymo Open Dataset. in the future. Table III compares the test-set results among latest mod- els. Compared to the latest SWFormer, our LidarAugment Veh/PedAPL1 Veh/PedAPHL2 improves the test-set L2 mAPH by 1.4 AP, outperforming UPillars UPillars-L UPillars UPillars-L all prior arts by a large margin. BaseAugment 72.1/72.3 69.3/70.3 63.5/52.1 61.0/59.0 LidarAugment 77.1/77.5 79.5/81.6 68.5/58.9 71.5/70.5 mAPH VehicleAP/APH3D PedestrianAP/APH3D Method TABLE IV: UPillars scaling results on WOD. L2 L1 L2 L1 L2 P.Pillars[2]† 55.1 68.6/68.1 60.5/60.1 68.0/55.5 61.4/50.1 CenterPoint[25] 69.1 80.2/79.7 72.2/71.8 78.3/72.1 72.2/66.4 RSN3f[6] 69.7 80.7/80.3 71.9/71.6 78.9/75.6 70.7/67.8 D. LidarAugment supports different representations PVRCNN++[34] 71.2 81.6/81.2 73.9/73.5 80.4/75.0 74.1/69.0 SSTTS3f[26] 72.9 81.0/80.6 73.1/72.7 83.1/79.4 76.7/73.1 Different from 2D image models, 3D detectors are more SWFormer[14] 73.4 82.9/82.5 75.0/74.7 82.1/78.1 75.9/72.1 SWFormer(+LA) 74.8 84.0/83.6 76.3/76.0 83.1/79.3 77.2/73.5 diverse and could utilize different input representations due to the additional dimensionality and sparsity of point cloud TABLE III: WOD test-set results. LidarAugment (LA) data. Other than UPillars and SWFormer, which are both significantlyimprovesdetectionperformanceforSWFormer, pillar-based architectures and taking 3D sparse points as and achieves new state-of-the-art mAPH L2. inputs, we further demonstrate LidarAugment generalizes to other input representations. First, StarNet [35] is a point- C. LidarAugment enables better model scaling based detector which directly processes raw points in 3D to detect objects. RSN, on the other hand, utilize multi-view Scaling up model capacity is a common approach to propertyofpointcloudsandtakesbothrangeimagesand3D achieve better performance, but large 3D object detectors sparse points as inputs. However, due to the lack of multi- often suffer from overfitting. Table IV shows scaling results, view data augmentations in prior works, RSN only utilize where UPillars-L is a larger model with more layers and two simple augmentations, i.e. random flip and rotation. channels than UPillars, detailed in Figure 3. Here, we adopt the strong data augmentations used in the latest SWFormer as Baseline (see subsection IV-B). As Model Augmentation Vehicle Pedestrian shown in Table IV, with Baseline augmentations, UPillars-L baseline 58.2 71.9 StarNet[35] does not benefit much from its significantly larger capacity. +LidarAugment 61.6 74.2 Infact,severalmetrics,suchasVeh/PedL1AP,evenbecome baseline 75.2 77.2 RSN-1frame[6] worse(e.g.69.3/70.3forUPillar-Lvs72.1/72.3forUPillars). +LidarAugment 75.8 79.0 We observe the training loss of UPillars-L is much smaller baseline 77.0 79.1 RSN-3frame[6] compared to loss of UPillars, indicating severe overfitting. +LidarAugment 77.7 80.6 On the other hand, LidarAugment achieves much better baseline 79.4 82.9 SWFormer[14] performance on larger models, especially on the most chal- +LidarAugment 80.9 84.4 lenging metric, i.e., +7.3AP for 3D L2 mAPH as shown in TABLEV:LidarAugmentimprovesvariousmodels.Start- Figure 1. Perhaps surprisingly, although the baseline UPillar net is a point-based detector. RSN is a range image and (mAPH=57.8) is much worse than latest 3D detectors, the pillar-based detector. Results are WOD L1 AP. final performance of UPillar-L (+ LidarAugment) is actuallyLidarAugment is a general method which supports aug- example, the maximum rotation angle for global rotation in menting different views of point clouds, including range the PPBA search space is π/4, a common value used for images, as explained in subsection III-A. Table V shows KITTIdataset.However, π/4isinsufficient comparedtothe the performance of LidarAugment on point-based StarNet, tailoredmaxrotationangleπ usedinourLidarAugment.Sur- range image based RSN, and transformer-based SWFormer. prisingly, a single well-tuned global rotation augmentation In general, our LidarAugment improves all kinds of 3D achievesL1mAP73.3,showninTableI,whichoutperforms detectors, sometimes by a large margin. PPBA with L1 mAP 72.1 over vehicle and pedestrian tasks. Although PPBA algorithm is more efficient than grid search E. Abletion studies: comparing to other approaches. and contains diverse augmentation policies, the suboptimal In this section, we show LidarAugment outperforms other search domain of rotation angle restricts the performance common data augmentation approaches on UPillars. of PPBA, which highlights the importance of tailoring 3D Manually tuned data augmentation. Due to the com- detection search space. plexity of search space scales exponentially with respect to LidarAugment Alternatively, LidarAugment mitigates the number of parameters, commonly used data augmenta- both the curse of dimensionality and suboptimal search tion strategies often consists of few data augmentation oper- domain issues by aligning and scaling the magnitude and ations. Here, we benchmark two sets of data augmentation probability of each data augmentation policy. This signifi- strategies used in training high-performance 3D detectors. cantly reduces the search complexity (only 2 hyperparame- First, we adopt random flip (probability 0.5) and rotation ters) while allowing exploration of a larger hyperparameter (probability 0.5, yaw angle uniformly sampled from [−π/4, space. As indicate in Table VI, LidarAugment significantly π/4])dataaugmentationsusedintrainingRSN[6].Thenwe outperformsbothmanuallydesignedandAutoML-baseddata benchmark more advanced and stronger data augmentation augmentation strategies by about 5 AP for both vehicle and strategyused intrainingSWFormer [14],detailed insubsec- pedestrian detection tasks and only requires a simple grid tionIV-B.Ourresultsshowbothdataaugmentationstrategies search of two hyperparameters. significantlyimprovedUPillarsperformances,about+10AP F. Generalize to nuScenes dataset for Vehicle and Pedestrian 3D L1 AP, when compared to Tofurthervalidateourmethod,weevaluateLidarAugment the no augmentation baseline, shown in Table VI. However, on a different dataset: nuScenes [33]. For simplicity, we tuningthedataaugmentationhyperparametersischallenging, adoptthesametrainingsettingsasWaymoOpenDataset,but e.g., if we only search 4 values for each hyperparameter, the reduce the voxel size to 0.25 and the total training steps by number of searches of 5 hyperparameters exceeds 1000. half for faster training. We use the same baseline augmenta- tion as SWFormer, and redefine LidarAugment search space UPillars(APLevel1) Hparams Vehicle Pedestrian for nuScenes following subsection III-C. Table VII shows NoAugmentation - 58.0 62.4 LidarAugment is a general approach, which outperforms the Rotate&Flip[6] 3 70.8 70.0 Rotate,Flip,Scale,Droppoints[14] 5 72.1 72.3 baseline augmentation by a large margin on nuScenes. PPBA[15] 29 71.6 72.6 LidarAugment 2 77.1(+5.0) 77.5(+4.9) UPillars mAP NDS TABLE VI: LidarAugment outperforms common data Rotate,Flip,Scale,Droppoints[14] 40.6 48.2 LidarAugment 46.7(+6.1) 53.4(+5.2) augmentation strategies.UPillarsL1APsonWaymoOpen Dataset validation set are reported. LidarAugment requires TABLE VII: nuScenes validation-set results. theleastnumberofhyperparameters(Hparams)butachieves the best results compared to manually designed and automl- V. CONCLUSION based data augmentations strategies. In this paper, we propose LidarAugment, a scalable and effective 3D augmentation approach for 3D object detec- AutoML-baseddataaugmentation.Toalleviatethechal- tion. Based on the insight that 3D data augmentations are lenge of exponentially large search space, population-based sensitive to model architecture and capacity, we propose a trainingisproposedtotunehyperparametersindataaugmen- simplifiedsearchspace,whichcontainstwohyperparameters tationsonline[17],[11],[15].Wefollowtheimplementation to control a diverse set of augmentations. LidarAugment of progressive-population based data augmentation (PPBA) outperforms both manually tuned and existing search-based [15] and use the same sets of data augmentation policies data augmentation strategies by a large margin. Extensive and search space. We set population size 16, generation step studies show that LidarAugment generalizes to convolution 4000, perturbation and exploration rate to 0.2. Our results, and attention-based architectures, as well as point-based inTableVI,showPPBAsignificantoutperformsthenoaug- and range-based input representations. More importantly, mentation baseline. Despite PPBA introduces significantly LidarAugment significantly simplifies the search process for more data augmentation policies, it is on par with manually 3D data augmentations and opens up exciting new research tuned data augmentations, which only contains 4 policies. opportunities, such as model scaling in 3D detection. With We find the search space of PPBA is suboptimal after LidarAugment, we demonstrate new state-of-the-art 3D de- inspecting the search domain of each hyperparameter. For tection results on the challenging Waymo Open Dataset.REFERENCES [20] B. Yang, W. Luo, and R. Urtasun, “Pixor: Real-time 3d object detectionfrompointclouds,”inProceedingsoftheIEEEconference onComputerVisionandPatternRecognition,2018,pp.7652–7660. [1] Y.Yan,Y.Mao,andB.Li,“Second:Sparselyembeddedconvolutional [21] Y.Chen,V.T.Hu,E.Gavves,T.Mensink,P.Mettes,P.Yang,andC.G. detection,”Sensors,vol.18,no.10,p.3337,2018. Snoek, “Pointmixup: Augmentation for point clouds,” in European [2] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom, ConferenceonComputerVision. Springer,2020,pp.330–345. “Pointpillars: Fast encoders for object detection from point clouds,” [22] J. S. Hu and S. L. Waslander, “Pattern-aware data augmentation for inProceedingsoftheIEEE/CVFConferenceonComputerVisionand lidar 3d object detection,” in 2021 IEEE International Intelligent PatternRecognition,2019,pp.12697–12705. TransportationSystemsConference(ITSC). IEEE,2021,pp.2703– [3] Y.Zhou,P.Sun,Y.Zhang,D.Anguelov,J.Gao,T.Ouyang,J.Guo, 2710. J. Ngiam, and V. Vasudevan, “End-to-end multi-view fusion for [23] J. Choi, Y. Song, and N. Kwak, “Part-aware data augmentation for 3d object detection in lidar point clouds,” in Conference on Robot 3d object detection in point cloud,” in 2021 IEEE/RSJ International Learning. PMLR,2020,pp.923–932. ConferenceonIntelligentRobotsandSystems(IROS). IEEE,2021, [4] S.Shi,C.Guo,L.Jiang,Z.Wang,J.Shi,X.Wang,andH.Li,“Pv- pp.3391–3397. rcnn: Point-voxel feature set abstraction for 3d object detection,” in [24] M. Reuse, M. Simon, and B. Sick, “About the ambiguity of data Proceedings of the IEEE/CVF Conference on Computer Vision and augmentation for 3d object detection in autonomous driving,” in PatternRecognition,2020,pp.10529–10538. ProceedingsoftheIEEE/CVFInternationalConferenceonComputer [5] Z. Liang, M. Zhang, Z. Zhang, X. Zhao, and S. Pu, “Rangercnn: Vision,2021,pp.979–987. Towards fast and accurate 3d object detection with range image [25] T. Yin, X. Zhou, and P. Krahenbuhl, “Center-based 3d object detec- representation,”arXivpreprintarXiv:2009.00206,2020. tion and tracking,” in Proceedings of the IEEE/CVF conference on [6] P.Sun,W.Wang,Y.Chai,G.Elsayed,A.Bewley,X.Zhang,C.Smin- computervisionandpatternrecognition,2021,pp.11784–11793. chisescu, and D. Anguelov, “Rsn: Range sparse net for efficient, [26] L.Fan,Z.Pang,T.Zhang,Y.-X.Wang,H.Zhao,F.Wang,N.Wang, accurate lidar 3d object detection,” in Proceedings of the IEEE/CVF andZ.Zhang,“Embracingsinglestride3dobjectdetectorwithsparse Conference on Computer Vision and Pattern Recognition, 2021, pp. transformer,”arXivpreprintarXiv:2112.06375,2021. 5725–5734. [27] S. Lim, I. Kim, T. Kim, C. Kim, and S. Kim, “Fast autoaugment,” [7] A. Bewley, P. Sun, T. Mensink, D. Anguelov, and C. Sminchisescu, AdvancesinNeuralInformationProcessingSystems,vol.32,2019. “Rangeconditioneddilatedconvolutionsforscaleinvariant3dobject [28] Z.Leng,S.Cheng,B.Caine,W.Wang,X.Zhang,S.Jonathon,M.Tan, detection,”inConferenceonRobotLearning. PMLR,2021,pp.627– and A. Dragomir, “Pseudoaugment: Learning to use unlabeled data 641. for data augmentation in point clouds,” in European Conference on [8] H.Zhang,M.Cisse,Y.N.Dauphin,andD.Lopez-Paz,“mixup:Be- ComputerVision. Springer,2022,pp.279–294. yondempiricalriskminimization,”arXivpreprintarXiv:1710.09412, [29] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional 2017. networksforbiomedicalimagesegmentation,”inInternationalConfer- [9] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le, enceonMedicalimagecomputingandcomputer-assistedintervention. “Autoaugment: Learning augmentation policies from data,” arXiv Springer,2015,pp.234–241. preprintarXiv:1805.09501,2018. [30] T. Yin, X. Zhou, and P. Krahenbuhl, “Center-based 3d object detec- [10] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo, “Cutmix: tion and tracking,” in Proceedings of the IEEE/CVF conference on Regularization strategy to train strong classifiers with localizable computervisionandpatternrecognition,2021,pp.11784–11793. features,” in Proceedings of the IEEE/CVF international conference [31] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza- oncomputervision,2019,pp.6023–6032. tion,”arXivpreprintarXiv:1412.6980,2014. [11] D.Ho,E.Liang,X.Chen,I.Stoica,andP.Abbeel,“Populationbased [32] A.Geiger,P.Lenz,C.Stiller,andR.Urtasun,“Visionmeetsrobotics: augmentation:Efficientlearningofaugmentationpolicyschedules,”in The kitti dataset,” The International Journal of Robotics Research, International Conference on Machine Learning. PMLR, 2019, pp. vol.32,no.11,pp.1231–1237,2013. 2731–2741. [33] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, [12] E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le, “Randaugment: A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A Practicalautomateddataaugmentationwithareducedsearchspace,” multimodal dataset for autonomous driving,” in Proceedings of the inProceedingsoftheIEEE/CVFConferenceonComputerVisionand IEEE/CVF conference on computer vision and pattern recognition, PatternRecognitionWorkshops,2020,pp.702–703. 2020,pp.11621–11631. [13] Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang, “Random eras- [34] S.Shi,L.Jiang,J.Deng,Z.Wang,C.Guo,J.Shi,X.Wang,andH.Li, ing data augmentation,” in Proceedings of the AAAI conference on “Pv-rcnn++:Point-voxelfeaturesetabstractionwithlocalvectorrep- artificialintelligence,vol.34,no.07,2020,pp.13001–13008. resentationfor3dobjectdetection,”arXivpreprintarXiv:2102.00463, [14] P.Sun,M.Tan,W.Wang,C.Liu,F.Xia,Z.Leng,andD.Anguelov, 2021. “Swformer: Sparse window transformer for 3d object detection in [35] J.Ngiam,B.Caine,W.Han,B.Yang,Y.Chai,P.Sun,Y.Zhou,X.Yi, point clouds,” European Conference on Computer Vision (ECCV), O.Alsharif,P.Nguyenetal.,“Starnet:Targetedcomputationforobject 2022. detectioninpointclouds,”arXivpreprintarXiv:1908.11069,2019. [15] S.Cheng,Z.Leng,E.D.Cubuk,B.Zoph,C.Bai,J.Ngiam,Y.Song, B.Caine,V.Vasudevan,C.Lietal.,“Improving3dobjectdetection through progressive population based augmentation,” in European ConferenceonComputerVision. Springer,2020,pp.279–294. [16] R. Li, X. Li, P.-A. Heng, and C.-W. Fu, “Pointaugment: an auto- augmentation framework for point cloud classification,” in Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2020,pp.6378–6387. [17] M. Jaderberg, V. Dalibard, S. Osindero, W. M. Czarnecki, J. Don- ahue, A. Razavi, O. Vinyals, T. Green, I. Dunning, K. Simonyan etal.,“Populationbasedtrainingofneuralnetworks,”arXivpreprint arXiv:1711.09846,2017. [18] P.Sun,H.Kretzschmar,X.Dotiwalla,A.Chouard,V.Patnaik,P.Tsui, J. Guo, Y. Zhou, Y. Chai, B. Caine et al., “Scalability in perception forautonomousdriving:Waymoopendataset,”inProceedingsofthe IEEE/CVF conference on computer vision and pattern recognition, 2020,pp.2446–2454. [19] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia, “Multi-view 3d object detection network for autonomous driving,” in Proceedings of the IEEEconferenceonComputerVisionandPatternRecognition,2017, pp.1907–1915.