LidarNAS: Unifying and Searching Neural Architectures for 3D Point Clouds Chenxi Liu, Zhaoqi Leng, Pei Sun, Shuyang Cheng, Charles R. Qi, Yin Zhou, Mingxing Tan, and Dragomir Anguelov Waymo LLC {cxliu, lengzhaoqi, peis, shuyangcheng, rqi, yinzhou, tanmingxing, dragomir}@waymo.com Abstract. Developing neural models that accurately understand ob- jects in 3D point clouds is essential for the success of robotics and au- tonomousdriving.However,arguablyduetothehigher-dimensionalna- ture of the data (as compared to images), existing neural architectures exhibit a large variety in their designs, including but not limited to the views considered, the format of the neural features, and the neural op- erations used. Lack of a unified framework and interpretation makes it hardtoputthesedesignsinperspective,aswellassystematicallyexplore new ones. In this paper, we begin by proposing a unified framework of such,withthekeyideabeingfactorizingtheneuralnetworksintoaseries ofviewtransformsandneurallayers.Wedemonstratethatthismodular frameworkcanreproduceavarietyofexistingworkswhileallowingafair comparisonofbackbonedesigns.Then,weshowhowthisframeworkcan easilymaterializeintoaconcreteneuralarchitecturesearch(NAS)space, allowing a principled NAS-for-3D exploration. In performing evolution- ary NAS on the 3D object detection task on the Waymo Open Dataset, not only do we outperform the state-of-the-art models, but also report the interesting finding that NAS tends to discover the same macro-level architecture concept for both the vehicle and pedestrian classes. 1 Introduction Being able to recognize, segment, or detect objects in 3D is one of the funda- mentalgoalsofcomputervision.Inthispaperweconsiderthepointcloudinput representation for the wide usage of RGBD cameras in robotics applications, as well as LiDAR sensors in autonomous driving. There has been a lot of research in this area, including various deep learning based approaches. But which neural architecture should you choose? PointNet [33]? Voxel- Net[59]?PointPillars[18]?RangeSparseNet[45]?Itiseasytogetoverwhelmed by the diverse set of concepts present in these names as well as the variety in the architectures themselves. This level of variety at the macro-level is not observed in other areas, e.g., neural architectures developed for 2D images. The root cause is the higher- dimensional nature of the data. There are three major reasons in particular: 2202 tcO 01 ]VC.sc[ 1v81050.0122:viXra2 C. Liu et al. – Views: 2D images are captured by an egocentric photographer. A similar view exists for 3D, that is the perspective view, or range images. But when thescanisnotegocentric,wehaveanunorderedpointsetthatcannolonger beindexedbypixelcoordinates.Inaddition,gravitymakesthezaxisspecial, and often times a natural choice is to view an object from top-down. Each view has its unique properties and (dis)advantages. – Sparsity: Images are dense in the sense that each pixel has an RGB value between0and255.Butin3D,rangeimagesmayhavepixelsthatcorrespond to infinite depth. Also, objects typically occupy a small percentage of the space, meaning that when a scene is voxelized, the number of non-empty voxels is typically small compared with the total number of voxels. – Neuraloperations:Duetoviewsandsparsity,2Dconvolutiondoesnotalways apply, resulting in more diverse neural operations. Our first contribution in this paper is a unified framework that can inter- pret and organize the variety of neural architecture designs, while adhering to the principles listed above. This framework allows us to put existing designs in perspective and enables us to explore new designs. The key idea is to factorize the entire neural network into a series of transforms and layers. The framework supports four views (point, voxel, pillar, perspective) and two formats (dense, sparse), as well as the transforms between them. It is also possible to merge features from different views, building parallelism into the sequential stages. But once a view-format combination is set, it restricts the types of layers that can be applied. When visualized, this framework is a trellis, and any neural ar- chitecture corresponds to a connected subset of this trellis. We provide several examples of how popular architectures can be refactored and reproduced under this framework, proving its generality. Adirectbenefitofthisframeworkisthatitcaneasilymaterializeintoasearch space, which immediately unlocks and enables NAS. NAS stands for neural ar- chitecture search [61], which tries to replace human labor and manual designs with machine computation and automatic discoveries. Despite its success on 2D architectures [46], its usage on 3D has been limited. In this paper we conduct a principled NAS-for-3D explorations, by not only considering the micro-level (such as the number of channels), but also embracing the macro-level (such as transforms between various views and formats). We conduct our LidarNAS experiments on the 3D object detection task on theWaymoOpenDataset[44].Usingregularizedevolution[36],oursearchfinds LidarNASNet, which outperforms the state-of-the-art RSN model [45] on both the vehicle and the pedestrian classes. In addition to the superior accuracy and the competitive latency, there are also interesting observations about the Li- darNASNet architecture itself. First of all, though the search / evolution was conducted separately on vehicle and pedestrian, the found architectures have essentiallythesamehigh-leveldesignconcept.Second,themodificationsdiscov- ered by NAS coincidentally reflects ideas from human designs. We also analyze the hundreds of architectures sampled in the process and draw useful lessons that should inform future designs.LidarNAS 3 To summarize, the main contributions of this paper are: – A unified framework general enough to include a wide range of backbones for 3D data processing – Asearchspaceandanalgorithmchallengingenoughtocoverboththemicro- level and the macro-level – AsuccessfulNASexperimentwhichleadstostate-of-the-artperformanceon the Waymo Open Dataset 2 Related Work 2.1 Neural Architectures for 3D We partition neural architectures for 3D into four categories, according to the primary view(s) used. Since this paper studies backbone design for 3D object detection, we will mostly cover detection but will also talk about segmentation and classification. The first category is top-down primary, which includes voxel and pillar. The main idea is to divide 3D points into 3D voxels [11,8,52,59,51,9] or 2D pil- lars[18],whichthenbecomeregular.Theadvantageisthatvoxelizationenables locality, which in turn enables convolution operations. But the main limitation is memory consumption, which grows cubically (or quadratically). This either limits the maximum detection range or sacrifices the voxelization granularity. Even if sparse operations may be used, for egocentric scans, the point densities at long-range and short-range are different, posing challenges in learning. The second category is point primary, which treats the point cloud as un- organized sets. Originally developed for classification and segmentation [33,34], the idea can also be used on detection [32,30]. The advantage is that it is more memory-friendly than voxelization based approaches. However, its limitation is that the neural layers do not perform as well, possibly due to irregular coordi- nates.Inaddition,toachievelocality,nearestneighborsearchistypicallyneeded for the input, which can be expensive. Thethirdcategoryisperspectiveprimary,operatingdirectlyontherange image [29,5,6,12]. This is also very memory-friendly and can utilize powerful 2D convolutionlayerswhichhavebeenextensivelyresearched.However,asthedepth can change drastically for adjacent pixels, these methods exhibit more difficulty in localizing the objects accurately, as well as handling occlusions. The fourth and final category is fusion methods, which use two or more of the representations discussed above. The fusion may be either sequential and parallel. For example, RSN [45] sequentially performs foreground segmentation on the perspective view and delivers detection output on the top-down view. PVCNN [26] and SPVCNN [47] fuses information from the point view and the voxel view in a parallel fashion. MVF [58] fuses feature from perspective view, point view, and pillar view, also in a parallel fashion. The hope is that fusion methods can combine the best of multiple worlds, which is why it is important to keep all options when doing architecture exploration.4 C. Liu et al. 2.2 Neural Architecture Search Early works on neural architecture search primarily focused on the search al- gorithm. A variety of methods were introduced, including reinforcement learn- ing [61,3], evolution [37,36], performance prediction [24], weight-sharing [31,25]. Essentially, different methods make different approximations about the search process. These search algorithm explorations started on image classification. The fol- lowing phase consists of extending to other tasks, such as semantic segmenta- tion [7,23] and object detection [50,14]. For 3D tasks, NAS research has been done on medical imaging [60,17,2,49,54]. However, the volumetric CT scans are different from point clouds, and as a result the search space is greatly simpli- fied. There are also works on 3D shape classification [27,19], but their overall frameworks do not exceed that set by [25]. [47,20] is closer to our work, in the sense that it uses NAS to optimize for segmentation and detection on 3D scenes (KITTI [13]). But generalizing the terminology used in [23], we believe there is also a two-level hierarchy in 3D neural architecture designs, with the outer macro-levelcontrollingtheviewsofthedata/features,andtheinnermicro-level beingthespecificsoftheneurallayers.Underthisterminology,[47,20]keepsthe macro-level fixed, while our search covers both. 3 Unifying Neural Architectures for 3D 3.1 Philosophy Inordertoofferaunifiedinterpretationofthegrowingvarietyofneuralnetworks for 3D, we need to pinpoint their high-level design principles. Fortunately, we find these underlying principles to be surprisingly congruent, and we character- ize them as: finding some neighborhood of the 3D points and then aggregating information within.The“aggregation”partistypicallydonethroughsomeform of convolution and / or pooling. The “neighborhood” part has different choices: – PointNet [33]: the neighborhood alternates between the point itself (MLP) and all points (max-pooling) – PointNet++[34]:theneighborhoodisanEuclideanballwithacertainradius – VoxelNet [59]: 3D neighborhood measured by Manhattan distance of Carte- sian coordinates (x,y,z) – PointPillars[18]:2DneighborhoodmeasuredbyManhattandistanceof(part of) Cartesian coordinates (x,y) – LaserNet [29]: 2D neighborhood measured by Manhattan distance of pixel coordinates (i,j) Thesecommon“neighborhood”choiceshavebeentypicallyexpressedthrough the views of the data / features: point, voxel, pillar, perspective. We point out that there have been and will be more views being proposed, which is why we feel the “neighborhood” interpretation is more generic. Notably, different data viewscantransform betweeneachotherbackandforth.However,oncethedataLidarNAS 5 view is determined, it restricts the type of layers that can be applied. This factorization of “transforms” and “layers” as well as their relationship will be reflected in our framework described next. 3.2 A Unified Framework In this subsection, we build upon the aforementioned high-level ideas and de- scribethemainframeworkweusetothinkaboutneuralarchitecturesthroughout this work. We describe its different levels of detail from fine to coarse. Views and formats We consider a total of four views (point, pillar, voxel, per- spective) and up to two data formats (dense and sparse): – Point:ThefeaturesforallN 3Dpointsarestoredinamatrixofsize[N,C], where C is the number of channels. The Cartesian coordinates (x,y,z) for each point are stored in a separate matrix of size [N,3], where the indices of the points are aligned between the two matrices. – Pillar: In this view, we store a fixed-length feature for each pillar when viewingthescenefromtop-down.Weallowthepillarviewtobeeitherdense or sparse. If dense, the features are stored in a tensor of size [B,X,Y,C], where B is the batch size, X and Y are the number of pillars along the corresponding dimension. If sparse, the features are stored in a matrix of size [N,C], where N is the number of non-empty pillars and a separate matrix of size [N,3] is used to store the indices (both batch and spatial) of these non-empty pillars. In both data formats, unlike the point view, the Cartesian coordinates of each pillar(’s center) can be easily calculated from its spatial index (origin + index * pillar size). – Voxel: Different from the pillar view, the voxel view partitions the scene along all three spatial dimensions. The additional partition along the z axis makes fitting a tensor of size [B,X,Y,Z,C] into memory very challenging. Therefore, in this work we only consider the sparse format for the voxel view. Features are stored in a matrix of size [N,C], where N is the number of non-empty voxels. A separate matrix of size [N,4] is used to store their indices (both batch and spatial). – Perspective: For egocentric 3D scans or RGBD images, simply using the original perspective view is a natural choice. We consider both the dense and sparse formats for this view. If dense, features are stored in a tensor of size [B,H,W,C], where H and W consist of the size of the range image. A separatetensorofsize[B,H,W,3]isusedtostoretheCartesiancoordinates ofeachpixelontherangeimage.Ifsparse,featuresarestoredinamatrixof size[N,C]andCartesiancoordinatesaseparatematrixofsize[N,3],similar to the point view. Transforms Now that the views and formats are established, the framework shall be general enough to include possible transforms from one to the other. Thetransformsareintendedtobelightweight:powerfulneuralfeatureupdateis6 C. Liu et al. transform layer View: Format: 3D …… sparse 2D …… dense / sparse point …… 2D …… dense / sparse Stage 1 Stage 2 Stage S Fig.1: The LidarNAS framework for interpreting neural architectures on 3D point clouds. The entire backbone consists of S stages. Each stage consists of view & format transforms followed by corresponding neural layers. Within this framework, a backbone architecture corresponds to a connected subset of the S-stage trellis. a non-goal. Since there are totally six possible representations (four views, with pillarandperspectivehavingtwoformats),wehaveupto62 =36differenttrans- forms. Though this number may seem daunting, some of these transforms have more familiar and friendly names. For example, the transform to itself is iden- tity. The one from a sparse format to its dense counterpart is densification (by padding zero vectors). The point to voxel transform is voxelization. The reverse transform is devoxelization. The point to perspective transform is projection. Layers Once a view-format combination is set, we can apply neural layers to update the features. Generally, we do not put constraint on the number or form of the layers: it can be as simple as a one-layer convolution, or as complicated as an entire U-Net [38]. But the one constraint is that it conforms to the view- format combination for both its input and output. This is because, for instance, 2D convolution cannot be applied on 3D inputs; sparse convolution does not work on dense features. Notably, 2D layer implementations can interchangeably work for both the pillar view and the perspective view. Stages Putting these concepts together, we define a stage to be the sequential pair of possible transforms and their associated layers. Fig. 1 visualizes the con- catenationofS stages.Withinthisframework,thebackboneofaneuralnetwork for 3D corresponds to a connected subset of this S-stage trellis. A head can then be added to the end to perform 3D classification / detection / segmentation. We emphasize that the word choice is “subset” but not “path”, meaning that a stage can have more than one view present. This makes our framework more general, as it supports not only sequential designs but also parallel ones. Consequently, we may have multiple different views in stage s−1 transforming tothesameviewinstages.Inthesecases,afterapplyingindividualtransforms, we merge these transformed features through either concatenation (default in this work) or summation.LidarNAS 7 Sparse 3D Conv 2D ResBlock MLP MLP (a) Multi-View Fusion [58] (b) Sparse Point-Voxel [47] Fig.2:ExamplesofhowexistingdesignsmaybeinterpretedwithintheLidarNAS framework. 3.3 Inclusion of Existing Designs In Fig. 2 and Fig. 3, we visualize several examples of how existing designs may be interpreted within the framework described above. Our framework is flexible enough to cover both entirely sequential designs such as Range Sparse Net [45] and more parallel designs such as Multi-View Fusion [58]. In addition to these networks developed for 3D detection, it can also explain those beyond, such as SPV [47]. More architecture designs fit in, including but not limited to [59,51,18,30,48], but we skip visualization due to space limitations. 4 Searching Neural Architectures for 3D The framework described in the previous section brings many benefits, one of which is the potential to search novel and better architectures. This section focusesonhowtheframeworkmaterializesintoasearchspace(Sec.4.1),aswell as our choice of search algorithm (Sec. 4.2). 4.1 From Framework to Search Space Fig.2demonstratedhow,atahighlevel,variousarchitecturesfallwithintheLi- darNASframework.Butdelvingintothedetails,specificimplementationsofthe modules are going to differ across works. While this is very much expected and understandable,thevarietyandfreedomin“layers”alonewouldmakeconstruct- ingameaningfulsearchspaceinfeasible.Wenowdiscusshowwematerializethe framework into a search space by making specific choices. Transforms Amongthe36possibletransforms,wedidnotimplementthetrans- forms from pillar to voxel, as nothing more can be done other than copying the same features along the z axis. From the voxel view, our implementation only supported transforms to the pillar view. Supporting 31/36 transforms is still high coverage. Layers Weneedatleastonetypeofneurallayerforeachofthefollowing:point, 2D dense, 2D sparse, 3D sparse. Our search space picked one representative for each:8 C. Liu et al. – Point:Multiplelayersofdense-normalization-ReLU.Thenormalizationcan either be batch normalization [16] or layer normalization [1]. The number of units F in the dense layers is a hyperparameter that can be searched. – 2D dense: A U-Net [38] with residual blocks [15]. We use up to five down- sampling and upsampling scales. The number of channels for each scale are [F,4F,8F,8F,16F] with F being the hyperparameter that can be searched. The number of blocks per scale is 2 except for the highest resolution scale which is 1. – 2D sparse:AlsoaU-Netwithresidualblocks,exceptthateachconvolution is a sparse convolution (kernel size 3×3). We use up to 3 downsampling andupsamplingscales,andthenumberofblocksare[1,2,3]and[0,2,2].We use the same number of channels F for all downsampling and upsampling blocks. – 3D sparse: Also a U-Net with residual blocks, except that 3D sparse con- volution is used. The kernel size can either be 3×3×3 or 3×3×1, and the correspondingstrideforeachscaleis2×2×2or2×2×1.Theotherdetails follow the 2D sparse case above. Thesechoicesoflayerspecifics,especiallythosefor2Dand3D,trytoexactly follow RSN [45]. Stages Our search space considers S = 3 stages. For simplicity, we also have the constraint that the last stage can only have one view. Inspired by RSN, we add the option to perform foreground segmentation immediately after the first perspective branch that appears. 4.2 Regularized Evolution We choose regularized evolution to be our search algorithm, which follows [36]. Compared against other major classes of NAS methods, evolution arguably makes the least amount of approximations, which is desirable especially since we are exploring a less explored task and a complicated search space. We do not use weight-sharing NAS for GPU memory considerations. 3D tasks are un- derstandably more memory intensive than 2D tasks, and the batch size on each GPU was already small (< 10). However, even the best weight-sharing NAS (a recent example is [4]) implementations require 2−3× extra GPU memory. Ourmutationalgorithmworksbyfirstrandomlyselectingastagesandthen randomly applying one of the following six mutation choices to this stage: – Add a view: if the stage does not have all four views, then randomly add a view not yet present in this stage. A random view present in the previous stage is selected as its predecessor. A random view present in the next stage is selected as its successor. A default layer of the corresponding type is used for this addition. The number of channels for all layers in this stage are halved.LidarNAS 9 – Remove a view: if the stage has more than one view, then randomly remove anexistingview.Usageoftheremovedviewinthenextstageisalsoremoved. The number of channels for all layers in this stage are doubled. – Switch the view: if the stage has exactly one view, then switch the view to another. All usage of the old view in the next stage is changed to the new view. – Adjust the pillar / voxel size: a key parameter in many of the transforms is the pillar / voxel size. Multiply the pillar / voxel size by either 0.8 or 1.2 for all views. – Adjustthenumberofchannels:multiplythenumberofchannelsforalllayers in the stage by either 0.8 or 1.2. – Adjust the layer progression: • Point: Either increase or decrease the number of dense-normalization- ReLU by 1. • 2D dense: Either increase or decrease the number of scales by 1. • 2D / 3D sparse: Increase or decrease the number of downsampling / upsampling scales by 1. If a mutation fails (e.g., if the precondition does not hold, such as trying to remove a view when the stage only has one view), the algorithm mutates again until it succeeds. The first four mutation choices focus on the “transform” aspect of a stage, while the last two mutation choices focus on the “layer” aspect. This level of coverage and variety makes the search comprehensive yet challenging. 5 Experimental Results 5.1 Experimental Setting We perform 3D object detection experiments on the challenging Waymo Open Dataset[44].ItprovidesLiDARscansintherangeimageform,whichmakesex- perimentsontheperspectiveviewmuchmorenaturalandconvenient.Itcontains 1150 LiDAR sequences with 798 train, 202 validation, and 150 test ones. Each sequence is 20 seconds at 10 frames per second. Experiments are conducted on both the vehicle and the pedestrian classes, using the official evaluation metrics of 3D / BEV AP. 5.2 Existing Architectures under LidarNAS Inthissubsection,weusetheLidarNASframework(Sec.3)toreimplementsev- eralexistingneuralarchitecturesfor3D.Thegoalhereistoprovethegenerality and correctness of the LidarNAS framework interpretation, as well as validate our implementation of individual modules. We selected four existing architectures: RSN [45], PointPillars [18], Laser- Net[29],andMVF++[35].Theseareselectedtocoveravarietyofviewsaswell as topology. Note that the LidarNAS framework focuses on backbone design. In10 C. Liu et al. model classframedevice batch steps lr voxelization(m)LidarNASAP previousAP Veh 3 GPU 2×16 120k0.006 0.2×0.2×0.2 77.2 77.2[45] RSN-exact Ped 3 GPU 3×16 120k0.006 0.1×0.1×20.0 79.1 79.1[45] Veh 1 GPU 2×16 120k0.006 0.32×0.32 69.3 63.3[45]/60.3[35] PointPillars-like Ped 1 GPU 3×16 120k0.006 0.32×0.32 66.1 68.9[45]/60.1[35] Veh 1 GPU 1×16 360k0.001 - 47.1 52.1[45]/56.1[6] LaserNet-like Ped 1 GPU 1×16 240k0.003 - 59.0 63.4[45]/62.9[6] Veh 1 TPU 2×128 43k 0.003 0.32×0.32 73.6 74.6[35] MVF++-like Ped 1 TPU 2×128 43k 0.003 0.32×0.32 70.4 78.0[35] Table1:Adiversesetofexisting3DdetectionarchitecturesundertheLidarNAS framework. The second number in the batch size multiplication is the number of GPUs / TPU shards. The metric (last two columns) is L1 3D AP. our reimplementation, we use an anchor-free detection head that is the same as RSN if the backbone output is sparse voxels but also works for pillar and perspective views (details are described in the supplementary material). This means that our RSN reimplementation is exact but the others are not, and we add the suffix “-like” to indicate this difference. Tab. 1 summarizes the results, using L1 3D AP. Key hyperparameter values arealsoprovided.Weusenocolorifourreimplementationiswithin1%absolute of the previously reported number; green if higher than > 5%; yellow if lower than ≤ 5%; and red if lower than > 5%. Considering the diversity of these architectures, overall we consider our reimplementation to be acceptable and successful,validatingourimplementationof(someofthe)transformsandlayers modules. Notice that our implementation can support multi-frame, as well as both GPU and TPU. Looking into individual neural architectures, our reproduction of RSN is ex- act. Interestingly, PointPillars-like significantly outperforms previous reports on the vehicle class. This is an important reminder that revisiting previous ar- chitectures may be necessary and beneficial, as they may still be competitive when coupled with latest developments in other areas (e.g., anchor-free detec- tion head). However, the performance on the pedestrian class is slightly worse. This is also observed on MVF++-like, where the vehicle class is within 1% but the pedestrian class is significantly worse. Our hypothesis is that comparatively speaking, our detection head is better suited on larger objects but struggles more on smaller objects. Finally, our LaserNet-like performs noticeably worse than any network that detects on the top-down view (meaning pillar or voxel), despite training for 2−3× longer steps. This proves that detection from the perspective view needs more specialized operations, such as those described in the original paper, or some recent developments [6,12]. 5.3 Searching for New Architectures In this subsection, we perform and analyze neural architecture search experi- ments, using the search space and algorithm described in Sec. 4.LidarNAS 11 Vehicle Pedestrian model year frame3DAPBEVAPlatency3DAPBEVAPlatency LaserNet[29] CVPR19 52.1 71.2 64.3 63.4 70.0 64.3 PointPillars[18] CVPR19 63.3 82.5 49.0 68.9 76.0 49.0 PV-RCNN[39] CVPR20 70.3 83.0 - - - - Pillar-based[48] ECCV20 1 69.8 87.1 66.7 72.5 78.5 66.7 PV-RCNN[40] WOD20 2 77.5 - 300 78.9 - 300 RCD[5] CoRL20 1 69.0 82.1 - - - - MVF++[35] CVPR21 1 74.6 87.6 - 78.0 83.3 - CenterPoint[53] CVPR21 2 76.7 - - 79.0 - - PPC[6] CVPR21 65.2 80.8 - 75.5 82.2 - RangeDet[12] ICCV21 1 72.9 - - 75.9 - - PointPillars-like§ 1 67.6 85.3 - - - - LidarNASNet-P(ours) 1 73.2 88.2 - - - - RSN[45] CVPR21 1 75.2 87.7 46.5† 77.1 81.7 21.0† LidarNASNet-R(ours) 1 75.6 88.6 49.3† 77.4 82.0 22.6† Table2:3Dobjectdetectionresultsforthevehicleandpedestrianclassesonthe WaymoOpenDatasetvalidationset.TheAPisdifficultyL1.Theunitoflatency is ms. Multi-frame models are grayed. §: Slightly different from PointPillars-like in Tab. 1, because here we have to swap the original layers with the U-Net explained in Sec. 4.1. †: our measurement using identical setting: average on 10 scenes, each has more than 100 vehicles / pedestrians. Evolving past the state-of-the-art Based on the analysis above, picking a random architecture as the starting point would take much longer time for the performance to ramp up, so we use warm starting [42,43] to speed up and save up.Eachsearchlasts100architectures,eachtrainedusingbatchsize2×8GPUs for 12k steps (10% of the standard number of steps) using cosine learning rate. All architectures operate on single-frame. The population size and tournament size for the regularized evolution algorithm are 20 and 5 respectively. We also measure the V100 latency of the network on a (random) training batch imme- diately after 11k training steps. The measurement is taken close to the end of thetrainingbecauseforarchitecturesthatperformforegroundsegmentation,the latencymaychangethroughouttraining.Wecommentthatthissearchphasela- tencymeasurementisnoisy,notonlybecausethedatabatchisrandom,butalso because the scheduler may allocate a GPU shared with other jobs. Regardless, weuse100 * L1 3D AP - 0.5 * latency in ms1 astheobjectivetoguidethe evolution.Onceanarchitectureisidentified,weincreasetheper-GPUbatchsize from 2 to 5, and train for 120k steps as the final evaluation. Weconductaseparatesearch/evolutionfromthreedifferentstartingpoints: PointPillars-like vehicle, RSN CarXL, and RSN PedL2. We name our found architecture LidarNASNet-P / R depending on whether the starting point was PointPillars-like or RSN, and compare them against other models in Tab. 2. 1 We empirically picked these multipliers; did not tune them heavily. 2 WeskippedPointPillars-likepedestrian,becausethecorrespondingnumberinTab.1 is yellow not green.12 C. Liu et al. 2D U-Net 2D U-Net 2D U-Net (a) PointPillars-like (b) LidarNASNet-P Sparse 3D Conv Sparse 3D Conv 2D U-Net 2D U-Net MLP MLP Sparse 2D Conv (c) Range Sparse Net [45] (d) LidarNASNet-R Fig.3:Themacro-level architectureofthefoundLidarNASNet-P/R.Notethat the illustration in (d) applies for both vehicle and pedestrian. It adds a sparse convolution on the pillar view in the first stage, utilizing all four views and two formats considered in this work. We first compare LidarNASNet-P against PointPillars-like, the evolution baseline. The L1 3D AP improves from 67.6 to 73.2, with a significant gap of+5.6.ThegainonBEVAPisalsosignificantat+2.9.Thelargeimprovement andcompetitiveendresultclearlyshowcasetheeffectivenessofoursearch.When running the same evolution from RSN, LidarNASNet-R outperforms by 0.4 and 0.3 3D AP on vehicle and pedestrian respectively, and the gains on BEV AP are even larger. As we will see soon, LidarNASNet-R has an additional branch, so the latency is higher, but only slightly. To put this in perspective, we did an ablationstudy3 whereweincreasethenumberofchannelsinthesparseU-Netof RSN for vehicle (from 64 to 91) to reach AP parity with LidarNASNet-R. The latency of this architecture is 60.8ms, which is significantly higher. Comparing against other architectures, LidarNASNet-R also performs very competitively. Not only is this reflected in the superior AP especially among single-frame models, but also in the small latency. We reiterate that our total search cost is about 80 GPU days, which is only 10 times the cost of training a single RSN (8 GPU days). Visualizing and analyzing LidarNASNet We visualize the macro-level ar- chitecture of LidarNASNet in Fig. 3. We start by discussing LidarNASNet-P. At the macro-level, the evolution decided to add a 2D U-Net that enhances the features for each range image pixel before voxelization to the pillar view. This change alone improves the 3D AP from 67.6 to 72.3. The evolution also learned to increase the voxelization granularity from 0.32×0.32 to 0.25×0.25, which is 3 In fact this architecture was sampled / discovered during our evolution.LidarNAS 13 amicro-levelchangethatisnotreflectedinFig.3.Thischangefurtherimproves the 3D AP to the 73.2 reported in Tab. 2. ForLidarNASNet-R,noticethatthoughthesearchwasconductedseparately forthevehicleandpedestrianclass,thesamemacro-levelarchitecturedesignwas found, which is a positive signal regarding the generality of the found design. Specifically, LidarNASNet-R adds a pillar view in the first stage, as well as the associated sparse 2D U-Net. The idea of adding a pillar view resembles MVF[58,35](thoughthesparseformatisusedherewhileMVFdidnotconsider sparse operations), making LidarNASNet-R a hybrid between RSN and MVF, two very successful human designs. The voxelization granularity of this sparse 2D U-Net is 0.32×0.32, and the number of channels F is 16. In the vehicle variant,thenumberofchannelsfortheoriginalperspectiveviewishalved(from 16 to 8). In the pedestrian variant, this is reduced even more aggressively (from 16 to 3). 0.6 0.4 0.2 0.0 20 40 V100 latency (ms) PA D3 1L elciheV The search space is challenging A com- mon critique on some of the NAS literature is that the search space can be “easy” in the sense that even random sampling of architec- tures (and taking the argmax) can find high- quality architectures indistinguishable from search random thosefoundbyNAS[21].Weproveoursearch space is not trivial, by training 100 architec- tures randomly generated by the procedure detailed in the supplementary material. Fig- ure 4 shows the side-by-side comparison of Fig.4:Randomlysampledarchi- theserandomarchitecturesagainstourLidar- tectures (orange) in the Lidar- NASevolution.Itisclearthatrandomlysam- NAS search space have worse pled architectures have much worse qualities, average and higher variance, in terms of both detection AP and latency. callingforwarmstarting(blue). Not only does this illustrate that the search space we consider is challenging and nontriv- ial,butalsojustifiesouruseofwarmstarting. Lessons from the sampled architectures Inadditiontofixatingonthetop- performing architecture, there are also lessons to be learned in the hundreds of architectures sampled. We now choose a few angles to analyze these data. Using the LidarNAS evolution data points, we investigate architectures that only mutated the “layer” aspect (i.e. the last two mutation choices in Sec. 4.2) versus the rest. The AP standard deviation of the two subsets are 0.04 and 0.14 respectively, which confirms that on average, mutating “transforms” results in more aggressive changes than mutating “layers” only. Usingtherandomarchitecturesdatapoints,westudywhichviewsandstages have the most direct effect on detection quality. Specifically, we run a linear14 C. Liu et al. regression from the 12-dimensional binary feature indicating whether the corre- spondingbranchexistsinthearchitecturetotheL13DAP.Thecoefficientsare visualized in Fig. 5. By comparing the columns, it is clear that later stages have a much more direct influence on the detection AP than earlier stages. By com- paring the rows of the last column, top-down views (voxel and pillar) positively influencethedetectionAP,whiletheperspectiveviewimpactsitnegatively.This again resonates with the belief that detection from the perspective view tends to be more challenging and requires more specialized treatment. Stage 0Stage 1Stage 2 lexoV .psreP tnioP ralliP Using the random architectures data points, we also study the effect of dense vs sparse on latency. Recall that in our Lidar- NAS framework, the perspective view and 0.01 0.08 0.11 0.10 the pillar view are the two that allow both 0.05 -0.02 -0.00 -0.25 0.00 dense and sparse formats. For each view, 0.05 we run a linear regression to latency from -0.01 0.03 0.00 0.10 a 3-dimensional feature, indicating the total 0.15 number of empty / dense / sparse branches. 0.00 0.01 0.14 0.20 For the perspective view, the coefficients are 0.25 [−5.24,−1.65,6.90]. For the pillar view, the coefficients are [−3.03,3.06,−0.03]. The first Fig.5: Linear regression from coefficientisthemostnegativeforboth,which the presence of individual views isexpected,becausethemoreemptybranches and stages to detection AP. you have, the smaller the latency is. Interest- ingly, the coefficients reveal that on the per- spective view, using more sparse branches results in larger latency, whereas on thepillarview,usingmoresparsebranchesresultsinsmallerlatency.Thisshows that sparse operations can offer speedup but not always: it depends on whether the view inherently has high sparsity. 6 Conclusion Thispaperaimstoachievetwogoalsforneuralarchitectureresearchfor3D:first, a unified framework that summarizes and organizes existing designs, and sec- ond, an architecture search exploration enabled by this framework. We demon- strate the generality of our LidarNAS framework, not only through pictorial illustration, but also through empirical experiments. Then, we successfully and automatically discovered LidarNASNet, which achieves state-of-the-art results on the Waymo Open Dataset 3D object detection. The searched architecture is interesting: not only is it identical when searching on two different classes, but also embodies and reaffirms shades from existing designs. There are still many limitations in this work, and we look forward to ad- dressing them in future research. First, while the transforms coverage is fairly complete, the layers currently implemented do not capture much diversity, and we shall add more powerful layer choices into the search space, such as Trans- formers [55,10,28]. Second, all search experiments are single-frame; in extendingLidarNAS 15 to multi-frame, challenges include more memory pressure and the additional complication over which stage to perform temporal fusion. References 1. Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016) 2. Bae, W., Lee, S., Lee, Y., Park, B., Chung, M., Jung, K.H.: Resource optimized neural architecture search for 3d medical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 228–236. Springer (2019) 3. Baker,B.,Gupta,O.,Naik,N.,Raskar,R.:Designingneuralnetworkarchitectures using reinforcement learning. arXiv preprint arXiv:1611.02167 (2016) 4. Bender, G., Liu, H., Chen, B., Chu, G., Cheng, S., Kindermans, P.J., Le, Q.V.: Canweightsharingoutperformrandomarchitecturesearch?aninvestigationwith tunas. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14323–14332 (2020) 5. Bewley, A., Sun, P., Mensink, T., Anguelov, D., Sminchisescu, C.: Range condi- tioned dilated convolutions for scale invariant 3d object detection. arXiv preprint arXiv:2005.09927 (2020) 6. Chai, Y., Sun, P., Ngiam, J., Wang, W., Caine, B., Vasudevan, V., Zhang, X., Anguelov, D.: To the point: Efficient 3d object detection in the range image with graphconvolutionkernels.In:ProceedingsoftheIEEE/CVFConferenceonCom- puter Vision and Pattern Recognition. pp. 16000–16009 (2021) 7. Chen, L.C.,Collins,M.D.,Zhu,Y.,Papandreou,G.,Zoph,B.,Schroff,F.,Adam, H.,Shlens,J.:Searchingforefficientmulti-scalearchitecturesfordenseimagepre- diction. arXiv preprint arXiv:1809.04184 (2018) 8. Chen,X.,Ma,H.,Wan,J.,Li,B.,Xia,T.:Multi-view3dobjectdetectionnetwork for autonomous driving. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. pp. 1907–1915 (2017) 9. Deng,J.,Shi,S.,Li,P.,Zhou,W.,Zhang,Y.,Li,H.:Voxelr-cnn:Towardshighper- formancevoxel-based3dobjectdetection.arXivpreprintarXiv:2012.15712(2020) 10. Engel, N., Belagiannis, V., Dietmayer, K.: Point transformer. IEEE Access 9, 134826–134840 (2021) 11. Engelcke,M.,Rao,D.,Wang,D.Z.,Tong,C.H.,Posner,I.:Vote3deep:Fastobject detectionin3dpointcloudsusingefficientconvolutionalneuralnetworks.In:2017 IEEE International Conference on Robotics and Automation (ICRA). pp. 1355– 1361. IEEE (2017) 12. Fan, L., Xiong, X., Wang, F., Wang, N., Zhang, Z.:Rangedet: In defense of range view for lidar-based 3d object detection. arXiv preprint arXiv:2103.10039 (2021) 13. Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the kitti visionbenchmarksuite.In:2012IEEEconferenceoncomputervisionandpattern recognition. pp. 3354–3361. IEEE (2012) 14. Ghiasi, G., Lin, T.Y., Le, Q.V.: Nas-fpn: Learning scalable feature pyramid ar- chitecture for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7036–7045 (2019) 15. He,K.,Zhang,X.,Ren,S.,Sun,J.:Deepresiduallearningforimagerecognition.In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)16 C. Liu et al. 16. Ioffe,S.,Szegedy,C.:Batchnormalization:Acceleratingdeepnetworktrainingby reducinginternalcovariateshift.In:Internationalconferenceonmachinelearning. pp. 448–456. PMLR (2015) 17. Kim,S.,Kim,I.,Lim,S.,Baek,W.,Kim,C.,Cho,H.,Yoon,B.,Kim,T.:Scalable neural architecture search for 3d medical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 220–228. Springer (2019) 18. Lang,A.H.,Vora,S.,Caesar,H.,Zhou,L.,Yang,J.,Beijbom,O.:Pointpillars:Fast encodersforobjectdetectionfrompointclouds.In:ProceedingsoftheIEEE/CVF ConferenceonComputerVisionandPatternRecognition.pp.12697–12705(2019) 19. Li, G., Qian, G., Delgadillo, I.C., Muller, M., Thabet, A., Ghanem, B.: Sgas: Se- quentialgreedyarchitecturesearch.In:ProceedingsoftheIEEE/CVFConference on Computer Vision and Pattern Recognition. pp. 1620–1630 (2020) 20. Li, G., Xu, M., Giancola, S., Thabet, A., Ghanem, B.: Lc-nas: Latency con- strained neural architecture search for point cloud networks. arXiv preprint arXiv:2008.10309 (2020) 21. Li, L., Talwalkar, A.: Random search and reproducibility for neural architecture search. In: Uncertainty in artificial intelligence. pp. 367–377. PMLR (2020) 22. Lin, T.Y., Goyal, P., Girshick, R., He, K., Dolla´r, P.: Focal loss for dense object detection.In:ProceedingsoftheIEEEinternationalconferenceoncomputervision. pp. 2980–2988 (2017) 23. Liu,C.,Chen,L.C.,Schroff,F.,Adam,H.,Hua,W.,Yuille,A.L.,Fei-Fei,L.:Auto- deeplab: Hierarchical neural architecture search for semantic image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 82–92 (2019) 24. Liu, C., Zoph, B., Neumann, M., Shlens, J., Hua, W., Li, L.J., Fei-Fei, L., Yuille, A.,Huang,J.,Murphy,K.:Progressiveneuralarchitecturesearch.In:Proceedings of the European conference on computer vision (ECCV). pp. 19–34 (2018) 25. Liu, H., Simonyan, K., Yang, Y.: Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055 (2018) 26. Liu, Z., Tang, H., Lin, Y., Han, S.: Point-voxel cnn for efficient 3d deep learning. arXiv preprint arXiv:1907.03739 (2019) 27. Ma,Z.,Zhou,Z.,Liu,Y.,Lei,Y.,Yan,H.:Auto-orvnet:Orientation-boostedvolu- metricneuralarchitecturesearchfor3dshapeclassification.IEEEAccess8,12942– 12954 (2019) 28. Mao, J., Xue, Y., Niu, M., Bai, H., Feng, J., Liang, X., Xu, H., Xu, C.: Voxel transformer for 3d object detection. In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision. pp. 3164–3173 (2021) 29. Meyer,G.P.,Laddha,A.,Kee,E.,Vallespi-Gonzalez,C.,Wellington,C.K.:Laser- net: An efficient probabilistic 3d object detector for autonomous driving. In: Pro- ceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecogni- tion. pp. 12677–12686 (2019) 30. Ngiam, J., Caine, B., Han, W., Yang, B., Chai, Y., Sun, P., Zhou, Y., Yi, X., Alsharif,O.,Nguyen,P.,etal.:Starnet:Targetedcomputationforobjectdetection in point clouds. arXiv preprint arXiv:1908.11069 (2019) 31. Pham, H., Guan, M., Zoph, B., Le, Q., Dean, J.: Efficient neural architecture searchviaparameterssharing.In:InternationalConferenceonMachineLearning. pp. 4095–4104. PMLR (2018) 32. Qi,C.R.,Litany,O.,He,K.,Guibas,L.J.:Deephoughvotingfor3dobjectdetec- tion in point clouds. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9277–9286 (2019)LidarNAS 17 33. Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: Deep learning on point sets for 3d classification and segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 652–660 (2017) 34. Qi,C.R.,Yi,L.,Su,H.,Guibas,L.J.:Pointnet++:Deephierarchicalfeaturelearn- ing on point sets in a metric space. arXiv preprint arXiv:1706.02413 (2017) 35. Qi, C.R.,Zhou,Y., Najibi,M.,Sun, P.,Vo,K.,Deng, B.,Anguelov, D.:Offboard 3dobjectdetectionfrompointcloudsequences.In:ProceedingsoftheIEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6134–6144 (2021) 36. Real,E.,Aggarwal,A.,Huang,Y.,Le,Q.V.:Regularizedevolutionforimageclas- sifier architecture search. In: Proceedings of the aaai conference on artificial intel- ligence. vol. 33, pp. 4780–4789 (2019) 37. Real, E., Moore, S., Selle, A., Saxena, S., Suematsu, Y.L., Tan, J., Le, Q.V., Ku- rakin, A.: Large-scale evolution of image classifiers. In: International Conference on Machine Learning. pp. 2902–2911. PMLR (2017) 38. Ronneberger,O.,Fischer,P.,Brox,T.:U-net:Convolutionalnetworksforbiomedi- calimagesegmentation.In:InternationalConferenceonMedicalimagecomputing and computer-assisted intervention. pp. 234–241. Springer (2015) 39. Shi,S.,Guo,C.,Jiang,L.,Wang,Z.,Shi,J.,Wang,X.,Li,H.:Pv-rcnn:Point-voxel feature set abstraction for 3d object detection. In: Proceedings of the IEEE/CVF ConferenceonComputerVisionandPatternRecognition.pp.10529–10538(2020) 40. Shi,S.,Guo,C.,Yang,J.,Li,H.:Pv-rcnn:Thetop-performinglidar-onlysolutions for3ddetection/3dtracking/domainadaptationofwaymoopendatasetchallenges. arXiv preprint arXiv:2008.12599 (2020) 41. Shi, S., Wang, X., Li, H.: Pointrcnn: 3d object proposal generation and detection frompointcloud.In:ProceedingsoftheIEEE/CVFconferenceoncomputervision and pattern recognition. pp. 770–779 (2019) 42. So, D., Le, Q., Liang, C.: The evolved transformer. In: International Conference on Machine Learning. pp. 5877–5886. PMLR (2019) 43. So, D.R., Man´ke, W., Liu, H., Dai, Z., Shazeer, N., Le, Q.V.: Primer: Searching for efficient transformers for language modeling. arXiv preprint arXiv:2109.08668 (2021) 44. Sun,P.,Kretzschmar,H.,Dotiwalla,X.,Chouard,A.,Patnaik,V.,Tsui,P.,Guo, J., Zhou, Y., Chai, Y., Caine, B., et al.: Scalability in perception for autonomous driving: Waymo open dataset. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2446–2454 (2020) 45. Sun, P., Wang, W., Chai, Y., Elsayed, G., Bewley, A., Zhang, X., Sminchisescu, C., Anguelov, D.: Rsn: Range sparse net for efficient, accurate lidar 3d object detection.In:ProceedingsoftheIEEE/CVFConferenceonComputerVisionand Pattern Recognition. pp. 5725–5734 (2021) 46. Tan,M.,Le,Q.:Efficientnet:Rethinkingmodelscalingforconvolutionalneuralnet- works. In: International Conference on Machine Learning. pp. 6105–6114. PMLR (2019) 47. Tang,H.,Liu,Z.,Zhao,S.,Lin,Y.,Lin,J.,Wang,H.,Han,S.:Searchingefficient 3d architectures with sparse point-voxel convolution. In: European Conference on Computer Vision. pp. 685–702. Springer (2020) 48. Wang, Y., Fathi, A., Kundu, A., Ross, D.A., Pantofaru, C., Funkhouser, T., Solomon, J.: Pillar-based object detection for autonomous driving. In: Computer Vision–ECCV2020:16thEuropeanConference,Glasgow,UK,August23–28,2020, Proceedings, Part XXII 16. pp. 18–34. Springer (2020)18 C. Liu et al. 49. Wong, K.C., Moradi, M.: Segnas3d: Network architecture search with derivative- free global optimization for 3d image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 393–401. Springer (2019) 50. Xu, H., Yao, L., Zhang, W., Liang, X., Li, Z.: Auto-fpn: Automatic network ar- chitecture adaptation for object detection beyond classification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 6649–6658 (2019) 51. Yan, Y., Mao, Y., Li, B.: Second: Sparsely embedded convolutional detection. Sensors 18(10), 3337 (2018) 52. Yang, B., Luo, W., Urtasun, R.: Pixor: Real-time 3d object detection from point clouds. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. pp. 7652–7660 (2018) 53. Yin,T.,Zhou,X.,Krahenbuhl,P.:Center-based3dobjectdetectionandtracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11784–11793 (2021) 54. Yu, Q., Yang, D., Roth, H., Bai, Y., Zhang, Y., Yuille, A.L., Xu, D.: C2fnas: Coarse-to-fine neural architecture search for 3d medical image segmentation. In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition. pp. 4126–4135 (2020) 55. Zhao,H.,Jiang,L.,Jia,J.,Torr,P.H.,Koltun,V.:Pointtransformer.In:Proceed- ings of the IEEE/CVF International Conference on Computer Vision. pp. 16259– 16268 (2021) 56. Zhou, D., Fang, J., Song, X., Guan, C., Yin, J., Dai, Y., Yang, R.: Iou loss for 2d/3d object detection. In: 2019 International Conference on 3D Vision (3DV). pp. 85–94. IEEE (2019) 57. Zhou, X., Wang, D., Kra¨henbu¨hl, P.: Objects as points. arXiv preprint arXiv:1904.07850 (2019) 58. Zhou,Y.,Sun,P.,Zhang,Y.,Anguelov,D.,Gao,J.,Ouyang,T.,Guo,J.,Ngiam, J., Vasudevan, V.: End-to-end multi-view fusion for 3d object detection in lidar point clouds. In: Conference on Robot Learning. pp. 923–932. PMLR (2020) 59. Zhou,Y.,Tuzel,O.:Voxelnet:End-to-endlearningforpointcloudbased3dobject detection.In:ProceedingsoftheIEEEconferenceoncomputervisionandpattern recognition. pp. 4490–4499 (2018) 60. Zhu, Z., Liu, C., Yang, D., Yuille, A., Xu, D.: V-nas: Neural architecture search for volumetric medical image segmentation. In: 2019 International Conference on 3D Vision (3DV). pp. 240–248. IEEE (2019) 61. Zoph, B., Le, Q.V.: Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578 (2016)LidarNAS 19 A Anchor-Free Detection Head Wedescribethedetailsofouranchor-freedetectionheadthatworksacrossviews and formats. The key is to abstract away from the specific views and formats, and think about the individual elements. The elements are individual voxels / pixels / pillars under the voxel / perspective / pillar view. A.1 Training Phase The detection head has two sequential jobs: finding the centers, and regressing parameters from them. Finding the Centers Intrainingthenetworktofindthecenters,weconstruct a ground truth heatmap. For each element e ∈ E where E is the set of all elements,weuseV(e)torepresentitsCartesiancoordinates,whichcanbeeither 2-dimensional (only x and y) or 3-dimensional (all of x,y,z). We construct its ground truth heatmap value to be: ||V(e)−c||−min ||V(f)−c|| h(e)= max exp(− f∈E ) c∈C(e) σ2 where C(e) is the set of centers of the boxes that contain e, and σ is a hyper- parameter. h(e) = 0 if |C(e)| = 0. Intuitively, the heatmap value is high when theelementisclosetoanobjectcenter(||V(e)−c||).Thisdistanceismodified/ compensatedbytheclosestdistanceamongalltheelements(min ||V(f)−c||). f∈E A penalty-reduced focal loss [22,57] is used to train the predicted heatmap: L =− 1 (cid:88) {(1−h˜(e))αlog(h˜(e))I + center |E| h(e)>1−(cid:15) e∈E (1−h(e))βh˜(e)αlog(1−h˜(e))I } h(e)≤1−(cid:15) whereh˜(e)isthepredictedheatmapvalueforelemente,α=2,β =4,(cid:15)=0.001. RegressingBoxParameters WeusesmoothL1losstoregressthe3-dimensional boxcenteroffsets,aswellasthe3-dimensionalboxlength,width,height.Weuse a bin loss [41] to regress the heading. We also add a IoU loss [56]. These losses areonlyactiveforelementsthathavegroundtruthheatmapvaluesgreaterthan a threshold δ. A.2 Inference Phase After the forward pass produces the predicted heatmap h˜, the predicted object centersaretheelementswhosepredictedheatmapvalueexceedsathresholdand is the local maximum. The latter is achieved by max pooling (possible on both dense grids and sparse) within a local window (3×3 or 3×3×3). The box parameters prediction on these elements complete the inference. We conclude by reiterating that when the view is voxel and the format is sparse, this detection head exactly follows RSN [45].20 C. Liu et al. B Randomly Generated Architectures We describe our procedure of randomly generating architectures stage by stage. For the first stage, we add each view with probability 0.5 independently. For views that may have either dense or sparse formats, the format is selected with equalprobability.Thepillar/voxelsizeis0.32mtoavoidvoxelizationmismatch complications. The number of channels is 32 multiplied by either 0.8 or 1.0 or 1.2. The layer progression is randomly chosen between five choices. If the layer type is point, this means the number of dense-normalization-ReLU is between 1 and 5. For the other layer types, this means the number of downsampling / upsampling scales choose between (0,0), (1,0), (2,0), (2,1), (2,2). For the second stage, we again add each view with probability 0.5 indepen- dently. For each added view, we iterate through the views selected in the first stage, and add it to the ancestor with probability 0.5 independently. The gener- ation process for the other parameters (pillar / voxel size, number of channels, layer progression) is the same as the first stage. The third stage should only contain one view. We select the view among voxel, perspective, pillar with equal probability. For views that may have either dense or sparse formats, the format is selected with equal probability. For this selected view, all the views selected in the second stage are its ancestors. The generation process for the other parameters is the same as the preceding stages. The randomly generated architecture may be invalid for several reasons. Ex- amples include: no branches are added in a particular stage; no ancestors are selected for a second stage view; there may be views in the first stage that are not selected by any view in the second stage. If any of these situations happen, we reject the sample and sample again until it succeeds.