SWFormer: Sparse Window Transformer for 3D Object Detection in Point Clouds Pei Sun, Mingxing Tan, Weiyue Wang, Chenxi Liu, Fei Xia, Zhaoqi Leng, and Dragomir Anguelov Waymo LLC {peis, tanmingxing, weiyuewang, cxliu, feixia, lengzhaoqi, dragomir} @waymo.com Abstract. 3D object detection in point clouds is a core component for modern robotics and autonomous driving systems. A key challenge in 3D object detection comes from the inherent sparse nature of point oc- cupancywithinthe3Dscene.Inthispaper,weproposeSparseWindow Transformer (SWFormer), a scalable and accurate model for 3D ob- ject detection, which can take full advantage of the sparsity of point clouds. Built upon the idea of window-based Transformers, SWFormer converts 3D points into sparse voxels and windows, and then processes thesevariable-lengthsparsewindowsefficientlyusingabucketingscheme. Inadditiontoself-attentionwithineachspatialwindow,ourSWFormer also captures cross-window correlation with multi-scale feature fusion andwindowshiftingoperations.Tofurtheraddresstheuniquechallenge of detecting 3D objects accurately from sparse features, we propose a newvoxeldiffusiontechnique.ExperimentalresultsontheWaymoOpen Dataset show our SWFormer achieves state-of-the-art 73.36 L2 mAPH on vehicle and pedestrian for 3D object detection on the official test set, outperforming all previous single-stage and two-stage models, while being much more efficient. 1 Introduction 3D point cloud representation learning is critical for autonomous driving, espe- ciallyforcoretaskslike3Dobjectdetection.Thechallengesoflearningfrom3D point clouds mainly come from two aspects. The first aspect is that 3D points are sparsely distributed in the 3D space due to the nature of LiDAR sensors. Thisforces3Dmodelstobedifferentfromdensemodelsinnaturallanguagepro- cessing (where words in a sentence are dense) or image understanding (where pixels in an image are dense). The second aspect is that both the number of points in a point cloud frame and the point cloud sensing region are increasing along with the improvement of the LiDAR sensor hardware. Some of the latest commercial LiDARs can sense up to 250m [15] and 300m [44] in all directions around the vehicle, leading to a large range of point clouds. To address these challenges, previous works have proposed many methods that can be roughly organized as five categories. PointNet [30,32,38] based 2202 tcO 31 ]VC.sc[ 1v27370.0122:viXra2 P. Sun et al. method treats 3D point clouds as unordered sets and encodes them with MLPs and max pooling. Hierarchical structure is introduced to deal with the large in- put space and to better capture local information. These methods usually have inferiorrepresentationcapacitycomparedwithmorerecentmethods.PointPil- lars-style methods [18] divide the space into grids of fixed sizes to convert the sparse 3D problem to a dense 2D problem. This method scales quadratically with the range, making it hard to scale with the advancement of LiDAR hard- wares.Sparsesubmanifoldconvolutions[14,36,40]basedmethodcanhandle the sparse input efficiently. Usually these methods use small 3×3 convolution kernels which cannot connect features that are sparsely disconnected without adding normal sparse convolution and striding. This weakness limits its repre- sentation capacity. Another weakness of this method is their need for heavily optimized custom ops to be efficient on the modern GPUs and incompatibility with matmul optimized accelerators such as TPUs. Range image is a compact representation of point cloud. Multi-view methods [50,40,43,2] run dense convo- lutionsinthisviewtoextractfeaturesandfusewithBEVfeatureslearnedinthe PointPillars-styletoimprove3Drepresentationlearning.Itishardtoregress3D objectsdirectlyfromtherangeimageduetoitslackof3Dinformationencoding in the dense 2D perspective convolutions. To tackle this weakness, graph-style kernels[4,12]replaceconvolutionstomakeuseoftherangeinformationinrange images to capture 3D information which greatly improves the accuracy but is still inferior to the state of the art. Transformer [41] is designed to process sequences of data. The challenge in applying it to a point cloud is to solve the quadraticcomplexityonthenumberofinputs.Recentmethodstacklethisprob- lem by attending to neighboring points [29], neighboring voxels [23] or voxels in fixed windows [11]. A generic and efficient transformer-only model without limitationslikelimitedreceptivefield,irregularmemoryaccesspattern,andlack of scalability is still to be designed. Inthispaper,weadaptwindow-basedTransformersto3Dpointclouds.The Transformer [41] architecture has been hugely successful in modeling language sequencesandimagepatches.Inparticular,on2Dimages,SwinTransformer[22] proposed to partition images into windows and merge context information in a hierarchical manner. Our Sparse Window Transformer (SWFormer) builds upon similar ideas, but with several key adaptations for sparse windows. Our first adaption is to add a bucketing-based window partition for sparse windows. Althougheachwindowhasthesamespatialsize,suchasa10×10voxelgrid,the number of non-empty voxels in each window can vary significantly, so we group thesewindowsintobucketswithdifferenteffectivesequencelengths.Oursecond adaptation is to limit the expensive window shifting. Swin Transformer [22] uses window shifting once per Transformer layer to connect features between windows and increases receptive fields, but this shifting operation is expensive in the sparse world as it needs to re-order all the sparse features with gather operations.Moreover,itisextremelyslowonmatmuloptimizedacceleratorssuch asTPUs.Toaddressthisissue,SWFormeremploysanewhierarchicalbackbone architecture,whereeachSWFormerblockhasmanyTransformerlayersbutonlySWFormer 3 one shifting operation, as shown in Figure 3. It relies on multi-scale features to achieve large receptive fields for context information, and a multi-scale fusion networktoeffectivelycombinethesefeatures.Themodelusesadditionalcustom downsample and upsample algorithms to properly handle the sparse features during feature fusion. Our innovation continues from the backbone into the 3D object detection head. Existing 3D object detection methods [51,18,50,43,13,36,4,40,24,46] can mostly be viewed as either anchor based methods with implicit or explicit an- chors or DETR [3] based methods [26]. The detection performance is closely related with the distribution of the difference between anchor and groundtruth. Methodswithinaccurateanchors[4,24]havepoorperformanceindetectinglarge objectssuchasvehiclesthoughtheycanhavereasonableperformanceonpedes- trians. One way to solve this problem is to have a two-stage model to refine the boxes [24,36] which greatly improves the detection accuracy. CenterNet- style detection methods [13,46,40] strive to define anchors in the center of the groundtruth boxes only which enforces distributions of closer to zero mean and smaller variance. However, when detecting objects directly from sparse features (e.g. features from PointNet, Submanifold convolutions, sparse Transformers), there are not necessarily features close to the object centers. To alleviate this issue, [40] applies normal sparse convolutions to insert points in the convolution output; [11] scatters the sparse features to a dense BEV grid and runs dense convolutions to expand features to missing positions. These methods are expen- sive. In this paper, we propose a voxel diffusion module to address this issue efficiently in a scalable way by segmenting and diffusing foreground voxels to their nearby regions as described in §3.4. ExtensiveexperimentsareconductedonthechallengingWaymoOpenDataset [39] to show state of the art results of SWFormer on 3D object detection. We summarize our contributions as follows: – We propose a hierarchical Sparse Window Transformer (SWFormer) back- bone for 3D representation learning. Its flexible receptive fields and multi- scale features make it suitable for different self-driving tasks like object de- tection and semantic segmentation. – Weproposeagenericvoxeldiffusionmoduletoaddresstheuniquechallenge of anchor placement in 3D object detection from sparse features. – We conduct extensive experiments on Waymo Open Dataset [39] to demon- strate the state of the art performance of our SWFormer model. 2 Related Work 2.1 3D object detection As one of the most important tasks in autonomous driving, 3D object detection has been extensively studied in prior works. Early works like PointNet [30] and PointNet++[32]directlyapplymultilayerperceptionsonindividualpoints,but itisdifficulttoscalethemtolargepointcloudswithgoodaccuracy.Thecurrent4 P. Sun et al. mainstream3Dobjectdetectorsoftenconvertpointcloudsintobirdeyeview3D [51] or 2D voxels [18] (2D voxels are also referred as pillars), where each voxel aggregates the information from points it contains. In this way, regular 2D or 3D convolutional neural networks can be applied to process these bird-eye-view representations. The pseudo image of voxels also makes it easier to reuse the rich research advancements in 2D object detection, such as two-stage or anchor- based detection heads [46]. The downside is that the pseudo image of voxels grows cubically/quadratically with the voxelization granularity and detection range, not to mention that many of the voxels are effectively empty. Therefore, anothertypeofapproachistoperform3Dobjectdetectionwithoutvoxelization. This includes methods that detect objects from the perspective view [25,4,12], orlookupnearestneighborsforeachpoint[28].However,thedetectionaccuracy is typically inferior to the voxelization route. Tohavethebestofbothworlds,recentapproaches[45,40,36]starttoexplore multi-view approaches and make use of sparse convolutions on the voxelized point cloud. For example, the recent range sparse net (RSN [40]) adopts a two- step approach, where the first step performs class-specific segmentation on the range image view, and the second step applies sparse 3D convolutions on the voxel view forspecific classes. However,submanifold sparseconvolutionscannot connect features that are sparsely disconnected without adding normal sparse convolutions and striding, and they often require heavily optimized customized ops to be efficient on modern accelerators. Ourworkaimstolearnthe3Drepresentationsfromsparsepointcloudswith- out using any dense or sparse convolutions. Instead, we resort to a hierarchical Transformer to achieve our goal. 2.2 Transformers Transformers [41] have shown great success in natural language processing [7]. Recently,researchershavebroughtthisarchitecturetocomputervision[1,33,42,6]. ViT[9]partitionsimagesintopatches,whichgreatlyadvancedtheuseofTrans- formers for image classification. Swin Transformer [22] further demonstrated better ways to fuse contextual information through window shifting and hierar- chy, and also generalized to other tasks such as segmentation and detection. Interestingly,Transformersarenaturallysuitableforsparsepointclouds,be- cause they can take any length of sequences as inputs and do not require dense 2D/3D image representations. Therefore, recent works have attempted to adopt Transformersfor3Drepresentationlearning,buttheyareprimarydevelopedfor object scans and indoor applications [47,10,27,29]. Voxel Transformer [24] is the submanifold sparse convolution [14] counterpart in the Transformer world, by replacingtheconvolutionkernelwithattention.Itsirregularmemoryaccesspat- terniscomputationallyinefficient,anditsaccuracyisworsethanstateoftheart methods. Recently, SST [11] proposes a single-stride transformer for 3D object detection and achieved impressive results on Waymo Open Datasets especially for pedestrian object detection. However, due to its single stride nature, SST has a limited receptive field and thus has difficulty dealing with large objects,SWFormer 5 making it ineffective in important tasks like large vehicle detection, large ob- ject segmentation (e.g. buildings), lane detection, and trajactory prediction. It needs to scatter features to a dense BEV grid to run several dense convolutions which limits its scalability. It is also computationally expensive as it needs to runmanylayersoftransformersonthehighresolutionfeaturemapwhichlimits its applications in realtime systems. Ourworkisinspiredbywindow-basedTransformers(e.g.,SwinTransformer[22]) inthesensethatwealsoadoptthehierarchicalwindow-basedTransformerback- bone,buttoaddresstheuniquechallengesof3Dsparsepointclouds,wepropose severalnoveltechniquessuchastheimprovedSWFormerblocks,multi-scalefea- ture fusion, and voxel diffusion. 3 Sparse Window Transformer 3.1 Overall Architecture SWFormer is a pure Transformer-based model without any convolutions. Fig- ure 1 shows the overall network architecture: given a sequence of point cloud framesasinputs,eachpointisaugmentedwithper-framevoxelfeatures[18]and an auxiliary frame timestamp offset [40]. It uses dynamic voxelization [50] and a point net [18,30] based feature embedding net to get sparse voxel features. Note, our voxels are also referred as pillars in other works [18]. These sparse voxels are then processed by a hierarchical sparse window Transformer network describedin§3.2.Theresultingmulti-scalefeaturesarethenfusedwithaTrans- former based feature fusion blocks. To address the unique challenge of detecting 3D boxes from sparse features, we first segment the foreground voxels and then apply a voxel diffusion module to expand foreground voxels to neighboring lo- cations with pseudo voxels. In the end, we apply a center net [46,40,49] style detection head to regress 3D boxes. 3.2 Hierarchical Sparse Window Transformer Encoder A key concept of our SWFormer is the sparse window in the birds eye view. After points are converted to a grid of 2D voxels on bird eye view, the voxel grid is further partitioned into a list of non-overlapping windows with fixed size H ×W (e.g., 10×10), similar to Swin Transformer[22]; however, since points are often sparse, many voxels are empty with no valid points. Therefore, the number of non-empty voxels in each window may vary from 0 to HW. As we willexplainlater,allnon-emptyvoxelswithinthesamewindowwillbeflattened toasinglevariable-lengthsequenceandfedintoTransformerlayers.Inpractice, these variable-length sequences prevent us from batch training, causing lower training efficiency. To solve this issue, we borrow a widely used ideas from nat- ural language processing [41,8] and recent works [11], which group these sparse windows into different buckets based on their sequence lengths. Concretely, we divide sparse windows into at most k buckets {B ,B ,...,B }, where windows 0 1 k6 P. Sun et al. Scale 1: /1 Scale 2: /2 Scale 3: /4 Scale 4: /16 Scale 5: /32 fuse fuse fuse fuse fuse Segmentation Voxel Diffusion Center Head Fig.1. Overview of SWFormer model architecture. Given a sparse point cloud, we first perform voxelization to generate a grid of 2D voxels. These voxels are then pro- cessedwitha5-scalesequenceofhierarchicalSWFormerblocks(Figure3),withstrides {1,2,4,16,32}.Theoutputfeaturesarecombinedwithamulti-scalefeaturefusionnet- work (section 3.3). The fused features are fed to a head, which performs foreground segmentation and voxel diffusion (section 3.4), and computes center net style classifi- cation and box regression loss (section 3.5). Different object classes (e.g. vehicles and pedestrians) may use a separate head on different feature scales. in B are always padded to a maximum sequence length of HW/2i. All padded i tokens are masked in Transformer layers. Based on the aforementioned sparse windows, our encoder adopts hierarchi- cal Transformers to process the inputs and produce a list of multi-scale BEV features. As shown in Figure 1, each scale starts with a sparse window partition layer followed by a multi-layer SWFormer block. SparseWindowPartition:WedividetheBEVvoxelsintonon-overlapping windowswithfixedsizeH×W,whicharethengroupedintobuckets{B ,B ,...,B }. 0 1 k ForeachbucketB ,weflattenallvoxelswithinthesamewindowintoasequence i and zero-pad the sequence length to HW/2i. These sequences are then batched and fed to the Transformer blocks, where the self-attention shares the keys and values for all query voxels coming from the same window [22]. Since SWFormer processes inputs in a hierarchical fashion with multiple feature scales, we need to apply strided window partitions at the beginning of each scale. The strided window partition is similar to traditional strided convolutions, except that it always picks the closest voxel to the center of the window with deterministic rules to break ties. Notably, no max or average pooling operations are applied because they are not friendly to sparse implementations. Figure 2 illustrates an example of a stride-4 window partition. Sparse Window Transformer block: Transformer[41] is inherently suit- able for sparse point clouds, as it does not require the dense 2D/3D inputs as in convolutional networks; unfortunately, due to the quadratic complexity of self- attentionwithrespecttotheinputsequencelength,itisprohibitivelyexpensive to feed the whole point cloud (with millions of points) or voxel features (with tensofthousandsvalidvoxels)asasingleinputsequencetoTransformer.InthisSWFormer 7 / 4 Fig.2.StridedSparseWindowPartition.Leftshowsagridof16x16BEVvoxels,where grey voxels are empty and others are non-empty. Right shows the results of stride-4 window partition, leading to a grid of 4x4 voxels. For each striding window, it picks thenearestneighbornon-emptyvoxelfeature(lightgreen)fromthecenter(blackdot) withanydeterministicruletobreakties;ifallvoxelsareemptyinthestridingwindow, then the corresponding voxel after striding is also empty. Best viewed in color. LN LN Shifted MLP Sparse MLP N⨉ Window M⨉ Partition LN Stride 1 LN MSA MSA N-M-layer SWFormer Block Fig.3.SparseWindowTransformerBlock.Givenasequenceofsparsefeatures,itfirst applies a multi-head self-attention (MSA) on all valid voxel within the same window, followed by a MLP and layer norm. After repeating the Transformer layer N times, it performs a shifted sparse window partition to re-generate the sparse windows, and thenprocesstheshiftedwindowswithanotherM Transformerlayers.IfN andM are the same, we name it as N-layer SWFormer block for simplicity. paper, we adopt the idea of Swin Transformer [22]: the sparse BEV voxels are first partitioned into windows, and Transformer is applied to each window sep- arately. To increase the receptive field and connect the features across windows, SwinTransformer uses a window shifting technique to re-partition the window foreverylayerofTransformer.However,asweareoperatingonsparsevoxelfea- tures,suchshift-windowoperationismemory-read/writeintensive,especiallyfor matrix-optimized accelerators like TPUs. To alleviate this problem, we propose tolimittheshift-windowoperationtoonceperstrideratherthanperlayer.Fig- ure3showsthedetailedarchitectureofaSWFormerblock:itlargelyfollowsthe same style of SwinTransformer to perform self-attention within a local window, exceptitonlyperformsshift-windowoperationonceinthemiddle.Formally,our8 P. Sun et al. SWFormer block can be described as follows: z0 =[x; mask ]+PE (1) z z zˆl =LN(cid:0) zl−1+MSA(zl−1)(cid:1) l=1...N zl =LN(cid:0) zˆl+MLP(zˆl)(cid:1) l=1...N u0 =[shift-window(zN); mask ]+PE (2) u u uˆl =LN(cid:0) ul−1+MSA(ul−1)(cid:1) l=1...M ul =LN(cid:0) uˆl+MLP(uˆl)(cid:1) l=1...M wherexistheinputfeaturesaftersparsewindowpartition,mask isthemaskfor z input padding, PE is the positional encoding. The process contains two stages: z (1) the first stage applies N Transformer layers to z0 and output zN. Each Transformer layer consists of a standard multi-head self-attention (MSA) and multilayer perceptron (MLP), but slightly different from the standard version, hereweadoptthepost-normschemewherelayernorm(LN)isaddedafterMSA and MLP. For simplicity, we use the standard sine/cosine absolute positional encodinginthispaper.(2)Thesecondstagefirstapplieswindow-shifttozN,and adds the updated mask and positional encoding PE based on zN; afterwards, u u M Transformer layers are added to process u0 and generate the final output uM.Notably,eachSWFormerblockhasN+M Transformerlayersbutonlyone window-shift operation. Byrestrictingwindow-shiftoperations,ourSWFormerblockismoreefficient than the conventional Swin Transformer; however, it also limits the receptive field,sinceeachTransformerlayerisonlyappliedtoasmallwindow.Toaddress this challenge, SWFormer is designed as a hierarchical network with multiple scales, where the strides are gradually increased: for simplicity, this paper uses strides {1,2,4,16,32} for the five scales. For each scale, we always keep the window size fixed (e.g., 10×10); however, as the later scales have larger strides, the same window in later scales will cover much larger area. As an example, for thelastscalewithstride32,a10×10windowwouldcover320×320areaonthe original BEV voxel grid, and a single window-shift would connect all features within an area as large as 480×480. 3.3 Multi Scale Feature Fusion Inspiredbyfeaturepyramidnetwork(FPN[20]),SWFormeradoptsTransformer- based multi-scale feature network to effectively combine all features from the hierarchical Transformer encoder. Figure 4 shows the overall architecture of the featurenetwork:givenalistofencoderfeatures{P ,P ,..P },ititerativelyfuses 0 1 5 (P ,P ) from large-stride P to small-stride P . Formally, our feature fusion i+1 i 5 0 process can be described as: Pˆ =P (3) 5 5 Pˆ =SWFormer(Concat(P ,Upsample(Pˆ ))) i=0,...,4 (4) i i i+1SWFormer 9 Starting from the last feature map P , we first 5 upsample it to have the same stride as P such 4 P P that they can be concatenated into a single fea- i i+1 ture map; afterwards, we simple apply a 1-layer SWFormerblocktoprocesstheconcatenatedfea- ture and generate the new Pˆ . The process is it- 4 erated until all fused features {Pˆ ,...,Pˆ } have concat upsample 0 5 been generated, which have the same strides as {P ,...,P } features. The fused features are fur- 0 5 ther used in voxel diffusion and box regression as P’ i described in the following sections. Onechallengeinsparseupsampingisthatone cannot naively duplicate the feature to all up- sampled locations (like commonly done in dense Fig.4. Feature Fusion. Fea- upsampling), which will cause unnecessary exces- ture P i+1 is upsampled and concatenatedwithP togener- sive feature duplication and significantly reduce i ateP(cid:48)andthefinalP .During the sparsity. In this paper, we restrict features in i i upsampling,weonlyduplicate P toonlyduplicatetolocationsthathavenon- i+1 P featurestolocationsthat i+1 emptyfeaturesinP ,asshowninFigure4.Inthis i are non-empty in P . way,wecanensurePˆ hasthesamesparsityasP . i i i 3.4 Voxel Diffusion 0.9 Diffusion 0.5 Fig.5. Voxel Diffusion. After foreground segmentation, each voxel receives a segmen- tation score s ∈ [0,1]. All voxels with scores greater than a threshold γ = 0.05 are scattered to a dense BEV grid, and then we apply a k×k max pooling on the dense BEV grid to expand valid voxel features to their neighboring locations where k is set to5inthisexample.(Left)beforediffusion,thereareonlytwoforegroundvoxelswith segmentation scores {0.5, 0.9} greater than γ; (Right) after voxel diffusion, 47 voxels become valid. Best viewed in color. To detect 3D objects from sparse voxel features, a unique challenge is that theremightbenovalidvoxelfeaturenearobjectcenterswhicharethebestposi- tionstoplaceimplicit[46]orexplicitanchors[34].Priorworkshaveattemptedto resolve this issue by: 1) second-stage box refinement [36], 2) sparse convolutions10 P. Sun et al. [40]orcoordinaterefinement[29]thatcanexpandfeaturestoemptyvoxelsclose to the object centers, 3) scattering sparse voxel features to dense and applying denseconvolutions[11].Inthispaper,weproposeanovelvoxel diffusion module to effectively and efficiently address this challenge. Voxeldiffusionisbasedontwosimpleideas:First,wesegmentallforeground voxels by jointly performing foreground/backgrond segmentation, thus effec- tively filtering out the majority of background voxels. Second, we expand all foreground voxels by zero-initializing their features into neighboring locations with a simple k ×k max pooling operations on the dense BEV grid, where k is the detection head specific diffusion factor to control the magnitude of ex- pansion. The diffused voxel features are further connected and processed with a few Transformer layers. Combining these two ideas, we can simultaneously keep voxel features sparse (by filtering out background voxels) and features filled (by voxel diffusion) for voxels closer to the object center. Figure 5 illustrates an example of voxel diffusion. Ourforegroundsegmentationisjointlytrainedwithobjectdetection.Specif- ically,foreachvoxel,weassignabinarygroundtruthlabel:0(background,voxel does not overlap with any objects) and 1 (foreground, voxel overlaps with at least one object). The foreground segmentation is trained with a two-class focal loss [21] for each object class c: 1 (cid:88) Lc = L (5) seg N i i where N is the total number of valid voxels and L is the focal loss for voxel i i. At inference time, we keep voxels as foreground if their foreground scores are greater than a threshold γ. 3.5 Box Regression SWFormerfollows[40]touseamodifiedCenterNet[49,13,40,46]headtoregress boxes from voxel features. The heatmap loss is computed as a penalty-reduced focal loss [49,21] per object class. Lc =− 1 (cid:88) {(1−h˜ )αlog(h˜ )I + hm N i i hi>1−(cid:15) (6) i (1−h )βh˜αlog(1−h˜ )I }, i i i hi≤1−(cid:15) where h˜ and h are the predicted and ground truth heatmap values for object i i class c respectively at voxel i. N is the number of boxes in class c. We use (cid:15)=1e−3, α=2 and β =4 in all experiments, following [49,19,40]. SWFormer parameterize 3D boxes as b={d ,d ,d ,l,w,h,θ} where d ,d ,d are the box x y z x y z centeroffsetsrelativetothevoxelcenters.l,w,h,θ areboxlength,width,height and box heading. We follow [40] to apply a bin loss [38] to regress heading θ, smooth L1 to regress other box parameters, and an IoU loss [48] to improveSWFormer 11 overall box accuracy on the voxels with ground truth heatmap values above a threshold δ . 1 Lc =L (θ ,θ˜), (7) θi bin i i Lc =SmoothL1(b \θ −b˜\θ˜), (8) bi\θi i i i i 1 (cid:88) Lc = (L +L +L )I , (9) box N θi bi\θi ioui hi>δ1 i where˜b , b are the predicted and ground truth box parameters respectively, θ˜, i i i θ are the predicted and ground truth box heading respectively. i The net is trained end to end with the total loss defined as (cid:88) L= (λ Lc +λ Lc +Lc ) (10) 1 seg 2 hm box c Whendecodingpredictionboxes,wefirstfiltervoxelswithheatmaplessthan athresholdδ ,thenrunmaxpoolontheheatmaptoselectboxescorresponding 2 to the local heatmap maximas without any non-maximum-suppression. 4 Experiments We describe the SWFormer implementation details, and demonstrate its effi- ciency and accuracy in multiple experiments. Ablation studies are conducted to understand the importance of various design choices. 4.1 Waymo Open Dataset Our experiments are primary based on the challenging Waymo Open Dataset (WOD) [39], which has been adopted in many recent state of the art 3D detec- tion methods [36,46,40,11,31]. The dataset contains 1150 scenes, split into 798 training, 202 validation, and 150 test. Each scene has about 200 frames, where eachframecapturesthefull360degreesaroundtheego-vehicle.Thedatasethas onelongrangeLiDARwithrangecappedat75meters,fournearrangeLiDARs and five cameras. SWFormer uses all five LiDARs in the experiments. 4.2 Implementation Details We normalize intensity and elongation in the raw point cloud with the tanh function.Thedynamicvoxelizationuses0.32mvoxelsizeinx,y andinfinitesize in z. During training, we ignore all ground truth boxes with fewer than five points inside. The voxel feature embedding net has two layers of MLPs with channel size of 128. All of the transformer layers have channel size of 128, 8 heads,andinnerMLPratioof2.Wealsousestochasticdepth[16]withsurvival probability 0.6. The segmentation cutoff γ in §3.4 is set to 0.05. The heatmap threshold δ , δ are set to 0.2, 0.1 respectively for both vehicle and pedestrian 1 212 P. Sun et al. heads. For training efficiency, we cap the number of regression targets in each frameby1024forvehicleand800forpedestriansortedbygroundtruthheatmap values. λ , λ are set to 200 and 10 in Eq. 10. 1 2 Data augmentation. We have adopted the several popular 3D data aug- mentation techniques described in [5] during training: randomly rotating the world by yaws uniformly chosen from [−π,π] with probability 0.74, randomly flipping the world along y-axis with probability 0.5, randomly scaling the world withscalingfactoruniformlychosenwithin[0.95,1.05),randomlydroppingpoints with probability of 0.05. Training and Inference. The SWFormer models are trained end-to-end with 32 TPUv3 cores using the Adam optimizer [17] for a total number of 128 epochs with an initial learning rate set to 1e-3. We apply cosine learning rate decay and 8 epoch warmup with initial warmup learning rate set to 5e-4. 4.3 Main Results WemeasuredthedetectionresultsusingtheofficialWODdetetionmetrics:BEV and 3D average precision (AP), heading error weighted BEV, and 3D average precision (APH) for L1 (easy) and L2 (hard) difficulty levels [39]. The official metrics used to rank in the leaderboard uses IoU cutoff of 0.7 for vehicle, 0.5 for pedestrian. We report additional AP results at IoU of 0.8 for vehicle, 0.6 for pedestrian. Large vehicles that have max dimension greater than 7 meters are also reported. Table 1 reports the main results on validation set, Table 2 reports additional results for high IoU and large vechiels on the validation set, andTable3showsthetestsetresultsbysubmittingourpredictionstotheofficial test server. Results from methods with test time augmentation or emsemble are not included. As shown in Table 1, SWFormer achieves new state-of-the-art results for ve- hicle detection on the WOD validation set: it has 1.5 APH/L2 higher than the prior best single-stage model RSN [40]. SWFormer even outperforms the prior best performing two-stage method PVRCNN++[37] by 0.42 APH/L2. Impor- tantly, SWFormer performs very well at detecting large vehicles, 6.35 AP/L2 higher than the prior art of RSN [40] as shown in Table 2. SWFormer slightly outperformsthestateoftheartsinglestagemethodSST 3f[11]by0.12APH/L2. Notably, the single frame single stage SWFormer 1f also outperforms all prior single frame methods. We have compiled the model with XLA [35] and ran inference for the 15th frameinscene8907419590259234067 1960 000 1980 000thathas68vehiclesand 69pedestriansonaNvidiaT4GPU.Thelatencyis43ms,moreefficientthanthe popularrealtimedetectorPointPillars[18]whichtakesabout100msonthesame GPU with our own implementation. With fused transformer GPU kernels and optimized GPU sparse window partition operations, the latency can be further reduced to 20ms. Table 3 shows vehicle and pedestrian detection result comparison with pub- lishedresultsontheWODtestset,whichshowsSWFormeroutperformsallprevi- oussingle-stageortwo-stagemethodsontheofficialrankingmethodmAPH/L2.SWFormer 13 Table 1.WODvalidationset results.†isfrom[40].Topmethodsarehighlighted.Top one-frame (cyan), single-stage (blue) are colored. TS: two-stage. BEV: BEV L1 AP. AP/APHVehicle AP/APHPedestrian Method TS 3DL1 3DL2 BEV 3DL1 3DL2 BEV PVRCNN++[37] (cid:51) 79.3/78.8 70.6/70.2 - 81.8/76.3 73.2/68.0 - VoTr-TSD[24] (cid:51) 75.0/74.3 65.9/65.3 - - - - SSTTS3f[11] (cid:51) 78.7/78.2 70.0/69.6 - 83.8/80.175.9/72.4 - CenterPointTS[46] (cid:51) 76.6/76.1 68.9/68.4 - 79.0/73.4 71.0/65.8 - PointPillars[18]† (cid:55) 63.3/62.7 55.2/54.7 82.5 68.9/56.6 60.0/49.1 76.0 MVF++1f[31] (cid:55) 74.6/- - 87.6 78.0/- - 83.3 RSN1f[40] (cid:55) 75.1/74.6 66.0/65.5 88.5 77.8/72.7 68.3/63.7 83.4 RSN3f[40] (cid:55) 78.4/78.1 69.5/69.1 91.3 79.4/76.2 69.9/67.0 85.0 SST1f[11] (cid:55) 74.2/73.8 65.5/65.1 - 78.7/69.6 70.0/61.7 - SST3f[11] (cid:55) 77.0/76.6 68.5/68.1 - 82.4/78.0 75.1/70.9 SWFormer1f(Ours) (cid:55) 77.8/77.3 69.2/68.8 91.7 80.9/72.7 72.5/64.9 86.1 SWFormer3f(Ours) (cid:55) 79.4/78.971.1/70.692.6 82.9/79.0 74.8/71.1 87.5 Table 2. Additional WOD validation set results. Top methods are highlighted. VehicleL1AP PedestrianL1AP Method 3DIoU=0.8BEVLarge3DLarge 3DIoU=0.6 MVF++1f[31] 43.3 - - 56.0 RSN3f[40] 46.4 53.1 45.2 - SWFormer3f(Ours) 47.5 60.1 51.5 62.1 Table3.WODtestset results.†isfrom[40].Topmethodsarehighlighted.mAPH/L2 is the official ranking metric on the WOD leaderboard. TS is short for two-stage. mAPH VehicleAP/APH3D PedestrianAP/APH3D Method TS L2 L1 L2 L1 L2 CenterPoint[46] (cid:51) 69.1 80.20/79.7072.20/71.8078.30/72.10 72.20/66.40 SSTTS3f[11] (cid:51) 72.94 80.99/80.6273.08/72.7483.05/79.38 76.65/73.14 PVRCNN++[37] (cid:51) 71.24 81.62/81.2073.86/73.4780.41/74.99 74.12/69.00 P.Pillars[18]† (cid:55) 55.10 68.60/68.1060.50/60.1068.00/55.50 61.40/50.10 RSN3f[40] (cid:55) 69.70 80.70/80.3071.90/71.6078.90/75.60 70.70/67.80 SWFormer3f(Ours) (cid:55) 73.36 82.89/82.4975.02/74.6582.13/78.13 75.87/72.07 4.4 Ablation Study Voxel diffusion is one of the primary contributions of this paper. We study its impacts by varying the diffusion window size k introduced in §3.4. The result in Table 4 shows the significance of voxel diffusion. Disabling voxel diffusion (i.e. setting k = 1) results in 6.37 and 3.22 3D AP drop compared with k = 9 on vehicle and pedestrian detection respectively. Increasing k can slightly improve the detection accuracy especially on vehicle. Multi-scale feature improves the model accuracy as shown in Table 5 espe- ciallygoingfromonescaletotwoscales.Theimpactislargeronvehicledetection14 P. Sun et al. (+2.723DAP)thanpedestriandetection(+1.153DAP).The3-scalemodelhas pretty close accuracy as the full 5-scale model. In practice, we can trade-off be- tween accuracy and latency by adjusting the number of scales. Note that some autonomous driving tasks such as lane detection, behavior prediction require larger receptive field. The success of training a deep five-scale SWFormer model shows its potential in those tasks. Table 4.ImpactofVoxelDiffusion.Comparedtothebaseline(windowsize=1),our voxel diffusion improves accuracy, especially with large diffusion window size. DiffusionWindowSize 1 3 5 9 Vehicle3DAP/L1 72.1378.0778.5878.50 Pedestrian3DAP/L1 79.2382.2882.4482.45 VehicleBEVAP/L1 82.4291.1992.0992.03 PedestrianBEVAP/L183.6587.0187.1587.47 Table5.ImpactofMulti-ScaleandWindowShifting.Comparedtosinglescale,multi- scale have much better accuracy. Window shifting is also important for performance. NumberofScales WindowShift 1 2 3 5 (cid:55) (cid:88) Vehicle3DAP/L1 74.9677.6878.8879.36 76.74 79.36 Pedestrian3DAP/L1 81.2482.3982.1982.91 81.19 82.91 VehicleBEVAP/L1 89.5591.8392.2392.60 90.74 92.60 PedestrianBEVAP/L1 86.4887.3087.1387.54 86.46 87.54 WindowshiftingisintroducedinSwinTransformer[22]toconnectthefeatures amongwindows.Wehavelimiteditsusagetooneperscale.Whathappensifwe completelyremoveit?Table5showsclearaccuracydropespeciallyonvehiclesif thewindow-shiftoperationsareremovedfromtheSWFormerblocks.Thismeets our intuition that it is important to keep one window shift operation per scale to make sure every voxel gets the similar receptive field in all directions. 5 Conclusion ThispaperpresentsSWFormer,ascalableandaccuratesparsewindowtransformer- only model, to effectively learn 3D point cloud representations for object detec- tion. Built upon window-based Transformers, it addresses the unique challenges brought by the sparse 3D point clouds, and proposes a bucketing-based multi- scaleTransformerneuralnetwork.SWFormertakesfulladvantageofthesparsity ofpointclouds,andcaneffectivelyprocessessparsewindowsofpointcloudsusing pureTransformerlayerswithoutanyconvolutions.Italsoproposesanovelvoxel diffusionmoduletofurtherdetect3Dobjectsfromsparsefeatures.Experiments show state-of-the-art results on the challenging Waymo Open Dataset.SWFormer 15 References 1. Bello,I.,Zoph,B.,Vaswani,A.,Shlens,J.,Le,Q.V.:Attentionaugmentedconvo- lutional networks. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 3286–3295 (2019) 2. Bewley, A., Sun, P., Mensink, T., Anguelov, D., Sminchisescu, C.: Range condi- tioned dilated convolutions for scale invariant 3d object detection. In: Conference on Robot Learning (2020) 3. Carion,N.,Massa,F.,Synnaeve,G.,Usunier,N.,Kirillov,A.,Zagoruyko,S.:End- to-end object detection with transformers. In: European conference on computer vision. pp. 213–229. Springer (2020) 4. Chai, Y., Sun, P., Ngiam, J., Wang, W., Caine, B., Vasudevan, V., Zhang, X., Anguelov, D.: To the point: Efficient 3d object detection in the range image with graphconvolutionkernels.In:ProceedingsoftheIEEE/CVFConferenceonCom- puter Vision and Pattern Recognition. pp. 16000–16009 (2021) 5. Cheng, S., Leng, Z., Cubuk, E.D., Zoph, B., Bai, C., Ngiam, J., Song, Y., Caine, B.,Vasudevan,V.,Li,C.,etal.:Improving3dobjectdetectionthroughprogressive populationbasedaugmentation.In:EuropeanConferenceonComputerVision.pp. 279–294. Springer (2020) 6. Dai,Z.,Liu,H.,Le,Q.,Tan,M.:Coatnet:Marryingconvolutionandattentionfor all data sizes. Advances in Neural Information Processing Systems 34 (2021) 7. Devlin,J.,Chang,M.W.,Lee,K.,Toutanova,K.:Bert:Pre-trainingofdeepbidirec- tional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) 8. Devlin,J.,Chang,M.W.,Lee,K.,Toutanova,K.:Bert:Pre-trainingofdeepbidirec- tional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) 9. Dosovitskiy,A.,Beyer,L.,Kolesnikov,A.,Weissenborn,D.,Zhai,X.,Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) 10. Engel, N., Belagiannis, V., Dietmayer, K.: Point transformer. IEEE Access 9, 134826–134840 (2021) 11. Fan,L.,Pang,Z.,Zhang,T.,Wang,Y.X.,Zhao,H.,Wang,F.,Wang,N.,Zhang,Z.: Embracingsinglestride3dobjectdetectorwithsparsetransformer.arXivpreprint arXiv:2112.06375 (2021) 12. Fan, L., Xiong, X., Wang, F., Wang, N., Zhang, Z.:Rangedet: In defense of range view for lidar-based 3d object detection. In: Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision. pp. 2918–2927 (2021) 13. Ge, R., Ding, Z., Hu, Y., Wang, Y., Chen, S., Huang, L., Li, Y.: Afdet: Anchor free one stage 3d object detection. arXiv preprint arXiv:2006.12671 (2020) 14. Graham,B.,vanderMaaten,L.:Submanifoldsparseconvolutionalnetworks.arXiv preprint arXiv:1706.01307 (2017) 15. Guizilini,V.,Ambrus,R.,Pillai,S.,Raventos,A.,Gaidon,A.:3dpackingforself- supervisedmonoculardepthestimation.In:IEEEConferenceonComputerVision and Pattern Recognition (CVPR) (2020) 16. Huang, G., Sun, Y., Liu, Z., Sedra, D., Weinberger, K.Q.: Deep networks with stochastic depth. ECCV pp. 646–661 (2016) 17. Kingma,D.P.,Ba,J.:Adam:Amethodforstochasticoptimization.arXivpreprint arXiv:1412.6980 (2014)16 P. Sun et al. 18. Lang, A.H., Vora, S., Caesar, H., Zhou, L., Yang, J., Beijbom, O.: Pointpillars: Fast encoders for object detection from point clouds. In: CVPR (2019) 19. Law, H., Deng, J.: Cornernet: Detecting objects as paired keypoints. In: Proceed- ingsoftheEuropeanConferenceonComputerVision(ECCV).pp.734–750(2018) 20. Lin, T.Y., Dolla´r, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramidnetworksforobjectdetection.In:ProceedingsoftheIEEEconferenceon computer vision and pattern recognition. pp. 2117–2125 (2017) 21. Lin, T.Y., Goyal, P., Girshick, R., He, K., Dolla´r, P.: Focal loss for dense object detection.In:ProceedingsoftheIEEEinternationalconferenceoncomputervision. pp. 2980–2988 (2017) 22. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. CVPR (2021) 23. Mao, J., Xue, Y., Niu, M., Bai, H., Feng, J., Liang, X., Xu, H., Xu, C.: Voxel transformer for 3d object detection. In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision. pp. 3164–3173 (2021) 24. Mao, J., Xue, Y., Niu, M., Bai, H., Feng, J., Liang, X., Xu, H., Xu, C.: Voxel transformer for 3d object detection. In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision. pp. 3164–3173 (2021) 25. Meyer,G.P.,Laddha,A.,Kee,E.,Vallespi-Gonzalez,C.,Wellington,C.K.:Laser- net: An efficient probabilistic 3d object detector for autonomous driving. In: Pro- ceedingsoftheIEEE/CVFconferenceoncomputervisionandpatternrecognition. pp. 12677–12686 (2019) 26. Misra, I., Girdhar, R., Joulin, A.: An end-to-end transformer model for 3d object detection. In: Proceedings of the IEEE/CVF International Conference on Com- puter Vision. pp. 2906–2917 (2021) 27. Misra, I., Girdhar, R., Joulin, A.: An end-to-end transformer model for 3d object detection. In: Proceedings of the IEEE/CVF International Conference on Com- puter Vision. pp. 2906–2917 (2021) 28. Ngiam, J., Caine, B., Han, W., Yang, B., Chai, Y., Sun, P., Zhou, Y., Yi, X., Alsharif,O.,Nguyen,P.,etal.:Starnet:Targetedcomputationforobjectdetection in point clouds. arXiv preprint arXiv:1908.11069 (2019) 29. Pan, X., Xia, Z., Song, S., Li, L.E., Huang, G.: 3d object detection with point- former. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7463–7472 (2021) 30. Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: Deep learning on point sets for 3d classification and segmentation. In: CVPR (2017) 31. Qi, C.R.,Zhou,Y., Najibi,M.,Sun, P.,Vo,K.,Deng, B.,Anguelov, D.:Offboard 3dobjectdetectionfrompointcloudsequences.In:ProceedingsoftheIEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6134–6144 (2021) 32. Qi,C.R.,Yi,L.,Su,H.,Guibas,L.J.:Pointnet++:Deephierarchicalfeaturelearn- ing on point sets in a metric space. In: NeurIPS (2017) 33. Ramachandran, P., Parmar, N., Vaswani, A., Bello, I., Levskaya, A., Shlens, J.: Stand-alone self-attention in vision models. Advances in Neural Information Pro- cessing Systems 32 (2019) 34. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object de- tectionwithregionproposalnetworks.IEEEtransactionsonpatternanalysisand machine intelligence 39(6), 1137–1149 (2016) 35. Sabne, A.: Xla : Compiling machine learning for peak performance (2020) 36. Shi,S.,Guo,C.,Jiang,L.,Wang,Z.,Shi,J.,Wang,X.,Li,H.:Pv-rcnn:Point-voxel feature set abstraction for 3d object detection. In: CVPR (2020)SWFormer 17 37. Shi, S., Jiang, L., Deng, J., Wang, Z., Guo, C., Shi, J., Wang, X., Li, H.: Pv- rcnn++: Point-voxel feature set abstraction with local vector representation for 3d object detection. arXiv preprint arXiv:2102.00463 (2021) 38. Shi, S., Wang, X., Li, H.: Pointrcnn: 3d object proposal generation and detection from point cloud. In: CVPR (2019) 39. Sun,P.,Kretzschmar,H.,Dotiwalla,X.,Chouard,A.,Patnaik,V.,Tsui,P.,Guo, J., Zhou, Y., Chai, Y., Caine, B., et al.: Scalability in perception for autonomous driving: Waymo open dataset. In: CVPR (2020) 40. Sun, P., Wang, W., Chai, Y., Elsayed, G., Bewley, A., Zhang, X., Sminchisescu, C., Anguelov, D.: Rsn: Range sparse net for efficient, accurate lidar 3d object detection.In:ProceedingsoftheIEEE/CVFConferenceonComputerVisionand Pattern Recognition. pp. 5725–5734 (2021) 41. Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, L(cid:32)., Polosukhin, I.: Attention is all you need. In: NeurIPS (2017) 42. Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Pro- ceedings of the IEEE conference on computer vision and pattern recognition. pp. 7794–7803 (2018) 43. Wang,Y.,Fathi,A.,Kundu,A.,Ross,D.,Pantofaru,C.,Funkhouser,T.,Solomon, J.: Pillar-based object detection for autonomous driving. In: ECCV (2020) 44. Waymo: Waymo’s 5th generation driver. https://blog.waymo.com/2020/03/ introducing-5th-generation-waymo-driver.html 45. Yan, Y., Mao, Y., Li, B.: Second: Sparsely embedded convolutional detection. Sensors (2018) 46. Yin, T., Zhou, X., Krahenbuhl, P.: Center-based 3d object detection and track- ing.In:ProceedingsoftheIEEE/CVFconferenceoncomputervisionandpattern recognition. pp. 11784–11793 (2021) 47. Zhao,H.,Jiang,L.,Jia,J.,Torr,P.H.,Koltun,V.:Pointtransformer.In:Proceed- ings of the IEEE/CVF International Conference on Computer Vision. pp. 16259– 16268 (2021) 48. Zhou, D., Fang, J., Song, X., Guan, C., Yin, J., Dai, Y., Yang, R.: Iou loss for 2d/3d object detection (2019) 49. Zhou, X., Wang, D., Kra¨henbu¨hl, P.: Objects as points. arXiv preprint arXiv:1904.07850 (2019) 50. Zhou,Y.,Sun,P.,Zhang,Y.,Anguelov,D.,Gao,J.,Ouyang,T.,Guo,J.,Ngiam, J., Vasudevan, V.: End-to-end multi-view fusion for 3d object detection in lidar point clouds. In: CORL (2019) 51. Zhou,Y.,Tuzel,O.:Voxelnet:End-to-endlearningforpointcloudbased3dobject detection. In: CVPR (2018)18 P. Sun et al. A Window Shift Thewindowshiftoperation(Figure6)isimplementedbyaddingoffsetsofhalfof thewindowsizestothevoxelizedcoordinatesandthenrunningthesamewindow partition algorithm. We have proposed to limit the window shift operation per scaleforefficiencyofprocessingsparseinputs.Wewonderedwhathappensifwe addmorewindowshiftstothemodel.Willitimpactmodelaccuracy?Weadded onemorewindowshifttothescale1andscale2respectively.Ithassloweddown our training by 10%. Surprisingly, it slightly decreased the model accuracy as shown in Table 6. Our hypothesis is that more shifts make the model harder to train when we do not need to rely on window shifts to increase receptive field. W1 W1 W2 Window Shift W2 W3 W3 W4 W4 W5 Fig.6.Sparsewindowshift.Thedarkbluecellsarevoxelswithpoints.Thelightblue cells are voxels without points. Left shows a grid of 8×8 BEV voxels partitioned into 4 non-empty sparse windows with window size of 4×4. After window shift, it results in 5 non-empty sparse windows as shown on the right. Table 6. Impact of adding more window shifts. More Window ShiftsVehicle 3D AP/L1Pedestrian 3D AP/L1 (cid:55) 79.36 82.91 (cid:51) 79.17 82.36 B Qualitative Results Figure 7 visualizes ground truth boxes, detected boxes, and attention scores for layers selected from different scales for the 15th frame in scene 8907419590259234067 1960 000 1980 000selectedfromtheWaymoOpenDataset validation set. The selected layers are the stride 1, 2 afters multi-scale feature fusion, and stride 1, 2, 4, 16 from the main backbone. We use all foregroundSWFormer 19 points as the query points. The predicted boxes almost overlap perfectly with the ground truth boxes. The attention score pattern shown in these subplots indicatesthatdifferentinformationiscapturedindifferentlayersandscales.In- terestingly, we have found that most of the attention scores are either 0 or 1 for foreground query points. We hope that these findings can inspire more research in the future. C Future Work: More Tasks Waymo Open Dataset [39] has recently added semantic segmentation labels for about 14% of the frames per scene for all of the 1150 scenes. We have extended the SWFormer detection network to perform joint semantic segmentation and detection. Figure 8 illustrates the joint detection and semantic segmentation network architecture. We concatenate the per-point feature from the voxel em- bedding net before per-voxel max pooling and its corresponding voxel feature from a selected scale after multi-scale feature fusion to predict the per-point se- mantic segmentation logits. Without much tuning, we have obtained reasonable semantic segmentation results as shown in Table 7 and Figure 9. We plan to further improve this model and extend it to more autonomous driving related tasks.20 P. Sun et al. Stride 1 Fused Stride 2 Fused Stride 1 Stride 2 Stride 4 Stride 16 Fig.7. Attention scores and model prediction visualization. Blue box: ground truth vehicle.Yellowbox:vehicleprediction.Greenbox:groundtruthpedestrian.Purplebox: pedestriandetection.Pointsarecoloredwiththebwrcolormap(0:blue,1:red),where redpointsmeanattentionscorescloseto1.Asredpointsaredistributeddifferentlyin each subfigure, it is clear that different layers are attending to different locations.SWFormer 21 Head 1 Head 2 Detection Head Semantic Segmentation Head noitirtaP esrapS Scale 1: /1 Scale 2: /2 Scale 3: /4 Scale 4: /16 Scale 5: /32 2-layer 3-layer 2-layer 3-layer 2-layer SWFormer SWFormer SWFormer SWFormer SWFormer Block Block Block Block Block fuse1 fuse2 fuse3 fuse4 fuse5 Segmentation Voxel Diffusion SW1- Fla oy re mr e r Center Head noitirtaP esrapS noitirtaP esrapS noitirtaP esrapS noitirtaP esrapS fuse1 Per-voxel query SWFormer concat Semantic per-point Feature Segmentation feature Fig.8.Overviewoftheupdatedneuralarchitectureforjoint3Ddetectionandsemantic segmentation.OntopofFigure1,itaddsanextrasegmentationheadfortheadditional segmentation task.22 P. Sun et al. Table 7.JointdetectionandsemanticsegmentationresultsonWaymoOpenDataset validation set and test set. Class Name Validation IOUTest IOU Bicycle 36.76 38.15 Bicyclist 51.43 51.77 Building 75.18 65.75 Bus 65.45 39.50 Car 75.05 72.29 Construction Cone 48.34 21.37 Curb 55.54 48.46 Lane Marker 43.97 30.73 Motorcycle 56.68 58.37 Motorcyclist 1.48 0.57 Other Ground 34.34 37.52 Other Vehicle 23.95 25.43 Pedestrian 60.87 61.08 Pole 55.50 51.65 Road 78.46 68.06 Sidewalk 59.67 59.77 Sign 53.70 43.60 Traffic Light 22.74 22.30 Tree Trunk 54.74 50.64 Truck 48.73 55.86 Vegetation 79.78 68.08 Walkable 65.87 59.08 mIOU 52.19 46.82SWFormer 23 Fig.9. Joint detection and semantic segmentation qualitative results. Green boxes: vehicle. Lavender boxes: pedestrian. Lavender points: building. Grey points: road. Orange points: sidewalk. Blue points: vehicle. Black points: pedestrian. Red points: pole/sign/tree trunk. Green points: vegetation.