SWFormer: Sparse Window Transformer for 3D Object Detection in Point Clouds Pei Sun, Mingxing Tan, Weiyue Wang, Chenxi Liu, Fei Xia, Zhaoqi Leng, and Dragomir Anguelov Waymo LLC { peis, tanmingxing, weiyuewang, cxliu, feixia, lengzhaoqi, dragomir } @waymo.com Abstract. 3D object detection in point clouds is a core component for modern robotics and autonomous driving systems. A key challenge in 3D object detection comes from the inherent sparse nature of point oc- cupancy within the 3D scene. In this paper, we propose Sparse Window Transformer ( SWFormer ), a scalable and accurate model for 3D ob- ject detection, which can take full advantage of the sparsity of point clouds. Built upon the idea of window-based Transformers, SWFormer converts 3D points into sparse voxels and windows, and then processes these variable-length sparse windows efficiently using a bucketing scheme. In addition to self-attention within each spatial window, our SWFormer also captures cross-window correlation with multi-scale feature fusion and window shifting operations. To further address the unique challenge of detecting 3D objects accurately from sparse features, we propose a new voxel diffusion technique. Experimental results on the Waymo Open Dataset show our SWFormer achieves state-of-the-art 73.36 L2 mAPH on vehicle and pedestrian for 3D object detection on the official test set, outperforming all previous single-stage and two-stage models, while being much more efficient. 1 Introduction 3D point cloud representation learning is critical for autonomous driving, espe- cially for core tasks like 3D object detection. The challenges of learning from 3D point clouds mainly come from two aspects. The first aspect is that 3D points are sparsely distributed in the 3D space due to the nature of LiDAR sensors. This forces 3D models to be different from dense models in natural language pro- cessing (where words in a sentence are dense) or image understanding (where pixels in an image are dense). The second aspect is that both the number of points in a point cloud frame and the point cloud sensing region are increasing along with the improvement of the LiDAR sensor hardware. Some of the latest commercial LiDARs can sense up to 250m [15] and 300m [44] in all directions around the vehicle, leading to a large range of point clouds. To address these challenges, previous works have proposed many methods that can be roughly organized as five categories. PointNet [30,32,38] based arXiv:2210.07372v1 [cs.CV] 13 Oct 2022 2 P. Sun et al. method treats 3D point clouds as unordered sets and encodes them with MLPs and max pooling. Hierarchical structure is introduced to deal with the large in- put space and to better capture local information. These methods usually have inferior representation capacity compared with more recent methods. PointPil- lars -style methods [18] divide the space into grids of fixed sizes to convert the sparse 3D problem to a dense 2D problem. This method scales quadratically with the range, making it hard to scale with the advancement of LiDAR hard- wares. Sparse submanifold convolutions [14,36,40] based method can handle the sparse input efficiently. Usually these methods use small 3 × 3 convolution kernels which cannot connect features that are sparsely disconnected without adding normal sparse convolution and striding. This weakness limits its repre- sentation capacity. Another weakness of this method is their need for heavily optimized custom ops to be efficient on the modern GPUs and incompatibility with matmul optimized accelerators such as TPUs. Range image is a compact representation of point cloud. Multi-view methods [50,40,43,2] run dense convo- lutions in this view to extract features and fuse with BEV features learned in the PointPillars-style to improve 3D representation learning. It is hard to regress 3D objects directly from the range image due to its lack of 3D information encoding in the dense 2D perspective convolutions. To tackle this weakness, graph-style kernels [4,12] replace convolutions to make use of the range information in range images to capture 3D information which greatly improves the accuracy but is still inferior to the state of the art. Transformer [41] is designed to process sequences of data. The challenge in applying it to a point cloud is to solve the quadratic complexity on the number of inputs. Recent methods tackle this prob- lem by attending to neighboring points [29], neighboring voxels [23] or voxels in fixed windows [11]. A generic and efficient transformer-only model without limitations like limited receptive field, irregular memory access pattern, and lack of scalability is still to be designed. In this paper, we adapt window-based Transformers to 3D point clouds. The Transformer [41] architecture has been hugely successful in modeling language sequences and image patches. In particular, on 2D images, Swin Transformer [22] proposed to partition images into windows and merge context information in a hierarchical manner. Our Sparse Window Transformer (SWFormer) builds upon similar ideas, but with several key adaptations for sparse windows. Our first adaption is to add a bucketing-based window partition for sparse windows. Although each window has the same spatial size, such as a 10 × 10 voxel grid, the number of non-empty voxels in each window can vary significantly, so we group these windows into buckets with different effective sequence lengths. Our second adaptation is to limit the expensive window shifting. Swin Transformer [22] uses window shifting once per Transformer layer to connect features between windows and increases receptive fields, but this shifting operation is expensive in the sparse world as it needs to re-order all the sparse features with gather operations. Moreover, it is extremely slow on matmul optimized accelerators such as TPUs. To address this issue, SWFormer employs a new hierarchical backbone architecture, where each SWFormer block has many Transformer layers but only SWFormer 3 one shifting operation, as shown in Figure 3. It relies on multi-scale features to achieve large receptive fields for context information, and a multi-scale fusion network to effectively combine these features. The model uses additional custom downsample and upsample algorithms to properly handle the sparse features during feature fusion. Our innovation continues from the backbone into the 3D object detection head. Existing 3D object detection methods [51,18,50,43,13,36,4,40,24,46] can mostly be viewed as either anchor based methods with implicit or explicit an- chors or DETR [3] based methods [26]. The detection performance is closely related with the distribution of the difference between anchor and groundtruth. Methods with inaccurate anchors [4,24] have poor performance in detecting large objects such as vehicles though they can have reasonable performance on pedes- trians. One way to solve this problem is to have a two-stage model to refine the boxes [24,36] which greatly improves the detection accuracy. CenterNet- style detection methods [13,46,40] strive to define anchors in the center of the groundtruth boxes only which enforces distributions of closer to zero mean and smaller variance. However, when detecting objects directly from sparse features (e.g. features from PointNet, Submanifold convolutions, sparse Transformers), there are not necessarily features close to the object centers. To alleviate this issue, [40] applies normal sparse convolutions to insert points in the convolution output; [11] scatters the sparse features to a dense BEV grid and runs dense convolutions to expand features to missing positions. These methods are expen- sive. In this paper, we propose a voxel diffusion module to address this issue efficiently in a scalable way by segmenting and diffusing foreground voxels to their nearby regions as described in § 3.4. Extensive experiments are conducted on the challenging Waymo Open Dataset [39] to show state of the art results of SWFormer on 3D object detection. We summarize our contributions as follows: – We propose a hierarchical Sparse Window Transformer (SWFormer) back- bone for 3D representation learning. Its flexible receptive fields and multi- scale features make it suitable for different self-driving tasks like object de- tection and semantic segmentation. – We propose a generic voxel diffusion module to address the unique challenge of anchor placement in 3D object detection from sparse features. – We conduct extensive experiments on Waymo Open Dataset [39] to demon- strate the state of the art performance of our SWFormer model. 2 Related Work 2.1 3D object detection As one of the most important tasks in autonomous driving, 3D object detection has been extensively studied in prior works. Early works like PointNet [30] and PointNet++ [32] directly apply multilayer perceptions on individual points, but it is difficult to scale them to large point clouds with good accuracy. The current 4 P. Sun et al. mainstream 3D object detectors often convert point clouds into bird eye view 3D [51] or 2D voxels [18] (2D voxels are also referred as pillars), where each voxel aggregates the information from points it contains. In this way, regular 2D or 3D convolutional neural networks can be applied to process these bird-eye-view representations. The pseudo image of voxels also makes it easier to reuse the rich research advancements in 2D object detection, such as two-stage or anchor- based detection heads [46]. The downside is that the pseudo image of voxels grows cubically/quadratically with the voxelization granularity and detection range, not to mention that many of the voxels are effectively empty. Therefore, another type of approach is to perform 3D object detection without voxelization. This includes methods that detect objects from the perspective view [25,4,12], or lookup nearest neighbors for each point [28]. However, the detection accuracy is typically inferior to the voxelization route. To have the best of both worlds, recent approaches [45,40,36] start to explore multi-view approaches and make use of sparse convolutions on the voxelized point cloud. For example, the recent range sparse net (RSN [40]) adopts a two- step approach, where the first step performs class-specific segmentation on the range image view, and the second step applies sparse 3D convolutions on the voxel view for specific classes. However, submanifold sparse convolutions cannot connect features that are sparsely disconnected without adding normal sparse convolutions and striding, and they often require heavily optimized customized ops to be efficient on modern accelerators. Our work aims to learn the 3D representations from sparse point clouds with- out using any dense or sparse convolutions. Instead, we resort to a hierarchical Transformer to achieve our goal. 2.2 Transformers Transformers [41] have shown great success in natural language processing [7]. Recently, researchers have brought this architecture to computer vision [1,33,42,6]. ViT [9] partitions images into patches, which greatly advanced the use of Trans- formers for image classification. Swin Transformer [22] further demonstrated better ways to fuse contextual information through window shifting and hierar- chy, and also generalized to other tasks such as segmentation and detection. Interestingly, Transformers are naturally suitable for sparse point clouds, be- cause they can take any length of sequences as inputs and do not require dense 2D/3D image representations. Therefore, recent works have attempted to adopt Transformers for 3D representation learning, but they are primary developed for object scans and indoor applications [47,10,27,29]. Voxel Transformer [24] is the submanifold sparse convolution [14] counterpart in the Transformer world, by replacing the convolution kernel with attention. Its irregular memory access pat- tern is computationally inefficient, and its accuracy is worse than state of the art methods. Recently, SST [11] proposes a single-stride transformer for 3D object detection and achieved impressive results on Waymo Open Datasets especially for pedestrian object detection. However, due to its single stride nature, SST has a limited receptive field and thus has difficulty dealing with large objects, SWFormer 5 making it ineffective in important tasks like large vehicle detection, large ob- ject segmentation (e.g. buildings), lane detection, and trajactory prediction. It needs to scatter features to a dense BEV grid to run several dense convolutions which limits its scalability. It is also computationally expensive as it needs to run many layers of transformers on the high resolution feature map which limits its applications in realtime systems. Our work is inspired by window-based Transformers (e.g., SwinTransformer [22]) in the sense that we also adopt the hierarchical window-based Transformer back- bone, but to address the unique challenges of 3D sparse point clouds, we propose several novel techniques such as the improved SWFormer blocks, multi-scale fea- ture fusion, and voxel diffusion. 3 Sparse Window Transformer 3.1 Overall Architecture SWFormer is a pure Transformer-based model without any convolutions. Fig- ure 1 shows the overall network architecture: given a sequence of point cloud frames as inputs, each point is augmented with per-frame voxel features [18] and an auxiliary frame timestamp offset [40]. It uses dynamic voxelization [50] and a point net [18,30] based feature embedding net to get sparse voxel features. Note, our voxels are also referred as pillars in other works [18]. These sparse voxels are then processed by a hierarchical sparse window Transformer network described in § 3.2. The resulting multi-scale features are then fused with a Trans- former based feature fusion blocks. To address the unique challenge of detecting 3D boxes from sparse features, we first segment the foreground voxels and then apply a voxel diffusion module to expand foreground voxels to neighboring lo- cations with pseudo voxels. In the end, we apply a center net [46,40,49] style detection head to regress 3D boxes. 3.2 Hierarchical Sparse Window Transformer Encoder A key concept of our SWFormer is the sparse window in the birds eye view. After points are converted to a grid of 2D voxels on bird eye view, the voxel grid is further partitioned into a list of non-overlapping windows with fixed size H × W (e.g., 10 × 10), similar to Swin Transformer[22]; however, since points are often sparse, many voxels are empty with no valid points. Therefore, the number of non-empty voxels in each window may vary from 0 to HW . As we will explain later, all non-empty voxels within the same window will be flattened to a single variable-length sequence and fed into Transformer layers. In practice, these variable-length sequences prevent us from batch training, causing lower training efficiency. To solve this issue, we borrow a widely used ideas from nat- ural language processing [41,8] and recent works [11], which group these sparse windows into different buckets based on their sequence lengths. Concretely, we divide sparse windows into at most k buckets { B 0 , B 1 , ..., B k } , where windows 6 P. Sun et al. fuse Segmentation Voxel Diffusion Center Head fuse fuse fuse fuse Scale 2: /2 Scale 1: /1 Scale 3: /4 Scale 4: /16 Scale 5: /32 Fig. 1. Overview of SWFormer model architecture. Given a sparse point cloud, we first perform voxelization to generate a grid of 2D voxels. These voxels are then pro- cessed with a 5-scale sequence of hierarchical SWFormer blocks (Figure 3), with strides { 1 , 2 , 4 , 16 , 32 } . The output features are combined with a multi-scale feature fusion net- work (section 3.3). The fused features are fed to a head, which performs foreground segmentation and voxel diffusion (section 3.4), and computes center net style classifi- cation and box regression loss (section 3.5). Different object classes (e.g. vehicles and pedestrians) may use a separate head on different feature scales. in B i are always padded to a maximum sequence length of HW/ 2 i . All padded tokens are masked in Transformer layers. Based on the aforementioned sparse windows, our encoder adopts hierarchi- cal Transformers to process the inputs and produce a list of multi-scale BEV features. As shown in Figure 1, each scale starts with a sparse window partition layer followed by a multi-layer SWFormer block. Sparse Window Partition: We divide the BEV voxels into non-overlapping windows with fixed size H × W , which are then grouped into buckets { B 0 , B 1 , ..., B k } . For each bucket B i , we flatten all voxels within the same window into a sequence and zero-pad the sequence length to HW/ 2 i . These sequences are then batched and fed to the Transformer blocks, where the self-attention shares the keys and values for all query voxels coming from the same window [22]. Since SWFormer processes inputs in a hierarchical fashion with multiple feature scales, we need to apply strided window partitions at the beginning of each scale. The strided window partition is similar to traditional strided convolutions, except that it always picks the closest voxel to the center of the window with deterministic rules to break ties. Notably, no max or average pooling operations are applied because they are not friendly to sparse implementations. Figure 2 illustrates an example of a stride-4 window partition. Sparse Window Transformer block: Transformer[41] is inherently suit- able for sparse point clouds, as it does not require the dense 2D/3D inputs as in convolutional networks; unfortunately, due to the quadratic complexity of self- attention with respect to the input sequence length, it is prohibitively expensive to feed the whole point cloud (with millions of points) or voxel features (with tens of thousands valid voxels) as a single input sequence to Transformer. In this SWFormer 7 / 4 Fig. 2. Strided Sparse Window Partition. Left shows a grid of 16x16 BEV voxels, where grey voxels are empty and others are non-empty. Right shows the results of stride-4 window partition, leading to a grid of 4x4 voxels. For each striding window, it picks the nearest neighbor non-empty voxel feature (light green) from the center (black dot) with any deterministic rule to break ties; if all voxels are empty in the striding window, then the corresponding voxel after striding is also empty. Best viewed in color. MSA LN MLP LN MSA LN MLP LN N ⨉ Shifted Sparse Window Partition Stride 1 M ⨉ N-M-layer SWFormer Block Fig. 3. Sparse Window Transformer Block. Given a sequence of sparse features, it first applies a multi-head self-attention (MSA) on all valid voxel within the same window, followed by a MLP and layer norm. After repeating the Transformer layer N times, it performs a shifted sparse window partition to re-generate the sparse windows, and then process the shifted windows with another M Transformer layers. If N and M are the same, we name it as N-layer SWFormer block for simplicity. paper, we adopt the idea of Swin Transformer [22]: the sparse BEV voxels are first partitioned into windows, and Transformer is applied to each window sep- arately. To increase the receptive field and connect the features across windows, SwinTransformer uses a window shifting technique to re-partition the window for every layer of Transformer. However, as we are operating on sparse voxel fea- tures, such shift-window operation is memory-read/write intensive, especially for matrix-optimized accelerators like TPUs. To alleviate this problem, we propose to limit the shift-window operation to once per stride rather than per layer. Fig- ure 3 shows the detailed architecture of a SWFormer block: it largely follows the same style of SwinTransformer to perform self-attention within a local window, except it only performs shift-window operation once in the middle. Formally, our 8 P. Sun et al. SWFormer block can be described as follows: z 0 = [ x ; mask z ] + PE z (1) ˆ z l = LN ( z l − 1 + MSA( z l − 1 ) ) l = 1 ...N z l = LN ( ˆ z l + MLP(ˆ z l ) ) l = 1 ...N u 0 = [shift-window( z N ); mask u ] + PE u (2) ˆ u l = LN ( u l − 1 + MSA( u l − 1 ) ) l = 1 ...M u l = LN ( ˆ u l + MLP(ˆ u l ) ) l = 1 ...M where x is the input features after sparse window partition, mask z is the mask for input padding, PE z is the positional encoding. The process contains two stages: (1) the first stage applies N Transformer layers to z 0 and output z N . Each Transformer layer consists of a standard multi-head self-attention (MSA) and multilayer perceptron (MLP), but slightly different from the standard version, here we adopt the post-norm scheme where layer norm (LN) is added after MSA and MLP. For simplicity, we use the standard sine/cosine absolute positional encoding in this paper. (2) The second stage first applies window-shift to z N , and adds the updated mask u and positional encoding PE u based on z N ; afterwards, M Transformer layers are added to process u 0 and generate the final output u M . Notably, each SWFormer block has N + M Transformer layers but only one window-shift operation. By restricting window-shift operations, our SWFormer block is more efficient than the conventional Swin Transformer; however, it also limits the receptive field, since each Transformer layer is only applied to a small window. To address this challenge, SWFormer is designed as a hierarchical network with multiple scales, where the strides are gradually increased: for simplicity, this paper uses strides { 1 , 2 , 4 , 16 , 32 } for the five scales. For each scale, we always keep the window size fixed (e.g., 10 × 10); however, as the later scales have larger strides, the same window in later scales will cover much larger area. As an example, for the last scale with stride 32, a 10 × 10 window would cover 320 × 320 area on the original BEV voxel grid, and a single window-shift would connect all features within an area as large as 480 × 480. 3.3 Multi Scale Feature Fusion Inspired by feature pyramid network (FPN [20]), SWFormer adopts Transformer- based multi-scale feature network to effectively combine all features from the hierarchical Transformer encoder. Figure 4 shows the overall architecture of the feature network: given a list of encoder features { P 0 , P 1 , ..P 5 } , it iteratively fuses ( P i +1 , P i ) from large-stride P 5 to small-stride P 0 . Formally, our feature fusion process can be described as: ˆ P 5 = P 5 (3) ˆ P i = SWFormer(Concat( P i , Upsample( ˆ P i +1 ))) i = 0 , ..., 4 (4) SWFormer 9 P i P i+1 concat upsample P’ i Fig. 4. Feature Fusion. Fea- ture P i +1 is upsampled and concatenated with P i to gener- ate P ′ i and the final P i . During upsampling, we only duplicate P i +1 features to locations that are non-empty in P i . Starting from the last feature map P 5 , we first upsample it to have the same stride as P 4 such that they can be concatenated into a single fea- ture map; afterwards, we simple apply a 1-layer SWFormer block to process the concatenated fea- ture and generate the new ˆ P 4 . The process is it- erated until all fused features { ˆ P 0 , ..., ˆ P 5 } have been generated, which have the same strides as { P 0 , ..., P 5 } features. The fused features are fur- ther used in voxel diffusion and box regression as described in the following sections. One challenge in sparse upsamping is that one cannot naively duplicate the feature to all up- sampled locations (like commonly done in dense upsampling), which will cause unnecessary exces- sive feature duplication and significantly reduce the sparsity. In this paper, we restrict features in P i +1 to only duplicate to locations that have non- empty features in P i , as shown in Figure 4. In this way, we can ensure ˆ P i has the same sparsity as P i . 3.4 Voxel Diffusion Diffusion 0.9 0.5 Fig. 5. Voxel Diffusion. After foreground segmentation, each voxel receives a segmen- tation score s ∈ [0 , 1]. All voxels with scores greater than a threshold γ = 0 . 05 are scattered to a dense BEV grid, and then we apply a k × k max pooling on the dense BEV grid to expand valid voxel features to their neighboring locations where k is set to 5 in this example. (Left) before diffusion, there are only two foreground voxels with segmentation scores { 0.5, 0.9 } greater than γ ; (Right) after voxel diffusion, 47 voxels become valid. Best viewed in color. To detect 3D objects from sparse voxel features, a unique challenge is that there might be no valid voxel feature near object centers which are the best posi- tions to place implicit [46] or explicit anchors [34]. Prior works have attempted to resolve this issue by: 1) second-stage box refinement [36], 2) sparse convolutions 10 P. Sun et al. [40] or coordinate refinement [29] that can expand features to empty voxels close to the object centers, 3) scattering sparse voxel features to dense and applying dense convolutions [11]. In this paper, we propose a novel voxel diffusion module to effectively and efficiently address this challenge. Voxel diffusion is based on two simple ideas: First, we segment all foreground voxels by jointly performing foreground/backgrond segmentation, thus effec- tively filtering out the majority of background voxels. Second, we expand all foreground voxels by zero-initializing their features into neighboring locations with a simple k × k max pooling operations on the dense BEV grid, where k is the detection head specific diffusion factor to control the magnitude of ex- pansion. The diffused voxel features are further connected and processed with a few Transformer layers. Combining these two ideas, we can simultaneously keep voxel features sparse (by filtering out background voxels) and features filled (by voxel diffusion) for voxels closer to the object center. Figure 5 illustrates an example of voxel diffusion. Our foreground segmentation is jointly trained with object detection. Specif- ically, for each voxel, we assign a binary groundtruth label: 0 (background, voxel does not overlap with any objects) and 1 (foreground, voxel overlaps with at least one object). The foreground segmentation is trained with a two-class focal loss [21] for each object class c : L c seg = 1 N ∑ i L i (5) where N is the total number of valid voxels and L i is the focal loss for voxel i . At inference time, we keep voxels as foreground if their foreground scores are greater than a threshold γ . 3.5 Box Regression SWFormer follows [40] to use a modified CenterNet [49,13,40,46] head to regress boxes from voxel features. The heatmap loss is computed as a penalty-reduced focal loss [49,21] per object class. L c hm = − 1 N ∑ i { (1 − ̃ h i ) α log( ̃ h i ) I h i > 1 −  + (1 − h i ) β ̃ h α i log(1 − ̃ h i ) I h i ≤ 1 −  } , (6) where ̃ h i and h i are the predicted and ground truth heatmap values for object class c respectively at voxel i . N is the number of boxes in class c . We use  = 1 e − 3, α = 2 and β = 4 in all experiments, following [49,19,40]. SWFormer parameterize 3D boxes as b = { d x , d y , d z , l, w, h, θ } where d x , d y , d z are the box center offsets relative to the voxel centers. l, w, h, θ are box length, width, height and box heading. We follow [40] to apply a bin loss [38] to regress heading θ , smooth L1 to regress other box parameters, and an IoU loss [48] to improve SWFormer 11 overall box accuracy on the voxels with ground truth heatmap values above a threshold δ 1 . L c θ i = L bin ( θ i , ̃ θ i ) , (7) L c b i \ θ i = SmoothL1( b i \ θ i − ̃ b i \ ̃ θ i ) , (8) L c box = 1 N ∑ i ( L θ i + L b i \ θ i + L iou i ) I h i >δ 1 , (9) where ̃ b i , b i are the predicted and ground truth box parameters respectively, ̃ θ i , θ i are the predicted and ground truth box heading respectively. The net is trained end to end with the total loss defined as L = ∑ c ( λ 1 L c seg + λ 2 L c hm + L c box ) (10) When decoding prediction boxes, we first filter voxels with heatmap less than a threshold δ 2 , then run max pool on the heatmap to select boxes corresponding to the local heatmap maximas without any non-maximum-suppression. 4 Experiments We describe the SWFormer implementation details, and demonstrate its effi- ciency and accuracy in multiple experiments. Ablation studies are conducted to understand the importance of various design choices. 4.1 Waymo Open Dataset Our experiments are primary based on the challenging Waymo Open Dataset (WOD) [39], which has been adopted in many recent state of the art 3D detec- tion methods [36,46,40,11,31]. The dataset contains 1150 scenes, split into 798 training, 202 validation, and 150 test. Each scene has about 200 frames, where each frame captures the full 360 degrees around the ego-vehicle. The dataset has one long range LiDAR with range capped at 75 meters, four near range LiDARs and five cameras. SWFormer uses all five LiDARs in the experiments. 4.2 Implementation Details We normalize intensity and elongation in the raw point cloud with the tanh function.The dynamic voxelization uses 0 . 32 m voxel size in x , y and infinite size in z . During training, we ignore all ground truth boxes with fewer than five points inside. The voxel feature embedding net has two layers of MLPs with channel size of 128. All of the transformer layers have channel size of 128, 8 heads, and inner MLP ratio of 2. We also use stochastic depth [16] with survival probability 0.6. The segmentation cutoff γ in § 3.4 is set to 0 . 05. The heatmap threshold δ 1 , δ 2 are set to 0 . 2, 0 . 1 respectively for both vehicle and pedestrian 12 P. Sun et al. heads. For training efficiency, we cap the number of regression targets in each frame by 1024 for vehicle and 800 for pedestrian sorted by ground truth heatmap values. λ 1 , λ 2 are set to 200 and 10 in Eq. 10. Data augmentation. We have adopted the several popular 3D data aug- mentation techniques described in [5] during training: randomly rotating the world by yaws uniformly chosen from [ − π, π ] with probability 0 . 74, randomly flipping the world along y-axis with probability 0 . 5, randomly scaling the world with scaling factor uniformly chosen within [0 . 95 , 1 . 05), randomly dropping points with probability of 0 . 05. Training and Inference. The SWFormer models are trained end-to-end with 32 TPUv3 cores using the Adam optimizer [17] for a total number of 128 epochs with an initial learning rate set to 1e-3. We apply cosine learning rate decay and 8 epoch warmup with initial warmup learning rate set to 5e-4. 4.3 Main Results We measured the detection results using the official WOD detetion metrics: BEV and 3D average precision (AP), heading error weighted BEV, and 3D average precision (APH) for L1 (easy) and L2 (hard) difficulty levels [39]. The official metrics used to rank in the leaderboard uses IoU cutoff of 0.7 for vehicle, 0.5 for pedestrian. We report additional AP results at IoU of 0.8 for vehicle, 0.6 for pedestrian. Large vehicles that have max dimension greater than 7 meters are also reported. Table 1 reports the main results on validation set, Table 2 reports additional results for high IoU and large vechiels on the validation set, and Table 3 shows the test set results by submitting our predictions to the official test server. Results from methods with test time augmentation or emsemble are not included. As shown in Table 1, SWFormer achieves new state-of-the-art results for ve- hicle detection on the WOD validation set : it has 1.5 APH/L2 higher than the prior best single-stage model RSN [40]. SWFormer even outperforms the prior best performing two-stage method PVRCNN++[37] by 0.42 APH/L2. Impor- tantly, SWFormer performs very well at detecting large vehicles, 6.35 AP/L2 higher than the prior art of RSN [40] as shown in Table 2. SWFormer slightly outperforms the state of the art single stage method SST 3f [11] by 0.12 APH/L2. Notably, the single frame single stage SWFormer 1f also outperforms all prior single frame methods. We have compiled the model with XLA [35] and ran inference for the 15th frame in scene 8907419590259234067 1960 000 1980 000 that has 68 vehicles and 69 pedestrians on a Nvidia T4 GPU. The latency is 43ms, more efficient than the popular realtime detector PointPillars [18] which takes about 100ms on the same GPU with our own implementation. With fused transformer GPU kernels and optimized GPU sparse window partition operations, the latency can be further reduced to 20ms. Table 3 shows vehicle and pedestrian detection result comparison with pub- lished results on the WOD test set , which shows SWFormeroutperforms all previ- ous single-stage or two-stage methods on the official ranking method mAPH/L2. SWFormer 13 Table 1. WOD validation set results. † is from [40]. Top methods are highlighted. Top one-frame (cyan), single-stage (blue) are colored. TS: two-stage. BEV: BEV L1 AP. Method TS AP/APH Vehicle AP/APH Pedestrian 3D L1 3D L2 BEV 3D L1 3D L2 BEV PVRCNN++ [37] 3 79.3/78.8 70.6/70.2 - 81.8/76.3 73.2/68.0 - VoTr-TSD[24] 3 75.0/74.3 65.9/65.3 - - - - SST TS 3f [11] 3 78.7/78.2 70.0/69.6 - 83.8/80.1 75.9/72.4 - CenterPoint TS [46] 3 76.6/76.1 68.9/68.4 - 79.0/73.4 71.0/65.8 - PointPillars [18] † 7 63.3/62.7 55.2/54.7 82.5 68.9/56.6 60.0/49.1 76.0 MVF++ 1f [31] 7 74.6/- - 87.6 78.0/- - 83.3 RSN 1f [40] 7 75.1/74.6 66.0/65.5 88.5 77.8/72.7 68.3/63.7 83.4 RSN 3f [40] 7 78.4/78.1 69.5/69.1 91.3 79.4/76.2 69.9/67.0 85.0 SST 1f [11] 7 74.2/73.8 65.5/65.1 - 78.7/69.6 70.0/61.7 - SST 3f [11] 7 77.0/76.6 68.5/68.1 - 82.4/78.0 75.1/70.9 SWFormer 1f (Ours) 7 77.8/77.3 69.2/68.8 91.7 80.9/72.7 72.5/64.9 86.1 SWFormer 3f (Ours) 7 79.4/78.9 71.1/70.6 92.6 82.9/79.0 74.8/71.1 87.5 Table 2. Additional WOD validation set results. Top methods are highlighted. Method Vehicle L1 AP Pedestrian L1 AP 3D IoU=0.8 BEV Large 3D Large 3D IoU=0.6 MVF++ 1f [31] 43.3 - - 56.0 RSN 3f [40] 46.4 53.1 45.2 - SWFormer 3f (Ours) 47.5 60.1 51.5 62.1 Table 3. WOD test set results. † is from [40]. Top methods are highlighted. mAPH/L2 is the official ranking metric on the WOD leaderboard. TS is short for two-stage. Method TS mAPH Vehicle AP/APH 3D Pedestrian AP/APH 3D L2 L1 L2 L1 L2 CenterPoint [46] 3 69.1 80.20/79.70 72.20/71.80 78.30/72.10 72.20/66.40 SST TS 3f [11] 3 72.94 80.99/80.62 73.08/72.74 83.05/79.38 76.65/73.14 PVRCNN++ [37] 3 71.24 81.62/81.20 73.86/73.47 80.41/74.99 74.12/69.00 P.Pillars [18] † 7 55.10 68.60/68.10 60.50/60.10 68.00/55.50 61.40/50.10 RSN 3f [40] 7 69.70 80.70/80.30 71.90/71.60 78.90/75.60 70.70/67.80 SWFormer 3f (Ours) 7 73.36 82.89/82.49 75.02/74.65 82.13/78.13 75.87/72.07 4.4 Ablation Study Voxel diffusion is one of the primary contributions of this paper. We study its impacts by varying the diffusion window size k introduced in § 3.4. The result in Table 4 shows the significance of voxel diffusion. Disabling voxel diffusion (i.e. setting k = 1) results in 6.37 and 3.22 3D AP drop compared with k = 9 on vehicle and pedestrian detection respectively. Increasing k can slightly improve the detection accuracy especially on vehicle. Multi-scale feature improves the model accuracy as shown in Table 5 espe- cially going from one scale to two scales. The impact is larger on vehicle detection 14 P. Sun et al. (+2.72 3D AP) than pedestrian detection (+1.15 3D AP). The 3-scale model has pretty close accuracy as the full 5-scale model. In practice, we can trade-off be- tween accuracy and latency by adjusting the number of scales. Note that some autonomous driving tasks such as lane detection, behavior prediction require larger receptive field. The success of training a deep five-scale SWFormer model shows its potential in those tasks. Table 4. Impact of Voxel Diffusion. Compared to the baseline (window size = 1), our voxel diffusion improves accuracy, especially with large diffusion window size. Diffusion Window Size 1 3 5 9 Vehicle 3D AP/L1 72.13 78.07 78.58 78.50 Pedestrian 3D AP/L1 79.23 82.28 82.44 82.45 Vehicle BEV AP/L1 82.42 91.19 92.09 92.03 Pedestrian BEV AP/L1 83.65 87.01 87.15 87.47 Table 5. Impact of Multi-Scale and Window Shifting. Compared to single scale, multi- scale have much better accuracy. Window shifting is also important for performance. Number of Scales Window Shift 1 2 3 5 7 X Vehicle 3D AP/L1 74.96 77.68 78.88 79.36 76.74 79.36 Pedestrian 3D AP/L1 81.24 82.39 82.19 82.91 81.19 82.91 Vehicle BEV AP/L1 89.55 91.83 92.23 92.60 90.74 92.60 Pedestrian BEV AP/L1 86.48 87.30 87.13 87.54 86.46 87.54 Window shifting is introduced in SwinTransformer [22] to connect the features among windows. We have limited its usage to one per scale. What happens if we completely remove it? Table 5 shows clear accuracy drop especially on vehicles if the window-shift operations are removed from the SWFormer blocks. This meets our intuition that it is important to keep one window shift operation per scale to make sure every voxel gets the similar receptive field in all directions. 5 Conclusion This paper presents SWFormer , a scalable and accurate sparse window transformer- only model, to effectively learn 3D point cloud representations for object detec- tion. Built upon window-based Transformers, it addresses the unique challenges brought by the sparse 3D point clouds, and proposes a bucketing-based multi- scale Transformer neural network. SWFormer takes full advantage of the sparsity of point clouds, and can effectively processes sparse windows of point clouds using pure Transformer layers without any convolutions. It also proposes a novel voxel diffusion module to further detect 3D objects from sparse features. Experiments show state-of-the-art results on the challenging Waymo Open Dataset. SWFormer 15 References 1. Bello, I., Zoph, B., Vaswani, A., Shlens, J., Le, Q.V.: Attention augmented convo- lutional networks. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 3286–3295 (2019) 2. Bewley, A., Sun, P., Mensink, T., Anguelov, D., Sminchisescu, C.: Range condi- tioned dilated convolutions for scale invariant 3d object detection. In: Conference on Robot Learning (2020) 3. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End- to-end object detection with transformers. In: European conference on computer vision. pp. 213–229. Springer (2020) 4. Chai, Y., Sun, P., Ngiam, J., Wang, W., Caine, B., Vasudevan, V., Zhang, X., Anguelov, D.: To the point: Efficient 3d object detection in the range image with graph convolution kernels. In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition. pp. 16000–16009 (2021) 5. Cheng, S., Leng, Z., Cubuk, E.D., Zoph, B., Bai, C., Ngiam, J., Song, Y., Caine, B., Vasudevan, V., Li, C., et al.: Improving 3d object detection through progressive population based augmentation. In: European Conference on Computer Vision. pp. 279–294. Springer (2020) 6. Dai, Z., Liu, H., Le, Q., Tan, M.: Coatnet: Marrying convolution and attention for all data sizes. Advances in Neural Information Processing Systems 34 (2021) 7. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirec- tional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) 8. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirec- tional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) 9. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) 10. Engel, N., Belagiannis, V., Dietmayer, K.: Point transformer. IEEE Access 9 , 134826–134840 (2021) 11. Fan, L., Pang, Z., Zhang, T., Wang, Y.X., Zhao, H., Wang, F., Wang, N., Zhang, Z.: Embracing single stride 3d object detector with sparse transformer. arXiv preprint arXiv:2112.06375 (2021) 12. Fan, L., Xiong, X., Wang, F., Wang, N., Zhang, Z.: Rangedet: In defense of range view for lidar-based 3d object detection. In: Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision. pp. 2918–2927 (2021) 13. Ge, R., Ding, Z., Hu, Y., Wang, Y., Chen, S., Huang, L., Li, Y.: Afdet: Anchor free one stage 3d object detection. arXiv preprint arXiv:2006.12671 (2020) 14. Graham, B., van der Maaten, L.: Submanifold sparse convolutional networks. arXiv preprint arXiv:1706.01307 (2017) 15. Guizilini, V., Ambrus, R., Pillai, S., Raventos, A., Gaidon, A.: 3d packing for self- supervised monocular depth estimation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020) 16. Huang, G., Sun, Y., Liu, Z., Sedra, D., Weinberger, K.Q.: Deep networks with stochastic depth. ECCV pp. 646–661 (2016) 17. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 16 P. Sun et al. 18. Lang, A.H., Vora, S., Caesar, H., Zhou, L., Yang, J., Beijbom, O.: Pointpillars: Fast encoders for object detection from point clouds. In: CVPR (2019) 19. Law, H., Deng, J.: Cornernet: Detecting objects as paired keypoints. In: Proceed- ings of the European Conference on Computer Vision (ECCV). pp. 734–750 (2018) 20. Lin, T.Y., Doll ́ ar, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2117–2125 (2017) 21. Lin, T.Y., Goyal, P., Girshick, R., He, K., Doll ́ ar, P.: Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision. pp. 2980–2988 (2017) 22. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. CVPR (2021) 23. Mao, J., Xue, Y., Niu, M., Bai, H., Feng, J., Liang, X., Xu, H., Xu, C.: Voxel transformer for 3d object detection. In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision. pp. 3164–3173 (2021) 24. Mao, J., Xue, Y., Niu, M., Bai, H., Feng, J., Liang, X., Xu, H., Xu, C.: Voxel transformer for 3d object detection. In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision. pp. 3164–3173 (2021) 25. Meyer, G.P., Laddha, A., Kee, E., Vallespi-Gonzalez, C., Wellington, C.K.: Laser- net: An efficient probabilistic 3d object detector for autonomous driving. In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12677–12686 (2019) 26. Misra, I., Girdhar, R., Joulin, A.: An end-to-end transformer model for 3d object detection. In: Proceedings of the IEEE/CVF International Conference on Com- puter Vision. pp. 2906–2917 (2021) 27. Misra, I., Girdhar, R., Joulin, A.: An end-to-end transformer model for 3d object detection. In: Proceedings of the IEEE/CVF International Conference on Com- puter Vision. pp. 2906–2917 (2021) 28. Ngiam, J., Caine, B., Han, W., Yang, B., Chai, Y., Sun, P., Zhou, Y., Yi, X., Alsharif, O., Nguyen, P., et al.: Starnet: Targeted computation for object detection in point clouds. arXiv preprint arXiv:1908.11069 (2019) 29. Pan, X., Xia, Z., Song, S., Li, L.E., Huang, G.: 3d object detection with point- former. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7463–7472 (2021) 30. Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: Deep learning on point sets for 3d classification and segmentation. In: CVPR (2017) 31. Qi, C.R., Zhou, Y., Najibi, M., Sun, P., Vo, K., Deng, B., Anguelov, D.: Offboard 3d object detection from point cloud sequences. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6134–6144 (2021) 32. Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: Deep hierarchical feature learn- ing on point sets in a metric space. In: NeurIPS (2017) 33. Ramachandran, P., Parmar, N., Vaswani, A., Bello, I., Levskaya, A., Shlens, J.: Stand-alone self-attention in vision models. Advances in Neural Information Pro- cessing Systems 32 (2019) 34. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object de- tection with region proposal networks. IEEE transactions on pattern analysis and machine intelligence 39 (6), 1137–1149 (2016) 35. Sabne, A.: Xla : Compiling machine learning for peak performance (2020) 36. Shi, S., Guo, C., Jiang, L., Wang, Z., Shi, J., Wang, X., Li, H.: Pv-rcnn: Point-voxel feature set abstraction for 3d object detection. In: CVPR (2020) SWFormer 17 37. Shi, S., Jiang, L., Deng, J., Wang, Z., Guo, C., Shi, J., Wang, X., Li, H.: Pv- rcnn++: Point-voxel feature set abstraction with local vector representation for 3d object detection. arXiv preprint arXiv:2102.00463 (2021) 38. Shi, S., Wang, X., Li, H.: Pointrcnn: 3d object proposal generation and detection from point cloud. In: CVPR (2019) 39. Sun, P., Kretzschmar, H., Dotiwalla, X., Chouard, A., Patnaik, V., Tsui, P., Guo, J., Zhou, Y., Chai, Y., Caine, B., et al.: Scalability in perception for autonomous driving: Waymo open dataset. In: CVPR (2020) 40. Sun, P., Wang, W., Chai, Y., Elsayed, G., Bewley, A., Zhang, X., Sminchisescu, C., Anguelov, D.: Rsn: Range sparse net for efficient, accurate lidar 3d object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5725–5734 (2021) 41. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NeurIPS (2017) 42. Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Pro- ceedings of the IEEE conference on computer vision and pattern recognition. pp. 7794–7803 (2018) 43. Wang, Y., Fathi, A., Kundu, A., Ross, D., Pantofaru, C., Funkhouser, T., Solomon, J.: Pillar-based object detection for autonomous driving. In: ECCV (2020) 44. Waymo: Waymo’s 5th generation driver. https://blog.waymo.com/2020/03/ introducing-5th-generation-waymo-driver.html 45. Yan, Y., Mao, Y., Li, B.: Second: Sparsely embedded convolutional detection. Sensors (2018) 46. Yin, T., Zhou, X., Krahenbuhl, P.: Center-based 3d object detection and track- ing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11784–11793 (2021) 47. Zhao, H., Jiang, L., Jia, J., Torr, P.H., Koltun, V.: Point transformer. In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision. pp. 16259– 16268 (2021) 48. Zhou, D., Fang, J., Song, X., Guan, C., Yin, J., Dai, Y., Yang, R.: Iou loss for 2d/3d object detection (2019) 49. Zhou, X., Wang, D., Kr ̈ ahenb ̈ uhl, P.: Objects as points. arXiv preprint arXiv:1904.07850 (2019) 50. Zhou, Y., Sun, P., Zhang, Y., Anguelov, D., Gao, J., Ouyang, T., Guo, J., Ngiam, J., Vasudevan, V.: End-to-end multi-view fusion for 3d object detection in lidar point clouds. In: CORL (2019) 51. Zhou, Y., Tuzel, O.: Voxelnet: End-to-end learning for point cloud based 3d object detection. In: CVPR (2018) 18 P. Sun et al. A Window Shift The window shift operation (Figure 6) is implemented by adding offsets of half of the window sizes to the voxelized coordinates and then running the same window partition algorithm. We have proposed to limit the window shift operation per scale for efficiency of processing sparse inputs. We wondered what happens if we add more window shifts to the model. Will it impact model accuracy? We added one more window shift to the scale 1 and scale 2 respectively. It has slowed down our training by 10%. Surprisingly, it slightly decreased the model accuracy as shown in Table 6. Our hypothesis is that more shifts make the model harder to train when we do not need to rely on window shifts to increase receptive field. Window Shift W1 W2 W3 W4 W1 W2 W3 W4 W5 Fig. 6. Sparse window shift. The dark blue cells are voxels with points. The light blue cells are voxels without points. Left shows a grid of 8 × 8 BEV voxels partitioned into 4 non-empty sparse windows with window size of 4 × 4. After window shift, it results in 5 non-empty sparse windows as shown on the right. Table 6. Impact of adding more window shifts. More Window Shifts Vehicle 3D AP/L1 Pedestrian 3D AP/L1 7 79.36 82.91 3 79.17 82.36 B Qualitative Results Figure 7 visualizes ground truth boxes, detected boxes, and attention scores for layers selected from different scales for the 15th frame in scene 8907419590259234067 1960 000 1980 000 selected from the Waymo Open Dataset validation set. The selected layers are the stride 1, 2 afters multi-scale feature fusion, and stride 1, 2, 4, 16 from the main backbone. We use all foreground SWFormer 19 points as the query points. The predicted boxes almost overlap perfectly with the ground truth boxes. The attention score pattern shown in these subplots indicates that different information is captured in different layers and scales. In- terestingly, we have found that most of the attention scores are either 0 or 1 for foreground query points. We hope that these findings can inspire more research in the future. C Future Work: More Tasks Waymo Open Dataset [39] has recently added semantic segmentation labels for about 14% of the frames per scene for all of the 1150 scenes. We have extended the SWFormer detection network to perform joint semantic segmentation and detection. Figure 8 illustrates the joint detection and semantic segmentation network architecture. We concatenate the per-point feature from the voxel em- bedding net before per-voxel max pooling and its corresponding voxel feature from a selected scale after multi-scale feature fusion to predict the per-point se- mantic segmentation logits. Without much tuning, we have obtained reasonable semantic segmentation results as shown in Table 7 and Figure 9. We plan to further improve this model and extend it to more autonomous driving related tasks. 20 P. Sun et al. Stride 1 Fused Stride 2 Fused Stride 1 Stride 2 Stride 4 Stride 16 Fig. 7. Attention scores and model prediction visualization. Blue box: ground truth vehicle. Yellow box: vehicle prediction. Green box: ground truth pedestrian. Purple box: pedestrian detection. Points are colored with the bwr colormap (0: blue, 1: red), where red points mean attention scores close to 1. As red points are distributed differently in each subfigure, it is clear that different layers are attending to different locations. SWFormer 21 Semantic Segmentation Head Head 1 Head 2 Detection Head Sparse Partition 2-layer SWFormer Block 3-layer SWFormer Block 2-layer SWFormer Bloc k 3-layer SWFormer Bloc k 2-layer SWFormer Bloc k fuse5 Segmentation Voxel Diffusion 1-layer SWFormer Center Head fuse4 fuse3 fuse2 fuse1 Scale 2: /2 Scale 1: /1 Scale 3: /4 Scale 4: /16 Scale 5: /32 Sparse Partition Sparse Partition Sparse Partition Sparse Partition fuse1 Semantic Segmentation query Per-voxel SWFormer Feature per-point feature concat Fig. 8. Overview of the updated neural architecture for joint 3D detection and semantic segmentation. On top of Figure 1, it adds an extra segmentation head for the additional segmentation task. 22 P. Sun et al. Table 7. Joint detection and semantic segmentation results on Waymo Open Dataset validation set and test set. Class Name Validation IOU Test IOU Bicycle 36.76 38.15 Bicyclist 51.43 51.77 Building 75.18 65.75 Bus 65.45 39.50 Car 75.05 72.29 Construction Cone 48.34 21.37 Curb 55.54 48.46 Lane Marker 43.97 30.73 Motorcycle 56.68 58.37 Motorcyclist 1.48 0.57 Other Ground 34.34 37.52 Other Vehicle 23.95 25.43 Pedestrian 60.87 61.08 Pole 55.50 51.65 Road 78.46 68.06 Sidewalk 59.67 59.77 Sign 53.70 43.60 Traffic Light 22.74 22.30 Tree Trunk 54.74 50.64 Truck 48.73 55.86 Vegetation 79.78 68.08 Walkable 65.87 59.08 mIOU 52.19 46.82 SWFormer 23 Fig. 9. Joint detection and semantic segmentation qualitative results. Green boxes: vehicle. Lavender boxes: pedestrian. Lavender points: building. Grey points: road. Orange points: sidewalk. Blue points: vehicle. Black points: pedestrian. Red points: pole/sign/tree trunk. Green points: vegetation.