LEF: Late-to-Early Temporal Fusion for LiDAR 3D Object Detection Tong He ∗ , Pei Sun, Zhaoqi Leng, Chenxi Liu, Dragomir Anguelov, Mingxing Tan ∗ Abstract — We propose a late-to-early recurrent feature fusion scheme for 3D object detection using temporal LiDAR point clouds. Our main motivation is fusing object-aware latent embeddings into the early stages of a 3D object detector. This feature fusion strategy enables the model to better capture the shapes and poses for challenging objects, compared with learning from raw points directly. Our method conducts late- to-early feature fusion in a recurrent manner. This is achieved by enforcing window-based attention blocks upon temporally calibrated and aligned sparse pillar tokens. Leveraging bird’s eye view foreground pillar segmentation, we reduce the number of sparse history features that our model needs to fuse into its current frame by 10 × . We also propose a stochastic-length FrameDrop training technique, which generalizes the model to variable frame lengths at inference for improved performance without retraining. We evaluate our method on the widely adopted Waymo Open Dataset and demonstrate improvement on 3D object detection against the baseline model, especially for the challenging category of large objects. I. I NTRODUCTION The goal of LiDAR temporal fusion is aggregating learned history information to improve point clouds based tasks. The history information could be of various implicit ( e . g . latent embeddings), explicit ( e . g . point clouds, 3D box tracklets) representations or a mixture of both, depending on the mod- els and tasks at hand. Temporal fusion is critical for multiple driving related tasks, such as 3D object detection, tracking, segmentation, and behavior prediction. Here we mainly study LiDAR-based fusion methods for 3D object detection, which is a crucial task for recognizing and localizing surrounding objects in modern autonomous driving systems. Point clouds of a single frame can only serve as partial observation of the scenes, lacking complete coverage of environment context and agent dynamics. This information bottleneck is caused by several factors such as object self-occlusion, occlusion by other objects, sensor field-of-view limitation, and data noises. Moreover, for moving objects, models with only single- frame data will struggle to understand their short-term states (velocities, accelerations) and long-term intentions (future trajectories). Tackling these issues demands effective ways of LiDAR temporal fusion, which can enable the model to understand scene / object attributes and dynamics from a wide time horizon. The main challenge of temporal fusion is how to rep- resent and aggregate the long-sequence information of his- tory frames. See Figure 1a for a high-level illustration and comparison. Generally speaking, previous solutions can be classified into two types. One of the most widely used methods is early-to-early fusion based point cloud stacking. *Waymo LLC. { simpleig,tanmingxing } @waymo.com Early-to-Early Late-to-Late Late-to-Early (ours) B B H H B H B H B H B H B H t t-1 t-2 t t-1 t-2 t t-1 t-2 (a) Overview structures of three temporal fusion approaches, where B denotes the backbone, H denotes the detection head. Early-to-Early Late-to-Late Late-to-Early 40.0 42.5 45.0 47.5 50.0 52.5 55.0 3D AP for large objects in WOD 49.7 49.1 54.4 (b) Performance comparisons on Waymo Open Dataset. Fig. 1: Comparisons of temporal fusion approaches . Our late-to-early fusion approach achieves better detection quality ( e . g . 54.4 3D AP for the challenging large objects) than previous early-to-early and late-to-late methods. Multi-frame LiDAR points are directly stacked together as model inputs, resulting in better performance than a single frame of LiDAR points. However, the performance quickly saturates when more frames are simply stacked together [1] without careful modeling of the inter-frame relationships. Moreover, each frame needs to be repeatedly processed when they are stacked into different adjacent frames, greatly increasing computation cost. Fitting long sequences will also greatly increase memory cost, reduce model efficiency or even result in out of memory (OOM) issues. Ideally, a model should leverage what it has already learned from the data, not simply stacking its raw sensory inputs. To overcome this issue, another type of fusion methods turn to late-to- late fusion so as to utilize the learned history embeddings. A representative method is ConvLSTM [1] which recurrently fuses latent embeddings between consecutive frames at deep layers of the model. This approach reduces memory usage and computation cost, but its results are usually inferior to early-to-early fusion, as shown in Figure 1b. We suspect that this is because the backbone only has access to single-frame arXiv:2309.16870v1 [cs.CV] 28 Sep 2023 data before late fusion happens. The task of understanding temporally fused deep features falls upon the detection heads, which usually consist of low-capacity multi-layer perceptron (MLP) layers. Consequently, most state-of-the-art LiDAR 3D object detectors ( e . g . PVRCNN++ [2], [3], CenterPoint [4], SST [5], SWFormer [6], etc .) still rely on early-to-early fusion with point cloud stacking. In this paper, we propose a new fusion method named LEF : L ate-to- E arly temporal F usion. We argue that this fusion scheme can leverage learned history knowledge, and in the meantime its backbone does not suffer from single- frame data deficiency issues. Long history LIDAR fusion is a fundamental block for autonomous driving, and our work opens a promising direction to achieving that goal. There are three main contributions in our paper: • We propose a recurrent architecture that fuses late- stage sparse pillar features into early stages of the next frame. To align the underlying static objects, we propose an inverse calibration and alignment module to fuse history and current sparse sets of pillar features. As for moving objects, we leverage window-based attention layers, which can associate relevant features within the windows and thus connect pillar tokens that belong to the same object. • While point stacking struggles to cache and preprocess huge point clouds as history length grows, we leverage a bird’s eye view (BEV) foreground pillar segmentation module to achieve long-sequence fusion at a low con- stant cost. The number of sparse voxels that our model needs to fuse at each recurrent step can be reduced by over 10 × via the foreground segmentation process. • We also propose a stochastic-length FrameDrop training recipe. It exposes the model to an augmented large motion space of pillar trajectories across time. Thus our recurrent model can capture different speed objects, and generalize to variable frame lengths during inference for improved performance. The proposed late-to-early temporal fusion scheme leads to improved 3D detection results on the widely used Waymo Open Dataset (WOD) [7] and demonstrates large gains on challenging large objects. We also conduct extensive ablation studies on various design choices made in our method, providing several interesting insights. II. R ELATED W ORK 3D Object Detection . LiDAR-based 3D object detection plays an essential role in autonomous driving. Early efforts of research such as PointRCNN [8] usually operate on raw 3D point clouds through PointNet(++) [9]–[11]. But they struggle to generalize to large-scale data, such as long- sequence fused LiDAR [7] with millions of points. Heavily relying on MLP-based backbones, these detectors are soon outperformed by models with more advanced architectures like submanifold sparse convolution [12] or Transformers [13]–[15]. By voxelizing free-shape point sets into regular 2D 1 or 3D-shape voxels, LiDAR-based detectors [16]–[18] can leverage numerous advancements on image 2D object detection, and start to demonstrate promising 3D detection results. Particularly, CenterPoint [4] utilizes sparse convo- lution layers and CenterNet-based detection heads [19] to predict 3D boxes. Some recent works, such as SST [20] and SWFormer [6], exploit Swin-Transformer [21] and push the detection performance to a new state of the art. Meanwhile, several methods [2], [3], [22]–[30] look into alternative LiDAR representations and strive towards a balance between detection efficiency and efficacy. LiDAR Temporal Fusion . Compared with the rapid pro- gresses achieved on 3D detection backbones, approaches of LiDAR temporal fusion are less well-studied. Point clouds of a single frame in WOD [7] have already caused huge computation burden ( i . e ., ∼ 200 k points), let alone long history sequences. As briefly discussed in the introduction section, LiDAR temporal fusion solutions can be generally classified into three types: early-to-early, late-to-late and late- to-early fusion. Early-to-early fusion is also referred to as point cloud stacking. It is most widely adopted in recent LiDAR object detectors ( e . g . CenterPoint [4], RSN [22], SWFormer [6], etc .) due to its simple setup. Multi-frame point sets are merged together. Timestamp offsets w.r.t. to the current frame are appended to sensory signals of each 3D point to serve as markers indicating different frame sources. However, point stacking struggles to work on long sequences due to the cost of fusing, saving and jointly preprocessing millions of points. It is also possible to use a Transformer to early fuse point clouds from different frames [31]. While early-to-early fusion simply stacks raw sensory inputs with- out carefully modeling inter-frame relationships and ignores knowledge learned from prior frames, late-to-late fusion tries to tackle these issues by ConvLSTM [1], [32]. It recurrently fuses sparse latent embeddings between deep layers of the backbone with improved efficiency than point stacking, but the results are often not as competitive as early-to-early fusion. This is presumably because its backbone can only utilize single-frame data until fusion happens at deep layers. 3D-MAN [33] may also be viewed as a form of late-to- late fusion, because the temporal fusion in this method is done through various kinds of cross-attention between box proposals and features in the memory bank, which are both after the backbone of its network. FaF [34] studied both early fusion and late fusion. To the best of our knowledge, late-to-early fusion has not been explored before in LiDAR detectors. A similar fusion framework is studied in [35] but targeting on camera-based detection. It faces very different challenges from our problems. We need to process sparsely distributed 3D data at wide ranges, which requires dedicated designs for sparse features alignment, fusion and also new training recipes. Finally, we note that our review so far concentrates on a single-stage trainable model that internalizes the temporal fusion schemes. It is also possible to follow up the box 1 2D-shape voxels are often referred to as pillars. predictions with a second-stage offline refinement, using the terminology from a recent exemplar of this two-stage approach, MPPNet [36]. MPPNet runs a pre-trained Center- Point [4] on 4-frame stacked LiDAR point clouds to generate anchor boxes, which will then be tracked and aggregated across long sequences. Specifically, latent embeddings or raw points within the box regions of one frame will be cropped and intertwined with those extracted from other frames in order to refine the box states. The key differentiating factor about the two-stage approach is that the two stages / models are trained separately [36], suggesting that the improvement inherently built into the first stage, like ours, is complementary to the second-stage innovation. III. M ETHOD A. Problem Statement We use { P i } , i = 1 , ..., T to represent a consecutive sequence of LiDAR point clouds with P i : { X i,j ∈ R 3 } , j = 1 , ..., N i . Our goal is to detect 3D object boxes { B i,m } , m = 1 , ..., M i for each frame- t using { P i | i ⩽ t } . Ideally the model should be capable of fusing history information F ( P 1 , ..., P t ) up to the current timestamp- t , where F ( · ) denotes the fusion function. LiDAR temporal fusion is known to be an open challenge due to the sparse and wide- range spatial distribution of point clouds, let alone diverse object dynamics. Currently early-to-early fusion ( i . e ., point stacking) is most widely used P t − l ∪ ... ∪ P t , which is easy to implement. However, due to memory constraint the sequence length is usually small, e.g. l ∈ { 2 , 3 } . Moreover, point clouds { X i,j } of one frame have to be repeatedly processed for ( l +1) times when we conduct model inference on adjacent frames, causing huge waste of computation. As for detection performance, whether directly stacking the raw sensory inputs without reusing learned history knowledge can lead to the optimal results also remains questionable. B. Recurrent Late-to-Early Fusion To address the aforementioned issues, we propose a re- current late-to-early temporal fusion strategy. As shown in Figure 2, the fusion pipeline works like a “Markov chain”, which can accumulate history information from long se- quences and reduce redundant computation. Thus, the fusion function F ( · ) can be iteratively defined as: f i = ψ ( h ( f i − 1 ⊕ τ ( t i − t i − 1 ) , ν ( { X i,j } ))) (1) where f i − 1 indicates history deep-layer voxel embeddings, and τ ( · ) is a Sinusoidal function for encoding the timestamp offset. ν ( · ) represents VoxelNet [18] used to obtain pillar features from point clouds. h ( · ) is the backbone for recurrent fusion and multi-scale sparse pillar features extraction, and ψ ( · ) is the foreground segmentation module. History features . Particularly, we use the latent features of segmented foreground pillars as f i − 1 and pass them into the next timestamp. Without loss of generality, we use SWFormer [6] as our backbone and center-based detection heads [4] as examples in our following discussion if needed. The diagram is plotted in Figure 2. The model works on sparse pillar tokens and thus the segmentation outputs can be written as f i − 1 : { V i − 1 ,k ∈ R 2+ d } , k = 1 , ..., K i − 1 . The first two dimensions record BEV coordinates of the pillars and the rest are extracted embeddings ( i . e ., d = 128 ), which contain rich scene and object-aware information. Moreover, compared with the raw point clouds size N i − 1 ( ∼ 200 k ), the foreground pillar feature set size K i − 1 ( ∼ 2 k ) is much smaller. Therefore, we are motivated to fuse these deep- layer features into early stages of the next frame in order to efficiently reuse learned high-level knowledge for 3D detection, especially on challenging large objects. Fusion location . To achieve recurrent late-to- early fusion, we fuse f i − 1 with VoxelNet [18] outputs ν ( { X i,j } ) 7 → { V ′ i,n ∈ R 2+ d } , n = 1 , ..., N ′ i before they are fed into the the main backbone network. Meanwhile, instead of early fusion before the backbone, some may argue that an alternative way is conducting late fusion after the backbone process, which is close to the network stage where f i − 1 is extracted. Diagrams of these two different fusion locations are plotted in Figure 1. We think that presumably late fusion can cause the backbone B to lose access to temporally aggregated LiDAR sequence information, and thus the low-capacity detection heads H will struggle to understand fused features and predict object poses and shapes. Ablation studies on early-to-early, late-to-late and our proposed late-to-early fusion methods are provided in Table IV and Section IV-C, which empirically proved the advantages of our approach. C. Inverse Calibration and Alignment While image sequences are naturally aligned across dif- ferent frames by the shapes (height, width, channel), sparse sets of pillar features { V i − 1 ,k } , { V ′ i,n } are neither aligned nor with the same cardinality ( i . e ., K i − 1 ̸ = N ′ i ). Intu- itively one could convert sparse features into dense BEV maps { V i − 1 ,k } 7 → I i − 1 ∈ R H × W × d , { V ′ i,n } 7 → I ′ i ∈ R H × W × d and then align them. However, as Figure 2 shows, directly doing so without proper calibration can result in misalignment between underlying objects of the scene. This is because pillar features extracted by the backbones are from their corresponding local vehicle coordinates with poses of g i − 1 ∈ R 4 × 4 , g i ∈ R 4 × 4 . To alleviate this misalignment issue, we need to calibrate the history BEV maps I i − 1 . I i − 1 ◦ g − 1 i − 1 ◦ g i 7 → ̃ I i − 1 (2) here ◦ means applying vehicle coordinates transformation and ̃ I i − 1 represents the calibrated BEV maps. However, in practice if we apply forward calibration upon I i − 1 we might get more than one pillars that fall into the same discrete coordinates within ̃ I i − 1 . To address this issue we conduct inverse transformation from ̃ I i − 1 to I i − 1 and sample the history BEV features. We use zero padding to fill in the pillar features of empty samples and also for out- of-view locations, e . g . red cross markers in Figure 2. The inversely calibrated history maps now can be aligned with current maps by feature concatenation ̃ I i − 1 ⊕ I ′ i 7 → J i ∈ Time Enc Time Enc Alignment Attention Segmentation Detection Head Inverse Calibration & Alignment MLP ... ... Backbone Alignment Attention Segmentation Detection Head Backbone Alignment Attention Segmentation Detection Head Backbone ... Inverse calibration Inverse calibration ... Fig. 2: Detection pipeline with our proposed LEF . In each forward pass, the early-stage pillar encoding will be aligned and fused with the history late-stage foreground pillar features f i − 1 . The alignment is achieved by an inverse calibration and alignment process (Section III-C) that enables pillar features of the underlying static objects to be matched. To effectively associate moving object features, we further use window-based attention blocks (Section III-D) to connect relevant pillars. Outputs from the attention fusion layers will then be fed into the main backbone network ( e . g . SWFormer [6]), followed by a foreground pillar segmentation layer and the final detection head [4] for 3D bounding box predictions. R H × W × 2 d . Next, we apply a MLP on J i for dimension reduction ( i . e ., 2 d 7 → d ) and get the temporally aligned pillar features J ′ i . Note that not all the coordinates within J ′ i have valid features. We use the union BEV boolean mask O i ∈ R H × W obtained from the current and calibrated history BEV features to mark valid coordinates of J ′ i . Thus, we do not lose the data sparsity. D. Window-based Attention Fusion Pillars of the static objects are effectively aligned after the prior steps, but the moving ones are still facing the misalignment issue. One solution is to apply flow estimation to further calibration the history BEV features ̃ I i − 1 before temporal alignment with I ′ i . But that requires adding addi- tional occupancy flow models, losses and feature coordinates transformation, which might greatly increase the computation overhead of the 3D object detector. Therefore, we propose to learn such association implicitly from the data by window- based attention blocks. We sparsify the dense BEV feature map J ′ i and its boolean mask O i into a sparse set of pillar tokens { V ′′ i,u } , u = 1 , ..., U i . Usually we have U i ⩾ N ′ i . Because the cardinality U i means the number of fused pillars after temporal alignment between the history and current features through the steps in Section III-C. While { V ′′ i,u } is used as the query tensor for the attention blocks, we can make different choices when determining the key and value tensors: using { V ′′ i,u } again or the sparsified set of history pillar tokens in (2): ̃ I i − 1 7 → { ̃ V i − 1 ,c } , c = 1 , ..., ̃ K i − 1 . Most often, ̃ K i − 1 ⩽ K i − 1 due to out-of-view truncation after vehicle coordinates calibration. The resulting variants are: self / cross / mix-attention. In self-attention the key and value tensors are the same as query. Cross-attention uses { ̃ V i − 1 ,c } as key and value and mix- attention uses the union set of prior two attention variants. We apply Sinusoidal functions based absolute positional encoding to inform the attention blocks of the sparse pillar coordinates within a window. Detailed ablation studies on different attention designs are provided in Section IV-C. With window-based attention fusion, features of both static and moving pillars now can be associated and fused for later being passed into the main backbone network. E. Stochastic-Length FrameDrop To enable robust training upon long sequences, we ran- domly drop history frames from ( P 1 , ..., P t ) during each training iteration. In other words, we randomly sample S i history frames, with S i being a stochastic number at different training steps and the sampled frames are not necessarily adjacent ones. In comparison, the previous LiDAR temporal fusion methods usually fix S i to be a constant ( e . g . 3 or 4) and sample consecutive frames. We apply stop gradient between each recurrent pass when fusing deep-layer history features into early layers of the next frame, without which long-sequence training of 3D object detectors can easily get intractable or run into OOM. During training, the model only predicts 3D boxes { ˆ B i,m } in the last forward pass. Losses are TABLE I: Overall performance comparisons on Waymo Open Dataset . Refine means that the detectors need an additional step of box refinement via feature pooling and fusion from the box areas, which usually increases time cost and might not be end-to-end trainable. For fair comparisons we focus on single-stage detectors without (w/o) box refinement. Method Refine Test set 3D AP/APH Validation set 3D AP/APH L1 L2 L1 L2 3D-MAN [33] with 78.71 / 78.28 70.37 / 69.98 74.53 / 74.03 67.61 / 67.14 CenterPoint [4] with 80.20 / 79.70 72.20 / 71.80 76.60 / 76.10 68.90 / 68.40 SST [5] with 80.99 / 80.62 73.08 / 72.72 77.00 / 76.60 68.50 / 68.10 PVRCNN++ [2] with 81.62 / 81.20 73.86 / 73.47 79.30 / 78.80 70.60 / 70.20 MPPNet [36] with 84.27 / 83.88 77.29 / 76.91 82.74 / 82.28 75.41 / 74.96 CenterFormer [37] with 84.70 / 84.40 78.10 / 77.70 78.80 / 78.30 74.30 / 73.80 PointPillars [16] w/o 68.60 / 68.10 60.50 / 60.10 63.30 / 62.70 55.20 / 54.70 RSN [22] w/o 80.70 / 80.30 71.90 / 71.60 78.40 / 78.10 69.50 / 69.10 SWFormer [6] w/o 82.25 / 81.87 74.23 / 73.87 79.03 / 78.55 70.55 / 70.11 LEF (ours) w/o 83.39 / 83.02 75.51 / 75.16 79.64 / 79.18 71.37 / 70.94 TABLE II: Detection results on challenging large objects . Method L1 L2 2D 3D 2D 3D RSN [22] 53.10 45.20 - 40.90 SWFormer [6] 58.33 49.74 53.45 45.23 LEF (ours) 62.63 54.35 57.42 49.34 enforced upon certain intermediate outputs ( e . g . foreground pillar segmentation) and the final box parameter predictions ( e . g . shapes and poses). L = λ 1 L seg + λ 2 L center + L box (3) in which L means the total losses. L seg is focal loss for foreground segmentation. L center is also based on focal loss but for object-center heatmap estimation [4], [38]. L box contains SmoothL1 losses for box azimuth, center offsets and sizes regression. A detailed explanation is in [6]. The training randomness introduced in LiDAR sequence sampling enables the model to be robust to various motion patterns of pillar trajectories across time. Thus our recurrent model can understand different object dynamics, and gen- eralize to variable frame lengths during inference without retraining. More experiments and analysis are provided in Table VI and the ablation studies. F. Implementation Details We conduct 3D object detection within a wide range of 164 × 164 meters ( m ) square zone, centering on the top LiDAR sensor. Point clouds inside this region are voxelized into 2D pillars with 0.32 m spatial resolutions. The window attention blocks are based on 10 × 10 grouping sizes. The loss weights λ 1 , λ 2 defined in (3) are 200, 10 respectively. We use AdamW [39], [40] optimizer with 128 batch sizes and 240 k iterations for distributed training on 128 TPUv3. The training takes about 2 days. TPU memory usage is 5.4 GB on average and 7.4 GB at peak. The first 10 k steps will warm up the learning rate from 5.0e-4 to 1.0e-3, after which the learning rate will follow a cosine annealing schedule to zero. IV. E XPERIMENTS In this section, we will compare our model with other state-of-the-art methods, and perform ablation studies upon the impact of our designs on detection performance. A. Dataset and Backbone We choose Waymo Open Dataset [7] over nuScenes [41] and KITTI [42] because WOD has large-scale and high- quality LiDAR data, which can better simulate the settings for developing on-road fully autonomous vehicles. There are about 160 k annotated training frames in WOD but only around 30 k frames in nuScenes. As for per-frame point cloud densities, WOD is ∼ 200 k and nuScenes is ∼ 30 k . Therefore WOD is widely used in recent LiDAR-based methods: PV- RCNN(++), SST, RSN, SWFormer and so on [2]–[4], [6], [20], [22], [24], [26], [33], [36]. WOD has 798 training sequences, 202 validation and 150 test sequences, covering diverse driving scenarios and agent status. LiDAR data collection frequency is 10Hz. Each frame of point clouds consists of data gathered from five sensors: one long-range and four short-range LiDAR. For evaluation metrics, we adopt the officially recommended 3D AP / APH under two difficulty levels (L1, L2) depending on point densities of the ground-truth bounding boxes. APH is a weighted metric of AP using heading angles ( i . e ., azimuth). We adopt the state-of-the-art SWFormer [6] as our detec- tion backbone, and replace its original early-to-early LiDAR fusion with our proposed LEF. For fair comparisons, all training settings are kept the same as [6]. B. Main Results and Comparisons The overall vehicle detection results with other competing methods are in Table I. We compare against methods both with and without box refinement steps, although our model is a single-stage method without refinement and generally more efficient than those with box refinement. Our method LEF surpasses the prior best single-stage model SWFormer by +1.3 3D APH on L2 test data ( e . g . 75.16 vs . 73.87), demon- strating the strong overall performance of our approach. Low 3D IoU False Positive False Negative Green (Ground Truth) Yellow (SWFormer) Blue (LEF, ours) Fig. 3: Box colors are explain in the legend. Errors of the baseline SWFormer are highlighted in dashed red regions. Our method is particularly useful for detecting challenging large objects whose maximum dimension is beyond 7 meters: truck, bus, construction vehicle, etc . We conduct detailed analysis on validation set in Table II. Our method LEF outperforms SWFormer by +9.3% relative increase on L1 3D AP: 54.35 vs . 49.74. Hard cases such as large vehicles suffer from partial observation issues more often than small or medium size objects. Faithfully detecting these challenging cases requires LiDAR temporal fusion at long frame lengths in order to enlarge the sensory data coverage. Moreover, our late-to-early fusion scheme can reuse learned scene and object-aware latent features from prior frames, not simply stacking the point clouds as in RSN and SWFormer. Such high-level history knowledge can enable the model to more easily tackle challenging detection cases, compared with solving them from scratch using stacked raw sensory inputs. Qualitative results are visualized in Figure 3. Typical errors of SWFormer are highlighted in the red zones. Our results are aligned better ( i . e ., have higher 3D IoU) with the ground truth boxes than SWFormer predictions, especially for challenging large objects. Moreover, our results contain fewer false negative and false positive predictions than SWFormer results. We also measure model latency, flops and parameter sizes of different LiDAR 3D object detectors in Table III, following the same benchmark settings as [6]. PointPillars and SWFormer both use point stacking. The results demonstrate the efficiency advantages of our late-to- early recurrent fusion method. C. Ablation Studies Fusion strategy. We conduct apple-to-apple comparisons to study the effect of early-to-early (E2E), late-to-late (L2L) TABLE III: Computation cost . For fair comparisons, we use 3-frame temporal fusion settings on WOD for measurement. Method Latency Flops Parameters PointPillars [16] 93ms 375G 6.4M SWFormer [6] 47ms 35G 4.4M LEF (ours) 38ms 29G 4.6M TABLE IV: Ablation studies on different types of tempo- ral fusion schemes . All methods are trained with SLF. Fusion Strategy L1 2D 3D L2 2D 3D Early-to-Early 58.33 49.74 53.45 45.23 Late-to-Late 58.74 48.83 53.67 44.32 Late-to-Early 61.46 53.13 56.37 48.28 TABLE V: Ablation studies on different object sizes . The 3D AP gains achieved by LEF increase as object size grows. Method L1 L2 Large Medium Small Large Medium Small RSN [22] 45.20 77.30 79.40 40.90 68.60 69.90 SWFormer [6] 49.74 79.11 82.36 45.23 70.59 74.04 LEF (ours) 54.35 79.62 82.46 49.34 71.32 74.15 and late-to-early (L2E) fusion strategies as illustrated in Figure 1a. Specifically, we test all fusion variants with the same backbone and frame number ( i . e ., 3) to factorize out the influence of model architectures and LiDAR sequence lengths. Results on validation set large objects are in Ta- ble IV. Our L2E fusion surpasses the other two methods with 7.8% relative gains on L1 3D AP. By comparing E2E and L2L fusion, we observe that their results on 2D AP are comparable. But E2E clearly outperforms L2L on 3D AP, indicating higher 3D object detection quality. These results validate our arguments about the benefits of late-to-early fusion. Compared with E2E fusion, L2E enables the model to reuse learned scene and object-aware knowledge from prior frames. Compared with L2L, the model capacity of L2E fusion is not constrained because its backbone has early access to the temporally aggregated sensory data. Different object sizes. Besides the overall results and hard example analysis in Section IV-B, we are also interested in knowing the impact of our method on different object sizes. Thus we divide validation set objects into: large, medium and small. Typical large objects are bus and truck. Medium and small objects usually include sedan and pedestrian, respectively. Detailed results are in Table V. Although our method LEF achieves comparable results with the competing methods on small objects, we observe increasingly more gains as object sizes grow larger. On L2 medium objects, LEF improves SWFormer by 0.73 AP and the gains further bump to 4.11 AP on large objects. One possible explanation is that small objects suffer less from partial-view observation issues than large objects, and thus do not significantly benefit from temporal fusion. From the results we believe that our method works robustly across different object sizes. TABLE VI: Long frame history generalization studies . For each trained model, we evaluate its inference generalization ability to different frame (f) lengths without retraining. Method L1 L2 3-f 6-f 9-f 3-f 6-f 9-f SWFormer [6] 46.23 38.76 OOM 41.93 35.09 OOM LEF (w/o SLF) 51.18 51.44 50.84 46.58 46.91 46.28 LEF (with SLF) 53.13 53.96 54.35 48.28 48.99 49.34 TABLE VII: Inverse calibration and alignment (ICA) can improve detection AP across different object sizes. ICA Large Medium Small 2D 3D 2D 3D 2D 3D w/o 60.85 51.34 92.72 78.30 85.92 80.59 with 62.63 54.35 93.02 79.62 87.40 82.46 Frame length generalization. Due to memory constraint of the computing devices, GPU or TPU, 3D object detectors with LiDAR temporal fusion usually sample a fixed number of history frames ( e . g . 2 or 3) during training. However, dur- ing inference, there are usually additional frames available to the model depending on the history lengths. For typical early- to-early fusion based multi-frame detectors ( e . g . CenterPoint, SWFormer), if we want to test a trained model on different frame lengths, the training settings need to be modified and the model needs to be retrained. With stochastic-length FrameDrop (SLF), LEF can generalize to variable frame lengths without retraining. It can leverage additional frames and achieve increasingly improved results. Large objects 3D AP are shown in Table VI. In contrast, SWFormer and LEF without SLF can not make best of long history and might even face performance decrease. This is because long history frames can exhibit diverse motion patterns of temporally aggregated data, posing generalization difficulties for methods trained without SLF. Moreover, since SWFormer is based on point cloud stacking, it will run into OOM if we simply stack a long LiDAR sequence into millions of 3D points and use them as inputs. These observations indicate that stochastic-length FrameDrop and recurrent fusion are critical in generalizing our method LEF to variable frame lengths during inference. Foreground pillar segmentation. To efficiently fuse his- tory pillar features in a recurrent manner, we apply BEV foreground segmentation before passing history latent pillar embeddings into the next frame. the number of history pillars that need to be recurrently fused can be reduced from ∼ 20 k to ∼ 2 k on average after removing a huge amount of uninformative background data. Therefore the computation burden of our late-to-early temporal fusion scheme can be greatly reduced and maintained at a relatively low constant cost. Inverse calibration and alignment. Inverse calibration and alignment, as illustrated in Figure 2, is important for fusing two sparse sets of pillar features between the prior and the current frames. Features belonging to the same underlying static objects can be effectively aligned after this TABLE VIII: Variants of window-based attention blocks for recurrent temporal fusion . Based on the comparisons, we adopt self-attention as default in other experiments. Attention Type L1 L2 2D 3D 2D 3D Cross-Attn 51.69 42.35 47.06 38.36 Mix-Attn 61.68 52.94 56.46 48.06 Self-Attn 62.63 54.35 57.42 49.34 TABLE IX: The impact of window-based self-attention on different speed objects . Self-Attention Static Slow Medium Fast Very Fast without 60.55 63.46 74.58 53.07 75.47 with 66.62 69.27 79.62 62.46 82.14 temporal alignment process. In Table VII we show that in- verse calibration and alignment achieves consistent detection improvement across different size objects, including truck, sedan, pedestrian, and so on. Window-based Attention Fusion. We apply window- based attention blocks on temporally aligned sparse pil- lar tokens to further fuse information of the history and current frames. As explained in Section III-D, we explore three different attention designs: self / cross / mix-attention. Detection AP on large objects of WOD validation set are shown in Table VIII. For all methods, we use the sparse set of pillar tokens { V ′′ i,u } converted from the temporally aligned BEV feature map J ′ i as the query tensor. In self- attention, query, key and value are based on the same tensor. In cross-attention, the key and value tensors are the sparse set of pillar tokens { ̃ V i − 1 ,c } converted from the calibrated history features ̃ I i − 1 . Mix-attention uses the union set of prior methods as key and value. We observe that self- attention consistently outperforms the other two attention variants. This is presumably because the history tokens exist in a quite different latent space from the temporally aligned tokens. Therefore attention between { ̃ V i − 1 ,c } and { V ′′ i,u } might easily lead to intractable feature fusion and eventually hurt detection. Meanwhile, since J ′ i has already merged information from the history ̃ I i − 1 and the current I i , self- attention is competent to associate relevant pillar tokens and fulfill the fusion task. Window-based attention fusion plays an important role in fusing the information from moving object pillars. In Table IX, we present validation set 3D AP comparisons be- tween with and without window-based self-attention fusion. We report subcategory metrics under different speed ranges: [0, 0.45), [0.45, 2.24), [2.24, 6.71), [6.71, 22.37), [22.37, + ∞ ) miles per hour for static, slow, medium, fast, very fast objects. The metrics are averaged over different size objects. We observe that attention fusion brings consistent detection gains across different object speed ranges. Particularly, the improvements achieved on high-speed objects are larger than those on low-speed objects: +9.4 (fast) vs . +6.1 (static) 3D AP gains. The comparisons empirically prove that window- based self-attention fusion is critical in associating relevant pillars that belong to the same underlying objects, which is especially important for moving object detection. V. C ONCLUSIONS AND F UTURE W ORK In this paper, we conduct an in-depth study on the tem- poral fusion aspect of 3D object detection from LiDAR sequences. We propose a late-to-early temporal feature fusion method that recurrently extracts sparse pillar features from both object-aware latent embeddings and LiDAR sensor raw inputs. To handle the alignment issues of static and moving objects, we propose inverse calibration and alignment as well as window-based attention fusion methods. We also apply foreground segmentation to obtain sparse pillar features from history for computation reduction. The resulting model, LEF, performs favorably against its base model SWFormer in both detection quality and efficiency. The improvement is especially significant on large objects that require multiple LiDAR sweeps fused across space and time to achieve high surface coverage rate. As future work, we plan to extend our method to multi- modal sensor fusion with a focus on integrating camera and radar information. Recurrent late-to-early temporal fusion schemes like ours and BEVFormer [35] have been explored in very few papers. To further demonstrate the effectiveness of this approach, it would be beneficial to test it on various backbone models and extend its application beyond the scope of 3D object detection task. R EFERENCES [1] R. Huang, W. Zhang, A. Kundu, C. Pantofaru, D. A. Ross, T. Funkhouser, and A. Fathi, “An lstm approach to temporal 3d object detection in lidar point clouds,” in ECCV , 2020, pp. 266–282. [2] S. Shi, C. Guo, L. Jiang, Z. Wang, J. Shi, X. Wang, and H. Li, “Pv- rcnn: Point-voxel feature set abstraction for 3d object detection,” in CVPR , 2020, pp. 10 529–10 538. [3] S. Shi, L. Jiang, J. Deng, Z. Wang, C. Guo, J. Shi, X. Wang, and H. Li, “Pv-rcnn++: Point-voxel feature set abstraction with local vector representation for 3d object detection,” ArXiv , 2021. [4] T. Yin, X. Zhou, and P. Krahenbuhl, “Center-based 3d object detection and tracking,” in CVPR , 2021, pp. 11 784–11 793. [5] L. Fan, Z. Pang, T. Zhang, Y.-X. Wang, H. Zhao, F. Wang, N. Wang, and Z. Zhang, “Embracing single stride 3d object detector with sparse transformer,” in CVPR , 2022, pp. 8458–8468. [6] P. Sun, M. Tan, W. Wang, C. Liu, F. Xia, Z. Leng, and D. Anguelov, “Swformer: Sparse window transformer for 3d object detection in point clouds,” in ECCV , 2022. [7] P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V. Patnaik, P. Tsui, J. Guo, Y. Zhou, Y. Chai, B. Caine, et al. , “Scalability in perception for autonomous driving: Waymo open dataset,” in CVPR , 2020. [8] S. Shi, X. Wang, and H. Li, “Pointrcnn: 3d object proposal generation and detection from point cloud,” CVPR , pp. 770–779, 2019. [9] C. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas, “Frustum pointnets for 3d object detection from rgb-d data,” CVPR , pp. 918–927, 2018. [10] C. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in CVPR , 2017. [11] C. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,” in NIPS , 2017. [12] B. Graham and L. van der Maaten, “Submanifold sparse convolutional networks,” arXiv preprint arXiv:1706.01307 , 2017. [13] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in NeurIPS , vol. 30, 2017. [14] H. Zhao, L. Jiang, J. Jia, P. H. Torr, and V. Koltun, “Point transformer,” in ICCV , 2021, pp. 16 259–16 268. [15] J. Mao, Y. Xue, M. Niu, H. Bai, J. Feng, X. Liang, H. Xu, and C. Xu, “Voxel transformer for 3d object detection,” in ICCV , 2021. [16] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom, “Pointpillars: Fast encoders for object detection from point clouds,” in CVPR , 2019, pp. 12 697–12 705. [17] Y. Yan, Y. Mao, and B. Li, “Second: Sparsely embedded convolutional detection,” Sensors (Basel, Switzerland) , vol. 18, 2018. [18] Y. Zhou and O. Tuzel, “Voxelnet: End-to-end learning for point cloud based 3d object detection,” in CVPR , 2018, pp. 4490–4499. [19] X. Zhou, D. Wang, and P. Kr ̈ ahenb ̈ uhl, “Objects as points,” ArXiv , vol. abs/1904.07850, 2019. [20] L. Fan, Z. Pang, T. Zhang, Y.-X. Wang, H. Zhao, F. Wang, N. Wang, and Z. Zhang, “Embracing single stride 3d object detector with sparse transformer,” CVPR , pp. 8448–8458, 2022. [21] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” ICCV , pp. 9992–10 002, 2021. [22] P. Sun, W. Wang, Y. Chai, G. Elsayed, A. Bewley, X. Zhang, C. Smin- chisescu, and D. Anguelov, “Rsn: Range sparse net for efficient, accurate lidar 3d object detection,” in CVPR , 2021, pp. 5725–5734. [23] G. P. Meyer, A. G. Laddha, E. Kee, C. Vallespi-Gonzalez, and C. K. Wellington, “Lasernet: An efficient probabilistic 3d object detector for autonomous driving,” CVPR , pp. 12 669–12 678, 2019. [24] J. Ngiam, B. Caine, W. Han, B. Yang, Y. Chai, P. Sun, Y. Zhou, X. Yi, O. Alsharif, P. Nguyen, Z. Chen, J. Shlens, and V. Vasudevan, “Starnet: Targeted computation for object detection in point clouds,” ArXiv , vol. abs/1908.11069, 2019. [25] Y. Wang, A. Fathi, A. Kundu, D. A. Ross, C. Pantofaru, T. Funkhouser, and J. Solomon, “Pillar-based object detection for autonomous driv- ing,” in ECCV , 2020, pp. 18–34. [26] Y. Chai, P. Sun, J. Ngiam, W. Wang, B. Caine, V. Vasudevan, X. Zhang, and D. Anguelov, “To the point: Efficient 3d object detection in the range image with graph convolution kernels,” in CVPR , 2021. [27] L. Fan, X. Xiong, F. Wang, N. long Wang, and Z. Zhang, “Rangedet: In defense of range view for lidar-based 3d object detection,” ICCV , pp. 2898–2907, 2021. [28] Z. Li, F. Wang, and N. Wang, “Lidar r-cnn: An efficient and universal 3d object detector,” in CVPR , 2021, pp. 7546–7555. [29] H. Sheng, S. Cai, Y. Liu, B. Deng, J. Huang, X.-S. Hua, and M.-J. Zhao, “Improving 3d object detection with channel-wise transformer,” in ICCV , 2021, pp. 2743–2752. [30] C. Liu, Z. Leng, P. Sun, S. Cheng, C. R. Qi, Y. Zhou, M. Tan, and D. Anguelov, “Lidarnas: Unifying and searching neural architectures for 3d point clouds,” in ECCV , 2022, pp. 158–175. [31] Z. Yuan, X. Song, L. Bai, Z. Wang, and W. Ouyang, “Temporal- channel transformer for 3d lidar-based video object detection for autonomous driving,” TCSVT , vol. 32, no. 4, pp. 2068–2078, 2021. [32] J. Yin, J. Shen, C. Guan, D. Zhou, and R. Yang, “Lidar-based online 3d video object detection with graph-based message passing and spatiotemporal transformer attention,” in CVPR , 2020. [33] Z. Yang, Y. Zhou, Z. Chen, and J. Ngiam, “3d-man: 3d multi-frame attention network for object detection,” in CVPR , 2021. [34] W. Luo, B. Yang, and R. Urtasun, “Fast and furious: Real time end- to-end 3d detection, tracking and motion forecasting with a single convolutional net,” in CVPR , 2018, pp. 3569–3577. [35] Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Q. Yu, and J. Dai, “Bevformer: Learning bird’s-eye-view representation from multi- camera images via spatiotemporal transformers,” in arXiv preprint arXiv:2203.17270 , 2022. [36] X. Chen, S. Shi, B. Zhu, K. C. Cheung, H. Xu, and H. Li, “Mppnet: Multi-frame feature intertwining with proxy points for 3d temporal object detection,” arXiv preprint arXiv:2205.05979 , 2022. [37] Z. Zhou, X. Zhao, Y. Wang, P. Wang, and H. Foroosh, “Centerformer: Center-based transformer for 3d object detection,” in ECCV , 2022. [38] X. Zhou, D. Wang, and P. Kr ̈ ahenb ̈ uhl, “Objects as points,” in arXiv preprint arXiv:1904.07850 , 2019. [39] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza- tion,” arXiv preprint arXiv:1412.6980 , 2014. [40] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101 , 2017. [41] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” in CVPR , 2020. [42] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in CVPR , 2012.