LEF: Late-to-Early Temporal Fusion for LiDAR 3D Object Detection Tong He∗, Pei Sun, Zhaoqi Leng, Chenxi Liu, Dragomir Anguelov, Mingxing Tan∗ Abstract—Weproposealate-to-earlyrecurrentfeaturefusion scheme for 3D object detection using temporal LiDAR point H H H H H H H clouds. Our main motivation is fusing object-aware latent embeddings into the early stages of a 3D object detector. This feature fusion strategy enables the model to better capture the shapes and poses for challenging objects, compared with B B B B B B B learning from raw points directly. Our method conducts late- to-early feature fusion in a recurrent manner. This is achieved by enforcing window-based attention blocks upon temporally calibrated and aligned sparse pillar tokens. Leveraging bird’s t-2 t-1 t t-2 t-1 t t-2 t-1 t eyeviewforegroundpillarsegmentation,wereducethenumber Early-to-Early Late-to-Late Late-to-Early (ours) of sparse history features that our model needs to fuse into its current frame by 10×. We also propose a stochastic-length (a)Overviewstructuresofthreetemporalfusionapproaches,where FrameDrop training technique, which generalizes the model to B denotes the backbone, H denotes the detection head. variable frame lengths at inference for improved performance without retraining. We evaluate our method on the widely adopted Waymo Open Dataset and demonstrate improvement 55.0 on 3D object detection against the baseline model, especially 52.5 for the challenging category of large objects. 50.0 I. INTRODUCTION 47.5 ThegoalofLiDARtemporalfusionisaggregatinglearned 45.0 historyinformationtoimprovepointcloudsbasedtasks.The 42.5 history information could be of various implicit (e.g. latent 40.0 embeddings), explicit (e.g. point clouds, 3D box tracklets) Early-to-Early Late-to-Late Late-to-Early representations or a mixture of both, depending on the mod- elsandtasksathand.Temporalfusioniscriticalformultiple driving related tasks, such as 3D object detection, tracking, segmentation,andbehaviorprediction.Herewemainlystudy LiDAR-basedfusionmethodsfor3Dobjectdetection,which is a crucial task for recognizing and localizing surrounding objectsinmodernautonomousdrivingsystems.Pointclouds ofasingleframecanonlyserveaspartialobservationofthe scenes, lacking complete coverage of environment context and agent dynamics. This information bottleneck is caused byseveralfactorssuchasobjectself-occlusion,occlusionby otherobjects,sensorfield-of-viewlimitation,anddatanoises. Moreover, for moving objects, models with only single- framedatawillstruggleto understandtheirshort-termstates (velocities, accelerations) and long-term intentions (future trajectories). Tackling these issues demands effective ways of LiDAR temporal fusion, which can enable the model to understand scene / object attributes and dynamics from a wide time horizon. The main challenge of temporal fusion is how to rep- resent and aggregate the long-sequence information of his- tory frames. See Figure 1a for a high-level illustration and comparison. Generally speaking, previous solutions can be classified into two types. One of the most widely used methods is early-to-early fusion based point cloud stacking. *WaymoLLC.{simpleig,tanmingxing}@waymo.com DOW ni stcejbo egral rof PA D3 54.4 49.7 49.1 (b) Performance comparisons on Waymo Open Dataset. Fig. 1: Comparisons of temporal fusion approaches. Our late-to-earlyfusionapproachachievesbetterdetectionquality (e.g. 54.4 3D AP for the challenging large objects) than previous early-to-early and late-to-late methods. Multi-frame LiDAR points are directly stacked together as model inputs, resulting in better performance than a single frame of LiDAR points. However, the performance quickly saturates when more frames are simply stacked together [1] without careful modeling of the inter-frame relationships. Moreover, each frame needs to be repeatedly processed when they are stacked into different adjacent frames, greatly increasingcomputationcost.Fittinglongsequenceswillalso greatly increase memory cost, reduce model efficiency or evenresultinoutofmemory(OOM)issues.Ideally,amodel should leverage what it has already learned from the data, not simply stacking its raw sensory inputs. To overcome this issue, another type of fusion methods turn to late-to- late fusion so as to utilize the learned history embeddings. ArepresentativemethodisConvLSTM[1]whichrecurrently fuses latent embeddings between consecutive frames at deep layers of the model. This approach reduces memory usage and computation cost, but its results are usually inferior to early-to-earlyfusion,asshowninFigure1b.Wesuspectthat this is because the backbone only has access to single-frame 3202 peS 82 ]VC.sc[ 1v07861.9032:viXradata before late fusion happens. The task of understanding 2D1 or 3D-shape voxels, LiDAR-based detectors [16]–[18] temporallyfuseddeepfeaturesfallsuponthedetectionheads, can leverage numerous advancements on image 2D object whichusuallyconsistoflow-capacitymulti-layerperceptron detection, and start to demonstrate promising 3D detection (MLP) layers. Consequently, most state-of-the-art LiDAR results. Particularly, CenterPoint [4] utilizes sparse convo- 3D object detectors (e.g. PVRCNN++ [2], [3], CenterPoint lution layers and CenterNet-based detection heads [19] to [4], SST [5], SWFormer [6], etc.) still rely on early-to-early predict 3D boxes. Some recent works, such as SST [20] and fusion with point cloud stacking. SWFormer [6], exploit Swin-Transformer [21] and push the In this paper, we propose a new fusion method named detection performance to a new state of the art. Meanwhile, LEF: Late-to-Early temporal Fusion. We argue that this several methods [2], [3], [22]–[30] look into alternative fusion scheme can leverage learned history knowledge, and LiDARrepresentationsandstrivetowardsabalancebetween in the meantime its backbone does not suffer from single- detection efficiency and efficacy. frame data deficiency issues. Long history LIDAR fusion is LiDAR Temporal Fusion. Compared with the rapid pro- a fundamental block for autonomous driving, and our work gresses achieved on 3D detection backbones, approaches of opensapromisingdirectiontoachievingthatgoal.Thereare LiDAR temporal fusion are less well-studied. Point clouds three main contributions in our paper: of a single frame in WOD [7] have already caused huge computation burden (i.e., ∼200k points), let alone long • We propose a recurrent architecture that fuses late- history sequences. As briefly discussed in the introduction stage sparse pillar features into early stages of the next section, LiDAR temporal fusion solutions can be generally frame.Toaligntheunderlyingstaticobjects,wepropose classifiedintothreetypes:early-to-early,late-to-lateandlate- an inverse calibration and alignment module to fuse to-early fusion. Early-to-early fusion is also referred to as history and current sparse sets of pillar features. As point cloud stacking. It is most widely adopted in recent formovingobjects,weleveragewindow-basedattention LiDAR object detectors (e.g. CenterPoint [4], RSN [22], layers, which can associate relevant features within the SWFormer [6], etc.) due to its simple setup. Multi-frame windows and thus connect pillar tokens that belong to point sets are merged together. Timestamp offsets w.r.t. to the same object. thecurrentframeareappendedtosensorysignalsofeach3D • While point stacking struggles to cache and preprocess point to serve as markers indicating different frame sources. huge point clouds as history length grows, we leverage However,pointstackingstrugglestoworkonlongsequences abird’seyeview(BEV)foregroundpillarsegmentation due to the cost of fusing, saving and jointly preprocessing module to achieve long-sequence fusion at a low con- millions of points. It is also possible to use a Transformer stant cost. The number of sparse voxels that our model to early fuse point clouds from different frames [31]. While needs to fuse at each recurrent step can be reduced by early-to-early fusion simply stacks raw sensory inputs with- over 10× via the foreground segmentation process. out carefully modeling inter-frame relationships and ignores • Wealsoproposeastochastic-lengthFrameDroptraining knowledgelearnedfrompriorframes,late-to-latefusiontries recipe. It exposes the model to an augmented large to tackle these issues by ConvLSTM [1], [32]. It recurrently motionspaceofpillartrajectoriesacrosstime.Thusour fuses sparse latent embeddings between deep layers of the recurrentmodelcancapturedifferentspeedobjects,and backbone with improved efficiency than point stacking, but generalizetovariableframelengthsduringinferencefor the results are often not as competitive as early-to-early improved performance. fusion. This is presumably because its backbone can only The proposed late-to-early temporal fusion scheme leads utilize single-frame data until fusion happens at deep layers. toimproved3DdetectionresultsonthewidelyusedWaymo 3D-MAN [33] may also be viewed as a form of late-to- Open Dataset (WOD) [7] and demonstrates large gains on late fusion, because the temporal fusion in this method is challenginglargeobjects.Wealsoconductextensiveablation done through various kinds of cross-attention between box studies on various design choices made in our method, proposals and features in the memory bank, which are both providing several interesting insights. after the backbone of its network. FaF [34] studied both early fusion and late fusion. To the best of our knowledge, II. RELATEDWORK late-to-early fusion has not been explored before in LiDAR detectors. A similar fusion framework is studied in [35] but 3D Object Detection. LiDAR-based 3D object detection targeting on camera-based detection. It faces very different plays an essential role in autonomous driving. Early efforts challenges from our problems. We need to process sparsely of research such as PointRCNN [8] usually operate on raw distributed3Ddataatwideranges,whichrequiresdedicated 3D point clouds through PointNet(++) [9]–[11]. But they designs for sparse features alignment, fusion and also new struggle to generalize to large-scale data, such as long- training recipes. sequence fused LiDAR [7] with millions of points. Heavily Finally, we note that our review so far concentrates on a relying on MLP-based backbones, these detectors are soon single-stage trainable model that internalizes the temporal outperformed by models with more advanced architectures fusion schemes. It is also possible to follow up the box like submanifold sparse convolution [12] or Transformers [13]–[15]. By voxelizing free-shape point sets into regular 12D-shapevoxelsareoftenreferredtoaspillars.predictions with a second-stage offline refinement, using The diagram is plotted in Figure 2. The model works on the terminology from a recent exemplar of this two-stage sparse pillar tokens and thus the segmentation outputs can approach, MPPNet [36]. MPPNet runs a pre-trained Center- be written as f :{V ∈R2+d}, k =1,...,K . The i−1 i−1,k i−1 Point[4]on4-framestackedLiDARpointcloudstogenerate first two dimensions record BEV coordinates of the pillars anchor boxes, which will then be tracked and aggregated and the rest are extracted embeddings (i.e., d=128), which acrosslongsequences.Specifically,latentembeddingsorraw contain rich scene and object-aware information. Moreover, points within the box regions of one frame will be cropped compared with the raw point clouds size N (∼200k), i−1 and intertwined with those extracted from other frames the foreground pillar feature set size K (∼2k) is much i−1 in order to refine the box states. The key differentiating smaller. Therefore, we are motivated to fuse these deep- factor about the two-stage approach is that the two stages layer features into early stages of the next frame in order / models are trained separately [36], suggesting that the to efficiently reuse learned high-level knowledge for 3D improvement inherently built into the first stage, like ours, detection, especially on challenging large objects. is complementary to the second-stage innovation. Fusionlocation.Toachieverecurrentlate-to-earlyfusion, we fuse f with VoxelNet [18] outputs ν({X }) (cid:55)→ i−1 i,j III. METHOD {V′ ∈R2+d},n=1,...,N′ beforetheyarefedintothethe i,n i A. Problem Statement main backbone network. Meanwhile, instead of early fusion We use {P }, i = 1,...,T to represent a consecutive beforethebackbone,somemayarguethatanalternativeway i sequence of LiDAR point clouds with P : {X ∈ R3}, isconductinglatefusionafterthebackboneprocess,whichis i i,j j =1,...,N i. Our goal is to detect 3D object boxes {B i,m}, closetothenetworkstagewheref i−1 isextracted.Diagrams m = 1,...,M for each frame-t using {P | i ⩽ t}. Ideally ofthesetwodifferentfusionlocationsareplottedinFigure1. i i the model should be capable of fusing history information Wethinkthatpresumablylatefusioncancausethebackbone F(P ,...,P ) up to the current timestamp-t, where F(·) B to lose access to temporally aggregated LiDAR sequence 1 t denotes the fusion function. LiDAR temporal fusion is information,andthusthelow-capacitydetectionheadsHwill known to be an open challenge due to the sparse and wide- struggletounderstandfusedfeaturesandpredictobjectposes range spatial distribution of point clouds, let alone diverse and shapes. Ablation studies on early-to-early, late-to-late object dynamics. Currently early-to-early fusion (i.e., point and our proposed late-to-early fusion methods are provided stacking) is most widely used P ∪ ... ∪ P , which is in Table IV and Section IV-C, which empirically proved the t−l t easy to implement. However, due to memory constraint the advantages of our approach. sequence length is usually small, e.g. l ∈ {2,3}. Moreover, C. Inverse Calibration and Alignment point clouds {X } of one frame have to be repeatedly i,j While image sequences are naturally aligned across dif- processedfor(l+1)timeswhenweconductmodelinference ferent frames by the shapes (height, width, channel), sparse on adjacent frames, causing huge waste of computation. As sets of pillar features {V }, {V′ } are neither aligned for detection performance, whether directly stacking the raw i−1,k i,n nor with the same cardinality (i.e., K ̸= N′). Intu- sensory inputs without reusing learned history knowledge i−1 i itively one could convert sparse features into dense BEV can lead to the optimal results also remains questionable. maps {V } (cid:55)→ I ∈ RH×W×d, {V′ } (cid:55)→ I′ ∈ i−1,k i−1 i,n i B. Recurrent Late-to-Early Fusion RH×W×d andthenalignthem.However,asFigure2shows, To address the aforementioned issues, we propose a re- directly doing so without proper calibration can result in current late-to-early temporal fusion strategy. As shown in misalignment between underlying objects of the scene. This Figure 2, the fusion pipeline works like a “Markov chain”, isbecausepillarfeaturesextractedbythebackbonesarefrom which can accumulate history information from long se- their corresponding local vehicle coordinates with poses of quencesandreduceredundantcomputation.Thus,thefusion g i−1 ∈ R4×4, g i ∈ R4×4. To alleviate this misalignment function F(·) can be iteratively defined as: issue, we need to calibrate the history BEV maps I i−1. f i =ψ(h(f i−1⊕τ(t i−t i−1),ν({X i,j}))) (1) I i−1◦g i− −1 1◦g i (cid:55)→I˜ i−1 (2) where f indicates history deep-layer voxel embeddings, here ◦ means applying vehicle coordinates transformation i−1 andτ(·)isaSinusoidalfunctionforencodingthetimestamp and I˜ represents the calibrated BEV maps. i−1 offset. ν(·) represents VoxelNet [18] used to obtain pillar However,inpracticeifweapplyforwardcalibrationupon featuresfrompointclouds.h(·)isthebackboneforrecurrent I we might get more than one pillars that fall into the i−1 fusion and multi-scale sparse pillar features extraction, and same discrete coordinates within I˜ . To address this issue i−1 ψ(·) is the foreground segmentation module. we conduct inverse transformation from I˜ to I and i−1 i−1 History features. Particularly, we use the latent features sample the history BEV features. We use zero padding to of segmented foreground pillars as f and pass them fill in the pillar features of empty samples and also for out- i−1 into the next timestamp. Without loss of generality, we use of-view locations, e.g. red cross markers in Figure 2. The SWFormer [6] as our backbone and center-based detection inversely calibrated history maps now can be aligned with heads[4]asexamplesinourfollowingdiscussionifneeded. current maps by feature concatenation I˜ ⊕ I′ (cid:55)→ J ∈ i−1 i iInverse Calibration & Alignment Detection Detection Detection Head Head Head Segmentation Segmentation Segmentation MLP … Backbone Backbone Backbone Attention Attention Attention Alignment Alignment Alignment Time Time Enc Enc … … noitarbilac esrevnI noitarbilac esrevnI … Fig. 2: Detection pipeline with our proposed LEF. In each forward pass, the early-stage pillar encoding will be aligned andfusedwiththehistorylate-stageforegroundpillarfeaturesf .Thealignmentisachievedbyaninversecalibrationand i−1 alignment process (Section III-C) that enables pillar features of the underlying static objects to be matched. To effectively associate moving object features, we further use window-based attention blocks (Section III-D) to connect relevant pillars. Outputs from the attention fusion layers will then be fed into the main backbone network (e.g. SWFormer [6]), followed by a foreground pillar segmentation layer and the final detection head [4] for 3D bounding box predictions. RH×W×2d. Next, we apply a MLP on J for dimension often, K˜ ⩽ K due to out-of-view truncation after i i−1 i−1 reduction (i.e., 2d (cid:55)→ d) and get the temporally aligned vehicle coordinates calibration. pillar features J′. Note that not all the coordinates within The resulting variants are: self / cross / mix-attention. In i J′ havevalidfeatures.WeusetheunionBEVbooleanmask self-attentionthekeyandvaluetensorsarethesameasquery. i O ∈RH×W obtainedfromthecurrentandcalibratedhistory Cross-attention uses {V˜ } as key and value and mix- i i−1,c BEV features to mark valid coordinates of J′. Thus, we do attention uses the union set of prior two attention variants. i not lose the data sparsity. We apply Sinusoidal functions based absolute positional encoding to inform the attention blocks of the sparse pillar D. Window-based Attention Fusion coordinates within a window. Detailed ablation studies on Pillars of the static objects are effectively aligned after differentattentiondesignsareprovidedinSectionIV-C.With the prior steps, but the moving ones are still facing the window-based attention fusion, features of both static and misalignmentissue.Onesolutionistoapplyflowestimation moving pillars now can be associated and fused for later to further calibration the history BEV features I˜ before i−1 being passed into the main backbone network. temporal alignment with I′. But that requires adding addi- i tionaloccupancyflowmodels,lossesandfeaturecoordinates E. Stochastic-Length FrameDrop transformation,whichmightgreatlyincreasethecomputation To enable robust training upon long sequences, we ran- overheadofthe3Dobjectdetector.Therefore,weproposeto domly drop history frames from (P ,...,P ) during each 1 t learn such association implicitly from the data by window- training iteration. In other words, we randomly sample S i based attention blocks. We sparsify the dense BEV feature historyframes,withS beingastochasticnumberatdifferent i map J i′ and its boolean mask O i into a sparse set of pillar training steps and the sampled frames are not necessarily tokens {V i′ ,′ u}, u = 1,...,U i. Usually we have U i ⩾ N i′. adjacent ones. In comparison, the previous LiDAR temporal BecausethecardinalityU i meansthenumberoffusedpillars fusion methods usually fix S i to be a constant (e.g. 3 or after temporal alignment between the history and current 4) and sample consecutive frames. We apply stop gradient features through the steps in Section III-C. While {V′′ } between each recurrent pass when fusing deep-layer history i,u is used as the query tensor for the attention blocks, we can features into early layers of the next frame, without which make different choices when determining the key and value long-sequence training of 3D object detectors can easily get tensors: using {V′′ } again or the sparsified set of history intractableorrunintoOOM.Duringtraining,themodelonly i,u pillar tokens in (2): I˜ (cid:55)→{V˜ }, c=1,...,K˜ . Most predicts3Dboxes{Bˆ }inthelastforwardpass.Lossesare i−1 i−1,c i−1 i,mTABLEI:OverallperformancecomparisonsonWaymoOpenDataset.Refinemeansthatthedetectorsneedanadditional step of box refinement via feature pooling and fusion from the box areas, which usually increases time cost and might not be end-to-end trainable. For fair comparisons we focus on single-stage detectors without (w/o) box refinement. Test set 3D AP/APH Validation set 3D AP/APH Method Refine L1 L2 L1 L2 3D-MAN [33] with 78.71 / 78.28 70.37 / 69.98 74.53 / 74.03 67.61 / 67.14 CenterPoint [4] with 80.20 / 79.70 72.20 / 71.80 76.60 / 76.10 68.90 / 68.40 SST [5] with 80.99 / 80.62 73.08 / 72.72 77.00 / 76.60 68.50 / 68.10 PVRCNN++ [2] with 81.62 / 81.20 73.86 / 73.47 79.30 / 78.80 70.60 / 70.20 MPPNet [36] with 84.27 / 83.88 77.29 / 76.91 82.74 / 82.28 75.41 / 74.96 CenterFormer [37] with 84.70 / 84.40 78.10 / 77.70 78.80 / 78.30 74.30 / 73.80 PointPillars [16] w/o 68.60 / 68.10 60.50 / 60.10 63.30 / 62.70 55.20 / 54.70 RSN [22] w/o 80.70 / 80.30 71.90 / 71.60 78.40 / 78.10 69.50 / 69.10 SWFormer [6] w/o 82.25 / 81.87 74.23 / 73.87 79.03 / 78.55 70.55 / 70.11 LEF (ours) w/o 83.39 / 83.02 75.51 / 75.16 79.64 / 79.18 71.37 / 70.94 TABLE II: Detection results on challenging large objects. IV. EXPERIMENTS L1 L2 In this section, we will compare our model with other Method 2D 3D 2D 3D state-of-the-art methods, and perform ablation studies upon RSN [22] 53.10 45.20 - 40.90 the impact of our designs on detection performance. SWFormer [6] 58.33 49.74 53.45 45.23 LEF (ours) 62.63 54.35 57.42 49.34 A. Dataset and Backbone We choose Waymo Open Dataset [7] over nuScenes [41] and KITTI [42] because WOD has large-scale and high- enforced upon certain intermediate outputs (e.g. foreground quality LiDAR data, which can better simulate the settings pillar segmentation) and the final box parameter predictions for developing on-road fully autonomous vehicles. There (e.g. shapes and poses). are about 160k annotated training frames in WOD but only L=λ L +λ L +L (3) around30kframesinnuScenes.Asforper-framepointcloud 1 seg 2 center box densities, WOD is ∼200k and nuScenes is ∼30k. Therefore in which L means the total losses. L is focal loss for seg WOD is widely used in recent LiDAR-based methods: PV- foreground segmentation. L is also based on focal center RCNN(++), SST, RSN, SWFormer and so on [2]–[4], [6], loss but for object-center heatmap estimation [4], [38]. L box [20], [22], [24], [26], [33], [36]. WOD has 798 training containsSmoothL1lossesforboxazimuth,centeroffsetsand sequences, 202 validation and 150 test sequences, covering sizes regression. A detailed explanation is in [6]. diverse driving scenarios and agent status. LiDAR data The training randomness introduced in LiDAR sequence collection frequency is 10Hz. Each frame of point clouds sampling enables the model to be robust to various motion consists of data gathered from five sensors: one long-range patterns of pillar trajectories across time. Thus our recurrent and four short-range LiDAR. For evaluation metrics, we model can understand different object dynamics, and gen- adopt the officially recommended 3D AP / APH under two eralize to variable frame lengths during inference without difficulty levels (L1, L2) depending on point densities of the retraining. More experiments and analysis are provided in ground-truth bounding boxes. APH is a weighted metric of Table VI and the ablation studies. AP using heading angles (i.e., azimuth). F. Implementation Details We adopt the state-of-the-art SWFormer [6] as our detec- tion backbone, and replace its original early-to-early LiDAR We conduct 3D object detection within a wide range fusion with our proposed LEF. For fair comparisons, all of 164×164 meters (m) square zone, centering on the top training settings are kept the same as [6]. LiDAR sensor. Point clouds inside this region are voxelized into 2D pillars with 0.32m spatial resolutions. The window B. Main Results and Comparisons attention blocks are based on 10×10 grouping sizes. The loss weights λ , λ defined in (3) are 200, 10 respectively. Theoverallvehicledetectionresultswithothercompeting 1 2 We use AdamW [39], [40] optimizer with 128 batch sizes methods are in Table I. We compare against methods both and 240k iterations for distributed training on 128 TPUv3. withandwithoutboxrefinementsteps,althoughourmodelis The training takes about 2 days. TPU memory usage is 5.4 asingle-stagemethodwithoutrefinementandgenerallymore GB on average and 7.4 GB at peak. The first 10k steps will efficient than those with box refinement. Our method LEF warmupthelearningratefrom5.0e-4to1.0e-3,afterwhich surpasses the prior best single-stage model SWFormer by the learning rate will follow a cosine annealing schedule to +1.3 3D APH on L2 test data (e.g. 75.16 vs. 73.87), demon- zero. strating the strong overall performance of our approach.TABLEIII:Computationcost.Forfaircomparisons,weuse Low 3D IoU False Positive False Negative Green (Ground Truth) Yellow (SWFormer) 3-frame temporal fusion settings on WOD for measurement. Blue (LEF, ours) Method Latency Flops Parameters PointPillars [16] 93ms 375G 6.4M SWFormer [6] 47ms 35G 4.4M LEF (ours) 38ms 29G 4.6M TABLE IV: Ablation studies on different types of tempo- ral fusion schemes. All methods are trained with SLF. L1 L2 Fusion Strategy 2D 3D 2D 3D Early-to-Early 58.33 49.74 53.45 45.23 Late-to-Late 58.74 48.83 53.67 44.32 Late-to-Early 61.46 53.13 56.37 48.28 TABLE V: Ablation studies on different object sizes. The 3DAPgainsachievedbyLEFincreaseasobjectsizegrows. L1 L2 Method Large Medium Small Large Medium Small RSN[22] 45.20 77.30 79.40 40.90 68.60 69.90 Fig. 3: Box colors are explain in the legend. Errors of the SWFormer[6] 49.74 79.11 82.36 45.23 70.59 74.04 LEF(ours) 54.35 79.62 82.46 49.34 71.32 74.15 baseline SWFormer are highlighted in dashed red regions. Ourmethodisparticularlyusefulfordetectingchallenging and late-to-early (L2E) fusion strategies as illustrated in largeobjectswhosemaximumdimensionisbeyond7meters: Figure 1a. Specifically, we test all fusion variants with the truck, bus, construction vehicle, etc. We conduct detailed same backbone and frame number (i.e., 3) to factorize out analysis on validation set in Table II. Our method LEF the influence of model architectures and LiDAR sequence outperformsSWFormerby+9.3%relativeincreaseonL13D lengths. Results on validation set large objects are in Ta- AP:54.35vs.49.74.Hardcasessuchaslargevehiclessuffer ble IV. Our L2E fusion surpasses the other two methods from partial observation issues more often than small or with 7.8% relative gains on L1 3D AP. By comparing E2E medium size objects. Faithfully detecting these challenging and L2L fusion, we observe that their results on 2D AP are cases requires LiDAR temporal fusion at long frame lengths comparable. But E2E clearly outperforms L2L on 3D AP, in order to enlarge the sensory data coverage. Moreover, indicating higher 3D object detection quality. These results our late-to-early fusion scheme can reuse learned scene and validate our arguments about the benefits of late-to-early object-aware latent features from prior frames, not simply fusion. Compared with E2E fusion, L2E enables the model stacking the point clouds as in RSN and SWFormer. Such to reuse learned scene and object-aware knowledge from high-level history knowledge can enable the model to more prior frames. Compared with L2L, the model capacity of easily tackle challenging detection cases, compared with L2Efusionisnotconstrainedbecauseitsbackbonehasearly solving them from scratch using stacked raw sensory inputs. access to the temporally aggregated sensory data. Qualitative results are visualized in Figure 3. Typical Differentobjectsizes.Besidestheoverallresultsandhard errors of SWFormer are highlighted in the red zones. Our example analysis in Section IV-B, we are also interested in results are aligned better (i.e., have higher 3D IoU) with the knowing the impact of our method on different object sizes. ground truth boxes than SWFormer predictions, especially Thus we divide validation set objects into: large, medium for challenging large objects. Moreover, our results contain and small. Typical large objects are bus and truck. Medium fewer false negative and false positive predictions than and small objects usually include sedan and pedestrian, SWFormer results. We also measure model latency, flops respectively. Detailed results are in Table V. Although our and parameter sizes of different LiDAR 3D object detectors methodLEFachievescomparableresultswiththecompeting in Table III, following the same benchmark settings as [6]. methods on small objects, we observe increasingly more PointPillars and SWFormer both use point stacking. The gains as object sizes grow larger. On L2 medium objects, results demonstrate the efficiency advantages of our late-to- LEF improves SWFormer by 0.73 AP and the gains further early recurrent fusion method. bump to 4.11 AP on large objects. One possible explanation isthatsmallobjectssufferlessfrompartial-viewobservation C. Ablation Studies issuesthanlargeobjects,andthusdonotsignificantlybenefit Fusion strategy. We conduct apple-to-apple comparisons from temporal fusion. From the results we believe that our to study the effect of early-to-early (E2E), late-to-late (L2L) method works robustly across different object sizes.TABLEVI:Longframehistorygeneralizationstudies.For TABLE VIII: Variants of window-based attention blocks each trained model, we evaluate its inference generalization for recurrent temporal fusion. Based on the comparisons, ability to different frame (f) lengths without retraining. we adopt self-attention as default in other experiments. L1 L2 L1 L2 Method Attention Type 3-f 6-f 9-f 3-f 6-f 9-f 2D 3D 2D 3D SWFormer[6] 46.23 38.76 OOM 41.93 35.09 OOM Cross-Attn 51.69 42.35 47.06 38.36 LEF(w/oSLF) 51.18 51.44 50.84 46.58 46.91 46.28 Mix-Attn 61.68 52.94 56.46 48.06 LEF(withSLF) 53.13 53.96 54.35 48.28 48.99 49.34 Self-Attn 62.63 54.35 57.42 49.34 TABLE VII: Inverse calibration and alignment (ICA) can improve detection AP across different object sizes. TABLEIX:Theimpactofwindow-basedself-attentionon different speed objects. Large Medium Small ICA 2D 3D 2D 3D 2D 3D Self-Attention Static Slow Medium Fast VeryFast w/o 60.85 51.34 92.72 78.30 85.92 80.59 without 60.55 63.46 74.58 53.07 75.47 with 62.63 54.35 93.02 79.62 87.40 82.46 with 66.62 69.27 79.62 62.46 82.14 Frame length generalization. Due to memory constraint temporal alignment process. In Table VII we show that in- of the computing devices, GPU or TPU, 3D object detectors versecalibrationandalignmentachievesconsistentdetection with LiDAR temporal fusion usually sample a fixed number improvement across different size objects, including truck, ofhistoryframes(e.g.2or3)duringtraining.However,dur- sedan, pedestrian, and so on. inginference,thereareusuallyadditionalframesavailableto Window-based Attention Fusion. We apply window- themodeldependingonthehistorylengths.Fortypicalearly- based attention blocks on temporally aligned sparse pil- to-earlyfusionbasedmulti-framedetectors(e.g.CenterPoint, lar tokens to further fuse information of the history and SWFormer), if we want to test a trained model on different current frames. As explained in Section III-D, we explore frame lengths, the training settings need to be modified three different attention designs: self / cross / mix-attention. and the model needs to be retrained. With stochastic-length Detection AP on large objects of WOD validation set are FrameDrop (SLF), LEF can generalize to variable frame shown in Table VIII. For all methods, we use the sparse lengths without retraining. It can leverage additional frames set of pillar tokens {V′′ } converted from the temporally i,u and achieve increasingly improved results. Large objects aligned BEV feature map J′ as the query tensor. In self- i 3D AP are shown in Table VI. In contrast, SWFormer attention,query,keyandvaluearebasedonthesametensor. and LEF without SLF can not make best of long history In cross-attention, the key and value tensors are the sparse and might even face performance decrease. This is because set of pillar tokens {V˜ } converted from the calibrated i−1,c long history frames can exhibit diverse motion patterns of history features I˜ . Mix-attention uses the union set of i−1 temporallyaggregateddata,posinggeneralizationdifficulties prior methods as key and value. We observe that self- formethodstrainedwithoutSLF.Moreover,sinceSWFormer attention consistently outperforms the other two attention is based on point cloud stacking, it will run into OOM if we variants. This is presumably because the history tokens exist simply stack a long LiDAR sequence into millions of 3D in a quite different latent space from the temporally aligned points and use them as inputs. These observations indicate tokens. Therefore attention between {V˜ } and {V′′ } i−1,c i,u that stochastic-length FrameDrop and recurrent fusion are mighteasilyleadtointractablefeaturefusionandeventually critical in generalizing our method LEF to variable frame hurt detection. Meanwhile, since J′ has already merged i lengths during inference. information from the history I˜ and the current I , self- i−1 i Foreground pillar segmentation. To efficiently fuse his- attention is competent to associate relevant pillar tokens and tory pillar features in a recurrent manner, we apply BEV fulfill the fusion task. foreground segmentation before passing history latent pillar Window-based attention fusion plays an important role embeddings into the next frame. the number of history in fusing the information from moving object pillars. In pillarsthatneedtoberecurrentlyfusedcanbereducedfrom Table IX, we present validation set 3D AP comparisons be- ∼20k to ∼2k on average after removing a huge amount of tween with and without window-based self-attention fusion. uninformative background data. Therefore the computation We report subcategory metrics under different speed ranges: burden of our late-to-early temporal fusion scheme can be [0, 0.45), [0.45, 2.24), [2.24, 6.71), [6.71, 22.37), [22.37, greatly reduced and maintained at a relatively low constant +∞) miles per hour for static, slow, medium, fast, very fast cost. objects.Themetricsareaveragedoverdifferentsizeobjects. Inverse calibration and alignment. Inverse calibration We observe that attention fusion brings consistent detection and alignment, as illustrated in Figure 2, is important for gains across different object speed ranges. Particularly, the fusing two sparse sets of pillar features between the prior improvementsachievedonhigh-speedobjectsarelargerthan and the current frames. Features belonging to the same those on low-speed objects: +9.4 (fast) vs. +6.1 (static) 3D underlying static objects can be effectively aligned after this AP gains. The comparisons empirically prove that window-based self-attention fusion is critical in associating relevant [15] J.Mao,Y.Xue,M.Niu,H.Bai,J.Feng,X.Liang,H.Xu,andC.Xu, pillars that belong to the same underlying objects, which is “Voxeltransformerfor3dobjectdetection,”inICCV,2021. [16] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom, especially important for moving object detection. “Pointpillars: Fast encoders for object detection from point clouds,” inCVPR,2019,pp.12697–12705. V. CONCLUSIONSANDFUTUREWORK [17] Y.Yan,Y.Mao,andB.Li,“Second:Sparselyembeddedconvolutional detection,”Sensors(Basel,Switzerland),vol.18,2018. In this paper, we conduct an in-depth study on the tem- [18] Y.ZhouandO.Tuzel,“Voxelnet:End-to-endlearningforpointcloud poral fusion aspect of 3D object detection from LiDAR based3dobjectdetection,”inCVPR,2018,pp.4490–4499. [19] X. Zhou, D. Wang, and P. Kra¨henbu¨hl, “Objects as points,” ArXiv, sequences.Weproposealate-to-earlytemporalfeaturefusion vol.abs/1904.07850,2019. method that recurrently extracts sparse pillar features from [20] L.Fan,Z.Pang,T.Zhang,Y.-X.Wang,H.Zhao,F.Wang,N.Wang, both object-aware latent embeddings and LiDAR sensor raw andZ.Zhang,“Embracingsinglestride3dobjectdetectorwithsparse transformer,”CVPR,pp.8448–8458,2022. inputs. To handle the alignment issues of static and moving [21] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and objects,weproposeinversecalibrationandalignmentaswell B. Guo, “Swin transformer: Hierarchical vision transformer using as window-based attention fusion methods. We also apply shiftedwindows,”ICCV,pp.9992–10002,2021. [22] P.Sun,W.Wang,Y.Chai,G.Elsayed,A.Bewley,X.Zhang,C.Smin- foregroundsegmentationtoobtainsparsepillarfeaturesfrom chisescu, and D. Anguelov, “Rsn: Range sparse net for efficient, historyforcomputationreduction.Theresultingmodel,LEF, accuratelidar3dobjectdetection,”inCVPR,2021,pp.5725–5734. performs favorably against its base model SWFormer in [23] G.P.Meyer,A.G.Laddha,E.Kee,C.Vallespi-Gonzalez,andC.K. Wellington,“Lasernet:Anefficientprobabilistic3dobjectdetectorfor both detection quality and efficiency. The improvement is autonomousdriving,”CVPR,pp.12669–12678,2019. especially significant on large objects that require multiple [24] J. Ngiam, B. Caine, W. Han, B. Yang, Y. Chai, P. Sun, Y. Zhou, LiDAR sweeps fused across space and time to achieve high X.Yi,O.Alsharif,P.Nguyen,Z.Chen,J.Shlens,andV.Vasudevan, “Starnet: Targeted computation for object detection in point clouds,” surface coverage rate. ArXiv,vol.abs/1908.11069,2019. As future work, we plan to extend our method to multi- [25] Y.Wang,A.Fathi,A.Kundu,D.A.Ross,C.Pantofaru,T.Funkhouser, modal sensor fusion with a focus on integrating camera and and J. Solomon, “Pillar-based object detection for autonomous driv- ing,”inECCV,2020,pp.18–34. radar information. Recurrent late-to-early temporal fusion [26] Y.Chai,P.Sun,J.Ngiam,W.Wang,B.Caine,V.Vasudevan,X.Zhang, schemes like ours and BEVFormer [35] have been explored and D. Anguelov, “To the point: Efficient 3d object detection in the in very few papers. To further demonstrate the effectiveness rangeimagewithgraphconvolutionkernels,”inCVPR,2021. [27] L.Fan,X.Xiong,F.Wang,N.longWang,andZ.Zhang,“Rangedet: of this approach, it would be beneficial to test it on various Indefenseofrangeviewforlidar-based3dobjectdetection,”ICCV, backbonemodelsandextenditsapplicationbeyondthescope pp.2898–2907,2021. of 3D object detection task. [28] Z.Li,F.Wang,andN.Wang,“Lidarr-cnn:Anefficientanduniversal 3dobjectdetector,”inCVPR,2021,pp.7546–7555. [29] H. Sheng, S. Cai, Y. Liu, B. Deng, J. Huang, X.-S. Hua, and M.-J. REFERENCES Zhao,“Improving3dobjectdetectionwithchannel-wisetransformer,” inICCV,2021,pp.2743–2752. [1] R. Huang, W. Zhang, A. Kundu, C. Pantofaru, D. A. Ross, [30] C. Liu, Z. Leng, P. Sun, S. Cheng, C. R. Qi, Y. Zhou, M. Tan, and T.Funkhouser,andA.Fathi,“Anlstmapproachtotemporal3dobject D.Anguelov,“Lidarnas:Unifyingandsearchingneuralarchitectures detectioninlidarpointclouds,”inECCV,2020,pp.266–282. for3dpointclouds,”inECCV,2022,pp.158–175. [2] S.Shi,C.Guo,L.Jiang,Z.Wang,J.Shi,X.Wang,andH.Li,“Pv- [31] Z. Yuan, X. Song, L. Bai, Z. Wang, and W. Ouyang, “Temporal- rcnn: Point-voxel feature set abstraction for 3d object detection,” in channel transformer for 3d lidar-based video object detection for CVPR,2020,pp.10529–10538. autonomousdriving,”TCSVT,vol.32,no.4,pp.2068–2078,2021. [3] S. Shi, L. Jiang, J. Deng, Z. Wang, C. Guo, J. Shi, X. Wang, and [32] J.Yin,J.Shen,C.Guan,D.Zhou,andR.Yang,“Lidar-basedonline H.Li,“Pv-rcnn++:Point-voxelfeaturesetabstractionwithlocalvector 3d video object detection with graph-based message passing and representationfor3dobjectdetection,”ArXiv,2021. spatiotemporaltransformerattention,”inCVPR,2020. [4] T.Yin,X.Zhou,andP.Krahenbuhl,“Center-based3dobjectdetection [33] Z. Yang, Y. Zhou, Z. Chen, and J. Ngiam, “3d-man: 3d multi-frame andtracking,”inCVPR,2021,pp.11784–11793. attentionnetworkforobjectdetection,”inCVPR,2021. [5] L.Fan,Z.Pang,T.Zhang,Y.-X.Wang,H.Zhao,F.Wang,N.Wang, [34] W.Luo,B.Yang,andR.Urtasun,“Fastandfurious:Realtimeend- andZ.Zhang,“Embracingsinglestride3dobjectdetectorwithsparse to-end 3d detection, tracking and motion forecasting with a single transformer,”inCVPR,2022,pp.8458–8468. convolutionalnet,”inCVPR,2018,pp.3569–3577. [6] P.Sun,M.Tan,W.Wang,C.Liu,F.Xia,Z.Leng,andD.Anguelov, [35] Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Q. Yu, and J. Dai, “Swformer: Sparse window transformer for 3d object detection in “Bevformer: Learning bird’s-eye-view representation from multi- pointclouds,”inECCV,2022. camera images via spatiotemporal transformers,” in arXiv preprint [7] P.Sun,H.Kretzschmar,X.Dotiwalla,A.Chouard,V.Patnaik,P.Tsui, arXiv:2203.17270,2022. J.Guo,Y.Zhou,Y.Chai,B.Caine,etal.,“Scalabilityinperception [36] X.Chen,S.Shi,B.Zhu,K.C.Cheung,H.Xu,andH.Li,“Mppnet: forautonomousdriving:Waymoopendataset,”inCVPR,2020. Multi-frame feature intertwining with proxy points for 3d temporal [8] S.Shi,X.Wang,andH.Li,“Pointrcnn:3dobjectproposalgeneration objectdetection,”arXivpreprintarXiv:2205.05979,2022. anddetectionfrompointcloud,”CVPR,pp.770–779,2019. [37] Z.Zhou,X.Zhao,Y.Wang,P.Wang,andH.Foroosh,“Centerformer: [9] C. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas, “Frustum pointnets Center-basedtransformerfor3dobjectdetection,”inECCV,2022. for3dobjectdetectionfromrgb-ddata,”CVPR,pp.918–927,2018. [38] X.Zhou,D.Wang,andP.Kra¨henbu¨hl,“Objectsaspoints,”inarXiv [10] C. Qi,H. Su, K. Mo, andL. J. Guibas, “Pointnet:Deep learning on preprintarXiv:1904.07850,2019. pointsetsfor3dclassificationandsegmentation,”inCVPR,2017. [39] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza- [11] C.Qi,L.Yi,H.Su,andL.J.Guibas,“Pointnet++:Deephierarchical tion,”arXivpreprintarXiv:1412.6980,2014. featurelearningonpointsetsinametricspace,”inNIPS,2017. [40] I.LoshchilovandF.Hutter,“Decoupledweightdecayregularization,” [12] B.GrahamandL.vanderMaaten,“Submanifoldsparseconvolutional arXivpreprintarXiv:1711.05101,2017. networks,”arXivpreprintarXiv:1706.01307,2017. [41] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, [13] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in multimodaldatasetforautonomousdriving,”inCVPR,2020. NeurIPS,vol.30,2017. [42] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous [14] H.Zhao,L.Jiang,J.Jia,P.H.Torr,andV.Koltun,“Pointtransformer,” driving?thekittivisionbenchmarksuite,”inCVPR,2012. inICCV,2021,pp.16259–16268.