LEF: Late-to-Early Temporal Fusion for LiDAR 3D Object Detection
Tong He∗, Pei Sun, Zhaoqi Leng, Chenxi Liu, Dragomir Anguelov, Mingxing Tan∗
Abstract—Weproposealate-to-earlyrecurrentfeaturefusion
scheme for 3D object detection using temporal LiDAR point H H H H H H H
clouds. Our main motivation is fusing object-aware latent
embeddings into the early stages of a 3D object detector. This
feature fusion strategy enables the model to better capture
the shapes and poses for challenging objects, compared with B B B B B B B
learning from raw points directly. Our method conducts late-
to-early feature fusion in a recurrent manner. This is achieved
by enforcing window-based attention blocks upon temporally
calibrated and aligned sparse pillar tokens. Leveraging bird’s t-2 t-1 t t-2 t-1 t t-2 t-1 t
eyeviewforegroundpillarsegmentation,wereducethenumber
Early-to-Early Late-to-Late Late-to-Early (ours)
of sparse history features that our model needs to fuse into its
current frame by 10×. We also propose a stochastic-length (a)Overviewstructuresofthreetemporalfusionapproaches,where
FrameDrop training technique, which generalizes the model to B denotes the backbone, H denotes the detection head.
variable frame lengths at inference for improved performance
without retraining. We evaluate our method on the widely
adopted Waymo Open Dataset and demonstrate improvement 55.0
on 3D object detection against the baseline model, especially 52.5
for the challenging category of large objects.
50.0
I. INTRODUCTION 47.5
ThegoalofLiDARtemporalfusionisaggregatinglearned 45.0
historyinformationtoimprovepointcloudsbasedtasks.The 42.5
history information could be of various implicit (e.g. latent
40.0
embeddings), explicit (e.g. point clouds, 3D box tracklets) Early-to-Early Late-to-Late Late-to-Early
representations or a mixture of both, depending on the mod-
elsandtasksathand.Temporalfusioniscriticalformultiple
driving related tasks, such as 3D object detection, tracking,
segmentation,andbehaviorprediction.Herewemainlystudy
LiDAR-basedfusionmethodsfor3Dobjectdetection,which
is a crucial task for recognizing and localizing surrounding
objectsinmodernautonomousdrivingsystems.Pointclouds
ofasingleframecanonlyserveaspartialobservationofthe
scenes, lacking complete coverage of environment context
and agent dynamics. This information bottleneck is caused
byseveralfactorssuchasobjectself-occlusion,occlusionby
otherobjects,sensorfield-of-viewlimitation,anddatanoises.
Moreover, for moving objects, models with only single-
framedatawillstruggleto understandtheirshort-termstates
(velocities, accelerations) and long-term intentions (future
trajectories). Tackling these issues demands effective ways
of LiDAR temporal fusion, which can enable the model to
understand scene / object attributes and dynamics from a
wide time horizon.
The main challenge of temporal fusion is how to rep-
resent and aggregate the long-sequence information of his-
tory frames. See Figure 1a for a high-level illustration and
comparison. Generally speaking, previous solutions can be
classified into two types. One of the most widely used
methods is early-to-early fusion based point cloud stacking.
*WaymoLLC.{simpleig,tanmingxing}@waymo.com
DOW
ni
stcejbo
egral
rof
PA
D3
54.4
49.7
49.1
(b) Performance comparisons on Waymo Open Dataset.
Fig. 1: Comparisons of temporal fusion approaches. Our
late-to-earlyfusionapproachachievesbetterdetectionquality
(e.g. 54.4 3D AP for the challenging large objects) than
previous early-to-early and late-to-late methods.
Multi-frame LiDAR points are directly stacked together as
model inputs, resulting in better performance than a single
frame of LiDAR points. However, the performance quickly
saturates when more frames are simply stacked together [1]
without careful modeling of the inter-frame relationships.
Moreover, each frame needs to be repeatedly processed
when they are stacked into different adjacent frames, greatly
increasingcomputationcost.Fittinglongsequenceswillalso
greatly increase memory cost, reduce model efficiency or
evenresultinoutofmemory(OOM)issues.Ideally,amodel
should leverage what it has already learned from the data,
not simply stacking its raw sensory inputs. To overcome
this issue, another type of fusion methods turn to late-to-
late fusion so as to utilize the learned history embeddings.
ArepresentativemethodisConvLSTM[1]whichrecurrently
fuses latent embeddings between consecutive frames at deep
layers of the model. This approach reduces memory usage
and computation cost, but its results are usually inferior to
early-to-earlyfusion,asshowninFigure1b.Wesuspectthat
this is because the backbone only has access to single-frame
3202
peS
82
]VC.sc[
1v07861.9032:viXradata before late fusion happens. The task of understanding 2D1 or 3D-shape voxels, LiDAR-based detectors [16]–[18]
temporallyfuseddeepfeaturesfallsuponthedetectionheads, can leverage numerous advancements on image 2D object
whichusuallyconsistoflow-capacitymulti-layerperceptron detection, and start to demonstrate promising 3D detection
(MLP) layers. Consequently, most state-of-the-art LiDAR results. Particularly, CenterPoint [4] utilizes sparse convo-
3D object detectors (e.g. PVRCNN++ [2], [3], CenterPoint lution layers and CenterNet-based detection heads [19] to
[4], SST [5], SWFormer [6], etc.) still rely on early-to-early predict 3D boxes. Some recent works, such as SST [20] and
fusion with point cloud stacking. SWFormer [6], exploit Swin-Transformer [21] and push the
In this paper, we propose a new fusion method named detection performance to a new state of the art. Meanwhile,
LEF: Late-to-Early temporal Fusion. We argue that this several methods [2], [3], [22]–[30] look into alternative
fusion scheme can leverage learned history knowledge, and LiDARrepresentationsandstrivetowardsabalancebetween
in the meantime its backbone does not suffer from single- detection efficiency and efficacy.
frame data deficiency issues. Long history LIDAR fusion is LiDAR Temporal Fusion. Compared with the rapid pro-
a fundamental block for autonomous driving, and our work gresses achieved on 3D detection backbones, approaches of
opensapromisingdirectiontoachievingthatgoal.Thereare LiDAR temporal fusion are less well-studied. Point clouds
three main contributions in our paper: of a single frame in WOD [7] have already caused huge
computation burden (i.e., ∼200k points), let alone long
• We propose a recurrent architecture that fuses late-
history sequences. As briefly discussed in the introduction
stage sparse pillar features into early stages of the next
section, LiDAR temporal fusion solutions can be generally
frame.Toaligntheunderlyingstaticobjects,wepropose
classifiedintothreetypes:early-to-early,late-to-lateandlate-
an inverse calibration and alignment module to fuse
to-early fusion. Early-to-early fusion is also referred to as
history and current sparse sets of pillar features. As
point cloud stacking. It is most widely adopted in recent
formovingobjects,weleveragewindow-basedattention
LiDAR object detectors (e.g. CenterPoint [4], RSN [22],
layers, which can associate relevant features within the
SWFormer [6], etc.) due to its simple setup. Multi-frame
windows and thus connect pillar tokens that belong to
point sets are merged together. Timestamp offsets w.r.t. to
the same object.
thecurrentframeareappendedtosensorysignalsofeach3D
• While point stacking struggles to cache and preprocess
point to serve as markers indicating different frame sources.
huge point clouds as history length grows, we leverage
However,pointstackingstrugglestoworkonlongsequences
abird’seyeview(BEV)foregroundpillarsegmentation
due to the cost of fusing, saving and jointly preprocessing
module to achieve long-sequence fusion at a low con-
millions of points. It is also possible to use a Transformer
stant cost. The number of sparse voxels that our model
to early fuse point clouds from different frames [31]. While
needs to fuse at each recurrent step can be reduced by
early-to-early fusion simply stacks raw sensory inputs with-
over 10× via the foreground segmentation process.
out carefully modeling inter-frame relationships and ignores
• Wealsoproposeastochastic-lengthFrameDroptraining
knowledgelearnedfrompriorframes,late-to-latefusiontries
recipe. It exposes the model to an augmented large
to tackle these issues by ConvLSTM [1], [32]. It recurrently
motionspaceofpillartrajectoriesacrosstime.Thusour
fuses sparse latent embeddings between deep layers of the
recurrentmodelcancapturedifferentspeedobjects,and
backbone with improved efficiency than point stacking, but
generalizetovariableframelengthsduringinferencefor
the results are often not as competitive as early-to-early
improved performance.
fusion. This is presumably because its backbone can only
The proposed late-to-early temporal fusion scheme leads utilize single-frame data until fusion happens at deep layers.
toimproved3DdetectionresultsonthewidelyusedWaymo 3D-MAN [33] may also be viewed as a form of late-to-
Open Dataset (WOD) [7] and demonstrates large gains on late fusion, because the temporal fusion in this method is
challenginglargeobjects.Wealsoconductextensiveablation done through various kinds of cross-attention between box
studies on various design choices made in our method, proposals and features in the memory bank, which are both
providing several interesting insights. after the backbone of its network. FaF [34] studied both
early fusion and late fusion. To the best of our knowledge,
II. RELATEDWORK late-to-early fusion has not been explored before in LiDAR
detectors. A similar fusion framework is studied in [35] but
3D Object Detection. LiDAR-based 3D object detection targeting on camera-based detection. It faces very different
plays an essential role in autonomous driving. Early efforts challenges from our problems. We need to process sparsely
of research such as PointRCNN [8] usually operate on raw distributed3Ddataatwideranges,whichrequiresdedicated
3D point clouds through PointNet(++) [9]–[11]. But they designs for sparse features alignment, fusion and also new
struggle to generalize to large-scale data, such as long- training recipes.
sequence fused LiDAR [7] with millions of points. Heavily Finally, we note that our review so far concentrates on a
relying on MLP-based backbones, these detectors are soon single-stage trainable model that internalizes the temporal
outperformed by models with more advanced architectures fusion schemes. It is also possible to follow up the box
like submanifold sparse convolution [12] or Transformers
[13]–[15]. By voxelizing free-shape point sets into regular 12D-shapevoxelsareoftenreferredtoaspillars.predictions with a second-stage offline refinement, using The diagram is plotted in Figure 2. The model works on
the terminology from a recent exemplar of this two-stage sparse pillar tokens and thus the segmentation outputs can
approach, MPPNet [36]. MPPNet runs a pre-trained Center- be written as f :{V ∈R2+d}, k =1,...,K . The
i−1 i−1,k i−1
Point[4]on4-framestackedLiDARpointcloudstogenerate first two dimensions record BEV coordinates of the pillars
anchor boxes, which will then be tracked and aggregated and the rest are extracted embeddings (i.e., d=128), which
acrosslongsequences.Specifically,latentembeddingsorraw contain rich scene and object-aware information. Moreover,
points within the box regions of one frame will be cropped compared with the raw point clouds size N (∼200k),
i−1
and intertwined with those extracted from other frames the foreground pillar feature set size K (∼2k) is much
i−1
in order to refine the box states. The key differentiating smaller. Therefore, we are motivated to fuse these deep-
factor about the two-stage approach is that the two stages layer features into early stages of the next frame in order
/ models are trained separately [36], suggesting that the to efficiently reuse learned high-level knowledge for 3D
improvement inherently built into the first stage, like ours, detection, especially on challenging large objects.
is complementary to the second-stage innovation. Fusionlocation.Toachieverecurrentlate-to-earlyfusion,
we fuse f with VoxelNet [18] outputs ν({X }) (cid:55)→
i−1 i,j
III. METHOD {V′ ∈R2+d},n=1,...,N′ beforetheyarefedintothethe
i,n i
A. Problem Statement main backbone network. Meanwhile, instead of early fusion
We use {P }, i = 1,...,T to represent a consecutive beforethebackbone,somemayarguethatanalternativeway
i
sequence of LiDAR point clouds with P : {X ∈ R3}, isconductinglatefusionafterthebackboneprocess,whichis
i i,j
j =1,...,N i. Our goal is to detect 3D object boxes {B i,m}, closetothenetworkstagewheref i−1 isextracted.Diagrams
m = 1,...,M for each frame-t using {P | i ⩽ t}. Ideally ofthesetwodifferentfusionlocationsareplottedinFigure1.
i i
the model should be capable of fusing history information Wethinkthatpresumablylatefusioncancausethebackbone
F(P ,...,P ) up to the current timestamp-t, where F(·) B to lose access to temporally aggregated LiDAR sequence
1 t
denotes the fusion function. LiDAR temporal fusion is information,andthusthelow-capacitydetectionheadsHwill
known to be an open challenge due to the sparse and wide- struggletounderstandfusedfeaturesandpredictobjectposes
range spatial distribution of point clouds, let alone diverse and shapes. Ablation studies on early-to-early, late-to-late
object dynamics. Currently early-to-early fusion (i.e., point and our proposed late-to-early fusion methods are provided
stacking) is most widely used P ∪ ... ∪ P , which is in Table IV and Section IV-C, which empirically proved the
t−l t
easy to implement. However, due to memory constraint the advantages of our approach.
sequence length is usually small, e.g. l ∈ {2,3}. Moreover,
C. Inverse Calibration and Alignment
point clouds {X } of one frame have to be repeatedly
i,j
While image sequences are naturally aligned across dif-
processedfor(l+1)timeswhenweconductmodelinference
ferent frames by the shapes (height, width, channel), sparse
on adjacent frames, causing huge waste of computation. As
sets of pillar features {V }, {V′ } are neither aligned
for detection performance, whether directly stacking the raw i−1,k i,n
nor with the same cardinality (i.e., K ̸= N′). Intu-
sensory inputs without reusing learned history knowledge i−1 i
itively one could convert sparse features into dense BEV
can lead to the optimal results also remains questionable.
maps {V } (cid:55)→ I ∈ RH×W×d, {V′ } (cid:55)→ I′ ∈
i−1,k i−1 i,n i
B. Recurrent Late-to-Early Fusion RH×W×d andthenalignthem.However,asFigure2shows,
To address the aforementioned issues, we propose a re- directly doing so without proper calibration can result in
current late-to-early temporal fusion strategy. As shown in misalignment between underlying objects of the scene. This
Figure 2, the fusion pipeline works like a “Markov chain”, isbecausepillarfeaturesextractedbythebackbonesarefrom
which can accumulate history information from long se- their corresponding local vehicle coordinates with poses of
quencesandreduceredundantcomputation.Thus,thefusion g i−1 ∈ R4×4, g i ∈ R4×4. To alleviate this misalignment
function F(·) can be iteratively defined as: issue, we need to calibrate the history BEV maps I i−1.
f i =ψ(h(f i−1⊕τ(t i−t i−1),ν({X i,j}))) (1) I i−1◦g i− −1 1◦g i (cid:55)→I˜ i−1 (2)
where f indicates history deep-layer voxel embeddings, here ◦ means applying vehicle coordinates transformation
i−1
andτ(·)isaSinusoidalfunctionforencodingthetimestamp and I˜ represents the calibrated BEV maps.
i−1
offset. ν(·) represents VoxelNet [18] used to obtain pillar However,inpracticeifweapplyforwardcalibrationupon
featuresfrompointclouds.h(·)isthebackboneforrecurrent I we might get more than one pillars that fall into the
i−1
fusion and multi-scale sparse pillar features extraction, and same discrete coordinates within I˜ . To address this issue
i−1
ψ(·) is the foreground segmentation module. we conduct inverse transformation from I˜ to I and
i−1 i−1
History features. Particularly, we use the latent features sample the history BEV features. We use zero padding to
of segmented foreground pillars as f and pass them fill in the pillar features of empty samples and also for out-
i−1
into the next timestamp. Without loss of generality, we use of-view locations, e.g. red cross markers in Figure 2. The
SWFormer [6] as our backbone and center-based detection inversely calibrated history maps now can be aligned with
heads[4]asexamplesinourfollowingdiscussionifneeded. current maps by feature concatenation I˜ ⊕ I′ (cid:55)→ J ∈
i−1 i iInverse Calibration & Alignment
Detection Detection Detection
Head Head Head
Segmentation Segmentation Segmentation
MLP
…
Backbone Backbone Backbone
Attention Attention Attention
Alignment Alignment Alignment
Time Time
Enc Enc
… …
noitarbilac
esrevnI
noitarbilac
esrevnI
…
Fig. 2: Detection pipeline with our proposed LEF. In each forward pass, the early-stage pillar encoding will be aligned
andfusedwiththehistorylate-stageforegroundpillarfeaturesf .Thealignmentisachievedbyaninversecalibrationand
i−1
alignment process (Section III-C) that enables pillar features of the underlying static objects to be matched. To effectively
associate moving object features, we further use window-based attention blocks (Section III-D) to connect relevant pillars.
Outputs from the attention fusion layers will then be fed into the main backbone network (e.g. SWFormer [6]), followed by
a foreground pillar segmentation layer and the final detection head [4] for 3D bounding box predictions.
RH×W×2d. Next, we apply a MLP on J for dimension often, K˜ ⩽ K due to out-of-view truncation after
i i−1 i−1
reduction (i.e., 2d (cid:55)→ d) and get the temporally aligned vehicle coordinates calibration.
pillar features J′. Note that not all the coordinates within The resulting variants are: self / cross / mix-attention. In
i
J′ havevalidfeatures.WeusetheunionBEVbooleanmask self-attentionthekeyandvaluetensorsarethesameasquery.
i
O ∈RH×W obtainedfromthecurrentandcalibratedhistory Cross-attention uses {V˜ } as key and value and mix-
i i−1,c
BEV features to mark valid coordinates of J′. Thus, we do attention uses the union set of prior two attention variants.
i
not lose the data sparsity. We apply Sinusoidal functions based absolute positional
encoding to inform the attention blocks of the sparse pillar
D. Window-based Attention Fusion
coordinates within a window. Detailed ablation studies on
Pillars of the static objects are effectively aligned after
differentattentiondesignsareprovidedinSectionIV-C.With
the prior steps, but the moving ones are still facing the
window-based attention fusion, features of both static and
misalignmentissue.Onesolutionistoapplyflowestimation
moving pillars now can be associated and fused for later
to further calibration the history BEV features I˜ before
i−1 being passed into the main backbone network.
temporal alignment with I′. But that requires adding addi-
i
tionaloccupancyflowmodels,lossesandfeaturecoordinates E. Stochastic-Length FrameDrop
transformation,whichmightgreatlyincreasethecomputation To enable robust training upon long sequences, we ran-
overheadofthe3Dobjectdetector.Therefore,weproposeto domly drop history frames from (P ,...,P ) during each
1 t
learn such association implicitly from the data by window- training iteration. In other words, we randomly sample S
i
based attention blocks. We sparsify the dense BEV feature historyframes,withS beingastochasticnumberatdifferent
i
map J i′ and its boolean mask O i into a sparse set of pillar training steps and the sampled frames are not necessarily
tokens {V i′ ,′ u}, u = 1,...,U i. Usually we have U i ⩾ N i′. adjacent ones. In comparison, the previous LiDAR temporal
BecausethecardinalityU i meansthenumberoffusedpillars fusion methods usually fix S i to be a constant (e.g. 3 or
after temporal alignment between the history and current 4) and sample consecutive frames. We apply stop gradient
features through the steps in Section III-C. While {V′′ } between each recurrent pass when fusing deep-layer history
i,u
is used as the query tensor for the attention blocks, we can features into early layers of the next frame, without which
make different choices when determining the key and value long-sequence training of 3D object detectors can easily get
tensors: using {V′′ } again or the sparsified set of history intractableorrunintoOOM.Duringtraining,themodelonly
i,u
pillar tokens in (2): I˜ (cid:55)→{V˜ }, c=1,...,K˜ . Most predicts3Dboxes{Bˆ }inthelastforwardpass.Lossesare
i−1 i−1,c i−1 i,mTABLEI:OverallperformancecomparisonsonWaymoOpenDataset.Refinemeansthatthedetectorsneedanadditional
step of box refinement via feature pooling and fusion from the box areas, which usually increases time cost and might not
be end-to-end trainable. For fair comparisons we focus on single-stage detectors without (w/o) box refinement.
Test set 3D AP/APH Validation set 3D AP/APH
Method Refine
L1 L2 L1 L2
3D-MAN [33] with 78.71 / 78.28 70.37 / 69.98 74.53 / 74.03 67.61 / 67.14
CenterPoint [4] with 80.20 / 79.70 72.20 / 71.80 76.60 / 76.10 68.90 / 68.40
SST [5] with 80.99 / 80.62 73.08 / 72.72 77.00 / 76.60 68.50 / 68.10
PVRCNN++ [2] with 81.62 / 81.20 73.86 / 73.47 79.30 / 78.80 70.60 / 70.20
MPPNet [36] with 84.27 / 83.88 77.29 / 76.91 82.74 / 82.28 75.41 / 74.96
CenterFormer [37] with 84.70 / 84.40 78.10 / 77.70 78.80 / 78.30 74.30 / 73.80
PointPillars [16] w/o 68.60 / 68.10 60.50 / 60.10 63.30 / 62.70 55.20 / 54.70
RSN [22] w/o 80.70 / 80.30 71.90 / 71.60 78.40 / 78.10 69.50 / 69.10
SWFormer [6] w/o 82.25 / 81.87 74.23 / 73.87 79.03 / 78.55 70.55 / 70.11
LEF (ours) w/o 83.39 / 83.02 75.51 / 75.16 79.64 / 79.18 71.37 / 70.94
TABLE II: Detection results on challenging large objects.
IV. EXPERIMENTS
L1 L2 In this section, we will compare our model with other
Method
2D 3D 2D 3D state-of-the-art methods, and perform ablation studies upon
RSN [22] 53.10 45.20 - 40.90 the impact of our designs on detection performance.
SWFormer [6] 58.33 49.74 53.45 45.23
LEF (ours) 62.63 54.35 57.42 49.34 A. Dataset and Backbone
We choose Waymo Open Dataset [7] over nuScenes [41]
and KITTI [42] because WOD has large-scale and high-
enforced upon certain intermediate outputs (e.g. foreground
quality LiDAR data, which can better simulate the settings
pillar segmentation) and the final box parameter predictions
for developing on-road fully autonomous vehicles. There
(e.g. shapes and poses).
are about 160k annotated training frames in WOD but only
L=λ L +λ L +L (3) around30kframesinnuScenes.Asforper-framepointcloud
1 seg 2 center box
densities, WOD is ∼200k and nuScenes is ∼30k. Therefore
in which L means the total losses. L is focal loss for
seg WOD is widely used in recent LiDAR-based methods: PV-
foreground segmentation. L is also based on focal
center RCNN(++), SST, RSN, SWFormer and so on [2]–[4], [6],
loss but for object-center heatmap estimation [4], [38]. L
box [20], [22], [24], [26], [33], [36]. WOD has 798 training
containsSmoothL1lossesforboxazimuth,centeroffsetsand
sequences, 202 validation and 150 test sequences, covering
sizes regression. A detailed explanation is in [6].
diverse driving scenarios and agent status. LiDAR data
The training randomness introduced in LiDAR sequence
collection frequency is 10Hz. Each frame of point clouds
sampling enables the model to be robust to various motion
consists of data gathered from five sensors: one long-range
patterns of pillar trajectories across time. Thus our recurrent
and four short-range LiDAR. For evaluation metrics, we
model can understand different object dynamics, and gen-
adopt the officially recommended 3D AP / APH under two
eralize to variable frame lengths during inference without
difficulty levels (L1, L2) depending on point densities of the
retraining. More experiments and analysis are provided in
ground-truth bounding boxes. APH is a weighted metric of
Table VI and the ablation studies.
AP using heading angles (i.e., azimuth).
F. Implementation Details We adopt the state-of-the-art SWFormer [6] as our detec-
tion backbone, and replace its original early-to-early LiDAR
We conduct 3D object detection within a wide range
fusion with our proposed LEF. For fair comparisons, all
of 164×164 meters (m) square zone, centering on the top
training settings are kept the same as [6].
LiDAR sensor. Point clouds inside this region are voxelized
into 2D pillars with 0.32m spatial resolutions. The window
B. Main Results and Comparisons
attention blocks are based on 10×10 grouping sizes. The
loss weights λ , λ defined in (3) are 200, 10 respectively. Theoverallvehicledetectionresultswithothercompeting
1 2
We use AdamW [39], [40] optimizer with 128 batch sizes methods are in Table I. We compare against methods both
and 240k iterations for distributed training on 128 TPUv3. withandwithoutboxrefinementsteps,althoughourmodelis
The training takes about 2 days. TPU memory usage is 5.4 asingle-stagemethodwithoutrefinementandgenerallymore
GB on average and 7.4 GB at peak. The first 10k steps will efficient than those with box refinement. Our method LEF
warmupthelearningratefrom5.0e-4to1.0e-3,afterwhich surpasses the prior best single-stage model SWFormer by
the learning rate will follow a cosine annealing schedule to +1.3 3D APH on L2 test data (e.g. 75.16 vs. 73.87), demon-
zero. strating the strong overall performance of our approach.TABLEIII:Computationcost.Forfaircomparisons,weuse
Low 3D IoU False Positive False Negative Green (Ground Truth)
Yellow (SWFormer) 3-frame temporal fusion settings on WOD for measurement.
Blue (LEF, ours)
Method Latency Flops Parameters
PointPillars [16] 93ms 375G 6.4M
SWFormer [6] 47ms 35G 4.4M
LEF (ours) 38ms 29G 4.6M
TABLE IV: Ablation studies on different types of tempo-
ral fusion schemes. All methods are trained with SLF.
L1 L2
Fusion Strategy
2D 3D 2D 3D
Early-to-Early 58.33 49.74 53.45 45.23
Late-to-Late 58.74 48.83 53.67 44.32
Late-to-Early 61.46 53.13 56.37 48.28
TABLE V: Ablation studies on different object sizes. The
3DAPgainsachievedbyLEFincreaseasobjectsizegrows.
L1 L2
Method
Large Medium Small Large Medium Small
RSN[22] 45.20 77.30 79.40 40.90 68.60 69.90
Fig. 3: Box colors are explain in the legend. Errors of the SWFormer[6] 49.74 79.11 82.36 45.23 70.59 74.04
LEF(ours) 54.35 79.62 82.46 49.34 71.32 74.15
baseline SWFormer are highlighted in dashed red regions.
Ourmethodisparticularlyusefulfordetectingchallenging and late-to-early (L2E) fusion strategies as illustrated in
largeobjectswhosemaximumdimensionisbeyond7meters: Figure 1a. Specifically, we test all fusion variants with the
truck, bus, construction vehicle, etc. We conduct detailed same backbone and frame number (i.e., 3) to factorize out
analysis on validation set in Table II. Our method LEF the influence of model architectures and LiDAR sequence
outperformsSWFormerby+9.3%relativeincreaseonL13D lengths. Results on validation set large objects are in Ta-
AP:54.35vs.49.74.Hardcasessuchaslargevehiclessuffer ble IV. Our L2E fusion surpasses the other two methods
from partial observation issues more often than small or with 7.8% relative gains on L1 3D AP. By comparing E2E
medium size objects. Faithfully detecting these challenging and L2L fusion, we observe that their results on 2D AP are
cases requires LiDAR temporal fusion at long frame lengths comparable. But E2E clearly outperforms L2L on 3D AP,
in order to enlarge the sensory data coverage. Moreover, indicating higher 3D object detection quality. These results
our late-to-early fusion scheme can reuse learned scene and validate our arguments about the benefits of late-to-early
object-aware latent features from prior frames, not simply fusion. Compared with E2E fusion, L2E enables the model
stacking the point clouds as in RSN and SWFormer. Such to reuse learned scene and object-aware knowledge from
high-level history knowledge can enable the model to more prior frames. Compared with L2L, the model capacity of
easily tackle challenging detection cases, compared with L2Efusionisnotconstrainedbecauseitsbackbonehasearly
solving them from scratch using stacked raw sensory inputs. access to the temporally aggregated sensory data.
Qualitative results are visualized in Figure 3. Typical Differentobjectsizes.Besidestheoverallresultsandhard
errors of SWFormer are highlighted in the red zones. Our example analysis in Section IV-B, we are also interested in
results are aligned better (i.e., have higher 3D IoU) with the knowing the impact of our method on different object sizes.
ground truth boxes than SWFormer predictions, especially Thus we divide validation set objects into: large, medium
for challenging large objects. Moreover, our results contain and small. Typical large objects are bus and truck. Medium
fewer false negative and false positive predictions than and small objects usually include sedan and pedestrian,
SWFormer results. We also measure model latency, flops respectively. Detailed results are in Table V. Although our
and parameter sizes of different LiDAR 3D object detectors methodLEFachievescomparableresultswiththecompeting
in Table III, following the same benchmark settings as [6]. methods on small objects, we observe increasingly more
PointPillars and SWFormer both use point stacking. The gains as object sizes grow larger. On L2 medium objects,
results demonstrate the efficiency advantages of our late-to- LEF improves SWFormer by 0.73 AP and the gains further
early recurrent fusion method. bump to 4.11 AP on large objects. One possible explanation
isthatsmallobjectssufferlessfrompartial-viewobservation
C. Ablation Studies
issuesthanlargeobjects,andthusdonotsignificantlybenefit
Fusion strategy. We conduct apple-to-apple comparisons from temporal fusion. From the results we believe that our
to study the effect of early-to-early (E2E), late-to-late (L2L) method works robustly across different object sizes.TABLEVI:Longframehistorygeneralizationstudies.For TABLE VIII: Variants of window-based attention blocks
each trained model, we evaluate its inference generalization for recurrent temporal fusion. Based on the comparisons,
ability to different frame (f) lengths without retraining. we adopt self-attention as default in other experiments.
L1 L2 L1 L2
Method Attention Type
3-f 6-f 9-f 3-f 6-f 9-f 2D 3D 2D 3D
SWFormer[6] 46.23 38.76 OOM 41.93 35.09 OOM
Cross-Attn 51.69 42.35 47.06 38.36
LEF(w/oSLF) 51.18 51.44 50.84 46.58 46.91 46.28
Mix-Attn 61.68 52.94 56.46 48.06
LEF(withSLF) 53.13 53.96 54.35 48.28 48.99 49.34
Self-Attn 62.63 54.35 57.42 49.34
TABLE VII: Inverse calibration and alignment (ICA) can
improve detection AP across different object sizes. TABLEIX:Theimpactofwindow-basedself-attentionon
different speed objects.
Large Medium Small
ICA
2D 3D 2D 3D 2D 3D Self-Attention Static Slow Medium Fast VeryFast
w/o 60.85 51.34 92.72 78.30 85.92 80.59 without 60.55 63.46 74.58 53.07 75.47
with 62.63 54.35 93.02 79.62 87.40 82.46 with 66.62 69.27 79.62 62.46 82.14
Frame length generalization. Due to memory constraint temporal alignment process. In Table VII we show that in-
of the computing devices, GPU or TPU, 3D object detectors versecalibrationandalignmentachievesconsistentdetection
with LiDAR temporal fusion usually sample a fixed number improvement across different size objects, including truck,
ofhistoryframes(e.g.2or3)duringtraining.However,dur- sedan, pedestrian, and so on.
inginference,thereareusuallyadditionalframesavailableto Window-based Attention Fusion. We apply window-
themodeldependingonthehistorylengths.Fortypicalearly- based attention blocks on temporally aligned sparse pil-
to-earlyfusionbasedmulti-framedetectors(e.g.CenterPoint, lar tokens to further fuse information of the history and
SWFormer), if we want to test a trained model on different current frames. As explained in Section III-D, we explore
frame lengths, the training settings need to be modified three different attention designs: self / cross / mix-attention.
and the model needs to be retrained. With stochastic-length Detection AP on large objects of WOD validation set are
FrameDrop (SLF), LEF can generalize to variable frame shown in Table VIII. For all methods, we use the sparse
lengths without retraining. It can leverage additional frames set of pillar tokens {V′′ } converted from the temporally
i,u
and achieve increasingly improved results. Large objects aligned BEV feature map J′ as the query tensor. In self-
i
3D AP are shown in Table VI. In contrast, SWFormer attention,query,keyandvaluearebasedonthesametensor.
and LEF without SLF can not make best of long history In cross-attention, the key and value tensors are the sparse
and might even face performance decrease. This is because set of pillar tokens {V˜ } converted from the calibrated
i−1,c
long history frames can exhibit diverse motion patterns of history features I˜ . Mix-attention uses the union set of
i−1
temporallyaggregateddata,posinggeneralizationdifficulties prior methods as key and value. We observe that self-
formethodstrainedwithoutSLF.Moreover,sinceSWFormer attention consistently outperforms the other two attention
is based on point cloud stacking, it will run into OOM if we variants. This is presumably because the history tokens exist
simply stack a long LiDAR sequence into millions of 3D in a quite different latent space from the temporally aligned
points and use them as inputs. These observations indicate tokens. Therefore attention between {V˜ } and {V′′ }
i−1,c i,u
that stochastic-length FrameDrop and recurrent fusion are mighteasilyleadtointractablefeaturefusionandeventually
critical in generalizing our method LEF to variable frame hurt detection. Meanwhile, since J′ has already merged
i
lengths during inference. information from the history I˜ and the current I , self-
i−1 i
Foreground pillar segmentation. To efficiently fuse his- attention is competent to associate relevant pillar tokens and
tory pillar features in a recurrent manner, we apply BEV fulfill the fusion task.
foreground segmentation before passing history latent pillar Window-based attention fusion plays an important role
embeddings into the next frame. the number of history in fusing the information from moving object pillars. In
pillarsthatneedtoberecurrentlyfusedcanbereducedfrom Table IX, we present validation set 3D AP comparisons be-
∼20k to ∼2k on average after removing a huge amount of tween with and without window-based self-attention fusion.
uninformative background data. Therefore the computation We report subcategory metrics under different speed ranges:
burden of our late-to-early temporal fusion scheme can be [0, 0.45), [0.45, 2.24), [2.24, 6.71), [6.71, 22.37), [22.37,
greatly reduced and maintained at a relatively low constant +∞) miles per hour for static, slow, medium, fast, very fast
cost. objects.Themetricsareaveragedoverdifferentsizeobjects.
Inverse calibration and alignment. Inverse calibration We observe that attention fusion brings consistent detection
and alignment, as illustrated in Figure 2, is important for gains across different object speed ranges. Particularly, the
fusing two sparse sets of pillar features between the prior improvementsachievedonhigh-speedobjectsarelargerthan
and the current frames. Features belonging to the same those on low-speed objects: +9.4 (fast) vs. +6.1 (static) 3D
underlying static objects can be effectively aligned after this AP gains. The comparisons empirically prove that window-based self-attention fusion is critical in associating relevant [15] J.Mao,Y.Xue,M.Niu,H.Bai,J.Feng,X.Liang,H.Xu,andC.Xu,
pillars that belong to the same underlying objects, which is “Voxeltransformerfor3dobjectdetection,”inICCV,2021.
[16] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom,
especially important for moving object detection.
“Pointpillars: Fast encoders for object detection from point clouds,”
inCVPR,2019,pp.12697–12705.
V. CONCLUSIONSANDFUTUREWORK [17] Y.Yan,Y.Mao,andB.Li,“Second:Sparselyembeddedconvolutional
detection,”Sensors(Basel,Switzerland),vol.18,2018.
In this paper, we conduct an in-depth study on the tem- [18] Y.ZhouandO.Tuzel,“Voxelnet:End-to-endlearningforpointcloud
poral fusion aspect of 3D object detection from LiDAR based3dobjectdetection,”inCVPR,2018,pp.4490–4499.
[19] X. Zhou, D. Wang, and P. Kra¨henbu¨hl, “Objects as points,” ArXiv,
sequences.Weproposealate-to-earlytemporalfeaturefusion
vol.abs/1904.07850,2019.
method that recurrently extracts sparse pillar features from [20] L.Fan,Z.Pang,T.Zhang,Y.-X.Wang,H.Zhao,F.Wang,N.Wang,
both object-aware latent embeddings and LiDAR sensor raw andZ.Zhang,“Embracingsinglestride3dobjectdetectorwithsparse
transformer,”CVPR,pp.8448–8458,2022.
inputs. To handle the alignment issues of static and moving
[21] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and
objects,weproposeinversecalibrationandalignmentaswell B. Guo, “Swin transformer: Hierarchical vision transformer using
as window-based attention fusion methods. We also apply shiftedwindows,”ICCV,pp.9992–10002,2021.
[22] P.Sun,W.Wang,Y.Chai,G.Elsayed,A.Bewley,X.Zhang,C.Smin-
foregroundsegmentationtoobtainsparsepillarfeaturesfrom
chisescu, and D. Anguelov, “Rsn: Range sparse net for efficient,
historyforcomputationreduction.Theresultingmodel,LEF, accuratelidar3dobjectdetection,”inCVPR,2021,pp.5725–5734.
performs favorably against its base model SWFormer in [23] G.P.Meyer,A.G.Laddha,E.Kee,C.Vallespi-Gonzalez,andC.K.
Wellington,“Lasernet:Anefficientprobabilistic3dobjectdetectorfor
both detection quality and efficiency. The improvement is
autonomousdriving,”CVPR,pp.12669–12678,2019.
especially significant on large objects that require multiple [24] J. Ngiam, B. Caine, W. Han, B. Yang, Y. Chai, P. Sun, Y. Zhou,
LiDAR sweeps fused across space and time to achieve high X.Yi,O.Alsharif,P.Nguyen,Z.Chen,J.Shlens,andV.Vasudevan,
“Starnet: Targeted computation for object detection in point clouds,”
surface coverage rate.
ArXiv,vol.abs/1908.11069,2019.
As future work, we plan to extend our method to multi- [25] Y.Wang,A.Fathi,A.Kundu,D.A.Ross,C.Pantofaru,T.Funkhouser,
modal sensor fusion with a focus on integrating camera and and J. Solomon, “Pillar-based object detection for autonomous driv-
ing,”inECCV,2020,pp.18–34.
radar information. Recurrent late-to-early temporal fusion
[26] Y.Chai,P.Sun,J.Ngiam,W.Wang,B.Caine,V.Vasudevan,X.Zhang,
schemes like ours and BEVFormer [35] have been explored and D. Anguelov, “To the point: Efficient 3d object detection in the
in very few papers. To further demonstrate the effectiveness rangeimagewithgraphconvolutionkernels,”inCVPR,2021.
[27] L.Fan,X.Xiong,F.Wang,N.longWang,andZ.Zhang,“Rangedet:
of this approach, it would be beneficial to test it on various
Indefenseofrangeviewforlidar-based3dobjectdetection,”ICCV,
backbonemodelsandextenditsapplicationbeyondthescope pp.2898–2907,2021.
of 3D object detection task. [28] Z.Li,F.Wang,andN.Wang,“Lidarr-cnn:Anefficientanduniversal
3dobjectdetector,”inCVPR,2021,pp.7546–7555.
[29] H. Sheng, S. Cai, Y. Liu, B. Deng, J. Huang, X.-S. Hua, and M.-J.
REFERENCES
Zhao,“Improving3dobjectdetectionwithchannel-wisetransformer,”
inICCV,2021,pp.2743–2752.
[1] R. Huang, W. Zhang, A. Kundu, C. Pantofaru, D. A. Ross,
[30] C. Liu, Z. Leng, P. Sun, S. Cheng, C. R. Qi, Y. Zhou, M. Tan, and
T.Funkhouser,andA.Fathi,“Anlstmapproachtotemporal3dobject
D.Anguelov,“Lidarnas:Unifyingandsearchingneuralarchitectures
detectioninlidarpointclouds,”inECCV,2020,pp.266–282.
for3dpointclouds,”inECCV,2022,pp.158–175.
[2] S.Shi,C.Guo,L.Jiang,Z.Wang,J.Shi,X.Wang,andH.Li,“Pv-
[31] Z. Yuan, X. Song, L. Bai, Z. Wang, and W. Ouyang, “Temporal-
rcnn: Point-voxel feature set abstraction for 3d object detection,” in
channel transformer for 3d lidar-based video object detection for
CVPR,2020,pp.10529–10538.
autonomousdriving,”TCSVT,vol.32,no.4,pp.2068–2078,2021.
[3] S. Shi, L. Jiang, J. Deng, Z. Wang, C. Guo, J. Shi, X. Wang, and
[32] J.Yin,J.Shen,C.Guan,D.Zhou,andR.Yang,“Lidar-basedonline
H.Li,“Pv-rcnn++:Point-voxelfeaturesetabstractionwithlocalvector
3d video object detection with graph-based message passing and
representationfor3dobjectdetection,”ArXiv,2021.
spatiotemporaltransformerattention,”inCVPR,2020.
[4] T.Yin,X.Zhou,andP.Krahenbuhl,“Center-based3dobjectdetection
[33] Z. Yang, Y. Zhou, Z. Chen, and J. Ngiam, “3d-man: 3d multi-frame
andtracking,”inCVPR,2021,pp.11784–11793.
attentionnetworkforobjectdetection,”inCVPR,2021.
[5] L.Fan,Z.Pang,T.Zhang,Y.-X.Wang,H.Zhao,F.Wang,N.Wang,
[34] W.Luo,B.Yang,andR.Urtasun,“Fastandfurious:Realtimeend-
andZ.Zhang,“Embracingsinglestride3dobjectdetectorwithsparse
to-end 3d detection, tracking and motion forecasting with a single
transformer,”inCVPR,2022,pp.8458–8468.
convolutionalnet,”inCVPR,2018,pp.3569–3577.
[6] P.Sun,M.Tan,W.Wang,C.Liu,F.Xia,Z.Leng,andD.Anguelov,
[35] Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Q. Yu, and J. Dai,
“Swformer: Sparse window transformer for 3d object detection in
“Bevformer: Learning bird’s-eye-view representation from multi-
pointclouds,”inECCV,2022.
camera images via spatiotemporal transformers,” in arXiv preprint
[7] P.Sun,H.Kretzschmar,X.Dotiwalla,A.Chouard,V.Patnaik,P.Tsui,
arXiv:2203.17270,2022.
J.Guo,Y.Zhou,Y.Chai,B.Caine,etal.,“Scalabilityinperception
[36] X.Chen,S.Shi,B.Zhu,K.C.Cheung,H.Xu,andH.Li,“Mppnet:
forautonomousdriving:Waymoopendataset,”inCVPR,2020.
Multi-frame feature intertwining with proxy points for 3d temporal
[8] S.Shi,X.Wang,andH.Li,“Pointrcnn:3dobjectproposalgeneration
objectdetection,”arXivpreprintarXiv:2205.05979,2022.
anddetectionfrompointcloud,”CVPR,pp.770–779,2019.
[37] Z.Zhou,X.Zhao,Y.Wang,P.Wang,andH.Foroosh,“Centerformer:
[9] C. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas, “Frustum pointnets
Center-basedtransformerfor3dobjectdetection,”inECCV,2022.
for3dobjectdetectionfromrgb-ddata,”CVPR,pp.918–927,2018.
[38] X.Zhou,D.Wang,andP.Kra¨henbu¨hl,“Objectsaspoints,”inarXiv
[10] C. Qi,H. Su, K. Mo, andL. J. Guibas, “Pointnet:Deep learning on
preprintarXiv:1904.07850,2019.
pointsetsfor3dclassificationandsegmentation,”inCVPR,2017.
[39] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-
[11] C.Qi,L.Yi,H.Su,andL.J.Guibas,“Pointnet++:Deephierarchical
tion,”arXivpreprintarXiv:1412.6980,2014.
featurelearningonpointsetsinametricspace,”inNIPS,2017.
[40] I.LoshchilovandF.Hutter,“Decoupledweightdecayregularization,”
[12] B.GrahamandL.vanderMaaten,“Submanifoldsparseconvolutional
arXivpreprintarXiv:1711.05101,2017.
networks,”arXivpreprintarXiv:1706.01307,2017.
[41] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu,
[13] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A
Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in multimodaldatasetforautonomousdriving,”inCVPR,2020.
NeurIPS,vol.30,2017.
[42] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous
[14] H.Zhao,L.Jiang,J.Jia,P.H.Torr,andV.Koltun,“Pointtransformer,” driving?thekittivisionbenchmarksuite,”inCVPR,2012.
inICCV,2021,pp.16259–16268.