3D-MAN: 3D Multi-frame Attention Network for Object Detection ZetongYang1∗ YinZhou2 ZhifengChen3 JiquanNgiam3 1TheChineseUniversityofHongKong 2WaymoLLC 3GoogleResearch,BrainTeam tomztyang@gmail.com yinzhou@waymo.com {zhifengc, jngiam}@google.com Abstract 3D object detection is an important module in au- tonomous driving and robotics. However, many existing methods focus on using single frames to perform 3D de- tection, and do not fully utilize information from multi- ple frames. In this paper, we present 3D-MAN: a 3D multi-frame attention network that effectively aggregates features from multiple perspectives and achieves state-of- the-art performance on Waymo Open Dataset. 3D-MAN firstusesanovelfastsingle-framedetectortoproducebox proposals. Theboxproposalsandtheircorrespondingfea- t-3 t-2 t-1 t ture maps are then stored in a memory bank. We design a Figure 1. Upper row: Potential detections given LiDAR from a multi-viewalignmentandaggregationmodule,usingatten- single frame demonstrating ambiguity between many reasonable tion networks, to extract and aggregate the temporal fea- predictions.Lowerrow:Aftermergingthepointsalignedacross4 turesstoredinthememorybank. Thiseffectivelycombines frames,thereismorecertaintyforthecorrectboxprediction. thefeaturescomingfromdifferentperspectivesofthescene. We demonstrate the effectiveness of our approach on the IoUthreshold 0.3 0.5 0.7 large-scalecomplexWaymoOpenDataset,achievingstate- AP(%) 94.72 88.97 63.27 of-the-art results compared to published single-frame and Table1.We varytheintersection-over-union(IoU)thresholdfor multi-framemethods. considering a predicted box correctly matched to a ground-truth box,andmeasuretheperformanceofthePointPillarsmodelonthe WaymoOpenDataset’svalidationset.AlowerIoUthresholdcor- 1.Introduction respondstoallowinglessaccurateboxestomatch.Thisshowsthat improvingtheboxlocalizationcouldsignificantlyimprovemodel 3D object detection is an important problem in com- performance. putervisionasitiswidelyusedinapplications,suchasau- tonomous driving and robotics. Autonomous driving plat- formsrequireprecise3Ddetectiontobuildanaccuraterep- In the autonomous driving scenario, as the vehicle pro- resentation of the world, which is in turn used in down- gresses, the sensors pick up multiple views of the world, streammodelsthatmakecriticaldrivingdecisions. making it possible to resolve the aforementioned localiza- LiDAR provides a high-resolution accurate 3D view of tion ambiguity. Multiple frames across time can provide theworld.However,atanypointoftime,theLiDARsensor different perspectives of an observed object instance. An collectsonlyasingleperspectiveofthescene.Itisoftenthe effectivemulti-framedetectionmethodshouldbeabletoex- casethattheLiDARpointsdetectedonanobservedobject tractrelevantfeaturesfromeachframeandaggregatethem, correspondtoonlyapartialviewofit. Detectingthesepar- soastoobtainarepresentationthatcombinesmultipleper- tiallyvisibleinstancesisanill-posedproblembecausethere spectives (Figure 1). Research in 3D multi-frame detec- existmultiplereasonablepredictions(shownasredandblue tion has been limited due to a lack of available datasets boxes in the upper row of Figure 1). These potential am- withwell-calibratedmulti-framedata. Fortunately,recently biguous scenarios can be a bottleneck for single-frame 3D released large-scale 3D sequence datasets (NuScenes [2], detectors(Table1). WaymoOpenDataset[23])havemadesuchdataavailable. ∗WorkdoneduringaninternshipatGoogleBrain. Astraight-forwardapproachtofusingmulti-framepoint 1 1202 raM 03 ]VC.sc[ 1v45061.3012:viXraModel Stationary(%) Slow(%) Medium(%) Fast(%) tures. Theaggregationstagethenmergesacrossframesfor 1-frame 60.01 66.64 65.02 71.90 eachtargetproposalindependently. Thiscanbeviewedas 4-frames 62.4 67.39 66.68 77.99 a form of factorization over the attention across proposals 8-frames 63.7 67.98 66.29 72.30 andframes. Table2.VelocitybreakdownsofvehicleAPmetricsforPointPil- We evaluate our model on large-scale Waymo Open lars models using point concatenation. For the 8-frame model, wefindthatitsbenefitscomefromslow-movingvehicles. Fast- Dataset [23]. Experimental results demonstrate that our movingobjectsnolongerbenefitfromalargenumberofframes methodoutperformspublishedstate-of-the-artsingle-frame sincetheLiDARpointsarenolongeralignedacrosstheframes. methods and multi-frame methods. Our primary contribu- tionsarelistedbelow. clouds is to use point concatenation, which simply com- KeyContributions. binespointsacrossdifferentframestogether[2]. Thecom- • Wepropose3D-MAN:a3Dmulti-frameattentionnet- binedpointcloudisthenusedasinputtoasingle-framede- work for object detection. We demonstrate that our tector.Thisapproachworkswellforstaticandslow-moving method achieves state-of-the-art performance on the objectssincethelimitedmovementimpliesthattheLiDAR Waymo Open Dataset [23] and provide thorough ab- points will be mostly aligned across the frames. However, lationstudies. whenobjectsarefast-movingorwhenlongertimehorizons areconsidered,thisapproachmaynotbeaseffectivesince • Weintroduceanoveltrainingstrategyforafastsingle- theLiDARpointsarenolongeraligned(Table2). frame detector method that uses max-pooling to per- As an alternative to point concatenation, Fast-and- formnon-maximumsuppressionandavariantofHun- furious [14] attempts to fuse information across frames by garianmatchingtocomputeadetectionloss. concatenating at a feature map level. However, this still • We design an efficient multi-view alignment and ag- runsintothesamechallengewithmisalignedfeaturemaps gregationmoduletoextractandaggregaterelevantfea- for fast-moving objects and longer time horizons. Recent tures from multiple frames in a memory bank. This approaches[10,34]proposeusingrecurrentlayerssuchas moduleproducesfeaturescontaininginformationfrom Conv-LSTM or Conv-GRU to aggregate the information multiple perspectives that perform well for classifica- across frames. It turns out that these recurrent approaches tionandboundingboxregression. areoftencomputationallyexpensive. 2.RelatedWork Our Approach. We propose 3D-MAN: a 3D multi-frame attention network that is able to extract relevant features frompastframesandaggregatethemeffectively. 3D-MAN 3DSingle-frameObjectDetection. Current3Dobjectde- hasthreecomponents: (i)afastsingle-framedetector,(ii)a tectors can be categorized into three approaches: voxel- memory bank, and (iii) a multi-view alignment and aggre- based methods, point-based methods, and their combina- gationmodule. tion. First,voxel-basedmethodstransformviavoxelization The fast single-frame detector (FSD) is an anchor-free asetofunorderedpointsintoafixed-size2Dfeaturemap, one-stage detector with a novel learning strategy. We onwhichconvolutionalneuralnetworks(CNN)canbeap- show that a max-pooling based non-maximum suppres- plied to generate detection results. Traditional approaches sion (NMS) algorithm together with a novel Hungarian- for voxel feature extraction rely on hand-crafted statistical matchingbasedlossisaneffectivemethodtogeneratehigh- quantities or binary encoding [25, 29], while recent works qualityproposalsatreal-timespeeds. Theseproposalsand show that machine-learned features demonstrate favorable thelastfeaturemapfromFSDarethenfedintoamemory performance [37, 12, 28, 36, 20, 26]. Second, point-based bank. The memory bank stores both predicted proposals methods[32,19,31,15,33]addressdetectionproblemsby and feature maps in previous frames so as to maintain dif- directlyextractingfeaturesbasedonthepointcloud, with- ferentperspectivesforeachinstanceacrossframes. out an explicit discretization step. Finally, recent works The stored proposals and features in the memory bank have combined methods from both voxel-based and point- arefinallyfusedtogetherthroughthemulti-viewalignment basedfeaturerepresentations[18]byusingthevoxel-based and aggregation module (MVAA), which produces fused methodstogenerateproposalsandthepoint-basedmethods multi-view features for target proposals that are used to torefinethem. regress bounding boxes for final predictions. MVAA has two stages: a multi-view alignment stage followed by a 2DMulti-frameObjectDetection. 2Dmulti-frameobject multi-view aggregation stage. The alignment stage works detectionhasbeenwidelyexploredcomparedto3Dcoun- on each stored frame independently; it uses target propos- terparts. 2D detection methods primarily focus on align- als as queries into a stored frame to extract relevant fea- ing objects in a target frame using motion and appearanceIdentity Backbone MP-NMS CA Frame t Proposals Feature CA Fast Single-frame Detector (FSD) CA … CA Predictions CA CA … Frame t-n Frame t-2 Frame t-1 Frame t Alignment Aggregation MemoryBank Multi-view Alignmentand Aggregation(MVAA) … … … … … Proposals Feature Reg Cls ProposalFeatureGeneration Figure 2. Framework for 3D-MAN: 3D multi-frame attention network. Given the point cloud for a target frame t, a fast single-frame detectorfirstgeneratesboxproposals. Theseproposals(boxparameters)withthefeaturemap(lastlayerofthebackbonenetwork)are inserted into a memory bank that stores proposals and features for the last n frames. We use a proposal feature generation module to extract proposal features for each stored frame. Each small rectangle box denotes a proposal and its associated features extracted in differentframes. Themulti-viewalignmentandaggregationmoduleperformsattentionacrossproposalfeaturesfromthememorybank, usingthetargetframeasqueriestoextractfeaturesforclassificationandregression. “MP-NMS”and“CA”representMaxPoolNMSand cross-attentionrespectively. Duringtraining, weuseclassificationandregressionlossesappliedtotheFSDproposals(L ), thefinal fsd outputsoftheMVAAnetwork(L ),andtheoutputsofthealignmentstage(L ,anauxiliarycross-viewloss). mvaa cv features from previous frames. Relational modules with tionallyexpensive. self-attentionlayers[9]areprevalentamongthesemethods [6, 4, 27, 5, 21]. They usually take as input a target frame 3.3D-MANFramework and multiple reference frames, from which proposals are generated per frame. Relation modules are applied to ag- The3D-MANframework(Figure2)consistsof3com- gregatetemporalfeaturesformorerobustobjectdetection. ponents: (i)afastsingle-framedetector(FSD)forproduc- Most approaches use self-attention across all proposals in ingproposalsgiveninputpointclouds,(ii)amemorybank all previous frames. In contrast, our method factorizes the tostorefeaturesfromdifferentframesand(iii)amulti-view attentionlayertofirstoperateindependentlyacrossframes alignmentandaggregationmodule(MVAA)forcombining (alignmentstage),andthenindependentlyacrossproposals informationacrossframestogeneratefinalpredictions. (aggregationstage). 3.1.FastSingle-frameDetector 3DMulti-frameObjectDetection. Astraight-forwardap- proachtomulti-framedetectionistoconcatenatethepoints Anchor-free Point Pillars. We base our single-frame de- from different frames together [2]. This has been demon- tector on the PointPillars architecture [12] with dynamic strated on the NuScenes dataset (improvement of 21.9% voxelization [36]. We start by dividing the 3D space to 28.8% mAP [2]), and we also observe improvements in into equally distributed pillars which are voxels of infinite our experiments (Table 2). However, as we increase the height. Eachpointinthepointcloudisassignedtoasingle number of frames concatenated, the improvement dimin- pillar. Each pillar is then featurized using a PointNet [17] ishes since the LiDAR points are less likely to be aligned producing a 2D feature representation for the entire scene, across longer time horizons (Table 2). Fast-and-furious whichissubsequentlyprocessedthroughaCNNbackbone. [14] side steps aligning the points by instead concatenat- Each location of the final layer of the network produces a ingtheintermediatefeaturesmaps. However,thisapproach predictionforaboundingboxrelativetothecorresponding maystillresultinmisalignmentacrossthefeaturemapsfor pillar center. We regress the location residuals, bounding fast-moving objects and longer time horizons. Recent ap- boxsizes, andorientation. Abinningapproachisusedfor proaches [10, 34] show further performance improvement predicting orientation which first classifies the orientation byapplyingConv-LSTMorConv-GRUtofusemulti-frame intoonebinfollowedbyregressionoftheresidualfromthe information.However,theuseofasinglememorystatethat correspondingbincenter[16,19]. getsupdatedcreatesapotentialbottleneck,andthehighres- Non-maximumsuppression(NMS)isoftenusedtopost- olution of the feature maps make these methods computa- processthedetectionsproducedbythelastlayerofthenet-work for redundancy removal. It first outputs the highest Bilinear scoringboxandthensuppressesalloverlappingboxeswith Interpolate thatbox,repeatingthisprocessuntilallboxesareprocessed. Average However, the sequential nature of this algorithm makes it slow to run in practice when there is a large number of predictions. We use a variant of NMS that leverages max- pooling to speed up this process. MaxPoolNMS [35] uses Bilinear themaxpoolingoperationtofindlocalpeaksontheobject- Interpolate Average ness score map. The local peaks are kept as predictions, whileallotherlocationsaresuppressed.Thisprocessisfast andhighlyparallelizable. Wefindthatthisapproachcanbe Figure 3. Illustration of rotated ROI feature extraction [13]. We upto6×faster1thanregularNMSwhendealingwithabout firstidentifykeypointsineachproposalboxandthenextractfea- 200kpredictions. tures using bilinear interpolation. Averaging pooling is further used to summarize each box into a single feature vector. Note HungarianMatching.MaxPoolNMSisusuallyperformed thatwhilethefiguredenoteskeypointsover3×2locations,we using the classification score as the ranking signal to indi- use7×7forvehiclesand3×3forpedestrians. catethatoneboxisbetterthananother.However,theclassi- ficationscoreisaproxymetric:ideally,wewanttohavethe wepost-processthematchestoreassignground-truthboxes highestscoringboxtobethebestlocalizedbox. Theideal thathavenooverlapwiththeirmatchedpredictionbox. We scoremapshouldhaveasinglepeakwhichcorrespondsto assigntheminsteadtotheirclosestpillarinthefeaturemap, the best localized box. We propose using the Hungarian whichmaynotbeoneretainedbyMaxPoolNMS.Thisen- matchingalgorithm[3,22]toproducesuchascoremap. courages the model to avoid invalid assignments and con- Given a set of bounding box predictions and a set of vergewell. ground-truthboxes,wecomputetheIoUscoreforeachpair of them. By applying the Hungarian matching algorithm 3.2.MemoryBank to this matrix2 of pair-wise scores, we can obtain a single match for each ground-truth box to a predicted box that Memory Bank. We use a memory bank to store the pro- maximizes the overall matching score. For each ground- posals and feature maps extracted by the FSD for the last truth box, we treat the matched predicted box as positive, n frames. When proposals and features from a new frame andallunmatchedboxesasnegative.Inthisway,themodel areaddedtothebank,thosefromtheoldestframearedis- isencouragedtopredictonlyonepositiveboxperground- carded. truth box such that the box predicted corresponds to the Proposal Feature Generation. To obtain features from highestIoU-scoringbox. multipleperspectives,weproposetogenerateproposalfea- Itturnsoutthattherearetwochallengeswhenusingthe tures for each stored frame in the memory bank as well as Hungarianmatchingalgorithm.First,theHungarianmatch- the target frame. We find that it is useful to use all stored ing algorithm is of order O(n3) and can be slow if there proposals regardless of which frame the proposal comes arealargenumberofpredictions. Therefore,wechooseto from to extract features from every stored frame. This al- perform the Hungarian matching based-loss only after the lowsthemodeltoincreaseitsrecallsinceanobjectmaybe MaxPoolNMS step. This ensures that only a few predic- missed by FSD in a single frame because of occlusion or tions remain, and enables the matching algorithm to com- partialobservation. pletequickly. Foreachproposal,weextractitsfeaturesusingarotated Second,themodelcanendupinabadlocalminimaby ROI feature extraction approach (Figure 3) [13]. Given a onlypredictingboxeswhicharefarawayfromanyground- proposal, we identify K ×K ×1 equally distributed key truthbox(e.g.,predictingboxesinlocationswherethereare pointswithrespecttotheproposalbox3.Foreachkeypoint, no points in the input point clouds). Consequently, these we compute a feature by bilinear interpolation of its value ground-truthboxesdonotoverlapatallwiththeirmatched inthefeaturemap. Finally, weuseaveragepoolingacross prediction boxes. As a result, the model does not get any alltheK×K×1keypointstoobtainasinglefeaturevector meaningful learning signals from these matches and is not fortheproposal. Itisworthnotingthatthisfeatureextrac- able to converge to a good solution. To address this issue, tionmethodcanbeperformedwithoutcorrectingtheentire LiDARpointcloudforego-motionoftheautonomousvehi- 1Execution time for regular NMS depends on the number of output boxesdesired,whileMaxPoolNMS’sspeedisinvariantthenumberofout- cle.Thisfacilitatesdeploymentinaproductionautonomous putboxes. drivingsystem. 2Inpractice,weadddummyboxestotheground-truthboxessothata one-to-onematchisalwaysproduced. 3Weuse7×7×1forvehiclesand3×3×1forpedestrians.tweenthenewproposalandotherobjectsinpreviousframes 𝑉"(NxC) thatcouldprovidecontextualinformation. MatMul We propose using a cross-attention network (Figure 4) Attention to learn how to relate the new frame proposals to those of Matrix (NxN) storedframes. Thisnetworkcouldpotentiallylearntoalign the proposal identities and also model interactions across SoftMax objects. Specifically, weapplyprojectionlayers toencode Add the new frame proposal features F as well as stored pro- MatMul SoftPlus t posal features F so as to compute projected queries F , s q MLP keysF kandvaluesF v.Theseareusedtocomputeanatten- 𝐹’(NxC) 𝐹((NxC) 𝐹&(NxC) tionmatrix. Wefurtherprovidetemporalandspatialinfor- Concat mationtotheattentionmatrixthroughencodingtherelative 𝑊’ 𝑊( 𝑊& ReB sido ux a ls Sub frameindexandboxresidualsbetweenallpairsofthequery andstoredboxes.Thecross-attentionnetworkisappliedbe- 𝐹"(NxC) 𝐹$(NxC) 𝐵"(Nx7) 𝐵$(Nx7) 𝑠 𝑡 tweenthetargetframeandeachstoredframeindependently Figure 4. Cross-attention network in the multi-view alignment withsharedparameters,generatingafeaturevectorforeach module. F sandB srepresentfeaturesandboxparametersofpro- targetproposal(V s)fromeachstoredframe. posals in a stored frame while F and B are those for the tar- t t get frame. We use s and t to denote the indices of the stored Cross-view Loss. The alignment stage of MVAA is de- frameandtargetframerespectively. N andC standforthenum- signed to extract features from each stored frame that are berofproposalsandchannels. “Boxresiduals”producesapair- mostrelevanttoeachtargetproposal. Toencouragetheex- wiseN×N×7tensorthatencodesthedifferencesinallpairsof tractedfeaturestobearelevantrepresentation, weemploy boxes,usingthesameapproachthatisusedtocomputeresiduals an auxiliary loss that encourages the extracted features to forground-truthboxesfromanchorboxes[12]. V istheoutput s contain sufficient information to predict the corresponding ofthecross-attentionnetwork,suchthateachinputtargetboxhas ground-truth bounding box associated with the target pro- oneassociatedoutputfeaturevectorwiththecorrespondingstored posal. Concretely, we add separate classification and re- frame. gressionheadsthatuseeachextractedfeaturevectorofthe alignmentstagetopredicttheboxresidualsbetweentarget Theproposalfeaturesgeneratedforthetargetframewill proposalanditscorrespondingground-truthbox. be used next in the MVAA module as the query features forthecross-attentionnetworks,whileproposalfeaturesfor Multi-view Aggregation. After the alignment module, storedframeswillbetreatedaskeysandvalues. each proposal in the target frame will have an associated feature for each stored frame. The multi-view aggrega- 3.3.Multi-viewAlignmentandAggregation tion layer (Figure 2, MVAA-Aggregation) is responsible These proposal features are then sent to the multi-view forcombiningthesefeaturesfromdifferentperspectivesto- alignmentandaggregationmodule(MVAA)tobeextracted gether to form a single feature for each proposal. Con- andaggregated. Thealignmentmoduleisappliedindepen- cretely, we use the new frame’s proposal features as the dentlyforeachstoredframe(attentionisacrossboxes,per- attentionqueryinputs,anditscorrespondingextractedfea- formed separately for each frame), while the aggregation turesinpreviousframesasthekeysandvalues. module is applied independently for each box (attention is Wenotethattheaggregationmodulecanenablethenet- acrosstime). Onecanviewthisasafactorizedformofat- work to be robust to newly appearing objects. If an object tention. appears for the first time in a new frame, the model can computeanattentionmatrixthatwillonlyfocusonthenew Multi-viewAlignment. Givenanewframe’sproposal,the frameandignorethepastframessincetheyarenotrelevant. multi-view alignment module is responsible for extracting its relevant information in each previous frame separately Box Prediction Head. After MVAA, we have an updated (Figure 2, MVAA-Alignment). To achieve this goal, the featureforeachproposalinthenewframe. Weregressob- alignment stage has to figure out how to relate the iden- jectnessscoresandboxparametersfromthisfeaturerepre- tities of the proposals in the new frame to those in the sentation.Fortheobjectnessscore,wefollow[18]andtreat stored frames. A naive approach could use nearest neigh- theIoUbetweenproposalsandtheircorrespondingground- bormatchingormaximumIoUoverlap. However,whenan truth bounding boxes as the classification target, with the instanceisfast-movingorclosetoanyotherinstance,there sigmoid cross-entropy loss. The box parameter targets are willoftenbeambiguityintheappropriateassignment. Fur- encoded as residuals [12, 37] and trained with a smooth- thermore,thenaiveapproachdoesnotlearninteractionsbe- L1loss. Thesameformulationsareusedforthecross-viewloss. 4.1.ImplementationDetails 3.4.Losses Hyperparameters. Giventheinputpointcloudinatarget frame,wefirstsetthedetectionrangeas[−76.8m,76.8m] We minimize the total loss consisting of a fast single- forxandyaxesand[−2m,4m]forthez-axis. Weequally frame detector (FSD) loss L fsd, a multi-view prediction splitthis3Drangeinto[512,512]pillarsamongxandyaxes loss L , and a cross-view loss L with equal loss mvaa cv respectively,followingPointPillars[12]. FortheMaxPool- weights. NMSappliedinFSD,weuseamax-poolingkernelsizeof [7,7]forvehiclesand[3,3]forpedestrians,withastrideof L =L +L +L (1) [1,1].AfterMaxPoolNMS,asetof128proposalsperframe total fsd mvaa cv arepassedtothememorybank. ThesameformulationfordetectionlossL det isusedin NetworkArchitectures. InourproposedFSD,weusethe thesethreelosses. ThisincludesaobjectnesslossL obj and samebackbonenetworkillustratedinPointPillars[12]. The aregressionlossL reg. channel dimension C of the last feature map and proposal features is 384. For encoding the frame index and rela- 1 (cid:88) 1 (cid:88) tive box residuals, we apply a 2-layer perceptron (MLP) L det = |C| L obj + |R| L reg (2) networks with C output channels for the first layer, and 1 i∈C i∈R output for the second layer. These are used in the cross- Crepresentsthesetoflocationswherewepredictanob- attention layers of the MVAA module. In the predic- jectnessscore. ForL ,thiscorrespondstotheremaining tion head, we first apply a 2-layer MLP network with C fsd pillars after MaxPoolNMS, while for L and L , this output channels to embed the aggregated multi-view fea- mvaa cv corresponds to the proposals after FSD (specifically, those tures. Theseembeddingsaretransformedwithtwopredic- thatremainafterMaxPoolNMS).Fortheobjectnesslossin tionbranchesforclassificationandregression. L mvaa and L cv, we use the IoU overlap between the pro- Training Parameters. Our network is trained end-to-end posalanditsassignedground-truthasthetarget. ForL fsd, using the ADAM [11] optimizer for a total number of 50 theoutputoftheHungarianmatchingisusedtodetermine epochs with an initial learning rate of 0.0016 and a batch positiveandnegativeassignmentsfortheobjectnessloss. sizeof32. Weapplyexponentialdecaytoannealthelearn- R represents the set of locations which are associated ingrate,startingat5epochsuntil45epochs. Duringtrain- with a ground-truth box. For all losses (L fsd, L mvaa, and ing,weapplyrandomflipandrandomrotationasouronly L cv), these are the matched boxes from Hungarian match- dataaugmentationmethods. ing. For the regression losses, we use a smooth-L1 loss Utilizingalargenumberofframes. Weenable3D-MAN as the supervision of regressing the x, y, z center location to exploit a large number of frames by combining it with residuals,andtheircorrespondingdimensions. Fororienta- point concatenation. Our best model uses 16 frames split tion,weuseabinningorientationloss[16,19]. Themodel into4windowsof4frames. Thepointcloudsineachwin- isexpectedtopredictananglebinfirst,followedbyaresid- dowareconcatenatedtogetherandusedasinputtotheFSD. ualfromthebincenter. Weuse12binsforL and1bin fsd Each window thus becomes an entry in the memory bank, forL andL . mvaa cv and the model is expected to produce predictions for only The cross-view identity loss L is computed across all cv the last frame. This utilizes point concatenation for when theoutputsofthemulti-viewalignmentstage,andaveraged movementissmallwithnearbyframesandMVAAforlarge acrossallinstances. movement across a longer time range. We provide further ablation studies with varying sizes of input frames in Sec- 4.Experiments tion4.3. We evaluate our method on Waymo Open Dataset [23], 4.2.MainResults a large scale 3D object detection dataset. There are a to- tal of 1150 sequences divided into 798 training, 202 vali- WaymoValidationSet.Wecompareourmethodwithpub- dation, and 150 testing examples. Each sequence consists lished state-of-the-art single-frame and multi-frame meth- of about 200 frames at a frame rate of 10 Hz, where each ods on the Waymo validation set on class Vehicle (Table frameincludesaLiDARpointcloudandlabeled3Dbound- 3) and class Pedestrian (Table 4). We first compare the ing boxes for vehicles, pedestrians, cyclists and signs. We performance between our model with and without multi- evaluateour modelandcompare itwithothermethods us- frame inputs (Table 3). When the model has access to 16 ingaverageprecision(AP)andAveragePrecisionWeighted stored frames, the overall 3D AP (LEVEL 1) is improved byHeading(APH). by5.50%onvehicleslabeledasLEVEL 1difficulty,illus-3DAP(IoU=0.7) 3DAPH(IoU=0.7) Difficulty Method Overall 0-30m 30-50m 50m-Inf Overall 0-30m 30-50m 50m-Inf StarNet[15] 55.11 80.48 48.61 27.74 54.64 79.92 48.10 27.29 PointPillars[12] 63.27 84.90 59.18 35.79 62.72 84.35 58.57 35.16 MVF[36] 62.93 86.30 60.02 36.02 - - - - AFDet[7] 63.69 87.38 62.19 29.27 - - - - RCD[1] 68.95 87.22 66.53 44.53 68.52 86.82 66.07 43.97 LEVEL1 PV-RCNN[18] 70.30 91.92 69.21 42.17 69.49 91.34 68.53 41.31 3D-MAN(Ours) 69.03 87.99 66.55 43.15 68.52 87.57 65.92 42.37 PointPillars∗[12] 65.41 85.58 61.51 39.51 64.88 85.02 60.95 38.91 ConvLSTM∗[10] 63.6 - - - - - - - 3D-MAN∗(Ours) 74.53 92.19 72.77 51.66 74.03 91.76 72.15 51.02 StarNet[15] 48.69 79.67 43.57 20.53 48.26 79.11 43.11 20.19 PointPillars[12] 55.18 83.61 53.01 26.73 54.69 83.08 52.46 26.24 PV-RCNN[18] 65.36 91.58 65.13 36.46 64.79 91.00 64.49 35.70 LEVEL2 3D-MAN(Ours) 60.16 87.10 59.27 32.69 59.71 86.68 58.71 32.08 PointPillars∗[12] 57.28 84.31 55.41 29.71 56.81 83.79 54.90 29.24 3D-MAN∗(Ours) 67.61 92.00 67.20 41.38 67.14 91.57 66.62 40.84 Table3.3DAPandAPHResultsonWaymoOpenDatasetvalidationsetforclassVehicle. ∗Methodsutilizemulti-framepointclouds fordetection. WereportPointPillars[12]basedonourownimplementation,withandwithoutpointconcatenation. Difficultylevelsare definedintheoriginaldataset[23]. LEVEL1 LEVEL2 Vehicle Pedestrians Method Method 3DAP 3DAPH 3DAP 3DAPH 3DAP 3DAPH 3DAP 3DAPH StarNet[15] 68.32 60.89 59.32 52.76 SECOND[28] 50.11 49.63 - - PointPillars[12] 68.88 56.57 59.98 49.14 StarNet[15] 63.51 63.03 67.78 60.10 MVF[36] 65.33 - - - PointPillars[12] 68.62 68.08 67.96 55.53 3D-MAN(Ours) 71.71 67.74 62.58 59.04 SA-SSD[8] 70.24 69.54 57.14 48.82 RCD[1] 71.97 71.59 - - Table4.3DAPandAPHResultsonWaymoOpenDatasetvali- 3D-MAN(Ours) 78.71 78.28 69.97 65.98 dationsetforclassPedestrian. Table5.3DAPandAPHResultsonWaymoOpenDatasettesting setforclassVehicleandPedestrainamongLEVEL 1difficultyob- jects.MetricbreakdownsforourmodelisavailableontheWaymo challengeleaderboard. tratingtheeffectivenessofourapproach. Mask Centeredness HungarianMatching 3D-MANoutperformsthecurrentbestpublishedmethod Ped.(%) 64.7 67.1 70.2 (PV-RCNN [18]) by 3.56% (30-50m range) and 9.49% Veh.(%) 44.5 63.7 64.8 (>50m range) AP (LEVEL 1) on vehicles. At these fur- Table6.Mini-validationAPcomparisonamongdifferentground- therranges,objectsareoftenpartiallyvisible,wherehaving truthassignmentstrategiesusingFSDforbothPedestrianandVe- more information from different perspectives could help. hicleclasses. These improvements show that our model is able to ef- fectivelycombinetheinformationacrossmultipleviewsto 4.3.AblationStudies generate more accurate 3D predictions. Moreover, com- pared to existing multi-frame models, 3D-MAN also out- Weconductallourablationstudiesonlyforthevehicle performs them by a large margin. Our method achieves class,andreportLEVEL 1difficultyresultsbasedonasub- a better 3D AP than the recently published Conv-LSTM set of the full validation set. We created a mini-validation method[10]by10.93%onvehicledetection.Forpedestrian set by uniformly sampling 10% of the full validation set. detection,3D-MANalsoachievesthebestperformance(Ta- This results in a dataset that allows us to experiment sig- ble4). nificantly faster. We note that there is a negligible perfor- mance gap between the mini-validation and full validation Waymo Testing Set. We also evaluate our model on set:forexample,ourbestmodelobtains74.3%onthemini- Waymo testing set through a test server submission. For validationsetversus74.5%onthefullvalidationset. vehicledetection(Table5),3D-MANachieves78.71%AP and 78.28 APH, outperforming RCD [1] by 6.74% and Hungarian Matching. We compare the performance of 6.69% respectively, which is currently the best published usingdifferentassignmentstrategiesinFSD,includingthe methodamongresultsgeneratedbyasinglemodel(notus- maskstrategy,centerednessstrategyandHungarianmatch- inganyensemblemethods). ing strategy (Table 6). The mask strategy [24, 30] as-Method Baseline Concat Relation MVAA Frames 1 4 7 10 13 16 AP(%) 68.2 70.5 70.1 72.5 AP(%) 68.2 72.5 73.4 73.5 73.8 74.3 Table7.Mini-validationAPcomparisononclassVehicleamong Table 9. Mini-validation AP comparison for different number of differentmulti-framefusionapproaches. inputframestothe3D-MANmodel. Allmodelsareexpectedto predictonlythelastframe. Modelswith7,10,13,and16frames SupervisionMethod None CorrespondenceLoss CVLoss useconcatenatedpoints(overwindowsof4frames)asinput,with AP(%) 70.7 71.4 72.5 differentamountofoverlapsbetweenadjacentwindows. Table 8. Mini-validation AP comparison on class Vehicle using Model Stationary(%) Slow(%) Medium(%) Fast(%) differentauxiliarylosseswiththeMVAAalignmentstage. 4-frames 69.5 68.6 67.1 78.3 16-frames 73.2 70.4 68.9 79.2 signs interior pillars of any valid object positive and all Table10.VelocitybreakdownsofvehicleAPmetricsfor3D-MAN otherpillarsnegative. However, thiscanleadtoadiscrep- withvaryingnumberofinputframes. ancybetweenclassificationscoreandlocalizationaccuracy. Our experiments show that this performs the least well. correspondence loss. The correspondence loss encourages Centeredness strategy [31, 24, 35] encourages pillars with elements of the attention matrix (of the alignment stage in closerdistancetotheinstancecentertohaveahigherclas- MVAA) to be close to 1 if the query proposal matches the sification score. However, the pillar in the center may not instanceofthecorrespondingstoredproposal,andzerooth- alwaysdrawthebestlocalizationpredictioninpointclouds: erwise. Wecomparetheseapproachesfortheauxiliaryloss theLiDARpointsoftenareonthesurfaceofthevehicleand (Table 8), and find that using the cross-view loss outper- not in the interior. We find centeredness to perform better forms having no auxiliary loss by 1.8% and using the cor- thanmask,butworsethanourproposedHungarianmatch- respondencelossby1.1%. ingapproach. FSDachievesthehighestAPwiththeHun- garianmatchingstrategy,whichvalidatesourapproach. Varying number of input frames. We further compare our model’s performance on different number of available Multi-frame Approaches. We compare our method to frames (Table 9). In order to draw a fair comparison be- othermulti-frameapproaches(Table7),includingthepoint tween models with 7 through 16 frames, we fix the com- concatenationapproachandaself-attentionapproachacross putation by using point concatenation over windows of 4 all previously detected boxes. Multi-frame models in the frames,withdifferentdegreesofoverlapsbetweenwindows comparisonhave4framesasinputandareexpectedtopre- (similar to strides in convolution windows). We find that dict bounding boxes for only the last frame. We also per- ourmodelsteadilyimprovesasithasaccesstomoreinput form ego-motion pose correction to map points from the framescorrespondingtolongertimehorizons. earlierframestotheposeofthelastframe. TheBaselinemodelisoursingle-frametwo-stagemodel, Velocity breakdowns. We also compare our model’s per- whichappliesFSDtogenerateproposalswithfeaturesand formanceacrossdifferentvelocitybreakdowns. Recallthat deploysboxpredictionheadwithanMLPnetworktorefine the baseline multi-frame PointPillars model performance theseproposals. Itachieves68.2%APonVehicleandpro- degrades when using 8-frames versus 4-frames (Table 2). videsabaselinetocomparethemulti-framemodelsagainst. Conversely,ourmodeldemonstratesanimprovementwhen In the Concat approach, points across all frames are weincreasethenumberofframesfrom4to16(Table10). combinedtogether,andthemergedpointcloudisusedasin- Thisshowsthatourapproachisabletobenefitfast-moving puttotheBaselinemodel. Thisimprovesuponthebaseline vehicles. by 2.3% AP. The Relation approach first extracts box pro- posals from multiple frames and then uses a self-attention 5.Conclusion networkonallpastproposalsdirectlytoproduceapredic- tion. ThisperformsbetterthantheBaselinebutworsethan In this paper, we present a novel 3D object detection theConcatmodel. method,3D-MAN,whichutilizesattentionnetworkstoex- Ourapproach(MVAA)performsthebest,outperforming tractandaggregatefeaturesacrossmultipleframes. Wein- the Concat approach and Relation approach by 2.0% and troduce a fast single-frame detector that utilizes a Hungar- 2.4%respectively. ianmatchingstrategytoaligntheobjectnessscorewiththe bestlocalizedbox. Weshowhowtheoutputsofthesingle- Cross-view loss. We find it useful to have an auxiliary framedetectorcanbeusedwithamemorybankandanovel cross-view loss to encourage the model to propagate rele- multi-view alignment and aggregation module to fuse the vantfeaturesinthealignmentstageofMVAA.Toevaluate information from multiple frames together. Our method is the effectiveness of the cross-view loss, we compare it to effectiveacrosslongtimehorizonsandobtainsstate-of-the- nothavinganauxiliarylossandalsoanalternativeauxiliary artperformanceonachallenginglargescaledataset.References [16] CharlesRuizhongtaiQi,WeiLiu,ChenxiaWu,HaoSu,and LeonidasJ.Guibas. Frustumpointnetsfor3dobjectdetec- [1] Alex Bewley, Pei Sun, Thomas Mensink, Dragomir tionfromRGB-Ddata. CVPR,2018. 3,6 Anguelov, and Cristian Sminchisescu. Range conditioned [17] Charles Ruizhongtai Qi, Hao Su, Kaichun Mo, and dilatedconvolutionsforscaleinvariant3dobjectdetection. Leonidas J. Guibas. Pointnet: Deep learning on point sets InCoRL,2020. 7 for3dclassificationandsegmentation. InCVPR,2017. 3 [2] HolgerCaesar,VarunBankiti,AlexH.Lang,SourabhVora, [18] ShaoshuaiShi,ChaoxuGuo,LiJiang,ZheWang,Jianping VeniceErinLiong,QiangXu,AnushKrishnan,YuPan,Gi- Shi,XiaogangWang,andHongshengLi. PV-RCNN:point- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- voxel feature set abstraction for 3d object detection. In modaldatasetforautonomousdriving. CVPR,2020. 1, 2, CVPR,2020. 2,5,7 3 [19] ShaoshuaiShi,XiaogangWang,andHongshengLi. Pointr- [3] NicolasCarion,FranciscoMassa,GabrielSynnaeve,Nicolas cnn: 3dobjectproposalgenerationanddetectionfrompoint Usunier,AlexanderKirillov,andSergeyZagoruyko.End-to- cloud. InCVPR,2019. 2,3,6 endobjectdetectionwithtransformers. InAndreaVedaldi, [20] ShaoshuaiShi,ZheWang,XiaogangWang,andHongsheng HorstBischof,ThomasBrox,andJan-MichaelFrahm,edi- Li. Part-aˆ2net: 3dpart-awareandaggregationneuralnet- tors,ECCV,2020. 4 workforobjectdetectionfrompointcloud. arXivpreprint [4] Yihong Chen, Yue Cao, Han Hu, and Liwei Wang. Mem- arXiv:1907.03670,2019. 2 ory enhanced global-local aggregation for video object de- [21] MykhailoShvets,WeiLiu,andAlexanderC.Berg.Leverag- tection. InCVPR,2020. 3 inglong-rangetemporalrelationshipsbetweenproposalsfor [5] HanmingDeng,YangHua,TaoSong,ZongpuZhang,Zhen- videoobjectdetection. InICCV,2019. 3 gui Xue, Ruhui Ma, Neil Martin Robertson, and Haibing [22] Russell Stewart, Mykhaylo Andriluka, and Andrew Y. Ng. Guan. Object guided external memory network for video End-to-end people detection in crowded scenes. In CVPR, objectdetection. InICCV,2019. 3 2016. 4 [6] Jiajun Deng, Yingwei Pan, Ting Yao, Wengang Zhou, [23] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Houqiang Li, and Tao Mei. Relation distillation networks Chouard,VijaysaiPatnaik,PaulTsui,JamesGuo,YinZhou, forvideoobjectdetection. InICCV,2019. 3 Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han, [7] RunzhouGe,ZhuangzhuangDing,YihanHu,YuWang,Si- Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, Scott Et- jiaChen, LiHuang, andYuanLi. Afdet: Anchorfreeone tinger,MaximKrivokon,AmyGao,AdityaJoshi,YuZhang, stage3dobjectdetection. CoRR,2020. 7 Jonathon Shlens, Zhifeng Chen, and Dragomir Anguelov. [8] ChenhangHe,HuiZeng,JianqiangHuang,Xian-ShengHua, Scalability in perception for autonomous driving: Waymo andLeiZhang.Structureawaresingle-stage3dobjectdetec- opendataset. InCVPR,2020. 1,2,6,7 tionfrompointcloud. InCVPR,June2020. 7 [24] ZhiTian,ChunhuaShen,HaoChen,andTongHe. FCOS: [9] HanHu,JiayuanGu,ZhengZhang,JifengDai,andYichen fullyconvolutionalone-stageobjectdetection. ICCV,2019. Wei.Relationnetworksforobjectdetection.InCVPR,2018. 7,8 3 [25] DominicZengWangandIngmarPosner. Votingforvoting [10] RuiHuang,WanyueZhang,AbhijitKundu,CarolinePanto- inonlinepointcloudobjectdetection. InRobotics: Science faru, David A. Ross, Thomas A. Funkhouser, and Alireza andSystemsXI,2015. 2 Fathi. AnLSTMapproachtotemporal3dobjectdetection [26] YueWang, AlirezaFathi, AbhijitKundu, DavidRoss, Car- inlidarpointclouds. CoRR,2020. 2,3,7 oline Pantofaru, Thomas Funkhouser, and Justin Solomon. [11] Diederik P. Kingma and Jimmy Ba. Adam: A method for Pillar-based object detection for autonomous driving. In stochasticoptimization. ICLR,2015. 6 ECCV,2020. 2 [12] Alex H Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, [27] HaipingWu, YuntaoChen, NaiyanWang, andZhao-Xiang JiongYang,andOscarBeijbom. Pointpillars: Fastencoders Zhang. Sequencelevelsemanticsaggregationforvideoob- forobjectdetectionfrompointclouds. CVPR,2019. 2,3,5, jectdetection. InICCV,2019. 3 6,7 [28] YanYan,YuxingMao,andBoLi. Second:Sparselyembed- [13] MingLiang*,BinYang*,YunChen,RuiHu,andRaquelUr- dedconvolutionaldetection. Sensors,2018. 2,7 tasun.Multi-taskmulti-sensorfusionfor3dobjectdetection. [29] BinYang, WenjieLuo, andRaquelUrtasun. PIXOR:real- InCVPR,2019. 4 time3dobjectdetectionfrompointclouds. InCVPR,2018. [14] WenjieLuo,BinYang,andRaquelUrtasun. Fastandfuri- 2 ous:Realtimeend-to-end3ddetection,trackingandmotion [30] Ze Yang, Shaohui Liu, Han Hu, Liwei Wang, and Stephen forecastingwithasingleconvolutionalnet. InCVPR,2018. Lin.Reppoints:Pointsetrepresentationforobjectdetection. 2,3 ICCV,2019. 7 [15] Jiquan Ngiam, Benjamin Caine, Wei Han, Brandon Yang, [31] Zetong Yang, Yanan Sun, Shu Liu, and Jiaya Jia. 3dssd: Yuning Chai, Pei Sun, Yin Zhou, Xi Yi, Ouais Alsharif, Point-based3dsinglestageobjectdetector,2020. 2,8 PatrickNguyen,ZhifengChen,JonathonShlens,andVijay [32] ZetongYang,YananSun,ShuLiu,XiaoyongShen,andJiaya Vasudevan. Starnet: Targetedcomputationforobjectdetec- Jia. IPOD: intensive point-based object detector for point tioninpointclouds. CoRR,2019. 2,7 cloud. CoRR,2018. 2[33] ZetongYang,YananSun,ShuLiu,XiaoyongShen,andJiaya Jia. STD:sparse-to-dense3dobjectdetectorforpointcloud. ICCV,2019. 2 [34] JunboYin,JianbingShen,ChenyeGuan,DingfuZhou,and RuigangYang. Lidar-basedonline3dvideoobjectdetection withgraph-basedmessagepassingandspatiotemporaltrans- formerattention. InCVPR,2020. 2,3 [35] XingyiZhou,DequanWang,andPhilippKra¨henbu¨hl. Ob- jectsaspoints. CoRR,2019. 4,8 [36] YinZhou, PeiSun, YuZhang, DragomirAnguelov, Jiyang Gao,TomOuyang,JamesGuo,JiquanNgiam,andVijayVa- sudevan. End-to-endmulti-viewfusionfor3dobjectdetec- tioninlidarpointclouds. InCoRL,2019. 2,3,7 [37] YinZhouandOncelTuzel. Voxelnet: End-to-endlearning forpointcloudbased3dobjectdetection. CVPR,2018. 2,5