4D-Net for Learned Multi-Modal Alignment AJPiergiovanni VincentCasser MichaelS.Ryoo AneliaAngelova GoogleResearch WaymoLLC RoboticsatGoogle GoogleResearch Abstract Points in Time Pseudo-image Backbone CNN features Z We present 4D-Net, a 3D object detection approach, CpX which utilizes 3D Point Cloud and RGB sensing informa- tion,bothintime. Weareabletoincorporatethe4Dinfor- mationbyperforminganoveldynamicconnectionlearning acrossvariousfeaturerepresentationsandlevelsofabstrac- tion, as well as by observing geometric constraints. Our approachoutperformsthestate-of-the-artandstrongbase- lines on the Waymo Open Dataset. 4D-Net is better able to use motion cues and dense image information to detect distant objects more successfully. We will open source the code. 1.Introduction Scene understanding is a long-standing research topic in computer vision. It is especially important to the au- tonomous driving domain, where a central point of inter- estisdetectingpedestrians,vehicles,obstaclesandpotential hazards in the environment. While it was traditionally un- dertakenfromastill2Dimage,3Dsensingiswidelyavail- able,andmostmodernvehicleplatformsareequippedwith both 3D LiDAR sensors and multiple cameras producing 3D Point Clouds (PC) and RGB frames. Furthermore, au- tonomous vehicles obtain this information in time. Since all sensors are grounded spatially, their data collectively, whenlookedatintime,canbeseenasa4-dimensionalen- tity. Reasoning across these sensors and time clearly of- fersopportunitiestoobtainamoreaccurateandholisticun- derstanding, instead of the traditional scene understanding fromasingle2Dstill-imageorasingle3DPointCloud. While all this 4D sensor data is readily available on- board, very few approaches have utilized it. For example, themajorityofmethodstargeting3Dobjectdetectionusea single 3D point cloud as an input [17], with numerous ap- proachesproposed[25,53,37,32,40,41,56,58,57]. Only more recently has point cloud information been consid- ered in time, with approaches typically accumulating sev- eralpointcloudsoverashorttimehorizon[21,20,56,32]. Furthermore,thesensorshavecomplementarycharacter- PCiT Video RGB RGB in Time RGB CNN features 3D Boxes 3D boxes Figure1.4D-Neteffectivelycombines3Dsensingintime(PCiT) with RGB data also streamed in time, learning the connections betweendifferentsensorsandtheirfeaturerepresentations. istics. Thepointclouddataalonemaysometimesbeinsuf- ficient, e.g., at far ranges where an object only reflects a handful of points, or for very small objects. More infor- mation is undoubtedly contained in the RGB data, espe- ciallywhencombinedwiththe3DPointCloudinputs. Yet, relatively few works attempted to combine these modali- ties[35,51,19,24]. Notably,only2ofthe26submissions to the Waymo Open Dataset 3D detection challenge oper- ated on both modalities [47]. No methods have attempted combiningthemwhenbotharestreamedintime. Theques- tionsofhowtoaligntheseverydifferentsensormodalities most effectively, as well as how to do so efficiently, have beenmajorroadblocks. Toaddressthesechallenges,wepropose4D-Net,which combinesPointCloudinformationtogetherwithRGBcam- era data, both in time, in an efficient and learnable man- ner.Weproposeanovellearningtechniqueforfusinginfor- mation in 4D from both sensors, respectively building and learning connections between feature representations from differentmodalitiesandlevelsofabstraction(Figure1).Us- ing our method, each modality is processed with a suit- able architecture producing rich features, which are then aligned and fused at different levels by dynamic connec- tion learning (Figure 2). We show that this is an effective andefficientwayofprocessing4Dinformationfrommulti- ple sensors. 4D-Nets provide unique opportunities as they naturallylearntoestablishrelationsbetweenthesesensors’ features,combininginformationatvariouslearningstages. Thisisincontrasttopreviouslatefusionwork,whichfuse 1202 peS 2 ]VC.sc[ 1v66010.9012:viXraalreadymaturefeaturesthatmayhavelostspatialinforma- Voxelresolutioncangreatlyaffectperformanceandistypi- tion,crucialtodetectingobjectsin3D. callylimitedbycomputationalconstraints. Toreducecom- Our results are evaluated on the Waymo Open pute,somerelyonapplyingsparse3Dconvolutions,suchas Dataset [48], a challenging Autonomous Driving dataset Vote3Deep [15], Second [56] or PVRCNN [40]. Dynamic andpopular3Ddetectionbenchmark. 4D-Netoutperforms voxelizationhasbeenproposedin[60]. the state-of-the-art and is competitive in runtime. Impor- Other methods operate directly on the raw point cloud tantly, being able to incorporate dense spatial information data,e.g.SPLATNet[46],StarNet[32]orPointRCNN[41]. and information in time improves detection at far ranges Point Clouds in Time (PCiT). Integrating information and for small and hard to see objects. We present several from multiple point clouds has been proposed recently. insightsintotherespectivesignificanceofthedifferentsen- StarNet[32]doesnotexplicitlyoperateintime,butcanuse sorsandtimehorizons,andruntime/accuracytrade-offs. high-confidencepredictionsonpreviousframesas“tempo- Ourcontributionsare: (1)thefirst4D-Netforobjectde- ralcontext”toseedobjectcentersamplinginthefollowing tection which spans the 4-Dimensions, incorporating both frames.[21]extractfeaturesonindividualframesandaccu- point clouds and images in time, (2) a novel learning mulateinformationinanLSTMover4framesfordetection. method which learns to fuse multiple modalities in 4D, [11]apply4DConvNetsforspatial-temporalreasoningfor (3)asimpleandeffectivesamplingtechniquefor3DPoint AR/VR applications. [56, 20, 59] combine multiple point Clouds in time, (4) a new state-of-the-art for 3D detection cloudsintimebyconcatenatingthemandaddingachannel ontheWaymoOpenDatasetandadetailedanalysisforun- representingtheirrelativetimestamps. lockingfurtherperformancegains. Point Clouds and RGB fusion. Acknowledging the 2.RelatedWork merits of sensor fusion, researchers have attempted to combine LiDAR and camera sensing to improve perfor- ObjectDetectionfromRGB.Theearliestdetectionap- mance [35, 10, 28]. Frustum PointNet [36] first performs proaches in the context of autonomous driving were pri- image-based2Ddetection,andthenextrapolatesthedetec- marily focused on camera-based object detection, often tionintoa3DfrustumbasedonLiDARdata. Alternatively, drawing heavily from the extensive body of 2D vision one can project the point cloud into the camera view - in work[50,14,4,14,3,7,33,9,45,17,13],withsomemore its simplest form creating RGB-D input, although alterna- advanced works using deep learning features [2, 12, 44]. tivedepthrepresentationsmaybeused[19,18]. However, Detection with temporal features, i.e., integrating features thisprovideslimitedscalabilityaseachdetectioninference acrossseveralneighboringframes[16,54], andleveraging wouldonlycoveraverylimitedfieldofview. Conversely, kinematic motion to improve detection consistency across thepointcloudinputcanbeenrichedbyaddingcolororse- time [6] have also been applied. However, looking at the mantic features [51, 43, 24]. This, however, comes at the imagesasvideosandprocessingthemwithvideoCNNsis expense of losing spatial density - one of the primary ad- notcommon. vantages of camera sensors. Several methods have instead ObjectDetectionfromPointCloud.Manyapproaches appliedmodality-specificfeatureextractors,whicharethen found it effective to apply well-established 2D detec- fuseddownstream[24,29,10,55,28]. Noneoftheabove- tors on a top-down (BEV) projection of the point cloud mentionedapproachesoperateoneithermodalityintime. (AVOD [24], PIXOR [58], Complex-YOLO [42], HD- NET[57]). This inputrepresentationcanbe advantageous because it makes it easier to separate objects, and object 3.4D-Net sizes remain constant across different ranges. However, it resultsinlossofocclusioninformation,doesnoteffectively 4D-Net proposes an approach to utilize and fuse multi- exploitthefull3Dgeometricinformation,andisinherently sensorinformation,learningthefeaturerepresentationfrom sparse. Certain techniques can be used to alleviate these thesesensorsandtheirmutualcombinations. Anoverview drawbacks,e.g.,learningapseudo-imageprojection[25]. of the approach is shown in Figure 2. In 4D-Net, we con- An alternative to top-down representations is operating sider the point clouds in time (i.e., a sequence of point directlyontherangeimage,projectingthePCintoperspec- clouds)andtheRGBinformation,alsointime(asequence tive view. This representation is inherently dense, simpli- of images). We first describe how to handle the raw 3D fiesocclusionreasoningandhasbeenusedinvariousworks pointcloudsandRGBinputinformationstreamingintime (e.g. LaserNet [31], VeloFCN [27], [5]). Another line of (Section 3.1) and then describe our main 4D-Net architec- work tries to exploit the complementary nature of both by tureswhichlearntocombineinformationfrombothsensor jointlyoperatingonmultipleviews[60,53]. modalitiesandacrossdimensions(Section3.2). Wefurther Insteadofrelyingonviewprojections,somemethodsdi- offer a multi-stream variant of the 4D-Net which achieves rectlyoperateonthe3Dvoxelizedpointcloud[61,26,52]. furtherperformanceimprovements(Section3.3). 2Connection Search RGB video feature maps T Y Y Y 3D 3D 3D ... X C X C X C Proj Proj Proj 1 2 v Point cloud Stacked pillars Learned features Pseudo-image from PC Backbone PC Network 3D Boxes T N Z C F P p P C pX Pillar index Figure2.4D-NetOverview. RGBframesandPointCloudsintimeareprocessedproducingfeatures, abstractingsomedimensions. A connectionsearchlearnswhereandhowtofusethemodalities’featurestogether.Weuse3DprojectiontoalignthePCandRGBfeatures. 3.1.3DProcessingandProcessingDatainTime 3.1.1 3DProcessing Our approach uses a learnable pre-processor for the point clouddata; itisappliedtothe3Dpointsandtheirfeatures Figure3. Theaveragepointdensitypervoxelillustrateshowlong- from the LiDAR response to create output features. We termtemporalaggregation,combinedwithoursubsamplingstrat- chosetousePointPillars[25]togeneratethesefeatures,but egy,leadstoanincreasedpointdensity,especiallyatfarranges. other3Dpoint‘featurising’approachescanbeused. Point- Pillars converts a point cloud into a pseudo-image, which creating T PointPillar “pseudo-images” and then using a can then be processed by a standard 2D CNN. For clarity, 2D or 3D CNN to process all those frames would be pro- inthederivationsbelow,wewillbeusinga3DX,Y,Z,co- hibitively expensive [59], limiting its usefulness. Previous ordinatesystem,wheretheZ directionisforward(aligned work[21]exploredusingsparseconvolutionswithLSTMs withthecardriving),Y isverticalpointingupandX ishor- tohandlepointcloudsintime,whereacompressedfeature izontal, i.e., we use a left-hand coordinate system (this is representation is fed recursively to next frame representa- thedefaultsystemusedintheWaymoOpenDataset). tions. Instead, we take a simpler approach similar to [8], Given a point cloud P = {p} where p is a 3D (x,y,z) whichhoweverpreservestheoriginalfeaturerepresentation pointandassociatedF-dimensionalfeaturevector(e.g.,in- per3Dpoint,togetherwithasenseoftime. tensity,elongation),thepseudo-imageiscreatedasfollows. First, the original feature representation is directly Each point is processed by a linear layer, batch norm and mergedinthe3Dpointcloud,togetherwithafeaturetoin- ReLU,toobtainafeaturizedsetof3Dpoints.Thepointsare dicateitstimestamp.Specifically,weusethevehicleposeto groupedintoasetofpillarsintheX,Zplanebasedontheir removetheeffectofego-motionandalignthepointclouds. 3Dlocationanddistancesbetweenthepoints. Thisgivesa Next,weaddatimeindicatorttothefeatureofeachpoint: point cloud representation with shape (P,N,F), where P p = [x,y,z,t]. Then, the PointPillar pseudo-image repre- isthenumberofpillars,andN isthemaximumnumberof sentationiscreatedasbefore,whichalsoresultsinadenser pointsperpillar. EachoftheP pillarsisassociatedwitha representation. While dynamic motion will obviously cre- x ,y ,z locationthatisthepillarcenter. Theideaisthen 0 0 0 ateaghost/haloeffect,itcaninfactbeaveryusefulsignal to further ‘featurize’ information in this (P,N,F) repre- for learning, and be resolved by the time information. In sentation and then, using the original coordinates, to ‘dis- some circumstances, it is only through motion that distant tribute’backthefeaturesalongtheX,Z planeandproduce orpoorlydiscernibleobjectscanbedetected. apseudo-image[25], sayofsize(X,Z,C ). Specifically, P from (P,N,F), a feature of size (P,C ) is obtained via PointCloudSubsampling. Importantly,whenaccumu- P learnable layers and pooling, to then get (X,Z,C ). In lating points from multiple point clouds, the voxelization P effect, PointPillarproducesa(X,Z,C )featurerepresen- stepconvertsallthepointsintoafixed-sizedrepresentation P tationfroma(X,Y,Z,F)inputforasinglePC. based on the grid cell size. This results in a tensor with a fixed size that is padded to N, the maximum number of 3.1.2 PointCloudsinTime points.Thustheamountofsubsequentcomputeremainsthe Pointcloudsandthesubsequentfeaturecreation(e.g.,asin same,regardlessofthenumberofPCs. Ifthepointsexceed Section 3.1.1) are computationally expensive and memory N,onlyN pointsarerandomlysampled(N=128through- intensive operations. Given a sequence of T point clouds, out). This has the effect of subsampling the accumulated 3For each pillar: 3.2.4D-Net: FusingRGBinTimeandPCinTime CNN feature maps CSotnantiecc otiro Dn yWnaemighicts pseudSop-airmsea gPeo ifnetaPtiullarer map Sperleofocjert sce tt aihocenh fsfe epa aat ttu uiar rele l of mrco aam pti .othne cPoons iF ncue tama Ptt eim lu lnareared ts fee aa danr te wd u riteh FeatPuoreinst Pairlela pr lfaecaetudr bea mcka pin. the PoiT no tPc ilo lam rb ri en pe reR seG nB tati in of nor (m boa tt hio in ni tn imto et )h ,e th3 eD reP aro ein tt wC ol mou ad - Proinjetoc tfieoant uorfe 3 mD apposin.ts jorconsiderations: 1)thetwosensorsneedtobegeometri- callyandspatiallyalignedand2)thefusionmechanismsof Efmpe iaaelc ltaahunr r . seoq fau nta hdr ee h prae osp i nare t(ssx e a,nyct,czs u )a m c Pouoolairn tdetidPn iail ntlea t oro fth teh e FICwn -etlh iageyhe d tr esy p an dciacyhmkn sapi c mic l los aicen rattnleli ync g fto,i ora n t bh ee lefe aa rntu er des frp or mod tu hc ee dd af taro .mthesemodalitiesshouldideally Our 4D-Net entails both projection fusion mechanisms Figure 4. Connection architecture search: Each PointPillar andconnectivitysearchtolearnwhereandhowtofusefea- feature is projected into the CNN feature map space based on tures (Figure 2). Of note is that both representations have the 3D coordinate of each pillar and the given camera matrices. abstractedawaysome‘dimensions’intheirfeaturesbutstill The feature from the 2D location is extracted from each feature contain complementary information: the RGB representa- map. Learned static connection weights (grey boxes) then com- tion has (X, Y, C ), whereas the PCiT has (X, Z, C ). bine these and concatenate them with the pillar feature to create V P Sinceourendgoalis3Dobjectdetection,wechosetofuse the pillar feature map used as input to the next layer. Dynamic from RGB into the point cloud stream, but we note that connections,showningreen,areproducedforeachvaluewhich theseapproachescouldbeusedtofuseintheotherdirection arecontrolledbythefeaturesfromthepointclouds,thustheyde- termine how to fuse in the rest of the features generated by the aswell. Section3.2.1andSection3.2.2,providedetails. modelatvariouslevelsofabstractions.PleaseseeSection3.2.2. 3.2.1 3DProjection pointcloud,butproportionallymorepointswillbesampled TofusetheRGBintothepointcloud,weneedto(approx- insparserareasthanindenseones. Bydensifyingthepoint imately) align the 3D points with 2D image points. To do cloudinsparseregionsandsparsifyingitindenseregions, this, we assume we have calibrated and synchronized sen- we distribute compute more efficiently and provide more sorsandcanthereforedefineaccurateprojections.Notethat signal for long-range detection by increased point density theWaymoOpenDatasetprovidesallcalibrationandsyn- atfarranges(e.g. seepointdensityfor16PCsinFigure3). chronizedLiDARandcameradata. Wefindthisrepresentationtobeveryeffective,resultingin ThePointPillarpseudo-imageM hasshape(X,Z,C ) significantimprovementsoverusingasinglepointcloud(as P and is passed through a backbone network with a ResNet- seenlaterinablationTable5). likestructure. Aftereachresidualblock,i,thenetworkfea- 3.1.3 RGBandRGBinTime turemapM hasshape(XM,ZM,CM),whereeachloca- i i i i While point clouds have become the predominant input tion(inX andZ)correspondstoapillar. Eachpillarpalso modality for 3D tasks, RGB information is very valuable, hasan(x ,y ,z )coordinaterepresentingitscenterbased 0 0 0 especially at larger distances (e.g., 30+ or 50+ meters) ontheaccumulated3Dpoints. Thisprovidesa3Dcoordi- whereobjectsgarnerfewerpoints. nateforeachofthenon-emptyfeaturemaplocations. Furthermore,imagesintimearealsohighlyinformative, The RGB network also uses a backbone to process the andcomplementarytobothastillimageandPCiT.Infact, video input. Let us assume that after each block, the net- forchallengingdetectioncases,motioncanbeaverypow- workproducesafeaturemapR withshape(XR,YR,CR), i i i i erful clue. While motion can be captured in 3D, a purely whichisastandardimageCNNfeaturemap. PC-based method might miss such signals simply because Using projections, we can combine the RGB and point ofthesensingsparsity. clouddata. Specifically,givena4×4homogeneousextrin- RGBframes,unlikePCs,representadensefeaturecon- siccameramatrixE (i.e.,thecameralocationandorienta- tainingcolorpixelinformationforeverythinginview. Here tionintheworld)anda4×4homogeneousintrinsiccamera wetakeRGBframesasinputandusevideoCNNstopro- matrixK(EandKarepartofthedataset),wecanprojecta ducetheRGBfeaturemaps. Inthevideosettings,wetake 3Dpointp=(x,y,z,1)toa2Dpointasq =K·(E·p).For T previousframesasinputandpredicttheobjectbounding eachpointpillarlocation,weobtainthe2Dpointq,which boxes in the final frame, same as for point clouds in time. providesanRGBfeatureforthatpointasR [q ,q ]1. This i x y We process a sequence of RGB frames as a video input. isconcatenatedtothepillar’sfeature,e.g., Sinceruntimeisoftheessence,weuseefficientvideorepre- sentations,TinyVideoNetworks[34]forprocessing. More M i[p x,p y]=[M i[p x,p y]|R i[K·(E·p)]]p∈P (1) specifically, by a series of layers, some of which working NotethatLiDARdatatypicallycoversafull360degree in the spatial dimensions, some temporal, a set of feature surround view, while individual cameras typically have a representationsinthespatialdimensionwillbelearned. As aresult,thefeaturewithshape(X,Y,C V)isproduced,ab- 1Wetriedaspatialcroparoundthepoint,butfoundittobecomputa- stractingawaythetimecoordinateininput(X,Y,T,3). tionallyexpensive.TheCNN’sreceptivefieldalsoprovidesspatialcontext. 4quite limited horizontal field of view. To account for this, weonlyobtainRGBfeaturesforpointswhicharecaptured by one of the cameras. For points outside of the image view, we concatenate a vector of zeros. This approach is easilyappliedtosettingswithmultipleRGBcamerascov- eringdifferentviewpoints–thesameRGBCNNisapplied to each view, then the projection is done per-camera, and addedtogetherbeforeconcatenation. 3.2.2 ConnectionArchitectureSearchin4D Whiletheaboveprojectionwillalignthetwosensorsgeo- metrically, it is not immediately obvious what information shouldbeextractedfromeachandhowthesensorfeatures interactforthemaintaskofobjectdetection. To that end we propose to learn the connections and fusion of these via a light-weight differentiable one-shot architecture search. One-shot differentiable architecture searchhasbeenusedforstrengtheningthelearnedfeatures for image understanding [30] and for video [38, 39]. In ourcase, weareutilizingitforrelatinginformationin4D, i.e.,in3Dandintime,andalsoconnectingrelatedfeatures across different sensing modalities (the RGB-in-time and point cloud-in-time streams). Of note is that we learn the combination of feature representations at various levels of abstractionforbothsensors(Figure2). In Figure 4, we illustrate how this architecture search works. Given a set of RGB feature maps, {R |i ∈ i [0,1,...,B]} (B being the total number of blocks/feature mapsintheRGBnetwork),wecancomputetheprojection of each pillar into the 2D space and obtain a feature vec- tor. This produces a set of feature vectors F = {f |i ∈ i [0,1,...,B]}. Wethenhavealearnedweightw,whichisa B-dimensionalvector. Weapplysoftmaxandthencompute (cid:80) w ×F to obtain the final feature vector. w learns the connection weights, i.e. which RGB layer to fuse into the PointPillarslayer.ThisisdoneaftereachblockinthePoint- Pillars network, allowing many connections to be learned (Figure2). Dynamic Connections. The above-mentioned mecha- nismisverypowerfulasitallowstolearntherelationsbe- tween different levels of feature abstraction and different sources of features (e.g., RGB, PC) (Figure 4). Further- more, asshowninlatersections(Section3.3)itallowsfor combiningmultiplecomputationaltowersseamlesslywith- outanyadditionalchanges. However, in the autonomous driving domain it is espe- ciallyimportanttoreliablydetectobjectsathighlyvariable distances, with modern LiDAR sensors reaching several hundredsofmetersofrange[1]. Thisimpliesthatfurther- awayobjectswillappearsmallerintheimagesandthemost valuable features for detecting them will be in earlier lay- ers, compared to close-by objects. Based on this observa- tion, we modified the connections to be dynamic, inspired egami-oduesP duolc tnioP Z XpC PCiT serutaef NNC enobkcaB Medium-Res Image serutaef NNC BGR serutaef NNC BGR 3D Boxes High-Res Image Video serutaef NNC enobkcaB egami-oduesP duolc tnioP Z XpC serutaef NNC BGR Medium-Res Image serutaef NNC BGR serutaef NNC BGR 3D Boxes High-Res Image Video serutaef NNC BGR Figure 5. Illustration of a multi-stream 4D-Net. It takes point clouds(intime)andstill-imageandvideoasinput,computesfea- turesforthem,learnsconnectionweightsbetweenthestreams. byself-attentionmechanisms.Specifically,insteadofwbe- ingalearnedweight,wereplacewwithalinearlayerwith B outputs, ω, which is applied to the PointPillar feature M [p ,p ] and generates weights over the B RGB feature i x y maps. ω isfollowedbyasoftmaxactivationfunction. This allowsthenetworktodynamicallyselectwhichRGBblock to fuse information from, e.g. taking a higher resolution featurefromanearlylayeroralowresolutionfeaturefrom a later layer (Figure 4). Since this is done for each pillar individually,thenetworkcanlearnhowandwheretoselect thesefeaturesbasedontheinput. 3.3.Multi-Stream4D-Net Multipe RGB streams. Building on the dynamic con- nectionlearning,weproposeaMulti-Stream(MS)version of4D-Net. Whilethe4D-Netitselfalreadylearnstocom- bine the information from two streams – the sparse 3D PCiTandcamerainput–wecanhavemorethanoneRGB input stream (Figure 5). One advantage of the proposed (dynamic) connection learning between features of differ- entmodalitiesisthatitisapplicabletomanyinputfeature sources and is agnostic to where they originate from. For example,wecanaddaseparatetowerforprocessinghigh- resolutionstillimages,oranadditionalvideotowerusinga differentbackboneoradifferenttemporalresolution. This enables a more rich set of motion features to be learned and surfaced for combination with the PC features. Note thatallthesearecombinedwiththePC(intime)featuresin thesamedynamicfusionproposedabove,thusallowingthe PCiTstreamtodynamicallyselecttheRGBfeaturestofuse from allstreams. Similarly, it isalso possibleto introduce additionalPCstreams. Multiple Resolutions. Empirically, we observe that addinganRGBstreambenefitsrecognitionoffarawayob- 5Method APL1 APL2 AP30m AP30-50m AP50m+ Runtime StarNet[32] 53.7 - - - - - LaserNet[31] 52.1 - 70.9 52.9 29.6 64ms PointPillars[25],from[21] 57.2 - - - - - MVF[60] 62.9 - 86.3 60.0 36.0 - Huangetal[21](4PCs) 63.6 - - - - - PillarMultiView[53] 69.8 - 88.5 66.5 42.9 67ms PVRCNN[40] 70.3 65.4 91.9 69.2 42.2 300ms 4D-Net(Ours) 73.6 70.6 80.7 74.3 56.8 142ms(net)+102ms(16fvoxel) 4D-Net(OurswithMulti-Stream) 74.5 71.2 80.9 74.7 57.6 203ms(net)+102ms(16fvoxel) Table1.WaymoOpenDataset[47].3DdetectionAPonvehicles@0.7IoUonthevalidationset.For4D-Netwereporttheruntimeofthe networkandofthepre-processingvoxelizationstepforthepointcloudsintime. Figure6.4D-NetpredictionsonasceneintheWaymoOpenDataset[47]. Individualinstancesareshownindifferentcolors. Redboxes indicateerrors(dashedlines:FN,solidlines:FP).Inthisexample,allpredictionsarematchingtheground-truthexceptforafalse-negative ontherightside(frontcamera).Notethatanymisalignmentsinthecameraviewareduetoprojection,notbyinaccuraciesinthepredictions. jects the most. Distant objects appear smaller than close the GTboxes. N = 128 throughout thepaper and weuse objects,suggestingthatusinghigherresolutionimageswill a 224×224×1 output grid. Further implementation and furtherimproverecognition. Additionally,addingRGBin- experimentaldetailsareincludedinthesup. material. puts at two or more different resolutions will increase the 4.Experiments diversityoffeaturesavailableforconnectionlearning. We conduct experiments on the Waymo Open Dataset Thus in the multi-stream setting, we combine inputs at [48],followingthestandardevaluationprotocolandevalu- different resolutions (see Figure 5 for a schematic). Our ation script provided. We report results on vehicles at 0.7 main Multi-Stream 4D-Net uses 1) one single still image IoUoverlap,whichisthesamesettingasinpreviouswork. towerathighimageresolution(312x312),2)onevideoim- We conduct ablation studies which demonstrate the vari- agetoweratlowerresolution(192x192)with16framesand ous benefits of the proposed approach, and provide analy- 3) the PCiT which has aggregated 16 point clouds. The siswithrespecttodifferentmodelandinputconfigurations. original4D-Nethasthelattertwostreamsonlybutuses12 Someoftheablationexperimentsshowadditionalopportu- RGB frames. More streams and more resolutions are ex- nitiesforimprovingperformancewhicharenotincludedin plored in the ablations, Section 4.2. Similar to our main themainresults,e.g.,increasingthenumberofRGBframes 4D-Net,weusealightweightTinyVideoNetwork[34],so orinputresolutionsorusingstrongerbackbones. thatthemulti-stream4D-Netisefficientatinference. 4.1.WaymoOpenDatasetresults ImplementationDetails. Wetrainthemodeltominimize Table 1 shows the results of 4D-Net in comparison to astandardcross-entropylossforclassificationandaL re- the recent state-of-the-art (SOTA) approaches. As seen, it 2 gressionlossfortheresidualsbetweentheanchorboxesand outperforms the previous SOTA, e.g. by 3.3 and 4.2 AP, 6Figure7.ExampleresultsontheWaymoOpenDataset. Greenboxesarecorrectdetections,andredboxesindicateerrors(dashedlines: FN,solidlines:FP).Weobservethatthemodelismoreinclinedtoproducefalse-negatives,ratherthanfalse-positives. andmoreimportantly,itsignificantlyoutperformswhende- (Figure 5). As seen, various interesting combinations can tecting objects at far distances by 14.6 and 15.4 AP. We be created and learned successfully. Multi-stream models alsoreportinferenceruntimes,splitintotimetorunthe4D- canalsoaffordtoreducetheresolutionofthevideostream Net (net) with PC accumulated in time and RGB streams, inputs and obtain equally powerful models. More multi- and the time for point cloud pre-processing (voxelization). streamvariantscanbeexplored. We observe very competitive runtimes, despite processing RGBResolutionandVideo. Inthissectionweexplore muchmoreinformationthanothermethods. Qualitativere- theimpactofRGBresolution(Table42).Weobservepoten- sultsareshowninFigure6andFigure7. tial improvements to the 4D-Net that were not used in the main method and can be leveraged in future work. Some 4.2.AblationStudies improvementsaregainedattheexpenseofruntimeasseen to the right of the table. As expected, better image reso- Thissectionpresentstheablationstudies. Wemakebest lution helps, particularly for detection of far-away objects, efforts to isolate confounding effects and test components e.g.,almost5and12APimprovementsforobjectsbeyond individually,e.g.,byremovingmultiplePCs,orRGB. 50mwhenresolutionisincreasedto512from224and192. Main Contributions and Fusion Approaches. Ta- Another interesting observation is that RGB video helps a ble 2 shows the main ablation experiments, investigating lot. Forexample,increasingresolutionfrom224to512im- key components of the 4D-Net. Starting from the main provesstill-imageperformanceby1%forcloseobjects,but 4D-Net approach (first line), in the top lines we evaluate keepingthesame224x224resolutionforavideoinputgets the approach when the key contributions are removed in- even higher performance, an improvement by 2.6%. Nat- dividually or jointly. As seen, dynamic learning on top of urally, higher resolution videos than the ones shown will 3D projections is the most beneficial and both are impor- improvebothmetricsatadditionallatencycost. tant. At the bottom of the table, we show performance of Leveragingmorepowerfulvideomethodsisalsopossi- theapproachwithdifferentmodalitiesenabledordisabled, ble,althoughnotparticularlyworthwhile,astheygainonly for direct comparison. We notice interesting phenomena: alittleinaccuracy, butathighcomputationalcost(seeTa- usingmultiplePCintimeisdefinitelyhelpful,butasingle ble 4, bottom). Specifically, we compare two of the most RGBimagecanboostuptheperformanceofbothasingle popularvideomodels: the3D-ResNet[49,23]andAssem- PCandmultiplePCsmuchmoresignificantly,withthepro- bleNet [38]. As seen, they provide more accurate results, posed projection and connection learning. Similarly, RGB especiallythepowerfulAssembleNet,butareveryslow. intimecanboostasinglePCvariantsignificantly,too.Thus Point Clouds in Time. Table 5 shows the effect of us- a combination of the two sensors, at least one of which is ingpointcloudsintime.Asexpected,multiplepointclouds in time is important. The best result comes from multiple improveperformancenotably.Using16PCs,about1.6sec- RGBsandPCintime,i.e.,spanningall4dimensions. The onds of history, seems to be optimal and also matches ob- supp.materialhasdetailsofthebaselinesusedinlieuofthe servationthatdensitysaturatesaround16frames(Figure3). proposedcomponents. Multi-Stream4D-NetVariants. Table3showsthere- 2TheAPislowerthanthemain4D-NetasweuseasinglePCandno sultsofusingavariousnumberofadditionalinputstreams dynamicconnections,toshowtheeffectsofimageorvideomodels. 7Components Modalities Method Proj Conn Dyn PC RGB APL1 APL2 AP30m AP30-50m AP50m+ 4D-Net (cid:88) (cid:88) (cid:88) PC+T RGB+T 73.6 70.6 80.7 74.3 56.8 (cid:88) (cid:88) PC+T RGB+T 73.1 70.1 80.5 73.9 56.2 (cid:88) PC+T RGB+T 72.6 69.7 79.6 72.8 54.6 PC+T RGB+T 62.5 58.9 70.1 57.8 42.5 (cid:88) (cid:88) PC+T RGB 72.6 69.7 79.8 73.6 55.8 (cid:88) (cid:88) PC RGB+T 65.4 63.9 77.5 67.4 48.5 (cid:88) (cid:88) PC RGB 64.3 63.0 74.9 65.1 47.2 (cid:88) PC RGB 62.5 61.5 71.5 61.5 41.0 (cid:88) PC RGB 56.7 53.6 66.2 52.5 37.5 PC+T 60.5 55.7 68.4 57.6 38.1 PC 55.7 52.8 65.0 51.3 35.4 Table2.Ablationresultsforthe4D-Net.Fromthefull4D-Netvariant,componentsareremovedoneatatime,todemonstratetheireffect. Projistheproposedprojectionmethod,ConnistheconnectivitysearchandDynisthedynamicconnectionmethod. Wealsoablatewith usingsinglePCorsingleRGBinput. PC+TdenotesPointCloudsinTime, RGB+TisRGBframesintime. 16PCiTand16RGBof 224x224sizeareused(exceptthetoptwowith12RGBof192x192).Thisisasingle4D-Net.SeeTable3formulti-stream4D-Nets. Method AP 30m 30-50m 50m+ Runtime 4D-Net(192x19212fvideo) 73.6 80.7 74.3 56.8 142ms 4D-NetMS(192x19216fvideo+312x312image) 74.5 80.9 74.7 57.6 203ms 4D-NetMS-1(192x19216fvideo+224x224image) 73.4 81.2 72.5 56.5 162ms 4D-NetMS-2(224x22416fvideo+224x224image) 73.8 80.5 73.7 56.9 171ms 4D-NetMS-3(128x12816fvideo+192x192image+312x312image) 74.2 81.5 72.9 57.8 225ms Table3.AblationresultsforMulti-Streams(MS)models.APshown.Thetopportionshowsthemain4D-NetandtheMulti-Streamversion fromTable1. MS-1andMS-2includeanadditionalimagestreambutatdifferentresolutions. MS-3hastwoadditionalimagestreams. It showsonecansignificantlyreducetheinputvideoresolutionachievingtopresults. Allvideomodelsuse16framesexcept4D-Netwhich has12.Voxelpre-processingisnotincludedinruntimeasinTable1. Imageresolutions AP 30m 30-50m 50m+ Time NumberofPC AP AP30m AP30-50m AP50m+ 192x1921fr. 60.8 73.6 60.7 40.4 82 1PC 55.7 65.0 51.3 35.4 224x2241fr. 64.3 74.9 65.1 47.2 97 2PC 56.3 66.1 52.5 36.4 312x3121fr. 67.3 75.7 66.4 49.5 142 4PC 57.8 66.9 54.6 36.7 512x5121fr. 68.2 75.9 67.5 52.4 297 8PC 59.4 67.6 56.4 37.8 16PC 60.5 68.4 57.6 38.1 192x19212-fr. 64.2 75.2 65.3 46.2 109 32PC 60.3 67.4 56.4 38.4 224x22416-fr. 65.4 77.5 67.4 48.5 115 Table5.AblationsforPointClouds(PC)intime. NoRGBinputs 224x2243DRes 66.4 77.8 68.6 49.5 254 areused.OnePC(toprow)iseffectivelythePointPillarmodel. 224x224Assm 66.8 79.1 69.2 50.7 502 Table4.Ablationsforinputimageresolutionsforasingleframe RGBtowerandasinglepointcloud.AP(L1)shown.Forcompar- using 4D sensing and both modalities in time. Without ison, models with 12-frame and 16-frame video input are given, aswellas,strongerbutmuchslowermethods3DResNet[49]and loss of generality, the same approach can be extended to AssembleNet[38],bothwith32frames.Allarewithasinglepoint other streams of RGB images, e.g., the side cameras pro- cloud,whichreducescompute,aswell.Timeisinms. viding critical information for highly occluded objects, or to diverse learnable feature representations for PC or im- ages, or to other sensors.While this work is demonstrated 5.ConclusionsandFutureWork for the challenging problem of aligning different sensors We present 4D-Net, which proposes a new approach to for autonomous driving which span the 4D, the proposed combine underutilized RGB streams with Point-Cloud-in- approach can be used for various related modalities which time information. We demonstrate improved state-of-the- capturedifferent aspectsofthesame domain: aligning au- artperformanceandcompetitiveinferenceruntimes,despite dioandvideodataortextandimagery. 8References jectdetectionin3dpointcloudsusingefficientconvolutional neuralnetworks. In2017IEEEInternationalConferenceon [1] Introducing the 5th-generation Waymo Driver: In- RoboticsandAutomation(ICRA),pages1355–1361.IEEE, formed by experience, designed for scale, en- 2017. 2 gineered to tackle more environments. In [16] Markus Enzweiler and Dariu M. Gavrila. A multilevel https://blog.waymo.com/2020/03/introducing-5th- mixture-of experts framework for pedestrian classification. generation-waymo-driver.html,2020. 5 InIEEETransactionsonImageProcessing,2011. 2 [2] AneliaAngelova,AlexKrizhevsky,VincentVanhoucke,Ab- [17] AndreasGeiger, PhilipLenz, ChristophStiller, andRaquel hijitOgale,andDaveFerguson. Real-timepedestriandetec- Urtasun. Visionmeetsrobotics: Thekittidataset. Interna- tionwithdeepnetworkcascade. InBritishMachineVision tionalJournalofRoboticsResearch(IJRR),2013. 1,2 Conference,2015. 2 [18] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra [3] MathiasM.TimofteR.VanGoolL.Benenson,R.Pedestrian Malik. Richfeaturehierarchiesforaccurateobjectdetection detectionat100framespersecond. InCVPR,2012. 2 andsemanticsegmentation.InProceedingsoftheIEEEcon- [4] Rodrigo Benenson, Mohamed Omran, Jan Hosang, and ference on computer vision and pattern recognition, pages BerntSchiele. Tenyearsofpedestriandetection,whathave 580–587,2014. 2 we learned? In ECCV Workshop on Computer Vision for [19] Saurabh Gupta, Ross Girshick, Pablo Arbela´ez, and Jiten- RoadSceneUnderstandingAutonomousDriving,2014. 2 draMalik. Learningrichfeaturesfromrgb-dimagesforob- [5] Alex Bewley, Pei Sun, Thomas Mensink, Dragomir jectdetectionandsegmentation. InEuropeanconferenceon Anguelov, and Cristian Sminchisescu. Range conditioned computervision,pages345–360.Springer,2014. 1,2 dilatedconvolutionsforscaleinvariant3dobjectdetection. [20] Peiyun Hu, Jason Ziglar, David Held, and Deva Ramanan. CoRL,2020. 2 Whatyouseeiswhatyouget:Exploitingvisibilityfor3dob- [6] GarrickBrazil,GerardPons-Moll,XiaomingLiu,andBernt jectdetection. InProceedingsoftheIEEE/CVFConference Schiele. Kinematic3dobjectdetectioninmonocularvideo. onComputerVisionandPatternRecognition,pages11001– In European Conference on Computer Vision, pages 135– 11009,2020. 1,2 152.Springer,2020. 2 [21] RuiHuang,WanyueZhang,AbhijitKundu,CarolinePanto- [7] Garrick Brazil, Xi Yin, and Xiaoming Liu. Illuminating faru,DavidARoss,ThomasFunkhouser,andAlirezaFathi. pedestriansviasimultaneousdetectionandsegmentation. In AnLSTMapproachtotemporal3dobjectdetectioninlidar ICCV,2017. 2 pointclouds. InEuropeanConferenceonComputerVision, [8] HolgerCaesar,VarunBankiti,AlexHLang,SourabhVora, 2020. 1,2,3,6 VeniceErinLiong,QiangXu,AnushKrishnan,YuPan,Gi- [22] Max Jaderberg, Karen Simonyan, Andrew Zisserman, and ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- Koray Kavukcuoglu. Spatial transformer networks. arXiv modal dataset for autonomous driving. In Proceedings of preprintarXiv:1506.02025,2015. 12 the IEEE/CVF conference on computer vision and pattern recognition,pages11621–11631,2020. 3 [23] Yutaka Satoh Kensho Hara, Hirokatsu Kataoka. Learning [9] ZhaoweiCai,QuanfuFan,RogerioS.Feris,andNunoVas- spatio-temporalfeatureswith3dresidualnetworksforaction concelos. A unified multi-scale deep convolutional neural recognition. In ICCV ChaLearn Looking at People Work- networkforfastobjectdetection. InEuropeanConference shop,2017. 7 onComputerVision,2016. 2 [24] Jason Ku, Melissa Mozifian, Jungwook Lee, Ali Harakeh, [10] Xiaozhi Chen, Huimin Ma, Ji Wan, Bo Li, and Tian Xia. andStevenLWaslander. Joint3dproposalgenerationand Multi-view 3d object detection network for autonomous objectdetectionfromviewaggregation. In2018IEEE/RSJ driving.InProceedingsoftheIEEEconferenceonComputer InternationalConferenceonIntelligentRobotsandSystems VisionandPatternRecognition,pages1907–1915,2017. 2 (IROS),pages1–8.IEEE,2018. 1,2 [11] ChristopherChoy,JunYoungGwak,andSilvioSavarese. 4d [25] AlexH.Lang,SourabhVora,HolgerCaesar,LubingZhou, spatio-temporal convnets: Minkowski convolutional neural andOscarBeijbomJiongYang. Pointpillars: Fastencoders networks. InCVPR,2019. 2 forobjectdetectionfrompointclouds. InCVPR,2019. 1,2, [12] XuangengChu,AnlinZheng,XiangyuZhang,andJianSun. 3,6 Detectionincrowdedscenes:Oneproposal,multiplepredic- [26] BoLi. 3dfullyconvolutionalnetworkforvehicledetection tions. InCVPR,2020. 2 inpointcloud. InIROS,2017. 2 [13] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo [27] BoLi,TianleiZhang,andTianXia. Vehicledetectionfrom Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe 3dlidarusingfullyconvolutionalnetwork. Proceedingsof Franke, Stefan Roth, and Bernt Schiele. The cityscapes Robotics:ScienceandSystems(RSS),2016. 2 datasetforsemanticurbansceneunderstanding. InCVPR, [28] MingLiang,BinYang,YunChen,RuiHu,andRaquelUrta- 2016. 2 sun. Multi-taskmulti-sensorfusionfor3dobjectdetection. [14] PiotrDollar,ChristianWojek,BerntSchiele,andPietroPer- InCVPR,2019. 2 ona. Pedestriandetection: A benchmark. In CVPR,2009. [29] MingLiang,BinYang,ShenlongWang,andRaquelUrtasun. 2 Deepcontinuousfusionformulti-sensor3dobjectdetection. [15] Martin Engelcke, Dushyant Rao, Dominic Zeng Wang, InProceedingsoftheEuropeanConferenceonComputerVi- Chi Hay Tong, and Ingmar Posner. Vote3deep: Fast ob- sion(ECCV),pages641–656,2018. 2 9[30] HanxiaoLiu,KarenSimonyan,andYimingYang. DARTS: [46] Hang Su, Varun Jampani, Deqing Sun, Subhransu Maji, Differentiablearchitectureseach. InICLR,2019. 5 Evangelos Kalogerakis, Ming-Hsuan Yang, and Jan Kautz. [31] GregoryP.Meyer,AnkitLaddha,EricKee,CarlosVallespi- Splatnet:Sparselatticenetworksforpointcloudprocessing. Gonzalez, and Carl K. Wellington. Lasernet: An efficient InCVPR,2018. 2 probabilistic3dobjectdetectorforautonomousdriving. In [47] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien CVPR,2019. 2,6 Chouard,VijaysaiPatnaik,PaulTsui,JamesGuo,YinZhou, [32] Jiquan Ngiam, Benjamin Caine, Wei Han, Brandon Yang, YuningChai,BenjaminCaine,etal.Scalabilityinperception Yuning Chai, Pei Sun, Yin Zhou, Xi Yi, Ouais Alsharif, forautonomousdriving: Waymoopendataset. InProceed- PatrickNguyen,ZhifengChen,JonathonShlens,andVijay ingsoftheIEEE/CVFConferenceonComputerVisionand Vasudevan. Starnet: Targetedcomputationforobjectdetec- PatternRecognition,pages2446–2454,2020. 1,6,13 tioninpointclouds. InCoRR:1908.11069,2019. 1,2,6 [48] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien [33] WanliOuyangandXiaogangWang. Adiscriminativedeep Chouard,VijaysaiPatnaik,PaulTsui,JamesGuo,YinZhou, modelforpedestriandetectionwithocclusionhandling. In Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han, CVPR,2012. 2 Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, Scott Et- [34] AJ Piergiovanni, Anelia Angelova, and Michael S Ryoo. tinger,MaximKrivokon,AmyGao,AdityaJoshi,YuZhang, Tinyvideonetworks: Architecturesearchforefficientvideo Jonathon Shlens, Zhifeng Chen, and Dragomir Anguelov. models. InICMLWorkshoponAutomatedMachineLearn- Scalability in perception for autonomous driving: Waymo ing(AutoML),2020. 4,6 opendataset. InCVPR,2020. 2,6 [35] Cristiano Premebida, Joao Carreira, Jorge Batista, and Ur- [49] DuTran,HengWang,LorenzoTorresani,JamieRay,Yann banoNunes. Pedestriandetectioncombiningrgbanddense LeCun,andManoharPaluri.Acloserlookatspatiotemporal lidardata. InIROS,2014. 1,2 convolutionsforactionrecognition. InCVPR,pages6450– [36] CharlesRQi,WeiLiu,ChenxiaWu,HaoSu,andLeonidasJ 6459,2018. 7,8 Guibas. Frustumpointnetsfor3dobjectdetectionfromrgb- [50] Paul Viola, Michael Jones, and Daniel Snow. Detecting ddata. InProceedingsoftheIEEEconferenceoncomputer pedestrians using patterns of motion and appearance. In visionandpatternrecognition,pages918–927,2018. 2 CVPR,2003. 2 [37] CharlesRQi,HaoSu,KaichunMo,andLeonidasJGuibas. [51] SourabhVora,AlexHLang,BassamHelou,andOscarBei- Pointnet: Deep learning on point sets for 3d classification jbom. Pointpainting: Sequentialfusionfor3dobjectdetec- andsegmentation. CVPR,2017. 1 tion. InProceedingsoftheIEEE/CVFConferenceonCom- [38] Michael S. Ryoo, AJ Piergiovanni, , Mingxing Tan, and puter Vision and Pattern Recognition, pages 4604–4612, AneliaAngelova. Assemblenet: Searchingformulti-stream 2020. 1,2 neuralconnectivityinvideoarchitectures. InICLR,2020. 5, [52] Dominic Wang and Ingmar Posner. Voting for voting in 7,8 online point cloud object detection. In In Proceedings of [39] MichaelS.Ryoo,AJPiergiovanni,JuhanaKangaspunta,and Robotics:ScienceandSystems,2015. 2 Anelia Angelova. Assemblenet++: Assembling modality [53] YueWang,AlirezaFathi,AbhijitKundu,DavidRoss,Caro- representationsviaattentionconnections. InEuropeanCon- linePantofaru,TomFunkhouser,andJustinSolomon. Pillar ferenceonComputerVision,2020. 5 basedobjectdetectionforautonomousdriving. InEuropean [40] ShaoshuaiShi,ChaoxuGuo,LiJiang,ZheWang,Jianping ConferenceonComputerVision,2020. 1,2,6 Shi, and Hongsheng Li Xiaogang Wang. Pv-rcnn: Point- [54] JialianWu,ChunluanZhou,MingYang,QianZhang,Yuan voxel feature set abstraction for 3d object detection. In Li,andJunsongYuan.Temporal-contextenhanceddetection CVPR,2020. 1,2,6 ofheavilyoccludedpedestrians. InCVPR,2020. 2 [41] ShaoshuaiShi,XiaogangWang,andHongshengLi. Pointr- [55] DanfeiXu,DragomirAnguelov,andAsheshJain. Pointfu- cnn: 3dobjectproposalgenerationanddetectionfrompoint sion: Deep sensor fusion for 3d bounding box estimation. cloud. In Proceedings of the IEEE Conference on Com- InProceedingsoftheIEEEConferenceonComputerVision puterVisionandPatternRecognition,LongBeach,CA,USA, andPatternRecognition,pages244–253,2018. 2 pages16–20,2019. 1,2 [42] Martin Simon, Stefan Milz, Karl Amende, and Horst- [56] YanYan,YuxingMao,andBoLi. Second:Sparselyembed- MichaelGross. Complex-yolo: Real-time3dobjectdetec- dedconvolutionaldetection. Sensors,18(10):3337,2018. 1, tiononpointclouds. InCVPR,2018. 2 2 [43] ShuranSongandJianxiongXiao. Slidingshapesfor3dob- [57] BinYang,MingLiang,andRaquelUrtasun.Hdnet:Exploit- jectdetectionindepthimages. InEuropeanconferenceon inghdmapsfor3dobjectdetection.InConferenceonRobot computervision,pages634–651.Springer,2014. 2 Learning(CoRL),2018. 1,2 [44] Xiaolin Song, Kaili Zhao, Wen-Sheng Chu, Honggang [58] Bin Yang, Wenjie Luo, and Raquel Urtasun. Pixor: Real- Zhang, and Jun Guo. Progressive refinement network for time3dobjectdetectionfrompointclouds. InCVPR,2018. occludedpedestriandetection. InEuropeanConferenceon 1,2 ComputerVision,2020. 2 [59] Zhishuai Zhang, Jiyang Gao, Junhua Mao, Yukai Liu, [45] Russell Stewart, Mykhaylo Andriluka, and Andrew Y. Ng. Dragomir Anguelov, and Congcong Li. Stinet: Spatio- End-to-end people detection in crowded scenes. In CVPR, temporal-interactive network for pedestrian detection and 2016. 2 trajectoryprediction. InProceedingsoftheIEEE/CVFCon- 10ferenceonComputerVisionandPatternRecognition,pages 11346–11355,2020. 2,3 [60] YinZhou, PeiSun, YuZhang, DragomirAnguelov, Jiyang Gao,TomOuyang,JamesGuo,JiquanNgiam,andVijayVa- sudevan. End-to-endmulti-viewfusionfor3dobjectdetec- tioninlidarpointclouds. InConferenceonRobotLearning (CoRL),2019. 2,6 [61] YinZhouandOncelTuzel. Voxelnet: End-to-endlearning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,pages4490–4499,2018. 2 11Block ConvSize Channels Repeat OutputSize Method APL1 APL2 AP30m AP30-50m AP50m+ Input - - - 16×224×224 BasePointPillars 55.7 52.8 65.0 51.3 35.4 Block1 1x3x3+3x1x1conv 32 1 8×112×112 RGB 56.7 53.6 66.2 52.5 37.5 Block2 1x3x3+3x1x1conv 64 1 4×56×56 SpatialAvg 58.9 57.6 69.7 56.1 39.6 Block3 1x3x3conv 128 4 2×28×28 SpatialTransformer 59.5 58.2 70.0 54.7 43.1 Block4 1x3x3conv 256 4 2×14×14 Projection 64.3 63.0 74.9 65.1 47.2 Table6.RGBVideoNetworkstructureusedin4D-Net. Notethat Table7.Comparisonofdifferentspatialfusionmethods. thesizesareshownassuming16framesat224×224inputsize. For networks that used smaller inputs, the output sizes are each 7.Additionalexperimentalresults stepwouldbesmaller,followingthesamescaling. Averagepool- ingwasusedaftertheconvolutiontoreducethespatialsize. The Fusion Methods In the main paper, we focused on pro- first two blocks apply both spatial and temporal convolutions in jection as the main method to fuse RGB and point cloud thatorder. data. Here,wealsocomparetoseveralothermethods. The resultsareshowninTable7. 6.ImplementationDetails BasicRGB.HereweflattentheRGBimagefeatureout- putintoa1-Dtensor, andconcatenateittothepointcloud The models were implemented in TensorFlow. We feature. This loses all spatial information and essentially trainedfor120000iterationsusingabatchsizeof256,split putstheentireimageintoeachpoint. across8devices. Thelearningratewassetto0.0015using SpatialAvgPooling.Asanotherbaseline,weapplyaver- alinearwarmupfor6000stepsfollowedbyacosinedecay agespatialpoolingtotheimage-basedfeaturesR ,obtain- i schedule. ingaFR-dimensionalfeaturevector. Wethenconcatenate i The anchorboxes had sizeof [4.7,2.1,1.7], and 2 rota- thistoeachfeatureinthePC-basedfeaturesM ,resultingin i tions(0and45degrees)usedateachfeaturemaplocation. a(XM,ZM,FM +FR)featuremap. Thisisthenpassed i i i i For the PointPillar pseudo-image creation, we used a grid through the remaining CNN for classification. This pro- sizeof(224,224,1),anx-rangeof(−74.88,74.88)andthe vides the point cloud stream with some RGB information, samefory-range. Thez-rangewas(−5,5). Themaxnum- butithasnospatialinformation. berofpointspercell,N =128. Weused10,000pillars. SpatialTransformer. Wealsotriedusingaspatialtrans- We applied data augmentation to the point clouds (ran- former[22]tocropregionsaroundeachprojectedpointto dom 3D rotations and flips). The camera matrices were appendtothePCfeature. However,despitesmallimprove- alsoupdatedbasedontheaugmentationssotheprojections ments, we found this to be extremely slow due to taking would still apply. No augmentation was used on the RGB manyspatialcropswiththetransformer. streams. ThesebaselinesarecomparedwiththeproposedProjec- tion method which is in Table 2 of the main paper. The The PointPillars network used a feature dim of 64 for experiments are conducted in the same conditions as the theinputpoints. Thecreatedpseudo-imagehad224×224 Projection method in the main paper. The Basic PointPil- shape. This was followed by 3 convolutional blocks with lars (also in Table 2 of the main paper) does not have an 4,6,and6repeats. Eachblockconsistedofaconvolution, RGBinputandisincludedforreferenceonly. batch norm and ReLU activation. This was followed by 3 deconvolutionallayerswhichgeneratethepredictions. 8.AdditionalVisualizations The RGB single frame network is a standard ResNet- 18. The video network is based on TinyVideoNetworks. In Figure 8 we show more visualizations of the predic- Specifically, we use a network that consists of 6 Residual tionsofthe4D-NetontheWaymoOpenDataset. blocks,thestructureisoutlinedinTable6.Notethatthefirst twoblocksapplybothspatialandtemporalconvolutionsto thedata(inthatorder). WetrainedontheWaymoOpenDataset,whichconsists of1950segmentsthatareeach20secondslong(about200 frames),atotalof390000frames. TheLiDARdataispro- cessedintopointcloudsgroupedbytimestampandaligned with the RGB frames. In most experiments, we take se- quencesof16framesasinputandpredict3Dboxesforthe lastframe. WeevaluateusingtheprovidedWaymometrics library. Before NMS, we filter out boxes with probability lessthan0.4andboxeslargerthan30minlengthand5min widthandboxessmallerthan0.5minlengthandwidth. 12Figure8.4D-NetpredictionsonasceneintheWaymoOpenDataset[47]. Individualinstancesareshowningreen(here)orindifferent colorsinthefiguresbelow.Redboxesindicateerrors(dashedlines:FN,solidlines:FP).Thefrontcamera(centralimage)istheonlyone usedinourworkpresently,theothersareincludedforvisualizationpurposes. Notethatanymisalignmentsinthecameraviewaredueto projection,notbyinaccuraciesinthepredictions. 1314Figure9.Failurecasesexamples:Inthesetwochallengingcases,thecentralimage(stream)isnotsufficientandavehicleismissed. 15