Range Conditioned Dilated Convolutions for Scale Invariant 3D Object Detection AlexBewley∗ PeiSun* GoogleResearch WaymoLLC ThomasMensink DragomirAnguelov CristianSminchisescu GoogleResearch WaymoLLC GoogleResearch Abstract: Thispaperpresentsanovel3DobjectdetectionframeworkthatprocessesLiDAR datadirectlyonitsnativerepresentation: rangeimages. Benefitingfromthecom- pactness of range images, 2D convolutions can efficiently process dense LiDAR data of a scene. To overcome scale sensitivity in this perspective view, a novel range-conditioneddilation(RCD)layerisproposedtodynamicallyadjustacon- tinuousdilationrateasafunctionofthemeasuredrange. Furthermore,localized softrangegatingcombinedwitha3Dbox-refinementstageimprovesrobustness inoccludedareas,andproducesoverallmoreaccurateboundingboxpredictions. Onthepubliclarge-scaleWaymoOpenDataset, ourmethodsetsanewbaseline forrange-based3Ddetection,outperformingmultiviewandvoxel-basedmethods overallrangeswithunparalleledperformanceatlongrangedetection. Keywords: AutonomousDriving,3DDetection,RangeImage. Figure1: Top: ReferenceRGBimagesofascene. Forillustrationpurposesonly. Bottom: Range image showing the dynamic sampling of our proposed RCD layer at selected positions. Here the measured range at the yellow + is used to govern a scale of the local receptive field towards a geometricallyconsistentsampledensityatanyrange. 1 Introduction One of the most exciting opportunities at the intersection of robotics and machine learning is au- tonomous driving, where the detection of objects is of critical importance. Catalyzed by the pro- liferation of high-density LiDAR sensors, algorithms for 3D detection from pointcloud data have gainedsignificantattentioninrecent years. Severalmeta-architecturesareusedfor addressing3D object detection. Firstly, voxelization methods [1, 2, 3] typically bin sparse Cartesian coordinates intodiscretevoxelsandlaterprocessthemwithsubsequent3Dconvolutions.Whilesuchapproaches ∗Indicatesequalcontribution.Correspondence:{bewley, peis}@google.com 4thConferenceonRobotLearning(CoRL2020),CambridgeMA,USA. 1202 naJ 22 ]VC.sc[ 3v72990.5002:viXraperformwellinpractice,theyareimpededbythememoryandcomputationaldemandsof3Dcon- volutionsinlargescenes. Arelatedmeta-architectureisprojectingthesparsepointsintoabirds-eye view(BEV)[4,5]inordertoreducethescenebacktoa2Dspacewhilemaintainingscaleinvari- anceatthepriceoflostinformationthroughquantization. Othermeta-architecturesarebuiltupon thePointNetframework[6,7]butsufferfromissuesaroundpointsparsityatlongrangeandoften requireinefficientcustomoperationsfordefiningpointneighborhoods[8]. This paper focuses on an a less explored alternative (in the context of 3D detection) using range images that exploits the intrinsic 2.5D manifold structure [9, 10] of raw 3D point data in its na- tivesphericalcoordinateform,enablingefficientanddirect3Dobjectdetection. Thisrangeimage representationischaracterisedas3DCartesianpointsprojectedontouniquepixelsina2Dspheri- calimage whererange isencoded aspixelvalues andtheir rowandcolumn indicescorrespond to inclinationandazimuthanglesrespectively.Operatingonrangeimagesenjoysthebenefitsofapply- ingmature2Dconvolutionalarchitecturesandisnaturallyefficientduetotheintrinsicallycompact representation[11]. Crucially,itdoesnotsufferfromtheissueofsparsityatlongrange. However, knownchallengesforlearningsuchasscalevariationandocclusionneedfurtherconsideration.This paperaddressestheseissuesbyproposinganovelconvolutionallayerwithascaleawaredilationrate forefficientreuseoffilterweightsatdifferentscales.Thisdirectlyleveragesthemeasuredistancein rangeimagestocompensateforthecorrespondingscalechange. Combinedwithsoftrange-based gating, both scale, and occlusion are appropriately handled within this framework. Occlusion is furtheraddressedthroughtheuseofasecondstagelocalboxrefinementmodule. Inthiswork, wepresentanefficientrangeimage-basedtwo-stage3Dobjectdetectorwiththefol- lowingkeycontributions. Firstly,anovelrangeconditioneddilated(RCD)convolutionaloperatoris introducedthatiscapableofdynamicallyadjustingthelocalreceptivefieldtoprovideaconsistent scalerelativetotheconvolutionalkernelatanydistance(seeFigure1). Second,aregionconvolu- tional neural network (RCNN) based second stage network is investigated in the context of range image-based3Dobjectdetection;Finally,anewbaselineissetforrangeimage-based3Dobjectde- tectiononapublicdataset[12]. TheintroducedRCDbasedmodelperformsespeciallywellatlong ranges (large distances), where voxel and sparse-point cloud based approaches suffer from point sparsityissues. Therefore,webelievethisisthefirstworktocombinearangeimage-basednetwork withaRCNNsecondstagefor3Dobjectdetection. 2 RelatedWork 2.1 3DLiDARDetection WhilemanyworkscombinecolorimageswithLiDARdata[13,14,15],herewerestrictourreview toworksthatonlyprocess3DLiDARdata. RasterizedandVoxelmethods.Apopularwaytodo3Dobjectdetectionistofirstprojectthepoints tobirds-eyeview(BEV)andconstructsa2Dmulti-channelimage. Theimageisthenprocessedby 2DCNNtogeteitherBEVor3Dboxes. Thetransformationprocessisusuallyhand-crafted,some selectedworksMV3D[4],PIXOR[16],ComplexYOLO [17].VoxelNet[3]dividesthepointcloud intoa3DvoxelgridandusesaPointNet-likenetwork[6]tolearnanembeddingofthepointsinside eachvoxel. PointPillars[5],acomputeefficientmethod,thatdividesthepointcloudinto3Dpillars andthenextractsfeaturessimilarasVoxelNet[3],isthemostprolificdetectorinthiscategoryand servesasacomparisoninourexperiments. Point based methods. Another paradigm of methods are point based detection. It processes the rawpointcloudwithpointcloudfeatureextractionmethodslikePointNet++[7], SparseConvolu- tion[18],andthenregresses3DboxesineitherdownsampledBEVviewor3Dpointviewdirectly. SomerepresentativeworksarePointRCNN[19],PVRCNN[20],STD[21]orSA-SSD[22]which usespointwisesupervisionfortrainingavoxelbasedbackbone.Ourmethodisbenchmarkedagainst PVRCNNwhichiscurrentlythetopdetectorontheKITTIdatasetandalsobenchmarkedonWaymo OpenDataset[12]. Range image based methods. The range image is compact and does not suffer from sparsity re- lated issues which is the main challenge when developing 3D algorithms. This representation is under-explored because a) range image based detectors require more data to train [11]. b) gen- erating high quality range images is non-trivial without knowing raw sensor information such as 2Figure2: Anoverviewoftherange-conditioneddilationblockdetailingtheinputsandoutputsthe various modules (illustrated with white boxes). The sampler is solely responsible for the spatial processingoftheinputtensorwherethereceptivefieldisdrivenbytheinputrangeimage. laser scan pattern, relative position at each laser shot. Both of these are addressed by the Waymo OpenDataset[12]. TheprimaryrepresentativeworkisLaserNet[11]whichbenchmarksonapri- vatedataset. Rangeimagebaseddetectionalgorithmsneedtodealwithscalevariance(nearrange objectsarelarger)andocclusions. 2.2 AdaptiveandDilatedConvolutions TheseminalworkofJaderbergetal.ontransformingtheinputsignalusingspatialtransformers[23] hasledtoseveralsubsequentmethods. Forexample,fordilatedconvolutions,thedilationratecould be learned per filter and layer [24], or related to the size of the 3D environment using RGBD as input[25]. Wangetal.[26]usedepthasaconstantweightingfactoroftheinfluenceofneighboring pixelsinconvolutionallayersandmax-poolinglayers. Moregenerically, dynamicfunctionscould belearnedtogeneratetheweightsoftheconvolutionallayerconditionedonitsinputvalues[27,28, 29,30]. Dingetal.[30]proposeadynamicfilterframeworkwhichsimulatesadilatedconvolution with integer shifts to the feature maps, these are then combined using weights from RGB image features. Hereafixedsetofdilationsarechosenapriorisimilartheatrousspatialpyramidpooling (ASPP)inDeepLab[31].Incontrast,ourmethodtakesadvantageoftheavailablerangeobservations torescalethedilationrateenablingthereuseofkernelweightsacrossmultiplescales. Thiscanbe viewedasanextensionofthe1DdistancebasedinputfilteringofBeyeretal.[32]to3Dpointclouds representedasarangeimage.Furthermore,oursampleriscontinuousandadaptsthroughthecourse oftrainingmitigatingtheneedtoselectasetoffixeddilationrates. 3 RangeConditiontoLearnScaleInvariantFeatures Thissectiondescribestheproposedrangeconditioneddilatedconvolutionalblock. Thisblockcan beplacedanywhereinalargernetworkwherearangeimageisavailable. 3.1 RangeConditionedDilatedConvolution The RCD block accounts for the scale variation when observing objects at different distances by dynamicallyadjustingthespatialextentofaconvolutionviaitsdilatationrate.Thedilationisscaled usingacontinuousfunctionofthemeasurerangerequiringthereplacementofdiscretepixelindex lookup(usedinregularkernels)withaformofspatialsamplingthatissimultaneouslysparse,local and conditioned on the observed range. As the dilation rate is continuous as opposed to integer- valued, a specialization of the spatial transformer [23] is adapted to adjust the spatial scale of the sampling pattern. This sampling is applied densely to every pixel to create a form of deformable convolution[31,33]thatisconditionedontheinputrange.ThissectiondescribestheRCDappliedto 2Drange-imagesrelevantforthispaper,howeverwithoutlossofgeneralitytheequivalentoperations can be applied to other dimensional inputs. Figure 2 provides a high-level overview of the RCD blockwiththeindividualcomponentsdetailedbelow. PatternTransformer: takesinarange-imageR∈RH×W withheightH andwidthW. Internally parameters in the form of N sparse 2D points G ∈ RN×2 are maintained to represent the relative 3spatialsamplingpatternfortheRCDconvolutionalkernel. ThesamplingpatternG isasetoflearn- able parameters which is initialized to be a uniform grid with mean (0,0). This relative sampling pattern G is shared for all pixel locations and then individually transformed using the input range imageasfollows: S =σ(R,λ)·G+P, (1) where (via broadcasting1) S ∈ RH×W×N×2, P ∈ RH×W×2 represents pixel coordinates and σ(R,λ)isthefollowingtrigonometricfunctionappliedtoallrangevaluesr =R(i),i∈P: i σ(r ,λ)=arctan(λ/r ), (2) i i whereλ∈R+ isalearnablescalarparameterrepresentingthenominalwidthofthecross-sectional areacoveredbythereceptivefieldatanyranger . Thedilationmultiplierσ(r ,λ)isvisualizedfor i i severalvaluesofλinFigure3. 1.50 1.25 1.00 0.75 0.50 0.25 0.00 0 5 10 15 20 25 30 35 40 Distance from sensor r (m) )snaidar( reilpitlum noitaliD Sampler: At the core of the RCD block is a sampler that is responsible for gathering spa- tial information from its input analogous to =0.2 the dilated kernel sampling, ready for an inner =0.5 product with kernel weights as with conven- (r, )=arctan(r) =1.0 =2.0 tional convolutions. Given input feature ten- =5.0 sor X s ∈ RH×W×C, input sampling is per- =10.0 formed N times for all pixel locations to pro- duce a resampled tensor Xˆ ∈ RH×W×N×C, s whereC isthedimensionalityofpixelwisefea- tures in X . While any sampling kernel can s be applied, the bilinear sampling is chosen for itsefficiency[27]anditsspatialpartialderiva- tives[23]allowforlossgradientstoflowback Figure 3: The rate of the convolutional filter di- tothesizeandspatialparametersλandG. lation as a function of both the distance and the nominalobjectsizeλinmeters.Notemultiplieris The coordinates contained in S may include inunitsofradiansandshouldbeadjustedtopixels samples extending beyond the width or height usingtheappropriateLiDARangularresolution. of the input range image. Samples extending aboveandbelowimageboundariesareclamped to the first and last row respectively. For samples beyond the left and right boundaries, horizontal angularwrappingisperformed,takingadvantageoftherangeimages´360◦view. In parallel to scaling the perceptive field, the range is also used to weight the magnitude of sam- pledfeatures. Thismechanismservesasaformofsoftrangegating(SRG)topreventdown-stream trainingdifficultiescausedbydistractorsfromnear-occlusionwhereneighboringpixelshaveasub- stantial difference in range. The weighting for SRG follows a Gaussian distribution, where for a givenpixellocationiwithcorrespondingsamplelocationj ∈S(i),afeaturemaskingweight: x¯ =xˆ N(rˆ ;r ,γ), (3) ij ij j i where N denotes the Gaussian probability density function, evaluated at the range of the integer spatiallocationr ,withmeanrˆ =R(j)beingthebilinearlyinterpolatedrangevalueatthesample i j location j. The variance γ is a learnableparameter that controlsthe length-scale of thesoft range gate(initializedto1meter). Thisdecreasestheimportanceofdistantpoints,whicharelikelyfrom adifferentobjecttothepointatthecenteroftheconvolution. Pointwise Convolutions: Also known as 1 × 1 convolutions [34] are used in multiple parts of the RCD block. See Figure 2 for their position within the RCD block. Firstly, two pointwise convolutions (PConv) partition the input tensor for the sampler input and concatenation with its output. This arrangement is inspired by the two stream hypothesis [35] where the sampler with its input PConv and the pass-through PConv are responsible for spatial and recognition processes respectively. Onapracticalnote,thePConvfeedingintothespatialsamplerprojectstheinputinto lower dimensional features, significantly reducing the computation and memory requirements (in all RCD experiments C = 3). The output of the sampler is reshaped to RH×W×NC and then concatenated along the last dimension of the result of the pass-through PConv. The final PConv completestheconvolutionaloperationastheinputcontainsthedilatedresamplingoftheinputand 1Singulartensorshapesareomittedforbrevity. 4Figure4: Twostagedesign: theinputrangeimages(size64×2650×8)arefirstprovidedtothe RCD-RPNwhichgeneratesa3Dboxproposalcorrespondingtoeachpoint. Highscoringproposals (redboxes)aresenttothesecond3DRCNNstageforrefinement. isreshapedtopackthespatialsamplesintothechanneldimension. TheweightsinthisfinalPConv areessentiallytheweightsoftheRCDconvolutionmakingthefinaloutputequalto: h =f(Wsxs+Wpxp), (4) i f i f i where xs,xp are the flattened spatially sampled features and passthrough feature respectively at i i pixellocationiandf(·)representsLayerNormalization[36]followedbytheanon-linearexponen- tiallinearunit[37]. 4 TwoStageDetectionNetwork The RCD block is embedded within the region proposal network (RPN) stage of a two stage de- tectionnetwork,seeFigure4. ThefirststageisanadaptionoftheLaserNet[11]architecture,with RCDblocks,containingbothaforegroundclassificationandboxregressorhead. Thesecondstage refineshigh-scoringboxproposalssimilarinspiritto[19,38]. 4.1 Firststage: RCD-RPN The first stage is a fully convolutional deep layer aggregation network architecture [39, 11], with RCD blocks replacing the initial convolutional layers at multiple scales (see supplementary). To achieve a larger receptive field and faster compute, we downsample more aggressively along the horizontalaxisonlywitha[1,·]poolingkernel.Thisbefitsthetypicalstructureofrangeimageswith fewrowsandmanycolumns. TheRCD-RPNnetworkyieldsfouroutputsperpoint: aforeground classification score, two additional scores representing the probability its top and left neighboring pixelslayonthesameobject,andapredicted3Dboxrepresentedbya7Dvector(x,y,z,h,w,l,θ). The focal loss [40]: L (p ) = −α (1−p )γlog(p ) is employed, where p = s if i is positive, f i i i i i i andp =1−s otherwise,tohandleclassimbalance(usingα )andfocusondifficultboxes(using i i i focusparameterγ =2),forthethreeclassificationscores: 1 (cid:88) L = [L (p )+L (pt)+L (pl)], (5) f |P| f i f i f i i where|P|isthenumberofvalidpointsintherangeimage,andp ,pt,pl arethescoresforforeground i i i classandtopandleftsimilarityclassificationrespectively. For regressing towards the 3D box parameters, the bin-loss (L ) [19] divides the distance to the bin centerpoint(x,y,z)andtheheadingofthegroundtruthboxintobinsandperformsbinclassification first,followedbyaregressionwithineachbin(see[19]formoredetails). Ithasbeenshownthatthe bin-lossconvergesfasterandachieveshigherrecallthanregressionbasedloss. Foreachpointiin therangeimage,weuseitspredicted3Dboxb asfollows: i 1 (cid:88) L = 1 L (b ,b ), (6) b N ni bin i j i whereN isthenumberofground-truthboxes,wesumoverallpixelsintherangeimage,n isthe i numberofpointsinthetargetground-truthboxb thatcontainspointi. j 54.2 Secondstage: 3DRCNN Thesecondstagerefinestheinitialboxproposals,byusinganRCNNstagesimilarto[41,19,38]. In general a two stage design greatly improves box prediction accuracy and further mitigates the occlusioneffect. Foreachbox, therawpointsareextractedandtheperpointfeatureembeddings, predictedboxparametersandsemanticclassificationscoresarereusedfromRCD-RPN.Eachbox is transformed into the canonical box frame [19] and divided into a a fixed 3D grid (12×8×6). Allthefeaturesineachgridcellarepooledtoyieldasinglefeaturepergridcell. Forthesemantic features(points,classificationscoreandRCD-RPNboxparameters)averagepoolingisused,while thefeatureembeddingsaremaxpooled. Thena3Dconvolutionlayer, followedbydownsampling andafullyconnectedlayergeneratesthefinalboxparametersandclassificationscore. Duringtraining,itisimportanttosub-sampleboxesfromthefirststagetohaveanefficienttraining pipeline. First, the range image is divided into a top and bottom half, and per column the highest scoring box-proposal is kept. This coarse partitioning maintains spatial diversity for training the second stage. The remaining proposals are then further reduced to 50 positive (with intersection overunionIoU ≥ .5)and50negative(IoU < .5)proposals. Duringinference,thetop400boxes arekeptafterrunningnon-maximum-suppressionontheRCD-RPNboxproposals. Thesecondstage3DRCNNalsocontainsclassificationlosswhichisaveragedoverallboxproposals (M =100duringtraining)selectedfromRCD-RPN: 1 (cid:88) L = L (s ,y ), (7) cls M ce k k k where s denotes the refined classification score, y the ground-truth class for box k, and L (·) k k ce thecross-entropyloss. Fortherefinedboundingboxparametersb acombinationoflossesisused, k includingaresiduallosssimilarto[41]forboxcenters(x,y,z)andboxdimensions(w,h,l),while heading(θ)regressionusesthebinloss[19]. Theselossesareaveragedoverallproposals: 1 (cid:88) L = L (b ,b ), (8) reg M box j k k wheretheboxregressionlossL isasumofsmoothL1lossforcenterpositionsandnormalized box length,widthandheightpredictionandabinlossforheadingprediction. 4.3 JointTraining: BoththeRPNandRCNNnetworksaretrainedjointlywiththefinalobjectiveas: L=L +L +L +L . (9) f b cls reg (cid:124) (cid:123)(cid:122) (cid:125) (cid:124) (cid:123)(cid:122) (cid:125) RCD-RPN 3DRCNN 5 Experiments We primarily benchmark on the Waymo Open Dataset (WOD) [12] as it released its raw data in rangeimageformatwhileotherdatasetssuchasKITTI[42]ornuScenes[43]providepointclouds. Converting a pointcloud back into a range image requires known laser angles and accurate point- wise timing to offset for the relative vehicle pose when in motion. Simply projecting Cartesian pointstotheirsphericalcounterparts[44]resultsineithersignificantpixelcollisionsormanyholes dependingonthechoiceofrange-imageresolution.FortheKITTILiDARwith64individuallasers, theiruniqueverticalandinclinationoffsetsarerecoveredwiththeHough-transformandpointsare groupedtoformtherowsintherangeimage. Table1showstheBEVresultsontheKITTIdetection testset. See Section E.2 in Appendix for more details. As reported in [11], we also observe that smalldatasetsarepronetooverfittingforrangeimagedetectorsandresumeourevaluationusingthe largerWODdataset. WOD captures multiple major cities in the U.S., under a variety of weather conditions and across different times of the day. The dataset provides a total number of 1000 sequences of 20s duration each, sampled at 10Hz, with train/validation split of 798/202 sequences. The effective annotation radiusfor6Mvehiclesacrossthesesequencesis75meters. Forourexperiments,weevaluateboth AveragePrecision(AP)andAPweightedbyheading(APH)[12]in3DandBEVforvehiclesonthe WODvalidationsetand3Ddetectionmetricsusingthepublicevaluationserverforthetestset. 6Table1: Comparison to official LaserNet Method Easy Moderate Hard BEV results on KITTI Car testset. No test- LaserNet 79.19 74.52 68.45 timeaugmentationoradditionaltrainingdata RCD(Ours) 82.26 75.83 69.91 is used for RCD. RCD method is trained RCD-FT(Ours) 85.37 82.61 77.80 from scratch on KITTI while RCD-FT is finetunedfromWODpretraining. 3DAP(IoU=0.7) 3DAPH(IoU=0.7) BEVAP(IoU=0.7) BEVAPH(IoU=0.7) All r≤30 r30−50 r≥50 All r≤30 r30−50 r≥50 All r≤30 r30−50 r≥50 All r≤30 r30−50 r≥50 P.Pillars[5] 56.62 81.0 51.8 27.9 - - - - 75.57 92.1 74.1 55.5 - - - - DynVox[45] 59.29 84.9 56.1 31.1 - - - - 77.18 93.0 76.1 57.7 - - - - MVF[45] 62.93 86.3 60.0 36.0 - - - - 80.40 93.6 79.2 63.1 - - - - PV-RCNN[20] 70.30 91.9 69.2 42.2 69.69 91.3 68.5 41.3 82.96 97.4 83.0 65.0 82.06 96.7 80.0 63.2 LaserNet* 52.11 70.9 52.9 29.6 50.05 68.7 51.4 28.6 71.19 83.9 71.4 54.5 67.66 80.7 68.4 51.4 RCD(Ours) 69.59 87.2 67.8 46.1 69.16 86.8 67.4 45.5 83.35 93.5 82.3 67.9 82.64 93.0 81.6 66.6 Table 2: Comparison of methods for vehicle detection on the Waymo Open Dataset (WOD) vali- dation set for 3D detection with 7DOF boxes. The best and secondbest results are highlighted in boldandunderlinedrespectively. (*)OurimplementationofLaserNet[11]. Columnswithr show breakdownofmetricsbyrange(inmeters). 5.1 BaselineMethods The full RCD model is compared to an equivalent two-stage detector baseline without SRG and RCD layers replaced with a fixed 7 × 7 dilated convolution (set to a dilation rate of 3 which is showntohavebestRPNperformance). Furthermore,aselectionofstateofartmethodsfromeach mainstreamcategoryin3DobjectdetectionalgorithmsarecomparedonWOD. LaserNet[11]: LaserNetisa2DCNN-basedsingleshot3DobjectdetectoroperatingonLiDAR range-images. Itshowedimprovementsonalargeprivatedataset. Withnopubliclyavailableimple- mentationforthismethod,weuseavariantofourRPNsub-networkwithnormal2Dconvolutions, ResNetblocks,adaptiveNMS,andtrainedwithmulti-modelboxregressionlossasdescribedin[11]. PointPillars(P.Pillars)[11]: AnothersinglestagedetectorwhichutilizesPointNets[6]toencode apointcloudscenerepresentationorganizedinverticalcolumnsintheBEV.Metricsfrom[45]. Multi-View Fusion (MVF) [45]: This method fuses Cartesian view features and spherical view features. Itshowssignificantimprovementsonlongrangedetectionbecauseofthesphericalview features. Wesharethesamefindingsinourmethodasourmethodisperspectiveonly. Point-Voxel(PV-RCNN)[20]: arecentlyproposedmethodcombiningPointNets[7]andasparse convolutionRPNbackbonewithasimilarsecondstageRCNNrefinementnetwork. 5.2 Implementationdetails For the RCD layer we always use N = 64 number of samples initialized as a 8×8 grid. With the number of channels kept at 64 for the inputs and outputs of the RCD block, the entire block consumes 23K FLOPs per pixel which is significantly lower than 262K FLOPs for an equivalent 2Dconvolutionallayer,primarilyduetoPConvsqueezewithonly3filters. OurRPNnetworktakes 74msperframecomparedto301msforourbest-effortimplementationoftheRPNfromPV-RCNN on a V100 GPU. All validation experiments use the Adam optimizer [46] for 350K iterations of batch-size 8 (or 17.5 epochs) with a learning rate starting from 6e-3 with a cosine decay end to endfromscratchwithoutanydataaugmentation. ForsubmissiontotheWODtest-serverourRCD modelistrainedforatotalof1millioniterations. 5.3 DiscussionofResults Amongthebaselinemethods,onlyLaserNet[11]usestherangeimagerydirectly,thismakesLaser- Netourdirectcomparison. TheresultsofthisareshowninTable2. EvenwithoutRCNNourRPN achieves57.2AP,asignificantimprovementovertheLaserNetrangeimagedetector. Thisismainly becauseofthewayRCDandSRGhandlescalevarianceandocclusionintherangeimageview. As ourmethodprocessestherangeimageintheperspectiveviewithasfundamentallydifferentchar- 7Level1 Level2 Table3: Comparison of single frame meth- ods for 3D vehicle detection on the Waymo AP APH AP APH Open Dataset (WOD) test set. Values are Second[47] 50.11 49.63 42.88 42.48 provided by the test server and divided into P.Pillars[5] 54.94 54.47 48.61 48.18 StarNet[48] 61.68 61.23 55.17 54.76 twolevelsofdifficultywhereLevel1hasat SA-SSD[22] 70.24 69.54 61.79 61.17 least five points per ground truth object and RCD1M(Ours) 71.97 71.59 65.06 64.70 Level2canhaveasfewasasinglepoint. Fixeddilation(rate=3) (cid:88) (cid:88) ASPP (cid:88) (cid:88) RCDatstart (cid:88) (cid:88) (cid:88) (cid:88) WithSRG (cid:88) (cid:88) (cid:88) RCDatmultiplescales (cid:88) (cid:88) 3DRCNN (cid:88) (cid:88) (cid:88) 3DAP 52.3 53.8 45.8 54.9 55.0 57.2 69.6 63.6 64.4 3DAPH 50.3 51.8 45.2 52.5 52.9 56.7 69.2 63.1 63.9 Table4: Ablationstudyshowingtheprogressiveimprovementswithproposedcontributions. Mea- sured 3D AP and APH are from the WOD validation sequences. RCD at multiple scales uses 4 blockswithoneatthestartoftheRPNandothersatdownsampledresolutions(seesupplementary). acteristics compared to voxel or BEV projection based methods. Compared with the voxel based methodsreportedonWOD,ourmethodshowsgreatestimprovementsinlongrange. Thisisdueto theissuearoundvoxelsparsityfordistantobjectscorroboratingwiththefindingsof[45]. Addition- ally, RCD is better able to utilize contextual information in the range image to distinguish distant objects with few points. As a result of this, our method exhibits complementary performance to thestate-of-the-artPV-RCNNmethodwhichhasstrongestperformanceincloserangewherepoint densityishighestforvoxelbaseddetection. WOD Leaderboard: Recently, an unlabelled test set was released to the public by Waymo for benchmarking3Dobjectdetectorsviaanonlineevaluationservice. Table3showstheperformance ofourbestmodeltrainedfor1millioniterationscomparedtootherpublishedsingleLiDARframe vehicledetectorsontheleaderboard2. Thepublicleaderboarddividesthedetectionresultsintotwo difficultylevelsbasedonthenumberofpointswithintheannotatedboxes. Level1hasatleastfive pointspergroundtruthobjectandLevel2boxesmayonlyhaveasinglepoint.OurrangebasedRCD modelsignificantlyoutperformsothermethodsforbothdifficultylevelsonthispublicbenchmark. Ablation Study: The merits of the proposed RCD block is compared with standard dilated con- volution. Furthermore, the proposed RCD layer relates to the exhausted dilation combination of differentlysizedreceptivekernelsinASPP[31]. ThekeydifferenceisthatASPPmergesmultiple fixeddilatedconvolutionstogetherwhileRCDreusesthesamesetoffilterweightsoveracontinuum ofdilations. WefoundthatsubstitutingourRCDforanASPPblockoftenresultedinoverfittingor unstabletrainingwithonlyasinglevariantproducingcomparableresultsshownasthethirdcolumn ofresultsinTable4. AsASPPuniformlyappliesmultipleratesofdilationsimultaneouslyitslarge receptivefieldissusceptibletodistractingobjectintheperiphery. UsingtheRCDwithSRGatthe startandthroughoutthebackbonenetworkachievesthebestsinglestageperformancewhileASPP exhibitedtheworstperformance.Thelastthreecolumnsshowtheeffectofthetwo-stageframework. Interestingly,the3DRCNNsecondstageisabletosubstantiallyrecoverfromtheunder-performing performing ASPP based RPN. The RCNN also capitalizes on the high-quality proposals from the RPNnetworkendowedwithmultipleRCDblocks. 6 Conclusions WeintroduceRCD,amethodfordynamicallyadjustingthedilationrateforusewithLiDARrange imagesforscaleinvariant3Dobjectdetection. AnimprovedsystemrelyingonournewRCDblock representation,andbasedonthetwostageRCNNmethod[19],isthetopperformingrangeimage 2waymo.com/open/challenges/3d-detection/,resultsgatheredonJuly27th,2020 8baseddetectionmethod,overallranges,ontheWaymoOpenDataset[12]benchmarkforBEVand achievescompetitiveresultsinothertests. Specifically,ourapproachsetsanewstate-of-the-artfor detecting vehicles at long distances as it benefits from the dense nature of the range image. Two directions of future research include, applying the proposed method with other common robotic depthsensors,likestructured-lightortime-of-flightcameraswhichalsocapturea2Drangeimage, anddesigningahybridPV-RCNNandRCDapproachtoefficientlyobtainthebestperformanceat allranges. References [1] M.Engelcke,D.Rao,D.Z.Wang,C.H.Tong,andI.Posner. Vote3deep: Fastobjectdetection in3dpointcloudsusingefficientconvolutionalneuralnetworks. InICRA.IEEE,2017. [2] D.Z.WangandI.Posner.Votingforvotinginonlinepointcloudobjectdetection.InRobotics: ScienceandSystems,2015. [3] Y.ZhouandO.Tuzel.Voxelnet:End-to-endlearningforpointcloudbased3dobjectdetection. InCVPR,pages4490–4499,2018. [4] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia. Multi-view 3d object detection network for autonomousdriving. InCVPR,2017. [5] A.H.Lang,S.Vora,H.Caesar,L.Zhou,J.Yang,andO.Beijbom. Pointpillars: Fastencoders forobjectdetectionfrompointclouds. InCVPR,pages12697–12705,2019. [6] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classificationandsegmentation. InCVPR,pages652–660,2017. [7] C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: Deep hierarchical feature learning on pointsetsinametricspace. InNeurIPS,2017. [8] F. Engelmann, T. Kontogianni, and B. Leibe. Dilated Point Convolutions: On the Receptive FieldSizeofPointConvolutionson3DPointClouds. InICRA,2020. [9] A.BewleyandB.Upcroft.Advantagesofexploitingprojectionstructureforsegmentingdense 3dpointclouds. InAustralianConferenceonRoboticsandAutomation,2013. [10] B.Wu,A.Wan,X.Yue,andK.Keutzer. Squeezeseg:Convolutionalneuralnetswithrecurrent crfforreal-timeroad-objectsegmentationfrom3dlidarpointcloud. InICRA,2018. [11] G. P. Meyer, A. Laddha, E. Kee, C. Vallespi-Gonzalez, and C. K. Wellington. Lasernet: An efficientprobabilistic3dobjectdetectorforautonomousdriving. InCVPR,2019. [12] P.Sun,H.Kretzschmar,X.Dotiwalla,A.Chouard,V.Patnaik,P.Tsui,J.Guo,Y.Zhou,Y.Chai, B. Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. In CVPR,2020. [13] C.R.Qi, W.Liu, C.Wu, H.Su, andL.J.Guibas. Frustumpointnetsfor3dobjectdetection fromrgb-ddata. InCVPR,pages918–927,2018. [14] J.Ku,M.Mozifian,J.Lee,A.Harakeh,andS.L.Waslander. Joint3dproposalgenerationand objectdetectionfromviewaggregation. InIROS.IEEE,2018. [15] Z.WangandK.Jia. Frustumconvnet: Slidingfrustumstoaggregatelocalpoint-wisefeatures foramodal3dobjectdetection. InIROS.IEEE,2019. [16] B.Yang,W.Luo,andR.Urtasun. Pixor: Real-time3dobjectdetectionfrompointclouds. In CVPR,2018. [17] M.Simony,S.Milzy,K.Amendey,andH.-M.Gross.Complex-yolo:Aneuler-region-proposal forreal-time3dobjectdetectiononpointclouds. InECCVWorkshops,2018. [18] B.Graham,M.Engelcke,andL.vanderMaaten. 3dsemanticsegmentationwithsubmanifold sparseconvolutionalnetworks. CVPR,2018. 9[19] S.Shi,X.Wang,andH.Li. Pointrcnn: 3dobjectproposalgenerationanddetectionfrompoint cloud. InCVPR,pages770–779,2019. [20] S.Shi,C.Guo,L.Jiang,Z.Wang,J.Shi,X.Wang,andH.Li. Pv-rcnn: Point-voxelfeatureset abstractionfor3dobjectdetection. InCVPR,pages10529–10538,2020. [21] Z.Yang,Y.Sun,S.Liu,X.Shen,andJ.Jia. Std: Sparse-to-dense3dobjectdetectorforpoint cloud. InICCV,pages1951–1960,2019. [22] C. He, H. Zeng, J. Huang, X.-S. Hua, and L. Zhang. Structure aware single-stage 3d object detectionfrompointcloud. InCVPR,2020. [23] M. Jaderberg, K.Simonyan, A.Zisserman, etal. Spatial transformernetworks. In NeurIPS, pages2017–2025,2015. [24] Y.He,M.Keuper,B.Schiele,andM.Fritz.Learningdilationfactorsforsemanticsegmentation ofstreetscenes. InGermanConferenceonPatternRecognition,pages41–51.Springer,2017. [25] Y. Chen, T. Mensink, and E. Gavves. 3d neighborhood convolution: Learning depth-aware featuresforrgb-dandrgbsemanticsegmentation. In3DV,pages173–182.IEEE,2019. [26] W.WangandU.Neumann. Depth-awarecnnforrgb-dsegmentation. InECCV,2018. [27] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. Deformable convolutional networks. InICCV,pages764–773,2017. [28] F.Strub,M.Seurin,E.Perez,H.DeVries,J.Mary,P.Preux,andA.CourvilleOlivierPietquin. Visualreasoningwithmulti-hopfeaturemodulation. InECCV,pages784–800,2018. [29] X.Zhu,H.Hu,S.Lin,andJ.Dai. Deformableconvnetsv2: Moredeformable,betterresults. InCVPR,pages9308–9316,2019. [30] M.Ding,Y.Huo,H.Yi,Z.Wang,J.Shi,Z.Lu,andP.Luo.Learningdepth-guidedconvolutions formonocular3dobjectdetection. InCVPR,2020. [31] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. PAMI,40(4),2017. [32] L. Beyer, A. Hermans, and B. Leibe. DROW: Real-Time Deep Learning based Wheelchair Detectionin2DRangeData. RA-L,2016. [33] F.YuandV.Koltun. Multi-scalecontextaggregationbydilatedconvolutions. InICLR,2016. [34] M.Lin,Q.Chen,andS.Yan. Networkinnetwork. ICLR,2014. [35] M.A.Goodale,A.D.Milner,etal.Separatevisualpathwaysforperceptionandaction.Trends inNeurosciences,15,1992. ISSN01662236. doi:10.1016/0166-2236(92)90344-8. [36] D.-A.Clevert,T.Unterthiner,andS.Hochreiter. Fastandaccuratedeepnetworklearningby exponentiallinearunits(elus). arXivpreprintarXiv:1511.07289,2015. [37] J.L.Ba,J.R.Kiros,andG.E.Hinton.Layernormalization.preprintarXiv:1607.06450,2016. [38] S. Shi, Z. Wang, X. Wang, and H. Li. Part-aˆ 2 net: 3d part-aware and aggregation neural networkforobjectdetectionfrompointcloud. PAMI,2019. [39] F.Yu,D.Wang,E.Shelhamer,andT.Darrell. Deeplayeraggregation. InProceedingsofthe IEEEconferenceoncomputervisionandpatternrecognition,pages2403–2412,2018. [40] T.-Y.Lin,P.Goyal,R.Girshick,K.He,andP.Dolla´r. Focallossfordenseobjectdetection. In ICCV,pages2980–2988,2017. [41] S.Ren,K.He,R.Girshick,andJ.Sun. Fasterr-cnn: Towardsreal-timeobjectdetectionwith regionproposalnetworks. InNeurIPS,2015. 10[42] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets robotics: The kitti dataset. The InternationalJournalofRoboticsResearch,32(11):1231–1237,2013. [43] H.Caesar,V.Bankiti,A.H.Lang,S.Vora,V.E.Liong,Q.Xu,A.Krishnan,Y.Pan,G.Baldan, and O. Beijbom. nuscenes: A multimodal dataset for autonomous driving. arXiv preprint arXiv:1903.11027,2019. [44] A.Milioto,I.Vizzo,J.Behley,andC.Stachniss.Rangenet++:Fastandaccuratelidarsemantic segmentation. InIROS.IEEE,2019. [45] Y. Zhou, P. Sun, Y. Zhang, D. Anguelov, J. Gao, T. Ouyang, J. Guo, J. Ngiam, and V. Va- sudevan. End-to-endmulti-viewfusionfor3dobjectdetectioninlidarpointclouds. InCoRL, 2019. [46] D.KingmaandJ.Ba. Adam: Amethodforstochasticoptimization. InICLR,2015. [47] Y. Yan, Y. Mao, and B. Li. Second: Sparsely embedded convolutional detection. Sensors, 2018. [48] J.Ngiam,B.Caine,W.Han,B.Yang,Y.Chai,P.Sun,Y.Zhou,X.Yi,O.Alsharif,P.Nguyen, et al. Starnet: Targeted computation for object detection in point clouds. arXiv preprint arXiv:1908.11069,2019. [49] M.A.Shand. Originsandmitigationsofsomeautomotivepulsedlidarartifacts. InB.Jalali and K. ichi Kitayama, editors, AI and Optical Data Sciences, volume 11299, pages 49 – 55. InternationalSocietyforOpticsandPhotonics,SPIE,2020. [50] K.He,X.Zhang,S.Ren,andJ.Sun. Deepresiduallearningforimagerecognition. InCVPR, 2016. [51] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam. Rethinking atrous convolution for semanticimagesegmentation. arXivpreprintarXiv:1706.05587,2017. 11A FurtherImplementationDetails A.1 InputRangeImage TheinputrangeimageformattotheentirenetworkfollowswhatisprovidedbytheWaymoOpen Dataset(WOD)[12]withrowscorrespondingtoLiDARbeaminclinationsandcolumnscorrespond- ingtolasershotazimuth. Therangeimagepixelsencodedchannelsinclude: range,intensity,elon- gation[12,49],inclination,azimuth,x,y,z. MostofthesechannelsaredefinedclearlyintheWOD [12]. Thechannelsx,y,zaretheCartesiancoordinatesofeachpointinthevehicleframe. A.2 RCD-RPNBackboneNetwork As briefly described in the main paper the backbone of the RCD-RPN is a 2D convolutional net- work inspired by the deep layer aggregation network [39] used in LaserNet[11]. The architecture iscomposedofResNetconvolutionalbottleneckunits[50]withmultiplestepsofhorizontaldown- samplingasshownintheupperbluesectionofFigure5(seecaptionforthenumberofResNetunits perdownsampledscale). Theaggregatormodulestakeintwofeaturemapsatdifferentresolutions, upsamplingthelowerresolutionviatransposedconvolutions. BeyondsubstitutinganinitialconvolutionappliedtotheLiDARinput,theRCDblockcanbeapplied anywherewithintheconvolutionalnetwork. Forexample, replacingaResNetblockwithanRCD block is considered at Res1, Res2, and Res3 of our multi-resolution backbone. The range image usedtoconditiontheamountofdilationateachpixeliscoarselydownsampledusingmax-pooling, withtheintentiontofavourdistantobjectswithfewerpixels. (cid:44)(cid:81)(cid:83)(cid:88)(cid:87) (cid:53)(cid:72)(cid:86)(cid:3)(cid:20) (cid:53)(cid:72)(cid:86)(cid:3)(cid:21)(cid:68) (cid:53)(cid:72)(cid:86)(cid:3)(cid:21) (cid:53)(cid:72)(cid:86)(cid:3)(cid:22)(cid:68) (cid:53)(cid:72)(cid:86)(cid:3)(cid:22) (cid:41)(cid:72)(cid:68)(cid:87)(cid:88)(cid:85)(cid:72)(cid:3)(cid:40)(cid:91)(cid:87)(cid:85)(cid:68)(cid:70)(cid:87)(cid:82)(cid:85) (cid:25)(cid:23)(cid:91)(cid:21)(cid:25)(cid:24)(cid:19)(cid:91)(cid:27) (cid:41)(cid:72)(cid:68)(cid:87)(cid:88)(cid:85)(cid:72)(cid:3)(cid:36)(cid:74)(cid:74)(cid:85)(cid:72)(cid:74)(cid:68)(cid:87)(cid:82)(cid:85) (cid:39)(cid:82)(cid:90)(cid:81)(cid:86)(cid:68)(cid:80)(cid:83)(cid:79)(cid:76)(cid:81)(cid:74)(cid:3)(cid:51)(cid:68)(cid:87)(cid:75)(cid:3)(cid:11)(cid:18)(cid:21)(cid:12) (cid:36)(cid:74)(cid:74)(cid:3)(cid:20) (cid:36)(cid:74)(cid:74)(cid:3)(cid:21) (cid:56)(cid:83)(cid:86)(cid:68)(cid:80)(cid:83)(cid:79)(cid:76)(cid:81)(cid:74)(cid:3)(cid:51)(cid:68)(cid:87)(cid:75)(cid:3)(cid:11)(cid:91)(cid:23)(cid:12) (cid:38)(cid:82)(cid:81)(cid:86)(cid:87)(cid:68)(cid:81)(cid:87)(cid:3)(cid:51)(cid:68)(cid:87)(cid:75) (cid:41)(cid:79)(cid:68)(cid:87)(cid:87)(cid:72)(cid:81) (cid:36)(cid:74)(cid:74)(cid:3)(cid:22) (cid:53)(cid:51)(cid:49) (cid:25)(cid:23)(cid:91)(cid:21)(cid:25)(cid:24)(cid:19)(cid:91)(cid:27) (cid:38)(cid:82)(cid:81)(cid:70)(cid:68)(cid:87)(cid:3)(cid:41)(cid:72)(cid:68)(cid:87)(cid:88)(cid:85)(cid:72)(cid:86)(cid:3)(cid:3) (cid:38)(cid:68)(cid:81)(cid:82)(cid:81)(cid:76)(cid:70)(cid:68)(cid:79) (cid:22)(cid:39)(cid:3)(cid:42)(cid:85)(cid:76)(cid:71) (cid:22)(cid:39)(cid:3)(cid:38)(cid:82)(cid:81)(cid:89) (cid:54)(cid:87)(cid:85)(cid:76)(cid:71)(cid:72)(cid:71)(cid:3)(cid:22)(cid:39)(cid:3) (cid:41)(cid:76)(cid:81)(cid:68)(cid:79) (cid:83)(cid:72)(cid:85)(cid:3)(cid:37)(cid:82)(cid:91) (cid:37)(cid:82)(cid:91)(cid:3)(cid:41)(cid:85)(cid:68)(cid:80)(cid:72) (cid:20)(cid:21)(cid:91)(cid:27)(cid:91)(cid:25) (cid:20)(cid:21)(cid:91)(cid:27)(cid:91)(cid:25) (cid:38)(cid:82)(cid:81)(cid:89) (cid:51)(cid:85)(cid:72)(cid:71)(cid:76)(cid:70)(cid:87)(cid:76)(cid:82)(cid:81) (cid:26)(cid:71)(cid:3)(cid:14)(cid:3)(cid:70)(cid:79)(cid:86) (cid:26)(cid:71)(cid:3)(cid:14)(cid:3)(cid:70)(cid:79)(cid:86) (cid:26)(cid:71)(cid:3)(cid:14)(cid:3)(cid:70)(cid:79)(cid:86) (cid:26)(cid:71)(cid:3)(cid:14)(cid:3)(cid:70)(cid:79)(cid:86) (cid:26)(cid:71)(cid:3)(cid:14)(cid:3)(cid:70)(cid:79)(cid:86) (cid:17)(cid:17)(cid:17) (cid:17)(cid:17)(cid:17) (cid:17)(cid:17)(cid:17) (cid:17)(cid:17)(cid:17) (cid:17)(cid:17)(cid:17) (cid:17)(cid:17)(cid:17) (cid:26)(cid:71)(cid:3)(cid:14)(cid:3)(cid:70)(cid:79)(cid:86) Figure 5: Illustrative detail of our detection pipeline. RPN: The upper blue section shows our 2D CNN backbone where the Resnet blocks and aggregation blocks contain either RCD layers or convolutational layers for the baseline. Number of bottlenecks units in each block: (Res1: 5), (Res2,Res2a: 7),(Res3,Res3a: 9),(Agg1,Agg2,Agg3: 4). Stridesineachblock: (Res1: 2),(Res2, Res2a: 2),(Res3,Res3a: 2). Thebottleneckistheonedescribedin[50]withbottleneckdepthset tothebottleneckinputchannelsizedividedby4. Theoutputchannelsizefromeachlayerare64, 128, 256 corresponding to the block numbers. The final RPN head predicts box parameters and score. RCNN: The second stage operates independently per surviving box. The Cartesian point coordinates are combined with box parameters to transform features range-image into their box canonicalframe. Foreachbox, pointfeaturesarevoxelizedandfollowedby3Dconvolutionsand finalpredictionsasshown. 124 3 2 1 0 1 2 3 4 4 3 2 1 0 1 2 3 4 Undilated Horizontal Pixel Offset tesffO lexiP lacitreV detalidnU Initial Trained 2.4 2.2 2.0 1.8 1.6 1.4 1.2 0 200000 400000 600000 800000 1000000 Iteration Figure6: Movementofa8×8kernelsam- pling pattern over the course of training. Withblue‘+’showingtheuniformgridused toinitializeoffsets,andred‘×’showingthe patternaftertraining. )m( htdiW lanimoN Figure 7: Nominal width free parameter λ overthecourseoftrainingfortheinputRCD block. λisusedtomodelthenominalwidth (inmeters)ofanobjectatanyrange. Method Dilation AP APH R@50P Fixedrate=1 52.25 50.28 59.22 Fixedrate=3 53.75 51.76 60.05 StandardConvolution Fixedrate=5 53.61 51.3 59.93 Fixedrate=7 51.58 48.06 60.23 ASPP*[31] (12,24,36,G) 45.8 45.2 - Ours Rangeconditioned 54.87 52.47 61.03 OurswithSRG Rangeconditioned 55.01 52.89 60.83 Table 5: Performance comparison for first stage RPN with a single RCD or standard convolution applied to the input tensor with fixed dilation rates. Average precision (AP), average precision weightedbyheading(APH)andRecallat50%precision(R@50P)ismeasuredonWODvalidation setafter350kiterationsoftraining.(*)ASPP-with16filterchannelsperdilationlayer-peakednear 240Kiterationswithvaluesshownintable. RCDimprovesoverusingonlystandardconvolutional layers,andaddingsoftrangegating(SRG),improvesAPandAPHevenfurther. B LearnableSamplingPatternAnalysis Tofurtherassesstheeffectofusingalearnable,continuousdilatedkernelsampling,Figure6shows therelativelocationsofthelearntspatialsampleswiththeirinitialpositionsmarked. Comparedto theuniformsamplingatinitialization, innersamplepointstendtoincreasethelocalconcentration around the center with the outer points showing negligible or slight spread away from the central location. This indicates that the RCD model attempts to over-sample in the central region with sparser sampling in the extremities, similar to the Fovea in the human eye. Figure 7 plots the evolutionofthelearnablenominalwidthparameterλfromEquation2ofthemainpaper. Effect of Nominal Width: For assessing the behaviour of the nominal width in all experiments we set it as a learnable parameter initialized to the value of 1m. Over the course of training, this parametergenerallyincreasesandthensettlesbetween2and3mwhichiscomparabletotheaverage dimensionsofavehicle. EffectofSRG: Wefoundthattheadditionofarangegateimprovedtrainingstabilityleadingto faster convergence with a slightly improved performance. Table 5 shows the benefits provided to thefirststagedetectorcomparedtoaddingastandardsquarekernelwithdifferentdilationrates. For afaircomparisonwesubstitutetheRCDblockwithastandard2Dconvolutionallayerwith7×7 kerneland64channelsmatchingtheRCDoutput. 13#Channels Dilations RPN-AP RPN-APH RCNN-AP RCNN-APH Peak@iteration 64 (2,4,8,G) 28.6 28.0 59.2 58.7 83K 64 (6,12,18,G) 27.1 26.4 58.4 58.0 62K 64 (12,24,36,G) 30.4 29.6 58.7 58.2 60K 16 (12,24,36,G) 45.8 45.2 64.4 63.9 240K 8 (12,24,36,G) 27.5 27.0 59.4 59.0 76K Table 6: Parameter search for the ASPP [31] baseline where all methods were trained for 350K iterations.Peak@iterationshowsthecheckpointwhichthehighestsecondstageRCNNperformance onWODvalidation. ThelowandsporadicRPNperformanceandearlypeakiterationsuggeststhat ASPPeasilyoverfitsandgenerallyresultsinunstabletraining. C ComparisontoFixedDilationRates Table 5, expands the experiments in the main paper with a more exhaustive set of dilation rates. Hereweexploretheeffectofincreasingthedilationrateofthekernelforastandardconvolutional operation[31,33]inthefirstregionproposalnetwork(RPN).Wealsocomparetotheatrousspatial pyramidpooling(ASPP)[31]frameworkbyreplacingourRCDlayerwiththeASPPmodule. Here weusedfixedstridesof12,24,36andglobalaveragefeaturesdenotedby‘G’3. With dilation rates of 3 and 5, a small improvement over the standard convolution (with default dilation rate of 1) is observed. However, further increasing the dilation results in a sharp drop in performance as seen with the dilation rate of 7. This drop in performance could be due to: insufficientsamplingofsmalldistantobjects;oranincreaseintheamountofpaddingneededgiven the height of the range image is limited to 64 pixels. ASPP with various dilations significantly under-performed compared to the standard dilated convolutions. Having fixed and overly large dilationratesincreasesthesusceptibilityofASPPtobedistractedbysporadicactivationscausedby highfrequencydetail. Wetriedtovarythereceptivefieldandthenumberofchannels(parameters) topreventoverfittingbutfoundthetrainingtobegenerallyhighlyunstable. SeeTable6. TheRCD models,withtheirabilitytoappropriatelyadjusttheirrateofdilationforbothnearandfar,leadsto thebestoverallperformance. D QualitativeResults Figures8and10provideaqualitativeviewofourdetectionsonWODvalidationsequences. Intop leftofthethirdrowofFigure10theRPNconservativelypredictsaproposalintheocclusionshadow ofavehiclewhichislatercorrectlyremovedbythesecondstageRCNNrefinement. Afailurecase isshowninthefirstrow,wheretheRCNNmistakenlyclassifiesaproposalasbackground. E ComparisontoLaserNetonKITTI WhiletheKITTIdatasetisacommonbenchmarkfor3Ddetection,ithasbeenshownthatitssmall sizecanleadtooverfittingandpoorgeneralizationforrangeimagebasedmethods[11]. Whilewe alsoobservesimilarissues,theKITTIdatasetisusedasanindicationofhowourmethodcompares totheofficialresultsofLaserNetinthesmalldatasetregime. E.1 RangeImagesfromPointClouds Despitetherange-imagebeinganativerepresentationforLiDAR,KITTIdistributesdatafromthe VelodyneLiDARsensorasaCartesianpointcloudwithanadditionalintensitychannel.Thestraight- forwardconversionofCartesiantosphericalpointsresultsinpoorqualityrange-imagesduetothe 64lasersoftheVelodynenotsharingacommonorigin. Toovercomethisissue,denseandcompact rangeimagesarereconstructedasfollows: 1. ForeachofthelasersintheVelodynesensoritsinclinationandheightaredeterminedusing Houghvoting,inadiscretizedheightandinclinationspace. 3Followingimplementationof[51]from:github.com/rishizek/tensorflow-deeplab-v3 14Figure8:Examplevisualizationof3Dalignment. ToprowshowscolourimagesofLeft,Front-Left, Front,Front-RightandRightviewrespectively. Middlerowshowsthecorrespondingrangeimage forthescene.Bottomrowshowsthecorresponding3Dpointcloudwithourdetectionsinredagainst thegroundtruthboxesinyellow. NotetheRGBimagesareonlyforillustrationpurposes. Figure9: Illustrationofextractedrangeimage(top)fromKITTIpointcloud(left),withprojection of the point cloud onto the RGB image (middle right), a detail of the extracted range image, and (bottomright)anillustrationofthevisiblepointsperannotatedboundingbox. 2. Foreachpointinapointclouditscorrespondinglaser-idlandazimuthangleaisestimated byminimizingthereconstruction/quantizationerror. 3. TherangeimageisthenconstructedasaL×Aimage,whereLdenotesthenumberoflasers (64)andAisthenumberofazimuthsteps(setto2048). Thepixelvalueatlocation(l,a) (cid:112) reflectstheobservedrange,i.e.r = x2+y2+z2,forpointp=(x,y,z)). Theaddition of the laser intensity is similarly added as another channel. When multiple observations maptothesamepixellocation(l,a)intherangeimage,theclosestpointiskept. Figure9illustratesthecompactreconstructedrange-imagefromKITTIVelodynedata. E.2 KITTIResults We report performance on KITTI test set after a fixed number of 100K iterations, using flipping and horizontal pixel shifting as data augmentation. Since the KITTI data is only labelled in the 90◦ aroundtheforwarddrivingdirection,onlythatpartoftherangeimageisfedintothenetwork. Table 1 shows the performance of our method against the primary representative for range-image detectorsLaserNet[11]onthissmalldataset.ThistableisextendedwithaRCDmodelpretrainedon WODandfinetunedonKITTIfor100Kiterations,denotedasRCD-FT.Whilethesenumbersarenot intendedtobedirectlycomparedtootherpublishedworks,theyservetohigh-lighttheimportance oflargeanddiversedatasetsfortrainingrange-imagedetectors. 15(a)Proposals (b)Detections (c)GroundTruth Figure 10: Birds-eye-view visualization of four different scenes from the WOD validation se- quences. Each row shows a different scene and each column from left to right shows: output of theRPNinblue,outputofRCNNinred,andgroundtruthboxesinyellowrespectively. Allpredic- tionsarefilteredwithminimumconfidenceof0.5. Bestvieweddigitallywithzoom. 16