Optimizing Anchor-based Detectors for Autonomous Driving Scenes XianzhiDu∗ Wei-ChihHung Tsung-YiLin* Google Waymo Google Abstract 80 This paper summarizes model improvements and 75 inference-time optimizations for the popular anchor-based detectors in the scenes of autonomous driving. Based on 70 the high-performing RCNN-RS and RetinaNet-RS detec- tionframeworksdesignedforcommondetectionscenes,we 65 studyasetofframeworkimprovementstoadaptthedetec- tors to better detect small objects in crowd scenes. Then, we propose a model scaling strategy by scaling input res- 60 olution and model size to achieve a better speed-accuracy trade-off curve. We evaluate our family of models on the 55 real-time 2D detection track of the Waymo Open Dataset (WOD) [24]. Within the 70 ms/frame latency constraint 50 0 10 20 30 40 50 60 70 80 90 on a V100 GPU, our largest Cascade RCNN-RS model V100 ms/frame achieves76.9%AP/L1and70.1%AP/L2,attainingthenew state-of-the-artonWODreal-time2Ddetection.Ourfastest RetinaNet-RSmodelachieves6.3ms/framewhilemaintain- ing a reasonable detection precision at 50.7% AP/L1 and 42.9%AP/L2. 1.Introduction Object bounding box detection in autonomous driving scenes is one of the most popular yet challenging tasks in computer vision. Unlike common object detection scenes thatcoverawiderangeofdetectionscenarios,objectscales andobjectcategorizes,autonomousdrivingscenestypically focus on street driving views where objects of interest are smallerinsizeandarefromlesscategories. Typicalobjects ofinterestforautonomousdrivingscenesarecars,pedestri- ans,cyclists,motorists,streetsigns,etc. TheCOCO[17]benchmarkhasbeenthedefactobench- mark to evaluate performance of object detectors since 2015. As a result, most of the popular object detec- tors[3–11,15,16,18,19,21,25]aretailoredforCOCOdetec- tioninmodeldesign,trainingrecipe,post-processingmeth- ods, inference-time optimization, and model scaling strat- egy.Inthiswork,weaimtooptimizecommonobjectdetec- torsforautonomousdriving. WeadopttheRCNN-RSand RetinaNet-RS [7] frameworks as our strong baselines and *WorkdonewhileatGoogle 1L/PA DOW Ours Ours Ours Ours LeapMotor_Det YOLOR_P6_TRT DIDI MapVision Ours Dereyly YOLO_v5 real-time 10 ms/frame detection 70 ms/frame Ours Figure 1. Result comparison on the WOD real-time 2D detec- tionleaderboard. Giventhe70ms/framelatencyconstraintona V100GPU,ourbestmodelachievesthenewstate-of-the-artper- formanceat76.9%AP/L1.MoredetailscanbefoundinSection3. carefullystudytheeffectivenessoftuningcommonCOCO detection settings for autonomous driving scenes in model improvements and inference-time optimizations. Next, we discoverabetterstrategytoscalemodelsininputresolution and backbone scales and propose a family of models that formabetterspeed-accuracytrade-offcurve. Ourfamilyofmodelsareevaluatedonthereal-time2D detection track of the Waymo Open Dataset [24] (WOD). We adopt the RCNN-RS models as our main detectors to push for best accuracy and adopt the RetinaNet-RS mod- els to push for fastest speed. Under the 70 frame/ms la- tency constraint on a V100 GPU, our Cascade RCNN-RS model achieves 76.9% AP/L1 and runs at 68.8 frame/ms, achievingthenewstate-of-the-artonthereal-time2Ddetec- tion leaderboard [1]. To further push for highest accuracy and fastest speed, our largest Cascade RCNN-RS model achieves78.9%AP/L1andrunsat103.9ms/frameandour smallestRetinaNet-RSachieves6.3ms/frameswhilemain- tainingareasonableAP/L1at50.7%,respectively. 2202 guA 11 ]VC.sc[ 1v26060.8022:viXraModel Lat(ms/frame) AP/L1 insmallerscales,weaddL2featurestothefeaturepyramid andremoveL7features. FasterRCNN-RS 91.8 71.5 +3cascadedheads 122.5 72.8(+1.3) +L2-L6features 135.1 74.0(+1.2) 2.1.2 Inference-timeoptimizations +Lightweightheads 100.8(-25%) 73.7 Inference-time framework designs: We feed 512 de- +512proposals 91.7(-9%) 73.7 tection proposals instead of 1000 to the second-stage of +0.7NMSthreshold 91.7 74.9(+1.2) the CRCNN-RS detector for inference and training. NMS +TensorRT&float16 43.3(-53%) 74.9 threshold is increased from 0.5 to 0.7 for ROI generation andfinaldetectiongeneration. Table1. AblationstudiesoftheRCNN-RSmodelimprovements andinference-timeoptimizationsonWOD2Ddetection. Byap- Benchmarking improvements: We further adopt the plyingallthechangestotheFasterRCNN-RSbaseline,thefinal NVIDIA TensorRT optimization with float16 precision modelachieves+3.4%AP/L1whilebeing53%faster. tooptimizemodelinferencespeed. 2.Methodology 2.2.RetinaNet-RSimprovements RetinaNet(-RS) [7, 16], a popular anchor-based one- 2.1.RCNN-RSimprovements stage object detector, shows competitive detection perfor- manceinthelow-latencyregimeforcommonobjectdetec- 2.1.1 Architecturalimprovements tion. In this work, we adopt the RetinaNet-RS framework We adopt the strong RCNN-RS [7] as our main detection for our fastest models and apply the changes introduced framework. RCNN-RS provides improved object detec- in Section 2.1 except the ones specific to RCNN-RS. The tion performance in common detection scenes by adopt- changes include L2-L6 features, a larger NMS threshold, ing modern training techniques and architectural improve- TensorRTandfloat16inferenceprecision. ments. The modern techniques include scaling jittering 2.3.ModelscalingonWOD augmentation,stochasticdepthregularization[13],alonger training schedule and SiLU activation [12,20]. To further We explore model scaling on the WOD from scal- optimize the detection framework for autonomous driving ing input resolution and scaling backbone size. A bet- scenes,wemakethefollowingchanges. ter speed-accuracy trade-off curve is formed by selecting best-performing models within a wide range of computa- Cascaded heads: The Cascade R-CNN framework [4] tional cost. To scale input resolution, we gradually in- showsconsistentaccuracyimprovementsovertheFasterR- crease the height of the input image from 384 to 1536 CNN [21] baseline. In this work, we adopt the Cascade and the width from 640 to 2688. To scale backbone, RCNN-RS(CRCNN-RS)asourmaindetectionframework. we adopt architectures at 5 different scales: ResNet-RS- Unlike COCO detection where best accuracy is achieved 18×0.251 [2] (RN18×0.25) that contains 5.3M parame- with two cascaded heads with higher IoUs [7], here we ters; SpineNet-49×0.25 (SN49×0.25) that contains 5.6M adoptthree cascadeddetection headswithincreasing fore- parameters; SpineNet-49 (SN49) that contains 30.3M pa- groundIoUthresholds{0.5,0.6,0.7}. rameters;SpineNet-96(SN96)thatcontains37.6Mparam- eters; SpineNet-143(SN143)thatcontains49.7Mparame- Lightweight heads: We remove all convolutional layers ters. Table2presentsourmodelsofallscales. in the detection heads and the RPN head but only keep thefinalfullyconnectedlayerforboundingboxregression 3.ExperimentalResults andclassification.Thelightweightheaddesignsignificantly boostsmodelspeedwhileachievingsimilaraccuracyasthe 3.1.WODtrainingandevalsettings originalheaddesignwhichconsistsof4convolutionallay- ersandonefullyconnectedlayer. We conduct experiments on the real-time 2D detection track of the popular Waymo Open Dataset [24]. WOD is a large-scale dataset for autonomous driving that consists L2-L6 feature pyramid: Detecting objects on a multi- of 798 training sequences and 202 validation sequences. scale feature pyramid is crucial to achieve good perfor- Each sequence spans 20 seconds and is densely labeled mance[15]. AcommondesignchoiceforCOCOdetection istoconstructaL3-L7featurepyramid[5,6,8,16,26]. To 1×0.25denotesthemodel’schanneldimensionisuniformlyscaledto better adapt the detector to localize and recognize objects 0.25oftheoriginalsize.Backbone Inputres Params(M) FLOPs(B) Lat(ms/frame) AP/L1 AP/L2 RN18×0.25 640×1152 5.3 9.3 13.7 59.6 51.3 RN18×0.25 768×1408 5.3 11.4 16.6 62.1 53.9 SN49×0.25 640×1152 5.6 10.4 19.6 63.9 55.5 SN49×0.25 768×1408 5.6 13.0 22.5 66.1 58.1 SN49×0.25 896×1664 5.6 16.0 26.2 67.9 60.0 SN49×0.25 1024×1920 5.6 19.5 30.7 69.2 61.5 SN49 384×640 30.3 - 18.8 62.8 54.0 SN49 512×896 30.3 57 22.0 68.3 59.9 SN49 640×1152 30.3 79 29.7 71.4 63.4 SN49 768×1408 30.3 107 34.9 73.4 65.7 SN49 896×1664 30.3 141 43.3 74.9 67.5 SN49 1024×1920 30.3 180 57.2 75.7 68.6 SN49 1280×2176 30.3 244 68.8 76.9 70.1 SN49 1408×2432 30.3 315 93.8 77.3 70.4 SN49 1536×2688 30.3 372 97.1 77.7 70.9 SN96 1280×2176 37.6 381 76.5 78.0 71.2 SN143 1280×2176 49.7 587 103.9 78.9 72.3 Table2.CRCNN-RSmodelperformanceonWOD.WestudymodelscalingonWODbyscalingupinputresolutionandbackbonesize. AllmodelsadopttheCRCNN-RSframework. Inputres Lat(ms/frame) AP/L1 AP/L2 changes,startingfromtheFasterRCNN-RSmodel,adding two more cascaded heads increases AP/L1 by +1.3%. In- 512×896 6.3 50.7 42.9 troducingL2featurestothemultiscalefeaturepyramidin- 640×1152 11.6 54.2 46.5 creasesAP/L1byanother+1.2%.Removing3×3convolu- 768×1408 13.6 55.5 48.0 tionallayersintheRPNheadandthedetectionheadsspeeds upthemodelby25%whileachievingsimilaraccuracy. For Table3.RetinaNet-RSperformanceonWOD.Allmodelsadopt the inference-time optimizations, by reducing the number aRN18×0.25backbone. of proposals for the second stage from 1000 to 512 and increasing the NMS threshold from 0.5 to 0.7, we further improve AP/L1 by +1.2% while being 9% faster. Finally, at 10 frames per second with camera object detection and optimizing model with TensorRT and changing inference tracks. modelprecisionfromfloat32tofloat16significantly WetrainallmodelsontheWODtrainsplitwithsyn- reducesinferencelatencyby(53%). chronizedbatchnormalization,SGDwitha0.9momentum rateandabatchsizeof256for20000stepsonTPUv3de- vices [14]. We apply a cosine learning rate schedule with aninitiallearningrate0.32. Alinearlearningratewarm-up 3.3.Scalinginputresolutionvs.backbonesize isappliedforthefirst1000steps. Toobtaincompetitivere- sults,wepretrainourmodelsontheCOCO[17]datasetby We explore the effectiveness of scaling input resolution followingthetrainingpracticesfrom[7]. Ourmainresults vs. scaling backbone size by a grid search over the scales for 2D bounding box detection are reported on the WOD described in Section 2.3 for CRCNN-RS models. The re- testsplit. sults are presented in Table 2 and Fig. 2. In Fig. 2, we show that within a reasonable input resolution range from 3.2. Improvements from architecture and height 512 to 1280, scaling input resolution while using a inference-timeoptimization larger backbone is more effective than adopting a smaller In this section, we show the impact of the RCNN-RS backbonewithlargerinputresolutions. Tofurtherpushfor modelarchitecturalchangesandinference-timemodelopti- a higher accuracy or a faster speed, scaling backbone size mizationsforautonomousdrivingscenes. Thedetailedab- while keeping input resolution becomes a more effective lation studies are shown in Table 1. For the architectural strategy.80 75 70 65 60 0 20 40 60 80 100 V100 ms/frame 1L/PA DOW 1280 62 1280 1536 1280 1408 60 1024 896 768 58 640 56 512 1024 896 768 54 SN49 640 SN49x0.25 52 768 384 RN18x0.25 SN96 512 SN143 50 640 4 6 8 10 12 14 16 18 V100 ms/frame Figure 2. Model scaling on WOD 2D detection. We compare CRCNN-RSmodelsadoptingSN49, SN96, SN143, SN49×0.25 andRN18×0.25backbonesatvariousinputresolutions.Numbers inthisfigurerepresentheightoftheinputimage. 3.4.RCNNvs.RetinaNetinthelow-latencyregime Inthissection,weevaluatetheperformancecomparison between RCNN-RS and RetinaNet-RS on WOD. We ap- plythesametrainingandbenchmarkingpractices. There- sultsareshowninTable3. WecomparetheRetinaNet-RS modelstotheCRCNN-RSmodelsadoptingaRN18×0.25 backbone. AsshowninFig.3,theRetinaNet-RSmodelun- derperformsCRCNN-RSby4.1%AP/L1whilerunningat asamespeed. Ontheotherhand,benefitingfromtheone- stageframeworkdesign,RetinaNet-RSachievesthefastest speedat6.3ms/frame. 3.5.WODReal-time2Ddetectionresults We present our best performing models and show the performance comparisons to the top-5 entries on the real- time2Ddetectionleaderboard[1]inTable4andFig.1. In particular,withinthe70ms/frameconstraint,ourCRCNN- SN49modelat1280×2176inputresolutionachieves76.9% AP/L1 and 70.1% AP/L2 and runs at 68.8 ms/frame, outperforming previous best models on the leaderboard. Our CRCNN-SN49 model at 640×1152 input resolution achieves 71.4% AP/L1 and 63.4 AP/L2 and runs at 29.7 ms/frame, achieving real-time object detection whiling at- tainingcompetitivedetectionprecision. Inthelow-latency regime,oursmallestRetinaNet-RSadoptingaRN18×0.25 backbone at 512×896 resolution achieves 6.3 ms/frame whilemaintainingreasonabledetectionprecisionat50.7% AP/L1and42.9%AP/L2. 1L/PA DOW 768 640 +4.1% AP/L1 768 640 512 CRCNN-RS RetinaNet-RS Figure 3. RetinaNet-RS vs. CRCNN-RS in the low-latency regime. AllmodelsadoptaRN18×0.25backbone. Numbersin thisfigurerepresentheightoftheinputimage. Model Lat. (ms) AP/L1 AP/L2 CRCNN-RS-SN49@640 29.7 71.4 63.4 CRCNN-RS-SN49@896 43.3 74.9 67.5 CRCNN-RS-SN49@1024 57.2 75.7 68.6 CRCNN-RS-SN49@1280 68.8 76.9 70.1 CRCNN-RS-SN96@1280 76.5 78.0 71.2 LeapMotor Det[27] 61.6 75.7 70.4 DIDIMapVision[28] 45.8 75.0 69.7 YOLOR P6 TRT[29] 37.4 74.8 69.6 Dereyly self ensemble[22] 68.7 71.7 65.7 YOLO v5[23] 38.1 70.3 64.1 Table4.Resultcomparisonsofourmodelsagainstthetop-5mod- elsontheWODreal-time2Ddetectionleaderboard. Weomitre- sultsusingmodelensembleormultiscaletest. 4.Conclusion Inthiswork,weimprovethestrongtwo-stageRCNN-RS detector for autonomous driving scenarios from architec- tural changes and inference-time optimizations. We study theimpactofscalinginputresolutionandmodelsizeonthe task of WOD real-time 2D detection and propose a fam- ily of models for a wide range of latency. We hope this studywillhelpthecommunitytobetterdesigndetectorsfor autonomous driving and the optimizations can transfer to moredetectionframeworksanddetectionscenarios. Acknowledgments: We would like to acknowledge Hen- rikKretzschmar,DragoAnguelovandtheWaymoresearch team for the support. Barret Zoph, Jianwei Xie, Zongwei Zhou,NimitNiganiaforthehelpfuldiscussions.References Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda,AndyPhelps,JonathanRoss,AmirSalek,Emad [1] https://waymo.com/open/challenges/2021/ Samadiani, Chris Severn, Gregory Sizikov, Matthew Snel- real-time-2d-prediction/. 1,4 ham,JedSouter,DanSteinberg,AndySwing,MercedesTan, [2] Irwan Bello, William Fedus, Xianzhi Du, Ekin D. Cubuk, GregoryThorson,BoTian,HoriaToma,ErickTuttle,Vijay AravindSrinivas,Tsung-YiLin,JonathonShlens,andBar- Vasudevan,RichardWalter,WalterWang,EricWilcox,and retZoph. Revisitingresnets: Improvedtrainingandscaling Doe Hyun Yoon. In-datacenter performance analysis of a strategies,2021. 2 tensorprocessingunit. CoRR,abs/1704.04760,2017. 3 [3] Alexey Bochkovskiy, Chien-Yao Wang, and Hong- [15] Tsung-Yi Lin, Piotr Dolla´r, Ross Girshick, Kaiming He, Yuan Mark Liao. Yolov4: Optimal speed and accuracy of Bharath Hariharan, and Serge Belongie. Feature pyramid objectdetection,2020. 1 networksforobjectdetection. InCVPR,2017. 1,2 [4] ZhaoweiCaiandNunoVasconcelos. Cascader-cnn: Delv- [16] Tsung-YiLin,PriyaGoyal,RossGirshick,KaimingHe,and ingintohighqualityobjectdetection. InProceedingsofthe PiotrDolla´r. Focallossfordenseobjectdetection. InICCV, IEEE conference on computer vision and pattern recogni- 2017. 1,2 tion,pages6154–6162,2018. 1,2 [17] Tsung-YiLin,MichaelMaire,SergeBelongie,JamesHays, [5] XianzhiDu,Tsung-YiLin,PengchongJin,YinCui,Mingx- PietroPerona,DevaRamanan,PiotrDolla´r,andCLawrence ing Tan, Quoc V. Le, and Xiaodan Song. Efficient scale- Zitnick. Microsoft coco: Common objects in context. In permuted backbone with learned resource distribution. In ECCV,2014. 1,3 ECCV,2020. 1,2 [18] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian [6] Xianzhi Du, Tsung-Yi Lin, Pengchong Jin, Golnaz Ghiasi, Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Mingxing Tan, Yin Cui, Quoc V. Le, and Xiaodan Song. Berg. Ssd: Single shot multibox detector. Lecture Notes Spinenet: Learning scale-permuted backbone for recogni- inComputerScience,page21–37,2016. 1 tionandlocalization. 2020IEEE/CVFConferenceonCom- [19] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, puterVisionandPatternRecognition(CVPR),pages11589– ZhengZhang, StephenLin, andBainingGuo. Swintrans- 11598,2020. 1,2 former: Hierarchical vision transformer using shifted win- [7] Xianzhi Du, Barret Zoph, Wei-Chih Hung, and Tsung-Yi dows,2021. 1 Lin. Simpletrainingstrategiesandmodelscalingforobject [20] PrajitRamachandran,BarretZoph,andQuocV.Le. Search- detection,2021. 1,2,3 ingforactivationfunctions,2017. 2 [8] Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. Nas-fpn: [21] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Learningscalablefeaturepyramidarchitectureforobjectde- Fasterr-cnn: Towardsreal-timeobjectdetectionwithregion tection. InCVPR,2019. 1,2 proposalnetworks. InAdvancesinNeuralInformationPro- [9] R.Girshick. Fastr-cnn. In2015IEEEInternationalConfer- cessingSystems,2015. 1,2 enceonComputerVision(ICCV),pages1440–1448, 2015. 1 [22] Nikolay Sergievskiy. https://waymo.com/open/ challenges/entry/?challenge=DETECTION_ [10] R.Girshick,J.Donahue,T.Darrell,andJ.Malik. Richfea- 2D & emailId = 2cd1c01b - 4335 & timestamp = ture hierarchies for accurate object detection and semantic 1623340740857905. 4 segmentation. In 2014 IEEE Conference on Computer Vi- sionandPatternRecognition,pages580–587,2014. 1 [23] Nikolay Sergievskiy. https://waymo.com/open/ challenges/entry/?challenge=DETECTION_ [11] KaimingHe,GeorgiaGkioxari,PiotrDolla´r,andRossGir- 2D & emailId = 2cd1c01b - 4335 & timestamp = shick. Maskr-cnn. InICCV,2017. 1 1623184567343638. 4 [12] Dan Hendrycks and Kevin Gimpel. Gaussian error linear units(gelus),2020. 2 [24] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien [13] GaoHuang,YuSun,ZhuangLiu,DanielSedra,andKilianQ Chouard,VijaysaiPatnaik,PaulTsui,JamesGuo,YinZhou, Weinberger.Deepnetworkswithstochasticdepth.InECCV, YuningChai,BenjaminCaine,etal.Scalabilityinperception 2016. 2 forautonomousdriving: Waymoopendataset. InProceed- ingsoftheIEEE/CVFConferenceonComputerVisionand [14] Norman P. Jouppi, Cliff Young, Nishant Patil, David A. PatternRecognition,pages2446–2454,2020. 1,2 Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, SureshBhatia,NanBoden,AlBorchers,RickBoyle,Pierre- [25] MingxingTan,RuomingPang,andQuocV.Le.Efficientdet: lucCantin,CliffordChao,ChrisClark,JeremyCoriell,Mike Scalable and efficient object detection. In Proceedings of Daley,MattDau,JeffreyDean,BenGelb,TaraVazirGhaem- theIEEE/CVFConferenceonComputerVisionandPattern maghami,RajendraGottipati,WilliamGulland,RobertHag- Recognition(CVPR),June2020. 1 mann, Richard C. Ho, Doug Hogberg, John Hu, Robert [26] MingxingTan, RuomingPang, andQuocV.Le. Efficient- Hundt,DanHurt,JulianIbarz,AaronJaffey,AlekJaworski, det:Scalableandefficientobjectdetection. 2020IEEE/CVF Alexander Kaplan, Harshit Khaitan, Andy Koch, Naveen Conference on Computer Vision and Pattern Recognition Kumar,SteveLacy,JamesLaudon,JamesLaw,DiemthuLe, (CVPR),pages10778–10787,2020. 2 ChrisLeary, ZhuyuanLiu, KyleLucke, AlanLundin, Gor- [27] Fenfen Wang, Qiankun Xie, Lindong Li, Yaonong Wang, don MacKean, Adriana Maggiore, Maire Mahony, Kieran and Hongtao Zhou. https://waymo.com/open/ Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, challenges/entry/?challenge=DETECTION_2D & emailId = 6aee4b36 - 4301 & timestamp = 1623415379657313. 4 [28] Yueming Zhang, Xiaolin Song, Bing Bai, Tengfei Xing, Chao Liu, Xin Gao, Zhihui Wang, Haojin Liao, and Pengfei Xu. https://waymo.com/open/ challenges/entry/?challenge=DETECTION_ 2D & emailId = 9554504f - c966 & timestamp = 1623399193410618. 4 [29] Yueming Zhang, Xiaolin Song, Bing Bai, Tengfei Xing, Chao Liu, Xin Gao, Zhihui Wang, Haojin Liao, and Pengfei Xu. https://waymo.com/open/ challenges/entry/?challenge=DETECTION_ 2D & emailId = 9554504f - c966 & timestamp = 1623232203640130. 4