Revisiting 3D Object Detection From an Egocentric Perspective BoyangDeng†∗ CharlesR.Qi† MahyarNajibi† ThomasFunkhouser‡ YinZhou† DragomirAnguelov† †WaymoLLC ‡GoogleResearch Abstract 3Dobjectdetectionisakeymoduleinsafety-criticalroboticsapplicationssuch asautonomousdriving. Forsuchapplications, wecarethemostabouthowthe detectionsimpacttheego-agent’sbehaviorandsafety(theegocentricperspective). Intuitively,weseekmoreaccuratedescriptionsofobjectgeometrywhenit’smore likelytointerferewiththeego-agent’smotiontrajectory. However,currentdetec- tionmetrics,basedonboxIntersection-over-Union(IoU),areobject-centricandare notdesignedtocapturethespatio-temporalrelationshipbetweenobjectsandthe ego-agent. Toaddressthisissue,weproposeanewegocentricmeasuretoevaluate 3Dobjectdetection: SupportDistanceError(SDE).OuranalysisbasedonSDE revealsthattheegocentricdetectionqualityisboundedbythecoarsegeometryof theboundingboxes. GiventheinsightthatSDEcanbeimprovedbymoreaccu- rategeometrydescriptions,weproposetorepresentobjectsasamodalcontours, specificallyamodalstar-shapedpolygons,anddeviseasimplemodel,StarPoly,to predictsuchcontours. Ourexperimentsonthelarge-scaleWaymoOpenDataset showthatSDEbetterreflectstheimpactofdetectionqualityontheego-agent’s safetycomparedtoIoU;andtheestimatedcontoursfromStarPolyconsistently improvetheegocentricdetectionqualityoverrecent3Dobjectdetectors. 1 Introduction 3Dobjectdetectionisakeyprobleminrobotics,includingpopularapplicationssuchasautonomous driving. Commonevaluationmetricsforthisproblem,e.g.meanAveragePrecision(mAP)based onboxIntersection-over-Union(IoU),followanobject-centricapproach,whereerrorsondifferent objectsarecomputedandaggregatedwithouttakingtheirspatiotemporalrelationshipswiththeego- agentintoaccount. Whilethesemetricsprovideagoodproxyfordownstreamperformanceingeneral sceneunderstandingapplications,theyhavelimitationsforegocentricapplications,e.g.autonomous driving, where detections are used to assist navigation of the ego-agent. In these applications, detectingpotentialcollisionsontheego-agent’strajectoryiscritical. Accordingly,evaluationmetrics shouldfocusmoreontheobjectsclosertotheplannedtrajectoryandtotheparts/boundariesofthose objectsthatareclosertothetrajectory. Recentworkshaveintroducedafewmodificationstoevaluationprotocolstoaddresstheseissues, e.g.,breakingdownthemetricsintodifferentdistancebuckets[53]orusinglearnedplanningmodels to reflect detection quality [34]. However, they are either very coarse [53] or rely on optimized neuralnetworks[34],makingitdifficulttointerpretandcompareresultsindifferentsettings. Inthis ∗Correspondencetobydeng@waymo.com 35thConferenceonNeuralInformationProcessingSystems(NeurIPS2021),virtual. 1202 ceD 41 ]VC.sc[ 1v78770.2112:viXrapaper,wetakeanovelapproachto3Dobjectdetectionfromanegocentricperspective. Westartby reviewingthefirstprinciple: thedetectionqualityrelevanttotheego-agent’splannedtrajectory,both atthemomentandinthefuture,hasthemostprofoundimpactontheabilitytofacilitatenavigation. Thisleadsustotransformdetectionpredictionsintotwotypesofdistanceestimatesrelativetothe ego-agent’strajectory—lateraldistanceandlongitudinaldistance(Fig.1). Theerrorsonthese twodistancesformoursupportdistanceerror(SDE)concept,wherethecomponentscaneitherbe aggregatedasthemaxdistanceestimationerrororusedindependently,fordifferentpurposes. Compared to IoU, SDE (as a shape metric) is conditionedonthespatio-temporalrelationship between the object and the ego-agent. Even a smallmistakeindetectionneartheego-agent’s planned trajectory can incur a high SDE (as in Fig.2left,object3).Additionally,SDEcanbeex- tendedtoevaluatetheimpactofdetectionstothe ego-agent’sfutureplans(forcaseswhereanob- jectcomesclosetotheplannedtrajectorylaterin time).ThisisnotfeasibleforIoU,whichisinvari- Figure1:Lateraldistanceandlongitudinaldistance. anttotheego-agentpositionortrajectory(shown Thesetwotypesofsupportdistancemeasurehowfar an object’s shape boundary is to the observer (ego- inFig.2). agent)inboththedirectionalongtheobservervelocity Using SDE to analyze a state-of-the-art detec- (longitudinal)andperpendiculartoit(lateral). tor[44], weobserveasignificanterrordiscrep- ancybetweenusingarectangular-shapedboxap- proximation and the actual object’s boundary, suggesting the need for a better representation to describethefine-grainedgeometryofobjects. Tothisend,weproposeasimplelightweightrefine- menttobox-baseddetectorsnamedStarPoly. Basedonadetectionbox,StarPolypredictsanamodal contouraroundtheobject,asastar-shapedpolygon. Moreover, we incorporate SDE into the standard average precision (AP) metric and derive an SDE-basedAP(SDE-AP)forconvenientlyevaluatingexistingdetectors. Inordertomakeaneven moreegocentricAPmetric,wefurtheraddinversedistanceweightingtotheexamples,obtaining SDE-APD(Dfordistanceweighted). Withtheproposedmetrics,weobservedifferentbehaviors among several popular detectors [41, 44, 67, 20] compared to what IoU-AP would reveal. For example,PointPillars[20]excelsonSDE-APinthenearrangeinspiteofitslesscompetitiveoverall performance. Finally,weshowthatStarPolyconsistentlyimprovesupontheboxrepresentationof shapebasedonouregocentricmetric,SDE-APD. 2 RelatedWork 3D Object Detection Modern LiDAR-based 3D object detectors can be organized into three sub-categories based on the way they represent the input point cloud: i.e., voxelization-based detectors[55,8,21,50,19,60,47,68,58,20,64,56],point-basedmethods[45,63,32,38,62,46] aswellashybridmethods[67,61,5,12,44]. Besidesinputrepresentation,aggregatingpointsacross frames[13,65,14,41],usingadditionalinputmodalities[19,4,39,57,25,29,48,37],andmulti-task training[27,59,30,24]havealsobeenstudiedtoboosttheperformance. Despitesuchprogressin modeldesign,theoutputrepresentationandevaluationmetricshaveremainedmostlyunchanged. EgocentricComputerVision Egocentricvisionhasbeenstudiedinvariousapplications. Toname afew,understandinghumanactionsfromegocentriccameras,includingaction/activityrecognition[9, 35, 28, 49, 36, 51, 15, 52], action anticipation [43, 16], and human object interaction [26] have beenwidelystudied. Egocentrichanddetection/segmentation[23,22,1,42],andposeestimation [54,66,31]areamongotherapplications. Arguably,3Ddetectionforautonomousdrivingcanbe naturallyviewedasanotheregocentricapplicationwheredataiscapturedbysensorsattachedtothe car. However,classicIoU-basedevaluationmetricsignoretheegocentricnatureofthisapplication. 3DObjectDetectionMetrics Variousextensionstotheaverageprecision(AP)metrichaverecently beenproposedfortheautonomousdrivingdomain. nuScenes[2]consolidatedmAPwithmorefine- grainederrortypes. WaymoOpenDataset[53]introducedmAPweightedbyheading(mAPH)to reflecttheimportanceofaccurateheadingpredictioninmotionforecasting. [34]proposedtoexamine 2Figure2:IllustrationofIoUsandlateraldistancesinarealscene.Wevisualizethescenefromabird’seye viewwhereLidarpointsaregray;greenboxesaregroundtruthboxes;andredboxesaredetectorboxes.Left: WeshowthatIoUasanobject-centricmeasureisnotdirectlyreflectingtheriskofcollision(coloredinblue) —thehighriskmistakeoftheobject3’sboxisnotreflectedbythehighIoU.Incontrast,whileobject2hasa lowerIoU,itsboxboundaryisaccuratelyestimated,thustheimpacttoego-agentplanningislimited.Asshown, comparedtoIoU,SDEismoreindicativeoftheperceptionquality’simpactondriving.Right:Weshowhow SDEchangeswhenevaluatedatafuturetime(coloredinpurple),reflectinghowthecurrentframe’sperception qualityinfluencesdecisionmakingintothefuture.Thedetectionboxistransformedtoafutureframebasedon therigidmotionbetweenthegroundtruthboxesatT =0sandT =1s(whichexcludestheerrorintroduced bymotionprediction).Whileobject1haslowSDE atT =0sontheleft,itserrorsignificantlyincreasesat lat T =1s,astheboxcannotcapturethefine-grainedgeometryattheobjectcorner(seethezoominview). TPCollision FP/FNCollision Measure Mean Median Mean Median IoU↑ 0.903 0.912 0.904 0.903 SDE↓ 0.114 0.094 0.162 0.153 Table 1: Distributions of error measures in two typesofcollisiondetectioncases.In“TPCollision”, Figure3:Correlationswithcollisiondetectionaccu- boththegroundtruthpointsandthepredictionreport racy(CDA).Foreachevaluationmomentsfromthe acollision. In“FP/FNCollision”,eithertheground predictiontimeto10sinthefuture,wecomputethe truth(FN)ortheprediction(FP)reportsacollision. CDA,meanIoU(mIoU),andmeanSDE(mSDE).We WhilethedistributionsofIoUinTPandFP/FNare seefromthecurvethatmIoUisnotcorrelatedwiththe close with even higher mean IoU in FP/FN, SDEs accuracydropasIoUsdon’tvarywithegomotions; amongTPareclearlybetterthanFP/FNwithanim- whilemSDEisinverselycorrelatedtocollisiondetec- provementof30%inmeanand40%inmedian. tionaccuracyduetoitsegocentricnature. detectionqualityfromtheplanner’sperspective,bymeasuringtheKL-divergencebetweenfuture predictionsconditionedoneithernoisyperceptionorgroundtruth. However,factorssuchasdifferent planningalgorithmsormodeltrainingsetupsmaycausethisapproachtoyieldinconsistentoutcomes. Boundary-basedSegmentationMetrics Adifferentclassofshapemetricsonsemanticsegmen- tationmasksevaluatesthematchqualityofgroundtruthandpredictedsegmentationboundaries. RepresentativemethodsincludeTrimapIoU[3,18],F-score[10,33],boundaryIoU[6]etc. These methodsoperateinanobject-centricmanneranddonottaketemporalinformationintoconsideration. 3 AnEgocentricShapeMetric: SupportDistanceError Understandingthequalityofmodern3Dobjectdetectorsfromanegocentricperspectiveisanunder- explored topic and is open for new egocentric shape measures. In this section, we first look at thelimitationsoftheboxIntersection-over-Union(IoU)measure,thedefactochoicetoevaluate detectionqualityinpopularbenchmarks[11,53,2]andthenintroduceournewlyproposeegocentric shapemetric: supportdistanceerror(SDE). 3Limitationsofbox-basedIoU IoUisanobject-centricmeasurebasedonvolumes(orareas). As illustratedinFig.2(left),apredictionboxwitharelativelyhighIoUcanstillexhibitahighriskfor anego-agent(theprotrudingboxcancausetheplannertobrakesuddenly,whichinturncouldleadto atailgatingcollision). Tounderstandsuchbehavioratscale,weusecollisiondetectionasa“goldstandard”toquantitatively revealthelimitationofIoUs. Weselectallthecollisionsreportedbyeitherthegroundtruthora state-of-the-artPV-RCNNdetector[44]inthevalidationsetoftheWaymoOpenDataset[53]. A groundtruthcollisionisdefinedasaneventwheretheobjectshape(approximatedbytheaggregated object LiDAR points across all of its observations) overlaps with the extended ego-agent shape (approximatedbyaboundingboxoftheego-agent,scaledupby80%). Collisionsareestimatedusing detectorboxesastheobject’sshape. Table1presentsthemeanandmedianIoUsfortruepositiveand falsepositivecollisiondetections,whosedifferenceisminimal,indicatingthatIoUisnoteffectively reflectingcollisionrisk. Support Distance Error (SDE) In autonomous driving, one of the core uses of detection is to provide accurate object distance and shape estimates for motion planning (which has collision avoidanceasaoneoftheprimaryobjectives). InsteadofusingboxIoU,wecanmeasuredistances fromtheestimatedshapestotheego-agent’splannedtrajectory. Specifically,weproposetwotypes ofdistancemeasurements(Fig.1)2: • Lateraldistancetoanobject: Theminimaldistancefromanypointontheobjectboundary tothelineintheego-agent’sheadingdirection. Thisdistanceiscriticalfortheego-agentto planlateraltrajectorymaneuvers. • Longitudinaldistancetoanobject: Theminimaldistancefromanypointontheobject boundarytothecentrallineperpendiculartotheego-agent’sheadingdirection.Thisdistance isimportanttodeterminethespeedprofileandkeepasafedistancefromtheobjectsinfront. Weusethetermsupportdistancesforthesetwodistancetypes,asthey“support”thedecisionmaking intrajectoryplanning,andnametheerrorbetweenthegroundtruthsupportdistanceandtheone estimatedfromadetector’soutputasthesupportdistanceerror(SDE).WeuseSDE todenotethe lat lateraldistanceerrorandSDE forthelongitudinalerror,andwedefineSDEasthemaximumof lon thetwo. ThisformulationleadstotwoconceptualchangescomparedtoIoU:weshiftourfocusfrom volumetoboundaryandfromobject-centrictoego-centric. Thisdefinitioncanalsobeextendedtomeasuretheimpactofthedetectionqualityonfuturecollision risks. Ifwecomputedistancesfromtheobjectboundarytothetangentlinesatafutureposition(at timet)ontheego-agent’strajectory,wecancomputeSDEfordifferentfuturetimesteps(denotedas SDE@t). Thisisequivalenttomeasuringhowclosetheobjectistoafuturelocationoftheego-agent. Tomakethedefinitionconcrete,attimeT =t,weassumetheego-agent’sposeise(t) =(x(t),θ(t)), withx(t) ∈R3asitscenter(e.g. thecenteroftheego-agent’sboundingbox)andθ(t)asitsheading direction(e.g. clock-wiserotatinganglearoundtheup-axis). Wedefinethe“lateralline”,theline crossingtheego-agent’scenterandinthedirectionofitsheading,asl(t);andthe“longitudinalline” lat perpendicular to it as l(t). On the other hand, we assume we have an object o and its predicted lon boundaryisB(o)asasetofpointsontheboundary. Thelateral/longitudinaldistanceofoatthe currentframe(T =0)isdefinedas: SD =SD (B(o),e(0))=min d(p,l(0)),α∈{lat,lon} (1) α α p∈B(o) α wheredcomputesthepoint-to-linedistance. IfthelinepassesthroughtheobjectboundarytheSDE wouldbe0. AssumeB (o)istheobjectgroundtruthboundary,thenthelateral/longitudinalsupport gt distanceerrorisdefinedas: SDE =SD (B (o),e(0))−SD (B(o),e(0)),α∈{lat,lon} (2) α α gt α TheSDEsignhasaphysicalmeaning: positiveerrorsmeanthepredictedboundaryisprotruding while negative means that a part of the object is not covered by the predicted boundary. For 2Forsimplicity,wedefinethedistancesfromtheobjectboundarytotheego-agenttrajectory(ortheline perpendiculartoit),insteadofusingtheego-agentshape,whichvariesacrossdatasetsandistypicallynot availableinpublicdatasets. 4Figure4:FailurecaseswithlargeSDEs(>=0.3m).(a)and(b):Thedetectorboxesarepoorlyalignedwith thegroundtrutheitherinorientationorsize.(c)and(d):Thedetectorboxesyieldnear-perfectIoUswiththe groundtruthsbutstillincurhighSDE.Theconvexvisiblecontours(CVC)arederivedbasedontheinputpoints withinadetectionatthecurrentframe.NotethatSDEsherearecomputedagainsttemporallyaggregatedLidar points(theGTPoints)andIoUsarecomputedbetweendetectionsandgroundtruthboxes. simplicity,wetaketheabsolutevalueofSDE andSDE bydefaultandformallydefineSDE= lat lon max(|SDE |,|SDE |),anaggregatedvalueofbotherrors. lat lon Tomeasuretheimpactofcurrentframedetectionqualityonfutureplans,wedefineSDE@t,which computestheSDEofanobjecttsecondsinthefuture. GiventhegroundtruthrigidmotionR(t)of theobjectfromT =0toT =t,wecantransformitspredictedboundaryatframeT =0toitsfuture position. Inthisway,theerrorpatternsoftheboundarycanbeconsistentlypropagatedintoafuture frame(seeFig.2rightforanexample). Therigidmotioncanbederivedbetweenpairsofground truthboxesoftheobject. WedenotethetransformedB(o)asB(t)(o)(cid:48). Notethatitisdifferentfrom theobjectshapepredictionattimeT =t: wearestillmeasuringthequalityoftheT =0prediction, butwithinafutureegocentriccontext. Thefuturesupportdistancecanbeformallydefinedas: SD @t=SD (B(t)(o)(cid:48),e(t))=min d(p,l(t)),α∈{lat,lon} (3) α α p∈B(t)(o)(cid:48) α Similarly,wedefineSDE @tasthedifferenceinSD @tbetweenthegroundtruthandthepredicted α α boundary,whereα∈{lat,lon}. ThenSDE@t=max(|SDE @t|,|SDE @t|). WeuseSDEand lat lon SDE@0sinterchangeablyunlessotherwisenoted. Metricimplementationdetails Tofaithfullycomputethesupportdistance,weaggregateobject surface points (from Lidar) across all frames, during which the object is observed (which cover differentviewpointsoftheobject)asasurrogateshapetothegroundtruth.Thisallowsustoeffectively computedistancestotheboundarywithoutrequiringcostlyobjectshapeannotations/modeling. By default,weusetherealdrivingtrajectory. Thesameimplementationapplieswhenonewouldliketo evaluateSDEbyprovidinganarbitrarysetofintendedtrajectories(fromaplannerorsimulation). ComparingSDEwithIoU InFig.2,weseeSDE isahighlyusefulindicatortoreflectcollision lat risk(forobject3). InTab.1,weshowthatthemeanandmedianSDEaresensitiveshapemeasures andareinverselycorrelatedwiththecollisionrisk. Naturallywithlargert,SDE@tincreases,since thedetectionsarebasedonlyonsensordatafromthecurrentframeT =0. Fig.3showshowSDE@t andIoUchangewhenweevaluatethematdifferenttimesteps. NotethatbothSDEandSDE@taredefinedbasedondistancestotheobjectboundary(usuallythe partclosertotheego-agent). Clearly,betterdetectionqualityandboundaryrepresentationwillresult inanimprovedSDEmetrics,whichleadstothemainideaofournextsection. 4 ShapeRepresentationsandSDE Inthissection,weuseSDEtoanalyzedetectionqualityinsafety-criticalscenariosandhighlight the importance of the shape representation therein. We further propose a new amodal contour representation and a neural network model (StarPoly) for contour estimation and demonstrate it producessignificantSDEimprovements. 5PV-RCNN GroundTruth Box CVC StarPoly Box CVC StarPoly mSDEof[0m,5m) 0.107 0.090 0.063 0.059 0.083 0.046 mSDEof[5m,10m) 0.108 0.087 0.064 0.070 0.076 0.053 mSDEof[10m,20m) 0.140 0.155 0.086 0.094 0.142 0.068 mSDEof[20m,40m) 0.207 0.266 0.152 0.132 0.235 0.105 Table2:ComparingmeanSDE(mSDE)ofboxes,convexvis- Figure5:mSDEin[0m,10m)atdifferent ible contours (CVC), and our StarPoly at different distance time steps. CVC’s mSDEsignificantly in- ranges. WhilelowerthanmSDEofdetectorbox,CVC’sSDE creases as the evaluation goes into the fu- risesrapidlytowardsfarranges.Meanwhile,StarPolyissuperior ture. Incontrary,StarPolyconsistentlyout- thanbothboxandCVCinallranges. performsothers. 4.1 QualitativeAnalysisofBoundingBoxFailureCases TounderstandhowdetectorboxesperformundertheSDEmeasure,weselectPV-RCNN[44],atop- performingsingle-framepoint-cloud-baseddetectorinpopularautonomousdrivingbenchmarks[11, 53]),forouranalysis. AllanalysisisbasedontheWaymoOpenDataset[53]validationset. Fig.4 illustratessomerepresentativefailurecasesamongthedetectorboxes,withhighSDE. Wefindthatevenwhenaboxalignsreasonablywellwiththegroundtruthitcanstillincurhigh SDE.Bycomparingthepredicteddetectionagainstthepointcloudinsideinthebox,wenoticethat rectangularboxestypicallydonottightlysurroundtheobjectboundary. Inparticular,thediscrepancy betweenboxcornersandtheactualobjectboundarycontributesaconsiderableamountofSDE.This observationinspiresustoseekmoreeffectiverepresentationsofthefine-grainedobjectgeometry. 4.2 ConvexVisibleContours AnintuitivesolutiontoobtainatighterobjectshapefitisbyleveragingtheLidarpoints. Specifically, onecanextractallpointswithinthedetectorbox(afterremovingpointsontheground)andcompute theirconvexhull,asaconvexvisiblecontour(CVC).Incontrasttoamodalobjectshape,CVCis computedonlyfromthevisibleLidarpointsatthecurrentframe. Fig.4providessomevisualizations. Tab.2showshowCVCcompareswithboundingboxesinSDE.ConsideringthatCVCisheavily dependentonthequalityoftheboxitresidesin,wealsoevaluateCVCdirectlybasedontheground truth boxes, which can be seen as the upper bound for CVC (col. 6). We see that at near range, CVC can significantly improve SDE compared to the detector boxes (col. 2 vs 3). However, its effectivenessdegradesatlongerranges(col. 2vs3)anditsperformanceisinferiortogroundtruth boxes(col. 5vs6). WehypothesizethatthisisbecauseCVCisvulnerabletoocclusions,clutterand objectpointcloudsparsityatlongerranges,whichareubiquitousphenomenainrealworlddata. In Fig.5,theanalysisbasedonSDE@tconfirmsthatCVCperformsbetterthandetectorboxesatthe currentframebutgeneralizespoorlytolongertimehorizons. Toimproveit,weneedarepresentation thatprovidesgoodcoverageofboththevisibleandtheoccludedobjectparts. 4.3 AmodalContourEstimationwithStarPoly Weproposetorefinebox-baseddetectionwithamodalcontours,apolylinecontourthatcoversthe entireobjectshape(SeeFig.6foranillustration). Ourmodel, StarPoly, implementscontoursas star-shapedpolygonsandpredictsamodalshapeviaaneuralnetwork3. Itcanbeemployedtorefine predictedboxesforanyoff-the-shelfdetectors. Input TheinputtotheStarPolymodelisanormalizedobjectpointcloud. Wecroptheobjectpoint cloudfromits(extended)detectionbox. Thepointcloudiscanonicalizedbasedonthecenterand theheadingofthedetectionbox,aswellasscaledbyascalingfactor,s,suchthatthelengthofthe longestsideofthepredictedboxbecomes1. 3Althoughtherearepreviousworksonshapereconstruction/completion[7,30],theyareoftentrainedon syntheticdataandarenotdirectlyapplicabletorealLidardata. Weleavemorestudiesindesigningthebest contourestimationmodeltofutureworkandevaluateStarPolyasabaselinetowardsbetteregocentricdetection. 6Parameterization AsshowninFig.6,thestar-shapedpolygonisdefinedbyacenterpoint,h,and alistofverticesonitsboundary,(v ,...,v ),wherenisthetotalnumberofverticesdeterminingthe 1 n shaperesolution. Weassumehisthecenterofthepredictedboxandsort(v ,...,v )inclockwise 1 n ordersothatconnectingtheverticessuccessivelyproducesapolygon. Weconstrainv tohaveonly i 1degreeoffreedombydefiningv =c d(cid:126),where(d(cid:126) ,...,d(cid:126) )isalistofunitvectorsinpredefined i i i 1 n directions. Consequently,predictingastar-shapedpolygonisequivalenttopredicting(c ,...,c ),for 1 n whichweemployaPointNet[40]model(seethesupplementarymaterialfordetails). Optimization Sincegroundtruthcontoursarenotavailableinpublicdatasets,directlytraining theregressionof(c ,...,c )isinfeasible. Weresorttoasurrogateobjectiveforsupervision. The 1 n objectivecombinesthreeintuitivegoals,namelycoverage,accuracy,andtightness. Thecoverage loss encourages the prediction to encompass all ground truth object points (aggregated points in theobjectboundingboxfromallframesinwhichtheobjectappears,withgroundpointsremoved). Moreover,astheinputpointcloudalreadyrevealspartoftheobjectboundaryvisibletotheego-agent, theaccuracylossrequiresthepredictiontofitthevisibleboundaryastightaspossible. Ontheother hand,thetightnesslossminimizestheareaofthepredictedcontour. Thecombinationofthesethree goals leads to the reconstruction of contours without requiring ground truth contour supervision. Moreformally,thecoveragelossL ,theaccuracylossL ,thetightnesslossL ,andconsequently c a t theoverallobjectiveLforonegroundtruthpointcloudX aredefinedasfollows: (cid:18) (cid:19) L= 1 (cid:88) max x×v r + v l×x −1,0 =⇒Encompassallobjectpoints,L . |X| v ×v v ×v c l r l r x∈X (cid:12) (cid:12) +β |B1 | (cid:88)(cid:12) (cid:12) (cid:12)vx× ×v vr + vv l ×× vx −1(cid:12) (cid:12) (cid:12) =⇒Fittighttovisibleboundaries,L a. (4) l r l r x∈X 1 (cid:88)(cid:13) (cid:13) +γ (cid:13)c (cid:13) =⇒Minimizetheareaofcontours,L . n i t i where x is a point from X, γ is a weight parameter for L , and × represents t cross product. Note that in L , l and r are selected so that d(cid:126) and d(cid:126) span a c l r wedgeshapecontainingx(asshowninFig.6). Intuitively,L iscomputingthe c barycentriccoordinatesofxwithregardtov andv withinthetriangle(cid:52)hv v l r l r andencouragingxtobeonthesamesideashregardingv v —thenecessary l r andsufficientconditionforx∈(cid:52)hv v . Similarly,L isforcingthepointson l r a thevisibleboundaryBtobeonthepredictedboundaryaswell. Meanwhile,L t ispullingallv towardsh. i Figure6:StarPoly Results InTab.2weseethatupdatingtheboundingboxoutputofPV-RCNN formulation. to StarPoly contours significantly improves the mean SDE under all distance buckets(e.g. at0-5m,itimprovesfrom10.7cmto8.6cm,whichisaround20% errorreduction). Similarimprovementsalsoappearonthegroundtruthboxes (col. 4-6). InFig.5,wealsoshowhowStarPolyimprovesonSDE@t. Inalltimesteps,theStarPoly haslowerSDEthanbothboundingboxesandvisiblecontours,showingitsadvantageofgettingthe bestofbothworlds. 5 EgocentricEvaluationof3DObjectDetectors Inthissection,weincorporateSDEintothestandardaverageprecision(AP)metricandevaluate variousdetectorsandshaperepresentationsontheWaymoOpenDataset[53]. SDE-AP:DetectionAPbasedontheSDEshapemetric Tocomparedifferentdetectorsontheir egocentricperformance,wecannotjustusetheSDEmeasure,whichdoesnotconsiderfalsepositive (FP)andfalsenegative(FN)cases. Therefore,weproposetoadaptthetraditionalIoU-basedAP (IoU-AP)toanSDE-basedone(SDE-AP).Specifically,wereplacetheclassificationcriterionfortrue positives(TP)fromanIoUtoanSDE-basedthresholdanduseSDE=20cmasthethreshold(seethe 7Figure7:DistancebreakdownsofIoU-APandSDE-AP.SDE-APcanbetterdifferentiatesdifferentegocentric detectionqualitythanIoU-AP,especiallyinthenearranges([0m,5m]and[5m,10m]). Supplementarymaterialformoreonwhyweselectedthisnumber). Inaddition,weuseSDE-based criterion(insteadofanIoU-basedone)tomatchpredictionsandgroundtruth. SDE-APD: inverse distance weighted SDE-AP Although SDE-AP is based on the egocentric SDEmeasure,itweightsobjectsatvariousdistancesfromtheegoagenttrajectoryequally. Todesign amorestronglyegocentricmeasure,wefurtherproposeavariantoftheSDE-APwithinverse-distance weighting,termedSDE-APD(thesuffixDmeansdistanceweighted). Specifically,foragivenframe wehavedetectionsB = {b },i = 1,...,N andgroundtruthobjectsG = {g },j = 1,...,M. We i j denotethematchedgroundtruthobjectforb asg(b )∈G. Apredictioniscountedasatruepositive i i ifSDE(b ,g(b );e)<δwhereδistheSDEthresholdandeistheego-agentpose. Thenwedefinethe i i setoftruepositivepredictionsasTP ={b |SDE(b ,g(b );e)<δ}andfalsepositivepredictions i i i asFP =B−TP. TheinversedistanceweightedTPcount(IDTP),FPcount(IDFP)andground truthcount(IDG)fortheframeare: (cid:88) (cid:88) (cid:88) IDTP = 1/dβ IDFP = 1/dβ IDG= 1/dβ (5) g(bi) bi gi bi∈TP bi∈FP gi∈G wheredistheManhattandistancefromthepredictionshapecentertotheego-agentcenterandβ is ahyperparametercontrollinghowmuchwefocusontheclose-byobjects(wesetβ = 3,seethe supplementaryformoredetails). TheinversedistanceweightedprecisionandrecallaredefinedasIDTP/(IDTP +IDFP)and IDTP/IDGrespectively,bothremainwithin[0,1]. TheSDE-APDistheareaunderthePR-curve. Similar to SDE, which is defined both for the current frame and for future frames (SDE@t), the SDE-APandSDE-APDmetricalsohavefutureequivalentsSDE-AP@tandSDE-APD@tthatcan evaluatetheimpactofcurrentframeperceptiononfutureplans. 5.1 ComparingDifferentDetectorsonSDE-APandSDE-APD Inthissubsection,wecompareafewrepresentativepoint-cloud- based3DobjectdetectorsontheSDE-APandSDE-APDmet- Method SDE-APD IoU-AP rics. Westudyseveralpopulardetectors: PointPillars[20],a light-weight and simple detector widely used as a baseline; 5F-MVF++ 0.874 0.863 MVF++ 0.834 0.814 PV-RCNN[44],astate-of-the-artdetectorwithasophisticated PV-RCNN 0.808 0.797 featureencoding;MVF++[67,41](animprovedversionofthe PointPillars 0.817 0.720 multi-viewfusiondetector),arecenttop-performingdetector; andfinally5F-MVF++[41], anextendedversionofMVF++ Table 3: SDE-APD and IoU-AP4of taking point clouds from 5 consecutive frames as input, the differentdetectors. mostpowerfulamongall. Fig. 7 shows the SDE-AP with distance breakdowns for all detectorsandTable3showstheegocentricSDE-APDmetric. AninterestingobservationfromFig.7 aboutIoU-APisthat,whilethefourdetectorshavefairlycloseIoU-APsatcloseranges(e.g.[0m, 5m]),weseesignificantgapsamongthematlongerranges(e.g.[20m,40m]). Sincetherearemore objectsatlongerranges,thoselong-rangebucketstypicallydominatetheoverallIoU-AP.Incontrast, theSDE-APisconsistentlymorediscriminativeofthedetectorsespeciallyfortheveryshortrange 4TheIoU-APiscomputeusingeuclideandistancematchingand2DIoU0.7asthethreshold. 8Figure8: SDE-APDofdetectorswithdifferentout- Figure9:SDE-APD@tofboxesandStarPolybased putrepresentations(T=0s). ondifferentdetectors. Figure10: QualitativeResults. AscenefromthevalidationsetoftheWaymoOpenDatasetwithStarPoly predictionsshownasgreencontours.Wealsozoomedininto4vehiclesclosesttotheego-agentandcompared StarPoly(green)withpredictedbox(red),andCVC(blue).SDE arereportedundereachzoom-in. lat in[0m,5m]. Weevenseesomechangeofrankings–PointPillars,withthelowestoverallIoU-AP, outperformsPV-RCNNandMVF++inclose-range[0m,5m]SDE-AP,suggestingithasaparticularly strong short-range performance. This also implies that simply examining IoU-AP for selecting detectorscanbesub-optimalandourSDE-APcanprovideaninformativealternativeperspective. 5.2 ComparingVariousShapeRepresentations In this subsection, we evaluate how detector output representations affect the overall detection performanceintermsofSDE-APDevaluatedatthecurrentframeaswellasintothefuture. StarPolyimplementationdetails. Fortheencodingneuralnetwork,weusethestandardPoint- Net[40]architecturefollowedbyafully-connectedlayertotransformlatentfeaturesto(c ,...,c ). 1 n Weusearesolutionn=256forallfollowingexperiments. (d(cid:126) ,...,d(cid:126) )isuniformlysampledfrom 1 n theboundaryofasquare. Duringtraining,γ andβ arebothsetto0.1,whichisdeterminedbyagrid searchoverthehyperparameters. Pleaserefertothesupplementarymaterialformoremodeldetails. Results InFig.8,wecomparetheegocentricperformanceofdifferentrepresentationsusingSDE- APD. StarPoly consistently improves the egocentric result quality across the different detectors. Interestingly,StarPolylargelyclosesthegapbetweenthedifferentdetectors,reducingthedifference between 5F-MVF++ [41] and PV-RCNN by a factor of 3. This implies that StarPoly’s amodal contourscangreatlycompensateforthelimitationsoftheinitialdetectionboxes,especiallyofthose withpoorerquality. StarPolyalsooutperformsconvexvisiblecontours(CVC)acrossalldetectors. In Fig. 9, we evaluate the performance of the different representations at future time steps. We observethatStarPolyremainssuperiortothedetectorboxesacrossalltimesteps,differentiatingit fromtheconvexvisiblecontoursthatdecaycatastrophicallyovertime(showninFig.5). Fig.10 showsascenewithStarPolyamodalcontoursestimatedforallvehicles. Thezoom-infiguresreveal howamodalcontourshavemoreaccurategeometry,andlowerSDE,thanbothboxesandvisible contours. However, they are not yet perfect, especially on the occluded object sides. Improving contourestimationevenfurtherisapromisingdirectionforfuturework. 96 Conclusion Inthispaper,weproposeegocentricmetricsfor3Dobjectdetection,measuringitsqualityinthe currenttimestep,butalsoitseffectsontheegoagent’splansinfuturetimesteps. Throughanalysis, wehaveshownthatouregocentricmetricsprovideavaluablesignalforroboticmotionplanning applications,comparedtothestandardboxintersection-over-unioncriterion. Ourmetricsrevealthat thecoarsegeometryofboundingboxeslimitstheegocentricpredictionquality. Toaddressthis,we haveproposedusingamodalcontoursasareplacementtoboundingboxesandintroducedStarPoly, asimplemethodtopredictthemwithoutdirectsupervision. ExtensiveevaluationontheWaymo OpenDatasetdemonstratesthatStarPolyimprovesexistingdetectorsconsistentlywithrespecttoour egocentricmetrics. AcknowledgementsandFundingTransparencyStatement. WethankJohnathanBinghamand ZoeyYangforthehelponproofreadingourdrafts,andtheanonymousreviewersfortheirconstructive comments. Alltheauthorsarefull-timeemployeesofAlphabetInc. andarefully-fundedbythe subsidairiesofAlphabetInc.,GoogleLLCandWaymoLLC.Allexperimentsaredoneusingthe resourcesprovidedbyWaymoLLC. References [1] Sven Bambach, Stefan Lee, David J Crandall, and Chen Yu. Lending a hand: Detecting hands and recognizing activities in complex egocentric interactions. In Proceedings of the IEEE International ConferenceonComputerVision,pages1949–1957,2015. [2] HolgerCaesar,VarunBankiti,AlexH.Lang,SourabhVora,VeniceErinLiong,QiangXu,AnushKrishnan, YuPan,GiancarloBaldan,andOscarBeijbom. nuscenes:Amultimodaldatasetforautonomousdriving. InProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecognition(CVPR),June 2020. [3] Liang-ChiehChen,GeorgePapandreou,IasonasKokkinos,KevinMurphy,andAlanYuille. Deeplab: Semanticimagesegmentationwithdeepconvolutionalnets,atrousconvolution,andfullyconnectedcrfs. IEEETransactionsonPatternAnalysisandMachineIntelligence,PP,062016. [4] XiaozhiChen,HuiminMa,JiWan,BoLi,andTianXia. Multi-view3dobjectdetectionnetworkfor autonomousdriving. InCVPR,2017. [5] Y.Chen,S.Liu,X.Shen,andJ.Jia. Fastpointr-cnn. In2019IEEE/CVFInternationalConferenceon ComputerVision(ICCV),pages9774–9783,2019. [6] BowenCheng,RossGirshick,PiotrDollár,AlexanderC.Berg,andAlexanderKirillov. BoundaryIoU: Improvingobject-centricimagesegmentationevaluation. InCVPR,2021. [7] AngelaDai,CharlesRuizhongtaiQi,andMatthiasNießner. Shapecompletionusing3d-encoder-predictor cnnsandshapesynthesis. Proc.ComputerVisionandPatternRecognition(CVPR),IEEE,2017. [8] MartinEngelcke,DushyantRao,DominicZengWang,ChiHayTong,andIngmarPosner.Vote3deep:Fast objectdetectionin3dpointcloudsusingefficientconvolutionalneuralnetworks. InICRA.IEEE,2017. [9] AlirezaFathi,XiaofengRen,andJamesMRehg. Learningtorecognizeobjectsinegocentricactivities. In CVPR2011,pages3281–3288.IEEE,2011. [10] FlorentPerronnin(Xerox(XRCE)Grenoble)GabrielaCsurka(XeroxResearchCentreEurope),DianeLar- lus. Whatisagoodevaluationmeasureforsemanticsegmentation? InProceedingsoftheBritishMachine VisionConference.BMVAPress,2013. [11] AndreasGeiger,PhilipLenz,andRaquelUrtasun. Arewereadyforautonomousdriving?thekittivision benchmarksuite. InCVPR,2012. [12] ChenhangHe,HuiZeng,JianqiangHuang,Xian-ShengHua,andLeiZhang. Structureawaresingle-stage 3dobjectdetectionfrompointcloud. InProceedingsoftheIEEE/CVFConferenceonComputerVision andPatternRecognition(CVPR),June2020. [13] PeiyunHu,JasonZiglar,DavidHeld,andDevaRamanan. Whatyouseeiswhatyouget: Exploiting visibilityfor3dobjectdetection. InProceedingsoftheIEEE/CVFConferenceonComputerVisionand PatternRecognition(CVPR),June2020. [14] RuiHuang,WanyueZhang,AbhijitKundu,CarolinePantofaru,DavidA.Ross,ThomasA.Funkhouser, andAlirezaFathi. AnLSTMapproachtotemporal3dobjectdetectioninlidarpointclouds. CoRR,2020. [15] EvangelosKazakos,ArshaNagrani,AndrewZisserman,andDimaDamen. Epic-fusion: Audio-visual temporal binding for egocentric action recognition. In Proceedings of the IEEE/CVF International ConferenceonComputerVision,pages5492–5501,2019. 10[16] Qiuhong Ke, Mario Fritz, and Bernt Schiele. Time-conditioned action anticipation in one shot. In ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecognition,pages9925–9934, 2019. [17] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,2014. [18] PhilippKrähenbühlandVladlenKoltun. Efficientinferenceinfullyconnectedcrfswithgaussianedge potentials. InJ.Shawe-Taylor,R.Zemel,P.Bartlett,F.Pereira,andK.Q.Weinberger,editors,Advancesin NeuralInformationProcessingSystems,volume24.CurranAssociates,Inc.,2011. [19] JasonKu,MelissaMozifian,JungwookLee,AliHarakeh,andStevenLWaslander. Joint3dproposal generationandobjectdetectionfromviewaggregation. InIROS.IEEE,2018. [20] AlexHLang,SourabhVora,HolgerCaesar,LubingZhou,JiongYang,andOscarBeijbom. Pointpillars: Fastencodersforobjectdetectionfrompointclouds. InCVPR,2019. [21] BoLi.3dfullyconvolutionalnetworkforvehicledetectioninpointcloud.arXivpreprintarXiv:1611.08069, 2016. [22] ChengLiandKrisMKitani. Modelrecommendationwithvirtualprobesforegocentrichanddetection. In ProceedingsoftheIEEEInternationalConferenceonComputerVision,pages2624–2631,2013. [23] ChengLiandKrisMKitani. Pixel-levelhanddetectioninego-centricvideos. InProceedingsoftheIEEE conferenceoncomputervisionandpatternrecognition,pages3570–3577,2013. [24] M.Liang,B.Yang,Y.Chen,R.Hu,andR.Urtasun. Multi-taskmulti-sensorfusionfor3dobjectdetection. In2019IEEE/CVFConferenceonComputerVisionandPatternRecognition(CVPR),pages7337–7345, 2019. [25] MingLiang,BinYang,ShenlongWang,andRaquelUrtasun. Deepcontinuousfusionformulti-sensor3d objectdetection. InECCV,pages641–656,2018. [26] MiaoLiu,SiyuTang,YinLi,andJamesMRehg. Forecastinghuman-objectinteraction:jointprediction ofmotorattentionandactionsinfirstpersonvideo. InEuropeanConferenceonComputerVision,pages 704–721.Springer,2020. [27] WenjieLuo,BinYang,andRaquelUrtasun. Fastandfurious:Realtimeend-to-end3ddetection,tracking and motion forecasting with a single convolutional net. In Proceedings of the IEEE Conference on ComputerVisionandPatternRecognition(CVPR),June2018. [28] MinghuangMa,HaoqiFan,andKrisMKitani. Goingdeeperintofirst-personactivityrecognition. In ProceedingsoftheIEEEConferenceonComputerVisionandPatternRecognition,pages1894–1903, 2016. [29] G.P.Meyer,J.Charland,D.Hegde,A.Laddha,andC.Vallespi-Gonzalez.Sensorfusionforjoint3dobject detectionandsemanticsegmentation. In2019IEEE/CVFConferenceonComputerVisionandPattern RecognitionWorkshops(CVPRW),pages1230–1237,2019. [30] MahyarNajibi,GuangdaLai,AbhijitKundu,ZhichaoLu,VivekRathod,ThomasFunkhouser,Caroline Pantofaru,DavidRoss,LarrySDavis,andAlirezaFathi. Dops:learningtodetect3dobjectsandpredict their3dshapes. InProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecognition (CVPR),pages11913–11922,2020. [31] Evonne Ng, Donglai Xiang, Hanbyul Joo, and Kristen Grauman. You2me: Inferring body pose in egocentricvideoviafirstandsecondpersoninteractions. InProceedingsoftheIEEE/CVFConferenceon ComputerVisionandPatternRecognition,pages9890–9900,2020. [32] JiquanNgiam,BenjaminCaine,WeiHan,BrandonYang,YuningChai,PeiSun,YinZhou,XiYi,Ouais Alsharif, Patrick Nguyen, Zhifeng Chen, Jonathon Shlens, and Vijay Vasudevan. Starnet: Targeted computationforobjectdetectioninpointclouds. CoRR,2019. [33] F.Perazzi,J.Pont-Tuset,B.McWilliams,L.VanGool,M.Gross,andA.Sorkine-Hornung. Abenchmark datasetandevaluationmethodologyforvideoobjectsegmentation.In2016IEEEConferenceonComputer VisionandPatternRecognition(CVPR),pages724–732,2016. [34] JonahPhilion,AmlanKar,andSanjaFidler. Learningtoevaluateperceptionmodelsusingplanner-centric metrics. InIEEE/CVFConferenceonComputerVisionandPatternRecognition(CVPR),June2020. [35] HamedPirsiavashandDevaRamanan. Detectingactivitiesofdailylivinginfirst-personcameraviews. In 2012IEEEconferenceoncomputervisionandpatternrecognition,pages2847–2854.IEEE,2012. [36] RafaelPossas,SheilaPintoCaceres,andFabioRamos. Egocentricactivityrecognitiononabudget. In ProceedingsoftheIEEEConferenceonComputerVisionandPatternRecognition,pages5967–5976, 2018. 11[37] CharlesRQi,XinleiChen,OrLitany,andLeonidasJGuibas. Imvotenet:Boosting3dobjectdetection inpointcloudswithimagevotes. InProceedingsoftheIEEE/CVFConferenceonComputerVisionand PatternRecognition,pages4404–4413,2020. [38] CharlesRQi,OrLitany,KaimingHe,andLeonidasJGuibas.Deephoughvotingfor3dobjectdetectionin pointclouds. InProceedingsoftheIEEEInternationalConferenceonComputerVision,pages9277–9286, 2019. [39] CharlesRQi,WeiLiu,ChenxiaWu,HaoSu,andLeonidasJGuibas. Frustumpointnetsfor3dobject detectionfromrgb-ddata. InCVPR,2018. [40] CharlesRQi,HaoSu,KaichunMo,andLeonidasJGuibas. Pointnet:Deeplearningonpointsetsfor3d classificationandsegmentation. Proc.ComputerVisionandPatternRecognition(CVPR),IEEE,2017. [41] Charles R Qi, Yin Zhou, Mahyar Najibi, Pei Sun, Khoa Vo, Boyang Deng, and Dragomir Anguelov. Offboard3dobjectdetectionfrompointcloudsequences. CVPR,2021. [42] DandanShan,JiaqiGeng,MichelleShu,andDavidFFouhey. Understandinghumanhandsincontactat internetscale. InProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecognition, pages9869–9878,2020. [43] YangShen,BingbingNi,ZefanLi,andNingZhuang. Egocentricactivitypredictionviaeventmodulated attention. InProceedingsoftheEuropeanConferenceonComputerVision(ECCV),pages197–212,2018. [44] ShaoshuaiShi, ChaoxuGuo, LiJiang, ZheWang, JianpingShi, XiaogangWang, andHongshengLi. Pv-rcnn: Point-voxelfeaturesetabstractionfor3dobjectdetection. InProceedingsoftheIEEE/CVF ConferenceonComputerVisionandPatternRecognition,pages10529–10538,2020. [45] ShaoshuaiShi,XiaogangWang,andHongshengLi.Pointrcnn:3dobjectproposalgenerationanddetection frompointcloud. arXivpreprintarXiv:1812.04244,2018. [46] WeijingShiandRagunathan(Raj)Rajkumar. Point-gnn:Graphneuralnetworkfor3dobjectdetectionina pointcloud. InTheIEEEConferenceonComputerVisionandPatternRecognition(CVPR),June2020. [47] MartinSimon,StefanMilz,KarlAmende,andHorst-MichaelGross. Complex-yolo: Aneuler-region- proposalforreal-time3dobjectdetectiononpointclouds. InECCV,2018. [48] VishwanathASindagi,YinZhou,andOncelTuzel. Mvx-net:Multimodalvoxelnetfor3dobjectdetection. arXivpreprintarXiv:1904.01649,2019. [49] SuriyaSingh,ChetanArora,andCVJawahar.Firstpersonactionrecognitionusingdeeplearneddescriptors. InProceedingsoftheIEEEConferenceonComputerVisionandPatternRecognition,pages2620–2628, 2016. [50] ShuranSongandJianxiongXiao. Deepslidingshapesforamodal3dobjectdetectioninrgb-dimages. In CVPR,2016. [51] SwathikiranSudhakaran,SergioEscalera,andOswaldLanz. Lsta:Longshort-termattentionforegocen- tricactionrecognition. InProceedingsoftheIEEE/CVFConferenceonComputerVisionandPattern Recognition,pages9954–9963,2019. [52] SwathikiranSudhakaran,SergioEscalera,andOswaldLanz. Gate-shiftnetworksforvideoactionrecogni- tion. InProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecognition,pages 1102–1111,2020. [53] PeiSun,HenrikKretzschmar,XerxesDotiwalla,AurelienChouard,VijaysaiPatnaik,PaulTsui,James Guo,YinZhou,YuningChai,BenjaminCaine,etal. Scalabilityinperceptionforautonomousdriving: Waymoopendataset. InProceedingsoftheIEEE/CVFConferenceonComputerVisionandPattern Recognition,pages2446–2454,2020. [54] DenisTome,PatrickPeluse,LourdesAgapito,andHernanBadino. xr-egopose:Egocentric3dhumanpose fromanhmdcamera. InProceedingsoftheIEEE/CVFInternationalConferenceonComputerVision, pages7728–7738,2019. [55] DominicZengWangandIngmarPosner. Votingforvotinginonlinepointcloudobjectdetection. RSS, 1317,2015. [56] YueWang,AlirezaFathi,AbhijitKundu,DavidRoss,CarolinePantofaru,ThomasFunkhouser,andJustin Solomon. Pillar-basedobjectdetectionforautonomousdriving. InECCV,2020. [57] DanfeiXu,DragomirAnguelov,andAsheshJain. Pointfusion:Deepsensorfusionfor3dboundingbox estimation. InCVPr,pages244–253,2018. [58] Yan Yan, Yuxing Mao, and Bo Li. Second: Sparsely embedded convolutional detection. Sensors, 18(10):3337,2018. [59] BinYang, MingLiang, andRaquelUrtasun. Hdnet: Exploitinghdmapsfor3dobjectdetection. In ConferenceonRobotLearning,pages146–155.PMLR,2018. 12[60] BinYang,WenjieLuo,andRaquelUrtasun. Pixor:Real-time3dobjectdetectionfrompointclouds. In ProceedingsoftheIEEEConferenceonComputerVisionandPatternRecognition,pages7652–7660, 2018. [61] Z.Yang,Y.Sun,S.Liu,X.Shen,andJ.Jia. Std:Sparse-to-dense3dobjectdetectorforpointcloud. In 2019IEEE/CVFInternationalConferenceonComputerVision(ICCV),pages1951–1960,2019. [62] ZetongYang,YananSun,ShuLiu,andJiayaJia. 3dssd:Point-based3dsinglestageobjectdetector. In ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecognition,pages11040– 11048,2020. [63] ZetongYang,YananSun,ShuLiu,XiaoyongShen,andJiayaJia. Ipod: Intensivepoint-basedobject detectorforpointcloud,2018. [64] M.Ye,S.Xu,andT.Cao. Hvnet: Hybridvoxelnetworkforlidarbased3dobjectdetection. In2020 IEEE/CVFConferenceonComputerVisionandPatternRecognition(CVPR),pages1628–1637,2020. [65] J. Yin, J. Shen, C. Guan, D. Zhou, and R. Yang. Lidar-based online 3d video object detection with graph-basedmessagepassingandspatiotemporaltransformerattention. In2020IEEE/CVFConferenceon ComputerVisionandPatternRecognition(CVPR),pages11492–11501,2020. [66] YeYuanandKrisKitani. Ego-poseestimationandforecastingasreal-timepdcontrol. InProceedingsof theIEEE/CVFInternationalConferenceonComputerVision,pages10082–10092,2019. [67] YinZhou,PeiSun,YuZhang,DragomirAnguelov,JiyangGao,TomOuyang,JamesGuo,JiquanNgiam, andVijayVasudevan. End-to-endmulti-viewfusionfor3dobjectdetectioninlidarpointclouds. In ConferenceonRobotLearning,pages923–932,2020. [68] YinZhouandOncelTuzel. Voxelnet:End-to-endlearningforpointcloudbased3dobjectdetection. In CVPR,2018. 13Revisiting 3D Object Detection From an Egocentric Perspective Supplementary Material Thisdocumentprovidessupplementarycontenttothemainpaper.InSec.A,weexpandthediscussion oncomparingouregocentricmetricwitharecentplanner-based3Dobjectdetectionmetric. InSec.B, weprovidemoredetailsoftheStarPolyarchitectureandtraining. Sec.Cexplainsmoreabouthowwe selecthyperparametersforouregocentricmetrics. Finally,Sec.Dshowsmorevisualizationresults. A Discussion Similartoourmetric,arecentwork(planner-centricmetrics)[34]alsofollowsanegocentricapproach toevaluate3Dobjectdetection. ItmeasurestheKL-divergenceoftheplanner’spredictionbased oneitherthegroundtruthorthedetection. However, wewouldliketohighlighttwodifferences betweenourSDE-basedmetrics(SDE-APandSDE-APD)andtheplanner-centricmetrics: stability andinterpretability. Stability. Intheplanner-centricmetrics[34],apre-trainedplannerisrequiredfortheevaluation. Consequently,themetricishighlydependentonthearchitecturalchoicesoftheplannerandmayvary drasticallywhenswitchedtoadifferentone. Moreover,astheproposedplannerislearnedfromdata, manyfactorsinthetrainingcansignificantlyaffecttheevaluationoutcome: 1)themetricdepends onastochasticgradientdescent(SGD)optimizationtotraintheplanner,whichmayfallintoalocal minimum;2)themetricdependsonatrainingsetoftrajectories,whichwillvarydependingonthe shiftofdatadistribution. Furthermore,iftheplanneristrainedongroundtruthboxes,itmaynot reflectthepreferencesofapracticalplannerwhichisusuallyoptimizedforacertainperceptionstack. Incontrast,SDE-basedmetricsdon’trequireanyparametricmodels. Itsevaluationisconsistentand canbeuniversallyinterpretedacrossdifferentdatasetsordownstreamapplications. Interpetability. Because the KL-divergence employed in [34] only conveys the correlation of twosetsofdistributions,themagnitudeofthemetricisdifficulttointerpret. Tofullyunderstand the detectionerrors, one has toinvestigate the typesof failures madeby theplanner, which vary dependingontypeofplannerused. Ontheotherhand, ourproposedSDEdirectlymeasuresthe physicaldistanceestimationerrorinmeters. ForSDE-basedAPmetrics,anintuitiveinterpretationis thefrequencyofdetection,whosedistanceestimationerroriswithinanempiricallysetthreshold. Therefore,SDE-basedmetricshaveaclearphysicalmeaning,whichtranslatesthecomplexmodel predictionsintosafety-sensitivemeasurements. B StarPolyModelDetails B.1 Architecture OurStarPolymodeltakesasinputthepointcloudscroppedfrom(extended)detectionbounding boxes. Weapplya paddingof 30cm alongthe lengthandwidth dimensionsforfor alldetection boxesbeforethecropping. ThepointcloudisnormalizedbeforebeingfedintoStarPolybasedonthe center,dimensions,andheadingofeachboundingbox. Inaddition,thepointcloudissubsampledto 2048pointsbeforebeingprocessedbyStarPoly. WeuseaPointNet[40]toencodethepointcloud intoalatentfeaturevectorof1024-d. Then,wereducethedimensionsofthelatentfeaturevector from1024-dto512-dwithafully-connectedlayer. Atlast,anotherfully-connectedlayerisemployed topredictthen-dparametersofastar-shapedpolygon,wherenistheresolutionofthestar-shaped polygon(asstatedinSec. 4.3). Weusen = 256foralltheexperimentsinthemainpaper. Asfor selecting(d(cid:126) ,...,d(cid:126) ),weuniformlysampledirectionsontheboundaryofasquare,inspiredbythe 1 n priorthattheobjectsofinterestinthispaper,i.e.vehicles,aresymmetricalandareapproximatelyof roundedsquareorrectangularshapes. 14Figure11:SDE-APDwithvariousdistancethresholds.Notethatatmorestringenterrorthreshold,e.g.,0.1m, SDE-APDclearlydifferentiatesdifferentdetector’sdetectedboxquality,where5F-MVF++keepsoutperforming othersandPointPillarsexcelsaswell. Figure12: SDE-APDwithvariousβ. Wechangetheinversedistanceweightingdegree, β, inSDE-APD computation. Notethatasweincreasethedegree,whichmeansmorefocusisshiftedtocloseobjects,the SDE-APDofPointPillars[20]graduallycatchesupandevensurpassesPV-RCNN[44]atβ = 3. Thisis coherentwithourstudyonSDE-AP’sdistancebreakdowns. Wecanthereforeconcludethatβ isaknobin SDE-APDtocontrolthelevelofegocentricitybasedonobjectdistances. B.2 Optimization WetrainStarPolyonthetrainingsplitofthelarge-scaleWaymoOpenDataset[53]. BecauseStarPoly aimstorefinetheresultsofadetector,wefirstuseapre-traineddetectortocropoutpointcloudsas describedinSec.B.1. ThenweoptimizeStarPolyindependentlyusingthepreparedpointclouds. Foralltheexperimentsinthepaper,weuseStarPolytrainedonMVF++. WefindthatStarPolycan generalizetodifferentdetectorseveniftrainedonlyononedetector.FortheStarPolyoptimization,we usetheAdamoptimizer[17]withβ =0.9,β =0.99andlearningrate=0.001antheparameters. 1 2 Forallexperimentsinthepaper,wetrainStarPolyfor500,000stepswithabatchsizeof64andset γ =0.1. C DetailsaboutSDEandSDE-APD C.1 Selectionofmetrichyperparameters SelectionoftheSDEthreshold AsdefinedinmainpaperSec.5,weclassifytruepositivepredic- tionsbycomparingtheSDEwithathreshold. Weuseathresholdof20cmforallexperimentsin thepaper. UnliketheIoUthreshold,ourthresholdhasadirectphysicalmeaninginsafety-critical scenarios,i.e.theamountofestimationerrorbyperceptionthatanautonomousvehiclecanhandle. Therefore,itcanbeselectedaccordingtotherealworldusecases. Inthispaper,weselect20cmvia analyzingtheSDEofgroundtruthboundingboxes(asshowninmainpaperTable2). Wefindthe 15overallmeanSDEofittobe0.1mandthereforedeterminearelaxedvalueof0.2masthethreshold. OnecanalsousedifferentthresholdsfortheevaluationasonecanusedifferentIoUthresholdsforthe boxclassification. Figure11illustratesthecomparisonamongdetectorswithvaryingthresholds,i.e., 0.1m,0.2m,0.3m. WecanseethatPointPillars[20]demonstratesstrongerperformancecompared toPV-RCNN[44]whenthethresholdissetmorestringent. Inaddition,theeffectivenessofusing multi-frameinformationismorepronouncedwhentheevaluationcriterionbecomesmorerigorous. Selectionofβ intheInverseDistanceWeighting Tobemoreegocentricinourevaluation,we proposetoextendtheAveragePrecision(AP)computationbyintroducinginversedistanceweighting. Thisstrategyaimstoautomaticallyemphasizetheobjectsclosetotheego-agent’strajectorythan thosefaraway. Asthenumberofobjectsgrowsroughlyquadraticallywithregardtothedistance, settingβ =2(squareinverse)wouldputequalweightforalldistances. Sincewewanttohighlight theimportanceofclose-byobjects,wegoafurtherstepandsetβ =3. Fig.12showstheSDE-APD(evaluatedattimestep0)withdifferentchoicesofβ. Settingβ = 0 meansallobjectscontributeequallytotheAPmetric,whereseethegreatestgapfromthebestandthe worstdetectors(5F-MVF++[41]v.s. PointPillars[20]). Asweincreasetheβ,i.e. makingtheoverall APmetricmoreegocentric,weightingmoreheavilyontheclose-byobjects,weseethePointPillars (withgreatclose-byaccuracy)catchesupwithPV-RCNN[44]andMVF++. Thegeneraldifferences ofdifferentdetectorsalsobecomesmallerastheyperformsimilarlywellforobjectsclosetothe ego-agent’strajectory(thedifferenceintheoriginalIoU-APismorerelatedtotheirperformance differenceonfar-awayobjects). C.2 ImportanceofinversedistanceweightingandSDEinSDE-APD InSDE-APD,weintroduceinversedistanceweight- ing as a simple proxy of distance breakdowns. To Method SDE-APD IoU-APD IoU-AP investigatetheimpactofsuchweightings,weextend IoU-APtoIoU-APDwiththesamedistanceweight- 5F-MVF++ 0.874 0.989 0.863 MVF++ 0.834 0.981 0.814 ing as SDE-APD. The results are shown in Tab. 4. PV-RCNN 0.808 0.972 0.797 NotethatwhileIoU-APDandIoU-APhavethesame PointPillars 0.817 0.966 0.720 ordering,SDE-APDisabletorevealadifferentrank- ingbetweenPointPillarsandPV-RCNN,wherewe Table 4: SDE-APD, IoU-APD, and IoU-AP of claimthatSDEplaysamoreimportantrole. differentdetectors. C.3 Compositionoflateralandlongitudinal distanceerrorsinSDE In our default definition, SDE is the maximum value of the lateraldistanceerrorandthelongitudinalerror. InTab.5we Statistics SDE SDE investigatethecompositionofthetwosub-distance-errorsof lat lon theSDE.Specifically,weemploythedetectionboxespredicted Mean(m) 0.17 0.17 byPV-RCNNasthedetectionoutputandcalculatethemean Median(m) 0.12 0.11 andaverageofallvalidSDE andSDE . Notethat“valid” lat lon Contribution 52% 48% meansthattheobjectdoesn’tintersectwiththelateralline(for SDE )orthelongitudinalline(forSDE )andthatthebox lat lon Table 5: Composition of SDE from ismatchedwithagroundtruthobject. Wealsocomputethe PV-RCNN’sDetectionBoxes. portion of SDEs that are equal to its lateral component, i.e. SDE > SDE ,andtheportionofSDEsthatareequalto lat lon thelongitudinalcomponent. Fromthestatistics,wefindthatthelateralandlongitudinalcomponents contributealmostequallytothefinalSDE. C.4 DistributionofsignedSDE In this work, we intend to bring attention to the idea of egocentric evaluation. We propose SDE withoutsignasasimpleimplementationofthisideawithminimalhyperparametersrequired. Itis straightforwardtoextendittomorecomplicatedversionswiththesignincluded. InFig.13, we provide a plotting of the distribution of signed SDE of detector boxes. It demonstrates that box predictionsaregenerallyoversized,i.e.withpositiveSDEs. Basedonspecificrequirementsofan 16TPCollision FP/FNCollision Measure Mean Median Mean Median IoU↑ 0.902 0.911 0.904 0.903 SDE↓ 0.114 0.095 0.161 0.153 Table6:Distributionsoferrormeasuresintwotypes ofcollisiondetectioncases.In“TPCollision”,boththe Figure13:DistributionofSignedSDE.Weshow groundtruthpointsandthepredictionreportacollision. thedistributionofmax(SDE ,SDE )ofPV- In“FP/FNCollision”,eitherthegroundtruth(FN)or lat lon RCNN’s box detections, where positive means theprediction(FP)reportsacollision.Hereweusethe over-sizedpredictionswhilenegativemeansunder- aggregated point clouds to test collision. The results sized.Boxpredictionshaveanoversizingbias. alignwithTab.1. Figure14:QualitativeResultsforevaluatingpredictionsatafuturetimestep. Top:predictionsattimeT=0. Bottom: evaluationsatT=8s. Ontherightofeachrowisthezoom-inviewwherethepredictionandpoint cloudcroppedbythegroundtruthboundingboxareshown. SDE sarereportedunderzoom-insforeach lon representation.AfarawayobjectatT=0canbecomeveryclosetotheagentinafuturetimestep(asshownfor T=8s).Whileconvexvisiblecontour(CVC)mayachievecomparableresultstoStarPolyatT=0,itsperformance considerablydropswhenevaluatedatT=8s.ThisiswhyStarPolyachievesbetterresultsthantheboxandCVC representationsacrossdifferenttimesteps. application,onecanalsohavemorefine-grainedthresholds,e.g. differentthresholdsforpositiveand negative,andselectthemostsuitablesetupbasedontheirpriorities. C.5 CollisioncorrelationofSDEandIoUbasedoncontours InTab.1,weusegroundtruthboxtotestcollisions,toalignwiththeevaluationofIoU.InTab.6, were-computedthetableusingthecontoursdrawnfromouraggregatedgroundtruthpoints,which shouldbethemoreaccurateshapeaccessible. ThegapbetweenIoUandSDEisalmostthesameas theoriginalTab.1usingboxesforcollisiontests. D QualitativeResults Inthissectionweprovideadditionalqualitativeanalysis. Fig.14showshowourmetricsevaluate predictionsatafuturetimestep. Wecomparedifferentrepresentationsbothatthecurrenttimeframe andatafuturetimeframe. Ourmetricsareegocentricinthesensethattheytakeintoaccountthe relativepositionsoftheobjectstotheagent’strajectoryinboththecurrentandfuturetimesteps. Clearly, our proposed representation, StarPoly outperforms both box and convex visible contour (CVC)representationsatthefuturetimestep. Fig.15showsacasewhentheCVCfailstocapturethe fullshapeoftheobjectduetoitsvulnerabilityagainstocclusions. 17Figure15:QualitativeResultsshowingthelimitationoftheconvexvisiblecontour(CVC).Asdepicted,dueto theocclusion,CVCfailstocoverthewholeextentoftheobject.NotethatwehavevisualizedboththeLidar pointsfromthecurrentframe(ingray)aswellastheaggregatedpoints(ingreen)whichareusedtorepresent thetrueobjectshape. 18