LET-3D-AP: Longitudinal Error Tolerant 3D Average Precision for Camera-Only 3D Detection Wei-Chih Hung Vincent Casser Henrik Kretzschmar∗ Jyh-Jing Hwang Dragomir Anguelov Waymo LLC Fig.1.Evaluatingcamera-only3Ddetectionswhenusingthe3DAveragePrecision(3DAP)metric(left)andwhenusingtheproposedlongitudinal errortolerant,LET-3D-AP(L)metric(right).Thefiguresdepictthebipartitematchingbetweenthedetections(green)andthegroundtruthobjects(black). Regular3DAPmatching(left)isbasedontheintersectionoverunion(IoU)valuesandcannotmatchthedetectionsthatsufferfromlongitudinallocalization errors,thoughthedetectionisreasonableandcanprovideusefulsignalstodownstreammodules.Incontrasttothis,theproposedLET-3D-AP(L),shownon theright,ismorepermissivebyshiftingthepredictionstomitigatethelongitudinallocalizationerrors.Weshowtheshiftedpredictionsinblue,whichare usedforcomputingthelongitudinalerrortolerantintersectionoverunion(LET-IoU).Toaccountfortheusedlongitudinaltolerance,weproposelongitudinal affinity(LA)asameasureofhowclosetheoriginalpredictionistothegroundtruthinthelongitudinaldirection. Abstract—The 3D Average Precision (3DAP) relies on the well. The issue is rooted in these detectors often poorly intersection over union between predictions and ground truth estimating depth. Common metrics, such as the 3D Average objects. However, camera-only detectors have limited depth Precision (3DAP), rely on the intersection over union (IoU) accuracy,whichmaycauseotherwisereasonablepredictionsthat to associate the prediction boxes with the ground truth boxes. suffer from such longitudinal localization errors to be treated as false positives. We therefore propose variants of the 3DAP As a result, they may treat reasonable predictions that suffer metric to be more permissive with respect to depth estimation from these longitudinal localization errors as false positives, errors.Specifically,ournovellongitudinalerrortolerantmetrics, leading to lower and less informative scores. LET-3D-AP and LET-3D-APL, allow longitudinal localization We therefore propose a variant of the 3DAP metric that is errorsofthepredictionboxesuptoagiventolerance.Toevaluate the proposed metrics, we also construct a new test set for the more permissive to depth errors. The proposed longitudinal Waymo Open Dataset, tailored to camera-only 3D detection error tolerant metrics, LET-3D-AP and LET-3D-APL, allow methods.Surprisingly,wefindthatstate-of-the-artcamera-based longitudinal localization errors up to a given tolerance. detectors can outperform popular LiDAR-based detectors with Specifically, we define the longitudinal error to be the object our new metrics past at 10% depth error tolerance, suggesting localization error along the line of sight between the camera that existing camera-based detectors already have the potential to surpass LiDAR-based detectors in downstream applications. and the ground truth object. The maximum longitudinal error Webelievetheproposedmetricsandthenewbenchmarkdataset that our metric tolerates is an adjustable percentage of the will facilitate advances in the field of camera-only 3D detection distance between the camera and the ground truth object. by providing more informative signals that can better indicate the system-level performance. * For each prediction and ground truth pair with tolerable longitudinalerror,wecorrectthelongitudinalerrorbyshifting I. INTRODUCTION the prediction box along the line of sight between the camera andthecenterofthepredictionbox.Wethenusetheresulting Detecting objects in 3D space is a fundamental task in corrected box to compute the IoU, which we refer to as the many robotics applications, including autonomous driving, longitudinal error tolerant IoU (LET-IoU). The precision and unmanned aerial vehicles, robot navigation, and augmented recall values are then computed by performing a bipartite reality. While LiDAR-based object detection [1]–[7] has matching with weights based on the longitudinal affinity and been studied extensively in recent years with impressive the LET-IoU values. results, reliable camera-based 3D object detection remains a challenging and active area of research [8]–[20]. Finally, we define two longitudinal error tolerant metrics, Accordingtocommonmetrics,LiDAR-baseddetectorsout- LET-3D-AP and LET-3D-APL. First, the metric LET-3D-AP performcamera-baseddetectorsbyalargemargin,suggesting is the AP on matching results using proposed method. Note thatcamera-basedobjectdetectorsfailtodetectmanyobjects. that this metric does not penalize any corrected errors and However, a careful evaluation of the failure cases reveals is therefore comparable to the original 3DAP metric, but that many camera-based detectors identify objects reasonably with more tolerant matching. In contrast to this, the metric LET-3D-APL penalizes longitudinal localization errors by *WorkdonewhileatWaymo. scaling the precision. 4202 yaM 3 ]VC.sc[ 2v50770.6022:viXraTo evaluate the proposed metrics, we extended the existing WaymoOpenDataset[21]by(1)creatinganewdedicatedtest setof80segments(80,000images)withLiDARdataredacted, (2) providing camera-synchronized 3D box labels for the entire dataset, eliminating the camera/label synchronization gap. Waymo included these extensions as part of the Waymo Open Dataset 3D Camera-Only Detection Challenge. The challenge remains open for submissions, and can be used to benchmark camera-only detection methods. Fig.2.Breakdownofthelocalizationerror.Wedecomposethe3Ddetection II. RELATEDWORK localizationerrorintoalateralerrorandalongitudinalerror.Wefindthat thelongitudinalerrorismoreprominentincamera-only3Ddetection.We Following popular 2D detection benchmarks, such as thereforeproposelongitudinalerrortolerant(LET)metricsthataremore PASCALVOC[22]andCOCO[23],most3Dobjectdetection permissivewithrespecttothelongitudinallocalizationerror. benchmarks rely on the metric 3DAP to evaluate detections based on the intersection over union (IoU), either in 3D or in a bird’s eye view (BEV), with predefined IoU thresholds. III. LONGITUDINALERRORTOLERANT3DAP For example, the KITTI dataset [24] and the Waymo Open The 3DAP metric relies on the IoU to match prediction Dataset [21] adopt 3D IoU as the main matching function. boxes with ground truth boxes. Therefore, prediction boxes TheWaymoOpenDatasetproposestouseaheadingaccuracy that have little or no overlap with the corresponding ground weighted AP, referred to as APH, as the primary metric to truth boxes will be treated as false positives. However, these penalizeincorrectheadingprediction.NuScenes[25]usesthe predictions may still contribute valuable information to the center distance between predicted objects and ground truth decision making of an autonomous driving system. We objects as the true positive matching criterion and proposes a therefore propose a metric that rewards such detections. set of true positive metrics to quantify other errors, including Our proposed metrics are inspired by the following ob- localization, scale, and orientation. jectives: First, given a model of an assumed localization While LiDAR-based detectors [1]–[7] remain the most error distribution, design a new matching criterion so that popular methods for 3D object detection for autonomous a predicted box may still match with a target ground truth driving, monocular camera-based 3D object detection has box even when they do not match in terms of IoU. Second, beengainingtractioninrecentyears.Methodscanberoughly design a new bipartite matching cost function that takes categorized into two groups: 1) The first camp leverages the localization, shape, and heading errors into account so mature perspective 2D detectors and equips the networks that frame-level matching can be properly calculated. Last, with 3D box attributes for 3D detection [8]–[14], [26]–[30]. design a penalty term to penalize detections that can only be An emerging direction is to leverage Transformer models matched with the ground truth when using the longitudinal for implicit depth encoding [17], [31]–[34]. 2) The second error tolerant matching. camp leverages mature LiDAR detectors and projects 2D A. Localization Errors in Camera-Based 3D Detection perspective features into 3D in the form of point clouds [16], [20], [35]–[40], feature maps in bird’s eye view (BEV) [15]– We assume a ground truth bounding box G with center [17],[41]–[52],orinvoxelspace[18],[19],[53],[54].These ⇀g ∈R3 and a predicted bounding box P with center ⇀p∈R3 methods then predict the detections by applying 3D detection such that the origin [0,0,0] is the camera location, or a headstothe3D/BEVfeaturemaps.Accuratemonoculardepth mean position in the case of multi-camera setup. That is, all estimationisessentialforallmethods.Sincemonoculardepth 3D location vectors are in the camera frame. We define the estimation is intrinsically an ill-posed problem, however, it is localization error as the vector difficult for camera-based methods to accurately estimate the ⇀e =⇀p−⇀g. (1) loc depththatcorrespondstotheobjects,resultinginlongitudinal localization errors that lead to reduced 3DAP scores when We decompose the localization error into two components: compared to LiDAR-based methods. noitemsep As pointed out by Ma et al. [55], the performance of • The longitudinal error⇀e lon is the error along the line of monocular camera 3D detection methods can be greatly sight from the center of the prediction box to the center improved when their longitudinal localization errors are of the ground truth box, giving ⇀e lon = (⇀e loc·⇀u G)·⇀u G, mitigated by using ground truth depth or localization. This where ⇀u G =⇀g/|⇀g| is the unit vector of⇀g. is why we are proposing longitudinal error tolerant (LET) • Thelateralerror⇀e lat isthedistancebetweenthepredicted metrics for evaluating camera-only 3D detection methods. box P and the line of sight to the ground truth box G. Whilegeneralizedintersectionoverunion(GIoU)[56]andits It is defined as the shortest distance from the predicted 3D variant [57] can also be used to match non-overlapping box to any point on the line of sight, leading to⇀e lat = bounding boxes, they do not target errors in a specific ⇀e loc−⇀e lon. direction,andtheshapeandheadingmismatchisnotfactored Figure 2 illustrates the error terms. We observe that into the non-overlapping cases. localization errors tend to have the following attributes forFig.3.ComputingLET-IoU.Givenapredictedobjectandagroundtruth objecttobematchedwith,wemovethepredictedobjectalongthelineof Fig.4.AnexampleofamatcheddetectionusingLET-IoU.Thegreen sighttoobtainminimaldistancetothegroundtruthcenter.Wethencompute boxdenotesthedetection,theredboxdenotesthegroundtruth,andtheblue theLET-IoUasthe3D-IoUbetweenthealignedpredictedobjectandthe boxdenotesthelongitudinalaligneddetectionasstatedinSectionIII-C.We groundtruthobject. alsoshowtheconnectionsbetweenmatchedpredictionboxesandaligned boxesusingapurpleconnector. camera-based 3D detectors. Localization errors tend to be the center⇀g as: most pronounced along the line of sight because of imperfect depthestimation.Weassumethatthestandarddeviationofthe (cid:18) |⇀e (⇀p,⇀g)| (cid:19) longitudinalerror,⇀e lon,isproportionaltothedistancebetween a l(⇀p,⇀g)=1−min lon T ,1.0 , (2) l thesensorandthecenterofthegroundtruthboundingbox,⇀g. The lateral error,⇀e lat, is the result of an imperfect estimation where T l =max(T lp·|⇀g|,T lm). of the object center on the camera image plane. Since the C. LET-IoU: Logitudinal Error Tolerant IoU size of the bounding box projected onto the camera plane is The longitudinal affinity captures the longitudinal error of inversely proportional to the range to a given ground truth, a prediction based on the center of the predicted bounding G, the standard deviation of the center estimation error in pixels,σ(|⇀e |),isalsoinverselyproportionaltotherangeof box and the center of the ground truth bounding box. To cam a given ground truth⇀g, that is, σ(|⇀e |)∝1/|⇀g|. The lateral determine whether a prediction can be associated with a cam ground truth, we also want to take into account the shape, error in 3D space, however, is scaled by the range projection, that is,⇀e =|⇀g|·σ(|⇀e |). Therefore, the standard deviation size, and heading error. When relying on the regular 3DAP, lat cam thisisonlycapturedby3DIoU.WeproposeLET-IoU,where of the lateral error is independent of the range of the ground IoU is calculated between the ground truth bounding box truth object and can be set as a constant. We therefore only and the prediction bounding box after compensating for the introduce a tolerance for the longitudinal localization errors. longitudinal error. Specifically, we mitigate the longitudinal error of the prediction by aligning its center along the line B. Longitudinal Affinity of sight with the ground truth. We propose a scalar value, longitudinal affinity a (⇀p,⇀g), Given a ground truth with center⇀g and a prediction with l center ⇀p, the objective is to move the center location of to determine the scores for matching prediction boxes with the prediction box along the line of sight so that the IoU ground truth boxes given a tolerance for the longitudinal between the moved prediction box and the ground truth box error. Specifically, the longitudinal affinity, whose value is in [0.0,1.0], estimates how well the centers of a prediction is maximized: box and the ground truth box align. Given the longitudinal error, ⇀e lon, between a pair of prediction and ground truth, ⇀p aligned =argmax 3D-IoU(P aligned,G), (3) we define the longitudinal affinity based on the following ⇀paligned hyperparameters: noitemsep where P is the prediction box P with updated cen- aligned • longitudinal tolerance percentage Tp: The maxi- ter ⇀p aligned. However, there is no closed-form solution to (3), l mum longitudinal error⇀e is expressed as a percentage and exhaustive search along the line of sight is computation- lon Tp of the range to the ground truth G. For example, ally expensive. We therefore approximate the objective by l Tp =0.1 provides a 10% tolerance, and for a ground minimizing the distance between the moved prediction box l truth object that is 50 meters away (|⇀g| = 50), the center and the ground truth center, where the longitudinal erroriscompensated.Inotherwords,thecenterofthealigned longitudinal tolerance is 5 meters. • min longitudinal tolerance meter T lm: When a prediction is the projection of the ground truth center⇀g onto the line of sight from the sensor to the prediction object, ground truth object is close to the sensor origin, leading to the percentage-based tolerance can result in an small matching region. This parameter controls the minimum absolute tolerance, thus mainly affecting near range ⇀p aligned =(⇀g·⇀u P)·⇀u P, (4) objects. where ⇀u =⇀p/|⇀p| is the unit vector along the line of sight P Finally,wedefinethelongitudinalaffinitya (⇀p,⇀g)between to prediction center ⇀p. A detailed analysis on the quality of l a prediction box with center ⇀p and a ground truth box with approximation(4)canbefoundinthesupplementarymaterial.Finally, we obtain the LET-IoU by calculating the typical (FN). Then the precision p=|TP|/(|TP|+|FP|) and recall 3D IoU between the aligned prediction box P and the r =|TP|/(|TP|+|FN|) can be calculated. After computing aligned target ground truth box G, giving the precision and recall values for detection subsets with different score cutoffs, a PR curve can be obtained. Finally, LET-IoU(P,G)=3D-IoU(P ,G). (5) aligned the LET-3D-AP is calculated from the average precision of (cid:82)1 See Figure 3 for an illustration. Note that for simplicity, this the PR curve: LET-3D-AP = p(r)dr, where p(r) is the 0 function does not assume that the longitudinal error is within precision value at recall r. a given tolerance. In the final metrics computation, LET-IoU Here, we do not penalize depth errors when computing willonlybecalculatedonpairswithnon-negativelongitudinal the PR curve, even though we may have had to adjust some affinity, where a >0. predictionstocompensatefortheirdeptherrors.Thisprovides l Figure 4 shows that the prediction (green) is aligned to a number comparable to 3DAP but with a more tolerant the target ground truth (red), where the 3D IoU is computed matching criterion. between the aligned prediction (yellow) and the ground truth, b) LET-3D-APL:LongitudinalAffinityWeightedLET-3D- resulting in 0.7 LET-IoU. AP.: Here, we penalize those predictions that do not overlap with any ground truth, that is, those which only match a D. Bipartite Matching with Longitudinal Error Tolerance ground truth bounding box owing to the above-mentioned To calculate the precision and recall values, detection center alignment. We penalize these predictions by using the metrics need to perform a bipartite matching between the longitudinal affinity a l(⇀p,⇀g) proposed above. To this end, we prediction set and the ground truth set. Most bipartite propose a weighted variant of precision values. matchingalgorithmsinvolvecomputinganassociationweight Though the number of true positives |TP| is used both in matrix W ∈RNP×NG, where N P and N G are the numbers the precision and recall computation, they represent different of detections and ground truth objects. The objective is to ideas. In precision calculation, |TP P| means the number of maximize the summed weights among all the matched pairs. matched predictions, while in the recall calculation, |TP G| We may take into consideration the longitudinal error, the means the number of matched ground truths. As a result, we shape error, and potentially the heading error. rewrite the precision and recall as The typical matching weight function in 3DAP computes |TP | the IoU between a prediction box and a ground truth box Precision= P (8) with an IoU threshold T iou: |TP P|+|FP| |TP | (cid:40) IoU(P(i),G(j)), if IoU(P(i),G(j))>T Recall= G (9) W(i,j)= iou , (6) |TP G|+|FN| 0, otherwise to emphasize the difference between the prediction accumu- whereP andGarethepredictionboxsetandgroundtruthbox lator and the ground truth accumulator. set for a single frame, respectively. However, in our setting, In the precision computation, we traverse through the pre- there are two affinity terms to be considered: longitudinal dictionstoaccumulatetheFP’sandTP’s.Foranyunmatched affinity a and LET-IoU. As a result, we propose to calculate l the bipartite matching weight where both terms are taken prediction,itonlycontributestotheFP-accumulatorswith1.0. into account: However,foramatchedpredictionPwithrespecttoaground  truth ⇀g, it contributes to TP-accumulator with the quantity a l·LET-IoU, if a l >0 of a (⇀p,⇀g) and also contributes to the FP-accumulator with W(i,j)= and LET-IoU>T , (7) l 0, otherwise iou the quantity of 1−a l(⇀p,⇀g). Essentially, we distribute part of the TP-accumulator to the FP-accumulator based on the whereweomitthefunctioninputterms(P(i),G(j))ofa and longitudinal affinity. Then, the soft TP- and FP-accumulators l LET-IoU for simplicity. This allows us to take the shape and are: heading error into account while prioritizing the detections (cid:88) |TP |= a (⇀p,⇀g) (10) that have higher longitudinal affinity. P l matchedP After computing the matching weight matrix, we run a (cid:88) (cid:88) |FP|= (1−a (⇀p,⇀g))+ 1.0 (11) bipartitematchingmethod(GreedyorHungarian)tocompute l the matching results, including matched pairs (true positives matchedP unmatchedP / TP), non-matched predictions (false positives / FP), and BasedonthedefinitionofsoftTPandFP,thesoftprecision non-matched ground truths (false negatives / FN). can then be computed as: E. LET-3D-AP and LET-3D-APL Prec = |TP P| =a ·Precision (12) L |TP |+|FP| l a) LET-3D-AP: Average Precision with Longitudinal P Error Tolerance.: Once the matching results are finalized, which results in a mean longitudinal affinity weighted each matched prediction will be counted as a true positive precision point. (TP). A prediction without matching ground truth is counted For the recall computation, we propose to not weight the as a false positive (FP). If a ground truth is not matched with TP since the ground truth does get matched, and there G any prediction, the ground truth is counted as a false negative is no reason to penalize the metric twice. As a result, theprecision values are discounted by the multiplier a , which is l theaveragelongitudinalaffinityofallthematchedpredictions that was treated as TP. Finally, the LET-3D-APL can be calculated with the longitudinal affinity weighted PR curve: (cid:90) 1 (cid:90) 1 LET-3D-APL= Prec (r)dr = a ·Prec(r)dr, (13) L l 0 0 wherePrec (r)isthelongitudinalaffinityweightedprecision, L and Prec(r) is the precision at recall r. Note that the weight a (⇀p,⇀g) depends on the specific score cutoff at the PR point l and therefore cannot be taken out of the integral. Fig. 5. CDFs of Box Shifts. We visualize the shift between BOX and IV. CAMERA-PRIMARY3DDETECTIONDATASET CAMERA SYNCED BOXbasedoncameratypes(left)andselectobjecttypes (right)onthevalidationset. To evaluate the proposed LET metrics, we extend the existing Waymo Open Dataset [21]. Our contributions are two-fold: (1) We provide a dedicated test set of 80 segments A. Comparison to Other Datasets with LiDAR sensor data redacted, (2) We significantly Table I shows a comparison of popular real-world AV improve camera/label synchronization throughout the dataset. datasetswithanemphasisoncameradataandsynchronization Waymo conducted the Waymo Open Dataset 3D Camera- properties. Two traditional datasets in particular, KITTI [24] Only Detection Challenge† between April and May 2022, and Cityscapes 3D [58], are restricted by the dataset size, with LET-3D-APL being the primary metric for determining as well as a limited number of cameras, only spanning a the winning submissions. comparatively small field of view. Datasets such as Lyft [60] Our dedicated test set contains 80 segments in the same and nuScenes [25] offer multiple camera views but can format as the original dataset release (20s, 10Hz, 5 cameras), experience a synchronization gap of roughly 5 − 10ms, for a total of 1,000 camera images per segment and 80,000 depending on the object location on the image plane. For images overall. LiDAR sensor data was used during labeling nuScenes, it is also worth pointing out that the displacement to ensure high-quality 3D boxes, but redacted before the betweenthelaserandcamerasensorsismoresignificantthan release to ensure compliance with the dataset challenge rule in other datasets, ranging from 0.56−0.93m. This viewpoint with regards to basing predictions solely on cameras. discrepancy makes accurate camera visibility filtering more The original Waymo Open Dataset has a synchronization difficult. gapof[−6,7]msbetweencameraandLiDARsensordata[21]. Both Argoverse 2 Sensor [59] and the newly extended Since the 3D object labels are based on LiDAR sensor data, Waymo Open Dataset are large-scale datasets that offer good this gap translates into a corresponding synchronization gap synchronization.AstrengthoftheArgoverse2Sensordataset between camera and LiDAR-based labels. Motion of the ego- is the availability of even more surround view cameras, vehicleandindependentobjectmotioncanleadtomisaligned capturing a whole 360 degrees. A strength of the Waymo labels with respect to the camera point of view. Open Dataset lies in the synergies found in many additional To address this issue, we add a new field CAM- label modalities, including camera-based bounding boxes, ERA SYNCED BOX, a variant of BOX, which is adjusted to human keypoints and video panoptic segmentation labels, as eliminatethesynchronizationgaptothecamerathatperceives well as additional 3D label modalities. it. The adjustment is based on solving an optimization problem that takes into account the camera rolling shutter, V. EXPERIMENTALRESULTS the motion of the autonomous vehicle and the motion of the In this section, we analyze the proposed LET metrics object of interest. against both LiDAR-based and camera-only 3D detectors. Figure 5 visualizes statistics of the box shift. The analysis A. Evaluation Dataset and Baseline Methods suggests that, in practice, the shift highly depends on the camera type, with the side-facing cameras benefiting the To verify the proposed LET metrics, we evaluate two most from improved synchronization as they are oriented LiDAR-based detectors and two camera-only 3D detectors orthogonally to the direction of motion and as they tend to trained on the Waymo Open Dataset [21] (WOD) and report observe close-by objects. In terms of object categories, the metrics on the newly proposed test set. shift is the least pronounced for pedestrians, and the most SWFormer [61]: A state-of-the-art LiDAR-based 3D pronounced for vehicles. This is due to pedestrians moving detector that builds upon the idea of Swin Transformer [62] more slowly, reducing displacement through independent and operates on sparse voxels. It achieves 73.36 L2 vehicle object motion, and being predominantly present in low-speed mAPH on the WOD original test set. areas, reducing displacement through ego-motion. PointPillars [3]: A well-established LiDAR-based 3D detector baseline that achieves 60.05 L2 vehicle mAPH on †https://waymo.com/open/challenges/2022/3d-camera-only-detection the WOD original test set.Dataset KITTI[24] Cityscapes3D[58] Argoverse2[59] Lyft[60] nuScenes[25] ExtendedWOD[21] #Frames 15K 5K 150K 55K 40K 216K #Cameras 2 2 9 7 6 5 #SurroundView 1 1 7 6 6 5 CameraShutterType Global Rolling Global Global Global Rolling Resolution[MP] 0.7 2.1 3.1 1.3/2.1 1.4 2.5 SyncGap[ms] [−12.5,12.5] 0* [−1.39,1.39] [−9.7,9.7] [−4.9,4.9] 0* Cam↔LabelCenter LabelTypes** 3DBoundingBox 3DBoundingBox 3DBoundingBox 3DBoundingBox 3DBoundingBox 3DBoundingBox 3DOpticalFlow 2DPanopticSeg 3DSemanticSeg 3DSemanticSeg 2DCameraBox 3DHumanKeypoints 2DPanopticSeg 2DCameraBox 2DHumanKeypoints 2DVideoPanopticSeg TABLEI.Comparison ofPopular Real-World AV Datasets.Weonlycountfully3Dboxlabeledframesanddefineaframeasrepresentingacollection ofallrespectivecameraimagesinoneinstance.TheoriginalWaymoOpenDatasetsyncgapwas[−6,7]msandisnoweffectivelyreducedto0ms.The syncgapsofotherdatasetsdependonthehorizontalobjectpositionintheimageplaneandfallswithinthegivenrange.*Cityscapes3Dislabeledonthe stereocameraimagerydirectly,whileourCAMERA SYNCED BOXfieldistheresultofouroptimizationmethod,whichundoesthelabelcentersyncgap consideringbothegomotionandindependentobjectmotion.**Somelabeltypesareonlyavailableforasubsetofthedata. B. Comparisons of Longitudinal Tolerance Values The main hyperparameter of the proposed metrics is longitudinal tolerance percentage Tp. We show the l performance of the LiDAR-based detectors (SWFormer, PointPillars) and camera-only detectors (BEVFormer, MV- FCOS3D++) with different tolerance values in Figure 6 for the vehicle class. As shown in Figure 6, higher longitudinal error tolerance leads to higher LET-3D-AP and LET-3D-APL because more ground truth objects are matched with detections. Note that for the LiDAR-based detectors, LET-AP remains similar to the AP values for all longitudinal tolerance values since the longitudinal errors of LiDAR-based detections are already small.TheresultsalsosuggestthattheproposedLETmetrics do not introduce additional matching errors and can reliably Fig.6.LETmetricswithdifferentlongitudinalerrortolerancevalues. WeshowtheLET-3D-APandLET-3D-APL,aswellastheoriginal3DAP evaluate LiDAR-based detectors as well. oftheselectedLiDAR-baseddetector(SWFormer,PointPillars)andcamera- To compute LET metrics, a longitudinal tolerance value baseddetectors(BEVFormer,MV-FCOS3D++)ontheproposednewtest must be chosen, and the choice of the longitudinal tolerance set.Highertolerancesleadtohigherscoressincemoredetectionsarebeing matchedwithgroundtruthobjects.Surprisingly,despiteBEVFormerhaving depends on the requirements of the downstream modules, e.g. absolute30%lower3DAPthanPointPillars,itoutperformsPointPillarswith tracking or behavior prediction, especially for an autonomous LET-3D-AP(L) metrics past a 10% tolerance. This suggests that existing driving system. Users can also sweep the tolerance values to camera-baseddetectorshavethepotentialtosurpassLiDAR-baseddetectors, especially in downstream applications, provided that (minor) localization gain more understanding of the error patterns. errorsaretolerated.Also,LET-3D-APLprovidesasmoothertransitionfor differenttolerancevaluesandisthusmoresuitableforcomparingcamera- C. LET-AP v.s. AP based3Ddetectors. We compare the proposed LET-AP and LET-APL metrics withAPontheselectedbaselinesandshowresultsinFigure6. For the LiDAR-based detectors (SWFormer, PointPillars), we show that LET-AP is mostly consistent with AP, which BEVFormer [41]: A camera-only 3D detector that learns suggests that the matching results are similar in the presence unified BEV representations with spatio-temporal trans- of longitudinal tolerance. For higher tolerance values, LET- formers to tackle both 3D detection and segmentation 3D-APisslightlyhigherthan3DAPduetothesmallamount tasks. The method achieves state-of-the-art performance on of additional detections being matched with ground truth nuScenes [25] and WOD. We evaluate a specific variant of objects with more error tolerance. Also, due to the accuracy BEVFormer that won the 2022 Waymo Open Dataset 3D of LiDAR range estimates, the boxes are fairly accurate, Camera-Only Detection Challenge. resulting in minimal difference between LET-AP and LET- APL since the allowed longitudinal tolerance is not used by MV-FCOS3D++ [30]: A camera-only 3D detector that most matching pairs. adds 2D supervision and temporal information onto the BEV The LET-AP metrics results of the camera-only detectors style detector and achieves 2nd place in the 2022 Waymo inFigure6exhibitamuchlargergaptotheAPbaseline.This Open Dataset 3D Camera-Only Detection Challenge. showsthatmanydetectionscanbeproperlymatchedtogroundClass Range(Vehicle) Method Metrics All Vehicle Pedestrian Cyclist [0,30) [30,50) [50,∞) 3DAP(%) N/A 89.0 86.7 N/A 96.0 89.5 78.7 LET-3D-AP(%) N/A 89.7 87.3 N/A 96.2 90.2 79.7 SWFormer LET-3D-APL(%) N/A 86.3 85.4 N/A 91.6 87.9 78.1 mLA N/A 0.962 0.978 N/A 0.952 0.975 0.980 3DAP(%) N/A 78.6 58.3 N/A 88.2 76.4 58.4 LET-3D-AP(%) N/A 80.7 62.7 N/A 88.5 78.5 66.2 PointPillars LET-3D-APL(%) N/A 76.7 61.0 N/A 83.0 75.9 54.1 mLA N/A 0.950 0.973 N/A 0.938 0.967 0.817 3DAP(%) 35.3 50.7 22.3 33.0 79.7 47.2 20.5 LET-3D-AP(%) 70.7 82.9 71.1 58.1 89.7 80.6 50.1 BEVFormer LET-3D-APL(%) 56.2 68.8 53.2 46.5 75.3 68.5 38.9 mLA 0.795 0.830 0.748 0.800 0.839 0.850 0.776 3DAP(%) 30.3 45.6 16.7 28.5 79.2 37.5 13.0 LET-3D-AP(%) 66.0 82.1 63.0 53.0 89.3 78.0 66.1 MV-FCOS3D++ LET-3D-APL(%) 51.1 66.9 45.0 41.3 74.5 63.6 52.0 mLA 0.774 0.815 0.714 0.779 0.834 0.815 0.787 TABLEII.LET metrics of different breakdowns.Themethodsareevaluatedonthenewtestsetwith80multi-camerasequences.Wereport3DAP, LET-3D-AP,LET-3D-APL,andmeanlongitudinalaffinity(mLA)foralltheclassandrangebreakdowns,with10%oflongitudinaltolerance.Forthe LiDAR-baseddetector,theproposedmetricsshowsimilarperformancesincemostlongitudinaltoleranceisnotusedformatchingpredictionsandground truths.Forcamera-baseddetectors,theresultsshowthatproposedLETmetricsaremoresuitableforevaluationsincepredictionscanbebettermatchedwith groundtruths,especiallyforsmallobjectslikepedestrianandlong-rangedetectionsasindicatedbythedifferencebetween3DAPandLET-3D-AP(L). truth objects with a certain amount of longitudinal tolerance, allows detections to be associated with the ground truth making the resulting LET-AP(L) metrics better suited for objects despite depth estimation errors. The results show evaluating these methods. Under 10% tolerance, BEVFormer that state-of-the-art camera-based detectors can outperform achieves higher metrics than PointPillars, indicating BEV- popular LiDAR-based detectors with our metrics, suggesting Former may actually have better downstream performance that existing camera-based methods have the potential in real- provided sufficient robustness towards localization errors. world applications. We also construct a new test set for the Compared to the LiDAR-based detectors, the drop off Waymo Open Dataset, tailored to camera-only 3D detection between LET-AP and LET-APL is larger, demonstrating methods. We hope the proposed metrics and dataset will help higher longitudinal error among the true positive examples. advance the field of camera-only 3D detection by providing a more meaningful indication for method performance. D. Metrics of Different Breakdowns We show detailed metrics on the proposed new test set REFERENCES in Table II. We report 3DAP, LET-3D-AP, LET-3D-APL, [1] P. Sun, W. Wang, Y. Chai, G. Elsayed, A. Bewley, X. Zhang, and mean longitudinal affinity (mLA) for all class and range C.Sminchisescu,andD.Anguelov,“Rsn:Rangesparsenetforefficient, accuratelidar3dobjectdetection,”inCVPR,2021. breakdowns,with10%oflongitudinaltolerance.Notethatthe [2] S.Shi,C.Guo,L.Jiang,Z.Wang,J.Shi,X.Wang,andH.Li,“Pv-rcnn: 3DAPmetricsarecomputedwith0.5,0.3,0.3IoUthresholds Point-voxelfeaturesetabstractionfor3dobjectdetection,”inCVPR, for vehicles, pedestrians and cyclists. As shown in Table II, 2020. [3] A.H.Lang,S.Vora,H.Caesar,L.Zhou,J.Yang,andO.Beijbom, LET-3D-AP values are consistently higher than 3DAP values “Pointpillars:Fastencodersforobjectdetectionfrompointclouds,”in since more detections are matched with ground truth objects CVPR,2019. with additional longitudinal tolerance. The difference is [4] G. P. Meyer, A. Laddha, E. Kee, C. Vallespi-Gonzalez, and C. K. Wellington,“Lasernet:Anefficientprobabilistic3dobjectdetectorfor especially pronounced for small objects like pedestrians, as autonomousdriving,”inCVPR,2019. wellasfar-awayobjects.Wealsoreportthemeanlongitudinal [5] C.R.Qi,O.Litany,K.He,andL.J.Guibas,“Deephoughvotingfor affinity values to show the difference between LET-3D-AP 3dobjectdetectioninpointclouds,”inICCV,2019. and LET-3D-APL. We observe that methods with higher [6] Y.ZhouandO.Tuzel,“Voxelnet:End-to-endlearningforpointcloud based3dobjectdetection,”inCVPR,2018. metrics not only have more valid detections but also have [7] V.A.Sindagi,Y.Zhou,andO.Tuzel,“Mvx-net:Multimodalvoxelnet better localization as indicated by higher mLA. We can for3dobjectdetection,”inICRA,2019. also verify the assumption of depth errors being roughly [8] X. Chen, K. Kundu, Z. Zhang, H. Ma, S. Fidler, and R. Urtasun, “Monocular 3d object detection for autonomous driving,” in CVPR, proportional to range by checking the consistent mLA values 2016. with different range breakdowns. Our results suggest that, [9] G.BrazilandX.Liu,“M3d-rpn:Monocular3dregionproposalnetwork under the proposed metrics, some camera-only 3D detectors forobjectdetection,”inICCV,2019. [10] A.Mousavian,D.Anguelov,J.Flynn,andJ.Kosecka,“3dbounding can already surpass the LiDAR-based detectors. boxestimationusingdeeplearningandgeometry,”inCVPR,2017. [11] B.XuandZ.Chen,“Multi-levelfusionbased3dobjectdetectionfrom VI. CONCLUSION monocularimages,”inCVPR,2018. [12] A. Simonelli, S. R. Bulo, L. Porzi, M. Lo´pez-Antequera, and WeproposedLET-3D-AP(L)metricsforevaluatingcamera- P. Kontschieder, “Disentangling monocular 3d object detection,” in only 3D detectors. The tolerance along the longitudinal axis ICCV,2019,pp.1991–1999.[13] X. Zhou, D. Wang, and P. Kra¨henbu¨hl, “Objects as points,” arXiv IEEE/CVFConferenceonComputerVisionandPatternRecognition, preprintarXiv:1904.07850,2019. 2020,pp.5881–5890. [14] T.Wang,X.Zhu,J.Pang,andD.Lin,“Fcos3d:Fullyconvolutional [39] Y. Chen, S. Liu, X. Shen, and J. Jia, “Dsgn: Deep stereo geometry one-stagemonocular3dobjectdetection,”inICCV,2021. network for 3d object detection,” in Proceedings of the IEEE/CVF [15] C.Reading,A.Harakeh,J.Chae,andS.L.Waslander,“Categorical conference on computer vision and pattern recognition, 2020, pp. depthdistributionnetworkformonocular3dobjectdetection,”inCVPR, 12536–12545. 2021. [40] X. Guo, S. Shi, X. Wang, and H. Li, “Liga-stereo: Learning lidar [16] Y.Wang,W.-L.Chao,D.Garg,B.Hariharan,M.Campbell,andK.Q. geometry aware representations for stereo-based 3d detector,” in Weinberger,“Pseudo-lidarfromvisualdepthestimation:Bridgingthe ProceedingsoftheIEEE/CVFInternationalConferenceonComputer gapin3dobjectdetectionforautonomousdriving,”inCVPR,2019. Vision,2021,pp.3153–3163. [17] Y.Wang,V.C.Guizilini,T.Zhang,Y.Wang,H.Zhao,andJ.Solomon, [41] Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Q. Yu, and J. Dai, “Detr3d: 3d object detection from multi-view images via 3d-to-2d “Bevformer:Learningbird’s-eye-viewrepresentationfrommulti-camera queries,”inCORL. PMLR,2022. imagesviaspatiotemporaltransformers,”inECCV,2022. [18] T.Roddick,A.Kendall,andR.Cipolla,“Orthographicfeaturetransform [42] X.WengandK.Kitani,“Monocular3dobjectdetectionwithpseudo- formonocular3dobjectdetection,”arXivpreprintarXiv:1811.08188, lidarpointcloud,”inICCVWorkshops,2019. 2018. [43] J.Huang,G.Huang,Z.Zhu,andD.Du,“Bevdet:High-performance [19] D.Rukhovich,A.Vorontsova,andA.Konushin,“Imvoxelnet:Image multi-camera 3d object detection in bird-eye-view,” arXiv preprint tovoxelsprojectionformonocularandmulti-viewgeneral-purpose3d arXiv:2112.11790,2021. objectdetection,”inWACV,2022. [44] Y. Li, Z. Ge, G. Yu, J. Yang, Z. Wang, Y. Shi, J. Sun, and Z. Li, [20] J.-J.Hwang,H.Kretzschmar,J.Manela,S.Rafferty,N.Armstrong- “Bevdepth: Acquisition of reliable depth for multi-view 3d object Crews, T. Chen, and D. Anguelov, “Cramnet: Camera-radar fusion detection,”arXivpreprintarXiv:2206.10092,2022. withray-constrainedcross-attentionforrobust3dobjectdetection,”in [45] E.Xie,Z.Yu,D.Zhou,J.Philion,A.Anandkumar,S.Fidler,P.Luo, ECCV,2022. and J. M. Alvarez, “Mˆ 2bev: Multi-camera joint 3d detection and [21] P.Sun,H.Kretzschmar,X.Dotiwalla,A.Chouard,V.Patnaik,P.Tsui, segmentationwithunifiedbirds-eyeviewrepresentation,”arXivpreprint J.Guo,Y.Zhou,Y.Chai,B.Caineetal.,“Scalabilityinperception arXiv:2204.05088,2022. forautonomousdriving:Waymoopendataset,”inCVPR,2020. [46] Y.Jiang,L.Zhang,Z.Miao,X.Zhu,J.Gao,W.Hu,andY.-G.Jiang, [22] M. Everingham, S. Eslami, L. Van Gool, C. K. Williams, J. Winn, “Polarformer:Multi-camera3dobjectdetectionwithpolartransformers,” and A. Zisserman, “The pascal visual object classes challenge: A arXivpreprintarXiv:2206.15398,2022. retrospective,”IJCV,2015. [47] Z. Liu, H. Tang, A. Amini, X. Yang, H. Mao, D. Rus, and S. Han, [23] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, “Bevfusion:Multi-taskmulti-sensorfusionwithunifiedbird’s-eyeview P. Dolla´r, and C. L. Zitnick, “Microsoft coco: Common objects in representation,”arXivpreprintarXiv:2205.13542,2022. context,”inECCV,2014. [48] J.HuangandG.Huang,“Bevdet4d:Exploittemporalcuesinmulti- [24] A.Geiger,P.Lenz,andR.Urtasun,“Arewereadyforautonomous camera3dobjectdetection,”arXivpreprintarXiv:2203.17054,2022. driving?thekittivisionbenchmarksuite,”inCVPR,2012. [49] Y.Zhang,Z.Zhu,W.Zheng,J.Huang,G.Huang,J.Zhou,andJ.Lu, [25] H.Caesar,V.Bankiti,A.H.Lang,S.Vora,V.E.Liong,Q.Xu,A.Kr- “Beverse:Unifiedperceptionandpredictioninbirds-eye-viewforvision- ishnan,Y.Pan,G.Baldan,andO.Beijbom,“nuscenes:Amultimodal centricautonomousdriving,”arXivpreprintarXiv:2205.09743,2022. dataset for autonomous driving,” arXiv preprint arXiv:1903.11027, [50] Z.Chen,Z.Li,S.Zhang,L.Fang,Q.Jiang,andF.Zhao,“Graph-detr3d: 2019. Rethinking overlapping regions for multi-view 3d object detection,” arXivpreprintarXiv:2204.11582,2022. [26] L.Liu,C.Wu,J.Lu,L.Xie,J.Zhou,andQ.Tian,“Reinforcedaxial refinementnetworkformonocular3dobjectdetection,”inECCV,2020. [51] S.Chen,X.Wang,T.Cheng,Q.Zhang,C.Huang,andW.Liu,“Polar parametrizationforvision-basedsurround-view3ddetection,”arXiv [27] A. Simonelli, S. R. Bulo, L. Porzi, E. Ricci, and P. Kontschieder, preprintarXiv:2206.10965,2022. “Towardsgeneralizationacrossdepthformonocular3dobjectdetection,” [52] W.Roh,G.Chang,S.Moon,G.Nam,C.Kim,Y.Kim,S.Kim,and inECCV,2020. J.Kim,“Ora3d:Overlapregionawaremulti-view3dobjectdetection,” [28] Y.Chen,L.Tai,K.Sun,andM.Li,“Monopair:Monocular3dobject arXivpreprintarXiv:2207.00865,2022. detectionusingpairwisespatialrelationships,”inCVPR,2020. [53] T.Wang,J.Pang,andD.Lin,“Monocular3dobjectdetectionwith [29] P.Li,H.Zhao,P.Liu,andF.Cao,“Rtm3d:Real-timemonocular3d depthfrommotion,”inECCV,2022. detectionfromobjectkeypointsforautonomousdriving,”inECCV, [54] J. Lu, Z. Zhou, X. Zhu, H. Xu, and L. Zhang, “Learning ego 3d 2020. representationasraytracing,”arXivpreprintarXiv:2206.04042,2022. [30] T. Wang, Q. Lian, C. Zhu, X. Zhu, and W. Zhang, “Mv-fcos3d++: [55] X. Ma, Y. Zhang, D. Xu, D. Zhou, S. Yi, H. Li, and W. Ouyang, Multi-viewcamera-only4dobjectdetectionwithpretrainedmonocular “Delvingintolocalizationerrorsformonocular3dobjectdetection,”in backbones,”arXivpreprintarXiv:2207.12716,2022. CVPR,2021. [31] Y. Liu, T. Wang, X. Zhang, and J. Sun, “Petr: Position embedding [56] H.Rezatofighi,N.Tsoi,J.Gwak,A.Sadeghian,I.Reid,andS.Savarese, transformationformulti-view3dobjectdetection,”inECCV,2022. “Generalizedintersectionoverunion,”inCVPR,2019. [32] K.-C.Huang,T.-H.Wu,H.-T.Su,andW.H.Hsu,“Monodtr:Monocular [57] J.Xu,Y.Ma,S.He,andJ.Zhu,“3d-giou:3dgeneralizedintersection 3dobjectdetectionwithdepth-awaretransformer,”inProceedingsof overunionforobjectdetectioninpointcloud,”Sensors,2019. theIEEE/CVFConferenceonComputerVisionandPatternRecognition, [58] N. Ga¨hlert, N. Jourdan, M. Cordts, U. Franke, and J. Denzler, 2022,pp.4012–4021. “Cityscapes3d:Datasetandbenchmarkfor9dofvehicledetection,” [33] R. Zhang, H. Qiu, T. Wang, X. Xu, Z. Guo, Y. Qiao, P. Gao, and arXivpreprintarXiv:2006.07864,2020. H.Li,“Monodetr:Depth-awaretransformerformonocular3dobject [59] B.Wilson,W.Qi,T.Agarwal,J.Lambert,J.Singh,S.Khandelwal, detection,”arXivpreprintarXiv:2203.13310,2022. B. Pan, R. Kumar, A. Hartnett, J. K. Pontes et al., “Argoverse 2: [34] Y.Liu,J.Yan,F.Jia,S.Li,Q.Gao,T.Wang,X.Zhang,andJ.Sun, Nextgenerationdatasetsforself-drivingperceptionandforecasting,” “Petrv2: A unified framework for 3d perception from multi-camera inThirty-fifthConferenceonNeuralInformationProcessingSystems images,”arXivpreprintarXiv:2206.01256,2022. DatasetsandBenchmarksTrack(Round2),2021. [35] X.Ma,S.Liu,Z.Xia,H.Zhang,X.Zeng,andW.Ouyang,“Rethinking [60] R.Kesten,M.Usman,J.Houston,T.Pandya,K.Nadhamuni,A.Fer- pseudo-lidarrepresentation,”inECCV,2020. reira, M. Yuan, B. Low, A. Jain, P. Ondruska, S. Omari, S. Shah, [36] Y. You, Y. Wang, W.-L. Chao, D. Garg, G. Pleiss, B. Hariharan, A.Kulkarni,A.Kazakova,C.Tao,L.Platinsky,W.Jiang,andV.Shet, M. Campbell, and K. Q. Weinberger, “Pseudo-lidar++: Accurate “Level5perceptiondataset2020,”https://level-5.global/level5/data/, depthfor3dobjectdetectioninautonomousdriving,”arXivpreprint 2019. arXiv:1906.06310,2019. [61] P.Sun,M.Tan,W.Wang,C.Liu,F.Xia,Z.Leng,andD.Anguelov, [37] M.Ding,Y.Huo,H.Yi,Z.Wang,J.Shi,Z.Lu,andP.Luo,“Learning “Swformer:Sparsewindowtransformerfor3dobjectdetectioninpoint depth-guidedconvolutionsformonocular3dobjectdetection,”inCVPR clouds,”inECCV,2022. workshops,2020. [62] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and [38] R. Qian, D. Garg, Y. Wang, Y. You, S. Belongie, B. Hariharan, B. Guo, “Swin transformer: Hierarchical vision transformer using M.Campbell,K.Q.Weinberger,andW.-L.Chao,“End-to-endpseudo- shiftedwindows,”inCVPR,2021. lidar for image-based 3d object detection,” in Proceedings of the