Depth Estimation Matters Most: Improving Per-Object Depth Estimation for Monocular 3D Detection and Tracking Longlong Jing1, Ruichi Yu1, Henrik Kretzschmar1, Kang Li1, Charles R. Qi1, Hang Zhao1∗, Alper Ayvaci1 Xu Chen1, Dillon Cower1, Yingwei Li2, Yurong You3, Han Deng1, Congcong Li1, and Dragomir Anguelov1 1Waymo LLC, 2Johns Hopkins University, 3Cornell University Abstract—Monocular image-based 3D perception has be- come an active research area in recent years owing to its applications in autonomous driving. Approaches to monocular 3Dperceptionincludingdetectionandtracking,however,often yield inferior performance when compared to LiDAR-based techniques.Throughsystematicanalysis,weidentifiedthatper- objectdepthestimationaccuracyisamajorfactorboundingthe Monocular 3D detection & Tracking-by-detection performance.Motivatedbythisobservation,weproposeamulti- level fusion method that combines different representations (RGB and pseudo-LiDAR) and temporal information across GT + PRT GT multiple frames for objects (tracklets) to enhance per-object Fusion on Pred w/ Depth Enhanced depth estimation. Our proposed fusion method achieves the Error Depth state-of-the-art performance of per-object depth estimation on per-object Pred depth theWaymoOpenDataset,theKITTIdetectiondataset,andthe BEV BEV KITTI MOT dataset. We further demonstrate that by simply replacing estimated depth with fusion-enhanced depth, we can achieve significant improvements in monocular 3D perception tasks, including detection and tracking. Fig.1. Illustrationoftheimpactofobjectdepthestimationformonocular3D detection.Predictedbox(Pred)andgroundtruth(GT)boxarevisualizedin I. INTRODUCTION abird’s-eye-view(BEV).Inthebottomleftimage,weseedepthdiscrepancy causeslocalizationerror.Inthebottomrightimage,weseeourproposed Existing perception systems for autonomous vehicles Pseudo-LiDAR,RGBandTracklet(PRT)fusionmethodcanimproveobject mainly rely on expensive sensors such as LiDAR and Radar depthestimationandimprovedetection. [1], [2], [3], [4]. Owing to the low cost, low power con- sumption and longer perception range of cameras, monocular center,matters(seethesignificantperformanceimprovements image-based perception has been attracting great interest when per-object depth is perfect, and marginal improvements in recent years from both the industry and the research when other signals are perfect). Based on this observation, community [5], [6], [7], [8], [9], [10], [11]. Such perception we identified that per-object depth estimation is a major tasks tend to be challenging, and there is a large performance bottleneck for monocular 3D detection and tracking-by- gapbetweenmonocularperceptionsystemsandLiDAR/radar- detection. We also conducted the same analysis with other based systems [7], [9], [5], [12]. state-of-the-art detectors such as RTM3D [13] with the Common 3D monocular perception systems comprise two AB3D [10] tracker, and the results suggest that depth is the major modules: 3D object detection and 3D tracking1. The keyfactortoimprovemonocular3Ddetectionandtrackingis former requires learning the 3D location, box size, and ageneralconclusionacrossmodels.Inthispaper,wefocuson rotation/orientation of an object, while the latter requires improving per-object depth estimation and demonstrate that using appearance and motion cues to track detections across by just enhancing object depth we can significantly improve frames.Amongbothtasks,itisnotobviouswhichcomponent detection and tracking performance. of the system has the most crucial impact on performance. A major challenge of estimating object depth from a To fully understand which component bounds the overall monocularimageistoobtainarepresentationthatencodesthe performance, we experimented with replacing each output transition from 2D information to 3D depth. Recent efforts from a state-of-the-art detection model with the ground truth (e.g., 3D monocular detection) have mainly focused on either and then evaluated changes of the detection and tracking-by- directly learning from the raw RGB image [9], [5], [16] detection performance using the state-of-the-art detector. As or leveraging a pseudo-LiDAR representation lifted from shown in Fig. 2, among all the attributes including rotation, the predicted dense depth map [7], [8], [17]. Intuitively, size, depth, and amodal box center in image, we find that we believe that the above two representations might be only the per-object depth, the depth of the vehicle’s 3D complementary in estimating per-object depth, and learning ∗WorkdonewhileatWaymoLLC. fromeitheroneofthemalonemightbesub-optimal:theRGB 1Inthispaperwefollowthetracking-by-detectionparadigm. image encodes the appearance, texture and 2D geometry, etc. 2202 nuJ 8 ]VC.sc[ 1v66630.6022:viXraDetection (mAP%) depth, we have also achieved significant improvements on 0 20 40 60 3Dobjectdetectionandtrackingmodels,bysimplyreplacing 3D Detector Detection the per-object depth with our enhanced estimation. Our contributions can be summarized as follows: 1) We Tracking + GT Size conducted a systematic analysis identifying that per-object Only depthestimationisamajorperformancebottleneckofcurrent + GT Rotation Only 3D monocular detection and tracking-by-detection methods. 2)Weproposedanovelmethodthatfusespseudo-LiDARand + GT Amodal Only RGB information across the temporal domain to significantly + GT Depth enhance per-object depth estimation performance. 3) We Only 30 40 50 60 70 demonstrated that with the enhanced depth, the performance Tracking (MOTA%) of monocular 3D detection and tracking can be significantly improved. Fig.2. Headroomanalysisoftheimpactofeachcomponentofthemonocular 3Ddetectionandtrackingsystem.Theexperimentisdonewiththestate- II. RELATEDWORK of-the-artmonocular3DdetectorCenterNet[14](ranked1stonnuScene Monocular 3D Object Detection and Tracking: Monoc- dataset[15])andtheAB3D[10]tracker.Weuse“+GT”toindicatethat wereplacedthepredictionwiththegroundtruth.Theanalysissuggeststhat ular 3D object detection models aim at directly regressing depthhasthemostsignificantimpactondetection(AveragePrecisionshown attributes such as rotation, size, and depth from RGB images asthetopbar)andtracking(MOTAshownasthebottombar)performance. [5], [9], [14], [16], [28], [29], [30] or pseudo-LiDAR [7], [17], [31], [8]. Recently, Wang et al. pointed out that image of an object explicitly but contains no direct information features are not suitable for the task; instead, they proposed of 3D. It is difficult to learn how to map RGB features to converting image-based depth maps into pseudo-LiDAR depth precisely without overfitting to irrelevant information; representations to mimic LiDAR and obtained significantly on the other hand, the pseudo-LiDAR representation directly betterperformancethanthepreviousimage-basedmethods[7]. models the 3D structure of an object via an estimated dense A few work attempted to leverage both the RGB image and depthmap,whichmakesitstraightforwardtolearnper-object estimated depth map for monocular 3D object detection [32], depth. However the estimated dense depth map is often [33]; none of them specifically focused on improving the noisy (usually with at least 8% average relative error [6], per-objectdepthestimationbasedondifferentfeaturesfroma [18], [19]). Inspired by previous methods that fuse different tracklet. Most existing tracking methods follow the tracking- representations such as RGB image features and optical by-detection scheme [14], [10], [34], [35], and the quality flow for action recognition [20], we believe that fusing the of monocular 3D detections is the bottleneck for tracking complementary signals encoded in the two representations performance.Amongalloftheoutputdimensionsoftheabove may help per-object depth estimation. tasks,weidentifiedthatper-objectdepthisthebottleneckand Furthermore, depth estimation from monocular images is demonstratedthattheperformanceofmonocular3Ddetection fundamentally ill-posed, as a single 2D view of a scene can and tracking can be significantly improved with enhanced be explained by many plausible 3D scenes [21]. However, object depth. observing an object over time allows us to model the Depth/Distance Estimation: Monocular image-based underlying temporal and motion consistency of the object, dense depth estimation has been studied for years [6], [36], which can provide contextual information to better localize [19], [18], [37], [38], [39], [40]. Different from the existing the object in 3D2. Similar ideas have been explored in other methods, our method focused on per-object depth estimation, taskssuchas2Dvideo-basedobjectdetection[22],[23],[24]. whichnaturallyenablesthenoveltracklet-basedfusionwhich Basedonourintuitionsandtheanalysisabove,wepropose fuses different types of features across multiple frames. amulti-levelfusionframeworkthatconsistsoftwomajorcom- Although the two tasks might share some high-level concepts ponents. The first component is the pseudo-LiDAR and RGB (depthestimation),itisnon-trivialtoadapttheper-pixeldense fusion (PR-Fusion) which enhances depth estimation from depth estimation to a per-object task, which has not been two complementary representations. The other component explored in the dense depth estimation community. On the is the tracklet fusion (T-fusion), which leverages temporal other hand, per-object depth/distance estimation has started and motion consistency with compensated ego-motion. Our to draw attention recently. Following the same experimental full model Pseudo-LiDAR-RGB-Tracklet (PRT fusion) fuses setting in of the most recent state-of-the-art [27], our novel information from both 2D and 3D representations across framework, which leverages both pseudo-LiDAR and RGB multiple frames. representations across multiple frames for per-object depth We conducted extensive experiments on the Waymo Open estimation, demonstrated superior performance. Dataset[25],theKITTIdetectiondataset[26],andtheKITTI Representation and Temporal Fusion Many existing MOT dataset [26] to demonstrate the effectiveness of our methods are based on representation [41], [42], [43], [44], method. We obtained state-of-the-art performance on the [45], [46] and temporal fusion [5], [6], [7], [8], [9], [10], per-object depth estimation benchmark proposed by [27]. To [11]. Two-stream fusion [20], [47] for action recognition is a furtherdemonstratethepracticalvalueofenhancedper-object classic multi-modal fusion which fuses RGB image featuresRGB-Net PL-Net Input Images 2D Object Detection Tracking Ego-Motion Tracklet Compensation PRT-Fusion t-n PR Fusion Depth t Tracklet Fusion t-1 t t-n to t Fig.3. Theoverallframeworkforourproposedmonocularper-objectdepthestimationmethod.Theentireframeworkincludesthreestages:(1)per-frame 2Dobjectdetectiontodetectobjects,(2)2Dtrackingtoassociateobjectsforthesamevehicleacrossthetemporaldomain,(3)theproposedPRT-Fusion overthetrackletsgeneratedbythetrackingmethods,withfusionfromboththetemporaldomainanddifferentrepresentations(RGBandpseudo-LiDAR). andopticalflow[48].Inthetemporaldomain,manymethods while d is a 2D dense depth estimation map having the same have been explored to improve sequence-based tasks such as size as input image I. The pixel value at location (u,v) of video object detection and video segmentation [9], [22], [23], the d indicates the depth of the corresponding pixel in the [24]. These methods inspired us that different representations image. (e.g., pseudo-LiDAR and RGB) across multiple frames might Then each pixel of the entire depth map is lifted into a be beneficial for improving per-object depth estimation. point cloud by using the following equations based on the camera model: III. IMPROVINGPER-OBJECTDEPTHESTIMATIONVIA  MULTI-LEVELFUSION  z =d(u,v), x=(u−C )×z/f , (3) Theoverviewofourproposedmulti-levelfusionframework x x  y =(v−C )×z/f , for per-object depth estimation is shown in Fig. 3. We first y y conduct 2D object detection and track detections across where (f ,f ) is the horizontal and vertical focal lengths of x y frames to construct a tracklet for each object. We then thecameraand(C ,C )isthepixellocationcorrespondingto x y constructpseudo-LiDARrepresentationsoftheobjectsacross thecameracenter[8],[7].Afterthetransformation,eachpixel frames and RGB image features for the current frame. Ego- in the dense depth map d is transformed into three channels motion compensation is applied to all pseudo-LiDAR patches representing the absolute location of the corresponding pixel within each tracklet to transform them to the same coordinate in 3D space, in camera coordinates. system.Finally,theRGBimagefeaturesforthecurrentframe Afterobtainingthepseudo-LiDARrepresentationforimage and the temporally fused pseudo-LiDAR features are fused I, the pseudo-LiDAR patch P for object b at timestamp t t t to produce per-object depth. can be cropped based on the 2D bounding box, where P is t a collection of pseudo-LiDAR points that are within the box A. Pseudo-LiDAR and RGB (PR) Fusion b . The pseudo-LiDAR-based features PL of object b can t t Inspired by the two-stream fusion method for action be extracted using another feature encoder F as p recognition presented in [20], we proposed PR-Fusion to leverage the complementary information encoded from both PL=F (P ), (4) RGB and pseudo-LiDAR representations. Given an RGB p t image I with size H ×W, compact features for the entire where PL represents the pseudo-LiDAR representation for image can be extracted by using a pre-trained convolution the object within the bounding box b in the image plane. t neural network F RGB. For any object with its 2D bounding Finally, the PR-Fusion can be represented as box b, the RGB image features R for the bounding box can be extracted by using the pre-defined pooling operation PR=G PR(PL,R), (5) Pool(F (I),b).TheprocessofextractingimagefeaturesR RGB where G is a deep neural network to fuse the two features forobjectboundingboxbfromimageI incanberepresented PR and PR is the fused feature. as R=Pool(F RGB(I),b). (1) B. Tracklet Fusion with Ego-motion Compensation Theextractionprocessofthepseudo-LiDARrepresentation Predicting per-object depth directly from a single frame is consists of three steps: (1) dense depth estimation for each challenging due to the fact that a single object in an camera image, (2) lifting predicted dense depth into pseudo-LiDAR, image can be explained by multiple plausible objects with and(3)pseudo-LiDARrepresentationextractionwithaneural different depth [21]. Inspired by temporal fusion methods network. For any RGB image I, the depth estimation can be for video based tasks, we propose to fuse the object-level accomplished by using a dense depth estimation network F information across multiple frames to enforce temporal and d as motion consistency of the prediction. Given the 2D detection d=F (I), (2) results, we first conduct 2D data association [34], [10] to dconstruct tracklets for objects and then fuse the features of frameanditspreviousframes,firstweconductT-Fusionwith the tracklet in a temporal window. ego-motion compensation for pseudo-LiDAR representations Astraightforwardmethodistodirectlyfuseimagefeatures acrossmultipleframes;thenwefuseitwiththeRGBfeatures across frames similar to [24]; however, we find that directly at the current frame t as fusing the RGB features from different frames can be PRT=G (PL ,R ), (8) suboptimalbecauseRGBfeaturescouplethedynamicmotion PR t−n→t t of the camera and the motion of the objects together, which where PRT is the fused features from frame t−n to t. makes it hard to learn motion and temporal consistency from 2D image sequence. We believe that to perform effective D. Implementation Details temporal fusion for depth estimation, the camera motion RGBFeatureExtraction.CenterNetandCenterTrack[9], mustbecompensatedtomakesurethefeaturesfromdifferent [14]haveachievedstate-of-the-artperformanceonmonocular frames are in the same coordinate system. Fortunately, the 3D detection task on the nuScenes dataset recently [15]. ego-motion of the camera can be easily compensated in the We followed its formulation and network architecture with 3D space with the pseudo-LiDAR representations. Thus, we ResNet50 [49] as the backbone to perform 2D detections. propose a T-Fusion method with ego-motion compensation Pseudo-LiDAR Feature Extraction. Recently, PatchNet based on pseudo-LiDAR representations. [8] was proposed to significantly improve pseudo-LiDAR- The input to our proposed T-Fusion includes pseudo- based detection performance. We choose it as our backbone LiDAR patches of each object in different frames model to extract pseudo-LiDAR-based features as both the P t,P t−1,...,P t−n, while P t is in the 3D camera coordinate baseline and the input to our method. in frame t. The ego-motion is represented using a 4 × 4 2D Tracking. To track 2D detections to form tracklets, homogeneous matrix H based on conventional six degrees we followed [34] to use a Kalman-Filter based tracker. It is of freedom: translation [γ x,γ y,γ z] in meters and rotation worthnotingthatsinceourpapermainlyfocusedonper-object [ρ x,ρ y,ρ z] in radians. depth estimation, we believe that with more sophisticated First, all pseudo-LiDAR patches from different frames are tracking methods [50], [51], [52], the performance of our projected into the global coordinate system using the camera- fusion method can be further improved. coordinate-to-global-coordinate transformation matrix H. For a pseudo-LiDAR patch for any timestamp P , assume its IV. EXPERIMENTS t−j camera-coordinate-to-global-coordinate transformation matrix In this section, we firstly benchmark our per-object depth is H t−j; the transformation is as follows: estimationbycomparingwithpriorworksinSec.IV-A.Next, we ablate design choices in our depth estimation model in P(cid:48) =H−1∗H ∗P . (6) t−j t t−j t−j Sec. IV-B. Finally, we show the applications of the improved Afterthecoordinatetransformation,theego-motionoftheself- per-object depth for 3D monocular detection and tracking in drivingcariscompensatedfor,andthetransformedP(cid:48) isin Sec. IV-C. t−j the same coordinate system as P . The same transformation Datasets. We evaluate on multiple datasets with a focus t is applied to pseudo-LiDAR patches from all timestamps on the vehicle class. Among them, Waymo Open Dataset is to eliminate the impact of ego-motion to the locations of a large-scale dataset for autonomous driving. It consists of pseudo-LiDAR points for each object. 798 training sequences and 202 validation sequences, while Given any feature encoder F () for pseudo-LiDAR, the each sequence contains around 200 frames. KITTI Detection p features for different timestamps of data can be extracted as Datasethas3,712RGBimagesfortrainingand3,768images F (P(cid:48)), F (P(cid:48) ), ..., F (P(cid:48) ), where the (cid:48) indicates that for testing. We are using the split used in the prior work [27] p t p t−1 p t−n the pseudo-LiDAR patch has ego-motion compensated. Then, for per-object depth estimation for fair comparison. KITTI thefusedfeaturesforasequenceofanobjectcanbemodeled MOT Dataset consists of 8,008 and 11,095 frames in the by using a neural network encoder G as follows: official training and testing splits. Since it has sequence TF information for each frame, we will use this dataset to PL =G (F (P(cid:48)),F (P(cid:48) ),....,F (P(cid:48) )), (7) t−n→t TF p t p t−1 p t−n demonstrate the effectiveness of the PRT fusion. where PL is the fused tracklet features from frame t−n→t A. Benchmarking Per-object Depth Estimation t−n to t. Metrics. Following the existing state-of-the-art per-object C. Multi-level PRT-Fusion depth estimation benchmark proposed by [27], five standard PR-FusionandT-Fusionaggregatefeaturesfromtwodiffer- metrics including average relative error (Abs Rel), squared ent domains. It is natural to combine the two fusion methods relative error (Sq Rel), root-mean-square error (RMSE), together for further performance improvements. Given a average (log ) error (RMSE ), and threshold accuracy (δ ) 10 log i sequence of object boxes across time, b ,b ,....,b , the are used for evaluation. t t−1 t−n RGB image features for object b can be represented using Results. Since our paper mainly focused on per-object i an image feature encoder F (), and its pseudo-LiDAR depth estimation, we compare the performance of our pro- RGB features can be extracted using encoder F (). There are posed fusion methods against the state-of-the-art models. We p two steps in PRT-Fusion: Given the object in the current comparewithtwotypesofmethods:geometrybasedmethodsTABLEI TABLEIII COMPARISONWITHSTATE-OF-THE-ARTMETHODSONTHEWAYMOOPEN PERFORMANCECOMPARISONWITHSTATE-OF-THE-ARTSONTHEKITTI DATASETFORVEHICLESFOLLOWINGTHESETTINGIN[27]. MOTDATASETFORVEHICLESFOLLOWINGTHESETTINGIN[27]. Method δ<1.25↑ AbsRel↓ SqRel↓ RMSE↓ RMSElog↓ Method δ<1.25↑ AbsRel↓ SqRel↓ RMSE↓ RMSElog↓ SVR[53] 83.26% 14.79% 1.3254 6.9081 0.2282 CenterNet[9] 92.17% 9.10% 0.9372 6.9596 0.1975 DistNet[54] 88.50% 11.23% 0.8974 6.3903 0.1737 PatchNet[8] 93.41% 8.65% 0.4988 4.9081 0.1268 PatchNet[8] 92.81% 8.77% 0.6051 5.5485 0.1283 Ours(T) 93.43% 7.90% 0.3706 4.0703 0.1157 CenterNet[9] 95.47% 7.25% 0.6240 4.7506 0.1146 Ours(PR) 94.39% 7.69% 0.4430 4.8065 0.1205 Ours(T) 96.45% 6.96% 0.4214 4.5941 0.1001 Ours(PRT) 95.23% 7.13% 0.3382 3.9391 0.1076 Ours(PR) 97.64% 5.74% 0.3188 3.8788 0.0863 Ours(PRT) 98.09% 5.47% 0.2858 3.7282 0.0802 TABLEIV TABLEII THEPERFORMANCECOMPARISONOFDIFFERENTFUSIONSTRATEGIESON PERFORMANCECOMPARISONWITHSTATE-OF-THE-ARTSONTHEKITTI THEWAYMOOPENDATASETFORVEHICLES. DETECTIONDATASETFORVEHICLESFOLLOWINGTHESETTINGIN[27]. Method AbsRel↓ SqRel↓ RMSE↓ RMSElog↓ AblationStudywithPredictedAssociation Method δ<1.25↑ AbsRel↓ SqRel↓ RMSE↓ RMSElog↓ RGBt 7.25% 0.6240 4.7506 0.1146 SVR[53] 34.50% 149.4% 47.748 18.970 1.4940 T-Fusion:RGBt−1→t 7.58% 0.6773 4.8902 0.1172 IPM[55] 70.10% 49.70% 1290.5 237.62 0.4510 T-Fusion:RGBt−3→t 7.60% 0.6748 4.9320 0.1179 Zhuetal.[27] 84.60% 15.00% 0.6180 3.9460 0.2040 PLt 8.77% 0.6051 5.5485 0.1283 DistNet[54] 93.26% 12.39% 0.4834 2.9539 0.2003 T-Fusion:PLt−1→t 6.96% 0.4214 4.5941 0.1001 CenterNet[9] 95.33% 8.70% 0.4250 3.2433 0.1436 T-Fusion:PLt−3→t 7.16% 0.4389 4.6712 0.1022 PatchNet[8] 95.52% 8.08% 0.2789 2.9048 0.1296 PRT:RGBt+PLt 5.74% 0.3188 3.8788 0.0863 Ours(PR) 97.60% 6.89% 0.2340 2.5025 0.1181 PRT:RGBt+PLt−1→t 5.47% 0.2858 3.7282 0.0802 PRT:RGBt+PLt−3→t 5.52% 0.2932 3.7661 0.0807 AblationStudywithGroundtruthAssociation PRT:RGBt+PLt 5.74% 0.3188 3.8788 0.0863 which predict the depth based on the geometry of the boxes PRT:RGBt+PLt−1→t 5.34% 0.2713 3.6278 0.0791 including SVR [53], IPM [55], and DistNet [54], and deep PRT:RGBt+PLt−3→t 5.29% 0.2668 3.5821 0.0783 feature based methods including the methods proposed in [27], and our implementation of state-of-the-art monocular ablation studies with different fusion strategies and different 3D detection method CenterNet [9] and PatchNet [8] (we 2D boxes and tracking qualities. simply used their backbone and change the final prediction For the first three ablation studies, we report performance from a 3D box to only per-object depth). To conduct a fair with predicted association to better understand how our comparison with the existing benchmark in [27], we follow methodworksinpractice,andinthefourthstudyweconduct its experimental setting to use the groundtruth 2D boxes (to headroomanalysistounderstandhowwouldourmethodwork filter out the impact of 2D object detection) in all datasets. given perfect association. We conduct data association with the Kalman-filter based 1. Does tracklet fusion work for RGB features? We tracker in [34]. The comparison on the Waymo Open Dataset, simply fused the features extracted from RGB images from a KITTIDetectionDataset,andKITTIMOTDatasetareshown tracklet(similarwithour proposedT-Fusion,butwithoutego- in Table I, Table II, and Table III. motioncompensation),andresultsareshowninthefirstgroup We first demonstrate the effectiveness of the PR-Fusion on of Table IV and only marginal improvements are observed. both the Waymo Open Dataset and KITTI Detection dataset: One possible explanation is that the 3D information encoded withaccessoftwodifferentrepresentations,ourmethodswith in RGB images at different timestamps are at different PR-Fusion(Ours(PR))significantlyenhancestheperformance coordinate system if the self-driving car is moving. It is compared to the two baseline models individually, which non-trivial to decompose the camera ego motion and the suggests that the two types of representations are indeed object motion, which makes it hard to learn motion and complementary with each other, and the fusion of them temporal consistency from simply fusing the image features. yields the best performance. Regarding the T-Fusion with 2. How does tracklet fusion improve pseudo-LiDAR- ego-motion compensation and the PRT-Fusion on the Waymo based depth estimation? In the second group of Table IV Open Dataset and the KITTI MOT dataset, when compared we show the performance of pseudo-LiDAR-based depth to the baseline method PatchNet [8], our proposed method estimationwithourproposedego-motioncompensationbased (Ours(T)) achieved significant better performance. Finally, T-Fusion. It is clearly that the depth estimation performance when leverage both the PR-Fusion and T-Fusion together significantly improved with the T-Fusion even with just as PRT-Fusion, the performances are further improved (see information from one more frame. Adding more frames is Ours (PRT)). In summary, our proposed fusion methods helpful, but the improvement is marginal. show significant improvements over the baseline models and 3. Does PRT fusion help? With the improvements of PR- outperform the state-of-the-art methods on both the Waymo Fusion and T-Fusion, it is natural to ask the question about Open Dataset and KITTI dataset. if the combination of them is helpful. The third group of Table IV shows the performance of the PRT-Fusion with B. Ablation Study for Per-object Depth Estimation different number of frames. It is clear that the combination To better demonstrate and understand the effectiveness of of both (PR-Fusion and T-Fusion) outperforms each one eachmoduleofourproposedmethod,weconductedthorough individually. However, due to noise introduced by dataTABLEV GT: 46.0 GT: 49.8 GT: 66.2 GT: 74.8 THERESULTSOFMONOCULAR3DDETECTIONONTHEWAYMOOPEN BL: 42.3 BL: 46.1 BL: 72.5 BL: 78.4 Ours:45.5 Ours: 49.6 Ours: 65.7 Ours: 75.5 DATASETFORVEHICLESWITHDIFFERENTIOUTHRESHOLDS(0.5-0.7). AP3D APBEV 0.5 0.6 0.7 0.5 0.6 0.7 CenterNet[9] 24.32 14.36 6.06 28.20 18.71 11.52 27.06 16.89 7.99 29.53 24.16 14.33 +EnhancedDepth (+2.74) (+2.53) (+1.93) (+1.10) (+5.45) (+2.81) TABLEVI RESULTSONTHEWAYMOOPENDATASETFORVEHICLESIN[10]. BEV BEV Method SMOTA↑ AMOTA↑ AMOTP↑ MOTA↑ IDS↓ CenterNet[9]+AB3D[10] 50.19 14.06 27.00 39.37 228 52.49 15.24 28.50 41.24 130 Fig.4. Qualitativeexamplesofper-objectdepthestimationandmonocular +EnhancedDepth (+2.30) (+1.18) (+1.50) (+1.87) (-98) 3Dobjectdetection.Thegreen,red,andblueboundingboxescorresponding to ground truth (GT), baseline depth estimation and detection (BL), and theonewiththeenhancedper-objectdepthfromourproposedPRT-Fusion. association, a longer tracklet does not necessarily yield better Significantbetterdepthestimationanditsfurtherimprovementsondetection canbeobserved. performance. 4. How is the performance affected by association Frame (t) Frame (t+1) noise? We further study the impact of the association quality towards our proposed method. The fourth group shows the results with the groundtruth association. As shown in Table. IV, we can observe with perfect association, the improvement is consistent with longer tracklet. This indicates that to fully leverage the capability of the proposed fusion Wrong Correct Association Association method, improving data association quality is a promising and ID switch direction to work on. BEV BEV (a) Tracking-by-detection (b) With enhanced depth C. Improving Monocular 3D Detection and Tracking with Enhanced Per-object Depth Fig. 5. Qualitative examples of monocular 3D tracking results. Due to inaccuratedepthestimationshowin(a),the3Dtrackerwronglyassociates Inthissubsection,weapplyourper-objectdepthestimation detectionacrossframeswhichleadstoIDswitches.Withenhanceddepth to show that it can further help improve the state-of-the-art predicted by our proposed fusion model in (b), the tracker associates monocular image based 3D detector CenterNet[14] and the detectionscorrectly. AB3D tracker [10] on the Waymo Open Dataset. Quantitative results. For 3D object detection, we trained aCenterNetformonocular3Ddetection,andreplaceonlythe is the key factor to lead to the significant improvements for depth of the detection results while the other outputs such as monocular 3D detection and tracking. box, rotation, etc. are remained. For tracking, we follow the tracking-by-detection scheme and perform a AB3D tracker V. CONCLUSION [10] on the detection boxes with the enhanced per-object depth. The results of detection and tracking are shown in We demonstrated that per-object depth estimation is the table V and table IV-C. As expected, we can observe that performance bottleneck of the monocular image based 3D by simply enhancing the depth, significant improvements perception tasks including detection and tracking. A multi- can be obtained for both tasks. It is worth noting that the level fusion framework was proposed to fuse features from performance can be further improved by tuning the model different representations across multiple frames. We first specifically for detection and tracking, but it is out of the obtainedthestate-of-the-artsperformanceonper-objectdepth scope of this paper since we only focus on demonstrating estimation, and then showed that by simply replacing the the improvements purely from depth. depth, significant improvements can be observed in the tasks Qualitative results. We further visualize the prediction above. This not only demonstrated our findings and the results of the baseline detection model and our proposed effectiveness of the proposed method, but also indicating PRT-Fusion to illustrate how the enhanced per-object depth that improving per-object depth is a promising direction to improve the monocular 3D detection and tracking. As shown enhance detection and tracking. Future works can include in Fig. 4, the first row shows the improvements of per-object end-to-end training of the proposed method. depth brought by our method, and the second row illustrated the bird-eye-view of the 3D detection results of both the baseline detector, and the one with our improved depth (only VI. ACKNOWLEDGEMENT thedepthisreplaced).Fig.5showsthetrackingmodelmakes fewerIDSwitcherrorswiththedepthpredictedbyourmodel. We would like to thank Jiyang Gao for the helpful Clearimprovementscanbeobserved,andtheimproveddepth discussions about this work.REFERENCES Proceedings of the IEEE/CVF Conference on Computer Vision and PatternRecognition,2020,pp.13075–13085. [1] A.H.Lang,S.Vora,H.Caesar,L.Zhou,J.Yang,andO.Beijbom, [23] M. Liu, M. Zhu, M. White, Y. Li, and D. Kalenichenko, “Looking “Pointpillars:Fastencodersforobjectdetectionfrompointclouds,”in fastandslow:Memory-guidedmobilevideoobjectdetection,”arXiv ProceedingsoftheIEEEConferenceonComputerVisionandPattern preprintarXiv:1903.10172,2019. Recognition,2019,pp.12697–12705. [24] X.Zhu,Y.Wang,J.Dai,L.Yuan,andY.Wei,“Flow-guidedfeature [2] X.Liu,C.R.Qi,andL.J.Guibas,“Flownet3d:Learningsceneflowin aggregationforvideoobjectdetection,”inProceedingsoftheIEEE 3dpointclouds,”inProceedingsoftheIEEEConferenceonComputer InternationalConferenceonComputerVision,2017,pp.408–417. VisionandPatternRecognition,2019,pp.529–537. [25] P.Sun,H.Kretzschmar,X.Dotiwalla,A.Chouard,V.Patnaik,P.Tsui, [3] J.Behley,M.Garbade,A.Milioto,J.Quenzel,S.Behnke,C.Stachniss, J.Guo,Y.Zhou,Y.Chai,B.Caineetal.,“Scalabilityinperception andJ.Gall,“Semantickitti:Adatasetforsemanticsceneunderstand- forautonomousdriving:Waymoopendataset,”inProceedingsofthe ing of lidar sequences,” in Proceedings of the IEEE International IEEE/CVFConferenceonComputerVisionandPatternRecognition, ConferenceonComputerVision,2019,pp.9297–9307. 2020,pp.2446–2454. [4] B. Yang, R. Guo, M. Liang, S. Casas, and R. Urtasun, “Radarnet: [26] A.Geiger,P.Lenz,C.Stiller,andR.Urtasun,“Visionmeetsrobotics: Exploiting radar for robust perception of dynamic objects,” arXiv The kitti dataset,” The International Journal of Robotics Research, preprintarXiv:2007.14366,2020. vol.32,no.11,pp.1231–1237,2013. [5] Y.Chen,L.Tai,K.Sun,andM.Li,“Monopair:Monocular3dobject [27] J.ZhuandY.Fang,“Learningobject-specificdistancefromamonocular detectionusingpairwisespatialrelationships,”inProceedingsofthe image,” in Proceedings of the IEEE International Conference on IEEE/CVFConferenceonComputerVisionandPatternRecognition, ComputerVision,2019,pp.3839–3848. 2020,pp.12093–12102. [28] G. Brazil and X. Liu, “M3d-rpn: Monocular 3d region proposal [6] I.AlhashimandP.Wonka,“Highqualitymonoculardepthestimation networkforobjectdetection,”inProceedingsoftheIEEEInternational viatransferlearning,”arXivpreprintarXiv:1812.11941,2018. ConferenceonComputerVision,2019,pp.9287–9296. [7] Y.Wang,W.-L.Chao,D.Garg,B.Hariharan,M.Campbell,andK.Q. [29] A. Simonelli, S. R. Bulo, L. Porzi, M. Lo´pez-Antequera, and Weinberger,“Pseudo-lidarfromvisualdepthestimation:Bridgingthe P. Kontschieder, “Disentangling monocular 3d object detection,” in gapin3dobjectdetectionforautonomousdriving,”inProceedings ProceedingsoftheIEEEInternationalConferenceonComputerVision, oftheIEEEConferenceonComputerVisionandPatternRecognition, 2019,pp.1991–1999. 2019,pp.8445–8453. [30] E.Jo¨rgensen,C.Zach,andF.Kahl,“Monocular3dobjectdetection [8] X.Ma,S.Liu,Z.Xia,H.Zhang,X.Zeng,andW.Ouyang,“Rethinking andboxfittingtrainedend-to-endusingintersection-over-unionloss,” pseudo-lidarrepresentation,”arXivpreprintarXiv:2008.04582,2020. arXivpreprintarXiv:1906.08070,2019. [9] X. Zhou, D. Wang, and P. Kra¨henbu¨hl, “Objects as points,” arXiv [31] X.WengandK.Kitani,“Monocular3dobjectdetectionwithpseudo- preprintarXiv:1904.07850,2019. lidarpointcloud,”inProceedingsoftheIEEEInternationalConference [10] X.Weng,J.Wang,D.Held,andK.Kitani,“3dmulti-objecttracking:A onComputerVisionWorkshops,2019,pp.0–0. baselineandnewevaluationmetrics,”arXivpreprintarXiv:1907.03961, [32] M.Ding,Y.Huo,H.Yi,Z.Wang,J.Shi,Z.Lu,andP.Luo,“Learning 2020. depth-guided convolutions for monocular 3d object detection,” in [11] H.-N.Hu,Q.-Z.Cai,D.Wang,J.Lin,M.Sun,P.Krahenbuhl,T.Darrell, Proceedings of the IEEE/CVF Conference on Computer Vision and and F. Yu, “Joint monocular 3d vehicle detection and tracking,” in PatternRecognitionWorkshops,2020,pp.1000–1001. ProceedingsoftheIEEEinternationalconferenceoncomputervision, [33] X.Ma,Z.Wang,H.Li,P.Zhang,W.Ouyang,andX.Fan,“Accurate 2019,pp.5390–5399. monocular3dobjectdetectionviacolor-embedded3dreconstruction [12] G.Wang,B.Tian,Y.Ai,T.Xu,L.Chen,andD.Cao,“Centernet3d: for autonomous driving,” in Proceedings of the IEEE International Ananchorfreeobjectdetectorforautonomousdriving,”arXivpreprint ConferenceonComputerVision,2019,pp.6851–6860. arXiv:2007.07214,2020. [34] A.Bewley,Z.Ge,L.Ott,F.Ramos,andB.Upcroft,“Simpleonline [13] P.Li,H.Zhao,P.Liu,andF.Cao,“Rtm3d:Real-timemonocular3d and realtime tracking,” in 2016 IEEE International Conference on detectionfromobjectkeypointsforautonomousdriving,”inECCV, ImageProcessing(ICIP). IEEE,2016,pp.3464–3468. 2020. [35] X. Weng, Y. Wang, Y. Man, and K. M. Kitani, “Gnn3dmot: Graph [14] X.Zhou,V.Koltun,andP.Kra¨henbu¨hl,“Trackingobjectsaspoints,” neuralnetworkfor3dmulti-objecttrackingwith2d-3dmulti-feature ECCV,2020. learning,”inProceedingsoftheIEEE/CVFConferenceonComputer [15] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, VisionandPatternRecognition,2020,pp.6499–6508. A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A [36] D.Eigen,C.Puhrsch,andR.Fergus,“Depthmappredictionfroma multimodal dataset for autonomous driving,” in Proceedings of the singleimageusingamulti-scaledeepnetwork,”inAdvancesinneural IEEE/CVFConferenceonComputerVisionandPatternRecognition, informationprocessingsystems,2014,pp.2366–2374. 2020,pp.11621–11631. [37] V.Guizilini,R.Hou,J.Li,R.Ambrus,andA.Gaidon,“Semantically- [16] Z.Liu,Z.Wu,andR.To´th,“Smoke:Single-stagemonocular3dobject guidedrepresentationlearningforself-supervisedmonoculardepth,” detectionviakeypointestimation,”inProceedingsoftheIEEE/CVF arXivpreprintarXiv:2002.12319,2020. ConferenceonComputerVisionandPatternRecognitionWorkshops, [38] X.Wang,W.Yin,T.Kong,Y.Jiang,L.Li,andC.Shen,“Task-aware 2020,pp.996–997. monoculardepthestimationfor3dobjectdetection.”inAAAI,2020, [17] J.M.U.Vianney,S.Aich,andB.Liu,“Refinedmpl:Refinedmonocular pp.12257–12264. pseudolidar for 3d object detection in autonomous driving,” arXiv [39] M.Poggi,F.Aleotti,F.Tosi,andS.Mattoccia,“Ontheuncertainty preprintarXiv:1911.09712,2019. ofself-supervisedmonoculardepthestimation,”inProceedingsofthe [18] C.Godard,O.MacAodha,andG.J.Brostow,“Unsupervisedmonocular IEEE/CVFConferenceonComputerVisionandPatternRecognition, depth estimation with left-right consistency,” in Proceedings of the 2020,pp.3227–3237. IEEEConferenceonComputerVisionandPatternRecognition,2017, [40] H.Fu,M.Gong,C.Wang,K.Batmanghelich,andD.Tao,“Deepordinal pp.270–279. regressionnetworkformonoculardepthestimation,”inProceedings [19] C.Godard,O.MacAodha,M.Firman,andG.J.Brostow,“Digging oftheIEEEConferenceonComputerVisionandPatternRecognition, intoself-supervisedmonoculardepthestimation,”inProceedingsof 2018,pp.2002–2011. theIEEEinternationalconferenceoncomputervision,2019,pp.3828– [41] S. Vora, A. H. Lang, B. Helou, and O. Beijbom, “Pointpainting: 3838. Sequential fusion for 3d object detection,” in Proceedings of the [20] K.SimonyanandA.Zisserman,“Two-streamconvolutionalnetworks IEEE/CVFConferenceonComputerVisionandPatternRecognition, foractionrecognitioninvideos,”inAdvancesinneuralinformation 2020,pp.4604–4612. processingsystems,2014,pp.568–576. [42] M. Liang, B. Yang, Y. Chen, R. Hu, and R. Urtasun, “Multi-task [21] C.R.Qi,W.Liu,C.Wu,H.Su,andL.J.Guibas,“Frustumpointnets multi-sensor fusion for 3d object detection,” in Proceedings of the for3dobjectdetectionfromrgb-ddata,”inProceedingsoftheIEEE IEEEConferenceonComputerVisionandPatternRecognition,2019, conference on computer vision and pattern recognition, 2018, pp. pp.7345–7353. 918–927. [43] M.Liang,B.Yang,S.Wang,andR.Urtasun,“Deepcontinuousfusion [22] S. Beery, G. Wu, V. Rathod, R. Votel, and J. Huang, “Context r- formulti-sensor3dobjectdetection,”inProceedingsoftheEuropean cnn:Longtermtemporalcontextforper-cameraobjectdetection,”in ConferenceonComputerVision(ECCV),2018,pp.641–656.[44] S.Fadadu,S.Pandey,D.Hegde,Y.Shi,F.-C.Chou,N.Djuric,and C.Vallespi-Gonzalez,“Multi-viewfusionofsensordataforimproved perception and prediction in autonomous driving,” arXiv preprint arXiv:2008.11901,2020. [45] F. Xiao, Y. J. Lee, K. Grauman, J. Malik, and C. Feichtenhofer, “Audiovisualslowfastnetworksforvideorecognition,”arXivpreprint arXiv:2001.08740,2020. [46] C.Hazirbas,L.Ma,C.Domokos,andD.Cremers,“Fusenet:Incorporat- ingdepthintosemanticsegmentationviafusion-basedcnnarchitecture,” inAsianconferenceoncomputervision. Springer,2016,pp.213–228. [47] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two- streamnetworkfusionforvideoactionrecognition,”inProceedings oftheIEEEconferenceoncomputervisionandpatternrecognition, 2016,pp.1933–1941. [48] A.Dosovitskiy,P.Fischer,E.Ilg,P.Hausser,C.Hazirbas,V.Golkov, P. Van Der Smagt, D. Cremers, and T. Brox, “Flownet: Learning opticalflowwithconvolutionalnetworks,”inProceedingsoftheIEEE internationalconferenceoncomputervision,2015,pp.2758–2766. [49] K.He,X.Zhang,S.Ren,andJ.Sun,“Deepresiduallearningforimage recognition,”inProceedingsoftheIEEEconferenceoncomputervision andpatternrecognition,2016,pp.770–778. [50] Y.Zhang,C.Wang,X.Wang,W.Zeng,andW.Liu,“Fairmot:Onthe fairnessofdetectionandre-identificationinmultipleobjecttracking,” arXivpreprintarXiv:2004.01888,2020. [51] W.-C.Hung,H.Kretzschmar,T.-Y.Lin,Y.Chai,R.Yu,M.-H.Yang,and D.Anguelov,“Soda:Multi-objecttrackingwithsoftdataassociation,” arXivpreprintarXiv:2008.07725,2020. [52] Z.Wang,L.Zheng,Y.Liu,Y.Li,andS.Wang,“Towardsreal-time multi-objecttracking,”arXivpreprintarXiv:1909.12605,2019. [53] F.Go¨kc¸e,G.U¨c¸oluk,E.S¸ahin,andS.Kalkan,“Vision-baseddetection anddistanceestimationofmicrounmannedaerialvehicles,”Sensors, vol.15,no.9,pp.23805–23846,2015. [54] M.A.Haseeb,J.Guan,D.Ristic´-Durrant,andA.Gra¨ser,“Disnet:A novelmethodfordistanceestimationfrommonocularcamera,”10th Planning,PerceptionandNavigationforIntelligentVehicles(PPNIV18), IROS,2018. [55] S.Tuohy,D.O’Cualain,E.Jones,andM.Glavin,“Distancedetermina- tionforanautomobileenvironmentusinginverseperspectivemapping inopencv,”2010.