Depth Estimation Matters Most: Improving Per-Object Depth Estimation
for Monocular 3D Detection and Tracking
Longlong Jing1, Ruichi Yu1, Henrik Kretzschmar1, Kang Li1, Charles R. Qi1, Hang Zhao1∗, Alper Ayvaci1
Xu Chen1, Dillon Cower1, Yingwei Li2, Yurong You3, Han Deng1, Congcong Li1, and Dragomir Anguelov1
1Waymo LLC, 2Johns Hopkins University, 3Cornell University
Abstract—Monocular image-based 3D perception has be-
come an active research area in recent years owing to its
applications in autonomous driving. Approaches to monocular
3Dperceptionincludingdetectionandtracking,however,often
yield inferior performance when compared to LiDAR-based
techniques.Throughsystematicanalysis,weidentifiedthatper-
objectdepthestimationaccuracyisamajorfactorboundingthe Monocular 3D detection & Tracking-by-detection
performance.Motivatedbythisobservation,weproposeamulti-
level fusion method that combines different representations
(RGB and pseudo-LiDAR) and temporal information across GT + PRT GT
multiple frames for objects (tracklets) to enhance per-object Fusion on Pred w/
Depth Enhanced
depth estimation. Our proposed fusion method achieves the Error Depth
state-of-the-art performance of per-object depth estimation on per-object
Pred depth
theWaymoOpenDataset,theKITTIdetectiondataset,andthe BEV BEV
KITTI MOT dataset. We further demonstrate that by simply
replacing estimated depth with fusion-enhanced depth, we can
achieve significant improvements in monocular 3D perception
tasks, including detection and tracking. Fig.1. Illustrationoftheimpactofobjectdepthestimationformonocular3D
detection.Predictedbox(Pred)andgroundtruth(GT)boxarevisualizedin
I. INTRODUCTION abird’s-eye-view(BEV).Inthebottomleftimage,weseedepthdiscrepancy
causeslocalizationerror.Inthebottomrightimage,weseeourproposed
Existing perception systems for autonomous vehicles Pseudo-LiDAR,RGBandTracklet(PRT)fusionmethodcanimproveobject
mainly rely on expensive sensors such as LiDAR and Radar depthestimationandimprovedetection.
[1], [2], [3], [4]. Owing to the low cost, low power con-
sumption and longer perception range of cameras, monocular
center,matters(seethesignificantperformanceimprovements
image-based perception has been attracting great interest
when per-object depth is perfect, and marginal improvements
in recent years from both the industry and the research
when other signals are perfect). Based on this observation,
community [5], [6], [7], [8], [9], [10], [11]. Such perception
we identified that per-object depth estimation is a major
tasks tend to be challenging, and there is a large performance
bottleneck for monocular 3D detection and tracking-by-
gapbetweenmonocularperceptionsystemsandLiDAR/radar-
detection. We also conducted the same analysis with other
based systems [7], [9], [5], [12].
state-of-the-art detectors such as RTM3D [13] with the
Common 3D monocular perception systems comprise two
AB3D [10] tracker, and the results suggest that depth is the
major modules: 3D object detection and 3D tracking1. The
keyfactortoimprovemonocular3Ddetectionandtrackingis
former requires learning the 3D location, box size, and
ageneralconclusionacrossmodels.Inthispaper,wefocuson
rotation/orientation of an object, while the latter requires
improving per-object depth estimation and demonstrate that
using appearance and motion cues to track detections across
by just enhancing object depth we can significantly improve
frames.Amongbothtasks,itisnotobviouswhichcomponent
detection and tracking performance.
of the system has the most crucial impact on performance.
A major challenge of estimating object depth from a
To fully understand which component bounds the overall
monocularimageistoobtainarepresentationthatencodesthe
performance, we experimented with replacing each output
transition from 2D information to 3D depth. Recent efforts
from a state-of-the-art detection model with the ground truth
(e.g., 3D monocular detection) have mainly focused on either
and then evaluated changes of the detection and tracking-by-
directly learning from the raw RGB image [9], [5], [16]
detection performance using the state-of-the-art detector. As
or leveraging a pseudo-LiDAR representation lifted from
shown in Fig. 2, among all the attributes including rotation,
the predicted dense depth map [7], [8], [17]. Intuitively,
size, depth, and amodal box center in image, we find that
we believe that the above two representations might be
only the per-object depth, the depth of the vehicle’s 3D
complementary in estimating per-object depth, and learning
∗WorkdonewhileatWaymoLLC. fromeitheroneofthemalonemightbesub-optimal:theRGB
1Inthispaperwefollowthetracking-by-detectionparadigm. image encodes the appearance, texture and 2D geometry, etc.
2202
nuJ
8
]VC.sc[
1v66630.6022:viXraDetection (mAP%) depth, we have also achieved significant improvements on
0 20 40 60 3Dobjectdetectionandtrackingmodels,bysimplyreplacing
3D Detector Detection the per-object depth with our enhanced estimation.
Our contributions can be summarized as follows: 1) We
Tracking
+ GT Size
conducted a systematic analysis identifying that per-object
Only
depthestimationisamajorperformancebottleneckofcurrent
+ GT Rotation
Only 3D monocular detection and tracking-by-detection methods.
2)Weproposedanovelmethodthatfusespseudo-LiDARand
+ GT Amodal
Only RGB information across the temporal domain to significantly
+ GT Depth enhance per-object depth estimation performance. 3) We
Only
30 40 50 60 70 demonstrated that with the enhanced depth, the performance
Tracking (MOTA%) of monocular 3D detection and tracking can be significantly
improved.
Fig.2. Headroomanalysisoftheimpactofeachcomponentofthemonocular
3Ddetectionandtrackingsystem.Theexperimentisdonewiththestate- II. RELATEDWORK
of-the-artmonocular3DdetectorCenterNet[14](ranked1stonnuScene
Monocular 3D Object Detection and Tracking: Monoc-
dataset[15])andtheAB3D[10]tracker.Weuse“+GT”toindicatethat
wereplacedthepredictionwiththegroundtruth.Theanalysissuggeststhat ular 3D object detection models aim at directly regressing
depthhasthemostsignificantimpactondetection(AveragePrecisionshown attributes such as rotation, size, and depth from RGB images
asthetopbar)andtracking(MOTAshownasthebottombar)performance.
[5], [9], [14], [16], [28], [29], [30] or pseudo-LiDAR [7],
[17], [31], [8]. Recently, Wang et al. pointed out that image
of an object explicitly but contains no direct information
features are not suitable for the task; instead, they proposed
of 3D. It is difficult to learn how to map RGB features to
converting image-based depth maps into pseudo-LiDAR
depth precisely without overfitting to irrelevant information;
representations to mimic LiDAR and obtained significantly
on the other hand, the pseudo-LiDAR representation directly
betterperformancethanthepreviousimage-basedmethods[7].
models the 3D structure of an object via an estimated dense
A few work attempted to leverage both the RGB image and
depthmap,whichmakesitstraightforwardtolearnper-object
estimated depth map for monocular 3D object detection [32],
depth. However the estimated dense depth map is often
[33]; none of them specifically focused on improving the
noisy (usually with at least 8% average relative error [6],
per-objectdepthestimationbasedondifferentfeaturesfroma
[18], [19]). Inspired by previous methods that fuse different
tracklet. Most existing tracking methods follow the tracking-
representations such as RGB image features and optical
by-detection scheme [14], [10], [34], [35], and the quality
flow for action recognition [20], we believe that fusing the
of monocular 3D detections is the bottleneck for tracking
complementary signals encoded in the two representations
performance.Amongalloftheoutputdimensionsoftheabove
may help per-object depth estimation.
tasks,weidentifiedthatper-objectdepthisthebottleneckand
Furthermore, depth estimation from monocular images is demonstratedthattheperformanceofmonocular3Ddetection
fundamentally ill-posed, as a single 2D view of a scene can and tracking can be significantly improved with enhanced
be explained by many plausible 3D scenes [21]. However, object depth.
observing an object over time allows us to model the Depth/Distance Estimation: Monocular image-based
underlying temporal and motion consistency of the object, dense depth estimation has been studied for years [6], [36],
which can provide contextual information to better localize [19], [18], [37], [38], [39], [40]. Different from the existing
the object in 3D2. Similar ideas have been explored in other methods, our method focused on per-object depth estimation,
taskssuchas2Dvideo-basedobjectdetection[22],[23],[24]. whichnaturallyenablesthenoveltracklet-basedfusionwhich
Basedonourintuitionsandtheanalysisabove,wepropose fuses different types of features across multiple frames.
amulti-levelfusionframeworkthatconsistsoftwomajorcom- Although the two tasks might share some high-level concepts
ponents. The first component is the pseudo-LiDAR and RGB (depthestimation),itisnon-trivialtoadapttheper-pixeldense
fusion (PR-Fusion) which enhances depth estimation from depth estimation to a per-object task, which has not been
two complementary representations. The other component explored in the dense depth estimation community. On the
is the tracklet fusion (T-fusion), which leverages temporal other hand, per-object depth/distance estimation has started
and motion consistency with compensated ego-motion. Our to draw attention recently. Following the same experimental
full model Pseudo-LiDAR-RGB-Tracklet (PRT fusion) fuses setting in of the most recent state-of-the-art [27], our novel
information from both 2D and 3D representations across framework, which leverages both pseudo-LiDAR and RGB
multiple frames. representations across multiple frames for per-object depth
We conducted extensive experiments on the Waymo Open estimation, demonstrated superior performance.
Dataset[25],theKITTIdetectiondataset[26],andtheKITTI Representation and Temporal Fusion Many existing
MOT dataset [26] to demonstrate the effectiveness of our methods are based on representation [41], [42], [43], [44],
method. We obtained state-of-the-art performance on the [45], [46] and temporal fusion [5], [6], [7], [8], [9], [10],
per-object depth estimation benchmark proposed by [27]. To [11]. Two-stream fusion [20], [47] for action recognition is a
furtherdemonstratethepracticalvalueofenhancedper-object classic multi-modal fusion which fuses RGB image featuresRGB-Net
PL-Net
Input Images 2D Object Detection Tracking Ego-Motion Tracklet
Compensation PRT-Fusion
t-n PR Fusion
Depth
t Tracklet Fusion
t-1
t
t-n to t
Fig.3. Theoverallframeworkforourproposedmonocularper-objectdepthestimationmethod.Theentireframeworkincludesthreestages:(1)per-frame
2Dobjectdetectiontodetectobjects,(2)2Dtrackingtoassociateobjectsforthesamevehicleacrossthetemporaldomain,(3)theproposedPRT-Fusion
overthetrackletsgeneratedbythetrackingmethods,withfusionfromboththetemporaldomainanddifferentrepresentations(RGBandpseudo-LiDAR).
andopticalflow[48].Inthetemporaldomain,manymethods while d is a 2D dense depth estimation map having the same
have been explored to improve sequence-based tasks such as size as input image I. The pixel value at location (u,v) of
video object detection and video segmentation [9], [22], [23], the d indicates the depth of the corresponding pixel in the
[24]. These methods inspired us that different representations image.
(e.g., pseudo-LiDAR and RGB) across multiple frames might Then each pixel of the entire depth map is lifted into a
be beneficial for improving per-object depth estimation. point cloud by using the following equations based on the
camera model:
III. IMPROVINGPER-OBJECTDEPTHESTIMATIONVIA

MULTI-LEVELFUSION  z =d(u,v),
x=(u−C )×z/f , (3)
Theoverviewofourproposedmulti-levelfusionframework x x
 y =(v−C )×z/f ,
for per-object depth estimation is shown in Fig. 3. We first y y
conduct 2D object detection and track detections across where (f ,f ) is the horizontal and vertical focal lengths of
x y
frames to construct a tracklet for each object. We then thecameraand(C ,C )isthepixellocationcorrespondingto
x y
constructpseudo-LiDARrepresentationsoftheobjectsacross thecameracenter[8],[7].Afterthetransformation,eachpixel
frames and RGB image features for the current frame. Ego- in the dense depth map d is transformed into three channels
motion compensation is applied to all pseudo-LiDAR patches representing the absolute location of the corresponding pixel
within each tracklet to transform them to the same coordinate in 3D space, in camera coordinates.
system.Finally,theRGBimagefeaturesforthecurrentframe Afterobtainingthepseudo-LiDARrepresentationforimage
and the temporally fused pseudo-LiDAR features are fused I, the pseudo-LiDAR patch P for object b at timestamp t
t t
to produce per-object depth. can be cropped based on the 2D bounding box, where P is
t
a collection of pseudo-LiDAR points that are within the box
A. Pseudo-LiDAR and RGB (PR) Fusion
b . The pseudo-LiDAR-based features PL of object b can
t t
Inspired by the two-stream fusion method for action
be extracted using another feature encoder F as
p
recognition presented in [20], we proposed PR-Fusion to
leverage the complementary information encoded from both
PL=F (P ), (4)
RGB and pseudo-LiDAR representations. Given an RGB p t
image I with size H ×W, compact features for the entire where PL represents the pseudo-LiDAR representation for
image can be extracted by using a pre-trained convolution the object within the bounding box b in the image plane.
t
neural network F RGB. For any object with its 2D bounding Finally, the PR-Fusion can be represented as
box b, the RGB image features R for the bounding box
can be extracted by using the pre-defined pooling operation PR=G PR(PL,R), (5)
Pool(F (I),b).TheprocessofextractingimagefeaturesR
RGB where G is a deep neural network to fuse the two features
forobjectboundingboxbfromimageI incanberepresented PR
and PR is the fused feature.
as
R=Pool(F RGB(I),b). (1) B. Tracklet Fusion with Ego-motion Compensation
Theextractionprocessofthepseudo-LiDARrepresentation Predicting per-object depth directly from a single frame is
consists of three steps: (1) dense depth estimation for each challenging due to the fact that a single object in an camera
image, (2) lifting predicted dense depth into pseudo-LiDAR, image can be explained by multiple plausible objects with
and(3)pseudo-LiDARrepresentationextractionwithaneural different depth [21]. Inspired by temporal fusion methods
network. For any RGB image I, the depth estimation can be for video based tasks, we propose to fuse the object-level
accomplished by using a dense depth estimation network F information across multiple frames to enforce temporal and
d
as motion consistency of the prediction. Given the 2D detection
d=F (I), (2) results, we first conduct 2D data association [34], [10] to
dconstruct tracklets for objects and then fuse the features of frameanditspreviousframes,firstweconductT-Fusionwith
the tracklet in a temporal window. ego-motion compensation for pseudo-LiDAR representations
Astraightforwardmethodistodirectlyfuseimagefeatures acrossmultipleframes;thenwefuseitwiththeRGBfeatures
across frames similar to [24]; however, we find that directly at the current frame t as
fusing the RGB features from different frames can be
PRT=G (PL ,R ), (8)
suboptimalbecauseRGBfeaturescouplethedynamicmotion PR t−n→t t
of the camera and the motion of the objects together, which where PRT is the fused features from frame t−n to t.
makes it hard to learn motion and temporal consistency from
2D image sequence. We believe that to perform effective D. Implementation Details
temporal fusion for depth estimation, the camera motion RGBFeatureExtraction.CenterNetandCenterTrack[9],
mustbecompensatedtomakesurethefeaturesfromdifferent [14]haveachievedstate-of-the-artperformanceonmonocular
frames are in the same coordinate system. Fortunately, the 3D detection task on the nuScenes dataset recently [15].
ego-motion of the camera can be easily compensated in the We followed its formulation and network architecture with
3D space with the pseudo-LiDAR representations. Thus, we ResNet50 [49] as the backbone to perform 2D detections.
propose a T-Fusion method with ego-motion compensation Pseudo-LiDAR Feature Extraction. Recently, PatchNet
based on pseudo-LiDAR representations. [8] was proposed to significantly improve pseudo-LiDAR-
The input to our proposed T-Fusion includes pseudo- based detection performance. We choose it as our backbone
LiDAR patches of each object in different frames model to extract pseudo-LiDAR-based features as both the
P t,P t−1,...,P t−n, while P t is in the 3D camera coordinate baseline and the input to our method.
in frame t. The ego-motion is represented using a 4 × 4 2D Tracking. To track 2D detections to form tracklets,
homogeneous matrix H based on conventional six degrees we followed [34] to use a Kalman-Filter based tracker. It is
of freedom: translation [γ x,γ y,γ z] in meters and rotation worthnotingthatsinceourpapermainlyfocusedonper-object
[ρ x,ρ y,ρ z] in radians. depth estimation, we believe that with more sophisticated
First, all pseudo-LiDAR patches from different frames are tracking methods [50], [51], [52], the performance of our
projected into the global coordinate system using the camera- fusion method can be further improved.
coordinate-to-global-coordinate transformation matrix H. For
a pseudo-LiDAR patch for any timestamp P , assume its IV. EXPERIMENTS
t−j
camera-coordinate-to-global-coordinate transformation matrix In this section, we firstly benchmark our per-object depth
is H t−j; the transformation is as follows: estimationbycomparingwithpriorworksinSec.IV-A.Next,
we ablate design choices in our depth estimation model in
P(cid:48) =H−1∗H ∗P . (6)
t−j t t−j t−j Sec. IV-B. Finally, we show the applications of the improved
Afterthecoordinatetransformation,theego-motionoftheself- per-object depth for 3D monocular detection and tracking in
drivingcariscompensatedfor,andthetransformedP(cid:48) isin Sec. IV-C.
t−j
the same coordinate system as P . The same transformation Datasets. We evaluate on multiple datasets with a focus
t
is applied to pseudo-LiDAR patches from all timestamps on the vehicle class. Among them, Waymo Open Dataset is
to eliminate the impact of ego-motion to the locations of a large-scale dataset for autonomous driving. It consists of
pseudo-LiDAR points for each object. 798 training sequences and 202 validation sequences, while
Given any feature encoder F () for pseudo-LiDAR, the each sequence contains around 200 frames. KITTI Detection
p
features for different timestamps of data can be extracted as Datasethas3,712RGBimagesfortrainingand3,768images
F (P(cid:48)), F (P(cid:48) ), ..., F (P(cid:48) ), where the (cid:48) indicates that for testing. We are using the split used in the prior work [27]
p t p t−1 p t−n
the pseudo-LiDAR patch has ego-motion compensated. Then, for per-object depth estimation for fair comparison. KITTI
thefusedfeaturesforasequenceofanobjectcanbemodeled MOT Dataset consists of 8,008 and 11,095 frames in the
by using a neural network encoder G as follows: official training and testing splits. Since it has sequence
TF
information for each frame, we will use this dataset to
PL =G (F (P(cid:48)),F (P(cid:48) ),....,F (P(cid:48) )), (7)
t−n→t TF p t p t−1 p t−n demonstrate the effectiveness of the PRT fusion.
where PL is the fused tracklet features from frame
t−n→t A. Benchmarking Per-object Depth Estimation
t−n to t.
Metrics. Following the existing state-of-the-art per-object
C. Multi-level PRT-Fusion
depth estimation benchmark proposed by [27], five standard
PR-FusionandT-Fusionaggregatefeaturesfromtwodiffer- metrics including average relative error (Abs Rel), squared
ent domains. It is natural to combine the two fusion methods relative error (Sq Rel), root-mean-square error (RMSE),
together for further performance improvements. Given a average (log ) error (RMSE ), and threshold accuracy (δ )
10 log i
sequence of object boxes across time, b ,b ,....,b , the are used for evaluation.
t t−1 t−n
RGB image features for object b can be represented using Results. Since our paper mainly focused on per-object
i
an image feature encoder F (), and its pseudo-LiDAR depth estimation, we compare the performance of our pro-
RGB
features can be extracted using encoder F (). There are posed fusion methods against the state-of-the-art models. We
p
two steps in PRT-Fusion: Given the object in the current comparewithtwotypesofmethods:geometrybasedmethodsTABLEI TABLEIII
COMPARISONWITHSTATE-OF-THE-ARTMETHODSONTHEWAYMOOPEN PERFORMANCECOMPARISONWITHSTATE-OF-THE-ARTSONTHEKITTI
DATASETFORVEHICLESFOLLOWINGTHESETTINGIN[27]. MOTDATASETFORVEHICLESFOLLOWINGTHESETTINGIN[27].
Method δ<1.25↑ AbsRel↓ SqRel↓ RMSE↓ RMSElog↓ Method δ<1.25↑ AbsRel↓ SqRel↓ RMSE↓ RMSElog↓
SVR[53] 83.26% 14.79% 1.3254 6.9081 0.2282 CenterNet[9] 92.17% 9.10% 0.9372 6.9596 0.1975
DistNet[54] 88.50% 11.23% 0.8974 6.3903 0.1737 PatchNet[8] 93.41% 8.65% 0.4988 4.9081 0.1268
PatchNet[8] 92.81% 8.77% 0.6051 5.5485 0.1283 Ours(T) 93.43% 7.90% 0.3706 4.0703 0.1157
CenterNet[9] 95.47% 7.25% 0.6240 4.7506 0.1146 Ours(PR) 94.39% 7.69% 0.4430 4.8065 0.1205
Ours(T) 96.45% 6.96% 0.4214 4.5941 0.1001 Ours(PRT) 95.23% 7.13% 0.3382 3.9391 0.1076
Ours(PR) 97.64% 5.74% 0.3188 3.8788 0.0863
Ours(PRT) 98.09% 5.47% 0.2858 3.7282 0.0802
TABLEIV
TABLEII
THEPERFORMANCECOMPARISONOFDIFFERENTFUSIONSTRATEGIESON
PERFORMANCECOMPARISONWITHSTATE-OF-THE-ARTSONTHEKITTI
THEWAYMOOPENDATASETFORVEHICLES.
DETECTIONDATASETFORVEHICLESFOLLOWINGTHESETTINGIN[27].
Method AbsRel↓ SqRel↓ RMSE↓ RMSElog↓
AblationStudywithPredictedAssociation
Method δ<1.25↑ AbsRel↓ SqRel↓ RMSE↓ RMSElog↓ RGBt 7.25% 0.6240 4.7506 0.1146
SVR[53] 34.50% 149.4% 47.748 18.970 1.4940 T-Fusion:RGBt−1→t 7.58% 0.6773 4.8902 0.1172
IPM[55] 70.10% 49.70% 1290.5 237.62 0.4510 T-Fusion:RGBt−3→t 7.60% 0.6748 4.9320 0.1179
Zhuetal.[27] 84.60% 15.00% 0.6180 3.9460 0.2040 PLt 8.77% 0.6051 5.5485 0.1283
DistNet[54] 93.26% 12.39% 0.4834 2.9539 0.2003 T-Fusion:PLt−1→t 6.96% 0.4214 4.5941 0.1001
CenterNet[9] 95.33% 8.70% 0.4250 3.2433 0.1436 T-Fusion:PLt−3→t 7.16% 0.4389 4.6712 0.1022
PatchNet[8] 95.52% 8.08% 0.2789 2.9048 0.1296 PRT:RGBt+PLt 5.74% 0.3188 3.8788 0.0863
Ours(PR) 97.60% 6.89% 0.2340 2.5025 0.1181 PRT:RGBt+PLt−1→t 5.47% 0.2858 3.7282 0.0802
PRT:RGBt+PLt−3→t 5.52% 0.2932 3.7661 0.0807
AblationStudywithGroundtruthAssociation
PRT:RGBt+PLt 5.74% 0.3188 3.8788 0.0863
which predict the depth based on the geometry of the boxes PRT:RGBt+PLt−1→t 5.34% 0.2713 3.6278 0.0791
including SVR [53], IPM [55], and DistNet [54], and deep
PRT:RGBt+PLt−3→t 5.29% 0.2668 3.5821 0.0783
feature based methods including the methods proposed in
[27], and our implementation of state-of-the-art monocular ablation studies with different fusion strategies and different
3D detection method CenterNet [9] and PatchNet [8] (we 2D boxes and tracking qualities.
simply used their backbone and change the final prediction For the first three ablation studies, we report performance
from a 3D box to only per-object depth). To conduct a fair with predicted association to better understand how our
comparison with the existing benchmark in [27], we follow methodworksinpractice,andinthefourthstudyweconduct
its experimental setting to use the groundtruth 2D boxes (to headroomanalysistounderstandhowwouldourmethodwork
filter out the impact of 2D object detection) in all datasets. given perfect association.
We conduct data association with the Kalman-filter based 1. Does tracklet fusion work for RGB features? We
tracker in [34]. The comparison on the Waymo Open Dataset, simply fused the features extracted from RGB images from a
KITTIDetectionDataset,andKITTIMOTDatasetareshown tracklet(similarwithour proposedT-Fusion,butwithoutego-
in Table I, Table II, and Table III. motioncompensation),andresultsareshowninthefirstgroup
We first demonstrate the effectiveness of the PR-Fusion on of Table IV and only marginal improvements are observed.
both the Waymo Open Dataset and KITTI Detection dataset: One possible explanation is that the 3D information encoded
withaccessoftwodifferentrepresentations,ourmethodswith in RGB images at different timestamps are at different
PR-Fusion(Ours(PR))significantlyenhancestheperformance coordinate system if the self-driving car is moving. It is
compared to the two baseline models individually, which non-trivial to decompose the camera ego motion and the
suggests that the two types of representations are indeed object motion, which makes it hard to learn motion and
complementary with each other, and the fusion of them temporal consistency from simply fusing the image features.
yields the best performance. Regarding the T-Fusion with 2. How does tracklet fusion improve pseudo-LiDAR-
ego-motion compensation and the PRT-Fusion on the Waymo based depth estimation? In the second group of Table IV
Open Dataset and the KITTI MOT dataset, when compared we show the performance of pseudo-LiDAR-based depth
to the baseline method PatchNet [8], our proposed method estimationwithourproposedego-motioncompensationbased
(Ours(T)) achieved significant better performance. Finally, T-Fusion. It is clearly that the depth estimation performance
when leverage both the PR-Fusion and T-Fusion together significantly improved with the T-Fusion even with just
as PRT-Fusion, the performances are further improved (see information from one more frame. Adding more frames is
Ours (PRT)). In summary, our proposed fusion methods helpful, but the improvement is marginal.
show significant improvements over the baseline models and 3. Does PRT fusion help? With the improvements of PR-
outperform the state-of-the-art methods on both the Waymo Fusion and T-Fusion, it is natural to ask the question about
Open Dataset and KITTI dataset. if the combination of them is helpful. The third group of
Table IV shows the performance of the PRT-Fusion with
B. Ablation Study for Per-object Depth Estimation
different number of frames. It is clear that the combination
To better demonstrate and understand the effectiveness of of both (PR-Fusion and T-Fusion) outperforms each one
eachmoduleofourproposedmethod,weconductedthorough individually. However, due to noise introduced by dataTABLEV
GT: 46.0 GT: 49.8 GT: 66.2 GT: 74.8
THERESULTSOFMONOCULAR3DDETECTIONONTHEWAYMOOPEN BL: 42.3 BL: 46.1 BL: 72.5 BL: 78.4
Ours:45.5 Ours: 49.6 Ours: 65.7 Ours: 75.5
DATASETFORVEHICLESWITHDIFFERENTIOUTHRESHOLDS(0.5-0.7).
AP3D APBEV
0.5 0.6 0.7 0.5 0.6 0.7
CenterNet[9] 24.32 14.36 6.06 28.20 18.71 11.52
27.06 16.89 7.99 29.53 24.16 14.33
+EnhancedDepth
(+2.74) (+2.53) (+1.93) (+1.10) (+5.45) (+2.81)
TABLEVI
RESULTSONTHEWAYMOOPENDATASETFORVEHICLESIN[10]. BEV BEV
Method SMOTA↑ AMOTA↑ AMOTP↑ MOTA↑ IDS↓
CenterNet[9]+AB3D[10] 50.19 14.06 27.00 39.37 228
52.49 15.24 28.50 41.24 130 Fig.4. Qualitativeexamplesofper-objectdepthestimationandmonocular
+EnhancedDepth
(+2.30) (+1.18) (+1.50) (+1.87) (-98) 3Dobjectdetection.Thegreen,red,andblueboundingboxescorresponding
to ground truth (GT), baseline depth estimation and detection (BL), and
theonewiththeenhancedper-objectdepthfromourproposedPRT-Fusion.
association, a longer tracklet does not necessarily yield better Significantbetterdepthestimationanditsfurtherimprovementsondetection
canbeobserved.
performance.
4. How is the performance affected by association
Frame (t) Frame (t+1)
noise? We further study the impact of the association quality
towards our proposed method. The fourth group shows
the results with the groundtruth association. As shown in
Table. IV, we can observe with perfect association, the
improvement is consistent with longer tracklet. This indicates
that to fully leverage the capability of the proposed fusion Wrong Correct
Association Association
method, improving data association quality is a promising and ID switch
direction to work on. BEV BEV
(a) Tracking-by-detection (b) With enhanced depth
C. Improving Monocular 3D Detection and Tracking with
Enhanced Per-object Depth
Fig. 5. Qualitative examples of monocular 3D tracking results. Due to
inaccuratedepthestimationshowin(a),the3Dtrackerwronglyassociates
Inthissubsection,weapplyourper-objectdepthestimation
detectionacrossframeswhichleadstoIDswitches.Withenhanceddepth
to show that it can further help improve the state-of-the-art predicted by our proposed fusion model in (b), the tracker associates
monocular image based 3D detector CenterNet[14] and the detectionscorrectly.
AB3D tracker [10] on the Waymo Open Dataset.
Quantitative results. For 3D object detection, we trained
aCenterNetformonocular3Ddetection,andreplaceonlythe is the key factor to lead to the significant improvements for
depth of the detection results while the other outputs such as monocular 3D detection and tracking.
box, rotation, etc. are remained. For tracking, we follow the
tracking-by-detection scheme and perform a AB3D tracker
V. CONCLUSION
[10] on the detection boxes with the enhanced per-object
depth. The results of detection and tracking are shown in
We demonstrated that per-object depth estimation is the
table V and table IV-C. As expected, we can observe that
performance bottleneck of the monocular image based 3D
by simply enhancing the depth, significant improvements
perception tasks including detection and tracking. A multi-
can be obtained for both tasks. It is worth noting that the
level fusion framework was proposed to fuse features from
performance can be further improved by tuning the model
different representations across multiple frames. We first
specifically for detection and tracking, but it is out of the
obtainedthestate-of-the-artsperformanceonper-objectdepth
scope of this paper since we only focus on demonstrating
estimation, and then showed that by simply replacing the
the improvements purely from depth.
depth, significant improvements can be observed in the tasks
Qualitative results. We further visualize the prediction
above. This not only demonstrated our findings and the
results of the baseline detection model and our proposed
effectiveness of the proposed method, but also indicating
PRT-Fusion to illustrate how the enhanced per-object depth
that improving per-object depth is a promising direction to
improve the monocular 3D detection and tracking. As shown
enhance detection and tracking. Future works can include
in Fig. 4, the first row shows the improvements of per-object
end-to-end training of the proposed method.
depth brought by our method, and the second row illustrated
the bird-eye-view of the 3D detection results of both the
baseline detector, and the one with our improved depth (only VI. ACKNOWLEDGEMENT
thedepthisreplaced).Fig.5showsthetrackingmodelmakes
fewerIDSwitcherrorswiththedepthpredictedbyourmodel. We would like to thank Jiyang Gao for the helpful
Clearimprovementscanbeobserved,andtheimproveddepth discussions about this work.REFERENCES Proceedings of the IEEE/CVF Conference on Computer Vision and
PatternRecognition,2020,pp.13075–13085.
[1] A.H.Lang,S.Vora,H.Caesar,L.Zhou,J.Yang,andO.Beijbom, [23] M. Liu, M. Zhu, M. White, Y. Li, and D. Kalenichenko, “Looking
“Pointpillars:Fastencodersforobjectdetectionfrompointclouds,”in fastandslow:Memory-guidedmobilevideoobjectdetection,”arXiv
ProceedingsoftheIEEEConferenceonComputerVisionandPattern preprintarXiv:1903.10172,2019.
Recognition,2019,pp.12697–12705. [24] X.Zhu,Y.Wang,J.Dai,L.Yuan,andY.Wei,“Flow-guidedfeature
[2] X.Liu,C.R.Qi,andL.J.Guibas,“Flownet3d:Learningsceneflowin aggregationforvideoobjectdetection,”inProceedingsoftheIEEE
3dpointclouds,”inProceedingsoftheIEEEConferenceonComputer InternationalConferenceonComputerVision,2017,pp.408–417.
VisionandPatternRecognition,2019,pp.529–537. [25] P.Sun,H.Kretzschmar,X.Dotiwalla,A.Chouard,V.Patnaik,P.Tsui,
[3] J.Behley,M.Garbade,A.Milioto,J.Quenzel,S.Behnke,C.Stachniss, J.Guo,Y.Zhou,Y.Chai,B.Caineetal.,“Scalabilityinperception
andJ.Gall,“Semantickitti:Adatasetforsemanticsceneunderstand- forautonomousdriving:Waymoopendataset,”inProceedingsofthe
ing of lidar sequences,” in Proceedings of the IEEE International IEEE/CVFConferenceonComputerVisionandPatternRecognition,
ConferenceonComputerVision,2019,pp.9297–9307. 2020,pp.2446–2454.
[4] B. Yang, R. Guo, M. Liang, S. Casas, and R. Urtasun, “Radarnet: [26] A.Geiger,P.Lenz,C.Stiller,andR.Urtasun,“Visionmeetsrobotics:
Exploiting radar for robust perception of dynamic objects,” arXiv The kitti dataset,” The International Journal of Robotics Research,
preprintarXiv:2007.14366,2020. vol.32,no.11,pp.1231–1237,2013.
[5] Y.Chen,L.Tai,K.Sun,andM.Li,“Monopair:Monocular3dobject [27] J.ZhuandY.Fang,“Learningobject-specificdistancefromamonocular
detectionusingpairwisespatialrelationships,”inProceedingsofthe image,” in Proceedings of the IEEE International Conference on
IEEE/CVFConferenceonComputerVisionandPatternRecognition, ComputerVision,2019,pp.3839–3848.
2020,pp.12093–12102. [28] G. Brazil and X. Liu, “M3d-rpn: Monocular 3d region proposal
[6] I.AlhashimandP.Wonka,“Highqualitymonoculardepthestimation networkforobjectdetection,”inProceedingsoftheIEEEInternational
viatransferlearning,”arXivpreprintarXiv:1812.11941,2018. ConferenceonComputerVision,2019,pp.9287–9296.
[7] Y.Wang,W.-L.Chao,D.Garg,B.Hariharan,M.Campbell,andK.Q. [29] A. Simonelli, S. R. Bulo, L. Porzi, M. Lo´pez-Antequera, and
Weinberger,“Pseudo-lidarfromvisualdepthestimation:Bridgingthe P. Kontschieder, “Disentangling monocular 3d object detection,” in
gapin3dobjectdetectionforautonomousdriving,”inProceedings ProceedingsoftheIEEEInternationalConferenceonComputerVision,
oftheIEEEConferenceonComputerVisionandPatternRecognition, 2019,pp.1991–1999.
2019,pp.8445–8453. [30] E.Jo¨rgensen,C.Zach,andF.Kahl,“Monocular3dobjectdetection
[8] X.Ma,S.Liu,Z.Xia,H.Zhang,X.Zeng,andW.Ouyang,“Rethinking andboxfittingtrainedend-to-endusingintersection-over-unionloss,”
pseudo-lidarrepresentation,”arXivpreprintarXiv:2008.04582,2020. arXivpreprintarXiv:1906.08070,2019.
[9] X. Zhou, D. Wang, and P. Kra¨henbu¨hl, “Objects as points,” arXiv [31] X.WengandK.Kitani,“Monocular3dobjectdetectionwithpseudo-
preprintarXiv:1904.07850,2019. lidarpointcloud,”inProceedingsoftheIEEEInternationalConference
[10] X.Weng,J.Wang,D.Held,andK.Kitani,“3dmulti-objecttracking:A onComputerVisionWorkshops,2019,pp.0–0.
baselineandnewevaluationmetrics,”arXivpreprintarXiv:1907.03961, [32] M.Ding,Y.Huo,H.Yi,Z.Wang,J.Shi,Z.Lu,andP.Luo,“Learning
2020. depth-guided convolutions for monocular 3d object detection,” in
[11] H.-N.Hu,Q.-Z.Cai,D.Wang,J.Lin,M.Sun,P.Krahenbuhl,T.Darrell, Proceedings of the IEEE/CVF Conference on Computer Vision and
and F. Yu, “Joint monocular 3d vehicle detection and tracking,” in PatternRecognitionWorkshops,2020,pp.1000–1001.
ProceedingsoftheIEEEinternationalconferenceoncomputervision, [33] X.Ma,Z.Wang,H.Li,P.Zhang,W.Ouyang,andX.Fan,“Accurate
2019,pp.5390–5399. monocular3dobjectdetectionviacolor-embedded3dreconstruction
[12] G.Wang,B.Tian,Y.Ai,T.Xu,L.Chen,andD.Cao,“Centernet3d: for autonomous driving,” in Proceedings of the IEEE International
Ananchorfreeobjectdetectorforautonomousdriving,”arXivpreprint ConferenceonComputerVision,2019,pp.6851–6860.
arXiv:2007.07214,2020. [34] A.Bewley,Z.Ge,L.Ott,F.Ramos,andB.Upcroft,“Simpleonline
[13] P.Li,H.Zhao,P.Liu,andF.Cao,“Rtm3d:Real-timemonocular3d and realtime tracking,” in 2016 IEEE International Conference on
detectionfromobjectkeypointsforautonomousdriving,”inECCV, ImageProcessing(ICIP). IEEE,2016,pp.3464–3468.
2020. [35] X. Weng, Y. Wang, Y. Man, and K. M. Kitani, “Gnn3dmot: Graph
[14] X.Zhou,V.Koltun,andP.Kra¨henbu¨hl,“Trackingobjectsaspoints,” neuralnetworkfor3dmulti-objecttrackingwith2d-3dmulti-feature
ECCV,2020. learning,”inProceedingsoftheIEEE/CVFConferenceonComputer
[15] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, VisionandPatternRecognition,2020,pp.6499–6508.
A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A [36] D.Eigen,C.Puhrsch,andR.Fergus,“Depthmappredictionfroma
multimodal dataset for autonomous driving,” in Proceedings of the singleimageusingamulti-scaledeepnetwork,”inAdvancesinneural
IEEE/CVFConferenceonComputerVisionandPatternRecognition, informationprocessingsystems,2014,pp.2366–2374.
2020,pp.11621–11631. [37] V.Guizilini,R.Hou,J.Li,R.Ambrus,andA.Gaidon,“Semantically-
[16] Z.Liu,Z.Wu,andR.To´th,“Smoke:Single-stagemonocular3dobject guidedrepresentationlearningforself-supervisedmonoculardepth,”
detectionviakeypointestimation,”inProceedingsoftheIEEE/CVF arXivpreprintarXiv:2002.12319,2020.
ConferenceonComputerVisionandPatternRecognitionWorkshops, [38] X.Wang,W.Yin,T.Kong,Y.Jiang,L.Li,andC.Shen,“Task-aware
2020,pp.996–997. monoculardepthestimationfor3dobjectdetection.”inAAAI,2020,
[17] J.M.U.Vianney,S.Aich,andB.Liu,“Refinedmpl:Refinedmonocular pp.12257–12264.
pseudolidar for 3d object detection in autonomous driving,” arXiv [39] M.Poggi,F.Aleotti,F.Tosi,andS.Mattoccia,“Ontheuncertainty
preprintarXiv:1911.09712,2019. ofself-supervisedmonoculardepthestimation,”inProceedingsofthe
[18] C.Godard,O.MacAodha,andG.J.Brostow,“Unsupervisedmonocular IEEE/CVFConferenceonComputerVisionandPatternRecognition,
depth estimation with left-right consistency,” in Proceedings of the 2020,pp.3227–3237.
IEEEConferenceonComputerVisionandPatternRecognition,2017, [40] H.Fu,M.Gong,C.Wang,K.Batmanghelich,andD.Tao,“Deepordinal
pp.270–279. regressionnetworkformonoculardepthestimation,”inProceedings
[19] C.Godard,O.MacAodha,M.Firman,andG.J.Brostow,“Digging oftheIEEEConferenceonComputerVisionandPatternRecognition,
intoself-supervisedmonoculardepthestimation,”inProceedingsof 2018,pp.2002–2011.
theIEEEinternationalconferenceoncomputervision,2019,pp.3828– [41] S. Vora, A. H. Lang, B. Helou, and O. Beijbom, “Pointpainting:
3838. Sequential fusion for 3d object detection,” in Proceedings of the
[20] K.SimonyanandA.Zisserman,“Two-streamconvolutionalnetworks IEEE/CVFConferenceonComputerVisionandPatternRecognition,
foractionrecognitioninvideos,”inAdvancesinneuralinformation 2020,pp.4604–4612.
processingsystems,2014,pp.568–576. [42] M. Liang, B. Yang, Y. Chen, R. Hu, and R. Urtasun, “Multi-task
[21] C.R.Qi,W.Liu,C.Wu,H.Su,andL.J.Guibas,“Frustumpointnets multi-sensor fusion for 3d object detection,” in Proceedings of the
for3dobjectdetectionfromrgb-ddata,”inProceedingsoftheIEEE IEEEConferenceonComputerVisionandPatternRecognition,2019,
conference on computer vision and pattern recognition, 2018, pp. pp.7345–7353.
918–927. [43] M.Liang,B.Yang,S.Wang,andR.Urtasun,“Deepcontinuousfusion
[22] S. Beery, G. Wu, V. Rathod, R. Votel, and J. Huang, “Context r- formulti-sensor3dobjectdetection,”inProceedingsoftheEuropean
cnn:Longtermtemporalcontextforper-cameraobjectdetection,”in ConferenceonComputerVision(ECCV),2018,pp.641–656.[44] S.Fadadu,S.Pandey,D.Hegde,Y.Shi,F.-C.Chou,N.Djuric,and
C.Vallespi-Gonzalez,“Multi-viewfusionofsensordataforimproved
perception and prediction in autonomous driving,” arXiv preprint
arXiv:2008.11901,2020.
[45] F. Xiao, Y. J. Lee, K. Grauman, J. Malik, and C. Feichtenhofer,
“Audiovisualslowfastnetworksforvideorecognition,”arXivpreprint
arXiv:2001.08740,2020.
[46] C.Hazirbas,L.Ma,C.Domokos,andD.Cremers,“Fusenet:Incorporat-
ingdepthintosemanticsegmentationviafusion-basedcnnarchitecture,”
inAsianconferenceoncomputervision. Springer,2016,pp.213–228.
[47] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-
streamnetworkfusionforvideoactionrecognition,”inProceedings
oftheIEEEconferenceoncomputervisionandpatternrecognition,
2016,pp.1933–1941.
[48] A.Dosovitskiy,P.Fischer,E.Ilg,P.Hausser,C.Hazirbas,V.Golkov,
P. Van Der Smagt, D. Cremers, and T. Brox, “Flownet: Learning
opticalflowwithconvolutionalnetworks,”inProceedingsoftheIEEE
internationalconferenceoncomputervision,2015,pp.2758–2766.
[49] K.He,X.Zhang,S.Ren,andJ.Sun,“Deepresiduallearningforimage
recognition,”inProceedingsoftheIEEEconferenceoncomputervision
andpatternrecognition,2016,pp.770–778.
[50] Y.Zhang,C.Wang,X.Wang,W.Zeng,andW.Liu,“Fairmot:Onthe
fairnessofdetectionandre-identificationinmultipleobjecttracking,”
arXivpreprintarXiv:2004.01888,2020.
[51] W.-C.Hung,H.Kretzschmar,T.-Y.Lin,Y.Chai,R.Yu,M.-H.Yang,and
D.Anguelov,“Soda:Multi-objecttrackingwithsoftdataassociation,”
arXivpreprintarXiv:2008.07725,2020.
[52] Z.Wang,L.Zheng,Y.Liu,Y.Li,andS.Wang,“Towardsreal-time
multi-objecttracking,”arXivpreprintarXiv:1909.12605,2019.
[53] F.Go¨kc¸e,G.U¨c¸oluk,E.S¸ahin,andS.Kalkan,“Vision-baseddetection
anddistanceestimationofmicrounmannedaerialvehicles,”Sensors,
vol.15,no.9,pp.23805–23846,2015.
[54] M.A.Haseeb,J.Guan,D.Ristic´-Durrant,andA.Gra¨ser,“Disnet:A
novelmethodfordistanceestimationfrommonocularcamera,”10th
Planning,PerceptionandNavigationforIntelligentVehicles(PPNIV18),
IROS,2018.
[55] S.Tuohy,D.O’Cualain,E.Jones,andM.Glavin,“Distancedetermina-
tionforanautomobileenvironmentusinginverseperspectivemapping
inopencv,”2010.