Multi-modal 3D Human Pose Estimation with 2D Weak Supervision in Autonomous Driving JingxiaoZheng1 XinweiShi1 AlexanderGorban1 JunhuaMao1 YangSong1 CharlesR.Qi1 TingLiu2 ViseshChari1 AndreCornman1 YinZhou1 CongcongLi1 DragomirAnguelov1 1 WaymoLLC 2 GoogleResearch jingxiaozheng, xinweis, gorban, junhuamao, yangsong, rqi @waymo.com, { } liuti@google.com, visesh, cornman, yinzhou, congcongli, dragomir @waymo.com { } Abstract RGB-D Sensor Characteristics 3D human pose estimation (HPE) in autonomous vehi- ● Indoor. cles (AV) differs from other use cases in many factors, in- ● Dense depth. ● Same view point. cluding the 3D resolution and range of data, absence of ● Easy camera-depth dense depth maps, failure modes for LiDAR, relative loca- correspondence. tionbetweenthecameraandLiDAR,andahighbarfores- ● Short range. timationaccuracy. Datacollectedforotherusecases(such asvirtualreality,gaming,andanimation)maythereforenot beusableforAVapplications. Thisnecessitatesthecollec- Camera and LiDAR Sensors Characteristics tionandannotationofalargeamountof3DdataforHPE ● Outdoor. inAV,whichistime-consumingandexpensive. ● Sparse depth. ● Different view points. In this paper, we propose one of the first approaches to ● Difficult alleviate this problem in the AV setting. Specifically, we camera-depth propose a multi-modal approach which uses 2D labels on correspondence. ● Long range. RGBimagesasweaksupervisiontoperform3DHPE.The proposedmulti-modalarchitectureincorporatesLiDARand camera inputs with an auxiliary segmentation branch. On Figure1.DifferentcharacteristicsofRGB-DandCamera+LiDAR the Waymo Open Dataset [27], our approach achieves a sensors. Toprowexamplesarefromdataset[48];bottomrowex- 22% relative improvement over camera-only 2D HPE amplesarefromtheWaymoOpenDataset[27]. ∼ baseline,and 6%improvementoverLiDAR-onlymodel. ∼ Finally, careful ablation studies and parts based analysis 3D[6]. Secondly,sensorcharacteristicsandplacementsfor illustratetheadvantagesofeachofourcontributions. LiDARfollowdifferentlogiccomparedtootherdepthsen- sors like in games or VR [34,40]. Thirdly, requirements for accuracy, real-time prediction and generalization over 1.Introduction a wide variety of scenarios are also different. Animation, 3D Human Pose Estimation (3D HPE) for autonomous surveillance, games and VR have relatively lower bars for vehicles (AV) has received little attention in the academic accuracy compared to AV where HPE is a critical compo- community relative to other applications like animation, nentfortheperceptionmodule. games, virtualreality(VR),orsurveillance[43]despiteits Divingdeeperintothesensor,LiDARdiffersfromother centralroleinAV.Arguably,thiscouldbebecause3DHPE depth sensors in several ways. Figure 1 summarizes these in AV differs greatly from HPE in other scenarios. For differences and gives visual illustrations. Firstly, LiDAR one,AVrequiresHPEinoutdoorenvironmentsandin3D, haslongerrangeandlargerFOVthanRGB-Dsensors,and whichisnotthecaseforanimationorgameswhicharenot it is more suitable for outdoor scenes. Point clouds from outdoor[17,35]orsurveillancewhichisnotnecessarilyin LiDAR are sparser and sweep a wider range of the envi- 1202 ceD 22 ]VC.sc[ 1v14121.2112:viXraronment. Secondly, LiDARs and cameras may not be co- datasets, followed by ablation studies and performance located on AV platforms. Accurate registration is needed analysis(refertosupplementaryforadditionalresults). Fi- for correspondence between point clouds and image tex- nally,weconcludeinSection5withadiscussionofavenues tures. Finally, failure cases for LiDAR caused by reflec- forimprovementandfuturedirections. tivematerials,weatherconditions,anddustonsensorsdif- fer from other sensor failures due to the difference in the 2.RelatedWork physicsofsensingaswellasenvironmentalfactors. Inrecentyears,manymethodshavebeenintroducedfor Given the aforementioned differences and the evi- 3DHPE[43], althoughhardlyanyworkhasaddressedthe dence of 3D HPE models not generalizing across different AVscenario. MosttakeRGBorRGB-Dimagesasinputs, datasets[32,37,43,44]becauseofdatasetbias, weseethe andoperateinmonocular,multivieworvideosettings. need for developing approaches specific to AV that tackle Monocular 3D HPE approaches like Tome et al. [30] theproblemof3DHPE.Onestraightforwardwaytotackle take the simplest of inputs (monocular RGB images) and this problem would be to collect 3D human pose annota- predict3Dkeypointsusingamulti-stagemethod.Thisclas- tionsforalargeanddiversedatasetofLiDARpointclouds sical approach of “lifting” 3D keypoints from 2D images inAVscenariosliketheWaymoOpenDataset[27]. How- hasbeenrecentlydoneusingdeeplearning[18],andinthe ever, the ”in-the-wild” setting of 3D HPE for AV presents pastusingadatabaseof3Dskeletons[1,24,33].Recentcrit- serious challenges to annotating training data at this scale, icismsofthisapproachhavefocusedonover-relianceonthe intermsoftime,costandcoverageoflongtailscenarios. underlying 2D estimator, and of generalization problems In this paper, we propose an approach to use widely [2,43]. Extending this approach temporally [2,5,44,45] available and easier to get 2D human pose annotations alsohasbeenattempted,butstillunderperformsapproaches to drive 3D HPE in a weakly-supervised setting. While whichusedepthinformation(see[42],[43]table11). the weakly-supervised setting is not uncommon for 3D HPE [44], using LiDAR in the AV setting requires sepa- Depth based approaches also come in different flavors. rate consideration for the reasons mentioned thus far. Fig- Some, like Zimmermann et al., [48] use a VoxelNet based ure 2 shows the idea of the proposed method. While we method on RGB-D images with 3D labels. Others might use PointNet [22]-inspired architecture as the main point only use point clouds [29], add temporal consistency for- cloud processing network, we cannot fuse camera and Li- mulations [12], use a split and recombine approach [39], DARimageryatthelowerlevelslikeinothersettings[38] or generate large amounts of synthetic data followed by becauseofthesparsityofLiDAR.Weproposeacascadear- supervised learning strategy [15]. Semi-supervised ap- chitecture with a CNN-based camera network for 2D pose proaches[19,20,25,26]havealsobeenrecentlyattempted estimation. In addition, we add an auxiliary segmentation todealwiththelongtailand”in-the-wild”scenarios. branch in the point network to introduce stronger supervi- siontoeachpointviamulti-tasklearning. Thisgivesusan Weakly-supervised 3D Human Pose Estimation: Be- advantageinthe”in-the-wild”settings,asshownbythere- sides the above fully- and semi-supervised methods which sultsontheWaymoOpenDataset(Table1andTable3). In rely on at least a certain amount of 3D annotations, there the rest of the paper, we show that pose estimation perfor- are also weakly-supervised methods that use pure 2D an- mancebenefitsfromallthesedesigns. notations. Tripathi et al. [31] introduced a self-supervised Themaincontributionsofthispaperareasfollows: method with teacher-student strategy on RGB sequences. Chenetal.[4]introducedaweakly-supervisedmethodwith • We propose a multi-modal framework which fuses cycle GAN [47]-like structure on pure 2D labels. Other RGB camera images and LiDAR point clouds to ex- weakly-supervised methods include [3,14]. All the above ploitthetextureinformationandgeometryinformation methodsareRGB-based,anddonotinvolvetheuseofpoint for3DposeestimationinchallengingAVscenarios. clouds,whileourmethodutilizespointcloudstohelptoim- • We train 3D pose estimation models by weak super- provethepredictionaccuracy. visionfrompure2Dlabels,whichmakesthelabeling Fu¨rst et al. [8] proposed an end-to-end system for 3D stagemuchlessexpensive. detectionandHPEforRGBandLiDARinAVwithpure2D keypoint annotations. However, their work only includes • We introduce an auxiliary segmentation branch into evaluations for 2D HPE and projected 3D HPE, while our thepointnetworktoimprove3Dposeestimationper- approachisevaluatedwithreal3Dannotations. formanceviamulti-tasklearning. We review related work in Section 2, and follow it up Point Cloud-Based Approaches: Point cloud-based ap- with details about our approach in Section 3. Section 4 proaches differ from HPE on traditional depth sensors in discusses detailed experiments with results on two large theirabilitytohandlesparse3Ddata[11]. PointNet[22]isFigure2.Modeloverview:themodelisacascadeofcameranetworkandpointnetwork.Thecameranetworktakesthe2Dcameraimage as input and predicts the 2D keypoint heatmap. This 2D heatmap is augmented with the point cloud using modality fusion (Figure 3) andisfedintothepointnetwork. Theregressionbranchofthepointnetworkpredicts3Dkeypointcoordinatesasoutput. Theauxiliary segmentation branch generates pointwise predictions which are only used for training. The model is trained on pseudo 3D labels and pointwiselabelsgeneratedfrom2Dkeypointlabels(Figure4). a popular network for point cloud-based classification and AnoverviewoftheproposedapproachisshowninFig- segmentation,improvedwithhierarchicalstructuresin[23] ure 2. Our model is a cascade of a camera network and and utilized for 3D object detection on RGB-D [21], and a point network. The camera network takes a 2D image hand pose estimation [9,10]. Finally, Zhang et al. [41] as input and predicts a 2D keypoints heatmap [36]. This proposed a weakly supervised point cloud-based method heatmapisusedtoaugmentthepointcloudusingmodality for 3D human pose estimation. However, their method re- fusion and fed into the point network. Finally, the regres- quires3DannotationsandisonlyevaluatedinindoorRGB- sionbranchofthepointnetworkpredictsthe3Dcoordinates D datasets, while our method works on uncontrolled AV of K keypoints. An auxiliary segmentation branch gener- scenarioswithpure2Dannotations. atespointwisepredictionswhichareonlyusedfortraining. Themodelistrainedonpseudo3Dlabelsandpointwisela- 3.Method belsgeneratedfrom2Dlabels. 3.1.ProblemFormulation 3.2.ModalityFusionofLiDARandCamera The 3D pose estimation problem can be described as Weintroducea2Dcameranetworkwithmodalityfusion follows. For each human subject in consideration, there to transfer texture information from RGB images to point are two modalities of data available: the point cloud and clouds. Our camera network follows the architecture pro- a camera image of the person. The point cloud P = posed in [36] which consists of a downscale module and p 1, ,p i, ,p N RN ×d, consists of N LiDAR anupscalemodule. ThedownscalemoduleisaResNet-50 ··· ··· ∈ points from a single scan with d-dimensional features. In network and the upscale module consists of three decon- (cid:2) (cid:3) thiswork,d=3. ThecameraimageisanH W 3RGB volutional layers. A 1 1 convolutional layer with sig- image. Assuming we have the extrinsics an× d intr× insics of moid activation follows× the upscale module and produces theLiDARandcamera,foreachpointp i,its3Dworldco- theoutputheatmap. ThenetworktakesanRGBimagewith ordinatesx(3) inthepointcloudcoordinatesystemand2D sizeH W 3asinputandgeneratesakeypointheatmap i coordinatesx(2) intheimagecoordinatesystemareknown. H = × h × H0,W0 with size H W K, where K i { m,n }m=1,n=1 0 × 0 × Giventheseinputs,thegoalistopredict3Dcoordinatesof isthenumberofkeypoints. Eachpixelh intheheatmap m,n K posekeypoints {y k(3) }K k=1 ∈RK ×3ofthecorresponding is a K dimensional vector, indicating the likelihood of the person. Note that LiDAR point clouds are usually sparse correspondingimagepixelbelongingtoeachoftheK key- andlieonthesurfaceoftheobject,whilegroundtruthkey- points. points are defined inside the human body. Therefore, we The heatmap H is consequently sampled at points cor- cannot choose a subset of P as the 3D pose of the person responding to the 2D projections on the camera image of andapproach3DHPEinAVasaclassificationproblem. 3D LiDAR points, to generate camera features pcam as iFigure3. Modalityfusion: the2DheatmapfromthecameranetworkisfirstsmoothedbyGaussiankernel, thensampledby2Dpoint cloudprojectionsonthecameraimage. Thesampledheatmapslicesareconsideredascamerafeaturesandareconcatenatedwithpoint coordinatesofthepointcloudasaugmentedinputtothepointnetwork.SeeSec3.2fordetails. shown in Figure 3. The camera feature for point i is com- with higher-level camera features provides the point net- puted as pcam = h , which is a slice of H at lo- workbothlow-andhigh-levelpointcloudinformation. By i m(i),n(i) cation (m(i),n(i)). Here m(i) = round(W0x ) and introducingmodalityfusion,weachieve 6%relativeim- W 1i ∼ n(i) = round(H0x ),wherex(2) = (x ,x )arethe2D provement on the Waymo Open Dataset compared to the H 2i i 1i 2i LiDAR-onlybaseline(Table3inSection4). image coordinates of point i. In practice, we observe that heatmapsfromthecameranetworkareusuallyverypeaky, 3.3.AuxiliaryPointwiseSegmentationBranch which contains little information at locations not close to anykeypoints. Hence,weapplyGaussiansmoothingtoen- Our point network is the primary component of the large the receptive field at these locations [7], so the cor- proposed method, which directly generates 3D keypoint responding point can utilize the information from a larger prediction from augmented point clouds. The regression neighborhoodontheimage. branch predicts a 3K-dimensional output vector corre- Finally, camera features pcam are concatenated with the spondingtothe3DcoordinatesofK keypoints. i original point feature p i to generate the augmented point Even though rich camera information from the camera cloudPaug RN ×(d+K), whichservesastheinputofthe network is provided to the point network by modality fu- ∈ following point network. This augmentation directly in- sion, the model’s designated output is still a fixed set of corporates texture information from RGB images into the keypoints. Itisdifficultforaglobalregressionlosstoguide point cloud, which helps the LiDAR based point network the point network to effectively utilize the camera infor- withinformationusefulformoreaccuratekeypointpredic- mation for each point. Therefore, to provide more direct tions. Similar concatenation can be found in [48], where supervision to every individual point, we propose an aux- voxel representations are concatenated with heatmaps be- iliary segmentation branch after the feature encoder in the forefeedingintoaVoxelNet[46]. point network, inspired by the architecture of a segmenta- The proposed cascade modality fusion architecture tionPointNet[22].ForeachLiDARpoint,thesegmentation achieves improvements because heatmap predictions from branch predicts the pose keypoint it is closest to. In other thecameranetworkcarrycomplementarytexturerelatedse- words, the segmentation branch generates N K confi- × mantic cues that are not present in LiDAR point features. dencescoresforassigningN LiDARpointstoK posekey- Therefore, augmenting lower-level LiDAR point features points(apointwithhighscoremeansthatitisclosetotheFigure4. Pseudolabelgeneration: apseudo3Dkeypointlabel(redtriangle)iscomputedastheweighedaverageof3Dcoordinatesof neighboringpoints(bluetrianglesanddots)tothekeypointlabelin2Dspace(reddot). Similarly,togeneratepointwiselabels,positive labelsareassignedtoneighboringpoints(bluedots)ofagroundtruthkeypoint(reddot)in2Dspace(bestviewedincolor).SeeSec3.4.1 fordetails. corresponding keypoint). Here, the keypoint type for each 1. thepointcloudisdenseenoughsothatthereisatleast pointcorrespondstothetypeofitsnearestkeypoint. onepointintheneighborhoodofeachkeypointin2D This additional point-wise loss helps the point network space; to digest more information from the camera network. By 2. thehumansurfaceissmoothenoughsothatthedepth adding the auxiliary segmentation branch and loss, we doesnotrapidlychangeintheneighborhoodofakey- achieve 1.8%relativeimprovementontheWaymoOpen ∼ point; Datasetcomparedtothemodality-fusionarchitecturewith- outthesegmentationbranch(Table3inSection4). 3. pointcloudtocameraregistrationisreliable. Though point clouds will be downsampled to a fixed size 3.4.Weakly-SupervisedModelTraining before being fed into the point network, pseudo 3D labels Trainingtheproposedpointnetworkwithtwobranches are generated based on the point cloud before downsam- needs two sets of labels: For the main regression branch, pling.Therefore,theaboveassumptionsholdinmostcases. ground truth 3D keypoint coordinates are required; for the Also, since LiDAR and camera are usually attached to the segmentation branch, pointwise keypoint type labels are samerigidobject(thevehicle)andarefrequentlycalibrated, needed. Intheproposedmethod,weintroducealabelgen- itisreasonabletoassumethattheregistrationisreliable. erationmethodtoenablemodeltrainingonpure2Dlabels 3D Keypoint Coordinates Label Generation: Based on forbothtasks. ourassumption,foreachpointinthepointcloud,itsaccu- rate 2D projection on the camera image is known. There- fore,foragroundtruthkeypointin2Dcoordinates,wecan 3.4.1 LabelGeneration first find its neighboring points in 2D space. Then, based onourassumptions,thedepthsofthesepointswillbeclose AsstatedinSection3.1,weknowthe3Dcoordinatesofin- enough to the true depth of the keypoint. As Figure 4, we putpoints x(3) N ,theircorresponding2Dimagecoordi- { i }i=1 usetheaverage3Dcoordinatesoftheseneighboringpoints nates x(2) N ,and2Dgroundtruthkeypoints y(2) K . { i }i=1 { k }k=1 toapproximatethecoordinatesofthekeypoint, The correspondence is pre-computed by projecting 3D pointsontothecameraimagecoordinatesaccordingtothe N exp T x(2) y(2) 2 camera model. Since the projection is not a one-to-one y˜ k(3) = α ikx( i3), α ik = N e(cid:16) x− p k Ti x− (2) k yk (2 2(cid:17)) 2 mapping,directlyback-projecting2Dlabelsto3Dspaceis Xi=1 j=1 − k j − k k2 impossible. (cid:16) (1)(cid:17) P To generate 3D keypoint labels from the 2D labels and Here α ik weights the contribution of point i to the pseudo thepointcloud,wemakethefollowingassumptions: keypoint yˆ(3) based on their distances to the ground truth kkeypointy(2) in2Dspace,T isthetemperaturethatcontrols pedestrians. These pedestrians are labeled with 2D key- k thesoftmaxoperation. pointlabelsof13keypointtypes(nose,left/rightshoulders, In case the pseudo 3D labels are not accurate, we also left/right elbows, left/right wrists, left/right hips, left/right compute the reliability of the 3D approximation for each knees and left/right ankles) in the camera image. These keypoint as r = exp T min x(2) y(2) 2 , where samples are split into a training set with 155,182 pedestri- k − r i k i − k k2 ansandatestsetwith42,199pedestrians. Thetrainingset T risthetemperaturefac(cid:16)tor,toweightthelosseson(cid:17)different withpure2Dlabelsisusedtotraintheproposedmodel. keypointsduringtraining. PointwiseKeypointTypeLabelGeneration: Togenerate pointwisetypelabelsforthesegmentationtask,wesimply 3D Evaluation Data and Metrics: The Waymo Open assign all neighboring points of a keypoint in 2D space to Dataset serves as our 3D evaluation set. It is composed the corresponding keypoint type, shown in Figure 4. The of sensor data collected by Waymo cars under a variety of typelabell forpointiwithrespecttokeypointkisgener- ik conditions. It contains 1,950 segments of 20s each, with atedby sensor data including point clouds from LiDAR and RGB l = 1 if kx( i2) −y k(2) k2 ≤r, (2) imagescapturedbycameras. For3Devaluation,welabeled ik (0 otherwise 986 pedestrians with 3D keypoint coordinates of 13 key- point types (same as our internal dataset) on LiDAR point whereristheneighboringradiusforpositivesamples. clouds.Wearelookingtoreleasetheselabelsforevaluation Withthegeneratedpseudo3Dlabelstotrainthe3Dkey- onceobtainedrelatedapprovals. point model, we achieve 22% relative improvement on ∼ EvaluationresultsarereportedintheOKS(ObjectKey- theWaymoOpenDatasetcomparedtothebaselineofpre- point Similarity) accuracy (OKS/ACC) metric, which is dicting2Dkeypointswith2Dlabelsandliftingto3D(Ta- similar to the OKS/AP metric introduced in COCO key- ble3inSection4). pointchallenge[16](pleaserefertothesupplementaryma- 3.5.TrainingLosses terial for more details), and MPJPE (Mean Per Joint Posi- tionError)[13]in3Dcoordinates. Point Network: The training loss for the regression branchisaHuberlossL onthegeneratedpseudo3Dla- reg bels, weighted by the reliability r k. The loss for the seg- 2DEvaluationDataandMetrics: Thetestsetofourin- mentationbranchisacross-entropylossL segonthepseudo ternal dataset serves as the 2D evaluation set. Evaluation pointwise labels weighted by different positive/negative results are reported in the OKS/ACC metric in 2D coordi- sampleweights. Theoveralllossforthepointnetworkis nates,afterthe3Dpredictionsareprojectedto2Dspaceby thecorrespondinglidartocameraprojections. L=L +λL (3) reg seg whereλisusedtoweighttheauxiliarysegmentationloss. Labeling: For 2D/3D keypoint labeling on the Waymo Camera Network: Similar to [36], the camera network is Open Dataset and the Internal Dataset, we adopt a defini- trainedonamean-squared-errorlosswithgroundtruth2D tion of keypoints similar to the COCO Challenge. Each heatmap. Wetrainthecameranetworkindependently,then keypoint is labeled by multiple annotators, whose results freezeitduringpointnetworktraining. areaggregatedtodeterminethefinallabel. For2Dlabeling, Notethatweonlytrainandevaluateonvisiblekeypoints. weonlylabel2Dcoordinatesofkeypointsthatarevisiblein Duringtraining,keypointlossesareonlyappliedonvisible the camera image. For occluded keypoints, we label them keypoints,whichmeanswewillnotgeneratepseudolabels as invisible. 3D labeling is similar, where we only label for occluded keypoints. In Section 4, we show that even keypoints that are visible from the point clouds. Since we trainedonvisiblekeypointsonly, themodelisabletopre- paireachLiDARwithitsclosestcamerainlocation,theoc- dictreasonablekeypointsforoccludedbodyparts.Formore clusionstatusofkeypointsismostlyconsistentbetween2D detailsoftraininglosses,pleaserefertothesupplementary and3D. material. 4.Experiments Implementation Details: For the Waymo Open Dataset and the Internal Dataset, we resize all camera images to 4.1.DataandEvaluationMetrics 256 256,andrandomlysub-sampletheinputpointcloud × TrainingData: WecollectaninternaldatasetwithRGB to a fixed size of 256 points (we did not observe obvious imagesandLiDARpointcloudssimilartotheWaymoOpen performancegainforlargernumberofpoints). Pleaserefer Dataset [27]. It consists of a total number of 197,381 tothesupplementarymaterialformoretrainingdetails.WaymoOpenDataset InternalDataset results show that our method outperforms all baselines in Methods OKS@3D MPJPE OKS@2D thecorrespondingdatasets. Wealsohavethefollowingob- ↑ ↓ ↑ camera-only[36] 51.74% 13.90cm 78.19% servations. LiDAR-only 59.58% 10.80cm 77.53% Training on pseudo labels is effective. LiDAR-only multi-modal 63.14% 10.32cm 82.94% baselineandtheproposedmethodbothoutperformcamera- only baseline on 3D metrics on the Waymo Open Dataset. Table 1. Comparison of camera-only, LiDAR-only, and multi- Sincethecamera-onlybaselineisalsotrainedon2Dlabels modal models. As described in Section 4.1, OKS@3D stands forOKS/ACCin3Devaluation,OKS@2DstandsforOKS/ACC andutilizespointcloudstoliftthepredictionsto3Dspace, in2Devaluation,andMPJPEisanotherevaluationmetricin3D. theresultsindicatethatitismoreeffectivetodirectlytraina Thesemetricsareusedthroughouttheexperiments.Theproposed 3Dhumanposemodelonpseudolabelsgeneratedfrom2D multi-modalmodelachievesthebestresultsonbothdatasets. groundtruth. Cameraimageimproves3Dprediction. Theproposed method performs better than LiDAR-only baseline on 3D metrics, which demonstrates that the information from 2D camera images helps 3D pose estimation. Table 2 shows that theproposed method outperformsbaselines on almost allbodyparts. ComparedtotheLiDAR-onlybaseline, the margins are larger for difficult body parts like elbows or wrists, which shows that texture information from camera images is especially helpful for keypoints that are hard to localize. Point cloud improves 2D prediction. LiDAR-only (a)NoCamera(LiDAR-Only) (b)Inception48x48 baselinehascomparableperformancewiththecamera-only baseline for 2D pose estimation on 2D metrics on the In- ternalDataset. Theproposedmethodsurpassesthecamera- only baseline, even if the models are not directly trained for 2D pose estimation. It shows that the depth informa- tion from 3D LiDAR point clouds also improves 2D pose estimationperformance. Modality fusion benefits from both modalities. The proposedmethodachievesthebestperformanceonallmet- ricsforbothdatasets. ItprovesthatcameraimagesandLi- DARpointcloudsprovidecomplementaryinformation,and (c)Inception64x64 (d)ResNet50256x256 modality fusion combines these sources of information to Figure 5. 3D predictions with different camera image sizes and improvetheoverallperformance. cameranetworkbackbonesfromtheWaymoOpenDataset(best Figure6showssomequalitativeresultsoftheproposed viewedincolor). ResNet50with256x256imagesizepredictsthe method on the Waymo Open Dataset. In these examples, mostaccuratekeypoints. pedestrians are either occluded (6a), in an irregular pose (6c, 6i), or carrying a large object (6g, 6e). The proposed methodaccuratelypredictsthevisiblehumankeypointsand 4.2.PerformanceAnalysis provides reasonable guesses for the occluded keypoints. Figure6kisafailurecasewherethecameraimageisblurred To show the effectiveness of the proposed method, we because of the sensor motion. It causes an inaccurate pre- comparewiththefollowingmodels. diction of the left wrist. More qualitative results can be Camera-onlymodel: weusethesamecameranetwork foundinthesupplementary. [36]astheproposedmethodtopredict2Dkeypoints. Then 2D-to-3Dkeypointliftingisimplementedbythe2D-to-3D 4.3.AblationStudies pseudolabelgenerationmethodintroducedinSection3.4.1, 4.3.1 AblationStudyonModelArchitecture thesamewayaswegeneratetraininglabels. LiDAR-onlymodel:weusetheproposedpointnetwork We conduct ablation studies to further demonstrate the ef- to predict 3D keypoints without the modality fusion, i.e. fectiveness of our key designs: the auxiliary segmentation onlyuse3Dcoordinatesofthepointcloudsasfeatures. branchandthemodalityfusionwithcameranetwork. The Experimental results on two datasets are shown in Ta- resultsareshowninTable3,whereReg. Lossmeansusing ble 1. Table 2 further shows per-keypoint results. These regressionloss(theprimaryloss)totrainthepointnetwork,camera-only LiDAR-only multi-modal parts OKS@3D OKS@2D OKS@3D OKS@2D OKS@3D OKS@2D nose 24.50% 75.10% 23.83% 56.27% 29.74% 72.17% shoulder 65.41% 83.38% 77.04% 85.68% 76.93% 87.89% elbow 65.61% 82.63% 66.61% 78.72% 72.49% 84.82% wrist 45.99% 79.03% 30.37% 64.10% 46.97% 79.17% hip 57.69% 87.97% 79.42% 90.33% 74.76% 92.37% knee 65.40% 85.91% 77.48% 86.82% 78.04% 90.05% ankle 62.68% 84.17% 69.06% 85.63% 72.30% 88.72% overall 51.74% 78.19% 59.58% 77.53% 63.14% 82.94% Table 2. Per-keypoint comparison of camera-only, LiDAR-only, and multi-modal models. OKS@3D is on the Waymo Open Dataset and OKS@2D is on the Internal Dataset. Note that the per-keypoint OKS is computed on each keypoint separately (please refer to supplementaryfordetails).Theproposedmulti-modalmodelachievesthebestresultsonmostofthekeypointtypes. Configurations WaymoOpenDataset InternalDataset Reg.Loss Seg.Loss Camera OKS@3D MPJPE OKS@2D ↑ ↓ ↑ X 59.10% 10.93cm 77.52% X X 59.58% 10.80cm 77.53% X X 62.03% 10.53cm 82.51% X X X 63.14% 10.32cm 82.94% Table 3. Ablation studies on different model architectures. The best performance is achieved by using multi-modal architecture with auxiliarysegmentationloss. Seg.Lossmeansauxiliarysegmentationbranchbeingadded the choice of image size and backbone, addtional camera (see Section 3.5), and Camera means using modality fu- images generally bring considerable improvements on el- sionwithcamerafeatures. Theresultsshowthat,byadding bow, wrist, knee and ankle. This is because merely based key features to the model, the performance improves con- onsparseandnoisyLiDARpointclouds, accuratelylocal- sistentlyonalldatasets. Wealsoobservethatthesegmen- izing these limb keypoints is difficult. Additional texture tation branch and modality fusion provide complementary informationfromcameraimagesmakesthelocalizationrel- improvements. ativelyeasier. 2)Largerimagesizehasbetterperformance on most difficult keypoints like elbow and wrist. Surpris- ingly,itperformsslightlyworsethansmallerpatchsizeson 4.3.2 AblationStudyonCameraImageSizeandCam- otherkeypoints. eraNetworkBackbone Figure5showsvisualizationsof3Dkeypointpredictions To study the effectiveness of modality fusion, experiments on a pedestrian riding a scooter from the Waymo Open areconductedwithdifferentcameraimagesizesandcamera Dataset. It is a challenging case because of the objects networkbackboneswithresultsinTable5. HereInception (backpack,scooter)attachedtothepedestrianandtheirreg- 48x48 uses an Inception [28]-inspired convolutional net- ularpose. TheLiDAR-onlymodelfailstopredictaccurate work backbone with a 48x48 image size; Inception 64x64 keypointsinFigure5a.Byintroducingmodalityfusion,im- is similar to Inception 48x48 but with a 64x64 image size; provements are observed on keypoints that are difficult to ResNet50 256x256 is the ResNet50 backbone used in the localize from sparse point clouds like those on the limbs proposedmethodwitha256x256imagesize. Fromthere- (elbow,wrist,kneeandankle). Thecameranetworkusedin sults in Table 5, we observe that, even with smaller cam- the proposed method (ResNet50 on 256x256 images) pre- erapatchsizeandshallowerbackbone,themodelstillben- dictsthemostaccuratekeypoints(Figure5d). efits from the additional camera modality. This observa- tionisconsistentwithorwithouttheauxiliarysegmentation 5.Conclusions branch. Withlargercamerapatchsizeanddeeperbackbone network,theoverallperformanceisbetter. LiDARbased3DHPEinAVdiffersfromotherapplica- Wefurtherstudiedtheeffectofdifferentimagesizesand tions for a variety of reasons including 3D resolution and networkbackbonesonper-keypointpredictionerrorsinTa- range, absence of dense depth maps, and variation in test ble 4. These experiments are all with the proposed auxil- conditions. In this paper, we propose a multi-modal 3D iarysegmentationbranch. Theresultsshowthat1)Despite HPEmodelwith2Dweaksupervisionforautonomousdriv-(a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l) Figure6. ResultsontheWaymoOpenDataset. 6b,6d,6h,6lare3Dpredictions. 6a,6c,6g,6kshowthecorresponding2Dprojections overlaidoncameraimages(3Dpredictionsmaynotbeshownunderthesameviewpointasthecameraimages. Bestviewedincolor). Moreresultscanbefoundinsupplementary. sEPJPMtniopyek-reP CameraNetwork Nocamera Inception48x48 Inception64x64 ResNet50256x256 all 0.1080 0.1026 0.1028 0.1032 elbow 0.1006 0.0940 0.0931 0.0891 wrist 0.1652 0.1501 0.1473 0.1320 hip 0.1081 0.1113 0.1113 0.1205 knee 0.0944 0.0896 0.0910 0.0925 ankle 0.1163 0.1100 0.1102 0.1107 nose 0.0814 0.0762 0.0760 0.0837 shoulder 0.0850 0.0814 0.0830 0.0872 Table4.Per-keypointperformancewithdifferentcameranetworksandimagesizesontheWaymoOpenDataset.ResNet50with256x256 imagesizeperformsthebestonchallengingkeypointslikeelbowandwristwithlargemargins,butslightlyworsethansmallerimagesizes onotherkeypointtypes. ing. The model leverages both RGB camera images and pose estimation in unconstrained scenarios. Instead of us- LiDARpointcloudstotacklethechallenges of3Dhuman ing expensive 3D labels, the proposed model is trained onConfig WaymoOpenDataset InternalDataset CameraNetwork Reg. Seg. OKS@3D MPJPE OKS@2D ↑ ↓ ↑ X 59.10% 10.93cm 77.52% NoCamera X X 59.58% 10.80cm 77.53% X 61.12% 10.51cm 78.72% Inception48x48 X X 62.22% 10.26cm 79.55% X 61.05% 10.46cm 78.95% Inception64x64 X X 62.52% 10.28cm 79.44% X 62.03% 10.53cm 82.51% ResNet50256x256 X X 63.14% 10.32cm 82.94% Table5.Ablationstudiesonthedifferentcameraimagesizesandcameranetworkbackbones.ResNet50with256x256imagesizeachieves thebestperformanceingeneral. pure2Dlabels. Anauxiliarysegmentationbranchisadded [10] LiuhaoGe,ZhouRen,andJunsongYuan. Point-to-pointre- tointroducestrongersupervisiontothepointnetwork. Re- gressionpointnetfor3dhandposeestimation. InEur.Conf. sultsontheWaymoOpenDataset(withevaluationlabelsto Comput.Vis.(ECCV),2018. 3 bereleased)andourinternaldataset,andadditionalablation [11] YulanGuo,HanyunWang,QingyongHu,HaoLiu,LiLiu, studiesshowingtheeffectivenessoftheproposedmethod. and Mohammed Bennamoun. Deep learning for 3d point clouds: A survey. IEEE Transactions on Pattern Analysis andMachineIntelligence,43(12):4338–4364,2021. 2 References [12] MirRayatImtiazHossainandJ.Little. Exploitingtemporal [1] Ijaz Akhter and Michael J. Black. Pose-conditioned joint information for 3d human pose estimation. In Eur. Conf. angle limits for 3d human pose reconstruction. In IEEE Comput.Vis.(ECCV),2018. 2 Conf. Comput. Vis. Pattern Recog. (CVPR), pages 1446– [13] Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian 1455,2015. 2 Sminchisescu.Human3.6m:Largescaledatasetsandpredic- [2] AnuragArnab, CarlDoersch, andAndrewZisserman. Ex- tivemethodsfor3dhumansensinginnaturalenvironments. ploiting temporal context for 3d human pose estimation in IEEETransactionsonPatternAnalysisandMachineIntelli- thewild. IEEEConf.Comput.Vis.PatternRecog.(CVPR), gence,36(7):1325–1339,jul2014. 6 2019. 2 [14] MuhammedKocabas,SalihKaragoz,andEmreAkbas.Self- [3] C.ChenandD.Ramanan. 3dhumanposeestimation=2d supervisedlearningof3dhumanposeusingmulti-viewge- pose estimation + matching. In IEEE Conf. Comput. Vis. ometry. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), PatternRecog.(CVPR),pages5759–5767,2017. 2 pages1077–1086,2019. 2 [4] Ching-Hang Chen, Ambrish Tyagi, Amit Agrawal, Dy- [15] ShichaoLi,LeiKe,KevinPratama,Yu-WingTai,Chi-Keung lan Drover, M. V. Rohith, Stefan Stojanov, and James M. Tang,andKwang-TingCheng.Cascadeddeepmonocular3d Rehg. Unsupervised 3d pose estimation with geometric human pose estimation with evolutionary training data. In self-supervision. IEEE Conf. Comput. Vis. Pattern Recog. IEEEConf.Comput.Vis.PatternRecog.(CVPR),June2020. (CVPR),pages5707–5717,2019. 2 2 [5] YuCheng,BoYang,BoWang,andRobbyT.Tan.3dhuman [16] Tsung-YiLin,MichaelMaire,SergeBelongie,JamesHays, poseestimationusingspatio-temporalnetworkswithexplicit PietroPerona,DevaRamanan,PiotrDolla´r,andC.Lawrence occlusiontraining. InAAAI,2020. 2 Zitnick.Microsoftcoco:Commonobjectsincontext.InEur. [6] SrijanDas,SauravSharma,RuiDai,FrancoisBremond,and Conf.Comput.Vis.(ECCV),pages740–755,2014. 6 MoniqueThonnat.Vpn:Learningvideo-poseembeddingfor [17] JingyuanLiu,HongboFu,andChiew-LanTai. Posetween: activitiesofdailyliving,2020. 1 Pose-driven tween animation. In Proceedings of the 33rd [7] Yang Feiyu, Song Zhan, Xiao Zhenzhong, Mo Yaoyang, Annual ACM Symposium on User Interface Software and ChenYu,PanZhe,ZhangMin,ZhangYao,QianBeibei,and Technology,UIST’20,page791–804,NewYork,NY,USA, Jin Wu. Error compensation heatmap decoding for human 2020.AssociationforComputingMachinery. 1 poseestimation. IEEEAccess,9:114514–114522,2021. 4 [18] Julieta Martinez, Rayat Hossain, Javier Romero, and [8] Michael Fu¨rst, Shriya T. P. Gupta, Rene´ Schuster, Oliver JamesJ.Little.Asimpleyeteffectivebaselinefor3dhuman Wasenmu¨ller,andDidierStricker. HPERL:3dhumanpose poseestimation. InInt.Conf.Comput.Vis.(ICCV),2017. 2 estimationfromRGBandlidar.ComputingResearchRepos- [19] Georgios Pavlakos, Xiaowei Zhou, and Kostas Daniilidis. itory,abs/2010.08221,2020. 2 Ordinaldepthsupervisionfor3Dhumanposeestimation. In [9] Liuhao Ge, Yujun Cai, Junwu Weng, and Junsong Yuan. IEEEConf.Comput.Vis.PatternRecog.(CVPR),2018. 2 Handpointnet: 3dhandposeestimationusingpointsets. In [20] DarioPavllo,ChristophFeichtenhofer,DavidGrangier,and IEEEConf.Comput.Vis.PatternRecog.(CVPR),2018. 3 MichaelAuli. 3dhumanposeestimationinvideowithtem-poral convolutions and semi-supervised training. In IEEE asingleimage. InIEEEConf.Comput.Vis.PatternRecog. Conf.Comput.Vis.PatternRecog.(CVPR),2019. 2 (CVPR),pages2369–2376,2014. 2 [21] Charles R. Qi, Wei Liu, Chenxia Wu, Hao Su, and [34] Chung-Yi Weng, Brian Curless, and Ira Kemelmacher- LeonidasJ.Guibas. Frustumpointnetsfor3dobjectdetec- Shlizerman. Photo wake-up: 3d character animation from tion from rgb-d data. In IEEE Conf. Comput. Vis. Pattern asinglephoto,2018. 1 Recog.(CVPR),June2018. 3 [35] NoraS.Willett,HijungValentinaShin,ZeyuJin,WilmotLi, [22] CharlesRQi,HaoSu,KaichunMo,andLeonidasJGuibas. andAdamFinkelstein. Pose2Pose:Poseselectionandtrans- Pointnet: Deep learning on point sets for 3d classification ferfor2Dcharacteranimation. In25thInternationalCon- andsegmentation. arXivpreprintarXiv:1612.00593,2016. ference on Intelligent User Interfaces (IUI 2020), page 12, 2,4 Mar.2020. 1 [23] CharlesRQi,LiYi,HaoSu,andLeonidasJGuibas. Point- [36] BinXiao,HaipingWu,andYichenWei.Simplebaselinesfor net++: Deephierarchicalfeaturelearningonpointsetsina humanposeestimationandtracking. InEur.Conf.Comput. metricspace. arXivpreprintarXiv:1706.02413,2017. 3 Vis.(ECCV),2018. 3,6,7 [24] VarunRamakrishna,TakeoKanade,andYaserSheikh. Re- [37] WeiYang,WanliOuyang,XiaolongWang,JimmyS.J.Ren, constructing 3d human pose from 2d image landmarks. HongshengLi,andXiaogangWang. 3dhumanposeestima- In Andrew Fitzgibbon, Svetlana Lazebnik, Pietro Perona, tioninthewildbyadversariallearning.IEEEConf.Comput. YoichiSato,andCordeliaSchmid,editors,Eur.Conf.Com- Vis.PatternRecog.(CVPR),pages5255–5264,2018. 2 put.Vis.(ECCV),pages573–586,Berlin,Heidelberg,2012. SpringerBerlinHeidelberg. 2 [38] JiamingYingandXuZhao. Rgb-dfusionforpoint-cloud- based3dhumanposeestimation. InIEEEInt.Conf.Image [25] HelgeRhodin,MathieuSalzmann,andPascalFua.Unsuper- Process.(ICIP),pages3108–3112,2021. 2 visedgeometry-awarerepresentationlearningfor3dhuman poseestimation. InEur.Conf.Comput.Vis.(ECCV),2018. [39] AilingZeng,X.Sun,F.Huang,MinhaoLiu,QiangXu,and 2 StephenLin. Srnet: Improvinggeneralizationin3dhuman [26] H. Rhodin, Jo¨rg Spo¨rri, Isinsu Katircioglu, V. Constantin, poseestimationwithasplit-and-recombineapproach. Eur. F. Meyer, E. Mu¨ller, M. Salzmann, and P. Fua. Learn- Conf.Comput.Vis.(ECCV),abs/2007.09389,2020. 2 ing monocular 3d human pose estimation from multi-view [40] Haotian Zhang, Cristobal Sciutto, Maneesh Agrawala, and images. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), KayvonFatahalian. Vid2player: Controllablevideosprites pages8437–8446,2018. 2 thatbehaveandappearlikeprofessionaltennisplayers.ACM [27] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Trans.Graph.,40(3),may2021. 1 Chouard,VijaysaiPatnaik,PaulTsui,JamesGuo,YinZhou, [41] Z.Zhang, L.Hu, X.Deng, andS.Xia. Weaklysupervised Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han, adversariallearningfor3dhumanposeestimationfrompoint Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, Scott Et- clouds. IEEETransactionsonVisualizationandComputer tinger,MaximKrivokon,AmyGao,AdityaJoshi,YuZhang, Graphics,26(5):1851–1859,2020. 3 Jonathon Shlens, Zhifeng Chen, and Dragomir Anguelov. [42] Zhe Zhang, Chunyu Wang, Weichao Qiu, Wenhu Qin, and Scalability in perception for autonomous driving: Waymo WenjunZeng. Adafuse: Adaptivemultiviewfusionforac- open dataset. In IEEE Conf. Comput. Vis. Pattern Recog. curate human pose estimation in the wild. Int. J. Comput. (CVPR),June2020. 1,2,6 Vis.(IJCV),pages1–16,2020. 2 [28] C. Szegedy, Wei Liu, Yangqing Jia, P. Sermanet, S. Reed, [43] Ce Zheng, Wenhan Wu, Taojiannan Yang, Sijie Zhu, D.Anguelov,D.Erhan,V.Vanhoucke,andA.Rabinovich. Chen Chen, Ruixu Liu, Ju Shen, Nasser Kehtarnavaz, and Goingdeeperwithconvolutions.InIEEEConf.Comput.Vis. MubarakShah.Deeplearning-basedhumanposeestimation: PatternRecog.(CVPR),pages1–9,2015. 8 Asurvey.ComputingResearchRepository,abs/2012.13392, [29] BugraTekin,PabloMarquez-Neila,MathieuSalzmann,and 2020. 1,2 Pascal Fua. Learning to fuse 2d and 3d image cues for [44] XingyiZhou,QixingHuang,XiaoSun,XiangyangXue,and monocularbodyposeestimation. InInt.Conf.Comput.Vis. YichenWei. Towards3dhumanposeestimationinthewild: (ICCV),pages3961–3970,102017. 2 A weakly-supervised approach. In Int. Conf. Comput. Vis. [30] Denis Tome, Chris Russell, and Lourdes Agapito. Lifting (ICCV),Oct2017. 2 from the deep: Convolutional 3d pose estimation from a single image. In IEEE Conf. Comput. Vis. Pattern Recog. [45] Xiaowei Zhou, Menglong Zhu, Spyridon Leonardos, Kon- (CVPR),July2017. 2 stantinos G. Derpanis, and Kostas Daniilidis. Sparseness meetsdeepness: 3dhumanposeestimationfrommonocular [31] Shashank Tripathi, Siddhant Ranade, Ambrish Tyagi, and video. In 2016 IEEE Conference on Computer Vision and Amit Agrawal. Posenet3d: Unsupervised 3d human shape PatternRecognition(CVPR),pages4966–4975,2016. 2 andposeestimation. 2020. 2 [32] BastianWandtandBodoRosenhahn.Repnet:Weaklysuper- [46] YinZhouandOncelTuzel. Voxelnet: End-to-endlearning visedtrainingofanadversarialreprojectionnetworkfor3d forpointcloudbased3dobjectdetection,2017. 4 humanposeestimation. InIEEEConf.Comput.Vis.Pattern [47] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Recog.(CVPR),June2019. 2 Efros. Unpaired image-to-image translation using cycle- [33] ChunyuWang,YizhouWang,ZhouchenLin,AlanL.Yuille, consistentadversarialnetworks. InInt.Conf.Comput.Vis. andWenGao. Robustestimationof3dhumanposesfrom (ICCV),2017. 2[48] ChristianZimmermann, TimWelschehold, ChristianDorn- hege,WolframBurgard,andThomasBrox. 3dhumanpose estimation in rgbd images for robotic task learning. In IEEEInternationalConferenceonRoboticsandAutomation (ICRA),2018. 1,2,4Supplementary Material for Multi-Modal 3D Human Pose Estimation with 2D Weak Supervision in Autonomous Driving JingxiaoZheng1 XinweiShi1 AlexanderGorban1 JunhuaMao1 YangSong1 CharlesR.Qi1 TingLiu2 ViseshChari1 AndreCornman1 YinZhou1 CongcongLi1 DragomirAnguelov1 1 WaymoLLC 2 GoogleResearch jingxiaozheng, xinweis, gorban, junhuamao, yangsong, rqi @waymo.com, { } liuti@google.com, visesh, cornman, yinzhou, congcongli, dragomir @waymo.com { } 1.TrainingLossesforSection3.5 heatmapas Point Network: The training loss for the regression 1 H0,W0 K branchisaweightedHuberlossdefinedas L = v (h g )2 (4) cam H W K k i,j,k − i,j,k 0 0 i,j=1 k=1 1 K yˆ(3) y˜(3) X X L reg = K v kr kL Huber k s− k (1) where v k is the visibility label of keypoint k, and g i,j,k is k ! k X=1 the ground truth heatmap generated by Gaussian functions centeredat2Dgroundtruthkeypoints. Wetrainthecamera whereyˆ(3) isthe3Dpredictionfromthemodel,y˜(3) isthe k k networkindependently,thenfreezeitduringpointnetwork pseudo3DlabelbylabelgenerationinEquation(1)inthe training. mainpaper,v isthevisibilitylabelofkeypointk(0-1val- k ued), r is the label reliability (Section 3.4.1 in the main k 2.MetricsforSection4.1 paper), and s is the scaling factor of keypoint k. The k loss is only applied on visible keypoints with v being 1. 2.1.OKS/ACCMetric k Keypoints that need to be more accurately localized have This paper focuses on pose estimation instead of key- smallerscaless (thereforelargerweights)duringtraining. k point detection by assuming that the person has been suc- Thelossforthesegmentationbranchisaweightedcross- cessfully detected and there is exactly one estimated pose entropylossdefinedas foreachground-truthpose. Therefore,insteadofusingthe OKS/AP metric defined in COCO keypoint challenge [1], N K 1 L = v w l logp + weintroduceamodifiedOKS/ACCmetricforevaluation: seg K k { pos ik ik Xi=1k X=1 1 N w neg(1 −l ik)log(1 −p ik) } (2) ACCOKS=t = N 1 {OKS n ≥t } (5) n=1 X wherev isthevisibilitylabelofkeypointk,andw and k pos w are the weights balancing the positive and negative wheretisthethresholdonOKS,N isthetotalnumberof neg samples. w pos is usually much larger than w neg since for samplesinthetestset,andOKS n istheOKSofprediction eachkeypointtherearemuchmorepointswithnegativela- on sample n. In our experiments we averaged OKS/ACC belsthanpointswithpositivelabels. overtfrom0.5to0.95withastep-sizeof0.05. Theoveralllossforthepointnetworkis 2.2.Per-keypointOKS L=L reg+λL seg (3) Per-keypoint OKS is defined as OKS [1] for one key- pointtype. Forkeypointtypei, whereλisusedtoweightheauxiliarysegmentationloss. Camera Network: Similar to [2], the camera network is exp( d2/2s2k2)δ(v >0) OKS = − i i i (6) trainedonamean-squared-errorlosswithgroundtruth2D δ(v >0)+(cid:15) i 1 1202 ceD 22 ]VC.sc[ 1v14121.2112:viXrawhere d is the distance between ground truth and predic- tation branch refines the predictions for both LiDAR-only i tion,v isvisibilityofthegroundtruth,sistheobjectscale, and multi-modal architectures. Similar to the observations i k isaper-keypointconstant,and(cid:15)isasmallnumberhere from Table 3 in the main paper, by adding key features to i topreventzerodenominator. themodel,thepredictionaccuracyimprovesconsistently. Figure 2 gives additional qualitative results in cases of 3.ImplementationDetailsforSection4.1 occlusion. In each of these cases, we can see how adding camera information to LiDAR provides a big boost, espe- Thefollowingaresomedetailsontraining. Forthecam- cially for identifying the individual limbs of the subject in eranetwork,weresizeallinputimagesto256 256. The question. This is understandable, since with occlusion, it × outputheatmapsizeis64 64 13(13keypointsarepre- isoftendifficulttoisolateLiDARpointcloudsofaperson × × dicted). ThecameranetworkistrainedwithanAdamopti- from their immediate surrounding. Figure 3 shows more mizerandbatchsize32 32for40000iterations. Theini- qualitativeresults,highlightingdifferencesbetweenvarious × tiallearningrateis1 10 −4andisdecayedby0.1at20000 architectures. Notethatevenwhenthehumanisheavilyoc- × and30000iterations. Randomaugmentationisapplieddur- cluded (last row), our approach can get reasonable results ingtraining,soeachinputimageisrandomlyrotated,scaled fromthelearnedpriors. or flipped. The heatmap is furthered smoothed by a 7 7 × Gaussiankernelwithσ =3. References For the point network, we sub-sample the input point cloudtoafixedsizeof256points. Weonlyusethe3Dco- [1] Keypoint evaluation metrics used by coco. https:// cocodataset.org/#keypoints-eval. 1 ordinates of points as point feature, which is concatenated [2] BinXiao,HaipingWu,andYichenWei. Simplebaselinesfor with the 13-dimensional camera feature from the camera humanposeestimationandtracking. 2018. 1 network to perform modality fusion. We set λ = 0.1 for the segmentation task (Equation (3) in the main paper). Thenetworkistrainedfor100000iterations,withanSGD optimizer and batch size 128. The initial learning rate is 1 10 3 andisdecayedbycosinedecay. Duringtraining, − × inputpointcloudsarerotatedintheX-Yplanebyarandom anglein[0,2π)asdataaugmentation. 4.QualitativeResults Figures 1, 2 , and 3 show qualitative results from the Waymo Open Dataset. Figure 1 compares different model architectures corresponding to Table 3 in the main paper. Row 1 shows the input camera image and LiDAR point cloud. StartingfromRow2,Columns1and3showthe2D projections of 3D predictions overlaid on camera images, and Columns 2 and 4 show 3D predictions. Rows 2 to 5 correspondtorowsinTable3inthemainpaper,whichare LiDAR-onlymodelwithoutsegmentationbranch, LiDAR- only model with segmentation branch, multi-modal model without segmentation branch and multi-modal model with segmentationbranch,respectively. Due to the objects (e.g., backpack, scooter, bike) at- tachedto thepedestrianandthe poseofthe legs, the input LiDAR point cloud looks different from a regular pedes- trian, which poses challenges for LiDAR-only 3D HPE. From the results, we can see that it is difficult to predict accuratekeypoints(especiallylowerbodykeypoints)from LiDARpointcloudonly(Rows2and3). Byutilizingtex- ture information from the camera image, multi-modal ar- chitecturesshowmuchbetterperformance(Rows4and5) on all keypoints. On the other hand, comparing Rows 2, 4withRows3,5respectively,weseethataddingsegmen-Figure1.QualitativeresultsfromWaymoOpenDataset,comparingdifferentmodelarchitecturessimilartoTable3inthemainpaper.Row 1istheinputcameraimageandLiDARpointcloud. StartingfromRow2,Columns1and3showthe2Dprojectionsof3Dpredictions overlaidoncameraimages;Columns2and4show3Dpredictions. Row2to5correspondtoLiDAR-onlymodelwithoutsegmentation branch, LiDAR-only model with segmentation branch, multi-modal model without segmentation branch and multi-modal model with segmentationbranch,respectively. SimilartotheobservationsfromTable3inthemainpaper,byaddingkeyfeaturestothemodel,the predictionaccuracyimprovesconsistently.Bestviewedincolor.(a)LiDARonly (b)LiDAR+Segmentation (c)Multi-modalonly (d)Multi-modal+Segmentation Figure2.AdditionalqualitativeresultsontheWaymoOpenDataset,showingtheimprovementthatourapproachbringsoverLiDAR-only model. Thecolumnsineachrowshow: a)LiDAR-onlymodelwithoutsegmentationbranch; b)LiDAR-onlymodelwithsegmentation branch;c)multi-modalmodelwithoutsegmentationbranch;andd)multi-modalmodelwithsegmentationbranch. Notethatineachcase thereiseitherself-orotherformsofocclusionthatdeterioratesLiDARonlyresults. Whilesegmentationandcameraeachcanprovide someadditionalclue,combiningeverythingproducesthebestresult.Bestviewedincolor.(a)LiDARonly (b)LiDAR+Segmentation (c)Multi-modalonly (d)Multi-modal+Segmentation Figure3. MorequalitativeresultsfromtheWaymoOpenDataset,highlightingdifferencesbetweenvariousarchitectures. Notethateven whenthehumanisheavilyoccluded(lastrow),ourapproachcangetreasonableresultsfromthelearnedpriors(inthiscase,ridingabike). Bestviewedincolor.