HUM3DIL: Semi-supervised Multi-modal 3D Human Pose Estimation for Autonomous Driving AndreiZanfir† MihaiZanfir† AlexanderGorban‡ JingweiJi‡ YinZhou‡ DragomirAnguelov‡ CristianSminchisescu† †GoogleResearch {andreiz, mihaiz, sminchisescu}@google.com ‡WaymoResearch {gorban, jingweij, yinzhou, dragomir}@waymo.com Abstract: Autonomousdrivingisanexcitingnewindustry,posingimportantre- searchquestions. Withintheperceptionmodule,3Dhumanposeestimationisan emerging technology, which can enable the autonomous vehicle to perceive and understandthesubtleandcomplexbehaviorsofpedestrians. Whilehardwaresys- temsandsensorshavedramaticallyimprovedoverthedecades–withcarspoten- tiallyboastingcomplexLiDARandvisionsystemsandwithagrowingexpansion oftheavailablebodyofdedicateddatasetsforthisnewlyavailableinformation– notmuchworkhasbeendonetoharnessthesenovelsignalsforthecoreproblem of3Dhumanposeestimation. Ourmethod,whichwecoinHUM3DIL(HUMan 3DfromImagesandLiDAR),efficientlymakesuseofthesecomplementarysig- nals,inasemi-supervisedfashionandoutperformsexistingmethodswithalarge margin. It is a fast and compact model for onboard deployment. Specifically, weembedLiDARpointsintopixel-alignedmulti-modalfeatures,whichwepass through a sequence of Transformer refinement stages. Quantitative experiments ontheWaymoOpenDatasetsupporttheseclaims,whereweachievestate-of-the- artresultsonthetaskof3Dposeestimation. Keywords: autonomous driving, perception, human pose, key points, skeletal representation 1 Introduction Roboticsystemswhichoperateinenvironmentswithhumansarerequiredtoavoidcollisionswith peopleandbenefitfromanalysingtheiractionsandforecastingfuturebehaviors. Humanposeun- derstanding is a well established research direction in computer vision with numerous industrial applications[1,2,3,4]. Inthisworkwefocuson3Dhumanposeunderstandingfortheautonomous vehicle(AV)industryandroboticapplicationsingeneral. SafetyisthetoppriorityfortheAVindustry.Manyroboticplatformsusesensorsindifferentmodal- ities(e.g. cameras,LiDARs,radars,audio,etc.) toimprovesafetybyanalyzingmoresignalsabout theenvironment. UsingRGBcamerascoupledwithLiDARsensorscouldbeconsideredasastan- dardsensorsuiteformostroboticplatforms. Whilemanystudieshaveshownimpressiveresultsfor estimatinghumanposesusingRGBimagery,thereisapaucityofmethodswhichcaneffectivelyuse bothmodalities[5]. Recentstudieshavemadegreatheadwayintoestimating3Dhumanposeincontrolledenvironments [1],butmanyreal-worldandsafety-criticalroboticapplicationsrequireestimatinghumanposesin uncontrolled environments, where subjects may be captured under different levels of occlusions, fromvariousperspectivesandacrossanyranges. Thereareseveralapproachestoestimate3Dhu- 2202 ceD 51 ]VC.sc[ 1v92770.2122:viXramanposesinuncontrolledenvironments[5],buttheyhaveinsufficientaccuracytounlocktheirfull potential,especiallyforAVapplications. Another limiting factor for robotic applications of available methods of 3D pose detection is their computationalcomplexity. Therearefastmethodsfordetectinga3Dboundingboxwhichcontains asingleperson[6,7]. Thusmostroboticapplicationsrepresenthumanswith3Dboundingboxes. Whilethiscruderepresentationofpeopleallowssuchapplicationstomeetbasicsafetyrequirements andavoidcollisions,itisnotsufficientforunderstandingcomplexhumanbodygestures. Thereare methodswhichoutputfeaturerichrepresentationsandestimateparametersoffullbodymeshes[8], buttheyarerelativelyslow.Representing3Dhumanbodyposeasasequenceoflocationsof3Dkey points inside the body could be considered as a balanced trade off between fast to compute boxes andslowfullbodymodels. Themaingoalofthisworkistoprovideafastmethodforhumanpose estimation in uncontrolled environments which efficiently uses sensor modalities common for AV industryandoutputsrepresentationsrichenoughtoenableanalysisofcomplexhumanbehaviors. Oneofthekeyfactorsenablingresearchanddevelopmentofmethodsforhumanposeunderstanding istheavailabilityofhigh-qualitygroundtruth3Ddatawithhumanposesandsensordata. Thereare fewwaystocollectsuchdata: markerormarkerlesscamerabasedmotioncapturesystems(suitable for controlled indoor environments); IMU based motion capture systems (suitable for both indoor and outdoor, but also controlled environments) [1] and manual human labeling (suitable for all environments,butexpensiveanderrorprone). Tothebestofourknowledge,thereisnolarge-scale outdoor dataset with human poses collected in an uncontrolled environment with ground truth 2D and 3D keypoints with RGB and LiDAR for a fully supervised training mode. Waymo recently releasedaversionoftheirWaymoOpenDataset(WODv1.3.2)withalargeamountofcamera(2D) keypointsandsmallamountoflaser(3D)keypoints(enoughforfine-tuningandevaluationpurposes) whichissuitableforweaklyand/orsemi-supervisedtrainingmodes. InthisworkweusetheWOD v1.3.2datasettodemonstratethatourmethodcanreliablypredict3Dhumanposeinuncontrolled andchallengingAVscenarios,andwecompareourapproachwithseveralstate-of-the-artmethods [9,10,11]afteradaptingthemformulti-modalapplications. We propose HUM3DIL, a light-weight 3D human joints prediction network, that leverages RGB informationwithLiDARpoints, inanovelfashion, bycomputingpixel-aligned[12]multi-modal features with the 3D positions of the LiDAR signal. These features are then used by subsequent Transformer-based refinement stages, to produce the desired 3D joints. We train our model in a semi-supervisedmanner,tomaximizetheutilityofboth2Dannotations(lessexpensivetocollect, availableinlargervolumes)and3Dlabels(expensivetocollect,accurate,butwithlimitedcoverage). QuantitativeresultsonWaymoOpenDatasetindicatestate-of-the-artperformance. Beingaccurate, fastandlightweight,HUM3DILcanbedeployedintoonboardautonomousdrivingsystemstopro- vide real-time perception signals of human road users. We believe downstream tasks can greatly benefitfromthesefine-grainedsignals. Related Work: There are considerable amount of prior works in 3D human pose reconstruction, mostlyfocusedonestimationfromRGBimagesalone. Therearetwomainclassesofmethods,the first of which is model based [13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23] and relies on statistical human body models like SMPL [24] or GHUM [25]. These methods do not estimate 3D pose directly,butinsteadregresstheparametersofastatisticalbodymodel,whichhasbuilt-inanatomical andkinematicconstraints.Thisleadstomorenaturalpredictions,withbodyshapeusuallyestimated as well, even for poses not encountered during training. The second class of methods is skeleton- based [26, 27, 28, 22], where the 3D pose is represented by 3D joint positions and these are to be regressed or detected directly from the input. These methods have the advantage of usually beingmoreaccurateandfaster,buttheyarenotguaranteedtoproduceanatomicallycorrecthuman skeletons (e.g. the left arm may be reconstructed with different length than the right arm). Our proposed model falls in the latter category, as our goal is to estimate pedestrians as accurately as possibleinreal-time. Mixedapproacheshavestartedtorecentlyemerge,likeine.g. [9],where3D positionsareinferreddirectly,butanatomicallyregularizedthroughastatisticalmodel. Thereareworksthatusedepthinformation,eitherseparatelyorincombinationwithanRGBimage, toreconstructthe3Dpose [29,30,31].ApproachesthatutilizeLiDARinformationarehardtocome by,mostlybecauseofthelackofground-truth3DhumanposepairedwithLiDARdata. Thereare a few datasets that, to different degrees, do provide 3D annotations. The PedX dataset [32] offers 14,000 3D automatic pedestrian annotations obtained using model fitting on different modalities, 2gatheredfromthreedifferentreal-worldscenes.TheWaymoOpenDataset[33]hasasimilaramount of3Dannotationsas[32],butitfeaturesmanymoredifferentenvironments(2,030realscenes,7,650 differentpeople)withhigh-quality2Dand3Dmanualannotations. Evenwiththeexistenceofthese datasets, the few works on 3D pose reconstruction published in this space mostly rely on weak supervision,bylifting2Dposeinformationin3D.[34]trainson2Dground-truthposeannotations and uses a reprojection loss for the 3D pose regression task. [11] creates pseudo ground-truth 3D jointpositionsfromtheprojectionofannotated2Djoints,byconsideringneighboringLiDARpoints in the projection space. During training, they directly compare the predicted 3D positions against thispseudoground-truth. 2 Methodology 2.1 ProblemFormulation Wefocusonthetaskofkeypointslocalizationandformulatetheproblemasestimating3Dlocations forasetofkeypointsY ∈RNj×3insidehumanbody,giventhegroundtruthorpredictedbounding box of a human as well as multi-modal inputs from camera and LiDAR sensors: an RGB camera image I ∈ [0,1]H×W×3 and a point cloud P ∈ RNp×3, consisting of N p LiDAR points from a singlescan. CameraModel.ForcorrectLiDARpointstocameraimageprojections,weuseadifferentialimple- mentationoftheWaymoOpenDataset[33]cameramodel,withrollingshuttereffectcompensated. Wedenotetheintrinsicinformation(e.g. lensdistortion,shutterspeed,focallength,focalpointetc.) as a vector K , and the extrinsic parameters (e.g. vehicle pose, camera pose, linear and rotation i speed)asK e. ThecompletecamerainformationwillbedenotedasK = [K e|K i] ∈ R1×Nk. The associatedcameraimageprojectionoperatorwillbedenotedbyΠ(∗,K),wherea3Dinputwillbe correctlyprojectedintotheimagespace. 2.2 HUM3DIL Our network, deemed HUM3DIL (see fig. 1), receives as input the RGB camera image I, LiDAR point cloud P, camera intrinsics K, the paired 2D and 3D bounding boxes, and outputs the pre- dicted3DhumankeypointsY. Ourgoalistouseimageinputtobetterexploitordisambiguatethe structurepresentinthe3Dpointcloud,whileatthesametimeusethe3Dpointstoanchorimagery evidence. Becausewehaveaccesstotheground-truthcameraintrinsics,wecanmovebetween3D spaceand2Dimagespacebyprojection, andvice-versa, byback-projection. Motivatedbyrecent advancementsin3Dposeestimationandpoint-cloudprocessing[36,37,9],weuseaTransformer- based architecture [38] to process 3D and image-based structural information at the same time, compared to typical approaches which use disconnected PointNet 3D [39, 40] embeddings and/or imagefeatures. Enriching LiDAR points with image evidence We construct a depth-image projection D ∈ RH×W×1 into the space of LiDAR point cloud P. We concatenate it with the raw RGB image, [I;D], and pass the derived tensor through a convolutional architecture. Thus, we simultaneously informtheconvolutionallayersoftheregionsofinterestintheimage, i.e. sparselocationsonthe silhouette of the person, and make depth information available from the start. Adding the depth mapchannelhelpsdisambiguatethetaskofkeypointpredictionincaseswithheavyocclusionsor incrowdedsceneswheremultiplepeopleareintheframe-depthmapchannelwillhavenonzero values only for a person of interest. For the convolutional architecture, we employ a lightweight U-Netnetwork[41],thatoutputsadensefeaturemaprepresentationF ∈ RH×W×Df. Thismapis usedtoqueryfrom,basedontheprojectionoftheLiDARpoints,bybilinearinterpolation: F i =F[Π(P i,K)]∈R1×Df (1) foreachpointP ∈P. Wethusobtainpixel-alignedimagefeaturesforLiDARpoints. i LiDARpointsembeddingAsidefromtheper-pointdepth-infusedimagefeatures,wealsoprocess theinitial3DLiDARpoints. WeuseRandomFourierFeatures(RFF)[42]toembedPinahigher- dimensional space, capturing high-frequency behaviour of the signal. We use a random Gaussian matrix B ∈ R3×Dp/2, with each entry independently drawn from a normal distribution N(0,σ2). ThetransformedpointswillhavetheformP(cid:101) =[cos(2πPB);sin(2πPB)]∈RNp×Dp. 3RGB Camera Image Multi-modal Features U-Net LiDAR 3D Points Random Fourier Training loss Embedding Camera intrinsics )redocnE remrofsnarT( x L PLM Predicted joints Projected LiDAR depth Figure 1: Overview of our proposed HUM3DIL architecture. It estimates the 3D joint positions of a sin- glepersonfromamulti-modalinputrepresentation. WeencodeLiDARpointsPthroughaRandomFourier Embedding [35], to produce representations P˜. The LiDAR points are also used to compute a depth-map representationD, whichareconcatenatedwithinputRGBimageI. Wefirstuseanimagefeatureextractor (U-Net)thatwillactontheconcatenationofDandI. TheprojectedLiDARpointswillfurtherreadfeatures fromtheproducedmapF. Weconstructatokensequenceofsizeequaltothenumberofpoints. Eachtoken, inthebeginning,willhaveinformationrelatingtotheimagefeatures,cameraintrinsicsandRandomFourier Embeddings. LsequencesofTransformerEncoderwillactonthetokens. WereadthefinalN tokensand j regressthe3DjointsthroughanMLP. Transforming the LiDAR points In order to regress Y, we employ a Transformer architecture [43]. Wedefineaninputi-thtokenM ∈R1×D,withD =D +D +N ,as: i f p k M i =[K,F i,P(cid:101)i], (2) the concatenation of the camera intrinsics, the per-point image feature and the per-point Fourier representation. WewillapplytheTransformeronafixedsequenceofN tokens. Inordertowork p withavariablenumberoftokens,weuseafixedmaximumsizeforthetokensequence,Nmax. We p shuffleandtrimexcesspoints,andpadwithzerosifwehaveafewernumberofpoints.Thecomplete inputsequenceisM∈RN pmax×D. Thissequenceisatfirstlinearlyembeddedbyusingalearnable matrixE ∈ RD×D0. Here,D 0 istheoperatingdimensionalityoftheTransformerarchitecture. We additionallyconcatenatelearnablejointstokens,MJ ∈RNj×D0. Similarto[9],weuseacascaded block of L Transformer encoder layers, and collect the predicted 3D keypoints Y˜ from an MLP appliedonthetransformedjointstokens: (cid:20) (cid:21) MJ M0 = (3) ME Ml =TL (Ml−1) (4) l Y˜ =MLP(ML−1 ) (5) 0...Nj Losses Labeling 3D keypoints is significantly more expensive and slower than 2D keypoints in uncontrolled real-world environments. As a result, we usually collect a dataset containing many more2Dannotationsthan3Dlabels. Tomaximizetheutilityofallavailablelabels(2Dand3D)and boostperformance,weuseamixtureofweaklyandfullysupervisedlossestotrainourmodel. We denotetheground-truth2Djointskeypointsasy ∈ RNj×2 anddefinethe2Dreprojection,and3D 4#samples subset #subjects total w\2Dkeypoints w\3Dkeypoints w\both training 6999 149683 144866 9472 4655 validation 1651 28614 27382 2137 905 Table1: HumanKeyPointsinWODv1.3.2 reconstructionlossesas: 1 (cid:88)Nj L = (cid:107)y −Π(Y˜ ,K)(cid:107) . (6) 2D N i i 2 j i=1 1 (cid:88)Nj L = (cid:107)Y −Y˜ (cid:107) . (7) 3D N i i 2 j i=1 where(cid:107)∗(cid:107) isthe(cid:96)2vectornorm–i.e. themorerobusteuclideandistancebetweenpredictions. We 2 addasmall(cid:15)duringtraining,asthefunctionisnotdifferentiableat0. Ourfinallossisgivenby: L=L +λL (8) 3D 2D whereλisascalarfactorusedtoweighthetwolosses. Semi-supervisedsupportInordertoefficientlytrainwithmixed2Dand3Dlabels, wealsoseta trainingbatchtocontainapre-definedfractionof2D-to-3Dannotations.Thiswillallowthenetwork to not forget about 3D, when the dataset is drastically biased towards 2D annotations. We show a studyontheeffectofthepercentageof3Dandthelossbalancingfactorinfigure4,left. 3 ExperimentalResults Datasets Waymo Open Dataset v1.3.2 (WOD) [33] contains RGB and LiDAR range images cap- turingvariousroadusers. Recentlycamera(2D)andlaser(3D)keypointsannotationsonaportion of human subjects (pedestrians and cyclists) in WOD have been released, namely Waymo Human KeyPointsdatasetv1.3.2(WHKP).WebenchmarkHUM3DILandotherbaselinesontheWHKP. As the official WOD/WHKP testing subset is hidden from the public, we randomly select 50% of subjectsfromtheWODvalidationsubsetasourvalidationsplit,andtherest50%fallintothetesting splitforbenchmarking. Inourexperiments,weusetheground-truthcameraandLiDARbounding boxes during training and evaluation, for two reasons: a) disentangling the evaluation of the key pointslocalizationfromtheobjectdetectiontask;b)settinganeasy-to-reproducebaselineforfuture researchworks. Metrics Where applicable, we will report two metrics: the mean per-joint position error (i.e. MPJPE) between predicted and ground-truth 3D joints, and a similar one, for the 2D case. As the ground-truth is not available for every frame, or even for every joint, we will use a visibility indicatorvj ∈{0,1}. Thissignalsifwehaveaground-truthannotationforaparticularjointi,ofa i particulartestingsamplej. TheMPJPEoveraparticulardatasetwillthenhavethevalue: 1 (cid:88) vj(cid:107)Yj −Y˜j(cid:107) (9) (cid:80) vj i i i 2 i,j i i,j Evaluation on WHKP We train three different methods on the WHKP training subset: ours (i.e. HUM3DIL),THUNDR[9]reimplementedfollowingtheoriginalpaper,THUNDRwithalsoanad- ditionaldepthimageasinput,ContextPose[10]usingpubliclyavailablecode,andthemulti-modal approachof[11],forwhichwereporttheirnumberonasimilarversionofthedataset. Theseareall state-of-the-artmethodsin3Dposeestimationforsinglepersons,fromanRGBimage. Wetriedto makecomparisonsagainstpureRGBimagemethodsasfairaspossible,butthereisamodelinggap thatcannotbebreached–ourarchitecturenaturallyexploitsLiDARsignalwithease. Alsonotethat wehave1/5−1/14thofthenumberofparametersofcompetingmethods. Wereportresultsonthe testsplitinTable. 2. 5Implementation details In all our experiments we use a U-Net backbone [41], with randomly initialized weights. The encoder/decoder convolutional filter sequences are [32,64,128,256] and [256,128,64,32], respectively. Thebackbonehas2,095,392parameters. FortheTransformerar- chitecture,weuseL=4stages,anembeddingsize256and8headsfortheMultiHeadAttention layer. Wetrainthenetworkfor50epochs, withbatchsizeof16, baselearningrateof1e−4and exponentialdecay0.99. WesetthemaximumnumberofLiDARpointsto1024. OurTransformer architectureconsistsof3,229,696parameters,withafinalMLPof771neurons. Wevalidateσ =10 andλ = 1e−2. Thecompletearchitecturehas5,325,859parameters. Allofournetworkswere trainedonasingleV100GPUwith16GBofmemory. OurcodeisimplementedinTensorFlow. We testthenetworkininferencemodeonanNvidiaRTX2080GPU,forabatchofasingleexample. Onepassisdonein8milliseconds.Themainperformance/memorybottleneckresidesincomputing theattentionmatrix(whichis≈1000×1000)intheTransformerarchitecture. Method MPJPE(cm)↓ MPJPE2D(pixels)↓ ContextPose[10] 10.82 12.95 Multi-modal[11]* 10.32 N/A THUNDR[9] 9.62 14.81 THUNDR[9]w/depth 9.20 13.53 HUM3DIL(Ours) 6.72 8.33 Table2: Performancesofdifferent3DjointpredictorsontheWHKP[33]testsplit. Ourfullmulti- modal approach is the best performer. (*) Note that [11] was evaluated on a different subset of WOD. Figure 2: QualitativepredictionsontheWHKPtestsubset. Ineachimage,fromlefttoright,wehave: the input RGB image, the overlayed LiDAR points, and our 3D joints predictions. Our method achieves plau- sible reconstructions even in challenging settings: low lighting conditions, cluttered environments, extreme occlusionsorpartialviewsandnon-trivialposes. Notethatinthecaseofpartialviews(e.g. secondrowfrom top,secondcolumnfromleft),ourmethodoutputsananatomicallyplausiblehumanprediction,evenwithan incompleteRGBsignal. 6Figure3: FailurecasesontheWHKPtestsubset. Ineachimage,fromlefttoright,wehave: theinputRGB image, theoverlayedLiDARpoints, andour3Djointspredictions. Mostpointsoffailurerelateto: unusual personappearancesorpoorcaptureconditions,limitedimagesupportandextremeocclusions. Figure4: VisualizingnetworkperformanceontheWHKPvalidationsubset,withrespecttosemi-supervised choices,distancetoLiDARandkeypointocclusions. Left. Weplota3Derrorsurface,whereontheX-axis wehavethe2Dlossweightλ,andontheY-axiswehavethepercentageof3Dsamplesinamixed-supervision trainingbatch. Theplotshowstheimportanceof2Dsamples, astheperformancegraduallydropswhenwe under-utilizethem.Bottom-right.Weaplota3Derrorcurve,whereontheX-axiswehavethedistancetothe LiDARpoint-cloud.Errorswerecomputedfor20equallyconcentratedbins,w.r.t.distance.Thedottedpoints representthecentersofthosebins. Notehowtheerrorgracefullydecayswhenthetargetsubjectiseithertoo nearortoofaraway. Top-right. Weplota3Derrorcurvewithrespecttothenumberofvisiblejoints. As expected, themethoddegradeswhenmorepartsofthetargethumanarenotvisible(duetopartialviewsor self-occlusions). 73.1 Ablationstudies In table 3, we ablate different high-level methodological choices in our proposed architecture and reportresultsonthevalidationsetofWHKP.First,wedisabletheweakly-supervisedlossbysetting λ = 0. Wenoticethattheerrorincreases,asexpected,substantiallyinthe2DMPJPE,butalsoin the 3D MPJPE. This showcases the importance of using the available 2D training signal, even for inferring3Djoints. Next,wedisabletheRandomFourierEmbeddingforthe3DLiDARpoints,by replacingitwithanidentityembedding. Thishasafairlylowimpactontheperformance. Wealso disablethedepthinput,leavingonlytheRGBimagethroughtheU-Netbackbone. Theperformance dropisnotasdramaticinthiscase,asLiDARpointpositionsarealreadyavailableastokens. How- ever,thissignalsthefactthatthefeatureprocessingdonebythebackboneisnotredundant,asagap inperformancestillexists. WhenwedisabletheRGB,weget≈ 1.3cmdropinperformance. We also ablate with the PointNet[40] and PointNet++[39] architectures, instead of a Tranformer, and performanceisworse. Thebestperformingmethodhasallthecomponentsactivated,showingtheir complementaryimpact. Method MPJPE(cm)↓ MPJPE2D(pixels) HUM3DILw/λ=0. 8.62 15.61 HUM3DILwPointNet 8.16 9.96 HUM3DILw/oRGB 8.06 11.41 HUM3DILw/odepth 7.63 9.13 HUM3DILwPointNet++ 7.71 9.36 HUM3DILw/oRFF 7.01 8.79 HUM3DILfull 6.72 8.33 Table3: Performanceofourmethodwithdifferentarchitecturalchoices. Themodelwithfullfea- tures–includingmulti-modality,semi-supervisedtrainingandTransformerarchitecture–isthebest performer. 4 Limitations Failure cases. In figure 3, we randomly select and show six examples of results where the error exceeds15cm. Extremeocclusionsandpartialviewsarethemostdifficultcasestohandle. Please notethattheresultsarestill,generally,anatomicallyplausible. Performancedecay. Wealsoshow theperformancedecayw.r.t. thedistancetotheLiDARhumanpoint-cloud(whichalsocontrolsthe sparsity of the LiDAR signal). As we can see from figure 4, bottom-right, our performance drops whenthesubjectistoonearortoofar,butstillproducesreasonableresults. Weareseeinga≈3cm errorgapbetweenthebestandworstconditions(ascapturedbythedataset). Wealsoseeaperfor- mancedegradation(seefigure4,top-right)whenthetargetsubjectisheavilyoccluded,duetopartial viewsorself-occlusions.Theerrorincreasesalmosttwo-foldwhengoingfromfull-viewtoseverely occluded. Generalapplicability. Fornow,ournetworkcanonlyprocesseachhumaninstanceby individuallycropping,soitcannotuseinformationaboutmultiplehumansatonce(e.g.[17,44,45]), toimprovetheerrorandprocessingspeed. Also,wedonotutilizetemporalinformation(e.g. [46]) whichcouldfurtherstabilizepredictionsandimproveerrorsunderocclusions. 5 Conclusions Wehavepresentedanoveldeepneuralnetworkarchitecture,whichhasbeentailoredfortheneedsof modernautonomousdrivingvehicles: fast,lightweightandaccurate,fortheproblemof3Dhuman poseestimationfromcolorand3Dsignals.Ournovelarchitecture,deemedHUM3DIL,makesusage of both RGB and LiDAR data, by gathering pixel-aligned multi-modal features, that are then fed intoasequenceofTransformerstages. Thenetworkisthusinformedbymulti-modalsignals,which complementeachotherinachievingstate-of-the-artperformance.Wealsotraininasemi-supervised regime, with limited annotated 3D data, but with an abundance of 2D labels, almost 2 orders of magnitude more. This makes data collection and annotation easier, as 3D signals are non-trivial and tedious to annotate precisely. The performance of the network is supported by a quantitative evaluationononeofthelargestrelevantdatasetsintheliterature,withmethodicalablationstudies. Forfuturework,wewillexploretemporalconsistenciesbetweenthepredictionsandincludethem inmotionforecastingandanalysis. 8References [1] J.Wang,S.Tan,X.Zhen,S.Xu,F.Zheng,Z.He,andL.Shao. Deep3Dhumanposeestima- tion: Areview. Comput.Vis.ImageUnderst.,210:103225,2021. [2] C. Zheng, W. Wu, C. Chen, T. Yang, S. Zhu, J. Shen, N. Kehtarnavaz, and M. Shah. Deep Learning-Basedhumanposeestimation: Asurvey. Dec.2020. [3] Y.Chen,Y.Tian,andM.He. Monocularhumanposeestimation: Asurveyofdeeplearning- basedmethods. ComputerVisionandImageUnderstanding,192:102897,2020. ISSN1077- 3142.doi:https://doi.org/10.1016/j.cviu.2019.102897.URLhttps://www.sciencedirect. com/science/article/pii/S1077314219301778. [4] B.I.I.A.K.NikolaosSarafianos,BogdanBoteanu. 3Dhumanposeestimation: Areviewof theliteratureandanalysisofcovariates. Comput.Vis.ImageUnderst.,152:1–20,Nov.2016. [5] M. Fu¨rst, S. T. P. Gupta, R. Schuster, O. Wasenmu¨ller, and D. Stricker. HPERL: 3D human pose estimation from RGB and LiDAR. In 2020 25th International Conference on Pattern Recognition(ICPR),pages7321–7327,2021. [6] K.Huang,B.Shi,X.Li,X.Li,S.Huang,andY.Li.Multi-modalsensorfusionforautodriving perception: Asurvey. Feb.2022. [7] D.Fernandes,A.Silva,R.Ne´voa,C.Simo˜es,D.Gonzalez,M.Guevara,P.Novais,J.Monteiro, andP.Melo-Pinto. Point-cloudbased3Dobjectdetectionandclassificationmethodsforself- drivingapplications: Asurveyandtaxonomy. Inf.Fusion,68:161–191,2021. [8] Y.Tian,H.Zhang,Y.Liu,andL.Wang. Recovering3Dhumanmeshfrommonocularimages: Asurvey. Mar.2022. [9] M. Zanfir, A. Zanfir, E. G. Bazavan, W. T. Freeman, R. Sukthankar, and C. Sminchisescu. Thundr: Transformer-based 3d human reconstruction with markers. In Proceedings of the IEEE/CVFInternationalConferenceonComputerVision,pages12971–12980,2021. [10] X.Ma,J.Su,C.Wang,H.Ci,andY.Wang. Contextmodelingin3dhumanposeestimation: Aunifiedperspective. InProceedingsoftheIEEE/CVFConferenceonComputerVisionand PatternRecognition,pages6238–6247,2021. [11] J.Zheng,X.Shi,A.Gorban,J.Mao,Y.Song,C.R.Qi,T.Liu,V.Chari,A.Cornman,Y.Zhou, etal.Multi-modal3dhumanposeestimationwith2dweaksupervisioninautonomousdriving. arXivpreprintarXiv:2112.12141,2021. [12] S. Saito, Z. Huang, R. Natsume, S. Morishima, A. Kanazawa, and H. Li. Pifu: Pixel- aligned implicit function for high-resolution clothed human digitization. In Proceedings of theIEEE/CVFInternationalConferenceonComputerVision,pages2304–2314,2019. [13] A. Zanfir, E. Marinoiu, and C. Sminchisescu. Monocular 3d pose and shape estimation of multiple people in natural scenes-the importance of multiple scene constraints. In CVPR, 2018. [14] N. Kolotouros, G. Pavlakos, M. J. Black, and K. Daniilidis. Learning to reconstruct 3d hu- man pose and shape via model-fitting in the loop. In Proceedings of the IEEE International ConferenceonComputerVision,pages2252–2261,2019. [15] R. A. Guler and I. Kokkinos. Holopose: Holistic 3d human reconstruction in-the-wild. In CVPR,pages10884–10894,2019. [16] G.MoonandK.M.Lee.I2l-meshnet:Image-to-lixelpredictionnetworkforaccurate3dhuman pose and mesh estimation from a single rgb image. In European Conference on Computer Vision(ECCV),2020. [17] W.Jiang,N.Kolotouros,G.Pavlakos,X.Zhou,andK.Daniilidis. Coherentreconstructionof multiplehumansfromasingleimage. InCVPR,pages5579–5588,2020. 9[18] B.Biggs,S.Ehrhadt,H.Joo,B.Graham,A.Vedaldi,andD.Novotny. 3dmulti-bodies:Fitting setsofplausible3dhumanmodelstoambiguousimagedata.arXivpreprintarXiv:2011.00980, 2020. [19] Y.Xu,S.-C.Zhu,andT.Tung. Denserac: Joint3dposeandshapeestimationbydenserender- and-compare.InProceedingsoftheIEEEInternationalConferenceonComputerVision,pages 7760–7770,2019. [20] T.Zhang,B.Huang,andY.Wang. Object-occludedhumanshapeandposeestimationfroma singlecolorimage. InCVPR,pages7376–7385,2020. [21] A. Arnab, C. Doersch, and A. Zisserman. Exploiting temporal context for 3d human pose estimationinthewild. InCVPR,pages3395–3404,2019. [22] W. Zeng, W. Ouyang, P. Luo, W. Liu, and X. Wang. 3d human mesh regression with dense correspondence. InCVPR,pages7054–7063,2020. [23] G. Georgakis, R. Li, S. Karanam, T. Chen, J. Kosˇecka´, and Z. Wu. Hierarchical kinematic humanmeshrecovery. InECCV,pages768–784.Springer,2020. [24] M.Loper,N.Mahmood,J.Romero,G.Pons-Moll,andM.J.Black. SMPL:Askinnedmulti- personlinearmodel. SIGGRAPH,2015. [25] H.Xu,E.G.Bazavan,A.Zanfir,B.Freeman,R.Sukthankar,andC.Sminchisescu. GHUM& GHUML:Generative3Dhumanshapeandarticulatedposemodels. CVPR,2020. [26] G. Varol, D. Ceylan, B. Russell, J. Yang, E. Yumer, I. Laptev, and C. Schmid. BodyNet: Volumetricinferenceof3Dhumanbodyshapes. InECCV,2018. [27] X.Sun,B.Xiao,F.Wei,S.Liang,andY.Wei. Integralhumanposeregression. InProceedings oftheEuropeanConferenceonComputerVision(ECCV),pages529–545,2018. [28] U.Iqbal, P.Molchanov, andJ.Kautz. Weakly-supervised3dhumanposelearningviamulti- viewimagesinthewild. InCVPR,pages5243–5252,2020. [29] J. Shotton, T. Sharp, A. Kipman, A. Fitzgibbon, M. Finocchio, A. Blake, M. Cook, and R. Moore. Real-time human pose recognition in parts from single depth images. Commu- nicationsoftheACM,56(1):116–124,2013. [30] G. Moon, J. Y. Chang, and K. M. Lee. V2v-posenet: Voxel-to-voxel prediction network for accurate3dhandandhumanposeestimationfromasingledepthmap. InProceedingsofthe IEEEconferenceoncomputervisionandpatternRecognition,pages5079–5088,2018. [31] C. Zimmermann, T. Welschehold, C. Dornhege, W. Burgard, and T. Brox. 3d human pose estimationinrgbdimagesforrobotictasklearning. In2018IEEEInternationalConferenceon RoboticsandAutomation(ICRA),pages1986–1992.IEEE,2018. [32] W.Kim,M.S.Ramanagopal,C.Barto,M.-Y.Yu,K.Rosaen,N.Goumas,R.Vasudevan,and M.Johnson-Roberson. PedX:Benchmarkdatasetformetric3Dposeestimationofpedestrians incomplexurbanintersections. Sept.2018. [33] P.Sun,H.Kretzschmar,X.Dotiwalla,A.Chouard,V.Patnaik,P.Tsui,J.Guo,Y.Zhou,Y.Chai, B. Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. In ProceedingsoftheIEEE/CVFconferenceoncomputervisionandpatternrecognition, pages 2446–2454,2020. [34] M. Fu¨rst, S. T. Gupta, R. Schuster, O. Wasenmu¨ller, and D. Stricker. Hperl: 3d human pose estimationfromrgbandlidar. In202025thInternationalConferenceonPatternRecognition (ICPR),pages7321–7327.IEEE,2021. [35] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In J. Platt, D.Koller,Y.Singer,andS.Roweis,editors,AdvancesinNeuralInformationProcessingSys- tems, volume 20. Curran Associates, Inc., 2007. URL https://proceedings.neurips. cc/paper/2007/file/013a006f03dbc5392effeb8f18fda755-Paper.pdf. 10[36] K.Lin,L.Wang,andZ.Liu. End-to-endhumanposeandmeshreconstructionwithtransform- ers. InCVPR,pages1954–1963,2021. [37] M.-H. Guo, J.-X. Cai, Z.-N. Liu, T.-J. Mu, R. R. Martin, and S.-M. Hu. Pct: Point cloud transformer. ComputationalVisualMedia,7(2):187–199,2021. [38] A.Vaswani,N.Shazeer,N.Parmar,J.Uszkoreit,L.Jones,A.N.Gomez,Ł.Kaiser,andI.Polo- sukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. [39] P. Ni, W. Zhang, X. Zhu, and Q. Cao. Pointnet++ grasping: Learning an end-to-end spatial graspgenerationalgorithmfromsparsepointclouds. In2020IEEEInternationalConference onRoboticsandAutomation(ICRA),pages3619–3625.IEEE,2020. [40] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision andpatternrecognition,pages652–660,2017. [41] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical im- age segmentation. In International Conference on Medical image computing and computer- assistedintervention,pages234–241.Springer,2015. [42] M.Tancik,P.P.Srinivasan,B.Mildenhall,S.Fridovich-Keil,N.Raghavan,U.Singhal,R.Ra- mamoorthi,J.T.Barron,andR.Ng. Fourierfeaturesletnetworkslearnhighfrequencyfunc- tionsinlowdimensionaldomains. NeurIPS,2020. [43] A.Vaswani,N.Shazeer,N.Parmar,J.Uszkoreit,L.Jones,A.N.Gomez,L.Kaiser,andI.Polo- sukhin. Attention is all you need. In NIPS, 2017. URL https://arxiv.org/pdf/1706. 03762.pdf. [44] M.Fieraru,M.Zanfir,T.Szente,E.Bazavan,V.Olaru,andC.Sminchisescu. Remips: Physi- callyconsistent3dreconstructionofmultipleinteractingpeopleunderweaksupervision. Ad- vancesinNeuralInformationProcessingSystems,34:19385–19397,2021. [45] G.Moon,J.Y.Chang,andK.M.Lee.Cameradistance-awaretop-downapproachfor3dmulti- personposeestimationfromasinglergbimage.InProceedingsoftheIEEE/CVFinternational conferenceoncomputervision,pages10133–10142,2019. [46] M.Kocabas,N.Athanasiou,andM.J.Black. Vibe: Videoinferenceforhumanbodyposeand shapeestimation. CVPR,2020. 11