HUM3DIL: Semi-supervised Multi-modal 3D Human
Pose Estimation for Autonomous Driving
AndreiZanfir† MihaiZanfir† AlexanderGorban‡ JingweiJi‡ YinZhou‡
DragomirAnguelov‡ CristianSminchisescu†
†GoogleResearch
{andreiz, mihaiz, sminchisescu}@google.com
‡WaymoResearch
{gorban, jingweij, yinzhou, dragomir}@waymo.com
Abstract: Autonomousdrivingisanexcitingnewindustry,posingimportantre-
searchquestions. Withintheperceptionmodule,3Dhumanposeestimationisan
emerging technology, which can enable the autonomous vehicle to perceive and
understandthesubtleandcomplexbehaviorsofpedestrians. Whilehardwaresys-
temsandsensorshavedramaticallyimprovedoverthedecades–withcarspoten-
tiallyboastingcomplexLiDARandvisionsystemsandwithagrowingexpansion
oftheavailablebodyofdedicateddatasetsforthisnewlyavailableinformation–
notmuchworkhasbeendonetoharnessthesenovelsignalsforthecoreproblem
of3Dhumanposeestimation. Ourmethod,whichwecoinHUM3DIL(HUMan
3DfromImagesandLiDAR),efficientlymakesuseofthesecomplementarysig-
nals,inasemi-supervisedfashionandoutperformsexistingmethodswithalarge
margin. It is a fast and compact model for onboard deployment. Specifically,
weembedLiDARpointsintopixel-alignedmulti-modalfeatures,whichwepass
through a sequence of Transformer refinement stages. Quantitative experiments
ontheWaymoOpenDatasetsupporttheseclaims,whereweachievestate-of-the-
artresultsonthetaskof3Dposeestimation.
Keywords: autonomous driving, perception, human pose, key points, skeletal
representation
1 Introduction
Roboticsystemswhichoperateinenvironmentswithhumansarerequiredtoavoidcollisionswith
peopleandbenefitfromanalysingtheiractionsandforecastingfuturebehaviors. Humanposeun-
derstanding is a well established research direction in computer vision with numerous industrial
applications[1,2,3,4]. Inthisworkwefocuson3Dhumanposeunderstandingfortheautonomous
vehicle(AV)industryandroboticapplicationsingeneral.
SafetyisthetoppriorityfortheAVindustry.Manyroboticplatformsusesensorsindifferentmodal-
ities(e.g. cameras,LiDARs,radars,audio,etc.) toimprovesafetybyanalyzingmoresignalsabout
theenvironment. UsingRGBcamerascoupledwithLiDARsensorscouldbeconsideredasastan-
dardsensorsuiteformostroboticplatforms. Whilemanystudieshaveshownimpressiveresultsfor
estimatinghumanposesusingRGBimagery,thereisapaucityofmethodswhichcaneffectivelyuse
bothmodalities[5].
Recentstudieshavemadegreatheadwayintoestimating3Dhumanposeincontrolledenvironments
[1],butmanyreal-worldandsafety-criticalroboticapplicationsrequireestimatinghumanposesin
uncontrolled environments, where subjects may be captured under different levels of occlusions,
fromvariousperspectivesandacrossanyranges. Thereareseveralapproachestoestimate3Dhu-
2202
ceD
51
]VC.sc[
1v92770.2122:viXramanposesinuncontrolledenvironments[5],buttheyhaveinsufficientaccuracytounlocktheirfull
potential,especiallyforAVapplications.
Another limiting factor for robotic applications of available methods of 3D pose detection is their
computationalcomplexity. Therearefastmethodsfordetectinga3Dboundingboxwhichcontains
asingleperson[6,7]. Thusmostroboticapplicationsrepresenthumanswith3Dboundingboxes.
Whilethiscruderepresentationofpeopleallowssuchapplicationstomeetbasicsafetyrequirements
andavoidcollisions,itisnotsufficientforunderstandingcomplexhumanbodygestures. Thereare
methodswhichoutputfeaturerichrepresentationsandestimateparametersoffullbodymeshes[8],
buttheyarerelativelyslow.Representing3Dhumanbodyposeasasequenceoflocationsof3Dkey
points inside the body could be considered as a balanced trade off between fast to compute boxes
andslowfullbodymodels. Themaingoalofthisworkistoprovideafastmethodforhumanpose
estimation in uncontrolled environments which efficiently uses sensor modalities common for AV
industryandoutputsrepresentationsrichenoughtoenableanalysisofcomplexhumanbehaviors.
Oneofthekeyfactorsenablingresearchanddevelopmentofmethodsforhumanposeunderstanding
istheavailabilityofhigh-qualitygroundtruth3Ddatawithhumanposesandsensordata. Thereare
fewwaystocollectsuchdata: markerormarkerlesscamerabasedmotioncapturesystems(suitable
for controlled indoor environments); IMU based motion capture systems (suitable for both indoor
and outdoor, but also controlled environments) [1] and manual human labeling (suitable for all
environments,butexpensiveanderrorprone). Tothebestofourknowledge,thereisnolarge-scale
outdoor dataset with human poses collected in an uncontrolled environment with ground truth 2D
and 3D keypoints with RGB and LiDAR for a fully supervised training mode. Waymo recently
releasedaversionoftheirWaymoOpenDataset(WODv1.3.2)withalargeamountofcamera(2D)
keypointsandsmallamountoflaser(3D)keypoints(enoughforfine-tuningandevaluationpurposes)
whichissuitableforweaklyand/orsemi-supervisedtrainingmodes. InthisworkweusetheWOD
v1.3.2datasettodemonstratethatourmethodcanreliablypredict3Dhumanposeinuncontrolled
andchallengingAVscenarios,andwecompareourapproachwithseveralstate-of-the-artmethods
[9,10,11]afteradaptingthemformulti-modalapplications.
We propose HUM3DIL, a light-weight 3D human joints prediction network, that leverages RGB
informationwithLiDARpoints, inanovelfashion, bycomputingpixel-aligned[12]multi-modal
features with the 3D positions of the LiDAR signal. These features are then used by subsequent
Transformer-based refinement stages, to produce the desired 3D joints. We train our model in a
semi-supervisedmanner,tomaximizetheutilityofboth2Dannotations(lessexpensivetocollect,
availableinlargervolumes)and3Dlabels(expensivetocollect,accurate,butwithlimitedcoverage).
QuantitativeresultsonWaymoOpenDatasetindicatestate-of-the-artperformance. Beingaccurate,
fastandlightweight,HUM3DILcanbedeployedintoonboardautonomousdrivingsystemstopro-
vide real-time perception signals of human road users. We believe downstream tasks can greatly
benefitfromthesefine-grainedsignals.
Related Work: There are considerable amount of prior works in 3D human pose reconstruction,
mostlyfocusedonestimationfromRGBimagesalone. Therearetwomainclassesofmethods,the
first of which is model based [13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23] and relies on statistical
human body models like SMPL [24] or GHUM [25]. These methods do not estimate 3D pose
directly,butinsteadregresstheparametersofastatisticalbodymodel,whichhasbuilt-inanatomical
andkinematicconstraints.Thisleadstomorenaturalpredictions,withbodyshapeusuallyestimated
as well, even for poses not encountered during training. The second class of methods is skeleton-
based [26, 27, 28, 22], where the 3D pose is represented by 3D joint positions and these are to
be regressed or detected directly from the input. These methods have the advantage of usually
beingmoreaccurateandfaster,buttheyarenotguaranteedtoproduceanatomicallycorrecthuman
skeletons (e.g. the left arm may be reconstructed with different length than the right arm). Our
proposed model falls in the latter category, as our goal is to estimate pedestrians as accurately as
possibleinreal-time. Mixedapproacheshavestartedtorecentlyemerge,likeine.g. [9],where3D
positionsareinferreddirectly,butanatomicallyregularizedthroughastatisticalmodel.
Thereareworksthatusedepthinformation,eitherseparatelyorincombinationwithanRGBimage,
toreconstructthe3Dpose [29,30,31].ApproachesthatutilizeLiDARinformationarehardtocome
by,mostlybecauseofthelackofground-truth3DhumanposepairedwithLiDARdata. Thereare
a few datasets that, to different degrees, do provide 3D annotations. The PedX dataset [32] offers
14,000 3D automatic pedestrian annotations obtained using model fitting on different modalities,
2gatheredfromthreedifferentreal-worldscenes.TheWaymoOpenDataset[33]hasasimilaramount
of3Dannotationsas[32],butitfeaturesmanymoredifferentenvironments(2,030realscenes,7,650
differentpeople)withhigh-quality2Dand3Dmanualannotations. Evenwiththeexistenceofthese
datasets, the few works on 3D pose reconstruction published in this space mostly rely on weak
supervision,bylifting2Dposeinformationin3D.[34]trainson2Dground-truthposeannotations
and uses a reprojection loss for the 3D pose regression task. [11] creates pseudo ground-truth 3D
jointpositionsfromtheprojectionofannotated2Djoints,byconsideringneighboringLiDARpoints
in the projection space. During training, they directly compare the predicted 3D positions against
thispseudoground-truth.
2 Methodology
2.1 ProblemFormulation
Wefocusonthetaskofkeypointslocalizationandformulatetheproblemasestimating3Dlocations
forasetofkeypointsY ∈RNj×3insidehumanbody,giventhegroundtruthorpredictedbounding
box of a human as well as multi-modal inputs from camera and LiDAR sensors: an RGB camera
image I ∈ [0,1]H×W×3 and a point cloud P ∈ RNp×3, consisting of N
p
LiDAR points from a
singlescan.
CameraModel.ForcorrectLiDARpointstocameraimageprojections,weuseadifferentialimple-
mentationoftheWaymoOpenDataset[33]cameramodel,withrollingshuttereffectcompensated.
Wedenotetheintrinsicinformation(e.g. lensdistortion,shutterspeed,focallength,focalpointetc.)
as a vector K , and the extrinsic parameters (e.g. vehicle pose, camera pose, linear and rotation
i
speed)asK e. ThecompletecamerainformationwillbedenotedasK = [K e|K i] ∈ R1×Nk. The
associatedcameraimageprojectionoperatorwillbedenotedbyΠ(∗,K),wherea3Dinputwillbe
correctlyprojectedintotheimagespace.
2.2 HUM3DIL
Our network, deemed HUM3DIL (see fig. 1), receives as input the RGB camera image I, LiDAR
point cloud P, camera intrinsics K, the paired 2D and 3D bounding boxes, and outputs the pre-
dicted3DhumankeypointsY. Ourgoalistouseimageinputtobetterexploitordisambiguatethe
structurepresentinthe3Dpointcloud,whileatthesametimeusethe3Dpointstoanchorimagery
evidence. Becausewehaveaccesstotheground-truthcameraintrinsics,wecanmovebetween3D
spaceand2Dimagespacebyprojection, andvice-versa, byback-projection. Motivatedbyrecent
advancementsin3Dposeestimationandpoint-cloudprocessing[36,37,9],weuseaTransformer-
based architecture [38] to process 3D and image-based structural information at the same time,
compared to typical approaches which use disconnected PointNet 3D [39, 40] embeddings and/or
imagefeatures.
Enriching LiDAR points with image evidence We construct a depth-image projection D ∈
RH×W×1 into the space of LiDAR point cloud P. We concatenate it with the raw RGB image,
[I;D], and pass the derived tensor through a convolutional architecture. Thus, we simultaneously
informtheconvolutionallayersoftheregionsofinterestintheimage, i.e. sparselocationsonthe
silhouette of the person, and make depth information available from the start. Adding the depth
mapchannelhelpsdisambiguatethetaskofkeypointpredictionincaseswithheavyocclusionsor
incrowdedsceneswheremultiplepeopleareintheframe-depthmapchannelwillhavenonzero
values only for a person of interest. For the convolutional architecture, we employ a lightweight
U-Netnetwork[41],thatoutputsadensefeaturemaprepresentationF ∈ RH×W×Df. Thismapis
usedtoqueryfrom,basedontheprojectionoftheLiDARpoints,bybilinearinterpolation:
F
i
=F[Π(P i,K)]∈R1×Df (1)
foreachpointP ∈P. Wethusobtainpixel-alignedimagefeaturesforLiDARpoints.
i
LiDARpointsembeddingAsidefromtheper-pointdepth-infusedimagefeatures,wealsoprocess
theinitial3DLiDARpoints. WeuseRandomFourierFeatures(RFF)[42]toembedPinahigher-
dimensional space, capturing high-frequency behaviour of the signal. We use a random Gaussian
matrix B ∈ R3×Dp/2, with each entry independently drawn from a normal distribution N(0,σ2).
ThetransformedpointswillhavetheformP(cid:101) =[cos(2πPB);sin(2πPB)]∈RNp×Dp.
3RGB Camera Image
Multi-modal Features
U-Net
LiDAR 3D Points
Random
Fourier Training loss
Embedding
Camera intrinsics
)redocnE
remrofsnarT(
x
L
PLM
Predicted joints
Projected LiDAR depth
Figure 1: Overview of our proposed HUM3DIL architecture. It estimates the 3D joint positions of a sin-
glepersonfromamulti-modalinputrepresentation. WeencodeLiDARpointsPthroughaRandomFourier
Embedding [35], to produce representations P˜. The LiDAR points are also used to compute a depth-map
representationD, whichareconcatenatedwithinputRGBimageI. Wefirstuseanimagefeatureextractor
(U-Net)thatwillactontheconcatenationofDandI. TheprojectedLiDARpointswillfurtherreadfeatures
fromtheproducedmapF. Weconstructatokensequenceofsizeequaltothenumberofpoints. Eachtoken,
inthebeginning,willhaveinformationrelatingtotheimagefeatures,cameraintrinsicsandRandomFourier
Embeddings. LsequencesofTransformerEncoderwillactonthetokens. WereadthefinalN tokensand
j
regressthe3DjointsthroughanMLP.
Transforming the LiDAR points In order to regress Y, we employ a Transformer architecture
[43]. Wedefineaninputi-thtokenM ∈R1×D,withD =D +D +N ,as:
i f p k
M
i
=[K,F i,P(cid:101)i], (2)
the concatenation of the camera intrinsics, the per-point image feature and the per-point Fourier
representation. WewillapplytheTransformeronafixedsequenceofN tokens. Inordertowork
p
withavariablenumberoftokens,weuseafixedmaximumsizeforthetokensequence,Nmax. We
p
shuffleandtrimexcesspoints,andpadwithzerosifwehaveafewernumberofpoints.Thecomplete
inputsequenceisM∈RN pmax×D. Thissequenceisatfirstlinearlyembeddedbyusingalearnable
matrixE ∈ RD×D0. Here,D
0
istheoperatingdimensionalityoftheTransformerarchitecture. We
additionallyconcatenatelearnablejointstokens,MJ ∈RNj×D0. Similarto[9],weuseacascaded
block of L Transformer encoder layers, and collect the predicted 3D keypoints Y˜ from an MLP
appliedonthetransformedjointstokens:
(cid:20) (cid:21)
MJ
M0 = (3)
ME
Ml =TL (Ml−1) (4)
l
Y˜ =MLP(ML−1 ) (5)
0...Nj
Losses Labeling 3D keypoints is significantly more expensive and slower than 2D keypoints in
uncontrolled real-world environments. As a result, we usually collect a dataset containing many
more2Dannotationsthan3Dlabels. Tomaximizetheutilityofallavailablelabels(2Dand3D)and
boostperformance,weuseamixtureofweaklyandfullysupervisedlossestotrainourmodel. We
denotetheground-truth2Djointskeypointsasy ∈ RNj×2 anddefinethe2Dreprojection,and3D
4#samples
subset #subjects
total w\2Dkeypoints w\3Dkeypoints w\both
training 6999 149683 144866 9472 4655
validation 1651 28614 27382 2137 905
Table1: HumanKeyPointsinWODv1.3.2
reconstructionlossesas:
1
(cid:88)Nj
L = (cid:107)y −Π(Y˜ ,K)(cid:107) . (6)
2D N i i 2
j
i=1
1
(cid:88)Nj
L = (cid:107)Y −Y˜ (cid:107) . (7)
3D N i i 2
j
i=1
where(cid:107)∗(cid:107) isthe(cid:96)2vectornorm–i.e. themorerobusteuclideandistancebetweenpredictions. We
2
addasmall(cid:15)duringtraining,asthefunctionisnotdifferentiableat0. Ourfinallossisgivenby:
L=L +λL (8)
3D 2D
whereλisascalarfactorusedtoweighthetwolosses.
Semi-supervisedsupportInordertoefficientlytrainwithmixed2Dand3Dlabels, wealsoseta
trainingbatchtocontainapre-definedfractionof2D-to-3Dannotations.Thiswillallowthenetwork
to not forget about 3D, when the dataset is drastically biased towards 2D annotations. We show a
studyontheeffectofthepercentageof3Dandthelossbalancingfactorinfigure4,left.
3 ExperimentalResults
Datasets Waymo Open Dataset v1.3.2 (WOD) [33] contains RGB and LiDAR range images cap-
turingvariousroadusers. Recentlycamera(2D)andlaser(3D)keypointsannotationsonaportion
of human subjects (pedestrians and cyclists) in WOD have been released, namely Waymo Human
KeyPointsdatasetv1.3.2(WHKP).WebenchmarkHUM3DILandotherbaselinesontheWHKP.
As the official WOD/WHKP testing subset is hidden from the public, we randomly select 50% of
subjectsfromtheWODvalidationsubsetasourvalidationsplit,andtherest50%fallintothetesting
splitforbenchmarking. Inourexperiments,weusetheground-truthcameraandLiDARbounding
boxes during training and evaluation, for two reasons: a) disentangling the evaluation of the key
pointslocalizationfromtheobjectdetectiontask;b)settinganeasy-to-reproducebaselineforfuture
researchworks.
Metrics Where applicable, we will report two metrics: the mean per-joint position error (i.e.
MPJPE) between predicted and ground-truth 3D joints, and a similar one, for the 2D case. As
the ground-truth is not available for every frame, or even for every joint, we will use a visibility
indicatorvj ∈{0,1}. Thissignalsifwehaveaground-truthannotationforaparticularjointi,ofa
i
particulartestingsamplej. TheMPJPEoveraparticulardatasetwillthenhavethevalue:
1 (cid:88)
vj(cid:107)Yj −Y˜j(cid:107) (9)
(cid:80) vj i i i 2
i,j i i,j
Evaluation on WHKP We train three different methods on the WHKP training subset: ours (i.e.
HUM3DIL),THUNDR[9]reimplementedfollowingtheoriginalpaper,THUNDRwithalsoanad-
ditionaldepthimageasinput,ContextPose[10]usingpubliclyavailablecode,andthemulti-modal
approachof[11],forwhichwereporttheirnumberonasimilarversionofthedataset. Theseareall
state-of-the-artmethodsin3Dposeestimationforsinglepersons,fromanRGBimage. Wetriedto
makecomparisonsagainstpureRGBimagemethodsasfairaspossible,butthereisamodelinggap
thatcannotbebreached–ourarchitecturenaturallyexploitsLiDARsignalwithease. Alsonotethat
wehave1/5−1/14thofthenumberofparametersofcompetingmethods. Wereportresultsonthe
testsplitinTable. 2.
5Implementation details In all our experiments we use a U-Net backbone [41], with randomly
initialized weights. The encoder/decoder convolutional filter sequences are [32,64,128,256] and
[256,128,64,32], respectively. Thebackbonehas2,095,392parameters. FortheTransformerar-
chitecture,weuseL=4stages,anembeddingsize256and8headsfortheMultiHeadAttention
layer. Wetrainthenetworkfor50epochs, withbatchsizeof16, baselearningrateof1e−4and
exponentialdecay0.99. WesetthemaximumnumberofLiDARpointsto1024. OurTransformer
architectureconsistsof3,229,696parameters,withafinalMLPof771neurons. Wevalidateσ =10
andλ = 1e−2. Thecompletearchitecturehas5,325,859parameters. Allofournetworkswere
trainedonasingleV100GPUwith16GBofmemory. OurcodeisimplementedinTensorFlow. We
testthenetworkininferencemodeonanNvidiaRTX2080GPU,forabatchofasingleexample.
Onepassisdonein8milliseconds.Themainperformance/memorybottleneckresidesincomputing
theattentionmatrix(whichis≈1000×1000)intheTransformerarchitecture.
Method MPJPE(cm)↓ MPJPE2D(pixels)↓
ContextPose[10] 10.82 12.95
Multi-modal[11]* 10.32 N/A
THUNDR[9] 9.62 14.81
THUNDR[9]w/depth 9.20 13.53
HUM3DIL(Ours) 6.72 8.33
Table2: Performancesofdifferent3DjointpredictorsontheWHKP[33]testsplit. Ourfullmulti-
modal approach is the best performer. (*) Note that [11] was evaluated on a different subset of
WOD.
Figure 2: QualitativepredictionsontheWHKPtestsubset. Ineachimage,fromlefttoright,wehave: the
input RGB image, the overlayed LiDAR points, and our 3D joints predictions. Our method achieves plau-
sible reconstructions even in challenging settings: low lighting conditions, cluttered environments, extreme
occlusionsorpartialviewsandnon-trivialposes. Notethatinthecaseofpartialviews(e.g. secondrowfrom
top,secondcolumnfromleft),ourmethodoutputsananatomicallyplausiblehumanprediction,evenwithan
incompleteRGBsignal.
6Figure3: FailurecasesontheWHKPtestsubset. Ineachimage,fromlefttoright,wehave: theinputRGB
image, theoverlayedLiDARpoints, andour3Djointspredictions. Mostpointsoffailurerelateto: unusual
personappearancesorpoorcaptureconditions,limitedimagesupportandextremeocclusions.
Figure4: VisualizingnetworkperformanceontheWHKPvalidationsubset,withrespecttosemi-supervised
choices,distancetoLiDARandkeypointocclusions. Left. Weplota3Derrorsurface,whereontheX-axis
wehavethe2Dlossweightλ,andontheY-axiswehavethepercentageof3Dsamplesinamixed-supervision
trainingbatch. Theplotshowstheimportanceof2Dsamples, astheperformancegraduallydropswhenwe
under-utilizethem.Bottom-right.Weaplota3Derrorcurve,whereontheX-axiswehavethedistancetothe
LiDARpoint-cloud.Errorswerecomputedfor20equallyconcentratedbins,w.r.t.distance.Thedottedpoints
representthecentersofthosebins. Notehowtheerrorgracefullydecayswhenthetargetsubjectiseithertoo
nearortoofaraway. Top-right. Weplota3Derrorcurvewithrespecttothenumberofvisiblejoints. As
expected, themethoddegradeswhenmorepartsofthetargethumanarenotvisible(duetopartialviewsor
self-occlusions).
73.1 Ablationstudies
In table 3, we ablate different high-level methodological choices in our proposed architecture and
reportresultsonthevalidationsetofWHKP.First,wedisabletheweakly-supervisedlossbysetting
λ = 0. Wenoticethattheerrorincreases,asexpected,substantiallyinthe2DMPJPE,butalsoin
the 3D MPJPE. This showcases the importance of using the available 2D training signal, even for
inferring3Djoints. Next,wedisabletheRandomFourierEmbeddingforthe3DLiDARpoints,by
replacingitwithanidentityembedding. Thishasafairlylowimpactontheperformance. Wealso
disablethedepthinput,leavingonlytheRGBimagethroughtheU-Netbackbone. Theperformance
dropisnotasdramaticinthiscase,asLiDARpointpositionsarealreadyavailableastokens. How-
ever,thissignalsthefactthatthefeatureprocessingdonebythebackboneisnotredundant,asagap
inperformancestillexists. WhenwedisabletheRGB,weget≈ 1.3cmdropinperformance. We
also ablate with the PointNet[40] and PointNet++[39] architectures, instead of a Tranformer, and
performanceisworse. Thebestperformingmethodhasallthecomponentsactivated,showingtheir
complementaryimpact.
Method MPJPE(cm)↓ MPJPE2D(pixels)
HUM3DILw/λ=0. 8.62 15.61
HUM3DILwPointNet 8.16 9.96
HUM3DILw/oRGB 8.06 11.41
HUM3DILw/odepth 7.63 9.13
HUM3DILwPointNet++ 7.71 9.36
HUM3DILw/oRFF 7.01 8.79
HUM3DILfull 6.72 8.33
Table3: Performanceofourmethodwithdifferentarchitecturalchoices. Themodelwithfullfea-
tures–includingmulti-modality,semi-supervisedtrainingandTransformerarchitecture–isthebest
performer.
4 Limitations
Failure cases. In figure 3, we randomly select and show six examples of results where the error
exceeds15cm. Extremeocclusionsandpartialviewsarethemostdifficultcasestohandle. Please
notethattheresultsarestill,generally,anatomicallyplausible. Performancedecay. Wealsoshow
theperformancedecayw.r.t. thedistancetotheLiDARhumanpoint-cloud(whichalsocontrolsthe
sparsity of the LiDAR signal). As we can see from figure 4, bottom-right, our performance drops
whenthesubjectistoonearortoofar,butstillproducesreasonableresults. Weareseeinga≈3cm
errorgapbetweenthebestandworstconditions(ascapturedbythedataset). Wealsoseeaperfor-
mancedegradation(seefigure4,top-right)whenthetargetsubjectisheavilyoccluded,duetopartial
viewsorself-occlusions.Theerrorincreasesalmosttwo-foldwhengoingfromfull-viewtoseverely
occluded. Generalapplicability. Fornow,ournetworkcanonlyprocesseachhumaninstanceby
individuallycropping,soitcannotuseinformationaboutmultiplehumansatonce(e.g.[17,44,45]),
toimprovetheerrorandprocessingspeed. Also,wedonotutilizetemporalinformation(e.g. [46])
whichcouldfurtherstabilizepredictionsandimproveerrorsunderocclusions.
5 Conclusions
Wehavepresentedanoveldeepneuralnetworkarchitecture,whichhasbeentailoredfortheneedsof
modernautonomousdrivingvehicles: fast,lightweightandaccurate,fortheproblemof3Dhuman
poseestimationfromcolorand3Dsignals.Ournovelarchitecture,deemedHUM3DIL,makesusage
of both RGB and LiDAR data, by gathering pixel-aligned multi-modal features, that are then fed
intoasequenceofTransformerstages. Thenetworkisthusinformedbymulti-modalsignals,which
complementeachotherinachievingstate-of-the-artperformance.Wealsotraininasemi-supervised
regime, with limited annotated 3D data, but with an abundance of 2D labels, almost 2 orders of
magnitude more. This makes data collection and annotation easier, as 3D signals are non-trivial
and tedious to annotate precisely. The performance of the network is supported by a quantitative
evaluationononeofthelargestrelevantdatasetsintheliterature,withmethodicalablationstudies.
Forfuturework,wewillexploretemporalconsistenciesbetweenthepredictionsandincludethem
inmotionforecastingandanalysis.
8References
[1] J.Wang,S.Tan,X.Zhen,S.Xu,F.Zheng,Z.He,andL.Shao. Deep3Dhumanposeestima-
tion: Areview. Comput.Vis.ImageUnderst.,210:103225,2021.
[2] C. Zheng, W. Wu, C. Chen, T. Yang, S. Zhu, J. Shen, N. Kehtarnavaz, and M. Shah. Deep
Learning-Basedhumanposeestimation: Asurvey. Dec.2020.
[3] Y.Chen,Y.Tian,andM.He. Monocularhumanposeestimation: Asurveyofdeeplearning-
basedmethods. ComputerVisionandImageUnderstanding,192:102897,2020. ISSN1077-
3142.doi:https://doi.org/10.1016/j.cviu.2019.102897.URLhttps://www.sciencedirect.
com/science/article/pii/S1077314219301778.
[4] B.I.I.A.K.NikolaosSarafianos,BogdanBoteanu. 3Dhumanposeestimation: Areviewof
theliteratureandanalysisofcovariates. Comput.Vis.ImageUnderst.,152:1–20,Nov.2016.
[5] M. Fu¨rst, S. T. P. Gupta, R. Schuster, O. Wasenmu¨ller, and D. Stricker. HPERL: 3D human
pose estimation from RGB and LiDAR. In 2020 25th International Conference on Pattern
Recognition(ICPR),pages7321–7327,2021.
[6] K.Huang,B.Shi,X.Li,X.Li,S.Huang,andY.Li.Multi-modalsensorfusionforautodriving
perception: Asurvey. Feb.2022.
[7] D.Fernandes,A.Silva,R.Ne´voa,C.Simo˜es,D.Gonzalez,M.Guevara,P.Novais,J.Monteiro,
andP.Melo-Pinto. Point-cloudbased3Dobjectdetectionandclassificationmethodsforself-
drivingapplications: Asurveyandtaxonomy. Inf.Fusion,68:161–191,2021.
[8] Y.Tian,H.Zhang,Y.Liu,andL.Wang. Recovering3Dhumanmeshfrommonocularimages:
Asurvey. Mar.2022.
[9] M. Zanfir, A. Zanfir, E. G. Bazavan, W. T. Freeman, R. Sukthankar, and C. Sminchisescu.
Thundr: Transformer-based 3d human reconstruction with markers. In Proceedings of the
IEEE/CVFInternationalConferenceonComputerVision,pages12971–12980,2021.
[10] X.Ma,J.Su,C.Wang,H.Ci,andY.Wang. Contextmodelingin3dhumanposeestimation:
Aunifiedperspective. InProceedingsoftheIEEE/CVFConferenceonComputerVisionand
PatternRecognition,pages6238–6247,2021.
[11] J.Zheng,X.Shi,A.Gorban,J.Mao,Y.Song,C.R.Qi,T.Liu,V.Chari,A.Cornman,Y.Zhou,
etal.Multi-modal3dhumanposeestimationwith2dweaksupervisioninautonomousdriving.
arXivpreprintarXiv:2112.12141,2021.
[12] S. Saito, Z. Huang, R. Natsume, S. Morishima, A. Kanazawa, and H. Li. Pifu: Pixel-
aligned implicit function for high-resolution clothed human digitization. In Proceedings of
theIEEE/CVFInternationalConferenceonComputerVision,pages2304–2314,2019.
[13] A. Zanfir, E. Marinoiu, and C. Sminchisescu. Monocular 3d pose and shape estimation of
multiple people in natural scenes-the importance of multiple scene constraints. In CVPR,
2018.
[14] N. Kolotouros, G. Pavlakos, M. J. Black, and K. Daniilidis. Learning to reconstruct 3d hu-
man pose and shape via model-fitting in the loop. In Proceedings of the IEEE International
ConferenceonComputerVision,pages2252–2261,2019.
[15] R. A. Guler and I. Kokkinos. Holopose: Holistic 3d human reconstruction in-the-wild. In
CVPR,pages10884–10894,2019.
[16] G.MoonandK.M.Lee.I2l-meshnet:Image-to-lixelpredictionnetworkforaccurate3dhuman
pose and mesh estimation from a single rgb image. In European Conference on Computer
Vision(ECCV),2020.
[17] W.Jiang,N.Kolotouros,G.Pavlakos,X.Zhou,andK.Daniilidis. Coherentreconstructionof
multiplehumansfromasingleimage. InCVPR,pages5579–5588,2020.
9[18] B.Biggs,S.Ehrhadt,H.Joo,B.Graham,A.Vedaldi,andD.Novotny. 3dmulti-bodies:Fitting
setsofplausible3dhumanmodelstoambiguousimagedata.arXivpreprintarXiv:2011.00980,
2020.
[19] Y.Xu,S.-C.Zhu,andT.Tung. Denserac: Joint3dposeandshapeestimationbydenserender-
and-compare.InProceedingsoftheIEEEInternationalConferenceonComputerVision,pages
7760–7770,2019.
[20] T.Zhang,B.Huang,andY.Wang. Object-occludedhumanshapeandposeestimationfroma
singlecolorimage. InCVPR,pages7376–7385,2020.
[21] A. Arnab, C. Doersch, and A. Zisserman. Exploiting temporal context for 3d human pose
estimationinthewild. InCVPR,pages3395–3404,2019.
[22] W. Zeng, W. Ouyang, P. Luo, W. Liu, and X. Wang. 3d human mesh regression with dense
correspondence. InCVPR,pages7054–7063,2020.
[23] G. Georgakis, R. Li, S. Karanam, T. Chen, J. Kosˇecka´, and Z. Wu. Hierarchical kinematic
humanmeshrecovery. InECCV,pages768–784.Springer,2020.
[24] M.Loper,N.Mahmood,J.Romero,G.Pons-Moll,andM.J.Black. SMPL:Askinnedmulti-
personlinearmodel. SIGGRAPH,2015.
[25] H.Xu,E.G.Bazavan,A.Zanfir,B.Freeman,R.Sukthankar,andC.Sminchisescu. GHUM&
GHUML:Generative3Dhumanshapeandarticulatedposemodels. CVPR,2020.
[26] G. Varol, D. Ceylan, B. Russell, J. Yang, E. Yumer, I. Laptev, and C. Schmid. BodyNet:
Volumetricinferenceof3Dhumanbodyshapes. InECCV,2018.
[27] X.Sun,B.Xiao,F.Wei,S.Liang,andY.Wei. Integralhumanposeregression. InProceedings
oftheEuropeanConferenceonComputerVision(ECCV),pages529–545,2018.
[28] U.Iqbal, P.Molchanov, andJ.Kautz. Weakly-supervised3dhumanposelearningviamulti-
viewimagesinthewild. InCVPR,pages5243–5252,2020.
[29] J. Shotton, T. Sharp, A. Kipman, A. Fitzgibbon, M. Finocchio, A. Blake, M. Cook, and
R. Moore. Real-time human pose recognition in parts from single depth images. Commu-
nicationsoftheACM,56(1):116–124,2013.
[30] G. Moon, J. Y. Chang, and K. M. Lee. V2v-posenet: Voxel-to-voxel prediction network for
accurate3dhandandhumanposeestimationfromasingledepthmap. InProceedingsofthe
IEEEconferenceoncomputervisionandpatternRecognition,pages5079–5088,2018.
[31] C. Zimmermann, T. Welschehold, C. Dornhege, W. Burgard, and T. Brox. 3d human pose
estimationinrgbdimagesforrobotictasklearning. In2018IEEEInternationalConferenceon
RoboticsandAutomation(ICRA),pages1986–1992.IEEE,2018.
[32] W.Kim,M.S.Ramanagopal,C.Barto,M.-Y.Yu,K.Rosaen,N.Goumas,R.Vasudevan,and
M.Johnson-Roberson. PedX:Benchmarkdatasetformetric3Dposeestimationofpedestrians
incomplexurbanintersections. Sept.2018.
[33] P.Sun,H.Kretzschmar,X.Dotiwalla,A.Chouard,V.Patnaik,P.Tsui,J.Guo,Y.Zhou,Y.Chai,
B. Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. In
ProceedingsoftheIEEE/CVFconferenceoncomputervisionandpatternrecognition, pages
2446–2454,2020.
[34] M. Fu¨rst, S. T. Gupta, R. Schuster, O. Wasenmu¨ller, and D. Stricker. Hperl: 3d human pose
estimationfromrgbandlidar. In202025thInternationalConferenceonPatternRecognition
(ICPR),pages7321–7327.IEEE,2021.
[35] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In J. Platt,
D.Koller,Y.Singer,andS.Roweis,editors,AdvancesinNeuralInformationProcessingSys-
tems, volume 20. Curran Associates, Inc., 2007. URL https://proceedings.neurips.
cc/paper/2007/file/013a006f03dbc5392effeb8f18fda755-Paper.pdf.
10[36] K.Lin,L.Wang,andZ.Liu. End-to-endhumanposeandmeshreconstructionwithtransform-
ers. InCVPR,pages1954–1963,2021.
[37] M.-H. Guo, J.-X. Cai, Z.-N. Liu, T.-J. Mu, R. R. Martin, and S.-M. Hu. Pct: Point cloud
transformer. ComputationalVisualMedia,7(2):187–199,2021.
[38] A.Vaswani,N.Shazeer,N.Parmar,J.Uszkoreit,L.Jones,A.N.Gomez,Ł.Kaiser,andI.Polo-
sukhin. Attention is all you need. Advances in neural information processing systems, 30,
2017.
[39] P. Ni, W. Zhang, X. Zhu, and Q. Cao. Pointnet++ grasping: Learning an end-to-end spatial
graspgenerationalgorithmfromsparsepointclouds. In2020IEEEInternationalConference
onRoboticsandAutomation(ICRA),pages3619–3625.IEEE,2020.
[40] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d
classification and segmentation. In Proceedings of the IEEE conference on computer vision
andpatternrecognition,pages652–660,2017.
[41] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical im-
age segmentation. In International Conference on Medical image computing and computer-
assistedintervention,pages234–241.Springer,2015.
[42] M.Tancik,P.P.Srinivasan,B.Mildenhall,S.Fridovich-Keil,N.Raghavan,U.Singhal,R.Ra-
mamoorthi,J.T.Barron,andR.Ng. Fourierfeaturesletnetworkslearnhighfrequencyfunc-
tionsinlowdimensionaldomains. NeurIPS,2020.
[43] A.Vaswani,N.Shazeer,N.Parmar,J.Uszkoreit,L.Jones,A.N.Gomez,L.Kaiser,andI.Polo-
sukhin. Attention is all you need. In NIPS, 2017. URL https://arxiv.org/pdf/1706.
03762.pdf.
[44] M.Fieraru,M.Zanfir,T.Szente,E.Bazavan,V.Olaru,andC.Sminchisescu. Remips: Physi-
callyconsistent3dreconstructionofmultipleinteractingpeopleunderweaksupervision. Ad-
vancesinNeuralInformationProcessingSystems,34:19385–19397,2021.
[45] G.Moon,J.Y.Chang,andK.M.Lee.Cameradistance-awaretop-downapproachfor3dmulti-
personposeestimationfromasinglergbimage.InProceedingsoftheIEEE/CVFinternational
conferenceoncomputervision,pages10133–10142,2019.
[46] M.Kocabas,N.Athanasiou,andM.J.Black. Vibe: Videoinferenceforhumanbodyposeand
shapeestimation. CVPR,2020.
11