3D Human Keypoints Estimation from Point Clouds in the Wild
without Human Labels
Zhenzhen Weng1*
Alexander S. Gorban2
Jingwei Ji2
Mahyar Najibi2
Yin Zhou2
Dragomir Anguelov2
1Stanford University
2Waymo
Abstract
Training a 3D human keypoint detector from point
clouds in a supervised manner requires large volumes of
high quality labels. While it is relatively easy to capture
large amounts of human point clouds, annotating 3D key-
points is expensive, subjective, error prone and especially
difficult for long-tail cases (pedestrians with rare poses,
scooterists, etc.).
In this work, we propose GC-KPL -
Geometry Consistency inspired Key Point Leaning, an ap-
proach for learning 3D human joint locations from point
clouds without human labels. We achieve this by our novel
unsupervised loss formulations that account for the struc-
ture and movement of the human body. We show that by
training on a large training set from Waymo Open Dataset
[21] without any human annotated keypoints, we are able
to achieve reasonable performance as compared to the fully
supervised approach. Further, the backbone benefits from
the unsupervised training and is useful in downstream few-
shot learning of keypoints, where fine-tuning on only 10 per-
cent of the labeled training data gives comparable perfor-
mance to fine-tuning on the entire set. We demonstrated that
GC-KPL outperforms by a large margin over SoTA when
trained on entire dataset and efficiently leverages large vol-
umes of unlabeled data.
1. Introduction
Estimation of human pose in 3D is an important prob-
lem in computer vision and it has a wide range of appli-
cations including AR/VR, AI-assisted healthcare, and au-
tonomous driving [4,29,32]. For autonomous systems, be-
ing able to perceive human poses from sensor data (e.g. Li-
DAR point clouds) is particularly essential to reason about
the surrounding environment and make safe maneuvers.
Despite the high level of interest in human pose estima-
tion in the wild, only few papers approached outdoor 3D
keypoint detection using point cloud. A main reason is that
*Work done as an intern at Waymo.
Transformer 
Encoder
Keypoint locations learned with 
Unsupervised Losses
Downstream Fine-tuning
Small amount 
of labeled 
point clouds
Pre-trained 
Backbone
MLP
In-the-wild 
Point Cloud
3D keypoints
Figure 1. We present GC-KPL, a novel method for learning 3D human
keypoints from in-the-wild point clouds without any human labels. We
propose to learn keypoint locations using unsupervised losses that account
for the structure and movement of the human body. The backbone learns
useful semantics from unsupervised learning and can be used in down-
stream fine-tuning tasks to boost the performance of 3D keypoint estima-
tion.
training a pedestrian pose estimation model requires large
amount of high quality in-the-wild data with ground truth
labels. Annotating 3D human keypoints on point cloud data
is expensive, time consuming and error prone. Although
there are a few existing point cloud datasets with ground
truth human poses [11, 13, 21], they are limited in terms
of the quantity of the 3D annotations and diversity of the
data. Therefore, fully-supervised human keypoint detectors
trained on such datasets do not generalize well for long tail
cases. For this reason, previous approaches on pedestrian
3D keypoint estimation have mainly focused on utilizing 2D
weak supervision [4,32] which is easier to obtain, or lever-
aging signals from others modalities (e.g. RGB, depth) [29].
Nonetheless, there is a lot of useful information in the large
amount of unlabeled LiDAR data that previous works on
human pose estimation have not made an effort to utilize.
In this work, we propose a novel and effective method for
learning 3D human keypoints from in-the-wild point clouds
without using any manual labeled 3D keypoints. Our ap-
proach is built on top of the key observation that human
skeletons are roughly centered within approximately rigid
body parts and that the location and movement of the sur-
face points should explain the movement of the skeleton and
vice versa. To that end, we design novel unsupervised loss
terms for learning locations of the 3D keypoints/skeleton
within human point clouds which correspond to 3D loca-
tions of major joints of human body.
In the proposed method, we first train a transformer-
based regression model for predicting keypoints and a se-
mantic segmentation model for localizing body parts on a
synthetic data constructed from randomly posed SMPL hu-
man body model [15]. Then, we train on the entire Waymo
Open Dataset [21] without using any 3D ground-truth anno-
tation of human keypoints. Through unsupervised training,
keypoint predictions are refined and the backbone learns
useful information from large amount of unannotated data.
In summary, we make the following contributions:
• We present GC-KPL, a method for learning human
3D keypoints for in-the-wild point clouds without any
manual keypoint annotations.
• Drawing insight from the structure and movement of
the human body, we propose three effective and novel
unsupervised losses for refining keypoints. We show
that the proposed losses are effective for unsupervised
keypoint learning on Waymo Open Dataset.
• Through
downstream
fine-tuning/few-shot
experi-
ments, we demonstrate that GC-KPL can be used as
unsupervised representation learning for human point
clouds, which opens up the possibility to utilize a prac-
tically infinite amounts of sensor data to improve hu-
man pose understanding in autonomous driving.
2. Related Work
2.1. 3D Human Keypoint Estimation from Points
Clouds
There have been a few works [19,31,34] about estimat-
ing 3D keypoints from clean and carefully-curated point
clouds [6], but 3D keypoint estimation from in-the-wild
point clouds is a much less studied problem. Due to the lack
of ground-truth 3D human pose annotations paired with Li-
DAR data, there has not been a lot of works on 3d human
keypoint estimation from LiDAR information. Among the
few point cloud datasets with 3D keypoint annotations, Li-
DARHuman26M [13] captures long-range human motions
with ground truth motion acquired by the IMU system and
pose information derived from SMPL models fitted into
point clouds. It is among the first few datasets which have
LiDAR point clouds synchronized with RGB images, but
SMPL shape parameters are same for all 13 subjects and it
does not feature in-the-wild pedestrians where there could
be much more background noise and occlusion. PedX [11]
offers 3D automatic pedestrian annotations obtained using
model fitting on different modalities, gathered effectively
from a single intersection with only 75 pedestrians (the sec-
ond intersection has only 218 frames, labels for the third
scene were not released). Waymo Open Dataset [21] has
more than 3,500 subjects from over 1,000 different in-the-
wild scenes with high-quality 2D and 3D manual annota-
tions. Despite the existence of these datasets, the few works
on 3D pose estimation from point clouds mostly rely on
weak supervision. HPERL model [4] trains on 2D ground-
truth pose annotations and uses a reprojection loss for the
3D pose regression task. Multi-modal model in [32] uses
2D labels on RGB images as weak supervision, and creates
pseudo ground-truth 3D joint positions from the projection
of annotated 2D joints. HUM3DIL [29] leverages RGB in-
formation with LiDAR points, by computing pixel-aligned
multi-modal features with the 3D positions of the LiDAR
signal. In contrast, our method does not use any RGB infor-
mation or weak supervision.
2.2. Unsupervised Keypoint Localization
There are a number of works that aim to recover 3D key-
points using self-supervised geometric reasoning [12, 22],
but they are limited to rigid objects. More recent unsuper-
vised methods work for articulated objects from monocu-
lar RGB data [9, 10, 10, 18, 20, 24], multi-view data [16],
or point clouds [27], where authors suggest to condition
on the predicted keypoints and train a conditional genera-
tive model to supervise the keypoints through reconstruc-
tion losses. We propose a simpler pipeline where we apply
our novel unsupervised losses to the predicted keypoints di-
rectly and do not require additional models besides the key-
point predictor itself.
2.3. Self-supervised Learning for Point Clouds
Self-supervised representation learning has proven to be
remarkably useful in language [3, 17] and 2D vision tasks
[2,7]. As LiDAR sensors become more affordable and com-
mon, there has been an increasing amount of research inter-
est in self-supervised learning on 3D point clouds. Previous
works proposed to learn representations of object or scene
level point clouds through contrastive learning [8, 25, 30]
or reconstruction [23, 26, 28, 33], which is useful in down-
stream classification or segmentation tasks. In contrast, our
supervision signals come from the unique structure of the
human body and our learned backbone is particularly use-
ful in downstream human keypoint estimation tasks.
3. Method
In this section, we describe our complete training
pipeline which contains two stages.
In the first stage,
we initialize the model parameters on a synthetic dataset
(Sec. 3.1). The purpose of Stage I is to warm-up the model
with reasonable semantics. The second stage generalizes
the model to the real-world data.
In this stage, we use
our unsupervised losses to refine the keypoint predictions
on in-the-wild point clouds (Sec. 3.2). An overview of our
pipeline is in Fig. 2.
3.1. Stage I: Initialization on Synthetic Data
In this stage, we initialize the model on a synthetic
dataset that is constructed by ray casting onto randomly
posed human mesh models (SMPL [15]). We describe de-
tails of synthetic data generation in Supplementary.
The goal of this stage is to train a model f that takes a
point cloud of a human P ∈RN×3 and outputs 3D locations
of keypoints ˆY ∈R(J+1)×3, as well as soft body part as-
signments (or part segmentation) ˆ
W ∈RN×(J+1) that con-
tains the probability of each point i belonging to body part
j ∈[J] or the background.
 \{\ h
at  {\bY
 },
 \ h at {
\bW
 
}\}
 
= f( \ bP ) \\ \forall i \in [N], \sum _{j=1}^{J+1} \hat {\bW }_{i,j} = 1
(2)
Ground truth information about part segmentation W and
keypoint locations Y are readily available for synthetic
data. Hence, we can train the model by directly supervis-
ing the predicted keypoint through L2 loss,
 \c L  _ {k p } = || \hat {\bY } - \bY ||_2
(3)
and predicted segmentation through cross entropy loss,
 \cL  _
{
s
eg}
 = 
-
\su
m _{ i=1} ^
{N} \sum _{j=1}^{J+1} \bW _{i, j} \log (\hat {\bW }_{i, j}) \label {eq:seg_loss}
(4)
Overall, we minimize
 \cL  _{syn}  = \lambda _{kp} \cL _{\text {kp}} + \lambda _{seg} \cL _{\text {seg}} \label {eq:warmup}
(5)
Notably, in Sec. 4.6 we show that supervision in this stage
is not required - ground truth W and Y can be replaced by
surrogate ground truths to achieve comparable results.
3.2. Stage II: Self-Supervised Learning on In-the-
Wild Data
In this stage, we further refine the network using unsu-
pervised losses. The key insight behind the design of the
losses is that the human body is composed of limbs, each
of which is a rigid part. Therefore, points on a limb move
with the limb and should stay roughly at the same location
in each limb’s local coordinate system. To account for this,
we propose flow loss that encourages the points to stay in
the same location (despite rotation around the limb) within
each limb’s local cylindrical coordinate.
We start by formally defining the key ingredients in the
following formulations. In our setup, a human skeleton L
is composed of limbs, each of which is connecting two key-
points. A limb l = (ya,yb) ∈L is a line segment connecting
the parent ya and child keypoint yb on this limb, and all
surface points on this limb have segmentation label a.
All three proposed losses are in terms of surface points
in each predicted limb’s local coordinate system. There-
fore, we first convert all input points to each limbs’ local
cylindrical coordinate and compute the radial and axial co-
ordinates. Specifically, we project point p ∈P in global
coordinate on to vector ÐÐ→
ˆyaˆyb, and calculate the norm of the
projected vector
  \bz ( p , \ hl )  = \f r ac {
(p-\h y  _a) \cdot (\hy _b - \hy _a)}{||\hy _b - \hy _a||_2} 
(6)
and the distance between the point and ÐÐ→
ˆyaˆyb,
  \br ( p , \ h l )  = ||p  - \hy _a - \bz (\hy _b - \hy _a, \hl )||_2 
(7)
For simplicity, we use zˆl(p) to represent z(p,ˆl), and rˆl(p)
to represent r(p,ˆl) in the following.
Next, we describe the formulation of each loss function
in detail.
Flow Loss. Flow loss considers the predictions from two
consecutive frames and encourages consistency of the ra-
dial and altitude components of all points with respect to
scene flow - limbs should move between frames in a way
to keep radial and axial coordinates for all points constant.
Formally, we define the forward and backward flow losses
(Lff and Lbf respectively) for limbs ˆlt = (ˆyt
a, ˆyt
b) and
ˆlt+1 = (ˆyt+1
a
, ˆyt+1
b
) for predicted keypoints for timestamp
t and t + 1.
 \s m a
l l
 
\
cL
 _ { ff} = \frac
 { 1 }
{ N } \sum _{
i} \
hat {\bW }
_ { i a
} ^ t  \cdot 
(|\br _{\hl ^{t+1}}(p_i^{t} + f_i^{t}) - \br _{\hl ^t}(p_i^{t})| + \\ |\bz _{\hl ^{t+1}}(p_i^{t} + f_i^{t}) - \bz _{\hl ^t}(p_i^{t})|)
(8)
 \s m a
l l
 
\
cL _
{b f } = \frac {
1
} {N} 
\
s u m _{i} \hat
 
{\b
W }_{ia}^{
t
+ 1} \
c
d o t (|\br _{\
h
l ^t}(p_i^{t+1} + b_i^{t+1}) - \br _{\hl ^{t+1}}(p_i^{t+1})| + \\ |\bz _{\hl ^t}(p_i^{t+1} + b_i^{t+1}) - \bz _{\hl ^{t+1}}(p_i^{t+1})|)
(9)
f t is the forward flow for each point pt ∈Pt and bt+1 is the
backward flow for each point pt+1 ∈Pt+1. We use Neural
Scene Flow Prior [14] to estimate flow for two consecutive
frames of points. The overall flow loss for frame t is
  \sm a l
l \ c
L _
{fl o w} 
&= \frac {1}{|L|} \sum _{\hl ^t} \frac {\cL _{ff} + \cL _{bf}}{2}
(10)
Transformer 
Encoder
MLP
MLP
3D keypoints
Part segmentation
Stage I: Initialization on Synthetic Data
Synthetic data
Real data
Stage II: Unsupervised Learning on In-the-Wild Data
Transformer 
Encoder
MLP
MLP
Unsupervised 
Losses
(b) Minimize points-to-limb distance to encourage 
the limb to stay within the body.
(c) Points are symmetrical around limb. (i.e. points with 
similar height z have similar radius r)
(a) After moving, points stay in the same place (despite rotation 
around axis) within each limb’s local cylindrical coordinate system.
Unsupervised Losses
Flow loss
Points-to-limb loss
Symmetry loss
Figure 2. Overview of our method. In Stage I, we warm-up the keypoint predictor and body part segmentation predictor on a small synthetic dataset. Then,
in Stage II we refine the 3D keypoint predictions on a large in-the-wild dataset with unsupervised losses. The main losses are depicted on the bottom.
By design, the flow loss value is the same if the radial and
axial values for all points in a local coordinate system are
the same in consecutive frames. This would happen if a
limb in both frames are shifted in their respective orthogo-
nal direction by the same amount. Theoretically, it is un-
likely to happen for all limbs, but empirically we observe
that with flow loss alone the skeleton would move out of the
point cloud. Therefore, we need additional losses to make
the keypoints stay within the body.
Points-to-Limb Loss. For a predicted limb ˆl = (ˆya, ˆyb),
we want the points on this limb to be close to it. Hence, we
introduce a points-to-limb (p2l) loss
 
 \
sma l l
 \
c
L
 _{p2l}^{\hl } &= \frac {1}{N} \sum _{i} \hat {\bW }_{ia} \bd (p_i, \hl )
(11)
where d is the Euclidean distance function between a point
and a line segment. We sum over all points to get the overall
points-to-limb loss,
  \s m a
ll \
cL
 
_{
\text {p2l}} &= \frac {1}{|L|} \sum _{\hl } \cL _{\text {p2l}}^{\hl }
(12)
Symmetry Loss. Symmetry loss encourages the pre-
dicted limb ˆl to be in a position such that all points around
this limb are roughly symmetrical around it. That is to say,
points with similar axial coordinates zˆl should have similar
radial values rˆl. To that end, we introduce symmetry loss,
 
 \
sma l l
 \
c
L
 _{sym}^{\h l  } &= \frac {1}{N} \sum _{i} \hat {\bW }_{ia} (\br _{\hl }(p_i) - \bar {\br }_{\hl }(p_i))^2
(13)
where ¯rˆl(pi) is the weighted mean of radial values of points
with similar axial coordinates as pi,
  \small  \b ar {\br }_{\hl }(p_i )
 &=  \
frac {\sum 
_{ j} K_h(\bz _{\hl }(p _
i),  \
bz _{\hl }(p_j)) (\hat {\bW }_{i*} \cdot \hat {\bW }_{j*}) \br _{\hl }(p_j)} {\sum _{j} K_h(\bz _{\hl }(p_i), \bz _{\hl }(p_j)) (\hat {\bW }_{i*} \cdot \hat {\bW }_{j*})} \label {eq:sym_loss}
(14)
Kh is Gaussian kernel with bandwith h, i.e. Kh(x,y) =
e−( x−y
h )2. ˆ
Wi∗∈RJ is the ith row of ˆ
W, and the dot prod-
uct ˆ
Wi∗⋅ˆ
Wj∗measures the similarity of part assignment
of point i and j, as we want the value of ¯rk
i to be calculated
using the points from the same part as point i.
The overall symmetry loss is over all points,
  \s m a
ll \
cL 
_{
sym} &= \frac {1}{|L|} \sum _{l \in L} \cL _{sym}^{l}
(15)
Joint-to-Part Loss. In addition, we encourage each joint
to be close to the center of the points on that part using a
joint-to-part loss.
  
\sm a ll \ c L _
{j2p}
^{ j
} &
=
 \left \lVert \hy _j - \frac {\sum _i \hat {\bW }_{ij} p_i}{\sum _i \hat {\bW }_{ij}}\right \rVert _2
(16)
We sum over all joints to get the overall joint-to-part loss.
  \s m a
l l
 
\c
L _{j2p} &= \frac {1}{J} \sum _{j} \cL _{j2p}^{j}
(17)
Note that although the ground truth location of joints are
not in the center of points on the corresponding part, keep-
ing this loss is essential in making the unsupervised training
more robust.
In practice, jointly optimizing ˆ
W and ˆY in Stage II leads
to unstable training curves. Hence, we use the pre-trained
Perturbed skeleton
Optimized skeleton
Optimized vs. Ground truth skeleton
Figure 3. Effect of unsupervised losses on perturbed skeleton.
segmentation branch from Stage I to run segmentation in-
ference to get the segmentation labels on all of the training
samples in the beginning of Stage II, and ˆ
W is the one-hot
encoding of the predicted segmentation labels.
Segmentation Loss. Lastly, we notice that keeping the
segmentation loss at this stage further regularizes the back-
bone and leads to better quantitative performance. We use
the inferenced segmentation
ˆ
W as the surrogate ground
truth and minimize cross entropy as in Eq. (4).
Training objective. The overall training objective dur-
ing Stage II is to minimize
 \ cL = \lamb d a _{flow }  \cL _{f
l ow} + \l a mbda _{\text {p2l}} \cL _{\text {p2l}} + \lambda _{sym} \cL _{sym} \\ + \lambda _{\text {j2p}} \cL _{\text {j2p}} + \lambda _{\text {seg}} \cL _{\text {seg}} \label {eq:overall_loss}
(18)
To illustrate the effect of the three unsupervised losses
(Lflow, Lp2l and Lsym), we show the result of applying
these losses on a perturbed ground truth skeleton (Fig. 3).
As shown, the proposed unsupervised losses effectively
moves the perturbed skeleton to locations that are closer to
ground truth.
4. Experiments
4.1. Implementation Details
The predictor model f consists of a transformer back-
bone with fully connected layers for predicting joints and
segmentation respectively. We use the same transformer
backbone as in HUM3DIL [29]. A fully connected layer
is applied to the output of transformer head to regress the
predicted ˆW and ˆY respectively. There are 352,787 train-
able parameters in total. We set the maximum number of
input LiDAR points to 1024, and zero-pad or downsample
the point clouds with fewer or more number of points. The
flow is obtained using a self-supervised test-time optimiza-
tion method [14]. The network is trained on 4 TPUs. We
train Stage I for 200 epochs and Stage II for 75 epochs, both
with batch size 32, base learning rate of 1e−4, and exponen-
tial decay 0.9. Stage I and II each finishes in about 6 hours.
The loss weights in Eq. (5) are λkp = 0.5 and λseg = 1.
The loss weights in Eq. (18) are λflow = 0.02, λp2l = 0.01,
λsym = 0.5, λj2p = 2, and λseg = 0.5. The kernel bandwidth
Eq. (14) is 0.1.
4.2. Dataset and Metrics
We construct a synthetic dataset with 1,000 sequences of
16-frame raycasted point clouds for Stage I training. Each
sequence starts with the same standing pose and ends in
a random pose. We find that data augmentation is essen-
tial in Stage I training. To simulate real-world noisy back-
ground and occlusion, we apply various data augmentations
to the synthetic data, including randomly downsample, ran-
dom mask, add ground clusters, add background clusters,
add a second person, add noise to each point, scale the per-
son. We include examples of augmented synthetic data in
Fig. 4.
Add ground
Scale
Add garbage background
Add second person
Downsample
Random crop
Drop a part
Add random noise
Figure 4. Data augmentations applied to the synthetic point clouds (col-
ored by ground truth segmentation labels). Ground truth skeletons are
shown in purple. Background points are in blue.
In Stage II, we train on the entire Waymo Open dataset
(WOD) training set (with around 200,000 unlabeled sam-
ples). As the official WOD testing subset is hidden from
the public, we randomly choose 50% of the validation set as
the validation split, and the rest as the test split for bench-
marking. We report average Mean Per Joint Position Error
(MPJPE) on test set at the end of each stage. Formally, for
a single sample, let ˆY ∈RJ×3 be the predicted keypoints,
Y ∈RJ×3 the ground truth keypoints, and v ∈{0,1}J the
visibility indicator annotated per keypoint.
 \small \t e x
t
 { MP J
PE} (
Y, \ha t  {Y}) = \frac {1}{\sum _{j} v_{j}} \sum _{j \in [J]} v_j ||y_j - \hat {y}||_2
(19)
Note that in this Stage, we do Hungarian matching between
the predicted and annotated keypoints per frame, and then
report MPJPE on matched keypoints. We report matched
MPJPE because the method is intended for scenarios where
correspondence between keypoints in the unlabeled training
data and downstream data is unknown.
Ground truth
Pred. after Stage I
Pred. after Stage II
Figure 5. Visualizations of predictions on WOD at the end of Stage I and
Stage II. Points are colored by predicted segmentation labels. Ground truth
keypoints are in green and predicted keypoints and skeletons are in red.
4.3. Results
In this section we perform quantitative evaluation of GC-
KPL at the end of Stage I and II in Tab. 2. Qualitative results
are in Fig. 5. As shown, after first stage where we train
on a synthetic dataset constructed from posed body mod-
els with carefully chosen data augmentations, we are able
to predict reasonable human keypoints on in-the-wild point
clouds. The second stage our novel unsupervised losses fur-
ther refine the predicted keypoints.
4.4. Downstream Task:
Few-shot 3D Keypoint
Learning
In this experiment, we show that the backbone of our
model benefits from unsupervised training on large amount
of unlabeled data, and can be useful for downstream fine-
tuning tasks. We start from our pre-trained backbone after
Stage II, and fine-tune with annotated training samples from
WOD by minimizing mean per joint error. We include few-
shot experiments where we fine-tune with a extremely small
amount of data (10% and 1% of the training set), to repre-
sent challenging scenarios where there is a limited amount
of annotated data.
We include the LiDAR-only version of HUM3DIL (a
state-of-the-art model on WOD) [29] as a strong baseline.
The quantitative results (Tab. 1) suggest that our back-
bone learns useful information from the unlabeled in-the-
wild data and enables a significant performance boost on
the downstream tasks. Compared to a randomly initialized
backbone as used in HUM3DIL, our backbone leads to over
2 cm of decrease in MPJPE in downstream fine-tuning ex-
periments, which is a significant improvement for the 3D
human keypoint estimation task.
We visualize the predicted keypoints under different data
regime in Fig. 6. As shown, models fine-tuned from our
backbone is able to capture fine details on the arms and
overall produces more accurate results than HUM3DIL.
To the best of our knowledge, there does not exist pre-
vious works on completely unsupervised human keypoint
estimation from point clouds. We additionally experiment
with using a readout layer on top of the features learned by a
state-of-the-art point cloud SSL method 3D-OAE [30], but
the MPJPE is 15 cm (compared to 10.10 cm from GC-KPL).
Hence we consider the baselines we adopt here strong and
complete. In Sec. 4.6, we further challenge our method by
comparing to the domain adaptation setup and demonstrate
that the performance of GC-KPL is still superior.
4.5. Domain adaptation
In the configuration where we use ground truth labels in
Stage I and unsupervised training in Stage II could be seen
as a domain adaption (DA) technique. Thus it is useful to
compare proposed method with a commonly-used domain
adaptation method. We train the same backbone model us-
ing a mix of real and synthetic data and a gradient reversal
layer (aka DA loss) [5] to help the network to learn domain
invariant keypoint features. Results in Tab. 3 demonstrate
that GC-KPL yields superior accuracy compared with the
DA method (MPJPE 10.1 vs 11.35 cm).
4.6. Ablations
Effect of using GT bounding boxes in pre-processing.
We cropped human point clouds from the entire scene by
including only points within GT bounding boxes. We also
conducted experiments where we train with detected bound-
ing boxes from raw LiDAR scans using a SoTA 3D detector.
Results suggest that GC-KPL is robust to noise in 3D detec-
tion, as there were no noticeable changes in metrics.
Effect of synthetic dataset size. In our method Stage
I serves as a model initialization step where we show that
training on a small synthetic dataset (16,000 samples) with
properly chosen data augmentations is suffice for the model
to learn useful semantics. We further investigate the effect
of synthetic dataset size during Stage I. We experiment with
larger dataset sizes (160,000 and 1,600,000 samples) and
observe that the effect of increasing synthetic dataset size
is insignificant on MPJPEmatched at the end of Stage I - it
decreased from 17.7cm to 17.6cm. Lack of a notable im-
provements for larger dataset sizes is likely due to limited
variability of generated poses in synthetic data (see Supple-
HUM3DIL
(c) Fine-tune on 1% training set
Ours
(b) Fine-tune on 10% training set
(a) Fine-tune on 100% training set
Ground truth
HUM3DIL
Ours
HUM3DIL
Ours
Figure 6. Predicted keypoints from fine-tuning with different amount of annotated data. The points are colored by predicted segmentation labels by our
model. Predicted keypoints are shown in red.
Method
Backbone
Stage I
supervised
1% training set
MPJPE cm. (gain)
10% training set
MPJPE cm. (gain)
100% training set
MPJPE cm. (gain)
HUM3DIL [29]
Randomly initialized
19.57
16.36
12.21
GC-KPL
Pre-trained on synthetic only
✔
18.52 (-1.05)
15.10 (-1.26)
11.27 (-0.94)
Pre-trained on 5,000 WOD-train
✔
17.87 (-1.70)
14.51 (-1.85)
10.73 (-1.48)
Pre-trained on 200,000 WOD-train
17.80 (-1.77)
14.30 (-2.06)
10.60 (-1.61)
Pre-trained on 200,000 WOD-train
✔
17.20 (-2.37)
13.40 (-2.96)
10.10 (-2.11)
Table 1. Downstream fine-tuning results. Check marks in “Stage I supervised” mean that we use ground truth part labels in Stage I, otherwise we use
KMeans labels.
Training data
MPJPEmatched (↓)
Synthetic only
17.70
5,000 WOD-train
14.64
200,000 WOD-train
13.92
Table 2. Unsupervised learning (Stage II) results.
Domain distribution
DA loss
MPJPE (↓)
100% real
12.21
50/50% real/synthetic
12.08
50/50% real/synthetic
✔
11.35
Table 3. Unsupervised domain adaptation results evaluated on WOD vali-
dation set.
mental for details).
Effect of using ground truths on synthetic data. While
our described pipeline does not use any kind of manual la-
bels, we do use ground truth segmentation and keypoints on
synthetic dataset in Stage I because they are readily avail-
able. Here we further experiment with a variation where we
do not use any kind of ground truths in Stage I (first row in
Tab. 4). Instead, we use KMeans clusters and cluster centers
as surrogate ground truths for model initialization, similar
to [1]. Note that we are able to establish correspondence
between KMeans clusters from different samples due to the
fact that in our data generation process, each synthetic se-
quence starts with the same starting standing pose. Hence,
we can run KMeans clustering on the starting pose that is
shared among all sequences, and for subsequent samples
within each sequence, we do Hungarian matching using
Stage I
Stage II
No.
Exp.
Lkp
Lseg
MPJPEmatched
Lj2p
Lseg
Lsym
Lp2l
Lflow
MPJPEmatched
1
Effect of using KMeans labels in Stage I
✔
✔
19.2
✔
✔
✔
✔
✔
14.5
2
Effect of Lkp in Stage I
✔
N/A
✔
✔
✔
✔
✔
14.2
3
Effect of warmup losses in Stage II
✔
✔
✔
✔
15.0
4
✔
✔
✔
✔
14.2
5
✔
✔
✔
15.2
6
Effect of unsupervised losses in Stage II
✔
✔
30.1
7
✔
✔
15.6
8
✔
✔
25.7
9
✔
✔
✔
✔
14.3
10
✔
✔
✔
✔
14.9
11
✔
✔
✔
✔
14.4
12
✔
✔
14.9
Full model (GC-KPL)
✔
✔
17.7
✔
✔
✔
✔
✔
13.9
Table 4. Ablations studies on the effect of individual loss term in our method. Experiments 3 through 12 are using both losses in Stage I. Full model is using
GT labels for Stage I.
inter-cluster Chamfer distance to establish correspondence
between clusters from consecutive frames. We observe that
although initializing with surrogate ground truths leads to
slightly inferior performance in Stage I, after training with
the losses in Stage II the drop in performance is less visible.
Overall, downstream fine-tuning performance is compara-
ble to our best model (10.6/14.3/17.8 vs. 10.1/13.4/17.2 cm
when fine-tuned on 100%/10%/1% of the data, see Tab. 1).
This experiment suggests that method does not require any
kind of ground truths, even during initialization stage.
Effect of Losses. In this section we further investigate
the effect of each component in our pipeline (Tab. 4). First,
we note that Lseg in Stage I is essential because we need an
initialized segmentation model to get the body part assign-
ment for each point in order to calculate the losses in Stage
II. Therefore, we only experiment with a variation of Stage
I training without Lkp, and we observe that Lkp is useful
in warming up the backbone for later stages. Next, we take
the backbone from Stage I (trained with both Lkp and Lseg),
and study the effect of individual losses in Stage II. Experi-
ments No. 3/4/5 show that it is helpful to include Lj2p and
Lseg while having all other three unsupervised losses. In ex-
periments 6/7/8 we take out Lj2p and Lseg, and investigate
the effect of individual unsupervised losses. As shown the
training becomes rather unstable if we further eliminate any
of the three losses. We observe qualitatively that the metric
worsens drastically because the limbs quickly move out of
the human body. Experiments No. 3/4/5 suggest that Lj2p
and Lseg are useful regularizers that make sure the limbs
stay within the body, and the unsupervised losses further
improve the performance by refining the keypoint location.
4.7. Limitations and Future Work
The task of keypoint location could be considered as a
dual problem for semantic segmentation. In this work we
use a simple segmentation network based on the same archi-
tecture as our keypoint estimation model. Using a superior
segmentation model could lead to further improvements.
The proposed flow loss depends on quality of the esti-
mated flow of LiDAR points. In this work we used a simple
but reasonable method to estimate flow between two frames
of LiDAR points called Neural Scene Flow prior [14]. Qual-
ity of the unsupervised keypoint estimation could be im-
proved by using a more advanced flow estimator tailored
for point clouds on human body surfaces.
Lastly, we use a part of the HUM3DIL [29] model which
takes only LiDAR point cloud as input. The full HUM3DIL
model was designed for multi-modal inputs and attains bet-
ter performance. Thus, another interesting direction is to
leverage multi-modal inputs.
5. Conclusion
In this work, we approached the problem of 3D hu-
man pose estimation using points clouds in-the-wild, in-
troduced a method (GC-KPL) for learning 3D human key-
points from point clouds without using any manual 3D
keypoint annotations. We shown that the proposed novel
losses are effective for unsupervised keypoint learning on
Waymo Open Dataset. Through downstream experiments
we demonstrated that GC-KPL can additionally serve as a
self-supervised representation method to learn from large
quantity of in-the-wild human point clouds. In addition,
GC-KPL compares favorably with a commonly used do-
main adaptation technique. The few-shot experiments em-
pirically verified that using only 10% of available 3D key-
point annotation the fine-tuned model reached comparable
performance to the state-of-the-art model training on the en-
tire dataset. These results opens up exciting possibility to
utilize massive amount of sensor data in autonomous driv-
ing to improve pedestrian 3D keypoint estimation.
References
[1] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and
Matthijs Douze. Deep clustering for unsupervised learning
of visual features. In European Conference on Computer Vi-
sion, 2018. 7
[2] Mathilde Caron, Piotr Bojanowski, Julien Mairal, and Ar-
mand Joulin. Unsupervised pre-training of image features
on non-curated data. In Proceedings of the IEEE/CVF Inter-
national Conference on Computer Vision, pages 2959–2968,
2019. 2
[3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
Toutanova.
Bert:
Pre-training of deep bidirectional
transformers for language understanding.
arXiv preprint
arXiv:1810.04805, 2018. 2
[4] Michael F¨urst, Shriya TP Gupta, Ren´e Schuster, Oliver
Wasenm¨uller, and Didier Stricker. Hperl: 3d human pose es-
timation from rgb and lidar. In 2020 25th International Con-
ference on Pattern Recognition (ICPR), pages 7321–7327.
IEEE, 2021. 1, 2
[5] Yaroslav Ganin and Victor Lempitsky.
Unsupervised do-
main adaptation by backpropagation. In Francis Bach and
David Blei, editors, Proceedings of the 32nd International
Conference on Machine Learning, volume 37 of Proceed-
ings of Machine Learning Research, pages 1180–1189, Lille,
France, 07–09 Jul 2015. PMLR. 6
[6] Albert Haque, Boya Peng, Zelun Luo, Alexandre Alahi, Ser-
ena Yeung, and Li Fei-Fei. Towards viewpoint invariant 3d
human pose estimation. In European conference on com-
puter vision, pages 160–177. Springer, 2016. 2
[7] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross
Girshick. Momentum contrast for unsupervised visual rep-
resentation learning. In Proceedings of the IEEE/CVF con-
ference on computer vision and pattern recognition, pages
9729–9738, 2020. 2
[8] Siyuan Huang, Yichen Xie, Song-Chun Zhu, and Yixin Zhu.
Spatio-temporal self-supervised representation learning for
3d point clouds.
In Proceedings of the IEEE/CVF Inter-
national Conference on Computer Vision, pages 6535–6545,
2021. 2
[9] Tomas Jakab, Ankush Gupta, Hakan Bilen, and Andrea
Vedaldi. Unsupervised learning of object landmarks through
conditional image generation. Advances in neural informa-
tion processing systems, 31, 2018. 2
[10] Tomas Jakab, Ankush Gupta, Hakan Bilen, and Andrea
Vedaldi. Self-supervised learning of interpretable keypoints
from unlabelled videos. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition,
pages 8787–8797, 2020. 2
[11] Wonhui Kim, Manikandasriram Srinivasan Ramanagopal,
Charles Barto, Ming-Yuan Yu, Karl Rosaen, Nick Goumas,
Ram Vasudevan, and Matthew Johnson-Roberson.
Pedx:
Benchmark dataset for metric 3-d pose estimation of pedes-
trians in complex urban intersections. IEEE Robotics and
Automation Letters, 4(2):1940–1947, 2019. 1, 2
[12] Jiaxin Li and Gim Hee Lee. Usip: Unsupervised stable inter-
est point detection from 3d point clouds. In Proceedings of
the IEEE/CVF international conference on computer vision,
pages 361–370, 2019. 2
[13] Jialian Li, Jingyi Zhang, Zhiyong Wang, Siqi Shen, Chenglu
Wen, Yuexin Ma, Lan Xu, Jingyi Yu, and Cheng Wang. Li-
darcap: Long-range marker-less 3d human motion capture
with lidar point clouds. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition,
pages 20502–20512, 2022. 1, 2
[14] Xueqian Li, Jhony Kaesemodel Pontes, and Simon Lucey.
Neural scene flow prior.
Advances in Neural Information
Processing Systems, 34:7838–7851, 2021. 3, 5, 8
[15] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard
Pons-Moll, and Michael J Black. Smpl: A skinned multi-
person linear model. ACM transactions on graphics (TOG),
34(6):1–16, 2015. 2, 3
[16] Atsuhiro Noguchi, Umar Iqbal, Jonathan Tremblay, Tatsuya
Harada, and Orazio Gallo. Watch it move: Unsupervised
discovery of 3d joints for re-posing of articulated objects.
In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 3677–3687, 2022. 2
[17] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario
Amodei, Ilya Sutskever, et al. Language models are unsu-
pervised multitask learners. OpenAI blog, 1(8):9, 2019. 2
[18] Luca Schmidtke, Athanasios Vlontzos, Simon Ellershaw,
Anna Lukens, Tomoki Arichi, and Bernhard Kainz. Unsu-
pervised human pose estimation through transforming shape
templates.
In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 2484–
2494, 2021. 2
[19] Jamie Shotton, Andrew Fitzgibbon, Mat Cook, Toby Sharp,
Mark Finocchio, Richard Moore, Alex Kipman, and Andrew
Blake. Real-time human pose recognition in parts from sin-
gle depth images. In CVPR 2011, pages 1297–1304. Ieee,
2011. 2
[20] Jennifer J Sun, Serim Ryou, Roni H Goldshmid, Bran-
don Weissbourd, John O Dabiri, David J Anderson, Ann
Kennedy, Yisong Yue, and Pietro Perona. Self-supervised
keypoint discovery in behavioral videos. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 2171–2180, 2022. 2
[21] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien
Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou,
Yuning Chai, Benjamin Caine, et al. Scalability in perception
for autonomous driving: Waymo open dataset. In Proceed-
ings of the IEEE/CVF conference on computer vision and
pattern recognition, pages 2446–2454, 2020. 1, 2
[22] Supasorn Suwajanakorn, Noah Snavely, Jonathan J Tomp-
son, and Mohammad Norouzi. Discovery of latent 3d key-
points via end-to-end geometric reasoning. Advances in neu-
ral information processing systems, 31, 2018. 2
[23] Hanchen Wang, Qi Liu, Xiangyu Yue, Joan Lasenby, and
Matt J Kusner. Unsupervised point cloud pre-training via oc-
clusion completion. In Proceedings of the IEEE/CVF inter-
national conference on computer vision, pages 9782–9792,
2021. 2
[24] Yuefan Wu, Zeyuan Chen, Shaowei Liu, Zhongzheng Ren,
and Shenlong Wang. Casa: Category-agnostic skeletal ani-
mal reconstruction. arXiv preprint arXiv:2211.03568, 2022.
2
[25] Saining Xie, Jiatao Gu, Demi Guo, Charles R Qi, Leonidas
Guibas, and Or Litany.
Pointcontrast: Unsupervised pre-
training for 3d point cloud understanding. In European con-
ference on computer vision, pages 574–591. Springer, 2020.
2
[26] Yaoqing Yang, Chen Feng, Yiru Shen, and Dong Tian. Fold-
ingnet: Point cloud auto-encoder via deep grid deformation.
In Proceedings of the IEEE conference on computer vision
and pattern recognition, pages 206–215, 2018. 2
[27] Yang You, Wenhai Liu, Yanjie Ze, Yong-Lu Li, Weiming
Wang, and Cewu Lu. Ukpgan: A general self-supervised
keypoint detector. In Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition, pages
17042–17051, 2022. 2
[28] Xumin Yu, Lulu Tang, Yongming Rao, Tiejun Huang, Jie
Zhou, and Jiwen Lu. Point-bert: Pre-training 3d point cloud
transformers with masked point modeling. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 19313–19322, 2022. 2
[29] Andrei Zanfir, Mihai Zanfir, Alex Gorban, Jingwei Ji,
Yin Zhou, Dragomir Anguelov, and Cristian Sminchisescu.
Hum3dil: Semi-supervised multi-modal 3d humanpose esti-
mation for autonomous driving. In 6th Annual Conference
on Robot Learning, 2022. 1, 2, 5, 6, 7, 8
[30] Zaiwei Zhang, Rohit Girdhar, Armand Joulin, and Ishan
Misra.
Self-supervised pretraining of 3d features on any
point-cloud. In Proceedings of the IEEE/CVF International
Conference on Computer Vision, pages 10252–10263, 2021.
2, 6
[31] Zihao Zhang, Lei Hu, Xiaoming Deng, and Shihong Xia.
Weakly supervised adversarial learning for 3d human pose
estimation from point clouds. IEEE transactions on visual-
ization and computer graphics, 26(5):1851–1859, 2020. 2
[32] Jingxiao Zheng, Xinwei Shi, Alexander Gorban, Junhua
Mao, Yang Song, Charles R Qi, Ting Liu, Visesh Chari, An-
dre Cornman, Yin Zhou, et al. Multi-modal 3d human pose
estimation with 2d weak supervision in autonomous driving.
In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 4478–4487, 2022. 1,
2
[33] Junsheng Zhou, Xin Wen, Yu-Shen Liu, Yi Fang, and
Zhizhong Han.
Self-supervised point cloud representa-
tion learning with occlusion auto-encoder.
arXiv preprint
arXiv:2203.14084, 2022. 2
[34] Yufan Zhou, Haiwei Dong, and Abdulmotaleb El Saddik.
Learning to estimate 3d human pose from point cloud. IEEE
Sensors Journal, 20(20):12334–12342, 2020. 2
Supplementary Material for
Unsupervised Learning of 3D Human Keypoints from Point Clouds in the Wild
Zhenzhen Weng1*
Alexander S. Gorban2
Jingwei Ji2
Mahyar Najibi2
Yin Zhou2
Dragomir Anguelov2
1Stanford University
2Waymo
1. Synthetic Data Generation
In our described Stage I, we initialize the model on a
synthetic dataset that is constructed by ray casting onto ran-
domly posed human mesh models (SMPL [1]). Here we
elaborate on the synthetic data generation process. We gen-
erate 1,000 16-frame sequences. Each sequence has a ran-
dom SMPL body shape, and starts with the same standing
pose and ends in a random pose. The poses in the middle of
the sequence are linearly interpolated between the starting
and ending poses.
The ending pose was created by adding random noise to
the rotation angles of each joint in the standing pose. To
create realistic pedestrian poses, we add up to 60 degrees
of random noise to the shoulder and elbow joint angles, and
up to 30 degrees to the thigh and knee joints, and up to 5
degrees of noise to all other joints.
To simulate LiDAR point clouds, we place the human
meshes at a distance of 6 to 17 meters from a ray caster and
keep the faces that intersect with the rays. As in [2], we
use 2650 vertical scans (with 360 degree coverage), and 64
LiDAR beams. We do not consider rolling shutter and other
LiDAR artifacts for simplicity.
We construct 2-frame samples by taking consecutive
frames from each sequence, and the same data augmenta-
tion is applied to both frames in each sample.
2. Additional Qualitative Results
In Fig. 1, we include additional qualitative results from
the finetuned (on 100% training data) model. We show typ-
ical failure cases on WOD in Fig. 2, which are caused by
occlusion (left and middle column) and incorrect segmenta-
tion of the point cloud (right column).
There is an animated visualization in the attachment. It
demonstrates the effect of our unsupervised losses (Lflow,
Lp2l and Lsym). We perturb the ground truth keypoints by
adding random noise (Gaussian noise with 0 mean and 6
cm standard deviation) to each keypoint. Then, we min-
*Work done as an intern at Waymo.
Figure 1. Additional qualitatve results. The points are colored by
predicted segmentation labels. Ground truth keypoints are in green
and predicted keypoints and skeletons are in red.
imize these three losses with respect to the keypoints lo-
cations. We minimize with Adam optimizer with learning
rate 1e-3 for 100 iterations. The weights for loss terms are
λflow = 0.2, λp2l = 0.1, λsym = 5. As shown, as the result
of the optimization process the keypoints move to unper-
turbed locations over time.
References
[1] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard
Pons-Moll, and Michael J Black.
Smpl: A skinned multi-
1
arXiv:2306.04745v1  [cs.CV]  7 Jun 2023
Figure 2. Failure cases.
person linear model. ACM transactions on graphics (TOG),
34(6):1–16, 2015. 1
[2] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien
Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou,
Yuning Chai, Benjamin Caine, et al. Scalability in perception
for autonomous driving: Waymo open dataset. In Proceedings
of the IEEE/CVF conference on computer vision and pattern
recognition, pages 2446–2454, 2020. 1