3D Human Keypoints Estimation from Point Clouds in the Wild without Human Labels  Zhenzhen Weng 1 *   Alexander S. Gorban 2   Jingwei Ji 2   Mahyar Najibi 2  Yin Zhou 2   Dragomir Anguelov 2 1 Stanford University   2 Waymo  Abstract  Training a 3D human keypoint detector from point clouds in a supervised manner requires large volumes of high quality labels.   While it is relatively easy to capture large amounts of human point clouds, annotating 3D key- points is expensive, subjective, error prone and especially difficult for long-tail cases (pedestrians with rare poses, scooterists, etc.).   In this work, we propose GC-KPL - Geometry Consistency inspired Key Point Leaning, an ap- proach for learning 3D human joint locations from point clouds without human labels. We achieve this by our novel unsupervised loss formulations that account for the struc- ture and movement of the human body.   We show that by training on a large training set from Waymo Open Dataset [21] without any human annotated keypoints, we are able to achieve reasonable performance as compared to the fully supervised approach. Further, the backbone benefits from the unsupervised training and is useful in downstream few- shot learning of keypoints, where fine-tuning on only 10 per- cent of the labeled training data gives comparable perfor- mance to fine-tuning on the entire set. We demonstrated that GC-KPL outperforms by a large margin over SoTA when trained on entire dataset and efficiently leverages large vol- umes of unlabeled data.  1. Introduction  Estimation of human pose in 3D is an important prob- lem in computer vision and it has a wide range of appli- cations including AR/VR, AI-assisted healthcare, and au- tonomous driving [4, 29, 32]. For autonomous systems, be- ing able to perceive human poses from sensor data ( e.g . Li- DAR point clouds) is particularly essential to reason about the surrounding environment and make safe maneuvers. Despite the high level of interest in human pose estima- tion in the wild, only few papers approached outdoor 3D keypoint detection using point cloud. A main reason is that  * Work done as an intern at Waymo. Transformer Encoder  Keypoint locations learned with  Unsupervised Losses  Downstream Fine-tuning  Small amount of labeled point clouds  Pre-trained Backbone  MLP  In-the-wild Point Cloud  3D keypoints  Figure 1.   We present GC-KPL, a novel method for learning 3D human keypoints from in-the-wild point clouds without any human labels.   We propose to learn keypoint locations using unsupervised losses that account for the structure and movement of the human body. The backbone learns useful semantics from unsupervised learning and can be used in down- stream fine-tuning tasks to boost the performance of 3D keypoint estima- tion.  training a pedestrian pose estimation model requires large amount of high quality in-the-wild data with ground truth labels. Annotating 3D human keypoints on point cloud data is expensive, time consuming and error prone.   Although there are a few existing point cloud datasets with ground truth human poses [11, 13, 21], they are limited in terms of the quantity of the 3D annotations and diversity of the data. Therefore, fully-supervised human keypoint detectors trained on such datasets do not generalize well for long tail cases. For this reason, previous approaches on pedestrian 3D keypoint estimation have mainly focused on utilizing 2D weak supervision [4, 32] which is easier to obtain, or lever- aging signals from others modalities ( e.g . RGB, depth) [29]. Nonetheless, there is a lot of useful information in the large amount of unlabeled LiDAR data that previous works on human pose estimation have not made an effort to utilize.
In this work, we propose a novel and effective method for learning 3D human keypoints from in-the-wild point clouds without using any manual labeled 3D keypoints.   Our ap- proach is built on top of the key observation that human skeletons are roughly centered within approximately rigid body parts and that the location and movement of the sur- face points should explain the movement of the skeleton and vice versa. To that end, we design novel unsupervised loss terms for learning locations of the 3D keypoints/skeleton within human point clouds which correspond to 3D loca- tions of major joints of human body. In the proposed method, we first train a transformer- based regression model for predicting keypoints and a se- mantic segmentation model for localizing body parts on a synthetic data constructed from randomly posed SMPL hu- man body model [15]. Then, we train on the entire Waymo Open Dataset [21] without using any 3D ground-truth anno- tation of human keypoints. Through unsupervised training, keypoint predictions are refined and the backbone learns useful information from large amount of unannotated data. In summary, we make the following contributions: • We present GC-KPL, a method for learning human 3D keypoints for in-the-wild point clouds without any manual keypoint annotations. • Drawing insight from the structure and movement of the human body, we propose three effective and novel unsupervised losses for refining keypoints. We show that the proposed losses are effective for unsupervised keypoint learning on Waymo Open Dataset. • Through   downstream   fine-tuning/few-shot   experi- ments, we demonstrate that GC-KPL can be used as unsupervised representation learning for human point clouds, which opens up the possibility to utilize a prac- tically infinite amounts of sensor data to improve hu- man pose understanding in autonomous driving.  2. Related Work  2.1. 3D Human Keypoint Estimation from Points Clouds  There have been a few works [19, 31, 34] about estimat- ing 3D keypoints from clean and carefully-curated point clouds [6], but 3D keypoint estimation from in-the-wild point clouds is a much less studied problem. Due to the lack of ground-truth 3D human pose annotations paired with Li- DAR data, there has not been a lot of works on 3d human keypoint estimation from LiDAR information. Among the few point cloud datasets with 3D keypoint annotations, Li- DARHuman26M [13] captures long-range human motions with ground truth motion acquired by the IMU system and pose information derived from SMPL models fitted into point clouds. It is among the first few datasets which have LiDAR point clouds synchronized with RGB images, but SMPL shape parameters are same for all 13 subjects and it does not feature in-the-wild pedestrians where there could be much more background noise and occlusion. PedX [11] offers 3D automatic pedestrian annotations obtained using model fitting on different modalities, gathered effectively from a single intersection with only 75 pedestrians (the sec- ond intersection has only 218 frames, labels for the third scene were not released).   Waymo Open Dataset [21] has more than 3,500 subjects from over 1,000 different in-the- wild scenes with high-quality 2D and 3D manual annota- tions. Despite the existence of these datasets, the few works on 3D pose estimation from point clouds mostly rely on weak supervision. HPERL model [4] trains on 2D ground- truth pose annotations and uses a reprojection loss for the 3D pose regression task. Multi-modal model in [32] uses 2D labels on RGB images as weak supervision, and creates pseudo ground-truth 3D joint positions from the projection of annotated 2D joints. HUM3DIL [29] leverages RGB in- formation with LiDAR points, by computing pixel-aligned multi-modal features with the 3D positions of the LiDAR signal. In contrast, our method does not use any RGB infor- mation or weak supervision.  2.2. Unsupervised Keypoint Localization  There are a number of works that aim to recover 3D key- points using self-supervised geometric reasoning [12, 22], but they are limited to rigid objects. More recent unsuper- vised methods work for articulated objects from monocu- lar RGB data [9, 10, 10, 18, 20, 24], multi-view data [16], or point clouds [27], where authors suggest to condition on the predicted keypoints and train a conditional genera- tive model to supervise the keypoints through reconstruc- tion losses. We propose a simpler pipeline where we apply our novel unsupervised losses to the predicted keypoints di- rectly and do not require additional models besides the key- point predictor itself.  2.3. Self-supervised Learning for Point Clouds  Self-supervised representation learning has proven to be remarkably useful in language [3, 17] and 2D vision tasks [2,7]. As LiDAR sensors become more affordable and com- mon, there has been an increasing amount of research inter- est in self-supervised learning on 3D point clouds. Previous works proposed to learn representations of object or scene level point clouds through contrastive learning [8, 25, 30] or reconstruction [23, 26, 28, 33], which is useful in down- stream classification or segmentation tasks. In contrast, our supervision signals come from the unique structure of the human body and our learned backbone is particularly use- ful in downstream human keypoint estimation tasks.
3. Method  In   this   section,   we   describe   our   complete   training pipeline which contains two stages.   In the first stage, we initialize the model parameters on a synthetic dataset (Sec. 3.1). The purpose of Stage I is to warm-up the model with reasonable semantics.   The second stage generalizes the model to the real-world data.   In this stage, we use our unsupervised losses to refine the keypoint predictions on in-the-wild point clouds (Sec. 3.2). An overview of our pipeline is in Fig. 2.  3.1. Stage I: Initialization on Synthetic Data  In this stage, we initialize the model on a synthetic dataset that is constructed by ray casting onto randomly posed human mesh models (SMPL [15]). We describe de- tails of synthetic data generation in   Supplementary . The goal of this stage is to train a model   f   that takes a point cloud of a human   P   ∈   R N   × 3   and outputs 3D locations of keypoints   ˆ Y   ∈   R ( J + 1 ) × 3 , as well as soft body part as- signments (or part segmentation)   ˆ W   ∈   R N   × ( J + 1 )   that con- tains the probability of each point   i   belonging to body part  j   ∈   [ J ]   or the background.  {   ˆ Y ,   ˆ W }   =   f   ( P )   (1)  ∀ i   ∈   [ N   ] ,  J + 1  ∑  j = 1  ˆ W i,j   =   1   (2) Ground truth information about part segmentation   W   and keypoint locations   Y   are readily available for synthetic data. Hence, we can train the model by directly supervis- ing the predicted keypoint through L2 loss,  L kp   =   ∣∣   ˆ Y   −   Y ∣∣ 2   (3) and predicted segmentation through cross entropy loss,  L seg   =   −  N  ∑  i = 1  J + 1  ∑  j = 1  W i,j   log (   ˆ W i,j   )   (4) Overall, we minimize  L syn   =   λ kp L kp   +   λ seg   L seg   (5) Notably, in Sec. 4.6 we show that supervision in this stage is not required - ground truth   W   and   Y   can be replaced by surrogate ground truths to achieve comparable results.  3.2. Stage II: Self-Supervised Learning on In-the- Wild Data  In this stage, we further refine the network using unsu- pervised losses.   The key insight behind the design of the losses is that the human body is composed of limbs, each of which is a rigid part. Therefore, points on a limb move with the limb and should stay roughly at the same location in each limb’s local coordinate system. To account for this, we propose flow loss that encourages the points to stay in the same location (despite rotation around the limb) within each limb’s local cylindrical coordinate. We start by formally defining the key ingredients in the following formulations. In our setup, a human   skeleton   L  is composed of   limbs , each of which is connecting two key- points. A limb   l   =   ( y a , y b )   ∈   L   is a line segment connecting the parent   y a   and child keypoint   y b   on this limb, and all surface points on this limb have segmentation label   a . All three proposed losses are in terms of surface points in each predicted limb’s local coordinate system.   There- fore, we first convert all input points to each limbs’ local cylindrical coordinate and compute the radial and axial co- ordinates.   Specifically, we project point   p   ∈   P   in global coordinate on to vector   — — →  ˆ y a   ˆ y b , and calculate the norm of the projected vector  z ( p,   ˆ l )   =   ( p   −   ˆ y a )   ⋅   ( ˆ y b   −   ˆ y a ) ∣∣ ˆ y b   −   ˆ y a ∣∣ 2  (6) and the distance between the point and   — — →  ˆ y a   ˆ y b ,  r ( p,   ˆ l )   =   ∣∣ p   −   ˆ y a   −   z ( ˆ y b   −   ˆ y a ,   ˆ l )∣∣ 2   (7) For simplicity, we use   z ˆ l ( p )   to represent   z ( p,   ˆ l ) , and   r ˆ l ( p )  to represent   r ( p,   ˆ l )   in the following. Next, we describe the formulation of each loss function in detail.  Flow Loss.   Flow loss considers the predictions from two consecutive frames and encourages consistency of the ra- dial and altitude components of all points with respect to scene flow - limbs should move between frames in a way to keep radial and axial coordinates for all points constant. Formally, we define the forward and backward flow losses ( L f f   and   L bf   respectively) for limbs   ˆ l t   =   ( ˆ y t a ,   ˆ y t b )   and  ˆ l t + 1   =   ( ˆ y t + 1  a   ,   ˆ y t + 1  b   )   for predicted keypoints for timestamp  t   and   t   +   1 .  L f f   =   1  N   ∑  i  ˆ W t ia   ⋅   (∣ r ˆ l t + 1   ( p t i   +   f   t i   )   −   r ˆ l t   ( p t i )∣ +  ∣ z ˆ l t + 1   ( p t i   +   f   t i   )   −   z ˆ l t   ( p t i )∣)   (8)  L bf   =   1  N   ∑  i  ˆ W t + 1  ia   ⋅   (∣ r ˆ l t   ( p t + 1  i   +   b t + 1  i   )   −   r ˆ l t + 1   ( p t + 1  i   )∣ +  ∣ z ˆ l t   ( p t + 1  i   +   b t + 1  i   )   −   z ˆ l t + 1   ( p t + 1  i   )∣)   (9)  f   t   is the forward flow for each point   p t   ∈   P t   and   b t + 1   is the backward flow for each point   p t + 1   ∈   P t + 1 . We use Neural Scene Flow Prior [14] to estimate flow for two consecutive frames of points. The overall flow loss for frame   t   is  L f low   =   1  ∣ L ∣   ∑  ˆ l t  L f f   +   L bf  2   (10)
Transformer Encoder  MLP MLP  3D keypoints  Part segmentation  Stage I: Initialization on Synthetic Data  Synthetic data   Real data  Stage II: Unsupervised Learning on In-the-Wild Data  Transformer Encoder  MLP MLP  Unsupervised Losses  (b) Minimize   points -to- limb   distance to encourage the limb to stay   within   the body.  (c)   Points   are symmetrical around   limb.   (i.e. points with similar height z have similar radius r)  (a) After moving,   points   stay in the same place (despite rotation around axis) within each   limb’s   local cylindrical coordinate system.  Unsupervised Losses  Flow loss   Points-to-limb loss   Symmetry loss Figure 2.   Overview of our method. In Stage I, we warm-up the keypoint predictor and body part segmentation predictor on a small synthetic dataset. Then, in Stage II we refine the 3D keypoint predictions on a large in-the-wild dataset with unsupervised losses. The main losses are depicted on the bottom.  By design, the flow loss value is the same if the radial and axial values for all points in a local coordinate system are the same in consecutive frames.   This would happen if a limb in both frames are shifted in their respective orthogo- nal direction by the same amount. Theoretically, it is un- likely to happen for all limbs, but empirically we observe that with flow loss alone the skeleton would move out of the point cloud. Therefore, we need additional losses to make the keypoints stay within the body.  Points-to-Limb Loss.   For a predicted limb   ˆ l   =   ( ˆ y a ,   ˆ y b ) , we want the points on this limb to be close to it. Hence, we introduce a points-to-limb (p2l) loss  L ˆ l p 2 l   =   1  N   ∑  i  ˆ W ia d ( p i ,   ˆ l )   (11) where   d   is the Euclidean distance function between a point and a line segment. We sum over all points to get the overall points-to-limb loss,  L p2l   =   1  ∣ L ∣   ∑  ˆ l  L ˆ l  p2l   (12)  Symmetry Loss.   Symmetry loss encourages the pre- dicted limb   ˆ l   to be in a position such that all points around this limb are roughly symmetrical around it. That is to say, points with similar axial coordinates   z ˆ l   should have similar radial values   r ˆ l . To that end, we introduce symmetry loss,  L ˆ l sym   =   1  N   ∑  i  ˆ W ia ( r ˆ l ( p i )   −    ̄ r ˆ l ( p i )) 2   (13) where    ̄ r ˆ l ( p i )   is the weighted mean of radial values of points with similar axial coordinates as   p i ,   ̄ r ˆ l ( p i )   =   ∑ j   K h ( z ˆ l ( p i ) ,   z ˆ l ( p j   ))(   ˆ W i ∗   ⋅   ˆ W j ∗ ) r ˆ l ( p j   )  ∑ j   K h ( z ˆ l ( p i ) ,   z ˆ l ( p j   ))(   ˆ W i ∗   ⋅   ˆ W j ∗ )   (14)  K h   is Gaussian kernel with bandwith   h , i.e.   K h ( x, y )   =  e − (   x − y h   ) 2  .   ˆ W i ∗   ∈   R J   is the   i th   row of   ˆ W , and the dot prod- uct   ˆ W i ∗   ⋅   ˆ W j ∗   measures the similarity of part assignment of point   i   and   j , as we want the value of    ̄ r k i   to be calculated using the points from the same part as point   i . The overall symmetry loss is over all points,  L sym   =   1  ∣ L ∣   ∑  l ∈ L  L l sym   (15)  Joint-to-Part Loss.   In addition, we encourage each joint to be close to the center of the points on that part using a joint-to-part loss.  L j j 2 p   =   ∥ ˆ y j   −   ∑ i   ˆ W ij   p i  ∑ i   ˆ W ij  ∥  2  (16) We sum over all joints to get the overall joint-to-part loss.  L j 2 p   =   1  J   ∑  j  L j j 2 p   (17) Note that although the ground truth location of joints are not in the center of points on the corresponding part, keep- ing this loss is essential in making the unsupervised training more robust. In practice, jointly optimizing   ˆ W   and   ˆ Y   in Stage II leads to unstable training curves. Hence, we use the pre-trained
Perturbed skeleton   Optimized skeleton   Optimized vs. Ground truth skeleton Figure 3.   Effect of unsupervised losses on perturbed skeleton.  segmentation branch from Stage I to run segmentation in- ference to get the segmentation labels on all of the training samples in the beginning of Stage II, and   ˆ W   is the one-hot encoding of the predicted segmentation labels.  Segmentation Loss.   Lastly, we notice that keeping the segmentation loss at this stage further regularizes the back- bone and leads to better quantitative performance. We use the inferenced segmentation   ˆ W   as the surrogate ground truth and minimize cross entropy as in Eq. (4).  Training objective.   The overall training objective dur- ing Stage II is to minimize  L   =   λ f low L f low   +   λ p2l L p2l   +   λ sym L sym  +   λ j2p L j2p   +   λ seg L seg   (18) To illustrate the effect of the three unsupervised losses ( L f low ,   L p 2 l   and   L sym ), we show the result of applying these losses on a perturbed ground truth skeleton (Fig. 3). As shown, the proposed unsupervised losses effectively moves the perturbed skeleton to locations that are closer to ground truth.  4. Experiments  4.1. Implementation Details  The predictor model   f   consists of a transformer back- bone with fully connected layers for predicting joints and segmentation respectively.   We use the same transformer backbone as in HUM3DIL [29].   A fully connected layer is applied to the output of transformer head to regress the predicted   ˆ W   and   ˆ Y   respectively. There are 352,787 train- able parameters in total. We set the maximum number of input LiDAR points to 1024, and zero-pad or downsample the point clouds with fewer or more number of points. The flow is obtained using a self-supervised test-time optimiza- tion method [14]. The network is trained on 4 TPUs. We train Stage I for   200   epochs and Stage II for   75   epochs, both with batch size 32, base learning rate of   1 e − 4 , and exponen- tial decay   0 . 9 . Stage I and II each finishes in about 6 hours. The loss weights in Eq. (5) are   λ kp   =   0 . 5   and   λ seg   =   1 . The loss weights in Eq. (18) are   λ f low   =   0 . 02 ,   λ p 2 l   =   0 . 01 ,  λ sym   =   0 . 5 ,   λ j 2 p   =   2 , and   λ seg   =   0 . 5 . The kernel bandwidth Eq. (14) is 0.1.  4.2. Dataset and Metrics  We construct a synthetic dataset with 1,000 sequences of 16-frame raycasted point clouds for Stage I training. Each sequence starts with the same standing pose and ends in a random pose.   We find that data augmentation is essen- tial in Stage I training. To simulate real-world noisy back- ground and occlusion, we apply various data augmentations to the synthetic data, including randomly downsample, ran- dom mask, add ground clusters, add background clusters, add a second person, add noise to each point, scale the per- son. We include examples of augmented synthetic data in Fig. 4. Add ground  Scale  Add garbage background   Add second person  Downsample   Random crop   Drop a part  Add random noise  Figure 4.   Data augmentations applied to the synthetic point clouds (col- ored by ground truth segmentation labels).   Ground truth skeletons are shown in purple. Background points are in blue.  In Stage II, we train on the entire Waymo Open dataset (WOD) training set (with around 200,000 unlabeled sam- ples).   As the official WOD testing subset is hidden from the public, we randomly choose 50% of the validation set as the validation split, and the rest as the test split for bench- marking. We report average Mean Per Joint Position Error (MPJPE) on test set at the end of each stage. Formally, for a single sample, let   ˆ Y   ∈   R J × 3   be the predicted keypoints,  Y   ∈   R J × 3   the ground truth keypoints, and   v   ∈   { 0 ,   1 } J   the visibility indicator annotated per keypoint.  MPJPE ( Y,   ˆ Y   )   =   1  ∑ j   v j  ∑  j ∈ [ J ]  v j   ∣∣ y j   −   ˆ y ∣∣ 2   (19) Note that in this Stage, we do Hungarian matching between the predicted and annotated keypoints per frame, and then report MPJPE on matched keypoints.   We report matched MPJPE because the method is intended for scenarios where correspondence between keypoints in the unlabeled training data and downstream data is unknown.
Ground truth   Pred. after Stage I   Pred. after Stage II Figure 5.   Visualizations of predictions on WOD at the end of Stage I and Stage II. Points are colored by predicted segmentation labels. Ground truth keypoints are in green and predicted keypoints and skeletons are in red.  4.3. Results  In this section we perform quantitative evaluation of GC- KPL at the end of Stage I and II in Tab. 2. Qualitative results are in Fig. 5.   As shown, after first stage where we train on a synthetic dataset constructed from posed body mod- els with carefully chosen data augmentations, we are able to predict reasonable human keypoints on in-the-wild point clouds. The second stage our novel unsupervised losses fur- ther refine the predicted keypoints.  4.4. Downstream Task:   Few-shot 3D Keypoint Learning  In this experiment, we show that the backbone of our model benefits from unsupervised training on large amount of unlabeled data, and can be useful for downstream fine- tuning tasks. We start from our pre-trained backbone after Stage II, and fine-tune with annotated training samples from WOD by minimizing mean per joint error. We include few- shot experiments where we fine-tune with a extremely small amount of data (10% and 1% of the training set), to repre- sent challenging scenarios where there is a limited amount of annotated data. We include the LiDAR-only version of HUM3DIL (a state-of-the-art model on WOD) [29] as a strong baseline. The quantitative results (Tab. 1) suggest that our back- bone learns useful information from the unlabeled in-the- wild data and enables a significant performance boost on the downstream tasks. Compared to a randomly initialized backbone as used in HUM3DIL, our backbone leads to over 2 cm of decrease in MPJPE in downstream fine-tuning ex- periments, which is a significant improvement for the 3D human keypoint estimation task. We visualize the predicted keypoints under different data regime in Fig. 6.   As shown, models fine-tuned from our backbone is able to capture fine details on the arms and overall produces more accurate results than HUM3DIL. To the best of our knowledge, there does not exist pre- vious works on completely unsupervised human keypoint estimation from point clouds. We additionally experiment with using a readout layer on top of the features learned by a state-of-the-art point cloud SSL method 3D-OAE [30], but the MPJPE is 15 cm (compared to 10.10 cm from GC-KPL). Hence we consider the baselines we adopt here strong and complete. In Sec. 4.6, we further challenge our method by comparing to the domain adaptation setup and demonstrate that the performance of GC-KPL is still superior.  4.5. Domain adaptation  In the configuration where we use ground truth labels in Stage I and unsupervised training in Stage II could be seen as a domain adaption (DA) technique. Thus it is useful to compare proposed method with a commonly-used domain adaptation method. We train the same backbone model us- ing a mix of real and synthetic data and a gradient reversal layer (aka DA loss) [5] to help the network to learn domain invariant keypoint features. Results in Tab. 3 demonstrate that GC-KPL yields superior accuracy compared with the DA method (MPJPE 10.1 vs 11.35 cm).  4.6. Ablations  Effect of using GT bounding boxes in pre-processing.  We cropped human point clouds from the entire scene by including only points within GT bounding boxes. We also conducted experiments where we train with detected bound- ing boxes from raw LiDAR scans using a SoTA 3D detector. Results suggest that GC-KPL is robust to noise in 3D detec- tion, as there were no noticeable changes in metrics.  Effect of synthetic dataset size.   In our method Stage I serves as a model initialization step where we show that training on a small synthetic dataset (16,000 samples) with properly chosen data augmentations is suffice for the model to learn useful semantics. We further investigate the effect of synthetic dataset size during Stage I. We experiment with larger dataset sizes (160,000 and 1,600,000 samples) and observe that the effect of increasing synthetic dataset size is insignificant on MPJPE matched   at the end of Stage I - it decreased from 17.7cm to 17.6cm. Lack of a notable im- provements for larger dataset sizes is likely due to limited variability of generated poses in synthetic data (see Supple-
HUM3DIL  (c) Fine-tune on 1% training set  Ours  (b) Fine-tune on 10% training set (a) Fine-tune on 100% training set  Ground truth   HUM3DIL   Ours   HUM3DIL   Ours Figure 6.   Predicted keypoints from fine-tuning with different amount of annotated data. The points are colored by predicted segmentation labels by our model. Predicted keypoints are shown in red.  Method   Backbone   Stage I supervised 1% training set MPJPE cm. (gain) 10% training set MPJPE cm. (gain) 100% training set MPJPE cm. (gain) HUM3DIL [29]   Randomly initialized   19.57   16.36   12.21 GC-KPL Pre-trained on synthetic only   ✔   18.52 (-1.05)   15.10 (-1.26)   11.27 (-0.94) Pre-trained on 5,000 WOD-train   ✔   17.87 (-1.70)   14.51 (-1.85)   10.73 (-1.48) Pre-trained on 200,000 WOD-train   17.80 (-1.77)   14.30 (-2.06)   10.60 (-1.61) Pre-trained on 200,000 WOD-train   ✔   17.20   ( -2.37 )   13.40   ( -2.96 )   10.10   ( -2.11 )  Table 1.   Downstream fine-tuning results. Check marks in “Stage I supervised” mean that we use ground truth part labels in Stage I, otherwise we use KMeans labels.  Training data   MPJPE matched   ( ↓ ) Synthetic only   17.70 5,000 WOD-train   14.64 200,000 WOD-train   13.92  Table 2.   Unsupervised learning (Stage II) results.  Domain distribution   DA loss   MPJPE ( ↓ ) 100% real   12.21 50/50% real/synthetic   12.08 50/50% real/synthetic   ✔   11.35  Table 3.   Unsupervised domain adaptation results evaluated on WOD vali- dation set.  mental for details).  Effect of using ground truths on synthetic data.   While our described pipeline does not use any kind of manual la- bels, we do use ground truth segmentation and keypoints on synthetic dataset in Stage I because they are readily avail- able. Here we further experiment with a variation where we do not use any kind of ground truths in Stage I (first row in Tab. 4). Instead, we use KMeans clusters and cluster centers as surrogate ground truths for model initialization, similar to [1].   Note that we are able to establish correspondence between KMeans clusters from different samples due to the fact that in our data generation process, each synthetic se- quence starts with the same starting standing pose. Hence, we can run KMeans clustering on the starting pose that is shared among all sequences, and for subsequent samples within each sequence, we do Hungarian matching using
Stage I   Stage II  No.   Exp.   L kp   L seg   MPJPE matched   L j 2 p   L seg   L sym   L p 2 l   L f low   MPJPE matched  1   Effect of using KMeans labels in Stage I   ✔   ✔   19.2   ✔   ✔   ✔   ✔   ✔   14.5 2   Effect of   L kp   in Stage I   ✔   N/A   ✔   ✔   ✔   ✔   ✔   14.2 3 Effect of warmup losses in Stage II  ✔   ✔   ✔   ✔   15.0 4   ✔   ✔   ✔   ✔   14.2 5   ✔   ✔   ✔   15.2 6 Effect of unsupervised losses in Stage II  ✔   ✔   30.1 7   ✔   ✔   15.6 8   ✔   ✔   25.7 9   ✔   ✔   ✔   ✔   14.3 10   ✔   ✔   ✔   ✔   14.9 11   ✔   ✔   ✔   ✔   14.4 12   ✔   ✔   14.9 Full model (GC-KPL)   ✔   ✔   17.7   ✔   ✔   ✔   ✔   ✔   13.9  Table 4.   Ablations studies on the effect of individual loss term in our method. Experiments 3 through 12 are using both losses in Stage I. Full model is using GT labels for Stage I.  inter-cluster Chamfer distance to establish correspondence between clusters from consecutive frames. We observe that although initializing with surrogate ground truths leads to slightly inferior performance in Stage I, after training with the losses in Stage II the drop in performance is less visible. Overall, downstream fine-tuning performance is compara- ble to our best model (10.6/14.3/17.8 vs. 10.1/13.4/17.2 cm when fine-tuned on 100 % /10 % /1 %   of the data, see Tab. 1). This experiment suggests that method does not require any kind of ground truths, even during initialization stage.  Effect of Losses.   In this section we further investigate the effect of each component in our pipeline (Tab. 4). First, we note that   L seg   in Stage I is essential because we need an initialized segmentation model to get the body part assign- ment for each point in order to calculate the losses in Stage II. Therefore, we only experiment with a variation of Stage I training without   L kp , and we observe that   L kp   is useful in warming up the backbone for later stages. Next, we take the backbone from Stage I (trained with both   L kp   and   L seg   ), and study the effect of individual losses in Stage II. Experi- ments No. 3/4/5 show that it is helpful to include   L j 2 p   and  L seg   while having all other three unsupervised losses. In ex- periments 6/7/8 we take out   L j 2 p   and   L seg   , and investigate the effect of individual unsupervised losses. As shown the training becomes rather unstable if we further eliminate any of the three losses. We observe qualitatively that the metric worsens drastically because the limbs quickly move out of the human body. Experiments No. 3/4/5 suggest that   L j 2 p  and   L seg   are useful regularizers that make sure the limbs stay within the body, and the unsupervised losses further improve the performance by refining the keypoint location.  4.7. Limitations and Future Work  The task of keypoint location could be considered as a dual problem for semantic segmentation. In this work we use a simple segmentation network based on the same archi- tecture as our keypoint estimation model. Using a superior segmentation model could lead to further improvements. The proposed flow loss depends on quality of the esti- mated flow of LiDAR points. In this work we used a simple but reasonable method to estimate flow between two frames of LiDAR points called Neural Scene Flow prior [14]. Qual- ity of the unsupervised keypoint estimation could be im- proved by using a more advanced flow estimator tailored for point clouds on human body surfaces. Lastly, we use a part of the HUM3DIL [29] model which takes only LiDAR point cloud as input. The full HUM3DIL model was designed for multi-modal inputs and attains bet- ter performance.   Thus, another interesting direction is to leverage multi-modal inputs.  5. Conclusion  In this work, we approached the problem of 3D hu- man pose estimation using points clouds in-the-wild, in- troduced a method (GC-KPL) for learning 3D human key- points from point clouds without using any manual 3D keypoint annotations.   We shown that the proposed novel losses are effective for unsupervised keypoint learning on Waymo Open Dataset.   Through downstream experiments we demonstrated that GC-KPL can additionally serve as a self-supervised representation method to learn from large quantity of in-the-wild human point clouds.   In addition, GC-KPL compares favorably with a commonly used do- main adaptation technique. The few-shot experiments em- pirically verified that using only 10 %   of available 3D key- point annotation the fine-tuned model reached comparable performance to the state-of-the-art model training on the en- tire dataset. These results opens up exciting possibility to utilize massive amount of sensor data in autonomous driv- ing to improve pedestrian 3D keypoint estimation.
References  [1] Mathilde Caron,   Piotr Bojanowski,   Armand Joulin,   and Matthijs Douze. Deep clustering for unsupervised learning of visual features. In   European Conference on Computer Vi- sion , 2018. 7 [2] Mathilde Caron, Piotr Bojanowski, Julien Mairal, and Ar- mand Joulin.   Unsupervised pre-training of image features on non-curated data. In   Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision , pages 2959–2968, 2019. 2 [3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.   Bert:   Pre-training   of   deep   bidirectional transformers for language understanding.   arXiv preprint arXiv:1810.04805 , 2018. 2 [4] Michael F ̈ urst, Shriya TP Gupta, Ren ́ e Schuster, Oliver Wasenm ̈ uller, and Didier Stricker. Hperl: 3d human pose es- timation from rgb and lidar. In   2020 25th International Con- ference on Pattern Recognition (ICPR) , pages 7321–7327. IEEE, 2021. 1, 2 [5] Yaroslav Ganin and Victor Lempitsky.   Unsupervised do- main adaptation by backpropagation.   In Francis Bach and David Blei, editors,   Proceedings of the 32nd International Conference on Machine Learning , volume 37 of   Proceed- ings of Machine Learning Research , pages 1180–1189, Lille, France, 07–09 Jul 2015. PMLR. 6 [6] Albert Haque, Boya Peng, Zelun Luo, Alexandre Alahi, Ser- ena Yeung, and Li Fei-Fei. Towards viewpoint invariant 3d human pose estimation.   In   European conference on com- puter vision , pages 160–177. Springer, 2016. 2 [7] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual rep- resentation learning. In   Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 9729–9738, 2020. 2 [8] Siyuan Huang, Yichen Xie, Song-Chun Zhu, and Yixin Zhu. Spatio-temporal self-supervised representation learning for 3d point clouds.   In   Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision , pages 6535–6545, 2021. 2 [9] Tomas Jakab, Ankush Gupta, Hakan Bilen, and Andrea Vedaldi. Unsupervised learning of object landmarks through conditional image generation.   Advances in neural informa- tion processing systems , 31, 2018. 2 [10] Tomas Jakab, Ankush Gupta, Hakan Bilen, and Andrea Vedaldi. Self-supervised learning of interpretable keypoints from unlabelled videos.   In   Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 8787–8797, 2020. 2 [11] Wonhui Kim, Manikandasriram Srinivasan Ramanagopal, Charles Barto, Ming-Yuan Yu, Karl Rosaen, Nick Goumas, Ram Vasudevan, and Matthew Johnson-Roberson.   Pedx: Benchmark dataset for metric 3-d pose estimation of pedes- trians in complex urban intersections.   IEEE Robotics and Automation Letters , 4(2):1940–1947, 2019. 1, 2 [12] Jiaxin Li and Gim Hee Lee. Usip: Unsupervised stable inter- est point detection from 3d point clouds. In   Proceedings of the IEEE/CVF international conference on computer vision , pages 361–370, 2019. 2 [13] Jialian Li, Jingyi Zhang, Zhiyong Wang, Siqi Shen, Chenglu Wen, Yuexin Ma, Lan Xu, Jingyi Yu, and Cheng Wang. Li- darcap: Long-range marker-less 3d human motion capture with lidar point clouds.   In   Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 20502–20512, 2022. 1, 2 [14] Xueqian Li, Jhony Kaesemodel Pontes, and Simon Lucey. Neural scene flow prior.   Advances in Neural Information Processing Systems , 34:7838–7851, 2021. 3, 5, 8 [15] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black.   Smpl: A skinned multi- person linear model.   ACM transactions on graphics (TOG) , 34(6):1–16, 2015. 2, 3 [16] Atsuhiro Noguchi, Umar Iqbal, Jonathan Tremblay, Tatsuya Harada, and Orazio Gallo.   Watch it move: Unsupervised discovery of 3d joints for re-posing of articulated objects. In   Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 3677–3687, 2022. 2 [17] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al.   Language models are unsu- pervised multitask learners.   OpenAI blog , 1(8):9, 2019. 2 [18] Luca Schmidtke, Athanasios Vlontzos, Simon Ellershaw, Anna Lukens, Tomoki Arichi, and Bernhard Kainz.   Unsu- pervised human pose estimation through transforming shape templates.   In   Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 2484– 2494, 2021. 2 [19] Jamie Shotton, Andrew Fitzgibbon, Mat Cook, Toby Sharp, Mark Finocchio, Richard Moore, Alex Kipman, and Andrew Blake. Real-time human pose recognition in parts from sin- gle depth images.   In   CVPR 2011 , pages 1297–1304. Ieee, 2011. 2 [20] Jennifer J Sun, Serim Ryou, Roni H Goldshmid, Bran- don Weissbourd, John O Dabiri, David J Anderson, Ann Kennedy, Yisong Yue, and Pietro Perona.   Self-supervised keypoint discovery in behavioral videos. In   Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 2171–2180, 2022. 2 [21] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. In   Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition , pages 2446–2454, 2020. 1, 2 [22] Supasorn Suwajanakorn, Noah Snavely, Jonathan J Tomp- son, and Mohammad Norouzi. Discovery of latent 3d key- points via end-to-end geometric reasoning.   Advances in neu- ral information processing systems , 31, 2018. 2 [23] Hanchen Wang, Qi Liu, Xiangyu Yue, Joan Lasenby, and Matt J Kusner. Unsupervised point cloud pre-training via oc- clusion completion. In   Proceedings of the IEEE/CVF inter- national conference on computer vision , pages 9782–9792, 2021. 2 [24] Yuefan Wu, Zeyuan Chen, Shaowei Liu, Zhongzheng Ren, and Shenlong Wang. Casa: Category-agnostic skeletal ani-
mal reconstruction.   arXiv preprint arXiv:2211.03568 , 2022. 2 [25] Saining Xie, Jiatao Gu, Demi Guo, Charles R Qi, Leonidas Guibas, and Or Litany.   Pointcontrast:   Unsupervised pre- training for 3d point cloud understanding. In   European con- ference on computer vision , pages 574–591. Springer, 2020. 2 [26] Yaoqing Yang, Chen Feng, Yiru Shen, and Dong Tian. Fold- ingnet: Point cloud auto-encoder via deep grid deformation. In   Proceedings of the IEEE conference on computer vision and pattern recognition , pages 206–215, 2018. 2 [27] Yang You, Wenhai Liu, Yanjie Ze, Yong-Lu Li, Weiming Wang, and Cewu Lu.   Ukpgan: A general self-supervised keypoint detector.   In   Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition , pages 17042–17051, 2022. 2 [28] Xumin Yu, Lulu Tang, Yongming Rao, Tiejun Huang, Jie Zhou, and Jiwen Lu. Point-bert: Pre-training 3d point cloud transformers with masked point modeling. In   Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 19313–19322, 2022. 2 [29] Andrei Zanfir,   Mihai Zanfir,   Alex Gorban,   Jingwei Ji, Yin Zhou, Dragomir Anguelov, and Cristian Sminchisescu. Hum3dil: Semi-supervised multi-modal 3d humanpose esti- mation for autonomous driving.   In   6th Annual Conference on Robot Learning , 2022. 1, 2, 5, 6, 7, 8 [30] Zaiwei Zhang, Rohit Girdhar, Armand Joulin, and Ishan Misra.   Self-supervised pretraining of 3d features on any point-cloud. In   Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 10252–10263, 2021. 2, 6 [31] Zihao Zhang, Lei Hu, Xiaoming Deng, and Shihong Xia. Weakly supervised adversarial learning for 3d human pose estimation from point clouds.   IEEE transactions on visual- ization and computer graphics , 26(5):1851–1859, 2020. 2 [32] Jingxiao Zheng, Xinwei Shi, Alexander Gorban, Junhua Mao, Yang Song, Charles R Qi, Ting Liu, Visesh Chari, An- dre Cornman, Yin Zhou, et al. Multi-modal 3d human pose estimation with 2d weak supervision in autonomous driving. In   Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 4478–4487, 2022. 1, 2 [33] Junsheng Zhou,   Xin Wen,   Yu-Shen Liu,   Yi Fang,   and Zhizhong Han.   Self-supervised point cloud representa- tion learning with occlusion auto-encoder.   arXiv preprint arXiv:2203.14084 , 2022. 2 [34] Yufan Zhou, Haiwei Dong, and Abdulmotaleb El Saddik. Learning to estimate 3d human pose from point cloud.   IEEE Sensors Journal , 20(20):12334–12342, 2020. 2
Supplementary Material for Unsupervised Learning of 3D Human Keypoints from Point Clouds in the Wild  Zhenzhen Weng 1 *   Alexander S. Gorban 2   Jingwei Ji 2   Mahyar Najibi 2  Yin Zhou 2   Dragomir Anguelov 2 1 Stanford University   2 Waymo  1. Synthetic Data Generation  In our described Stage I, we initialize the model on a synthetic dataset that is constructed by ray casting onto ran- domly posed human mesh models (SMPL [1]).   Here we elaborate on the synthetic data generation process. We gen- erate 1,000 16-frame sequences. Each sequence has a ran- dom SMPL body shape, and starts with the same standing pose and ends in a random pose. The poses in the middle of the sequence are linearly interpolated between the starting and ending poses. The ending pose was created by adding random noise to the rotation angles of each joint in the standing pose.   To create realistic pedestrian poses, we add up to 60 degrees of random noise to the shoulder and elbow joint angles, and up to 30 degrees to the thigh and knee joints, and up to 5 degrees of noise to all other joints. To simulate LiDAR point clouds, we place the human meshes at a distance of 6 to 17 meters from a ray caster and keep the faces that intersect with the rays.   As in [2], we use 2650 vertical scans (with 360 degree coverage), and 64 LiDAR beams. We do not consider rolling shutter and other LiDAR artifacts for simplicity. We construct 2-frame samples by taking consecutive frames from each sequence, and the same data augmenta- tion is applied to both frames in each sample.  2. Additional Qualitative Results  In Fig. 1, we include additional qualitative results from the finetuned (on 100 %   training data) model. We show typ- ical failure cases on WOD in Fig. 2, which are caused by occlusion (left and middle column) and incorrect segmenta- tion of the point cloud (right column). There is an animated visualization in the attachment. It demonstrates the effect of our unsupervised losses ( L f low ,  L p 2 l   and   L sym ). We perturb the ground truth keypoints by adding random noise (Gaussian noise with 0 mean and 6 cm standard deviation) to each keypoint.   Then, we min-  * Work done as an intern at Waymo.  Figure 1. Additional qualitatve results. The points are colored by predicted segmentation labels. Ground truth keypoints are in green and predicted keypoints and skeletons are in red.  imize these three losses with respect to the keypoints lo- cations. We minimize with Adam optimizer with learning rate 1e-3 for 100 iterations. The weights for loss terms are  λ f low   =   0 . 2 ,   λ p 2 l   =   0 . 1 ,   λ sym   =   5 . As shown, as the result of the optimization process the keypoints move to unper- turbed locations over time.  References  [1] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black.   Smpl: A skinned multi-  1  arXiv:2306.04745v1 [cs.CV] 7 Jun 2023
Figure 2. Failure cases. person linear model.   ACM transactions on graphics (TOG) , 34(6):1–16, 2015. 1 [2] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. In   Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 2446–2454, 2020. 1