Tracking Human-like Natural Motion Using Deep Recurrent Neural Networks Youngbin Park, Sungphill Moon and Il Hong Suh Abstract— Kinect skeleton tracker is able to achieve con- siderable human body tracking performance in convenient and a low-cost manner. However, The tracker often captures unnatural human poses such as discontinuous and vibrated motions when self-occlusions occur. A majority of approaches tackle this problem by using multiple Kinect sensors in a workspace. Combination of the measurements from different sensors is then conducted in Kalman filter framework or optimization problem is formulated for sensor fusion. However, these methods usually require heuristics to measure reliability of measurements observed from each Kinect sensor. In this paper, we developed a method to improve Kinect skeleton using single Kinect sensor, in which supervised learning technique was employed to correct unnatural tracking motions. Specifically, deep recurrent neural networks were used for improving joint positions and velocities of Kinect skeleton, and three methods were proposed to integrate the refined positions and velocities for further enhancement. Moreover, we suggested a novel measure to evaluate naturalness of captured motions. We evaluated the proposed approach by comparison with the ground truth obtained using a commercial optical maker-based motion capture system. I. INTRODUCTION The second version of the device, the Microsoft Kinect for Window v2(Kinect v2), was released and made available to researchers in 2014. This new generation of Kinect sensor offers a higher resolution and a wider field of view compared to the original Kinect technology. Further, in terms of depth, Kinect v2 is based on time-of-flight principle, whereas the previous version of Kinect utilized structured light to reconstruct the third dimension. This difference has led a considerable improvement in the accuracy of depth sensing. To enable the use of Kinect sensors for developers and researchers, the official Microsoft SDKs (Software Develop- ment Kits) 1.0 and 2.0 are freely available for Kinect v1 and v2, respectively. These SDKs provide a set of functions, especially including human body skeleton tracker. Due to the enhanced depth sensor, tracking accuracy has been improved in Kinect V2. Therefore, in this work we developed our skeleton tracking system based on Kinect v2. Although Kinect v2 provides better tracking results com- paring to Kinect v1, it often captures unnatural skeleton poses such as discontinuous and vibrated motions in the presence of self-occlusion, which is common among most vision-based sensing systems. A simple way to solve this problem is to use multiple cameras in the workspace. For instance, if a view of a body part is blocked from one camera, it might be possible to obtain a view of the body part from another camera. Subsequently, appropriately combining data obtained from multiple Kinect sensors can be used to achieve more accurate tracking compared with a single sensor. A majority of approaches integrate the measurements from different sensors in Kalman filter framework or for- mulate optimization problem for sensor fusion. However, these methods require way to estimate the confidence of each measurement for combining multiple observations based on the confidence level. This usually leads heuristic measure to evaluate reliability of the measurements. In this paper, we developed a method to improve Kinect skeleton using single Kinect sensor, in which supervised learning technique was employed to correct unnatural track- ing motions. Specifically, deep recurrent neural networks were used for improving joint positions and velocities of Kinect skeleton data, and three methods were proposed to integrate the refined joint positions and velocities for further enhancement. Consequently, the proposed method removes jitters and promotes temporal continuity. Moreover, we suggested a novel measure to evaluate naturalness of captured motions. The remainder of the paper is organized as follows. Section 2 provides a survey of the current literatures related to the topic of improvement of Kinect skeleton. Section 3 briefly describes how to improve joint positions and velocities of Kinect skeleton data using deep recurrent neural network. In Section 4, three methods are proposed to inte- grate the enhanced position and velocity. A novel measure to evaluate naturalness of captured motions is given in Section 5. Section 6 presents our experimental setup and evaluation of the performance of the proposed model. Finally, we present out conclusions in Section 7. II. RELATED WORKS Skeleton tracking algorithms can be classified into single- view based models [10], [11], [12] and multi-view based model [13], [14]. Shotton el al. [1] proposed a new method to predict 3D positions of body joints from a single depth image. In their method, an intermediate representation of body parts was designed to map the pose estimation problem onto a per-pixel classification problem. An extensively large and highly varied training data set is employed for the random forest classifier to estimate body parts invariant to pose, body shape, clothing, etc. Finally, confidence-scored 3D proposals of several body joints are generated by re- projecting the classification results to the 3D world and finding local modes. As a result, this approach can quickly and accurately predict the 3D positions of body joints. The skeleton trackers in both the first and second versions of the Kinect SDK are based on this algorithm. However, the 3D arXiv:1604.04528v1 [cs.CV] 15 Apr 2016 body pose that is estimated using a single view frequently has problems of determining positions of joints during self- occlusion motions. Consequently, Kinect skeleton tracker has problems of capturing discontinuous movements or unwanted vibration. Therefore, approaches that utilize multiple views have recently begun to receive significant attention. For example, Zhang el al. [15] fused individual depth images to a joint point cloud and used an efficient particle filtering approach for pose estimation. Likewise, Liu el al. [16] presented a markerless motion capture approach for multi-view video that reconstructs the skeletal motion and detailed surface geometries of two closely interacting people. The approach presented in this paper differs from the methods used by studies described above. Specifically, our goal was not to develop a method that estimates 3D positions of body joint directly from raw depth images or RGB images, but rather to investigate how to generate more human-like natural motion by improving the estimated Kinect v2 skeleton. Indeed, there have been relatively few studies to determine skeleton pose by enhancing Kinect skeleton tracking. Masse el al. [17] presented a framework that obtains 3D positions of body joints from multiple Kinect sensors and then inputs the measured skeletons into a Gated Kalman Filter. In their method, the gated Kalman Filter rejects skeleton poses if the measurement residual referred to as innovation is lower than the gating threshold. This is done in order to discard faulty sensor readings and retain correct measurements. For quantitative evaluation, commercial motion capture system is used to get access to the ground truth. However, the processing step to reject measurement is quite simple and entirely relies on innovation. This might be often possible to lead ineffective measurement fusion. Yeung el al. [18] developed a method synthesizing skele- tons with duplex Kinect sensors that capture human mo- tion in different views. In their study, each joint had two measurements reported by two cameras. The major tech- nical difficulty comes from how to evaluate the reliability of the two values at each joint, and how to resolve any inconsistencies. To address this problem, they developed a measure to estimate confidence on the 3D positions obtained using the Kinect skeleton tracker. Specifically, the distances between a joint i and the closest joint j estimated from Kinect A and the distance between corresponding joint i and the closest joint k estimated from Kinect B are computed, then if the distance between i and j is smaller than the distance between i and k the joint i obtained from Kinect A is considered as unreliable estimation otherwise, the joint i obtained from Kinect B is considered as the mis-leading joint. This reliability was computed in advance before data fusion procedure based on mathematical optimization was executed. Data fusion procedure was formulated under the mathematical optimization problem, in which objective is to reduce sum of differences between the estimated joint position and the corresponding more reliable position, and the bone-lengths are given as equality constraints. Both studies described above are different to our approach in following two reason: First, single Kinect sensor was used in our method. Second, We formulate our problem as supervised learning task instead of employing simple Kalman filtering or formulating mathematical optimization problem. In terms of these two aspects, an approach similar to our method has not been proposed. III. IMPROVING POSITION AND VELOCITY OF KINECT SKELETON USING DEEP RECURRENT NEURAL NETWORK First part of our method is to improve joint position and velocity of Kinect skeleton using supervised learning. The inputs for the supervised learning are sequences of 3D position or velocity obtained by Kinect skeleton tracker and the targets are sequences of skeleton pose captured using commercial optical maker-based motion capture system. In our method, deep recurrent neural network is employed to solve the regression problem, in which two deep recurrent neural networks are trained separately for refining positions and velocities of body joints. In this Section, we will briefly describe deep recurrent neural network and present the detail of how to train the networks. A. Deep Recurrent Neural Network A recurrent neural network (RNN) [2] is a neural network that simulates a discrete-time dynamical system and are a powerful model for sequential data. A conventional RNN is constructed by defining the transition function and the output function as ht = φh WT ht−1 + UT xt  (1) yt = φo VT ht  , (2) where φh, φo, xt, yt and ht are respectively a state transition function, an output function, an input, an output, a hidden state, and W, U and V are the transition, input and output matrices, in that order. It is usual to use a nonlinear function such as a logistic sigmoid function or a hyperbolic tangent function for φh. Deep learning is built based on a hypothesis that a deep, hierarchical model can be exponentially more efficient at representing some functions than a shallow one [3]. Several theoretical results and empirical evidences support this hy- pothesis [5], [4], [6]. RNNs are inherently deep in time, since their hidden state is a function of all previous hidden states. However, the potential weakness for RNNs is that RNNs lack hierarchical processing of the input in space. From this perspective view, deep recurrent neural networks has recently gained significant attention to many researchers. As with feedforward deep neural networks have multiple nonlinear layers between input and output, a recurrent network can be considered as a deep recurrent neural network (DRNNs) if the network has more than one hidden layers. We can now consider two schemes of DRNNs. One has L hidden layer with temporal connection only at the l-th layer and the other has L hidden layer with full temporal connec- tions (called stacked RNN). Based on empirical evaluation on our datasets, we have chosen the former scheme. The l-th hidden activation at time t, hl t, is defined as hl t = φl  WlTht−1+UlTφl−1  Ul−1T . . . φ1  U1Txt  (3) where WlT and UlT represent the fully connected weight matrices for the recurrent connection and for the l-th layer, respectively. Because skeleton tracking is an inherently dynamic pro- cess, it seems natural to consider DRNNs as a model for supervised learning. As with most researcher, for the first time we train DRNNs, we considered two most popular deep learning techniques, Dropout and Rectified Linear Units (ReLU) [7]. We used a Rectified Linear Unit (ReLU) as nonlinear activation function for all units in hidden layers. However, unfortunately, dropout does not work well with RNNs unlikely feedforward deep neural networks. Although we carefully applied dropout to DRNNs with our datasets according to the way proposed by [8], we found that dropout leads to divergence. The values of output units are computed by linear activation. An alternative for modeling sequences is Long Short-Term Memory (LSTM) [9]. LSTM is a variants of the RNN that perform better on problems with long term dependencies because LSTM has been designed to address the vanishing and exploding gradient problems of conventional RNNs. We trained single layer LSTM and compared performance to single layer RNNs with ReLU activation function for hidden units. In our test dataset, however, LSTM achieved lower performance and took longer time to train. Hence, we did not employ LSTM for supervised learning. B. Details in Training Two DRNNs In the following, we will refer two DRNNs for improv- ing joint position and velocity of skeleton to pDRNN and vDRNN, respectively. pDRNN and vDRNN are five layers, where three layers are hidden and two layers are input and output, respectively. The size of each hidden layer is 256. The number of units in input and output layer 48 because the number of joints to be refined is 16 and each joint is composed of x, y and z coordinates. Kinect v2 supports 25 joints and 16 joints used in our method, which are as follows: spinebase, spinemid, neck, shoulderleft, elbowleft, wristleft, shoulderright, elbowright, wristright, hipleft, kneeleft, ankleleft, footleft, hipright, kneeright, ank- leright, footright, and spineshoulder. Among 25 joints, some joints, such as thumbleft and thumbright are tracked very unstable and some joints are not supported by the motion capture system. head, handleft, handright, handtipleft, thum- bleft, handtipright, thumbright, footleft and footright were excluded in our method. Temporal lengths of training data for pDRNN is 7. In training phase, absolute joint positions of Kinect skeleton, it is denoted by z, is transformed to relative positions with respect to parent joints. The root joint is spinemid and only root joint is represented by absolute position. The joints Fig. 1. Schematic representation of soft-KNN. tracked using motion capture system are transformed in the same way. Hence, the output of pDRNN is relative joint positions except spinemid joint. The output is transformed to absolute positions and it is denoted by ˜z. We do not represent body joint using relative angle because skeleton poses produced by Kinect sensor vary along with the change of the orientation between performer and Kinect sensor. We thus need to preserve angle information in our representation. Temporal lengths of training data for vDRNN is 20. The training data for vDRNN are the velocities of the improved skeleton poses, which is defined by vt = ˜zt−˜zt−1. We denote the input and output for vDRNN as v and ˜v, respectively. The L-BFGS optimization algorithm is used to train two networks from random initialization and sum-of-squared errors is used for objective functions. IV. THREE METHODS FOR INTEGRATING IMPROVED POSITION AND VELOCITY OF SKELETONS pDRNN trained based on a large amount of training data can already refine inaccurate Kinect skeleton. However, higher improvement can be expected by integrating pDRNN and vDRNN. In this sense, we propose three methods to combine the outputs produced by pDRNN and vDRNN. First method we have developed is to use K-Nearest Neighbor (KNN). KNN is an instance-based method for classification and regression. In both case, the target value of unknown input is determined according to the values of its K nearest training data. Although the scheme works well it is sensitive to the number of K. Thus, we varies the value of K automatically and we will call the variant of KNN as soft- KNN (sKNN) in the following. Second method is based on Kalman filtering. Kalman filter is an algorithm that assumes the true state at time t by observing a series of measurements over time. Specifically, Kalman filter predicts and corrects the estimate based on measurement and process models. The outputs of pDRNN and vDRNN are used for the measurement and process model, respectively. The last method is to combine sKNN and Kalman Filtering. The details will be described in Section 4.3. A. Integrating based on Soft-KNN Figure 1 shows schematic diagram of soft-KNN. Let S = { (˜z1, zM 1 ),. . ., (˜zN, zM N ) } be a set of N input-output training points, where ˜z is refined skeleton pose by pDRNN and zM is corresponding body joints captured from motion capture system. For a novel pattern ˜zt at time t, the proposed soft- KNN regression computes the mean of the target values of its ˜K-nearest neighbors. The j-th component of the skeleton pose generated by soft-KNN is defined by ˆzj t = 1 ˜K X i∈NK(˜zt) pj(|vj i,t−˜vj t |)>θ zj,M i (4) where set NK(˜zt) contains the indices of K-nearest neigh- bors of ˜zt. The number of nearest neighbors for summation is reduced to ˜K. ˜K is determined by pj(|vj i,t −˜vj t |). ˜vj t is the j-th component of the velocity generated by vDRNN. vj i,t is velocity of the j-th component of the i-th training data, which is defined by vj i,t = ˜zj i −ˆzj t−1 (5) where ˆzj t−1 is the j-th component of the skeleton pose obtained by soft-KNN regression at time t −1. It should be noted that ˆzj t−1 is used for computing velocity instead of ˜zj t−1. This is because ˆzj t−1 is assumed to be closer to the true joint position than ˜zj t−1 it is thus appropriate for calculating current velocity of the j-th component of the i-th sample. ˆzj 0 is set to ˜zj 0. We assume that the the initial body pose improved by pDRNN is very close to the skeleton tracked by motion capture system because in our experiment the initial pose of the performer is restricted to standing toward Kinect sensor. Two conditions for summation in Equation (4) indicate that if the j-th component of velocity of the i-th sample is far from the j-th component of improved current velocity, although the i-th training sample is included in K-nearest neighbors, the j-th component of the sample is excluded for summation. The probability distribution for the j-th component in Equation (4) is zero mean Gaussian and is estimated during training phase. Mean and variance are estimated by computing |vj,M i −˜vj i | on all validation dataset. Here, vj,M i denotes true velocity computed using motion capture data. In this work, K is set to 300 and θ is 0.05. B. Integrating based on Kalman Filtering In Kalman filter framework, the dynamics and the mea- surements are modeled by the following discrete-time state- space model: xt = Ftxt−1 + Gtvt + wt (6) zt = Htxt + ut. (7) where x, z, v, F, G and H are the state vector, measurement vector, input control vector, state transition matrix, input transition matrix, and measurement matrix, respectively. It is assumed that w is the process noise vector, which has has zero mean with a covariance matrix Q = E{wwT }, and u is the measurement noise vector that also has zero mean with a covariance matrix R = E{uuT }. In this work, since we consider an uncorrelated covariance matrix, Q and R become diagonal matrices. In our experiment, F, G and H Fig. 2. Schematic representation of sKNNkF. was set to identity matrix hence prediction model becomes xt = xt−1+vt. Q and R were determined by using validation dataset. The state, xt, we should estimate is true skeleton pose and the dimension is 48 as mentioned earlier. Our contribution is to replace the measurement vector, zt, with the improved body joints, ˜z, and the input control vector, vt, with the enhanced velocities, ˜vt. Therefore, the j-th row and j-th column of R and Q are determined by computing (zj,M i −˜zj i ) and (vj,M i −˜vj i ), respectively. In our methods, x0 was set to ˜z0. C. Integrating based on combination of Soft-KNN and Kalman Filtering The last method is to combine soft-KNN and Kalman Filtering methods described above. We will refer the method to sKNNkF in the following. Figure 2 shows schematic diagram of sKNNkF. Let S+ = { (x1, xM 1 ),. . ., (xN, xM N ) } be a set of N input-output training points, where x is estimated by Kalman filtering and xM is corresponding skeleton pose captured from motion capture system. For a novel pattern xt at time t, the soft-KNN regression computes the mean of the target values of its ˜K-nearest neighbors. The j-th component of the skeleton pose generated by soft-KNN is defined by ˆz+,j t = 1 ˜K X i∈NK(xt) p+ j (|v+,j i,t −˜v+,j t |)>θ+ xj,M i (8) Here, v+,j i,t is velocity of the j-th component of the i-th training data, which is defined by v+,j i,t = xj i −ˆz+,j t−1 (9) The probability distribution for the j-th component in Equation (8) is zero mean Gaussian and is estimated during training phase. The mean and variance are estimated by computing |vj,M i −˜v+,j i | on all validation dataset. In this work, K is set to 300 and θ+ is 0.05. Here, ˜v+,j i denotes the improved velocity obtained using another deep recurrent neural network. We call the network as vDRNN+. Input training data for vDRNN+ is velocity of estimated skeleton pose in Kalman filtering step, which is defined by v+ t = xt− xt−1. We denote the output for vDRNN+ as ˜v+. The network has identical structure with vDRNN and the temporal length of training data is also same. V. A NOVEL MEASURE FOR EVALUATING HUMAN-LIKE NATURAL MOVEMENT As mentioned earlier, our goal is to propose a skeleton tracking method, in which captured body joint trajectories should be human-like natural movement. Most popular mea- sure to evaluate quality of tracked skeleton pose is average position error (APE). If APE of a sequence of 3D positions is less than 1mm, the estimated trajectory can be considered as human-like movement. In fact, this condition extremely difficult to meet. However, we found that if APEs of two skeleton trajectories are 3cm and 4cm, respectively, in that case we cannot be confident that which is better movement. Suppose that two tracked trajectories. The former is a joint trajectory that has a large number of small vibrated motions. In contrast, the latter trajectory consists of natural movements but the orientation of the tracked body center is little bit different to that of the ground truth body center. In this case, APE of the latter is often larger than that of the former. Therefore, an investigation for a novel measure to assess human-like natural movement is required. Flash and Hogan have proposed that the human motor system minimizes jerk [19]. Jerk is the 3rd derivative of the position trajectory. In this sense, some researchers have developed human motion prediction techniques based on the minimum jerk model [20], [21]. However, the minimum jerk model assumption fails if the human decides to change the course of the trajectory during performing activity. We also found jerks of some actions such as, kicking or punching are not low. Hence, we define jerk error (JE) of j-th component of tracked skeleton at time t as JE = |jj t −jj,M t | (10) where jj,M t is jerk of the trajectory captured by motion capture system. We argue that average jerk error (AJE) can evaluate naturalness of captured motions in terms of vibrated and discontinuous motions. However, AJE only cannot eval- uate the quality of tracking appropriately. Suppose that an extreme case. If one activity is standing and the other is sitting the jerks of two activities are identical. Hence, in our experiment, we consider APE as well as AJE. VI. EXPERIMENTS A. Experimental Setup We implemented the algorithm proposed in this paper using MATLAB and the Microsoft Kinect SDK 2.0 on Window 8 OS. All experimental tests were run on a PC with an Intel Core i5 1.8GHz processor and 4GB RAM. The Microsoft Kinect SDK 2.0 can extract skeleton data at approximately 30 frames per second (fps). For supervised learning and evaluation, we employed an OptiTrack motion capture system to provide a set of ground truth trajectories. Kinect sensor and motion capture system tracked skeleton poses simultaneously with recoding capturing time hence we can construct sets of input and target data pairs. Kinect sensor and the motion capture system extrinsically calibrated using least-squares solution. We collected training, validation and test dataset. The training and validation dataset is composed of free move- ments human can do. Validation dataset was employed to decide structure of DRNNs such as the number of layers, the number of hidden neuron size and the temporal length of training data. The variances of Gaussian distributions used in soft-KNN and the covariance matrices R and Q used in Kalman filtering were also determined using Validation dataset. The numbers of frames in training and validation dataset are 45,179 and 6,483. As mentioned earlier, the temporal lengths of training data for pDRNN, vDRNN and vDRNN+ is 7, 20 and 20, respectively. To construct dataset to train deep recurrent neural networks, we sampled sets of sequence of data with a temporal stride 1. Test dataset consists of 11 types of activity classes such as Crossing arms, Crossing arms and legs, Crossing legs, Bowing from the waist, Punching, Running, Crossing legs on the chair, Sitting on the chair, spinning, walking around and kicking. Some activities such as, Crossing arms and legs, Sitting on the chair consist of a large amount of severe self-occlusion poses while Running and Bowing from the waist include a small number of self-occlusion poses. Each activity class was repeated ten times. Every activity start with standing pose and then repeat a certain activities such as, Crossing arms Punching several times. An activity is composed of approximately 150∼250 frames. The total numbers of frames in test dataset is 20,508. Every activity except Spinning and Walking around were performed facing the Kinect sensors. For the cases of Spinning and Walking around, the minimum and maximum orientations relative to the Kinect sensor were -90◦and 90◦, respectively, We did not allow the Kinect sensor to look at the performer’s back because Kinect skeleton tracker cannot distinguish front and back. The average distance from the Kinect sensor to the human was about 3m and the height of Kinect above the ground plane was 130cm. B. Experimental Results We have implemented three skeleton tracking tech- niques: (1) sKNN(Integrating pDRNN and vDRNN based on soft-KNN), (2) kF(Integrating pDRNN and vDRNN in Kalman filter framework) and (3)sKNNfF(Integrating pDRNN, vDRNN and vDRNN+ based on combination of soft-KNN and Kalman filtering). We have additionally implemented six skeleton tracking techniques for the sake of comparison: (1) Kinect Skeleton (2) pDRNN (Skeleton tracking using pDRNN), (3) sKNN-pDRNN (sKNN without pDRNN), (4) sKNN-vDRNN (sKNN without vDRNN), (5) n¨aive-sKNN (sKNN with using ˜zj t−1 instead of ˆzj t−1 in Equation (5)) and (6) kF-pDRNN (kF without pDRNN). sKNN-pDRNN choose K-nearest neighbors of zt instead of ˜zt and a dataset S−= { (z1, zM 1 ),. . ., (zN, zM N ) } is used for training. sKNN-vDRNN reduces the number of nearest neighbors from K to ˜K using pj(|vj i,t −vj t |) instead of pj(|vj i,t −˜vj t |). pj(|vj i,t −vj t |) is estimated by computing 0 0.02 0.04 0.06 0.08 0.1 0.12 Crossing arms Crossing arms&legs Crossing legs Bowing from the waist Punching Running Crossing legs on the chair Sitting on the chair Spinning Walking around Kicking Average Kinect Skeleton pDRNN sKNN-vDRNN Naïve sKNN sKNN-pDRNN kF-pDRNN sKNN kF sKNNkF Fig. 3. The average position error (APE). 0 0.002 0.004 0.006 0.008 0.01 0.012 Crossing arms Crossing arms&legs Crossing legs Bowing from the waist Punching Running Crossing legs on the chair Sitting on the chair Spinning Walking around Kicking Average Kinect Skeleton pDRNN sKNN-vDRNN Naïve sKNN sKNN-pDRNN kF-pDRNN sKNN kF sKNNkF Fig. 4. The average jerk error (AJE). |vj,M i −vj i |. kF-pDRNN employs vt for the control input and Q are determined by computing (vj,M i −vj i ). We did not implement kF-vDRNN since kF-vDRNN produces identical results to pDRNN. In kF-vDRNN, vt = zt −zt−1 and the prediction model becomes xt = xt−1 + vt. And if x0 is set to z0, x1 becomes equal to z1. In this way, xt = zt for all time t. Figure 3 shows average position error (APE). APE of Kienct skeleton is 0.058 and pDRNN decrease APE to 0.026. pDRNN achieves considerable reduce. APEs of kF, sKNN and sKNNkF are 0.0297, 0.0427 and 0.0454, respectively. There is small increase in APE of kF compared to APE of pDRNN. In contrast, APE of sKNN is relatively larger than that of kF. It seems because KNN estimates current pose depend on simple combination of nearest training samples. sKNNkF shows a little bit worse performance than KNN and APE of sKNNkF is highest among three proposed methods. It is observed that APE is accumulated through sKNN and kF. In cases of sKNN-pDRNN and kF-pDRNN, the APEs are even higher than APE of Kinect skeleton. We can conclude that pDRNN plays a important role to reduce APE and additional procedures after the regression using pDRNN increase APE. Figure 4 shows average jerk error (AJE). It is noted that AJEs of sKNN, kF and sKKkF achieve best performance (0.0016, 0.0016 and 0.0011, respectively). kF-pDRNN per- forms similar AJE to the proposed three methods, but as shown in Figure 3 APE of kF-pDRNN worse than that of Kienct skeleton. We can conclude that vDRNN plays a important role to reduce AJE. However, although n¨aive- sKNN integrates both pDRNN and vDRNN the reduction of AJE is small. According to Figure 4, AJEs of Kinect skeleton and pDRNN are similar, which are 0.0063 and 0.006, respec- tively. It seems that pDRNN has a little influence on im- proving vibrated and discontinuous movement. To investigate effects of pDRNN on tracking human-like natural motion, we construct average jerk error histogram. First, we divide the entire range of jerk error into ten bins and then AJE for each bin is computed. Specifically, for example, last bin summates a jerk error greater than 0.27 and the summation is divide by M. M = (the dimension of skeleton pose) * (the number of total frame in test dataset). It should be considered that jerk errors fall into first bin is usually very small vibrated 0.000 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0==0.27 Kinect Skeleton pDRNN Fig. 5. The average jerk error histogram for Kinect skeleton and pDRNN. 0.000 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0==0.27 Kinect Skeleton pDRNN kF-pDRNN sKNN kF sKNNkF Fig. 6. The average jerk error histogram for Kinect skeleton, pDRNN, kF-pDRNN, sKNN, kF and sKKkF. Fig. 7. The skeleton poses tracked by Kinect skeleton tracker, motion capture system and sKKkF. motions thus it cannot be hard to recognize. The AJE histogram for Kinect skeleton and pDRNN is shown in Figure 5. It is observed that AJE of pDRNN is larger than Kinect skeleton in first three bin, while pDRNN achieves smaller error than Kinect skeleton in the rest. This implies that pDRNN reduces moderate and large jerk errors. In our experiments, we found that pDRNN cannot remove small vibrated movement but are effective to alleviate severe discontinuity. Figure 6 shows an average jerk error histogram for Kinect skeleton, pDRNN, kF-pDRNN, sKNN, kF and sKNNkF. It is observed that sKNN shows better performance than kF from 1 to 4 bins while kF shows better performance than sKKN from 5 to 10 bins. sKNNkF achieves good performance in the entire range of histogram. kF-pDRNN seems not to be effective to reduce moderate and large jerk errors. The proposed three methods can reduce the unnatural movements range from small vibration to severe discontinuity. For qualitative comparison, the tracking results produced by Kinect skeleton tracker, motion capture system and sKKkF are displayed in Figure 7. The seven images in each row were chosen during Crossing arms and legs, Crossing legs on the chair, Spinning, walking around, Crossing arms, Punching and Crossing legs behaviors, respectively. The three images in each column were generated by Kinect skeleton tracker, motion capture system and sKKkF, respec- tively. It is observed that significant error was produced by the Kinect skeleton tracker. The pose chosen during Crossing legs on the chair activity dose not cross legs and the poses selected during Spinning and walking around activities are quite different to ground truth, whereas the skeletons generated by sKKkF look similar to the ground truths and seem to reflect the natural movement of the performer. VII. CONCLUSIONS The goal of this paper was to propose a method to improve Kinect skeleton, vibrated and discontinuous when self-occlusion occurs, to human-like natural motion. To this end, we first employed deep recurrent neural networks to refine the position and velocity errors of the skeleton poses. Then, we proposed three methods to integrate enhanced joint positions and velocities. Moreover, we suggested a novel measure to evaluate naturalness of captured motions. We evaluated the proposed methods by comparison with the ground truth acquired from a commercial motion capture system and compared the results to those of Kinect skeleton and of several variants of our methods. Our proposed three approaches performed considerably better than the Kinect skeleton tracker and the proposed integration methods leads further improvement than when we refine Kinect skeleton data using only pDRNN. REFERENCES [1] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake, ”Real-time human pose recognition in parts from single depth images,” International Conference on Computer Vision and Pattern Recognition (CVPR), 2011. [2] D. Rumelhart, G. Hinton, and R. Williams, ”Learning representations by backpropagating errors,” Nature, vol. 323, no. 6088, pp. 533-536, 1986. [3] Y. Bengio, ”Learning deep architectures for AI,” Foundations and Trends in Machine Learning, Vol. 2, No. 1, pp. 1-127, 2009. [4] I. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio, ”Maxout networks,” In: ICML 2013. [5] N. Le Roux and Y. Bengio, ”Deep belief networks are compact universal approximators,” Neural Computation, Vol. 22, No. 8, pp. 2192-2207, 2010. [6] O. Delalleau and Y. Bengio, ”Shallow vs. deep sum-product net- works,” In NIPS, 2011. [7] A. Krizhevsky, I. Sutskever, and G. Hinton, ”ImageNet Classification with Deep Convolutional Neural Networks,” NIPS 2012. [8] W. Zaremba, I. Sutskever, and O. Vinyals, ”Recurrent Neural Net- work Regularization,” http://arxiv.org/abs/1409.2329v5. [9] S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural Computation, vol. 9, no. 8, pp. 1735-1780, 1997. [10] S. Park and M. Trivedi, ”Understanding human interactions with track and body synergies (TBS) captured from multiple views,” Comput. Vis. Image Understand., vol. 111, no. 1, pp. 2-20, 2008. [11] J. Ziegler, K. Nickel, and R. Stiefelhagen, ”Tracking of the articulated upper body on multi-view stereo image sequences,” in Proc. Comput. Vis. Pattern Recognit., 2006. [12] M. Hofmann and D. Gavrila, ”Multi-view 3D Human Pose Estima- tion in Complex Environment,” International Journal of Computer Vision, pp. 1-22, 2011. [13] A. Baak, M. Muller, G. Bharaj, H.-P. Seidel, and C. Theobalt, ”A data-driven approach for real-time full body pose reconstruction from a depth camera,” In ICCV, pages 1092-1099. Nov. 2011. [14] Q. Zhang, X. Song, X. Shao, R. Shibasaki, H. Zhao, ”Unsupervised skeleton extraction and motion capture from 3D deformable match- ing,” Neurocomputing, Elsevier, pp.170-182, 2013. [15] L. Zhang and J. Sturm and D. Cremers and D. Lee, ”Real-Time Human Motion Tracking using Multiple Depth Cameras,” Proc. of the International Conference on Intelligent Robot Systems (IROS), 2012. [16] Y. Liu, J. Gall, C. Stoll, Q. Dai, H.-P. Seidel, and C. Theobalt, ”Markerless Motion Capture of Multiple Characters Using Multi- view Image Segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 35, No. 11, 2720-2735, 2013. [17] J.-T. Masse, F. Lerasle, M. Devy, A. Monin, O. Lefebvre, and S. Mas, ”Human Motion Capture Using Data Fusion of Multiple Skeleton Data,” ACIVS, volume 8192 of Lecture Notes in Computer Science, pp. 126-137. Springer, 2013. [18] K.Y. Yeung, T.H. Kwok, C.L. Wang, ”Improved Skeleton Tracking by Duplex Kinects: A Practical Approach for Real-Time Applications,” Journal of Computing and Information Science in Engineering, vol. 13, no. 4, pp. 1-10, 2013. [19] T. Flash and N. Hogan, The coordination of arm movements: An experimentally confirmed mathematical model, The Journal of Neu- roscience, vol. 5, no. 7, pp. 1688-1703, 1985. [20] A. Thobbi, Y. Gu, and W. Sheng, ”Using human motion estimation for human-robot cooperative manipulation,” IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2011. [21] B. Corteville, E. Aertbelien, H. Bruyninckx, J. De Schutter, and H. Van Brussel, Human-inspired robot assistant for fast point-to- point movements, in IEEE International Conference on Robotics and Automation, 2007. Tracking Human-like Natural Motion Using Deep Recurrent Neural Networks Youngbin Park, Sungphill Moon and Il Hong Suh Abstract— Kinect skeleton tracker is able to achieve con- siderable human body tracking performance in convenient and a low-cost manner. However, The tracker often captures unnatural human poses such as discontinuous and vibrated motions when self-occlusions occur. A majority of approaches tackle this problem by using multiple Kinect sensors in a workspace. Combination of the measurements from different sensors is then conducted in Kalman filter framework or optimization problem is formulated for sensor fusion. However, these methods usually require heuristics to measure reliability of measurements observed from each Kinect sensor. In this paper, we developed a method to improve Kinect skeleton using single Kinect sensor, in which supervised learning technique was employed to correct unnatural tracking motions. Specifically, deep recurrent neural networks were used for improving joint positions and velocities of Kinect skeleton, and three methods were proposed to integrate the refined positions and velocities for further enhancement. Moreover, we suggested a novel measure to evaluate naturalness of captured motions. We evaluated the proposed approach by comparison with the ground truth obtained using a commercial optical maker-based motion capture system. I. INTRODUCTION The second version of the device, the Microsoft Kinect for Window v2(Kinect v2), was released and made available to researchers in 2014. This new generation of Kinect sensor offers a higher resolution and a wider field of view compared to the original Kinect technology. Further, in terms of depth, Kinect v2 is based on time-of-flight principle, whereas the previous version of Kinect utilized structured light to reconstruct the third dimension. This difference has led a considerable improvement in the accuracy of depth sensing. To enable the use of Kinect sensors for developers and researchers, the official Microsoft SDKs (Software Develop- ment Kits) 1.0 and 2.0 are freely available for Kinect v1 and v2, respectively. These SDKs provide a set of functions, especially including human body skeleton tracker. Due to the enhanced depth sensor, tracking accuracy has been improved in Kinect V2. Therefore, in this work we developed our skeleton tracking system based on Kinect v2. Although Kinect v2 provides better tracking results com- paring to Kinect v1, it often captures unnatural skeleton poses such as discontinuous and vibrated motions in the presence of self-occlusion, which is common among most vision-based sensing systems. A simple way to solve this problem is to use multiple cameras in the workspace. For instance, if a view of a body part is blocked from one camera, it might be possible to obtain a view of the body part from another camera. Subsequently, appropriately combining data obtained from multiple Kinect sensors can be used to achieve more accurate tracking compared with a single sensor. A majority of approaches integrate the measurements from different sensors in Kalman filter framework or for- mulate optimization problem for sensor fusion. However, these methods require way to estimate the confidence of each measurement for combining multiple observations based on the confidence level. This usually leads heuristic measure to evaluate reliability of the measurements. In this paper, we developed a method to improve Kinect skeleton using single Kinect sensor, in which supervised learning technique was employed to correct unnatural track- ing motions. Specifically, deep recurrent neural networks were used for improving joint positions and velocities of Kinect skeleton data, and three methods were proposed to integrate the refined joint positions and velocities for further enhancement. Consequently, the proposed method removes jitters and promotes temporal continuity. Moreover, we suggested a novel measure to evaluate naturalness of captured motions. The remainder of the paper is organized as follows. Section 2 provides a survey of the current literatures related to the topic of improvement of Kinect skeleton. Section 3 briefly describes how to improve joint positions and velocities of Kinect skeleton data using deep recurrent neural network. In Section 4, three methods are proposed to inte- grate the enhanced position and velocity. A novel measure to evaluate naturalness of captured motions is given in Section 5. Section 6 presents our experimental setup and evaluation of the performance of the proposed model. Finally, we present out conclusions in Section 7. II. RELATED WORKS Skeleton tracking algorithms can be classified into single- view based models [10], [11], [12] and multi-view based model [13], [14]. Shotton el al. [1] proposed a new method to predict 3D positions of body joints from a single depth image. In their method, an intermediate representation of body parts was designed to map the pose estimation problem onto a per-pixel classification problem. An extensively large and highly varied training data set is employed for the random forest classifier to estimate body parts invariant to pose, body shape, clothing, etc. Finally, confidence-scored 3D proposals of several body joints are generated by re- projecting the classification results to the 3D world and finding local modes. As a result, this approach can quickly and accurately predict the 3D positions of body joints. The skeleton trackers in both the first and second versions of the Kinect SDK are based on this algorithm. However, the 3D arXiv:1604.04528v1 [cs.CV] 15 Apr 2016 body pose that is estimated using a single view frequently has problems of determining positions of joints during self- occlusion motions. Consequently, Kinect skeleton tracker has problems of capturing discontinuous movements or unwanted vibration. Therefore, approaches that utilize multiple views have recently begun to receive significant attention. For example, Zhang el al. [15] fused individual depth images to a joint point cloud and used an efficient particle filtering approach for pose estimation. Likewise, Liu el al. [16] presented a markerless motion capture approach for multi-view video that reconstructs the skeletal motion and detailed surface geometries of two closely interacting people. The approach presented in this paper differs from the methods used by studies described above. Specifically, our goal was not to develop a method that estimates 3D positions of body joint directly from raw depth images or RGB images, but rather to investigate how to generate more human-like natural motion by improving the estimated Kinect v2 skeleton. Indeed, there have been relatively few studies to determine skeleton pose by enhancing Kinect skeleton tracking. Masse el al. [17] presented a framework that obtains 3D positions of body joints from multiple Kinect sensors and then inputs the measured skeletons into a Gated Kalman Filter. In their method, the gated Kalman Filter rejects skeleton poses if the measurement residual referred to as innovation is lower than the gating threshold. This is done in order to discard faulty sensor readings and retain correct measurements. For quantitative evaluation, commercial motion capture system is used to get access to the ground truth. However, the processing step to reject measurement is quite simple and entirely relies on innovation. This might be often possible to lead ineffective measurement fusion. Yeung el al. [18] developed a method synthesizing skele- tons with duplex Kinect sensors that capture human mo- tion in different views. In their study, each joint had two measurements reported by two cameras. The major tech- nical difficulty comes from how to evaluate the reliability of the two values at each joint, and how to resolve any inconsistencies. To address this problem, they developed a measure to estimate confidence on the 3D positions obtained using the Kinect skeleton tracker. Specifically, the distances between a joint i and the closest joint j estimated from Kinect A and the distance between corresponding joint i and the closest joint k estimated from Kinect B are computed, then if the distance between i and j is smaller than the distance between i and k the joint i obtained from Kinect A is considered as unreliable estimation otherwise, the joint i obtained from Kinect B is considered as the mis-leading joint. This reliability was computed in advance before data fusion procedure based on mathematical optimization was executed. Data fusion procedure was formulated under the mathematical optimization problem, in which objective is to reduce sum of differences between the estimated joint position and the corresponding more reliable position, and the bone-lengths are given as equality constraints. Both studies described above are different to our approach in following two reason: First, single Kinect sensor was used in our method. Second, We formulate our problem as supervised learning task instead of employing simple Kalman filtering or formulating mathematical optimization problem. In terms of these two aspects, an approach similar to our method has not been proposed. III. IMPROVING POSITION AND VELOCITY OF KINECT SKELETON USING DEEP RECURRENT NEURAL NETWORK First part of our method is to improve joint position and velocity of Kinect skeleton using supervised learning. The inputs for the supervised learning are sequences of 3D position or velocity obtained by Kinect skeleton tracker and the targets are sequences of skeleton pose captured using commercial optical maker-based motion capture system. In our method, deep recurrent neural network is employed to solve the regression problem, in which two deep recurrent neural networks are trained separately for refining positions and velocities of body joints. In this Section, we will briefly describe deep recurrent neural network and present the detail of how to train the networks. A. Deep Recurrent Neural Network A recurrent neural network (RNN) [2] is a neural network that simulates a discrete-time dynamical system and are a powerful model for sequential data. A conventional RNN is constructed by defining the transition function and the output function as ht = φh WT ht−1 + UT xt  (1) yt = φo VT ht  , (2) where φh, φo, xt, yt and ht are respectively a state transition function, an output function, an input, an output, a hidden state, and W, U and V are the transition, input and output matrices, in that order. It is usual to use a nonlinear function such as a logistic sigmoid function or a hyperbolic tangent function for φh. Deep learning is built based on a hypothesis that a deep, hierarchical model can be exponentially more efficient at representing some functions than a shallow one [3]. Several theoretical results and empirical evidences support this hy- pothesis [5], [4], [6]. RNNs are inherently deep in time, since their hidden state is a function of all previous hidden states. However, the potential weakness for RNNs is that RNNs lack hierarchical processing of the input in space. From this perspective view, deep recurrent neural networks has recently gained significant attention to many researchers. As with feedforward deep neural networks have multiple nonlinear layers between input and output, a recurrent network can be considered as a deep recurrent neural network (DRNNs) if the network has more than one hidden layers. We can now consider two schemes of DRNNs. One has L hidden layer with temporal connection only at the l-th layer and the other has L hidden layer with full temporal connec- tions (called stacked RNN). Based on empirical evaluation on our datasets, we have chosen the former scheme. The l-th hidden activation at time t, hl t, is defined as hl t = φl  WlTht−1+UlTφl−1  Ul−1T . . . φ1  U1Txt  (3) where WlT and UlT represent the fully connected weight matrices for the recurrent connection and for the l-th layer, respectively. Because skeleton tracking is an inherently dynamic pro- cess, it seems natural to consider DRNNs as a model for supervised learning. As with most researcher, for the first time we train DRNNs, we considered two most popular deep learning techniques, Dropout and Rectified Linear Units (ReLU) [7]. We used a Rectified Linear Unit (ReLU) as nonlinear activation function for all units in hidden layers. However, unfortunately, dropout does not work well with RNNs unlikely feedforward deep neural networks. Although we carefully applied dropout to DRNNs with our datasets according to the way proposed by [8], we found that dropout leads to divergence. The values of output units are computed by linear activation. An alternative for modeling sequences is Long Short-Term Memory (LSTM) [9]. LSTM is a variants of the RNN that perform better on problems with long term dependencies because LSTM has been designed to address the vanishing and exploding gradient problems of conventional RNNs. We trained single layer LSTM and compared performance to single layer RNNs with ReLU activation function for hidden units. In our test dataset, however, LSTM achieved lower performance and took longer time to train. Hence, we did not employ LSTM for supervised learning. B. Details in Training Two DRNNs In the following, we will refer two DRNNs for improv- ing joint position and velocity of skeleton to pDRNN and vDRNN, respectively. pDRNN and vDRNN are five layers, where three layers are hidden and two layers are input and output, respectively. The size of each hidden layer is 256. The number of units in input and output layer 48 because the number of joints to be refined is 16 and each joint is composed of x, y and z coordinates. Kinect v2 supports 25 joints and 16 joints used in our method, which are as follows: spinebase, spinemid, neck, shoulderleft, elbowleft, wristleft, shoulderright, elbowright, wristright, hipleft, kneeleft, ankleleft, footleft, hipright, kneeright, ank- leright, footright, and spineshoulder. Among 25 joints, some joints, such as thumbleft and thumbright are tracked very unstable and some joints are not supported by the motion capture system. head, handleft, handright, handtipleft, thum- bleft, handtipright, thumbright, footleft and footright were excluded in our method. Temporal lengths of training data for pDRNN is 7. In training phase, absolute joint positions of Kinect skeleton, it is denoted by z, is transformed to relative positions with respect to parent joints. The root joint is spinemid and only root joint is represented by absolute position. The joints Fig. 1. Schematic representation of soft-KNN. tracked using motion capture system are transformed in the same way. Hence, the output of pDRNN is relative joint positions except spinemid joint. The output is transformed to absolute positions and it is denoted by ˜z. We do not represent body joint using relative angle because skeleton poses produced by Kinect sensor vary along with the change of the orientation between performer and Kinect sensor. We thus need to preserve angle information in our representation. Temporal lengths of training data for vDRNN is 20. The training data for vDRNN are the velocities of the improved skeleton poses, which is defined by vt = ˜zt−˜zt−1. We denote the input and output for vDRNN as v and ˜v, respectively. The L-BFGS optimization algorithm is used to train two networks from random initialization and sum-of-squared errors is used for objective functions. IV. THREE METHODS FOR INTEGRATING IMPROVED POSITION AND VELOCITY OF SKELETONS pDRNN trained based on a large amount of training data can already refine inaccurate Kinect skeleton. However, higher improvement can be expected by integrating pDRNN and vDRNN. In this sense, we propose three methods to combine the outputs produced by pDRNN and vDRNN. First method we have developed is to use K-Nearest Neighbor (KNN). KNN is an instance-based method for classification and regression. In both case, the target value of unknown input is determined according to the values of its K nearest training data. Although the scheme works well it is sensitive to the number of K. Thus, we varies the value of K automatically and we will call the variant of KNN as soft- KNN (sKNN) in the following. Second method is based on Kalman filtering. Kalman filter is an algorithm that assumes the true state at time t by observing a series of measurements over time. Specifically, Kalman filter predicts and corrects the estimate based on measurement and process models. The outputs of pDRNN and vDRNN are used for the measurement and process model, respectively. The last method is to combine sKNN and Kalman Filtering. The details will be described in Section 4.3. A. Integrating based on Soft-KNN Figure 1 shows schematic diagram of soft-KNN. Let S = { (˜z1, zM 1 ),. . ., (˜zN, zM N ) } be a set of N input-output training points, where ˜z is refined skeleton pose by pDRNN and zM is corresponding body joints captured from motion capture system. For a novel pattern ˜zt at time t, the proposed soft- KNN regression computes the mean of the target values of its ˜K-nearest neighbors. The j-th component of the skeleton pose generated by soft-KNN is defined by ˆzj t = 1 ˜K X i∈NK(˜zt) pj(|vj i,t−˜vj t |)>θ zj,M i (4) where set NK(˜zt) contains the indices of K-nearest neigh- bors of ˜zt. The number of nearest neighbors for summation is reduced to ˜K. ˜K is determined by pj(|vj i,t −˜vj t |). ˜vj t is the j-th component of the velocity generated by vDRNN. vj i,t is velocity of the j-th component of the i-th training data, which is defined by vj i,t = ˜zj i −ˆzj t−1 (5) where ˆzj t−1 is the j-th component of the skeleton pose obtained by soft-KNN regression at time t −1. It should be noted that ˆzj t−1 is used for computing velocity instead of ˜zj t−1. This is because ˆzj t−1 is assumed to be closer to the true joint position than ˜zj t−1 it is thus appropriate for calculating current velocity of the j-th component of the i-th sample. ˆzj 0 is set to ˜zj 0. We assume that the the initial body pose improved by pDRNN is very close to the skeleton tracked by motion capture system because in our experiment the initial pose of the performer is restricted to standing toward Kinect sensor. Two conditions for summation in Equation (4) indicate that if the j-th component of velocity of the i-th sample is far from the j-th component of improved current velocity, although the i-th training sample is included in K-nearest neighbors, the j-th component of the sample is excluded for summation. The probability distribution for the j-th component in Equation (4) is zero mean Gaussian and is estimated during training phase. Mean and variance are estimated by computing |vj,M i −˜vj i | on all validation dataset. Here, vj,M i denotes true velocity computed using motion capture data. In this work, K is set to 300 and θ is 0.05. B. Integrating based on Kalman Filtering In Kalman filter framework, the dynamics and the mea- surements are modeled by the following discrete-time state- space model: xt = Ftxt−1 + Gtvt + wt (6) zt = Htxt + ut. (7) where x, z, v, F, G and H are the state vector, measurement vector, input control vector, state transition matrix, input transition matrix, and measurement matrix, respectively. It is assumed that w is the process noise vector, which has has zero mean with a covariance matrix Q = E{wwT }, and u is the measurement noise vector that also has zero mean with a covariance matrix R = E{uuT }. In this work, since we consider an uncorrelated covariance matrix, Q and R become diagonal matrices. In our experiment, F, G and H Fig. 2. Schematic representation of sKNNkF. was set to identity matrix hence prediction model becomes xt = xt−1+vt. Q and R were determined by using validation dataset. The state, xt, we should estimate is true skeleton pose and the dimension is 48 as mentioned earlier. Our contribution is to replace the measurement vector, zt, with the improved body joints, ˜z, and the input control vector, vt, with the enhanced velocities, ˜vt. Therefore, the j-th row and j-th column of R and Q are determined by computing (zj,M i −˜zj i ) and (vj,M i −˜vj i ), respectively. In our methods, x0 was set to ˜z0. C. Integrating based on combination of Soft-KNN and Kalman Filtering The last method is to combine soft-KNN and Kalman Filtering methods described above. We will refer the method to sKNNkF in the following. Figure 2 shows schematic diagram of sKNNkF. Let S+ = { (x1, xM 1 ),. . ., (xN, xM N ) } be a set of N input-output training points, where x is estimated by Kalman filtering and xM is corresponding skeleton pose captured from motion capture system. For a novel pattern xt at time t, the soft-KNN regression computes the mean of the target values of its ˜K-nearest neighbors. The j-th component of the skeleton pose generated by soft-KNN is defined by ˆz+,j t = 1 ˜K X i∈NK(xt) p+ j (|v+,j i,t −˜v+,j t |)>θ+ xj,M i (8) Here, v+,j i,t is velocity of the j-th component of the i-th training data, which is defined by v+,j i,t = xj i −ˆz+,j t−1 (9) The probability distribution for the j-th component in Equation (8) is zero mean Gaussian and is estimated during training phase. The mean and variance are estimated by computing |vj,M i −˜v+,j i | on all validation dataset. In this work, K is set to 300 and θ+ is 0.05. Here, ˜v+,j i denotes the improved velocity obtained using another deep recurrent neural network. We call the network as vDRNN+. Input training data for vDRNN+ is velocity of estimated skeleton pose in Kalman filtering step, which is defined by v+ t = xt− xt−1. We denote the output for vDRNN+ as ˜v+. The network has identical structure with vDRNN and the temporal length of training data is also same. V. A NOVEL MEASURE FOR EVALUATING HUMAN-LIKE NATURAL MOVEMENT As mentioned earlier, our goal is to propose a skeleton tracking method, in which captured body joint trajectories should be human-like natural movement. Most popular mea- sure to evaluate quality of tracked skeleton pose is average position error (APE). If APE of a sequence of 3D positions is less than 1mm, the estimated trajectory can be considered as human-like movement. In fact, this condition extremely difficult to meet. However, we found that if APEs of two skeleton trajectories are 3cm and 4cm, respectively, in that case we cannot be confident that which is better movement. Suppose that two tracked trajectories. The former is a joint trajectory that has a large number of small vibrated motions. In contrast, the latter trajectory consists of natural movements but the orientation of the tracked body center is little bit different to that of the ground truth body center. In this case, APE of the latter is often larger than that of the former. Therefore, an investigation for a novel measure to assess human-like natural movement is required. Flash and Hogan have proposed that the human motor system minimizes jerk [19]. Jerk is the 3rd derivative of the position trajectory. In this sense, some researchers have developed human motion prediction techniques based on the minimum jerk model [20], [21]. However, the minimum jerk model assumption fails if the human decides to change the course of the trajectory during performing activity. We also found jerks of some actions such as, kicking or punching are not low. Hence, we define jerk error (JE) of j-th component of tracked skeleton at time t as JE = |jj t −jj,M t | (10) where jj,M t is jerk of the trajectory captured by motion capture system. We argue that average jerk error (AJE) can evaluate naturalness of captured motions in terms of vibrated and discontinuous motions. However, AJE only cannot eval- uate the quality of tracking appropriately. Suppose that an extreme case. If one activity is standing and the other is sitting the jerks of two activities are identical. Hence, in our experiment, we consider APE as well as AJE. VI. EXPERIMENTS A. Experimental Setup We implemented the algorithm proposed in this paper using MATLAB and the Microsoft Kinect SDK 2.0 on Window 8 OS. All experimental tests were run on a PC with an Intel Core i5 1.8GHz processor and 4GB RAM. The Microsoft Kinect SDK 2.0 can extract skeleton data at approximately 30 frames per second (fps). For supervised learning and evaluation, we employed an OptiTrack motion capture system to provide a set of ground truth trajectories. Kinect sensor and motion capture system tracked skeleton poses simultaneously with recoding capturing time hence we can construct sets of input and target data pairs. Kinect sensor and the motion capture system extrinsically calibrated using least-squares solution. We collected training, validation and test dataset. The training and validation dataset is composed of free move- ments human can do. Validation dataset was employed to decide structure of DRNNs such as the number of layers, the number of hidden neuron size and the temporal length of training data. The variances of Gaussian distributions used in soft-KNN and the covariance matrices R and Q used in Kalman filtering were also determined using Validation dataset. The numbers of frames in training and validation dataset are 45,179 and 6,483. As mentioned earlier, the temporal lengths of training data for pDRNN, vDRNN and vDRNN+ is 7, 20 and 20, respectively. To construct dataset to train deep recurrent neural networks, we sampled sets of sequence of data with a temporal stride 1. Test dataset consists of 11 types of activity classes such as Crossing arms, Crossing arms and legs, Crossing legs, Bowing from the waist, Punching, Running, Crossing legs on the chair, Sitting on the chair, spinning, walking around and kicking. Some activities such as, Crossing arms and legs, Sitting on the chair consist of a large amount of severe self-occlusion poses while Running and Bowing from the waist include a small number of self-occlusion poses. Each activity class was repeated ten times. Every activity start with standing pose and then repeat a certain activities such as, Crossing arms Punching several times. An activity is composed of approximately 150∼250 frames. The total numbers of frames in test dataset is 20,508. Every activity except Spinning and Walking around were performed facing the Kinect sensors. For the cases of Spinning and Walking around, the minimum and maximum orientations relative to the Kinect sensor were -90◦and 90◦, respectively, We did not allow the Kinect sensor to look at the performer’s back because Kinect skeleton tracker cannot distinguish front and back. The average distance from the Kinect sensor to the human was about 3m and the height of Kinect above the ground plane was 130cm. B. Experimental Results We have implemented three skeleton tracking tech- niques: (1) sKNN(Integrating pDRNN and vDRNN based on soft-KNN), (2) kF(Integrating pDRNN and vDRNN in Kalman filter framework) and (3)sKNNfF(Integrating pDRNN, vDRNN and vDRNN+ based on combination of soft-KNN and Kalman filtering). We have additionally implemented six skeleton tracking techniques for the sake of comparison: (1) Kinect Skeleton (2) pDRNN (Skeleton tracking using pDRNN), (3) sKNN-pDRNN (sKNN without pDRNN), (4) sKNN-vDRNN (sKNN without vDRNN), (5) n¨aive-sKNN (sKNN with using ˜zj t−1 instead of ˆzj t−1 in Equation (5)) and (6) kF-pDRNN (kF without pDRNN). sKNN-pDRNN choose K-nearest neighbors of zt instead of ˜zt and a dataset S−= { (z1, zM 1 ),. . ., (zN, zM N ) } is used for training. sKNN-vDRNN reduces the number of nearest neighbors from K to ˜K using pj(|vj i,t −vj t |) instead of pj(|vj i,t −˜vj t |). pj(|vj i,t −vj t |) is estimated by computing 0 0.02 0.04 0.06 0.08 0.1 0.12 Crossing arms Crossing arms&legs Crossing legs Bowing from the waist Punching Running Crossing legs on the chair Sitting on the chair Spinning Walking around Kicking Average Kinect Skeleton pDRNN sKNN-vDRNN Naïve sKNN sKNN-pDRNN kF-pDRNN sKNN kF sKNNkF Fig. 3. The average position error (APE). 0 0.002 0.004 0.006 0.008 0.01 0.012 Crossing arms Crossing arms&legs Crossing legs Bowing from the waist Punching Running Crossing legs on the chair Sitting on the chair Spinning Walking around Kicking Average Kinect Skeleton pDRNN sKNN-vDRNN Naïve sKNN sKNN-pDRNN kF-pDRNN sKNN kF sKNNkF Fig. 4. The average jerk error (AJE). |vj,M i −vj i |. kF-pDRNN employs vt for the control input and Q are determined by computing (vj,M i −vj i ). We did not implement kF-vDRNN since kF-vDRNN produces identical results to pDRNN. In kF-vDRNN, vt = zt −zt−1 and the prediction model becomes xt = xt−1 + vt. And if x0 is set to z0, x1 becomes equal to z1. In this way, xt = zt for all time t. Figure 3 shows average position error (APE). APE of Kienct skeleton is 0.058 and pDRNN decrease APE to 0.026. pDRNN achieves considerable reduce. APEs of kF, sKNN and sKNNkF are 0.0297, 0.0427 and 0.0454, respectively. There is small increase in APE of kF compared to APE of pDRNN. In contrast, APE of sKNN is relatively larger than that of kF. It seems because KNN estimates current pose depend on simple combination of nearest training samples. sKNNkF shows a little bit worse performance than KNN and APE of sKNNkF is highest among three proposed methods. It is observed that APE is accumulated through sKNN and kF. In cases of sKNN-pDRNN and kF-pDRNN, the APEs are even higher than APE of Kinect skeleton. We can conclude that pDRNN plays a important role to reduce APE and additional procedures after the regression using pDRNN increase APE. Figure 4 shows average jerk error (AJE). It is noted that AJEs of sKNN, kF and sKKkF achieve best performance (0.0016, 0.0016 and 0.0011, respectively). kF-pDRNN per- forms similar AJE to the proposed three methods, but as shown in Figure 3 APE of kF-pDRNN worse than that of Kienct skeleton. We can conclude that vDRNN plays a important role to reduce AJE. However, although n¨aive- sKNN integrates both pDRNN and vDRNN the reduction of AJE is small. According to Figure 4, AJEs of Kinect skeleton and pDRNN are similar, which are 0.0063 and 0.006, respec- tively. It seems that pDRNN has a little influence on im- proving vibrated and discontinuous movement. To investigate effects of pDRNN on tracking human-like natural motion, we construct average jerk error histogram. First, we divide the entire range of jerk error into ten bins and then AJE for each bin is computed. Specifically, for example, last bin summates a jerk error greater than 0.27 and the summation is divide by M. M = (the dimension of skeleton pose) * (the number of total frame in test dataset). It should be considered that jerk errors fall into first bin is usually very small vibrated 0.000 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0==0.27 Kinect Skeleton pDRNN Fig. 5. The average jerk error histogram for Kinect skeleton and pDRNN. 0.000 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0==0.27 Kinect Skeleton pDRNN kF-pDRNN sKNN kF sKNNkF Fig. 6. The average jerk error histogram for Kinect skeleton, pDRNN, kF-pDRNN, sKNN, kF and sKKkF. Fig. 7. The skeleton poses tracked by Kinect skeleton tracker, motion capture system and sKKkF. motions thus it cannot be hard to recognize. The AJE histogram for Kinect skeleton and pDRNN is shown in Figure 5. It is observed that AJE of pDRNN is larger than Kinect skeleton in first three bin, while pDRNN achieves smaller error than Kinect skeleton in the rest. This implies that pDRNN reduces moderate and large jerk errors. In our experiments, we found that pDRNN cannot remove small vibrated movement but are effective to alleviate severe discontinuity. Figure 6 shows an average jerk error histogram for Kinect skeleton, pDRNN, kF-pDRNN, sKNN, kF and sKNNkF. It is observed that sKNN shows better performance than kF from 1 to 4 bins while kF shows better performance than sKKN from 5 to 10 bins. sKNNkF achieves good performance in the entire range of histogram. kF-pDRNN seems not to be effective to reduce moderate and large jerk errors. The proposed three methods can reduce the unnatural movements range from small vibration to severe discontinuity. For qualitative comparison, the tracking results produced by Kinect skeleton tracker, motion capture system and sKKkF are displayed in Figure 7. The seven images in each row were chosen during Crossing arms and legs, Crossing legs on the chair, Spinning, walking around, Crossing arms, Punching and Crossing legs behaviors, respectively. The three images in each column were generated by Kinect skeleton tracker, motion capture system and sKKkF, respec- tively. It is observed that significant error was produced by the Kinect skeleton tracker. The pose chosen during Crossing legs on the chair activity dose not cross legs and the poses selected during Spinning and walking around activities are quite different to ground truth, whereas the skeletons generated by sKKkF look similar to the ground truths and seem to reflect the natural movement of the performer. VII. CONCLUSIONS The goal of this paper was to propose a method to improve Kinect skeleton, vibrated and discontinuous when self-occlusion occurs, to human-like natural motion. To this end, we first employed deep recurrent neural networks to refine the position and velocity errors of the skeleton poses. Then, we proposed three methods to integrate enhanced joint positions and velocities. Moreover, we suggested a novel measure to evaluate naturalness of captured motions. We evaluated the proposed methods by comparison with the ground truth acquired from a commercial motion capture system and compared the results to those of Kinect skeleton and of several variants of our methods. Our proposed three approaches performed considerably better than the Kinect skeleton tracker and the proposed integration methods leads further improvement than when we refine Kinect skeleton data using only pDRNN. REFERENCES [1] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake, ”Real-time human pose recognition in parts from single depth images,” International Conference on Computer Vision and Pattern Recognition (CVPR), 2011. [2] D. Rumelhart, G. Hinton, and R. Williams, ”Learning representations by backpropagating errors,” Nature, vol. 323, no. 6088, pp. 533-536, 1986. [3] Y. Bengio, ”Learning deep architectures for AI,” Foundations and Trends in Machine Learning, Vol. 2, No. 1, pp. 1-127, 2009. [4] I. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio, ”Maxout networks,” In: ICML 2013. [5] N. Le Roux and Y. Bengio, ”Deep belief networks are compact universal approximators,” Neural Computation, Vol. 22, No. 8, pp. 2192-2207, 2010. [6] O. Delalleau and Y. Bengio, ”Shallow vs. deep sum-product net- works,” In NIPS, 2011. [7] A. Krizhevsky, I. Sutskever, and G. Hinton, ”ImageNet Classification with Deep Convolutional Neural Networks,” NIPS 2012. [8] W. Zaremba, I. Sutskever, and O. Vinyals, ”Recurrent Neural Net- work Regularization,” http://arxiv.org/abs/1409.2329v5. [9] S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural Computation, vol. 9, no. 8, pp. 1735-1780, 1997. [10] S. Park and M. Trivedi, ”Understanding human interactions with track and body synergies (TBS) captured from multiple views,” Comput. Vis. Image Understand., vol. 111, no. 1, pp. 2-20, 2008. [11] J. Ziegler, K. Nickel, and R. Stiefelhagen, ”Tracking of the articulated upper body on multi-view stereo image sequences,” in Proc. Comput. Vis. Pattern Recognit., 2006. [12] M. Hofmann and D. Gavrila, ”Multi-view 3D Human Pose Estima- tion in Complex Environment,” International Journal of Computer Vision, pp. 1-22, 2011. [13] A. Baak, M. Muller, G. Bharaj, H.-P. Seidel, and C. Theobalt, ”A data-driven approach for real-time full body pose reconstruction from a depth camera,” In ICCV, pages 1092-1099. Nov. 2011. [14] Q. Zhang, X. Song, X. Shao, R. Shibasaki, H. Zhao, ”Unsupervised skeleton extraction and motion capture from 3D deformable match- ing,” Neurocomputing, Elsevier, pp.170-182, 2013. [15] L. Zhang and J. Sturm and D. Cremers and D. Lee, ”Real-Time Human Motion Tracking using Multiple Depth Cameras,” Proc. of the International Conference on Intelligent Robot Systems (IROS), 2012. [16] Y. Liu, J. Gall, C. Stoll, Q. Dai, H.-P. Seidel, and C. Theobalt, ”Markerless Motion Capture of Multiple Characters Using Multi- view Image Segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 35, No. 11, 2720-2735, 2013. [17] J.-T. Masse, F. Lerasle, M. Devy, A. Monin, O. Lefebvre, and S. Mas, ”Human Motion Capture Using Data Fusion of Multiple Skeleton Data,” ACIVS, volume 8192 of Lecture Notes in Computer Science, pp. 126-137. Springer, 2013. [18] K.Y. Yeung, T.H. Kwok, C.L. Wang, ”Improved Skeleton Tracking by Duplex Kinects: A Practical Approach for Real-Time Applications,” Journal of Computing and Information Science in Engineering, vol. 13, no. 4, pp. 1-10, 2013. [19] T. Flash and N. Hogan, The coordination of arm movements: An experimentally confirmed mathematical model, The Journal of Neu- roscience, vol. 5, no. 7, pp. 1688-1703, 1985. [20] A. Thobbi, Y. Gu, and W. Sheng, ”Using human motion estimation for human-robot cooperative manipulation,” IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2011. [21] B. Corteville, E. Aertbelien, H. Bruyninckx, J. De Schutter, and H. Van Brussel, Human-inspired robot assistant for fast point-to- point movements, in IEEE International Conference on Robotics and Automation, 2007.