Parallel Dynamics Computation using Prefix Sum Operations Yajue Yang Yuanqing Wu Jia Pan Abstract— We propose a new parallel framework for fast computation of inverse and forward dynamics of articulated robots based on prefix sums (scans). We re-investigate the well- known recursive Newton-Euler formulation of robot dynamics and show that the forward-backward propagation process for robot inverse dynamics is equivalent to two scan operations on certain semigroups. We show that the state-of-the-art forward dynamics algorithms may almost completely be cast into a sequence of scan operations, with unscannable parts clearly identified. This suggests a serial-parallel hybrid approach for systems with a moderate number of links. We implement our scan based algorithms on Nvidia CUDA platform with per- formance compared with multithreading CPU-based recursive algorithms; a significant level of acceleration is demonstrated. I. INTRODUCTION Recent developments in humanoid robot online motion planning/replanning [1], [2], [3], [4], [5] and model-based control optimization and learning [6], [7] have raised new computational challenges on classical articulated robot dy- namics algorithms: hundreds to thousands of groups of in- verse/forward dynamics and their derivatives under different states and control inputs must be evaluated within several milliseconds. Fortunately, such computation task is inher- ently akin to parallel computation on multiple levels, and therefore may take substantially less time than a sequential implementation. At the outset, data-independent parallelism from different states and inputs suggests a parallel computa- tion on the group level. Within individual groups, the data- dependent parallelism in inverse and forward dynamics is well known to the robotics community, thanks to decades of extensive research on adapting recursive sequential al- gorithms [8], [9], [10] to parallel algorithms for multi- processor systems [11], [12], [13], [14], [15]. Such algo- rithms are mainly intended for complex simulations with a large number (e.g., ≥1000) of links that are unusual for robotic applications. A realistic scenario for dynamics computation in articu- lated robotics may involve simultaneously solving a large number of groups (> 1000) of inverse/forward dynamics problem for a robot with a moderate number (10 ∼100) of links [4], [5], [6], [7], [16], [17]. It is arguably more suitable for a hybrid CPU/GPU implementation in light of the recent popularization of GPGPU technology [18]. Several researchers [5], [19] have identified part of the inverse dynamics as a prefix sum operation, also known as scan [20], [21], which is a useful and easy to implement building block for many parallel algorithms and is well Yajue Yang and Jia Pan are with the Department of Mechanical and Biomedical Engineering, the City University of Hong Kong; Yuanqing Wu is with the University of Bologna, Italy supported by GPU platforms like NVidia CUDA [22]. These early work does not exploit the Lie group structure within the inverse dynamics, and thus needs to compute a large set of temporary parameters in parallel to prepare the data before applying a parallel prefix sum. In fact, articulated robot kinematics and dynamics can be considered a spatial propagation problem [9], [10], [23], [24], which naturally leads to linear recursions equivalent to scan operations [20]. This implies that the inverse/forward dynamics problem may be mostly (if not entirely) modeled as a sequence of scan operations, thereby applying state-of-the-art parallel computing technology to the classic dynamics computation problem without reinventing any wheels [25]. In this paper, we will show that the parallel scan perspec- tive on robot dynamics computation can indeed be carried to a further extent: apart from the initialization and some inter- mediate variables that may be computed in parallel, both the inverse dynamics and the forward dynamics problem may be reformulated into a sequence of scan operations. The paper is organized as follows. In Section II, we give a brief review of the recursive Newton-Euler formulation for robot dynamics, following the Lie group approach formally stated in [10]. Our main contribution is presented in Section III and IV, where we formulate the recursive inverse and forward dynamics problem into a sequence of scan operations. We show that the scan operator can be considered as the binary operation of certain semigroups. Detailed algorithms of implementations are then described. In Section V, we present some experiment results of a hybrid CPU/GPU implementation of our parallel scan based algorithm and make some reasonable compar- ison with (multithreading) CPU based implementations. In particular, our method achieves a maximal 500x speedup on inverse dynamics computation, and a maximal 100x speedup on forward dynamics. Without loss of generality, we shall only consider single open chain robot in this paper. II. LIE GROUP FORMULATION OF ROBOT DYNAMICS A. Inverse dynamics A comprehensive summary of state-of-the-art robot inverse and forward dynamics algorithms is given by [24]. The inverse dynamics problem of deriving necessary joint torques for generating a specified robot motion is well-known to be a forward-backward two-phase propagation process [10], [23], [26]. In the forward propagation phase, the motion of each joint (with twist axis Si and joint variables qi, ˙qi and ¨qi) propagates from the base link toward the end-effector of a n-link robot, resulting in linear recursion of link velocities arXiv:1609.04493v1 [cs.RO] 15 Sep 2016 SE(3) Euclidean group of R3 se(3) Lie algebra of SE(3) fi,i+1 coordinate transformation from link frame i + 1 to i Mi initial coordinate transformation of fi−1,i Ad(·) Adjoint transformation ad(·) adjoint map Si ith joint twist S diag(S1, . . . , Sn) qi ith joint variable q (q1, . . . , qn)T Vi spatial velocity of ith link V column(V1, . . . , Vn) ˙Vi spatial acceleration of ith link ˙V column( ˙V1, . . . , ˙Vn) Fi reaction force of ith joint F column(F1, . . . , Fn) τi ith joint torque τ (τ1, . . . , τn)T Ji ith link inertia J diag(J1, . . . , Jn) ˆJi ith articulated body inertia ˆJ diag( ˆJ1, . . . , ˆJn) M(q) joint space inertia at q Rn×n nth-order matrix space GLR(n) nth-order general linear group ⊕ binary associative operator [xi] array [x1, . . . , xn] TABLE I NOMENCLATURE and accelerations for i = 1, . . . , n:        fi−1,i = MieSiqi Vi = Adf −1 i−1,i(Vi−1) + Si ˙qi ˙Vi = Si¨qi + Adf −1 i−1,i( ˙Vi−1) −adSi ˙qi Adf −1 i−1,i(Vi−1). (1) Here, we follow closely the notation of [10], [26], [27]. In the back propagation phase, the reaction force applied by each joint to the succeeding link is successively computed using Newton-Euler equation, which leads to a second set of linear recursion for joint reaction forces and joint torques for i = n, . . . , 1: ( Fi = AdT f −1 i,i+1(Fi+1) + Ji ˙Vi −adT Vi(JiVi) τi = ST i Fi. (2) The base link velocity and acceleration V0, ˙V0 and external force Fn+1 acting on the end link are provided to initiate the recursion. A sequential recursion of (1) and (2) leads to the O(n) (for n links) recursive Newton-Euler inverse dynamics algorithm. Here, emphasis should be made on the coordinate invariant Lie group description of robot kinematics and dynamics [26], [27]. We will show in Section III that the scan operands and operators defined by (1) and (2) may be considered as the elements and binary operator of certain matrix semigroup that is closely related to SE(3) (special Euclidean group of R3). This makes it possible to adapt our parallel framework to using other mathematical representations of SE(3), such as dual quaternions [28] or other Clifford algebras [29]. For the rest of the paper, we shall denote the ID function by: τ = ID(q, ˙q, ¨q, V0, ˙V0, Fn+1). (3) B. Forward dynamics The robot forward dynamics (FD) problem refers to the derivation of joint acceleration ¨q = (¨q1, . . . , ¨qn)T as a function of the states (qT , ˙qT )T , applied joint torques τ, base link velocity and acceleration V0, ˙V0, and external force Fn+1: ¨q = FD(q, ˙q, τ, V0, ˙V0, Fn+1) (4) It may be numerically integrated to compute joint position and velocity trajectories. The joint torques τ is linearly related to the joint accelerations ¨q via the joint space inertia (JSI) M(q) with a bias term τ bias accounting for Coriolis, centrifugal and external forces: ( τ =M(q)¨q + τ bias τ bias := ID(q, ˙q, 0, V0, ˙V0, Fn+1) (5) Since both the joint accelerations ¨qi’s and joint reaction forces Fi’s are unknown, the original FD problem obstructs a propagation formulation and cannot be directly solved by parallel scan operations. However, ¨q may be derived from directly inverting the JSI, ¨q = M −1(q)ˆτ := M −1(q)(τ −τ bias) (6) leading to the O(n3) JSI inversion algorithm (JSIIA) [24, ch. 6]. Depending on the methods for computing JSI, JSII may be computationally appealing for moderate number of links (n ≤200). However, it does not take advantage of the structure of JSI inherited from robot chain topology. This was emphasized in a series of papers on analytical factorization and inverse of JSI [9], [11], [17], [23], [30]. The goal of the factorization is to represent the inverse of JSI as the product of block diagonal and block upper/lower triangular matrices, or a sequence of linear recursions that leads to the computationally efficient propagation methods [24, ch. 7]. The articulated-body inertia algorithm (ABIA) proposed by Featherstone [8] is a well-known propagation method that involves a nonlinear recursion for deriving the equivalent inertias of the sub-system rooted at link i (see [27] for the notations) for i = n, . . . , 1: ˆJi = Ji + AdT f −1 i,i+1 ˆJi+1 Adf −1 i,i+1 − AdT f −1 i,i+1 ˆJi+1Si+1ST i+1 ˆJi+1 Adf −1 i,i+1 ST i+1 ˆJi+1Si+1 (7) which is essentially a recursive elimination of the unknown joint reaction forces Fi’s using virtual work principle. The availability of ABI allows us to derive link accelerations from joint torques τi’s with a backward-forward two-phase propagation process [24], which may be summarized as follows [27]: Backward recursion (for i = n, . . . , 1)      ˆzi = Yi,i+1ˆzi+1 + Πi,i+1ˆτi+1 ci = ˆτi −ST i ˆzi ˆci = Ω−1 i ci := (ST i ˆJiSi)−1ci Forward recursion (for i = 1, . . . , n) ( λi = Y T i−1,iλi−1 + Siˆci ¨qi = ˆci −ΠT i−1,iλi−1 (8) where Yi,i+1 := AdT f −1 i,i+1 I − ˆJi+1Si+1ST i+1 ST i+1 ˆJi+1Si+1 ! and Πi,i+1 := AdT f −1 i,i+1 ˆJi+1Si+1 ST i+1 ˆJi+1Si+1 . The connection of ABI to square factorization of JSI is emphasized in [9]. All recursions involved in ABIA except that of (7) are linear and may eventually be formulated as scan operations. However, eq. (7) is a well known nonlinear recursion that does not speedup beyond a constant factor by parallel algorithms [31], [32], and therefore should not conform to a parallel scan operation. The joint reaction forces may also be eliminated by defining appropriate (not necessarily orthogonal) complements for joint constraint force, leading to the constraint force algorithm (CFA) [11]; the inverse of JSI is directly factorized into a product of block diagonal, block upper/lower bi-diagonal and block tri- diagonal matrices. Block cyclic-reduction algorithms [33], [34] may be applied to solve a block tri-diagonal system efficiently. III. PREFIX-SUM RE-FORMULATION OF ROBOT DYNAMICS A. A brief review of scan operation The all-prefix-sums (scan) operation [20] takes a binary associative operator ⊕, and an array of n elements [a0, a1, . . . , an−1] (9) and returns the array [x0, x1, . . . , xn−1] :=[a0, (a0 ⊕a1), . . . , (a0 ⊕a1 ⊕· · · ⊕an−1)] (10) It is obvious that the scan operation implies the following recursion: xi = xi−1 ⊕ai (11) Generalizations of the standard scan operation and their applicatinos are discussed in [20]. B. Synchronous forward scan of link velocities and acceler- ations Our following development is inspired by Zhang et al.’s work [19] on scanning successive multiplications of link ro- tation transformations. Its practical application in humanoid motion planning is considered in [5]. More generally, we may rewrite the second and third equation of (1) in a single linear recursion:   ˙Vi Vi 1  =   Adf −1 i−1,i −adSi ˙qi Adf −1 i−1,i Si¨qi 0 Adf −1 i−1,i Si ˙qi 0 0 1   | {z } Ai ∈GLR(13)   ˙Vi−1 Vi−1 1   A0 =   I 0 ˙V0 0 I V0 0 0 1   (12) leading to a synchronous scan of both link velocities Vi’s and accelerations ˙Vi’s. Emphasis should be made on the Lie group structure of the operands: it can be shown that the operands are isomorphic to (f −1 i−1,i, Si¨qi, Si ˙qi) ∈SE(3) × se(3)2, a Lie group with binary operation ⊕defined by: (g, ξ1, ξ2) ⊕(g′, ξ′ 1, ξ′ 2) :=(gg′, Adg(ξ′ 1) + ξ1 −adξ2(Adg(ξ′ 2)), Adg(ξ′ 2) + ξ2) (13) The identity element of this Lie group is (I, 0, 0), and the inverse of (g, ξ1, ξ2) ∈SE(3) × se(3)2 is given by: (g, ξ1, ξ2)−1 = (g−1, −Adg−1(ξ1), −Adg−1(ξ2)) (14) We may take advantage of this isomorphism by first scan- ning the isomorphic operands (f −1 i−1,i, Si¨qi, Si ˙qi)’s and then performing a parallel isomorphism to recover Vi’s and ˙Vi’s. C. Backward scan of bias force and joint torques for inverse dynamics After scanning the link velocities and accelerations, the bias term ˆFi := Ji ˙Vi −adT Vi(JiVi) in the first equation of (2) may be pre-computed in parallel before a second scan computes the linear recursion for Fi’s. Alternatively, ˆFi is quadratic in Vi and linear in ˙Vi, and may be syn- chronously computed along with the scan of (12). More- over, it can be proved that ˆFi involves only 9 out of the 21 quadratic terms of Vi = (vT i , wT i )T , namely wijwik for 1 ≤j ≤k ≤3 and wi = (wi1, wi2, wi3)T , and wi × vi ∈R3. For convenience, we define Qi = ((wi × vi)T , w2 i1, wi1wi2, wi1wi3, w2 i2, wi2wi3, w2 i3)T ∈R9. A syn- chronous scan may be expressed as follows:   ˙Vi Qi Vi ˆFi 1   =   ∗ 0 ∗ 0 ∗ 0 ∗ ∗ 0 ∗ 0 0 ∗ 0 ∗ ∗ ∗ ∗ 0 ∗ 0 0 0 0 1   | {z } Ai ∈R28×28   ˙Vi−1 Qi−1 Vi−1 ˆFi−1 1   F0 = 0 (15) where only the block pattern of the scan operand Ai is shown. A similar simplification of the operand by iso- morphism may be further investigated. After scanning the forward propagation (15), we proceed with the scan of the backward propagation (2) using the following linear recursion:   Fi τi+1 1  =   AdT f −1 i,i+1 0 ˆFi ST i+1 0 0 0 0 1   | {z } Ai ∈R8×8   Fi+1 τi+2 1   F0 = τn+1 = τn+2 = 0 (16) Therefore, we have shown that the ID problem may be completely solved by two scanning operations. D. Accelerating JSIIA using scan operations Following the discussion about (6) in Section II-B, we may accelerate the JSIIA by parallelizing data-independent com- putation of the columns M·,j(q) of the JSI M(q) (see [24, Ch. 6] for more details) using our parallel ID algorithm: M·,j(q) =M(q)δ·,j = ID(q, ˙q, δ·,j, V0, ˙V0, Fn+1) −ID(q, ˙q, 0, V0, ˙V0, Fn+1) = ID(q, 0, δ·,j, 0, 0, 0) δ·,j :=(δ1,j, · · · , δn,j)T j = 1, · · · , n (17) where δi,j denotes the Kronecker delta function. The joint accelerations ¨q may then be evaluated from (6) using state- of-the-art parallel linear system solvers, such as parallel Cholesky decomposition [35]. Consequently, a total of n+1 parallel IDs (including the one for computing τ bias) are computed in parallel for the JSIIA. E. Accelerating ABIA using scan operations The details of square factorization of JSI and its analytical inversion may be found in [27]. Following our discussion in Section II-B, the two-phase propagation summarized in (8) may be reformulated into a backforward scan:   ˆzi ˆci+1 1  =   Yi,i+1 0 Πi,i+1ˆτi+1 −Ω−1 i+1ST i+1 0 Ωi+1ˆτi+1 0 0 1   | {z } Ai ∈R8×8   ˆzi+1 ˆci+2 1   ˆzn+1 = ˆcn+1 = ˆcn+2 = 0 (18) followed by a forward scan:   λi ¨qi 1  =   Y T i−1,i 0 Siˆci −ΠT i−1,i 0 ˆci 0 0 1   | {z } Ai ∈R8×8   λi−1 ¨qi−1 1   λ0 = ¨q0 = 0 (19) respectively. Therefore, the complete ABIA comprises two scan operations with operands Ai’s given in (15),(16) for the computation of bias torques τ bias i ’s, a nonlinear recursion (7) for the computation of the ABI, and then finally two scan operations with operands Ai’s given in (18),(19) for the computation of joint accelerations ¨qi’s. We may also combine the two backward scans pertaining to (16) and (18) into a single backward scan, with the corresponding linear recursion given by:   Fi−1 ˆτi ˆzi ˆci+1 1   =   AdT f −1 i−1,i 0 0 0 ˆFi−1 −ST i 0 0 0 τ in i 0 Πi,i+1 Yi,i+1 0 0 0 Ωi+1 −Ω−1 i+1ST i+1 0 0 0 0 0 0 1   | {z } Ai ∈R15×15 ×   Fi ˆτi+1 ˆzi+1 ˆci+2 1   (20) thereby reducing the hybrid ABIA to one nonlinear recursion to be executed on the CPU, and a sequence of three parallel scans to be executed on the GPU. IV. IMPLEMENTATION So far, we have presented a complete parallel framework for utilizing parallel scans to accelerate articulated robot ID and FD computation. In comparison to the classic recursive algorithms [27], we have eliminated a large portion of inter- mediate variables by assembling multiple linear recursions to a minimal number of large dimension linear recursions (two for ID, and three for ABIA). However, such mathematically compact form does not necessarily lead to an optimal data structure for implementation on a particular hardware plat- form. In this section, we shall materialize the corresponding parallel algorithms on a hybrid CPU-GPU platform, and illustrate how hardware constraints may weigh in to tailor our algorithms with the flexibility of composing/decomposing linear recursions. We adopt Nvidia CUDA as our software to make our implementation may be easily reproducible for further study and comparison. A. Parallel Inverse Dynamics Algorithm In Section II-A, we proposed a forward-backward double scan parallelization of the recursive ID algorithm [10], where the forward scan computes all link velocities and accelera- tions (along with the bias force) and the backward scan other for computing bias forces. In the actual GPU implementation, we split the scan for velocities and accelerations into two successive scans so that the scan data may by fitted into the GPU’s highspeed shared memory and local registers. The computation of bias forces ˆFi’s becomes perfectly parallel after the velocity and acceleration scans, may simply be computed on n GPU threads, with n being the number of links. The details of our parallel ID algorithm are illustrated in Algorithm 1. Algorithm 1 Parallel Inverse Dynamics CALCINVDYN([Mi], [qi], [ ˙qi], [¨qi]) Parallel Compute: Compute adjacent transformations in parallel. 1: [fi−1,i] ←CALCTRANSFORM([Mi], [Si], [qi]) Forward Scan: Compute body velocities. 2: [Vi] ←INCLUSIVEVELSCAN([fi−1,i], [Si], [ ˙qi]) Forward Scan: Compute body accelerations. 3: [ ˙Vi] ←INCLUSIVEACCSCAN([fi−1,i], [Si], [¨qi], [Vi]) Backward Scan: Compute body forces. 4: [Fi] ←INCLUSIVEFORCESCAN([fi−1,i], [Vi], [ ˙Vi]) Parallel Compute: Compute joint torques. 5: [τi] ←CALCTORQUE([Si], [Fi]) B. Forward Dynamics Algorithm As shown in Section III, the parallelization of FD algo- rithms such as JSIIA and ABIA relies on recognizing the parallel ID algorithm as a common subroutine to compute the bias torques and/or the inertia matrix. Although, neither FD algorithms are completely scannable, the JSIIA may be alternatively accelerated by parallelly solving a positive definite linear system of equations. The ABIA, on the other hand, relies on CPU recursion to compute the ABI, and may be accelerated by careful scheduling of the hybrid CPU/GPU system. 1) Parallel JSIIA: Given a robot with n links, JSIIA calls the inverse dynamics routine n+1 times with different inputs. The first call computes the bias torques τ bias i ’s by setting zero-acceleration input, i.e., ¨qi = 0, i = 1, . . . , n. The other n inverse dynamics calls compute the n columns of the JSI M(q) by setting ˙q = 0 and ¨q = δ·,j, j = 1, . . . , n. All n + 1 inverse dynamics are data independent, and thus may be solved simultaneously on GPUs. Besides, the additional matrix operations involved in the JSIIA can also be computed in parallel on GPUs for further acceleration. The details of our parallel JSIIA are illustrated in Algorithm 2. Algorithm 2 Parallel JSIIA CALCFWDDYN JSIIA([τ in i ], [qi], [ ˙qi]) Parallel Compute: Compute the bias torques and the joint space inertia matrix. 1: ([τ bias i ], M·,1,...,n) ←(CALCINVDYN([qi], [ ˙qi], [0]), CALCINVDYN([qi], [0], [δi,1]), ..., CALCINVDYN([qi], [0], [δi,n]) Parallel Compute: Compute differential torques. 2: [τ diff i ] ←CALCDIFFTORQUES([τ in i ], [τ bias i ]) Parallel Compute: Compute the inverse of the joint space inertia matrix. 3: M −1 ←CALCMATRIXINV(M) Parallel Compute: Compute joint accelerations. 4: [¨qout i ] ←CALCACC(M −1, [τ diff i ]) 2) Hybrid ABIA Algorithm: In comparison to JSIIA, ABIA only leverages the inverse dynamics routine once for bias torque computation. The ABI ˆJi’s is recursively computed using (7) on the CPU and sent back to the GPUs for subsequent forward dynamics computation. The high latency of sending data from CPU main memory to the GPU video memory is partially hidden through asynchronous computation of ABIs and bias torques. All remaining com- ponents can be performed in parallel on GPUs. This includes the computation for several intermediate variables, which are either perfectly parallel (line 4, 6, 8 in Algorithm 3) or scannable (line 5, 7 in Algorithm 3). The complete descrip- tion of the hybrid ABIA algorithm is shown in Algorithm 3. Algorithm 3 Hybrid ABIA Algorithm CALCFWDDYN ABIA([τini], [qi], [ ˙qi]) Asynchronous Compute: 1: Compute bias torques. [τ bias i ] ←CALCINVDYN([qi], [ ˙qi], [0]) 2: Compute differential torques. [τ diff i ] ←CALCDIFFTORQUES([τ in i ], [τ bias i ]) 3: Compute articulated body inertia on the CPU. [ ˆJi] ←CALCABI([Ji], [fi−1,i], [Si]) Stream Parallel Compute: Parallel compute intermediate variables on the GPU. 4: ([Ωi], [Πi,i+1], [Yi,i+1]) ←CALCINTERINTVAR([ ˆJi], [fi−1,i], [Si]) Backward Scan: Compute intermediate variables [ˆzi]. 5: [ˆzi] ←INCLUSIVEZHATSCAN([Yi,i+1], [Πi,i+1], [τ diff i ]) Parallel Compute: Compute intermediate variables [ˆci]. 6: [ˆci] ←CALCCHAT([ˆzi], [Si], [τ diff i ]) Forward Scan: Compute intermediate variables [(λi)]. 7: [λi] ←INCLUSIVELAMBDASCAN([Yi,i+1], [Si], [ˆci]) Parallel Compute: Compute joint accelerations. 8: [¨qout i ] ←CALCACC([λi], [ˆci], [Πi,i+1]) V. EXPERIMENTS In this section, we evaluate the performance of our GPU- based parallel dynamics algorithms by comparing their run- ning time to that of their multithreading CPU counterparts on two experiments. The first experiment investigates the speedup and scalability with respect to the number of links of a single robot. The second one is used to demonstrate the efficacy of our methods on a large group of robots with moderate number of links, which poses as a bottleneck in many applications such as model-based control optimization. The robot configuration parameters such as twists are initial- ized randomly before the dynamics computation in all our experiments. Each experiment will be repeated 1000 times with ran- domized joint inputs, and the average computation time is then reported. All running times are recorded on a desktop workstation with an 8-core Genuine Intel i7-6700 CPU and 15.6 GB memory. We implemented our GPU-based parallel dynamics algorithms using CUDA on a Tesla K40c GPU with a 11520 MB video memory and 288GB/sec memory bandwidth. CPU multithreading is used to parallelize the large number of independent dynamics computations in the second set of experiments, where the number of threads is equal to the number of independent dynamics computations. A. Experiments with Different Link Numbers We first compare the GPU and CPU’s performance of computing a single inverse dynamics for robots with different number of links, and the result is shown in Figure 1(a). We can observe that when the number of links increases, the serial CPU’s computation time increases linearly, while the GPU’s time cost roughly increases at the log(n) rate. Next, we compare the time cost of a single forward dynamics call for robots with different number of links. We investigate four different implementations of the forward dynamics computation, including GPU parallel JSIIA, CPU- GPU hybrid ABIA, serial JSIIA and serial ABIA. According to the result shown in Figure 1(b), the serial JSIIA has the worst performance and is dominated by other three methods. The GPU-based parallel JSIIA is the fastest for robots with moderate number of links, but it is outperformed by hybrid ABIA and the serial ABIA when the link the number is more than 100 links or 190 links respectively. This is because the JSIIA involves the inversion of inertia matrix. For robots with a large number of links, the inertia matrix inversion is more computationally expensive than any components in the ABIA algorithm, and may involve expensive data exchange between global memory and local memory. Our hybrid ABIA algorithm will be the most efficient for robots with many links, thanks to the asynchronous data exchange which hides the latency of sending data from CPUs to GPUs. B. Experiments with Different Group Numbers We now compare the performance while performing a large number of independent dynamics computations be- tween GPU-based parallel algorithms and multithreading CPU algorithms. In the CPU implementation, we allocate one thread for each independent dynamics computation. The comparison result for inverse dynamics implementa- tions are shown in Figure 2(a) and Figure 2(b) for robots with 10 links and 100 links respectively. We see that the running time of the CPU inverse dynamics increases linearly with the group number. In contrast, the computation time of the GPU inverse dynamics increases much slower. Due to the limited memory and necessary thread management of the GPU, the GPU inverse dynamics does not reach the ideal O(log(n)) time complexity. Nevertheless, it provides a satisfactory ∼100x speedup when the group number is 1000. Next, four different approaches for computing the forward dynamics are compared, namely GPU-based parallel JSIIA Link Number 10 1 10 2 10 3 10 4 10 5 Computation Time (ms) 10 -1 10 0 10 1 10 2 10 3 10 4 Parallel ID Serial ID (a) Comparison of ID Link Number 10 1 10 2 10 3 Computation Time (ms) 10 -2 10 0 10 2 10 4 10 6 Hybrid ABI Parallel JSI Serial ABI Serial JSI (b) Comparison of FD Fig. 1. Computation time comparison of different algorithms on a single inverse / forward dynamics call for robots with different number of links and ABIA, CPU multithreading JSIIA and ABIA, and the results are shown in Figure 3. From Figure 3(a), we can observe that the GPU-based JSIIA is always the most effi- cient one. When the link number is 200, the result is quite different as shown in Figure 3(b): the hybrid ABIA always outperforms the other three methods, and the multithreading JSIIA is the slowest approach. This phenomenon is consistent with the result in Figure 1(b) where the GPU-based JSIIA and the hybrid ABIA have the shortest running time when the link number is 10 and 200 respectively. The reason is because when the link number is small (e.g., 10), the inertia matrix is small enough to be fit in the high-speed memory of the GPUs for high performance; and we can further accelerate the dynamics computation by parallelizing the matrix inversions in independent dynamics calls. When the link number is large (e.g., 200), the inertia matrix becomes so large that the running time for a single matrix inversion is significantly more expensive than other components of the dynamics computation. Even worse, due to the limited GPU memories, we have to perform these time-consuming matrix Group Number 10 0 10 1 10 2 10 3 Computation Time (ms) 10 -1 10 0 10 1 10 2 10 3 Parallel ID for 10 Links Serial ID for 10 Links (a) Comparison for robots with 10 links Group Number 10 0 10 1 10 2 10 3 Computation Time (ms) 10 -1 10 0 10 1 10 2 10 3 10 4 Parallel ID for 100 Links Serial ID for 100 Links (b) Comparison for robots with 100 links Fig. 2. Computation time comparison of different algorithms on many independent inverse dynamics call inversions in serial and can not leverage GPUs’ parallel mechanism. VI. CONCLUSION AND FUTURE WORK In this paper, we presented a parallel framework for fast computation of inverse and forward dynamics of articulated robots using parallel scan operations, and reported details of its implementation on a hybrid CPU-GPU platform. Our contributions are summarized as follows: 1) We reformulated the recursive Newton-Euler inverse dynamics algorithm into a forward-backward two-phase parallel scan operation, which is based on the utilization of several generalizations of the standard scan operation. 2) We reformulated the joint space inertia inversion algo- rithm (JSIIA) into n + 1 data-independent scan oper- ations along with a positive definite matrix inversion operation. 3) We reformulated the complete articulated-body iner- tia algorithm (ABIA) into a nonlinear recursion and a forward-backward-forward three-phase parallel scan operation. Group Number 10 0 10 1 10 2 10 3 Computation Time (ms) 10 -1 10 0 10 1 10 2 10 3 10 4 Hybrid ABI for 10 Links Parallel JSI for 10 Links Serial ABI for 10 Links Serial JSI for 10 Links (a) Comparison for robots with 10 links Group Number 10 0 10 1 10 2 Computation Time (ms) 10 1 10 2 10 3 10 4 10 5 Hybrid ABI for 200 Links Parallel JSI for 200 Links Serial ABI for 200 Links Serial JSI for 200 Links (b) Comparison for robots with 200 links Fig. 3. Computation time comparison of different algorithms on many independent forward dynamics call 4) We implemented the aforesaid algorithms on a hybrid CPU-GPU platform with extensive use of Nvidia CUDA standard libraries. This makes our experiment results easily reproducible for further study and comparison. Besides, since our parallel framework is based on a coordinate-free Lie group / semigroup formulation and is not representation-specific, the corresponding parallel algorithms apply equally well to other representations than matrices, such as dual quaternions. Our future work shall involve expansion of our parallel framework to include parallel scan of gradient and Hessian of inverse dynamics [2], and other parallel FD algorithms such as CFA [11], and eventually an automatic scheduler for hardware-specific optimal performance. ACKNOWLEDGMENTS This work is partially supported by NVidia Corp. and Hong Kong GRF 17204115. Yuanqing Wu is supported by the PRIN 2012 grant No. 20124SMZ88 and the MANET FP7-PEOPLE-ITN grant No. 607643. REFERENCES [1] K. Yamane and Y. Nakamura, “Dynamics filter-concept and im- plementation of online motion generator for human figures,” IEEE transactions on robotics and automation, vol. 19, no. 3, pp. 421–432, 2003. [2] S.-H. Lee, J. Kim, F. C. Park, M. Kim, and J. E. Bobrow, “Newton- type algorithms for dynamics-based robot movement optimization,” IEEE Transactions on robotics, vol. 21, no. 4, pp. 657–667, 2005. [3] M. Chitchian, A. S. van Amesfoort, A. Simonetto, T. Keviczky, and H. J. Sips, “Particle filters on multi-core processors,” Dept. Comput. Sci., Delft Univ. Technology, Delft, The Netherlands, Tech. Rep. PDS- 2012-001, 2012. [4] S. Lengagne, J. Vaillant, E. Yoshida, and A. Kheddar, “Generation of whole-body optimal dynamic multi-contact motions,” The Interna- tional Journal of Robotics Research, vol. 32, no. 9-10, pp. 1104–1119, 2013. [5] B. Chretien, A. Escande, and A. Kheddar, “GPU robot motion planning using semi-infinite nonlinear programming,” IEEE Transactions on Parallel and Distributed Systems, vol. PP, no. 99, pp. 1–1, 2016. [6] E. Todorov, T. Erez, and Y. Tassa, “MuJoCo: A physics engine for model-based control,” in IEEE/RSJ International Conference on Intelligent Robots and Systems, 2012, pp. 5026–5033. [7] T. Erez and E. Todorov, “Trajectory optimization for domains with contacts using inverse dynamics,” in IEEE/RSJ International Confer- ence on Intelligent Robots and Systems, 2012, pp. 4914–4919. [8] R. Featherstone, “The calculation of robot dynamics using articulated- body inertias,” The International Journal of Robotics Research, vol. 2, no. 1, pp. 13–30, 1983. [9] G. Rodriguez, A. Jain, and K. Kreutz-Delgado, “A spatial operator algebra for manipulator modeling and control,” The International Journal of Robotics Research, vol. 10, no. 4, pp. 371–381, 1991. [10] F. C. Park, J. E. Bobrow, and S. R. Ploen, “A Lie group formulation of robot dynamics,” The International Journal of Robotics Research, vol. 14, no. 6, pp. 609–618, 1995. [11] A. Fijany, I. Sharf, and G. M. D’Eleuterio, “Parallel O(log n) al- gorithms for computation of manipulator forward dynamics,” IEEE Transactions on Robotics and Automation, vol. 11, no. 3, pp. 389– 400, 1995. [12] R. Featherstone, “A divide-and-conquer articulated-body algorithm for parallel O(log(n)) calculation of rigid-body dynamics. part 1: Basic algorithm,” The International Journal of Robotics Research, vol. 18, no. 9, pp. 867–875, 1999. [13] ——, “A divide-and-conquer articulated-body algorithm for paral- lel O(log(n)) calculation of rigid-body dynamics. part 2: Trees, loops, and accuracy,” The International Journal of Robotics Research, vol. 18, no. 9, pp. 876–892, 1999. [14] K. S. Anderson and S. Duan, “Highly parallelizable low-order dy- namics simulation algorithm for multi-rigid-body systems,” Journal of Guidance, Control, and Dynamics, vol. 23, no. 2, pp. 355–364, 2000. [15] K. Yamane and Y. Nakamura, “Automatic scheduling for parallel forward dynamics computation of open kinematic chains.” in Robotics: Science and Systems. Citeseer, 2007, pp. 193–199. [16] J. J. Laflin, K. S. Anderson, and M. Hans, “Enhancing the performance of the DCA when forming and solving the equations of motion for multibody systems,” in Multibody Dynamics. Springer, 2016, pp. 19–31. [17] A. Fijany and R. Featherstone, “A new factorization of the mass matrix for optimal serial and parallel calculation of multibody dynamics,” Multibody System Dynamics, vol. 29, no. 2, pp. 169–187, 2013. [18] J. Nickolls and W. J. Dally, “The GPU computing era,” IEEE Micro, vol. 30, no. 2, pp. 56–69, March 2010. [19] J. J. Zhang, Y. F. Lu, and B. Wang, “A nonrecursive newton- euler formulation for the parallel computation of manipulator inverse dynamics,” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 28, no. 3, pp. 467–471, Aug 1998. [20] G. E. Blelloch, “Prefix sums and their applications,” 1990. [21] M. Harris, S. Sengupta, and J. D. Owens, “Parallel prefix sum (scan) with CUDA,” GPU gems, vol. 3, no. 39, pp. 851–876, 2007. [22] N. Bell and J. Hoberock, “Thrust: A productivity-oriented library for CUDA,” GPU computing gems Jade edition, vol. 2, pp. 359–371, 2011. [23] G. Rodriguez, “Kalman filtering, smoothing, and recursive robot arm forward and inverse dynamics,” IEEE Journal on Robotics and Automation, vol. 3, no. 6, pp. 624–639, 1987. [24] R. Featherstone, Rigid body dynamics algorithms. Springer, 2014. [25] D. Negrut, R. Serban, H. Mazhar, and T. Heyn, “Parallel computing in multibody system dynamics: why, when, and how,” Journal of Computational and Nonlinear Dynamics, vol. 9, no. 4, p. 041007, 2014. [26] R. M. Murray, Z. Li, S. S. Sastry, and S. S. Sastry, A mathematical introduction to robotic manipulation. CRC press, 1994. [27] S. R. Ploen and F. C. Park, “Coordinate-invariant algorithms for robot dynamics,” IEEE Transactions on Robotics and Automation, vol. 15, no. 6, pp. 1130–1135, 1999. [28] J. Dooley and J. M. McCarthy, “Spatial rigid body dynamics using dual quaternion components,” in Robotics and Automation, 1991. Proceedings., 1991 IEEE International Conference on. IEEE, 1991, pp. 90–95. [29] J. Selig and E. Bayro-Corrochano, “Rigid body dynamics using clifford algebra,” Advances in applied Clifford algebras, vol. 20, no. 1, pp. 141–154, 2010. [30] A. Jain, “Unified formulation of dynamics for serial rigid multibody systems,” Journal of Guidance, Control, and Dynamics, vol. 14, no. 3, pp. 531–542, 1991. [31] J. Mikloˇsko and V. E. Kotov, “Complexity of parallel algorithms,” in Algorithms, software and hardware of parallel computers. Springer, 1984, pp. 45–63. [32] L. Hyafil and H.-T. Kung, “The complexity of parallel evaluation of linear recurrences,” Journal of the ACM (JACM), vol. 24, no. 3, pp. 513–521, 1977. [33] D. Heller, “Some aspects of the cyclic reduction algorithm for block tridiagonal linear systems,” SIAM Journal on Numerical Analysis, vol. 13, no. 4, pp. 484–496, 1976. [34] R. A. Sweet, “A cyclic reduction algorithm for solving block tridi- agonal systems of arbitrary dimension,” SIAM Journal on Numerical Analysis, vol. 14, no. 4, pp. 706–720, 1977. [35] V. Volkov and J. Demmel, “Lu, QR and Cholesky factorizations using vector capabilities of GPUs,” EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2008-49, May, pp. 2008– 49, 2008.