A Novel GPU-based Parallel Implementation Scheme and Performance Analysis of Robot Forward Dynamics Algorithms Yajue Yang, Yuanqing Wu and Jia Pan Abstract— We propose a novel unifying scheme for parallel implementation of articulated robot dynamics algorithms. It is based on a unified Lie group notation for deriving the equations of motion of articulated robots, where various well- known forward algorithms differ only by their joint inertia matrix inversion strategies. This new scheme leads to a unified abstraction of state-of-the-art forward dynamics algorithms into combinations of block bi-diagonal and/or block tri-diagonal systems, which may be efficiently solved by parallel all-prefix- sum operations (scan) and parallel odd-even elimination (OEE) respectively. We implement the proposed scheme on a Nvidia CUDA GPU platform for the comparative study of three algo- rithms, namely the hybrid articulated-body inertia algorithm (ABIA), the parallel joint space inertia inversion algorithm (JSIIA) and the constrained force algorithm (CFA), and the performances are analyzed. I. INTRODUCTION In recent years, the robotics community has witnessed increasingly wider applications of CPU-GPU heterogeneous computation for its remarkable number crunching and also task/data parallelism capabilities [1]. In particular, the task of solving substantial groups of inverse/forward dynamics problems for articulated robots and humanoid robots with a moderate number of links [2]–[6] may now be com- pletely run on GPU platforms instead of traditional multi- processor systems. This, in turn, opens up new research on the classic inverse/forward dynamics algorithms. It should be emphasized that the aim of such research is not to develop essentially new algorithms under the new GPU setting, but to determine the compatibility with the GPU platform of state-of-the-art robot dynamics algorithms [7]–[11], some of which are already adapted to parallelized multi-processor systems. In our recent study [12], we have identified two major problems when developing and comparing robot forward dynamics algorithms on the GPU platform. First, different algorithms often come with different mathematical and no- tation conventions, which makes their adaptation to the GPU platform laborious and error prone. Therefore, it is preferable to select a concise and intuitive notation, to which all for- ward dynamics algorithms can be easily translated. Second, the earlier parallel adaptation of robot forward dynamics algorithms often relies on ad hoc parallel implementation, Yajue Yang and Jia Pan are with the Department of Mechanical and Biomedical Engineering, City University of Hong Kong, Hong Kong. Yuanqing Wu is with the Department of Industrial Engineering (DIN) University of Bologna, Italy. which makes it difficult to conduct an efficient perfor- mance comparison between different algorithms. Therefore, it is preferable to have a consistent implementation scheme of different algorithms using some well-developed generic building blocks to avoid reinventing the wheels [13]. In this paper, we propose a novel GPU-based parallel implementation scheme and performance analysis of robot forward dynamics algorithms. First, we advocate for the use of a de facto coordinate-free Lie group notation for robot dynamics algorithms [11]. The process of translating different forward dynamics algorithms to this notation turns out to be enlightening rather than cumbersome. Second, we focus on the parallel implementation of two building blocks, namely, block bi-diagonal and block tri-diagonal systems, which are common to various robot inverse and forward dynamics algorithms. The former may be efficiently solved by the all-prefix-sum (scan) operation and the latter may be efficiently solved by the odd-even elimination (OEE) algorithm. Using the proposed scheme, we compare the running performance of three representative robot forward dynamics algorithms on a Nvidia CUDA GPU platform: the hybrid articulated-body inertia algorithm (ABIA) based on Featherstone’s pioneering work [7], the parallel joint space inertia inversion algorithm (JSIIA), and the constrained force algorithm (CFA) proposed by Fijany [9]. A. Brief review of robot forward dynamic algorithms Fast computation of robot motion trajectories with given state and inputs is of vital importance to not only motion simulation of robotic systems with a large number of bod- ies [14], but also online trajectory planning [2], [3], [15]–[17] and model-based control optimization [4], [18]. The main challenge to efficiently solving such forward dynamics (FD) problem originates in the recursive propagation of motions and forces along serial kinematic chains and inverting joint space inertia (JSI) matrix with a high dimension. Parallel forward dynamics algorithms have long been developed and implemented on multi-processor systems to meet such demands. The CFA proposed by Fijany et al. is claimed to be the first truly parallel FD algorithm that has O(log n) time complexity on O(n) processors [19] (n being the number of links in the manipulator). Featherstone later presented a recursive, divide-and-conquer algorithm (DCA) based on the articulated body inertia (ABI) theory [10]. These are often compared with sequential-parallel hybrid O(n) algorithms arXiv:1609.06779v1 [cs.RO] 21 Sep 2016 such as by Anderson and Duan [20]. B. Related work The following are some highly related work to our paper. [3], [21] have utilized the all-prefix-sums (scan) operation to parallelize the propagation of rotation matrices when computing articulated robot inverse dynamics on GPU. Part of our contribution in this paper is to systematize the use of scan operations. Yamane et al. proposed a comparative study of various robot forward dynamics algorithms, which focuses on comparing the representation of joint geometry and applicability to robots with branches and loops [15]. Our work focus instead on the building blocks of different algorithms and the different variable elimination strategies that lead to different factorization of the inverse of the joint space inertia. C. Organization of the paper This paper is organized as follows. In section II, we first briefly review the recursive formulation of robot dy- namics using the matrix Lie group notation. Observing that some inverse/forward dynamics algorithms derived from this formulation contain a set of bi-diagonal systems, we then introduce the parallel scan operation to accelerate these algorithms. Since not all equations can be transformed into the scan operation, section III turns to exploring the CFA, a fully parallel forward dynamics algorithm, which involves solving a tri-diagonal system. Section IV then illustrates the tailored parallel OEE algorithm for solving the special tri- diagonal system in the CFA. In section V, we implement the hybrid/parallel forward dynamics algorithms on a CPU/GPU platform and conduct several experiments to test their per- formances in different applications. II. ROBOT DYNAMICS ALGORITHMS INVOLVING BLOCK BI-DIAGONAL SYSTEMS Throughout this paper, we shall use the Lie group nota- tion [22], [23] and follow the global matrix representation in [11] for the equations of motion of rigid bodies and articulated robots. We use SE(3) and se(3) to denote the special Euclidean group and its Lie algebra, and se(3)∗to denote the dual of se(3). Both se(3) and se(3)∗are identified with the 6-D vector space R6 in a natural way. A. Recursive formulation of robot dynamics Recall that the Newton-Euler equation for a single rigid body is given by [23]: J ˙V −ad∗ V (JV ) = F (1) where the velocity twist V ∈se(3) and applied external wrench F ∈se(3)∗; the inertia tensor J denotes a 6 × 6 symmetric positive-definite matrix; the dual adjoint map ad∗ V is identified with the 6 × 6 matrix adT V . It can be considered as a Euler-Poincar`e equation defined on SE(3) [24]. Consider without loss of generality a loopless and branch- less (serial) articulated robot with n links. The recursive formulation of dynamics can be derived from (1) and sum- marized by the following global matrix representation [11]: (I −Γ)V = S ˙q + const. (2a) (I −Γ) ˙V = S¨q + adS ˙q(ΓV ) + const. (2b) (I −Γ)T F = J ˙V + adT V (JV ) + const. (2c) τ = ST F (2d) where for example V = (V T 1 , . . . , V T n )T and τ = (τ1, . . . , τn)T denote the global velocity and joint torque vectors. The constant terms in (2) refer to the velocity and acceleration of the base link or the force applied at the end-effector. We emphasize that the block lower bi-diagonal matrix (I −Γ) (see [11] for more details) represents the forward propagation (from the base to the end-effector) of link velocities Vi’s and accelerations ˙Vi’s, and dually the block upper bi-diagonal matrix (I −Γ)T represents the backward propagation of external forces Fi’s (from the end- effector to the base). B. Solving bi-diagonal systems using parallel scan opera- tions Equation (2a) – (2d) form a set of linear algebraic equa- tions which can be used to solve both the inverse and forward dynamics problem. On the one hand, the inverse dynamics algorithms to find the applied torques given the motion of robots directly involve the solution of these bi-diagonal systems, and can therefore be accelerated by utilizing state-of-the-art bi-diagonal system solvers. On the other hand, similar bi-diagonal systems also appear in several forward dynamics algorithms. Such bi-diagonal systems can be efficiently solved with the parallel scan algorithm as we proposed in [12]. A scan operation takes a binary associative operator ⊕ with an identity, and an ordered set [a0, a1, . . . , an−1] of n elements, and returns the vector [a0, (a0 ⊕a1), . . . , (a0 ⊕ a1 ⊕· · · ⊕an−1)] [25]. It can be extended to the following linear recursion: xi = ( c0 if i = 0 bi−1xi−1 + ci if 0 < i < n (3) with the corresponding bi-diagonal system:   1 −b0 1 ... ... −bn−1 1     x0 ... xn  =   c0 ... cn   (4) by setting ai to be the homogeneous matrix a0 =  1 c0 0 1  ai =  bi−1 ci 0 1  , 0 < i < n (5) and ai−1 ⊕ai is defined as the matrix multiplication aiai−1. More further, a bi-diagonal system such as in Equation (2a)– (2c) can be easily transformed into scan compatible forms [12], [25]. An O(log n)-complexity parallel implementation of scan operation (running on n processors) can be achieved based on the balanced binary tree structure [25]. C. Scan-based forward dynamics algorithms The equation of motion in joint coordinates of an articu- lated robot can be derived by eliminating V, ˙V and F from Equation (2a)–(2d) [22]: M(q)¨q + C(q, ˙q) ˙q + N(q, ˙q) = τ (6) where M(q) = ST (I −Γ)−T J(I −Γ)−1S denotes the joint space inertia (JSI). Therefore, one can simply solve the unknown joint acceleration ¨q by inverting the JSI: ¨q = M(q)−1(τ −(C(q, ˙q) ˙q + N(q, ˙q) | {z } τ bias )) (7) Note that the bias torque τ bias in Equation (6) can simply be evaluated using an inverse dynamics solver by setting ¨q = 0 [26]. We shall denote τ −τ bias by τ δ for the rest of the paper. Equation (7) leads to the joint space inertia inversion algo- rithms (JSIIA) (or inertia matrix methods [26]) for solving the forward dynamics. In particular, we can compute each column of M(q) by solving an inverse dynamics problem by setting the corresponding ¨qi to 1 while setting all ˙qi and all ¨qj(j ̸= i) to 0, thereby utilizing the bi-diagonal solver based on scan operations. Inversion of M(q) leads to a time complexity of O(n3) [26], [27]. The inertia matrix methods are usually outperformed by the O(n) propagation methods [26]. One of the first ex- amples is the so-called articulated body inertia algorithm (ABIA) proposed by Featherstone [7], which hinges on the concept of articulated-body inertia (ABI) that represents the equivalent inertia of a sub-system disassembled from an entire robot [11], [26]. It involves the following square factorization of the inverse of M(q) [11]: M(q)−1 = [I −ST Y Π]T (ST ˆJS)−1[I −ST Y Π] (8) where ˆJ is the block diagonal matrix formed by the ABI of consecutive sub-systems, and can be evaluated by a nonlinear recursion [11]; Y = (I −XT )−1 and X is a 6n × 6n bi- diagonal matrix which implies that we can apply the parallel scan algorithm to efficiently calculate M −1 by solving the bi-diagonal system with (I −XT ) or its transpose being the coefficient matrix [12]. III. FORWARD DYNAMICS ALGORITHMS INVOLVING BLOCK TRI-DIAGONAL SYSTEMS The ABIA is inherently limited by the nonlinear recursion for the derivation of ABI, which can not be accelerated beyond a constant factor by parallel algorithms [28], [29]. This limitation is arguably due to the computation of the ABI matrix ˆJ based on an alternate elimination of Fi’s and ¨qi’s in Equation (2a)–(2c) [26]. On the contrary, another strategy used in the so-called constraint force algorithm (CFA) [9] calculates explicitly the global interbody force vector F by eliminating the global acceleration vector ¨q, leading to a Schur complement representation of M(q)−1. A. An elimination strategy with explicit forces Using the Lie group notation of [11], we summarize a general elimination approach compatible with CFA as follows. First, we reduce the equations (2a)–(2d) to: (I −Γ) ˙V = S¨q (9a) (I −Γ)T F = J ˙V (9b) τ δ = ST F (9c) by canceling out the bias torque related terms. Next, (9a) and (9b) lead to: S¨q = (I −Γ)J−1(I −Γ)T F (10) The elimination of ¨q is then achieved by defining a proper projection matrix W ∈R6n×5n such that: W T S¨q = W T (I −Γ)J−1(I −Γ)T F = 0 (11) for any ¨q (and therefore W T S = 0) and that Equation (11) and (9c) should fully determine F, i.e. the co-efficient matrix H ∈R6n×6n defined as follows must be non-singular: W T (I −Γ)J−1(I −Γ)T ST  | {z } H F = τ δ (12) Once F is fully determined, ¨q can be derived by pre- multiplying ST on both sides of Equation (10): ST S¨q = ¨q = ST (I −Γ)J−1(I −Γ)T F (13) where we assume ST S = I without loss of generality. B. The Constraint Force Algorithm The choice of the projection matrix W is not unique and may eventually lead to different factorization of the inverse of the JSI [6]. Here, we shall focus on the original CFA convention where W = diag(W1, . . . , Wn) and Wi ∈R6×5 is a conveniently chosen basis matrix for the constraint force of the ith joint. This results in a decomposition of F into active forces along S and constraint forces Fc along W [9]: F = Sτ δ + WFc (14) effectively replacing (9c). With this choice of W, H can be proved to invertible. Substitute (14) into Equation (11) leads to: AFc = −Bτ δ A = W T (I −Γ)J−1(I −Γ)T W B = W T (I −Γ)J−1(I −Γ)T S (15) where A must be invertible since the following matrix is the product of non-singular matrices: H W S =  A B 0 I  (16) Substitute (15) back into (10): ¨q = ST (I −Γ)J−1(I −ΓT )(S | {z } C +W (−A−1)B)τ δ | {z } Fc = (C + BT (−A−1)B) | {z } M(q)−1 τ δ (17) The above equation gives rise to a new factorization of M(q)−1 in the form of a Schur complement [9]. The advantage of this algorithm is that it involves no nonlinear recursion but instead a block tri-diagonal system defined by the block tri-diagonal matrix A. As we shall see in the next section, block tri-diagonal systems, like bi-diagonal systems, can also be efficiently solved by parallel algorithms. C. The Parallel CFA Implementation Steps In summary, the CFA first calculates the bias torques using the inverse dynamics algorithm and then subtract them from the input torques. The matrix M(q)−1 can be factored as M(q)−1 = C −BT A−1B as shown in (17). Calculations of A, B, and C are mutually independent, and only involve constant joint inertia, twists and adjoint transformations, which have already been figured out in the inverse dynamics computation. By exploiting the parallelism of many core GPUs, we can compute A, B and C in parallel. After solving the tri-diagonal system (15) for constraint forces, the desired joint acceleration vector can be figured out through a final parallel block matrix multiplication ¨q = Cτ + BT Fc. The detailed implementation is illustrated in the Algorithm 1. In summary, the two building blocks, the bi-diagonal system and the tri-diagonal system, of the unifying scheme for parallel implementation of forward dynamics algorithms are illustrated in Fig. 1. Algorithm 1: CALCFWDDYN CFA([τ in i ], [qi], [ ˙qi]) /* Compute the bias torques. */ 1 [τ bias i ] ←CALCINVDYN([qi], [ ˙qi], 0) /* Compute the differential torques in parallel. */ 2 [τ δ] ←CALCDIFFTORQUES([τ in i ], [τ bias i ]) /* Parallel computations. */ 3 begin 4 A ←CALCA([Ji], [Adi], [Wi]) 5 B ←CALCB([Ji], [Adi], [Wi], [Si]) 6 C ←CALCC([Ji], [Adi], [Si]) 7 [Ri] ←−Bτ 8 end /* Solve a block tridiagonal (BTD) system for constraint forces. */ 9 [Fi,c] ←SOLVBTDSYS(A, [Ri]) /* Compute joint accelerations. */ 10 [¨qout i ] ←CALCACC([τ δ], [Fi,c], B, C) IV. PARALLEL SOLVER OF BLOCK TRIDIAGONAL SYSTEM Compared to other parts in (17) which only involve matrix multiplication, solving the block tri-diagonal system AFc = −Bτ is more time consuming and thus its performance will have a significant impact on the efficiency of the CFA im- plementation, especially for systems with a large number of links. Fortunately, the tri-diagonal system can also be solved using parallel techniques, and two well-known solutions are the cyclic reduction algorithm (CRA) and the the recursive doubling algorithm (RDA) [30], [31]. The parallel RDA is similar to the parallel scan operation that has been used for accelerating the inverse dynamics and parts of the forward dynamics in our previous work [12]. However, it needs to assume that all the upper blocks in the tri-diagonal system are nonsingular, which does not hold for the CFA. As a result, we choose to use a variant of CRA called the odd- even elimination algorithm (OEE) that does not need such assumption [32]. The same choice was also made by Fijany in [19]. Intuitively, given a tri-diagonal system with n diagonal blocks, the OEE algorithm iteratively eliminates the off- diagonal blocks for ⌈log2 n⌉times (where ⌈⌉is the ceil- ing function) until the original tri-diagonal matrix becomes block-diagonal so that every equation in the system can be solved independently. Figure 1(d) illustrates the diagonaliza- tion process for a linear system with 8 diagonal blocks, which involves 3 eliminating-updating iterations. In step j (1 ≤ j ≤⌈log2 8⌉), the non-zero off-diagonal blocks (colored by green) on the i-th row are eliminated by subtracting the blocks on the (i + 2j−1)-th row or the (i −2j−1)-th row multiplied by suitable coefficient matrices. The elimination process will update all the diagonal blocks and some off- (a) (b) (c) (d) Fig. 1. (a) Inverse dynamics; (b) ABIA; (c) CFA; (d) diagonalization of an 8 × 8 block tri-diagonal matrix in the OEE algorithm. diagonal blocks (colored by yellow) as shown in Figure 1(d). These updated off-diagonal blocks will be annihilated in the next step in the same manner. Every eliminating or updating calculation only involve the blocks from the last step and thus can be executed in parallel. The A matrix in the CFA is a block tri-diagonal system with the form as follows: A =   D1 U1 U T 1 D2 U2 0 ... 0 U T n−2 Dn−1 Un−1 U T n−1 Dn   (18) This tri-diagonal system has two special properties: First, its diagonal blocks are all symmetric and invertible; second, all the lower off-diagonal blocks are the transpose of the corresponding upper off-diagonal blocks. Thus, we only need to update the upper off-diagonal blocks (as shown by the red front blocks in Figure 1(d)). Given a linear system of AFc = −Bτ, we can solve it by updating the blocks in the left side and right side in an iterative manner. For the j-th step, the update rule is as follows: D(j) i = D(j−1) i −E(j) i (U (j−1) i )T −K(j) i U (j−1) i−2j−1 U (j) i = −E(j) i U (j−1) i+2j−1 R(j) i = R(j−1) i −E(j) i X(j−1) i+2j−1 −K(j) i R(j−1) i−2j−1 (19) where D(j) i , U (j) i are the i-th diagonal and off-diagonal blocks in the matrix A and R(j) i is the i-th block in Bτ. E(j) i = U (j−1) i (D(j−1) i+2j−1)−1 is the coefficient matrix used to eliminate U (j−1) i and K(j) i = (U (j−1) i−2j−1)T (D(j−1) i−2j−1)−1 is the coefficient matrix used to eliminate (U (j−1) i )T . By taking the advantage of fact that Di is symmetric, we can compute the coefficient matrix E(j) i by solving a linear system D(j−1) i+2j−1(E(j) i )T = (U (j−1) i )T , which is cheaper and more stable than taking the inverse of D(j−1) i+2j−1 directly. The other coefficient matrix K(j) i can be figured out in the same manner. As shown in Figure 1(d), E(j) i and D(j) i only need to be calculated for i = 1 to n −2j−1 because the last 2j−1 upper off-diagonal blocks are removed in the j-th step. A more detailed description of this GPU-based parallel OEE algorithm is shown in Algorithm 2. Algorithm 2: OEESOLVBTDSYS([D(0) i ], [U (0) i ], [R(0) i ]) 1 for j ←1 to ⌈log2 n⌉do // Compute coefficients in parallel for i = 1 to n −2j−1. 2 begin 3 (E(j) i )T ←SOLVSYS([D(j−1) i+2j−1], [(U (j−1) i )T ]) 4 (K(j) i )T ←SOLVSYS([D(j−1) i ], [U (j−1) i ]) 5 end // Update Di and Ri in parallel for i = 1 to n −2j−1. 6 begin 7 D(j) i ←D(j−1) i −E(j) i (U (j−1) i )T 8 D(j) i+2j−1 ←D(j) i+2j−1 −K(j) i U (j−1) i 9 L(j) i ←E(j) i R(j−1) i+2j−1 10 R(j) i+2j−1 ←R(j−1) i+2j−1 −K(j) i R(j−1) i 11 R(j) i ←R(j) i −L(j) i 12 end // Update Ui in parallel where i = 1 to n −2j. 13 U (j) i ←−E(j) i U (j−1) i+2j−1 14 end // Compute all constraint forces in parallel where j = ⌈log2 n⌉. 15 begin 16 [Fi,c] ←SOLVSYS([D(j) i ], [R(j) i ]) 17 end V. EXPERIMENT In this section, we evaluate the performance of the three parallel dynamics algorithms discussed above by comparing their running time over two experiments. We first investigate the scalability of each algorithm with respect to increasing number of links of a single robot. We next demonstrate the efficacy of each method on a large group of robots with a moderate number of links, which poses as a bottleneck in many applications such as model-based control optimization. The robot configuration parameters such as twists are initial- ized randomly before the dynamics computation in all our experiments. Each experiment will be repeated one thousand times with randomized joint inputs, after which the average computation time is reported. All experiments are conducted on a desktop workstation with an 8-core Genuine Intel i7-6700 CPU and 15.6 GB memory. We implemented our GPU-based parallel dynamics algorithms using CUDA on a Tesla K40c GPU with a 11520 MB video memory and 288GB/sec memory bandwidth. A. Experiments with Different Link Numbers We compare the time cost of a single forward dynamics call for robots with a different number of links. We investi- gate the performance of three different forward dynamics algorithm: the GPU-based parallel JSIIA, the CPU-GPU hybrid ABIA (ABI iteratively computed on CPU; the rest executed on GPUs), and the GPU-based parallel CFA. As illustrated in Figure 2, even though the CFA algorithm is computationally more expensive for robots with a small num- ber of links, it eventually outperforms ABIA and JSIIA when the link number becomes substantially large. This is because the CFA introduces some extra operations to achieve the full parallelism. These extra operations result in a relatively high overhead for the situation with a moderate number of links, as explained in [19]. As the system scale increases, such overhead gets mitigated by the high acceleration rate of the other part of the CFA dynamics algorithm. Consequently, the CFA’s time cost grows slower as the number of links increases. In comparison, the inversion operation of the joint space inertia matrix (for JSIIA) and the computation of the articulated-body inertia via non-linear recursion (for ABIA) pose as the major bottlenecks for the corresponding parallel forward dynamics when link number is large. B. Experiments with Different Group Numbers We now compare the performances of conducting a large number of independent dynamics computations. Figure 3(a) and Figure 3(b) illustrate the performance of each algorithm for robots with ten links and two hundred links, respectively. For the ten-link robot, the computational efficiency of JSIIA Link Number 10 1 10 2 10 3 Computation Time (ms) 10 -1 10 0 10 1 10 2 10 3 10 4 Hybrid ABI Parallel JSI Parallel CF Fig. 2. Computation time comparison of different parallel forward dynamics algorithms for robots with different number of links and CFA is similar when the number of groups is larger than one hundred; both are faster than ABIA. However, for the two-hundred-link robot, CFA outperforms the other two approaches, with JSIIA having the worst performance. The performance degradation of JSIIA is due to its matrix inversion operation, whose cost increases cubically with the number of links. In conclusion, CFA demonstrates the best performance for robots with substantial number of links, while JSIIA is the most suitable for small scale systems. Due to the limited computational efficiency of sequential recursive operations and overhead for data transfer between CPU and GPU, the performance of ABIA is always dominated by JSIIA and CFA. VI. CONCLUSION AND FUTURE WORK We have proposed a parallel implementation scheme for various articulated robot forward dynamics algorithms. We address various representative algorithms, such as ABIA, JSIIA and CFA using a unified Lie group notation. We focus on a class of algorithms that admit a direct factorization of the inverse of JSI, and show that different variable elimination strategies lead to different factorization. We have identified two common building blocks of a large class (if not all) of forward dynamics algorithms for branchless and loopless articulated robots, namely block bi-diagonal and block tri-diagonal systems. They may be efficiently solved by parallel all-prefix-sum operations (scan) and parallel odd-even elimination (OEE) respectively. We implement the proposed scheme on a Nvidia CUDA GPU platform for the comparative study of JSIIA, ABIA and CFA. In summary, the CFA provides the best performance when the number of links increases, which conforms to the theoretical analysis presented in [19]. The JSIIA is the fastest Group Number 10 0 10 1 10 2 10 3 Computation Time (ms) 10 -1 10 0 10 1 10 2 10 3 Hybrid ABI for 10 Links Parallel JSI for 10 Links Parallel CF for 10 Links (a) Comparison for robots with 10 links Group Number 10 0 10 1 10 2 Computation Time (ms) 10 0 10 1 10 2 10 3 Hybrid ABI for 200 Links Parallel JSI for 200 Links Parallel CF for 200 Links (b) Comparison for robots with 200 links Fig. 3. Computation time comparison of different algorithms on many independent forward dynamics calls algorithm for robots with a moderate number of links, while the hybrid ABIA does not show a high-level competency in all the experiments we conducted. Our future work includes investigation of general choices of the projection matrix W as in (11) that lead to factoriza- tion of the inverse of JSI into efficiently solvable building blocks. This is eventually an optimization problem for speed and numerical accuracy. REFERENCES [1] J. Nickolls and W. J. Dally, “The GPU computing era,” Micro, IEEE, vol. 30, no. 2, pp. 56–69, 2010. [2] S. Lengagne, J. Vaillant, E. Yoshida, and A. Kheddar, “Generation of whole-body optimal dynamic multi-contact motions,” The Interna- tional Journal of Robotics Research, vol. 32, no. 9-10, pp. 1104–1119, 2013. [3] B. Chretien, A. Escande, and A. Kheddar, “GPU robot motion planning using semi-infinite nonlinear programming,” 2016. [4] E. Todorov, T. Erez, and Y. Tassa, “MuJoCo: A physics engine for model-based control,” in 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2012, pp. 5026–5033. [5] J. J. Laflin, K. S. Anderson, and M. Hans, “Enhancing the performance of the DCA when forming and solving the equations of motion for multibody systems,” in Multibody Dynamics. Springer, 2016, pp. 19–31. [6] A. Fijany and R. Featherstone, “A new factorization of the mass matrix for optimal serial and parallel calculation of multibody dynamics,” Multibody System Dynamics, vol. 29, no. 2, pp. 169–187, 2013. [7] R. Featherstone, “The calculation of robot dynamics using articulated- body inertias,” The International Journal of Robotics Research, vol. 2, no. 1, pp. 13–30, 1983. [8] G. Rodriguez, A. Jain, and K. Kreutz-Delgado, “A spatial operator algebra for manipulator modeling and control,” The International Journal of Robotics Research, vol. 10, no. 4, pp. 371–381, 1991. [9] A. Fijany, “Parallel O(log N) algorithms for rigid multi-body dynam- ics,” JPL Engineering Memorandum, EM, pp. 343–92, 1992. [10] R. Featherstone, “A divide-and-conquer articulated-body algorithm for parallel O(log n) calculation of rigid-body dynamics. part 1: Basic algorithm,” The International Journal of Robotics Research, vol. 18, no. 9, pp. 867–875, 1999. [11] S. R. Ploen and F. C. Park, “Coordinate-invariant algorithms for robot dynamics,” IEEE Transactions on Robotics and Automation, vol. 15, no. 6, pp. 1130–1135, 1999. [12] Y. Yang, Y. Wu, and J. Pan, “Parallel dynamics computation using prefix sum operations,” CoRR, vol. abs/1609.04493, 2016. [Online]. Available: http://arxiv.org/abs/1609.04493 [13] D. Negrut, A. Tasora, H. Mazhar, T. Heyn, and P. Hahn, “Leveraging parallel computing in multibody dynamics,” Multibody System Dynam- ics, vol. 27, no. 1, pp. 95–117, 2012. [14] S. Redon, N. Galoppo, and M. C. Lin, “Adaptive dynamics of articulated bodies,” in ACM Transactions on Graphics (TOG), vol. 24, no. 3. ACM, 2005, pp. 936–945. [15] K. Yamane and Y. Nakamura, “Dynamics filter-concept and im- plementation of online motion generator for human figures,” IEEE transactions on robotics and automation, vol. 19, no. 3, pp. 421–432, 2003. [16] S.-H. Lee, J. Kim, F. C. Park, M. Kim, and J. E. Bobrow, “Newton- type algorithms for dynamics-based robot movement optimization,” IEEE Transactions on robotics, vol. 21, no. 4, pp. 657–667, 2005. [17] M. Chitchian, A. S. van Amesfoort, A. Simonetto, T. Keviczky, and H. J. Sips. (2012) Particle filters on multi-core processors. Dept. Comput. Sci., Delft Univ. Technology, Delft, The Netherlands, Tech. Rep. PDS-2012-001. [Online]. Available: http://www.pds.ewi.tudelft. nl/fileadmin/pds/reports/2012/PDS-2012-001.pdf [18] T. Erez and E. Todorov, “Trajectory optimization for domains with contacts using inverse dynamics,” in 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2012, pp. 4914– 4919. [19] A. Fijany, I. Sharf, and G. M. D’Eleuterio, “Parallel O(log n) al- gorithms for computation of manipulator forward dynamics,” IEEE Transactions on Robotics and Automation, vol. 11, no. 3, pp. 389– 400, 1995. [20] K. S. Anderson and S. Duan, “A hybrid parallelizable low-order algorithm for dynamics of multi-rigid-body systems: Part I, chain systems,” Mathematical and computer modelling, vol. 30, no. 9, pp. 193–215, 1999. [21] J. J. Zhang, Y. F. Lu, and B. Wang, “A nonrecursive Newton- Euler formulation for the parallel computation of manipulator inverse dynamics,” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 28, no. 3, pp. 467–471, 1998. [22] R. M. Murray, Z. Li, S. S. Sastry, and S. S. Sastry, A mathematical introduction to robotic manipulation. CRC press, 1994. [23] F. C. Park, J. E. Bobrow, and S. R. Ploen, “A Lie group formulation of robot dynamics,” The International Journal of Robotics Research, vol. 14, no. 6, pp. 609–618, 1995. [24] J. E. Marsden and T. Ratiu, Introduction to mechanics and symmetry: a basic exposition of classical mechanical systems. Springer Science & Business Media, 2013, vol. 17. [25] G. E. Blelloch, “Prefix sums and their applications,” 1990. [26] R. Featherstone, Rigid body dynamics algorithms. Springer, 2014. [27] B. Siciliano, L. Sciavicco, L. Villani, and G. Oriolo, Robotics: mod- elling, planning and control. Springer Science & Business Media, 2010. [28] L. Hyafil and H.-T. Kung, “The complexity of parallel evaluation of linear recurrences,” Journal of the ACM (JACM), vol. 24, no. 3, pp. 513–521, 1977. [29] J. Mikloˇsko and V. E. Kotov, “Complexity of parallel algorithms,” in Algorithms, software and hardware of parallel computers. Springer, 1984, pp. 45–63. [30] S. Cohen, “Cyclic reduction,” 1994. [Online]. Available: http: //robotics.stanford.edu/∼scohen/cr.pdf [31] S. K. Seal, “An accelerated recursive doubling algorithm for block tridiagonal systems,” in Parallel and Distributed Processing Sympo- sium, 2014 IEEE 28th International. IEEE, 2014, pp. 1019–1028. [32] C. Levit, “Parallel solution of pentadiagonal systems using general- ized odd-even elimination,” in Proceedings of the 1989 ACM/IEEE conference on Supercomputing. ACM, 1989, pp. 333–336.