GPU based Path Integral Control with Learned Dynamics Grady Williams Autonomous Control and Decision Systems Laboratory Georgia Institute of Technology gradyrw@gatech.edu Eric Rombokas Mechanical Engineering University of Washington eric@rombokas.com Tom Daniel Department of Biology University of Washington danielt@uw.edu Abstract We present an algorithm which combines recent advances in model based path integral control with machine learning approaches to learning forward dynamics models. We take advantage of the parallel computing power of a GPU to quickly take a massive number of samples from a learned probabilistic dynamics model, which we use to approximate the path integral form of the optimal control. The re- sulting algorithm runs in a receding-horizon fashion in realtime, and is subject to no restrictive assumptions about costs, constraints, or dynamics. A simple change to the path integral control formulation allows the algorithm to take model un- certainty into account during planning, and we demonstrate its performance on a quadrotor navigation task. In addition to this novel adaptation of path integral control, this is the first time that a receding-horizon implementation of iterative path integral control has been run on a real system. 1 Introduction General motion planning is one of the core problems in robotics and there have been recent successes in this field through the use of model-based optimal control. Some recent examples are [4] and [9] for legged robot locomation, and [6] for aggressive control of autonomous quadrotors. Despite these successes, a major drawback with this approach is the reliance on accurate forward or inverse dynamics models, which can be difficult or impossible to develop through physical analysis. A promising method of dealing with this problem is to learn a dynamics model through interaction with the environment instead of relying on an expert provided physics based model. An important issue in this approach is that the learned dynamics only accurately describe a small region of the state space, where the training examples lie. Without taking any extra precautions, the controller assumes that the model accurately describes the dynamics for the entire space. This effect, termed model bias in [3], is potentially catastrophic and must be dealt with for consistent and safe performance. Various approaches have been succesful in dealing with model bias and have achieved good per- formance on simple tasks, the most well known of these is perhaps PILCO [3] which demonstrated remarkable learning performance on a cart-double-pole swing-up task while learning dynamics as a Gaussian Process. Another recent algorithm is due to Kuindersma et al. [5] who considers risk control for stochastic optimization and demonstrates good performance on stabilization tasks. In 1 arXiv:1503.00330v1 [cs.RO] 1 Mar 2015 contrast to these works, we consider control for tasks which require more general cost descriptions, such as obstacle navigation. 1.1 Contributions We present an algorithm and computational framework for combining model based receding-horizon path integral control with machine learning algorithms that are capable of learning probabilistic for- ward dynamics models. Specifically we modify the recently developed receding-horizon formula- tion of Policy Improvement with Path Integrals (PI2-RH) [8] to take into account model uncertainty, and then use Locally Weighted Projection Regression (LWPR) [11] to learn a probabilistic dynam- ics model of a nano-quadrotor. The new algorithm performs unprecedented massive sampling of the probabilistic dynamics, but runs in real time due to its implementation on a graphics processing unit (GPU). We report the algorithm’s performance controlling the nano-quadrotor on an obstacle avoidance navigation task. Additionally this is the first implementation of receding-horizon PI2 on a real machine. 2 Learning forward models with LWPR Many machine learning algorithms are capable of acting as black box predictors for dynamical systems (see [7] for an overview), however only a subset are able to give probabilistic predictions at a cheap computational cost. LWPR is an algorithm based on a locally weighted incremental version of partial least squares, and under the classic probabilistic interpretation of weighted least squares it’s able to give uncertainty estimates along with a mean prediction. The algorithm learns a set of local linear models and equips each model with a center point in the state space and a distance metric. These values are used to compute predictions and uncertainties as a weighted average of all the models, let yi(x), σi(x), and ci be the mean, standard deviation, and center of the ith local model and let Di be the distance metric. If there are L local models then the LWPR mean y(x) and variance σ(x)2 predictions are given by: y(x) = L X i=1 wiyi, σ(x)2 = L X j=1 wj((y −yi)2 + σ2 k) where wj = e−1 2 (x−cj)TDj(x−cj) PL i=1 e−1 2 (x−ci)TDj(x−ci) (1) The key advantage of LWPR is that the computational complexity of updating the local models and making predictions is linear in the dimension of the state space, as opposed to scaling with the number of training inputs like many other algorithms (e.g Gaussian Processes). 3 Control In this section we briefly review Policy Improvement with Path Integrals (PI2)[10] and its reced- ing horizon implementation (PI2-RH)[8]. We then propose a modification to the algorithm which allows it to take into account model uncertainty. 3.1 Receding-Horizon path integral control Consider a stochastic dynamical system of the form dx = f(x, t)dt + G(x)u(t)dt + B(x)dw, this system has non-linear passive dynamics, but is linear in the control and noise terms. We further as- sume that the noise is control dependent and that BBT = λGR−1GT, for some λ > 0. Now assume that we’re given a cost function of the form J(x) = φ(T) + R T 0 (q(x, t) + u(t)TRu(t))dt, under these assumptions the stochastic Hamilton-Jacobi-Bellman equation for V (x) = minu E[J(x)] can be related to a path integral (See [10] for the derivation), and the optimal controls take the form: uti = Z P(τi)uL(τi)dτi and P(τi) = exp(−1 λ ˜S(τi)) R exp(−1 λ ˜S(τi))dτi (2) Where the cost-to-go function is: ˜S(τi) = S(τi) + PN i=1(uT t GT c H−1Gcutdt + uT t GT c H−1Bdwk t ). 2 And the portion of the cost-to-go depending on the passive dynamics is: S(τi) = φtN + PN−1 j=i q(x, t)dt with H = GcR−1GT c and uL(τi) = R−1GT c (GcR−1GT c )−1Bc(x)dw. In these equations τ = {x0, x1, x2, ...xN−1} denotes a discretized trajectory, and τi is the remaining piece of the trajectory at the ith timestep, N is the number of timesteps in a trajectory (ie. the number of discrete steps between 0 and T), and dt is the size of a single step. In the PI2 algorithm the optimal controls are computed by sampling from the stochastic dynamics in order to approximate (2). It’s usually necessary to perform this approximation in an iterative fashion: start with an initial guess, sample possible trajectories from the dynamics (termed “rollouts”) and compute an approximation to (2), the resulting approximation will be an improvement over the initial controls but may not be sufficient. If this is the case the process is repeated until the algorithm converges to a solution, and once the solution is computed the control is executed in an open loop fashion. The receding-horizon implementation of PI2 modifies this approach by constantly re-planning while simultaneously executing. Instead of iterating to convergence as in traditional PI2 only a small, pre-defined, number of iterations are performed and then only a single timestep of the result- ing trajectory is executed. The algorithm then receives state-feedback and warm-starts its optimiza- tion for the next timestep with the un-executed portion of the previous timestep’s control plan. This modification transforms PI2 into an implicit feedback controller, and allows it to perform difficult navigation tasks with a simulated quadrotor, which the original PI2 algorithm has difficulty with [8]. 3.2 Accounting for model uncertainty In order to combat model bias we modify PI2 in order to take into account the uncertainty of dynam- ics predictions from a learned probabilistic model. The role of the passive dynamics in evaluating rollouts is in the S(τi) portion of the cost-to-go function. In all previous model-based applications of PI2, S(τi) is evaluated based on a known passive dynamics f(x, u) model. In order to evaluate S(τi) under the uncertain dynamics it makes sense to evaluate E[S(τi)|F(x, u)], the expectation of the cost-to-go with respect to the uncertain dynamics F(x, u), instead of computing a deterministic evaluation of S(τi) based on the mean prediction. This expectation is analytically intractible, but we can compute a crude Monte-Carlo approximation by taking multiple draws from the probability distribution for each rollout. A similar method for evaluating candidate control plans is considered by Bagnell and Schneider in [2], however we employ this strategy on a much larger scale. Under this approach a single rollout becomes a set of samples from a conditional probability distribution (termed sub-rollouts), and the cost-to-go of a rollout is evaluated as the average of the sub-rollouts. 4 System We implemented the modified algorithm on a crazyflie [1] nano-quadrotor. The crazyflie weighs 19 grams and measures 9 centimeters from rotor to rotor. An onboard radio communicates with the ground station and onboard sensors and processors compute euler angles, euler angle rates, and use a PID controller to achieve commanded attitude rate targets. We used a Vicon Motion Capture system to calculate quadrotor position and velocity. The algorithms runs in realtime on the ground station computer which had an Nvidia GTX 780 Ti with 2880 CUDA cores. This computer was able to perform an iteration using only the mean prediction for 1000 50-step rollouts in about 6.5 milliseconds, and for 1000 rollouts with 32 sub-rollouts an iteration runs in about 18 milliseconds. 4.1 Models We used a standard quadrotor model based on rigid body dynamics to run PI2-RH in order to compare it with the LWPR algorithm. Controls took the form of desired attitude rates (φ, θ, ψ), and desired total thrust (F) where the desired rates affect the actual rates according to the equation ˙r = 25(rdesired −ractual). We used flight examples taken from flights with PI2-RH and the analytic model to train 3 LWPR models with inputs φ, θ, ψ, F and with each output being one of the three cartesian accelerations (¨x, ¨y, ¨z). Then, we replaced the rigid-body formulas for those derivatives with the LWPR models 3 to obtain the learned probabilistic dynamics model.The learned models consisted of 15, 15, and 24 locally linear models for ¨x, ¨y, and ¨z respectively. In general the learned models were superior to the analytic model, the propagation errors over 1 second of both the analytic and LWPR model for a representative trial run of the quadrotor are given in Table 1. Analytic Dynamics LWPR Dynamics Position/Velocity Component x y z x y z Position Error (m) .38 .63 .52 .27 .37 .20 Velocity Error (m/s) 1.01 1.71 1.25 .78 1.05 .36 Table 1: Average error of euler integration simulation after 1 second over a typical quadrotor run. 5 Experiments In our experiments the quadrotor attempted to navigate between three points while avoiding three obstacles placed in the flight arena. The quadrotor performed 4 laps around these points during a trial, and performed 5 trials in total for each parameter setting. We tested the algorithm using the rigid body model and LWPR models with 1 (just the mean prediction), 4, 8, 16, and 32 sub-rollouts. For an instantaneous state cost we used: q(x) = (x −Wx)2 + (y −Wy)2 + 10(z −Wz)2 + 1 5(φ2 + θ2 + ψ2) + 1 10( ˙x2 + ˙y2 + ˙z2) + 100 P3 i=1 exp(−10(d2 x,i + d2 y,i)) + 10C. Here Wi denotes the ith component of the point that the quadrotor is moving towards, once the quadrotor comes within a quarter meter of the point the goal is changed to another point. The algorithm is not given advance knowledge that this switch will occur. dx,i is the x distance between the quadrotor and the ith obstacle, and similarly for dy,i. The variable C is an indicator variable determing whether a crash has occured or not. The terminal cost was set to 0, and the cost term: uT t GT c H−1Gcutdt + uT t GT c H−1Bdwk t is consid- ered 0 as well. This is because the term uT t GT c H−1Gcutdt is state independent, so the corresponding term in the denominator can be pushed out of the integral and cancels with the term in the numerator. The second term uT t GT c H−1Bdwk t is not 0, but since control costs are very small it is close to zero and computational effort can be saved by not computing it, without suffering adverse affects. We ran the algorithm with the maximal settings that allowed real time computation under the given parameters. This meant that the analytic model, and LWPR models with M=1 and M=4 were allowed to perform two iterations of optimization per timestep while all other settings could only perform 1 iteration in real time. Every trial used a time horizon of 1 second and 50 timesteps per second. The parameters for each setting are given in Table 2. RBD Model M = 1 M = 4 M = 8 M = 16 M = 32 Rollouts 1000 1000 1000 1000 970 950 Optimizations per Timestep 2 2 2 1 1 1 Table 2: Number of rollouts and optimization iterations per timestep possible for the sub-rollout setting. 5.1 Results The quadrotor was able to consistently complete the navigation task for all of the settings, except for M = 1 when only the mean prediction of LWPR was used. For this setting the quadrotor was overly-aggressive and unable to consistently stay in the flight volume while avoiding obstacles, out of 7 total trials the quadrotor successfully completed the task 4 times, flew into an obstacles once, and flew out of the flight arena twice. We report the data from the 4 completed trials. Interestingly, when the quadrotor does complete the task for M = 1 the scores are very good. In fact the lowest average for total cost belongs to the succesfully completed trajectories with M = 1. The issue is not poor performance, rather it is overly aggressive and risky performance which comes from not accounting for bias in the learned model. 4 Figure 1: Trajectories of all the trials for M = 1 (left), M = 4 (center), and M = 32 (right). The trajectories are overlayed with a contour plot of the portion of the cost due to obstacles. Sub-Rollouts Avg. Time Avg. Total Cost Avg. Cost/Sec. Avg. of 8 Closest Passes (m) Analytic Model 37.05 6943.27 186.93 .23 M = 1 35.39 5733.55 162.01 .19 M = 4 34.91 5868.72 168.22 .30 M = 8 47.14 7273.80 154.27 .29 M = 16 49.40 7491.60 152.0 .32 M = 32 52.13 8523.40 162.99 .33 Table 3: Chart of performance metrics for all of the parameter settings. Note the the Cost/Sec is the cost over the optimization horizon. Our method of taking into account the model uncertainty significantly improves the overall perfor- mance and consistency of the vehicle, just using a small number of sub-rollouts (M = 4) allows the vehicle to safely complete the task, and outperforms the analytic model in every metric. Using a greater number of sub-rollouts results in the vehicle choosing more deliberate plans, which have a lower cost over the 1 second planning horizon and stay further away from the obstacles. The time to completion, total cost, cost over the time horizon, and the average distance of the 8 closest passes to obstacles is given in Table 3. We take the average of the 8 closest passes because there are 8 times that the quadrotor has to cross between (or go very far around) the obstacles during a trial. 5.2 Implicit variance minimization The average variance of the LWPR predictions is given in Table 4. There’s a clear trend of decreasing variance in the predictions as the number of sub-rollouts increases, this data in conjunction with the more deliberative actions taken by the parameter settings with higher numbers of sub-rollouts is evidence that the algorithm chooses safer, more certain actions when it has more knowledge about the effects of uncertainty on the system. Number of Sub-Rollouts ¨x ¨y ¨z M = 1 .27 .39 1.39 M = 4 .25 .38 1.32 M = 8 .22 .34 1.18 M = 16 .21 .33 1.17 M = 32 .20 .32 1.14 Table 4: Average variance of acceleration predictions. It’s important to note that this variance minimization is happening without any variance term in the cost function. Minimization is happening implicitly because of the dangerous effects of executing uncertain trajectories in an obstacle filled environment. 5 6 Discussion We have developed and tested an algorithm which succesfully combines a modified version of PI2- RH, and LWPR. This is the first time that PI2-RH has been implemented on a real system, and also the first time that it has been combined with a learned dynamics model. Our results show that the PI2-RH algorithm succesfully completes a difficult navigation task using an analytic model, and is outperformed by the new algorithm which uses a learned model and takes into account model bias. The naive implementation of the learned model, where just the mean prediction is used, results in un-reliable and dangerous performance. The new algorithm also minimizes the variance of its trajec- tories when obstacles are present, despite there being no variance term in the cost function. PI2-RH makes no restrictive assumptions about the form of the system dynamics or cost function, so given an accurate model it can theoretically perform a very wide class of tasks. Used in conjunction with model learning the potential exists for autonomous agents to learn complex tasks completely from scratch in a data efficient manner. Acknowledgments This research was supported by ONR grant No. N000141010952 for K Morgansen and T Daniel Komen Endowed Chair (TLD). We would also like to thank the Veterans Affairs (VA), Rehabili- tation Research and Development Center of Excellence for Limb Loss Prevention and Prosthetic Engineering who lent us the use of their Motion Analysis Laboratory. References [1] Bitcraze. http://www.bitcraze.se/. [Online]. [2] J. Bagnell and J. Schneider. Autonomous helicopter control using reinforcement learning pol- icy search methods. In Proceedings of the International Conference on Robotics and Automa- tion 2001. IEEE, 2001. [3] M. Deisenroth and C. Rasmussen. Pilco: A model-based and data-efficient approach to policy search. In In Proceedings of the International Conference on Machine Learning, 2011. [4] M. Kalakrishnan, S Chitta, E Theodorou, P Pastor, and S Schaal. Stomp: Stochastic trajectory optimization for motion planning. In Robotics and Automation (ICRA), 2011 IEEE Interna- tional Conference on, pages 4569–4574. IEEE, 2011. [5] S. Kuindersma, R. Grupen, and A. Barto. Variable risk control via stochastic optimization. International Journal of Robotics Research, 2013. [6] D. Mellinger and V. Kumar. Minimum snap trajectory generation and control for quadrotors. In Robotics and Automation (ICRA), 2011 IEEE International Conference on, pages 2520–2525. IEEE, 2011. [7] D. Nguyen-Tuong and J. Peters. Model learning for robot control: a survey. Cognitive Pro- cessing, 12(4):319–340, 2011. [8] E. Rombokas, G. Williams, and T. Daniel. Receding horizon trajectory optimization on a gpu. Systems, Man, and Cybernetics, Part B, Currently Under Review. [9] Y. Tassa, T. Erez, and E. Todorov. Synthesis and stabilization of complex behaviors through online trajectory optimization. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ In- ternational Conference on, pages 4906–4913. IEEE, 2012. [10] E. Theodorou, J. Buchli, and S. Schaal. A generalized path integral control approach to rein- forcement learning. The Journal of Machine Learning Research, 11:3137–3181, 2010. [11] S. Vijayakumar and S. Schaal. Locally weighted projection regression: An o(n) algorithm for incremental real time learning in high dmensional space. In in Proceedings of the Seventeenth International Conference on Machine Learning (ICML 2000, pages 1079–1086. 6