Real-Time Stochastic Optimal Control for Multi-agent Quadrotor Systems Vicenc¸ G´omez1, Sep Thijssen2, Andrew Symington3, Stephen Hailes4, Hilbert J. Kappen2 1Universitat Pompeu Fabra. Barcelona, Spain vicen.gomez@upf.edu 2Radboud University Nijmegen, the Netherlands {s.thijssen,b.kappen}@donders.ru.nl 3University of California Los Angeles, USA andrew.c.symington@gmail.com 4University College London, United Kingdom s.hailes@cs.ucl.ac.uk Abstract This paper presents a novel method for controlling teams of unmanned aerial vehicles using Stochastic Optimal Con- trol (SOC) theory. The approach consists of a centralized high-level planner that computes optimal state trajectories as velocity sequences, and a platform-specific low-level con- troller which ensures that these velocity sequences are met. The planning task is expressed as a centralized path-integral control problem, for which optimal control computation cor- responds to a probabilistic inference problem that can be solved by efficient sampling methods. Through simulation we show that our SOC approach (a) has significant benefits compared to deterministic control and other SOC methods in multimodal problems with noise-dependent optimal solu- tions, (b) is capable of controlling a large number of platforms in real-time, and (c) yields collective emergent behaviour in the form of flight formations. Finally, we show that our ap- proach works for real platforms, by controlling a team of three quadrotors in outdoor conditions. 1 Introduction The recent surge in autonomous Unmanned Aerial Vehicle (UAV) research has been driven by the ease with which plat- forms can now be acquired, evolving legislation that reg- ulates their use, and the broad range of applications en- abled by both individual platforms and cooperative swarms. Example applications include automated delivery systems, monitoring and surveillance, target tracking, disaster man- agement and navigation in areas inaccessible to humans. Quadrotors are a natural choice for an experimental plat- form, as they provide a safe, highly-agile and inexpensive means by which to evaluate UAV controllers. Figure 1 shows a 3D model of one such quadrotor, the Ascending Technolo- gies Pelican. Quadrotors have non-linear dynamics and are naturally unstable, making control a non-trivial problem. Stochastic optimal control (SOC) provides a promising theoretical framework for achieving autonomous control of quadrotor systems. In contrast to deterministic control, SOC directly captures the uncertainty typically present in noisy environments and leads to solutions that qualitatively de- pend on the level of uncertainty (Kappen 2005). However, with the exception of the simple Linear Quadratic Gaussian Copyright c⃝2016, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Figure 1: Control hierarchy: The path-integral controller (1) calculates target velocities/heights for each quadrotor. These are converted to roll, pitch, throttle and yaw rates by a platform-specific Velocity Height PID controller (2). This control is in turn passed to the platform’s flight control sys- tem (3), and converted to relative motor speed changes. case, for which a closed form solution exists, solving the SOC problem requires solving the Hamilton Jacobi Bellman (HJB) equations. These equations are generally intractable, and so the SOC problem remains an open challenge. In such a complex setting, a hierarchical approach is usu- ally taken and the control problem is reduced to follow a state-trajectory (or a set of way points) designed by hand or computed offline using trajectory planning algorithms (Kendoul 2012). While the planning step typically involves a low-dimensional state representation, the control methods use a detailed complex state representation of the UAV. Ex- amples of control methods for trajectory tracking are the Proportional Integral Derivative or the Linear-Quadratic reg- ulator. arXiv:1502.04548v6 [eess.SY] 12 May 2020 A generic class of SOC problems was introduced in Kap- pen; Todorov (2005; 2006) for which the controls and the cost function are restricted in a way that makes the HJB equation linear and therefore more efficiently solvable. This class of problems is known as path integral (PI) control, linearly-solvable controlled diffusions or Kullback-Leibler control, and it has lead to successful robotic applications, e.g. (Kinjo, Uchibe, and Doya 2013; Rombokas et al. 2012; Theodorou, Buchli, and Schaal 2010). A particularly inter- esting feature of this class of problems is that the compu- tation of optimal control is an inference problem with a so- lution given in terms of the passive dynamics. In a multi- agent system, where the agents follow independent passive dynamics, such a feature can be exploited using approxi- mate inference methods such as variational approximations or belief propagation (Kappen, G´omez, and Opper 2012; Van Den Broek, Wiegerinck, and Kappen 2008). In this paper, we show how PI control can be used for solving motion planning tasks on a team of quadrotors in real time. We combine periodic re-planning with receding horizon, similarly to model predictive control, with effi- cient importance sampling. At a high level, each quadro- tor is modelled as a point mass that follows simple dou- ble integrator dynamics. Low-level control is achieved us- ing a standard Proportional Integral Derivative (PID) veloc- ity controller that interacts with a real or simulated flight control system. With this strategy we can scale PI control to ten units in simulation. Although in principle there are no further limits to experiments with actual platforms, our first results with real quadrotors only include three units. To the best of our knowledge this has been the first real-time im- plementation of PI control on an actual multi-agent system. In the next section we describe related work. We introduce our approach in Section 3 Results are shown on three dif- ferent scenarios in Section 4 Finally, Section 5 concludes this paper. 2 Related Work on UAV Planning and Control There is a large and growing body of literature related to this topic. In this section, we highlight some of the most related papers to the presented approach. An recent sur- vey of control methods for general UAVs can be found in Kendoul (2012). Stochastic optimal control is mostly used for UAV con- trol in its simplest form, assuming a linear model perturbed by additive Gaussian noise and subject to quadratic costs (LQG), e.g. (How et al. 2008). While LQG can successfully perform simple actions like hovering, executing more com- plex actions requires considering additional corrections for aerodynamic effects such as induced power or blade flap- ping (Hoffmann et al. 2011). These approaches are mainly designed for accurate trajectory control and assume a given desired state trajectory that the controller transforms into motor commands. Model Predictive Control (MPC) has been used optimize trajectories in multi-agent UAV systems (Shim, Kim, and Sastry 2003). MPC employs a model of the UAV and solves an optimal control problem at time t and state x(t) over a future horizon of a fixed number of time-steps. The first op- timal move u∗(t) is then applied and the rest of the optimal sequence is discarded. The process is repeated again at time t + 1. A quadratic cost function is typically used, but other more complex functions exist. MPC has mostly been used in indoor scenarios, where high-precision motion capture systems are available. For in- stance, in Kushleyev et al. (2013) authors generate smooth trajectories through known 3-D environments satisfying specifications on intermediate waypoints and show remark- able success controlling a team of 20 quadrotors. Trajectory optimization is translated to a relaxation of a mixed inte- ger quadratic program problem with additional constraints for collision avoidance, that can be solved efficiently in real- time. Examples that follow a similar methodology can be found in Turpin, Michael, and Kumar; Augugliaro, Schoel- lig, and D’Andrea (2012; 2012). Similarly to our approach, these methods use a simplified model of dynamics, either us- ing the 3-D position and yaw angle Kushleyev et al.; Turpin, Michael, and Kumar (2013; 2012) or the position and ve- locities as in Augugliaro, Schoellig, and D’Andrea (2012). However, these approaches are inherently deterministic and express the optimal control problem as a quadratic problem. In our case, we solve an inference problem by sampling and we do not require intermediate trajectory waypoints. In outdoor conditions, motion capture is difficult and Global Positioning System (GPS) is used instead. Existing control approaches are typically either based on Reynolds flocking (B¨urkle, Segor, and Kollmann 2011; Hauert et al. 2011; V´as´arhelyi et al. 2014; Reynolds 1987) or flight for- mation (Guerrero and Lozano 2012; Yu et al. 2013). In Reynolds flocking, each agent is considered a point mass that obeys simple and distributed rules: separate from neigh- bors, align with the average heading of neighbors and steer towards neighborhood centroid to keep cohesion. Flight for- mation control is typically modeled using graphs, where ev- ery node is an agent that can exchange information with all or several agents. Velocity and/or position coordination is usually achieved using consensus algorithms. The work in Quintero, Collins, and Hespanha (2013) shares many similarities with our approach. Authors derive a stochastic optimal control formulation of the flocking prob- lem for fixed-wings UAVs. They take a leader-follower strat- egy, where the leader follows an arbitrary (predefined) pol- icy that is learned offline and define the immediate cost as a function of the distance and heading with respect to the leader. Their method is demonstrated outdoors with 3 fixed- wing UAVs in a distributed sensing task. As in this paper, they formulate a SOC problem and perform MPC. However, in our case we do not restrict to a leader-follower setup and consider a more general class of SOC problems which can include coordination and cooperation problems. Planning approaches Within the planning community, Bernardini, Fox, and Long (2014) consider search and track- ing tasks, similar to one of our scenarios. Their approach is different to ours, they formulate a planning problem that uses used search patterns that must be selected and se- quenced to maximise the probability of rediscovering the target. Albore et al. (2015) and Chanel, Teichteil-Knigsbuch, and Lesire (2013) consider a different problem: dynamic data acquisition and environmental knowledge optimisation. Both techniques use some form of replanning. While Al- bore et al. (2015) uses a Markov Random Field framework to represent knowledge about the uncertain map and its qual- ity, Chanel, Teichteil-Knigsbuch, and Lesire (2013) rely on partially-observable MDPs. All these works consider a sin- gle UAV scenario and low-level control is either neglected or deferred to a PID or waypoint controller. Recent Progress in Path-Integral Control There has been significant progress in PI control, both theoretically and in applications. Most of existing methods use parametrized policies to overcome the main limitations (see Section 3.1). Examples can be found in Theodorou, Buchli, and Schaal; Stulp and Sigaud; G´omez et al. (2010; 2012; 2014). In these methods, the optimal control solution is restricted by the class of parametrized policies and, more importantly, it is computed offline. In Rawlik, Toussaint, and Vijayaku- mar (2013), authors propose to approximate the transformed cost-to-go function using linear operators in a reproducing kernel Hilbert space. Such an approach requires an analyt- ical form of the PI embedding, which is difficult to obtain in general. In Horowitz, Damle, and Burdick (2014), a low- rank tensor representation is used to represent the model dy- namics, allowing to scale PI control up to a 12-dimensional system. More recently, the issue of state-dependence of the optimal control has been addressed (Thijssen and Kappen 2015), where a parametrized state-dependent feedback con- troller is derived for the PI control class. Finally, model predictive PI control has been recently pro- posed for controlling a nano-quadrotor in indoor settings in an obstacle avoidance task (Williams, Rombokas, and Daniel 2014). In contrast to our approach, their method is not hierachical and uses naive sampling, which makes it less sample efficient. Additionally, the control cost term is ne- glected, which can have important implications in complex tasks involving noise. The approach presented here scales well to several UAVs in outdoor conditions and is illustrated in tasks beyond obstacle avoidance navigation. 3 Path-Integral Control for Multi-UAV planning We first briefly review PI control theory. This is followed by a description of the proposed method used to achieve motion planning of multi-agent UAV systems using PI control. 3.1 Path-Integral Control We consider continuous time stochastic control problems, where the dynamics and cost are respectively linear and quadratic in the control input, but arbitrary in the state. More precisely, consider the following stochastic differential equation of the state vector x ∈Rn under controls u ∈Rm dx = f(x)dt + G(x)(udt + dξ), (1) where ξ is m−dimensional Wiener noise with covariance Σu ∈Rm×m and f(x) ∈Rn and G(x) ∈Rn×m are ar- bitrary functions, f is the drift in the uncontrolled dynamics (including gravity, Coriolis and centripetal forces), and G describes the effect of the control u into the state vector x. A realization τ = x0:dt:T of the above equation is called a (random) path. In order to describe a control problem we define the cost that is attributed to a path (cost-to-go) by S(τ|x0, u) = rT (xT ) + X t=0:dt:T −dt  rt(xt) + 1 2u⊤ t Rut  dt, (2) where rT (xT ) and rt(xt) are arbitrary state cost terms at end and intermediate times, respectively. R is the control cost matrix. The general stochastic optimal control problem is to minimize the expected cost-to-go w.r.t. the control u∗= arg min u E[S(τ|x0, u)]. In general, such a minimization leads to the Hamilton- Jacobi-Bellman (HJB) equations, which are non-linear, sec- ond order partial differential equations. However, under the following relation between the control cost and noise co- variance Σu = λR−1, the resulting equation is linear in the exponentially transformed cost-to-go function. The solution is given by the Feynman-Kac Formula, which expresses op- timal control in terms of a Path-Integral, which can be in- terpreted as taking the expectation under the optimal path distribution (Kappen 2005) p∗(τ|x0) ∝p(τ|x0, u) exp(−S(τ|x0, u)/λ), (3) ⟨u∗ t (x0)⟩= ⟨ut + (ξt+dt −ξt)/dt⟩, (4) where p(τ|x0, u) denotes the probability of a (sub-optimal) path under equation (1) and ⟨·⟩denotes expectation over paths distributed by p∗. The constraint Σu = λR−1 forces control and noise to act in the same dimensions, but in an inverse relation. Thus, for fixed λ, the larger the noise, the cheaper the control and vice- versa. Parameter λ act as a temperature: higher values of λ result in optimal solutions that are closer to the uncontrolled process. Equation (4) permits optimal control to be calculated by probabilistic inference methods, e.g., Monte Carlo. An in- teresting fact is that equations (3, 4) hold for all controls u. In particular, u can be chosen to reduce the variance in the Monte Carlo computation of ⟨u∗ t (x0)⟩which amounts to importance sampling. This technique can drastically im- prove the sampling efficiency, which is crucial in high di- mensional systems. Despite this improvement, direct appli- cation of PI control into real systems is limited because it is not clear how to choose a proper importance sampling dis- tribution. Furthermore, note that equation (4) yields the op- timal control for all times t averaged over states. The result is therefore an open-loop controller that neglects the state- dependence of the control beyond the initial state. 3.2 Multi-UAV planning The proposed architecture is composed of two main levels. At the most abstract level, the UAV is modeled as a 2D point-mass system that follows double integrator dynamics. Algorithm 1 PI control for UAV motion planning 1: function PICONTROLLER(N, H, dt, rt(·), Σu, ut:dt:t+H) 2: for k = 1, . . . , N do 3: Sample paths τk = {xt:dt:t+H}k with Eq. (5) 4: end for 5: Compute Sk = S(τk|x0, u) with Eq. (2) 6: Store the noise realizations {ξt:dt:t+H}k 7: Compute the weights: wk = e−Sk/λ/ P l e−Sl/λ 8: for s = t : dt : t + H do 9: u∗ s = us + 1 dt P k wk ({ξs+dt}k −{ξs}k) 10: end for 11: Return next desired velocity: vt+dt = vt + u∗ t dt and u∗ t:dt:t+H for importance sampling at t + dt 12: end function At the low-level, we use a detailed second order model that we learn from real flight data (De Nardi and Holland 2008). We use model predictive control combined with importance sampling. There are two main benefits of using the proposed approach: first, since the state is continuously updated, the controller does not suffer from the problems caused by us- ing an open-loop controller. Second, the control policy is not restricted by any parametrization. The two-level approach permits to transmit control sig- nals from the high-level PI controller to the low-level control system at a relatively low frequencies (we use 15Hz in this work). Consequently, the PI controller has more time avail- able for sampling a large number of trajectories, which is critical to obtain good estimates of the control. The choice of 2D in the presented method is not a fundamental limitation, as long as double-integrator dynamics is used. The control hierarchy introduces additional model mismatch. However, as we show in the results later, this mismatch is not critical for obtaining good performance in real conditions. Ignoring height, the state vector x is thus composed of the East-North (EN) positions and EN velocities of each agent i = 1, . . . , M as xi = [pi, vi]⊤where pi, vi ∈R2. Similarly, the control u consists of EN accelerations ui ∈R2. Equa- tion (1) decouples between the agents and takes the linear form dxi = Axidt + B(uidt + dξi), A =  0 1 0 0  , B =  0 1  . (5) Notice that although the dynamics is decoupled and linear, the state cost rt(xt) in equation (2) can be any arbitrary function of all UAVs states. As a result, the optimal con- trol will in general be a non-linear function that couples all the states and thus hard to compute. Given the current joint optimal action u∗ t and velocity vt, the expected velocity at the next time t′ is calculated as vt′ = vt + (t′ −t)u∗ t and passed to the low-level controller. The final algorithm optionally keeps an importance-control sequence ut:dt:t+H that is incrementally updated. We sum- marize the high-level controller in Algorithm 1. The importance-control sequence ut:dt:t+H is initialized using prior knowledge or with zeros otherwise. Noise is Figure 2: The flight control system (FCS) is comprised of two control loops: one for stabilization and the other for pose control. A low-level controller interacts with the FCS over a serial interface to stream measurements and issue control. dimension-independent, i.e. Σu = σ2 uId. To measure sam- pling convergence, we define the Effective Sample Size (ESS) as ESS := 1/ PN k=1 w2 k, which is a quantity between 1 and N. Values of ESS close to one indicate an estimate dominated by just one sample and a poor estimate of the optimal control, whereas an ESS close to N indicates near perfect sampling, which occurs when the importance- equals the optimal-control function. 3.3 Low Level Control The target velocity v = [vE vN]⊤is passed along with a height ˆpU to a Velocity-Height controller. This con- troller uses the current state estimate of the real quadrotor y = [pE pN pU φ θ ψ u v w p q r]⊤, where (pE, pN, pU) and (φ, θ, ψ) denote navigation-frame position and orienta- tion and (u, v, w), (p, q, r) denote body-frame and angular velocities, respectively. It is composed of four independent PID controllers for roll ˆφ, pitch ˆθ, throttle ˆγ and yaw rate ˆr. that send the commands to the flight control system (FCS) to achieve v. Figure 2 shows the details of the FCS. The control loop runs at 1kHz fusing triaxial gyroscope, accelerometer and magnetometer measurements. The accelerometer and mag- netometer measurements are used to determine a reference global orientation, which is in turn used to track the gyro- scope bias. The difference between the desired and actual angular rates are converted to motor speeds using the model in Mahony, Kumar, and Corke (2012). An outer pose control loop calculates the desired angular rates based on the desired state. Orientation is obtained from the inner control loop, while position and velocity are ob- tained by fusing GPS navigation fixes with barometric pres- sure (BAR) based altitude measurements. The radio trans- mitter (marked TX in the diagram) allows the operator to switch quickly between autonomous and manual control of a platform. There is also an acoustic alarm on the platform itself, which warns the operator when the GPS signal is lost or the battery is getting low. If the battery reaches a critical level or communication with the transmitter is lost, the plat- form can be configured to land immediately or alternatively, to fly back and land at its take-off point. Figure 3: Drunken Quadrotor: a red target has to be reached while avoiding obstacles. (Left) the shortest route is the op- timal solution in the absence of noise. (Right) with control noise, the optimal solution is to fly around the building. 3.4 Simulator Platform We have developed an open-source framework called CRATES1. The framework is a implementation of QRSim (De Nardi 2013; Symington et al. 2014) in Gazebo, which uses Robot Operating System (ROS) for high-level control. It permits high-level controllers to be platform-agnostic. It is similar to the Hector Quadrotor project (Meyer et al. 2012) with a formalized notion of a hardware abstraction layers. The CRATES simulator propagates the quadrotor state forward in time based on a second order model (De Nardi and Holland 2008). The equations were learned from real flight data and verified by expert domain knowledge. In ad- dition to platform dynamics, CRATES also simulates var- ious noise-perturbed sensors, wind shear and turbulence. Orientation and barometric altitude errors follow zero-mean Ornstein-Uhlenbeck processes, while GPS error is modeled at the pseudo range level using trace data available from the International GPS Service. In accordance with the Mil- itary Specification MIL-F-8785C, wind shear is modeled as a function of altitude, while turbulence is modeled as a dis- crete implementation of the Dryden model. CRATES also provides support for generating terrain from satellite images and light detection and ranging (LIDAR) technology, and reporting collisions between platforms and terrain. 4 Results We now analyze the performance of the proposed approach in three different tasks. We first show that, in the presence of control noise, PI control is preferable over other approaches. For clarity, this scenario is presented for one agent only. We then consider two tasks involving several units: a flight for- mation task and a pursuit-evasion task. We compare the PI control method described in Section 3.2 with iterative linear-quadratic Gaussian (iLQG) control (Todorov and Li 2005). iLQG is a state-of-the-art method based on differential dynamic programming, that iteratively computes local linear-quadratic approximations to the finite 1CRATES stands for ’Cognitive Robotics Architecture for Tightly-Coupled Experiments and Simulation’. Available at https://bitbucket.org/vicengomez/crates horizon problem. A key difference between iLQG and PI control is that the linear-quadratic approximation is certainty equivalent. Consequently, iLQG yields a noise independent solution. 4.1 Scenario I: Drunken Quadrotor This scenario is inspired in Kappen (2005) and highlights the benefits of SOC in a quadrotor task. The Drunken Quadro- tor is a finite horizon task where a quadrotor has to reach a target, while avoiding a building and a wall (figure 3). There are two possible routes: a shorter one that passes through a small gap between the wall and the building, and a longer one that goes around the building. Unlike SOC, the deter- ministic optimal solution does not depend on the noise level and will always take the shorter route. However, with added noise, the risk of collision increases and thus the optimal noisy control is to take the longer route. This task can be alternatively addressed using other plan- ning methods, such as the one proposed by Ono, Williams, and Blackmore (2013), which allow for specification of user’s acceptable levels of risk using chance constraints. Here we focus on comparing deterministic and stochastic optimal control for motion planning. The amount of noise thus determines whether the optimal solution is go through the risky path or the longer safer path. The state cost in this problem consists of hard constraints that assign infinite cost when either the wall or the build- ing is hit. PI control deals with collisions by killing par- ticles that hit the obstacles during Monte Carlo sampling. For iLQG, the local approximations require a twice differ- entiable cost function. We resolved this issue by adding a smooth obstacle-proximity penalty in the cost function. Al- though iLQG computes linear feedback, we tried to improve it with a MPC scheme, similar as for PI control. Unfortu- nately, this leads to numerical instabilities in this task, since the system disturbances tend to move the reference trajec- tory through a building when moving from one time step to the next. For MPC with PI control we use a receding horizon of three seconds and perform re-planning at a frequency of 15 Hz with N = 2000 sample paths. Both methods are ini- tialized with ut = 0, ∀t. iLQG requires approximately 103 iterations to converge with a learning rate of 0.5%. Figure 3 (left) shows an example of real trajectory com- puted for low control noise level, σ2 u = 10−3. To be able to obtain such a trajectory we deactivate sensor uncertain- ties in accelerometer, gyroscope, orientation and altimeter. External noise is thus limited to aerodynamic turbulences only. In this case, both iLQG and PI solutions correspond to the shortest path, i.e. go through the gap between the wall and the building. Figure 3 (right) illustrates the solutions ob- tained for larger noise level σ2 u = 1. While the optimal ref- erence trajectory obtained by iLQG does not change, which results in collision once the real noisy controller is executed (left path), the PI control solution avoids the building and takes the longer route (right path). Note that iLQG can find both solutions depending on initialization. However, How- ever, it will always choose the shortest route, regardless of nearby obstacles. Also, note that the PI controlled unit takes a longer route to reach the target. The reason is that the con- 0 0.05 0.1 0.15 0.2 50 60 70 80 90 Wind velocity (m/s) Cost solution 0 0.05 0.1 0.15 0.2 0 0.2 0.4 0.6 0.8 1 Wind velocity (m/s) Percent of crashes iLQG PI Figure 4: Results: Drunken Quadrotor with wind: For different wind velocities and fixed control noise σ2 u = 0.5. (Left) cost of the obtained solutions and (Right) percentage of crashes using iLQG and PI. trol cost R is set quite high in order to reach a good ESS. Alternatively, if R is decreased, the optimal solution could reach the target sooner, but at the cost of a decreased ESS. This trade-off, which is inherent in PI control, can be re- solved by incorporating feedback control in the importance sampling, as presented in Thijssen and Kappen (2015). We also consider more realistic conditions with noise not limited to act in the control. Figure 4 (a,b) shows results in the presence of wind and sensor uncertainty. Panel (a) shows how the wind affects the quality of the solution, resulting in an increase of the variance and the cost for stronger wind. In all our tests, iLQG is not able to bring the quadrotor to the other side. Panel (b) shows the percentage of crashes using both methods. Crashes occur often using iLQG con- trol and only occasionally using PI control. With stronger wind, the iLQG controlled unit does occasionally not even reach the corridor (the unit did not reach the other side but did not crash either). This explains the difference in percent- ages of Panel (b). We conclude that for multi-modal tasks (tasks where multiple solution trajectories exist), the pro- posed method is preferable to iLQG. 4.2 Scenario II: Holding Pattern The second scenario addresses the problem of coordinating agents to hold their position near a point of interest while keeping a safe range of velocities and avoiding crashing into each other. Such a problem arises for instance when multiple aircraft need to land at the same location, and simultaneous landing is not possible. The resulting flight formation has been used frequently in the literature (V´as´arhelyi et al. 2014; How et al. 2008; Yu et al. 2013; Franchi et al. 2012), but always with prior specification of the trajectories. We show how this formation is obtained as the optimal solution of a SOC problem. Consider the following state cost (omitting time indexes) rHP(x) = M X i=1 exp (vi −vmax) + exp (vmin −vi) + exp (∥pi −d ∥2) + M X j>i Chit/ ∥pi −pj ∥2 (6) Figure 5: Holding pattern in the CRATES simulator. Ten units coordinate their flight in a circular formation. In this example, N = 104 samples, control noise is σ2 u = 0.1 and horizon H = 1 sec. Cost parameters are vmin = 1, vmax = 3, Chit = 20 and d = 7. Environmental noise and sensing uncertainties are modeled using realistic parameter values. where vmax and vmin denote the maximum and minimum ve- locities, respectively, d denotes penalty for deviation from the origin and Chit is the penalty for collision risk of two agents. ∥· ∥2 denotes ℓ-2 norm. The optimal solution for this problem is a circular flying pattern where units fly equidistantly from each other. The value of parameter d determines the radius and the average velocities of the agents are determined from vmin and vmax. Since the solution is symmetric with respect to the direc- tion of rotation (clockwise or anti-clockwise), only when the control is executed, a choice is made and the symmetry is broken. Figure 5 shows a snapshot of a simulation after the flight formation has been reached for a particular choice of parameter values 2. Since we use an uninformed initial con- trol trajectory, there is a transient period during which the agents organize to reach the optimal configuration. The co- ordinated circular pattern is obtained regardless of the initial positions. This behavior is robust and obtained for a large range of parameter values. 2Supplementary video material is available at http://www. mbfys.ru.nl/staff/v.gomez/uav.html 0 50 100 150 100 150 200 250 300 350 400 450 time state cost (a) N=10 N=102 N=103 2 4 6 8 10 50 100 150 200 250 number of quadrotors state cost (b) N=102 N=104 10 −3 10 −2 10 −1 0.9 0.95 1 1.05 1.1 Noise level σu 2 Cost iLQG/PI (c) 10 −3 10 −2 10 −1 10 0 10 0 10 1 10 2 10 3 Noise level σu 2 ESS (d) Figure 6: Holding pattern: (a) evolution of the state cost for different number of samples N = 10, 102, 103. (b) scaling of the method with the number of agents. For different control noise levels, (c) comparison between iLQG and PI control (ratios > 1 indicate better performance of PI over iLQG) and (d) Effective Sample Sizes. Errors bars correspond to ten different random realizations. Figure 6(a) shows immediate costs at different times. Cost always decreases from the starting configuration until the formation is reached. This value depends on several param- eters. We report its dependence on the number N of sample paths. For large N, the variances are small and the cost at- tains small values at convergence. Conversely, for small N, there is larger variance and the obtained dynamical config- uration is less optimal (typically the distances between the agents are not the same). During the formation of the pat- tern the controls are more expensive. For this particular task, full convergence of the path integrals is not required, and the formation can be achieved with a very small N. Figure 6(b) illustrates how the method scales as the num- ber of agents increases. We report averages over the mean costs over 20 time-steps after one minute of flight. We var- ied M while fixing the rest of the parameters (the distance d which was set equal to the number of agents in meters). The small variance of the cost indicates that a stable forma- tion is reached in all the cases. As expected, larger values of N lead to smaller state cost configurations. For more than ten UAVs, the simulator starts to have problems in this task and occasional crashes may occur before the formation is reached due to limited sample sizes. This limitation can be addressed, for example, by using more processing power and parallelization and it is left for future work. We also compared our approach with iLQG in this sce- nario. Figure 6(c) shows the ratio of cost differences after convergence of both solutions. Both use MPC, with a hori- zon of 2s and update frequency of 15Hz. Values above 1 in- dicate that PI control consistently outperforms iLQG in this problem. Before convergence, we also found, as in the pre- vious task, that iLQG resulted in occasional crashes while PI control did not. The Effective Sample Size (ESS) is shown in Figure 6(d). We observe that higher control noise levels result in better exploration and thus better controls. We can thus conclude that the proposed methodology is feasible for coordinating a large team of quadrotors. For this task, we performed experiments with the real plat- forms. Figure 7 shows real trajectories obtained in outdoor −5 −3 −1 1 3 5 −6 −4 −2 0 2 4 6 Meters East Meters North Position for both UAVs UAV0 UAV1 Figure 7: Resulting trajectories of a Holding Pattern experi- ment using two platforms in outdoors conditions. conditions (see also the video that accompanies this paper for an experiment with three platforms). Despite the pres- ence of significant noise, the circular behavior was also ob- tained. In the real experiments, we used a Core i7 laptop with 8GB RAM as base station, which run its own ROS messag- ing core and forwarded messages to and from the platforms over a IEEE 802.11 2.4GHz network. For safety reasons, the quadrotors were flown at different altitudes. 4.3 Scenario III: Cat and Mouse The final scenario that we consider is the cat and mouse sce- nario. In this task, a team of M quadrotors (the cats) has to catch (get close to) another quadrotor (the mouse). The mouse has autonomous dynamics: it tries to escape the cats by moving at velocity inversely proportional to the distance to the cats. More precisely, let pmouse denote the 2D posi- tion of the mouse, the velocity command for the mouse is Mouse Cats Figure 8: Cat and mouse scenario: (Top-left) four cats and one mouse. (Top-right) for horizon time H = 2 seconds, the four cats surround the mouse forever and keep rotation around it. (Bottom-left) for horizon time H = 1 seconds, the four cats chase the mouse but (bottom-right) the mouse manages to escape. With these settings, the multi-agent system alternates between these two dynamical states. Number of sample paths is N = 104, noise level σ2 u = 0.5. Other parameter values are d = 30, vmin = 1, vmax = 4, vmin = 4 and vmax mouse = 3. computed (omitting time indexes) as vmouse = v max mouse v ∥v ∥2 , where v = M X i=1 pi −pmouse ∥pi −pmouse ∥2 . The parameter v max mouse determines the maximum velocity of the mouse. As state cost function we use equation (6) with an additional penalty term that depends on the sum of the distances to the mouse rCM(x) = rHP(x) + M X i=1 ∥pi −pmouse ∥2 . This scenario leads to several interesting dynamical states. For example, for a sufficiently large value of M, the mouse always gets caught (if its initial position is not close to the boundary, determined by d). The optimal control for the cats consists in surrounding the mouse to prevent collision. Once the mouse is surrounded, the cats keep rotating around it, as in the previous scenario, but with the origin replaced by the mouse position. The additional video shows examples of other complex behaviors obtained for different parameter settings. Figure 8 (top-right) illustrates this behavior. The types of solution we observe are different for other parameter values. For example, for M = 2 or a small time horizon, e.g. H = 1, the dynamical state in which the cats rotate around the mouse is not stable, and the mouse escapes. This is displayed in Figure 8 (bottom panels) and better il- lustrated in the video provided as supplementary material. We emphasize that these different behaviors are observed for large uncertainty in the form of sensor noise and wind. 5 Conclusions This paper presents a centralized, real-time stochastic opti- mal control algorithm for coordinating the actions of multi- ple autonomous vehicles in order to minimize a global cost function. The high-level control task is expressed as a Path- Integral control problem that can be solved using efficient sampling methods and real-time control is possible via the use of re-planning and model predictive control. To the best of our knowledge, this is the first real-time implementation of Path-Integral control on an actual multi-agent system. We have shown in a simple scenario (Drunken Quadro- tor) that the proposed methodology is more convenient than other approaches such as deterministic control or iLQG for planning trajectories. In more complex scenarios such as the Holding Pattern and the Cat and Mouse, the pro- posed methodology is also preferable and allows for real- time control. We observe multiple and complex group be- havior emerging from the specified cost function. Our ex- perimental framework CRATES has been a key development that permitted a smooth transition from the theory to the real quadrotor platforms, with literally no modification of the un- derlying control code. This gives evidence that the model mismatch caused by the use of a control hierarchy is not critical in normal outdoor conditions. Our current research is addressing the following aspects: Large scale parallel sampling−the presented method can be easily parallelized, for instance, using graphics pro- cessing units, as in Williams, Rombokas, and Daniel (2014). Although the tasks considered in this work did not required more than 104 samples, we expect that this improvement will significantly increase the number of application do- mains and system size. Distributed control−we are exploring different dis- tributed formulations that take better profit of the factorized representation of the state cost. Note that the costs functions considered in this work only require pairwise couplings of the agents (to prevent collisions). However, full observabil- ity of the joint space is still required, which is not available in a fully distributed approach. References [Albore et al.] Albore, A.; Peyrard, N.; Sabbadin, R.; and Te- ichteil K¨onigsbuch, F. 2015. An online replanning approach for crop fields mapping with autonomous UAVs. In International Con- ference on Automated Planning and Scheduling (ICAPS). [Augugliaro, Schoellig, and D’Andrea] Augugliaro, F.; Schoellig, A.; and D’Andrea, R. 2012. Generation of collision-free trajec- tories for a quadrocopter fleet: A sequential convex programming approach. In Intelligent Robots and Systems (IROS), 1917–1922. [Bernardini, Fox, and Long] Bernardini, S.; Fox, M.; and Long, D. 2014. Planning the behaviour of low-cost quadcopters for surveil- lance missions. In International Conference on Automated Plan- ning and Scheduling (ICAPS). [B¨urkle, Segor, and Kollmann] B¨urkle, A.; Segor, F.; and Koll- mann, M. 2011. Towards autonomous micro UAV swarms. J. Intell. Robot. Syst. 61(1-4):339–353. [Chanel, Teichteil-Knigsbuch, and Lesire] Chanel, C. C.; Teichteil- Knigsbuch, F.; and Lesire, C. 2013. Multi-target detection and recognition by UAVs using online POMDPs. In International Con- ference on Automated Planning and Scheduling (ICAPS). [De Nardi and Holland] De Nardi, R., and Holland, O. 2008. Co- evolutionary modelling of a miniature rotorcraft. In 10th Interna- tional Conference on Intelligent Autonomous Systems (IAS10), 364 – 373. [De Nardi] De Nardi, R. 2013. The QRSim Quadrotors Simulator. Technical Report RN/13/08, University College London. [Franchi et al.] Franchi, A.; Masone, C.; Grabe, V.; Ryll, M.; B¨ulthoff, H. H.; and Giordano, P. R. 2012. Modeling and con- trol of UAV bearing-formations with bilateral high-level steering. Int. J. Robot. Res. 0278364912462493. [G´omez et al.] G´omez, V.; Kappen, H. J.; Peters, J.; and Neumann, G. 2014. Policy search for path integral control. In European Conf. on Machine Learning & Knowledge Discovery in Databases, 482– 497. [Guerrero and Lozano] Guerrero, J., and Lozano, R. 2012. Flight Formation Control. John Wiley & Sons. [Hauert et al.] Hauert, S.; Leven, S.; Varga, M.; Ruini, F.; Can- gelosi, A.; Zufferey, J.-C.; and Floreano, D. 2011. Reynolds flock- ing in reality with fixed-wing robots: Communication range vs. maximum turning rate. In Intelligent Robots and Systems (IROS), 5015–5020. [Hoffmann et al.] Hoffmann, G. M.; Huang, H.; Waslander, S. L.; and Tomlin, C. J. 2011. Precision flight control for a multi-vehicle quadrotor helicopter testbed. Control. Eng. Pract. 19(9):1023 – 1036. [Horowitz, Damle, and Burdick] Horowitz, M. B.; Damle, A.; and Burdick, J. W. 2014. Linear Hamilton Jacobi Bellman equations in high dimensions. arXiv preprint arXiv:1404.1089. [How et al.] How, J.; Bethke, B.; Frank, A.; Dale, D.; and Vian, J. 2008. Real-time indoor autonomous vehicle test environment. IEEE Contr. Syst. Mag. 28(2):51–64. [Kappen, G´omez, and Opper] Kappen, H. J.; G´omez, V.; and Op- per, M. 2012. Optimal control as a graphical model inference problem. Mach. Learn. 87:159–182. [Kappen] Kappen, H. J. 2005. Path integrals and symmetry break- ing for optimal control theory. Journal of statistical mechanics: theory and experiment 2005(11):P11011. [Kendoul] Kendoul, F. 2012. Survey of advances in guidance, navi- gation, and control of unmanned rotorcraft systems. J. Field Robot. 29(2):315–378. [Kinjo, Uchibe, and Doya] Kinjo, K.; Uchibe, E.; and Doya, K. 2013. Evaluation of linearly solvable Markov decision process with dynamic model learning in a mobile robot navigation task. Front. Neurorobot. 7:1–13. [Kushleyev et al.] Kushleyev, A.; Mellinger, D.; Powers, C.; and Kumar, V. 2013. Towards a swarm of agile micro quadrotors. Auton. Robot. 35(4):287–300. [Mahony, Kumar, and Corke] Mahony, R.; Kumar, V.; and Corke, P. 2012. Multirotor aerial vehicles: Modeling, estimation, and control of quadrotor. IEEE Robotics & Automation Magazine 20–32. [Meyer et al.] Meyer, J.; Sendobry, A.; Kohlbrecher, S.; and Klin- gauf, U. 2012. Comprehensive Simulation of Quadrotor UAVs Using ROS and Gazebo. Lecture Notes in Computer Science 7628:400–411. [Ono, Williams, and Blackmore] Ono, M.; Williams, B. C.; and Blackmore, L. 2013. Probabilistic planning for continuous dy- namic systems under bounded risk. J. Artif. Int. Res. 46(1):511– 577. [Quintero, Collins, and Hespanha] Quintero, S.; Collins, G.; and Hespanha, J. 2013. Flocking with fixed-wing UAVs for distributed sensing: A stochastic optimal control approach. In American Con- trol Conference (ACC), 2025–2031. [Rawlik, Toussaint, and Vijayakumar] Rawlik, K.; Toussaint, M.; and Vijayakumar, S. 2013. Path integral control by reproducing kernel Hilbert space embedding. In Twenty-Third International Joint Conference on Artificial Intelligence, 1628–1634. AAAI Press. [Reynolds] Reynolds, C. W. 1987. Flocks, herds and schools: A dis- tributed behavioral model. SIGGRAPH Comput. Graph. 21(4):25– 34. [Rombokas et al.] Rombokas, E.; Theodorou, E.; Malhotra, M.; Todorov, E.; and Matsuoka, Y. 2012. Tendon-driven control of biomechanical and robotic systems: A path integral reinforcement learning approach. In International Conference on Robotics and Automation, 208–214. [Shim, Kim, and Sastry] Shim, D. H.; Kim, H. J.; and Sastry, S. 2003. Decentralized nonlinear model predictive control of mul- tiple flying robots. In IEEE conference on Decision and control (CDC), volume 4, 3621–3626. [Stulp and Sigaud] Stulp, F., and Sigaud, O. 2012. Path integral policy improvement with covariance matrix adaptation. In Inter- national Conference Machine Learning (ICML). [Symington et al.] Symington, A.; De Nardi, R.; Julier, S.; and Hailes, S. 2014. Simulating quadrotor UAVs in outdoor scenar- ios. In Intelligent Robots and Systems (IROS), 3382–3388. [Theodorou, Buchli, and Schaal] Theodorou, E.; Buchli, J.; and Schaal, S. 2010. A generalized path integral control approach to reinforcement learning. J. Mach. Learn. Res. 11:3137–3181. [Thijssen and Kappen] Thijssen, S., and Kappen, H. J. 2015. Path integral control and state-dependent feedback. Phys. Rev. E 91:032104. [Todorov and Li] Todorov, E., and Li, W. 2005. A generalized it- erative LQG method for locally-optimal feedback control of con- strained nonlinear stochastic systems. In American Control Con- ference (ACC), 300–306 vol. 1. IEEE. [Todorov] Todorov, E. 2006. Linearly-solvable Markov decision problems. In Advances in neural information processing systems (NIPS), 1369–1376. [Turpin, Michael, and Kumar] Turpin, M.; Michael, N.; and Kumar, V. 2012. Decentralized formation control with variable shapes for aerial robots. In International Conference on Robotics and Au- tomation (ICRA), 23–30. [Van Den Broek, Wiegerinck, and Kappen] Van Den Broek, B.; Wiegerinck, W.; and Kappen, H. J. 2008. Graphical model infer- ence in optimal control of stochastic multi-agent systems. J. Artif. Intell. Res. 32:95–122. [V´as´arhelyi et al.] V´as´arhelyi, G.; Vir´agh, C.; Somorjai, G.; Tarcai, N.; Szorenyi, T.; Nepusz, T.; and Vicsek, T. 2014. Outdoor flocking and formation flight with autonomous aerial robots. In Intelligent Robots and Systems (IROS), 3866–3873. [Williams, Rombokas, and Daniel] Williams, G.; Rombokas, E.; and Daniel, T. 2014. GPU based path integral control with learned dynamics. In Autonomously Learning Robots - NIPS Workshop. [Yu et al.] Yu, B.; Dong, X.; Shi, Z.; and Zhong, Y. 2013. Formation control for quadrotor swarm systems: Algorithms and experiments. In Chinese Control Conference (CCC), 7099–7104. Real-Time Stochastic Optimal Control for Multi-agent Quadrotor Systems Vicenc¸ G´omez1, Sep Thijssen2, Andrew Symington3, Stephen Hailes4, Hilbert J. Kappen2 1Universitat Pompeu Fabra. Barcelona, Spain vicen.gomez@upf.edu 2Radboud University Nijmegen, the Netherlands {s.thijssen,b.kappen}@donders.ru.nl 3University of California Los Angeles, USA andrew.c.symington@gmail.com 4University College London, United Kingdom s.hailes@cs.ucl.ac.uk Abstract This paper presents a novel method for controlling teams of unmanned aerial vehicles using Stochastic Optimal Con- trol (SOC) theory. The approach consists of a centralized high-level planner that computes optimal state trajectories as velocity sequences, and a platform-specific low-level con- troller which ensures that these velocity sequences are met. The planning task is expressed as a centralized path-integral control problem, for which optimal control computation cor- responds to a probabilistic inference problem that can be solved by efficient sampling methods. Through simulation we show that our SOC approach (a) has significant benefits compared to deterministic control and other SOC methods in multimodal problems with noise-dependent optimal solu- tions, (b) is capable of controlling a large number of platforms in real-time, and (c) yields collective emergent behaviour in the form of flight formations. Finally, we show that our ap- proach works for real platforms, by controlling a team of three quadrotors in outdoor conditions. 1 Introduction The recent surge in autonomous Unmanned Aerial Vehicle (UAV) research has been driven by the ease with which plat- forms can now be acquired, evolving legislation that reg- ulates their use, and the broad range of applications en- abled by both individual platforms and cooperative swarms. Example applications include automated delivery systems, monitoring and surveillance, target tracking, disaster man- agement and navigation in areas inaccessible to humans. Quadrotors are a natural choice for an experimental plat- form, as they provide a safe, highly-agile and inexpensive means by which to evaluate UAV controllers. Figure 1 shows a 3D model of one such quadrotor, the Ascending Technolo- gies Pelican. Quadrotors have non-linear dynamics and are naturally unstable, making control a non-trivial problem. Stochastic optimal control (SOC) provides a promising theoretical framework for achieving autonomous control of quadrotor systems. In contrast to deterministic control, SOC directly captures the uncertainty typically present in noisy environments and leads to solutions that qualitatively de- pend on the level of uncertainty (Kappen 2005). However, with the exception of the simple Linear Quadratic Gaussian Copyright c⃝2016, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Figure 1: Control hierarchy: The path-integral controller (1) calculates target velocities/heights for each quadrotor. These are converted to roll, pitch, throttle and yaw rates by a platform-specific Velocity Height PID controller (2). This control is in turn passed to the platform’s flight control sys- tem (3), and converted to relative motor speed changes. case, for which a closed form solution exists, solving the SOC problem requires solving the Hamilton Jacobi Bellman (HJB) equations. These equations are generally intractable, and so the SOC problem remains an open challenge. In such a complex setting, a hierarchical approach is usu- ally taken and the control problem is reduced to follow a state-trajectory (or a set of way points) designed by hand or computed offline using trajectory planning algorithms (Kendoul 2012). While the planning step typically involves a low-dimensional state representation, the control methods use a detailed complex state representation of the UAV. Ex- amples of control methods for trajectory tracking are the Proportional Integral Derivative or the Linear-Quadratic reg- ulator. arXiv:1502.04548v6 [eess.SY] 12 May 2020 A generic class of SOC problems was introduced in Kap- pen; Todorov (2005; 2006) for which the controls and the cost function are restricted in a way that makes the HJB equation linear and therefore more efficiently solvable. This class of problems is known as path integral (PI) control, linearly-solvable controlled diffusions or Kullback-Leibler control, and it has lead to successful robotic applications, e.g. (Kinjo, Uchibe, and Doya 2013; Rombokas et al. 2012; Theodorou, Buchli, and Schaal 2010). A particularly inter- esting feature of this class of problems is that the compu- tation of optimal control is an inference problem with a so- lution given in terms of the passive dynamics. In a multi- agent system, where the agents follow independent passive dynamics, such a feature can be exploited using approxi- mate inference methods such as variational approximations or belief propagation (Kappen, G´omez, and Opper 2012; Van Den Broek, Wiegerinck, and Kappen 2008). In this paper, we show how PI control can be used for solving motion planning tasks on a team of quadrotors in real time. We combine periodic re-planning with receding horizon, similarly to model predictive control, with effi- cient importance sampling. At a high level, each quadro- tor is modelled as a point mass that follows simple dou- ble integrator dynamics. Low-level control is achieved us- ing a standard Proportional Integral Derivative (PID) veloc- ity controller that interacts with a real or simulated flight control system. With this strategy we can scale PI control to ten units in simulation. Although in principle there are no further limits to experiments with actual platforms, our first results with real quadrotors only include three units. To the best of our knowledge this has been the first real-time im- plementation of PI control on an actual multi-agent system. In the next section we describe related work. We introduce our approach in Section 3 Results are shown on three dif- ferent scenarios in Section 4 Finally, Section 5 concludes this paper. 2 Related Work on UAV Planning and Control There is a large and growing body of literature related to this topic. In this section, we highlight some of the most related papers to the presented approach. An recent sur- vey of control methods for general UAVs can be found in Kendoul (2012). Stochastic optimal control is mostly used for UAV con- trol in its simplest form, assuming a linear model perturbed by additive Gaussian noise and subject to quadratic costs (LQG), e.g. (How et al. 2008). While LQG can successfully perform simple actions like hovering, executing more com- plex actions requires considering additional corrections for aerodynamic effects such as induced power or blade flap- ping (Hoffmann et al. 2011). These approaches are mainly designed for accurate trajectory control and assume a given desired state trajectory that the controller transforms into motor commands. Model Predictive Control (MPC) has been used optimize trajectories in multi-agent UAV systems (Shim, Kim, and Sastry 2003). MPC employs a model of the UAV and solves an optimal control problem at time t and state x(t) over a future horizon of a fixed number of time-steps. The first op- timal move u∗(t) is then applied and the rest of the optimal sequence is discarded. The process is repeated again at time t + 1. A quadratic cost function is typically used, but other more complex functions exist. MPC has mostly been used in indoor scenarios, where high-precision motion capture systems are available. For in- stance, in Kushleyev et al. (2013) authors generate smooth trajectories through known 3-D environments satisfying specifications on intermediate waypoints and show remark- able success controlling a team of 20 quadrotors. Trajectory optimization is translated to a relaxation of a mixed inte- ger quadratic program problem with additional constraints for collision avoidance, that can be solved efficiently in real- time. Examples that follow a similar methodology can be found in Turpin, Michael, and Kumar; Augugliaro, Schoel- lig, and D’Andrea (2012; 2012). Similarly to our approach, these methods use a simplified model of dynamics, either us- ing the 3-D position and yaw angle Kushleyev et al.; Turpin, Michael, and Kumar (2013; 2012) or the position and ve- locities as in Augugliaro, Schoellig, and D’Andrea (2012). However, these approaches are inherently deterministic and express the optimal control problem as a quadratic problem. In our case, we solve an inference problem by sampling and we do not require intermediate trajectory waypoints. In outdoor conditions, motion capture is difficult and Global Positioning System (GPS) is used instead. Existing control approaches are typically either based on Reynolds flocking (B¨urkle, Segor, and Kollmann 2011; Hauert et al. 2011; V´as´arhelyi et al. 2014; Reynolds 1987) or flight for- mation (Guerrero and Lozano 2012; Yu et al. 2013). In Reynolds flocking, each agent is considered a point mass that obeys simple and distributed rules: separate from neigh- bors, align with the average heading of neighbors and steer towards neighborhood centroid to keep cohesion. Flight for- mation control is typically modeled using graphs, where ev- ery node is an agent that can exchange information with all or several agents. Velocity and/or position coordination is usually achieved using consensus algorithms. The work in Quintero, Collins, and Hespanha (2013) shares many similarities with our approach. Authors derive a stochastic optimal control formulation of the flocking prob- lem for fixed-wings UAVs. They take a leader-follower strat- egy, where the leader follows an arbitrary (predefined) pol- icy that is learned offline and define the immediate cost as a function of the distance and heading with respect to the leader. Their method is demonstrated outdoors with 3 fixed- wing UAVs in a distributed sensing task. As in this paper, they formulate a SOC problem and perform MPC. However, in our case we do not restrict to a leader-follower setup and consider a more general class of SOC problems which can include coordination and cooperation problems. Planning approaches Within the planning community, Bernardini, Fox, and Long (2014) consider search and track- ing tasks, similar to one of our scenarios. Their approach is different to ours, they formulate a planning problem that uses used search patterns that must be selected and se- quenced to maximise the probability of rediscovering the target. Albore et al. (2015) and Chanel, Teichteil-Knigsbuch, and Lesire (2013) consider a different problem: dynamic data acquisition and environmental knowledge optimisation. Both techniques use some form of replanning. While Al- bore et al. (2015) uses a Markov Random Field framework to represent knowledge about the uncertain map and its qual- ity, Chanel, Teichteil-Knigsbuch, and Lesire (2013) rely on partially-observable MDPs. All these works consider a sin- gle UAV scenario and low-level control is either neglected or deferred to a PID or waypoint controller. Recent Progress in Path-Integral Control There has been significant progress in PI control, both theoretically and in applications. Most of existing methods use parametrized policies to overcome the main limitations (see Section 3.1). Examples can be found in Theodorou, Buchli, and Schaal; Stulp and Sigaud; G´omez et al. (2010; 2012; 2014). In these methods, the optimal control solution is restricted by the class of parametrized policies and, more importantly, it is computed offline. In Rawlik, Toussaint, and Vijayaku- mar (2013), authors propose to approximate the transformed cost-to-go function using linear operators in a reproducing kernel Hilbert space. Such an approach requires an analyt- ical form of the PI embedding, which is difficult to obtain in general. In Horowitz, Damle, and Burdick (2014), a low- rank tensor representation is used to represent the model dy- namics, allowing to scale PI control up to a 12-dimensional system. More recently, the issue of state-dependence of the optimal control has been addressed (Thijssen and Kappen 2015), where a parametrized state-dependent feedback con- troller is derived for the PI control class. Finally, model predictive PI control has been recently pro- posed for controlling a nano-quadrotor in indoor settings in an obstacle avoidance task (Williams, Rombokas, and Daniel 2014). In contrast to our approach, their method is not hierachical and uses naive sampling, which makes it less sample efficient. Additionally, the control cost term is ne- glected, which can have important implications in complex tasks involving noise. The approach presented here scales well to several UAVs in outdoor conditions and is illustrated in tasks beyond obstacle avoidance navigation. 3 Path-Integral Control for Multi-UAV planning We first briefly review PI control theory. This is followed by a description of the proposed method used to achieve motion planning of multi-agent UAV systems using PI control. 3.1 Path-Integral Control We consider continuous time stochastic control problems, where the dynamics and cost are respectively linear and quadratic in the control input, but arbitrary in the state. More precisely, consider the following stochastic differential equation of the state vector x ∈Rn under controls u ∈Rm dx = f(x)dt + G(x)(udt + dξ), (1) where ξ is m−dimensional Wiener noise with covariance Σu ∈Rm×m and f(x) ∈Rn and G(x) ∈Rn×m are ar- bitrary functions, f is the drift in the uncontrolled dynamics (including gravity, Coriolis and centripetal forces), and G describes the effect of the control u into the state vector x. A realization τ = x0:dt:T of the above equation is called a (random) path. In order to describe a control problem we define the cost that is attributed to a path (cost-to-go) by S(τ|x0, u) = rT (xT ) + X t=0:dt:T −dt  rt(xt) + 1 2u⊤ t Rut  dt, (2) where rT (xT ) and rt(xt) are arbitrary state cost terms at end and intermediate times, respectively. R is the control cost matrix. The general stochastic optimal control problem is to minimize the expected cost-to-go w.r.t. the control u∗= arg min u E[S(τ|x0, u)]. In general, such a minimization leads to the Hamilton- Jacobi-Bellman (HJB) equations, which are non-linear, sec- ond order partial differential equations. However, under the following relation between the control cost and noise co- variance Σu = λR−1, the resulting equation is linear in the exponentially transformed cost-to-go function. The solution is given by the Feynman-Kac Formula, which expresses op- timal control in terms of a Path-Integral, which can be in- terpreted as taking the expectation under the optimal path distribution (Kappen 2005) p∗(τ|x0) ∝p(τ|x0, u) exp(−S(τ|x0, u)/λ), (3) ⟨u∗ t (x0)⟩= ⟨ut + (ξt+dt −ξt)/dt⟩, (4) where p(τ|x0, u) denotes the probability of a (sub-optimal) path under equation (1) and ⟨·⟩denotes expectation over paths distributed by p∗. The constraint Σu = λR−1 forces control and noise to act in the same dimensions, but in an inverse relation. Thus, for fixed λ, the larger the noise, the cheaper the control and vice- versa. Parameter λ act as a temperature: higher values of λ result in optimal solutions that are closer to the uncontrolled process. Equation (4) permits optimal control to be calculated by probabilistic inference methods, e.g., Monte Carlo. An in- teresting fact is that equations (3, 4) hold for all controls u. In particular, u can be chosen to reduce the variance in the Monte Carlo computation of ⟨u∗ t (x0)⟩which amounts to importance sampling. This technique can drastically im- prove the sampling efficiency, which is crucial in high di- mensional systems. Despite this improvement, direct appli- cation of PI control into real systems is limited because it is not clear how to choose a proper importance sampling dis- tribution. Furthermore, note that equation (4) yields the op- timal control for all times t averaged over states. The result is therefore an open-loop controller that neglects the state- dependence of the control beyond the initial state. 3.2 Multi-UAV planning The proposed architecture is composed of two main levels. At the most abstract level, the UAV is modeled as a 2D point-mass system that follows double integrator dynamics. Algorithm 1 PI control for UAV motion planning 1: function PICONTROLLER(N, H, dt, rt(·), Σu, ut:dt:t+H) 2: for k = 1, . . . , N do 3: Sample paths τk = {xt:dt:t+H}k with Eq. (5) 4: end for 5: Compute Sk = S(τk|x0, u) with Eq. (2) 6: Store the noise realizations {ξt:dt:t+H}k 7: Compute the weights: wk = e−Sk/λ/ P l e−Sl/λ 8: for s = t : dt : t + H do 9: u∗ s = us + 1 dt P k wk ({ξs+dt}k −{ξs}k) 10: end for 11: Return next desired velocity: vt+dt = vt + u∗ t dt and u∗ t:dt:t+H for importance sampling at t + dt 12: end function At the low-level, we use a detailed second order model that we learn from real flight data (De Nardi and Holland 2008). We use model predictive control combined with importance sampling. There are two main benefits of using the proposed approach: first, since the state is continuously updated, the controller does not suffer from the problems caused by us- ing an open-loop controller. Second, the control policy is not restricted by any parametrization. The two-level approach permits to transmit control sig- nals from the high-level PI controller to the low-level control system at a relatively low frequencies (we use 15Hz in this work). Consequently, the PI controller has more time avail- able for sampling a large number of trajectories, which is critical to obtain good estimates of the control. The choice of 2D in the presented method is not a fundamental limitation, as long as double-integrator dynamics is used. The control hierarchy introduces additional model mismatch. However, as we show in the results later, this mismatch is not critical for obtaining good performance in real conditions. Ignoring height, the state vector x is thus composed of the East-North (EN) positions and EN velocities of each agent i = 1, . . . , M as xi = [pi, vi]⊤where pi, vi ∈R2. Similarly, the control u consists of EN accelerations ui ∈R2. Equa- tion (1) decouples between the agents and takes the linear form dxi = Axidt + B(uidt + dξi), A =  0 1 0 0  , B =  0 1  . (5) Notice that although the dynamics is decoupled and linear, the state cost rt(xt) in equation (2) can be any arbitrary function of all UAVs states. As a result, the optimal con- trol will in general be a non-linear function that couples all the states and thus hard to compute. Given the current joint optimal action u∗ t and velocity vt, the expected velocity at the next time t′ is calculated as vt′ = vt + (t′ −t)u∗ t and passed to the low-level controller. The final algorithm optionally keeps an importance-control sequence ut:dt:t+H that is incrementally updated. We sum- marize the high-level controller in Algorithm 1. The importance-control sequence ut:dt:t+H is initialized using prior knowledge or with zeros otherwise. Noise is Figure 2: The flight control system (FCS) is comprised of two control loops: one for stabilization and the other for pose control. A low-level controller interacts with the FCS over a serial interface to stream measurements and issue control. dimension-independent, i.e. Σu = σ2 uId. To measure sam- pling convergence, we define the Effective Sample Size (ESS) as ESS := 1/ PN k=1 w2 k, which is a quantity between 1 and N. Values of ESS close to one indicate an estimate dominated by just one sample and a poor estimate of the optimal control, whereas an ESS close to N indicates near perfect sampling, which occurs when the importance- equals the optimal-control function. 3.3 Low Level Control The target velocity v = [vE vN]⊤is passed along with a height ˆpU to a Velocity-Height controller. This con- troller uses the current state estimate of the real quadrotor y = [pE pN pU φ θ ψ u v w p q r]⊤, where (pE, pN, pU) and (φ, θ, ψ) denote navigation-frame position and orienta- tion and (u, v, w), (p, q, r) denote body-frame and angular velocities, respectively. It is composed of four independent PID controllers for roll ˆφ, pitch ˆθ, throttle ˆγ and yaw rate ˆr. that send the commands to the flight control system (FCS) to achieve v. Figure 2 shows the details of the FCS. The control loop runs at 1kHz fusing triaxial gyroscope, accelerometer and magnetometer measurements. The accelerometer and mag- netometer measurements are used to determine a reference global orientation, which is in turn used to track the gyro- scope bias. The difference between the desired and actual angular rates are converted to motor speeds using the model in Mahony, Kumar, and Corke (2012). An outer pose control loop calculates the desired angular rates based on the desired state. Orientation is obtained from the inner control loop, while position and velocity are ob- tained by fusing GPS navigation fixes with barometric pres- sure (BAR) based altitude measurements. The radio trans- mitter (marked TX in the diagram) allows the operator to switch quickly between autonomous and manual control of a platform. There is also an acoustic alarm on the platform itself, which warns the operator when the GPS signal is lost or the battery is getting low. If the battery reaches a critical level or communication with the transmitter is lost, the plat- form can be configured to land immediately or alternatively, to fly back and land at its take-off point. Figure 3: Drunken Quadrotor: a red target has to be reached while avoiding obstacles. (Left) the shortest route is the op- timal solution in the absence of noise. (Right) with control noise, the optimal solution is to fly around the building. 3.4 Simulator Platform We have developed an open-source framework called CRATES1. The framework is a implementation of QRSim (De Nardi 2013; Symington et al. 2014) in Gazebo, which uses Robot Operating System (ROS) for high-level control. It permits high-level controllers to be platform-agnostic. It is similar to the Hector Quadrotor project (Meyer et al. 2012) with a formalized notion of a hardware abstraction layers. The CRATES simulator propagates the quadrotor state forward in time based on a second order model (De Nardi and Holland 2008). The equations were learned from real flight data and verified by expert domain knowledge. In ad- dition to platform dynamics, CRATES also simulates var- ious noise-perturbed sensors, wind shear and turbulence. Orientation and barometric altitude errors follow zero-mean Ornstein-Uhlenbeck processes, while GPS error is modeled at the pseudo range level using trace data available from the International GPS Service. In accordance with the Mil- itary Specification MIL-F-8785C, wind shear is modeled as a function of altitude, while turbulence is modeled as a dis- crete implementation of the Dryden model. CRATES also provides support for generating terrain from satellite images and light detection and ranging (LIDAR) technology, and reporting collisions between platforms and terrain. 4 Results We now analyze the performance of the proposed approach in three different tasks. We first show that, in the presence of control noise, PI control is preferable over other approaches. For clarity, this scenario is presented for one agent only. We then consider two tasks involving several units: a flight for- mation task and a pursuit-evasion task. We compare the PI control method described in Section 3.2 with iterative linear-quadratic Gaussian (iLQG) control (Todorov and Li 2005). iLQG is a state-of-the-art method based on differential dynamic programming, that iteratively computes local linear-quadratic approximations to the finite 1CRATES stands for ’Cognitive Robotics Architecture for Tightly-Coupled Experiments and Simulation’. Available at https://bitbucket.org/vicengomez/crates horizon problem. A key difference between iLQG and PI control is that the linear-quadratic approximation is certainty equivalent. Consequently, iLQG yields a noise independent solution. 4.1 Scenario I: Drunken Quadrotor This scenario is inspired in Kappen (2005) and highlights the benefits of SOC in a quadrotor task. The Drunken Quadro- tor is a finite horizon task where a quadrotor has to reach a target, while avoiding a building and a wall (figure 3). There are two possible routes: a shorter one that passes through a small gap between the wall and the building, and a longer one that goes around the building. Unlike SOC, the deter- ministic optimal solution does not depend on the noise level and will always take the shorter route. However, with added noise, the risk of collision increases and thus the optimal noisy control is to take the longer route. This task can be alternatively addressed using other plan- ning methods, such as the one proposed by Ono, Williams, and Blackmore (2013), which allow for specification of user’s acceptable levels of risk using chance constraints. Here we focus on comparing deterministic and stochastic optimal control for motion planning. The amount of noise thus determines whether the optimal solution is go through the risky path or the longer safer path. The state cost in this problem consists of hard constraints that assign infinite cost when either the wall or the build- ing is hit. PI control deals with collisions by killing par- ticles that hit the obstacles during Monte Carlo sampling. For iLQG, the local approximations require a twice differ- entiable cost function. We resolved this issue by adding a smooth obstacle-proximity penalty in the cost function. Al- though iLQG computes linear feedback, we tried to improve it with a MPC scheme, similar as for PI control. Unfortu- nately, this leads to numerical instabilities in this task, since the system disturbances tend to move the reference trajec- tory through a building when moving from one time step to the next. For MPC with PI control we use a receding horizon of three seconds and perform re-planning at a frequency of 15 Hz with N = 2000 sample paths. Both methods are ini- tialized with ut = 0, ∀t. iLQG requires approximately 103 iterations to converge with a learning rate of 0.5%. Figure 3 (left) shows an example of real trajectory com- puted for low control noise level, σ2 u = 10−3. To be able to obtain such a trajectory we deactivate sensor uncertain- ties in accelerometer, gyroscope, orientation and altimeter. External noise is thus limited to aerodynamic turbulences only. In this case, both iLQG and PI solutions correspond to the shortest path, i.e. go through the gap between the wall and the building. Figure 3 (right) illustrates the solutions ob- tained for larger noise level σ2 u = 1. While the optimal ref- erence trajectory obtained by iLQG does not change, which results in collision once the real noisy controller is executed (left path), the PI control solution avoids the building and takes the longer route (right path). Note that iLQG can find both solutions depending on initialization. However, How- ever, it will always choose the shortest route, regardless of nearby obstacles. Also, note that the PI controlled unit takes a longer route to reach the target. The reason is that the con- 0 0.05 0.1 0.15 0.2 50 60 70 80 90 Wind velocity (m/s) Cost solution 0 0.05 0.1 0.15 0.2 0 0.2 0.4 0.6 0.8 1 Wind velocity (m/s) Percent of crashes iLQG PI Figure 4: Results: Drunken Quadrotor with wind: For different wind velocities and fixed control noise σ2 u = 0.5. (Left) cost of the obtained solutions and (Right) percentage of crashes using iLQG and PI. trol cost R is set quite high in order to reach a good ESS. Alternatively, if R is decreased, the optimal solution could reach the target sooner, but at the cost of a decreased ESS. This trade-off, which is inherent in PI control, can be re- solved by incorporating feedback control in the importance sampling, as presented in Thijssen and Kappen (2015). We also consider more realistic conditions with noise not limited to act in the control. Figure 4 (a,b) shows results in the presence of wind and sensor uncertainty. Panel (a) shows how the wind affects the quality of the solution, resulting in an increase of the variance and the cost for stronger wind. In all our tests, iLQG is not able to bring the quadrotor to the other side. Panel (b) shows the percentage of crashes using both methods. Crashes occur often using iLQG con- trol and only occasionally using PI control. With stronger wind, the iLQG controlled unit does occasionally not even reach the corridor (the unit did not reach the other side but did not crash either). This explains the difference in percent- ages of Panel (b). We conclude that for multi-modal tasks (tasks where multiple solution trajectories exist), the pro- posed method is preferable to iLQG. 4.2 Scenario II: Holding Pattern The second scenario addresses the problem of coordinating agents to hold their position near a point of interest while keeping a safe range of velocities and avoiding crashing into each other. Such a problem arises for instance when multiple aircraft need to land at the same location, and simultaneous landing is not possible. The resulting flight formation has been used frequently in the literature (V´as´arhelyi et al. 2014; How et al. 2008; Yu et al. 2013; Franchi et al. 2012), but always with prior specification of the trajectories. We show how this formation is obtained as the optimal solution of a SOC problem. Consider the following state cost (omitting time indexes) rHP(x) = M X i=1 exp (vi −vmax) + exp (vmin −vi) + exp (∥pi −d ∥2) + M X j>i Chit/ ∥pi −pj ∥2 (6) Figure 5: Holding pattern in the CRATES simulator. Ten units coordinate their flight in a circular formation. In this example, N = 104 samples, control noise is σ2 u = 0.1 and horizon H = 1 sec. Cost parameters are vmin = 1, vmax = 3, Chit = 20 and d = 7. Environmental noise and sensing uncertainties are modeled using realistic parameter values. where vmax and vmin denote the maximum and minimum ve- locities, respectively, d denotes penalty for deviation from the origin and Chit is the penalty for collision risk of two agents. ∥· ∥2 denotes ℓ-2 norm. The optimal solution for this problem is a circular flying pattern where units fly equidistantly from each other. The value of parameter d determines the radius and the average velocities of the agents are determined from vmin and vmax. Since the solution is symmetric with respect to the direc- tion of rotation (clockwise or anti-clockwise), only when the control is executed, a choice is made and the symmetry is broken. Figure 5 shows a snapshot of a simulation after the flight formation has been reached for a particular choice of parameter values 2. Since we use an uninformed initial con- trol trajectory, there is a transient period during which the agents organize to reach the optimal configuration. The co- ordinated circular pattern is obtained regardless of the initial positions. This behavior is robust and obtained for a large range of parameter values. 2Supplementary video material is available at http://www. mbfys.ru.nl/staff/v.gomez/uav.html 0 50 100 150 100 150 200 250 300 350 400 450 time state cost (a) N=10 N=102 N=103 2 4 6 8 10 50 100 150 200 250 number of quadrotors state cost (b) N=102 N=104 10 −3 10 −2 10 −1 0.9 0.95 1 1.05 1.1 Noise level σu 2 Cost iLQG/PI (c) 10 −3 10 −2 10 −1 10 0 10 0 10 1 10 2 10 3 Noise level σu 2 ESS (d) Figure 6: Holding pattern: (a) evolution of the state cost for different number of samples N = 10, 102, 103. (b) scaling of the method with the number of agents. For different control noise levels, (c) comparison between iLQG and PI control (ratios > 1 indicate better performance of PI over iLQG) and (d) Effective Sample Sizes. Errors bars correspond to ten different random realizations. Figure 6(a) shows immediate costs at different times. Cost always decreases from the starting configuration until the formation is reached. This value depends on several param- eters. We report its dependence on the number N of sample paths. For large N, the variances are small and the cost at- tains small values at convergence. Conversely, for small N, there is larger variance and the obtained dynamical config- uration is less optimal (typically the distances between the agents are not the same). During the formation of the pat- tern the controls are more expensive. For this particular task, full convergence of the path integrals is not required, and the formation can be achieved with a very small N. Figure 6(b) illustrates how the method scales as the num- ber of agents increases. We report averages over the mean costs over 20 time-steps after one minute of flight. We var- ied M while fixing the rest of the parameters (the distance d which was set equal to the number of agents in meters). The small variance of the cost indicates that a stable forma- tion is reached in all the cases. As expected, larger values of N lead to smaller state cost configurations. For more than ten UAVs, the simulator starts to have problems in this task and occasional crashes may occur before the formation is reached due to limited sample sizes. This limitation can be addressed, for example, by using more processing power and parallelization and it is left for future work. We also compared our approach with iLQG in this sce- nario. Figure 6(c) shows the ratio of cost differences after convergence of both solutions. Both use MPC, with a hori- zon of 2s and update frequency of 15Hz. Values above 1 in- dicate that PI control consistently outperforms iLQG in this problem. Before convergence, we also found, as in the pre- vious task, that iLQG resulted in occasional crashes while PI control did not. The Effective Sample Size (ESS) is shown in Figure 6(d). We observe that higher control noise levels result in better exploration and thus better controls. We can thus conclude that the proposed methodology is feasible for coordinating a large team of quadrotors. For this task, we performed experiments with the real plat- forms. Figure 7 shows real trajectories obtained in outdoor −5 −3 −1 1 3 5 −6 −4 −2 0 2 4 6 Meters East Meters North Position for both UAVs UAV0 UAV1 Figure 7: Resulting trajectories of a Holding Pattern experi- ment using two platforms in outdoors conditions. conditions (see also the video that accompanies this paper for an experiment with three platforms). Despite the pres- ence of significant noise, the circular behavior was also ob- tained. In the real experiments, we used a Core i7 laptop with 8GB RAM as base station, which run its own ROS messag- ing core and forwarded messages to and from the platforms over a IEEE 802.11 2.4GHz network. For safety reasons, the quadrotors were flown at different altitudes. 4.3 Scenario III: Cat and Mouse The final scenario that we consider is the cat and mouse sce- nario. In this task, a team of M quadrotors (the cats) has to catch (get close to) another quadrotor (the mouse). The mouse has autonomous dynamics: it tries to escape the cats by moving at velocity inversely proportional to the distance to the cats. More precisely, let pmouse denote the 2D posi- tion of the mouse, the velocity command for the mouse is Mouse Cats Figure 8: Cat and mouse scenario: (Top-left) four cats and one mouse. (Top-right) for horizon time H = 2 seconds, the four cats surround the mouse forever and keep rotation around it. (Bottom-left) for horizon time H = 1 seconds, the four cats chase the mouse but (bottom-right) the mouse manages to escape. With these settings, the multi-agent system alternates between these two dynamical states. Number of sample paths is N = 104, noise level σ2 u = 0.5. Other parameter values are d = 30, vmin = 1, vmax = 4, vmin = 4 and vmax mouse = 3. computed (omitting time indexes) as vmouse = v max mouse v ∥v ∥2 , where v = M X i=1 pi −pmouse ∥pi −pmouse ∥2 . The parameter v max mouse determines the maximum velocity of the mouse. As state cost function we use equation (6) with an additional penalty term that depends on the sum of the distances to the mouse rCM(x) = rHP(x) + M X i=1 ∥pi −pmouse ∥2 . This scenario leads to several interesting dynamical states. For example, for a sufficiently large value of M, the mouse always gets caught (if its initial position is not close to the boundary, determined by d). The optimal control for the cats consists in surrounding the mouse to prevent collision. Once the mouse is surrounded, the cats keep rotating around it, as in the previous scenario, but with the origin replaced by the mouse position. The additional video shows examples of other complex behaviors obtained for different parameter settings. Figure 8 (top-right) illustrates this behavior. The types of solution we observe are different for other parameter values. For example, for M = 2 or a small time horizon, e.g. H = 1, the dynamical state in which the cats rotate around the mouse is not stable, and the mouse escapes. This is displayed in Figure 8 (bottom panels) and better il- lustrated in the video provided as supplementary material. We emphasize that these different behaviors are observed for large uncertainty in the form of sensor noise and wind. 5 Conclusions This paper presents a centralized, real-time stochastic opti- mal control algorithm for coordinating the actions of multi- ple autonomous vehicles in order to minimize a global cost function. The high-level control task is expressed as a Path- Integral control problem that can be solved using efficient sampling methods and real-time control is possible via the use of re-planning and model predictive control. To the best of our knowledge, this is the first real-time implementation of Path-Integral control on an actual multi-agent system. We have shown in a simple scenario (Drunken Quadro- tor) that the proposed methodology is more convenient than other approaches such as deterministic control or iLQG for planning trajectories. In more complex scenarios such as the Holding Pattern and the Cat and Mouse, the pro- posed methodology is also preferable and allows for real- time control. We observe multiple and complex group be- havior emerging from the specified cost function. Our ex- perimental framework CRATES has been a key development that permitted a smooth transition from the theory to the real quadrotor platforms, with literally no modification of the un- derlying control code. This gives evidence that the model mismatch caused by the use of a control hierarchy is not critical in normal outdoor conditions. Our current research is addressing the following aspects: Large scale parallel sampling−the presented method can be easily parallelized, for instance, using graphics pro- cessing units, as in Williams, Rombokas, and Daniel (2014). Although the tasks considered in this work did not required more than 104 samples, we expect that this improvement will significantly increase the number of application do- mains and system size. Distributed control−we are exploring different dis- tributed formulations that take better profit of the factorized representation of the state cost. Note that the costs functions considered in this work only require pairwise couplings of the agents (to prevent collisions). However, full observabil- ity of the joint space is still required, which is not available in a fully distributed approach. References [Albore et al.] Albore, A.; Peyrard, N.; Sabbadin, R.; and Te- ichteil K¨onigsbuch, F. 2015. An online replanning approach for crop fields mapping with autonomous UAVs. In International Con- ference on Automated Planning and Scheduling (ICAPS). [Augugliaro, Schoellig, and D’Andrea] Augugliaro, F.; Schoellig, A.; and D’Andrea, R. 2012. Generation of collision-free trajec- tories for a quadrocopter fleet: A sequential convex programming approach. In Intelligent Robots and Systems (IROS), 1917–1922. [Bernardini, Fox, and Long] Bernardini, S.; Fox, M.; and Long, D. 2014. Planning the behaviour of low-cost quadcopters for surveil- lance missions. In International Conference on Automated Plan- ning and Scheduling (ICAPS). [B¨urkle, Segor, and Kollmann] B¨urkle, A.; Segor, F.; and Koll- mann, M. 2011. Towards autonomous micro UAV swarms. J. Intell. Robot. Syst. 61(1-4):339–353. [Chanel, Teichteil-Knigsbuch, and Lesire] Chanel, C. C.; Teichteil- Knigsbuch, F.; and Lesire, C. 2013. Multi-target detection and recognition by UAVs using online POMDPs. In International Con- ference on Automated Planning and Scheduling (ICAPS). [De Nardi and Holland] De Nardi, R., and Holland, O. 2008. Co- evolutionary modelling of a miniature rotorcraft. In 10th Interna- tional Conference on Intelligent Autonomous Systems (IAS10), 364 – 373. [De Nardi] De Nardi, R. 2013. The QRSim Quadrotors Simulator. Technical Report RN/13/08, University College London. [Franchi et al.] Franchi, A.; Masone, C.; Grabe, V.; Ryll, M.; B¨ulthoff, H. H.; and Giordano, P. R. 2012. Modeling and con- trol of UAV bearing-formations with bilateral high-level steering. Int. J. Robot. Res. 0278364912462493. [G´omez et al.] G´omez, V.; Kappen, H. J.; Peters, J.; and Neumann, G. 2014. Policy search for path integral control. In European Conf. on Machine Learning & Knowledge Discovery in Databases, 482– 497. [Guerrero and Lozano] Guerrero, J., and Lozano, R. 2012. Flight Formation Control. John Wiley & Sons. [Hauert et al.] Hauert, S.; Leven, S.; Varga, M.; Ruini, F.; Can- gelosi, A.; Zufferey, J.-C.; and Floreano, D. 2011. Reynolds flock- ing in reality with fixed-wing robots: Communication range vs. maximum turning rate. In Intelligent Robots and Systems (IROS), 5015–5020. [Hoffmann et al.] Hoffmann, G. M.; Huang, H.; Waslander, S. L.; and Tomlin, C. J. 2011. Precision flight control for a multi-vehicle quadrotor helicopter testbed. Control. Eng. Pract. 19(9):1023 – 1036. [Horowitz, Damle, and Burdick] Horowitz, M. B.; Damle, A.; and Burdick, J. W. 2014. Linear Hamilton Jacobi Bellman equations in high dimensions. arXiv preprint arXiv:1404.1089. [How et al.] How, J.; Bethke, B.; Frank, A.; Dale, D.; and Vian, J. 2008. Real-time indoor autonomous vehicle test environment. IEEE Contr. Syst. Mag. 28(2):51–64. [Kappen, G´omez, and Opper] Kappen, H. J.; G´omez, V.; and Op- per, M. 2012. Optimal control as a graphical model inference problem. Mach. Learn. 87:159–182. [Kappen] Kappen, H. J. 2005. Path integrals and symmetry break- ing for optimal control theory. Journal of statistical mechanics: theory and experiment 2005(11):P11011. [Kendoul] Kendoul, F. 2012. Survey of advances in guidance, navi- gation, and control of unmanned rotorcraft systems. J. Field Robot. 29(2):315–378. [Kinjo, Uchibe, and Doya] Kinjo, K.; Uchibe, E.; and Doya, K. 2013. Evaluation of linearly solvable Markov decision process with dynamic model learning in a mobile robot navigation task. Front. Neurorobot. 7:1–13. [Kushleyev et al.] Kushleyev, A.; Mellinger, D.; Powers, C.; and Kumar, V. 2013. Towards a swarm of agile micro quadrotors. Auton. Robot. 35(4):287–300. [Mahony, Kumar, and Corke] Mahony, R.; Kumar, V.; and Corke, P. 2012. Multirotor aerial vehicles: Modeling, estimation, and control of quadrotor. IEEE Robotics & Automation Magazine 20–32. [Meyer et al.] Meyer, J.; Sendobry, A.; Kohlbrecher, S.; and Klin- gauf, U. 2012. Comprehensive Simulation of Quadrotor UAVs Using ROS and Gazebo. Lecture Notes in Computer Science 7628:400–411. [Ono, Williams, and Blackmore] Ono, M.; Williams, B. C.; and Blackmore, L. 2013. Probabilistic planning for continuous dy- namic systems under bounded risk. J. Artif. Int. Res. 46(1):511– 577. [Quintero, Collins, and Hespanha] Quintero, S.; Collins, G.; and Hespanha, J. 2013. Flocking with fixed-wing UAVs for distributed sensing: A stochastic optimal control approach. In American Con- trol Conference (ACC), 2025–2031. [Rawlik, Toussaint, and Vijayakumar] Rawlik, K.; Toussaint, M.; and Vijayakumar, S. 2013. Path integral control by reproducing kernel Hilbert space embedding. In Twenty-Third International Joint Conference on Artificial Intelligence, 1628–1634. AAAI Press. [Reynolds] Reynolds, C. W. 1987. Flocks, herds and schools: A dis- tributed behavioral model. SIGGRAPH Comput. Graph. 21(4):25– 34. [Rombokas et al.] Rombokas, E.; Theodorou, E.; Malhotra, M.; Todorov, E.; and Matsuoka, Y. 2012. Tendon-driven control of biomechanical and robotic systems: A path integral reinforcement learning approach. In International Conference on Robotics and Automation, 208–214. [Shim, Kim, and Sastry] Shim, D. H.; Kim, H. J.; and Sastry, S. 2003. Decentralized nonlinear model predictive control of mul- tiple flying robots. In IEEE conference on Decision and control (CDC), volume 4, 3621–3626. [Stulp and Sigaud] Stulp, F., and Sigaud, O. 2012. Path integral policy improvement with covariance matrix adaptation. In Inter- national Conference Machine Learning (ICML). [Symington et al.] Symington, A.; De Nardi, R.; Julier, S.; and Hailes, S. 2014. Simulating quadrotor UAVs in outdoor scenar- ios. In Intelligent Robots and Systems (IROS), 3382–3388. [Theodorou, Buchli, and Schaal] Theodorou, E.; Buchli, J.; and Schaal, S. 2010. A generalized path integral control approach to reinforcement learning. J. Mach. Learn. Res. 11:3137–3181. [Thijssen and Kappen] Thijssen, S., and Kappen, H. J. 2015. Path integral control and state-dependent feedback. Phys. Rev. E 91:032104. [Todorov and Li] Todorov, E., and Li, W. 2005. A generalized it- erative LQG method for locally-optimal feedback control of con- strained nonlinear stochastic systems. In American Control Con- ference (ACC), 300–306 vol. 1. IEEE. [Todorov] Todorov, E. 2006. Linearly-solvable Markov decision problems. In Advances in neural information processing systems (NIPS), 1369–1376. [Turpin, Michael, and Kumar] Turpin, M.; Michael, N.; and Kumar, V. 2012. Decentralized formation control with variable shapes for aerial robots. In International Conference on Robotics and Au- tomation (ICRA), 23–30. [Van Den Broek, Wiegerinck, and Kappen] Van Den Broek, B.; Wiegerinck, W.; and Kappen, H. J. 2008. Graphical model infer- ence in optimal control of stochastic multi-agent systems. J. Artif. Intell. Res. 32:95–122. [V´as´arhelyi et al.] V´as´arhelyi, G.; Vir´agh, C.; Somorjai, G.; Tarcai, N.; Szorenyi, T.; Nepusz, T.; and Vicsek, T. 2014. Outdoor flocking and formation flight with autonomous aerial robots. In Intelligent Robots and Systems (IROS), 3866–3873. [Williams, Rombokas, and Daniel] Williams, G.; Rombokas, E.; and Daniel, T. 2014. GPU based path integral control with learned dynamics. In Autonomously Learning Robots - NIPS Workshop. [Yu et al.] Yu, B.; Dong, X.; Shi, Z.; and Zhong, Y. 2013. Formation control for quadrotor swarm systems: Algorithms and experiments. In Chinese Control Conference (CCC), 7099–7104.