Time - Varying Formation Controllers for Unmanned Aerial Vehicles Using Deep Reinforcement Learning Ronny Conde*. José Ramón Llata* . Carlos Torre - Ferrero * * Electronics Technology, Systems and Automation Engineering Department at the University of Cantabria, 39005, Santander, Spain (Corresponding Author: ronny.conde@gmail.com) Abstract: We consider the problem of designing scalable and portable controllers for unmanned aerial vehicles (UAVs) to reach time - varying formations as quickly as possible. This brief confirms that deep reinforcement learning can be used in a multi - agent fashion to drive UAVs to reach any formation while taking into account optimality and portability . W e use a deep neural network to estimate how good a state is , so the agent can choose actions accordingly. T he system is tested with different non - high - dimensional sensory inputs without any change in the neural network architecture , algorithm or hyperparameters , just with additional training. Keywords: Reinforcement learning control, multi - vehicle systems, flying robots, autonomous vehicles, decentralized control an d systems , deep neural network s , deep reinforcement learning, time - varying formation control , unmann ed aerial vehicles (UAVs ) . 1. INTRODUCTION Recently, formation control of unmanned aerial vehicles (UAVs) has attracted lots of researchers due to its broad potential applications, including search and rescue missions (Waharte et al., 2009) and wide area surveillance (N igam et al., 2012) . Considering these examples, it seems clear that widespread formation control systems will have a positive impact on society , from improving public security to saving lives after natural disasters. F or a UAV formation controller to become widespread the following requirements, among many others, should be met: • Time - varying Formations: It should be able to work with time - varying formations , the m ost general case , to empower users to come up with new applications . Time - varying formations are those where re lative separations and bearings change with time. • Portability: It should generalize to any formation and work with different sensory inputs without requiring complex design processes . This requirement will help potential users d eploy the system quickly and easily. • Scalability: It should function properly with an arbitrarily large number of UAVs, which may be needed in some applications. • Optimality: It should drive UAVs to reach f ormation as quickly as possible . J ust guaranteeing stability may not be enough for some applications. Moreover, we want UAVs to do their best when desired formations are not feasible , because of UAV dynamic limitations, to help users design applications quickly, without checking formation feasibility . Although many formation control strategies have been proposed, we have not found any approach that covers all these requirement s. A dec entralized approach is needed for scalab ility. According to (Z hang and Mehrjerdi, 2013) , it is less affected by computational and communicative bottlenecks. However, designing decentralized systems is difficult . We need global behaviors to emerge from indi vidual ones without a central unit monitoring the accomplishment of the mission and sharing that information with team members. It is so hard that most decentralized approaches study fixed formations. Only a few results have been obtained for time - varying format ions, the most general case. These good results were achieved using consensus theory. Consensus theory can be used to analy z e stability, getting necessary and sufficient conditions, but it is not evident , at least, how to guarantee optimality in the way we have defined it. Another issue of this approach is that complex design processes may be necessary when changing sensory inputs or the required formation. Theoretically, a multi - agent system trained with classical reinforcement learning (RL) would me et most of the desired properties. It would be scalable because of the decentralized approach. Optimality requirement would be satisfied because RL is about agents learning optimal policies. However, it would be only partially portable. Even though the sys tem could easily generalize to different time - varying formations, it would require complex design processes t o come up with appropriate hand - crafted features when the sensory input changes. In 2013 , Mnih et al. presented a deep learning model able t o learn control policies directly from sensory input, without relying on hand - crafted features, using reinforcement learning (Mnih et al., 2013, Mnih et al., 2015) . Now, thanks to this work, it seems clear that all desired requirements could be m et by using this approach : Deep R einforcement L earning . 1.1 Specific Limitations For the sake of simplicity and to reduce computational needs , we do not consider obstacle avoidance, different communication topologies, delays, noisy sensors, complex UAV models or high - dimensional sensory inputs. All these elements will be part of future research. 1.2 Main Contributions This paper supports that deep reinforcement learning is a feasible approach to develop scalable, optimal and portable time - varying formation control lers for UAVs . • This brief confirms that deep reinforcement learning can be used to train individual UAVs to behave optimally in any time - varying formation, covering scalability and optimality. • Even more, it supports that a deep reinforcement learning agent ca n be trained to drive UAVs to reach different time - varying forma tions in a plug - and - play fashion . This point partially covers portability. • This work substantiates that generalizing to different sensory inputs can be done without complex design processes, just with additional training, at least for non - h igh - dimensional sensory inputs. Now, our definition of portability is covered. • This paper demonstrates that this approach also works with unfeasible formations , beca use of UAV dynam ic limitations . 2. PROBLEM DEFINITION We developed a simulatio n environment to t est th e solution, including different formation sets and sen sory inputs . 2.1 Simulation Environment Training / Test Mode: During t rainin g, each simulated UAV receives a reward signal depending on how close it is to its desired position in the formation, according to the following equation: 𝑟 = − ( 𝑥 − 𝑥 ' ) ) + ( 𝑦 − 𝑦 ' ) ) (1) Where 𝑟 is the reward signal, ( 𝑥 , 𝑦 ) the position of the UAV and ( 𝑥 ' , 𝑦 ' ) the desired position in the formation. UAV Model: The simulator models rotary - wing UAVs, like quadrotors or vertical take - of f and landing aircrafts (VTOL). Since the trajectory dynamics has much larger time constants than the attitude dynamics , the formation control can be implement ed with an inner/outer loop st ructure . The inner loop controls the attitude , and the out er loop is used to drive the UAVs t o the desired positions ( Dong e t al., 2015 , K arimoddini et al., 2013) . Because our agent controls the outer - loop, each UAV can be modeled as a point - mass system described by t h e f ollowing double integrator (Dong et al., 2015) : 𝑥 𝑡 = 𝑣 / 𝑡 𝑦 𝑡 = 𝑣 0 ( 𝑡 ) 𝑣 / 𝑡 = 𝑢 / 𝑡 𝑣 0 𝑡 = 𝑢 0 ( 𝑡 ) (2 ) Where ( 𝑥 , 𝑦 ) denote s the position of the UAV, ( 𝑣 / , 𝑣 0 ) represents its velocity and ( 𝑢 / , 𝑢 0 ) the control inputs. Two - Dimensional Environment: For the sake of simplicity, all quadrotors move in the XY plane. Labelled Homogen e ous Robots: All UAVs are homogeneous but not interchangeable. Each UAV knows its desired position in the formation at every time step. Discrete Cont rol Commands: The controller choose s actions from a discrete set 𝒜 = { → , ← , ↑ , ↓ , □ } at each time step , where “ → ” represents positive acceleration on the x axis, “ ← ” represents negative acceleration on the x axis, “ ↑ ” repres ents positive acceleration on the y axis, “ ↓ ” represents negative acceleration on the y axis and “ □ ” represents no acceleration. Formation Training Set, ℱ ;<=>?>?' : The simulation environment provides dif ferent time - varying formations for the training of our controller. Formation Test Set, ℱ ;@A; : The formations included in this set are different from the ones included in ℱ ;<=>?>?' because we want to test whether the approach generalizes to any time - varying formation. We have also in cluded non - feasible formations to prove that the system can handle them. Sensory Inputs: Two sensory inputs are provided to check if the solution satisfies the portability requirement : • Precise localization system. Each UAV knows its position and the velocity estimation at each time step. • Distance to four landmarks. Each UAV knows the distance to four landmarks and the velocity estimation at each time step . For the sake of simplicity , and to reduce computational needs , we work with non - high dimensional sensory inputs. High - dimensional ones would require a model with higher capacity . Episodic Environment: Every four hundred time steps the simulation is reset, placing each simula ted UAV in a new start position and choosing a new formation from the appropr iate set . 2.2 Checking Requirements The following checklist has been used to verify the requirements : • Optimality. The controller should try to maximize cumulative reward on each episode, reaching formation as quickly as possible and keeping it afterward. • Portability. The controller should be trained with formations from ℱ ;<=>?>?' and work properly with formations from ℱ ;@A; if sensory inpu t remains the same. Even more, the system should work in the same way with another sensory input just with additional training. • Scalability. A different controller instance should be loaded on each UAV . No central unit is allowed. • Time - varying and non - feasible formations were included in ℱ ;@A; . 3. DEEP REINFORCEMENT LEARNING AS A FEASIBLE APPROACH RL is the area of machine learning concerned with how software agents ought to t ake actions based on experience . During training, the agent receiv es a reward signal at each time step, used to define its goal, and learns how to maximize the total rewa rd it receiv es . 3.1 Training Cycle During training , the following cycle is repeated until the episode ends: • The agent combines the UAV sensory input with the in formation about its role in the formation into a state 𝑠 . State is defined as the information taken into account by the a gent to choose the next action. • The agent picks action 𝑎 , from the available actions set 𝒜 , given the current state , 𝑠 . • After executing the selected action, the agent ends up in a new state, 𝑠 ’ , and receives the appropriate reward, 𝑟 . 𝑠 ’ was obtained by c ombining the new sensory data with the formation specification . Fig 1 . Train ing Cycle, repeated until the episode ends. 3.2 Discount Factor and Stochasticity RL takes into account stochasticity. E nding up in a new state 𝑠 ’ is partly random and given by the transition function 𝑇 𝑠 , 𝑎 , 𝑠 F = 𝑃 𝑠 F 𝑠 , 𝑎 ) , the probability of ending up in 𝑠 ’ given the prior state and chosen action. As a result, we m ay not receive the same reward s by choosing the same actions. For this reaso n, w e use discounted rewards to calculate the expected total reward until the end of the episode: 𝑅 ; = 𝛾 ( > J ; ) 𝑟 > ? > K ; (3 ) where 𝑡 is the current time step, 𝑛 is the length of the episode, and 𝛾 is a real number between 0 and 1 , called discount factor , that reflects how much the agent takes into account future rewards. 3 .3 Q - values and Optimal Behavio r A Q - state represents the commitment of executing an action from a state. E.g. { ( 𝑠 , ← ) , ( 𝑠 , ↑ ) , ( 𝑠 , → ) , ( 𝑠 , ↓ ) , ( 𝑠 , □ ) } is the list of the Q - states associated to a hypothetical state 𝑠 . A value is assigned to each state - action pair, Q - value or 𝑄 𝑠 , 𝑎 , t o estimate how good is choosing an action from a stat e . 𝑄 ( 𝑠 , 𝑎 ) maps each Q - state to the expected total discounted r ew ard if the agent executes action 𝑎 from state 𝑠 and behaves optimally afterward. E.g. 𝑄 ( 𝑠 , ← ) returns the expected total discounted reward, 𝔼 [ 𝑟 ; + ɣ 𝑟 ; Q R + ɣ ) 𝑟 ; Q ) + ... ] , after choosing action “ ← ” from state 𝑠 and behaving optimally afterward, sel ecting the actions with the highest Q - value . Once all Q - values are calculated, the optimal behavior can be easily extracted. The agent just needs to choose the action with the highest associated Q - value at each time step to be optimal. 3.4 Deep Reinforcement Learning The problem implies a con tinuous state space but , h ow can an infinite amount of Q - states be stored ? A lookup table cannot be used. Besides, the agent needs to be able to estimate Q - values of state - action pairs it has not visited befo re. Classical approaches use linear combinations of hand - crafted features to approximate 𝑄 ( 𝑠 , 𝑎 ) . However, coming up with good hand - crafted features is a tough problem because 𝑄 ( 𝑠 , 𝑎 ) usually has strong non - linearities . This approach does not me et ou r portability requirement because changes in the sensory input would require a complex design process to come up with new hand - crafted features. In this work, a different approach to estimating Q - values is used : deep neural networks. Deep neural networks can learn useful features by themselves if enough data is provided. Ev en more, according to (K rizhevsky et al., 2012) , they usually learn better representations than hand - crafted features. As a result, the agent can work with different sensory in puts just with additional training. 3.5 Epsilon - Greedy Before diving into how to train the neural network , the exploratory strategy during training needs to be considered. If the agent always choose s the greedy action, the one w ith the highest Q - value , it could get stuck in suboptimal strategies. To avoid this problem, epsilon - greedy is used as the exploratory strategy . A random action is chosen with a small probability 𝜀 , picking the greedy one otherwise. 3.6 Deep Neural Network Training Since 𝑄 ( 𝑠 , 𝑎 ) are real values, the neural network can be trained like in a regression problem. T he Bellman optimality equation is used t o come up with an appropriate loss function: 𝑄 𝑠 , 𝑎 = 𝔼 A V 𝑟 + 𝛾 𝑀𝑎𝑥 = V ∈ 𝒜 𝑄 𝑠 F , 𝑎 F s , 𝑎 ] (4 ) The Loss function was calculated by using (4) : 𝐿 = R ) 𝑟 + 𝛾 𝑀𝑎𝑥 = V ∈ 𝒜 𝑄 𝑠 F , 𝑎 F − 𝑄 ( 𝑠 , 𝑎 ) (5 ) H owever, ap plying it naively has several problems, including: • Learning from consecutive samples is inefficient, due to the strong correlations between them . • Since learning the current parameters determine s the next data sample, the process could be unstable. To solve these problems, we u se a technique call ed Experience Replay (Lin, 1993) . T he last N agent ́s transitions are stored in a datas et called Replay Memory and, at each time step, samples of transitions are drawn at random f rom the Replay Memory to train the network . This solution stabilizes the whole training process. 3.7 How the Approach Solves the Problem Scalability. A different agent instance is in char ge of each U AV . Optimality. RL is about software agents learning how to behave optimally. Optimality is defined by a reward strategy , and the agent learns how to behave optimally with training. Portability. Thanks t o the rich state space our agent generali zes to any time - varying formation without any additional process. Because of the use of a deep neural network to approximate Q - values , our agent works with different sensory inputs just with additional training. 4. SYSTEM SPECIFICATION T he same neural net work architecture and hyper parameters were used across all scenarios, needed for portability. 4.1 Neural Network Architecture In addition to the input layer, which depends on the state representation, t he model has three fully - connected hidde n layers, with 128, 64 and 32 rectified linear units (ReLU) respectively. The output layer is a fully - connected one with a single output, the approximation of 𝑄 ( 𝑠 , 𝑎 ) . We used Keras (C hollet , 2015) on top of Theano (Team et al., 2016) to build and train the deep neur al network. We selected these platforms because of Keras' focus on fast experimentation, Theano's ability to run on GPU and the availability of a Python API, our development language. We chose RMSProp as the optimizer and “Uniform” as the weight initializa tion strategy. Specific details are presented below. 4.2 Hyperparameters Epsilon - greedy module : 𝜀 was equal to 0.5 during training . Re play Memory Size : It stores the 10 ] most recent transitions. Replay Memory Start Size: The agent followed a random policy for 10 ^ transitions to populate the replay memory before learning starts. RMSProp Parameters: 5 ∙ 10 J ] was used as a learning rate, 10 J a as epsilon (Small value added for numerical stability) and 0 . 9 as rho (Gradient moving average decay fac tor). Discount Factor: 𝛾 was set to 0.95. Minibatch Size: We use d 16 training samples to compute each SGD update. 4.3. Reward Clipping and Normalization The reward function (1) offered by the simulation environment lacks a lower limit. Two normalization steps have been applied t o a void large errors : 1. The reward has been increased by 1. After this operation, the new upper limit is 1. 2. Th e reward has been clipped to [ - 1, +1] 5. MAIN EVALUATION 5.1 Generalization to Different Non - High - Dimensional Sensory Inputs The agent was trained using both sensory inputs provided by the simulation environment : precise localization system and distance to three landmarks. The total clipped reward collected per episode was used as the primary metric to evaluate the progre ss of the agent . This metric tends to be very noisy, especially when 𝜀 is set to a relatively high value , 0.5. A Savitzky - Golay filter has been used in (Fig. 2) and (Fig 3.) to smooth the data and mitigate this problem. The upper limit of the metric in (F ig. 2) and (Fig. 3) is 400 because the maximum normalized reward is 1 . 0 per time step and the simulation environment resets every four hundred time steps. However, the following points make it impossible to reach the upper limit during training : • 𝜀 has bee n set to 0.5, a relatively high value . It means that, during training, a random action , instead of the greedy one, is chosen with a probability of 0.5 at every time step. • The actions are chosen from a discrete set 𝒜 = { → , ← , ↑ , ↓ , □ } , instead of using a continuous action space. • A small neural network has been used to avoid complexity and computational needs, so the capacity of the model is relatively small. However, despite all these limitations, (Fig 2.) and (Fig. 3) show steady learn ing processes in both scenarios . The same neural network, architecture, hyperparameters and algorithm have been used in both situations , including the learning rate and the epsilon - greedy values. (Fig. 2) and (Fig. 3) display similar results, with the exce ption that the second scenario, dis tance to three landmarks , requires more training to get to similar cumulative rewards , 1500 episodes instead of 1000. Fig. 2. Total clipped reward collected per episode using a precise localization system as sensory inp ut. Th e s e results support the idea that deep reinforcement learning can be used to build agents able to generalize to different sensory inputs with just additional training. Fig. 3 . Total reward collected per episode using the distance to four landmarks as sensory input. 5.2. Generalization to Different Time - Varying Formations The agent was trained with formations drawn at random from ℱ ;<=>?>?' . In this section we want to find out how the agent behaves with formations from ℱ ;@A; , formations it has not seen before. To be more specific, (Fig. 4 ) shows the trajectories of five UAVs follow ing an eight - figure pattern while keeping a phase separation of 2 𝜋 / 5 for 400 time steps, the episode length. This formation is similar to the one used in (Dong e t al., 2015) . Each simulated UAV was controlled by an instance of the trained agent. Fig. 4 . Trajectories of five UAVs trying to follow an eight - figure pattern while keeping a phase separation of 2 𝜋 / 5 . Despite the fact that all formations included in ℱ ;@A; are not feasible because of the discrete action space, (Fig. 5) shows how the system handles a more obvio us non - feasible formation. It would be unfeasible even with a continuous action space because of the acute angles of the pattern. Again, each simulated UAV was controlled by an instance of the trained agent. Fig. 5 . Trajectories of five UAVs trying to reach a non - feasible time - varying formation while keeping a phase separation of 2 𝜋 / 5 . 6. RELATED WORK The r elated work is analyzed from two different perspectives: UAV formation control and Deep Reinforcement Learning . 6.1 UAV Formation Control Centralized vs. Decentralized Architecture : The decentralized approach was highly influenced by (Zhang and Mehrjerdi, 2013) . This survey repor ted that centralized systems do not scale well as formation size increases because of communication bottlenecks and the lack of use of the computational resources available on each vehicle . This was stated to be true even when the most advanced optimizatio n solvers are used. Time - Varying Formations: Most decentralized approaches to UAV formation control study fixed formations. Few successful approaches have been reported for time - varying formations, the most general case. These good results were achieved using consen sus theory, like (D ong et al., 2015) or (R ui et al., 2015) . Consensus theory can be used to analyze stability but is not evident , at least, how it can be used to guarantee optimality in the way we have defined it. Another problem with this appr oach is that complex design processes may be necessary when changing sensory inputs or the required formation. 6.2 Deep Reinforcement Learning Combining RL algorithms with no nlinear function approximators, like neural networks, c ould cause the training pro cess to diverge, so the vast amount of wo rk in RL used to focus on li near function approximators (Tsitsiklis and Van Roy, 1997) . The main problem when using RL with neural networks is that RL agents incrementally update their parameters while they observe a stream of experience, breaking the independent and identically distributed assumption of many stochastic gradient - based algorithms. However, in 2013 , Mnih et al. presented a deep learning model able to learn control policies directly from sensory input ( Mnih et al., 2013, Mnih at al., 2015) . They did it us ing experience replay (L in , 1993) , which addresses the stability problem by mixing more and less recent experience when updating the neural network weights. Thanks to this work, it seems clear that this approach can be used to meet portability because h and - crafted features are not needed anymore . Even more, according to (Krizhevsky et al., 2012) , deep neural networks can, usually, learn better representations than hand - crafted features if en ough data is p rovided . 7 . CONCLUSIONS This paper confirmed the feasibility of using deep reinforcement learning to develop scalable, optimal and portable time - varying formation controllers for UAVs. Scalability requirement was met because an agent was trained to control individual UAVs and different instances were installed on each vehicle. Optimality was also considered. The reward strategy was used to define what we meant by optimal , and the agent learned how to behave optimally with training. Besides, we showed that our system generalizes to different time - varying f ormations and confirmed that is able to work with different non - high - dimensional sensory inputs just with additional training, thanks to a deep neural network to approximate 𝑄 ( 𝑠 , 𝑎 ) . We did not con sider important aspects of UAV formation controllers, including obstacle avoidance, complex UAV models and high - dimensional sensory inputs. These topics will be part of future research. REFERENCES Lin, L.J., 1993. Reinforcement learning for robots using neural networks (No. CMU - CS - 93 - 103). Carnegie - Mellon Univ Pittsburgh PA School of Computer Science. Tsitsiklis, J.N. and Van Roy, B., 1997. An analysis of temporal - difference learning with function a pproximation. IEEE transactions on automatic control, 42(5), pp.674 - 690. Waharte, S., Trigoni, N. and Julier, S., 2009, June. Coordinated search with a swarm of uavs. In 2009 6th IEEE Annual Communications Society Conference on Sensor, Mesh and Ad Hoc Comm unications and Networks Workshops (pp. 1 - 3). IEEE. Nigam, N., Bieniawski, S., Kroo, I. and Vian, J., 2012. Control of multiple UAVs for persistent surveillance: algorithm and flight test results. IEEE Transactions on Control Systems Technology, 20(5), pp.1 236 - 1251. Krizhevsky, A., Sutskever, I. and Hinton, G.E., 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097 - 1105). Zhang, Y. and Mehrjerdi, H., 2013, May. A survey on multiple unmanned vehicles formation control and coordination: normal and fault situations. In Unmanned Aircraft Systems (ICUAS), 2013 International Conference on (pp. 1087 - 1096). IEEE. Karimoddini, A., Lin, H., Chen, B.M. and Lee, T.H., 2013. Hybrid three - dimensional formation control for unmanned helicopters. Automatica, 49(2), pp.424 - 433. Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D. and Riedmiller, M., 2013. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Rui, W., Xiwang, D., Qingdong, L., Qilun, Z. and Zhang, R., 2015, July. Adaptive time - varying formation control for high - order LTI multi - agent systems. In Control Conference (CCC), 2015 34th Chinese (pp. 6998 - 7003). IEEE. Mnih, V., Kavukc uoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G. and Petersen, S., 2015. Human - level control through deep reinforcement learning. Nature, 518(7540), pp.529 - 533. Dong, X., Yu, B., Shi , Z. and Zhong, Y., 2015. Time - varying formation control for unmanned aerial vehicles: theories and applications. IEEE Transactions on Control Systems Technology, 23(1), pp.340 - 348. Chollet, A., 2016. Keras. https://github.com/fchollet/keras. Team, T.T.D., Al - Rfou, R., Alain, G., Almahairi, A., Angermueller, C., Bahdanau, D., Ballas, N., Bastien, F., Bayer, J., Belikov, A. and Belopolsky, A., 2016. Theano: A Python framework for fast computation of mathematical expressions. arXiv preprint arXiv:1605.02688.