1 Combining Deep Reinforcement Learning and Safety Based Control for Autonomous Driving Xi Xiong Jianqiang Wang F ang Zhang Keqiang Li State Key Laboratory of Automotive Safety and Energy , Tsinghua University Abstract With the development of state - of - art deep reinforcement learning, we can efficiently tackle continuous control problems . But the deep reinforcement learni ng method for continuous control is based on historical data, which would make unpredicted decisions i n unfamiliar scenarios . Comb ining deep reinforcement learning and safety ba sed control can get goo d performance for self - driving and collision avoidance . In this passage, we use the Deep Deterministic Policy Gradient algorithm to implement autonomous driving without vehicles around . The vehicle can learn the driving policy in a stable and familiar environment , which is efficient and reliable . Then we use the artificial potential field to design collision avoidance algorithm with vehicles around. The path tracking method is also taken into consideration. The combination of deep reinforcement learning and safety based control performs well in most scenarios. 1. Introduction There are two major paradigms for autonomous driving: the learning method and the control method. With the success of deep learning and reinforcement learning, more and more people have focused on using the learning method for autonomous driving. The combi nation of deep learning and reinforcement learning can tackle problems of high dimensional inputs. The DQN network (Minh et al., 2015) can play Atari games at the human level. But the DQN is not efficient to solve problems with high dimensional action stat e spaces. The combination of Q - network with actor - critic structure can perform well in the continuous control field. The DDPG algorithm (Lillicrap et al., 2015) presents an actor - critic, model - free algorithm based on the deterministic policy gradient that can operate over continuous action spaces. This algorithm can learn policies end - to - end : from low - dimension inputs or raw pixel inputs to final actions . In the D DPG algorithm, the positive reward is the velocity projected along the track direction. We want the vehicle to run along the track as fast as possible. The negative reward is the penalty for collision. However, this method can’t perfo rm well without sufficient training . The policy would make unpredicted decisions in unfamiliar scenarios, which is also the shortcoming of data based method . In addition , avoiding colli sion is the basic function when designing the control strategy for autonomous driving . We also want the vehicle to run on the road with high er safety level , including driving along the central road track and keeping safe ty distance s from vehicles around. 2 The artificial potential field method is widely used for collision avoidance in the field of robot pat h planning. We combine the ideas in the artificial potential field ( Khatib, 1986 ) with deep reinforcement learning f or autonomous driving to put both merits into full use. Path tracking is also impor tant for the autonomous driving strategy because we assume that driving far away from the center of the road is with high risk . We can use the path tracking method to get a relative safe state. 2. Background 2.1 . Deep Reinforcement Learning In the structure of reinforcement learning, the agent interacts with the environment. After every discrete time t , the agent implements the action a . Then, the env ironment change s its prev ious state t s to 1 t s  , and the agent gets its reward t r . The goal of reinforcement learning is to maximize the discounted accumulative reward ( , ) T i t t i i i t R r s a      . The action - value function is used to express the expectation of t R . ( , ) ( , ) t t t t t Q s a E R s a  . We define the optimal action value function as the maximum achievable expected return under the strategy  , ( , ) max , , t t t Q s a E R s s a a           (1) T he optimal value functions can be expressed with Bellman equations, 1 1 1 ( , ) max ( , ) , t t t a Q s a E r Q s a s a               . (2) In DQN algorithm , it’s common to use neural network s to approximate the ( , ) Q s a  . We assume the approximator as ( , ; ) ( , ) Q s a Q s a    , with the parameters  . The Q - network can be trained by minimizing the loss function, 1 2 1 1 1 ( ) ( max ( , ; ) ( , ; )) i i i i i i i i i a L E r Q s a Q s a                  (3) The DQN algorithm can work well in high dimensional state spaces but is not effective in continuous action spaces b ecause the optimization of t a at every time step is too slow to be practical with nontrivial action spaces. We use the Deep Deterministic Policy Gradient algorithm to solve continuous control problems . We divide the control policies into the stochastic policy and the deterministic policy (Sutton et al., 2012 ) . We assume the stochastic policy as ( ) ( ) a s P a s   , which represents the action probability distribution . We also den ote the state distribution as ( ) s   . The objective performance can be expressed as an expect at ion, ( ) ( ) ( ) ( , )d d S J s a s Q s a a s             (4) 3 The essen ce of policy gradient algorithm is to adjust the parameters of the policy in the direction of the performance gradient ( ) J     . The policy gradient theorem (Sutton et al , 1999 ) can be expressed as , ( ) ( ) ( ) ( , )d d S J s a s Q s a a s                 ( , ) ln ( ) E Q s a a s           (5) As for the continuous control problem s , we assume the policy to be deterministic . We use   to represent the reflection from the sta te spaces to the action spaces, namely ( ) a s    . As with the definition of stochastic policy, we define the objective performance as, ( ) ( ) ( , )d ( ) ( , ( ))d S S J s Q s a s s Q s s s                 (6) We also use the policy gradient for the deterministic policy . If ( ) s     and ( , ) a Q s a   both exist , then the gradient can be expresse d as, ( ) ( ) ( ) ( , )d ( , ) ( ) a a S J s s Q s a s E Q s a s                          (7) 2.2 . Safety Based Control When considering the safety of the vehicle , avoiding coll ision and driving along the track are the most important issues, e specially the former function. The artificial potential field method is wi dely used for robot path planning. The goal of potential field method is to make the robot move from the initial position to the target position in a desired manner while avoiding collision. T here ar e two types of potential field in the domain of robot path planning , the attractive potential f ield and repulsive potential field. The attractive potential part repres ents the energy to get to the t arget position . The repulsive potential part represents the potential risk of collision. ( ) ( ) ( ) art att rep U x U x U x   (8) where ( ) art U x is the artificial potential field, ( ) att U x is the attractive potential field, and the ( ) rep U x is the repulsive potential field. The potential forces are the gradient s of the respective potential field, att att F U   (9) rep rep F U   (10) When we consider multiple targets and obstacle s (Fig 1 ), t he final potential forces are the s um of attractive forces and repulsive forces . The attractive and repulsive potential forces are vectors , then t he total fo rce can be exp ressed as the sum of vectors. 4 Robot Target 1 Obstacle 1 Obstacle 2 Target 2 att F 1 att F 2 att F rep F 1 rep F 2 rep F Figure 1 : Multiple targets and obstacles for artificial potential field . The attractive force att1 F and att2 F produced by Target 1 and Target 2 are vectors. att F is the vector sum of att1 F and att2 F . Accordingly, rep F is the vector sum of rep1 F and rep2 F produced by Obstacle 1 and Obstacle 2. 3 . Combining Deep Reinforcement Learning and Safety Based Control 3 .1 . Methodology I n the field of cognitive scienc e, there are two major learning paradigm s , the empiricism and the speculation . Empiricism is a way of learn ing from historical experiences . Speculation is the way of logical thinking, which means taking measures by reasoning. The thinking process of humans contains both empiricism and sp ecu latio n, which are interactive during the process . The deep reinforcement learning method is just like learning from our past experience s . The safety based control, which contains artificial potential method and path tracking, is like the speculation and logical thinking. Using the deep reinforcement learning is efficient and can work well in a relative s table and familiar environment, b ut this method would be difficult to c over all scenarios. We combine the deep reinforcement learning and safety control to solve the problem. 3.2 . Algorithm We tackle the problem using perception sensor data , including vehicle speed, vehicle position on the road track and opponent vehicle distances ( Loiacono et al., 2013 ) . The input data can be di vi ded into two parts. The features of opponent distances can be used for collision avoidance. Other parameters can be used for deep reinforcement learning and path tracking. Each of the three methods has its own acceleration and steer ing commands. We then balan ce the weigh t of these three action outputs, l f p              (11) l f p              ( 12 ) s.t. 1       (13)  , l  , f  and p  respectively represents the final steer ing action, the learning policy steer ing action, the potential field steer ing action and the path tracking steer ing action.  , l  , f  and p  respectively represents the final acceleration action, the learning policy acceleratio n action, the 5 potential acceleration action and the path tracking acceleration action.  ,  and  are respectively the weight parameters of the three methods. 3.2.1 . Deep Reinforcement Learning Firstly, we train the vehicle without opponents. The positive reward at each step is the velocity of the car projected along the tra ck direction. We don’t need to set negative reward s . The structure of DDPG is shown in Figure 2 . The policy network is used to generate actions and the va lue function network is used to approximate the optimal Q - values . The input state s are data from vehicle speed sensors, current engine speed, track sensors, wheel speed, track position and vehicle angle . After several hours training, we can use the policy to implement actions ( l  and l  ) without opponents , which can also be applicable to other tracks because of the stable familiar environment. ( ) Q Q L    ( ) s     Figure 2: Deep Deterministic Policy Gradient Policy Architecture. The policy network is used to implement acceleration and steer ing demands. The input state parameters are partial sensor data. Then the action and state pairs are used in the critic network, which is the p rocess of Q - learning. The actor - critic architecture can update the policy in the direction performance gradient ( ) J     . By using the actor - critic structure, we can updat e critic by minimizing the loss, 2 1 1 1 ( ) ( ( , ) ( , ( ); ) ( , ; )) Q Q Q i i i i i i i L E r s a Q s s Q s a                (14) And we can update the actor policy by using th e deterministic policy gradient, ( , ) ( ) a J E Q s a s               (15) 3.2.2 . Artificial Potential Field Method As for the artificial potential field , we only consider the repulsive potential field for collision avoidance . As is sh own in the Figure 3 , in the coordinate system of the ego vehicle, the rep F projected along the x - axis corresponds to the steer ing co mmand and the force projected along the 6 y - axis corresponds to the acceleration command. We assume the forces are continuous and only related to the distances, _ 1 cos rep x i i i F d      (16 ) _ 1 sin rep y i i i F d      (17 ) where i  is the obstacle angle in the coordinate system of the ego vehicle, i d is the obstacl e distance from the ego vehicle, and  represents the power to be determined. Vehicle Obstacle 1 Obstacle 2 1  2  Figure 3: Repulsive potential field forces in ego vehicle coordinate system . The forces projected along the x - axis of ego vehicle correspond to the steer ing command. The forces projected along the y - axis correspond to the acceleration command. Then the output actions are prop ortional to the potential field , _ f fx rep x k F   (18 ) _ f fy rep y k F   (19 ) where fx k and fy k are respectively the proportional coefficients of _ rep x F and _ rep y F . 3.2.3 . Path Tracking As for the path tra cking function, we want the vehicle to drive along t he central track of the road . T he goal of path tracking is to mi nimize the angle between the car direction and the direction of track axis and shorten the distance between the vehicle centroid and the cen tral road track (Kapania et al., 2015) . T he equation (20 ) represents the steer ing command with tracking error and heading error. We also tackle the acceleration command p  according to the steer ing command. The basic rule is to decrease the vehicle speed when the steer ing command is high enough. 1 2 p e         (20 ) 7 where  is the angle between the car direction and the direction of track axis, e is the distance between the vehicle centroid and the central road track, and 1  and 2  are respectively their coefficients .  e Figure 4: Diagram showing tracking error e and heading error  . The goal of path tracking is to minimize the tracking error and heading error to keep the vehicle drive along the track . 4 . Experiments 4.1 . Experiments Setup We use TORCS platform ( Loiacono et al., 2013 ) to implement our autonomous driving algorithm. First we train the policy network without opponents on GPU. The actor neural network consists of two hidden layers with 400 and 300 units respectively. The final output layer is a tanh layer to implement steer ing and acceleration command s . The policy network learning rate is 4 10  . The critic neural network also consists of t wo hidden layers with 400 and 300 units with the learning rate 3 10  . The discounted factor  is 0.99 and t he training minibatch size is 64. The input states for t he actor - critic architecture are focus sensors, track sensors, vehicle speed , engine speed, wheel speed, track position and vehicle angle. The output actions are the steer ing command s and acceleration commands . After the training, we combine the learning policy actions and safety based control actions. The parameters for the safety based control are shown in Table 1. The input states for the repulsive potential field are the 36 opponent distances. The path tracking method uses the angle error and track pos ition to calculate the actions. Table 1 : Parameters for the artificial potential field and path tracking Symbol Value Symbol Value fx k 20 2  2 fy k 10  0.4  1.5  0.3 1  3.18  0.3 4.2 . Results We first train the driving policy using DDPG algorithm without opponents o n GPU. T he average Q - value of the actor - critic structure has increased gradually (Fig 5) . During the training process , we divide the reward by 150 to limit the one - step reward to [0, 2] . After approximate 13 hours 8 training, the average Q - value reaches approximate 110. We then use the policy for autonomo us driving. The vehicle could perform well using the trained policy network. 200 K 400 K 600 K 800 K 1 . 00 M 1 . 20 M 1 . 40 M 1 . 60 M 0 . 00 20 . 0 40 . 0 60 . 0 80 . 0 100 120 140 Average Q - value Training steps Figure 5 : Average Q - value during the training process . After 13 hours, the average value reaches 110, we then used the policy network for autonomous driving without vehicles around. We then combine the DDPG policy network and safety based control. The ratio coefficients for DDPG, Path Tracing and Artificial Potential Field (APF) method are 0.4, 0.3 and 0.3. Figure 6 (a)~6(d) show 4 typical scena rios using the combined algorithm and Figure 6(d)~6(e) show the respective steer ing command s and acceleration command s by DDPG, Path Tracking and APF . Figure 6(a) shows t he vehicle runs along the curve. The DDPG algorithm outputs the major steer ing command an d the APF commands for steering and acceleration are 0 because no vehicle around is detected. Figure 6(b) and Figure 6(c) show that there is o ne opponent around and APF commands output corresponding actions . T he opponent d istance in Figure 6(c) is shorter than Figure 6(b), so the A PF commands play the major part s in Figure 6(c) . Figure 6(d) shows the scenario in which the vehicle runs along the curve with two vehicles around. The ego vehicle is far from the track so the Path Tracki ng steer ing command is hi gher than two other methods . 5. Conclusion In this paper, we combine the deep reinforcement learning and safety based control, including artificial potential field and path tracking for autonomous driving. We first use the DDPG algorithm to get the dri ving policy using partial state input s and then combine the policy network and safety based control to avoid collision and drive along the track. Experiments show that the three algorithms coordinate well in the TORCS environment. 9 1 2 3 4 (a) (b) (c) (d) DDPG Path Tracking APF Steering Command Acceleration Command - 1 . 0 - 0 . 5 0 . 0 0 . 5 1 . 0 - 1 . 0 - 0 . 5 0 . 0 0 . 5 1 . 0 (e) (f) Figure 6: Typical driving scenarios and corresponding commands. The positive steer ing command represents turning left and vice cersa. The negative acceleration command represents braking contr o l. The blue race car with red box is the ego vehicle . (a), Driving along the curve with no opponent around. (b), One opponent vehicle in front. (c), One opponent in bottom left. (d), Two opponents around while driving along the curve. (e), The steer ing commands in the four typical scenarios using DDPG, Path Tracking and APF. (f), The accelearation commands in the four typical scenarios using DDPG, Path Taracking and APF. 6 . References Sutton R S, Barto A G. Reinforcement learning: An introductio n . Cambridge: MIT press, 2012 . Lillicrap T P, Hunt J J, Pritzel A, et al. Continuous control wit h deep reinforcement learning . arXiv preprint arXiv:1509.02971, 2015. Mnih V, Badia A P, Mirza M, et al. Asynchronous methods fo r deep reinforcement learning . arXiv preprint arXiv:1602.01783, 2016. Mnih V, Kavukcuoglu K, Silver D, et al. Human - level control throug h deep reinforcement learning . Nature, 2015, 518(7540): 529 - 533. Silver D, Huang A, Maddison C J, et al. Maste ring the game of Go with deep ne ural networks and tree search . Nature, 2016, 529(7587): 484 - 489. Mnih V, Kavukcuoglu K, Silver D, et al. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013. Silver D, Lever G , Heess N, et al . Determinist ic policy gradient algorithms . In ICML , 2014. Sutton R S, McAllester D A, Singh S P, et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation[C]//NIPS. 1999, 99: 1057 - 1063. 10 Nair A, Srinivasan P, Blackwell S, et al. Massively parallel methods for deep reinforcement learning[J]. arXiv preprint arXiv:1507.04296, 2015. Khatib O. Real - time obstacle avoidance for m anipulators and mobile robots . The international journal of robotics research, 1986, 5(1): 90 - 98. Hausknecht M, Stone P. Deep Reinforcement Learning in Parameterized Action Space. arXiv preprint arXiv:1511.04143, 2015. Chen C, Seff A, Kornhauser A, et al. Deepdriving: Learning affordance for direct perception in autonomous driving[C]//Proceedings of the IEEE Inte rnational Conference on Computer Vision. 2015: 2722 - 2730. Loiacono D, Cardamone L, Lanzi P L. Simulated car racing championship: Competition software manual[J]. arXiv preprint arXiv:1304.1672, 2013. Kapania N R, Gerdes J C. Path tracking of highly dynamic autonomous vehicle trajectories via iterative learning control[C]//2015 American Control Conference (ACC). IEEE, 2015: 2753 - 2758. Ge S S, Cui Y J. Dynamic motion planning for mobile robots using potential field method. Autonomous Robots, 2002, 13(3): 207 - 2 22. Vadakkepat P, Tan K C, Ming - Liang W. Evolutionary artificial potential fields and their application in real time robot path planning[C]//Evolutionary Computation, 2000. Proceedings of the 2000 Congress on. IEEE, 2000, 1: 256 - 263.