1 Autonomous Ramp M erg e Maneuver Based on Reinforcement Learning with Continuous Action Space Pin Wang * , Ching - Yao Chan California PATH, University of California, Berkeley, bldg. 454 , Richmond Field Station, Richmond, 94804 , US pin_wang@berkeley.edu , cychan@ber keley.edu Abstract – Ramp merging is a critical maneuver for road safety and traffic efficiency . Most of the current automated driving system s developed by multiple automobile manufacture r s and suppliers are typically limited to restricted access freeway s only. Extending the automat ed mode to ramp merging zones presents substantial challenges. One is that the automated vehicle needs to incorporate a future objective (e.g. a successful and smooth merge) and optimize a long - term reward that is impacted by s ubsequent actions when executing the current action. Furthermore, the merging process involves interaction between the merging vehicle and its surrounding vehicles whose behavio r may be cooperative or adversarial , leading to distinct merging co u ntermeasure s that are cruc ial to successfully complete the merge . In place of the conventional rule - based ap p roach es , we propose to apply reinforcement learning algorithm on the automated vehicle agent to find an optimal driving policy by maximizing the long - term rew ard in an interactive driving environment . Most importantly, in contrast to most reinforcement learning applications in which the action space is resolved as discrete, our approach treat s the action space as well as the state space as continuous without in curring additional computational cost s . Our unique contribution is the design of the Q - function approximat ion whose format is structured as a quadratic function , by which simple but effective neural networks are used to estimate its coefficients. The resul ts obtained through the implementation of our training platform demonstrate that the vehicle agent is able to learn a safe, smooth and timely merging policy , indicating t he effectiveness and practicality of our approach. K eywords : Au tonomous driving, Ramp merging, Reinforcement learning, Continuous action. I . INTRODUCTION Automated vehicles have the potential to reduce traffic accidents and improve traffic efficiency. A number of automakers, high - tech companies , and research agencies are dedicating their efforts to implement and demonstrate parti ally or highly automated features in modern vehicles, such as the AI - enabled computational platforms for autonomous driving from NVIDIA [ 1 ] , the Autopilot from Tesla [2] , and ‘Drive Me’ project by Volvo [3]. Fully autonom ous vehicles, e.g. Google self - driving car (WAYMO) [4] , are also being tested and may be deployed in the near future. Different levels of automated functions designed for freeways or expressways are well developed and some of them are being or wil l be introduced in the market soon, such as Level 2 functions (e. g. adaptive cruise control plus lane k eeping , etc. ) by various automakers . One example is the Super Cruise by General Motors [5] . However , the implementation of autonomous on - ramp merging sti ll presents considerable challenges. One big challenge is that intelligent vehicle agent should take the long - term impacts into consideration when it decides on its current control action (the “long term” in the study is defined to be the completion of a m erge process while at any point along the merging maneuver there is a “current” action ) . In other words, the actions such as accelerating, decelerating, or steering that the ego vehicle takes at the current moment may affect the success or failure of the m erge mission. Another challenge is tha t the merging maneuver is not only based on the merging vehicle’s own dynamic state, but dependent on its surrounding vehicle s whose action s may be cooperative ( e.g. decelerating or changing lane to yield to the mergin g vehicle ) or adversarial ( e.g. speeding up to deter the merging vehicle ) . The merging process can be handled at relative ease in most cases by experienced human drivers but the algorithms for automated execution of the merge maneuver in a consistently s mooth, safe, and reliable manner can become complex. Most p revious studies solve the merging problem by assuming some specific rules. For example, Marinescu et al. [ 6 ] proposed a slot - based merging algorithm by defining a 2 slot’s occupancy status (e.g. free or occupied) based on the information of the mainline vehicles’ speed, position, and behavior of acceleration or deceleration . Chen et al. [ 7 ] applied a gap acceptance theory and defined some driving rules t o model the decision - making process of the on - ra mp merge behavior on urban expressways . These rule - based models are conceptually comprehensible but are pragmatically vulnerable due to their inability to adapt to unforeseen situations in the real world. R einforcement learning , a machine learning algorith m which trains itself continually through trial s and error s [8] , has the potential to allow the vehicle agent to learn how to drive under different or previously unencountered situations by training it to build up its pattern recognition capabilities . R ein forcement learning is different from standard supervised learning techniques , which need ground truth as input and output pairs. A reinforcement learning agent learns from past experience and tries to capture the best possible knowledge to find an optimal action given its current state , with the goal of maximizing a long - term reward which is a cumulative effect of the current action on future states. In our study, we apply reinforcement learning algorithm on the autonomous driving agent to find an optimal merging policy . In a typical reinforcement learning problem, the state space and action space are often treated as discrete, which simplifies the learning process in a finite tabular setting. However, in reality , the vehicle ’s st ate and actions ( i.e. vehic le dynamics ) are continuous. Discretizing them will result in an extremely large unordered set of state/action pairs and render the solution suboptimal. Therefore , finding ways to treat both the state space and action space as continuous is of primary impo rtance, which forms one cornerstone of our research thesis . The rest of our paper is organized as follows. A literature review of related works is given in the next section, followed by the description of our proposed reinforcement learning algorithm . The n, the training procedure implemented on a simulation platform and the results are presented. Finally, concluding remarks and discussions are given in the closing section. II. LITERATURE REVIEW The application of r einforcement learning has seen significan t progress in the field of artificial intelligence i n the past decade. Narasimhan et al. [ 9 ] employed reinforcement learning for language understanding of text - based games. Li et al. [ 10 ] proposed a hybrid reinforcement learning approach to deal with custo mer relationship management problems in a company , in order to find o ptimal actions (e.g. sending a catalog, a coupon or a greeting card) on its customer s. Google DeepMind [ 1 1 ] has been applied deep reinforcement learning technique s to develop an artificia l agent and l et it play classic Atari games. The trained agent shows better performance than a professional human by directly learning game policies from high dimensional image inputs . In recent years, r einforcement learning has been applied in traffic a nd vehicle control problems. Some studies applied reinforcement learning in ramp metering control to improve traffic efficiency . Fares et al. [1 2 ] designed a density control agent based on reinforcement learning to control the vehicles entering the highway from on - ramps. In the study, they define the state space as a three - dimension al space and the action space as a two - action space (i.e. red and green ) . Yang et al. [ 13 ] used basic Q - learning to increase the capacity at t he highway - ramp weaving section. The state space was composed by upstream and downstream volumes, and the action space was represented by discrete ramp - merging rates. Some other studies use reinforcement learning for automated vehicle control. Ngai et al. [1 4 ] proposed a reinforcement learn i ng multiple - goal framework to solve the overtaking problem of automated vehicles. They used a quanti zation method to convert continuous sensor state and action space into discrete spaces. The vehicle c an accomplish the overtaking task though it c an not alwa ys turn to the desired direction accurately due to the discrete steering angles. Yu et al. [1 5 ] investigated the use of reinforcement learning to control a simulated car through a browser - based car simulator. They decreased the action space from 9 actions to 3 actions (e.g. faster, faster - plus - left, faster - plus - right) and tested two reward functions. The simulated car can learn turning operations in relatively large sections without going off - road , however it faces challenges in obstacle avoidance . In the se studies, the authors use discrete action s to represent the real - world action s pace which are fundamentally continuous. It has been learned that discret izing action space can simplify the problems and may lead to fast convergence , but it can also 3 result in suboptimal and unrealistic vehicle performance . Some attempts are made to use continuous action space. Sallab et al. [1 6 ] formulated two main reinforcement learning categories , a discrete action category and a continuous action category, f or a lane - ke eping assistant study. They tested and compared the performance of the two algorithms with an open source car simulator (TORCS) , and results showed that discrete action s pace made steering abrupt while continuous action space gave better performance with s mooth control. Shalev - Shwartz et al. [1 7 ] applied reinforcement learning to optimize long - term driving strategies ( e.g. double merging scenario ) where they decomposed the problem into a learnable part and a non - learnable part. The learnable part maps the s tate into a future trajector y which enables the comfort of driving , while t he other part is designed as hard const rains which guarantees the safety of driving. The proposed framework is plausible but the authors has not conduct ed reproducible experiments. We believe it is challenging but crucial to consider the control action space as continuous. In our work , we design a unique format of Q - function approximator to obtain the optimal merging policy without increasing computational cost. We give the descript ion of our approach in the next section. III. METHODOLOGY In this section, we provide an in - depth explanation of the methodologies, including the concept of reinforcement learning, the state space, the action space, the reward function, and the neural net work based Q - function approximator . A. Reinforcement L earning In a reinforcement learning problem, an agent interacts with the environment which is typically formulized as a Markov Decision Process (MDP). The agent takes the environment observations as st ate 𝑠 ∈ 𝑆 , and choose s an action 𝑎 ∈ 𝐴 based on 𝑠 . After the action execution , it observes the reward 𝑟 ~ 𝑅 ( 𝑠 , 𝑎 ) and next state 𝑠 ′ ∈ 𝑆 . An expected discounted cumulative return 𝐺 is calculated as in (1) based on rewards starting from state s and thereafte r following the policy 𝜋 . T he goal of the reinforcement learning agent is to f ind an optimal policy 𝜋 ∗ which maps states into actions . 𝐺 = 𝐸 [ 𝛾 4 5 6 𝑟 4 4 ] (1) where 𝛾 is a discount factor 𝛾 ∈ ( 0 , 1 ) . To solve a reinforcement lea rning problem, model - based and model - free approaches are two main categories. For the ramp merging problem, it is difficult to pre scribe an accurate model of the environment with a state transition matrix . Therefore, we resort to Q - lea rning, a model - free a pproach , for finding an optimal driving policy . A Q - function is used to evaluate the long - term return 𝐺 ( 𝑠 , 𝑎 ) based on the current and next step information ( 𝑠 , 𝑎 , 𝑟 , 𝑠 : ) , instead of waiting until the end of the episode to gather a discounted cumulated reward . 𝑄 ( 𝑠 , 𝑎 ) is called the action - state value, among which the highest one 𝑄 ∗ ( 𝑠 , 𝑎 ∗ ) indicates the action 𝑎 ∗ is an optimal action in state s . By iteratively updating the estimated Q - values with the observed reward 𝑟 and next state 𝑠 : as follows , an optimal policy can be learne d . 𝑄 𝑠 , 𝑎 ≔ 𝑄 𝑠 , 𝑎 + 𝛼 ( 𝑟 + 𝛾 ∗ 𝑎𝑟𝑔𝑚𝑎𝑥 C D 𝑄 𝑠 : , 𝑎 : − 𝑄 𝑠 , 𝑎 ) (2 ) where 𝛼 is learning rate. An optimal policy ( 𝜋 ∗ ) is better than or equal to all other policies ( 𝜋 ∗ ≥ 𝜋 , ∀ 𝜋 ) in which all the states reach the optimal action value s ( 𝑄 𝑠 , 𝑎 = 𝑄 ∗ ( 𝑠 , 𝑎 ∗ ) ) . Note that the above update approach only applies to discrete states and actions, which makes it impractical to be applied in our case where both the state space ( driving environment ) and the action space (vehicle control) are continu ous . An alternative is to use neural network s as Q - function approxi mat or . The Q - value for a given state s and a chosen action a is estimated by the Q - network with weights 𝜃 , expressed as 𝑄 𝑠 , 𝑎 , 𝜃 . The Q - network can be updated by stochastic gradient de scent s . However, if we directly put the states and actions into the neural network without explicitly or implicitly ‘tell’ it some prior knowledge, it may have a hard time learning the driving policy. Due to this reason, we design the format of the Q - func tion approximator as a quadratic function to ensure that there is always a global optimal action for a given state at th e very moment. The coefficients of the quadratic function are learned by concise neural networks. To setup the learning graph, we first define the state space, action space and reward function , and then formulate the Q - function approximator. These are described in the following sections. 4 B. State Space In a typical on - ramp merging scenario, the ego vehicle ( i.e. the merging vehicle) needs to know not only its own dynamic state but also the state of its surrounding vehicles (SVs) to make a rational decision on when and how to merge on to the highway . In other words, t he ego vehicle’s state is related to SVs ’ state which makes the driving envi ronment a Non - MDP. It is a fact that the real - world environment is rarely a MDP , but many situations can be approximated as a MDP in one way or another . In our case, the ego vehicle’s own state is independent of its historical kinematic information given its current state ( which is a MDP ) , while the SV s ’ states are not in the view of the ego vehicle mainly due to the unpredictable nature of their next state ( which makes it a Non - MDP). The historical vehicle dynamic information of highway vehicles may give a hint about how the y will probably behave in a short future and this can be learned by a LSTM (Long Short Term Memory) based model as we previously proposed in our early work [18] , but the most critical information valuable for the ego vehicle to select a n optimal action is their current state s . Besides, d ue to the advanced sens ing technologies in positioning , communicating, and compu ting , we can capture the vehicles’ state instantan e ously (tens to hundreds of milliseconds ) and simultaneously transmit it t o the agent control modular for the proc ess of perception, recognition and action decisio n. In this sense, we currently simplify the real - world driving environment into a MDP . The merging procedure can be partitioned into three phases . First, find an app ropriate gap. To do this, the ego vehicle needs to estimate the arrival time to the merging section of its own and of the other vehicles on the highway . Second , execute merging maneuver. T he ego vehicle needs to adjust its action to merge safely and smooth ly into the selected gap, and this is what the vehicle agent needs to be trained. After completing the merging , the ego vehicle should be able to perform proper car - following actions as vehicles on the highway usually do . In the overall process, t he dynami cs of the gap - front vehicle (meaning the vehicle directly ahead of the ego - vehicle) and the gap - back vehicle (meaning the vehicle directly behind) are critical for the ego vehicle to l earn the optimal merging policy. Thereby , the state space is defined to include the dynamics of the ego vehicle , the gap - front vehicle and the gap - back vehicle. Additionally, we add another element, the highway speed limit, to constrain the vehicle’s speed in a reasonable range. The continuous state space is therefore defined as 𝑠 = 𝑣 LM , 𝑝 LM , 𝑣 OPM , 𝑝 OPM , 𝑣 OQM , 𝑝 OQM where 𝑣 LM and 𝑝 LM are t he speed and position of the ego vehicle; 𝑣 OPM and 𝑝 OPM are the speed and position of the gap fron t vehicle ; 𝑣 OQM and 𝑝 OQ M are the speed and position of the gap back vehicle . C. Action Space Typically , vehicle control refers to longitudinal control (e.g. acceleration or deceleration ) and lateral co ntrol (e.g. steering ) . In the on - ramp merge scenario , we suppose the mergin g vehicle travels along the centerline of the lane from ramp to highway and such geometry information is available from the embedded digital map for the ego - vehicle to follow. In other words , for the purpose of demonstrating the reinforcement learning conc ept in this paper, we do not include t he lateral control of steering and only model the longitudinal acceleration as the control action. Based on vehicle dynamics it is common sense that in reality the acceleration of a vehicle canno t be a n arbitrarily value. Therefore, we limit the acceleration in a range of [ - 4.5 𝑚 R / 𝑠 , 2. 5 𝑚 R / 𝑠 ] based on literature on vehicle dynamics [1 9 ] , and allow the acceleration to be any real value within the range , which is different from some other studies in which the acceleration space was divided into some subsets or a sequence of discrete numbers. It is worth mentioning that the output action from the learning algorithm generally take s effect on the agent for a relatively small time interval when the data update fre quency is high (e.g. 10Hz ), leading to miniscule or unobservable effect s of that action. To overcome this phenomenon , in evaluating the vehicle dynamics we k eep the action the same for a few step s (e.g. the next 𝑘 steps) to let it manifest its impact , and then update it based on newly observed information . In other word s, the action calculation is updated ev ery 𝑘 steps, while the state is updated at every time step. D. Reward Function After the reinforcem ent learning agent takes an action in a given state , its impact on the environment is fe d back as an immediate reward , i.e., t he immediate reward measures the effect of an action in a given state . 5 In our on - ramp merging problem, the effect is reflected by the smoothness, safeness , and promptness of the merging maneuver . Smoothness represents the comfort of the merging maneuver and is measured by the absolute value of the acceleration. The higher the absolute value of the acceleration is , the larger penalty will be imposed on the agent . The safeness is estimated by the distance to the surrounding vehicles. The closer the ego vehicle is positioned to its surrounding veh icles, the larger the penalty it gets . The promptness is assessed by the time that the ego vehicle will take to complete the merging process. T h is effect cannot be immediately measured by only a single time interval since merging is a time sequential process . W e resort to the current vehicle speed to account for the contribution of promptness in the immediate reward . Consequently, t he composition of the immediate re ward is expressed in e quation s (3) - (6) . 𝑅 𝑠 , 𝑎 = 𝑅 6 𝑎𝑐𝑐𝑒𝑙𝑒𝑟𝑎𝑡𝑖𝑜𝑛 + 𝑅 R 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 + 𝑅 ] 𝑠𝑝𝑒𝑒𝑑 (3) 𝑅 6 𝑎𝑐𝑐𝑒𝑙𝑒𝑟𝑎𝑡𝑖𝑜𝑛 = 𝑓 6 ∗ 𝑎𝑏𝑠 ( 𝑎𝑐𝑐𝑒𝑙𝑒𝑟𝑎𝑡𝑖𝑜𝑛 ) (4) 𝑅 R 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 = 𝑓 R ∗ 𝑔 R ( 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 ) (5) 𝑅 ] 𝑠𝑝𝑒𝑒𝑑 = 𝑓 ] ∗ 𝑠𝑝𝑒𝑒𝑑 (6) where 𝑓 6 , 𝑓 R , and 𝑓 ] are factors accounted for each part of reward. It needs to be stressed that t he importance of the safeness is relatively higher than the smoothness and timeliness in our daily driving . Hence we put more emphasis on the distance related reward. This reward is split two parts, the r eward from the distance to the gap front vehicle and the reward from the distance to the gap back vehicle , respectively . Equation (5) is further specified as 𝑅 R 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 = 𝑓 R6 ∗ 𝑔 R6 ( 𝑑𝑖 𝑠 OPM ) + 𝑓 RR ∗ 𝑔 RR ( 𝑑𝑖 𝑠 OQM ) (7) where 𝑔 R6 and 𝑔 RR are functions of the distance to gap - front vehicle ( 𝑑𝑖 𝑠 OPM ) and the distance to gap - back vehicle ( 𝑑𝑖 𝑠 OQM ), and f R6 and f RR are the corresponding factors respectively . Note tha t when the ego vehicle is far from or has passed the merging zone, 𝑑𝑖 𝑠 OPM and 𝑑𝑖 𝑠 OQM are not necessarily important to the ego vehicle , t herefore , 𝑓 R6 and 𝑓 RR are set t o zeros when the ego vehicle is relative ly far from the merging zone . The factor 𝑓 6 for the acceleration in the immediate reward function is relative ly straightforward and can be assigned as a constant. The speed factor 𝑓 ] depends on how fast a mer ging behavior is considered appropriate and accept able , and can be designated as a polygonal function to punish speed values that are too low or too high . E. Q - function Approximator The quadratic format of Q - function approximator is specified as follo ws 𝑄 𝑠 , 𝑎 = 𝐴 ( 𝑠 ) ∗ 𝐵 𝑠 − 𝑎 R + 𝐶 ( 𝑠 ) (8 ) where 𝐴 , 𝐵 , and 𝐶 are trainable parameters and designed with the neural network s tructure with environment state as inputs . An illustration is shown in Fig. 1. F ig. 1 . Graph of the Q - function approxima tor . There are two graphs concealed in this form of the Q - functi o n approximator . One is the graph for obtaining an optimal action in a given state, the other is the graph for calculating the Q - value for a given state and action. In the optimal action graph , the optimal action is obtained as 𝑎 ∗ = 𝐵 ( 𝑠 ) , where 𝐵 ( 𝑠 ) is learned based on the current state 𝑠 . In the Q - value graph, the Q - value is calculated based on the coefficients 𝐴 ( 𝑠 ) , 𝐵 ( 𝑠 ) , 𝐶 ( 𝑠 ) , and action 𝑎 , where the coefficients are constructed by neural networks with the state 𝑠 as fundamental input . 𝐴 is assigned a negative value with an activation function used in the neural network. 𝐵 has the same structure as that in the optim al action graph . In the learning process, Q - network is updated with the following loss function. 𝐿𝑜𝑠𝑠 = ( 𝑟 + 𝛾 ∗ max C D 𝑄 𝑠 : , 𝑎 : , 𝜃 − 𝑄 ( 𝑠 , 𝑎 , 𝜃 ) ) f R g f h 6 ( 9 ) where 𝑟 + 𝛾 ∗ 𝑚𝑎𝑥 C D 𝑄 𝑠 : , 𝑎 : , 𝜃 is called the target Q - value and 𝑄 ( 𝑠 , 𝑎 , 𝜃 ) is called the predicted Q - value in our manuscript. 𝜃 is a set of Q - ne twork parameter s . When the agent is trained based on equation (9 ), stability issues and correlations in the observed sequence are factors affecting the learning performance. E xperience replay and a second Q - network are good techniques to alleviate the pr oblem [ 20 ] . For experience replay in our research , a mini - batch of training samples 𝑠 , 𝑎 , 𝑟 , 𝑠 ′ f are selected from a 6 replay memory and fed into the learning graph . For each sample tuple 𝑠 , 𝑎 , 𝑟 , 𝑠 ′ , 𝑠 is taken as input to the neural networks of 𝐴 , 𝐵 , and 𝐶 to obtain their values, and at the same time 𝑎 is also input to the Q - function approximator to obtain 𝑄 i . The calculation of 𝑄 j is a combined process of the optimal action calculation and Q - value calculation. It first calculates the optimal action 𝑎 ′ based on the state 𝑠 ′ with the use of the optimal action graph , then it calculates the Q - value 𝑄 𝑠 ′ , 𝑎 ′ by using the Q - value graph with the inputs of next state 𝑠 ′ and the optimal action 𝑎 ′ . To break the correlations, a second Q - network, called the target Q - network , which has the same structure but different par ameter values ( 𝜃 5 ) with the original Q - network ( 𝜃 ) , called the prediction Q - network, is used to calculate the target Q - values. The loss , expressed as the summed errors between the predicted Q - values 𝑄 i 𝑠 , 𝑎 , 𝜃 and the target Q - values 𝑄 j 𝑠 , 𝑎 , 𝜃 5 , is rewritten as follows 𝐿𝑜𝑠𝑠 = ( 𝑟 + 𝛾 ∗ max C D 𝑄 𝑠 : , 𝑎 : , 𝜃 5 − 𝑄 ( 𝑠 , 𝑎 , 𝜃 ) ) f R g f h 6 ( 10 ) where 𝜃 5 is the parameters in the target Q - network, and 𝜃 is the parameters in the prediction Q - network. In the learning process, 𝜃 is updated at every time step while 𝜃 5 is updated periodically. A step - by - step learn ing procedure is shown in Fig. 2. F ig. 2 . Reinforcement learning procedure . IV. SIMULATION AND RESULTS A. Simulation Settings We train our reinforcemen t lea rning agent in simulated ramp merging scenario s where the ramp is a 3.5m wide lane and the main highway is a two - way four - lane highway with a lane width of 3 . 75m . The highway speed limit is 65 mi/h. The highway traffic is composed of random ly emerging vehicles with random initial speed at the entrance of the highway section , and the highway vehicle can perform car following behavior s when it is close to its leading vehicle. More importantly, the highway vehicle s can yield to or surpass the ego vehicle when the ego vehicle i s about to merge onto the highway , representing the real - world cooperative or adversarial situations . On the ramp, there is always one ramp vehicle (i.e. the ego vehicle ) travelling towards the highway. After one ramp vehicle complete s its merging task, another ramp vehicle depar ts at the beginning of the ramp and is the new ego vehicle. The ego vehicle is supposed to be equipped with a suite of sensors including lidar, radar, camera, a digital map, DGPS (Differential Global Positioni ng System) and IMU (Inertial Measurement Unit) , and can gather the vehicle dynamic information of its own and its surrounding vehicle s within a vicinity of 150m that is also assumed to be accurate enough to meet our requirements . These assumptions are far from the realistic situations where the observation range may be partially occluded and the measurements are shortened, imprecise or inaccurate. Within the scope of this paper, we leave the assumptions alone, but in future work t he sensing capabilities of the ego - vehicle can be adjusted to represent various scenarios and measurement conditions. B. Training The training procedure is illustrated as follows in Table 1 . TABLE 1 Training P rocedure In our study, we design the neural network s in 𝐴 , 𝐵 , and 7 𝐶 with a two - layer neural network. The t otal training steps 𝑁 are set to 1,6 0 0 , 000 , during which there are around 8 , 000 ramp vehicles performed ramp merging behavior . The data update interval 𝑑𝑡 is set to 0.1s. The action update step 𝑘 is set to 4. The size of replay mini - batch 𝑀 is set to 32 . The target Q parameter update step 𝑝 is set to 500 . The discount factor 𝛾 in the calculation of 𝑄 j is set to 0.95 . The learning rate 𝛼 in th e backpropagation is set to 0.0 01. C. Results The loss calculated based on 𝑄 i and 𝑄 j is plotted along with the training steps in Fig. 3. To save computation memory, loss values are store every 5 steps. The graph shows an obvious decaying and converging trend despite a few spikes along the way . I t is normal to have s ome spikes as in daily driving one may encounter some extreme situations where an unusual action such as a hard braking is required . We also accumulate d the immediate reward s for each ramp merging vehicle in a complete merging task. Fig. 4 shows t he total reward ( named single total reward ) of all the 8000 vehicles in the simulation. To be specific, each point on the curve is a cumulative result of the immediate reward s that the ramp vehicle obtains at each time step during its merging process. Note that the values are always be negative since the immediate rewards are defined as a penalty whose value is always negative by our definition . F ig. 3 . Training l oss curve . Fi g. 4 . Curve of single total rewards of ramp vehicles . Rememb er that the total reward is composed of four parts, reward from distance to front vehicle, reward from distance to gap back vehicle, reward from acceleration, and reward from speed. We also plot the four curves along with the training steps in Fi g. 5. F ig . 5 . Individual rewards from total reward . From Fig . 5 we can see that the reward curves of distance to gap - front vehicle (single_total_reward_dis_to_front) and vehicle acceleration (single_total_reward_acce) show apparent convergence. In these two grap hs, the rewards go up from large negative values to relatively small values and show a potential steady trend, similar with the single total reward curve . In contrast, reward curves of the distance to gap - back vehicle (single_total_reward_dis_to_back) and the acc eleration (single_total_reward_speed) show a higher level of fluctuation s . The explanation for the curve of single_total_reward_dis_to_back is that we put greater emphasis on the front safety in our design, so the vehicle agent will learn to try to keep relatively large distance to the preceding vehicle while compromising the distance to the gap - back vehicle. Another reason is that the distance to the gap - back vehicle is not entirely controlled by the ramp merge vehicle as it is also affected by the action of the gap - back vehicle . As for t he fluctuation of the speed reward curve , it is more intuitional to understand because the speed is adjusted to accommodate to the smoothness and safety purpose and it has the least weight compared to the other three parts in the reward function. V. CONCLUSION AND DISCUSSION In this work , we adopted a reinforcement learning approach for developing an on - ramp merge driving policy. Our key contribution is that we treat the state space and action 8 space as continuous as in the real - world situation, in order to learn a practical automated control policy. The reward fu nction is designed based on intuitive concerns of human drivers in a merging situation where safe ness , smoothness and promptness are the prim ary attributes r eflecting the success of a merge mane u ver . It is formulated with vehicle acceleration, speed, and distance to gap vehicles, which are all explicit variable s and can directly measure the performance of the merging maneuver s . Another contribution of our work is the unique format of the proposed Q - function approximator that guarantees the existence of an o ptimal action in a given state without complicating the neural networks ’ structure . The t raining results show that the automated vehicle agent is able to lea rn to merge safely, smooth ly and timely onto the highway as the training goes on for a period of time which indicates the validity of our methodolog y . There is still room to further improve the performance of the learning agent. One aspect is to fine - tu ne the reinforcement learning model by trying different hyperparameters, f or example the structure of the neural networks, the weights update frequency of the target Q - network , etc . Besides, proper state feature engineering and different reward function co mposition s are additional promising point s worth investigating in the future. REFERENCE 1. http://www.nvidia.com/object/drive - px.html 2. https://www.tesla.com/autopilot 3. http://www.volvocars.com/intl/about/our - innovation - brands/intellisafe/autonomous - driving/dri ve - me 4. https://www.google.com/selfdrivingcar/ 5. http://www.detroitnews.com/story/business/autos/general - motors/2017/04/10/gms - super - cruise - debut - fall - cadillac - ct/100287748/ 6. Marinescu, D., Č urn, J., Bouroche, M., Cahill, V. On - ramp traffic merging using coope rative intelligent vehicles: a slot - based approach. 15th International IEEE Conference on Intelligent Transportation Systems, Alaska, USA, 2012. 7. Chen, X., Jin, M., Chan, C., Mao, Y., Gong, W. Decision - making analysis during urban expressway ramp merge for autonomous vehicle. 96th Annual Meeting of the Transportation Research Board, Washington, D.C., 2017. 8. Long - Ji Lin. Reinforcement learning for robots using neural networks. Carnegie Mellon University, USA, 1993. 9. Narasimhan, K., Kulka rni, T., Barzilay, R. Language understanding for text - based games using deep reinforcement l earning. arXiv:1506.08941v2 , 2015. 10. Li, Xiujun, Li, Lihong, Gao, Jianfeng, He, Xiaodong, Chen, Jianshu, Deng, Li, He, Ji. Recurrent reinforceme nt learning: A hybrid approach. arXiv:1509. 03044v2, 2015. 11. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A., Veness, J., Bellemare, M., Graves, A., Riedmiller, M., Fidjeland, A., Ostrovski, G., et al. Human - level control through deep reinforcement learning. Nature, 518(7540):529 – 533, 2015. 12. Fares, A. , Gomaa, W. Freeway ramp - metering control based on reinforcement l earning. 11th IEEE International Conference on Control & Automation (ICCA), 2014, Taiwan. 13. Yang, H., Rakha, H. Reinforcement learning ramp metering control for weaving sections in a connected vehicle environment. 96th Annual Meeting of the Transportation Research Board, Washington, D.C., 2017. 14. Ngai, D., Yung, N. Automated vehicle overtaking bas ed on a multiple - g oal reinforcement l earning f ramework . Proceedings of the 2007 IEEE Intelligent Transportation Systems Conference, USA, 2007. 15. Yu, A., Pal efsky - Smith, R., Bedi, R. Deep reinforcement learning for simulated autonomous vehicle c ontrol. Stanfo rd University. StuDocu. 2016. 16. Sallab, A., Abdou, M., Perot, E., Yogamani, S. End - to - end deep reinforcement learning for lane k eeping Assist ance . 30th Conference on Neural Information Processing Systems (NIPS), Barcelona, Spain, 2016. 17. Shalev - Shwartz , S., Shammah , S., Shashua , A. S afe, multi - agent, reinforcement l earning for a utonomous Driving . arXiv:1610.03295v1, 2016. 18. Wang, P., Chan, C. Formulation of d eep reinforcement learning architecture toward autonomous driving for on - ramp merge. IEEE 20th Internat ional Conference on Intelligent Transportation Systems , Yokohama, Japan , 2017 . 19. Bokare, P .S., Maurya, A.K. Acceleration - deceleration behavior of various vehicle t ypes. World Conference on Tran sport Research, Shanghai, 2016, pp. 4737 - 4753. 20. Sander Adam, Luci an Bus oniu, and Robert Babusˇka . Experience replay for real - time reinforcement learning c ontrol . IEEE Transaction s on Systems, Vol. 42, No. 2, 201 - 212.