Imitation Is Not Enough: Robustifying Imitation with Reinforcement Learning for Challenging Driving Scenarios Yiren Lu1, Justin Fu1, George Tucker2, Xinlei Pan1, Eli Bronstein1, Rebecca Roelofs2, Benjamin Sapp1, Brandyn White1, Aleksandra Faust2, Shimon Whiteson1, Dragomir Anguelov1, Sergey Levine2,3 Abstract—Imitationlearning(IL)isasimpleandpower- fulwaytousehigh-qualityhumandrivingdata,whichcan becollectedatscale,toproducehuman-likebehavior.How- ever,policiesbasedonimitationlearningaloneoftenfailto sufficiently account for safety and reliability concerns. In thispaper,weshowhowimitationlearningcombinedwith reinforcement learning using simple rewards can substan- tially improve the safety and reliability of driving policies over those learned from imitation alone. In particular, we train a policy on over 100k miles of urban driving data, and measure its effectiveness in test scenarios grouped by Fig.1:Thedemonstration-rewardtrade-off.Astheamountof different levels of collision likelihood. Our analysis shows dataforaparticularscenariodecreases,rewardsignalsbecome that while imitation can perform well in low-difficulty more important for learning. We show a few visual examples scenariosthatarewell-coveredbythedemonstrationdata, representing scenarios with different frequencies. our proposed approach significantly improves robustness Reinforcement Learning (RL) has the potential to on the most challenging scenarios (over 38% reduction in resolve this by leveraging explicit reward functions that failures).Toourknowledge,thisisthefirstapplicationofa tell the policy what constitutes safe or unsafe outcomes combined imitation and reinforcement learning approach in autonomous driving that utilizes large amounts of real- (e.g.,collisions).Furthermore,becauseRLmethodstrain world human driving data. inclosed-loop,RLpoliciescanestablishcausalrelation- shipsbetweenobservations,actions,andoutcomes.This I. INTRODUCTION yields policies that are (1) less vulnerable to covariate shifts and spurious correlations commonly seen in open Building an autonomous driving system that is de- loop IL [5], [6], and (2) aware of safety considerations ployable at scale presents many difficulties. First and encoded in their reward function, but which are only foremost is the challenge of handling the numerous implicit in the demonstrations. rare and challenging edge cases that occur in real-world However, relying on RL alone, e.g., [7], [8], [9], is driving.Tothisend,imitativelearningbasedapproaches also problematic because it heavily depends on reward have been proposed that allow the performance of the design, which is an open challenge in autonomous method to scale with the amount of data available [1], driving [10]. Without accounting for imitation fidelity, [2],[3].Whilesituationsthatarewellrepresentedinthe drivingpoliciestrainedwithRLmaybetechnicallysafe demonstration data are likely to be handled correctly by butunnatural,andmayhaveahardtimemakingforward suchapolicy,moreunusualordangeroussituationsthat progress in situations that demand human-like driving occur only rarely in the data might cause the imitation behavior to coordinate with other agents and follow policy–whichhasnotbeenexplicitlyinstructedonwhat driving conventions. IL and RL offer complementary constitutesariskyorinappropriateresponse–torespond strengths: IL increases realism and eases the reward unpredictably. The problem is compounded by complex design burden and RL improves safety and robustness, interactions, where human expert driving data in similar especially in rare and challenging scenarios in the ab- scenarios may be scarce and sub-optimal [4]. sence of abundant data (Fig. 1). In this paper we focus on the driving scenarios that 1WaymoResearch.2GoogleResearch,BrainTeam3UCBerkeley. aremostlikelytoexhibitsafetyandreliabilityconcerns, Pointofcontact:maxlu@waymo.com. leveraging the difficulty estimation from [11]. Our pro- Webpage: waymo.com/research/imitation-is-not-enough-robustifying- imitation-with-reinforcement-learning posed method, BC-SAC, combines IL and RL with a 3202 guA 01 ]IA.sc[ 2v91411.2122:viXrasimple reward function, and trains on difficult driving RL and other closed-loop methods for autonomous scenarios. Difficulty is estimated via a classifier that drivingtypicallyusesimulationfortraining.Therearea estimatesthelikelihoodofacollisionornear-misswhen numberofsuchpublicenvironments,whichvaryinhow re-simulated with a pre-trained planning policy. Our realistictheyare,inparticularwhatdrivesthesimulated proposed reward function enforces safety of the agent, agents (e.g., expert-following/log playback [34], [35], while natural driving behaviors are implicitly learned [36], [37], intelligent driving model (IDM) [38], or with IL. The training data comes from a subset of real- otherrulebasedsystems[39]andML-basedagents[38], world human driving data (over 100k miles of real- [40]), and whether scenarios are procedurally generated worldurbandrivingdata)[11].Wedemonstratethatthis (e.g., [39], [41], [40]) or initialized from real-world approachsubstantiallyimprovesthesafetyandreliability drivingscenes[42],[36],[34],[37].Inourexperiments, ofpolicieslearnedoverimitationalonewithoutcompro- we develop and evaluate in closed-loop on real-world mising on human-like behavior, showing 38% and 40% data with other agents following logs. improvements over pure IL and RL baselines. CombiningILandRL.MethodssuchasDQfD[22], The main contributions of our work are: (1) We DDPGfD [43], and DAPG [44] have shown that IL can conduct the first large-scale application of a combined help RL overcome exploration challenges in domains IL and RL approach in autonomous driving utilizing withknownsparserewards.OfflineRLapproaches,such large amounts of real-world urban human driving data as TD3+BC [21] and CQL [20] combine RL objectives (over 100k miles) and a simple reward function. (2) withILonestoregularizeQ-learningupdatesandavoid Wesystematicallyevaluateitsperformanceandbaseline overestimating out-of-distribution values. Our goal is performance by slicing the dataset by difficulty, demon- not to propose a novel algorithmic combination of IL strating that combining IL and RL improves safety and and RL, but rather to leverage this general approach to reliability of policies over those learned from imitation address challenges in autonomous driving at scale. alone (over 38% reduction in safety events on the most Addressing challenging and safety-critical scenar- difficult bucket). ios for autonomous vehicles. [4] learns policies that address long-tail scenarios in autonomous driving by II. RELATEDWORK usinganensembleofILplannerscombinedwithmodel- predictivecontrol.Anotherapproachtoimprovingsafety Learning-based approaches in autonomous driv- is to augment a learned planner with a rule-based ing. We briefly summarize key properties of different fallback layer that guarantees safety [45], [25]. Our learning-based algorithms for planning in Table I. IL work differs from these approaches, in that we directly wasamongtheearliestandmostpopularlearning-based incorporate safety awareness into the model learning approachesadoptedforderivingdrivingpolicies[1],[2], process through a reward. Our method is also com- [24], [25], [26], [27]. Controllable models trained with patible with a fallback layer if needed, although we either IL [3], [28] or RL [8] allow the user to specify this is potential future work. Another way to improve high-level commands in the form of goals or control robustness of polices is to increase the frequency of signals(e.g.,left,right,straight)tocombinehigher-level negative examples during training. [46] collects failure route planning with low-level control. datathatcoversvariouswaysanunmannedaerialvehicle Two drawbacks of IL methods are: (1) open-loop can crash, and the combined negative and positive data IL (such as the widely used behavioral cloning ap- helps to train more robust policies. [11] investigates proach[12],[14],[13],[29],[30])suffersfromcovariate the use of curriculum training to improve performance shift [5] (which can be addressed with closed-loop on challenging edge cases. While we also increase the training [15], [16]), (2) IL methods lack explicit knowl- exposure of the policy to challenging scenarios during edge of what constitutes good driving, such as collision training, we extend these findings by showing how RL avoidance. RL methods have been proposed that allow yields outsized improvements on the hardest scenarios. the policy to learn from explicit reward signals with closed-looptrainingandhavebeenappliedtotaskssuch III. BACKGROUND aslane-keeping[31],intersectiontraversal[32],andlane A. Markov Decision Processes (MDPs) changing [33]. While these works show the efficacy of RL on specific scenarios, our work analyzes both the In this work, we cast the autonomous driving poli- large-scale, aggregate performance and challenging and cies learning problem as a Markov decision process. safety-critical edge cases that make autonomous driving Following standard formalism, we define an MDP as difficult to deploy in a real-world system. a tuple {S,A,T,R,γ,ρ }. S and A denote the state 0TABLE I: A comparison of different learning-based approaches to robotic control and autonomous driving. Offline Demo Closed-loop Rewards Example Methods Behavior Cloning (BC) Expert Demos No No Multipath [12], Precog [13], Trajectron++ [14] Adversarial Imitation/IRL Expert Demos Yes No IRL [15], GAIL [16], MGAIL [17] RL No Yes Yes DQN [18], SAC [19] Offline RL Behavioral Data No Yes CQL [20], TD3+BC [21] “Imitative” RL Expert Demos Yes Yes DQfD [22], DAPG [23], BC-SAC (ours) and action spaces, respectively. T denotes to transi- max E [Q(s,a)+H(π(·|s))], (2) s,a∼π π tion model. R represents the reward function, and γ represents the discount factor. ρ represents the initial where 0 state distribution. The objective is to find a policy Qˆ(s,a,s′)=r(s,a)+γE (cid:2) Q¯(s′,a′)−logπ(a′|s′)(cid:3) π, a (stochastic) mapping from S to A, that maxi- a′∼π (3) mizes the expected discounted sum of rewards, π∗ = max πE T,π,ρ0[(cid:80)∞ t=0γtR(s t,a t)]. and Q¯ denotes a target network that is a copy of the critic through which gradients do not pass. B. Imitation Learning (IL) IL constructs an optimal policy by mimicking an IV. LEARNINGTODRIVEWITHRL-AUGMENTEDBC expert. We assume an expert (an optimal policy), de- We wish to design an approach that benefits from noted as π , produces a dataset of trajectories D = β the complementary strengths of IL and RL. Imitation {s ,a ,··· ,s ,a } through interaction with the en- 0 0 N N provides an abundant source of learning signal without vironment. The learner’s goal is to train a policy π theneedforrewarddesign,andRLaddressestheweak- that imitates the π . In practice, we only observe β nesses of IL in rare and challenging scenarios where the expert states, so we estimate expert actions us- data is scarce. Following this intuition, we formulate an ing inverse dynamics. For example, behavioral cloning objective that utilizes the learning signal from demon- (BC) trains the policy via a log-likelihood objec- tive, E [logπ(a|s)]. Alternatively, closed loop ap- strations where data is abundant and the reward signal s,a∼D where data is scarce. Specifically, we utilize a weighted proaches include inverse RL (IRL) [15] and adversarial mixture of the IL and RL objectives: IL (GAIL [16], MGAIL [17]), which instead aim to more directly match the occupancy measure or state- (cid:34) ∞ (cid:35) (cid:88) action visitation distribution between the policy and the m πax E T,π,ρ0 γtR(s t,a t) +λE s,a∼D[logπ(a|s)]. expert, rather than indirectly through the conditional t=0 (4) action distribution. In principle, this can resolve the covariate shift issue that affects open loop imitation [5]. A. Behavior Cloned Soft Actor-Critic (BC-SAC) C. Reinforcement Learning (RL) While in principle a variety of RL methods could be RL aims to learn an optimal policy through an iter- combinedwithILtooptimizeEq.4,aconvenientchoice ative, online trial and error process. In this work we for efficient training is to use actor-critic algorithms, in use off-policy, value-based RL algorithms such as Q- which case the policy can be optimized with respect to learning. These methods aim to learn the state-action Eq. 4 simply by adding the imitation learning objective value function, defined as the expected future return to the expected value of the Q-function (i.e., the critic), when starting from a particular state and action: similarly to DAPG [23] or TD3+BC [21]. Building on (cid:34) ∞ (cid:35) the widely used SAC framework, which further adds an (cid:88) Qπ(s,a)=E γtR(s ,a )|s =s,a =a . entropy regularization objective to the actor, we obtain T,π,ρ0 t t 0 0 t=0 our full actor objective: In this work, we use an actor-critic method for training E [Q(s,a)+H(π(·|s))]+λE [logπ(a|s)]. continuouscontrolpolicies.Typicalactor-criticmethods s,a∼π s,a∼D alternate between training a critic Q to minimize the The critic update remains the same as in SAC, outlined Bellman error and an actor π to maximize the value inEq1.Withtheappropriatesettingofλ,thisobjective function.Weusetheentropy-regularizedupdatesofSoft encourages the policy to mimic the expert data when it Actor-Critic (SAC) [19]: is within the data distribution D. However, in out-of- min E [(Q(s,a)−Qˆ(s,a,s′))2] (1) distribution states the policy primarily relies on reward s,a,s′∼π Q to learn. Fig. 2 visualizes this concept.encoderdescribedin[49]thatencodesfeaturesincluding allvehiclestates,road-graphpoints,trafficlightssignals, and route goals. The actor network outputs a tanh- squashed diagonal Gaussian distribution parameterized by a mean µ and variance σ. E. Training on Difficult Examples Fig.2:Differentobjectiveinfluence.Forin-distributionstates, bothILandRLobjectivesprovidelearningsignal.Forout-of- The performance of learning-based methods strongly distribution states, the RL objective dominates. depends on the training data distribution, especially in safety-critical settings with long-tail distributions [50], B. Reward Function [51], [45], [52], [53]). Autonomous driving falls in this While designing a reward function to capture “good” category: most scenarios are mundane, but a sizable drivingbehaviorisanopen-challenge[10],wecanside- minority of scenarios have critical safety concerns. Fol- step this issue by relying on the imitation learning loss lowing [11], which demonstrated that training on more to primarily guide the policy, while the simple reward difficult examples results in better performance than functiononlyneedstoencodesafetyconstraints.Tothis using unbiased training distributions, we explore how end, we use a combination of collision and off-road thetrainingdistributionaffectsthemethodperformance. distances as our reward signal. The collision reward is V. EXPERIMENTS R =min(d −d ,0), (5) collision collision coffset A. Experimental Setup where d collision is the Euclidean distance in meters of Datasets. We use a dataset (denoted All) consisting of the closest points between the ego vehicle and a nearest over 100k miles of expert driving trajectories, split into bounding box of other vehicles; d coffset (default 1.0) 10 second segments, collected from a fleet of vehicles is an offset added to encourage the vehicle to keep a operating in San Francisco (SF) [11]. We divide these distance from nearby objects. The off-road reward is segments into 6.4 million for training and 10k for testing. Trajectories from the same vehicle operating on R =clip(−d −d ,−2.0,0.0), (6) off-road ooffset to-edge the same day are stored in the same partition to avoid where d is the distance in meters of the vehicle to train-test leakage. The trajectories, which are sampled to-edge the nearest road edge (negative being on-road, positive at 15 Hz, contain features describing the autonomous being off-road). d (default 1.0) is an offset to vehicle (AV) state and the state of the environment ooffset encourage the vehicle to keep a distance to road edge. as measured by the AV’s perception system. We use We combine the rewards additively, such that R = the difficulty model described by [11] as a proxy for R +R . measuring the rarity of events, since it is difficult collision off-road to directly construct a scenario-level out-of-distribution C. Forward and Inverse Vehicle Dynamics Models estimator, and challenging scenarios are generally less We update the vehicle’s state using the kinematic frequent. Given a run segment, the difficulty model bicycle dynamics model [47], which computes the predicts whether a segment will result in a collision vehicle’s next pose (x,y,θ) given a steering and ac- or near-miss when re-simulated with an internal AV celeration action a = (a ,a ). In order to obtain planner. We trained the difficulty model in a supervised steer accel expert actions for imitation learning, we use an inverse manner using cross-entropy loss on a dataset consisting dynamicsmodeltosolvefortheactionsthatwouldhave of 5.6k positive examples and 80k negative examples, achievedthesamestatesastheloggedtrajectoriesinour with binary human labels. We create the Top1, Top10, dataset. These expert actions are found by minimizing and Top50 subsets by selecting the top 1% (40k train, the MSE of the corners’ (x,y) positions between the 1.2ktest),10%(400ktrain,19ktest),and50%(2million inferredstateT(s ,a )andground-truthnextstates . train, 66k test) percentiles of difficulty model scores t t t+1 from a chronologically separate dataset of 4 million D. Model Architecture segments, respectively. We use a dual actor-critic architecture similar to Simulation.AsmentionedinSec.IV-C,vehicledynam- TD3 and SAC [48], [19]: the main components are icsaremodeledusinga2Dbicycledynamicsmodel.The an actor network π(a|s), a double Q-critic network behavior of other vehicles and pedestrians in the scene Q(s,a) and a target double Q-critic network Q¯(s,a). are replayed from the logs (log-playback), similarly Each network has a separate Transformer observation to [34], [35], [36]. While this means that agents areB. Results We evaluate the baseline methods (BC, MGAIL, SAC) and our method (BC-SAC) trained on several subsets of the training dataset (All, Top10, and Top1), and evaluate against subsets of the evaluation set (Top1, Top10, Top50, All) in Table II. All configurations are evaluated with three random seeds, reporting mean and standarddeviation.Previously,[11]showedthattraining Fig. 3: Failure rates on the most challenging evaluation sets: MGAILonTop10yieldssimilarperformancewithtrain- Top1 and Top10 (lower is better, with training on All and ing on All. Similarly, we find that all methods perform Top10). BC-SAC consistently achieves the lowest error rates. best when trained on Top10. Notably, BC trained on Top1 performs significantly worse compared to training on All or Top10, which reflects the fact that imitation learning methods rely on large amounts of data to implicitlyinferdrivingpreferences.Incontrast,BC-SAC performs robustly when trained on Top1. Given that all methods perform best when trained on Top10, we focus on that setting in the following subsections. BC-SAC comparison to imitation methods (BC, MGAIL) in the challenging scenarios. Figure 4 com- Fig. 4: Failure rates of BC, MGAIL, and BC-SAC across scenarios of varying difficulty levels (50%-100%, lower is pares BC-SAC against BC and MGAIL across the better). While all methods perform worse as the evaluation evaluation dataset slices according to difficulty levels. dataset becomes more challenging, BC-SAC always performs BC-SACachievesbetterperformanceoverall,especially best and shows the least degradation. inthemorechallengingsliceswheretheperformanceof both BC and MGAIL substantially degrade. Addition- non-reactive,itensuresthatthebehaviorofotheragents ally, BC-SAC has the lowest variance across scenarios is human-like, and the inclusion of imitative losses of varying difficulty in performance (σ =0.37) vs. BC discourages the learned policy to deviate too far from (σ =1.29) and MGAIL (σ =0.78). the logs, which would cause the log-playback agents to BC-SAC comparison to RL-only training (SAC). In become unrealistic. We also use short segments of 10s all configurations, BC-SAC outperforms SAC in terms to mitigate pose divergence. ofsafetymetrics(TableII),likelybecauseBC-SACalso Baselines. We compare our method to both open-loop utilizeslearningsignalfromlargeamountofdemonstra- (BC [1]) and closed-loop (MGAIL [17]) imitative meth- tions. SAC generates actions that deviate significantly ods. The latter takes advantage of closed loop training fromthedemonstrationswithmoreboundaryactionval- and the differentiability of the simulator dynamics. For ues yielding unnatural (more swerves) and uncomfort- completeness, we also include a SAC baseline to repre- able (abrupt acceleration) driving behavior (Figure 5). sent an RL-only approach. WithaBCloss,BC-SACgeneratesanactiondistribution Metrics. We evaluate agents using two metrics: similar to the logs. Reward shaping and RL / IL weights. We conduct 1) Failure Rate: Percentage of the run segments that a set of ablation studies to answer how the form of haveatleastoneCollisionorOff-roadeventatany the reward function and the weights on the RL and timestep. Collision is true if the bounding box of imitation components influence final performance. We the ego vehicle intersects with a bounding box of use a smaller dataset constructed by sampling 10% of another object. Off-road is true if the bounding the Top10 data and compare: (1) our full reward vs. box of the ego vehicle deviates from the drivable a discrete binary reward (Fig 6 Right), (2) off-road surface according to the map. and collision reward term weights (Fig 6 Left), (3) off- 2) Route Progress Ratio: Ratio of the distance trav- road and collision offset parameters (Fig 7), and (4) the eledalongtheroutebythepolicycomparedtothe weight on the RL and IL terms in the objective (Fig 8). expertdemonstration.Weprojecttheegovehicle’s The results indicate that the proposed shaped reward state onto the route and compute the total length improves overall performance over the simpler sparse from the start of the route. rewardwithanappropriatechoiceofrewardparameters,RouteProgress Method Training Top1(%) Top10(%) Top50(%) All(%) Ratio,All(%) BC All 9.74±0.49 6.72±0.47 5.14±0.39 4.35±0.27 99.00±0.39 MGAIL All 7.28±0.98 4.22±0.77 3.40±0.97 2.48±0.29 99.55±1.91 SAC All 5.29±0.66 4.64±1.08 4.12±0.74 6.66±0.44 77.82±8.21 BC-SAC All 3.72±0.62 2.88±0.23 2.64±0.21 3.35±0.31 95.26±8.64 (a) SAC (b) SAC tire BC Top10 5.79±0.82 3.45±0.72 2.71±0.57 3.64±0.31 98.06±0.18 acceleration angles MGAIL Top10 4.21±0.95 2.57±0.52 2.20±0.52 2.45±0.35 96.57±1.19 SAC Top10 4.33±0.47 4.11±0.63 3.66±0.47 5.60±0.86 71.05±2.47 BC-SAC Top10 2.59±0.31 2.01±0.29 1.76±0.20 2.81±0.26 87.63±0.58 BC Top1 7.66±1.13 7.84±0.92 6.63±0.78 6.85±0.65 94.10±1.00 MGAIL Top1 4.24±0.95 3.16±0.43 2.74±0.46 3.79±0.46 93.10±11.72 SAC Top1 4.15±0.31 3.87±0.12 3.46±0.16 5.98±1.03 75.63±2.19 (c) BC-SAC (d) BC-SAC tire BC-SAC Top1 3.61±0.87 2.96±1.11 2.69±0.87 3.38±0.48 75.00±17.21 acceleration angles TABLE II: Failure rates (lower is better) and progress ratios (higher is better) of BC-SAC and baselines on different training/evaluation subsets. Fig. 5: Marginal action distributions. SAC/BC-SAC(orange)vslogs(blue). Fig. 6: Left: Off-road / collision weights. Off-road weight Fig. 8: Left Imitation weights (log-scale) vs failure rates. andcollisionweightaddupto2.0.Thex-axisisthecollision RightProgressrewardweights(log-scale)vspolicyevaluation weight. A balanced choice of off-road and collision weights performance: safety event rate and route progress ratio. leadtothebestperformance.Right:Densevsbinaryrewards. Binary reward is defined as −1 when a safety event happens and 0 otherwise. Dense rewards lead to fewer safety events. ios from the Top1 and Top10 buckets. We categorize failures into 6 broad buckets. CLIP (clipping): small collisions that occur when a vehicle collides with an objectonthesidewhilemoving.OFF(off-road):failures when the agent drives off the road. LAN (bad lane): an agent encroaches into another lane, either the wrong lane or a bad merge, which results in a collision. COLL (collision):collisionwheretheplanningagentisatfault Fig.7:Off-roadoffsetd andcollisionoffsetd abla- ooffset coffset and drives into another vehicle. RED (red light): red tions.Asmallamountofoffsetsimprovesoverallperformance. lightviolationsthatresultincollisions.Finally,DIV(log and a balance between imitation and RL terms leads to divergence): collisions where a sim agent collides with the best performance. the planning agent due to divergence from the logs. Progress-safety balance. While our work focuses on Overall, MGAIL tends to have more clipping col- safety-critical scenarios, in Fig 8 Right, we show that lisions and off-road events. Fig 9 shows two of the introducing a small amount of a progress reward leads cases where RL improves over IL. We hypothesize that tosignificantlymoreprogresswithoutmajorregressions our method improves in these cases because MGAIL, in safety metrics. However, large progress rewards lead as an imitation method, lacks an explicit penalty for to degradation in performance. collisions, and thus is not sensitive to small collisions In-depth failure analysis. Table III presents a detailed during otherwise realistic behavior. On the other hand, analysisoffailuremodesonasetof80sampledscenar- the collisions encountered by BC-SAC tend to be cases where the collision is not directly the result of the Method CLIP COLL OFF RED LAN DIV AV planner’s action, but the planner diverges from the BC-SAC 8 7 2 1 7 15 logs in a way such that it is hit by other vehicles. MGAIL 16 8 8 2 0 6 Because BC-SAC also is not explicitly rewarded for TABLE III: Failure frequency categorizations, per type, in- following traffic rules (though it inherits this behavior curredbyBC-SACandMGAILonasmallsampleset(N=80). via imitation), we also see a small amount of failures BC-SAC generally has fewer direct collisions and off-road due to that. events, but has a greater frequency of being hit by other objects.constraint, perhaps in combination with methodology to mitigate distributional shift. REFERENCES [1] D. A. Pomerleau, “Alvinn: An autonomous land vehicle in a neural network,” Advances in neural information processing systems,vol.1,1988. [2] M.Bojarski,D.DelTesta,D.Dworakowski,B.Firner,B.Flepp, P.Goyal,L.D.Jackel,M.Monfort,U.Muller,J.Zhangetal., “End to end learning for self-driving cars,” arXiv preprint arXiv:1604.07316,2016. [3] F.Codevilla,M.Miiller,A.Lo´pez,V.Koltun,andA.Dosovit- skiy,“End-to-enddrivingviaconditionalimitationlearning,”in 2018 IEEE International Conference on Robotics and Automa- tion(ICRA). IEEE,2018,pp.1–9. [4] W. Zhou, Z. Cao, N. Deng, X. Liu, K. Jiang, and D. Yang, “Long-tail prediction uncertainty aware trajectory planning for self-drivingvehicles,”arXivpreprintarXiv:2207.00788,2022. [5] S. Ross, G. Gordon, and D. Bagnell, “A reduction of imitation Fig. 9: Visualizations of a few scenarios where BC-SAC learningandstructuredpredictiontono-regretonlinelearning,” improves over imitation (MGAIL) and RL-only (SAC). The in Proceedings of the fourteenth international conference on cyan car is controlled. Example 1: MGAIL collides with a artificial intelligence and statistics. JMLR Workshop and pedestrian exiting a double parked car while BC-SAC leaves ConferenceProceedings,2011,pp.627–635. enough clearance. Example 2: MGAIL does not provide [6] P.DeHaan,D.Jayaraman,andS.Levine,“Causalconfusionin sufficient clearance and collides with the incoming vehicle. imitationlearning,”AdvancesinNeuralInformationProcessing Example 3: SAC slows down in an intersection resulting in Systems,vol.32,2019. an rear collision. BC-SAC maintains a proper speed profile [7] X. Pan, Y. You, Z. Wang, and C. Lu, “Virtual to real re- through the intersection without a collision. inforcement learning for autonomous driving,” arXiv preprint arXiv:1704.03952,2017. [8] X. Liang, T. Wang, L. Yang, and E. Xing, “Cirl: Controllable VI. CONCLUSIONS imitative reinforcement learning for vision-based self-driving,” Wepresentedamethodforrobustautonomousdriving in Proceedings of the European conference on computer vision (ECCV),2018,pp.584–599. inchallengingdrivingscenarios,thatcombinesimitation [9] Z.Zhang,A.Liniger,D.Dai,F.Yu,andL.VanGool,“End-to- learningwithRL(BC-SAC),pairedwithasimplesafety endurbandrivingbyimitatingareinforcementlearningcoach,” reward, and trained on large datasets of real-world driv- in Proceedings of the IEEE/CVF International Conference on ComputerVision(ICCV),October2021,pp.15222–15232. ing. Overall, the method significantly improves safety [10] W. B. Knox, A. Allievi, H. Banzhaf, F. Schmitt, and P. Stone, andreliabilityinchallengingscenarios,resultinginmore “Reward (mis) design for autonomous driving,” arXiv preprint than38%reductioninsafetyeventsofthemostdifficult arXiv:2104.13906,2021. [11] E. Bronstein, S. Srinivasan, S. Paul, A. Sinha, M. O’Kelly, scenarios compared to IL-only and RL-only baselines. P. Nikdel, and S. Whiteson, “Embedding synthetic off-policy Ourextensiveexperimentsexaminedtherolesoftraining experience for autonomous driving via zero-shot curricula,” in datasets, reward shaping and IL / RL objective terms. 6thAnnualConferenceonRobotLearning,2022. [12] Y. Chai, B. Sapp, M. Bansal, and D. Anguelov, “Multipath: BC-SAC inherits implicit human-like driving behaviors Multipleprobabilisticanchortrajectoryhypothesesforbehavior from imitation, while RL is a fail-safe for handling out- prediction,”arXivpreprintarXiv:1910.05449,2019. of-distribution safety scenarios. Similarly to the IL-only [13] N. Rhinehart, R. McAllister, K. M. Kitani, and S. Levine, “PRECOG:predictionconditionedongoalsinvisualmulti-agent settings,trainingonthetop10%ofthemostchallenging settings,”CoRR,vol.abs/1905.01296,2019. scenariosyieldsthemostrobustperformanceinthecom- [14] T.Salzmann,B.Ivanovic,P.Chakravarty,andM.Pavone,“Tra- binedILandRLsetting.Whilethisworkmainlyfocused jectron++:Multi-agentgenerativetrajectoryforecastingwithhet- erogeneous data for control,” arXiv preprint arXiv:2001.03093, onoptimizingsafety-relatedrewards,anaturalextension 2020. is to incorporate other factors into the objective, such [15] A.Y.NgandS.J.Russell,“Algorithmsforinversereinforcement as progress, traffic rule adherence, and passenger com- learning,” in Proceedings of 17th International Conference on MachineLearning,2000,2000,pp.663–670. fort. Besides the reward function, this approach does [16] J.HoandS.Ermon,“Generativeadversarialimitationlearning,” not account for unexpected behavior of other agents Advances in neural information processing systems, vol. 29, in response to out-of-distribution actions on the part 2016. [17] N.Baram,O.Anschel,andS.Mannor,“Model-basedadversarial of the ego vehicle, and it still requires heuristically imitationlearning,”arXivpreprintarXiv:1612.02179,2016. choosingthetradeoffbetweentheILandRLobjectives. [18] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, A promising future work direction would be to enable M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep rein- reactive sim agents for training and evaluation and to forcement learning,” nature, vol. 518, no. 7540, pp. 529–533, extend the approach to enforce safety as an explicit 2015.[19] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor- [37] V. Lioutas, J. W. Lavington, J. Sefas, M. Niedoba, Y. Liu, critic:Off-policymaximumentropydeepreinforcementlearning B. Zwartsenberg, S. Dabiri, F. Wood, and A. Scibior, “Critic withastochasticactor,”CoRR,vol.abs/1801.01290,2018. sequentialmontecarlo,”arXivpreprintarXiv:2205.15460,2022. [20] A.Kumar,A.Zhou,G.Tucker,andS.Levine,“Conservativeq- [38] H.Caesar,J.Kabzan,K.S.Tan,W.K.Fong,E.Wolff,A.Lang, learningforofflinereinforcementlearning,”AdvancesinNeural L.Fletcher,O.Beijbom,andS.Omari,“nuplan:Aclosed-loop InformationProcessingSystems,vol.33,pp.1179–1191,2020. ml-based planning benchmark for autonomous vehicles,” arXiv [21] S.FujimotoandS.S.Gu,“Aminimalistapproachtoofflinerein- preprintarXiv:2106.11810,2021. forcementlearning,”Advancesinneuralinformationprocessing [39] A.Dosovitskiy,G.Ros,F.Codevilla,A.Lopez,andV.Koltun, systems,vol.34,pp.20132–20145,2021. “Carla:Anopenurbandrivingsimulator,”inConferenceonrobot [22] T.Hester,M.Vecerik,O.Pietquin,M.Lanctot,T.Schaul,B.Piot, learning. PMLR,2017,pp.1–16. D. Horgan, J. Quan, A. Sendonaris, I. Osband et al., “Deep [40] K.Ramamohanarao,H.Xie,L.Kulik,S.Karunasekera,E.Tanin, q-learning from demonstrations,” in Proceedings of the AAAI R. Zhang, and E. B. Khunayn, “Smarts: Scalable microscopic ConferenceonArtificialIntelligence,vol.32,no.1,2018. adaptiveroadtrafficsimulator,”ACMTransactionsonIntelligent SystemsandTechnology(TIST),vol.8,no.2,pp.1–22,2016. [23] A. Rajeswaran, V. Kumar, A. Gupta, G. Vezzani, J. Schulman, [41] E. Leurent, “An environment for autonomous driving decision- E.Todorov,andS.Levine,“Learningcomplexdexterousmanip- making,”https://github.com/eleurent/highway-env,2018. ulation with deep reinforcement learning and demonstrations,” [42] W. Zhan, L. Sun, D. Wang, H. Shi, A. Clausse, M. Naumann, inRobotics:ScienceandSystems,2018. J.Kummerle,H.Konigshof,C.Stiller,A.deLaFortelleetal., [24] Z.Zhang,A.Liniger,D.Dai,F.Yu,andL.VanGool,“End-to- “Interaction dataset: An international, adversarial and coopera- endurbandrivingbyimitatingareinforcementlearningcoach,” tivemotiondatasetininteractivedrivingscenarioswithsemantic in Proceedings of the IEEE/CVF International Conference on maps,”arXivpreprintarXiv:1910.03088,2019. ComputerVision,2021,pp.15222–15232. [43] M.Vecerik,T.Hester,J.Scholz,F.Wang,O.Pietquin,B.Piot, [25] M.Vitelli,Y.Chang,Y.Ye,A.Ferreira,M.Wołczyk,B.Osin´ski, N. Heess, T. Rotho¨rl, T. Lampe, and M. Riedmiller, “Leverag- M. Niendorf, H. Grimmett, Q. Huang, and A. Jain,“Safetynet: ingdemonstrationsfordeepreinforcementlearningonrobotics Safeplanningforreal-worldself-drivingvehiclesusingmachine- problemswithsparserewards,”arXivpreprintarXiv:1707.08817, learnedpolicies,”in2022InternationalConferenceonRobotics 2017. andAutomation(ICRA). IEEE,2022,pp.897–904. [44] A. Rajeswaran, V. Kumar, A. Gupta, G. Vezzani, J. Schulman, [26] N.Nayakanti,R.Al-Rfou,A.Zhou,K.Goel,K.S.Refaat,and E.Todorov,andS.Levine,“Learningcomplexdexterousmanip- B.Sapp,“Wayformer:Motionforecastingviasimple&efficient ulation with deep reinforcement learning and demonstrations,” attentionnetworks,”arXivpreprintarXiv:2207.05844,2022. arXivpreprintarXiv:1709.10087,2017. [27] V.Lioutas,A.Scibior,andF.Wood,“Titrated:Learnedhuman [45] S.Shalev-Shwartz,S.Shammah,andA.Shashua,“Safe,multi- driving behavior without infractions via amortized inference,” agent, reinforcement learning for autonomous driving,” arXiv TransactionsonMachineLearningResearch,2022. preprintarXiv:1610.03295,2016. [28] N. Rhinehart, R. McAllister, and S. Levine, “Deep imitative [46] D.Gandhi,L.Pinto,andA.Gupta,“Learningtoflybycrashing,” models for flexible inference, planning, and control,” arXiv in2017IEEE/RSJInternationalConferenceonIntelligentRobots preprintarXiv:1810.06544,2018. andSystems(IROS). IEEE,2017,pp.3948–3955. [29] M. Liang, B. Yang, R. Hu, Y. Chen, R. Liao, S. Feng, and [47] R. Rajamani, Vehicle dynamics and control. Springer Science R. Urtasun, “Learning lane graph representations for motion &BusinessMedia,2011. forecasting,”arXivpreprintarXiv:2007.13732,2020. [48] S. Fujimoto, H. Hoof, and D. Meger, “Addressing function [30] J. Ngiam, B. Caine, V. Vasudevan, Z. Zhang, H. L. Chiang, approximation error in actor-critic methods,” in International J.Ling,R.Roelofs,A.Bewley,C.Liu,A.Venugopal,D.Weiss, conferenceonmachinelearning. PMLR,2018,pp.1587–1596. B.Sapp,Z.Chen,andJ.Shlens,“Scenetransformer:Aunified [49] A. Jaegle, F. Gimeno, A. Brock, O. Vinyals, A. Zisserman, multi-task model for behavior prediction and planning,” CoRR, and J. Carreira, “Perceiver: General perception with iterative vol.abs/2106.08417,2021. attention,” in International conference on machine learning. [31] A.Kendall,J.Hawke,D.Janz,P.Mazur,D.Reda,J.-M.Allen, PMLR,2021,pp.4651–4664. V.-D.Lam,A.Bewley,andA.Shah,“Learningtodriveinaday,” [50] J. Frank, S. Mannor, and D. Precup, “Reinforcement learning in 2019 International Conference on Robotics and Automation in the presence of rare events,” in Proceedings of the 25th (ICRA). IEEE,2019,pp.8248–8254. international conference on Machine learning, 2008, pp. 336– [32] D. Isele, R. Rahimi, A. Cosgun, K. Subramanian, and K. Fu- 343. jimura, “Navigating occluded intersections with autonomous [51] N.KalraandS.M.Paddock,DrivingtoSafety:HowManyMiles vehicles using deep reinforcement learning,” in 2018 IEEE of Driving Would It Take to Demonstrate Autonomous Vehicle International Conference on Robotics and Automation (ICRA). Reliability? RANDCorporation,2016. IEEE,2018,pp.2034–2039. [52] S. Paul, K. Chatzilygeroudis, K. Ciosek, J.-B. Mouret, M. Os- borne,andS.Whiteson,“Alternatingoptimisationandquadrature [33] P. Wang, C.-Y. Chan, and A. de La Fortelle, “A reinforcement forrobustcontrol,”inAAAIConferenceonArtificialIntelligence, learningbasedapproachforautomatedlanechangemaneuvers,” 2018. in2018IEEEIntelligentVehiclesSymposium(IV). IEEE,2018, [53] S. Paul, M. A. Osborne, and S. Whiteson, “Fingerprint policy pp.1379–1384. optimisationforrobustreinforcementlearning,”inInternational [34] E. Vinitsky, N. Lichtle´, X. Yang, B. Amos, and J. Foerster, ConferenceonMachineLearning,2019. “Nocturne: a scalable driving benchmark for bringing multi- [54] L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, agentlearningonestepclosertotherealworld,”arXivpreprint T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning et al., arXiv:2206.09889,2022. “Impala:Scalabledistributeddeep-rlwithimportanceweighted [35] P. Kothari, C. Perone, L. Bergamini, A. Alahi, and P. On- actor-learner architectures,” in International conference on ma- druska, “Drivergym: Democratising reinforcement learning for chinelearning. PMLR,2018,pp.1407–1416. autonomousdriving,”arXivpreprintarXiv:2111.06889,2021. [55] E.Bronstein,M.Palatucci,D.Notz,B.White,A.Kuefler,Y.Lu, [36] Q. Li, Z. Peng, L. Feng, Q. Zhang, Z. Xue, and B. Zhou, S. Paul, P. Nikdel, P. Mougin, H. Chen et al., “Hierarchical “Metadrive: Composing diverse driving scenarios for general- model-based imitation learning for planning in autonomous izable reinforcement learning,” IEEE transactions on pattern driving,”arXivpreprintarXiv:2210.09539,2022. analysisandmachineintelligence,2022.APPENDIX A. IL + RL Distributed Actor-Learner Training Architecture Fig. 10: IL + RL distributed actor-learner training architecture. We extend the distributed IMPALA architecture [54] with additional demo rollout workers and a demo replay buffer, which produce rollout transitions in the same format as the actor workers. The learner worker samples from both the rollout replay buffer and the demo replay buffer to perform training updates in an off-policy manner. B. Additional Details on Model Architectures and Hyper-parameters Settings We use a dual actor-critic architecture similar to TD3 and SAC [48], [19]: each of the main components, actor network π(a|s), double Q-critic network Q(s,a) and target double Q-critic network Q¯(s,a), has a separate Transformerobservationencoderdescribedin[55],andtheencoderembeddingisfedtoa(256,256)fullyconnected head. The actor network outputs a tanh-squashed diagonal Gaussian distribution parameterized by a mean µ and variance σ. We train the BC-SAC algorithm with the following hyper-parameters: the actor learning rate is 1e-4, the critic learning rate is 1e-4, the imitation learning rate is 5e-5, the batch size is 64, and the reward discount ratio is 0.92. The sample-to-insert ratio for replay is 8, which is the average number of times the learner should sample each item in the replay buffer during the item’s entire lifetime. In practice, instead of performing a combined gradient step of both the IL and RL objectives, we alternate the training steps between IL and RL with different update frequencies. For every 8 RL updates, we update with IL loss for one time. The hyper-parameters are found by performing grid-search. For SAC, we use the same network design and hyper-parameters as in BC-SAC, except that it does not perform IL step. ForBC,wediscretizethe2dactionspace(steer,acceleration)into31×7=217actionswiththesameunderlying dynamicsmodel.WeuseasimilarnetworkdesignforBCasinBC-SAC’sactornetworkwithaSoftmaxprediction head representing probabilities of the discrete actions. We use the cross-entropy loss with a learning rate of 1e-4 and batch size of 256 for training. For MGAIL, we follow the network design and hyper-parameters setting presented in [55].