Symphony: Learning Realistic and Diverse Agents for Autonomous Driving Simulation Maximilian Igl1, Daewoo Kim1, Alex Kuefler1, Paul Mougin1, Punit Shah1, Kyriacos Shiarlis1, Dragomir Anguelov2, Mark Palatucci2, Brandyn White2, Shimon Whiteson2 Abstract—Simulationisacrucialtoolforacceleratingthede- greatly improve the realism of the learned behaviour. A key velopment of autonomous vehicles. Making simulation realistic idea behind Symphony is to combine conventional policies, requiresmodelsofthehumanroaduserswhointeractwithsuch represented as neural networks, with a parallel beam search cars. Such models can be obtained by applying learning from thatrefinesthesepoliciesonthefly.Assimulationsarerolled demonstration (LfD) to trajectories observed by cars already on the road. However, existing LfD methods are typically out,Symphonyprunesbranchesthatareunfavourablyevalu- insufficient,yieldingpoliciesthatfrequentlycollideordriveoff atedbyadiscriminatortrainedtodistinguishagentbehaviour theroad.Toaddressthisproblem,weproposeSymphony,which from that in the data. Because the beam search is paral- greatly improves realism by combining conventional policies lelised, promising branches are repeatedly forked, focusing with a parallel beam search. The beam search refines these computation on the most realistic rollouts. In addition, since policies on the fly by pruning branches that are unfavourably evaluatedbyadiscriminator.However,itcanalsoharmdiversity, thetreesearchisalsoperformedduringtraining,thepruning i.e.,howwelltheagentscovertheentiredistributionofrealistic mechanismdrivestheagenttowardsmorerealisticstatesthat behaviour,aspruningcanencouragemodecollapse.Symphony increasingly challenge the discriminator. The results of each addresses this issue with a hierarchical approach, factoring tree search can then be distilled back into the policy itself, agentbehaviourintogoalgenerationandgoalconditioning.The yielding an adversarial algorithm. useofsuchgoalsensuresthatagentdiversityneitherdisappears during adversarial training nor is pruned away by the beam However, simply learning realistic agents is not enough. search. Experiments on both proprietary and open Waymo They must also be diverse, i.e., cover the entire distribution datasets confirm that Symphony agents learn more realistic of realistic behaviour, in order to enable a full evaluation and diverse behaviour than several baselines. of autonomous driving software. Unfortunately, while the use of beam search improves realism, it tends to harm I. INTRODUCTION diversity: repeated pruning can encourage mode collapse, Simulation is a crucial tool for accelerating the devel- whereonlytheeasiesttosimulatemodesarerepresented.To opment of autonomous driving software because it can address this issue, Symphony takes a hierarchical approach, generate adversarial interactions for training autonomous factoring agent behaviour into goal generation and goal drivingpolicies,playoutcounterfactualscenariosofinterest, conditioning. For the former, we train a generative model and estimate safety-critical metrics. In this way, simulation thatproposesgoalsintheformofroutes,whichcapturehigh- reduces reliance on real-world data, which can be expensive level intent. For the latter, we train goal-conditional policies and/or dangerous to collect. As autonomous vehicles share that modulate their behaviour based on a goal provided as public roads with human drivers, cyclists, and pedestrians, input. Generating and conditioning on diverse goals ensures the underlying simulation tools require realistic models of that agent diversity neither disappears during adversarial these human road users. training nor is pruned away by the beam search. Such models can be obtained by applying learning from WeevaluateSymphonyagentswithextensiveexperiments demonstration (LfD) [1], [2] to example trajectories of on run segments from the Waymo Open Motion Dataset [6] human road use collected using sensors (e.g., cameras and andaproprietaryWaymodatasetconsistingofdemonstration LIDAR) already mounted on cars on the road. Such demon- trajectories and their corresponding contexts, created by strations are ideally suited to learning the realistic road use applying Waymo’s perception tools to sensor data collected behaviour needed for autonomous driving simulation. by Waymo vehicles driving on public roads. We report per- However, existing LfD methods such as behavioural formance on several realism and diversity metrics, including cloning [3], [4] and generative adversarial imitation learning a novel diversity metric called curvature Jensen-Shannon [5] are typically insufficient for producing realistic models divergence that indicates how well the high-level agent of human road users. Despite minimising their supervised behaviour matches the empirical distribution. Our results oradversariallosses,theresultingpoliciesfrequentlycollide confirm that combining beam search with hierarchy yields with other road users or drive off the road. more realistic and diverse behaviour than several baselines. To address this problem, we propose Symphony, a new approach to LfD for autonomous driving simulation that can II. BACKGROUND A. Problem Setting 1Contributing author (listed alphabetically). 2Senior author The sequential process that generates the demonstration (listed alphabetically). All authors from Waymo Research. Contact: shimonw@waymo.com behaviour is multi-agent and general sum and can be mod- 2202 yaM 6 ]GL.sc[ 1v59130.5022:viXraeled as a Markov game [7] G = {S,A,P,{r }N ,ν,γ}. B. Behavioural Cloning i i=1 N is the number of agents; S is the state space; A is the ThesimplestapproachtoLfDisbehaviouralcloning(BC) actionspaceofagivenagent,(thesameforallagents),such [3], [4], which solves a supervised learning problem: D E thatthejointactionandobservationspacesareAN andZN; is interpreted as a labeled training set and π is trained θ {r }N isthesetofrewardfunctions(oneforeachagent);ν i i=1 to predict a t given s t. If we take a maximum likelihood istheinitialstatedistribution;andγ isadiscountfactor.The approach, then we can optimise θ as follows: dynamics are described by a transition probability function P(s(cid:48)|s,a), where a ∈ AN is a joint action; s ∈ S is also (cid:88)KE (cid:88)T max logπ (a |s ). (1) shown in bold because it factors similarly to a. The agent’s θ k,t k,t θ actions are determined by a joint policy π(a |s ) that is k=1t=1 t t agent-wise factored, i.e., π(a |s )=(cid:81)N πi(ai|s ). This approach is simple but limited. As it optimises only t t i=1 t t Toavoidsuperfluousformalism,wedonotexplicitlycon- the conditional policy probabilities, it does not ensure that sider partial observability. However, this is easily modelled the underlying distribution of states visited by π E and π θ bymaskingcertainstatefeaturesintheencodersuponwhose match. Consequently it suffers from covariate shift [8], in output the agents condition their actions. which generalisation errors compound, leading π θ to states Because we have access to a simulator, the transition far from those visited by π E. function is considered known. However, since we are learn- C. Generative Adversarial Imitation Learning ing from demonstration, the agents’ reward functions are unknown. Furthermore, we do not even have access to BC is a strictly offline method because it does not require sample rewards. Instead, we can merely observe the agents’ interaction with an environment or simulator. Given D E, behavior, yielding a dataset D E ={τ k}K k=E 1 of K E trajecto- it simply estimates π θ offline using supervised learning. ries where τ = {(s ,a ),(s ,a ),...(s ,a )} By contrast, most LfD methods are interactive, repeatedly k k,1 k,1 k,2 k,2 k,T k,T is a trajectory generated by the ‘expert’ joint policy π E. executing π θ in the environment and using the resulting The goalof LfD isto set parameters θ such thata policy π trajectories to estimate a gradient with respect to θ. θ matches π in some sense. Interactivemethodsincludeinversereinforcementlearning E In our case, the data consists of LIDAR and camera [9]–[12] and adversarial methods such as generative adver- readings recorded from an ego vehicle on public roads. sarial imitation learning (GAIL) [5]. GAIL borrows ideas It is partitioned into run segments and preprocessed by from GANs [13] and employs a discriminator that is trained a perception system yielding the dataset D . The states to distinguish between states and actions generated by the E s k,t and discrete actions a k,t in each run segment are fit agentsfromthoseobservedinD E.Thediscriminatoristhen greedily to approximate the logged trajectory. Each state is used as a cost (i.e., negative reward) function by the agents, a tuple containing three kinds of features. The first is static yielding increasingly log-like behaviour. In our multi-agent scene features sSS such as locations of lanes and sidewalks, setting, the GAIL objective can be written as: k including a roadgraph, a set of interconnected lane regions (cid:104) (cid:105) minmax E log(D (s))+E log(1−D (sE)) , with ancestor, descendant, and neighbour relationships that θ φ s∼dθ φ sE∼DE φ describes how agents can move, change lanes, and turn. The (2) second is dynamic scene features sDS such as traffic light where d is the distribution over states induced by π and k,t θ θ states. The third is features describing the position, velocity, D is here treated as an empirical distribution over states. E and orientation of the N agents. Together, these yield the Although the agents who generated D are not cooperative E tuple: s = {sSS,sDS,s1 ,...sN }. By convention, s1 (asmodelledbythedifferentrewardfunctions{r }N inG), k,t k k,t k,t k,t k,t i i=1 is the state of the ego vehicle. Since agents can enter or the learned agents controlled by π are cooperative because θ leave the ego agent’s field of view during a run segment, they all aim to minimise the same discriminator D , i.e., φ we zero-pad states for missing agents to maintain fixed they share the goal of realistically imitating π . E dimensionality. Differentiatingthroughd istypicallynotpossiblebecause θ Toavoidtheneedtolearnanexplicitmodelofinitialcon- P is unknown. Hence, updating θ requires using a score- ditions, we couple each simulation to a reference trajectory. function gradient estimator [14], which suffers from high Each agent i is initialised to the corresponding si after variance. However, in our setting, P is both known and dif- k,1 which it can either be a playback agent that blindly replays ferentiable, so we can employ model-based GAIL (MGAIL) the behaviour in the reference trajectory or an interactive [15], which exploits end-to-end differentiability by directly agent that responds dynamically to the unfolding simulation propagating gradients from D to π through P. φ θ using a policy πi learned from demonstration. θ III. SYMPHONY During a Symphony simulation, the state of an interactive agent is determined by sampling actions from its policy πi Symphony is a new approach to LfD for autonomous θ and propagating it through P. Given a reference trajectory driving simulation that builds upon a base method such as τ,thisyieldsanewsimulatedtrajectoryτ(cid:48) inwhichsSS and BC or MGAIL by adding a parallel beam search to improve k sDS remain as they were in the reference trajectory but the realism and a hierarchical policy to ensure diversity. Figure k,t agent states si of the N interactive agents are altered. 1 gives an overview of the training process for the case k,t Ioccursevery2secondsofsimulation.Duringtraining,weuse S =4 but during inference S =16. Pruningbasedonag- gregate scores means that the simulation at a given timestep can be subtly influenced by future events, i.e., ac- tions are pruned away Fig. 1: Interactive agent training when Symphony is based because they lead to on BC (replay agents not shown). unrealisticfuturestates, andthosestatesinclude observations of play- wherethebasemethodisBC.Trainingproceedsbysampling back agents. In other a batch of run segments from a training set D ⊂ D tr E problem settings, such Fig. 2: Beam search. and using them as reference trajectories to perform new as behaviour prediction rollouts with the current policy. At the start of each rollout, [19]–[21] such leakage would be problematic because infor- a goal generating policy proposes a goal, based on initial mation about the future would not be available at inference conditions, that remains fixed for the rollout and is input time (the whole point is to predict the future given only to the goal-conditional policy that proposes actions. These the past). However, in our setting the reference trajectory is actions are used to generate nodes in a parallel beam search available even at inference time, i.e., the goal is simply to (see Figure 2), which periodically prunes away branches generaterealisticanddiversesimulationsgiventhereference deemed unfavourable by a discriminator and copies the rest trajectory. While leakage can in principle yield useful hints, to maintain a fixed-width beam. it can also be misleading as any leaked hints can become Unlike in model-predictive control [16] or reinforcement obsolete when interactive agents diverge from the reference learning methods that employ online tree search, e.g., [17], trajectory. In practice, as we show in Section V-C, refining [18],inwhichanagent‘imagines’variousfuturesbeforese- simulation on the fly through beam search can drastically lectingasingleaction,eachrolloutinSymphonyisexecuted improverealismbuttendstoharmdiversity:repeatedpruning directly in the simulator. However, because simulations hap- canencouragemodecollapse,whereonlytheeasiesttosim- peninparallel,promisingbranchescanbeduplicatedduring ulate modes are represented. Next we discuss a hierarchical executiontoreplaceunpromisingones,focusingcomputation approach to remedy this issue. on the most realistic rollouts. Finally, we use the resulting rollouts to compute losses B. Hierarchical Policy and update the goal generating policy, the goal-conditional To mitigate mode collapse, we employ hierarchical agent policy, and the discriminator. During inference at test time, policies. At the beginning of each rollout, a high-level goal theprocessisthesameexceptthatrunsegmentsaresampled generating policy h (g|s ) proposes a goal g, based on an ψ 1 from a test set D and no parameters are updated. te initial state s , that remains fixed throughout the rollouts 1 In the rest of this section, we provide more details about andisprovidedtoboththelow-levelgoal-conditionalpolicy the parallel beam search, hierarchical policy, network archi- π (a|s,g) and the discriminator D (s|g). The goal gener- θ φ tectures, and learning rules. ating policy is trained to match the distribution of goals in the training data: A. Parallel Beam Search For each reference trajectory in the batch, we first sample maxE logh (gE|sE). (3) ψ (sE 1,gE)∼DE ψ 1 S actions from the joint policy, yielding S branches that roll out in parallel. We then call the discriminator at each Because the same goal is used for all rollouts within the simulation step to score each of the N interactive agents in search tree, it cannot be biased by the discriminator. I each branch, yielding a tensor of dimension [B,N ,T ,S] Weuseroutes,representedassequencesofroadgraphlane I p where B is the batch size and T is the number of time segments, as goals because they capture high-level intent p stepsbetweenpruning/resampling.AftereveryT simulation and are a primary source of multi-modality. A feasible set p steps, we aggregate the discriminator scores across the time of routes is generated by following all roadgraph branches, and interactive agent dimensions, yielding a tensor of shape beginning at the lane segment corresponding to the agent’s [B,S] containing a score for each sample in the batch. We initialstate.Fromthisset,routeswithminimaldisplacement aggregate by maximising across time and summing across error from the observed trajectory are used as ground truth agents. We then rank samples by aggregate score and prune totrainh (g|s )andasinputtoπ (a|s,g)duringtraining. ψ 1 θ away the top half (i.e., the least realistic). We tile the Hence,h (g|s )andπ (a|scanbeseenaslearnedversions ψ 1 θ remaining samples such that S remains constant throughout of the router and planner, respectively, in a conventional thesimulation.WeuseT =10,i.e.,pruningandresampling control stack. pC. Architecture and Learning B. Combining Planning and Learning For each interactive agent, objects (such as other cars, When a model of the environment dynamics P(s(cid:48)|s,a) is pedestrians and cyclists), as well as static and dynamic available, deliberative planning can help to predict the value features are all encoded individually using MLPs, followed of different actions. Model-based reinforcement learning bymax-poolingacrossinputsofthesametype.Theresulting typicallyusesplanningduringtrainingtoreducethevariance type-specific embeddings are, together with an encoding of of value estimates. When the model is differentiable, this features of the interactive agent, concatenated and provided can also be exploited, as in MGAIL [15], [24] to reduce to the policy head as input. Spatial information such as the variance of gradient estimates [25]–[27]. By contrast, location or velocities of other objects are normalised with online planning typically uses tree search to refine policies respect to the agent before being passed into the network. on the fly during inference by focusing computation on Furthermore, roads and lanes are represented as a set of the most relevant states [18], [28]. By distilling the results points. In large scenes, only the nearest 16 objects and 1K of the tree search back into the policy, online planning static and dynamic features are included. also serves as an extended policy improvement operation For BC, the goal-conditional policy head maps the con- [29]–[34]. Recently, sequential decision making problems catenated embeddings to a 7×21 action space of discretised have been reformulated as auto-regressive models using accelerationsandsteeringangles.ForMGAIL,weuseacon- transformer architectures [35]. Most related to Symphony is tinuousactionspacespecifyingx-ydisplacementtofacilitate theTrajectoryTransformer[36],whichisfullydifferentiable end-to-end differentiation. The goal generating policy maps and uses beam search but without a discriminator. to softmax logits for each feasible route, up to a limit of C. Autonomous Driving Applications 200routes.Thegoalgeneratingandgoal-conditionalpolicies use separate encoders. The discriminator uses a similar but As early as 1989, ALVINN [3], a neural network trained simplerencodingbymax-poolingacrossallobjectsandpoint withBC,autonomouslycontrolledavehicleonpublicroads. features within 20 metres. We train both policies and the More recently, deep learning has been used to train au- discriminator simultaneously. We train the goal generating tonomous driving software end-to-end with BC [37] and policy using eq. (3) and the discriminator using: perturbation-basedaugmentationshavebeenusedtomitigate (cid:104) (cid:105) covariate shift [38]. As simulation emerges as a crucial tool max E log(D (s))+E log(1−D (sE)) , φ s∼dTS φ sE∼DE φ in autonomous driving, interest is turning to how to popu- late simulators with realistic agents. ViBe [39] learns such whered isgeneratedbythetreesearchwithS=4.Wetrain TS models from CCTV data collected at intersections, using thegoal-conditionalpolicyusingeitherBCorMGAIL.Inthe GAILbutwithoutthetreesearchorhierarchicalcomponents case of BC, we increase its robustness to covariate shift by of Symphony. SimNet [40] produces such models using training not only on expert data, but also on additional data only BC but uses GANs instead of reference trajectories sampledfromd ,i.e.,thebeamsearchisdistilledbackinto TS to generate initial simulation conditions. TrafficSim [41] the goal-conditional policy, yielding an adversarial method also uses hierarchical control like Symphony but with a evenwithouttheuseofMGAIL.Eachtrainingbatchcontains latent variable model and without a tree search for online 16 run segments of 10 seconds each, for which actions are refinement. AdvSim [42] is similar but generates adversarial recomputed every 0.2 seconds. perturbationstochallengethefullautonomousdrivingstack. IV. RELATEDWORK Like Symphony, SMARTS [43] considers the realism and diversity of agents in a driving simulator, but employs only A. Coping with Covariate Shift reinforcementlearning,notLfD,tolearnsuchagents.nuPlan One way to address covariate shift in BC is to add actu- [44] is a planning benchmark that uses a set of reference ator noise when demonstrations are performed, forcing the trajectoriesbutdoesnotsimulateagentobservations,feeding demonstrator to label a wider range of states [22]. However, the observations from the reference trajectory even if they this requires intervening when demonstrations are collected, diverge from the simulation. which is not possible in our setting. Another solution is DAgger [8], where the demonstrator labels the states visited V. EXPERIMENTS&RESULTS by the agent as it learns, which also requires access to the A. Experimental Setup demonstrator that is not available in our setting. Adversarial Datasets. We use two datasets. The first, a proprietary methods such as GAIL and MGAIL avoid covariate shift dataset created by applying Waymo’s perception tools to by repeatedly trying policies in the environment and min- sensor data collected by Waymo vehicles driving on public imising a divergence between the resulting trajectories and roads,contains1.1M runsegmentseachwith30soffeatures the demonstrations. When the environment is not available, at15Hz.Theseconddatasetconsistsof 64.5K runsegments methodsthatmatchstatedistributionscanretainBC’sstrictly from the Waymo Open Motion Dataset (WOMD) [6], which offline feature while minimising covariate shift [23]. How- we extract to 10s run segments sampled at 15Hz.1 Both ever,inoursetting,remainingstrictlyofflineisnotnecessary as we have access to a high quality simulator that is itself 1WhileweusethesamerunsegmentsastheWOMD,statescontainthe the target environment for the learned agents. featuresdescribedinSectionII-A,notthoseintheWOMD.datasets exclude run segments containing more than 256 playback agents, 10K roadgraph points, or fewer than N I agents at the initial timestep. Unlike the proprietary dataset, WOMD’s run segments were selected to contain pairwise interactions such as merges, lane changes, and intersection turns. Both datasets split the demonstration data D into E disjointsetsD andD withK =|D |andK =|D |. tr te tr tr te te For the proprietary dataset, K = 1.1M and K = 10K tr te and for the WOMD, K =58.1K and K =6.4K. tr te Simulation setup. Unless stated otherwise, each simulation lasts for 10s, with initial conditions set by the reference trajectoryandactionstakenat5Hz.Unlessstatedotherwise, the ego vehicle and one other vehicle are interactive, i.e., controlled by our learned policy, while the rest are playback agents. The interactive agent is chosen heuristically depend- (a)ProprietaryDataset. (b)WaymoOpenMotionDataset. ingonthecontext,e.g.,inmergesitisthevehiclewithwhich Fig. 3: Histograms for distribution of curvature metrics. the ego vehicle is merging. If no such context applies, the nearest moving vehicle is chosen. B. Metrics because they represent places where agents have multiple, branching choices, e.g., a lane approaching a four-way stop Weconsiderthefollowingthreerealismmetrics.Collision may branch into three descendant regions each going in a Rate is the percent of run segments that contain at least different direction at the intersection. For each branching one collision involving an interactive agent. A collision is region, we compute the average curvature across the region. detected when two bounding boxes overlap. Off-road Time The curvature JSD is then the Jensen-Shannon divergence isthepercentoftimethataninteractiveagentspendsoffthe betweenthedistributionofaveragecurvaturesofthebranch- road. ADE is the average displacement error between each ing regions visited by the policy and reference trajectories. joint reference trajectory and the corresponding trajectory These distributions are approximated with histograms with generated in simulation: bins of width 0.01 in the range [−1,1], yielding 201 bins. ADE= 1 (cid:88)Kte(cid:104)(cid:88)(cid:88)T δ(si ,s(cid:48)i )(cid:105) , C. Results K N T k,t k,t te I k=1 i∈I t=1 WecompareBCandMGAILasis,withhierarchy(BC+H, where I is the set of indices of the interactive agents, s(cid:48)i MGAIL+H), with tree search (BC+TS, MGAIL+TS), and k,t and si are the states of the ith interactive agents in the kth with both (BC+TS+H, MGAIL+TS+H). We train each k,t method for 70K update steps with run segments sampled simulated and reference trajectories respectively, and δ is a uniformly from K and save checkpoints every 2K steps. Euclidean distance function. tr We then select the checkpoint with the lowest sum of We also consider two diversity metrics. MinSADE is the collision rate and off-road time on a validation set of 200 minimumscene-levelaveragedisplacementerror[41],which run segments and test it using all of K . For each run extends ADE to measure diversity instead of just realism. te segment in K , we generate 16 rollouts and report the During evaluation, the simulator populates a set R by te k average (or minimum for minSADE). We average all results simulating m trajectories for each reference trajectory τ in k over five independent seeds per method. For each metric, D and then computes minSADE as follows: te we indicate the best performing BC and MGAIL methods minSADE= 1 (cid:88)Kte min (cid:104)(cid:88)(cid:88)T δ(si ,s(cid:48)i )(cid:105) , in bold. For collision rate and off-road time, we also report K teN IT k=1τ(cid:48)∈Rk i∈I t=1 k,t k,t v ba ul tu ce os mfo pr utp inla gyi tn hg eseba mck etrt ih ce sl fo og rs thw eit ah go eu nt tsin tt he ar tac wti ov ue ldag he an vt es When m = 1, minSADE reduces to ADE; when m > 1, been interactive. These values are slightly positive due to, minimising minSADE requires populating R with diverse e.g., perception errors on objects far from the ego vehicle or k but realistic trajectories. Our experiments use m=16. Low the use of bounding boxes instead of contours. minSADE therefore implies good coverage of behaviour Table I shows our main results, comparing all methods modes. However, it does not imply actually matching the across all metrics on both datasets. Comparing BC methods empirical distribution of behaviours, e.g., low-probability first,itisclearthattreesearchdramaticallyimprovesrealism modes may be over-represented. Curvature JSD is a novel (especially with respect to collision rate and off-road time) diversitymetricthataimstomeasurehowwellthehigh-level but reduces diversity due to mode collapse. This loss is behaviour matches the empirical distribution. It is computed detected by curvature JSD, which measures distribution using the roadgraph features in sSS. Multiple lane regions matching, but not by minSADE, which only requires cov- k that share a common ancestor are called branching regions erage.However,theadditionofhierarchyimprovesdiversityTABLE I: Proprietary and Waymo Open Motion Dataset results and standard errors. ProprietaryDataset WaymoOpenMotionDataset Method Collision Off-road ADE MinSADE CurvatureJSD Collision Off-road ADE MinSADE CurvatureJSD rate(%) time(%) (m) (m) (×10−3) rate(%) time(%) (m) (m) (×10−3) Playback 0.99 0.40 - - - 3.04 0.96 - - - BC 16.65±0.41 2.16±0.08 5.80±0.07 2.16±0.04 2.82±0.26 24.65±0.32 2.75±0.12 4.81±0.10 1.76±0.04 1.32±0.28 BC+H 17.25±1.07 1.69±0.12 5.18±0.09 2.01±0.03 1.38±0.11 23.27±0.38 2.40±0.05 4.38±0.04 1.67±0.03 3.50±1.25 BC+TS 1.84±0.13 0.35±0.03 4.83±0.09 2.07±0.04 5.28±1.14 4.94±0.65 1.23±0.04 4.20±0.15 1.82±0.07 5.84±1.02 BC+TS+H 1.80±0.07 0.34±0.01 4.30±0.05 1.96±0.04 1.24±0.14 4.86±0.24 1.30±0.06 3.70±0.09 1.66±0.04 2.82±1.13 MGAIL 5.34±0.32 0.83±0.08 7.32±0.52 3.95±0.42 1.88±0.18 9.48±0.91 1.62±0.15 5.70±0.44 3.13±0.26 4.14±1.36 MGAIL+H 4.16±0.18 0.76±0.03 4.52±0.17 2.48±0.09 1.55±0.17 7.39±0.37 1.65±0.12 3.82±0.12 2.15±0.08 3.88±0.80 MGAIL+TS 2.97±0.16 0.72±0.20 6.83±0.48 3.80±0.39 4.14±1.08 4.36±0.21 1.22±0.02 4.26±0.19 2.51±0.11 5.42±1.26 MGAIL+TS+H 2.40±0.19 0.70±0.06 4.69±0.15 2.73±0.12 2.35±0.53 4.89±0.38 1.65±0.25 3.86±0.17 2.22±0.13 2.80±0.60 TABLE II: Proprietary dataset results and standard errors with 20 second rollouts and 8 interactive agents. LongerRollouts(20seconds) MoreInteractiveAgents(NI=8) Method Collision Off-road ADE MinSADE CurvatureJSD Collision Off-road ADE MinSADE CurvatureJSD rate(%) time(%) (m) (m) (×10−3) rate(%) time(%) (m) (m) (×10−3) Playback 2.38 0.47 - - - 2.37 0.22 - - - BC 25.56±0.69 2.99±0.09 11.60±0.21 4.08±0.05 2.61±0.20 17.46±0.79 1.06±0.15 6.40±0.51 1.86±0.14 1.49±0.25 BC+H 30.33±0.56 3.14±0.22 9.83±0.18 3.83±0.05 4.17±0.37 18.60±0.81 0.95±0.06 5.85±0.08 1.83±0.08 1.73±0.36 BC+TS 6.05±0.31 0.72±0.10 8.92±0.30 3.97±0.11 4.90±0.59 4.17±0.23 0.38±0.02 6.22±0.38 2.10±0.20 2.76±0.50 BC+TS+H 7.76±0.12 0.66±0.02 7.74±0.15 3.72±0.09 2.69±0.22 5.18±0.24 0.36±0.03 5.68±0.16 2.03±0.05 0.96±0.12 MGAIL 14.15±0.83 1.03±0.19 14.18±0.97 9.34±0.45 6.79±1.70 11.08±1.29 0.47±0.02 12.30±0.62 6.55±0.65 4.14±1.06 MGAIL+H 14.73±1.93 1.72±0.24 10.52±1.17 6.58±0.60 3.09±0.62 6.79±0.11 0.37±0.03 5.74±0.34 2.80±0.26 1.61±0.40 MGAIL+TS 10.52±0.94 0.99±0.03 15.76±1.17 11.80±1.22 5.93±0.71 9.50±0.93 0.51±0.06 7.46±0.79 3.66±0.53 2.65±0.41 MGAIL+TS+H 7.81±0.78 0.88±0.03 9.11±0.73 6.39±0.64 2.66±0.59 5.80±0.47 0.36±0.02 5.47±0.07 2.69±0.04 1.24±0.15 in nearly all cases. In particular, hierarchy is crucial for curvature JSD than BC but the use of hierarchy prevents addressingthemodecollapsefromtreesearch.BC+TS+His mode collapse, enabling BC+TS+H to approach the best of the only BC method that gets the best of both worlds with both worlds. MGAIL shows less diversity loss from tree strong performance on both realism and diversity metrics. search than BC (only moderately worse minSADE) but also Figure3showsthehistogramsusedtocomputethecurva- sees much better diversity when hierarchy is used. tureJSDvaluesinTableIforthreeBCmethods,withlearned To assess whether we can maintain realism when more policies in orange and the log reference trajectories in blue. agents are replaced, we repeat our experiments on the While BC matches distributions well, adding tree search proprietary dataset with eight interactive agents (N I =8) in leads to under-representation of positive curvature, i.e., right both training and testing. Two agents are selected as before turns, a deficiency repaired with hierarchical policies. and an additional six are selected that are nearest to the Turning now to MGAIL methods, similar trends emerge. ego vehicle and whose distance traveled in the reference Tree search improves realism, though the effect is less trajectory exceeds a threshold. Again, all methods obtain dramatic as MGAIL is already adversarial even without tree higher values on most metrics, as with the longer rollouts search. Similarly, while tree search also increases curvature discussed above. Relative performance remains similar, with JSD in MGAIL, the effect is smaller. This is also to be tree search improving realism but harming diversity and expected given that MGAIL is already adversarial and can hierarchy improving diversity. In this case, MGAIL sees no thus experience mode collapse even without tree search. loss of diversity from tree search, as any mode collapse On both datasets, the addition of hierarchy substantially alreadyhappensinMGAILtraining,butstillseessubstantial improves MGAIL’s diversity metrics. Overall, the best BC diversity improvements when hierarchy is used. methods perform better than the best MGAIL methods on VI. CONCLUSIONS&FUTUREWORK nearly all metrics, though the differences are modest. Toseeifwecanmaintainrealismforlongertimehorizons, This paper presented Symphony, which learns realistic werepeatourexperimentsontheproprietarydatasetwiththe and diverse simulated agents and performs parallel multi- rollout length doubled to 20s in both training and testing. agent simulations with them. Symphony is data driven and The left side of Table II shows the results. As expected, all combines hierarchical policies with a parallel beam search. methods obtain higher values on nearly all metrics in this Experiments on both open and proprietary Waymo data more challenging setup. However, the relative performance confirmed that Symphony learns more realistic and diverse ofthemethodsremainssimilartothatshowninTableI.Tree behaviour than a number of baselines. Future work will search methods perform much better with respect to realism investigate alternative pruning rules to shape simulation to than those without. While longer rollouts give more time to variousends,augmentinggoalstomodeldriverpersona,and accumulate error, tree search repeatedly prunes problematic developing additional diversity metrics that capture distribu- rollouts, greatly mitigating this effect. BC+TS has worse tionalrealismin,e.g.,agents’aggregatepass/yieldbehaviour.REFERENCES [22] M. Laskey, J. Lee, R. Fox, A. Dragan, and K. Goldberg, “Dart: Noiseinjectionforrobustimitationlearning,”inConferenceonrobot [1] B.D.Argall,S.Chernova,M.Veloso,andB.Browning,“Asurveyof learning. PMLR,2017,pp.143–156. robotlearningfromdemonstration,”Rob.Auton.Syst.,vol.57,no.5, [23] D. Jarrett, I. Bica, and M. van der Schaar, “Strictly batch imita- pp.469–483,May2009. tion learning by energy-based distribution matching,” arXiv preprint [2] A.Hussein,M.M.Gaber,E.Elyan,andC.Jayne,“Imitationlearning: arXiv:2006.14154,2020. A survey of learning methods,” ACM Comput. Surv., vol. 50, no. 2, [24] N. Baram, O. Anschel, and S. Mannor, “Model-based adversarial pp.1–35,Apr.2017. imitationlearning,”arXivpreprintarXiv:1612.02179,2016. [3] D.A.Pomerleau,“ALVINN:anautonomouslandvehicleinaneural [25] A. Tamar, Y. Wu, G. Thomas, S. Levine, and P. Abbeel, “Value network,” in Advances in neural information processing systems 1. iterationnetworks,”Feb.2016. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., Dec. [26] L. Lee, E. Parisotto, D. S. Chaplot, E. Xing, and R. Salakhutdinov, 1989,pp.305–313. “Gatedpathplanningnetworks,”2018. [4] D. Michie, M. Bain, and J. Hayes-Miches, “Cognitive models from [27] G. Farquhar, T. Rockta¨schel, M. Igl, and S. Whiteson, “Treeqn and subcognitiveskills,”IEEcontrolengineeringseries,vol.44,pp.71– atreec: Differentiable tree-structured models for deep reinforcement 99,1990. learning,”2018. [5] J.HoandS.Ermon,“Generativeadversarialimitationlearning,”June [28] R.S.SuttonandA.G.Barto,Reinforcementlearning:Anintroduction. 2016. MITpress,2018. [6] S. Ettinger, S. Cheng, B. Caine, C. Liu, H. Zhao, S. Pradhan, [29] T.Anthony,Z.Tian,andD.Barber,“Thinkingfastandslowwithdeep Y. Chai, B. Sapp, C. Qi, Y. Zhou, Z. Yang, A. Chouard, P. Sun, learningandtreesearch,”arXivpreprintarXiv:1705.08439,2017. J. Ngiam, V. Vasudevan, A. McCauley, J. Shlens, and D. Anguelov, [30] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, “Large scaleinteractivemotion forecastingfor autonomousdriving : A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al., “Mastering Thewaymoopenmotiondataset,”CoRR,vol.abs/2104.10133,2021. thegameofgowithouthumanknowledge,”Nature,vol.550,no.7676, [Online].Available:https://arxiv.org/abs/2104.10133 pp.354–359,2017. [7] M.L.Littman,“Markovgamesasaframeworkformulti-agentrein- [31] D.Silver,T.Hubert,J.Schrittwieser,I.Antonoglou,M.Lai,A.Guez, forcementlearning,”inMachinelearningproceedings1994. Elsevier, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, et al., “A general 1994,pp.157–163. reinforcement learning algorithm that masters chess, shogi, and go [8] S.Ross,G.Gordon,andD.Bagnell,“Areductionofimitationlearning throughself-play,”Science,vol.362,no.6419,pp.1140–1144,2018. andstructuredpredictiontono-regretonlinelearning,”inProceedings [32] N. Brown, A. Bakhtin, A. Lerer, and Q. Gong, “Combining deep ofthefourteenthinternationalconferenceonartificialintelligenceand reinforcement learning and search for imperfect-information games,” statistics,2011,pp.627–635. arXivpreprintarXiv:2007.13544,2020. [9] A. Y. Ng and S. J. Russell, “Algorithms for inverse reinforcement [33] J. B. Hamrick, A. L. Friesen, F. Behbahani, A. Guez, F. Viola, learning,”inProceedingsoftheSeventeenthInternationalConference S.Witherspoon,T.Anthony,L.Buesing,P.Velicˇkovic´,andT.Weber, on Machine Learning, ser. ICML ’00. San Francisco, CA, USA: “Ontheroleofplanninginmodel-baseddeepreinforcementlearning,” MorganKaufmannPublishersInc.,June2000,pp.663–670. 2020. [10] P.AbbeelandA.Y.Ng,“Apprenticeshiplearningviainversereinforce- [34] J. Schrittwieser, I. Antonoglou, T. Hubert, K. Simonyan, L. Sifre, ment learning,” in Twenty-first international conference on Machine S.Schmitt,A.Guez,E.Lockhart,D.Hassabis,T.Graepel,andetal., learning - ICML ’04. New York, New York, USA: ACM Press, “Mastering atari, go, chess and shogi by planning with a learned 2004. model,”Nature,vol.588,no.7839,p.604–609,Dec2020.[Online]. Available:http://dx.doi.org/10.1038/s41586-020-03051-4 [11] D. Ramachandran and E. Amir, “Bayesian inverse reinforcement learning.”inIJCAI,vol.7,2007,pp.2586–2591. [35] L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, M. Laskin, P. Abbeel, A. Srinivas, and I. Mordatch, “Decision transformer: [12] B.D.Ziebart,A.L.Maas,J.A.Bagnell,andA.K.Dey,“Maximum Reinforcementlearningviasequencemodeling,”2021. entropy inverse reinforcement learning,” in AAAI, vol. 8, 2008, pp. [36] M.Janner,Q.Li,andS.Levine,“Reinforcementlearningasonebig 1433–1438. sequencemodelingproblem,”2021. [13] I.Goodfellow,J.Pouget-Abadie,M.Mirza,B.Xu,D.Warde-Farley, [37] M. Bojarski, D. D. Testa, D. Dworakowski, B. Firner, B. Flepp, S.Ozair,A.Courville,andY.Bengio,“Generativeadversarialnets,”in P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, Advancesinneuralinformationprocessingsystems,2014,pp.2672– X. Zhang, J. Zhao, and K. Zieba, “End to end learning for self- 2680. drivingcars,”CoRR,vol.abs/1604.07316,2016.[Online].Available: [14] R. J. Williams, “Simple statistical gradient-following algorithms for http://arxiv.org/abs/1604.07316 connectionistreinforcementlearning,”Machinelearning,vol.8,no.3, [38] M. Bansal, A. Krizhevsky, and A. S. Ogale, “Chauffeurnet: pp.229–256,1992. Learning to drive by imitating the best and synthesizing the [15] N.Baram,O.Anschel,I.Caspi,andS.Mannor,“End-to-enddifferen- worst,” CoRR, vol. abs/1812.03079, 2018. [Online]. Available: tiable adversarial imitation learning,” in International Conference on http://arxiv.org/abs/1812.03079 MachineLearning. PMLR,2017,pp.390–399. [39] F. Behbahani, K. Shiarlis, X. Chen, V. Kurin, S. Kasewa, C. Stirbu, [16] C.E.Garcia,D.M.Prett,andM.Morari,“Modelpredictivecontrol: J. Gomes, S. Paul, F. A. Oliehoek, J. Messias, and S. Whiteson, Theoryandpractice—asurvey,”Automatica,vol.25,no.3,pp.335– “Learningfromdemonstrationinthewild,”Nov.2018. 348,1989. [40] L. Bergamini, Y. Ye, O. Scheel, L. Chen, C. Hu, L. Del Pero, [17] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van B.Osinski,H.Grimmett,andP.Ondruska,“Simnet:Learningreactive Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, self-drivingsimulationsfromreal-worldobservations,”arXivpreprint M. Lanctot, et al., “Mastering the game of go with deep neural arXiv:2105.12332,2021. networks and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, [41] S.Suo,S.Regalado,S.Casas,andR.Urtasun,“Trafficsim:Learning 2016. tosimulaterealisticmulti-agentbehaviors,”2021. [18] G. Tesauro and G. R. Galperin, “On-line policy improvement using [42] J.Wang,A.Pun,J.Tu,S.Manivasagam,A.Sadat,S.Casas,M.Ren, monte-carlo search,” in Proceedings of the 9th International Confer- and R. Urtasun, “Advsim: Generating safety-critical scenarios for enceonNeuralInformationProcessingSystems,1996,pp.1068–1074. self-driving vehicles,” CoRR, vol. abs/2101.06549, 2021. [Online]. [19] N. Lee, W. Choi, P. Vernaza, C. B. Choy, P. H. Torr, and M. Chan- Available:https://arxiv.org/abs/2101.06549 draker,“Desire:Distantfuturepredictionindynamicsceneswithin- [43] M. Zhou, J. Luo, J. Villela, Y. Yang, D. Rusu, J. Miao, W. Zhang, teractingagents,”inProceedingsoftheIEEEConferenceonComputer M. Alban, I. Fadakar, Z. Chen, et al., “Smarts: Scalable multi-agent VisionandPatternRecognition,2017,pp.336–345. reinforcementlearningtrainingschoolforautonomousdriving,”arXiv [20] S. Casas, W. Luo, and R. Urtasun, “Intentnet: Learning to predict preprintarXiv:2010.09776,2020. intention from raw sensor data,” in Conference on Robot Learning. [44] H. Caesar, J. Kabzan, K. S. Tan, W. K. Fong, E. M. Wolff, PMLR,2018,pp.947–956. A. H. Lang, L. Fletcher, O. Beijbom, and S. Omari, “nuplan: [21] Y. Chai, B. Sapp, M. Bansal, and D. Anguelov, “Multipath: A closed-loop ml-based planning benchmark for autonomous Multiple probabilistic anchor trajectory hypotheses for behavior vehicles,” CoRR, vol. abs/2106.11810, 2021. [Online]. Available: prediction,” CoRR, vol. abs/1910.05449, 2019. [Online]. Available: https://arxiv.org/abs/2106.11810 http://arxiv.org/abs/1910.05449