The Waymo Open Sim Agents Challenge Nico Montali John Lambert Paul Mougin Alex Kuefler Nicholas Rhinehart Michelle Li Cole Gulino Tristan Emrich Zoey Yang Shimon Whiteson Brandyn White Dragomir Anguelov Waymo LLC Abstract Simulation with realistic, interactive agents represents a key task for autonomous vehicle software development. In this work, we introduce the Waymo Open Sim Agents Challenge (WOSAC). WOSAC is the first public challenge to tackle this task and propose corresponding metrics. The goal of the challenge is to stimulate the design of realistic simulators that can be used to evaluate and train a behavior model for autonomous driving. We outline our evaluation methodology, present results for a number of different baseline simulation agent methods, and analyze several submissions to the 2023 competition which ran from March 16, 2023 to May 23, 2023. The WOSAC evaluation server remains open for submissions and we discuss open problems for the task. 1 Introduction Simulation environments allow cheap and fast evaluation of autonomous driving behavior systems, while also reducing the need to deploy potentially risky software releases to physical systems. While generation of synthetic sensor data was an early goal [19, 43] of simulation, use cases have evolved as perception systems have matured. Today, one of the most promising use cases for simulation is system safety validation via statistical model checking [1, 16] with Monte Carlo trials involving realistically modeled traffic participants, i.e., simulation agents. Figure 1: WOSAC models the simu- lation problem as simulation of mid- level object representations, rather than as sensor simulation. Simulation agents are controlled objects that perform real- istic behaviors in a virtual world. In this challenge, in order to reduce the computational burden and complexity of sim- ulation, we focus on simulating agent behavior as captured by the outputs of a perception system, e.g., mid-level object representations [2, 79] such as object trajectories, rather than simulating the underlying sensor data [13, 37, 62, 76] (see Figure 1). A requirement for modeling realistic behavior in simulation is the ability for sim agents to respond to arbitrary behav- ior of the autonomous vehicle (AV). “Pose divergence” or “simulation drift” [3] is defined as the deviation between the AV’s behavior in driving logs and its behavior during simu- lation, which may be represented through differing position, heading, speed, acceleration, and more. Directly replaying logged behavior of all other objects in the scene [32, 34, 35] 37th Conference on Neural Information Processing Systems (NeurIPS 2023) Track on Datasets and Benchmarks. arXiv:2305.12032v4 [cs.CV] 11 Dec 2023 Table 1: A comparison of three autonomous-vehicle behavior related tasks which involve generation of a desired future sequence of physical states: trajectory forecasting, planning, and simulation. Note that observations ot ∈O include simulated agent and environment properties. MULTIPLE VEHICLE SYSTEM SYSTEM TASK OBJECT OUTPUTS KINEMATIC EVALUATION OBJECTIVES CATEGORIES CONSTRAINTS Multi-Agent Trajectory Forecasting ✓ (xt, yt, θt, vx t , vy t ) T t=1 ✗ OPEN-LOOP KINEMATIC ACCURACY AND MODE COVERING AV Motion Planning ✗ (xt, yt, θt) T t=1 or controls ✓ CLOSED-LOOP SAFETY, COMFORT, PROGRESS Agent and Environment Simulation ✓ ot T t=1; ot ∈O ✗ CLOSED-LOOP DISTRIBUTIONAL REALISM under arbitrary AV planning may have limited realism because of this pose divergence. Such log- playback agents tend to heavily overestimate the aggressiveness of real actors, as they are unwilling to deviate from their planned route under any circumstances. On the other hand, rule-based agents that follow heuristics such as the Intelligent Driver Model (IDM) [65] are overly accommodating and reactive. We seek to evaluate and encourage the development of sim agents that lie in the middle ground, adhering to a definition of realism that implies matching the full distribution of human behavior. To the best of our knowledge, to date there is no existing benchmark for evaluation of simulation agents. Benchmarks have spurred notable innovation in other areas related to autonomous driving research, especially for perception [6, 10, 24, 58], motion forecasting [6, 10, 22, 72, 78], and motion planning [19]. We believe a standardized benchmark can likewise spur dramatic improvements for simulation agent development. Among these benchmarks, those focused on motion forecasting are perhaps most similar to simulation, but all involve open-loop evaluation, which is clearly deficient compared to our closed-loop evaluation. Furthermore, we introduce realism metrics which are suitable to evaluating long-term futures. Relevant datasets such as the Waymo Open Motion Dataset (WOMD) [22] exist today that contain real-world agent behavior examples, and we build on top of WOMD to build WOSAC. In this challenge, we focus on a subset of the possible perception outputs, e.g., traffic light states or vehicle attributes are not modeled, but we leave this for future work. The challenges our benchmark raises are unique, and if we can make real progress on it, we can show that we’ve solved one of the hard problems in self-driving. We have a number of open questions: Are there benefits to scene-centric, rather than agent-centric, simulation methods? What is the most useful generative modeling framework for the task? What degree of motion planning is needed for agent policies, and how far can marginal motion prediction take us? How can simulation methods be made more efficient? How can we design a benchmark and enforce various simulator properties? During our first iteration of the WOSAC challenge, user submissions have helped us answer a subset of these questions; for example, we observed that most methods found it most expedient to build upon state-of-the-art marginal motion prediction methods, i.e. operating in an agent-centric manner. In this work, we describe in detail the Waymo Open Sim Agents Challenge (WOSAC) with the goal of stimulating interest in traffic simulation and world modeling. Our contributions are as follows: • An evaluation framework for autoregressive traffic agents based on the approximate negative log likelihood they assign to logged data. • An evaluation platform, an online leaderboard, available for submission at https:// waymo.com/open/challenges/2023/sim-agents/. • An empirical evaluation and analysis of various baseline methods, as well as several external submissions. 2 Related Work Multi-Agent Traffic Simulation Simulators have been used to train and evaluate autonomous driving planners for several decades, dating back to ALVINN [43]. While simulators such as CARLA [19], SUMO [33], and Flow [73] provide only a heuristic driving policy for sim agents, they have still enabled progress in the AV motion planning domain [11, 12, 15]. Other recent simulators such as Nocturne [68] use a simplified world representation that consists of a fixed roadgraph and moving agent boxes. 2 Table 2: Existing evaluation methods for simulation agents. Entries are ordered chronologically by Arxiv timestamp. There is limited consensus in the literature regarding how multi-agent simulation should be evaluated. Evaluation Protocol ADE or minADE Offroad Rate Collision Rate Instance-Level Distribution Matching Dataset-Level Distribution Matching Spatial Coverage or Diversity Goal progress or Completion ConvSocialPool [17] ✓ ✓ Trajectron [29] ✓ ✓ PRECOG [48] ✓ ✓ BARK [4] ✓ ✓ ✓ SMARTS [82] ✓ ✓ TrafficSim [60] ✓ ✓ ✓ ✓ SimNet [3] ✓ ✓ Symphony [28] ✓ ✓ ✓ ✓ Nocturne [68] ✓ ✓ ✓ BITS [74] ✓ ✓ ✓ ✓ InterSim [59] ✓ ✓ ✓ MetaDrive [34] ✓ ✓ ✓ TrafficBots [79] ✓ ✓ ✓ WOSAC (Ours) ✓ ✓ ✓ Simulation agent modeling is closely related to the problem of trajectory forecasting, as a sim agent could execute a set of trajectory predictions as its plan [4, 59]. However, as trajectory prediction methods are traditionally trained in open-loop, they have limited capability to recover from out of domain predictions encountered during closed-loop simulation [51]. In addition, few forecasting methods produce consistent joint future samples at the scene level [36, 48]. Sim agent modeling is also related to planning, as each sim agent could execute a replica of a planner independently [4]. However, each of these three tasks differ dramatically in objectives, outputs, and constraints (see Table 1). Learned Sim Agents Learned sim agents in the literature differ widely in assumptions around policy coordination, dynamics model constraints, observability, as well as input modalities. While coordinated scene-centric agent behavior is studied in the open-loop motion forecasting domain [8, 9, 57], to the best of our knowledge, TrafficSim [60] is the only closed-loop, learned sim agent work to use a joint, scene-centric actor policy; all others operate in a decentralized manner without coordination [5, 28, 74], i.e., each agent in the scene is independently controlled by replicas of the same model using agent-centric inference. BITS and TrafficBots [74, 79] use a unicycle dynamics model and Nocturne uses a bicycle dynamics model [68] whereas most others do not specify any such constraint; others enforce partial observability constraints, such as Nocturne [68]. Other methods differ in the type of input, whether rasterized [3, 74] or provided in a vector format [28, 79]. Some works focus specifically on generating challenging scenarios [47], and others aim for user-based controllability [81]. Some are trained via pure imitation learning [3], while others include closed-loop adversarial losses [28, 60], or multi-agent RL [4, 34, 82] in order to learn to recover from its mistakes [51]. Some works such as InterSim [4, 28, 59, 74, 79, 82] use a goal-conditioned problem formulation, while others do not [3]. Evaluation of Generative Models Distribution matching has become a common way to evaluate generative models [18, 20, 27, 30, 31, 45, 46, 49, 52, 77], through the Fréchet Inception Distance (FID) [26]. Previous evaluation methods such as the Inception Score (IS) [53] reason over the entropy of conditional and unconditional distributions, but are not applicable in our case due to the multi- modality of the simulation problem. The FID improves the Inception Score by using statistics of real world samples, measuring the difference between the generated distributions and a data distribution (in the simulation domain, the logged distribution). However, FID has limited sensitivity per example due to aggregation of statistics over entire test datasets into a single mean and covariance. Evaluating Multi-Agent Simulation There is limited consensus in the literature regarding how multi-agent simulation should be evaluated (see Table 2), and no mainstream existing benchmark exists. Given the importance of safety, almost all existing sim agent works measure some form of collision rate [3, 28, 59, 60, 68, 74], and some multi-object joint trajectory forecasting methods also measure it via trajectory overlap [36]. However, collision rate can be artificially driven to zero by static policies, and thus cannot measure realism. Quantitative evaluation of realism requires comparison with logged data. Such evaluation methods vary widely, from distribution matching of vehicle dynamics [28, 74], to comparison of offroad rates [28, 60, 74], spatial coverage and diversity [60, 74] , and progress to goal [59, 68]. However, as goals are not observable, they are thus difficult to extract reliably. Requiring direct reconstruction of logged data through metrics such as Average Displacement Error (ADE) has also been proposed [3, 68], but has limited effectiveness because 3 there are generally multiple realistic actions the AV or sim agents could take at any given moment. To overcome this limitation, one option is to allow the user to provide multiple possible trajectories per sim agent, such as TrafficSim, which uses a minimum average displacement error (minADE) over 15 simulations. [60]. Recently, generative-model based evaluation has become more popular in the simulation domain, primarily through distribution matching metrics. Symphony [28] uses Jensen-Shannon distances over trajectory curvature. NeuralNDE [75] compares distributions of vehicle speed and inter-vehicle distance, whereas BITS [74] utilizes Wasserstein distances on agent scene occupancy using multiple rollouts per scene, along with Wasserstein distances between simulated and logged speed and jerk – two kinematic features which can encapsulate passenger comfort. The latter are computed as a distribution-to-distribution comparison on a dataset level, however, this type of metric has shown limited sensitivity in our experiments. Likelihood metrics An alternative distribution matching framework is to measure point-to- distribution distances. [17] and [29] introduce a metric defined as the average negative log likelihood (NLL) of the ground truth trajectory, as determined by a kernel density estimate (KDE) [42, 50] over output samples at the same prediction timestep. This metric has found some adoption [54, 64], and we primarily build off of this metric in our work. We note that likelihood-based generative models of simulation such as PRECOG [48] and MFP [63] directly produce likelihoods, meaning that the use of a KDE on sampled trajectories to estimate likelihoods is not needed for such model classes. Concurrent work [79] also measures the NLL of the GT scene under 6 rollouts. 3 Traffic Simulation as Conditional Generative Modeling Our goal is to encourage the design of traffic simulators by defining a data-driven evaluation framework and instantiating it with publicly accessible data. We focus on simulating agent behavior in a setting in which an offboard perception system is treated as fixed and given. Problem formulation. We formulate driving as a Hidden Markov Model H = S, O, p(ot|st), p(st|st−1)  , where S denotes the set of unobservable true world states, O denotes the set of observations, p(ot|st) denotes the sampleable emission distribution, and p(st|st−1) denotes the hidden Markovian state dynamics: the probability of the hidden state transitioning from st−1 at timestep t −1 to st at time t. Each ot ∈O can be partitioned into AV- and environment-centric components that vary in time: ot = [oAV t , oenv t ]. Oenv t can in general contain a rich set of features, but for the purpose of our challenge, it contains solely the poses of the non-AV agents. We denote the true observation dynamics as pworld(ot|st−1) .= Ep(st|st−1)p(ot|st). The task. The task to build a “world model” qworld(ot|oc