Embedding Synthetic Off-Policy Experience for Autonomous Driving via Zero-Shot Curricula Eli Bronstein∗ Sirish Srinivasan∗ Supratik Paul∗ Aman Sinha Matthew O’Kelly Payam Nikdel Shimon Whiteson Waymo, LLC {ebronstein, sirishs, supratikpaul, thisisaman, mokelly, payamn, shimonw}@waymo.com Abstract: ML-based motion planning is a promising approach to produce agents that exhibit complex behaviors, and automatically adapt to novel environments. In the context of autonomous driving, it is common to treat all available training data equally. However, this approach produces agents that do not perform robustly in safety-critical settings, an issue that cannot be addressed by simply adding more data to the training set—we show that an agent trained using only a 10% subset of the data performs just as well as an agent trained on the entire dataset. We present a method to predict the inherent difficulty of a driving situation given data collected from a fleet of autonomous vehicles deployed on public roads. We then demonstrate that this difficulty score can be used in a zero-shot transfer to generate curricula for an imitation-learning based planning agent. Compared to training on the entire unbiased training dataset, we show that prioritizing difficult driving scenarios both reduces collisions by 15% and increases route adherence by 14% in closed-loop evaluation, all while using only 10% of the training data. Keywords: Imitation Learning, Curriculum Learning, Autonomous Driving 1 Introduction Autonomous vehicles (AV) typically rely on optimization-based motion planning and control meth- ods. These techniques involve bespoke components specific to the deployment region and AV hard- ware, and require copious hand-tuning to adapt to new environments. An alternative approach is to apply machine learning (ML) to the lifetimes of experience that AV fleets can collect within days or weeks. A paradigm shift to ML-based planning could automate the adaptation of behaviors to new areas, improve planning latency, and increase the impact of hardware acceleration. For example, imitation learning (IL) can utilize the large tranches of expert demonstrations collected by the regular operations of AV fleets to produce policies that perform well in common scenarios, without the need to specify a reward function. However, both the distribution from which experi- ences are sampled and the policy used to generate the demonstrations can critically affect the IL policy’s performance [1]. The training data distribution is especially important when learning meth- ods are applied to problems characterized by long tail examples (c.f. [2, 3, 4, 5, 6]). In the case of autonomous driving, the vast majority of observed scenarios are simple enough to be navigated without any negative safety outcomes. A visual inspection of a random subset of our data suggests that half of it consists of scenarios with the AV as the only road user in motion, while another quarter contains other moving road users, but not necessarily close enough to the AV to affect its behavior. As a result, IL policies may not be robust in safety-critical and long tail situations. Reinforcement learning (RL) can be used to explicitly penalize poor behavior, but due to the complex nature of driving [7], it is difficult to design a reward function for AVs that aligns with human expectations. Even if reward signals were provided for safety-critical events or traffic law violations (e.g., collisions, running a red light), they would be extremely sparse since such events are quite rare. Furthermore, exploration to collect more long tail data is challenging due to safety concerns [8]. Despite these issues, most learning-based robotics and AV applications use the naive strategy of cre- ating training datasets from all available demonstrations. In the context of AVs, since most driving situations are simple, this strategy is both inefficient and unlikely to generate a policy that is robust ∗Denotes equal contribution. 6th Conference on Robot Learning (CoRL 2022), Auckland, New Zealand. arXiv:2212.01375v1 [cs.RO] 2 Dec 2022 Region 1 Vehicle Platform Deployment Region Policy Vehicle Configuration Run Segments Fleet Data Development Planner Deployed in Counterfactual Simulations Run Segment Embedding Difficulty Model Fleet Learning A B C D E Zero-shot Transfer Counterfactual Difficulty Score Expert 1 Expert N Segment Embedding Difficulty Score Vehicle Type 1 Vehicle Type 2 Development Planner Run Segment Human Triage Multi-layer Perceptron Region 2 Expert Trajectory Roadgraph, Other Road Users Similarity Same Run Segment? Figure 1: A The fleet collects experiences with a variety of policies in multiple operational design domains. B The fleet data is sharded into run segments. C Fleet data is used to learn an embedding that maps a run segment to a vector space based on similarity. D Run segments are selected for counterfactual simulations and human triage; the outcome of this process is a labeled set of difficulty scores. E An MLP is trained to regress from embeddings to the difficulty labels. to difficult scenarios. A common solution is to upsample challenging examples, either by increas- ing their sampling probability by a predetermined factor [9] or with curriculum learning [10], i.e., dynamically updating the sampling probability during training based on the agent’s performance. However, both of these approaches include significant hurdles. Upsampling requires that we know which examples are part of the long tail a priori, as in standard classification problems where the class-label imbalance can inform a sampling strategy. In IL and RL, no such labels are available. As such, curriculum learning is more suitable since it uses the agent’s current performance to identify hard examples. However, standard approaches to curriculum learning are specific to the agent being trained; they do not, for example, utilize data collected by deployed AVs running other planners, which can provide more general, policy-agnostic insights into the long tail of driving. In this paper, we propose an approach (summarized in Figure 1) that addresses the challenges of upsampling and curriculum learning applied to an AV setting. Developing a road-ready AV gener- ally involves both collecting real-world data with an expert, which can be a combination of human drivers and thoroughly evaluated AV planners; and evaluating new development planners, which are regularly simulated on the data collected by the expert to identify potential failure modes, generating a large counterfactual dataset. Our method uses this readily available data to train a difficulty model that scores the inherent difficulty of a given scenario by predicting the probability of collisions and near-misses in simulation. This difficulty model provides several key benefits: 1) it is computation- ally less expensive to predict a driving situation’s difficulty than to simulate it for a policy being trained; 2) the model learns the inherent, policy-agnostic difficulty of a scenario because it is trained on multiple development planners in different geographic regions; and 3) the model predicts a con- tinuous score that can be used to identify scenarios within an arbitrary difficulty range, rather than obtaining a few counterfactual failures. We show that a zero-shot transfer of this model can identify long-tail examples that are difficult for a new IL-based planning agent—without any fine tuning. This allows us to upsample difficult training examples without expensive evaluation of the agent during training. Though we train the planning agent using IL as a case study, our approach can be applied to any ML-based planning approach. The main contributions of this paper are: 1. We train a model to predict which driving scenarios are difficult for development planners and show that it can zero-shot transfer to the task of finding challenging scenarios on which to train an ML-based planning agent. This generalization suggests that the model can predict the inherent difficulty of a driving situation. 2 2. We show that training an ML-based planning agent on unbiased driving data leads to poor per- formance on difficult examples since easy driving scenarios dominate rarer, harder cases. 3. We show that using our difficulty model to upsample more challenging scenarios reduces colli- sions by 15% and increases route adherence by 14% on the unbiased test set. This suggests that there are significant diminishing returns in adding common scenarios to the training dataset. 2 Related Work The application of RL to the task of autonomous driving has received significant attention in recent years [11]; proposed methods span the gamut of methodologies and the AV stack itself. RL has been used to address a variety of problems including end-to-end motion planning, behavior gener- ation, reward design, and even behavior prediction. In this work, we focus on imitation learning techniques [12], which avoid direct specification of a reward function, and rely instead on expert demonstrations. As a result, they can capture subtle human preferences and demonstrate impressive performance on a variety of robotics tasks. However, despite many attempts [13, 14, 15], IL and RL techniques still struggle with the long tail present in the driving task [4]. Like this work, Brys et al. [16] and Suay et al. [17] consider how to leverage potentially subopti- mal demonstrations to improve the efficiency and robustness of learning. Unlike these works, we use offline methods to learn a model of each scenario’s difficulty and bias the distribution that IL is performed on. This approach is similar to baselines [18, 19] inspired by Peters and Schaal [9]; however, unlike these works, our setting does not provide a reward signal for the proposed demon- strations. Instead, we use offline, off-policy simulations to learn a foundation model with which we can efficiently approximate a scenario’s difficulty, which would have otherwise required expensive counterfactual simulations during training. Our approach sidesteps the inefficiencies of performing rollouts of the learnt policy on the entire dataset since inference using the difficulty model is com- putationally much cheaper than simulation. Similar techniques have also been proposed by Brown et al. [20]; however, they focus largely on situations with severely suboptimal demonstrations where the reward is specified. Similar problems have also been identified in offline RL [21]. Interestingly, Kumar et al. [22] identify the tight relationship between imitation learning and offline RL, noting the theoretical advantage of incorporating reward information in settings like autonomous driving which must avoid rare catastrophic failures. Our experiments provide empirical support for this insight. Curriculum learning (CL) [10] is also closely related to this work. While not originally classi- fied as such, methods like automatic domain randomization, prioritized experience replay [23], and AlphaGo’s self-play [24, 25] have led to superhuman game-playing agents and breakthroughs in sim2real transfer [23, 26, 27]. CL methods solve for surrogate objectives rather than directly opti- mizing the final performance of the learner. They control which transitions are sampled, the behavior of other agents in an environment, the generation of initial states, or even the reward function. CL methods are also characterized by whether they are used on- or off-policy. For example, Uesato et al. [28] exploit low quality policies to obtain failures in an on-policy RL setting. Finding hard examples in the training data using this approach requires repeatedly generating rollouts for each expert trajectory in the dataset. Such an approach is computationally infeasible when operating at scale since the training datasets can have hundreds of thousands of real-world driving miles. Simi- lar approaches known as hard-negative mining have been used in supervised learning settings [29]; like Uesato et al. [28] they evaluate the difficulty of examples online. Instead, we consider variants of CL that exploit off-policy data. As in the on-policy case, the key problem is to determine which data is interesting. Off-policy compatible methods are also gen- erally surrogate-based. For example, they can select for diversity [30], moderate difficulty [27], surprise [23], or learning progress [31]. Our approach is most similar to Akkaya et al. [27], but in- stead of performing expensive agent evaluation during training, we use off-policy data both to train a foundation model [32], which encodes experiences, and to classify the difficulty of an interaction. We also utilize large-scale real-world data and demonstrate that simpler curricula are effective. 3 Background Model-based Generative Adversarial Imitation Learning: Behavior cloning (BC) [33, 13] is a naive imitation learning method that applies supervised learning to match the expert’s conditional action distribution: arg maxθ Es,a∼πE[log πθ(a|s)]. BC policies may suffer from covariate shift, 3 resulting in quadratic worst-case error with respect to the time horizon [34]. To address this issue, generative adversarial imitation learning (GAIL) [35] formulates IL as an adversarial game between the policy πθ and the discriminator Dω. The discriminator is trained to classify whether a given trajectory was sampled from πθ (labeled 0) or from the expert demonstration (labeled 1), and the policy is trained to generate trajectories that are indistinguishable from demonstrations: arg max θ arg min ω Es,a∼πθ[log Dω(s, a)] + Es,a∼πE[log(1 −Dω(s, a))]. GAIL minimizes the gap in the joint distributions of states and actions p(s, a) between the policy and the expert, resulting in linear error with respect to the time horizon [36]. However, GAIL relies on high variance policy gradient estimates because it uses an unknown dynamics model, making its objective function non-differentiable. In contrast, model-based GAIL (MGAIL) [37] uses differen- tiable dynamics in combination with the reparameterization trick [38] to reduce the variance of the policy gradient estimates. 4 Method A key challenge in commercial AV development is to design an AV planner that can safely and efficiently navigate real-world settings while aligning with human expectations. At any given time, there may exist multiple development planners under evaluation. Iteratively improving an AV plan- ner typically involves the following three steps. 1) Data Collection: Real-world data is collected by a fleet of vehicles in the operational area. 2) Data-Driven Simulation: A development planner is tested in simulation by having it control the data-collecting ego vehicle in a run segment, or a short snippet of recorded driving data. 3) Evaluation and Improvement: The development planner is evaluated on key metrics based on these simulations, with potential issues identified and addressed. As mentioned in Section 1, we consider an ML-based approach to developing an AV planner from the ground up. One option is to use imitation learning to train a planning agent. Given an initial dataset of logged expert driving, a naive approach is to train the agent on the entire dataset. However, this means that challenging long tail segments are used only a few times during training, yielding a planning agent that has difficulty negotiating similar situations [2, 5, 6]. Thus, to improve our agent, we require a method to upsample these rare segments. 4.1 Difficulty Model The key idea behind our method is to use the real-world run segments replayed in simulation with development planners to learn a difficulty model that predicts the difficulty of a logged segment, i.e., whether a development planner is likely to have a poor safety outcome in simulation. We train the difficulty model on simulations of multiple development planners in different geographic areas, so it can be seen as marginalizing over a diverse distribution of development planners. This makes the model more likely to be able to identify the inherent difficulty of a segment. In turn, this facilitates the zero-shot transfer from training on data from development planners to inferring difficulty for a substantially different planning agent. Intuitively, segments that development planners find difficult are likely to be difficult for the planning agent as well. Specifically, we use the difficulty model’s scores to inform our upsampling strategy for training the planning agent. The evaluation process for development planners typically involves large-scale simulations, with potentially problematic behaviors flagged for engineers to address. We train the difficulty model to predict collisions and near-misses attributable to the development planner, as opposed to other road users. This data is generated in the normal course of the AV planner development cycle, so no new training data is needed. Since we want to marginalize out the idiosyncrasies of individual development planners, we model a simulation’s safety outcome y ∈{0, 1} (1 if a collision or near-miss occurred, 0 otherwise) as a function of the logged run segment alone. The input to our model is a learned segment em- bedding from a separately trained model. Given a logged run segment, we collect static features (e.g., road/lane layouts, crosswalks, stop signs), dynamic features (e.g., positions and orientations of other road users over time), and kinematic information about the data-collecting ego vehicle. We use these features to generate two top-down images of the segment: one of the ego vehicle’s trajectory, and another of the static features and other road users’ trajectories. We encode each im- age into a dense d-dimensional embedding vector (as in [39]) using a CNN and contrastively train 4 a classifier (e.g., [40]) with cross-entropy loss to determine if two images are from the same run segment (see Figure 1c). Our difficulty model is an MLP that learns a function f : Rd →[0, 1] map- ping the embedding to the simulated safety outcome y. We trained this model using cross-entropy loss on a dataset of 5.6k positive and 80k negative examples. The number of negative examples was downsampled by multiple orders of magnitude since the prevalence of simulated collisions and near- misses is extremely low. The model produces uncalibrated scores by design, as trying to calibrate it to the extremely small unbiased prevalence rate of positive examples is numerically unstable. 4.2 Sampling Strategies Given the long-tail nature of the difficulty scores (see Figure 2), it is natural to upsample difficult segments during training. A standard solution for upsampling in classification problems is to create separate datasets for each class, and then generate a batch by sampling a specified proportion from each dataset. Since this requires discretized classes, it cannot be applied to our real-valued difficulty scores. Moreover, due to the large training dataset size it is not scalable to upsample individual seg- ments: the entire dataset cannot fit in memory and random access to individual examples from disk is incompatible with distributed file sharding of data. Instead, we partition the dataset into ten equally sized buckets, each corresponding to a decile of the data by difficulty scores, with up/downsampling achieved by assigning different sampling probabilities to each bucket. This enables us to efficiently generate batches on the fly (e.g., sampling a weighted batch of k run segments requires minimal overhead over the k constant-time accesses to the head pointers of each bucket). This decile-based bucketing also ensures that our method is agnostic to the model scores, which are uncalibrated. We consider two training variants: 1) a fixed weighting scheme for each bucket, held constant throughout training, and 2) a schedule of weights for each bucket that changes as training progresses. Specifically, we use the following three sampling strategies. “Highest-10%” trains the agent only on the highest scoring bucket (i.e., on the segments with the highest 10% difficulty scores). “Uniform- 10%” upsamples difficult segments by setting each bucket’s sampling weight to the range of diffi- culty scores in that bucket (in the limit of infinite buckets, this approaches a uniform distribution over the difficulty scores). “Geometric-schedule-10%” implements a geometric progression of weights with each bucket weighted equally at the beginning of training and weighted proportional to its av- erage difficulty score at the end of training (see Appendix 8.2 for further details). In Section 5.4 we compare the performance of these training variants against several baselines. Our variants are trained on only a 10% sample of available data. 5 Experiments To prevent information leakage between the difficulty model and the planning agent, the former is trained on a dataset collected more than six months prior to the dataset for the latter. The training dataset for the planning agent consists of over 14k hours of driving logged by a fleet of vehicles. We split the data into 10 second run segments, resulting in over 5 million training segments. We also create two test sets, chronologically separate from the training set to prevent train-test leakage. The first unbiased test set is composed of 20k segments sampled uniformly from logged data. The second set consists of 10k segments with difficulty scores in the top one percentile of the training data’s score distribution. The distributions of the difficulty model scores for the train set and unbiased test set both have long tails (see Figure 2) – scores above 0.85 account for only around 0.5% of the dataset. As described in Section 4.2, we split the training dataset into 10 equal sized buckets based on the difficulty score deciles. 200 run segments are further split from each training bucket to obtain vali- dation buckets for model selection. To highlight the effect of our training schemes on performance on segments of varying difficulty, we use the same bucketing approach for the unbiased test set as for the training set. We report the full, unbiased test set results by aggregating over all buckets. 5.1 Baselines We report three baselines for comparison, which differ in their training data: “Baseline-all” is trained on the full dataset, “Baseline-10%” is trained on a uniformly randomly sampled 10% of the full dataset, and “Baseline-lowest-10%” is trained only on the bucket with the lowest difficulty scores. 5 0.0 0.2 0.4 0.6 0.8 1.0 Model Score 0 2 4 6 Density (a) Train Dataset 0.0 0.2 0.4 0.6 0.8 1.0 Model Score 0 2 4 6 Density (b) Unbiased Test Dataset Figure 2: Distribution of the difficulty model scores for the train and test datasets. The ten alternating shaded backgrounds indicate the thresholds of the decile buckets. 5.2 Training Details We use the planning agent described in Bronstein et al. [41], which employs a stochastic continuous action policy conditioned on a goal route and is trained using a combination of MGAIL and BC. See Appendix 8.7 for additional details. We train 10 random seeds of each agent variant and baseline for 200k steps. After the initial 100k training steps, we evaluate each agent on the validation set at intervals of 10k steps. We select the agent checkpoint with the lowest sum of collision and off-road driving rates, and evaluate it on the held-out test set. Since the learnt policy is stochastic, we report the average performance of 16 independent rollouts for each test run segment. 5.3 Metrics We assess the planning agent’s performance using the following binary metrics (1 if the event of interest occurred in the segment, 0 otherwise): 1. Route Failure: the agent deviates from the goal “road route” at the start of the segment, which includes all lanes in the road containing the goal lane-specific route. 2. Collision: the agent’s bounding box intersects with another road user’s bounding box. 3. Off-road: the agent’s bounding box exits the drivable road area. 4. Route Progress ratio: ratio of the distance traveled along the route by the agent and the expert. We also report the overall failure rate as the union of the first three metrics; a segment is considered a failure if any of the binary metrics is nonzero. When comparing the performance of different agents, we prioritize this failure rate due to the safety-critical nature of driving, while also considering the route progress ratio to ensure the agents are making efficient forward progress. 5.4 Results We present the performance of our training variants and baselines on the full, unbiased test set in Table 1. Each variant’s action policy is conditioned on the expert’s initial goal route, which is held constant throughout the segment. We observe no significant difference between the performance of Baseline-10% and Baseline-all, demonstrating that simply increasing the training dataset size does not necessarily lead to better performance. Also, Baseline-lowest-10% has the worst performance for the collision and off-road metrics. This suggests that the easiest segments are not representative of the entire test set distri- bution and do not contain enough useful information to learn from. However, Baseline-lowest-10% achieves the lowest route failure rate. We believe this is because the least difficult training bucket is primarily composed of segments in which it is simple to follow the route, such as one-lane roads with no other road users and minimal interaction. This could cause the Baseline-lowest-10% agent to overfit to the route features and follow the route well at the expense of safety. All three of our upsampling variants achieve significantly lower collision rates, and comparable off-road and route failure rates to the baselines (with the exception of Baseline-lowest-10%’s route failure rate). This key result demonstrates that segments with high predicted difficulty contain the 6 Table 1: Evaluation of agents and baselines on the full unbiased test set (mean ± standard error of each metric across 10 seeds). For all metrics except route progress, lower is better. Agent Variant Route Failure rate (%) Collision rate (%) Off-road rate (%) Route Progress ratio (%) Failure rate (%) Baseline-all 1.38±0.13 1.46±0.09 0.73±0.07 81.21±0.39 3.33±0.20 Baseline-10% 1.34±0.06 1.50±0.09 0.67±0.06 81.12±0.37 3.28±0.13 Baseline-lowest-10% 1.14±0.05 4.15±0.11 0.98±0.10 81.88±0.41 5.91±0.13 Highest-10% 1.33±0.06 1.23±0.09 0.74±0.02 77.95±1.33 3.10±0.10 Uniform-10% 1.35±0.09 1.17±0.08 0.75±0.07 80.67±0.73 3.07±0.17 Geometric-schedule-10% 1.19±0.07 1.25±0.04 0.74±0.10 80.48±0.36 2.92±0.11 majority of useful information needed for good aggregate performance. Geometric-schedule-10% has the largest improvement over the baselines, with a significantly lower collision rate, comparable route failure and off-road rates, and a minimal decrease in the route progress ratio. This highlights the advantage of observing the whole spectrum of data at the start of training, and progressively increasing the proportion of difficult segments to emphasize more useful demonstrations. To get a more nuanced view of each variant’s performance, we compare the variants to Baseline- 10% for each of the test buckets. Figure 3 shows the performance for the lowest (0-10%), low/mid (30-40%), highest (90-100%), and long tail (99-100%) test buckets. See Figures 5 and 6 in the Appendix for metrics for all the test buckets. Not only does each agent’s collision rate correlate with the difficulty score, but so do the route failure and off-road rates, with the exception of Highest-10%’s off-road rate. This shows that segments that were challenging for development planners are also likely to be challenging for our planning agent, which enables the zero-shot transfer of the difficulty model. It also demonstrates that although the difficulty model was only trained to predict collisions and near-misses, its predicted score describes a broader notion of difficulty, as measured by other key planning metrics. On the highest and long tail buckets, Highest-10% and Uniform-10% achieve much lower colli- sion rates and overall failure rates than Geometric-schedule-10% and the baseline. This shows that upsampling difficult segments results in better overall performance on those segments, not just on metrics that are highly correlated with the difficulty label (i.e., collisions and near-misses). This is encouraging, since it suggests that the training labels for the difficulty model do not need to fully define expert driving behavior in order for the resulting planning agent to exhibit improved perfor- mance across multiple metrics. However, Highest-10% and Uniform-10% perform comparable to, or worse than the baseline on the lowest and low/mid buckets across all metrics, with especially poor performance on the route failure and off-road metrics. Thus, extreme upsampling of difficult segments sacrifices performance at the other end of the spectrum, since the easiest segments become too rare in the training data. Geometric-schedule-10% addresses this issue by upsampling difficult segments while maintaining sufficiently broad coverage over the difficulty distribution. While it does not achieve equally low collision rates as the Highest-10% and Uniform-10% variants on the highest and long tail buckets, it outperforms the baseline on collisions and performs well on the lowest and low/mid buckets, yielding the best overall performance. 6 Limitations While our difficulty model successfully identified challenging segments, it was only trained to pre- dict collisions and near-misses, which are just one indication of difficulty. There are other labels that would be helpful for a more comprehensive difficulty model, such as traffic law violations, route progress, and discomfort caused to both the ego vehicle’s passengers and other road users. Moreover, the difficulty model could be improved by incorporating the severity of the negative safety outcome into the training labels. Furthermore, as noted in Section 1, a large proportion of the available data consists of situations with very few other road users in the scene. The difficulty model could be replaced with a heuristics-based approach of pruning such scenarios, though the viability of doing so is difficult to gauge a priori. In terms of evaluation metrics, we focused primarily on safety metrics, since these are of paramount importance for real-world deployment. However, we have not considered other facets of driving like comfort and reliability, which can also significantly affect the viability of ML-based planners. 7 0-10 30-40 90-100 99-100 0 4 8 12 highest-10% uniform-10% geometric-schedule-10% baseline-10% (a) Overall failure rate (%) 0-10 30-40 90-100 99-100 0 1 2 3 4 (b) Route failure rate (%) 0-10 30-40 90-100 99-100 0.0 2.5 5.0 7.5 10.0 (c) Collision rate (%) 0-10 30-40 90-100 99-100 0.0 0.6 1.2 1.8 2.4 (d) Off-road rate (%) Figure 3: Metrics for the Baseline-10%, Highest-10%, Geometric-schedule-10%, and Baseline- 10% variants on multiple decile test buckets according to the difficulty score. For each metric, each variant’s performance is shown for the lowest (0-10%), low/mid (30-40%), highest (90-100%), and long tail (99-100%) test buckets. Finally, while we have demonstrated that our method of upsampling long tail segments leads to better performance, we have done so only for an agent trained using MGAIL. Quantifying the performance gains with other learning methods remains a topic for future work. 7 Conclusion We showed that the naive strategy of training on an unbiased driving dataset is suboptimal due to the large fraction of data that does not provide additional useful experience. By utilizing readily available data collected while evaluating development planners in simulation, we trained a model to identify difficult segments with poor safety outcomes. We then applied this model in a zero- shot manner to develop training curricula that upsample difficult examples. Planning agents trained with these curricula outperform the naive strategy in aggregate and are more robust in challenging, long tail scenarios. However, overly aggressive upsampling produces policies that do not handle simpler situations well. We conclude that sampling strategies that prioritize difficult segments but also include easier ones are likely to achieve the best overall performance. We have also showed that training on the full dataset does not yield any significant benefit over training on only 10% of the data sampled uniformly at random, demonstrating that simply adding more unbiased data to the training set does not necessarily improve performance. This suggests that we can use our difficulty model to reduce the cost of AV system development in two areas: targeting active data collection when operating a fleet of vehicles and selective retention of large- scale sensor logs. Namely, since the difficulty model can predict which driving scenarios are likely to be challenging for new planning agents, we could identify geographic “hotspots” where these scenarios occur and use these locations to inform our data collection process. Furthermore, since biasing the planning agent’s training dataset toward difficult segments leads to better results with only a fraction of the available data, we could use the difficulty model scores to reduce the amount of stored data without sacrificing downstream performance. 8 Acknowledgments We thank Ben Sapp, Eugene Ie, Jonathan Bingham, and Ryan Polkowski for their helpful comments, and to Ury Zhilinsky for his support with experiments and infrastructure. References [1] J. Fu, A. Kumar, O. Nachum, G. Tucker, and S. Levine. D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020. [2] J. Frank, S. Mannor, and D. Precup. Reinforcement learning in the presence of rare events. In Proceedings of the 25th international conference on Machine learning, pages 336–343, 2008. [3] N. Kalra and S. M. Paddock. Driving to Safety: How Many Miles of Driving Would It Take to Demonstrate Autonomous Vehicle Reliability? RAND Corporation, 2016. [4] S. Shalev-Shwartz, S. Shammah, and A. Shashua. Safe, multi-agent, reinforcement learning for autonomous driving. arXiv preprint arXiv:1610.03295, 2016. [5] S. Paul, K. Chatzilygeroudis, K. Ciosek, J.-B. Mouret, M. Osborne, and S. Whiteson. Al- ternating optimisation and quadrature for robust control. In AAAI Conference on Artificial Intelligence, 2018. [6] S. Paul, M. A. Osborne, and S. Whiteson. Fingerprint policy optimisation for robust reinforce- ment learning. In International Conference on Machine Learning, 2019. [7] J. De Freitas, A. Censi, B. W. Smith, L. Di Lillo, S. E. Anthony, and E. Frazzoli. From driverless dilemmas to more practical commonsense tests for automated vehicles. Proceedings of the national academy of sciences, 118(11), 2021. [8] S. Lange, T. Gabel, and M. Riedmiller. Batch reinforcement learning. In Reinforcement learn- ing, pages 45–73. Springer, 2012. [9] J. Peters and S. Schaal. Reinforcement learning by reward-weighted regression for operational space control. In Proceedings of the 24th international conference on Machine learning, pages 745–750, 2007. [10] R. Portelas, C. Colas, L. Weng, K. Hofmann, and P.-Y. Oudeyer. Automatic curriculum learn- ing for deep rl: A short survey. arXiv preprint arXiv:2003.04664, 2020. [11] B. R. Kiran, I. Sobh, V. Talpaert, P. Mannion, A. A. Al Sallab, S. Yogamani, and P. P´erez. Deep reinforcement learning for autonomous driving: A survey. IEEE Transactions on Intelligent Transportation Systems, 2021. [12] A. Hussein, M. M. Gaber, E. Elyan, and C. Jayne. Imitation learning: A survey of learning methods. ACM Comput. Surv., 50(2), apr 2017. doi:10.1145/3054912. URL https://doi. org/10.1145/3054912. [13] D. A. Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In D. Touretzky, editor, Advances in Neural Information Processing Systems, volume 1. Morgan-Kaufmann, 1988. URL https://proceedings.neurips.cc/paper/1988/file/ 812b4ba287f5ee0bc9d43bbf5bbe87fb-Paper.pdf. [14] M. Bojarski et al. End to end learning for self-driving cars. CoRR, 2016. URL http:// arxiv.org/abs/1604.07316. [15] J. Ho and S. Ermon. Generative adversarial imitation learning. In Advances in Neural Infor- mation Processing Systems, pages 4565–4573, 2016. [16] T. Brys, A. Harutyunyan, H. B. Suay, S. Chernova, M. E. Taylor, and A. Now´e. Reinforcement learning from demonstration through shaping. In Twenty-fourth international joint conference on artificial intelligence, 2015. [17] H. B. Suay, T. Brys, M. E. Taylor, and S. Chernova. Learning from demonstration for shaping through inverse reinforcement learning. In AAMAS, pages 429–437, 2016. 9 [18] L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, M. Laskin, P. Abbeel, A. Srinivas, and I. Mordatch. Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems, 34, 2021. [19] P. Florence, C. Lynch, A. Zeng, O. A. Ramirez, A. Wahid, L. Downs, A. Wong, J. Lee, I. Mor- datch, and J. Tompson. Implicit behavioral cloning. In Conference on Robot Learning, pages 158–168. PMLR, 2022. [20] D. Brown, W. Goo, P. Nagarajan, and S. Niekum. Extrapolating beyond suboptimal demon- strations via inverse reinforcement learning from observations. In International conference on machine learning, pages 783–792. PMLR, 2019. [21] S. Levine, A. Kumar, G. Tucker, and J. Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems, 2020. URL https://arxiv.org/abs/2005.01643. [22] A. Kumar, J. Hong, A. Singh, and S. Levine. When should we prefer offline reinforcement learning over behavioral cloning? arXiv preprint arXiv:2204.05618, 2022. [23] T. Schaul, J. Quan, I. Antonoglou, and D. Silver. Prioritized experience replay. In ICLR (Poster), 2016. [24] A. L. Samuel. Some studies in machine learning using the game of checkers. ii—recent progress. IBM Journal of research and development, 11(6):601–617, 1967. [25] G. Tesauro. Td-gammon, a self-teaching backgammon program, achieves master-level play. Neural computation, 6(2):215–219, 1994. [26] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016. [27] I. Akkaya, M. Andrychowicz, M. Chociej, M. Litwin, B. McGrew, A. Petron, A. Paino, M. Plappert, G. Powell, R. Ribas, et al. Solving rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113, 2019. [28] J. Uesato, A. Kumar, C. Szepesvari, T. Erez, A. Ruderman, K. Anderson, K. D. Dvijotham, N. Heess, and P. Kohli. Rigorous agent evaluation: An adversarial approach to uncover catas- trophic failures. In International Conference on Learning Representations, 2018. [29] A. Shrivastava, A. Gupta, and R. Girshick. Training region-based object detectors with online hard example mining. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 761–769, 2016. [30] M. Fang, T. Zhou, Y. Du, L. Han, and Z. Zhang. Curriculum-guided hindsight experience replay. Advances in neural information processing systems, 32, 2019. [31] C. Colas, P. Fournier, M. Chetouani, O. Sigaud, and P.-Y. Oudeyer. Curious: intrinsically motivated modular multi-goal reinforcement learning. In International conference on machine learning, pages 1331–1340. PMLR, 2019. [32] R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021. [33] D. Michie, M. Bain, and J. Hayes-Miches. Cognitive models from subcognitive skills. IEE control engineering series, 44:71–99, 1990. [34] S. Ross et al. A reduction of imitation learning and structured prediction to no-regret online learning. In AI Stats, 2011. [35] J. Ho and S. Ermon. Generative adversarial imitation learning. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings.neurips. cc/paper/2016/file/cc7e2b878868cbae992d1fb743995d8f-Paper.pdf. 10 [36] G. Swamy, S. Choudhury, J. A. Bagnell, and Z. S. Wu. Of moments and matching: A game- theoretic framework for closing the imitation gap, 2021. [37] N. Baram, O. Anschel, I. Caspi, and S. Mannor. End-to-end differentiable adversarial imitation learning. In International Conference on Machine Learning, pages 390–399. PMLR, 2017. [38] M. Xu et al. Variance reduction properties of the reparameterization trick. In AI Stats, 2019. [39] M. Chidambaram, Y. Yang, D. Cer, S. Yuan, Y.-H. Sung, B. Strope, and R. Kurzweil. Learning cross-lingual sentence representations via a multi-task dual-encoder model. arXiv preprint arXiv:1810.12836, 2018. [40] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervi- sion. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021. [41] E. Bronstein, M. Palatucci, D. Notz, B. White, A. Kuefler, Y. Lu, S. Paul, P. Nikdel, P. Mougin, H. Chen, J. Fu, A. Abrams, P. Shah, E. Racah, B. Frenkel, S. Whiteson, and D. Anguelov. Hier- archical model-based imitation learning for planning in autonomous driving. In 2022 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 8652–8659. IEEE, 2022. [42] H. Namkoong and J. C. Duchi. Stochastic gradient methods for distributionally robust opti- mization with f-divergences. Advances in neural information processing systems, 29, 2016. [43] Y. Liu, J. Zhang, L. Fang, Q. Jiang, and B. Zhou. Multimodal motion prediction with stacked transformers, 2021. [44] J. Mercat et al. Multi-head attention for multi-modal joint vehicle motion forecasting. In ICRA, 2020. [45] J. Lee et al. Set transformer: A framework for attention-based permutation-invariant neural networks. In ICML, 2019. [46] A. Jaegle et al. Perceiver: General perception with iterative attention. In ICML, 2021. [47] F. Torabi, G. Warnell, and P. Stone. Generative adversarial imitation from observation. arXiv preprint arXiv:1807.06158, 2018. [48] C. Zhang, R. Guo, W. Zeng, Y. Xiong, B. Dai, R. Hu, M. Ren, and R. Urtasun. Rethinking closed-loop training for autonomous driving. In S. Avidan, G. Brostow, M. Ciss´e, G. M. Farinella, and T. Hassner, editors, Computer Vision – ECCV 2022, pages 264–282, Cham, 2022. Springer Nature Switzerland. ISBN 978-3-031-19842-7. 11 8 Appendix 8.1 Difficulty Bucket Statistics The summary statistics of the difficulty scores in each bucket of the training and test datasets are presented in Tables 2 and 3, respectively. 8.2 Geometric Schedule The Geometric-schedule-10% uses a schedule that starts training by weighting each bucket equally (i.e., the unbiased dataset), and ends by weighting each bucket proportional to the average difficulty score of the segments it contains. Specifically, at step t, the sampling weight for bucket k is qk = (qi k −qf k)αt+qf k, where qi k and qf k are the initial and final weights for bucket k, and α is the common ratio of the geometric progression. The sample weights for all the buckets are then normalized to sum to 1 to acquire sampling probabilities. For the Geometric-schedule-10% variant, we set α = 0.999975 and qi k = 1 for each bucket, and {qf}10 k=1 is given by the “Mean” row in table 2. This progression is visualized in Figure 4. 0 50000 100000 150000 200000 Step 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 Sampling Probability 0-10% 10-20% 20-30% 30-40% 40-50% 50-60% 60-70% 70-80% 80-90% 90-100% Figure 4: Sampling schedule of each bucket for the Geometric-schedule-10% variant. Table 2: Summary statistics of the difficulty scores in each training data bucket. 0-10% 10-20% 20-30% 30-40% 40-50% 50-60% 60-70% 70-80% 80-90% 90-100% Min 0.001 0.019 0.031 0.046 0.066 0.094 0.133 0.189 0.270 0.407 Mean 0.013 0.025 0.038 0.056 0.079 0.112 0.159 0.227 0.331 0.573 Max 0.019 0.031 0.046 0.066 0.094 0.133 0.189 0.270 0.407 0.939 Table 3: Summary statistics of the difficulty scores in each test data bucket. 0-10% 10-20% 20-30% 30-40% 40-50% 50-60% 60-70% 70-80% 80-90% 90-100% Min 0.001 0.016 0.026 0.040 0.059 0.085 0.122 0.176 0.258 0.396 Mean 0.011 0.021 0.033 0.049 0.071 0.103 0.147 0.214 0.320 0.562 Max 0.016 0.026 0.040 0.059 0.085 0.122 0.176 0.258 0.396 0.939 12 8.3 Uniform Variant The Uniform-10% variant sets the sampling weight of each bucket to be proportional to the range of difficulty scores of each bucket. The score ranges for the 10 training buckets are [0.0180, 0.0126, 0.0150, 0.0199, 0.0276, 0.0392, 0.0557, 0.0814, 0.1368, 0.5324]. 8.4 Metrics by Bucket In Figures 5 and 6 we present the performance of each training variant and baseline for each bucket. The clear upward trend in collision rate with the increasing bucket scores demonstrates that colli- sions are highly correlated with the difficulty scores. We observe a similar, but not quite as strong, correlation in the route failure rate and off-road rate as well. 8.5 Adaptive Importance Sampling Variants We conducted a series of experiments that perform adaptive importance sampling, wherein we up- sample certain buckets based on the agent’s performance during training, but then we correct for this upsampling via a likelihood ratio. Importance sampling measures the expectation of a statistic over a distribution P using a different distribution Q . In particular, EP [f(X)] = EQ[f(X)w(X)], where w(x) := p(x)/q(x) is the likelihood ratio for density functions p and q from distributions P and Q, respectively. Our nominal distribution P is the natural distribution of run segments. Due to the infrastructure chal- lenges surrounding large training datasets mentioned in Section 4.2, we implemented reweighting on the level of buckets rather than on the level of individual run segments. This allows our method to easily scale to arbitrarily large datasets since it depends only on the number of buckets, not on the dataset size. In this setting, the nominal density is pi := 1/N. We constructed the sampling distribution Q as follows: every K training steps, we collected the average policy loss per bucket ( ¯LP)i over the preceding window of K steps. Since our losses can be positive or negative, we set the sampling weights qi ∝exp γ · ( ¯LP)i  , where γ is the inverse temperature parameter. We also dedicated a small constant probability mass ϵ to all buckets that were not sampled during the last K iterations, which ensures a nonzero probability of sampling a run segment from any given bucket. During training, we multiplied the loss for a segment from bucket i by the ratio wi := 1/(N · qi). To evaluate the effect of different degrees of importance reweighting, we also considered wβ i for different values of β ∈[0, 1]. Algorithm 1 describes this procedure in the context of training the planning agent. We note that this approach is similar to Prioritized Experience Replay (PER) [23], but adapted to our setting with priority weights assigned over a discrete set of buckets. We expanded on this approach using Distributionally Robust Optimization (DRO) [42], which intro- duces an additional loss weighting term with hyperparameter ρ ∈[0, 1]. Larges values of ρ allow for a greater deviation of the loss weights from the importance sampling weights that would be needed to exactly account for the non-uniform sampling. We show the dataset sampling probabilities qi for the PER variant in Figure 7 with two settings of the inverse temperature parameter γ: “PER (γ = 0.1, β = 1)-10%” and “PER (γ = 1, β = 1)-10%”. While γ = 0.1 results in a distribution that is close to uniform, γ = 1 quickly produces a heavily skewed distribution that samples from the most difficult bucket at least 75% of the time. In both cases, the sampling probability of each bucket is directly related to its difficulty scores: the higher a bucket’s difficulty scores, the more frequently it is sampled. This clearly demonstrates that a run segment’s difficulty score is a strong predictor for how challenging it will be for a planning agent to navigate successfully. We present our results for PER and DRO in Table 4 with different values of γ, β, and ρ. For these experiments, we use 10% of the available training data, and we set K = 1000 and ϵ = 3.125 × 104. We observe that certain settings of PER and DRO achieve the lowest route failure and off-road rates. They also result in the best collision rate, overall failure rate, and route progress ratio, though other non-adaptive variants achieve comparable results that are within the confidence bounds. This suggests that adaptive importance sampling is a promising curriculum learning approach that can provide comparable or better results to fixed sampling strategies without the need for hand-tuning custom sampling weights and schedules. 13 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 90-100 99-100 0 1 2 3 4 highest-10% uniform-10% geometric-schedule-10% baseline-10% (a) Route failure rate (%) 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 90-100 99-100 0.0 2.5 5.0 7.5 10.0 highest-10% uniform-10% geometric-schedule-10% baseline-10% (b) Collision rate (%) 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 90-100 99-100 0.0 0.6 1.2 1.8 2.4 highest-10% uniform-10% geometric-schedule-10% baseline-10% (c) Off-road rate (%) Figure 5: Metrics by test decile bucket for each of the training variants. 14 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 90-100 99-100 0 4 8 12 highest-10% uniform-10% geometric-schedule-10% baseline-10% (a) Overall failure rate (%) 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 90-100 99-100 0 2000 4000 6000 8000 highest-10% uniform-10% geometric-schedule-10% baseline-10% (b) Progress rate (%) Figure 6: Metrics by test decile bucket for each of the training variants. Algorithm 1 MGAIL curriculum training with bucket-wise adaptive importance sampling Input: datasets {D}N i=1, sampling period K, step size η, inverse temperature parameter γ, IS weight exponent β, budget T, minibatch size B 1: Initialize sampling probabilities qi = 1 N , loss buffers Hi = ∅for i = 1, ..., N 2: for t = 1 to T do 3: if t ≡0 mod K then 4: Compute per-dataset mean policy loss ( ¯LP)i from each buffer Hi 5: Compute dataset sampling weights qi ←exp(γ · ( ¯LP)i) 6: Normalize dataset sampling probabilities qi ←qi/ PN j=1 qj 7: Reset Hi = ∅for i = 1, ..., N 8: end if 9: Sample B dataset indices {b}B i=1 ∼Q, where Q has probability mass function q 10: Sample minibatch {x}B i=1, where xi i.i.d. ∼U[Dbi] ▷Example xi is from dataset Dbi 11: Compute per-example policy losses LP(xi) and discriminator losses LD(xi) 12: Add stopgrad(LP(xi)) to Hbi 13: Compute IS weights wi ←1/(N · qi) 14: Compute weighted policy losses lP ←1 B PB i=1 wβ biLP(xi) 15: Compute weighted discriminator losses lD ←1 B PB i=1 wβ biLD(xi) 16: Update policy weights θ ←θ + η · ∇θlP 17: Update discriminator weights ω ←ω + η · ∇ωlD 18: end for 15 0 50000 100000 150000 200000 Steps 0.050 0.075 0.100 0.125 0.150 Sampling Probability 0-10% 10-20% 20-30% 30-40% 40-50% 50-60% 60-70% 70-80% 80-90% 90-100% (a) γ = 0.1 0 50000 100000 150000 200000 Steps 0.00 0.25 0.50 0.75 Sampling Probability 0-10% 10-20% 20-30% 30-40% 40-50% 50-60% 60-70% 70-80% 80-90% 90-100% (b) γ = 1 Figure 7: Dataset sampling probabilities throughout training for the PER adaptive importance sam- pling variant for two different values of the inverse temperature parameter γ. 8.6 Other Variants We implemented two additional variants, which differ from the other variants in their training dataset sizes and sampling strategies. Their performance is shown in Table 4. 1. The “Highest-1%” variant is an agent trained on only the top 1% of training examples ordered by difficulty score. As such, it is not strictly comparable to the other variants which use 10% of the data. Our results show that while it achieves a lower collision rate than the baselines, its performance on all other metrics is worse. This demonstrates that extreme upsampling strategies on the most difficult examples, combined with using significantly less training data, can lead to worse overall performance. However, the fact that its collision rate is lower than that of the baselines suggests that upsampling difficult segments has a strong positive effect on metrics that are highly correlated with the difficulty score. It is possible that incorporating other metrics into our definition of “difficulty” for the difficulty model could improve this variant’s performance on those metrics as well. 2. We also trained the “Highest-10% + Lowest-10%” variant on the combination of the most difficult bucket and the least difficult bucket. This variant achieves among the best perfor- mance overall, matching that of the Geometric-schedule-10% variant. By incorporating the least difficult bucket, it addresses the shortcomings of Highest-10%, which has high failure rates on segments in the 0-40% range of difficulty scores. However, this variant uses 20% of the available data, twice as much as the other variants. 16 Table 4: Evaluation of agent variants and baselines on the full unbiased test set (mean ± standard error of each metric across 10 seeds, unless noted otherwise). For all metrics except route progress, lower is better. Agent Variant Route Failure rate (%) Collision rate (%) Off-road rate (%) Route Progress ratio (%) Failure rate (%) Baseline-all 1.38±0.13 1.46±0.09 0.73±0.07 81.21±0.39 3.33±0.20 Baseline-10% 1.34±0.06 1.50±0.09 0.67±0.06 81.12±0.37 3.28±0.13 Baseline-lowest-10% 1.14±0.05 4.15±0.11 0.98±0.10 81.88±0.41 5.91±0.13 Highest-10% 1.33±0.06 1.23±0.09 0.74±0.02 77.95±1.33 3.10±0.10 Uniform-10% 1.35±0.09 1.17±0.08 0.75±0.07 80.67±0.73 3.07±0.17 Highest-1% 1.53±0.08 1.39±0.06 0.99±0.11 79.35±1.18 3.66±0.13 Highest-10% + Lowest-10% 1.18±0.06 1.28±0.07 0.65±0.06 79.97±0.81 2.94±0.12 Geometric-schedule-10% 1.19±0.07 1.25±0.04 0.74±0.10 80.48±0.36 2.92±0.11 PER(γ = 0.1, β = 0)-10% (6 seeds) 1.37±0.06 1.32±0.02 0.55±0.07 81.91±0.58 2.99±0.05 PER(γ = 0.1, β = 0.5)-10% (7 seeds) 1.28±0.09 1.39±0.11 0.73±0.11 82.89±0.68 3.15±0.20 PER(γ = 0.1, β = 1)-10% 1.31±0.09 1.63±0.09 0.51±0.05 80.34±0.47 3.28±0.14 PER (γ = 1, β = 0)-10% 1.28±0.08 1.06±0.03 0.71±0.08 81.16±0.88 2.88±0.08 PER(γ = 1, β = 0.5)-10% 1.21±0.06 1.47±0.10 0.90±0.12 82.60±0.69 3.36±0.20 PER(γ = 1, β = 1)-10% 1.20±0.06 1.99±0.07 1.06±0.21 82.34±0.64 3.93±0.25 DRO (γ = 0.1, β = 0, ρ = 0.25) 10% (4 seeds) 1.23±0.10 1.34±0.13 0.85±0.12 80.01±0.52 3.27±0.18 DRO (γ = 0.1, β = 1, ρ = 0.05) 10% (8 seeds) 1.19±0.09 1.16±0.04 0.69±0.07 80.86±0.65 2.83±0.10 DRO (γ = 0.1, β = 1, ρ = 0.25) 10% (9 seeds) 1.24±0.03 1.49±0.08 0.70±0.03 81.23±0.48 3.25±0.08 DRO (γ = 0.1, β = 1, ρ = 1) 10% (4 seeds) 1.02±0.04 1.65±0.09 0.87±0.21 78.28±0.81 3.43±0.18 DRO (γ = 1, β = 0, ρ = 0.25) 10% (4 seeds) 1.27±0.15 1.31±0.07 0.58±0.06 79.43±1.02 3.04±0.24 DRO (γ = 1, β = 0.5, ρ = 0.25) 10% (7 seeds) 1.33±0.05 1.18±0.08 0.73±0.06 80.72±1.22 3.02±0.06 DRO (γ = 1, β = 1, ρ = 0.05) 10% (7 seeds) 1.20±0.02 1.58±0.05 0.72±0.12 81.72±0.46 3.31±0.14 DRO (γ = 1, β = 1, ρ = 0.25) 10% (6 seeds) 1.18±0.09 1.65±0.10 1.12±0.29 82.22±0.91 3.70±0.37 DRO (γ = 1, β = 1, ρ = 1) 10% (5 seeds) 1.28±0.14 2.33±0.19 1.95±0.20 77.97±1.84 5.14±0.20 8.7 Planning Agent Details We use the same hierarchical planning agent described in Bronstein et al. [41]; additional details can be found in that work. This planning agent consisting of a high-level route-generation policy and a low-level action policy trained using MGAIL. The high-level policy uses an A* search to produce multiple lane-specific routes through a pre-mapped roadgraph and selects the lowest-cost route. We can either evaluate the low-level policy in a standalone fashion by conditioning it on a given route, or the high-level and low-level policies together by allowing the agent to choose its own goal routes given a destination. The low-level action policy and discriminator use stacked transformer-based observation models [43, 44] to encode the goal route, AV’s state, other vehicles’ states, roadgraph points, and traffic light signals. Similar to Set-Transformer [45] and Perceiver [46], this observation encoder uses learned latent queries and a stack of cross-attention blocks, one for each group of features. A delta actions model is used for the AV’s dynamics, where the action a is the offset from the current state s: s′ = s + a. The policy head predicts the parameters (weights, means, and covariances) of a Gaussian Mixture Model (GMM) with 8 Gaussians, used to parameterize the delta actions. We trained the action policy and discriminator using a combination of MGAIL and behavior cloning (BC). The total policy loss is given by λPLP +λBCLBC, where LP = −Es∼πθ[log Dω(s)] is the MGAIL policy loss, LBC = −Es,a∼πE[log πθ(a|s)] is the BC loss, and λP and λBC are hyperparameters. The MGAIL discriminator loss is LD = Es∼πθ[log Dω(s)] + Es∼πE[log(1 − Dω(s))]. The discriminator is only conditioned on the state s as in [47]. During backpropagation, only the policy parameters θ are updated for LP and LBC, and only the discriminator parameters ω are updated for LD. 8.8 Evaluation With Interactive Agents A potential concern is that the planning agent is trained and evaluated with other agents replaying their logged trajectories. This may result in unrealistic behavior if the planning agent behaves in a significantly different way than the logged AV and other agents don’t react realistically. It is possible that a planning agent trained in this way would not perform well when deployed in the real world, in which other road users influence and interact with the AV. To determine whether this is an issue, we evaluated our planning agent alongside interactive agents controlling a subset of other vehicles in the scene. The interactive agent policy, which was trained separately, has the same model architecture, dynamics model, and training loss function as our planning agent. The main difference is that the 17 interactive agent is not goal-conditioned because its task is to drive in a realistic manner and not necessarily reach a specific destination. For each bucket in the test dataset, we constructed a subset in which each segment has at least 8 other vehicles that could be controlled by an interactive agent. Note that this is a more challenging dataset because segments with more road users tend to be more difficult. Starting from an equal number of segments per bucket and discarding segments with an insufficient number of other vehicles, the number of segments remaining in each bucket in order of increasing difficulty accounted for 0.81%, 1.59%, 2.47%, 3.66%, 5.76%, 8.46%, 11.67%, 16.97%, 22.97%, and 25.64% of the total. We evaluated the Baseline-10% and Uniform-10% planning agent variants on this dataset by using the same initial conditions and goal route as the original test dataset. Table 5 demonstrates that the route failure rate decreases for all variants when using interactive agents, and the collision, off-road, and overall failure rates either decrease or remain the same. Additional investigation is needed to determine why the route progress ratio increases for the Baseline-10% variant but decreases for the Uniform-10% variant. These results indicate that our training procedure for the planning agent allows it to perform better in a more realistic simulated environment, not worse. In fact, we expect the agent to have even better performance when evaluated with interactive agents on the original data distribution (i.e., without the requirement that at least 8 vehicles are available for replacement with interactive agents), which would be inherently easier than the subset with interactive agents. While training the planning agent with interactive agents may result in performance gains, this approach is orthogonal to the curriculum learning framework we present and can be easily combined with it. In fact, concurrent work by Zhang et al. [48] investigates this idea, finding that targeted training on more challenging closed-loop scenarios results in more robust agents while requiring less data. Table 5: Evaluation of agent variants and baselines without interactive agents on the original test set vs. with interactive agents on a subset where at least 8 vehicles are available for replacement. For all metrics except route progress, lower is better. Agent Variant Without Interactive Agents With Interactive Agents Route Failure rate (%) Collision rate (%) Off-road rate (%) Route Progress ratio (%) Failure rate (%) Route Failure rate (%) Collision rate (%) Off-road rate (%) Route Progress ratio (%) Failure rate (%) Baseline-10% 1.34±0.06 1.50±0.09 0.67±0.06 81.12±0.37 3.28±0.13 1.03±0.05 1.47±0.10 0.67±0.03 84.58±0.22 3.05±0.13 Uniform-10% 1.35±0.09 1.17±0.08 0.75±0.07 80.67±0.73 3.07±0.17 0.78±0.07 1.16±0.06 0.29±0.04 75.5±0.61 2.13±0.11 18