Embedding Synthetic Off-Policy Experience for Autonomous Driving via Zero-Shot Curricula EliBronstein∗ SirishSrinivasan∗ SupratikPaul∗ AmanSinha MatthewO’Kelly PayamNikdel ShimonWhiteson Waymo,LLC {ebronstein, sirishs, supratikpaul, thisisaman, mokelly, payamn, shimonw}@waymo.com Abstract: ML-basedmotionplanningisapromisingapproachtoproduceagents thatexhibitcomplexbehaviors,andautomaticallyadapttonovelenvironments.In thecontextofautonomousdriving,itiscommontotreatallavailabletrainingdata equally. However,thisapproachproducesagentsthatdonotperformrobustlyin safety-criticalsettings, anissuethatcannotbeaddressedbysimplyaddingmore data to the training set—we show that an agent trained using only a 10% subset of the data performs just as well as an agent trained on the entire dataset. We present a method to predict the inherent difficulty of a driving situation given datacollectedfromafleetofautonomousvehiclesdeployedonpublicroads. We then demonstrate that this difficulty score can be used in a zero-shot transfer to generate curricula for an imitation-learning based planning agent. Compared to trainingontheentireunbiasedtrainingdataset,weshowthatprioritizingdifficult driving scenarios both reduces collisions by 15% and increases route adherence by14%inclosed-loopevaluation,allwhileusingonly10%ofthetrainingdata. Keywords: ImitationLearning,CurriculumLearning,AutonomousDriving 1 Introduction Autonomousvehicles(AV)typicallyrelyonoptimization-basedmotionplanningandcontrolmeth- ods. ThesetechniquesinvolvebespokecomponentsspecifictothedeploymentregionandAVhard- ware,andrequirecopioushand-tuningtoadapttonewenvironments. Analternativeapproachisto applymachinelearning(ML)tothelifetimesofexperiencethatAVfleetscancollectwithindaysor weeks. AparadigmshifttoML-basedplanningcouldautomatetheadaptationofbehaviorstonew areas,improveplanninglatency,andincreasetheimpactofhardwareacceleration. Forexample,imitationlearning(IL)canutilizethelargetranchesofexpertdemonstrationscollected bytheregularoperationsofAVfleetstoproducepoliciesthatperformwellincommonscenarios, without the need to specify a reward function. However, both the distribution from which experi- ences are sampled and the policy used to generate the demonstrations can critically affect the IL policy’sperformance[1]. Thetrainingdatadistributionisespeciallyimportantwhenlearningmeth- ods are applied to problems characterized by long tail examples (c.f. [2, 3, 4, 5, 6]). In the case ofautonomousdriving, thevastmajorityofobservedscenariosaresimpleenoughtobenavigated withoutanynegativesafetyoutcomes. Avisualinspectionofarandomsubsetofourdatasuggests thathalfofitconsistsofscenarioswiththeAVastheonlyroaduserinmotion,whileanotherquarter containsothermovingroadusers,butnotnecessarilycloseenoughtotheAVtoaffectitsbehavior. Asaresult,ILpoliciesmaynotberobustinsafety-criticalandlongtailsituations. Reinforcementlearning(RL)canbeusedtoexplicitlypenalizepoorbehavior,butduetothecomplex nature of driving [7], it is difficult to design a reward function for AVs that aligns with human expectations. Evenifrewardsignalswereprovidedforsafety-criticaleventsortrafficlawviolations (e.g.,collisions,runningaredlight),theywouldbeextremelysparsesincesucheventsarequiterare. Furthermore,explorationtocollectmorelongtaildataischallengingduetosafetyconcerns[8]. Despitetheseissues,mostlearning-basedroboticsandAVapplicationsusethenaivestrategyofcre- atingtrainingdatasetsfromallavailabledemonstrations. InthecontextofAVs,sincemostdriving situationsaresimple,thisstrategyisbothinefficientandunlikelytogenerateapolicythatisrobust ∗Denotesequalcontribution. 6thConferenceonRobotLearning(CoRL2022),Auckland,NewZealand. 2202 ceD 2 ]OR.sc[ 1v57310.2122:viXraVehicle Platform Deployment Region Policy Region 1 noitarugfinoC elciheV Fleet Learning Zero-shot Transfer A D Development Planner Deployed in Counterfactual Simulations Expert 1 Counterfactual Difficulty Score Expert N B Fleet Data C Run Segment Embedding E Difficulty Model Run Segments gniddebmE tnemgeS Vehicle Type 1 Development Human Triage Planner Vehicle Type 2 Region 2 Run Segment Same Run Segment? Similarity Difficulty Score Expert Roadgraph, Multi-layer Perceptron Trajectory Other Road Users Figure1: Thefleetcollectsexperienceswithavarietyofpoliciesinmultipleoperationaldesign A domains. Thefleetdataisshardedintorunsegments. Fleetdataisusedtolearnanembedding B C that maps a run segment to a vector space based on similarity. Run segments are selected for D counterfactualsimulationsandhumantriage;theoutcomeofthisprocessisalabeledsetofdifficulty scores. AnMLPistrainedtoregressfromembeddingstothedifficultylabels. E todifficultscenarios. Acommonsolutionistoupsamplechallengingexamples, eitherbyincreas- ingtheirsamplingprobabilitybyapredeterminedfactor[9]orwithcurriculumlearning[10], i.e., dynamicallyupdatingthesamplingprobabilityduringtrainingbasedontheagent’sperformance. However,bothoftheseapproachesincludesignificanthurdles. Upsamplingrequiresthatweknow which examples are part of the long tail a priori, as in standard classification problems where the class-labelimbalancecaninformasamplingstrategy. InILandRL,nosuchlabelsareavailable. As such,curriculumlearningismoresuitablesinceitusestheagent’scurrentperformancetoidentify hardexamples. However,standardapproachestocurriculumlearningarespecifictotheagentbeing trained; they do not, for example, utilize data collected by deployed AVs running other planners, whichcanprovidemoregeneral,policy-agnosticinsightsintothelongtailofdriving. In this paper, we propose an approach (summarized in Figure 1) that addresses the challenges of upsamplingandcurriculumlearningappliedtoanAVsetting. Developingaroad-readyAVgener- allyinvolvesbothcollectingreal-worlddatawithanexpert,whichcanbeacombinationofhuman driversandthoroughlyevaluatedAVplanners;andevaluatingnewdevelopmentplanners,whichare regularlysimulatedonthedatacollectedbytheexperttoidentifypotentialfailuremodes,generating alargecounterfactualdataset. Ourmethodusesthisreadilyavailabledatatotrainadifficultymodel thatscorestheinherentdifficultyofagivenscenariobypredictingtheprobabilityofcollisionsand near-missesinsimulation. Thisdifficultymodelprovidesseveralkeybenefits: 1)itiscomputation- ally less expensive to predict a driving situation’s difficulty than to simulate it for a policy being trained;2)themodellearnstheinherent,policy-agnosticdifficultyofascenariobecauseitistrained onmultipledevelopmentplannersindifferentgeographicregions;and3)themodelpredictsacon- tinuousscorethatcanbeusedtoidentifyscenarioswithinanarbitrarydifficultyrange,ratherthan obtainingafewcounterfactualfailures. Weshowthatazero-shottransferofthismodelcanidentifylong-tailexamplesthataredifficultfora newIL-basedplanningagent—withoutanyfinetuning.Thisallowsustoupsampledifficulttraining exampleswithoutexpensiveevaluationoftheagentduringtraining. Thoughwetraintheplanning agent using IL as a case study, our approach can be applied to any ML-based planning approach. Themaincontributionsofthispaperare: 1. We train a model to predict which driving scenarios are difficult for development planners and showthatitcanzero-shottransfertothetaskoffindingchallengingscenariosonwhichtotrain anML-basedplanningagent.Thisgeneralizationsuggeststhatthemodelcanpredicttheinherent difficultyofadrivingsituation. 22. WeshowthattraininganML-basedplanningagentonunbiaseddrivingdataleadstopoorper- formanceondifficultexamplessinceeasydrivingscenariosdominaterarer,hardercases. 3. Weshowthatusingourdifficultymodeltoupsamplemorechallengingscenariosreducescolli- sionsby15%andincreasesrouteadherenceby14%ontheunbiasedtestset. Thissuggeststhat therearesignificantdiminishingreturnsinaddingcommonscenariostothetrainingdataset. 2 RelatedWork TheapplicationofRLtothetaskofautonomousdrivinghasreceivedsignificantattentioninrecent years [11]; proposed methods span the gamut of methodologies and the AV stack itself. RL has beenusedtoaddressavarietyofproblemsincludingend-to-endmotionplanning, behaviorgener- ation, reward design, and even behavior prediction. In this work, we focus on imitation learning techniques [12], which avoid direct specification of a reward function, and rely instead on expert demonstrations. Asaresult,theycancapturesubtlehumanpreferencesanddemonstrateimpressive performanceonavarietyofroboticstasks. However,despitemanyattempts[13,14,15],ILandRL techniquesstillstrugglewiththelongtailpresentinthedrivingtask[4]. Like this work, Brys et al. [16] and Suay et al. [17] consider how to leverage potentially subopti- mal demonstrations to improve the efficiency and robustness of learning. Unlike these works, we use offline methods to learn a model of each scenario’s difficulty and bias the distribution that IL is performed on. This approach is similar to baselines [18, 19] inspired by Peters and Schaal [9]; however,unliketheseworks,oursettingdoesnotprovidearewardsignalfortheproposeddemon- strations. Instead,weuseoffline,off-policysimulationstolearnafoundationmodelwithwhichwe canefficientlyapproximateascenario’sdifficulty,whichwouldhaveotherwiserequiredexpensive counterfactualsimulationsduringtraining. Ourapproachsidestepstheinefficienciesofperforming rolloutsofthelearntpolicyontheentiredatasetsinceinferenceusingthedifficultymodeliscom- putationallymuchcheaperthansimulation. SimilartechniqueshavealsobeenproposedbyBrown etal.[20];however,theyfocuslargelyonsituationswithseverelysuboptimaldemonstrationswhere therewardisspecified. SimilarproblemshavealsobeenidentifiedinofflineRL[21]. Interestingly, Kumaretal.[22]identifythetightrelationshipbetweenimitationlearningandofflineRL,notingthe theoreticaladvantageofincorporatingrewardinformationinsettingslikeautonomousdrivingwhich mustavoidrarecatastrophicfailures. Ourexperimentsprovideempiricalsupportforthisinsight. Curriculum learning (CL) [10] is also closely related to this work. While not originally classi- fiedassuch,methodslikeautomaticdomainrandomization,prioritizedexperiencereplay[23],and AlphaGo’s self-play [24, 25] have led to superhuman game-playing agents and breakthroughs in sim2realtransfer[23,26,27]. CLmethodssolveforsurrogateobjectivesratherthandirectlyopti- mizingthefinalperformanceofthelearner.Theycontrolwhichtransitionsaresampled,thebehavior ofotheragentsinanenvironment,thegenerationofinitialstates,oreventherewardfunction. CL methods are also characterized by whether they are used on- or off-policy. For example, Uesato et al. [28] exploit low quality policies to obtain failures in an on-policy RL setting. Finding hard examples in the training data using this approach requires repeatedly generating rollouts for each experttrajectoryinthedataset. Suchanapproachiscomputationallyinfeasiblewhen operatingat scalesincethetrainingdatasetscanhavehundredsofthousandsofreal-worlddrivingmiles. Simi- larapproachesknownashard-negativemininghavebeenusedinsupervisedlearningsettings[29]; likeUesatoetal.[28]theyevaluatethedifficultyofexamplesonline. Instead, we consider variants of CL that exploit off-policy data. As in the on-policy case, the key problem is to determine which data is interesting. Off-policy compatible methods are also gen- erally surrogate-based. For example, they can select for diversity [30], moderate difficulty [27], surprise[23],orlearningprogress[31]. OurapproachismostsimilartoAkkayaetal.[27],butin- steadofperformingexpensiveagentevaluationduringtraining,weuseoff-policydatabothtotrain afoundationmodel[32],whichencodesexperiences,andtoclassifythedifficultyofaninteraction. Wealsoutilizelarge-scalereal-worlddataanddemonstratethatsimplercurriculaareeffective. 3 Background Model-based Generative Adversarial Imitation Learning: Behavior cloning (BC) [33, 13] is a naive imitation learning method that applies supervised learning to match the expert’s conditional action distribution: argmax E [log π (a|s)]. BC policies may suffer from covariate shift, θ s,a∼πE θ 3resultinginquadraticworst-caseerrorwithrespecttothetimehorizon[34]. Toaddressthisissue, generativeadversarialimitationlearning(GAIL)[35]formulatesILasanadversarialgamebetween the policy π and the discriminator D . The discriminator is trained to classify whether a given θ ω trajectory was sampled from π (labeled 0) or from the expert demonstration (labeled 1), and the θ policyistrainedtogeneratetrajectoriesthatareindistinguishablefromdemonstrations: argmax argminE [log D (s,a)]+E [log(1−D (s,a))]. s,a∼πθ ω s,a∼πE ω θ ω GAIL minimizes the gap in the joint distributions of states and actions p(s,a) between the policy andtheexpert,resultinginlinearerrorwithrespecttothetimehorizon[36]. However,GAILrelies onhighvariancepolicygradientestimatesbecauseitusesanunknowndynamicsmodel,makingits objectivefunctionnon-differentiable. Incontrast,model-basedGAIL(MGAIL)[37]usesdifferen- tiabledynamicsincombinationwiththereparameterizationtrick[38]toreducethevarianceofthe policygradientestimates. 4 Method A key challenge in commercial AV development is to design an AV planner that can safely and efficientlynavigatereal-worldsettingswhilealigningwithhumanexpectations. Atanygiventime, theremayexistmultipledevelopmentplannersunderevaluation. IterativelyimprovinganAVplan- ner typically involves the following three steps. 1) Data Collection: Real-world data is collected by a fleet of vehicles in the operational area. 2) Data-Driven Simulation: A development planner is tested in simulation by having it control the data-collecting ego vehicle in a run segment, or a shortsnippetofrecordeddrivingdata. 3)EvaluationandImprovement:Thedevelopmentplanneris evaluatedonkeymetricsbasedonthesesimulations,withpotentialissuesidentifiedandaddressed. AsmentionedinSection1,weconsideranML-basedapproachtodevelopinganAVplannerfrom the ground up. One option is to use imitation learning to train a planning agent. Given an initial datasetofloggedexpertdriving,anaiveapproachistotraintheagentontheentiredataset.However, thismeansthatchallenginglongtailsegmentsareusedonlyafewtimesduringtraining,yieldinga planningagentthathasdifficultynegotiatingsimilarsituations[2,5,6]. Thus,toimproveouragent, werequireamethodtoupsampletheseraresegments. 4.1 DifficultyModel Thekeyideabehindourmethodistousethereal-worldrunsegmentsreplayedinsimulationwith developmentplannerstolearnadifficultymodelthatpredictsthedifficultyofaloggedsegment,i.e., whetheradevelopmentplannerislikelytohaveapoorsafetyoutcomeinsimulation. Wetrainthe difficultymodelonsimulationsofmultipledevelopmentplannersindifferentgeographicareas,so itcanbeseenasmarginalizingoveradiversedistributionofdevelopmentplanners. Thismakesthe modelmorelikelytobeabletoidentifytheinherentdifficultyofasegment. Inturn,thisfacilitates thezero-shottransferfromtrainingondatafromdevelopmentplannerstoinferringdifficultyfora substantiallydifferentplanningagent. Intuitively,segmentsthatdevelopmentplannersfinddifficult are likely to be difficult for the planning agent as well. Specifically, we use the difficulty model’s scorestoinformourupsamplingstrategyfortrainingtheplanningagent. The evaluation process for development planners typically involves large-scale simulations, with potentiallyproblematicbehaviorsflaggedforengineerstoaddress. Wetrainthedifficultymodelto predictcollisionsandnear-missesattributabletothedevelopmentplanner,asopposedtootherroad users. ThisdataisgeneratedinthenormalcourseoftheAVplannerdevelopmentcycle,sononew trainingdataisneeded. Sincewewanttomarginalizeouttheidiosyncrasiesofindividualdevelopmentplanners,wemodel a simulation’s safety outcome y ∈ {0,1} (1 if a collision or near-miss occurred, 0 otherwise) as a function of the logged run segment alone. The input to our model is a learned segment em- bedding from a separately trained model. Given a logged run segment, we collect static features (e.g., road/lane layouts, crosswalks, stop signs), dynamic features (e.g., positions and orientations of other road users over time), and kinematic information about the data-collecting ego vehicle. We use these features to generate two top-down images of the segment: one of the ego vehicle’s trajectory,andanotherofthestaticfeaturesandotherroadusers’trajectories. Weencodeeachim- ageintoadensed-dimensionalembeddingvector(asin[39])usingaCNNandcontrastivelytrain 4a classifier (e.g., [40]) with cross-entropy loss to determine if two images are from the same run segment(seeFigure1c).OurdifficultymodelisanMLPthatlearnsafunctionf :Rd →[0,1]map- pingtheembeddingtothesimulatedsafetyoutcomey. Wetrainedthismodelusingcross-entropy lossonadatasetof5.6kpositiveand80knegativeexamples. Thenumberofnegativeexampleswas downsampledbymultipleordersofmagnitudesincetheprevalenceofsimulatedcollisionsandnear- missesisextremelylow. Themodelproducesuncalibratedscoresbydesign,astryingtocalibrateit totheextremelysmallunbiasedprevalencerateofpositiveexamplesisnumericallyunstable. 4.2 SamplingStrategies Given the long-tail nature of the difficulty scores (see Figure 2), it is natural to upsample difficult segmentsduringtraining. Astandardsolutionforupsamplinginclassificationproblemsistocreate separatedatasetsforeachclass,andthengenerateabatchbysamplingaspecifiedproportionfrom eachdataset. Sincethisrequiresdiscretizedclasses,itcannotbeappliedtoourreal-valueddifficulty scores. Moreover,duetothelargetrainingdatasetsizeitisnotscalabletoupsampleindividualseg- ments:theentiredatasetcannotfitinmemoryandrandomaccesstoindividualexamplesfromdiskis incompatiblewithdistributedfileshardingofdata. Instead,wepartitionthedatasetintotenequally sizedbuckets,eachcorrespondingtoadecileofthedatabydifficultyscores,withup/downsampling achievedbyassigningdifferentsamplingprobabilitiestoeachbucket. Thisenablesustoefficiently generate batches on the fly (e.g., sampling a weighted batch of k run segments requires minimal overheadoverthek constant-timeaccessestotheheadpointersofeachbucket). Thisdecile-based bucketingalsoensuresthatourmethodisagnostictothemodelscores,whichareuncalibrated. We consider two training variants: 1) a fixed weighting scheme for each bucket, held constant throughouttraining,and2)ascheduleofweightsforeachbucketthatchangesastrainingprogresses. Specifically,weusethefollowingthreesamplingstrategies. “Highest-10%”trainstheagentonlyon thehighestscoringbucket(i.e.,onthesegmentswiththehighest10%difficultyscores). “Uniform- 10%”upsamplesdifficultsegmentsbysettingeachbucket’ssamplingweighttotherangeofdiffi- cultyscoresinthatbucket(inthelimitofinfinitebuckets,thisapproachesauniformdistributionover the difficulty scores). “Geometric-schedule-10%” implements a geometric progression of weights witheachbucketweightedequallyatthebeginningoftrainingandweightedproportionaltoitsav- erage difficulty score at the end of training (see Appendix 8.2 for further details). In Section 5.4 we compare the performance of these training variants against several baselines. Our variants are trainedononlya10%sampleofavailabledata. 5 Experiments Topreventinformationleakagebetweenthedifficultymodelandtheplanningagent, theformeris trainedonadatasetcollectedmorethansixmonthspriortothedatasetforthelatter. Thetraining datasetfortheplanningagentconsistsofover14khoursofdrivingloggedbyafleetofvehicles. We split the data into 10 second run segments, resulting in over 5 million training segments. We also createtwotestsets,chronologicallyseparatefromthetrainingsettopreventtrain-testleakage. The firstunbiasedtestsetiscomposedof20ksegmentssampleduniformlyfromloggeddata.Thesecond setconsistsof10ksegmentswithdifficultyscoresinthetoponepercentileofthetrainingdata’sscore distribution. The distributions of the difficulty model scores for the train set and unbiased test set bothhavelongtails(seeFigure2)–scoresabove0.85accountforonlyaround0.5%ofthedataset. As described in Section 4.2, we split the training dataset into 10 equal sized buckets based on the difficultyscoredeciles. 200runsegmentsarefurthersplitfromeachtrainingbuckettoobtainvali- dationbucketsformodelselection. Tohighlighttheeffectofourtrainingschemesonperformance onsegmentsofvaryingdifficulty, weusethesamebucketingapproachfortheunbiasedtestsetas forthetrainingset. Wereportthefull,unbiasedtestsetresultsbyaggregatingoverallbuckets. 5.1 Baselines Wereportthreebaselinesforcomparison,whichdifferintheirtrainingdata:“Baseline-all”istrained on the full dataset, “Baseline-10%” is trained on a uniformly randomly sampled 10% of the full dataset,and“Baseline-lowest-10%”istrainedonlyonthebucketwiththelowestdifficultyscores. 56 4 2 0 0.0 0.2 0.4 0.6 0.8 1.0 Model Score ytisneD 6 4 2 0 0.0 0.2 0.4 0.6 0.8 1.0 Model Score (a)TrainDataset ytisneD (b)UnbiasedTestDataset Figure2:Distributionofthedifficultymodelscoresforthetrainandtestdatasets.Thetenalternating shadedbackgroundsindicatethethresholdsofthedecilebuckets. 5.2 TrainingDetails WeusetheplanningagentdescribedinBronsteinetal.[41],whichemploysastochasticcontinuous actionpolicyconditionedonagoalrouteandistrainedusingacombinationofMGAILandBC.See Appendix 8.7 for additional details. We train 10 random seeds of each agent variant and baseline for200ksteps. Aftertheinitial100ktrainingsteps,weevaluateeachagentonthevalidationsetat intervalsof10ksteps. Weselecttheagentcheckpointwiththelowestsumofcollisionandoff-road drivingrates,andevaluateitontheheld-outtestset. Sincethelearntpolicyisstochastic,wereport theaverageperformanceof16independentrolloutsforeachtestrunsegment. 5.3 Metrics We assess the planning agent’s performance using the following binary metrics (1 if the event of interestoccurredinthesegment,0otherwise): 1. Route Failure: the agent deviates from the goal “road route” at the start of the segment, which includesalllanesintheroadcontainingthegoallane-specificroute. 2. Collision: theagent’sboundingboxintersectswithanotherroaduser’sboundingbox. 3. Off-road: theagent’sboundingboxexitsthedrivableroadarea. 4. RouteProgressratio: ratioofthedistancetraveledalongtheroutebytheagentandtheexpert. Wealsoreporttheoverallfailurerateastheunionofthefirstthreemetrics;asegmentisconsidereda failureifanyofthebinarymetricsisnonzero.Whencomparingtheperformanceofdifferentagents, weprioritizethisfailurerateduetothesafety-criticalnatureofdriving,whilealsoconsideringthe routeprogressratiotoensuretheagentsaremakingefficientforwardprogress. 5.4 Results We present the performance of our training variants and baselines on the full, unbiased test set in Table1. Eachvariant’sactionpolicyisconditionedontheexpert’sinitialgoalroute,whichisheld constantthroughoutthesegment. We observe no significant difference between the performance of Baseline-10% and Baseline-all, demonstrating that simply increasing the training dataset size does not necessarily lead to better performance. Also, Baseline-lowest-10%hastheworstperformanceforthecollisionandoff-road metrics. This suggests that the easiest segments are not representative of the entire test set distri- butionanddonotcontainenoughusefulinformationtolearnfrom. However,Baseline-lowest-10% achieves the lowest route failure rate. We believe this is because the least difficult training bucket isprimarilycomposedofsegmentsinwhichitissimpletofollowtheroute,suchasone-laneroads withnootherroadusersandminimalinteraction. ThiscouldcausetheBaseline-lowest-10%agent tooverfittotheroutefeaturesandfollowtheroutewellattheexpenseofsafety. All three of our upsampling variants achieve significantly lower collision rates, and comparable off-roadandroutefailureratestothebaselines(withtheexceptionofBaseline-lowest-10%’sroute failurerate). Thiskey resultdemonstratesthatsegmentswith highpredicteddifficultycontain the 6Table1: Evaluationofagentsandbaselinesonthefullunbiasedtestset(mean±standarderrorof eachmetricacross10seeds). Forallmetricsexceptrouteprogress,lowerisbetter. RouteFailure Collision Off-road RouteProgress Failure AgentVariant rate(%) rate(%) rate(%) ratio(%) rate(%) Baseline-all 1.38±0.13 1.46±0.09 0.73±0.07 81.21±0.39 3.33±0.20 Baseline-10% 1.34±0.06 1.50±0.09 0.67±0.06 81.12±0.37 3.28±0.13 Baseline-lowest-10% 1.14±0.05 4.15±0.11 0.98±0.10 81.88±0.41 5.91±0.13 Highest-10% 1.33±0.06 1.23±0.09 0.74±0.02 77.95±1.33 3.10±0.10 Uniform-10% 1.35±0.09 1.17±0.08 0.75±0.07 80.67±0.73 3.07±0.17 Geometric-schedule-10% 1.19±0.07 1.25±0.04 0.74±0.10 80.48±0.36 2.92±0.11 majority of useful information needed for good aggregate performance. Geometric-schedule-10% hasthelargestimprovementoverthebaselines,withasignificantlylowercollisionrate,comparable routefailureandoff-roadrates,andaminimaldecreaseintherouteprogressratio. Thishighlights the advantage of observing the whole spectrum of data at the start of training, and progressively increasingtheproportionofdifficultsegmentstoemphasizemoreusefuldemonstrations. To get a more nuanced view of each variant’s performance, we compare the variants to Baseline- 10%foreachofthetestbuckets. Figure3showstheperformanceforthelowest(0-10%),low/mid (30-40%), highest (90-100%), and long tail (99-100%) test buckets. See Figures 5 and 6 in the Appendixformetricsforallthetestbuckets. Notonlydoeseachagent’scollisionratecorrelatewiththedifficultyscore,butsodotheroutefailure andoff-roadrates,withtheexceptionofHighest-10%’soff-roadrate. Thisshowsthatsegmentsthat werechallengingfordevelopmentplannersarealsolikelytobechallengingforourplanningagent, whichenablesthezero-shottransferofthedifficultymodel. Italsodemonstratesthatalthoughthe difficultymodelwasonlytrainedtopredictcollisionsandnear-misses,itspredictedscoredescribes abroadernotionofdifficulty,asmeasuredbyotherkeyplanningmetrics. On the highest and long tail buckets, Highest-10% and Uniform-10% achieve much lower colli- sionratesandoverallfailureratesthanGeometric-schedule-10%andthebaseline. Thisshowsthat upsamplingdifficultsegmentsresultsinbetteroverallperformanceonthosesegments, notjuston metricsthatarehighlycorrelatedwiththedifficultylabel(i.e.,collisionsandnear-misses). Thisis encouraging, since it suggests that the training labels for the difficulty model do not need to fully defineexpertdrivingbehaviorinorderfortheresultingplanningagenttoexhibitimprovedperfor- manceacrossmultiplemetrics. However,Highest-10%andUniform-10%performcomparableto, or worse than the baseline on the lowest and low/mid buckets across all metrics, with especially poorperformanceontheroutefailureandoff-roadmetrics. Thus, extremeupsamplingofdifficult segmentssacrificesperformanceattheotherendofthespectrum,sincetheeasiestsegmentsbecome toorareinthetrainingdata. Geometric-schedule-10%addressesthisissuebyupsamplingdifficult segments while maintaining sufficiently broad coverage over the difficulty distribution. While it doesnotachieveequallylowcollisionratesastheHighest-10%andUniform-10%variantsonthe highest and long tail buckets, it outperforms the baseline on collisions and performs well on the lowestandlow/midbuckets,yieldingthebestoverallperformance. 6 Limitations Whileourdifficultymodelsuccessfullyidentifiedchallengingsegments,itwasonlytrainedtopre- dict collisions and near-misses, which are just one indication of difficulty. There are other labels that would be helpful for a more comprehensive difficulty model, such as traffic law violations, route progress, and discomfort caused to both the ego vehicle’s passengers and other road users. Moreover, the difficulty model could be improved by incorporating the severity of the negative safetyoutcomeintothetraininglabels. Furthermore,asnotedinSection1,alargeproportionofthe availabledataconsistsofsituationswithveryfewotherroadusersinthescene.Thedifficultymodel couldbereplacedwithaheuristics-basedapproachofpruningsuchscenarios,thoughtheviability ofdoingsoisdifficulttogaugeapriori. Intermsofevaluationmetrics,wefocusedprimarilyonsafetymetrics,sincetheseareofparamount importance for real-world deployment. However, we have not considered other facets of driving likecomfortandreliability, whichcanalsosignificantlyaffecttheviabilityofML-basedplanners. 7highest-10% 4 12 uniform-10% geometric-schedule-10% baseline-10% 3 8 2 4 1 0 0 0-10 30-40 90-100 99-100 0-10 30-40 90-100 99-100 (a)Overallfailurerate(%) (b)Routefailurerate(%) 10.0 2.4 7.5 1.8 5.0 1.2 2.5 0.6 0.0 0.0 0-10 30-40 90-100 99-100 0-10 30-40 90-100 99-100 (c)Collisionrate(%) (d)Off-roadrate(%) Figure 3: Metrics for the Baseline-10%, Highest-10%, Geometric-schedule-10%, and Baseline- 10%variantsonmultipledeciletestbucketsaccordingtothedifficultyscore. Foreachmetric,each variant’sperformanceisshownforthelowest(0-10%),low/mid(30-40%),highest(90-100%),and longtail(99-100%)testbuckets. Finally,whilewehavedemonstratedthatourmethodofupsamplinglongtailsegmentsleadstobetter performance,wehavedonesoonlyforanagenttrainedusingMGAIL.Quantifyingtheperformance gainswithotherlearningmethodsremainsatopicforfuturework. 7 Conclusion We showed that the naive strategy of training on an unbiased driving dataset is suboptimal due to the large fraction of data that does not provide additional useful experience. By utilizing readily available data collected while evaluating development planners in simulation, we trained a model to identify difficult segments with poor safety outcomes. We then applied this model in a zero- shotmannertodeveloptrainingcurriculathatupsampledifficultexamples. Planningagentstrained withthesecurriculaoutperformthenaivestrategyinaggregateandaremorerobustinchallenging, long tail scenarios. However, overly aggressive upsampling produces policies that do not handle simpler situations well. We conclude that sampling strategies that prioritize difficult segments but alsoincludeeasieronesarelikelytoachievethebestoverallperformance. We have also showed that training on the full dataset does not yield any significant benefit over training on only 10% of the data sampled uniformly at random, demonstrating that simply adding more unbiased data to the training set does not necessarily improve performance. This suggests that we can use our difficulty model to reduce the cost of AV system development in two areas: targeting active data collection when operating a fleet of vehicles and selective retention of large- scalesensorlogs. Namely,sincethedifficultymodelcanpredictwhichdrivingscenariosarelikely to be challenging for new planning agents, we could identify geographic “hotspots” where these scenarios occur and use these locations to inform our data collection process. Furthermore, since biasing the planning agent’s training dataset toward difficult segments leads to better results with onlyafractionoftheavailabledata,wecouldusethedifficultymodelscorestoreducetheamount ofstoreddatawithoutsacrificingdownstreamperformance. 8Acknowledgments WethankBenSapp,EugeneIe,JonathanBingham,andRyanPolkowskifortheirhelpfulcomments, andtoUryZhilinskyforhissupportwithexperimentsandinfrastructure. References [1] J.Fu, A.Kumar, O.Nachum, G.Tucker, andS.Levine. D4rl: Datasetsfordeepdata-driven reinforcementlearning. arXivpreprintarXiv:2004.07219,2020. [2] J.Frank,S.Mannor,andD.Precup. Reinforcementlearninginthepresenceofrareevents. In Proceedingsofthe25thinternationalconferenceonMachinelearning,pages336–343,2008. [3] N.KalraandS.M.Paddock. DrivingtoSafety: HowManyMilesofDrivingWouldItTaketo DemonstrateAutonomousVehicleReliability? RANDCorporation,2016. [4] S. Shalev-Shwartz, S. Shammah, and A. Shashua. Safe, multi-agent, reinforcement learning forautonomousdriving. arXivpreprintarXiv:1610.03295,2016. [5] S. Paul, K. Chatzilygeroudis, K. Ciosek, J.-B. Mouret, M. Osborne, and S. Whiteson. Al- ternating optimisation and quadrature for robust control. In AAAI Conference on Artificial Intelligence,2018. [6] S.Paul,M.A.Osborne,andS.Whiteson. Fingerprintpolicyoptimisationforrobustreinforce- mentlearning. InInternationalConferenceonMachineLearning,2019. [7] J. De Freitas, A. Censi, B. W. Smith, L. Di Lillo, S. E. Anthony, and E. Frazzoli. From driverlessdilemmastomorepracticalcommonsensetestsforautomatedvehicles. Proceedings ofthenationalacademyofsciences,118(11),2021. [8] S.Lange,T.Gabel,andM.Riedmiller. Batchreinforcementlearning. InReinforcementlearn- ing,pages45–73.Springer,2012. [9] J.PetersandS.Schaal. Reinforcementlearningbyreward-weightedregressionforoperational spacecontrol.InProceedingsofthe24thinternationalconferenceonMachinelearning,pages 745–750,2007. [10] R.Portelas,C.Colas,L.Weng,K.Hofmann,andP.-Y.Oudeyer. Automaticcurriculumlearn- ingfordeeprl: Ashortsurvey. arXivpreprintarXiv:2003.04664,2020. [11] B.R.Kiran,I.Sobh,V.Talpaert,P.Mannion,A.A.AlSallab,S.Yogamani,andP.Pe´rez.Deep reinforcement learning for autonomous driving: A survey. IEEE Transactions on Intelligent TransportationSystems,2021. [12] A. Hussein, M. M. Gaber, E. Elyan, and C. Jayne. Imitation learning: A survey of learning methods. ACMComput.Surv.,50(2),apr2017. doi:10.1145/3054912. URLhttps://doi. org/10.1145/3054912. [13] D. A. Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In D. Touretzky, editor, Advances in Neural Information Processing Systems, volume 1. Morgan-Kaufmann,1988.URLhttps://proceedings.neurips.cc/paper/1988/file/ 812b4ba287f5ee0bc9d43bbf5bbe87fb-Paper.pdf. [14] M. Bojarski et al. End to end learning for self-driving cars. CoRR, 2016. URL http:// arxiv.org/abs/1604.07316. [15] J.HoandS.Ermon. Generativeadversarialimitationlearning. InAdvancesinNeuralInfor- mationProcessingSystems,pages4565–4573,2016. [16] T.Brys,A.Harutyunyan,H.B.Suay,S.Chernova,M.E.Taylor,andA.Nowe´. Reinforcement learningfromdemonstrationthroughshaping. InTwenty-fourthinternationaljointconference onartificialintelligence,2015. [17] H.B.Suay,T.Brys,M.E.Taylor,andS.Chernova. Learningfromdemonstrationforshaping throughinversereinforcementlearning. InAAMAS,pages429–437,2016. 9[18] L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, M. Laskin, P. Abbeel, A. Srinivas, and I.Mordatch. Decisiontransformer: Reinforcementlearningviasequencemodeling. Advances inneuralinformationprocessingsystems,34,2021. [19] P.Florence,C.Lynch,A.Zeng,O.A.Ramirez,A.Wahid,L.Downs,A.Wong,J.Lee,I.Mor- datch,andJ.Tompson. Implicitbehavioralcloning. InConferenceonRobotLearning,pages 158–168.PMLR,2022. [20] D. Brown, W. Goo, P. Nagarajan, and S. Niekum. Extrapolating beyond suboptimal demon- strationsviainversereinforcementlearningfromobservations. InInternationalconferenceon machinelearning,pages783–792.PMLR,2019. [21] S.Levine, A.Kumar, G.Tucker, andJ.Fu. Offlinereinforcementlearning: Tutorial, review, andperspectivesonopenproblems,2020. URLhttps://arxiv.org/abs/2005.01643. [22] A. Kumar, J. Hong, A. Singh, and S. Levine. When should we prefer offline reinforcement learningoverbehavioralcloning? arXivpreprintarXiv:2204.05618,2022. [23] T. Schaul, J. Quan, I. Antonoglou, and D. Silver. Prioritized experience replay. In ICLR (Poster),2016. [24] A. L. Samuel. Some studies in machine learning using the game of checkers. ii—recent progress. IBMJournalofresearchanddevelopment,11(6):601–617,1967. [25] G. Tesauro. Td-gammon, a self-teaching backgammon program, achieves master-level play. Neuralcomputation,6(2):215–219,1994. [26] D.Silver,A.Huang,C.J.Maddison,A.Guez,L.Sifre,G.VanDenDriessche,J.Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neuralnetworksandtreesearch. Nature,529(7587):484–489,2016. [27] I. Akkaya, M. Andrychowicz, M. Chociej, M. Litwin, B. McGrew, A. Petron, A. Paino, M.Plappert,G.Powell,R.Ribas,etal. Solvingrubik’scubewitharobothand. arXivpreprint arXiv:1910.07113,2019. [28] J. Uesato, A. Kumar, C. Szepesvari, T. Erez, A. Ruderman, K. Anderson, K. D. Dvijotham, N.Heess,andP.Kohli. Rigorousagentevaluation: Anadversarialapproachtouncovercatas- trophicfailures. InInternationalConferenceonLearningRepresentations,2018. [29] A.Shrivastava,A.Gupta,andR.Girshick. Trainingregion-basedobjectdetectorswithonline hardexamplemining. InProceedingsoftheIEEEconferenceoncomputervisionandpattern recognition,pages761–769,2016. [30] M. Fang, T. Zhou, Y. Du, L. Han, and Z. Zhang. Curriculum-guided hindsight experience replay. Advancesinneuralinformationprocessingsystems,32,2019. [31] C. Colas, P. Fournier, M. Chetouani, O. Sigaud, and P.-Y. Oudeyer. Curious: intrinsically motivatedmodularmulti-goalreinforcementlearning. InInternationalconferenceonmachine learning,pages1331–1340.PMLR,2019. [32] R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J.Bohg,A.Bosselut,E.Brunskill,etal. Ontheopportunitiesandrisksoffoundationmodels. arXivpreprintarXiv:2108.07258,2021. [33] D. Michie, M. Bain, and J. Hayes-Miches. Cognitive models from subcognitive skills. IEE controlengineeringseries,44:71–99,1990. [34] S. Ross et al. A reduction of imitation learning and structured prediction to no-regret online learning. InAIStats,2011. [35] J. Ho and S. Ermon. Generative adversarial imitation learning. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems,volume29.CurranAssociates,Inc.,2016. URLhttps://proceedings.neurips. cc/paper/2016/file/cc7e2b878868cbae992d1fb743995d8f-Paper.pdf. 10[36] G.Swamy,S.Choudhury,J.A.Bagnell,andZ.S.Wu. Ofmomentsandmatching: Agame- theoreticframeworkforclosingtheimitationgap,2021. [37] N.Baram,O.Anschel,I.Caspi,andS.Mannor.End-to-enddifferentiableadversarialimitation learning. InInternationalConferenceonMachineLearning,pages390–399.PMLR,2017. [38] M.Xuetal. Variancereductionpropertiesofthereparameterizationtrick. InAIStats,2019. [39] M.Chidambaram,Y.Yang,D.Cer,S.Yuan,Y.-H.Sung,B.Strope,andR.Kurzweil. Learning cross-lingual sentence representations via a multi-task dual-encoder model. arXiv preprint arXiv:1810.12836,2018. [40] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P.Mishkin,J.Clark,etal. Learningtransferablevisualmodelsfromnaturallanguagesupervi- sion. InInternationalConferenceonMachineLearning,pages8748–8763.PMLR,2021. [41] E.Bronstein,M.Palatucci,D.Notz,B.White,A.Kuefler,Y.Lu,S.Paul,P.Nikdel,P.Mougin, H.Chen,J.Fu,A.Abrams,P.Shah,E.Racah,B.Frenkel,S.Whiteson,andD.Anguelov.Hier- archicalmodel-basedimitationlearningforplanninginautonomousdriving.In2022IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 8652–8659. IEEE, 2022. [42] H. Namkoong and J. C. Duchi. Stochastic gradient methods for distributionally robust opti- mizationwithf-divergences. Advancesinneuralinformationprocessingsystems,29,2016. [43] Y.Liu,J.Zhang,L.Fang,Q.Jiang,andB.Zhou. Multimodalmotionpredictionwithstacked transformers,2021. [44] J.Mercatetal.Multi-headattentionformulti-modaljointvehiclemotionforecasting.InICRA, 2020. [45] J. Lee et al. Set transformer: A framework for attention-based permutation-invariant neural networks. InICML,2019. [46] A.Jaegleetal. Perceiver: Generalperceptionwithiterativeattention. InICML,2021. [47] F.Torabi,G.Warnell,andP.Stone. Generativeadversarialimitationfromobservation. arXiv preprintarXiv:1807.06158,2018. [48] C. Zhang, R. Guo, W. Zeng, Y. Xiong, B. Dai, R. Hu, M. Ren, and R. Urtasun. Rethinking closed-loop training for autonomous driving. In S. Avidan, G. Brostow, M. Cisse´, G. M. Farinella, and T. Hassner, editors, Computer Vision – ECCV 2022, pages 264–282, Cham, 2022.SpringerNatureSwitzerland. ISBN978-3-031-19842-7. 118 Appendix 8.1 DifficultyBucketStatistics The summary statistics of the difficulty scores in each bucket of the training and test datasets are presentedinTables2and3,respectively. 8.2 GeometricSchedule TheGeometric-schedule-10%usesaschedulethatstartstrainingbyweightingeachbucketequally (i.e.,theunbiaseddataset),andendsbyweightingeachbucketproportionaltotheaveragedifficulty scoreofthesegmentsitcontains. Specifically, atstept, thesamplingweightforbucketk isq = k (qi−qf)αt+qf,whereqi andqf aretheinitialandfinalweightsforbucketk,andαisthecommon k k k k k ratio of the geometric progression. The sample weights for all the buckets are then normalized to sum to 1 to acquire sampling probabilities. For the Geometric-schedule-10% variant, we set α=0.999975andqi =1foreachbucket,and{q }10 isgivenbythe“Mean”rowintable2. This k f k=1 progressionisvisualizedinFigure4. 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00 0 50000 100000 150000 200000 Step ytilibaborP gnilpmaS 0-10% 10-20% 20-30% 30-40% 40-50% 50-60% 60-70% 70-80% 80-90% 90-100% Figure4: SamplingscheduleofeachbucketfortheGeometric-schedule-10%variant. Table2: Summarystatisticsofthedifficultyscoresineachtrainingdatabucket. 0-10% 10-20% 20-30% 30-40% 40-50% 50-60% 60-70% 70-80% 80-90% 90-100% Min 0.001 0.019 0.031 0.046 0.066 0.094 0.133 0.189 0.270 0.407 Mean 0.013 0.025 0.038 0.056 0.079 0.112 0.159 0.227 0.331 0.573 Max 0.019 0.031 0.046 0.066 0.094 0.133 0.189 0.270 0.407 0.939 Table3: Summarystatisticsofthedifficultyscoresineachtestdatabucket. 0-10% 10-20% 20-30% 30-40% 40-50% 50-60% 60-70% 70-80% 80-90% 90-100% Min 0.001 0.016 0.026 0.040 0.059 0.085 0.122 0.176 0.258 0.396 Mean 0.011 0.021 0.033 0.049 0.071 0.103 0.147 0.214 0.320 0.562 Max 0.016 0.026 0.040 0.059 0.085 0.122 0.176 0.258 0.396 0.939 128.3 UniformVariant The Uniform-10% variant sets the sampling weight of each bucket to be proportional to the range of difficulty scores of each bucket. The score ranges for the 10 training buckets are [0.0180,0.0126,0.0150,0.0199,0.0276,0.0392,0.0557,0.0814,0.1368,0.5324]. 8.4 MetricsbyBucket InFigures5and6wepresenttheperformanceofeachtrainingvariantandbaselineforeachbucket. The clear upward trend in collision rate with the increasing bucket scores demonstrates that colli- sionsarehighlycorrelatedwiththedifficultyscores. Weobserveasimilar,butnotquiteasstrong, correlationintheroutefailurerateandoff-roadrateaswell. 8.5 AdaptiveImportanceSamplingVariants Weconductedaseriesofexperimentsthatperformadaptiveimportancesampling,whereinweup- sample certain buckets based on the agent’s performance during training, but then we correct for thisupsamplingviaalikelihoodratio. Importancesamplingmeasurestheexpectationofastatistic overadistributionP usingadifferentdistributionQ. Inparticular,E [f(X)]=E [f(X)w(X)], P Q wherew(x) := p(x)/q(x)isthelikelihoodratiofordensityfunctionspandqfromdistributionsP andQ,respectively. OurnominaldistributionP isthenaturaldistributionofrunsegments.Duetotheinfrastructurechal- lenges surrounding large training datasets mentioned in Section 4.2, we implemented reweighting onthelevelofbucketsratherthanonthelevelofindividualrunsegments. Thisallowsourmethod to easily scale to arbitrarily large datasets since it depends only on the number of buckets, not on the dataset size. In this setting, the nominal density is p := 1/N. We constructed the sampling i distribution Q as follows: every K training steps, we collected the average policy loss per bucket (L¯ ) overtheprecedingwindowofK steps. Sinceourlossescanbepositiveornegative,weset P i thesamplingweightsq ∝ exp(cid:0) γ·(L¯ ) (cid:1) ,whereγ istheinversetemperatureparameter. Wealso i P i dedicatedasmallconstantprobabilitymass(cid:15)toallbucketsthatwerenotsampledduringthelastK iterations, which ensures a nonzero probability of sampling a run segment from any given bucket. Duringtraining,wemultipliedthelossforasegmentfrombucketibytheratiow := 1/(N ·q ). i i To evaluate the effect of different degrees of importance reweighting, we also considered wβ for i different values of β ∈ [0,1]. Algorithm 1 describes this procedure in the context of training the planningagent. WenotethatthisapproachissimilartoPrioritizedExperienceReplay(PER)[23], butadaptedtooursettingwithpriorityweightsassignedoveradiscretesetofbuckets. WeexpandedonthisapproachusingDistributionallyRobustOptimization(DRO)[42],whichintro- ducesanadditionallossweightingtermwithhyperparameterρ∈[0,1].Largesvaluesofρallowfor agreaterdeviationofthelossweightsfromtheimportancesamplingweightsthatwouldbeneeded toexactlyaccountforthenon-uniformsampling. Weshowthedatasetsamplingprobabilitiesq forthePERvariantinFigure7withtwosettingsofthe i inversetemperatureparameterγ: “PER(γ = 0.1,β = 1)-10%”and“PER(γ = 1,β = 1)-10%”. Whileγ = 0.1resultsinadistributionthatisclosetouniform, γ = 1quicklyproducesaheavily skewed distribution that samples from the most difficult bucket at least 75% of the time. In both cases,thesamplingprobabilityofeachbucketisdirectlyrelatedtoitsdifficultyscores: thehigher abucket’sdifficultyscores,themorefrequentlyitissampled. Thisclearlydemonstratesthatarun segment’sdifficultyscoreisastrongpredictorforhowchallengingitwillbeforaplanningagentto navigatesuccessfully. WepresentourresultsforPERandDROinTable4withdifferentvaluesofγ,β,andρ. Forthese experiments,weuse10%oftheavailabletrainingdata,andwesetK =1000and(cid:15)=3.125×104. We observe that certain settings of PER and DRO achieve the lowest route failure and off-road rates. Theyalsoresultinthebestcollisionrate,overallfailurerate,androuteprogressratio,though othernon-adaptivevariantsachievecomparableresultsthatarewithintheconfidencebounds. This suggests that adaptive importance sampling is a promising curriculum learning approach that can provide comparable or better results to fixed sampling strategies without the need for hand-tuning customsamplingweightsandschedules. 13highest-10% 4 uniform-10% geometric-schedule-10% 3 baseline-10% 2 1 0 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 90-100 99-100 (a)Routefailurerate(%) highest-10% 10.0 uniform-10% geometric-schedule-10% 7.5 baseline-10% 5.0 2.5 0.0 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 90-100 99-100 (b)Collisionrate(%) 2.4 highest-10% uniform-10% geometric-schedule-10% 1.8 baseline-10% 1.2 0.6 0.0 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 90-100 99-100 (c)Off-roadrate(%) Figure5: Metricsbytestdecilebucketforeachofthetrainingvariants. 14highest-10% uniform-10% 12 geometric-schedule-10% baseline-10% 8 4 0 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 90-100 99-100 (a)Overallfailurerate(%) 8000 6000 4000 highest-10% uniform-10% 2000 geometric-schedule-10% baseline-10% 0 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 90-10099-100 (b)Progressrate(%) Figure6: Metricsbytestdecilebucketforeachofthetrainingvariants. Algorithm1MGAILcurriculumtrainingwithbucket-wiseadaptiveimportancesampling Input: datasets{D}N ,samplingperiodK,stepsizeη,inversetemperatureparameterγ,ISweight i=1 exponentβ,budgetT,minibatchsizeB 1: Initializesamplingprobabilitiesq i = N1,lossbuffersH i =∅fori=1,...,N 2: fort=1toT do 3: ift≡0 mod K then 4: Computeper-datasetmeanpolicyloss(L¯ P) ifromeachbufferH i 5: Computedatasetsamplingweightsq i ←exp(γ·(L¯ P) i) 6: Normalizedatasetsamplingprobabilitiesq i ←q i/(cid:80)N j=1q j 7: ResetH i =∅fori=1,...,N 8: endif 9: SampleBdatasetindices{b}B ∼Q,whereQhasprobabilitymassfunctionq i=1 10: Sampleminibatch{x}B i=1,wherex i i. ∼i.d. U[D bi] (cid:46)Examplex iisfromdatasetD bi 11: Computeper-examplepolicylossesL P(x i)anddiscriminatorlossesL D(x i) 12: Addstopgrad(L P(x i))toH bi 13: ComputeISweightsw i ←1/(N ·q i) 14: Computeweightedpolicylossesl P ← B1 (cid:80)B i=1w bβ iL P(x i) 15: Computeweighteddiscriminatorlossesl D ← B1 (cid:80)B i=1w bβ iL D(x i) 16: Updatepolicyweightsθ ←θ+η·∇ θl P 17: Updatediscriminatorweightsω ←ω+η·∇ ωl D 18: endfor 150.150 0.125 0.100 0.075 0.050 0 50000 100000 150000 200000 Steps ytilibaborP gnilpmaS 0-10% 10-20% 20-30% 30-40% 40-50% 50-60% 60-70% 70-80% 80-90% 90-100% (a)γ =0.1 0.75 0.50 0.25 0.00 0 50000 100000 150000 200000 Steps ytilibaborP gnilpmaS 0-10% 10-20% 20-30% 30-40% 40-50% 50-60% 60-70% 70-80% 80-90% 90-100% (b)γ =1 Figure7: DatasetsamplingprobabilitiesthroughouttrainingforthePERadaptiveimportancesam- plingvariantfortwodifferentvaluesoftheinversetemperatureparameterγ. 8.6 OtherVariants Weimplementedtwoadditionalvariants,whichdifferfromtheothervariantsintheirtrainingdataset sizesandsamplingstrategies. TheirperformanceisshowninTable4. 1. The “Highest-1%” variant is an agent trained on only the top 1% of training examples orderedbydifficultyscore.Assuch,itisnotstrictlycomparabletotheothervariantswhich use10%ofthedata. Ourresultsshowthatwhileitachievesalowercollisionratethanthe baselines, its performance on all other metrics is worse. This demonstrates that extreme upsampling strategies on the most difficult examples, combined with using significantly lesstrainingdata,canleadtoworseoverallperformance.However,thefactthatitscollision rate is lower than that of the baselines suggests that upsampling difficult segments has a strong positive effect on metrics that are highly correlated with the difficulty score. It is possiblethatincorporatingothermetricsintoourdefinitionof“difficulty”forthedifficulty modelcouldimprovethisvariant’sperformanceonthosemetricsaswell. 2. Wealsotrainedthe“Highest-10%+Lowest-10%”variantonthecombinationofthemost difficultbucketandtheleastdifficultbucket. Thisvariantachievesamongthebestperfor- manceoverall,matchingthatoftheGeometric-schedule-10%variant.Byincorporatingthe leastdifficultbucket,itaddressestheshortcomingsofHighest-10%,whichhashighfailure ratesonsegmentsinthe0-40%rangeofdifficultyscores. However,thisvariantuses20% oftheavailabledata,twiceasmuchastheothervariants. 16Table4: Evaluationofagentvariantsandbaselinesonthefullunbiasedtestset(mean±standard errorofeachmetricacross10seeds,unlessnotedotherwise). Forallmetricsexceptrouteprogress, lowerisbetter. RouteFailure Collision Off-road RouteProgress Failure AgentVariant rate(%) rate(%) rate(%) ratio(%) rate(%) Baseline-all 1.38±0.13 1.46±0.09 0.73±0.07 81.21±0.39 3.33±0.20 Baseline-10% 1.34±0.06 1.50±0.09 0.67±0.06 81.12±0.37 3.28±0.13 Baseline-lowest-10% 1.14±0.05 4.15±0.11 0.98±0.10 81.88±0.41 5.91±0.13 Highest-10% 1.33±0.06 1.23±0.09 0.74±0.02 77.95±1.33 3.10±0.10 Uniform-10% 1.35±0.09 1.17±0.08 0.75±0.07 80.67±0.73 3.07±0.17 Highest-1% 1.53±0.08 1.39±0.06 0.99±0.11 79.35±1.18 3.66±0.13 Highest-10%+Lowest-10% 1.18±0.06 1.28±0.07 0.65±0.06 79.97±0.81 2.94±0.12 Geometric-schedule-10% 1.19±0.07 1.25±0.04 0.74±0.10 80.48±0.36 2.92±0.11 PER(γ=0.1,β=0)-10%(6seeds) 1.37±0.06 1.32±0.02 0.55±0.07 81.91±0.58 2.99±0.05 PER(γ=0.1,β=0.5)-10%(7seeds) 1.28±0.09 1.39±0.11 0.73±0.11 82.89±0.68 3.15±0.20 PER(γ=0.1,β=1)-10% 1.31±0.09 1.63±0.09 0.51±0.05 80.34±0.47 3.28±0.14 PER(γ=1,β=0)-10% 1.28±0.08 1.06±0.03 0.71±0.08 81.16±0.88 2.88±0.08 PER(γ=1,β=0.5)-10% 1.21±0.06 1.47±0.10 0.90±0.12 82.60±0.69 3.36±0.20 PER(γ=1,β=1)-10% 1.20±0.06 1.99±0.07 1.06±0.21 82.34±0.64 3.93±0.25 DRO(γ=0.1,β=0,ρ=0.25)10%(4seeds) 1.23±0.10 1.34±0.13 0.85±0.12 80.01±0.52 3.27±0.18 DRO(γ=0.1,β=1,ρ=0.05)10%(8seeds) 1.19±0.09 1.16±0.04 0.69±0.07 80.86±0.65 2.83±0.10 DRO(γ=0.1,β=1,ρ=0.25)10%(9seeds) 1.24±0.03 1.49±0.08 0.70±0.03 81.23±0.48 3.25±0.08 DRO(γ=0.1,β=1,ρ=1)10%(4seeds) 1.02±0.04 1.65±0.09 0.87±0.21 78.28±0.81 3.43±0.18 DRO(γ=1,β=0,ρ=0.25)10%(4seeds) 1.27±0.15 1.31±0.07 0.58±0.06 79.43±1.02 3.04±0.24 DRO(γ=1,β=0.5,ρ=0.25)10%(7seeds) 1.33±0.05 1.18±0.08 0.73±0.06 80.72±1.22 3.02±0.06 DRO(γ=1,β=1,ρ=0.05)10%(7seeds) 1.20±0.02 1.58±0.05 0.72±0.12 81.72±0.46 3.31±0.14 DRO(γ=1,β=1,ρ=0.25)10%(6seeds) 1.18±0.09 1.65±0.10 1.12±0.29 82.22±0.91 3.70±0.37 DRO(γ=1,β=1,ρ=1)10%(5seeds) 1.28±0.14 2.33±0.19 1.95±0.20 77.97±1.84 5.14±0.20 8.7 PlanningAgentDetails WeusethesamehierarchicalplanningagentdescribedinBronsteinetal.[41];additionaldetailscan befoundinthatwork. Thisplanningagentconsistingofahigh-levelroute-generationpolicyanda low-levelactionpolicytrainedusingMGAIL.Thehigh-levelpolicyusesanA*searchtoproduce multiplelane-specificroutesthroughapre-mappedroadgraphandselectsthelowest-costroute. We caneitherevaluatethelow-levelpolicyinastandalonefashionbyconditioningitonagivenroute, orthehigh-levelandlow-levelpoliciestogetherbyallowingtheagenttochooseitsowngoalroutes given a destination. The low-level action policy and discriminator use stacked transformer-based observationmodels[43,44]toencodethegoalroute, AV’sstate, othervehicles’states, roadgraph points,andtrafficlightsignals. SimilartoSet-Transformer[45]andPerceiver[46],thisobservation encoder uses learned latent queries and a stack of cross-attention blocks, one for each group of features. AdeltaactionsmodelisusedfortheAV’sdynamics,wheretheactionaistheoffsetfrom the current state s: s(cid:48) = s + a. The policy head predicts the parameters (weights, means, and covariances)ofaGaussianMixtureModel(GMM)with8Gaussians,usedtoparameterizethedelta actions.WetrainedtheactionpolicyanddiscriminatorusingacombinationofMGAILandbehavior cloning(BC).Thetotalpolicylossisgivenbyλ L +λ L ,whereL =−E [log D (s)] is the MGAIL policy loss, L = −E [P logP π (aB |sC )]B isC the BC loP ss, ands λ∼πθ and λ ω are hyperparameters. The MGAIB LC discrimis n, aa t∼ oπ rE loss is Lθ = E [log D (s)]+EP [loB gC (1− D s∼πθ ω s∼πE D (s))]. Thediscriminatorisonlyconditionedonthestatesasin[47]. Duringbackpropagation, ω onlythepolicyparametersθareupdatedforL andL ,andonlythediscriminatorparametersω P BC areupdatedforL . D 8.8 EvaluationWithInteractiveAgents Apotentialconcernisthattheplanningagentistrainedandevaluatedwithotheragentsreplaying theirloggedtrajectories. Thismayresultinunrealisticbehavioriftheplanningagentbehavesina significantlydifferentwaythantheloggedAVandotheragentsdon’treactrealistically.Itispossible thataplanningagenttrainedinthiswaywouldnotperformwellwhendeployedintherealworld,in whichotherroadusersinfluenceandinteractwiththeAV.Todeterminewhetherthisisanissue,we evaluatedourplanningagentalongsideinteractiveagentscontrollingasubsetofothervehiclesinthe scene. Theinteractiveagentpolicy,whichwastrainedseparately,hasthesamemodelarchitecture, dynamicsmodel, andtraininglossfunctionasourplanningagent. Themaindifferenceisthatthe 17interactive agent is not goal-conditioned because its task is to drive in a realistic manner and not necessarilyreachaspecificdestination. Foreachbucketinthetestdataset,weconstructedasubsetinwhicheachsegmenthasatleast8other vehiclesthatcouldbecontrolledbyaninteractiveagent. Notethatthisisamorechallengingdataset because segments with more road users tend to be more difficult. Starting from an equal number ofsegmentsperbucketanddiscardingsegmentswithaninsufficientnumberofothervehicles, the numberofsegmentsremainingineachbucketinorderofincreasingdifficultyaccountedfor0.81%, 1.59%, 2.47%, 3.66%, 5.76%, 8.46%, 11.67%, 16.97%, 22.97%, and 25.64% of the total. We evaluatedtheBaseline-10%andUniform-10%planningagentvariantsonthisdatasetbyusingthe same initial conditions and goal route as the original test dataset. Table 5 demonstrates that the routefailureratedecreasesforallvariantswhenusinginteractiveagents,andthecollision,off-road, and overall failure rates either decrease or remain the same. Additional investigation is needed to determine why the route progress ratio increases for the Baseline-10% variant but decreases for theUniform-10%variant. Theseresultsindicatethatourtrainingprocedurefortheplanningagent allowsittoperformbetterinamorerealisticsimulatedenvironment,notworse. Infact,weexpect theagenttohaveevenbetterperformancewhenevaluatedwithinteractiveagentsontheoriginaldata distribution(i.e.,withouttherequirementthatatleast8vehiclesareavailableforreplacementwith interactiveagents),whichwouldbeinherentlyeasierthanthesubsetwithinteractiveagents. While training the planning agent with interactive agents may result in performance gains, this approach isorthogonaltothecurriculumlearningframeworkwepresentandcanbeeasilycombinedwithit. Infact,concurrentworkbyZhangetal.[48]investigatesthisidea,findingthattargetedtrainingon morechallengingclosed-loopscenariosresultsinmorerobustagentswhilerequiringlessdata. Table5: Evaluationofagentvariantsandbaselineswithoutinteractiveagentsontheoriginaltestset vs.withinteractiveagentsonasubsetwhereatleast8vehiclesareavailableforreplacement. For allmetricsexceptrouteprogress,lowerisbetter. WithoutInteractiveAgents WithInteractiveAgents AgentVariant RouteFailure Collision Off-road RouteProgress Failure RouteFailure Collision Off-road RouteProgress Failure rate(%) rate(%) rate(%) ratio(%) rate(%) rate(%) rate(%) rate(%) ratio(%) rate(%) Baseline-10% 1.34±0.06 1.50±0.09 0.67±0.06 81.12±0.37 3.28±0.13 1.03±0.05 1.47±0.10 0.67±0.03 84.58±0.22 3.05±0.13 Uniform-10% 1.35±0.09 1.17±0.08 0.75±0.07 80.67±0.73 3.07±0.17 0.78±0.07 1.16±0.06 0.29±0.04 75.5±0.61 2.13±0.11 18