The Waymo Open Sim Agents Challenge NicoMontali JohnLambert PaulMougin AlexKuefler NicholasRhinehart MichelleLi ColeGulino TristanEmrich ZoeyYang ShimonWhiteson BrandynWhite DragomirAnguelov WaymoLLC Abstract Simulationwithrealistic,interactiveagentsrepresentsakeytaskforautonomous vehiclesoftwaredevelopment. Inthiswork,weintroducetheWaymoOpenSim AgentsChallenge(WOSAC).WOSACisthefirstpublicchallengetotacklethis taskandproposecorrespondingmetrics. Thegoalofthechallengeistostimulate thedesignofrealisticsimulatorsthatcanbeusedtoevaluateandtrainabehavior modelforautonomousdriving. Weoutlineourevaluationmethodology,present resultsforanumberofdifferentbaselinesimulationagentmethods,andanalyze severalsubmissionstothe2023competitionwhichranfromMarch16,2023to May23,2023. TheWOSACevaluationserverremainsopenforsubmissionsand wediscussopenproblemsforthetask. 1 Introduction Simulationenvironmentsallowcheapandfastevaluationofautonomousdrivingbehaviorsystems, whilealsoreducingtheneedtodeploypotentiallyriskysoftwarereleasestophysicalsystems. While generationofsyntheticsensordatawasanearlygoal[19,43]ofsimulation,usecaseshaveevolved as perception systems have matured. Today, one of the most promising use cases for simulation issystemsafetyvalidationviastatisticalmodelchecking[1,16]withMonteCarlotrialsinvolving realisticallymodeledtrafficparticipants,i.e.,simulationagents. Simulationagentsarecontrolledobjectsthatperformreal- isticbehaviorsinavirtualworld. Inthischallenge,inorder toreducethecomputationalburdenandcomplexityofsim- ulation,wefocusonsimulatingagentbehaviorascaptured bytheoutputsofaperceptionsystem,e.g.,mid-levelobject representations[2,79]suchasobjecttrajectories,ratherthan simulatingtheunderlyingsensordata[13,37,62,76](see Figure1). Arequirementformodelingrealisticbehaviorinsimulation istheabilityforsimagentstorespondtoarbitrarybehav- ioroftheautonomousvehicle(AV).“Posedivergence”or “simulationdrift”[3]isdefinedasthedeviationbetweenthe Figure 1: WOSAC models the simu- AV’sbehaviorindrivinglogsanditsbehaviorduringsimu- lation problem as simulation of mid- lation,whichmayberepresentedthroughdifferingposition, level object representations, rather heading,speed,acceleration,andmore. Directlyreplaying thanassensorsimulation. loggedbehaviorofallotherobjectsinthescene[32,34,35] 37thConferenceonNeuralInformationProcessingSystems(NeurIPS2023)TrackonDatasetsandBenchmarks. 3202 ceD 11 ]VC.sc[ 4v23021.5032:viXraTable1: Acomparisonofthreeautonomous-vehiclebehaviorrelatedtaskswhichinvolvegeneration ofadesiredfuturesequenceofphysicalstates: trajectoryforecasting,planning,andsimulation. Note thatobservationso ∈Oincludesimulatedagentandenvironmentproperties. t MULTIPLE VEHICLE SYSTEM SYSTEM TASK OBJECT OUTPUTS KINEMATIC EVALUATION OBJECTIVES CATEGORIES CONSTRAINTS M Agu elt ni t-A aA ng V den EMt nT o vr t ia i roj oe n nc mPto l er a ny n tnF Sio n ir mgec ua las tt ii on ng ✓ ✓✗ (cid:0)((cid:0) x( tx ,yt (cid:0), t oy , tθt (cid:1), t T tθ ) =(cid:1)t 1, T t ;=v ot 1x t, o ∈v rty c O) o(cid:1) nT t= tr1 ols ✓✗ ✗ CCO LL OOPE SSN EE DD-L -- LLO OOO OOP P P KINEMA ST A DIC F IE SA TTC RYC , IU BCR UOA TMC IOFY O NA R ATN L,D RP ERM AOO LGD IR SE E MC SO SVERING underarbitraryAVplanningmayhavelimitedrealismbecauseofthisposedivergence. Suchlog- playbackagentstendtoheavilyoverestimatetheaggressivenessofrealactors,astheyareunwilling todeviatefromtheirplannedrouteunderanycircumstances. Ontheotherhand,rule-basedagents thatfollowheuristicssuchastheIntelligentDriverModel(IDM)[65]areoverlyaccommodating andreactive. Weseektoevaluateandencouragethedevelopmentofsimagentsthatlieinthemiddle ground, adhering to a definition of realism that implies matching the full distribution of human behavior. Tothebestofourknowledge,todatethereisnoexistingbenchmarkforevaluationofsimulation agents. Benchmarkshavespurrednotableinnovationinotherareasrelatedtoautonomousdriving research,especiallyforperception[6,10,24,58],motionforecasting[6,10,22,72,78],andmotion planning[19]. Webelieveastandardizedbenchmarkcanlikewisespurdramaticimprovementsfor simulationagentdevelopment. Amongthesebenchmarks,thosefocusedonmotionforecastingare perhapsmostsimilartosimulation,butallinvolveopen-loopevaluation,whichisclearlydeficient compared to our closed-loop evaluation. Furthermore, we introduce realism metrics which are suitabletoevaluatinglong-termfutures. RelevantdatasetssuchastheWaymoOpenMotionDataset (WOMD)[22]existtodaythatcontainreal-worldagentbehaviorexamples,andwebuildontopof WOMDtobuildWOSAC.Inthischallenge,wefocusonasubsetofthepossibleperceptionoutputs, e.g.,trafficlightstatesorvehicleattributesarenotmodeled,butweleavethisforfuturework. Thechallengesourbenchmarkraisesareunique,andifwecanmakerealprogressonit,wecanshow thatwe’vesolvedoneofthehardproblemsinself-driving. Wehaveanumberofopenquestions: Aretherebenefitstoscene-centric,ratherthanagent-centric,simulationmethods? Whatisthemost usefulgenerativemodelingframeworkforthetask? Whatdegreeofmotionplanningisneededfor agentpolicies,andhowfarcanmarginalmotionpredictiontakeus? Howcansimulationmethodsbe mademoreefficient? Howcanwedesignabenchmarkandenforcevarioussimulatorproperties? DuringourfirstiterationoftheWOSACchallenge,usersubmissionshavehelpedusanswerasubset ofthesequestions;forexample,weobservedthatmostmethodsfounditmostexpedienttobuild uponstate-of-the-artmarginalmotionpredictionmethods,i.e. operatinginanagent-centricmanner. Inthiswork,wedescribeindetailtheWaymoOpenSimAgentsChallenge(WOSAC)withthegoal ofstimulatinginterestintrafficsimulationandworldmodeling. Ourcontributionsareasfollows: • Anevaluationframeworkforautoregressivetrafficagentsbasedontheapproximatenegative loglikelihoodtheyassigntologgeddata. • An evaluation platform, an online leaderboard, available for submission at https:// waymo.com/open/challenges/2023/sim-agents/. • Anempiricalevaluationandanalysisofvariousbaselinemethods,aswellasseveralexternal submissions. 2 RelatedWork Multi-AgentTrafficSimulationSimulatorshavebeenusedtotrainandevaluateautonomousdriving plannersforseveraldecades,datingbacktoALVINN[43]. WhilesimulatorssuchasCARLA[19], SUMO[33],andFlow[73]provideonlyaheuristicdrivingpolicyforsimagents,theyhavestill enabledprogressintheAVmotionplanningdomain[11,12,15]. Otherrecentsimulatorssuchas Nocturne[68]useasimplifiedworldrepresentationthatconsistsofafixedroadgraphandmoving agentboxes. 2Table2: Existingevaluationmethodsforsimulationagents. Entriesareorderedchronologicallyby Arxivtimestamp. Thereislimitedconsensusintheliteratureregardinghowmulti-agentsimulation shouldbeevaluated. ADEor Offroad Collision Instance-Level Dataset-Level SpatialCoverageor Goalprogressor EvaluationProtocol minADE Rate Rate DistributionMatching DistributionMatching Diversity Completion ConvSocialPool[17] ✓ ✓ Trajectron[29] ✓ ✓ PRECOG[48] ✓ ✓ BARK[4] ✓ ✓ ✓ SMARTS[82] ✓ ✓ TrafficSim[60] ✓ ✓ ✓ ✓ SimNet[3] ✓ ✓ Symphony[28] ✓ ✓ ✓ ✓ Nocturne[68] ✓ ✓ ✓ BITS[74] ✓ ✓ ✓ ✓ InterSim[59] ✓ ✓ ✓ MetaDrive[34] ✓ ✓ ✓ TrafficBots[79] ✓ ✓ ✓ WOSAC(Ours) ✓ ✓ ✓ Simulationagentmodelingiscloselyrelatedtotheproblemoftrajectoryforecasting,asasimagent couldexecuteasetoftrajectorypredictionsasitsplan[4,59]. However,astrajectoryprediction methodsaretraditionallytrainedinopen-loop,theyhavelimitedcapabilitytorecoverfromoutof domainpredictionsencounteredduringclosed-loopsimulation[51]. Inaddition,fewforecasting methodsproduceconsistentjointfuturesamplesatthescenelevel[36,48]. Simagentmodelingis alsorelatedtoplanning,aseachsimagentcouldexecuteareplicaofaplannerindependently[4]. However,eachofthesethreetasksdifferdramaticallyinobjectives,outputs,andconstraints(see Table1). Learned Sim Agents Learned sim agents in the literature differ widely in assumptions around policycoordination,dynamicsmodelconstraints,observability,aswellasinputmodalities. While coordinated scene-centric agent behavior is studied in the open-loop motion forecasting domain [8,9,57],tothebestofourknowledge,TrafficSim[60]istheonlyclosed-loop,learnedsimagent worktouseajoint,scene-centricactorpolicy;allothersoperateinadecentralizedmannerwithout coordination[5,28,74],i.e.,eachagentinthesceneisindependentlycontrolledbyreplicasofthe samemodelusingagent-centricinference. BITSandTrafficBots[74,79]useaunicycledynamics modelandNocturneusesabicycledynamicsmodel[68]whereasmostothersdonotspecifyanysuch constraint;othersenforcepartialobservabilityconstraints,suchasNocturne[68]. Othermethods differinthetypeofinput,whetherrasterized[3,74]orprovidedinavectorformat[28,79]. Some worksfocusspecificallyongeneratingchallengingscenarios[47], andothersaimforuser-based controllability[81]. Somearetrainedviapureimitationlearning[3],whileothersincludeclosed-loop adversariallosses[28,60],ormulti-agentRL[4,34,82]inordertolearntorecoverfromitsmistakes [51].SomeworkssuchasInterSim[4,28,59,74,79,82]useagoal-conditionedproblemformulation, whileothersdonot[3]. EvaluationofGenerativeModelsDistributionmatchinghasbecomeacommonwaytoevaluate generativemodels[18,20,27,30,31,45,46,49,52,77],throughtheFréchetInceptionDistance (FID)[26]. PreviousevaluationmethodssuchastheInceptionScore(IS)[53]reasonovertheentropy ofconditionalandunconditionaldistributions,butarenotapplicableinourcaseduetothemulti- modalityofthesimulationproblem. TheFIDimprovestheInceptionScorebyusingstatisticsofreal worldsamples,measuringthedifferencebetweenthegenerateddistributionsandadatadistribution (inthesimulationdomain,theloggeddistribution). However,FIDhaslimitedsensitivityperexample duetoaggregationofstatisticsoverentiretestdatasetsintoasinglemeanandcovariance. Evaluating Multi-Agent Simulation There is limited consensus in the literature regarding how multi-agentsimulationshouldbeevaluated(seeTable2),andnomainstreamexistingbenchmark exists. Given the importance of safety, almost all existing sim agent works measure some form ofcollisionrate[3,28,59,60,68,74],andsomemulti-objectjointtrajectoryforecastingmethods alsomeasureitviatrajectoryoverlap[36]. However,collisionratecanbeartificiallydriventozero by static policies, and thus cannot measure realism. Quantitative evaluation of realism requires comparisonwithloggeddata. Suchevaluationmethodsvarywidely,fromdistributionmatchingof vehicledynamics[28,74],tocomparisonofoffroadrates[28,60,74],spatialcoverageanddiversity [60,74],andprogresstogoal[59,68]. However,asgoalsarenotobservable,theyarethusdifficult toextractreliably. RequiringdirectreconstructionofloggeddatathroughmetricssuchasAverage DisplacementError(ADE)hasalsobeenproposed[3,68], buthaslimitedeffectivenessbecause 3therearegenerallymultiplerealisticactionstheAVorsimagentscouldtakeatanygivenmoment. Toovercomethislimitation,oneoptionistoallowtheusertoprovidemultiplepossibletrajectories persimagent,suchasTrafficSim,whichusesaminimumaveragedisplacementerror(minADE)over 15simulations. [60]. Recently,generative-modelbasedevaluationhasbecomemorepopularinthesimulationdomain, primarilythroughdistributionmatchingmetrics. Symphony[28]usesJensen-Shannondistances overtrajectorycurvature. NeuralNDE[75]comparesdistributionsofvehiclespeedandinter-vehicle distance,whereasBITS[74]utilizesWassersteindistancesonagentsceneoccupancyusingmultiple rolloutsperscene,alongwithWassersteindistancesbetweensimulatedandloggedspeedandjerk –twokinematicfeatureswhichcanencapsulatepassengercomfort. Thelatterarecomputedasa distribution-to-distributioncomparisononadatasetlevel,however,thistypeofmetrichasshown limitedsensitivityinourexperiments. Likelihood metrics An alternative distribution matching framework is to measure point-to- distributiondistances. [17]and[29]introduceametricdefinedastheaveragenegativeloglikelihood (NLL)ofthegroundtruthtrajectory,asdeterminedbyakerneldensityestimate(KDE)[42,50]over outputsamplesatthesamepredictiontimestep. Thismetrichasfoundsomeadoption[54,64],and weprimarilybuildoffofthismetricinourwork. Wenotethatlikelihood-basedgenerativemodels ofsimulationsuchasPRECOG[48]andMFP[63]directlyproducelikelihoods,meaningthatthe useofaKDEonsampledtrajectoriestoestimatelikelihoodsisnotneededforsuchmodelclasses. Concurrentwork[79]alsomeasurestheNLLoftheGTsceneunder6rollouts. 3 TrafficSimulationasConditionalGenerativeModeling Our goal is to encourage the design of traffic simulators by defining a data-driven evaluation frameworkandinstantiatingitwithpubliclyaccessibledata. Wefocusonsimulatingagentbehavior inasettinginwhichanoffboardperceptionsystemistreatedasfixedandgiven. Problem formulation. We formulate driving as a Hidden Markov Model (cid:0) (cid:1) H= S,O,p(o |s ),p(s |s ) , where S denotes the set of unobservable true world states, t t t t−1 O denotes the set of observations, p(o |s ) denotes the sampleable emission distribution, and t t p(s |s ) denotes the hidden Markovian state dynamics: the probability of the hidden state t t−1 transitioningfroms attimestept−1tos attimet. Eacho ∈OcanbepartitionedintoAV-and t−1 t t environment-centriccomponentsthatvaryintime: o = [oAV,oenv]. Oenv caningeneralcontaina t t t t richsetoffeatures,butforthepurposeofourchallenge,itcontainssolelytheposesofthenon-AV . agents. Wedenotethetrueobservationdynamicsaspworld(o |s )=E p(o |s ). t t−1 p(st|st−1) t t The task. The task to build a “world model” qworld(o |oc ) of pworld(o |s ), . t