The Waymo Open Sim Agents Challenge
NicoMontali JohnLambert PaulMougin AlexKuefler NicholasRhinehart
MichelleLi ColeGulino TristanEmrich ZoeyYang ShimonWhiteson
BrandynWhite DragomirAnguelov
WaymoLLC
Abstract
Simulationwithrealistic,interactiveagentsrepresentsakeytaskforautonomous
vehiclesoftwaredevelopment. Inthiswork,weintroducetheWaymoOpenSim
AgentsChallenge(WOSAC).WOSACisthefirstpublicchallengetotacklethis
taskandproposecorrespondingmetrics. Thegoalofthechallengeistostimulate
thedesignofrealisticsimulatorsthatcanbeusedtoevaluateandtrainabehavior
modelforautonomousdriving. Weoutlineourevaluationmethodology,present
resultsforanumberofdifferentbaselinesimulationagentmethods,andanalyze
severalsubmissionstothe2023competitionwhichranfromMarch16,2023to
May23,2023. TheWOSACevaluationserverremainsopenforsubmissionsand
wediscussopenproblemsforthetask.
1 Introduction
Simulationenvironmentsallowcheapandfastevaluationofautonomousdrivingbehaviorsystems,
whilealsoreducingtheneedtodeploypotentiallyriskysoftwarereleasestophysicalsystems. While
generationofsyntheticsensordatawasanearlygoal[19,43]ofsimulation,usecaseshaveevolved
as perception systems have matured. Today, one of the most promising use cases for simulation
issystemsafetyvalidationviastatisticalmodelchecking[1,16]withMonteCarlotrialsinvolving
realisticallymodeledtrafficparticipants,i.e.,simulationagents.
Simulationagentsarecontrolledobjectsthatperformreal-
isticbehaviorsinavirtualworld. Inthischallenge,inorder
toreducethecomputationalburdenandcomplexityofsim-
ulation,wefocusonsimulatingagentbehaviorascaptured
bytheoutputsofaperceptionsystem,e.g.,mid-levelobject
representations[2,79]suchasobjecttrajectories,ratherthan
simulatingtheunderlyingsensordata[13,37,62,76](see
Figure1).
Arequirementformodelingrealisticbehaviorinsimulation
istheabilityforsimagentstorespondtoarbitrarybehav-
ioroftheautonomousvehicle(AV).“Posedivergence”or
“simulationdrift”[3]isdefinedasthedeviationbetweenthe
Figure 1: WOSAC models the simu-
AV’sbehaviorindrivinglogsanditsbehaviorduringsimu-
lation problem as simulation of mid-
lation,whichmayberepresentedthroughdifferingposition,
level object representations, rather
heading,speed,acceleration,andmore. Directlyreplaying
thanassensorsimulation.
loggedbehaviorofallotherobjectsinthescene[32,34,35]
37thConferenceonNeuralInformationProcessingSystems(NeurIPS2023)TrackonDatasetsandBenchmarks.
3202
ceD
11
]VC.sc[
4v23021.5032:viXraTable1: Acomparisonofthreeautonomous-vehiclebehaviorrelatedtaskswhichinvolvegeneration
ofadesiredfuturesequenceofphysicalstates: trajectoryforecasting,planning,andsimulation. Note
thatobservationso ∈Oincludesimulatedagentandenvironmentproperties.
t
MULTIPLE VEHICLE SYSTEM SYSTEM
TASK OBJECT OUTPUTS KINEMATIC EVALUATION OBJECTIVES
CATEGORIES CONSTRAINTS
M Agu elt ni t-A aA ng V den EMt nT o vr t ia i roj oe n nc mPto l er a ny n tnF Sio n ir mgec ua las tt ii on ng ✓ ✓✗ (cid:0)((cid:0) x( tx ,yt (cid:0), t oy , tθt (cid:1), t T tθ ) =(cid:1)t 1, T t ;=v ot 1x t, o ∈v rty c O) o(cid:1) nT t= tr1 ols ✓✗ ✗ CCO LL OOPE SSN EE DD-L -- LLO OOO OOP P P KINEMA ST A DIC F IE SA TTC RYC , IU BCR UOA TMC IOFY O NA R ATN L,D RP ERM AOO LGD IR SE E MC SO SVERING
underarbitraryAVplanningmayhavelimitedrealismbecauseofthisposedivergence. Suchlog-
playbackagentstendtoheavilyoverestimatetheaggressivenessofrealactors,astheyareunwilling
todeviatefromtheirplannedrouteunderanycircumstances. Ontheotherhand,rule-basedagents
thatfollowheuristicssuchastheIntelligentDriverModel(IDM)[65]areoverlyaccommodating
andreactive. Weseektoevaluateandencouragethedevelopmentofsimagentsthatlieinthemiddle
ground, adhering to a definition of realism that implies matching the full distribution of human
behavior.
Tothebestofourknowledge,todatethereisnoexistingbenchmarkforevaluationofsimulation
agents. Benchmarkshavespurrednotableinnovationinotherareasrelatedtoautonomousdriving
research,especiallyforperception[6,10,24,58],motionforecasting[6,10,22,72,78],andmotion
planning[19]. Webelieveastandardizedbenchmarkcanlikewisespurdramaticimprovementsfor
simulationagentdevelopment. Amongthesebenchmarks,thosefocusedonmotionforecastingare
perhapsmostsimilartosimulation,butallinvolveopen-loopevaluation,whichisclearlydeficient
compared to our closed-loop evaluation. Furthermore, we introduce realism metrics which are
suitabletoevaluatinglong-termfutures. RelevantdatasetssuchastheWaymoOpenMotionDataset
(WOMD)[22]existtodaythatcontainreal-worldagentbehaviorexamples,andwebuildontopof
WOMDtobuildWOSAC.Inthischallenge,wefocusonasubsetofthepossibleperceptionoutputs,
e.g.,trafficlightstatesorvehicleattributesarenotmodeled,butweleavethisforfuturework.
Thechallengesourbenchmarkraisesareunique,andifwecanmakerealprogressonit,wecanshow
thatwe’vesolvedoneofthehardproblemsinself-driving. Wehaveanumberofopenquestions:
Aretherebenefitstoscene-centric,ratherthanagent-centric,simulationmethods? Whatisthemost
usefulgenerativemodelingframeworkforthetask? Whatdegreeofmotionplanningisneededfor
agentpolicies,andhowfarcanmarginalmotionpredictiontakeus? Howcansimulationmethodsbe
mademoreefficient? Howcanwedesignabenchmarkandenforcevarioussimulatorproperties?
DuringourfirstiterationoftheWOSACchallenge,usersubmissionshavehelpedusanswerasubset
ofthesequestions;forexample,weobservedthatmostmethodsfounditmostexpedienttobuild
uponstate-of-the-artmarginalmotionpredictionmethods,i.e. operatinginanagent-centricmanner.
Inthiswork,wedescribeindetailtheWaymoOpenSimAgentsChallenge(WOSAC)withthegoal
ofstimulatinginterestintrafficsimulationandworldmodeling. Ourcontributionsareasfollows:
• Anevaluationframeworkforautoregressivetrafficagentsbasedontheapproximatenegative
loglikelihoodtheyassigntologgeddata.
• An evaluation platform, an online leaderboard, available for submission at https://
waymo.com/open/challenges/2023/sim-agents/.
• Anempiricalevaluationandanalysisofvariousbaselinemethods,aswellasseveralexternal
submissions.
2 RelatedWork
Multi-AgentTrafficSimulationSimulatorshavebeenusedtotrainandevaluateautonomousdriving
plannersforseveraldecades,datingbacktoALVINN[43]. WhilesimulatorssuchasCARLA[19],
SUMO[33],andFlow[73]provideonlyaheuristicdrivingpolicyforsimagents,theyhavestill
enabledprogressintheAVmotionplanningdomain[11,12,15]. Otherrecentsimulatorssuchas
Nocturne[68]useasimplifiedworldrepresentationthatconsistsofafixedroadgraphandmoving
agentboxes.
2Table2: Existingevaluationmethodsforsimulationagents. Entriesareorderedchronologicallyby
Arxivtimestamp. Thereislimitedconsensusintheliteratureregardinghowmulti-agentsimulation
shouldbeevaluated.
ADEor Offroad Collision Instance-Level Dataset-Level SpatialCoverageor Goalprogressor
EvaluationProtocol
minADE Rate Rate DistributionMatching DistributionMatching Diversity Completion
ConvSocialPool[17] ✓ ✓
Trajectron[29] ✓ ✓
PRECOG[48] ✓ ✓
BARK[4] ✓ ✓ ✓
SMARTS[82] ✓ ✓
TrafficSim[60] ✓ ✓ ✓ ✓
SimNet[3] ✓ ✓
Symphony[28] ✓ ✓ ✓ ✓
Nocturne[68] ✓ ✓ ✓
BITS[74] ✓ ✓ ✓ ✓
InterSim[59] ✓ ✓ ✓
MetaDrive[34] ✓ ✓ ✓
TrafficBots[79] ✓ ✓ ✓
WOSAC(Ours) ✓ ✓ ✓
Simulationagentmodelingiscloselyrelatedtotheproblemoftrajectoryforecasting,asasimagent
couldexecuteasetoftrajectorypredictionsasitsplan[4,59]. However,astrajectoryprediction
methodsaretraditionallytrainedinopen-loop,theyhavelimitedcapabilitytorecoverfromoutof
domainpredictionsencounteredduringclosed-loopsimulation[51]. Inaddition,fewforecasting
methodsproduceconsistentjointfuturesamplesatthescenelevel[36,48]. Simagentmodelingis
alsorelatedtoplanning,aseachsimagentcouldexecuteareplicaofaplannerindependently[4].
However,eachofthesethreetasksdifferdramaticallyinobjectives,outputs,andconstraints(see
Table1).
Learned Sim Agents Learned sim agents in the literature differ widely in assumptions around
policycoordination,dynamicsmodelconstraints,observability,aswellasinputmodalities. While
coordinated scene-centric agent behavior is studied in the open-loop motion forecasting domain
[8,9,57],tothebestofourknowledge,TrafficSim[60]istheonlyclosed-loop,learnedsimagent
worktouseajoint,scene-centricactorpolicy;allothersoperateinadecentralizedmannerwithout
coordination[5,28,74],i.e.,eachagentinthesceneisindependentlycontrolledbyreplicasofthe
samemodelusingagent-centricinference. BITSandTrafficBots[74,79]useaunicycledynamics
modelandNocturneusesabicycledynamicsmodel[68]whereasmostothersdonotspecifyanysuch
constraint;othersenforcepartialobservabilityconstraints,suchasNocturne[68]. Othermethods
differinthetypeofinput,whetherrasterized[3,74]orprovidedinavectorformat[28,79]. Some
worksfocusspecificallyongeneratingchallengingscenarios[47], andothersaimforuser-based
controllability[81]. Somearetrainedviapureimitationlearning[3],whileothersincludeclosed-loop
adversariallosses[28,60],ormulti-agentRL[4,34,82]inordertolearntorecoverfromitsmistakes
[51].SomeworkssuchasInterSim[4,28,59,74,79,82]useagoal-conditionedproblemformulation,
whileothersdonot[3].
EvaluationofGenerativeModelsDistributionmatchinghasbecomeacommonwaytoevaluate
generativemodels[18,20,27,30,31,45,46,49,52,77],throughtheFréchetInceptionDistance
(FID)[26]. PreviousevaluationmethodssuchastheInceptionScore(IS)[53]reasonovertheentropy
ofconditionalandunconditionaldistributions,butarenotapplicableinourcaseduetothemulti-
modalityofthesimulationproblem. TheFIDimprovestheInceptionScorebyusingstatisticsofreal
worldsamples,measuringthedifferencebetweenthegenerateddistributionsandadatadistribution
(inthesimulationdomain,theloggeddistribution). However,FIDhaslimitedsensitivityperexample
duetoaggregationofstatisticsoverentiretestdatasetsintoasinglemeanandcovariance.
Evaluating Multi-Agent Simulation There is limited consensus in the literature regarding how
multi-agentsimulationshouldbeevaluated(seeTable2),andnomainstreamexistingbenchmark
exists. Given the importance of safety, almost all existing sim agent works measure some form
ofcollisionrate[3,28,59,60,68,74],andsomemulti-objectjointtrajectoryforecastingmethods
alsomeasureitviatrajectoryoverlap[36]. However,collisionratecanbeartificiallydriventozero
by static policies, and thus cannot measure realism. Quantitative evaluation of realism requires
comparisonwithloggeddata. Suchevaluationmethodsvarywidely,fromdistributionmatchingof
vehicledynamics[28,74],tocomparisonofoffroadrates[28,60,74],spatialcoverageanddiversity
[60,74],andprogresstogoal[59,68]. However,asgoalsarenotobservable,theyarethusdifficult
toextractreliably. RequiringdirectreconstructionofloggeddatathroughmetricssuchasAverage
DisplacementError(ADE)hasalsobeenproposed[3,68], buthaslimitedeffectivenessbecause
3therearegenerallymultiplerealisticactionstheAVorsimagentscouldtakeatanygivenmoment.
Toovercomethislimitation,oneoptionistoallowtheusertoprovidemultiplepossibletrajectories
persimagent,suchasTrafficSim,whichusesaminimumaveragedisplacementerror(minADE)over
15simulations. [60].
Recently,generative-modelbasedevaluationhasbecomemorepopularinthesimulationdomain,
primarilythroughdistributionmatchingmetrics. Symphony[28]usesJensen-Shannondistances
overtrajectorycurvature. NeuralNDE[75]comparesdistributionsofvehiclespeedandinter-vehicle
distance,whereasBITS[74]utilizesWassersteindistancesonagentsceneoccupancyusingmultiple
rolloutsperscene,alongwithWassersteindistancesbetweensimulatedandloggedspeedandjerk
–twokinematicfeatureswhichcanencapsulatepassengercomfort. Thelatterarecomputedasa
distribution-to-distributioncomparisononadatasetlevel,however,thistypeofmetrichasshown
limitedsensitivityinourexperiments.
Likelihood metrics An alternative distribution matching framework is to measure point-to-
distributiondistances. [17]and[29]introduceametricdefinedastheaveragenegativeloglikelihood
(NLL)ofthegroundtruthtrajectory,asdeterminedbyakerneldensityestimate(KDE)[42,50]over
outputsamplesatthesamepredictiontimestep. Thismetrichasfoundsomeadoption[54,64],and
weprimarilybuildoffofthismetricinourwork. Wenotethatlikelihood-basedgenerativemodels
ofsimulationsuchasPRECOG[48]andMFP[63]directlyproducelikelihoods,meaningthatthe
useofaKDEonsampledtrajectoriestoestimatelikelihoodsisnotneededforsuchmodelclasses.
Concurrentwork[79]alsomeasurestheNLLoftheGTsceneunder6rollouts.
3 TrafficSimulationasConditionalGenerativeModeling
Our goal is to encourage the design of traffic simulators by defining a data-driven evaluation
frameworkandinstantiatingitwithpubliclyaccessibledata. Wefocusonsimulatingagentbehavior
inasettinginwhichanoffboardperceptionsystemistreatedasfixedandgiven.
Problem formulation. We formulate driving as a Hidden Markov Model
(cid:0) (cid:1)
H= S,O,p(o |s ),p(s |s ) , where S denotes the set of unobservable true world states,
t t t t−1
O denotes the set of observations, p(o |s ) denotes the sampleable emission distribution, and
t t
p(s |s ) denotes the hidden Markovian state dynamics: the probability of the hidden state
t t−1
transitioningfroms attimestept−1tos attimet. Eacho ∈OcanbepartitionedintoAV-and
t−1 t t
environment-centriccomponentsthatvaryintime: o = [oAV,oenv]. Oenv caningeneralcontaina
t t t t
richsetoffeatures,butforthepurposeofourchallenge,itcontainssolelytheposesofthenon-AV
.
agents. Wedenotethetrueobservationdynamicsaspworld(o |s )=E p(o |s ).
t t−1 p(st|st−1) t t
The task. The task to build a “world model” qworld(o |oc ) of pworld(o |s ),
. t <t t t−1
oc = [omap,osignals,o ,...,o ], i.e., it denotes a context of a static map observation,
<t −H−1 t−1
trafficsignalobservations,andtheobservationhistory,withhistorylengthH.
Taskconstraints:
1. qworld must be autoregressive for T steps, i.e., sim agent models must adhere to a 10Hz
resamplingprocedure,re-observingtheupdatedsceneandconsumingtheirpreviousoutputs.
2. qworldmustfactorizeaccordingtoEq.1:
qworld(o |oc )=π(oAV|oc )q(oenv|oc ), (1)
t <t t <t t <t
whereq(oenv|oc )isatrafficsimulator,andπ(oAV|oc )isanAVpolicy1. Anysubmissionthatfails
t <t t <t
tosatisfybothofthesepropertieswillnotbeconsideredonWOSACleaderboards,asdeterminedby
1Wecallthisapolicybecauseitissimilartothetypicalformulationofapolicyinadecisionprocessover
actions,althoughnotequivalent,becauseitisdefinedovernextobservationsratherthancurrentactions.Itcan
bemadeequivalenttoastandardpolicyπ(aAV |oc )bydefiningtheAV’sactionspaceAtobeequivalentto
itscomponentoftheobservationspace,anddt −efi1 ni< ngt
anaction-dependentworldmodelqworld(o |oc ,aAV
)=.
t <t t 1
δ(oAV =aAV )q(oenv|oc ),whereδdenotestheDiracdeltafunction. −
t t 1 t <t
−
4oA1V oA2V ··· oA T−V 1 oA TV
omap
oe1nv oe2nv ··· oe Tn −v 1 oe Tnv
Figure2: GraphicalmodelofrequiredfactorizationasaBayesnet: thetwodistributionsfromEq.1
areautoregressivelyinterleaved:onerepresentstheAV’s“policy”π(oAV|oc ),andanotherrepresents
t <t
theenvironmentaldynamicsq(oenv|oc );thegraphicalmodelrepresentT-stepsofapplyingthese
t <t
twodistributions. Thickoutgoingarrowsdenotepassinginputsfromallparentnodestoallchildren.
challengesubmissionreports. Requiringthemtobegenerativeenablessamplingfromanarbitrary
trafficsimulator-AVpolicypair. Thesetwopropertiesimplytheprobabilisticgraphicalmodelshown
inFig.2,modifiedfromFig. 9of[48]. Algorithms1and2illustratevalidandinvalidsubmissions.
Muchofthechallengeofmodelingpworld liesinthefactthatinmanysituationss ∈ S, pworld
t−1
assigns density to multiple outcomes due to uncertainty from agents in the scene, which means
that both π(oAV|oc ) and q(oenv|oc ) often must contain multiple modes in order to perform
t <t t <t
well. We evaluate distribution-matching of pworld relative to a dataset of logged outcomes. The
requiredfactorizationintoaAVobservation-spacepolicyandenvironmentobservationdynamics,
qworld(o |oc )=π(oAV|oc )q(oenv|oc ), is fairly flexible. We are agnostic to their particular
t <t t <t t <t
structures. One noteworthy choice for the environment observation dynamics is a “multi-agent”
factorization, in which q(oenv|oc )=(cid:81)A π (oenv,a|oc ), i.e., the environment observation
t <t a=1 a t <t
dynamicsfactorizesintoasequenceofAobservation-spacepolicies,andtheenvironmentobservation
itselfispartitionedintoAdifferentcomponents,oneforeachagent: oenv =[oenv,1,...,oenv,A].
t t t
Algorithm1Valid: Factorized,Closed-Loop,Agent-CentricSimulation
Input: Mapomapandtrafficsignalsosignals. Initialactorstateso ={o ,··· ,o }where
−H−1:0 −H−1 0
eachoenv =(cid:8) oenv,1,...,oenv,A(cid:9) fortheAactorsinthescene.
t t t
Output: Simulatedobservationso ={o ,o ,··· ,o }forT simulationtimesteps.
1:T 1 2 T
1: fort=1,...,T do ▷Simulateforrequestednumberoftimesteps
2: oA tV ←π AV(o <t;omap,osignals)
3: fora=1,...,Ado ▷Producenextstateforeachactorateachtimestep
4: oe tnv,a ←π a(o <t;omap,osignals)
5: o t ={oe tnv,a :∀a∈1...A}∪{oA tV}
6: returno 1:T ={o 1,o 2,··· ,o T}
Algorithm2Invalid: Factorized,Open-Loop,Agent-CentricSimulation
Input: Mapomapandtrafficsignalsosignals. Initialactorstateso ={o ,··· ,o }where
−H−1:0 −H−1 0
eachoenv =(cid:8) oenv,1,...,oenv,A(cid:9) fortheAactorsinthescene.
t t t
Output: Simulatedobservationso ={o ,o ,··· ,o }forT simulationtimesteps.
1:T 1 2 T
1: oA 1:V
T
←π AV(o <1;omap,osignals)
2: fora=1,...,Ado ▷Producestatesatallfuturetimestepsforeachactor
3: oe 1n :Tv,a ←π a(o <1;omap,osignals)
4: returno 1:T ={oe 1n :Tv,a :∀a∈1...A}∪{oA 1:V T}
4 BenchmarkOverview
4.1 Dataset
For WOSAC, we use the test data from the v1.2.0 release of the Waymo Open Motion Dataset
(WOMD)[22]. WetreatWOMDasasetD ofscenarioswhereeachscenarioisahistory-future
5pair(o ,o ). Thisdatasetoffersalargequantityofhigh-fidelityobjectbehaviorsandshapes
−H−1:0 ≥1
produced by a state-of-the-art offboard perception system. We use WOMD’s 9 second 10 Hz
sequences(comprisingH =11observationsfrom1.1secondsofhistoryand80observationsfrom8
secondsoffuturedata),whichcontainobjecttracksat10Hzandmapdatafortheareacoveredby
thesequence. Acrossthedatasetsplits,thereexists486,995scenariosintrain,44,097invalidation,
and44,920intest. These9.1secondwindowshavebeensampledwithvaryingoverlapfrom103,354
minedsegmentsof20secondduration. Upto128agents(oneofwhichmustrepresenttheAV)must
besimulatedineachscenarioforthe8secondfuture(comprising80stepsofsimulation).
AgentDefinitionWerequiresimulationofallagentsthathavevalidmeasurementsattimet=0,
i.e. thelaststepofloggedinitialconditionsbeforesimulationbegins. Becausethetestsplitdata
issequestered,userswillnothaveaccesstoobjectsthatappearafterthetimeofhandover,andso
thereforecouldnotbeexpectedtosimulatethem. WerequiresimulationofallthreeWOMDobject
types (vehicles, cyclists, and pedestrians). Objects’ dimensions stay fixed as per the last step of
history(whiletheydochangeintheoriginaldata).
SubmissionWedonotenforceanymotionmodel(alsobecausewehavemultipleagenttypes),which
meansusersneedtodirectlyreportx/y/z centroidcoordinatesandheadingoftheobjects’boxes
(whichcouldbegenerateddirectlyorthroughanappropriatemotionmodel). SeetheAppendixfor
additionalinformationonthesubmissionformat.
By allowing users to produce the simulations themselves, we reduce the burden on the user by
avoidingtheneedtosubmitcontainerizedsoftwareforanevaluationserver.
4.2 Evaluation
Agentsshouldgeneraterealisticdrivingscenariosstochastically. Wedefine“realisticagents”asthose
thatmatchtheactualdistributionofscenariosobservedduringreal-worlddriving. Unfortunately,we
donotknowtheanalyticformofthedistribution,butwedohavesamplesfromit: theexamplesthat
makeupWOMD.Wethereforeevaluatesubmissionsusingtheapproximatenegativeloglikelihood
(NLL)ofrealworldsamplesunderthedistributioninducedbytheagents.
TheNLLwewishtominimizeisgivenby:
|D|
1 (cid:88)
NLL∗ =− logqworld(o |o ) (2)
|D| ≥1,i <1,i
i=1
However,therearetwoproblemswithtryingtominimizeEquation2exactlyinourproblemsetting.
First,o ishigh-dimensional. Insteadoftryingtoparameterizetheentiregroundtruthscenarioand
≥1
computeitsNLLunderasimulateddistribution,wethereforeparameterizescenarioswithasmaller
number of component metrics (see Section 4.2.1) and aggregate them together into a composite
NLLmetric(seeSection4.2.2). Second,agentsmaysupportsamplingbutnotpointwiselikelihood
estimation[41]. Infact,weonlyrequirechallengeentrantstosubmitsamplesfromtheiragents,and
thereforehavenowayofknowingtheexactlikelihoodofloggedscenariosunderdifferentagent
submissions. Toavoidthisproblem,westandardizetheNLLcomputationbyfittinghistogramstothe
32submittedsamplesofagentfutures,andcomputeNLLsunderthecategoricaldistributioninduced
bynormalizingthehistograms.
4.2.1 ComponentMetrics
BreakingNLL∗intocomponentmetricshasafewbenefits. Itmitigatesthecurseofdimensionality
describedinSection4.2. Italsoaddsmoreinterpretabilitytotheevaluation,allowingresearchersto
tradeoffbetweendifferenttypesoferrors.
TimeSeriesNLL:Giventhetimeseriesnatureofsimulationdata,twochoicesemergeforhowto
treatsamplesovermultipletimestepsforagivenobjectforagivenrunsegment: totreatthemas
time-independentortime-dependentsamples. Inthelattercase,userswouldbeexpectedtonotonly
reconstructthegeneralbehaviorspresentintheloggeddatainonerollout,butalsorecreatethose
behaviorsovertheexactsametimeintervals. Toallowmoreflexibilityinagentbehavior,weuse
theformerformulationwhencomputingNLLs,definingeachcomponentmetricmasanaverage
(cid:16) (cid:17)
(inlog-space)overthetime-axis,maskedbyvalidityv : m = exp − 1 (cid:80)1{v }NLL .
t (cid:80)
t
1{vt}
t
t t
6SimulationInput LogPlayback Wayformer ConstantVelocity
(LoggedOracle) (DiverseSample)
Figure3: VisualizationsofsimulationresultsontwoseparateWOMDscenes(top,bottom). Results
for various baseline methods are shown on WOMD’s validation split, in 2d. ‘Simulation input’
representsthecontexthistoryo ,whereasallothercolumnsvisualizeboth(o ,o ).
−H−1:0 −H−1:0 ≥1
Twoscenesarerepresented: onewheretheAVcompletestheexecutionofaleftturn(toprow)and
anotherwheretheAVremainsstoppedataredtrafficsignal(bottomrow). Eachrenderinginthe
secondandthirdcolumnsdepictstheentiredurationofthescene. Trajectoriesofenvironmentsim
agentsaredrawninagreen-bluegradient,andtrajectoriesoftheAVagentaredrawninared-yellow
gradient(eachasasequenceofcirclesinatemporalcolorgradient).
However,wenotethatasaresult,aloggedoraclewillnotachievelikelihoodsof1.0,whereasinthe
latterformulationaloggedoraclewould.
DefinitionsWecomputeNLLsover9measurements: kinematicmetrics (linearspeed, linearac-
celeration,angularspeed,angularaccelerationmagnitude),objectinteractionmetrics(distanceto
nearestobject,collisions,time-to-collision),andmap-basedmetrics(distancetoroadedge,androad
departures). PleaserefertoSectionA.6oftheAppendixforacompletedescriptionandadditional
implementationdetails.
4.2.2 CompositeMetric
Afterobtainingcomponentmetricsforeachmeasurement,weaggregatethemintoasinglecomposite
metricMK forevaluatingsubmissions:
N M M
1 (cid:88)(cid:88) (cid:88)
MK = w mK, w =1 (3)
NM j i,j j
i=1j=1 j=1
whereN isthenumberofscenariosandM =9isthenumberofcomponentmetrics. Thecomponent
metricsmandcompositemetricMarealsoparameterizedbyanumberofsamplesK =32. The
valuem representsthelikelihoodforthejthmetricontheithexample. ThemetricMissimplya
i,j
convexcombination(i.e. weightedaverage)overthecomponentmetrics,wheretheweightw forthe
j
jthmetricissetmanually. Intheinterestofpromotingsafety,theweightingforcollisionandroad
departureNLLsaresettobe2×largerthantheweightfortheothercomponentmetrics.
5 ExperimentalResults
InFigure4andTable3,wepresentquantitativeresultsforahandfulofmethods. Wedescribeeach
methodinmoredetailinthesectionsbelow.
70.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
Random Constant Constant SBTA-ADIA CAD Joint Wayformer MTR+++ MVTE Logged
(Gaussian) Velocity Velocity Multipath++(Identical Oracle
(+Gaussian Samples)
Noise)
cirteM
etisopmoC
CASOW
0.722
0.645
0.608
0.575
0.531 0.533
0.420
0.324
0.287
0.155
Figure4: ChallengecompositemetricresultsofvariousbaselinesonthetestsplitofWOMD.
5.1 Baselines
RandomAgent: Anagentthatproducesrandomtrajectories{(x ,y ,θ )}T , forT = 80, with
t t t t=1
x,y,θ ∼N(µ,σ2),withµ=1.0andσ =0.1,intheAV’scoordinateframe.
ConstantVelocityAgent: Anagentthatextrapolatesthetrajectoryusingthelastheadingandspeed
recordedintheprovidedcontext/history. Ifnotwo-stepdifferencecanbecomputedbasedonthe
validmeasurements(e.g. theobjectappearedonlyatthefinalstepofcontext),wesetazerospeedfor
suchagents.
Wayformer (Identical Samples) Agent: An agent that produces a hybrid of open-loop and
closed-loopdatausingaWayformer[40]motionpredictionmodel,byexecutingmodelinference
autoregressively at 2Hz. The agents execute the policy forward for 5 simulation steps, and then
replan. Resultswitha10HzreplanrateinsteadarealsoshowninTable3,andanablationonthe
replan rate is provided in the Appendix. The maximum-likelihood trajectory for each agent is
identicallyrepeated32timestoproduce32samples. Eachagentisexecutedbythesamepolicyinan
agent-centricframe,batchedtogetherforinference,thuscomplyingwiththerequiredfactorization.
Wayformer (Diverse Samples) Agent: An agent that also utilizes Wayformer [40]-generated
trajectories, but samples diverse agent plans, from K possible trajectories according to their
likelihood,insteadofselectingthemaximum-likelihoodchoice.
LoggedOracle:AgentthatdirectlycopiestrajectoriesfromtheWOMDtestsplit,with32repetitions.
5.2 ExternalSubmissions
MultiVerseTransformerforAgentsimulation(MVTA)[71]: AmethodinspiredbyMTR[55,56]
thatistrainedandexecutedinclosed-loop. MVTAusesa‘recedinghorizon’policywithaGMM
head,andconsumesvectorinputs.
MVTE:AnenhancedversionofMVTA[71]thatsamplesaMVTAmodelfromapoolofmodel
variantstoincreasesimulationdiversityacrossrollouts.
MTR+++[44]: Ahybridopen-loop/closed-loopmethodwitha0.5Hzreplanningratethatisinspired
byMTR[55,56]andsearchesforthedensestsubgraphinagraphofnon-collidingfuturetrajectories.
Foradescriptionofotherevaluatedexternalsubmissions,pleaserefertoSectionA.4oftheAppendix.
5.3 Learningsfromthe2023Challenge
Duringthecourseofour2023WOSACChallenge(March16,2023toMay23,2023),associated
withtheCVPR2023WorkshoponAutonomousDriving,wereceived24testsetsubmissions,and
16 validation set submissions, from 10 teams. We continue to receive submission queries to our
evaluationserverforourstandingleaderboardasnewteamssubmitnewmethods.
TrendsWeobservedseveraltrendsamongsubmissions.First,thechallengechampion,MVTA/MVTE
[71],wastheonlymethodtoutilizeandbenefitfromclosed-looptraining. Othermethodsthatwere
trainedinopen-loop,suchasMTR+++[44]orourWayformer-derived[40]baseline,foundoperating
8Table3: Per-componentmetricresultsonthetestsplitofWOMD,representinglikelihoods. Methods
arerankedbycompositemetricontheV1Leaderboard,ratherthanthepreviousV0Leaderboard;
Numberswithin1%ofthebestareinbold(excluding‘loggedoracle’). *indicatesamethodthatwas
receivedafterMay23,2023,whichmarkedthecloseoftheCVPR2023competition.
AGENTPOLICY LINEAR LINEAR ANG. ANG. DIST. COLLISION TTC DIST.TO OFFROAD COMPOSITE ADE MINADE
SPEED ACCEL. SPEED ACCEL. TOOBJ. ROADEDGE METRIC
(↑) (↑) (↑) (↑) (↑) (↑) (↑) (↑) (↑) (↑) (↓) (↓)
RANDOMAGENT 0.002 0.044 0.074 0.120 0.000 0.000 0.734 0.178 0.287 0.155 50.739 50.706
CONSTANTVELOCITY 0.074 0.058 0.019 0.035 0.208 0.345 0.737 0.454 0.455 0.287 7.923 7.923
CONSTANTVELOCITY(+GAUSSIANNOISE) 0.157 0.119 0.019 0.035 0.247 0.411 0.775 0.502 0.463 0.324 7.594 7.237
WAYFORMER(IDENTICALSAMPLES,10HZREPLAN)[40] 0.202 0.144 0.248 0.312 0.192 0.449 0.766 0.379 0.305 0.338 6.823 6.823
SBTA-ADIA[38] 0.317 0.174 0.478 0.463 0.265 0.337 0.770 0.557 0.483 0.420 4.777 3.611
WAYFORMER(DIVERSESAMPLES,10HZREPLAN)[40] 0.233 0.212 0.345 0.330 0.241 0.635 0.797 0.424 0.413 0.421 6.866 5.761
CAD[14] 0.349 0.253 0.432 0.310 0.332 0.568 0.789 0.637 0.834 0.531 3.334 2.308
JOINT-MULTIPATH++*[70] 0.434 0.230 0.515 0.452 0.345 0.567 0.812 0.639 0.682 0.533 5.293 2.049
WAYFORMER(IDENTICALSAMPLES,2HZREPLAN)[40] 0.331 0.098 0.413 0.406 0.297 0.870 0.782 0.592 0.866 0.575 2.498 2.498
MTR+++[44] 0.414 0.107 0.484 0.436 0.347 0.861 0.797 0.654 0.895 0.608 2.125 1.679
MVTA[71] 0.439 0.220 0.533 0.480 0.374 0.875 0.829 0.654 0.893 0.636 3.925 1.866
MVTE[71] 0.445 0.222 0.535 0.481 0.383 0.893 0.832 0.664 0.908 0.645 3.859 1.674
LOGGEDORACLE 0.561 0.330 0.563 0.489 0.485 1.000 0.881 0.713 1.000 0.722 0.000 0.000
at slower replan rates necessary to obtain high composite metric results (See Table 3). Second,
almostallsubmissionsusedTransformer-basedmethods[67],exceptforJointMultiPath++,which
used LSTM and MCG blocks [66]. Third, all methods built primarily on top of existing motion
predictionworks,ratherthanuponexistingmotionplanningworksorsimagentmethodsfromthe
literature. Only one method, MVTA/MVTE [71], incorporated aspects of an existing sim agent
work, TrafficSim [60], as well as motion planning techniques, implementing a receding horizon
planningpolicy. Thus, fourth, weobservedthebenefitofincorporatingplanning-basedmethods
intoamotionpredictionframework. Fifth,mostmethods(excludingJointMultipath++[70])built
uponthe2022CVPRWaymoOpenMotionPredictionchallengechampion,MTR[55,56],likely
duetotheopen-sourceavailabilityofitscodebaseandSOTAperformance. Finally,allsubmissions
operatedinanagent-centriccoordinateframe,ratherthanjointlysamplingfromascenerepresentation
simultaneously.
LikelihoodMetricsRewardDiversityWefoundthatourlikelihood-basedmetricsrewardmodels
thatproducediversefutures. Forexample,generating32diverserolloutsperscenewithaWayformer
modelperforms11%betteronourevaluationmetricsthanaWayformermodelthatproduces32
identicalrolloutsperscene(seeFigure4).
CollisionMinimizationasanAlgorithmicObjectiveSeveralmethodsdesignedalgorithmiccom-
ponentstodeterminefutureswithaminimalnumberofcollisions,e.g.,MTR+++[44]whichused
clique-findinginanundirectedgraphofcollision-freefuturetrajectories,andCAD[14],whichused
rejectionsamplingonopen-loopfuturesthatcreatedcollisions. Thisobjectivealignswithhuman
preference,butasclosecallsandcollisionsdooccurinrealdrivingdatadistributions,optimizing
forthisobjectivecouldbeseenastrimmingthetailofthedistribution;distracteddriversgenerally
doexistineverydayrealworlddriving, andincertainscenarios, onewouldexpectalow-quality
plannertoperformpoorlyandproducecollisionswithsimagents,andsosuchshouldbetakeninto
considerationforgeneratingrealisticsimulations. ThissuggestsalimitationoftheWOMD[22],
whichhasfewexamplesfromthetaildistributionofrealdriving,andeffortstoupsamplecollision
datacouldproveuseful. Inaddition,open-loopmethodssuchasCAD[14]thatprunecollisionsafter
thefactcouldprunecollisionscausedbytheAVratherthanbythesimagents,yieldingamisleadingly
optimisticviewoftheAV’sperformance.
CompositeMetricvs.(min-)ADEandADE:Weseethatamongsubmissionstothetestset,rankings
byADEandminADEandrankingbyourNLLcompositemetricdisagree. However,methodswith
lowerminADEdotendtoachievehighercompositescores;ADEdoesnotexhibitsuchatrend.
CompositeMetricResultsTheordinalrankingshowninFigure4andTable3indicatesthatlearned,
stochastic sim agents outperform not only heuristic baselines but also learned, deterministic sim
agents. Weconsideracompositemetricscoreof0.722asapracticalupperboundonsubmissions,
becauseitinvolvesaccesstotestdataviaanoracle.
Component Metric Results In Table 3, we provide a breakdown of the composite metric into
componentmetricresults. Asexpected,the‘loggedoracle’baselineachievesthehighestlikelihoodin
eachofthe9componentmetrics. Thetopperformingmethod,MVTE[71]scoredhighestonallbut
onecomponentmetric(linearaccelerationlikelihood),whereCAD[14]outperformedMVTEby12%
(likelihoodof0.253vs. 0.222). Surprisingly,MVTE[71]hasangularaccelerationlikelihoodswithin
apercentagepointofthe‘loggedoracle’(0.481vs. 0.489). Thegapbetweenthetopperforming
learned method (MVTE) and ‘logged oracle’ in both collision likelihood (0.893 vs. 1.000) and
9distance-to-nearestobjectlikelihood(0.383vs. 0.485)indicatessignificantroomforimprovementin
futureworkoninteractivemetrics.
5.4 QualitativeResults
In Figure 3, we provide a qualitative comparison of various baselines on two WOMD scenarios.
Theresultsindicatethatthecomplexityofbehaviorswithinintersectionsfarexceedsthecapability
ofsimpleheuristicstopredict. Collisionsareevidentfromtheconstantvelocitybaselinesinboth
examples. AdditionalqualitativeexamplesfromothersimagentmethodsareshowninSectionA.5
oftheAppendix.
6 Discussion
Limitations. Forour2023Challenge,wemanuallyverifiedthevalidityofeachsubmissionaccord-
ingtofactorizationandclosed-looprequirementsdiscussedineachteam’sreport,andweobserved
that the technical rules were subtle. Several of the submissions that used open-loop or hybrid
open-loop/closed-loopmethodsmayhavelimitedapplicabilityforsomesimulationapplications.
EvenifwehadinstitutedabenchmarkbasedonDocker-containerizedsoftwaresubmissionsinstead
ofuploadingoutputtrajectorysubmissions,enforcingourrequirementsalgorithmicallyandautomati-
callywouldstillbechallenging. AlthoughmanypropertiesoffunctioncallstoDockerizedsoftware
canbemeasured,e.g. latency,aslongasanyarbitrarystateismaintainedbytheuser,thesystem
couldnotenforcealldetailsoftheclosed-loopnatureofthefunctioncall. Asaresult,user-submitted
simulationagentsoftwarewouldhavetoadheretostrictstatelessinputandoutputdataAPIs. The
abilitytodosowouldassistinremovingambiguityregardingwhethermethodsthatprunecollisions
post-hocqualifyasclosed-loop.
Ifauserprovidescontainerizedsimulatorsubmissions,oneapproachtoencourageadherencetoour
requirementsandtofurtherincentivizeclosed-loopbehaviorwouldbetoprovideandinteractwith
anAVpolicythattheuserdoesnotcontrol. Inourbenchmark,theuserwasallowedtocontrolthe
AV,albeitthroughanindependentpolicy;theabilitytoevaluatesimulatorsubmissionsonseparate,
held-outAVmotionplanningpoliciesandonnewscenarioswouldallowfurthervaluableanalysis.
FutureWork Objectinsertionanddeletionareimportantaspectsofthesimulationproblem,yet
weintentionallyintroducedanassumptionofnoobjectinsertionordeletioninordertoreducethe
complexity of the first iteration of the WOSAC challenge for users. Motion planners trained or
evaluatedinasimulatormusthavethecapabilitytoexercisecautionregardingareasofocclusion
from which new objects may emerge at any timestep. In a future iteration of the challenge, we
plantointroducerealismmetricsthatrewardproperly-modeledobjectinsertionanddeletion,e.g.
distributional metrics on the number of vehicles appearing or disappearing at each frame, or the
distanceofsimulatedobjectsfromtheautonomousvehicle. ThedatadistributionintheWOMD
datasetalreadyincludessuchobjectinsertionanddeletion.
Furthermore, we intentionally introduced an assumption of time-invariant object dimensions in
our first iteration of WOSAC to simplify the modeling challenge for users. Time-variant object
dimensionscanbeconsideredasatypeofvehicleattribute,andobjectdimensionsdoactuallychange
intheunderlyingdatadistributionprovidedintheWOMDdataset. Wehopetoincludetime-variant
objectdimensionpredictionasanaspectofthebenchmarkinfutureiterations.
AsdiscussedinSection5.3,giventheprevalenceofcollisionminimizationalgorithmiccomponents
amongsubmissions,onemaypresumethatcollisionsarenotheavilyrepresentedinWOMD[22]
data,orourmetricsarelimitedinsomeway. Anotherapproachwouldbeto“fattenthetails”ofthe
evaluationdatadistributionbygeneratingsynthetic,challenginginitialconditions[3,21,23,47,61,
69],orminingmoreclosecallsandcollisionsfromrealdrivingdata.
Conclusion. Inthiswork,wehaveintroducedanewchallengeforevaluationofsimulationagents,
explainingtherationaleforthedifferentcriteriawerequire. Weinvitetheresearchcommunityto
continuetoparticipate.
10AcknowledgmentsandDisclosureofFunding
Nothird-partyfundingreceivedindirectsupportofthiswork. WethankBenSappforhishelpful
feedbackinpreparingthechallenge. WewouldlikethankMustafaMustafa,KratarthGoel,Rami
Al-Rfouforofferingconsultation,modelsandinfrastructurethatacceleratedourwork. Wethank
AlexanderGorbanforhisassistanceindevelopingandmaintainingtheevaluationserver. Allthe
authorsareemployeesofWaymoLLC.
References
[1] GulAghaandKarlPalmskog. Asurveyofstatisticalmodelchecking. ACMTransactionsonModeling
andComputerSimulation(TOMACS),28(1):1–39,2018.
[2] MayankBansal,AlexKrizhevsky,andAbhijitS.Ogale. ChauffeurNet:Learningtodrivebyimitatingthe
bestandsynthesizingtheworst. InRobotics:ScienceandSystemsXV,2019.
[3] LucaBergamini,YaweiYe,OliverScheel,LongChen,ChihHu,LucaDelPero,Błaz˙ejOsin´ski,Hugo
Grimmet, and Peter Ondruska. SimNet: Learning reactive self-driving simulations from real-world
observations. InICRA,2021.
[4] JulianBernhard,KlemensEsterle,PatrickHart,andTobiasKessler. BARK:Openbehaviorbenchmarking
inmulti-agentenvironments. InIROS,2020.
[5] RaunakPBhattacharyya, DerekJPhillips, BlakeWulfe, JeremyMorton, AlexKuefler, andMykelJ
Kochenderfer. Multi-agentimitationlearningfordrivingsimulation. InIROS,2018.
[6] HolgerCaesar,VarunBankiti,AlexHLang,SourabhVora,VeniceErinLiong,QiangXu,AnushKrishnan,
YuPan,GiancarloBaldan,andOscarBeijbom. nuScenes:Amultimodaldatasetforautonomousdriving.
InCVPR,pages11621–11631,2020.
[7] HolgerCaesar,JurajKabzan,KokSeangTan,WhyeKitFong,EricWolff,AlexLang,LukeFletcher,
OscarBeijbom,andSammyOmari. Nuplan:Aclosed-loopml-basedplanningbenchmarkforautonomous
vehicles. InCVPRADP3workshop,2021.
[8] SergioCasas,ColeGulino,RenjieLiao,andRaquelUrtasun. SpAGNN:Spatially-awaregraphneural
networksforrelationalbehaviorforecastingfromsensordata. InICRA,pages9491–9497,2020.
[9] SergioCasas, ColeGulino, SimonSuo, KatieLuo, RenjieLiao, andRaquelUrtasun. Implicitlatent
variablemodelforscene-consistentmotionforecasting. InECCV,2020.
[10] Ming-FangChang, JohnLambert, PatsornSangkloy, JagjeetSingh, SlawomirBak, AndrewHartnett,
De Wang, Peter Carr, Simon Lucey, Deva Ramanan, and James Hays. Argoverse: 3d tracking and
forecastingwithrichmaps. InCVPR,June2019.
[11] DianChen,BradyZhou,VladlenKoltun,andPhilippKrähenbühl. Learningbycheating. InConference
onRobotLearning,pages66–75.PMLR,2020.
[12] DianChen,VladlenKoltun,andPhilippKrähenbühl. Learningtodrivefromaworldonrails. InICCV,
pages15590–15599,2021.
[13] YunChen,FriedaRong,ShivamDuggal,ShenlongWang,XinchenYan,SivabalanManivasagam,Shangjie
Xue,ErsinYumer,andRaquelUrtasun. Geosim:Realisticvideosimulationviageometry-awarecomposi-
tionforself-driving. InCVPR,pages7230–7240,June2021.
[14] Hsu-kuangChiuandStephenF.Smith. Collisionavoidancedetour: Asolutionfor2023waymoopen
datasetchallenge-simagents. Technicalreport,CarnegieMellonUniversity,2023.
[15] FelipeCodevilla,MatthiasMüller,AntonioLópez,VladlenKoltun,andAlexeyDosovitskiy. End-to-end
drivingviaconditionalimitationlearning. InICRA,pages4693–4700.IEEE,2018.
[16] AnthonyCorso,RobertMoss,MarkKoren,RitchieLee,andMykelKochenderfer. Asurveyofalgorithms
forblack-boxsafetyvalidationofcyber-physicalsystems. JournalofArtificialIntelligenceResearch,72:
377–428,2021.
[17] NachiketDeoandMohanM.Trivedi. Convolutionalsocialpoolingforvehicletrajectoryprediction. In
CVPRWorkshops.ComputerVisionFoundation/IEEEComputerSociety,2018.
11[18] PrafullaDhariwalandAlexanderNichol. Diffusionmodelsbeatgansonimagesynthesis. InNeurIPS,
2021.
[19] AlexeyDosovitskiy,GermanRos,FelipeCodevilla,AntonioLopez,andVladlenKoltun. CARLA:An
openurbandrivingsimulator. InCoRL,2017.
[20] PatrickEsser,RobinRombach,andBjornOmmer.Tamingtransformersforhigh-resolutionimagesynthesis.
InCVPR,2021.
[21] NickRoyEthanPronovost,KaiWang. Generatingdrivingsceneswithdiffusion. InICRAWorkshopon
ScalableAutonomousDriving,June2023.
[22] Scott Ettinger, Shuyang Cheng, Benjamin Caine, Chenxi Liu, Hang Zhao, Sabeek Pradhan, Yuning
Chai,BenSapp,CharlesR.Qi,YinZhou,ZoeyYang,AurélienChouard,PeiSun,JiquanNgiam,Vijay
Vasudevan,AlexanderMcCauley,JonathonShlens,andDragomirAnguelov.Largescaleinteractivemotion
forecastingforautonomousdriving:Thewaymoopenmotiondataset. InICCV,2021.
[23] LanFeng,QuanyiLi,ZhenghaoPeng,ShuhanTan,andBoleiZhou. Trafficgen: Learningtogenerate
diverseandrealistictrafficscenarios. InICRA,2023.
[24] AndreasGeiger,PhilipLenz,andRaquelUrtasun. Arewereadyforautonomousdriving?thekittivision
benchmarksuite. InCVPR,2012.
[25] E.G.Gilbert, D.W.Johnson, andS.S.Keerthi. Afastprocedureforcomputingthedistancebetween
complexobjectsinthree-dimensionalspace. IEEEJournalonRoboticsandAutomation,4(2):193–203,
1988.
[26] MartinHeusel,HubertRamsauer,ThomasUnterthiner,BernhardNessler,andSeppHochreiter. Gans
trainedbyatwotime-scaleupdateruleconvergetoalocalnashequilibrium. InProceedingsofthe31st
InternationalConferenceonNeuralInformationProcessingSystems,2017.
[27] JonathanHo,AjayJain,andPieterAbbeel. Denoisingdiffusionprobabilisticmodels. InNeurIPS,2020.
[28] Maximilian Igl, Daewoo Kim, Alex Kuefler, Paul Mougin, Punit Shah, Kyriacos Shiarlis, Dragomir
Anguelov,MarkPalatucci,BrandynWhite,andShimonWhiteson. Symphony: Learningrealisticand
diverseagentsforautonomousdrivingsimulation. InICRA,2022.
[29] BorisIvanovicandMarcoPavone. Thetrajectron: Probabilisticmulti-agenttrajectorymodelingwith
dynamicspatiotemporalgraphs. InICCV,October2019.
[30] TeroKarras,SamuliLaine,andTimoAila. Astyle-basedgeneratorarchitectureforgenerativeadversarial
networks. InCVPR,2019.
[31] TeroKarras,SamuliLaine,MiikaAittala,JanneHellsten,JaakkoLehtinen,andTimoAila. Analyzingand
improvingtheimagequalityofstylegan. InCVPR,2020.
[32] ParthKothari, ChristianPerone, LucaBergamini, AlexandreAlahi, andPeterOndruska. Drivergym:
Democratisingreinforcementlearningforautonomousdriving. arXivpreprintarXiv:2111.06889,2021.
[33] DanielKrajzewicz,GeorgHertkorn,ChristianRössel,andPeterWagner. Sumo(simulationofurban
mobility)-anopen-sourcetrafficsimulation.InProceedingsofthe4thmiddleEastSymposiumonSimulation
andModelling(MESM20002),pages183–187,2002.
[34] Quanyi Li, Zhenghao Peng, Lan Feng, Qihang Zhang, Zhenghai Xue, and Bolei Zhou. Metadrive:
Composingdiversedrivingscenariosforgeneralizablereinforcementlearning. TPAMI,2022.
[35] YirenLu,JustinFu,GeorgeTucker,XinleiPan,EliBronstein,BeccaRoelofs,BenjaminSapp,Brandyn
White,AleksandraFaust,ShimonWhiteson,etal. Imitationisnotenough:Robustifyingimitationwith
reinforcementlearningforchallengingdrivingscenarios. InIROS,2023.
[36] WenjieLuo,CheolPark,AndreCornman,BenjaminSapp,andDragomirAnguelov. Jfp: Jointfuture
predictionwithinteractivemulti-agentmodelingforautonomousdriving. InCoRL,2023.
[37] SivabalanManivasagam,ShenlongWang,KelvinWong,WenyuanZeng,MikitaSazanovich,ShuhanTan,
BinYang,Wei-ChiuMa,andRaquelUrtasun. Lidarsim:Realisticlidarsimulationbyleveragingthereal
world. InCVPR,June2020.
[38] XiaoyuMo,HaochenLiu,ZhiyuHuang,andChenLv. Simulatingbehaviorsoftrafficagentsforau-
tonomousdrivingviainteractiveautoregression. Technicalreport,NanyangTechnologicalUniversity„
2023.
12[39] XiaoyuMo,YangXing,HaochenLiu,andChenLv. Map-adaptivemultimodaltrajectorypredictionusing
hierarchicalgraphneuralnetworks. IEEERoboticsandAutomationLetters,8(6):3685–3692,2023.
[40] NigamaaNayakanti,RamiAl-Rfou,AurickZhou,KratarthGoel,KhaledSRefaat,andBenjaminSapp.
Wayformer:Motionforecastingviasimple&efficientattentionnetworks. InICRA,2023.
[41] SebastianNowozin,BotondCseke,andRyotaTomioka. f-gan:Traininggenerativeneuralsamplersusing
variationaldivergenceminimization. Advancesinneuralinformationprocessingsystems,29,2016.
[42] EmanuelParzen. Onestimationofaprobabilitydensityfunctionandmode. Theannalsofmathematical
statistics,33(3):1065–1076,1962.
[43] DeanA.Pomerleau. Alvinn: Anautonomouslandvehicleinaneuralnetwork. InAdvancesinNeural
InformationProcessingSystems,1988.
[44] ChengQian,DiXiu,andMinghaoTian. Asimpleyeteffectivemethodforsimulatingrealisticmulti-agent
behaviors. Technicalreport,2023.
[45] AdityaRamesh,MikhailPavlov,GabrielGoh,ScottGray,ChelseaVoss,AlecRadford,MarkChen,and
IlyaSutskever. Zero-shottext-to-imagegeneration. InICML,2021.
[46] AdityaRamesh,PrafullaDhariwal,AlexNichol,CaseyChu,andMarkChen.Hierarchicaltext-conditional
imagegenerationwithcliplatents. arXivpreprintarXiv:2204.06125,1(2):3,2022.
[47] DavisRempe,JonahPhilion,LeonidasJGuibas,SanjaFidler,andOrLitany. Generatingusefulaccident-
pronedrivingscenariosviaalearnedtrafficprior. InCVPR,June2022.
[48] NicholasRhinehart,RowanMcAllister,KrisKitani,andSergeyLevine. Precog:Predictionconditionedon
goalsinvisualmulti-agentsettings. InICCV,October2019.
[49] RobinRombach,AndreasBlattmann,DominikLorenz,PatrickEsser,andBjörnOmmer. High-resolution
imagesynthesiswithlatentdiffusionmodels. InCVPR,2022.
[50] MurrayRosenblatt. Remarksonsomenonparametricestimatesofadensityfunction. Theannalsof
mathematicalstatistics,pages832–837,1956.
[51] StephaneRoss,GeoffreyGordon,andDrewBagnell. Areductionofimitationlearningandstructured
predictiontono-regretonlinelearning. InProceedingsoftheFourteenthInternationalConferenceon
ArtificialIntelligenceandStatistics,2011.
[52] ChitwanSaharia,WilliamChan,SaurabhSaxena,LalaLi,JayWhang,EmilyDenton,SeyedKamyarSeyed
Ghasemipour,RaphaelGontijo-Lopes,BurcuKaragolAyan,TimSalimans,JonathanHo,DavidJ.Fleet,
andMohammadNorouzi. Photorealistictext-to-imagediffusionmodelswithdeeplanguageunderstanding.
InNeurIPS,2022.
[53] TimSalimans,IanGoodfellow,WojciechZaremba,VickiCheung,AlecRadford,andXiChen. Improved
techniquesfortraininggans. InProceedingsofthe30thInternationalConferenceonNeuralInformation
ProcessingSystems,2016.
[54] TimSalzmann,BorisIvanovic,PunarjayChakravarty,andMarcoPavone. Trajectron++:Dynamically-
feasibletrajectoryforecastingwithheterogeneousdata. InECCV,2020.
[55] ShaoshuaiShi,LiJiang,DengxinDai,andBerntSchiele. Mtr-a:1stplacesolutionfor2022waymoopen
datasetchallenge–motionprediction,2022.
[56] Shaoshuai Shi, Li Jiang, Dengxin Dai, and Bernt Schiele. Motion transformer with global intention
localizationandlocalmovementrefinement. InAdvancesinNeuralInformationProcessingSystems,2022.
[57] DiJiaSu,BertrandDouillard,RamiAl-Rfou,CheolhoPark,andBenjaminSapp.Narrowingthecoordinate-
framegapinbehaviorpredictionmodels: Distillationforefficientandaccuratescene-centricmotion
forecasting. arXiv:2206.03970,2022.
[58] PeiSun,HenrikKretzschmar,XerxesDotiwalla,AurelienChouard,VijaysaiPatnaik,PaulTsui,James
Guo,YinZhou,YuningChai,BenjaminCaine,etal. Scalabilityinperceptionforautonomousdriving:
Waymoopendataset. InCVPR,2020.
[59] QiaoSun,XinHuang,BrianWilliams,andHangZhao. InterSim:Interactivetrafficsimulationviaexplicit
relationmodeling. InIROS,2022.
13[60] SimonSuo,SebastianRegalado,SergioCasas,andRaquelUrtasun. Trafficsim: Learningtosimulate
realisticmulti-agentbehaviors. InCVPR,2021.
[61] ShuhanTan,KelvinWong,ShenlongWang,SivabalanManivasagam,MengyeRen,andRaquelUrtasun.
Scenegen:Learningtogeneraterealistictrafficscenes. InCVPR,June2021.
[62] MatthewTancik,VincentCasser,XinchenYan,SabeekPradhan,BenMildenhall,PratulP.Srinivasan,
JonathanT.Barron,andHenrikKretzschmar. Block-nerf:Scalablelargesceneneuralviewsynthesis. In
CVPR,2022.
[63] CharlieTangandRussRSalakhutdinov. Multiplefuturesprediction. InNeurIPS,2019.
[64] Luca Anthony Thiede and Pratik Prabhanjan Brahma. Analyzing the variety loss in the context of
probabilistictrajectoryprediction. InICCV,October2019.
[65] MartinTreiber,AnsgarHennecke,andDirkHelbing. Congestedtrafficstatesinempiricalobservations
andmicroscopicsimulations. PhysicalreviewE,62(2):1805,2000.
[66] BalakrishnanVaradarajan, AhmedHefny, AvikalpSrivastava, KhaledS.Refaat, NigamaaNayakanti,
AndreCornman,KanChen,BertrandDouillard,ChiPangLam,DragomirAnguelov,andBenjaminSapp.
Multipath++:Efficientinformationfusionandtrajectoryaggregationforbehaviorprediction. InICRA,
2022.
[67] AshishVaswani,NoamShazeer,NikiParmar,JakobUszkoreit,LlionJones,AidanNGomez,Łukasz
Kaiser,andIlliaPolosukhin. Attentionisallyouneed. InAdvancesinNeuralInformationProcessing
Systems,2017.
[68] EugeneVinitsky, NathanLichtlé, XiaomengYang, BrandonAmos, andJakobFoerster. Nocturne: a
scalabledrivingbenchmarkforbringingmulti-agentlearningonestepclosertotherealworld. InNeurIPS
DatasetsandBenchmarksTrack,2022.
[69] JingkangWang,AvaPun,JamesTu,SivabalanManivasagam,AbbasSadat,SergioCasas,MengyeRen,
andRaquelUrtasun. Advsim: Generatingsafety-criticalscenariosforself-drivingvehicles. InCVPR,
2021.
[70] WenxiWangandHaotianZhen. Joint-multipath++forsimulationagents. Technicalreport,2023.
[71] YuWang,TiebiaoZhao,andFanYi.Multiversetransformer:1stplacesolutionforwaymoopensimagents
challenge2023. Technicalreport,Pegasus,2023.
[72] BenjaminWilson,WilliamQi,TanmayAgarwal,JohnLambert,JagjeetSingh,SiddheshKhandelwal,
BowenPan,RatneshKumar,AndrewHartnett,JhonyKaesemodelPontes,DevaRamanan,PeterCarr,
andJamesHays. Argoverse2: Nextgenerationdatasetsforself-drivingperceptionandforecasting. In
ProceedingsoftheNeuralInformationProcessingSystemsTrackonDatasetsandBenchmarks(NeurIPS
DatasetsandBenchmarks2021),2021.
[73] CathyWu,AboudyKreidieh,KanaadParvate,EugeneVinitsky,andAlexandreMBayen. Flow:Architec-
tureandbenchmarkingforreinforcementlearningintrafficcontrol. arXivpreprintarXiv:1710.05465,10,
2017.
[74] DanfeiXu,YuxiaoChen,BorisIvanovic,andMarcoPavone. Bits:Bi-levelimitationfortrafficsimulation.
InICRA,2023.
[75] XintaoYan,ZhengxiaZou,ShuoFeng,HaojieZhu,HaoweiSun,andHenryXLiu. Learningnaturalistic
drivingenvironmentwithstatisticalrealism. NatureCommunications,14(1):2037,2023.
[76] ZhenpeiYang,YuningChai,DragomirAnguelov,YinZhou,PeiSun,DumitruErhan,SeanRafferty,and
HenrikKretzschmar. Surfelgan:Synthesizingrealisticsensordataforautonomousdriving. InCVPR,June
2020.
[77] Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan,
AlexanderKu,YinfeiYang,BurcuKaragolAyan,BenHutchinson,WeiHan,ZaranaParekh,XinLi,Han
Zhang,JasonBaldridge,andYonghuiWu. Scalingautoregressivemodelsforcontent-richtext-to-image
generation. TransactionsonMachineLearningResearch,2022.
[78] WeiZhan,LitingSun,DiWang,HaojieShi,AubreyClausse,MaximilianNaumann,JuliusKümmerle,
HendrikKönigshof,ChristophStiller,ArnauddeLaFortelle,andMasayoshiTomizuka. INTERACTION
Dataset:AnINTERnational,AdversarialandCooperativemoTIONDatasetinInteractiveDrivingScenarios
withSemanticMaps. arXiv:1910.03088[cs,eess],2019.
14[79] ZhejunZhang,AlexanderLiniger,DengxinDai,FisherYu,andLucVanGool. Trafficbots:Towardsworld
modelsforautonomousdrivingsimulationandmotionprediction. InICRA,2023.
[80] ZiyuanZhong,DavisRempe,YuxiaoChen,BorisIvanovic,YulongCao,DanfeiXu,MarcoPavone,and
BaishakhiRay. Language-guidedtrafficsimulationviascene-leveldiffusion. InCoRL,2023.
[81] ZiyuanZhong,DavisRempe,DanfeiXu,YuxiaoChen,SushantVeer,TongChe,BaishakhiRay,and
MarcoPavone. Guidedconditionaldiffusionforcontrollabletrafficsimulation. InICRA,2023.
[82] MingZhou,JunLuo,JulianVillella,YaodongYang,DavidRusu,JiayuMiao,WeinanZhang,Montgomery
Alban,IMANFADAKAR,ZhengChen,ChongxiHuang,YingWen,KimiaHassanzadeh,DanielGraves,
ZhengbangZhu,YihanNi,NhatNguyen,MohamedElsayed,HaithamAmmar,AlexanderCowen-Rivers,
SanjeevanAhilan,ZhengTian,DanielPalenicek,KasraRezaee,PeymanYadmellat,KunShao,dong
chen,BaokuanZhang,HongboZhang,JianyeHao,WulongLiu,andJunWang. Smarts:Anopen-source
scalablemulti-agentrltrainingschoolforautonomousdriving. InCoRL,2020.
15A Appendix
InthisAppendix,weprovideablationsinvestigatingtheimpactofreplanrateofWayformer-derived
[40] sim agent baselines on the test split, as well as additional learnings from the 2023 CVPR
competition. Wealsoincludeadditionaldescriptionsofmethodsfromexternalchallengesubmissions,
corresponding qualitative results for such methods, and implementation details of each of the 9
componentmetricsweuse. Finally,wedescribetheexactdetailsofthedatasetsplitsweuseandgive
leaderboardsubmissioninstructions.
BenchmarkVersioningInDecember2023,weimprovedtheaccuracyofthecollisionandoffroad
likelihoodcalculation,whichimprovedmostcollisionlikelihoodscores,offroadlikelihoodscores,and
compositemetricresults. Thispaperdescribestheupdatedscores(theV1versionofthebenchmark),
ratherthanthepreviousV0scorespresentedattheWorkshoponAutonomousDrivingatCVPR2023.
Bothversionsoftheleaderboardsareavailableonline(V1Leaderboard,V0Leaderboard).
A.1 ReplanningRateAblationResults
AsshowninTable3ofthemainpaperandasdiscussedinSection5,weperformanablationofthe
impactofreplanningrateoncompositemetricperformanceforopen-looptrainedmodelsonthe
WOMDtestset. WeshowthatamorefrequentreplanratenegativelyimpactsWayformer-based[40]
agent,irregardlessofwhethermultiplediverserolloutsperscenearesampledorif32identicalrollouts
persceneareproduced. Fortheidentical-sampleproducingagent,weseearelativeperformancedrop
of41.2%(compositescoresof0.575vs. 0.338)whentransitioningfromreplanningat2Hzto10Hz.
InFigure5,wevisualizeseveraladditionaldatapointsfromareplanningintervalof100msto1100
msforthesameidentical-sampleproducingagent.
0.65
0.60
0.55
0.50
0.45
0.40
0.35
250 500 750 1000 1250 1500 1750 2000
Replanning Interval/Period (ms)
cirteM
etisopmoC
CASOW
7
6
5
4
3
Wayformer
MVTE
MTR+++
2
250 500 750 1000 1250 1500 1750 2000
Replanning Interval/Period (ms)
(a)WOSACCompositeMetricvs.replanrate.
)EDA(
rorrE
tnemecalpsiD
egarevA
Wayformer
MVTE
MTR+++
(b)ADEvs.replanrate.
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
200 400 600 800 1000
Replanning Interval/Period (ms)
doohilekiL
composite metric
linear speed likelihood
linear acceleration likelihood
angular speed likelihood
angular acceleration likelihood
distance to nearest object likelihood
collision indication likelihood
time to collision likelihood
distance to road edge likelihood
offroad indication likelihood
(c)Componentlikelihoodmetricsvs.replanrate.
Figure 5: Results at various replanning rates on the WOSAC test set for a Wayformer baseline
(producingidenticalsamplesforthe32rollouts). Asthereplanningrateincreasesfrom1Hz,thento
2Hz,andthento10Hz,weobserveasmoothdegradationinperformance.
16A.2 AdditionalLearningsfromthe2023CVPRCompetition
GenerativeModelingWhilemanyfamiliesofgenerativemodelsexist,mostchallengeparticipants
restrictedtheirmodelingtoanarrowclassofsuchmodels,namelyGMMs. Toourknowledge,forthe
2023CVPRchallenge,nousersubmittedsimagentbehaviorgeneratedbynormalizingflow,GAN,
orvariationalautoencoder(VAEs)models,orbydenoisingdiffusionmodels,althoughsuchdiffusion
techniqueshaverecentlybecomemorepopularinthesimulationagentliterature[80,81]. Weexpect
thismaychangeinthefutureasmoreentrantsparticipateintheWOSACchallenge.
FurtherQuantitativeAnalysisWenotethatthetop-fourperformingmethodswereexecutedfully
in closed-loop (MVTE [71], MVTA) or in a hybrid fashion (MTR+++ [44], Wayformer [40]),
outperformingthetwoopen-loopsubmissions,JointMultiPath++[70]andCAD[14],asshownin
Table4. Thisgapisespeciallyclearbetweennon-openloopmethodsandopen-loopmethodsinthe
collisionlikelihoodmetric.
Amongallbaselinesweevaluated,theconstantvelocitybaselineisweakestwhenitcomestoangular-
basedlikelihoods. Forexample,itachievesclosetozerolikelihoodonbothangularspeed(0.02)
andangularacceleration(0.04),asopposedtotheMVTEmethod,whichachieves0.54and0.38
likelihoodsonthesametwocomponentmetrics,respectively. Thisresultisintuitive,asourconstant
velocitymodeldoesnotaccountforanyyawrate.
A.3 AdditionalComparisonswithOtherBenchmarksforAutonomousDrivingBehavior
In Table 4, we compare our WOSAC benchmark with other benchmarks used for evaluation of
behaviormodelsforautonomousdriving.
BENCHMARKNAME TASK
Argoverse[10] TrajectoryForecasting
INTERPRET(INTERACTION)[78] TrajectoryForecasting
Argoverse2[72] TrajectoryForecasting
nuScenes[6] TrajectoryForecasting
WOMD[22] TrajectoryForecasting
CARLA[19] MotionPlanning
nuPlan[7] MotionPlanning
WOSAC(Ours) Multi-AgentSimulation
Table4: Existingbenchmarksforevaluationofbehaviormodelsforautonomousdriving.
A.4 AdditionalInformationaboutMethodsfromExternalChallengeSubmissions
MultiVerseTransformerforAgentsimulation(MVTA)[71]:AmethodinspiredbyTrafficSim[60]
thatistrainedandexecutedinclosed-loopandadherestoWOSAC’sfactorizationandautoregressive
requirements. Itusesa‘recedinghorizon’policy(i.e. predicting1sec. offuturemotionbutusing
onlythenext100ms). InspiredbyMTR[55,56],MVTAplacesaGaussianMixtureModel(GMM)
headontopofatransformer-basedencoderanddecoder(employingthesameencoder/decoderlayers
as implemented in MTR), consuming vector inputs. Rather than utilizing a fixed-length history
ascontext,MVTAusesavariable-lengthhistorytopotentiallyuseallofthepastdata. Theinput
agentencodingcontainsagenthistorymotionstate(i.e.,position,objectsize,headingangle,and
velocity)andaone-hotcategorymaskofeachagent. Thepredictionheadsincludearegressionhead
that outputs 5 GMM parameters (µ ,µ ,σ ,σ ,ρ), along with the velocity (v ,v ) and heading
x y x y x y
(sin(θ),cos(θ))predictionsforatimestep,andaclassificationheadthatoutputsprobabilityp. Both
headstakethequerycontentfeatures(num_query×hiddenfeaturedimension)asinput.
MVTE:AnenhancedversionofMVTA[71]wherein3variantsofMVTAaretrainedandrandomly
selectedtogenerateeachofthe32simulations,increasingsimulationdiversityacrossrollouts.
MTR+++[44]: Ahybridmethodwitha0.5Hzreplanningrateanda2secondpredictionhorizon.
WenotethatMTR+++doesnotfullyadheretoWOSAC’sclosed-looprequirement,asitdoesnot
replanata10Hzrate. MTR+++alsodoesnotadheretothepolicyfactorizationrequirements,as
worldvs. AVpoliciesarenotseparated. InspiredbyMTR[55,56],themethodaddressestwokey
limitationsofMTR:inaccurateheadingpredictionsandexcessivecollisionsincurredbymarginal
predictionsalone. Toovercomethefirstissue,theauthorsestimateheadingsfromx/ytrajectories.
Second,inordertominimizecollisions,theauthorsconsiderK =6trajectoriespredictedperagent
17byMTR,andprunetheexponentialnumberoffuturesinagreedyfashion. Asbrute-forceexhaustive
searchoverthe6N combinationsiscomputationallyinfeasible,MTR+++searchesforthedensest
subgraphinagraphofnon-collidingfuturetrajectories.
First, a 6N by 6N distance matrix D is constructed, where entry D indicates the
6m+i−1,6n+j−1
minimumL2distancebetweentheith-highesttrajectoriesofagentmandthejth-highestoneofagent
n. Secondthedistancematrixisbinarizedbyevaluatingwhichdistancescorrespondtocollisions
accordingtoobjectextents. Finally,aclique-findingheuristicmethodfindsadensesubgraphofsize
N. Anensembleof32fine-tunedMTRmodelsisemployedtocreatethe32rollouts,eachmodel
producingasinglerollout.
CollisionAvoidanceDetour(CAD)[14]: Anopen-loopmethodthatbuildsuponanexistingmotion
forecastingmethod, MTR[55,56]toproducemarginaltrajectorypredictions, andresamplesthe
entirefutureiffutureagentcollisionsareanticipated,untilamaximumnumberoftrialsisexhausted.
WhileCADadherestothefactorizationrequirement,itdoesnotadheretoWOSAC’sclosed-loop
requirement. Factorizationofworldvs. AVpoliciesisaccomplishedbyusingdifferentcheckpoints
ofanMTRmotionpredictionmodelforthetwoagentgroups,andmotionofnon-evaluatedagentsis
simulatedusingaconstantvelocitymodel.
Joint-Multipath++[70]: Anopen-loop,scene-centricmethodthatbuildsoffofMultiPath++[66],
producing in a single model pass 32 rollouts, each representing an entire length-80 trajectory.
JointMultiPath++ does not factorize AV vs. world policies, and thus does not fully adhere to
WOSAC"spolicyfactorizationrequirements. Agenthistoryinformation(positionsandheadings
of all agents) are transformed into the AV’s coordinate frame, while closest lane information is
selectedforeachagent. Initsencoder,JointMultiPath++concatenatestheoutputof2LSTMsand
oneMultiContextGating(MCG)[66]blocktoformper-agentembeddings;oneLSTMisusedto
encodeagenthistory,anotherLSTMisusedtoencodeper-stepdifferencesinagenthistory,andan
MCGblockisusedtoencodeagenthistorywithcorrespondingtimesteps. SubsequentMCGblocks
fuseper-agentinformationwithroadnetwork(polyline)embeddings. Initsdecoder,aseriesofMLP
blockstransformnper-agentembeddingstorolloutsrepresentedasn×32×80×3outputtensors. 2
SBTA-ADIAMoetal.[38]:Buildsuponanexistingmotionpredictionmethod[39].Thishierarchical
methodsplitstheproblemintoafirstphaseofmulti-agentgoalprediction,basedonaGNNscene
encoder,andasimpleplanningpolicywhichtriestoaccomplishthegoalclosed-loop.
A.5 AdditionalQualitativeResultsfromExternalChallengeSubmissions
InFigure6and7,weprovideadditionalqualitativeexamplesofsimulationresultsfromexternal
challengesubmissions.
SimulationInput LogPlayback MTR+++[44] MVTE[71]
(LoggedOracle)
Figure6: Two-dimensionalvisualizationofsimulationresultsonWOMD’stestsplit. MVTEexhibits
acollisionandMTR+++producesanear-miss. ‘Simulationinput’representsthecontexthistory
o ,whereasallothercolumnsvisualizeboth(o ,o ). Onepossiblefutureforasingle
−H−1:0 −H−1:0 ≥1
sceneisrepresented,selectedfrom32submittedrollouts,wheretheAVremainsstoppedataredtraffic
signal. Eachrenderingincolumns2,3and4depictstheentiredurationofthescene. Trajectories
ofenvironmentsimagentsaredrawninagreen-bluegradient(eachasasequenceofcirclesina
temporalcolorgradient). TheAVagentisdrawninorange.
2Codeavailableathttps://github.com/wangwenxi-handsome/Joint-Multipathpp.
18SimulationInput LogPlayback MTR+++[44] MVTE[71]
(LoggedOracle)
Figure7: Two-dimensionalvisualizationofsimulationresultsonasinglescenefromWOMD’stest
split using various baseline methods. Four possible futures for this single scene are represented
(oneperrow),selectedfrom32submittedrollouts. ‘Simulationinput’representsthecontexthistory
o ,whereasallothercolumnsvisualizeboth(o ,o ). Eachrenderinginthesecond,
−H−1:0 −H−1:0 ≥1
third,andfourthcolumnsdepictstheentiredurationofthescene. Trajectoriesofenvironmentsim
agentsaredrawninagreen-bluegradient,andtrajectoriesoftheAVagentaredrawninared-yellow
gradient (each as a sequence of circles in a temporal color gradient). The AV agent is drawn in
orange.
19A.6 ComponentMetricsImplementationDetails
1. LinearSpeed: Unsignedmagnitudeofthefirstderivative∥v∥=∥xt+1−xt∥ wherex =
∆t 2 t
[x ,y ,z ], Linearspeedin3Dcomputedasthe1-stepdifferencebetween3Dtrajectory
t t t
points. Weemployspeed,ratherthanvelocity,asvelocitycaneitherbedefinedw.r.t.the
ego-agent’sheading,orw.r.t.aglobalcoordinatesystem,wherevelocitydirectionsmaybe
city-specific,basedonorientationofroadsw.r.t.North. Althoughthiscannotcaptureobjects
movinginreverse,ararebehavior,weomititforsakeofsimplicity.
2. LinearAccelerationMagnitude: Signedmagnitudeofsecondderivative,in3Dcomputedas
the1-stepdifferencebetweenspeedsofobjects. ∥vt+1∥−∥vt∥.
∆t
3. AngularSpeed: Signedfirstderivativeω = d(θt+1,θt),computedasthe1-stepdifferencein
∆t
heading,whered(·)representstheminimalangulardifferencebetweentwoanglesontheunit
circle,i.e.,d(·)isadistancemetriconSO(2)computedasmin{|θ −θ |,2π−|θ −θ |}
t+1 t t+1 t
withθ ,θ ∈[0,2π).
t t+1
4. AngularAccelerationMagnitude: Secondderivative,computedasthe1-stepdifferencein
angularspeedω,as d(ωt+1,ωt).
∆t
5. Distancetonearestobject: Signeddistance(inmeters)tothenearestobjectinthescene.
We use Minkowski difference of box polygons, according to a simplified version of the
Gilbert–Johnson–Keerthi(GJK)distancealgorithm[25].
6. Collisions: Countindicatingobjectsthatcollided,atanypointintime,withanyotherobject,
i.e. whenthesigneddistancestonearestobjects,asdescribedabove,achievesanegative
value.
7. Time-to-collision (TTC): Time (in seconds) before the object collides with the object it
isfollowing(ifoneexists),assumingconstantspeeds. Anobjectisdefinedasexhibiting
object-following(tailgating)behaviorbasedonalignmentconditionsderivedfromheading
andlateraldistance.
8. Distancetoroadedge: Signeddistance(inmeters)tothenearestroadedgeinthescene.
9. Roaddepartures: Booleanvalueindicatingwhethertheobjectwentofftheroad,atanypoint
intime[60].
Topreventundefinedscoresfromhistogrambinswithzerosupport,weemployLaplacesmoothing
withapseudocountof0.1.
InsertedandDeletedObjectHandlingInordertopreventobjectinsertion/deletionbiasbetween
theloggedandsimulateddatadistributionsduringevaluation,wediscardanynewlyspawnedobjects
(appearingafterthehistoryinterval)intheloggedtestsetwhencomputingtheloggeddatadistribution.
ThedatadistributionintheWOMDdatasetalreadyincludessuchobjectinsertionanddeletion.
A.6.1 EvaluationSourceCodeReferences
Inthissection,weprovidepointerstoourspecificimplementationsofthe9metricsdiscussedin
Section4.2.1ofthemainpaper:
• Kinematic-basedfeatures: Linearspeed,linearaccelerationmagnitude,angularspeed,and
angularaccelerationmagnitude(metrics1,2,3,4): [Code]
• Interaction-basedfeatures: TTCanddistancetonearestobject(metrics5,6,7): [Code]and
modifiedGJKalgorithmimplementation[Code]
• Map-basedfeatures: Roaddeparturesanddistancetoroadedge(metrics8,9): [Code]
Animplementationofourtime-seriesbasedNLLcomputationcanbefoundhere: [Code].
A.6.2 EvaluationCodeLicenseandDependencies
The WOMD [22] dataset itself is licensed under a non-commercial license (www.
waymo.com/open/terms) and the evaluation code for our Waymo Open Sim
Agents Challenge (WOSAC) is released under a BSD+limited patent license. See
20https://github.com/waymo-research/waymo-open-dataset/blob/master/
src/waymo_open_dataset/wdl_limited/sim_agents_metrics/PATENTS and
https://github.com/waymo-research/waymo-open-dataset/blob/master/
src/waymo_open_dataset/wdl_limited/sim_agents_metrics/LICENSE.
Dependencies used include NumPy (numpy), the Waymo Open Dataset repository
(waymo-open-dataset-tf-2-11-0==1.5.2), TensorFlow (tensorflow), Tensor-
FlowProbability(tensorflow_probability),Matplotlib(matplotlib),TQDM(tqdm),
Protocol Buffers (google.protobuf), and Python standard library imports (os, tarfile,
dataclasses).
A.7 AdditionalInformationaboutWOMDSplitsUsed
We exclude 401 run segments from evaluation due to discrepancies in object counts across the
Scenario proto and tf.Example formats, due to object count truncation arising from fixed-
shapetf.Exampletensorswithexactly128objectslots. Wealsoexcludefromevaluation9test
runsegmentswhicharemissingmaps(however,mapsarepresentforeachscenariopresentinthethe
validationset).3
Table5: StatisticsofWOMDdataset[22]splitsused.
DATASETSPLIT
VALIDATION TESTING
ALLSCENARIOCOUNTS 44097 44920
EVALUATEDSCENARIOCOUNTS 43696 44520
A.8 SubmissionFormat
Submissions must be uploaded as serialized SimAgentsChallengeSubmission protocol
buffer data4 (“protos”). Each ScenarioRollouts proto within the submission must contain
328-secondrolloutsofsimulationdatafromonescenario. Avalidationortestsetsubmissionmaybe
submittedtotheevaluationserver.
WeprovideaJupyternotebooktutorialwithadditionalinstructionsandexamplesonhowtogenerate
asubmissionforadatasetsplit. WerecommendstoringmultipleScenarioRollout’sineach
binaryprotofile(i.e. ineachSimAgentsChallengeSubmissionfile)topreventcreatinga
tar.gzfilewithtensofthousandsoffiles;useof100to150ofsuchshardsisrecommended. Please
refertothetutorialnotebookforthenamingconventionofthesefiles. Submissiondatashouldbe
compressedasasingle.tar.gzarchiveanduploadedasasinglefile.
3WOMDdownloadinstructionsavailableathttps://waymo.com/intl/en_us/open/download.
4https://protobuf.dev/
21