MotionLM: Multi-Agent Motion Forecasting as Language Modeling AriSeff BrianCera DianChen* MasonNg AurickZhou NigamaaNayakanti KhaledS.Refaat RamiAl-Rfou BenjaminSapp Waymo Vocabulary Abstract Reliableforecastingofthefuturebehaviorofroadagents isacriticalcomponenttosafeplanninginautonomousve- hicles. Here, we represent continuous trajectories as se- quencesofdiscretemotiontokensandcastmulti-agentmo- tion prediction as a language modeling task over this do- main. Ourmodel,MotionLM,providesseveraladvantages: First,itdoesnotrequireanchorsorexplicitlatentvariable optimizationtolearnmultimodaldistributions. Instead,we leverage a single standard language modeling objective, maximizing the average log probability over sequence to- kens. Second, our approach bypasses post-hoc interaction heuristics where individual agent trajectory generation is conductedpriortointeractivescoring. Instead,MotionLM producesjointdistributionsoverinteractiveagentfuturesin a single autoregressive decoding process. In addition, the model’ssequentialfactorizationenablestemporallycausal conditional rollouts. The proposed approach establishes new state-of-the-art performance for multi-agent motion predictionontheWaymoOpenMotionDataset,ranking1st ontheinteractivechallengeleaderboard. 1.Introduction Modernsequencemodelsoftenemployanext-tokenpre- dictionobjectivethatincorporatesminimaldomain-specific assumptions. For example, autoregressive language mod- els [3, 10] are pre-trained to maximize the probability of thenextobservedsubwordconditionedontheprevioustext; there is no predefined notion of parsing or syntax built in. Thisapproachhasfoundsuccessincontinuousdomainsas well,suchasaudio[2]andimagegeneration[49].Leverag- ing the flexibility of arbitrary categorical distributions, the aboveworksrepresentcontinuousdatawithasetofdiscrete *WorkdoneduringaninternshipatWaymo. Contact:{aseff, bensapp}@waymo.com … … Motion token sequence: ... t=1 t=2 t=T Figure1. Ourmodelautoregressivelygeneratessequencesofdis- cretemotiontokensforasetofagentstoproduceconsistentinter- activetrajectoryforecasts. tokens,reminiscentoflanguagemodelvocabularies. In driving scenarios, road users may be likened to par- ticipantsinaconstantdialogue,continuouslyexchanginga dynamicseriesofactionsandreactionsmirroringthefluid- ityofcommunication. Navigatingthisrichwebofinterac- tionsrequirestheabilitytoanticipatethelikelymaneuvers and responses of the involved actors. Just as today’s lan- guage models can capture sophisticated distributions over conversations,canweleveragesimilarsequencemodelsto forecastthebehaviorofroadagents? A common simplification to modeling the full future worldstatehasbeentodecomposethejointdistributionof agent behavior into independent per-agent marginal distri- butions. Although there has been much progress on this task [8, 47, 12, 25, 31, 5, 6, 21], marginal predictions are insufficientasinputstoaplanningsystem;theydonotrep- resentthefuturedependenciesbetweentheactionsofdiffer- entagents,leadingtoinconsistentscene-levelforecasting. Oftheexistingjointpredictionapproaches, someapply 3202 peS 82 ]VC.sc[ 1v43561.9032:viXraaseparationbetweenmarginaltrajectorygenerationandin- predictions, whicharelargelyunsupported bycurrent teractivescoring[40,42,29]. Forexample,Luoetal.[29] jointforecastingmodels. initiallyproduceasmallsetofmarginaltrajectoriesforeach agentindependently,beforeassigningalearnedpotentialto 2.Relatedwork each inter-agent trajectory pair through a belief propaga- Marginaltrajectoryprediction. Behaviorpredictorsare tion algorithm. Sun et al. [42] use a manual heuristic to often evaluated on their predictions for individual agents, tag agents as either influencers or reactors, and then pairs e.g., in recent motion forecasting benchmarks [14, 9, 4, marginal and conditional predictions to form joint predic- 51, 37]. Previous methods process the rasterized scene tions. with CNNs [8, 5, 12, 17]; the more recent works repre- We also note that because these approaches do not ex- sent scenes with points and polygraphs and process them plicitly model temporal dependencies within trajectories, withGNNs[6,25,47,22]ortransformers[31,40,20]. To theirconditionalforecastsmaybemoresusceptibletospu- handlethemultimodalityoffuturetrajectories,somemod- rious correlations, leading to less realistic reaction predic- elsmanuallyenforcediversityviapredefinedanchors[8,5] tions. For example, these models can capture the cor- or intention points [40, 52, 28]. Other works learn diverse relation between a lead agent decelerating and a trail- modeswithlatentvariablemodeling,e.g., [24]. ing agent decelerating, but may fail to infer which one is While these works produce multimodal future trajecto- likely causing the other to slow down. In contrast, previ- riesofindividualagents,theyonlycapturethemarginaldis- ousjointmodelsemployinganautoregressivefactorization, tributionsofthepossibleagentfuturesanddonotmodelthe e.g.,[36,43,39],dorespectfuturetemporaldependencies. interactionsamongagents. These models have generally relied on explicit latent vari- ables for diversity, optimized via either an evidence lower boundornormalizingflow. Interactive trajectory prediction. Interactive behavior In this work, we combine trajectory generation and in- predictors model the joint distribution of agents’ futures. teraction modeling in a single, temporally causal, decod- This task has been far less studied than marginal motion ing process over discrete motion tokens (Fig. 1), leverag- prediction. Forexample,theWaymoOpenMotionDataset ing a simple training objective inspired by autoregressive (WOMD)[14]challengeleaderboardcurrentlyhas71pub- language models. Our model, MotionLM, is trained to lishedentriesformarginalpredictioncomparedtoonly14 directly maximize the log probability of these token se- forinteractionprediction. quencesamonginteractingagents. Atinferencetime,joint Ngiametal.[32]modelsthedistributionoffuturetrajec- trajectories are produced step-by-step, where interacting tories with a transformer-based mixture model outputting agentssampletokenssimultaneously,attendtooneanother, jointmodes. Toavoidtheexponentialblow-upfromafull andrepeat. Incontrasttopreviousapproacheswhichman- joint model, Luo et al. [29] models pairwise joint distri- ually enforce trajectory multimodality during training, our butions. Tolstaya et al. [44], Song et al. [41], Sun et al. modelisentirelylatentvariableandanchor-free,withmul- [42]considerconditionalpredictionsbyexposingthefuture timodalityemergingsolelyasacharacteristicofsampling. trajectory of one agent when predicting for another agent. MotionLMmaybeappliedtoseveraldownstreambehavior Shi et al. [40] derives joint probabilities by simply mul- predictiontasks, includingmarginal, joint, andconditional tiplying marginal trajectory probabilities, essentially treat- predictions. ing agents as independent, which may limit accuracy. Cui Thisworkmakesthefollowingcontributions: etal.[11],Casasetal.[7],Girgisetal.[15]reducethefull- fledgedjointdistributionusinggloballatentvariables. Un- 1. Wecastmulti-agentmotionforecastingasalanguage likeourautoregressivefactorization,theabovemodelstypi- modeling task, introducing a temporally causal de- callyfollow“one-shot”(parallelacrosstime)factorizations coderoverdiscretemotiontokenstrainedwithacausal anddonotexplicitlymodeltemporallycausalinteractions. languagemodelingloss. 2. We pair sampling from our model with a simple roll- Autoregressive trajectory prediction. Autoregressive outaggregationschemethatfacilitatesweightedmode behaviorpredictorsgeneratetrajectoriesatintervalstopro- identification for joint trajectories, establishing new duce scene-consistent multi-agent trajectories. Rhinehart state-of-the-artperformanceontheWaymoOpenMo- et al. [36], Tang and Salakhutdinov [43], Amirloo et al. tion Dataset interaction prediction challenge (6% im- [1], Salzmann et al. [39], Yuan et al. [50] predict multi- provementintherankingjointmAPmetric). agent future trajectories using latent variable models. Lu et al. [27] explores autoregressively outputting keyframes 3. We perform extensive ablations of our approach as via mixtures of Gaussians prior to filling in the remaining well as analysis of its temporally causal conditional states. In [18], an adversarial objective is combined withScene Decoded motion tokens Ensemble & Rollout aggregation embeddings Scene … xE Encoder [R,N,ᐧ,H] … Sample Sample Sample … NMS Projection Autoregressive Transformer Self Attention [R,N,T,2] [K,N,T,2] Decoder K-means Embed Embed Cross Attention … … … A1 A1 AT-1 AT-1 AT AT Agent 1 Agent 2 Start token Multimodal Time Scene Inputs t=0 t=1 t=T-1 t=T Figure2.MotionLMarchitecture.Wefirstencodeheterogeneousscenefeaturesrelativetoeachmodeledagent(left)assceneembeddings ofshapeR,N,·,H. Here,Rreferstothenumberofrollouts,N referstothenumberof(jointlymodeled)agents,andH isthedimen- sionalityofeachembedding. WerepeattheembeddingsRtimesinthebatchdimensionforparallelsamplingduringinference. Next,a trajectorydecoderautoregressivelyrollsoutT discretemotiontokensformultipleagentsinatemporallycausalmanner(center). Finally, representativemodesoftherolloutsmayberecoveredviaasimpleaggregationutilizingk-meansclusteringinitializedwithnon-maximum suppression(right). parallel beam search to learn multi-agent rollouts. Unlike frameworkcapableofcapturingthesubstantialmultimodal- mostautoregressivetrajectorypredictors, ourmethoddoes ityindrivingscenarios. Inaddition,wetakeconsideration not rely on latent variables or beam search and generates here to preserve temporal dependencies; i.e., inference in multimodal joint trajectories by directly sampling from a ourmodelfollowsadirectedacyclicgraphwiththeparents learneddistributionofdiscretemotiontokensequences. ofeverynoderesidingearlierintimeandchildrenresiding later (Section 3.3, Fig. 4). This enables conditional fore- Discrete sequence modeling in continuous domains. casts that more closely resemble causal interventions [34] Whengeneratingsequencesincontinuousdomains,oneef- byeliminatingcertainspuriouscorrelationsthatcanother- fectiveapproachistodiscretizetheoutputspaceandpredict wise result from disobeying temporal causality2. We ob- categoricaldistributionsateachstep. serve that joint models that do not preserve temporal de- For example, in image generation, van den Oord et al. pendencies may have a limited ability to predict realistic [45]sequentiallypredicttheuniformlydiscretizedpixelval- agentreactions–akeyuseinplanning(Section4.6).Tothis ues for each channel and found this to perform better than end, we leverage an autoregressive factorization of our fu- outputting continuous values directly. Multiple works on turedecoder,whereagents’motiontokensareconditionally generating images from text such as [35] and [49] use a dependentonallpreviouslysampledtokensandtrajectories two-stage process with a learned tokenizer to map images arerolledoutsequentially(Fig.2). to discrete tokens and an autoregressive model to predict LetS representtheinputdataforagivenscenario. This the discrete tokens given the text prompt. For audio gen- may include context such as roadgraph elements, traffic eration, WaveNet [46] applies a µ-law transformation be- lightstates,aswellasfeaturesdescribingroadagents(e.g., fore discretizing. Borsos et al. [2] learn a hierarchical to- vehicles, cyclists, and pedestrians) and their recent histo- kenizer/detokenizer, with the main transformer sequence ries, all provided at the current timestep t = 0. Our . modeloperatingontheintermediatediscretetokens. When task is to generate predictions for joint agent states Y = t generating polygonal meshes, Nash et al. [30] uniformly {y1,y2,...,yN}forN agentsofinterestatfuturetimesteps t t t quantizethecoordinatesofeachvertex. InMotionLM,we t = 1,...,T. Ratherthancompletestates,thesefuturestate employasimpleuniformquantizationofaxis-aligneddeltas targetsaretypicallytwo-dimensionalwaypoints(i.e.,(x,y) betweenconsecutivewaypointsofagenttrajectories. coordinates), with T waypoints forming the full ground truthtrajectoryforanindividualagent. 3.MotionLM We aim to model a distribution over multi-agent inter- actions in a general manner that can be applied to distinct 2Wemakenoclaimsthatourmodeliscapableofdirectlymodeling causalrelationships(duetothetheoreticallimitsofpurelyobservational downstream tasks, including marginal, joint, and condi- data and unobserved confounders). Here, we solely take care to avoid tional forecasting. This requires an expressive generative breakingtemporalcausality.3.1.Jointprobabilisticrollouts Ourmodelissubjecttothesametheoreticallimitations asgeneralimitationlearningframeworks(e.g.,compound- In our modeling framework, we sample a predicted ac- ing error [38] and self-delusions due to unobserved con- tion for each target agent at each future timestep. These founders [33]). However, we find that, in practice, these actionsareformulatedasdiscretemotiontokensfromafi- donotpreventstrongperformanceonforecastingtasks. nite vocabulary, as described later in Section 3.2.2. Let an represent the target action (derived from the ground 3.2.Modelimplementation t . truth waypoints) for the nth agent at time t, with A = t {a1,a2,...,aN}representingthesetoftargetactionsforall Our model consists of two main networks, an encoder t t t which processes initial scene elements followed by a tra- agentsattimet. jectorydecoderwhichperformsbothcross-attentiontothe scene encodings and self-attention along agent motion to- Factorization. Wefactorizethedistributionoverjointfu- kens,followingatransformerarchitecture[48]. tureactionsequencesasaproductofconditionals: 3.2.1 Sceneencoder T (cid:89) p θ(A 1,A 2,...A T |S)= p θ(A t |A 99% oftheWOMDdataset. Verlet-wrapped action space. Once the above delta ac- tion space has the Verlet wrapper applied, we only require 13 bins for each coordinate. This results in a total of 132 = 169totaldiscretemotiontokensthatthemodelcan select from the Cartesian product comprising the final vo- cabulary. Sequence lengths. For 8-second futures, the model out- puts 16 motion tokens for each agent (note that WOMD evaluates predictions at 2 Hz). For the two-agent interac- tive split, our flattened agent-time token sequences (Sec- tion3.2.2)havelength2×16=32. B.Implementationdetails B.1.Sceneencoder We follow the design of the early fusion network pro- posedby[31]asthesceneencodingbackboneofourmodel. Thefollowinghyperparametersareused: • Numberoflayers: 4 • Hiddensize: 256 • Feed-forwardnetworkintermediatesize: 1024 • Numberofattentionheads: 4 • Numberoflatentqueries: 92 • Activation: ReLU B.2.Trajectorydecoder Toautoregressivelydecodemotiontokensequences,we utilizeacausaltransformerdecoderthattakesinthemotion tokensasqueries, andthesceneencodingsascontext. We usethefollowingmodelhyperparameters: • Numberoflayers: 4 • Hiddensize: 256 T timesteps Figure 8. Masked causal attention between two agents dur- ing training. We flatten the agent and time axes, leading to an NT ×NT attentionmask.Theagentsmayattendtoeachother’s previousmotiontokens(solidsquares)butnofuturetokens(empty squares). • Feed-forwardnetworkintermediatesize: 1024 • Numberofattentionheads: 4 • Activation: ReLU B.3.Optimization We train our model to maximize the likelihood of the ground truth motion token sequences via teacher forcing. Weusethefollowingtraininghyperparameters: • Numberoftrainingsteps: 600000 • Batchsize: 256 • Learningrateschedule: Lineardecay • Initiallearningrate: 0.0006 • Finallearningrate: 0.0 • Optimizer: AdamW • Weightdecay: 0.6 B.4.Inference We found nucleus sampling [16], commonly used with languagemodels,tobehelpfulforimprovingsamplequal- ity while maintaining diversity. Here we set the top-p pa- rameterto0.95. C.Metricsdescriptions C.1.WOMDmetrics All metrics for the two WOMD [14] benchmarks are evaluated at three time steps (3, 5, and 8 seconds) and are averagedoverallobjecttypestoobtainthefinalvalue. For jointmetrics,asceneisattributedtoanobjectclass(vehicle, pedestrian, or cyclist) according to the least common type ofagentthatispresentinthatinteraction,withcyclistbeingtherarestobjectclassandvehiclesbeingthemostcommon. the bounding boxes of two predicted agents collide at any Up to 6 trajectories are produced by the models for each timestep in a scene, that counts as an overlap/collision for target agent in each scene, which are then used for metric thatscene. Thefinalpredictionoverlaprateiscalculatedas evaluation. thesumofper-sceneoverlaps,averagedacrossthedataset. D.Additionalevaluation mAP & Soft mAP mAP measures precision of predic- tionlikelihoodsandiscalculatedbyfirstbucketingground Ablations. Tables5and6displayjointpredictionperfor- truthfuturesofobjectsintoeightdiscreteclassesofintent: mance across varying interactive attention frequencies and straight, straight-left, straight-right, left, right, left u-turn, numbersofrollouts,respectively. Inadditiontotheensem- rightu-turn,andstationary. bledmodelperformance,singlereplicaperformanceiseval- For marginal predictions, a prediction trajectory is con- uated. Standard deviations are computed for each metric sidereda“miss”ifitexceedsalateralorlongitudinalerror over8independentlytrainedreplicas. thresholdataspecifiedtimestepT. Similarlyforjointpre- dictions, apredictionisconsidereda“miss”ifnoneofthe Scaling analysis. Table 7 displays the performance of kjointpredictionscontainstrajectoriesforallpredictedob- different model sizes on the WOMD interactive split, all jectswithinagivenlateralandlongitudinalerrorthreshold, trained with the same optimization hyperparameters. We withrespecttothegroundtruthtrajectoriesforeachagent. varythenumberoflayers,hiddensize,andnumberofatten- Trajectory predictions classified as a miss are labeled as a tionheadsintheencoderanddecoderproportionally. Due false positive. In the event of multiple predictions satisfy- to external constraints, in this study we only train a single ingthemisscriteria, consistentwithobjectdetectionmAP replicaforeachparametercount. Weobservethatamodel metrics,onlyonetruepositiveisallowedforeachscene,as- with 27M parameters overfits while 300K underfits. Both signed to the highest confidence prediction. All other pre- the1Mand9Mmodelsperformdecently. Inthispaper,our dictionsfortheobjectareassignedafalsepositive. mainresultsuse9M-parameterreplicas. To compute the mAP metric, bucket entries are sorted and a P/R curve is computed for each bucket, averaging Latency analysis. Table 8 provides inference latency on precision values over various likelihood thresholds for all the latest generation of GPUs across different numbers of intent buckets results in the final mAP value. Soft mAP rollouts. These were measured for a single-replica joint differsonlyinthefactthatadditionalmatchingpredictions modelrollingouttwoagents. (otherthanthemostlikelymatch)areignoredinsteadofbe- ingassignedafalsepositive,andsoarenotpenalizedinthe E.Visualizations metriccomputation. Inthesupplementaryzipfile,wehaveincludedGIFan- Missrate Usingthesamedefinitionofa“miss”described imations of the model’s greatest-probability predictions in aboveforeithermarginalorjointpredictions,missrateisa various scenes. Each example below displays the associ- measure of what fraction of scenarios fail to generate any atedsceneID,whichisalsocontainedinthecorresponding predictionswithinthelateralandlongitudinalerrorthresh- GIFfilename. Wedescribetheexampleshere. olds,relativetothegroundtruthfuture. E.1.Marginalvs. Joint minADE & minFDE minADE measures the Euclidean • Scene ID: 286a65c777726df3 distance error averaged over all timesteps for the closest Marginal: The turning vehicle and crossing cyclist prediction, relative to ground truth. In contrast, minFDE collide. considers only the distance error at the final timestep. For Joint: Thevehicleyieldstothecyclistbeforeturning. joint predictions, minADE and minFDE are calculated as • Scene ID: 440bbf422d08f4c0 theaveragevalueoverbothagents. Marginal:Theturningvehiclecollideswiththecross- C.2.Predictionoverlap ingvehicleinthemiddleoftheintersection. Joint: The turning vehicle yields and collision is As described in [29], the WOMD [14] overlap met- avoided. riconlyconsidersoverlapbetweenpredictionsandground truth. Here we use a prediction overlap metric to assess • Scene ID: 38899bce1e306fb1 scene-levelconsistencyforjointmodels. Ourimplementa- Marginal: Thelane-changingvehiclegetsrear-ended tion is similar to [29], except we follow the convention of bythevehicleintheadjacentlane. the WOMD challenge of only requiring models to gener- Joint: The adjacent vehicle slows down to allow the ate (x,y) waypoints; headings are inferred as in [14]. If lane-changingvehicletocompletethemaneuver.Ensemble SingleReplica Freq.(Hz) minADE(↓) minFDE(↓) MR(↓) mAP(↑) minADE(↓) minFDE(↓) MR(↓) mAP(↑) 0.125 0.9120 2.0634 0.4222 0.2007 1.0681(0.011) 2.4783(0.025) 0.5112(0.007) 0.1558(0.007) 0.25 0.9083 2.0466 0.4241 0.1983 1.0630(0.009) 2.4510(0.025) 0.5094(0.006) 0.1551(0.006) 0.5 0.8931 2.0073 0.4173 0.2077 1.0512(0.009) 2.4263(0.022) 0.5039(0.006) 0.1588(0.004) 1 0.8842 1.9898 0.4117 0.2040 1.0419(0.014) 2.4062(0.032) 0.5005(0.008) 0.1639(0.005) 2 0.8831 1.9825 0.4092 0.2150 1.0345(0.012) 2.3886(0.031) 0.4943(0.006) 0.1687(0.004) Table5. JointpredictionperformanceacrossvaryinginteractiveattentionfrequenciesontheWOMDinteractivevalidationset. Displayed arescene-leveljointevaluationmetrics.Forthesinglereplicametrics,weincludethestandarddeviation(across8replicas)inparentheses. Ensemble SingleReplica #Rollouts minADE(↓) minFDE(↓) MR(↓) mAP(↑) minADE(↓) minFDE(↓) MR(↓) mAP(↑) 1 1.0534 2.3526 0.5370 0.1524 1.9827(0.018) 4.7958(0.054) 0.8182(0.003) 0.0578(0.004) 2 0.9952 2.2172 0.4921 0.1721 1.6142(0.011) 3.8479(0.032) 0.7410(0.003) 0.0827(0.004) 4 0.9449 2.1100 0.4561 0.1869 1.3655(0.012) 3.2060(0.035) 0.6671(0.003) 0.1083(0.003) 8 0.9158 2.0495 0.4339 0.1934 1.2039(0.013) 2.7848(0.035) 0.5994(0.004) 0.1324(0.003) 16 0.9010 2.0163 0.4196 0.2024 1.1254(0.012) 2.5893(0.031) 0.5555(0.005) 0.1457(0.003) 32 0.8940 2.0041 0.4141 0.2065 1.0837(0.013) 2.4945(0.035) 0.5272(0.005) 0.1538(0.004) 64 0.8881 1.9888 0.4095 0.2051 1.0585(0.012) 2.4411(0.033) 0.5114(0.005) 0.1585(0.004) 128 0.8851 1.9893 0.4103 0.2074 1.0456(0.012) 2.4131(0.033) 0.5020(0.006) 0.1625(0.004) 256 0.8856 1.9893 0.4078 0.2137 1.0385(0.012) 2.3984(0.031) 0.4972(0.007) 0.1663(0.005) 512 0.8831 1.9825 0.4092 0.2150 1.0345(0.012) 2.3886(0.031) 0.4943(0.006) 0.1687(0.004) Table6. JointpredictionperformanceacrossvaryingnumbersofrolloutsperreplicaontheWOMDinteractivevalidationset. Displayed arescene-leveljointevaluationmetrics.Forthesinglereplicametrics,weincludethestandarddeviation(across8replicas)inparentheses. Parametercount MissRate(↓) mAP(↑) Numberofrollouts Latency(ms) 300K 0.6047 0.1054 16 19.9(0.19) 1M 0.5037 0.1713 32 27.5(0.25) 9M 0.4972 0.1663 64 43.8(0.26) 27M 0.6072 0.1376 128 75.8(0.23) 256 137.7(0.19) Table7. Jointpredictionperformanceacrossvaryingmodelsizes on the WOMD interactive validation set. Displayed are scene- Table8. InferencelatencyoncurrentgenerationofGPUsfordif- leveljointmAPandmissratefor256rolloutsforasinglemodel ferentnumbersofrolloutsofthejointmodel.Wedisplaythemean replica(exceptfor9Mwhichdisplaysthemeanperformanceof8 and standard deviation (in parentheses) of the latency measure- replicas). mentsforeachsetting. E.2.Marginalvs. Conditional • Scene ID: 2ea76e74b5025ec7 Marginal: The cyclist crosses in front of the vehicle “Conditional”herereferstotemporallycausalcondition- leadingtoacollision. ingasdescribedinthemaintext. Joint: Thecyclistwaitsforthevehicletoproceedbe- foreturning. • Scene ID: 5ebba77f351358e2 Marginal: Thepedestriancrossesthestreetasavehi- cleisturning,leadingtoacollision. • Scene ID: 55b5fe989aa4644b Conditional: When conditioning on the vehicle’s Marginal: Thecyclistlanechangesinfrontofthead- turningtrajectoryasaquery, the pedestrian isinstead jacentvehicle,leadingtocollision. predictedtoremainstationary. Joint:Thecyclistremainsintheirlanefortheduration ofthescene,avoidingcollision. • Scene ID: d557eee96705c822Marginal: Themodeledvehiclecollideswiththelead vehicle. Conditional:Whenconditioningontheleadvehicle’s query trajectory, which remains stationary for a bit, themodeledvehicleinsteadcomestoaanappropriate stop. • Scene ID: 9410e72c551f0aec Marginal: Themodeledvehicletakestheturnslowly, unawareofthelastturningvehicle’sprogress. Conditional: When conditioning on the query vehi- cle’sturnprogress,themodeledagentlikewisemakes moreprogress. • Scene ID: c204982298bda1a1 Marginal: Themodeledvehicleproceedsslowly,un- awareofthemergingvehicle’sprogress. Conditional: When conditioning on the query vehi- cle’s merge progress, the modeled agent accelerates behind. E.3.TemporallyCausalvs. AcausalConditioning • Scene ID: 4f39d4eb35a4c07c Jointprediction: Thetwomodeledvehiclesmaintain speedforthedurationofthescene. Conditioningontrailingagent: - Temporally causal: The lead vehicle is indifferent tothequerytrailingvehicledeceleratingtoastop,pro- ceedingalongataconstantspeed. - Acausal: The lead vehicle is “influenced” by the queryvehicledecelerating.Itlikewisecomestoastop. Intuitively, this is an incorrect direction of influence thattheacausalmodelhaslearned. Conditioningonleadagent: -Temporallycausal:Whenconditioningonthequery lead vehicle decelerating to a stop, the modeled trail- ingvehicleislikewisepredictedtostop. -Acausal: In this case, the acausal conditional pre- dictionissimilartothetemporallycausalconditional. The trailing vehicle is predicted to stop behind the queryleadvehicle.