MotionDiffuser: Controllable Multi-Agent Motion Prediction using Diffusion Chiyu“Max”Jiang AndreCornman CheolhoPark ∗ ∗ BenSapp YinZhou DragomirAnguelov equal contribution ∗ WaymoLLC Figure 1. MotionDiffuser is a learned representation for the distribution of multi-agent trajectories based on diffusion models. During inference,samplesfromthepredictedjointfuturedistributionarefirstdrawni.i.d. fromarandomnormaldistribution(leftmostcolumn), and gradually denoised using a learned denoiser into the final predictions (rightmost column). Diffusion allows us to learn a diverse, multimodaldistributionoverjointoutputs(topright).Furthermore,guidanceintheformofadifferentiablecostfunctioncanbeappliedat inferencetimetoobtainresultssatisfyingadditionalpriorsandconstraints(bottomright). Abstract 1.Introduction We present MotionDiffuser, a diffusion based represen- Motion prediction is a central yet challenging problem tation for the joint distribution of future trajectories over forautonomousvehiclestosafelynavigateunderuncertain- multiple agents. Such representation has several key ad- ties. Motionprediction,intheautonomousdrivingsetting, vantages: first, our model learns a highly multimodal dis- referstothepredictionofthefuturetrajectoriesofmodeled tributionthatcapturesdiversefutureoutcomes. Second,the agents,conditionedonthehistoriesofthemodeledagents, simplepredictordesignrequiresonlyasingleL2losstrain- contextagents,roadgraphandtrafficlightsignals. ing objective, and does not depend on trajectory anchors. Third, our model is capable of learning the joint distribu- Several key challenges arise in the motion prediction tion for the motion of multiple agents in a permutation- problem.First,motionpredictionisprobabilisticandmulti- invariant manner. Furthermore, we utilize a compressed modal in nature where it is important to faithfully predict trajectory representation via PCA, which improves model an unbiased distribution of possible futures. Second, mo- performance and allows for efficient computation of the tion prediction requires jointly reasoning about the future exact sample log probability. Subsequently, we propose distribution for a set of agents that may interact with each a general constrained sampling framework that enables otherineachsuchfutures. Naivelypredictingandsampling controlledtrajectorysamplingbasedondifferentiablecost fromthemarginaldistributionoftrajectoriesforeachagent functions. Thisstrategyenablesahostofapplicationssuch independentlyleadstounrealisticandoftenconflictingout- as enforcing rules and physical priors, or creating tai- comes.Lastbutnotleast,whileitischallengingtoconstrain lored simulation scenarios. MotionDiffuser can be com- orbiasthepredictionsofconventionalregression-basedtra- bined with existing backbone architectures to achieve top jectorymodels,guidedsamplingofthetrajectoriesisoften motion forecasting results. We obtain state-of-the-art re- required. For example, it may be useful to enforce rules sultsformulti-agentmotionpredictionontheWaymoOpen orphysicalpriorsforcreatingtailoredsimulationscenarios. MotionDataset. Thisrequirestheabilitytoenforceconstraintsoverthefu- ture time steps, or enforce a specified behavior for one or 3202 nuJ 5 ]OR.sc[ 1v38030.6032:viXraFigure2. Overviewformulti-agentmotionpredictionusingdiffusionmodels. Theinputscenecontainingagenthistory,trafficlightsand roadgraphsisencodedviaatransformerencoderintoasetofconditiontokensC.Duringtraining,arandomsetofnoisesaresampledi.i.d. fromanormaldistributionandaddedtothegroundtruth(GT)trajectory. Thedenoiser,whileattendingtotheconditiontokens,predicts the denoised trajectories corresponding to each agent. The entire model can be trained end-to-end using a simple L2 loss between the predicteddenoisedtrajectoryandtheGTtrajectory.Duringinference,apopulationoftrajectoriesforeachagentcanfirstbesampledfrom purenoiseatthehighestnoiselevelσ ,anditerativelydenoisedbythedenoisertoproduceaplausibledistributionoffuturetrajectories. max Anoptionalconstraintintheformofanarbitrarydifferentiablelossfunctioncanbeinjectedinthedenoisingprocesstoenforceconstraints. moreagentsamongasetofagents. nally, we propose several enhancements to the representa- tion, including PCA-based latent trajectory diffusion and Inlightofthesechallenges,wepresentMotionDiffuser, improved trajectory sample clustering to further boost the a denoising diffusion model-based representation for the performanceofourmodel. joint distribution of future trajectories for a set of agents Insummary,themaincontributionsofthisworkare: (see Fig. 2). MotionDiffuser leverages a conditional denoising diffusion model. Denoising diffusion models • A novel permutation-invariant, multi-agent joint mo- [16, 23, 33, 43, 44] (henceforth, diffusion models) are a tion distribution representation using conditional dif- classofgenerativemodelsthatlearnsadenoisingfunction fusionmodels. basedonnoisydataandsamplesfromalearneddatadistri- bution via iteratively refining a noisy sample starting from • Ageneralandflexibleframeworkforperformingcon- pure Gaussian noise (see Fig. 1). Diffusion models have trolled and guided trajectory sampling based on ar- recentlygainedimmensepopularityduetotheirsimplicity, bitrary differentiable cost functions of the trajectories strongcapacitytorepresentcomplex,highdimensionaland witharangeofnovelapplications. multimodal distributions, ability to solve inverse problems • Severalsignificantenhancementstotherepresentation, [4,6,24,44],andeffectivenessacrossmultipleproblemdo- includingPCA-basedlatenttrajectorydiffusionformu- mains,includingimagegeneration[36,37,39],videogen- lationandimprovedtrajectorysampleclusteringalgo- eration[15,18,49]and3Dshapegeneration[35]. rithmtofurtherboostthemodelperformance. Building on top of conditional diffusion models as a basis for trajectory generation, we propose several unique 2.RelatedWork design improvements for the multi-agent motion predic- tion problem. First, we propose a cross-attention-based Denoisingdiffusionmodels Denoisingdiffusionmodels permutation-invariantdenoiserarchitectureforlearningthe [16, 33], methodologically highly related to the class of motion distribution for a set of agents regardless of their score-based generative models [23, 43, 44], have recently ordering. Second,weproposeageneralandflexibleframe- emerged as a powerful class of generative models that workforperformingcontrolledandguidedtrajectorysam- demonstratehighsamplequalityacrossawiderangeofap- plingbasedonarbitrarydifferentiablecostfunctionsofthe plicationdomains,includingimagegeneration[36,37,39], trajectories, which enables several interesting applications videogeneration[15,18,49]and3Dshapegeneration[35]. such as rules and controls on the trajectories, trajectory Weareamongthefirsttousediffusionmodelsforpredict- in-painting and creating tailored simulation scenarios. Fi- ingthejointmotionofagents.Constrained sampling Diffusion models have been worktomodelthemotionofmultipleagentsjointly. Scene- shown to be effective at solving inverse problems such as Transformer[32]outputsafixedsetofjointmotionpredic- image in-painting, colorization and sparse-view computed tionsforalltheagentsinthescene. M2I[45],WIMP[25], tomography by using a controllable sampling process [4– PIP[42],andCBP[47]proposeaconditionalmodelwhere 6,22,24,43,44]. Concurrentwork[53]exploresdiffusion themotionsoftheotheragentsarepredictedbygivenmo- modelingforcontrollabletrafficgeneration,whichwecom- tionsofthecontrolledagents. paretoinSec. 3.4. Indiffusionmodels,thegenerationpro- There is a set of literature using probabilistic graphical cesscanbeconditionedoninformationnotavailableduring models. DSDNet [51] and MFP [46] use fully connected training. The inverse problem can be posed as sampling graphs. JFP [27] supports static graphs such as fully con- fromtheposteriorp(x;y)basedonalearnedunconditional nectedgraphsandautonomousvehiclecenteredgraphs,and distributionp(x),whereyisanobservationoftheeventx. dynamic graphs where the edges are constructed between WedeferfurthertechnicaldetailstoSec. 3.4. theinteractingagents. RAIN[26]learnsthedynamicgraph oftheinteractionthroughseparateRLtraining. Motionprediction Therearetwomaincategoriesofap- 3.Method proaches for motion prediction: supervised learning and generative learning. Supervised learning trains a model 3.1.DiffusionModelPreliminaries with logged trajectories with supervised losses such as L2 Preliminaries Diffusion models [23] provide a learned loss. One of the challenges is to model inherent multi- parameterization of the probability distribution p (x) modal behavior of the agents. For this, MultiPath [40] θ through learnable parameters θ. Denote this probability usesstaticanchors,andMultiPath++[48],Wayformer[31], densityfunction,convolvedwithaGaussiankernelofstan- SceneTransformer[32]uselearnedanchors,andDenseTNT dard deviation σ to be p (x,σ). Instead of directly learn- [13] uses goal-based predictions. Home [9] and GoHome θ inganormalizedprobabilitydensityfunctionp (x)where [10] predict future occupancy heatmaps, and then decode θ thenormalizationconstantisgenerallyintractable[19],dif- trajectories from the samples. MP3 [2] and NMP [50] fusion models learn the score function of the distribution: learn the cost function evaluator of trajectories, and then logp (x;σ)atarangeofnoiselevelsσ. the output trajectories are heuristically enumerated. Many ∇x θ Giventhescorefunction logp (x;σ),onecansam- of these approaches use ensembles for further diversified ∇x θ plefromthedistributionbydenoisinganoisesample. Sam- predictions. Thenextsectioncoversgenerativeapproaches. ples can be drawn from the underlying distribution x 0 ∼ p (x)viathefollowingdynamics: θ Generative models for motion prediction Various re- 0 cent works have modeled the motion prediction task as x =x(T)+ σ˙(t)σ(t) logp (x(t);σ(t))dt 0 x θ − ∇ a conditional probability inference problem of the form ZT p(s;c) using generative models, where s denote the fu- where x(T) (0,σ2 I) (1) ∼N max ture trajectories of one or more agents, and c denote the wherevarianceσ(t)isamonotonic,deterministicfunction context or observation. HP-GAN [1] learns a probability ofanauxiliaryparameteroftimet. Following[23],weuse density function(PDF) of futurehuman poses conditioned thelinearnoisescheduleσ(t)=t. Theinitialnoisesample on previous poses using an improved Wasserstein Gener- issampledi.i.d. fromaunitGaussianscaledtothehighest ativeAdversarialNetwork(GAN).ConditionalVariational standarddeviationσ(T)=σ . max Auto-Encoders(C-VAEs)[11,20,34], NormalizingFlows Thediffusionmodelcanbetrainedtoapproximateadata [8,28,29,41]havealsobeenshowntobeeffectiveatlearn- distribution p (x), where χ = x ,x , ,x denote χ 1 2 Nd { ··· } ing this conditional PDF of future trajectories for motion the set of training data. The empirical distribution of the prediction. Very recent works have started looking into datacanbeviewedasasumofdeltafunctionsaroundeach diffusion models as an alternative to modeling the condi- data point: p (x) = 1 Nd δ(x x ). Denote the de- χ n i=0 − i tionaldistributionsoffuturesequencessuchashumanmo- noiserasD(x;σ)whichisafunctionthatrecoverstheun- P tionposesequences [38,52]andplanning [21]. Ina more noised sample corresponding to the noised sample x. The relevant work, [14] the authors utilize diffusion models to denoiserisrelatedtothescorefunctionvia: model the uncertainties of pedestrian motion. As far as logp(x;σ)=(D(x;σ) x)/σ2 (2) x we are aware, we are the first to utilize diffusion models ∇ − ThedenoisercanbelearnedbyminimizingtheexpectedL tomodelthemulti-agentjointmotiondistribution. 2 denoisingerrorforaperturbedsamplexatanynoiselevel σsampledfromthenoisedistributionq(σ): M tiou nlt pi r- ea dg ie cn tit om nlo itt eio ran tup rr ee hd aic st wio on rkeW doh nile prm edu ic ch tino gf mth oe tim ono s- arg θmin Ex ∼pχEσ ∼q(σ)Eϵ ∼N(0,σ2I) ||D θ(x+ϵ;σ) −x ||2 2 of individual agents independently, there has been some (3)Sampling WefollowtheODEdynamicsinEqn. 1when sampling the predictions. We utilize Huen’s 2nd order method for solving the corresponding ODE using the de- faultparametersand32samplingsteps. 3.2.DiffusionModelforMulti-AgentTrajectories Oneofthemaincontributionsofthisworkistopropose a framework for modeling the joint distribution of multi- agent trajectories using diffusion models. Denote the fu- ture trajectory of agent i as s i RNt×Nf where N t is ∈ the number of future time steps and N is the number of f Figure3.NetworkarchitectureforsetdenoiserD (S;C,σ).The θ featurespertimesteps,suchaslongitudinalandlateralpo- n cao ti es ny at tr ea dje wct io thrie as rac no drr oe msp -o fon ud ri in eg rt eo nca og den edts ns o1 is· e·· les vN ea lσa ,re befi fr os rt ec go on -- sitions, heading directions etc. Denote c i ∈ R··· as the learned ego-centric context encoding of the scene, includ- ingthroughrepeatedblocksofself-attentionamongthesetoftra- ing the road graph, traffic lights, histories of modeled and jectoriesandcross-attentionwithrespecttotheconditiontokens c 1 ···c Nc. Theself-attentionallowsthediffusionmodeltolearn contextagents,aswellasinteractionswithinthesesceneel- a joint distribution across the agents and cross-attention allows ements,centeredaroundagenti. Forgeneralityccouldbe themodeltolearnamoreaccuratescene-conditionaldistribution. ofarbitrarydimensions,eitherasasingleconditionvector, Notethateachagentcross-attendstoitsownconditiontokensfrom orasasetofcontexttokens. Denotethesetofagentfutures theagent-centricsceneencoding(notshownforsimplicity). The trajectoriesasS RNa×Nt×Nf,thesetofego-centriccon- [learnablecomponents]aremarkedwithbrackets. text encodings as∈ C RNa×···, where = = N a ∈ |S| |C| isthenumberofmodeledagents. Weappendeachagent’s Conditional diffusion models In this work, we are in- positionandheading(relativetotheegovehicle)toitscor- terested in the conditional setting of learning p θ(x;c), responding context vectors.Denote the j-th permutation of where x denote the future trajectories of a set of agents in the two sets to be j, j, sharing consistent or- S C agents and c is the scene context. A simple modifi- dering of the agents. We seek to model the set probabil- cation is to augment both the denoiser D(x;c,σ) and itydistributionofagenttrajectoriesusingdiffusionmodels: the score function xlogp(x;c,σ) by the condition c. p( j; j). Sincetheagentorderinginthesceneisarbitrary, ∇ S C Given a dataset χ c augmented by conditions: χ c = learningapermutationinvariantsetprobabilitydistribution {(x 1,c 1), ··· ,(x Nd,c N) },theconditionaldenoisercanbe isessential,i.e., learnedbyaconditionaldenoisingscorematchingobjective p(S;C)=p(Sj;Cj), j [1,N !] (7) a tominimizethefollowing: ∀ ∈ Tolearnapermutation-invariantsetprobabilitydistribu- Ex,c ∼χcEσ ∼q(σ)Eϵ ∼N(0,σ2I) ||D θ(x+ϵ;c,σ) −x ||2 2 tion, we seek to learn a permutation-equivariant denoiser, (4) i.e.,whentheorderoftheagentsinthedenoiserpermutes, whichleadstothelearnedconditionalscorefunction: thedenoiseroutputfollowsthesamepermutation: xlogp θ(x;c,σ)=(D θ(x;c,σ) x)/σ2 (5) D(Sj;Cj,σ)=Dj(S;C,σ), j [1,N !] (8) ∇ − a ∀ ∈ Anothermajorconsiderationforthedenoiserarchitectureis Preconditioning and training Directly training the theabilitytoeffectivelyattendtotheconditiontensorcand model with the denoising score matching objective (Eqn. noise level σ. Both of these motivations prompt us to uti- 4) has various drawbacks. First, the input to the denoiser lize the transformer as the main denoiser architecture. We has non-unit variance: Var(x+ϵ) = Var(x)+Var(ϵ) = utilizethesceneencoderarchitecturefromthestate-of-the- σ2 +σ2, σ [0,σ ].Second,atsmallnoiselevelsofσ, data ∈ max art Wayformer [31] model to encode scene elements such itismucheasierforthemodeltopredicttheresidualnoise as road graph, agent histories and traffic light states into thanpredictingthecleansignal. Following[23],weadopta asetoflatentembeddings. Thedenoisertakesasinputthe preconditionedformofthedenoiser: GTtrajectorycorrespondingtoeachagent,perturbedwitha D θ(x;c,σ)=c skip(σ)x+c out(σ)F θ(c in(σ)x;c,c noise(σ)) randomnoiselevelσ q(σ),andthenoiselevelσ. During ∼ (6) the denoising process, the noisy input undergoes repeated F istheneuralnetworktotrain,c ,c ,c ,c respec- blocks of self-attention between the agents and cross at- θ skip in out noise tivelyscaletheskipconnectiontothenoisyx,inputtothe tention to the set of context tokens per agent, and finally network,outputfromthenetwork,andnoiseinputσ tothe theresultsareprojectedtothesamefeaturedimensionality network.Wedonotadditionallyscalecsinceitistheoutput as the inputs. Since we do not apply positional encoding ofanencodernetwork,assumedtohavemodulatedscales. alongtheagentdimension,transformersnaturallypreserve3.4.ConstrainingTrajectorySamples Constrained trajectory sampling has a range of applica- tions. One situation where controllability of the sampled trajectorieswouldberequiredistoinjectphysicalrulesand constraints. For example, agent trajectories should avoid collision with static objects and other road users. Another application is to perform trajectory in-painting: to solve theinverseproblemofcompletingthetrajectoryprediction givenoneormorecontrolpoints.Thisisausefultoolincre- atingcustomtrafficscenariosforautonomousvehicledevel- opmentandsimulation. Figure4.Inferredexactlogprobabilityof64sampledtrajectories More formally, we seek the solution to sampling from peragent.Higherprobabilitysamplesareplottedwithlightercol- thejointconditionaldistributionp(S;C) q(S;C),where ors.TheorangeagentrepresentstheAV(autonomousvehicle). · p(S;C) is the learned future distribution for trajectories andq(S;C)asecondarydistributionrepresentingthecon- straint manifold for S. The score of this joint distribu- the equivariance among the tokens (agents), leading to the tion is log p(S;C) q(S;C) = logp(S;C)+ permutation-equivarnance of the denoiser model. See Fig. ∇S · ∇S 3 for a more detailed design of the transformer-based de- Slogq(S;C(cid:16)). In order to sampl(cid:17)e this joint distribution, ∇ noiserarchitecture. weneedthejointscorefunctionatallnoiselevelsσ: logp(S;C,σ)+ logq(S;C,σ) (12) S S ∇ ∇ 3.3.ExactLogProbabilityInference Thefirsttermdirectlycorrespondstotheconditionalscore function in Eqn. 5. The second term accounts for gra- Withourmodel,wecaninfertheexactlogprobabilityof dient guidance based on the constraint, which resembles thegeneratedsampleswiththefollowingmethod. First,the classifier-based guidance [17] in class-conditional image changeoflogdensityovertimefollowsaseconddifferen- generationtasks,whereaspecialtyneuralnetworkistrained tial equation, called the instantaneous change of variables to estimate this guidance term under a range of noise lev- formula[3], els. Werefertothisastheconstraintgradientscore. How- ever, sinceourgoal istoapproximatetheconstraint gradi- logp(x(t)) ∂f entscorewithanarbitrarydifferentiablecostfunctionofthe = Tr ∂t − ∂x(t) trajectory,howisthisafunctionofthenoiseparameterσ? (cid:18) (cid:19) wheref =∂x/∂t (9) Thekeyinsightistoexploitthedualitybetweenanyin- termediatenoisytrajectorySandthedenoisedtrajectoryat Inthediffusionmodel,theflowfunction,f follows, thatnoiselevelD(S;C,σ). WhileS isclearlyoffthedata manifoldandnotaphysicaltrajectory,D(S;C,σ)usually ∂x(t) f(x(t),t)= = σ˙(t)σ(t) xlogp(x(t);σ(t)) closely resembles a physical trajectory that is on the data ∂t − ∇ manifold since it is trained to regress for the ground truth (10) (Eqn. 4), even at a high σ value. The denoised event and The log probability of the sample can be calculated by the noisy event converge at the limit σ 0. In this light, integratingovertimeasbelow. → weapproximatetheconstraintgradientscoreas: ∂ 0 ∂f Slogq(S;C,σ) λ D(S;C,σ) (13) ∇ ≈ ∂SL logp(x(0))=logp(x(T)) Tr dt (11) − ZT (cid:18)∂x(t) (cid:19) where L : RNa×Nt×Nf 7→ R is a(cid:16)n arbitrary co(cid:17)st function ThecomputationofthetraceoftheJacobiantakesO(n2) forthesetofsampledtrajectories,andλisahyperparameter controllingtheweightofthisconstraint. wherenisthedimensionalityofx. WhenweusePCAas inSec. 3.5,nwillbemuchsmallerthanthedimensionality Inthiswork,weintroducetwosimplecostfunctionsfor of the original data. We can also use Hutchinson’s trace trajectory controls: an attractor and a repeller. Attractors estimatorasinFFJORD[12]whichtakesO(n). encourage the predicted trajectory at certain timesteps to Thelogprobabilitycanbeusedforfilteringhigherprob- arriveatcertainlocations. Repellersdiscourageinteracting abilitypredictions. InFig. 4,forexample,higherprobabil- agents from getting too close to each other and mitigates itysamplesplottedwithlightercolorsaremorelikely. collisions. Wedefinethecostsas:Attractorcost PCAlatentdiffusion Inspiredbytherecentsuccessofla- (D(S;C,σ))= |(D(S;C,σ) −S target) ⊙M target | tentdiffusionmodels[37]forimagegeneration,weutilize Lattract M +eps acompressedrepresentationfortrajectoriesusingPrincipal P | target | (14) Component Analysis (PCA). PCA is particularly suitable P Where S target RNa×Nt×Nf are the target location ten- for representing trajectories, as trajectories are temporally ∈ andgeometricallysmoothinnature,andthetrajectoriescan sor, and M is a binary mask tensor indicating which target berepresentedbyaverysmallsetofcomponents.Ouranal- locations in S to enforce. denotes the elementwise target ⊙ ysisshowsthatamere3components(fortrajectorieswith product and eps denotes an infinitesimal value to prevent 80 2 degrees of freedom) accounts for 99.7% of all ex- underflow. × plainedvariance,thoughweuse10componentsforamore accurate reconstruction. PCA representation has multiple Repellercost benefits,includingfasterinference,bettersuccesswithcon- 1 A=max 1 ∆(D(S;C,σ)) (1 I),0 (15) trolled trajectory, and perhaps most importantly, better ac- − r ⊙ − curacyandperformance(seeablationstudiesinSec. 5). (cid:16)(cid:0) A (cid:1) (cid:17) (D(S))= (16) First, as many ground truth trajectories include missing Lrepell (A>0)+eps P timesteps(duetoocclusion/agentleavingthescene), we Where A is the per time step repeller cost. we denote uselinearinterpolation/extrapolationtofillinthemissing P the pairwise L distance function between all pairs of 2 stepsineachtrajectory. Weuniformlysamplealargepopu- denoised agents at all time steps as ∆(D(S;C,σ)) lationofN =105agenttrajectories,whereeachtrajectory ∈ s RNa×Na×Nt,identitytensorbroadcasttoallN t timesteps s i RNtNf,i [1,N s]isfirstcenteredaroundtheagent’s I RNa×Na×Nt,andrepellerradiusasr. cur∈ rentlocation∈ , rotatedsuchthattheagent’sheadingisin ∈ +ydirection,andflattenedintoasinglevector. Denotethis Constraint score thresholding To further increase the random subset of agent trajectories as S ′ RNs×NtNf. ∈ stabilityoftheconstrainedsamplingprocess,weproposea Wecomputeitscorrespondingprinciplecomponentmatrix simpleandeffectivestrategy: constraintscorethresholding (with whitening) as W pca RNp×(NtNf) where N p is the ∈ (ST).FromEqn. 2,wemaketheobservationthat: number of principle components to use, and its mean as σ ∇xlogp(x;σ)=(D(x,σ) −x)/σ =ϵ, ϵ ∼N(0,I) s¯ ′ ∈ RNtNf. We obtain the PCA and inverse PCA trans- (17) formationforeachtrajectoryS ias: Therefore,weadjusttheconstraintscoreinEqn. 13viaan sˆ i =(s i −s¯)W pT ca ⇔s i =sˆ i(W pT ca) −1+s¯ (19) elementwiseclippingfunction: With the new representation, we have agent trajectories in Slogq(S;C,σ):=clip(σ Slogq(S;C,σ), 1)/σ Eqn. 7inPCAspaceasS RNa×Np. ∇ ∇ ± ∈ (18) WeablatethisdesignchoiceinTable2. 4.ExperimentandResults 3.5.TrajectoryRepresentationEnhancements 4.1.PCAModeAnalysis Sampleclustering WhileMotionDiffuserlearnsanentire To motivate our use of PCA as a simple and accurate distributionofpossiblejointfuturetrajectoriesfromwhich compressed trajectory representation, we analyze the prin- wecandrawanarbitrarynumberofsamples,itisoftennec- cipal components computed from N = 105 randomly s essary to extract a more limited number of representative selected trajectories from the Waymo Open Dataset train- modes from the output distribution. The Interaction Pre- ing split. Fig. 5a shows the average reconstruction error diction challenge in Waymo Open Motion Dataset, for in- per waypoint using increasing numbers of principal com- stance,computesmetricsbasedonasetof6predictedjoint ponents. When keeping only the first 10 principal com- futuresacrossmodeledagents. Thus,weneedtogeneratea ponents, the average reconstruction error is 0.06 meters, representativesetfromthelargersetofsampledtrajectories. whichissignificantlylowerthantheaveragepredictioner- Tothisend,wefollowthetrajectoryaggregationmethod ror achieved by state-of-the-art methods. This motivates defined in [48] which performs iterative greedy clustering PCAasaneffectivecompressionstrategy,withouttheneed to maximize the probability of trajectory samples falling formorecomplexstrategieslikeautoencodersin[37]. within a fixed distance threshold to an output cluster. We We visualize the top-10 principal components in Fig. referreadersto[48]fordetailsontheclusteringalgorithm. 5b. Thehigherorderprincipalcomponentsareincreasingly Inthejointagentpredictionsetting,wemodifytheclus- similar, and deviate only slightly from the dataset mean. teringalgorithmsuchthatforeachjointpredictionsample, These components represent high frequency trajectory in- we maximize the probability that all agent predictions fall formationthatareirrelevantformodeling,andmayalsobe withinadistancethresholdtoanoutputcluster. aresultofperceptionnoise.0.25 Number of PCA Components )m( rorrE noitcurtsnoceR ACP 0.30 Method Overlap minSADE minSFDE sMissRate mAP #1 ( ↓) ( ↓) ( ↓) ( ↓) ( ↑) LSTMbaseline[7] - 1.91 5.03 0.78 0.05 0.20 #6 #5,7,8 #4 H Sce ea ntI eR Tm ra4 ns[ f3 o0 r] mer(J)[32] - - 1 0. .4 92 8 3 2. .2 16 9 0 0. .7 42 9 0 0. .0 18 2 #3 M2I[45] - 1.35 2.83 0.55 0.12 0.15 Test DenseTNT[13] - 1.14 2.49 0.54 0.16 MultiPath++[48] - 1.00 2.33 0.54 0.17 0.10 JFP[27] - 0.88 1.99 0.42 0.21 #2 MotionDiffuser(Ours) - 0.86 1.95 0.43 0.20 0.05 SceneTransformer(M)[32] 0.091 1.12 2.60 0.54 0.09 SceneTransformer(J)[32] 0.046 0.97 2.17 0.49 0.12 0.00 5 10 15 20 25 Val MultiPath++[48] 0.064 1.00 2.33 0.54 0.18 JFP[27] 0.030 0.87 1.96 0.42 0.20 (a) PCA trajectory reconstruction Wayformer[31] 0.061 0.99 2.30 0.47 0.16 error vs number of PCA compo- (b)Visualizationofthetop-10PCA MotionDiffuser(Ours) 0.036 0.86 1.92 0.42 0.19 nents. componentsfortrajectories. Table1.WOMDInteractiveSplit:wereportscene-leveljointmet- Figure 5. Analysis of PCA representation for agent trajectories. ricsnumbersaveragedforallobjecttypesovert=3,5,8seconds. (a) shows the average reconstruction error for varying numbers MetricsminSADE,minSFDE,SMissRate,andmAParefromthe ofprincipalcomponents. (b)showsavisualizationofthetop-10 benchmark[7].Overlapisdefinedin[27]. principalcomponents. Thehighermodesrepresentinghigherfre- quenciesareincreasinglysimilarandhaveasmallimpactonthe Realism( ) ConstraintEffectiveness finaltrajectory. Method ↓ minSADE meanSADE Overlap minSFDE() meanSFDE() SR2m( ) SR5m( ) ↓ ↓ ↑ ↑ NoConstraint 1.261 3.239 0.059 2.609 8.731 0.059 0.316 Attractor(toGTfinalpoint) 4.2.Multi-AgentMotionPrediction Optimization 4.563 5.385 0.054 0.010 0.074 1.000 1.000 GTC[53] 1.18 1.947 0.057 0.515 0.838 0.921 0.957 Ours(-ST) 1.094 2.083 0.042 0.627 1.078 0.913 0.949 ToevaluateMotionDiffuser’sperformanceinthemulti- Ours 0.533 2.194 0.040 0.007 0.747 0.952 0.994 agent prediction setting, we assess our method on the Repeller(betweenthepairofagents) WaymoOpenDatasetInteractivesplit,whichcontainspairs Ours 1.359 3.229 0.008 2.875 8.888 0.063 0.317 ofagentsinhighlyinteractiveanddiversescenarios[7]. Table2.Quantitativevalidationforcontrollabletrajectorysynthe- sis.WeenforcetheattractororrepellerconstraintsinSec.3.4. InTable1,wereportthemainmetricsfortheInteractive split, as defined in [7]. minSADE measures the displace- 4.3.ControllableTrajectorySynthesis mentbetweentheground-truthfutureagenttrajectoriesand the closest joint prediction (out of 6 joint predictions), av- Weexperimentallyvalidatetheeffectivenessofourcon- eragedoverthefuturetimehorizonandoverthepairofin- trollable trajectory synthesis approach. In particular, we teracting agents. minSFDE measures the minimum joint validatetheattractorandrepellerdesignsproposedinSec. displacementerroratthetimehorizonendpoint. SMissRate 3.4. We continue these experiments using the Interactive measures the recall of the joint predictions, with distance Split from Waymo Open Motion Dataset. In experiments thresholds defined as a function of agent speed and future forboththeattractorandtherepeller,weusethesamebase- timestep. Finally mAP measures the joint Mean Average linediffusionmodeltrainedinSec. 4.2. Werandomlysam- Precisionbasedonagentactiontypes,suchasleft-turnand ple64trajectoriesfromthepredicteddistribution.Wereport u-turn. Thereportedmetricsareaveragedoverfuturetime our results in Table 2. We measure min/mean ADE/FDE horizons (3s, 5s, and 8s) and over agent types (vehicles, andoverlapmetrics,followingSec. 4.2. Themeanmetrics pedestrians,andcyclists). computesthemeanquantityoverthe64predictions. Additionally, we report results for the Overlap metric Fortheattractorexperiment, weconstrainthelastpoint [27]bymeasuringtheoverlaprateonthemostlikelyjoint of all predicted trajectories to be close to the last point in prediction,whichcapturestheconsistencyofmodelpredic- thegroundtruthdata. Therefore,min/meanSADEservesas tions,asconsistentjointpredictionsshouldnotcollide. a proxy for the realism of the predictions and how closely they stay to the data manifold. For baselines, we compare Ourmodelachievesstate-of-the-artresults, asshownin totwoapproaches:“Optimization”directlysamplesthetra- Table 1. While MotionDiffuser and Wayformer [31] use jectoriesfromourdiffusionmodel,followedbyapostpro- thesamebackbone,ourmethodperformssignificantlybet- cessingstepviaAdamoptimizertoenforcetheconstraints. ter across all metrics due to the strength of the diffusion “CTG” is a reimplementation of the sampling method in a head. Compared to JFP [27] on the test split, we demon- concurrent work [53] that performs an inner optimization strate an improvement with respect to the minSADE and looptoenforceconstraintsonthedenoisedsamplesduring minSFDEmetrics. FormAP,andOverlap,ourmethodper- everystepofthediffusionprocess. SeeTable2fordetailed forms slightly worse than JFP, but outperforms all other results. Althoughtrajectoryoptimizationafterthesampling methods.Single Agent Constraint No Constraint Optimization CTG Ours Multi- Agent Constraint No Constraint Optimization CTG Ours Figure6. Qualitativeresultsforcontrollabletrajectorysynthesis. Weapplyanattractor-basedconstraint(markedas )onthelastpoint × ofthetrajectory. Withoutanyconstraintatinferencetime,theinitialpredictiondistributionsfromMotionDiffuser(“NoConstraint”)are plausibleyetdispersed.Whiletesttimeoptimizationofthepredictedtrajectoriesiseffectiveatenforcingtheconstraintsonmodeloutputs, itdeviatessignificantlyfromthedatamanifold,resultinginunrealisticoutputs.Ourmethodproducesrealisticandwell-constrainedresults. Method minSADE( ) minSFDE( ) SMissRate( ) self-attention layers in the denoiser architecture Ours(- ↓ ↓ ↓ Ours(-PCA) 1.03 2.29 0.53 SelfAttention), while keeping the cross-attention layers (to Ours(-Transformer) 0.93 2.08 0.47 allowforconditioningonthescenecontextandnoiselevel). Ours(-SelfAttention) 0.91 2.07 0.46 This result shows that attention between modeled agents’ MotionDiffuser(Ours) 0.88 1.97 0.43 noisyfuturetrajectoriesisimportantforgeneratingconsis- Table 3. Ablations on WOMD Interactive Validation Split. We tent joint predictions. Note that MotionDiffuser’s perfor- ablatecomponentsofthedenoiserarchitecture,andthePCAcom- mance in Table 3 is slightly worse than Table 1 due to a pressedtrajectoryrepresentation. reducedWayformerencoderbackbonesize. process has the strongest effect in enforcing constraints, it 6.ConclusionandDiscussions resultsinunrealistictrajectories. Withourmethodwehave a high level of effectiveness in enforcing the trajectories, Inthiswork,weintroducedMotionDiffuser,anoveldif- second only to optimization methods, while maintaining a fusion model based multi-agent motion prediction frame- high degree of realism. Additionally we show qualitative workthatallowsustolearnadiverse,multimodaljointfu- comparisonsfortheoptimizedtrajectoriesinFig. 6. ture distribution for multiple agents. We propose a novel For the repeller experiment, we add the repeller con- transformer-based set denoiser architecture that is permu- straint(radius5m)betweenallpairsofmodeledagents.We tation invariant across agents. Furthermore we propose a were able to significantly decrease overlap between joint general and flexible constrained sampling framework, and predictionsbyanorderofmagnitude,demonstratingitsef- demonstratetheeffectivenessoftwosimpleandusefulcon- fectivenessinrepellingbetweenthemodeledagents. straints - attractor and repeller. We demonstrate state-of- the-artmulti-agentmotionpredictionresults,andtheeffec- 5.AblationStudies tivenessofourapproachonWaymoOpenMotionDataset. We validate the effectiveness of our proposed Score Future work includes applying the diffusion-based gen- Thresholding(ST)approachinTable2,withOurs(-ST)de- erativemodelingtechniquetoothertopicsofinterestinau- noting the removal of this technique, resulting in signifi- tonomousvehicles,suchasplanningandscenegeneration. cantlyworseconstraintsatisfaction. Furthermore, we ablate critical components of the Mo- tionDiffuser architecture in Table 3. We find that using Acknowledgements We thank Wenjie Luo for helping theuncompressedtrajectoryrepresentationOurs(-PCA)de- with the overlap metrics code, Ari Seff for helping with grades performance significantly. Additionally, replacing multi-agent NMS, Rami Al-RFou, Charles Qi and Carlton the Transformer architecture with a simple MLP Ours(- Downeyforhelpfuldiscussions,JoaoMessiasforreviewing Transformer) reduces performance. We also ablate the themanuscript,andanonymousreviewers.References opendatasetmotionpredictionchallenge1stplacesolution. CoRR,abs/2106.14160,2021. 3,7 [1] Emad Barsoum, John Kender, and Zicheng Liu. Hp-gan: [14] TianpeiGu,GuangyiChen,JunlongLi,ChunzeLin,Yong- Probabilistic3dhumanmotionpredictionviagan. InPro- ming Rao, Jie Zhou, and Jiwen Lu. Stochastic trajectory ceedingsoftheIEEEconferenceoncomputervisionandpat- predictionviamotionindeterminacydiffusion. InProceed- ternrecognitionworkshops,pages1418–1427,2018. 3 ingsoftheIEEE/CVFConferenceonComputerVisionand [2] Sergio Casas, Abbas Sadat, and Raquel Urtasun. Mp3: A PatternRecognition,pages17113–17122,2022. 3 unified model to map, perceive, predict and plan. In Pro- [15] Jonathan Ho, WilliamChan, ChitwanSaharia, JayWhang, ceedingsoftheIEEE/CVFConferenceonComputerVision Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben andPatternRecognition,pages14403–14412,2021. 3 Poole, Mohammad Norouzi, David J Fleet, et al. Imagen [3] Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and video:Highdefinitionvideogenerationwithdiffusionmod- David K Duvenaud. Neural ordinary differential equa- els. arXivpreprintarXiv:2210.02303,2022. 2 tions. Advances in neural information processing systems, [16] JonathanHo,AjayJain,andPieterAbbeel. Denoisingdiffu- 31,2018. 5 sionprobabilisticmodels. AdvancesinNeuralInformation [4] JooyoungChoi,SungwonKim,YonghyunJeong,Youngjune ProcessingSystems,33:6840–6851,2020. 2 Gwon, and Sungroh Yoon. Ilvr: Conditioning method for [17] Jonathan Ho and Tim Salimans. Classifier-free diffusion denoising diffusion probabilistic models. arXiv preprint guidance. arXivpreprintarXiv:2207.12598,2022. 5 arXiv:2108.02938,2021. 2,3 [18] Jonathan Ho, Tim Salimans, Alexey Gritsenko, William [5] Hyungjin Chung, Byeongsu Sim, Dohoon Ryu, and Chan, Mohammad Norouzi, and David J Fleet. Video dif- Jong Chul Ye. Improving diffusion models for inverse fusionmodels. arXivpreprintarXiv:2204.03458,2022. 2 problems using manifold constraints. arXiv preprint [19] Aapo Hyva¨rinen and Peter Dayan. Estimation of non- arXiv:2206.00941,2022. normalized statistical models by score matching. Journal [6] Hyungjin Chung, Byeongsu Sim, and Jong Chul Ye. ofMachineLearningResearch,6(4),2005. 3 Come-closer-diffuse-faster: Accelerating conditional diffu- [20] Boris Ivanovic, Karen Leung, Edward Schmerling, and sionmodelsforinverseproblemsthroughstochasticcontrac- MarcoPavone. Multimodaldeepgenerativemodelsfortra- tion. InProceedingsoftheIEEE/CVFConferenceonCom- jectoryprediction:Aconditionalvariationalautoencoderap- puterVisionandPatternRecognition, pages12413–12422, proach. IEEE Robotics and Automation Letters, 6(2):295– 2022. 2,3 302,2020. 3 [7] Scott Ettinger, Shuyang Cheng, Benjamin Caine, Chenxi [21] Michael Janner, Yilun Du, Joshua Tenenbaum, and Sergey Liu, HangZhao, SabeekPradhan, YuningChai, BenSapp, Levine.Planningwithdiffusionforflexiblebehaviorsynthe- Charles Qi, Yin Zhou, Zoey Yang, Aurelien Chouard, sis.InInternationalConferenceonMachineLearning,2022. Pei Sun, Jiquan Ngiam, Vijay Vasudevan, Alexander Mc- 3 Cauley, Jonathon Shlens, and Dragomir Anguelov. Large [22] ZahraKadkhodaieandEeroSimoncelli.Stochasticsolutions scale interactive motion forecasting for autonomous driv- forlinearinverseproblemsusingthepriorimplicitinade- ing: The waymo open motion dataset. arXiv preprint noiser.AdvancesinNeuralInformationProcessingSystems, arXiv:2104.10133,2021. 7 34:13242–13254,2021. 3 [8] Samuel G Fadel, Sebastian Mair, Ricardo da Silva Torres, [23] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. andUlfBrefeld.Contextualmovementmodelsbasedonnor- Elucidating the design space of diffusion-based generative malizingflows.AStAAdvancesinStatisticalAnalysis,pages models. arXivpreprintarXiv:2206.00364,2022. 2,3,4 1–22,2021. 3 [24] Bahjat Kawar, Michael Elad, Stefano Ermon, and Jiaming [9] ThomasGilles,StefanoSabatini,DzmitryTsishkou,Bogdan Song.Denoisingdiffusionrestorationmodels.arXivpreprint Stanciulescu,andFabienMoutarde. Home:Heatmapoutput arXiv:2201.11793,2022. 2,3 for future motion estimation. In 2021 IEEE International [25] Siddhesh Khandelwal, William Qi, Jagjeet Singh, Andrew IntelligentTransportationSystemsConference(ITSC),pages Hartnett, and Deva Ramanan. What-if motion prediction 500–507.IEEE,2021. 3 forautonomousdriving. arXivpreprintarXiv:2008.10587, [10] Thomas Gilles, Stefano Sabatini, Dzmitry Tsishkou, Bog- 2020. 3 dan Stanciulescu, and Fabien Moutarde. Gohome: Graph- [26] Jiachen Li, Fan Yang, Hengbo Ma, Srikanth Malla, oriented heatmap output for future motion estimation. In Masayoshi Tomizuka, and Chiho Choi. Rain: Reinforced 2022InternationalConferenceonRoboticsandAutomation hybrid attention inference network for motion forecasting. (ICRA),pages9107–9114.IEEE,2022. 3 In Proceedings of the IEEE/CVF International Conference [11] Sebastian Gomez-Gonzalez, Sergey Prokudin, Bernhard onComputerVision,pages16096–16106,2021. 3 Scho¨lkopf, and Jan Peters. Real time trajectory prediction [27] WenjieLuo,CheolhoPark,AndreCornman,BenjaminSapp, using deep conditional generative models. IEEE Robotics and Dragomir Anguelov. Jfp: Joint future prediction with andAutomationLetters,5(2):970–976,2020. 3 interactivemulti-agentmodelingforautonomousdriving. In [12] Will Grathwohl, Ricky TQ Chen, Jesse Bettencourt, Ilya Conf.OnRobotLearning,2022. 3,7 Sutskever, and David Duvenaud. Ffjord: Free-form con- [28] Yecheng Jason Ma, Jeevana Priya Inala, Dinesh Jayara- tinuousdynamicsforscalablereversiblegenerativemodels. man, and Osbert Bastani. Diverse sampling for normal- arXivpreprintarXiv:1810.01367,2018. 5 izing flow based trajectory forecasting. arXiv preprint [13] Junru Gu, Qiao Sun, and Hang Zhao. Densetnt: Waymo arXiv:2011.15084,7(8),2020. 3[29] WeiMao,MiaomiaoLiu,andMathieuSalzmann. Generat- informed trajectory prediction for autonomous driving. In ing smooth pose sequences for diverse human motion pre- EuropeanConferenceonComputerVision,pages598–614. diction.InProceedingsoftheIEEE/CVFInternationalCon- Springer,2020. 3 ferenceonComputerVision,pages13309–13318,2021. 3 [43] Jiaming Song, Chenlin Meng, and Stefano Ermon. [30] XiaoyuMo,ZhiyuHuang,andChenLv. Multi-modalinter- Denoising diffusion implicit models. arXiv preprint activeagenttrajectorypredictionusingheterogeneousedge- arXiv:2010.02502,2020. 2,3 enhanced graph attention network. In Workshop on Au- [44] YangSong,JaschaSohl-Dickstein,DiederikPKingma,Ab- tonomousDriving,CVPR,volume6,page7,2021. 7 hishekKumar,StefanoErmon,andBenPoole. Score-based [31] NigamaaNayakanti,RamiAl-Rfou,AurickZhou,Kratarth generative modeling through stochastic differential equa- Goel, Khaled S Refaat, and Benjamin Sapp. Wayformer: tions. arXivpreprintarXiv:2011.13456,2020. 2,3 Motionforecastingviasimple&efficientattentionnetworks. [45] Qiao Sun, Xin Huang, Junru Gu, Brian C Williams, arXivpreprintarXiv:2207.05844,2022. 3,4,7 and Hang Zhao. M2i: From factored marginal trajec- [32] Jiquan Ngiam, Benjamin Caine, Vijay Vasudevan, Zheng- tory prediction to interactive prediction. arXiv preprint dongZhang,Hao-TienLewisChiang,JeffreyLing,Rebecca arXiv:2202.11884,2022. 3,7 Roelofs,AlexBewley,ChenxiLiu,AshishVenugopal,David [46] Charlie Tang and Russ R Salakhutdinov. Multiple futures Weiss,BenjaminSapp,ZhifengChen,andJonathonShlens. prediction. InNeurIPS.2019. 3 Scenetransformer: Aunifiedmulti-taskmodelforbehavior [47] Ekaterina I. Tolstaya, Reza Mahjourian, Carlton Downey, predictionandplanning. CoRR,abs/2106.08417,2021. 3,7 Balakrishnan Varadarajan, Benjamin Sapp, and Dragomir [33] Alexander Quinn Nichol and Prafulla Dhariwal. Improved Anguelov. Identifying driver interactions via conditional denoising diffusion probabilistic models. In International behavior prediction. In IEEE International Conference on ConferenceonMachineLearning,pages8162–8171.PMLR, RoboticsandAutomation,ICRA2021,Xi’an,China,May30 2021. 2 -June5,2021,pages3473–3479.IEEE,2021. 3 [34] Geunseob Oh and Huei Peng. Cvae-h: Conditionaliz- [48] Balakrishnan Varadarajan, Ahmed Hefny, Avikalp Srivas- ing variational autoencoders via hypernetworks and trajec- tava,KhaledSRefaat,NigamaaNayakanti,AndreCornman, tory forecasting for autonomous driving. arXiv preprint Kan Chen, Bertrand Douillard, Chi Pang Lam, Dragomir arXiv:2201.09874,2022. 3 Anguelov, etal. Multipath++: Efficientinformationfusion [35] BenPoole,AjayJain,JonathanTBarron,andBenMilden- and trajectory aggregation for behavior prediction. arXiv hall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprintarXiv:2111.14973,2021. 3,6,7 preprintarXiv:2209.14988,2022. 2 [49] RuihanYang,PrakharSrivastava,andStephanMandt. Dif- [36] AdityaRamesh,PrafullaDhariwal,AlexNichol,CaseyChu, fusion probabilistic modeling for video generation. arXiv and Mark Chen. Hierarchical text-conditional image gen- preprintarXiv:2203.09481,2022. 2 erationwithcliplatents. arXivpreprintarXiv:2204.06125, [50] WenyuanZeng,WenjieLuo,SimonSuo,AbbasSadat,Bin 2022. 2 Yang, Sergio Casas, and Raquel Urtasun. End-to-end in- [37] Robin Rombach, Andreas Blattmann, Dominik Lorenz, terpretable neural motion planner. In Proceedings of the Patrick Esser, and Bjo¨rn Ommer. High-resolution image IEEE/CVF Conference on Computer Vision and Pattern synthesis with latent diffusion models. In Proceedings of Recognition,pages8660–8669,2019. 3 theIEEE/CVFConferenceonComputerVisionandPattern [51] WenyuanZeng,ShenlongWang,RenjieLiao,YunChen,Bin Recognition,pages10684–10695,2022. 2,6 Yang, and Raquel Urtasun. Dsdnet: Deep structured self- [38] SaeedSaadatnejad,AliRasekh,MohammadrezaMofayezi, driving network. In European conference on computer vi- YasaminMedghalchi,SaraRajabzadeh,TaylorMordan,and sion,pages156–172.Springer,2020. 3 Alexandre Alahi. A generic diffusion-based approach for [52] Mingyuan Zhang, Zhongang Cai, Liang Pan, Fangzhou 3d human pose prediction in the wild. arXiv preprint Hong, XinyingGuo, LeiYang, andZiweiLiu. Motiondif- arXiv:2210.05669,2022. 3 fuse: Text-driven human motion generation with diffusion [39] Chitwan Saharia, William Chan, Saurabh Saxena, Lala model. arXivpreprintarXiv:2208.15001,2022. 3 Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed [53] Ziyuan Zhong, Davis Rempe, Danfei Xu, Yuxiao Chen, Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, SushantVeer,TongChe,BaishakhiRay,andMarcoPavone. Rapha Gontijo Lopes, et al. Photorealistic text-to-image Guidedconditionaldiffusionforcontrollabletrafficsimula- diffusionmodelswithdeeplanguageunderstanding. arXiv tion. arXivpreprintarXiv:2210.17366,2022. 3,7 preprintarXiv:2205.11487,2022. 2 [40] BenjaminSapp,YuningChai,MayankBansal,andDragomir Anguelov. Multipath: Multipleprobabilisticanchortrajec- tory hypotheses for behavior prediction. In Conference on RobotLearning,pages86–99.PMLR,2020. 3 [41] ChristophScho¨llerandAloisKnoll. Flomo: Tractablemo- tion prediction with normalizing flows. In 2021 IEEE/RSJ InternationalConferenceonIntelligentRobotsandSystems (IROS),pages7977–7984.IEEE,2021. 3 [42] HaoranSong,WenchaoDing,YuxuanChen,ShaojieShen, Michael Yu Wang, and Qifeng Chen. Pip: Planning-MotionDiffuser: Controllable Multi-Agent Motion Prediction using Diffusion Chiyu“Max”Jiang AndreCornman CheolhoPark ∗ ∗ BenSapp YinZhou DragomirAnguelov equal contribution ∗ WaymoLLC 1.AdditionalVisualizations 1 3202 nuJ 5 ]OR.sc[ 1v38030.6032:viXra2.ImplementationDetails MotionDiffuseristrainedontheWaymoOpenMotionDatasetusing32TPUshardsfor2 106trainingsteps. Weusethe ∗ ADAMWoptimizer[? ] withweightdecaycoefficientof0.03. Thelearningrateissetto5 10 −4,with104 warmupsteps ∗ and linear learning rate decay. MotionDiffuser uses the Wayformer [? ] encoder backbone, with 128 latent embeddings, eachwithhiddensizeof256. BecausetheWayformerencoderisagentcentric,weappendeachagent’spositionandheading (relativetotheegovehicle)toitscorrespondingcontextvectors. Ourtransformerdenoiserarchitectureuses4layersofself-attentionandcross-attentionblocks. Eachattentionlayerhasa hiddensizeof256andanintermediatesizeof1024. ReLUactivationisusedinalltransformerlayers. Weembedthenoise levelusing128randomfourierfeatures. We can flexibly denoise N random noise vectors during training and inference. We use N = 128 during training and N =256duringinference(beforeapplyingclustering). 3.NetworkPreconditioning Wefollowthenetworkpreconditioningframeworkfrom[? ],whichdefinesthedenoiserD θ as: D (x;c,σ)=c (σ)x+c (σ)F (c (σ)x;c,c (σ)) (1) θ skip out θ in noise c (σ)scalesthenetworkinput,suchthatthetraininginputstoF haveunitvariance. in θ c (σ)=1/ σ2+σ2 (2) in data c (σ)modulatestheskipconnectionandisdefinedas: q skip c (σ)=σ2 /(σ2+σ2 ) (3) skip data data c (σ)modulatesthenetworkoutputandisdefinedas: out c (σ)=σ σ / σ2+σ2 (4) out · data data Finallyc (σ)scalesthenoiselevel,andisdefinedas: q noise 1 c (σ)= lnσ (5) noise 4 Forallourexperiments,wesetσ =0.5. data 4.InferenceLatency We report our model’s inference latency over a varying number of sampling steps T in Table 1. We use a single V100 GPU,withbatchsizeof1. Method Latency(ms) minSADE( ) minSFDE( ) SMissRate( ) ↓ ↓ ↓ Ours(T =8) 101.0 0.91 2.06 0.47 Ours(T =16) 203.7 0.88 1.96 0.44 Ours(T =32) 408.5 0.88 1.97 0.43 Table1.Modelinferencelatencyvs.qualityforWOMDInteractiveValidationSplit.