Wayformer: Motion Forecasting via Simple & Efficient Attention Networks NigamaaNayakanti∗ RamiAl-Rfou∗ AurickZhou nigamaa@waymo.com rmyeid@waymo.com aurickz@waymo.com KratarthGoel KhaledS.Refaat BenjaminSapp kratarth@waymo.com krefaat@waymo.com bensapp@waymo.com Abstract: Motion forecasting for autonomous driving is a challenging task be- cause complex driving scenarios result in a heterogeneous mix of static and dy- namic inputs. It is an open problem how best to represent and fuse information about road geometry, lane connectivity, time-varying traffic light state, and his- tory of a dynamic set of agents and their interactions into an effective encoding. Tomodelthisdiversesetofinputfeatures, manyapproachesproposedtodesign anequallycomplexsystemwithadiversesetofmodalityspecificmodules. This results in systems that are difficult to scale, extend, or tune in rigorous ways to tradeoffqualityandefficiency. In this paper, we present Wayformer, a family of attention based architectures for motion forecasting that are simple and homogeneous. Wayformer offers a compact model description consisting of an attention based scene encoder and a decoder. Inthesceneencoderwestudythechoiceofearly, lateandhierarchical fusion of input modalities. For each fusion type we explore strategies to trade off efficiency and quality via factorized attention or latent query attention. We showthatearlyfusion,despiteitssimplicityofconstruction,isnotonlymodality agnostic but also achieves state-of-the-art results on both Waymo Open Motion Dataset (WOMD) and Argoverse leaderboards, demonstrating the effectiveness ofourdesignphilosophy. Keywords: Motion Forecasting, Trajectory Prediction, Autonomous Driving, Transformer,Robotics,Learning 1 Introduction Inthiswork,wefocusonthegeneraltaskoffuturebehaviorpredictionofagents(pedestrians,vehi- cles,cyclists)inreal-worlddrivingenvironments. Thisisanessentialtaskforsafeandcomfortable human-robotinteractions,enablinghigh-impactroboticsapplicationslikeautonomousdriving. The modeling needed for such scene under- standing is challenging for many reasons. For one, the output is highly unstructured and multimodal—e.g., a person driving a vehicle could carry out one of many underlying in- tentsunknowntoanobserver,andrepresenting a distribution over diverse and disjoint possi- ble futures is required. A second challenge is that the input consists of a heterogeneous mix of modalities, including agents’ past physical state, static road information (e.g. location of Figure 1: The Wayformer architecture as a pair lanesandtheirconnectivity),andtime-varying of encoder/decoder Transformer networks. This trafficlightinformation. model takes multimodal scene data as input and producesmultimodaldistributionoftrajectories. Manypreviouseffortsaddresshowtomodelthe multimodaloutput[1,2,3,4,5,6],anddevelop ∗Equalcontribution. 2202 luJ 21 ]VC.sc[ 1v44850.7022:viXrahand-engineeredarchitecturestofusedifferentinputtypes,eachrequiringtheirownpreprocessing (e.g., image rasterization [7, 2, 8]). Here, we focus on the multimodality of the input space, and developasimpleyeteffectivemodality-agnosticframeworkthatavoidscomplexandheterogeneous architectures, and leads to a simpler architecture parameterization. This compact description of a familyofarchitecturesresultsinasimplerdesignspaceandallowsustomoredirectlyandeffectively controlfortrade-offsinmodelqualityandlatencybytuningmodelcomputationandcapacity. Tokeepcomplexityundercontrolwithoutsacrificingqualityorefficiency,weneedtofindgeneral modelingprimitives,whichcanhandlemultimodalfeaturesthatexistintemporalandspatialdimen- sionsconcurrently. Recently,severalapproachesproposedTransformernetworksasthenetworksof choiceformotionforecastingproblems[9,10,11,12,13]. Whiletheseapproachesoffersimplified modelarchitectures,theystillrequiredomainexpertiseandexcessivemodalityspecifictuning. [14] proposedastackofcrossattentionlayerssequentiallyprocessingonemodalityatatime. Theorder in which to process each modality is left to the designer and enumerating all possibilities is com- binatoriallyprohibitive. [3]proposedusingseparateencodersforeachmodality,wherethetypeof networkanditscapacityisopenfortuningonaper-modalitybasis. Thenmodalities’embeddings are flattened and one single vector is fed to the predictor. While these approaches allow for many degrees of freedom, they increase the search space significantly. Without efficient network archi- tecture search or significant human input and hand engineering, the chosen models will likely be sub-optimalgiventhatalimitedamountofthemodelingoptionshavebeenexplored. OurexperimentssuggestthedomainofmotionforecastingconformstoOccam’sRazor. Weshow state of the art results with the simplest design choices and making minimal domain specific as- sumptions,whichisinstarkcontrasttopreviouswork. WhentestedinsimulationandonrealAVs, theseWayformermodelsshowedgoodunderstandingofthescene. Ourcontributionscanbesummarizedasfollows: • Wedesignafamilyofmodelswithtwobasicprimitives:aself-attentionencoder,wherewe fuseoneormoremodalitiesacrosstemporalandspatialdimensions,andacross-attention decoder,whereweattendtodrivingsceneelementstoproduceadiversesetoftrajectories. • Westudythreevariationsofthesceneencoderthatdifferinhowandwhendifferentinput modalitiesarefused. • Tokeepourproposedmodelswithinpracticalrealtimeconstraintsofmotionforecasting, westudytwocommontechniquestospeedupself-attention:factorizedattentionandlatent queryattention. • Weachievestate-of-the-artresultsonbothWOMDandArgoversechallenges. 2 MultimodalSceneUnderstanding Driving scenarios consist of multimodal data, such as road information, traffic light state, agent history, and agent interactions. In this section we detail the representation of these modalities in our setup. For readability, we define the following symbols: A denotes the number of modeled ego-agents,T denotesthenumberofpastandcurrenttimestepsbeingconsideredinthehistory,with afeaturesizeD . Foramodalitym, wemighthavea4th dimension(S )representinga“setof m m contextualobjects”(i.e. representationsofotherroadusers)foreachmodeledagent. AgentHistory containsasequenceofpastagentstatesalongwiththecurrentstate[A,T,1,D ]. h Foreachtimestept ∈ T, weconsiderfeaturesthatdefinethestateoftheagente.g. x, y, velocity, acceleration,boundingboxandsoon. WeincludeacontextdimensionS =1forhomogeneity. h Agent Interactions The interaction tensor [A,T,S ,D ] represents the relationship between i i agents. For each modeled agent a ∈ A, a fixed number of the closest context agents c ∈ S i i around the modeled agent are considered. These context agents represent the agents which influ- ence the behavior of our modeled agent. The features in D represent the physical state of each i contextagents(asinD above),buttransformedintotheframeofreferenceofourego-agent. h Roadgraph Theroadgraph[A,1,S ,D ]containsroadfeaturesaroundtheagent. Following[2], r r werepresentroadgraphsegmentsaspolylines,approximatingtheroadshapewithcollectionsofline segments specified by their endpoints and annotated with type information. We use S roadgraph r segmentsclosesttothemodeledagent. Notethatthereisnotimedimensionfortheroadfeatures, butweincludeatimedimensionof1forhomogeneitywiththeothermodalities. 2(a)LateFusion (b)EarlyFusion (c)HierarchicalFusion Figure2: Wayformersceneencoderfusingmultimodalinputsatdifferentstages. Latefusiondedi- catesanattentionencoderpermodalitywhileearlyfusionprocessallinputswithinonecrossmodal encoder. Finally,hierarchicalfusioncombinesboththeapproaches. TrafficLightState Foreachagenta ∈ A,trafficlightinformation[A,T,S ,D ]containsthe tls tls states of the traffic signals that are closest to that agent. Each traffic signal point tls ∈ S has tls featuresD describingthepositionandconfidenceofthesignal. tls 3 Wayformer WedesignthefamilyofWayformermodelstoconsistoftwomaincomponents: aSceneEncoder and a Decoder. The scene encoder is mainly composed of one or more attention encoders that summarize the driving scene. The decoder is a stack of one or more standard transformer cross- attentionblocks,inwhichlearnedinitialqueriesarefedin,andthencross-attendedwiththescene encoding to produce trajectories. Figure 1 shows the Wayformer model processing multimodal inputs to produce scene encoding. This scene encoding serves as the context for the decoder to generatekpossibletrajectoriescoveringthemultimodalityoftheoutputspace. FrameofReference Asourmodelistrainedtoproducefuturesforasingleagent,wetransform thesceneintoanego-centricframeofreferencebycenteringandrotatingthescene’sspatialfeatures aroundtheego-agent’spositionandheadingatthecurrenttimestep. ProjectionLayers Differentinputmodalitiesmaynotsharethesamenumberoffeatures,sowe projectthemtoacommondimensionDbeforeconcatenatingallmodalitiesalongthetemporaland spatialdimensions[S,T]. Wefoundthesimpletransformation Projection(x ) = relu(Wx +b), i i wherex i ∈RDm,b∈RD,andW ∈RD×Dm,tobesufficient. Concretely,givenaninputofshape [A,T,S ,D ]weprojectitslastdimensionproducingatensorofsize[A,T,S ,D]. m m m Positional Embeddings Self-attention is naturally permutation equivariant, therefore, we may think of them as set-encoders rather than sequence encoders. However, for modalities where the data does follow a specific ordering, for example agent state across different time steps, it is ben- eficial to break permutation equivariance and utilize the sequence information. This is commonly donethroughpositionalembeddings. Forsimplicity,weaddlearnedpositionalembeddingsforall modalities. Asnotallmodalitiesareordered,thelearnedpositionalembeddingsareinitiallysetto zero,lettingthemodellearnifitisnecessarytoutilizetheorderingwithinamodality. 3.1 Fusion Once projections and positional embeddings are applied to different modalities, the scene en- coder combines the information from all modalities to generate a representation of the environ- ment. Concretely,weaimtolearnascenerepresentationZ =Encoder({m ,m ,...,m }),where 0 1 k m i ∈RA×(T×Sm)×D,Z ∈RA×L×D,andLisahyperparameter. However, thediversityofinputsourcesmakesthisintegrationanon-trivialtask. Modalitiesmight not be represented at the same abstraction level or scale: {pixels vs objects}. Therefore, some modalitiesmightrequiremorecomputationthantheothers. Splittingcomputeandparametercount among modalities is application specific and non-trivial to hand-engineer. We attempt to simplify theprocessbyproposingthreelevelsoffusion: {Late,Early,Hierarchical}. LateFusion Thisisthemostcommonapproachusedbymotionforecastingmodels,whereeach modality has its own dedicated encoder (See Figure 2). We set the width of these encoders to be 3(a)EncoderBlocks (b)Encoders Figure3: AsummaryofencoderarchitecturesconsideredforWayformer. (a)providesanoverview ofdifferentencoderblocksand(b)explainshowtheseblocksarearrangedtoconstructtheencoder. equal to avoid introducing extra projection layers to their outputs. Moreover, we share the same depthacrossallencoderstonarrowdowntheexplorationspacetoamanageablescope. Transferof informationacrossmodalitiesisallowedonlyinthecross-attentionlayersofthetrajectorydecoder. EarlyFusion Insteadofdedicatingaself-attentionencodertoeachmodality,earlyfusionreduces modalityspecificparameterstoonlytheprojectionlayers(SeeFigure2).Inthisparadigm,thescene encoder consists of a single self-attention encoder (“Cross-Modal Encoder”), giving the network maximumflexibilityinassigningimportanceacrossmodalitieswithminimalinductivebias. Hierarchical Fusion As a compromise between the two previous extremes, capacity is split be- tweenmodality-specificself-attentionencodersandthecross-modalencoderinahierarchicalfash- ion. As done in late fusion, width and depth is common across attention encoders and the cross modal encoder. This effectively splits the depth of the scene encoder between modality specific encodersandthecrossmodalencoder(Figure2). 3.2 Attention Transformernetworksdonotscalewellforlargemultidimensionalsequencesduetotwofactors:(a) Self-attentionisquadraticintheinputsequencelength.(b)Position-wiseFeed-forwardnetworksare expensivesub-networks. Inthefollowingsections,wediscussdifferentspeedupstothetransformer networksthatwillhelpusscalemoreeffectively. Multi-Axis Attention This refers to the default transformer setting which applies self-attention across both spatial and temporal dimensions simultaneously (See Figure 3b), which we expect to be the most expensive computationally. Computational complexity of early, late and hierarchical fusionswithmulti-axisattentionisO(S2 ×T2). m Factorized Attention Computational complexity of the self-attention is a quadratic in input se- quence length. This becomes more pronounced in multi-dimensional sequences, since each extra dimensionincreasesthesizeoftheinputbyamultiplicativefactor. Forexample,someinputmodal- ities have both temporal and spatial dimensions, so the compute cost scales as O(S2 ×T2). To m alleviatethis,weconsiderfactorizedattention[15,16]alongthetwodimensions. Thisexploitsthe multidimensionalstructureofinputsequencesbyapplyingself-attentionovereachdimensionindi- vidually,whichreducesthecostofself-attentionsub-networkfromO(S2 ×T2)toO(S2 )+O(T2). (cid:80) m m Notethatthelineartermstilltendstodominateif S ×T <<12×D[17]. m m Whilefactorizedattentionhasthepotentialtoreducecomputationcomparedtomulti-axisattention, itintroducescomplexityindecidingtheorderinwhichself-attentionisappliedtoeachdimension. Inourwork,wecomparetwoparadigmsoffactorizedattention(seeFigure3b): • SequentialAttention: anN layerencoderconsistsofN/2temporalencoderblocksfol- lowedbyanotherN/2spatialencoderblocks. • InterleavedAttention:anN layerencoderconsistsoftemporalandspatialencoderblocks alternatingN/2times. LatentQueryAttention Anotherapproachtoaddressthecomputationalcostsoflargeinputse- quences is to use latent queries [18, 19] in the first encoder block, where input x ∈ RA×Lin×D is mapped to latent space z ∈ RA×Lout×D. These latents z ∈ RA×Lout×D are processed further by a seriesofencoderblocksthattakeinandreturnarraysinthislatentspace(seeFigure3a). Thisgives 4usfullfreedomtosetthelatentspaceresolution,reducingthecomputationalcostsofthebothself- attentioncomponentandtheposition-wisefeedforwardnetworkofeachblock. Wesetthereduction value (R=L /L ) to be a percentage of the input sequence length. Reduction factor R is kept out in constantacrossalltheattentionencodersinlateandhierarchicalfusions. 3.3 TrajectoryDecoding Asourfocusisonhowtointegrateinformationfromdifferentmodalitiesintheencoder,wesimply follow the training and output format of [2, 3], where the Wayformer predictor outputs a mixture of Gaussians to represent the possible trajectories an agent may take. To generate predictions, we use a Transformer decoder which is fed a set of k learned initial queries (S ∈ Rh)k and cross i i=1 attendsthemwiththesceneembeddingsfromtheencoderinordertogenerateembeddingsforeach componentintheoutputmixtureofGaussians. Given the embedding Y for a particular component of the mixture, we estimate the mixture like- i lihoodwithalinearprojectionlayerthatproducestheunnormalizedlog-likelihoodforthecompo- nent. To generate the trajectory, we project Y using another linear layer to output 4 time series: i T = {µt,µt,logσt,logσt}T corresponding to the means and log-standard deviations of the i x y x y t=1 predictedGaussianateachtimestep. Duringtraining,wefollow[2,3]indecomposingthelossintoseparateclassificationandregression losses.GivenkpredictedGaussians(T )k ,letˆidenotetheindexoftheGaussianwithmeanclosest i i=1 tothegroundtruthtrajectoryG. Wetrainthemixturelikelihoodsontheloglikelihoodofselecting theindexˆi,andtheGaussianT tomaximizethelog-probabilityofthegroundtruthtrajectory. ˆi maxlogPr(ˆi|Y)+logPr(G|T ). (1) ˆi (cid:124) (cid:123)(cid:122) (cid:125) (cid:124) (cid:123)(cid:122) (cid:125) classificationloss regressionloss 3.4 TrajectoryAggregation If the predictor outputs a GMM with many modes, it can be difficult to reason about a mixture withsomanycomponents,andthebenchmarkmetricsoftenrestrictthenumberoftrajectoriesbeing considered. Duringevaluation,wethusapplytrajectoryaggregationfollowing[3]inordertoreduce the number of modes being considered while still preserving the diversity in the original output mixture. WereferthereadertoAppendixCand[3]fordetailsoftheaggregationscheme. 4 ExperimentalSetup 4.1 Datasets WaymoOpenMotionDataset(WOMD) consistsof1.1Mexamplestime-windowedfrom103K 20sscenariosderivedfromreal-worlddrivinginurbanandsuburbanenvironments. Eachexample consistsof1secondofhistorystateand8secondsoffuture,whichweresampleat5Hz. Theobject- agentstatecontainsattributessuchasposition,agentdimensions,velocityandaccelerationvectors, orientation, angular velocity, and turn signal state. The long (8s) time horizon in this dataset tests themodel’sabilitytocapturealargefieldofviewandscaletoalargeoutputspaceoftrajectories. ArgoverseDataset consistsof333Kscenarioscontainingtrajectoryhistories,contextagents,and lanecenterlineinputsformotionprediction. Thetrajectoriesaresampledat10Hz, with2seconds ofhistoryanda3-secondfuturepredictionhorizon. 4.2 TrainingDetailsandHyperparameters We compare models using competition specific metrics associated with these datasets (see Ap- pendixE).Forallmetrics,weconsideronlythetopk = 6mostlikelymodesoutputbyourmodel (aftertrajectoryaggregation)anduseonlythemeanofeachmode. Forallexperiments,wetrainmodelsusingtheAdamWoptimizer[20]withaninitiallearningrate of2e-4andlinearlydecayingto0over1Msteps.Wetrainmodelsusing16TPUv3coreseach,with abatchsizeof16percore,resultinginatotalbatchsizeof256examplesperstep. To vary the capacity of the models, we consider hidden sizes among {64,128,256} and depths among{1,2,4}layers. WefixtheintermediatesizeinthefeedforwardnetworkoftheTransformer blocktobeeither2or4timesthehiddensize. 51.10 1.05 1.00 0.95 0.901 2 4 8 16 32 64 Latency (ms) EDAnim Pareto Frontier Fusion Early Hierarchical Late (a)Efficiency 1.10 1.05 1.00 0.95 0.90 106 107 Parameters EDAnim ForourarchitecturestudyinSections(5.1-5.3),eachpre- dictoroutputsamixtureofGaussianswithm=6compo- nents,withnotrajectoryaggregation. Forourbenchmark results in Section 5.4, each predictor outputs a mixture of Gaussians with m = 64 components, and we prune themixturecomponentsusingthetrajectoryaggregation scheme described in Section 3.4. For experiments with latent queries, we experiment with reducing the original inputresolutionto0.25,0.5,0.75and0.9timestheorig- Pareto Frontier Fusion Early Hierarchical Late inalsequencelength. Weincludeafulldescriptionofhy- perparametersinAppendixB. 5 Results InthisSection,wepresentexperimentsthatdemonstrate the trade-offs of combining different fusion strategies (b)Capacity with vanilla self-attention (multi-axis) and more opti- mized methods such as factorized attention and learned Figure 4: MinADE of different fusion queries. In our ablation studies (Section 5.1-5.3), we modelswithmulti-axisattention. trained models with varying capacities (0.3M-20M pa- rameters)for1MstepsonWOMD.WereporttheirinferencelatencyonacurrentgenerationGPU, capacity,andminADEasaproxyofquality. 5.1 Multi-AxisAttention In these experiments, we train Wayformer models on early, hierarchical and late fusion (Section 3.1) in combination with multi-axis attention. In Figure (4a), we show that for models with low latency (x ≤ 16 ms), late fusion represents an optimal choice. These models are computationally cheapsincethereisnointeractionbetweenmodalitiesduringthesceneencodingstep. Addingthe cross modal encoder for hierarchical models unlocks further quality gains for models in the range (16ms< x < 32ms). Finally,wecanseethatearlyfusioncanmatchhierarchicalfusionathigher computational cost (x > 32ms). We then study the model quality as a function of capacity, as measuredbythenumberoftrainableparameters(Figure4b). Smallmodelsperformbestwithearly fusion,butasmodelcapacityincreases,sensitivitytothechoiceoffusiondecreasesdramatically. 5.2 FactorizedAttention Toreducethecomputationalbudgetofourmodels,wetrainmodelswithfactorizedattentioninstead ofjointlyattendingtospatialandtemporaldimensionstogether. Whencombiningdifferentmodali- tiestogetherforthecrossmodalencoder,wefirsttiletheroadgraphmodalitytoacommontemporal dimension as the other modalities, then concatenate modalities along the spatial dimension. After thesceneencoder,wepooltheencodingsoverthetimedimensionbeforefeedingtothepredictor. 1.10 1.05 1.00 0.95 0.90 1 2 4 8 16 32 64 Latency (ms) EDAnim Attention Interleaved Sequential Multi-axis 1.10 1.05 1.00 0.95 0.90 1 2 4 8 16 32 64 Latency (ms) (a)EarlyFusion. EDAnim Attention Interleaved Sequential Multi-axis 1.10 1.05 1.00 0.95 0.90 1 2 4 8 16 32 64 Latency (ms) (b)LateFusion. EDAnim Attention Interleaved Sequential Multi-axis (c)HierarchicalFusion. Figure5: Factorizedattentionimprovesquality,butonlyspeedsuplatefusionmodels. We study two types of factorized attention: sequential, interleaved (Figure 5). First, we observe thatbothsequentialandinterleavedfactorizedattentionperformsimilarlyacrossalltypesoffusion. Second,wearesurprisedtoseequalitygainsfromapplyingfactorizedattentiontotheearlyandlate fusioncases(Figures5a,5b). Finally,weonlyobservelatencyimprovementsforlatefusionmodels (Figure5b),sincetilingtheroadgraphtothecommontemporaldimensionincross-modalencoder usedinearlyandhierarchicalfusionsignificantlyincreasesthecountoftokens. 5.3 LatentQueries In this study, we train models with multi-axis latent query encoders with varying levels of input sequencelengthreductioninthefirstlayerasshowninFigure5. Thenumberofthelatentqueries 6iscalculatedtobeapercentageoftheinputsizeoftheTransformernetworkwith0.0%indicating thebaselinemodels(multi-axisattentionwithnolatentqueriesaspresentedinFigure4). 1.10 1.05 1.00 0.95 0.90 1 2 4 8 16 32 64 Latency (ms) EDAnim Reduction 0.25 0.75 0.0 0.5 0.9 1.10 1.05 1.00 0.95 0.90 1 2 4 8 16 32 64 Latency (ms) (a)EarlyFusion. EDAnim Reduction 0.25 0.75 0.0 0.5 0.9 1.10 1.05 1.00 0.95 0.90 1 2 4 8 16 32 64 Latency (ms) (b)LateFusion. EDAnim Reduction 0.25 0.75 0.0 0.5 0.9 (c)HierarchicalFusion. Figure6: Latentqueriesreducemodels’latencywithoutsignificantdegradationtothequality. Figure6showstheresultsofapplyinglatentqueries,whichspeedsupallfusionmodelsby2x-16x times with minimal to no quality regression. Early and hierarchical fusion still produce the best qualityresults,showingtheimportanceofthecrossmodalinteractionstage. 5.4 BenchmarkResults WevalidateourlearningsbycomparingWayformermodelstocompetitivemodelsonpopularbench- marks of motion forecasting. We choose early fusion models since they match the quality of the hierarchicalmodelswithoutincreasedcomplexityofimplementation. Moreover,asmodels’capac- ity increases they are less sensitive to the choice of fusion (See Figure 4b). We use latent queries sincetheyspeedupmodelswithoutnoticeablequalityregressionand,insomemodels,wecombine themwithfactorizedattention(seeAppendixA)sincethatimprovesthequalityfurther. Wefurther applyensembling,astandardpracticeforproducingSOTAresultsforleaderboardsubmissions.Full hyperparametersforWayformermodelsreportedonbenchmarksarereportedinAppendixD. When ensembling for WOMD, the model has a single shared encoder but uses N = 3 separate Transformer decoders. To merge predictions over the ensemble, we simply combine all mixture componentsfromeachpredictortogetatotalofN ×64modes,andrenormalizethemixtureprob- abilities. We then apply our trajectory aggregation scheme (section 3.4) to the combined mixture distributiontoreducethenumberofoutputmodestothedesiredcountk =6. InTable1,wepresentresultsontheWaymoOpenMotionDatasetandArgoverseDataset. Weuse thestandardmetricsusedfortheeachdatasetfortheirrespectiveevaluation(seeAppendixE).For the Waymo Open Motion Dataset, both Wayformer early fusion models outperform other models across all metrics; early fusion of input modalities results in better overall metrics independent of theattentionstructure(multi-axisorfactorizedattention). ForArgoverseleaderboard,wetrain15replicaseachwithitsownencoderandN =10transformer decoders. TomergepredictionsoverN decoderswefollowtheaggregationschemeinsection3.4 to result in k = 6 modes for each model. We then ensemble 15 such replicas following the same aggregationscheme(section3.4)toreduceN ×6modestok =6. WaymoOpenMotionDataset ArgoverseDataset Models minFDE(↓) minADE(↓) MR(↓) Overlap(↓) mAP∗(↑) Brier-minFDE∗(↓) minFDE(↓) MR(↓) minADE(↓) SceneTransformer[11] 1.212 0.612 0.156 0.147 0.279 1.8868 1.2321 0.1255 0.8026 DenseTNT[21] 1.551 1.039 0.157 0.178 0.328 1.9759 1.2858 0.1285 0.8817 MultiPath[2] 2.040 0.880 0.345 0.166 0.409 - - - - MultiPath++[3] 1.158 0.556 0.134 0.131 0.409 1.7932 1.2144 0.1324 0.7897 LaneConv - - - - - 2.0539 1.3622 0.1600 0.8703 LaneRCNN[22] - - - - - 2.1470 1.4526 0.1232 0.9038 mmTransformer[14] - - - - - 2.0328 1.3383 0.1540 0.8346 TNT[23] - - - - - 2.1401 1.4457 0.1300 0.9400 DCMS[24] - - - - - 1.7564 1.1350 0.1094 0.7659 Attention Wayformer LQ+Multi-Axis 1.128 0.545 0.123 0.127 0.419 1.7408 1.1615 0.1186 0.7675 EarlyFusion LQ+Factorized 1.126 0.545 0.123 0.127 0.412 1.7451 1.1625 0.1192 0.7672 Table1: WayformermodelsandselectSOTAbaselinesonWaymoOpenMotionDataset2021and Argoverse2021. *denotesthemetricusedforleaderboardranking. LQdenoteslatentquery. 6 RelatedWork Motion prediction architectures : Increasing interest in self-driving applications and the avail- ability of benchmarks [25, 26, 27] has allowed motion prediction models to flourish. Successful 7modeling techniques fuse multi-modal inputs that represent different static, dynamic, social and temporal aspects of the scene. One class of models draws heavily from the computer vision liter- ature, rendering inputs as a multichannel rasterized top-down image [4, 2, 28, 29, 7, 23]. In this approach, relationships between scene elements are rendered in the top down orthographic plane andmodeledviaspatio-temporalconvolutionalnetworks. However,thelocalizedstructureofcon- volutionsiswellsuitedtoprocessingimageinputs, butisnoteffectiveatcapturingthelongrange spatio-temporalrelationships.Apopularalternativeistouseanentity-centricapproach,whereagent statehistoryistypicallyencodedviasequencemodelingtechniqueslikeRNNs[10,30,31,32]or temporal convolutions [33]. Road elements are approximated with basic primitives (e.g. piece- wise linear segments) which encode pose and semantic information. Modeling relationships be- tweenentitiesisoftenpresentedasaninformationaggregationprocess,andmodelsemploypooling [23, 34, 31, 35, 10, 28], soft-attention [10, 23] or graph neural networks [36, 33, 30]. Like our proposedmethod,severalrecentmodelsuseTransformers[37],whichareapopularstate-of-the-art choice for sequence modeling in NLP [38, 39], and have shown promise in core computer vision taskssuchasdetection[40,41,42],tracking[43]andclassification[41,44]. Iterativecross-attention Arecentapproachtoencodemulti-modaldataistosequentiallyprocess onemodalityatatime[14,19,9]. [14]ingeststhesceneintheorder{agenthistory,nearbyagents, map};theyarguethatitiscomputationallyexpensivetoperformself-attentionovermultiplemodali- tiesatonce.[9]pre-encodestheagenthistoryandcontextualagentsthroughself-attentionandcross- attendstothemapwithagentencodingsasqueries. Theorderofself-attentionandcross-attention reliesheavilyonthedesigner’sintuitionandhas,toourknowledge,notbeenablatedbefore. FactorizedAttention Flatteninghighdimensionaldataleadstolongsequenceswhichmakeself- attention computationally prohibitive. [16] proposed limiting each attention operation to a single axes to alleviate the computational costs and applied this technique to autoregressive generative modeling for images. Similarly, [15] factorize the spatial and temporal dimensions of the video input when constructing their self-attention based classifier. This axis based attention, which gets applied in interleaved fashion across layers, has been adopted in Transformer-based motion fore- castingmodels[9]andgraphneuralnetworkapproaches[12]. Theorderofapplyingattentionover {temporal, social/spatial} dimensions has been studied with two different common patterns: (a) Temporalfirst[31,35,45](b)Social/Spatialfirst[46,47]. InSection3.2, westudya‘sequential’ modeandcontrastitwithinterleavedmodewhereinterleavedimensionsofattentionsimilarto[9]. Multimodal Encoding [13] argued that attending to temporal and spatial dimensions indepen- dentlyleadstolossofinformation. Moreover,allowingallinputstoself-attendtoeachotherearly ontheencodingprocessreducescomplexityandtheneedtohandcraftarchitecturestoaddressthe scalingofcomputationfortransformerswiththeincreaseintheinputsequencelength[48]. How- ever,self-attentionisknowntobecomputationallyexpensiveforlargeinputs[49],andrecentlythere hasbeenhugeinterestinapproachesimprovingitsscalability. Foracompletediscussionofprevi- ous works, we refer the reader to the comprehensive survey [50]. One compelling approach is to uselearnedlatentqueriestodecouplesthenumberofqueryvectorsofaTransformerencoderfrom theoriginalinputsequencelength[18]. ThisallowsustosettheresolutionoftheTransformerout- puttoarbitraryscalesindependentoftheinput,andflexiblytunemodelcomputationalcosts. This approach is appealing since it does not assume any structure in the input and has proven effective infusingmultimodalinputs[48]. Wetakeinspirationfromsuchframeworksandpresentastudyof theirbenefitswhenappliedtothetaskofmotionforecastingintheself-drivingdomain. 7 Limitations Scopeofthecurrentstudyissubjecttothefollowinglimitations:(1)Ego-centricmodelingissubject torepeatedcomputationsondensescenes.Thiscanbealleviatedbyencodingthesceneonlyoncein aglobalframeofreference. (2)Oursysteminputisasparseabstractstatedescriptionoftheworld, which fails to capture some important nuances in highly interactive scenes, e.g., visual cues from pedestriansorfine-granularitycontourorwheelangleinformationforvehicles.Learningperception andpredictionend-to-endcouldunlockimprovements. (3)Wemodelthedistributionoverpossible futures independently per agent, and temporally conditionally independent for each agent given intent. These simplifying assumptions allow for efficient computation but fail to fully describe combinatoriallymanyfutures. Multi-agent,temporallycausalmodelscouldshowfurtherbenefitsin interactivesituations. 8Acknowledgments We thank Balakrishnan Varadarajan for help on ensembling strategies; Dragomir Anguelov and EugeneIefortheirhelpfulfeedbackonthepaper. References [1] N. Rhinehart, K. Kitani, and P. Vernaza. R2P2: A reparameterized pushforward policy for diverse,precisegenerativepathforecasting. InECCV,2018. [2] Y. Chai, B. Sapp, M. Bansal, and D. Anguelov. Multipath: Multiple probabilistic anchor trajectoryhypothesesforbehaviorprediction. InCoRL,2019. [3] B.Varadarajan,A.S.Hefny,A.Srivastava,K.S.Refaat,N.Nayakanti,A.Cornman,K.Chen, B.Douillard,C.P.Lam,D.Anguelov,andB.Sapp. Multipath++: Efficientinformationfusion andtrajectoryaggregationforbehaviorprediction. InICRA,2021. [4] H. Cui, V. Radosavljevic, F.-C. Chou, T.-H. Lin, T. Nguyen, T.-K. Huang, J. Schneider, and N.Djuric. Multimodaltrajectorypredictionsforautonomousdrivingusingdeepconvolutional networks.In2019InternationalConferenceonRoboticsandAutomation(ICRA),pages2090– 2096.IEEE,2019. [5] C.TangandR.R.Salakhutdinov. Multiplefuturesprediction. Innips,2019. [6] J.Liang,L.Jiang,K.Murphy,T.Yu,andA.Hauptmann.Thegardenofforkingpaths:Towards multi-futuretrajectoryprediction. InProceedingsoftheIEEE/CVFConferenceonComputer VisionandPatternRecognition,pages10508–10518,2020. [7] S. Casas, W. Luo, and R. Urtasun. Intentnet: Learning to predict intention from raw sensor data. InConf.onRobotLearning,2018. [8] W.Zeng,W.Luo,S.Suo,A.Sadat,B.Yang,S.Casas,andR.Urtasun.End-to-endinterpretable neuralmotionplanner. InCVPR,2019. [9] R.Girgis,F.Golemo,F.Codevilla,J.A.D’Souza,S.E.Kahou,F.Heide,andC.J.Pal. Latent variable nested set transformers & autobots. CoRR, abs/2104.00563, 2021. URL https: //arxiv.org/abs/2104.00563. [10] J.P.Mercat,T.Gilles,N.E.Zoghby,G.Sandou,D.Beauvois,andG.P.Gil. Multi-headatten- tionformulti-modaljointvehiclemotionforecasting. 2020IEEEInternationalConferenceon RoboticsandAutomation(ICRA),pages9638–9644,2020. [11] J. Ngiam, B. Caine, V. Vasudevan, Z. Zhang, H. L. Chiang, J. Ling, R. Roelofs, A. Bewley, C.Liu,A.Venugopal,D.Weiss,B.Sapp,Z.Chen,andJ.Shlens.Scenetransformer:Aunified multi-task model for behavior prediction and planning. CoRR, abs/2106.08417, 2021. URL https://arxiv.org/abs/2106.08417. [12] C. Yu, X. Ma, J. Ren, H. Zhao, and S. Yi. Spatio-temporal graph transformer networks for pedestriantrajectoryprediction. InECCV,2020. [13] Y.Yuan, X.Weng, Y.Ou, andK.Kitani. Agentformer: Agent-awaretransformersforsocio- temporalmulti-agentforecasting. ArXiv,abs/2103.14023,2021. [14] Y. Liu, J. Zhang, L. Fang, Q. Jiang, and B. Zhou. Multimodal motion prediction with stacked transformers. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recog- nition(CVPR),pages7573–7582,2021. [15] A.Arnab,M.Dehghani,G.Heigold,C.Sun,M.Lucˇic´,andC.Schmid. Vivit: Avideovision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV),pages6836–6846,October2021. [16] J.Ho,N.Kalchbrenner,D.Weissenborn,andT.Salimans. Axialattentioninmultidimensional transformers. arXivpreprintarXiv:1912.12180,2019. 9[17] J.Kaplan,S.McCandlish,T.J.Henighan,T.B.Brown,B.Chess,R.Child,S.Gray,A.Rad- ford,J.Wu,andD.Amodei.Scalinglawsforneurallanguagemodels. ArXiv,abs/2001.08361, 2020. [18] J.Lee,Y.Lee,J.Kim,A.R.Kosiorek,S.Choi,andY.W.Teh. Settransformer: Aframework forattention-basedpermutation-invariantneuralnetworks. InICML,2019. [19] A.Jaegle,F.Gimeno,A.Brock,A.Zisserman,O.Vinyals,andJ.Carreira. Perceiver: General perceptionwithiterativeattention. InICML,2021. [20] I.LoshchilovandF.Hutter. Decoupledweightdecayregularization. InICLR,2019. [21] J.Gu,C.Sun,andH.Zhao. Densetnt: End-to-endtrajectorypredictionfromdensegoalsets. CoRR,abs/2108.09640,2021. URLhttps://arxiv.org/abs/2108.09640. [22] W.Zeng,M.Liang,R.Liao,andR.Urtasun. Lanercnn: Distributedrepresentationsforgraph- centricmotionforecasting. CoRR,abs/2101.06653,2021. URLhttps://arxiv.org/abs/ 2101.06653. [23] H.Zhao,J.Gao,T.Lan,C.Sun,B.Sapp,B.Varadarajan,Y.Shen,Y.Shen,Y.Chai,C.Schmid, etal. Tnt: Target-driventrajectoryprediction. arXivpreprintarXiv:2008.08294,2020. [24] M.Ye,J.Xu,X.Xu, T.Cao,andQ.Chen. Dcms: Motionforecastingwithdualconsistency andmulti-pseudo-targetsupervision,2022. [25] M.Chang,J.Lambert,P.Sangkloy,J.Singh,S.Bak,A.Hartnett,D.Wang,P.Carr,S.Lucey, D. Ramanan, and J. Hays. Argoverse: 3d tracking and forecasting with rich maps. CoRR, abs/1911.02620,2019. URLhttp://arxiv.org/abs/1911.02620. [26] J.Houston,G.Zuidhof,L.Bergamini,Y.Ye,A.Jain,S.Omari,V.Iglovikov,andP.Ondruska. Onethousandandonehours: Self-drivingmotionpredictiondataset. CoRR,abs/2006.14480, 2020. URLhttps://arxiv.org/abs/2006.14480. [27] S. Ettinger, S. Cheng, B. Caine, C. Liu, H. Zhao, S. Pradhan, Y. Chai, B. Sapp, C. R. Qi, Y.Zhou, Z.Yang,A.Chouard, P.Sun,J.Ngiam, V.Vasudevan,A.McCauley,J.Shlens, and D.Anguelov. Largescaleinteractivemotionforecastingforautonomousdriving: Thewaymo openmotiondataset. CoRR,abs/2104.10133,2021. URLhttps://arxiv.org/abs/2104. 10133. [28] N. Lee, W. Choi, P. Vernaza, C. B. Choy, P. H. S. Torr, and M. K. Chandraker. DESIRE: distant future prediction in dynamic scenes with interacting agents. CoRR, abs/1704.04394, 2017. URLhttp://arxiv.org/abs/1704.04394. [29] J. Hong, B. Sapp, and J. Philbin. Rules of the road: Predicting driving behavior with a convolutional model of semantic interactions. CoRR, abs/1906.08945, 2019. URL http: //arxiv.org/abs/1906.08945. [30] S.Khandelwal,W.Qi,J.Singh,A.Hartnett,andD.Ramanan. What-ifmotionpredictionfor autonomousdriving. ArXiv,2020. [31] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and S. Savarese. Social lstm: Humantrajectorypredictionincrowdedspaces. 2016IEEEConferenceonComputerVision andPatternRecognition(CVPR),pages961–971,2016. [32] N.Rhinehart,R.McAllister,K.Kitani,andS.Levine.Precog:Predictionconditionedongoals invisualmulti-agentsettings. InECCV,2019. [33] M. Liang, B. Yang, R. Hu, Y. Chen, R. Liao, S. Feng, and R. Urtasun. Learning lane graph representationsformotionforecasting. arXivpreprintarXiv:2007.13732,2020. [34] J.Gao,C.Sun,H.Zhao,Y.Shen,D.Anguelov,C.Li,andC.Schmid. VectorNet: Encoding hdmapsandagentdynamicsfromvectorizedrepresentation. InCVPR,2020. 10[35] A.Gupta, J.Johnson, L.Fei-Fei, S.Savarese, andA.Alahi. Socialgan: Sociallyacceptable trajectorieswithgenerativeadversarialnetworks. InProceedingsoftheIEEEConferenceon ComputerVisionandPatternRecognition(CVPR),June2018. [36] S.Casas,C.Gulino,R.Liao,andR.Urtasun. Spagnn: Spatially-awaregraphneuralnetworks for relational behavior forecasting from sensor data. In IEEE Intl. Conf. on Robotics and Automation.IEEE,2020. [37] A.Vaswani,N.Shazeer,N.Parmar,J.Uszkoreit,L.Jones,A.N.Gomez,Ł.Kaiser,andI.Polo- sukhin. Attentionisallyouneed. InNeurIPS,2017. [38] J.Devlin,M.-W.Chang,K.Lee,andK.Toutanova. BERT:Pre-trainingofdeepbidirectional transformersforlanguageunderstanding. arXivpreprintarXiv:1810.04805,2018. [39] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, andD.Amodei. Languagemodelsarefew-shotlearners. CoRR,abs/2005.14165,2020. URL https://arxiv.org/abs/2005.14165. [40] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko. End-to-end object detection with transformers. In A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, editors, ComputerVision–ECCV2020, pages213–229, Cham, 2020.SpringerInternational Publishing. ISBN978-3-030-58452-8. [41] I. Bello, B. Zoph, A. Vaswani, J. Shlens, and Q. V. Le. Attention augmented convolutional networks. CoRR,abs/1904.09925,2019. URLhttp://arxiv.org/abs/1904.09925. [42] A.Srinivas,T.Lin,N.Parmar,J.Shlens,P.Abbeel,andA.Vaswani. Bottlenecktransformers forvisualrecognition.CoRR,abs/2101.11605,2021.URLhttps://arxiv.org/abs/2101. 11605. [43] W.Hung, H.Kretzschmar, T.Lin, Y.Chai, R.Yu, M.Yang, andD.Anguelov. Soda: Multi- object tracking with soft data association. CoRR, abs/2008.07725, 2020. URL https:// arxiv.org/abs/2008.07725. [44] P.Ramachandran, N.Parmar, A.Vaswani, I.Bello, A.Levskaya, andJ.Shlens. Stand-alone self-attention in vision models. CoRR, abs/1906.05909, 2019. URL http://arxiv.org/ abs/1906.05909. [45] V.Kosaraju, A.Sadeghian, R.Mart´ın-Mart´ın, I.D.Reid, S.H.Rezatofighi, andS.Savarese. Social-bigat: Multimodal trajectory forecasting using bicycle-gan and graph attention net- works. InNeurIPS,2019. [46] Y. Huang, H. Bi, Z. Li, T. Mao, and Z. qi Wang. Stgat: Modeling spatial-temporal interac- tionsforhumantrajectoryprediction. 2019IEEE/CVFInternationalConferenceonComputer Vision(ICCV),pages6271–6280,2019. [47] T.Salzmann,B.Ivanovic,P.Chakravarty,andM.Pavone. Trajectron++: Multi-agentgenera- tivetrajectoryforecastingwithheterogeneousdataforcontrol. CoRR,abs/2001.03093,2020. URLhttp://arxiv.org/abs/2001.03093. [48] A.Jaegle,S.Borgeaud,J.-B.Alayrac,C.Doersch,C.Ionescu,D.Ding,S.Koppula,A.Brock, E. Shelhamer, O. J. H’enaff, M. M. Botvinick, A. Zisserman, O. Vinyals, and J. Carreira. Perceiverio: Ageneralarchitectureforstructuredinputs&outputs. ArXiv, abs/2107.14795, 2021. [49] Y. Tay, M. Dehghani, S. Abnar, Y. Shen, D. Bahri, P. Pham, J. Rao, L. Yang, S. Ruder, and D.Metzler.Longrangearena:Abenchmarkforefficienttransformers.ArXiv,abs/2011.04006, 2021. [50] Y. Tay, M. Dehghani, D. Bahri, and D. Metzler. Efficient transformers: A survey. ArXiv, abs/2009.06732,2020. 11Appendix A FactorizedLatentQueryAttention Figure 7a shows the implementation of Factorized latent query attention encoder blocks and Fig- ure 7b shows how they are used in constructing the encoders. Specifically in factorized attention (sequential or interleaved), the first temporal encoder block and the first spatial encoder blocks in Figure 3 are replaced with temporal latent query encoder block and spatial latent query encoder blockrespectively. (a)EncoderBlocks (b)Encoders Figure7: AsummaryofencoderarchitecturesconsideredforWayformer. (a)providesanoverview ofdifferentencoderblocksand(b)explainshowtheseblocksarearrangedtoconstructtheencoder. B Hyperparameters Hyperparameter Values Hiddensize {128,256,512} Intermediatesize {2x,4x} hiddensize Numencoderlayers [2,16] Numdecoderlayers [2,16] Latentqueryratio {0.25,0.5,0.751.0} NumberGMMmodes 64 Optimizer AdamW Initiallearningrate 2e-4 Trainingsteps 1000000 Learningratedecay linear Batchsize 256 Table2: ModelandtraininghyperparametersacrossallablationexperimentsdoneonWOMD. Hyperparameter WOMD Argoverse Maxnumhistorytimesteps 11 20 (includingcurrenttimestep) Maxnumroadgraphfeats 512 1024 Maxnumcontextagents 64 64 Maxnumtrafficlights 32 32 Table3: HyperparametersforgeneratingWOMDandArogverseinputfeatures. Fixedforallexper- iments 12C TrajectoryAggregationDetails Given adistance threshold D, the trajectory aggregationscheme attemptsto first select the fewest centroid modes such that all output modes are within a final distance D away from the nearest centroid. The aggregation algorithm iteratively selects centroid modes by greedily selecting the output mode that covers the maximum total likelihood out of the uncovered modes, and proceeds untilalloutputmodeshavebeencovered. Afterinitializingthesekcentroidmodes,theaggregationalgorithmthenproceedsintoarefinement stage and runs another iterative procedure similar to k-means clustering starting from the initial centroid modes. In each iteration, each centroid mode becomes of the weighted average of all outputmodesassignedtoit,andthenoutputmodesarereassignedtothenewclosestcentroidmode. D SOTAWayformerDetails WedescribethethehyperparametersusedforWOMDandArgoversebenchmarkresultsinTables4 and5respectively. Hyperparameter Multi-axisLatentQuery FactorizedLatentQuery Hiddensize 256 256 Intermediatesize 1024 1024 Numencoderlayers 2 4 Numdecoderlayers 8 4 Latentqueries 192 4timelatents,192spatiallatents NumberGMMmodes 64 64 Ensemblesize 3 3 Optimizer AdamW AdamW Initiallearningrate 2e-4 2e-4 Learningratedecay linear linear Trainingsteps 1200000 1000000 Batchsize 256 256 Aggregationinitialdistancethreshold 2.3 2.3 Aggregationrefinementiterations 3 3 Aggregationmaxnumtrajectories 6 6 Table4: ModelandtraininghyperparametersforbenchmarkexperimentsonWaymoOpenMotion 2021Dataset E Metrics Wecomparemodelsusingcompetitionspecificmetricsassociatedwiththesedatasets. Forallmet- rics,weconsideronlythetopk =6mostlikelymodesoutputbyourmodel(aftertrajectoryaggre- gation)anduseonlythemeanofeachmode. Specifically,wereportthefollowingmetricstakenfromtheevaluationprocedureusedinthestan- dardevaluationsbasedonthedatasetbeingused. minDEt (Minimum Distance Error): Considers the top-k most likely trajectories output by the k model,andcomputestheminimumdistancetothegroundtruthtrajectoryattimestept. MRt (MissRate): Foreachpredictedtrajectory,wecomputewhetheritissufficientlyclosetothe predictedagent’sgroundtruthtrajectoryattimet.Missrateastheproportionofpredictedagentsfor whichnoneofthepredictedtrajectoriesaresufficientlyclosetothegroundtruth. Wedeferdetails ofhowatrajectoryisdeterminedtobesufficientlyclosetotheWOMDmetricsdefinition[27]. minADE (MinimumAverageDistanceError):SimilartominDEt ,butthedistanceiscalculated k k asanaverageoveralltimesteps. mAP: Foreachsetofpredictedtrajectories,wehaveatmostonepositive-theoneclosesttothe ground truth and which is within τ distance from the ground truth. The other predicted trajecto- ries are reported as misses. From this, we can compute precision and recall at various thresholds. FollowingWOMDmetricsdefinition[27]theagentsfuturetrajectoriesarepartitionedintobehavior 13Hyperparameter Multi-axisLatentQuery FactorizedLatentQuery EncoderHiddensize 128 256 EncoderIntermediatesize 512 1536 DecoderHiddensize 128 128 DecoderIntermediatesize 512 768 Numencoderlayers 4 4 Numdecoderlayers 6 6 Latentqueries 1024 6timelatents,192spatiallatents NumberGMMmodes 6 6 Ensemblesize 10 10 Optimizer AdamW AdamW Initiallearningrate 2e-4 2e-4 Learningratedecay linear linear Trainingsteps 1000000 1000000 Batchsize 4 4 Aggregationinitialdistancethreshold 2.9 2.9 Aggregationrefinementiterations 5 5 Aggregationmaxnumtrajectories 6 6 Table 5: Model and training hyperparameters for benchmark experiments on Argoverse 2021 Dataset. buckets, and an area under the precision-recall curve is computed using the possible true positive andfalsepositivesperagent,givingusAveragePrecisionperbehaviorbucket. ThetotalmAPvalue isameanovertheAP’sforeachbehaviorbucket. Overlapt:Thefractionoftimestepsofthemostlikelytrajectorypredictionforwhichtheprediction overlapswiththecorrespondingtimesteprealfuturetrajectoryofanotheragent. minFDE(MinimumFinalDisplacementError): TheL2distancebetweentheendpointofthebest forecastedtrajectoryandthegroundtruth. brier−minFDE: isdefinedasthesumofminFDEandthebrierscore(1−p)2,wherepisthe probabilityofthebest-predictedtrajectory. F QualitativeWins In this section, we present some examples of Wayformer (WF) predictions on WOMD scenes in comparisonwithMultiPath++(MP++)model[3]. Inallthefollowingexamples,(a)Hueindicates time horizon (0s - 8s), while transparency indicates probability. (b) Rectangles indicate vehicles, andsquaresindicatepedestriansorcyclists. 14(a)MultiPath++(MP++) (b)Wayformer(WF) Figure 8: This scenario represents a multi-lane road with a parking lot on the left side. Here, we seethatWF’sperformanceonseveralvehiclesismoresafeandroadfollowingthanthatofMP++. For example: (a) Vehicle A is seen merging onto the road coming out of a parking lot. MP++’s predictionsarecompletelyoff-roadwhileWF’spredictionsfollowrulesoftheroad. (b)VehiclesB, C,andD’spredictionsoverlapwitheachotherforMP++predictingcollisionwitheachother. But, WF correctly predicts that D yields for the vehicle before, C yields for D and B yields for C. (c) MP++’spredictionsforvehicleEnavigatingtheparkinglotgothroughanalreadyparkedvehicles, whileWFunderstandstheinteractionsbetterandproducespredictionswhicharenotcolliding. (a)MultiPath++(MP++) (b)Wayformer(WF) Figure9: ThisscenariorepresentsaTintersection. Herewesee(a)acyclistB,makingaleftturn. MP++’spredictionsareoff-roadandgoingbeyondtheavailableroad. But,WF’spredictionsfollow rulesoftheroadandpresentmultiplespeedprofilesforthesameactionoftakingaleftturn. (b)We alsoseebetterpredictionsforapedestrian(pedestrianA)whereMP++predictsthatthepedestrian isgoingtowalkontotheroadwithoncomingtraffic. But,WF’spredictionsareconstrainedtothe sidewalk. (c)Inaddition,wealsonoticethatWF’spredictssafefuturesforvehiclesC,DandEin comparisonwithMP++. 15(a)MultiPath++(MP++) (b)Wayformer(WF) Figure 10: This scenario represents a vehicle (agent A) turning into a parking structure. MP++’s predictiondiscountsthepresenceofotherparkedvehiclesandsomepredictionsaremadethrough the parked agents. WF models these interactions better and only predicts trajectories that do not collidewithotherparkedentities. (a)MultiPath++(MP++) (b)Wayformer(WF) Figure 11: This scenario represents a busy 4-way intersection. First we discuss the WF improve- mentsforpedestriantrajectorypredictions. MP++predictspedestrian(A)asgoingintotheoncom- ingvehicledemonstratingitfailstomodelthisspatialinteraction. WFdemonstrateshowthesame pedestrian crosses in-front of this stopped vehicle and continues to walk on the corsswalk on the opposite side of the road. Pedestrian (agent B and C) on the lower left corner of the image show similar behavior. MP++ predicts them to bump into cars parked right next to them and walk onto theroadsurfacetowardsoncomingtraffic. WFontheotherhandpredictsniceandconsistentalong roadtrajectoriesforthesepedestrians. Wenowobservethepredictsforavehicle(agentD)inthis scene. MP++predictsthetrajectoriesofthisvehicletocollidebothwiththestaticcarin-frontofit aswellasthepedestrianpassingin-frontonthatcar. WFmodelsallthesespatialinteractionswell andpredictsthetrajectoriesforthesecartowaitbehindthecarin-frontofitandnotnudgeintothe pedestriancrossingin-front. 16(a)MultiPath++(MP++) (b)Wayformer(WF) Figure12: Thisscenariorepresentsacomlpex4-wayintersectionwithlotsofcarspassingthrough. Similar to Fig- 11 we see MP++ predicting trajectories for vehicles (agent A, B, C an d D) in the scenetocollidewithcarsin-frontofthem. WFdemonstratesverysophisticatedbehavior. Foragent A,itisabletoestimatethatthecarparkedin-frontofagentAisadouble-parkedvehicleandthere isspaceontheroadnexttoit, soitpredictstrajectoriesthatnudgearoundit. ForB,CandDitis abletocarefullymodeltherulesoftheroadandalloweitheroncoming(incaseofagentB)orcross traffic(incaseofagentCandD)totakeprecedenceandpredictsyieldingtrajectoriesforthem. (a)MultiPath++(MP++) (b)Wayformer(WF) Figure13:InthisscenarioweobservethatMP++isnotabletomodelthefutureofthevehicle(agent A) entering the parking lane and outputs a multi-modal equally likely future for this agent. WF understandstheroadgraphinteractionmuchbetterandoutputstrajectoriesthathavehighlikelihood thatagentAisenteringtheparkinglane. 17(a)MultiPath++(MP++) (b)Wayformer(WF) Figure 14: We see agents A, B, and C are waiting behind a stationary vehicle. WF predicts agent A will nudge around the stationary vehicle to make progress, while MP predicts the agents will proceed through the stationary vehicle. Additionally, MP predicts agent D could proceed off the road,whileWFpredictsittofollowtheroadbehindagentC. (a)MultiPath++(MP++) (b)Wayformer(WF) Figure15: Multiplepedestrians, includingagentB,arecrossingtheroadandbothMP++andWF predict car A wants to make a left turn through that crosswalk. WF predicts car A will start to turn, then wait as the pedestrians cross, while MP++ predicts that car A will proceed through the crosswalkevenasthepedestriansarecrossing. 18(a)MultiPath++(MP++) (b)Wayformer(WF) Figure16: Thisshowsabusyintersection,withbothWFandMP++predictingvehiclesintheleft- rightroad(i.e.agentD)areeitherproceedingstraightorleftturning.However,MP++predictsagent Atotrytomakealeftturndirectlyintotheflowoftraffic,includingthroughothercarsleftturning, while WF predicts agent A will wait. Additionally, MP++ predicts agent B will try to proceed throughthevehiclewaitinginfrontofit,whileWFinsteadpredictsiteitherremainingstationaryor nudgingtotheadjacentlane. Furthermore,WFalsopredictsagentDtopotentiallymakeaU-turn thatgoesthroughthecornerofthesidewalknearagentC(highlightedbytheredarrow). (a)MultiPath++(MP++) (b)Wayformer(WF) Figure17:Thisscenariorepresentsa4-wayintersection.(a)AttheintersectionvehiclesA,B,Cand Dareallstoppedattheintersectionduetosignal. WFtakesthisintoaccountandpredictsyielding behavior. Vehicle Dyielding forthe light, Vehicle Cyieldingfor B,Vehicle Byielding forA and VehicleAyieldingforthevehiclein-front. But,MP++’spredictionsforthesameagentsgothrough theintersection(VehicleD)andforvehiclesA,BandC,theypassthroughthevehiclesin-front. (b) Weseesimilarbehaivorontheothersideoftheintersection,wherevehicleE’sWFpredictionsare yieldingandMP++predictionsarepassingthroughvehiclesinthefront. 19(a)MultiPath++(MP++) (b)Wayformer(WF) Figure 18: This scenario represents a T intersection with narrow roads and parked cars. In this highly interactive scene, we observe that (a) a parked vehicle (vehicle A) is trying to merge into traffic. WFpredictsnudgingaroundalreadyparkedcarsandmergingontothetraffic,whileMP++ predictionspassthroughtheparkedcarsinfrontofA.(b)Inaddition, wealsoseethatforvehicle BWFpredictsthatnudgesaroundthevehicleinfrontwhileMP++predictionsgothroughthecar in-front. (c)ForvehiclesC,DandE,WFpredictsyieldingbehavior(CyieldingforB,Dyielding forcarinthefrontandEyieldingforD),whileMP++predictionsgothroughthevehiclesinfront. (a)MultiPath++(MP++) (b)Wayformer(WF) Figure 19: This scenario represents a very busy 4-way intersection with clusters of pedestrians (A, B). Both these clusters are pedestrians crossing the signal from either side of the road. We observethatMP++prediction’saremoredistributed,someofthemgoingthroughalreadystopped vehicles(vehiclesCandD)attheintersection. But,WFunderstandsthepresenceofothervehicles andproducespredictionswhichdonotcrossthroughthem. WealsoseethatWF’spredictionsfor vehicleCyieldtopedestrianswhileMP++’spredictionsdonot. 20