MoST: Multi-modality Scene Tokenization for Motion Prediction NormanMu* JingweiJi* ZhenpeiYang* NateHarada* HaotianTang* KanChen* CharlesR.Qi RunzhouGe KratarthGoel ZoeyYang ScottEttinger RamiAl-Rfou DragomirAnguelov YinZhou† WaymoLLC Abstract Many existing motion prediction approaches rely on sym- bolicperceptionoutputstogenerateagenttrajectories,such as bounding boxes, road graph information and traffic lights. Thissymbolicrepresentationisahigh-levelabstrac- tionoftherealworld,whichmayrenderthemotionpredic- tionmodelvulnerabletoperceptionerrors(e.g.,failuresin detectingopen-vocabularyobstacles)whilemissingsalient information from the scene context (e.g., poor road condi- tions).Analternativeparadigmisend-to-endlearningfrom raw sensors. However, this approach suffers from the lack of interpretability and requires significantly more training Figure1. Overviewoftheproposedmotionpredictionparadigm. resources. In this work, we propose tokenizing the visual Itfusessymbolicperceptionoutputandourmulti-modalityscene worldintoacompactsetofsceneelementsandthenlever- tokens. Whilesymbolicrepresentationoffersaconvenientworld agingpre-trainedimagefoundationmodelsandLiDARneu- abstraction,themulti-modalityscenetokenslinksbehaviormodels ral networks to encode all the scene elements in an open- directlytosensorobservationsviatokenembeddings. vocabulary manner. The image foundation model enables our scene tokens to encode the general knowledge of the dimensionality,facilitatingcomputationallyefficientmodel open world while the LiDAR neural network encodes ge- training. Additionally, since inputs such as 3D boxes are ometryinformation. Ourproposedrepresentationcaneffi- easily rearranged and manipulated, it is possible to con- cientlyencodethemulti-framemulti-modalityobservations structmanyhypotheticalscenariosleadingtoefficientsim- with a few hundred tokens and is compatible with most ulationandtesting. Yetinordertocontinueimprovingthe transformer-based architectures. To evaluate our method, accuracyandrobustnessofbehaviormodels,itmaybenec- wehaveaugmentedWaymoOpenMotionDatasetwithcam- essary to feed the models higher-fidelity sensor features. era embeddings. Experiments over Waymo Open Motion For instance, pedestrian pose and gaze offer richer cues Datasetshowthatourapproachleadstosignificantperfor- thanmereboundingboxesformotionprediction.Moreover, manceimprovementsoverthestate-of-the-art. manysceneelementslikelanemarkingscannotbewellrep- resented by boxes. Furthermore, scene context (e.g., road surfaceconditions,hazardouslocations)isdifficulttochar- 1.Introduction acterize with symbolic representations. Manually crafting representation for diverse concepts demands considerable In order to safely and effectively operate in complex envi- engineeringeffortinimplementation, training, andevalua- ronments,autonomoussystemsmustmodelthebehaviorof tion.Instead,wewantthebehaviormodeltodirectlyaccess nearbyagents. Thesemotionprediction modelsnowoften therawsensordataanddeterminewhatandhowtoencode. rely on symbolic perception outputs such as 3D bounding Deeplearningmodels’performancegenerallyimproves boxtrackstorepresentagentstates,ratherthandirectlypro- whenwereplacehand-craftedfeatures,designedtoencode cessing sensor inputs. These representations reduce input inductivebiasaccordingtoexpertdomainknowledge,with thedirectlyobservedfeatureaslongwescalecomputeand *Equalcontribution †Correspondingauthor dataaccordingly. Butlearningto predictcomplexpatterns 1 4202 rpA 03 ]VC.sc[ 1v13591.4042:viXrasuchasagentbehaviordirectlyfromveryhigh-dimensional 2.RelatedWorks sensorinputs(e.g. manyhigh-resolutionLiDARandcam- Motion Prediction for Autonomous Driving The in- erasensorsalloperatingathighfrequency)isanextremely creasing interest in autonomous driving has led to a sig- challenginglearningproblem. Itrequireslearningtoorga- nificantfocusonmotionprediction[8,22,48,59,60,65]. nizemanyhundredsofthousandsofpointsandpixelsacross Early methods [1, 4, 6, 7, 13, 20, 27, 47, 52] rasterize the time into meaningful representations. Moreover, the inter- input scene into a 2D image, followed by processing us- mediaterepresentationsoffullyend-to-endsystemsarefar ing convolutional neural networks (CNNs). However, as moredifficulttovalidateandinspect. a result of the inherent lossiness in the rasterization pro- Rather than choosing strictly between the two ap- cess, contemporary research has shifted its focus towards proaches,weinsteadproposecombiningexistingsymbolic representingroadelements,suchasobjectboundingboxes, representations with learned tokens encoding scene infor- road graphs, and traffic light signals, as discrete graph mation. Wefirstdecomposethesceneintoacompactsetof nodes[19]. Theseelementsarethendirectlyprocessedus- disjoint elements representing ground regions, perception- inggraphneuralnetworks(GNNs)[5,21,35,39]. Another detectedagentsandopen-setobjects,basedongroundplane streamofresearchalsoemploysthisdiscretesetrepresenta- fittingandconnectedcomponentanalysis.Wethenleverage tionforsceneelementsbutprocessesthemusingrecurrent largepre-trained2Dimagemodelsand3Dpointcloudmod- neural networks [2, 25, 47, 58, 63, 65], rather than GNNs. els to encode these scene elements into “tokens”. The 2D Thanks to the rapid advancement of transformer-based ar- image models are trained on Internet-scale data, and show chitecturesinnaturallanguageprocessingandcomputervi- impressive capabilities in understanding the open visual sion, the latest state-of-the-art motion predictors also ex- world. These tokens encapsulate relevant information for tensively incorporate the attention mechanism [32, 48, 49, reasoningabouttheenvironment,suchasobjectsemantics, 59, 60]. More recently, the community has also started to objectgeometryaswellasthescenecontext.Wecompactly studyinteractivebehaviorprediction,whichjointlymodels representmulti-modalityinformationaboutground, agents thefuturemotionofmultipleobjects[45,61,62,64]. and open-set objects into a few hundred tokens, which we laterfeedtoWayformer-likenetwork[48]alongsidetokens encodingagentpositionandvelocity,roadgraph,andtraffic End-to-end Autonomous Driving The concept of end- signals.Alltokensareprocessedviaalinearprojectioninto to-end learning-based autonomous driving systems started samedimensionandself-attentionlayers. in the late 1980s [53]. Since then, researchers have de- To evaluate our method, we introduce camera embed- velopeddifferentiablemodulesthatconnectperceptionand dings to the Waymo Open Motion Dataset (WOMD) [16]. behavior [14, 18, 23, 28, 40, 44, 73], behavior and plan- WithLiDARpoints[10]andcameraembeddings,WOMD ning [24, 34, 41, 56, 61], or span from perception to plan- has become a large-scale multi-modal dataset for motion ning [6, 29, 57, 71]. Building on the inspiration from prediction. On the WOMD, our model, which combines [38, 72], Hu et al. introduced UniAD [30], which lever- learnedandsymbolicscenetokens,brings6.6%relativeim- agestransformerqueriesandasharedBEVfeaturemapto provementonsoftmAPor10.3%relativeimprovementon facilitateend-to-endlearningofperception,prediction,and minADE.Whileweobtainthestrongestresultswiththere- planning[11,31,33].Morerecently,therehasbeenagrow- centlyreleasedimagebackbonefrom[36],otherpre-trained inginterestinachievingend-to-endmotionplanningusing image models [51, 55] also yield considerable gains. We largelanguagemodels(LLMs)[46,67,69]. furtheranalyzetheperformanceofourtrajectoryprediction model under challenging scenarios. Notably, we discover Challenges of Existing Methods While substantial ad- thateveninthepresenceofimperfectsymbolicperception vancements have been achieved in standard motion pre- outputsandincompleteroadgraphinformation,ourmodel diction benchmarks [3, 9, 10, 17, 68], the deployment of maintainsexceptionalrobustnessandaccuracy. existing behavior models in real-world scenarios remains Ourcontributionsarethree-fold: challenging. Many motion prediction models heavily rely • We have augmented WOMD into a large-scale multi- on pre-processed, symbolic data from perception mod- modaldatasettosupportresearchinend-to-endlearning. els [19, 54, 74, 75], and therefore are vulnerable to poten- Cameraembeddingsarereleasedtothecommunity. tial failures. Moreover, the manually-engineered interface • We have conducted a thorough study of modeling ideas greatly restrict the flexibility and scalability of the models of varying complexity to demonstrate the value of those inhandlinglong-tailandnovelcategoriesofobjects.Incon- sensoryinputsinmotionprediction. trast,end-to-endlearning[11,26,30,31,33]fromrawsen- • We have proposed a novel method MoST, which effec- sors, while overcoming some limitations, encounters chal- tivelyleveragesthemulti-modalitydataandleadstosig- lenges in interpretability and scaling up batch size due to nificantperformanceimprovement. computationalconstraints. 2Multi-modality Scene Tokenization Image p. 32 Encoder Per-point Image Features Image Feature Maps Scene Element Features Camera Images Per-point [0, 1, 1, 3, 5, …] Token ID Point-Pixel Association Scene Scene Element Element Feature Extraction Boxes Scene Decomposition Point Cloud Figure2. OverviewoftheproposedMulti-modalitySceneTokenization. Ourmethodtakesasinputmulti-viewcameraimagesandafull scenepointcloud. Weleverageapre-trainedimagefoundationmodeltoobtaindescriptivefeaturemapsanddecomposethesceneinto disjointelementsviaclustering. BasedonthesensorcalibrationinformationbetweencameraandLiDAR,weobtainpoint-wiseimage features.Fromscenedecomposition,weassigneachpointwithatoken/clusteridandderiveboxinformationforeachelement.Finally,we extractonefeatureembeddingforeachsceneelement. 3.Multi-modalitySceneTokenization pre-trained image models trained on a diverse collections of datasets and tasks, capturing a richer understanding of Weproposeanovelmethod,MoST(Multi-modalityScene therealworld. Weexperimentwithseveralimageencoder Confidential & proprietary Tokenization),toenrichtheinformationfedtoTransformer- candidates: SAM ViT-H [36], VQ-GAN [15], CLIP [55] based motion prediction models, by efficiently combining and DINO v2 [51]. Different from others, VQ-GAN uses existingsymbolicrepresentationswithscenetokensthaten- a codebook to build the feature map. To derive Vk from code multi-modality sensor information. In this section, VQ-GAN, we bottom-crop and partition each input image we focus on how we obtain these scene tokens, each rep- into multiple 256×256 patches. Subsequently, we extract resentedbyasceneelementfeatureenrichedwithsemantic 256tokensfromeachpatchandconvertthemintoa16×16 and geometric knowledge extracted from both image and featuremapthroughqueryingthecodebook. Finally,these LiDARdata. Figure2showsanoverviewofMoST. partialfeaturemapsarestackedtogetheraccordingtotheir 3.1.ImageEncodingandPoint-PixelAssociation originalspatiallocationstoproduceVk. We start by extracting image feature maps for each cam- 3.2.SceneDecomposition era and subsequently associating these features to the cor- Next, our approach groups full scene LiDAR point cloud responding 3D LiDAR points using sensor calibration in- into three element types: ground, agents, and open-set formation. At each time step, we have a set of images {Ik ∈RHk×Wk×3} k capturedbyatotalnumberofK cam- objects (see illustration in Figure 3). We use the term “scene element” to denote the union of these three ele- eras, where H and W represent the image dimensions. k k Additionally,wehaveaLiDARpointcloudP xyz ∈RNpts×3, m tye pn et totyp be es N. gW nde ,Nde an go ent te ,Nth oe pen n-u sem t b ree sr peo cf tie vl ee lm y,en at ns do wf eea dc eh - with N denoting the number of points. Using a pre- elem elem elem trainedp 2ts D image encoder Eimg, we obtain a feature map fineN elem =N eg ln ed m+N ea lg ee mnt+N eo lp ee mn-set asthetotalnumber of each image, denoted as {Vk ∈ RH k′×W k′×D} k. Sub- ofsceneelements. sequently, we leverage camera and LiDAR calibrations to • Ground elements: These are segmented blocks of the establish a mapping between 3D LiDAR points and their ground surface, obtained through either a dedicated corresponding2Dcoordinatesontheimagefeaturemapof ground point segmentation model or a simple RANSAC sizeH′ ×W′. Thismappingassociateseach3Dpointwith algorithm. Since the ground occupies a large area, we k k thecorrespondingimagefeaturevector. Asaresult,weob- divideitintodisjoint10m×10mtiles,following[42]. tain image features for all N 3D points, represented as • Agent elements: These correspond to the points within pts F pts ∈ RNpts×D. Note that for points projecting outside of theboundingboxesofagents,detectedbyestablishedper- any image plane, we set their image features as zeros and ceptionpipelinesforapre-definedsetofcategories. marktheirimagefeaturesasinvalid. • Open-set object elements: These capture the remaining Toharnessawiderrangeofknowledge,weutilizelarge objects not included in the agent categories. Examples 33.3.SceneElementFeatureExtraction Wefinallyextractsceneelementfeatureswithaneuralnet- workmodule.Multi-frameinformationarefirstcompressed inanefficientway,thenfedintothisfeatureextractionmod- ule,whichgenerateasinglefeaturevectorforeachsceneel- ement. Thefeatureextractionmoduleisconnectedwiththe downstream Transformer-based motion prediction models, formulatinganend-to-endtrainableparadigm. 3.3.1 EfficientMulti-frameDataRepresentation While we’ve compiled valuable information for each ele- mentwithinasingle-framescene–LiDARpoints,per-point Figure3. Visualizationofscenedecomposition. Wedecomposea image features, and a bounding box – collecting this data sceneintoagentelements,open-setelementsandgroundelements. acrossmultipleframesleadstoalargeincreaseinmemory Wealsovisualizetheperceptionboundingboxesforagents. usage. Consideringthatself-attentionlayershavequadratic complexity with respect to the number of tokens, naively concatenatingtokensacrossallhistoryframeswillalsolead includenovelcategoriesoftrafficparticipantsandobsta- tosignificantlyincreasedmemoryusage.Weproposeanef- cles beyond the training data, long-tail instances that a ficientdatarepresentationtoreducetheamountofdatasent perception model suppresses due to low confidence. We tothemodelwiththefollowingthreeingredients. extracttheseelementsbyfirstremovinggroundandagent Open-set element tracking We compress the representa- elementsfromthescenepointcloudandthenusingcon- tionofopen-setelementsbyassociatingopen-setelements nectedcomponentanalysistogrouppointsintoinstances. across frames using a simple Kalman Filter. For each open-set element, we only store its box information for Per-point Token ID Based on scene decomposition of all T frames with a tensor of shape (Nopen-set × T × 7) elem eachLiDARframe,wecanassignauniquetokenidtoeach and we apply average pooling across T frames of its im- LiDAR point. Points within the same scene element share age features resulting in an image feature tensor of shape onetokenid. Withthepoint-pixelassociation,wecanscat- (Nopen-set×1×D). elem ter each scene token ID to a set of camera pixels and/or Ground-elementaggregationInsteadofdecomposingthe locations on image feature maps. As a result, we can ob- ground into tiles for each frame, we apply decomposition tain features from both LiDAR and camera for each scene aftercombiningthegroundpointsfromallframes. element. Based on point-wise token id, we can pool per- Cross-frame LiDAR downsampling Directly storing Li- pointimagefeaturesintothreesetsofcluster-wiseembed- DAR points for all frames is computationally prohibitive. ding vectors, i.e., Fg imnd g ∈ RN eg ln md×D, Fa imge gnt ∈ RN ea lg ment×D, SimplydownsamplingLiDARpointsineachframestillsuf- Fo impe gn-set ∈RN eo lp men-set×D. f Te hrs erefr fo om re,hi wgh e r ee mdu pn lod yan dc iy ffeo rv ee nr ts dta ot wic np saa mrts plo inf gth se chsc ee mn ee s. for ground elements (which are always static), and open- SceneElementBoxes Weproposetoencodeeachscene set/agent elements (which could be dynamic). For ground element with a combination of image features, coarse- elements,wefirstmergethegroundpointsacrossallframes grained geometry features, and fine-grained geometry fea- then uniformly subsample to a fixed number Ngnd. For pts tures. Here we describe how we construct scene element open-set/agent elements, we subsample them to the fixed boxes B to represent coarse-grained geometry. For agent numberNopen-set andNagent respectively. ThefinalLiDAR pts pts elements,coarse-grainedgeometryfeaturearederivedfrom pointsN = Ngnd +Nagent +Nopen-set. Wealsocreatea pts pts pts pts perception pipelines, capturing information of agent posi- tensor P ind ∈ RNpts×2 that stores the frame id and scene- tions, sizes, and heading. For open-set object elements, element id for each point. Note that in this representation, wecomputethetightestboundingboxescoveringthepoint thenumberofpointsfromeachframeisavariable,whichis cluster, and these bounding boxes are also represented by moreefficientcomparedtostoringafixednumberofpoints boxcenters,boxsizesandheadings. Forgroundelements, forallframeswithpadding. we have divided the ground into fixed size tiles and sim- ply use the tile center coordinates as position information. 3.3.2 NetworkArchitecture These box representations will be further encoded with a MLP and combined with image features and fine-grained With the efficient multi-frame data representation, the in- features. WewilldiveintothiscombinationinSec3.3. puts to the scene element feature extraction module are 4Fpts Fimg ViT-VQGAN[70] SAMViT-H[36] 8cameras(front,frontleft Npts×D Pind Nelem×T×D Felem Sensors reaf rr lo en ft t,r ri eg ah rt, rs igid he t,l re ef at, r)si ad ne dri Lg ih Dt, AR Pxyz Temporal 1.0s,11Frames MLPf Nelem×D Pre-trained Fgeo Dataset WebLi[12] SA-1B[36] Npts×3 Nelem×T×D B Format Token&Embedding Embedding MLPc ftemporal Table1.DetailsoftheWOMDcameraembeddings. Nelem×T×7 Nelem×T×D Figure 4. Scene element feature extraction. Scene-element fea- wisefeaturesbasedontokenidandframeid. tureisderivedfromaspatial-temporalmodulethatfusingtogether Spatial-temporal Fusion Our spatial-temporal fusion imagefeature,geometryfeatureandtemporalembedding. Image feature contains pooled feature from large pre-trained image en- module (Figure 4 right) takes as input the image feature coder, and characterize the appearance and semantic attribute of F img, the geometry feature F geo, and a trainable temporal thesceneelement.Geometryfeature,ontheotherhand,character- embeddingf ∈ RT×D thatcorrespondstoT frames. temporal izesthespatiallocationaswellasthedetailedgeometry.Temporal It produces a temporally aggregated feature F for all elem informationisinjectedthroughalearnedtemporalembedding. scene elements. Under the hood, the spatial-temporal fu- sion module adds up the two input tensors, and then con- ducts axial attention across the temporal and element axes summarizedasfollowing: ofthetensor,whichisfollowedbythefinalaveragepooling • F pts ∈ RNpts×D, point-wise image embeddings derived acrossthetemporalaxis,aslistedbelow: foralltheLiDARpointsacrossT frames. • B ∈ RNelem×T×7, bounding boxes of different scene el- F ←F +F +f ∈RNelem×T×D ements across time, where N = Ngnd + Nagent + elem img geo temporal elem elem elem F ←AttnAlongAxis(F ,axis=time) Nopen-set. For ground elements, we only encode the tile elem elem elm F ←AttnAlongAxis(F ,axis=sceneelement) center,leavingtherestfourattributesaszeros. elem elem • P = {P xyz,P ind},whereP xyz ∈ RNpts×3 collectsmulti- F elem ←mean(F elem,axis=time) frameLiDARpoints,andP ind ∈ RNpts×2 storesframeid The final F uses a single vector to describe each scene andtokenidforeachpointrespectively. elem element. This tensor can be fed as the additional inputs to For each tracked element across T time steps, our net- scene encoding module of transformer-based motion pre- work (as shown in Figure 4) will process the previously dictionmodels,suchasrecentlypublished [48,60]. listed multi-modality information into one embedding for eachsceneelement,denotedasF . elem 4.Experiments As shown in the top branch of Figure 4, the network leveragesP ind togrouppoint-wiseimageembeddingsF pts 4.1.TheReleaseofWOMDCameraEmbeddings accordingtothetokenidandframeid,whichresultsinthe To advance research in sensor-based motion predic- image feature tensor for all scene elements across frames, tion, we have augmented Waymo Open Motion Dataset F . In the bottom branch of Figure 4, we aim to derive img (WOMD)[16]withcameraembeddings. WOMDcontains geometryinformationF byencodingtwopiecesofinfor- geo the standard perception output, e.g. tracks of bounding mation, i.e., fine-grained geometry information from point boxes, road graph, traffic signals, and now it also includes cloudsP andcoarse-grainedshapeinformationfrom3D xyz synchronized LiDAR points [16] and camera embeddings. boxes B. The fine-grained geometry is encoded by first Given one scenario, a motion prediction model is required mapping point xyz coordinates into a higher dimensional to reason about 1 second history data and generate pre- spaceandgroupinghigh-dimensionalfeaturesaccordingto dictions for the future 8 seconds at 5Hz. Our LiDAR can thetokenidandframeid. Thecoarse-grainedshapeencod- reachupto75metersalongtheradiusandthecameraspro- ingisderivedbyprojectingboxattributestothesamehigh vide a multi-view imagery for the environment. WOMD dimensionalspace. Formally,f isdefinedas geo characterize each perception-detected objects using a 3D boundingbox(3Dcenterpoint,heading,length,width,and f gi eo =pool by index(MLP f(P xyz),P ind)[i,:,:] height), and the object’s velocity vector. The road graph (1) isprovidedasasetofpolylinesandpolygonswithseman- +MLP (B)[i,:,:] c tic types. WOMD is divided into training, validation and whereiisthetokenid,functionpool by indexpoolspoint- testing subsets according to the ratio 70%, 15%, 15%. In 5Method Reference Sensor #Decoders minADE↓ minFDE↓ MissRate↓ mAP↑ soft-mAP↑ MotionCNN[37] CVPRW2021 - - 0.7383 1.4957 0.2072 0.2123 - MultiPath++[50] ICRA2022 - - 0.978 2.305 0.440 - - SceneTransformer[50] ICLR2022 - - 0.9700 2.0700 0.1867 0.2433 - MTR[60] NeurIPS2022 - - 0.6046 1.2251 0.1366 0.4164 - Wayformer[48] ICRA2023 - 3 0.5512 1.1602 0.1208 0.4099 0.4247 MotionLM*[59] ICCV2023 - 1 0.5702 1.1653 0.1327 0.3902 0.4063 Wayformer Reproduced - 1 0.5830 1.2314 0.1347 0.3995 0.4110 MoST-SAM H-6 Ours C+L 1 0.5228 1.0764 0.1303 0.4040 0.4207 MoST-SAM H-64 Ours C+L 1 0.5487 1.1355 0.1238 0.4230 0.4380 Wayformer Reproduced - 3 0.5494 1.1386 0.1190 0.4052 0.4239 MoST-VQGAN-64 Ours C 3 0.5391 1.1099 0.1172 0.4201 0.4396 Table2. PerformancecomparisononWOMDvalidationset. MoSTleadstosignificantperformancegaintotheWayformerbaselinesand achievesstate-of-the-artresultsinallcomparedmetrics. MoST-SAM H-{6,64}: ourmethodusingSAMViT-Hfeatureandpredicting basedon6or64queries. MoST-VQGAN-64: ourmethodusingVQGANfeaturewith64queries. Boldfonthighlightsthebestresultin eachmetricandunderlinedenotesthesecondbest. Formethodswithmultipledecoders,resultsarebasedonensemblingofpredictions. MotionLM*isbasedoncontactingauthorsfortheir1decoderresults,whichwasnotreportedintheoriginalpublication. this paper, we report results over the validation set. We’ll augment it with our new design by fusing multi-modality reservethetestsetforfuturecommunitybenchmarking. tokens. Inthefollowingsections,weuseWayformerasthe Duetothedatastorageissueandriskofleakageofsensi- baselineandshowtheperformanceimprovementbyMoST. tiveinformation(e.g.,humanfaces,carplatenumbers,etc.), Pleaserefertoappendixforimplementationdetails. wewillnotreleasetherawcameraimages. Instead,there- leasedmulti-modalitydatasetwillbeintwoformats: 4.2.1 BaselineComparison • ViT-VQGANTokensandEmbeddings:Weapplyapre- trained ViT-VQGAN [70] to extract tokens and embed- In Table 2, we evaluate the proposed approach and com- dingsforeachcameraimage. Thenumberoftokensper pare it with recently published models, i.e., MTR [60], camerais512,whereeachtokencorrespondstoa32di- Wayformer[48],MultiPath++[66],MotionCNN[37],Mo- mensionalembeddinginthequantizedcodebook. tionLM [59], SceneTransformer [50]. Specifically, we • SAMViT-HEmbeddings: Weapplyapre-trainedSAM studyourapproachintwosettings,1)usingLiDAR+cam- ViT-H [36] model to extract dense embeddings for each eratokenswithsingledecoderand2)usingcameratokens cameraimage. Wereleasetheper-scene-elementembed- with3decoders. Duringinference,wefitaGaussianMix- dingvectors,eachbeing256dimensional. tureModelbymergingpredictionsfromthedecoder(s)and In the released dataset, we have 1 LiDAR and 8 cameras draw2048samples,whicharefinallyaggregatedinto6tra- (front, front-left, front-right, side-left, side-right, rear-left, jectoriesthroughK-means[48]. Inbothsettings,theintro- rear-right,rear). PleaseseedetailsinTable1. ductionofsensorytokensleadstoaclearperformancegain over the corresponding Wayformer baselines based on our TaskandMetrics BasedontheaugmentedWOMD,we re-implementation. Moreover,ourapproachachievesstate- investigate the standard marginal motion prediction task, of-the-artperformanceacrossvariousmetrics. whereamodelisrequiredtogenerate6mostlylikelyfuture Figure5illustratestwocomparisonsbetweenourMoST trajectories for each of the agents independently of other andthebaselinemodel.Theupperexampleshowsthatwith agents futures. We report results for various methods un- tokenizedsensorinformation,ourMoSTrulesoutthepos- dercommonlyadoptedmetrics,namelyminADE,minFDE, sibility that a vehicle runs onto walls after a U-turn. The missrate,mAPandsoft-mAP[16].Forfaircomparison,we lower example makes a prediction that a cyclist may cross onlycompareresultsbasedonsinglemodelprediction. thestreetwhichissafetycriticalfortheautonomousvehicle totakeprecautionregardingthisbehavior. 4.2.ExperimentalResults Our MoST is a general paradigm applicable to most 4.2.2 AblationStudy transformer-basedmotionpredictionarchitectures. Without losing generality, we adopt a state-of-the-art architecture, We find that applying MoST to the current frame can also Wayformer[48],asourmotionpredictionbackboneandwe lead to significant improvement over the baseline. For ef- 6MoST Wayformer p. 28 Open-set Agent Ground M-frame minADE↓ soft-mAP↑ ✗ ✓ ✓ ✗ 0.5654 0.4112 ✓ ✗ ✓ ✗ 0.5520 0.4273 ✓ ✓ ✗ ✗ 0.5514 0.4241 ✓ ✓ ✓ ✗ 0.5483 0.4321 ✓ ✓ ✓ ✓ 0.5487 0.4380 Table 4. Ablation study on how different scene element affects MoST Wayformer theperformance. Thefirstfourrowsshowsthatalltypesofscene p. 31 elementbringsbenefitstothemodel. Thelastrowshowsthatag- Confidential & proprietary gregating scene element across frame considerably improves the soft-mAP,thoughleadstoslightregressionofminADE. PerceptionCameraLiDAR minADE↓ mAP↑ soft-mAP↑ ✓ ✗ ✗ 0.5830 0.3995 0.4110 ✓ ✓ ✗ 0.5483 0.4118 0.4265 ✓ ✗ ✓ 0.5486 0.4040 0.4212 ✓ ✓ ✓ 0.5483 0.4162 0.4321 FiguConfirdentieal & propriet5ary. Qualitative comparison. The agent boxes are colored bytheirtypes: grayforvehicle, redforpedestrian, andcyanfor Table5. Ablationstudyondifferentinputmodality. Thefirstrow cyclist. The predicted trajectories are ordered temporally from corresponds to the setting of Wayformer [48]. The second row greentoblue.Foreachmodeledagent,themodelspredict6trajec- addscameraimagefeature.ThelastrowfurtheraddsLiDAR.We torycandidates,whoseconfidencescoresareillustratedbytrans- canseeusingbothCameraandLiDARyieldstothebestresults. parency: themoreconfident,themorevisible. Groundtruthtra- jectoryisshownasreddots. Intheupperexample, MoSTrules outthepossibilitythatavehiclerunsontoawallafterU-turn;in SensoryToken minADE↓ mAP↑ soft-mAP↑ the lower example, MoST correctly predicts that a cyclist could None(Wayformer) 0.5830 0.3995 0.4110 suddenlycrossthestreet. Image-gridToken(Ours) 0.5495 0.4109 0.4261 SceneClusterToken(Ours) 0.5483 0.4162 0.4321 ImageEncoder minADE↓ mAP↑ soft-mAP↑ DINO-v2[51] 0.5597 0.4154 0.4285 Table 6. Comparing variant of scene tokenization strategy with CLIP[55] 0.5590 0.4138 0.4272 single-frame sensor data. Both token strategy leads to improve- mentthevanillaWayformer[48],whichdoesnotusesensordata. VQ-GAN[15] 0.5670 0.4058 0.4192 SAMViT-H[36] 0.5483 0.4162 0.4321 maps, as it is trained with a large and diverse dataset for Table 3. Ablation study of different image features. All these thedenseunderstandingtaskofimagesegmentation. Other imagefeaturesimprovesthemotionpredictionperformance,while large pre-trained image models also demonstrate notable weobserveSAMViT-H[36]leadstothemostimprovement. We usessingle-framemulti-modalfeatureforthesestudy. capability,outperformingtheWayformerbaselineonmAP andsoftmAP,albeitinferiortoSAM. Ablation on Input Modality To understand how differ- ficient experimentation, we perform ablation study by em- entinputmodalitiesaffectthefinalmodelperformance,we ployingsingleframeMoSTandSAMViT-Hfeature. conductablationexperimentsandsummarizeresultsinTa- Effects of Different Pre-trained Image Encoders To in- ble 5 where we remove image feature or remove LiDAR vestigate different choices of the image encoder for our feature of our single-frame model. We can see image fea- model,wehaveconductedexperimentscomparingtheper- tureandLiDARfeaturearebothbeneficial,andcombining formance of using image feature encoders from various bothmodalityleadstothebiggestimprovement. pre-trained models: SAM ViT-H [36], CLIP [55], DINO Ablation on Scene Element To gain deeper insights into v2[51],andVQ-GAN[15]. AsshowinTable3,theSAM thecontributionofeachtypeofsceneelement,weconduct ViT-Hencoderyieldsthehighestperformanceacrossallbe- ablation studies in Table 4 by removing specific element haviorpredictionmetrics. Wehypothesizethatthisperfor- types and evaluate the impact on behavior prediction met- manceadvantagelikelystemsfromSAM’sstrongcapabil- rics.Combiningalltypesofsceneelementsleadstothebest ity to extract comprehensive and spatially faithful feature soft-mAPmetric. Notethatonlyassociatingimagefeatures 7Method minADE↓ mAP↑ soft-mAP↑ Failuretype Failurerate minADE↓ soft-mAP↑ Wayformer[48] 0.9002 0.2312 0.2382 None 0% 0.5515 0.4396 MoST-SAM H-64 0.8720 0.2615 0.2677 10% 0.5560 0.4302 Perception 30% 0.5625 0.4235 Table 7. Evaluation on hard scenarios. We curate a set of hard 50% 0.5712 0.4164 scenariosbasedontheperformanceofMoSTandWayformeron them.MoSTconsistentlyshowsimprovedperformance. 10% 0.5647 0.4217 Roadgraph 30% 0.6020 0.4010 50% 0.6707 0.3499 toagentsgivethesmallestimprovement.Wehypothesisthe reasons to be two-fold: 1) in most cases, the agent box is Table8.Evaluationonsimulatedperceptionandroadgraphfailure. sufficient to characterize the motion of the object; 2) there Wevarytheratioofmissdetectedagentboxesandmissdetected areonlyahandfulofagentsinthesceneandveryfewimage roadgraphsegmentsinscenesas10%,30%and50%,respectively. featuresareincludedinthemodel. Withmulti-modalfeatures,MoSTperformsonparwithbaselines evenwith50%perceptionor30%roadgraphfailure. Alternative Scene Tokenizer We use SAM ViT-H in this experiment. Wedesignanotherbaselinetokenizer,denoted as Image-grid token, which tokenizes each image feature tailandnovelcategoriesofobjectbeyondtrainingsupervi- as 16×16 = 256 image embeddings by subsampling 4X sion,occlusion,long-range,etc. Thus,weproposetoaddi- along column and row axes. The feature from all camera tionallyevaluateourmethodagainstthebaselinemethodin imagesareflattenedandconcatenatedtoformscenetokens. thecaseofperceptionfailure. Concretely,wesimulateper- InTable6wecanseetheImage-gridtokenizeralsoleadsto ceptionfailureofnotdetectingcertainobjectboxesbyran- improvementcomparedtoWayformerbaseline,thoughin- domlyremovingagentsaccordingtoafixedratioofagents ferior to our cluster-based sparse tokenizer which utilizes in the scene. The boxes dropped out are consistent for pointcloudtoderiveaccuratedepthinformationandlever- MoST-SAM H-64 and the baseline. As shown in Table 8, agestheintrinsicscenesparsitytogetcompacttokens. MoST shows robustness against perception failures: even whenfailurerateraisesto50%,ourmodelstillperformson 4.2.3 EvaluationonChallengingScenarios parwiththeWayformerbaseline(softmAP0.4121). Whiletheimprovementshownabovedemonstratesoverall Roadgraph Failure Motion prediction models often ex- improvementacrossalldrivingscenarios,wearealsointer- hibit a strong reliance on roadgraphs, leading to potential estedininvestigatingtheperformancegaininthemostchal- vulnerabilities in situations where the roadgraph is incom- lengingcases. Herewepresenthowourmodelperformsin plete or inaccurate. Our proposed model, MoST, tackles challengingscenarios,specificallyon(a)aminedsetofhard this issue by incorporating multi-modality scene tokens as scenarios, (b) situations where perception failures happen, additional inputs, thereby enhancing its robustness against and(c)situationswhereroadgraphisinaccurate. roadgraphfailures. Wedemonstratethisadvantagebysim- Mined Hard Scenarios To assess the effectiveness of our ulating various levels of roadgraph errors, similar to the methodincomplexsituations,wehavecuratedasetofhard aforementionedperceptionfailuresimulation. Specifically, scenarios. We conduct a per-scenario evaluation through- we evaluate MoST-SAM H-64 under scenarios with 10%, out the entire validation set, identifying the 1000 scenar- 30%, and 50% missing roadgraph segments in the valida- ioswiththelowestminADEacrossvehicle,pedestrian,and tionset. Notably,asshowcasedinTable8,evenwitha30% cyclist categories for the baseline and MoST-SAM H-64, oftheroadgraphmissing,MoSTperformsonparwithbase- respectively. In this way, we ensure the mining is sym- linemodelsthatassumeperfectroadgraphinformation. metric and fair for both methods. Then we combine these 6000 scenarios, resulting in 4024 unique scenarios, form- 5.Conclusions ing our curated challenging evaluation dataset. As shown in Table 7, MoST demonstrates more pronounced relative To promote sensor-based motion prediction research, we improvementinmAPandsoft-mAP,i.e.,13.1%and12.4% have enhanced WOMD with camera embeddings, making it a large-scale multi-modal dataset for benchmarking. respectively,comparedtothebaselineinthesehardestsce- To efficiently integrate multi-modal sensor signals into narios,confirmingitseffectivenessofenhancedrobustness motionprediction,weproposeamethodthatrepresentsthe andresilienceincomplexsituations. Wealsofindthatim- multi-framescenesasasetofsceneelementsandleverages provingminADEinthesehardscenariosisstillachallenge. large pre-trained image encoders and 3D point cloud PerceptionFailureMostmotionpredictionalgorithms[48, networks to encode rich semantic and geometric informa- 59]assumeaccurateperceptionobjectboxesasinputs. Itis tion for each element. We demonstrate that our approach criticaltounderstandhowsuchasystemwillperformwhen leadstosignificantimprovementsinmotionpredictiontask. thisassumptionbreaksduetovariousreasons,suchaslong- 8References Feng, RuiHu, YangXu, etal. Multixnet: Multiclassmul- tistagemultimodalmotionprediction. In2021IEEEIntelli- [1] YuriyBiktairov,MaximStebelev,IrinaRudenko,OlehShli- gentVehiclesSymposium(IV),pages435–442.IEEE,2021. azhko,andBorisYangel. Prank:motionpredictionbasedon 2 ranking.Advancesinneuralinformationprocessingsystems, [15] PatrickEsser,RobinRombach,andBjornOmmer. Taming 33:2553–2563,2020. 2 transformers for high-resolution image synthesis. In Pro- [2] Thibault Buhet, Emilie Wirbel, Andrei Bursuc, and Xavier ceedings of the IEEE/CVF conference on computer vision Perrotton. Plop: Probabilistic polynomial objects trajec- andpatternrecognition,pages12873–12883,2021. 3,7 tory planning for autonomous driving. arXiv preprint [16] Scott Ettinger, Shuyang Cheng, Benjamin Caine, Chenxi arXiv:2003.08744,2020. 2 Liu, HangZhao, SabeekPradhan, YuningChai, BenSapp, [3] HolgerCaesar,VarunBankiti,AlexHLang,SourabhVora, CharlesR.Qi,YinZhou,ZoeyYang,Aure´lienChouard,Pei VeniceErinLiong,QiangXu,AnushKrishnan,YuPan,Gi- Sun,JiquanNgiam,VijayVasudevan,AlexanderMcCauley, ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- Jonathon Shlens, and Dragomir Anguelov. Large scale in- modaldatasetforautonomousdriving. InCVPR,2020. 2 teractive motion forecasting for autonomous driving: The [4] Sergio Casas, Wenjie Luo, and Raquel Urtasun. Intentnet: waymoopenmotiondataset. InICCV,2021. 2,5,6 Learningtopredictintentionfromrawsensordata.InCoRL, [17] Scott Ettinger, Shuyang Cheng, Benjamin Caine, Chenxi 2018. 2 Liu, HangZhao, SabeekPradhan, YuningChai, BenSapp, [5] Sergio Casas, Cole Gulino, Renjie Liao, and Raquel Urta- CharlesRQi, YinZhou, etal. Largescaleinteractivemo- sun. Spagnn: Spatially-aware graph neural networks for tion forecasting for autonomous driving: The waymo open relational behavior forecasting from sensor data. In 2020 motiondataset. InICCV,2021. 2 IEEEInternationalConferenceonRoboticsandAutomation [18] Sudeep Fadadu, Shreyash Pandey, Darshan Hegde, Yi Shi, (ICRA),pages9491–9497.IEEE,2020. 2 Fang-Chieh Chou, Nemanja Djuric, and Carlos Vallespi- [6] Sergio Casas, Abbas Sadat, and Raquel Urtasun. Mp3: A Gonzalez. Multi-view fusion of sensor data for improved unified model to map, perceive, predict and plan. In Pro- perception and prediction in autonomous driving. In Pro- ceedingsoftheIEEE/CVFConferenceonComputerVision ceedings of the IEEE/CVF Winter Conference on Applica- andPatternRecognition,pages14403–14412,2021. 2 tionsofComputerVision,pages2349–2357,2022. 2 [7] YuningChai,BenjaminSapp,MayankBansal,andDragomir [19] Jiyang Gao, Chen Sun, Hang Zhao, Yi Shen, Dragomir Anguelov. Multipath: Multipleprobabilisticanchortrajec- Anguelov, Congcong Li, and Cordelia Schmid. Vectornet: toryhypothesesforbehaviorprediction. InCoRL,2019. 2 Encodinghdmapsandagentdynamicsfromvectorizedrep- [8] YuningChai,BenjaminSapp,MayankBansal,andDragomir resentation. InCVPR,2020. 2 Anguelov. Multipath: Multipleprobabilisticanchortrajec- [20] ThomasGilles,StefanoSabatini,DzmitryTsishkou,Bogdan toryhypothesesforbehaviorprediction. InCoRL,2019. 2 Stanciulescu,andFabienMoutarde. Home:Heatmapoutput [9] Ming-Fang Chang, John Lambert, Patsorn Sangkloy, Jag- for future motion estimation. In 2021 IEEE International jeetSingh,SlawomirBak,AndrewHartnett,DeWang,Peter IntelligentTransportationSystemsConference(ITSC),pages Carr, Simon Lucey, Deva Ramanan, et al. Argoverse: 3d 500–507.IEEE,2021. 2 trackingandforecastingwithrichmaps. InCVPR,2019. 2 [21] Thomas Gilles, Stefano Sabatini, Dzmitry Tsishkou, Bog- [10] Kan Chen, Runzhou Ge, Hang Qiu, Rami Al-Rfou, dan Stanciulescu, and Fabien Moutarde. Gohome: Graph- Charles R Qi, Xuanyu Zhou, Zoey Yang, Scott Ettinger, oriented heatmap output for future motion estimation. In Pei Sun, Zhaoqi Leng, et al. Womd-lidar: Raw sensor 2022 international conference on robotics and automation dataset benchmark for motion forecasting. arXiv preprint (ICRA),pages9107–9114.IEEE,2022. 2 arXiv:2304.03834,2023. 2 [22] JunruGu,ChenSun,andHangZhao. Densetnt:End-to-end [11] Li Chen, Penghao Wu, Kashyap Chitta, Bernhard Jaeger, trajectorypredictionfromdensegoalsets. InICCV,2021. 2 Andreas Geiger, and Hongyang Li. End-to-end au- [23] Junru Gu, Chenxu Hu, Tianyuan Zhang, Xuanyao Chen, tonomousdriving: Challengesandfrontiers. arXivpreprint YilunWang,YueWang,andHangZhao. Vip3d:End-to-end arXiv:2306.16927,2023. 2 visualtrajectorypredictionvia3dagentqueries.InProceed- [12] XiChen,XiaoWang,SoravitChangpinyo,AJPiergiovanni, ingsoftheIEEE/CVFConferenceonComputerVisionand Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam PatternRecognition,pages5496–5506,2023. 2 Grycner,BasilMustafa,LucasBeyer,etal. Pali: Ajointly- [24] Cole Gulino, Justin Fu, Wenjie Luo, George Tucker, Eli scaled multilingual language-image model. arXiv preprint Bronstein,YirenLu,JeanHarb,XinleiPan,YanWang,Xi- arXiv:2209.06794,2022. 5 angyu Chen, et al. Waymax: An accelerated, data-driven [13] Henggang Cui, Vladan Radosavljevic, Fang-Chieh Chou, simulatorforlarge-scaleautonomousdrivingresearch.arXiv Tsung-HanLin,ThiNguyen,Tzu-KuoHuang,JeffSchnei- preprintarXiv:2310.08710,2023. 2 der,andNemanjaDjuric. Multimodaltrajectorypredictions [25] Agrim Gupta, Justin Johnson, Li Fei-Fei, Silvio Savarese, forautonomousdrivingusingdeepconvolutionalnetworks. and Alexandre Alahi. Social gan: Socially acceptable tra- InICRA,2019. 2 jectorieswithgenerativeadversarialnetworks. InProceed- [14] NemanjaDjuric,HenggangCui,ZhaoenSu,ShangxuanWu, ingsoftheIEEEconferenceoncomputervisionandpattern Huahua Wang, Fang-Chieh Chou, Luisa San Martin, Song recognition,pages2255–2264,2018. 2 9[26] Steffen Hagedorn, Marcel Hallgarten, Martin Stoll, and [37] Stepan Konev, Kirill Brodt, and Artsiom Sanakoyeu. Mo- Alexandru Condurache. Rethinking integration of predic- tioncnn: A strong baseline for motion prediction in au- tionandplanningindeeplearning-basedautomateddriving tonomousdriving. InCVPRW,2021. 6 systems: Areview. arXivpreprintarXiv:2308.05731,2023. [38] Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chong- 2 hao Sima, Tong Lu, Yu Qiao, and Jifeng Dai. Bevformer: [27] JoeyHong,BenjaminSapp,andJamesPhilbin. Rulesofthe Learning bird’s-eye-view representation from multi-camera road:Predictingdrivingbehaviorwithaconvolutionalmodel images via spatiotemporal transformers. In European con- ofsemanticinteractions. InCVPR,2019. 2 ferenceoncomputervision,pages1–18.Springer,2022. 2 [28] AnthonyHu, ZakMurez, NikhilMohan, Sof´ıaDudas, Jef- [39] MingLiang,BinYang,RuiHu,YunChen,RenjieLiao,Song frey Hawke, Vijay Badrinarayanan, Roberto Cipolla, and Feng, and Raquel Urtasun. Learning lane graph represen- Alex Kendall. Fiery: Future instance prediction in bird’s- tations for motion forecasting. In ECCV, pages 541–556. eyeviewfromsurroundmonocularcameras.InProceedings Springer,2020. 2 oftheIEEE/CVFInternationalConferenceonComputerVi- [40] MingLiang,BinYang,WenyuanZeng,YunChen,RuiHu, sion,pages15273–15282,2021. 2 SergioCasas,andRaquelUrtasun. Pnpnet: End-to-endper- [29] ShengchaoHu,LiChen,PenghaoWu,HongyangLi,Junchi ceptionandpredictionwithtrackingintheloop. InCVPR, Yan,andDachengTao. St-p3: End-to-endvision-basedau- 2020. 2 tonomous driving via spatial-temporal feature learning. In [41] JerryLiu,WenyuanZeng,RaquelUrtasun,andErsinYumer. EuropeanConferenceonComputerVision,pages533–549. Deep structured reactive planning. In 2021 IEEE Inter- Springer,2022. 2 national Conference on Robotics and Automation (ICRA), [30] YihanHu,JiazhiYang,LiChen,KeyuLi,ChonghaoSima, pages4897–4904.IEEE,2021. 2 Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai [42] MinghuaLiu, YinZhou, CharlesRQi, BoqingGong, Hao Wang,etal. Planning-orientedautonomousdriving. InPro- Su,andDragomirAnguelov. Less: Label-efficientsemantic ceedingsoftheIEEE/CVFConferenceonComputerVision segmentationforlidarpointclouds. InEuropeanconference andPatternRecognition,pages17853–17862,2023. 2 oncomputervision,pages70–89.Springer,2022. 3 [31] Xiaosong Jia, Yulu Gao, Li Chen, Junchi Yan, [43] Ilya Loshchilov and Frank Hutter. Decoupled weight de- Patrick Langechuan Liu, and Hongyang Li. Driveadapter: cayregularization. InInternationalConferenceonLearning Breaking the coupling barrier of perception and planning Representations,2017. 14 in end-to-end autonomous driving. In Proceedings of the [44] WenjieLuo,BinYang,andRaquelUrtasun. Fastandfuri- IEEE/CVF International Conference on Computer Vision, ous:Realtimeend-to-end3ddetection,trackingandmotion pages7953–7963,2023. 2 forecastingwithasingleconvolutionalnet.InProceedingsof [32] XiaosongJia,PenghaoWu,LiChen,YuLiu,HongyangLi, theIEEEconferenceonComputerVisionandPatternRecog- andJunchiYan. Hdgt: Heterogeneousdrivinggraphtrans- nition,pages3569–3577,2018. 2 formerformulti-agenttrajectorypredictionviasceneencod- [45] WenjieLuo, CheolPark, AndreCornman, BenjaminSapp, ing. IEEEtransactionsonpatternanalysisandmachinein- and Dragomir Anguelov. Jfp: Joint future prediction with telligence,2023. 2 interactivemulti-agentmodelingforautonomousdriving. In [33] Xiaosong Jia, Penghao Wu, Li Chen, Jiangwei Xie, Con- Conference on Robot Learning, pages 1457–1467. PMLR, ghui He, Junchi Yan, and Hongyang Li. Think twice be- 2023. 2 foredriving: Towardsscalabledecodersforend-to-endau- [46] Jiageng Mao, Yuxi Qian, Hang Zhao, and Yue Wang. tonomous driving. In Proceedings of the IEEE/CVF Con- Gpt-driver: Learning to drive with gpt. arXiv preprint ferenceonComputerVisionandPatternRecognition,pages arXiv:2310.01415,2023. 2 21983–21994,2023. 2 [47] FrancescoMarchetti,FedericoBecattini,LorenzoSeidenari, [34] Alexey Kamenev, Lirui Wang, Ollin Boer Bohan, Ishwar andAlbertoDelBimbo. Mantra: Memoryaugmentednet- Kulkarni, Bilal Kartal, Artem Molchanov, Stan Birchfield, worksformultipletrajectoryprediction. InProceedingsof DavidNiste´r,andNikolaiSmolyanskiy.Predictionnet:Real- the IEEE/CVF conference on computer vision and pattern time joint probabilistic traffic prediction for planning, con- recognition,pages7143–7152,2020. 2 trol, and simulation. In 2022 International Conference on [48] NigamaaNayakanti,RamiAl-Rfou,AurickZhou,Kratarth RoboticsandAutomation(ICRA),pages8936–8942.IEEE, Goel, Khaled S Refaat, and Benjamin Sapp. Wayformer: 2022. 2 Motionforecastingviasimple&efficientattentionnetworks. [35] Siddhesh Khandelwal, William Qi, Jagjeet Singh, Andrew In2023IEEEInternationalConferenceonRoboticsandAu- Hartnett, and Deva Ramanan. What-if motion prediction tomation(ICRA),pages2980–2987.IEEE,2023. 2,5,6,7, forautonomousdriving. arXivpreprintarXiv:2008.10587, 8,13 2020. 2 [49] Jiquan Ngiam, Benjamin Caine, Vijay Vasudevan, Zheng- [36] AlexanderKirillov,EricMintun,NikhilaRavi,HanziMao, dongZhang,Hao-TienLewisChiang,JeffreyLing,Rebecca ChloeRolland,LauraGustafson,TeteXiao,SpencerWhite- Roelofs,AlexBewley,ChenxiLiu,AshishVenugopal,etal. head,AlexanderCBerg,Wan-YenLo,etal. Segmentany- Scenetransformer:Aunifiedarchitectureforpredictingmul- thing. arXivpreprintarXiv:2304.02643,2023. 2,3,5,6,7, tiple agent trajectories. arXiv preprint arXiv:2106.08417, 13 2021. 2 10[50] Jiquan Ngiam, Benjamin Caine, Vijay Vasudevan, Zheng- IEEE/CVF Conference on Computer Vision and Pattern dongZhang,Hao-TienLewisChiang,JeffreyLing,Rebecca Recognition,pages6543–6552,2022. 2 Roelofs,AlexBewley,ChenxiLiu,AshishVenugopal,David [63] Charlie Tang and Russ R Salakhutdinov. Multiple futures Weiss,BenSapp,ZhifengChen,andJonathonShlens.Scene prediction. Advancesinneuralinformationprocessingsys- transformer: A unified architecture for predicting multiple tems,32,2019. 2 agenttrajectories. InICLR,2022. 6 [64] Ekaterina Tolstaya, Reza Mahjourian, Carlton Downey, [51] Maxime Oquab, Timothe´e Darcet, The´o Moutakanni, Huy Balakrishnan Vadarajan, Benjamin Sapp, and Dragomir Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Anguelov. Identifyingdriverinteractionsviaconditionalbe- DanielHaziza,FranciscoMassa,AlaaeldinEl-Nouby,etal. haviorprediction.In2021IEEEInternationalConferenceon Dinov2:Learningrobustvisualfeatureswithoutsupervision. RoboticsandAutomation(ICRA),pages3473–3479.IEEE, arXivpreprintarXiv:2304.07193,2023. 2,3,7 2021. 2 [52] Seong Hyeon Park, Gyubok Lee, Jimin Seo, Manoj Bhat, [65] Balakrishnan Varadarajan, Ahmed Hefny, Avikalp Sri- Minseok Kang, Jonathan Francis, Ashwin Jadhav, Paul Pu vastava, Khaled S. Refaat, Nigamaa Nayakanti, Andre Liang, and Louis-Philippe Morency. Diverse and admissi- Cornman, Kan Chen, Bertrand Douillard, Chi-Pang Lam, bletrajectoryforecastingthroughmultimodalcontextunder- DragomirAnguelov,andBenjaminSapp. Multipath++: Ef- standing. InECCV,pages282–298.Springer,2020. 2 ficientinformationfusionandtrajectoryaggregationforbe- [53] DeanAPomerleau. Alvinn:Anautonomouslandvehiclein haviorprediction. CoRR,abs/2111.14973,2021. 2 aneuralnetwork.Advancesinneuralinformationprocessing [66] Balakrishnan Varadarajan, Ahmed Hefny, Avikalp Srivas- systems,1,1988. 2 tava, Khaled S. Refaat, Nigamaa Nayakanti, Andre Corn- [54] CharlesRQi,YinZhou,MahyarNajibi,PeiSun,KhoaVo, man, Kan Chen, Bertrand Douillard, Chi Pang Lam, BoyangDeng,andDragomirAnguelov. Offboard3dobject Dragomir Anguelov, and Benjamin Sapp. Multipath++: detectionfrompointcloudsequences. InCVPR,2021. 2 Efficient information fusion and trajectory aggregation for [55] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya behavior prediction. In 2022 International Conference on Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Robotics and Automation (ICRA), pages 7814–7821, 2022. AmandaAskell,PamelaMishkin,JackClark,etal. Learn- 6 ingtransferablevisualmodelsfromnaturallanguagesuper- [67] PengqinWang,MeixinZhu,HongliangLu,HuiZhong,Xi- vision. InICML,2021. 2,3,7 andaChen,ShaojieShen,XuesongWang,andYinhaiWang. [56] Nicholas Rhinehart, Rowan McAllister, Kris Kitani, and Bevgpt: Generativepre-trainedlargemodelforautonomous SergeyLevine. Precog: Predictionconditionedongoalsin driving prediction, decision-making, and planning. arXiv visualmulti-agentsettings.InProceedingsoftheIEEE/CVF preprintarXiv:2310.10357,2023. 2 InternationalConferenceonComputerVision,pages2821– [68] Benjamin Wilson, William Qi, Tanmay Agarwal, John 2830,2019. 2 Lambert, Jagjeet Singh, Siddhesh Khandelwal, Bowen [57] Abbas Sadat, Sergio Casas, Mengye Ren, Xinyu Wu, Pan, Ratnesh Kumar, Andrew Hartnett, Jhony Kaesemodel PranaabDhawan,andRaquelUrtasun.Perceive,predict,and Pontes, et al. Argoverse 2: Next generation datasets for plan: Safe motion planning through interpretable semantic self-driving perception and forecasting. arXiv preprint representations.InECCV,pages414–430.Springer,2020.2 arXiv:2301.00493,2023. 2 [58] Tim Salzmann, Boris Ivanovic, Punarjay Chakravarty, and [69] ZhenhuaXu,YujiaZhang,EnzeXie,ZhenZhao,YongGuo, MarcoPavone. Trajectron++: Dynamically-feasibletrajec- Kenneth KY Wong, Zhenguo Li, and Hengshuang Zhao. tory forecasting with heterogeneous data. In ECCV, pages Drivegpt4: Interpretableend-to-endautonomousdrivingvia 683–700.Springer,2020. 2 large language model. arXiv preprint arXiv:2310.01412, [59] AriSeff,BrianCera,DianChen,MasonNg,AurickZhou, 2023. 2 Nigamaa Nayakanti, Khaled S Refaat, Rami Al-Rfou, and [70] JiahuiYu,XinLi,JingYuKoh,HanZhang,RuomingPang, Benjamin Sapp. Motionlm: Multi-agent motion forecast- JamesQin,AlexanderKu,YuanzhongXu,JasonBaldridge, ingaslanguagemodeling. InProceedingsoftheIEEE/CVF and Yonghui Wu. Vector-quantized image modeling with InternationalConferenceonComputerVision,pages8579– improvedvqgan.arXivpreprintarXiv:2110.04627,2021.5, 8590,2023. 2,6,8 6,13 [60] Shaoshuai Shi, Li Jiang, Dengxin Dai, and Bernt Schiele. [71] WenyuanZeng,WenjieLuo,SimonSuo,AbbasSadat,Bin Motiontransformerwithglobalintentionlocalizationandlo- Yang,SergioCasas,andRaquelUrtasun. End-to-endinter- calmovementrefinement. AdvancesinNeuralInformation pretableneuralmotionplanner. InCVPR,2019. 2 ProcessingSystems,35:6531–6543,2022. 2,5,6 [72] Tianyuan Zhang, Xuanyao Chen, Yue Wang, Yilun Wang, [61] HaoranSong,WenchaoDing,YuxuanChen,ShaojieShen, and Hang Zhao. Mutr3d: A multi-camera tracking frame- Michael Yu Wang, and Qifeng Chen. Pip: Planning- workvia3d-to-2dqueries. InProceedingsoftheIEEE/CVF informed trajectory prediction for autonomous driving. In Conference on Computer Vision and Pattern Recognition, ECCV,pages598–614.Springer,2020. 2 pages4537–4546,2022. 2 [62] Qiao Sun, Xin Huang, Junru Gu, Brian C Williams, and [73] YunpengZhang,ZhengZhu,WenzhaoZheng,JunjieHuang, Hang Zhao. M2i: From factored marginal trajectory pre- GuanHuang,JieZhou,andJiwenLu. Beverse:Unifiedper- diction to interactive prediction. In Proceedings of the ception and prediction in birds-eye-view for vision-centric 11autonomous driving. arXiv preprint arXiv:2205.09743, 2022. 2 [74] YinZhouandOncelTuzel. Voxelnet: End-to-endlearning forpointcloudbased3dobjectdetection. InCVPR,2018. 2 [75] YinZhou, PeiSun, YuZhang, DragomirAnguelov, Jiyang Gao,TomOuyang,JamesGuo,JiquanNgiam,andVijayVa- sudevan. End-to-endmulti-viewfusionfor3dobjectdetec- tioninlidarpointclouds. InCoRL,2020. 2 12Figure6. AdditionalqualitativecomparisonbetweenMoSTand Wayformer [48] baseline. The agent boxes are colored by their types: gray for vehicle, red for pedestrian, and cyan for cyclist. Thepredictedtrajectoriesareorderedtemporallyfromgreen(+0s) to blue (+8.0s). For each modeled agent, the models predict 6 trajectory candidates, whose confidence scores are illustrated by transparency: themoreconfident,themorevisible. Groundtruth trajectoryisshownasreddots. Notethatthevehicleindicatedby theredarrowisenteringaplazawhichhasnomapcoverage.Since Figure 7. Examples of reconstructed driving images from ViT- ourmodelhasaccesstotherichvisualsignals,itcorrectlypredicts VQGANcodes.Weshow3camerasat3consecutivetimestamps. thevehicle’spossibletrajectorywhichincludesfollowsthearrow WeareabletodecodehighqualityimagesfromVQGANcodes. andturnright. Wayformer,ontheotherhand,completelymissed this possibility due to the lack of road graph information in that region. Appendix A.AdditionalQualitativeResults An additional qualitative comparison can be found in Fig- ure 6. In this scenario, the model is asked to predict the future trajectory of a vehicle entering a plaza which is not mappedbytheroadgraph. Ourmodelwithaccesstovisual informationcorrectlypredictsseveraltrajectoriesfollowing thearrowpaintedonthegroundandturningright. Figure8. ExamplesofSAMfeature. Thefirstrowshowscamera imagesandthesecondrowillustratestheSAMfeaturemapvisu- alizedbyPCAreductionfrom256to3dimensions. B.WOMDCameraEmbeddings VQGANEmbedding ToextractVQGANembeddingfor animage,wefirstresizetheimageintoshapeof256×512. Then we horizontally split the image into two patches and C.ImplementationDetails apply pre-trained ViT-VQGAN [70] model on each patch respectively. Each patch contains 16×16 tokens so each Model Detail We use Nagent = 128, Nopen-set = 384, cameraimagecanberepresentedas512tokens. Thecode- elem elem booksizeis8192. N eg ln ed m = 256, and N pts = 65536 in our experiments. We usesensordatafrompast10framesthatcorrespondtothe 1secondhistoryandthecurrentframe(i.e. T = 11). Fol- SAM-H Embedding For each camera we extract SAM lowing Wayformer [48], we train our model to output K ViT-H [36] embedding of size 64×64×256. Compared modesfortheGaussianmixture,whereweexperimentwith to VQGAN embeddings, SAM features are less spatially K = {6,64}. During inference, we draw 2048 samples compressedduetoitshigh-resolutionfeaturemap. Thevi- from the predicted Gaussian mixture distribution, and use sualization of SAM Embedding can be found in Figure 8. K-Meansclusteringtoaggregatethose2048samplesinto6 WereleasetheSAMfeaturespooledper-scene-element. finaltrajectorypredictions. 13Variablename Description TensorShape Thetotalnumberof N 1 pts LiDARpointsafterdown-sampling. N Thetotalnumberofsceneelements. 1 elem T Thetotalnumberofframes. 1 D Thefeaturedimension. 1 Theaggregated P LiDARpointsfromallframes N ×3 xyz pts afterdownsampling Thesceneelementindex P N ×2 ind andframeindexforeachLiDARpoint pts F Theperpointimagefeature. N ×D pts pts Theboxattributes,includingbox B N ×T ×7 center,boxsize,andboxheading. elem Theperscene- F N ×T ×D img elementimagefeature. elem Theperscene- F N ×T ×D geo elementgeometryfeature. elem f Thelearnabletemporalembedding. 1×T ×D temporal Table9.Descriptionsforvariablesusedinthemainpaper. Training Detail For allexperiments, we trainour model usingAdamW[43]on64GoogleCloudTPUv4cores1with a global batch size of 512. We use a cosine learning rate schedule,wherethelearningrateisinitializedto3×10−4 and ramps up to 6×10−4 after 1,000 steps. The training finishesafter500,000steps. Notations Please refer to Table 9 for a summary of the notationsusedinthemainpaper. 1https://cloud.google.com/tpu 14