MULTIPATH++: EFFICIENT INFORMATION FUSION AND TRAJECTORY AGGREGATION FOR BEHAVIOR PREDICTION BalakrishnanVaradarajan AhmedHefny AvikalpSrivastava KhaledS.Refaat NigamaaNayakanti AndreCornman KanChen BertrandDouillard ChiPangLam DragomirAnguelov BenjaminSapp∗ ABSTRACT Predictingthefuturebehaviorofroadusersisoneofthemostchallengingandimportantproblems in autonomous driving. Applying deep learning to this problem requires fusing heterogeneous worldstateintheformofrichperceptionsignalsandmapinformation,andinferringhighlymulti- modaldistributionsoverpossiblefutures. Inthispaper,wepresentMultiPath++,afutureprediction modelthatachievesstate-of-the-artperformanceonpopularbenchmarks. MultiPath++improvesthe MultiPatharchitecture[45]byrevisitingmanydesignchoices. Thefirstkeydesigndifferenceisa departurefromdenseimage-basedencodingoftheinputworldstateinfavorofasparseencodingof heterogeneoussceneelements: MultiPath++consumescompactandefficientpolylinestodescribe roadfeatures,andrawagentstateinformationdirectly(e.g.,position,velocity,acceleration). We proposeacontext-awarefusionoftheseelementsanddevelopareusablemulti-contextgatingfusion component. Second, we reconsider the choice of pre-defined, static anchors, and develop a way to learn latent anchor embeddings end-to-end in the model. Lastly, we explore ensembling and output aggregation techniques—common in other ML domains—and find effective variants for our probabilistic multimodal output representation. We perform an extensive ablation on these design choices, and show that our proposed model achieves state-of-the-art performance on the ArgoverseMotionForecastingCompetition[12]andtheWaymoOpenDatasetMotionPrediction Challenge[18]. 1 Introduction Modelingandpredictingthefuturebehaviorofhumanagentsisafundamentalprobleminmanyreal-worldrobotics domains. Forexample,accuratelyforecastingthefuturestateofothervehicles,cyclistsandpedestriansiscriticalfor safe,comfortable,andhuman-likeautonomousdriving. However,behaviorpredictioninanautonomousvehicle(AV) drivingsettingposesanumberofuniquemodelingchallenges: 1)Multimodaloutputspace: Theproblemisinherentlystochastic;itisimpossibletotrulyknowthefuturestateof theenvironment. Thisisexacerbatedbythefactthatotheragents’intentionsarenotobservable,andleadstoahighly multimodaldistributionoverpossibleoutcomes(e.g.,acarcouldturnleftorrightatanintersection). Effectivemodels mustbeabletorepresentsucharichoutputspacewithhighprecisionandrecallmatchingtheunderlyingdistribution. 2)Heterogenous,interrelatedinputspace:Thedrivingenvironmentrepresentationcancontainahighlyheterogeneous mixofstaticanddynamicinputs: roadnetworkinformation(lanegeometryandconnectivity,stoplines,crosswalks), trafficlightstateinformation,andmotionhistoryofagents. Drivingsituationsareoftenhighlyinteractive,andcan involvemanyagentsatonce(e.g.negotiatinga4-waystopwithcrosswalks). Thisrequirescarefulmodelingchoices,as explicitlymodelingjointfuturedistributionsovermultipleagentsisexponentialinthenumberofagents. Effective modelsmustcapturenotonlytheinteractionsbetweentheagentsinthescene,butalsotherelationshipsbetweenthe roadelementsandthebehaviorofagentsgiventheroadcontext. ∗Contact:{balakrishnanv, bensapp}@waymo.com.Waymo,1600AmphitheatrePkwy,MountainView,California,USA. 1202 ceD 22 ]VC.sc[ 3v37941.1112:viXraThenovelchallengesandhighimpactofthisproblemhavenaturallygarneredmuchinterestinrecentyears. Therehas beenarichbodyofworkonhowtomodelagents’futures,theirinteractions,andtheenvironment. However,there is little consensus to date on the best modeling choices for each component, and in popular benchmark challenge datasets[6,12,18,53],thereisasurprisinglydiversesetofsolutionstothisproblem;fordetailsseeSection2and Table2. The MultiPath framework [45] addresses the multimodal output space challenge above by modeling the highly multimodaloutputdistributionsviaaGaussianMixtureModel. Ithandlesacommonissueofmodecollapseduring learningbyusingstatictrajectoryanchors,anexternalinputtothemodel. Thispracticalsolutiongivespractitionersa straightforwardwayofensuringdiversityandanextralevelofmodelercontrolviathedesignofsuchanchors. The choiceofaGMMrepresentationprovedtobeanextremelypopular,appearinginmanyworks—seeTable2,“Trajectory Distribution”,where“Weightedset”isaspecialcaseofGMMswhereonlymeansandmixtureweightsaremodeled. TheMultiPathinputrepresentationandbackbonedrawsheavilyuponthecomputervisionliterature. Byrasterizingall worldstateinatop-downorthographicview,MultiPathandothers[10,14,27,30,33,37,46,52]leveragepowerful, establishedCNNarchitectureslikeResNet[25],whichoffersolutionstotheheterogeneousinterrelatedinputspace: the heterogeneousworldstateismappedtoacommonpixelformat,andinteractionsoccurvialocalinformationsharingvia convolutionoperations. Whileconvenientandestablished,therearedownsidestosuchrasterization: (1)Thereisan uneasytrade-offbetweenresolutionofthespatialgrid,fieldofview,andcomputerequirements. (2)Rasterizingisa formofmanualfeatureengineering,andsomefeaturesmaybeinherentlydifficulttorepresentinsuchaframework(e.g. radialvelocity). (3)Itisdifficulttocapturelongrangeinteractionsviaconvolutionswithsmallreceptivefields. (4) Theinformationcontentisspatiallyverysparse,makingadenserepresentationapotentiallycomputationallywasteful choice. Inthispaper,weintroduceMultiPath++,whichbuildsuponMultiPath,takingitsoutputGMMrepresentationand conceptofanchors,butreconsideringhowtorepresentandcombinehighlyheterogeneousworldstateinputsandmodel interactionsbetweenstateelements. MultiPath++introducesanumberofkeyupgrades: • Weeschewtherasterization-and-CNNapproachinfavorofmodelingsparseworldstateobjectsmoredirectly fromtheircompactstatedescription. Werepresentroadelementsaspolylines,agenthistoryasasequenceof physicalstateencodedwithRNNs,andagentinteractionsasRNNsoverthestateofneighborsrelativetoeach ego-agent. Thesechoicesavoidlossyrasterizationinfavorofraw,continuousstate,andresultincompute complexitythatscaleswiththenumberofsceneelementsratherthanthesizeofaspatialgrid. Long-range dependenciesareeffectivelyandefficientlymodeledinourrepresentation. • Capturingrelationshipsbetweenroadelementsandagentsiscritical,andwefindthatencodingeachelement independentlydoesnotperformaswellasmodelingtheirinteractions(e.g.,whentheroadencodingisaware ofrelevantagents,andviceversa). Toaddressthis,weproposeanovelformofcontextawarenesswecall multi-contextgating(MCG),inwhichsetsofelementshaveaccesstoasummarycontextvectoruponwhich encodings are conditioned. MCG is implemented as a generic neural network component that is applied throughoutourmodel. MCGcanbeviewedasanefficientformofcross-attention,whoseefficiency/quality trade-offdependsonthesizeofthecontextvector. • Wealsoexploreimprovementsintrajectorymodeling,comparingrepresentationsbasedonkinematiccontrols, and/orpolynomialsasafunctionofcontinuousfuturetime. Wefurtherdemonstrateawaytolearnlatent representationsofanchorsandshowtheyoutperformtheoriginalstaticanchorsofMultiPath,whilesimplifying modelcreationtoasingle-stepprocess. • Finally,wefindsignificantadditionalgainsonpublicbenchmarksbyapplyingensemblingtechniquestoour models. Unlikemodelswithstaticanchors,thelatentanchorsofaMultiPath++ensemblearenotindirect correspondence. Furthermore,alotofpopularbehaviorpredictionbenchmarkshaveintroducedmetricssuch asmiss-rate(MR)andmeanAveragePrecision(mAP),whichrequiretheabilitytomodeldiverseoutcomes withfewtrajectoriesanddifferfrompuretrajectorydistanceerrorcapturingtheaverageagentbehavior. With theaboveinmind,weformulatetheproblemofensemblingtheresultsofseveralmodelsasoneofgreedy iterativeclustering,whichmaximizesaprobabilisticobjectiveusingthepopularExpectationMaximization algorithm[3]. AsofNovember1,2021,MultiPath++ranks1stontheWaymoOpenMotionDatasetleaderboard2,4thontheArgoverse MotionForecasting. Competition3WeofferMultiPath++asareferencesetofdesignchoices,empiricallyvalidatedvia ablationstudies,thatcanbeadopted,furtherstudiedandextendedbythebehaviormodelingcommunity. 2https://waymo.com/open/challenges/2021/motion-prediction/ 3https://eval.ai/web/challenges/challenge-page/454/leaderboard/1279 2Method RoadEnc. MotionEnc. Interactions Decoder Output TrajectoryDistribution Jean[34] – LSTM attention LSTM states GMM TNT[54] polyline polyline maxpool,attention MLP states Weightedset LaneGCN[32] GNN 1Dconv GNN MLP states Weightedset WIMP[28] polyline LSTM GNN+attention LSTM states GMM VectorNet[20] polyline polyline maxpool,attention MLP states Singletraj. SceneTransformer[35] polyline attention attention attention states Weightedset GOHOME[21] GNN 1Dconv+GRU GNN MLP states heatmap MP3[11] raster raster conv conv costfunction Weightedsamples CoverNet[37] raster raster conv lookup states GMMw/dynamicanchors DESIRE[30] raster GRU spatialpooling GRU states Samples RoadRules[27] raster raster conv LSTM states GMM SocialLSTM[1] – LSTM spatialpooling LSTM states Samples SocialGan[24] – LSTM maxpool LSTM states Samples MFP[46] raster GRU RNNs+attention GRU states Samples MANTRA[33] raster GRU – GRU states Samples PRANK[2] raster raster conv lookup states Weightedset IntentNet[10] raster raster conv conv states Singletraj. SpaGNN[9] raster raster GNN MLP state Singletraj. Multimodal[14] raster raster conv conv states Weightedset PLOP[5] raster LSTM conv MLP statepoly GMM Precog[41] raster GRU multi-agentsim. GRU motion Samples R2P2[40] raster GRU – GRU motion Samples HYU_ACE[36] raster LSTM attn LSTM motion Samples Trajectron++[44] raster LSTM RNNs+attention GRU controls GMM DKM[13] raster raster conv conv controls Weightedset MultiPath[45] raster raster conv MLP states GMMw/staticanchors MultiPath++ polyline LSTM RNNs+maxpool MLP controlpoly GMM Table 1: A survey of recent work in behavior prediction, categorized by choice of road encoding, motion history encoding,agentinteractionencoding,trajectorydecoding,intrinsicoutputrepresentation,anddistributionoverfuture trajectories. 2 RelatedWork Wefocusonarchitecturaldesignchoicesforbehaviorpredictionindrivingenvironments—whatrepresentationstouse toencoderoadinformation,agentmotion,agentinteractions,outputtrajectories,andoutputdistributions. Table2isa summaryofpastwork,whichwegooverherewithadditionalcontext. Forroadencoding,thereisadichotomyofrepresentations.Therasterapproachencodestheworldasastackofimages, fromatop-downorthographic(or“bird’s-eye”)view. Rasterizingtheworldstatehasthebenefitofsimplicity—all thevarioustypesofinputinformation(roadconfiguration,agentstatehistory,spatialrelationships)areunifiedvia renderingasamulti-channelimage,enablingonetoleveragepowerfuloff-the-shelfConvolutionalNeuralNetwork (CNN)techniques. However,thisone-size-fits-allapproachhassignificantdownsides: difficultyinmodelinglong-range interactions, constrained field of view, and difficulty in representing continuous physical states. As an alternative, the polyline approach describes curves (e.g., lanes, crosswalks, boundaries) as piecewise linear segments. This is a significantly more compact form due to the sparse nature of road networks. Previous works typically process a set-of-polylinesdescriptionoftheworldinaper-agent,agent-centriccoordinatesystem. LaneGCN[32]standsapartby treatingroadlanesasnodesinagraphneuralnetwork,leveragingroadnetworkconnectivitystructure. Tomodelmotionhistory,onepopularchoiceistoencodethesequenceofpastobservedstatesviaarecurrentnet (GRU,LSTM)ortemporal(1D)convolution. Asanalternative,intherasterframework,thestatesequenceistypically renderedasastackofbinarymaskimagesdepictingagentorientedboundingboxes,orrenderedinthesameimage, withthecorrespondingtimeinformationrenderedseparately[39]. Tomodelagentinteractions,onemustdealwithadynamicsetofneighboringagentsaroundeachmodeledagent. Thisistypicallydonebyaggregatingneighbormotionhistorywithapermutation-invariantsetoperator: poolingor softattention. Notably,Precog[41]jointlyrollsoutagentpoliciesinastep-wisesimulation. Rasterapproachesrely onconvolutionoverthe2Dspatialgridtoimplicitlycaptureinteractions;long-terminteractionsaredependentonthe networkreceptivefields. 3Agenttrajectorydecodingchoicesaresimilartochoicesforencodingmotionhistory,withtheexceptionofmethods thatdolookuponafixedorlearnedtrajectorydatabase[2,37]. Themostpopularoutputtrajectoryrepresentationisasequenceofstates(orstatedifferences).Afewworks[40,42] instead model Newton’s laws of motion in a discrete time-step aggregation capturing Verlet integration. Other works[13,44]explicitlymodelcontrolswhichparameterizeakinematically-feasiblemodelforvehiclesandbicycles. Withanyoftheserepresentations,thespacetimetrajectorycanbeintrinsicallyrepresentedasasequenceofsample points or a continuous polynomial representation [5]. In our experimental results, we explore the effectiveness of statesandkinematiccontrols,withandwithoutanunderlyingpolynomialbasis. Notablyuniqueare(1)HOME[22] andGOHOME[21]whichfirstpredictaheatmap,andthendecodetrajectoriesaftersampling,and(2)MP3[11]and NMP[52]whichlearnacostfunctionevaluatoroftrajectories,andthetrajectoriesareenumeratedheuristicallyrather thangeneratedbyalearnedmodel. Nearly all work assumes an independent, per-agent output space, in which agent interactions cannot be explicitly captured. A few works are notable in describing joint interactions as output, either in an asymmetric [28, 47] or symmetricway[18,35,41]. Thechoiceofoutputtrajectorydistributionhasramificationsondownstreamapplications. Anintrinsicpropertyof thedrivingsettingisthatavehicleorapedestriancanfollowoneofadiversesetofpossibletrajectories. Itisthus essentialtocapturethemultimodalnatureoftheproblem. GaussianMixtureModels(GMMs)areapopularchoicefor thispurposeduetotheircompactparameterizedform;modecollapseisaddressedthroughtrainingtricks[14,50]orthe useoftrajectoryanchors[45]. Otherapproachesmodeladiscretedistributionoverasetoftrajectories(learnedorfixed apriori)[2,14,32,54],orviaacollectionoftrajectorysamplesdrawnfromalatentdistributionanddecodedbythe model[1,30,33,40,41]. 3 ModelArchitecture Figure1depictstheproposedMultiPath++modelarchitecture,whichonahighlevelissimilartothatofMultiPath[45]; the model consists of an encoding step and a predictor head which conditions on anchors and outputs a Gaussian MixtureModel(GMM)[3]distributionforthepossibleagentpositionateachfuturetimestep. MultiPathusedacommon,top-downimagebasedrepresentationforallinputmodalities(e.g.,agents’trackedstate, roadnetworkinformation),andaCNNencoder. Incontrast,MultiPath++hasencodersprocessingeachinputmodality and converting it to a compact and sparse representation; the different modality encodings are later fused using a multi-contextgating(MCG)mechanism. Input Encoder Output Agent state Agent history history encoder Classification Trajectory head likelihoods MCG AV state Interaction Interaction predictor history encoder MCG encoder Regression Trajectory means head and covariances Roadgraph Learned anchor Roadgraph MCG encoder embeddings polylines Concat Predictor History MCG encoder Gaussian mixture model Concat Polyline encoder Figure1: MultiPath++ModelArchitecture. MCGdenotesMulti-ContextGating,describedinSection3. Blocksinred highlightportionsofthemodelwithlearnedparameters. DottedinputstotheMCGdenotescontextfeatures. Eachof theencoderMCGoutputsaggregatedembeddings(oneperagent)asshownbydottedarrows. Ontheotherhand,the predictorMCGoutputsoneembeddingpertrajectoryperagent 3.1 InputRepresentation MultiPath++makespredictionsbasedonthefollowinginputmodalities: • Agent state history: a state sequence describing the agent trajectory for a fixed number of past steps. In theWaymoOpenMotiondataset[18],thisstateinformationincludesposition,velocity,3Dboundingbox size,headingangleandobjecttype;forArgoverse[12]onlypositioninformationisprovided. Thestateis 4Figure 2: Left: Context gating (CG) block diagram. Right: 3 CG blocks stacked together, with running-average skip-connections(shownascomponentslabeled“µ”). SeeSection3.2fordetails. transformedtoanagent-centriccoordinatesystem,suchthatthemostrecentagentposeislocatedattheorigin andheadingeast. 4 • Roadnetwork: Roadnetworkelementssuchaslanelines,crosswalks,andstoplinesareoftenrepresentedas parametriccurveslikeclothoids[52],whichcanbesampledtoproducepointcollectionsthatareeasilystored inmulti-dimensionalarrayformat,asisdoneinmanypublicdatasets[12,18]. Wefurthersummarizethis informationbyapproximatingpointsequencesforeachroadelementasasetofpiecewiselinearsegments,or polylines,similarto[20,26,32]. • Agentinteractions: Foreachmodeledagent,weconsiderallneighboringagents. Foreachneighboringagent, weextractfeaturesinthemodeledagent’scoordinateframe,suchasrelativeorientation,distance,historyand speed. • AV-relativefeatures:Similartotheinteractionfeatures,weextractfeaturesoftheautonomousvehicle/sensing vehicle(AV)relativetoeachotheragent. WemodeltheAVseparatelyfromtheotheragents. Wehypothesize thisisahelpfuldistinctionforthemodelbecause: (a)TheAVisthecenterofsensors’fieldofview. Tracking errorsduetodistanceandocclusionarerelativetothiscenter. (b)ThebehavioroftheAVcanbeunlikethe otherroadusers,whichtoagoodapproximationcanbeassumedtoallbehumans. Detailsonhowthesefeaturesareencodedandfusedaredescribednext. Thesestepscomprisethe“Encoder”blockof Figure1,whoseoutputisanencodingperagent,ineachagent’scoordinateframe. 3.2 MultiContextGatingforfusingmodalities Inthissectionwefocusonhowtocombinethedifferentinputmodalityencodingsinaneffectiveway.Otherworksusea commonrasterizedformat[45,52],asimpleconcatenationofencodings[30,41,44],oremployattention[20,32,35,46]. Weproposeanefficientmechanismforfusinginformationwetermmulti-contextgating(MCG),anduseMCGblocks throughouttheMultiPath++architecture. Givenasetofelementss andaninputcontextvectorc,aCGblockassignsanoutputs(cid:48) toeachelementin 1:N 1:N the set, and computes an output context vector c(cid:48). The output does not depend on the ordering of input elements. Mathematically,letCG(·,·)bethefunctionimplementedbytheCGblock,andπbeanypermutationoperationona sequenceofnelements. ThefollowingequationsholdforCG: (s(cid:48) ,c(cid:48)) =CG(s ,c) 1:n 1:n (1) (s∗ ,c∗) =CG(π(s ),c) 1:n 1:n whichimplythatwehave c∗ =c(cid:48) (permutation-invariance) s∗ =π(s(cid:48) ) (permutation-equivariance). 1:n 1:n ThesizeofthesetncanvaryacrosscallstoCG(·,·). CG’ssetfunctionproperties—permutationinvariance/equivarianceandabilitytoprocessarbitrarilysizedsets—are naturally motivated by the need to encode a variable, unordered set of road network elements and agent relation- ships. A number of set functions have been proposed in the literature such as DeepSets [51], PointNet [38] and SetTransformers[29]. 4SincetheexplicitheadingismissinginArgoversedata,weusethelasttwotimestepstogetthecurrentorientation. 5Figure3: Acomparisonoftheelementrelationshipgraphforcross-attentionandCG.Incross-attention,eachelement s aggregatesinformationfromc . InCG,c summarizedwithasinglecontextvectorc. i 1:m 1:m AsingleCGblockisimplementedvia ˜s = MLP(s ) (2) i i c˜ = MLP(c) (3) s(cid:48) = ˜s (cid:12)c˜ (4) i i c(cid:48) = Pool(s(cid:48) ), (5) 1:n where (cid:12) denotes element-wise product and Pool is a permutation-invariant pooling layer such as max or average pooling. TheseoperationsareillustratedinFigure2. Intheabsenceofaninputcontext,wesimplysetc˜toanall-ones vector in the first context gating block. Note that both s(cid:48) and c(cid:48) depend on all inputs. It can be shown that c(cid:48) is i permutation-invariantw.r.ttheinputembeddings. Itcanalsobeshownthats(cid:48) arepermutation-equivariant. i WestackmultipleCGblocksbyincorporatingrunning-averageskip-connections,asisdoneresidualnetworks[25]: ¯sk = 1 (cid:80)k sj (6) 1:n k j=1 1:n ¯ck = 1 (cid:80)k cj (7) k j=1 (cid:0) sk+1,ck+1(cid:1) =CG(cid:0) ¯sk ,¯ck(cid:1) . (8) 1:n 1:n Wedenotesuchmulti-layerCGblocksasMCG (·,·)forastackofN CGblocks. N Comparisonwithattention. AttentionisapopularmechanismindomainssuchasNLP[48]andcomputervision[15, 17],inwhichtheencodingforeachelementofasetisupdatedviaacombinationofencodingsofallotherelements. Forasetofsizen,thisintrinsicallyrequiresO(n2)operations. Inmodelsofhumanbehaviorindrivingscenarios,self attentionhasbeenemployedtoupdateencodingsfor,e.g.,roadlanes,byattendingtoneighboringlanes,ortoupdate encodingsperagentbasedontheotheragentsinthescene. Crossattentionhasalsobeenusedtoconditiononeinput type(e.g.agentencodings)onanother(e.g.roadlanes)[20,32,35]. Withoutlossofgenerality,iftherearenagentsand mroadelements,thiscrossattentionscalesasO(nm)toaggregateroadinformationforeachagent. CGcanbeviewedasanapproximationtocross-attention. Ratherthaneachofnelementsattendingtoallmelements of the latter set, CG summarizes the latter set with the single context vector c, as shown in Figure 3. Thus the dimensionalityofcneedstobegreatenoughtocapturealltheusefulinformationcontainedintheoriginalmencodings. Ifthedimensionalityofelementsisd,andthedimensionalityofcisdc,thenifdc =md,CGcanbereducedtosome formofcross-attentionbysettingc=Concat({c }m ). Whendc M;whereM is fixedasarequirementforthetask(andisusedinbenchmarkmetricscalculations). LetΨdenotetheunionofthepredictionsfromallheads Ψ={(µ ,Σ ,q ),...,(µ ,Σ ,q )}, (12) 1 1 1 M(cid:48) M(cid:48) M(cid:48) whereM(cid:48) =L·E,andthemodelikelihoodsparedividedbythenumberofheadsE sothattheysumupto1. Then weposetheensemblecombinationtaskasoneofconvertingΨtoamorecompactGMMΨ¯ withM modes: Ψ¯ =(cid:0) (µ¯ ,Σ¯ ,q¯ ),...,(µ¯ ,Σ¯ ,q¯ )(cid:1) , (13) 1 1 1 M M M whilerequiringthatΨ¯ bestapproximatesΨ. Inthissectionwedescribetheaggregationalgorithmweuse. Theoretical motivationsandderivationcanbefoundinAppendixA. WefindfitΨ¯ toΨusinganiterativeclusteringalgorithm,likeExpectation-Maximization[3],butwithhardassignment ofclustermembership. Thissettinglendsitselftoefficientimplementationinacomputegraph,andallowsustotrain thisstepend-to-endasafinallayerinourdeepnetwork. WestartbyselectingM clustercentroidsfromµ inagreedyfashion. Theselectioncriteriaistomaximizethe 1:M(cid:48) probabilitythatacentroidsampledfromΨlieswithinτ distancefromatleastoneselectedcentroid: M(cid:48) (cid:88) µ¯ =argmax q max I((cid:107)µ −µ¯(cid:107) ≤τ) (14) 1:M i i 2 µ¯1:M i=1 µ¯∈µ¯1:M Thisisacriterionthatexplicitlyoptimizestrajectorydiversity,whichisafitformetricssuchasmissrate,mAPand minADE,asdefinedin [12,18]. Othercriteriacouldalsobeuseddependingonthemetricofinterest. Itisinteresting torelatethiscriteriatotheensemblingandsamplingmethodemployedbyGOHOME[21]. Inthatwork,theyoutputan intermediatespatialheatmaprepresentation,whichisamenabletoensembleaggregation. Thentheygreedilysample end-pointsinasimilarfashion. Sincejointlyoptimizingµ¯ ishard,weselecteachµ greedilyfori=1,...,M accordingto 1:M i M(cid:48) (cid:88) µ¯ =argmax q max I((cid:107)µ −µ¯(cid:107) ≤τ) (15) i i i 2 µ¯i i=1 µ¯∈µ¯1:i 9whichdiffersinthattheouterargmaxisdoneiterativelyoverµ¯ ratherthanjointlyµ¯ . i 1:M Startingwiththeselectedcentroids,WeiterativelyupdatetheparametersofΨ¯ usinganexpectation-maximization- style[16]algorithm,whereeachiterationconsistsofthefollowingupdates M(cid:48) q¯ ←(cid:88) q p(h|µ ;Ψ¯) (16) h i i i=1 M(cid:48) µ¯ ← 1 (cid:88) q p(h|µ ;Ψ¯)x (17) h q¯ i i h i=1 M(cid:48) Σ¯ ← 1 (cid:88) q p(h|µ ;Ψ¯)(cid:2) Σ +(µ −µ¯ )(µ −µ¯ )T(cid:3) , (18) h q¯ i i i i h i h h i=1 wherep(h|x;Ψ¯)istheposteriorprobabilitythatagivensamplexissampledfromthehthcomponentofthemixture modelspecifiedbyΨ¯,whichcanbecomputedas q¯ N(x−µ¯ ,Σ¯ ) p(h|x;Ψ¯)= h h h (19) (cid:80)M q¯ N(x−µ¯ ,Σ¯ ) k=1 k k k 5 Experiments 5.1 Datasets TheWaymoOpenMotionDataset(WOMD)[18]consistsof1.1Mexamplestime-windowedfrom103K20sscenarios. Thedatasetisderivedfromreal-worlddrivinginurbanandsuburbanenvironments. Eachexamplefortrainingand inferenceconsistsof1secondofhistorystateand8secondsoffuture,whichweresampleat5Hz. Theobject-agent state contains attributes such as position, agent dimensions, velocity and acceleration vectors, orientation, angular velocity,andturnsignalstate. Thelong(8s)timehorizoninthisdatasetteststhemodel’sabilitytocapturealargefield ofviewandscaletoanoutputspaceoftrajectories,whichintheorygrowsexponentiallywithtime. TheArgoversedataset[12]consistsof333Kscenarioscontainingtrajectoryhistories,contextagents,andlanecenterline inputsformotionprediction. Thetrajectoriesaresampledat10Hz,with2secondsofpasthistoryanda3-secondfuture predictionhorizon. 5.2 Metrics Wecomparemodelsusingcompetitionspecificmetricsassociatedwithvariousdatasets5, Specifically,wereportthefollowingmetrics. minDEt (Minimum Distance Error): The minimum distance, over the top k most-likely trajectories, between a k predictedtrajectoryandthegroundtruthtrajectoryattimet. minADE (MinimumAverageDistanceError): SimilartominDEt,butthedistanceiscalculatedasanaverageover k k alltimesteps. MRt@d(MissRate): MeasurestherateatwhichminFDEt exceedsdmeters. NotethatWOMDleaderboardusesa k k differentdefinition[18]. mAP:Foreachsetofpredictedtrajectories,wehaveatmostonepositive-theoneclosesttothegroundtruthand whichiswithinτ distancefromthegroundtruth. Theotherpredictedtrajectoriesarereportedasmisses. Fromthis,we cancomputeprecisionandrecallatvariousthresholds. FollowingWOMDmetricsdefinition[18]theagentsfuture trajectoriesarepartitionedintobehaviorbuckets,andanareaundertheprecision-recallcurveiscomputedusingthe possibletruepositiveandfalsepositivesperagent,givingusAveragePrecisionperbehaviorbucket. ThetotalmAP valueisameanovertheAP’sforeachbehaviorbucket. Overlap rate: The fraction of times the most likely trajectory prediction of any agent overlaps with a real future trajectoryofanotheragent(see[18]fordetails). 5Foreachdataset,wereporttheresultsofourmodelagainstpublishedresultsofpubliclyavailablemodels. 10Argoverseleaderboard(k =6,d=2m,t=3s) Rank† brier-minDE minFDE MR minADE LaneGCN[32] 50 2.059 1.364 0.163 0.868 DenseTNT[23] 23 1.976 1.282 0.126 0.882 HOME+GOHOME[21] 10 1.860 1.292 0.085 0.890 TPCN++[49] 5 1.796 1.168 0.116 0.780 MultiPath++(ours) 4 1.793 1.214 0.132 0.790 QCraftBlueTeam 1 1.757 1.214 0.114 0.801 Table 2: Comparison with select state-of-the-art methods on Argoverse leaderboard. k is the num- ber of trajectories. d is the maximum distance for no miss, and t is the trajectory time duration. † Rank on the public leaderboard https://eval.ai/web/challenges/challenge-page/454/leaderboard/ 1279asofNovember12,2021,whichissortedbybrier-minDE. TRI: (Turning Radius Infeasibility) We compute the turning radius along the predicted trajectories using two approaches: onethatusesthepredictedyawoutputfromthemodel(TRI-h),andtheotherthatdoesn’trequireyaw predictionsandinsteadusesthecircumradiusconstitutingthreeconsecutivewaypoints(TRI-c). Iftheradiusisless thanacertainthresholdτ,itistreatedasaviolation. Wesetthisthresholdastheapproximateminimumturningradius thresholdforamidsizesedan,τ = 3.5m. Notethatamodelthatsimplypredictsaconstantheadingcanachievea TRI-hrateofzero,hencewealsocomputeinconsistenciesbetweenturningradiussuggestedbythecoordinatesandthe predictedheadings(TRI-hc). TRI-hcinconsistencyistruewhenthedifferenceinheadingbasedoncircumradiusfrom waypointsandpredictedheadingsisgreaterthan0.05radiansatanytimestepinatrajectory. 5.3 MultiPathbaseline AsourworkevolvedfromMultiPath,weincludeareferenceMultiPathmodelwheretheinputandbackbonearefaithful totheoriginalpaper[45]forapointofcomparison,withafewminordifferences. Specifically,weuseatop-down renderingofthesceneasbefore,butnowemployasplatrendering[55]approachforrasterization,inwhichwesample pointsuniformlyfromsceneelementsanddoanorthographicprojection. Thisisasimpler,sparseformofrendering, whichdoesn’temployanti-aliasing,butisefficientandstraightforwardtoimplementinTensorFlowandrunaspartof themodelcomputegraphonhardwareaccelerators(GPU/TPU). Asintheoriginalpaper,weuseagridof400×400cells,withgridcellphysicaldimensionof0.2m×0.2m,thusa totalfield-of-viewof80mcenteredaroundtheAVsensingvehicleinWOMD,withaResNet18backbone[25]. Weuse 128staticanchorsobtainedviak-means,whicharesharedamongallagenttypes(vehicles,pedestrians,cyslists)for simplicity. Figure10illustratesthismodel’sinputsandarchitecture. 5.4 Externalbenchmarkresults OnArgoverse,MultiPath++achievestop-5performanceonmostmetrics(Table2). Ourtechniqueisranked1stonall metricsonWaymoOpenMotionDataset[18](Table3). ThetestedmodelisbasedonthebestconfigurationinTable4,wheretheoutputsfrommultipleensembleheadsare aggregatedasdescribedinSection4. OnWOMD,wealsoseethattheoriginalMultiPathmodel,evenwiththerefinementoflearnedanchorsandensembling, is outperformed by more recent methods. It is interesting to note that MultiPath is the best performing top-down scene-centricmodelemployingaCNN;everyknownmethodwhichoutranksitusessparserepresentations. 5.5 QualitativeExamples Figure4showsexamplesofMultipath++onWOMDscenes. Figure5showsexamplesofMultipath++onArgoverse scenes. TheseexamplesshowtheabilityofMultiPath++tohandledifferentroadlayoutsandagentinteractions. 5.6 AblationStudy Inthissectionweevaluateourdesignchoicesthroughanablationstudy. Table4summarizesablationresults. Inthe followingsubsectionswediscusshowourarchitecturechoicesaffectthemodelperformance. 11(a) (b) (c) (d) (e) (f) (g) Figure4: ExamplesofMultiPath++predictionsfor8secondsinWOMDscenes. Hueindicatestimehorizonwhile transparencyindicatespredictedprobability. Rectanglesindicatevehicleswhilesmallsquaresindicatepedestrians. (a): Afour-wayintersectioninvolvingmultipleinteractions. Forexample,carAispredictedtoyieldforcarB.(b&c): Narrowroadinteraction. CarAispredictedtoyieldforcarBandthennudgearoundtheparkingcar,designatedbythe arrow. (d&e): InteractionbetweentwovehiclesatanintersectionwherecanAispredictedtoyieldforcarBormakea right-turnbehind. AftercarBpasses,carAcanmakealeftturn. Also,notethebimodalpredictionofthepedestrian thatislocatedatthecorner. (fandg): Predictionsinaparkinglotandatypicalroadgraph. 12WaymoOpenMotionPrediction(k =6,t=8s) Rank† minDE minADE MR Overlap mAP MultiPath[45] 11 2.04 0.880 0.345 0.166 0.409 SceneTransformer[35] 7 1.212 0.612 0.156 0.147 0.279 DenseTNT[23] 5 1.551 1.039 0.157 0.178 0.328 MultiPath++ 1 1.158 0.556 0.134 0.131 0.409 Table 3: Comparison with published state-of-the-art methods on WOMD public leader- board. k is the number of trajectories and t is the trajectory time horizon. † Rank on the public leaderboard https://waymo.com/open/challenges/2021/motion-prediction/ as ofNovember12,2021,whichissortedbymAP. (a) (b) (c) (d) (e) (f) Figure 5: Examples of MultiPath++ predictions for 8 seconds in Argoverse scenes, showing the ability to follow differentlanegeometries. 135.6.1 SetFunctions RecallthatMultiPath++usestwotypesofsetfunctions. Invariantsetfunctionsareusedtoencodeasetofelements(e.g. agents,roadgraphsegments)intoasinglefeaturevector. Equivariantsetfunctionsareusedtoconvertthesetoflearned anchors,togetherwiththeencodedfeaturevectorasacontext,intoacorrespondingsetoftrajectorieswithlikelihoods. Weusemulti-contextgatingtorepresentbothtypesoffunctions. Weexperimentedwithotherrepresentationsofset functions: • MLP+MaxPool: Inthisexperiment,wereplacethemulti-contextgating(MCG)roadnetworkencoderwitha MLP+MaxPoolappliedonpointsratherthanpolylines,inspiredbyPointNet[38]. Weusea5layerdeepMLP andRELUactivations. • EquivariantDeepSet[51]: Theequivariantsetfunctionisrepresentedasaseriesofblocks,eachinvolvingan element-wisetransformationfollowedbypoolingtocomputethecontext. UnlikeMCG,itdoesnotusegating (pointwisemultiplication)betweensetelementsandthecontextvector. Instead,alineartransformationofthe contextisaddedtoeachelement. WeuseaDeepSetof5blocksinthepredictor. • Transformers[29]: Wereplacethegatingmechanism(element-wisemultiplication)onpolylineswithself- attention. Fordecoding,weusedcrossattentionwherethequeriesarethelearnedembeddingsandthekeys arethevariousencoderfeatures. 5.6.2 Trajectoryrepresentation AsmentionedinSection3.6,weexperimentwithpredictingpolynomialcoefficientsforthetrajectory,aswellpredicting kinematic control signals (acceleration and heading change rate). We found that polynomial representations hurt performance,countertoconclusionsmadeinPLOP[4],wheretheydemonstratedimprovementsoverthethenstateof theartonPRECOG[43]andnuScenes[7]usingpolynomialstorepresentoutputtrajectories. Furthermore,inthePLOP datasets,weneedtopredict4sintothefuturewhichismuchshorterthanourpredictionhorizonof10s. Forsuchshort futures,polynomialrepresentationsaremoresuitable. Inourcase,wedonotseemuchgainsfromusingthepolynomial representation,possiblyduetothelargerdatasetsizeandlonger-termpredictionhorizon. Thecontrols-basedoutputworksbetterindistancemetricsthanapolynomialrepresentations, whichsuggestsitis amorebeneficialanddomain-specificformofinductivebias. Overall,ourresultssuggestthatthesimplesequence of raw coordinates trajectory representation works best for distance-based metrics. However, these unconstrained representationshaveanon-trivialrateofkinematicinfeasibility(TRI-xmetricsinTable4). Kinematicfeasibilityand consistencybetweenheadingsandpositionsiscrucialinpracticewhensuchbehaviormodelsareusedforplanningand controlsofareal-worldrobot,anissuethatisnotcapturedbypublicbenchmarkmetrics. 5.6.3 Ensembling Weexploreensembling,producinganover-completesetoftrajectoriesthatisthensummarizedusingtheaggregation proposedinSection4,aswellastheircombination. ThenumberofensemblesisdenotedbyE andthenumberof trajecctoriesperensembleisdenotedbyL. FinallyweaggregatetheE·LtrajectoriestoM =6whichistherequired numberoftrajectoriesfortheWOMDsubmission. 5.6.4 Anchorrepresentation Weexplorelearnedandkmeansbasedanchorrepresentation. 5.7 Discussion First,weremarkthatMultiPath++isasignificantimprovementoveritspredecessorMultiPath,asseeninTables3 and4. Asdiscussedinthispaper,theydifferinmanydesigndimensions,theprimarybeingthechangefromadense top-downrasterrepresentationtoasparse,element-basedrepresentationwithagent-centriccoordinatesystems. Other designchoicesarevalidatedinisolationinthefollowingdiscussion. WefindthatMLP+MaxPoolperformstheworstamongallsetfunctionvariantsasexpectedduetolimitedcapacity. DeepSetisabletooutperformMLP+MaxPool. AlsoincreasingthedepthoftheMCGgivesconsistentlybetterresults owingtoeffectiveincreaseincapacityandflowofinformationacrossskipconnections. Wegetthebestperformanceby increasingthedepthoftheMCGto5layers. Wefindthatlearninganchors(“Learnedanchors”)ismoreeffectivethanusingasetofanchorsobtainedapriorivia k-means.ThisrunscountertotheoriginalfindingsintheMultiPathpaper[45]thatanchor-freemodelssufferfrommode 14minDE minADE MR AUC TRI-h(%) TRI-c(%) TRI-hc(%) OriginalMultiPath 4.752 1.796 0.749 – – – – SetFunction MLP+MaxPool 2.693 1.107 0.528 0.367 – – – DeepSet 2.562 1.055 0.5 0.368 – – – Transformer 2.479 1.042 0.479 0.3687 – – – 1MCGblock 2.764 1.15 0.55 0.312 – – – 5stackedMCGblocks(cid:63) 2.305 0.978 0.44 0.393 – – – TrajectoryRepresentation Polynomial 2.537 1.041 0.501 0.368 n/a 1.92 n/a Control 2.319 0.987 0.449 0.386 0.00 1.22 0.00 Rawcoordinates 2.305 0.978 0.44 0.393 n/a 1.08 n/a Rawcoordinatesw/heading(cid:63) 2.311 0.978 0.443 0.395 4.10 1.04 9.92 Ensembling E =1,L=6 2.333 0.982 0.410 0.240 – – – E =5,L=6 2.18 0.948 0.395 0.297 – – – E =1,L=64 2.487 1.057 0.473 0.367 – – – E =5,L=64(cid:63) 2.305 0.978 0.44 0.393 – – – Anchors Statick-meansanchors 2.99 1.22 0.578 0.324 – – – Learnedanchors(cid:63) 2.305 0.978 0.44 0.393 – – – Table4: ModelablationonWOMDvalidatationset. Metricsareparameterizedbyk =6,t=8s,andd=2m. (cid:63)denotesthereferenceconfiguration: roadencoding,statehistoryencodingandinteractionencodingasdescribedin Section3. “n/a”denotesamodelthatdoesnotpredictheading. Figure6: Qualitativeresultsshowingtheeffectofcontrol-basedtrajectoryrepresentation.Toprow:MultiPath++withstate-based trajectories.Bottomrow:MultiPath++withcontrol-basedtrajectories. collapse. Thedifferencecouldpossiblybeduetothericherandmorestructuredinputs,improvedmodelarchitecture, andlargerbatchsizesinMultiPath++. Weleavemoredetailedablationsonthisissuebetweenthetwoapproaches tofuturework. tWecomparethebaselineofdirectlyoutputtingasingleheadwith6trajectories(E = 1,L = 6), to training 5 ensemble heads (E = 5,L = 6). We see that ensembling significantly improves most metrics, and particularly minDE, for which this combination is best. We also train a model with a single head that outputs 64 trajectories,followedbyouraggregationmethodthatreducesthemto6(E = 1,L = 64). Comparedtoourinitial baseline,thismodelsignificantlyimprovesMRandAUC thatrequirediversepredictions,butregressestheaverage trajectory distance metrics minDE, and even minADE a little bit. This suggests that the different metrics pose differentsolutionrequirements. Asexpected,ouraggregationcriterioniswellsuitedtopreservingdiversity,while straight-upensemblingisbetteratcapturingtheaveragedistribution. Finally,ourexperiment(E =5,L=64)with moreensembleheadsandmorepredictionsperensemblecombinesthestrengthsofbothtechniques,obtainingastrictly superiorperformanceinallmetricscomparedtothebaseline. 15Figure7: Exampleofimproveddiversityduetoaggregation. Figure8: Illustrationofweedingoutunrealistictrajectoriesafteraggregation. Figure9: Illustrationofimprovedlanediversity: Themodelisabletopredicttheagentgoingtomultiplelanesafter aggregation. 165.8 Conclusion Weproposedanovelbehaviorpredictionsystem,MultiPath++,bycarefullyconsideringchoicesforinputrepresentation andencoding, fusingencodings, andrepresentingtheoutputdistribution. Wedemonstratedstate-of-the-artresults onpopularbenchmarksforbehaviorprediction. Furthermore,wesurveyedexistingmethods,analyzedourapproach empirically,andprovidedpracticalinsightsfortheresearchcommunity. Inparticular,weshowedtheimportanceof sparseencoding,efficientfusionmethods,control-basedmethods,andlearnedanchors. Finally,weprovidedapractical guideforvarioustricksusedfortrainingandinferencetoimproverobustness,increasediversity,handlemissingdata, andensurefastconvergenceduringtraining. References [1] Alexandre"Alahi,KratarthGoel,VigneshRamanathan,AlexandreRobicquet,LiFei-Fei,andSilvioSavarese. SocialLSTM: HumanTrajectoryPredictioninCrowdedSpaces. InCVPR,2016. [2] YuriyBiktairov,MaximStebelev,IrinaRudenko,OlehShliazhko,andBorisYangel. Prank: motionpredictionbasedon ranking. arXivpreprintarXiv:2010.12007,2020. [3] ChristopherMBishop. Patternrecognitionandmachinelearning. springer,2006. [4] ThibaultBuhet,É.Wirbel,andXavierPerrotton. Plop:Probabilisticpolynomialobjectstrajectoryplanningforautonomous driving. ArXiv,abs/2003.08744,2020. [5] ThibaultBuhet,EmilieWirbel,andXavierPerrotton.Plop:Probabilisticpolynomialobjectstrajectoryplanningforautonomous driving. arXivpreprintarXiv:2003.08744,2020. [6] HolgerCaesar,VarunBankiti,AlexH.Lang,SourabhVora,VeniceErinLiong,QiangXu,AnushKrishnan,YuPan,Giancarlo Baldan,andOscarBeijbom. nuscenes:Amultimodaldatasetforautonomousdriving. arXivpreprintarXiv:1903.11027,2019. [7] HolgerCaesar,VarunBankiti,AlexH.Lang,SourabhVora,VeniceErinLiong,QiangXu,AnushKrishnan,YuPan,Giancarlo Baldan,andOscarBeijbom. nuscenes:Amultimodaldatasetforautonomousdriving,2020. [8] NicolasCarion,FranciscoMassa,GabrielSynnaeve,NicolasUsunier,AlexanderKirillov,andSergeyZagoruyko. End-to-end objectdetectionwithtransformers.InAndreaVedaldi,HorstBischof,ThomasBrox,andJan-MichaelFrahm,editors,Computer Vision–ECCV2020,pages213–229,Cham,2020.SpringerInternationalPublishing. [9] SergioCasas,ColeGulino,RenjieLiao,andRaquelUrtasun. Spagnn:Spatially-awaregraphneuralnetworksforrelational behaviorforecastingfromsensordata. InIEEEIntl.Conf.onRoboticsandAutomation.IEEE,2020. [10] SergioCasas,WenjieLuo,andRaquelUrtasun. Intentnet: Learningtopredictintentionfromrawsensordata. InConf.on RobotLearning,2018. [11] SergioCasas,AbbasSadat,andRaquelUrtasun. Mp3:Aunifiedmodeltomap,perceive,predictandplan. InProceedingsof theIEEE/CVFConferenceonComputerVisionandPatternRecognition,pages14403–14412,2021. [12] Ming-FangChang,JohnLambert,PatsornSangkloy,JagjeetSingh,SlawomirBak,AndrewHartnett,DeWang,PeterCarr, SimonLucey,DevaRamanan,etal. Argoverse:3dtrackingandforecastingwithrichmaps. InCVPR,2019. [13] HenggangCui,ThiNguyen,Fang-ChiehChou,Tsung-HanLin,JeffSchneider,DavidBradley,andNemanjaDjuric. Deep kinematicmodelsforkinematicallyfeasiblevehicletrajectorypredictions. InIEEEIntl.Conf.onRoboticsandAutomation, pages10563–10569.IEEE,2020. [14] HenggangCui,VladanRadosavljevic,Fang-ChiehChou,Tsung-HanLin,ThiNguyen,Tzu-KuoHuang,JeffSchneider,and NemanjaDjuric. Multimodaltrajectorypredictionsforautonomousdrivingusingdeepconvolutionalnetworks. InIEEEIntl. Conf.onRoboticsandAutomation,2019. [15] ZihangDai,HanxiaoLiu,QuocVLe,andMingxingTan. Coatnet:Marryingconvolutionandattentionforalldatasizes. arXiv preprintarXiv:2106.04803,2021. [16] A.P.Dempster,N.M.Laird,andD.B.Rubin. Maximumlikelihoodfromincompletedataviatheemalgorithm. Journalofthe RoyalStatisticalSociety.SeriesB(Methodological),39(1):1–38,1977. [17] AlexeyDosovitskiy,LucasBeyer,AlexanderKolesnikov,DirkWeissenborn,XiaohuaZhai,ThomasUnterthiner,Mostafa Dehghani,MatthiasMinderer,GeorgHeigold,SylvainGelly,etal. Animageisworth16x16words:Transformersforimage recognitionatscale. InInternationalConferenceonLearningRepresentations,2020. [18] ScottEttinger,ShuyangCheng,BenjaminCaine,ChenxiLiu,HangZhao,SabeekPradhan,YuningChai,BenSapp,CharlesQi, YinZhou,ZoeyYang,AurelienChouard,PeiSun,JiquanNgiam,VijayVasudevan,AlexanderMcCauley,JonathonShlens, andDragomirAnguelov. Largescaleinteractivemotionforecastingforautonomousdriving:Thewaymoopenmotiondataset, 2021. [19] JeromeHFriedman. Theelementsofstatisticallearning:Datamining,inference,andprediction. springeropen,2017. [20] JiyangGao,ChenSun,HangZhao,YiShen,DragomirAnguelov,CongcongLi,andCordeliaSchmid. VectorNet:Encoding hdmapsandagentdynamicsfromvectorizedrepresentation. InCVPR,2020. 17[21] ThomasGilles,StefanoSabatini,DzmitryTsishkou,BogdanStanciulescu,andFabienMoutarde. Gohome:Graph-oriented heatmapoutputforfuturemotionestimation. arXivpreprintarXiv:2109.01827,2021. [22] ThomasGilles,StefanoSabatini,DzmitryTsishkou,BogdanStanciulescu,andFabienMoutarde. HOME:heatmapoutputfor futuremotionestimation. CoRR,abs/2105.10968,2021. [23] JunruGu,QiaoSun,andHangZhao. Densetnt:Waymoopendatasetmotionpredictionchallenge1stplacesolution. CoRR, abs/2106.14160,2021. [24] AgrimGupta,JustinJohnson,LiFei-Fei,SilvioSavarese,andAlexandreAlahi. SocialGAN:Sociallyacceptabletrajectories withgenerativeadversarialnetworks. InCVPR,2018. [25] KaimingHe,XiangyuZhang,ShaoqingRen,andJianSun. Deepresiduallearningforimagerecognition. InCVPR,2016. [26] NamdarHomayounfar,Wei-ChiuMa,ShrinidhiKowshikaLakshmikanth,andRaquelUrtasun. Hierarchicalrecurrentattention networksforstructuredonlinemaps. InProceedingsoftheIEEEConferenceonComputerVisionandPatternRecognition, pages3417–3426,2018. [27] JoeyHong,BenjaminSapp,andJamesPhilbin. Rulesoftheroad:Predictingdrivingbehaviorwithaconvolutionalmodelof semanticinteractions. InCVPR,2019. [28] Siddhesh Khandelwal, William Qi, Jagjeet Singh, Andrew Hartnett, and Deva Ramanan. What-if motion prediction for autonomousdriving. ArXiv,2020. [29] JuhoLee,YoonhoLee,JungtaekKim,AdamR.Kosiorek,SeungjinChoi,andYeeWhyeTeh.Settransformer:Aframeworkfor attention-basedpermutation-invariantneuralnetworks. InKamalikaChaudhuriandRuslanSalakhutdinov,editors,Proceedings ofthe36thInternationalConferenceonMachineLearning, ICML2019, 9-15June2019, LongBeach, California, USA, volume97ofProceedingsofMachineLearningResearch,pages3744–3753.PMLR,2019. [30] NamhoonLee,WongunChoi,PaulVernaza,ChristopherBChoy,PhilipHSTorr,andManmohanChandraker. DESIRE: Distantfuturepredictionindynamicsceneswithinteractingagents. InCVPR,2017. [31] JunweiLiang,LuJiang,KevinMurphy,TingYu,andAlexanderHauptmann.Thegardenofforkingpaths:Towardsmulti-future trajectoryprediction. InCVPR,2020. [32] MingLiang,BinYang,RuiHu,YunChen,RenjieLiao,SongFeng,andRaquelUrtasun. Learninglanegraphrepresentations formotionforecasting. arXivpreprintarXiv:2007.13732,2020. [33] FrancescoMarchetti,FedericoBecattini,LorenzoSeidenari,andAlbertoDelBimbo. Mantra:Memoryaugmentednetworks formultipletrajectoryprediction. InProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecognition, pages7143–7152,2020. [34] JeanMercat,ThomasGilles,NicoleZoghby,GuillaumeSandou,DominiqueBeauvois,andGuillermoGil.Multi-headattention forjointmulti-modalvehiclemotionforecasting. InIEEEIntl.Conf.onRoboticsandAutomation,2020. [35] JiquanNgiam,BenjaminCaine,VijayVasudevan,ZhengdongZhang,Hao-TienLewisChiang,JeffreyLing,RebeccaRoelofs, AlexBewley,ChenxiLiu,AshishVenugopal,DavidWeiss,BenjaminSapp,ZhifengChen,andJonathonShlens. Scene transformer:Aunifiedmulti-taskmodelforbehaviorpredictionandplanning. CoRR,abs/2106.08417,2021. [36] SeongHyeonPark,GyubokLee,ManojBhat,JiminSeo,MinseokKang,JonathanFrancis,AshwinRJadhav,PaulPuLiang, andLouis-PhilippeMorency. Diverseandadmissibletrajectoryforecastingthroughmultimodalcontextunderstanding. arXiv preprintarXiv:2003.03212,2020. [37] TungPhan-Minh,ElenaCorinaGrigore,FreddyABoulton,OscarBeijbom,andEricMWolff.CoverNet:Multimodalbehavior predictionusingtrajectorysets. arXiv:1911.10298,2019. [38] CharlesRQi,HaoSu,KaichunMo,andLeonidasJGuibas. Pointnet:Deeplearningonpointsetsfor3dclassificationand segmentation. InCVPR,2017. [39] KhaledS.Refaat,KaiDing,NataliaPonomareva,andStéphaneRoss. Agentprioritizationforautonomousnavigation. 2019 IEEE/RSJInternationalConferenceonIntelligentRobotsandSystems(IROS),pages2060–2067,2019. [40] NicholasRhinehart,KrisKitani,andPaulVernaza.R2P2:Areparameterizedpushforwardpolicyfordiverse,precisegenerative pathforecasting. InECCV,2018. [41] NicholasRhinehart,RowanMcAllister,KrisKitani,andSergeyLevine. PRECOG:Predictionconditionedongoalsinvisual multi-agentsettings. InIntl.Conf.onComputerVision,2019. [42] NicholasRhinehart,RowanMcAllister,KrisKitani,andSergeyLevine. Precog:Predictionconditionedongoalsinvisual multi-agentsettings. InECCV,2019. [43] NicholasRhinehart,RowanMcAllister,KrisM.Kitani,andSergeyLevine. PRECOG:predictionconditionedongoalsin visualmulti-agentsettings. CoRR,abs/1905.01296,2019. [44] TimSalzmann,BorisIvanovic,PunarjayChakravarty,andMarcoPavone. Trajectron++:Multi-agentgenerativetrajectory forecastingwithheterogeneousdataforcontrol. arXivpreprintarXiv:2001.03093,2020. [45] BenjaminSapp,YuningChai,MayankBansal,andDragomirAnguelov. MultiPATH:Multipleprobabilisticanchortrajectory hypothesesforbehaviorprediction. InConf.onRobotLearning,2019. 18[46] YichuanCharlieTangandRuslanSalakhutdinov. Multiplefuturesprediction. InProceedingsofthe33rdInternational ConferenceonNeuralInformationProcessingSystems,RedHook,NY,USA,2019.CurranAssociatesInc. [47] EkaterinaI.Tolstaya,RezaMahjourian,CarltonDowney,BalakrishnanVaradarajan,BenjaminSapp,andDragomirAnguelov. Identifying driver interactions via conditional behavior prediction. In IEEE International Conference on Robotics and Automation,ICRA2021,Xi’an,China,May30-June5,2021,pages3473–3479.IEEE,2021. [48] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attentionisallyouneed. InNeurIPS,2017. [49] MaoshengYe,TongyiCao,andQifengChen. Tpcn:Temporalpointcloudnetworksformotionforecasting. InProceedingsof theIEEE/CVFConferenceonComputerVisionandPatternRecognition,pages11318–11327,2021. [50] YeYuanandKrisKitani. Diversetrajectoryforecastingwithdeterminantalpointprocesses. ICLR,2020. [51] ManzilZaheer,SatwikKottur,SiamakRavanbakhsh,BarnabasPoczos,RussRSalakhutdinov,andAlexanderJSmola. Deep sets. InI.Guyon,U.V.Luxburg,S.Bengio,H.Wallach,R.Fergus,S.Vishwanathan,andR.Garnett,editors,Advancesin NeuralInformationProcessingSystems,volume30.CurranAssociates,Inc.,2017. [52] WenyuanZeng,WenjieLuo,SimonSuo,AbbasSadat,BinYang,SergioCasas,andRaquelUrtasun. End-to-endinterpretable neuralmotionplanner. InCVPR,2019. [53] WeiZhan,LitingSun,DiWang,HaojieShi,AubreyClausse,MaximilianNaumann,JuliusKümmerle,HendrikKönigshof, ChristophStiller,ArnauddeLaFortelle,andMasayoshiTomizuka. INTERACTIONDataset:AnINTERnational,Adversarial andCooperativemoTIONDatasetinInteractiveDrivingScenarioswithSemanticMaps. arXiv:1910.03088,2019. [54] HangZhao,JiyangGao,TianLan,ChenSun,BenjaminSapp,BalakrishnanVaradarajan,YueShen,YiShen,YuningChai, CordeliaSchmid,etal. Tnt:Target-driventrajectoryprediction. arXivpreprintarXiv:2008.08294,2020. [55] MatthiasZwicker,HanspeterPfister,JeroenVanBaar,andMarkusGross. Surfacesplatting. InProceedingsofthe28thannual conferenceonComputergraphicsandinteractivetechniques,pages371–378,2001. A DetailsandDerivationofAggregationAlgorithm Byhavinganovercompletetrajectoryrepresentationthatislateraggregatedintoafixedsmallnumberoftrajectories, weattempttoaddresstwokindsofuncertaintiesinthedata: • Aleatoricuncertainty: Thisisanaturalvariationinthedata. Forexampleanagentcaneithertakealeftorright turnorchangelanes,etcgiventhesamecontextinformation. Thislevelofambiguitycannotberesolvedby increasingthemodelcapacity,butratherthemodelneedstopredictcalibratedprobabilitiesfortheseoutcomes. Despitethetheoreticalpossibilityofmodelingthesevariationsusingasmallnumberofoutputtrajectories directly,thereareseveralchallengesinlearning. Someexamplesincludemodecollapseandfailuretomodel thesevariationsduetolimitedmodelcapacity. Trainingthemodeltoproduceanovercompleterepresentation forces the model to output a diverse distribution of trajectories and could make it more resistant to mode collapse. Followingthisupwithgreedyiterativetrajectoryaggregationenhancesdiversityinthefinaloutput. • Epistemicuncertainty: Thisisthevariationacrossmodeloutputs,whichtypicallyindicatesthemodel’sfailure tocapturecertainaspectsofthesceneorinputfeatures. Suchvariationscouldoccurifsomemodelsarepoorly trainedorhaven’tseenaparticularsliceofthedata. Bydoingmodelensembling,weattempttoreducethis uncertainty. Foreaseofexposition,weassumeeachtotrajectorytobecomposedofasingletimepoint;thesamecomputationsare appliedtoeachtimestepinafuturesequence. TheoutputΨisaGaussianmixturemodel(GMM)distributionwithM(cid:48) modesonthefutureposition: M(cid:48) (cid:88) p(x,y;Ψ)= q N(x,y;µ ,Σ ) (20) j j j j=1 WeformulatetheaggregationasobtaininganM-modeGMMΨ¯ whichminimizestheKL-divergenceD (Ψ||Ψ¯). KL This is equivalent to maximizing the expected log likelihood p(x,y;Ψ¯) of a sample point (x,y) drawn from the overcompletedistributionΨ: (cid:90) f(Ψ¯)= p(x;Ψ)logp(x;Ψ¯)dx (21) x Assumingtheovercompletedistributionapproximatestherealdistribution, thisisroughlyequivalenttofittingthe compactdistributiontotherealdata,butwiththeaddedbenefitsdescribedabove.Directlymaximizing(21)isintractable. 19HenceweattempttoemployanExpectation-Maximization-likealgorithmtoobtainalocalmaximum. Thedifference intheobjectivefunctionbetweenanoldandnewvaluemaybewrittenas (cid:90) (cid:20) p(x;Ψ¯(cid:48))(cid:21) f(Ψ¯(cid:48))−f(Ψ¯)= p(x;Ψ)log dx (22) p(x;Ψ¯) x Denotingthehiddenvariablehtobeamixtureinthecompactrepresentation,wemaywrite: log(cid:104) p(x;Ψ¯(cid:48))(cid:105) (23) p(x;Ψ¯) = (cid:80) p(h|x;Ψ¯)log(cid:104) p(x;Ψ¯(cid:48))(cid:105) (24) h p(x;Ψ¯) = (cid:80) p(h|x;Ψ¯)log(cid:104) p(h,x;Ψ¯(cid:48))p(h|x;Ψ¯)(cid:105) (25) h p(h|x;Ψ¯(cid:48))p(h,x;Ψ¯) = (cid:80) p(h|x;Ψ¯)log(cid:104) p(h,x;Ψ¯(cid:48))(cid:105) +(cid:80) p(h|x;Ψ¯)log(cid:104) p(h|x;Ψ¯)(cid:105) (26) h p(h,x;Ψ¯) h p(h|x;Ψ¯(cid:48)) = (cid:80) p(h|x;Ψ¯)log(cid:104) p(h,x;Ψ¯(cid:48))(cid:105) +D (p(h|x;Ψ¯)(cid:107)p(h|x;Ψ¯(cid:48))) (27) h p(h,x;Ψ¯) KL ≥ (cid:80) p(h|x;Ψ¯)log(cid:104) p(h,x;Ψ¯(cid:48))(cid:105) (28) h p(h,x;Ψ¯) Thus f(Ψ¯(cid:48))−f(Ψ¯)≥(cid:90) p(x;Ψ)(cid:88) p(h|x;Ψ¯)log(cid:20) p(h,x;Ψ¯(cid:48))(cid:21) dx (29) p(h,x;Ψ¯) x h The right hand side is called the Q function in the EM algorithm. Maximizing the Q function with respect to Ψ¯(cid:48) ensuresthatthelikelihoodincreasesatleastasmuchwhenweupdatetheparameterstoΨ¯(cid:48). Notingthatp(h,x;Ψ¯(cid:48))= p(h;Ψ¯(cid:48))p(x;Ψ¯(cid:48)|h) = q¯(cid:48)N(x−µ¯(cid:48),Σ¯(cid:48)) and factoring out the terms independent of Ψ¯(cid:48), we find the update that h h h maximizesthelowerboundtobe (cid:90) Ψ¯ ←argmax p(x;Ψ)(cid:88) p(h|x;Ψ¯)log(cid:2) q¯(cid:48)N(x−µ¯(cid:48),Σ¯(cid:48))(cid:3) dx (30) h h h Ψ¯(cid:48) x h M(cid:48) (cid:90) Ψ¯ ←argmax(cid:88) q p(x;µ ,Σ )(cid:88) p(h|x;Ψ¯)log(cid:2) q¯(cid:48)N(x−µ¯(cid:48),Σ¯(cid:48))(cid:3) dx (31) i i i h h h Ψ¯(cid:48) i=1 x h ThesecondequationfollowsfromthefactthattheovercompletedistributionΨisamixtureofM(cid:48) Gaussians. The updatescanbesolvedasfollows. M(cid:48) (cid:90) q¯(cid:48) ← (cid:88) q p(x;µ ,Σ )p(h|x;Ψ¯)dx h i i i i=1 x µ¯(cid:48) ← 1 (cid:88)M(cid:48) q (cid:90) xp(x;µ ,Σ )p(h|x;Ψ¯)dx h q¯ i i i h i=1 x Σ¯(cid:48) ← 1 (cid:88)M(cid:48) q (cid:90) xp(x;µ ,Σ )(x−µ¯(cid:48))(x−µ¯(cid:48))Tp(h|x;Ψ¯)dx h q¯ i i i h h h i=1 x wherep(h|x;Ψ¯)istheposteriorprobabilitythatagivensamplexissampledfromthehthcomponentofthemixture modelspecifiedbyΨ¯ (hereweusethepreviousestimateforΨ¯). Thiscanbecomputedas: q¯ N(x;µ¯ ,Σ¯ ) p(h|x;Ψ¯)= h h h (32) (cid:80)M q¯ N(x;µ¯ ,Σ¯ ) k=1 k k k NoticetheresemblencewithstandardGMM,exceptwherep(x;µ ,Σ )isadiracdeltafunctioninthestandardsetting i i (since the input data in standard GMM is a set of points instead of a distribution). Unlike standard GMM, these 20expectations (integrations) in the above EM updates are hard to compute in closed form. Instead we employ the approximationforanyfunctiong(x) E (cid:2) g(x)p(h|x;Ψ¯)(cid:3) x∼N(x;µ,Σ) ≈E (cid:2) g(x)p(h|µ;Ψ¯)(cid:3) x∼N(x;µ,Σ) =p(h|µ;Ψ¯)E [g(x)]. x∼N(x;µ,Σ) In other words, we assume that the posterior probability of any output cluster only depends on the mean of the overcompleteclustercentroidinsidetheexpectation. Thisapproximationisreasonablesincemostsamplesdrawnfrom thedistributionwouldbeconcentratedaroundthemean. Furthermoreasweincreasethenumberofclustercentroidsin theovercompleterepresentation,thevariancewithineachovercompleteclustercentroidbecomessmalleryieldingmore focusaroundthemean. Thesetofupdatescannowbesolvedinclosedformasfollows: M(cid:48) q¯ ← (cid:88) q p(h|µ ;Ψ¯) h i i i=1 M(cid:48) µ¯ ← 1 (cid:88) q p(h|µ ;Ψ¯)x h q¯ i i h i=1 M(cid:48) Σ¯ ← 1 (cid:88) q p(h|µ ;Ψ¯)(cid:2) Σ +(µ −µ¯ )(µ −µ¯ )T(cid:3) h q¯ i i i i h i h h i=1 SinceEMisalocaloptimizationmethod,carefulinitializationofGMMparametersisimportant. Ourinitialization criterionofGMMcentroidsistomaximizetheprobabilitythatfuturepointlieswithinτ distancefromatleastone centroid: M(cid:48) (cid:88) µ¯ =argmax q max I((cid:107)µ −µ¯(cid:107) ≤τ) (33) 1:M i i 2 µ¯1:M i=1 µ¯∈µ¯1:M Unfortunately, directly optimizing (33) is NP-hard. So instead, we select an M-sized subset of µ in a greedy 1:M(cid:48) fashiontomaximize(33)6. 6Notethatthissubsetselectionproblemissubmodular,whichmeansthatagreedymethodisguranteedtoachieveatleast1−1/e oftheoptimalsubsetvalue. 21B Multipathbaselinesystemdiagram Figure10visualizesthesplatrenderingrasterizationofinputpoints,andthebackboneandoutputarchitecture. See Section5.3fordetails. GMM output Road Network Spatial feature map N agents K trajectories CNN backbone Agent-centric (ResNet18) patch T time steps network Agents Differentiable cropping and rotation Traffic Lights Figure10: OriginalMultiPatharchitecture,withdetailsdescribedinSection5.3. 22