MULTIPATH++: EFFICIENT INFORMATION FUSION AND
TRAJECTORY AGGREGATION FOR BEHAVIOR PREDICTION
BalakrishnanVaradarajan AhmedHefny AvikalpSrivastava KhaledS.Refaat
NigamaaNayakanti AndreCornman KanChen BertrandDouillard ChiPangLam
DragomirAnguelov BenjaminSapp∗
ABSTRACT
Predictingthefuturebehaviorofroadusersisoneofthemostchallengingandimportantproblems
in autonomous driving. Applying deep learning to this problem requires fusing heterogeneous
worldstateintheformofrichperceptionsignalsandmapinformation,andinferringhighlymulti-
modaldistributionsoverpossiblefutures. Inthispaper,wepresentMultiPath++,afutureprediction
modelthatachievesstate-of-the-artperformanceonpopularbenchmarks. MultiPath++improvesthe
MultiPatharchitecture[45]byrevisitingmanydesignchoices. Thefirstkeydesigndifferenceisa
departurefromdenseimage-basedencodingoftheinputworldstateinfavorofasparseencodingof
heterogeneoussceneelements: MultiPath++consumescompactandefficientpolylinestodescribe
roadfeatures,andrawagentstateinformationdirectly(e.g.,position,velocity,acceleration). We
proposeacontext-awarefusionoftheseelementsanddevelopareusablemulti-contextgatingfusion
component. Second, we reconsider the choice of pre-defined, static anchors, and develop a way
to learn latent anchor embeddings end-to-end in the model. Lastly, we explore ensembling and
output aggregation techniques—common in other ML domains—and find effective variants for
our probabilistic multimodal output representation. We perform an extensive ablation on these
design choices, and show that our proposed model achieves state-of-the-art performance on the
ArgoverseMotionForecastingCompetition[12]andtheWaymoOpenDatasetMotionPrediction
Challenge[18].
1 Introduction
Modelingandpredictingthefuturebehaviorofhumanagentsisafundamentalprobleminmanyreal-worldrobotics
domains. Forexample,accuratelyforecastingthefuturestateofothervehicles,cyclistsandpedestriansiscriticalfor
safe,comfortable,andhuman-likeautonomousdriving. However,behaviorpredictioninanautonomousvehicle(AV)
drivingsettingposesanumberofuniquemodelingchallenges:
1)Multimodaloutputspace: Theproblemisinherentlystochastic;itisimpossibletotrulyknowthefuturestateof
theenvironment. Thisisexacerbatedbythefactthatotheragents’intentionsarenotobservable,andleadstoahighly
multimodaldistributionoverpossibleoutcomes(e.g.,acarcouldturnleftorrightatanintersection). Effectivemodels
mustbeabletorepresentsucharichoutputspacewithhighprecisionandrecallmatchingtheunderlyingdistribution.
2)Heterogenous,interrelatedinputspace:Thedrivingenvironmentrepresentationcancontainahighlyheterogeneous
mixofstaticanddynamicinputs: roadnetworkinformation(lanegeometryandconnectivity,stoplines,crosswalks),
trafficlightstateinformation,andmotionhistoryofagents. Drivingsituationsareoftenhighlyinteractive,andcan
involvemanyagentsatonce(e.g.negotiatinga4-waystopwithcrosswalks). Thisrequirescarefulmodelingchoices,as
explicitlymodelingjointfuturedistributionsovermultipleagentsisexponentialinthenumberofagents. Effective
modelsmustcapturenotonlytheinteractionsbetweentheagentsinthescene,butalsotherelationshipsbetweenthe
roadelementsandthebehaviorofagentsgiventheroadcontext.
∗Contact:{balakrishnanv, bensapp}@waymo.com.Waymo,1600AmphitheatrePkwy,MountainView,California,USA.
1202
ceD
22
]VC.sc[
3v37941.1112:viXraThenovelchallengesandhighimpactofthisproblemhavenaturallygarneredmuchinterestinrecentyears. Therehas
beenarichbodyofworkonhowtomodelagents’futures,theirinteractions,andtheenvironment. However,there
is little consensus to date on the best modeling choices for each component, and in popular benchmark challenge
datasets[6,12,18,53],thereisasurprisinglydiversesetofsolutionstothisproblem;fordetailsseeSection2and
Table2.
The MultiPath framework [45] addresses the multimodal output space challenge above by modeling the highly
multimodaloutputdistributionsviaaGaussianMixtureModel. Ithandlesacommonissueofmodecollapseduring
learningbyusingstatictrajectoryanchors,anexternalinputtothemodel. Thispracticalsolutiongivespractitionersa
straightforwardwayofensuringdiversityandanextralevelofmodelercontrolviathedesignofsuchanchors. The
choiceofaGMMrepresentationprovedtobeanextremelypopular,appearinginmanyworks—seeTable2,“Trajectory
Distribution”,where“Weightedset”isaspecialcaseofGMMswhereonlymeansandmixtureweightsaremodeled.
TheMultiPathinputrepresentationandbackbonedrawsheavilyuponthecomputervisionliterature. Byrasterizingall
worldstateinatop-downorthographicview,MultiPathandothers[10,14,27,30,33,37,46,52]leveragepowerful,
establishedCNNarchitectureslikeResNet[25],whichoffersolutionstotheheterogeneousinterrelatedinputspace: the
heterogeneousworldstateismappedtoacommonpixelformat,andinteractionsoccurvialocalinformationsharingvia
convolutionoperations. Whileconvenientandestablished,therearedownsidestosuchrasterization: (1)Thereisan
uneasytrade-offbetweenresolutionofthespatialgrid,fieldofview,andcomputerequirements. (2)Rasterizingisa
formofmanualfeatureengineering,andsomefeaturesmaybeinherentlydifficulttorepresentinsuchaframework(e.g.
radialvelocity). (3)Itisdifficulttocapturelongrangeinteractionsviaconvolutionswithsmallreceptivefields. (4)
Theinformationcontentisspatiallyverysparse,makingadenserepresentationapotentiallycomputationallywasteful
choice.
Inthispaper,weintroduceMultiPath++,whichbuildsuponMultiPath,takingitsoutputGMMrepresentationand
conceptofanchors,butreconsideringhowtorepresentandcombinehighlyheterogeneousworldstateinputsandmodel
interactionsbetweenstateelements. MultiPath++introducesanumberofkeyupgrades:
• Weeschewtherasterization-and-CNNapproachinfavorofmodelingsparseworldstateobjectsmoredirectly
fromtheircompactstatedescription. Werepresentroadelementsaspolylines,agenthistoryasasequenceof
physicalstateencodedwithRNNs,andagentinteractionsasRNNsoverthestateofneighborsrelativetoeach
ego-agent. Thesechoicesavoidlossyrasterizationinfavorofraw,continuousstate,andresultincompute
complexitythatscaleswiththenumberofsceneelementsratherthanthesizeofaspatialgrid. Long-range
dependenciesareeffectivelyandefficientlymodeledinourrepresentation.
• Capturingrelationshipsbetweenroadelementsandagentsiscritical,andwefindthatencodingeachelement
independentlydoesnotperformaswellasmodelingtheirinteractions(e.g.,whentheroadencodingisaware
ofrelevantagents,andviceversa). Toaddressthis,weproposeanovelformofcontextawarenesswecall
multi-contextgating(MCG),inwhichsetsofelementshaveaccesstoasummarycontextvectoruponwhich
encodings are conditioned. MCG is implemented as a generic neural network component that is applied
throughoutourmodel. MCGcanbeviewedasanefficientformofcross-attention,whoseefficiency/quality
trade-offdependsonthesizeofthecontextvector.
• Wealsoexploreimprovementsintrajectorymodeling,comparingrepresentationsbasedonkinematiccontrols,
and/orpolynomialsasafunctionofcontinuousfuturetime. Wefurtherdemonstrateawaytolearnlatent
representationsofanchorsandshowtheyoutperformtheoriginalstaticanchorsofMultiPath,whilesimplifying
modelcreationtoasingle-stepprocess.
• Finally,wefindsignificantadditionalgainsonpublicbenchmarksbyapplyingensemblingtechniquestoour
models. Unlikemodelswithstaticanchors,thelatentanchorsofaMultiPath++ensemblearenotindirect
correspondence. Furthermore,alotofpopularbehaviorpredictionbenchmarkshaveintroducedmetricssuch
asmiss-rate(MR)andmeanAveragePrecision(mAP),whichrequiretheabilitytomodeldiverseoutcomes
withfewtrajectoriesanddifferfrompuretrajectorydistanceerrorcapturingtheaverageagentbehavior. With
theaboveinmind,weformulatetheproblemofensemblingtheresultsofseveralmodelsasoneofgreedy
iterativeclustering,whichmaximizesaprobabilisticobjectiveusingthepopularExpectationMaximization
algorithm[3].
AsofNovember1,2021,MultiPath++ranks1stontheWaymoOpenMotionDatasetleaderboard2,4thontheArgoverse
MotionForecasting. Competition3WeofferMultiPath++asareferencesetofdesignchoices,empiricallyvalidatedvia
ablationstudies,thatcanbeadopted,furtherstudiedandextendedbythebehaviormodelingcommunity.
2https://waymo.com/open/challenges/2021/motion-prediction/
3https://eval.ai/web/challenges/challenge-page/454/leaderboard/1279
2Method RoadEnc. MotionEnc. Interactions Decoder Output TrajectoryDistribution
Jean[34] – LSTM attention LSTM states GMM
TNT[54] polyline polyline maxpool,attention MLP states Weightedset
LaneGCN[32] GNN 1Dconv GNN MLP states Weightedset
WIMP[28] polyline LSTM GNN+attention LSTM states GMM
VectorNet[20] polyline polyline maxpool,attention MLP states Singletraj.
SceneTransformer[35] polyline attention attention attention states Weightedset
GOHOME[21] GNN 1Dconv+GRU GNN MLP states heatmap
MP3[11] raster raster conv conv costfunction Weightedsamples
CoverNet[37] raster raster conv lookup states GMMw/dynamicanchors
DESIRE[30] raster GRU spatialpooling GRU states Samples
RoadRules[27] raster raster conv LSTM states GMM
SocialLSTM[1] – LSTM spatialpooling LSTM states Samples
SocialGan[24] – LSTM maxpool LSTM states Samples
MFP[46] raster GRU RNNs+attention GRU states Samples
MANTRA[33] raster GRU – GRU states Samples
PRANK[2] raster raster conv lookup states Weightedset
IntentNet[10] raster raster conv conv states Singletraj.
SpaGNN[9] raster raster GNN MLP state Singletraj.
Multimodal[14] raster raster conv conv states Weightedset
PLOP[5] raster LSTM conv MLP statepoly GMM
Precog[41] raster GRU multi-agentsim. GRU motion Samples
R2P2[40] raster GRU – GRU motion Samples
HYU_ACE[36] raster LSTM attn LSTM motion Samples
Trajectron++[44] raster LSTM RNNs+attention GRU controls GMM
DKM[13] raster raster conv conv controls Weightedset
MultiPath[45] raster raster conv MLP states GMMw/staticanchors
MultiPath++ polyline LSTM RNNs+maxpool MLP controlpoly GMM
Table 1: A survey of recent work in behavior prediction, categorized by choice of road encoding, motion history
encoding,agentinteractionencoding,trajectorydecoding,intrinsicoutputrepresentation,anddistributionoverfuture
trajectories.
2 RelatedWork
Wefocusonarchitecturaldesignchoicesforbehaviorpredictionindrivingenvironments—whatrepresentationstouse
toencoderoadinformation,agentmotion,agentinteractions,outputtrajectories,andoutputdistributions. Table2isa
summaryofpastwork,whichwegooverherewithadditionalcontext.
Forroadencoding,thereisadichotomyofrepresentations.Therasterapproachencodestheworldasastackofimages,
fromatop-downorthographic(or“bird’s-eye”)view. Rasterizingtheworldstatehasthebenefitofsimplicity—all
thevarioustypesofinputinformation(roadconfiguration,agentstatehistory,spatialrelationships)areunifiedvia
renderingasamulti-channelimage,enablingonetoleveragepowerfuloff-the-shelfConvolutionalNeuralNetwork
(CNN)techniques. However,thisone-size-fits-allapproachhassignificantdownsides: difficultyinmodelinglong-range
interactions, constrained field of view, and difficulty in representing continuous physical states. As an alternative,
the polyline approach describes curves (e.g., lanes, crosswalks, boundaries) as piecewise linear segments. This is
a significantly more compact form due to the sparse nature of road networks. Previous works typically process a
set-of-polylinesdescriptionoftheworldinaper-agent,agent-centriccoordinatesystem. LaneGCN[32]standsapartby
treatingroadlanesasnodesinagraphneuralnetwork,leveragingroadnetworkconnectivitystructure.
Tomodelmotionhistory,onepopularchoiceistoencodethesequenceofpastobservedstatesviaarecurrentnet
(GRU,LSTM)ortemporal(1D)convolution. Asanalternative,intherasterframework,thestatesequenceistypically
renderedasastackofbinarymaskimagesdepictingagentorientedboundingboxes,orrenderedinthesameimage,
withthecorrespondingtimeinformationrenderedseparately[39].
Tomodelagentinteractions,onemustdealwithadynamicsetofneighboringagentsaroundeachmodeledagent.
Thisistypicallydonebyaggregatingneighbormotionhistorywithapermutation-invariantsetoperator: poolingor
softattention. Notably,Precog[41]jointlyrollsoutagentpoliciesinastep-wisesimulation. Rasterapproachesrely
onconvolutionoverthe2Dspatialgridtoimplicitlycaptureinteractions;long-terminteractionsaredependentonthe
networkreceptivefields.
3Agenttrajectorydecodingchoicesaresimilartochoicesforencodingmotionhistory,withtheexceptionofmethods
thatdolookuponafixedorlearnedtrajectorydatabase[2,37].
Themostpopularoutputtrajectoryrepresentationisasequenceofstates(orstatedifferences).Afewworks[40,42]
instead model Newton’s laws of motion in a discrete time-step aggregation capturing Verlet integration. Other
works[13,44]explicitlymodelcontrolswhichparameterizeakinematically-feasiblemodelforvehiclesandbicycles.
Withanyoftheserepresentations,thespacetimetrajectorycanbeintrinsicallyrepresentedasasequenceofsample
points or a continuous polynomial representation [5]. In our experimental results, we explore the effectiveness of
statesandkinematiccontrols,withandwithoutanunderlyingpolynomialbasis. Notablyuniqueare(1)HOME[22]
andGOHOME[21]whichfirstpredictaheatmap,andthendecodetrajectoriesaftersampling,and(2)MP3[11]and
NMP[52]whichlearnacostfunctionevaluatoroftrajectories,andthetrajectoriesareenumeratedheuristicallyrather
thangeneratedbyalearnedmodel.
Nearly all work assumes an independent, per-agent output space, in which agent interactions cannot be explicitly
captured. A few works are notable in describing joint interactions as output, either in an asymmetric [28, 47] or
symmetricway[18,35,41].
Thechoiceofoutputtrajectorydistributionhasramificationsondownstreamapplications. Anintrinsicpropertyof
thedrivingsettingisthatavehicleorapedestriancanfollowoneofadiversesetofpossibletrajectories. Itisthus
essentialtocapturethemultimodalnatureoftheproblem. GaussianMixtureModels(GMMs)areapopularchoicefor
thispurposeduetotheircompactparameterizedform;modecollapseisaddressedthroughtrainingtricks[14,50]orthe
useoftrajectoryanchors[45]. Otherapproachesmodeladiscretedistributionoverasetoftrajectories(learnedorfixed
apriori)[2,14,32,54],orviaacollectionoftrajectorysamplesdrawnfromalatentdistributionanddecodedbythe
model[1,30,33,40,41].
3 ModelArchitecture
Figure1depictstheproposedMultiPath++modelarchitecture,whichonahighlevelissimilartothatofMultiPath[45];
the model consists of an encoding step and a predictor head which conditions on anchors and outputs a Gaussian
MixtureModel(GMM)[3]distributionforthepossibleagentpositionateachfuturetimestep.
MultiPathusedacommon,top-downimagebasedrepresentationforallinputmodalities(e.g.,agents’trackedstate,
roadnetworkinformation),andaCNNencoder. Incontrast,MultiPath++hasencodersprocessingeachinputmodality
and converting it to a compact and sparse representation; the different modality encodings are later fused using a
multi-contextgating(MCG)mechanism.
Input Encoder
Output
Agent state Agent history
history encoder Classification Trajectory
head likelihoods
MCG AV state Interaction Interaction predictor
history encoder MCG encoder
Regression Trajectory means
head and covariances
Roadgraph Learned anchor Roadgraph MCG encoder embeddings polylines
Concat
Predictor
History
MCG encoder
Gaussian mixture model
Concat
Polyline encoder
Figure1: MultiPath++ModelArchitecture. MCGdenotesMulti-ContextGating,describedinSection3. Blocksinred
highlightportionsofthemodelwithlearnedparameters. DottedinputstotheMCGdenotescontextfeatures. Eachof
theencoderMCGoutputsaggregatedembeddings(oneperagent)asshownbydottedarrows. Ontheotherhand,the
predictorMCGoutputsoneembeddingpertrajectoryperagent
3.1 InputRepresentation
MultiPath++makespredictionsbasedonthefollowinginputmodalities:
• Agent state history: a state sequence describing the agent trajectory for a fixed number of past steps. In
theWaymoOpenMotiondataset[18],thisstateinformationincludesposition,velocity,3Dboundingbox
size,headingangleandobjecttype;forArgoverse[12]onlypositioninformationisprovided. Thestateis
4Figure 2: Left: Context gating (CG) block diagram. Right: 3 CG blocks stacked together, with running-average
skip-connections(shownascomponentslabeled“µ”). SeeSection3.2fordetails.
transformedtoanagent-centriccoordinatesystem,suchthatthemostrecentagentposeislocatedattheorigin
andheadingeast. 4
• Roadnetwork: Roadnetworkelementssuchaslanelines,crosswalks,andstoplinesareoftenrepresentedas
parametriccurveslikeclothoids[52],whichcanbesampledtoproducepointcollectionsthatareeasilystored
inmulti-dimensionalarrayformat,asisdoneinmanypublicdatasets[12,18]. Wefurthersummarizethis
informationbyapproximatingpointsequencesforeachroadelementasasetofpiecewiselinearsegments,or
polylines,similarto[20,26,32].
• Agentinteractions: Foreachmodeledagent,weconsiderallneighboringagents. Foreachneighboringagent,
weextractfeaturesinthemodeledagent’scoordinateframe,suchasrelativeorientation,distance,historyand
speed.
• AV-relativefeatures:Similartotheinteractionfeatures,weextractfeaturesoftheautonomousvehicle/sensing
vehicle(AV)relativetoeachotheragent. WemodeltheAVseparatelyfromtheotheragents. Wehypothesize
thisisahelpfuldistinctionforthemodelbecause: (a)TheAVisthecenterofsensors’fieldofview. Tracking
errorsduetodistanceandocclusionarerelativetothiscenter. (b)ThebehavioroftheAVcanbeunlikethe
otherroadusers,whichtoagoodapproximationcanbeassumedtoallbehumans.
Detailsonhowthesefeaturesareencodedandfusedaredescribednext. Thesestepscomprisethe“Encoder”blockof
Figure1,whoseoutputisanencodingperagent,ineachagent’scoordinateframe.
3.2 MultiContextGatingforfusingmodalities
Inthissectionwefocusonhowtocombinethedifferentinputmodalityencodingsinaneffectiveway.Otherworksusea
commonrasterizedformat[45,52],asimpleconcatenationofencodings[30,41,44],oremployattention[20,32,35,46].
Weproposeanefficientmechanismforfusinginformationwetermmulti-contextgating(MCG),anduseMCGblocks
throughouttheMultiPath++architecture.
Givenasetofelementss andaninputcontextvectorc,aCGblockassignsanoutputs(cid:48) toeachelementin
1:N 1:N
the set, and computes an output context vector c(cid:48). The output does not depend on the ordering of input elements.
Mathematically,letCG(·,·)bethefunctionimplementedbytheCGblock,andπbeanypermutationoperationona
sequenceofnelements. ThefollowingequationsholdforCG:
(s(cid:48) ,c(cid:48)) =CG(s ,c)
1:n 1:n (1)
(s∗ ,c∗) =CG(π(s ),c)
1:n 1:n
whichimplythatwehave
c∗ =c(cid:48) (permutation-invariance)
s∗ =π(s(cid:48) ) (permutation-equivariance).
1:n 1:n
ThesizeofthesetncanvaryacrosscallstoCG(·,·).
CG’ssetfunctionproperties—permutationinvariance/equivarianceandabilitytoprocessarbitrarilysizedsets—are
naturally motivated by the need to encode a variable, unordered set of road network elements and agent relation-
ships. A number of set functions have been proposed in the literature such as DeepSets [51], PointNet [38] and
SetTransformers[29].
4SincetheexplicitheadingismissinginArgoversedata,weusethelasttwotimestepstogetthecurrentorientation.
5Figure3: Acomparisonoftheelementrelationshipgraphforcross-attentionandCG.Incross-attention,eachelement
s aggregatesinformationfromc . InCG,c summarizedwithasinglecontextvectorc.
i 1:m 1:m
AsingleCGblockisimplementedvia
˜s = MLP(s ) (2)
i i
c˜ = MLP(c) (3)
s(cid:48) = ˜s (cid:12)c˜ (4)
i i
c(cid:48) = Pool(s(cid:48) ), (5)
1:n
where (cid:12) denotes element-wise product and Pool is a permutation-invariant pooling layer such as max or average
pooling. TheseoperationsareillustratedinFigure2. Intheabsenceofaninputcontext,wesimplysetc˜toanall-ones
vector in the first context gating block. Note that both s(cid:48) and c(cid:48) depend on all inputs. It can be shown that c(cid:48) is
i
permutation-invariantw.r.ttheinputembeddings. Itcanalsobeshownthats(cid:48) arepermutation-equivariant.
i
WestackmultipleCGblocksbyincorporatingrunning-averageskip-connections,asisdoneresidualnetworks[25]:
¯sk = 1 (cid:80)k sj (6)
1:n k j=1 1:n
¯ck = 1 (cid:80)k cj (7)
k j=1
(cid:0) sk+1,ck+1(cid:1) =CG(cid:0) ¯sk ,¯ck(cid:1) . (8)
1:n 1:n
Wedenotesuchmulti-layerCGblocksasMCG (·,·)forastackofN CGblocks.
N
Comparisonwithattention. AttentionisapopularmechanismindomainssuchasNLP[48]andcomputervision[15,
17],inwhichtheencodingforeachelementofasetisupdatedviaacombinationofencodingsofallotherelements.
Forasetofsizen,thisintrinsicallyrequiresO(n2)operations. Inmodelsofhumanbehaviorindrivingscenarios,self
attentionhasbeenemployedtoupdateencodingsfor,e.g.,roadlanes,byattendingtoneighboringlanes,ortoupdate
encodingsperagentbasedontheotheragentsinthescene. Crossattentionhasalsobeenusedtoconditiononeinput
type(e.g.agentencodings)onanother(e.g.roadlanes)[20,32,35]. Withoutlossofgenerality,iftherearenagentsand
mroadelements,thiscrossattentionscalesasO(nm)toaggregateroadinformationforeachagent.
CGcanbeviewedasanapproximationtocross-attention. Ratherthaneachofnelementsattendingtoallmelements
of the latter set, CG summarizes the latter set with the single context vector c, as shown in Figure 3. Thus the
dimensionalityofcneedstobegreatenoughtocapturealltheusefulinformationcontainedintheoriginalmencodings.
Ifthedimensionalityofelementsisd,andthedimensionalityofcisdc,thenifdc =md,CGcanbereducedtosome
formofcross-attentionbysettingc=Concat({c }m ). Whendc <md,wearetradingofftherepresentationalpower
j j=1
offullcross-attentionwithcomputationalefficiency.
3.3 Encoders
InthissectionwedetailthespecificencodersshowninFigure1.
Agenthistoryencoding. Theagenthistoryencodingisobtainedbyconcatenatingtheoutputofthreesources:
1. ALSTMonthehistoryfeaturesfromH timestepsagotothepresenttime: (x ,y ) .
t t t=−H:0
2. ALSTMonthedifferenceinthehistoryfeatures(x −x ,y −y ) .
t t−1 t t−1 t=−H+1:0
3. MCGblocksappliedtothesetofhistoryelements. Eachelementinthesetconsistsofahistoricalpositionand
timeoffsetinsecondsrelativetothepresenttime. Thecontextinputhereisanall-onesvectorwithanidentity
6contextMLP.Additionallywealsoencodethehistoryframeidasaonehotvectortofurtherdisambiguatethe
historysteps.
Wedenotethefinalembedding,whichconcatenatesthesethreestatehistoryencodings,asφ(state).
Agentinteractionencoding. Foreachmodeledagent,webuildaninteractionencodingbyconsideringeachneighbor-
ingagentν’spaststateobservations: {xν ,...,xν ,xν}. Wetransformν’sstateintothemodeledagent’scoordinate
−H −1 0
frame,andembeditwithaLSTMtoobtainanembeddingφ(interaction). Notethisissimilartotheego-agenthistory
ν
embeddingbutinsteadappliedtotherelativecoordinatesofanotheragent.
Bydoingthisfornneighboringagentsweobtainasetofinteractionembeddingsφ(interaction). Wefuseneighbor
i=1:n
informationwithstackedMCGblocksasfollows
(φ(cid:48)(interaction),φ(cid:48)(state))=MCG (φ(interaction),[φ(state),φ(interaction)]) (9)
1:n N 1:n AV
wherethesecondargumentistheinputcontextvectortoMCG,whichinthiscaseisaconcatenationofthemodeled
agent’shistoryembedding,andtheAV’sinteractionembedding. InthiswayweemphasizetheAV’srepresentationasa
uniqueentityinthecontextforallinteractions;seeSection3.1formotivation.
Roadnetworkencoding. WeusethepolylineroadelementrepresentationdiscussedinSection3.1asinput. Each
linesegmentisparameterizedbyitsstartpoint,endpointandtheroadelementsemantictypeΥ(e.g., Crosswalk,
SolidDoubleYellow,etc). Foreachagentofinterest,wetransformtheclosestP = 128polylinesintotheirframeof
referenceandcallthetransformedsegmentp=(a,b). Letrbetheclosestpointfromtheagenttothesegment,anda⊥
betheunittangentvectorataontheoriginalcurve. Thenwerepresenttheagent’sspatialrelationshiptothesegmentvia
thevector[||r|| ,r/||r|| ,(b−a)/||b−a|| ,||b−a|| ,||b−r|| ,a⊥,OneHotEncoding(Υ)]. Thesefeaturevectors
2 2 2 2 2
areeachprocessedwithasharedMLP,resultinginasetofagent-specificembeddingsperroadsegment,whichwe
denotebyφ(road). Wethenfuseroadelementembeddingswiththeagenthistoryembeddingφ(state) usingstacked
1:P
MCGblocks
(φ(cid:48)(road),φ(cid:48)(state))=MCG (φ(road),φ(state)) (10)
1:P N 1:P
andthusenrichtheroadembeddingswithdynamicstateinformation.
3.4 Outputrepresentation
MultiPath++ predicts a distribution of future behavior parameterized as a Gaussian Mixture Model (GMM), as is
doneinMultiPath[45]andotherworks[4,34,37]. Forefficientlong-termprediction,thedistributionisconditionally
independentovertimestepsacrossmixturecomponents,thuseachmodeateachtimestepisrepresentedasaGaussian
distributionover(x,y)withameanµt ∈R2andcovarianceΣt ∈R2×2. TheM modelikelihoodsp aretiedover
1:M
time. MAPinferencepermodeisequivalenttotakingthesequenceµ1:T asstatewaypointsdefiningapossiblefuture
trajectoryfortheagent. Thefulloutputdistributionis
M T
(cid:88) (cid:89)
p(s)= p N(s −µt,Σt) (11)
i t i i
i=1 t=1
wheres=s1:T representsatrajectory;st ∈R2.
TheclassificationheadofFigure1predictsthep asasoftmaxdistributionovermixturecomponents. Theregression
i
headoutputstheparametersoftheGaussiansµandΣforM modesandT timesteps.
Training objective. We follow the original MultiPath approach and maximize the likelihood of the groundtruth
trajectory under our model’s predicted distribution. We make a hard-assignment labeling of a “correct” mixture
componentbychoosingtheonewiththesmallestEuclideandistancetothegroundtruthtrajectory.
TheaverageloglossovertheentiretrainingsetisoptimizedusingAdam. Weuseaninitiallearningrateof0.0002and
abatchsizeof64,withdecayrateof0.5every2epochs. Thefinalmodelischosenaftertrainingfor800,000steps.
3.5 Predictionarchitecturewithlearnedanchorembeddings
ThegoalofthePredictormodule(Figure1)istopredicttheparametersoftheGMMdescribedinSection3.4,namely
M trajectories,withlikelihoodsanduncertaintiesaroundeachwaypoint.
Inapplicationsrelatedtofutureprediction,capturingthehighlyuncertainandmultimodalsetofoutcomesisakey
challengeandthefocusofmuchwork[31,37,40,45,50]. OneofMultiPath’skeyinnovationswastouseastaticsetof
7anchortrajectoriesaspre-definedmodesthatappliedtoallscenes. Onemajordownsidetothisisthatmostmodesare
notagoodfittoanyparticularscene,thusrequiringalargeamountmodestobeconsidered,withmostobtaininga
low-likelihoodandgettingdiscarded. Anotherdownsideistheaddedcomplexityandeffortstemmingfroma2-phase
learningprocess(firstestimatingthemodesfromdata,thentrainingthenetwork).
Inthiswork,welearnanchorembeddingsaspartoftheoverallmodeltraining. Weinterprettheseembeddingse as
1:M
anchorsinlatentspace,andconstructourarchitecturetohaveaone-to-onecorrespondencewiththeseembeddingsand
theoutputtrajectorymodesofourGMM.Thevectorse aretrainablemodelparametersthatareindependentofthe
1:M
input. ThishasconnectionstoDetectionTransformers(DETR)[8]whichproposeawaytolearnanchorsratherthan
hand-designthemforobjectdetection. ThisisalsosimilarinspirittoMANTRA[33],atrajectorypredictionnetwork,
whichhasanexplicitlearnedmemorynetworkwhichconsistsofadatabaseofembeddingsthatcanberetrievedand
decodedintotrajectories.
Weconcatenatetheembeddingsφ(cid:48)(interaction),φ(cid:48)(road)andφ(cid:48)(state)obtainedfromtheoutputoftherespectiveMCG
blockstoobtainafixed-lengthfeaturevectorφ(combined)foreachmodeledagent. Wethenusethisascontextinstacked
MCGblocksthatoperateonthesetofanchorembeddingse ,withafinalMLPthatpredictsallparametersofthe
1:M
outputGMM:
(µ,σ ,σ ,ρ ,logq) =MLP(MCG(e ,φ(combined))),
x y xy 1:M 1:M
whereΣisformedfrom(σ ,σ ,ρ ).
x y xy
3.6 InternalTrajectoryRepresentation
We model the future position and heading of agents, along with agent-relative longitudinal and lateral Gaussian
uncertainties. Weparameterizethetrajectoryusingx(t),y(t),θ(t),σ (t),σ (t)—position,heading,andstandard
lng lat
deviationforlongitudinalandlateraluncertainty.
Themostpopularapproachintheliteratureistodirectlypredictasequenceofsuchstatesatuniformtime-discretization.
Herewealsoconsidertwonon-mutuallyexclusivevariants.
1. We can represent functions over time as polynomials, which add an inductive bias that ensures a smooth
trajectory. Itgivesusacompact,interpretablerepresentationofeachpredictedsignal.
2. Insteadofdirectlypredicting{x(t),y(t),θ(t)},wecanpredicttheunderlyingkinematiccontrolsignals,which
canthenbeintegratedtoevaluatetheoutputstate. Inthiswork,weexperimentwithpredictingtheacceleration
a(t)andheadingchangerateθ˙(t)andintegratingthemtorecoverthetrajectoryasfollows:
(cid:82)t
v(t) =v(0)+ a(τ)dτ
0
θ(t) =θ(0)+(cid:82)t θ˙(τ)dτ
0
(cid:82)t
x(t) =x(0)+ v(τ)cos(θ(τ))dτ
0
(cid:82)t
y(t) =y(0)+ v(τ)sin(θ(τ))dτ.
0
These representations add inductive bias encouraging natural and realistic trajectories that are based on realistic
kinematicsandconsistentwiththecurrentstateofthepredictedagent. Forthepolynomialrepresentation,itisalso
possibletospecifyasoftconstraintbyregularizingthepolynomial’sconstantterm,whichdeterminestheshiftofthe
predictedsignalfromitscurrentvalue.
Algorithm1demonstratestheconversionfromcontrolsignalstooutputpositions. Notethatthisoperationisdifferen-
tiable,permittingend-to-endoptimization. ItisanumericalapproximationofEquation12withadditionaltechnical
considerations: (1)Whencomputingthenextposition(x(t),y(t)),weusethemidpointapproximationofthespeedand
headingθ˜(t)≡θ(t−δ )+θ˙(t−δ )·(δ /2). (2)Givenvehicledimensions,wecaptheheadingchangeratetomatcha
t t t
predeterminedmaximumfeasiblecurvature. (3)Theseequationsareappliedtotherear-axleofthevehicleratherthan
thecenterposition. Weusetherear-endpositionofthevehicleasanapproximationoftherear-axleposition.
NotethatAlgorithm1canbeviewedasaspecialtypeofrecurrentnetwork,withoutlearnedparameters. Thisdecoding
stagethenmirrorsotherworkswhichusealearnedRNN(LSTMorGRUcells)todecodeanembeddingvectorintoa
trajectory[27,28,34,44,46]. Inourcase,therecurrentnetworkstateconsistsofx(t),y(t),v(t)andθ(t),andtheinput
consistsofθ˙(t)anda(t). Encodinganinductivebiasderivedfromkinematicmodelingsparesthenetworktheneedto
explicitlylearnthesepropertiesmakesthepredictedstatecompact. Thispromotesdataefficiencyandgeneralization
power,butcanbemoresensitivetoperceptionerrorsinthecurrentstateestimate.
8Data:a(t)andθ˙(t)att=0,0.1,0.2,...,10.0,x(0),y(0),θ(0),v(0)
Result:x(t),y(t)andθ(t)att=0.1,0.2,...,10.0
δ =0.1
t
fort=0.1,0.2,...,10.0do
v˜←v(t−δ )+a(t)·(δ /2)
t t
θ˙ ←CapCurvature(θ˙(t−δ ))
cap t
θ˜←θ(t−δ )+θ˙ ·(δ /2)
t cap t
x(t)←x(t−δ )+v˜cos(θ˜)δ
t t
y(t)←y(t−δ )+v˜sin(θ˜)δ
t t
θ(t)←θ(t−1)+θ˙ δ
cap t
v(t)←θ(t−1)+a(t−δ )δ
t t
end
Algorithm1:Integratingcontrol-signaltopositions.
4 Ensemblingpredictorheadsviabootstrapaggregation
Ensemblingisapowerfulandpopulartechniqueinmanymachinelearningapplications. Forexample,ensemblingisa
criticaltechniqueforgettingthebestperformanceonImageNet[25]. Bycombiningmultiplemodelswhicharetosome
degreecomplementary,wecanenjoythebenefitsofahighercapacitymodelwithlowerstatisticalvariance.
Wespecificallyapplybootstrapaggregation(bagging)[19]toourpredictorheadsbytrainingE suchheadstogether. To
encouragemodelslearningcomplementaryinformation,theweightsoftheE headsareinitializedrandomly,andan
exampleisusedtoupdatetheweightsofeachheadwitha50%probability.
Unlikescalarregressionorclassification,itisnotobvioushowtocombineoutputfromdifferentheadsinourcase—each
isaGaussianMixtureModel,withnocorrespondenceofmixturecomponentsacrossensembleheads. Furthermore,we
considerallowingeachpredictorheadtopredictaricheroutputdistributionwithmoremodesL > M;whereM is
fixedasarequirementforthetask(andisusedinbenchmarkmetricscalculations).
LetΨdenotetheunionofthepredictionsfromallheads
Ψ={(µ ,Σ ,q ),...,(µ ,Σ ,q )}, (12)
1 1 1 M(cid:48) M(cid:48) M(cid:48)
whereM(cid:48) =L·E,andthemodelikelihoodsparedividedbythenumberofheadsE sothattheysumupto1. Then
weposetheensemblecombinationtaskasoneofconvertingΨtoamorecompactGMMΨ¯ withM modes:
Ψ¯ =(cid:0) (µ¯ ,Σ¯ ,q¯ ),...,(µ¯ ,Σ¯ ,q¯ )(cid:1) , (13)
1 1 1 M M M
whilerequiringthatΨ¯ bestapproximatesΨ. Inthissectionwedescribetheaggregationalgorithmweuse. Theoretical
motivationsandderivationcanbefoundinAppendixA.
WefindfitΨ¯ toΨusinganiterativeclusteringalgorithm,likeExpectation-Maximization[3],butwithhardassignment
ofclustermembership. Thissettinglendsitselftoefficientimplementationinacomputegraph,andallowsustotrain
thisstepend-to-endasafinallayerinourdeepnetwork.
WestartbyselectingM clustercentroidsfromµ inagreedyfashion. Theselectioncriteriaistomaximizethe
1:M(cid:48)
probabilitythatacentroidsampledfromΨlieswithinτ distancefromatleastoneselectedcentroid:
M(cid:48)
(cid:88)
µ¯ =argmax q max I((cid:107)µ −µ¯(cid:107) ≤τ) (14)
1:M i i 2
µ¯1:M
i=1
µ¯∈µ¯1:M
Thisisacriterionthatexplicitlyoptimizestrajectorydiversity,whichisafitformetricssuchasmissrate,mAPand
minADE,asdefinedin [12,18]. Othercriteriacouldalsobeuseddependingonthemetricofinterest. Itisinteresting
torelatethiscriteriatotheensemblingandsamplingmethodemployedbyGOHOME[21]. Inthatwork,theyoutputan
intermediatespatialheatmaprepresentation,whichisamenabletoensembleaggregation. Thentheygreedilysample
end-pointsinasimilarfashion.
Sincejointlyoptimizingµ¯ ishard,weselecteachµ greedilyfori=1,...,M accordingto
1:M i
M(cid:48)
(cid:88)
µ¯ =argmax q max I((cid:107)µ −µ¯(cid:107) ≤τ) (15)
i i i 2
µ¯i
i=1
µ¯∈µ¯1:i
9whichdiffersinthattheouterargmaxisdoneiterativelyoverµ¯ ratherthanjointlyµ¯ .
i 1:M
Startingwiththeselectedcentroids,WeiterativelyupdatetheparametersofΨ¯ usinganexpectation-maximization-
style[16]algorithm,whereeachiterationconsistsofthefollowingupdates
M(cid:48)
q¯ ←(cid:88) q p(h|µ ;Ψ¯) (16)
h i i
i=1
M(cid:48)
µ¯ ← 1 (cid:88) q p(h|µ ;Ψ¯)x (17)
h q¯ i i
h
i=1
M(cid:48)
Σ¯ ← 1 (cid:88) q p(h|µ ;Ψ¯)(cid:2) Σ +(µ −µ¯ )(µ −µ¯ )T(cid:3) , (18)
h q¯ i i i i h i h
h
i=1
wherep(h|x;Ψ¯)istheposteriorprobabilitythatagivensamplexissampledfromthehthcomponentofthemixture
modelspecifiedbyΨ¯,whichcanbecomputedas
q¯ N(x−µ¯ ,Σ¯ )
p(h|x;Ψ¯)= h h h (19)
(cid:80)M q¯ N(x−µ¯ ,Σ¯ )
k=1 k k k
5 Experiments
5.1 Datasets
TheWaymoOpenMotionDataset(WOMD)[18]consistsof1.1Mexamplestime-windowedfrom103K20sscenarios.
Thedatasetisderivedfromreal-worlddrivinginurbanandsuburbanenvironments. Eachexamplefortrainingand
inferenceconsistsof1secondofhistorystateand8secondsoffuture,whichweresampleat5Hz. Theobject-agent
state contains attributes such as position, agent dimensions, velocity and acceleration vectors, orientation, angular
velocity,andturnsignalstate. Thelong(8s)timehorizoninthisdatasetteststhemodel’sabilitytocapturealargefield
ofviewandscaletoanoutputspaceoftrajectories,whichintheorygrowsexponentiallywithtime.
TheArgoversedataset[12]consistsof333Kscenarioscontainingtrajectoryhistories,contextagents,andlanecenterline
inputsformotionprediction. Thetrajectoriesaresampledat10Hz,with2secondsofpasthistoryanda3-secondfuture
predictionhorizon.
5.2 Metrics
Wecomparemodelsusingcompetitionspecificmetricsassociatedwithvariousdatasets5,
Specifically,wereportthefollowingmetrics.
minDEt (Minimum Distance Error): The minimum distance, over the top k most-likely trajectories, between a
k
predictedtrajectoryandthegroundtruthtrajectoryattimet.
minADE (MinimumAverageDistanceError): SimilartominDEt,butthedistanceiscalculatedasanaverageover
k k
alltimesteps.
MRt@d(MissRate): MeasurestherateatwhichminFDEt exceedsdmeters. NotethatWOMDleaderboardusesa
k k
differentdefinition[18].
mAP:Foreachsetofpredictedtrajectories,wehaveatmostonepositive-theoneclosesttothegroundtruthand
whichiswithinτ distancefromthegroundtruth. Theotherpredictedtrajectoriesarereportedasmisses. Fromthis,we
cancomputeprecisionandrecallatvariousthresholds. FollowingWOMDmetricsdefinition[18]theagentsfuture
trajectoriesarepartitionedintobehaviorbuckets,andanareaundertheprecision-recallcurveiscomputedusingthe
possibletruepositiveandfalsepositivesperagent,givingusAveragePrecisionperbehaviorbucket. ThetotalmAP
valueisameanovertheAP’sforeachbehaviorbucket.
Overlap rate: The fraction of times the most likely trajectory prediction of any agent overlaps with a real future
trajectoryofanotheragent(see[18]fordetails).
5Foreachdataset,wereporttheresultsofourmodelagainstpublishedresultsofpubliclyavailablemodels.
10Argoverseleaderboard(k =6,d=2m,t=3s)
Rank† brier-minDE minFDE MR minADE
LaneGCN[32] 50 2.059 1.364 0.163 0.868
DenseTNT[23] 23 1.976 1.282 0.126 0.882
HOME+GOHOME[21] 10 1.860 1.292 0.085 0.890
TPCN++[49] 5 1.796 1.168 0.116 0.780
MultiPath++(ours) 4 1.793 1.214 0.132 0.790
QCraftBlueTeam 1 1.757 1.214 0.114 0.801
Table 2: Comparison with select state-of-the-art methods on Argoverse leaderboard. k is the num-
ber of trajectories. d is the maximum distance for no miss, and t is the trajectory time duration.
† Rank on the public leaderboard https://eval.ai/web/challenges/challenge-page/454/leaderboard/
1279asofNovember12,2021,whichissortedbybrier-minDE.
TRI: (Turning Radius Infeasibility) We compute the turning radius along the predicted trajectories using two
approaches: onethatusesthepredictedyawoutputfromthemodel(TRI-h),andtheotherthatdoesn’trequireyaw
predictionsandinsteadusesthecircumradiusconstitutingthreeconsecutivewaypoints(TRI-c). Iftheradiusisless
thanacertainthresholdτ,itistreatedasaviolation. Wesetthisthresholdastheapproximateminimumturningradius
thresholdforamidsizesedan,τ = 3.5m. Notethatamodelthatsimplypredictsaconstantheadingcanachievea
TRI-hrateofzero,hencewealsocomputeinconsistenciesbetweenturningradiussuggestedbythecoordinatesandthe
predictedheadings(TRI-hc). TRI-hcinconsistencyistruewhenthedifferenceinheadingbasedoncircumradiusfrom
waypointsandpredictedheadingsisgreaterthan0.05radiansatanytimestepinatrajectory.
5.3 MultiPathbaseline
AsourworkevolvedfromMultiPath,weincludeareferenceMultiPathmodelwheretheinputandbackbonearefaithful
totheoriginalpaper[45]forapointofcomparison,withafewminordifferences. Specifically,weuseatop-down
renderingofthesceneasbefore,butnowemployasplatrendering[55]approachforrasterization,inwhichwesample
pointsuniformlyfromsceneelementsanddoanorthographicprojection. Thisisasimpler,sparseformofrendering,
whichdoesn’temployanti-aliasing,butisefficientandstraightforwardtoimplementinTensorFlowandrunaspartof
themodelcomputegraphonhardwareaccelerators(GPU/TPU).
Asintheoriginalpaper,weuseagridof400×400cells,withgridcellphysicaldimensionof0.2m×0.2m,thusa
totalfield-of-viewof80mcenteredaroundtheAVsensingvehicleinWOMD,withaResNet18backbone[25]. Weuse
128staticanchorsobtainedviak-means,whicharesharedamongallagenttypes(vehicles,pedestrians,cyslists)for
simplicity. Figure10illustratesthismodel’sinputsandarchitecture.
5.4 Externalbenchmarkresults
OnArgoverse,MultiPath++achievestop-5performanceonmostmetrics(Table2). Ourtechniqueisranked1stonall
metricsonWaymoOpenMotionDataset[18](Table3).
ThetestedmodelisbasedonthebestconfigurationinTable4,wheretheoutputsfrommultipleensembleheadsare
aggregatedasdescribedinSection4.
OnWOMD,wealsoseethattheoriginalMultiPathmodel,evenwiththerefinementoflearnedanchorsandensembling,
is outperformed by more recent methods. It is interesting to note that MultiPath is the best performing top-down
scene-centricmodelemployingaCNN;everyknownmethodwhichoutranksitusessparserepresentations.
5.5 QualitativeExamples
Figure4showsexamplesofMultipath++onWOMDscenes. Figure5showsexamplesofMultipath++onArgoverse
scenes. TheseexamplesshowtheabilityofMultiPath++tohandledifferentroadlayoutsandagentinteractions.
5.6 AblationStudy
Inthissectionweevaluateourdesignchoicesthroughanablationstudy. Table4summarizesablationresults. Inthe
followingsubsectionswediscusshowourarchitecturechoicesaffectthemodelperformance.
11(a) (b) (c)
(d) (e) (f)
(g)
Figure4: ExamplesofMultiPath++predictionsfor8secondsinWOMDscenes. Hueindicatestimehorizonwhile
transparencyindicatespredictedprobability. Rectanglesindicatevehicleswhilesmallsquaresindicatepedestrians. (a):
Afour-wayintersectioninvolvingmultipleinteractions. Forexample,carAispredictedtoyieldforcarB.(b&c):
Narrowroadinteraction. CarAispredictedtoyieldforcarBandthennudgearoundtheparkingcar,designatedbythe
arrow. (d&e): InteractionbetweentwovehiclesatanintersectionwherecanAispredictedtoyieldforcarBormakea
right-turnbehind. AftercarBpasses,carAcanmakealeftturn. Also,notethebimodalpredictionofthepedestrian
thatislocatedatthecorner. (fandg): Predictionsinaparkinglotandatypicalroadgraph.
12WaymoOpenMotionPrediction(k =6,t=8s)
Rank† minDE minADE MR Overlap mAP
MultiPath[45] 11 2.04 0.880 0.345 0.166 0.409
SceneTransformer[35] 7 1.212 0.612 0.156 0.147 0.279
DenseTNT[23] 5 1.551 1.039 0.157 0.178 0.328
MultiPath++ 1 1.158 0.556 0.134 0.131 0.409
Table 3: Comparison with published state-of-the-art methods on WOMD public leader-
board. k is the number of trajectories and t is the trajectory time horizon.
† Rank on the public leaderboard https://waymo.com/open/challenges/2021/motion-prediction/ as
ofNovember12,2021,whichissortedbymAP.
(a) (b) (c)
(d) (e) (f)
Figure 5: Examples of MultiPath++ predictions for 8 seconds in Argoverse scenes, showing the ability to follow
differentlanegeometries.
135.6.1 SetFunctions
RecallthatMultiPath++usestwotypesofsetfunctions. Invariantsetfunctionsareusedtoencodeasetofelements(e.g.
agents,roadgraphsegments)intoasinglefeaturevector. Equivariantsetfunctionsareusedtoconvertthesetoflearned
anchors,togetherwiththeencodedfeaturevectorasacontext,intoacorrespondingsetoftrajectorieswithlikelihoods.
Weusemulti-contextgatingtorepresentbothtypesoffunctions. Weexperimentedwithotherrepresentationsofset
functions:
• MLP+MaxPool: Inthisexperiment,wereplacethemulti-contextgating(MCG)roadnetworkencoderwitha
MLP+MaxPoolappliedonpointsratherthanpolylines,inspiredbyPointNet[38]. Weusea5layerdeepMLP
andRELUactivations.
• EquivariantDeepSet[51]: Theequivariantsetfunctionisrepresentedasaseriesofblocks,eachinvolvingan
element-wisetransformationfollowedbypoolingtocomputethecontext. UnlikeMCG,itdoesnotusegating
(pointwisemultiplication)betweensetelementsandthecontextvector. Instead,alineartransformationofthe
contextisaddedtoeachelement. WeuseaDeepSetof5blocksinthepredictor.
• Transformers[29]: Wereplacethegatingmechanism(element-wisemultiplication)onpolylineswithself-
attention. Fordecoding,weusedcrossattentionwherethequeriesarethelearnedembeddingsandthekeys
arethevariousencoderfeatures.
5.6.2 Trajectoryrepresentation
AsmentionedinSection3.6,weexperimentwithpredictingpolynomialcoefficientsforthetrajectory,aswellpredicting
kinematic control signals (acceleration and heading change rate). We found that polynomial representations hurt
performance,countertoconclusionsmadeinPLOP[4],wheretheydemonstratedimprovementsoverthethenstateof
theartonPRECOG[43]andnuScenes[7]usingpolynomialstorepresentoutputtrajectories. Furthermore,inthePLOP
datasets,weneedtopredict4sintothefuturewhichismuchshorterthanourpredictionhorizonof10s. Forsuchshort
futures,polynomialrepresentationsaremoresuitable. Inourcase,wedonotseemuchgainsfromusingthepolynomial
representation,possiblyduetothelargerdatasetsizeandlonger-termpredictionhorizon.
Thecontrols-basedoutputworksbetterindistancemetricsthanapolynomialrepresentations, whichsuggestsitis
amorebeneficialanddomain-specificformofinductivebias. Overall,ourresultssuggestthatthesimplesequence
of raw coordinates trajectory representation works best for distance-based metrics. However, these unconstrained
representationshaveanon-trivialrateofkinematicinfeasibility(TRI-xmetricsinTable4). Kinematicfeasibilityand
consistencybetweenheadingsandpositionsiscrucialinpracticewhensuchbehaviormodelsareusedforplanningand
controlsofareal-worldrobot,anissuethatisnotcapturedbypublicbenchmarkmetrics.
5.6.3 Ensembling
Weexploreensembling,producinganover-completesetoftrajectoriesthatisthensummarizedusingtheaggregation
proposedinSection4,aswellastheircombination. ThenumberofensemblesisdenotedbyE andthenumberof
trajecctoriesperensembleisdenotedbyL. FinallyweaggregatetheE·LtrajectoriestoM =6whichistherequired
numberoftrajectoriesfortheWOMDsubmission.
5.6.4 Anchorrepresentation
Weexplorelearnedandkmeansbasedanchorrepresentation.
5.7 Discussion
First,weremarkthatMultiPath++isasignificantimprovementoveritspredecessorMultiPath,asseeninTables3
and4. Asdiscussedinthispaper,theydifferinmanydesigndimensions,theprimarybeingthechangefromadense
top-downrasterrepresentationtoasparse,element-basedrepresentationwithagent-centriccoordinatesystems. Other
designchoicesarevalidatedinisolationinthefollowingdiscussion.
WefindthatMLP+MaxPoolperformstheworstamongallsetfunctionvariantsasexpectedduetolimitedcapacity.
DeepSetisabletooutperformMLP+MaxPool. AlsoincreasingthedepthoftheMCGgivesconsistentlybetterresults
owingtoeffectiveincreaseincapacityandflowofinformationacrossskipconnections. Wegetthebestperformanceby
increasingthedepthoftheMCGto5layers.
Wefindthatlearninganchors(“Learnedanchors”)ismoreeffectivethanusingasetofanchorsobtainedapriorivia
k-means.ThisrunscountertotheoriginalfindingsintheMultiPathpaper[45]thatanchor-freemodelssufferfrommode
14minDE minADE MR AUC TRI-h(%) TRI-c(%) TRI-hc(%)
OriginalMultiPath 4.752 1.796 0.749 – – – –
SetFunction
MLP+MaxPool 2.693 1.107 0.528 0.367 – – –
DeepSet 2.562 1.055 0.5 0.368 – – –
Transformer 2.479 1.042 0.479 0.3687 – – –
1MCGblock 2.764 1.15 0.55 0.312 – – –
5stackedMCGblocks(cid:63) 2.305 0.978 0.44 0.393 – – –
TrajectoryRepresentation
Polynomial 2.537 1.041 0.501 0.368 n/a 1.92 n/a
Control 2.319 0.987 0.449 0.386 0.00 1.22 0.00
Rawcoordinates 2.305 0.978 0.44 0.393 n/a 1.08 n/a
Rawcoordinatesw/heading(cid:63) 2.311 0.978 0.443 0.395 4.10 1.04 9.92
Ensembling
E =1,L=6 2.333 0.982 0.410 0.240 – – –
E =5,L=6 2.18 0.948 0.395 0.297 – – –
E =1,L=64 2.487 1.057 0.473 0.367 – – –
E =5,L=64(cid:63) 2.305 0.978 0.44 0.393 – – –
Anchors
Statick-meansanchors 2.99 1.22 0.578 0.324 – – –
Learnedanchors(cid:63) 2.305 0.978 0.44 0.393 – – –
Table4: ModelablationonWOMDvalidatationset. Metricsareparameterizedbyk =6,t=8s,andd=2m.
(cid:63)denotesthereferenceconfiguration: roadencoding,statehistoryencodingandinteractionencodingasdescribedin
Section3. “n/a”denotesamodelthatdoesnotpredictheading.
Figure6: Qualitativeresultsshowingtheeffectofcontrol-basedtrajectoryrepresentation.Toprow:MultiPath++withstate-based
trajectories.Bottomrow:MultiPath++withcontrol-basedtrajectories.
collapse. Thedifferencecouldpossiblybeduetothericherandmorestructuredinputs,improvedmodelarchitecture,
andlargerbatchsizesinMultiPath++. Weleavemoredetailedablationsonthisissuebetweenthetwoapproaches
tofuturework. tWecomparethebaselineofdirectlyoutputtingasingleheadwith6trajectories(E = 1,L = 6),
to training 5 ensemble heads (E = 5,L = 6). We see that ensembling significantly improves most metrics, and
particularly minDE, for which this combination is best. We also train a model with a single head that outputs 64
trajectories,followedbyouraggregationmethodthatreducesthemto6(E = 1,L = 64). Comparedtoourinitial
baseline,thismodelsignificantlyimprovesMRandAUC thatrequirediversepredictions,butregressestheaverage
trajectory distance metrics minDE, and even minADE a little bit. This suggests that the different metrics pose
differentsolutionrequirements. Asexpected,ouraggregationcriterioniswellsuitedtopreservingdiversity,while
straight-upensemblingisbetteratcapturingtheaveragedistribution. Finally,ourexperiment(E =5,L=64)with
moreensembleheadsandmorepredictionsperensemblecombinesthestrengthsofbothtechniques,obtainingastrictly
superiorperformanceinallmetricscomparedtothebaseline.
15Figure7: Exampleofimproveddiversityduetoaggregation.
Figure8: Illustrationofweedingoutunrealistictrajectoriesafteraggregation.
Figure9: Illustrationofimprovedlanediversity: Themodelisabletopredicttheagentgoingtomultiplelanesafter
aggregation.
165.8 Conclusion
Weproposedanovelbehaviorpredictionsystem,MultiPath++,bycarefullyconsideringchoicesforinputrepresentation
andencoding, fusingencodings, andrepresentingtheoutputdistribution. Wedemonstratedstate-of-the-artresults
onpopularbenchmarksforbehaviorprediction. Furthermore,wesurveyedexistingmethods,analyzedourapproach
empirically,andprovidedpracticalinsightsfortheresearchcommunity. Inparticular,weshowedtheimportanceof
sparseencoding,efficientfusionmethods,control-basedmethods,andlearnedanchors. Finally,weprovidedapractical
guideforvarioustricksusedfortrainingandinferencetoimproverobustness,increasediversity,handlemissingdata,
andensurefastconvergenceduringtraining.
References
[1] Alexandre"Alahi,KratarthGoel,VigneshRamanathan,AlexandreRobicquet,LiFei-Fei,andSilvioSavarese. SocialLSTM:
HumanTrajectoryPredictioninCrowdedSpaces. InCVPR,2016.
[2] YuriyBiktairov,MaximStebelev,IrinaRudenko,OlehShliazhko,andBorisYangel. Prank: motionpredictionbasedon
ranking. arXivpreprintarXiv:2010.12007,2020.
[3] ChristopherMBishop. Patternrecognitionandmachinelearning. springer,2006.
[4] ThibaultBuhet,É.Wirbel,andXavierPerrotton. Plop:Probabilisticpolynomialobjectstrajectoryplanningforautonomous
driving. ArXiv,abs/2003.08744,2020.
[5] ThibaultBuhet,EmilieWirbel,andXavierPerrotton.Plop:Probabilisticpolynomialobjectstrajectoryplanningforautonomous
driving. arXivpreprintarXiv:2003.08744,2020.
[6] HolgerCaesar,VarunBankiti,AlexH.Lang,SourabhVora,VeniceErinLiong,QiangXu,AnushKrishnan,YuPan,Giancarlo
Baldan,andOscarBeijbom. nuscenes:Amultimodaldatasetforautonomousdriving. arXivpreprintarXiv:1903.11027,2019.
[7] HolgerCaesar,VarunBankiti,AlexH.Lang,SourabhVora,VeniceErinLiong,QiangXu,AnushKrishnan,YuPan,Giancarlo
Baldan,andOscarBeijbom. nuscenes:Amultimodaldatasetforautonomousdriving,2020.
[8] NicolasCarion,FranciscoMassa,GabrielSynnaeve,NicolasUsunier,AlexanderKirillov,andSergeyZagoruyko. End-to-end
objectdetectionwithtransformers.InAndreaVedaldi,HorstBischof,ThomasBrox,andJan-MichaelFrahm,editors,Computer
Vision–ECCV2020,pages213–229,Cham,2020.SpringerInternationalPublishing.
[9] SergioCasas,ColeGulino,RenjieLiao,andRaquelUrtasun. Spagnn:Spatially-awaregraphneuralnetworksforrelational
behaviorforecastingfromsensordata. InIEEEIntl.Conf.onRoboticsandAutomation.IEEE,2020.
[10] SergioCasas,WenjieLuo,andRaquelUrtasun. Intentnet: Learningtopredictintentionfromrawsensordata. InConf.on
RobotLearning,2018.
[11] SergioCasas,AbbasSadat,andRaquelUrtasun. Mp3:Aunifiedmodeltomap,perceive,predictandplan. InProceedingsof
theIEEE/CVFConferenceonComputerVisionandPatternRecognition,pages14403–14412,2021.
[12] Ming-FangChang,JohnLambert,PatsornSangkloy,JagjeetSingh,SlawomirBak,AndrewHartnett,DeWang,PeterCarr,
SimonLucey,DevaRamanan,etal. Argoverse:3dtrackingandforecastingwithrichmaps. InCVPR,2019.
[13] HenggangCui,ThiNguyen,Fang-ChiehChou,Tsung-HanLin,JeffSchneider,DavidBradley,andNemanjaDjuric. Deep
kinematicmodelsforkinematicallyfeasiblevehicletrajectorypredictions. InIEEEIntl.Conf.onRoboticsandAutomation,
pages10563–10569.IEEE,2020.
[14] HenggangCui,VladanRadosavljevic,Fang-ChiehChou,Tsung-HanLin,ThiNguyen,Tzu-KuoHuang,JeffSchneider,and
NemanjaDjuric. Multimodaltrajectorypredictionsforautonomousdrivingusingdeepconvolutionalnetworks. InIEEEIntl.
Conf.onRoboticsandAutomation,2019.
[15] ZihangDai,HanxiaoLiu,QuocVLe,andMingxingTan. Coatnet:Marryingconvolutionandattentionforalldatasizes. arXiv
preprintarXiv:2106.04803,2021.
[16] A.P.Dempster,N.M.Laird,andD.B.Rubin. Maximumlikelihoodfromincompletedataviatheemalgorithm. Journalofthe
RoyalStatisticalSociety.SeriesB(Methodological),39(1):1–38,1977.
[17] AlexeyDosovitskiy,LucasBeyer,AlexanderKolesnikov,DirkWeissenborn,XiaohuaZhai,ThomasUnterthiner,Mostafa
Dehghani,MatthiasMinderer,GeorgHeigold,SylvainGelly,etal. Animageisworth16x16words:Transformersforimage
recognitionatscale. InInternationalConferenceonLearningRepresentations,2020.
[18] ScottEttinger,ShuyangCheng,BenjaminCaine,ChenxiLiu,HangZhao,SabeekPradhan,YuningChai,BenSapp,CharlesQi,
YinZhou,ZoeyYang,AurelienChouard,PeiSun,JiquanNgiam,VijayVasudevan,AlexanderMcCauley,JonathonShlens,
andDragomirAnguelov. Largescaleinteractivemotionforecastingforautonomousdriving:Thewaymoopenmotiondataset,
2021.
[19] JeromeHFriedman. Theelementsofstatisticallearning:Datamining,inference,andprediction. springeropen,2017.
[20] JiyangGao,ChenSun,HangZhao,YiShen,DragomirAnguelov,CongcongLi,andCordeliaSchmid. VectorNet:Encoding
hdmapsandagentdynamicsfromvectorizedrepresentation. InCVPR,2020.
17[21] ThomasGilles,StefanoSabatini,DzmitryTsishkou,BogdanStanciulescu,andFabienMoutarde. Gohome:Graph-oriented
heatmapoutputforfuturemotionestimation. arXivpreprintarXiv:2109.01827,2021.
[22] ThomasGilles,StefanoSabatini,DzmitryTsishkou,BogdanStanciulescu,andFabienMoutarde. HOME:heatmapoutputfor
futuremotionestimation. CoRR,abs/2105.10968,2021.
[23] JunruGu,QiaoSun,andHangZhao. Densetnt:Waymoopendatasetmotionpredictionchallenge1stplacesolution. CoRR,
abs/2106.14160,2021.
[24] AgrimGupta,JustinJohnson,LiFei-Fei,SilvioSavarese,andAlexandreAlahi. SocialGAN:Sociallyacceptabletrajectories
withgenerativeadversarialnetworks. InCVPR,2018.
[25] KaimingHe,XiangyuZhang,ShaoqingRen,andJianSun. Deepresiduallearningforimagerecognition. InCVPR,2016.
[26] NamdarHomayounfar,Wei-ChiuMa,ShrinidhiKowshikaLakshmikanth,andRaquelUrtasun. Hierarchicalrecurrentattention
networksforstructuredonlinemaps. InProceedingsoftheIEEEConferenceonComputerVisionandPatternRecognition,
pages3417–3426,2018.
[27] JoeyHong,BenjaminSapp,andJamesPhilbin. Rulesoftheroad:Predictingdrivingbehaviorwithaconvolutionalmodelof
semanticinteractions. InCVPR,2019.
[28] Siddhesh Khandelwal, William Qi, Jagjeet Singh, Andrew Hartnett, and Deva Ramanan. What-if motion prediction for
autonomousdriving. ArXiv,2020.
[29] JuhoLee,YoonhoLee,JungtaekKim,AdamR.Kosiorek,SeungjinChoi,andYeeWhyeTeh.Settransformer:Aframeworkfor
attention-basedpermutation-invariantneuralnetworks. InKamalikaChaudhuriandRuslanSalakhutdinov,editors,Proceedings
ofthe36thInternationalConferenceonMachineLearning, ICML2019, 9-15June2019, LongBeach, California, USA,
volume97ofProceedingsofMachineLearningResearch,pages3744–3753.PMLR,2019.
[30] NamhoonLee,WongunChoi,PaulVernaza,ChristopherBChoy,PhilipHSTorr,andManmohanChandraker. DESIRE:
Distantfuturepredictionindynamicsceneswithinteractingagents. InCVPR,2017.
[31] JunweiLiang,LuJiang,KevinMurphy,TingYu,andAlexanderHauptmann.Thegardenofforkingpaths:Towardsmulti-future
trajectoryprediction. InCVPR,2020.
[32] MingLiang,BinYang,RuiHu,YunChen,RenjieLiao,SongFeng,andRaquelUrtasun. Learninglanegraphrepresentations
formotionforecasting. arXivpreprintarXiv:2007.13732,2020.
[33] FrancescoMarchetti,FedericoBecattini,LorenzoSeidenari,andAlbertoDelBimbo. Mantra:Memoryaugmentednetworks
formultipletrajectoryprediction. InProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecognition,
pages7143–7152,2020.
[34] JeanMercat,ThomasGilles,NicoleZoghby,GuillaumeSandou,DominiqueBeauvois,andGuillermoGil.Multi-headattention
forjointmulti-modalvehiclemotionforecasting. InIEEEIntl.Conf.onRoboticsandAutomation,2020.
[35] JiquanNgiam,BenjaminCaine,VijayVasudevan,ZhengdongZhang,Hao-TienLewisChiang,JeffreyLing,RebeccaRoelofs,
AlexBewley,ChenxiLiu,AshishVenugopal,DavidWeiss,BenjaminSapp,ZhifengChen,andJonathonShlens. Scene
transformer:Aunifiedmulti-taskmodelforbehaviorpredictionandplanning. CoRR,abs/2106.08417,2021.
[36] SeongHyeonPark,GyubokLee,ManojBhat,JiminSeo,MinseokKang,JonathanFrancis,AshwinRJadhav,PaulPuLiang,
andLouis-PhilippeMorency. Diverseandadmissibletrajectoryforecastingthroughmultimodalcontextunderstanding. arXiv
preprintarXiv:2003.03212,2020.
[37] TungPhan-Minh,ElenaCorinaGrigore,FreddyABoulton,OscarBeijbom,andEricMWolff.CoverNet:Multimodalbehavior
predictionusingtrajectorysets. arXiv:1911.10298,2019.
[38] CharlesRQi,HaoSu,KaichunMo,andLeonidasJGuibas. Pointnet:Deeplearningonpointsetsfor3dclassificationand
segmentation. InCVPR,2017.
[39] KhaledS.Refaat,KaiDing,NataliaPonomareva,andStéphaneRoss. Agentprioritizationforautonomousnavigation. 2019
IEEE/RSJInternationalConferenceonIntelligentRobotsandSystems(IROS),pages2060–2067,2019.
[40] NicholasRhinehart,KrisKitani,andPaulVernaza.R2P2:Areparameterizedpushforwardpolicyfordiverse,precisegenerative
pathforecasting. InECCV,2018.
[41] NicholasRhinehart,RowanMcAllister,KrisKitani,andSergeyLevine. PRECOG:Predictionconditionedongoalsinvisual
multi-agentsettings. InIntl.Conf.onComputerVision,2019.
[42] NicholasRhinehart,RowanMcAllister,KrisKitani,andSergeyLevine. Precog:Predictionconditionedongoalsinvisual
multi-agentsettings. InECCV,2019.
[43] NicholasRhinehart,RowanMcAllister,KrisM.Kitani,andSergeyLevine. PRECOG:predictionconditionedongoalsin
visualmulti-agentsettings. CoRR,abs/1905.01296,2019.
[44] TimSalzmann,BorisIvanovic,PunarjayChakravarty,andMarcoPavone. Trajectron++:Multi-agentgenerativetrajectory
forecastingwithheterogeneousdataforcontrol. arXivpreprintarXiv:2001.03093,2020.
[45] BenjaminSapp,YuningChai,MayankBansal,andDragomirAnguelov. MultiPATH:Multipleprobabilisticanchortrajectory
hypothesesforbehaviorprediction. InConf.onRobotLearning,2019.
18[46] YichuanCharlieTangandRuslanSalakhutdinov. Multiplefuturesprediction. InProceedingsofthe33rdInternational
ConferenceonNeuralInformationProcessingSystems,RedHook,NY,USA,2019.CurranAssociatesInc.
[47] EkaterinaI.Tolstaya,RezaMahjourian,CarltonDowney,BalakrishnanVaradarajan,BenjaminSapp,andDragomirAnguelov.
Identifying driver interactions via conditional behavior prediction. In IEEE International Conference on Robotics and
Automation,ICRA2021,Xi’an,China,May30-June5,2021,pages3473–3479.IEEE,2021.
[48] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Polosukhin. Attentionisallyouneed. InNeurIPS,2017.
[49] MaoshengYe,TongyiCao,andQifengChen. Tpcn:Temporalpointcloudnetworksformotionforecasting. InProceedingsof
theIEEE/CVFConferenceonComputerVisionandPatternRecognition,pages11318–11327,2021.
[50] YeYuanandKrisKitani. Diversetrajectoryforecastingwithdeterminantalpointprocesses. ICLR,2020.
[51] ManzilZaheer,SatwikKottur,SiamakRavanbakhsh,BarnabasPoczos,RussRSalakhutdinov,andAlexanderJSmola. Deep
sets. InI.Guyon,U.V.Luxburg,S.Bengio,H.Wallach,R.Fergus,S.Vishwanathan,andR.Garnett,editors,Advancesin
NeuralInformationProcessingSystems,volume30.CurranAssociates,Inc.,2017.
[52] WenyuanZeng,WenjieLuo,SimonSuo,AbbasSadat,BinYang,SergioCasas,andRaquelUrtasun. End-to-endinterpretable
neuralmotionplanner. InCVPR,2019.
[53] WeiZhan,LitingSun,DiWang,HaojieShi,AubreyClausse,MaximilianNaumann,JuliusKümmerle,HendrikKönigshof,
ChristophStiller,ArnauddeLaFortelle,andMasayoshiTomizuka. INTERACTIONDataset:AnINTERnational,Adversarial
andCooperativemoTIONDatasetinInteractiveDrivingScenarioswithSemanticMaps. arXiv:1910.03088,2019.
[54] HangZhao,JiyangGao,TianLan,ChenSun,BenjaminSapp,BalakrishnanVaradarajan,YueShen,YiShen,YuningChai,
CordeliaSchmid,etal. Tnt:Target-driventrajectoryprediction. arXivpreprintarXiv:2008.08294,2020.
[55] MatthiasZwicker,HanspeterPfister,JeroenVanBaar,andMarkusGross. Surfacesplatting. InProceedingsofthe28thannual
conferenceonComputergraphicsandinteractivetechniques,pages371–378,2001.
A DetailsandDerivationofAggregationAlgorithm
Byhavinganovercompletetrajectoryrepresentationthatislateraggregatedintoafixedsmallnumberoftrajectories,
weattempttoaddresstwokindsofuncertaintiesinthedata:
• Aleatoricuncertainty: Thisisanaturalvariationinthedata. Forexampleanagentcaneithertakealeftorright
turnorchangelanes,etcgiventhesamecontextinformation. Thislevelofambiguitycannotberesolvedby
increasingthemodelcapacity,butratherthemodelneedstopredictcalibratedprobabilitiesfortheseoutcomes.
Despitethetheoreticalpossibilityofmodelingthesevariationsusingasmallnumberofoutputtrajectories
directly,thereareseveralchallengesinlearning. Someexamplesincludemodecollapseandfailuretomodel
thesevariationsduetolimitedmodelcapacity. Trainingthemodeltoproduceanovercompleterepresentation
forces the model to output a diverse distribution of trajectories and could make it more resistant to mode
collapse. Followingthisupwithgreedyiterativetrajectoryaggregationenhancesdiversityinthefinaloutput.
• Epistemicuncertainty: Thisisthevariationacrossmodeloutputs,whichtypicallyindicatesthemodel’sfailure
tocapturecertainaspectsofthesceneorinputfeatures. Suchvariationscouldoccurifsomemodelsarepoorly
trainedorhaven’tseenaparticularsliceofthedata. Bydoingmodelensembling,weattempttoreducethis
uncertainty.
Foreaseofexposition,weassumeeachtotrajectorytobecomposedofasingletimepoint;thesamecomputationsare
appliedtoeachtimestepinafuturesequence. TheoutputΨisaGaussianmixturemodel(GMM)distributionwithM(cid:48)
modesonthefutureposition:
M(cid:48)
(cid:88)
p(x,y;Ψ)= q N(x,y;µ ,Σ ) (20)
j j j
j=1
WeformulatetheaggregationasobtaininganM-modeGMMΨ¯ whichminimizestheKL-divergenceD (Ψ||Ψ¯).
KL
This is equivalent to maximizing the expected log likelihood p(x,y;Ψ¯) of a sample point (x,y) drawn from the
overcompletedistributionΨ:
(cid:90)
f(Ψ¯)= p(x;Ψ)logp(x;Ψ¯)dx (21)
x
Assumingtheovercompletedistributionapproximatestherealdistribution, thisisroughlyequivalenttofittingthe
compactdistributiontotherealdata,butwiththeaddedbenefitsdescribedabove.Directlymaximizing(21)isintractable.
19HenceweattempttoemployanExpectation-Maximization-likealgorithmtoobtainalocalmaximum. Thedifference
intheobjectivefunctionbetweenanoldandnewvaluemaybewrittenas
(cid:90) (cid:20) p(x;Ψ¯(cid:48))(cid:21)
f(Ψ¯(cid:48))−f(Ψ¯)= p(x;Ψ)log dx (22)
p(x;Ψ¯)
x
Denotingthehiddenvariablehtobeamixtureinthecompactrepresentation,wemaywrite:
log(cid:104) p(x;Ψ¯(cid:48))(cid:105)
(23)
p(x;Ψ¯)
=
(cid:80)
p(h|x;Ψ¯)log(cid:104) p(x;Ψ¯(cid:48))(cid:105)
(24)
h p(x;Ψ¯)
=
(cid:80)
p(h|x;Ψ¯)log(cid:104) p(h,x;Ψ¯(cid:48))p(h|x;Ψ¯)(cid:105)
(25)
h p(h|x;Ψ¯(cid:48))p(h,x;Ψ¯)
=
(cid:80)
p(h|x;Ψ¯)log(cid:104) p(h,x;Ψ¯(cid:48))(cid:105)
+(cid:80)
p(h|x;Ψ¯)log(cid:104) p(h|x;Ψ¯)(cid:105)
(26)
h p(h,x;Ψ¯) h p(h|x;Ψ¯(cid:48))
= (cid:80)
p(h|x;Ψ¯)log(cid:104) p(h,x;Ψ¯(cid:48))(cid:105)
+D (p(h|x;Ψ¯)(cid:107)p(h|x;Ψ¯(cid:48))) (27)
h p(h,x;Ψ¯) KL
≥
(cid:80)
p(h|x;Ψ¯)log(cid:104) p(h,x;Ψ¯(cid:48))(cid:105)
(28)
h p(h,x;Ψ¯)
Thus
f(Ψ¯(cid:48))−f(Ψ¯)≥(cid:90)
p(x;Ψ)(cid:88)
p(h|x;Ψ¯)log(cid:20) p(h,x;Ψ¯(cid:48))(cid:21)
dx (29)
p(h,x;Ψ¯)
x h
The right hand side is called the Q function in the EM algorithm. Maximizing the Q function with respect to Ψ¯(cid:48)
ensuresthatthelikelihoodincreasesatleastasmuchwhenweupdatetheparameterstoΨ¯(cid:48). Notingthatp(h,x;Ψ¯(cid:48))=
p(h;Ψ¯(cid:48))p(x;Ψ¯(cid:48)|h) = q¯(cid:48)N(x−µ¯(cid:48),Σ¯(cid:48)) and factoring out the terms independent of Ψ¯(cid:48), we find the update that
h h h
maximizesthelowerboundtobe
(cid:90)
Ψ¯ ←argmax p(x;Ψ)(cid:88) p(h|x;Ψ¯)log(cid:2) q¯(cid:48)N(x−µ¯(cid:48),Σ¯(cid:48))(cid:3) dx (30)
h h h
Ψ¯(cid:48) x h
M(cid:48) (cid:90)
Ψ¯ ←argmax(cid:88) q p(x;µ ,Σ )(cid:88) p(h|x;Ψ¯)log(cid:2) q¯(cid:48)N(x−µ¯(cid:48),Σ¯(cid:48))(cid:3) dx (31)
i i i h h h
Ψ¯(cid:48) i=1 x h
ThesecondequationfollowsfromthefactthattheovercompletedistributionΨisamixtureofM(cid:48) Gaussians. The
updatescanbesolvedasfollows.
M(cid:48) (cid:90)
q¯(cid:48) ← (cid:88) q p(x;µ ,Σ )p(h|x;Ψ¯)dx
h i i i
i=1 x
µ¯(cid:48) ←
1 (cid:88)M(cid:48)
q
(cid:90)
xp(x;µ ,Σ )p(h|x;Ψ¯)dx
h q¯ i i i
h i=1 x
Σ¯(cid:48) ←
1 (cid:88)M(cid:48)
q
(cid:90)
xp(x;µ ,Σ )(x−µ¯(cid:48))(x−µ¯(cid:48))Tp(h|x;Ψ¯)dx
h q¯ i i i h h
h i=1 x
wherep(h|x;Ψ¯)istheposteriorprobabilitythatagivensamplexissampledfromthehthcomponentofthemixture
modelspecifiedbyΨ¯ (hereweusethepreviousestimateforΨ¯). Thiscanbecomputedas:
q¯ N(x;µ¯ ,Σ¯ )
p(h|x;Ψ¯)= h h h (32)
(cid:80)M q¯ N(x;µ¯ ,Σ¯ )
k=1 k k k
NoticetheresemblencewithstandardGMM,exceptwherep(x;µ ,Σ )isadiracdeltafunctioninthestandardsetting
i i
(since the input data in standard GMM is a set of points instead of a distribution). Unlike standard GMM, these
20expectations (integrations) in the above EM updates are hard to compute in closed form. Instead we employ the
approximationforanyfunctiong(x)
E (cid:2) g(x)p(h|x;Ψ¯)(cid:3)
x∼N(x;µ,Σ)
≈E (cid:2) g(x)p(h|µ;Ψ¯)(cid:3)
x∼N(x;µ,Σ)
=p(h|µ;Ψ¯)E [g(x)].
x∼N(x;µ,Σ)
In other words, we assume that the posterior probability of any output cluster only depends on the mean of the
overcompleteclustercentroidinsidetheexpectation. Thisapproximationisreasonablesincemostsamplesdrawnfrom
thedistributionwouldbeconcentratedaroundthemean. Furthermoreasweincreasethenumberofclustercentroidsin
theovercompleterepresentation,thevariancewithineachovercompleteclustercentroidbecomessmalleryieldingmore
focusaroundthemean. Thesetofupdatescannowbesolvedinclosedformasfollows:
M(cid:48)
q¯ ← (cid:88) q p(h|µ ;Ψ¯)
h i i
i=1
M(cid:48)
µ¯ ← 1 (cid:88) q p(h|µ ;Ψ¯)x
h q¯ i i
h
i=1
M(cid:48)
Σ¯ ← 1 (cid:88) q p(h|µ ;Ψ¯)(cid:2) Σ +(µ −µ¯ )(µ −µ¯ )T(cid:3)
h q¯ i i i i h i h
h
i=1
SinceEMisalocaloptimizationmethod,carefulinitializationofGMMparametersisimportant. Ourinitialization
criterionofGMMcentroidsistomaximizetheprobabilitythatfuturepointlieswithinτ distancefromatleastone
centroid:
M(cid:48)
(cid:88)
µ¯ =argmax q max I((cid:107)µ −µ¯(cid:107) ≤τ) (33)
1:M i i 2
µ¯1:M
i=1
µ¯∈µ¯1:M
Unfortunately, directly optimizing (33) is NP-hard. So instead, we select an M-sized subset of µ in a greedy
1:M(cid:48)
fashiontomaximize(33)6.
6Notethatthissubsetselectionproblemissubmodular,whichmeansthatagreedymethodisguranteedtoachieveatleast1−1/e
oftheoptimalsubsetvalue.
21B Multipathbaselinesystemdiagram
Figure10visualizesthesplatrenderingrasterizationofinputpoints,andthebackboneandoutputarchitecture. See
Section5.3fordetails.
GMM output
Road Network Spatial feature map N agents
K trajectories
CNN backbone Agent-centric
(ResNet18) patch
T time steps
network
Agents
Differentiable
cropping and
rotation
Traffic Lights
Figure10: OriginalMultiPatharchitecture,withdetailsdescribedinSection5.3.
22