TNT: Target-driveN Trajectory Prediction HangZhao1∗ JiyangGao1∗ TianLan1 ChenSun2 BenjaminSapp1 BalakrishnanVaradarajan1 YueShen1 YiShen1 YuningChai1 CordeliaSchmid2 CongcongLi1 DragomirAnguelov1 1WaymoLLC 2GoogleResearch Abstract: Predicting the future behavior of moving agents is essential for real worldapplications.Itischallengingastheintentoftheagentandthecorresponding behavior is unknown and intrinsically multimodal. Our key insight is that for prediction within a moderate time horizon, the future modes can be effectively capturedbyasetoftargetstates.Thisleadstoourtarget-driventrajectoryprediction (TNT) framework. TNT has three stages which are trained end-to-end. It first predicts an agent’s potential target states T steps into the future, by encoding itsinteractionswiththeenvironmentandtheotheragents. TNTthengenerates trajectorystatesequencesconditionedontargets. Afinalstageestimatestrajectory likelihoodsandafinalcompactsetoftrajectorypredictionsisselected. Thisisin contrasttopreviousworkwhichmodelsagentintentsaslatentvariables,andrelies on test-time sampling to generate diverse trajectories. We benchmark TNT on trajectorypredictionofvehiclesandpedestrians,whereweoutperformstate-of-the- artonArgoverseForecasting,INTERACTION,StanfordDroneandanin-house Pedestrian-at-Intersectiondataset. Keywords: Trajectoryprediction,multimodalprediction. 1 Introduction Predicting the future states of moving agents in a real-world environment is an important and fundamentalprobleminrobotics. Forexample,inthesettingofautonomousdrivingonpublicroads, itisessentialtohaveanaccurateunderstandingofwheretheothervehiclesandpedestrianswilllikely beinthefuture,inorderforanautonomousvehicletotakesafeandeffectiveactions. Akeychallengetofuturepredictionisthehighdegreeofuncertainty,inlargepartduetonotknowing theintentsandlatentcharacteristicsoftheotheragents. Forexample,avehiclecommonlyhasa multimodaldistributionoffutures: itcouldturn,gostraight,slowdown,speedup,etc. Dependingon othersceneelements,itcouldpass,yield,changelanes,orpullintoadriveway. Thischallengehas garneredalotofinterestinthepastfewyears.Oneapproachtomodelthehighdegreeofmultimodality istoemployflexibleimplicitdistributionsfromwhichsamplescanbedrawn—conditionalvariational autoencoders (CVAEs) [1], generative adversarial networks (GANs) [2], and single-step policy roll-outmethods[3]. Despitetheircompetitiveperformance,theuseoflatentvariablestomodel intentsprohibitsthemtobeinterpreted,andoftenrequirestest-timesamplingtoevaluateprobabilistic queries (e.g., “how likely is the agent to turn left?”). Furthermore, considerable effort has gone intoaddressingmodecollapseinsuchmodels,inthemachinelearningcommunityatlarge[4]and specificallyforself-drivingcars[5,6]. Toaddresstheselimitations,wemaketheobservationthatforourtask(e.g.vehicleandpedestrian trajectoryprediction),theuncertaintiesoveramoderatelylongfuturecanbemostlycapturedbythe predictionofpossibletargetsoftheagents. Thesetargetsarenotonlygroundedinphysicalentities thatareinterpretable(e.g.location),butalsocorrelatewellwithintent(e.g.alanechangeoraright turn). Weconjecturethatthespaceoftargetscanbediscretizedinascene—allowingadeterministic modeltogeneratediversetargetsinparallel—andlaterrefinedtobemoreaccurate. Theseobservationsleadtoourproposedtarget-driventrajectorypredictionframework,namedTNT. We first cast the future prediction problem into predicting a distribution over discretized target ∗Equalcontribution.Correspondto{hangz, jiyanggao}@waymo.com.Target Prediction Motion Estimation Trajectory Scoring & Selection p=0.4 p=0.3 p=0.2 p=0.1 Lane Centerline Target Candidate Predicted Target Predicted Trajectory Unselected Trajectory Figure1: IllustrationoftheTNTframeworkwhenappliedtothevehiclefuturetrajectoryprediction task. TNTconsistsofthreestages: (a)targetpredictionwhichproposesasetofplausibletargets (stars)amongallcandidates(diamonds). (b)target-conditionedmotionestimationwhichestimatesa trajectory(distribution)towardseachselectedtarget,(c)scoringandselectionwhichrankstrajectory hypothesesandselectsafinalsetoftrajectorypredictionswithlikelihoodscores. states,andthenformulateaprobabilisticmodelinwhichtrajectoryestimationandlikelihoodare conditionedonsuchtargets. Theresultingframeworkhasthreestagesthataretrainedend-to-end: (1)targetpredictionestimatesadistributionovercandidatetargetsgivenscenecontext;(2)target- conditionedmotionestimationpredictstrajectorystatesequencespertarget; and(3)scoringand selection estimates the likelihood of each predicted trajectory, taking into account the context of allotherpredictedtrajectories. Weobtainafinalsetofcompactdiversepredictionsbyrankingthe likelihoodsandsuppressingredundanttrajectories. Anillustrationofourthree-stagemodelwhen applied to vehicle trajectory prediction is shown in Figure 1. Although our model is end-to-end trained,itsthreestagelayout,withinterpretableoutputsateachstage,closelyfollowsthetypical processingstepsintraditionalroboticsmotionforecastingandplanningsystems[7,8],thusmaking iteasytoincorporatedomainknowledgeduringdeployment. We demonstrate the effectiveness of TNT on multiple challenging trajectory prediction bench- marks. InthedrivingdomainweevaluateontheArgoverseForecastingdataset[9]andINTERAC- TIONdataset[10];forpedestrians,theStanfordDronedataset[11]andanin-housePedestrian-at- Intersectiondataset. Weachievedthestate-of-the-artperformanceonallbenchmarks. 2 RelatedWork Trajectorypredictionhasreceivedmuchattentionrecently,especiallyinautonomousdriving[9,10, 12–14],insocialinteractionprediction[15–18]andsports[19,20]. Oneofthekeychallengesisto modelmultimodalfuturesdistributions. Apopularapproachistomodelthefuturemodesimplicitly aslatentvariables[1–3,21–25],whichaimsatcapturingtheunderlyingintentsoftheagents. For example,DESIRE[1]usedaconditionalVAE[26]whilePRECOG[3]usedflow-basedgenerative models[27];SocialGAN[2]proposedanadversarialdiscriminatortopredictrealisticfutures;Hong etal.[22]modeledthemotionpatternswithalatentGaussianmixturemodel. However, theuse of non-interpretable, latent variables makes it challenging to incorporate expert knowledge into thesystem. Furthermore,thesemodelsrequirestochasticsamplingfromthelatentspacetoobtain implicitdistributionsatruntime. Thesepropertiesmakethemlesssuitableforpracticaldeployment. Alternatively,someapproachesattemptedtodecomposethetrajectorypredictiontaskintosubtasks, withthehopethateachsubtaskismoremanageabletosolveandprovidesinterpretableintermediate results. Forexample,Ziebartetal.[28]proposedplanning-basedpredictionforpedestrians,they firstestimatedaBayesianposteriordistributionofdestinations,andthenusedinversereinforcement learning(IRL)toplanthetrajectories. Rehderetal.[29]introducedthenotionofgoalswhichare definedasshort-termdestinations,anddecomposedtheproblemintogoaldistributionestimation andgoal-directedplanning. ThegoalsweredefinedasmixtureofGaussianlatentvariables. Their followup work [30] then demonstrated that the whole framework can be jointly trained via IRL. 2Concurrent to our work, Mangalam et al. [31] proposed to generate endpoints to guide the full trajectorygeneration. UnlikeTNT,theirmethodstillreliesonlatentvariablesinCVAEtomodelthe underlyingmodesoftheendpoints. MostrelatedtoTNTareworksthatdiscretizetheoutputspaceasintents[32]orwithanchors[33, 34]. IntentNet[32]manuallydefinedseveralcommonmotioncategoriesforself-drivingvehicles, such as left turn and lane changes, and learned a separate motion predictor for each intent. This manualcategorizationistaskanddatasetdependent,andmaybetoocoarsetocaptureintra-category multimodality. Morerecently,MultiPath[33]andCoverNet[34]chosetoquantizethetrajectories intoanchors,wherethetrajectorypredictiontaskisreformulatedintoanchorselectionandoffset regression. Theanchorsareeitherpre-clusteredintoafixedsetapriori[33]orobtaineddynamically basedonkinematicheuristics[34]. Unlikeanchortrajectories,thetargetsinTNTaremuchlower dimensionalandcanbeeasilydiscretizedviauniformsamplingorbasedonexpertknowledge(e.g. HDmaps). Hence,theycanbeestimatedmorereliably. Despitetheirsimplicity,wedemonstratethat thetargetsareinformativeenoughtocapturemostoftheuncertaintyinpredictingfuturestate,and ourtarget-drivenframeworkoutperformstheanchor-basedmethods. 3 Formulation Given a sequence of observed states for a single agent s = [s ,s ,...,s ], our goal P −T(cid:48)+1 −T(cid:48)+2 0 is to predict its future states s = [s ,s ,...,s ] up to some fixed time step T. Naturally, the F 1 2 T agent interacts with an environment consisting of other agents and scene elements for context: c = [c ,c ,...,c ]. Wedenotex = (s ,c )forbrevity, thustheoverallprobabilistic P −T(cid:48)+1 −T(cid:48)+2 0 P P distributionwewanttocaptureisp(s |x). F Inpractice,p(s |x)canbehighlymultimodal. Forexample,avehicleapproachinganintersection F could turn left, go straight or change lanes. Intuitively, the uncertainty of future states can be decomposedintotwoparts: thetargetorintentuncertainty,suchasthedecisionbetweenturningleft andright;andthecontroluncertainty,suchasthefine-grainedmotionrequiredtoperformaturn. We canthereforedecomposetheprobabilisticdistributionaccordinglybyconditioningontargetsand thenmarginalizingoverthem: (cid:90) p(s |x)= p(τ|x)p(s |τ,x)dτ, (1) F F τ∈T(cP) whereT(c )representsthespaceofplausibletargetsdependingontheobservedcontextc . P P Underthisformulation,ourmaininsightisthat,forapplicationssuchastrajectoryprediction,by properlydesigningthetargetspaceT(c )(e.g.targetlocations),thetargetdistributionp(τ|x)can P well capture the intent uncertainty. Once the target is decided, we further demonstrate that the control uncertainty (e.g. trajectories) can be reliably modeled by simple, unimodal distributions. We approximate the target space T(c ) by a set of discrete locations, turning the estimation of P p(τ|x)primarilyintoaclassificationtask. Comparedwithlatentvariationalmodels,ourmodeloffers betterinterpretabilityintheformofexplicittargetdistributions,andcannaturallyincorporateexpert knowledge(suchasroadtopology),whendesigningthetargetspaceT(c ). P Our overall framework has three conceptual stages. The first stage is target prediction, whose goalistomodeltheintentuncertaintywithadiscretesetoftargetstatesT basedontheobserved contextx,andoutputsthetargetdistributionp(τ|x). Thesecondstageistarget-conditionedmotion estimation, which models the possible future motions from the initial state to the target with a unimodal distribution. The first two stages give rise to the following probabilistic predictions (cid:80) p(s |x)= p(τ|x)p(s |τ,x). F τ∈T(cP) F Manydownstreamapplications,suchasreal-timebehaviorprediction,requireasmallsetofrepresen- tativefuturepredictionsratherthanthefulldistributionofallpossiblefutures.Ourfinalstage,scoring andselection,istailoredforthispurpose. Welearnascoringfunctionφ(s )overallrepresentative F predictions,andselectafinaldiversifiedsetofpredictions. 4 Target-driveNTrajectoryPrediction This section describes our proposed TNT framework in detail. We focus on the task of future trajectorypredictionformovingroadagents,whereboththestatesandthetargetsarerepresentedby 3HD Map & Agents Target Prediction Motion Estimation Trajectory Scoring Target Trajectories Scores Offsets Target Scores Motion Score Estimator Predictor N Target M K Target Candidates Targets Trajectories Candidate ~ ~ Sampling Sort Select Modeled Agent Context Agents noitazirotceV Context Encoding Target VectorNet Predictor M Trajectories Lane Centerline Target Candidate Predicted Target Predicted Trajectory Figure2: TNTmodeloverview. Scenecontextisfirstencodedasthemodel’sinputs. Thenfollows thecorethreestagesofTNT:(a)targetpredictionwhichproposesaninitialsetofM targets;(b) target-conditionedmotionestimationwhichestimatesatrajectoryforeachtarget;(c)scoringand selectionwhichrankstrajectoryhypothesesandoutputsafinalsetofK predictedtrajectories. theirphysicallocations(x ,y ). Webeginthesectionbydescribinghowthecontextinformationis t t encodedefficiently. Wethenpresentdetailsonhowtheproposedthreestagesareadaptedtothetask. AnoverviewoftheTNTmodelarchitectureisshowninFigure2. 4.1 Scenecontextencoding Modelingscenecontextisafirststepintrajectorypredictionsoastocaptureagent-roadandagent- agentinteractions. TNTcanuseanysuitablecontextencoder: whentheHDmapisavailable,weuse astate-of-the-arthierarchicalgraphneuralnetworkVectorNet[35]toencodethecontext. Specifically, polylinesareusedtoabstracttheHDmapelementsc (lanes,trafficsigns)andagenttrajectoriess ; P P asubgraphnetworkisappliedtoencodeeachpolyline,whichcontainsavariablenumberofvectors; thenaglobalgraphisusedtomodeltheinteractionsbetweenpolylines. Theoutputisaglobalcontext featurexforeachmodeledagent. Ifscenecontextisonlyavailableintheformoftop-downimagery, aConvNetisusedasthecontextencoder. 4.2 Targetprediction Inourformulation,targetsτ aredefinedasthelocations(x,y)anagentislikelytobeatafixedtime horizonT. Inthefirsttargetpredictionstage,weaimtoprovideadistributionoffuturetargetsofan agentp(T|x). WemodelthepotentialfuturetargetsviaasetofN discrete,quantizedlocationswith continuousoffsets: T ={τn}={(xn,yn)+(∆xn,∆yn)}N . Thedistributionovertargetscan n=1 thenbemodeledviaadiscrete-continuousfactorization: p(τn|x)=π(τn|x)·N(∆xn |νn(x))·N(∆yn |νn(x)), (2) x y where π(τn|x) = expf(τn,x)/(cid:80) expf(τ(cid:48),x) is a discrete distribution over location choices τ(cid:48) (xn,yn). ThetermsN(·|ν(·))denoteageneralizednormaldistribution,wherewechooseHuberas thedistancefunction. Wedenotethemeanasν(·)andassumeunitvariance. Thetrainablefunctionsf(·)andν(·)areimplementedwitha2-layermultilayerperceptron(MLP), withtargetcoordinates(xk,yk)andthescenecontextfeaturexasinputs. Theypredictadiscrete distributionovertargetlocationsandtheirmostlikelyoffsets. Thelossfunctionfortrainingthisstage isgivenby L =L (π,u)+L (ν ,ν ,∆xu,∆yu), (3) S1 cls offset x y whereL iscrossentropy,L istheHuberloss;uisthetargetclosesttothegroundtruthlocation, cls offset and∆xu,∆yuarethespatialoffsetsofufromthegroundtruth. The choice of the discrete target space is flexible across different applications, as illustrated in Figure3. Inthevehicletrajectorypredictionproblem,weuniformlysamplepointsonlanecenterlines fromtheHDmapandusethemastargetcandidates(markedasyellowspades),withtheassumption thatvehiclesneverdepartfarawayfromlanes;forpedestrians,wegenerateavirtualgridaroundthe agentandusethegridpointsastargetcandidates. Foreachtargetcandidate,theTNTtargetpredictor 4(a) Map-based targets (b) Grid targets Lane Centerline Target Candidate Predicted Target Figure3: TNTsupportsflexiblechoicesoftargets. Vehicletargetcandidatepointsaresampledfrom thelanecenterlines. Pedestriantargetcandidatepointsaresampledfromavirtualgridcenteredon thepedestrian. producesatupleof(π,∆x,∆y); theregressedtargetsaremarkedasorangestars. Comparingto directregression,themostprominentadvantageofmodelingthefutureasadiscretesetoftargets isthatitdoesnotsufferfrommodeaveraging,whichisthemajorfactorthathampersmultimodal predictions. Inpractice,weover-samplealargenumberoftargetcandidatesasinputtothisstage,e.g.N =1000, toincreasethecoverageofthepotentialfuturelocations;andthenkeepasmallernumberofthemas output,e.g.topM = 50,forfurtherprocessing,asagoodchoiceofM helpstobalancebetween targetrecallandmodelefficiency. 4.3 Target-conditionedmotionestimation In the second stage, we model the likelihood of a trajectory given a target as p(s |τ,x) = F (cid:81)T p(s |τ,x),againwithageneralizednormaldistribution. Thismakestwoassumptions. First, t=1 t futuretimestepsareconditionallyindependent,whichmakesourmodelcomputationallyefficient byavoidingsequentialpredictions,asisdonein[21,31,33,34]. Second,wearemakingstrongbut reasonableassumptionthatthedistributionofthetrajectoriesisunimodal(normal)giventhetarget. Thisiscertainlytrueforshorttimehorizons;forlongertimehorizons,onecoulditeratebetween (intermediate)targetpredictionandmotionestimationsothattheassumptionstillholds. Thisstageisimplementedwitha2-layerMLP.Ittakescontextfeaturexandatargetlocationτ as input,andoutputsonemostlikelyfuturetrajectory[sˆ ,...,sˆ ]pertarget. Sinceitisconditionedon 1 T thepredictedtargetsfromthefirststage,toenableasmoothlearningprocess,weapplyateacher forcingtechnique[36]attrainingtimebyfeedingthegroundtruthlocation(xu,yu)astarget. The losstermforthisstageisthedistancebetweenbetweenpredictedstatesˆs andgroundtruths : t t T (cid:88) L = L (ˆs ,s ), (4) S2 reg t t t=1 whereL isimplementedasHuberlossoverper-stepcoordinateoffsets. reg 4.4 Trajectoryscoringandselection Ourfinalstageestimatesthelikelihoodoffullfuturetrajectoriess .Thisdiffersfromthesecondstage, F whichdecomposesovertimestepsandtargets,andfromthefirststagewhichonlyhasknowledgeof targets,butnotfulltrajectories—e.g.,atargetmightbeestimatedtohavehighlikelihood,butafull trajectorytoreachthattargetmightnot. WeuseamaximumentropymodeltoscorealltheM trajectoriesfromthesecondstage: exp(g(s ,x)) φ(s |x)= F , (5) F (cid:80)M exp(g(sm,x)) m=1 F 5whereg(·)ismodeledasa2-layerMLP.Thelosstermfortrainingthisstageisthecrossentropy betweenthepredictedscoresandgroundtruthscores, L =L (φ(s |x),ψ(s )), (6) S3 CE F F wherethegroundtruthscoreofeachpredictedtrajectoryisdefinedbyitsdistancetogroundtruth trajectoryψ(s )=exp(−D(s,s )/α)/(cid:80) exp(−D(s(cid:48),s )/α),whereD(·)isinmetersandα F GT s(cid:48) GT isthetemperature. ThedistancemetricisdefinedasD(si,sj)=max(||si −sj||2,...,||si−sj||2). 1 1 2 t t 2 ToobtainthefinalsmallsetofK predictedtrajectoriesfromthescoredM trajectories,weimplement a trajectory selection algorithm to reject near-duplicate trajectories. We first sort the trajectories accordingtotheirscoreindescendingorder,andthenpickthemgreedily;ifonetrajectoryisdistant enoughfromalltheselectedtrajectories, weselectitaswell, otherwiseexcludeit. Thedistance metricusedhereisthesameasforthescoringprocess. Thisprocessisinspiredbythenon-maximum suppressionalgorithmcommonlyusedforcomputervisionproblems,suchasobjectdetection. 4.5 Trainingandinferencedetails TheaboveTNTformulationyieldsfullysupervisedend-to-endtraining,withatotallossfunction L=λ L +λ L +λ L , (7) 1 S1 2 S2 3 S3 whereλ ,λ ,λ arechosentobalancethetrainingprocess. 1 2 3 Atinferencetime,TNTworksasfollows:(1)encodecontext;(2)sampleN targetcandidatesasinput tothetargetpredictor,takethetopM targetsasestimatedbyπ(τ|x);(3)taketheMAPtrajectory foreachoftheM targetsfrommotionestimationmodelp(s |τ,x);(4)scoretheM trajectoriesby F φ(s |τ,x),andselectafinalsetofK trajectories. F 5 Experiments 5.1 Datasets Argoverseforecastingdataset[9]providestrajectoryhistories,contextagentsandlanecenterlinefor futuretrajectoryprediction. Thereare333K5-secondlongsequencesinthedataset. Thetrajectories aresampledat10Hz,with(0,2]secondsforobservationand(2,5]secondsforfutureprediction. INTERACTIONdataset[10]focusesonvehiclebehaviorpredictioninhighlyinteractivedriving scenarios. It provides 4 different categories of interactive driving scenarios: roundabout (10479 vehicles), un-signalized intersection (14867 vehicles), signalized intersection (10933 vehicles), mergingandlanechanging(3775vehicles). In-house Pedestrian-at-Intersection dataset (PAID) is an in-house pedestrian dataset collected aroundcrosswalksandintersections. Therearearound77Kuniquepedestriansfortrainingand12k uniquepedestriansfortest. Thetrajectoriesaresampledat10Hz,1-sechistorytrajectoryisusedto predict3-secfuture. Mapfeaturesincludecrosswalks,laneboundariesandstop/yieldsigns. StanfordDronedataset(SDD)[11]isavideodatasetwithtop-downrecordingsofcollegecampus scenes,collectedbydrones. TheRGBvideoframesprovidecontextsimilartoroadmapsinother datasets. Wefollowpracticeofotherliterature[2,16,37],focusingonpedestriantrajectoriesonly: frames are sampled at 2.5 Hz, 2 seconds of history (5 frames) are used as model input, and 4.8 seconds(12frames)arethefuturetobepredicted. 5.2 ImplementationDetails Contextencoding.FollowingVectorNet[35],weconvertthemapelementsandtrajectoriesintoaset ofpolylinesandvectors. Eachvectorisrepresentedas[p ,p ,f,id ],wherep andp arestartand s e p s e endpointofthevector,f isafeaturevector,whichcancontainfeaturetypelikelanestate,andid p isthepolylineindexthatthevectorbelongsto. Wenormalizethevectorcoordinatestobecentered aroundthelocationoftargetagentatthelastobservedtimestep. Aftervectorization,VectorNetis usedtoencodecontextofthemodeledagent,anditsoutputfeaturewillbeconsumedbyTNT.One exceptionistheStanfordDroneDataset,whichdoesnotoffermapdata,wethereforeuseastandard ResNet-50[38]ConvNettoencodethebirds-eye-viewimageryforcontextencoding. 6Target candidate sampling. For vehicle trajectory prediction, we sample points as the target candidatesfromlanecenterlines(Argoversedataset)orlaneboundaries(INTERACTIONdataset). Atleastonepointissampledeverymeter. Forpedestriantrajectoryprediction,aspedestrianshave muchlargermovingflexibility,webuildarectangular2Dgrid(e.g.10m×10m)aroundtheagent, andthecenterofeachcell(e.g.1m×1m)isatargetcandidate. Modeldetails. ThemodelarchitecturesofallthethreestagesofTNTare2-layerMLPs,withthe numberofhiddenunitssetto64. Wesetthetemperatureαinψ(s )tobe0.01. Thelossweights F areλ =0.1,λ =1.0,λ =0.1. TNTistrainedend-to-end,forapproximately50epochswithan 1 2 3 Adamoptimizer[39]. Thelearningrateissettobe0.001,andbatchsizeis128. Table1: PerformancebreakdownaftereachstageontheArgoversevalidationset. minFDE minADE MissRate@2m M=50 K=6 M=50 K=6 M=50 K=6 S1: TargetPrediction 0.533 1.629 - - 0.027 0.216 S1+S2: MotionEstimation 0.534 1.632 0.488 0.877 0.027 0.216 S1+S2+S3: TrajScoring&Selection - 1.292 - 0.728 - 0.093 Metrics. WeadoptthewidelyusedAverageDisplacementError(ADE)andFinalDisplacement Error(FDE).ToevaluatetheADEandFDEforasetofK predictedtrajectories,weuseminADE K andminFDE . Thedisplacementsareallmeasuredinmeters,exceptfortheStanfordDronedataset K where it is in pixels. On Argoverse, we also report miss rate (MR) which measures the ratio of scenarioswherenoneofthepredictionsarewithin2metersofthegroundtruthaccordingtoFDE. 5.3 Ablationstudy Performancebreakdownbystage. WediscusstheefficacyofeachstageofTNTbytracingthe performanceontheArgoversedataset,showninTable1. WecanseethatS1achievesgoodtarget recall as indicated by minFDE and Miss Rate at M = 50; S2 further generates trajectories as evaluated by the minADE metric. The minFDE between S1 and S2 are almost the same, which confirmsthefactthattheconditionalmotionestimationisabletogeneratetrajectoriesendingatthe conditionedtargets. FinallyS3narrowsdownthenumberofpredictionstoK =6withoutmuchloss comparedtoM =50. Targetcandidatesampling. ThetargetcandidatesamplingdensityhasanimpactonTNT’sperfor- mance,asshowninTable2onArgoverseandTable3onPAIDrespectively.ForvehiclesinArgoverse, wesampletargetsfromlanes,measuredastargetspacingalongthepolyline. ForpedestriansinPAID, astheyhavemorefreedomofmovement,weempiricallyfindthatgridtargetsperformmuchbetter thanmap-basedtargets,andreportonlygridtargetresults. Weobservethatdensertargetsleadto betterperformancebeforethesaturatingpoint. Table2: Comparisonofmaptargetsam- Table3: Comparisonofgridtargetsam- plingdensityonArgoversedataset. plingdensityonPAID. targetspacing minFDE minADE gridsize minFDE minADE 6 6 6 6 target/5.0m 1.55 0.79 2.0m 0.41 0.22 target/2.0m 1.31 0.73 1.0m 0.33 0.19 target/1.0m 1.29 0.72 0.5m 0.32 0.18 target/0.5m 1.29 0.72 0.2m 0.32 0.18 Targetregression. ThecomparisonbetweenwithandwithouttargetoffsetregressioninS1isshown inTable4. Wecanseethatwithregressiontheperformanceimprovedby0.16m,whichshowsthe necessityofpositionrefinementfromtheoriginaltargetcoordinates. Motionestimationmethods. ForS2motionestimation,wecomparebetweenourunimodalHuber regressorwithaCVAEregressorwhichgeneratesmultimodalpredictions. ForCVAE,wevarythe numberofsampledtrajectoriesbetween1and10. TheresultsareshowninTable5. Asexpected,the twoperformsimilarwithonly1trajectory. However,evenwhenweincreasethenumberofCVAE 7Table5:Ablationonmotionestimationmethods. Table4: Ablationontargetoffsetregression. #traj/target minADE @S3 minFDE @S1 6 50 Huber 1 0.73 w/targetreg 0.53 CVAE 1 0.73 w/otargetreg 0.69 CVAE 10 0.71 Table6: Comparisonwithstate-of-the-artmethodsonArgoversevalidationandtestset. DESIREand MultiPatharereimplementedwithVectorNetcontextencoder. Oursinglemodelresultperformson parorbetterthantheArgoverseChallengewinneronthetestset. subset minFDE minADE MissRate@2m 6 6 DESIRE[1] 1.77 0.92 0.18 MultiPath[33] validation 1.68 0.80 0.14 TNT(Ours) 1.29 0.73 0.09 Jean(Challengewinner) 1.42 0.97 0.13 test TNT(Ours) 1.54 0.94 0.13 sampledtrajectoriesby10×,itonlymarginallyimprovesontheminADEmetric. Thissupportsour assumptioninS2thattheagentmotionisunimodalgivenatarget. 5.4 Comparisonwithstate-of-the-art Vehiclepredictionbenchmarks. Fortrajectorypredictiononvehicles,wecompareTNTwiththe state-of-the-artmethodsonArgoverseandINTERACTIONbenchmarks. Tomakefaircomparisons, were-implementMultiPath[33]andDESIRE[1]byreplacingtheirConvNetcontextencoderswith VectorNet [35]. As shown in Table 6 and Table 7, TNT outperforms all other methods by large marginsandachievesthestate-of-the-artperformance. VisualizationsoftheTNTpredictionson ArgoversecanbefoundinFigure4. WefurthersubmitoursinglemodelTNTresultstotheArgoverse leaderboard. As shown in the bottom rows in Table 6, TNT performs on par or better than the ArgoverseChallenge2020winner(itsdetailswereundisclosed). Table7: ModelperformanceonINTERAC- Table 8: Comparison with state-of-the-art TIONvalidationset. methodsonPAID. minFDE minADE minFDE minADE 6 6 3 3 DESIRE[1] 0.88 0.32 DESIRE[1] 0.59 0.29 MultiPath[33] 0.99 0.30 MultiPath[33] 0.43 0.23 TNT(Ours) 0.67 0.21 TNT(Ours) 0.32 0.18 Pedestrian prediction benchmarks. For trajectory prediction on pedestrians, we compare TNT withstate-of-the-artmethodsonthein-housePedestrian-At-IntersectionDataset(PAID)andStanford DroneDataset(SDD).OnthePAID,wesampletargetsfromagridofrange20m×20mwithagrid sizeof0.5m. WeenhanceDESIREandMultipathwithVectorNetforcontextencoding. OntheSDD,sincenomapdataisprovided,wecropanimagepatchwithresolutionof800×800 aroundtheagentofinterest,anduseaResNet-50toextractcontextfeatures. Weuseagridofrange 300×300withagridsizeof6astargets. AsshowninTable8andTable9,TNToutperformsall previousmethodsandachievesthestate-of-the-artperformanceonbothdatasets. 6 Conclusion WehavepresentedanovelframeworkTNTformultimodaltrajectoryprediction. Itconsistsofthree interpretablestages: targetprediction,target-conditionedmotionestimation,andtrajectoryscoring. TNTachievesthestate-of-the-artperformanceonfourchallengingreal-worldpredictiondatasets. As 8Table9: Comparisonwithstate-of-the-artmethodsonSDD.Unitsarepixels. minFDE minADE 5 5 SocialLSTM[16] 56.97 31.19 SocialGAN[2] 41.44 27.25 DESIRE[1] 34.05 19.25 SoPhie[37] 29.38 16.27 PECNet[31] 25.98 12.79 TNT(Ours) 21.16 12.23 futureworkweplantoextendourframeworktolongtermfuturepredictionbyiterativelypredicting intermediatetargetsandtrajectories. Acknowledgments WewouldliketothankAncaDraganforhelpfulcomments. References [1] N.Lee,W.Choi,P.Vernaza,C.B.Choy,P.H.S.Torr,andM.Chandraker. DESIRE:Distant futurepredictionindynamicsceneswithinteractingagents. InCVPR,2017. [2] A.Gupta,J.Johnson,L.Fei-Fei,S.Savarese,andA.Alahi. SocialGAN:Sociallyacceptable trajectorieswithgenerativeadversarialnetworks. InCVPR,2018. [3] N.Rhinehart,R.McAllister,K.Kitani,andS.Levine. PRECOG:Predictionconditionedon goalsinvisualmulti-agentsettings. InICCV,2019. [4] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniquesfortraininggans. InNeurIPS,2016. [5] N. Rhinehart, K. Kitani, and P. Vernaza. R2P2: A reparameterized pushforward policy for diverse,precisegenerativepathforecasting. InECCV,2018. [6] Y.YuanandK.Kitani. Diversetrajectoryforecastingwithdeterminantalpointprocesses. ICLR, 2020. [7] T.TsubouchiandS.Arimoto. Behaviorofamobilerobotnavigatedbyan"iteratedforecastand planning"schemeinthepresenceofmultiplemovingobstacles. InICRA,1994. [8] A.Broadhurst,S.Baker,andT.Kanade. Apredictionandplanningframeworkforroadsafety analysis,obstacleavoidanceanddriverinformation. CMU-RI-TR-04-11,2004. [9] M.-F.Chang,J.Lambert,P.Sangkloy,J.Singh,S.Bak,A.Hartnett,D.Wang,P.Carr,S.Lucey, D.Ramanan,etal. Argoverse: 3dtrackingandforecastingwithrichmaps. InCVPR,2019. [10] W. Zhan, L. Sun, D. Wang, H. Shi, A. Clausse, M. Naumann, J. Kümmerle, H. Königshof, C.Stiller,A.deLaFortelle,andM.Tomizuka. INTERACTIONDataset: AnINTERnational, AdversarialandCooperativemoTIONDatasetinInteractiveDrivingScenarioswithSemantic Maps. arXiv:1910.03088,2019. [11] A. Robicquet, A. Alahi, A. Sadeghian, B. Anenberg, J. Doherty, E. Wu, and S. Savarese. Forecastingsocialnavigationincrowdedcomplexscenes. arXiv:1601.00998,2016. [12] R. Krajewski, J. Bock, L. Kloeker, and L. Eckstein. The highd dataset: A drone dataset of naturalisticvehicletrajectoriesongermanhighwaysforvalidationofhighlyautomateddriving systems. InITSC,2018. [13] J.ColyarandH.John. Ushighway101dataset. FHWA-HRT-07-030,2007. [14] L. Fang, Q. Jiang, J. Shi, and B. Zhou. TPNet: Trajectory proposal network for motion prediction. InCVPR,2020. 9[15] K.M.Kitani,D.-A.Huang,andW.-C.Ma. Activityforecasting. InGroupandCrowdBehavior forComputerVision.2017. [16] A.Alahi,K.Goel,V.Ramanathan,A.Robicquet,L.Fei-Fei,andS.Savarese. SocialLSTM: HumanTrajectoryPredictioninCrowdedSpaces. InCVPR,2016. [17] D.HelbingandP.Molnar. Socialforcemodelforpedestriandynamics. PhysicalreviewE,51 (5):4282,1995. [18] W.-C. Ma, D.-A. Huang, N. Lee, and K. M. Kitani. Forecasting interactive dynamics of pedestrianswithfictitiousplay. InCVPR,2017. [19] S. Zheng, Y. Yue, and J. Hobbs. Generating long-term trajectories using deep hierarchical networks. InNeurIPS,2016. [20] E.Zhan,S.Zheng,Y.Yue,L.Sha,andP.Lucey. Generativemulti-agentbehavioralcloning. arXiv:1803.07612,2018. [21] H.Cui,V.Radosavljevic,F.-C.Chou,T.-H.Lin,T.Nguyen,T.-K.Huang,J.Schneider,and N.Djuric. Multimodaltrajectorypredictionsforautonomousdrivingusingdeepconvolutional networks. InICRA,2019. [22] J. Hong, B. Sapp, and J. Philbin. Rules of the road: Predicting driving behavior with a convolutionalmodelofsemanticinteractions. InCVPR,2019. [23] R.A.Yeh,A.G.Schwing,J.Huang,andK.Murphy. Diversegenerationformulti-agentsports games. InCVPR,2019. [24] C.Sun,P.Karlsson,J.Wu,J.B.Tenenbaum,andK.Murphy. Stochasticpredictionofmulti- agentinteractionsfrompartialobservations. InICLR,2019. [25] C.TangandR.R.Salakhutdinov. Multiplefuturesprediction. InNeurIPS.2019. [26] D.P.KingmaandM.Welling. Auto-encodingvariationalbayes. arXiv:1312.6114,2013. [27] D.J.RezendeandS.Mohamed. Variationalinferencewithnormalizingflows. InICML,2015. [28] B.D.Ziebart,N.Ratliff,G.Gallagher,C.Mertz,K.Peterson,J.A.Bagnell,M.Hebert,A.K. Dey,andS.Srinivasa. Planning-basedpredictionforpedestrians. InIROS,2009. [29] E.RehderandH.Kloeden. Goal-directedpedestrianprediction. InICCVWorkshops,2015. [30] E.Rehder,F.Wirth,M.Lauer,andC.Stiller. Pedestrianpredictionbyplanningusingdeep neuralnetworks. InICRA,2018. [31] K.Mangalam,H.Girase,S.Agarwal,K.-H.Lee,E.Adeli,J.Malik,andA.Gaidon. Itisnot thejourneybutthedestination: Endpointconditionedtrajectoryprediction. arXiv:2004.02025, 2020. [32] S.Casas,W.Luo,andR.Urtasun. Intentnet: Learningtopredictintentionfromrawsensordata. InCoRL,2018. [33] Y. Chai, B. Sapp, M. Bansal, and D. Anguelov. Multipath: Multiple probabilistic anchor trajectoryhypothesesforbehaviorprediction. InCoRL,2019. [34] T.Phan-Minh,E.C.Grigore,F.A.Boulton,O.Beijbom,andE.M.Wolff. CoverNet: Multi- modalbehaviorpredictionusingtrajectorysets. arXiv:1911.10298,2019. [35] J.Gao,C.Sun,H.Zhao,Y.Shen,D.Anguelov,C.Li,andC.Schmid. VectorNet: Encodinghd mapsandagentdynamicsfromvectorizedrepresentation. InCVPR,2020. [36] R.J.WilliamsandD.Zipser. Alearningalgorithmforcontinuallyrunningfullyrecurrentneural networks. Neuralcomputation,1989. 10[37] A.Sadeghian,V.Kosaraju,A.Sadeghian,N.Hirose,H.Rezatofighi,andS.Savarese. Sophie: Anattentiveganforpredictingpathscomplianttosocialandphysicalconstraints. InCVPR, 2019. [38] K.He,X.Zhang,S.Ren,andJ.Sun. Deepresiduallearningforimagerecognition. InCVPR, 2016. [39] D.P.KingmaandJ.Ba. Adam: Amethodforstochasticoptimization. arXiv:1412.6980,2014. 11Figure4: QualitativeresultsontheArgoversevalidationset. Lanecenterlinesareshowningrey, agent’spasttrajectoryinblue,groundtruthfuturetrajectoryisinlightblue. (Left)Toppredicted targets,wheredarkercolorcorrespondstohigherscores. (Middle)Trajectoryregressionconditioned onthetargets. (Right)Predictedtrajectoriesafterscoringandselection. TheexamplesshowTNT predictingadiversesetofvehiclebehaviors,amongthemturning,changinglanes,goingstraightat differentspeeds,etc. 12