KEMP: Keyframe-Based Hierarchical End-to-End Deep Model for Long-Term Trajectory Prediction Qiujing Lu∗,1,2,†, Weiqiao Han∗,1,3,†, Jeffrey Ling1, Minfa Wang1, Haoyu Chen1, Balakrishnan Varadarajan1, Paul Covington1 1Waymo, 2UCLA, 3MIT Abstract—Predicting future trajectories of road agents is a critical task for autonomous driving. Recent goal-based trajectorypredictionmethods,suchasDenseTNTandPECNet [1, 2], have shown good performance on prediction tasks on publicdatasets.However,theyusuallyrequirecomplicatedgoal- selectionalgorithmsandoptimization.Inthiswork,wepropose KEMP,ahierarchicalend-to-enddeeplearningframeworkfor trajectoryprediction.Atthecoreofourframeworkiskeyframe- based trajectory prediction, where keyframes are representative states that trace out the general direction of the trajectory. KEMP first predicts keyframes conditioned on the road con- text, and then fills in intermediate states conditioned on the keyframesandtheroadcontext.Underourgeneralframework, goal-conditionedmethodsarespecialcasesinwhichthenumber of keyframes equal to one. Unlike goal-conditioned methods, our keyframe predictor is learned automatically and does not requirehand-craftedgoal-selectionalgorithms.Weevaluateour model on public benchmarks and our model ranked 1st on Waymo Open Motion Dataset Leaderboard (as of September 1, 2021). I. INTRODUCTION Inorderforrobotstonavigatesafelyinstochasticenviron- ments with multiple surrounding moving agents, predicting future trajectories of surrounding agents is a critical task. In the setting of autonomous driving, the road scene is highly complex, consisting of not only static objects, such as traffic lights and road fences, but also dynamic objects, such as vehicles, pedestrians and cyclists; any vehicle could choose to go straight and pass the intersection, or stop before the Fig.1. Agentsinanintersectionscenario.Top:Agentswithgroundtruth futuretrajectoriescoloredinmagentaandtheagentforthepredictiontask intersection and wait for the pedestrians to pass, or make representedbyacyanbox.Bottom:6predictedtrajectoriescoloredinblue turns. Predicting future trajectories of agents in the scene andkeyframesannotatedwithyellowstars. enables several downstream tasks, such as risk assessment of planned trajectories [3] and safe trajectory planning for autonomous vehicles with theoretical guarantees [4, 5]. ability to scale with larger datasets, including work on Due to the dynamic, stochastic, and interactive nature graph neural networks [9], long short-term memory (LSTM) of the environment, predicting future trajectories of agents [10],generativeadversarialnetworks(GAN)[11],variational based on past observations and the traffic scene is quite autoencoders(VAE)[12,13],flows[14,15],ortransformers challenging. Traditional methods use hand-crafted features [16] to predict trajectories. and manually-designed logic and models to predict trajecto- ries [6, 7, 8], but they require a great deal of manual work Recently,anchor-basedandgoal-conditionedmethods[17, and are brittle to edge cases. On the other hand, modern 18, 1, 14] have received much attention as they directly deep learning methods have successfully demonstrated the consider the intention of agents and are more interpretable. However, when making long-term predictions (for example, *Equalcontribution in Waymo Open Motion Dataset [19], where one needs to † Work done during internship at Waymo. Corresponding to predict 8 seconds into the future based on 1 second of weiqiaoh@mit.edu,qiujiing@ucla.edu past trajectories), only modeling a single high-level goal or Thisarticlesolelyreflectstheopinionsandconclusionsofitsauthorsand notWaymooranyotherWaymoentity. intent may not be enough. For one thing, the goal prediction 2202 yaM 01 ]VC.sc[ 1v42640.5022:viXrafor long trajectories may not be accurate, and for another, large datasets. Multipath [17] first predicts intents as a set trajectories can vary significantly between the fixed starting of anchors and then fix the anchors and learn to predict the andgoalpoints.Toaddressthisproblem,wedrawideasfrom residual with respect to the anchors. long-term motion planning and hierarchical reinforcement Goal-conditioned trajectory prediction. Goal- learning literature [20, 21, 22, 23], where in order for the conditioned trajectory prediction models are a promising robottoreachagoalfarawayoraccomplishacomplextask, way to develop interpretable autonomous vehicle systems. a high level model generates subgoals that are easier for the PECNet [2] predicts the goal as a latent variable and robottoreach,andalowlevelmodelgeneratescontrolinputs predicts the trajectory conditioned on this latent variable. that enable the robot to navigate between two subgoals. TNT [18] and DenseTNT [1] predict a set of targets directly In this paper, we propose a hierarchical end-to-end deep and then predict trajectories conditioned on the targets. learning framework for autonomous driving trajectory pre- Compared to latent-variable-sampling-based methods and diction: Keyframe MultiPath (KEMP). At the core of our intention-based methods, goal-conditioned methods such framework is the keyframe-based trajectory prediction. In as are more interpretable, because the predicted goal is this framework, the model first predicts several keyframes, part of the trajectory instead of a latent variable. Our which are representative states in the trajectory that trace method can be viewed as a generalization of this line of out the general direction of the trajectory, conditioned on work, where we predict not only the goal but also other the road context. The model then fills in the gaps between keyframes in the trajectory. Unlike DenseTNT, in which keyframes by predicting intermediate states conditioned on there is a complicated goal-selection algorithm, our method the keyframes and the road context. To our best knowledge, automatically learns to predict keyframes without any itisthefirsttimethatkeyframe-basedhierarchicalprediction hand-crafted engineering. is applied to trajectory prediction for autonomous vehicles. III. METHOD Our framework is in some sense a generalization of goal- Our method consists of three steps. First, we extract the conditioned trajectory prediction models. In particular, goal- features from the scene and encode them as context using conditioned trajectory prediction models, such as TNT [18], multiple encoders. Second, we predict keyframes of the DenseTNT [1], and PECNet [2], can be viewed as special output trajectory using a keyframe predictor conditioned on cases of keyframe-based trajectory prediction models where thecontext.Third,wepredictintermediatestatesconditioned the number of keyframes equals to 1, but unlike these on keyframes and the context using the whole trajectory models, we allow the model to learn to predict keyframes predictor. The whole model is trained end-to-end. instead of manually selecting goals. Other trajectory pre- diction models that predict trajectories in one shot without A. Context Encoding conditioningonthefinalgoalcanbeviewedasspecialcases To encode the road context, previous work uses rasterized of keyframe-based trajectory prediction models where the encoding methods [25, 17, 26, 27, 28]. This method renders number of keyframes equals to 0. Our model is not only trajectories of moving agents and road context information more general than previous methods but also simpler as asbirds-eyeviewimagesandencodesthemwithCNNs.Re- keyframe prediction is learned automatically. Finally, our cently,vector-basedrepresentations,whichrepresenttheroad model achieves state-of-the-art performance in autonomous andagentsaspolylines,hasbeenmoreeffectiveincapturing driving trajectory prediction tasks, ranking 1st on Waymo thestructuralfeaturesofhigh-definitionmaps[29,18,1].We OpenDatasetMotionPredictionLeaderboard(asofSeptem- adopt this vector-based sparse encoding representation. Our ber 1, 2021). contextencodingmostlyfollowsthemethodsinMultipath++ [30]. II. RELATEDWORK We use a deep neural network consisting of multiple Latent-variable-sampling-based trajectory prediction. copies of multi-layer perceptrons (MLPs) and max pooling A popular approach for trajectory prediction is sampling layers to extract geometric features from road polylines and from latent variables. DESIRE [12] generates trajectory their connections with each agent. We use PointNet [31, 32] samplesviaaconditionalVAE-basedRNNencoder-decoder. to encode features from 2D points around each agent. Each R2P2 [15] and PRECOG [14] use flows to predict agent agent’s raw state, including past positions, velocity, and futures.SocialGAN[11]usesrecurrentgenerativeadversarial heading, is encoded using an MLP. Interactions between networkstopredictfuturetrajectories.Thesemethodsrequire agentsarecapturedbyencodingrelativepositionsandspeeds stochastic sampling from latent distributions to produce between pairs of agents using MLP and max pooling. All implicit trajectories. The latent variables are not fully inter- these features are mixed by going through several Multi- pretableandhencedonotworkincombinationwithexternal ContextGating(MCG)encoders,anefficientmechanismfor prior knowledge. fusinginformation[30].Intheendweconcatenatealloutputs Intention-based trajectory prediction. IntentNet [24] from MCG encoders and get the context embedding c. predicts intentions of drivers to guide trajectory prediction. B. Keyframe-Based Hierarchical Trajectory Prediction They classify intentions into 8 classes, including keep lane, turn left, turn right, and so on. The method requires a great In the prediction part, given the context c, the goal is to deal of manual engineering and might miss special cases on predictN trajectories(cid:96) ,...,(cid:96) .Wefollowtheformulation 1 NFig. 2. KEMP architecture. Context features are extracted from scenario inputs with multiple agent historical tracks by multiple encoders. They are thensenttoourhierarchicaldecodersforgeneratingpredictedtrajectories.Thedecoderconsistsoftwoparts:thekeyframedecoderforthegenerationof keyframelocationsandwholetrajectorydecoderforproducingthefinalwholetrajectorybasedonthepreviouslydecodedkeyframelocationsandcontext embeddings.Inthepredictor,wecanfeedinacontrolsignalg(X˜ jt−Xi)asafunctionofthedistancetothesubgoal. in MultiPath [17]. Each trajectory is the union of T states The predictors predict N trajectories (cid:96) ,...,(cid:96) . We assign 1 N (cid:96) ={X ,...,X }, and each state X is the tuple (µ ,Σ ), aprobabilitytoeachtrajectoryp =p((cid:96) |c)= expf((cid:96)i|c) , wi hereµ i1 istheexT pectationofthe(x,yi )positionofthei agei nt where f((cid:96)|c) is implemented bi y a dei ep neu(cid:80) raj lex np ef t( w(cid:96)j o|c rk) . at time i, and Σ i is the covariance matrix of the position Therefore, our prediction is a mixture of Gaussian distri- prediction at time i. bution. We impose the negative log-likelihood loss on the In our proposed method, the keyframe predictor predicts predicted trajectory several keyframes, which are defined as representative states N in the trajectory that trace out the general direction of the (cid:88) L (θ)=− I(j =r)[logp((cid:96) |c;θ)+ traj i trajectory, conditioned on the context c. In this paper we j=1 focus on evenly spaced keyframes. More precisely, suppose T T = kt, where T is the total number of time steps for the (cid:88) logN(µ¯ |µ ,Σ ;θ)], i i i prediction task and k,t are two positive integers. Then the i=1 keyframe predictor predicts k keyframes X˜ ,X˜ ,...,X˜ t 2t kt where {µ¯ ,...,µ¯ } represents the ground truth trajectory, conditioned on the context c. 1 T θ represents the parameter to be learned, which is all the We model the keyframes using a joint distribution weightsinsidepredictormodelsimplementedbydeepneural X˜ ,X˜ ,...,X˜ ∼p(x ,x ,...,x |c). networks, including the whole trajectory predictor and the t 2t kt t 2t kt probability predictor. r denotes the index of the trajectory We can either use an autoregressive formulation that is closest to the ground truth measured by the (cid:96) 2 distance. X˜ ∼p(x |c,x ,x ,...,x ),i=0,...,k−1. (i+1)t (i+1)t t 2t it 2) Separable Model: In the interpolation model, the keyframes in the final trajectory are predicted by the or assume conditional independence between the keyframes keyframe predictor. The whole trajectory predictor does X˜ ∼p(x |c),i=0,...,k−1. not predict keyframes. In the separable model, the (i+1)t (i+1)t whole trajectory predictor predicts the intermediate states In the former, an autoregressive predictor can be imple- X ,...,X , including the keyframes, for any in- it+1 (i+1)t mentedwithanLSTM,andinthelatter,anon-autoregressive terval [X˜ ,X˜ ]. This gives us a complete trajectory it (i+1)t predictor can be implemented with a single MLP over all X ,...,X predicted by the whole trajectory predictor. As 1 T time steps. an aside, when generating intermediate states, the whole Giventhek keyframesX˜ t,X˜ 2t,...,X˜ kt,weconsidertwo trajectory predictor could condition on, in addition to the ways to generate final trajectories. keyframes,someothermanuallydefinedcontrolsignals,such 1) Interpolation Model: For any interval [X˜ it,X˜ (i+1)t], as a function of the distance to the subgoal g(X˜ jt−X i); in the whole trajectory predictor predicts the states inside practice we use g as the identity function. the interval X it+1,...,X (i+1)t−1 conditioned on X˜ it and Asintheinterpolationmodel,weimposethenegativelog- X˜ , as well as the context c. This gives us a complete likelihood loss L (θ) on the trajectories predicted by the (i+1)t traj trajectory whole trajectory predictor. Different from the interpolation model,wealsoimposetheconsistencylossonthekeyframes X ,...,X ,X˜ ,X ,...,X ,X˜ ,X ,...,X˜ . 1 t−1 t t+1 2t−1 2t 2t+1 kt predicted by the keyframe predictor and the keyframespredicted by the whole trajectory predictor precision-recall performance of the future predictions with a normalized averaged over different types of behavior; we k L =(cid:88) ||X −X˜ ||2. use mAP as the primary model metric. cons it it 2 i=1 C. Baseline Algorithms In addition, we impose the negative log-likelihood loss on As our keyframe-based model can be viewed as a gener- the keyframes alization of goal-conditioned models, we contrast its perfor- N k mancewiththecurrentlytoprankedgoal-conditionedmodel, (cid:88) (cid:88) L key(θ)=− I(j =r) logN(µ¯ it|µ it,Σ it;θ). DenseTNT, as well as other strong baseline models. j=1 i=1 D. Implementation Details Thetotallossfunctionisaweightedsumofthelossesabove Multiple future predictions: To obtain more diverse L=L +αL +βL , candidate trajectories, our model predicts m trajectories, traj cons key more than the required number of predictions. To make the where α and β are weights. predictions more robust, we trained n models independently IV. EXPERIMENTS andensembletheirpredictions.Atinferencetime,amongall nmcandidatetrajectories,thetopK trajectoriesareselected A. Datasets using non-maximum suppression algorithm (NMS), where We evaluate our method on two large-scale real world toprankedtrajectories(withhighestestimatedlikelihood)are datasets, the Argoverse Forecasting Dataset and the Waymo selected greedily while their nearby trajectories are rejected. Open Motion Dataset. Model variants: The keyframe predictor and the whole Argoverse Forecasting Dataset: The Argoverse Fore- trajectory predictor both output sequences of states. They casting Dataset [33] includes 324,557 five seconds tracked can either predict the sequence in one shot, or predict the scenarios(2sforthepastand3sfortheprediction)collected sequence iteratively in an autoregressive fashion. For both from 1006 driving hours across both Miami and Pittsburgh. cases, we can use an MLP as the predictor (either one-shot Each motion sequence contains the 2D bird’s eye view or autoregressive), or an LSTM for the autoregressive case. centroid of each tracked object sampled at 10 Hz. It covers Training details: Our model is trained with a batch size diverse scenarios such as vehicles at intersection, taking of256onWOMDtrainingdatasetandabatchsizeof128on turns,changinglanes,anddensetraffic.Onlyonechallenging Argoverse.Wesetlossweightsα=10,β =1forallmodels. vehicle trajectory is selected as the focus of the forecasting Network is trained by ADAM optimizer with learning rate task in each scenario. 3×10−4, with an exponential decay of 0.5 every 200k steps Waymo Open Motion Dataset: The Waymo Open Mo- forWOMDand100kforArgoverse.Bothmodelsaretrained tion Dataset (WOMD) [34] is by far the largest interactive on TPU custom hardware accelerator [35] and converged in motiondatasetwithmultipletypesofagents:vehicles,pedes- 3 days on WOMD and 2 days on Argoverse Dataset. trians and cyclists. It consists of 104,000 run segments with over 1,750 km of roadways and 7.64 million unique agent V. RESULTS trackswith20secondsdurationandsampledat10Hz.Each A. Results on benchmarks segmentisfurtherbrokeninto9secondswindows(1sforthe Benchmarks on both datasets are listed in Tables I and V. past and 8 seconds of future data) with 5 seconds overlap. Waymo Open Motion Dataset: This is a more challenging B. Metrics dataset compared to Argoverse Dataset due to the longer predictiondurationandmorecomplexscenarios.Themodels Given one historical motion, K predictions are output are ranked by mAP. Our best models are: from a model to compare with the ground truth motion. We used both standard metrics and dataset-specific ranking 1) KEMP-I-LSTM in Table I: An interpolation model, metrics to evaluate our model’s performance. L distance where the keyframe predictor is implemented by 2 betweenapredictedtrajectoryandthecorrespondingground LSTM, and the whole trajectory predictor is imple- truth is widely used to quantify the displacement error. For mented by MLP. The number of keyframes is 4. the multiple predictions setting, minimum average displace- 2) KEMP-I-MLP in Table I: An interpolation model, menterror(minADE)amongall predictionsiscomputedfor where the keyframe predictor and the whole trajectory performance comparison among models. Similarly, minFDE predictor are implemented by MLP. The number of iscomputedastheminimalL distanceamongthepredicted keyframes is 4. 2 trajectories and ground truth at the last time step (endpoint). 3) KEMP-S in Table I: A separable model, where the Besides these two standard metrics, Miss Rate (MR) is keyframe predictor and the whole trajectory predictor additionally evaluated, which is the number of scenarios are implemented by LSTM. The number of keyframes where none of the predicted trajectories are within a certain is 4. distance of the ground truth according to the endpoint error ThefirstfiverowsinTableIshowthetop5methodsonthe divided by the total number of predictions. For WOMD, Waymo Open Motion Dataset Leaderboard as of September mean Average Precision (mAP) is designed to measure 1st,2021.KEMP-I-LSTMoutperformsbaselinemodelsinallFig. 3. Samples from WOMD dataset. The agent to be predicted is shown in cyan with its ground truth trajectory shown in magenta. 6 predicted trajectoriesareshowninbluewithyellowstarsannotatingthekeyframes.WecompareKEMP-I-LSTMwithMultipath.1stand3rdrows:Multipath;2nd and4throws:KEMP-I-LSTM. TABLEI MODELPERFORMANCEONWAYMOOPENDATASET(LEADERBOARD) Model minADE↓ minFDE↓ MR↓ mAP↑ mAP(3s)↑ mAP(5s)↑ mAP(8s)↑ DenseTNT5th [1] 1.0387 1.5514 0.1779 0.3281 0.4059 0.3195 0.2589 TVN4th 0.7498 1.5840 0.1833 0.3341 0.3888 0.3284 0.2852 Scene-Transformer(M+NMS)3rd 0.6784 1.3762 0.1977 0.3370 0.3984 0.3317 0.2809 Kraken-NMS2nd 0.7407 1.5786 0.2074 0.3561 0.4339 0.3591 0.2754 Multipath++1st 0.5749 1.2117 0.1475 0.3952 0.4710 0.4024 0.3123 KEMP-I-LSTM(ours) 0.5733 1.2088 0.1453 0.3977 0.4729 0.4042 0.3160 KEMP-I-MLP(ours) 0.5723 1.2048 0.1450 0.3968 0.4683 0.4080 0.3141 KEMP-S(ours) 0.5714 1.1986 0.1453 0.3942 0.4729 0.4018 0.3080 TABLEII ABLATIONSTUDYONWAYMOOPENDATASET(VALIDATIONSET) Model minADE↓ minFDE↓ MR↓ mAP↑ mAP(3s)↑ mAP(5s)↑ mAP(8s)↑ KEMP-I-LSTM 0.5718 1.2061 0.1470 0.3881 0.4735 0.3904 0.3004 KEMP-I-MLP 0.5758 1.2164 0.1487 0.3922 0.4780 0.3995 0.2991 LSTM(Nokeyframes) 0.5724 1.2099 0.1482 0.3837 0.4676 0.3879 0.2955 MLP(Nokeyframes) 0.5736 1.2157 0.1493 0.3828 0.4656 0.3892 0.2935 TABLEIII ABLATIONSTUDYONWAYMOOPENDATASET(VALIDATIONSET) Model minADE↓ minFDE↓ MR↓ mAP↑ mAP(3s)↑ mAP(5s)↑ mAP(8s)↑ KEMP-S 0.5691 1.1993 0.1458 0.3940 0.4791 0.3959 0.3071 KEMP-SwithoutLcons loss 0.5698 1.2021 0.1476 0.3949 0.4785 0.4019 0.3043 KEMP-SwithoutL key loss 0.5710 1.2074 0.1467 0.3955 0.4783 0.4009 0.3074 KEMP-SwithoutLcons andL key losses 0.5723 1.2103 0.1484 0.3942 0.4801 0.4018 0.3008 metrics. Additionally, the higher values we achieved in the whole trajectory and demonstrate that our model is able to breakdownofmAPfrom3seconds,5secondsand8seconds predict trajectories more accurately in the long-term task. indicate the effectiveness of keyframes as a guidance to the KEMP-SisbetterthanKEMP-I-LSTMintermsofminADETABLEIV EFFECTOFNUMBEROFKEYFRAMESONWOMD(VALIDATIONSET) Numberofkeyframes minADE↓ minFDE↓ MR↓ mAP↑ mAP(3s)↑ mAP(5s)↑ mAP(8s)↑ 0 0.5724 1.2099 0.1482 0.3837 0.4676 0.3879 0.2955 1 0.5720 1.2117 0.1466 0.3945 0.4781 0.4019 0.3034 2 0.5678 1.1993 0.1454 0.3881 0.4726 0.3921 0.2995 4 0.5691 1.1993 0.1458 0.3940 0.4791 0.3959 0.3071 8 0.5715 1.2082 0.1490 0.3963 0.4800 0.3977 0.3110 16 0.5735 1.2136 0.1501 0.3894 0.4742 0.3927 0.3012 40 0.5737 1.2164 0.1487 0.3915 0.4780 0.3959 0.3006 TABLEV Table II, KEMP-I-MLP has the highest mAP on validation MODELPERFORMANCEONARGOVERSEDATASET(LEADERBOARD) set, though it has also the highest minADE and minFDE. We also run LSTM and MLP models without keyframe Model minADE↓ minFDE↓ MR↓ prediction as baselines, and observe that their mAP is more TNT[18] 0.94 1.54 13.3% than 1% lower than those of KEMP models. This suggests LaneRCNN[9] 0.90 1.45 12.3% SenseTimeAP 0.87 1.36 12.0% that keyframes have a positive effect on model quality. Poly 0.87 1.47 12.0% Second, we ablate losses from the KEMP-S model in PRIME[36] 1.22 1.56 11.5% Table III. The fluctuation of mAP among different models DenseTNT[1] 0.94 1.49 10.5% KEMP-I-LSTM 0.85 1.38 12.9% is minor: less than 0.3%. The reason might be that the consistency loss and the keyframe loss are complementary and minFDE, though it has lower mAP. given the whole trajectory loss – getting rid of either does Figure 3 shows qualitative results in different scenarios not affect the model much. Even after removing both losses, from WOMD validaton set. We look at the four examples in the keyframe predictor can still learn certain features or the top two rows in detail. In the first case, both models are latent keyframes, because the whole trajectory predictor able to predict diverse modes (turning left, going straight), predictssegmentsconditionedontheoutputofthekeyframe whilethebaselinemodelfailstopredictturningright,which predictor. is the agent’s actual behavior in the next 8 seconds. In Finally, we vary the number of keyframes in the KEMP- the second case, both models have predicted the correct S model as shown in Table IV. Note that our keyframes intent of turning left, but our model has a more natural are equally spaced. So when the number of keyframes is 1, predictionwithkeyframescloselyalignedtothegroundtruth. the model becomes goal-conditioned and hence resembles In the third case, our model is able to propose more diverse TNT. We observe that 2 keyframes attains the best recall, as and reasonable possibilities in the future without missing the minADE, minFDE, and MR metrics are best. However, the mode that agent actually follows. In the fourth case, the mAP metrics are generally better with 8 keyframes. although both models have diverse predictions spanning the This indicates that there is a tradeoff that can be made roadgraphspace,ourmodelhasmorereasonablepredictions. between fewer keyframes, which may increase diversity at Compared to the baseline model, KEMP is able to produce the cost of precision, and more keyframes, which provide more accurate predictions and recall more diverse modes. finer granularity and hence better precision. With too many We believe these good properties are brought by the design keyframes, the model may not be taking advantage of the of the keyframe architecture. By focusing on the keypoints hierarchical structure of the trajectory prediction problem first to ease the burden of predicting intermediate points, – when we go up to 40 keyframes, for example, metrics patterns between trajectory and the environment are more become worse. Therefore, depending on whether we care easily learned. more about precision or recall, the keyframe number can be Argoverse Forecasting Dataset: Table V shows sev- tuned accordingly. eral popular methods on Argeverse Dataset Leaderboard, VI. CONCLUSIONANDFUTUREWORK including TNT, DenseTNT, LaneRCNN, and PRIME. Our model achieves lower minADE and minFDE compared to In this paper, we proposed a keyframe-based hierarchical DenseTNT. In general, we achieve the lowest minADE and end-to-end deep model for long-term trajectory prediction. second-lowest minFDE among all baseline models, which Our framework generalizes goal-based trajectory prediction indicates that KEMP is able to produce realistic trajectories methods. Our predictors are automatically learned and does that are very close to the ground truth. However, as the not require hand-crafted algorithms. Our model achieved trajectories in Argoverse are fairly short (3 seconds future), state-of-the-art performance on the Waymo Open Motion our keyframe model does not have a significant advantage Dataset. Future work could try more complicated structure over other models. forthekeyframepredictorandthewholetrajectorypredictor for better performance. Another important direction could B. Ablation studies be a different definition of keyframes. Currently in our First,wecompareKEMPagainstnon-keyframemodelson model the keyframes are evenly-spaced states. One could the validation set of the Waymo Open Dataset. As shown in try unevenly-spaced states as keyframes.REFERENCES [20] A. Wang, T. Kurutach, K. Liu, P. Abbeel, and A. Tamar, “Learn- ing Robotic Manipulation through Visual Planning and Acting,” in Robotics:scienceandsystems,2019. [1] J. Gu, C. Sun, and H. Zhao, “DenseTNT: End-to-end Trajectory [21] S.NairandC.Finn,“HierarchicalForesight:Self-SupervisedLearning Prediction from Dense Goal Sets,” in Proceedings of the IEEE/CVF of Long-Horizon Tasks via Visual Subgoal Generation,” in Interna- International Conference on Computer Vision, 2021, pp. 15303– tionalConferenceonLearningRepresentations,2019. 15312. [22] T.Kurutach,A.Tamar,G.Yang,S.J.Russell,andP.Abbeel,“Learn- [2] K.Mangalam,H.Girase,S.Agarwal,K.-H.Lee,E.Adeli,J.Malik, ing Plannable Representations with Causal InfoGAN,” Advances in and A. Gaidon, “It is not the Journey but the Destination: Endpoint NeuralInformationProcessingSystems,vol.31,2018. ConditionedTrajectoryPrediction,”inEuropeanConferenceonCom- [23] O. Nachum, S. S. Gu, H. Lee, and S. Levine, “Data-efficient Hi- puterVision. Springer,2020,pp.759–776. erarchical Reinforcement Learning,” Advances in neural information [3] A. Wang, X. Huang, A. Jasour, and B. Williams, “Fast Risk As- processingsystems,vol.31,2018. sessment for Autonomous Vehicles Using Learned Models of Agent [24] S. Casas, W. Luo, and R. Urtasun, “InterNet: Learning to Predict Futures,”Robotics:ScienceandSystems(RSS),2020. IntentionfromRawSensorData,”inConferenceonRobotLearning. [4] T. Lew, R. Bonalli, and M. Pavone, “Chance-constrained Sequential PMLR,2018,pp.947–956. Convex Programming for Robust Trajectory Optimization,” in 2020 [25] M. Bansal, A. Krizhevsky, and A. Ogale, “ChauffeurNet: Learning EuropeanControlConference(ECC). IEEE,2020,pp.1871–1878. to Drive by Imitating the Best and Synthesizing the Worst,” arXiv [5] A. Jasour, W. Han, and B. Williams, “Convex Risk Bounded preprintarXiv:1812.03079,2018. Continuous-TimeTrajectoryPlanninginUncertainNonconvexEnvi- [26] F.-C. Chou, T.-H. Lin, H. Cui, V. Radosavljevic, T. Nguyen, T.-K. ronments,”Robotics:ScienceandSystems(RSS),2021. Huang,M.Niedoba,J.Schneider,andN.Djuric,“PredictingMotion [6] N. Deo, A. Rangesh, and M. M. Trivedi, “How would surround of Vulnerable Road Users using High-Definition Maps and Efficient vehiclesmove?AUnifiedFrameworkforManeuverClassificationand ConvNets,”in2020IEEEIntelligentVehiclesSymposium(IV). IEEE, MotionPrediction,”IEEETransactionsonIntelligentVehicles,vol.3, 2020,pp.1655–1662. no.2,pp.129–140,2018. [27] H. Cui, V. Radosavljevic, F.-C. Chou, T.-H. Lin, T. Nguyen, T.-K. [7] K. Yamaguchi, A. C. Berg, L. E. Ortiz, and T. L. Berg, “Who Huang, J. Schneider, and N. Djuric, “Multimodal Trajectory Predic- are you with and where are you going?” in 2011 IEEE Conference tionsforAutonomousDrivingusingDeepConvolutionalNetworks,” on Computer Vision and Pattern Recognition, CVPR 2011. IEEE in2019InternationalConferenceonRoboticsandAutomation(ICRA). ComputerSociety,2011,pp.1345–1352. IEEE,2019,pp.2090–2096. [8] W.-C.Ma,D.-A.Huang,N.Lee,andK.M.Kitani,“ForecastingInter- [28] J.Hong,B.Sapp,andJ.Philbin,“RulesoftheRoad:PredictingDriv- activeDynamicsofPedestrianswithFictitiousPlay,”inProceedings ing Behavior with a Convolutional model of Semantic Interactions,” oftheIEEEConferenceonComputerVisionandPatternRecognition, inProceedingsoftheIEEE/CVFConferenceonComputerVisionand 2017,pp.774–782. PatternRecognition,2019,pp.8454–8462. [9] W. Zeng, M. Liang, R. Liao, and R. Urtasun, “LaneRCNN: Dis- [29] J.Gao,C.Sun,H.Zhao,Y.Shen,D.Anguelov,C.Li,andC.Schmid, tributed Representations for Graph-Centric Motion Forecasting,” in “VectorNet:EncodingHDMapsandAgentDynamicsfromVectorized 2021 IEEE/RSJ International Conference on Intelligent Robots and Representation,” in Proceedings of the IEEE/CVF Conference on Systems(IROS). IEEE,2021,pp.532–539. ComputerVisionandPatternRecognition,2020,pp.11525–11533. [10] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and [30] B.Varadarajan,A.Hefny,A.Srivastava,K.S.Refaat,N.Nayakanti, S.Savarese,“SocialLSTM:HumanTrajectoryPredictioninCrowded A. Cornman, K. Chen, B. Douillard, C. P. Lam, D. Anguelov et al., Spaces,” in Proceedings of the IEEE conference on computer vision “Multipath++: Efficient Information Fusion and Trajectory Aggrega- andpatternrecognition,2016,pp.961–971. tionforBehaviorPrediction,”arXivpreprintarXiv:2111.14973,2021. [11] A. Gupta, J. Johnson, L. Fei-Fei, S. Savarese, and A. Alahi, “Social [31] C.R.Qi,H.Su,K.Mo,andL.J.Guibas,“PointNet:DeepLearning GAN: Socially Acceptable Trajectories with Generative Adversarial onPointSetsfor3DClassificationandSegmentation,”inProceedings Networks,” in Proceedings of the IEEE Conference on Computer of the IEEE conference on computer vision and pattern recognition, VisionandPatternRecognition,2018,pp.2255–2264. 2017,pp.652–660. [12] N. Lee, W. Choi, P. Vernaza, C. B. Choy, P. H. Torr, and M. Chan- [32] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “PointNet++: Deep draker,“DESIRE:DistantFuturePredictioninDynamicSceneswith Hierarchical Feature Learning on Point Sets in a Metric Space,” InteractingAgents,”inProceedingsoftheIEEEConferenceonCom- Advancesinneuralinformationprocessingsystems,vol.30,2017. puterVisionandPatternRecognition,2017,pp.336–345. [33] M.-F.Chang,J.Lambert,P.Sangkloy,J.Singh,S.Bak,A.Hartnett, [13] Y.YuanandK.M.Kitani,“DiverseTrajectoryForecastingwithDe- D. Wang, P. Carr, S. Lucey, D. Ramanan et al., “Argoverse: 3D terminantalPointProcesses,”inInternationalConferenceonLearning Tracking and Forecasting with Rich Maps,” in Proceedings of the Representations,2019. IEEE/CVFConferenceonComputerVisionandPatternRecognition, [14] N. Rhinehart, R. McAllister, K. Kitani, and S. Levine, “PRECOG: 2019,pp.8748–8757. Prediction Conditioned on Goals in Visual Multi-Agent Settings,” in [34] P.Sun,H.Kretzschmar,X.Dotiwalla,A.Chouard,V.Patnaik,P.Tsui, ProceedingsoftheIEEE/CVFInternationalConferenceonComputer J.Guo,Y.Zhou,Y.Chai,B.Caineetal.,“ScalabilityinPerceptionfor Vision,2019,pp.2821–2830. Autonomous Driving: Waymo Open Dataset,” in Proceedings of the IEEE/CVFConferenceonComputerVisionandPatternRecognition, [15] N. Rhinehart, K. M. Kitani, and P. Vernaza, “R2P2: A Reparam- 2020,pp.2446–2454. eteRized Pushforward Policy for Diverse, Precise Generative Path [35] N.P.Jouppi,C.Young,N.Patil,D.Patterson,G.Agrawal,R.Bajwa, Forecasting,”inProceedingsoftheEuropeanConferenceonComputer S. Bates, S. Bhatia, N. Boden, A. Borchers et al., “In-datacenter Vision(ECCV),2018,pp.772–788. Performance Analysis of a Tensor Processing Unit,” in Proceedings [16] J. Ngiam, B. Caine, V. Vasudevan, Z. Zhang, H.-T. L. Chiang, ofthe44thannualinternationalsymposiumoncomputerarchitecture, J. Ling, R. Roelofs, A. Bewley, C. Liu, A. Venugopal et al., “Scene 2017,pp.1–12. Transformer:Aunifiedmulti-taskmodelforbehaviorpredictionand [36] H.Song,D.Luan,W.Ding,M.Y.Wang,andQ.Chen,“Learningto planning,”arXivpreprintarXiv:2106.08417,2021. Predict Vehicle Trajectories with Model-based Planning,” in Confer- [17] Y.Chai,B.Sapp,M.Bansal,andD.Anguelov,“Multipath:Multiple enceonRobotLearning. PMLR,2022,pp.1035–1045. ProbabilisticAnchorTrajectoryHypothesesforBehaviorPrediction,” inConferenceonRobotLearning. PMLR,2020,pp.86–99. [18] H. Zhao, J. Gao, T. Lan, C. Sun, B. Sapp, B. Varadarajan, Y. Shen, Y. Shen, Y. Chai, C. Schmid et al., “TNT: Target-driveN Trajectory Prediction,” in Conference on Robot Learning. PMLR, 2021, pp. 895–904. [19] S. Ettinger, S. Cheng, B. Caine, C. Liu, H. Zhao, S. Pradhan, Y. Chai, B. Sapp, C. R. Qi, Y. Zhou et al., “Large Scale Interactive Motion Forecasting for Autonomous Driving: The WAYMO OPEN MOTIONDATASET,”inProceedingsoftheIEEE/CVFInternational ConferenceonComputerVision,2021,pp.9710–9719.