Narrowing the coordinate-frame gap in behavior prediction models: Distillation for efficient and accurate scene-centric motion forecasting DiJia (Andy) Su1, Bertrand Douillard2, Rami Al-Rfou2, Cheolho Park2, Benjamin Sapp2 Abstract—Behavior prediction models have proliferated in recent years, especially in the popular real-world robotics application of autonomous driving, where representing the distribution over possible futures of moving agents is essential forsafeandcomfortablemotionplanning.Inthesemodels,the choiceofcoordinateframestorepresentinputsandoutputshas crucialtradeoffswhichbroadlyfallintooneoftwocategories. Agent-centricmodelstransforminputsandperforminferencein agent-centriccoordinates.Thesemodelsareintrinsicallyinvari- anttotranslationandrotationbetweensceneelements,arebest- performingonpublicleaderboards,butscalequadraticallywith thenumberofagentsandsceneelements.Scene-centricmodels use a fixed coordinate system to process all agents. This gives themtheadvantageofsharingrepresentationsamongallagents, offeringefficientamortizedinferencecomputationwhichscales linearlywiththenumberofagents.However,thesemodelshave to learn invariance to translation and rotation between scene elements, and typically underperform agent-centric models. In this work, we develop knowledge distillation techniques between probabilistic motion forecasting models, and apply these techniques to close the gap in performance between agent-centric and scene-centric models. This improves scene- Fig.1. Approachoverview.Ontheleft,theteacherandagent-centricmodel centric model performance by 13.2% on the public Argoverse isrepeatedlyandindependentlyappliedtoeachagentinthescene,withall modelinputsandoutputsrepresentedineachagent’sego-centriccoordinate benchmark,7.8%onWaymoOpenDatasetandupto9.4%on frame. On the right, the student and scene-centric model is applied to the alargeIn-Housedataset.Theseimprovedscene-centricmodels wholesceneonce,withoutrequiringrepeatedcomputationsperagent.While rankhighlyinpublicleaderboardsandareupto15timesmore faster, a scene-centric formulation tend to be less accurate, since it also efficient than their agent-centric teacher counterparts in busy has to understand and model the per-agent invariance that is otherwise scenes. built-in theagent-centric approach.To usethe computationalefficiency of a scene-centric approach and yet benefit from the accuracy of an agent- I. INTRODUCTIONANDRELATEDWORK centric approach we propose a knowledge distillation approach that uses the predicted trajectories of the agent-centric or teacher model, to train a Predicting the future behavior of multiple vehicle, cyclist, scene-centricorstudentmodel. and pedestrian agents in real-world driving scenes is a are two distinct common choices, each with advantages and difficult but essential task for safe and comfortable motion disadvantages. planning for autonomous vehicles. This task is typically Agent-centric models represent inputs and internal state referred to as “motion forecasting” or “behavior prediction”. inagent-centriccoordinates,andperforminferencereasoning It is challenging for a number of reasons. (1) The world inthisframe3.Thecoordinatesofroadelements(e.g.,lanes, stateisheterogeneous,consistingofstaticanddynamicroad crosswalks) and other agents’ states are described relative network elements and dynamic agent state observations. (2) to the agent’s pose, thus the representation is inherently The outcomes depend heavily on multi-agent interactions. invariant to the global position and orientation of the agent. (3)Theoutputdistributionoverpossiblefuturesisinherently This can be considered a form of feature pre-processing uncertainandhighlymulti-modalduetolatentagentintents. that allows for models to specialize to an agent’s point of How to represent the input world state, interactions, and view, and in practice results in state-of-the-art performance output distributions are all open questions and active areas on public benchmarks [5]–[9]. A key downside, however, of research. becomes apparent when modeling many agents in a scene: In the last few years, there has been a proliferation of each agent is modeled independently, thus computation is behavior prediction systems which address these model- typically linear in the number of agents, and quadratic ing challenges, fueled by both the compelling promise of when modeling interactions [10]–[16]—for a scene with n the autonomous vehicle industry, and public benchmarks agents, and m road elements, the computation scales as to compare methods [1]–[4]. One of the most interesting O(n(n+m)). This is not an issue for public benchmarks design choices and the focus of this paper is that of the coordinate frames to represent input and output data. There 3Without loss of generality, an agent-centric frame transforms world coordinatessothattheoriginissettotheego-agent’scenter,androtatedso 1PrincetonUniversity,2WaymoLLC thattheagent’sheadingdirectionistheunitvector(x,y)=(1,0). 2202 nuJ 8 ]VC.sc[ 1v07930.6022:viXrawhich require modeling of less than ten agents at once [1]– for motion forecasting is an open problem. Furthermore, [4], but is a computational bottleneck for busy real-world for motion forecasting where the future is represented as urban environments consisting of hundreds of agents. a distribution of trajectories covering intent modes (the Scene-centric models, on the other hand, do the bulk approach we adopt here and is the most common, eg. [20], of world state encoding in a shared, fixed frame for all [31]–[33]), trajectory and mode diversity is crucial. One key agents4.Modelsthatoperateinthisframearetypically“top- challenge then is that distillation could be detrimental to down” or “bird’s-eye-view” representations which discretize diversity; this was investigated in [34] in the NLP domain. the world into spatial grid cells, and apply a convolutional Inthiswork,wedevelopandempiricallyvalidateavariety neuralnet(CNN)backbonetoencodethescene[17]–[23]— of distillation approaches for behavior prediction. We then althoughnon-rasterscene-centricapproachesalsoexist[24]– apply these techniques by setting the teacher to be a high- [26]. After such processing, the prediction head of these performance agent-centric model, and transfer knowledge models decodes trajectories in agent-frame after a global-to- to an efficient student scene-centric model. In doing so local transformation. A salient advantage of this formulation we significantly improve the performance of our scene- is that computation is primarily function of the spatial grid centricmodelwhilemaintainingit’scomputationalefficiency resolution and field of view, rather than the number of benefits. agents—a spatial grid of size H ×W cells would scale as Contributions. The contributions of this paper are as O(HW +n), where the first term is processed with a CNN follows: and dominates the second term in practical settings—see • We systematically analyze latency and quality of agent- fig. 2 for quantification. The downsides of this formulation centric and scene-centric model approaches in a com- are (1) loss of information when discretizing world state mon framework. This supports our characterization of the into a raster format, (2) difficulty in modeling long-range coordinate-frame modeling choices in an empirically rigor- interactions with CNNs, and (3) the model must either learn ous way. rotation/translationinvarianceorlearntoperformaglobalto • We are the first to develop and apply knowledge distilla- local transformation for each agent when decoding. tion techniques to the popular field of behavior prediction In brief, agent-centric models outperform scene-centric modeling. models—borne out in public leaderboards, and likely ex- •Applyingourbestdistillationapproachgivesaremarkable plainable by the shortcomings described above. However, boostinperformancetoourefficientscene-centricmodelson scene-centric models are compelling due to amortized sub- several large autonomous vehicle future prediction datasets. linear scaling with respect to the number of agents in the Comparing to the non-distilled student model baseline, dis- scene; particularly relevant in dense urban environments. In tillation improves performance by 13.2% on the Argoverse thispaper,weproposeanovelknowledgedistillationmethod dataset, 7.8% on the Waymo-Open-Motion dataset, and up tonarrowthegapinperformancebetweenthesetwodifferent to 9.4% on key metrics of our In-House dataset. modeling approaches. II. BACKGROUND Knowledge distillation [27] is a popular and effective Definition of the prediction problem. Let x be the machine learning technique in domains like computer vision observations of all agents in the scene (in the form of past and natural language processing to transfer knowledge from trajectories) and additional contextual information (such as a large model—the “teacher model”—to a smaller one– lane semantics and traffic light states), t be the discrete the “student model”. The knowledge transfer mechanism time step, s be the state of an agent at time t. The future originally proposed for classification tasks, replaces training t trajectory s = [s ,...,s ] is the sequence of states of the data groundtruth (“hard labels”) with predictions from the 1 T agent up to time T. We assume our model predicts K teacher model (“soft labels”). The intuition is that these soft trajectories, where each trajectory is a sequence of predicted labels contain a more information-rich smooth target space states sk =[sk,...,sk]. forthestudentmodeltolearnfromthantheoriginaldata[28] 1 T For both agent-centric and scene-centric approaches, we [29]. consider the class of models whose output is to predict a Distillation has been extended beyond classification to Gaussian distribution around a predicted trajectory: sequence prediction tasks like Neural Machine Transla- tion [30]. To our knowledge, however, distillation has never φ(sk|x)=N(sk|µk(x),Σk(x)) (1) t t t t been applied to the domain of behavior prediction / motion where µk is the mean and Σk is the co-variance of the forecasting. Although behavior prediction can be considered t t Normal distribution. The mean and the variance are learnt a sequence problem, a key difference is that we wish the parameters. The mean represents the mode of the distribu- predicted future distributions to cover the entire space of tion, which is the most likely state at time t. outcomes accurately, in contrast to the typical NLP task for We also model a probabilistic distribution over the pre- instancethataimstogenerateasinglerealisticoutput.Hence dicted trajectories, which can be interpreted as the “confi- transferring knowledge between a teacher and a student dence” over each predicted trajectory: π(sk|x) = efk(x) , where f (x) : Rd(x) → R is the output parameteri(cid:80) zei def bi( yx) a 4Withoutlossofgenerality,thiscanbeanarbitraryconceptualcenterof k thesceneelements. neural network.Thus, combining the two elements above we obtain the attributes. The SCM encodes all these input points with a Gaussian Mixture Model (GMM) distribution: PointPillars encoder [39] followed by a 2D convolutional backbone [40]. A final per-agent embedding is extracted by K T (cid:88) (cid:89) p(s|x)= π(sk|x) φ(s |sk,x) (2) cropping a patch out of the feature map at a location that t maps to the current location of the agent in the scene, as is k=1 t=1 in [20]. Note that even though we end up with a per-agent This makes the simplifying assumption that time steps embedding,alloftheupstreamprocessingisdoneforthefull are conditionally independent given a history of world state, scene at once. The final per-agent embedding is transformed allowing us to use an efficient feed-forward neural network. into a GMM (eqn.2) using a MLP based decoder, as for the A typical number for K is on the order of K = 10 output ACM. trajectories. This type of output representation is a fairly Fig.2providesinferencespeedcomparisonsbetweenSCM popular approach, as in [20], [31], [35], [36]. and ACM. As is shown in the figure, the inference speed A. Teacher model: Agent-Centric Model difference gets progressively larger as the number of agents We use an agent-centric coordinate frame model (ACM) in the scene increases, showing that the ACM doesn’t scale to serve as the teacher. The agent-centric model encodes, well. processes, and reasons about the world from each individual Despite SCMs’ fast inference speed, we observe that they agent’spointofview.Thisrepresentationrequiresatransfor- underperform ACMs in general. We see this trend in public mation of all scene information from the global coordinate leaderboards, where agent-centric models tend to dominate frameintotheagent’sframe.Becauseofthis,withtheagent- (see sec. IV). We also see this directly comparing the ACM centric approach, inference time and memory requirement and SCM architectures described in this section. To get the increases with the number of agents. best of both worlds (fast inference speed + good prediction OurACMarchitectureisinspiredbysomebestperforming accuracy),wenowdiscussusingknowledgedistillationfrom design choices in the literature. It consumes the following a slower but accurate teacher (ACM) to improve a faster but four types of input: road graph and traffic light information, less accurate student (SCM). motion history (i.e. agents states history), and agents inter- Learning Objective. Let the training data be in the form actions. For the road graph information, the ACM utilizes of {xm,sˆm}M m=1 with sˆm be the groundtruth trajectory, polylines to encode the road elements from a 3D high π(sk|x),µk t(x),Σk t(x) be the outputs of a deep neural net- definitionmapwithanMLP(multi-layerperceptron),similar work parameterized by θ. For both the ACM and SCM, to [6], [12], [25]. For traffic light information, the ACM we train to maximize the log-likelihood of recorded driving utilizes a separate LSTM as the encoder. For the motion trajectories, following [20]: M K history, theACM uses a LSTMto handle asequence of past Lbase(θ)=− (cid:88) (cid:88)1(k=kˆm) observations,andthelastiterationofthehiddenstateisused m=1k=1 as history embedding, as in [10], [16], [37], [38], to name (cid:32) log(cid:16) π(sk|xm;θ)(cid:17) +(cid:88)T log(cid:16) N(sk|µk,Σk;xm,θ)(cid:17)(cid:33) , t t t a few. For agent interactions, we use a LSTM to encode t=1 (3) the neighbors’ motion history in an agent centric frame, where 1(·) is the indicator function, kˆm is the index of the and aggregate all neighbors’ information via max-pooling to predicted trajectory closest to the ground-truth sˆm, in terms arrive to a single interaction. This is a simple form of fully- oftheL2distance.Thefirstterminthelossfunctionfitsthe connected neighbor interaction modeling; other works have likelihood of each k predicted trajectory (by making the explicitly used GNNs [14], [25] and/or attention or max- th closest-to-ground-truth predicted trajectory the most proba- pooling[6],[11],[13],[16].Finally,thesefourencodingsare ble one), and the second term is simply a time sequence concatenatedtogethertocreateanembeddingforeachagent extension of standard GMM likelihood fitting [41]. The intheagent-centriccoordinateframe.Thisfinalembeddingis advantageoftrainingthenetworkaccordingtoeqn.3isthatit convertedintoaGMM(eqn.2)usinganMLPbaseddecoder. avoids the need of performing the expectation-maximization B. Student model: scene-centric model procedure and avoids the intractability in directly fitting the For the student we use a scene-centric coordinate frame GMM likelihood. model (SCM). In our SCM architecture, the input data is III. DISTILLATIONMETHODS representedinaglobalcoordinateframethatissharedacross In this section, we describe distillation techniques we all agents. As mentioned above, one of the benefits of this developed for trajectory-based behavior prediction5. While formulation is that the scene can be processed as a whole, we use these methods to distill from ACMs to SCMs in this resulting in efficient inference which is invariant to the work, the distillation methods here can be applied to any number of agents. The SCM consumes three types of inputs. It takes in road 5Notethatanalternativerepresentationforfuturebehavior,probabilistic information represented as points augmented with semantic occupancy grids (or “heatmaps”) could be considered. Possibly simpler attributes, agents information in the form of points sampled distillationapproachesforthisrepresentationcouldbedeveloped.However heatmaprepresentationsaresignificantlylesscommonintheliterature,and from each agent’s oriented box, and traffic light informa- moreimportantly,publicbenchmarksandtheirmetricsspecificallyrequire tion, also represented as points augmented with semantic trajectory-basedrepresentations.learnfrom,thatis,anadditionalK−1predictedtrajectories with the associated distribution over them. Oneimplementationdetailtonoteforthisapproachisthat it imposes correspondence between each of the K teacher trajectories and K student trajectories, which constrains the set of possible solutions for the student by removing equivalent solutions under permutation. We use the hyper-parameter λ to optionally disable the base loss for a certain number of steps of warm-up, which pre-trains the model with the distillation loss only. In our experiments we cross-validate whether we (i) set λ = 1 for Fig. 2. Inference speed comparison between ACM (teacher) and SCM all training (i.e., no pre-training), or (ii) set λ = 1(step ≥ (student). total steps/4), i.e., pre-train for 25% of the total training iterations. trajectory-basedbehaviorpredictionmodels.Anoverviewof our approach is provided in Figure 1. B. Trajectory Sample Distillation To facilitate the presentation, we use the following no- As an alternative to using multiple trajectories as pseudo- tations for the teacher model. We denote ϑ as the teacher groundtruth, as described above, we sample a single trajec- networkparameters,ξk asthekth predictedtrajectoryoutput tory from the teacher’s distribution to be the groundtruth for from the teacher, and Π(ξk|x) as the trajectory likelihood the student: ξk ∼ Π(ξk|xm,ϑ). We call this sampled distribution from the teacher (analogous to π(sk|x) from teacher trajects oa rm yple thd e proxy groundtruth label. the student). Lastly, we denote H(·,·) as the cross entropy Then, we directly optimize over this proxy groundtruth function and D KL(·||·) as the KL divergence function. Our (instead of the true groundtruth label). Mathematically, this distillation approaches are as follows. is expressed as follows: A. Trajectory Set Distillation Lsampled(θ)=−(cid:88)M (cid:88)K 1(k=kˆ sm ampled) In this distillation approach, we train our student model m=1k=1 to match the full trajectory set output from the teacher. (cid:32) log(cid:16) π(sk|xm;θ)(cid:17) +(cid:88)T log(cid:16) N(sk|µk,Σk;xm,θ)(cid:17)(cid:33) (6) t t t Recall that the full output representation of our models is a t=1 On expectation, over infinite samples, this loss is equivalent GMM;ignoringthecovariancesandtakingthemodeofeach torequiringthestudent’sweightedtrajectorysettomatchthe component gives us our trajectory set. The weights over this teacher’s. While this formulation stands out its simplicity, it trajectory set are given by π. is the same as L (θ), it does not, however, encourage the The distillation loss has two parts. For the first part, base full GMM distribution of the teacher and student to match; we use the teacher’s predicted trajectories (all K of them) this is described next. as multiple pseudo-groundtruth trajectories for training the student. Here, we want the kth teacher trajectory to be max- C. Trajectory Distribution Distillation imally likely under the learned distribution for the student’s In this last formulation, the loss directly encourages the corresponding kth mode. For the second part, we impose a student’s full GMM output to match the teacher’s GMM. crossentropylosstoencouragethestudent’strajectorymode As in Trajectory Set Distillation, we force correspondence distribution π to match the teacher’s mode distribution Π. between the teacher and students kth trajectory (for all k) (cid:88)M (cid:88)K to avoid permutation ambiguity in the solution space of the Ldistill(θ)=− student. To match distributions, we use cross-entropy loss m=1k=1 (cid:32) −H(cid:16) π(sk|xm;θ),Π(ξk|xm;ϑ)(cid:17) +(cid:88)T log(cid:16) N(ξk|µk,Σk;xm,θ)(cid:17)(cid:33) between the discrete mode distributions of the student and t t t teacher (π and Π), and KL-divergence for each Gaussian t=1 (4) distribution (N for the student, N for the teacher) in the t t The full loss function is formed by adding L distill on top of trajectory sequences: the original loss L (eqn.3) as follows: base L(θ)=L (θ)+λL (θ). (5) L(θ)=Lbase(θ)+ (cid:88)M (cid:88)K (cid:32) H(cid:16) π,Π(cid:17) +(cid:88)T DKL(Nt||N t)(cid:33) (7) distill base m=1k=1 t=1 Note that L does not have the term 1(k = kˆm), distill compared to L . This is because for the L , we IV. EXPERIMENTS base distill match all K predicted trajectories to the teacher’s predicted We ran our experiments on the WOMD, Argoverse, and trajectories, while for the L , we only optimize over one In-House datasets. The results are shown in Table I, II, and base trajectory (the real observed future groundtruth). One added III. The best numbers across the entire table are highlighted benefitofthisdistillationformulationisthattrainingincludes as bold. The best methods among the SCM methods are additionalinformationintheformofadditionalsoftlabelsto markedasblue,andthesecondbestonesasorange.InTablecoord. testset method frame rank mAP(↑) minADE(↓) minFDE(↓) MissRate(↓) Scene-Transformer[24] scene 4th 0.337 0.678 1.376 0.198 Multipath++ agent 1st 0.401 0.569 1.194 0.143 ACM(teacher) agent – 0.329 0.676 1.488 0.178 Average SCMbaseline scene – 0.322 0.757 1.691 0.205 improvement: SCM+DistillSet scene 3rd 0.349 0.710 1.569 0.186 7.8% SCM+DistillSample scene – 0.320 0.742 1.643 0.194 4.2% SCM+DistillDistr. scene – 0.330 0.758 1.681 0.199 1.5% TABLEI MODELDISTILLATIONPERFORMANCEONTHEWOMDTESTSET. I and II, the first section of rows show the results of the its large size and quality—over 13 millions training samples modelstop-rankedinthecorrespondingpublicleaderboards. with richer HD map and state information. Our distillation For both the teacher and student models, we train end- results on this dataset are summarized in Table.III. to-end using the Adam optimizer with a learning rate of 5 × 10−4. We used gradient clipping to prevent gradient A. Discussion explosion with a threshold of 10. We trained all models for 1Mtrainingsteps,andwesubmittotheleaderboardthebest Some clear trends emerge from our results. Across model based on its performance on the validation set. After datasets, distillation improves the student model’s perfor- cross-validation we set λ = 1(step ≥ total steps/4) for the mancesignificantly—from4.6%–13.2%averagerelativeim- WOMD and λ = 1 for the In-house and Argoverse dataset. provement across metrics. The Trajectory Set distillation The teacher implementation uses off the shelf components method worked the best across datasets. Interestingly Tra- suchaspolylinesandLSTMs,asdescribedinsec.II-A.The jectory Sample distillation worked better only on Argoverse. studentmodelsuseanEfficientDet-d2backbone,a200x200 The major differences with Trajectory Sample distillation is PointPillarsgridwitha2metersresolution,andaPointPillars that it trains with a single trajectory groundtruth rather than embedding size of 64. trying to learn a trajectory set or full distribution like the othermethods.TheArgoversedatasetaswellstandsoutfrom Metrics. We follow the Argoverse benchmark and use the other datasets in that it is smaller, has a short prediction followingmetricsforevaluation:minimumaveragedisplace- horizon, and has less diverse driving behavior [1]. Lastly, ment error (minADE), minimum final displacement error Distribution Distillation did not work as well as the other (minFDE), and miss rate (MR). Besides these, there are distillation methods across all 3 datasets. We hypothesize additional metrics provided for each dataset: mAP (mean thatthisformofdistillationtaskwastooconstrained:match- AveragePrecision)forWOMD,brier-minFDE(whichscales ing GMM distributions via KL-divergence is more difficult the minFDE with prediction probability) for Argoverse, to achieve than simply maximizing likelihood of pseudo- wADE (probability weighted Averaged Displacement Error) groundtruth (as in Trajectory Set and Trajectory Sample for the In-house dataset. Where the choice of K is required, distillation). to define the top K trajectories to be used for evaluation of a metric (for example minADE (K =k) on Argoverse), we Another trend from our experiments is that agent-centric use k =6. models outperform the scene-centric models, in our own implementations, as well as in related works. This was our Datasets.TheWaymoOpenMotionDataset(WOMD)[1]is original motivation for this work, and the results presented anopensourcedatasetforbehaviorpredictionfromWaymo. here provide further justification for investigating distillation Itcontains570hoursofdataover1750kmofdrivingdistance approaches.However,thereisstillimprovementstobemade with more than 100,000 scenes that are on average about 20 inefficient,scene-centricmodels,sincethegaphasnotbeen seconds long. Our results on this dataset are summarized in fully closed by our distillation techniques. Table I, and we report the rank in terms of mAP. Lastly, we want to highlight that our models are competi- The Argoverse Motion Forecasting Competition is a open tiveonpublicleaderboardsinanabsolutesense.OnWOMD, source trajectory prediction dataset with more than 300,000 ourbestdistilledmodel6 isranked3rd,andtoourknowledge curated scenarios [2] with each sequence containing one is the best performing scene-centric (and thus efficient) target vehicle for prediction. Our results on this dataset are model.IntheArgoverseleaderboard,ourbestdistilledmodel summarized in Table.II, and ranks are reported in terms of ranks17thwhereotherknown,popularscene-centric models minADE. are ranked 35th and 58th place. TheIn-HouseDatasetisalargescalereal-worlddatasetof Illustrations of improvements provided by the distilled driving scenes in various urban and suburban environments models on a variety of urban driving scenarios are shown within the US. It is collected by vehicles equipped with an in Figure 3. industry grade sensor and perception stack, and it provides detailed logs of tracked objects. Results on this dataset are 6Our best distillation model is named as MPG-Distil(pretrain) on the a valuable addition to the public benchmark results due to WOMD’spublicleader-boardcoord. testset method frame rank brier-minFDE(↓) minFDE(↓) MR(↓) minADE(↓) LaneRCNN[26] scene 58th 2.147 1.453 0.123 0.904 LGN[42] scene 35th 2.059 1.364 0.163 0.868 mmTransformer[43] agent 15th 2.033 1.338 0.154 0.844 TPCN[44] agent 6th 1.929 1.244 0.133 0.815 poly agent 1st 1.793 1.214 0.132 0.790 ACM(teacher) agent – 1.906 1.280 0.147 0.816 Average SCMbaseline scene – 2.206 1.588 0.225 0.931 improvement: SCM+DistillSet scene – 2.052 1.416 0.180 0.868 11.1% SCM+DistillSample scene 17th 2.017 1.383 0.173 0.853 13.2% SCM+DistillDistr. scene – 2.345 1.723 0.254 0.980 -8.2% TABLEII MODELDISTILLATIONPERFORMANCEONTHEARGOVERSETESTSET. coord. method frame wADE(↓) minADE(↓) minFDE(↓) MissRate(↓) ACM(teacher) agent 1.200 0.524 1.145 0.335 Average: SCMbaseline scene 1.270 0.558 1.532 0.357 improvement: SCM+DistillSet scene 1.190 0.545 1.526 0.324 4.6% SCM+DistillSample scene 1.270 0.567 1.602 0.345 -0.7% SCM+DistillDistr. scene 1.220 0.608 1.789 0.358 -5.5% TABLEIII MODELDISTILLATIONPERFORMANCEONTHEIN-HOUSETESTSET. Fig. 3. Illustration of improvements provided by the distilled models on the WOMD. For each example, the sub-figure on the left shows prediction for SCM-baseline while the sub-figure on the right shows the predictions for our SCM+Distill Set. The purple markers show the groundtruth while the redmarkersshowthetrajectoryoftheAutonomousDrivingVehicle(ADV).Thepredictedtrajectoriesareshowninblue(thedarkertheblue,thehigher the confidence). In different scenarios (parking lots, variuous types of traffic intersections) we see that the non-distilled baseline misses the groundtruth trajectoryandpredictsleftorrightturnwhilethegroundtruthisgoingstraight,orviceversa.Incontrast,afterapplyingthedistillationtechniquesproposed inthispaper,themodelmoreaccuratelypredictsthegroundtruth. V. CONCLUSIONS the scene-centric model by 13.2% on the public Argoverse In this paper, we develop novel knowledge distillation benchmark,7.8%onthepublicWaymoOpenDataset,andup techniques to bridge the coordinate-frame gap in behavior to9.4%onalargeIn-Housedataset.Theresultingimproved predictionmodels.Weuseaagent-centricmodelasateacher scene-centricmodelsarealso15timesfasterthantheiragent- toimprovetheaccuracyofanotherwisemoreefficientscene- centric distillation counterparts in busy urban scenes. centric model. Our method improves the performance ofREFERENCES [22] Y. Yuan and K. M. Kitani, “Diverse trajectory forecasting with de- terminantalpointprocesses,”inInternationalConferenceonLearning [1] S. Ettinger, S. Cheng, B. Caine, C. Liu, H. Zhao, S. Pradhan, Representations,2019. Y. Chai, B. Sapp, C. R. Qi, Y. Zhou, et al., “Large scale interactive [23] T.Buhet,E.Wirbel,A.Bursuc,andX.Perrotton,“Plop:Probabilistic motionforecastingforautonomousdriving:Thewaymoopenmotion polynomialobjectstrajectoryplanningforautonomousdriving,”2020. dataset,” in Proceedings of the IEEE/CVF International Conference [24] J. Ngiam, B. Caine, V. Vasudevan, Z. Zhang, H.-T. L. Chiang, onComputerVision,2021,pp.9710–9719. J.Ling,R.Roelofs,A.Bewley,C.Liu,A.Venugopal,etal.,“Scene [2] M.-F.Chang,J.Lambert,P.Sangkloy,J.Singh,S.Bak,A.Hartnett, transformer: A unified multi-task model for behavior prediction and D. Wang, P. Carr, S. Lucey, D. Ramanan, et al., “Argoverse: 3d planning,”arXivpreprintarXiv:2106.08417,2021. tracking and forecasting with rich maps,” in Proceedings of the [25] M.Liang,B.Yang,R.Hu,Y.Chen,R.Liao,S.Feng,andR.Urtasun, IEEE/CVFConferenceonComputerVisionandPatternRecognition, “Learning lane graph representations for motion forecasting,” arXiv 2019,pp.8748–8757. preprintarXiv:2007.13732,2020. [3] W. Zhan, L. Sun, D. Wang, H. Shi, A. Clausse, M. Naumann, [26] W.Zeng,M.Liang,R.Liao,andR.Urtasun,“Lanercnn:Distributed J. Kummerle, H. Konigshof, C. Stiller, A. de La Fortelle, et al., representations for graph-centric motion forecasting,” in IEEE/RSJ “Interaction dataset: An international, adversarial and cooperative Intl.Conf.onIntelligentRobotsandSystems,2021. motion dataset in interactive driving scenarios with semantic maps,” [27] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a arXivpreprintarXiv:1910.03088,2019. neuralnetwork,”stat,vol.1050,p.9,2015. [4] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, [28] M.PhuongandC.Lampert,“Towardsunderstandingknowledgedis- A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A tillation,”inInternationalConferenceonMachineLearning. PMLR, multimodaldatasetforautonomousdriving,”2020. 2019,pp.5142–5151. [5] C. Tang and R. R. Salakhutdinov, “Multiple futures prediction,” [29] X. Cheng, Z. Rao, Y. Chen, and Q. Zhang, “Explaining knowledge Advances in Neural Information Processing Systems, vol. 32, pp. distillation by quantifying the knowledge,” in Proceedings of the 15424–15434,2019. IEEE/CVFConferenceonComputerVisionandPatternRecognition, [6] J.Gao,C.Sun,H.Zhao,Y.Shen,D.Anguelov,C.Li,andC.Schmid, 2020,pp.12925–12935. “Vectornet: Encoding hd maps and agent dynamics from vectorized [30] Y.KimandA.M.Rush, “Sequence-levelknowledgedistillation,”in representation,”inProceedingsoftheIEEE/CVFConferenceonCom- EMNLP,2016. puterVisionandPatternRecognition,2020,pp.11525–11533. [31] C. Tang and R. R. Salakhutdinov, “Multiple futures prediction,” in [7] J.Hong,B.Sapp,andJ.Philbin,“Rulesoftheroad:Predictingdriving nips,2019. behavior with a convolutional model of semantic interactions,” in [32] T. Zhao, Y. Xu, M. Monfort, W. Choi, C. Baker, Y. Zhao, Y. Wang, Proceedings of the IEEE/CVF Conference on Computer Vision and and Y. N. Wu, “Multi-agent tensor fusion for contextual trajectory PatternRecognition,2019,pp.8454–8462. prediction,”inProceedingsoftheIEEE/CVFConferenceonComputer [8] J. Mercat, T. Gilles, N. El Zoghby, G. Sandou, D. Beauvois, and VisionandPatternRecognition,2019,pp.12126–12134. G.P.Gil,“Multi-headattentionformulti-modaljointvehiclemotion [33] T.Phan-Minh,E.C.Grigore,F.A.Boulton,O.Beijbom,andE.M. forecasting,”in2020IEEEInternationalConferenceonRoboticsand Wolff, “Covernet: Multimodal behavior prediction using trajectory Automation(ICRA). IEEE,2020,pp.9638–9644. sets,”inProceedingsoftheIEEE/CVFConferenceonComputerVision [9] N.Rhinehart,K.M.Kitani,andP.Vernaza,“R2p2:Areparameterized andPatternRecognition,2020,pp.14074–14083. pushforwardpolicyfordiverse,precisegenerativepathforecasting,”in [34] C. Zhou, G. Neubig, and J. Gu, “Understanding knowledge dis- ProceedingsoftheEuropeanConferenceonComputerVision(ECCV), tillation in non-autoregressive machine translation,” ArXiv, vol. 2018,pp.772–788. abs/1911.02727,2020. [10] J.Mercat,T.Gilles,N.Zoghby,G.Sandou,D.Beauvois,andG.Gil, [35] J.Hong,B.Sapp,andJ.Philbin,“Rulesoftheroad:Predictingdriving “Multi-head attention for joint multi-modal vehicle motion forecast- behavior with a convolutional model of semantic interactions,” in ing,”inIEEEIntl.Conf.onRoboticsandAutomation,2020. CVPR,2019. [11] H. Zhao, J. Gao, T. Lan, C. Sun, B. Sapp, B. Varadarajan, Y. Shen, [36] T. Salzmann, B. Ivanovic, P. Chakravarty, and M. Pavone, “Tra- Y. Shen, Y. Chai, C. Schmid, et al., “Tnt: Target-driven trajectory jectron++:Dynamically-feasibletrajectoryforecastingwithheteroge- prediction,”arXivpreprintarXiv:2008.08294,2020. neousdata,”inComputerVision–ECCV2020:16thEuropeanConfer- [12] S.Khandelwal,W.Qi,J.Singh,A.Hartnett,andD.Ramanan,“What- ence,Glasgow,UK,August23–28,2020,Proceedings,PartXVIII16. ifmotionpredictionforautonomousdriving,”ArXiv,2020. Springer,2020,pp.683–700. [13] C. Tang and R. R. Salakhutdinov, “Multiple futures prediction,” in [37] A. Gupta, J. Johnson, L. Fei-Fei, S. Savarese, and A. Alahi, “Social NeurIPS,2019. GAN: Socially acceptable trajectories with generative adversarial [14] S. Casas, C. Gulino, R. Liao, and R. Urtasun, “Spagnn: Spatially- networks,”inCVPR,2018. awaregraphneuralnetworksforrelationalbehaviorforecastingfrom [38] A. ”Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and sensordata,”inIEEEIntl.Conf.onRoboticsandAutomation. IEEE, S.Savarese,“SocialLSTM:HumanTrajectoryPredictioninCrowded 2020. Spaces,”inCVPR,2016. [15] N. Rhinehart, R. McAllister, K. Kitani, and S. Levine, “PRECOG: [39] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom, Predictionconditionedongoalsinvisualmulti-agentsettings,”inIntl. “Pointpillars: Fast encoders for object detection from point clouds,” Conf.onComputerVision,2019. inProceedingsoftheIEEE/CVFConferenceonComputerVisionand [16] T. Salzmann, B. Ivanovic, P. Chakravarty, and M. Pavone, “Trajec- PatternRecognition,2019,pp.12697–12705. tron++: Multi-agent generative trajectory forecasting with heteroge- [40] M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for con- neousdataforcontrol,”arXivpreprintarXiv:2001.03093,2020. volutionalneuralnetworks,”inInternationalConferenceonMachine [17] N. Lee, W. Choi, P. Vernaza, C. B. Choy, P. H. Torr, and M. Chan- Learning. PMLR,2019,pp.6105–6114. draker,“Desire:Distantfuturepredictionindynamicsceneswithin- [41] C.M.Bishop,PatternRecognitionandMachineLearning(Informa- teractingagents,”inProceedingsoftheIEEEconferenceoncomputer tion Science and Statistics). Berlin, Heidelberg: Springer-Verlag, visionandpatternrecognition,2017,pp.336–345. 2006. [18] M.Bansal,A.Krizhevsky,andA.Ogale,“Chauffeurnet:Learningto [42] M. Liang, B. Yang, R. Hu, Y. Chen, R. Liao, S. Feng, and R. Ur- drivebyimitatingthebestandsynthesizingtheworst,”arXivpreprint tasun, “Learning lane graph representations for motion forecasting,” arXiv:1812.03079,2018. in European Conference on Computer Vision. Springer, 2020, pp. [19] S. Casas, W. Luo, and R. Urtasun, “Intentnet: Learning to predict 541–556. intention from raw sensor data,” in Conference on Robot Learning. [43] Y.Liu,J.Zhang,L.Fang,Q.Jiang,andB.Zhou,“Multimodalmotion PMLR,2018,pp.947–956. predictionwithstackedtransformers,”inProceedingsoftheIEEE/CVF [20] Y.Chai,B.Sapp,M.Bansal,andD.Anguelov,“Multipath:Multiple Conference on Computer Vision and Pattern Recognition, 2021, pp. probabilisticanchortrajectoryhypothesesforbehaviorprediction,”in 7577–7586. ConferenceonRobotLearning,2019. [44] M.Ye,T.Cao,andQ.Chen,“Tpcn:Temporalpointcloudnetworksfor [21] T.Phan-Minh,E.C.Grigore,F.A.Boulton,O.Beijbom,andE.M. motionforecasting,”inProceedingsoftheIEEE/CVFConferenceon Wolff, “Covernet: Multimodal behavior prediction using trajectory ComputerVisionandPatternRecognition,2021,pp.11318–11327. sets,”inProceedingsoftheIEEE/CVFConferenceonComputerVision andPatternRecognition,2020,pp.14074–14083.