Large Scale Interactive Motion Forecasting for Autonomous Driving : The WAYMO OPEN MOTION DATASET ScottEttinger1,ShuyangCheng1,BenjaminCaine2,ChenxiLiu1,HangZhao1,SabeekPradhan1, YuningChai1,BenSapp1,CharlesQi1,YinZhou1,ZoeyYang1,Aure´lienChouard1,PeiSun1, JiquanNgiam2,VijayVasudevan2,AlexanderMcCauley1,JonathonShlens2,DragomirAnguelov1 1 WaymoLLC,2 GoogleBrain Abstract Asautonomousdrivingsystemsmature,motionforecast- ing has received increasing attention as a critical require- mentforplanning.Ofparticularimportanceareinteractive situations such as merges, unprotected turns, etc., where (a) A vehicle waits for a pedestrian to fully cross the crosswalk predicting individual object motion is not sufficient. Joint beforecommencingaturn. predictions of multiple objects are required for effective route planning. There has been a critical need for high- qualitymotiondatathatisrichinbothinteractionsandan- notationtodevelopmotionplanningmodels. Inthiswork, we introduce the most diverse interactive motion dataset to our knowledge, and provide specific labels for interact- ingobjectssuitablefordevelopingjointpredictionmodels. With over 100,000 scenes, each 20 seconds long at 10 Hz, (b) A vehicle accelerates onto the street only after the incoming our new dataset contains more than 570 hours of unique vehicleturns. dataover1750kmofroadways. Itwascollectedbymining for interesting interactions between vehicles, pedestrians, Figure 1: Examples of interactions between agents in a and cyclists across six cities within the United States. We sceneintheWAYMOOPENMOTIONDATASET.Eachex- use a high-accuracy 3D auto-labeling system to generate amplehighlightshowpredictingthejointbehaviorofagents high quality 3D bounding boxes for each road agent, and aidsinpredictinglikelyfuturescenarios. Solidanddashed provide corresponding high definition 3D maps for each linesindicatetheroadgraphandassociatedlanes. Eachnu- scene. Furthermore,weintroduceanewsetofmetricsthat meralindicatesauniqueagentinthescene. provides a comprehensive evaluation of both single agent and joint agent interaction motion forecasting models. Fi- forecasting models requires large amounts of high quality nally, we provide strong baseline models for individual- real world data. Creating a dataset for motion forecasting agentpredictionandjoint-prediction.Wehopethatthisnew iscomplicatedbythefactthatthedistributionofrealworld large-scaleinteractivemotiondatasetwillprovidenewop- data is highly imbalanced [4, 18, 31, 37]; in the common portunitiesforadvancingmotionforecastingmodels. case, vehicles drive straight at a constant velocity. In or- dertodevelopeffectivemodels,adatasetmustcontainand measureperformanceonawiderangeofbehaviorsandtra- jectoryshapesfordifferentobjecttypesthatanautonomous 1.Introduction systemwillencounterinoperation. Motion forecasting has received increasing attention as We argue that critical situations (e.g., merges, lane a critical requirement for planning in autonomous driving changes,andunprotectedturns)requirethejointprediction systems [8, 14, 39, 35, 28, 33]. Due to the complexity of ofasetofmultipleinteractingobjects,notjustasingleob- scenesthatautonomoussystemsneedtosafelyhandle,pre- ject. Anexampleofapedestrianandvehicleinteractingis dictingobjectmotioninthesceneisadifficulttask,suitable illustrated in Figure 1a where a vehicle waits for a pedes- for machine learning models. Building effective motion trian to fully cross the street before turning. In Figure 1b, 1202 rpA 02 ]VC.sc[ 1v33101.4012:viXratheorangevehicleacceleratesintothestreetonlyafteren- Lyft NuSc Argo Inter Ours suring the incoming blue vehicle’s intention is to deceler- #uniquetracks 53.4m§ 4.3k 11.7m‡ 40k 7.64m ate and turn off of the street. Most existing datasets have Avgtracklength 1.8s§ - 2.48s‡ 19.8s∗ 7.04s†† Timehorizon 5s 6s 3s 3s 8s focused on single agent representation, but there has been #segments 170k 1k 324k - 104k considerably less work on interaction modeling at a large Segmentduration 25s 20s 5s - 20s Totaltime 1118h 5.5h 320h 16.5h∗ 574h scale,whichmotivatesthiswork. Uniqueroadways 10km - 290km - 1750km†† Thegoalofthisworkistoprovidealargescale,diverse Samplingrate 10Hz 2Hz 10Hz 10Hz 10Hz dataset with specific annotations for interacting objects to #citiescovered 1 2 2 6∗ 6 promotethedevelopmentofmodelstojointlypredictinter- #objecttypes 3 1† 1‡ 1 3 Boxes 2D 3D None 2D 3D active behaviors. In addition, we aim to supply object be- 3Dmaps (cid:51) (cid:51) haviorsoverawiderangeofroadgeometries,andthuspro- Offlineperception (cid:51) (cid:51) Interactions (cid:51) (cid:51) videalargesetofannotatedinteractionsoveradiverseset Trafficsignalstates (cid:51) (cid:51) oflocations. Togeneratesuchaset,wedevelopcriteriafor mining interactive behavior over a large corpus of driving Table1:Comparisonofpopularbehaviorpredictionand data. Weexplicitlyannotategroupsofinteractingobjectsin motion forecasting datasets. Specifically, we compare bothtrainingandvalidation/testdatatoenabledevelopment LyftLevel5[19],NuScenes[4],Argoverse[9],Interactions ofmodelsthatjointlypredictthemotionofmultipleagents [38], and our dataset across multiple dimensions. # object aswellasindividualpredictionmodels. typesmeasuresthenumberoftypesofobjectstopredictthe Weaimtoprovidehighqualityobjecttrackingdatatore- motiontrajectory. Dashedline”-”indicatesthatdataisnot duceuncertaintyduetoperceptionnoise. Thecostofhand availableornotapplicable.§LyftLevel5numberofunique labelingadatasetoftherequiredsizeisprohibitive. Instead tracksandaveragetracklengtharedeterminedthroughpri- we use a state-of-the-art automatic labeling system [26] to vate correspondence. † nuScenes [4] provides annotations providehighqualitydetectionandtrackingdataofobjects for 23 objects types (stationary vehicles are removed), but inthescenes. Incontrastwithmanydatasetswhichprovide only vehicle is predicted. ‡ Argoverse [9] provides anno- trackingfromon-boardautonomoussystems,theoff-board tations for 15 object types (Appendix B) but only vehicle automaticlabelingsystemprovideshigheraccuracyasitis is predicted. The number of unique tracks is determined notconstrainedtoruninrealtime.Thesehighqualitytracks throughprivatecorrespondence.Theaveragetracklengthis allowustofocusonunderstandingthecomplexityofobject estimatedfromdata. ∗ Interactions[38]gathereddatafrom behavior,ratherthanondealingwithperceptionnoise. 4 countries including 6 cities (the last statistic is collected through personal communication) and the entire dataset is Evaluationofinteractivepredictionmodelsrequiresmet- notdividedintosegments. Theaveragetracklengthisesti- ricsformulatedforjointpredictionsasmotivatedbyrecent mated from data. †† Our average track length is computed work[32,6,33,28]. InSection4,wediscussexistingwork on the 20s segments of the training split. Our total unique ongeneralizingmetricstothejointpredictioncase.Wealso roadwaydistanceiscalculatedbyhashingourautonomous propose a novel mean Average Precision (mAP) metric to vehicleposesasUTMcoordinatesinto25metervoxelsand capture the performance of models across different object countingthenumberofnon-zerovoxels. types, prediction time scales, and trajectory shape buckets (e.g.,u-turns,leftturns).Thismethodisinspiredbymetrics used in the object detection literature and overcomes limi- marginalandjointpredictioncases. tations in currently adopted metrics. We discuss how this metricattemptstoaddressissueswithexistingmetrics. 2.RelatedWork We name our large-scale interactive motion dataset: WAYMO OPEN MOTION DATASET. It will be made pub- Motion forecasting datasets Several existing public licly available to the research community, and we hope it datasetshavebeendevelopedwiththeprimarygoalofmo- willprovidenewdirectionsandopportunitiesindeveloping tion forecasting in real-world urban driving environments, motion forecasting models. We summarize the contribu- comparedinTable1. Thedatasetsvaryinsizemeasuredin tionsofourworkasfollows: numberofscenes,totaltime,totalmiles,numberoftracked • We release a large-scale dataset for motion forecast- objects, andnumberofdistincttimesegments. WhileLyft ing research with specifically labeled interactive be- Level 5 [19] has the most hours of data and NuScenes [4] haviors. Thedataisderivedfromhighqualitypercep- has rich object taxonomy, they were not collected to cap- tionoutputacrossalargearrayofdiversesceneswith tureawidediversityofcomplexandinteractivedrivingsce- richannotationsfrommultiplecities. narios. Argoverse [9] was collected for interesting behav- • We provide novel metrics for motion prediction anal- iorsbybiasingsamplingtowardscertainobservedbehaviors ysis along with challenging benchmarks for both the (e.g., lanechanges, turns)androadfeatures(e.g., intersec-tions). TheINTERACTIONdataset[38]manuallyselected 3.Dataset asmallsetofspecificdrivinglocations(e.g.,roundabouts), The dataset provides high quality object tracks gener- and times of day (e.g., rush hour) to obtain a dataset with atedusinganoffboardperceptionsystem(describedinSec- high interaction complexity. We explain our own method- tion 3.3) along with both static and dynamic map features ologyforcollectinginteractionsinSection3.1. to provide context for the road environment. Object track Another salient dataset attribute is the time horizon for statesaresampledat10Hz. Eachstateincludestheobject’s prediction. Our dataset’s forecasting horizon is 8 seconds boundingbox(3Dcenterpoint,heading,length,width,and intothefuture,considerablylongerthanothers(3or5sec- height), and the object’s velocity vector. Due to sensor onds), as we believe that long term forecasting is neces- rangeorocclusion, measurementsofanobject’sstatemay sary for safe and human-like planning, and is intrinsically notexistatsometimesteps. Avalidflagisprovidedtoin- moredifficult. Finally, mostdatasetsareauto-labeledwith dicatewhichtimestepshavevalidmeasurements.Mapdata industry-grade, onboard 3D perception stacks, employing isprovidedasasetofpolylinesandpolygonscreatedfrom LiDAR’s, cameras, and/or radar, and provided as-is with curves sampled at a resolution of 0.5 meters. Static map noisystateestimatesandtrackingerrors. Oneexceptionis featuretypesincludelanecenters,laneboundarylines,road the INTERACTION dataset [38] which collects data from edges,stopsigns,crosswalks,andspeedbumps.Trafficsig- dronefootage,whichisthenpost-processedofflinewithde- nal states and the lanes they control are included. In addi- tection,trackingandtracksmoothing.Wealsoputconsider- tion to the geometry data, map features also contain addi- ableeffortintocreatinghighqualitystateestimatesand3D tionaldataspecifictoeachfeaturetypee.g. laneboundaries tracksbyemployinganoffboard3Ddetectionandtracking haveafieldtoindicateiftheyareabrokenwhiteboundary, pipeline,asdiscussedinSection3.3. adoubleyellowboundary,etc. We consider perception datasets (e.g., KITTI [15], Starting with 20 second segments that are specifically Waymo Open Dataset [31]) outside of the scope of this mined from interactions as described in 3.1, we create 9.1 discussion as they do not contain enough motion data to second (91 steps at 10Hz) scenes, splitting the data into a buildsufficientlycomplexmodels. Wealsonotetherearea 70%training,15%validation,and15%testset. Wederive hostofothermotionforecastingdatasetswhich,whilepop- two versions of the validation and test sets which we refer ular, are orders of magnitude smaller, have O(10) unique to as the standard and interactive versions. The standard locations, and/or are not focused on driving environment, validationandtestsetsprovideupto8objectstopredictin forexampletheStanfordDroneDataset[29],NGSIM[10], each scene. Selection is biased to require objects that do ETH[24],UCY[21],TownCenter[2]. notfollowaconstantvelocitymodelorstraightpaths. The interactiveversionsofthevalidationandtestsetsfocuson the interactive portion of the segment and require only the Jointly consistent multi-agent forecasting Most exist- 2minedinteractiveobjectstobepredicted. Theoriginal20 ing models output independent future distributions per ob- second segments are also provided for research requiring ject in a scene, e.g. [1, 3, 7, 5, 8, 12, 11, 14, 17, 20, 22, longertimeframes. 25, 39]. This is encouraged by the popular metrics, which 3.1.Miningforinterestingscenarios onlymeasurequalityonaper-objectlevel,andbydatasets that only require predicting one agent per scene. An im- We mine for interesting scenarios by first hand-crafting portantnoteisthatthesemethodsdomodelinteractionsbe- semantic predicates involving agents’ relationships—e.g., tweenobjectstoachievebetterperformance, butexplicitly “agent A changed lanes at time t”, and “agents A and B modeling joint futures is much less common. There are a crossedpathswithatimegaptandrelativeheadingdiffer- fewexceptionswhichmodeljointly-consistentfutures:Pre- enceθ”.Thesepredicatescanbecomposedtoretrievemore cog[28]andMFP[33]employmodelswhichrollouttrajec- complexqueriesinanefficientSQLandrelationaldatabase torysamplestimestep-by-timestep,whereeachagent’snext framework on an overall data corpus orders of magnitude stepsampleconditionsonallotheragents’currentandpast larger than the resulting curated WAYMO OPEN MOTION steps. Incontrast,ILVM[6](alsousedbyTrafficSim[32]), DATASET. samplesfromalatentvariablefromwhichmultiplestepsof With this framework, we specifically mined for the future joint samples from all agents are decoded, without following pairwise interaction scenarios: merges, lane explicit conditioning on each step of rollout. These works changes, unprotected turns, intersection left turns, inter- all measure a stricter version of distance error metrics, re- section right turns, pedestrian-vehicle interactions, cyclist- porting the per-agent error of the best joint configuration. vehicle interactions, interactions with close proximity, and It is important to note that none of the datasets in Table 1 interactionswithhighaccelerations. Thepairofinteracting provide such joint metrics in their release, in contrast to objectsisannotatedwithinthedatasetineachscenario,and ourWAYMOOPENMOTIONDATASET. theinteractionhappensclosetothe10smarkofthe20sclip.0.020 0.015 0.010 0.005 0.000 0 20 40 60 80 100 120 Number of Agents senecS fo tnecreP Overall Number of Agents 0.20 0.15 0.10 0.05 0.00 1 2 3 4 5 6 7 8 Predicted Agents Per Scene senecS fo noitcarF Validation - Predicted Agents Vehicles Pedestrians Cyclists Figure 2: Our dataset contains many agents including pedestrians and cyclists. Top: 46% of scenes have more than 32 agents, and 11% of scenes have more than 64 Figure 3: Agents selected to be predicted have diverse agents. Bottom: In the standard validation set, 33.5% of trajectories. Left: Ground truth trajectory of each pre- scenes require at least one pedestrian to be predicted, and dicted agent in a frame of reference where all agents start 10.4%ofscenesrequireatleastonecyclisttobepredicted. at the origin with heading pointing along the positive X axis(pointingup). Right: Distributionofmaximumspeeds achievedbyalloftheagentsalongtheir9secondtrajectory. 3.2.Datasetstatistics Plotsdepictvarietyintrajectoryshapesandspeedprofiles. In contrast with many existing datasets that provide a limitednumberofagentspersceneoragenttypes,wepro- 77.5%ofscenesinvolvetwointeractingvehicles,14.9%of videmorediversescenesintermsofthenumberofagents scenes involve a vehicle interacting with a pedestrian, and andtypesofagents,reflectingmanycomplicatedrealworld 7.6%ofscenesinvolveavehicleinteractingwithacyclist. driving scenarios like city driving and busy intersections. Finally, a motion forecasting dataset should contain di- We show the distribution of number of agents per scene versescenarios,trajectories,andagentinteractions. Table1 (Figure2,top). Allsceneshaveatleastonevehicle,57%of showsthatwegatherdataacrossalargerangeofroadways. scenes have at least one pedestrian (with 20% having four Figure3visualizesthefutureground-truthtrajectoriesand ormore),and16%ofsceneshaveatleastonecyclist. maximum speeds of agents we task the models with pre- In addition to accurately predicting the motion of other dicting. These agents represent a wide range of trajectory vehicles, to safely drive, an autonomous vehicle must also shapes,speeds,andbehaviors,whichwebelieveaccurately accurately predict the motion of other road agents like capturesthemanydifferentbehavioralmodesforeachclass. pedestrians and cyclists. To support this, our dataset con- tains rich interactions between vehicles, pedestrians, and 3.3.Offboardperceptionsystem cyclists, and the users of this dataset must be able to ac- curately predict the trajectories of all three classes, which Modern motion forecasting systems require a large isnotthecaseinpreviousdatasets[9,4,38]. Weshowthe amount of training data to imitate human maneuvers in frequency of scenes in which we ask the model to predict complex real-world scenarios. Recently released datasets eachclassinthevalidationset(Figure2,bottom). Notably, for motion forecasting [9, 18, 4] are orders of magnitude 38.3% of scenes in the validation set require the model to larger than popular 3D perception datasets [4, 19, 31, 15]. predict more than one type of agent (e.g. a vehicle and a However,manuallyannotatingdatasetsatsuchlargescales pedestrian or cyclist), and 4.9% of scenes require a model notonlyincursexorbitantcostbutitalsotakestremendous topredicttrajectoriesforallthreeclasses. Finally,inthein- amount of time [26, 36]. Constrained by the high cost, teractivevalidationset, wherewetaskthemodelwithpre- most existing motion forecasting datasets [9, 18] directly dictingthejointfuturetrajectoriesoftwointeractingagents, employ onboard perception output as groundtruth for tra-jectory prediction. But limited by the onboard perception isdenotedasˆs = {sˆ }. Theindividualobjectprediction a,t systemperformance,suchannotated3Dobjectstracksmay taskbecomesaspecialcaseofthisformulationwhereeach have a high degree of state estimation error, lack temporal jointpredictioncontainsonlyasingleagentA=1. kinematicconsistencyorunder-/over-segmenttracks. minADE. The minimum Average Displacement Error In this work, we aim to alleviate the perception qual- computestheL2normbetweenˆsandtheclosestjointpre- ity bottleneck in existing motion datasets captured by au- diction: 1 min (cid:80) (cid:80) ||sˆ −sk || . TA k a t a,t a,t 2 tonomous vehicles and propose using the recently intro- minFDE. The minimum Final Displacement Error is duced offboard algorithms [26, 36] to automatically gen- equivalent to evaluating the minADE at a single time step eratehigh-qualitymotionlabels,allowingmotionforecast- T: 1 min (cid:80) ||sˆ −sk || A k a a,T a,T 2 ingalgorithmstofocusonthesubtledynamicsandinterac- Overlaprate(OR). Theoverlaprateiscomputedbytak- tions of agents instead of overcoming the noise generated ingthehighestconfidencejointpredictionfromeachmulti- byaconstrained,onboardperceptionsystem. Comparedto modaljointprediction. IfanyoftheAagentsinthejointly the onboard counterpart, offboard perception has two ma- predictedtrajectoriesoverlapatanytimewithanyotherob- joradvantages: 1)itcanaffordmuchmorepowerfulmod- jectsthatwerevisibleatthepredictiontimestep(compared elsrunningontheamplecomputationalresources;and2)it at each time step up to T) or with any of the jointly pre- canmaximallyaggregatecomplementaryinformationfrom dicted trajectories, it is considered a single overlap. The differentviewsbyexploitingthefullpointcloudsequence overlaprateiscomputedasthetotalnumberofoverlapsdi- including both history and future. Thanks to those advan- videdbythetotalnumberofmulti-modaljointpredictions. tages, the offboard perception system has shown superior See the supplementary material for details. The overlap is perception accuracy compared to onboard detectors [26] calculatedusingboxintersection,withboxextentstakenas andwehavefurthervalidateditsqualityinSection5.3. thecurrenttimestep’sestimates,andheadinginferredfrom Theoffboardperceptionsystem[26]employedcontains consecutivewaypointpositiondifferences. three steps: (1) 3D object detector generates object pro- Miss rate (MR). A binary match/miss indicator func- posals from each LiDAR frame. (2) Multi-object tracker linksdetectedobjectsthroughouttheLiDARsequence. (3) tion ISMATCH(sˆ t,s t) is assigned to each sample way- point at a time t. The average over the dataset creates For each object, an object-centric refinement network pro- the miss rate at that time step. Our dataset asks to pre- cesses the tracked object boxes and its point clouds across dict an 8-seconds trajectory on agents with varying speed all frames in the track, and outputs temporally consistent profiles. Therefore, a single distance threshold to deter- andaccurate3Dboundingboxesoftheobjectineachframe. mine ISMATCH is insufficient: we want a stricter crite- ria for slower moving and closer-in-time predictions, and 4.Metrics alsodifferentcriteriaforlateraldeviation(e.g.wronglane) Tomeasuretheaccuracyofmotionpredictionsweusea versus longitudinal (e.g. wrong speed profile). For a par- suite of five metrics, which we extend to handle joint pre- ticular joint configuration, a miss is assigned for time t if dictions over multiple agents as proposed by a few related anyofthetrajectoriesdon’tmatchtheirgroundtruthtrajec- works [33, 6, 28]. Several common metrics report a min- tory: MR t = min k∨ a¬IsMatch(sˆ t,sk a,t). Weimplement imum error within a trajectory set; when generalized, the ISMATCHwithseparatelateralandlongitudinalthresholds, joint metric analog constrains the minimum over the best whichscaleasaclampedlinearfunctionoffuturetimeand jointconfigurationoftrajectoriesfromagroupofagents. velocity. Seethesupplementarymaterialfordetails. We report standard trajectory-set distance error metrics Mean average precision (mAP). The Average Precision minADE,minFDE,andMissRate(MR),withacustomdef- computes the area under the precision-recall curve by ap- initionofamatchexplainedbelow. Wealsoreportoverlap plying confidence score thresholds c across a validation k rate(OR)tomeasurefrequencyofpredictedtracks’extents set, and using the definition of Miss Rate above to define overlappingwithothers’. Finally,inspiredbythedetection true positives, false positives, etc. Consistent with object literature,weproposeanAveragePrecision(AP)metricac- detection mAP metrics [23], only one true positive is al- cording to the defined MR to measure the precision and lowed for each object and is assigned to the highest confi- recall performance of models across different confidence denceprediction. values. We then account for imbalanced data by reporting Further inspired by object detection literature [13], we mean AP (mAP) over different semantic trajectory motion seek an overall metric balanced over semantic buckets, types. someofwhichmaybemuchmoreinfrequent(e.g.,u-turns), Foreachsamplee,amodelmakesK possiblyjointpre- soreportthemeanAPoverdifferentdrivingbehaviors.The dictions S ,k ∈ 1...K. Each S contains a scalar con- finalmAPmetricaveragesovereightdifferentgroundtruth k k fidence c , and a trajectory sk = {s } for T trajectory shapes: straight, straight-left, straight-right, left, k a,t t=1:T,a=1:A future time steps for A agents. Similarly, the ground truth right,leftu-turn,rightu-turn,andstationary.Vehicle Pedestrian Cyclist Set Model rg ts hi minADE↓ MR↓ mAP↑ minADE↓ MR↓ mAP↑ minADE↓ MR↓ mAP↑ Const.Vel. 11.0 0.95 0.02 1.55 0.60 0.07 4.17 0.82 0.02 2.63 0.67 0.07 0.73 0.22 0.15 1.86 0.60 0.07 Standard (cid:51) 1.67 0.40 0.16 0.74 0.18 0.18 1.50 0.40 0.12 Validation (cid:51) 1.54 0.32 0.19 0.66 0.14 0.23 1.36 0.31 0.17 LSTM (cid:51) (cid:51) 1.36 0.26 0.22 0.63 0.14 0.23 1.29 0.30 0.18 (cid:51) (cid:51) 1.52 0.31 0.18 0.65 0.15 0.20 1.34 0.33 0.15 (cid:51) (cid:51) (cid:51) 1.34 0.25 0.23 0.63 0.13 0.23 1.26 0.29 0.21 Standard Const.Vel. 11.0 0.95 0.02 1.58 0.60 0.06 4.12 0.83 0.03 Test LSTM (cid:51) (cid:51) (cid:51) 1.34 0.24 0.24 0.64 0.13 0.22 1.29 0.28 0.20 Table2: Marginalmetricsonthestandardvalidationandtestset. Allmetricscomputedat8s. rgstandsforroadgraph information. tsstandsfortrafficsignalstatesinformation. histandsforhigh-orderinteractionsbetweenagents’features. TheconstantvelocitybaselineemploysK =1predictedtrajectories;allothermodelsemployK =6. Vehicle Pedestrian Cyclist Set Model rg ts hi minADE↓ MR↓ mAP↑ minADE↓ MR↓ mAP↑ minADE↓ MR↓ mAP↑ Const.Vel. 10.3 0.98 0.00 3.62 1.00 0.00 6.35 1.00 0.00 4.16 0.88 0.01 2.45 0.93 0.02 4.00 0.98 0.00 Interactive (cid:51) 2.89 0.75 0.06 2.22 0.93 0.01 3.75 0.94 0.01 Validation (cid:51) 2.94 0.75 0.04 2.39 0.86 0.06 3.30 0.88 0.02 LSTM (cid:51) (cid:51) 2.45 0.66 0.06 2.22 0.86 0.03 3.02 0.83 0.03 (cid:51) (cid:51) 2.92 0.75 0.04 2.69 0.93 0.10 3.24 0.89 0.01 (cid:51) (cid:51) (cid:51) 2.42 0.66 0.08 2.73 1.00 0.00 3.16 0.83 0.01 Interactive Const.Vel. 10.3 0.98 0.01 4.56 1.00 0.00 6.21 1.00 0.00 Test LSTM (cid:51) (cid:51) (cid:51) 2.46 0.67 0.08 2.47 0.89 0.00 2.96 0.89 0.01 Table3: Jointmetricsontheinteractivevalidationandtestset. SeeTable2forabbreviationsanddetails. Notethatthese metricsindicatethattheinteractivesplitissystematicallymorechallenging. 5.Experiments ity model in which we assume the agent will maintain its velocityatthecurrenttimestampforallfuturesteps. In this section, we evaluate various baseline models on Second,weconsiderafamilyofdeep-learnedmodelsusing the WAYMO OPEN MOTION DATASET to investigate the various encoders, with a base architecture of an LSTM to importance of rich map annotations (e.g. 3D road graph, encodea1-secondhistoryofobservedstate[16,1];thisin- traffic signal states), interaction context, and joint model- cludesagents’positions,velocity,and3Dboundingboxes. ing (Section 5.1). We then compare the standard valida- Inordertomeasuretheimportanceofparticularadditional tion and interactive validation datasets on conditional be- features,weselectivelyprovideadditionalinformation: haviorpredictionmetricstoshowthattheinteractivevalida- • Road graph (rg): Encode the 3D map information tion dataset is both more challenging and more interactive withpolylinesfollowing[14]. (Section5.2). Furthermore,weshowthatouroffboardper- • Traffic signals (ts): Encode the traffic signal states ception system achieves a similar accuracy and perception withanLSTMencoderasanadditionalfeature. noise reduction to human labels (Section 5.3). Finally, to • High-order interactions (hi): Model the high-order provideinsightontheperformancemeasurementofmotion interactions between agents with a global interaction predictiontasks,weempiricallyanalyzeminADEvs. mAP graphfollowing[14]. ontheirabilitytoreflectthequalityofconfidencescorecal- In experiments, combinations of these encodings are con- ibration(Section5.4). catenated together to create an embedding per-agent, in agent-centeredcoordinates. WedecodeK=6trajectoriesfor 5.1.Baselinemodelperformances outputusinganotherMLPwithmin-of-kloss[12,34]. See In this section, we evaluate several baseline models on thesupplementarymaterialfordetails. the proposed dataset. First, we consider a Constant Veloc- InTable2and3, wereportthemarginalmetricsontheVehicleminADE↓ VehiclemAP↑ 5.2.Quantifyinginteractivity Model 3s 5s 8s 3s 5s 8s Following[35],weuseConditionalBehaviorPrediction Marginal 0.65 1.66 4.16 0.08 0.07 0.01 Joint 0.65 1.59 3.81 0.10 0.06 0.03 (CBP) to quantify the interactivity in our dataset. [35] in- troducesamodelthatcanproduceeitherunconditionalpre- Table 4: Joint modeling is advantageous on interactive dictionsorpredictionsconditionedona“querytrajectory” agents. Numbersarefromtheinteractivevalidationset. for one of the agents in the scene. If two agents are not interacting, then one’s actions have no effect on the other, standardvalidation/testsetandjointmetricsontheinterac- soknowledgeofthatagent’sfutureshouldnotchangepre- tivevalidation/testset,respectively. Specifically,minADE, dictions for the other agent. Thus, [35] defines the degree miss rate, and mAP at 8s are chosen to be the representa- of influence agent A has on agent B as the KL divergence tives,andwebreakdownthemetricsacross3objecttypes. betweentheunconditionalpredictionsforBandthepredic- The constant velocity model performs quite poorly, e.g., tionsforBconditionedonA’sgroundtruthfuturetrajectory. achieving double digit minADE on vehicles. This shows Weapplythisframeworktoourinteractiveandstandard thatourdatasetcontainsnontrivialtrajectories. validation datasets, computing the KL divergence between Wetheninvestigatetheimportanceofencoding3Dmap unconditional and conditional predictions for every query information,trafficsignalstates,andhigh-orderinteractions agent/target agent pair in the dataset. We find that the KL between agents. Intuitively, they should all benefit motion divergences are much larger in the interactive validation forecasting,andthisisindeedsupportedbytheexperimen- dataset than in the standard validation dataset. In particu- talresults. Forexample,onthestandardvalidationset(Ta- lar, 73% of agent pairs in the interactive dataset have KL ble 2) for vehicle trajectory prediction, minADE improves divergencesgreaterthan10,and45%haveKLdivergences from 2.63 to 1.34 and mAP improves from 0.07 to 0.23 greater than 50; in the standard dataset, these numbers are whenincrementallyaddingmoreinformationinthisorder. 48% and 28% respectively. Figure 4 presents a full his- Thesametrendholdsforpedestrianandcyclistaswell. togram of the KL divergences between unconditional and We only evaluate joint metrics on the interactive sets. conditionalpredictionforeachagentpair. Conditioningon Sincemakingjointpredictionsisarelativelynewpractice, a query agent’s future trajectories makes little difference there are no mature, established baselines. In Table 3, we in the standard validation dataset but a large difference in reuse the models trained to make K marginal predictions; the interactive validation dataset, providing evidence that but when evaluating on the 2 interactive agents, we select the interactive dataset contains more cases where multiple thetopK amongtheK2 possibilitiesbasedontheproduct agentsareinteractingwithandinfluencingeachother. For of predicted probabilities, as described in [6]. The overall detailsontheCBPmodel,seethesupplementarymaterial. low performance in Table 3 can be attributed to at least 3 factors: the higher difficulty level of the mined interactive agents; the requirement to make good predictions for both agents as dictated by the joint version of the metrics; the fact that the predictions are post-hoc manipulations rather thantheresultoftruejointtraining. Wehavearguedtheimportanceofjointlypredictingin- teractive behaviors. In Table 4 we provide direct com- parison between a base LSTM (without rg, ts, or hi) trained to make marginal or joint predictions for the 2 in- teractiveagents. Inconvertingthemarginalmodeltomak- ing joint predictions, the neural features for the 2 interac- tiveagentsareconcatenatedwitheachothertoprovidethe Figure4:Theinteractivesplitseesmuchlargerimprove- minimalnecessarycontext;thesumoftheirindividualdis- ments from conditional prediction. Each element in the tancestothegroundtruth(whilematchingthepairsoftra- histogram is one pair of query agent/target agent, and the jectoriesjointly)areusedfortraining;theconfidencescore xaxisshowstheKLdivergencebetweentheunconditional are jointly predicted for each pair of trajectories to ensure predictions on the target agent and the predictions for the consistency. When evaluated on the interactive set using target agent conditioned on the query agent’s ground truth joint metrics, this joint model performs favorably against future.Notethatbothplotsarenormalizedtothetotalnum- itsmarginalcounterpart. Wehopethispreliminaryexperi- berofagentpairs. ment can motivate further development of joint models on ourdataset,especiallytheinteractiveset.Recall: 99.29% Recall: 93.50% Recall: 87.31% 3.0 Mean DE: 0.1849 Mean DE: 0.1958 Mean DE: 0.2738 Std DE: 0.2342 Std DE: 0.2721 Std DE: 0.3800 2.5 2.0 1.5 1.0 0.5 Figure 5: Distance error statistics of vehicle bounding 1 3 6 12 18 24 boxes. We compare three sets of vehicle bounding boxes withtheWaymoOpenDataset(WOD)groundtruthboxes onthe5selectedrunsegmentsfromthevalset. Thestatis- tics include the histogram of distance errors (capped at 0.8m), the box recall (using a 3D IoU threshold of 0.03), meandistanceerrorandstandarddeviation(std)ofthedis- tance error. Only boxes with at least one point inside are considered. Note that the DE from different boxes are not directlycomparableastherecallsaredifferent. 5.3.Analysisofperceptiondataquality Inthissection,westudythequalityofouroffboardper- ception system and compare them with two alternatives – humanlabelsandbaselinedetectorboxes. Following[26], weconductastudyonthesamefivevalidationsetrunseg- mentsfromtheWaymoOpenDataset(WOD)re-labeledby extrathreeindependenthumanlabelers. Withtheduplicate humanlabels,wecananalyzethehumanlabelconsistency tounderstandthe“backgroundnoise”inlabelaccuracy. In- stead of comparing detection results in average precision [26],weevaluatetheboxdistanceerrors(DE)inmetersby comparingtotheoriginalWODgroundtruthboxes. Figure5showsthatoffboardperceptionachievesanac- curacy and distance error distribution similar to human la- bels. We also show the distance errors of boxes obtained from a baseline detector (Multi-view Fusion [40]) with a Kalman filter-based tracker (the same tracker used in the offboard perception). Using the baseline (onboard) detec- torleadstoasignificantlyhighermeandistanceerror–this increased perception noise indicates a higher lower-bound minADEthatabehaviormodelcanachieve. 5.4.ComparingmAPwithminADE WhileminADEiswidelyadoptedforperformancemea- surement in motion forecasting tasks [9, 8, 14, 39], it fails tomeasurethequalityofconfidencescorecalibrationinthe trajectoryprediction. Incontrast,themAPmetricdescribed in Section 4 provides a measurement of the quality of the confidence score calibration by design. In this section, we perform an analysis of minADE vs. mAP with increasing numbersofpredictions atdifferenttimestepsto showthat minADE does not provide a full picture of the model per- formancewhilemAPprovidesmoreinsight. As shown in Figure 6, minADE artificially improves as )selciheV( EDAnim minADE@3s minADE@5s minADE@8s 0.3 0.2 0.1 1 3 6 12 18 24 Number of Predictions (K) )selciheV( PAm mAP@3s mAP@5s mAP@8s Figure 6: Comparison of minADE and mAP across in- creasing numbers of predictions. Using the best LSTM baselinemodelinSection5.1,theminADE(top)artificially improves as one allows for increasing numbers of predic- tions. Conversely,themAP(bottom)saturatesasthemodel mustproducehighqualityconfidenceestimatesinaddition toaccuratetrajectories. the number of predictions increase, while the mAP value peaksat3predictionsfor3sand5s,andat6predictionsfor 8s. TheminADEscoresmayimprovesolongasanyofthe predictions are good regardless of their confidence score. In contrast, mAP penalizes high confidence false positive predictionsanddoesnotcontinuetoimprovewiththenum- berofpredictions. Precision-recallcurvesfortheseexperi- mentsareshowninthesupplementarymaterial. 6.Discussion In this work we release the WAYMO OPEN MOTION DATASET,alarge-scalemotionforecastingdatasetcontain- ingdataminedforinteractivebehaviorsacrossadiverseset of road geometries from multiple cities. The data comes withrich3DobjectstateandHDmapinformation. Object tracks are generated with a state-of-the-art offboard auto- maticlabelingsystemwhichissignificantlyhigherfidelity than typical onboard 3D perception stacks. For evaluation weoutlineasetofmetricsforbothper-agentandjointtra- jectory predictions, including a novel mAP metric to mea- sureprecision-recallperformanceinabalancedwayacross semantic driving behavior buckets. We provide baseline modelsforbothindividualandinteractivepredictiontasks, which we hope provides great opportunities for advancing motionforecastingresearch.Acknowledgements We thank Paul Hempstead, David Margines, Dietmar Ebner,PeterPawlowski,BalakrishnanVaradarajan,Avikalp Srivastava, Zhifeng Chen, and Rebecca Roelofs for their comments and suggestions. Additionally, we thank the larger Google Brain team and Waymo Research teams for theirsupport.References forautonomousdrivingusingdeepconvolutionalnetworks. In2019InternationalConferenceonRoboticsandAutoma- [1] Alexandre Alahi, Kratarth Goel, Vignesh Ramanathan, tion(ICRA),pages2090–2096.IEEE,2019. 3,6 AlexandreRobicquet, LiFei-Fei, andSilvioSavarese. So- [13] M.Everingham,L.Gool,C.K.Williams,J.Winn,andAn- ciallstm:Humantrajectorypredictionincrowdedspaces.In drewZisserman.Thepascalvisualobjectclasses(voc)chal- ProceedingsoftheIEEEconferenceoncomputervisionand lenge. International Journal of Computer Vision, 88:303– patternrecognition,pages961–971,2016. 3,6,12 338,2009. 5 [2] Ben Benfold and Ian Reid. Stable multi-target tracking in [14] Jiyang Gao, Chen Sun, Hang Zhao, Yi Shen, Dragomir real-time surveillance video. In CVPR 2011, pages 3457– Anguelov, CongcongLi, andCordeliaSchmid. VectorNet: 3464.IEEE,2011. 3 Encodinghdmapsandagentdynamicsfromvectorizedrep- [3] ThibaultBuhet,EmilieWirbel,andXavierPerrotton. Plop: resentation. InCVPR,2020. 1,3,6,8,13 Probabilisticpolynomialobjectstrajectoryplanningforau- [15] AndreasGeiger, PhilipLenz, ChristophStiller, andRaquel tonomousdriving. arXivpreprintarXiv:2003.08744,2020. Urtasun. Visionmeetsrobotics:Thekittidataset. TheInter- 3 national Journal of Robotics Research, 32(11):1231–1237, [4] HolgerCaesar,VarunBankiti,AlexHLang,SourabhVora, 2013. 3,4 VeniceErinLiong,QiangXu,AnushKrishnan,YuPan,Gi- [16] SeppHochreiterandJu¨rgenSchmidhuber. Longshort-term ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- memory. Neuralcomputation,9(8):1735–1780,1997. 6 modal dataset for autonomous driving. In Proceedings of [17] JoeyHong,BenjaminSapp,andJamesPhilbin. Rulesofthe theIEEE/CVFConferenceonComputerVisionandPattern road:Predictingdrivingbehaviorwithaconvolutionalmodel Recognition,pages11621–11631,2020. 1,2,4 ofsemanticinteractions. InCVPR,2019. 3 [5] Sergio Casas, Cole Gulino, Renjie Liao, and Raquel Urta- sun. Spagnn: Spatially-awaregraphneuralnetworksforre- [18] JohnHouston, GuidoZuidhof, LucaBergamini, YaweiYe, lationalbehaviorforecastingfromsensordata.In2020IEEE AsheshJain, SammyOmari, VladimirIglovikov, andPeter InternationalConferenceonRoboticsandAutomation,ICRA Ondruska.Onethousandandonehours:Self-drivingmotion 2020,Paris,France,May31-August31,2020,pages9491– predictiondataset. arXivpreprintarXiv:2006.14480,2020. 9497.IEEE,2020. 3 1,4 [6] Sergio Casas, Cole Gulino, Simon Suo, Katie Luo, Renjie [19] R. Kesten, M. Usman, J. Houston, T. Pandya, K. Nad- Liao, and Raquel Urtasun. Implicit latent variable model hamuni, A. Ferreira, M. Yuan, B. Low, A. Jain, P. On- for scene-consistent motion forecasting. In Proceedings druska, S. Omari, S. Shah, A. Kulkarni, A. Kazakova, C. of the European Conference on Computer Vision (ECCV). Tao, L. Platinsky, W. Jiang, and V. Shet. Lyft level 5 per- Springer,2020. 2,3,5,7,12 ception dataset 2020. https://level5.lyft.com/ dataset/,2019. 2,4 [7] Sergio Casas, Wenjie Luo, and Raquel Urtasun. Intentnet: Learningtopredictintentionfromrawsensordata. InCon- [20] NamhoonLee,WongunChoi,PaulVernaza,ChristopherB ference on Robot Learning, pages 947–956. PMLR, 2018. Choy,PhilipHSTorr,andManmohanChandraker. Desire: 3 Distantfuturepredictionindynamicsceneswithinteracting [8] YuningChai,BenjaminSapp,MayankBansal,andDragomir agents.InProceedingsoftheIEEEConferenceonComputer Anguelov. Multipath: Multiple probabilistic anchor tra- VisionandPatternRecognition,pages336–345,2017. 3 jectory hypotheses for behavior prediction. arXiv preprint [21] Alon Lerner, Yiorgos Chrysanthou, and Dani Lischinski. arXiv:1910.05449,2019. 1,3,8,12 Crowds by example. In Computer graphics forum, vol- [9] Ming-Fang Chang, John Lambert, Patsorn Sangkloy, Jag- ume26,pages655–664.WileyOnlineLibrary,2007. 3 jeetSingh,SlawomirBak,AndrewHartnett,DeWang,Peter [22] Ming Liang, Bin Yang, Rui Hu, Yun Chen, Renjie Liao, Carr, Simon Lucey, Deva Ramanan, et al. Argoverse: 3d Song Feng, and Raquel Urtasun. Learning lane graph trackingandforecastingwithrichmaps. InProceedingsof representations for motion forecasting. arXiv preprint theIEEE/CVFConferenceonComputerVisionandPattern arXiv:2007.13732,2020. 3 Recognition,pages8748–8757,2019. 2,4,8,12 [23] Tsung-YiLin,MichaelMaire,SergeBelongie,JamesHays, [10] Benjamin Coifman and Lizhe Li. A critical evaluation of PietroPerona,DevaRamanan,PiotrDolla´r,andCLawrence the next generation simulation (ngsim) vehicle trajectory Zitnick. Microsoft coco: Common objects in context. In dataset. Transportation Research Part B: Methodological, European conference on computer vision, pages 740–755. 105:362–377,2017. 3 Springer,2014. 5 [11] HenggangCui,ThiNguyen,Fang-ChiehChou,Tsung-Han [24] StefanoPellegrini,AndreasEss,KonradSchindler,andLuc Lin, Jeff Schneider, David Bradley, and Nemanja Djuric. Van Gool. You’ll never walk alone: Modeling social be- Deep kinematic models for kinematically feasible vehicle havior for multi-target tracking. In 2009 IEEE 12th Inter- trajectory predictions. In 2020 IEEE International Con- national Conference on Computer Vision, pages 261–268. ferenceonRoboticsandAutomation(ICRA),pages10563– IEEE,2009. 3,12 10569.IEEE,2020. 3 [25] TungPhan-Minh,ElenaCorinaGrigore,FreddyABoulton, [12] Henggang Cui, Vladan Radosavljevic, Fang-Chieh Chou, OscarBeijbom, andEricMWolff. CoverNet: Multimodal Tsung-HanLin,ThiNguyen,Tzu-KuoHuang,JeffSchnei- behaviorpredictionusingtrajectorysets. arXiv:1911.10298, der,andNemanjaDjuric. Multimodaltrajectorypredictions 2019. 3[26] CharlesR.Qi,YinZhou,MahyarNajibi,PeiSun,KhoaVo, in Interactive Driving Scenarios with Semantic Maps. BoyangDeng,andDragomirAnguelov. Offboard3dobject arXiv:1910.03088[cs,eess],2019. 2,3,4 detectionfrompointcloudsequences,2021. 2,4,5,8 [39] Hang Zhao, Jiyang Gao, Tian Lan, Chen Sun, Benjamin [27] NicholasRhinehart,KrisMKitani,andPaulVernaza.R2p2: Sapp,BalakrishnanVaradarajan,YueShen,YiShen,Yuning A reparameterized pushforward policy for diverse, precise Chai, CordeliaSchmid, etal. Tnt: Target-driventrajectory generativepathforecasting. InProceedingsoftheEuropean prediction. arXivpreprintarXiv:2008.08294,2020. 1,3,8 Conference on Computer Vision (ECCV), pages 772–788, [40] YinZhou, PeiSun, YuZhang, DragomirAnguelov, Jiyang 2018. 12 Gao,TomOuyang,JamesGuo,JiquanNgiam,andVijayVa- [28] Nicholas Rhinehart, Rowan McAllister, Kris Kitani, and sudevan. End-to-endmulti-viewfusionfor3dobjectdetec- SergeyLevine. Precog: Predictionconditionedongoalsin tioninlidarpointclouds. InConferenceonRobotLearning, visualmulti-agentsettings.InProceedingsoftheIEEE/CVF pages923–932,2020. 8 InternationalConferenceonComputerVision,pages2821– 2830,2019. 1,2,3,5,12 [29] Alexandre Robicquet, Amir Sadeghian, Alexandre Alahi, andSilvioSavarese. Learningsocialetiquette: Humantra- jectoryunderstandingincrowdedscenes. InEuropeancon- ferenceoncomputervision,pages549–565,2016. 3 [30] Tim Salzmann, Boris Ivanovic, Punarjay Chakravarty, and MarcoPavone. Trajectron++: Dynamically-feasibletrajec- tory forecasting with heterogeneous data. arXiv preprint arXiv:2001.03093,2020. 12 [31] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard,VijaysaiPatnaik,PaulTsui,JamesGuo,YinZhou, YuningChai,BenjaminCaine,etal.Scalabilityinperception forautonomousdriving: Waymoopendataset. InProceed- ingsoftheIEEE/CVFConferenceonComputerVisionand PatternRecognition,pages2446–2454,2020. 1,3,4 [32] Simon Suo, Sebastian Regalado, Sergio Casas, and Raquel Urtasun. Trafficsim: Learning to simulate realistic multi- agentbehaviors.InConferenceonComputerVisionandPat- ternRecognition(CVPR),2021. 2,3 [33] Charlie Tang and Russ R Salakhutdinov. Multiple futures prediction. InNeurIPS,2019. 1,2,3,5 [34] LucaAnthonyThiedeandPratikPrabhanjanBrahma. An- alyzing the variety loss in the context of probabilistic tra- jectory prediction. In Proceedings of the IEEE/CVF Inter- nationalConferenceonComputerVision,pages9954–9963, 2019. 6 [35] Ekaterina Tolstaya, Reza Mahjourian, Carlton Downey, Balakrishnan Vadarajan, Benjamin Sapp, and Dragomir Anguelov. Identifyingdriverinteractionsviaconditionalbe- haviorprediction. 2021IEEEInternationalConferenceon RoboticsandAutomation(ICRA),2021. 1,7,13 [36] BinYang,MinBai,MingLiang,WenyuanZeng,andRaquel Urtasun. Auto4d:Learningtolabel4dobjectsfromsequen- tialpointclouds,2021. 4,5 [37] FisherYu,HaofengChen,XinWang,WenqiXian,Yingying Chen, Fangchen Liu, Vashisht Madhavan, and Trevor Dar- rell. Bdd100k: Adiversedrivingdatasetforheterogeneous multitask learning. In Proceedings of the IEEE/CVF Con- ferenceonComputerVisionandPatternRecognition,pages 2636–2645,2020. 1 [38] Wei Zhan, Liting Sun, Di Wang, Haojie Shi, Aubrey Clausse, Maximilian Naumann, Julius Ku¨mmerle, Hendrik Ko¨nigshof, Christoph Stiller, Arnaud de La Fortelle, and Masayoshi Tomizuka. INTERACTION Dataset: An IN- TERnational,AdversarialandCooperativemoTIONDatasetA.MotionForecastingMetrics LSTMextrapolationmodel. Distance error metrics are the most commonly used to C.MetricsDetails compare methods, capturing how close a predicted trajec- tory(discretetimesequenceofstates)matchesafutureob- Overlaprate(OR)details. Abinaryindicatorisassigned jecttrack, underEuclideandistance. Themostcommonis to each sample alerting of self-overlapping. The average Average Displacement Error (ADE) [1, 24]. Because the over the dataset creates the overlap rate. We only con- futureisinherentlystochasticandmulti-modal,mostmod- siderthehighestscoringjointpredictionp˜here. Ourmetric els output a (weighted) set of trajectory hypotheses, and countsanoverlapwiththefollowingcriteria:giventhejoint then a minimal error over the set (of constrained size) is predicted trajectories of A agents, an overlap is counted if reported (i.e. minADE [9]). For methods that provide ex- the rotated bounding box of any of the A agents overlaps plicitorimplicitfutureprobabilitydistributions, thelikeli- with any other visible object at any time step within the hood of the ground truth future trajectory can be used as predictionintervalT. Notethatagentsnotvisibleatpredic- a metric [8, 30, 27, 28]. Framing the problem instead as tiontime(duetotheirlaterappearance)arenotconsidered oneofdetectionoffuturelocations,Argoverse[9]employs for potential overlaps. Consider G = {s˜ ∀a,g ∀b ∈ t a,t b,t Miss Rate within 2 meters as their primary metric, which 1...B} where s˜ are waypoints from p˜ at time t, and a,t has the benefit to being tolerant to outliers. A number of g aregroundtruthwaypointsfromBnearbyenvironmen- b,t metricsincludingminADEhavebeenextendedforusewith talagents,thesingleoverlapindicatorisdefinedas: jointlypredictedagenttrajectories[6]. (cid:88)(cid:88) (cid:88) µ (e)= 1[IOU(b(s˜ ),b(s(cid:48)))>0] OR a,t t B.DatasetSplits t a s(cid:48)∈Gt\s˜a,t (1) Thedatasetprovides6differentsplitsoftheoriginalset where b(.) is a function to derive a 5-dof (x, y, width, of 20 second scenarios. The scenarios are first split into length and heading) bounding box from a waypoint. The training,validationandtestsets. Thisisdonebyhashinga groundtruth bounding box is used for an environmental stringcontainingthedateofthedatacaptureandtheunique agent. For a predicted waypoint s a,t, we derive the head- IDofthevehicleusedtocapturethedata. Thehashedval- ing from the derivative to the previous waypoint and use uesaresplitintomutuallyexclusive70%training,15%val- thegroundtruthboundingboxsizes. IOU(·)computesthe idation,and15%testingsubsetsofthe20secondscenarios. intersection-over-unionbetweentwo5-dofboxes. From these 3 subsets we generate examples by extracting Missrate(MR)details. Theindicatorfunctionf(.)in(1) 9.1 second windows from the longer 20 second scenarios. isdefinedasfollows: Each 9.1 second window contains 91 time steps at 10Hz - f(.)=1[xk >λlon]∨1[yk >λlat] (2) 10 history samples, 1 sample at the current time, and 80 a a future steps. We extract 5 different sets of windowed ex- [xk,yk]:=(sˆ −sk)·R a a a a a amples from the respective 20 second splits, training, val- idation, testing, validation interactive, and testing interac- whereR aisa2Drotationmatrixdefinedbytheheadingof tive. Thetrainingsetcontains9.1secondwindowsstarting agent a at the timestamp 0. λlon and λlat are longitudinal at times {0, 2, 4, 5, 6, 8, 10} seconds within the 20 sec- andlateralthresholds.Sinceagentscanhavedifferentspeed ond scenarios. The validation and testing sets contain 9.1 at time 0, we scale these thresholds by their speed so that second windows starting at times {0, 5, 10} seconds. The we do not over-penalize faster agents: λlon = λl 0onγ(v x) validation interactive and testing interactive sets contain 9 andλlat = λl 0atγ(v y),whereγ(v) = (max(0,min(1,(v− secondwindowsstartingattimes{4,5,6}secondstofocus υ L)/(υ H −υ L)))/2+0.5. Wesetυ H to11m/sandυ L to ontheinteractiveportionofthescenario. The5windowed 1.4m/s. ThethresholdsdependentonT areasfollows: setsareincludedinthepublisheddatasetalongwiththefull λlat λlon 20secondtrainingset. Eachofthewindowedsetscontains 0 0 T=3seconds 1 2 a list of objects in the scene to be predicted. The training, T=5seconds 1.8 3.6 validation, andtestingsetscontainupto8objectspersce- T=8seconds 3 6 nario chosen to include at least 2 objects of each type if available. Selectionisbiasedtoincludeobjectsthatdonot D.OverlapMetric followaconstantvelocitymodelorstraightpaths. Forthe validation interactive and testing interactive sets, only the Weuseamarginaloverlap-basedmetricwiththesimple minedinteractiveagentpairobjectsareincludedinthelist baseline models to quantify the difficulty and interactivity ofobjectstopredict. Inaddition,eachobjecttopredicthas inourdataset. Weconsideratrajectoryforanagenttocon- a difficulty level based on how easily it is predicted by an tainanoverlapifatanytimepoint,theagentboundingboxE.ConditionalModelDetails OverlapRate Val.set Model Vehicle Pedestrian Cyclist Themodelweuseforconditionalbehaviorpredictionis Const.Vel. 38.4% 29.8% 22.3% based on the baseline model we describe in 5.1. Figure 7 Regular LSTM 27.9% 22.9% 22.1% provides an overview diagram of the proposed model. We Const.Vel. 44.2% 30.6% 27.0% use the LSTM encoder and all three enhancements (road- Interactive graphencodingwithpolylines,trafficsignalstatesencoded LSTM 36.3% 32.3% 25.6% inanLSTM,modelinghigh-orderinteractionswithaglobal Table5: Theinteractivesplitofthedatahasmoreover- interaction graph). To make this model suitable for condi- laps per scene. Despite the interactive set only requiring tional predictions, we add an early fusion conditional en- coder similar to [35]. Just like [35], we train the model to predictionsfortwoagentsinsteadofuptoeightagentsfor do both conditional and unconditional prediction by pass- the regular dataset, the split contains more scenes where a ing in a randomly selected query agent’s ground truth fu- constant velocity model or an LSTM model – neither of turetrajectoryasconditionalqueryinputin95%oftraining which models other agents – produces at least one over- samples while providing no conditional query in the other lap. Statistics are reported on the validation set for both 5%. We generate 6 predictions per agent and evaluate the dataset splits. The marginal-based overlap metric is used KLdivergenceoverthefull8secondfuturetrajectory. forbothsplitssothattheratescanbecomparedacrossthe splits. Constant velocity model only predicts a single tra- F.Videos jectoryperagent. FortheLSTMmodel,thehighestscoring trajectoryforeachagentisused. Theincludedvideosshowvisualizationofsomesamples of scenarios from the dataset including those in Figure 1a andFigure1b. Figure 7: Diagram of baseline architecture. An illustra- tionofthebaselinearchitectureemployedforthefamilyof learnedmodelswithabaseLSTMencoderforagentstates. Thethreedetachablecomponentsarearoadgraphpolyline encoder[14],atrafficstateLSTMencoder,andahigh-order interactions encoder following [14]. The trajectories are predictedthroughaMLPwithmin-of-kloss. overlapswithaground-truthboxatthattime. Theoverlap rate is the number of agents whose trajectories have over- lapsdividedbythetotalnumberofpredictedagents. We compute the overlap rate for the constant velocity model and compare the performance between the regular split and interactive split of the dataset. For the constant velocitymodel, wefoundthat38.4%ofpredictedvehicles in the regular split, and 44.2% of predicted vehicles in the interactivesplithavetrajectoriesthatoverlapwithaground- truth(Table5). Thisshowsthattheinteractivesplitismore challenging, and suggests that more interactions between agentsinthatsplit.1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.5 1.0 recall noisicerp Stationary 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.5 1.0 recall noisicerp Straight 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.5 1.0 recall noisicerp Straight-Left 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.5 1.0 recall noisicerp Straight-Right 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.5 1.0 recall noisicerp Left-Turn 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.5 1.0 recall noisicerp Right-Turn 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.5 1.0 recall noisicerp Left-U-Turn Figure 8: Precision versus recall curves for increasing number of predictions (K) for the polyline model at 3 seconds for vehiclesacrosstrajectoryshapebucketsforthestandardvalidationdataset. RecallincreaseswithKbutAUCdecreases. 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.5 1.0 recall noisicerp Stationary 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.5 1.0 recall noisicerp Straight 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.5 1.0 recall noisicerp Straight-Left 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.5 1.0 recall noisicerp Straight-Right 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.5 1.0 recall noisicerp Left-Turn 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.5 1.0 recall noisicerp Right-Turn 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.5 1.0 recall noisicerp Left-U-Turn Figure 9: Precision versus recall curves for increasing number of predictions (K) for the polyline model at 5 seconds for vehiclesacrosstrajectoryshapebucketsforthestandardvalidationdataset.1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.5 1.0 recall noisicerp Stationary 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.5 1.0 recall noisicerp Straight 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.5 1.0 recall noisicerp Straight-Left 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.5 1.0 recall noisicerp Straight-Right 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.5 1.0 recall noisicerp Left-Turn 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.5 1.0 recall noisicerp Right-Turn 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.5 1.0 recall noisicerp Left-U-Turn Figure10: Precisionversusrecallcurvesforincreasingnumberofpredictions(K)forthepolylinemodelat8secondsfor vehiclesacrosstrajectoryshapebucketsforthestandardvalidationdataset.