Large Scale Interactive Motion Forecasting for Autonomous Driving :
The WAYMO OPEN MOTION DATASET
ScottEttinger1,ShuyangCheng1,BenjaminCaine2,ChenxiLiu1,HangZhao1,SabeekPradhan1,
YuningChai1,BenSapp1,CharlesQi1,YinZhou1,ZoeyYang1,Aure´lienChouard1,PeiSun1,
JiquanNgiam2,VijayVasudevan2,AlexanderMcCauley1,JonathonShlens2,DragomirAnguelov1
1 WaymoLLC,2 GoogleBrain
Abstract
Asautonomousdrivingsystemsmature,motionforecast-
ing has received increasing attention as a critical require-
mentforplanning.Ofparticularimportanceareinteractive
situations such as merges, unprotected turns, etc., where
(a) A vehicle waits for a pedestrian to fully cross the crosswalk
predicting individual object motion is not sufficient. Joint
beforecommencingaturn.
predictions of multiple objects are required for effective
route planning. There has been a critical need for high-
qualitymotiondatathatisrichinbothinteractionsandan-
notationtodevelopmotionplanningmodels. Inthiswork,
we introduce the most diverse interactive motion dataset
to our knowledge, and provide specific labels for interact-
ingobjectssuitablefordevelopingjointpredictionmodels.
With over 100,000 scenes, each 20 seconds long at 10 Hz, (b) A vehicle accelerates onto the street only after the incoming
our new dataset contains more than 570 hours of unique vehicleturns.
dataover1750kmofroadways. Itwascollectedbymining
for interesting interactions between vehicles, pedestrians, Figure 1: Examples of interactions between agents in a
and cyclists across six cities within the United States. We sceneintheWAYMOOPENMOTIONDATASET.Eachex-
use a high-accuracy 3D auto-labeling system to generate amplehighlightshowpredictingthejointbehaviorofagents
high quality 3D bounding boxes for each road agent, and aidsinpredictinglikelyfuturescenarios. Solidanddashed
provide corresponding high definition 3D maps for each linesindicatetheroadgraphandassociatedlanes. Eachnu-
scene. Furthermore,weintroduceanewsetofmetricsthat meralindicatesauniqueagentinthescene.
provides a comprehensive evaluation of both single agent
and joint agent interaction motion forecasting models. Fi- forecasting models requires large amounts of high quality
nally, we provide strong baseline models for individual- real world data. Creating a dataset for motion forecasting
agentpredictionandjoint-prediction.Wehopethatthisnew iscomplicatedbythefactthatthedistributionofrealworld
large-scaleinteractivemotiondatasetwillprovidenewop- data is highly imbalanced [4, 18, 31, 37]; in the common
portunitiesforadvancingmotionforecastingmodels. case, vehicles drive straight at a constant velocity. In or-
dertodevelopeffectivemodels,adatasetmustcontainand
measureperformanceonawiderangeofbehaviorsandtra-
jectoryshapesfordifferentobjecttypesthatanautonomous
1.Introduction
systemwillencounterinoperation.
Motion forecasting has received increasing attention as We argue that critical situations (e.g., merges, lane
a critical requirement for planning in autonomous driving changes,andunprotectedturns)requirethejointprediction
systems [8, 14, 39, 35, 28, 33]. Due to the complexity of ofasetofmultipleinteractingobjects,notjustasingleob-
scenesthatautonomoussystemsneedtosafelyhandle,pre- ject. Anexampleofapedestrianandvehicleinteractingis
dictingobjectmotioninthesceneisadifficulttask,suitable illustrated in Figure 1a where a vehicle waits for a pedes-
for machine learning models. Building effective motion trian to fully cross the street before turning. In Figure 1b,
1202
rpA
02
]VC.sc[
1v33101.4012:viXratheorangevehicleacceleratesintothestreetonlyafteren- Lyft NuSc Argo Inter Ours
suring the incoming blue vehicle’s intention is to deceler- #uniquetracks 53.4m§ 4.3k 11.7m‡ 40k 7.64m
ate and turn off of the street. Most existing datasets have Avgtracklength 1.8s§ - 2.48s‡ 19.8s∗ 7.04s††
Timehorizon 5s 6s 3s 3s 8s
focused on single agent representation, but there has been
#segments 170k 1k 324k - 104k
considerably less work on interaction modeling at a large Segmentduration 25s 20s 5s - 20s
Totaltime 1118h 5.5h 320h 16.5h∗ 574h
scale,whichmotivatesthiswork.
Uniqueroadways 10km - 290km - 1750km††
Thegoalofthisworkistoprovidealargescale,diverse
Samplingrate 10Hz 2Hz 10Hz 10Hz 10Hz
dataset with specific annotations for interacting objects to #citiescovered 1 2 2 6∗ 6
promotethedevelopmentofmodelstojointlypredictinter- #objecttypes 3 1† 1‡ 1 3
Boxes 2D 3D None 2D 3D
active behaviors. In addition, we aim to supply object be- 3Dmaps (cid:51) (cid:51)
haviorsoverawiderangeofroadgeometries,andthuspro- Offlineperception (cid:51) (cid:51)
Interactions (cid:51) (cid:51)
videalargesetofannotatedinteractionsoveradiverseset Trafficsignalstates (cid:51) (cid:51)
oflocations. Togeneratesuchaset,wedevelopcriteriafor
mining interactive behavior over a large corpus of driving Table1:Comparisonofpopularbehaviorpredictionand
data. Weexplicitlyannotategroupsofinteractingobjectsin motion forecasting datasets. Specifically, we compare
bothtrainingandvalidation/testdatatoenabledevelopment LyftLevel5[19],NuScenes[4],Argoverse[9],Interactions
ofmodelsthatjointlypredictthemotionofmultipleagents [38], and our dataset across multiple dimensions. # object
aswellasindividualpredictionmodels. typesmeasuresthenumberoftypesofobjectstopredictthe
Weaimtoprovidehighqualityobjecttrackingdatatore- motiontrajectory. Dashedline”-”indicatesthatdataisnot
duceuncertaintyduetoperceptionnoise. Thecostofhand
availableornotapplicable.§LyftLevel5numberofunique
labelingadatasetoftherequiredsizeisprohibitive. Instead tracksandaveragetracklengtharedeterminedthroughpri-
we use a state-of-the-art automatic labeling system [26] to vate correspondence. † nuScenes [4] provides annotations
providehighqualitydetectionandtrackingdataofobjects for 23 objects types (stationary vehicles are removed), but
inthescenes. Incontrastwithmanydatasetswhichprovide only vehicle is predicted. ‡ Argoverse [9] provides anno-
trackingfromon-boardautonomoussystems,theoff-board tations for 15 object types (Appendix B) but only vehicle
automaticlabelingsystemprovideshigheraccuracyasitis is predicted. The number of unique tracks is determined
notconstrainedtoruninrealtime.Thesehighqualitytracks throughprivatecorrespondence.Theaveragetracklengthis
allowustofocusonunderstandingthecomplexityofobject estimatedfromdata. ∗ Interactions[38]gathereddatafrom
behavior,ratherthanondealingwithperceptionnoise. 4 countries including 6 cities (the last statistic is collected
through personal communication) and the entire dataset is
Evaluationofinteractivepredictionmodelsrequiresmet-
notdividedintosegments. Theaveragetracklengthisesti-
ricsformulatedforjointpredictionsasmotivatedbyrecent
mated from data. †† Our average track length is computed
work[32,6,33,28]. InSection4,wediscussexistingwork
on the 20s segments of the training split. Our total unique
ongeneralizingmetricstothejointpredictioncase.Wealso
roadwaydistanceiscalculatedbyhashingourautonomous
propose a novel mean Average Precision (mAP) metric to
vehicleposesasUTMcoordinatesinto25metervoxelsand
capture the performance of models across different object
countingthenumberofnon-zerovoxels.
types, prediction time scales, and trajectory shape buckets
(e.g.,u-turns,leftturns).Thismethodisinspiredbymetrics
used in the object detection literature and overcomes limi-
marginalandjointpredictioncases.
tations in currently adopted metrics. We discuss how this
metricattemptstoaddressissueswithexistingmetrics.
2.RelatedWork
We name our large-scale interactive motion dataset:
WAYMO OPEN MOTION DATASET. It will be made pub- Motion forecasting datasets Several existing public
licly available to the research community, and we hope it datasetshavebeendevelopedwiththeprimarygoalofmo-
willprovidenewdirectionsandopportunitiesindeveloping tion forecasting in real-world urban driving environments,
motion forecasting models. We summarize the contribu- comparedinTable1. Thedatasetsvaryinsizemeasuredin
tionsofourworkasfollows: numberofscenes,totaltime,totalmiles,numberoftracked
• We release a large-scale dataset for motion forecast- objects, andnumberofdistincttimesegments. WhileLyft
ing research with specifically labeled interactive be- Level 5 [19] has the most hours of data and NuScenes [4]
haviors. Thedataisderivedfromhighqualitypercep- has rich object taxonomy, they were not collected to cap-
tionoutputacrossalargearrayofdiversesceneswith tureawidediversityofcomplexandinteractivedrivingsce-
richannotationsfrommultiplecities. narios. Argoverse [9] was collected for interesting behav-
• We provide novel metrics for motion prediction anal- iorsbybiasingsamplingtowardscertainobservedbehaviors
ysis along with challenging benchmarks for both the (e.g., lanechanges, turns)androadfeatures(e.g., intersec-tions). TheINTERACTIONdataset[38]manuallyselected 3.Dataset
asmallsetofspecificdrivinglocations(e.g.,roundabouts),
The dataset provides high quality object tracks gener-
and times of day (e.g., rush hour) to obtain a dataset with
atedusinganoffboardperceptionsystem(describedinSec-
high interaction complexity. We explain our own method-
tion 3.3) along with both static and dynamic map features
ologyforcollectinginteractionsinSection3.1.
to provide context for the road environment. Object track
Another salient dataset attribute is the time horizon for
statesaresampledat10Hz. Eachstateincludestheobject’s
prediction. Our dataset’s forecasting horizon is 8 seconds
boundingbox(3Dcenterpoint,heading,length,width,and
intothefuture,considerablylongerthanothers(3or5sec-
height), and the object’s velocity vector. Due to sensor
onds), as we believe that long term forecasting is neces-
rangeorocclusion, measurementsofanobject’sstatemay
sary for safe and human-like planning, and is intrinsically
notexistatsometimesteps. Avalidflagisprovidedtoin-
moredifficult. Finally, mostdatasetsareauto-labeledwith
dicatewhichtimestepshavevalidmeasurements.Mapdata
industry-grade, onboard 3D perception stacks, employing
isprovidedasasetofpolylinesandpolygonscreatedfrom
LiDAR’s, cameras, and/or radar, and provided as-is with
curves sampled at a resolution of 0.5 meters. Static map
noisystateestimatesandtrackingerrors. Oneexceptionis
featuretypesincludelanecenters,laneboundarylines,road
the INTERACTION dataset [38] which collects data from
edges,stopsigns,crosswalks,andspeedbumps.Trafficsig-
dronefootage,whichisthenpost-processedofflinewithde-
nal states and the lanes they control are included. In addi-
tection,trackingandtracksmoothing.Wealsoputconsider-
tion to the geometry data, map features also contain addi-
ableeffortintocreatinghighqualitystateestimatesand3D
tionaldataspecifictoeachfeaturetypee.g. laneboundaries
tracksbyemployinganoffboard3Ddetectionandtracking
haveafieldtoindicateiftheyareabrokenwhiteboundary,
pipeline,asdiscussedinSection3.3.
adoubleyellowboundary,etc.
We consider perception datasets (e.g., KITTI [15], Starting with 20 second segments that are specifically
Waymo Open Dataset [31]) outside of the scope of this mined from interactions as described in 3.1, we create 9.1
discussion as they do not contain enough motion data to second (91 steps at 10Hz) scenes, splitting the data into a
buildsufficientlycomplexmodels. Wealsonotetherearea 70%training,15%validation,and15%testset. Wederive
hostofothermotionforecastingdatasetswhich,whilepop- two versions of the validation and test sets which we refer
ular, are orders of magnitude smaller, have O(10) unique to as the standard and interactive versions. The standard
locations, and/or are not focused on driving environment, validationandtestsetsprovideupto8objectstopredictin
forexampletheStanfordDroneDataset[29],NGSIM[10], each scene. Selection is biased to require objects that do
ETH[24],UCY[21],TownCenter[2]. notfollowaconstantvelocitymodelorstraightpaths. The
interactiveversionsofthevalidationandtestsetsfocuson
the interactive portion of the segment and require only the
Jointly consistent multi-agent forecasting Most exist- 2minedinteractiveobjectstobepredicted. Theoriginal20
ing models output independent future distributions per ob- second segments are also provided for research requiring
ject in a scene, e.g. [1, 3, 7, 5, 8, 12, 11, 14, 17, 20, 22, longertimeframes.
25, 39]. This is encouraged by the popular metrics, which
3.1.Miningforinterestingscenarios
onlymeasurequalityonaper-objectlevel,andbydatasets
that only require predicting one agent per scene. An im- We mine for interesting scenarios by first hand-crafting
portantnoteisthatthesemethodsdomodelinteractionsbe- semantic predicates involving agents’ relationships—e.g.,
tweenobjectstoachievebetterperformance, butexplicitly “agent A changed lanes at time t”, and “agents A and B
modeling joint futures is much less common. There are a crossedpathswithatimegaptandrelativeheadingdiffer-
fewexceptionswhichmodeljointly-consistentfutures:Pre- enceθ”.Thesepredicatescanbecomposedtoretrievemore
cog[28]andMFP[33]employmodelswhichrollouttrajec- complexqueriesinanefficientSQLandrelationaldatabase
torysamplestimestep-by-timestep,whereeachagent’snext framework on an overall data corpus orders of magnitude
stepsampleconditionsonallotheragents’currentandpast larger than the resulting curated WAYMO OPEN MOTION
steps. Incontrast,ILVM[6](alsousedbyTrafficSim[32]), DATASET.
samplesfromalatentvariablefromwhichmultiplestepsof With this framework, we specifically mined for the
future joint samples from all agents are decoded, without following pairwise interaction scenarios: merges, lane
explicit conditioning on each step of rollout. These works changes, unprotected turns, intersection left turns, inter-
all measure a stricter version of distance error metrics, re- section right turns, pedestrian-vehicle interactions, cyclist-
porting the per-agent error of the best joint configuration. vehicle interactions, interactions with close proximity, and
It is important to note that none of the datasets in Table 1 interactionswithhighaccelerations. Thepairofinteracting
provide such joint metrics in their release, in contrast to objectsisannotatedwithinthedatasetineachscenario,and
ourWAYMOOPENMOTIONDATASET. theinteractionhappensclosetothe10smarkofthe20sclip.0.020
0.015
0.010
0.005
0.000
0 20 40 60 80 100 120
Number of Agents
senecS
fo
tnecreP
Overall Number of Agents
0.20
0.15
0.10
0.05
0.00
1 2 3 4 5 6 7 8
Predicted Agents Per Scene
senecS
fo
noitcarF
Validation - Predicted Agents
Vehicles
Pedestrians
Cyclists
Figure 2: Our dataset contains many agents including
pedestrians and cyclists. Top: 46% of scenes have more
than 32 agents, and 11% of scenes have more than 64 Figure 3: Agents selected to be predicted have diverse
agents. Bottom: In the standard validation set, 33.5% of trajectories. Left: Ground truth trajectory of each pre-
scenes require at least one pedestrian to be predicted, and dicted agent in a frame of reference where all agents start
10.4%ofscenesrequireatleastonecyclisttobepredicted. at the origin with heading pointing along the positive X
axis(pointingup). Right: Distributionofmaximumspeeds
achievedbyalloftheagentsalongtheir9secondtrajectory.
3.2.Datasetstatistics
Plotsdepictvarietyintrajectoryshapesandspeedprofiles.
In contrast with many existing datasets that provide a
limitednumberofagentspersceneoragenttypes,wepro-
77.5%ofscenesinvolvetwointeractingvehicles,14.9%of
videmorediversescenesintermsofthenumberofagents
scenes involve a vehicle interacting with a pedestrian, and
andtypesofagents,reflectingmanycomplicatedrealworld
7.6%ofscenesinvolveavehicleinteractingwithacyclist.
driving scenarios like city driving and busy intersections.
Finally, a motion forecasting dataset should contain di-
We show the distribution of number of agents per scene
versescenarios,trajectories,andagentinteractions. Table1
(Figure2,top). Allsceneshaveatleastonevehicle,57%of
showsthatwegatherdataacrossalargerangeofroadways.
scenes have at least one pedestrian (with 20% having four
Figure3visualizesthefutureground-truthtrajectoriesand
ormore),and16%ofsceneshaveatleastonecyclist.
maximum speeds of agents we task the models with pre-
In addition to accurately predicting the motion of other
dicting. These agents represent a wide range of trajectory
vehicles, to safely drive, an autonomous vehicle must also
shapes,speeds,andbehaviors,whichwebelieveaccurately
accurately predict the motion of other road agents like
capturesthemanydifferentbehavioralmodesforeachclass.
pedestrians and cyclists. To support this, our dataset con-
tains rich interactions between vehicles, pedestrians, and
3.3.Offboardperceptionsystem
cyclists, and the users of this dataset must be able to ac-
curately predict the trajectories of all three classes, which Modern motion forecasting systems require a large
isnotthecaseinpreviousdatasets[9,4,38]. Weshowthe amount of training data to imitate human maneuvers in
frequency of scenes in which we ask the model to predict complex real-world scenarios. Recently released datasets
eachclassinthevalidationset(Figure2,bottom). Notably, for motion forecasting [9, 18, 4] are orders of magnitude
38.3% of scenes in the validation set require the model to larger than popular 3D perception datasets [4, 19, 31, 15].
predict more than one type of agent (e.g. a vehicle and a However,manuallyannotatingdatasetsatsuchlargescales
pedestrian or cyclist), and 4.9% of scenes require a model notonlyincursexorbitantcostbutitalsotakestremendous
topredicttrajectoriesforallthreeclasses. Finally,inthein- amount of time [26, 36]. Constrained by the high cost,
teractivevalidationset, wherewetaskthemodelwithpre- most existing motion forecasting datasets [9, 18] directly
dictingthejointfuturetrajectoriesoftwointeractingagents, employ onboard perception output as groundtruth for tra-jectory prediction. But limited by the onboard perception isdenotedasˆs = {sˆ }. Theindividualobjectprediction
a,t
systemperformance,suchannotated3Dobjectstracksmay taskbecomesaspecialcaseofthisformulationwhereeach
have a high degree of state estimation error, lack temporal jointpredictioncontainsonlyasingleagentA=1.
kinematicconsistencyorunder-/over-segmenttracks. minADE. The minimum Average Displacement Error
In this work, we aim to alleviate the perception qual- computestheL2normbetweenˆsandtheclosestjointpre-
ity bottleneck in existing motion datasets captured by au- diction: 1 min (cid:80) (cid:80) ||sˆ −sk || .
TA k a t a,t a,t 2
tonomous vehicles and propose using the recently intro- minFDE. The minimum Final Displacement Error is
duced offboard algorithms [26, 36] to automatically gen- equivalent to evaluating the minADE at a single time step
eratehigh-qualitymotionlabels,allowingmotionforecast- T: 1 min (cid:80) ||sˆ −sk ||
A k a a,T a,T 2
ingalgorithmstofocusonthesubtledynamicsandinterac-
Overlaprate(OR). Theoverlaprateiscomputedbytak-
tions of agents instead of overcoming the noise generated
ingthehighestconfidencejointpredictionfromeachmulti-
byaconstrained,onboardperceptionsystem. Comparedto
modaljointprediction. IfanyoftheAagentsinthejointly
the onboard counterpart, offboard perception has two ma-
predictedtrajectoriesoverlapatanytimewithanyotherob-
joradvantages: 1)itcanaffordmuchmorepowerfulmod-
jectsthatwerevisibleatthepredictiontimestep(compared
elsrunningontheamplecomputationalresources;and2)it
at each time step up to T) or with any of the jointly pre-
canmaximallyaggregatecomplementaryinformationfrom
dicted trajectories, it is considered a single overlap. The
differentviewsbyexploitingthefullpointcloudsequence
overlaprateiscomputedasthetotalnumberofoverlapsdi-
including both history and future. Thanks to those advan-
videdbythetotalnumberofmulti-modaljointpredictions.
tages, the offboard perception system has shown superior
See the supplementary material for details. The overlap is
perception accuracy compared to onboard detectors [26]
calculatedusingboxintersection,withboxextentstakenas
andwehavefurthervalidateditsqualityinSection5.3.
thecurrenttimestep’sestimates,andheadinginferredfrom
Theoffboardperceptionsystem[26]employedcontains
consecutivewaypointpositiondifferences.
three steps: (1) 3D object detector generates object pro-
Miss rate (MR). A binary match/miss indicator func-
posals from each LiDAR frame. (2) Multi-object tracker
linksdetectedobjectsthroughouttheLiDARsequence. (3)
tion ISMATCH(sˆ t,s t) is assigned to each sample way-
point at a time t. The average over the dataset creates
For each object, an object-centric refinement network pro-
the miss rate at that time step. Our dataset asks to pre-
cesses the tracked object boxes and its point clouds across
dict an 8-seconds trajectory on agents with varying speed
all frames in the track, and outputs temporally consistent
profiles. Therefore, a single distance threshold to deter-
andaccurate3Dboundingboxesoftheobjectineachframe.
mine ISMATCH is insufficient: we want a stricter crite-
ria for slower moving and closer-in-time predictions, and
4.Metrics
alsodifferentcriteriaforlateraldeviation(e.g.wronglane)
Tomeasuretheaccuracyofmotionpredictionsweusea versus longitudinal (e.g. wrong speed profile). For a par-
suite of five metrics, which we extend to handle joint pre- ticular joint configuration, a miss is assigned for time t if
dictions over multiple agents as proposed by a few related anyofthetrajectoriesdon’tmatchtheirgroundtruthtrajec-
works [33, 6, 28]. Several common metrics report a min- tory: MR t = min k∨ a¬IsMatch(sˆ t,sk a,t). Weimplement
imum error within a trajectory set; when generalized, the ISMATCHwithseparatelateralandlongitudinalthresholds,
joint metric analog constrains the minimum over the best whichscaleasaclampedlinearfunctionoffuturetimeand
jointconfigurationoftrajectoriesfromagroupofagents. velocity. Seethesupplementarymaterialfordetails.
We report standard trajectory-set distance error metrics Mean average precision (mAP). The Average Precision
minADE,minFDE,andMissRate(MR),withacustomdef- computes the area under the precision-recall curve by ap-
initionofamatchexplainedbelow. Wealsoreportoverlap plying confidence score thresholds c across a validation
k
rate(OR)tomeasurefrequencyofpredictedtracks’extents set, and using the definition of Miss Rate above to define
overlappingwithothers’. Finally,inspiredbythedetection true positives, false positives, etc. Consistent with object
literature,weproposeanAveragePrecision(AP)metricac- detection mAP metrics [23], only one true positive is al-
cording to the defined MR to measure the precision and lowed for each object and is assigned to the highest confi-
recall performance of models across different confidence denceprediction.
values. We then account for imbalanced data by reporting Further inspired by object detection literature [13], we
mean AP (mAP) over different semantic trajectory motion seek an overall metric balanced over semantic buckets,
types. someofwhichmaybemuchmoreinfrequent(e.g.,u-turns),
Foreachsamplee,amodelmakesK possiblyjointpre- soreportthemeanAPoverdifferentdrivingbehaviors.The
dictions S ,k ∈ 1...K. Each S contains a scalar con- finalmAPmetricaveragesovereightdifferentgroundtruth
k k
fidence c , and a trajectory sk = {s } for T trajectory shapes: straight, straight-left, straight-right, left,
k a,t t=1:T,a=1:A
future time steps for A agents. Similarly, the ground truth right,leftu-turn,rightu-turn,andstationary.Vehicle Pedestrian Cyclist
Set Model rg ts hi minADE↓ MR↓ mAP↑ minADE↓ MR↓ mAP↑ minADE↓ MR↓ mAP↑
Const.Vel. 11.0 0.95 0.02 1.55 0.60 0.07 4.17 0.82 0.02
2.63 0.67 0.07 0.73 0.22 0.15 1.86 0.60 0.07
Standard (cid:51) 1.67 0.40 0.16 0.74 0.18 0.18 1.50 0.40 0.12
Validation (cid:51) 1.54 0.32 0.19 0.66 0.14 0.23 1.36 0.31 0.17
LSTM
(cid:51) (cid:51) 1.36 0.26 0.22 0.63 0.14 0.23 1.29 0.30 0.18
(cid:51) (cid:51) 1.52 0.31 0.18 0.65 0.15 0.20 1.34 0.33 0.15
(cid:51) (cid:51) (cid:51) 1.34 0.25 0.23 0.63 0.13 0.23 1.26 0.29 0.21
Standard Const.Vel. 11.0 0.95 0.02 1.58 0.60 0.06 4.12 0.83 0.03
Test
LSTM (cid:51) (cid:51) (cid:51) 1.34 0.24 0.24 0.64 0.13 0.22 1.29 0.28 0.20
Table2: Marginalmetricsonthestandardvalidationandtestset. Allmetricscomputedat8s. rgstandsforroadgraph
information. tsstandsfortrafficsignalstatesinformation. histandsforhigh-orderinteractionsbetweenagents’features.
TheconstantvelocitybaselineemploysK =1predictedtrajectories;allothermodelsemployK =6.
Vehicle Pedestrian Cyclist
Set Model rg ts hi minADE↓ MR↓ mAP↑ minADE↓ MR↓ mAP↑ minADE↓ MR↓ mAP↑
Const.Vel. 10.3 0.98 0.00 3.62 1.00 0.00 6.35 1.00 0.00
4.16 0.88 0.01 2.45 0.93 0.02 4.00 0.98 0.00
Interactive (cid:51) 2.89 0.75 0.06 2.22 0.93 0.01 3.75 0.94 0.01
Validation (cid:51) 2.94 0.75 0.04 2.39 0.86 0.06 3.30 0.88 0.02
LSTM
(cid:51) (cid:51) 2.45 0.66 0.06 2.22 0.86 0.03 3.02 0.83 0.03
(cid:51) (cid:51) 2.92 0.75 0.04 2.69 0.93 0.10 3.24 0.89 0.01
(cid:51) (cid:51) (cid:51) 2.42 0.66 0.08 2.73 1.00 0.00 3.16 0.83 0.01
Interactive Const.Vel. 10.3 0.98 0.01 4.56 1.00 0.00 6.21 1.00 0.00
Test
LSTM (cid:51) (cid:51) (cid:51) 2.46 0.67 0.08 2.47 0.89 0.00 2.96 0.89 0.01
Table3: Jointmetricsontheinteractivevalidationandtestset. SeeTable2forabbreviationsanddetails. Notethatthese
metricsindicatethattheinteractivesplitissystematicallymorechallenging.
5.Experiments ity model in which we assume the agent will maintain its
velocityatthecurrenttimestampforallfuturesteps.
In this section, we evaluate various baseline models on
Second,weconsiderafamilyofdeep-learnedmodelsusing
the WAYMO OPEN MOTION DATASET to investigate the
various encoders, with a base architecture of an LSTM to
importance of rich map annotations (e.g. 3D road graph,
encodea1-secondhistoryofobservedstate[16,1];thisin-
traffic signal states), interaction context, and joint model-
cludesagents’positions,velocity,and3Dboundingboxes.
ing (Section 5.1). We then compare the standard valida-
Inordertomeasuretheimportanceofparticularadditional
tion and interactive validation datasets on conditional be-
features,weselectivelyprovideadditionalinformation:
haviorpredictionmetricstoshowthattheinteractivevalida-
• Road graph (rg): Encode the 3D map information
tion dataset is both more challenging and more interactive
withpolylinesfollowing[14].
(Section5.2). Furthermore,weshowthatouroffboardper-
• Traffic signals (ts): Encode the traffic signal states
ception system achieves a similar accuracy and perception
withanLSTMencoderasanadditionalfeature.
noise reduction to human labels (Section 5.3). Finally, to
• High-order interactions (hi): Model the high-order
provideinsightontheperformancemeasurementofmotion
interactions between agents with a global interaction
predictiontasks,weempiricallyanalyzeminADEvs. mAP
graphfollowing[14].
ontheirabilitytoreflectthequalityofconfidencescorecal-
In experiments, combinations of these encodings are con-
ibration(Section5.4).
catenated together to create an embedding per-agent, in
agent-centeredcoordinates. WedecodeK=6trajectoriesfor
5.1.Baselinemodelperformances
outputusinganotherMLPwithmin-of-kloss[12,34]. See
In this section, we evaluate several baseline models on thesupplementarymaterialfordetails.
the proposed dataset. First, we consider a Constant Veloc- InTable2and3, wereportthemarginalmetricsontheVehicleminADE↓ VehiclemAP↑ 5.2.Quantifyinginteractivity
Model 3s 5s 8s 3s 5s 8s
Following[35],weuseConditionalBehaviorPrediction
Marginal 0.65 1.66 4.16 0.08 0.07 0.01
Joint 0.65 1.59 3.81 0.10 0.06 0.03 (CBP) to quantify the interactivity in our dataset. [35] in-
troducesamodelthatcanproduceeitherunconditionalpre-
Table 4: Joint modeling is advantageous on interactive dictionsorpredictionsconditionedona“querytrajectory”
agents. Numbersarefromtheinteractivevalidationset. for one of the agents in the scene. If two agents are not
interacting, then one’s actions have no effect on the other,
standardvalidation/testsetandjointmetricsontheinterac- soknowledgeofthatagent’sfutureshouldnotchangepre-
tivevalidation/testset,respectively. Specifically,minADE, dictions for the other agent. Thus, [35] defines the degree
miss rate, and mAP at 8s are chosen to be the representa- of influence agent A has on agent B as the KL divergence
tives,andwebreakdownthemetricsacross3objecttypes. betweentheunconditionalpredictionsforBandthepredic-
The constant velocity model performs quite poorly, e.g., tionsforBconditionedonA’sgroundtruthfuturetrajectory.
achieving double digit minADE on vehicles. This shows Weapplythisframeworktoourinteractiveandstandard
thatourdatasetcontainsnontrivialtrajectories. validation datasets, computing the KL divergence between
Wetheninvestigatetheimportanceofencoding3Dmap unconditional and conditional predictions for every query
information,trafficsignalstates,andhigh-orderinteractions agent/target agent pair in the dataset. We find that the KL
between agents. Intuitively, they should all benefit motion divergences are much larger in the interactive validation
forecasting,andthisisindeedsupportedbytheexperimen- dataset than in the standard validation dataset. In particu-
talresults. Forexample,onthestandardvalidationset(Ta- lar, 73% of agent pairs in the interactive dataset have KL
ble 2) for vehicle trajectory prediction, minADE improves divergencesgreaterthan10,and45%haveKLdivergences
from 2.63 to 1.34 and mAP improves from 0.07 to 0.23 greater than 50; in the standard dataset, these numbers are
whenincrementallyaddingmoreinformationinthisorder. 48% and 28% respectively. Figure 4 presents a full his-
Thesametrendholdsforpedestrianandcyclistaswell. togram of the KL divergences between unconditional and
We only evaluate joint metrics on the interactive sets. conditionalpredictionforeachagentpair. Conditioningon
Sincemakingjointpredictionsisarelativelynewpractice, a query agent’s future trajectories makes little difference
there are no mature, established baselines. In Table 3, we in the standard validation dataset but a large difference in
reuse the models trained to make K marginal predictions; the interactive validation dataset, providing evidence that
but when evaluating on the 2 interactive agents, we select the interactive dataset contains more cases where multiple
thetopK amongtheK2 possibilitiesbasedontheproduct agentsareinteractingwithandinfluencingeachother. For
of predicted probabilities, as described in [6]. The overall detailsontheCBPmodel,seethesupplementarymaterial.
low performance in Table 3 can be attributed to at least 3
factors: the higher difficulty level of the mined interactive
agents; the requirement to make good predictions for both
agents as dictated by the joint version of the metrics; the
fact that the predictions are post-hoc manipulations rather
thantheresultoftruejointtraining.
Wehavearguedtheimportanceofjointlypredictingin-
teractive behaviors. In Table 4 we provide direct com-
parison between a base LSTM (without rg, ts, or hi)
trained to make marginal or joint predictions for the 2 in-
teractiveagents. Inconvertingthemarginalmodeltomak-
ing joint predictions, the neural features for the 2 interac-
tiveagentsareconcatenatedwitheachothertoprovidethe
Figure4:Theinteractivesplitseesmuchlargerimprove-
minimalnecessarycontext;thesumoftheirindividualdis-
ments from conditional prediction. Each element in the
tancestothegroundtruth(whilematchingthepairsoftra-
histogram is one pair of query agent/target agent, and the
jectoriesjointly)areusedfortraining;theconfidencescore
xaxisshowstheKLdivergencebetweentheunconditional
are jointly predicted for each pair of trajectories to ensure
predictions on the target agent and the predictions for the
consistency. When evaluated on the interactive set using
target agent conditioned on the query agent’s ground truth
joint metrics, this joint model performs favorably against
future.Notethatbothplotsarenormalizedtothetotalnum-
itsmarginalcounterpart. Wehopethispreliminaryexperi-
berofagentpairs.
ment can motivate further development of joint models on
ourdataset,especiallytheinteractiveset.Recall: 99.29% Recall: 93.50% Recall: 87.31% 3.0
Mean DE: 0.1849 Mean DE: 0.1958 Mean DE: 0.2738
Std DE: 0.2342 Std DE: 0.2721 Std DE: 0.3800 2.5
2.0
1.5
1.0
0.5
Figure 5: Distance error statistics of vehicle bounding 1 3 6 12 18 24
boxes. We compare three sets of vehicle bounding boxes
withtheWaymoOpenDataset(WOD)groundtruthboxes
onthe5selectedrunsegmentsfromthevalset. Thestatis-
tics include the histogram of distance errors (capped at
0.8m), the box recall (using a 3D IoU threshold of 0.03),
meandistanceerrorandstandarddeviation(std)ofthedis-
tance error. Only boxes with at least one point inside are
considered. Note that the DE from different boxes are not
directlycomparableastherecallsaredifferent.
5.3.Analysisofperceptiondataquality
Inthissection,westudythequalityofouroffboardper-
ception system and compare them with two alternatives –
humanlabelsandbaselinedetectorboxes. Following[26],
weconductastudyonthesamefivevalidationsetrunseg-
mentsfromtheWaymoOpenDataset(WOD)re-labeledby
extrathreeindependenthumanlabelers. Withtheduplicate
humanlabels,wecananalyzethehumanlabelconsistency
tounderstandthe“backgroundnoise”inlabelaccuracy. In-
stead of comparing detection results in average precision
[26],weevaluatetheboxdistanceerrors(DE)inmetersby
comparingtotheoriginalWODgroundtruthboxes.
Figure5showsthatoffboardperceptionachievesanac-
curacy and distance error distribution similar to human la-
bels. We also show the distance errors of boxes obtained
from a baseline detector (Multi-view Fusion [40]) with a
Kalman filter-based tracker (the same tracker used in the
offboard perception). Using the baseline (onboard) detec-
torleadstoasignificantlyhighermeandistanceerror–this
increased perception noise indicates a higher lower-bound
minADEthatabehaviormodelcanachieve.
5.4.ComparingmAPwithminADE
WhileminADEiswidelyadoptedforperformancemea-
surement in motion forecasting tasks [9, 8, 14, 39], it fails
tomeasurethequalityofconfidencescorecalibrationinthe
trajectoryprediction. Incontrast,themAPmetricdescribed
in Section 4 provides a measurement of the quality of the
confidence score calibration by design. In this section, we
perform an analysis of minADE vs. mAP with increasing
numbersofpredictions atdifferenttimestepsto showthat
minADE does not provide a full picture of the model per-
formancewhilemAPprovidesmoreinsight.
As shown in Figure 6, minADE artificially improves as
)selciheV(
EDAnim
minADE@3s
minADE@5s
minADE@8s
0.3
0.2
0.1
1 3 6 12 18 24
Number of Predictions (K)
)selciheV(
PAm
mAP@3s
mAP@5s
mAP@8s
Figure 6: Comparison of minADE and mAP across in-
creasing numbers of predictions. Using the best LSTM
baselinemodelinSection5.1,theminADE(top)artificially
improves as one allows for increasing numbers of predic-
tions. Conversely,themAP(bottom)saturatesasthemodel
mustproducehighqualityconfidenceestimatesinaddition
toaccuratetrajectories.
the number of predictions increase, while the mAP value
peaksat3predictionsfor3sand5s,andat6predictionsfor
8s. TheminADEscoresmayimprovesolongasanyofthe
predictions are good regardless of their confidence score.
In contrast, mAP penalizes high confidence false positive
predictionsanddoesnotcontinuetoimprovewiththenum-
berofpredictions. Precision-recallcurvesfortheseexperi-
mentsareshowninthesupplementarymaterial.
6.Discussion
In this work we release the WAYMO OPEN MOTION
DATASET,alarge-scalemotionforecastingdatasetcontain-
ingdataminedforinteractivebehaviorsacrossadiverseset
of road geometries from multiple cities. The data comes
withrich3DobjectstateandHDmapinformation. Object
tracks are generated with a state-of-the-art offboard auto-
maticlabelingsystemwhichissignificantlyhigherfidelity
than typical onboard 3D perception stacks. For evaluation
weoutlineasetofmetricsforbothper-agentandjointtra-
jectory predictions, including a novel mAP metric to mea-
sureprecision-recallperformanceinabalancedwayacross
semantic driving behavior buckets. We provide baseline
modelsforbothindividualandinteractivepredictiontasks,
which we hope provides great opportunities for advancing
motionforecastingresearch.Acknowledgements
We thank Paul Hempstead, David Margines, Dietmar
Ebner,PeterPawlowski,BalakrishnanVaradarajan,Avikalp
Srivastava, Zhifeng Chen, and Rebecca Roelofs for their
comments and suggestions. Additionally, we thank the
larger Google Brain team and Waymo Research teams for
theirsupport.References forautonomousdrivingusingdeepconvolutionalnetworks.
In2019InternationalConferenceonRoboticsandAutoma-
[1] Alexandre Alahi, Kratarth Goel, Vignesh Ramanathan,
tion(ICRA),pages2090–2096.IEEE,2019. 3,6
AlexandreRobicquet, LiFei-Fei, andSilvioSavarese. So-
[13] M.Everingham,L.Gool,C.K.Williams,J.Winn,andAn-
ciallstm:Humantrajectorypredictionincrowdedspaces.In
drewZisserman.Thepascalvisualobjectclasses(voc)chal-
ProceedingsoftheIEEEconferenceoncomputervisionand
lenge. International Journal of Computer Vision, 88:303–
patternrecognition,pages961–971,2016. 3,6,12
338,2009. 5
[2] Ben Benfold and Ian Reid. Stable multi-target tracking in
[14] Jiyang Gao, Chen Sun, Hang Zhao, Yi Shen, Dragomir
real-time surveillance video. In CVPR 2011, pages 3457–
Anguelov, CongcongLi, andCordeliaSchmid. VectorNet:
3464.IEEE,2011. 3
Encodinghdmapsandagentdynamicsfromvectorizedrep-
[3] ThibaultBuhet,EmilieWirbel,andXavierPerrotton. Plop:
resentation. InCVPR,2020. 1,3,6,8,13
Probabilisticpolynomialobjectstrajectoryplanningforau-
[15] AndreasGeiger, PhilipLenz, ChristophStiller, andRaquel
tonomousdriving. arXivpreprintarXiv:2003.08744,2020.
Urtasun. Visionmeetsrobotics:Thekittidataset. TheInter-
3
national Journal of Robotics Research, 32(11):1231–1237,
[4] HolgerCaesar,VarunBankiti,AlexHLang,SourabhVora,
2013. 3,4
VeniceErinLiong,QiangXu,AnushKrishnan,YuPan,Gi-
[16] SeppHochreiterandJu¨rgenSchmidhuber. Longshort-term
ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi-
memory. Neuralcomputation,9(8):1735–1780,1997. 6
modal dataset for autonomous driving. In Proceedings of
[17] JoeyHong,BenjaminSapp,andJamesPhilbin. Rulesofthe
theIEEE/CVFConferenceonComputerVisionandPattern
road:Predictingdrivingbehaviorwithaconvolutionalmodel
Recognition,pages11621–11631,2020. 1,2,4
ofsemanticinteractions. InCVPR,2019. 3
[5] Sergio Casas, Cole Gulino, Renjie Liao, and Raquel Urta-
sun. Spagnn: Spatially-awaregraphneuralnetworksforre- [18] JohnHouston, GuidoZuidhof, LucaBergamini, YaweiYe,
lationalbehaviorforecastingfromsensordata.In2020IEEE AsheshJain, SammyOmari, VladimirIglovikov, andPeter
InternationalConferenceonRoboticsandAutomation,ICRA Ondruska.Onethousandandonehours:Self-drivingmotion
2020,Paris,France,May31-August31,2020,pages9491– predictiondataset. arXivpreprintarXiv:2006.14480,2020.
9497.IEEE,2020. 3 1,4
[6] Sergio Casas, Cole Gulino, Simon Suo, Katie Luo, Renjie [19] R. Kesten, M. Usman, J. Houston, T. Pandya, K. Nad-
Liao, and Raquel Urtasun. Implicit latent variable model hamuni, A. Ferreira, M. Yuan, B. Low, A. Jain, P. On-
for scene-consistent motion forecasting. In Proceedings druska, S. Omari, S. Shah, A. Kulkarni, A. Kazakova, C.
of the European Conference on Computer Vision (ECCV). Tao, L. Platinsky, W. Jiang, and V. Shet. Lyft level 5 per-
Springer,2020. 2,3,5,7,12 ception dataset 2020. https://level5.lyft.com/
dataset/,2019. 2,4
[7] Sergio Casas, Wenjie Luo, and Raquel Urtasun. Intentnet:
Learningtopredictintentionfromrawsensordata. InCon- [20] NamhoonLee,WongunChoi,PaulVernaza,ChristopherB
ference on Robot Learning, pages 947–956. PMLR, 2018. Choy,PhilipHSTorr,andManmohanChandraker. Desire:
3 Distantfuturepredictionindynamicsceneswithinteracting
[8] YuningChai,BenjaminSapp,MayankBansal,andDragomir
agents.InProceedingsoftheIEEEConferenceonComputer
Anguelov. Multipath: Multiple probabilistic anchor tra- VisionandPatternRecognition,pages336–345,2017. 3
jectory hypotheses for behavior prediction. arXiv preprint [21] Alon Lerner, Yiorgos Chrysanthou, and Dani Lischinski.
arXiv:1910.05449,2019. 1,3,8,12 Crowds by example. In Computer graphics forum, vol-
[9] Ming-Fang Chang, John Lambert, Patsorn Sangkloy, Jag- ume26,pages655–664.WileyOnlineLibrary,2007. 3
jeetSingh,SlawomirBak,AndrewHartnett,DeWang,Peter [22] Ming Liang, Bin Yang, Rui Hu, Yun Chen, Renjie Liao,
Carr, Simon Lucey, Deva Ramanan, et al. Argoverse: 3d Song Feng, and Raquel Urtasun. Learning lane graph
trackingandforecastingwithrichmaps. InProceedingsof representations for motion forecasting. arXiv preprint
theIEEE/CVFConferenceonComputerVisionandPattern arXiv:2007.13732,2020. 3
Recognition,pages8748–8757,2019. 2,4,8,12 [23] Tsung-YiLin,MichaelMaire,SergeBelongie,JamesHays,
[10] Benjamin Coifman and Lizhe Li. A critical evaluation of PietroPerona,DevaRamanan,PiotrDolla´r,andCLawrence
the next generation simulation (ngsim) vehicle trajectory Zitnick. Microsoft coco: Common objects in context. In
dataset. Transportation Research Part B: Methodological, European conference on computer vision, pages 740–755.
105:362–377,2017. 3 Springer,2014. 5
[11] HenggangCui,ThiNguyen,Fang-ChiehChou,Tsung-Han [24] StefanoPellegrini,AndreasEss,KonradSchindler,andLuc
Lin, Jeff Schneider, David Bradley, and Nemanja Djuric. Van Gool. You’ll never walk alone: Modeling social be-
Deep kinematic models for kinematically feasible vehicle havior for multi-target tracking. In 2009 IEEE 12th Inter-
trajectory predictions. In 2020 IEEE International Con- national Conference on Computer Vision, pages 261–268.
ferenceonRoboticsandAutomation(ICRA),pages10563– IEEE,2009. 3,12
10569.IEEE,2020. 3 [25] TungPhan-Minh,ElenaCorinaGrigore,FreddyABoulton,
[12] Henggang Cui, Vladan Radosavljevic, Fang-Chieh Chou, OscarBeijbom, andEricMWolff. CoverNet: Multimodal
Tsung-HanLin,ThiNguyen,Tzu-KuoHuang,JeffSchnei- behaviorpredictionusingtrajectorysets. arXiv:1911.10298,
der,andNemanjaDjuric. Multimodaltrajectorypredictions 2019. 3[26] CharlesR.Qi,YinZhou,MahyarNajibi,PeiSun,KhoaVo, in Interactive Driving Scenarios with Semantic Maps.
BoyangDeng,andDragomirAnguelov. Offboard3dobject arXiv:1910.03088[cs,eess],2019. 2,3,4
detectionfrompointcloudsequences,2021. 2,4,5,8 [39] Hang Zhao, Jiyang Gao, Tian Lan, Chen Sun, Benjamin
[27] NicholasRhinehart,KrisMKitani,andPaulVernaza.R2p2: Sapp,BalakrishnanVaradarajan,YueShen,YiShen,Yuning
A reparameterized pushforward policy for diverse, precise Chai, CordeliaSchmid, etal. Tnt: Target-driventrajectory
generativepathforecasting. InProceedingsoftheEuropean prediction. arXivpreprintarXiv:2008.08294,2020. 1,3,8
Conference on Computer Vision (ECCV), pages 772–788, [40] YinZhou, PeiSun, YuZhang, DragomirAnguelov, Jiyang
2018. 12 Gao,TomOuyang,JamesGuo,JiquanNgiam,andVijayVa-
[28] Nicholas Rhinehart, Rowan McAllister, Kris Kitani, and sudevan. End-to-endmulti-viewfusionfor3dobjectdetec-
SergeyLevine. Precog: Predictionconditionedongoalsin tioninlidarpointclouds. InConferenceonRobotLearning,
visualmulti-agentsettings.InProceedingsoftheIEEE/CVF pages923–932,2020. 8
InternationalConferenceonComputerVision,pages2821–
2830,2019. 1,2,3,5,12
[29] Alexandre Robicquet, Amir Sadeghian, Alexandre Alahi,
andSilvioSavarese. Learningsocialetiquette: Humantra-
jectoryunderstandingincrowdedscenes. InEuropeancon-
ferenceoncomputervision,pages549–565,2016. 3
[30] Tim Salzmann, Boris Ivanovic, Punarjay Chakravarty, and
MarcoPavone. Trajectron++: Dynamically-feasibletrajec-
tory forecasting with heterogeneous data. arXiv preprint
arXiv:2001.03093,2020. 12
[31] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien
Chouard,VijaysaiPatnaik,PaulTsui,JamesGuo,YinZhou,
YuningChai,BenjaminCaine,etal.Scalabilityinperception
forautonomousdriving: Waymoopendataset. InProceed-
ingsoftheIEEE/CVFConferenceonComputerVisionand
PatternRecognition,pages2446–2454,2020. 1,3,4
[32] Simon Suo, Sebastian Regalado, Sergio Casas, and Raquel
Urtasun. Trafficsim: Learning to simulate realistic multi-
agentbehaviors.InConferenceonComputerVisionandPat-
ternRecognition(CVPR),2021. 2,3
[33] Charlie Tang and Russ R Salakhutdinov. Multiple futures
prediction. InNeurIPS,2019. 1,2,3,5
[34] LucaAnthonyThiedeandPratikPrabhanjanBrahma. An-
alyzing the variety loss in the context of probabilistic tra-
jectory prediction. In Proceedings of the IEEE/CVF Inter-
nationalConferenceonComputerVision,pages9954–9963,
2019. 6
[35] Ekaterina Tolstaya, Reza Mahjourian, Carlton Downey,
Balakrishnan Vadarajan, Benjamin Sapp, and Dragomir
Anguelov. Identifyingdriverinteractionsviaconditionalbe-
haviorprediction. 2021IEEEInternationalConferenceon
RoboticsandAutomation(ICRA),2021. 1,7,13
[36] BinYang,MinBai,MingLiang,WenyuanZeng,andRaquel
Urtasun. Auto4d:Learningtolabel4dobjectsfromsequen-
tialpointclouds,2021. 4,5
[37] FisherYu,HaofengChen,XinWang,WenqiXian,Yingying
Chen, Fangchen Liu, Vashisht Madhavan, and Trevor Dar-
rell. Bdd100k: Adiversedrivingdatasetforheterogeneous
multitask learning. In Proceedings of the IEEE/CVF Con-
ferenceonComputerVisionandPatternRecognition,pages
2636–2645,2020. 1
[38] Wei Zhan, Liting Sun, Di Wang, Haojie Shi, Aubrey
Clausse, Maximilian Naumann, Julius Ku¨mmerle, Hendrik
Ko¨nigshof, Christoph Stiller, Arnaud de La Fortelle, and
Masayoshi Tomizuka. INTERACTION Dataset: An IN-
TERnational,AdversarialandCooperativemoTIONDatasetA.MotionForecastingMetrics LSTMextrapolationmodel.
Distance error metrics are the most commonly used to C.MetricsDetails
compare methods, capturing how close a predicted trajec-
tory(discretetimesequenceofstates)matchesafutureob- Overlaprate(OR)details. Abinaryindicatorisassigned
jecttrack, underEuclideandistance. Themostcommonis to each sample alerting of self-overlapping. The average
Average Displacement Error (ADE) [1, 24]. Because the over the dataset creates the overlap rate. We only con-
futureisinherentlystochasticandmulti-modal,mostmod- siderthehighestscoringjointpredictionp˜here. Ourmetric
els output a (weighted) set of trajectory hypotheses, and countsanoverlapwiththefollowingcriteria:giventhejoint
then a minimal error over the set (of constrained size) is predicted trajectories of A agents, an overlap is counted if
reported (i.e. minADE [9]). For methods that provide ex- the rotated bounding box of any of the A agents overlaps
plicitorimplicitfutureprobabilitydistributions, thelikeli- with any other visible object at any time step within the
hood of the ground truth future trajectory can be used as predictionintervalT. Notethatagentsnotvisibleatpredic-
a metric [8, 30, 27, 28]. Framing the problem instead as tiontime(duetotheirlaterappearance)arenotconsidered
oneofdetectionoffuturelocations,Argoverse[9]employs for potential overlaps. Consider G = {s˜ ∀a,g ∀b ∈
t a,t b,t
Miss Rate within 2 meters as their primary metric, which 1...B} where s˜ are waypoints from p˜ at time t, and
a,t
has the benefit to being tolerant to outliers. A number of g aregroundtruthwaypointsfromBnearbyenvironmen-
b,t
metricsincludingminADEhavebeenextendedforusewith talagents,thesingleoverlapindicatorisdefinedas:
jointlypredictedagenttrajectories[6].
(cid:88)(cid:88) (cid:88)
µ (e)= 1[IOU(b(s˜ ),b(s(cid:48)))>0]
OR a,t t
B.DatasetSplits t a s(cid:48)∈Gt\s˜a,t
(1)
Thedatasetprovides6differentsplitsoftheoriginalset where b(.) is a function to derive a 5-dof (x, y, width,
of 20 second scenarios. The scenarios are first split into length and heading) bounding box from a waypoint. The
training,validationandtestsets. Thisisdonebyhashinga groundtruth bounding box is used for an environmental
stringcontainingthedateofthedatacaptureandtheunique agent. For a predicted waypoint s a,t, we derive the head-
IDofthevehicleusedtocapturethedata. Thehashedval- ing from the derivative to the previous waypoint and use
uesaresplitintomutuallyexclusive70%training,15%val- thegroundtruthboundingboxsizes. IOU(·)computesthe
idation,and15%testingsubsetsofthe20secondscenarios. intersection-over-unionbetweentwo5-dofboxes.
From these 3 subsets we generate examples by extracting Missrate(MR)details. Theindicatorfunctionf(.)in(1)
9.1 second windows from the longer 20 second scenarios. isdefinedasfollows:
Each 9.1 second window contains 91 time steps at 10Hz -
f(.)=1[xk >λlon]∨1[yk >λlat] (2)
10 history samples, 1 sample at the current time, and 80 a a
future steps. We extract 5 different sets of windowed ex- [xk,yk]:=(sˆ −sk)·R
a a a a a
amples from the respective 20 second splits, training, val-
idation, testing, validation interactive, and testing interac- whereR aisa2Drotationmatrixdefinedbytheheadingof
tive. Thetrainingsetcontains9.1secondwindowsstarting agent a at the timestamp 0. λlon and λlat are longitudinal
at times {0, 2, 4, 5, 6, 8, 10} seconds within the 20 sec- andlateralthresholds.Sinceagentscanhavedifferentspeed
ond scenarios. The validation and testing sets contain 9.1 at time 0, we scale these thresholds by their speed so that
second windows starting at times {0, 5, 10} seconds. The we do not over-penalize faster agents: λlon = λl 0onγ(v x)
validation interactive and testing interactive sets contain 9 andλlat = λl 0atγ(v y),whereγ(v) = (max(0,min(1,(v−
secondwindowsstartingattimes{4,5,6}secondstofocus υ L)/(υ H −υ L)))/2+0.5. Wesetυ H to11m/sandυ L to
ontheinteractiveportionofthescenario. The5windowed 1.4m/s. ThethresholdsdependentonT areasfollows:
setsareincludedinthepublisheddatasetalongwiththefull
λlat λlon
20secondtrainingset. Eachofthewindowedsetscontains 0 0
T=3seconds 1 2
a list of objects in the scene to be predicted. The training,
T=5seconds 1.8 3.6
validation, andtestingsetscontainupto8objectspersce-
T=8seconds 3 6
nario chosen to include at least 2 objects of each type if
available. Selectionisbiasedtoincludeobjectsthatdonot
D.OverlapMetric
followaconstantvelocitymodelorstraightpaths. Forthe
validation interactive and testing interactive sets, only the Weuseamarginaloverlap-basedmetricwiththesimple
minedinteractiveagentpairobjectsareincludedinthelist baseline models to quantify the difficulty and interactivity
ofobjectstopredict. Inaddition,eachobjecttopredicthas inourdataset. Weconsideratrajectoryforanagenttocon-
a difficulty level based on how easily it is predicted by an tainanoverlapifatanytimepoint,theagentboundingboxE.ConditionalModelDetails
OverlapRate
Val.set Model Vehicle Pedestrian Cyclist
Themodelweuseforconditionalbehaviorpredictionis
Const.Vel. 38.4% 29.8% 22.3%
based on the baseline model we describe in 5.1. Figure 7
Regular
LSTM 27.9% 22.9% 22.1% provides an overview diagram of the proposed model. We
Const.Vel. 44.2% 30.6% 27.0% use the LSTM encoder and all three enhancements (road-
Interactive graphencodingwithpolylines,trafficsignalstatesencoded
LSTM 36.3% 32.3% 25.6%
inanLSTM,modelinghigh-orderinteractionswithaglobal
Table5: Theinteractivesplitofthedatahasmoreover- interaction graph). To make this model suitable for condi-
laps per scene. Despite the interactive set only requiring tional predictions, we add an early fusion conditional en-
coder similar to [35]. Just like [35], we train the model to
predictionsfortwoagentsinsteadofuptoeightagentsfor
do both conditional and unconditional prediction by pass-
the regular dataset, the split contains more scenes where a
ing in a randomly selected query agent’s ground truth fu-
constant velocity model or an LSTM model – neither of
turetrajectoryasconditionalqueryinputin95%oftraining
which models other agents – produces at least one over-
samples while providing no conditional query in the other
lap. Statistics are reported on the validation set for both
5%. We generate 6 predictions per agent and evaluate the
dataset splits. The marginal-based overlap metric is used
KLdivergenceoverthefull8secondfuturetrajectory.
forbothsplitssothattheratescanbecomparedacrossthe
splits. Constant velocity model only predicts a single tra-
F.Videos
jectoryperagent. FortheLSTMmodel,thehighestscoring
trajectoryforeachagentisused.
Theincludedvideosshowvisualizationofsomesamples
of scenarios from the dataset including those in Figure 1a
andFigure1b.
Figure 7: Diagram of baseline architecture. An illustra-
tionofthebaselinearchitectureemployedforthefamilyof
learnedmodelswithabaseLSTMencoderforagentstates.
Thethreedetachablecomponentsarearoadgraphpolyline
encoder[14],atrafficstateLSTMencoder,andahigh-order
interactions encoder following [14]. The trajectories are
predictedthroughaMLPwithmin-of-kloss.
overlapswithaground-truthboxatthattime. Theoverlap
rate is the number of agents whose trajectories have over-
lapsdividedbythetotalnumberofpredictedagents.
We compute the overlap rate for the constant velocity
model and compare the performance between the regular
split and interactive split of the dataset. For the constant
velocitymodel, wefoundthat38.4%ofpredictedvehicles
in the regular split, and 44.2% of predicted vehicles in the
interactivesplithavetrajectoriesthatoverlapwithaground-
truth(Table5). Thisshowsthattheinteractivesplitismore
challenging, and suggests that more interactions between
agentsinthatsplit.1.0
0.8
0.6
0.4
0.2
0.0
0.0 0.5 1.0
recall
noisicerp
Stationary
1.0
0.8
0.6
0.4
0.2
0.0
0.0 0.5 1.0
recall
noisicerp
Straight
1.0
0.8
0.6
0.4
0.2
0.0
0.0 0.5 1.0
recall
noisicerp
Straight-Left
1.0
0.8
0.6
0.4
0.2
0.0
0.0 0.5 1.0
recall
noisicerp
Straight-Right
1.0
0.8
0.6
0.4
0.2
0.0
0.0 0.5 1.0
recall
noisicerp
Left-Turn
1.0
0.8
0.6
0.4
0.2
0.0
0.0 0.5 1.0
recall
noisicerp
Right-Turn
1.0
0.8
0.6
0.4
0.2
0.0
0.0 0.5 1.0
recall
noisicerp
Left-U-Turn
Figure 8: Precision versus recall curves for increasing number of predictions (K) for the polyline model at 3 seconds for
vehiclesacrosstrajectoryshapebucketsforthestandardvalidationdataset. RecallincreaseswithKbutAUCdecreases.
1.0
0.8
0.6
0.4
0.2
0.0
0.0 0.5 1.0
recall
noisicerp
Stationary
1.0
0.8
0.6
0.4
0.2
0.0
0.0 0.5 1.0
recall
noisicerp
Straight
1.0
0.8
0.6
0.4
0.2
0.0
0.0 0.5 1.0
recall
noisicerp
Straight-Left
1.0
0.8
0.6
0.4
0.2
0.0
0.0 0.5 1.0
recall
noisicerp
Straight-Right
1.0
0.8
0.6
0.4
0.2
0.0
0.0 0.5 1.0
recall
noisicerp
Left-Turn
1.0
0.8
0.6
0.4
0.2
0.0
0.0 0.5 1.0
recall
noisicerp
Right-Turn
1.0
0.8
0.6
0.4
0.2
0.0
0.0 0.5 1.0
recall
noisicerp
Left-U-Turn
Figure 9: Precision versus recall curves for increasing number of predictions (K) for the polyline model at 5 seconds for
vehiclesacrosstrajectoryshapebucketsforthestandardvalidationdataset.1.0
0.8
0.6
0.4
0.2
0.0
0.0 0.5 1.0
recall
noisicerp
Stationary
1.0
0.8
0.6
0.4
0.2
0.0
0.0 0.5 1.0
recall
noisicerp
Straight
1.0
0.8
0.6
0.4
0.2
0.0
0.0 0.5 1.0
recall
noisicerp
Straight-Left
1.0
0.8
0.6
0.4
0.2
0.0
0.0 0.5 1.0
recall
noisicerp
Straight-Right
1.0
0.8
0.6
0.4
0.2
0.0
0.0 0.5 1.0
recall
noisicerp
Left-Turn
1.0
0.8
0.6
0.4
0.2
0.0
0.0 0.5 1.0
recall
noisicerp
Right-Turn
1.0
0.8
0.6
0.4
0.2
0.0
0.0 0.5 1.0
recall
noisicerp
Left-U-Turn
Figure10: Precisionversusrecallcurvesforincreasingnumberofpredictions(K)forthepolylinemodelat8secondsfor
vehiclesacrosstrajectoryshapebucketsforthestandardvalidationdataset.