WOMD-LiDAR: Raw Sensor Dataset Benchmark for
Motion Forecasting
Kan Chen, Runzhou Ge, Hang Qiu, Rami AI-Rfou, Charles Qi, Xuanyu Zhou, Zoey Yang, Scott Ettinger,
Pei Sun, Zhaoqi Leng, Mustafa Baniodeh, Ivan Bogun, Weiyue Wang, Mingxing Tan, Dragomir Anguelov
Abstract—Widely adopted motion forecasting datasets sub-
stitute the observed sensory inputs with higher-level abstrac-
tions such as 3D boxes and polylines. These sparse shapes are
inferredthroughannotatingtheoriginalsceneswithperception
systems’ predictions. Such intermediate representations tie the
qualityofthemotionforecastingmodelstotheperformanceof
computervisionmodels.Moreover,thehuman-designedexplicit
interfaces between perception and motion forecasting typically
pass only a subset of the semantic information present in the
original sensory input. To study the effect of these modular (a) Sophisticatedinteractionswith(left)andwithoutLiDAR(right).
approaches, design new paradigms that mitigate these limi-
tations, and accelerate the development of end-to-end motion
forecasting models, we augment the Waymo Open Motion
Dataset(WOMD)withlarge-scale,high-quality,diverseLiDAR
data for the motion forecasting task.
The new augmented dataset (WOMD-LiDAR)1 consists of
over 100,000 scenes that each spans 20 seconds, consisting
of well-synchronized and calibrated high quality LiDAR point
cloudscapturedacrossarangeofurbanandsuburbangeogra-
phies. Compared to Waymo Open Dataset (WOD), WOMD-
LiDAR dataset contains 100× more scenes. Furthermore, we
integrate the LiDAR data into the motion forecasting model
training and provide a strong baseline. Experiments show that
the LiDAR data brings improvement in the motion forecasting
(b) Predictedtrajectorieswith(left)andwithoutLiDARdata(right).
task. We hope that WOMD-LiDAR will provide new opportu-
Fig. 1: Human-interpretable labels from the perception sys-
nities for boosting end-to-end motion forecasting models.
tem provide limited information at the scene level and the
I. INTRODUCTION objectlevel.Insophisticatedsceneswithinteractionbetween
Motionforecastingplaysanimportantroleforplanningin multiple objects, raw sensor data provides rich information
autonomous driving systems and received increasing atten- andhelpsimprovethemotionforecastingperformance.Leg-
tion in the research community [13], [18], [38], [45], [50], ends in the figure: Yellow and blue (highlighted) trajectories
[44]. The prohibitively expensive storage requirements for are predictions for different agents. Red dotted lines are
publishing raw sensor data for driving scenes limited the agents’ ground truth trajectories.
major motion forecasting datasets [17], [47], [9], [26], [49].
They instead release abstract representations, such as 3D Coverage of driving scene representation is centered around
boxes from pre-trained perception models (for objects) and where the perception system detects objects. The detection
polylines (for maps), to represent the driving scenes. task becomes a bottleneck of transferring information to
The absence of the raw sensor data leads to the fol- motion forecasting and planning when we are not sure if
lowing limitations: 1) Motion forecasting relies on lossy an object exist or not, especially in the first moments of an
representation of the driving scenes (Fig. 1). The human object surfacing. We hope for more graceful transmission
designed interfaces lack the specificity required by the mo- of information between the systems that is error-robust.
tion forecasting task. For example, the taxonomy of the 3) Training perception models to match these intermediate
agent types in Waymo Open Motion Dataset (WOMD) [17] representations might evolve them into overly complicated
is limited to only three types: vehicle, pedestrian, cyclist. systems that get evaluated on subtasks that are not well
In practice, we interact with agents who might be hard to correlated with overall system quality.
fit into this taxonomy such as pedestrians on scooters or
The goal of this work is to provide a large-scale, diverse
motor cyclists. Moreover, the fidelity of the input features
rawsensordatasetforthemotionforecastingtask. Weaimto
is quite limited to 3D boxes that hide many important
augment WOMD [17] with LiDAR data in a similar format
details such as pedestrian postures and gaze directions. 2)
of WOD [42] for the motion forecasting task, with 100×
more scenes than those available in WOD [42]. To the best
*ThisworkwasdoneinWaymoLLC
1https://waymo.com/open/data/motion/ of our knowledge, it is the largest publicly available LiDAR
4202
beF
81
]VC.sc[
2v43830.4032:viXraINTERACTION Woven Planet Shifts Argoverse 2 nuScenes WOMD-LiDAR
Has LiDAR Data ✓ ✓
# Segments - 170k 600k 250k 1k 104k
Segment Duration - 25s 10s 11s 20s 20s
Total Time 16.5h 1118h 1667h 763h 5.5h 574h
Unique Roadways 2km 10km - 2220km - 1750km
Sampling Rate 10Hz 10Hz 5Hz 10Hz 2Hz 10Hz
# Cities 6 1 6 6 2 6
3D Maps ✓ ✓ ✓
Dataset Size† - 22GB 120GB 58GB 48GB 2.29TB*
TABLE I: Comparison of the popular behavior prediction and motion forecasting datasets. We compare our WOMD-LiDAR
with INTERACTION [49], Woven Planet [26], Shifts [33], Argoverse 2 [47], nuScenes [9]. “-” indicates that the data is not
available or not applicable. †The sizes are cited from [47]. *WOMD-LiDAR dataset size is after ∼8× compression.
dataset across perception or motion forecasting tasks (Table forecasting datasets which aim at real-world urban driving
I).Toovercomethehugedatastorageproblemandmakethe environments. The Woven Planet prediction dataset [26]
dataset user-friendly for academic research, we adopt state- processed raw data through their perception system with
of-the-art LiDAR compression technology [51]. It reduces over 1000 hours of logs for the traffic agents. nuScenes [9]
the LiDAR dataset by ∼8×, resulting in the final WOMD- is an autonomous driving dataset that supports detection,
LiDAR data to be around 2.3 TB. tracking, prediction and localization. But both of these [26],
To demonstrate the usefulness of the new LiDAR data, [9] did not explicitly collect or upsample diverse, complex
we propose a novel and simple motion forecasting baseline, or interactive driving scenarios. Argoverse [14], [47] mined
which leverages raw LiDAR data to boost prediction accu- for vehicles in various scenarios (e.g. intersections, dense
racy.Insteadofjointlytrainingtheperceptionandprediction traffic). The INTERACTION dataset [49] collects some
networks,whichdemandshugememoryfootprint,wetakea interactive scenarios (e.g., roundabouts, ramp merging). The
two-stageapproach:wefirstapplyaperceptionmodel[43]to Shifts [33] dataset targets vehicle motion prediction and has
extract embedding features from LiDAR data. Then, during the longest duration. However, many of these long-duration
training, we feed these embeddings to a motion forecasting datasets [26], [49], [33] lack LiDAR data, blocking the
model, WayFormer [35]. We evaluate the model with same explorationofend-to-endmotionforecasting.nuPlan[10],an
metricsasWOMD[17].Experimentsshowthat,withLiDAR ego vehicle’s planning dataset, released only a subset of the
data, the WayFormer model has a 2% mAP increase for LiDARsequences.Comparedwithotherautonomousdriving
VehicleandPedestrianpredictionrespectively.Thisindicates perception datasets [42], [9], [19], [3] that provide LiDAR
that the WOMD-LiDAR brings useful information and can frames,WOMD-LiDARissignificantlylargerintermsofthe
further improve motion forecasting models’ performance. total time, number of scenes and object interactions.
The WOMD-LiDAR data has been made publicly avail-
Motion forecasting modeling. A popular approach is to
able to the research community, and we hope it will provide
render each input frame as a rasterized top-down image
new directions and opportunities in developing end-to-end
whereeachchannelrepresentsdifferentsceneelements[13],
motion forecasting models. Additionally, WOMD-LiDAR
[16], [29], [23], [12], [50]. Another method is to encode
opens the door for new research on detection and tracking
agent state history using temporal modeling techniques like
with a very large amount of 3D boxes and tracks.
RNN [34], [28], [2], [38] or temporal convolution [31]. In
We summarize the contributions of our work as follows:
these two methods, relationships between each entity are
• We release the largest scale LiDAR dataset for motion aggregated through pooling [50], [48], [2], [21], [29], [34],
forecasting with high quality raw sensor data across a soft attention [34], [50] and graph neural networks [11],
wide spectrum of diverse scenes. [28], [31]. Recently, some work [35], [41] explore the
• Weprovideabaselinethatbooststhemotionforecasting Transformer [46] encoder-decoder structure for multimodal
performance using the raw data, demonstrating the motionprediction.WechooseWayFormer[35]asourmotion
efficacy of the sensor inputs. forecastingbaseline:itisastate-of-the-artmodel,whichcan
• Wedesignanencodingschemethatutilizesintermediate flexibly integrate features from our new LiDAR modality.
perception representations as a feature extraction utility
LiDAR data compression. Releasing the LiDAR data for
for motion forecasting models.
ourdatasetpresentsadatastoragechallenge:withoutLiDAR
compression techniques, the raw sensor data of WOMD-
II. RELATEDWORK
LiDARexceeds20TB.Asvaluableasthedatais,thesizeis
Motion forecasting datasets. There has been an increasing inconvenient for fast distribution in the research community.
number of motion forecasting datasets released [17], [26], Fortunately,inrecentyears,thereisagrowinginterestinthe
[25],[9],[47],[49],[39],[15],[36],[30],[4],[8],[6].TableI LiDAR point cloud compression techniques. For example,
shows the comparison for several most relevant motion one major stream of work, octree-based methods, whichare collected from five LiDAR sensors. For top LiDAR,
h = 64,w = 2650. For other sensors, h = 116,w = 150.
Each pixel in the range images includes the following:
• Range (scalar): The distance between the origin of
LiDAR sensor frame and the LiDAR point.
• Intensity (scalar): It is a measurement describing the
Fig. 2: Visualization of a range image from the top LiDAR return strength of the laser pulse that produces the
sensorinWOMD-LiDAR.Thethreerowsareshowingrange, LiDARpoint,whichispartiallybasedonthereflectivity
(normalized)intensity,and(normalized)elongationfromthe of the object struck by the laser pulse.
first LiDAR return (second return omitted due to brevity). • Elongation (scalar): The elongation of the laser pulse
We crop the range images to only show the front 180◦. beyond its normal width.
• Vehicle pose (∈R3): The pose of the vehicle when the
LiDAR point is captured.
represent and compress quantized point clouds [7], [40], has
been released as a point cloud compression standard [20]. Therangeimageformatisnecessarytoexploitefficientcom-
More recently, neural network based octrees squeeze meth- pression schemes to reduce storage requirements (Section
ods have been proposed, such as Octsqueeze [27], MuS- III-C). Fig. 2 shows the different features that constitute the
CLE [5] and VoxelContextNet [37]. Alternatively, LiDAR range images through mono-chromatic images, one for each
point clouds can be stored as range images. A family of feature. We provide a tutorial2 to show how to decompress
image-basedcompressionmethodshavebeenadaptedforthe range images and convert them into the features above.
task. For example, traditional methods such as JPEG, PNG
C. LiDAR Data Compression
andTIFFhavebeenappliedtocompressingrangeimages[1],
[24].Recently,RIDDLE[51]extendssuchmethodbyapply- Storingrawsensorydataisprohibitivelyexpensive.There-
ing a deep neural network and delta encoding to compress fore, we apply the delta encoding compressor proposed
range images. We adopt the delta encoder of RIDDLE [51] in[51].Weuseanon-deep-learningversionofthealgorithm
and reduce the raw sensor data by ∼8×. forfastcompressionanddecompression.Thiscompressionis
lossless under a pre-specified quantization precision. There-
III. DATASET
fore, we do not expect to impact end-to-end learning.
In this section, we describe the WOMD-LiDAR dataset The basic idea of the algorithm is to use a previous pixel
statistics, the LiDAR data format, and the compression value in the range image to predict the next valid pixel (the
technique used to reduce the storage footprint. closest valid one on its right in the spatial domain). Instead
of storing the absolute pixel values, we store the residuals
A. Dataset Statistics
between the predictions and the original pixel values. Since
To evaluate motion forecasting models, we leverage ex-
the residuals have a more concentrated distribution (espe-
isting labels gathered from WOMD [17]. We follow the
cially on quantized range images) with lower entropy, they
WOMD dataset format, and extract 9 second scenarios arecompressedtoamuchsmallersizewithvarintcoding
containingLiDARdata. WOMD-LiDARissplitintoa70%
followed by zlib compression.
training, 15% validation, and 15% test set with the same
Inourimplementation,wequantizetherangeimagechan-
run segments in WOMD. For training a motion forecasting
nels with the following precision: range 0.005m, intensity
model, it is sufficient to only use the past and current times-
0.01m, elongation 0.01m, pose translation 0.0001m, pose
tamps’LiDARdata,whilethefuturetimestampsareusedas
rotation 0.001 radians. We leverage the default varint
ground truth to calculate loss and metrics. We only release
coding from the publicly available Google Protobuf imple-
the first 1 second LiDAR data for each scene. This helps
mentation (for uint and bool fields). We will release our
reduce the 87.9% size of the raw LiDAR data. However, it
compression algorithm together with the dataset.
stillreaches∼20TBdatastorage.WefurtherapplyaLiDAR
compression method to reduce its size (Section III-C). IV. MOTIONFORECASTINGMODELWITHLIDAR
Datasets comparison: Compared with WOD [42], one of To validate the effectiveness of WOMD-LiDAR, we train
the largest datasets for the perception task, WOMD-LiDAR a WayFormer [35] model using LiDAR embeddings as a
contains100×morescenes,80×totalhours.nuScenes[9]is baseline. We describe the details of the motion forecasting
currently the only other LiDAR dataset suitable for the mo- model and the LiDAR encoder (Fig. 3) in this section.
tion forecasting task. WOMD-LiDAR is significantly larger
than nuScenes, with 104k (100×) segments and 574 hours A. Motion Forecasting Model
(100×) of total time (see Table I). We extend the WayFormer [35] model to incorporate raw
LiDAR data. It adopts a transformer based scene encoder
B. LiDAR Data Format
whichisflexibletopluginfeaturesfromvariousmodalities.
LiDAR data is encoded in WOMD-LiDAR as range im-
The transformer fuses features from agent history states,
ages ∈ Rh×w×6. Following the format of WOD [42], the
firsttworeturnsofLiDARpulseareprovided.Rangeimages 2https://bit.ly/tutorial-womd-lidarSWFormerFeature Extractor Motion Forecasting Model
T1 T2 T3
*SP: Sparse Partition Scale 1: /1 Scale 2: /2 Scale 5: /32
Scene Encoder
SP* SW BF loo crm ker SP* SW BF loo crm ker … SW BF loo crm ker T Dra ej ce oc dto er ry
𝐸%∈ℝ&×$!
embedding embedding … embedding EL ni cD oA dR e r projection projection projection projection
fuse fuse fuse 𝐸∈ℝ!×#×$
LiDAR of WOMD
S1 S2 S3
Detection outputs embC edo dn ic na gte 𝐸na ∈te ℝd ! ×#×$ LiDAR from SWFormer Agent History Traffic Light Agent Interaction Road Graph Learned Seeds
Fig. 3: Model structures of LiDAR encoder (left) and motion forecasting model (right). To encode LiDAR data, we adopt a
pre-trained SWFormer [43] model and extract the embedding features (which can be decoded to produce detection results).
Those features (in the light yellow box) from different scales are concatenated and fed to a WayFormer [35] model as a
new modality feature for the motion forecasting task.
traffic light signals, agent interaction states and road graph V. EXPERIMENTS
features,WeaddadditionalLiDARmodalityfedtothescene A. Experiment Setup
encoder. The features of LiDAR modality are generated
LiDAR Feature Extractor. We train the SWFormer [43]
from a SWFormer [43] extractor and a LiDAR encoder.
on WOD [42] as the LiDAR feature extractor. We set batch
Duringthetraining,wefreezethegradientsoftheSWFormer
size as 4, training 80,000 steps on 64 V3 TPUs. The IOU
feature extractor and update only the LiDAR encoder’s
thresholds for vehicles and pedestrians are 0.7 and 0.5
model parameters. After applying the scene encoder to fuse
respectively. In the original SWFormer inference stage, the
multi-modal features, the output embeddings are fed to the
boxes are filtered if the predicted confidence is less than
trajectory decoder to produce the final predicted trajectories.
0.5. To extract feature embeddings, we need more context
informationandhighrecallofthedetectionresults.Thus,we
B. LiDAR Encoding Scheme
lower the box confidence threshold τ to be 0.1 (see ablation
We adopt a pre-trained SWFormer [43] to extract LiDAR study in Section V-D). The extracted embeddings are 128D
embeddings. The SWFormer is trained on WOD [42] for vectors. With box coordinates (x, y, z), box size (width,
the 3D object detection task. The SWFormer adopts sparse length, height) and foreground probability, the final LiDAR
partition operators and transformer based layers to encode featuresare135Dvectors(C =135)fedtothesceneencoder
LiDARdatafromdifferentscales.Weextracttheembedding of the WayFormer [35]. We set the maximum number of
features which are used to produce detection results in detected boxes in each frame as 140 (N ≤ 140). If there
the detection heads as the input to the scene encoder of are more than 140 detected objects, we discard the detected
WayFormer model. These features effectively encode rich objects with low box confidence scores. We set the number
information of objects and context environment from noisy of output tokens of the LiDAR encoder as M = 10 before
LiDAR points. To provide context agent information, we sending the embeddings to the scene encoder.
lower the detection confidence threshold to produce more Motion Forecasting Model. We use a batch size of 16 and
butlessreliabledetectedobjects.Thisincreasestherecallof traintheWayFormermodelwith1.2Mstepson16V3TPUs.
the detection results but decreases the precision. In addition We project all modalities to the same feature size of 256D
to the embedding features, we also pad more features: (C′ =256),thenutilizecross-attentionwithlatentqueriesto
• Detected box coordinates: We append the detected reduce the number of tokens to 192. The scene encoder has
boxes center coordinates to emphasize the potential 2 transformer layers. WayFormer encodes the history states
detected objects positions. of 1 second (10 steps at 10Hz) and predicts K=6 trajectories
• Detected box size: The height, width, length of the for each agent’s future 8 seconds.
boxesprovidehintsofobjectsfromdifferentcategories.
B. Metrics
• Foreground probability from the segmentation head:
Given an input sample, a motion forecasting model pre-
This helps reduce the noise from detection results.
dictsK trajectoriesforN agentsinthesceneforthefutureT
The output tensor E from SWFormer with padded features steps xk = {x } . We denote the corresponding
i,t i=1:N,t=1:T
is a N ×T ×C tensor, where N is the number of detected
groundtruthtrajectoriesasy={y } .Weinherit
i,t i=1:N,t=1:T
boxes, T is the number of input frames, C is the feature
the WOMD motion forecasting challenge metrics [17].
size. To adapt E to be compatible as input for the scene
minADE.TheminimumAverageDisplacementErrorcalcu-
encoder of WayFormer, we flatten the first two dimensions
lates the ℓ distance between the predicted trajectory which
as the token dimension. A one-layer Axial Transformer [22] 2
is closest to the ground truth across all time steps:
is applied as a LiDAR encoder to project the output tensor
E to be a fixed M-token tensor E′ ∈RM×C′ with the same minADE=min 1 (cid:88)(cid:88) ||xk −y || (1)
feature size as other modalities. k NT i,t i,t 2
i tVehicle Pedestrian Cyclist
Set Model
minADE ↓ MR ↓ mAP ↑ minADE ↓ MR ↓ mAP ↑ minADE ↓ MR ↓ mAP ↑
LSTM [17] 1.34 0.25 0.23 0.63 0.13 0.23 1.26 0.29 0.21
Standard
Wayformer [35] 1.10 0.18 0.35 0.54 0.11 0.35 1.08 0.22 0.29
Validation
Wayformer
1.09 0.17 0.37 0.54 0.10 0.37 1.06 0.21 0.28
+ LiDAR
TABLE II: Marginal metrics on the standard validation set. All metrics computed at 8s. We compare baseline
WayFormer [35] and WayFormer trained with LiDAR data on the WOMD-LiDAR standard motion forecasting track.
Threshold τ minADE ↓ MR ↓ mAP ↑
Miss Rate (MR). MR measures whether the closest pre-
0.0 0.5692 0.1401 0.4005
dictedtrajectorymin xk matchesthegroundtruthy .The
k i,t i,t 0.1 0.5553 0.1292 0.4191
MR at time step t is calculated as: 0.3 0.5623 0.1399 0.4102
0.5 0.5675 0.1410 0.4087
MR =min∨ ¬IsMatch(xk ,y ) (2)
t i i,t i,t
k
TABLE III: Experiment results of sweeping SWFormer
More details of the function IsMatch implementation can
threshold τ to extract embeddings. The metrics are eval-
be found in the WOMD dataset [17].
uated on WOMD-LiDAR validation set, averaged across
Mean Average Precision (mAP). mAP is similar to the categories, and over results at 3s, 5s, and 8s.
one for object detection task [32]. It computes precision-
recall curve’s integral area by varying confidence threshold Model minADE ↓ MR ↓ mAP ↑
forthepredictedtrajectories.Thecriteriaofjudgingwhether
No boxes coordinates 0.5852 0.1594 0.3947
atrajectoryisatruepositive,falsepositive,etc.isconsistent No boxes sizes 0.5773 0.1476 0.4008
with the MR definition in Eq. 2. For each object, only the No foreground prob. 0.5601 0.1331 0.4110
trajectorywiththehighestconfidenceisusedtocalculatethe Wayformer with LiDAR 0.5553 0.1292 0.4191
mAP for the corresponding true positive.
TABLE IV: Experiment results of masking out additional
C. Baseline Model Performance features in LiDAR encoding (Sec. IV-B). The metrics are
evaluated on WOMD-LiDAR validation set, averaged across
We evaluate our baseline model on the WOMD-LiDAR
categories, and over results at 3s, 5s, and 8s.
validationset.TheresultsareshowninTableII.WithLiDAR
features, our model performs better than WayFormer for
vehicle, pedestrians and cyclists on the Missing Rate (MR) generate different training datasets extracted from WOMD-
metric, with 0.01 decrease in each category respectively. LiDAR and evaluate the corresponding performance of the
This indicates LiDAR information provides location hints baselinemodel.Whenthethresholdτ islower,thenumberof
forWayFormer.ForminADEmetrics,theresultsareroughly predictedboxesfromSWFormerbecomeslarger.Thisbrings
the same. WayFormer with LiDAR inputs also achieves 2% morecontextinformationformotionforecastingmodelwhile
increaseinmAPforvehicleandpedestriancategories.Thisis italsobringsmorenoiseintheinputs.AsshowninTableIII,
becauseLiDARfeaturesprovidemoreinformationaboutthe theWayFormer’sperformanceisnotsosensitivetoτ.When
object locations, shapes and interactions with other objects. τ = 0.1, the WayFormer with LiDAR inputs achieves the
They help the WayFormer model understand the scene and best performance. When τ further increases, the number of
predict more accurate trajectories. For cyclists, there is a detected boxes becomes smaller and may result in loss of
minor regression in mAP. It is likely due to the fact that the useful information.
LiDAR points are noisy in this category and we may need a
Different embedding features. There are three additional
better encoding method to extract useful information.
features (Sec. IV-B) included in the embedding output from
the LiDAR encoder: detected box coordinates, size and
D. Ablation Study
foreground probability. We mask out each feature and check
In the following experiments, we report the average mi- theWayFormermodelperformanceinTableIV.Theexperi-
nADE, MR and mAP across vehicle, pedestrian and cyclist mentsshowthatwithoutboxcoordinates,theminADE,MR,
categories at 3s, 5s and 8s on the validation set. mAP regress by 0.0299, 0.0302, 0.0244 respectively. This
Threshold of SWFormer to extract embeddings. As de- indicatesthatasidefromtheSWFormerembeddingfeatures,
scribed in Sec. V-A, we lower the SWFormer threshold τ to the box coordinates play an important role in motion fore-
get high recall of detected boxes so that we could get more casting.Comparedtomaskingoutboxcoordinates,masking
context information in the scene. We sweep the threshold of out box sizes has a smaller regression, with minADE, MR
SWFormerfrom0.0to0.5(defaultvalueofSWFormer),and increased by 0.022 and 0.0184 and mAP decreased by(a) (b)
Fig.4:Visualization ofpredictionresultcomparisonbetween WayFormer[35](sub-figuresonthe left)andWayFormerwith
LiDAR inputs (sub-figures on the right). Fig (a): With LiDAR information the predicted trajectories avoid crashing into
parked cars. Fig (b): The predicted trajectories of cyclists avoid crashing into cars. Legends in the figure: Yellow and blue
trajectories are predictions for different agents, while blue trajectories are highlighted ones. Red dotted lines are labeled
ground truth trajectories for agents in the scene.
# tokens of embeddings minADE ↓ MR ↓ mAP ↑ E. Qualitative Results
16 0.6011 0.1702 0.3811
We visualize the WayFormer prediction results on
32 0.5888 0.1610 0.3907
64 0.5797 0.1503 0.3998 WOMD-LiDAR to check the quality motion forecasting.
192 0.5553 0.1292 0.4191 Please check the supplementary video for more visual-
# layers of transformer minADE ↓ MR ↓ mAP ↑ ization results.
Visualization of WayFormer prediction results. We visu-
1 0.5711 0.1440 0.3991
2 0.5553 0.1292 0.4191 alize some prediction results and conduct analysis on the
3 0.5561 0.1325 0.4112 prediction quality. As shown in Fig. 4, with LiDAR inputs,
WayFormermodelavoidscollisionintovehicles,pedestrians
TABLE V: Experiment results of scene encoder’s #tokens and cyclists. Specifically, in Fig 4(a), with LiDAR infor-
and #transformer layers. The metrics are evaluated on mation, the predicted trajectories avoid crashing into parked
WOMD-LiDAR validation set, averaged across categories, cars. In Fig 4(b), the predicted trajectories of cyclists avoid
and over results at 3s, 5s, and 8s. crashing into cars. We observe more reasonable predicted
trajectories, matching the improved performance in Table II.
0.0183. Foreground probability also contributes slightly to
VI. CONCLUSIONANDFUTUREDIRECTIONS
the overall performance, with regression in the minADE,
MR, mAP as 0.0048, 0.0039, 0.0081 respectively.
Conclusion. In this work, we augment WOMD with the
WayFormer modeling. We study the WayFormer hyper- largest scale LiDAR dataset in the community, containing
parameters in motion forecasting. Specifically, we conduct LiDAR point clouds for more than 100,000 scenes. To re-
experiments to investigate the impact of number of tokens solve the huge data storage requirements, we adopt state-of-
and layers of the scene encoder. This is because the scene the-artLiDARdatacompressiontechnologyandsuccessfully
encoder provides encoded embeddings for the trajectory reduce the dataset size to be less than 2.5 TB. To evaluate
decoderinthepredictionstage.Theembeddingqualityplays the suitability of LiDAR to the motion forecasting task,
an important role for the motion forecasting task. As shown we provide a WayFormer baseline trained with LiDAR.
inTableV,thenumberofembeddingtokensimpactsquality Experiments show that LiDAR data brings improvement in
more than the number of scene encoder transformer layers. the motion forecasting task.
Whenweincreasethetokensizefrom16to192(thedefault
Limitations and future work. 1) In this work, we only
WayFormer setting), the minADE and MR decrease from
trained WayFormer and WayFormer + LiDAR models. We
0.6011 to 0.5553 and 0.1702 to 0.1292, respectively, and
will investigate end-to-end models that can directly encode
mAP increases from 0.3811 to 0.4191. This indicates that
LiDAR point clouds with motion forecasting task in mind.
when the token size increases, more information will be
2) The SWFormer detector, which serves as the point cloud
encoded in the embeddings for motion prediction.
encoderinourmodel,canonlyrepresentsobject-levelinfor-
We also vary the number of transformer blocks from 1 mation.Wewilllookintosomeapproachesthatcanleverage
to 3 (Table V). The performance of WayFormer model first scene-levelinformation,thatarenotsensitivetothedetection
improves (# layers increases from 1 to 2) and then regresses prediction thresholds. 3) Another interesting direction is to
(# layers increases from 2 to 3). Thus, we set the optimal explore methods that solely depends on the sensor data to
value of # layers of the scene encoder as 2. avoid the dependency on human-defined object interface.REFERENCES Predicting driving behavior with a convolutional model of semantic
interactions. InCVPR,2019.
[24] HamidrezaHoushiar andAndreas Nu¨chter. 3dpoint cloudcompres-
[1] Jae-Kyun Ahn, Kyu-Yul Lee, Jae-Young Sim, and Chang-Su Kim. sion using conventional image compression for efficient data trans-
Large-scale3dpointcloudcompressionusingadaptiveradialdistance mission. InInternationalConferenceonInformation,Communication
prediction in hybrid coordinate domains. IEEE Journal of Selected andAutomationTechnologies(ICAT),2015.
TopicsinSignalProcessing,2014. [25] JohnHouston,GuidoZuidhof,LucaBergamini,YaweiYe,LongChen,
[2] Alexandre Alahi, Kratarth Goel, Vignesh Ramanathan, Alexandre AsheshJain,SammyOmari,VladimirIglovikov,andPeterOndruska.
Robicquet, Li Fei-Fei, and Silvio Savarese. Social lstm: Human One thousand and one hours: Self-driving motion prediction dataset.
trajectorypredictionincrowdedspaces. InCVPR,2016. InCoRL,2021.
[3] Jens Behley, Martin Garbade, Andres Milioto, Jan Quenzel, Sven [26] John Houston, Guido Zuidhof, Luca Bergamini, Yawei Ye, Ashesh
Behnke, Cyrill Stachniss, and Jurgen Gall. Semantickitti: A dataset Jain, Sammy Omari, Vladimir Iglovikov, and Peter Ondruska.
forsemanticsceneunderstandingoflidarsequences. InICCV,2019. One thousand and one hours: Self-driving motion prediction
[4] Ben Benfold and Ian Reid. Stable multi-target tracking in real-time dataset. https://www.woven-planet.global/en/data/
surveillancevideo. InCVPR,2011. prediction-dataset,2020.
[5] SouravBiswas,JerryLiu,KelvinWong,ShenlongWang,andRaquel [27] Lila Huang, Shenlong Wang, Kelvin Wong, Jerry Liu, and Raquel
Urtasun.Muscle:Multisweepcompressionoflidarusingdeepentropy Urtasun. Octsqueeze: Octree-structured entropy model for lidar
models. NeurIPS,2020. compression. InCVPR,2020.
[6] JulianBock,RobertKrajewski,TobiasMoers,SteffenRunde,Lennart [28] Siddhesh Khandelwal, William Qi, Jagjeet Singh, Andrew Hartnett,
Vater, and Lutz Eckstein. The ind dataset: A drone dataset of and Deva Ramanan. What-if motion prediction for autonomous
naturalisticroadusertrajectoriesatgermanintersections.InIntelligent driving. arXivpreprintarXiv:2008.10587,2020.
VehiclesSymposium(IV),2020. [29] Namhoon Lee, Wongun Choi, Paul Vernaza, Christopher B Choy,
[7] MarioBotsch,AndreasWiratanaya,andLeifKobbelt. Efficienthigh Philip HS Torr, and Manmohan Chandraker. Desire: Distant future
qualityrenderingofpointsampledgeometry. RenderingTechniques, predictionindynamicsceneswithinteractingagents.InCVPR,2017.
2002. [30] AlonLerner,YiorgosChrysanthou,andDaniLischinski. Crowdsby
[8] Antonia Breuer, Jan-Aike Termo¨hlen, Silviu Homoceanu, and Tim example. InComputergraphicsforum.WileyOnlineLibrary,2007.
Fingscheidt. opendd: A large-scale roundabout drone dataset. In In- [31] MingLiang,BinYang,RuiHu,YunChen,RenjieLiao,SongFeng,
ternationalConferenceonIntelligentTransportationSystems(ITSC), andRaquelUrtasun. Learninglanegraphrepresentationsformotion
2020. forecasting. InECCV,2020.
[9] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, [32] Tsung-YiLin,MichaelMaire,SergeBelongie,JamesHays,PietroPer-
Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo ona,DevaRamanan,PiotrDolla´r,andCLawrenceZitnick. Microsoft
Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for coco:Commonobjectsincontext. InECCV,2014.
autonomousdriving. InCVPR,2020. [33] AndreyMalinin,NeilBand,GermanChesnokov,YarinGal,MarkJF
[10] Holger Caesar, Juraj Kabzan, Kok Seang Tan, Whye Kit Fong, Eric Gales,AlexeyNoskov,AndreyPloskonosov,LiudmilaProkhorenkova,
Wolff,AlexLang,LukeFletcher,OscarBeijbom,andSammyOmari. Ivan Provilkov, Vatsal Raina, et al. Shifts: A dataset of real dis-
nuplan:Aclosed-loopml-basedplanningbenchmarkforautonomous tributional shift across multiple large-scale tasks. arXiv preprint
vehicles. arXivpreprintarXiv:2106.11810,2021. arXiv:2107.07455,2021.
[11] SergioCasas,ColeGulino,RenjieLiao,andRaquelUrtasun.Spagnn: [34] Jean Mercat, Thomas Gilles, Nicole El Zoghby, Guillaume Sandou,
Spatially-awaregraphneuralnetworksforrelationalbehaviorforecast- Dominique Beauvois, and Guillermo Pita Gil. Multi-head attention
ingfromsensordata. InICRA,2020. formulti-modaljointvehiclemotionforecasting. InICRA,2020.
[12] Sergio Casas, Wenjie Luo, and Raquel Urtasun. Intentnet: Learning [35] Nigamaa Nayakanti, Rami Al-Rfou, Aurick Zhou, Kratarth Goel,
topredictintentionfromrawsensordata. InCoRL,2018. Khaled S Refaat, and Benjamin Sapp. Wayformer: Motion fore-
[13] Yuning Chai, Benjamin Sapp, Mayank Bansal, and Dragomir casting via simple & efficient attention networks. arXiv preprint
Anguelov.Multipath:Multipleprobabilisticanchortrajectoryhypothe- arXiv:2207.05844,2022.
sesforbehaviorprediction. CoRL,2019. [36] StefanoPellegrini,AndreasEss,KonradSchindler,andLucVanGool.
[14] Ming-Fang Chang, John Lambert, Patsorn Sangkloy, Jagjeet Singh, You’ll never walk alone: Modeling social behavior for multi-target
SlawomirBak,AndrewHartnett,DeWang,PeterCarr,SimonLucey, tracking. InICCV,2009.
Deva Ramanan, et al. Argoverse: 3d tracking and forecasting with [37] Zizheng Que, Guo Lu, and Dong Xu. Voxelcontext-net: An octree
richmaps. InCVPR,2019. basedframeworkforpointcloudcompression. InCVPR,2021.
[15] BenjaminCoifmanandLizheLi.Acriticalevaluationofthenextgen- [38] Nicholas Rhinehart, Rowan McAllister, Kris Kitani, and Sergey
eration simulation (ngsim) vehicle trajectory dataset. Transportation Levine. Precog:Predictionconditionedongoalsinvisualmulti-agent
ResearchPartB:Methodological,2017. settings. InCVPR,2019.
[16] HenggangCui,VladanRadosavljevic,Fang-ChiehChou,Tsung-Han [39] Alexandre Robicquet, Amir Sadeghian, Alexandre Alahi, and Silvio
Lin, Thi Nguyen, Tzu-Kuo Huang, Jeff Schneider, and Nemanja Savarese. Learning social etiquette: Human trajectory understanding
Djuric. Multimodal trajectory predictions for autonomous driving incrowdedscenes. InECCV,2016.
usingdeepconvolutionalnetworks. InICRA,2019. [40] Ruwen Schnabel and Reinhard Klein. Octree-based point-cloud
[17] Scott Ettinger, Shuyang Cheng, Benjamin Caine, Chenxi Liu, Hang compression. PBG@SIGGRAPH,2006.
Zhao, Sabeek Pradhan, Yuning Chai, Ben Sapp, Charles R Qi, Yin [41] Shaoshuai Shi, Li Jiang, Dengxin Dai, and Bernt Schiele. Motion
Zhou,etal.Largescaleinteractivemotionforecastingforautonomous transformer with global intention localization and local movement
driving:Thewaymoopenmotiondataset. InICCV,2021. refinement. InNeurIPS,2022.
[18] Jiyang Gao, Chen Sun, Hang Zhao, Yi Shen, Dragomir Anguelov, [42] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard,
Congcong Li, and Cordelia Schmid. Vectornet: Encoding hd maps Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai,
andagentdynamicsfromvectorizedrepresentation. InCVPR,2020. Benjamin Caine, et al. Scalability in perception for autonomous
[19] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. driving:Waymoopendataset. InCVPR,2020.
Visionmeetsrobotics:Thekittidataset. TheInternationalJournalof [43] PeiSun,MingxingTan,WeiyueWang,ChenxiLiu,FeiXia,Zhaoqi
RoboticsResearch,32(11):1231–1237,2013. Leng,andDragomirAnguelov.Swformer:Sparsewindowtransformer
[20] D Graziosi, O Nakagami, S Kuma, A Zaghetto, T Suzuki, and for3dobjectdetectioninpointclouds. InECCV,2022.
A Tabatabai. An overview of ongoing point cloud compression [44] CharlieTangandRussRSalakhutdinov. Multiplefuturesprediction.
standardizationactivities:Video-based(v-pcc)andgeometry-based(g-
NeurIPS,2019.
[45] EkaterinaTolstaya,RezaMahjourian,CarltonDowney,Balakrishnan
pcc). APSIPA Transactions on Signal and Information Processing,
Vadarajan,BenjaminSapp,andDragomirAnguelov.Identifyingdriver
2020.
[21] AgrimGupta,JustinJohnson,LiFei-Fei,SilvioSavarese,andAlexan- interactionsviaconditionalbehaviorprediction. InICRA,2021.
[46] AshishVaswani,NoamShazeer,NikiParmar,JakobUszkoreit,Llion
dreAlahi. Socialgan:Sociallyacceptabletrajectorieswithgenerative
Jones,AidanNGomez,ŁukaszKaiser,andIlliaPolosukhin.Attention
adversarialnetworks. InCVPR,2018.
[22] JonathanHo,NalKalchbrenner,DirkWeissenborn,andTimSalimans. isallyouneed. NeurIPS,2017.
[47] Benjamin Wilson, William Qi, Tanmay Agarwal, John Lambert,
Axial attention in multidimensional transformers. arXiv preprint
Jagjeet Singh, Siddhesh Khandelwal, Bowen Pan, Ratnesh Kumar,
arXiv:1912.12180,2019.
[23] Joey Hong, Benjamin Sapp, and James Philbin. Rules of the road: Andrew Hartnett, Jhony Kaesemodel Pontes, Deva Ramanan, PeterCarr,andJamesHays. Argoverse2:Nextgenerationdatasetsforself- APPENDIX
drivingperceptionandforecasting.InNeurIPSTrackonDatasetsand
Benchmarks,2021.
VII. SUPPLEMENTARYDATASETDETAILS
[48] Maosheng Ye, Jiamiao Xu, Xunnong Xu, Tongyi Cao, and Qifeng
Datasetcomparison.Weprovidemoredetailsofthedataset
Chen. Dcms: Motion forecasting with dual consistency and multi-
pseudo-targetsupervision. arXivpreprintarXiv:2204.05859,2022. comparison in Table VI. In Table I of the main paper,
[49] WeiZhan,LitingSun,DiWang,HaojieShi,AubreyClausse,Maximil- “SamplingRate”isthedatacollectionrateinHz.“3DMaps”
ianNaumann,JuliusKummerle,HendrikKonigshof,ChristophStiller,
indicates whether the dataset provided the 3D Map infor-
Arnaud de La Fortelle, et al. Interaction dataset: An international,
adversarialandcooperativemotiondatasetininteractivedrivingsce- mation. “Dataset Size” entries were collected by Argoverse
narioswithsemanticmaps. arXivpreprintarXiv:1910.03088,2019. 2 [47]. Combining Table VI and Table I of the main paper,
[50] HangZhao,JiyangGao,TianLan,ChenSun,BenSapp,Balakrishnan
Varadarajan,YueShen,YiShen,YuningChai,CordeliaSchmid,etal. we provide the complete comparison between our WOMD-
Tnt:Target-driventrajectoryprediction. InCoRL,2021. LiDAR and other datasets.
[51] Xuanyu Zhou, Charles R Qi, Yin Zhou, and Dragomir Anguelov.
Riddle:Lidardatacompressionwithrangeimagedeepdeltaencoding. Supplementary Details of WOMD-LiDAR. Map data is
InCVPR,2022. encoded as a set of polylines and polygons created from
curves sampled at a resolution of 0.5 meters following [17],
[18]. Traffic signal states is also provided along with other
staticmapfeaturetypes(e.g.,laneboundarylines,roadedges
and stop signs). We followed [17] to mine the interesting
scenarios in our WOMD-LiDAR.
VIII. VISUALIZATION
Scenario Videos with LiDAR. In Fig. 5, we provide more
visualization of scenarios with not only the bounding boxes
of the agents of interest, but also the released high quality
well calibrated LiDAR data.
We provide some simulated scenes with both LiDAR
and labeled boxes on WOMD-LiDAR. They are formulated
as mov files in the supplementary materials. Each video
clip contains 11 frames in slow motion, with LiDAR data
visualizedwiththeboxesofagents.Thisisbecauseweonly
release the first 11 frames’ LiDAR data in WOMD-LiDAR.
WayFormer+LiDARpredictionvisualization.Weprovide
more visualization results in Fig. 6. From the visualization
results, our WayFormer [35] + LiDAR model tries to avoid
collisionintootheragents(vehicles,pedestriansandcyclists)
in the motion forecasting task. This is consistent with the
improved performance in Table II.
IX. EXPERIMENTS
AblationStudyofLiDAREncoder.Weprovidetheablation
study of LiDAR Encoder described in Section 4.2 in the
submission. Specifically, we study the number of output
tokensM andthenumberoftransformerlayers.Experiment
results are shown in the Table VII.
From the Table VII, we find when M increases, the
final performance of WayFormer first increases and then
decreases. We set the optimal value of M as 10 in our
experiments. On the other side, LiDAR encoder is not so
sensitive to the number of transformer layers. There is a
slight regression when the number of layers increases. To
achieve best performance and fast training speed, we set the
number of layers as 1 in our experiments.Fig. 5: Scenario visualizations with LiDAR. Better viewed in color and zoom in for more details.
INTERACTION Woven Planet Shifts Argoverse 2 nuScenes WOMD-LiDAR
Offboard Perception ✓ ✓
Mined for Interestingness - - - ✓ - ✓
Traffic Signal States ✓ ✓ ✓
TABLEVI:Comparisonofthepopularbehaviorpredictionandmotionforecastingdatasets.“-”indicatesthatthedataisnot
available or not applicable. “Offboard perception” is checked if the labels were auto-labeled by offboard perception which
can generate high-quality labels. “Mined for Interestingness” is checked if the dataset mined interesting interactions after
the data collection. “Traffic Signal States” is checked if the dataset provided traffic light states.
# output tokens (M) minADE ↓ MR ↓ mAP ↑ # layers minADE ↓ MR ↓ mAP ↑
5 0.5700 0.1501 0.3999 1 0.5553 0.1292 0.4191
10 0.5553 0.1292 0.4191 2 0.5613 0.1392 0.3998
20 0.5594 0.1313 0.4102 3 0.5610 0.1398 0.4001
TABLE VII: Experiment results of the number of output tokens M and the number of transformer layers in the LiDAR
encoder. The metrics are evaluated on WOMD-LiDAR validation set, averaged across categories, and over results at 3s, 5s,
and 8s.(a) (b)
(c) (d)
(e) (f)
(g) (h)
Fig.6:Visualization ofpredictionresultcomparisonbetween WayFormer[35](sub-figuresonthe left)andWayFormerwith
LiDAR inputs (sub-figures on the right). Legends in the figure: Yellow and blue trajectories are predictions for different
agents, while blue trajectories are highlighted ones. Red dotted lines are labeled ground truth trajectories for agents in the
scene. More visualization results are available in the supplementary material. Better viewed in color and zoom in for more
details.