WOMD-LiDAR: Raw Sensor Dataset Benchmark for Motion Forecasting Kan Chen, Runzhou Ge, Hang Qiu, Rami AI-Rfou, Charles Qi, Xuanyu Zhou, Zoey Yang, Scott Ettinger, Pei Sun, Zhaoqi Leng, Mustafa Baniodeh, Ivan Bogun, Weiyue Wang, Mingxing Tan, Dragomir Anguelov Abstract— Widely adopted motion forecasting datasets sub- stitute the observed sensory inputs with higher-level abstrac- tions such as 3D boxes and polylines. These sparse shapes are inferred through annotating the original scenes with perception systems’ predictions. Such intermediate representations tie the quality of the motion forecasting models to the performance of computer vision models. Moreover, the human-designed explicit interfaces between perception and motion forecasting typically pass only a subset of the semantic information present in the original sensory input. To study the effect of these modular approaches, design new paradigms that mitigate these limi- tations, and accelerate the development of end-to-end motion forecasting models, we augment the Waymo Open Motion Dataset (WOMD) with large-scale, high-quality, diverse LiDAR data for the motion forecasting task. The new augmented dataset (WOMD-LiDAR)1 consists of over 100,000 scenes that each spans 20 seconds, consisting of well-synchronized and calibrated high quality LiDAR point clouds captured across a range of urban and suburban geogra- phies. Compared to Waymo Open Dataset (WOD), WOMD- LiDAR dataset contains 100× more scenes. Furthermore, we integrate the LiDAR data into the motion forecasting model training and provide a strong baseline. Experiments show that the LiDAR data brings improvement in the motion forecasting task. We hope that WOMD-LiDAR will provide new opportu- nities for boosting end-to-end motion forecasting models. I. INTRODUCTION Motion forecasting plays an important role for planning in autonomous driving systems and received increasing atten- tion in the research community [13], [18], [38], [45], [50], [44]. The prohibitively expensive storage requirements for publishing raw sensor data for driving scenes limited the major motion forecasting datasets [17], [47], [9], [26], [49]. They instead release abstract representations, such as 3D boxes from pre-trained perception models (for objects) and polylines (for maps), to represent the driving scenes. The absence of the raw sensor data leads to the fol- lowing limitations: 1) Motion forecasting relies on lossy representation of the driving scenes (Fig. 1). The human designed interfaces lack the specificity required by the mo- tion forecasting task. For example, the taxonomy of the agent types in Waymo Open Motion Dataset (WOMD) [17] is limited to only three types: vehicle, pedestrian, cyclist. In practice, we interact with agents who might be hard to fit into this taxonomy such as pedestrians on scooters or motor cyclists. Moreover, the fidelity of the input features is quite limited to 3D boxes that hide many important details such as pedestrian postures and gaze directions. 2) *This work was done in Waymo LLC 1https://waymo.com/open/data/motion/ (a) Sophisticated interactions with (left) and without LiDAR (right). (b) Predicted trajectories with (left) and without LiDAR data (right). Fig. 1: Human-interpretable labels from the perception sys- tem provide limited information at the scene level and the object level. In sophisticated scenes with interaction between multiple objects, raw sensor data provides rich information and helps improve the motion forecasting performance. Leg- ends in the figure: Yellow and blue (highlighted) trajectories are predictions for different agents. Red dotted lines are agents’ ground truth trajectories. Coverage of driving scene representation is centered around where the perception system detects objects. The detection task becomes a bottleneck of transferring information to motion forecasting and planning when we are not sure if an object exist or not, especially in the first moments of an object surfacing. We hope for more graceful transmission of information between the systems that is error-robust. 3) Training perception models to match these intermediate representations might evolve them into overly complicated systems that get evaluated on subtasks that are not well correlated with overall system quality. The goal of this work is to provide a large-scale, diverse raw sensor dataset for the motion forecasting task. We aim to augment WOMD [17] with LiDAR data in a similar format of WOD [42] for the motion forecasting task, with 100× more scenes than those available in WOD [42]. To the best of our knowledge, it is the largest publicly available LiDAR arXiv:2304.03834v2 [cs.CV] 18 Feb 2024 INTERACTION Woven Planet Shifts Argoverse 2 nuScenes WOMD-LiDAR Has LiDAR Data ✓ ✓ # Segments - 170k 600k 250k 1k 104k Segment Duration - 25s 10s 11s 20s 20s Total Time 16.5h 1118h 1667h 763h 5.5h 574h Unique Roadways 2km 10km - 2220km - 1750km Sampling Rate 10Hz 10Hz 5Hz 10Hz 2Hz 10Hz # Cities 6 1 6 6 2 6 3D Maps ✓ ✓ ✓ Dataset Size† - 22GB 120GB 58GB 48GB 2.29TB* TABLE I: Comparison of the popular behavior prediction and motion forecasting datasets. We compare our WOMD-LiDAR with INTERACTION [49], Woven Planet [26], Shifts [33], Argoverse 2 [47], nuScenes [9]. “-” indicates that the data is not available or not applicable. †The sizes are cited from [47]. *WOMD-LiDAR dataset size is after ∼8× compression. dataset across perception or motion forecasting tasks (Table I). To overcome the huge data storage problem and make the dataset user-friendly for academic research, we adopt state- of-the-art LiDAR compression technology [51]. It reduces the LiDAR dataset by ∼8×, resulting in the final WOMD- LiDAR data to be around 2.3 TB. To demonstrate the usefulness of the new LiDAR data, we propose a novel and simple motion forecasting baseline, which leverages raw LiDAR data to boost prediction accu- racy. Instead of jointly training the perception and prediction networks, which demands huge memory footprint, we take a two-stage approach: we first apply a perception model [43] to extract embedding features from LiDAR data. Then, during training, we feed these embeddings to a motion forecasting model, WayFormer [35]. We evaluate the model with same metrics as WOMD [17]. Experiments show that, with LiDAR data, the WayFormer model has a 2% mAP increase for Vehicle and Pedestrian prediction respectively. This indicates that the WOMD-LiDAR brings useful information and can further improve motion forecasting models’ performance. The WOMD-LiDAR data has been made publicly avail- able to the research community, and we hope it will provide new directions and opportunities in developing end-to-end motion forecasting models. Additionally, WOMD-LiDAR opens the door for new research on detection and tracking with a very large amount of 3D boxes and tracks. We summarize the contributions of our work as follows: • We release the largest scale LiDAR dataset for motion forecasting with high quality raw sensor data across a wide spectrum of diverse scenes. • We provide a baseline that boosts the motion forecasting performance using the raw data, demonstrating the efficacy of the sensor inputs. • We design an encoding scheme that utilizes intermediate perception representations as a feature extraction utility for motion forecasting models. II. RELATED WORK Motion forecasting datasets. There has been an increasing number of motion forecasting datasets released [17], [26], [25], [9], [47], [49], [39], [15], [36], [30], [4], [8], [6]. Table I shows the comparison for several most relevant motion forecasting datasets which aim at real-world urban driving environments. The Woven Planet prediction dataset [26] processed raw data through their perception system with over 1000 hours of logs for the traffic agents. nuScenes [9] is an autonomous driving dataset that supports detection, tracking, prediction and localization. But both of these [26], [9] did not explicitly collect or upsample diverse, complex or interactive driving scenarios. Argoverse [14], [47] mined for vehicles in various scenarios (e.g. intersections, dense traffic). The INTERACTION dataset [49] collects some interactive scenarios (e.g., roundabouts, ramp merging). The Shifts [33] dataset targets vehicle motion prediction and has the longest duration. However, many of these long-duration datasets [26], [49], [33] lack LiDAR data, blocking the exploration of end-to-end motion forecasting. nuPlan [10], an ego vehicle’s planning dataset, released only a subset of the LiDAR sequences. Compared with other autonomous driving perception datasets [42], [9], [19], [3] that provide LiDAR frames, WOMD-LiDAR is significantly larger in terms of the total time, number of scenes and object interactions. Motion forecasting modeling. A popular approach is to render each input frame as a rasterized top-down image where each channel represents different scene elements [13], [16], [29], [23], [12], [50]. Another method is to encode agent state history using temporal modeling techniques like RNN [34], [28], [2], [38] or temporal convolution [31]. In these two methods, relationships between each entity are aggregated through pooling [50], [48], [2], [21], [29], [34], soft attention [34], [50] and graph neural networks [11], [28], [31]. Recently, some work [35], [41] explore the Transformer [46] encoder-decoder structure for multimodal motion prediction. We choose WayFormer [35] as our motion forecasting baseline: it is a state-of-the-art model, which can flexibly integrate features from our new LiDAR modality. LiDAR data compression. Releasing the LiDAR data for our dataset presents a data storage challenge: without LiDAR compression techniques, the raw sensor data of WOMD- LiDAR exceeds 20 TB. As valuable as the data is, the size is inconvenient for fast distribution in the research community. Fortunately, in recent years, there is a growing interest in the LiDAR point cloud compression techniques. For example, one major stream of work, octree-based methods, which Fig. 2: Visualization of a range image from the top LiDAR sensor in WOMD-LiDAR. The three rows are showing range, (normalized) intensity, and (normalized) elongation from the first LiDAR return (second return omitted due to brevity). We crop the range images to only show the front 180◦. represent and compress quantized point clouds [7], [40], has been released as a point cloud compression standard [20]. More recently, neural network based octrees squeeze meth- ods have been proposed, such as Octsqueeze [27], MuS- CLE [5] and VoxelContextNet [37]. Alternatively, LiDAR point clouds can be stored as range images. A family of image-based compression methods have been adapted for the task. For example, traditional methods such as JPEG, PNG and TIFF have been applied to compressing range images [1], [24]. Recently, RIDDLE [51] extends such method by apply- ing a deep neural network and delta encoding to compress range images. We adopt the delta encoder of RIDDLE [51] and reduce the raw sensor data by ∼8×. III. DATASET In this section, we describe the WOMD-LiDAR dataset statistics, the LiDAR data format, and the compression technique used to reduce the storage footprint. A. Dataset Statistics To evaluate motion forecasting models, we leverage ex- isting labels gathered from WOMD [17]. We follow the WOMD dataset format, and extract 9 second scenarios containing LiDAR data. WOMD-LiDAR is split into a 70% training, 15% validation, and 15% test set with the same run segments in WOMD. For training a motion forecasting model, it is sufficient to only use the past and current times- tamps’ LiDAR data, while the future timestamps are used as ground truth to calculate loss and metrics. We only release the first 1 second LiDAR data for each scene. This helps reduce the 87.9% size of the raw LiDAR data. However, it still reaches ∼20TB data storage. We further apply a LiDAR compression method to reduce its size (Section III-C). Datasets comparison: Compared with WOD [42], one of the largest datasets for the perception task, WOMD-LiDAR contains 100× more scenes, 80× total hours. nuScenes [9] is currently the only other LiDAR dataset suitable for the mo- tion forecasting task. WOMD-LiDAR is significantly larger than nuScenes, with 104k (100×) segments and 574 hours (100×) of total time (see Table I). B. LiDAR Data Format LiDAR data is encoded in WOMD-LiDAR as range im- ages ∈Rh×w×6. Following the format of WOD [42], the first two returns of LiDAR pulse are provided. Range images are collected from five LiDAR sensors. For top LiDAR, h = 64, w = 2650. For other sensors, h = 116, w = 150. Each pixel in the range images includes the following: • Range (scalar): The distance between the origin of LiDAR sensor frame and the LiDAR point. • Intensity (scalar): It is a measurement describing the return strength of the laser pulse that produces the LiDAR point, which is partially based on the reflectivity of the object struck by the laser pulse. • Elongation (scalar): The elongation of the laser pulse beyond its normal width. • Vehicle pose (∈R3): The pose of the vehicle when the LiDAR point is captured. The range image format is necessary to exploit efficient com- pression schemes to reduce storage requirements (Section III-C). Fig. 2 shows the different features that constitute the range images through mono-chromatic images, one for each feature. We provide a tutorial2 to show how to decompress range images and convert them into the features above. C. LiDAR Data Compression Storing raw sensory data is prohibitively expensive. There- fore, we apply the delta encoding compressor proposed in [51]. We use a non-deep-learning version of the algorithm for fast compression and decompression. This compression is lossless under a pre-specified quantization precision. There- fore, we do not expect to impact end-to-end learning. The basic idea of the algorithm is to use a previous pixel value in the range image to predict the next valid pixel (the closest valid one on its right in the spatial domain). Instead of storing the absolute pixel values, we store the residuals between the predictions and the original pixel values. Since the residuals have a more concentrated distribution (espe- cially on quantized range images) with lower entropy, they are compressed to a much smaller size with varint coding followed by zlib compression. In our implementation, we quantize the range image chan- nels with the following precision: range 0.005m, intensity 0.01m, elongation 0.01m, pose translation 0.0001m, pose rotation 0.001 radians. We leverage the default varint coding from the publicly available Google Protobuf imple- mentation (for uint and bool fields). We will release our compression algorithm together with the dataset. IV. MOTION FORECASTING MODEL WITH LIDAR To validate the effectiveness of WOMD-LiDAR, we train a WayFormer [35] model using LiDAR embeddings as a baseline. We describe the details of the motion forecasting model and the LiDAR encoder (Fig. 3) in this section. A. Motion Forecasting Model We extend the WayFormer [35] model to incorporate raw LiDAR data. It adopts a transformer based scene encoder which is flexible to plug in features from various modalities. The transformer fuses features from agent history states, 2https://bit.ly/tutorial-womd-lidar T1 T2 T3 projection projection projection projection Agent History Traffic Light Agent Interaction Road Graph Learned Seeds Scene Encoder Trajectory Decoder S1 S2 S3 LiDAR from SWFormer LiDAR Encoder SWFormer Block SP* Scale 1: /1 Scale 2: /2 SWFormer Block SP* Scale 5: /32 SWFormer Block … embedding fuse embedding fuse embedding fuse … Detection outputs LiDAR of WOMD *SP: Sparse Partition SWFormer Feature Extractor Motion Forecasting Model 𝐸∈ℝ!×#×$ 𝐸% ∈ℝ&×$! Concatenated embedding 𝐸∈ℝ!×#×$ Fig. 3: Model structures of LiDAR encoder (left) and motion forecasting model (right). To encode LiDAR data, we adopt a pre-trained SWFormer [43] model and extract the embedding features (which can be decoded to produce detection results). Those features (in the light yellow box) from different scales are concatenated and fed to a WayFormer [35] model as a new modality feature for the motion forecasting task. traffic light signals, agent interaction states and road graph features, We add additional LiDAR modality fed to the scene encoder. The features of LiDAR modality are generated from a SWFormer [43] extractor and a LiDAR encoder. During the training, we freeze the gradients of the SWFormer feature extractor and update only the LiDAR encoder’s model parameters. After applying the scene encoder to fuse multi-modal features, the output embeddings are fed to the trajectory decoder to produce the final predicted trajectories. B. LiDAR Encoding Scheme We adopt a pre-trained SWFormer [43] to extract LiDAR embeddings. The SWFormer is trained on WOD [42] for the 3D object detection task. The SWFormer adopts sparse partition operators and transformer based layers to encode LiDAR data from different scales. We extract the embedding features which are used to produce detection results in the detection heads as the input to the scene encoder of WayFormer model. These features effectively encode rich information of objects and context environment from noisy LiDAR points. To provide context agent information, we lower the detection confidence threshold to produce more but less reliable detected objects. This increases the recall of the detection results but decreases the precision. In addition to the embedding features, we also pad more features: • Detected box coordinates: We append the detected boxes center coordinates to emphasize the potential detected objects positions. • Detected box size: The height, width, length of the boxes provide hints of objects from different categories. • Foreground probability from the segmentation head: This helps reduce the noise from detection results. The output tensor E from SWFormer with padded features is a N × T × C tensor, where N is the number of detected boxes, T is the number of input frames, C is the feature size. To adapt E to be compatible as input for the scene encoder of WayFormer, we flatten the first two dimensions as the token dimension. A one-layer Axial Transformer [22] is applied as a LiDAR encoder to project the output tensor E to be a fixed M-token tensor E′ ∈RM×C′ with the same feature size as other modalities. V. EXPERIMENTS A. Experiment Setup LiDAR Feature Extractor. We train the SWFormer [43] on WOD [42] as the LiDAR feature extractor. We set batch size as 4, training 80,000 steps on 64 V3 TPUs. The IOU thresholds for vehicles and pedestrians are 0.7 and 0.5 respectively. In the original SWFormer inference stage, the boxes are filtered if the predicted confidence is less than 0.5. To extract feature embeddings, we need more context information and high recall of the detection results. Thus, we lower the box confidence threshold τ to be 0.1 (see ablation study in Section V-D). The extracted embeddings are 128D vectors. With box coordinates (x, y, z), box size (width, length, height) and foreground probability, the final LiDAR features are 135D vectors (C = 135) fed to the scene encoder of the WayFormer [35]. We set the maximum number of detected boxes in each frame as 140 (N ≤140). If there are more than 140 detected objects, we discard the detected objects with low box confidence scores. We set the number of output tokens of the LiDAR encoder as M = 10 before sending the embeddings to the scene encoder. Motion Forecasting Model. We use a batch size of 16 and train the WayFormer model with 1.2M steps on 16 V3 TPUs. We project all modalities to the same feature size of 256D (C′ = 256), then utilize cross-attention with latent queries to reduce the number of tokens to 192. The scene encoder has 2 transformer layers. WayFormer encodes the history states of 1 second (10 steps at 10Hz) and predicts K=6 trajectories for each agent’s future 8 seconds. B. Metrics Given an input sample, a motion forecasting model pre- dicts K trajectories for N agents in the scene for the future T steps xk = {xi,t}i=1:N,t=1:T . We denote the corresponding ground truth trajectories as y = {yi,t}i=1:N,t=1:T . We inherit the WOMD motion forecasting challenge metrics [17]. minADE. The minimum Average Displacement Error calcu- lates the ℓ2 distance between the predicted trajectory which is closest to the ground truth across all time steps: minADE = min k 1 NT X i X t ||xk i,t −yi,t||2 (1) Set Model Vehicle Pedestrian Cyclist minADE ↓ MR ↓ mAP ↑ minADE ↓ MR ↓ mAP ↑ minADE ↓ MR ↓ mAP ↑ LSTM [17] 1.34 0.25 0.23 0.63 0.13 0.23 1.26 0.29 0.21 Standard Validation Wayformer [35] 1.10 0.18 0.35 0.54 0.11 0.35 1.08 0.22 0.29 Wayformer + LiDAR 1.09 0.17 0.37 0.54 0.10 0.37 1.06 0.21 0.28 TABLE II: Marginal metrics on the standard validation set. All metrics computed at 8s. We compare baseline WayFormer [35] and WayFormer trained with LiDAR data on the WOMD-LiDAR standard motion forecasting track. Miss Rate (MR). MR measures whether the closest pre- dicted trajectory mink xk i,t matches the ground truth yi,t. The MR at time step t is calculated as: MRt = min k ∨i¬IsMatch(xk i,t, yi,t) (2) More details of the function IsMatch implementation can be found in the WOMD dataset [17]. Mean Average Precision (mAP). mAP is similar to the one for object detection task [32]. It computes precision- recall curve’s integral area by varying confidence threshold for the predicted trajectories. The criteria of judging whether a trajectory is a true positive, false positive, etc. is consistent with the MR definition in Eq. 2. For each object, only the trajectory with the highest confidence is used to calculate the mAP for the corresponding true positive. C. Baseline Model Performance We evaluate our baseline model on the WOMD-LiDAR validation set. The results are shown in Table II. With LiDAR features, our model performs better than WayFormer for vehicle, pedestrians and cyclists on the Missing Rate (MR) metric, with 0.01 decrease in each category respectively. This indicates LiDAR information provides location hints for WayFormer. For minADE metrics, the results are roughly the same. WayFormer with LiDAR inputs also achieves 2% increase in mAP for vehicle and pedestrian categories. This is because LiDAR features provide more information about the object locations, shapes and interactions with other objects. They help the WayFormer model understand the scene and predict more accurate trajectories. For cyclists, there is a minor regression in mAP. It is likely due to the fact that the LiDAR points are noisy in this category and we may need a better encoding method to extract useful information. D. Ablation Study In the following experiments, we report the average mi- nADE, MR and mAP across vehicle, pedestrian and cyclist categories at 3s, 5s and 8s on the validation set. Threshold of SWFormer to extract embeddings. As de- scribed in Sec. V-A, we lower the SWFormer threshold τ to get high recall of detected boxes so that we could get more context information in the scene. We sweep the threshold of SWFormer from 0.0 to 0.5 (default value of SWFormer), and Threshold τ minADE ↓ MR ↓ mAP ↑ 0.0 0.5692 0.1401 0.4005 0.1 0.5553 0.1292 0.4191 0.3 0.5623 0.1399 0.4102 0.5 0.5675 0.1410 0.4087 TABLE III: Experiment results of sweeping SWFormer threshold τ to extract embeddings. The metrics are eval- uated on WOMD-LiDAR validation set, averaged across categories, and over results at 3s, 5s, and 8s. Model minADE ↓ MR ↓ mAP ↑ No boxes coordinates 0.5852 0.1594 0.3947 No boxes sizes 0.5773 0.1476 0.4008 No foreground prob. 0.5601 0.1331 0.4110 Wayformer with LiDAR 0.5553 0.1292 0.4191 TABLE IV: Experiment results of masking out additional features in LiDAR encoding (Sec. IV-B). The metrics are evaluated on WOMD-LiDAR validation set, averaged across categories, and over results at 3s, 5s, and 8s. generate different training datasets extracted from WOMD- LiDAR and evaluate the corresponding performance of the baseline model. When the threshold τ is lower, the number of predicted boxes from SWFormer becomes larger. This brings more context information for motion forecasting model while it also brings more noise in the inputs. As shown in Table III, the WayFormer’s performance is not so sensitive to τ. When τ = 0.1, the WayFormer with LiDAR inputs achieves the best performance. When τ further increases, the number of detected boxes becomes smaller and may result in loss of useful information. Different embedding features. There are three additional features (Sec. IV-B) included in the embedding output from the LiDAR encoder: detected box coordinates, size and foreground probability. We mask out each feature and check the WayFormer model performance in Table IV. The experi- ments show that without box coordinates, the minADE, MR, mAP regress by 0.0299, 0.0302, 0.0244 respectively. This indicates that aside from the SWFormer embedding features, the box coordinates play an important role in motion fore- casting . Compared to masking out box coordinates, masking out box sizes has a smaller regression, with minADE, MR increased by 0.022 and 0.0184 and mAP decreased by (a) (b) Fig. 4: Visualization of prediction result comparison between WayFormer [35] (sub-figures on the left) and WayFormer with LiDAR inputs (sub-figures on the right). Fig (a): With LiDAR information the predicted trajectories avoid crashing into parked cars. Fig (b): The predicted trajectories of cyclists avoid crashing into cars. Legends in the figure: Yellow and blue trajectories are predictions for different agents, while blue trajectories are highlighted ones. Red dotted lines are labeled ground truth trajectories for agents in the scene. # tokens of embeddings minADE ↓ MR ↓ mAP ↑ 16 0.6011 0.1702 0.3811 32 0.5888 0.1610 0.3907 64 0.5797 0.1503 0.3998 192 0.5553 0.1292 0.4191 # layers of transformer minADE ↓ MR ↓ mAP ↑ 1 0.5711 0.1440 0.3991 2 0.5553 0.1292 0.4191 3 0.5561 0.1325 0.4112 TABLE V: Experiment results of scene encoder’s #tokens and #transformer layers. The metrics are evaluated on WOMD-LiDAR validation set, averaged across categories, and over results at 3s, 5s, and 8s. 0.0183. Foreground probability also contributes slightly to the overall performance, with regression in the minADE, MR, mAP as 0.0048, 0.0039, 0.0081 respectively. WayFormer modeling. We study the WayFormer hyper- parameters in motion forecasting. Specifically, we conduct experiments to investigate the impact of number of tokens and layers of the scene encoder. This is because the scene encoder provides encoded embeddings for the trajectory decoder in the prediction stage. The embedding quality plays an important role for the motion forecasting task. As shown in Table V, the number of embedding tokens impacts quality more than the number of scene encoder transformer layers. When we increase the token size from 16 to 192 (the default WayFormer setting), the minADE and MR decrease from 0.6011 to 0.5553 and 0.1702 to 0.1292, respectively, and mAP increases from 0.3811 to 0.4191. This indicates that when the token size increases, more information will be encoded in the embeddings for motion prediction. We also vary the number of transformer blocks from 1 to 3 (Table V). The performance of WayFormer model first improves (# layers increases from 1 to 2) and then regresses (# layers increases from 2 to 3). Thus, we set the optimal value of # layers of the scene encoder as 2. E. Qualitative Results We visualize the WayFormer prediction results on WOMD-LiDAR to check the quality motion forecasting. Please check the supplementary video for more visual- ization results. Visualization of WayFormer prediction results. We visu- alize some prediction results and conduct analysis on the prediction quality. As shown in Fig. 4, with LiDAR inputs, WayFormer model avoids collision into vehicles, pedestrians and cyclists. Specifically, in Fig 4(a), with LiDAR infor- mation, the predicted trajectories avoid crashing into parked cars. In Fig 4(b), the predicted trajectories of cyclists avoid crashing into cars. We observe more reasonable predicted trajectories, matching the improved performance in Table II. VI. CONCLUSION AND FUTURE DIRECTIONS Conclusion. In this work, we augment WOMD with the largest scale LiDAR dataset in the community, containing LiDAR point clouds for more than 100,000 scenes. To re- solve the huge data storage requirements, we adopt state-of- the-art LiDAR data compression technology and successfully reduce the dataset size to be less than 2.5 TB. To evaluate the suitability of LiDAR to the motion forecasting task, we provide a WayFormer baseline trained with LiDAR. Experiments show that LiDAR data brings improvement in the motion forecasting task. Limitations and future work. 1) In this work, we only trained WayFormer and WayFormer + LiDAR models. We will investigate end-to-end models that can directly encode LiDAR point clouds with motion forecasting task in mind. 2) The SWFormer detector, which serves as the point cloud encoder in our model, can only represents object-level infor- mation. We will look into some approaches that can leverage scene-level information, that are not sensitive to the detection prediction thresholds. 3) Another interesting direction is to explore methods that solely depends on the sensor data to avoid the dependency on human-defined object interface. REFERENCES [1] Jae-Kyun Ahn, Kyu-Yul Lee, Jae-Young Sim, and Chang-Su Kim. Large-scale 3d point cloud compression using adaptive radial distance prediction in hybrid coordinate domains. IEEE Journal of Selected Topics in Signal Processing, 2014. [2] Alexandre Alahi, Kratarth Goel, Vignesh Ramanathan, Alexandre Robicquet, Li Fei-Fei, and Silvio Savarese. Social lstm: Human trajectory prediction in crowded spaces. In CVPR, 2016. [3] Jens Behley, Martin Garbade, Andres Milioto, Jan Quenzel, Sven Behnke, Cyrill Stachniss, and Jurgen Gall. Semantickitti: A dataset for semantic scene understanding of lidar sequences. In ICCV, 2019. [4] Ben Benfold and Ian Reid. Stable multi-target tracking in real-time surveillance video. In CVPR, 2011. [5] Sourav Biswas, Jerry Liu, Kelvin Wong, Shenlong Wang, and Raquel Urtasun. Muscle: Multi sweep compression of lidar using deep entropy models. NeurIPS, 2020. [6] Julian Bock, Robert Krajewski, Tobias Moers, Steffen Runde, Lennart Vater, and Lutz Eckstein. The ind dataset: A drone dataset of naturalistic road user trajectories at german intersections. In Intelligent Vehicles Symposium (IV), 2020. [7] Mario Botsch, Andreas Wiratanaya, and Leif Kobbelt. Efficient high quality rendering of point sampled geometry. Rendering Techniques, 2002. [8] Antonia Breuer, Jan-Aike Term¨ohlen, Silviu Homoceanu, and Tim Fingscheidt. opendd: A large-scale roundabout drone dataset. In In- ternational Conference on Intelligent Transportation Systems (ITSC), 2020. [9] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In CVPR, 2020. [10] Holger Caesar, Juraj Kabzan, Kok Seang Tan, Whye Kit Fong, Eric Wolff, Alex Lang, Luke Fletcher, Oscar Beijbom, and Sammy Omari. nuplan: A closed-loop ml-based planning benchmark for autonomous vehicles. arXiv preprint arXiv:2106.11810, 2021. [11] Sergio Casas, Cole Gulino, Renjie Liao, and Raquel Urtasun. Spagnn: Spatially-aware graph neural networks for relational behavior forecast- ing from sensor data. In ICRA, 2020. [12] Sergio Casas, Wenjie Luo, and Raquel Urtasun. Intentnet: Learning to predict intention from raw sensor data. In CoRL, 2018. [13] Yuning Chai, Benjamin Sapp, Mayank Bansal, and Dragomir Anguelov. Multipath: Multiple probabilistic anchor trajectory hypothe- ses for behavior prediction. CoRL, 2019. [14] Ming-Fang Chang, John Lambert, Patsorn Sangkloy, Jagjeet Singh, Slawomir Bak, Andrew Hartnett, De Wang, Peter Carr, Simon Lucey, Deva Ramanan, et al. Argoverse: 3d tracking and forecasting with rich maps. In CVPR, 2019. [15] Benjamin Coifman and Lizhe Li. A critical evaluation of the next gen- eration simulation (ngsim) vehicle trajectory dataset. Transportation Research Part B: Methodological, 2017. [16] Henggang Cui, Vladan Radosavljevic, Fang-Chieh Chou, Tsung-Han Lin, Thi Nguyen, Tzu-Kuo Huang, Jeff Schneider, and Nemanja Djuric. Multimodal trajectory predictions for autonomous driving using deep convolutional networks. In ICRA, 2019. [17] Scott Ettinger, Shuyang Cheng, Benjamin Caine, Chenxi Liu, Hang Zhao, Sabeek Pradhan, Yuning Chai, Ben Sapp, Charles R Qi, Yin Zhou, et al. Large scale interactive motion forecasting for autonomous driving: The waymo open motion dataset. In ICCV, 2021. [18] Jiyang Gao, Chen Sun, Hang Zhao, Yi Shen, Dragomir Anguelov, Congcong Li, and Cordelia Schmid. Vectornet: Encoding hd maps and agent dynamics from vectorized representation. In CVPR, 2020. [19] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 32(11):1231–1237, 2013. [20] D Graziosi, O Nakagami, S Kuma, A Zaghetto, T Suzuki, and A Tabatabai. An overview of ongoing point cloud compression standardization activities: Video-based (v-pcc) and geometry-based (g- pcc). APSIPA Transactions on Signal and Information Processing, 2020. [21] Agrim Gupta, Justin Johnson, Li Fei-Fei, Silvio Savarese, and Alexan- dre Alahi. Social gan: Socially acceptable trajectories with generative adversarial networks. In CVPR, 2018. [22] Jonathan Ho, Nal Kalchbrenner, Dirk Weissenborn, and Tim Salimans. Axial attention in multidimensional transformers. arXiv preprint arXiv:1912.12180, 2019. [23] Joey Hong, Benjamin Sapp, and James Philbin. Rules of the road: Predicting driving behavior with a convolutional model of semantic interactions. In CVPR, 2019. [24] Hamidreza Houshiar and Andreas N¨uchter. 3d point cloud compres- sion using conventional image compression for efficient data trans- mission. In International Conference on Information, Communication and Automation Technologies (ICAT), 2015. [25] John Houston, Guido Zuidhof, Luca Bergamini, Yawei Ye, Long Chen, Ashesh Jain, Sammy Omari, Vladimir Iglovikov, and Peter Ondruska. One thousand and one hours: Self-driving motion prediction dataset. In CoRL, 2021. [26] John Houston, Guido Zuidhof, Luca Bergamini, Yawei Ye, Ashesh Jain, Sammy Omari, Vladimir Iglovikov, and Peter Ondruska. One thousand and one hours: Self-driving motion prediction dataset. https://www.woven-planet.global/en/data/ prediction-dataset, 2020. [27] Lila Huang, Shenlong Wang, Kelvin Wong, Jerry Liu, and Raquel Urtasun. Octsqueeze: Octree-structured entropy model for lidar compression. In CVPR, 2020. [28] Siddhesh Khandelwal, William Qi, Jagjeet Singh, Andrew Hartnett, and Deva Ramanan. What-if motion prediction for autonomous driving. arXiv preprint arXiv:2008.10587, 2020. [29] Namhoon Lee, Wongun Choi, Paul Vernaza, Christopher B Choy, Philip HS Torr, and Manmohan Chandraker. Desire: Distant future prediction in dynamic scenes with interacting agents. In CVPR, 2017. [30] Alon Lerner, Yiorgos Chrysanthou, and Dani Lischinski. Crowds by example. In Computer graphics forum. Wiley Online Library, 2007. [31] Ming Liang, Bin Yang, Rui Hu, Yun Chen, Renjie Liao, Song Feng, and Raquel Urtasun. Learning lane graph representations for motion forecasting. In ECCV, 2020. [32] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Per- ona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. [33] Andrey Malinin, Neil Band, German Chesnokov, Yarin Gal, Mark JF Gales, Alexey Noskov, Andrey Ploskonosov, Liudmila Prokhorenkova, Ivan Provilkov, Vatsal Raina, et al. Shifts: A dataset of real dis- tributional shift across multiple large-scale tasks. arXiv preprint arXiv:2107.07455, 2021. [34] Jean Mercat, Thomas Gilles, Nicole El Zoghby, Guillaume Sandou, Dominique Beauvois, and Guillermo Pita Gil. Multi-head attention for multi-modal joint vehicle motion forecasting. In ICRA, 2020. [35] Nigamaa Nayakanti, Rami Al-Rfou, Aurick Zhou, Kratarth Goel, Khaled S Refaat, and Benjamin Sapp. Wayformer: Motion fore- casting via simple & efficient attention networks. arXiv preprint arXiv:2207.05844, 2022. [36] Stefano Pellegrini, Andreas Ess, Konrad Schindler, and Luc Van Gool. You’ll never walk alone: Modeling social behavior for multi-target tracking. In ICCV, 2009. [37] Zizheng Que, Guo Lu, and Dong Xu. Voxelcontext-net: An octree based framework for point cloud compression. In CVPR, 2021. [38] Nicholas Rhinehart, Rowan McAllister, Kris Kitani, and Sergey Levine. Precog: Prediction conditioned on goals in visual multi-agent settings. In CVPR, 2019. [39] Alexandre Robicquet, Amir Sadeghian, Alexandre Alahi, and Silvio Savarese. Learning social etiquette: Human trajectory understanding in crowded scenes. In ECCV, 2016. [40] Ruwen Schnabel and Reinhard Klein. Octree-based point-cloud compression. PBG@ SIGGRAPH, 2006. [41] Shaoshuai Shi, Li Jiang, Dengxin Dai, and Bernt Schiele. Motion transformer with global intention localization and local movement refinement. In NeurIPS, 2022. [42] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. In CVPR, 2020. [43] Pei Sun, Mingxing Tan, Weiyue Wang, Chenxi Liu, Fei Xia, Zhaoqi Leng, and Dragomir Anguelov. Swformer: Sparse window transformer for 3d object detection in point clouds. In ECCV, 2022. [44] Charlie Tang and Russ R Salakhutdinov. Multiple futures prediction. NeurIPS, 2019. [45] Ekaterina Tolstaya, Reza Mahjourian, Carlton Downey, Balakrishnan Vadarajan, Benjamin Sapp, and Dragomir Anguelov. Identifying driver interactions via conditional behavior prediction. In ICRA, 2021. [46] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. NeurIPS, 2017. [47] Benjamin Wilson, William Qi, Tanmay Agarwal, John Lambert, Jagjeet Singh, Siddhesh Khandelwal, Bowen Pan, Ratnesh Kumar, Andrew Hartnett, Jhony Kaesemodel Pontes, Deva Ramanan, Peter Carr, and James Hays. Argoverse 2: Next generation datasets for self- driving perception and forecasting. In NeurIPS Track on Datasets and Benchmarks, 2021. [48] Maosheng Ye, Jiamiao Xu, Xunnong Xu, Tongyi Cao, and Qifeng Chen. Dcms: Motion forecasting with dual consistency and multi- pseudo-target supervision. arXiv preprint arXiv:2204.05859, 2022. [49] Wei Zhan, Liting Sun, Di Wang, Haojie Shi, Aubrey Clausse, Maximil- ian Naumann, Julius Kummerle, Hendrik Konigshof, Christoph Stiller, Arnaud de La Fortelle, et al. Interaction dataset: An international, adversarial and cooperative motion dataset in interactive driving sce- narios with semantic maps. arXiv preprint arXiv:1910.03088, 2019. [50] Hang Zhao, Jiyang Gao, Tian Lan, Chen Sun, Ben Sapp, Balakrishnan Varadarajan, Yue Shen, Yi Shen, Yuning Chai, Cordelia Schmid, et al. Tnt: Target-driven trajectory prediction. In CoRL, 2021. [51] Xuanyu Zhou, Charles R Qi, Yin Zhou, and Dragomir Anguelov. Riddle: Lidar data compression with range image deep delta encoding. In CVPR, 2022. APPENDIX VII. SUPPLEMENTARY DATASET DETAILS Dataset comparison. We provide more details of the dataset comparison in Table VI. In Table I of the main paper, “Sampling Rate” is the data collection rate in Hz. “3D Maps” indicates whether the dataset provided the 3D Map infor- mation. “Dataset Size” entries were collected by Argoverse 2 [47]. Combining Table VI and Table I of the main paper, we provide the complete comparison between our WOMD- LiDAR and other datasets. Supplementary Details of WOMD-LiDAR. Map data is encoded as a set of polylines and polygons created from curves sampled at a resolution of 0.5 meters following [17], [18]. Traffic signal states is also provided along with other static map feature types (e.g., lane boundary lines, road edges and stop signs). We followed [17] to mine the interesting scenarios in our WOMD-LiDAR. VIII. VISUALIZATION Scenario Videos with LiDAR. In Fig. 5, we provide more visualization of scenarios with not only the bounding boxes of the agents of interest, but also the released high quality well calibrated LiDAR data. We provide some simulated scenes with both LiDAR and labeled boxes on WOMD-LiDAR. They are formulated as mov files in the supplementary materials. Each video clip contains 11 frames in slow motion, with LiDAR data visualized with the boxes of agents. This is because we only release the first 11 frames’ LiDAR data in WOMD-LiDAR. WayFormer + LiDAR prediction visualization. We provide more visualization results in Fig. 6. From the visualization results, our WayFormer [35] + LiDAR model tries to avoid collision into other agents (vehicles, pedestrians and cyclists) in the motion forecasting task. This is consistent with the improved performance in Table II. IX. EXPERIMENTS Ablation Study of LiDAR Encoder. We provide the ablation study of LiDAR Encoder described in Section 4.2 in the submission. Specifically, we study the number of output tokens M and the number of transformer layers. Experiment results are shown in the Table VII. From the Table VII, we find when M increases, the final performance of WayFormer first increases and then decreases. We set the optimal value of M as 10 in our experiments. On the other side, LiDAR encoder is not so sensitive to the number of transformer layers. There is a slight regression when the number of layers increases. To achieve best performance and fast training speed, we set the number of layers as 1 in our experiments. Fig. 5: Scenario visualizations with LiDAR. Better viewed in color and zoom in for more details. INTERACTION Woven Planet Shifts Argoverse 2 nuScenes WOMD-LiDAR Offboard Perception ✓ ✓ Mined for Interestingness - - - ✓ - ✓ Traffic Signal States ✓ ✓ ✓ TABLE VI: Comparison of the popular behavior prediction and motion forecasting datasets. “-” indicates that the data is not available or not applicable. “Offboard perception” is checked if the labels were auto-labeled by offboard perception which can generate high-quality labels. “Mined for Interestingness” is checked if the dataset mined interesting interactions after the data collection. “Traffic Signal States” is checked if the dataset provided traffic light states. # output tokens (M) minADE ↓ MR ↓ mAP ↑ # layers minADE ↓ MR ↓ mAP ↑ 5 0.5700 0.1501 0.3999 1 0.5553 0.1292 0.4191 10 0.5553 0.1292 0.4191 2 0.5613 0.1392 0.3998 20 0.5594 0.1313 0.4102 3 0.5610 0.1398 0.4001 TABLE VII: Experiment results of the number of output tokens M and the number of transformer layers in the LiDAR encoder. The metrics are evaluated on WOMD-LiDAR validation set, averaged across categories, and over results at 3s, 5s, and 8s. (a) (b) (c) (d) (e) (f) (g) (h) Fig. 6: Visualization of prediction result comparison between WayFormer [35] (sub-figures on the left) and WayFormer with LiDAR inputs (sub-figures on the right). Legends in the figure: Yellow and blue trajectories are predictions for different agents, while blue trajectories are highlighted ones. Red dotted lines are labeled ground truth trajectories for agents in the scene. More visualization results are available in the supplementary material. Better viewed in color and zoom in for more details.