WOMD-LiDAR: Raw Sensor Dataset Benchmark for Motion Forecasting Kan Chen, Runzhou Ge, Hang Qiu, Rami AI-Rfou, Charles Qi, Xuanyu Zhou, Zoey Yang, Scott Ettinger, Pei Sun, Zhaoqi Leng, Mustafa Baniodeh, Ivan Bogun, Weiyue Wang, Mingxing Tan, Dragomir Anguelov Abstract—Widely adopted motion forecasting datasets sub- stitute the observed sensory inputs with higher-level abstrac- tions such as 3D boxes and polylines. These sparse shapes are inferredthroughannotatingtheoriginalsceneswithperception systems’ predictions. Such intermediate representations tie the qualityofthemotionforecastingmodelstotheperformanceof computervisionmodels.Moreover,thehuman-designedexplicit interfaces between perception and motion forecasting typically pass only a subset of the semantic information present in the original sensory input. To study the effect of these modular (a) Sophisticatedinteractionswith(left)andwithoutLiDAR(right). approaches, design new paradigms that mitigate these limi- tations, and accelerate the development of end-to-end motion forecasting models, we augment the Waymo Open Motion Dataset(WOMD)withlarge-scale,high-quality,diverseLiDAR data for the motion forecasting task. The new augmented dataset (WOMD-LiDAR)1 consists of over 100,000 scenes that each spans 20 seconds, consisting of well-synchronized and calibrated high quality LiDAR point cloudscapturedacrossarangeofurbanandsuburbangeogra- phies. Compared to Waymo Open Dataset (WOD), WOMD- LiDAR dataset contains 100× more scenes. Furthermore, we integrate the LiDAR data into the motion forecasting model training and provide a strong baseline. Experiments show that the LiDAR data brings improvement in the motion forecasting (b) Predictedtrajectorieswith(left)andwithoutLiDARdata(right). task. We hope that WOMD-LiDAR will provide new opportu- Fig. 1: Human-interpretable labels from the perception sys- nities for boosting end-to-end motion forecasting models. tem provide limited information at the scene level and the I. INTRODUCTION objectlevel.Insophisticatedsceneswithinteractionbetween Motionforecastingplaysanimportantroleforplanningin multiple objects, raw sensor data provides rich information autonomous driving systems and received increasing atten- andhelpsimprovethemotionforecastingperformance.Leg- tion in the research community [13], [18], [38], [45], [50], ends in the figure: Yellow and blue (highlighted) trajectories [44]. The prohibitively expensive storage requirements for are predictions for different agents. Red dotted lines are publishing raw sensor data for driving scenes limited the agents’ ground truth trajectories. major motion forecasting datasets [17], [47], [9], [26], [49]. They instead release abstract representations, such as 3D Coverage of driving scene representation is centered around boxes from pre-trained perception models (for objects) and where the perception system detects objects. The detection polylines (for maps), to represent the driving scenes. task becomes a bottleneck of transferring information to The absence of the raw sensor data leads to the fol- motion forecasting and planning when we are not sure if lowing limitations: 1) Motion forecasting relies on lossy an object exist or not, especially in the first moments of an representation of the driving scenes (Fig. 1). The human object surfacing. We hope for more graceful transmission designed interfaces lack the specificity required by the mo- of information between the systems that is error-robust. tion forecasting task. For example, the taxonomy of the 3) Training perception models to match these intermediate agent types in Waymo Open Motion Dataset (WOMD) [17] representations might evolve them into overly complicated is limited to only three types: vehicle, pedestrian, cyclist. systems that get evaluated on subtasks that are not well In practice, we interact with agents who might be hard to correlated with overall system quality. fit into this taxonomy such as pedestrians on scooters or The goal of this work is to provide a large-scale, diverse motor cyclists. Moreover, the fidelity of the input features rawsensordatasetforthemotionforecastingtask. Weaimto is quite limited to 3D boxes that hide many important augment WOMD [17] with LiDAR data in a similar format details such as pedestrian postures and gaze directions. 2) of WOD [42] for the motion forecasting task, with 100× more scenes than those available in WOD [42]. To the best *ThisworkwasdoneinWaymoLLC 1https://waymo.com/open/data/motion/ of our knowledge, it is the largest publicly available LiDAR 4202 beF 81 ]VC.sc[ 2v43830.4032:viXraINTERACTION Woven Planet Shifts Argoverse 2 nuScenes WOMD-LiDAR Has LiDAR Data ✓ ✓ # Segments - 170k 600k 250k 1k 104k Segment Duration - 25s 10s 11s 20s 20s Total Time 16.5h 1118h 1667h 763h 5.5h 574h Unique Roadways 2km 10km - 2220km - 1750km Sampling Rate 10Hz 10Hz 5Hz 10Hz 2Hz 10Hz # Cities 6 1 6 6 2 6 3D Maps ✓ ✓ ✓ Dataset Size† - 22GB 120GB 58GB 48GB 2.29TB* TABLE I: Comparison of the popular behavior prediction and motion forecasting datasets. We compare our WOMD-LiDAR with INTERACTION [49], Woven Planet [26], Shifts [33], Argoverse 2 [47], nuScenes [9]. “-” indicates that the data is not available or not applicable. †The sizes are cited from [47]. *WOMD-LiDAR dataset size is after ∼8× compression. dataset across perception or motion forecasting tasks (Table forecasting datasets which aim at real-world urban driving I).Toovercomethehugedatastorageproblemandmakethe environments. The Woven Planet prediction dataset [26] dataset user-friendly for academic research, we adopt state- processed raw data through their perception system with of-the-art LiDAR compression technology [51]. It reduces over 1000 hours of logs for the traffic agents. nuScenes [9] the LiDAR dataset by ∼8×, resulting in the final WOMD- is an autonomous driving dataset that supports detection, LiDAR data to be around 2.3 TB. tracking, prediction and localization. But both of these [26], To demonstrate the usefulness of the new LiDAR data, [9] did not explicitly collect or upsample diverse, complex we propose a novel and simple motion forecasting baseline, or interactive driving scenarios. Argoverse [14], [47] mined which leverages raw LiDAR data to boost prediction accu- for vehicles in various scenarios (e.g. intersections, dense racy.Insteadofjointlytrainingtheperceptionandprediction traffic). The INTERACTION dataset [49] collects some networks,whichdemandshugememoryfootprint,wetakea interactive scenarios (e.g., roundabouts, ramp merging). The two-stageapproach:wefirstapplyaperceptionmodel[43]to Shifts [33] dataset targets vehicle motion prediction and has extract embedding features from LiDAR data. Then, during the longest duration. However, many of these long-duration training, we feed these embeddings to a motion forecasting datasets [26], [49], [33] lack LiDAR data, blocking the model, WayFormer [35]. We evaluate the model with same explorationofend-to-endmotionforecasting.nuPlan[10],an metricsasWOMD[17].Experimentsshowthat,withLiDAR ego vehicle’s planning dataset, released only a subset of the data, the WayFormer model has a 2% mAP increase for LiDARsequences.Comparedwithotherautonomousdriving VehicleandPedestrianpredictionrespectively.Thisindicates perception datasets [42], [9], [19], [3] that provide LiDAR that the WOMD-LiDAR brings useful information and can frames,WOMD-LiDARissignificantlylargerintermsofthe further improve motion forecasting models’ performance. total time, number of scenes and object interactions. The WOMD-LiDAR data has been made publicly avail- Motion forecasting modeling. A popular approach is to able to the research community, and we hope it will provide render each input frame as a rasterized top-down image new directions and opportunities in developing end-to-end whereeachchannelrepresentsdifferentsceneelements[13], motion forecasting models. Additionally, WOMD-LiDAR [16], [29], [23], [12], [50]. Another method is to encode opens the door for new research on detection and tracking agent state history using temporal modeling techniques like with a very large amount of 3D boxes and tracks. RNN [34], [28], [2], [38] or temporal convolution [31]. In We summarize the contributions of our work as follows: these two methods, relationships between each entity are • We release the largest scale LiDAR dataset for motion aggregated through pooling [50], [48], [2], [21], [29], [34], forecasting with high quality raw sensor data across a soft attention [34], [50] and graph neural networks [11], wide spectrum of diverse scenes. [28], [31]. Recently, some work [35], [41] explore the • Weprovideabaselinethatbooststhemotionforecasting Transformer [46] encoder-decoder structure for multimodal performance using the raw data, demonstrating the motionprediction.WechooseWayFormer[35]asourmotion efficacy of the sensor inputs. forecastingbaseline:itisastate-of-the-artmodel,whichcan • Wedesignanencodingschemethatutilizesintermediate flexibly integrate features from our new LiDAR modality. perception representations as a feature extraction utility LiDAR data compression. Releasing the LiDAR data for for motion forecasting models. ourdatasetpresentsadatastoragechallenge:withoutLiDAR compression techniques, the raw sensor data of WOMD- II. RELATEDWORK LiDARexceeds20TB.Asvaluableasthedatais,thesizeis Motion forecasting datasets. There has been an increasing inconvenient for fast distribution in the research community. number of motion forecasting datasets released [17], [26], Fortunately,inrecentyears,thereisagrowinginterestinthe [25],[9],[47],[49],[39],[15],[36],[30],[4],[8],[6].TableI LiDAR point cloud compression techniques. For example, shows the comparison for several most relevant motion one major stream of work, octree-based methods, whichare collected from five LiDAR sensors. For top LiDAR, h = 64,w = 2650. For other sensors, h = 116,w = 150. Each pixel in the range images includes the following: • Range (scalar): The distance between the origin of LiDAR sensor frame and the LiDAR point. • Intensity (scalar): It is a measurement describing the Fig. 2: Visualization of a range image from the top LiDAR return strength of the laser pulse that produces the sensorinWOMD-LiDAR.Thethreerowsareshowingrange, LiDARpoint,whichispartiallybasedonthereflectivity (normalized)intensity,and(normalized)elongationfromthe of the object struck by the laser pulse. first LiDAR return (second return omitted due to brevity). • Elongation (scalar): The elongation of the laser pulse We crop the range images to only show the front 180◦. beyond its normal width. • Vehicle pose (∈R3): The pose of the vehicle when the LiDAR point is captured. represent and compress quantized point clouds [7], [40], has been released as a point cloud compression standard [20]. Therangeimageformatisnecessarytoexploitefficientcom- More recently, neural network based octrees squeeze meth- pression schemes to reduce storage requirements (Section ods have been proposed, such as Octsqueeze [27], MuS- III-C). Fig. 2 shows the different features that constitute the CLE [5] and VoxelContextNet [37]. Alternatively, LiDAR range images through mono-chromatic images, one for each point clouds can be stored as range images. A family of feature. We provide a tutorial2 to show how to decompress image-basedcompressionmethodshavebeenadaptedforthe range images and convert them into the features above. task. For example, traditional methods such as JPEG, PNG C. LiDAR Data Compression andTIFFhavebeenappliedtocompressingrangeimages[1], [24].Recently,RIDDLE[51]extendssuchmethodbyapply- Storingrawsensorydataisprohibitivelyexpensive.There- ing a deep neural network and delta encoding to compress fore, we apply the delta encoding compressor proposed range images. We adopt the delta encoder of RIDDLE [51] in[51].Weuseanon-deep-learningversionofthealgorithm and reduce the raw sensor data by ∼8×. forfastcompressionanddecompression.Thiscompressionis lossless under a pre-specified quantization precision. There- III. DATASET fore, we do not expect to impact end-to-end learning. In this section, we describe the WOMD-LiDAR dataset The basic idea of the algorithm is to use a previous pixel statistics, the LiDAR data format, and the compression value in the range image to predict the next valid pixel (the technique used to reduce the storage footprint. closest valid one on its right in the spatial domain). Instead of storing the absolute pixel values, we store the residuals A. Dataset Statistics between the predictions and the original pixel values. Since To evaluate motion forecasting models, we leverage ex- the residuals have a more concentrated distribution (espe- isting labels gathered from WOMD [17]. We follow the cially on quantized range images) with lower entropy, they WOMD dataset format, and extract 9 second scenarios arecompressedtoamuchsmallersizewithvarintcoding containingLiDARdata. WOMD-LiDARissplitintoa70% followed by zlib compression. training, 15% validation, and 15% test set with the same Inourimplementation,wequantizetherangeimagechan- run segments in WOMD. For training a motion forecasting nels with the following precision: range 0.005m, intensity model, it is sufficient to only use the past and current times- 0.01m, elongation 0.01m, pose translation 0.0001m, pose tamps’LiDARdata,whilethefuturetimestampsareusedas rotation 0.001 radians. We leverage the default varint ground truth to calculate loss and metrics. We only release coding from the publicly available Google Protobuf imple- the first 1 second LiDAR data for each scene. This helps mentation (for uint and bool fields). We will release our reduce the 87.9% size of the raw LiDAR data. However, it compression algorithm together with the dataset. stillreaches∼20TBdatastorage.WefurtherapplyaLiDAR compression method to reduce its size (Section III-C). IV. MOTIONFORECASTINGMODELWITHLIDAR Datasets comparison: Compared with WOD [42], one of To validate the effectiveness of WOMD-LiDAR, we train the largest datasets for the perception task, WOMD-LiDAR a WayFormer [35] model using LiDAR embeddings as a contains100×morescenes,80×totalhours.nuScenes[9]is baseline. We describe the details of the motion forecasting currently the only other LiDAR dataset suitable for the mo- model and the LiDAR encoder (Fig. 3) in this section. tion forecasting task. WOMD-LiDAR is significantly larger than nuScenes, with 104k (100×) segments and 574 hours A. Motion Forecasting Model (100×) of total time (see Table I). We extend the WayFormer [35] model to incorporate raw LiDAR data. It adopts a transformer based scene encoder B. LiDAR Data Format whichisflexibletopluginfeaturesfromvariousmodalities. LiDAR data is encoded in WOMD-LiDAR as range im- The transformer fuses features from agent history states, ages ∈ Rh×w×6. Following the format of WOD [42], the firsttworeturnsofLiDARpulseareprovided.Rangeimages 2https://bit.ly/tutorial-womd-lidarSWFormerFeature Extractor Motion Forecasting Model T1 T2 T3 *SP: Sparse Partition Scale 1: /1 Scale 2: /2 Scale 5: /32 Scene Encoder SP* SW BF loo crm ker SP* SW BF loo crm ker … SW BF loo crm ker T Dra ej ce oc dto er ry 𝐸%∈ℝ&×$! embedding embedding … embedding EL ni cD oA dR e r projection projection projection projection fuse fuse fuse 𝐸∈ℝ!×#×$ LiDAR of WOMD S1 S2 S3 Detection outputs embC edo dn ic na gte 𝐸na ∈te ℝd ! ×#×$ LiDAR from SWFormer Agent History Traffic Light Agent Interaction Road Graph Learned Seeds Fig. 3: Model structures of LiDAR encoder (left) and motion forecasting model (right). To encode LiDAR data, we adopt a pre-trained SWFormer [43] model and extract the embedding features (which can be decoded to produce detection results). Those features (in the light yellow box) from different scales are concatenated and fed to a WayFormer [35] model as a new modality feature for the motion forecasting task. traffic light signals, agent interaction states and road graph V. EXPERIMENTS features,WeaddadditionalLiDARmodalityfedtothescene A. Experiment Setup encoder. The features of LiDAR modality are generated LiDAR Feature Extractor. We train the SWFormer [43] from a SWFormer [43] extractor and a LiDAR encoder. on WOD [42] as the LiDAR feature extractor. We set batch Duringthetraining,wefreezethegradientsoftheSWFormer size as 4, training 80,000 steps on 64 V3 TPUs. The IOU feature extractor and update only the LiDAR encoder’s thresholds for vehicles and pedestrians are 0.7 and 0.5 model parameters. After applying the scene encoder to fuse respectively. In the original SWFormer inference stage, the multi-modal features, the output embeddings are fed to the boxes are filtered if the predicted confidence is less than trajectory decoder to produce the final predicted trajectories. 0.5. To extract feature embeddings, we need more context informationandhighrecallofthedetectionresults.Thus,we B. LiDAR Encoding Scheme lower the box confidence threshold τ to be 0.1 (see ablation We adopt a pre-trained SWFormer [43] to extract LiDAR study in Section V-D). The extracted embeddings are 128D embeddings. The SWFormer is trained on WOD [42] for vectors. With box coordinates (x, y, z), box size (width, the 3D object detection task. The SWFormer adopts sparse length, height) and foreground probability, the final LiDAR partition operators and transformer based layers to encode featuresare135Dvectors(C =135)fedtothesceneencoder LiDARdatafromdifferentscales.Weextracttheembedding of the WayFormer [35]. We set the maximum number of features which are used to produce detection results in detected boxes in each frame as 140 (N ≤ 140). If there the detection heads as the input to the scene encoder of are more than 140 detected objects, we discard the detected WayFormer model. These features effectively encode rich objects with low box confidence scores. We set the number information of objects and context environment from noisy of output tokens of the LiDAR encoder as M = 10 before LiDAR points. To provide context agent information, we sending the embeddings to the scene encoder. lower the detection confidence threshold to produce more Motion Forecasting Model. We use a batch size of 16 and butlessreliabledetectedobjects.Thisincreasestherecallof traintheWayFormermodelwith1.2Mstepson16V3TPUs. the detection results but decreases the precision. In addition We project all modalities to the same feature size of 256D to the embedding features, we also pad more features: (C′ =256),thenutilizecross-attentionwithlatentqueriesto • Detected box coordinates: We append the detected reduce the number of tokens to 192. The scene encoder has boxes center coordinates to emphasize the potential 2 transformer layers. WayFormer encodes the history states detected objects positions. of 1 second (10 steps at 10Hz) and predicts K=6 trajectories • Detected box size: The height, width, length of the for each agent’s future 8 seconds. boxesprovidehintsofobjectsfromdifferentcategories. B. Metrics • Foreground probability from the segmentation head: Given an input sample, a motion forecasting model pre- This helps reduce the noise from detection results. dictsK trajectoriesforN agentsinthesceneforthefutureT The output tensor E from SWFormer with padded features steps xk = {x } . We denote the corresponding i,t i=1:N,t=1:T is a N ×T ×C tensor, where N is the number of detected groundtruthtrajectoriesasy={y } .Weinherit i,t i=1:N,t=1:T boxes, T is the number of input frames, C is the feature the WOMD motion forecasting challenge metrics [17]. size. To adapt E to be compatible as input for the scene minADE.TheminimumAverageDisplacementErrorcalcu- encoder of WayFormer, we flatten the first two dimensions lates the ℓ distance between the predicted trajectory which as the token dimension. A one-layer Axial Transformer [22] 2 is closest to the ground truth across all time steps: is applied as a LiDAR encoder to project the output tensor E to be a fixed M-token tensor E′ ∈RM×C′ with the same minADE=min 1 (cid:88)(cid:88) ||xk −y || (1) feature size as other modalities. k NT i,t i,t 2 i tVehicle Pedestrian Cyclist Set Model minADE ↓ MR ↓ mAP ↑ minADE ↓ MR ↓ mAP ↑ minADE ↓ MR ↓ mAP ↑ LSTM [17] 1.34 0.25 0.23 0.63 0.13 0.23 1.26 0.29 0.21 Standard Wayformer [35] 1.10 0.18 0.35 0.54 0.11 0.35 1.08 0.22 0.29 Validation Wayformer 1.09 0.17 0.37 0.54 0.10 0.37 1.06 0.21 0.28 + LiDAR TABLE II: Marginal metrics on the standard validation set. All metrics computed at 8s. We compare baseline WayFormer [35] and WayFormer trained with LiDAR data on the WOMD-LiDAR standard motion forecasting track. Threshold τ minADE ↓ MR ↓ mAP ↑ Miss Rate (MR). MR measures whether the closest pre- 0.0 0.5692 0.1401 0.4005 dictedtrajectorymin xk matchesthegroundtruthy .The k i,t i,t 0.1 0.5553 0.1292 0.4191 MR at time step t is calculated as: 0.3 0.5623 0.1399 0.4102 0.5 0.5675 0.1410 0.4087 MR =min∨ ¬IsMatch(xk ,y ) (2) t i i,t i,t k TABLE III: Experiment results of sweeping SWFormer More details of the function IsMatch implementation can threshold τ to extract embeddings. The metrics are eval- be found in the WOMD dataset [17]. uated on WOMD-LiDAR validation set, averaged across Mean Average Precision (mAP). mAP is similar to the categories, and over results at 3s, 5s, and 8s. one for object detection task [32]. It computes precision- recall curve’s integral area by varying confidence threshold Model minADE ↓ MR ↓ mAP ↑ forthepredictedtrajectories.Thecriteriaofjudgingwhether No boxes coordinates 0.5852 0.1594 0.3947 atrajectoryisatruepositive,falsepositive,etc.isconsistent No boxes sizes 0.5773 0.1476 0.4008 with the MR definition in Eq. 2. For each object, only the No foreground prob. 0.5601 0.1331 0.4110 trajectorywiththehighestconfidenceisusedtocalculatethe Wayformer with LiDAR 0.5553 0.1292 0.4191 mAP for the corresponding true positive. TABLE IV: Experiment results of masking out additional C. Baseline Model Performance features in LiDAR encoding (Sec. IV-B). The metrics are evaluated on WOMD-LiDAR validation set, averaged across We evaluate our baseline model on the WOMD-LiDAR categories, and over results at 3s, 5s, and 8s. validationset.TheresultsareshowninTableII.WithLiDAR features, our model performs better than WayFormer for vehicle, pedestrians and cyclists on the Missing Rate (MR) generate different training datasets extracted from WOMD- metric, with 0.01 decrease in each category respectively. LiDAR and evaluate the corresponding performance of the This indicates LiDAR information provides location hints baselinemodel.Whenthethresholdτ islower,thenumberof forWayFormer.ForminADEmetrics,theresultsareroughly predictedboxesfromSWFormerbecomeslarger.Thisbrings the same. WayFormer with LiDAR inputs also achieves 2% morecontextinformationformotionforecastingmodelwhile increaseinmAPforvehicleandpedestriancategories.Thisis italsobringsmorenoiseintheinputs.AsshowninTableIII, becauseLiDARfeaturesprovidemoreinformationaboutthe theWayFormer’sperformanceisnotsosensitivetoτ.When object locations, shapes and interactions with other objects. τ = 0.1, the WayFormer with LiDAR inputs achieves the They help the WayFormer model understand the scene and best performance. When τ further increases, the number of predict more accurate trajectories. For cyclists, there is a detected boxes becomes smaller and may result in loss of minor regression in mAP. It is likely due to the fact that the useful information. LiDAR points are noisy in this category and we may need a Different embedding features. There are three additional better encoding method to extract useful information. features (Sec. IV-B) included in the embedding output from the LiDAR encoder: detected box coordinates, size and D. Ablation Study foreground probability. We mask out each feature and check In the following experiments, we report the average mi- theWayFormermodelperformanceinTableIV.Theexperi- nADE, MR and mAP across vehicle, pedestrian and cyclist mentsshowthatwithoutboxcoordinates,theminADE,MR, categories at 3s, 5s and 8s on the validation set. mAP regress by 0.0299, 0.0302, 0.0244 respectively. This Threshold of SWFormer to extract embeddings. As de- indicatesthatasidefromtheSWFormerembeddingfeatures, scribed in Sec. V-A, we lower the SWFormer threshold τ to the box coordinates play an important role in motion fore- get high recall of detected boxes so that we could get more casting.Comparedtomaskingoutboxcoordinates,masking context information in the scene. We sweep the threshold of out box sizes has a smaller regression, with minADE, MR SWFormerfrom0.0to0.5(defaultvalueofSWFormer),and increased by 0.022 and 0.0184 and mAP decreased by(a) (b) Fig.4:Visualization ofpredictionresultcomparisonbetween WayFormer[35](sub-figuresonthe left)andWayFormerwith LiDAR inputs (sub-figures on the right). Fig (a): With LiDAR information the predicted trajectories avoid crashing into parked cars. Fig (b): The predicted trajectories of cyclists avoid crashing into cars. Legends in the figure: Yellow and blue trajectories are predictions for different agents, while blue trajectories are highlighted ones. Red dotted lines are labeled ground truth trajectories for agents in the scene. # tokens of embeddings minADE ↓ MR ↓ mAP ↑ E. Qualitative Results 16 0.6011 0.1702 0.3811 We visualize the WayFormer prediction results on 32 0.5888 0.1610 0.3907 64 0.5797 0.1503 0.3998 WOMD-LiDAR to check the quality motion forecasting. 192 0.5553 0.1292 0.4191 Please check the supplementary video for more visual- # layers of transformer minADE ↓ MR ↓ mAP ↑ ization results. Visualization of WayFormer prediction results. We visu- 1 0.5711 0.1440 0.3991 2 0.5553 0.1292 0.4191 alize some prediction results and conduct analysis on the 3 0.5561 0.1325 0.4112 prediction quality. As shown in Fig. 4, with LiDAR inputs, WayFormermodelavoidscollisionintovehicles,pedestrians TABLE V: Experiment results of scene encoder’s #tokens and cyclists. Specifically, in Fig 4(a), with LiDAR infor- and #transformer layers. The metrics are evaluated on mation, the predicted trajectories avoid crashing into parked WOMD-LiDAR validation set, averaged across categories, cars. In Fig 4(b), the predicted trajectories of cyclists avoid and over results at 3s, 5s, and 8s. crashing into cars. We observe more reasonable predicted trajectories, matching the improved performance in Table II. 0.0183. Foreground probability also contributes slightly to VI. CONCLUSIONANDFUTUREDIRECTIONS the overall performance, with regression in the minADE, MR, mAP as 0.0048, 0.0039, 0.0081 respectively. Conclusion. In this work, we augment WOMD with the WayFormer modeling. We study the WayFormer hyper- largest scale LiDAR dataset in the community, containing parameters in motion forecasting. Specifically, we conduct LiDAR point clouds for more than 100,000 scenes. To re- experiments to investigate the impact of number of tokens solve the huge data storage requirements, we adopt state-of- and layers of the scene encoder. This is because the scene the-artLiDARdatacompressiontechnologyandsuccessfully encoder provides encoded embeddings for the trajectory reduce the dataset size to be less than 2.5 TB. To evaluate decoderinthepredictionstage.Theembeddingqualityplays the suitability of LiDAR to the motion forecasting task, an important role for the motion forecasting task. As shown we provide a WayFormer baseline trained with LiDAR. inTableV,thenumberofembeddingtokensimpactsquality Experiments show that LiDAR data brings improvement in more than the number of scene encoder transformer layers. the motion forecasting task. Whenweincreasethetokensizefrom16to192(thedefault Limitations and future work. 1) In this work, we only WayFormer setting), the minADE and MR decrease from trained WayFormer and WayFormer + LiDAR models. We 0.6011 to 0.5553 and 0.1702 to 0.1292, respectively, and will investigate end-to-end models that can directly encode mAP increases from 0.3811 to 0.4191. This indicates that LiDAR point clouds with motion forecasting task in mind. when the token size increases, more information will be 2) The SWFormer detector, which serves as the point cloud encoded in the embeddings for motion prediction. encoderinourmodel,canonlyrepresentsobject-levelinfor- We also vary the number of transformer blocks from 1 mation.Wewilllookintosomeapproachesthatcanleverage to 3 (Table V). The performance of WayFormer model first scene-levelinformation,thatarenotsensitivetothedetection improves (# layers increases from 1 to 2) and then regresses prediction thresholds. 3) Another interesting direction is to (# layers increases from 2 to 3). Thus, we set the optimal explore methods that solely depends on the sensor data to value of # layers of the scene encoder as 2. avoid the dependency on human-defined object interface.REFERENCES Predicting driving behavior with a convolutional model of semantic interactions. InCVPR,2019. [24] HamidrezaHoushiar andAndreas Nu¨chter. 3dpoint cloudcompres- [1] Jae-Kyun Ahn, Kyu-Yul Lee, Jae-Young Sim, and Chang-Su Kim. sion using conventional image compression for efficient data trans- Large-scale3dpointcloudcompressionusingadaptiveradialdistance mission. InInternationalConferenceonInformation,Communication prediction in hybrid coordinate domains. IEEE Journal of Selected andAutomationTechnologies(ICAT),2015. TopicsinSignalProcessing,2014. [25] JohnHouston,GuidoZuidhof,LucaBergamini,YaweiYe,LongChen, [2] Alexandre Alahi, Kratarth Goel, Vignesh Ramanathan, Alexandre AsheshJain,SammyOmari,VladimirIglovikov,andPeterOndruska. Robicquet, Li Fei-Fei, and Silvio Savarese. Social lstm: Human One thousand and one hours: Self-driving motion prediction dataset. trajectorypredictionincrowdedspaces. InCVPR,2016. InCoRL,2021. [3] Jens Behley, Martin Garbade, Andres Milioto, Jan Quenzel, Sven [26] John Houston, Guido Zuidhof, Luca Bergamini, Yawei Ye, Ashesh Behnke, Cyrill Stachniss, and Jurgen Gall. Semantickitti: A dataset Jain, Sammy Omari, Vladimir Iglovikov, and Peter Ondruska. forsemanticsceneunderstandingoflidarsequences. InICCV,2019. One thousand and one hours: Self-driving motion prediction [4] Ben Benfold and Ian Reid. Stable multi-target tracking in real-time dataset. https://www.woven-planet.global/en/data/ surveillancevideo. InCVPR,2011. prediction-dataset,2020. [5] SouravBiswas,JerryLiu,KelvinWong,ShenlongWang,andRaquel [27] Lila Huang, Shenlong Wang, Kelvin Wong, Jerry Liu, and Raquel Urtasun.Muscle:Multisweepcompressionoflidarusingdeepentropy Urtasun. Octsqueeze: Octree-structured entropy model for lidar models. NeurIPS,2020. compression. InCVPR,2020. [6] JulianBock,RobertKrajewski,TobiasMoers,SteffenRunde,Lennart [28] Siddhesh Khandelwal, William Qi, Jagjeet Singh, Andrew Hartnett, Vater, and Lutz Eckstein. The ind dataset: A drone dataset of and Deva Ramanan. What-if motion prediction for autonomous naturalisticroadusertrajectoriesatgermanintersections.InIntelligent driving. arXivpreprintarXiv:2008.10587,2020. VehiclesSymposium(IV),2020. [29] Namhoon Lee, Wongun Choi, Paul Vernaza, Christopher B Choy, [7] MarioBotsch,AndreasWiratanaya,andLeifKobbelt. Efficienthigh Philip HS Torr, and Manmohan Chandraker. Desire: Distant future qualityrenderingofpointsampledgeometry. RenderingTechniques, predictionindynamicsceneswithinteractingagents.InCVPR,2017. 2002. [30] AlonLerner,YiorgosChrysanthou,andDaniLischinski. Crowdsby [8] Antonia Breuer, Jan-Aike Termo¨hlen, Silviu Homoceanu, and Tim example. InComputergraphicsforum.WileyOnlineLibrary,2007. Fingscheidt. opendd: A large-scale roundabout drone dataset. In In- [31] MingLiang,BinYang,RuiHu,YunChen,RenjieLiao,SongFeng, ternationalConferenceonIntelligentTransportationSystems(ITSC), andRaquelUrtasun. Learninglanegraphrepresentationsformotion 2020. forecasting. InECCV,2020. [9] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, [32] Tsung-YiLin,MichaelMaire,SergeBelongie,JamesHays,PietroPer- Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo ona,DevaRamanan,PiotrDolla´r,andCLawrenceZitnick. Microsoft Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for coco:Commonobjectsincontext. InECCV,2014. autonomousdriving. InCVPR,2020. [33] AndreyMalinin,NeilBand,GermanChesnokov,YarinGal,MarkJF [10] Holger Caesar, Juraj Kabzan, Kok Seang Tan, Whye Kit Fong, Eric Gales,AlexeyNoskov,AndreyPloskonosov,LiudmilaProkhorenkova, Wolff,AlexLang,LukeFletcher,OscarBeijbom,andSammyOmari. Ivan Provilkov, Vatsal Raina, et al. Shifts: A dataset of real dis- nuplan:Aclosed-loopml-basedplanningbenchmarkforautonomous tributional shift across multiple large-scale tasks. arXiv preprint vehicles. arXivpreprintarXiv:2106.11810,2021. arXiv:2107.07455,2021. [11] SergioCasas,ColeGulino,RenjieLiao,andRaquelUrtasun.Spagnn: [34] Jean Mercat, Thomas Gilles, Nicole El Zoghby, Guillaume Sandou, Spatially-awaregraphneuralnetworksforrelationalbehaviorforecast- Dominique Beauvois, and Guillermo Pita Gil. Multi-head attention ingfromsensordata. InICRA,2020. formulti-modaljointvehiclemotionforecasting. InICRA,2020. [12] Sergio Casas, Wenjie Luo, and Raquel Urtasun. Intentnet: Learning [35] Nigamaa Nayakanti, Rami Al-Rfou, Aurick Zhou, Kratarth Goel, topredictintentionfromrawsensordata. InCoRL,2018. Khaled S Refaat, and Benjamin Sapp. Wayformer: Motion fore- [13] Yuning Chai, Benjamin Sapp, Mayank Bansal, and Dragomir casting via simple & efficient attention networks. arXiv preprint Anguelov.Multipath:Multipleprobabilisticanchortrajectoryhypothe- arXiv:2207.05844,2022. sesforbehaviorprediction. CoRL,2019. [36] StefanoPellegrini,AndreasEss,KonradSchindler,andLucVanGool. [14] Ming-Fang Chang, John Lambert, Patsorn Sangkloy, Jagjeet Singh, You’ll never walk alone: Modeling social behavior for multi-target SlawomirBak,AndrewHartnett,DeWang,PeterCarr,SimonLucey, tracking. InICCV,2009. Deva Ramanan, et al. Argoverse: 3d tracking and forecasting with [37] Zizheng Que, Guo Lu, and Dong Xu. Voxelcontext-net: An octree richmaps. InCVPR,2019. basedframeworkforpointcloudcompression. InCVPR,2021. [15] BenjaminCoifmanandLizheLi.Acriticalevaluationofthenextgen- [38] Nicholas Rhinehart, Rowan McAllister, Kris Kitani, and Sergey eration simulation (ngsim) vehicle trajectory dataset. Transportation Levine. Precog:Predictionconditionedongoalsinvisualmulti-agent ResearchPartB:Methodological,2017. settings. InCVPR,2019. [16] HenggangCui,VladanRadosavljevic,Fang-ChiehChou,Tsung-Han [39] Alexandre Robicquet, Amir Sadeghian, Alexandre Alahi, and Silvio Lin, Thi Nguyen, Tzu-Kuo Huang, Jeff Schneider, and Nemanja Savarese. Learning social etiquette: Human trajectory understanding Djuric. Multimodal trajectory predictions for autonomous driving incrowdedscenes. InECCV,2016. usingdeepconvolutionalnetworks. InICRA,2019. [40] Ruwen Schnabel and Reinhard Klein. Octree-based point-cloud [17] Scott Ettinger, Shuyang Cheng, Benjamin Caine, Chenxi Liu, Hang compression. PBG@SIGGRAPH,2006. Zhao, Sabeek Pradhan, Yuning Chai, Ben Sapp, Charles R Qi, Yin [41] Shaoshuai Shi, Li Jiang, Dengxin Dai, and Bernt Schiele. Motion Zhou,etal.Largescaleinteractivemotionforecastingforautonomous transformer with global intention localization and local movement driving:Thewaymoopenmotiondataset. InICCV,2021. refinement. InNeurIPS,2022. [18] Jiyang Gao, Chen Sun, Hang Zhao, Yi Shen, Dragomir Anguelov, [42] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Congcong Li, and Cordelia Schmid. Vectornet: Encoding hd maps Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, andagentdynamicsfromvectorizedrepresentation. InCVPR,2020. Benjamin Caine, et al. Scalability in perception for autonomous [19] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. driving:Waymoopendataset. InCVPR,2020. Visionmeetsrobotics:Thekittidataset. TheInternationalJournalof [43] PeiSun,MingxingTan,WeiyueWang,ChenxiLiu,FeiXia,Zhaoqi RoboticsResearch,32(11):1231–1237,2013. Leng,andDragomirAnguelov.Swformer:Sparsewindowtransformer [20] D Graziosi, O Nakagami, S Kuma, A Zaghetto, T Suzuki, and for3dobjectdetectioninpointclouds. InECCV,2022. A Tabatabai. An overview of ongoing point cloud compression [44] CharlieTangandRussRSalakhutdinov. Multiplefuturesprediction. standardizationactivities:Video-based(v-pcc)andgeometry-based(g- NeurIPS,2019. [45] EkaterinaTolstaya,RezaMahjourian,CarltonDowney,Balakrishnan pcc). APSIPA Transactions on Signal and Information Processing, Vadarajan,BenjaminSapp,andDragomirAnguelov.Identifyingdriver 2020. [21] AgrimGupta,JustinJohnson,LiFei-Fei,SilvioSavarese,andAlexan- interactionsviaconditionalbehaviorprediction. InICRA,2021. [46] AshishVaswani,NoamShazeer,NikiParmar,JakobUszkoreit,Llion dreAlahi. Socialgan:Sociallyacceptabletrajectorieswithgenerative Jones,AidanNGomez,ŁukaszKaiser,andIlliaPolosukhin.Attention adversarialnetworks. InCVPR,2018. [22] JonathanHo,NalKalchbrenner,DirkWeissenborn,andTimSalimans. isallyouneed. NeurIPS,2017. [47] Benjamin Wilson, William Qi, Tanmay Agarwal, John Lambert, Axial attention in multidimensional transformers. arXiv preprint Jagjeet Singh, Siddhesh Khandelwal, Bowen Pan, Ratnesh Kumar, arXiv:1912.12180,2019. [23] Joey Hong, Benjamin Sapp, and James Philbin. Rules of the road: Andrew Hartnett, Jhony Kaesemodel Pontes, Deva Ramanan, PeterCarr,andJamesHays. Argoverse2:Nextgenerationdatasetsforself- APPENDIX drivingperceptionandforecasting.InNeurIPSTrackonDatasetsand Benchmarks,2021. VII. SUPPLEMENTARYDATASETDETAILS [48] Maosheng Ye, Jiamiao Xu, Xunnong Xu, Tongyi Cao, and Qifeng Datasetcomparison.Weprovidemoredetailsofthedataset Chen. Dcms: Motion forecasting with dual consistency and multi- pseudo-targetsupervision. arXivpreprintarXiv:2204.05859,2022. comparison in Table VI. In Table I of the main paper, [49] WeiZhan,LitingSun,DiWang,HaojieShi,AubreyClausse,Maximil- “SamplingRate”isthedatacollectionrateinHz.“3DMaps” ianNaumann,JuliusKummerle,HendrikKonigshof,ChristophStiller, indicates whether the dataset provided the 3D Map infor- Arnaud de La Fortelle, et al. Interaction dataset: An international, adversarialandcooperativemotiondatasetininteractivedrivingsce- mation. “Dataset Size” entries were collected by Argoverse narioswithsemanticmaps. arXivpreprintarXiv:1910.03088,2019. 2 [47]. Combining Table VI and Table I of the main paper, [50] HangZhao,JiyangGao,TianLan,ChenSun,BenSapp,Balakrishnan Varadarajan,YueShen,YiShen,YuningChai,CordeliaSchmid,etal. we provide the complete comparison between our WOMD- Tnt:Target-driventrajectoryprediction. InCoRL,2021. LiDAR and other datasets. [51] Xuanyu Zhou, Charles R Qi, Yin Zhou, and Dragomir Anguelov. Riddle:Lidardatacompressionwithrangeimagedeepdeltaencoding. Supplementary Details of WOMD-LiDAR. Map data is InCVPR,2022. encoded as a set of polylines and polygons created from curves sampled at a resolution of 0.5 meters following [17], [18]. Traffic signal states is also provided along with other staticmapfeaturetypes(e.g.,laneboundarylines,roadedges and stop signs). We followed [17] to mine the interesting scenarios in our WOMD-LiDAR. VIII. VISUALIZATION Scenario Videos with LiDAR. In Fig. 5, we provide more visualization of scenarios with not only the bounding boxes of the agents of interest, but also the released high quality well calibrated LiDAR data. We provide some simulated scenes with both LiDAR and labeled boxes on WOMD-LiDAR. They are formulated as mov files in the supplementary materials. Each video clip contains 11 frames in slow motion, with LiDAR data visualizedwiththeboxesofagents.Thisisbecauseweonly release the first 11 frames’ LiDAR data in WOMD-LiDAR. WayFormer+LiDARpredictionvisualization.Weprovide more visualization results in Fig. 6. From the visualization results, our WayFormer [35] + LiDAR model tries to avoid collisionintootheragents(vehicles,pedestriansandcyclists) in the motion forecasting task. This is consistent with the improved performance in Table II. IX. EXPERIMENTS AblationStudyofLiDAREncoder.Weprovidetheablation study of LiDAR Encoder described in Section 4.2 in the submission. Specifically, we study the number of output tokensM andthenumberoftransformerlayers.Experiment results are shown in the Table VII. From the Table VII, we find when M increases, the final performance of WayFormer first increases and then decreases. We set the optimal value of M as 10 in our experiments. On the other side, LiDAR encoder is not so sensitive to the number of transformer layers. There is a slight regression when the number of layers increases. To achieve best performance and fast training speed, we set the number of layers as 1 in our experiments.Fig. 5: Scenario visualizations with LiDAR. Better viewed in color and zoom in for more details. INTERACTION Woven Planet Shifts Argoverse 2 nuScenes WOMD-LiDAR Offboard Perception ✓ ✓ Mined for Interestingness - - - ✓ - ✓ Traffic Signal States ✓ ✓ ✓ TABLEVI:Comparisonofthepopularbehaviorpredictionandmotionforecastingdatasets.“-”indicatesthatthedataisnot available or not applicable. “Offboard perception” is checked if the labels were auto-labeled by offboard perception which can generate high-quality labels. “Mined for Interestingness” is checked if the dataset mined interesting interactions after the data collection. “Traffic Signal States” is checked if the dataset provided traffic light states. # output tokens (M) minADE ↓ MR ↓ mAP ↑ # layers minADE ↓ MR ↓ mAP ↑ 5 0.5700 0.1501 0.3999 1 0.5553 0.1292 0.4191 10 0.5553 0.1292 0.4191 2 0.5613 0.1392 0.3998 20 0.5594 0.1313 0.4102 3 0.5610 0.1398 0.4001 TABLE VII: Experiment results of the number of output tokens M and the number of transformer layers in the LiDAR encoder. The metrics are evaluated on WOMD-LiDAR validation set, averaged across categories, and over results at 3s, 5s, and 8s.(a) (b) (c) (d) (e) (f) (g) (h) Fig.6:Visualization ofpredictionresultcomparisonbetween WayFormer[35](sub-figuresonthe left)andWayFormerwith LiDAR inputs (sub-figures on the right). Legends in the figure: Yellow and blue trajectories are predictions for different agents, while blue trajectories are highlighted ones. Red dotted lines are labeled ground truth trajectories for agents in the scene. More visualization results are available in the supplementary material. Better viewed in color and zoom in for more details.