Scalable Scene Flow from Point Clouds in the Real World Philipp Jund ∗,1, Chris Sweeney ∗,2, Nichola Abdo 2, Zhifeng Chen 1, Jonathon Shlens 1 1 Google Brain, 2 Waymo {abdon,cjsweeney}@waymo.com Abstract— Autonomous vehicles operate in highly dynamic environments necessitating an accurate assessment of which aspects of a scene are moving and where they are moving to. A popular approach to 3D motion estimation, termed scene flow, is to employ 3D point cloud data from consecutive LiDAR scans, although such approaches have been limited by the small size of real-world, annotated LiDAR data. In this work, we introduce a new large-scale dataset for scene flow estimation derived from corresponding tracked 3D objects, which is ∼1,000× larger than previous real-world datasets in terms of the number of annotated frames. We demonstrate how previous works were bounded based on the amount of real LiDAR data available, suggesting that larger datasets are required to achieve state- of-the-art predictive performance. Furthermore, we show how previous heuristics for operating on point clouds such as down-sampling heavily degrade performance, motivating a new class of models that are tractable on the full point cloud. To address this issue, we introduce the FastFlow3D architecture which provides real time inference on the full point cloud. Additionally, we design human-interpretable metrics that better capture real world aspects by accounting for ego-motion and providing breakdowns per object type. We hope that this dataset may provide new opportunities for developing real world scene flow systems. I. INTRODUCTION Motion is a prominent cue that enables humans to navi- gate complex environments [1]. Likewise, understanding and predicting the 3D motion field of a scene – termed the scene flow – provides an important signal to enable autonomous vehicles (AVs) to understand and navigate highly dynamic environments [2]. Accurate scene flow prediction enables an AV to identify potential obstacles, estimate the trajectories of objects [3], [4], and aid downstream tasks such as detection, segmentation and tracking [5], [6]. Recently, approaches that learn models to estimate scene flow from LiDAR have demonstrated the potential for LiDAR-based motion estimation, outperforming camera- based methods [7], [8], [9]. Such models take two con- secutive point clouds as input and estimate the scene flow as a set of 3D vectors, which transform the points from the first point cloud to best match the second point cloud. One of the most prominent benefits of this approach is that it avoids the additional burden of estimating the depth of sensor readings as is required in camera-based approaches [6]. Unfortunately, for LiDAR based data, ground truth motion vectors are ill-defined and not tenable because no ∗Denotes equal contributions. correspondence exists between LiDAR returns from subse- quent time points. Instead, one must rely on semi-supervised methods that employ auxiliary information to make strong inferences about the motion signal in order to bootstrap annotation labels [10], [11]. Such an approach suffers from the fact that motion annotations are extremely limited (e.g. 400 frames in [10], [11]) and often rely on pretraining a model based on synthetic data [12] which exhibit distinct noise and sensor properties from real data. Furthermore, previous datasets cover a smaller area, e.g., the KITTI scene flow dataset covers 1/5th the area of our proposed dataest. This allows for different subsampling tradeoffs and inspired a class of models that are not able to tractably scale training and inference beyond ∼10K points [7], [8], [9], [13], [14], making the usage of such models impractical in real world AV scenes which often contain 100K - 1000K points. In this work, we address these shortcomings of this field by introducing a large scale dataset for scene flow geared towards AVs. We derive per-point labels for motion estima- tion by bootstrapping from tracked objects densely annotated in a scene from a recently released large scale AV dataset [15]. The resulting scene flow dataset contains 198K frames of motion estimation annotations. This amounts to roughly ∼1,000× larger training set than the largest, commonly used real world dataset (200 frames) for scene flow [10], [11]. By working with a large scale dataset for scene flow, we identify several indications that the problem is quite distinct from current approaches: • Learned models for scene flow are heavily bounded by the amount of data. • Heuristics for operating on point clouds (e.g. down- sampling) heavily degrade predictive performance. This observation motivates the development of a new class of models tractable on a full point cloud scene. • Previous evaluation metrics ignore notable systematic biases across classes of objects (e.g. predicting pedes- trian versus vehicle motion). We discuss each of these points in turn as we investigate working with this new dataset. Recognizing the limitations of previous works, we develop a new baseline model ar- chitecture, FastFlow3D, that is tractable on the complete point cloud with the ability to run in real time (i.e. < 100 ms) on an AV. Figure 1 shows scene flow predictions from FastFlow3D, trained on our scene flow dataset. Finally, we identify and characterize an under-appreciated problem in semi-supervised learning based on the ability to predict arXiv:2103.01306v5 [cs.CV] 25 Oct 2021 Fig. 1: LiDAR scene flow estimation for autonomous vehicles. Left: Overlay of two consecutive point clouds (green and blue, respectively) sampled at 10 Hz from the Waymo Open Dataset [15]. White boxes are tracked 3D bounding boxes for human annotated vehicles and pedestrians. Middle: Predicted scene flow for each point colored by direction, and brightened by speed based on overlaid frames†. Right: Two qualitative examples of bootstrapped annotations. the motion of unlabeled objects. We suspect the degree to which the fields of semi-supervised learning attack this problem may have strong implications for the real-world application of scene flow in AVs. We hope that the resulting dataset presented in this paper may open the opportunity for qualitatively new forms of learned scene flow models. II. RELATED WORK A. Benchmarks for scene flow estimation Early datasets focused on the related problems of inferring depth from a single image [16] or stereo pairs of images [17], [18]. Previous datasets for estimating optical flow were small and largely based on synthetic imagery [19], [20], [21], [22]. Subsequent datasets focused on 2D motion estimation in movies or sequences of images [23]. The KITTI Scene Flow dataset represented a huge step forward, providing non-synthetic imagery paired with accurate ground truth estimates However, it contained only 200 scenes for train- ing and involved preprocessing steps that alter real-world characteristics [10]. FlyingThings3D offered a modern large- scale synthetic dataset comprising ∼20K frames of high resolution data from which scene flow may be bootstrapped [12]. Internal datasets by [24], [25] are constructed similarly to ours, but are not publicly available and do not offer a detailed description. Recently, [26] created two scene flow datasets in a similar fashion, subsampling 2,691 and 1,513 training scenes from the Argoverse [27] and nuScenes [28] datasets, resepectively. However, even without subsampling, larger datasets in terms of scenes and number of points are needed to train more accurate scene flow models as shown in [26] and Figure 4. B. Datasets for tracking in AV’s Recently, there have been several works introducing large- scale datasets for autonomous vehicle applications [11], [27], [28], [29], [15]. While these datasets do not directly provide scene flow labels, they provide vehicle localization data, as well as raw LiDAR data and bounding box annotations for † Please note that we predict 3D flow, but color the direction of flow with respect to the x-y plane for the visualization. perceived tracklets. These recent datasets offer an opportu- nity to propose a methodology to construct point-wise flow annotations from such data (Section III-B). We extend the Waymo Open Dataset (WOD) to con- struct a large-scale scene flow benchmark for dense point clouds [15]. We select the Waymo Open Dataset because the bounding box annotations are at a higher acquisition frame (10 Hz) than competing datasets (e.g. 2 Hz in [28]) and contain ∼5× the number of returns per LiDAR frame (Table 1, [15]). In addition, the Waymo Open Dataset also provides ∼10× more scenes and annotated LiDAR frames than Argoverse [27]. Recently, [29] released a large-scale dataset with 1,000+ hours of driving data. However, their tracked object annotations are not human-annotated but based on the results of the onboard perception system. C. Models for learning scene flow There is a rich literature of building learned models for scene flow using end-to-end learned architectures [30], [31], [8], [7], [14], [13], [32], [25] as well as hybrid architectures [33], [34], [35]. We discuss these in Section V in conjunction with building a scalable baseline model that operates in real time. Recently, Lee et al. presented an approach for predict- ing pillar-level flow [25]. Whereas our model leverages a similar pillar-based architecture, we tackle the full scope of the scene flow problem and predict point-level flow while being tractable enough for real-time applications. Moreover, many previous works train models on synthetic datasets like FlyingThings3D [12] and evaluate and/or fine- tune on KITTI Scene Flow [11], [10]. Typically, these models are limited in their ability to leverage synthetic data in training. This observation is in line with the robotics literature and highlights the challenges of generalization from the simulated to the real world [36], [37], [38], [39]. III. CONSTRUCTING A SCENE FLOW DATASET In this section, we present an approach for generating scene flow annotations bootstrapped from existing labeled datasets. We first formalize the scene flow problem definition. We then detail our method for computing per-point flow vectors by leveraging the motion of 3D object label boxes. We emphasize that many details abound in the assumptions behind such annotations, how to calculate various trans- formations in the track labels, as well as how to handle important edge cases. A. Problem definition We consider the problem of estimating 3D scene flow in settings where the scene at time ti is represented as a point cloud Pi as measured by a LiDAR sensor mounted on the AV. Specifically, we define scene flow as the collection of 3D motion vectors f := (vx, vy, vz)⊤for each point in the scene where vd is the velocity in the d directions in m/s. Following the scene flow literature, we predict flow given two consecutive point clouds of the scene, P−1 and P0. The scene flow encodes the motion between the previous and current time steps, t-1 and t0, respectively. We predict the scene flow at the current time step, P0 in order to make the predictions practical for real time operation. B. From tracked boxes to flow annotations Obtaining ground truth scene flow from standard real- world LiDAR data is a challenging task. One challenge is the lack of point-wise correspondences between subsequent LiDAR frames. Manual annotation is too expensive and humans must contend with ambiguity due to changes in viewpoint and partial occlusions. Therefore, we focus on a scalable automated approach bootstrapped from existing labeled, tracked objects in LiDAR data sequences. The annotation procedure is straightforward. We assume that labeled objects are rigid and calculate point velocities using a secant line approximation. For each point p0 at time t0, we compute the flow annotation as f = 1 ∆t (p0 −p−1), where ∆t = t0 −t-1, p−1 = T∆p0 is the corresponding point at t-1, and T∆is a homogeneous transformation inferred from the track labels of the object to which the point belongs (If there is no label at t-1, the flow is annotated as invalid.). This captures how a moving object may have varying per-point flow magnitudes and directions. Though our rigidity assumption does not necessarily apply to non rigid objects (e.g. pedestrians), the high frame rate (10 Hz) minimizes non-rigid deformations between adjacent frames. In order to calculate the transformation T∆, we compen- sate for the ego motion of the AV because this leads to superior predictive performance since a learned model does not need to additionally infer the AV’s motion (most AVs are equipped with an IMU/GPS system to provide such informa- tion). Furthermore, compensating for ego motion improves the interpretability of the evaluation metrics (Section IV) since the predictions are now independent of the AV motion. We use this approach to compute the flow vectors for all points in P0 belonging to labeled objects. Points outside the labeled objects are assigned a flow of 0 m/s. This stationary assumption works well in practice, but has a notable gap when considering unlabeled moving objects in the scene. See Section VI-C for an in depth analysis on how our model can generalize to unlabeled moving objects. In this work, we apply this methodology on the Waymo Open Dataset [15]. The dataset offers a large scale with diverse LiDAR scenes where objects have been manually and accurately annotated with 3D boxes at 10Hz. Finally, the accurate AV pose information permits compensating for ego motion. We note that the method for scene flow annotation is general and may be used to estimate 3D flow vectors of the label box poses available in other datasets [28], [29], [15], [40]. IV. EVALUATION METRICS FOR SCENE FLOW Two common metrics used for 3D scene flow are mean L2 error of pointwise flow and the percentage of predictions with L2 error below a given threshold [7], [24]. In this work, we additionally propose modifications to improve the interpretability of the results. Breakdown by object type. Objects within the AV scene (e.g. vehicles, pedestrians) have different speed distributions dictated by the object class (Section VI-A). This becomes especially apparent after accounting for ego motion. Reporting a single error ignores these systematic differences.In practice, we find it more meaningful to report all prediction performances delineated by the object label. Binary classification formulation. One important practical application of predicting scene flow is enabling an AV to distinguish between moving and stationary parts of the scene. In that spirit, we formulate a second set of metrics that represent a “lower bar” which captures a useful rudimentary signal. We employ this metric exclusively for the more difficult task of semi-supervised learning (Section VI-C) where learning is more challenging. In particular, we assign a binary label to each reflection as either moving or stationary based on a threshold, |f| ≥fmin. Accordingly, we compute precision and recall metrics for these binary labels across an entire scene. Selecting a threshold, fmin, is not straightforward as there is an ambiguous range between very slow and stationary objects. For simplicity, we select a conservative threshold of fmin = 0.5 m/s (1.1 mph) to assure that things labeled as moving are actually moving. V. FASTFLOW3D: A SCALABLE BASELINE MODEL The average scene from the Waymo Open Dataset consists of 177K points (Table II), even though most models [7], [8], [9], [13], [14] were designed to train with 8,192 points (16,384 points in [14]). This design choice favors algorithms that scale poorly to O(100K) regimes. For instance, many methods require preprocessing techniques such as nearest neighbor lookup. Even with efficient implementations [46], [47], increasing fractions of inference time are dedicated to preprocessing instead of the core inference operation. For this reason, we propose a new model that exhibits favorable scaling properties and may operate on O(100K) in a real time system. We name this model FastFlow3D (FF3D). In particular we exploit the fact that LiDAR point clouds are dense, relatively flat along the z dimension, but cover a large t−1 t0 concat in depth shared weights PointNet (dynamic voxelization) Convolutional Autoencoder (U-Net) Grid Flow Embedding Multilayer Perceptron Fig. 2: Diagram of FastFlow3D model. FastFlow3D con- sists of 3 stages employing a PointNet encoder with dynamic voxelization [41], [42], a convolutional autoencoder [43], [44] with weights shared across two frames, and a shared MLP to regress an embedding on to a point-wise motion prediction. For details, see Section V and Appendix in [45]. area along the x and y dimensions. The proposed model is composed of three parts: a scene encoder, a decoder fusing contextual information from both frames, and a subsequent decoder to obtain point-wise flow (Figure 2). FastFlow3D operates on two successive point clouds where the first cloud has been transformed into the co- ordinate frame of the second. The target annotations are correspondingly provided in the coordinate frame of the second frame. The result of these transformation is to remove apparent motion due to the movement of the AV (Section III- B). We train the resulting model with the average L2 loss between the final prediction for each LiDAR returns and the corresponding ground truth flow annotation [8], [7], [9]. The encoder computes embeddings at different spatial resolutions for both point clouds. The encoder is a variant of PointPillars [44] and offers a great trade-off in terms of latency and accuracy by aggregating points within fixed vertical columns (i.e “pillars”) followed by a 2D convo- lutional network to decrease the spatial resolution. Each pillar center is parameterized through its center coordinate (cx, cy, cz). We compute the offset from the pillar center to the points in the pillar (∆x, ∆y, ∆z), and append the pillar center and laser features (l0, l1), resulting in an 8D encoding (cx, cy, cz, ∆x, ∆y, ∆z, l0, l1). Additionally, we employ dy- namic voxelization [42], computing a linear transformation and aggregating all points within a pillar instead of sub- sampling points. Furthermore, we find that summing the featurized points in the pillar outperforms the max-pooling operation used in previous works [44], [42]. One can draw an analogy of our pillar-based point fea- turization to more computationally expensive sampling tech- niques used by previous works [7], [8]. Instead of choosing representative sampled points based on expensive farthest point sampling and computing features relative to these points, we use a fixed grid to sample the points and compute features relative to each pillar in the grid. The pillar based representation allows our net to cover a larger area with an increased density of points. The decoder is a 2D convolutional U-Net [43]. First, we concatenate the embeddings of both encoders at each 32K 100K 255K 1000K HPLFlowNet [9] 431.1 1194.5 OOM OOM FlowNet3D [7] 205.2 520.7 1116.4 3819.0 FastFlow3D (ours) 49.3 51.9 63.1 98.1 TABLE I: Inference latency varying point cloud sizes. All numbers report latency in ms on a NVIDIA Tesla P100 with batch size = 1. The timings for HPLFlowNet [9] differ from reported results as we include the required preprocessing on the raw point clouds. OOM indicates out of memory. spatial resolution. Subsequently, we use a 2D convolution to obtain contextual information at the different resolutions. These context embeddings are used as the skip connections for the U-Net, which progressively merges context from consecutive resolutions. To decrease latency, we introduce bottleneck convolutions and replace deconvolution opera- tions (i.e. transposed convolutions) with bilinear upsampling [48]. The resulting feature map of the U-Net decoder rep- resents a grid-structured flow embedding. To obtain point- wise flow, we introduce the unpillar operation, which for each point retrieves the corresponding flow embedding grid cell, concatenates the point feature, and uses a multi layer perceptron to compute the flow vector. As proof of concept, we showcase how the resulting archi- tecture achieves favorable scaling behavior up to and beyond the number of laser returns in the Waymo Open Dataset (Ta- ble I). Note that we measure performance up to 1M points in order to accommodate multi-frame perception models which operate on point clouds from multiple time frames concatenated together [49] 1. As mentioned earlier, previ- ously proposed baseline models rely on nearest neighbor search for pre-processing, and even with an efficient im- plementation [47], [46] result in poor scaling behavior (see Section VI-B for details. ) making it prohibitively expensive to train and run these models on large, realistic datasets like the Waymo Open Dataset2. In contrast, our baseline model exhibits nearly linear growth with a small constant. Furthermore, the typical period of a LiDAR scan is 10 Hz (i.e. 100 ms) and the latency of operating on 1M points is such that predictions may finish within the period of the scan as is required for real-time operation. VI. RESULTS We first present results describing the generated scene flow dataset and discuss how it compares to established baselines for scene flow in the literature (Section VI-A). In the process, we discuss dataset statistics and how this affects our selection of evaluation metrics. Next, in Section VI-B we present the FastFlow3D baseline architecture trained on the resulting dataset. We showcase with this model the necessity of training with the full density of point cloud returns 1Many unpublished efforts employ multiple frames as detailed at https: //waymo.com/open/challenges 2In Section VI-B we demonstrate that downsampling the point cloud severely degrades predictive performance further motivating architectures that can natively operate on the entire point cloud in real time. KITTI FlyingThings3D Ours Data LiDAR Synth. LiDAR Label Semi-Sup. Truth Super. Scenes 22 – 1150 # LiDAR Frames 200 ‡ 28K 198K Avg Points/Frame 208K 220K † 177K TABLE II: Comparison of popular datasets for scene flow estimation. [11] is computed through a semi-supervised procedure [10]. [12] is computed from a depth map based on a geometric procedure [7]. # LiDAR frames counts annotated LiDAR frames for training and validation. moving stationary vehicles 32.0% (843.5M) 68.0% (1,790.0M) pedestrians 73.7% (146.9M) 26.3% (52.4M) cyclists 84.7% (7.0M) 15.2% (1.6M) 0 5 10 15 20 speed (m/s) 0 10 20 30 40 50 60 70 fraction of moving points (%) cyclist pedestrian vehicle 0 10 20 30 40 50 speed (mph) Fig. 3: Distribution of moving and stationary LiDAR points. Statistics computed from training set split. Top: Distribution of moving and stationary points across all frames (raw counts in parenthesis). We consider points with a flow magnitude below 0.1 m/s to be stationary. Bottom: Distribution of speeds for moving points. as well as the complete dataset. These results highlight deficiencies in previous approaches which employed too few data or employed sub-sampled points for real-time inference. Finally, in Section VI-C we discuss an extension to this work in which we examine the generalization power of the model and highlight an open challenge in the application of self- supervised and semi-supervised learning techniques. A. A large-scale dataset for scene flow The Waymo Open Dataset provides an accurate source of tracked 3D objects and an opportunity for deriving a large-scale scene flow dataset across a diverse and rich do- main [15]. As previously discussed, scene flow ground truth does not exist in real-world point cloud datasets based on standard time-of-flight LiDAR because no correspondences exist between points from subsequent frames. To generate a reasonable set of scene flow labels, we leveraged the human annotated tracked 3D objects from the Waymo Open Dataset [15]. Following the methodology in Section III-B, we derived a supervised label (vx, vy, vz) for each point in the scene across time. Figure 1 (right) high- lights some qualitative examples of the resulting annotation ‡ indicates that only 400 frames of the KITTI dataset were annotated for scene flow (200 available for training). † indicates the average number of points with distance from the camera ≤35. 100 1000 number of run segments 0 5 10 15 20 fraction (%) with L2 error < 0.1 m/s cyc ped veh 100 1000 number of run segments 0 20 40 60 80 100 fraction (%) with L2 error < 1.0 m/s cyc ped veh Fig. 4: Accuracy of scene flow estimation is bounded by the amount of data. Each point corresponds to the cross validated accuracy of a model trained on increasing amounts of data (see text). Y -axis reports the fraction of LiDAR returns contained within moving objects whose motion vector is estimated within 0.1 m/s (top) or 1.0 m/s (bottom) L2 error. Higher numbers are better. The star indicates a model trained on the number of run segments in [11], [10]. of scene flow using this methodology. In the selected frames, we highlight the diversity of the scene and difficulty of the resulting bootstrapped annotations. Namely, we observe the challenges of working with real LiDAR data including the noise inherent in the sensor reading, the prevalence of occlusions and variation in object speed. All of these qualities result in a challenging predictive task. The dataset comprises 800 and 200 scenes, termed run segments, for training and validation, respectively. Each run segment is 20 seconds recorded at 10 Hz [15]. Hence, the training and validation splits contain 158,081 and 39,987 frames.3 The total dataset comprises 24.3B and 6.1B LiDAR returns in each split, respectively. Table II indicates that the resulting dataset is orders of magnitude larger than the standard KITTI scene flow dataset [11], [10] and even surpasses the large-scale synthetic dataset FlyingThings3D [12] often used for pretraining. Figure 3 provides a summary of the scene flow constructed from the Waymo Open Dataset. Across 7,029,178 objects labeled across all frames 4, we find that ∼64.8% of the points within pedestrians, cyclists and vehicles are stationary. This summary statistic belies a large amount of systematic variability across object class. For instance, the majority of points within vehicles (68.0%) are parked and stationary, whereas the majority of points within pedestrians (73.7%) and cyclists (84.7%) are actively moving. The motion sig- nature of each class of labeled object becomes even more distinct when examining the distribution of moving objects (Figure 3, bottom). Note that the average speed of moving points corresponding to pedestrians (1.3 m/s or 2.9 mph), cyclists (3.8 m/s or 8.5 mph) and vehicles (5.6 m/s or 12.5 mph) vary significantly. This variability of motion across object types emphasizes our selection of evaluation metrics that consider the prediction of each class separately. B. A scalable model baseline for scene flow We train the FastFlow3D architecture on the scene flow data. Briefly, the architecture consists of 3 stages employ- 3Please see the Appendix in [45] for more details on downloading and accessing this new dataset. 4A single instance of an object may be tracked across N frames. We count a single instance as N labeled objects. 10K 100K number of point cloud returns 0 5 10 15 20 fraction (%) with L2 error < 0.1 m/s cyc ped veh 10K 100K number of point cloud returns 0 20 40 60 80 100 fraction (%) with L2 error < 1.0 m/s cyc ped veh Fig. 5: Accuracy of scene flow estimation requires the full density of the point cloud scene. Each point corresponds to the cross validated accuracy of a model trained on an increasing density of point cloud points. Y -axis reports the fraction of LiDAR returns contained within moving vehicles, pedestrians and cyclists whose motion vector is correctly estimated within 0.1 m/s (top) and 1.0 m/s (bottom) L2 error. ing established techniques: (1) a PointNet encoder with a dynamic voxelization [42], [41], (2) a convolutional autoen- coder with skip connections [43] in which the first half of the architecture [44] consists of shared weights across two frames, and (3) a shared MLP to regress an embedding on to a point-wise motion prediction. The resulting model contains 5.23M parameters, a vast majority of which reside in the convolution architecture (4.21M). A small number of parameters (544) are dedicated to featurizing each point cloud point [41] as well as per- forming the final regression on to the motion flow (4,483). These latter sets of parameters are purposefully small in order to effectively constrain computational cost because they are applied across all N points in a point cloud. We evaluate the resulting model on the cross-validated split using the aforementioned metrics across an array of experimental studies to justify the motivation for this dataset as well as demonstrate the difficulty of the prediction task. We first approach the question of what the appropriate dataset size is given the prediction task. Figure 4 provides an ablation study in which we systematically subsample the number of run segments employed for training 5. We observe that predictive performance improves significantly as the model is trained on increasing numbers of run segments. We find that cyclists trace out a curve quite distinct from pedestrians and vehicles, possibly indicative of the small number of cyclists in a scene (Figure 3). Secondly, we ob- serve that the cross validated accuracy is far from saturating behavior when approximating the amount of data available in the KITTI scene flow dataset [11], [10] (Figure 4, stars). We observe that even with the complete dataset, our metrics do not appear to exhibit asymptotic behavior indicating that models trained on the Waymo Open Dataset may still be data bound. This result parallels detection performance reported in the original results (Table 10 in [15]). We next investigate how scene flow prediction is affected by the density of the point cloud scene. This question is im- portant because many baseline models purposefully operate on a smaller number of points (Table I) and by necessity must heavily sub-sample the number of points in order to perform inference in real time. In stationary objects, we 5We subsample the number of run segments and not frames because subsequent frames within a single run segment are heavily correlated. observe minimal detriment in performance (data not shown). This result is not surprising given that the vast majority of LiDAR returns arise from stationary, background objects (e.g. buildings, roads). However, we do observe that training on sparse versions of the original point cloud severely de- grades predictive performance of moving objects (Figure 5). Notably, moving pedestrians and vehicle performance appear to be saturating indicating that if additional LiDAR returns were available, they would have minimal additional benefit in terms of predictive performance. In addition to decreasing point density, previous works also filter out the numerous returns from the ground in order to limit the number of points to predict [8], [7], [9]. Such a technique has a side benefit of bridging the domain gap between FlyingThings3D and KITTI Scene Flow, which differ in the inclusion of such points. We performed an ablation experiment to parallel this heuristic by training and evaluating with our annotations but removing points with a crude threshold of 0.2 m above ground. When removing ground points, we found that the mean L2 error increased by 159% and 31% for points in moving and stationary objects, respectively. We take these results to indicate that the inclusion of ground points provide a useful signal for predicting scene flow. Taken together, these results provide post-hoc justification for building an architecture which may be tractably trained on all point cloud returns instead of one that only trains on a sample of the returns. Finally, we report our results on the complete dataset and identify systematic differences across object class and whether or not an object is moving (Table III). Producing baseline comparisons for previous nearest neighbor based models is prohibitively expensive due to their poor scaling behavior. 6 We hope to motivate a new class of real time scene flow models that are capable of training on our dataset. Table III indicates that moving vehicle points have a mean L2 error of 0.54 m/s, corresponding to 10% of the average speed of moving vehicles (5.6 m/s). Likewise, the mean L2 error of moving pedestrian and cyclist points are 0.32 m/s and 0.57 m/s, corresponding to 25% and 15% of the mean speed of each object class, respectively. Hence, the ability to predict vehicle speed is better than pedestrians and cyclists. We suspect that these imbalances are largely due to imbalances in the number of training examples for each label and the average speed of these objects. For instance, the vast majority of points are marked as background and hence have a target of zero motion. Because the background points are dominant, we likewise observe the error to be smallest. C. Generalizing to unlabeled moving objects The mean L2 error is averaged over many points, making it unclear if this statistic may be dominated by outlier events. To address this issue, we show the percentage of points in the Waymo Open Dataset evaluation set with L2 errors below 0.1 m/s and 1.0 m/s. We observe that the vast majority of the 6We did try experiments involving cropping and downsampling the point clouds to make comparison to the baselines feasible. However, these modifications distorted the points too much to serve as practical input data. error metric vehicle pedestrian cyclist background all moving stationary all moving stationary all moving stationary mean (m/s) 0.18 0.54 0.05 0.25 0.32 0.10 0.51 0.57 0.10 0.07 mean (mph) 0.40 1.21 0.11 0.55 0.72 0.22 1.14 1.28 0.22 0.16 ≤0.1 m/s 70.0% 11.6% 90.2% 33.0% 14.0% 71.4% 13.4% 4.8% 78.0% 95.7% ≤1.0 m/s 97.7% 92.8% 99.4% 96.7% 95.4% 99.4% 89.5% 88.2% 99.6% 96.7% TABLE III: Performance of baseline on scene flow in large-scale dataset. Mean pointwise L2 error (top) and percentage of points with error below 0.1 m/s and 1.0 m/s (bottom). Most errors are ≤1.0 m/s. Additionally, we investigate the error for stationary and moving points where a point is coarsely considered moving if the flow vector magnitude is ≥0.5 m/s. Fig. 6: Generalizing to unlabeled moving objects. Three examples each with the bootstrapped annotation (left) and model prediction (right) (Color code from Figure 1). Left example: Despite missing flow annotation for the middle of a bus, our model can generalize well. Middle example: The model generalizes an unlabeled object (moving shopping cart). Right example: failures of generalization as motion is incorrectly predicted for the ground and parts of the tree. method L2 error (m/s) prec recall all moving stationary supervised 0.51 0.57 0.10 1.00 0.95 cyc stationary 1.13 1.24 0.06 1.00 0.67 ignored 0.83 0.93 0.06 1.00 0.78 supervised 0.25 0.32 0.10 1.00 0.91 ped stationary 0.90 1.30 0.10 0.97 0.02 ignored 0.88 1.25 0.10 0.99 0.07 TABLE IV: Generalization of motion estimation. Ap- proximating generalization for moving objects by artificially excluding a class from training by either treating all its points as having zero flow (stationary) or as having no target label (ignored). We report the mean pointwise L2 error and the precision and recall for moving point classification. errors are below 1.0 m/s (2.2 mph) in magnitude, indicating a rather regular distribution to the residuals. For example, the residuals of 92.8% and 99.8% of moving and stationary vehicle points have an error below 1.0 m/s. In the next section, we also investigate how the prediction accuracy for classes like pedestrians and cyclists can be cast as a discrete task distinguishing moving and stationary points. Our supervised method for generating flow ground truth relies on every moving object having an accompanying tracked box. Without a tracked box, we effectively assume the points on an object are stationary. Though this assump- tion holds for the vast majority of points, there are still a wide range of moving objects that our algorithm assumes to be stationary. For deployment on a safety critical system, it is important to capture motion for these objects (e.g. stroller, opening car doors, etc.). Even though the labeled data does not capture such objects, we find qualitatively that a trained model does capture some motion in these objects (Figure 6). We next ask the degree to which a model trained on such data predicts the motion of unlabeled moving objects. To answer this question, we construct several experiments by artificially removing labeled objects from the scene and measuring the ability of the model (in terms of the point-wise mean L2 error) to predict motion in spite of this disadvan- tage. Additionally, we coarsely label points as moving if their annotated speed (flow vector magnitude) is ≥0.5 m/s (fmin) and query the model to quantify the precision and recall for moving classification. This latter measurement of detecting moving objects is particularly important for guiding planning in an AV [50], [51], [52]. Table IV reports results for selectively ablating the labels for pedestrian and cyclist. We ablate the labels in two methods: (1) Stationary treats points of ablated objects as background with no motion, (2) Ignored treats points of ablated objects as having no target label. We observe that fixing all points as stationary results in a model with near perfect precision. However, the recall suffers enormously, particularly for pedestrians. Our results imply that unlabeled points predicted to be moving are almost perfectly correct (i.e. minimal false positives), however the recall is quite poor as many moving points are not identified (i.e. large number of false negatives). We find that treating the unlabeled points as ignored improves the performance slightly, indicating that even moderate information known about potential moving objects may alleviate challenges in recall. Notably, we ob- serve a large discrepancy in recall between the ablation experiments for cyclists and pedestrians. We posit that this discrepancy is likely due to the much larger amount of pedestrian labels in the Waymo Open Dataset. Removing the entire class of pedestrian labels removes much more of the ground truth labels for moving objects. Although our model has some capacity to generalize to unlabeled moving object points, this capacity is limited. Ignoring labeled points does mitigate the error rate for cyclists and pedestrians, however such an approach can result in other systematic errors. For instance, in earlier experiments, ignoring stationary labels for background points (i.e. no motion) results in a large increase in mean L2 error in background points from 0.03 m/s to 0.40 m/s. Hence, such heuristics are only partial solutions to this problem and new ideas are warranted for approaching this dataset. We suspect that many opportunities exist for applying semi-supervised learning techniques for generalizing to unlabeled objects and leave this to future work [53], [54]. VII. DISCUSSION In this work we presented a new large-scale scene flow dataset measured from LiDAR in autonomous vehicles. Specifically, by leveraging the supervised tracking labels from the Waymo Open Dataset, we bootstrapped a motion vector annotation for each LiDAR return. The resulting dataset is ∼1000× larger than previous real world scene flow datasets. We also propose a series of metrics for evaluating the resulting scene flow with breakdowns based on criteria that are relevant for deploying in the real world. Finally, we demonstrated a scalable baseline model trained on this dataset that achieves reasonable predictive perfor- mance and may be deployed for real time operation. Inter- estingly, our setup opens opportunities for self-supervised and semi-supervised methods [53], [54], [26]. We hope that this dataset may provide a useful baseline for exploring such techniques and developing generic methods for scene flow estimation in AV’s in the future. ACKNOWLEDGEMENTS We thank Vijay Vasudevan, Benjamin Caine, Jiquan Ngiam, Brandon Yang, Pei Sun, Yuning Chai, Charles Qi, Dragomir Anguelov, Congcong Li, Jiyang Gao, James Guo, and Yin Zhou for their comments and suggestions. Additionally, we thank the larger Google Brain and Waymo Perception teams for their support. REFERENCES [1] D. A. Forsyth et al., Computer vision: a modern approach. Prentice Hall Professional Technical Reference, 2002. [2] S. Thrun et al., “Stanley: The robot that won the darpa grand challenge,” Journal of field Robotics, vol. 23, pp. 661–692, 2006. [3] S. Casas et al., “Intentnet: Learning to predict intention from raw sensor data,” in CoRL, 2018, pp. 947–956. [4] Y. Chai et al., “Multipath: Multiple probabilistic anchor trajectory hypotheses for behavior prediction,” in CoRL, 2019. [5] W. Luo et al., “Fast and furious: Real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net,” in CVPR, 2018, pp. 3569–3577. [6] R. Mahjourian et al., “Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints,” in CVPR, 2018. [7] X. Liu et al., “Flownet3d: Learning scene flow in 3d point clouds,” in CVPR, 2019, pp. 529–537. [8] W. Wu et al., “Pointpwc-net: A coarse-to-fine network for supervised and self-supervised scene flow estimation on 3d point clouds,” arXiv preprint arXiv:1911.12408, 2019. [9] X. Gu et al., “Hplflownet: Hierarchical permutohedral lattice flownet for scene flow estimation on large-scale point clouds,” in CVPR, 2019. [10] M. Menze et al., “Object scene flow for autonomous vehicles,” in CVPR, 2015, pp. 3061–3070. [11] A. Geiger et al., “Are we ready for autonomous driving? the kitti vision benchmark suite,” in CVPR, 2012, pp. 3354–3361. [12] N. Mayer et al., “A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation,” in CVPR, 2016. [13] Z. Wang et al., “Flownet3d++: Geometric losses for deep scene flow estimation,” in The IEEE Winter Conference on Applications of Computer Vision, 2020, pp. 91–98. [14] X. Liu et al., “Meteornet: Deep learning on dynamic 3d point cloud sequences,” in CVPR, 2019, pp. 9246–9255. [15] P. Sun et al., “Scalability in perception for autonomous driving: Waymo open dataset,” in CVPR, 2020, pp. 2446–2454. [16] A. Saxena et al., “Learning depth from single monocular images,” in NeurIPS, 2006, pp. 1161–1168. [17] D. Scharstein et al., “A taxonomy and evaluation of dense two-frame stereo correspondence algorithms,” IJCV, vol. 47, pp. 7–42, 2002. [18] D. Pfeiffer et al., “Exploiting the power of stereo confidences,” in CVPR, 2013, pp. 297–304. [19] S. Baker et al., “A database and evaluation methodology for optical flow,” IJCV, vol. 92, no. 1, pp. 1–31, 2011. [20] D. Kondermann et al., “On performance analysis of optical flow algorithms,” in Outdoor and Large-Scale Real-World Scene Analysis. Springer, 2012, pp. 329–355. [21] S. Morales et al., “Ground truth evaluation of stereo algorithms for real world applications,” in ACCV. Springer, 2010, pp. 152–162. [22] L. Ladick`y et al., “Joint optimization for object class segmentation and dense stereo reconstruction,” IJCV, vol. 100, pp. 122–133, 2012. [23] D. J. Butler et al., “A naturalistic open source movie for optical flow evaluation,” in ECCV. Springer, 2012, pp. 611–625. [24] S. Wang et al., “Deep parametric continuous convolutional neural networks,” in CVPR, 2018, pp. 2589–2597. [25] K.-H. Lee et al., “Pillarflow: End-to-end birds-eye-view flow estima- tion for autonomous driving,” IROS, 2020. [26] J. Pontes et al., “Scene flow from point clouds with or without learning,” International Conference on 3D Vision, 2020. [27] M.-F. Chang et al., “Argoverse: 3d tracking and forecasting with rich maps,” in CVPR, 2019, pp. 8748–8757. [28] H. Caesar et al., “nuscenes: A multimodal dataset for autonomous driving,” in CVPR, 2020, pp. 11 621–11 631. [29] J. Houston et al., “One thousand and one hours: Self-driving motion prediction dataset,” arXiv preprint arXiv:2006.14480, 2020. [30] A. Behl et al., “Pointflownet: Learning representations for rigid motion estimation from point clouds,” in CVPR, 2019, pp. 7962–7971. [31] H. Fan et al., “Pointrnn: Point recurrent neural network for moving point cloud processing,” arXiv preprint arXiv:1910.08287, 2019. [32] P. Wu et al., “Motionnet: Joint perception and motion prediction for autonomous driving based on bird’s eye view maps,” in CVPR, 2020. [33] A. Dewan et al., “Rigid scene flow for 3d lidar scans,” in IROS, 2016. [34] A. Ushani et al., “A learning approach for real-time temporal scene flow estimation from lidar data,” in ICRA, 2017, pp. 5666–5673. [35] A. K. Ushani et al., “Feature learning for scene flow estimation from lidar,” in CoRL, 2018, pp. 283–292. [36] K. Bousmalis et al., “Using simulation and domain adaptation to improve efficiency of deep robotic grasping,” in ICRA, 2018. [37] A. Saxena et al., “Robotic grasping of novel objects using vision,” The Int’l Journal of Robotics Research, vol. 27, pp. 157–173, 2008. [38] U. Viereck et al., “Learning a visuomotor controller for real world robotic grasping using simulated depth images,” arXiv preprint arXiv:1706.04652, 2017. [39] M. Gualtieri et al., “High precision grasp pose detection in dense clutter,” in IROS, 2016, pp. 598–605. [40] F. Yu et al., “Bdd100k: A diverse driving dataset for heterogeneous multitask learning,” in CVPR, 2020, pp. 2636–2645. [41] C. R. Qi et al., “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in CVPR, 2017, pp. 652–660. [42] Y. Zhou et al., “End-to-end multi-view fusion for 3d object detection in lidar point clouds,” in CoRL, 2019. [43] O. Ronneberger et al., “U-net: Convolutional networks for biomedical image segmentation,” in MICCAI. Springer, 2015, pp. 234–241. [44] A. H. Lang et al., “Pointpillars: Fast encoders for object detection from point clouds,” arXiv preprint arXiv:1812.05784, 2018. [45] P. Jund et al., “Scalable scene flow from point clouds in the real world,” arXiv preprint arXiv:2103.01306, 2021. [46] K. Zhou et al., “Real-time kd-tree construction on graphics hardware,” ACM Transactions on Graphics (TOG), vol. 27, no. 5, pp. 1–11, 2008. [47] Y. Chen et al., “Fast neighbor search by using revised kd tree,” Information Sciences, vol. 472, pp. 145–162, 2019. [48] A. Odena et al., “Deconvolution and checkerboard artifacts,” Distill, 2016. [49] Z. Ding et al., “1st place solution for waymo open dataset challenge–3d detection and domain adaptation,” arXiv preprint arXiv:2006.15505, 2020. [50] M. McNaughton et al., “Motion planning for autonomous driving with a conformal spatiotemporal lattice,” in ICRA, 2011, pp. 4889–4895. [51] K. Chu et al., “Local path planning for off-road autonomous driving with avoidance of static obstacles,” IEEE Transactions on Intelligent Transportation Systems, vol. 13, no. 4, pp. 1599–1616, 2012. [52] D. Dolgov et al., “Practical search techniques in path planning for autonomous driving,” in AAAI, vol. 1001, 2008, pp. 18–80. [53] G. Papandreou et al., “Weakly-and semi-supervised learning of a deep convolutional net for semantic image segmentation,” in ICCV, 2015. [54] L.-C. Chen et al., “Leveraging semi-supervised learning in video sequences for urban scene segmentation.” in ECCV, 2020. [55] A. Filatov, A. Rykov, and V. Murashkin, “Any motion detector: Learning class-agnostic scene dynamics from a sequence of lidar point clouds,” arXiv preprint arXiv:2004.11647, 2020. [56] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza- tion,” arXiv preprint arXiv:1412.6980, 2014. [57] J. Shen, P. Nguyen, Y. Wu, Z. Chen, M. X. Chen, Y. Jia, A. Kannan, T. Sainath, Y. Cao, C.-C. Chiu, et al., “Lingvo: a modular and scal- able framework for sequence-to-sequence modeling,” arXiv preprint arXiv:1902.08295, 2019. [58] J. Ngiam, B. Caine, W. Han, B. Yang, Y. Chai, P. Sun, Y. Zhou, X. Yi, O. Alsharif, P. Nguyen, et al., “Starnet: Targeted computation for object detection in point clouds,” arXiv preprint arXiv:1908.11069, 2019. [59] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in AISTATS, 2010, pp. 249–256. APPENDIX BOOTSTRAPPING GROUND TRUTH ANNOTATIONS In this section, we discuss in detail several practical considerations for the method for computing scene flow annotations (Section III-B). One challenge in this context is the lack of correspondence between the observed points in P−1 and P0. In our work, we choose to make flow predictions (and thus compute annotations) for the points at the current time step, P0. As opposed to doing so for P−1, we believe that explicitly assigning flow predictions to the points in the most recent frame is advantageous to an AV that needs to reason about and react to the environment in real time. Additionally, the motion between P−1 and P0 is a reasonable approximation for the flow at t0 when considering a high LiDAR acquisition frame rate and assuming a constant velocity between consecutive frames. Calculating the transformation for the motion of an object. The goal of this section is to describe the calculation of T∆used to transform a point p0 to its corresponding position at t-1 based on the motion of the labeled object to which it belongs. We leverage the 3D label boxes to circumvent the point-wise correspondence problem between P0 and P−1 and estimate the position of the points belonging to an object in P0 as they would have been observed at t-1. We compute the flow vector for each point at t0 using its displacement over the duration of ∆t = t0 −t-1. Let point clouds P−1 and P0 be represented in the reference frame of the AV at their corresponding time steps. We identify the set of objects O0 at t0 based on the annotated 3D boxes of the corresponding scene. We express the pose of an object o in the AV frame as a homogeneous transformation matrix T consisting of 3D translation and rotational components derived from the pose of tracked objects. For each object o ∈O0, we first use its pose T−1 relative to the AV at t-1 and compensate for ego motion to compute its pose T∗ −1 at t-1 but with respect to the AV frame at t0. This is straightforward given knowledge of the poses of the AV at the corresponding time steps in the dataset, e.g. from a localization system. Accordingly, we compute the rigid body transform T∆ used to transform points belonging to object o at time t0 to their corresponding position at t-1, i.e. T∆:= T∗ −1·T−1 0 . Rigid body assumption. Our approach for scene flow annotation assumes the 3D label boxes correspond to rigid bodies, allowing us to compute the point-wise correspondences between two frames. Although this is a common assumption in the literature (especially for labeled vehicles [10]), this does not necessarily apply to non-rigid objects such as pedestrians. However, we found this to be a reasonable approximation in our work on the Waymo Open Dataset for two reasons. First, we derive our annotations from frames measured at high frequency (i.e. 10 Hz) such that object deformations are minimal between adjacent frames. Second, the number of observed points on objects like pedestrians is typically small making any deviations from a rigid assumption to be of statistically minimal consequence. Objects with no matching previous frame labels. In some cases, an object o ∈Ot0 with a label box at t0 will not have a corresponding label at t-1, e.g. the object is first observable at t0. Without information about the motion of the object between t-1 and t0, we choose to annotate its points as having invalid flow. While we can still use them to encode the scene and extract features during model training, this annotation allows us to exclude them from model weight updates and scene flow evaluation metrics. Background points. Since typically most of the world is stationary (e.g. buildings, ground, vegetation), it is important to reflect this in the dataset. Having compensated for ego motion, we assign zero motion for all unlabeled points in the scene, and additionally annotate them as belonging to the “background” class. Although this holds for the vast majority of unlabeled points, there will always exist rare moving objects in the scene that were not manually annotated with label boxes (e.g. animals). In the absence of label boxes, points of such objects will receive a stationary annotation by default. Nonetheless, we recognize the importance of enabling a model to predict motion on unlabeled objects, as it is crucial for an AV to safely react to rare, moving objects. In Section VI-C, we highlight this challenge and discuss opportunities for employing this dataset as a benchmark for semi-supervised and self-supervised learning. Coordinate frame of reference. As opposed to most other works [11], [10], we account for ego motion in our scene flow annotations. Not only does this better reflect the fact that most of the world is stationary, but it also improves the interpretability of flow annotations, predictions, and evaluation metrics. In addition to compensating for ego motion when computing flow annotations at t0, we also transform P−1, the scene at t-1, to the reference frame of the AV at t0 when learning and inferring scene flow. We argue that this is more realistic for AV applications in which ego motion is available from IMU/GPS sensors [2]. Furthermore, having a consistent coordinate frame for both input frames lessens the burden on a model to correspond moving objects between frames [55] as explored in Appendix . MODEL ARCHITECTURE AND TRAINING DETAILS Figure 2 provides an overview of FastFlow3D. We discuss each section of the architecture in turn and provide additional parameters in Table VI. The model architecture contains in total 5,233,571 pa- rameters. The vast majority of the parameters (4,212,736) reside in the standard convolution architecture [44]. An additional large set of parameters (1,015,808) reside in later layers that perform upsampling with a skip connection [48]. Finally, a small number of parameters (544) are dedicated to featurizing each point cloud point [41] as well as performing the final regression on to the motion flow (4483). Note that both of these latter sets of parameters are purposefully small because they are applied to all N points in the LiDAR point cloud. FastFlow3D uses a top-down U-Net to process the pil- larized features. Consequently, the model can only predict flow for points inside the pillar grid. Points outside the x-y dimensions of the grid or outside the z dimension bounds for the pillars are marked as invalid and receive no predictions. To extend the scope of the grid, one can either make the pillar size larger or increase the size of the pillar grid. In our work, we use a 170 × 170 m grid (centered at the AV) represented by 512 × 512 pillars (∼0.33 × 0.33 m pillars). For the z dimension, we consider the valid pillar range to be from −3 m to +3 m. The model was trained for 19 epochs on the Waymo Open Dataset training set using the Adam optimizer [56]. The model was written in Lingvo [57] and forked from the open-source repository version of PointPillars 3D object detection [44], [58] 7. The training set contains a label imbalance, vastly over-representing background stationary points. In early experiments, we explored a hyper-parameter to artificially downweight background points and found that weighing down the L2 loss by a factor of 0.1 provided good performance. COMPENSATING FOR EGO MOTION In Section III-B, we argue that compensating for ego motion in the scene flow annotations improves the in- terpretability of flow predictions and highlights important patterns and biases in the dataset, e.g. slow vs fast objects. When training our proposed FastFlow3D model, we also compensate for ego motion by transforming both LiDAR frames to the reference frame of the AV at t0, the time step at which we predict flow. This is convenient in practice given that ego motion information is easily available from the localization module of an AV. We hypothesize that this lessens the burden on the model, because the model does not have to implicitly learn to compensate for the motion of the AV. We validate this hypothesis in a preliminary experiment where we compare the performance of the model reported in Section VI to a model trained on the same dataset but without compensating for ego motion in the input point clouds. Consequently, this model has to implicitly learn how to compensate for ego motion. Table V shows the mean L2 error for two such models. We observe that mean L2 error increases substantially when ego motion is not compensated for across all object types and across moving and stationary objects. This is also consistent with previous works [55]. We also ran a similar experiment where the model consumes non ego motion compensated point clouds, but instead subtracts ego motion from the predicted flow during training and evaluation. We found slightly better performance for moving 7 https://github.com/tensorflow/lingvo/ objects for this setup, but the performance is still far short of the performance achieved when compensating for ego motion directly in the input. Further research is needed to effectively learn a model that can implicitly account for ego motion. MEASUREMENTS OF LATENCY In this section we provide additional details for how the latency numbers for Table I were calculated. All calcula- tions were performed on a standard NVIDIA Tesla P100 GPU with a batch size of 1. The latency is averaged over 90 forward passes, excluding 10 warm up runs. Latency for the baseline models, HPLFlowNet [9] and FlowNet3D [7], [13] included any preprocessing necessary to perform inference. For HLPFlowNet and FlowNet3D, we used the implementations provided by the authors and did not alter hyperparameters. Note that this is in favor of these models, as they were tuned for point clouds covering a much smaller area compared to the Waymo Open Dataset. DATASET FORMAT FOR ANNOTATIONS In order to access the data, please go to http://www.waymo.com/open and click on Access Waymo Open Dataset, which requires a user to sign in with Google and accept the Waymo Open Dataset license terms. After logged in, please visit https: //pantheon.corp.google.com/storage/ browser/waymo_open_dataset_scene_flow to download the labels. We extend the Waymo Open Dataset to include the scene flow labels for the training and validation datasets splits. For each LiDAR, we add a new range image through the field range image flow compressed in the message dataset.proto:RangeImage. The range image is a 3D tensor of shape [H, W, 4] where H and W are the height and width of the LiDAR scan. For the LiDAR returns at point (i, j), we provide annotations in the range image where [i, j, 0:3] corresponds to the estimated velocity components for the return along x, y and z axes, respectively. Finally, the value stored in the range image at [i, j, 3] contains an integer class label. ego motion vehicle pedestrian cyclist background compensated for all moving stationary all moving stationary all moving stationary yes 0.18 0.54 0.05 0.25 0.32 0.10 0.51 0.57 0.10 0.07 no 0.36 1.16 0.08 0.49 0.63 0.17 1.21 1.34 0.14 0.07 TABLE V: Accounting for ego motion in the input significantly improves performance. All reported values correspond to the mean L2 error in m/s. The first column refers to whether or not we transform the point cloud at t-1 to the reference frame of the AV at t0. For both models, we evaluate the error in predicting scene flow annotations as described in Section III-B, i.e. with ego motion compensated for. Meta-Arch Name Input(s) Operation Kernel Stride BN? Output Size Depth # Param Architecture A N pts MLP – – Yes N pts 64 544 B A Snap-To-Grid – – – 512 × 512 64 0 C B Convolution 3 × 3 2 Yes 256 × 256 64 37120 D C Convolution 3 × 3 1 Yes 256 × 256 64 37120 E D Convolution 3 × 3 1 Yes 256 × 256 64 37120 F E Convolution 3 × 3 1 Yes 256 × 256 64 37120 G F Convolution 3 × 3 2 Yes 128 × 128 128 74240 H G Convolution 3 × 3 1 Yes 128 × 128 128 147968 I H Convolution 3 × 3 1 Yes 128 × 128 128 147968 J I Convolution 3 × 3 1 Yes 128 × 128 128 147968 K J Convolution 3 × 3 1 Yes 128 × 128 128 147968 L K Convolution 3 × 3 1 Yes 128 × 128 128 147968 M L Convolution 3 × 3 2 Yes 64 × 64 256 295936 N M Convolution 3 × 3 1 Yes 64 × 64 256 590848 O N Convolution 3 × 3 1 Yes 64 × 64 256 590848 P O Convolution 3 × 3 1 Yes 64 × 64 256 590848 Q P Convolution 3 × 3 1 Yes 64 × 64 256 590848 R Q Convolution 3 × 3 1 Yes 64 × 64 256 590848 S R∗, L∗, 128, 128 Upsample-Skip – – No 128 × 128 128 540672 T S, F∗, 128, 64 Upsample-Skip – – No 256 × 256 128 311296 U T, B∗, 64, 64 Upsample-Skip – – No 512 × 512 64 126976 V U Convolution 3 × 3 1 No 512 × 512 64 36864 W V Ungrid – – – N pts 64 0 X W, B Concatenate – – – N pts 128 0 Y X Linear – – No N pts 32 4384 Z Y Linear – – No N pts 3 99 Upsample-Skip (α, β, d, db) U1 α Convolution 1 × 1 1 No hα × wα db U2 U1 Bilinear Interp. – 1 2 No hβ × wβ db U3 β Convolution 1 × 1 1 No hβ × wβ db U4 U2, U3 Concatenate – – No hβ × wβ 2db U5 U4 Convolution 3 × 3 1 No hβ × wβ d U6 U5 Convolution 3 × 3 1 No hβ × wβ d Padding mode SAME Optimizer Adam [56] (lr = 1e−6, β1 = 0.9, β2 = 0.999) Batch size 64 Weight initialization Xavier-Glorot [59] Weight decay None TABLE VI: FastFlow3D architecture and training details. The network receives as input N 3D points with additional point cloud features and outputs N 3D points corresponding to the predicted 3D motion vector of each point. This output corresponds to the output of layer Z. All layers employ a ReLU nonlinearity except for layers S −Z which employ no nonlinearity. The shape of tensor x is denoted hx ×wx. The Upsample-Skip layer receives two tensors α and β and a scalar depth d as input and outputs tensor U6. Inputs with ∗denote the concatenation of the respective layers of the weight-sharing encoders.