Scalable Scene Flow from Point Clouds in the Real World Philipp Jund ∗,1, Chris Sweeney ∗,2, Nichola Abdo 2, Zhifeng Chen 1, Jonathon Shlens 1 1 Google Brain, 2 Waymo {abdon,cjsweeney}@waymo.com Abstract—Autonomous vehicles operate in highly dynamic correspondence exists between LiDAR returns from subse- environments necessitating an accurate assessment of which quenttimepoints.Instead,onemustrelyonsemi-supervised aspectsofascenearemovingandwheretheyaremovingto.A methods that employ auxiliary information to make strong popular approach to 3D motion estimation, termed scene flow, inferences about the motion signal in order to bootstrap istoemploy3DpointclouddatafromconsecutiveLiDARscans, althoughsuchapproacheshavebeenlimitedbythesmallsizeof annotation labels [10], [11]. Such an approach suffers from real-world,annotatedLiDARdata.Inthiswork,weintroducea the fact that motion annotations are extremely limited (e.g. new large-scale dataset for scene flow estimation derived from 400 frames in [10], [11]) and often rely on pretraining a corresponding tracked 3D objects, which is ∼1,000× larger model based on synthetic data [12] which exhibit distinct than previous real-world datasets in terms of the number of noise and sensor properties from real data. Furthermore, annotated frames. We demonstrate how previous works were bounded based on the amount of real LiDAR data available, previous datasets cover a smaller area, e.g., the KITTI scene suggesting that larger datasets are required to achieve state- flow dataset covers 1/5th the area of our proposed dataest. of-the-art predictive performance. Furthermore, we show how This allows for different subsampling tradeoffs and inspired previous heuristics for operating on point clouds such as a class of models that are not able to tractably scale training down-samplingheavilydegradeperformance,motivatinganew and inference beyond ∼10K points [7], [8], [9], [13], [14], class of models that are tractable on the full point cloud. To address this issue, we introduce the FastFlow3D architecture making the usage of such models impractical in real world which provides real time inference on the full point cloud. AV scenes which often contain 100K - 1000K points. Additionally,wedesignhuman-interpretablemetricsthatbetter In this work, we address these shortcomings of this field capture real world aspects by accounting for ego-motion and by introducing a large scale dataset for scene flow geared providing breakdowns per object type. We hope that this towards AVs. We derive per-point labels for motion estima- dataset may provide new opportunities for developing real world scene flow systems. tionbybootstrappingfromtrackedobjectsdenselyannotated in a scene from a recently released large scale AV dataset I. INTRODUCTION [15]. The resulting scene flow dataset contains 198K frames of motion estimation annotations. This amounts to roughly ∼1,000×largertrainingsetthanthelargest,commonlyused Motion is a prominent cue that enables humans to navi- real world dataset (200 frames) for scene flow [10], [11]. gatecomplexenvironments[1].Likewise,understandingand By working with a large scale dataset for scene flow, we predictingthe3Dmotionfieldofascene–termedthescene identify several indications that the problem is quite distinct flow – provides an important signal to enable autonomous from current approaches: vehicles (AVs) to understand and navigate highly dynamic • Learned models for scene flow are heavily bounded by environments [2]. Accurate scene flow prediction enables an the amount of data. AVtoidentifypotentialobstacles,estimatethetrajectoriesof • Heuristics for operating on point clouds (e.g. down- objects [3], [4], and aid downstream tasks such as detection, sampling)heavilydegradepredictiveperformance.This segmentation and tracking [5], [6]. observation motivates the development of a new class Recently, approaches that learn models to estimate scene of models tractable on a full point cloud scene. flow from LiDAR have demonstrated the potential for • Previous evaluation metrics ignore notable systematic LiDAR-based motion estimation, outperforming camera- biases across classes of objects (e.g. predicting pedes- based methods [7], [8], [9]. Such models take two con- trian versus vehicle motion). secutive point clouds as input and estimate the scene flow as a set of 3D vectors, which transform the points from We discuss each of these points in turn as we investigate the first point cloud to best match the second point cloud. working with this new dataset. Recognizing the limitations One of the most prominent benefits of this approach is that of previous works, we develop a new baseline model ar- it avoids the additional burden of estimating the depth of chitecture, FastFlow3D, that is tractable on the complete sensor readings as is required in camera-based approaches point cloud with the ability to run in real time (i.e. < [6]. Unfortunately, for LiDAR based data, ground truth 100 ms) on an AV. Figure 1 shows scene flow predictions motion vectors are ill-defined and not tenable because no fromFastFlow3D,trainedonoursceneflowdataset.Finally, we identify and characterize an under-appreciated problem ∗ Denotesequalcontributions. in semi-supervised learning based on the ability to predict 1202 tcO 52 ]VC.sc[ 5v60310.3012:viXraFig. 1: LiDAR scene flow estimation for autonomous vehicles. Left: Overlay of two consecutive point clouds (green and blue, respectively) sampled at 10 Hz from the Waymo Open Dataset [15]. White boxes are tracked 3D bounding boxes for human annotated vehicles and pedestrians. Middle: Predicted scene flow for each point colored by direction, and brightened by speed based on overlaid frames†. Right: Two qualitative examples of bootstrapped annotations. the motion of unlabeled objects. We suspect the degree perceived tracklets. These recent datasets offer an opportu- to which the fields of semi-supervised learning attack this nity to propose a methodology to construct point-wise flow problem may have strong implications for the real-world annotations from such data (Section III-B). application of scene flow in AVs. We hope that the resulting We extend the Waymo Open Dataset (WOD) to con- dataset presented in this paper may open the opportunity for struct a large-scale scene flow benchmark for dense point qualitatively new forms of learned scene flow models. clouds[15].WeselecttheWaymoOpenDatasetbecausethe bounding box annotations are at a higher acquisition frame II. RELATEDWORK (10 Hz) than competing datasets (e.g. 2 Hz in [28]) and A. Benchmarks for scene flow estimation contain∼5×thenumberofreturnsperLiDARframe(Table 1,[15]).Inaddition,theWaymoOpenDatasetalsoprovides Earlydatasetsfocusedontherelatedproblemsofinferring ∼ 10× more scenes and annotated LiDAR frames than depth from a single image [16] or stereo pairs of images Argoverse [27]. Recently, [29] released a large-scale dataset [17],[18].Previousdatasetsforestimatingopticalflowwere with 1,000+ hours of driving data. However, their tracked smallandlargelybasedonsyntheticimagery[19],[20],[21], objectannotationsarenothuman-annotatedbutbasedonthe [22]. Subsequent datasets focused on 2D motion estimation results of the onboard perception system. in movies or sequences of images [23]. The KITTI Scene Flow dataset represented a huge step forward, providing C. Models for learning scene flow non-synthetic imagery paired with accurate ground truth There is a rich literature of building learned models for estimates However, it contained only 200 scenes for train- scene flow using end-to-end learned architectures [30], [31], ing and involved preprocessing steps that alter real-world [8], [7], [14], [13], [32], [25] as well as hybrid architectures characteristics[10].FlyingThings3Dofferedamodernlarge- [33],[34],[35].WediscusstheseinSectionVinconjunction scale synthetic dataset comprising ∼20K frames of high with building a scalable baseline model that operates in real resolution data from which scene flow may be bootstrapped time. Recently, Lee et al. presented an approach for predict- [12].Internal datasetsby[24], [25]are constructedsimilarly ing pillar-level flow [25]. Whereas our model leverages a to ours, but are not publicly available and do not offer a similar pillar-based architecture, we tackle the full scope of detailed description. Recently, [26] created two scene flow the scene flow problem and predict point-level flow while datasets in a similar fashion, subsampling 2,691 and 1,513 being tractable enough for real-time applications. training scenes from the Argoverse [27] and nuScenes [28] Moreover,manypreviousworkstrainmodelsonsynthetic datasets, resepectively. However, even without subsampling, datasets like FlyingThings3D [12] and evaluate and/or fine- larger datasets in terms of scenes and number of points are tune on KITTI Scene Flow [11], [10]. Typically, these needed to train more accurate scene flow models as shown models are limited in their ability to leverage synthetic data in [26] and Figure 4. in training. This observation is in line with the robotics B. Datasets for tracking in AV’s literatureandhighlightsthechallengesofgeneralizationfrom the simulated to the real world [36], [37], [38], [39]. Recently,therehavebeenseveralworksintroducinglarge- scaledatasetsforautonomousvehicleapplications[11],[27], III. CONSTRUCTINGASCENEFLOWDATASET [28], [29], [15]. While these datasets do not directly provide In this section, we present an approach for generating scene flow labels, they provide vehicle localization data, as scene flow annotations bootstrapped from existing labeled well as raw LiDAR data and bounding box annotations for datasets.Wefirstformalizethesceneflowproblemdefinition. We then detail our method for computing per-point flow † Please note that we predict 3D flow, but color the direction of flow withrespecttothex-y planeforthevisualization. vectors by leveraging the motion of 3D object label boxes.We emphasize that many details abound in the assumptions In this work, we apply this methodology on behind such annotations, how to calculate various trans- the Waymo Open Dataset [15]. The dataset offers a formations in the track labels, as well as how to handle large scale with diverse LiDAR scenes where objects have important edge cases. been manually and accurately annotated with 3D boxes at 10Hz. Finally, the accurate AV pose information permits A. Problem definition compensating for ego motion. We note that the method for scene flow annotation is general and may be used to We consider the problem of estimating 3D scene flow in estimate 3D flow vectors of the label box poses available in settings where the scene at time t is represented as a point i other datasets [28], [29], [15], [40]. cloud P as measured by a LiDAR sensor mounted on the i AV. Specifically, we define scene flow as the collection of IV. EVALUATIONMETRICSFORSCENEFLOW 3D motion vectors f := (vx,vy,vz)(cid:62) for each point in the Two common metrics used for 3D scene flow scene where vd is the velocity in the d directions in m/s. are mean L error of pointwise flow and the Following the scene flow literature, we predict flow given 2 percentage of predictions with L error below a given two consecutive point clouds of the scene, P and P . 2 −1 0 threshold [7], [24]. In this work, we additionally propose The scene flow encodes the motion between the previous modifications to improve the interpretability of the results. and current time steps, t and t , respectively. We predict -1 0 the scene flow at the current time step, P in order to make 0 Breakdown by object type. Objects within the AV the predictions practical for real time operation. scene (e.g. vehicles, pedestrians) have different speed distributions dictated by the object class (Section VI-A). B. From tracked boxes to flow annotations This becomes especially apparent after accounting for ego Obtaining ground truth scene flow from standard real- motion. Reporting a single error ignores these systematic world LiDAR data is a challenging task. One challenge is differences.In practice, we find it more meaningful to report the lack of point-wise correspondences between subsequent all prediction performances delineated by the object label. LiDAR frames. Manual annotation is too expensive and humans must contend with ambiguity due to changes in Binary classification formulation. One important practical viewpoint and partial occlusions. Therefore, we focus on application of predicting scene flow is enabling an AV to a scalable automated approach bootstrapped from existing distinguish between moving and stationary parts of the labeled, tracked objects in LiDAR data sequences. scene. In that spirit, we formulate a second set of metrics The annotation procedure is straightforward. We assume that represent a “lower bar” which captures a useful that labeled objects are rigid and calculate point velocities rudimentary signal. We employ this metric exclusively for usingasecantlineapproximation.Foreachpointp attime the more difficult task of semi-supervised learning (Section 0 t , we compute the flow annotation as f = 1 (p −p ), VI-C) where learning is more challenging. In particular, we 0 ∆t 0 −1 where ∆ = t −t , p = T p is the corresponding assign a binary label to each reflection as either moving or t 0 -1 −1 ∆ 0 point at t , and T is a homogeneous transformation stationary based on a threshold, |f| ≥ f . Accordingly, -1 ∆ min inferredfromthetracklabelsoftheobjecttowhichthepoint we compute precision and recall metrics for these binary belongs (If there is no label at t , the flow is annotated labels across an entire scene. Selecting a threshold, f , is -1 min as invalid.). This captures how a moving object may have not straightforward as there is an ambiguous range between varying per-point flow magnitudes and directions. Though very slow and stationary objects. For simplicity, we select our rigidity assumption does not necessarily apply to non a conservative threshold of f = 0.5 m/s (1.1 mph) to min rigid objects (e.g. pedestrians), the high frame rate (10 Hz) assure that things labeled as moving are actually moving. minimizes non-rigid deformations between adjacent frames. In order to calculate the transformation T , we compen- V. FASTFLOW3D:ASCALABLEBASELINEMODEL ∆ sate for the ego motion of the AV because this leads to TheaveragescenefromtheWaymoOpenDatasetconsists superior predictive performance since a learned model does of 177K points (Table II), even though most models [7], notneedtoadditionallyinfertheAV’smotion(mostAVsare [8], [9], [13], [14] were designed to train with 8,192 points equippedwithanIMU/GPSsystemtoprovidesuchinforma- (16,384pointsin[14]).Thisdesignchoicefavorsalgorithms tion). Furthermore, compensating for ego motion improves that scale poorly to O(100K) regimes. For instance, many the interpretability of the evaluation metrics (Section IV) methods require preprocessing techniques such as nearest sincethepredictionsarenowindependentoftheAVmotion. neighbor lookup. Even with efficient implementations [46], We use this approach to compute the flow vectors for all [47], increasing fractions of inference time are dedicated to points in P belonging to labeled objects. Points outside the preprocessing instead of the core inference operation. 0 labeledobjectsareassignedaflowof0m/s.Thisstationary For this reason, we propose a new model that exhibits assumption works well in practice, but has a notable gap favorable scaling properties and may operate on O(100K) in whenconsideringunlabeledmovingobjectsinthescene.See arealtimesystem.WenamethismodelFastFlow3D(FF3D). SectionVI-Cforanindepthanalysisonhowourmodelcan InparticularweexploitthefactthatLiDARpointcloudsare generalize to unlabeled moving objects. dense,relativelyflatalongthez dimension,butcoveralarge32K 100K 255K 1000K concat HPLFlowNet [9] 431.1 1194.5 OOM OOM t0 indepth FlowNet3D [7] 205.2 520.7 1116.4 3819.0 FastFlow3D(ours) 49.3 51.9 63.1 98.1 TABLEI:Inference latency varying point cloud sizes.All shared numbersreportlatencyinmsonaNVIDIATeslaP100with t−1 weights batchsize=1.ThetimingsforHPLFlowNet[9]differfrom PointNet Convolutional GridFlow Multilayer reported results as we include the required preprocessing on (dynamicvoxelization) Autoencoder(U-Net) Embedding Perceptron the raw point clouds. OOM indicates out of memory. Fig. 2: Diagram of FastFlow3Dmodel. FastFlow3D con- sistsof3stagesemployingaPointNetencoderwithdynamic spatial resolution. Subsequently, we use a 2D convolution voxelization [41], [42], a convolutional autoencoder [43], to obtain contextual information at the different resolutions. [44] with weights shared across two frames, and a shared These context embeddings are used as the skip connections MLP to regress an embedding on to a point-wise motion for the U-Net, which progressively merges context from prediction. For details, see Section V and Appendix in [45]. consecutive resolutions. To decrease latency, we introduce bottleneck convolutions and replace deconvolution opera- area along the x and y dimensions. The proposed model is tions (i.e. transposed convolutions) with bilinear upsampling composed of three parts: a scene encoder, a decoder fusing [48]. The resulting feature map of the U-Net decoder rep- contextual information from both frames, and a subsequent resents a grid-structured flow embedding. To obtain point- decoder to obtain point-wise flow (Figure 2). wise flow, we introduce the unpillar operation, which for FastFlow3D operates on two successive point clouds each point retrieves the corresponding flow embedding grid where the first cloud has been transformed into the co- cell, concatenates the point feature, and uses a multi layer ordinate frame of the second. The target annotations are perceptron to compute the flow vector. correspondingly provided in the coordinate frame of the Asproofofconcept,weshowcasehowtheresultingarchi- secondframe.Theresultofthesetransformationistoremove tectureachievesfavorablescalingbehavioruptoandbeyond apparentmotionduetothemovementoftheAV(SectionIII- thenumberoflaserreturnsintheWaymoOpenDataset(Ta- B). We train the resulting model with the average L loss 2 ble I). Note that we measure performance up to 1M points between the final prediction for each LiDAR returns and the in order to accommodate multi-frame perception models corresponding ground truth flow annotation [8], [7], [9]. which operate on point clouds from multiple time frames The encoder computes embeddings at different spatial concatenated together [49] 1. As mentioned earlier, previ- resolutions for both point clouds. The encoder is a variant ously proposed baseline models rely on nearest neighbor of PointPillars [44] and offers a great trade-off in terms search for pre-processing, and even with an efficient im- of latency and accuracy by aggregating points within fixed plementation [47], [46] result in poor scaling behavior (see vertical columns (i.e “pillars”) followed by a 2D convo- Section VI-B for details. ) making it prohibitively expensive lutional network to decrease the spatial resolution. Each to train and run these models on large, realistic datasets pillar center is parameterized through its center coordinate like the Waymo Open Dataset2. In contrast, our baseline (c ,c ,c ). We compute the offset from the pillar center to x y z model exhibits nearly linear growth with a small constant. the points in the pillar (∆ ,∆ ,∆ ), and append the pillar x y z Furthermore, the typical period of a LiDAR scan is 10 Hz centerandlaserfeatures(l ,l ),resultinginan8Dencoding 0 1 (i.e. 100 ms) and the latency of operating on 1M points is (c ,c ,c ,∆ ,∆ ,∆ ,l ,l ). Additionally, we employ dy- x y z x y z 0 1 suchthatpredictionsmayfinishwithintheperiodofthescan namic voxelization [42], computing a linear transformation as is required for real-time operation. and aggregating all points within a pillar instead of sub- sampling points. Furthermore, we find that summing the VI. RESULTS featurized points in the pillar outperforms the max-pooling We first present results describing the generated scene operation used in previous works [44], [42]. flow dataset and discuss how it compares to established One can draw an analogy of our pillar-based point fea- baselines for scene flow in the literature (Section VI-A). In turizationtomorecomputationallyexpensivesamplingtech- theprocess,wediscussdatasetstatisticsandhowthisaffects niques used by previous works [7], [8]. Instead of choosing ourselectionofevaluationmetrics.Next,inSectionVI-Bwe representative sampled points based on expensive farthest present the FastFlow3Dbaseline architecture trained on the point sampling and computing features relative to these resultingdataset.Weshowcasewiththismodelthenecessity points,weuseafixedgridtosamplethepointsandcompute of training with the full density of point cloud returns features relative to each pillar in the grid. The pillar based representation allows our net to cover a larger area with an 1Manyunpublishedeffortsemploymultipleframesasdetailedathttps: increased density of points. //waymo.com/open/challenges 2In Section VI-B we demonstrate that downsampling the point cloud The decoder is a 2D convolutional U-Net [43]. First, severely degrades predictive performance further motivating architectures we concatenate the embeddings of both encoders at each thatcannativelyoperateontheentirepointcloudinrealtime.KITTI FlyingThings3D Ours Data LiDAR Synth. LiDAR Label Semi-Sup. Truth Super. Scenes 22 – 1150 #LiDARFrames 200‡ 28K 198K AvgPoints/Frame 208K 220K† 177K TABLE II: Comparison of popular datasets for scene flowestimation.[11]iscomputedthroughasemi-supervised procedure[10].[12]iscomputedfromadepthmapbasedon ageometricprocedure[7].#LiDARframescountsannotated LiDAR frames for training and validation. moving stationary vehicles 32.0% (843.5M) 68.0% (1,790.0M) pedestrians 73.7% (146.9M) 26.3% (52.4M) cyclists 84.7% (7.0M) 15.2% (1.6M) 70 60 50 40 30 20 10 0 0 5 10 15 20 speed (m/s) )%( stniop gnivom fo noitcarf 20 15 10 5 0 100 1000 number of run segments speed (mph) 0 10 20 30 40 50 cyclist pedestrian vehicle Fig. 3: Distribution of moving and stationary LiDAR points. Statistics computed from training set split. Top: Distribution of moving and stationary points across all frames (raw counts in parenthesis). We consider points with a flow magnitude below 0.1 m/s to be stationary. Bottom: Distribution of speeds for moving points. as well as the complete dataset. These results highlight deficienciesinpreviousapproacheswhichemployedtoofew dataoremployedsub-sampledpointsforreal-timeinference. Finally,inSectionVI-Cwediscussanextensiontothiswork in which we examine the generalization power of the model and highlight an open challenge in the application of self- supervised and semi-supervised learning techniques. A. A large-scale dataset for scene flow The Waymo Open Dataset provides an accurate source of tracked 3D objects and an opportunity for deriving a large-scale scene flow dataset across a diverse and rich do- main [15]. As previously discussed, scene flow ground truth does not exist in real-world point cloud datasets based on standard time-of-flight LiDAR because no correspondences exist between points from subsequent frames. To generate a reasonable set of scene flow labels, we leveraged the human annotated tracked 3D objects from the Waymo Open Dataset [15]. Following the methodology in Section III-B, we derived a supervised label (vx,vy,vz) for each point in the scene across time. Figure 1 (right) high- lights some qualitative examples of the resulting annotation ‡indicatesthatonly400framesoftheKITTIdatasetwereannotatedfor scene flow (200 available for training). † indicates the average number of pointswithdistancefromthecamera≤35. htiw )%( noitcarf s/m 1.0 < rorre L 2 100 cyc ped 80 veh 60 40 20 0 100 1000 number of run segments htiw )%( noitcarf s/m 0.1 < rorre 2L cyc ped veh Fig. 4: Accuracy of scene flow estimation is bounded by the amount of data. Each point corresponds to the cross validatedaccuracyofamodeltrainedonincreasingamounts of data (see text). Y-axis reports the fraction of LiDAR returnscontainedwithinmovingobjectswhosemotionvector isestimatedwithin0.1m/s(top)or1.0m/s(bottom)L error. 2 Highernumbersarebetter.Thestarindicatesamodeltrained on the number of run segments in [11], [10]. ofsceneflowusingthismethodology.Intheselectedframes, we highlight the diversity of the scene and difficulty of the resulting bootstrapped annotations. Namely, we observe the challenges of working with real LiDAR data including the noise inherent in the sensor reading, the prevalence of occlusionsandvariationinobjectspeed.Allofthesequalities result in a challenging predictive task. The dataset comprises 800 and 200 scenes, termed run segments, for training and validation, respectively. Each run segment is 20 seconds recorded at 10 Hz [15]. Hence, the training and validation splits contain 158,081 and 39,987 frames.3 Thetotaldatasetcomprises24.3Band6.1BLiDAR returns in each split, respectively. Table II indicates that the resulting dataset is orders of magnitude larger than the standard KITTI scene flow dataset [11], [10] and even surpasses the large-scale synthetic dataset FlyingThings3D [12] often used for pretraining. Figure3providesasummaryofthesceneflowconstructed from the Waymo Open Dataset. Across 7,029,178 objects labeled across all frames 4, we find that ∼64.8% of the pointswithinpedestrians,cyclistsandvehiclesarestationary. This summary statistic belies a large amount of systematic variability across object class. For instance, the majority of points within vehicles (68.0%) are parked and stationary, whereas the majority of points within pedestrians (73.7%) and cyclists (84.7%) are actively moving. The motion sig- nature of each class of labeled object becomes even more distinct when examining the distribution of moving objects (Figure 3, bottom). Note that the average speed of moving points corresponding to pedestrians (1.3 m/s or 2.9 mph), cyclists (3.8 m/s or 8.5 mph) and vehicles (5.6 m/s or 12.5 mph) vary significantly. This variability of motion across object types emphasizes our selection of evaluation metrics that consider the prediction of each class separately. B. A scalable model baseline for scene flow We train the FastFlow3D architecture on the scene flow data. Briefly, the architecture consists of 3 stages employ- 3Please see the Appendix in [45] for more details on downloading and accessingthisnewdataset. 4A single instance of an object may be tracked across N frames. We countasingleinstanceasN labeledobjects.20 15 10 5 0 10K 100K number of point cloud returns htiw )%( noitcarf s/m 1.0 < rorre 2L 100 cyc ped 80 veh 60 40 20 0 10K 100K number of point cloud returns htiw )%( noitcarf s/m 0.1 < rorre 2L observeminimaldetrimentinperformance(datanotshown). cyc This result is not surprising given that the vast majority ped veh of LiDAR returns arise from stationary, background objects (e.g.buildings, roads).However,we doobserve thattraining on sparse versions of the original point cloud severely de- grades predictive performance of moving objects (Figure 5). Fig.5:Accuracyofsceneflowestimationrequiresthefull Notably,movingpedestriansandvehicleperformanceappear density of the point cloud scene. Each point corresponds to be saturating indicating that if additional LiDAR returns to the cross validated accuracy of a model trained on an were available, they would have minimal additional benefit increasing density of point cloud points. Y-axis reports the in terms of predictive performance. fractionofLiDAR returnscontainedwithinmovingvehicles, In addition to decreasing point density, previous works pedestrians and cyclists whose motion vector is correctly also filter out the numerous returns from the ground in estimatedwithin0.1m/s(top)and1.0m/s(bottom)L error. order to limit the number of points to predict [8], [7], [9]. 2 Such a technique has a side benefit of bridging the domain ing established techniques: (1) a PointNet encoder with a gapbetweenFlyingThings3DandKITTISceneFlow,which dynamic voxelization [42], [41], (2) a convolutional autoen- differ in the inclusion of such points. We performed an coder with skip connections [43] in which the first half of ablation experiment to parallel this heuristic by training and the architecture [44] consists of shared weights across two evaluating with our annotations but removing points with frames, and (3) a shared MLP to regress an embedding on a crude threshold of 0.2 m above ground. When removing to a point-wise motion prediction. ground points, we found that the mean L error increased 2 The resulting model contains 5.23M parameters, a vast by 159% and 31% for points in moving and stationary majority of which reside in the convolution architecture objects, respectively. We take these results to indicate that (4.21M). A small number of parameters (544) are dedicated the inclusion of ground points provide a useful signal for to featurizing each point cloud point [41] as well as per- predicting scene flow. Taken together, these results provide forming the final regression on to the motion flow (4,483). post-hocjustificationforbuildinganarchitecturewhichmay Theselattersetsofparametersarepurposefullysmallinorder be tractably trained on all point cloud returns instead of one to effectively constrain computational cost because they are that only trains on a sample of the returns. applied across all N points in a point cloud. Finally, we report our results on the complete dataset We evaluate the resulting model on the cross-validated and identify systematic differences across object class and split using the aforementioned metrics across an array of whether or not an object is moving (Table III). Producing experimentalstudiestojustifythemotivationforthisdataset baseline comparisons for previous nearest neighbor based as well as demonstrate the difficulty of the prediction task. models is prohibitively expensive due to their poor scaling We first approach the question of what the appropriate behavior. 6 We hope to motivate a new class of real time dataset size is given the prediction task. Figure 4 provides sceneflowmodelsthatarecapableoftrainingonourdataset. an ablation study in which we systematically subsample the TableIIIindicatesthatmovingvehiclepointshaveamean numberofrunsegmentsemployedfortraining5.Weobserve L error of 0.54 m/s, corresponding to 10% of the average 2 that predictive performance improves significantly as the speed of moving vehicles (5.6 m/s). Likewise, the mean L 2 model is trained on increasing numbers of run segments. errorofmovingpedestrianandcyclistpointsare0.32m/sand We find that cyclists trace out a curve quite distinct from 0.57 m/s, corresponding to 25% and 15% of the mean speed pedestrians and vehicles, possibly indicative of the small ofeachobjectclass,respectively.Hence,theabilitytopredict number of cyclists in a scene (Figure 3). Secondly, we ob- vehicle speed is better than pedestrians and cyclists. We serve that the cross validated accuracy is far from saturating suspect that these imbalances are largely due to imbalances behavior when approximating the amount of data available in the number of training examples for each label and the in the KITTI scene flow dataset [11], [10] (Figure 4, stars). averagespeedoftheseobjects.Forinstance,thevastmajority We observe that even with the complete dataset, our metrics of points are marked as background and hence have a target do not appear to exhibit asymptotic behavior indicating that ofzeromotion.Becausethebackgroundpointsaredominant, modelstrainedontheWaymoOpenDatasetmaystillbedata we likewise observe the error to be smallest. bound. This result parallels detection performance reported in the original results (Table 10 in [15]). C. Generalizing to unlabeled moving objects We next investigate how scene flow prediction is affected The mean L error is averaged over many points, making 2 by the density of the point cloud scene. This question is im- itunclearifthisstatisticmaybedominatedbyoutlierevents. portant because many baseline models purposefully operate To address this issue, we show the percentage of points in on a smaller number of points (Table I) and by necessity theWaymoOpenDatasetevaluationsetwithL errorsbelow 2 must heavily sub-sample the number of points in order to 0.1m/sand1.0m/s.Weobservethatthevastmajorityofthe perform inference in real time. In stationary objects, we 6We did try experiments involving cropping and downsampling the 5We subsample the number of run segments and not frames because pointcloudstomakecomparisontothebaselinesfeasible.However,these subsequentframeswithinasinglerunsegmentareheavilycorrelated. modificationsdistortedthepointstoomuchtoserveaspracticalinputdata.vehicle pedestrian cyclist errormetric background all moving stationary all moving stationary all moving stationary mean(m/s) 0.18 0.54 0.05 0.25 0.32 0.10 0.51 0.57 0.10 0.07 mean(mph) 0.40 1.21 0.11 0.55 0.72 0.22 1.14 1.28 0.22 0.16 ≤0.1m/s 70.0% 11.6% 90.2% 33.0% 14.0% 71.4% 13.4% 4.8% 78.0% 95.7% ≤1.0m/s 97.7% 92.8% 99.4% 96.7% 95.4% 99.4% 89.5% 88.2% 99.6% 96.7% TABLE III: Performance of baseline on scene flow in large-scale dataset. Mean pointwise L error (top) and percentage 2 of points with error below 0.1 m/s and 1.0 m/s (bottom). Most errors are ≤ 1.0 m/s. Additionally, we investigate the error for stationary and moving points where a point is coarsely considered moving if the flow vector magnitude is ≥0.5 m/s. Fig. 6: Generalizing to unlabeled moving objects. Three examples each with the bootstrapped annotation (left) and model prediction (right) (Color code from Figure 1). Left example: Despite missing flow annotation for the middle of a bus, our model can generalize well. Middle example: The model generalizes an unlabeled object (moving shopping cart). Right example: failures of generalization as motion is incorrectly predicted for the ground and parts of the tree. method L2error(m/s) prec recall measuringtheabilityofthemodel(intermsofthepoint-wise all moving stationary mean L error) to predict motion in spite of this disadvan- 2 supervised 0.51 0.57 0.10 1.00 0.95 tage.Additionally,wecoarselylabelpointsasmovingiftheir cyc stationary 1.13 1.24 0.06 1.00 0.67 ignored 0.83 0.93 0.06 1.00 0.78 annotated speed (flow vector magnitude) is ≥ 0.5 m/s (f min) supervised 0.25 0.32 0.10 1.00 0.91 and query the model to quantify the precision and recall for ped stationary 0.90 1.30 0.10 0.97 0.02 moving classification. This latter measurement of detecting ignored 0.88 1.25 0.10 0.99 0.07 movingobjectsisparticularlyimportantforguidingplanning TABLE IV: Generalization of motion estimation. Ap- in an AV [50], [51], [52]. proximating generalization for moving objects by artificially Table IV reports results for selectively ablating the labels excludingaclassfromtrainingbyeithertreatingallitspoints for pedestrian and cyclist. We ablate the labels in two as having zero flow (stationary) or as having no target label methods: (1) Stationary treats points of ablated objects as (ignored). We report the mean pointwise L 2 error and the background with no motion, (2) Ignored treats points of precision and recall for moving point classification. ablated objects as having no target label. We observe that fixing all points as stationary results in a model with near errors are below 1.0 m/s (2.2 mph) in magnitude, indicating perfect precision. However, the recall suffers enormously, a rather regular distribution to the residuals. For example, particularly for pedestrians. Our results imply that unlabeled the residuals of 92.8% and 99.8% of moving and stationary points predicted to be moving are almost perfectly correct vehicle points have an error below 1.0 m/s. In the next (i.e.minimalfalsepositives),howevertherecallisquitepoor section, we also investigate how the prediction accuracy for as many moving points are not identified (i.e. large number classes like pedestrians and cyclists can be cast as a discrete offalsenegatives).Wefindthattreatingtheunlabeledpoints task distinguishing moving and stationary points. asignoredimprovestheperformanceslightly,indicatingthat Our supervised method for generating flow ground truth even moderate information known about potential moving relies on every moving object having an accompanying objects may alleviate challenges in recall. Notably, we ob- tracked box. Without a tracked box, we effectively assume serve a large discrepancy in recall between the ablation the points on an object are stationary. Though this assump- experiments for cyclists and pedestrians. We posit that this tion holds for the vast majority of points, there are still a discrepancy is likely due to the much larger amount of wide range of moving objects that our algorithm assumes to pedestrian labels in the Waymo Open Dataset. Removing be stationary. For deployment on a safety critical system, it the entire class of pedestrian labels removes much more of isimportanttocapturemotionfortheseobjects(e.g.stroller, the ground truth labels for moving objects. opening car doors, etc.). Even though the labeled data does Although our model has some capacity to generalize to not capture such objects, we find qualitatively that a trained unlabeled moving object points, this capacity is limited. modeldoescapturesomemotionintheseobjects(Figure6). Ignoring labeled points does mitigate the error rate for We next ask the degree to which a model trained on such cyclists and pedestrians, however such an approach can data predicts the motion of unlabeled moving objects. result in other systematic errors. For instance, in earlier To answer this question, we construct several experiments experiments,ignoringstationarylabelsforbackgroundpoints by artificially removing labeled objects from the scene and (i.e. no motion) results in a large increase in mean L error 2inbackgroundpointsfrom0.03m/sto0.40m/s.Hence,such [14] X.Liuetal.,“Meteornet:Deeplearningondynamic3dpointcloud heuristics are only partial solutions to this problem and new sequences,”inCVPR,2019,pp.9246–9255. [15] P. Sun et al., “Scalability in perception for autonomous driving: ideas are warranted for approaching this dataset. We suspect Waymoopendataset,”inCVPR,2020,pp.2446–2454. that many opportunities exist for applying semi-supervised [16] A.Saxenaetal.,“Learningdepthfromsinglemonocularimages,”in learningtechniquesforgeneralizingtounlabeledobjectsand NeurIPS,2006,pp.1161–1168. [17] D.Scharsteinetal.,“Ataxonomyandevaluationofdensetwo-frame leave this to future work [53], [54]. stereocorrespondencealgorithms,”IJCV,vol.47,pp.7–42,2002. VII. DISCUSSION [18] D. Pfeiffer et al., “Exploiting the power of stereo confidences,” in CVPR,2013,pp.297–304. In this work we presented a new large-scale scene flow [19] S. Baker et al., “A database and evaluation methodology for optical dataset measured from LiDAR in autonomous vehicles. flow,”IJCV,vol.92,no.1,pp.1–31,2011. [20] D. Kondermann et al., “On performance analysis of optical flow Specifically, by leveraging the supervised tracking labels algorithms,”inOutdoorandLarge-ScaleReal-WorldSceneAnalysis. from the Waymo Open Dataset, we bootstrapped a motion Springer,2012,pp.329–355. vector annotation for each LiDAR return. The resulting [21] S. Morales et al., “Ground truth evaluation of stereo algorithms for realworldapplications,”inACCV. Springer,2010,pp.152–162. dataset is ∼ 1000× larger than previous real world scene [22] L. Ladicky` et al., “Joint optimization for object class segmentation flow datasets. We also propose a series of metrics for anddensestereoreconstruction,”IJCV,vol.100,pp.122–133,2012. evaluating the resulting scene flow with breakdowns based [23] D.J.Butleretal.,“Anaturalisticopensourcemovieforopticalflow evaluation,”inECCV. Springer,2012,pp.611–625. on criteria that are relevant for deploying in the real world. [24] S. Wang et al., “Deep parametric continuous convolutional neural Finally,wedemonstratedascalablebaselinemodeltrained networks,”inCVPR,2018,pp.2589–2597. on this dataset that achieves reasonable predictive perfor- [25] K.-H.Leeetal.,“Pillarflow:End-to-endbirds-eye-viewflowestima- mance and may be deployed for real time operation. Inter- tionforautonomousdriving,”IROS,2020. [26] J. Pontes et al., “Scene flow from point clouds with or without estingly, our setup opens opportunities for self-supervised learning,”InternationalConferenceon3DVision,2020. and semi-supervised methods [53], [54], [26]. We hope that [27] M.-F.Changetal.,“Argoverse:3dtrackingandforecastingwithrich thisdatasetmayprovideausefulbaselineforexploringsuch maps,”inCVPR,2019,pp.8748–8757. [28] H. Caesar et al., “nuscenes: A multimodal dataset for autonomous techniques and developing generic methods for scene flow driving,”inCVPR,2020,pp.11621–11631. estimation in AV’s in the future. [29] J.Houstonetal.,“Onethousandandonehours:Self-drivingmotion predictiondataset,”arXivpreprintarXiv:2006.14480,2020. ACKNOWLEDGEMENTS [30] A.Behletal.,“Pointflownet:Learningrepresentationsforrigidmotion estimationfrompointclouds,”inCVPR,2019,pp.7962–7971. We thank Vijay Vasudevan, Benjamin Caine, Jiquan Ngiam, [31] H. Fan et al., “Pointrnn: Point recurrent neural network for moving Brandon Yang, Pei Sun, Yuning Chai, Charles Qi, Dragomir pointcloudprocessing,”arXivpreprintarXiv:1910.08287,2019. Anguelov, Congcong Li, Jiyang Gao, James Guo, and Yin [32] P. Wu et al., “Motionnet: Joint perception and motion prediction for Zhou for their comments and suggestions. Additionally, we autonomousdrivingbasedonbird’seyeviewmaps,”inCVPR,2020. [33] A.Dewanetal.,“Rigidsceneflowfor3dlidarscans,”inIROS,2016. thank the larger Google Brain and Waymo Perception teams [34] A. Ushani et al., “A learning approach for real-time temporal scene for their support. flowestimationfromlidardata,”inICRA,2017,pp.5666–5673. [35] A.K.Ushanietal.,“Featurelearningforsceneflowestimationfrom REFERENCES lidar,”inCoRL,2018,pp.283–292. [36] K. Bousmalis et al., “Using simulation and domain adaptation to [1] D.A.Forsythetal.,Computervision:amodernapproach. Prentice improveefficiencyofdeeproboticgrasping,”inICRA,2018. HallProfessionalTechnicalReference,2002. [37] A. Saxena et al., “Robotic grasping of novel objects using vision,” [2] S. Thrun et al., “Stanley: The robot that won the darpa grand TheInt’lJournalofRoboticsResearch,vol.27,pp.157–173,2008. challenge,”JournaloffieldRobotics,vol.23,pp.661–692,2006. [38] U. Viereck et al., “Learning a visuomotor controller for real world [3] S. Casas et al., “Intentnet: Learning to predict intention from raw robotic grasping using simulated depth images,” arXiv preprint sensordata,”inCoRL,2018,pp.947–956. arXiv:1706.04652,2017. [4] Y. Chai et al., “Multipath: Multiple probabilistic anchor trajectory [39] M. Gualtieri et al., “High precision grasp pose detection in dense hypothesesforbehaviorprediction,”inCoRL,2019. clutter,”inIROS,2016,pp.598–605. [5] W. Luo et al., “Fast and furious: Real time end-to-end 3d detection, [40] F. Yu et al., “Bdd100k: A diverse driving dataset for heterogeneous tracking and motion forecasting with a single convolutional net,” in multitasklearning,”inCVPR,2020,pp.2636–2645. CVPR,2018,pp.3569–3577. [41] C. R. Qi et al., “Pointnet: Deep learning on point sets for 3d [6] R.Mahjourianetal.,“Unsupervisedlearningofdepthandego-motion classificationandsegmentation,”inCVPR,2017,pp.652–660. from monocular video using 3d geometric constraints,” in CVPR, 2018. [42] Y.Zhouetal.,“End-to-endmulti-viewfusionfor3dobjectdetection [7] X. Liu et al., “Flownet3d: Learning scene flow in 3d point clouds,” inlidarpointclouds,”inCoRL,2019. inCVPR,2019,pp.529–537. [43] O.Ronnebergeretal.,“U-net:Convolutionalnetworksforbiomedical [8] W.Wuetal.,“Pointpwc-net:Acoarse-to-finenetworkforsupervised imagesegmentation,”inMICCAI. Springer,2015,pp.234–241. andself-supervisedsceneflowestimationon3dpointclouds,”arXiv [44] A. H. Lang et al., “Pointpillars: Fast encoders for object detection preprintarXiv:1911.12408,2019. frompointclouds,”arXivpreprintarXiv:1812.05784,2018. [9] X.Guetal.,“Hplflownet:Hierarchicalpermutohedrallatticeflownet [45] P. Jund et al., “Scalable scene flow from point clouds in the real forsceneflowestimationonlarge-scalepointclouds,”inCVPR,2019. world,”arXivpreprintarXiv:2103.01306,2021. [10] M. Menze et al., “Object scene flow for autonomous vehicles,” in [46] K.Zhouetal.,“Real-timekd-treeconstructionongraphicshardware,” CVPR,2015,pp.3061–3070. ACMTransactionsonGraphics(TOG),vol.27,no.5,pp.1–11,2008. [11] A. Geiger et al., “Are we ready for autonomous driving? the kitti [47] Y. Chen et al., “Fast neighbor search by using revised kd tree,” visionbenchmarksuite,”inCVPR,2012,pp.3354–3361. InformationSciences,vol.472,pp.145–162,2019. [12] N.Mayeretal.,“Alargedatasettotrainconvolutionalnetworksfor [48] A. Odena et al., “Deconvolution and checkerboard artifacts,” Distill, disparity,opticalflow,andsceneflowestimation,”inCVPR,2016. 2016. [13] Z. Wang et al., “Flownet3d++: Geometric losses for deep scene [49] Z. Ding et al., “1st place solution for waymo open dataset flowestimation,”inTheIEEEWinterConferenceonApplicationsof challenge–3d detection and domain adaptation,” arXiv preprint ComputerVision,2020,pp.91–98. arXiv:2006.15505,2020.[50] M.McNaughtonetal.,“Motionplanningforautonomousdrivingwith aconformalspatiotemporallattice,”inICRA,2011,pp.4889–4895. [51] K.Chuetal.,“Localpathplanningforoff-roadautonomousdriving withavoidanceofstaticobstacles,”IEEETransactionsonIntelligent TransportationSystems,vol.13,no.4,pp.1599–1616,2012. [52] D. Dolgov et al., “Practical search techniques in path planning for autonomousdriving,”inAAAI,vol.1001,2008,pp.18–80. [53] G.Papandreouetal.,“Weakly-andsemi-supervisedlearningofadeep convolutionalnetforsemanticimagesegmentation,”inICCV,2015. [54] L.-C. Chen et al., “Leveraging semi-supervised learning in video sequencesforurbanscenesegmentation.”inECCV,2020. [55] A. Filatov, A. Rykov, and V. Murashkin, “Any motion detector: Learningclass-agnosticscenedynamicsfromasequenceoflidarpoint clouds,”arXivpreprintarXiv:2004.11647,2020. [56] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza- tion,”arXivpreprintarXiv:1412.6980,2014. [57] J.Shen,P.Nguyen,Y.Wu,Z.Chen,M.X.Chen,Y.Jia,A.Kannan, T. Sainath, Y. Cao, C.-C. Chiu, et al., “Lingvo: a modular and scal- able framework for sequence-to-sequence modeling,” arXiv preprint arXiv:1902.08295,2019. [58] J. Ngiam, B. Caine, W. Han, B. Yang, Y. Chai, P. Sun, Y. Zhou, X.Yi,O.Alsharif,P.Nguyen,etal.,“Starnet:Targetedcomputation forobjectdetectioninpointclouds,”arXivpreprintarXiv:1908.11069, 2019. [59] X. Glorot and Y. Bengio, “Understanding the difficulty of training deepfeedforwardneuralnetworks,”inAISTATS,2010,pp.249–256.APPENDIX number of observed points on objects like pedestrians is typically small making any deviations from a rigid BOOTSTRAPPINGGROUNDTRUTHANNOTATIONS assumption to be of statistically minimal consequence. In this section, we discuss in detail several practical considerations for the method for computing scene flow Objects with no matching previous frame labels. annotations (Section III-B). One challenge in this context In some cases, an object o ∈ O with a label box at t t0 0 is the lack of correspondence between the observed points will not have a corresponding label at t , e.g. the object is -1 in P and P . In our work, we choose to make flow first observable at t . Without information about the motion −1 0 0 predictions (and thus compute annotations) for the points of the object between t and t , we choose to annotate -1 0 at the current time step, P . As opposed to doing so for its points as having invalid flow. While we can still use 0 P , we believe that explicitly assigning flow predictions them to encode the scene and extract features during model −1 to the points in the most recent frame is advantageous to an training, this annotation allows us to exclude them from AV that needs to reason about and react to the environment model weight updates and scene flow evaluation metrics. in real time. Additionally, the motion between P and −1 P is a reasonable approximation for the flow at t when Background points. Since typically most of the world is 0 0 considering a high LiDAR acquisition frame rate and stationary(e.g.buildings,ground,vegetation),itisimportant assuming a constant velocity between consecutive frames. to reflect this in the dataset. Having compensated for ego motion,weassignzeromotionforallunlabeledpointsinthe Calculating the transformation for the motion of scene, and additionally annotate them as belonging to the an object. The goal of this section is to describe the “background”class.Althoughthisholdsforthevastmajority calculation of T used to transform a point p to its of unlabeled points, there will always exist rare moving ∆ 0 corresponding position at t based on the motion of the objects in the scene that were not manually annotated with -1 labeled object to which it belongs. We leverage the 3D label boxes (e.g. animals). In the absence of label boxes, label boxes to circumvent the point-wise correspondence points of such objects will receive a stationary annotation problem between P and P and estimate the position by default. Nonetheless, we recognize the importance of 0 −1 of the points belonging to an object in P as they would enabling a model to predict motion on unlabeled objects, 0 have been observed at t . We compute the flow vector for as it is crucial for an AV to safely react to rare, moving -1 each point at t using its displacement over the duration of objects. In Section VI-C, we highlight this challenge 0 ∆ =t −t . Let point clouds P and P be represented and discuss opportunities for employing this dataset as a t 0 -1 −1 0 in the reference frame of the AV at their corresponding benchmarkforsemi-supervisedandself-supervisedlearning. time steps. We identify the set of objects O at t based 0 0 on the annotated 3D boxes of the corresponding scene. Coordinate frame of reference. As opposed to most We express the pose of an object o in the AV frame as a other works [11], [10], we account for ego motion in our homogeneous transformation matrix T consisting of 3D scene flow annotations. Not only does this better reflect the translation and rotational components derived from the fact that most of the world is stationary, but it also improves pose of tracked objects. For each object o ∈ O , we first the interpretability of flow annotations, predictions, and 0 use its pose T relative to the AV at t and compensate evaluation metrics. In addition to compensating for ego −1 -1 for ego motion to compute its pose T∗ at t but with motion when computing flow annotations at t , we also −1 -1 0 respect to the AV frame at t . This is straightforward given transform P , the scene at t , to the reference frame 0 −1 -1 knowledge of the poses of the AV at the corresponding of the AV at t when learning and inferring scene flow. 0 time steps in the dataset, e.g. from a localization system. We argue that this is more realistic for AV applications Accordingly, we compute the rigid body transform T in which ego motion is available from IMU/GPS sensors ∆ used to transform points belonging to object o at time t [2]. Furthermore, having a consistent coordinate frame 0 totheircorrespondingpositionatt ,i.e.T := T∗ ·T−1. for both input frames lessens the burden on a model to -1 ∆ −1 0 correspond moving objects between frames [55] as explored Rigid body assumption. Our approach for scene flow in Appendix . annotation assumes the 3D label boxes correspond to rigid bodies, allowing us to compute the point-wise MODELARCHITECTUREANDTRAININGDETAILS correspondences between two frames. Although this is Figure2providesanoverviewofFastFlow3D.Wediscuss a common assumption in the literature (especially for eachsectionofthearchitectureinturnandprovideadditional labeled vehicles [10]), this does not necessarily apply parameters in Table VI. to non-rigid objects such as pedestrians. However, we The model architecture contains in total 5,233,571 pa- found this to be a reasonable approximation in our work rameters. The vast majority of the parameters (4,212,736) on the Waymo Open Dataset for two reasons. First, we reside in the standard convolution architecture [44]. An derive our annotations from frames measured at high additional large set of parameters (1,015,808) reside in later frequency (i.e. 10 Hz) such that object deformations layers that perform upsampling with a skip connection [48]. are minimal between adjacent frames. Second, the Finally,asmallnumberofparameters(544)arededicatedtofeaturizingeachpointcloudpoint[41]aswellasperforming objectsforthissetup,buttheperformanceisstillfarshortof the final regression on to the motion flow (4483). Note that theperformanceachievedwhencompensatingforegomotion bothoftheselattersetsofparametersarepurposefullysmall directlyintheinput.Furtherresearchisneededtoeffectively because they are applied to all N points in the LiDAR point learn a model that can implicitly account for ego motion. cloud. MEASUREMENTSOFLATENCY FastFlow3D uses a top-down U-Net to process the pil- larized features. Consequently, the model can only predict In this section we provide additional details for how the flow for points inside the pillar grid. Points outside the x-y latency numbers for Table I were calculated. All calcula- dimensionsofthegridoroutsidethez dimensionboundsfor tions were performed on a standard NVIDIA Tesla P100 the pillars are marked as invalid and receive no predictions. GPU with a batch size of 1. The latency is averaged over To extend the scope of the grid, one can either make the 90 forward passes, excluding 10 warm up runs. Latency pillar size larger or increase the size of the pillar grid. In for the baseline models, HPLFlowNet [9] and FlowNet3D our work, we use a 170×170 m grid (centered at the AV) [7], [13] included any preprocessing necessary to perform represented by 512×512 pillars (∼ 0.33×0.33 m pillars). inference. For HLPFlowNet and FlowNet3D, we used the Forthez dimension,weconsiderthevalidpillarrangetobe implementations provided by the authors and did not alter from −3 m to +3 m. hyperparameters. Note that this is in favor of these models, The model was trained for 19 epochs on the astheyweretunedforpointcloudscoveringamuchsmaller Waymo Open Dataset training set using the Adam optimizer area compared to the Waymo Open Dataset. [56].ThemodelwaswritteninLingvo[57]andforkedfrom DATASETFORMATFORANNOTATIONS the open-source repository version of PointPillars 3D object In order to access the data, please go to detection [44], [58] 7. The training set contains a label http://www.waymo.com/open and click on Access imbalance, vastly over-representing background stationary Waymo Open Dataset, which requires a user to sign points. In early experiments, we explored a hyper-parameter in with Google and accept the Waymo Open Dataset to artificially downweight background points and found that license terms. After logged in, please visit https: weighing down the L loss by a factor of 0.1 provided good 2 //pantheon.corp.google.com/storage/ performance. browser/waymo_open_dataset_scene_flow COMPENSATINGFOREGOMOTION to download the labels. We extend the Waymo Open Dataset to include the scene In Section III-B, we argue that compensating for ego flow labels for the training and validation datasets splits. motion in the scene flow annotations improves the in- For each LiDAR, we add a new range image through the terpretability of flow predictions and highlights important field range image flow compressed in the message patterns and biases in the dataset, e.g. slow vs fast objects. dataset.proto:RangeImage. The range image is a When training our proposed FastFlow3D model, we also 3Dtensorofshape[H,W,4]whereH andW aretheheight compensate for ego motion by transforming both LiDAR and width of the LiDAR scan. For the LiDAR returns at frames to the reference frame of the AV at t , the time 0 point(i,j),weprovideannotationsintherangeimagewhere step at which we predict flow. This is convenient in practice [i,j,0:3] corresponds to the estimated velocity components given that ego motion information is easily available from forthereturnalongx,y andz axes,respectively.Finally,the the localization module of an AV. We hypothesize that this valuestoredintherangeimageat[i,j,3]containsaninteger lessenstheburdenonthemodel,becausethemodeldoesnot class label. have to implicitly learn to compensate for the motion of the AV. We validate this hypothesis in a preliminary experiment where we compare the performance of the model reported in Section VI to a model trained on the same dataset but without compensating for ego motion in the input point clouds.Consequently,thismodelhastoimplicitlylearnhow to compensate for ego motion. Table V shows the mean L 2 error for two such models. We observe that mean L error 2 increases substantially when ego motion is not compensated for across all object types and across moving and stationary objects.Thisisalsoconsistentwithpreviousworks[55].We alsoranasimilarexperimentwherethemodelconsumesnon ego motion compensated point clouds, but instead subtracts ego motion from the predicted flow during training and evaluation.Wefoundslightlybetterperformanceformoving 7https://github.com/tensorflow/lingvo/ego motion vehicle pedestrian cyclist background compensated for all moving stationary all moving stationary all moving stationary yes 0.18 0.54 0.05 0.25 0.32 0.10 0.51 0.57 0.10 0.07 no 0.36 1.16 0.08 0.49 0.63 0.17 1.21 1.34 0.14 0.07 TABLEV:Accountingforegomotionintheinputsignificantlyimprovesperformance.Allreportedvaluescorrespondto themeanL errorinm/s.Thefirstcolumnreferstowhetherornotwetransformthepointcloudatt tothereferenceframe 2 -1 of the AV at t . For both models, we evaluate the error in predicting scene flow annotations as described in Section III-B, 0 i.e. with ego motion compensated for.Meta-Arch Name Input(s) Operation Kernel Stride BN? Output Size Depth # Param Architecture A N pts MLP – – Yes N pts 64 544 B A Snap-To-Grid – – – 512×512 64 0 C B Convolution 3×3 2 Yes 256×256 64 37120 D C Convolution 3×3 1 Yes 256×256 64 37120 E D Convolution 3×3 1 Yes 256×256 64 37120 F E Convolution 3×3 1 Yes 256×256 64 37120 G F Convolution 3×3 2 Yes 128×128 128 74240 H G Convolution 3×3 1 Yes 128×128 128 147968 I H Convolution 3×3 1 Yes 128×128 128 147968 J I Convolution 3×3 1 Yes 128×128 128 147968 K J Convolution 3×3 1 Yes 128×128 128 147968 L K Convolution 3×3 1 Yes 128×128 128 147968 M L Convolution 3×3 2 Yes 64×64 256 295936 N M Convolution 3×3 1 Yes 64×64 256 590848 O N Convolution 3×3 1 Yes 64×64 256 590848 P O Convolution 3×3 1 Yes 64×64 256 590848 Q P Convolution 3×3 1 Yes 64×64 256 590848 R Q Convolution 3×3 1 Yes 64×64 256 590848 S R∗, L∗, 128, 128 Upsample-Skip – – No 128×128 128 540672 T S, F∗, 128, 64 Upsample-Skip – – No 256×256 128 311296 U T, B∗, 64, 64 Upsample-Skip – – No 512×512 64 126976 V U Convolution 3×3 1 No 512×512 64 36864 W V Ungrid – – – N pts 64 0 X W, B Concatenate – – – N pts 128 0 Y X Linear – – No N pts 32 4384 Z Y Linear – – No N pts 3 99 Upsample-Skip (α,β,d,d ) b U1 α Convolution 1×1 1 No h ×w d α α b U2 U1 Bilinear Interp. – 1 No h ×w d 2 β β b U3 β Convolution 1×1 1 No h ×w d β β b U4 U2, U3 Concatenate – – No h ×w 2d β β b U5 U4 Convolution 3×3 1 No h ×w d β β U6 U5 Convolution 3×3 1 No h ×w d β β Padding mode SAME Optimizer Adam [56] (lr=1e−6, β =0.9, β =0.999) 1 2 Batch size 64 Weight initialization Xavier-Glorot [59] Weight decay None TABLE VI: FastFlow3Darchitecture and training details. The network receives as input N 3D points with additional point cloud features and outputs N 3D points corresponding to the predicted 3D motion vector of each point. This output corresponds to the output of layer Z. All layers employ a ReLU nonlinearity except for layers S −Z which employ no nonlinearity. The shape of tensor x is denoted h ×w . The Upsample-Skip layer receives two tensors α and β and a scalar x x depthdasinputandoutputstensorU6.Inputswith∗ denotetheconcatenationoftherespectivelayersoftheweight-sharing encoders.