Occupancy Flow Fields for Motion Forecasting in Autonomous Driving Reza Mahjourian∗1, Jinkyu Kim∗2, Yuning Chai1, Mingxing Tan3, Ben Sapp1, Dragomir Anguelov1 Abstract—We propose Occupancy Flow Fields, a new rep- resentation for motion forecasting of multiple agents, an im- portant task in autonomous driving. Our representation is a spatio-temporal grid with each grid cell containing both the probability of the cell being occupied by any agent, and a two-dimensional flow vector representing the direction and magnitude of the motion in that cell. Our method success- fully mitigates shortcomings of the two most commonly-used representations for motion forecasting: trajectory sets and occupancygrids.Althoughoccupancygridsefficientlyrepresent the probabilistic location of many agents jointly, they do not capture agent motion and lose the agent identities. To this Predicted Occupancy Predicted Flow end, we propose a deep learning architecture that generates Occupancy Flow Fields with the help of a new flow trace loss that establishes consistency between the occupancy and flow predictions. We demonstrate the effectiveness of our approach usingthreemetricsonoccupancyprediction,motionestimation, and agent ID recovery. In addition, we introduce the problem of predicting speculative agents, which are currently-occluded agents that may appear in the future through dis-occlusion or by entering the field of view. We report experimental results onalargein-houseautonomousdrivingdatasetandthepublic INTERACTIONdataset,andshowthatourmodeloutperforms state-of-the-art models. I. INTRODUCTION Predicted Occupancy + Flow Ground Truth In this work, we tackle the problem of predicting the Fig. 1: (Best viewed in color) Our model predicts occu- future location and motion of other vehicles and pedestrians pancy (top-left) and flow (top-right) at discrete timesteps from the view of an autonomous vehicle (AV). To represent in the future. Combining occupancy and flow produces a future locations, we adopt the notion of occupancy grids [2] rich occupancy representation that captures the direction from the robotics literature. An occupancy grid O t, at a and magnitude of motion as well (bottom-left). Notice the particular timestep t, can be represented by a single-channel agent near the bottom and the two prominent predictions gray-scale image with dimensions h×w where each pixel corresponding to staying in the roundabout and exiting. The correspondstoaparticulargridcellinthemap,andthepixel ground truth (bottom-right) contains the color wheel that value represents the probability that any part of any agent maps motion direction and magnitude to different colors. occupies that grid cell. To represent future motion, we draw The gray traces in the bottom images show the past state inspirationfromopticalflow.AflowfieldF t attimetcanbe of the agents. All views are at t+1.5s. The sample scene represented by a two-channel image with dimensions h×w is from the Interaction dataset [1]. where each pixel holds a two-dimensional motion vector (∆x,∆y) for the corresponding grid cell. Fig. 1 illustrates sample occupancy and flow predictions vs. ground truth on output captures a richer class of future distributions, in- a sample scene. corporates shape and identity uncertainty, and models the Motion forecasting is an essential component of planning joint probability of the existence of any agent in a spatio- in a multi-agent environment, and of particular interest for temporalcell(ratherthanindependentmarginalprobabilities autonomous driving. Modeling the uncertain future as a per agent). While these observations make occupancy grids distribution over a compact set of trajectories per agent is an attractive choice, occupancy grid methods have the dis- a very popular choice [1], [3], [4], [5], [6], [7]. On the advantage that agent identity is lost, and there is no obvious other hand, occupancy grid-based methods [8], [9], [10], way to extract motion from the grids (which are a snapshot [11]providesomesignificantadvantages:thenon-parametric of a time interval), making it unclear how to interpolate at finer time granularity, and impossible to know the velocity 1Waymo.rezama@waymo.com of the agents. 2KoreaUniversity.WorkdonewhileatWaymo. ThismotivatesOccupancyFlowFieldswhichextendstan- 3GoogleBrain. ∗Equalcontribution. dard occupancy grids with flow fields. By augmenting the 2202 raM 8 ]OR.sc[ 1v57830.3022:viXraoutput with flow estimates, we are able to trace occupancy Motion Forecasting via Occupancy Grids. There are rela- from far-future grid locations back to current time locations tivelyfewermotionforecastingmodelsthatemployanoccu- by following the sequence of predicted flow vectors. This pancygridrepresentation.ChauffeurNet[8]trainsoccupancy gives us a way to recover the most-likely agent identity for inamulti-tasknetworktoimprovetrajectoryplanningperfor- any future grid cell. mance.DRF[11]predictsasequenceofoccupancyresiduals Another advantage for producing flow predictions is that inspired by auto-regressive sequential prediction. Rules of it allows an occupancy model to capture future behavior the Road [10] compares trajectory methods to occupancy with fewer “key frames”, since flow predictions can be grids by proposing a dynamic program to decode likely used to warp/morph occupancy at any continuous point in trajectories from occupancy under a simple motion model. time. Since our flow formulation captures multiple travel Finally, contemporaneous with this work, MP3 [9] proposes directions for each agent, the morphing process will lead to a concept similar to Flow Fields, termed Motion Fields: a conservative expansion of occupancy from the last known They predict a set of forward motion vectors and associated occupancyofagents.Therefore,themorphedoccupancycan probabilities per grid cell. MP3 employs occupancy flow in be safely used by a planning algorithm that minimizes co- the context of a planning task and does not offer direct location with predicted occupancy. analysis on the quality and performance of their motion Given the current state-of-the-art in real-time perception, forecasting method. In this work we explore occupancy thetrackqualityavailabletoamotionforecastingsystemcan flow in detail and develops a class of metrics for directly be limited. We may lose sight of its tracked agents because evaluating it. of occlusions or increased distance. More importantly, new agents may appear through dis-occlusion or otherwise enter- III. METHOD ing the AV’s field of view. Reasoning about the location and In this section, we define the occupancy flow problem, velocity of these so-called speculative agents is critical for describe our model architecture and losses, and elaborate safe and effective autonomous driving. Trajectory prediction on how we trace flow predictions over time to establish models in the literature are formulated only for agents that consistency between flow and occupancy predictions. have already been detected and tracked, and cannot handle A. Representation agents that may come out of occluded areas. In summary, our contributions are as follows: As introduced in Sec. I, the problem is to predict an OccupancyFlowFields:Weintroduceamodelthatpredicts occupancy flow field at future time t, given observations bothoccupancyandflowinaspatio-temporalgrid.Thisrep- from the recent state of the agents. Following the common resentationallowsustopredictanon-parametricdistribution convention in AV motion datasets [19], [1], [5], we abstract of future occupancy as well as velocity of agents. the agents as two-dimensional rectangles in bird’s-eye view Flow-Traced Occupancy: We use the chain of flow predic- (BEV), characterized by position, orientation, width, height, tions over multiple timesteps to trace predicted occupancies velocity, etc. A detection and tracking pipeline extracts the at any future timestep all the way back to the current sparse agent states over a number of timesteps from raw observations. Adding a loss term based on traced occupancy sensor readings. A sequence of observations is split into predictions forces the model to establish consistency be- past and future segments. Our model receives the past agent tween flow and occupancy predictions, leading to significant states as sparse inputs and predicts dense future occupancy. improvements in metrics. At inference time, tracing flow Ground-truthoccupancyisgeneratedbyrenderingthebird’s- predictions allows us to recover the identity of the agent eye view rectangles for the detected agents at each timestep. predicted to occupy any grid cell. B. Inputs SpeculativeAgents:Weintroducetheproblemofpredicting occupancy and flow for speculative agents—an important We abstract the problem inputs to be sparse environment task that to our knowledge is unexplored in the behavior and agent states as estimated by any detection and tracking modeling literature. system as follows: 1. Past agent states: Each agent at time t is represented II. RELATEDWORK as a tuple (p , θ , w , l , v , a ), where p = (x ,y ) t t t t t t t t t Motion Forecasting via Trajectories. In this motion fore- denotestheagent’scenterposition,θ denotestheorientation, t casting representation, each modeled agent’s future distri- (w ,l ) denotes the box width and length, v denotes a t t t bution is described by a set of trajectories, which are each two-dimensional velocity vector, and a denotes a two- t a time sequence of state estimates. This set may be pre- dimensional acceleration vector. The model receives A , the t dictedinadiscriminative,feed-forwardmanner,viaimitation state of all agents at time t, for t∈{T ,...,0}. input learning [8], [12], [4], [13], [14], [15], [7], [16], [7], [5], 2. Road structure: To receive information about the [6]. These models commonly predict trajectory likelihoods structure of the road lanes and other traffic objects, the and sometimes Gaussian uncertainty parameters as well, model is given a set of points sampled uniformly from the giving rise to a full parametric probability distribution as line segments and curves representing the road elements. output [17], [4], [3], [18]. The trajectory representation has Each sampled point is represented by a tuple (p, u) where several disadvantages mentioned in the introduction. p = (x,y) denotes position and u denotes the type of theunderlyingroadelement,whichcanbeoneofthefollowing: crosswalk,speedbump,stop/yieldsign,roadedgeboundary, Predicted Forward Flow Predicted Backward Flow parking line, dotted line, solid single/double line, and solid double yellow line. 3. State of traffic lights: The model is also given the t state of traffic lights for each lane at each input timestep. t - 1 t - 1 t The traffic light state of each traffic-controlled lane at time t is represented by a tuple (p ,s ) where p =(x ,y ) is the t t t t t position of a point placed at the end of the traffic-controlled Fig. 2: Hypothetical flow predictions for a single agent are lane, and s is the light state, which is one of {red, yellow, illustrated in a single forward flow field (left) and a single t green, unknown}. backwardflowfield(right).Eachflowvectorisstoredinthe gridcell(yellowsquare)atitsbase.Withforwardflow,each C. Occupancy Flow Prediction grid cell predicts its future location, while with backward Occupancy Flow Fields can be represented as two quan- flow each grid cell predicts its past location. Note that the tities: an occupancy grid O and a flow field F , both with agentboundariesdrawnattimetarejustillustrationaidsand t t spatial dimensions h×w. Each cell in the grid corresponds not predicted by the model. Unlike forward flow, a single to a particular BEV grid cell in the map. Each cell (x,y) in backward flow field can represent multiple next destinations the occupancy grid O contains a value in the range [0,1] for any current occupancy, making it more effective for t representing the probability that any part of any agent box motion forecasting. Note that backward flow vectors are overlaps with that grid cell at time t. Each cell (x,y) in the meaningfulandpredictedoneverycurrently-unoccupiedcell flow field F contains a two-dimensional vector (∆x,∆y) inthemap(notshown),butforwardflowisonlymeaningful t that specifies the motion of any agent whose box occupies on current occupancies. Moreover, since each backward that grid cell at time t. Vehicle and pedestrian behavior flow vector pulls occupancy from a single source, predicted differsignificantly,soweoutputseparateoccupancyandflow backward flow for multiple agents is also inherently free predictions for different agent classes K. More specifically, of collisions–which indicate inconsistent predictions and are we predict occupancy grids O = (OV,OP) and flow undesired in motion forecasting systems. t t t fields F = (FV,FP) for vehicles and pedestrians, ∀t ∈ t t t {1,...,T }. pred sensor-occluded region. For this problem, we train the same Flow Formulation: We model the motion of agents with modelwithalternativelabelsthatreflectoccupancyandflow backward flow (see Fig. 2). Ground-truth flow vectors be- of agents known to exist in the future but missing in past tween times t and t − 1 are placed in the grid at time t timesteps. and point to the original position of that grid cell at time t−1. More specifically, flow ground truth is constructed as E. Model F˜(x,y) = (x,y) −(x,y) , where (x,y) denotes the t t−1 t t−1 Fig. 3 shows the overall architecture of our model, which coordinatesattimet−1ofthesameagentpartthatoccupies consists of an encoder and a decoder: (x,y) at t. The magnitude of the flow vectors is in grid cell Encoder: the first stage receives all three types of input (think pixel) units. points and processes them with a PointPillars-inspired en- Note that backward flow still models the forward motion coder [20], [21]. The traffic light and road points are of agents; it just represents where each grid cell comes from placed directly into the grid. The agent states A at each in the previous timestep rather than representing where each t input timestep t∈{T ,...,0} are encoded by uniformly grid cell moves to in the next timestep. Therefore backward input samplingafixed-sizegridofpointsfromtheinteriorofeach flow can capture multiple futures for individual agents using agent’s BEV box and placing those points with associated asingleflowfieldpertimestep.Ontheotherhand,capturing agent state attributes (Sec. III-B, including a one-hot encod- multiple futures with forward flow requires predicting mul- ing of time t) into the grid (visualized in Fig. 3). Each pillar tiple flow vectors per cell and their associated probabilities, outputs an embedding for all the points contained in it. which would increase latency, memory requirement, and complexity of the model. Decoder: The second stage receives the per-pillar embed- dings as input, and produces per-grid-cell occupancy and D. Speculative Occupancy Flow Prediction flow predictions. The decoder network is based on Effi- The problem described in Sec. III-C is to predict future cientDet [22]: it employs EfficientNet as the backbone to occupancyandflowoftheagentsthathavebeenobservedat process the per-pillar embeddings into feature maps (P , 2 any of the past timesteps. The speculative prediction model ...P ), where P is downsampled by 2i from inputs. These 7 i hasthesameinputsandthesameoutputrepresentationasthe multi-scale features are then fused in a bidirectional manner main model. However, the task is to predict occupancy and using a BiFPN network. Then, the highest-resolution feature motion of agents that have not been observed in the past, mapP isusedtoregressoccupancyandflowpredictionsfor 2 yet appear in the future, e.g., a vehicle appearing on the allagentclassesKoverallT timesteps.Morespecifically, pred edges of the model’s field-of-view, or a pedestrian exiting a the decoder outputs a vector of size |K| × T × 3 for predMLP Max MLP Pool Pillar MLP Embedding Occupancy + Flow Decoder Input Per-Pillar Encoder Network Past Agent State Points Output Road Structure Per-Pillar Points Embeddings [h, w, |K| * T * 3] pred Traffic Light Points Fig. 3: Our model architecture consists of PointPillars-inspired encoder [20], [21] and a decoder based on EfficientDet [22]. each grid cell, in order to simultaneously predict occupancy ofcellsinW .Moreover,backwardflowvectorsnevermove t (|K|×T channels) and flow (|K|×T ×2 channels). current occupancies out of the grid. pred pred The flow warping process does not use any occupancy F. Losses predictions. However, we multiply each W with its corre- t The model is trained with supervised occupancy and spondingoccupancypredictionO ,andrequirethattheresult t flow field losses, and a novel self-supervised Flow Trace matches the ground truth occupancy O˜ using the loss term t loss. The occupancy loss is a binary logistic cross-entropy p oe cr cupg ari nd cyce O˜ll agb get rw ege ae tn edth oe verpr ae lldi sc pte ad tio-O tema pn od rag lr co eu ln lsdt ar suth L W =(cid:88)Tpredw (cid:88)−1h (cid:88)−1 H(W t(x,y)O t(x,y),O˜ t(x,y)). (4) t=1 x=0y=0 Tpredw−1h−1 The final loss is defined as L =(cid:88)(cid:88)(cid:88) H(O (x,y),O˜ (x,y)) (1) O t t (cid:88) 1 t=1 x=0y=0 L= (λ L +λ L +λ L ) (5) hwT O O F F W W pred where H denotes the cross-entropy function. K The flow loss is an L1-norm regression loss with respect where λ ,λ ,λ are coefficients, and K contains all agent to the ground-truth flow F˜, weighted by O˜ as classes, O i.e.,F vehW icles and pedestrians in our case. L =(cid:88)Tpredw (cid:88)−1h (cid:88)−1(cid:13) (cid:13)F (x,y)−F˜(x,y)(cid:13) (cid:13) O˜ (x,y). (2) G. Recovering Agent IDs Using Flow Traces F (cid:13) t t (cid:13) t TheflowtracesdiscussedasalossinSec.III-F.1canalso 1 t=1 x=0y=0 be used at inference time to assign agent IDs to predicted 1) Flow Trace Loss: As discussed in Sec. III-C, the occupancies. If we augment the current occupancy grid O 0 backward flow vectors at time t point to the previous withtheIDoftheoriginagentforeverygridcell,thewarping location of corresponding grid cells at time t−1. Consider processspreadstheagentIDacrosstheflowvectorsaswell. the current occupancy grid for vehicles O 0. Note that Inthissetup,W tcontainsflow-warpedoccupancieswithper- O 0 is not a prediction, but can be constructed directly cellID attributes,from whichwe candirectly readthe agent from inputs. We can warp the current occupancy O 0 ID. This ID points to the agent that could occupy this grid according to the first flow prediction F 1 for t=1 to obtain cell at time t according to the flow predictions F 1,...,F t. a grid of all possible future occupancies of the current Construction of flow traces and thereby the ID recovery agentsatt=1.Werecursivelyapplythiswarpingprocessas process is very fast with time complexity O(h·w·T ). pred W =F ◦W (3) IV. EXPERIMENTS t t t−1 A. Datasets whereW denotestheflow-warpedoccupancyattandW = Crowds Dataset. This dataset is a revision of the Waymo t 0 O . Therefore, computing W for the final timestep applies Open Motion Dataset [19] focused on crowded scenes. It 0 T a chain of all flow predictions F ,∀t ∈ 1...T to compute contains 10.5 million training and 2.8 million test examples t all potential future locations for the current occupancies at spanning over 500 hours of real-world driving in several O . Fig. 4 visualizes this process in a sample scene. Note urban areas across the US. Dynamic scene entities are com- 0 thatweareusingbackwardflowfieldstorolloutthecurrent putedfromLiDARandcameradatasimilartoexistingworks occupancies forward in time. But, thanks to backward flow in the literature [23], [24]. All scenarios contain at least 20 vectors, there are never any overlaps/conflicts on the origin dynamicagents.Thisdatasetcontainssensorreadingsat5Hz.selciheV tnerruC snairtsedeP tnerruC depraw-wolF depraw-wolF ycnapuccO ycnapuccO t = 0 (current) t = 1 (0.6s) t = 4 (2.4s) t = 7 (4.2s) t = 10 (6.0s) Fig. 4: Flow Traces. Top: Current vehicles occupancy O˜ followed by recursive construction of flow-warped occupancies v0 W .EachagentIDhasbeenmappedtoadifferentcolor.Theblackboxshowstheground-truthlocationoftheAV.Ateach vt timestep, flow predictions are used to expand the potential reachable region for each agent, independently of the occupancy likelihoods. Since the backward flow field can pull any occupied grid cell to multiple future locations, each successive application of the warping process likely increases the reachable region. For example, notice how the vehicle highlighted by the red arrow is predicted to either go straight, or perform a U-turn. The flow-warped occupancies are used in loss functions at training time. At inference time, they can be used to recover agent identities for predicted occupancies. Bottom: The same process applied to the pedestrians in the same scene. Note that the predicted flow vectors have been inverted to make it easier to study the predicted motions. Weproducepredictionsforsixsecondsintothefuture,given agent class K using the Soft IoU metric as one second of observations. (cid:80) OK·O˜K Interaction [1]. Interaction is a publicly-available dataset Soft-IoU(OK,O˜K)= x,y t t (6) t t (cid:80) OK+O˜K−OK·O˜K with sensor readings at 10Hz. We use 432k examples for x,y t t t t training and 108k examples for test. We produce predictions where arguments (x,y) and have been omitted for brevity. for three seconds into the future, given half a second of observations. Flow Metrics: The following metrics measure the accuracy offlowpredictions:EPEcomputesthemeanEnd-PointError (cid:13) (cid:13) B. Training Setup L2 distance (cid:13)FK(x,y)−F˜K(x,y)(cid:13) where O˜K(x,y) (cid:54)= 0. (cid:13) t t (cid:13) t 2 We produce predictions for 30 future timesteps using ob- ID Recall measures the percentage of correctly-recalled IDs servations from the past 5 timesteps (T input =−4). For both for each ground-truth occupancy grid O˜ tK as datasets, each prediction timestep aggregates occupancy and (cid:80) 1[ID(WK)=ID(O˜K)].1[O˜K (cid:54)=0] flow from 3 timesteps in the dataset. Therefore T pred = 10 x,y t t t (7) capturesfutureoccupancyoverall30timesteps.Theencoder (cid:80) 1(cid:2) O˜K (cid:54)=0(cid:3) x,y t uses80×80pillars,eachmappingtoa1m×1mareaofthe world. The occupancy and flow outputs have a resolution of where 1[] denotes the indicator function. h×w = 400×400 cells, covering the same 80m×80m Combined Metrics: These metrics require both flow and area. Loss coefficients are set to λ =λ =1000,λ =1 O W F occupancy predictions to be accurate: Flow-Traced (FT) to roughly balance the magnitude of different losses. The AUC measures AUC(WKOK,O˜K). Flow-Traced (FT) IoU t t t modelistrainedfromscratchusingtheAdamoptimizerwith measures Soft-IoU(WKOK,O˜K). t t t a learning rate of 0.02 and batch size of 4. D. Results C. Metrics Wereportresultsforapplyingourmethodtothreeseparate OccupancyMetrics:Weemployevaluationmetricsusedfor occupancy flow predictions tasks: 1) on the Crowds dataset, binary segmentation [25]: Area under the Curve (AUC) and 2) for speculative objects in the Crowds dataset, and 3) on SoftIntersectionoverUnion(Soft-IoU)[26].AUCcomputes the Interaction dataset. AUC(OK,O˜K)foragentclassK(vehicle/pedestrian)usinga Fig. 5 compares occupancy and flow metrics from two t t linearly-spacedsetofthresholdsin[0,1]tocomputepairsof models trained with and without the flow-trace loss on precision and recall values and estimate the area under the the Crowds dataset. For metrics that require the chain of PR-curve. Soft-IoU measures the area of overlap for each flow predictions to be correct, i.e., FT AUC, FT IoU,AUC (Ours) FT AUC (Ours no trace loss) FT AUC (Ours w/ trace loss) Predicted Time (s) Predicted Time (s) Predicted Time (s) Predicted Time (s) Predicted Time (s) Predicted Time (s) Fig. 5: Occupancy and flow metrics separately for vehicles and pedestrians on the Crowds dataset. The plots compare occupancy and ID recall metrics from two models trained with and without the flow-trace loss with MP3 [9]. For vehicles, we also compare the models to occupancy grids generated from MultiPath [4], a trajectory prediction model. Ourswithouttraceloss Ourswithtraceloss Time Occupancy Flow ID Flow-TracedOccupancy Occupancy Flow ID Flow-TracedOccupancy (sec) AUC IoU EPE Recall AUC IoU AUC IoU EPE Recall AUC IoU 0.3 0.938 0.802 0.439 0.899 0.920 0.796 0.939 0.802 0.458 0.950 0.941 0.817 0.6 0.898 0.720 0.456 0.841 0.876 0.708 0.899 0.723 0.536 0.915 0.896 0.734 0.9 0.853 0.636 0.528 0.800 0.825 0.626 0.860 0.641 0.632 0.886 0.851 0.652 1.2 0.796 0.551 0.591 0.762 0.764 0.544 0.801 0.557 0.736 0.859 0.796 0.563 1.5 0.717 0.461 0.667 0.724 0.687 0.457 0.730 0.472 0.797 0.825 0.723 0.476 1.8 0.651 0.389 0.722 0.698 0.620 0.388 0.659 0.396 0.820 0.807 0.652 0.401 2.1 0.576 0.324 0.811 0.683 0.544 0.323 0.578 0.324 0.870 0.791 0.572 0.328 2.4 0.509 0.271 0.887 0.663 0.475 0.270 0.509 0.268 0.893 0.775 0.503 0.272 2.8 0.432 0.229 0.935 0.649 0.399 0.228 0.429 0.229 0.935 0.766 0.425 0.234 3.0 0.366 0.187 0.947 0.635 0.338 0.188 0.369 0.190 0.995 0.751 0.365 0.196 TABLE I: Vehicle occupancy and flow metrics over time from our two models on the Interaction dataset. the same level as the simpler occupancy metrics. We notice that for pedestrians the flow-traced metrics can even surpass the simple metrics. Since multiplying W can only lower t the predicted occupancy values O , we conjecture that the t improvement must be happening through W wiping out t some predicted occupancies that are improbable according to flow predictions, and thereby improving the match with the ground truth. We compare our occupancy prediction method with a MP3 K = 1, t + 0.6s MP3 K = 3, t + 0.6s Ours K = 1, t + 0.6s state-of-the-art trajectory prediction model, MultiPath [4], Fig.6:PredictionsbyMP3(withK =1,K =3flowfields) by converting its predictions to occupancy grids. For a andourmodel(K =1)onasamplescene.UnlikeMP3,our consistent comparison, we train the MultiPath backbone and backward forward flow fields have the capacity to disperse decoder on the same feature maps obtained from our sparse predicted occupancy in noisy datasets. encoder. We convert the top 6 most-likely trajectories with likelihoodsandGaussianuncertaintiestooccupancygridsby and ID recall, using the flow-trace loss leads to signifi- rasterizingthepredictedorientedagentboxesandconvolving cant improvements. Note that the flow-traced occupancy them with the two-dimensional Gaussian predicted for that metrics tend to be lower than regular occupancy metrics, timestep, weighted by the associated trajectory likelihood. as they compare W O against ground truth and we have As Fig. 5 shows, our rich non-parametric occupancy grid t t W O ≤ O . In other words, the best the trace process representation outperforms the trajectory model. t t t in W can do is to reach all predicted occupancies and Fig. 5 also compares our models with MP3 [9], which t retain their intensities. We observe that when training with predicts a set of K forward flow fields and associated the trace loss, the flow-traced occupancy metrics can reach probabilities. Our single backward flow representation per-VehicleFlow VehicleOccupancyFlow Pedestrian Flow Pedestrian Occupancy Flow (1.5s) (0.6s) (1.5s) (5.4s) (1.5s) (0.6s) (1.5s) (5.4s) s nt e g A ar ul g e R s nt e g A e v ati ul c e p S s nt e g A ar ul g e R s nt e g A e v ati ul c e p S Fig. 7: Regular and speculative predictions on two sample scenes (top two and bottom two rows) from the Crowds dataset. The left four columns display predictions for vehicles, and the right four columns show pedestrians. For each scene, a single flow prediction (F ) and three combined flow and occupancy predictions (F .O ) are shown. Gray boxes show the t t t recent state of input agents and the clouds visualize predicted occupancy and flow. Regular occupancy is predicted on the path of moving agents. Speculative occupancy is predicted in regions which might contain agents currently hidden from the AV (black box near bottom). 0.10 Vehicles 0.0200 Vehicles 6 Vehicles Pedestrians Pedestrians Pedestrians 0.0175 5 0.08 0.0150 4 0.06 0.0125 0.0100 3 0.04 0.0075 2 0.02 0.0050 1 0.0025 0.00 0.0000 0 0.6 1.2 1.8 2.4 3.0 3.6 4.2 4.8 5.4 6.0 0.6 1.2 1.8 2.4 3.0 3.6 4.2 4.8 5.4 6.0 0.6 1.2 1.8 2.4 3.0 3.6 4.2 4.8 5.4 6.0 Predicted Time (s) Predicted Time (s) Predicted Time (s) Fig. 8: Occupancy (AUC, Soft-IOU) and flow (EPE) metrics for the speculative model on the Crowds dataset. 00 .. 89 00 .. 78 I F Fo T TU I I o o(O U Uu ( (r O Os) u ur rs s n wo / tt rr aa cc ee ll oo ss ss )) 01 .. 90 E EP PE E ( (O Ou ur rs s n wo / tt rr aa cc ee ll oo ss ss )) 00 .. 99 05 I ID D R Re ec ca al ll l ( (O Ou ur rs s n wo / tt rr aa cc ee ll oo ss ss )) 0.6 0.8 0.85 0.7 0.6 0.5 0.7 0.80 00 .. 45 A FTU C AU (O Cu (r Os) urs no trace loss) 00 .. 34 00 .. 56 00 .. 77 05 FT AUC (Ours w/ trace loss) 0.2 0.65 0.3 0.6 0.9 1.2 1.5 1.8 2.1 2.4 2.7 3.0 0.3 0.6 0.9 1.2 1.5 1.8 2.1 2.4 2.7 3.0 0.3 0.6 0.9 1.2 1.5 1.8 2.1 2.4 2.7 3.0 0.3 0.6 0.9 1.2 1.5 1.8 2.1 2.4 2.7 3.0 Predicted Time (s) Predicted Time (s) Predicted Time (s) Predicted Time (s) Fig. 9: Occupancy and flow metrics on the Interaction dataset, which contains only vehicles. The plots compare occupancy, flow, and ID recall metrics from two models trained with and without the flow-trace loss. Training the model with the flow-trace loss leads to significant improvements. formsfavorablyagainstMP3despitehavingamorecompact is less compute- and memory-efficient. For K = 3, each representation. We match MP3’s performance with K = 3 timestep needs 9 output channels compared to just 2 in our for pedestrians with just one flow field. For vehicles, we method.WehadtotrainseparateMP3modelsforeachagent outperform it even with K = 3. MP3’s flow representation class to fit into the memory—giving it an advantages sinceour models predict all agent classes. cooperative motion dataset in interactive driving scenarios with se- MP3 does not directly supervised occupancy, and its manticmaps,”arXiv:1910.03088,2019. [2] S. Thrun and A. Bu¨cken, “Integrating grid-based and topological forward flow fields have a limited capacity in spreading maps for mobile robot navigation,” in Proceedings of the National occupancy in uncertain situations. With K = 3 fields, each ConferenceonArtificialIntelligence,1996,pp.944–951. currently-occupiedpixelcanmovetoatmostthreelocations [3] T.Buhet,E.Wirbel,A.Bursuc,andX.Perrotton,“Plop:Probabilistic polynomial objects trajectory prediction for autonomous driving,” in inthenexttimestepandtouchatmost12pixelswithbilinear CoRL,2021,pp.329–338. smoothing. In noisy datasets where the motion of objects [4] Y.Chai,B.Sapp,M.Bansal,andD.Anguelov,“Multipath:Multiple can be uncertain, MP3 struggles to cover all possible future probabilistic anchor trajectory hypotheses for behavior prediction,” CoRL,2019. locations of agents, and can produce predictions with agents [5] M.-F.Chang,J.Lambert,P.Sangkloy,J.Singh,S.Bak,A.Hartnett, disintegrating into disjoint pixels. Fig. 6 demonstrates this D. Wang, P. Carr, S. Lucey, D. Ramanan, et al., “Argoverse: 3d behavior.Moreover,ouroccupancyrepresentationallowsfor trackingandforecastingwithrichmaps,”inCVPR,2019,pp.8748– 8757. modelingoccupancyofspeculativeagents,averyimportant [6] H. Cui, V. Radosavljevic, F.-C. Chou, T.-H. Lin, T. Nguyen, T.-K. class of agents for AVs, which is not possible with MP3. Huang,J.Schneider,andN.Djuric,“Multimodaltrajectorypredictions Fig. 8 shows occupancy and flow metrics for the spec- forautonomousdrivingusingdeepconvolutionalnetworks,”inICRA, 2019,pp.2090–2096. ulative occupancy prediction task. The metrics are worse [7] T.Phan-Minh,E.C.Grigore,F.A.Boulton,O.Beijbom,andE.M. in absolute values, since speculative prediction is a harder Wolff, “Covernet: Multimodal behavior prediction using trajectory problem. Note that unlike the main problem, speculative sets,”inCVPR,2020,pp.14074–14083. [8] M.Bansal,A.Krizhevsky,andA.Ogale,“Chauffeurnet:Learningto occupancy metrics improve over time, since speculative drivebyimitatingthebestandsynthesizingtheworst,”RSS,2019. objects are harder to predict in near future than far future. [9] S.Casas,A.Sadat,andR.Urtasun,“Mp3:Aunifiedmodeltomap, Most scenes have no agents disoccluding in the near future. perceive,predictandplan,”inCVPR,2021,pp.14403–14412. [10] J.Hong,B.Sapp,andJ.Philbin,“Rulesoftheroad:Predictingdriving However, it is possible to anticipate potential disocclusions, behavior with a convolutional model of semantic interactions,” in e.g., around the corners, with the forward motion of the CVPR,2019,pp.8454–8462. AV or other agents. Fig. 7 visualizes regular and speculative [11] A.Jain,S.Casas,R.Liao,Y.Xiong,S.Feng,S.Segal,andR.Urtasun, “Discrete residual flow for probabilistic pedestrian behavior predic- occupancy flow predictions on two sample scenes. tion,”inCoRL. PMLR,2020,pp.407–419. Fig. 9 compares occupancy and flow metrics on the [12] S. Casas, W. Luo, and R. Urtasun, “Intentnet: Learning to predict Interactiondatasetfromtwomodelstrainedwithandwithout intentionfromrawsensordata,”inCoRL,2018,pp.947–956. [13] D.HelbingandP.Molnar,“Socialforcemodelforpedestriandynam- the flow-trace loss. We have included the metric values on ics,”PhysicalreviewE,vol.51,no.5,p.4282,1995. this dataset in Table I as well. Again, we see significant [14] W.Luo,B.Yang,andR.Urtasun,“Fastandfurious:Realtimeend- improvements in metrics that depend on the chain of flow to-end 3d detection, tracking and motion forecasting with a single convolutionalnet,”inCVPR,2018,pp.3569–3577. predictions. Regular occupancy metrics slightly improve un- [15] S. Pellegrini, A. Ess, K. Schindler, and L. Van Gool, “You’ll never der the flow trace loss as well. However the one-step flow walk alone: Modeling social behavior for multi-target tracking,” in accuracymetricregressesforthemodeltrainedwiththetrace ICCV,2009,pp.261–268. [16] A. Sadeghian, F. Legros, M. Voisin, R. Vesel, A. Alahi, and loss, especially for early timesteps. Since object detection S. Savarese, “Car-net: Clairvoyant attentive recurrent network,” in andposeestimationmodelsareoftenrunindependentlyfrom ECCV,2018,pp.151–167. frame to frame, the AV datasets typically include very noisy [17] J. Mercat, T. Gilles, N. El Zoghby, G. Sandou, D. Beauvois, and G.P.Gil,“Multi-headattentionformulti-modaljointvehiclemotion readings.Oftenevenboundingboxextentsfluctuatebetween forecasting,”inICRA,2020,pp.9638–9644. subsequentframes,whichleadstonoisyoccupancyandflow [18] T. Salzmann, B. Ivanovic, P. Chakravarty, and M. Pavone, “Trajec- labels in ground-truth. The trace loss incentivizes the model tron++: Multi-agent generative trajectory forecasting with heteroge- neousdataforcontrol,”arXivpreprintarXiv:2001.03093,2020. to smooth out these fluctuations and learn the true behavior [19] S.Ettinger,S.Cheng,B.Caine,C.Liu,H.Zhao,S.Pradhan,Y.Chai, of the agents to be able to capture their long-term motion B. Sapp, C. R. Qi, Y. Zhou, et al., “Large scale interactive motion patterns, rather than trying to model the noise and temporal forecastingforautonomousdriving:Thewaymoopenmotiondataset,” inICCV,2021,pp.9710–9719. inaccuracies in the detection pipeline. [20] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom, “Pointpillars: Fast encoders for object detection from point clouds,” V. CONCLUSIONS inCVPR,2019,pp.12697–12705. Inthispaperweproposedamotionforecastingmodelthat [21] J.Kim,R.Mahjourian,S.Ettinger,M.Bansal,B.White,B.Sapp,and D.Anguelov,“Stopnet:Scalabletrajectoryandoccupancyprediction- predicts both occupancy and flow on a spatio-temporal grid, forurbanautonomousdriving,”inICRA,2022. allowing us to predict not only the probabilistic location, [22] M.Tan,R.Pang,andQ.Le,“Efficientdet:Scalableandefficientobject but also the extents, motion, velocity and identity of agents detection,”inIEEEConf.Comput.Vis.PatternRecog.,2020. [23] N.FairfieldandC.Urmson,“Trafficlightmappinganddetection,”in (whether observed or speculative) in the future. We also ICRA,2011,pp.5421–5426. showed that our method for warping current occupancies [24] B.Yang,M.Liang,andR.Urtasun,“Hdnet:Exploitinghdmapsfor based on flow predictions can improve our motion forecast- 3dobjectdetection,”inCoRL,2018,pp.146–155. [25] L.-C.Chen,G.Papandreou,I.Kokkinos,K.Murphy,andA.L.Yuille, ing metrics. Future work can explore gains from predicting “Deeplab:Semanticimagesegmentationwithdeepconvolutionalnets, speculative occupancies in an AV planning application. atrousconvolution,andfullyconnectedcrfs,”TPAMI,vol.40,no.4, pp.834–848,2017. REFERENCES [26] G. Ma´ttyus, W. Luo, and R. Urtasun, “Deeproadmapper: Extracting roadtopologyfromaerialimages,”inICCV,2017,pp.3438–3446. [1] W. Zhan, L. Sun, D. Wang, H. Shi, A. Clausse, M. Naumann, J. Ku¨mmerle, H. Ko¨nigshof, C. Stiller, A. de La Fortelle, and M. Tomizuka, “Interaction dataset: An international, adversarial and