Particle-Based Score Estimation for State Space Model Learning in Autonomous Driving Angad Singh ∗ Waymo Research singhangad@waymo.com Omar Makhlouf ∗ Waymo Research makhlouf@waymo.com Maximilian Igl Waymo Research migl@waymo.com Joao Messias Waymo Research messiasj@waymo.com Arnaud Doucet DeepMind arnauddoucet@deepmind.com Shimon Whiteson Waymo Research shimonw@waymo.com Abstract: Multi-object state estimation is a fundamental problem for robotic ap- plications where a robot must interact with other moving objects. Typically, other objects’ relevant state features are not directly observable, and must instead be inferred from observations. Particle filtering can perform such inference given approximate transition and observation models. However, these models are often unknown a priori, yielding a difficult parameter estimation problem since obser- vations jointly carry transition and observation noise. In this work, we consider learning maximum-likelihood parameters using particle methods. Recent meth- ods addressing this problem typically differentiate through time in a particle filter, which requires workarounds to the non-differentiable resampling step, that yield biased or high variance gradient estimates. By contrast, we exploit Fisher’s iden- tity to obtain a particle-based approximation of the score function (the gradient of the log likelihood) that yields a low variance estimate while only requiring step- wise differentiation through the transition and observation models. We apply our method to real data collected from autonomous vehicles (AVs) and show that it learns better models than existing techniques and is more stable in training, yield- ing an effective smoother for tracking the trajectories of vehicles around an AV. Keywords: Autonomous Driving, Particle Filtering, Self-supervised Learning 1 Introduction Multi-object state estimation is a fundamental problem in settings where a robot must interact with other moving objects, since their state is directly relevant for decision making. Typically, other objects’ relevant state features are not directly observable. Instead, the robot must infer them from a stream of observations it receives via a perception system. For example, an autonomous vehicle (AV) selects actions based on the state of nearby road users. However, such road users are only partially observed, owing to limited field of view, occlusions, and imperfections in the AV’s sensors and perception systems. Such partial observability negatively affects many downstream tasks in a robot’s behavioural stack that depend on observations, e.g., action planning. Addressing partial observability requires sequential state estimation, to which Bayesian filtering of- fers a generic probabilistic approach. In particular, sequential Monte Carlo methods, also known as particle filtering, have been successfully applied to state estimation in many robotics applications [1]. However, Bayesian filters require models that reasonably approximate the transition and observation models of a state-space model (SSM). In some special cases, these models can be derived analyti- cally from first principles, e.g., when the physical dynamics are well understood, or by modeling a sensor’s physical characteristics. In many real-world applications, however, these models cannot be specified analytically. For example, the transition model may encode complicated motion dynamics and environmental physics. In multi-agent settings, other agents’ behaviour must also be modelled. * These authors contributed equally to this work. 6th Conference on Robot Learning (CoRL 2022), Auckland, New Zealand. arXiv:2212.06968v1 [cs.RO] 14 Dec 2022 Modelling observations is also difficult. Modern perception systems often involve multiple stages and combine information from multiple sensors, making observation models practically impossible to specify by hand. By contrast, collecting observations from a robotic system is relatively easy and cheap. We are interested, therefore, in algorithms that can leverage such observations to learn tran- sition and observation models in a self-supervised fashion, and yield an effective particle smoother. Learned transition and observation models can also be independently useful for other applications, such as the evaluation of AVs by simulating realistic observations. In this work, we propose Particle Filtering-Based Score Estimation using Fisher’s Identity (PF- SEFI), a method for jointly learning maximum-likelihood parameters of both the transition and observation models, that is applicable to a wide class of SSMs. Unlike many recently proposed methods [2, 3, 4, 5, 6, 7], our approach avoids differentiable approximations of the resampling step. We achieve this by revisiting a methodology originally proposed in statistics [8, 9] that relies on a particle approximation of the score, i.e., the gradient of the log likelihood of observation sequences, obtained through Fisher’s identity. This only requires differentiating through the transition and observation models. Unfortunately, a direct particle approximation of this identity provides a high variance estimate of the score. While [8] propose an alternative low variance estimate, it admits a O ( N 2 ) cost, where N is the number of particles. Furthermore, these methods compute and store the gradient of the marginal log-likelihood with respect to model parameters for each particle. This requires computing Jacobian matrices, which are slow to compute using automatic differentiation tools such as TensorFlow and PyTorch [10, 11] which rely on Jacobian-vector products. This makes these methods impractical for large models. By contrast, PF-SEFI is a simple scalable O ( N ) variant with only negligible bias. PF-SEFI marginalises over particles before computing gradients, allowing automatic differentiation tools to make use of efficient Jacobian-vector product operations, making it significantly faster and allowing us to scale to larger models. To the best of our knowledge, previous particle methods estimating the score have been limited to SSMs with few parameters, whereas we apply PF-SEFI to neural network models with thousands of parameters. We apply PF-SEFI to jointly learn transition and observation models for tracking multiple objects around an AV, using a large set of noisy trajectories, containing almost 10 hours of road-user tra- jectories observed by an AV. We show that PF-SEFI learns an SSM that yields an effective object tracker as measured by average displacement and yaw errors. We compare the learned observation model to one trained through supervised learning on a dataset of manually labelled trajectories, and show that PF-SEFI yields a better model (as measured by log-likelihood on ground-truth labels) even though it requires no labels for training. Finally, we compare PF-SEFI to a number of existing particle methods for jointly learning transition and observation models and show that it learns better models and is more stable to train. 2 Related Work Particle filters are widely used for state estimation in non-linear non-Gaussian SSMs where no closed form solution is available; see e.g., [12] for a survey. The original bootstrap particle filter [13] sam- ples at each time step using the transition density particles that are then reweighted according to their conditional likelihood, which measures their “fitness” w.r.t. to the available observation. Parti- cles with low weights are then eliminated while particles with high weights are replicated to focus computational efforts into regions of high probability mass. Compared to many newer methods, such as the auxiliary particle filter [14], the bootstrap particle filter only requires sampling from the transition density, not its evaluation at arbitrary values, which is not possible for the compositional transition density used in this work. In most practical applications, the SSM has unknown parameters that must be estimated together with the latent state posterior (see [9] for a review). Simply extending the latent space to include the unknown parameters suffers from insufficient parameter space exploration [15]. While particle filters can estimate consistently the likelihood for fixed model parameters, a core challenge is that the such estimated likelihood function is discontinuous in the model parameters due to the resampling step, hence complicating its optimization; see e.g. [6, Figure 1] for an illustration. Instead, the score vector can be computed using Fisher’s identity [8]. However, as shown in [8], performance degrades quickly for longer sequences if a standard particle filter is used, due to the path degeneracy problem: repeated resampling of particles and their ancestors will leave few or even 2 just one remaining ancestor path for earlier timesteps, resulting in unbiased, but very high variance estimates. Methods for overcoming this limitation exist [8, 16, 17], but with requirements making them unsuitable in this work. Poyiadjis et al. [8] store gradients separately for each particle, making this approach infeasible for all but the smallest neural networks. ́ Scibior and Wood [17] propose an improved implementation with lower memory requirements by smartly using automatic differen- tiation. However, their approach still requires storing a computation graph whose size scales with O ( N 2 ) as the transition density for each particle pair must be evaluated during the forward pass. Both previous methods’ computational complexity also scales quadratically with the number of par- ticles, N , which is problematic for costly gradient backpropagation through large neural networks. Lastly, Olsson and Westerborn [16] require evaluation of the transition density for arbitrary values, which our compositional transition model does not allow. Instead, in this work, we show that fixed- lag smoothing [18, 19] is a viable alternative to compute the score function of large neural network models in the context of extended object tracking. There is extensive literature on combining particle filters with learning complex models such as neural networks [2, 3, 4, 5, 6, 20, 21, 22, 23, 24]. In contrast to our work, they make use of a learned, data-dependent proposal distribution. However, for parameter estimation, they rely on dif- ferentiation of an estimated lower bound (ELBO). Due to the non-differentiable resampling step, this gradient estimation has either extremely high variance or is biased if the high variance terms are simply dropped, as in [2, 3, 4]. As we show in Section 5, this degrades performance noticeably. A second line of work proposes soft resampling [5, 20, 21], which interpolates between regular and uniform sampling, thereby allowing to trade off variance reduction through resampling with the bias introduced by ignoring the non-differentiable component of resampling. Lastly, Corenflos et al. [6] make the resampling step differentiable by using entropy-regularized optimal transport, also inducing bias and a O ( N 2 ) cost. Extended object tracking [25] considers how to track objects which, in contrast to “small” objects [26], generate multiple sensor measurements per timestep. Unlike in our work, transition and mea- surement models are assumed to be known or to depend on only a few learnable parameters. Similar to our work, the measurement model proposed in [27] assumes measurement sources lying on a rectangular shape. However, our model is more flexible, for example, allowing non-zero probability on all four sides simultaneously. 3 State-Space Models and Particle Filtering 3.1 State-Space Models A SSM is a partially observed discrete-time Markov process with initial density, x 0 ∼ μ ( · ) , transi- tion density x t | x t − 1 ∼ f θ ( ·| x t − 1 ) , and observation density y t | x t ∼ g θ ( ·| x t ) , where x t is the latent state at time t and y t the corresponding observation. The joint density of x 0: T , y 0: T satisfies: p θ ( x 0: T , y 0: T ) = μ ( x 0 ) g θ ( y 0 | x 0 ) T ∏ t =1 f θ ( x t | x t − 1 ) g θ ( y t | x t ) . (1) Given this model, we are typically interested in inferring the states from the data by computing the filtering and one-step ahead prediction distributions, { p ( x t | y 0: t ) } t ∈ 0 ,...,T and { p ( x t +1 | y 0: t ) } t ∈ 0 ,...,T − 1 respectively, and more generally the joint distributions { p ( x 0: t | y 0: t ) } t ∈ 0 ,...,T satisfying p θ ( x 0: t | y 0: t ) = p θ ( x 0: t , y 0: t ) p θ ( y 0: t ) , p θ ( y 0: T ) = ∫ p θ ( x 0: T , y 0: T )d x 0: T . (2) Additionally, to estimate parameters, we would also like to compute the marginal log likelihood: ` T ( θ ) = log p θ ( y 0: T ) = log p θ ( y 0 ) + T ∑ t =1 log p θ ( y t | y 0: t − 1 ) , (3) where p θ ( y 0 ) = ∫ g θ ( y 0 | x 0 ) μ ( x 0 )d x 0 and p θ ( y t | y 0: t − 1 ) = ∫ g θ ( y t | x t ) p θ ( x t | y 0: t − 1 )d x t for t ≥ 1 . For non-linear non-Gaussian SSMs, these posterior distributions and the corresponding marginal likelihood cannot be computed in closed form. 3 3.2 Particle Filtering Particle methods provide non-parametric and consistent approximations of these quantities. They rely on the combination of importance sampling and resampling steps of a set of N weighted parti- cles ( x i t , w i t ) , where x i t denotes the values of the i th particle at time t and w i t is corresponding weight satisfying ∑ N i =1 w i t = 1 . We focus on the bootstrap particle filter, shown in Algorithm 1, which samples particles according to the transition density. Let k ∼ Cat ( α 1 , ..., α N ) denote the categori- Algorithm 1 Bootstrap Particle Filter Sample X i 0 i.i.d. ∼ μ ( · ) for i ∈ [ N ] and set ˆ ` 0 ( θ ) ← log ( 1 N ∑ N i =1 g θ ( y 0 | x i 0 ) ) . For t = 1 , ..., T 1. Compute weights w i t − 1 ∝ g θ ( y t − 1 | x i t − 1 ) with ∑ N i =1 w i t − 1 = 1 . 2. Sample a i t − 1 ∼ Cat ( w 1 t − 1 , ..., w N t − 1 ) then x i t ∼ f θ ( ·| x a i t − 1 t − 1 ) for i ∈ [ N ] . 3. Set x i 0: t ← ( x a i t − 1 0: t − 1 , x i t ) for i ∈ [ N ] and ˆ ` t ( θ ) ← ˆ ` t − 1 ( θ ) + log ( 1 N ∑ N i =1 g θ ( y t | x i t ) ) . cal distribution with N categories, where the probability of the k taking the i th category is α i . At any time t , this algorithm produces particle approximations ˆ p θ ( x 0: t | y 0: t ) = N ∑ i =1 w i t δ x i 0: t ( x 0: t ) , ˆ ` t ( θ ) = T ∑ t =0 log ( 1 N N ∑ i =1 g θ ( y t | x i t )) , (4) of p θ ( x 0: t | y 0: t ) and ` t ( θ ) = log p θ ( y 0: t ) , where δ α is the Dirac delta distribution centred at α . Step 2 resamples, discarding particles with small weights while replicating those with large weights before evolving according to the transition density. This focuses computational effort on the “promising” regions of the state space. Unfortunately, resampling involves sampling N discrete random variables at each time step and as such produces estimates of the log likelihood that are not differentiable w.r.t. θ as illustrated in [6, Figure 1]. While the resulting estimates are consistent as N → ∞ for any fixed time t [28], this does not guarantee good practical performance. Fortunately, under regularity conditions the approximation error for the estimate ˆ p θ ( x t | y 0: t ) and more generally ˆ p θ ( x t − L +1: t | y 0: t ) for a fixed lag L ≥ 1 as well as log p θ ( y 0: t ) /t does not increase with t for fixed N . However, this is not the case for the joint smoothing approximation because successive resampling means that ˆ p θ ( x 0: L | y 0: t ) is eventually ap- proximated by a single unique particle for large enough t , a phenomenon known as path degeneracy; see e.g. [12, Section 4.3]. 4 Score Estimation using Particle Methods To estimate the parameters θ of a given SSM (1) along with a dataset of observations y 0: T , we want to maximise via gradient ascent the marginal log likelihood in (3). However, the gradient of the marginal log likelihood, i.e., the score function , is intractable. As explained in Section 2, automatic differentiation through the filter is difficult due to the non-differentiable resampling step. 4.1 Score Function Using Fisher’s Identity We leverage here instead Fisher’s identity [8] for the score to completely side-step the non- differentiability problem. This identity shows that ∇ θ ` T ( θ ) = ∫ ∇ θ log p θ ( x 0: T , y 0: T ) p θ ( x 0: T | y 0: T )d x 0: T , (5) i.e., the score is the expectation of ∇ θ log p θ ( x 0: T , y 0: T ) under the joint smoothing distribution p θ ( x 0: T | y 0: T ) . Plugging in (1), the score function can be simplified to ∇ θ ` T ( θ ) = T ∑ t =0 ∫ ∇ θ log g θ ( y t | x t ) p θ ( x t | y 0: T )d x t + T ∑ t =1 ∫ ∇ θ log f θ ( x t | x t − 1 ) p θ ( x t − 1: t | y 0: T )d x t − 1: t . (6) 4 4.2 Particle Score Approximation The identity (6) shows that we can simply estimate the score by plugging particle approximations of the marginal smoothing distributions p ( x t − 1: t | y 0: T ) into (6). This identity makes differentiating through time superfluous and thereby renders the use of differentiable approximations of resampling unnecessary. However, as discussed in Section 3.2, naive particle approximations of the smoothing distribution’s marginals, p θ ( x t | y 0: T ) and p θ ( x t − 1: t | y 0: T ) , suffer from path degeneracy. To bypass this problem, [8, 17] propose an O ( N 2 ) method inspired by dynamic programming. We propose here a simpler and computationally cheaper method that relies on the following fixed-lag approxi- mation of the fixed-interval smoothing distribution, which states that for L ≥ 1 large enough, p θ ( x t − 1: t | y 0: T ) ≈ p θ ( x t − 1: t | y 0: min { t + L,T } ) . (7) This approximation simply assumes that observations after time t + L do not bring further infor- mation about the states x t − 1 , x t . This is satisfied for most models and the resulting approximation error decreases geometrically fast with L [19]. The benefit of this approximation is that the particle approximation of p θ ( x t − 1: t | y 0: min { t + L,T } ) does not suffer from path degeneracy and is a simple byproduct of the bootstrap particle filtering of Algorithm 1; e.g., for t + L < T we consider the particle approximation ˆ p θ ( x 0: t + L | y 0: t + L ) = ∑ N i =1 w i t + L δ x i 0: t + L ( x i 0: t + L ) obtained at time t + L and use its corresponding marginals in x t − 1 , x t and x t to integrate respectively ∇ θ log f θ ( x t | x t − 1 ) and ∇ θ log g θ ( y t | x t ) . For t + L ≥ T , we just consider the marginals in x t − 1 , x t and x t of ˆ p θ ( x 0: T | y 0: T ) . So finally, we consider the estimate, ̂ ∇ θ ` T ( θ ) = T ∑ t =0 ∫ ∇ θ log g θ ( y t | x t ) ˆ p θ ( x t | y 0: min { t + L,T } )d x t + T ∑ t =1 ∫ ∇ θ log f θ ( x t | x t − 1 ) ˆ p θ ( x t − 1: t | y 0: min { t + L,T } )d x t − 1: t . (8) 4.3 Score Estimation with Deterministic, Differentiable, Injective Motion Models We have described a generic method to approximate the score using particle filtering techniques. For many applications, however, the transition density function, f θ ( x t | x t − 1 ) , is the composition of a policy , π θ ( a t | x t − 1 ) , which characterises the action distribution conditioned on the state, and a poten- tially complex but deterministic, differentiable, and injective motion model , τ : R n x × R n a → R n x where n a < n x , which characterises kinematic constraints such that x t = τ ( x t − 1 , a t ) = ̄ τ x t − 1 ( a t ) . Under such a composition, the transition density function on the induced manifold M x t − 1 = { ̄ τ x t − 1 ( a t ) : a t ∈ R n a } is thus obtained by marginalising out the latent action variable, i.e., f θ ( x t | x t − 1 ) = I ( x t ∈ M x t − 1 ) ∫ δ ( x t − ̄ τ x t − 1 ( a t )) π θ ( a t | x t − 1 )d a t . (9) It is easy to sample from this density but it is intractable analytically if the motion model is only available through a complex simulator or if it is not invertible. This precludes the use of sophis- ticated proposal distributions within the particle filter. Additionally, even if it were known, one cannot use the O ( N 2 ) smoothing type algorithms developed in [8, 16] as the density is concen- trated on a low-dimensional manifold [29]. This setting is common in mobile robotics, in which controllers factor into policies that select actions and motion models that determine the next state. Indeed, this is precisely the case in our application (see Section 5). Learning the corresponding SSM reduces to learning the parameters θ of the policy, π θ ( a t | x t − 1 ) , and the observation model, g θ ( y t | x t ) . Thankfully, even if the explicit form of the motion model is unknown, we can still com- pute ∇ log f θ ( x t | x t − 1 ) as required by the score estimate (8). Lemma 4.1. For any x ∈ R n x , let τ x : R n a → R n x where n a < n x be a smooth and injective mapping. Then, for any fixed x t − 1 and x t ∈ M x t − 1 , the gradient of the transition log density, i.e., ∇ θ log f θ ( x t | x t − 1 ) , reduces to the gradient of the policy log density, i.e., ∇ θ log π θ ( a t | x t − 1 ) , where a t is the unique action that takes x t − 1 to x t . Proof. For x t − 1 and x t ∈ M x t − 1 , we denote by J [ ̄ τ x t − 1 ]( ̄ τ − 1 x t − 1 ( x t )) ∈ R n x × n a the rectangular Jacobian matrix and write a t = ̄ τ − 1 x t − 1 ( x t ) , i.e., this is the unique action such ̄ τ x t − 1 ( a t ) = x t . By a 5 (a) Observations from the real data. (b) Observations from the synthetic data. Figure 1: Observations from real and synthetic data. The AV (orange) observes a set of 2D points (blue) forming a polygon around the road users. The true bounding boxes of each road user (green) were manually labelled in the real data, and pre-determined while constructing the synthetic data. standard result from differential geometry [30, 31], the transition density (9) satisfies f θ ( x t | x t − 1 ) = π θ ( a t | x t − 1 ) ∣ ∣ ∣ det J [ ̄ τ x t − 1 ] T ( a t ) J [ ̄ τ x t − 1 ]( a t ) ∣ ∣ ∣ − 1 / 2 I ( x t ∈ M x t − 1 ) . (10) It follows directly that ∇ θ log f θ ( x t | x t − 1 ) = ∇ θ log π θ ( a t | x t − 1 ) .  Indeed for the marginals ˆ p θ ( x t − 1: t | y 0:min { t + L,T } ) , we can store the actions corresponding to tran- sitions x t − 1 → x t during filtering, and it follows that for the class of SSMs described above, the score estimate reduces to: ̂ ∇ θ ` T ( θ ) = T ∑ t =0 ∫ ∇ θ log g θ ( y t | x t ) ˆ p θ ( x t | y 0: min { t + L,T } )d x t + T ∑ t =1 ∫ ∇ θ log π θ ( a t | x t − 1 ) ˆ p θ ( x t − 1: t | y 0: min { t + L,T } )d x t − 1: t , (11) where we use Lemma 4.1 to replace the gradient of the transition log density with the gradient of the policy log density in (8), and where a t is the action sampled to go from x t − 1 to x t . 5 Experiments Problem Setting. Our experiments focus on the problem of state estimation of observed road users (in particular other vehicles) from the viewpoint of an AV, which involves the estimation of 2D poses from an observed sequence of 2D convex polygons in a “bird’s eye view” (BEV) constructed from LiDAR point clouds at each time step. For these experiments, we assume that the size of the observed objects, the pose of the AV, and the association of observations with their corresponding objects are known a priori. Some observations (and their corresponding states) are shown in Figure 1a. Here, the observation model must learn to describe the likelihood of 2D points around the pe- riphery of the observed road user (see [25] for a review on such models), while the transition model must learn to describe driving behaviour. We use a feed-forward neural network to parameterise our observation model, where we provide it with range, bearing, and relative bearing from the viewpoint of the corresponding AV as features (Appendix A), and factor our transition model into a determin- istic and differentiable motion model based on Ackermann dynamics [32] (Appendix B.1), and a policy parameterised by another feed-forward neural network (Appendix B.2). Baselines, Datasets, and Metrics. We compare the quality of the models learned using PF-SEFI (our method), DPF-SGR [17], PFNET [5], and differentiating through a vanilla PF (ignoring the bias introduced by resampling). We evaluate our method (and the baselines) on real data collected from an AV in an urban environment, equipped with LiDARs, cameras, and radar sensors. All sensors were used to associate LiDAR points to their corresponding objects, and the observations shown in Figure 1a were obtained via a convex hull computation of the associated LiDAR points. In addition to using real data, we generate two synthetic datasets (with 25 and 50 step trajectories), using a hand-crafted policy, and an observation model trained using supervised learning on manually labelled trajectories (see Appendix A and B for more details). Example observations are shown in Figure 1b. Unlike with real data, where the true models are unknown, synthetic datasets allow us 6 (a) MLL of test synthetic data with 25 step trajectories. (b) MLL of test synthetic data with 50 step trajectories. (c) MLL of test real data with 60 step trajectories. Figure 2: Marginal Log Likelihood (MLL) on synthetic and real test data for models trained using PF-SEFI (us), DPF-SGR, PFNET, and PF, plotted against the corresponding training steps. For synthetic data we also show the MLL of the true models. to compare the learned models against a known ground truth. We measure the quality of learned models using the following metrics: • Marginal Log Likelihood (MLL) : The marginal log likelihood ` T ( θ ) given by filtering obser- vations y 0: T using the learned models. • Average Displacement Error (ADE) and Average Yaw Error (AYE) : The average error in the positions and yaws respectively of the smoothed state estimates E θ ( x 0: T | y 0: T ) against the true poses, ̄ x 0: T . For the synthetic data, the true poses are sampled while generating the data; for the real data, the true poses are obtained by humans manually labelling object trajectories from videos. These measure the quality of the learned models for the purposes of state estimation. • Average Observation True Log Likelihood (AOTLL) : The average log likelihood of ob- servations conditioned on the true states under the learned observation model, i.e., 1 T +1 ∑ T t =0 log g θ ( y t | ̄ x t ) . This measures the quality of the learned observation model. • Average Policy True Log Likelihood (APTLL) : The average log likelihood of true actions, ̄ a 1: T , (only available for experiments with synthetic data since it is not possible to man- ually label latent actions) conditioned on the true states under the learned policy, i.e., 1 T ∑ T t =1 log π θ ( ̄ a t | ̄ x t − 1 ) . This measures the quality of the learned policy. Results. Figure 2 shows the progress of the learned models by tracking MLL of held out test data for each of the three datasets (synthetic data with 25 steps, synthetic data with 50 steps, and real data with 60 steps), and for each of the four methods (PF-SEFI, DPF-SGR, PFNET, and PF). Table 1 sum- marise the performance of the learned models at convergence. We pick the best hyper-parameters, smoothing lag L for PF-SEFI, and trade-off parameter α for PFNET, in each of the experiments. Appendix C includes analysis of the training sensitivity of each of the hyper-parameters (see Fig- ures 5, 6, and 7). We find that PF-SEFI improves with increasing L up to a point, past which it is insensitive to the choice of L (see Figure 6, Appendix C). In our experiments with synthetic data with 25 steps (Figure 2a and Experiment A in Table 1), we observe a clear gap in performance of PF-SEFI and DPF-SGR relative to PFNET and PF. The im- provements over PF are likely due to the bias in PF’s score estimates due to the non-differentiable resampling step, while the improvements over PFNET are likely due to adverse effects of not re- sampling with the correct distribution at each time step. While PF-SEFI and DPF-SGR perform similarly on this dataset, the difference is stark in the case of synthetic data with 50 steps (Figure 2b and Experiment B in Table 1). PF-SEFI is invariant to the length of the trajectories used, converging stably; however, all other methods, struggle to learn useful models. We postulate that since each of the baselines, in one way or another, differentiate through all time steps of the filter, the variance in their score estimates is too high for good learning through gradient ascent. 1 1 The authors of DPF-SGR [17] recommend the use of stop gradients not only for particle weights after resampling, i.e., ̃ v i t = ̄ v a i t / ⊥ ̄ v a i t [17, Algorithm 1], but also, in the case of bootstrap particle filters, while com- puting the likelihood ratio v i t = ̃ v i t − 1 p θ ( x i t , y t | x a i t − 1 ) / ⊥ q θ ( x i t | x a i t − 1 ) before resampling, and while sampling from x i t ∼ q θ ( ·| x a i t − 1 ) [17, Section 4.1]. While these additional stop gradients significantly reduce variance, our experiments with them yielded extremely poor overall performance (even with synthetic data with 25 steps). The results we report here thus make use of stop-gradients only for particle weights after resampling. 7 Table 1: Metrics computed on held out test data comparing PF-SEFI (us) against baselines. We ran experiments using 3 different datasets - (A) synthetic data with 25 steps, (B) synthetic data with 50 steps, and (C) real data with 60 steps. For experiments (A) and (B), we also compare against the performance of the true models. For experiment (C), we compare against the supervised observation model trained using manually labelled trajectories. For MLL, AOTLL, and APTLL, higher values imply better models, while for ADE and AYE, lower values imply better models. Exp. Method MLL AOTLL APTLL ADE (m) AYE (rad) A TRUE -3.161 ± 0.003 -2.128 2.674 0.090 ± 0.001 0.014 ± 0.000 PF-SEFI (us) -3.147 ± 0.004 -2.285 ± 0.028 2.661 ± 0.014 0.186 ± 0.021 0.016 ± 0.000 DPF-SGR -3.159 ± 0.004 -2.265 ± 0.010 2.594 ± 0.027 0.165 ± 0.008 0.014 ± 0.000 PFNET -3.225 ± 0.004 -2.487 ± 0.026 2.621 ± 0.013 0.264 ± 0.019 0.017 ± 0.000 PF -3.229 ± 0.006 -2.484 ± 0.026 2.576 ± 0.021 0.245 ± 0.021 0.017 ± 0.000 B TRUE -3.145 ± 0.002 -2.165 2.693 0.088 ± 0.001 0.012 ± 0.000 PF-SEFI (us) -3.141 ± 0.005 -2.283 ± 0.015 2.505 ± 0.042 0.165 ± 0.013 0.014 ± 0.000 DPF-SGR -3.966 ± 0.050 -2.636 ± 0.031 0.811 ± 0.130 2.828 ± 0.415 0.142 ± 0.016 PFNET -4.169 ± 0.046 -2.901 ± 0.039 0.539 ± 0.077 2.809 ± 0.176 0.148 ± 0.008 PF -4.118 ± 0.038 -2.841 ± 0.025 0.681 ± 0.122 2.502 ± 0.042 0.137 ± 0.007 C SUPERVISED N/A -2.224 ± 0.006 N/A N/A N/A PF-SEFI (us) -2.447 ± 0.029 -1.973 ± 0.029 N/A 0.275 ± 0.011 0.034 ± 0.006 DPF-SGR -3.297 ± 0.287 -2.236 ± 0.218 N/A 0.643 ± 0.177 0.081 ± 0.477 PFNET -4.019 ± 0.098 -2.752 ± 0.079 N/A 0.746 ± 0.091 1.015 ± 0.159 PF -3.848 ± 0.045 -2.639 ± 0.140 N/A 0.701 ± 0.109 1.082 ± 0.364 The results of our experiments with real data with 60 steps (Figure 2c and Experiment C in Table 1) are consistent with Experiment B (i.e., with experiments on synthetic data with 50 steps) and show that PF-SEFI is able to learn useful models. The learned observation model using PF-SEFI performs even better than the model that was trained offline through supervision with manually labelled data (see AOTLL in Table 1 for Experiment C). We also find that sampling from the learned model produces observations that are qualitatively similar to the real data (Appendix D). While the supervised model is trained only on the subset of the observations that are labeled (labelling only a subset is common in practical applications due to the cost of labelling), PF-SEFI, by contrast, can leverage all observations in a self-supervised fashion. Moreover, we speculate that the labels contain noise and that the labelling distribution is biased towards observations that are easy to label. Both limitations hinder supervised learning. 6 Discussion, Limitations, and Future Work In this work, we proposed an efficient particle-based method for estimating the score function to learn a wide class of SSMs in a self-supervised way. Compared to previous particle methods that estimate the score, PF-SEFI is more computationally efficient, allowing us to scale to learning mod- els with many parameters. Unlike alternative methods, PF-SEFI is applicable to SSMs where the transition distribution is concentrated on a low-dimensional manifold, allowing us to apply it to a real-world AV object tracking problem. We showed empirically that our method learns better models and is more stable in training than methods that use automatic differentiation to estimate the score, and that we can learn an observation model that outperforms one trained using supervised learning. While this solution is ideal for our problem, it does have a number of limitations. Most notably, it is restricted to maximising the marginal log-likelihood of the data, while differentiating through the filter allows for arbitrary differentiable loss functions. Furthermore, our method is not suitable for estimating the parameters of a proposal distribution. Beyond these algorithmic limitations, in our application, the models that we used were not very expressive. For the observation model, we did not model important phenomena that affect partial observability such as occlusions and we restricted our states and observations to 2D. For the policy, we used a simplified policy with only basic features that are insufficient for controlling an agent in simulation. Furthermore, the policy and the motion model are both specific to vehicles, and currently exclude other road users such as pedestrians. In future work, we aim to scale up our problem setting, by making both models more expressive, and to estimate more state dimensions, such as full 3D poses and sizes of objects. We also believe that learning policies as components of an SSM to explicitly account for observation noise is, in practice, critical for learning good driving behaviour from demonstrations. Such policies could be used as models for predicting the behaviour of other road-users, or to control agents in simulation, and the method we proposed in this work offers an ideal starting point to explore this. 8 Acknowledgments We thank Drago Anguelov, Charles Qi and Congcong Li for their helpful feedback on a draft of the paper. We also thank the reviewers for taking the time to give detailed and useful feedback on our initial submission. Finally, we thank the Waymo Research team and DeepMind for supporting this project. References [1] S. Thrun, W. Burgard, and D. Fox. Probabilistic Robotics . MIT Press, 2005. [2] T. A. Le, M. Igl, T. Rainforth, T. Jin, and F. Wood. Auto-encoding sequential Monte Carlo. In International Conference on Learning Representations , 2018. [3] C. J. Maddison, J. Lawson, G. Tucker, N. Heess, M. Norouzi, A. Mnih, A. Doucet, and Y. Teh. Filtering variational objectives. Advances in Neural Information Processing Systems , 30, 2017. [4] C. Naesseth, S. Linderman, R. Ranganath, and D. Blei. Variational sequential Monte Carlo. In International Conference on Artificial Intelligence and Statistics , pages 968–977, 2018. [5] P. Karkus, D. Hsu, and W. S. Lee. Particle filter networks with application to visual localization. In Conference on Robot Learning , pages 169–178, 2018. [6] A. Corenflos, J. Thornton, G. Deligiannidis, and A. Doucet. Differentiable particle filtering via entropy-regularized optimal transport. In International Conference on Machine Learning , pages 2100–2111, 2021. [7] J. Lai, J. Domke, and D. Sheldon. Variational marginal particle filters. In International Con- ference on Artificial Intelligence and Statistics , pages 875–895, 2022. [8] G. Poyiadjis, A. Doucet, and S. S. Singh. Particle approximations of the score and observed information matrix in state space models with application to parameter estimation. Biometrika , 98(1):65–80, 2011. [9] N. Kantas, A. Doucet, S. S. Singh, J. Maciejowski, and N. Chopin. On particle methods for parameter estimation in state-space models. Statistical science , 30(3):328–351, 2015. [10] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Joze- fowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Man ́ e, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Van- houcke, V. Vasudevan, F. Vi ́ egas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL http://tensorflow.org/ . Software available from tensorflow.org. [11] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Te- jani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems , volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips. cc/paper/2019/file/bdbca288fee7f92f2bfa9f7012727740-Paper.pdf . [12] A. Doucet and A. M. Johansen. A tutorial on particle filtering and smoothing: Fifteen years later. Handbook of Nonlinear Filtering , 12(656-704):3, 2009. [13] N. J. Gordon, D. J. Salmond, and A. F. Smith. Novel approach to nonlinear/non-Gaussian Bayesian state estimation. IEE Proceedings F (Radar and Signal Processing) , 140(2):107– 113, 1993. [14] M. K. Pitt and N. Shephard. Filtering via simulation: Auxiliary particle filters. Journal of the American Statistical Association , 94(446):590–599, 1999. [15] E. L. Ionides, C. Bret ́ o, and A. A. King. Inference for nonlinear dynamical systems. Proceed- ings of the National Academy of Sciences , 103(49):18438–18443, 2006. 9 [16] J. Olsson and J. Westerborn. Efficient particle-based online smoothing in general hidden markov models: the PaRIS algorithm. Bernoulli , 23(3):1951–1996, 2017. [17] A. ́ Scibior and F. Wood. Differentiable particle filtering without modifying the forward pass. arXiv preprint arXiv:2106.10314 , 2021. [18] G. Kitagawa and S. Sato. Monte Carlo smoothing and self-organising state-space model. In A. Doucet, N. De Freitas, and N. Gordon, editors, Sequential Monte Carlo Methods in Practice , pages 177–195. Springer, 2001. [19] J. Olsson, O. Capp ́ e, R. Douc, and E. Moulines. Sequential Monte Carlo smoothing with application to parameter estimation in nonlinear state space models. Bernoulli , 14(1):155– 179, 2008. [20] X. Ma, P. Karkus, D. Hsu, and W. S. Lee. Particle filter recurrent neural networks. In AAAI Conference on Artificial Intelligence , 2020. [21] X. Ma, P. Karkus, D. Hsu, W. S. Lee, and N. Ye. Discriminative particle filter reinforcement learning for complex partial observations. In International Conference on Learning Represen- tations , 2020. [22] R. Jonschkowski, D. Rastogi, and O. Brock. Differentiable particle filters: End-to-end learning with algorithmic priors. In Proceedings of Robotics: Science and Systems , 2018. [23] M. Zhu, K. Murphy, and R. Jonschkowski. Towards differentiable resampling. arXiv preprint arXiv:2004.11938 , 2020. [24] A. Kloss, G. Martius, and J. Bohg. How to train your differentiable filter. Autonomous Robots , 45(4):561–578, 2021. [25] K. Granstrom, M. Baum, and S. Reuter. Extended object tracking: Introduction, overview and applications. arXiv preprint arXiv:1604.00970 , 2016. [26] R. S. S. Blackman, Popoli. Design and Analysis of Modern Tracking Systems . Artech House, Boston, 1999. [27] K. Granstr ̈ om, S. Reuter, D. Meissner, and A. Scheel. A multiple model PHD approach to tracking of cars under an assumed rectangular shape. In 17th International Conference on Information Fusion (FUSION) . IEEE, 2014. [28] P. Del Moral. Feynman-Kac Formulae . Springer, 2004. [29] F. Lindsten, P. Bunch, S. S. Singh, and T. B. Sch ̈ on. Particle ancestor sampling for near- degenerate or intractable state transition models. arXiv preprint arXiv:1505.06356 , 2015. [30] S. G. Krantz and H. R. Parks. Geometric Integration Theory . Springer Science & Business Media, 2008. [31] A. L. Caterini, G. Loaiza-Ganem, G. Pleiss, and J. P. Cunningham. Rectangular flows for manifold learning. Advances in Neural Information Processing Systems , 34, 2021. [32] K. M. Lynch and F. C. Park. Modern Robotics . Cambridge University Press, 2017. 10 (a) Input features provided to the observation model’s neural network. (b) Output representation of an observed point y i t us- ing the triplet ( e i t , α i t ( e i t ) , β i t ( e i t ) ) . Figure 3: Input and output representations for the observation model. A Observation Model For the problem setting described in Section 5, our observation model is concerned with the distribu- tion of 2D points around the peripheries of observed road users. Such models are reviewed in [25]. Let the i th point from an observation y t be y i t . We assumed independence across different points, i.e. g θ ( y t | x t ) = ∏ i g θ ( y i t | x t ) . (12) In order to conveniently sample y i t or to measure its likelihood, we represented each point via a triplet, ( e i t , α i t ( e i t ) , β i t ( e i t )) , where e i t ∈ { 0 , 1 , 2 , 3 } is a categorical variable encoding the edge of the road user’s bounding box, α i t ( e i t ) is the corresponding parallel offset, and β i t ( e i t ) is the corresponding perpendicular offset (see Figure 3b). Such a triplet uniquely determines y i t , and we used following generative model to sample such points:- e i t ∼ Categorical ( φ 0 θ ( x t ) , φ 1 θ ( x t ) , φ 2 θ ( x t ) , φ 3 θ ( x t ) ) , if e i t = 0 , α i t ∼ Uniform (0 , w ) and β i t ∼ Laplace ( μ = φ 4 θ ( x t ) , b = φ 5 θ ( x t ) ) , if e i t = 1 , α i t ∼ Uniform (0 , l ) and β i t ∼ Laplace ( μ = φ 6 θ ( x t ) , b = φ 7 θ ( x t ) ) , if e i t = 2 , α i t ∼ Uniform (0 , w ) and β i t ∼ Laplace ( μ = φ 8 θ ( x t ) , b = φ 9 θ ( x t ) ) , if e i t = 3 , α i t ∼ Uniform (0 , l ) and β i t ∼ Laplace ( μ = φ 10 θ ( x t ) , b = φ 11 θ ( x t ) ) , where l and w are the length and the width of the road user’s bounding box respectively, and φ 0:11 θ ( x t ) are the outputs from a neural network with weights θ . While the mapping ( e i t , α i t ( e i t ) , β i t ( e i t )) → y i t is unique, the reverse mapping has four representations depending on the choice of e i t , i.e., e i t is a latent variable. The likelihood of a point, g θ ( y i t | x t ) , was therefore obtained by:- g θ ( y i t | x t ) = 3 ∑ e i t =0 φ e i t θ ( x t ) p θ ( α i t ( e i t ) , β i t ( e i t ) | e i t , x t ) , (13) where we project y i t → ( e i t , α i t ( e i t ) , β i t ( e i t )) for each e i t ∈ { 0 , 1 , 2 , 3 } , and marginalise over it. We use a feed-forward neural network with 4 hidden layers, each with 16 Tanh units, which outputs 12 parameters, φ 0:11 θ ( x t ) . We provide the network with 5 input features - range, bearing, relative bearing, length, and width - of the road-user’s bounding box as measured from the observing AV’s viewpoint (see Figure 3a). For each of the experiments in Section 5 (with synthetic and real data), we used the same obser- vation model design. Moreover, we trained an observation model using supervision from a dataset of manually labelled trajectories, ̄ x 0: T , by maximising the AOTLL. This model was then used to generate the synthetic data and also as a baseline for the experiments with real data. 11 B Transition Model As described in Section 4, the class of SSMs that we are concerned with involves a transition func- tion, f θ ( x t | x t − 1 ) , that factorises into a policy which produces actions, π θ ( a t | x t − 1 ) , and a deter- ministic, differentiable, and injective motion model, such that x t = τ ( x t − 1 , a t ) . In this section, we provide more details on the motion model and on the policies used for the experiments described in Section 5. B.1 Motion Model All experiments used a state space that consists of the 2D Pose ( x, y, θ ) of an observed road user, in addition to its instantaneous linear speed v and curvature κ . Moreover, the action space used consists of linear acceleration a and pinch p , i.e., the instantaneous rate of change of curvature. This choice of actions, i.e. acceleration and pinch, naturally maps to the controls exercised by road users (vehicles users in particular), i.e., to gas and rate of change of steering respectively. The calculations below compute the next state, ( x ( t ) , y ( t ) , θ ( t ) , v ( t ) , κ ( t )) , from the previous state ( x 0 , y 0 , θ 0 , v 0 , κ 0 ) under the influence of constant actions ( a, p ) for t ∈ [0 , ∆ t ] . Clearly ̇ κ ( t ) = p ⇒ κ ( t ) = κ 0 + pt, (14) and ̇ v ( t ) = a ⇒ v ( t ) = v 0 + at. (15) Since ̇ θ ( t ) = v ( t ) κ ( t ) , we have (16) ̇ θ ( t ) = v 0 κ 0 + ( v 0 p + aκ 0 ) t + apt 2 , (17) ⇒ θ ( t ) = θ 0 + v 0 κ 0 t + ( v 0 p + aκ 0 ) t 2 2 + ap t 3 3 . (18) Finally, using ̇ x ( t ) = v ( t ) cos θ ( t ) , and ̇ y ( t ) = v ( t ) sin θ ( t ) , we have (19) x ( t ) = x 0 + ∫ t 0 v ( s ) cos θ ( s ) d s, (20) and y ( t ) = y 0 + ∫ t 0 v ( s ) sin θ ( s ) d s (21) = y 0 + ∫ t 0 v ( s ) cos ( π 2 − θ ( s ) ) d s. (22) To make these integrals analytically tractable, we drop the cubic term in θ t , i.e. ap t 3 3 , and use the integral 2 ∫ ( a + bs ) cos( c + ds + es 2 ) d s with appropriate coefficients to compute x ( t ) and y ( t ) . This approximation is justified as for the experiments described in Section 5, we use a small ∆ t of ≈ 0 . 33 seconds. B.2 Policy For synthetic data , we designed a simple state-dependent stochastic policy that modulated its mean acceleration and pinch as a function of speed. The mapping from state to actions is described in Figure 4. The policy then had 2 learnable parameters - the standard deviations of its acceleration and pinch. The advantage of having only 2 learnable parameters is that it allowed us to easily verify if a method was converging to the correct values or not. On real data , we used a more expressive policy which produces a multivariate Gaussian distribution over acceleration and pinch conditioned on its inputs. The policy’s inputs are the instantaneous speed and curvature of an object, which are then fed to a 3-layer feedforward neural network, with 2 hidden layers with 32 ReLU units, outputting 5 parameters - the means of acceleration and pinch, and the 3 elements of the lower triangular matrix representation of the covariance of the two. This gives a total of 1,317 learnable parameters. 2 The closed form integral was obtained using Wolfram Alpha. 12 Figure 4: State to action mapping of the simple policy. The solid line represents the mean action taken at a given speed, while the shaded regions represent one standard deviation of Gaussian noise around that action. At a high-level, the policy accelerates at low speeds and decelerates at higher speeds. The policy also applies less pinch at higher speeds. (a) MLL of test synthetic data with 25 step trajectories using PF-SEFI with different values for the smoothing lag ( L ). (b) MLL of test synthetic data with 50 step trajectories using PF-SEFI with different values for the smoothing lag ( L ). (c) MLL of test real data with 60 step trajectories using PF- SEFI with different values for the smoothing lag ( L ) using 2048 training particles. Figure 5: Marginal Log Likelihood (MLL) of synthetic test data for models trained using PF-SEFI with differ- ent values for the smoothing lag ( L ) plotted against the corresponding training steps. C Training Setup In this section, we present our training setup for the experiments described in Section 5. We show sample trajectories from each of the datasets (synthetic and real), and provide the set of hyper- parameters that were used for each of the experiments. All experiments (including the ones used for picking the best set of hyper-parameters) were repeated 10 times to obtain the median and in- terquartile ranges that are shown in Figures 2, 5, and 7, and in Table 1. All experiments used the default settings of the Adam optimiser in TensorFlow with a learning rate of 0.01. We found that smaller learning rates yield similar results, but require proportionately longer training time, while larger learning rates cause instability. C.1 Training on Synthetic Data Sample trajectories from the generated synthetic data are shown in Figure 8. These samples were generated using a hand-crafted policy (see Section B.2), the motion model derived in Section B.1, and an observation model trained with supervised learning (see Section A). We generated two datasets (with 25 and 50 steps respectively), each containing 10 scenes with 100 objects each for training, and 2 scenes also with 100 objects each for evaluation. For every train step, we used all 100 objects from a single randomly sampled training scene, while for every evaluation step, we used all 100 objects from both evaluation scenes. Table 2 tabulates the set of hyper-parameters that were used for the experiments discussed in Section 5. Figures 5a and 5b show the effect of different values for the smoothing lag hyper-parameter ( L ) for PF-SEFI on synthetic data, while 5c shows the effect of L on real data. We observed that for 13 Figure 6: Maximum MLL of test real data with 60 step trajectories using PF-SEFI sweeping over different values for the smoothing lag ( L ) using different numbers of particles for training. (a) MLL of test synthetic data with 25 step trajectories using PFNET with different values for the trade- off parameter ( α ). (b) MLL of test synthetic data with 50 step trajectories using PFNET with different values for the trade- off parameter ( α ). (c) MLL of test real data with 60 step trajectories using PFNET with different values for the trade-off pa- rameter ( α ). Figure 7: Marginal Log Likelihood (MLL) of synthetic and real test data for models trained using PFNET with different values for the trade-off parameter ( α ) plotted against the corresponding training steps. synthetic data, the performance started to plateau at L = 8 , and hence picked L = 8 for the final experiments. On real data, performance continues to improve at higher values of L up to around L = 14 and we also note that variance in training is high when L is too small. Moreover, Figures 7a and 7b show the effect of different values for the trade-off parameter ( α ) for PFNET. While learning failed on synthetic data with 50 steps, we observed that α = 0 . 8 marginally outperformed α = 1 . 0 and significantly outperformed α = 0 . 6 . C.2 Training on Real Data For training on real data, we used a dataset of real-world road-user trajectories observed by an AV. Many of the frames in this set were also labelled manually by human-labelers, allowing us to Table 2: Hyper-parameters used for experiments A (synthetic data with 25 step trajectories) and B (synthetic data with 50 step trajectories). Smoothing lag ( L ) is only relevant for PF-SEFI, and the trade-off parameter ( α ) is only relevant for PFNET. Hyper-Parameter Value Learning Rate 0.01 Number of Epochs 100 Smoothing Lag ( L ) for PF-SEFI 8 Trade-off Parameter ( α ) for PFNET 0.8 Number of Particles for Training 1024 Number of Particles for Evaluation 4096 14 (a) (b) Figure 8: Examples of synthetic trajectories. The figure shows 3 snapshots of the state of multiple objects, each occurring 10 steps apart from each other. The green boxes are the true states of the objects at different timesteps, while the blue polygons are the observations that the AV (orange) makes. The green dots are the true ( x, y ) coordinates of the objects at all timesteps. compute relevant tracking metrics such as ADE, AYE, and AOTLL under the different models that we learned. All trajectories were approximately 20 seconds long, where each step corresponded to 0.33 seconds in real time, giving 60 steps in discrete time. The dataset consisted of a training set with 1502 trajectories, and a test set containing 404 trajectories. Figure 9 shows some example trajectories. We ran hyper-parameter sweeps over learning rates, the smoothing lag ( L ) for PF-SEFI, the trade- off parameter ( α ) for PFNET, and the number of training particles. The final results are using the best settings of these parameters which we list in Table 3, though we note that results were quite insensitive to most of these settings. Figure 6 shows the effect of L on the maximum achieved test MLL on real data for different numbers of training particles. As can be seen, PF-SEFI does significantly better as L increases from 0 to 10, highlighting the added benefit of smoothing. On the other hand, it is quite insensitive to the smoothing lag L beyond this point, except that the variance appears to increase when L gets too large. We also find that PF-SEFI consistently improves with more particles. 15 (a) The AV in a crowded multi- lane road surrounded by mostly stationary objects. (b) The AV decelerating, yielding for other objects at an intersection. Figure 9: Examples of real-data trajectories. The figure shows 3 snapshots of the state of multiple objects, each occurring 5 seconds apart from each other. The green boxes are the labelled bounding boxes of objects at different timesteps, while the blue polygons are the observations that the AV (orange) makes. The dots represent the labelled ( x, y ) coordinates of the objects over time. Figure 10: Sampled observations on the same labelled states from the real data that were shown in Figure 9. Notice the qualitative similarity in the observations in this figure relative to Figure 9. 16 Table 3: Hyper-parameters used for experiment C (real data with 60 step trajectories). Smoothing lag ( L ) is only relevant for PF-SEFI, and the trade-off parameter ( α ) is only relevant for PFNET. Hyper-Parameter Value Learning Rate 0.01 Global Grad Norm Clipping 0.5 Number of Epochs 15 Smoothing Lag ( L ) for PF-SEFI 15 Trade-off Parameter ( α ) for PFNET 0.8 Number of Particles for Training 4096 Number of Particles for Evaluation 4096 (a) MLL of test real data after training models using 30 step sequences. (b) MLL of test real data after training models using 15 step sequences. Figure 11: Marginal Log Likelihood (MLL) of real test data for models trained on shorter sequences plotted against the corresponding training steps. When training on real data, all methods were subject to very large gradients at times, especially earlier on when failing to track objects is much more likely, leading to very high losses and cor- responding gradients. In order to stabilise training, we clipped the maximum global norm of all gradients to 0.5. This is particularly necessary for the methods that require differentiating through the filter. D Sampled Observations from Learned Observation Model Figure 10 shows sampled observations from the observation model learned on real data using PF- SEFI. These observations were sampled using the checkpoint that produced the highest MLL, and for the same (labelled) states that were shown in Figure 9. The qualitative similarity between the real and sampled observations indicates the efficacy of our method for learning generative models that can be used to sample observations in closed-loop simulation. E Training on Shorter Sequences In the case of synthetic data, we note that some methods such as DPF-SGR performed poorly when trained on 50 step sequences (Figure 2b), however performed much better when trained on shorter 25 step sequences (Figure 2a). We conducted similar experiments on real data to see if a similar improvement can be attained. Figure 11 shows the results when training DPF-SGR and other base- lines on 30 and 15 step sequences of real data (instead of 60). At 30 steps (Figure 11a), we find that all 3 baselines still fail to learn good models, while PF-SEFI performs almost as well as on length 60 sequences. At 15 steps (Figure 11b), however, the baselines do actually perform much better, though the final performance for all methods is worse than PF-SEFI on 60 steps. 17 Table 4: Metrics computed on held out synthetic test data comparing PF-SEFI (us) against baselines DPF-SGR, PFNET, and vanilla PF, on two experiments - (A) learning from synthetic data with 25 steps, and (B) learning from synthetic data with 50 steps. The Smoothing ADE and AYE reported here are the same as in Table 1. They are contrasted with the Filtering ADE and AYE, which measure the performance of the learned models in the online setting of state estimation. Exp. Method Smoothing ADE (m) Filtering ADE (m) Smoothing AYE (rad) Filtering AYE (rad) A TRUE 0.090 ± 0.001 0.144 ± 0.001 0.014 ± 0.000 0.034 ± 0.000 PF-SEFI (us) 0.186 ± 0.021 0.233 ± 0.009 0.016 ± 0.000 0.039 ± 0.000 DPF-SGR 0.165 ± 0.008 0.222 ± 0.008 0.014 ± 0.000 0.035 ± 0.001 PFNET 0.264 ± 0.019 0.333 ± 0.030 0.017 ± 0.000 0.047 ± 0.001 PF 0.245 ± 0.021 0.345 ± 0.024 0.017 ± 0.000 0.047 ± 0.001 B TRUE 0.088 ± 0.001 0.154 ± 0.001 0.012 ± 0.000 0.038 ± 0.000 PF-SEFI (us) 0.165 ± 0.013 0.239 ± 0.007 0.014 ± 0.000 0.043 ± 0.000 DPF-SGR 2.828 ± 0.415 2.888 ± 0.232 0.142 ± 0.016 0.189 ± 0.008 PFNET 2.809 ± 0.176 3.133 ± 0.108 0.148 ± 0.008 0.213 ± 0.009 PF 2.502 ± 0.042 3.207 ± 0.106 0.137 ± 0.007 0.212 ± 0.008 This highlights an advantage of PF-SEFI, which is that it is relatively invariant to the length of sequences that it is trained on. Depending on the problem setting, there is usually a minimum sequence length required to obtain enough information to learn the correct models. If that sequence length is longer than the maximum sequence length for which a method such as DPF-SGR is stable to train, then one must sacrifice model quality for stable learning by cutting the sequences to shorter subsequences or by subsampling the sequences, throwing away some of the observations. In this case, shortening the sequences to 15 steps allowed for reasonable (though suboptimal) models to be learned using the baseline methods. In other problems it may well be the case that even more trimming and/or subsampling would be needed. F Performance on the Filtering Task In Section 5, we considered metrics such as ADE and AYE that pertain to the task of state estimation in the offline setting (known as smoothing), i.e., where we assume access to the entire sequence of observations, i.e. y 1: T . In this section, we additionally consider the online setting (known as filtering) where the task is state estimation at each time step t , with observations from time step 0 up to time step t , i.e. p ( x t | y 0: t ) . This setting is relevant for the use of our learned models on-board the AV. Table 4 reports ADE and AYE using both the smoothing and filtering distribution for the offline and online task of state estimation respectively. The patterns are unchanged relative to the ones observed in Section 5 and in Table 1. However, these results highlight the applicability of the learned in both settings. G Training with Higher Dimensional Observations In Section 5, we reported experiments with 32 dimensional observations (16 2D points). In this section, we report additional experiments with even higher dimensional observations (32 and 64 2D points, i.e., 64 and 128 dimensional observations respectively) on the synthetic dataset with 25 steps, trained using PF-SEFI. In Table 5 we summarise the empirical findings of these experiments. In each case, the learned models match the performance of the true models as measured by MLL. These results suggest that PF-SEFI scales well with higher dimensional observations. H Effect of a Noisy AV State on Learning In Section 5, we assume that the AV state is known precisely. In practice, we expect there to be some minimal errors in state estimation. To test our sensitivity to the same, we ran additional experiments with the 25 steps synthetic dataset by injecting Gaussian noise (with a standard deviation of 0.5m in x and y , and 0.05rad in θ ) in the AV’s 2D pose. We retrained our models using PF-SEFI in the 18 Table 5: MLL computed on held out synthetic test data comparing PF-SEFI (us) against the true models on synthetic dataset generated with 32 (A), 64 (D), and 128 (E) dimensional observations. Exp. Method MLL A TRUE -3.161 ± 0.003 PF-SEFI (us) -3.152 ± 0.006 D TRUE -3.171 ± 0.003 PF-SEFI (us) -3.163 ± 0.007 E TRUE -3.188 ± 0.003 PF-SEFI (us) -3.177 ± 0.006 Figure 12: Marginal Log Likelihood (MLL) of synthetic test data using models trained on synthetic data with noisy AV states. presence of such noise, and observe no change in MLL at convergence over the held out test data (see Figure 12), nor in training stability. 19