Variational Gaussian Process State-Space Models Roger Frigola, Yutian Chen and Carl E. Rasmussen Department of Engineering University of Cambridge {rf342,yc373,cer54}@cam.ac.uk Abstract State-space models have been successfully used for more than fifty years in differ- ent areas of science and engineering. We present a procedure for efficient varia- tional Bayesian learning of nonlinear state-space models based on sparse Gaussian processes. The result of learning is a tractable posterior over nonlinear dynamical systems. In comparison to conventional parametric models, we offer the possi- bility to straightforwardly trade off model capacity and computational cost whilst avoiding overfitting. Our main algorithm uses a hybrid inference approach com- bining variational Bayes and sequential Monte Carlo. We also present stochastic variational inference and online learning approaches for fast learning with long time series. 1 Introduction State-space models (SSMs) are a widely used class of models that have found success in applications as diverse as robotics, ecology, finance and neuroscience (see, e.g., Brown et al. [3]). State-space models generalize other popular time series models such as linear and nonlinear auto-regressive models: (N)ARX, (N)ARMA, (G)ARCH, etc. [21]. In this article we focus on Bayesian learning of nonparametric nonlinear state-space models. In particular, we use sparse Gaussian processes (GPs) [19] as a convenient method to encode general assumptions about the dynamical system such as continuity or smoothness. In contrast to conven- tional parametric methods, we allow the user to easily trade off model capacity and computation time. Moreover, we present a variational training procedure that allows very complex models to be learned without risk of overfitting. Our variational formulation leads to a tractable approximate posterior over nonlinear dynamical systems. This approximate posterior can be used to compute fast probabilistic predictions of future trajectories of the dynamical system. The computational complexity of our learning approach is linear in the length of the time series. This is possible thanks to the use of variational sparse GPs [22] which lead to a smoothing problem for the latent state trajectory in a simpler auxiliary dynamical system. Smoothing in this auxiliary system can be carried out with any conventional technique (e.g. sequential Monte Carlo). In addition, we present a stochastic variational inference procedure [10] to accelerate learning for long time series and we also present an online learning scheme. This work is useful in situations where: 1) it is important to know how uncertain future predictions are, 2) there is not enough knowledge about the underlying nonlinear dynamical system to create a principled parametric model, and 3) it is necessary to have an explicit model that can be used to simulate the dynamical system into the future. These conditions arise often in engineering and finance. For instance, consider an autonomous aircraft adapting its flight control when carrying a large external load of unknown weight and aerodynamic characteristics. A model of the nonlinear dynamics of the new system can be very useful in order to automatically adapt the control strategy. When few data points are available, there is high uncertainty about the dynamics. In this situation, 1 arXiv:1406.4905v2 [cs.LG] 3 Nov 2014 a model that quantifies its uncertainty can be used to synthesize control laws that avoid the risks of overconfidence. The problem of learning flexible models of nonlinear dynamical systems has been tackled from multiple perspectives. Ghahramani and Roweis [9] presented a maximum likelihood approach to learn nonlinear SSMs based on radial basis functions. This work was later extended by using a parameterized Gaussian process point of view and developing tailored filtering algorithms [6, 7, 23]. Approximate Bayesian learning has also been developed for parameterized nonlinear SSMs [5, 24]. Wang et al. [25] modeled the nonlinear functions in SSMs using Gaussian processes (GP-SSMs) and found a MAP estimate of the latent variables and hyperparameters. Their approach preserved the nonparametric properties of Gaussian processes. Despite using MAP learning over state trajectories, overfitting was not an issue since it was applied in a dimensionality reduction context where the latent space of the SSM was much smaller than the observation space. In a similar vein, [4, 12] presented a hierarchical Gaussian process model that could model linear dynamics and nonlinear mappings from latent states to observations. More recently, Frigola et al. [8] learned GP-SSMs in a fully Bayesian manner by employing particle MCMC methods to sample from the smoothing distribution. However, their approach led to predictions with a computational cost proportional to the length of the time series. In the rest of this article, we present an approach to variational Bayesian learning of flexible non- linear state-space models which leads to a simple representation of the posterior over nonlinear dynamical systems and results in predictions having a low computational complexity. 2 Gaussian Process State-Space Models We consider discrete-time nonlinear state-space models built with deterministic functions and addi- tive noise xt+1 = f(xt) + vt, (1a) yt = g(xt) + et. (1b) The dynamics of the system are defined by the state transition function f(xt) and independent additive noise vt (process noise). The states xt ∈RD are latent variables such that all future variables are conditionally independent on the past given the present state. Observations yt ∈RE are linked to the state via another deterministic function g(xt) and independent additive noise et (observation noise). State-space models are stochastic dynamical processes that are useful to model time series y ≜{y1, ..., yT }. The deterministic functions in (1) can also take external known inputs (such as control signals) as an argument but, for conciseness, we will omit those in our notation. A traditional approach to learn f and g is to restrict them to a family of parametric functions. This is particularly appropriate when the dynamical system is very well understood, e.g. orbital mechanics of a spacecraft. However, in many applications, it is difficult to specify a class of parametric models that can provide both the ability to model complex functions and resistance to overfitting thanks to an easy to specify prior or regularizer. Gaussian processes do have these properties: they can represent functions of arbitrary complexity and provide a straightforward way to specify assumptions about those unknown functions, e.g. smoothness. In the light of this, it is natural to place Gaussian process priors over both f and g [25]. However, the extreme flexibility of the two Gaussian processes leads to severe nonidentifiability and strong correlations between the posteriors of the two unknown functions. In the rest of this paper we will focus on a model with a GP prior over the transition function and a parametric likelihood. However, our variational formulation can also be applied to the double GP case (see supplementary material). A probabilistic state-space model with a Gaussian process prior over the transition function and a parametric likelihood is specified by f(x) ∼GP mf(x), kf(x, x′)  , (2a) xt | ft ∼N(xt | ft, Q), (2b) x0 ∼p(x0) (2c) yt | xt ∼p(yt | xt, θy), (2d) where we have used ft ≜f(xt−1). Since f(x) ∈RD, we use the convention that the covariance function kf returns a D ×D matrix. We group all hyperparameters into θ ≜{θf, θy, Q}. Note that 2 0 time states 0 time 0 time 0 time Figure 1: State trajectories from four 2-state nonlinear dynamical systems sampled from a GP-SSM prior with fixed hyperparameters. The same prior generates systems with qualitatively different behaviors, e.g. the leftmost panel shows behavior similar to that of a non-oscillatory linear system whereas the rightmost panel appears to have arisen from a limit cycle in a nonlinear system. we are not restricting the likelihood (2d) to any particular form. The joint distribution of a GP-SSM is p(y, x, f) = p(x0) T Y t=1 p(yt|xt)p(xt|ft)p(ft|f1:t−1, x0:t−1), (3) where we use the convention f1:0 = ∅and omit the conditioning on θ in the notation. The GP on the transition function induces a distribution over the latent function values with the form of a GP predictive: p(ft|f1:t−1, x0:t−1) = N mf(xt−1) + Kt−1,0:t−2K−1 0:t−2,0:t−2(f1:t−1 −mf(x0:t−2)), Kt−1,t−1 −Kt−1,0:t−2K−1 0:t−2,0:t−2K⊤ t−1,0:t−2  , (4) where the subindices of the kernel matrices indicate the arguments to the covariance function nec- essary to build each matrix, e.g. Kt−1,0:t−2 = [kf(xt−1, x0) . . . kf(xt−1, xt−2)]. When t = 1, the distribution is that of a GP marginal p(f1|x0) = N(mf(x0), kf(x0, x0)). Equation (3) provides a sequential procedure to sample state trajectories and observations. GP- SSMs are doubly stochastic models in the sense that one could, at least notionally, first sample a state transition dynamics function from eq. (2a) and then, conditioned on that function, sample the state trajectory and observations. GP-SSMs are a very rich prior over nonlinear dynamical systems. In Fig. 1 we illustrate this concept by showing state trajectories sampled from a GP-SSM with fixed hyperparameters. The dynamical systems associated with each of these trajectories are qualitatively very different from each other. For instance, the leftmost panel shows the dynamics of an almost linear non-oscillatory system whereas the rightmost panel corresponds to a limit cycle in a nonlinear system. Our goal in this paper is to use this prior over dynamical systems and obtain a tractable approximation to the posterior over dynamical systems given the data. 3 Variational Inference in GP-SSMs Since the GP-SSM is a nonparametric model, in order to define a posterior distribution over f(x) and make probabilistic predictions it is necessary to first find the smoothing distribution p(x0:T |y1:T ). Frigola et al. [8] obtained samples from the smoothing distribution that could be used to define a predictive density via Monte Carlo integration. This approach is expensive since it requires averag- ing over L state trajectory samples of length T. In this section we present an alternative approach that aims to find a tractable distribution over the state transition function that is independent of the length of the time series. We achieve this by using variational sparse GP techniques [22]. 3.1 Augmenting the Model with Inducing Variables As a first step to perform variational inference in a GP-SSM, we augment the model with M inducing points u ≜{ui}M i=1. Those inducing points are jointly Gaussian with the latent function values. In the case of a GP-SSM, the joint probability density becomes p(y, x, f, u) = p(x, f|u) p(u) T Y t=1 p(yt|xt), (5) 3 where p(u) = N(u | 0, Ku,u) (6a) p(x, f|u) = p(x0) T Y t=1 p(ft|f1:t−1, x0:t−1, u)p(xt|ft), (6b) T Y t=1 p(ft|f1:t−1, x0:t−1, u) = N f1:T | K0:T −1,uK−1 u,uu, K0:T −1−K0:T −1,uK−1 u,uK⊤ 0:T −1,u  . (6c) Kernel matrices relating to the inducing points depend on a set of inducing inputs {zi}M i=1 in such a way that Ku,u is an MD × MD matrix formed with blocks kf(zi, zj) having size D × D. For brevity, we use a zero mean function and we omit conditioning on the inducing inputs in the notation. 3.2 Evidence Lower Bound of an Augmented GP-SSM Variational inference [1] is a popular method for approximate Bayesian inference based on making assumptions about the posterior over latent variables that lead to a tractable lower bound on the evidence of the model (sometimes referred to as ELBO). Maximizing this lower bound is equivalent to minimizing the Kullback-Leibler divergence between the approximate posterior and the exact one. Following standard variational inference methodology, [1] we obtain the evidence lower bound of a GP-SSM augmented with inducing points log p(y|θ) ≥ Z x,f,u q(x, f, u) log p(u)p(x0) QT t=1 p(ft|f1:t−1, x0:t−1, u)p(yt|xt)p(xt|ft) q(x, f, u) . (7) In order to achieve tractability, we use a variational distribution that factorizes as q(x, f, u) = q(u)q(x) T Y t=1 p(ft|f1:t−1, x0:t−1, u), (8) where q(u) and q(x) can take any form but the terms relating to f are taken to match those of the prior (3). As a consequence, the difficult p(ft|...) terms inside the log cancel out and lead to the following lower bound L(q(u), q(x),θ) = −KL(q(u)∥p(u)) + H(q(x)) + Z x q(x) log p(x0) + T X t=1  Z x,u q(x)q(u) Z ft p(ft|xt−1, u) log p(xt|ft) | {z } Φ(xt,xt−1,u) + Z x q(x) log p(yt|xt)  (9) where KL denotes the Kullback-Leibler divergence and H the entropy. The integral with respect to ft can be solved analytically: Φ(xt, xt−1, u) = −1 2tr(Q−1Bt−1) + log N(xt|At−1u, Q) where At−1 = Kt−1,uK−1 u,u, and Bt−1 = Kt−1,t−1 −Kt−1,uK−1 u,uKu,t−1. As in other variational sparse GP methods, the choice of variational distribution (8) gives the abil- ity to precisely learn the latent function at the locations of the inducing inputs. Away from those locations, the posterior takes the form of the prior conditioned on the inducing variables. By in- creasing the number of inducing variables, the ELBO can only become tighter [22]. This offers a straightforward trade-off between model capacity and computation cost without increasing the risk of overfitting. 3.3 Optimal Variational Distribution for u The optimal distribution of q(u) can be found by setting to zero the functional derivative of the evidence lower bound with respect to q(u) q∗(u) ∝p(u) T Y t=1 exp{⟨log N(xt|At−1u, Q)⟩q(x)}, (10) 4 where ⟨·⟩q(x) denotes an expectation with respect to q(x). The optimal variational distribution q∗(u) is, conveniently, a multivariate Gaussian distribution. If, for simplicity of notation, we restrict ourselves to D = 1 the natural parameters of the optimal distribution are η1 = Q−1 T X t=1 ⟨AT t−1xt⟩q(xt,xt−1), η2 = −1 2 K−1 uu + Q−1 T X t=1 ⟨AT t−1At−1⟩q(xt−1) ! . (11) The mean and covariance matrix of q∗(u), denoted as µ and Σ respectively, can be computed as µ = Ση1 and Σ = (−2η2)−1. Note that the optimal q(u) depends on the sufficient statistics Ψ1 = PT t=1⟨KT t−1,uxt⟩q(xt,xt−1) and Ψ2 = PT t=1⟨KT t−1,uKt−1,u⟩q(xt−1). 3.4 Optimal Variational Distribution for x In an analogous way as for q∗(u), we can obtain the optimal form of q(x) q∗(x) ∝p(x0) T Y t=1 p(yt|xt) exp{−1 2tr Q−1(Bt−1 + At−1ΣAT t−1)  } N(xt|At−1µ, Q), (12) where, in the second equation, we have used q(u) = N(u|µ, Σ). The optimal distribution q∗(x) is equivalent to the smoothing distribution of an auxiliary parametric state-space model. The auxiliary model is simpler than the original one in (3) since the latent states factorize with a Markovian structure. Equation (12) can be interpreted as a nonlinear state-space model with a Gaussian state transition density, N(xt|At−1µ, Q), and a likelihood augmented with an additional term: exp{−1 2tr Q−1(Bt−1 + At−1ΣAT t−1)  }. Smoothing in nonlinear Markovian state-space models is a standard problem in the context of time series modeling. There are various existing strategies to find the smoothing distribution which could be used depending on the characteristics of each particular problem [20]. For instance, in a mildly nonlinear system with Gaussian noise, an extended Kalman smoother can have very good perfor- mance. On the other hand, problems with severe nonlinearities and/or non-Gaussian likelihoods can lead to heavily multimodal smoothing distributions that are better represented using particle meth- ods. We note that the application of sequential Monte Carlo (SMC) is particularly straightforward in the present auxiliary model. 3.5 Optimizing the Evidence Lower Bound Algorithm 1 presents a procedure to maximize the evidence lower bound by alternatively sampling from the smoothing distribution and taking steps both in θ and in the natural parameters of q∗(u). We propose a hybrid variational-sampling approach whereby approximate samples from q∗(x) are obtained with a sequential Monte Carlo smoother. However, as discussed in section 3.4, depending on the characteristics of the dynamical system, other smoothing methods could be more appropriate [20]. As an alternative to smoothing on the auxiliary dynamical system in (12), one could force a q(x) from a particular family of distributions and optimise the evidence lower bound with respect to its variational parameters. For instance, we could posit a Gaussian q(x) with a sparsity pattern in the covariance matrix assuming zero covariance between non-neighboring states and maximize the ELBO with respect to the variational parameters. We use stochastic gradient descent [10] to maximize the ELBO (where we have plugged in the optimal q∗(u) [22]) by using its gradient with respect to the hyperparameters. Both quantities are stochastic in our hybrid approach due to variance introduced by the sampling of q∗(x). In fact, vanilla sequential Monte Carlo methods will result in biased estimators of the gradient and the parameters of q∗(u). However, in our experiments this has not been an issue. Techniques such as particle MCMC would be a viable alternative to conventional sequential Monte Carlo [13]. 5 Algorithm 1 Variational learning of GP-SSMs with particle smoothing. Batch mode (i.e. non-SVI) is the particular case where the mini-batch is the whole dataset. Require: Observations y1:T . Initial values for θ, η1 and η2. Schedules for ρ and λ. i = 1. repeat yτ:τ ′ ←SAMPLEMINIBATCH(y1:T ) {xτ:τ ′}L l=1 ←GETSAMPLESOPTIMALQX(yτ:τ ′, θ, η1, η2) sample from eq. (12) ∇θL ←GETTHETAGRADIENT({xτ:τ ′}L l=1, θ) supp. material η∗ 1, η∗ 2 ←GETOPTIMALQU({xτ:τ ′}L l=1, θ) eq. (11) or (14) η1 ←η1 + ρi(η∗ 1 −η1) η2 ←η2 + ρi(η∗ 2 −η2) θ ←θ + λi∇θL i ←i + 1 until ELBO convergence 3.6 Making Predictions One of the most appealing properties of our variational approach to learning GP-SSMs is that the approximate predictive distribution of the state transition function can be cheaply computed p(f∗|x∗, y) = Z x,u p(f∗|x∗, x, u) p(x|u, y) p(u|y) ≈ Z x,u p(f∗|x∗, u) p(x|u, y) q(u) = Z u p(f∗|x∗, u) q(u) = N(f∗|A∗µ, B∗+ A∗ΣA⊤ ∗). (13) The derivation in eq. (13) contains two approximations: 1) predictions at new test points are con- sidered to depend only on the inducing variables, and 2) the posterior distribution over u is approx- imated by a variational distribution. After pre-computations, the cost of each prediction is O(M) for the mean and O(M 2) for the variance. This contrasts with the O(TL) and O(T 2L) complexity of approaches based on sampling from the smoothing distribution where p(f∗|x∗, y) = R x p(f∗|x∗, x) p(x|y) is approximated with L samples from p(x|y) [8]. The variational approach condenses the learning of the latent function on the inducing points u and does not explicitly need the smoothing distribution p(x|y) to make predictions. 4 Stochastic Variational Inference Stochastic variational inference (SVI) [10] can be readily applied using our evidence lower bound. When the observed time series is long, it can be expensive to compute q∗(u) or the gradient of L with respect to the hyperparameters and inducing inputs. Since both q∗(u) and ∂L ∂θ/z1:M depend linearly on q(x) via sufficient statistics that contain a summation over all elements in the state trajectory, we can obtain unbiased estimates of these sufficient statistics by using one or multiple segments of the sequence that are sampled uniformly at random. However, obtaining q(x) also requires a time complexity of O(T). Yet, in practice, q(x) can be approximated by running the smoothing algorithm locally around those segments. This can be justified by the fact that in a time series context, the smoothing distribution at a particular time is not largely affected by measurements that are far into the past or the future [20]. The natural parameters of q∗(u) can be estimated by using a portion of the time series of length S η1 = Q−1 T S τ ′ X t=τ ⟨AT t−1xt⟩q(xt,xt−1), η2 = −1 2  K−1 uu + Q−1 T S τ ′ X t=τ ⟨AT t−1At−1⟩q(xt−1)  . (14) 5 Online Learning Our variational approach to learn GP-SSMs also leads naturally to an online learning implementa- tion. This is of particular interest in the context of dynamical systems as it is often the case that data arrives in a sequential manner, e.g. a robot learning the dynamics of different objects by interacting 6 Table 1: Experimental evaluation of 1D nonlinear system. Unless otherwise stated, training times are reported for a dataset with T = 500 and test times are given for a test set with 105 data points. All pre-computations independent on test data are performed before timing the “test time”. Predictive log likelihoods are the average over the full test set. * our PMCMC code did not use fast updates- downdates of the Cholesky factors during training. This does not affect test times. Test RMSE log p(xtest t+1|xtest t , ytr 0:T ) Train time Test time Variational GP-SSM 1.15 -1.61 2.14 min 0.14 s Var. GP-SSM (SVI, T = 104) 1.07 -1.47 4.12 min 0.14 s PMCMC GP-SSM [8] 1.12 -1.57 547 min* 421 s GP-NARX [17] 1.46 -1.90 0.22 min 3.85 s GP-NARX + FITC [17, 18] 1.47 -1.90 0.17 min 0.23 s Linear (N4SID, [16]) 2.35 -2.30 0.01 min 0.11 s with them. Online learning in a Bayesian setting consists in sequential application of Bayes rule whereby the posterior after observing data up to time t becomes the prior at time t + 1 [2, 15]. In our case, this involves replacing the prior p(u) = N(u|0, Ku,u) by the approximate posterior N(u|µ, Σ) obtained in the previous step. The expressions for the update of the natural parameters of q∗(u) with a new mini batch yτ:τ ′ are η′ 1 = η1 + Q−1 τ ′ X t=τ ⟨AT t−1xt⟩q(xt,xt−1), η′ 2 = η2 −1 2Q−1 τ ′ X t=τ ⟨AT t−1At−1⟩q(xt−1). (15) 6 Experiments The goal of this section is to showcase the ability of variational GP-SSMs to perform approximate Bayesian learning of nonlinear dynamical systems. In particular, we want to demonstrate: 1) the ability to learn the inherent nonlinear dynamics of a system, 2) the application in cases where the latent states have higher dimensionality than the observations, and 3) the use of non-Gaussian like- lihoods. 6.1 1D Nonlinear System We apply our variational learning procedure presented above to the one-dimensional nonlinear sys- tem described by p(xt+1|xt) = N(f(xt), 1) and p(yt|xt) = N(xt, 1) where the transition function is xt + 1 if x < 4 and −4xt + 21 if x ≥4. Its pronounced kink makes it challenging to learn. Our goal is to find a posterior distribution over this function using a GP-SSM with Mat´ern covariance function. To solve the expectations with respect to the approximate smoothing distribution q(x) we use a bootstrap particle fixed-lag smoother with 1000 particles and a lag of 10. In Table 1, we compare our method (Variational GP-SSM) against the PMCMC sampling proce- dure from [8] taking 100 samples and 10 burn in samples. As in [8], the sampling exhibited very good mixing with 20 particles. We also compare to an auto-regressive model based on Gaussian process regression [17] of order 5 with Mat´ern ARD covariance function with and without FITC approximation. Finally, we use a linear subspace identification method (N4SID, [16]) as a base- line for comparison. The PMCMC training offers the best test performance from all methods using 500 training points at the cost of substantial train and test time. However, if more data is available (T = 104) the stochastic variational inference procedure can be very attractive since it improves test performance while having a test time that is independent of the training set size. The reported SVI performance has been obtained with mini-batches of 100 time-steps. 6.2 Neural Spike Train Recordings We now turn to the use of SSMs to learn a simple model of neural activity in rats’ hippocampus. We use data in neuron cluster 1 (the most active) from experiment ec013.717 in [14]. In some regions of the time series, the action potential spikes show a clear pattern where periods of rapid spiking are followed by periods of very little spiking. We wish to model this behaviour as an autonomous nonlinear dynamical system (i.e. one not driven by external inputs). Many parametric models of nonlinear neuron dynamics have been proposed [11] but our goal here is to learn a model from data 7 940 940.5 941 10 20 30 40 time [s] spike counts 940 940.5 941 0 time [s] states 0 0.5 1 10 20 30 40 prediction time [s] spike counts 0 0.5 1 0 prediction time [s] states Figure 2: From left to right: 1) part of the observed spike count data, 2) sample from the corre- sponding smoothing distribution, 3) predictive distribution of spike counts obtained by simulating the posterior dynamical from an initial state, and 4) corresponding latent states. 0 x(1) x(2) x(1) x(2) 0 x(1) x(2) 0 x(1) x(2) 0 Figure 3: Contour plots of the state transition function x(2) t+1 = f(x(1) t , x(2) t ), and trajectories in state space. Left: mean posterior function and trajectory from smoothing distribution. Other three panels: transition functions sampled from the posterior and trajectories simulated conditioned on the corresponding sample. Those simulated trajectories start inside the limit cycle and are naturally attracted towards it. Note how function samples are very similar in the region of the limit cycle. without using any biological insight. We use a GP-SSM with a structure such that it is the discrete- time analog of a second order nonlinear ordinary differential equation: two states one of which is the derivative of the other. The observations are spike counts in temporal bins of 0.01 second width. We use a Poisson likelihood relating the spike counts to the second latent state yt|xt ∼ Poisson(exp(αx(2) t + β)). We find a posterior distribution for the state transition function using our variational GP-SSM ap- proach. Smoothing is done with a fixed-lag particle smoother and training until convergence takes approximately 50 iterations of Algorithm 1. Figure 2 shows a part of the raw data together with an approximate sample from the smoothing distribution during the same time interval. In addition, we show the distribution over predictions made by chaining 1-step-ahead predictions. To make those predictions we have switched off process noise (Q = 0) to show more clearly the effect of uncer- tainty in the state transition function. Note how the frequency of roughly 6 Hz present in the data is well captured. Figure 3 shows how the limit cycle corresponding to a nonlinear dynamical system has been captured (see caption for details). 7 Discussion and Future Work We have derived a tractable variational formulation to learn GP-SSMs: an important class of mod- els of nonlinear dynamical systems that is particularly suited to applications where a principled parametric model of the dynamics is not available. Our approach makes it possible to learn very ex- pressive models without risk of overfitting. In contrast to previous approaches [4, 12, 25], we have demonstrated the ability to learn a nonlinear state transition function in a latent space of greater dimensionality than the observation space. More crucially, our approach yields a tractable posterior over nonlinear systems that, as opposed to those based on sampling from the smoothing distribution [8], results in a computation time for the predictions that does not depend on the length of the time series. Given the interesting capabilities of variational GP-SSMs, we believe that future work is warranted. In particular, we want to focus on structured variational distributions q(x) that could eliminate the need to solve the smoothing problem in the auxiliary dynamical system at the cost of having more variational parameters to optimize. On a more theoretical side, we would like to better characterize GP-SSM priors in terms of their dynamical system properties: stability, equilibria, limit cycles, etc. 8 References [1] C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006. [2] Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C Wilson, and Michael Jordan. Streaming variational Bayes. In Advances in Neural Information Processing Systems 26, pages 1727–1735. Curran Associates, Inc., 2013. [3] Emery N. Brown, Loren M. Frank, Dengda Tang, Michael C. Quirk, and Matthew A. Wilson. A statistical paradigm for neural spike train decoding applied to position prediction from ensemble firing patterns of rat hippocampal place cells. The Journal of Neuroscience, 18(18):7411–7425, 1998. [4] Andreas C. Damianou, Michalis Titsias, and Neil D. Lawrence. Variational Gaussian process dynamical systems. In Advances in Neural Information Processing Systems 24, pages 2510–2518. 2011. [5] J. Daunizeau, K.J. Friston, and S.J. Kiebel. Variational Bayesian identification and prediction of stochastic nonlinear dynamic causal models. Physica D: Nonlinear Phenomena, 238(21):2089 – 2118, 2009. [6] M. P. Deisenroth and S. Mohamed. Expectation Propagation in Gaussian process dynamical systems. In Advances in Neural Information Processing Systems (NIPS) 25, pages 2618–2626. 2012. [7] M. P. Deisenroth, R. D. Turner, M. F. Huber, U. D. Hanebeck, and C. E. Rasmussen. Robust filtering and smoothing with Gaussian processes. IEEE Transactions on Automatic Control, 57(7):1865 –1871, 2012. [8] Roger Frigola, Fredrik Lindsten, Thomas B. Sch¨on, and Carl E. Rasmussen. Bayesian inference and learning in Gaussian process state-space models with particle MCMC. In Advances in Neural Information Processing Systems (NIPS) 26. 2013. [9] Z. Ghahramani and S. Roweis. Learning nonlinear dynamical systems using an EM algorithm. In Ad- vances in Neural Information Processing Systems (NIPS) 11. MIT Press, 1999. [10] Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303–1347, 2013. [11] Eugene M Izhikevich. Neural excitability, spiking and bursting. International Journal of Bifurcation and Chaos, 10(06):1171–1266, 2000. [12] Neil D. Lawrence and Andrew J. Moore. The hierarchical Gaussian process latent variable model. In Zoubin Ghahramani, editor, Proceedings of the 24th International Conference on Machine Learning (ICML), 2007. [13] Fredrik Lindsten and Thomas B. Sch¨on. Backward simulation methods for Monte Carlo statistical infer- ence. Foundations and Trends in Machine Learning, 6(1):1–143, 2013. [14] K. Mizuseki, A. Sirota, E. Pastalkova, K. Diba, and G. Buzski. Multiple single unit recordings from different rat hippocampal and entorhinal regions while the animals were performing multiple behavioral tasks. CRCNS.org. http://dx.doi.org/10.6080/K09G5JRZ, 2013. [15] Manfred Opper. A bayesian approach to on-line learning. In David Saad, editor, On-Line Learning in Neural Networks. Cambridge University Press, 1998. [16] Van Overschee P. and De Moor B. Subspace Identification for Linear Systems, Theory, Implementation, Applications. Kluwer Academic Publishers, 1996. [17] J. Qui˜nonero Candela, A Girard, J. Larsen, and C.E. Rasmussen. Propagation of uncertainty in Bayesian kernel models - application to multiple-step ahead forecasting. In Acoustics, Speech, and Signal Pro- cessing, 2003. Proceedings. (ICASSP ’03). 2003 IEEE International Conference on, volume 2, pages II–701–4 vol.2, April 2003. [18] J. Qui˜nonero-Candela and C.E. Rasmussen. A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Research, 6:1939–1959, 2005. [19] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. [20] S. S¨arkk¨a. Bayesian Filtering and Smoothing. Cambridge University Press, 2013. [21] R. H. Shumway and D. S. Stoffer. Time Series Analysis and Its Applications. Springer, 3rd edition, 2011. [22] Michalis Titsias. Variational learning of inducing variables in sparse Gaussian processes. In Proceedings of the 12th International Conference on Artificial Intelligence and Statistics (AISTATS), 2009. [23] R. Turner, M. P. Deisenroth, and C. E. Rasmussen. State-space inference and learning with Gaussian processes. In Yee Whye Teh and Mike Titterington, editors, 13th International Conference on Artificial Intelligence and Statistics, volume 9 of W&CP, pages 868–875, Chia Laguna, Sardinia, Italy, 2010. [24] Harri Valpola and Juha Karhunen. An unsupervised ensemble learning method for nonlinear dynamic state-space models. Neural Computation, 14(11):2647–2692, 2002. [25] J.M. Wang, D.J. Fleet, and A. Hertzmann. Gaussian process dynamical models. In Advances in Neural Information Processing Systems (NIPS) 18, pages 1441–1448. MIT Press, Cambridge, MA, 2006. 9 Supplementary Material A Remark on p(f|x) Note that T Y t=1 p(ft|f1:t−1, x0:t−1, u) ̸= p(f1:T |x0:T −1, u). (16) The right hand side is equivalent to Gaussian process regression where x0:T −1 are inputs and x1:T are outputs p(f1:T |x0:T −1, u) =N K0:T −1,{u,0:T −1}(K{u,0:T −1} + ΣQ)−1  u x0:T −1  , K0:T −1 −K0:T −1,{u,0:T −1}(K{u,0:T −1} + ΣQ)−1K⊤ 0:T −1,{u,0:T −1}  (17) where ΣQ = blockdiag(0, I ⊗Q) captures process noise. However, the left hand side of (16) is the product of terms that are equivalent to Gaussian process prediction with noiseless observations p(ft|f1:t−1, x0:t−1, u) =N Kt−1,{u,0:t−2}K−1 {u,0:t−2}  u f1:t−1  , Kt−1 −Kt−1,{u,0:t−2}K−1 {u,0:t−2}K⊤ t−1,{u,0:t−2}  . (18) The product of these terms can be succinctly represented by the following Gaussian T Y t=1 p(ft|f1:t−1, x0:t−1, u) =N f1:T | K0:T −1,uK−1 u,uu, K0:T −1 −K0:T −1,uK−1 u,uK⊤ 0:T −1,u  . (19) Therefore Z ft T Y t=1 p(ft|f1:t−1, x0:t−1, u) =N ft | Kt−1,uK−1 u,uu, Kt−1 −Kt−1,uK−1 u,uK⊤ t−1,u  . (20) B Relationship of Variational Approximation with Other Models B.1 Markovian Model with Heteroscedastic Noise Although not strictly a GP-SSM, it is also interesting to consider a model where the state transitions are independent of each other given the inducing variables p(f1:T , x0:T |u) = p(x0) T Y t=1 N ft|At−1u, Bt−1  p(xt|ft). (21) This model can be interpreted as a parametric model where At−1u is a deterministic transition function and Bt−1 provides the description of an heteroscedastic process noise. Note that this process noise is independent between any two time steps. If we look for an evidence lower bound with using an analogous procedure to that of section 3 and the following variational distribution q(x, f, u) = q(u)q(x) T Y t=1 N ft|At−1u, Bt−1  , (22) the lower bound becomes the same as the one in equation (9). 10 B.2 Bayesian RBF Model The Radial Basis Function in this model uses a deterministic state transition function of the form f(x) = A(x) u which leads to p(f1:T , x0:T |u) = p(x0) T Y t=1 N ft|At−1 u, 0  p(xt|ft). (23) As opposed to the model in B.1, this transition dynamics is not heteroscedastic. It is fully determin- istic and parameterized by u and z. From a GP perspective, this model is analogous to the Subset of Regressors sparse GP [18]. This model has the undesirable characteristic that the predictive vari- ance shrinks to zero away from the inducing inputs when exactly the opposite behaviour would be desirable. B.3 Double GP Model A state space model having a Gaussian process prior over the state transition function and the emis- sion/observation function can be represented by f(x) ∼GP mf(x), kf(x, x′)  , (24a) g(x) ∼GP mg(x), kg(x, x′)  , (24b) x0 ∼p(x0) (24c) xt | ft ∼N(xt | ft, Q), (24d) yt | gt ∼N(yt | gt, R), (24e) where we have used ft ≜f(xt−1) and gt ≜g(xt). If the transition GP is augmented with inducing variables u and the emission GP is augmented with v, we obtain the following joint distribution of the model p(y, x, f, u, g, v) = p(g|x, v) p(x, f|u) p(u) p(v) T Y t=1 p(yt|gt), (25) where p(x, f|u) is the same as in the model presented in the paper and p(g|x, v) is straightforward since it is conditioned on all the states. We use the following variational distribution over latent variables q(x, f, u, g, v) = q(u)q(v)q(x)p(g|x, v) T Y t=1 p(ft|f1:t−1, x0:t−1, u). (26) Terms with latent variables inside kernel matrices cancel inside the log log p(y|θ) ≥ Z x,f,u,g,v q(x, f, u, g, v) log p(u)p(v)p(x0) QT t=1 p(yt|gt)p(xt|ft) q(u)q(v)q(x) = −KL(q(u)∥p(u)) −KL(q(v)∥p(v)) + H(q(x)) + Z x q(x) log p(x0) + T X t=1  Z x,u q(x)q(u) Z ft p(ft|xt−1, u) log p(xt|ft) + Z x,v q(x)q(v) Z gt p(gt|xt, v) log p(yt|gt)  . The optimal distribution q∗(u) is the same as in eq. (10) and the optimal variational distribution of the emission inducing variables is a Gaussian distribution q∗(v) ∝p(v) T Y t=1 exp{⟨log N(yt|Ct v, R)⟩q(xt)} (27) 11 where Ct = Kt,vK−1 v,v, Dt = Kt,t −Kt,vK−1 v,vKv,t. The optimal variational distribution of the state trajectory is q∗(x) ∝p(x0) T Y t=1 exp{−1 2tr(Q−1(Bt−1 + At−1ΣAt−1 T )) −1 2tr(R−1(Dt + CtΛCt T ))} N(xt|At−1µ, Q) N(yt|Ctν, R), (28) where we have used q(v) = N(ν, Λ). C Optimization of Hyperparameters We optimize the hyperparameters and variational parameters (z1:M) with gradient ascent. The gra- dient w.r.t. θ is computed as ∂L ∂θ =  ∂ ∂θ log p(u)  q(u) +  ∂ ∂θ log p(x0)  q(x0) + T X t=1   −1 2 ∂ ∂θ tr(Q−1(Bt−1 + At−1ΣAT t−1))  q(xt−1) +  ∂ ∂θ log N(xt|At−1µ, Q)  q(xt,xt−1) +  ∂ ∂θ log p(yt|xt)  q(xt) ) . (29) The gradient with respect to z1:M is similar with θ replaced by z1:M. In this expression µ and Σ can be replaced by their optimal settings dependent on the sufficient statistics Ψ1 and Ψ2. D Plots from Experiments Section −10 −5 0 5 −10 −5 0 5 xt−1 ft & u Figure 4: Posterior distribution over latent state transition function (green: ground truth, blue: pos- terior mean, red: mean ±1 standard deviation). 12 Variational Gaussian Process State-Space Models Roger Frigola, Yutian Chen and Carl E. Rasmussen Department of Engineering University of Cambridge {rf342,yc373,cer54}@cam.ac.uk Abstract State-space models have been successfully used for more than fifty years in differ- ent areas of science and engineering. We present a procedure for efficient varia- tional Bayesian learning of nonlinear state-space models based on sparse Gaussian processes. The result of learning is a tractable posterior over nonlinear dynamical systems. In comparison to conventional parametric models, we offer the possi- bility to straightforwardly trade off model capacity and computational cost whilst avoiding overfitting. Our main algorithm uses a hybrid inference approach com- bining variational Bayes and sequential Monte Carlo. We also present stochastic variational inference and online learning approaches for fast learning with long time series. 1 Introduction State-space models (SSMs) are a widely used class of models that have found success in applications as diverse as robotics, ecology, finance and neuroscience (see, e.g., Brown et al. [3]). State-space models generalize other popular time series models such as linear and nonlinear auto-regressive models: (N)ARX, (N)ARMA, (G)ARCH, etc. [21]. In this article we focus on Bayesian learning of nonparametric nonlinear state-space models. In particular, we use sparse Gaussian processes (GPs) [19] as a convenient method to encode general assumptions about the dynamical system such as continuity or smoothness. In contrast to conven- tional parametric methods, we allow the user to easily trade off model capacity and computation time. Moreover, we present a variational training procedure that allows very complex models to be learned without risk of overfitting. Our variational formulation leads to a tractable approximate posterior over nonlinear dynamical systems. This approximate posterior can be used to compute fast probabilistic predictions of future trajectories of the dynamical system. The computational complexity of our learning approach is linear in the length of the time series. This is possible thanks to the use of variational sparse GPs [22] which lead to a smoothing problem for the latent state trajectory in a simpler auxiliary dynamical system. Smoothing in this auxiliary system can be carried out with any conventional technique (e.g. sequential Monte Carlo). In addition, we present a stochastic variational inference procedure [10] to accelerate learning for long time series and we also present an online learning scheme. This work is useful in situations where: 1) it is important to know how uncertain future predictions are, 2) there is not enough knowledge about the underlying nonlinear dynamical system to create a principled parametric model, and 3) it is necessary to have an explicit model that can be used to simulate the dynamical system into the future. These conditions arise often in engineering and finance. For instance, consider an autonomous aircraft adapting its flight control when carrying a large external load of unknown weight and aerodynamic characteristics. A model of the nonlinear dynamics of the new system can be very useful in order to automatically adapt the control strategy. When few data points are available, there is high uncertainty about the dynamics. In this situation, 1 arXiv:1406.4905v2 [cs.LG] 3 Nov 2014 a model that quantifies its uncertainty can be used to synthesize control laws that avoid the risks of overconfidence. The problem of learning flexible models of nonlinear dynamical systems has been tackled from multiple perspectives. Ghahramani and Roweis [9] presented a maximum likelihood approach to learn nonlinear SSMs based on radial basis functions. This work was later extended by using a parameterized Gaussian process point of view and developing tailored filtering algorithms [6, 7, 23]. Approximate Bayesian learning has also been developed for parameterized nonlinear SSMs [5, 24]. Wang et al. [25] modeled the nonlinear functions in SSMs using Gaussian processes (GP-SSMs) and found a MAP estimate of the latent variables and hyperparameters. Their approach preserved the nonparametric properties of Gaussian processes. Despite using MAP learning over state trajectories, overfitting was not an issue since it was applied in a dimensionality reduction context where the latent space of the SSM was much smaller than the observation space. In a similar vein, [4, 12] presented a hierarchical Gaussian process model that could model linear dynamics and nonlinear mappings from latent states to observations. More recently, Frigola et al. [8] learned GP-SSMs in a fully Bayesian manner by employing particle MCMC methods to sample from the smoothing distribution. However, their approach led to predictions with a computational cost proportional to the length of the time series. In the rest of this article, we present an approach to variational Bayesian learning of flexible non- linear state-space models which leads to a simple representation of the posterior over nonlinear dynamical systems and results in predictions having a low computational complexity. 2 Gaussian Process State-Space Models We consider discrete-time nonlinear state-space models built with deterministic functions and addi- tive noise xt+1 = f(xt) + vt, (1a) yt = g(xt) + et. (1b) The dynamics of the system are defined by the state transition function f(xt) and independent additive noise vt (process noise). The states xt ∈RD are latent variables such that all future variables are conditionally independent on the past given the present state. Observations yt ∈RE are linked to the state via another deterministic function g(xt) and independent additive noise et (observation noise). State-space models are stochastic dynamical processes that are useful to model time series y ≜{y1, ..., yT }. The deterministic functions in (1) can also take external known inputs (such as control signals) as an argument but, for conciseness, we will omit those in our notation. A traditional approach to learn f and g is to restrict them to a family of parametric functions. This is particularly appropriate when the dynamical system is very well understood, e.g. orbital mechanics of a spacecraft. However, in many applications, it is difficult to specify a class of parametric models that can provide both the ability to model complex functions and resistance to overfitting thanks to an easy to specify prior or regularizer. Gaussian processes do have these properties: they can represent functions of arbitrary complexity and provide a straightforward way to specify assumptions about those unknown functions, e.g. smoothness. In the light of this, it is natural to place Gaussian process priors over both f and g [25]. However, the extreme flexibility of the two Gaussian processes leads to severe nonidentifiability and strong correlations between the posteriors of the two unknown functions. In the rest of this paper we will focus on a model with a GP prior over the transition function and a parametric likelihood. However, our variational formulation can also be applied to the double GP case (see supplementary material). A probabilistic state-space model with a Gaussian process prior over the transition function and a parametric likelihood is specified by f(x) ∼GP mf(x), kf(x, x′)  , (2a) xt | ft ∼N(xt | ft, Q), (2b) x0 ∼p(x0) (2c) yt | xt ∼p(yt | xt, θy), (2d) where we have used ft ≜f(xt−1). Since f(x) ∈RD, we use the convention that the covariance function kf returns a D ×D matrix. We group all hyperparameters into θ ≜{θf, θy, Q}. Note that 2 0 time states 0 time 0 time 0 time Figure 1: State trajectories from four 2-state nonlinear dynamical systems sampled from a GP-SSM prior with fixed hyperparameters. The same prior generates systems with qualitatively different behaviors, e.g. the leftmost panel shows behavior similar to that of a non-oscillatory linear system whereas the rightmost panel appears to have arisen from a limit cycle in a nonlinear system. we are not restricting the likelihood (2d) to any particular form. The joint distribution of a GP-SSM is p(y, x, f) = p(x0) T Y t=1 p(yt|xt)p(xt|ft)p(ft|f1:t−1, x0:t−1), (3) where we use the convention f1:0 = ∅and omit the conditioning on θ in the notation. The GP on the transition function induces a distribution over the latent function values with the form of a GP predictive: p(ft|f1:t−1, x0:t−1) = N mf(xt−1) + Kt−1,0:t−2K−1 0:t−2,0:t−2(f1:t−1 −mf(x0:t−2)), Kt−1,t−1 −Kt−1,0:t−2K−1 0:t−2,0:t−2K⊤ t−1,0:t−2  , (4) where the subindices of the kernel matrices indicate the arguments to the covariance function nec- essary to build each matrix, e.g. Kt−1,0:t−2 = [kf(xt−1, x0) . . . kf(xt−1, xt−2)]. When t = 1, the distribution is that of a GP marginal p(f1|x0) = N(mf(x0), kf(x0, x0)). Equation (3) provides a sequential procedure to sample state trajectories and observations. GP- SSMs are doubly stochastic models in the sense that one could, at least notionally, first sample a state transition dynamics function from eq. (2a) and then, conditioned on that function, sample the state trajectory and observations. GP-SSMs are a very rich prior over nonlinear dynamical systems. In Fig. 1 we illustrate this concept by showing state trajectories sampled from a GP-SSM with fixed hyperparameters. The dynamical systems associated with each of these trajectories are qualitatively very different from each other. For instance, the leftmost panel shows the dynamics of an almost linear non-oscillatory system whereas the rightmost panel corresponds to a limit cycle in a nonlinear system. Our goal in this paper is to use this prior over dynamical systems and obtain a tractable approximation to the posterior over dynamical systems given the data. 3 Variational Inference in GP-SSMs Since the GP-SSM is a nonparametric model, in order to define a posterior distribution over f(x) and make probabilistic predictions it is necessary to first find the smoothing distribution p(x0:T |y1:T ). Frigola et al. [8] obtained samples from the smoothing distribution that could be used to define a predictive density via Monte Carlo integration. This approach is expensive since it requires averag- ing over L state trajectory samples of length T. In this section we present an alternative approach that aims to find a tractable distribution over the state transition function that is independent of the length of the time series. We achieve this by using variational sparse GP techniques [22]. 3.1 Augmenting the Model with Inducing Variables As a first step to perform variational inference in a GP-SSM, we augment the model with M inducing points u ≜{ui}M i=1. Those inducing points are jointly Gaussian with the latent function values. In the case of a GP-SSM, the joint probability density becomes p(y, x, f, u) = p(x, f|u) p(u) T Y t=1 p(yt|xt), (5) 3 where p(u) = N(u | 0, Ku,u) (6a) p(x, f|u) = p(x0) T Y t=1 p(ft|f1:t−1, x0:t−1, u)p(xt|ft), (6b) T Y t=1 p(ft|f1:t−1, x0:t−1, u) = N f1:T | K0:T −1,uK−1 u,uu, K0:T −1−K0:T −1,uK−1 u,uK⊤ 0:T −1,u  . (6c) Kernel matrices relating to the inducing points depend on a set of inducing inputs {zi}M i=1 in such a way that Ku,u is an MD × MD matrix formed with blocks kf(zi, zj) having size D × D. For brevity, we use a zero mean function and we omit conditioning on the inducing inputs in the notation. 3.2 Evidence Lower Bound of an Augmented GP-SSM Variational inference [1] is a popular method for approximate Bayesian inference based on making assumptions about the posterior over latent variables that lead to a tractable lower bound on the evidence of the model (sometimes referred to as ELBO). Maximizing this lower bound is equivalent to minimizing the Kullback-Leibler divergence between the approximate posterior and the exact one. Following standard variational inference methodology, [1] we obtain the evidence lower bound of a GP-SSM augmented with inducing points log p(y|θ) ≥ Z x,f,u q(x, f, u) log p(u)p(x0) QT t=1 p(ft|f1:t−1, x0:t−1, u)p(yt|xt)p(xt|ft) q(x, f, u) . (7) In order to achieve tractability, we use a variational distribution that factorizes as q(x, f, u) = q(u)q(x) T Y t=1 p(ft|f1:t−1, x0:t−1, u), (8) where q(u) and q(x) can take any form but the terms relating to f are taken to match those of the prior (3). As a consequence, the difficult p(ft|...) terms inside the log cancel out and lead to the following lower bound L(q(u), q(x),θ) = −KL(q(u)∥p(u)) + H(q(x)) + Z x q(x) log p(x0) + T X t=1  Z x,u q(x)q(u) Z ft p(ft|xt−1, u) log p(xt|ft) | {z } Φ(xt,xt−1,u) + Z x q(x) log p(yt|xt)  (9) where KL denotes the Kullback-Leibler divergence and H the entropy. The integral with respect to ft can be solved analytically: Φ(xt, xt−1, u) = −1 2tr(Q−1Bt−1) + log N(xt|At−1u, Q) where At−1 = Kt−1,uK−1 u,u, and Bt−1 = Kt−1,t−1 −Kt−1,uK−1 u,uKu,t−1. As in other variational sparse GP methods, the choice of variational distribution (8) gives the abil- ity to precisely learn the latent function at the locations of the inducing inputs. Away from those locations, the posterior takes the form of the prior conditioned on the inducing variables. By in- creasing the number of inducing variables, the ELBO can only become tighter [22]. This offers a straightforward trade-off between model capacity and computation cost without increasing the risk of overfitting. 3.3 Optimal Variational Distribution for u The optimal distribution of q(u) can be found by setting to zero the functional derivative of the evidence lower bound with respect to q(u) q∗(u) ∝p(u) T Y t=1 exp{⟨log N(xt|At−1u, Q)⟩q(x)}, (10) 4 where ⟨·⟩q(x) denotes an expectation with respect to q(x). The optimal variational distribution q∗(u) is, conveniently, a multivariate Gaussian distribution. If, for simplicity of notation, we restrict ourselves to D = 1 the natural parameters of the optimal distribution are η1 = Q−1 T X t=1 ⟨AT t−1xt⟩q(xt,xt−1), η2 = −1 2 K−1 uu + Q−1 T X t=1 ⟨AT t−1At−1⟩q(xt−1) ! . (11) The mean and covariance matrix of q∗(u), denoted as µ and Σ respectively, can be computed as µ = Ση1 and Σ = (−2η2)−1. Note that the optimal q(u) depends on the sufficient statistics Ψ1 = PT t=1⟨KT t−1,uxt⟩q(xt,xt−1) and Ψ2 = PT t=1⟨KT t−1,uKt−1,u⟩q(xt−1). 3.4 Optimal Variational Distribution for x In an analogous way as for q∗(u), we can obtain the optimal form of q(x) q∗(x) ∝p(x0) T Y t=1 p(yt|xt) exp{−1 2tr Q−1(Bt−1 + At−1ΣAT t−1)  } N(xt|At−1µ, Q), (12) where, in the second equation, we have used q(u) = N(u|µ, Σ). The optimal distribution q∗(x) is equivalent to the smoothing distribution of an auxiliary parametric state-space model. The auxiliary model is simpler than the original one in (3) since the latent states factorize with a Markovian structure. Equation (12) can be interpreted as a nonlinear state-space model with a Gaussian state transition density, N(xt|At−1µ, Q), and a likelihood augmented with an additional term: exp{−1 2tr Q−1(Bt−1 + At−1ΣAT t−1)  }. Smoothing in nonlinear Markovian state-space models is a standard problem in the context of time series modeling. There are various existing strategies to find the smoothing distribution which could be used depending on the characteristics of each particular problem [20]. For instance, in a mildly nonlinear system with Gaussian noise, an extended Kalman smoother can have very good perfor- mance. On the other hand, problems with severe nonlinearities and/or non-Gaussian likelihoods can lead to heavily multimodal smoothing distributions that are better represented using particle meth- ods. We note that the application of sequential Monte Carlo (SMC) is particularly straightforward in the present auxiliary model. 3.5 Optimizing the Evidence Lower Bound Algorithm 1 presents a procedure to maximize the evidence lower bound by alternatively sampling from the smoothing distribution and taking steps both in θ and in the natural parameters of q∗(u). We propose a hybrid variational-sampling approach whereby approximate samples from q∗(x) are obtained with a sequential Monte Carlo smoother. However, as discussed in section 3.4, depending on the characteristics of the dynamical system, other smoothing methods could be more appropriate [20]. As an alternative to smoothing on the auxiliary dynamical system in (12), one could force a q(x) from a particular family of distributions and optimise the evidence lower bound with respect to its variational parameters. For instance, we could posit a Gaussian q(x) with a sparsity pattern in the covariance matrix assuming zero covariance between non-neighboring states and maximize the ELBO with respect to the variational parameters. We use stochastic gradient descent [10] to maximize the ELBO (where we have plugged in the optimal q∗(u) [22]) by using its gradient with respect to the hyperparameters. Both quantities are stochastic in our hybrid approach due to variance introduced by the sampling of q∗(x). In fact, vanilla sequential Monte Carlo methods will result in biased estimators of the gradient and the parameters of q∗(u). However, in our experiments this has not been an issue. Techniques such as particle MCMC would be a viable alternative to conventional sequential Monte Carlo [13]. 5 Algorithm 1 Variational learning of GP-SSMs with particle smoothing. Batch mode (i.e. non-SVI) is the particular case where the mini-batch is the whole dataset. Require: Observations y1:T . Initial values for θ, η1 and η2. Schedules for ρ and λ. i = 1. repeat yτ:τ ′ ←SAMPLEMINIBATCH(y1:T ) {xτ:τ ′}L l=1 ←GETSAMPLESOPTIMALQX(yτ:τ ′, θ, η1, η2) sample from eq. (12) ∇θL ←GETTHETAGRADIENT({xτ:τ ′}L l=1, θ) supp. material η∗ 1, η∗ 2 ←GETOPTIMALQU({xτ:τ ′}L l=1, θ) eq. (11) or (14) η1 ←η1 + ρi(η∗ 1 −η1) η2 ←η2 + ρi(η∗ 2 −η2) θ ←θ + λi∇θL i ←i + 1 until ELBO convergence 3.6 Making Predictions One of the most appealing properties of our variational approach to learning GP-SSMs is that the approximate predictive distribution of the state transition function can be cheaply computed p(f∗|x∗, y) = Z x,u p(f∗|x∗, x, u) p(x|u, y) p(u|y) ≈ Z x,u p(f∗|x∗, u) p(x|u, y) q(u) = Z u p(f∗|x∗, u) q(u) = N(f∗|A∗µ, B∗+ A∗ΣA⊤ ∗). (13) The derivation in eq. (13) contains two approximations: 1) predictions at new test points are con- sidered to depend only on the inducing variables, and 2) the posterior distribution over u is approx- imated by a variational distribution. After pre-computations, the cost of each prediction is O(M) for the mean and O(M 2) for the variance. This contrasts with the O(TL) and O(T 2L) complexity of approaches based on sampling from the smoothing distribution where p(f∗|x∗, y) = R x p(f∗|x∗, x) p(x|y) is approximated with L samples from p(x|y) [8]. The variational approach condenses the learning of the latent function on the inducing points u and does not explicitly need the smoothing distribution p(x|y) to make predictions. 4 Stochastic Variational Inference Stochastic variational inference (SVI) [10] can be readily applied using our evidence lower bound. When the observed time series is long, it can be expensive to compute q∗(u) or the gradient of L with respect to the hyperparameters and inducing inputs. Since both q∗(u) and ∂L ∂θ/z1:M depend linearly on q(x) via sufficient statistics that contain a summation over all elements in the state trajectory, we can obtain unbiased estimates of these sufficient statistics by using one or multiple segments of the sequence that are sampled uniformly at random. However, obtaining q(x) also requires a time complexity of O(T). Yet, in practice, q(x) can be approximated by running the smoothing algorithm locally around those segments. This can be justified by the fact that in a time series context, the smoothing distribution at a particular time is not largely affected by measurements that are far into the past or the future [20]. The natural parameters of q∗(u) can be estimated by using a portion of the time series of length S η1 = Q−1 T S τ ′ X t=τ ⟨AT t−1xt⟩q(xt,xt−1), η2 = −1 2  K−1 uu + Q−1 T S τ ′ X t=τ ⟨AT t−1At−1⟩q(xt−1)  . (14) 5 Online Learning Our variational approach to learn GP-SSMs also leads naturally to an online learning implementa- tion. This is of particular interest in the context of dynamical systems as it is often the case that data arrives in a sequential manner, e.g. a robot learning the dynamics of different objects by interacting 6 Table 1: Experimental evaluation of 1D nonlinear system. Unless otherwise stated, training times are reported for a dataset with T = 500 and test times are given for a test set with 105 data points. All pre-computations independent on test data are performed before timing the “test time”. Predictive log likelihoods are the average over the full test set. * our PMCMC code did not use fast updates- downdates of the Cholesky factors during training. This does not affect test times. Test RMSE log p(xtest t+1|xtest t , ytr 0:T ) Train time Test time Variational GP-SSM 1.15 -1.61 2.14 min 0.14 s Var. GP-SSM (SVI, T = 104) 1.07 -1.47 4.12 min 0.14 s PMCMC GP-SSM [8] 1.12 -1.57 547 min* 421 s GP-NARX [17] 1.46 -1.90 0.22 min 3.85 s GP-NARX + FITC [17, 18] 1.47 -1.90 0.17 min 0.23 s Linear (N4SID, [16]) 2.35 -2.30 0.01 min 0.11 s with them. Online learning in a Bayesian setting consists in sequential application of Bayes rule whereby the posterior after observing data up to time t becomes the prior at time t + 1 [2, 15]. In our case, this involves replacing the prior p(u) = N(u|0, Ku,u) by the approximate posterior N(u|µ, Σ) obtained in the previous step. The expressions for the update of the natural parameters of q∗(u) with a new mini batch yτ:τ ′ are η′ 1 = η1 + Q−1 τ ′ X t=τ ⟨AT t−1xt⟩q(xt,xt−1), η′ 2 = η2 −1 2Q−1 τ ′ X t=τ ⟨AT t−1At−1⟩q(xt−1). (15) 6 Experiments The goal of this section is to showcase the ability of variational GP-SSMs to perform approximate Bayesian learning of nonlinear dynamical systems. In particular, we want to demonstrate: 1) the ability to learn the inherent nonlinear dynamics of a system, 2) the application in cases where the latent states have higher dimensionality than the observations, and 3) the use of non-Gaussian like- lihoods. 6.1 1D Nonlinear System We apply our variational learning procedure presented above to the one-dimensional nonlinear sys- tem described by p(xt+1|xt) = N(f(xt), 1) and p(yt|xt) = N(xt, 1) where the transition function is xt + 1 if x < 4 and −4xt + 21 if x ≥4. Its pronounced kink makes it challenging to learn. Our goal is to find a posterior distribution over this function using a GP-SSM with Mat´ern covariance function. To solve the expectations with respect to the approximate smoothing distribution q(x) we use a bootstrap particle fixed-lag smoother with 1000 particles and a lag of 10. In Table 1, we compare our method (Variational GP-SSM) against the PMCMC sampling proce- dure from [8] taking 100 samples and 10 burn in samples. As in [8], the sampling exhibited very good mixing with 20 particles. We also compare to an auto-regressive model based on Gaussian process regression [17] of order 5 with Mat´ern ARD covariance function with and without FITC approximation. Finally, we use a linear subspace identification method (N4SID, [16]) as a base- line for comparison. The PMCMC training offers the best test performance from all methods using 500 training points at the cost of substantial train and test time. However, if more data is available (T = 104) the stochastic variational inference procedure can be very attractive since it improves test performance while having a test time that is independent of the training set size. The reported SVI performance has been obtained with mini-batches of 100 time-steps. 6.2 Neural Spike Train Recordings We now turn to the use of SSMs to learn a simple model of neural activity in rats’ hippocampus. We use data in neuron cluster 1 (the most active) from experiment ec013.717 in [14]. In some regions of the time series, the action potential spikes show a clear pattern where periods of rapid spiking are followed by periods of very little spiking. We wish to model this behaviour as an autonomous nonlinear dynamical system (i.e. one not driven by external inputs). Many parametric models of nonlinear neuron dynamics have been proposed [11] but our goal here is to learn a model from data 7 940 940.5 941 10 20 30 40 time [s] spike counts 940 940.5 941 0 time [s] states 0 0.5 1 10 20 30 40 prediction time [s] spike counts 0 0.5 1 0 prediction time [s] states Figure 2: From left to right: 1) part of the observed spike count data, 2) sample from the corre- sponding smoothing distribution, 3) predictive distribution of spike counts obtained by simulating the posterior dynamical from an initial state, and 4) corresponding latent states. 0 x(1) x(2) x(1) x(2) 0 x(1) x(2) 0 x(1) x(2) 0 Figure 3: Contour plots of the state transition function x(2) t+1 = f(x(1) t , x(2) t ), and trajectories in state space. Left: mean posterior function and trajectory from smoothing distribution. Other three panels: transition functions sampled from the posterior and trajectories simulated conditioned on the corresponding sample. Those simulated trajectories start inside the limit cycle and are naturally attracted towards it. Note how function samples are very similar in the region of the limit cycle. without using any biological insight. We use a GP-SSM with a structure such that it is the discrete- time analog of a second order nonlinear ordinary differential equation: two states one of which is the derivative of the other. The observations are spike counts in temporal bins of 0.01 second width. We use a Poisson likelihood relating the spike counts to the second latent state yt|xt ∼ Poisson(exp(αx(2) t + β)). We find a posterior distribution for the state transition function using our variational GP-SSM ap- proach. Smoothing is done with a fixed-lag particle smoother and training until convergence takes approximately 50 iterations of Algorithm 1. Figure 2 shows a part of the raw data together with an approximate sample from the smoothing distribution during the same time interval. In addition, we show the distribution over predictions made by chaining 1-step-ahead predictions. To make those predictions we have switched off process noise (Q = 0) to show more clearly the effect of uncer- tainty in the state transition function. Note how the frequency of roughly 6 Hz present in the data is well captured. Figure 3 shows how the limit cycle corresponding to a nonlinear dynamical system has been captured (see caption for details). 7 Discussion and Future Work We have derived a tractable variational formulation to learn GP-SSMs: an important class of mod- els of nonlinear dynamical systems that is particularly suited to applications where a principled parametric model of the dynamics is not available. Our approach makes it possible to learn very ex- pressive models without risk of overfitting. In contrast to previous approaches [4, 12, 25], we have demonstrated the ability to learn a nonlinear state transition function in a latent space of greater dimensionality than the observation space. More crucially, our approach yields a tractable posterior over nonlinear systems that, as opposed to those based on sampling from the smoothing distribution [8], results in a computation time for the predictions that does not depend on the length of the time series. Given the interesting capabilities of variational GP-SSMs, we believe that future work is warranted. In particular, we want to focus on structured variational distributions q(x) that could eliminate the need to solve the smoothing problem in the auxiliary dynamical system at the cost of having more variational parameters to optimize. On a more theoretical side, we would like to better characterize GP-SSM priors in terms of their dynamical system properties: stability, equilibria, limit cycles, etc. 8 References [1] C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006. [2] Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C Wilson, and Michael Jordan. Streaming variational Bayes. In Advances in Neural Information Processing Systems 26, pages 1727–1735. Curran Associates, Inc., 2013. [3] Emery N. Brown, Loren M. Frank, Dengda Tang, Michael C. Quirk, and Matthew A. Wilson. A statistical paradigm for neural spike train decoding applied to position prediction from ensemble firing patterns of rat hippocampal place cells. The Journal of Neuroscience, 18(18):7411–7425, 1998. [4] Andreas C. Damianou, Michalis Titsias, and Neil D. Lawrence. Variational Gaussian process dynamical systems. In Advances in Neural Information Processing Systems 24, pages 2510–2518. 2011. [5] J. Daunizeau, K.J. Friston, and S.J. Kiebel. Variational Bayesian identification and prediction of stochastic nonlinear dynamic causal models. Physica D: Nonlinear Phenomena, 238(21):2089 – 2118, 2009. [6] M. P. Deisenroth and S. Mohamed. Expectation Propagation in Gaussian process dynamical systems. In Advances in Neural Information Processing Systems (NIPS) 25, pages 2618–2626. 2012. [7] M. P. Deisenroth, R. D. Turner, M. F. Huber, U. D. Hanebeck, and C. E. Rasmussen. Robust filtering and smoothing with Gaussian processes. IEEE Transactions on Automatic Control, 57(7):1865 –1871, 2012. [8] Roger Frigola, Fredrik Lindsten, Thomas B. Sch¨on, and Carl E. Rasmussen. Bayesian inference and learning in Gaussian process state-space models with particle MCMC. In Advances in Neural Information Processing Systems (NIPS) 26. 2013. [9] Z. Ghahramani and S. Roweis. Learning nonlinear dynamical systems using an EM algorithm. In Ad- vances in Neural Information Processing Systems (NIPS) 11. MIT Press, 1999. [10] Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303–1347, 2013. [11] Eugene M Izhikevich. Neural excitability, spiking and bursting. International Journal of Bifurcation and Chaos, 10(06):1171–1266, 2000. [12] Neil D. Lawrence and Andrew J. Moore. The hierarchical Gaussian process latent variable model. In Zoubin Ghahramani, editor, Proceedings of the 24th International Conference on Machine Learning (ICML), 2007. [13] Fredrik Lindsten and Thomas B. Sch¨on. Backward simulation methods for Monte Carlo statistical infer- ence. Foundations and Trends in Machine Learning, 6(1):1–143, 2013. [14] K. Mizuseki, A. Sirota, E. Pastalkova, K. Diba, and G. Buzski. Multiple single unit recordings from different rat hippocampal and entorhinal regions while the animals were performing multiple behavioral tasks. CRCNS.org. http://dx.doi.org/10.6080/K09G5JRZ, 2013. [15] Manfred Opper. A bayesian approach to on-line learning. In David Saad, editor, On-Line Learning in Neural Networks. Cambridge University Press, 1998. [16] Van Overschee P. and De Moor B. Subspace Identification for Linear Systems, Theory, Implementation, Applications. Kluwer Academic Publishers, 1996. [17] J. Qui˜nonero Candela, A Girard, J. Larsen, and C.E. Rasmussen. Propagation of uncertainty in Bayesian kernel models - application to multiple-step ahead forecasting. In Acoustics, Speech, and Signal Pro- cessing, 2003. Proceedings. (ICASSP ’03). 2003 IEEE International Conference on, volume 2, pages II–701–4 vol.2, April 2003. [18] J. Qui˜nonero-Candela and C.E. Rasmussen. A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Research, 6:1939–1959, 2005. [19] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. [20] S. S¨arkk¨a. Bayesian Filtering and Smoothing. Cambridge University Press, 2013. [21] R. H. Shumway and D. S. Stoffer. Time Series Analysis and Its Applications. Springer, 3rd edition, 2011. [22] Michalis Titsias. Variational learning of inducing variables in sparse Gaussian processes. In Proceedings of the 12th International Conference on Artificial Intelligence and Statistics (AISTATS), 2009. [23] R. Turner, M. P. Deisenroth, and C. E. Rasmussen. State-space inference and learning with Gaussian processes. In Yee Whye Teh and Mike Titterington, editors, 13th International Conference on Artificial Intelligence and Statistics, volume 9 of W&CP, pages 868–875, Chia Laguna, Sardinia, Italy, 2010. [24] Harri Valpola and Juha Karhunen. An unsupervised ensemble learning method for nonlinear dynamic state-space models. Neural Computation, 14(11):2647–2692, 2002. [25] J.M. Wang, D.J. Fleet, and A. Hertzmann. Gaussian process dynamical models. In Advances in Neural Information Processing Systems (NIPS) 18, pages 1441–1448. MIT Press, Cambridge, MA, 2006. 9 Supplementary Material A Remark on p(f|x) Note that T Y t=1 p(ft|f1:t−1, x0:t−1, u) ̸= p(f1:T |x0:T −1, u). (16) The right hand side is equivalent to Gaussian process regression where x0:T −1 are inputs and x1:T are outputs p(f1:T |x0:T −1, u) =N K0:T −1,{u,0:T −1}(K{u,0:T −1} + ΣQ)−1  u x0:T −1  , K0:T −1 −K0:T −1,{u,0:T −1}(K{u,0:T −1} + ΣQ)−1K⊤ 0:T −1,{u,0:T −1}  (17) where ΣQ = blockdiag(0, I ⊗Q) captures process noise. However, the left hand side of (16) is the product of terms that are equivalent to Gaussian process prediction with noiseless observations p(ft|f1:t−1, x0:t−1, u) =N Kt−1,{u,0:t−2}K−1 {u,0:t−2}  u f1:t−1  , Kt−1 −Kt−1,{u,0:t−2}K−1 {u,0:t−2}K⊤ t−1,{u,0:t−2}  . (18) The product of these terms can be succinctly represented by the following Gaussian T Y t=1 p(ft|f1:t−1, x0:t−1, u) =N f1:T | K0:T −1,uK−1 u,uu, K0:T −1 −K0:T −1,uK−1 u,uK⊤ 0:T −1,u  . (19) Therefore Z ft T Y t=1 p(ft|f1:t−1, x0:t−1, u) =N ft | Kt−1,uK−1 u,uu, Kt−1 −Kt−1,uK−1 u,uK⊤ t−1,u  . (20) B Relationship of Variational Approximation with Other Models B.1 Markovian Model with Heteroscedastic Noise Although not strictly a GP-SSM, it is also interesting to consider a model where the state transitions are independent of each other given the inducing variables p(f1:T , x0:T |u) = p(x0) T Y t=1 N ft|At−1u, Bt−1  p(xt|ft). (21) This model can be interpreted as a parametric model where At−1u is a deterministic transition function and Bt−1 provides the description of an heteroscedastic process noise. Note that this process noise is independent between any two time steps. If we look for an evidence lower bound with using an analogous procedure to that of section 3 and the following variational distribution q(x, f, u) = q(u)q(x) T Y t=1 N ft|At−1u, Bt−1  , (22) the lower bound becomes the same as the one in equation (9). 10 B.2 Bayesian RBF Model The Radial Basis Function in this model uses a deterministic state transition function of the form f(x) = A(x) u which leads to p(f1:T , x0:T |u) = p(x0) T Y t=1 N ft|At−1 u, 0  p(xt|ft). (23) As opposed to the model in B.1, this transition dynamics is not heteroscedastic. It is fully determin- istic and parameterized by u and z. From a GP perspective, this model is analogous to the Subset of Regressors sparse GP [18]. This model has the undesirable characteristic that the predictive vari- ance shrinks to zero away from the inducing inputs when exactly the opposite behaviour would be desirable. B.3 Double GP Model A state space model having a Gaussian process prior over the state transition function and the emis- sion/observation function can be represented by f(x) ∼GP mf(x), kf(x, x′)  , (24a) g(x) ∼GP mg(x), kg(x, x′)  , (24b) x0 ∼p(x0) (24c) xt | ft ∼N(xt | ft, Q), (24d) yt | gt ∼N(yt | gt, R), (24e) where we have used ft ≜f(xt−1) and gt ≜g(xt). If the transition GP is augmented with inducing variables u and the emission GP is augmented with v, we obtain the following joint distribution of the model p(y, x, f, u, g, v) = p(g|x, v) p(x, f|u) p(u) p(v) T Y t=1 p(yt|gt), (25) where p(x, f|u) is the same as in the model presented in the paper and p(g|x, v) is straightforward since it is conditioned on all the states. We use the following variational distribution over latent variables q(x, f, u, g, v) = q(u)q(v)q(x)p(g|x, v) T Y t=1 p(ft|f1:t−1, x0:t−1, u). (26) Terms with latent variables inside kernel matrices cancel inside the log log p(y|θ) ≥ Z x,f,u,g,v q(x, f, u, g, v) log p(u)p(v)p(x0) QT t=1 p(yt|gt)p(xt|ft) q(u)q(v)q(x) = −KL(q(u)∥p(u)) −KL(q(v)∥p(v)) + H(q(x)) + Z x q(x) log p(x0) + T X t=1  Z x,u q(x)q(u) Z ft p(ft|xt−1, u) log p(xt|ft) + Z x,v q(x)q(v) Z gt p(gt|xt, v) log p(yt|gt)  . The optimal distribution q∗(u) is the same as in eq. (10) and the optimal variational distribution of the emission inducing variables is a Gaussian distribution q∗(v) ∝p(v) T Y t=1 exp{⟨log N(yt|Ct v, R)⟩q(xt)} (27) 11 where Ct = Kt,vK−1 v,v, Dt = Kt,t −Kt,vK−1 v,vKv,t. The optimal variational distribution of the state trajectory is q∗(x) ∝p(x0) T Y t=1 exp{−1 2tr(Q−1(Bt−1 + At−1ΣAt−1 T )) −1 2tr(R−1(Dt + CtΛCt T ))} N(xt|At−1µ, Q) N(yt|Ctν, R), (28) where we have used q(v) = N(ν, Λ). C Optimization of Hyperparameters We optimize the hyperparameters and variational parameters (z1:M) with gradient ascent. The gra- dient w.r.t. θ is computed as ∂L ∂θ =  ∂ ∂θ log p(u)  q(u) +  ∂ ∂θ log p(x0)  q(x0) + T X t=1   −1 2 ∂ ∂θ tr(Q−1(Bt−1 + At−1ΣAT t−1))  q(xt−1) +  ∂ ∂θ log N(xt|At−1µ, Q)  q(xt,xt−1) +  ∂ ∂θ log p(yt|xt)  q(xt) ) . (29) The gradient with respect to z1:M is similar with θ replaced by z1:M. In this expression µ and Σ can be replaced by their optimal settings dependent on the sufficient statistics Ψ1 and Ψ2. D Plots from Experiments Section −10 −5 0 5 −10 −5 0 5 xt−1 ft & u Figure 4: Posterior distribution over latent state transition function (green: ground truth, blue: pos- terior mean, red: mean ±1 standard deviation). 12