1 Robust Filtering and Smoothing with Gaussian Processes Marc Peter Deisenroth a,b, ∗ , Ryan Turner b , Marco F. Huber c , Uwe D. Hanebeck d , Carl Edward Rasmussen b,e Abstract —We propose a principled algorithm for robust Bayesian filtering and smoothing in nonlinear stochastic dynamic systems when both the transition function and the measurement function are described by non-parametric Gaussian process (GP) models. GPs are gaining increasing importance in signal processing, machine learning, robotics, and control for rep- resenting unknown system functions by posterior probability distributions. This modern way of “system identification” is more robust than finding point estimates of a parametric function representation. In this article, we present a principled algorithm for robust analytic smoothing in GP dynamic systems, which are increasingly used in robotics and control. Our numerical evaluations demonstrate the robustness of the proposed approach in situations where other state-of-the-art Gaussian filters and smoothers can fail. Index Terms —Nonlinear systems, Bayesian inference, Smooth- ing, Gaussian processes, Machine learning I. I NTRODUCTION Filtering and smoothing in context of dynamic systems refers to a Bayesian methodology for computing posterior distributions of the latent state based on a history of noisy measurements. This kind of methodology can be found, e.g., in navigation, control engineering, robotics, and machine learn- ing [1]–[4]. Solutions to filtering [1]–[5] and smoothing [6]– [9] in linear dynamic systems are well known, and numerous approximations for nonlinear systems have been proposed, for both filtering [10]–[15] and smoothing [16]–[19]. In this article, we focus on Gaussian filtering and smoothing in Gaussian process (GP) dynamic systems. GPs are a robust non-parametric method for approximating unknown functions by a posterior distribution over them [20], [21]. Although GPs have been around for decades, they only recently became com- putationally interesting for applications in robotics, control, and machine learning [22]–[26]. The contribution of this article is the derivation of a novel, principled, and robust Rauch-Tung-Striebel (RTS) smoother for GP dynamic systems, which we call the GP-RTSS . The GP- RTSS computes a Gaussian approximation to the smoothing distribution in closed form. The posterior filtering and smooth- ing distributions can be computed without linearization [10] or (small) sampling approximations of densities [11]. We provide numerical evidence that the GP-RTSS is more robust than state-of-the-art nonlinear Gaussian filtering and smoothing algorithms including the extended Kalman filter (EKF) [10], the unscented Kalman filter (UKF) [11], the cubature Kalman filter (CKF) [15], the GP-UKF [12], and ∗ Corresponding author a University of Washington, Seattle, WA, USA b University of Cambridge, UK c AGT Group (R&D) GmbH, Darmstadt, Germany d Karlsruhe Institute of Technology, Germany e Max Planck Institute for Biological Cybernetics, T ̈ ubingen, Germany their corresponding RTS smoothers. Robustness refers to the ability of an inferred distribution to explain the “true” state/ measurement. The paper is structured as follows: In Secs. I-A–I-B, we introduce the problem setup and necessary background on Gaussian smoothing and GP dynamic systems. In Sec. II, we briefly introduce Gaussian process regression, discuss the expressiveness of a GP, and explain how to train GPs. Sec. III details our proposed method (GP-RTSS) for smoothing in GP dynamic systems. In Sec. IV, we provide experimental evidence of the robustness of the GP-RTSS. Sec. V concludes the paper with a discussion. A. Problem Formulation and Notation In this article, we consider discrete-time stochastic systems x t = f ( x t − 1 ) + w t , (1) z t = g ( x t ) + v t , (2) where x t ∈ R D is the state, z t ∈ R E is the measurement at time step t , w t ∼ N ( 0 , Σ w ) is Gaussian system noise, v t ∼ N ( 0 , Σ v ) is Gaussian measurement noise, f is the transition function (or system function) and g is the measure- ment function. The discrete time steps t run from 0 to T . The initial state x 0 of the time series is distributed according to a Gaussian prior distribution p ( x 0 ) = N ( μ x 0 , Σ x 0 ) . The purpose of filtering and smoothing is to find approximations to the posterior distributions p ( x t | z 1: τ ) , where 1 : τ in a subindex abbreviates 1 , . . . , τ with τ = t during filtering and τ = T during smoothing. In this article, we consider Gaussian approximations p ( x t | z 1: τ ) ≈ N ( x t | μ x t | τ , Σ x t | τ ) of the latent state posterior distributions p ( x t | z 1: τ ) . We use the short-hand notation a d b | c where a = μ denotes the mean μ and a = Σ denotes the covariance, b denotes the time step under consideration, c denotes the time step up to which we consider measurements, and d ∈ { x, z } denotes either the latent space ( x ) or the observed space ( z ). B. Gaussian RTS Smoothing Given the filtering distributions p ( x t | z 1: t ) = N ( x t | μ x t | t , Σ x t | t ) , t = 1 , . . . , T , a sufficient condition for Gaussian smoothing is the computation of Gaussian approximations of the joint distributions p ( x t − 1 , x t | z 1: t − 1 ) , t = 1 , . . . , T [19]. In Gaussian smoothers, the standard smoothing distribution for the dynamic system in Eqs. (1)–(2) is always p ( x t − 1 | z 1: T ) = N ( x t − 1 | μ x t − 1 | T , Σ x t − 1 | T ) , where (3) μ x t − 1 | T = μ x t − 1 | t − 1 + J t − 1 ( μ t | T − μ t | t ) , (4) Σ x t − 1 | T = Σ t − 1 | t − 1 + J t − 1 ( Σ t | T − Σ t | t ) J > t − 1 , (5) J t − 1 := Σ x t − 1 ,t | t − 1 ( Σ x t | t − 1 ) − 1 , t = T, . . . , 1 . (6) Depending on the methodology of computing this joint dis- tribution, we can directly derive arbitrary RTS smoothing algorithms, including the URTSS [16], the EKS [1], [10], the CKS [19], a smoothing extension to the CKF [15], or the GP- URTSS, a smoothing extension to the GP-UKF [12]. The in- dividual smoothers (URTSS, EKS, CKS, GP-based smoothers arXiv:1203.4345v1 [cs.SY] 20 Mar 2012 2 etc.) simply differ in the way of computing/estimating the means and covariances required in Eqs. (4)–(6) [19]. To derive the GP-URTSS, we closely follow the derivation of the URTSS [16]. The GP-URTSS is a novel smoother, but its derivation is relatively straightforward and therefore not detailed in this article. Instead, we detail the derivation of the GP-RTSS, a robust Rauch-Tung-Striebel smoother for GP dynamic systems, which is based on analytic computation of the means and (cross-)covariances in Eqs. (4)–(6). In GP dynamics systems , the transition function f and the measurement function g in Eqs. (1)–(2) are modeled by Gaussian processes. This setup is getting more relevant in practical applications such as robotics and control, where it can be difficult to find an accurate parametric form of f and g , respectively [25], [27]. Given the increasing use of GP models in robotics and control, the robustness of Bayesian state estimation is important. II. G AUSSIAN P ROCESSES In the standard GP regression model, we assume that the data D := { X := [ x 1 , . . . , x n ] > , y := [ y 1 , . . . , y n ] > } have been generated according to y i = h ( x i ) + ε i , where h : R D → R and ε i ∼ N (0 , σ 2 ε ) is independent (measurement) noise. GPs consider h a random function and infer a posterior distribution over h from data. The posterior is used to make predictions about function values h ( x ∗ ) for arbitrary inputs x ∗ ∈ R D . Similar to a Gaussian distribution, which is fully specified by a mean vector and a covariance matrix, a GP is fully specified by a mean function m h ( · ) and a covariance function k h ( x , x ′ ) := E h [( h ( x ) − m h ( x ))( h ( x ′ ) − m h ( x ′ ))] (7) = cov h [ h ( x ) , h ( x ′ )] ∈ R , x , x ′ ∈ R D , (8) which specifies the covariance between any two function values. Here, E h denotes the expectation with respect to the function h . The covariance function k h ( · , · ) is also called a kernel . Unless stated otherwise, we consider a prior mean function m h ≡ 0 and use the squared exponential (SE) covariance function with automatic relevance determination k SE ( x p , x q ) := α 2 exp ( − 1 2 ( x p − x q ) > Λ − 1 ( x p − x q ) ) (9) for x p , x q ∈ R D , plus a noise covariance function k noise := δ pq σ 2 ε , such that k h = k SE + k noise . The δ denotes the Kronecker symbol that is unity when p = q and zero otherwise, resulting in i.i.d. measurement noise. In Eq. (9), Λ = diag([ ` 2 1 , . . . , ` 2 D ]) is a diagonal matrix of squared characteristic length-scales ` i , i = 1 , . . . , D, and α 2 is the signal variance of the latent function h . By using the SE covariance function from Eq. (9) we assume that the latent function h is smooth and stationary. Smoothness and stationarity is easier to justify than fixed parametric form of the underlying function. A. Expressiveness of the Model Although the SE covariance function and the zero-prior mean function are common defaults, they retain a great deal of expressiveness. Inspired by [20], [28], we demonstrate this expressiveness and show the correspondence of our GP model to a universal function approximator: Consider a function h ( x ) = ∑ i ∈ Z lim N →∞ 1 N ∑ N n =1 γ n exp ( − ( x − ( i + n N )) 2 λ 2 ) (10) where γ n ∼ N (0 , 1) , n = 1 , . . . , N . Note that in the limit h ( x ) is represented by infinitely many Gaussian-shaped basis functions along the real axis with variance λ 2 and prior (Gaussian) random weights γ n , for x ∈ R , and for all i ∈ Z . The model in Eq. (10) is considered a universal function approximator. Writing the sums in Eq. (10) as an integral over the real axis R , we obtain h ( x ) = ∑ i ∈ Z ∫ i +1 i γ ( s ) exp ( − ( x − s ) 2 λ 2 ) d s = ∫ ∞ −∞ γ ( s ) exp ( − ( x − s ) 2 λ 2 ) d s = ( γ ∗ K )( x ) , (11) where γ ( s ) ∼ N (0 , 1) is a white-noise process and K is a Gaussian convolution kernel. The function values of h are jointly normal, which follows from the convolution γ ∗ K . We now analyze the mean function and the covariance function of h , which fully specify the distribution of h . The only random variables are the weights γ ( s ) . Computing the expected func- tion of this model (prior mean function) requires averaging over γ ( s ) and yields E γ [ h ( x )] = ∫ h ( x ) p ( γ ( s )) d γ ( s ) (12) (11) = ∫ exp ( − ( x − s ) 2 λ 2 ) ∫ γ ( s ) p ( γ ( s )) d γ ( s ) d s = 0 (13) since E γ [ γ ( s )] = 0 . Hence, the mean function of h equals zero everywhere. Let us now find the covariance function. Since the mean function equals zero, for any x, x ′ ∈ R we obtain cov γ [ h ( x ) , h ( x ′ )] = ∫ h ( x ) h ( x ′ ) p ( γ ( s )) d γ ( s ) = ∫ exp ( − ( x − s ) 2 λ 2 ) exp ( − ( x ′ − s ) 2 λ 2 ) × ∫ γ ( s ) 2 p ( γ ( s )) d γ ( s ) d s , (14) where we used the definition of h in Eq. (11). Using var γ [ γ ( s )] = 1 and completing the squares yields cov γ [ h ( x ) , h ( x ′ )] = ∫ exp ( − 2 ( s − x + x ′ 2 ) 2 + ( x − x ′ ) 2 2 λ 2 ) d s = α 2 exp ( − ( x − x ′ ) 2 2 λ 2 ) (15) for suitable α 2 . From Eqs. (13) and (15), we see that the mean function and the covariance function of the universal function approximator in Eq. (10) correspond to the GP model assumptions we made earlier: a prior mean function m h ≡ 0 and the SE covariance function in Eq. (9) for a one-dimensional input space. Hence, 3 the considered GP prior implicitly assumes latent functions h that can be described by the universal function approximator in Eq. (11). Examples of covariance functions that encode different model assumptions are given in [21]. B. Training via Evidence Maximization For E target dimensions, we train E GPs assuming that the target dimensions are independent at a deterministically given test input (if the test input is uncertain, the target dimensions covary): After observing a data set D , for each (training) target dimension, we learn the D + 1 hyper-parameters of the covariance function and the noise variance of the data using evidence maximization [20], [21]: Collecting all ( D + 2) E hyper-parameters in the vector θ , evidence maximization yields a point estimate ˆ θ ∈ argmax θ log p ( y | X , θ ) . Evidence maximization automatically trades off data fit with function complexity and avoids overfitting [21]. From here onward, we consider the GP dynamics system setup, where two GP models have been trained using evidence maximization: GP f , which models the mapping x t − 1 7 → x t , R D → R D , see Eq. (1), and GP g , which models the mapping x t 7 → z t , R D → R E , see Eq. (2). To keep the notation uncluttered, we do not explicitly condition on the hyper-parameters ˆ θ and the training data D in the following. III. R OBUST S MOOTHING IN G AUSSIAN P ROCESS D YNAMIC S YSTEMS Analytic moment-based filtering in GP dynamic systems has been proposed in [13], where the filter distribution is given by p ( x t | z 1: t ) = N ( x t | μ x t | t , Σ x t | t ) , (16) μ x t | t = μ x t | t − 1 + Σ xz t | t − 1 ( Σ z t | t − 1 ) − 1 ( z t − μ z t | t − 1 ) , (17) Σ x t | t = Σ x t | t − 1 − Σ xz t | t − 1 ( Σ z t | t − 1 ) − 1 Σ zx t | t − 1 , (18) for t = 1 , . . . , T . Here, we extend these filtering results to analytic moment-based smoothing, where we explicitly take nonlinearities into account (no linearization required) while propagating full Gaussian densities (no sigma/cubature-point representation required) through nonlinear GP models. In the following, we detail our novel RTS smoothing approach for GP dynamic systems. We fit our smoother in the standard frame of Eqs. (4)–(6). For this, we compute the means and covariances of the Gaussian approximation N ([ x t − 1 x t ] ∣ ∣ ∣ ∣ ∣ [ μ x t − 1 | t − 1 μ x t | t − 1 ] , [ Σ x t − 1 | t − 1 Σ x t − 1 ,t | t − 1 Σ x t,t − 1 | t − 1 Σ x t | t − 1 ] ) (19) to the joint p ( x t − 1 , x t | z 1: t − 1 ) , after which the smoother is fully determined [19]. Our approximation does not involve sampling, linearization, or numerical integration. Instead, we present closed-form expressions of a deterministic Gaussian approximation of the joint distribution in Eq. (19). In our case, the mapping x t − 1 7 → x t is not known, but instead it is distributed according to GP f , a distribution over system functions. For robust filtering and smoothing, we therefore need to take the GP (model) uncertainty into account by Bayesian averaging according to the GP distribution [13], [29]. The marginal p ( x t − 1 | z 1: t − 1 ) = N ( μ x t − 1 | t − 1 , Σ x t − 1 | t − 1 ) is known from filtering [13]. In Sec. III-A, we compute the mean and covariance of second marginal p ( x t | z 1: t − 1 ) and then in Sec. III-B the cross-covariance terms Σ x t − 1 ,t | t − 1 = cov[ x t − 1 , x t | z 1: t − 1 ] . A. Marginal Distribution 1) Marginal Mean: Using the system Eq. (1) and integrat- ing over all three sources of uncertainties (the system noise, the state x t − 1 , and the system function itself), we apply the law of total expectation and obtain the marginal mean μ x t | t − 1 = E x t − 1 [ E f [ f ( x t − 1 ) | x t − 1 ] | z 1: t − 1 ] . (20) The expectations in Eq. (20) are taken with respect to the posterior GP distribution p ( f ) and the filter distribution p ( x t − 1 | z 1: t − 1 ) = N ( μ x t − 1 | t − 1 , Σ x t − 1 | t − 1 ) at time step t − 1 . Eq. (20) can be rewritten as μ x t | t − 1 = E x t − 1 [ m f ( x t − 1 ) | z 1: t − 1 ] with m f ( x t − 1 ) := E f [ f ( x t − 1 ) | x t − 1 ] is the posterior mean function of GP f . Writing m f as a finite sum over the SE kernels centered at all n training inputs [21], the predicted mean for each target dimension a = 1 , . . . , D is ( μ x t | t − 1 ) a = ∫ m f a ( x t − 1 ) p ( x t − 1 | z 1: t − 1 ) d x t − 1 (21) = ∑ n i =1 β x a i ∫ k f a ( x t − 1 , x i ) p ( x t − 1 | z 1: t − 1 ) d x t − 1 , where p ( x t − 1 | z 1: t − 1 ) = N ( x t − 1 | μ x t − 1 | t − 1 , Σ x t − 1 | t − 1 ) is the filter distribution at time t − 1 . Moreover, x i , i = 1 , . . . , n , are the training set of GP f , k f a is the covariance function of GP f for the a th target dimension (GP hyper-parameters are not shared across dimensions), and β x a := ( K f a + σ 2 w a I ) − 1 y a ∈ R n . For dimension a , K f a denotes the kernel matrix (Gram matrix), where K f aij = k f a ( x i , x j ) , i, j = 1 , . . . , n . More- over, y a are the training targets, and σ 2 w a is the learned system noise variance. The vector β x a has been pulled out of the integration since it is independent of x t − 1 . Note that x t − 1 serves as a test input from the perspective of the GP regression model. For the SE covariance function in Eq. (9), the integral in (21) can be computed analytically (other tractable choices are covariance functions containing combinations of squared exponentials, trigonometric functions, and polynomials). The marginal mean is given as ( μ x t | t − 1 ) a = ( β x a ) > q x a (22) where we defined q x a i := α 2 f a | Σ x t − 1 | t − 1 Λ − 1 a + I | − 1 2 × exp ( − 1 2 ( x i − μ x t − 1 | t − 1 ) > S − 1 ( x i − μ x t − 1 | t − 1 ) ) , (23) S := Σ x t − 1 | t − 1 + Λ a , (24) i = 1 , . . . , n , being the solution to the integral in Eq. (21). Here, α 2 f a is the signal variance of the a th target dimension of GP f , a learned hyper-parameter of the SE covariance function, see Eq. (9). 4 2) Marginal Covariance Matrix: We now explicitly com- pute the entries of the corresponding covariance Σ x t | t − 1 . Using the law of total covariance, we obtain for a, b = 1 , . . . , D (Σ x t | t − 1 ) ( ab ) = cov x t − 1 ,f, w [ x ( a ) t , x ( b ) t | z 1: t − 1 ] (25) = E x t − 1 [ cov f, w [ f a ( x t − 1 ) + w a , f b ( x t − 1 ) + w b | x t − 1 ] | z 1: t − 1 ] + cov x t − 1 [ E f a [ f a ( x t − 1 ) | x t − 1 ] , E f b [ f b ( x t − 1 ) | x t − 1 ] | z 1: t − 1 ] , where we exploited in the last term that the system noise w has mean zero. Note that Eq. (25) is the sum of the covariance of (conditional) expected values and the expectation of a (con- ditional) covariance. We analyze these terms in the following. The covariance of the expectations in Eq. (25) is ∫ m f a ( x t − 1 ) m f b ( x t − 1 ) p ( x t − 1 ) d x t − 1 − ( μ x t | t − 1 ) a ( μ x t | t − 1 ) b , (26) where we used that E f [ f ( x t − 1 ) | x t − 1 ] = m f ( x t − 1 ) . With β x a = ( K a + σ 2 w a I ) − 1 y a and m f a ( x t − 1 ) = k f a ( X , x t − 1 ) > β x a , we obtain cov x t − 1 [ m f a ( x t − 1 ) , m f b ( x t − 1 ) | z 1: t − 1 ] = ( β x a ) > Q β x b − ( μ x t | t − 1 ) a ( μ x t | t − 1 ) b . (27) Following [30], the entries of Q ∈ R n × n are given as Q ij = k f a ( x i , μ x t − 1 | t − 1 ) k f b ( x j , μ x t − 1 | t − 1 ) / √ | R | (28) × exp ( 1 2 z > ij R − 1 Σ x t − 1 | t − 1 z ij ) = exp( n 2 ij ) / √ | R | , n 2 ij = log( α 2 f a ) + log( α 2 f b ) − 1 2 ( ζ > i Λ − 1 a ζ i + ζ > j Λ − 1 b ζ j − z > ij R − 1 Σ x t − 1 | t − 1 z ij ) , where we defined R := Σ x t − 1 | t − 1 ( Λ − 1 a + Λ − 1 b ) + I , ζ i := x i − μ x t − 1 | t − 1 , and z ij := Λ − 1 a ζ i + Λ − 1 b ζ j . The expected covariance in Eq. (25) is given as E x t − 1 [ cov f [ f a ( x t − 1 ) , f b ( x t − 1 ) | x t − 1 ] | z 1: t − 1 ] + δ ab σ 2 w a (29) since the noise covariance matrix Σ w is diagonal. Following our GP training assumption that different target dimensions do not covary if the input is deterministically given, Eq. (29) is only non-zero if a = b , i.e., Eq. (29) plays a role only for diagonal entries of Σ x t | t − 1 . For these diagonal entries ( a = b ), the expected covariance in Eq. (29) is α 2 f a − tr ( ( K f a + σ 2 w a I ) − 1 Q ) + σ 2 w a . (30) Hence, the desired marginal covariance matrix in Eq. (25) is (Σ x t | t − 1 ) ab = { Eq. (27) + Eq. (30) if a = b Eq. (27) otherwise (31) We have now solved for the marginal distribution p ( x t | z 1: t − 1 ) in Eq. (19). Since the approximate Gaussian filter distribution p ( x t − 1 | z 1: t − 1 ) = N ( μ x t − 1 | t − 1 , Σ x t − 1 | t − 1 ) is also known, it remains to compute the cross-covariance Σ x t − 1 ,t | t − 1 to fully determine the Gaussian approximation in Eq. (19). B. Cross-Covariance By the definition of a covariance and the system Eq. (1), the missing cross-covariance matrix Σ x t − 1 ,t | t − 1 in Eq. (19) is Σ x t − 1 ,t | t − 1 = E x t − 1 ,f, w t [ x t − 1 ( f ( x t − 1 ) + w t ) > | z 1: t − 1 ] − μ x t − 1 | t − 1 ( μ x t | t − 1 ) > , (32) where μ x t − 1 | t − 1 is the mean of the filter update at time step t − 1 and μ x t | t − 1 is the mean of the time update, see Eq. (20). Note that we explicitly average out the model uncertainty about f . Using the law of total expectations, we obtain Σ x t − 1 ,t | t − 1 = E x t − 1 [ x t − 1 E f, w t [ f ( x t − 1 ) + w t | x t − 1 ] > | z 1: t − 1 ] − μ x t − 1 | t − 1 ( μ x t | t − 1 ) > (33) = E x t − 1 [ x t − 1 m f ( x t − 1 ) > | z 1: t − 1 ] − μ x t − 1 | t − 1 ( μ x t | t − 1 ) > , (34) where we used the fact that E f, w t [ f ( x t − 1 ) + w t | x t − 1 ] = m f ( x t − 1 ) is the mean function of GP f , which models the mapping x t − 1 7 → x t , evaluated at x t − 1 . We thus obtain Σ x t − 1 ,t | t − 1 = ∫ x t − 1 m f ( x t − 1 ) > p ( x t − 1 | z 1: t − 1 ) d x t − 1 (35) − μ x t − 1 | t − 1 ( μ x t | t − 1 ) > . Writing m f ( x t − 1 ) as a finite sum over kernels [21] and moving the integration into this sum, the integration in Eq. (35) turns into ∫ x t − 1 m f a ( x t − 1 ) p ( x t − 1 | z 1: t − 1 ) d x t − 1 = n ∑ i =1 β x a i ∫ x t − 1 k f a ( x t − 1 , x i ) p ( x t − 1 | z 1: t − 1 ) d x t − 1 for each state dimension a = 1 , . . . , D . With the SE covariance function k SE defined in Eq. (9), we compute the integral analytically and obtain ∫ x t − 1 m f a ( x t − 1 ) p ( x t − 1 | z 1: t − 1 ) d x t − 1 (36) = n ∑ i =1 β x a i ∫ x t − 1 c 3 N ( x i , Λ a ) N ( μ x t − 1 | t − 1 , Σ x t − 1 | t − 1 ) d x t − 1 , where we defined c − 1 3 = ( α 2 f a (2 π ) D 2 √ | Λ a | ) − 1 , such that k f a ( x t − 1 , x i ) = c 3 N ( x t − 1 | x i , Λ a ) . In the definition of c 3 , α 2 f a is a hyper-parameter of GP f responsible for the variance of the latent function in dimension a . Using the definition of S in Eq. (24), the product of the two Gaussians in Eq. (36) results in a new (unnormalized) Gaussian c − 1 4 N ( x t − 1 | ψ i , Ψ ) with c − 1 4 = (2 π ) − D 2 | Λ a + Σ x t − 1 | t − 1 | − 1 2 × exp ( − 1 2 ( x i − μ x t − 1 | t − 1 ) > S − 1 ( x i − μ x t − 1 | t − 1 ) ) , Ψ = ( Λ − 1 a + ( Σ x t − 1 | t − 1 ) − 1 ) − 1 , ψ i = Ψ ( Λ − 1 a x i + ( Σ x t − 1 | t − 1 ) − 1 μ x t − 1 | t − 1 ) . Pulling all constants outside the integral in Eq. (36), the integral determines the expected value of the product of the 5 two Gaussians, ψ i . For a = 1 , . . . , D , we obtain E [ x t − 1 x t a | z 1: t − 1 ] = ∑ n i =1 c 3 c − 1 4 β x a i ψ i . Using c 3 c − 1 4 = q x a i , see Eq. (23), and some matrix identities, we finally obtain ∑ n i =1 β x a i q x a i Σ x t − 1 | t − 1 ( Σ x t − 1 | t − 1 + Λ a ) − 1 ( x i − μ x t − 1 | t − 1 ) (37) for cov x t − 1 ,f, w t [ x t − 1 , x t a | z 1: t − 1 ] , and the joint covariance matrix of p ( x t − 1 , x t | z 1: t − 1 ) and, hence, the full Gaussian approximation in Eq. (19) is completely determined. With the mean and the covariance of the joint distribution p ( x t − 1 , x t | z 1: t − 1 ) given by Eqs. (22), (31), (37), and the filter step, all necessary components are provided to compute the smoothing distribution p ( x t | z 1: T ) analytically [19]. IV. S IMULATIONS In the following, we present results analyzing the robust- ness of state-of-the art nonlinear filters (Sec. IV-A) and the performances of the corresponding smoothers (Sec. IV-B). A. Filter Robustness We consider the nonlinear stochastic dynamic system x t = x t − 1 2 + 25 x t − 1 1+ x 2 t − 1 + w t , w t ∼ N (0 , σ 2 w = 0 . 2 2 ) , (38) z t = 5 sin( x t ) + v t , v t ∼ N (0 , σ 2 v = 0 . 2 2 ) , (39) which is a modified version of the model used in [18], [31]. The system is modified in two ways: First, Eq. (38) does not contain a purely time-dependent term in the system, which would not allow for learning stationary transition dynamics. Second, we substituted a sinusoidal measurement function for the originally quadratic measurement function used by [18] and [31]. The sinusoidal measurement function increases the difficulty in computing the marginal distribution p ( z t | z 1: t − 1 ) if the time update distribution p ( x t | z 1: t − 1 ) is fairly uncertain: While the quadratic measurement function can only lead to bimodal distributions (assuming a Gaussian input distribution), the sinusoidal measurement function in Eq. (39) can lead to an arbitrary number of modes—for a broad input distribution. The prior variance was set to σ 2 0 = 0 . 5 2 , i.e., the initial uncertainty was fairly high. The system and measurement noises (see Eqs. (38)–(39)) were relatively small considering the amplitudes of the system function and the measurement function. For the numerical analysis, a linear grid in the interval [ − 3 , 3] of mean values ( μ x 0 ) i , i = 1 , . . . , 100 , was defined. Then, a single latent (initial) state x ( i ) 0 was sampled from p ( x ( i ) 0 ) = N (( μ x 0 ) i , σ 2 0 ) , i = 1 , . . . , 100 . For the dynamic system in Eqs. (38)–(39), we analyzed the robustness in a single filter step of the EKF, the UKF, the CKF, an SIR PF (sequential importance resampling particle filter) with 200 particles, the GP-UKF, and the GP-ADF against the ground truth, closely approximated by the Gibbs-filter [19]. Compared to the evaluation of longer trajectories, evaluating a single filter step makes it easier to analyze the robustness of individual filtering algorithms. Tab. I summarizes the expected performances (RMSE: root- mean-square error, MAE: mean-absolute error, NLL: negative log-likelihood) of the EKF, the UKF, the CKF, the GP-UKF, the GP-ADF, the Gibbs-filter, and the SIR PF for estimating the latent state x . The results in the table are based on averages over 1,000 test runs and 100 randomly sampled start states per test run (see experimental setup). The table also reports the 95% standard error of the expected performances. The ? indicates a method developed in this paper. Tab. I indicates that the GP-ADF is the most robust filter and statistically significantly outperforms all filters but the sampling-based Gibbs-filter and the SIR PF. The green color highlights a near-optimal Gaussian filter (Gibbs-filter) and the near-optimal particle filter. Amongst all other filters the GP-ADF is the closest Gaussian filter to the computationally expensive Gibbs- filter [19]. Note that the SIR PF is not a Gaussian filter and is able to express multi-modality in distributions. Therefore, its performance is typically better than the one of Gaussian filters. The difference between the SIR PF and a near-optimal Gaussian filter, the Gibbs-filter, is expressed in Tab. I. The performance difference essentially depicts how much we lose by using a Gaussian filter instead of a particle filter. The NLL values for the SIR PF are obtained by moment-matching the particles. The poor performance of the EKF is due to linearization errors. The filters based on small sample approximations of densities (UKF, GP-UKF, CKF) suffer from the degeneracy of these approximations, which is illustrated in Fig. 1. Note that the CKF uses a smaller set of cubature points than the UKF to determine predictive distributions, which makes the CKF statistically even less robust than the UKF. B. Smoother Robustness We consider a pendulum tracking example taken from [13]. We evaluate the performances of four filters and smoothers, the EKF/EKS, the UKF/URTSS, the GP-UKF/GP-URTSS, the CKF/CKS, the Gibbs-filter/smoother, and the GP-ADF/GP- RTSS. The pendulum has mass m = 1 kg and length l = 1 m . The state x = [ ̇ φ, φ ] > of the pendulum is given by the angle φ (measured anti-clockwise from hanging down) and the angular velocity ̇ φ . The pendulum can exert a constrained torque u ∈ [ − 5 , 5] Nm . We assumed a frictionless system such that the transition function f is f ( x t , u t ) = ∫ t +∆ t t [ u ( τ ) − 0 . 5 mlg sin φ ( τ ) 0 . 25 ml 2 + I ̇ φ ( τ ) ] d τ , (40) where I is the moment of inertia and g the acceleration of gravity. Then, the successor state x t +1 = x t +∆ t = f ( x t , u t ) + w t , (41) was computed using an ODE solver for Eq. (40) with a zero- order hold control signal u ( τ ) . In Eq. (41), we set Σ w = diag([0 . 5 2 , 0 . 1 2 ]) . In our experiment, the torque was sampled randomly according to u ∼ U [ − 5 , 5] Nm and implemented using a zero-order-hold controller. Every time increment ∆ t = 0 . 2 s , the state was measured according to z t = arctan ( − 1 − l sin( φ t ) 0 . 5 − l cos( φ t ) ) + v t , σ 2 v = 0 . 05 2 . (42) 6 TABLE I A VERAGE FILTER PERFORMANCES (RMSE, MAE, NLL) WITH STANDARD ERRORS ( 95% CONFIDENCE INTERVAL ) AND P - VALUES TESTING THE HYPOTHESIS THAT THE OTHER FILTERS ARE BETTER THAN THE GP-ADF USING A ONE - SIDED T - TEST . RMSE x (p-value) MAE x (p-value) NLL x (p-value) EKF [10] 3 . 62 ± 0 . 212 ( p = 4 . 1 × 10 − 2 ) 2 . 36 ± 0 . 176 ( p = 0 . 38 ) 3 . 05 × 10 3 ± 3 . 02 × 10 2 ( p < 10 − 4 ) UKF [11] 10 . 5 ± 1 . 08 ( p < 10 − 4 ) 8 . 58 ± 0 . 915 ( p < 10 − 4 ) 25 . 6 ± 3 . 39 ( p < 10 − 4 ) CKF [15] 9 . 24 ± 1 . 13 ( p = 2 . 8 × 10 − 4 ) 7 . 31 ± 0 . 941 ( p = 4 . 2 × 10 − 4 ) 2 . 22 × 10 2 ± 17 . 5 ( p < 10 − 4 ) GP-UKF [12] 5 . 36 ± 0 . 461 ( p = 7 . 9 × 10 − 4 ) 3 . 84 ± 0 . 352 ( p = 3 . 3 × 10 − 3 ) 6 . 02 ± 0 . 497 ( p < 10 − 4 ) GP-ADF [13] 2 . 85 ± 0 . 174 2 . 17 ± 0 . 151 1 . 97 ± 6 . 55 × 10 − 2 Gibbs-filter [19] 2 . 82 ± 0 . 171 ( p = 0 . 54 ) 2 . 12 ± 0 . 148 ( p = 0 . 56 ) 1 . 96 ± 6 . 62 × 10 − 2 ( p = 0 . 55 ) SIR PF 1 . 57 ± 7 . 66 × 10 − 2 ( p = 1 . 0 ) 0 . 36 ± 2 . 28 × 10 − 2 ( p = 1 . 0 ) 1 . 03 ± 7 . 30 × 10 − 2 ( p = 1 . 0 ) −3 −2 −1 0 1 −10 0 10 x 0 x 1 −3 −2 −1 0 1 0 0.5 p(x 0 ) −10 0 10 0 0.05 0.1 p(x 1 ) (a) UKF time update p ( x 1 |∅ ) , which misses out substantial prob- ability mass of the true predictive distribution. −25 −20 −15 −10 −5 0 −5 0 5 10 x 1 z 1 −25 −20 −15 −10 −5 0 0 0.05 0.1 p(x 1 ) −5 0 5 10 0 0.5 1 1.5 p(z 1 ) (b) UKF determines p ( z 1 |∅ ) , which is too sensitive and cannot explain the actual measurement z 1 (black dot, left sub-figure). Fig. 1. Degeneracy of the unscented transformation (UT) underlying the UKF. Input distributions to the UT are the Gaussians in the sub-figures at the bottom in each panel. The functions the UT is applied to are shown in the top right sub-figures, i.e, the transition mapping, Eq. (38), in Panel (a), and the measurement mapping, Eq. (39), in Panel (b). Sigma points are marked by red dots. The predictive distributions are shown in the left sub-figures of each panel. The true predictive distributions are the shaded areas; the UT predictive distributions are the solid Gaussians. The predictive distribution of the time update in Panel (a) equals the input distribution at the bottom of Panel (b). Note that the scalar measurement Eq. (42) solely depends on the angle. Thus, the full distribution of the latent state x had to be reconstructed using the cross-correlation information between the angle and the angular velocity. Trajectories of length T = 6 s = 30 time steps were started from a state sampled from the prior p ( x 0 ) = N ( μ 0 , Σ 0 ) with μ 0 = [0 , 0] > and Σ 0 = diag([0 . 01 2 , ( π 16 ) 2 ]) . For each trajectory, GP models GP f and GP g are learned based on randomly generated data using either 250 or 20 data points. Tab. II reports the expected values of the NLL x -measure for the EKF/EKS, the UKF/URTSS, the GP-UKF/GP-URTSS, the GP-ADF/GP-RTSS, and the CKF/CKS when tracking the pendulum over a horizon of 6 s , averaged over 1,000 runs. As in the example in Sec. IV-A, the NLL x -measure emphasizes the robustness of our proposed method: The GP-RTSS is the only method that consistently reduced the negative log-likelihood value compared to the corresponding filtering algorithm. Increasing the NLL x -values (red color in Tab. II) occurs when the filter distribution cannot explain the latent state/measurement, an example of which is given in Fig. 1(b). Even with only 20 training points, the GP- ADF/GP-RTSS outperform the commonly used EKF/EKS, UKF/URTSS, CKF/CKS. We experimented with even smaller signal-to-noise ratios. The GP-RTSS remains robust, while the other smoothers remain unstable. V. D ISCUSSION AND C ONCLUSION In this paper, we presented GP-RTSS, an analytic Rauch- Tung-Striebel smoother for GP dynamic systems, where the GPs with SE covariance functions are practical implementa- tions of universal function approximators. We showed that the GP-RTSS is more robust to nonlinearities than state-of-the- art smoothers. There are two main reasons for this: First, the GP-RTSS relies neither on linearization (EKS) nor on density approximations (URTSS/CKS) to compute an optimal Gaussian approximation of the predictive distribution when mapping a Gaussian distribution through a nonlinear function. This property avoids incoherent estimates of the filtering and smoothing distributions as discussed in Sec IV-A. Second, GPs allow for more robust “system identification” than standard methods since they coherently represent uncertainties about the system and measurement functions at locations that have not been encountered in the data collection phase. The GP- RTSS is a robust smoother since it accounts for model uncertainties in a principled Bayesian way. After training the GPs, which can be performed offline, the computational complexity of the GP-RTSS (including filtering) is O ( T ( E 3 + n 2 ( D 3 + E 3 ))) for a time series of length T . Here, n is the size of the GP training sets, and D and E are the dimensions of the state and the measurements, respectively. The computational complexity is due to the inversion of the D and E -dimensional covariance matrices, and the computation of the matrix Q ∈ R n × n in Eq. (28), required for each entry of a D and E -dimensional covariance matrix. The computational complexity scales linearly with the number of time steps. The computational demand of classical Gaussian smoothers, such as the URTSS and the EKS is O ( T ( D 3 + E 3 )) . Although not reported here, we verified the computational complexity experimentally. Approximating the online computations of the 7 TABLE II E XPECTED FILTERING AND SMOOTHING PERFORMANCES ( PENDULUM TRACKING ) WITH 95% CONFIDENCE INTERVALS . Filters EKF [10] UKF [11] CKF [15] GP-UKF 250 [12] GP-ADF 250 [13] GP-ADF 20 [13] E [NLL x ] 1 . 6 × 10 2 ± 29 . 1 6 . 0 ± 3 . 02 28 . 5 ± 9 . 83 4 . 4 ± 1 . 32 1 . 44 ± 0 . 117 6 . 63 ± 0 . 149 Smoothers EKS [10] URTSS [16] CKS [19] GP-URTSS ? 250 GP-RTSS ? 250 GP-RTSS ? 20 E [NLL x ] 3 . 3 × 10 2 ± 60 . 5 17 . 2 ± 10 . 0 72 . 0 ± 25 . 1 10 . 3 ± 3 . 85 1 . 04 ± 0 . 204 6 . 57 ± 0 . 148 GP-RTSS by numerical integration or grids scales poorly with increasing dimension. These problems already appear in the histogram filter [3]. By explicitly providing equations for the solution of the involved integrals, we show that numerical integration is not necessary and the GP-RTSS is a practical approach to filtering in GP dynamic systems. Although the GP-RTSS is computationally more involved than the URTSS, the EKS, and the CKS, this does not necessarily imply that smoothing with the GP-RTSS is slower: function evaluations, which are heavily used by the EKS/CKS/ URTSS are not necessary in the GP-RTSS (after training). In the pendulum example, repeatedly calling the ODE solver caused the EKS/CKS/URTSS to be slower than the GP-RTSS (with 250 training points) by a factor of two. The increasing use of GPs for model learning in robotics and control will eventually require principled smoothing methods for GP models. To our best knowledge, the proposed GP-RTSS is the most principled GP-smoother since all computations can be performed analytically exactly, i.e., without function lin- earization or sigma/cubature point representation of densities, while exactly integrating out the model uncertainty induced by the GP distribution. Code will be made publicly available at http://mloss. org . A CKNOWLEDGEMENTS This work was partially supported by ONR MURI grant N00014-09-1-1052, by Intel Labs, and by DataPath, Inc. R EFERENCES [1] B. D. O. Anderson and J. B. Moore, Optimal Filtering . Dover Publications, 2005. [2] K. J. ̊ Astr ̈ om, Introduction to Stochastic Control Theory . Dover Publications, Inc., 2006. [3] S. Thrun, W. Burgard, and D. Fox, Probabilistic Robotics . The MIT Press, 2005. [4] S. T. Roweis and Z. Ghahramani, Kalman Filtering and Neural Net- works . Wiley, 2001, ch. Learning Nonlinear Dynamical Systems using the EM Algorithm, pp. 175–220. [5] R. E. Kalman, “A New Approach to Linear Filtering and Prediction Problems,” Transactions of the ASME—Journal of Basic Engineering , vol. 82, pp. 35–45, 1960. [6] H. E. Rauch, F. Tung, and C. T. Striebel, “Maximum Likelihood Estimates of Linear Dynamical Systems,” AIAA Journal , vol. 3, pp. 1445–1450, 1965. [7] S. Roweis and Z. Ghahramani, “A Unifying Review of Linear Gaussian Models,” Neural Computation , vol. 11, no. 2, pp. 305–345, 1999. [8] F. R. Kschischang, B. J. Frey, and H.-A. Loeliger, “Factor Graphs and the Sum-Product Algorithm,” IEEE Transactions on Information Theory , vol. 47, pp. 498–519, 2001. [9] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference . Morgan Kaufmann, 1988. [10] P. S. Maybeck, Stochastic Models, Estimation, and Control . Academic Press, Inc., 1979, vol. 141. [11] S. J. Julier and J. K. Uhlmann, “Unscented Filtering and Nonlinear Estimation,” Proceedings of the IEEE , vol. 92, no. 3, pp. 401–422, 2004. [12] J. Ko and D. Fox, “GP-BayesFilters: Bayesian Filtering using Gaus- sian Process Prediction and Observation Models,” Autonomous Robots , vol. 27, no. 1, pp. 75–90, 2009. [13] M. P. Deisenroth, M. F. Huber, and U. D. Hanebeck, “Analytic Moment- based Gaussian Process Filtering,” in International Conference on Ma- chine Learning , 2009, pp. 225–232. [14] U. D. Hanebeck, “Optimal Filtering of Nonlinear Systems Based on Pseudo Gaussian Densities,” in Symposium on System Identification , 2003, pp. 331–336. [15] I. Arasaratnam and S. Haykin, “Cubature Kalman Filters,” IEEE Trans- actions on Automatic Control , vol. 54, no. 6, pp. 1254–1269, 2009. [16] S. S ̈ arkk ̈ a, “Unscented Rauch-Tung-Striebel Smoother,” IEEE Transac- tions on Automatic Control , vol. 53, no. 3, pp. 845–849, 2008. [17] S. J. Godsill, A. Doucet, and M. West, “Monte Carlo Smoothing for Nonlinear Time Series,” Journal of the American Statistical Association , vol. 99, no. 465, pp. 438–449, 2004. [18] G. Kitagawa, “Monte Carlo Filter and Smoother for Non-Gaussian Non- linear State Space Models,” Journal of Computational and Graphical Statistics , vol. 5, no. 1, pp. 1–25, 1996. [19] M. P. Deisenroth and H. Ohlsson, “A General Perspective on Gaussian Filtering and Smoothing: Explaining Current and Deriving New Algo- rithms,” in American Control Conference , 2011. [20] D. J. C. MacKay, “Introduction to Gaussian Processes,” in Neural Networks and Machine Learning . Springer, 1998, vol. 168, pp. 133– 165. [21] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine Learning . The MIT Press, 2006. [22] D. Nguyen-Tuong, M. Seeger, and J. Peters, “Local Gaussian Process Regression for Real Time Online Model Learning,” in Advances in Neural Information Processing Systems , 2009, pp. 1193–1200. [23] R. Murray-Smith, D. Sbarbaro, C. E. Rasmussen, and A. Girard, “Adaptive, Cautious, Predictive Control with Gaussian Process Priors,” in Symposium on System Identification , 2003. [24] J. Kocijan, R. Murray-Smith, C. E. Rasmussen, and B. Likar, “Predictive Control with Gaussian Process Models,” in IEEE Region 8 Eurocon 2003: Computer as a Tool , 2003, pp. 352–356. [25] M. P. Deisenroth, C. E. Rasmussen, and D. Fox, “Learning to Control a Low-Cost Manipulator using Data-Efficient Reinforcement Learning,” in Robotics: Science & Systems , 2011. [26] M. P. Deisenroth and C. E. Rasmussen, “PILCO: A Model-Based and Data-Efficient Approach to Policy Search,” in International Conference on Machine Learning , 2011. [27] C. G. Atkeson and J. C. Santamar ́ ıa, “A Comparison of Direct and Model-Based Reinforcement Learning,” in International Conference on Robotics and Automation , 1997. [28] J. Kern, “Bayesian Process-Convolution Approaches to Specifying Spa- tial Dependence Structure,” Ph.D. dissertation, Institue of Statistics and Decision Sciences, Duke University, 2000. [29] J. Qui ̃ nonero-Candela, A. Girard, J. Larsen, and C. E. Rasmussen, “Propagation of Uncertainty in Bayesian Kernel Models—Application to Multiple-Step Ahead Forecasting,” in International Conference on Acoustics, Speech and Signal Processing , 2003, pp. 701–704. [30] M. P. Deisenroth, Efficient Reinforcement Learning using Gaussian Processes . KIT Scientific Publishing, 2010, vol. 9. [31] A. Doucet, S. J. Godsill, and C. Andrieu, “On Sequential Monte Carlo Sampling Methods for Bayesian Filtering,” Statistics and Computing , vol. 10, pp. 197–208, 2000.