Integrated Pre-Processing for Bayesian Nonlinear System Identification with Gaussian Processes Roger Frigola and Carl Edward Rasmussen Abstract — We introduce GP-FNARX: a new model for non- linear system identification based on a nonlinear autoregressive exogenous model (NARX) with filtered regressors (F) where the nonlinear regression problem is tackled using sparse Gaussian processes (GP). We integrate data pre-processing with system identification into a fully automated procedure that goes from raw data to an identified model. Both pre-processing param- eters and GP hyper-parameters are tuned by maximizing the marginal likelihood of the probabilistic model. We obtain a Bayesian model of the system’s dynamics which is able to report its uncertainty in regions where the data is scarce. The automated approach, the modeling of uncertainty and its relatively low computational cost make of GP-FNARX a good candidate for applications in robotics and adaptive control. I. I NTRODUCTION System identification consists in building mathematical models of dynamical systems based on observed input and output signals. Those signals are often contaminated by noise and do not necessarily explore the whole operating range of the system. Yet, system identification methods are required to find a model that faithfully explains the observed data and has graceful generalization capabilities. In this paper we will be particularly concerned with identifying models of nonlinear systems when only a limited amount of data are available. Data could be scarce due to the high cost of experimentation, vast operating range of the system, presence of closed loop control or many other reasons. As a consequence, we will require a model that can accurately describe the system wherever data is present but which can also report its own ignorance in regions where no data can support its claims. This naturally leads to the use of a Bayesian framework for the identification of the system’s dynamics [1]. A very common and general representation of a dynamical system is in terms of a state space model . The future behavior of the system is fully described by its current state z and external input u . In statistical terms, future states are conditionally independent of past states and inputs given the current ones p ( z t +1 | z t , z t − 1 , ..., u t , u t − 1 , ... ) = p ( z t +1 | z t , u t ) . (1) A system with this characteristic is said to have the Markov property. The state of a system is often not perfectly observed. An observation model is usually defined as a probability density The authors are with the Department of Engineering, University of Cambridge, UK, { rf342,cer54 } @cam.ac.uk All the results presented in this paper can be reproduced without any tuning by using the code for GP-FNARX available from the first author’s website: http://mlg.eng.cam.ac.uk/roger of an observation y t conditioned on the state at the same time step: p ( y t | z t ) . In the particular case where a nonlinear state space model has additive independent Gaussian noise in both the state transition and the observation, the model can be presented as z t +1 = f ( z t , u t ) + δ t , δ t ∼ N (0 , Q ) (2) y t = g ( z t ) + ε t , ε t ∼ N (0 , R ) . (3) For time-invariant systems, f and g represent fixed deter- ministic functions although both the state transition and the observation model are stochastic difference equations (2,3). In statistical terms, the system identification task can be described as finding the state transition probability condi- tioned on the observed inputs and outputs p ( z t +1 | z t , u t , u 1: N , y 1: N ) , t > N. (4) The fact that the states are not observed makes this dis- tribution exceedingly difficult to compute. In particular, if one wants to obtain a model that is able to report its own uncertainty (i.e. a Bayesian model), there exist no closed form solutions even for the simplest linear-Gaussian state space models [2], [3]. A popular alternative to state space models can be found in autoregressive exogenous (ARX) models [4]. These models represent the system based only on observable quantities: the inputs and the outputs. A nonlinear ARX (NARX) model with Gaussian innovations provides an estimate of the next output based on a finite amount of previous inputs and outputs y t = f ( y t − 1 , y t − 2 , ..., u t − 1 , ... ) + δ t , δ t ∼ N (0 , Q ) . (5) In contrast with state space models, there are no latent states and the model does not have the Markov property. In practical terms, this representation based only on observables makes it possible to sidestep many of the problems associated with system identification when latent states are present. An important drawback of autoregressive models is that they do not account for observation noise . If we consider equation (5) in a generative manner by assuming a known f and sequentially generate new values of y t , y t +1 , ... , it is apparent that all randomness injected through the innovation δ has an influence on the future trajectory of the system. In other words, the innovation can be considered as a form of process noise . The model does not accommodate for any kind of observation noise such as the one described in equation (3) for the case of state space models. A common way to deal with observation noise is to pre- process the measured input and output data to remove as arXiv:1303.2912v3 [cs.AI] 17 Sep 2013 much noise as possible before identifying the NARX model [4]. For instance, low-pass filtering the signals can remove a significant component of their noise. However, for optimal performance, the filter needs to be tuned to the particular characteristics of the noise. We propose an automated approach which simultaneously performs filter tuning and learns a Bayesian model of the system’s dynamics. Noise modeling is performed in a completely data-driven fashion by optimizing the marginal likelihood of the probabilistic regression model. In section II we compare our approach to related work. Section III presents an overview of the field of Bayesian system identification together with a brief introduction to Gaussian processes. Then, in section IV, we describe our approach to system identification integrating data pre- processing with learning the system’s dynamics. We show experimental results in section V and, finally, in section VI we make concluding remarks. II. R ELATED W ORK Kocijan et al. [5] presented a NARX model where the nonlinearity is captured using Gaussian process regression. More recently Gutjahr et al. [6] have proposed the use of FITC in order to make computation more efficient in this kind of models. Our model follows this approach but also considers that the input/output signals may be contaminated by noise which presents an errors-in-variables regression problem. We mitigate the errors-in-variables problem by incorporat- ing into system identification the signal pre-processing step that is usually carried out manually [4], [7]. For instance, a portion of the noise can be removed by low-pass filtering the signals in the time domain. More sophisticated approaches combining smoothing and system identification for state space models have the po- tential to give better results if one is ready to accept their additional complexity [8], [9]. In contrast, our method needs to be seen as a pragmatic approach that attempts to offer a better solution than standard NARX models with a simple procedure and a limited amount of computational effort. Thus, it is particularly well suited to large datasets. III. B ACKGROUND T HEORY A. Bayesian System Identification The goal of Bayesian system identification is to obtain models of dynamical systems that quantify the degree of un- certainty in the system’s dynamics [1]. Modeling uncertainty can be particularly useful in applications such as robotics or adaptive control where decisions about future control actions will typically depend on the confidence that the agent has about the system dynamics. Bayesian system identification relies on the use of Bayesian statistics and hence interprets probabilities as de- grees of belief. Such an approach to probabilistic modeling has been very successful in the fields of machine learning and artificial intelligence [10]. We will borrow heavily from advances in those fields to model uncertainty in a principled manner. In particular, we will solve the nonlinear regression task from a NARX model with Bayesian nonlinear regression implemented with Gaussian processes. B. Gaussian Process Regression Gaussian processes are stochastic processes that have proven very successful at the task of nonlinear regression. Their unbounded flexibility makes them ideal whenever it is hard to specify a parametric form for an unknown nonlinear function. Formally, a Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution [11]. We say that a random function f ( x ) ∈ R , with x ∈ R n , follows a Gaussian process and write it as f ( x ) ∼ GP ( m ( x ) , k ( x, x ′ )) (6) when all values of f at any locations x are jointly normally distributed. m ( x ) and k ( x, x ′ ) represent the mean function and the covariance function respectively. Together, they com- pletely specify the Gaussian process and can be parametrized by a, typically small, set of hyper-parameters. Those hyper- parameters define the kind of functions that one expects to see. For instance, they control the degree of smoothness or periodicity but do not constrain the function to have any predefined shape. In Gaussian process regression, a likelihood function p ( y | f ) relates the vector of real-valued observations y = ( y 1 , ..., y N ) with the latent function f . In the particular case where observations are contaminated with additive Gaussian i.i.d. noise p ( y i | f i ) = N ( y i | f i , σ 2 ) , (7) there exists a convenient closed-form solution for the poste- rior distribution over functions, i.e. the distribution over the unknown function f ( x ) after incorporating all the observed data. The posterior distribution of the function at any new location x ∗ is described by p ( f ∗ | x ∗ , y, X ) = N ( ̄ f ∗ , cov( f ∗ )) (8) where X = { x 1 , ..., x N } is the set of regressor vectors. In the common case where m ( x ) = 0 we obtain ̄ f ∗ = K ( X ∗ , X )[ K ( X, X ) + σ 2 I ] − 1 y, (9) cov( f ∗ ) = K ( X ∗ , X ∗ ) − K ( X ∗ , X )[ K ( X, X ) + σ 2 I ] − 1 K ( X, X ∗ ) . (10) where the covariance matrices K are constructed by applying the covariance function to all elements in each of the sets that they take as arguments: K ( A, B ) ij = k ( a i , b j ) . (11) In Figure 1 we show an example of Gaussian process regression. Given a set of seven noisy data points, the goal is to infer the latent function that generated them. We do not assume any parametric form for this latent function. Our assumption is that the latent function was a draw from a Gaussian process with zero mean and a squared exponential 0 1 2 3 4 5 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 Fig. 1. Gaussian process regression. Black dots represent the data. The red line is the mean of the Gaussian process prediction and the area shaded in red covers the 95% confidence region. The three black lines are random functions drawn from the posterior that could have generated the data. covariance function [11]. To obtain the posterior distribution over the latent function we use equations (9) and (10). Note how the confidence region widens in areas that are far from the data points. C. Model Selection for Gaussian Processes In the Gaussian process framework, model selection refers to using data to select the functional form for the mean and the covariance functions as well as selecting the values for their hyper-parameters. In addition, the likelihood function may also have some hyper-parameters, such as the noise variance σ 2 in equation (7). For convenience we include the likelihood hyper-parameters into the vector θ . For Gaussian process models, it is very attractive, from a computational and theoretical point of view, to use Bayesian model selection procedures. In particular, maximization of the marginal likelihood of the Gaussian process model has proven to be very effective [11]. The objective is to find θ Opt = arg max θ p ( y | X, θ ) (12) where p ( y | X, θ ) = ∫ p ( y | f, θ ) p ( f | X, θ ) df (13) is the marginal likelihood. The first term inside the integral is the likelihood function and the second term is a normal distribution that can be obtained directly from the mean and covariance functions that specify the Gaussian process. For the particular case of Gaussian processes with zero mean and a Gaussian likelihood, such as in eq. (7), we obtain: p ( y | X, θ ) = N (0 , K θ ( X, X ) + σ 2 I ) . (14) To avoid underflow errors, it is common to work with the logarithm of the marginal likelihood rather than the marginal likelihood itself. Since the logarithm function is strictly monotonically increasing, this transformation does not change the location of the optimum. If we take logarithms on both sides of the equation we obtain log p ( y | X, θ ) = − 1 2 y > K − 1 y − 1 2 log | K | − n 2 log 2 π (15) where K = K θ ( X, X ) + σ 2 I . For optimization, it will be useful to have the derivative of the marginal likelihood with respect to any of the parameters of interest: ∂ ∂θ j log p ( y | X ( ω ) , θ ) = − 1 2 y > K − 1 ∂K ∂θ j K − 1 y − 1 2 tr( K − 1 ∂K ∂θ j ) . (16) The marginal likelihood embodies what is usually de- scribed as Bayesian Occam’s razor : an automatic balancing of model fit and model complexity [11]. In contrast with maximum likelihood estimation, the marginal likelihood is found by integrating over the latent values of the function f ( x ) . The computational complexity of the marginal likelihood in eq. (14) is O ( N 3 ) due to the inverted covariance matrix in the density of a multivariate Gaussian distribution. This cubic complexity effectively limits the approach to values of N not greater than a few thousands. However, there is a mitigating factor: once K − 1 has been computed for eq. (15), each derivative of the marginal likelihood with respect to the hyper-parameters has a computational complexity of O ( N 2 ) . This is a particular feature of the marginal likelihood that is useful for efficient gradient based optimization. D. Sparse Gaussian Process Regression Several approaches have been proposed to accelerate Gaussian processes for very large data sets. Those ap- proaches aim at reducing the particularly disadvantageous O ( N 3 ) cost to train the hyper-parameters but also target a reduction of the O ( N ) cost necessary to evaluate the mean of a posterior GP (eq. 9) or the O ( N 2 ) cost of computing its posterior covariance (eq. 10). A common strategy is to introduce a new set of M latent variables ̄ f at input locations ̄ X where M < N [12]. The intuition behind these methods is that the new set of latent variables incorporates much of the information present in the original, large, dataset. A particularly effective sparse GP method named FITC, or originally SPGP, was proposed by Snelson and Ghahramani [13]. Due to space constraints we avoid a complete re- derivation of the FITC method and simply state its predictive distribution given a dataset { y, X } and the locations of the latent inputs ̄ X : p ( f ∗ | x ∗ , y, X, ̄ X ) = N ( μ ∗ , σ 2 ∗ ) , (17) where μ ∗ = K > ∗ Q − 1 M K M N (Λ + σ 2 I ) − 1 y (18) σ 2 ∗ = K ∗∗ − K > ∗ ( K − 1 M − Q − 1 M ) K ∗ (19) and Q M = K M + K M N (Λ + σ 2 I ) − 1 K N M (20) Λ = diag( λ ) , λ n = K nn − K > n K M K n , (21) where the matrices K are constructed using the covariance function as shown in eq. (11) and we follow the following convention: K AB = K ( X A , X B ) and K A = K ( X A , X A ) . IV. GP-FNARX M ODEL A. Overview In this section, we describe the GP-FNARX model for nonlinear system identification based on a nonlinear autore- gressive exogenous (NARX) model with filtered regressors (F) where the nonlinear regression problem is tackled using sparse Gaussian processes (GP). The approach integrates pre- processing (e.g. filtering) with system identification into a fully automated procedure that yields a Bayesian nonpara- metric model. The key contribution of this article is the integration of the data pre-processing step with the optimization of the sparse GP hyper-parameters. Training is performed by simultane- ously maximizing the marginal likelihood of the probabilistic regression model with respect to the pre-processing param- eters and the GP hyper-parameters. Autoregressive models with exogenous inputs attempt to predict future outputs by considering them a function of past inputs and outputs to the system: y t = f ( y t − 1 , y t − 2 , ..., u t − 1 , ... ) + δ t . (22) The identification problem is then posed as a regression task. In other words, we want to infer the function y = f ( x ) from a finite number of examples { y i , x i } , where x i = ( y i − 1 , y i − 2 , ..., u i − 1 , ... ) . In our approach, we place a GP prior on the function f ( x ) and use sparse GP regression techniques to infer it from observed data. If the input and output signals to the dynamical system are noisy, we face an errors-in-variables problem since the regressors x are noisy. Noise in the regressors makes the regression problem particularly hard. This is one of the reasons why input signals are normally pre-processed before trying to identify a model of the system dynamics. For instance, one can carefully low-pass filter the signals to remove high-frequency noise irrelevant to the identification task at hand. Since we are looking for a method that avoids having a human in the loop, we take a data-based approach which consists in parameterizing the data pre-processing stage and optimizing its parameters jointly with the hyper- parameters of the Gaussian process regression. We will consider any data pre-processing function applied to the input and output signals (ˆ y, ˆ u ) = h ( y, u, ω ) (23) where the pre-processed signals vary smoothly with respect to a vector of pre-processing parameters ω . This smoothness condition is imposed in order to obtain a probabilistic model with a differentiable marginal likelihood. We can rephrase the autoregressive model in terms of the pre-processed regressors: y t = f (ˆ y t − 1 , ˆ y t − 2 , ..., ˆ u t − 1 , ... ) . (24) Note that the left hand side term is not pre-processed. B. Optimization of the Marginal Likelihood In section III-C we have described how the marginal likelihood provides a very powerful metric to perform model selection due to its ability to automatically trade off model fit and model complexity in a principled manner. Our goal here will be to maximize the marginal likelihood of the Gaussian process regression model with respect to the signal pre- pocessing parameters and also, simultaneously, with respect to the hyper-parameters of the GP. For convenience, we introduce ψ = ( ω, θ ) grouping the two kinds of parameters. We will employ hill-climbing optimization to maximize the marginal likelihood (or, equivalently, its logarithm). For notational simplicity, we group all the pre-processed regres- sors into a matrix ˆ X = ˆ X ( y, u, ω ) so that the log marginal likelihood becomes log p ( y | ˆ X ( ω ) , θ ) . (25) Its derivatives with respect to the GP hyper-parameters ∂ ∂θ j log p ( y | ˆ X ( ω ) , θ ) (26) are straightforward since we typically choose differentiable covariance functions. However, the derivatives with respect to any of the pre-processing parameters ∂ ∂ω k log p ( y | ˆ X ( ω ) , θ ) (27) can be more difficult to compute. From eq. (16) it is apparent that ∂K ∂ω k needs to be computed. We can write the derivative of a single element of the covariance matrix as ∂K ij ∂ω k = ∂k (ˆ x i , ˆ x j ) ∂ω k = ∂k ∂ ˆ x i ∂ ˆ x i ∂ω k + ∂k ∂ ˆ x j ∂ ˆ x j ∂ω k (28) where the derivatives with respect to the regressors are straightforward to compute when using smooth covariance functions. However, the derivatives of the regressors with respect to the pre-processing parameters may be hard to com- pute. In any case, if the pre-processing function is smooth, the derivatives ∂ ˆ x ∂ω k can be approximated numerically by finite differences at the cost of one extra evaluation of the pre-processing function per dimension of ω . C. Sparse GPs for Computational Speed Computing the marginal likelihood for datasets with more than a few thousand points becomes computationally ex- pensive. For larger datasets we adopt a pragmatic strategy whereby the marginal likelihood is maximized by employing only a subset of the data. Once the GP hyper-parameters and the data pre-processing parameters have been set, we use FITC sparse Gaussian processes [12] to obtain a model of the system dynamics that uses the whole dataset to make predictions. Equation (17) shows the expression for FITC predictions. After all possible pre-computations, the computational complexity is O ( M ) for the mean of the predictive distribution and O ( M 2 ) for its variance. The computational efficiency comes from having condensed the original dataset with N points into a smaller set of M inducing points. 0 10 20 30 40 50 60 70 80 10 −4 10 −3 10 −2 SNR [dB] RMSE [V] Silverbox benchmark GP−FNARX (SoD), 23 ± 2 GP−FNARX (FITC), 23 ± 2 GP−NARX (SoD), 14 ± 1 GP−NARX (FITC), 14 ± 1 wavenet nlarx, 5 ± 1 sigmoidnet nlarx, 83 ± 9 treepartition nlarx, 7 ± 0 wavenet nlarx (filt), 6 ± 2 sigmoidnet nlarx (filt), 74 ± 11 treepartition nlarx (filt), 7 ± 0 0 10 20 30 40 50 60 70 80 10 −3 10 −2 10 −1 SNR [dB] RMSE [V] Wiener−Hammerstein benchmark GP−FNARX (SoD), 25 ± 2 GP−FNARX (FITC), 25 ± 2 GP−NARX (SoD), 16 ± 1 GP−NARX (FITC), 16 ± 1 wavenet nlarx, 7 ± 3 sigmoidnet nlarx, 85 ± 12 treepartition nlarx, 8 ± 0 wavenet nlarx (filt), 5 ± 1 sigmoidnet nlarx (filt), 85 ± 8 treepartition nlarx (filt), 8 ± 0 Fig. 3. Root mean squared prediction error on the test set as a function of the signal to noise ratio in the output signal. The mean computation time and standard deviation for each method are displayed in the legend (in seconds). Experiments are repeated 10 times, the marker is positioned at the median value and error-bars indicate the 10-90 percentile interval. Inputs : I = { output signals y = ( y 1 , ..., y N ) , input signals u = ( u 1 , ..., u N ) , model order vector η } 1. ψ 0 ← I NITIAL G UESS ( I ) 2. ψ Opt ← M AXIMIZE M ARG L IKELIHOOD ( ψ 0 , I ) 3. model ← C OMPUTE FITCP REDICTOR ( ψ Opt , I ) Fig. 2. High-level pseudo-code for the GP-FNARX algorithm. D. Summary of the Proposed Methodology In Fig. 2 we present a high-level overview of the GP- FNARX algorithm. The first step consists in providing an initial guess for the unknown hyper-parameters. We have found that, in the absence of any problem-specific knowledge that could guide the initial guess, a successful data-based heuristic is to run a few steps of optimization of the marginal likelihood with respect to the GP hyper-parameters followed by a few steps of optimization with respect to the pre- processing parameters. By running these two steps on a subset of the data, the algorithm can rapidly hone in the right order of magnitude for the unknown parameters. The second step consists in a straightforward joint opti- mization of the GP hyper-parameters and the pre-processing parameters. This step can be performed by either using a subset of the data (SoD) with a conventional GP or on the full dataset with a FITC sparse GP [12]. The FITC approach is theoretically more sound but very good and fast results can also be obtained by optimizing the marginal likelihood computed on a subset of the data. In the third and final step, a large part of the FITC predictor equations (18,19) is pre-computed in order to obtain a model that will be able to provide predictive means in O ( M ) and predictive variances in O ( M 2 ) . As an aside, we want to highlight that the choice of orders of the autoregressive model is not critical to the performance of the algorithm provided that two conditions are met: i) the order is chosen to be higher than the optimal order and ii) the automatic relevance determination (ARD) covariance function is chosen for the GP. This is due to the fact that the Bayesian Occam’s razor embodied by the marginal likelihood is able to automatically disable regressors that are irrelevant to the task at hand. In our experiments we verified that adding hundreds of regressors on a problem of low order did not cause any overfitting and only represented a computation time penalty. V. E XPERIMENTAL E VALUATION In this section we present an experimental evaluation of the proposed system identification method. We have used data from two nonlinear system identification benchmarks based on electric circuits: i) the Silverbox benchmark originating from the 2004 IFAC Symposium on Nonlinear Control Sys- tems (NOLCOS) and ii) the SYSID09 Wiener-Hammerstein system identification benchmark [14] which has been the object of a special section in the November 2012 issue of the “Control Engineering Practice” journal [15]. Both datasets have > 10 5 data points and are corrupted by a very small amount of noise. For instance, the authors of the Wiener-Hammerstein dataset estimate its signal to noise ratio to be of 70 dB. Since we are attempting to demonstrate the ability of our method to cope with measurement noise, we will inject different amounts of synthetic i.i.d. Gaussian noise to the output signals to make the identification task more challenging. Being able to have original signals with little noise will be convenient to test the quality of the resulting models. Very good algorithms have been tailored to the specific benchmarks at hand. However, since our goal is to provide an automatic method for system identification, we have considered that the best comparison would be against other off-the-shelf alternatives. We have compared our model with respect to several NARX models available in the Matlab System Identification toolbox [7]. Following this spirit, we have avoided any tuning of our method to the particular benchmarks presented in this section. For instance, by using knowledge about the under- lying system of the Silverbox benchmark, we obtained better performance by adding cubic regressors into our NARX model. However, we have not used those custom regressors when reporting the performance of our model. Regarding the pre-processing step, we have chosen a simple zero-phase second order Butterworth low-pass filter. In this case, the filter parameter ω represents the cut-off frequency of the filter. In Figure 3 we have plotted the prediction errors on both benchmarks for a number of different models and signal-to- (synthetic)-noise ratios. We have tested the following models: • GP-FNARX (SoD) : the model presented in this paper using a subset of 512 randomly chosen points (subset of data approximation, SoD [11]). • GP-FNARX (FITC) : the model presented in this paper using a FITC sparse GP [12] with M = 512 inducing points chosen at random. • GP-NARX (SoD) : same as GP-FNARX (SoD) but with no pre-filtering of the signals. • GP-NARX (FITC) : same as GP-FNARX (FITC) but with no pre-filtering of the signals. • * nlarx : 3 different NARX models implemented in the Matlab System Identification toolbox [7]. Default options. No pre-filtering of the signals. • * nlarx (filt) : same as * nlarx but using the pre-filtering parameters taken from GP-FNARX (FITC) (the computa- tion time for computing those parameters is not taken included in the reported figure). All models have order 10 for the inputs and the outputs. We observe that the GP-FNARX method with a FITC sparse GP provides the lowest error for all noise levels in both benchmarks. Overall, these results allow us to be optimistic with regards to the prediction capabilities of GP-FNARX models and validate the use of the marginal likelihood as a criterion for model selection in the context of automated pre-processing of the data. The legend of Figure 3 reports the computation times of the experiments when performed on a machine with a 2008 Intel Core i7-920 processor at 2.67 GHz. Although the training time of the GP-FNARX model is higher than the wavenet and treepartition models, it is significantly faster than sigmoidnet yet it provides a lower prediction error. GP models have a one degree of freedom knob which can be used to trade off accuracy with speed: the number of data points used for SoD or the number of inducing points in FITC. Increasing the number of points is not equivalent to increasing the number of parameters in a parametric model and does not imply any risk of overfitting [11]. VI. C ONCLUSIONS We have presented GP-FNARX, an automated approach for the identification of nonlinear systems based on au- toregressive models implemented with Gaussian processes. Our approach combines the data pre-processing step with the identification task per se. By maximizing the marginal likelihood of the probabilistic regression model we obtain a fitting procedure that naturally balances model fit and model complexity. The Gaussian process model resulting from the identifica- tion belongs to the family of Bayesian nonparametric mod- els. As such, it does not rely on a rigid parametric functional form and has the capability to adapt to arbitrarily complex data. Moreover, the models can report the uncertainty present in their own predictions. For instance, if the model is used in an operating region which is far from the training data, the model will report higher uncertainty. This feature is useful in applications such as robotics or adaptive control where different control strategies may be appropriate depending on the confidence in the predictions. One of the reasons why Bayesian system identification is not widely used is the lack of methods with a competitive computational cost. We have demonstrated our approach on time series having > 10 5 data points and we have shown how it presents a realistic alternative to parametric nonlinear function approximators. We obtain good performance with a comparable cost and we add the ability to quantify the uncertainty in the model. In addition, our method is easy to apply due to the automation of the pre-processing stage. R EFERENCES [1] V. Peterka, “Bayesian system identification,” Automatica , vol. 17, no. 1, pp. 41 – 53, 1981. [2] D. Barber and S. Chiappa, “Unified inference for variational Bayesian linear Gaussian state-space models,” in Advances in Neural Informa- tion Processing Systems (NIPS) , vol. 20, 2006. [3] A. Wills, T. B. Sch ̈ on, F. Lindsten, and B. Ninness, “Estimation of linear systems using a Gibbs sampler,” in Proceedings of the 16th IFAC Symposium on System Identification (SYSID) , 2012. [4] L. Ljung, System Identification: Theory for the User , 2nd ed. Prentice Hall, 1999. [5] J. Kocijan, A. Girard, B. Banko, and R. Murray-Smith, “Dynamic systems identification with Gaussian processes,” Mathematical and Computer Modelling of Dynamical Systems , vol. 11, no. 4, pp. 411– 424, 2005. [6] T. Gutjahr, H. Ulmer, and C. Ament, “Sparse Gaussian processes with uncertain inputs for multi-step ahead prediction,” in Proceedings of the 16th IFAC Symposium on System Identification (SYSID) , vol. 16, 2012, pp. 107–112. [7] L. Ljung, System Identification Toolbox TM User’s Guide (R2012b) . The MathWorks, 2012. [8] R. Turner, M. P. Deisenroth, and C. E. Rasmussen, “State-space inference and learning with Gaussian processes,” in 13th International Conference on Artificial Intelligence and Statistics , ser. W&CP, Y. W. Teh and M. Titterington, Eds., vol. 9, Chia Laguna, Sardinia, Italy, May 13–15 2010, pp. 868–875. [9] T. B. Sch ̈ on, A. Wills, and B. Ninness, “System identification of nonlinear state-space models,” Automatica , vol. 47, no. 1, pp. 39 – 49, 2011. [10] C. M. Bishop, Pattern Recognition and Machine Learning . Springer, 2006. [11] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine Learning . MIT Press, 2006. [12] J. Qui ̃ nonero-Candela and C. E. Rasmussen, “A unifying view of sparse approximate Gaussian process regression,” Journal of Machine Learning Research , vol. 6, pp. 1939–1959, 2005. [13] E. Snelson and Z. Ghahramani, “Sparse Gaussian processes using pseudo-inputs,” in Advances in Neural Information Processing Systems (NIPS) , Y. Weiss, B. Sch ̈ olkopf, and J. Platt, Eds., Cambridge, MA, 2006, pp. 1257–1264. [14] J. Schoukens, J. Suykens, and L. Ljung, “Wiener-Hammerstein bench- mark,” in Symposium on System Identification (SYSID) Special Ses- sion , vol. 15, 2009. [15] H. Hjalmarsson, C. R. Rojas, and D. E. Rivera, “System identification: A Wiener-Hammerstein benchmark,” Control Engineering Practice , vol. 20, no. 11, pp. 1095 – 1096, 2012.