Integrated Pre-Processing for Bayesian Nonlinear System Identiﬁcation
with Gaussian Processes
Roger Frigola and Carl Edward Rasmussen
Abstract— We introduce GP-FNARX: a new model for non-
linear system identiﬁcation based on a nonlinear autoregressive
exogenous model (NARX) with ﬁltered regressors (F) where the
nonlinear regression problem is tackled using sparse Gaussian
processes (GP). We integrate data pre-processing with system
identiﬁcation into a fully automated procedure that goes from
raw data to an identiﬁed model. Both pre-processing param-
eters and GP hyper-parameters are tuned by maximizing the
marginal likelihood of the probabilistic model. We obtain a
Bayesian model of the system’s dynamics which is able to
report its uncertainty in regions where the data is scarce.
The automated approach, the modeling of uncertainty and its
relatively low computational cost make of GP-FNARX a good
candidate for applications in robotics and adaptive control.
I. INTRODUCTION
System identiﬁcation consists in building mathematical
models of dynamical systems based on observed input and
output signals. Those signals are often contaminated by noise
and do not necessarily explore the whole operating range of
the system. Yet, system identiﬁcation methods are required
to ﬁnd a model that faithfully explains the observed data and
has graceful generalization capabilities.
In this paper we will be particularly concerned with
identifying models of nonlinear systems when only a limited
amount of data are available. Data could be scarce due to
the high cost of experimentation, vast operating range of
the system, presence of closed loop control or many other
reasons. As a consequence, we will require a model that can
accurately describe the system wherever data is present but
which can also report its own ignorance in regions where no
data can support its claims. This naturally leads to the use of
a Bayesian framework for the identiﬁcation of the system’s
dynamics [1].
A very common and general representation of a dynamical
system is in terms of a state space model. The future
behavior of the system is fully described by its current state
z and external input u. In statistical terms, future states are
conditionally independent of past states and inputs given the
current ones
p(zt+1|zt, zt−1, ..., ut, ut−1, ...) = p(zt+1|zt, ut).
(1)
A system with this characteristic is said to have the Markov
property.
The state of a system is often not perfectly observed. An
observation model is usually deﬁned as a probability density
The authors are with the Department of Engineering, University of
Cambridge, UK, {rf342,cer54}@cam.ac.uk
All the results presented in this paper can be reproduced without any
tuning by using the code for GP-FNARX available from the ﬁrst author’s
website: http://mlg.eng.cam.ac.uk/roger
of an observation yt conditioned on the state at the same
time step: p(yt|zt). In the particular case where a nonlinear
state space model has additive independent Gaussian noise
in both the state transition and the observation, the model
can be presented as
zt+1 = f(zt, ut) + δt,
δt ∼N(0, Q)
(2)
yt = g(zt) + εt,
εt ∼N(0, R).
(3)
For time-invariant systems, f and g represent ﬁxed deter-
ministic functions although both the state transition and the
observation model are stochastic difference equations (2,3).
In statistical terms, the system identiﬁcation task can be
described as ﬁnding the state transition probability condi-
tioned on the observed inputs and outputs
p(zt+1|zt, ut, u1:N, y1:N),
t > N.
(4)
The fact that the states are not observed makes this dis-
tribution exceedingly difﬁcult to compute. In particular, if
one wants to obtain a model that is able to report its own
uncertainty (i.e. a Bayesian model), there exist no closed
form solutions even for the simplest linear-Gaussian state
space models [2], [3].
A popular alternative to state space models can be found in
autoregressive exogenous (ARX) models [4]. These models
represent the system based only on observable quantities: the
inputs and the outputs. A nonlinear ARX (NARX) model
with Gaussian innovations provides an estimate of the next
output based on a ﬁnite amount of previous inputs and
outputs
yt = f(yt−1, yt−2, ..., ut−1, ...) + δt,
δt ∼N(0, Q). (5)
In contrast with state space models, there are no latent
states and the model does not have the Markov property. In
practical terms, this representation based only on observables
makes it possible to sidestep many of the problems associated
with system identiﬁcation when latent states are present.
An important drawback of autoregressive models is that
they do not account for observation noise. If we consider
equation (5) in a generative manner by assuming a known
f and sequentially generate new values of yt, yt+1, ..., it is
apparent that all randomness injected through the innovation
δ has an inﬂuence on the future trajectory of the system. In
other words, the innovation can be considered as a form of
process noise. The model does not accommodate for any kind
of observation noise such as the one described in equation (3)
for the case of state space models.
A common way to deal with observation noise is to pre-
process the measured input and output data to remove as
arXiv:1303.2912v3  [cs.AI]  17 Sep 2013
much noise as possible before identifying the NARX model
[4]. For instance, low-pass ﬁltering the signals can remove
a signiﬁcant component of their noise. However, for optimal
performance, the ﬁlter needs to be tuned to the particular
characteristics of the noise.
We propose an automated approach which simultaneously
performs ﬁlter tuning and learns a Bayesian model of
the system’s dynamics. Noise modeling is performed in a
completely data-driven fashion by optimizing the marginal
likelihood of the probabilistic regression model.
In section II we compare our approach to related work.
Section III presents an overview of the ﬁeld of Bayesian
system identiﬁcation together with a brief introduction
to Gaussian processes. Then, in section IV, we describe
our approach to system identiﬁcation integrating data pre-
processing with learning the system’s dynamics. We show
experimental results in section V and, ﬁnally, in section VI
we make concluding remarks.
II. RELATED WORK
Kocijan et al. [5] presented a NARX model where the
nonlinearity is captured using Gaussian process regression.
More recently Gutjahr et al. [6] have proposed the use of
FITC in order to make computation more efﬁcient in this
kind of models. Our model follows this approach but also
considers that the input/output signals may be contaminated
by noise which presents an errors-in-variables regression
problem.
We mitigate the errors-in-variables problem by incorporat-
ing into system identiﬁcation the signal pre-processing step
that is usually carried out manually [4], [7]. For instance, a
portion of the noise can be removed by low-pass ﬁltering the
signals in the time domain.
More sophisticated approaches combining smoothing and
system identiﬁcation for state space models have the po-
tential to give better results if one is ready to accept their
additional complexity [8], [9]. In contrast, our method needs
to be seen as a pragmatic approach that attempts to offer a
better solution than standard NARX models with a simple
procedure and a limited amount of computational effort.
Thus, it is particularly well suited to large datasets.
III. BACKGROUND THEORY
A. Bayesian System Identiﬁcation
The goal of Bayesian system identiﬁcation is to obtain
models of dynamical systems that quantify the degree of un-
certainty in the system’s dynamics [1]. Modeling uncertainty
can be particularly useful in applications such as robotics or
adaptive control where decisions about future control actions
will typically depend on the conﬁdence that the agent has
about the system dynamics.
Bayesian system identiﬁcation relies on the use of
Bayesian statistics and hence interprets probabilities as de-
grees of belief. Such an approach to probabilistic modeling
has been very successful in the ﬁelds of machine learning
and artiﬁcial intelligence [10]. We will borrow heavily from
advances in those ﬁelds to model uncertainty in a principled
manner. In particular, we will solve the nonlinear regression
task from a NARX model with Bayesian nonlinear regression
implemented with Gaussian processes.
B. Gaussian Process Regression
Gaussian processes are stochastic processes that have
proven very successful at the task of nonlinear regression.
Their unbounded ﬂexibility makes them ideal whenever it is
hard to specify a parametric form for an unknown nonlinear
function.
Formally, a Gaussian process is a collection of random
variables, any ﬁnite number of which have a joint Gaussian
distribution [11]. We say that a random function f(x) ∈R,
with x ∈Rn, follows a Gaussian process and write it as
f(x) ∼GP(m(x), k(x, x′))
(6)
when all values of f at any locations x are jointly normally
distributed. m(x) and k(x, x′) represent the mean function
and the covariance function respectively. Together, they com-
pletely specify the Gaussian process and can be parametrized
by a, typically small, set of hyper-parameters. Those hyper-
parameters deﬁne the kind of functions that one expects to
see. For instance, they control the degree of smoothness or
periodicity but do not constrain the function to have any
predeﬁned shape.
In Gaussian process regression, a likelihood function
p(y|f) relates the vector of real-valued observations y =
(y1, ..., yN) with the latent function f. In the particular case
where observations are contaminated with additive Gaussian
i.i.d. noise
p(yi|fi) = N(yi|fi, σ2),
(7)
there exists a convenient closed-form solution for the poste-
rior distribution over functions, i.e. the distribution over the
unknown function f(x) after incorporating all the observed
data. The posterior distribution of the function at any new
location x∗is described by
p(f∗|x∗, y, X) = N( ¯f∗, cov(f∗))
(8)
where X = {x1, ..., xN} is the set of regressor vectors. In
the common case where m(x) = 0 we obtain
¯f∗= K(X∗, X)[K(X, X) + σ2I]−1y,
(9)
cov(f∗) = K(X∗, X∗)−
K(X∗, X)[K(X, X) + σ2I]−1K(X, X∗). (10)
where the covariance matrices K are constructed by applying
the covariance function to all elements in each of the sets
that they take as arguments:
K(A, B)ij = k(ai, bj).
(11)
In Figure 1 we show an example of Gaussian process
regression. Given a set of seven noisy data points, the goal
is to infer the latent function that generated them. We do
not assume any parametric form for this latent function. Our
assumption is that the latent function was a draw from a
Gaussian process with zero mean and a squared exponential
0
1
2
3
4
5
−2.5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
Fig. 1.
Gaussian process regression. Black dots represent the data. The
red line is the mean of the Gaussian process prediction and the area shaded
in red covers the 95% conﬁdence region. The three black lines are random
functions drawn from the posterior that could have generated the data.
covariance function [11]. To obtain the posterior distribution
over the latent function we use equations (9) and (10). Note
how the conﬁdence region widens in areas that are far from
the data points.
C. Model Selection for Gaussian Processes
In the Gaussian process framework, model selection refers
to using data to select the functional form for the mean and
the covariance functions as well as selecting the values for
their hyper-parameters. In addition, the likelihood function
may also have some hyper-parameters, such as the noise
variance σ2 in equation (7). For convenience we include the
likelihood hyper-parameters into the vector θ.
For Gaussian process models, it is very attractive, from a
computational and theoretical point of view, to use Bayesian
model selection procedures. In particular, maximization of
the marginal likelihood of the Gaussian process model has
proven to be very effective [11]. The objective is to ﬁnd
θOpt = arg max
θ
p(y|X, θ)
(12)
where
p(y|X, θ) =
Z
p(y|f, θ) p(f|X, θ) df
(13)
is the marginal likelihood. The ﬁrst term inside the integral
is the likelihood function and the second term is a normal
distribution that can be obtained directly from the mean and
covariance functions that specify the Gaussian process. For
the particular case of Gaussian processes with zero mean and
a Gaussian likelihood, such as in eq. (7), we obtain:
p(y|X, θ) = N(0, Kθ(X, X) + σ2I).
(14)
To avoid underﬂow errors, it is common to work with
the logarithm of the marginal likelihood rather than the
marginal likelihood itself. Since the logarithm function is
strictly monotonically increasing, this transformation does
not change the location of the optimum. If we take logarithms
on both sides of the equation we obtain
log p(y|X, θ) = −1
2y⊤K−1y −1
2 log |K| −n
2 log 2π (15)
where K = Kθ(X, X) + σ2I.
For optimization, it will be useful to have the derivative of
the marginal likelihood with respect to any of the parameters
of interest:
∂
∂θj
log p(y|X(ω), θ) = −1
2y⊤K−1 ∂K
∂θj
K−1y
−1
2tr(K−1 ∂K
∂θj
).
(16)
The marginal likelihood embodies what is usually de-
scribed as Bayesian Occam’s razor: an automatic balancing
of model ﬁt and model complexity [11]. In contrast with
maximum likelihood estimation, the marginal likelihood is
found by integrating over the latent values of the function
f(x).
The computational complexity of the marginal likelihood
in eq. (14) is O(N 3) due to the inverted covariance matrix
in the density of a multivariate Gaussian distribution. This
cubic complexity effectively limits the approach to values
of N not greater than a few thousands. However, there is a
mitigating factor: once K−1 has been computed for eq. (15),
each derivative of the marginal likelihood with respect to the
hyper-parameters has a computational complexity of O(N 2).
This is a particular feature of the marginal likelihood that is
useful for efﬁcient gradient based optimization.
D. Sparse Gaussian Process Regression
Several approaches have been proposed to accelerate
Gaussian processes for very large data sets. Those ap-
proaches aim at reducing the particularly disadvantageous
O(N 3) cost to train the hyper-parameters but also target a
reduction of the O(N) cost necessary to evaluate the mean
of a posterior GP (eq. 9) or the O(N 2) cost of computing
its posterior covariance (eq. 10).
A common strategy is to introduce a new set of M latent
variables ¯f at input locations ¯X where M < N [12]. The
intuition behind these methods is that the new set of latent
variables incorporates much of the information present in the
original, large, dataset.
A particularly effective sparse GP method named FITC, or
originally SPGP, was proposed by Snelson and Ghahramani
[13]. Due to space constraints we avoid a complete re-
derivation of the FITC method and simply state its predictive
distribution given a dataset {y, X} and the locations of the
latent inputs ¯X:
p(f∗|x∗, y, X, ¯X) = N(µ∗, σ2
∗),
(17)
where
µ∗= K⊤
∗Q−1
M KMN(Λ + σ2I)−1y
(18)
σ2
∗= K∗∗−K⊤
∗(K−1
M −Q−1
M )K∗
(19)
and
QM = KM + KMN(Λ + σ2I)−1KNM
(20)
Λ = diag(λ),
λn = Knn −K⊤
n KMKn,
(21)
where the matrices K are constructed using the covariance
function as shown in eq. (11) and we follow the following
convention: KAB = K(XA, XB) and KA = K(XA, XA).
IV. GP-FNARX MODEL
A. Overview
In this section, we describe the GP-FNARX model for
nonlinear system identiﬁcation based on a nonlinear autore-
gressive exogenous (NARX) model with ﬁltered regressors
(F) where the nonlinear regression problem is tackled using
sparse Gaussian processes (GP). The approach integrates pre-
processing (e.g. ﬁltering) with system identiﬁcation into a
fully automated procedure that yields a Bayesian nonpara-
metric model.
The key contribution of this article is the integration of the
data pre-processing step with the optimization of the sparse
GP hyper-parameters. Training is performed by simultane-
ously maximizing the marginal likelihood of the probabilistic
regression model with respect to the pre-processing param-
eters and the GP hyper-parameters.
Autoregressive models with exogenous inputs attempt to
predict future outputs by considering them a function of past
inputs and outputs to the system:
yt = f(yt−1, yt−2, ..., ut−1, ...) + δt.
(22)
The identiﬁcation problem is then posed as a regression task.
In other words, we want to infer the function y = f(x)
from a ﬁnite number of examples {yi, xi}, where xi =
(yi−1, yi−2, ..., ui−1, ...). In our approach, we place a GP
prior on the function f(x) and use sparse GP regression
techniques to infer it from observed data.
If the input and output signals to the dynamical system
are noisy, we face an errors-in-variables problem since the
regressors x are noisy. Noise in the regressors makes the
regression problem particularly hard. This is one of the
reasons why input signals are normally pre-processed before
trying to identify a model of the system dynamics. For
instance, one can carefully low-pass ﬁlter the signals to
remove high-frequency noise irrelevant to the identiﬁcation
task at hand. Since we are looking for a method that avoids
having a human in the loop, we take a data-based approach
which consists in parameterizing the data pre-processing
stage and optimizing its parameters jointly with the hyper-
parameters of the Gaussian process regression.
We will consider any data pre-processing function applied
to the input and output signals
(ˆy, ˆu) = h(y, u, ω)
(23)
where the pre-processed signals vary smoothly with respect
to a vector of pre-processing parameters ω. This smoothness
condition is imposed in order to obtain a probabilistic model
with a differentiable marginal likelihood.
We can rephrase the autoregressive model in terms of the
pre-processed regressors:
yt = f(ˆyt−1, ˆyt−2, ..., ˆut−1, ...).
(24)
Note that the left hand side term is not pre-processed.
B. Optimization of the Marginal Likelihood
In section III-C we have described how the marginal
likelihood provides a very powerful metric to perform model
selection due to its ability to automatically trade off model ﬁt
and model complexity in a principled manner. Our goal here
will be to maximize the marginal likelihood of the Gaussian
process regression model with respect to the signal pre-
pocessing parameters and also, simultaneously, with respect
to the hyper-parameters of the GP. For convenience, we
introduce ψ = (ω, θ) grouping the two kinds of parameters.
We will employ hill-climbing optimization to maximize
the marginal likelihood (or, equivalently, its logarithm). For
notational simplicity, we group all the pre-processed regres-
sors into a matrix ˆX = ˆX(y, u, ω) so that the log marginal
likelihood becomes
log p(y| ˆX(ω), θ).
(25)
Its derivatives with respect to the GP hyper-parameters
∂
∂θj
log p(y| ˆX(ω), θ)
(26)
are straightforward since we typically choose differentiable
covariance functions. However, the derivatives with respect
to any of the pre-processing parameters
∂
∂ωk
log p(y| ˆX(ω), θ)
(27)
can be more difﬁcult to compute. From eq. (16) it is apparent
that
∂K
∂ωk needs to be computed. We can write the derivative
of a single element of the covariance matrix as
∂Kij
∂ωk
= ∂k(ˆxi, ˆxj)
∂ωk
= ∂k
∂ˆxi
∂ˆxi
∂ωk
+ ∂k
∂ˆxj
∂ˆxj
∂ωk
(28)
where the derivatives with respect to the regressors are
straightforward to compute when using smooth covariance
functions. However, the derivatives of the regressors with
respect to the pre-processing parameters may be hard to com-
pute. In any case, if the pre-processing function is smooth,
the derivatives
∂ˆx
∂ωk can be approximated numerically by
ﬁnite differences at the cost of one extra evaluation of the
pre-processing function per dimension of ω.
C. Sparse GPs for Computational Speed
Computing the marginal likelihood for datasets with more
than a few thousand points becomes computationally ex-
pensive. For larger datasets we adopt a pragmatic strategy
whereby the marginal likelihood is maximized by employing
only a subset of the data.
Once the GP hyper-parameters and the data pre-processing
parameters have been set, we use FITC sparse Gaussian
processes [12] to obtain a model of the system dynamics
that uses the whole dataset to make predictions. Equation
(17) shows the expression for FITC predictions. After all
possible pre-computations, the computational complexity is
O(M) for the mean of the predictive distribution and O(M 2)
for its variance. The computational efﬁciency comes from
having condensed the original dataset with N points into a
smaller set of M inducing points.
0
10
20
30
40
50
60
70
80
10
−4
10
−3
10
−2
SNR [dB]
RMSE [V]
Silverbox benchmark
 
 
GP−FNARX (SoD),  23±2
GP−FNARX (FITC), 23±2
GP−NARX (SoD),  14±1
GP−NARX (FITC), 14±1
wavenet nlarx,  5±1
sigmoidnet nlarx,  83±9
treepartition nlarx,  7±0
wavenet nlarx (filt),  6±2
sigmoidnet nlarx (filt),  74±11
treepartition nlarx (filt),  7±0
0
10
20
30
40
50
60
70
80
10
−3
10
−2
10
−1
SNR [dB]
RMSE [V]
Wiener−Hammerstein benchmark
 
 
GP−FNARX (SoD),  25±2
GP−FNARX (FITC), 25±2
GP−NARX (SoD),  16±1
GP−NARX (FITC), 16±1
wavenet nlarx,  7±3
sigmoidnet nlarx,  85±12
treepartition nlarx,  8±0
wavenet nlarx (filt),  5±1
sigmoidnet nlarx (filt),  85±8
treepartition nlarx (filt),  8±0
Fig. 3.
Root mean squared prediction error on the test set as a function of the signal to noise ratio in the output signal. The mean computation time and
standard deviation for each method are displayed in the legend (in seconds). Experiments are repeated 10 times, the marker is positioned at the median
value and error-bars indicate the 10-90 percentile interval.
Inputs: I = {output signals y = (y1, ..., yN), input signals
u = (u1, ..., uN) , model order vector η}
1.
ψ0 ←INITIALGUESS(I)
2.
ψOpt ←MAXIMIZEMARGLIKELIHOOD(ψ0, I)
3.
model ←COMPUTEFITCPREDICTOR(ψOpt, I)
Fig. 2.
High-level pseudo-code for the GP-FNARX algorithm.
D. Summary of the Proposed Methodology
In Fig. 2 we present a high-level overview of the GP-
FNARX algorithm. The ﬁrst step consists in providing an
initial guess for the unknown hyper-parameters. We have
found that, in the absence of any problem-speciﬁc knowledge
that could guide the initial guess, a successful data-based
heuristic is to run a few steps of optimization of the marginal
likelihood with respect to the GP hyper-parameters followed
by a few steps of optimization with respect to the pre-
processing parameters. By running these two steps on a
subset of the data, the algorithm can rapidly hone in the
right order of magnitude for the unknown parameters.
The second step consists in a straightforward joint opti-
mization of the GP hyper-parameters and the pre-processing
parameters. This step can be performed by either using a
subset of the data (SoD) with a conventional GP or on the
full dataset with a FITC sparse GP [12]. The FITC approach
is theoretically more sound but very good and fast results
can also be obtained by optimizing the marginal likelihood
computed on a subset of the data.
In the third and ﬁnal step, a large part of the FITC
predictor equations (18,19) is pre-computed in order to
obtain a model that will be able to provide predictive means
in O(M) and predictive variances in O(M 2).
As an aside, we want to highlight that the choice of orders
of the autoregressive model is not critical to the performance
of the algorithm provided that two conditions are met: i)
the order is chosen to be higher than the optimal order and
ii) the automatic relevance determination (ARD) covariance
function is chosen for the GP. This is due to the fact
that the Bayesian Occam’s razor embodied by the marginal
likelihood is able to automatically disable regressors that
are irrelevant to the task at hand. In our experiments we
veriﬁed that adding hundreds of regressors on a problem of
low order did not cause any overﬁtting and only represented
a computation time penalty.
V. EXPERIMENTAL EVALUATION
In this section we present an experimental evaluation of the
proposed system identiﬁcation method. We have used data
from two nonlinear system identiﬁcation benchmarks based
on electric circuits: i) the Silverbox benchmark originating
from the 2004 IFAC Symposium on Nonlinear Control Sys-
tems (NOLCOS) and ii) the SYSID09 Wiener-Hammerstein
system identiﬁcation benchmark [14] which has been the
object of a special section in the November 2012 issue of
the “Control Engineering Practice” journal [15].
Both datasets have >105 data points and are corrupted by
a very small amount of noise. For instance, the authors of the
Wiener-Hammerstein dataset estimate its signal to noise ratio
to be of 70 dB. Since we are attempting to demonstrate the
ability of our method to cope with measurement noise, we
will inject different amounts of synthetic i.i.d. Gaussian noise
to the output signals to make the identiﬁcation task more
challenging. Being able to have original signals with little
noise will be convenient to test the quality of the resulting
models.
Very good algorithms have been tailored to the speciﬁc
benchmarks at hand. However, since our goal is to provide
an automatic method for system identiﬁcation, we have
considered that the best comparison would be against other
off-the-shelf alternatives. We have compared our model with
respect to several NARX models available in the Matlab
System Identiﬁcation toolbox [7].
Following this spirit, we have avoided any tuning of
our method to the particular benchmarks presented in this
section. For instance, by using knowledge about the under-
lying system of the Silverbox benchmark, we obtained better
performance by adding cubic regressors into our NARX
model. However, we have not used those custom regressors
when reporting the performance of our model.
Regarding the pre-processing step, we have chosen a
simple zero-phase second order Butterworth low-pass ﬁlter.
In this case, the ﬁlter parameter ω represents the cut-off
frequency of the ﬁlter.
In Figure 3 we have plotted the prediction errors on both
benchmarks for a number of different models and signal-to-
(synthetic)-noise ratios. We have tested the following models:
• GP-FNARX (SoD): the model presented in this paper using
a subset of 512 randomly chosen points (subset of data
approximation, SoD [11]).
• GP-FNARX (FITC): the model presented in this paper
using a FITC sparse GP [12] with M = 512 inducing
points chosen at random.
• GP-NARX (SoD): same as GP-FNARX (SoD) but with no
pre-ﬁltering of the signals.
• GP-NARX (FITC): same as GP-FNARX (FITC) but with no
pre-ﬁltering of the signals.
• * nlarx: 3 different NARX models implemented in
the Matlab System Identiﬁcation toolbox [7]. Default
options. No pre-ﬁltering of the signals.
• * nlarx (ﬁlt): same as * nlarx but using the pre-ﬁltering
parameters taken from GP-FNARX (FITC) (the computa-
tion time for computing those parameters is not taken
included in the reported ﬁgure).
All models have order 10 for the inputs and the outputs.
We observe that the GP-FNARX method with a FITC
sparse GP provides the lowest error for all noise levels
in both benchmarks. Overall, these results allow us to be
optimistic with regards to the prediction capabilities of
GP-FNARX models and validate the use of the marginal
likelihood as a criterion for model selection in the context
of automated pre-processing of the data.
The legend of Figure 3 reports the computation times
of the experiments when performed on a machine with a
2008 Intel Core i7-920 processor at 2.67 GHz. Although
the training time of the GP-FNARX model is higher than
the wavenet and treepartition models, it is signiﬁcantly faster
than sigmoidnet yet it provides a lower prediction error.
GP models have a one degree of freedom knob which can
be used to trade off accuracy with speed: the number of data
points used for SoD or the number of inducing points in
FITC. Increasing the number of points is not equivalent to
increasing the number of parameters in a parametric model
and does not imply any risk of overﬁtting [11].
VI. CONCLUSIONS
We have presented GP-FNARX, an automated approach
for the identiﬁcation of nonlinear systems based on au-
toregressive models implemented with Gaussian processes.
Our approach combines the data pre-processing step with
the identiﬁcation task per se. By maximizing the marginal
likelihood of the probabilistic regression model we obtain a
ﬁtting procedure that naturally balances model ﬁt and model
complexity.
The Gaussian process model resulting from the identiﬁca-
tion belongs to the family of Bayesian nonparametric mod-
els. As such, it does not rely on a rigid parametric functional
form and has the capability to adapt to arbitrarily complex
data. Moreover, the models can report the uncertainty present
in their own predictions. For instance, if the model is used in
an operating region which is far from the training data, the
model will report higher uncertainty. This feature is useful
in applications such as robotics or adaptive control where
different control strategies may be appropriate depending on
the conﬁdence in the predictions.
One of the reasons why Bayesian system identiﬁcation is
not widely used is the lack of methods with a competitive
computational cost. We have demonstrated our approach on
time series having >105 data points and we have shown
how it presents a realistic alternative to parametric nonlinear
function approximators. We obtain good performance with
a comparable cost and we add the ability to quantify the
uncertainty in the model. In addition, our method is easy to
apply due to the automation of the pre-processing stage.
REFERENCES
[1] V. Peterka, “Bayesian system identiﬁcation,” Automatica, vol. 17,
no. 1, pp. 41 – 53, 1981.
[2] D. Barber and S. Chiappa, “Uniﬁed inference for variational Bayesian
linear Gaussian state-space models,” in Advances in Neural Informa-
tion Processing Systems (NIPS), vol. 20, 2006.
[3] A. Wills, T. B. Sch¨on, F. Lindsten, and B. Ninness, “Estimation of
linear systems using a Gibbs sampler,” in Proceedings of the 16th
IFAC Symposium on System Identiﬁcation (SYSID), 2012.
[4] L. Ljung, System Identiﬁcation: Theory for the User, 2nd ed. Prentice
Hall, 1999.
[5] J. Kocijan, A. Girard, B. Banko, and R. Murray-Smith, “Dynamic
systems identiﬁcation with Gaussian processes,” Mathematical and
Computer Modelling of Dynamical Systems, vol. 11, no. 4, pp. 411–
424, 2005.
[6] T. Gutjahr, H. Ulmer, and C. Ament, “Sparse Gaussian processes with
uncertain inputs for multi-step ahead prediction,” in Proceedings of
the 16th IFAC Symposium on System Identiﬁcation (SYSID), vol. 16,
2012, pp. 107–112.
[7] L. Ljung, System Identiﬁcation ToolboxTMUser’s Guide (R2012b).
The MathWorks, 2012.
[8] R. Turner, M. P. Deisenroth, and C. E. Rasmussen, “State-space
inference and learning with Gaussian processes,” in 13th International
Conference on Artiﬁcial Intelligence and Statistics, ser. W&CP, Y. W.
Teh and M. Titterington, Eds., vol. 9, Chia Laguna, Sardinia, Italy,
May 13–15 2010, pp. 868–875.
[9] T. B. Sch¨on, A. Wills, and B. Ninness, “System identiﬁcation of
nonlinear state-space models,” Automatica, vol. 47, no. 1, pp. 39 –
49, 2011.
[10] C. M. Bishop, Pattern Recognition and Machine Learning.
Springer,
2006.
[11] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for
Machine Learning.
MIT Press, 2006.
[12] J. Qui˜nonero-Candela and C. E. Rasmussen, “A unifying view of
sparse approximate Gaussian process regression,” Journal of Machine
Learning Research, vol. 6, pp. 1939–1959, 2005.
[13] E. Snelson and Z. Ghahramani, “Sparse Gaussian processes using
pseudo-inputs,” in Advances in Neural Information Processing Systems
(NIPS), Y. Weiss, B. Sch¨olkopf, and J. Platt, Eds., Cambridge, MA,
2006, pp. 1257–1264.
[14] J. Schoukens, J. Suykens, and L. Ljung, “Wiener-Hammerstein bench-
mark,” in Symposium on System Identiﬁcation (SYSID) Special Ses-
sion, vol. 15, 2009.
[15] H. Hjalmarsson, C. R. Rojas, and D. E. Rivera, “System identiﬁcation:
A Wiener-Hammerstein benchmark,” Control Engineering Practice,
vol. 20, no. 11, pp. 1095 – 1096, 2012.