Local Gaussian Regression Franziska Meier 1 Philipp Hennig 2 Stefan Schaal 1 , 2 1 Computational Learning and Motor Control Lab, University of Southern California, Los Angeles 2 Max-Planck-Institute for Intelligent Systems, T ̈ ubingen, Germany Abstract Locally weighted regression was created as a nonpara- metric learning method that is computationally effi- cient, can learn from very large amounts of data and add data incrementally. An interesting feature of lo- cally weighted regression is that it can work with spa- tially varying length scales, a beneficial property, for instance, in control problems. However, it does not provide a generative model for function values and requires training and test data to be generated identi- cally, independently. Gaussian (process) regression, on the other hand, provides a fully generative model with- out significant formal requirements on the distribution of training data, but has much higher computational cost and usually works with one global scale per in- put dimension. Using a localising function basis and approximate inference techniques, we take Gaussian (process) regression to increasingly localised properties and toward the same computational complexity class as locally weighted regression. 1 Introduction Besides expressivity and sample efficiency, computa- tional cost is a crucial design criterion for machine learning algorithms in real-time settings, such as con- trol problems. An example is the problem of building a model for robot dynamics: The sensors in a robot’s limbs can produce thousands of datapoints per second, quickly amassing a local coverage of the input domain. In such settings, fast local learning and generaliza- tion can be more important than a globally optimized model. A learning method should rapidly produce a good model from the large number N of datapoints, using a comparably small number M of parameters. Locally weighted regression (LWR) [ 1 ] makes use of the popular and well-studied idea of local learning [ 2 ] to address the task of compressing large amounts of data into a small number of parameters. In the spirit of a Taylor expansion, the idea of LWR is that simple models with few parameters may locally be precise, while it may be difficult to find good nonlinear features to capture the entire function globally – lots of good local models may form a good global one. The key to LWRs low computational cost (linear, O ( N M ) ) is that each local model is trained indepen- dently. The resulting speed has made LWR popular in robot learning. The downside is that LWR requires several tuning parameters, whose optimal values can be highly data dependent. This is at least partly a result of the strongly localized training strategy, which does not allow models to ‘coordinate’, or to benefit from other local models in their vicinity. Here, we explore a probabilistic alternative to LWR that alleviates the need for parameter tuning, but re- tains potential for fast training. An initial candidate could be the mixture of experts model (ME) [ 3 ]. In- deed, it has been argued [ 1 ], that LWR can be thought of as a mixture of experts model in which experts are trained independently of each other. The advantage of ME is that it comes with a full generative model [ 4 , 5 ] allowing for principled learning of parameters and expert allocations, while LWR does not (see Figure 1). However, in this work we emulate LWRs assumption that the data is a continous function as opposed to a mixture of linear models, for instance. When the underlying data is not assumed to be generated by a mixture model, utilizing ME to fit the data makes inference (unneccesarily) complicated. Hence, we are interested in a probabilistic model that captures the best of both worlds without making the mixture as- sumption: A generative model that has the ability to localize computations to speed up learning. Our proposed solution is local Gaussian regression (LGR), a linear regression model that explicitly encodes localisation. Starting from the well-known probabilis- tic, Gaussian formulation of least-squares generalized linear regression, we first re-interpret the well known radial basis feature functions as localisers of constant feature functions. This allows us to introduce more expressive features, capturing the idea of local models within the Gaussian regression framework. In its exact form, this model has the cubic cost in M typical of arXiv:1402.0645v1 [cs.LG] 4 Feb 2014 Local Gaussian Regression N M K training set localizers local models prediction y x n z x n f x n m φ x n mk η x n m w mk α mk φ ∗ mk f ∗ m y ∗ η ∗ m z ∗ λ m mixture of experts N M K training set localizers local models prediction y x n f x n m φ x n mk ξ x n mk η x n m w mk α mk φ ∗ mk f ∗ m y ∗ ξ ∗ mk η ∗ m λ m local Gaussian regression N M K training set localizers local models prediction y x n ̃ y x n m η x n m ̃ ξ x n mk ξ x n mk w mk ξ ∗ mk η ∗ m ̃ y ∗ m y ∗ λ m locally weighted regression Figure 1: Generative model for mixture of experts (ME) (left), local Gaussian regression (LGR) (middle), and factor graph for locally weighted regression (LWR) (right). LWR assumes a fixed contribution of each local model to each observation. So a V-structure only exists within each m -plate, making inference cheaper. In ME, contributions of experts to each observations are normalized, coupling all the experts and making it necessary to update responsibilities in the e -step in a global manner. In local Gaussian regression, the contribution of each model to observations is uncertain (similar to ME), so the model is more densely connected than LWR. However, LWR is not a generative model for the data: It treats training and test data in different ways. Gaussian models, arising because observations induce correlation between all local models in the posterior. To decouple the local models, we propose a variational approximation that gives essentially linear cost in the number of models M . The core of this work revolves around fitting the model parameters using maximum likelihood (Section 3.1). As a final note, we show how the novel feature function representation allows us to readily extend our probabilistic model to a local non- parametric formulation (Section 4). Previous work on probabilistic formulations of local regression [ 6 , 7 ] has been focussed on bottom-up con- structions, trying to find generative models for one local model at a time. To our knowledge, this is the first top-down approach, starting from a globally optimal training procedure, to find approximations giving a localized regression algorithm similar in spirit to LWR. 2 Background Both LWR and Gaussian regression have been studied extensively before, so we only give brief introductions here. Generalized linear regression maps weights w ∈ R F to the nonlinear function f ∶ R D _ R via F feature functions φ i ( x ) ∶ R D _ R : f ( x ) = F Q i = 1 φ i ( x ) w i = φ ⊺ w . (1) Using the feature matrix Φ ∈ R N × F whose elements are Φ nf = φ f ( x n ) , the function values at N locations x n ∈ R D , subsumed in the matrix X ∈ R N × D , are f ( X ) = Φ w . Locally weighted regression (LWR) trains M lo- cal regression models. We will assume each local model has K local feature functions ξ mk ( x ) , so that the m -th model’s prediction at x is f m ( x ) = K Q k = 1 ξ mk ( x ) w mk = ξ m ( x ) w m (2) K = 2 and ξ m 1 ( x ) = 1 , ξ m 2 ( x ) = ( x − c m ) gives a linear model around c m . The models are localized by a non- negative, symmetric and integrable weighting η m ( x ) , typically the radial basis function η m ( x ) = exp − ( x − c m ) 2 2 λ 2 m , or, for x ∈ R D , (3) η m ( x ) = exp − 1 2 ( x − c m ) Λ − 1 ( x − c m ) ⊺ . (4) with center c m and length scale λ or positive definite metric Λ. In LWR, each local model is trained inde- pendently of the others, by minimizing a quadratic Franziska Meier 1 , Philipp Hennig 2 , Stefan Schaal 1 , 2 loss L ( w ) = N Q n = 1 M Q m = 1 η m ( x n )( y n − ξ m ( x n ) w m ) 2 (5) = N Q n = 1 M Q m = 1 » η m ( x n ) y n − » η m ( x n ) ξ m ( x n ) w m 2 over the N observations ( y n , x n ) . At test time, local predictions (2) are combined into a joint prediction at input x as a normalised weighted sum f ( x ) = ∑ m η m ( x ) f m ( x ) ∑ m ′ η m ′ ( x ) = Q m,k φ mk ( x ) w mk (6) with the features φ mk ( x ) = η m ( x ) ξ mk ( x ) ∑ m ′ η m ′ ( x ) . (7) An important observation is that the objective (5) can- not be interpreted as least-squares estimation of the lin- ear model Φ from Eq. (6). Eq. (5) is effectively training M linear models with φ mk ( x n ) = » η m ( x n ) ξ mk ( x n ) on M separate datasets y m ( x n ) = » η m ( x n ) y n , but that model differs from the one in Equation (6). Thus, LWR can not be cast probabilistically as one genera- tive model for training and test data simultaneously. The factor graph in Figure 1, right, shows this broken symmetry. Training LWR is linear in N and M , and cubic in K , since it involves a least-squares problem in the K weights w m . But this low cost profile also means ab- sence of mixing terms between the M models, thus no ‘coordination’ between the local models. One prob- lem this causes is that model-specific hyperparameters are not sufficiently identified and tend to over-fit. To pick one example among the hyperparameters: when learning the length scale parameters λ m , there is no sense of global improvement. Each local model focusses only on fitting data points within a λ m -neighborhood around c m . So the optimal choice is actually λ _ 0, which gives a useless regression model. To counteract this behaviour, implementations of LWR use regularisa- tion of λ , but of course this introduces free parameters, whose best values are not obvious. Most works in LWR address this problem via leave-one-out cross-validation, which works well if the data is sparse. But when the data is dense, like in robotics, the kernel tends to shrink to only ‘activate’ points close by the left-out data point. Gaussian regression [ 8 , § 2] is a probabilistic frame- work for inference on the weights w given the features φ . Using a Gaussian prior p ( w ) = N ( w ; μ 0 , Σ 0 ) and a Gaussian likelihood p ( y S φ , w ) = N ( y ; φ ⊺ w , β − 1 I ) , the posterior distribution is itself Gaussian: p ( w S y , φ ) = N ( w ; μ N , Σ N ) with (8) μ N = ( Σ − 1 0 + β Φ ⊺ Φ ) − 1 ( β Φ ⊺ y − Σ − 1 0 μ 0 ) (9) Σ N = ( Σ − 1 0 + β Φ ⊺ Φ ) − 1 (10) The mean of this Gaussian posterior is identical with the Σ-regularised least-squares estimator for w , so it could alternatively be derived as the minimiser of the loss function L ( w ) = w ⊺ Σ − 1 w + N Q n = 1 ( y n − φ ( x n ) w ) 2 . (11) But the probabilistic interpretation of Equation (8) has additional value over (11) because it is a gener- ative model for all (training and test) data points y , which can be used to learn hyperparameters of the fea- ture functions. The prediction for f ( x ∗ ) with features φ ( x ∗ ) = ∶ φ ∗ is also Gaussian, with p ( f ( x ∗ ) S y , φ ) = N ( f ( x ∗ ) ; φ ∗ μ N , φ ∗ Σ N φ ⊺ ∗ ) (12) As is widely known, this framework can be extended to the nonparametric case by a limit which replaces all in- ner products φ ( x 1 ) Σ 0 φ ( x 2 ) ⊺ with a Mercer (positive semi-definite) kernel function k ( x 1 , x 2 ) . The corre- sponding priors are Gaussian processes. This generali- sation will only play a role in Section 4 of this paper, but the direct connection between Gaussian regression and the elegant theory of Gaussian processes is often cited in favour of this framework. Its main downside, relative to LWR, is computational cost: Calculating the posterior (12) requires solving the least-squares problem for all F parameters w jointly, by inverting the Gram matrix ( Σ − 1 0 + β Φ ⊺ Φ ) . In the general case, this inversion requires Ø ( F 3 ) operations. The point of the construction in the following sections is to use approximations to lower the computation cost of this operation such that the resulting algorithm is compa- rable in cost to LWR, while retaining the probabilistic interpretation, and the modelling robustness of the full Gaussian model. The Gaussian regression framework puts virtually no limit on the choice of basis functions φ , even allow- ing discontinuous and unbounded functions, but the radial basis function (RBF, aka. Gaussian, square- exponential) features from Equation (3), φ i ( x ) = η i ( x ) , (for F = M ) enjoy particular popularity for various rea- sons, including algebraic conveniences and the fact that their associated reproducing kernel Hilbert space lies dense in the space of continuous functions [ 9 ]. A downside of this covariance function is that it uses one global length scale for the entire input domain. There are some special kernels of locally varying regularity [ 10 ], and mixture descriptions offer a more discretely varying model class [ 11 ]. Both, however, require com- putationally demanding training. Local Gaussian Regression 0 0 . 2 0 . 4 0 . 6 0 . 8 1 − 2 − 1 0 1 2 0 0 . 2 0 . 4 0 . 6 0 . 8 1 − 2 − 1 0 1 2 (a) 0 0 . 2 0 . 4 0 . 6 0 . 8 1 − 2 − 1 0 1 2 0 0 . 2 0 . 4 0 . 6 0 . 8 1 − 2 − 1 0 1 2 (b) 0 0 . 2 0 . 4 0 . 6 0 . 8 1 − 2 − 1 0 1 2 0 0 . 2 0 . 4 0 . 6 0 . 8 1 − 2 − 1 0 1 2 (c) Figure 2: Noisy data drawn from sine function learned by (a) LWR (b) exact and (c) approximate local Gaussian regression. In LWR, local models do not know of each other and thus fit tangential lines around the center of each local model. In local Gaussian regression local models are correlated. They work together to fit the function. In the approximate version (c) this correlation is reduced, but not completely gone. 3 Local Parametric Gaussian Regression In Gaussian regression with RBF features as de- scribed above, without changing anything, the features φ m ( x ) = η m ( x ) can be interpreted as M constant func- tion models ξ m ( x ) = 1, localised by the RBF function, φ ( x ) = [ ξ 1 ( x ) η 1 ( x ) , . . . , ξ M ( x ) η M ( x )] . This represen- tation extends to more elaborate local models ξ ( x ) . For example ξ ( x ) = x gives a local weighted regres- sion with linear local models. Extending to M local models consisting of K parameters each, feature func- tion φ n mk combines the k th component of the local model ξ km ( x n ) , localised by the m -th weighting func- tion η m ( x ) φ n mk ∶ = φ mk ( x n ) = η m ( x n ) ξ km ( x n ) . (13) For these features, treating mk as the index set of a vector with M K elements, the results from Equations (8) and (12) apply, giving a localised linear Gaussian regression algorithm. The choice of local parametric model is essentially arbitrary. But regular features are an obvious choice. A scalar polynomial model of order K − 1 arises from ξ km ( x n ) = ( x n − c m ) k − 1 . Local linear regression in a K - dimensional input space takes the form ξ km ( x n ) = x nk − c m . We will adopt this latter model in the remainder, because it is also the most widely used model in LWR. An illustration of LWR regression and the proposed local Gaussian regression is given in Figure 2. Since it will become necessary to prune out unnec- essary parts of the model, we adopt the classic idea of automatic relevance determination [ 12 , 13 ] using a factorizing prior p ( w S A ) = M M m = 1 N ( w m ; 0 , A − 1 m ) with (14) A m = diag ( α m 1 , . . . , α mK ) . (15) So every component k of each local model m has its own precision, and can thus be pruned out by setting α mk _ ∞ . Section 3.1 assumes a fixed number M of lo- cal models with fixed centers c m . Thus the parameters are θ = { β, { α mk } , { λ mk }} . We propose an approxi- mation for estimating these parameters. Section 3.2 then describes placing the local models incrementally to adapt M and c m . 3.1 Learning in Local Bayesian Linear Regression Exact inference in Gaussian regression with localised feature functions comes with the cubic cost of its non- local origins. However, because of the localised feature functions, correlation between far away local models is approximately 0, hence inference is approximately inde- pendent between local models. In this section we aim to make use of this “almost independence” to derive a localised approximate inference scheme of low cost, similar in spirit to LWR. To arrive at this localised learning algorithm we first introduce a latent variable f nm for each local model m and data point x n , simi- lar to probabilistic backfitting [ 14 ]. Intuitively, the f form approximate targets, one for each local model, against which the local regression parameters fit (see also Figure 1, middle). This modified model motivates a factorizing variational bound constructed in Section 3.1.1, rendering the the Franziska Meier 1 , Philipp Hennig 2 , Stefan Schaal 1 , 2 local models computationally independent, which al- lows for fast approximate inference in the local Gaus- sian model. Hyperparameters will be learnt by ap- proximate maximum likelihood (3.1.2), i.e. iterating between constructing a bound q ( z S θ ) on the posterior over variables z given current parameter estimates θ and optimising q with respect to θ . 3.1.1 Variational Bound The complete data likelihood of the modified model is p ( y , f , w S Φ , θ ) = N M n = 1 p ( y n S f n , β y ) (16) N M n = 1 M M m = 1 p ( f nm S φ n m w m , β f m ) M M m = 1 p ( w m S A m ) (c.f. Figure 1, left). Our Gaussian model involves the latent variables w and f , the precision β and the model parameters λ m , c m . We treat w and f as probabilis- tic variables and estimate θ = { β, λ , c } . On w , f , we construct a variational bound q ( w , f ) imposing fac- torisation q ( w , f ) = q ( w ) q ( f ) . The variational free energy is a lower bound on the log evidence for the observations y : log p ( y S θ ) ≥ S q ( w , f ) log p ( y , w , f S θ ) q ( w , f ) . (17) This bound is maximized by the q ( w , f ) minimizing the relative entropy D KL [ q ( w , f )Y p ( w , f S y , θ )] , the distri- bution for which log q ( w ) = E f [ log p ( y S f , w ) p ( w , f )] and log q ( f ) = E w [ log p ( y S f , w ) p ( w , f )] . It is rela- tively straightforward to show (e.g. [ 15 ]) that these distributions are Gaussian in both w and f .The ap- proximation on w is log q ( w ) = E f [ log p ( f n S φ ( x n ) , w ) + log p ( w S A )] = log M M m = 1 N ( w m ; μ w m , Σ w m ) (18) where Σ w m = β f m N Q n = 1 φ n m φ n m T + A m − 1 ∈ R K × K (19) μ w m = β f m Σ w m N Q n = 1 φ n m E [ f nm ] ∈ R K × 1 (20) The posterior update equations for the weights are local: each of the local models updates its parameters independently. This comes at the cost of having to update the belief over the variables f nm , which achieves a coupling between the local models. The Gaussian variational bound on f is log q ( f n ) = E w [ log p ( y n S f n , β y ) + log p ( f n S φ n m , w )] = N ( f n ; μ f n , Σ f ) (21) where Σ f = B − 1 − B − 1 1 ( β − 1 y + 1 T B − 1 1 ) − 1 1 T B − 1 = B − 1 − B − 1 11 T B − 1 β − 1 y + 1 T B − 1 1 (22) μ f n = M Q m = 1 E w w T m φ ( x n ) + (23) 1 β − 1 y + 1 ⊺ B − 1 1 B − 1 1 y n − M Q m = 1 E [ w m ] T φ n m where μ f n ∈ R M and using B = diag ( β f 1 , . . . , β f M ) . These updates can be performed in O ( M K ) . Inspect- ing Equations (23) and (20), one can see that the optimal assignment of all μ w actually amounts to a joint linear problem of size M K × M K , which could also be solved as a linear program. However, this would come at additional computational cost. 3.1.2 Optimizing Hyperparameters We set the model parameters θ = { β y , { β f m , λ m } M m = 1 , { α mk }} to maximize the ex- pected complete log likelihood under the variational bound, E f , w [ log p ( y , f , w S Φ , θ )] = (24) E f , w N Q n = 1 log N y n ; M Q m = 1 f nm , β − 1 y + M Q m = 1 log N ( f nm ; w T m φ n m , β − 1 f m ) + M Q m = 1 log N ( w m ; 0 , A − 1 m )¡ Setting the gradient of this expression to zero leads to the following update equations for the variances β − 1 y = 1 N N Q n = 1 ( y n − 1 μ f n ) 2 + 1 T Σ f 1 (25) β − 1 f m = 1 N N Q n = 1 ( μ f nm − μ w m φ n m ) 2 + φ n m T Σ w m φ n m + σ 2 f m (26) α − 1 mk = μ 2 w mk + Σ w,kk (27) The gradient with respect to the scales of each local model is completely localized ∂ E f , w [ log p ( y , f , w S Φ , θ )] ∂ log λ mk = ∂ E f , w ∑ N n = 1 N ( f nm ; w T m φ m ( x n ) , β − 1 f m ) ∂λ mk (28) We use gradient ascent to optimize the length scales λ mk . All necessary equations are of low cost and, with Local Gaussian Regression the exception of the variance 1 β y , all hyper-parameter updates are solved independently for each local model, similar to LWR. In contrast to LWR, however, these local updates do not cause a shrinking in the length scales: In LWR, both inputs and outputs are weighted by the localizing function, thus reducing the length scale improves the fit. The localization of Equation (13) only affects the influence of regression model m , but the targets still need to be fit accordingly. Shrinking of local models only happens if it actually improves the fit against the unweighted targets f nm . 3.1.3 Prediction Prediction at a test point x ∗ arise from marginalizing over both w and f , using S N ( y ∗ ; 1 T f ∗ , β − 1 y ) N ( f ∗ ; W T φ ( x ∗ ) , B − 1 ) d f ∗ = N ( y ∗ ; Q m w T m φ ∗ m , β − 1 y + 1 T B − 1 1 ) (29) and S N ( y ∗ ; w T φ ∗ , β − 1 y + 1 T B − 1 1 ) N ( w ; μ w , Σ w ) d w = N y ∗ ; M Q m μ T w m φ ∗ m , σ 2 ( x ∗ ) (30) where σ 2 ( x ∗ ) = β − 1 y + M Q m = 1 β − 1 f m + M Q m = 1 φ ∗ m T Σ w m φ ∗ m (31) which is linear in M and K . 3.2 Incremental Adding of Local Models Up to here, the number M and locations c of local models were fixed. An extension analogous to the in- cremental learning of the relevance vector machine [ 16 ] can be used to iteratively add local models at new, greedily selected locations c M + 1 . The resulting algo- rithm starts with one local model, and per iteration adds one local model in the variational step. Con- versely, existing local models for which all components α mk _ ∞ are pruned out. This works well in practice, with the caveat that the number local models M can grow fast initially before the pruning becomes effective. Thus we check for each selected location c M + 1 whether any of the existing local models c 1 ∶ M produces a localiz- ing weight η m ( c M + 1 ) ≥ w gen , where w gen is a parameter between 0 and 1 and regulates how many parameters are added. An overview of the final algorithm is given in 1. Algorithm 1 Incremental LGR Require: { x n , y n } N n = 1 ~~ initialize first local model 1: M = 1; c 1 = x 1 ; C = { c 1 } ~~ iterate over all data points 2: for n = 2 . . . N do 3: if η m ( x n ) < w gen , ∀ m = 1 , . . . , M then 4: c m ^ x n 5: C ^ C ∪ { c m } , M = M + 1 6: end if ~~ E -Step: equations (19) , (20) , (22) , (23) 7: { μ w m , Σ w m , μ f m , Σ f m } m ^ 8: e-step ({ β y , β f m , { α mk , λ mk } k } m ) ~~ M -Step: equations (25) , (26) , (27) , (28) 9: { β y , β f m , { α mk , λ mk } k } m ^ 10: m-step ({ μ w m , Σ w m , μ f m , Σ f m } m ) ~~ pruning 11: for all m do 12: if α mk > 1 e 3 ∀ k = 1 , . . . , K then 13: M ^ M − 1; C ^ C ∖ c m 14: end if 15: end for 16: end for 4 Extension to Finitely Many Local Nonparametric Models An interesting component of local Gaussian regres- sion is that it easily extends to a model with finitely many local nonparametric, Gaussian process models. Marginalising out the weights w m in Eq. (17) gives the marginal p ( y , f S X, θ ) = N ( y ; 1 ⊺ f , β − 1 y I N ) M M m = 1 N ( f m ; 0 , C m ) (32) where p ( f m ) = N ( f m ; 0 , C m ) is a Gaussian pro- cess prior over function f m with a (degenerate) co- variance function C m = β − 1 f m I N + φ T m A − 1 m φ m . Re- placing the finitely many local features ξ m with in- finitely many features results in the local nonpara- metric model with covariance function κ m ( x, x ′ ) = β − 1 f m δ ij + η m ( x ) ˆ κ m ( x, x ′ ) η m ( x ′ ) . The exact posterior over f x ∶ = f ( x ) is p ( f x S y ) = N ( f x ; μ x , Σ xx ) (33) with Σ mr xx ′ = cov ( f m ( x ) , f r ( x ′ )) (34) = δ mr k m xx ′ − k m xX ( M Q s k s XX + β − 1 y I ) − 1 k r Xx ′ (35) μ m x = k m xX ( M Q s k s XX + β − 1 y I ) − 1 y (36) Franziska Meier 1 , Philipp Hennig 2 , Stefan Schaal 1 , 2 − 1 0 1 − 1 0 1 0 1 (a) − 2 − 1 0 1 2 − 1 0 1 (b) − 2 − 1 0 1 2 − 1 0 1 (c) − 2 − 1 0 1 2 − 1 0 1 (d) Figure 3: (a) 2D cross function, local models learnt by (b) LWPR, (c) LGR and (d) ME . Table 1: Predictive performance on Cross Function nMSE w/o LSL opt w gen # of LMs nMSE with LSL opt w gen # of LMs LWPR 0.234 0.2 13 0.0365 0.1 45.5 GLR 0 . 069 1.0 23.8 0 . 0137 0.9 23.3 ME 0.169 0.5 21.2 0.0313 0.4 62.2 So computing the posterior covariance Σ mr xx ′ requires one inversion of an N × N matrix. It remains to be seen to what extend variational bounds on the parametric forms presented in the previous sections can be trans- ported to a local nonparametric Gaussian regression framework. 5 Experiments We evaluate and compare to mixture of experts and locally weighted projection regression (LWPR) – an extension of LWR suitable for regression in high di- mensional space [ 17 ] – on two data sets. The first is data drawn from the “cross function” in Figure 3, often used to demonstrate locally weighted learning. For the second comparison we learn inverse dynamics of a SARCOS anthropomorphic robot arm with seven degrees of freedom [8]. In order to compare to mixture of experts we assume lin- ear expert models. A prior of the form N ( w m ; 0 , A − 1 m ) on each experts regression parameters and normalized gaussian kernels for the mixture components are used in our implementation to make it as comparable as possible. To compute the posterior in the e -step a mean field approximation is employed. In both experiments, LWPR performed multiple cy- cles through the data sets. Both the local Gaussian regression and the mixture of experts implementation are executed based on Algorithm 1 and are allowed an additional 1000 iterations to reach convergence. 5.1 Data from the ‘Cross Function’ We used 2 , 000 uniformly distributed training inputs, with zero mean Gaussian noise of variance ( 0 . 2 ) 2 . The test set is a regular grid of 1641 edges without noise and is used to evaluate how well the underlying function is captured. The initial length scale was set to λ m = 0 . 3 , ∀ m , and we ran each method with and without lengthscale learning (LSL) for w gen = 0 . 1 , 0 . 2 , . . . , 1 . 0. All results presented here are results averaged over 5 randomly seeded runs. Table 1 presents the top per- formance achieved by each method, including what the optimal setting for w gen was and how many local models were used. In both settings (with or without LSL) LGR outperforms both LWPR and ME in terms of accuracy as well as number of local models used. To understand the role of parameter w gen better, we also summarize the normalized mean squared error as a function of parameter w gen for all 3 methods in Figure 4 with (right) and without (left) LSL. The key message of this graph is that the parameter w gen does not affect the performance of LGR greatly. While accuracy slightly improves with increasing w gen it is not dramatic. Thus for LGR w gen can be thought of as a trade-off parameter, for smaller w gen the algo- rithm has to consider less potential local models, for a slight performance decrease. For LWPR and ME this relationship to w gen is not that clear. Furthermore, although LGR has to consider more local models for larger w gen (up to 2000 for w gen = 1), we only see a slight increase in the number of local models in the final model, indicating that the pruning mechanism works very well. Local Gaussian Regression 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 0 50 100 150 200 w gen M 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 0 50 100 150 200 w gen 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 0 0 . 2 0 . 4 0 . 6 w gen normalized MSE train LWPR test LWPR train ME test ME train LGR test LGR 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 0 0 . 2 0 . 4 0 . 6 w gen Figure 4: normalized mean squared error (nMSE) on 2D cross data without (left) and with (right) lengthscale learning. Table 2: Predictive Performance on SARCOS task LWPR LGR ME nMSE (MSE) # of LM nMSE (MSE) # of LM nMSE (MSE) # of LM J1 0 . 045 ( 19 . 160 ) 419 0 . 015 ( 6 . 282 ) 366 0.071 (30.125) 249 J2 0 . 039 ( 8 . 938 ) 493 0 . 0195 ( 4 . 391 ) 361 0.0871 (19.564) 257 J3 0 . 034 ( 3 . 420 ) 483 0 . 014 ( 1 . 382 ) 359 0.074 (7.319) 243 J4 0 . 024 4 . 552 384 0 . 016 ( 3 . 118 ) 348 0.039 (7.348) 239 J5 0 . 064 ( 0 . 060 ) 514 0 . 035 ( 0 . 0336 ) 354 0.058 (0.0552) 438 J6 0 . 075 ( 0 . 221 ) 519 0 . 027 ( 0 . 0786 ) 359 0.124 (0.363) 234 J7 0 . 03 ( 0 . 203 ) 405 0 . 023 ( 0 . 1578 ) 358 0.0235 (0.1589) 383 Finally, we show representative results of the shape of learned local models for LWPR, LGR and ME In Figure 3, nicely illustrating the key difference between the three methods: In LWPR local models don’t know of each other and thus aim to find the best linear approximation to the function. In both ME and LGR, the local models know of each other and collaborate to fit the function. 5.2 Inverse Dynamics Learning Task The SARCOS data contains 44 , 484 training data points and 4 , 449 test data points. The 21 input variables represent joint positions, velocities and accelerations for the 7 joints. The task is to predict the 7 joint torques. In Table 2 we show the predictive perfor- mance of LWPR, ME and LGR when trained with lengthscale learning and with w gen = 0 . 3. LGR outper- forms LWPR and ME in terms of accuracy for almost all joints.However, the true advantage of LGR lies in the fact the number of hand tuned parameters is reduced to setting the learning rate for the gradient descent updates and setting the parameter w gen . 6 Conclusion We have taken a top-down approach to developing a probabilistic localised regression algorithm: We start with the generative model of Gaussian generalized lin- ear regression, which amounts to a fully connected graph and thus has cubic inference cost. To break down the computational cost of inference, we first in- troduce the idea of localised feature functions as local models, which can be extended to local nonparametric models. In as second step, we argue that because of the localisation these local models are approximately inde- pendent . We exploit that fact through a variational approximation that reduced computational complexity to local computations. Empirical evaluation suggests that LGR successfully addresses the problem of fitting hyperparameters inherent in locally weighted regres- sion. A final step left for future work is to re-formulate our algorithm into an incremental version that can deal with a continuous stream of incoming data. Franziska Meier 1 , Philipp Hennig 2 , Stefan Schaal 1 , 2 References [1] Stefan Schaal and Christopher G Atkeson. Construc- tive incremental learning from only local information. Neural Computation , 10(8):2047–2084, 1998. [2] L ́ eon Bottou and Vladimir Vapnik. Local learning algorithms. Neural computation , 4(6):888–900, 1992. [3] Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts. Neural computation , 3(1):79–87, 1991. [4] Steve Waterhouse, David MacKay, and Tony Robinson. Bayesian methods for mixtures of experts. Advances in neural information processing systems , pages 351–357, 1996. [5] Lauren Hannah, David M Blei, and Warren B Powell. Dirichlet process mixtures of generalized linear models. Journal of Machine Learning Research , 12:1923–1953, 2011. [6] Jo-Anne Ting, Mrinal Kalakrishnan, Sethu Vijayaku- mar, and Stefan Schaal. Bayesian Kernel Shaping for Learning Control. In Neural information processing systems , 2008. [7] Narayanan U Edakunni, Stefan Schaal, and Sethu Vijayakumar. Kernel carpentry for online regression using randomly varying coefficient model. In Proceed- ings of the international joint conference on artificial intelligence (IJCAI) , 2007. [8] C.E. Rasmussen and C.K.I. Williams. Gaussian Pro- cesses for Machine Learning . MIT Press, 2006. [9] C.A. Micchelli, Y. Xu, and H. Zhang. Universal kernels. Journal of Machine Learning Research , 7:2651–2667, 2006. [10] M. N Gibbs. Bayesian Gaussian processes for re- gression and classification . PhD thesis, University of Cambridge, 1997. [11] C.E. Rasmussen and Z. Ghahramani. Infinite mixtures of Gaussian process experts. In Advances in neural information processing systems . MIT Press, 2002. [12] R.M. Neal. Bayesian learning for neural networks . Springer Verlag, 1996. [13] M.E. Tipping. Sparse Bayesian learning and the rele- vance vector machine. The Journal of Machine Learn- ing Research , 1:211–244, 2001. [14] Aaron D’Souza, Sethu Vijayakumar, and Stefan Schaal. The bayesian backfitting relevance vector machine. In Proceedings of the twenty-first international conference on Machine learning , page 31. ACM, 2004. [15] M.K. Titsias. Variational learning of inducing variables in sparse Gaussian processes. In Artificial Intelligence & Statistics (AISTATS) , volume 12, 2009. [16] Joaquin Quinonero-Candela and Ole Winther. Incre- mental gaussian processes. In Advances in neural in- formation processing systems , pages 1001–1008, 2002. [17] Sethu Vijayakumar and Stefan Schaal. Locally weighted projection regression: An o (n) algorithm for incremental real time learning in high dimensional space. In Proceedings of the Seventeenth International Conference on Machine Learning (ICML 2000) , vol- ume 1, pages 288–293, 2000.