BayesOpt: A Library for Bayesian optimization with Robotics Applications Ruben Martinez-Cantin Centro Universitario de la Defensa, Zaragoza Email: rmcantin@unizar.es Zaragoza, 50090, Spain Abstract —The purpose of this paper is twofold. On one side, we present a general framework for Bayesian optimization and we compare it with some related fields in active learning and Bayesian numerical analysis. On the other hand, Bayesian opti- mization and related problems (bandits, sequential experimental design) are highly dependent on the surrogate model that is selected. However, there is no clear standard in the literature. Thus, we present a fast and flexible toolbox that allows to test and combine different models and criteria with little effort. It includes most of the state-of-the-art contributions, algorithms and models. Its speed also removes part of the stigma that Bayesian optimization methods are only good for “expensive functions”. The software is free and it can be used in many operating systems and computer languages. I. I NTRODUCTION Many problems in engineering, computer science, robotics., require to find the extremum of a real valued function. In many cases, those functions do not have a closed-form expression or might be multimodal, where some of the local extrema might have a bad outcome compared to the global extremum, or the evaluation of those functions might be costly. Global optimization is a special case of non-convex opti- mization where we want to find the global extremum of a real valued function, that is, the target function. The search is done by some pointwise evaluation of the target function. The objective of a global optimization algorithm is to find the sequence of points x n ∈ A ⊂ R m , n = 1 , 2 , . . . (1) which converges to the point x ∗ , that is, the extremum of the target function, when n is large. The algorithm should be able to find that sequence at least for all functions from a given family. As explained in [26], this search procedure is a sequential decision making problem where point at step n + 1 is based on decision d n which considers all previous data: x n +1 = d n ( x 1: n , y 1: n ) (2) where y i = f ( x i ) +  i . For simplicity, many works assume  i = 0 , that is, function evaluations are deterministic. However, we can easily extend the description to include stochastic functions (e.g.: homoscedastic noise  i ∼ N (0 , σ ) ). The search method is the sequence of decisions d = d 0 , . . . , d n − 1 , which leads to the final decision x n = x n ( d ) . In most applications, the objective is to optimize the response of the final decisions. Then, the criteria relies on the optimality error or optimality gap , which can be expressed as: δ n ( f, d ) = f ( x n ) − f ( x ∗ ) (3) In other applications, the objective may require to converge to x ∗ in the input space. Then, we can use for example the Euclidean distance error : δ n ( f, d ) = ‖ x n − x ∗ ‖ 2 (4) Equations (3) and (4) can also be interpreted as variants of the loss function for the decision at each step. Thus, the optimal decision is defined as the function that minimizes the loss function: d n = arg min d δ n ( f, d ) (5) This requires full knowledge of function f , which is unavail- able. Instead, let assume that the target function f = f ( x ) belongs to a family of functions f ∈ F , e.g.: continuous functions in R m . Let also assume that the function can be represented as sample from a probability distribution over functions f ∼ P ( f ) . Then, the best response case analysis for the search process is defined as the decision that optimizes the expectation of the loss function: d BR n = arg min d E P ( f ) [ δ n ( f, d )] = arg min d ∫ F δ n ( f, d ) dP ( f ) (6) where P is a prior distribution over functions. However, we can improve equation (6) considering that, at decision d n we have already observed the actual response of the function at n − 1 points, { x 1: n − 1 , y 1: n − 1 } . Thus, the prior information of the function can be updated with the observations and the Bayes rule: P ( f | x 1: n − 1 , y 1: n − 1 ) = P ( x 1: n − 1 , y 1: n − 1 | f ) P ( f ) P ( x 1: n − 1 , y 1: n − 1 ) (7) In fact, we can actually rewrite the equation to represent the updates sequentially: P ( f | x 1: i , y 1: i ) = P ( x i , y i | f ) P ( f | x 1: i − 1 , y 1: i − 1 ) P ( x i , y i ) , (8) ∀ i = 1 . . . n − 1 . Thus, equation (6) can be rewritten as: d BO n = arg min d E P ( f | x 1: n − 1 ,y 1: n − 1 ) [ δ n ( f, d )] = arg min d ∫ F δ n ( f, d ) dP ( f | x 1: n − 1 , y 1: n − 1 ) (9) arXiv:1309.0671v1 [cs.RO] 3 Sep 2013 Equation (9) is the root of Bayesian optimization , where the Bayesian part comes from the fact that we are computing the expectation with respect to the posterior distribution, also called belief , over functions. Therefore, Bayesian optimization is a memory-based optimization algorithm. As commented before, most of the theory of Bayesian optimization is related to deterministic functions, we consider also stochastic functions, that is, we assume there might be a random error in the function output. In fact, evaluations can produce different outputs if repeated. In that case, the target function is the expected output. Furthermore, in a recent paper by [8] it has been shown that, even for deterministic functions, it is better to assume certain error in the observation. The main reason being that, in practice, there might be some mismodelling errors which can lead to instability of the recursion if neglected. II. R ELATED FIELDS Equation (9) is, in fact, much more general than Bayesian optimization. There are many fields of research that combines optimal decision making over the expectation of an unknown function which is recursively learned. A. Active and interactive learning Active learning consider the situation where data points can be selected for labeling during the training process. The rationale behind that is that: data points might be expensive to label, bad data points might introduce important bias, etc. The analogy with Bayesian optimization is clear. In fact, one can rethink Bayesian optimization as an active learning problem where the unknown parameters are the ex- tremum of the function, i.e. Θ = x ∗ . In this setup, equation (3) correspond to the prediction bias and equation (4) corresponds to the parameter variance. In fact, the first applications of Bayesian optimization in fields like robotics or computer graphics design are all related to problems of active learning like [24] or interactive learning [3]. See Section VI for a detailed description. B. Experimental design The field of Bayesian optimization shares important pieces of knowledge with the sequential experimental design field, especially related to the Design and Analysis of Computer Ex- periments (DACE), pioneered by [31] and beautifully reviewed in [32]. The main difference resides on the optimum point x ∗ , which in the case of DACE, it is not longer the extremum of the function f , but the point that provides most information to reconstruct the target function f . The reader should note also the analogy with active learning , where the target function f corresponds to the regression function or classifier that the system is learning, while trying to sequentially querying the most informative point x ∗ . For the sequential experimental design equation (3) is re- placed by a different loss function. For example, if we consider that the function can be represented in a –potentially infinite– parametric form f ( x ) = g ( ψ, x ) , then we can define the loss function in terms of the parameters: A-optimality (error minimization) δ AO ( f, d ) = ( ψ ( f ) − ˆ ψ ) T Σ ψ ( ψ ( f ) − ˆ ψ ) (10) D-optimality (entropy minimization) δ DO ( f, d ) = H ( ψ | f, x n ) (11) I-optimality (prediction error minimization) δ IO ( f, d ) = ( f − ˆ f ) T Σ f ( f − ˆ f ) (12) There are related criteria (E-optimality, C-optimality, . . . ) which can be easily mapped in a loss function. For a complete description of the different criteria, see [1]. Thus, the only difference between code for experimental design and Bayesian optimization is the criteria used. C. Reinforcement learning The field of reinforcement learning [17] 1) Multi-armed bandits: The simplest known reinforcement learning problem is called the multi-armed bandit problem [36]. In the bandits problem, the response function is a reward or cost function –for example, the cost of a manufacturing a product in a certain way–. The algorithm must find the lowest production cost, regarding that changing the process may result in a more expensive manufacturing cost. The main different with the optimization problem is that intermediate evaluations of the function may incur in extra cost. Thus, the definition of the performance function is no longer the optimality error. Instead, equation (3) is called the instantaneous regret r n in the reinforcement learning literature. The target of the bandits algorithm is to minimize the accumulated regret R N = ∑ N n =1 r n or average regret R N /N . Thus, (3) can be replaced by: δ SB ( f, d ) = ∑ N n =1 f ( x n ( d )) − f ( x ∗ ) N (13) In this setup, an algorithm with no-regret , that is lim N →∞ R N /N = 0 , is guaranteed to converge to the optimum. In fact, as cleverly found by [36], the average regret R N /N can be used to provide convergence rates of the optimization algorithm. Typically, in the multi-armed bandit setup, each input has associated a probability distribution to deliver the reward or cost values. Thus, several evaluations of the same input results in different outcomes. The analogous case would be the optimization problem of stochastic function. Finally, Bull [5] suggest  -greedy, a classical bandits algo- rithm, to improve the convergence of Bayesian optimization. 2) Partially-observable Markov Decision Processes (POMDP): As pointed out in [39], Bayesian optimization, as a sequential decision making problem, has a direct connection to a POMDP. In fact, bandits can be modeled as a single state reinforcement learning problem, for example, an POMDP with immediate reward. An independent connection between Bayesian optimization and POMDPs is through the specific methodology to solve POMDPs based on policy search . In policy search, it is assume that the policy space can be expressed as a parametric function. In an abuse of notation, let us name the policy parameters as π . Thus, the parametrization inherently encodes the dynamics of the world (priors, transitions, etc). Then, the reinforcement learning problem becomes a static optimization problem. π ∗ = arg max π E ( R ( π )) (14) which is equivalent to (6) replacing max by min and reward function by loss function . D. Bayesian numerical analysis In the seminal works by [6] and [27] introduce the field of Bayesian numerical analysis. The main idea is to solve a com- plex analysis problem –for example: interpolation, regression, integral evaluation, etc.– by following a simple methodology: 1) Put a prior on the family of functions (e.g.: continuous) on the working domain. 2) Compute the response of the function f as a set of sample points x 1 , x 2 , . . . , x n . 3) Compute the posterior. 4) Solve the original problem by the Bayes rule. [6] presents a set of beautiful examples where, the previous algorithm results a powerful machinery. [27] goes one step further and formalize the previous methodology in a single, although very general model, showing its applicability in several analysis problems –including optimization–. In fact, all the previous fields, including Bayesian optimization itself, can be seen as particular cases of some Bayesian numerical analysis. The models presented in this paper are partially based on [27] formulation. III. B AYESIAN OPTIMIZATION GENERAL MODEL In order to simplify the description, we are going to use a special case of Bayesian optimization model defined previ- ously which corresponds to the most common application. In subsequent Sections we will introduce some generalizations for different applications. Without loss of generality, consider the problem of finding the minimum of an unknown real valued function f : X → R , where X is a compact space, X ⊂ R d , d ≥ 1 . Let P ( f ) be a prior distribution over functions represented as a stochastic process, for example, a Gaussian process ξ ( · ) , with inputs x ∈ X and an associate kernel or covariance function k ( · , · ) . Let also assume that the target function is a sample of the stochastic process f ∼ ξ ( · ) . In order to find the minimum, the algorithm has a maximum budget of N evaluations of the target function f . The purpose of the algorithm is to find optimal decisions that provide a better performance at the end, according to equation (9). One advantage of using Gaussian processes as a prior dis- tributions over functions is that new observations of the target function ( x i , y i ) can be easily used to update the distribution over functions. Furthermore, the posterior distribution is also a Gaussian process ξ i = [ ξ ( · ) | x 1: i , y 1: i ] . Therefore, the posterior can be used as an informative prior for the next iteration in a recursive algorithm. In a more general setting, many authors have suggested to modify the standard zero-mean Gaussian process for different variations that include semi-parametric models [12, 10, 16, 27], use of hyperpriors on the parameters [25, 4, 11], Student t processes [9, 31, 43], etc. We use a generalized linear model of the form: f ( x ) = φ ( x ) T w +  ( x ) (15) where  ( x ) ∼ N P ( 0 , σ 2 s ( K ( θ ) + σ 2 n I ) ) (16) The term N P means a non-parametric process, which can make reference to a Gaussian process GP or a Student’s t pro- cess T P . In both cases, σ 2 n is the observation noise variance, sometimes called nugget, and it is problem specific. Many authors decide to fix this value σ 2 n = 0 when the function f ( x ) is deterministic, for example, a computer simulation. However, as cleverly pointed out in [8], there might be more reasons to include this term appart from being the observation noise, for example, to consider model inaccuracies. This model has been presented in different ways depending on the field where it was used: • As a generalized linear model φ ( x ) T w with het- eroscedastic perturbation  ( x ) . • As a nonparametric process of the form N P ( φ ( x ) T w , σ 2 s ( K ( θ ) + σ 2 n I ) ) . • As a semiparametric model f ( x ) = f par ( x ) + f nonpar ( x ) = φ ( x ) T w + N P ( · ) IV. C OMPUTING THE HYPERPARAMETERS Based on the model that we have defined previously, we have to consider four sets of hyperparameters: the kernel hyperparameters θ , the mean function parameters w , the signal variance σ 2 s and the noise variance σ 2 n . A fully Bayesian approach does not admit a closed form solution for most kernel functions. In the Bayesian optimiza- tion literature, the standard and most effective approach is to rely on an empirical Bayes approach, where some parameters are approximated by a point value –e.g.: θ – or directly fixed a priori –e.g.: σ 2 n – while some other parameters are actually computed in closed form –e.g.: ( w , σ 2 s ) – and On the other hand, we can assign conjugate hyperpri- ors to some parameters of the model, that is, w and σ 2 s and still get a closed form posterior. For example, let as- sume that the prior can be decompose such as p ( w , σ 2 s ) = p ( w | σ 2 s ) p ( σ 2 s ) . For example, as a normal inverse-gamma: p ( w , σ 2 s ) ∼ N ( w 0 , σ 2 s W ) IG ( α, β ) , a normal scaled- inverse-chi-squared: p ( w , σ 2 s ) ∼ N ( w 0 , σ 2 s W ) χ − 2 ( ν, σ 2 0 ) or a Jeffreys prior: p ( w , σ 2 s ) ∼ 1 · σ − 2 s . In fact, the first and second option are different parametriza- tions of the same model. The Jeffreys prior can be consider as the limit case. Also the Jeffreys prior [14], is an uninformative prior which is invariant to reparametrization –for example: we can exchange the signal variance σ 2 s for the signal precision λ s and the prior is the same–. However, they can perform poorly for multidimensional parameters, for example, if they are many mean basis functions φ i ( x ) . A. Learning the kernel parameters. For Bayesian optimization this can not be trivially imple- mented. As [40] proved, updating the value of θ during the optimization process introduce bias which might result of the optimization being stuck in a local minimum. To avoid this problem, we learn the parameters θ during a preliminar stage, with a small number of data points and then freeze or we updating the parameters infrequently. We consider that the hyperprior of the kernel hyperpa- rameters θ –if available– is independent of other variables. Depending on the model, the likelihood function will be a multivariate Gaussian distribution or multivariate t distribution. Based on [32, 30], we are going to consider the following algorithms to learn the kernel hyperparameters: • Cross-validation: In this case, we try to maximize the average predicted log probability by the leave one out (LOO) strategy [30]. • Maximum likelihood: For any of the models presented, one approach to learn the hyperparameters is to maximize the likelihood of all the parameters w , σ 2 s and θ . Then, the likelihood function is a multivariate Gaussian distribution. As presented in [32], we can obtain a better estimate if we adjust the number of degrees of freedom. • Posterior maximum likelihood: In this case, the likelihood function is modified to consider the posterior estimate of ( w , σ 2 s ) based on the different cases defined in Section IV. In this case, the function will be a multivariate Gaussian or t distribution, depending on the kind of prior used for σ 2 s . • Maximum a posteriori: We can modify any of the pre- vious algorithms by adding a prior distribution p ( θ ) . As commented in [5, 40], applying the maximum likelihood naively while doing optimization might be problematic. [42] suggest to add bounds to restrict the exploration of an overfitted or underfitted width parameter. However, this solution is restricted to using a box-bounded optimizer to learn θ and does not generalize to include extra information –if available– about the parameters. Adding a prior distribution is a more general and elegant solution. • Sampling strategies: The applicability of those techniques seems to be against the philosophy of Bayesian optimiza- tion where we aim to reduce the number of data points. Nevertheless, some efforts have been developed in this area, as that presented in [35] or [18]. V. B AYES O PT T OOLBOX We present a highly efficient Bayesian optimization toolbox. It is implemented in C++ and it is compatible with Windows, Mac Os and Linux. It also provides interfaces to C, Python and Matlab/Octave. Fig. 1. Time execution (in seconds) for the BayesOpt library compared to an equivalent Matlab naive implementation based on GPML. Clearly, the use of C/C++ for the core library combined with some code optimizations specially intended for iterative optimization reduce the total computational cost. The target function was a trivial function, thus the plot can be considered purely the cost of the optimization code. The library was influenced by the GPML toolbox by [29] and NLOPT by [15]. The toolbox is highly configurable. The user can select among different kernels, mean functions, models and optimization criteria. A. Advantage of the library design First, it is fast . The execution time of the library is represented in Figure 1 running on standard laptop and a single process. The optimization problem was trivial, so the computational cost plotted can be considered purely of the optimization process. One of the most critical components is the computation of the inverse of the kernel matrix. We have compared different numerical tricks like the incremental computation of the inverse matrix using blockwise inversion. We found that the Cholesky decomposition method outperforms any other method in terms of performance and numerical stability. Besides, it guarantees the numerical stability of the Gram matrix. Furthermore, we can use two properties of the method: 1) Points arrive one at a time. Thus, we can do incremental computations of the matrix and vectors. For example, at each iteration, we know that only n new elements will appear in the correlation matrix –the correlation of the new point with each of the existing points–. The rest of the matrix remains invariant. Thus, instead of computing the whole Cholesky decomposition, being O ( n 3 ) we just add the new row of elements to the triangular matrix, which is O ( n 2 ) . 2) Multiple queries from the same model. Computing the optimal decision requires to evaluate many queries. However, all those queries rely on the same previous data. Thus, we can factor out –precompute– all the operations that are independent of the query. Second, it is flexible . The design has been carefully selected relying on inheritance and polymorphism for all the compo- nents of the optimization process. It is very easy to create a new kernel, surrogate or criteria model by inheriting the Fig. 2. Inheritance diagram for the criteria class abstract model or one of the existing models. Thus, the newly created model will be fully integrated in the library. It will also inherit most of the existing linear algebra optimizations and safety checks. This is specially important in some part of the codes that have been carefully design to optimize performance or guarantee a correct implementation by design. For example, the initial set of points is selected using latin hypercube sampling. The kernel hyperparameters are not updated after every new data point since that is known to introduce bias [5, 40]. The library includes different kernel functions (Matern, Gaussian, isotropic, automatic relevance determination, etc.), different mean functions (constant, linear, radial, etc.), differ- ent surrogate models depending on the parameters hyperpriors (Gaussian process, Student’s t process, etc.), and different criteria (expected improvement (EI), lower confidence bound (LCB), probability of improvement (POI), etc.). See Figure 2 for the inheritance tree of the implemented criteria. One of the advantages of having such a flexible design is that it is easy to combine different options. For example, we have implemented what we called, some metacriteria algorithms, like GP-Hedge [11], which can be used to find the most relevant criteria online. Other metacriterion is the linear combination of multiple criteria, which can be used to implement a optimization criteria with movement penalties [22]. The library can be downloaded from https://bitbucket.org/ rmcantin/bayesopt/downloads and the online documentation can be found in http://rmcantin.bitbucket.org/html/ The library internally uses NLOPT for the inner optimiza- tion loops (optimize criteria, learn kernel hyperparameters, etc.) [15]. The interface is very similar, thus NLOPT can be also used for comparison and benchmarking. B. Compatibility The toolbox has been design with the idea to be highly compatible in many platforms and setups. It has already been tested and compiled in different operating systems (Windows, Debian/Ubuntu, Mac OS . . . ), with different compilers (Visual Studio, gcc, clang, mingw . . . ). The core of the library is in C++98 for compatibility with older compilers. It also provides interfaces and demos for C, Python and Matlab/Octave. C. Demos The code includes many demos for different lan- guages, models (continuous, discrete), test functions (Ackley, Michalewicz, Rosenbrock, etc.). It even includes some spe- cific demos such as multiprocessing computation or a simple computer vision application (image binarization). D. Using BayesOpt API The API is compatible with several languages and program- ming paradigms. In general, the usage of the toolbox can be summarize in three steps: 1) Define the function to optimize. 2) Set or modify the parameters and models of the opti- mization process. 3) Run the optimizer. Here we show a brief summary of the different ways to use the library: 1) C/C++ callback usage: This interface is the most stan- dard approach, it could also be used as an interface or wrapper of other languages such as Fortran, Ada, etc. The function to optimize must agree with the following template: d o u b l e m y f u n c t i o n ( u n s i g n e d i n t n , c o n s t d o u b l e ∗ x , d o u b l e ∗ g r a d i e n t , v o i d ∗ f u n c d a t a ) ; Then we call the optimizer for continuous or discrete opti- mization, passing the target function as a pointer 1 . 2) C++ inheritance usage: This is the most straighforward and complete method to use the library. The object that must be optimized must inherit from the bayesopt::ContinuousModel or bayesopt::DiscreteModel classes. For example: c l a s s M y O p t i m i z a t i o n : p u b l i c b a y e s o p t : : C o n t i n u o u s M o d e l { p u b l i c : M y O p t i m i z a t i o n ( b o p t p a r a m s param ) : C o n t i n u o u s M o d e l ( param ) {} d o u b l e e v a l u a t e S a m p l e ( c o n s t u b l a s : : v e c t o r < double > &q u e r y ) { / / My f u n c t i o n h e r e } ; b o o l c h e c k R e a c h a b i l i t y ( c o n s t u b l a s : : v e c t o r < double > &q u e r y ) { / / My c o n s t r a i n t s h e r e } ; i n t o p t i m i z e ( u b l a s : : v e c t o r < double > & r e s u l t ) } ; Then, we just need to override one of the virtual functions called evaluateSample, which can be called with C arrays and uBlas vectors. Since there is no pure virtual functions, you can just redefine your preferred interface. You can also override the checkReachability function to include nonlinear constraints. 1 The gradient has been included for future compatibility. In the current implementation, it is not used. Again, the parameters are defined in the bopt_params struct. 3) Python callback/inheritance usage: Both interfaces are analogous to the C/C++ interface. In this case, the parameters are defined as a Python dictionary. 4) Matlab/Octave callback usage: Matlab/Octave only sup- port callback. The parameters are defined as a Matlab struct analogous to the parameters struct in the C/C++ interface. E. BayesOpt parameters The parameters are defined in the bopt_params struct. The easiest way to set the parameters is to use b o p t p a r a m s p a r a m e t e r s = i n i t i a l i z e p a r a m e t e r s t o d e f a u l t ( ) ; and then, modify the necesary fields. The details of each parameter can be found in the included documentation. For example: p a r . k e r n e l . name = "kSum(kSEISO,kConst)" ; / / Sum o f k e r n e l s p a r . k e r n e l . hp mean = { 0 . 5 , 0 . 5 } ; p a r . k e r n e l . h p s t d = { 1 , 1 } ; / / H y p e r p r i o r on k e r n e l p a r . k e r n e l . n hp = 2 ; / / S u r r o g a t e h y p e r p r i o r s (w, s i g m a s ) p a r . s u r r n a m e = S STUDENT T PROCESS JEFFREYS ; / / We combine E x p e c t e d Improvement , Lower C o n f i d e n c e Bound / / p r o b a b i l i t y o f i m p r o v e m e n t y Thompson s a m p l i n g p a r . c r i t n a m e = "cHedge(cEI,cLCB,cPOI,cThompsonSampling)" ; p a r . c r i t p a r a m s = { 1 , 1 , 0 . 0 1 } ; / / Each c r i t e r i o n h a s p a r . n c r i t p a r a m s = 3 ; / / i t s p a r a m e t e r s p a r . l t y p e = L ML ; / / L e a r n i n g t y p e p a r . n i t e r a t i o n s = 2 0 0 ; / / Number o f i t e r a t i o n s VI. R OBOTICS APPLICATIONS Bayesian optimization has already been applied to mane robotics and related problems. It was first introduced in the field by [21] and, independently by [24]. The work of Lizotte et al. [21] used Bayesian optimization for robot gait optimiza- tion, whereas Martinez-Cantin et al. [24, 25] introduced a pol- icy search algorithm for robot planning that assumes expensive reward functions (see Figure 3). Based on related work from [32], the surrogate model in [24] included hyperpriors on the model parameters. This work was the original motivation to build the library. The idea of learning a controller and a planner was later combined and extended in [4] in a hierarchical fashion. Although it was originally intended for video games and simulation, its applicability to robotic platforms is direct. As commented before, this topic is highly related with the work on sequential experimental design relying on Gaussian processes. In that field, one of the major breakthroughs was the discovery of the mutual information as an efficient cri- terion for submodular optimization. Originally intended for sensor placement [19] it was later extended for robot planning (moving sensors) [33, 34]. In fact, sensor placement or selection was addressed in a pure Bayesian optimization methodology using a more classical RMS error measure by Garnett et al. [7]. A related problem of robots as moving sensors was pre- sented in [22], which introduced the cost of robot movement as an additional criteria to trade-off exploration and exploitation. Fig. 4. Bayesian optimization for grasping. The reward of the grasp is computed by the volume of the wrench space generated by the contact points. This provides a metric of the stability of the grasp assuming that the friction of the fingers can be predicted. Gait optimization in a more complex setup (snake robot) was further analyzed in [37] where the expected improvement algorithm is extended with non-controllable variables as in [43]. Recently, the same authors have extended the work to deal with multiobjective functions [38]. Manipulation has also been attracted by the capabilities of Bayesian optimization. Kroemer et al. [20] presented a rein- forcement learning algorithm for grasping where the Bayesian optimization algorithm is used as a proxy between the reward function of the grasp and the grasping parameters. The reward function is directly computed by experimenting the grasp and performing a set of movements to guarantee stability of the grip. For the inner loop in the Bayesian optimization, the authors used a gradient-based strategy. This kind of strategy has some advantages in reinforcement learning for robots [28]. However, they provide suboptimal performance due to local minima. Instead of relying on arbitrary movements to test the grasp reward , some authors have suggested a metric based on a simulated environment where the wrench space of the contact points can be computed [2]. This idea was explored in the Bayesian optimization setup in [23]. Later, it has been tested in [41] which uses the library presented here. Finally, as presented in a recent work [13], the parameters –either continuous, discrete or categorical– of any algorithm can be optimized using Bayesian optimization. VII. C ONCLUSION We have presented a common framework for Bayesian non- linear optimization, sequential experimental design, bandits and, in general, hyperparameter optimization for many applica- tions. The framework has been implemented in an efficient and easy to use library compatible with many operating systems and computer languages. It includes most of the state-of-the- art contributions to the field, both in terms of core algorithms and application oriented modifications. As the implementation includes metamodels and metacrite- ria, the library is able to generalize to many situations with a small contribution from the user. On the other hand, for advanced users, it allows fully customization to improve the performance in dedicated applications. −10 0 10 20 30 40 50 60 −10 0 10 20 30 40 50 60 limited field of view Robot policy path way−point unkown landmark mean of known landmark actual landmark position uncertainty ellipse −10 0 10 20 30 40 50 60 −10 0 10 20 30 40 50 60 −10 0 10 20 30 40 50 60 −10 0 10 20 30 40 50 60 Fig. 3. This simulation shows three stages of the robot exploring an environment. The simulation includes landmarks that the robot does not know a priori. As soon as the robot observes these landmarks, it incorporates them into its model of the world. The robot continuously plans and replans so as to minimize the uncertainty in its pose and in the location of the known landmarks. The figure also shows the robot’s limited field of view and the paths that it plans to follow at the three simulation stages. R EFERENCES [1] Anthony C. Atkinson, Alexander N. Donev, and Ran- dall D. Tobias. Optimum Experimental Design with SAS . Oxford University Press, 2007. [2] Tim Baier and Jianwei Zhang. Learning to grasp every- day objects using reinforcement-learning with automatic value cut-off. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2007) , 2007. [3] Eric Brochu, Nando de Freitas, and Abhijeet Ghosh. Active preference learning with discrete choice data. In Advances in Neural Information Processing Systems , 2007. [4] Eric Brochu, Vlad M. Cora, and Nando de Freitas. A tutorial on Bayesian optimization of expensive cost func- tions, with application to active user modeling and hier- archical reinforcement learning. eprint arXiv:1012.2599, arXiv.org, December 2010. [5] Adam D. Bull. Convergence rates of efficient global optimization algorithms. Journal of Machine Learning Research , 12:2879–2904, 2011. [6] Persi Diaconis. Bayesian numerical analysis. In S.S. Gupta and J.O. Berger, editors, Statistical Decision The- ory and Related Topics IV , volume 1, pages 163–175, 1988. [7] Roman Garnett, Michael A. Osborne, and Stephen J. Roberts. Bayesian optimization for sensor set selection. In Information Processing in Sensor Networks (IPSN 2010) , 2010. [8] Robert B. Gramacy and Herbert K.H. Lee. Cases for the nugget in modeling computer experiments. Statistics and Computing , 22:713–722, 2012. ISSN 0960-3174. doi: 10.1007/s11222-010-9224-x. URL http://dx.doi.org/10. 1007/s11222-010-9224-x. [9] Robert B Gramacy and Nicholas G Polson. Particle learning of gaussian process models for sequential design and optimization. Business , 20(1):18, 2009. URL http://arxiv.org/abs/0909.5262. [10] Mark S. Handcock and Michael L. Stein. A Bayesian analysis of Kriging. Technometrics , 35(4):403–410, 1993. [11] Matthew Hoffman, Eric Brochu, and Nando de Fre- itas. Portfolio allocation for Bayesian optimization. In 27th Conference on Uncertainty in Artificial Intelligence (UAI2011) , 2011. [12] D. Huang, T. T. Allen, W. I. Notz, and N. Zheng. Global optimization of stochastic black-box systems via sequential kriging meta-models. Journal of Global Op- timization , 34(3):441– 466, 2006. [13] Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown. Sequential model-based optimization for general algo- rithm configuration. In Proceedings of Learning and Intelligent Optimization (LION-5) , page 507523, 2011. [14] Harold. Jeffreys. An invariant form for the prior proba- bility in estimation problems. Proceedings of the Royal Society of London. Series A, Mathematical and Physical Sciences , 186(1007):453461, 1946. [15] Steven G. Johnson. The NLopt nonlinear-optimization package, 2013. URL http://ab-initio.mit.edu/nlopt. [16] Donald R. Jones, Matthias Schonlau, and William J. Welch. Efficient global optimization of expensive black- box functions. Journal of Global Optimization , 13(4): 455–492, 1998. [17] Leslie P. Kaelbling, Michael L. Littman, and Andrew W. Moore. Reinforcement learning: A survey. Journal of Artificial Intelligence Research , 4:237–285, 1996. [18] Jack PC Kleijnen, Wim van Beers, and Inneke Van Nieuwenhuyse. Expected improvement in efficient global optimization through bootstrapped kriging. Jour- nal of Global Optimization , 54(1):59–73, 2012. [19] Andreas Krause, Ajit Singh, and Carlos Guestrin. Near- optimal sensor placements in Gaussian processes: The- ory, efficient algorithms and empirical studies. Journal of Machine Learning Research , 9:235–284, 2008. [20] Oliver Kroemer, Renaud Detry, Justus Piater, and Jan Peters. Combining active learning and reactive control for robot grasping. Robotics and Autonomous Systems , 58(9):1105–1116, 2010. URL http://robot-learning.de/pmwiki/uploads/Publications/ KroemerJRAS 6636[0].pdf. [21] Daniel Lizotte, Tao Wang, Michael Bowling, and Dale Schuurmans. Automatic gait optimization with gaussian process regression. In Proceedings of the Twentieth International Joint Conference on Artificial Intelligence (IJCAI) , pages 944–949, 2007. [22] Roman Marchant and Fabio Ramos. Bayesian opti- misation for intelligent environmental monitoring. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages 2242–2249, 2012. [23] Ruben Martinez-Cantin. Efficient computation of stable grasps through Bayesian optimization. Technical report, ISR-Vislab, 2012. [24] Ruben Martinez-Cantin, Nando de Freitas, Arnoud Doucet, and Jose A. Castellanos. Active policy learning for robot planning and exploration under uncertainty. In Robotics: Science and Systems , 2007. [25] Ruben Martinez-Cantin, Nando de Freitas, Eric Brochu, Jose A. Castellanos, and Arnoud Doucet. A Bayesian exploration-exploitation approach for optimal online sensing and planning with a visually guided mobile robot. Autonomous Robots - Special Issue on Robot Learning, Part B , 27(3):93–103, 2009. [26] Jonas Mockus. Application of bayesian approach to numerical methods of global and stochastic optimization. Journal of Global Optimization , 4:347–365, 1994. [27] Anthony O’Hagan. Some Bayesian numerical analysis. Bayesian Statistics , 4:345–363, 1992. [28] J. Peters and S. Schaal. Policy gradient methods for robotics. In Proceedings of the IEEE International Conference on Intelligent Robotics Systems (IROS 2006) , 2006. URL http://www-clmc.usc.edu/publications/P/ peters-IROS2006.pdf. [29] Carl E. Rasmussen and Hannes Nickisch. Gaussian processes for machine learning (GPML) toolbox. Journal of Machine Learning Research , 11:3011–3015, 2010. [30] Carl E. Rasmussen and Christopher K.I. Williams. Gaus- sian Processes for Machine Learning . The MIT Press, Cambridge, Massachusetts, 2006. [31] Jerome Sacks, William J. Welch, Toby J. Mitchell, and Henry P. Wynn. Design and analysis of computer experiments. Statistical Science , 4(4):409–423, 1989. [32] Thomas J. Santner, Brian J. Williams, and William I. Notz. The Design and Analysis of Computer Experi- ments . Springer-Verlag, 2003. [33] Amarjeet Singh, Andreas Krause, Carlos Guestrin, and William Kaiser. Efficient informative sensing using mul- tiple robots. Journal of Artificial Intelligence Research (JAIR) , 34:707–755, 2009. [34] Amarjeet Singh, Andreas Krause, and William Kaiser. Nonmyopic adaptive informative path planning for mul- tiple robots. In Proc. International Joint Conference on Artificial Intelligence (IJCAI) , 2009. [35] Jasper Snoek, Hugo Larochelle, and Ryan Adams. Prac- tical Bayesian optimization of machine learning algo- rithms. In P. Bartlett, F.C.N. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 25 , pages 2960– 2968, 2012. [36] Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In Proc. International Conference on Machine Learning (ICML) , 2010. [37] Matthew Tesch, Jeff Schneider, and Howie Choset. Adapting control policies for expensive systems to changing environments. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , September 2011. [38] Matthew Tesch, Jeff Schneider, and Howie Choset. Ex- pensive multiobjective optimization and validation with a robotics application. In Neural Information Processing Systems: Workshop on Bayesian Optimization & Deci- sion Making , December 2012. [39] Marc Toussaint. The Bayesian search game. In Alberto Moraglio Yossi Borenstein, editor, Theory and Principled Methods for Designing Metaheuristics . Springer, in press. [40] Emmanuel Vazquez and Julien Bect. Convergence prop- erties of the expected improvement algorithm with fixed mean and covariance functions. Journal of Statistical Planning and Inference , 140(11):3088–3095, 2010. [41] Filipe Veiga and Alexandre Bernardino. Active tactile exploration for grasping. In ICRA Workshop on Au- tonomous Learning , 2013. [42] Ziyu Wang, Masrour Zoghi, David Matheson, Frank Hutter, and Nando de Freitas. Bayesian optimization in a billion dimensions via random embeddings. In International Joint Conference on Artificial Intelligence (IJCAI) , 2013. [43] Brian J. Williams, Thomas J. Santner, and William I. Notz. Sequential design of computer experiments to minimize integrated response functions. Statistica Sinica , 10(4):1133–1152, 2000.