Limbo: A Fast and Flexible Library for Bayesian Optimization Antoine Cully1, Konstantinos Chatzilygeroudis2,3,4, Federico Allocati2,3,4 and Jean-Baptiste Mouret2,3,4 1 Personal Robotics Lab, Imperial College London, London, UK 2 Inria Nancy Grand - Est, Villers-l`es-Nancy, France 3 CNRS, Loria, UMR 7503, Vandœuvre-l`es-Nancy, France 4 Universit´e de Lorraine, Loria, UMR 7503, Vandœuvre-ls-Nancy, France Correspondence to: jean-baptiste.mouret@inria.fr Preprint – November 23, 2016 Limbo is an open-source C++11 library for Bayesian optimization which is designed to be both highly flexi- ble and very fast. It can be used to optimize functions for which the gradient is unknown, evaluations are ex- pensive, and runtime cost matters (e.g., on embedded systems or robots). Benchmarks on standard functions show that Limbo is about 2 times faster than BayesOpt (another C++ library) for a similar accuracy. Introduction N on-linear optimization problems are pervasive in machine learning. Bayesian Optimization (BO) is de- signed for the most challenging ones: when the gradient is unknown, evaluating a solution is costly, and evaluations are noisy. This is, for instance, the case when we want to find optimal parameters for a machine learning algorithm [Snoek et al., 2012], because testing a set of parameters can take hours, and because of the stochastic nature of many machine learning algorithms. Besides parameter tuning, Bayesian optimization recently attracted a lot of interest for direct policy search in robot learning [Lizotte et al., 2007, Wilson et al., 2014, Calandra et al., 2016] and on- line adaptation; for example, it was recently used to allow a legged robot to learn a new gait after a mechanical dam- age in about 10-15 trials (2 minutes) [Cully et al., 2015]. At its core, Bayesian optimization builds a probabilistic model of the function to be optimized (the reward/perfor- mance/cost function) using the samples that have already been evaluated [Shahriari et al., 2016]; usually, this model is a Gaussian process [Williams and Rasmussen, 2006]. To select the next sample to be evaluated, Bayesian optimiza- tion optimizes an acquisition function which leverages the model to predict the most promising areas of the search space. Typically, this acquisition function is high in ar- eas not yet explored by the algorithm (i.e., with a high uncertainty) and in those where the model predicts high- performing solutions. As a result, Bayesian optimization handles the exploration / exploitation trade-offby select- ing samples that combine a good predicted value and a high uncertainty. In spite of its strong mathematical foundations [Mockus, 2012], Bayesian optimization is more a template than a fully-specified algorithm. For any Bayesian optimization algorithm, the following components need to be chosen: (1) an initialization function (e.g., random sampling), (2) a model (e.g., a Gaussian process, which itself needs a kernel function and a mean function), (3) an acquisition function (e.g., Upper Confidence Bound, Expected Improvement, see Shahriari et al. 2016), (4) a global, non-linear opti- mizer for the acquisition function (e.g., CMA-ES, Hansen and Ostermeier 2001, or DIRECT, Jones et al. 1993) (5) a non-linear optimizer to learn the hyper-parameters of the models (if the user chooses to learn them). Specific applications often require a specific choice of components and most research articles focus on the introduction of a single component (e.g., a novel acquisition function or a novel kernel for Gaussian processes). This almost infinite number of variants of Bayesian optimization calls for flexible libraries in which compo- nents can easily be substituted with alternative ones (user- defined or predefined). In many applications, the run-time cost is negligible compared to the evaluation of a poten- tial solution, but this is not the case in online adapta- tion for robots (e.g., Cully et al. 2015), in which the algo- rithm has to run on small embedded platforms (e.g., a cell phone), or in interactive applications [Brochu et al., 2010], in which the algorithm needs to quickly react to the in- puts. To our knowledge, no open-source library combines a high-performance implementation of Bayesian optimiza- tion with the high flexibility that is needed for developing and deploying novel variants. The Limbo library Limbo (LIbrary for Model-based Bayesian Optimiza- tion) is an open-source (GPL-compatible CeCiLL license) C++11 library which provides a modern implementation of Bayesian optimization that is both flexible and high- performing. It does not depend on any proprietary soft- ware (the main dependencies are Boost and Eigen3). The code is standard-compliant but it is currently mostly de- 1 arXiv:1611.07343v1 [cs.LG] 22 Nov 2016 x 101 0.0 0.2 0.4 0.6 0.8 1.0 x 10-1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.0 1.0 2.0 3.0 4.0 5.0 0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 x 10-5 0.0 0.5 1.0 1.5 2.0 Limbo Limbo with HP optimization BayesOpt BayesOpt with HP optimization Branin Ellipsoid GlodsteinPrice Hartmann3 Hartmann6 Rastrigin SixHumpCamel Sphere Comp. time (s) Accuracy 0 0.5 1 1.5 2 2.5 3 3.5 0 0.5 1 1.5 2 2.5 3 3.5 0 0.5 1 1.5 2 2.5 3 3.5 0 1 2 3 4 5 0 2 4 6 8 10 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 0.5 1 1.5 2 2.5 3 Figure 1: Comparison of the accuracy (difference with the optimal value) and the wall clock time for Limbo and BayesOpt [Martinez-Cantin, 2014] – a state-of-the art C++ library for Bayesian optimization – on common test functions (see http://www.sfu.ca/˜ssurjano/optimization.html). Two configurations are tested: with and without optimization of the hyper-parameters of the Gaussian Process. Each experiment has been replicated 250 times. The median of the data is pictured with a thick dot, while the box represents the first and third quartiles. The most extreme data points are delimited by the whiskers and the outliers are individually depicted as smaller circles. Limbo is configured to reproduce the default parameters of BayesOpt. veloped for GNU/Linux and Mac OS X with both the GCC and Clang compilers. The library is distributed via a GitHub repository1, in which bugs and further develop- ments are handled by the community of developers and users. An extensive documentation2 is available and con- tains guides, examples, and tutorials. New contributors can rely on a full API reference, while their developments are checked via a continuous integration platform (auto- matic unit-testing routines). Limbo was instrumental in several of our robotics projects (e.g., Cully et al. 2015) but it has successfully been used internally for other fields. The implementation of Limbo follows a template-based, policy-based design [Alexandrescu, 2001], which allows it to be highly flexible without paying the cost induced by classic object-oriented designs [Driesen and H¨olzle, 1996] (cost of virtual functions). In practice, changing one of the components of the algorithms in Limbo (e.g., changing the acquisition function) usually requires changing only a template definition in the source code. This design make it possible for users to rapidly experiment and test new ideas while being as fast as specialized code. According to the benchmarks we performed (Figure 1), Limbo finds solutions with the same level of quality as BayesOpt [Martinez-Cantin, 2014], within a significantly lower amount of time: for the same accuracy (less than 2.10−3 between the optimized solutions found by Limbo and BayesOpt), Limbo is between 1.47 and 1.76 times 1http://github.com/resibots/limbo 2http://www.resibots.eu/limbo faster (median values) than BayesOpt when the hyper- parameters are not optimized, and between 2.05 and 2.54 times faster when they are. Using Limbo The policy-based design of Limbo allows users to define or adapt variants of Bayesian Optimization with very little change in the code. The definition of the optimized function is achieved by creating a functor (an arbitrary object with an operator() function) that takes as input a vector and outputs the resulting vector (Limbo can support multi-objective optimization); this object also defines the input and output dimensions of the problem (dim in, dim out). For example, to maximize the function my fun(x) = −P2 i=1 x2 i sin(2xi): s t r u c t my fun { s t a t i c c o n s t e x p r s i z e t dim in = 2; s t a t i c c o n s t e x p r s i z e t dim out = 1; Eigen : : VectorXd operator ( ) ( const Eigen : : VectorXd& x ) const { double r e s = −(x . a r r a y ( ) . square () ∗ ( x ∗ 2) . s i n ( ) ) . sum () ; return limbo : : t o o l s : : make vector ( r e s ) ; } }; Optimizing my fun with the default parameters only re- quires instantiating a BOptimizer object and call the “op- 2 timize” method: limbo : : bayes opt : : BOptimizer opt ; opt . o p t i m i z e ( my fun () ) ; where Params is a structure that defines all the parameters in a static way, for instance: s t r u c t Params { // d e f a u l t parameters f o r the a c q u i s i t i o n f u n c t i o n ’ gpucb ’ s t r u c t acqui gpucb : p u b l i c limbo : : d e f a u l t s : : acqui gpucb { } ; // custom parameters f o r the o p t i m i z e r s t r u c t b a y e s o p t b o p t i m i z e r : p u b l i c limbo : : d e f a u l t s : : b a y e s o p t b o p t i m i z e r { BO PARAM( double , noise , 0.001) ; }; // . . . } While default functors are provided, most of the compo- nents of BOptimizer can be replaced to allow researchers to investigate new variants. For example, changing the kernel function from the Squared Exponential kernel (the default) to another type of kernel (here the Matern-5/2) and using the UCB acquisition function is achieved as fol- lows: // d e f i n e the templates using K e r n e l t = limbo : : k e r n e l : : MaternFiveHalves ; using Mean t = limbo : : mean : : Data; using GP t = limbo : : model : : GP; using Acqui t = limbo : : acqui : : UCB; // i n s t a n t i a t e a custom o p t i m i z e r limbo : : bayes opt : : BOptimizer, limbo : : acquifun > opt ; // run i t opt . o p t i m i z e ( my fun () ) ; In addition to the many kernel, mean, and acquisition functions that are implemented, Limbo provides several tools for the internal optimization of the acquisition func- tion and the hyper-parameters. For example, a wrapper around the NLOpt library (which provides many local, global, gradient-based, and gradient-free algorithms) al- lows Limbo to be used with a large variety of internal opti- mization algorithms. Moreover, several “restarts” of these internal optimization processes can be performed in par- allel to avoid local optima with a minimal computational cost, and several internal optimizations can be chained in order to take advantage of the global aspects of some al- gorithms and the local properties of others. Acknowledgements This project is funded by the European Research Coun- cil (ERC) under the European Union’s Horizon 2020 re- search and innovation programme (Project: ResiBots, grant agreement No 637972). References Andrei Alexandrescu. Modern C++ design: generic program- ming and design patterns applied. Addison-Wesley, 2001. Eric Brochu, Vlad M Cora, and Nando De Freitas. A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical rein- forcement learning. arXiv preprint arXiv:1012.2599, 2010. Roberto Calandra, Andr´e Seyfarth, Jan Peters, and Marc Pe- ter Deisenroth. Bayesian optimization for learning gaits under uncertainty. Annals of Mathematics and Artificial Intelligence, 76(1-2):5–23, 2016. Antoine Cully, JeffClune, Danesh Tarapore, and Jean- Baptiste Mouret. Robots that can adapt like animals. Na- ture, 521(7553):503–507, 2015. Karel Driesen and Urs H¨olzle. The direct cost of virtual func- tion calls in C++. In ACM Sigplan Notices, volume 31, pages 306–323. ACM, 1996. Nikolaus Hansen and Andreas Ostermeier. Completely de- randomized self-adaptation in evolution strategies. Evolu- tionary computation, 9(2):159–195, 2001. Donald R Jones, Cary D Perttunen, and Bruce E Stuckman. Lipschitzian optimization without the Lipschitz constant. Journal of Optimization Theory and Applications, 79(1): 157–181, 1993. Daniel J Lizotte, Tao Wang, Michael H Bowling, and Dale Schuurmans. Automatic gait optimization with Gaussian process regression. In IJCAI, volume 7, pages 944–949, 2007. Ruben Martinez-Cantin. BayesOpt: a Bayesian optimization library for nonlinear optimization, experimental design and bandits. Journal of Machine Learning Research, 15:3915– 3919, 2014. Jonas Mockus. Bayesian approach to global optimization: theory and applications, volume 37. Springer Science & Business Media, 2012. Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P Adams, and Nando de Freitas. Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104(1):148–175, 2016. Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical Bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pages 2951–2959, 2012. Christopher KI Williams and Carl Edward Rasmussen. Gaus- sian processes for machine learning. the MIT Press, 2(3): 4, 2006. Aaron Wilson, Alan Fern, and Prasad Tadepalli. Using trajec- tory data to improve Bayesian optimization for reinforce- ment learning. Journal of Machine Learning Research, 15 (1):253–282, 2014. 3