Context-Specific Validation of Data-Driven Models Somil Bansal*, Shromona Ghosh*, Alberto Sangiovanni-Vincentelli, Sanjit A. Seshia, Claire J. Tomlin Abstract —With an increasing use of data-driven models to control robotic systems, it has become important to develop a methodology for validating such models before they can be deployed to design a controller for the actual system. Specifically, it must be ensured that the controller designed for a learned model would perform as expected on the actual physical system. We propose a context-specific validation framework to quantify the quality of a learned model based on a distance measure between the closed-loop actual system and the learned model. We then propose an active sampling scheme to compute a probabilistic upper bound on this distance in a sample-efficient manner. The proposed framework validates the learned model against only those behaviors of the system that are relevant for the purpose for which we intend to use this model, and does not require any a priori knowledge of the system dynamics. Several simulations illustrate the practicality of the proposed framework for validating the models of real-world systems, and consequently, for controller synthesis. I. I NTRODUCTION Recently, research in robotics and control theory has been focusing on developing complex autonomous systems, such as robotic manipulators, autonomous cars, surgical robots, etc. Synthesizing controllers for such complex systems is a challenging but an important problem. Moreover, we need to ensure that the synthesized controller behaves as intended on the actual system. This is particularly important for safety- critical systems, where the actions of an autonomous agent may affect human lives. This motivates us to develop formal techniques to reason about the performance of synthesized controllers, before deploying them in the real world. This is often done by identifying an open-loop model and comparing it with the actual system; a controller is then designed on the model, which is tried out in simulation, and subsequently, tested on the actual system. Although the models obtained and tested in this fashion are more general and robust, this scheme can be prohibitive for complex autonomous systems, especially when the functional form of the dynamics is unknown. Moreover, we often care about controlling the system on a specific set of tasks, where a coarse dynamics model is often sufficient for control purposes. In such cases, an abstraction (or model) of a system is obtained and formally validated before being used for synthesizing controllers. Re- cently, there has been an increased interest in using machine learning (ML) and AI for learning an abstraction directly based on the data collected on the system [7, 8]. Thus, it has become even more important to develop validation schemes for such abstractions, and is the main focus of this paper. * These authors contributed equally to this work. All authors are with the Department of Electrical Engineering and Computer Sciences, Uni- versity of California, Berkeley. { somil, shromona.ghosh, alberto, sseshia, tomlin } @eecs.berkeley.edu. Broadly, validation refers to formally quantifying the qual- ity/ability of a model to mimic the (dynamic) behavior of the actual system. Several methods have been proposed in literature to validate an abstraction (or model), M , of the actual system, S , utilizing the notion of simulation metric , where the quality of an abstraction is quantified via metrics that specify how close the trajectories of S and M are. This approach has led to the study of exact [3, 6] as well as approximate abstractions [18] for a variety of discrete and continuous systems in both deterministic [21, 19] and stochastic settings [10, 16, 20, 24]. Although several meth- ods have been proposed to compute the simulation metric (see [1] and references therein); their contributions present some limitations; in that; 1) they validate the open-loop behavior of the model against that of the system, which is unnecessary and may lead to quite conservative bounds on the quality of M as an abstraction of S for the controller synthesis, as one may be able to control the system well with only a very approximate model of the system, 2) restrictive assumptions on the dynamics of the systems are required, and 3) a computational procedure to determine the approximate simulation metric is given only for certain classes of systems. More recently, [2] proposed a randomized approach to compute simulation metric that relies on the formulation of the problem as a semi-infinite chance-constrained optimization and computing its solution using the so-called “sampling and discarding approach” [13, 12]. The proposed approach does not require any a priori assumption on the system dynamics, and requires only to be able to run multiple executions of both S and M . Although, routed with strong theoretical guarantees, the simulation metric is computed through a random sampling- based approach, which often suffers from a poor sample complexity. Moreover, the validation is performed on the open- loop model; therefore, the bounds obtained on the simulation metric can still be quite conservative. To overcome these limitations, we propose CuSTOM (Context-Specific validaTiOn framework for data-driven Mod- els), a novel framework to quantify the quality of an ab- straction for the controller synthesis purposes. Here, context refers to a specific (potentially continuous) set of control tasks for which we want to synthesize a controller. As opposed to the open-loop validation considered in the above mentioned works, we consider the validation of closed-loop systems, where the controller has been designed using M for the particular task. This allows us to validate only those behaviors of the model that are relevant for the underlying task rather than across all system behaviors. Moreover, we propose to use a context-specific distance measure , which considers the context while computing the (closed-loop) distance between M and S instead of simply comparing how close their trajec- arXiv:1802.04929v2 [cs.SY] 26 Mar 2018 tories are. Hence, the context guides both the controller syn- thesis procedure ( context-specific controller ) and the distance measure ( context-specific distance measure ) in our validation framework. This subtlety of CuSTOM proves to be very important for validating data-driven models where obtaining a good open-loop abstraction across the entire state-space is extremely challenging; thus, such models will hardly pass a typical simulation metric-based validation test, even when the abstraction is good enough for synthesizing a controller for the actual system. CuSTOM is based on active sampling and does not require any a priori knowledge of the system dynamics. Since each execution on the real system is expensive, we aim to compute the distance between the system and its abstraction with as few samples as possible. To that end, we propose to use a Bayesian optimization (BO)-based active sampling method to compute this distance, which takes all past executions into account and suggests the next execution that is most likely to maximize the distance function. This direct maxima-seeking behavior based on the observed distances ensures that our approach is scalable and highly data-efficient compared to random sampling-based approaches, such as [2]. Overall, the proposed framework provides a practical and scalable scheme to validate data-driven models for controller synthesis. II. P ROBLEM S TATEMENT Let S and M denote the true system and its discrete-time model (also referred to as abstraction hereon) respectively, with state space X and control space U . We assume that M has the same input and output space as S but is often simpler and, in general, learned from data. Let ξ S ( t ; x 0 , u ( · )) denote the system state at time t starting at state x 0 and applying the control sequence u ( · ) . ξ M is similarly defined. Given S and its model M , our goal is to validate M , i.e., we would like to quantify how useful M is for designing controllers for S to complete tasks in the taskspace, G . More specifically, if we design a controller π ( · ) to complete a task g ∈ G using the model, will it perform equally well on the true system? Here, we are interested in controlling the system over a discrete-time horizon of H ; therefore, each task can be thought of as a control objective defined over the horizon, H . We also assume that the taskspace G can be parameterized by p ∈ P , i.e., for each p ∈ P , we have a different instance of a task g p ∈ G . For example, if we are interested in regulating the system to a desired state over a horizon of H , then p can represent the desired state, and P can represent the set of all states to which we might want the system to regulate to. Typically, the quality of a model is quantified by comparing the open-loop trajectories of S and M . In particular, a distance measure d ( ξ M , ξ S ) is defined between the trajectories and its maximum is computed over all possible pairs of trajectories, i.e., d ∗ = max u ( · ) ∈U H ,x 0 ∈X 0 d ( ξ M ( · ; x 0 , u ( · )) , ξ S ( · ; x 0 , u ( · ))) . (1) d ∗ in (1) is also called simulation metric between S and M . Intuitively, if d ∗ is small, then the trajectories of the actual system and its abstraction are close to each other, and we can quantify M as a “good” or validated model of S . However, if d ∗ is large, we explicitly have an initial state and a control sequence for which the evolved pair of trajectories will be very different, and we say that M doesn’t approximate S well. However, the notion of distance in (1) is too general; in that, it measures the distance across all possible control laws without explicitly considering the tasks of interest. Often, the goal of obtaining an abstraction is to synthesize a controller for the system and hope that it will work well on the actual system. The open-loop distance measure in (1) does not capture this intended use of an abstraction. Moreover, it has been shown in [8] that a “good” model for controller synthesis can be very different from a model that mimics the open-loop behavior of the true system well. Thus, a small open-loop distance measure in (1) is not even necessary for a model to be a good abstraction of the system. This motivates us to consider a context-specific validation framework, which measures the closed-loop distance between S and M . The context here is defined by the taskspace, G , which guides two key elements in our framework: context- specific controller and context-specific distance measure . In contrast to open-loop control, the context-specific controller refers to a controller designed for the specific task instance using M , and is represented as π ( x t ; g p , M ) . For example, π can be designed by associating a cost function with each task (typical in robotics) and solving an optimal control problem. This mimics the intended use of an abstraction, i.e., to design a controller for the actual system for the tasks at hand. We succinctly represent π ( x t ; g p , M ) , as π p to make the dependence on p explicit. While it is obvious that π p differs based on the task (and context in general), we claim that the distance measure used to validate an abstraction should also depend on the context at hand. Often, the entire trajectory of the system is not of interest for the task at hand; in such cases, one should use a distance measure that considers only the relevant part(s) of the trajectories. For example, for regulation tasks, one may care only about how close the final state of the model is to the desired state, and an appropriate distance measure should only compare the final state of M and S rather than computing the distance between the full trajectory. Thus, we propose to validate an abstraction based on the maximum context-specific distance, d ∗ G ( M π , S π ) , between the closed- loop S and M , d ∗ G ( M π , S π ) = max p ∈P ,x 0 ∈X 0 d G ( ξ M ( · ; x 0 , π p ) , ξ S ( · ; x 0 , π p )) , (2) where M π and S π denote the closed-loop abstraction and system respectively. Thus, we aim to validate a system model given a model, a control scheme and a control horizon. For simplicity, we denote d G ( ξ M ( · ; x 0 , π p ) , ξ S ( · ; x 0 , π p )) as d G ( p ) here on to make the dependence on p explicit. Definition 1 (Validation): We say a model M for a system S is validated for a context G , if the context-specific distance d ∗ G is below a pre-defined threshold τ , i.e., d ∗ G ≤ τ . Intuitively, τ quantifies the maximum acceptable difference between a system and its abstraction. Our goal in this paper is to check whether M validates S or not, for which we need to compute d ∗ G . Since S is unknown, the shape of the objective function, d G ( p ) , in (2) is unknown. The distance is thus evaluated empirically for each p ∈ P , which is often expensive as it involves conducting experiments on the real system. Thus, the goal is to solve the optimization problem in (2) with as few evaluations as possible. In this paper, we do so via a BO-based active sampling method. III. R UNNING E XAMPLE We now introduce a canonical control example, the inverted pendulum, that we will use throughout the paper to illustrate our approach. It has two states, x = ( θ, ̇ θ ) , the angle of rotation, θ ; and, angular velocity, ̇ θ . The dynamics of the system, S , are given by, ml 2 3 ̈ θ ( t ) = mgl 2 sin ( θ ( t )) + u ( t ) (3) where m is the mass of the pendulum, g is the acceleration due to gravity, l is the length of the rod. In (3), θ is the angle of the pendulum measured counter-clockwise from the upward vertical and u is the external torque applied. The state space and control space are, X = [ − π, π ) × R and U = R respectively. We would like to design a state-feedback controller, π ( θ, ̇ θ ) which attempts to stabilize the pendulum at a desired angle, i.e., x f inal = ( θ f inal , 0) , starting at the initial configuration x init = ( π, 0) , i.e., the bottom-most position. For simulation purposes, we assume that the actual dynam- ics model in (3) is unknown, and learn a simple linear model M from data using least-squares. The G is instantiated by the desired angle we would like to regulate to, θ f inal . Hence, we have a 1D P = [ − π, π ) . For each p ∈ P , we design a controller π for the abstract model M . We use finite horizon Iterative Linear Quadratic Regulator (iLQR) to design this controller, where iLQR minimizes a quadratic cost function penalizing the deviation from the desired state, x f inal . We use the l 1 distance between the final states of the closed- loop M π and S π as our distance measure, i.e., d G ( p ) = ‖ ξ S ( H ; x init , π p ) − ξ M ( H ; x init , π p ) ‖ 1 (4) Note that other distance measures can very well be used for the validation purposes, but the measure in (4) might be a more meaningful distance measure for a regulation task compared to the distance between the trajectories of the two systems, since we, often, only care about the final state of the system in this scenario, rather than the taken trajectory. We will discuss the effect of the choice of the distance measure further in Section V-A. IV. B ACKGROUND A. Random Sampling If we do not make any assumptions on S and M , then the associated distance function, d G ( · ) , can be arbitrary. Random sampling-based approaches have been explored for solving (2) in such scenarios. Consider an equivalent form of (2), d ∗ G ( M π , S π ) = min h ∈ R h , s.t ∀ p ∈P d G ( p ) ≤ h (5) (5) is a semi-infinite convex optimization problem where the objective and constraints are convex in the optimization variable, h . For simplicity, we augment P with the space of initial states, X 0 , and consider initial states as a parameter in (5). We now discuss two random sampling-based approaches to solve (5). 1) Scenario-based optimization (SC): Scenario-based opti- mization has been introduced in [11] and [13] to solve (5) by approximating it as a finite convex optimization problem. In particular, N parameters p 1 , . . . , p N ∈ P are randomly sampled and the optimization problem in (5) is solved for these parameters, i.e., we compute ˆ d G SC ( M π , S π ) := max i ∈ 1 ,...,N d G ( p i ) . (6) It is shown in [13] that given any  ∈ (0 , 1) and β ∈ (0 , 1) , the following statement holds with probability no smaller than (1 − β ) Pr ( p ∈ P : d G ( p ) > ˆ d G SC ) <  if N ≥ O ( 1  log( 1 β )) . Intuitively, the above statement provides a probabilistic bound on the volume of the parameter space where the distance is greater than ˆ d G SC . Simply put, as N increases, the volume of P where the distance is greater than ˆ d G SC decreases at the rate of 1 /N . It is also interesting to see that this rate is independent of the dimension of P . However, the scenario- based approach does not provide any information on the gap between our current estimate ˆ d G SC and the global optimum d ∗ G , which can be arbitrary in general. In contrast, our approach provides an explicit bound on this gap, as well as its rate of convergence to zero, as shown in Sec. V-B. 2) Sampling and discarding approach (SD): Another ap- proach to solving (5) is to reformulate it as a chance- constrained optimization problem, min h ∈ R h , s.t Pr ( p ∈ P : d G ( p ) > h ) ≤ , (7) and solve using sampling-based methods [12]. In particular, N parameters p 1 , . . . , p N ∈ P are sampled randomly and an algorithm A is used to discard k of these parameters to compute ˆ d G SD ( M π , S π ) := max i ∈{ 1 ,...,N }−A ( p 1 ,...,p N ) d G ( p i ) . (8) The formulation in (7) has also been adopted in [2] to compute distance between a system and it model. While different algorithm A can be chosen, for the purposes of computing maximum distance between models, we borrow A proposed in [2], where the authors discard the parameters corresponding to k maximum distances. Similar to the scenario-based optimization, it can be shown that given any  ∈ (0 , 1) and β ∈ (0 , 1) , the following statement holds with probability no smaller than (1 − β ) [12] Pr ( p ∈ P : d G ( p ) > ˆ d G SD ) <  if k − 1 ∑ i =0 ( N i )  i (1 −  ) N − i ≤ β. This probability can be interpreted along similar lines as in scenario-based optimization. In addition, sampling and discarding approach provides probabilistic bounds on the gap from the optimal h in (7) (for details we refer the interested readers to [12]). However, this approach also does not provide any information about the gap from the true maximum d ∗ G . B. Active sampling 1) Gaussian Process (GP): Since we do not know the dynamics of the actual system, the dependence of d G on the parameters p ∈ P is unknown a priori . We use a GP to approximate d G in the domain P . The following introduction of GPs is based on [22]. GPs are non-parametric regression method from machine learning, where the goal is to find an approximation of the nonlinear function d G : P → R from a parameter p ∈ P to the function value d G . This is done by considering the function values d G ( p ) to be random variables, such that any finite number of them have a joint Gaussian distribution. For GPs, we define a prior mean function and a covariance function (or kernel), k ( p, p ′ ) which defines the covariance between the function values d G ( p ) and d G ( p ′ ) . The prior mean in our case is assumed to be zero without loss of generality. In general, the choice of kernel function is problem-dependent and encodes assumptions about the unknown function such as smoothness. In our experimental section, we use the 3 / 2 Mat` ern kernel where the hyper-parameters are optimized by maximizing the marginal likelihood [22]. The GP framework can be used to predict the distribution of the performance function d G ( p ) at an arbitrary input p ∈ P based on the past observations, D = { p i , d G ( p i ) } n i =1 . Con- ditioned on D , the mean and variance of the prediction are μ ( p ) = kK − 1 J ; σ 2 ( p ) = k ( p, p ) − kK − 1 k T , (9) where K is the kernel matrix with K ij = k ( p i , p j ) , k = [ k ( p 1 , p ) , . . . , k ( p n , p )] and J = [ d G ( p 1 ) , . . . , d G ( p n )] . Thus, the GP provides both the expected value of the performance function at any arbitrary point p as well as a notion of the uncertainty of this estimate. 2) Bayesian Optimization (BO): In this work we use BO in order to find the maximum of the unknown function d G ( · ) . BO is particularly suitable for the scenarios where evaluating the unknown function is expensive, which fits our problem in Sec. II. At each iteration, BO uses the past observations D to model the objective function, and uses this model to determine informative sample locations. A common model used in BO for the underlying objective, and the one that we consider, are Gaussian processes (see Sec. IV-B1). Using the mean and variance predictions of the GP from (9), BO computes the next sample location by optimizing the so-called acquisition func- tion, α ( · ) . Different acquisition functions are used in literature to trade off between exploration and exploitation during the optimization process. The acquisition function that we use here is the GP-Upper Confidence Bound (GP-UCB) [23], where the next evaluation is given by p ∗ = argmax p ∈P α ( p ) , α ( p ) = μ ( p ) + β 1 / 2 σ ( p ) , (10) where β determines the confidence interval and is typically chosen as β n = 2 . Intuitively, at each iteration (10) selects the next parameter for which the upper confidence bound of the GP is maximal. Repeatedly evaluating the true function d G at samples given by (10) improves the GP model and decreases uncertainty at candidate locations for the maximum, such that the global maximum is found eventually. V. A CTIVE S AMPLING FOR D ISTANCE C OMPUTATION We now present CuSTOM, an active sampling-based frame- work to compute the context-specific distance between a system and its model. A. Approach Our approach is summarized in Algorithm 1. We start with a given model M that we want to validate for all possible tasks in the taskspace G , parameterized by p ∈ P . In CuSTOM, we model the unknown function d G as a GP. Given an initial D (initialized randomly if empty), we initialize our GP (Line 3). Next we compute the p ∗ that maximizes the corresponding acquisition function α ( p ) (Line 5). This parameter, p := p ∗ is used to design a controller π ( x t ; g p , M ) for the model M (Line 6). This controller should be the same as the one that the system designer would synthesize for the actual system based on M . Typically, this is done by solving an optimal control/reinforcement learning problem that minimizes some task-specific cost function J ( x , u ) subject to M . We then apply the controller in a closed-loop fashion to both the abstract model, M , and the actual system, S ; and the corresponding state-input trajectories are recorded (Lines 7 and 8). These trajectories are used to compute the distance d G ( p ) (Line 9). The GP is updated based on the collected data sample { p, d G ( p ) } (Line 10) and the entire process is repeated until the p ∗ corresponding to the maximum distance is found. Intuitively, CuSTOM directly learns the shape of the dis- tance function d G ( p ) as a function of p . However, instead of learning the global shape of this function through random queries, it analyzes the performance of all the past evaluations and by optimizing the acquisition function, generates the next query that is the most likely candidate for the maxima of the distance function. This direct maxima-seeking behavior based on the observed distance ensures that CuSTOM is highly data- efficient. Equivalently, in the space of all tasks of interest, we efficiently and directly search for the task for which the abstract model performs “most differently” compared to the actual system. For our running example, thus, we are directly seeking the angle which is hardest to balance the pendulum at, using a controller designed based on the learned model. Algorithm 1: CuSTOM algorithm 1 D ←− if available: { p, d G ( p ) } ; otherwise, initialized randomly 2 Prior ←− if available: Prior of the GP hyperparameters; otherwise, uniform 3 Initialize GP with D 4 while optimize do 5 Find p ∗ = argmax p α ( p ) ; p ←− p ∗ 6 Design π ( x t ; g p , M ) for the task g p 7 Apply π on M and record ξ M ( · ; x 0 , π ( · )) 8 Apply π on S and record ξ S ( · ; x 0 , π ( · )) 9 Evaluate d G ( p ) based on the defined distance measure 10 Update GP and D with { p, d G ( p ) } Note that the mean of the GP in our algorithm can be initialized with a prior distance function if some information is known about it. This generally leads to a faster convergence. When no information is known about the distance function a priori , the initial tasks are queried randomly and the corre- sponding distances are used to initialize the GP. Finally, note that CuSTOM can also be used when the distance measure is stochastic, which for example is the case when the actual system or the model is stochastic. In this case, CuSTOM will maximize the expected distance. There are two natural questions to ask at this point: 1) Is the context-specific validation framework in (2) any better than the open-loop validation framework in (1)? 2) Is active sampling approach more sample efficient than scenario-based or sampling and discarding approaches? Example 1: We now provide some insights into above ques- tions using our running example in Sec. III. To demonstrate the utility of a context-specific validation framework, we consider the following four scenarios 1) Open-loop control, distance between trajecto- ries (OLDT): The distance is defined by (1), and we measure the open-loop distance between the system and the model. We solve for, d ∗ = max u ∈U H ,p ∈P ‖ ξ M ( · ; x init , u ) − ξ S ( · ; x init , u ) ‖ ∞ 2) Context-specific controller, distance between trajecto- ries (CCDT): Here, the context is used only to design the controller π while the distance is defined between the trajectories using infinity norm. We solve for, d ∗ = max p ∈P ‖ ξ M ( · ; x init , π p ) − ξ S ( · ; x init , π p ) ‖ ∞ 3) Open-loop control, Context-specific distance met- ric (OLCD): Here the context is considered only to define d G , i.e, the distance measure in (4), but an open- loop controller is used. The distance between the two models is given by, d ∗ = max u ∈U H ,p ∈P ‖ ξ M ( H ; x init , u ) − ξ S ( H ; x init , u ) ‖ 1 Fig. 1. Inverted Pendulum. The x -axis and y -axis show the parameter space, θ f inal ∈ [ − π, π ] , and the distance between M and S respectively. The distance profiles of the open-loop frameworks (shown in red and green) are independent of θ f inal and compute the distance across all finite horizon control laws. Using a context-specific controller (shown in blue and orange), reduces the maximum distance significantly, indicating that the learned model may not mimic the actual system everywhere, but their closed-loop behavior is very similar for the tasks of interest. 4) Context-specific validation (CuSTOM): Here we use our proposed approach, i.e., the context is considered while designing the controller, π , as well as in the distance measure, d G , and solve for d ∗ = max p ∈P ‖ ξ M ( H ; x init , π p ) − ξ S ( H ; x init , π p ) ‖ 1 In Fig. 1 we plot the distance function for each scenario as a function of the task parameter, p := θ f inal ∈ [ − π, π ] . The dis- tance profiles for OLDT (shown in green) and OLCD (shown in red) are independent of θ f inal , as the control law is open- loop and does not consider the task explicitly. Moreover, the distance for any given θ f inal is higher than that of CCDT (shown in orange) or CuSTOM (shown in blue), be- cause we are comparing the abstraction and the system for all possible finite horizon control laws. Although OLDT/OLCD give the maximum distance between the trajectories of M and S , they convey no information about how useful M is for designing controllers to regulate the inverted pendulum to the desired angle. This illustrates the utility of comparing the closed-loop models M π and S π rather than the open-loop models. This is an important observation: since M is learned for synthesizing π , we need to analyze how well π behaves on the real system S , and not necessarily all possible control laws. Indeed, we see that CCDT/CuSTOM has a much lower distance profile compared to OLDT and OLCD, implying that the learned model may not necessarily have the same open- loop trajectories as that of the actual system, but their closed- loop trajectories are relatively similar. Furthermore, a good distance measure should accurately capture the intent of the system designer. For example, ex- pecting the trajectories of the closed-loop systems to be close maybe too strict for regulation purposes. Instead, one might be interested in how close the final states of M π and S π are. A small distance will then imply that we can expect the actual system to get close to the desired angle, given that the model can be regulated to the desired angle. Indeed, comparing CuSTOM and CCDT indicates that the trajectories Fig. 2. Inverted Pendulum. The error between the closed-loop trajectories of M π and S π as horizon progresses. The controller is designed using M for θ f inal = 0 . 04 . Note that even though the controller leads to different undershooting behavior on M π and S π , they finally reach the same configuration. Thus, the choice of the distance measure affects the distance between M π and S π , and an appropriate measure should be chosen that captures the intent of the designer. of M π and S π do not align exactly, but reach roughly the same final state. Fig. 2 illustrates the difference in the trajectories of M π and S π for θ f inal = 0 . 04 , corresponding to the maximum of CCDT. The controller designed on the abstract model leads to a larger undershoot on the actual system but leads to the same final angle, resulting in a larger distance as per CCDT, but a smaller distance as per CuSTOM. In general, the choice of the distance measure is subjective, but the validation threshold should be chosen keeping in mind the chosen distance measure. For example, it will be too strict to pick the same validation threshold for CCDT and CuSTOM in this case. Example 2: We next compare the effectiveness of the three methods, scenario-based optimization (SC), sampling and dis- carding approach (SD) and active sampling using BO (CuS- TOM), to solve (2). Fig. 3 compares the error in the maximum distance estimation across the three methods as a function of the number of parameter samples. The first five samples used to initialize all the three methods are the same. It takes CuSTOM (the Blue curve) only three samples to reach within the 1% of the true distance, as opposed to SC (the Orange curve) which is able to reach within 2 . 5% of the true distance only after 60 samples, which supports our claim that modeling d G as a GP and using BO converges to the true distance faster. Further, we notice that SD (the Green curve) does not reach close to the true distance even after 100 samples. This is not surprising as SD considers a chance constrained formulation and discards k > 0 (in our case k = 5 ) maximum samples. The converged GP for CuSTOM and the uncertainty asso- ciated with it is shown in Fig. 4. The true distance function is shown in red (computed by gridding the parameter space). While the GP does not capture the entire distance function within the 95% confidence interval (the red line is not com- pletely contained in the grey uncertainty region), it captures the region around maximum very well, which is sufficient for BO. In fact, this characteristic of BO precisely makes it more sample efficient compared to some of the other approaches. Additionally, the system designer can use the GP itself Fig. 3. Inverted Pendulum. The x -axis and y -axis show the number of samples and the percentage error in the distance estimation respectively. While the error for BO (shown in Blue) converges to 0 very quickly (in 20 samples), scenario-based optimization (SC) (shown in Orange) is not able to find the true maximum even after 100 samples. Both techniques outperform sampling and discarding (SD) approach (shown in Green), which does not seem close to the true distance even after 100 samples. in several different ways to get important insights about the learned model. First, it tells the designer that the maximum distance which is 0 . 3 radians occurs when we want to regulate the system to θ f inal = 2 radians. In particular, we can expect the actual system angle to be in the range [1 . 7 , 2 . 3] radians, if we use the controller designed on M 1 . This distance should be compared with the predefined tolerance threshold to validate the model, which is also the main focus of the paper. In particular, if a distance of 0 . 3 radians is within the tolerance the designer is willing to accept, we can conclude that the learned linear model is an acceptable model for the inverted pendulum for controller synthesis. However, if it is beyond the acceptable tolerance, we can precisely tell the designer for what regulation tasks, the model fails. Note that this is different from saying that the linear system is sufficient for regulating the inverted pendulum, since we do not analyze how good the controller is at completing the task. It merely suggests that the controllers designed for regulation purposes behave similarly with the model, M , and actual system, S . In that aspect our validation framework is a tool in the overall control analysis procedure (see Remark 1 for further discussion). Second, it indicates that the distance profile is asymmetric. This reflects a potential bias in the training data while learning M since the inverted pendulum is symmetric around the bottom-most position. Furthermore, M π performs relatively poorly around π/ 2 and − π/ 2 , which suggests that M needs to be refined further by adding more samples from this region in the training data. Thus, using the GP, we can obtain some tangible information about the regions in which the model can be refined further to improve the closed-loop performance on the actual system, something that random sampling-based approaches are unable to provide. Remark 1: In general, there are two, perhaps equally, im- portant aspects to a context-specific validation: • Aspect-1: How well a model can perform the tasks in G ? • Aspect-2: How well a model can mimic the actual system 1 This assumes that we can regulate the model to the desired angle. Fig. 4. Inverted Pendulum. The x -axis and y -axis are the parameter space and the distance function d G ( p ) respectively. The true distance function is shown in red, the mean of the GP is shown in blue and the shaded grey region is the associated uncertainty. It suffices for the GP to capture the true distance function around the maximum of the true distance function, and not necessarily over the entire parameter space. behavior on these tasks? Aspect-1 checks if model itself can perform the tasks to a satisfactory level, i.e., verify (or test) the model can perform the task. Aspect-2 checks if we can expect the same perfor- mance to transfer to the actual system. Thus, to make sure that the designed controller performs well on the desired tasks, it is important to test and validate a model along both aspects. In this paper, we are concerned with efficiently validating the model along Aspect-2; nevertheless, a model should be first tested along Aspect-1, and only then along Aspect-2. In fact, if we are not able to verify the model along Aspect-1, validating it along Aspect-2 provides very limited information. Several methods, such as simulation-based falsification [17, 4], have been proposed in literature to test a model against a high level specification along Aspect-1. These methods can easily be used along with CuSTOM to complete the validation process. It is also important to note that verification or testing of models is purely simulation-based and does not require any real execution on the system. Consequently, it is often much “cheaper” compared to validating a model along Aspect-2. Finally, these two aspects can also be combined in a single distance measure. For instance, for regulation tasks, we can validate an abstraction using the following distance measure d ∗ = max p ∈P ‖ ξ S ( H ; x init , π p ) − p ‖ 1 . The above distance measure directly validates a model based on the performance of the synthesized controller on the actual system for the task it was designed for. CuSTOM can then be similarly used to validate a model using this distance measure. However, validating a model in two separate stages helps in identifying the concrete reasons for failing the validation, which can be instrumental in improving (or refining) the model further. B. Theoretical Analysis We now discuss the rate of convergence of CuSTOM. Global optimization is a difficult problem without any assump- tions on the objective function d G . The main complicating factor is the uncertainty over the extent of the variations of d G . For example, d G could be a characteristic function, which is equal to 1 at p m and 0 elsewhere, and none of the methods we mention here can optimize this function without exhaustively searching through every point in P . The way a large number of global optimization methods address this problem is by imposing some prior assumption on how fast the objective function d G can vary. A common assumption made in the optimization literature is Lipschitz continuity of d G in p [15], which essentially limits the rate of change of d G . However, a single Lipschitz constant for the entire function might be too conservative [9] for the exploration-exploitation trade-off required for a data-efficient global optimization. One way to relax these hard Lipschitz constraints is by imposing a GP prior on the function, as we do in CuSTOM. Instead of restricting the function from oscillating too fast, a GP prior requires those fast oscillations to have low probability. Under a GP prior, it can be shown that CuSTOM converges to d ∗ G at a sub-linear rate. Proposition 1: Assume d G ∼ GP (0 , k ( p, p ′ )) with a Mat ́ ern 3/2 kernel. Select  ∈ (0 , 1) and sample p n ∈ P by maximizing (10). Then, Pr ( d ∗ G > ˆ d G BO ( n ) + r n ∀ n ≥ 1) <  (11) where r n = C (√ |P| n 3 3+ |P| ( |P| +1) ) for some constant C , |P| is the dimension of P , n is the number of iterations of BO, d ∗ G is the true maximum distance, and ˆ d G BO ( n ) is the maximum returned by BO after n iterations. Intuitively, Proposition 1 quantifies how far our current best guess, ˆ d G BO , is from the true maximum d ∗ G . Furthermore, it claims that this gap between our current best guess and the true maximum vanishes quickly as the number of samples increases. Note that the guarantees provided by Proposition 1 are much stronger than the that by the scenario-based or sampling and discarding approaches; not only does it quantify how far the true maximum is from our current estimate but also provides a rate of convergence. Corollary 1: If there exists n > 0 such that ˆ d G BO ( n ) < τ − r n , then with probability atleast 1 −  we have d ∗ G < τ and the model M for a system S has been validated . Corollary 1 is a direct consequence of Proposition 1 and provides a probabilistic validation certificate for M . The proof for Proposition 1 and Corollary 1 can be found in the appendix. Remark 2: Even though Proposition 1 is specific to Mat ́ ern32 kernel, the convergence rates for different kernels can be similarly obtained (see the proof of Proposition 1 for more details). A natural question to ask is when can we use a GP prior on the underlying distance function? Even though there exist for- mal conditions in literature for when a GP prior can be used for a function, they are often hard to verify beforehand, especially when the function is unknown a priori [22]. Nevertheless, GP priors have been successfully used in machine learning and control theory to efficiently model and control unknown systems, primarily because the flexibility they provide for capturing different underlying functions by an appropriate choice of kernel. In particular, depending on what is known about the distance measure d G beforehand, an appropriate kernel can be used to increase the likelihood of capturing the true distance function with a GP. For example, if the true system, S , and the abstract system, M , satisfy the standard results of existence and uniqueness of state trajectory [14]; and the controller, π , designed to achieve a task, g p ∈ G , is Lipschitz bounded in the parameter p ; then the underlying closed-loop systems, S π and M π are continuous in parameters p . Any continuous distance measure, d G defined on S π and M π is then continuous and deterministic in p . In such a case, a Mat ́ ern Kernel (with ν = 1 . 5 ) can be used, which only assumes the continuity of the underlying function [22]. Similarly, if d G is also differentiable, alternative kernels can be used to capture it efficiently under a GP prior. VI. C ASE S TUDIES We now apply CuSTOM on a series of systems and learned models. As baselines, we compare CuSTOM to ran- dom sampling-based approaches: the scenario-based and the sampling and discarding-based approaches. A. Dubins Car For our first simulation, we vary the taskspace, G , i.e., “context”, and show how our technique can be used across different contexts. We consider the task of controlling, a three dimensional non-linear Dubins car to follow a desired trajectory. The dynamics of the system are given by, ̇ y = v cos φ, ̇ z = v sin φ, ̇ φ = ω (12) where x := ( y, z, φ ) is the state of the system; with position, ( y, z ) , and heading, φ ; and control inputs are the velocity v and turn rate ω . We would like S to follow the desired trajectory, ˆ ξ , over the horizon, H = 100 , starting from the initial state, x init = (0 , 0 , 0) , ˆ ξ ( t ; x init , p ) =   t · a 0 H ∑ 10 j =1 a j sin( 2 πj · y ( t ) a 0 ) 0   , t ∈ [0 , · · · , H ] Thus, ˆ ξ ( · ) is a sum of ten sine waves in the yz plane. The task parameters are, a 0 ∈ [0 , 2] , the desired final y position ; and { a 1 , · · · , a 10 } ∈ [0 , 1] 10 , the set of amplitudes corresponding to the different sine components. The combined parameter space is 11D, P = [0 , 2] × [0 , 1] 10 , and every instance p ∈ P generates a different trajectory, ˆ ξ ( p ) . We assume that the dynamics in (12) are unknown, and learn a linear dynamics model, M , from random data gathered by simulating the system. For each trajectory tracking task, g p ∈ G , we use a quadratic cost penalizing the distance between the current state, x ( t ) , and the desired state, ˆ ξ ( t ) , as the cost function to design a LQR controller, π p . The context-specific Fig. 5. Dubins Car: The x -axis and y -axis represents the number of samples and maximum distance among the samples, respectively. CuSTOM converges to its maximum distance much faster than random sampling-based approaches. Moreover, it converges to a higher distance, indicating that SC and SD underestimate the true distance in this case. distance measure, d G , is the l 2 distance between the trajectories produced by S π and M π , i.e., d G = 1 H ‖ ξ S ( · ; x init , π p ) − ξ M ( · ; x init , π p ) ‖ 2 , but other distance measures can very well be used. We ran the simulation for 10 different trials. In each trial, we generate new training data and learn a linear model from it. We compare the maximum distance found by CuSTOM, scenario optimization (SC) and sampling and discarding (SD). We implement CuSTOM as a python package (will be published after acceptance). For BO, we use the python package GPy- Opt [5]. We do not compute the true distance, as exploring the parameter space, P , becomes a near impossible task for large dimensions. We plot the medians of the distance across the trials for each technique in Fig. 5. We see that for a given number of samples, the maximum distance found by CuSTOM is higher than that of SC and SD, indicating that the random sampling- based approaches underestimate the true distance between the system and the model in this case. In fact, if the validation threshold is set somewhere between 2 to 3 metres, SC and SD will provide a false validation certificate for the learned model, which can be problematic for the safety-critical systems. Furthermore, CuSTOM converges to its maximum distance in 100 samples, while SC and SD are not able to find this even after 500 samples. The distance found by CuSTOM has a number of implications to the control designer. It tells that the designer can expect a maximum average deviation of 3 . 5 metres from the trajectory obtained by utilizing M for designing π , and if it is greater than the tolerance of error he is willing to accept then he should refine the model. Moreover, our framework provides the designer the trajectory which leads to this error, which he can use to refine the model further. B. 2-DoF Robotic Arm This simulation is inspired from the manipulation tasks in robotics. We consider a two degree-of-freedom (DoF) robotic arm, which is operating on a 2D table. The goal is to regulate the end-effector of the arm to a desired target position on the Fig. 6. Robotic Arm: the sample complexity for different approaches to reach within a desired percentage of the true distance. CuSTOM is significantly more sample-efficient compared to the random sampling-based approaches. table starting from the initial extended arm configuration, i.e., zero joint angles. For simulating the actual system, we use the MuJoCo [25] physics engine. The state of the system, x , is given by the two joint angles, ( θ 1 , θ 2 ) , the corresponding angular velocities, ( ̇ θ 1 , ̇ θ 2 ) , and the position of the end-effector, ( y, z ) . The control inputs are the torques for the two motors at the joints. We assume that we know the state and control of the system, but we do not have access to a dynamics model of the system. Instead, we learn a dynamics model using the data from the system and use CuSTOM to quantify the quality of the learned model for the desired regulation tasks. To learn the dynamics model, we use a feedforward neural network (FNN) with 2 hidden layers, 8 nodes per layer and ReLU activation function. The NN was trained in a supervised fashion with the state- control pair as its input and the next state as the output. The G is parameterized by the desired target position of the end-effector, ( y f inal , z f inal ) , i.e., P is different positions on the 2D table. The l 2 distance between the final position of the actual arm and its model is used as the distance measure, d G . A quadratic cost penalizing the distance between the end-effector position, ( y ( t ) , z ( t )) , and the desired position, ( y f inal , z f inal ) , is used as the cost function to design the LQR controller. We ran the simulation for 10 different trials, where the training data, and hence the learned model, was different for each trial. For each trial we also computed the true distance between the system and the learned model by an exhaustive search over P , i.e., across all possible desired positions on the table. The median sample complexity across these trials for different approaches is plotted in Fig. 6. As evident from the figure, CuSTOM significantly outperforms all other baselines and reaches within 2% of the true distance in less than 50 executions. Our framework can also be employed to compare different abstractions of a system. To illustrate this, we train three dif- ferent dynamics models for each of the 10 trials above: a FNN model as before, a GP model and a linear model. CuSTOM was then used to validate each of these models. The results for a particular trial are shown in Fig. 7. As evident from the figure, CuSTOM is able to find the true distance (dashed lines) within a few samples across all models. Moreover, based on Fig. 7. Robotic Arm: Obtained distances for the different abstractions (a Gaussian process (GP), a feed-forward neural network (FNN) and a linear model (LTI)) of a 2-DoF robotic arm. CuSTOM can also be used to compare different abstractions of a system, based on the obtained closed-loop distance bound. In this case, a GP best mimics the closed-loop behavior of the arm. the observed distances it is clear that the GP model performs best, in the sense that the controller designed on the GP model takes the actual system closest to the desired position, when deployed on the actual system. In this case, the linear model is the worst abstraction of the real system. In general, the quality of a data-driven abstraction depends on the training data, training process, function approximator, etc. However, such considerations are beyond the scope of this work. Here, we aim to provide a certificate for an abstraction given the abstraction. It is also important to note that CuSTOM can similarly be used to compare different controllers. C. Linear Systems The main objective of this simulation is to demonstrate the effect of dimensionality of the task space, G , on the sample complexity of CuSTOM and the baselines, SC and SD. For this purpose, we consider linear systems, ̇ x = Ax + Bu , with state-space dimension ranging from 1D to 10D. The parameters of the system matrices ( A, B ) have been chosen in random for each system. Corresponding to each system, we consider a linear abstraction whose ( A, B ) matrices have also been chosen in random. For each linear system, we are interested in regulating the system state from a desired initial state, x init to a desired goal state, x f inal . Hence, our G is parameterized by P which contains both the initial state and the goal state, p = ( x init , x f inal ) ∈ P . Thus, our P is 2D for a 1D linear system, 4D for a 2D linear system, and so on. A linear feedback controller is used to achieve the desired regulation, whose parameters are also chosen in random. In particular, the control applied at state x is given by u ( x ) = K ( x − x f inal ) , where K is the feedback matrix and x f inal is the desired final state. In practice, the abstraction as well as the controller are designed carefully based on the data and the task at hand, but in this simulation, our focus is not on the model design but on the relative sample complexity. The l 1 distance between the final state of the system and the abstraction is used as the distance measure for validation. For comparison purposes, we also compute the true distance between the model and the abstraction. This can be done in the Fig. 8. Effect of dimensionality: The sample complexity for different ap- proaches to reach within 5% of the true distance vs. the dimension of the task space. The high sample complexity of random sampling-based approaches renders them impractical for validating abstractions of the real- world autonomous systems. closed form for the linear systems, as the distance function, d G ( p ) , is also linear in the task parameters, p ∈ P . The median results for the number of samples necessary to reach within the 5% of the true distance are shown in Figure 8. Even though the number of required samples increases with the dimension of the task space for all the three methods, the increment is very modest for CuSTOM. This is not surprising because of the sub-linear dependence of sample complexity on the task space dimension for CuSTOM (see Proposition 1). However, the sample complexity increases drastically for the two baselines with the dimension of the task space, indicating their impracticality to be used for validating abstractions of real-world systems. VII. C ONCLUSION Validating data-driven models before they are deployed to synthesize a controller for the actual system is a challenging but an important problem. In this paper, we present a context- specific validation framework for data-driven models that only validates those behaviors of the model that are relevant for the tasks at hand. The proposed framework is based on active-sampling and does not require any a priori knowledge of the system dynamics. The context-specific nature of the framework along with the active sampling ensures that the validation can be performed in a sample efficient manner, and have the potential to be of practical use for real-world systems. There are several interesting future directions that emerge out of this work. First, it would be interesting to apply the proposed framework on more complex robotic systems. An- other interesting direction will be to explore how BO compares with other active sampling methods, e.g., CMA-ES, Simulated Annealing, etc. Performing a thorough theoretical analysis of the proposed framework is another promising direction. Finally, it will be interesting to use the computed distance bound for the safety analysis and verification of synthesized controllers. A CKNOWLEDGMENTS This research is supported by NSF under the CPS Frontiers VehiCal project (1545126), by the UC-Philippine-California Advanced Research Institute under project IIID-2016-005, by the ONR MURI Embedded Humans (N00014-16-1-2206), by the Army Research Laboratory 2 and was accomplished under Cooperative Agreement Number W911NF-17-2-0196; and in part by Toyota under the iCyPhy center. A PPENDIX A. Proof of Proposition 1 We denote by R n = ∑ n i =1 d ∗ G − d BO G ( i ) , the cumulative regret from n iterations of BO, where d BO G ( i ) is the i -th sample returned by BO. If d G ∼ GP (0 , k ( p, p ′ )) with k ( p, p ′ ) being a Mat ́ ern kernel, then it can be shown that (see Theorem 2 in [23]) for any  ∈ (0 , 1) Pr ( R n ≤ O ( √ |P| · nγ n ) ∀ n ≥ 1) ≥ 1 −  (13) where |P| is the dimension of the P , n is the number of iterations of BO and γ n is the maximum information gain after n rounds. γ n for a Mat ́ ern32 kernel is given by (see Theorem 5 in [23]), γ n = O ( n |P| ( |P| +1) (3+ |P| ( |P| +1)) (log n )) . We know that, ˆ d G BO ( n ) ≥ d BO G ( i ) , ∀ i ∈ { 1 , . . . , n } since ˆ d G BO ( n ) is the maximum among all the n iterations of BO. Hence, we have, R n = ∑ n i =1 d ∗ G − d BO G ( i ) ≥ ∑ n i =1 d ∗ G − ˆ d G BO ( n ) = n · ( d ∗ G − ˆ d G BO ( n )) . (13) can thus be written as, Pr ( n · ( d ∗ G − ˆ d G BO ( n )) ≤ O ( √ |P| · nγ n ) ∀ n ≥ 1) ≥ 1 −  Pr ( d ∗ G > ˆ d G BO ( n ) + O ( √ |P| · γ n /n ) ∀ n ≥ 1) <  (14) Ignoring the polylog factors and using γ n for the Mat ́ ern 3 / 2 kernel, we have, Pr ( d ∗ G > ˆ d G BO ( n ) + r n ∀ n ≥ 1) <  (15) where r n = C (√ |P| n 3 3+ |P| ( |P| +1) ) for some constant C . Note that the above proof can also be used to obtain the convergence rates for different kernels by substituting the appropriate γ n . γ n for commonly used kernels have been derived in Theorem 5 in [23]. B. Proof of Corollary 1 Let there exist a n > 0 such that ˆ d G BO ( n ) < τ − r n where τ is the validation threshold and r n is defined as above. Rewriting (15) as, Pr ( d ∗ G < ˆ d G BO ( n ) + r n ∀ n ≥ 1) ≥ 1 −  (16) 2 The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the ocial policies, either expressed or implied, of the Army Research Laboratory or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation here on. we have, Pr ( d ∗ G < τ ) ≥ 1 −  (17) and we can conclude that the model M is validated for system S . R EFERENCES [1] A. Abate. Approximation metrics based on probabilistic bisimulations for general state-space Markov processes: a survey. Electronic Notes in Theoretical Computer Science , 297:3–25, 2013. [2] A. Abate and M. Prandini. Approximate abstractions of stochastic systems: A randomized method. In Conference on Decision and Control and European Control Confer- ence , pages 4861–4866, 2011. [3] R. Alur, T. A. Henzinger, G. Lafferriere, and G. J. Pap- pas. Discrete abstractions of hybrid systems. Proceedings of the IEEE , 88(7):971–984, 2000. [4] Y. Annpureddy, C. Liu, G. Fainekos, and S. Sankara- narayanan. S-taliro: A tool for temporal logic falsification for hybrid systems. In International Conference on Tools and Algorithms for the Construction and Analysis of Systems , pages 254–257. Springer, 2011. [5] The GPyOpt authors. GPyOpt: A Bayesian optimization framework in python. http://github.com/SheffieldML/ GPyOpt, 2016. [6] C. Baier, J. Katoen, and K. G. Larsen. Principles of model checking . MIT press, 2008. [7] S. Bansal, A. K. Akametalu, F. J. Jiang, F. Laine, and C. J. Tomlin. Learning quadrotor dynamics using neural network for flight control. In Conference on Decision and Control , 2016. [8] S. Bansal, R. Calandra, T. Xiao, S. Levine, and C. J. Tomlin. Goal-driven dynamics learning via Bayesian optimization. Conference on Decision and Control , 2017. [9] S. Bubeck, R. Munos, G. Stoltz, and C. Szepesv ́ ari. X- armed bandits. Journal of Machine Learning Research , 12(May):1655–1695, 2011. [10] M. L. Bujorianu, J. Lygeros, and M. C. Bujorianu. Bisimulation for general stochastic hybrid systems. In International Workshop on Hybrid Systems: Computation and Control , pages 198–214. Springer, 2005. [11] G. C. Calafiore and M. C. Campi. The scenario approach to robust control design. IEEE Transactions on Automatic Control , 51(5):742–753, 2006. [12] M. C. Campi and S. Garatti. A sampling-and-discarding approach to chance-constrained optimization: feasibility and optimality. Journal of Optimization Theory and Applications , 148(2):257–280, 2011. [13] M. C. Campi, S. Garatti, and M. Prandini. The scenario approach for systems and control design. Annual Reviews in Control , 33(2):149–157, 2009. [14] E. A. Coddington and N. Levinson. Theory of ordinary differential equations . Tata McGraw-Hill Education, 1955. [15] N. De Freitas, A. Smola, and M. Zoghi. Exponential regret bounds for Gaussian process bandits with deter- ministic observations. arXiv preprint arXiv:1206.6457 , 2012. [16] J. Desharnais, A. Edalat, and P. Panangaden. Bisimu- lation for labelled Markov processes. Information and Computation , 179(2):163–193, 2002. [17] A. Donz ́ e. Breach, a toolbox for verification and pa- rameter synthesis of hybrid systems. In International Conference on Computer Aided Verification , pages 167– 170. Springer, 2010. [18] A. Girard and G. J. Pappas. Approximate bisimulation: A bridge between computer science and control theory. European Journal of Control , 17(5-6):568–578, 2011. [19] A. Girard, G. Pola, and P. Tabuada. Approximately bisimilar symbolic models for incrementally stable switched systems. IEEE Transactions on Automatic Control , 55(1):116–126, 2010. [20] K. G. Larsen and A. Skou. Bisimulation through prob- abilistic testing. Information and computation , 94(1):1– 28, 1991. [21] G. Pola, A. Girard, and P. Tabuada. Approximately bisimilar symbolic models for nonlinear control systems. Automatica , 44(10):2508–2516, 2008. [22] C. E. Rasmussen and CKI. Williams. Gaussian Processes for Machine Learning . Adaptive Computation and Ma- chine Learning. MIT Press, 2005. [23] N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger. Gaussian process optimization in the bandit setting: no regret and experimental design. In International Confer- ence on Machine Learning , pages 1015–1022, 2010. [24] S. Strubbe and A. Van Der Schaft. Bisimulation for com- municating piecewise deterministic Markov processes (cpdps). In International Workshop on Hybrid Systems: Computation and Control , pages 623–639. Springer, 2005. [25] E. Todorov, T. Erez, and Y. Tassa. MuJoCo: A physics engine for model-based control. In Conference on In- telligent Robots and Systems , pages 5026–5033. IEEE, 2012.