arXiv:1612.05594v1 [cs.SY] 16 Dec 2016 Importance sampling-based approximate optimal planning and control Jie Fu 1 1 March 18, 2021 11 Jie Fu is with Robotics Engineering Program, Department of Electrical and Computer Engineering, Worcester Polytechnic Institute, 01609, Worcester, MA, US. jfu2@wpi.edu Abstract In this paper, we propose a sampling-based planning and optimal control method of nonlinear systems under non-differentiable constraints. Motivated by devel- oping scalable planning algorithms, we consider the optimal motion plan to be a feedback controller that can be approximated by a weighted sum of given bases. Given this approximate optimal control formulation, our main contribu- tion is to introduce importance sampling, specifically, model-reference adaptive search algorithm, to iteratively compute the optimal weight parameters, i.e., the weights corresponding to the optimal policy function approximation given chosen bases. The key idea is to perform the search by iteratively estimating a parametrized distribution which converges to a Dirac’s Delta that infinitely peaks on the global optimal weights. Then, using this direct policy search, we incorporated trajectory-based verification to ensure that, for a class of nonlin- ear systems, the obtained policy is not only optimal but robust to bounded disturbances. The correctness and efficiency of the methods are demonstrated through numerical experiments including linear systems with a nonlinear cost function and motion planning for a Dubins car. 0.1 Introduction This paper presents an importance sampling based approximate optimal plan- ning and control algorithm. Optimal motion planning in deterministic and continuous systems is computationally NP-complete [1] except for linear time invariant systems. For nonlinear systems, there is a vast literature on approxi- mate solutions and algorithms. In optimal planning, the common approximation scheme is discretization-based. By discretizing the state and input spaces, opti- mal planning is performed by solving the shortest path problem in the discrete transition systems obtained from abstracting the continuous dynamics, using heuristic-based search or dynamic programming. Comparing to discretization- based methods, sampling-based graph search , includes Probabilistic RoadMap (PRM) [2], RRT [3], RRT* [4], are more applicable for high-dimensional sys- tems. While RRT has no guarantee on the optimality of the path [4], RRT* compute an optimal path asymptotically provided the cost functional is Lips- chitz continuous. However, such Lipschitz conditions may not be satisfied for some cost functions under specific performance consideration. The key idea in the proposed sampling-based planning method builds on a unification of importance sampling and approximate optimal control [5, 6]. In approximate optimal control, the objective is to approximate both the value function, i.e., optimal cost-to-go, and the optimal feedback policy function by weighted sums of known basis functions. As a consequence, the search space is changed from infinite trajectory space or policy space to a continuous space of weight vectors, given that each weight vector corresponds to a unique feedback controller. Instead of solving the approximate optimal control through training actor and critic neural networks (NNs) using trajectory data [7, 8], we propose a sampling-based method for sampling the weight vectors for a policy function approximation and searching for the optimal one. This method employs Model Reference Adaptive Search (MRAS) [9], a probabilistic complete global opti- mization algorithm, for searching the optimal weight vector that parametrizes the approximate optimal feedback policy. The fundamental idea is to treat the weight vector as a random variable over a parameterized distribution and the optimal weight vector corresponds to a Dirac’s Delta function which is the tar- get distribution. The MRAS algorithm iteratively estimates the parameter that possesses the minimum Kullback-Leibler divergence with respect to an inter- mediate reference model, which assigns a higher probability mass on a set of weights of controllers with improved performance over the previous iteration. At the meantime, a set of sampled weight vectors are generated using the pa- rameterized distribution and the performance of their corresponding policies are evaluated via simulation-based policy evaluation. Under mild conditions, the parameterized distribution converges, with probability one, to the target dis- tribution that concentrates on the optimal weight vector with respect to given basis functions. MRAS resembles another adaptive search algorithm called cross-entropy(CE) method and provides faster and stronger convergence guarantee for being less sensitive to input parameters [9, 10]. Previously, CE algorithm has been intro- duced for motion planning [11, 12] based on sampling in the trajectory space. The center idea is to construct a probability distribution over the set of feasi- ble paths and to perform the search for an optimal trajectory using CE. The parameters to be estimated is either a sequence of motion primitives or a set of via-points for interpolation-based trajectory planning. Differ to these methods, ours is the first to integrate importance sampling to estimate parameterization of the optimal policy function approximation for continuous nonlinear systems. Since the algorithm performs direct policy search, we are able to enforce robust- ness and stability conditions to ensure the computed policy is both robust and approximate optimal, provided these conditions can be evaluated efficiently. To conclude, the contributions of this paper are the following: First, we in- troduce a planning algorithm by a novel integration of model reference adaptive search and approximate optimal control. Second, based on contraction the- ory, we introduce a modification to the planning method to directly generate stabilizing and robust feedback controllers in the presence of bounded distur- bances. Last but not the least, through illustrative examples, we demonstrate the effectiveness and efficiency of the proposed methods and share our view on interesting future research along this direction. 0.2 Problem formulation Notation: The inner product between two vectors w, v ∈ R n is denoted w ⊺ v or 〈 w, v 〉 . Given a positive semi-definite matrix P , the P -norm of a vector is denoted ‖ x ‖ P = √ x ⊺ P x . We denote ‖ x ‖ for P being the identity matrix. I { E } is the indicator function, i.e., I E = 1 if event E holds, otherwise 0. For a real α ∈ R , ⌈ α ⌉ is the smallest integer that is greater than α . 0.2.1 System model We consider continuous-time nonlinear systems of the form Σ : ̇ x ( t ) = f ( x ( t ) , u ( t )) , x ( t ) ∈ X, u ( t ) ∈ U. (1) where x ∈ X is the state, u ∈ U is the control input, x 0 ∈ X is the initial state, and f ( x, u ) is a vector field. We assume that X and U are compact. A feedback controller u : X → U takes the current state and outputs a control input. The objective is to find a feedback controller u ∗ that minimizes a finite- horizon cost function for a nonlinear system min u J ( x 0 , u ) = ∫ T 0 ℓ ( x ( t ) , u ( t )) dt + g ( x ( T ) , u ( T )) subject to: ̇ x ( t ) = f ( x ( t ) , u ( t )) , x ( t ) ∈ X, u ( t ) ∈ U, x (0) = x 0 . (2) where T is the stopping time, ℓ : X × U → R + defines the running cost when the state trajectory traverses through x and the control input u is applied and g : X → R + defines the terminal cost. As an example, a running cost function can be a quadratic cost ℓ ( x, u ) = ‖ x ‖ R + ‖ u ‖ Q for some positive semi-definite matrices Q and R , and a terminal cost can be g ( x, u ) = ‖ x − x f ‖ R where x f is a goal state. We denote the set of feedback policies to be Π. For infinite horizon optimal control, the optimal policy is independent of time and a feedback controller suffices to be a minimizing argument of (2) (see Ref. [13]). For finite-horizon optimal control, the optimal policy is time-dependent. However, for simplicity, in this paper, we only consider time-invariant feedback policies and assume the time horizon T is of sufficient length to ignore the time constraints. 0.2.2 Preliminary: Model reference adaptive search MRAS algorithm, introduced in [9], aims to solve the following problem: z ∗ ∈ arg max z ∈ Z H ( z ) , z ∈ R n where Z is the solution space and H : R n → R is a deterministic function that is bounded from below. It is assumed that the optimization problem has a unique solution, i.e., z ∗ ∈ Z and for all z 6 = z ∗ , H ( z ) < H ( z ∗ ). The following regularity conditions need to be met for the applicability of MRAS. Assumption 1. For any given constant ξ < H ( z ∗ ) , the set { z | H ( z ) ≥ ξ } ∩ Z has a strictly positive Lebesgue or discrete measure. This condition ensures that any neighborhood of the optimal solution z ∗ will have a positive probability to be sampled. Assumption 2. For any constant δ > 0 , sup z ∈ A δ H ( z ) < H ( z ∗ ) , where A δ := { z | ‖ z − z ∗ ‖ ≥ δ } ∩ X , and we define the supremum over the empty set to be −∞ . • Selecte a sequence of reference distributions { g k ( · ) } with desired conver- gence properties. Specifically, the sequence { g k ( · ) } will converge to a distribution that concentrates only on the optimal solution. • Selecte a parametrized family of distribution f ( · , θ ) over X with parameter θ ∈ Θ. • Optimize the parameters { θ k } iteratively by minimizing the following KL distance between f ( · , θ k ) and g k ( · ). d ( g k , f ( · , θ )) := ∫ Z ln g k ( z ) f ( z, θ ) g k ( z ) ν ( dz ) . where ν ( · ) is the Lebesgue measure defined over Z . The sample distri- butions { f ( · , θ k ) } can be viewed as compact approximations of the refer- ence distributions and will converge to an approximate optimal solution as { g k ( · ) } converges provided certain properties of { g k ( · ) } is retained in f ( · , θ k ). Note that the reference distribution { g k ( · ) } is unknown beforehand as the op- timal solution is unknown. Thus, the MRAS algorithm employs the estimation of distribution algorithms [14] to estimate a reference distribution that guides the search. To make the paper self-contained, we will cover details of MRAS in the development of the planning algorithm. 0.3 Approximate optimal motion planning us- ing MRAS In this section, we present an algorithm that uses MRAS in a distinguished way for approximate optimal feedback motion planning. 0.3.1 Policy function approximation The policy function approximation ̄ u : X → U is a weighted sum of basis functions, ̄ u ( x ) = N ∑ i =1 w i φ i ( x ) where φ i : X → R , i = 1 , . . . , N are basis functions, and the coefficients w i are the weight parameters, i = 1 , . . . , N . An example of basis function can be polynomial basis φ = [1 , x, x 2 , x 3 , . . . , x N ] for one-dimensional system. A com- monly used class of basis functions is Radial basis function (RBF). It can be constructed by determining a set of centers c i , . . . , c N ∈ X , and then construct- ing RBF basis functions φ i = exp( − ‖ x − c i ‖ 2 2 σ 2 ), for each center c i , where σ is a pre-defined parameter. In vector form, a policy function approximation is represented by ̄ u = 〈 w, φ 〉 where vector φ = [ φ 1 , . . . φ N ] ⊺ and w = [ w 1 , . . . , w N ] ⊺ . We let the domain of weight vector be W and denote it by Π φ = {〈 w, φ 〉 | w ∈ W, 〈 w, φ 〉 ∈ Π } the set of all policies that can be generated by linear combinations of pre-defined basis functions. In the following context, unless specifically mentioned, the vector of basis functions is φ . Clearly, for any weight vector w , J ( x 0 , 〈 w, φ 〉 ) ≥ min u ∈ Π J ( x 0 , u ). Thus, we aim to solve min w J ( x 0 , 〈 w, φ 〉 ) so as to minimize the error in the optimal cost introduced by policy function approximation. Definition 1 (Approximate optimal feedback policy) . Given a basis vector φ , a weight vector w ∗ with respect to φ is optimal if and only if 〈 w ∗ , φ 〉 ∈ Π φ and for all w ∈ W such that 〈 w, φ 〉 ∈ Π φ , J ( x 0 , 〈 w ∗ , φ 〉 ) ≤ J ( x 0 , 〈 w, φ 〉 ) . The approximate optimal feedback policy is ̄ u ∗ = 〈 w ∗ , φ 〉 . By requiring J ( x 0 , 〈 w ∗ , φ 〉 ) ≤ J ( x 0 , 〈 w, φ 〉 ), it can be shown that the optimal weight vector w ∗ minimizes the difference between the optimal cost achievable with policies in Π φ and the cost under the global optimal policy. For clarity in notation, we denote J ( x 0 , 〈 w, φ 〉 ) by J ( x 0 ; w ) as φ is a fixed basis vector throughout the development of the proposed method. Clearly, if the actual optimal policy u ∗ can be represented by a linear com- bination of selected basis functions, then we obtain the optimal policy by com- puting the optimal weight vector, i.e., u ∗ = 〈 w ∗ , φ 〉 . Remark: Here, we assume a feedback policy can be represented by 〈 w, φ 〉 for some weight vector w ∈ W . In cases when the basis functions are continuous, a feedback policy must be a continuous function of the state. However, this requirement is hard to satisfy for many physical systems due to, for example, input saturation. In cases when a feasible controller is discontinuous, we can still use a continuous function to approximate, and then project the continuous function to the set Π of applicable controllers. Using function approximation, we aim to solve the optimal feedback planning problem in (2) approximately by finding the optimal weight vector with respect to a pre-defined basis vector. The main algorithm is presented next. 0.3.2 Integrating MRAS in approximate optimal planning In this section, we present an adaptive search-based algorithm to compute the approximate optimal feedback policy. The algorithm is “near” anytime, mean- ing that it returns a feasible solution after a small number of samples. If more time is permitted, it will quickly converge to the globally optimal solution that corresponds to the approximate optimal feedback policy. The algorithm is prob- abilistic complete under regularity conditions of MRAS. We start by viewing the weight vector as a random variable w governed by a multivariate Gaussian distribution with a compact support W . The distribution is parameterized by parameter θ = ( μ, Σ), where μ is a N -dimensional mean vector and Σ is the N by N covariance matrix. Recall N is the number of basis functions. The optimal weight vector w ∗ can be represented as a target distribution p goal as a Dirac’s Delta, i.e., p goal ( w ∗ ) = ∞ and p goal ( w ) = 0 for w 6 = w ∗ . Dirac’s Delta is a special case of multivariate Gaussian distribution with zero in the limit case of vanishing covariance. Thus, it is ensured that the target distribution can be arbitrarily closely approximated by multivariate Gaussian distribution by a realization of parameter θ . Recall that the probability density of a multivariate Gaussian distribution is defined by p ( w ; θ ) = 1 √ (2 π ) N | Σ | exp( − 1 2 ( x − μ ) ⊺ Σ − 1 ( x − μ )) , θ = ( μ, Σ) , ∀ w ∈ W, where N is the dimension of weight vector w ∈ W and | Σ | is the determinant of Σ. Now, we are ready to represent the main algorithm, called Sampling-based Approximate-Optimal Planning (SAOP), which includes the following steps. 1) Initialization : The initial distribution is selected to be p ( · , θ 0 ), for θ 0 = ( μ 0 , Σ 0 ) ∈ Θ which can generate a set of sample to achieve a good coverage of the sample space W . For example, μ 0 = 0 ∈ R N and Σ 0 = I ∈ R N which is an identity matrix. The following parameters are used in this algorithm: ρ ∈ (0 , 1] for specifying the quantile, the improvement parameter ε ∈ R + , a sample increment percentage α , an initial sample size N 1 , a smoothing coefficient λ ∈ (0 , 1]. Let k = 1. 2) Sampling-based policy evaluation : At each iteration k , generate a set of N k samples W k ⊆ W from the current distribution p ( · , θ k ). For each w ∈ W k , using simulation we evaluate the cost J ( x 0 ; w ) from the initial state x 0 and the feedback policy u ( x ) = 〈 w, φ ( x ) 〉 with system model in (1). The cost J ( x 0 ; w ) is determined because the system is deterministic and has a unique solution. 3) Policy improvement with Elite samples : Next, the set { J ( x 0 ; w ) | w ∈ W k } is ordered from largest (worst) to smallest (best) among given samples: J k, (0) ≥ . . . ≥ J k, ( N k ) We denote κ to be the estimated (1 − ρ )-quantile of cost J ( · ; w ), i.e., κ = J k, ⌈ (1 − ρ ) N k ⌉ . The following cases are distinguished. • If k = 1, we introduce a threshold γ = κ . • If k 6 = 1, the following cases are further distinguished: – κ ≤ γ − ε , i.e., the estimated (1 − ρ )-quantile of cost has been reduced by the amount of ε from the last iteration, then let γ = κ . Let N k +1 = N k and continue to step 4). – Otherwise κ > γ − ε , we find the largest ρ ′ , if it exists, such that the estimated (1 − ρ ′ )-quantile of cost κ ′ = J k, ⌈ (1 − ρ ′ ) N k ⌉ satisfies κ ′ ≤ γ − ε . Then let γ = κ ′ and also let ρ = ρ ′ . Continue to step 4). However, if no such ρ ′ exists, then there is no update in the threshold γ but the sample size is increased to N k +1 = ⌈ (1+ α ) N k ⌉ . Let θ k +1 = θ k , k = k + 1, and continue to step 2). • Parameter(Policy) update : We update parameters θ k +1 for itera- tion k + 1. First, we define a set E = { w | J ( x 0 ; w ) ≤ γ, w ∈ W k , j = 1 , . . . , k } of elite samples . Note that the parameter update in θ is to ensure a higher probability for elite samples. To achieve that, for each elite sample w ∈ E , we associated a weight such that a higher weight is associated with a weight vector with a lower cost and a lower proba- bility in the current distribution. The next parameter θ k +1 is selected to maximize the weighted sum of probabilities of elite samples. To this end, we update the parameter as follows. θ ∗ k +1 = arg max θ ∈ Θ E θ k [ S ( J ( x 0 , w )) k p ( w, θ k ) I J ( x 0 ,w ) ≤ γ ln p ( w, θ ) ] where E θ ( ν ) is the expected value of a random variable ν given distri- bution p ( · , θ ), S : R → R + is a strictly decreasing and positive function 1 . S ( J ( x 0 ; w )) k /p ( w, θ k ) is the weight for parameter w . Assumption 3. The optimal parameter θ ∗ is the interior point of Θ for all k . Lemma 1 (based on Theorem 1 [9]) . Assuming 1,2, and 3 and the compactness of W , with probability one, lim k →∞ μ k = w ∗ , and lim k →∞ Σ k = 0 N × N . where w ∗ is the optimal weight vector and 0 N × N is an N -by- N zero matrix. Note that since Σ k converges in the limit a zero matrix, the stopping criterion is justified. Building on the convergence result of MRAS, the proposed sampling-based planner ensures a convergence to a Dirac Delta function concentrating on the op- timum. In practice, the parameter update is performed using the expectation— maximization (EM) algorithm. EM-based parameter update/policy improvement Since our choice of probability distribution is the multivariate Gaussian, the parameter θ ∗ k +1 = ( μ, Σ) is computed as follows μ = E θ k [ S ( J ( x 0 , w )) k /p ( w, θ k )] I w ∈ E w E θ k [ S ( J ( x 0 , w )) k /p ( w, θ k )] I w ∈ E ≈ ∑ w ∈ W k [ S ( J ( x 0 , w )) k /p ( w, θ k )] I w ∈ E w ∑ w ∈ W k [ S ( J ( x 0 , w )) k /p ( w, θ k )] I w ∈ E , (3) 1 Possible choices can be S ( x ) = exp( − x ) or S ( x ) = 1 x if x is strictly positive. and Σ = E θ k [ S ( J ( x 0 , w )) k /p ( w, θ k )] I w ∈ E ( w − μ )( w − μ ) ⊺ E θ k [ S ( J ( x 0 , w )) k /p ( w, θ k )] I w ∈ E ≈ ∑ w ∈ W k [ S ( J ( x 0 , w )) k /p ( w, θ k )] I w ∈ E ( w − μ )( w − μ ) ⊺ ∑ w ∈ W k [ S ( J ( x 0 , w )) k /p ( w, θ k )] I w ∈ E , (4) where we approximate E θ k ( h ( w )) with its estimate 1 N k ∑ w ∈ W k h ( w ) for w ∼ p ( · , θ k ) and the fraction 1 N k was canceled as the term is shared by the numerator and the denominator. Smoothing : Due to limited sample size, a greedy maximization for pa- rameter update can be premature if too few samples are used. To ensure the convergence to the global optimal solution , a smoothing update is needed. To this end, we select the parameter for the next iteration to be θ k +1 ← λθ k + (1 − λ ) θ ∗ k +1 . where λ ∈ [0 , 1) is the smoothing parameter. Let k = k + 1. We check if the iteration can be terminated based on a given stopping criterion. If the stopping criterion is met, then we output the latest θ k . Otherwise, we continue to update of θ by moving to step 2). Stopping criterion Given the probability distribution will converge to a degenerated one that concentrates on the optimal weight vector. We stop the it- eration if the covariance matrix Σ k becomes near-singular given the convergence condition in Lemma 1. To conclude, the proposed algorithm using MRAS is probabilistic complete and converges to the global optimal solution. If the assumptions are not met, the algorithm converges to a local optimum. 0.3.3 Robust control using trajectory verification in sam- pling Being able to directly search within continuous control policy space, one major advantage is that one can enforce stability condition such that the search is restricted to stable and robust policy space. In this subsection, we consider contraction theory to compute conditions that need to be satisfied by weight vectors to ensure stability and robustness under bounded disturbances. Definition 2. [15] Given the system equation for the closed-loop system ̇ x = f ( x, t ) , a region of the state space is called a contraction region if the Jacobian ∂f ∂x is uniformly negative definite in that region, that is, ∃ β > 0 , ∀ x, ∀ t > 0 , 1 2 ( ∂f ∂x M + ̇ M + M ∂f ∂x ⊺ )  − βM. where M ( t ) is a positive definite matrix for all t ≥ 0 . Theorem 1. [15] Given the system model ̇ x = f ( x, t ) , any trajectory, which starts in a ball of constant radius with respect to the matrix M , centered about a given trajectory and contained at all times in a contraction region with respect to the matrix M , remains in that ball and converges exponentially to this trajectory. Furthermore, global exponential convergence to the given trajectory is guaranteed if the whole state space is a contraction region. Theorem 1 provides a necessary and sufficient condition for exponential con- vergence of an autonomous system. Under bounded disturbances, the key idea is to incorporate a contraction analysis in the planning algorithm such that it searches for a weight vector w that is not only optimal in the nominal system but also ensures that the closed-loop actual system under the controller u = w ⊺ φ has contraction dynamics within a tube around the nominal trajectory. Using a similar proof in [16], we can show that for systems with contracting dynamics, the actual trajectory under bounded disturbances will be ultimately uniformly bounded along the nominal trajectory. Lemma 2. Consider a closed-loop system ̇ x = f ( x ) + ω ( t ) where ω ( t ) is a disturbance with max t ‖ ω ( t ) ‖ ≤ ρ max , let a state trajectory x ( t ) be in the con- traction region X ℓ at all time t ≥ t 0 , then for any time t ≥ t 0 , the deviation between x ( t ) and the nominal trajectory ̄ x ( t ) , whose dynamic model is given by ̇ ̄ x = f ( ̄ x ) , satisfies ‖ x ( t ) − ̄ x ( t ) ‖ 2 M ≤ 2 ℓρ max β (1 − e βt )) In other words, the error is uniformly ultimately bounded with the ultimate bound 2 ℓρ max β . Proof. Let’s pick the Lyapunov function V = ( x − ̄ x ) T M ( x − ̄ x ) , whose time derivative is ̇ V = ( x − ̄ x ) T M ( f ( x ) + ω − f ( ̄ x )) + ( f ( x ) + ω − f ( ̄ x )) T M ( x − ̄ x ) = ( x − ̄ x ) T M ( f ( x ) − f ( ̄ x )) + 2( x − ̄ x ) T M ω ( M is symmetric) = ( x − ̄ x ) T ( ∂f ∂x T | ̃ x M + M ∂f ∂x | ̃ x )( x − ̄ x ) + 2( x − ̄ x ) T M ω, where the following property is used: f ( x ) − f ( ̄ x ) = ∂f ∂x T | ̃ x ( x − ̄ x ) for some ̃ x ∈ [ ̄ x, x ] if ̄ x 4 x or ̃ x ∈ [ x, ̄ x ] otherwise. Since the trajectories stays within the contraction region, the following con- dition holds ∂f ∂x T | ̃ x M + M ∂f ∂x | ̃ x ≤ − 2 βM , and we have ̇ V ≤ − ( x − ̄ x ) T βM ( x − ̄ x ) + 2( x − ̄ x ) T M ω. Meanwhile, ‖ x ( t ) − ̄ x ( t ) ‖ M ≤ ℓ as the trajectory x ( t ) stays within the region of contraction, and also M ( x − ̄ x ) ≤ √ ( x − ̄ x ) T M M ( x − ̄ x ) = √ ‖ x − ̄ x ‖ M we conclude that as ω ≤ ρ max , ̇ V ≤ − ( x − ̄ x ) T ( βM )( x − ̄ x ) + 2 ω √ ‖ x − ̄ x ‖ M ≤ − ( x − ̄ x ) T ( βM )( x − ̄ x ) + 2 ρ max ℓ, Since ̇ V ≤ − βV + 2 ρ max ℓ and under the condition that x (0) = ̄ x (0), we obtain V ( t ) ≤ 2 ℓ β ρ max (1 − e − βt ), and therefore ‖ x − ̄ x ‖ 2 M = 2 ℓ β ρ max (1 − e − βt ) Thus, to search for the optimal and robust policies, we modify the algorithm by introducing the following step. Contraction verification step: Suppose the closed-loop system is subject to bounded disturbances, the objective is to ensure the trajectory is contracting within the time-varying tube { x | ‖ x − ̄ x ‖ ≤ ℓ } , for all t , where ̄ x is the nominal state trajectory. The following condition translates the contraction condition into verifiable condition for a closed-loop system: Choose positive constants β , a positive definite symmetric and constant matrix M = [ m ij ] i =1 ,...,n,j =1 ,...,n , and verify whether, at each time step along the nominal trajectory ̄ x ( t ) in the closed- loop system under control u ( t ) = w ⊺ φ ( x ( t )), the following condition holds. max x : ‖ x − ̄ x ‖ M ≤ ℓ g ij ( x ) ≤ − βm ij , ∀ i = 1 , . . . , n, ∀ j = 1 , . . . , n (5) where g ij is the ( i, j )th component in the matrix ∂f ∂x ⊺ M + M ∂f ∂x . We verify this condition numerically at discrete time steps instead of continous time. Further, if the function g ij ( x ) is semi-continuous, according to the Extreme Value The- orem, this condition can be verified by evaluating g ij ( x ) at all critical points where dg ij ( x ) dx = 0 and the boundary of the set { x | ‖ x − ̄ x ‖ M ≤ ℓ } . The modification to the planning algorithm is made in Step 3), if a controller u = 〈 w, φ 〉 of elite sample w does not meet the condition, then w is rejected from the set of elite samples. Alternatively, one can do so implicitly by associating w with a very large cost. However, since the condition is sufficient but not necessary as we have the matrix M , constant β and ℓ pre-fixed and M is chosen to be a constant matrix, the obtained robust controller may not necessary be optimal among all robust controllers in Π φ . A topic for future work is to extend joint planning and control policies with respect to adaptive bound β , ℓ , and a uniformly positive definite and time-varying matrix M ( x, t ). 0.4 Examples In this section, we use two examples to illustrate the correctness and efficiency of the proposed method. The simulation experiments are implemented in MAT- LAB on a desktop with Intel Xeon E5 CPU and 16 GB of RAM. 0.4.1 Feedback gain search for linear systems To illustrate the correctness and sampling efficiency in the planning algorithm, we consider an optimal control of linear time invariant (LTI) systems with non- quadratic cost. For this class of optimal control problems, since there is no admissible heuristic, one cannot use any planning algorithm facilitated by the usage of a heuristic function. Moreover, the optimal controller is nonlinear given the non-quadratic cost. Consider a LTI system ̇ x = Ax + Bu where A = [ − 1 1 0 0 ] and B = [ 0 1 ] with x ∈ X = R 2 and u ∈ R 1 . The initial state is x 0 = [5 , 5]. The cost functional is J ( x 0 , u ) = ∫ T 0 ( ‖ x ‖ 2 + ‖ u ‖ 2 + 0 . 5 ‖ x ‖ 4 + 0 . 8 ‖ x ‖ 6 ) dt + ‖ x ( T ) ‖ 2 . For a non-quadratic cost functional, the optimal controller is no longer linear and cannot be computed by LQR unless the running cost can be written in the sum-of-square form. Thus, we consider an approximate feedback controller with basis vector φ = [ x 1 , x 2 , x 2 1 , x 2 2 , x 3 1 , x 3 2 ] ⊺ . Suppose the magnitude of external disturbance is bounded by ρ max = 0 . 5. The following parameters are used in stability verification: β = 2, at any time t , for all x such that ‖ x ( t ) − ̄ x ( t ) ‖ ≤ ℓ , the controller ensures ‖ x ( t ) − ̄ x ( t ) ‖ ≤ 2 ℓρ max β (1 − e βt ) because 2 ℓρ max β = 0 . 5 ℓ ≤ ℓ . With this choice for stability analysis, the constraint ∂f ∂x + ∂f ∂x ⊺ = [ − 2 3 w (5) x 2 1 + 2 w (3) x 1 + w (1) + 1 Sym. 6 w (6) x 2 2 + 4 w (4) x 2 + 2 w (2) ] ≤ [ − 2 0 0 − 2 ] In this case, if we select w (6) , w (4) , w (5) , w (3) nonpositive, w (1) ≤ − 1 and w (2) ≤ − 1, then closed-loop system, which is a nonlinear polynomial system, will become globally contracting. 10 2 10 3 Number of samples 3800 4000 4200 4400 4600 4800 Cost function (a) 0 500 1000 1500 2000 2500 3000 3500 Number of samples -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 Mean of multivariant Gaussian μ 1 μ 2 μ 3 μ 4 μ 5 μ 6 (b) 0 1 2 3 4 5 x -2 -1 0 1 2 3 4 5 6 y nominal x1 nominal x2 actual x1 actual x2 (c) Figure 1: Convergence of the SAOP algorithm on the LTI system with a non- quadratic cost functional. (a) The mean of multivariate Gaussian as weight vector over iterations. (b) The state trajectory of the closed-loop system un- der bounded disturbance ρ max = 0 . 5 under feedback controller computed with SAOP. Figures 1a and 1b show the convergence result with SAOP in one simulation in terms of cost and the mean of the multivariant Gaussian over iterations. The following parameters are used: Initial sample size N 1 = 50, improvement parameter ǫ = 0 . 1, quantile percentage ρ = 0 . 1, smoothing parameter λ = 0 . 5, sample increment parameter α = 0 . 1. The algorithm converges after 38 iterations with 3301 samples to the mean ̄ w ∗ = [ − 1 . 0629 − 2 . 7517 0 − 1 . 7939 − 0 . 0987 − 2 . 1474] ⊺ and the covariance matrix with a norm 3 . 3401e − 4. Each iteration took less than 10 seconds. The approximate optimal cost under feedback controller u = 〈 ̄ w ∗ , φ 〉 is 3863 . 3. Fig- ure 1c shows the state trajectory for the closed-loop system with bounded distur- bances. With 25 independent runs of SAOP, the mean of J ( x 0 ; ̄ w ∗ i ) , i = 1 , . . . , 25 is 3903 . 3 and the standard deviation is 104 . 1683, 2 . 6% of the approximate op- timal cost. Note, if we only use linear feedback u = Kx , the optimal cost is 1 . 0943e4, which is about three times the optimal cost that can be achieved with a nonlinear controller. 0.4.2 Example: approximate optimal planning of a Dubins car Consider a Dubins car dynamics ̇ x = u cos θ, ̇ y = u sin θ ̇ θ = v where ~ x = ( x, y, θ ) ∈ R 2 × S 1 being the state (coordinates and turning angle with respect to x -axis) and u and v are control variables including linear and angular velocities. The system is kinematically constrained by its positive minimum turning radius r which implies the following bound | v | ≤ 1 r . Without loss of generality, we assume | v | ≤ 5 and | u | ≤ 10 are the input constraints. The control objective is to reach the goal x f = 20 , y f = 20 while avoiding static obstacles. The cost function J = ∫ T 0 ℓ ( x, u ) dt + g ( x, u ) where T = 100, the running cost is ℓ ( x, u ) = 0 . 1 × ( ‖ x ‖ + ‖ u ‖ ), and the terminal cost is g ( x ( T ) , u ( T )) = 1000 × x -5 0 5 10 15 20 25 y -5 0 5 10 15 20 25 (a) Number of samples 0 500 1000 1500 2000 2500 Cost × 10 4 0 0.5 1 1.5 2 2.5 3 (b) Figure 2: Convergence of the the planning algorithm algorithm on the Dubins car. (a) The planned trajectory under feedback policy 〈 μ, φ 〉 computed using the mean of multivariate Gaussian over iterations (from the lightest to the darkest). (b) The convergence of the covariance matrix. (c) The total cost evaluated at the mean of the multivariate Gaussian over iterations. ‖ ( x ( T ) , y ( T )) − ( x f , y f ) ‖ . The initial state is ~ x 0 = 0 . In simulation, we consider the robot reaches the target if ‖ ( x, y ) ′ − ( x f , y f ) ‖ ≤ ε for ε ∈ [0 , 1]. In simulation, ε = 0 . 5. We select RBF as basis functions and define φ rbf = [ φ 1 , . . . , φ N ] ⊺ for N center points. In the experiment, the center points are includes 1) uniform grids in x − y coordinates with step sizes δx = 5, δy = 5; and 2) vertices of the obstacle. We also include linear basis functions φ linear = [( x − x f ) , ( y − y f ) , θ ]. The basis vector is φ = [ φ ⊺ rbf , φ ⊺ linear ] ⊺ . We consider a bounded domain − 5 ≤ x ≤ 30 and − 5 ≤ y ≤ 30 and θ ∈ [0 , 2 π ] and thus the total number of basis functions is 80. The control input ~ u = [ u, v ] ⊺ where u = w ⊺ u φ and v = w ⊺ v φ . The total number of weight parameters is twice the number of bases and in this case 160. The following parameters are used: Initial sample size N 1 = 100, improve- ment parameter ǫ = 0 . 1, smoothing parameter λ = 0 . 5, sample increment per- centage α = 0 . 1, and ρ = 0 . 1. In Fig. 2a we show the trajectory computed using the estimated mean of multivariate Gaussian distribution over iterations, from the lightest (1-th iteration) to the darkest (the last iteration when stop- ping criterion is met). The optimal trajectory is the darkest line. In Fig. 2b we show the cost computed using the mean of multivariate Gaussian over iterations. SAOP converges after 22 iterations with 2200 samples and the optimal cost is 697 . 29. Each iteration took about 20 to 30 seconds. However, it generates a collision-free path only after 5 iterations. Due to input saturation, the algorithm is only ensured to converge to a local optimum. However, in 24 independent runs, all runs converges to a local optimum closer to the global one, as shown in the histogram in Fig. 3. Our current work is to implement trajectory-based contraction analysis using time-varying matrices M ( x, t ) and adaptive bound β , which are needed for nonlinear Dubins car dynamics. Figure 3: The frequency distribution of the optimal costs with 24 independent runs. 0.5 Conclusion In this paper, an importance sampling-based approximate optimal planning and control method is developed. In the control-theoretic formulation of optimal mo- tion planning, the planning algorithm performs direct policy computation using simulation-based adaptive search for an optimal weight vector corresponding to an approximate optimal feedback policy. Each iteration of the algorithm runs time linear in the number of samples and in the time horizon for simulated runs. However, it is hard to quantify the number of iterations required for MRAS to converge. One future work is to consider incorporate multiple-distribution im- portance sampling to achieve faster and better convergence results. Based on contraction analysis of the closed-loop system, we show that by modifying the sampling-based policy evaluation step in the algorithm, the proposed planning algorithm can be used for joint planning and robust control for a class of non- linear systems under bounded disturbances. In future extension of this work, we are interested in extending this algorithm for stochastic optimal control. Bibliography [1] S. M. LaValle, Planning algorithms . Cambridge university press, 2006. [2] L. E. Kavraki, P. Svestka, J.-C. Latombe, and M. H. Overmars, “Probabilis- tic roadmaps for path planning in high-dimensional configuration spaces,” IEEE transactions on Robotics and Automation , vol. 12, no. 4, pp. 566–580, 1996. [3] S. M. Lavalle, “Rapidly-exploring random trees: A new tool for path plan- ning,” Iowa State University, Tech. Rep., 1998. [4] S. Karaman and E. Frazzoli, “Sampling-based algorithms for optimal mo- tion planning,” The International Journal of Robotics Research , vol. 30, no. 7, pp. 846–894, 2011. [5] D. P. Bertsekas, “Approximate policy iteration: A survey and some new methods,” Journal of Control Theory and Applications , vol. 9, no. 3, pp. 310–335, 2011. [6] ——, “Dynamic programming and optimal control 3rd edition, volume ii,” Belmont, MA: Athena Scientific , 2011. [7] M. Abu-Khalaf and F. L. Lewis, “Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network hjb approach,” Automatica , vol. 41, no. 5, pp. 779–791, 2005. [8] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-dynamic programming . Athena Scientific, 1996. [9] J. Hu, M. C. Fu, and S. I. Marcus, “A model reference adaptive search method for global optimization,” Operations Research , vol. 55, no. 3, pp. 549–568, 2007. [10] T. Homem-de Mello, “A study on the cross-entropy method for rare-event probability estimation,” INFORMS Journal on Computing , vol. 19, no. 3, pp. 381–394, 2007. [11] M. Kobilarov, “Cross-entropy randomized motion planning,” Robotics: Sci- ence and Systems VII , p. 153, 2012. 15 [12] S. C. Livingston, E. M. Wolff, and R. M. Murray, “Cross-entropy temporal logic motion planning,” in Proceedings of the 18th International Conference on Hybrid Systems: Computation and Control . ACM, 2015, pp. 269–278. [13] D. P. Bertsekas, Dynamic programming and optimal control . Athena Sci- entific Belmont, MA, 1995, vol. 1, no. 2. [14] M. Hauschild and M. Pelikan, “An introduction and survey of estimation of distribution algorithms,” Swarm and Evolutionary Computation , vol. 1, no. 3, pp. 111–128, 2011. [15] W. Lohmiller and J.-J. E. Slotine, “On contraction analysis for non-linear systems,” Automatica , vol. 34, no. 6, pp. 683–696, 1998. [16] X. Liu, Y. Shi, and D. Constantinescu, “Robust constrained model pre- dictive control using contraction theory,” in IEEE Conference on Decision and Control , Dec 2014, pp. 3536–3541.