Feedback Motion Planning Under Non-Gaussian Uncertainty and Non-Convex State Constraints Mohammadhussein Rafieisakhaei, Amirhossein Tamjidi, Suman Chakravorty, P. R. Kumar Texas A&M University Abstract— Planning under process and measurement uncer- tainties is a challenging problem. In its most general form it can be modeled as a Partially Observed Markov Decision Process (POMDP) problem. However POMDPs are generally difficult to solve when the underlying spaces are continuous, particularly when beliefs are non-Gaussian, and the difficulty is further exacerbated when there are also non-convex constraints on states. Existing algorithms to address such challenging POMDPs are expensive in terms of computation and memory. In this paper, we provide a feedback policy in non-Gaussian be- lief space via solving a convex program for common non-linear observation models. The solution involves a Receding Horizon Control strategy using particle filters for the non-Gaussian belief representation. We develop a way of capturing non-convex constraints in the state space and adapt the optimization to incorporate such constraints, as well. A key advantage of this method is that it does not introduce additional variables in the optimization problem and is therefore more scalable than existing constrained problems in belief space. We demonstrate the performance of the method on different scenarios. I. INTRODUCTION Planning under uncertainty is a challenging problem for systems with partial observability. Many problems in robotics are in this category where there is an inherent uncertainty in the measurements obtained from sensors in addition to the uncertainty in the robot’s very motion. The main challenge is that controller’s knowledge about the true state of the system is limited to the conditional probability distribution of the state, called the belief. Belief is a sufficient statistic [1] that encodes all the information available to the controller including the history of actions and observations. The space of all possible beliefs, called the belief space, is very high dimensional in practical problems and the problem conse- quently suffers from the curse of dimensionality [2], [3]. This is one of the reasons that it is extremely difficult to solve Partially Observable Markov Decision Processes (POMDPs) [4], [5], [2] which provide the general framework for planing under uncertainty. They suffer from the curse of history [6] due to the exponential growth of decision choices because of dependency of future decisions on previous ones. Major point-based POMDP solvers such as PBVI, HSVI, Perseus, and SARSOP (c.f. the survey in [7]) consider finite state, observation and action spaces and develop a decision tree that can exactly solve the POMDP problem for the initial belief state. Recent point-based solvers such as MCVI [8], [9] can allow continuous state space. However, generally in point-based solvers, the time complexity of the algorithms grows exponentially with the number of (sampled) states and time horizon [10], [11]. Further, they guarantee optimality of their solution only for the particular initial belief state. This means that if there is a deviation from the planned trajectory during the execution, it becomes impractical to re- plan and compensate for the accumulated errors due to the stochastic nature of the system. Therefore, these methods are not suitable for use in the control strategies such as Receding Horizon Control (RHC) [12]–[14] that require a fast re-planning algorithm. Feedback-based Information RoadMap (FIRM) [15], [16] is a general framework to overcome the curse of history that attempts to solve an MDP in the sampled belief space. The graph-based solution of FIRM introduced an elegant method for solving POMDPS with continuous underlying spaces. However, attention is restricted to Gaussian [17]–[20] belief spaces which can be insufficient in some problems. Another issue is that in the planning stage, observations are random variables, and so therefore are future beliefs, which is a major challenge for computation of the solu- tion. One approach samples the observations, generates the corresponding beliefs, and performs a Monte Carlo analysis [21]. Unfortunately, this has a heavy computational burden, which limits its usage. Another approach restricts attention to the maximum likelihood observation [13], and propagates the corresponding belief using the filtering equations (either linear Gaussian or non-Gaussian). This method is compu- tationally better than the former approach, however, it still requires belief updates using the Bayesian filtering equations. In all these methods, based on a predicted observation, the belief state is propagated, and a new observation is predicted, and the control policy is thereby determined. In this paper, we propose a different approach to this problem. Essentially, we propose taking samples from the initial belief and propagating them via a noiseless system model. Then, we use the observation model’s properties to guide the samples towards the regions where the prediction of the observation error is reduced. What this means is that, essentially, we attempt to control the system towards a subset of the state space where we predict that the dispersion of the ensemble of the observation particles is reduced and therefore the beliefs are more informative. Therefore, mathematically, we provide a general frame- work to control a system under motion and sensing uncer- tainties using an RHC strategy. We define a basic convex planning problem that can be used to obtain an optimized trajectory for a robot. Then, we utilize the knowledge of filtering to define a general cost function that incorporates the state-dependent linearized observation model to follow arXiv:1511.05186v2 [cs.RO] 12 Jan 2016 a trajectory along which the system gains more information about its state from the environmental features in a trade- off with the control effort along the path. We use particle representation of belief and our belief space can be non- Gaussian. The convex formulation of the problem enables us to avoid the Dynamic Programming [22] over the belief space which is usually highly costly (in fact, intractable in exact sense). In addition, in case that the agent deviates from its planned path, due to the convexity of planning problem, we can stop execution and re-plan to obtain a better path towards the goal region. The execution of the actions stop whenever the probability of reaching to a goal region is higher than some predefined threshold. Moreover, we develop a method to adapt the optimization problem such that the non-convex constraints on the feasible state space are respected softly. Particularly, we propose a special form of penalty functions to incorporate the obstacles into the optimization cost, which enables us to respect the constraints without addition of any new variable to the optimization problem. Therefore, the optimization remains scalable to higher number of samples and longer time horizons as opposed to the sampling based chance-constraint methods such as [23], [24]. Finally, we run the algorithm and show the simulation results to support our work. II. SYSTEM REPRESENTATION In this subsection, we specify the required ingredients that we use for the problem. State space definition: For simplicity we denote the ran- dom variables with upper case and the values with lower case letters. Throughout the paper, vectors x ∈X ⊂Rnx, u ∈U ⊂Rnu, and z ∈Z ⊂Rnz denote the state of the system, the action, and observation vectors, respectively. System equations: The dynamics of the system and obser- vations are as follows: xt+1= f(xt, ut, ωt) (1a) zt= h(xt) + νt. (1b) where {ωt} and {νt} are two zero mean independent, iden- tically distributed (iid) random sequences, and are mutually independent. In addition, f : X × U × Rnx →X shows the process dynamics (motion model), and h : X →Z denotes the observation model. Belief: Belief or information state at time step t is a function defined as bt : X × Zt × Ut → R, where bt(x, z0:t, u0:t−1, b0) := pXt|Z0:t;U0:t−1(x|z0:t; u0:t−1; b0), which is the posterior distribution of xt given the data history up to time t and the initial belief, b0. In addition, the space of all possible belief states or the belief space is denoted by B. For simplicity of notation, we will denote the belief state with bt(x) or bt throughout this paper [1], [25]–[27]. Particle representation of belief: We consider a non- Gaussian model of the belief space and approximate the belief state bt at time step t by N number of particles in the state space {xi t}N i=1 with importance weights {wi t}N i=1 [27]–[29] as bt(x) ≈PN i=1 wi tδ(x−xi t) where δ(x) denotes the Dirac delta mass located at x. III. FEEDBACK POLICY CALCULATION In this section, we provide the method that we use for calculating the feedback policy. A. The General Problem Optimization problem: Given an initial belief state bt′ at time t′ and a goal state xg, solve the following optimization problem: min ut′:t′+K−1 = t′+K−1 X t=t′ E[c(bt, ut)] s.t. bt+1 = τ(bt, ut, zt+1) (2a) xt+1 = f(xt, ut, ωt) (2b) zt = h(xt) + νt (2c) fj(x) > 0, j ∈[1, nc] (2d) g(bt′+K, xg) = 0 (2e) where c(·, ·) : B × U →R is the one-step cost function, τ : B × U × Z →B denotes the belief dynamics, fj(x) > 0 are nc number of inequality constraints, and g : B × X →R denotes the terminal constraint. RHC Strategy: The control loop is shown in Fig. 1. The RHC policy function ˜π : B →U generates an action ut = ˜π(bt) which is the first element of the sequence of actions generated in problem (2). Once ut is executed, the state of the system transitions from xt into a new state xt+1 and the sensors perceive a new measurement zt+1. Given the updated data history, the estimator updates the belief as bt+1 = τ(bt, ut, zt+1). The new belief state is fed into the controller and the loop closes. Stopping criteria: The execution of the actions stop when we reach a belief state b satisfying: P(b, r, xg) := Z x∈Br(xg) b(x)dx ≥˘wth (3) where, Br(xg) is defined as Br(xg) := {x ∈X s.t. ||x − xg||2 < r, r > 0} which is an r-ball (in L2 norm) centered at xg, and 0 < ˘wth < 1 is an arbitrary threshold [14]. Now that we have defined our problem, we outline our proposed approach to solve it. B. Defining the cost function: In this subsection, we specifically define the cost function that we use in our optimization problem. Maximum-A-Posteriori estimate: In our optimization prob- lem, we plan for the Maximum-A-Posteriori (MAP) state estimate xMAP t = arg maxx∈X bt(x). Quadratic cost: Let us define the one-step cost function c(·, ·) in the optimization problem as follows: c(bt, ut)=Ebt[(xt −x MAP t )T Wx t (xt −x MAP t )]+uT t Vu t ut, (4) where Wx t ≻0, and Vu t ≻0 are weight matrices. Their relative magnitudes incorporate the trade-off between the control effort and the cost of uncertainty in the system. However, unlike the usual Linear Quadratic Regulator (LQR) 2 Fig. 1. The RHC control loop. controller, the weight matrices can be state and time depen- dent. In this paper, we assume Vu t to be constant and design Wx t . C. Linearizing System Equations Linearized process model: Given a nominal trajectory of states and controls {xp t }t′+K t=t′ , and {up t }t′+K−1 t=t′ , we linearize the f(·, ·, ·) about the nominal trajectory as follows: xt+1 = f(xp t , up t , 0) + At(xt −xp t ) + Bt(ut −up t ) + Gtωt = f p t + Atxt + Btut + Gtωt (5) where f p t := f(xp t , up t , 0) −Atxp t −Btup t , and At = ∂f(x, u, 0)/∂x|xp t ,up t , Bt = ∂f(x, u, 0)/∂u|xp t ,up t , and Gt = ∂f(x, ut, 0)/∂ω|xp t ,up t are the corresponding matrices. Linearized observation model: We are linearizing the h(·) about the MAP state xMAP t as follows: zt = h(x MAP t ) + H(x MAP t )(xt −x MAP t ) + νt, (6) where, H(xMAP t ) = ∂h(x, 0)/∂x|xMAP t . Thus, H(xMAP t ) is not a constant matrix, rather it is a function of xMAP t which is a function of control variables in the optimization problem. D. Incorporating Filtering in the Cost Following the Dual Control concept in [1], we use the time-and-trajectory-dependent matrix Wx t := W(xMAP t ) = H(xMAP t )T R(xMAP t )H(xMAP t ) as our weight matrix, where R : X →Rnz×nz is a proper weighting matrix, later to be de- fined. By assigning this weight matrix, we can show that the first term in (4) is equivalent to E[(zt−zxMAP t )T R(xMAP t )(zt− zxMAP t )], which is the expected weighted innovation. Choosing this cost, we drop the filtering equation in (2), which enables us to achieve the reduced complexity in our solution. Partic- ularly, if R = Inz where Inz is the nz-dimensional identity matrix, the first term in (4) is trace(Cov[(zt−zxMAP t )]) where (zt −zxMAP t ) is the predicted error of the observation at time step t from its nominal observation. Therefore, conceptually, the minimum of this cost occurs over the state trajectories along which the dispersion in the ensemble of the observation trajectories is reduced in the sense of covariance. This means that the minimization seeks for reducing the uncertainty in the predicted observation or equivalently in estimation. Later, we explain that R can be designed suitably to define a more tractable optimization problem. E. Terminal Constraint To reach a specific goal state while minimizing the uncer- tainty over the trajectory (localization) we set the terminal constraint in equation (2) as xMAP t′+K = xg. IV. PARTICLE REPRESENTATION AND THE COST Calculation of the expectation in (4) is intractable in general. That is why, we use particle representation to approximate this expectation, and overcome the intractability in solving the problem. Therefore, we can write the overall cost in (4) as: t′+K X t=t′+1 [ 1 N N X i=1 [(xi t −x MAP t )T W(x MAP t )(xi t −x MAP t )]+uT t−1Vu t ut−1], (7) where {xi t}k i=1 are the set of particles obtained through the particle filtering after taking action ut−1 followed by perceiving zt. A. Predicting the Evolution of the Particles We use the linear model of (5) to predict the evolution of the particles in (7). Using the noiseless equations and given initial set of particles {xi t′}k i=1 at time step t′, we can iteratively write: xi t′+t+1−x MAP t′+t+1=˜At′:t′+t(xi t′ −x MAP t′ ), where ˜At1:t2 := Qt2 τ=t1 Aτ = At2At2−1 · · · At1, for t1 ≤ t2, otherwise (i.e., for t1 > t2), ˜At1:t2 := Inx. Simplified cost function: Let us define a vector ct := (c1T t , c2T t , · · · , cN T t )T ∈ RNnx, where ci t := 1 N ˜At′:t−1(xi t′ −xMAP t′ ) ∈Rnx for 1 ≤i ≤N. Moreover, define a matrix ¯ W(xMAP t ) := BlockDiag(W(xMAP t )) a block- diagonal matrix with N equal diagonal blocks of W(xMAP t ). Thus, the simplified cost is: t′+K X t=t′+1 [cT t ¯ W(x MAP t )ct + uT t−1Vu t ut−1], (8) where ct is a constant vector at time step t and is defined as before. Moreover xMAP t = ˜At′:t−1xMAP t′ + Pt−1 s=t′ ˜As+1:t−1(Bsus + f p s ). B. Convexity of the Cost In this subsection we go through the necessary steps to design the convex problem whose solution is the optimized trajectory over the convex feasible space. Lemma 1: Given the smooth differentiable function h(x) : X →R, define l : X →R, as l(x) := p R(x) nx X i=1 di ∂h(x) ∂xi , where x = (x1, · · · , xnx)T ∈X, and d = (d1, · · · , dnx)T ∈ R is an arbitrary vector. If l is convex or concave in x, then 3 g : X →R≥0, where g(x) := dT H(x)T R(x)H(x)d is a convex function of x, where H(x) is the Jacobian of h. The hint to prove this lemma is to show that g(x) = (l(x))2. Let us provide two most common observation models in the literature and design R for them. Example 1: Range-based measurements: Let h(x) = ||x− L||2, where || · ||2 denotes the Euclidean norm, and L ∈R represents a landmark. Using R(x) = ||x −L||2 2, g(x) of Lemma 1 becomes convex in x. Example 2: Bearing-based measurements: Given state vector x = [x, y, θ]T , and L = [Lx, Ly]T , using R(x) = (x −Lx)2 + (y −Ly)2 , g(x) of Lemma 1 becomes convex in x. The results of Lemma 1 can be easily extended to the cases where there are multiple observations. For our purposes, we design R to be a positive and increasing function of the dis- tance from the landmarks, such that it helps to maintain the convexity of the problem as desired in Lemma 1. Therefore, as the state gets distant from the landmarks, the distance gets more penalized, and the corresponding observation bundle is more penalized, as well. V. NON-CONVEX CONSTRAINTS ON THE STATE In this section, we provide our solution for handling non- convex constraints on the state. Particularly, we consider the presence of non-convex obstacles which make the feasible state space non-convex. We define a special type of penalty functions to softly incorporate the non-convex constraints in the optimization cost function. Constraints approximated by ellipsoids: Let us consider nc number of non-convex constraint fi(x) > 0, for i = 1, · · · , nc. Given x = (x1, · · · , xnx)T ∈X, the compact volume constrained by fi(x) ≤0 can be covered by a finite number, nb, of ellipsoids with different areas: f ′ i(x) := nx X j=1 −αij(xj −c′ ij)2, i = 1, · · · , nb where c′ i := (c′ i1, c′ i2, · · · , c′ inx)T ∈ Rnx, αi := (αi1, αi2, · · · , αinx)T ∈Rnx are parameters of these ellip- soids. New set of constraints {f ′ i(x) > 0, i = 1, · · · , nb} is at least as conservative as {fi(x) > 0, i = 1, · · · , nc}. Moreover, matrices C := (c′ 1, c′ 2, · · · , c′ nb) ∈Rnx×nb and A := (α1, α2, · · · , αnb) ∈Rnx×nb contain all the parame- ters needed to capture the environment’s obstacles. Obstacle Penalty Function: We define an Obstacle Penalty Function (OPF) as follows: f b(x) := max 1≤i≤nb{Me Pnx j=1 −αij(xj−c′ ij)2} (9) where M > 0 is a big penalizing number. By adding this OPF to the optimization cost, the optimization will seek to minimize this cost as well. The OPF can be designed to be nearly zero (of the order of 10−40 or so), except in the vicinity of the area enclosed by the ellipsoid where it sharply gains values. Particularly, αi, and M can be chosen appropriately such that the function has well small values in a desired margin of safety outside the banned areas. The nice property of this penalty function is that it is continuous and differentiable infinitely many times, everywhere, which is useful in gradient decent methods. Therefore, if we purely use the gradient decent methods, initialized by trajectory that is not in the local minimal of the OPF, this function will act as a barrier and prevent the trajectory from getting inside the infeasible states. RHC inner loop optimization problem: Given the initial belief state bt′(x) = (1/N) PN i=1 δ(x−xi t) at time t′, a goal state xg and obstacle parameters (A, C), solve the following optimization problem: min ut′:t′+K−1 t′+K X t=t′+1 [cT t ¯ W(x MAP t )ct + uT t−1Vu t ut−1 + βf b(x MAP t )] s.t. x MAP t′+K = xg (10) where β is set to zero for solving the convex problem and set to one for solving the problem with obstacles. Moreover, all the parameters and functions are as defined before. The overall algorithm for the planning problem is ex- pressed in 1. VI. SIMULATIONS AND EXAMPLES In this section, we show some applications for our method. We perform all our simulations in MATLAB 2015b in a 2.90 GHz CORE i7 machine with dual core technology and 8 GB of RAM. First, we do a comparison test with an example in the literature and analyze the solutions of two algorithms with various parameters. Then, we introduce a scenario that consists of guiding a robot between two walls. Our last experiment is a simulation where a robot is in a complex scenario in a house with several features to localize with respect to and reach a goal. A. Comparison Test in a Convex Scenario In this experiment, we consider the light-dark example introduced in [14]. We compare our results with the algo- rithm presented in [14]. Since we did not have access to the author’s code, we implemented the method of [14] in MATLAB to the best of our ability. Note that in this scenario, we assume that there is no obstacle in the environment. It is important to note that essentially, the two methods are different from each other, however, we solve the same prob- lem for the same systems and same initial and final states. However, the reader cannot find a mapping between the methods. Therefore, the optimization tuning parameters are different and have different meanings. The state, observation and action spaces are 2-dimensional continuous spaces. The process model is linear with A = B = I2, and the observa- tion model is linear with non-linear observation covariance, modeled as R(x) = diag(1/(2x1 + 1), 1/(2x1 + 1)), where x1 > 0 is the first element of state. Therefore, as the robot gets further to bigger values of x1 it can localize better with less noisy observations. This is shown in Fig. 2 with lighter background on th right side. One can verify that the problem is convex in both methods (with different shapes of cost functions). Figure 2 shows the results of the optimized 4 Algorithm 1: Planning Algorithm Input: Initial belief state bt′, Goal state xg, Planning horizon K, Belief dynamics τ, Obstacle parameters (A, C) 1 while P(bt, r, xg) ≤˘wth do 2 Solve problem (10); 3 ut ←˜π(bt); 4 execute ut, perceive zt; 5 bt+1(x) ←τ(bt(x), ut, zt); 6 end trajectory for time 0 with 1000 particles and a time horizon of 20. Moreover, to avoid the control saturation, we add a constraint to bound the control input’s magnitude at each step to 3.16. The initial distribution is a mixture of two Gaussians with equal variances of 0.0625 and means at (1.75, 0) and (2, 0.5). The solid line shows the results for our problem with Vu t = 0.065. It should be noted that, in our simulation, changes Vu t does not impose unexpected behavior in the trajectory. Rather, by increasing the values of Vu t , the agent acts more conservatively in terms of the consumed energy effort. Sensitivity of solution to number of particles: We increase the number of particles from 50 to 1000, 10000, and 100000 particles and analyze the optimization size and required time for the optimization. In our method, by increasing the number of particles, the optimization vector size does not increase. Neither are additional constraints introduced by increasing the number of particles. Therefore, as shown in table I the required time for optimization does not increase significantly. However, in [14], the optimization vector size is dependent on the number of particles, particularly, it is equal to (Knu + N), while in our method, it is only Knu. Moreover, in their method, by addition of one particle, K new inequality constraints are added to the optimization problem, whereas in our method, there is no such constraint and the number of optimization constraints is independent from the number of samples. The results of table I show that our method is scalable with the number of particles. As stated in the table, for N = 10000 and N = 100000, we could not perform the optimization for the method in [14] due to high required memory allocation. Sensitivity of solution to time horizon: Lastly, we perform the optimization for lookahead time horizon K = 10, 20, 50 and 100 and report the required time in table. Once again, since the number of optimization variables is Knu which is 2K, and there is no added constraint for addition of time horizon, the optimization time does not explode in our method. Whereas, in [14], increasing the time horizon, increases the solution time significantly. The results reflected in table I show that our method is scalable with long time horizon as well. However, for K = 50 and K = 100, we could not perform the optimization for method of [14] due to high required memory allocation. Fig. 2. Light dark example. Lighter states on the right mean presence of less observation noise. The solid blue and red dotted lines show the results of our method and the implementation of [14], respectively. Fig. 3. Robot within two walls. The OPF is visualized within the walls. The green and red lines show the results for optimization with and without considering the walls. B. Robot within Two Walls; Visualization of OPF In this section, we simulate a case where there is non- convex constraints in the state space. Figure 3 depicts the results in a case where the system starts with a distribution about its initial state and wants to reach the goal state while minimizing the localization error and spending low energy. The green and red lines show the solution of the convex problem where there is no walls, and the problem with added walls, respectively. As it is seen, there are three information sources in that are shown with lighter spots in Fig. 3. The observation model is range based as described in example 1. To obtain the green trajectory, the convex optimization problem (which is initialized with an arbitrary solution) is solved. Then, the green trajectory (which is not feasible for the case with walls) is used as the initial trajectory for the optimization with OPF to obtain the red trajectory which avoids the walls, as well. C. Complex scenario Robot in a house: Figure 4 depicts the results in two cases where the objective is similar to the previous example. 5 TABLE I THE RESULTS OF OUR COMPARATIVE SIMULATIONS FOR SEVERAL TIME HORIZONS AND PARTICLE NUMBERS IN A CONVEX LIGHT-DARK SCENARIO. Time horizon (K) 20 10 20 50 100 Number of Particles (N) 100 1000 10000 100000 1000 Time (s) 0.24 0.33 1.11 10.37 0.16 0.33 2.30 9.22 # of Iterations 288 288 288 288 170 288 1013 2215 Our Function Tolerance 2e-03 2e-03 2e-03 2e-03 2e-03 2e-03 2e-03 2e-03 Method Constraint Tolerance 5.551e-16 8.882e-16 5.551e-16 2.331e-15 1.110e-15 8.882e-16 3.839e-11 1.883e-11 # of Opt. Vars.† (Knu) 40 40 40 40 20 40 100 200 # of Opt. Constrs.† (1) 1 1 1 1 1 1 1 1 Time (s) 49.0 311.32 * * 80.22 311.32 * * # of Iterations 40000 40000 40000 40000 Method Function Tolerance 2e-02 2e-02 2e-02 2e-02 of Constraint Tolerance 3.509e-04 5.853e-04 5.671e-04 5.853e-04 [14] Required Memory (GB) 15.0 1490.7 37.6 76.0 # of Opt. Vars.† (Knu+N) 140 1040 10,040 100,040 1020 1040 1100 1200 # of Opt. Constrs.† (N(K−1)+1) 1901 19,001 190,001 1,900,001 9001 19,001 49,001 99,001 *: Unable to allocate enough memory to solve the problem. †: ‘# of Opt. Vars.’ specifies the number of optimization variables, and ‘# of Opt. Constrs.’ specifies the number of optimization problem’s constraints. In the first case, the robot is put in a room and wants to reach a room in the other side of the house. Given an initial trajectory, shown by red dots, the optimization provides the optimized trajectory that seeks for the information sources in every house, and the penalty functions perform the task of keeping the robot away from the obstacles. In this case, the lookahead time horizon is set to K = 100. In the second case, the start and final goal of the robot is in one room, and therefore, the optimization can solve the problem with any arbitrary trajectory in that room like the straight line. Fig. 4. A holonomic system in a complex scenario. Solid lines show the optimal trajectories, dotted show the initial, for two different scenarios. The longer trajectory includes obstacles, and the other, no obstacles. The dots around the start points show the initial particles. Landmarks are marked as stars and information is coded with color (lighter means more information). Lookahead time horizon for the longer trajectory is 100 and 30 for the other. VII. CONCLUSION In this paper, we proposed a method for controlling a stochastic system starting from a configuration in the state space to reach a goal region. Our method consists of a receding horizon control strategy, where planning occurs in the belief space rather than on the underlying state space. Our solution consists of solving a convex program for the unconstrained state space. Hence it can be solved quickly and in an on-line manner where the robot might need to re- plan in case of deviations from the planned trajectory due to the stochastic and noisy nature of the system. Moreover, we proposed a method of incorporating the non-convex constraints in the optimization problem without adding new variables, which enables us to maintain the scalability of our solution for high number of particles and longer lookahead horizons. In our future work, we will extend this method to better process models, and for the cases in which the assumptions can be reliable for a long distance. ACKNOWLEDGMENT This material is based upon work partially supported by AFOSR Contract No. FA9550-13-1-0008, the U.S. Army Research Office under Contract No. W911NF-15-1-0279, and the NSF under Contract Nos. CNS-1302182 and Science & Technology Center Grant CCF-0939370. REFERENCES [1] P. R. Kumar and P. P. Varaiya, Stochastic Systems: Estimation, Identification, and Adaptive Control. Englewood Cliffs, NJ: Prentice- Hall, 1986. [2] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra, “Planning and acting in partially observable stochastic domains,” Artificial Intelli- gence, vol. 101, pp. 99–134, 1998. [3] R. Zhou and E. A. Hansent, “An improved grid-based approximation algorithm for pomdps,” IJCAI, 2001. [4] K. Astrom, “Optimal control of markov decision processes with incomplete state estimation,” Journal of Mathematical Analysis and Applications, vol. 10, pp. 174–205, 1965. 6 [5] R. D. Smallwood and E. J. Sondik, “The optimal control of partially observable markov processes over a finite horizon,” Operations Re- search, vol. 21, no. 5, pp. 1071–1088, 1973. [6] J. Pineau, G. Gordon, and S. Thrun, “Point-based value iteration: An anytime algorithm for POMDPs,” in International Joint Conference on Artificial Intelligence, 2003, pp. 1025–1032. [7] G. Shani, J. Pineau, and R. Kaplow, “A survey of point-based pomdp solvers,” Autonomous Agents and Multi-Agent Systems, vol. 27, pp. 1–51, 2013. [8] H. Bai, D. Hsu, W. Lee, and V. Ngo, “Monte carlo value iteration for continuous-state pomdps,” Algorithmic foundations of robotics IX, pp. 175–191. [9] Z. Lim, W. Lee, and D. Hsu, “Monte carlo value iteration with macro- actions,” Advances in Neural Information Processing Systems, pp. 1287–1295. [10] C. Papadimitriou and J. N. Tsitsiklis, “The complexity of markov decision processes,” Mathematics of Operations Research, vol. 12, no. 3, pp. 441–450, 1987. [11] O. Madani, S. Hanks, and A. Condon, “On the undecidability of probabilistic planning and infinite-horizon partially observable markov decision problems,” in Proceedings of the Sixteen Conference on Artificial Intelligence (AAAI), 1999, pp. 541–548. [12] R. Platt, L. Kaelbling, T. Lozano-Perez, and R. Tedrake, “Efficient planning in non-gaussian belief spaces and its application to robot grasping,” Intl Symposium on Robotics Research, 2011. [13] R. Platt, R. Tedrake, L. Kaelbling, and T. Lozano-Perez, “Belief space planning assuming maximum likelihood observations,” in Robotics: Science and Systems (RSS), 2010. [14] R. Platt, “Convex receding horizon control in non-gaussian belief space,” in International Workshop on Algorithmic Foundations of Robotics, 2012. [15] A. Agha-mohammadi, S. Chakravorty, and N. Amato, “FIRM: Feed- back controller-based Information-state RoadMap -a framework for motion planning under uncertainty-,” in International Conference on Intelligent Robots and Systems (IROS), 2011. [16] A. Agha-mohammadi, S. Chakravorty, and N. Amato, “Firm: Sampling-based feedback motion planning under motion uncertainty and imperfect measurements,” International Journal of Robotics Re- search, no. 2, 2014. [17] S. Prentice and N. Roy., “The belief roadmap: Efficient planning in linear pomdps by factoring the covariance,” 2008. [18] A. Bry and N. Roy, “Rapidly-exploring random belief trees for motion planning under uncertainty,” in IEEE Intl Conf. on Robotics and Automation, 2011. [19] M. P. Vitus and C. J. Tomlin, “Closed-loop belief space planning for linear, Gaussian systems.” in ICRA, 2011, pp. 2152–2159. [20] J. Van Den Berg, S. Patil, and R. Alterovitz, “Motion planning under uncertainty using iterative local optimization in belief space,” The International Journal of Robotics Research, vol. 31, no. 11, pp. 1263– 1278, 2012. [21] A. Beskos, O. Papaspiliopoulos, and G. Roberts, “Monte carlo max- imum likelihood estimation for discretely observed diffusion pro- cesses,” The Annals of Statistics, pp. 223–245, 2009. [22] M. Sniedovich, “Dijkstra’s algorithm revisited: the dynamic program- ming connexion,” Control and Cybernetics, vol. 35, no. 3, pp. 599– 620, 2006. [23] R. M. Van Slyke and R. Wets, “L-shaped linear programs with applications to optimal control and stochastic programming,” SIAM Journal on Applied Mathematics, vol. 17, no. 4, pp. 638–663, 1969. [24] L. Blackmore, M. Ono, A. Bektassov, and B. C. Williams, “A proba- bilistic particle-control approximation of chance-constrained stochastic predictive control,” Robotics, IEEE Transactions on, vol. 26, no. 3, pp. 502–517, 2010. [25] E. J. Sondik, “The optimal control of partially observable markov processes,” PhD thesis, Stanford University, 1971. [26] D. Bertsekas, Dynamic Programming and Optimal Control: 3rd Ed. Athena Scientific, 2007. [27] S. Thrun, W. Burgard, and D. Fox, Probabilistic Robotics. MIT Press, 2005. [28] D. Crisan and A. Doucet, “A survey of convergence results on particle filtering methods for practitioners,” IEEE TRANSACTIONS ON SIGNAL PROCESSING, vol. 50, no. 3, 2002. [29] A. Doucet, J. de Freitas, and N. Gordon, Sequential Monte Carlo methods in practice. New York: Springer, 2001. 7