arXiv:1709.00546v1 [cs.RO] 2 Sep 2017 Autonomous Waypoint Generation with Safety Guarantees: On-Line Motion Planning in Unknown Environments Sanjeev Sharma Graduate Student of Computing Science University of Alberta, Canada, Email: sanjeev1@ualberta.ca Abstract —On-line motion planning in unknown environments is a challenging problem as it requires (i) ensuring collision avoidance and (ii) minimizing the motion time, while continuously predicting where to go next. Previous approaches to on-line motion planning assume that a rough map of the environment is available, thereby simplifying the problem. This paper presents a reactive on-line motion planner, Robust Autonomous Waypoint generation (RAW), for mobile robots navigating in unknown and unstructured environments. RAW generates a locally maximal ellipsoid around the robot, using semi-definite programming, such that the surrounding obstacles lie outside the ellipsoid. A re- inforcement learning agent then generates a local waypoint in the robot’s field of view, inside the ellipsoid. The robot navigates to the waypoint and the process iterates until it reaches the goal. By following the waypoints the robot navigates through a sequence of overlapping ellipsoids, and avoids collision. Robot’s safety is guaranteed theoretically and the claims are validated through rigorous numerical experiments in four different experimental setups. Near-optimality is shown empirically by comparing RAW trajectories with the global optimal trajectories. I. I NTRODUCTION Autonomous vehicles and robots will soon help, or perhaps replace, humans in the tasks that are time consuming, hard or unsafe. Such tasks include autonomous driving, search and rescue, patrolling and engaging with enemy forces — often confronting unseen, hostile and unstructured environments. A common challenge underlying these tasks is efficient on- line motion planning: deciding where to go next, ensuring collision avoidance and minimizing the motion time. Thus, the trajectory generator for such vehicles and robots should be able to (i) minimize the motion time, (ii) decide where to go, and (iii) ensure collision avoidance. This paper presents an on-line reactive motion planner, Robust Autonomous Waypoint generation (RAW), for mobile robots navigating in unknown and unstructured environments. RAW generates a locally maximal ellipsoid, using semi- definite programming, around the robot, separating it from the surrounding obstacles. A reinforcement learning agent then makes a decision about where to go and generates a waypoint inside the ellipsoid. The robot then navigates, remaining inside the ellipsoid, towards the waypoint. By following the waypoints to the goal, the robot navigates through a sequence of (overlapping) ellipsoids — a virtual, obstacle free, tunnel. RAW assumes that the robot has limited field of view and has access to its coordinates or relative position to the goal. On-line motion planning for mobile robots has been a focus of the robotics community for the past three decades. The previous approaches can be broadly categorized as: (i) assuming that a complete map of the environment is available; (ii) using some known structure in the environment; and (iii) assuming that a global path or a global guidance function is available (as may be derived from a potential function or through a graph search). Fox et al. [4] introduced dynamic window to search for admissible velocities in a given time frame, to avoid collisions in known environments. Fiorini and Shiller [3] proposed velocity obstacles for motion planning in known environments. Velocity obstacle based approaches require carefully tuning the time-step parameter to avoid poor performance. Shiller et al. [15, 16] introduced an on- line planner that generates a sequence of intermediate goals in the environment, and produces a trajectory to the goal, passing through the intermediate goals. However, the approach assumes that a full map of the environment is available. Sam- pling based approaches, like RRT [8], have been successfully demonstrated in many robotics applications. However, RRT based methods are in general limited to known environments. Quinlan and Khatib [11] proposed elastic band for opti- mizing an initial collision free configuration-space path to the goal. The path is adjusted (and smoothed) on-line for sensor based obstacle avoidance. However, it assumes that the deviations from the global path are small. If the deviations are large, the approach may fail, requiring the global path to be recomputed using the updated information (also see [9] and [1] for similar works). The assumption of availability of such a global path to the goal is invalid in this paper. Local path-set approaches were used by many finalists of the DARPA Urban Challenge, successfully demonstrating its application to motion planning in unknown environments. Kuwata et al. [7] presented non-fixed random path-sets using RRT. A similar work is fixed path-set, presented by Knep- per et al. [5] (and references therein). Path-set methods rely on a global guidance function to guide the search towards the goal, and often constructing such a function requires map of the environment. In the DARPA Urban Challenge, Author did this work in 2013, when he was a graduate student in the Computing Science department at the University of Alberta the structure of the problem (following a road) was used as a heuristic to constrain the paths to end parallel to the road. Such a heuristic, or a navigation function, may not be available in unstructured and unknown environments. Minguez and Montano [10] introduced the ego-kinodynamic space (E- KS), building a local space for generating trajectories and reactively avoiding collisions. E-KS takes the robot’s stopping distance into account and guarantees safety within a certain time interval. However, (i) it uses potential fields to guide the robot to the goal, and potential fields assume that a map of the environment is available; and (ii) with limited field of view, the safety may not be guaranteed at future time- steps — a trajectory along which the braking constraint is not yet active may lead to collision with an obstacle not observed in the current view. A similar approach is anytime re-planning using a graph search in the local high-dimensional lattice-space [6, 21]. The robot executes a global path obtained using the graph-search and re-planning is done upon receiving the updated information. The changes in the environment are assumed to be small, or the global plan may fail, thereby requiring the global-search to be re-performed. II. T HE P ROPOSED A LGORITHM : RAW Having discussed the underlying simplifications of the pre- vious approaches to on-line motion planning, this section now presents the proposed approach for on-line motion planning in unknown and unstructured environments. The previous work [14] presented an autonomous waypoint generation strategy (AWGS) for on-line path planning in unknown and unstructured environments. In AWGS, a reinforcement learn- ing [17] (RL) agent analyzes the local surroundings of the robot and then generates a waypoint in its field of view (FOV). The robot then navigates to the waypoint and the process iterates until it reaches the goal. AWGS builds a novel representation that makes the RL agent’s policy environment independent — the policy is learned in one environment and can be reused in novel environments without requiring relearning. This makes AWGS suitable for mobile robot nav- igation in unknown environments. However, AWGS assumes that the robot can execute arbitrary motion commands to avoid collisions, and therefore disregards the safety concern. A robot with kinodynamic constraints, such as a minimum turn-radius constraint, cannot move arbitrarily to avoid collisions. Thus, the robot’s safety is not guaranteed in AWGS. To ensure safety of the robot, a waypoint generated at current time-step, say t current , must ensure safety at any time t > t current . In AWGS, the agent learns its policy using a reward function that encourages it to reach the goal quickly, while penalizing for collisions with the obstacles. Designing a reward function that balances safety and optimality is a challenging prob- lem [18, 19]. A too large penalty for collisions may make the RL agent too conservative and it may never navigate through a tight space, even if a collision free path exists. A too low penalty may result in collisions as the RL agent will try to reach the goal quickly to get a higher reward, ignoring the collisions. Furthermore, when faced with robot’s motion Robot Feature Space Construction Motion Planner Waypoint SDP Filter RL Agent Reward Function Local information Figure 1: RAW uses semi-definite program to filter out potentially dangerous waypoint (SDP filter) locations, to guarantee robot’s safety in unknown environments. Modification to AWGS is indicated using a red box. constraints, the RL agent may fail to recognize a waypoint that may result in a collision in future. To ensure the robot’s safety, the problem can be formulated as a constrained Markov decision process [19] (C-MDP). However, solving a C-MDP is computationally very expensive, requiring a solution to a mixed-integer linear program or an exhaustive look-ahead search with branch and bound, both limited to a small finite (discrete) state-space and require full map of the environment. RAW addresses the safety issue by forming a locally maxi- mal ellipsoid, using semi-definite programming (SDP), around the robot and discards the waypoint locations that lie outside the ellipsoid. The surrounding obstacles are constrained to lie outside the ellipsoid. Any two consecutive ellipsoids overlap with each other such that the robot lies in the region of intersection of the two ellipsoids. Thus, if the robot’s initial configuration is feasible, i.e. separated from the surrounding obstacles by an ellipsoid, then the robot’s trajectory is guar- anteed to be collision free. As shown in Figure 1, RAW uses AWGS architecture for waypoint generation, but removes the potentially dangerous waypoint locations (SDP Filter). The ellipsoid effectively filters out the infeasible actions using SDP. The contribution of this paper is an on-line reactive motion planner for mobile robot navigation in unknown and un- structured environments that (i) guarantees collision avoidance with unforeseeable obstacles, assuming that the robot’s initial configuration is feasible, and (ii) produces trajectories that are not far from the optimal trajectories. In RAW, the RL agent takes a locally-feasible optimal action at each planning cycle. However, in unknown environments, no theoretical guarantee can be provided on the path optimality. Thus, RAW trajectories are compared with the optimal trajectories and those generated by the RRT, to show that the sequence of locally optimal actions generate reasonably good trajectories (when the environment has no dead-ends, as discussed later). The next section discusses AWGS; section-IV describes the robot model and waypoint generation with robot’s motion constraints; section-IV-B describes the SDP for filtering the waypoints to ensure the robot’s safety; section-V discusses the empirical results; and section-VI concludes the paper. III. N OTATION , A SSUMPTIONS AND B ACKGROUND It is assumed that: (i) the robot’s field of view (FOV) is limited; (ii) obstacles in the FOV are static and fully characterized; and (iii) the robot has access to its coordinates z t , at time-step t , and the coordinates of the goal z g . The robot’s configuration (position and orientation) at time t is represented as C t . Obstacles in the FOV are represented using a point cloud, O t , of m obstacles at time t ; v T represents the transpose of a vector v ; and S 2 ++ represents the space of all Point Obstacle (a) (b) Figure 2: (a) shows the grid-points as red-dots and (b) shows the potential map; red regions represent higher (i.e., dangerous) grid-point values. 2 × 2 positive definite symmetric matrices. A. Reinforcement Learning: Markov Decision Process (MDP) In an MDP, a state-action value function Q π ( s t , a t ) for a policy π is the expected return of taking action a t in state s t and then following π . The probability of taking an action a t in s t is π ( s t , a t ) . The RL agent’s task is to learn π that maximizes the expected sum of discounted future rewards from any s t , with discount factor γ ∈ [0 , 1) . By taking action a t in s t , the agent makes a transition to state s t +1 , and receives a reward r t ( s t , a t , s t +1 ) . Q π is approximated using a linear function approximation architecture: Q ( s t , a t ) = 〈 φ ( s t , a t ) , w 〉 , where φ ( s t , a t ) ∈ R k is the state-action feature vector for ( s t , a t ) and w ∈ R k is learned using the samples. B. Autonomous Waypoint Generation Strategy (AWGS) In AWGS, an RL agent analyzes the robot’s FOV. The local region in the FOV is represented by a potential map, which requires descretizing the FOV. The FOV is defined by r ∈ (0 , R FOV ] and θ ∈ [ − θ FOV , θ FOV ] ; { R FOV , θ FOV } control the robot’s view. r and θ are descretized in steps of dr and dθ . The total number of discrete grid-points, N , in the robot’s FOV is ((2 θ FOV /dθ ) + 1) R FOV /dr . One of these grid-points is then selected as a waypoint by the RL agent. The obstacles in the FOV are represented as a point cloud O t , at time-step t . A potential map V ∈ [0 , 1] N is then computed; the i th element of V is the potential of the i th grid-point. Figure 2 shows the grid-points and corresponding potential map. Next, a feature space is constructed for the RL agent for generating a waypoint. The feature space is constructed using the potential map and three geometric parameters for each of the grid points in the FOV. The first two geometric parameters for a grid-point compute the progress towards the goal if that grid-point is selected as the waypoint. The third parameter ζ j = 1 if a straight line path to the j th grid-point collides with an obstacle, and ζ j = 0 otherwise. The robot’s current position represents the agent’s state, while an action corresponds to selecting one of the grid-points as the waypoint. Thus, at each time-step, the RL agent has N possible actions. Once the waypoint is generated, the robot follows a local trajectory to the waypoint for time ∆ T . The RL agent learns a policy using a reward function that penalizes the agent for generating the waypoints in obstructed regions of the FOV (when ζ j = 1 ) or close to the boundary of obstacles (as measured by grid- point’s value in potential map). For selecting the j th grid-point as a waypoint, it receives a reward: reward = − 10 3 ζ j − α 1 V j + max { α 2 500 , − 5 } , −1 −0.5 0 0.5 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 L u (x,y) θ A Simple Car−Like Robot (a) 0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50 Goal Start Blocked Path (the direction in which the robot is navigating is blocked) robot gets arbitrarily close to the wall and the collision becomes inevitable (b) Figure 3: (a) A simple car like robot; and (b) shows a situation where the robot gets close to the wall and crosses it to reach the goal at the top-right corner, i.e. the safety of the robot is not guaranteed in AWGS. where α 1 is a constant that controls penalty for defining the waypoints close to obstacles, and α 2 = 1 if the waypoint is defined at the goal and is − 1 otherwise. max {·} returns either +500 or − 5 , encouraging the agent to reach the goal quickly. IV. M OTION P LANNING : C AR -L IKE R OBOT This paper considers a simple (Reeds and Shepp’s) car- like robot shown in Figure 3a. The configuration of the robot in space at time t is C t = ( x t , y t , θ t ) — its position ( z t ) and orientation (with respect to x -axis) at time t . The distance between the front and real axle, L , is 1 m and the minimum turning radius is also 1 m . The coordinates of four corners on the robot’s body (a rectangle), according to its current configuration C t , are represented using a matrix Ξ( C t ) ∈ R 4 × 2 . The kinematic model of this robot is: ( ̇ x, ̇ y, ̇ θ ) = ( cos θ, sin θ, 0 ) u + ( 0 , 0 , 1 ) v, (1) where u ∈ {− 1 , +1 } describes the linear velocity and v ∈ [ − 1 , +1] describes the angular velocity. The magnitude of linear velocity is always 1 m/s . Given any two configurations ( x 1 , y 1 , θ 1 ) and ( x 2 , y 2 , θ 2 ) in space, a time-optimal trajectory connecting them can be computed efficiently [20]. A. Safety Issues: Waypoint Generation This section addresses the challenges faced by AWGS with the robot’s motion constraints in unknown environments. Fig- ure 3b shows that the robot is navigating through a sequence of corridors to reach the goal at (50,50). However, due to limited visibility, as the robot comes too close to the corner at around (10,50), the collision becomes inevitable. The agent did not take the unforeseeable obstacles in to account, and generated waypoints that led the robot too close to the obstacles. As all the possible waypoint locations result in collision, the RL agent generates a waypoint that crosses the obstacles. The waypoint generation process thus does not account for possible future collisions — disregarding safety. RAW addresses this issue by filtering out the potentially dangerous waypoint locations, thereby limiting the actions available to the RL agent, using SDP. The RL agent thus selects waypoints that guarantee collision avoidance at future time-steps. B. Convex Semi-Definite Programming (SDP): Filtering This section discusses the SDP for ellipsoid generation, which filters out the potentially dangerous waypoint locations. The ellipsoid is generated such that the robot lies inside the ellipsoid, obstacles lie outside and the goal may lie inside the ellipsoid. Also, as discussed earlier, the RL agent selects one of the N grid-points in the FOV as a waypoint. Thus, the ellipsoid tries to incorporate as many grid-points as possible, inside it. Let the robot’s configuration be C t . Let Ξ( C t ) k , k = { 1 , ..., 4 } represent ( x, y ) coordinates of the k th corner on the robot’s body. The ellipsoid Ψ t , parametrized by P t ∈ S 2 ++ , q t ∈ R 2 , r t ∈ R , at time t is represented as Ψ t = { x ∈ R 2 | x T P t x + q T t x + r t ≤ 0 } . Let z i be the location of i th , i = { 1 , ..., m } , point obstacle in the cloud O t . Let γ j ∈ R 2 , j = { 1 , ..., N } , be the location of j th grid-point in the robot’s FOV. Let | P t | denote the determinant of P t . The SDP for ellipsoid formation and filtering is: min P t ,q t ,r t ,λ,ν ν + log( | P t | − 1 ) + N ∑ j =1 λ j s.t. P t  I ; Ξ( C t ) k P t (Ξ( C t ) k ) T + Ξ( C t ) k q t + r t ≤ − 1 z T i P t z i + q T t z i + r t ≥ 1 , i = { 1 , ..., m } γ T j P t γ j + q T t γ j + r t ≤ − 1 + λ j , j = { 1 , ..., N } z T g P t z g + q T t z g + r t ≤ − 1 + ν ; k = { 1 , ..., 4 } . λ j and ν are the slack variables, allowing the soft constraints for including the j th grid-point and the goal, respectively, inside the ellipsoid. If λ j ≥ 1 , then the j th grid-point is marked as an infeasible waypoint location. The constraint P t  I , where I ∈ S 2 ++ is an identity matrix, is a positive definite constraint on matrix P . The objective log( | P t | − 1 ) minimizes the volume of the ellipsoid. This keeps the volume of ellipsoid in check, as the inclusion of slack variables ( λ j and ν ) in the objective results in an expansion (as much as possible) of the ellipsoid (constrained by the surrounding obstacles). The hard constraints for the robot and the surrounding obstacles separate the robot from the obstacles. Minimizing ν is equivalent to minimizing the distance between the goal and the boundary of the ellipsoid. Minimizing ∑ j λ j results in an inclusion of as many grid-points inside the ellipsoid as possible. The next section theoretically guarantees the robot’s safety. C. Safety Guarantee The proof of the safety of the robot follows from the sequence of lemmas and a theorem in this section. DEFINITION: The region of intersection, Γ t +1 t , be- tween two ellipsoids Ψ t ( P t , q t , r t ) and Ψ t +1 ( P t +1 , q t +1 , r t +1 ) formed at time t and t + 1 , respectively, is: Γ t +1 t = { x ∈ R 2 | x T P t x + q T t x + r t ≤ 0 , x T P t +1 x + q T t +1 x + r t +1 ≤ 0 } . LEMMA-1: Let C t be the robot’s configuration at time t , such that Ψ t separates Ξ( C t ) from the surrounding obstacle cloud O t . The robot’s new configuration C t +1 (and hence Ξ( C t +1 ) ), after it follows the trajectory to the waypoint for time ∆ T , remains inside Ψ t . Proof: The waypoint is always inside the ellipsoid Ψ t . The trajectory to the waypoint, however, may have some segment outside Ψ t . However, the robot follows the trajectory only for time ∆ T , and there always exists ∆ T > 0 such that the robot remains inside Ψ t even when some segment of the trajectory is outside Ψ t . In RAW, auto-tuning of (maximum limit) 1 ∆ T is provided by Ψ t through-out navigation. LEMMA-2 Assume that the SPD is feasible at time t and returns Ψ t . There exists ∆ T > 0 such that the robot remains inside Ψ t during navigation for time ∆ T along the trajectory. Proof: If ∆ T = 0 , then the robot will have to stop immediately, otherwise it will navigate outside the current ellipsoid, violating Lemma-1. To show that RAW guarantees collision avoidance with unforeseeable obstacles, it is essential to show that the robot always remains inside the current ellipsoid. Thus, it suffices to show that there always exists ∆ T > 0 such that C t +1 and Ξ( C t +1 ) are inside Ψ t . Let there be a function F : R × R 3 → R 3 such that C t +1 = F (∆ T, C t ) , i.e. the robot’s new configuration after it navigates from its configuration C t for time ∆ T is given by some (non- linear) mapping F . Thus, to prove that there exists ∆ T 6 = 0 , it suffices to show that: lim ∆ T → 0 + Ξ( F (∆ T, C t )) ∈ Ψ t , where 0 + indicates positive side of 0 . This implies that in the worst case, the robot can take infinitesimally small steps and remain inside Ψ t . Consider a case where Ψ t is such that the robot’s body touches the the {− 1 } level-set 2 of Ψ t . Note that the SDP involves constraints: Ξ( C t ) k P t (Ξ( C t ) k ) T + Ξ( C t ) k q t + r t ≤ − 1 , k = 1 , ..., 4 . Thus the robot’s body can touch {− 1 } level-set of Ψ t . Assume that the robot’s trajectory requires it to cross the {− 1 } level- set (for example, moving outside Ψ t ). The robot can move until one of the corners of its body (which is a rectangle) touches the { 0 } level-set of Ψ t . Thus, a non-zero ∆ T implies non-zero distance between the { 0 } and {− 1 } level-sets. It therefore suffices to show that the distance between the 0 and − 1 level-sets of Ψ t is non-zero. If this distance is zero, then ∆ T will be 0 — the robot already touches the − 1 level set and therefore cannot move further along any trajectory requiring it to move towards the higher level-sets. Proving that the distance between the level-sets is non-zero is tedious in the non-canonical form. Thus, the ellipsoid is transformed in to its canonical position using some Euclidean transformation T E (rotation and translation). Furthermore, with some affine transformation, T A , the canonical ellipsoid can be transformed in to a circle. It then suffices to show that: (A) first and foremost, the resulting circle ( 0 level-set) has non-zero radius R 1 ; and (B) the radius of the circle corresponding to − 1 level- set is {R 2 | R 2 ≥ 0 , R 1 − R 2 > 0 } . If R 1 = 0 , then the { 0 } level-set shrinks to a point. Thus, {− 1 } level-set will also shrink to a point — implying R 1 − R 2 = 0 . To avoid clutter, the time-stamp t is removed — Ψ t is represented as Ψ and is parameterized by P ∈ S 2 ++ , q ∈ R 2 , r ∈ R . The boundary of the ellipsoid is given by ( x ∈ R 2 ): x T P x + q T x + r = 0 , (2) 1 This may be computed analytically for convex-shaped robots, or using a quadratic program for general shaped robots and is discussed elsewhere [13]. 2 An α level-set of f : R n → R is defined as: { z ∈ R n | f ( z ) ≤ α } . where P and q are of the form: P = [ u 11 u 2 u 2 u 22 ] ; q = [ b 1 b 2 ] ; u 11 , u 22 > 0; u 2 , b 1 , b 2 ∈ R . Thus, (2) can be re-written as: [ x 1 ] T   u 11 u 2 b 1 2 u 2 u 22 b 2 2 b 1 2 b 2 2 r   [ x 1 ] = [ x 1 ] T Q [ x 1 ] = 0 . Let b 1 / 2 = u 3 and b 2 / 2 = u 4 . There exists an Euclidean transformation, T E , that transforms Q to ˆ Q , such that ˆ Q is a diagonal matrix and it satisfies: [ T E [ x 1 ]] T ˆ Q [ T E [ x 1 ]] = 0 . Furthermore, there exists an affine transformation (rotation, translation and shear), T A , that transforms ˆ Q to ̄ Q , where: ̄ Q =   1 0 0 0 1 0 0 0 Λ   ; T A [ T E [ x 1 ]] T ̄ QT A [ T E [ x 1 ]] = 0 (3) for some Λ ∈ R . Note that the above equation represents a circle, in the new coordinate system, where − Λ is square of the radius of the circle. These transformations involve pre- multiplying T E and T A to Q : ̄ Q = T A T E Q ; transformations are equivalent to some set of elementary row transformations applied to matrix Q . There exist many such transformations 3 . One such sequence of row transformations is: R 3 ← R 3 − R 1 u 3 u 11 ; R 2 ← R 2 − R 1 u 2 u 11 R 3 ← R 3 − R 2 u 4 − u 2 u 3 u 11 S ( u 11 ) ; R 1 ← R 1 − R 2 u 2 S ( u 11 ) R 2 ← R 2 − R 3   u 4 − u 3 u 2 u 11 r − u 2 3 u 11 − ( u 4 − u 3 u 2 u 11 ) 2 S ( u 11 ) − 1   R 1 ← R 1 − R 3   u 3 − ( u 4 − u 3 u 2 u 11 ) u 2 S ( u 11 ) − 1 r − u 2 3 u 11 − ( u 4 − u 3 u 2 u 11 ) 2 S ( u 11 ) − 1   R 1 ← R 1 u 11 ; R 2 ← R 2 S ( u 11 ) Here: R i ← R i − R j β means that the components of j th row are multiplied by β and then component-wise subtracted from i th row; and S ( u 11 ) = u 22 − u 2 u − 1 11 u 2 is the Schur Comple- ment of u 11 in matrix P . Applying these transformations to Q gives the matrix ̄ Q (3) with following Λ : Λ = r − u 2 3 u 11 − ( u 4 − u 2 u 3 u 11 ) 2 S ( u 11 ) − 1 = −R 2 1 . (4) Matrix P is positive definite if and only if u 11 > 0 and S ( u 11 ) > 0 . Thus, as P ∈ S 2 ++ : S ( u 11 ) > 0; and u 11 > 0 . (5) 3 Note that the transformations are nothing but converting Q to a Reduced- Row Echelon form. In the limiting cases ( lim S ( u 11 ) → 0 + , and lim u 11 → 0 + ), it is trivial to show that: lim S ( u 11 ) → 0 + R 1 = ∞ ; and lim u 11 → 0 + R 1 = ∞ ( as S ( u 11 ) > 0) . For the limiting case, i.e when R 1 approaches ∞ , there is nothing to prove. If R 1 = ∞ , then it means that there are no surrounding obstacles and the ellipsoid is unbounded. Thus, it remains to show that R 1 > 0 , or equivalently: Λ < 0 . By combining (4) and (5), note that to prove Λ < 0 it remains to show that r < 0 . The SDP in RAW constrains the robot to lie inside the ellipsoid. The four corner locations at time t , Ξ( C t ) , on the robot’s body are constrained to lie inside Ψ t . Thus, the robot’s position ( x t , y t ) also lies inside the ellipsoid. Therefore, [ x t y t ] T P t [ x t y t ] + q T t [ x t y t ] + r t ≤ − 1 Note that this SDP can be solved in the robot’s local coordinate frame, where the robot’s current position is the origin and its current navigation direction is the x − axis. In fact, RAW solves this SDP in the robot’s local coordinate frame for stability. Thus, ( x t , y t ) are both 0 , ∀ t . Substituting this in the above inequality, and removing time-stamp, gives: [ 0 0 ] T P [ 0 0 ] + q T [ 0 0 ] + r ≤ − 1 = ⇒ r ≤ − 1 . (6) This completes the proof for (A). Note that the SDP can also be solved in the global coordinate system, and it will require expanding the square terms and then using a more complicated relation between the variables instead of r < 0 . To prove (B), applying the identical row transformations for {− 1 } level-set gives: R 2 = √ √ √ √ − ( r + 1) + u 2 3 u 11 + ( u 4 − u 2 u 3 u 11 ) 2 S ( u 11 ) . (7) By combining (7) with (5) and (6) we get R 2 ≥ 0 . Since R 1 > 0 and R 2 ≥ 0 , to show R 1 − R 2 > 0 , it suffices to show R 2 1 − R 2 2 > 0 . We have R 2 1 − R 2 2 = 1 . This completes the proof. LEMMA-3: The new ellipsoid Ψ t +1 is formed such that Ξ( C t +1 ) ∈ Γ t +1 t . The robot thus navigates through a sequence of overlapping ellipsoids. Proof: From Lemma-1 and 2, we have that Ξ( C t +1 ) ∈ Ψ t . Let O t +1 be the point-cloud of obstacles in the robot’s current FOV. This point cloud is guaranteed to be separated from Ξ( C t +1 ) with an ellipsoid, otherwise these obstacles would have been discovered in the robot’s FOV at time t . Also, ∆ T is always selected such that the robot does not reach the waypoint (except when it coincides with the goal). This ensures that the robot never reaches the boundary of the FOV even if the waypoint is generated at the boundary of the FOV. Thus, an obstacle cannot suddenly appear in front of the robot, i.e., at zero-distance from the robot. Hence, there exists an ellipsoid Ψ t +1 such that Ξ( C t +1 ) ∈ Ψ t +1 and is Algorithm 1 RAW - initialize parameters: R F OV , θ F OV , dr, dθ, ǫ, t = 0 - set a maximum limit for ∆ T - initialize the policy π , learned in AWGS architecture - initialize robot’s position, orientation and goal: z 0 , θ 0 , z g compute N (number of grid-points in the FOV) while ( || z t − z g || 2 > ǫ ) O t ← getPointCloudObstaclesInFOV computeFeaturesForEachGridPointInFOV solveSDP for : j = { 1 , ..., N } If ( λ j ≥ 1) mark j th grid-point as infeasible endif endfor generateWaypointUsingPolicy π computeTrajectoryToWaypoint followTrajectoryForTime ∆ T t ← t + 1 ; update: z t , θ t endwhile separated from O t +1 . Furthermore, as Ξ( C t +1 ) ∈ Ψ t , we have that Ξ( C t +1 ) ∈ Γ t +1 t . THEOREM: Let C 0 be such that Ξ( C 0 ) ∈ Ψ 0 , i.e., the robot starts from a feasible configuration. Then Ψ t exists ∀ t = 1 , ..., ∞ , and Ξ( C t ) ∈ Γ t +1 t ∀ t = 0 , ..., ∞ . The robot thus navigates through a sequence of overlapping ellipsoids and is guaranteed to avoid collision throughout the navigation task. Proof: Follows by combining Lemma-1, 2 and 3. The complete algorithm is listed in Algorithm 1, and is self- explanatory using the discussions in the previous sections. V. E XPERIMENTS : S IMULATION R ESULTS This section empirically demonstrates that RAW guarantees safety in unknown environments, cluttered with both the convex and arbitrarily shaped obstacles. To show that RAW generates close to optimal trajectories, RAW trajectories are compared with the global optimal and RRT trajectories — showing that the actions selected by the RL agent lead to acceptable trajectories. RAW uses the same parameters as in previous work [14]. These are: { R FOV , dr, θ FOV , dθ } = { 5 , 0 . 2 , 60 , 1 ◦ } ; α 1 = 200 in the reward function. The robot has dimensions 1 × 1 m 2 . RAW is implemented in MATLAB, running on a 64 -bit Windows 7 notebook with core-i7 2.2 GHz and 8 GB RAM. RRT was run 10 times for each start- goal configuration, in each of the environmental set-ups. The default maximum RRT iterations is 10 4 in each run. RAW follows a trajectory to the waypoint for time ∆ T , which can be at most 1 s , and then a new waypoint is generated. A. Performance Evaluation: Convex Obstacles These experiments empirically show that RAW generates safe trajectories and measure the optimality of RAW trajec- tories by comparing the trajectory lengths with the optimal 2 4 6 8 10 12 14 50 60 70 80 90 100 Environment Trajectory Length (m) RRT RAW Optimal Min−RRT Figure 4: This figure shows that RAW trajectories are shorter than both the average and minimum length (Min-RRT) RRT trajectories in all configura- tions. RAW trajectories are at most 24% longer than the optimal trajectories. Environments are sorted by the average RRT performance. 0 10 20 30 40 50 0 5 10 15 20 25 30 35 40 45 50 Start Goal Robot’s FOV Figure 5: Sample RAW trajectories for two start-goal configurations. The robot’s FOV is also shown. and RRT trajectories. The environment has 7 convex obstacles (shown in Figure 5). The planners were run in 15 different start-goal configurations in this environment. RRT trajectories were averaged over 10 trials for each configuration. Figure 4 compares the algorithms’ performance. RAW trajectories are shorter than both the average and minimum length (Min-RRT) RRT trajectories. RAW trajectories are longer than the optimal trajectories. The maximum ratio of the length of RAW and optimal trajectories is 1 . 24 . Thus, RAW trajectories are at most 24% longer than the optimal trajectories. Figure 5 shows sample RAW trajectories for two start-goal configurations. It can be seen that the trajectories maintain a safe distance, taking robot’s dimensions into account, from the obstacles. B. Performance Evaluation: Arbitrary Obstacles In this experiment the planners are compared in an environ- ment with arbitrary shaped obstacles (Figure 7), in 15 different start-goal configurations. Figure 6 shows the numerical results. These environments pose a great challenge for RAW as it has to generate waypoints appropriately to ensure safety of the robot and also minimize the motion time. It can be seen that RAW trajectories are longer than the optimal trajectories. The maximum ratio of the length of RAW and optimal trajectories is 1 . 19 . Thus, RAW trajectories are at most 19% longer than the optimal trajectories. Furthermore, RAW trajectories are shorter than both the average and minimum length RRT trajectories. Thus RAW generates acceptable trajectories when planning among arbitrarily shaped obstacles in unknown en- vironments. Figure 7 shows sample RAW trajectories for two start-goal configurations. RAW safely avoided non-convex obstacles, and reached the goal in unknown environments. 2 4 6 8 10 12 14 60 70 80 90 100 Environment Trajectory Length (m) RRT RAW Optimal Min−RRT Figure 6: This figure shows that RAW trajectories are shorter than both the average and minimum length RRT (Min-RRT) trajectories. RAW trajectories are at most 19% longer than the optimal trajectories. Environments are sorted by the average RRT performance. 0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50 Start Goal Robot’s FOV Figure 7: This figure shows sample RAW trajectories for two start-goal configurations shown in blue and red. The robot’s FOV is also shown. C. Performance Evaluation: Corridors These experiments empirically show that RAW successfully navigates in an structured environment (as shown in Figure 9 — the environment has corridors, effectively providing a navigation direction) with corridors, in 50 different start- goal configurations. RRT used 25000 iterations in each run. Figure 8 shows that both the average and minimum length RRT trajectories are longer than RAW trajectories. As expected, RAW trajectories are longer than the optimal trajectories. The maximum ratio of the length of RAW and optimal trajectories is 1 . 17 . Thus, RAW trajectories are at most 17% longer than the optimal trajectories. Figure 9 shows sample RAW trajectories. In Figure 9b, when the goal is at the top-right corner, RAW first visited the blocked region of the second corridor and then turned back to avoid collision. This explains why RAW trajectories are 10 − 18 m (in Figure 8) longer than the optimal trajectories in 7 of the 50 configurations. Also, AWGS collided in the blocked region for the start- goal configuration shown in Figure 9b, as discussed earlier in section-IV-A, while RAW succeeded in avoiding the collision. It should also be noted that one can make the dead-end in Figure 9 arbitrarily deep, resulting in possibly longer RAW trajectories. This is because the robot has no prior information about the path being blocked. However, this is true for any planner using limited information, and not just for RAW. To show the robustness of RAW for collision avoidance, it is compared with AWGS. AWGS is tested in the same 50 start-goal configurations. Navigation is considered successful if the robot reaches the goal without colliding with any obsta- cle. RAW was successful in all 50 start-goal configurations, while AWGS was successful in only 34 configurations. RAW successfully avoids collisions by filtering out the potentially dangerous waypoint locations (with SDP) before the collision 5 10 15 20 25 30 35 40 45 120 140 160 180 200 220 Environment Trajectory Length (m) RRT RAW Optimal Min−RRT Figure 8: This figures shows that RAW trajectories are shorter than both the minimum length (Min-RRT) and average RRT trajectories. RAW trajectories are longer (at most 17% ) than the optimal trajectories. Environments are sorted by the average RRT performance. 0 10 20 30 40 50 0 5 10 15 20 25 30 35 40 45 50 Start Goal (a) 0 10 20 30 40 50 0 5 10 15 20 25 30 35 40 45 50 Start Goal (b) Figure 9: This figure shows sample RAW trajectories for 2 start-goal config- urations: (a) RAW avoids the dead-ends; and (b) RAW moves towards the blocked region and then turns back to avoid collision in the dead-end. becomes inevitable, while AWGS lacks such an ability. D. Performance Evaluation: Circular Obstacles These experiments measure the optimality of RAW trajecto- ries among 48 circular obstacles. The planners were tested in 50 different start-goal configurations. RRT used same param- eters as in previous experiment. Figure 10 shows that RAW trajectories are shorter than both the average and minimum- length RRT trajectories. RAW trajectories are longer than the optimal trajectories. The maximum ratio of the length of RAW and optimal trajectories is 1 . 15 . Thus, RAW trajectories are at most 15% longer than the optimal trajectories. Figure 11 shows sample RAW trajectories for two start-goal configurations. E. Computation Time This section discusses the average computation time of RAW — running in MATLAB on a Core i7, 2.2 GHz notebook computer. The SDP in RAW is solved using the non-commercial solver CVX [2]. RAW was run in 50 dif- ferent start-goal configurations of the previous section, in a 105 × 105 m 2 environment shown in Figure 11. The average computation time per-step of the algorithm is 98 . 36 ± 8 . 68 ms . Thus, RAW can re-plan at ≈ 10 Hz. VI. C ONCLUSION AND D ISCUSSION This paper presented an algorithm for on-line reactive mo- tion planning in unknown and unstructured environments that: (i) ensures collision avoidance with unforeseeable obstacles, and (ii) produces trajectories that are not far from the optimal trajectories. RAW was tested in four different experimental setups, testing its performance among all kinds of obstacles. Thus, theoretical claims were verified empirically. Also, RAW 5 10 15 20 25 30 35 40 45 50 100 120 140 160 180 Environment Trajectory Length (m) RRT RAW Optimal Min−RRT Figure 10: RAW trajectories are: (i) shorter than both the average and minimum-length (Min-RRT) RRT trajectories and (ii) at most 15% longer than the optimum. Environments are sorted by average RRT performance. succeeded in cases where AWGS failed, showing significant safety improvements over AWGS. RAW trajectories were at most 15 − 24% longer than the optimal trajectories. Also, RAW produced shorter trajectories than RRT. This shows that the sequence of locally optimal actions (as taken by the RL agent) generate reasonably ac- ceptable trajectories in unknown environments. However, such a bound ( 15 − 24% ) cannot be guaranteed in general. For example, as discussed earlier, one can make the dead-end section in Figure 9 arbitrarily deep, resulting in longer RAW trajectories. However, this is true for any planner using limited information, and not just for RAW. At least in environments with no-dead ends, similar results are expected. Note that, if the environment has dead-ends RAW is not guaranteed to converge to the goal. This is because of the assumption that the robot has no information beyond the FOV. Also, no map is built on-line, and thus the robot may first avoid a dead-end, but then may come back (resulting in an oscillation), when navigating in a corridor type environment. However, this problem of local oscillation is different from the problem of local minima as generally observed in potential field methods. If an on-line map building is allowed, the robot may successfully avoid such oscillations. However, in the potential field methods, at least saddle-points are unavoidable even when the perfect geometric data of the environment is available [12]. The robot model assumed in this paper was a simple car-like robot, moving at a constant speed of 1 m/s . It was shown that the distance between the level-sets is finite ( R 2 1 − R 2 2 = 1 ). This ensured safety as the robot can instantaneously change the direction of velocity. For a general robot model, ideally, one would like to have this distance to be equal to the minimum stopping distance of the robot. This would require modifying the SDP and introducing additional constraints. This extension to the SDP was beyond the scope of this paper and is therefore left as a future work. A CKNOWLEDGMENT The author gratefully acknowledges the valuable feedbacks from Matthew E. Taylor. R EFERENCES [1] O. Brock and O. Khatib. Elastic strips: A framework for motion generation in human environments. International Journal of Robotics Research , 21(12), 2002. 0 20 40 60 80 100 0 10 20 30 40 50 60 70 80 90 100 Start Goal Figure 11: Sample RAW trajectories for two different start-goal configurations. [2] Inc. CVX Research. CVX: Matlab software for disciplined convex programming, version 2.0 beta. http://cvxr.com/cvx, September 2012. [3] P. Fiorini and Z. Shiller. Motion planning in dynamic environments using velocity obstacles. International Journal of Robotics Research , 17(7), 1998. [4] D. Fox, W. Burgard, and S. Thrun. Dynamic window approach to collision avoidance. IEEE Robotics and Automation Magazine , 4(1), 1997. [5] R. A. Knepper and M. T. Mason. Realtime informed path sampling for motion planning search. In International Symposium on Robotics Research , 2011. [6] A. Kushleyev and M. Likhachev. Time-bounded lattice for efficient path planning in dynamic environments. In IEEE International Conference on Robotics and Automation , 2009. [7] Y. Kuwata, G. Fiore, J.E. Teo, E. Frazzoli, and J.P. How. Motion plan- ning for urban driving using rrt. In IEEE/RSJ International Conference on Intelligent Robots and Systems , 2008. [8] S.M. LaValle and J.J. Kuffner. Randomized kinodynamic planning. International Journal of Robotics Research , 20(5):378–400, 2001. [9] Y. Li and J. Xiao. On-line planning of nonholonomic trajectories in crowded and geometrically unknown environments. In IEEE Interna- tional Conference on Robotics and Automation , 2009. [10] J. Minguez and L. Montano. The ego-kinodynamic space: Collision avoidance for any shape mobile robots with kinematic and dynamic constraints. In IEEE/RSJ International Conference on Intelligent Robots and Systems , 2003. [11] S. Quinlan and O. Khatib. Elastic bands: Connecting path planning and control. In IEEE International Conference on Robotics and Automation , 1993. [12] E. Rimon and D. E. Koditschek. Exact robot navigation using artificial potential functions. IEEE Transactions on Robotics and Automation , 8 (5), 1992. [13] S. Sharma. QCQP-tunneling: Ellipsoidal constrained agent navigation. In IASTED International Conference on Robotics , 2011. [14] S. Sharma and M. E Taylor. Autonomous waypoint generation strategy for on-line navigation in unknown environments. In IROS Workshop on Robot Motion Planning: On-line, Reactive and in Real-time , 2012. [15] Z. Shiller and S. Sharma. High speed on-line motion planning in cluttered environments. In IEEE/RSJ International Conference on Intelligent Robots and Systems , 2012. [16] Z. Shiller, S. Sharma, I. Stern, and A. Stern. On-line obstacle avoidance at high speeds. International Journal of Robotics Research , 32:1030– 1047, 2013. [17] R Sutton and A Barto. Reinforcement Learning: An Introduction . MIT Press, 1998. [18] A. Undurti and J. P. How. An online algorithm for constrained POMDPs. In IEEE International Conference on Robotics and Automation , 2010. [19] A. Undurti and J. P. How. Decentralized risk sharing in teams of unmanned vehicles. In IEEE International Symposium on Safety, Security and Rescue Robotics , 2011. [20] H. Wang, Y. Chen, and P. Soueres. An efficient geometric algorithm to compute time-optimal trajectories for a car-like robot. In IEEE Conference on Decision and Control , 2007. [21] H. Zhang, J. Butzke, and M. Likhachev. Combining global and local planning with guarantees on completeness. In IEEE International Conference on Robotics and Automation , 2012.