Temporal Logic Motion Control using Actor-Critic Methods – Technical Report – ∗ Xu Chu Ding†, Jing Wang‡, Morteza Lahijanian‡, Ioannis Ch. Paschalidis‡, and Calin A. Belta‡ Abstract— In this paper, we consider the problem of deploy- ing a robot from a specification given as a temporal logic statement about some properties satisfied by the regions of a large, partitioned environment. We assume that the robot has noisy sensors and actuators and model its motion through the regions of the environment as a Markov Decision Process (MDP). The robot control problem becomes finding the control policy maximizing the probability of satisfying the temporal logic task on the MDP. For a large environment, obtaining transition probabilities for each state-action pair, as well as solving the necessary optimization problem for the optimal policy are usually not computationally feasible. To address these issues, we propose an approximate dynamic programming framework based on a least-square temporal difference learning method of the actor-critic type. This framework operates on sample paths of the robot and optimizes a randomized control policy with respect to a small set of parameters. The transition probabilities are obtained only when needed. Hardware-in-the- loop simulations confirm that convergence of the parameters translates to an approximately optimal policy. Index Terms— Motion planning, Markov Decision Processes, dynamic programming, actor-critic methods. I. INTRODUCTION One major goal in robot motion planning and control is to specify a mission task in an expressive and high-level language and convert the task automatically to a control strategy for the robot. The robot is subject to mechanical constraints, actuation and measurement noise, and limited communication and sensing capabilities. The challenge in this area is the development of a computationally efficient framework accommodating both the robot constraints and the uncertainty of the environment, while allowing for a large spectrum of task specifications. In recent years, temporal logics such as Linear Temporal Logic (LTL) and Computation Tree Logic (CTL) have been promoted as formal task specification languages for robotic applications [1]–[6]. They are appealing due to their high expressivity and closeness to human language. Moreover, several existing formal verification [7], [8] and synthesis [8] tools can be adapted to generate motion plans and provably correct control strategies for the robots. * Research partially supported by the NSF under grant EFRI-0735974, by the DOE under grant DE-FG52-06NA27490, by the ODDR&E MURI10 program under grant N00014-10-1-0952, and by ONR MURI under grant N00014-09-1051. † Xu Chu Ding is with Embedded Systems and Networks group, United Technologies Research Center, East Hartford, CT 06108 (dingx@utrc.utc.com). Jing Wang, Morteza Lahijanian, Ioannis Ch. Paschalidis, and Calin A. Belta are with the Division of System Eng., Dept. of Mechanical Eng., Dept. of Electrical & Computer Eng., and Dept. of Mechanical Eng., Boston University, Boston, MA 02215 ({wangjing,morteza, yannisp, cbelta}@bu.edu), respectively. In this paper, we assume that the robot model in the envi- ronment is described by a (finite) Markov Decision Process (MDP). In this model, the robot can precisely determine its current state, and by applying an action (corresponding to a motion primitive) enabled at each state, it triggers a transition to an adjacent state with a fixed probability. We are interested in controlling the MDP robot model such that it maximizes the probability of satisfying a temporal logic formula over a set of properties satisfied at the states of the MDP. By adapting existing probabilistic model checking [8]–[10] and synthesis [11], [12] algorithms, we recently developed such computational frameworks for formulas of LTL [13] and a fragment of probabilistic CTL [14]. With the above approaches, an optimal control policy can be generated to maximize the satisfying probability, given that the transition probabilities are known for each state- action pair of the MDP, which can be computed by using a Monte-Carlo method and repeated forward simulations. However, it is often not feasible for realistic robotic appli- cations to obtain the transition probabilities for each state- action pair, even if an accurate model or a simulator of the robot in the environment is available. Moreover, the problem size is even larger when considering temporal logic specifications. For example, in order to find an optimal policy for an MDP satisfying an LTL formula, one need to solve a dynamical programming problem on the product between the original MDP and a Rabin automaton representing the formula. As such, exact solution can be computationally prohibitive for realistic settings. In this paper, we show that approximate dynamic pro- gramming [15] can be effectively used to address the above limitations. For large dynamic programming problems, an approximately optimal solution can be provided using actor- critic algorithms [16]. In particular, actor-critic algorithms with Least Squares Temporal Difference (LSTD) learning have been shown recently to be a powerful tool to solve large-sized problems [17], [18]. This paper extends from [19], in which we proposed an actor-critic method for maximal reachability (MRP) problems, i.e., maximizing the probability of reaching a set of states, to a computational framework that finds a control policy such that the probabil- ity of its paths satisfying an arbitrary LTL formula is locally optimal over a set of parameters. This set of parameters is designed to tailor to this class of approximate dynamical programming problems. Our proposed algorithm produces a randomized policy, which gives a probability distribution over enabled actions at a state. Our method requires transition probabilities to be generated only along sample paths, and is therefore arXiv:1202.2185v2 [cs.RO] 23 Feb 2012 Fig. 1. Robotic InDoor Environment (RIDE) platform. Left: An iCreate mobile platform moving autonomously through the corridors and intersec- tions of an indoor-like environment. Right: The partial schematics of the environment. The black blocks represent walls, and the grey and white regions are intersection and corridors, respectively. The labels inside a region represents observations associated with regions, such as Un (unsafe regions) and Ri (risky regions). particularly suitable for robotic applications. To the best of our knowledge, this is the first of combining temporal logic formal synthesis with actor-critic type methods. We illustrate the algorithms with hardware-in-the-loop simulations using an accurate simulator of our Robotic InDoor Environment (RIDE) platform [20]. Notation: We use bold letters to denote sequences and vectors. Vectors are assumed to be column vectors. Transpose of a vector x is denoted by xT. ∥·∥stands for the Euclidean norm. |S| denotes the cardinality of a set S. II. PROBLEM FORMULATION AND APPROACH We consider a robot moving in an environment partitioned into regions such as the Robotic Indoor Environment (RIDE) (see Fig. 1). Each region in the environment is associated with a set of observations. Observations can be Un for unsafe regions, or Up for a region where the robot can upload data. We assume that the robot can detect its current region. Moreover, the robot is programmed with a set of motion primitives allowing it to move from a region to an adjacent region. To capture noise in actuation and sensing, we make the natural assumption that, at a given region, a motion primitive designed to take the robot to a specific adjacent region may take the robot to a different adjacent region. Such a robot model naturally leads to a labeled Markov Decision Process (MDP), which is defined below. Definition II.1 (Labeled Markov Decision Process). A la- beled Markov decision process (MDP) is a tuple M = (Q, q0, U, A, P, Π, h), where (i) Q = {1, . . . , n} is a finite set of states; (ii) q0 ∈Q is the initial state; (iii) U is a finite set of actions; (iv) A : Q →U maps a state q ∈Q to actions enabled at q; (v) P : Q × U × Q →[0, 1] is the transition probability function such that for all q ∈Q, P q′∈Q P(q, u, q′) = 1 if u ∈A(q), and P(q, u, q′) = 0 for all q′ ∈Q if u /∈A(q); (vi) Π is a set of observations; (vii) h : Q →2Π is the observation map. Each state of the MDP M modeling the robot in the environment corresponds to an ordered set of regions in the environment, while the actions label the motion primitives that can be applied at a region. For example, a state of M may be labelled as I1-C1, which means that the robot is currently at region C1, coming from region I1. Each ordered set of regions corresponds to a recent history of the robot trajectory, and is needed to ensure the Markov property (more details on such MDP abstraction of the robot in the environment can be found in e.g., [14]). The transition probability function P can be obtained through extensive simulations of the robot in the environment. We assume that there exists an accurate simulator that is capable of generating (computing) the transition probability P(q, u, ·) for each state-action pair q ∈Q and u ∈A(q). More details of the construction of the MDP model for a robot in the RIDE platform are included in Sec. IV. If the exact transition probabilities are not known, M can be seen as a labeled non-deterministic transition system (NTS) MN = (Q, q0, U, A, P N , Π, h), where P in M is replaced by P N : Q×U ×Q →{0, 1}, and P N (q, u, q′) = 1 indicates a possible transition from q to q′ applying an enabled action u ∈A(q); if P N (q, u, q′) = 0, then the transition from q to q′ is not possible under u. A path on M is a sequence of states q = q0q1 . . . such that for all k ≥0, there exists uk ∈A(qk) such that P(qk, uk, qk+1) > 0. Along a path q = q0q1 . . ., qk is said to be the state at time k. The trajectory of the robot in the environment is represented by a path q on M (which corresponds to a sequence of regions in the environment). A path q = q1q2 . . . generates a sequence of observations h(q) := o1o2 . . ., where ok = h(qk) for all k ≥0. We call o = h(q) the word generated by q. Definition II.2 (Policy). A control policy for an MDP M is an infinite sequence M = µ0µ1 . . ., where µk : Q × U → [0, 1] is such that P u∈A(q) µk(q, u) = 1, for all k ≥0. Namely, at time k, µk(q, ·) is a discrete probability distri- bution over A(q). If µ = µk for all k ≥0, then M = µµ . . . is called a stationary policy. If for all k ≥0, µk(q, u) = 1 for some u, then M is deterministic; otherwise, M is randomized. Given a policy M, we can then generate a set of paths on M, by applying uk with probability µk(qk, uk) at state qk for all time k. We require the trajectory of the robot in the environment to satisfy a rich task specification given as a Linear Temporal Logic (LTL) (see, e.g., [7], [8]) formula over a set of observations Π. An LTL formula over Π is evaluated over an (infinite) sequence o = o0o1 . . . (e.g., a word generated by a path on M), where ok ⊆Π for all k ≥0. We denote o ⊨φ if word o satisfies the LTL formula φ, and we say q satisfies φ if h(q) ⊨φ. Roughly, φ can be constructed from a set of observations Π, Boolean operators ¬ (negation), ∨(disjunction), ∧(conjunction), −→(implication), and temporal operators X (next), U (until), F (eventually), G (always). A variety of robotic tasks can be easily translated to LTL formulas. For example, the following complex task command in natural language: “Gather data at locations Da infinitely often. Only reach a risky region Ri if valuable data at VD can be gathered, and always avoid unsafe regions (Un)” can be translated to the LTL formula: φ := G F Da ∧G (Ri −→VD) ∧G ¬Un. In this paper, we consider the following problem. Problem II.3. Given a labeled MDP M = (Q, q0, U, A, P, Π, h) modeling the motion of a robot in a partitioned environment and a mission task specified as an LTL formula φ over Π, find a control policy that maximizes the probability of its path satisfying φ. The probability that paths generated under a policy M satisfy an LTL formula φ is well defined with a suitable measure over the set of all paths generated by M [8]. In [13], we proposed a computational framework to solve Prob. II.3, by adapting methods from the area of probabilistic model checking [8]–[10]. However, this framework relies upon the fact that the transition probabilities are known for all state-action pairs. These transition probabilities are typically not available for robotic applications and com- putationally expensive to compute. Moreover, even if the transition probabilities are obtained for each state-action pair, this method still requires solving a linear program on the product of the MDP and the automata representing the formula, which can be very large (thousands or even millions of states). In this case an approximate method might be more desirable. For these reasons, we instead focus on the following problem. Problem II.4. Given a labeled NTS MN = (Q, q0, U, A, P N , Π, h) modeling a robot in a partitioned environment, a mission task specified as an LTL formula φ over Π, and an accurate simulator to compute transition probabilities P(q, u, ·) given a state-action pair (q, u), find a control policy that approximately maximizes the probability of its path satisfying φ. In many robotic applications, the NTS model MN = (Q, q0, U, A, P N , Π, h) can be quickly constructed for the robot in the environment. Our approach to Prob. II.4 can be summarized as follows: First, we proceed to translate the problem to a maximal reachability probability (MRP) problem using MN and φ (Sec. III-A). We then use an actor critic framework to find a randomized policy giving an approximate solution to the MRP problem (Sec. III-B). The randomized policy is constructed to be a function of a small set of parameters and we find a policy that is locally optimal with respect to these parameters. The construction of a class of policies suitable for MRP problems without using the transition probabilities is explained in Sec. III-C. The algorithmic framework presented in this paper is summarized in Sec. III-D. III. CONTROL SYNTHESIS A. Formulation of the MRP Problem The formulation of the MRP problem is based on [8]–[10], [13] with modification if needed when using the NTS MN instead of M. We start by converting the LTL formula φ over Π to a so-called deterministic Rabin automaton, which is defined as follows. Definition III.1 (Deterministic Rabin Automaton). A de- terministic Rabin automaton (DRA) is a tuple R = (S, s0, Σ, δ, F), where (i) S is a finite set of states; (ii) s0 ∈S is the initial state; (iii) Σ is a set of inputs (alphabet); (iv) δ : S × Σ →S is the transition function; (v) F = {(L(1), K(1)), . . . , (L(M), K(M))} is a set of pairs of sets of states such that L(i), K(i) ⊆S for all i = 1, . . . , M. A run of a Rabin automaton R, denoted by r = s0s1 . . ., is an infinite sequence of states in R such that for each k ≥0, sk+1 ∈δ(sk, α) for some α ∈Σ. A run r is accepting if there exists a pair (L, K) ∈F such that r intersects with L finitely many times and K infinitely many times. For any LTL formula φ over Π, one can construct a DRA (for which we denote by Rφ) with input alphabet Σ = 2Π accepting all and only words over Π that satisfy φ (see [21]). We then obtain an MDP as the product of a labeled MDP M and a DRA Rφ, which captures all paths of M satisfying φ. Note that this product MDP can only be constructed from an MDP and a deterministic automaton, this is why we require a DRA instead of, e.g., a (generally non-deterministic) B¨uchi automaton (see [8]). Definition III.2 (Product MDP). The product MDP M × Rφ between a labeled MDP M = (Q, q0, U, A, P, Π, h) and a DRA Rφ = (S, s0, 2Π, δ, F) is an MDP P = (SP, sP0, UP, AP, PP, Π, hP), where (i) SP = Q × S is a set of states; (ii) sP0 = (q0, s0) is the initial state; (iii) UP = U is a set of actions inherited from M; (iv) AP is also inherited from M and AP((q, s)) := A(q); (v) PP gives the transition probabilities: PP((q, s), u, (q′, s′))= ( P(q, u, q′) if q′ = δ(s, h(q)) 0 otherwise; Note that hP is not used in the product MDP. Moreover, P is associated with pairs of accepting states (similar to a DRA) FP := {(LP(1), KP(1)), . . . , (LP(M), KP(M))} where LP(i) = Q × L(i), KP(i) = Q × K(i), for i = 1, . . . , M; The product MDP is constructed in a ways such that, given a path (s0, q0)(s1, q1) . . ., the corresponding path s0s1 . . . on M satisfies φ if and only if there exists a pair (LP, KP) ∈ FP satisfying the Rabin acceptance condition, i.e., the set KP is visited infinitely often and the set LP is visited finitely often. We can make a very similar product between a la- beled NTS MN = (Q, q0, U, A, P N , Π, h) and Rφ. This product is also an NTS, which we denote by PN = (SP, sP0, UP, AP, P N P , Π, hP) := MN × Rφ, associated with accepting sets FP. The definition (and the accepting condition) of PN is exactly the same as for the product MDP. The only difference between PN and P is in P N P , which is either 0 or 1 for every state-action-state tuple. From the product P or equivalently PN , we can proceed to construct the MRP problem. To do so, it is necessary to produce the so-called accepting maximum end components (AMECs). An end component is a subset of an MDP (consisting of a subset of states and a subset of enabled actions at each state) such that for each pair of states (i, j) in P, there is a sequence of actions such that i can be reached from j with positive probability, and states outside the component cannot be reached. An AMEC of P is the largest end component containing at least one state in KP and no state in LP, for a pair (KP, LP) ∈FP. A procedure to obtain all AMECs of an MDP is outlined in [8]. This procedure is intended to be used for the product MDP P, but it can be used without modification to find all AMECs associated with P when PN is used instead of P. This is because the information needed to construct the AMECs is the set of all possible state transitions at each state, and this information is already contained in PN . If we denote S⋆ P as the union of all states in all AMECs associated with P, it has been shown in probabilistic model checking (see e.g., [8]) that the probability of satisfying the LTL formula is given by the maximal probability of reaching the set S⋆ P from the initial state SP0. The desired optimal policy can then be obtained as the policy maximizing this probability. If transition probabilities are available for each state-action pair, then the solution to this MRP problem can be solved as by a linear program (see [8], [22]). The resultant optimal policy is deterministic and (i.e., M = µµ . . .) on the product MDP P. To implement this policy on M, it is necessary to use the DRA as a feedback automaton to keep track of the current state sP on P, and apply the action u where µ(sP, u) = 1 (since µ is deterministic). Remark III.3. It is only necessary to find the optimal policy for states not in the set S⋆ P. This is because by construction, there exists a policy inside any AMEC that almost surely satisfies the LTL formula φ by reaching a state in KP infinitely often. This policy can be obtained by simply choosing an action (among the subset of actions retained by the AMEC) at each state randomly, i.e., a trivial randomized stationary policy exists that almost surely satisfies φ. B. LSTD Actor-Critic Method We now describe how relevant results in [19] can be applied to solve Prob. II.4. An approximate dynamic pro- gramming algorithm of the actor-critic type was presented in [19], which obtains a stationary randomized policy (RSP) (see Def. II.2) M = µθµθ . . ., where µθ(q, u) is a function of the state-action pair (q, u) and θ ∈Rn, which is a vector of parameters. For the convenience of notations, we denote an RSP µθµθ . . . simply by µθ. In this sub-section we assume that the RSP µθ(q, u) to be given, and we will describe in Sec. III-C on how to design a suitable RSP. Given an RSP µθ, actor-Critic algorithms can be applied to optimize the parameter vector θ by policy gradient estima- tions. The basic idea is to use stochastic learning techniques to find θ that locally optimizes a cost function. In particular, the algorithm presented in [19] is targeted at Stochastic Shortest Path (SSP) problems commonly studied in literature (see e.g., [22]). Given an MDP M = (Q, q0, U, A, P, Π, h), a termination state q⋆∈Q and a function g(q, u) defining the one-step cost of applying action u at state q, the expected total cost is defined as: ¯α(θ) = lim N→∞E (N−1 X k=0 g(qk, uk) ) , (1) where (qk, uk) is the state-action pair at time k along a path under RSP µθ. The SSP problem is formulated as the problem of finding θ⋆minimizing (1). Note that, in general, we assume q⋆to be cost-free and absorbing, i.e., g(q⋆, u) = 0 and P(q⋆, u, q⋆) = 1 for all u ∈A(q⋆). Under these conditions, the expected total cost (1) is finite. We note that an MRP problem as described in Sec. III-A can be immediately converted to an SSP problem. Definition III.4 (Conversion from MRP to SSP). Given the product MDP P = (SP, sP0, UP, AP, PP, FP) and a set of states S⋆ P ⊆SP, the problem of maximizing the probability of reaching S⋆ P can be converted to an SSP problem by defining a new MDP eP = (eSP, ˜sP0, eUP, eAP, ePP, gP), where (i) eSP = (SP \ S⋆ P) ∪{s⋆ P}, where s⋆ P is a “dummy” terminal state; (ii) ˜sP0 = sP0 (without the loss of generality, we exclude the trivial case where sP0 ∈S⋆ P); (iii) eUP = UP; (iv) eAP(sP) = AP(sP) for all sP ∈SP, and for the dummy state we set eAP(s⋆ P) = eUP; (v) The transition probability is redefined as follows. We first define ¯S⋆ P as the set of states on P that cannot reach S⋆ P under any policy. We then define: ePP(sP, u, s′ P) =    X s′′ P∈S⋆ P PP(sP, u, s′′ P), if s′ P = s⋆ P PP(sP, u, s′ P), if s′ P ∈SP \ S⋆ P for all sP ∈SP \(S⋆ P ∪¯S⋆ P) and u ∈eUP. Moreover, for all sP ∈¯S⋆ P and u ∈eUP, we set ePP(s⋆ P, u, s⋆ P) = 1 and ePP(sP, u, sP0) = 1; (vi) For all sP ∈eSP and u ∈eUP, we define the one-step cost gP(sP, u) = 1 if sP ∈¯S⋆ P, and g(sP, u) = 0 otherwise. We have shown in [19] that the policy minimizing (1) for the SSP problem with MDP eP and the termination state s⋆ P is a policy maximizing the probability of reaching the set S⋆ P on P, i.e., a solution to the MRP problem formulated in Sec. III-A. Fig. 2. Diagram illustrating an actor-critic algorithm. The SSP problem can also be constructed from the NTS PN . In this case we obtain an NTS ePN (eSP, ˜sP0, eUP, eAP, eP N P , gP), using the exact same construction as Def. III.4, except for the definition of eP N P . The transition function eP N P (sP, u, s′ P) is instead defined as: eP N P (sP, u, s′ P) = ( max s′′ P∈S⋆ P P N P (sP, u, s′′ P), if s′ P = s⋆ P P N P (sP, u, s′ P), if s′ P ∈SP \ S⋆ P for all sP ∈SP \ (S⋆ P ∪¯S⋆ P) and u ∈eUP. Moreover, for all sP ∈¯S⋆ P and u ∈eUP, we set eP N P (s⋆ P, u, s⋆ P) = 1 and eP N P (sP, u, sP0) = 1. Once the SSP problem is constructed, the algorithm pre- sented in [19] is an iterative procedure that obtains a policy that locally minimizes the cost function (1) by simulating sample paths on eP. Each sample paths on eP starts at sP0 and ends when the termination state s⋆ P is reached. Since the probabilities is needed only along the sample path, we do not require the MDP eP, but only ePN . An actor-critic algorithm operates in the following way: the critic observes state and one-step cost from MDP and uses observed information to update the critic parameters, then the critic parameters are used to update the policy; the actor generates the action based on the policy and applies the action to the MDP. The algorithm stops when the gradient of ¯α(θ) is small enough (i.e., θ is locally optimal). The actor- critic update mechanism is shown in Fig. 2. We summarize the actor-critic update algorithm in Alg. 1, and we note that it does not depend on the form of RSP µθ. The vectors zk ∈Rn,bk ∈Rn, rk ∈Rn and the matrix Ak ∈Rn×n are updated during each critic update, while simultaneously, the vector θk ∈Rn is updated during each actor update. Both the critic and actor update depend on ψθ(x, u) := ∇θ ln(µθ(x, u)), (2) which is the gradient of the logarithm of µθ(x, u), to estimate the gradient ∇¯α(θ). Lastly, sequence {γk} controls the critic step-size, while {βk} and Γ(rk) control the actor step-size. We note that all step-size parameters are positive, and their effect on the convergence rate is discussed in [19]. The critic update algorithm in Alg. 1 is of the LSTD type, which has shown to be superior to other approximate dynamic programming methods in terms of the convergence rate [18]. More detail of this algorithm can be found in [19]. Algorithm 1 LSTD Actor-critic algorithm for SSP problems Input: The NTS ePN (eSP, ˜sP0, eUP, e AP, eP N P , gP) with the termi- nal state s⋆ P, the RSP µθ, and a computation tool to obtain ePP(sP, u, ·) for a given (sP, u) state-action pair. 1: Initialization: Set all entries in z0, A0, b0 and r0 to zeros. Let θ0 take some initial value. Set initial state x0 := ˜sP0. Obtain action u0 using the RSP µθ0. 2: repeat 3: Compute the transition probabilities eP(xk, uk, ·). 4: Obtain the simulated subsequent state xk+1 using the transi- tion probabilities eP(xk, uk, ·). If xk = s⋆ P, set xk+1 := x0. 5: Obtain action uk+1 using the RSP µθk 6: Critic Update: zk+1 = λzk + ψθk(xk, uk) bk+1 = bk + γk (g(xk, uk)zk −bk) Ak+1 = Ak + γk(zk(ψT θk(xk+1, uk+1) −ψT θk(xk, uk)) −Ak), rk+1 = −A−1 k bk. 7: Actor Update: θk+1 = θk−βkΓ(rk)rT kψθk(xk+1, uk+1)ψθk(xk+1, uk+1) 8: until ||∇¯α(θk)|| ≤ϵ for some given ϵ C. Designing an RSP In this section we describe a randomized policy suitable to be used in Alg. 1 for MRP problems, and do not require the transition probabilities. We propose a family of RSPs that perform a “t steps look-ahead”. This class of policies consider all possible sequences of actions in t steps and obtain a probability for each action sequence. To simplify notation, for a pair of states i, j ∈eSP, we denote i t→j if there is a positive probability of reaching j from i in t step. This can be quickly verified given eP N P without transition probabilities. At state i ∈eSP, we denote an action sequence from i with t steps look-ahead as e = u1u2 . . . ut, where uk ∈eAP(j) for some j such that i k→j, for all k = 1, . . . t. We denote the set of all action sequences from state i as E(i). Given e ∈E(i), we denote eP N P (i, e, j) = 1 if there is a positive probability of reaching j from i with the action sequence e. This can also be recursively obtained given eP N P (i, u, ·). For each pair of states i, j ∈eSP, we define d(i, j) as the minimum number of steps from i to reach j (this again can be obtained quickly from eP N P without transition probabilities). We denote j ∈N(i) if and only if d(i, j) ≤rN, where rN is a fixed integer given apriori. If j ∈N(i), then we say i is in the neighborhood of j, and rN represents the radius of the neighborhood around each state. For each state i ∈eSP, We define the safety score safe(i) as the ratio of the neighboring states not in ¯S⋆ P over all neighboring states of i. Recall that ¯S⋆ P is the set of states with 0 probability of reaching the goal states S⋆ P. To be more specific, we define: safe(i) := P j∈N(i) I(j) |N(i)| , (3) where I(i) is an indicator function such that I(i) = 1 if and only if i ∈eSP\ ¯S⋆ P and I(i) = 0 if otherwise. A higher safety score for the current state implies that it is less likely to reach ¯S⋆ P in the near future. Furthermore, we define the progress score of a state i ∈eSP as progress(i) := minj∈S⋆ P d(i, j), which is the minimum number of transitions from i to any goal state. We can now present the definition of our RSP. Let θ := [θ1, θ2]T. We define: a (θ, i, e) = exp  θ1 X j∈N(i) safe(j) eP N P (i, e, j) +θ2 X j∈N(i) (progress (j) −progress (i)) eP N P (i, e, j)  , (4) where exp is the exponential function. Note that a(θ, i, e) is the combination of the expected safety score of the next state applying the action sequence e, and the expected improved progress score from the current state applying e, weighted by θ1 and θ2. We assign the probability of pick the action sequence e at i proportional to the combined score a(θ, i, e). Hence, the probability to pick action sequence e at state i is defined as: ˜µθ (i, e) = a (θ, i, e) P e∈E(i) a (θ, i, e). (5) Note that, if the action sequence e = u1u2 . . . ut is picked, only the first action u1 is applied. Hence, at stat i, the probability that an action u ∈eAP(i) can be derived from Eq. (5): µθ (i, u) = X {e∈E(i) | e=uu2...ut} ˜µθ(i, e), (6) which completes the definition of the RSP. D. Overall Algorithm We now connect all the pieces together and present the overall algorithm giving a solution to Prob. II.4. Proposition III.5. Alg. 2 returns in finite time with θ⋆locally maximizing the probability of the RSP µθ satisfying the LTL formula φ. Proof. In [19], we have shown that the actor-critic algorithm used in this paper returns in finite time with a locally optimal θ⋆such that ||∇¯α(θ⋆)|| ≤ϵ for a given ϵ. We have shown throughout the paper that the optimal policy maximizing the probability of reaching S⋆ P on P is a policy maximizing the probability of satisfying φ. We also showed throughout the paper that the SSP problem, as well as the RSP µθ can be constructed without the transition probabilities, and only with MN . Therefore, Alg. 2 produces an RSP maximizing Algorithm 2 Overall algorithm providing a solution to Prob. II.4 Input: A labeled NTS MN = (Q, q0, U, A, P N , Π, h) modeling a robot in a partitioned environment, LTL formula φ over Π, and a simulator to compute P(q, u, ·) given a state-action pair (q, u) 1: Translate the LTL formula φ to a DRA Rφ 2: Generate the product NTS PN = MN × Rφ 3: Find the union of all AMECs S⋆ P associated with PN 4: Convert from an MRP to an SSP and generate ePN 5: Obtained the RSP µθ with PN 6: Execute Alg. 1 with ePN and µθ as inputs until ||∇¯α(θ⋆)|| ≤ϵ for a θ⋆and a given ϵ Output: RSP µθ and θ⋆locally maximizing the probability of satisfying φ with respect to θ up to a threshold ϵ the probability of satisfying φ with respect to θ up to a threshold ϵ. IV. HARDWARE-IN-THE-LOOP SIMULATION We test the algorithms proposed in this paper through hardware-in-the-loop simulation for the RIDE environment (as shown in Fig. 1). The transition probabilities are com- puted by an accurate simulator of RIDE as needed. We apply both LTL control synthesis methods of linear programming (exact solution) and actor-critic (approximate solution) and compare the results. A. Environment In this case study, we consider an environment whose topology is shown in Fig. 3. This environment is made of square blocks forming 164 corridors and 84 intersections. The corridors (C1, C2, . . . , C164) shown as white regions in Fig. 3 are of three different lengths, one-, two-, and three-unit lengths. The three-unit corridors are used to build corners in the environment. The intersections (I1, I2, . . . , I84) are of two types, three-way and four-way, and are shown as grey blocks in Fig. 3. The black regions in this figure represent the walls of the environment. Note that there is always a corridor between two intersections. There are five properties of interest (observations) associ- ated with the regions of the environment. These properties are: VD = ValuableData (regions containing valuable data to be collected), RD = RegularData (regions containing regular data to be collected), Up = Upload (regions where data can be uploaded), Ri = Risky (regions that could pose a threat to the robot), and Un = Unsafe (regions that are unsafe for the robot). B. Construction of the MDP model The robot is equipped with a set of feedback control primitives (actions) - FollowRoad, GoRight, GoLeft, and GoStraight. The controller FollowRoad is only available (enabled) at the corridors. At four-way intersections, con- trollers are GoRight, GoLeft, and GoStraight. At three- way intersections, depending on the shape of the intersection, two of the four controllers are available. Due to the presence of noise in the actuators and sensors, however, the resulting Fig. 3. Schematic representation of the environment with 84 intersections and 164 corridors. The black blocks represent walls, and the grey and white regions are intersection and corridors, respectively. There are five properties of interest in the regions indicated with VD = ValuableData, RD = RegularData, Up = Upload, Ri = Risky, and Un = Unsafe. The initial position of the robot is shown with a blue disk and the upload region is indicated with a red star. motion may be different than intended. Thus, the outcome of each control primitive is characterized probabilistically. To create an MDP model of the robot in RIDE, we define each state of the MDP as a collection of two adjacent regions (a corridor and an intersection). For instance the pairs C1-I2 and I3-C4 are two states of the MDP. Through this pairing of regions, it was shown that the Markov property (i.e., the result of an action at a state depends only on the current state) can be achieved [14]. The resulting MDP has 608 states. The set of actions available at a state is the set of controllers available at the last region corresponding to the state. For example, when in state C1-I2 only those actions from region I2 are allowed. Each state of the MDP whose second region satisfies an observation in Π is mapped to that observation. To obtain transition probabilities, we use an accurate simulator (see Fig. 4) incorporating the motion and sensing of an iRobot Create platform with a Hokoyu URG-04LX laser range finder, APSX RW-210 RFID reader, and an MSI Wind U100-420US netbook (the robot is shown in Fig. 1) in RIDE. Specifically, it emulates experimentally measured response times, sensing and control errors, and noise levels and distributions in the laser scanner readings. More detail for the software implementation of the simulator can be found in [14]. We perform a total of 1,000 simulations for each action available in each MDP state. C. Task specification and results We consider the following mission task: Fig. 4. Simulation snapshots. The white disk represents the robot and the different circles around it indicate different ”zones” in which different controllers are activated. The yellow dots represent the laser readings used to define the target angle. (a) The robot centers itself on a stretch of corridor by using FollowRoad; (b) The robot applies GoRight in an intersection; (c) The robot applies GoLeft. 0 100 200 300 400 500 600 700 800 900 1000 1100 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Iteration Number Reachability Probability Fig. 5. The optimal solution (the maximal probability of satisfying the specification) is shown with the dashed line, and the solid line represents the exact reachability probability for the RSP as a function of the number of iterations applying the proposed algorithm. Specification: Reach a location with ValuableData (VD) or RegularData (RD), and then reach Upload (Up). Do not reach Risky (Ri) regions unless eventually reach a location with ValuableData (VD). Always avoid Unsafe (Un) regions until Upload (Up) is reached (and mission completed). The above task specification can be translated to the LTL formula: φ := F Up ∧(¬Un U Up) ∧G (Ri −→F VD) ∧G (VD ∨RD −→X F Up) (7) The initial position of the robot is shown as a blue circle in Fig. 3 with the orientation towards the neighboring inter- section. We used the computational frameworks described in this paper to find the control strategy maximizing the probabilities of satisfying the specification. The size of the DRA is 17 which results in the product MDP with 10336 states. By applying both methods of linear programming (exact solution) and actor-critic (approximate solution), we found the maximum probabilities of satisfying the specifi- cation were 92% and 75%, respectively. The graph of the convergence of the actor-critic solution is shown in Fig. 5. The parameters for this examples are: λ = 0.9, and the initial θ = [5, −0.5]T. The look-ahead window t for the RSP is 2. It should be emphasized that, we only compute the tran- sition probabilities along the sample path. Thus, when Alg. 2 is completed (at iteration 1100), at most 1100 transition probabilities of state-action pairs were computed. In com- parison, in order to solve the probability exactly, arround 30000 transition probabilities of state-action pairs must be computed. V. CONCLUSIONS We presented a framework that brings together an approx- imate dynamic programming computational method of the actor critic type, with formal control synthesis for Markov Decision Processes (MDPs) from temporal logic specifi- cations. We show that this approach is particular suitable for problems where the transition probabilities of the MDP are difficult or computationally expensive to compute, such as for many robotic applications. We show that this ap- proach effectively finds an approximate optimal policy within a class of randomized stationary polices maximizing the probability of satisfying the temporal logic formula. Future direction includes extending this result to multi-robot teams, examining exactly how to choose an appropriate look-ahead window when designing the RSP, and applying the result to more realistic problem settings with the MDP containing possibility millions of states. REFERENCES [1] H. Kress-Gazit, G. Fainekos, and G. J. Pappas, “Where’s Waldo? Sensor-based temporal logic motion planning,” in IEEE Int. Conf. on Robotics and Automation, Rome, Italy, 2007, pp. 3116–3121. [2] S. Karaman and E. Frazzoli, “Sampling-based motion planning with deterministic µ-calculus specifications,” in IEEE Conf. on Decision and Control, Shanghai, China, 2009, pp. 2222 – 2229. [3] S. G. Loizou and K. J. Kyriakopoulos, “Automatic synthesis of multiagent motion tasks based on LTL specifications,” in 43rd IEEE Conference on Decision and Control, December 2004. [4] M. M. Quottrup, T. Bak, and R. Izadi-Zamanabadi, “Multi-robot motion planning: A timed automata approach,” in IEEE Int. Conf. on Robotics and Automation, New Orleans, LA, Apr. 2004, pp. 4417– 4422. [5] T. Wongpiromsarn, U. Topcu, and R. M. Murray, “Receding horizon temporal logic planning for dynamical systems,” in IEEE Conference on Decision and Control, Shanghai, China, 2009. [6] A. Bhatia, L. Kavraki, and M. Vardi, “Sampling-based motion plan- ning with temporal goals,” in Robotics and Automation (ICRA), 2010 IEEE International Conference on. IEEE, 2010, pp. 2689–2696. [7] E. M. M. Clarke, D. Peled, and O. Grumberg, Model checking. MIT Press, 1999. [8] C. Baier, J.-P. Katoen, and K. G. Larsen, Principles of Model Check- ing. MIT Press, 2008. [9] L. De Alfaro, “Formal verification of probabilistic systems,” Ph.D. dissertation, Stanford University, 1997. [10] M. Vardi, “Probabilistic linear-time model checking: An overview of the automata-theoretic approach,” Formal Methods for Real-Time and Probabilistic Systems, pp. 265–276, 1999. [11] C. Courcoubetis and M. Yannakakis, “Markov decision processes and regular events,” IEEE Transactions on Automatic Control, vol. 43, no. 10, pp. 1399–1418, 1998. [12] C. Baier, M. Gr¨oßer, M. Leucker, B. Bollig, and F. Ciesinski, “Con- troller synthesis for probabilistic systems,” in In Proceedings of IFIP TCS?2004. Citeseer, 2004. [13] X. C. Ding, S. L. Smith, C. Belta, and D. Rus, “LTL planning for groups of robots control in uncertain environments with probabilistic satisfaction guarantees,” in 18th IFAC World Congress, 2011. [14] M. Lahijanian, J. Wasniewski, S. B. Andersson, and C. Belta, “Motion planning and control from temporal logic specifications with proba- bilistic satisfaction guarantees,” in IEEE Int. Conf. on Robotics and Automation, Anchorage, AK, 2010, pp. 3227 – 3232. [15] J. Si, Handbook of learning and approximate dynamic programming. Wiley-IEEE Press, 2004, vol. 2. [16] A. Barto, R. Sutton, and C. Anderson, “Neuronlike adaptive elements that can solve difficult learning control problems.” IEEE Transactions on Systems, Man, & Cybernetics, 1983. [17] I. Paschalidis, K. Li, and R. Estanjini, “An actor-critic method using least squares temporal difference learning,” in Decision and Control, 2009 held jointly with the 2009 28th Chinese Control Conference. CDC/CCC 2009. Proceedings of the 48th IEEE Conference on. IEEE, pp. 2564–2569. [18] V. Konda and J. Tsitsiklis, “On actor-critic algorithms,” SIAM Journal on Control and Optimization, vol. 42, no. 4, pp. 1143–1166, 2004. [19] R. Estanjini, X. Ding, M. Lahijanian, J. Wang, C. Belta, and I. Pascha- lidis, “Least squares temporal difference actor-critic methods with applications to robot motion control,” in IEEE Conference on Decision and Control (CDC), Orlando, FL, December 2011. [20] “Robotic indoor environment.” [Online]. Available: www.hyness.bu. edu/ride [21] E. Gradel, W. Thomas, and T. Wilke, Automata, logics, and infinite games: A guide to current research, ser. Lecture Notes in Computer Science. Springer, 2002, vol. 2500. [22] M. Puterman, Markov decision processes: Discrete stochastic dynamic programming. John Wiley & Sons, Inc. New York, NY, USA, 1994.