MDP Optimal Control under Temporal Logic Constraints - Technical Report - Xu Chu Ding Stephen L. Smith Calin Belta Daniela Rus Abstract— In this paper, we develop a method to automati- cally generate a control policy for a dynamical system modeled as a Markov Decision Process (MDP). The control specification is given as a Linear Temporal Logic (LTL) formula over a set of propositions defined on the states of the MDP. We synthesize a control policy such that the MDP satisfies the given specification almost surely, if such a policy exists. In addition, we designate an “optimizing proposition” to be repeatedly satisfied, and we formulate a novel optimization criterion in terms of minimizing the expected cost in between satisfactions of this proposition. We propose a sufficient condition for a policy to be optimal, and develop a dynamic programming algorithm that synthesizes a policy that is optimal under some conditions, and sub-optimal otherwise. This problem is motivated by robotic applications requiring persistent tasks, such as environmental monitoring or data gathering, to be performed. I. INTRODUCTION In this paper, we consider the problem of controlling a (fi- nite state) Markov Decision Process (MDP). Such models are widely used in various areas including engineering, biology, and economics. In particular, in recent results, they have been successfully used to model and control autonomous robots subject to uncertainty in their sensing and actuation, such as for ground robots [1], unmanned aircraft [2] and surgical steering needles [3]. Several authors [4]–[7] have proposed using temporal log- ics, such as Linear Temporal Logic (LTL) and Computation Tree Logic (CTL) [8], as specification languages for control systems. Such logics are appealing because they have well defined syntax and semantics, which can be easily used to specify complex behavior. In particular, in LTL, it is possible to specify persistent tasks, e.g., “Visit regions A, then B, and then C, infinitely often. Never enter B unless coming directly from D.” In addition, off-the-shelf model checking algorithms [8] and temporal logic game strategies [9] can be used to verify the correctness of system trajectories and to synthesize provably correct control strategies. The existing works focusing on LTL assume that a finite model of the system is available and the current state can be precisely determined. If the control model is deterministic (i.e., at each state, an available control enables exactly one transition), control strategies from specifications given as This work was supported in part by ONR-MURI N00014-09-1051, ARO W911NF-09-1-0088, AFOSR YIP FA9550-09-1-020, and NSF CNS- 0834260. X. C. Ding and C. Belta are with Department of Mechanical En- gineering, Boston University, Boston, MA 02215, USA (email: {xcding; cbelta}@bu.edu). S. L. Smith is with the Department of Electrical and Computer Engineering, University of Waterloo, Waterloo ON, N2L 3G1 Canada (email: stephen.smith@uwaterloo.ca). D. Rus is with the Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA (email: rus@csail.mit.edu). LTL formulas can be found through simple adaptations of off-the-shelf model checking algorithms [10]. If the control is non-deterministic (an available control at a state enables one of several transitions, and their probabilities are not known), the control problem from an LTL specification can be mapped to the solution of a B¨uchi or GR(1) game if the specification is restricted to a fragment of LTL [4], [11]. If the probabilities of the enabled transitions at each state are known, the control problem reduces to finding a control policy for an MDP such that a probabilistic temporal logic formula is satisfied [12]. By adapting methods from probabilistic model-checking [12]–[14], we have recently developed frameworks for de- riving MDP control policies from LTL formulas [15], which is related to a number of other approaches [16], [17] that address the problem of synthesizing control policies for MDPs subject to LTL satisfaction constraints. In all of the above approaches, a control policy is designed to maximize the probability of satisfying a given LTL formula. However, no attempt has been made so far to optimize the long-term behavior of the system, while enforcing LTL satisfaction guarantees. Such an objective is often critical in many applications, such as surveillance, persistent monitoring, and pickup delivery tasks, where a robot must repeatedly visit some areas in an environment and the time in between revisits should be minimized. As far as we know, this paper is the first attempt to compute an optimal control policy for a dynamical sys- tem modeled as an MDP, while satisfying temporal logic constraints. This work begins to bridge the gap between our prior work on MDP control policies maximizing the probability of satisfying an LTL formula [15] and optimal path planning under LTL constraints [18]. We consider LTL formulas defined over a set of propositions assigned to the states of the MDP. We synthesize a control policy such that the MDP satisfies the specification almost surely, if such a policy exists. In addition, we minimize the expected cost between satisfying instances of an “optimizing proposition.” The main contribution of this paper is two-fold. First, we formulate the above MDP optimization problem in terms of minimizing the average cost per cycle, where a cycle is defined between successive satisfaction of the optimizing proposition. We present a novel connection between this problem and the well-known average cost per stage problem. Second, we incorporate the LTL constraints and obtain a sufficient condition for a policy to be optimal. We present a dynamic programming algorithm that under some (heavy) restrictions synthesizes an optimal control policy, and a sub- optimal policy otherwise. arXiv:1103.4342v2 [cs.RO] 23 Mar 2011 The organization of this paper is as follows. In Sec. II we provide some preliminaries. We formulate the problem in Sec. III and we formalize the connection between the average cost per cycle and the average cost per stage problem in Sec. IV. In Sec. V, we provide a method for incorporating LTL constraints. We present a case study illustrating our framework in Sec. VI and we conclude in Sec. VII II. PRELIMINARIES A. Linear Temporal Logic We employ Linear Temporal Logic (LTL) to describe MDP control specifications. A detailed description of the syntax and semantics of LTL is beyond the scope of this paper and can be found in [8], [13]. Roughly, an LTL formula is built up from a set of atomic propositions Π, standard Boolean operators ¬ (negation), ∨(disjunction), ∧ (conjunction), and temporal operators ⃝(next), U (until), 3 (eventually), 2 (always). The semantics of LTL formulas are given over infinite words in 2Π. A word satisfies an LTL formula φ if φ is true at the first position of the word; 2φ means that φ is true at all positions of the word; 3φ means that φ eventually becomes true in the word; φ1 Uφ2 means that φ1 has to hold at least until φ2 is true. More expressivity can be achieved by combining the above temporal and Boolean operators (more examples are given later). An LTL formula can be represented by a deterministic Rabin automaton, which is defined as follows. Definition II.1 (Deterministic Rabin Automaton). A de- terministic Rabin automaton (DRA) is a tuple R = (Q, Σ, δ, q0, F), where (i) Q is a finite set of states; (ii) Σ is a set of inputs (alphabet); (iii) δ : Q × Σ →Q is the transition function; (iv) q0 ∈Q is the initial state; and (v) F = {(L(1), K(1)), . . . , (L(M), K(M))} is a set of pairs of sets of states such that L(i), K(i) ⊆Q for all i = 1, . . . , M. A run of a Rabin automaton R, denoted by rR = q0q1 . . ., is an infinite sequence of states in R such that for each i ≥0, qi+1 ∈δ(qi, α) for some α ∈Σ. A run rR is accepting if there exists a pair (L, K) ∈F such that 1) there exists n ≥0, such that for all m ≥n, we have qm /∈L, and 2) there exist infinitely many indices k where qk ∈K. This acceptance conditions means that rR is accepting if for a pair (L, K) ∈F, rR intersects with L finitely many times and K infinitely many times. Note that for a given pair (L, K), L can be an empty set, but K is not empty. For any LTL formula φ over Π, one can construct a DRA with input alphabet Σ = 2Π accepting all and only words over Π that satisfy φ (see [19]). We refer readers to [20] and references therein for algorithms and to freely available implementations, such as [21], to translate a LTL formula over Π to a corresponding DRA. B. Markov Decision Process and probability measure Definition II.2 (Labeled Markov Decision Process). A la- beled Markov decision process (MDP) is a tuple M = (S, U, P, s0, Π, L, g), where S = {1, . . . , n} is a finite set of states; U is a finite set of controls (actions) (with slight abuse of notations we also define the function U(i), where i ∈S and U(i) ⊆U to represent the available controls at state i); P : S × U × S →[0, 1] is the transition probability function such that for all i ∈S, P j∈S P(i, u, j) = 1 if u ∈U(i), and P(i, u, j) = 0 if u /∈U(i); s0 ∈S is the initial state; Π is a set of atomic propositions; L : S →2Π is a labeling function and g : S × U →R+ is such that g(i, u) is the expected (non-negative) cost when control u ∈U(i) is taken at state i. We define a control function µ : S →U such that µ(i) ∈ U(i) for all i ∈S. A infinite sequence of control functions M = {µ0, µ1, . . .} is called a policy. One can use a policy to resolve all non-deterministic choices in an MDP by applying the action µk(sk) at state sk. Given an initial state s0, an infinite sequence rM M = s0s1 . . . on M generated under a policy M is called a path on M if P(sk, µk(sk), sk+1) > 0 for all k. The index k of a path is called stage. If µk = µ for all k, then we call it a stationary policy and we denote it simply as µ. A stationary policy µ induces a Markov chain where its set of states is S and the transition probability from state i to j is P(i, µ(i), j). We define PathsM M and FPathsM M as the set of all infinite and finite paths of M under a policy M, respectively. We can then define a probability measure over the set PathsM M. For a path rM M = s0s1 . . . smsm+1 . . . ∈PathsM M, the prefix of length m of rM M is the finite subsequence s0s1 . . . sm. Let PathsM M(s0s1 . . . sm) denote the set of all paths in PathsM M with the prefix s0s1 . . . sm. (Note that s0s1 . . . sm is a finite path in FPathsM M.) Then, the probability measure PrM M on the smallest σ-algebra over PathsM M containing PathsM M(s0s1 . . . sm) for all s0s1 . . . sm ∈FPathsM M is the unique measure satisfying: PrM M{PathsM M(s0s1 . . . sm)} = Y 0≤k 0 for all i, u, and there exists at least one proper policy, the stochastic shortest path problem for Sπ admits an optimal stationary policy as a solution [22]. Hence, for each cycle k, the cycle cost can be minimized if a stationary policy is used for the cycle. Therefore, a policy which is stationary in between cycles is optimal over M, which is in turn, optimal if M = µ⋆. The proof is complete. Unfortunately, it is not clear how to find the optimal policy from (23) except by searching through all policies in M˜µ. This exhaustive search is not feasible for reasonably large problems. Instead, we would like equations in the form of (9) and (10), so that the optimizations can be carried out independently at each state. To circumvent this difficulty, we need to express the gain- bias pair (Jµ, hµ), which is equal to (Js ˜µ, hs ˜µ), in terms of µ. From (7), we have Jµ = P˜µJµ, Jµ + hµ = g˜µ + P˜µhµ. By manipulating the above equations, we can now show that Jµ and hµ can be expressed in terms of µ (analogous to (7)) instead of ˜µ via the following proposition: Proposition IV.10. We have Jµ = PµJµ, Jµ + hµ = gµ + Pµhµ + −→ P µJµ. (24) Moreover, we have (I −−→ P µ)hµ + vµ = Pµvµ, (25) for some vector vµ. Proof. We start from (7): Jµ = P˜µJµ, Jµ + hµ = g˜µ + P˜µhµ. (26) For the first equation in (26), using (17), we have Jµ = P˜µJµ Jµ = (I −−→ P µ)−1←− P µJµ (I −−→ P µ)Jµ = ←− P µJµ Jµ −−→ P µJµ = ←− P µJµ Jµ = (−→ P µ + ←− P µ)Jµ Jµ = PµJµ. For the second equation in (26), using (17) and (19), we have Jµ + hµ = g˜µ + P˜µhµ Jµ + hµ = (I −−→ P µ)−1(gµ + ←− P µhµ) (I −−→ P µ)(Jµ + hµ) = gµ + ←− P µhµ Jµ −−→ P µJµ + hµ −−→ P µhµ = gµ + ←− P µhµ Jµ + hµ −−→ P µJµ = gµ + (−→ P µ + ←− P µ)hµ Jµ + hµ = gµ + Pµhµ + −→ P µJµ. Thus, (26) can be expressed in terms of µ as: Jµ = PµJµ, Jµ + hµ = gµ + Pµhµ + −→ P µJµ. To compute Jµ and hµ, we need an extra equation similar to (8). Using (8), we have hµ + vµ = P˜µvµ hµ + vµ = (I −−→ P µ)−1←− P µvµ (I −−→ P µ)hµ + vµ = Pµvµ, which completes the proof. From Prop. IV.10, similar to the ACPS problem, (Jµ, hµ, vµ) can be solved together by a linear system of 3n equations and 3n unknowns. The insight gained when simplifying Jµ and hµ in terms of µ motivate us to propose the following optimality condition for an optimal policy. Proposition IV.11. The policy µ⋆with gain-bias pair (λ1, h) satisfying λ + h(i) = min u∈U(i)  g(i, u) + n X j=1 P(i, u, j)h(j) + λ n X j=m+1 P(i, u, j)  , (27) for all i = 1, . . . , n, is the optimal policy minimizing (3) over all policies in M. Proof. The optimality condition (27) can be written as: λ1 + h ≤gµ + Pµh + −→ P µλ1, (28) for all stationary policies µ. Note that, given a, b ∈Rn and a ≤b, if A is an n × n matrix with all non-negative entries, then Aa ≤Ab. Moreover, given c ∈Rn, we have a + c ≤b + c. s0 s1 1 .2 .8 .3 .7 1 u1 u1 u2 u2 1 u3 a π {a, π} u1 1 Sπ = {s2} s2 s3 (a) q0 q2 q1 q3 q4 {∅, {π}} {∅, {π}} {∅, {π}} {∅, {a}} {a} {a} {a} {π, {a, π}} {a, π} {a, π} {∅, {a}} {a, π} L = ∅ K = {q2, q3} {π, {a, π}} (b) 1 .8 .2 1 1 u2 1 .7 .3 1 1 1 .8 .2 1 u2 u1 u1 u1 u3 u2 1 .8 .2 1 1 .3 1 .7 s1, q1 s2, q0 s3, q2 s0, q3 s1, q3 s2, q3 s3, q3 u1 u3 u1 u2 u2 u1 u1 s0, q1 s1, q1 s0, q2 s2, q2 s2, q1 s3, q1 s3, q2 s1, q2 s0, q0 u1 u1 u3 u2 s0, q4 s1, q4 s2, q4 s3, q4 (c) Fig. 2. The construction of the product MDP between a labeled MDP and a DRA. In this example, the set of atomic proposition is {a, π}. (a): A labeled MDP where the label on top of a state denotes the atomic propositions assigned to the state. The number on top of an arrow pointing from a state s to s′ is the probability P(s, u, s′) associated with a control u ∈U(s). The set of states marked by ovals is Sπ. (b): The DRA Rφ corresponding to the LTL formula φ = 23π ∧23a. In this example, there is one set of accepting states F = {(L, K)} where L = ∅and K = {q2, q3} (marked by double-strokes). Thus, accepting runs of this DRA must visit q2 or q3 (or both) infinitely often. (c): The product MDP P = M × Rφ where states of KP are marked by double-strokes and states of SPπ are marked by ovals. The states with dashed borders are unreachable, and they are removed from SP. From (28) we have λ1 + h ≤ gµ + Pµh + −→ P µλ1 λ1 −−→ P µλ1 + h ≤ gµ + Pµh λ1 −−→ P µλ1 + h ≤ gµ + (←− P µ + −→ P µ)h λ1 −−→ P µλ1 + h −−→ P µh ≤ gµ + ←− P µh (I −−→ P µ)(λ1 + h) ≤ gµ + ←− P µh (29) If µ is proper, then −→ P µ is a transient matrix (see proof of Prop. IV.5), and all of its eigenvalues are strictly inside the unit circle. Therefore, we have (I −−→ P µ)−1 = I + −→ P µ + −→ P 2 µ + . . . . Therefore, since all entries of −→ P µ are non-negative, all entries of (I −−→ P µ)−1 are non-negative. Thus, continuing from (29), we have (I −−→ P µ)(λ1 + h) ≤ gµ + ←− P µh λ1 + h ≤ (I −−→ P µ)−1(gµ + ←− P µh) λ1 + h ≤ g˜µ + P˜µh for all proper policies µ and all ˜µ ∈M˜µ. Hence, ˜µ⋆satisfies (23) and µ⋆is optimal over all proper policies. Using Prop. IV.9, the proof is complete. We will present an algorithm that uses Prop. IV.11 to find the optimal policy in the next section. Note that, unlike (23), (27) can be checked for any policy µ by finding the minimum for all states i = 1, . . . , n, which is significantly easier than searching over all proper policies. V. SYNTHESIZING THE OPTIMAL POLICY UNDER LTL CONSTRAINTS In this section we outline an approach for Prob. III.1. We aim for a computational framework, which in combination with results of [15] produces a policy that both maximizes the satisfaction probability and optimizes the long-term per- formance of the system, using results from Sec. IV. A. Automata-theoretical approach to LTL control synthesis Our approach proceeds by converting the LTL formula φ to a DRA as defined in Def. II.1. We denote the resulting DRA as Rφ = (Q, 2Π, δ, q0, F) with F = {(L(1), K(1)), . . . , (L(M), K(M))} where L(i), K(i) ⊆Q for all i = 1, . . . , M. We now obtain an MDP as the product of a labeled MDP M and a DRA Rφ, which captures all paths of M satisfying φ. Definition V.1 (Product MDP). The product MDP M × Rφ between a labeled MDP M = (S, U, P, s0, Π, L, g) and a DRA Rφ = (Q, 2Π, δ, q0, F) is obtained from a tuple P = (SP, U, PP, sP0, FP, SPπ, gP), where (i) SP = S × Q is a set of states; (ii) U is a set of controls inherited from M and we define UP((s, q)) = U(s); (iii) PP gives the transition probabilities: PP((s, q), u, (s′, q′))= ( P(s, u, s′) if q′ = δ(q, L(s)) 0 otherwise; (iv) sP0 = (s0, q0) is the initial state; (v) FP = {(LP(1), KP(1)), . . . , (LP(M), KP(M))} where LP(i) = S × L(i), KP(i) = S × K(i), for i = 1, . . . , M; (vi) SPπ is the set of states in SP for which proposition π is satisfied. Thus, SPπ = Sπ × Q; (vii) gP((s, q), u) = g(s, u) for all (s, q) ∈SP; Note that some states of SP may be unreachable and therefore have no control available. After removing those states (via a simple graph search), P is a valid MDP and is the desired product MDP. With a slight abuse of notation we still denote the product MDP as P and always assume that unreachable states are removed. An example of a product MDP between a labeled MDP and a DRA corresponding to the LTL formula φ = 23π ∧23a is shown in Fig. 2. There is an one-to-one correspondence between a path s0s1, . . . on M and a path (s0, q0)(s1, q1) . . . on P. More- over, from the definition of gP, the costs along these two paths are the same. The product MDP is constructed so that, given a path (s0, q0)(s1, q1) . . ., the corresponding path s0s1 . . . on M generates a word satisfying φ if and only if, there exists (LP, KP) ∈FP such that the set KP is visited infinitely often and LP finitely often. A similar one-to-one correspondence exists for policies: Definition V.2 (Inducing a policy from P). Given policy MP = {µP 0 , µP 1 , . . .} on P, where µP k ((s, q)) ∈UP((s, q)), it induces policy M = {µ0, µ1, . . .} on M by setting µk(sk) = µP k ((sk, qk)) for all k. We denote MP|M as the policy induced by MP, and we use the same notation for a set of policies. An induced policy can be implemented on M by simply keeping track of its current state on P. Note that a stationary policy on P induces a non-stationary policy on M. From the one-to-one correspondence between paths and the equiv- alence of their costs, the expected cost in (3) from initial state s0 under MP|M is equal to the expected cost from initial state (s0, q0) under MP. For each pair of states (LP, KP) ∈FP, we can obtain a set of accepting maximal end components (AMEC): Definition V.3 (Accepting Maximal End Components). Given (LP, KP) ∈FP, an end component C is a communi- cating MDP (SC, UC, PC, KC, SCπ, gC) such that SC ⊆SP, UC ⊆UP, UC(i) ⊆U(i) for all i ∈SC, KC = SC ∩KP, SCπ = SC ∩SPπ, and gC(i, u) = gP(i, u) if i ∈SC, u ∈UC(i). If P(i, u, j) > 0 for any i ∈SC and u ∈UC(i), then j ∈SC, in which case PC(i, u, j) = P(i, u, j). An accepting maximal end components (AMEC) is the largest such end component such that KC ̸= ∅and SC ∩LP = ∅. Note that, an AMEC always contains at least one state in KP and no state in LP. Moreover, it is “absorbing” in the sense that the state does not leave an AMEC once entered. In the example shown in Fig. 2, there exists only one AMEC corresponding to (LP, KP), which is the only pair of states in FP, and the states of this AMEC are shown in Fig. 3. s1, q1 s2, q0 s3, q2 s0, q3 s1, q3 s2, q3 s3, q3 s0, q1 s1, q1 s0, q2 s2, q2 s2, q1 s3, q1 s3, q2 s1, q2 s0, q0 s0, q4 s1, q4 s2, q4 s3, q4 Fig. 3. The states of the only AMEC corresponding to the product MDP in Fig. 2. A procedure to obtain all AMECs of an MDP was pro- vided in [13]. From probabilistic model checking, a policy M = MP|M almost surely satisfies φ (i.e., M ∈Mφ) if and only if, under policy MP, there exists AMEC C such that the probability of reaching SC from initial state (s0, q0) is 1 (in this case, we call C a reachable AMEC). In [15], such an optimal policy was found by dynamic programming or solving a linear program. For states inside C, since C itself is a communicating MDP, a policy (not unique) can be easily constructed such that a state in KC is infinitely often visited, satisfying the LTL specification. B. Optimizing the long-term performance of the MDP For a control policy designed to satisfy an LTL formula, the system behavior outside an AMEC is transient, while the behavior inside an AMEC is long-term. The policies obtained in [15]–[17] essentially disregard the behavior inside an AMEC, because, from the verification point of view, the behavior inside an AMEC is for the most part irrelevant, as long as a state in KP is visited infinitely often. We now aim to optimize the long-term behavior of the MDP with respect to the ACPC cost function, while enforcing the satisfaction constraint. Since each AMEC is a communicating MDP, we can use results in Sec. IV-B to help obtaining a solution. Our approach consists of the following steps: (i) Convert formula φ to a DRA Rφ and obtain the product MDP P between M and Rφ; (ii) Obtain the set of reachable AMECs, denoted as A; (iii) For each C ∈A: Find a stationary policy µ⋆ →C(i), defined for i ∈S \ SC, that reaches SC with probability 1 (µ⋆ →C is guaranteed to exist and obtained as in [15]); Find a stationary policy µ⋆ ⟳C(i), defined for i ∈SC minimizing (3) for MDP C and set SCπ while satisfying the LTL constraint; Define µ⋆ C to be: µ⋆ C =  µ⋆ →C(i) if i /∈SC µ⋆ ⟳C(i) if i ∈SC , (30) and denote the ACPC of µ⋆ ⟳C as λC; (iv) We find the solution to Prob. III.1 by: J⋆(s0) = min C∈A λC, (31) and the optimal policy is µ⋆ C⋆|M, where C⋆is the AMEC attaining the minimum in (31). We now provide the sufficient conditions for a policy µ⋆ ⟳C to be optimal. Moreover, if an optimal policy µ⋆ ⟳C can be obtained for each C, we show that the above procedure indeed gives the optimal solution to Prob. III.1. Proposition V.4. For each C ∈A, let µ⋆ C to be constructed as in (30), where µ⋆ ⟳C is a stationary policy satisfying two optimality conditions: (i) its ACPC gain-bias pair is (λC1, h), where λC + h(i) = min u∈UC(i)  gC(i, u) + X j∈SC P(i, u, j)h(j) + λC X j /∈SCπ P(i, u, j)  , (32) for all i ∈SC, and (ii) there exists a state of KC in each recurrent class of µ⋆ ⟳C. Then the optimal cost for Prob. III.1 is J⋆(s0) = minC∈A λC, and the optimal policy is µ⋆ C⋆|M, where C⋆is the AMEC attaining this minimum. Proof. Given C ∈A, define a set of policies MC, such that for each policy in MC: from initial state (s0, q0), (i) SC is reached with probability 1, (ii) S\SC is not visited thereafter, and (iii) KC is visited infinitely often. We see that, by the definition of AMECs, a policy almost surely satisfying φ belongs to MC|M for a C ∈A. Thus, Mφ = ∪C∈AMC|M Since µ⋆ C(i) = µ⋆ →C(i) if i /∈SC, the state reaches SC with probability 1 and in a finite number of stages. We denote the probability that j ∈SC is the first state visited in SC when C is reached from initial state sP0 as ePC(j, µ⋆ →C, sP0). Since the ACPC for the finite path from the initial state to a state j ∈SC is 0 as the cycle index goes to ∞, the ACPC from initial state sP0 under policy µ⋆ C is J(s0) = X j∈SC ePC(j, µ⋆ →C, sP0)Jµ⋆ ⟳C(j). (33) Since C is communicating, the optimal cost is the same for all states of SC (and thus it does not matter which state in SC is first visited when SC is reached). We have J(s0) = X j∈SC ePC(j, µ⋆ →C, (s0, q0))λC = λC. (34) Applying Prop. IV.11, we see that µ⋆ ⟳C satisfies the opti- mality condition for MDP C with respect to set SCπ. Since there exists a state of KC is in each recurrent class of µ⋆ ⟳C, a state in KC is visited infinitely often and it satisfies the LTL constraint. Therefore, µ⋆ C as constructed in (30) is optimal over MC and µ⋆ C|M is optimal over MC|M (due to equiv- alence of expected costs between MP and MP|M). Since Mφ = ∪C∈AMC|M, we have that J⋆(s0) = minC∈A λC and the policy corresponding to C⋆attaining this minimum is the optimal policy. We can relax the optimality conditions for µ⋆ ⟳C in Prop. V.4 and require that there exist a state i ∈KC in one recurrent class of µ⋆ ⟳C. For such a policy, we can construct a policy such that it has one recurrent class containing state i, with the same ACPC cost at each state. This construction is identical to a similar procedure for ACPS problems when the MDP is communicating (see [22, p. 203]). We can then use (30) to obtain the optimal policy µ⋆ C for C. We now present an algorithm (see Alg. 1) that iteratively updates the policy in an attempt to find one that satisfies the optimality conditions given in Prop. V.4, for a given C ∈ A. Note that Alg. 1 is similar in nature to policy iteration algorithms for ACPS problems. Proposition V.5. Given C, Alg. 1 terminates in a finite number of iterations. If it returns policy µ⟳C with “optimal”, then µ⟳C satisfies the optimality conditions in Prop. V.4. If C is unichain (i.e., each stationary policy of C contains one recurrent class), then Alg. 1 is guaranteed to return the optimal policy µ⋆ ⟳C. Proof. If C is unichain, then since it is also communicating, µ⋆ ⟳C contains a single recurrent class (and no transient state). In this case, since KC is not empty, states in KC are recurrent and the LTL constraint is always satisfied at step 7 and 9 of Alg. 1. The rest of the proof (for the general case and Algorithm 1 : Policy iteration algorithm for ACPC Input: C = (SC, UC, PC, KC, SCπ, gC) Output: Policy µ⟳C 1: Initialize µ0 to a proper policy containing KC in its re- current classes (such a policy can always be constructed since C is communicating) 2: repeat 3: Given µk, compute Jµk and hµk with (24) and (25) 4: Compute for all i ∈SC: ¯U(i) = arg min u∈UC(i) X j∈SC P(i, u, j)Jµk(j) (35) 5: if µk(i) ∈¯U(i) for all i ∈SC then 6: Compute, for all i ∈SC: ¯ M(i) = arg min u∈¯U(i)  gC(i, u)+ X j∈SC P(i, u, j)hµk(j) + X j /∈SCπ P(i, u, j)Jµk(j)  (36) 7: Find µk+1 such that µk+1(i) ∈¯ M(i) for all i ∈SC, and contains a state of KC in its recurrent classes. If one does not exist. Return: µk with “not optimal” 8: else 9: Find µk+1 such that µk+1(i) ∈¯U(i) for all i ∈SC, and contains a state of KC in its recurrent classes. If one does not exist, Return: µk with “not optimal” 10: end if 11: Set k ←k + 1 12: until µk with gain-bias pair satisfying (32) and Return: µk with “optimal” not assuming C to be unichain) is similar to the proof of convergence for the policy iteration algorithm for the ACPS problem (see [22, pp. 237-239]). Note that the proof is the same except that when the algorithm terminates at step 11 in Alg. 1, µk satisfies (32) instead of the optimality conditions for the ACPS problem ((9) and (10)). If we obtain the optimal policy for each C ∈A, then we use (31) to obtain the optimal solution for Prob. III.1. If for some C, Alg. 1 returns “not optimal”, then the policy returned by Alg. 1 is only sub-optimal. We can then apply this algorithm to each AMEC in A and use (31) to obtain a sub-optimal solution for Prob. III.1. Note that similar to policy iteration algorithms for ACPS problems, either the gain or the bias strictly decreases every time when µ is updated, so policy µ is improved in each iteration. In both cases, the satisfaction constraint is always enforced. Remark V.6 (Complexity). The complexity of our proposed algorithm is dictated by the size of the generated MDPs. We use | · | to denote cardinality of a set. The size of the DRA (|Q|) is in the worst case, doubly exponential with respect to |Σ|. However, empirical studies such as [20] have shown that in practice, the sizes of the DRAs for many LTL formulas are generally much lower and manageable. The size of product MDP P is at most |S| × |Q|. The complexity for the algorithm generating AMECs is at most quadratic in the size of P [13]. The complexity of Alg. 1 depends on the size of C. The policy evaluation (step 3) requires solving a system of 3 × |SC| linear equation with 3 × |SC| unknowns. The optimization step (step 4 and 6) each requires at most |UC| × |SC| evaluations. Checking the recurrent classes of µ is linear in |SC|. Therefore, assuming that |UC| is dominated by |SC|2 (which is usually true) and the number of policies satisfying (35) and (36) for all i is also dominated by |SC|2, for each iteration, the computational complexity is O(|SC|3). VI. CASE STUDY The algorithmic framework developed in this paper is implemented in MATLAB, and here we provide an example as a case study. Consider the MDP M shown in Fig. 4, which can be viewed as the dynamics of a robot navigat- ing in an environment with the set of atomic propositions {pickup, dropoff}. In practice, this MDP can be obtained via an abstraction process (see [1]) from the environment, where its probabilities of transitions can be obtained from experimental data or accurate simulations. 1 1 0.2 0.8 1 1 0.9 1 1 1 1 α α α α α β 1 β γ 0.7 0.1 1 dropoff β 1 γ β 0.3 α α 1 α α pickup 0 1 2 3 4 5 6 7 8 9 Fig. 4. MDP capturing a robot navigating in an environment. {α, β, γ} is the set of controls at states. The cost of applying α, β, γ at a state where the control is available is 5, 10, 1, respectively. (e.g., g(i, α) = 5 if α ∈U(i)) The goal of the robot is to continuously perform a pickup- delivery task. The robot is required to pick up items at the state marked by pickup (see Fig. 4), and drop them off at the state marked by dropoff. It is then required to go back to pickup and this process is repeated. This task can be written as the following LTL formula: φ = 23pickup ∧2(pickup ⇒⃝(¬pickup Udropoff)). The first part of φ, 23pickup, enforces that the robot repeatedly pick up items. The remaining part of φ ensures that new items cannot be picked up until the current items are dropped off. We denote pickup as the optimizing proposition, and the goal is to find a policy that satisfies φ with probability 1 and minimizes the expected cost in between visiting the pickup state (i.e., we aim to minimize the expected cost in between picking up items). We generated the DRA Rφ using the ltl2dstar tool [21] with 13 states and 1 pair (L, K) ∈F. The product MDP P after removing unreachable states contains 31 states (note that P has 130 states without removing unreachable states). There is one AMEC C corresponding to the only pair in FP and it contains 20 states. We tested Alg. 1 with a number of different initial policies and Alg. 1 produced the optimal policy within 2 or 3 policy updates in each case (note that C is not unichain). For one initial policy, the ACPC was initially 330 at each state of C, and it was reduced to 62.4 at each state when the optimal policy was found. The optimal policy is as follows: State 0 1 2 3 4 5 6 7 8 9 After pickup α β α α α γ γ α β α After dropoff α α α α α α γ α α α The first row of the above table shows the policy after pick-up but before drop-off and the second row shows the policy after drop-off and before another pick-up. VII. CONCLUSIONS We have developed a method to automatically generate a control policy for a dynamical system modelled as a Markov Decision Process (MDP), in order to satisfy specifications given as Linear Temporal Logic formulas. The control policy satisfies the given specification almost surely, if such a policy exists. In addition, the policy optimizes the average cost between satisfying instances of an “optimizing proposition”, under some conditions. The problem is motivated by robotic applications requiring persistent tasks to be performed such as environmental monitoring or data gathering. We are currently pursuing several future directions. First, we aim to solve the problem completely and find an algo- rithm that guarantees to always return the optimal policy. Second, we are interested to apply the optimization criterion of average cost per cycle to more complex models such as Partially Observable MDPs (POMDPs) and semi-MDPs. REFERENCES [1] M. Lahijanian, J. Wasniewski, S. B. Andersson, and C. Belta, “Motion planning and control from temporal logic specifications with proba- bilistic satisfaction guarantees,” in Proc ICRA, Anchorage, AK, 2010, pp. 3227 – 3232. [2] S. Temizer, M. J. Kochenderfer, L. P. Kaelbling, T. Lozano-P´erez, and J. K. Kuchar, “Collision avoidance for unmanned aircraft using Markov decision processes,” in Proc AIAA GN&C, Toronto, Canada, Aug. 2010. [3] R. Alterovitz, T. Sim´eon, and K. Goldberg, “The stochastic motion roadmap: A sampling framework for planning with Markov motion uncertainty,” in RSS, Atlanta, GA, Jun. 2007. [4] H. Kress-Gazit, G. Fainekos, and G. J. Pappas, “Where’s Waldo? Sensor-based temporal logic motion planning,” in Proc ICRA, Rome, Italy, 2007, pp. 3116–3121. [5] S. Karaman and E. Frazzoli, “Sampling-based motion planning with deterministic µ-calculus specifications,” in Proc CDC, Shanghai, China, 2009, pp. 2222–2229. [6] S. G. Loizou and K. J. Kyriakopoulos, “Automatic synthesis of multiagent motion tasks based on LTL specifications,” in Proc CDC, Paradise Island, Bahamas, 2004, pp. 153–158. [7] T. Wongpiromsarn, U. Topcu, and R. M. Murray, “Receding horizon temporal logic planning for dynamical systems,” in Proc CDC, Shang- hai, China, 2009, pp. 5997–6004. [8] E. M. Clarke, D. Peled, and O. Grumberg, Model checking. MIT Press, 1999. [9] N. Piterman, A. Pnueli, and Y. Saar, “Synthesis of reactive(1) designs,” in International Conference on Verification, Model Checking, and Abstract Interpretation, Charleston, SC, 2006, pp. 364–380. [10] M. Kloetzer and C. Belta, “A fully automated framework for control of linear systems from temporal logic specifications,” IEEE Trans Automatic Ctrl, vol. 53, no. 1, pp. 287–297, 2008. [11] M.Kloetzer and C. Belta, “Dealing with non-determinism in symbolic control,” in Hybrid Systems: Computation and Control, ser. Lect. Notes Comp. Science, M. Egerstedt and B. Mishra, Eds. Springer Verlag, 2008, pp. 287–300. [12] L. De Alfaro, “Formal verification of probabilistic systems,” Ph.D. dissertation, Stanford University, 1997. [13] C. Baier, J.-P. Katoen, and K. G. Larsen, Principles of Model Check- ing. MIT Press, 2008. [14] M. Vardi, “Probabilistic linear-time model checking: An overview of the automata-theoretic approach,” Formal Methods for Real-Time and Probabilistic Systems, pp. 265–276, 1999. [15] X. C. Ding, S. L. Smith, C. Belta, and D. Rus, “LTL control in uncertain environments with probabilistic satisfaction guarantees,” in Proc IFAC World C, Milan, Italy, Aug. 2011, to appear. [16] C. Courcoubetis and M. Yannakakis, “Markov decision processes and regular events,” IEEE Trans Automatic Ctrl, vol. 43, no. 10, pp. 1399– 1418, 1998. [17] C. Baier, M. Gr¨oßer, M. Leucker, B. Bollig, and F. Ciesinski, “Con- troller synthesis for probabilistic systems,” in Proceedings of IFIP TCS’2004. Kluwer, 2004. [18] S. L. Smith, J. T˚umov´a, C. Belta, and D. Rus, “Optimal path planning under temporal constraints,” in Proc IROS, Taipei, Taiwan, Oct. 2010, pp. 3288–3293. [19] E. Gradel, W. Thomas, and T. Wilke, Automata, logics, and infinite games: A guide to current research, ser. Lect. Notes Comp. Science. Springer Verlag, 2002, vol. 2500. [20] J. Klein and C. Baier, “Experiments with deterministic ω-automata for formulas of linear temporal logic,” Theoretical Computer Science, vol. 363, no. 2, pp. 182–195, 2006. [21] J. Klein, “ltl2dstar - LTL to deterministic Streett and Rabin automata,” http://www.ltl2dstar.de/, 2007, viewed September 2010. [22] D. Bertsekas, Dynamic programming and optimal control, vol. II. Athena Scientific, 2007. [23] M. L. Puterman, Markov decision processes: Discrete stochastic dynamic programming. John Wiley and Sons, 1994. [24] L. Hogben, Handbook of linear algebra. CRC Press, 2007. MDP Optimal Control under Temporal Logic Constraints - Technical Report - Xu Chu Ding Stephen L. Smith Calin Belta Daniela Rus Abstract— In this paper, we develop a method to automati- cally generate a control policy for a dynamical system modeled as a Markov Decision Process (MDP). The control specification is given as a Linear Temporal Logic (LTL) formula over a set of propositions defined on the states of the MDP. We synthesize a control policy such that the MDP satisfies the given specification almost surely, if such a policy exists. In addition, we designate an “optimizing proposition” to be repeatedly satisfied, and we formulate a novel optimization criterion in terms of minimizing the expected cost in between satisfactions of this proposition. We propose a sufficient condition for a policy to be optimal, and develop a dynamic programming algorithm that synthesizes a policy that is optimal under some conditions, and sub-optimal otherwise. This problem is motivated by robotic applications requiring persistent tasks, such as environmental monitoring or data gathering, to be performed. I. INTRODUCTION In this paper, we consider the problem of controlling a (fi- nite state) Markov Decision Process (MDP). Such models are widely used in various areas including engineering, biology, and economics. In particular, in recent results, they have been successfully used to model and control autonomous robots subject to uncertainty in their sensing and actuation, such as for ground robots [1], unmanned aircraft [2] and surgical steering needles [3]. Several authors [4]–[7] have proposed using temporal log- ics, such as Linear Temporal Logic (LTL) and Computation Tree Logic (CTL) [8], as specification languages for control systems. Such logics are appealing because they have well defined syntax and semantics, which can be easily used to specify complex behavior. In particular, in LTL, it is possible to specify persistent tasks, e.g., “Visit regions A, then B, and then C, infinitely often. Never enter B unless coming directly from D.” In addition, off-the-shelf model checking algorithms [8] and temporal logic game strategies [9] can be used to verify the correctness of system trajectories and to synthesize provably correct control strategies. The existing works focusing on LTL assume that a finite model of the system is available and the current state can be precisely determined. If the control model is deterministic (i.e., at each state, an available control enables exactly one transition), control strategies from specifications given as This work was supported in part by ONR-MURI N00014-09-1051, ARO W911NF-09-1-0088, AFOSR YIP FA9550-09-1-020, and NSF CNS- 0834260. X. C. Ding and C. Belta are with Department of Mechanical En- gineering, Boston University, Boston, MA 02215, USA (email: {xcding; cbelta}@bu.edu). S. L. Smith is with the Department of Electrical and Computer Engineering, University of Waterloo, Waterloo ON, N2L 3G1 Canada (email: stephen.smith@uwaterloo.ca). D. Rus is with the Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA (email: rus@csail.mit.edu). LTL formulas can be found through simple adaptations of off-the-shelf model checking algorithms [10]. If the control is non-deterministic (an available control at a state enables one of several transitions, and their probabilities are not known), the control problem from an LTL specification can be mapped to the solution of a B¨uchi or GR(1) game if the specification is restricted to a fragment of LTL [4], [11]. If the probabilities of the enabled transitions at each state are known, the control problem reduces to finding a control policy for an MDP such that a probabilistic temporal logic formula is satisfied [12]. By adapting methods from probabilistic model-checking [12]–[14], we have recently developed frameworks for de- riving MDP control policies from LTL formulas [15], which is related to a number of other approaches [16], [17] that address the problem of synthesizing control policies for MDPs subject to LTL satisfaction constraints. In all of the above approaches, a control policy is designed to maximize the probability of satisfying a given LTL formula. However, no attempt has been made so far to optimize the long-term behavior of the system, while enforcing LTL satisfaction guarantees. Such an objective is often critical in many applications, such as surveillance, persistent monitoring, and pickup delivery tasks, where a robot must repeatedly visit some areas in an environment and the time in between revisits should be minimized. As far as we know, this paper is the first attempt to compute an optimal control policy for a dynamical sys- tem modeled as an MDP, while satisfying temporal logic constraints. This work begins to bridge the gap between our prior work on MDP control policies maximizing the probability of satisfying an LTL formula [15] and optimal path planning under LTL constraints [18]. We consider LTL formulas defined over a set of propositions assigned to the states of the MDP. We synthesize a control policy such that the MDP satisfies the specification almost surely, if such a policy exists. In addition, we minimize the expected cost between satisfying instances of an “optimizing proposition.” The main contribution of this paper is two-fold. First, we formulate the above MDP optimization problem in terms of minimizing the average cost per cycle, where a cycle is defined between successive satisfaction of the optimizing proposition. We present a novel connection between this problem and the well-known average cost per stage problem. Second, we incorporate the LTL constraints and obtain a sufficient condition for a policy to be optimal. We present a dynamic programming algorithm that under some (heavy) restrictions synthesizes an optimal control policy, and a sub- optimal policy otherwise. arXiv:1103.4342v2 [cs.RO] 23 Mar 2011 The organization of this paper is as follows. In Sec. II we provide some preliminaries. We formulate the problem in Sec. III and we formalize the connection between the average cost per cycle and the average cost per stage problem in Sec. IV. In Sec. V, we provide a method for incorporating LTL constraints. We present a case study illustrating our framework in Sec. VI and we conclude in Sec. VII II. PRELIMINARIES A. Linear Temporal Logic We employ Linear Temporal Logic (LTL) to describe MDP control specifications. A detailed description of the syntax and semantics of LTL is beyond the scope of this paper and can be found in [8], [13]. Roughly, an LTL formula is built up from a set of atomic propositions Π, standard Boolean operators ¬ (negation), ∨(disjunction), ∧ (conjunction), and temporal operators ⃝(next), U (until), 3 (eventually), 2 (always). The semantics of LTL formulas are given over infinite words in 2Π. A word satisfies an LTL formula φ if φ is true at the first position of the word; 2φ means that φ is true at all positions of the word; 3φ means that φ eventually becomes true in the word; φ1 Uφ2 means that φ1 has to hold at least until φ2 is true. More expressivity can be achieved by combining the above temporal and Boolean operators (more examples are given later). An LTL formula can be represented by a deterministic Rabin automaton, which is defined as follows. Definition II.1 (Deterministic Rabin Automaton). A de- terministic Rabin automaton (DRA) is a tuple R = (Q, Σ, δ, q0, F), where (i) Q is a finite set of states; (ii) Σ is a set of inputs (alphabet); (iii) δ : Q × Σ →Q is the transition function; (iv) q0 ∈Q is the initial state; and (v) F = {(L(1), K(1)), . . . , (L(M), K(M))} is a set of pairs of sets of states such that L(i), K(i) ⊆Q for all i = 1, . . . , M. A run of a Rabin automaton R, denoted by rR = q0q1 . . ., is an infinite sequence of states in R such that for each i ≥0, qi+1 ∈δ(qi, α) for some α ∈Σ. A run rR is accepting if there exists a pair (L, K) ∈F such that 1) there exists n ≥0, such that for all m ≥n, we have qm /∈L, and 2) there exist infinitely many indices k where qk ∈K. This acceptance conditions means that rR is accepting if for a pair (L, K) ∈F, rR intersects with L finitely many times and K infinitely many times. Note that for a given pair (L, K), L can be an empty set, but K is not empty. For any LTL formula φ over Π, one can construct a DRA with input alphabet Σ = 2Π accepting all and only words over Π that satisfy φ (see [19]). We refer readers to [20] and references therein for algorithms and to freely available implementations, such as [21], to translate a LTL formula over Π to a corresponding DRA. B. Markov Decision Process and probability measure Definition II.2 (Labeled Markov Decision Process). A la- beled Markov decision process (MDP) is a tuple M = (S, U, P, s0, Π, L, g), where S = {1, . . . , n} is a finite set of states; U is a finite set of controls (actions) (with slight abuse of notations we also define the function U(i), where i ∈S and U(i) ⊆U to represent the available controls at state i); P : S × U × S →[0, 1] is the transition probability function such that for all i ∈S, P j∈S P(i, u, j) = 1 if u ∈U(i), and P(i, u, j) = 0 if u /∈U(i); s0 ∈S is the initial state; Π is a set of atomic propositions; L : S →2Π is a labeling function and g : S × U →R+ is such that g(i, u) is the expected (non-negative) cost when control u ∈U(i) is taken at state i. We define a control function µ : S →U such that µ(i) ∈ U(i) for all i ∈S. A infinite sequence of control functions M = {µ0, µ1, . . .} is called a policy. One can use a policy to resolve all non-deterministic choices in an MDP by applying the action µk(sk) at state sk. Given an initial state s0, an infinite sequence rM M = s0s1 . . . on M generated under a policy M is called a path on M if P(sk, µk(sk), sk+1) > 0 for all k. The index k of a path is called stage. If µk = µ for all k, then we call it a stationary policy and we denote it simply as µ. A stationary policy µ induces a Markov chain where its set of states is S and the transition probability from state i to j is P(i, µ(i), j). We define PathsM M and FPathsM M as the set of all infinite and finite paths of M under a policy M, respectively. We can then define a probability measure over the set PathsM M. For a path rM M = s0s1 . . . smsm+1 . . . ∈PathsM M, the prefix of length m of rM M is the finite subsequence s0s1 . . . sm. Let PathsM M(s0s1 . . . sm) denote the set of all paths in PathsM M with the prefix s0s1 . . . sm. (Note that s0s1 . . . sm is a finite path in FPathsM M.) Then, the probability measure PrM M on the smallest σ-algebra over PathsM M containing PathsM M(s0s1 . . . sm) for all s0s1 . . . sm ∈FPathsM M is the unique measure satisfying: PrM M{PathsM M(s0s1 . . . sm)} = Y 0≤k 0 for all i, u, and there exists at least one proper policy, the stochastic shortest path problem for Sπ admits an optimal stationary policy as a solution [22]. Hence, for each cycle k, the cycle cost can be minimized if a stationary policy is used for the cycle. Therefore, a policy which is stationary in between cycles is optimal over M, which is in turn, optimal if M = µ⋆. The proof is complete. Unfortunately, it is not clear how to find the optimal policy from (23) except by searching through all policies in M˜µ. This exhaustive search is not feasible for reasonably large problems. Instead, we would like equations in the form of (9) and (10), so that the optimizations can be carried out independently at each state. To circumvent this difficulty, we need to express the gain- bias pair (Jµ, hµ), which is equal to (Js ˜µ, hs ˜µ), in terms of µ. From (7), we have Jµ = P˜µJµ, Jµ + hµ = g˜µ + P˜µhµ. By manipulating the above equations, we can now show that Jµ and hµ can be expressed in terms of µ (analogous to (7)) instead of ˜µ via the following proposition: Proposition IV.10. We have Jµ = PµJµ, Jµ + hµ = gµ + Pµhµ + −→ P µJµ. (24) Moreover, we have (I −−→ P µ)hµ + vµ = Pµvµ, (25) for some vector vµ. Proof. We start from (7): Jµ = P˜µJµ, Jµ + hµ = g˜µ + P˜µhµ. (26) For the first equation in (26), using (17), we have Jµ = P˜µJµ Jµ = (I −−→ P µ)−1←− P µJµ (I −−→ P µ)Jµ = ←− P µJµ Jµ −−→ P µJµ = ←− P µJµ Jµ = (−→ P µ + ←− P µ)Jµ Jµ = PµJµ. For the second equation in (26), using (17) and (19), we have Jµ + hµ = g˜µ + P˜µhµ Jµ + hµ = (I −−→ P µ)−1(gµ + ←− P µhµ) (I −−→ P µ)(Jµ + hµ) = gµ + ←− P µhµ Jµ −−→ P µJµ + hµ −−→ P µhµ = gµ + ←− P µhµ Jµ + hµ −−→ P µJµ = gµ + (−→ P µ + ←− P µ)hµ Jµ + hµ = gµ + Pµhµ + −→ P µJµ. Thus, (26) can be expressed in terms of µ as: Jµ = PµJµ, Jµ + hµ = gµ + Pµhµ + −→ P µJµ. To compute Jµ and hµ, we need an extra equation similar to (8). Using (8), we have hµ + vµ = P˜µvµ hµ + vµ = (I −−→ P µ)−1←− P µvµ (I −−→ P µ)hµ + vµ = Pµvµ, which completes the proof. From Prop. IV.10, similar to the ACPS problem, (Jµ, hµ, vµ) can be solved together by a linear system of 3n equations and 3n unknowns. The insight gained when simplifying Jµ and hµ in terms of µ motivate us to propose the following optimality condition for an optimal policy. Proposition IV.11. The policy µ⋆with gain-bias pair (λ1, h) satisfying λ + h(i) = min u∈U(i)  g(i, u) + n X j=1 P(i, u, j)h(j) + λ n X j=m+1 P(i, u, j)  , (27) for all i = 1, . . . , n, is the optimal policy minimizing (3) over all policies in M. Proof. The optimality condition (27) can be written as: λ1 + h ≤gµ + Pµh + −→ P µλ1, (28) for all stationary policies µ. Note that, given a, b ∈Rn and a ≤b, if A is an n × n matrix with all non-negative entries, then Aa ≤Ab. Moreover, given c ∈Rn, we have a + c ≤b + c. s0 s1 1 .2 .8 .3 .7 1 u1 u1 u2 u2 1 u3 a π {a, π} u1 1 Sπ = {s2} s2 s3 (a) q0 q2 q1 q3 q4 {∅, {π}} {∅, {π}} {∅, {π}} {∅, {a}} {a} {a} {a} {π, {a, π}} {a, π} {a, π} {∅, {a}} {a, π} L = ∅ K = {q2, q3} {π, {a, π}} (b) 1 .8 .2 1 1 u2 1 .7 .3 1 1 1 .8 .2 1 u2 u1 u1 u1 u3 u2 1 .8 .2 1 1 .3 1 .7 s1, q1 s2, q0 s3, q2 s0, q3 s1, q3 s2, q3 s3, q3 u1 u3 u1 u2 u2 u1 u1 s0, q1 s1, q1 s0, q2 s2, q2 s2, q1 s3, q1 s3, q2 s1, q2 s0, q0 u1 u1 u3 u2 s0, q4 s1, q4 s2, q4 s3, q4 (c) Fig. 2. The construction of the product MDP between a labeled MDP and a DRA. In this example, the set of atomic proposition is {a, π}. (a): A labeled MDP where the label on top of a state denotes the atomic propositions assigned to the state. The number on top of an arrow pointing from a state s to s′ is the probability P(s, u, s′) associated with a control u ∈U(s). The set of states marked by ovals is Sπ. (b): The DRA Rφ corresponding to the LTL formula φ = 23π ∧23a. In this example, there is one set of accepting states F = {(L, K)} where L = ∅and K = {q2, q3} (marked by double-strokes). Thus, accepting runs of this DRA must visit q2 or q3 (or both) infinitely often. (c): The product MDP P = M × Rφ where states of KP are marked by double-strokes and states of SPπ are marked by ovals. The states with dashed borders are unreachable, and they are removed from SP. From (28) we have λ1 + h ≤ gµ + Pµh + −→ P µλ1 λ1 −−→ P µλ1 + h ≤ gµ + Pµh λ1 −−→ P µλ1 + h ≤ gµ + (←− P µ + −→ P µ)h λ1 −−→ P µλ1 + h −−→ P µh ≤ gµ + ←− P µh (I −−→ P µ)(λ1 + h) ≤ gµ + ←− P µh (29) If µ is proper, then −→ P µ is a transient matrix (see proof of Prop. IV.5), and all of its eigenvalues are strictly inside the unit circle. Therefore, we have (I −−→ P µ)−1 = I + −→ P µ + −→ P 2 µ + . . . . Therefore, since all entries of −→ P µ are non-negative, all entries of (I −−→ P µ)−1 are non-negative. Thus, continuing from (29), we have (I −−→ P µ)(λ1 + h) ≤ gµ + ←− P µh λ1 + h ≤ (I −−→ P µ)−1(gµ + ←− P µh) λ1 + h ≤ g˜µ + P˜µh for all proper policies µ and all ˜µ ∈M˜µ. Hence, ˜µ⋆satisfies (23) and µ⋆is optimal over all proper policies. Using Prop. IV.9, the proof is complete. We will present an algorithm that uses Prop. IV.11 to find the optimal policy in the next section. Note that, unlike (23), (27) can be checked for any policy µ by finding the minimum for all states i = 1, . . . , n, which is significantly easier than searching over all proper policies. V. SYNTHESIZING THE OPTIMAL POLICY UNDER LTL CONSTRAINTS In this section we outline an approach for Prob. III.1. We aim for a computational framework, which in combination with results of [15] produces a policy that both maximizes the satisfaction probability and optimizes the long-term per- formance of the system, using results from Sec. IV. A. Automata-theoretical approach to LTL control synthesis Our approach proceeds by converting the LTL formula φ to a DRA as defined in Def. II.1. We denote the resulting DRA as Rφ = (Q, 2Π, δ, q0, F) with F = {(L(1), K(1)), . . . , (L(M), K(M))} where L(i), K(i) ⊆Q for all i = 1, . . . , M. We now obtain an MDP as the product of a labeled MDP M and a DRA Rφ, which captures all paths of M satisfying φ. Definition V.1 (Product MDP). The product MDP M × Rφ between a labeled MDP M = (S, U, P, s0, Π, L, g) and a DRA Rφ = (Q, 2Π, δ, q0, F) is obtained from a tuple P = (SP, U, PP, sP0, FP, SPπ, gP), where (i) SP = S × Q is a set of states; (ii) U is a set of controls inherited from M and we define UP((s, q)) = U(s); (iii) PP gives the transition probabilities: PP((s, q), u, (s′, q′))= ( P(s, u, s′) if q′ = δ(q, L(s)) 0 otherwise; (iv) sP0 = (s0, q0) is the initial state; (v) FP = {(LP(1), KP(1)), . . . , (LP(M), KP(M))} where LP(i) = S × L(i), KP(i) = S × K(i), for i = 1, . . . , M; (vi) SPπ is the set of states in SP for which proposition π is satisfied. Thus, SPπ = Sπ × Q; (vii) gP((s, q), u) = g(s, u) for all (s, q) ∈SP; Note that some states of SP may be unreachable and therefore have no control available. After removing those states (via a simple graph search), P is a valid MDP and is the desired product MDP. With a slight abuse of notation we still denote the product MDP as P and always assume that unreachable states are removed. An example of a product MDP between a labeled MDP and a DRA corresponding to the LTL formula φ = 23π ∧23a is shown in Fig. 2. There is an one-to-one correspondence between a path s0s1, . . . on M and a path (s0, q0)(s1, q1) . . . on P. More- over, from the definition of gP, the costs along these two paths are the same. The product MDP is constructed so that, given a path (s0, q0)(s1, q1) . . ., the corresponding path s0s1 . . . on M generates a word satisfying φ if and only if, there exists (LP, KP) ∈FP such that the set KP is visited infinitely often and LP finitely often. A similar one-to-one correspondence exists for policies: Definition V.2 (Inducing a policy from P). Given policy MP = {µP 0 , µP 1 , . . .} on P, where µP k ((s, q)) ∈UP((s, q)), it induces policy M = {µ0, µ1, . . .} on M by setting µk(sk) = µP k ((sk, qk)) for all k. We denote MP|M as the policy induced by MP, and we use the same notation for a set of policies. An induced policy can be implemented on M by simply keeping track of its current state on P. Note that a stationary policy on P induces a non-stationary policy on M. From the one-to-one correspondence between paths and the equiv- alence of their costs, the expected cost in (3) from initial state s0 under MP|M is equal to the expected cost from initial state (s0, q0) under MP. For each pair of states (LP, KP) ∈FP, we can obtain a set of accepting maximal end components (AMEC): Definition V.3 (Accepting Maximal End Components). Given (LP, KP) ∈FP, an end component C is a communi- cating MDP (SC, UC, PC, KC, SCπ, gC) such that SC ⊆SP, UC ⊆UP, UC(i) ⊆U(i) for all i ∈SC, KC = SC ∩KP, SCπ = SC ∩SPπ, and gC(i, u) = gP(i, u) if i ∈SC, u ∈UC(i). If P(i, u, j) > 0 for any i ∈SC and u ∈UC(i), then j ∈SC, in which case PC(i, u, j) = P(i, u, j). An accepting maximal end components (AMEC) is the largest such end component such that KC ̸= ∅and SC ∩LP = ∅. Note that, an AMEC always contains at least one state in KP and no state in LP. Moreover, it is “absorbing” in the sense that the state does not leave an AMEC once entered. In the example shown in Fig. 2, there exists only one AMEC corresponding to (LP, KP), which is the only pair of states in FP, and the states of this AMEC are shown in Fig. 3. s1, q1 s2, q0 s3, q2 s0, q3 s1, q3 s2, q3 s3, q3 s0, q1 s1, q1 s0, q2 s2, q2 s2, q1 s3, q1 s3, q2 s1, q2 s0, q0 s0, q4 s1, q4 s2, q4 s3, q4 Fig. 3. The states of the only AMEC corresponding to the product MDP in Fig. 2. A procedure to obtain all AMECs of an MDP was pro- vided in [13]. From probabilistic model checking, a policy M = MP|M almost surely satisfies φ (i.e., M ∈Mφ) if and only if, under policy MP, there exists AMEC C such that the probability of reaching SC from initial state (s0, q0) is 1 (in this case, we call C a reachable AMEC). In [15], such an optimal policy was found by dynamic programming or solving a linear program. For states inside C, since C itself is a communicating MDP, a policy (not unique) can be easily constructed such that a state in KC is infinitely often visited, satisfying the LTL specification. B. Optimizing the long-term performance of the MDP For a control policy designed to satisfy an LTL formula, the system behavior outside an AMEC is transient, while the behavior inside an AMEC is long-term. The policies obtained in [15]–[17] essentially disregard the behavior inside an AMEC, because, from the verification point of view, the behavior inside an AMEC is for the most part irrelevant, as long as a state in KP is visited infinitely often. We now aim to optimize the long-term behavior of the MDP with respect to the ACPC cost function, while enforcing the satisfaction constraint. Since each AMEC is a communicating MDP, we can use results in Sec. IV-B to help obtaining a solution. Our approach consists of the following steps: (i) Convert formula φ to a DRA Rφ and obtain the product MDP P between M and Rφ; (ii) Obtain the set of reachable AMECs, denoted as A; (iii) For each C ∈A: Find a stationary policy µ⋆ →C(i), defined for i ∈S \ SC, that reaches SC with probability 1 (µ⋆ →C is guaranteed to exist and obtained as in [15]); Find a stationary policy µ⋆ ⟳C(i), defined for i ∈SC minimizing (3) for MDP C and set SCπ while satisfying the LTL constraint; Define µ⋆ C to be: µ⋆ C =  µ⋆ →C(i) if i /∈SC µ⋆ ⟳C(i) if i ∈SC , (30) and denote the ACPC of µ⋆ ⟳C as λC; (iv) We find the solution to Prob. III.1 by: J⋆(s0) = min C∈A λC, (31) and the optimal policy is µ⋆ C⋆|M, where C⋆is the AMEC attaining the minimum in (31). We now provide the sufficient conditions for a policy µ⋆ ⟳C to be optimal. Moreover, if an optimal policy µ⋆ ⟳C can be obtained for each C, we show that the above procedure indeed gives the optimal solution to Prob. III.1. Proposition V.4. For each C ∈A, let µ⋆ C to be constructed as in (30), where µ⋆ ⟳C is a stationary policy satisfying two optimality conditions: (i) its ACPC gain-bias pair is (λC1, h), where λC + h(i) = min u∈UC(i)  gC(i, u) + X j∈SC P(i, u, j)h(j) + λC X j /∈SCπ P(i, u, j)  , (32) for all i ∈SC, and (ii) there exists a state of KC in each recurrent class of µ⋆ ⟳C. Then the optimal cost for Prob. III.1 is J⋆(s0) = minC∈A λC, and the optimal policy is µ⋆ C⋆|M, where C⋆is the AMEC attaining this minimum. Proof. Given C ∈A, define a set of policies MC, such that for each policy in MC: from initial state (s0, q0), (i) SC is reached with probability 1, (ii) S\SC is not visited thereafter, and (iii) KC is visited infinitely often. We see that, by the definition of AMECs, a policy almost surely satisfying φ belongs to MC|M for a C ∈A. Thus, Mφ = ∪C∈AMC|M Since µ⋆ C(i) = µ⋆ →C(i) if i /∈SC, the state reaches SC with probability 1 and in a finite number of stages. We denote the probability that j ∈SC is the first state visited in SC when C is reached from initial state sP0 as ePC(j, µ⋆ →C, sP0). Since the ACPC for the finite path from the initial state to a state j ∈SC is 0 as the cycle index goes to ∞, the ACPC from initial state sP0 under policy µ⋆ C is J(s0) = X j∈SC ePC(j, µ⋆ →C, sP0)Jµ⋆ ⟳C(j). (33) Since C is communicating, the optimal cost is the same for all states of SC (and thus it does not matter which state in SC is first visited when SC is reached). We have J(s0) = X j∈SC ePC(j, µ⋆ →C, (s0, q0))λC = λC. (34) Applying Prop. IV.11, we see that µ⋆ ⟳C satisfies the opti- mality condition for MDP C with respect to set SCπ. Since there exists a state of KC is in each recurrent class of µ⋆ ⟳C, a state in KC is visited infinitely often and it satisfies the LTL constraint. Therefore, µ⋆ C as constructed in (30) is optimal over MC and µ⋆ C|M is optimal over MC|M (due to equiv- alence of expected costs between MP and MP|M). Since Mφ = ∪C∈AMC|M, we have that J⋆(s0) = minC∈A λC and the policy corresponding to C⋆attaining this minimum is the optimal policy. We can relax the optimality conditions for µ⋆ ⟳C in Prop. V.4 and require that there exist a state i ∈KC in one recurrent class of µ⋆ ⟳C. For such a policy, we can construct a policy such that it has one recurrent class containing state i, with the same ACPC cost at each state. This construction is identical to a similar procedure for ACPS problems when the MDP is communicating (see [22, p. 203]). We can then use (30) to obtain the optimal policy µ⋆ C for C. We now present an algorithm (see Alg. 1) that iteratively updates the policy in an attempt to find one that satisfies the optimality conditions given in Prop. V.4, for a given C ∈ A. Note that Alg. 1 is similar in nature to policy iteration algorithms for ACPS problems. Proposition V.5. Given C, Alg. 1 terminates in a finite number of iterations. If it returns policy µ⟳C with “optimal”, then µ⟳C satisfies the optimality conditions in Prop. V.4. If C is unichain (i.e., each stationary policy of C contains one recurrent class), then Alg. 1 is guaranteed to return the optimal policy µ⋆ ⟳C. Proof. If C is unichain, then since it is also communicating, µ⋆ ⟳C contains a single recurrent class (and no transient state). In this case, since KC is not empty, states in KC are recurrent and the LTL constraint is always satisfied at step 7 and 9 of Alg. 1. The rest of the proof (for the general case and Algorithm 1 : Policy iteration algorithm for ACPC Input: C = (SC, UC, PC, KC, SCπ, gC) Output: Policy µ⟳C 1: Initialize µ0 to a proper policy containing KC in its re- current classes (such a policy can always be constructed since C is communicating) 2: repeat 3: Given µk, compute Jµk and hµk with (24) and (25) 4: Compute for all i ∈SC: ¯U(i) = arg min u∈UC(i) X j∈SC P(i, u, j)Jµk(j) (35) 5: if µk(i) ∈¯U(i) for all i ∈SC then 6: Compute, for all i ∈SC: ¯ M(i) = arg min u∈¯U(i)  gC(i, u)+ X j∈SC P(i, u, j)hµk(j) + X j /∈SCπ P(i, u, j)Jµk(j)  (36) 7: Find µk+1 such that µk+1(i) ∈¯ M(i) for all i ∈SC, and contains a state of KC in its recurrent classes. If one does not exist. Return: µk with “not optimal” 8: else 9: Find µk+1 such that µk+1(i) ∈¯U(i) for all i ∈SC, and contains a state of KC in its recurrent classes. If one does not exist, Return: µk with “not optimal” 10: end if 11: Set k ←k + 1 12: until µk with gain-bias pair satisfying (32) and Return: µk with “optimal” not assuming C to be unichain) is similar to the proof of convergence for the policy iteration algorithm for the ACPS problem (see [22, pp. 237-239]). Note that the proof is the same except that when the algorithm terminates at step 11 in Alg. 1, µk satisfies (32) instead of the optimality conditions for the ACPS problem ((9) and (10)). If we obtain the optimal policy for each C ∈A, then we use (31) to obtain the optimal solution for Prob. III.1. If for some C, Alg. 1 returns “not optimal”, then the policy returned by Alg. 1 is only sub-optimal. We can then apply this algorithm to each AMEC in A and use (31) to obtain a sub-optimal solution for Prob. III.1. Note that similar to policy iteration algorithms for ACPS problems, either the gain or the bias strictly decreases every time when µ is updated, so policy µ is improved in each iteration. In both cases, the satisfaction constraint is always enforced. Remark V.6 (Complexity). The complexity of our proposed algorithm is dictated by the size of the generated MDPs. We use | · | to denote cardinality of a set. The size of the DRA (|Q|) is in the worst case, doubly exponential with respect to |Σ|. However, empirical studies such as [20] have shown that in practice, the sizes of the DRAs for many LTL formulas are generally much lower and manageable. The size of product MDP P is at most |S| × |Q|. The complexity for the algorithm generating AMECs is at most quadratic in the size of P [13]. The complexity of Alg. 1 depends on the size of C. The policy evaluation (step 3) requires solving a system of 3 × |SC| linear equation with 3 × |SC| unknowns. The optimization step (step 4 and 6) each requires at most |UC| × |SC| evaluations. Checking the recurrent classes of µ is linear in |SC|. Therefore, assuming that |UC| is dominated by |SC|2 (which is usually true) and the number of policies satisfying (35) and (36) for all i is also dominated by |SC|2, for each iteration, the computational complexity is O(|SC|3). VI. CASE STUDY The algorithmic framework developed in this paper is implemented in MATLAB, and here we provide an example as a case study. Consider the MDP M shown in Fig. 4, which can be viewed as the dynamics of a robot navigat- ing in an environment with the set of atomic propositions {pickup, dropoff}. In practice, this MDP can be obtained via an abstraction process (see [1]) from the environment, where its probabilities of transitions can be obtained from experimental data or accurate simulations. 1 1 0.2 0.8 1 1 0.9 1 1 1 1 α α α α α β 1 β γ 0.7 0.1 1 dropoff β 1 γ β 0.3 α α 1 α α pickup 0 1 2 3 4 5 6 7 8 9 Fig. 4. MDP capturing a robot navigating in an environment. {α, β, γ} is the set of controls at states. The cost of applying α, β, γ at a state where the control is available is 5, 10, 1, respectively. (e.g., g(i, α) = 5 if α ∈U(i)) The goal of the robot is to continuously perform a pickup- delivery task. The robot is required to pick up items at the state marked by pickup (see Fig. 4), and drop them off at the state marked by dropoff. It is then required to go back to pickup and this process is repeated. This task can be written as the following LTL formula: φ = 23pickup ∧2(pickup ⇒⃝(¬pickup Udropoff)). The first part of φ, 23pickup, enforces that the robot repeatedly pick up items. The remaining part of φ ensures that new items cannot be picked up until the current items are dropped off. We denote pickup as the optimizing proposition, and the goal is to find a policy that satisfies φ with probability 1 and minimizes the expected cost in between visiting the pickup state (i.e., we aim to minimize the expected cost in between picking up items). We generated the DRA Rφ using the ltl2dstar tool [21] with 13 states and 1 pair (L, K) ∈F. The product MDP P after removing unreachable states contains 31 states (note that P has 130 states without removing unreachable states). There is one AMEC C corresponding to the only pair in FP and it contains 20 states. We tested Alg. 1 with a number of different initial policies and Alg. 1 produced the optimal policy within 2 or 3 policy updates in each case (note that C is not unichain). For one initial policy, the ACPC was initially 330 at each state of C, and it was reduced to 62.4 at each state when the optimal policy was found. The optimal policy is as follows: State 0 1 2 3 4 5 6 7 8 9 After pickup α β α α α γ γ α β α After dropoff α α α α α α γ α α α The first row of the above table shows the policy after pick-up but before drop-off and the second row shows the policy after drop-off and before another pick-up. VII. CONCLUSIONS We have developed a method to automatically generate a control policy for a dynamical system modelled as a Markov Decision Process (MDP), in order to satisfy specifications given as Linear Temporal Logic formulas. The control policy satisfies the given specification almost surely, if such a policy exists. In addition, the policy optimizes the average cost between satisfying instances of an “optimizing proposition”, under some conditions. The problem is motivated by robotic applications requiring persistent tasks to be performed such as environmental monitoring or data gathering. We are currently pursuing several future directions. First, we aim to solve the problem completely and find an algo- rithm that guarantees to always return the optimal policy. Second, we are interested to apply the optimization criterion of average cost per cycle to more complex models such as Partially Observable MDPs (POMDPs) and semi-MDPs. REFERENCES [1] M. Lahijanian, J. Wasniewski, S. B. Andersson, and C. Belta, “Motion planning and control from temporal logic specifications with proba- bilistic satisfaction guarantees,” in Proc ICRA, Anchorage, AK, 2010, pp. 3227 – 3232. [2] S. Temizer, M. J. Kochenderfer, L. P. Kaelbling, T. Lozano-P´erez, and J. K. Kuchar, “Collision avoidance for unmanned aircraft using Markov decision processes,” in Proc AIAA GN&C, Toronto, Canada, Aug. 2010. [3] R. Alterovitz, T. Sim´eon, and K. Goldberg, “The stochastic motion roadmap: A sampling framework for planning with Markov motion uncertainty,” in RSS, Atlanta, GA, Jun. 2007. [4] H. Kress-Gazit, G. Fainekos, and G. J. Pappas, “Where’s Waldo? Sensor-based temporal logic motion planning,” in Proc ICRA, Rome, Italy, 2007, pp. 3116–3121. [5] S. Karaman and E. Frazzoli, “Sampling-based motion planning with deterministic µ-calculus specifications,” in Proc CDC, Shanghai, China, 2009, pp. 2222–2229. [6] S. G. Loizou and K. J. Kyriakopoulos, “Automatic synthesis of multiagent motion tasks based on LTL specifications,” in Proc CDC, Paradise Island, Bahamas, 2004, pp. 153–158. [7] T. Wongpiromsarn, U. Topcu, and R. M. Murray, “Receding horizon temporal logic planning for dynamical systems,” in Proc CDC, Shang- hai, China, 2009, pp. 5997–6004. [8] E. M. Clarke, D. Peled, and O. Grumberg, Model checking. MIT Press, 1999. [9] N. Piterman, A. Pnueli, and Y. Saar, “Synthesis of reactive(1) designs,” in International Conference on Verification, Model Checking, and Abstract Interpretation, Charleston, SC, 2006, pp. 364–380. [10] M. Kloetzer and C. Belta, “A fully automated framework for control of linear systems from temporal logic specifications,” IEEE Trans Automatic Ctrl, vol. 53, no. 1, pp. 287–297, 2008. [11] M.Kloetzer and C. Belta, “Dealing with non-determinism in symbolic control,” in Hybrid Systems: Computation and Control, ser. Lect. Notes Comp. Science, M. Egerstedt and B. Mishra, Eds. Springer Verlag, 2008, pp. 287–300. [12] L. De Alfaro, “Formal verification of probabilistic systems,” Ph.D. dissertation, Stanford University, 1997. [13] C. Baier, J.-P. Katoen, and K. G. Larsen, Principles of Model Check- ing. MIT Press, 2008. [14] M. Vardi, “Probabilistic linear-time model checking: An overview of the automata-theoretic approach,” Formal Methods for Real-Time and Probabilistic Systems, pp. 265–276, 1999. [15] X. C. Ding, S. L. Smith, C. Belta, and D. Rus, “LTL control in uncertain environments with probabilistic satisfaction guarantees,” in Proc IFAC World C, Milan, Italy, Aug. 2011, to appear. [16] C. Courcoubetis and M. Yannakakis, “Markov decision processes and regular events,” IEEE Trans Automatic Ctrl, vol. 43, no. 10, pp. 1399– 1418, 1998. [17] C. Baier, M. Gr¨oßer, M. Leucker, B. Bollig, and F. Ciesinski, “Con- troller synthesis for probabilistic systems,” in Proceedings of IFIP TCS’2004. Kluwer, 2004. [18] S. L. Smith, J. T˚umov´a, C. Belta, and D. Rus, “Optimal path planning under temporal constraints,” in Proc IROS, Taipei, Taiwan, Oct. 2010, pp. 3288–3293. [19] E. Gradel, W. Thomas, and T. Wilke, Automata, logics, and infinite games: A guide to current research, ser. Lect. Notes Comp. Science. Springer Verlag, 2002, vol. 2500. [20] J. Klein and C. Baier, “Experiments with deterministic ω-automata for formulas of linear temporal logic,” Theoretical Computer Science, vol. 363, no. 2, pp. 182–195, 2006. [21] J. Klein, “ltl2dstar - LTL to deterministic Streett and Rabin automata,” http://www.ltl2dstar.de/, 2007, viewed September 2010. [22] D. Bertsekas, Dynamic programming and optimal control, vol. II. Athena Scientific, 2007. [23] M. L. Puterman, Markov decision processes: Discrete stochastic dynamic programming. John Wiley and Sons, 1994. [24] L. Hogben, Handbook of linear algebra. CRC Press, 2007.