Optimal Receding Horizon Control for Finite Deterministic Systems with Temporal Logic Constraints M´aria Svoreˇnov´a, Ivana ˇCern´a and Calin Belta Abstract— In this paper, we develop a provably correct optimal control strategy for a finite deterministic transition system. By assuming that penalties with known probabilities of occurrence and dynamics can be sensed locally at the states of the system, we derive a receding horizon strategy that minimizes the expected average cumulative penalty incurred between two consecutive satisfactions of a desired property. At the same time, we guarantee the satisfaction of correctness specifications expressed as Linear Temporal Logic formulas. We illustrate the approach with a persistent surveillance robotics application. I. INTRODUCTION Temporal logics, such as Computation Tree Logic (CTL) and Linear Temporal Logic (LTL), have been customarily used to specify the correctness of computer programs and digital circuits modeled as finite-state transition systems [1]. The problem of analyzing such a model against a temporal logic formula, known as formal analysis or model checking, has received a lot of attention during the past thirty years, and several efficient algorithms and software tools are available [2], [3], [4]. The formal synthesis problem, in which the goal is to design or control a system from a temporal logic spec- ification, has not been studied extensively until a few years ago. Recent results include the use of model checking al- gorithms for controlling deterministic systems [5], automata games for controlling non-deterministic systems [6], linear programming and value iteration for synthesis of control policies for Markov decision processes [1], [7]. Through the use of abstractions, such techniques have also been used for infinite systems, such as continuous and discrete-time linear systems [8], [9], [10], [11], [12]. The connection between optimal and temporal logic con- trol is an intriguing problem with a potentially high impact in several applications. By combining these two seemingly unrelated areas, our goal is to optimize the behavior of a system subject to correctness constraints. Consider, for example, a mobile robot involved in a persistent surveillance mission in a dangerous area and under tight fuel / time constraints. The correctness requirement is expressed as a temporal logic specification, e.g., “Alternately keep visiting A and B while always avoiding C”, while the resource constraints translate to minimizing a cost function over the feasible trajectories of the robot. While optimal control M. Svoreˇnov´a, I. ˇCern´a are with Faculty of Informatics, Masaryk University, Brno, Czech Republic, svorenova@mail.muni.cz, cerna@muni.cz. C. Belta is with Department of Mechanical Engineering and the Division of Systems Engineering, Boston University, Boston, MA, USA, cbelta@bu.edu. This work was partially supported at MU by grants GAP202/11/0312, LH11065, at BU by ONR grants MURI N00014- 09-1051, MURI N00014-10-10952 and by NSF grant CNS-1035588. is a mature discipline and formal synthesis is fairly well understood, optimal formal synthesis is a largely open area. In this paper, we focus on finite labeled transition systems and correctness specifications given as formulas of LTL. We assume there is a penalty associated with the states of the sys- tem with a known occurrence probability and time-behavior. Motivated by persistent surveillance robotic missions, our goal is to minimize the expected average cumulative penalty incurred between two consecutive satisfactions of a desired property associated with some states of the system, while at the same time satisfying an additional temporal logic constraint. Also from robotics comes our assumption that actual penalty values can only be sensed locally in a close proximity from the current state during the execution of the system. We propose two algorithms for this problem. The first operates offline, i.e., without executions of the system, and therefore only uses the known probabilities but does not exploit actual penalties sensed during the execution. The second algorithm designs an online strategy by locally improving the offline strategy based on local sensing and simulation over a user-defined planning horizon. While both algorithms guarantee optimal expected average penalty col- lection, in real executions of the system, the second algorithm provides lower real average than the first algorithm. We illustrate these results on a robotic persistent surveillance case study. This paper is closely related to [13], [14], [5], which also focused on optimal control for finite transitions systems with temporal logic constraints. In [5], the authors developed an offline control strategy minimizing the maximum cost between two consecutive visits to a given set of states, subject to constraints expressed as LTL formulas. Time-varying, locally sensed rewards were introduced in [13], where a receding horizon control strategy maximizing rewards col- lected locally was shown to satisfy an LTL specification. This approach was generalized in [14] to allow for a broader class of optimization objectives and reward models. In contrast with [13], [14], we interpret the dynamic values appearing in states of the system as penalties instead of rewards, i.e., in our case, the cost function is being minimized rather than maximized. That allows the existence of the optimum in expected average penalty collection. In this paper, we show how it can be achieved using automata-based approach and game theory results. In Sec. II, we introduce the notation and definitions necessary throughout the paper. The problem is stated in Sec. III. The main results of the paper are in Sec. IV and Sec. V. The simulation results are presented in Sec. VI. arXiv:1303.3533v1 [cs.RO] 14 Mar 2013 II. PRELIMINARIES For a set S, we use Sω and S+ to denote the set of all infinite and all non-empty finite sequences of elements of S, respectively. For a finite or infinite sequence α = a0a1 . . ., we use α(i) = ai to denote the i-th element and α(i) = a0 . . . ai for the finite prefix of α of length |α(i)| = i + 1. Definition 1: A weighted deterministic transition system (TS) is a tuple T = (S, T, AP, L, w), where S is a non- empty finite set of states, T ⊆S × S is a transition relation, AP is a finite set of atomic propositions, L: S →2AP is a labeling function and w: T →N is a weight function. We assume that for every s ∈S exists s′ ∈S such that (s, s′) ∈T. An initialized transition system is a TS T = (S, T, AP, L, w) with a distinctive initial state sinit ∈S. A run of a TS T is an infinite sequence ρ = s0s1 . . . ∈Sω such that for every i ≥0 it holds (si, si+1) ∈T. We use inf(ρ) to denote the set of all states visited infinitely many times in the run ρ and RunT (s) for the set of all runs of T that start in s ∈S. Let RunT = S s∈S RunT (s). A finite run σ = s0 . . . sn of T is a finite prefix of a run of T and RunT fin(s) denotes the set of all finite runs of T that start in s ∈S. Let RunT fin = S s∈S RunT fin(s). The length |σ|, or number of stages, of a finite run σ = s0 . . . sn is n + 1 and last(σ) = sn denotes the last state of σ. With slight abuse of notation, we use w(σ) to denote the weight of a finite run σ = s0 . . . sn, i.e., w(σ) = Pn−1 i=0 w((si, si+1)). Moreover, w∗(s, s′) denotes the minimum weight of a finite run from s to s′. Specifically, w∗(s, s) = 0 for every s ∈S and if there does not exist a run from s to s′, then w∗(s, s′) = ∞. For a set S′ ⊆S we let w∗(s, S′) = min s′∈S′ w∗(s, s′). We say that a state s′ and a set S′ is reachable from s, iff w∗(s, s′) ̸= ∞ and w∗(s, S′) ̸= ∞, respectively. Every run ρ = s0s1 . . . ∈RunT , resp. σ = s0 . . . sn ∈ RunT fin, induces a word z = L(s0)L(s1) . . . ∈(2AP )ω, resp. z = L(s0) . . . L(sn) ∈(2AP )+, over the power set of AP. A cycle of the TS T is a finite run cyc = c0 . . . cm of T for which it holds that (cm, c0) ∈T. Definition 2: A sub-system of a T = (S, T, AP, L, w) is a TS U = (SU, TU, AP, L|U, w|U), where SU ⊆S and TU ⊆T ∩(SU × SU). We use L|U to denote the labeling function L restricted to the set SU. Similarly, we use w|U with the obvious meaning. If the context is clear, we use L, w instead of L|U, w|U. A sub-system U of T is called strongly connected if for every pair of states s, s′ ∈SU, there exists a finite run σ ∈RunU fin(s) such that last(σ) = s′. A strongly connected component (SCC) of T is a maximal strongly connected sub-system of T . We use SCC(T ) to denote the set of all strongly connected components of T . Strongly connected components of a TS T are pairwise disjoint. Hence, the cardinality of the set SCC(T ) is bounded by the number of states of T and the set can be computed using Tarjan’s algorithm [15]. Definition 3: Let T = (S, T, AP, L, w) be a TS. A control strategy for T is a function C : RunT fin →S such that for every σ ∈RunT fin, it holds that (last(σ), C(σ)) ∈T. A strategy C for which C(σ1) = C(σ2), for all finite runs σ1, σ2 ∈RunT fin with last(σ1) = last(σ2), is called memoryless. In that case, C is a function C : S →S. A strategy is called finite-memory if it is defined as a tuple C = (M, next, ∆, start), where M is a finite set of modes, ∆: M ×S →M is a transition function, next: M ×S →S selects a state of T to be visited next, and start: S →M selects the starting mode for every s ∈S. A run induced by a strategy C for T is a run ρC = s0s1 . . . ∈RunT for which si+1 = C(ρ(i) C ) for every i ≥0. For every s ∈S, there is exactly one run induced by C that starts in s. A finite run induced by C is σC ∈RunT fin, which is a finite prefix of a run ρC induced by C. Let C be a strategy, finite-memory or not, for a TS T . For every state s ∈S, the run ρC ∈RunT (s) induced by C satisfies inf(ρC) ⊆SU for some U ∈SCC(T ) [1]. We say that C leads T from the state s to the SCC U. Definition 4: Linear Temporal Logic (LTL) formulas over the set AP are formed according to the following grammar: φ ::= true | a | ¬φ | φ∨φ | φ∧φ | X φ | φ U φ | G φ | F φ, where a ∈AP is an atomic proposition, ¬, ∨and ∧are standard Boolean connectives, and X (next), U (until), G (always) and F (eventually) are temporal operators. The semantics of LTL is defined over words over 2AP , such as those generated by the runs of a TS T (for details see e.g., [1]). For example, a word w ∈(2AP ) ω satisfies G φ and F φ if φ holds in w always and eventually, respectively. If the word induced by a run of T satisfies a formula φ, we say that the run satisfies φ. We call φ satisfiable in T from s ∈S if there exists a run ρ ∈RunT (s) that satisfies φ. Having an initialized TS T and an LTL formula φ over AP, the formal synthesis problem aims to find a strategy C for T such that the run ρC ∈RunT (sinit) induced by C satisfies φ. In that case we also say that the strategy C satisfies φ. The formal synthesis problem can be solved using principles from model checking methods [1]. Specifically, φ is translated to a B¨uchi automaton and the system combining the B¨uchi automaton and the TS T is analyzed. Definition 5: A B¨uchi automaton (BA) is a tuple B = (Q, 2AP , δ, q0, F), where Q is a non-empty finite set of states, 2AP is the alphabet, δ ⊆Q × 2AP × Q is a transition relation such that for every q ∈Q, a ∈2AP , there exists q′ ∈Q such that (q, a, q′) ∈δ, q0 ∈Q is the initial state, and F ⊆Q is a set of accepting states. A run q0q1 . . . Qω of B is an infinite sequence such that for every i ≥0 there exists ai ∈2AP with (qi, ai, qi+1) ∈δ. The word a0a1 . . . ∈(2AP )ω is called the word induced by the run q0q1 . . .. A run q0q1 . . . of B is accepting if there exist infinitely many i ≥0 such that qi is an accepting state. For every LTL formula φ over AP, one can construct a B¨uchi automaton Bφ such that the accepting runs are all and only words over 2AP satisfying φ [16]. We refer readers to [17], [18] for algorithms and to online implementations such as [19], to translate an LTL formula to a BA. Definition 6: Let T = (S, T, AP, L, w) be an initial- ized TS and B = (Q, 2AP , δ, q0, F) be a B¨uchi au- tomaton. The product P of T and B is a tuple P = (SP, TP, sPinit, AP, LP, FP, wP), where SP = S × Q, TP ⊆SP × SP is a transition relation such that for every (s, q), (s′, q′) ∈SP it holds that ((s, q), (s′, q′)) ∈TP if and only if (s, s′) ∈T and (q, L(s), q′) ∈δ, sPinit = (sinit, q0) is the initial state, LP((s, q)) = L(s) is a labeling function, FP = S × F is a set of accepting states, and wP(((s, q), (s′, q′))) = w((s, s′)) is a weight function. The product P can be viewed as an initialized TS with a set of accepting states. Therefore, we adopt the definitions of a run ρ, a finite run σ, its weight wP(σ), and sets RunP((s, q)), RunP, RunP fin((s, q)) and RunP fin from above. Similarly, a cycle cyc of P, a strategy CP for P and runs ρCP, σCP induced by CP are defined in the same way as for a TS. On the other hand, P can be viewed as a weighted BA over the trivial alphabet with a labeling function, which gives us the definition of an accepting run of P. Using the projection on the first component, every run (s0, q0)(s1, q1) . . . and finite run (s0, q0) . . . (sn, qn) of P corresponds to a run s0s1 . . . and a finite run s0 . . . sn of T , respectively. Vice versa, for every run s0s1 . . . and finite run s0 . . . sn of T , there exists a run (s0, q0)(s1, q1) . . . and finite run (s0, q0) . . . (sn, qn). Similarly, every strategy for P projects to a strategy for T and for every strategy for T there exists a strategy for P that projects to it. The projection of a finite-memory strategy for P is also finite-memory. Since P can be viewed as a TS, we also adopt the defini- tions of a sub-system and a strongly connected component. Definition 7: Let P = (SP, TP, sPinit, AP, LP, FP, wP) be the product of an initialized TS T and a BA B. An accepting strongly connected component (ASCC) of P is an SCC U = (SU, TU, AP, LP, wP) such that the set SU ∩FP is non-empty and we refer to it as the set FU of accepting states of U. We use ASCC(P) to denote the set of all ASCCs of P that are reachable from the initial state sPinit. In our work, we always assume that ASCC(P) is non- empty, i.e., the given LTL formula is satisfiable in the TS. III. PROBLEM FORMULATION Consider an initialized weighted transition system T = (S, T, AP, L, w). The weight w((s, s′)) of a transition (s, s′) ∈T represents the amount of time that the transition takes and the system starts at time 0. We use tn to denote the point in time after the n-th transition of a run, i.e., initially the system is in a state s0 at time t0 = 0 and after a finite run σ ∈RunT fin(s0) of length n + 1 the time is tn = w(σ). We assume there is a dynamic penalty associated with every state s ∈S. In this paper, we address the following model of penalties. Nevertheless, as we discuss in Sec.V, the algorithms presented in the next section provide optimal solution for a much broader class of penalty dynamics. The penalty is a rational number between 0 and 1 that is increasing every time unit by 1 r, where r ∈N is a given rate. Always when the penalty is 1, in the next time unit the penalty remains 1 or it drops to 0 according to a given probability distribution. Upon the visit of a state, the corresponding penalty is incurred. We assume that the visit of the state does not affect the penalty’s value or dynamics. Formally, the penalties are defined by a rate r ∈N and a penalty probability function p: S →(0, 1], where p(s) is the probability that if the penalty in a state s is 1 then in the next time unit the penalty remains 1, and 1−p(s) is the probability of the penalty dropping to 0. The penalties are described using a function g: S × N0 →{ i r | i ∈{0, 1, . . . , r}}, such that g(s, t) is the penalty in a state s ∈S at time t ∈N0. For s ∈S, g(s, 0) is a uniformly distributed random variable with values in the set { i r | i ∈{0, 1, . . . , r}} and for t ≥1 g(s, t) = ( g(s, t −1) + 1 r if g(s, t −1) < 1, x otherwise, (1) where x is a random variable such that x = 1 with probability p(s) and x = 0 otherwise. We use gexp(s) = (1 −p(s)) · 1 2 + p(s) · 1 = 1 2(1 + p(s)) (2) to denote the expected value of the penalty in a state s ∈S. Please note that 1 2 ≤gexp(s) ≤1, for every s ∈S. In our setting, the penalties are sensed only locally in the states in close proximity from the current state. To be specific, we assume a visibility range v ∈N is given. If the system is in a state s ∈S at time t, the penalty g(s′, t) of a state s′ ∈S is observable if and only if s′ ∈Vis(s) = {s′ ∈S | w∗(s, s′) ≤v}. The set Vis(s) is also called the set of states visible from s. The problem we consider in this paper combines the formal synthesis problem with long-term optimization of the expected amount of penalties incurred during the system’s execution. We assume that the specification is given as an LTL formula φ of the form φ = ϕ ∧GFπsur, (3) where ϕ is an LTL formula over AP and πsur ∈AP. This formula requires that the system satisfies ϕ and surveys the states satisfying the property πsur infinitely often. We say that every visit of a state from the set Ssur = {s ∈S | πsur ∈ L(s)} completes a surveillance cycle. Specifically, starting from the initial state, the first visit of Ssur after the initial state completes the first surveillance cycle of a run. Note that a surveillance cycle is not a cycle in the sense of the definition of a cycle of a TS in Sec. II. For a finite run σ such that last(σ) ∈Ssur, ♯(σ) denotes the number of complete surveillance cycles in σ, otherwise ♯(σ) is the number of complete surveillance cycles plus one. We define a function VT ,C : S →R+ 0 such that VT ,C(s) is the expected average cumulative penalty per surveillance cycle (APPC) incurred under a strategy C for T starting from a state s ∈S: VT ,C(s) = lim sup n→∞E Pn i=0 g(ρC(i), w(ρ(i) C )) ♯(ρ(n) C )  , (4) where ρC ∈RunT (s) is the run induced by C starting from s and E(·) denotes the expected value. In this paper, we consider the following problem: Problem 1: Let T = (S, T, AP, L, w) be an initialized TS, with penalties defined by a rate r ∈N and penalty probabilities p: S →(0, 1]. Let v ∈N be a visibility range and φ an LTL formula over the set AP of the form in Eq. (3). Find a strategy C for T such that C satisfies φ and among all strategies satisfying φ, C minimizes the APPC value VT ,C(sinit) defined in Eq. (4). In the next section, we propose two algorithms solving the above problem. The first algorithm operates offline, without the deployment of the system, and therefore, without taking advantage of the local sensing of penalties. On the other hand, the second algorithm computes the strategy in real-time by locally improving the offline strategy according to the penalties observed from the current state and their simulation over the next h time units, where h ≥1 is a natural number, a user-defined planning horizon. IV. SOLUTION The two algorithms work with the product P = (SP, TP, sPinit, AP, LP, FP, wP) of the initialized TS T and a B¨uchi automaton Bφ for the LTL formula φ. To project the penalties from T to P, we define the penalty in a state (s, q) ∈SP at time t as g((s, q), t) = g(s, t). We also adopt the visibility range v and the set Vis((s, q)) of all states visible from (s, q) is defined as for a state of T . The APPC function VP,CP of a strategy CP for P is then defined according to Eq. (4). We use the correspondence between the strategies for P and T to find a strategy for T that solves Problem 1. Let CP be a strategy for P such that the run induced by CP visits the set FP infinitely many times and at the same time, the APPC value VP,CP(sPinit) is minimal among all strategies that visit FP infinitely many times. It is easy to see that CP projects to a strategy C for T that solves Problem 1 and VT ,C(sinit) = VP,CP(sPinit). The offline algorithm leverages ideas from formal meth- ods. Using the automata-based approach to model checking, one can construct a strategy Cφ P for P that visits at least one of the accepting states infinitely many times. On the other hand, using graph theory, we can design a strategy CV P that achieves the minimum APPC value among all strategies of P that do not cause an immediate, unrepairable violation of φ, i.e., φ is satisfiable from every state of the run induced by CV P . However, we would like to have a strategy CP satisfying both properties at the same time. To achieve that, we employ a technique from game theory presented in [20]. Intuitively, we combine two strategies Cφ P and CV P to create a new strategy CP. The strategy CP is played in rounds, where each round consists of two phases. In the first phase, we play the strategy Cφ P until an accepting state is reached. We say that the system is to achieve the mission subgoal. The second phase applies the strategy CV P . The aim is to maintain the expected average cumulative penalty per surveillance cycle in the current round, and we refer to it as the average subgoal. The number of steps for which we apply CV P is computed individually every time we enter the second phase of a round. The online algorithm constructs a strategy CP by locally improving the strategy CP computed by the offline algo- rithm. Intuitively, we compare applying CP for several steps to reach a specific state or set of states of P, to executing different local paths to reach the same state or set. We consider a finite set of finite runs leading to the state, or set, containing the finite run induced by CP, choose the one that is expected to minimize the average cumulative penalty per surveillance cycle incurred in the current round and apply the first transition of the chosen run. The process continues until the state, or set, is reached, and then it starts over again. A. Probability measure Let CP be a strategy for P and (s, q) ∈SP a state of P. For a finite run σCP ∈RunP fin((s, q)) induced by the strategy CP starting from the state (s, q) and a sequence τ ∈({ i r | 0 ≤i ≤r})+ of length |σCP|, we call (σCP, τ) a finite pair. Analogously, an infinite pair (ρCP, κ) consists of the run ρCP ∈RunP((s, q)) induced by the strategy CP and an infinite sequence κ ∈({ i r | 0 ≤i ≤r})ω. A cylinder set Cyl((σCP, τ)) of a finite pair (σCP, τ) is the set of all infinite pairs (ρCP, κ) such that τ is a prefix of κ. Consider the σ-algebra generated by the set of cylinder sets of all finite pairs (σCP, τ), where σCP ∈RunP fin((s, q)) is a finite run induced by the strategy CP starting from the state (s, q) and τ ∈({ i r | 0 ≤i ≤r})+ is of length |σCP|. From classical concepts in probability theory [21], there exists a unique probability measure PrP,CP (s,q) on the σ- algebra such that for a finite pair (σCP, τ) PrP,CP (s,q) (Cyl((σCP, τ))) is the probability that the penalties incurred in the first |σCP| + 1 stages when applying the strategy CP in P from the state (s, q) are given by the sequence τ, i.e., g(σCP(i), wP(σ(i) CP)) = τ(i) for every 0 ≤i ≤|σCP|. This probability is given by the penalty dynamics and therefore, can be computed from the rate r and the penalty probability function p. For a set X of infinite pairs, an element of the above σ-algebra, the probability PrP,CP (s,q) (X) is the probability that under CP starting from (s, q) the infinite sequence of penalties received in the visited states is κ where (ρCP, κ) ∈X. B. Offline control In this section, we construct a strategy CP for P that projects to a strategy C for T solving Problem 1. The strategy CP has to visit FP infinitely many times and therefore, CP must lead from sPinit to an ASCC. For an U ∈ASCC(P), we denote V ∗ U ((s, q)) the minimum expected average cumulative penalty per surveillance cycle that can be achieved in U starting from (s, q) ∈SU. Since U is strongly connected, this value is the same for all the states in SU and is referred to as V ∗ U . It is associated with a cycle cycV U = c0 . . . cm of U witnessing the value, i.e., 1 |cycV U ∩SUsur| m X i=0 gexp(ci) = V ∗ U where SUsur is the set of all states of U labeled with πsur. Since U is an ASCC, it holds SUsur ̸= ∅. We design an algorithm that finds the value V ∗ U and a cycle cycV U for an ASCC U. The algorithm first reduces U to a TS U and then applies the Karp’s algorithm [22] on U. The Karp’s algorithm finds for a directed graph with values run1 run8 run3 run7 run9 run4 run6 run2 run5 run5.run4 run2.run6 Fig. 1: An example of elimination of a state during the reduction of an ASCC U. The finite run run8 is equal to the one of the finite runs run1 and run2.run4 that minimizes the sum of expected penalties in the states of the run. Similarly, run9 is one of the finite runs run7 and run5.run6. on edges a cycle with minimum value per edge also called the minimum mean cycle. The value V ∗ U and cycle cycV U are synthesized from the minimum mean cycle. Let U = (SU, TU, AP, LP, wP) be an ASCC of P. For simplicity, we use singletons such as u, ui to denote the states of P in this paragraph. We construct a TS U = (SUsur, TU, AP, LP, wU), and a function run: TU →RunU fin for which it holds that (u, u′) ∈TU if and only if there exists a finite run in U from u ∈SUsur to u′ ∈SUsur with one surveillance cycle, i.e., between u and u′ no state labeled with πsur is visited. Moreover, the run run((u, u′)) = u0 . . . un is such that u = u0 and σ = u0 . . . unu′ is the finite run in U from u to u′ with one surveillance cycle that minimizes the expected sum of penalties received during σ among all finite runs in U from u to u′ with one surveillance cycle. The TS can be constructed from U by an iterative algorithm eliminating the states from SU\SUsur one by one, in arbitrary order. At the beginning let U = U, TU = TU, and for every transition (u, u′) ∈TU let run((u, u′)) = u. The procedure for eliminating u ∈SU\SUsur proceeds as follows. Consider every u1 ̸= u, u2 ̸= u such that (u1, u), (u, u2) ∈TU. If the transition (u1, u2) is not in TU, add (u1, u2) to TU and define run((u1, u2)) = run((u1, u)).run((u, u2)), where . is the concatenation of sequences. If TU already contains the transition (u1, u2) and run((u1, u2)) = σ, we set run((u1, u2)) = run((u1, u)).run((u, u2)), if X gexp run((u1, u)).run((u, u2))  ≤ X gexp σ  , where P gexp(x) for a run x is the sum of gexp(x(i)) for every state x(i) of x, otherwise we let run((u1, u2)) = σ. The weight wU((u1, u2)) = P gexp(run((u1, u2))). Once all pairs u1, u2 are handled, remove u from SU and all adjacent transitions from TU. Fig. 1 demonstrates one iteration of the algorithm. Consequently, we apply the Karp’s algorithm on the ori- ented graph with vertices SUsur, edges TU and values on edges wU. Let cycU = u0 . . . um be the minimum mean cycle of this graph. Then it holds V ∗ U = 1 m+1 m X i=0 gexp run((ui, ui+1 mod (m+1))  , cycV U = run((u0, u1)). . . . .run((um−1, um)).run((um, u0)). When the APPC value and the corresponding cycle is computed for every ASCC of P, we choose the ASCC that minimizes the APPC value. We denote this ASCC U = (SU, TU, AP, LP, wP) and cycV U = c0 . . . cm. The mission subgoal aims to reach an accepting state from the set FU. The corresponding strategy Cφ P is such that from every state (s, q) ∈SP\FU that can reach the set FU, we follow one of the finite runs with minimum weight from (s, q) to FU. That means, Cφ P is a memoryless strategy such that for (s, q) ∈SP\FU with w∗ P((s, q), FU) < ∞it holds Cφ P((s, q)) = (s′, q′) where wP((s, q), (s′, q′)) = w∗ P((s, q), FU) −w∗ P((s′, q′), FU). The strategy CV P for the average subgoal is given by the cycle cycV U = c0 . . . cm of the ASCC U. Similarly to the mission subgoal, for a state (s, q) ∈SP\cycV U with w∗ P((s, q), cycV U ) < ∞, the strategy CV P follows one of the finite runs with minimum weight to cycV U . For a state ci ∈cycV U , it holds CV P (ci) = ci+1 mod (m+1). If all the states of the cycle cycV U are distinct, the strategy CV P is memoryless, otherwise it is finite-memory. Proposition 1: For the strategy CV P and every state (s, q) ∈SU, it holds lim n→∞PrU,CV P (s,q) Pn i=0 g(ρCV P (i), wP(ρ(i) CV P )) ♯(ρ(n) CV P ) ≤V ∗ U  = 1. Equivalently, for every state (s, q) ∈SU and every ϵ > 0, there exists j(ϵ) ∈N such that if the strategy CV P is played from the state (s, q) until at least l ≥j(ϵ) surveillance cycles are completed, then the average cumulative penalty per surveillance cycle incurred in the performed finite run is at most V ∗ U + ϵ with probability at least 1 −ϵ. Proof: (Sketch.) The proof is based on the fact that the product P with dynamic penalties can be translated into a Markov decision process (MDP) (see e.g., [23]) with static penalties. The run ρCV P corresponds to a Markov chain (see e.g., [24]) of the MDP. Moreover, the cycle cyc∗ U corresponds to the minimum mean cycle of the reduced TS U. Hence, the equation in the theorem is equivalent to the property of MDPs with static penalties proved in [20] regarding the minimum expected penalty incurred per stage. Remark 1: Assume there exists a state (s, q) ∈SP with p((s, q)) = 0, i.e., if the penalty in (s, q) is 1, it always drops to 0. The dynamics of the penalty in (s, q) is not probabilistic and if we visit (s, q) infinitely many times, the expected average penalty incurred in (s, q) might differ from gexp((s, q)). That can cause violation of Prop. 1. Now we describe the strategy CP. It is played in rounds, where each round consists of two phases, one for each subgoal. The first round starts at the beginning of the execution of the system in the initial state sPinit of P. Let i be the current round. In the first phase of the round the strategy Cφ P is applied until an accepting state of the ASCC U is reached. We use ki to denote the number of steps we played the strategy Cφ P in round i. Once the mission subgoal is fulfilled, the average subgoal becomes the current subgoal. In this phase, we play the strategy CV P until the number of completed surveillance cycles in the second phase of the current round is li ≥max{j( 1 i ), i · ki}. Theorem 1: The strategy CP projects to a strategy C of T that solves Problem 1. Proof: From the fact that the ASCC U is reachable from the initial state sPinit and from the construction of Cφ P, it follows that U is reached from sPinit in finite time. In every round i of the strategy CP, an accepting state is visited. Moreover, from Prop. 1 and the fact that li ≥max{j( 1 i ), i · ki}, it can be shown that the average cumulative penalty per surveillance cycle incurred in the i-th round is at most V ∗ U + 2 i with probability at least 1−1 i . Therefore, in the limit, the run induced by CP satisfies the LTL specification and reaches the optimal average cumulative penalty per surveillance cycle V ∗ U with probability 1. Note that, in general, the strategy CP is not finite-memory. The reason is that in the modes of the finite-memory strategy we would need to store the number of steps spent so far in the first phase ki and the number li of the surveillance cycles in the second phase of a given round. Since j( 1 i ) is generally increasing with i, we would need infinitely many modes to be able to count the number of surveillance cycles in the second phase. However, if there exists a cycle cycV U of the SCC U corresponding to V ∗ U that contains an accepting state, then the finite-memory strategy CV P for the average subgoal maps to a strategy of T solving Problem 1, which is therefore in the worst case finite-memory as well. Complexity: The size of a BA for an LTL formula φ is in the worst case 2O(|φ|), where |φ| is the size of φ [17]. However, the actual size of the BA is in practice often quite small. The size of the product P is O(|S|·2O(|φ|)). To com- pute the minimum weights w∗((s, q), (s′, q′)) between every two states of P we use Floyd-Warshall algorithm taking O(|SP|3) time. Tarjan’s algorithm [15] is used to compute the set SCC(P) in time O(|SP| + |TP|). The reduction of an ASCC U can be computed in time O(|SU| · |TU|2). The Karp’s algorithm [22] finds the optimal APPC value and cor- responding cycle in time O(|SU| · |TU|). The main pitfall of the algorithm is to compute the number j( 1 i ) of surveillance cycles needed in the second phase of the current round i according to Prop. 1. Intuitively, we need to consider the finite run σCV P ,k induced by the strategy CV P from the current state that contains k = 1 surveillance cycles, and compute the sum of probabilities PrP,CP (s,q) (Cyl((σCV P ,k, τ))) for every τ with the average cumulative penalty per surveillance cycle less or equal to V ∗ U + 1 i . If the total probability is at least 1 −1 i , we set j( 1 i ) = k, otherwise we increase k and repeat the process. For every k, there exist r |σCV P ,k| sequences τ. To partially overcome this issue, we compute the number j( 1 i ) only at the point in time, when the number of surveillance cycles in the second phase of the current round i is i·ki and the average cumulative penalty in this round is still above V ∗ U + 2 i . As the simulation results in Sec. VI show, this happens only rarely, if ever. C. Online control The online algorithm locally improves the strategy CP according to the values of penalties observed from the current state and their simulation in the next h time units. The resulting strategy CP is again played in rounds. However, in each step of the strategy CP, we consider a finite set of finite runs starting from the current state, choose one according to an optimization function, and apply its first transition. Throughout the rest of the section we use the following notation. We use singletons such as u, ui to denote the states of P. Let σall ∈RunP fin(sPinit) denote the finite run executed by P so far. Let i be the current round of the strategy CP and σi = ui,0 . . . ui,k the finite run executed so far in this round, i.e., ui,k is the current state of P. We use ti,0, . . . , ti,k to denote the points in time when the states ui,0, . . . , ui,k were visited, respectively. The optimization function f : RunP fin(ui,k) →[0, 1] as- signs every finite run σ = u0 . . . un starting from the current state a value f(σ) that is the expected average cumulative penalty per surveillance cycle that would be incurred in the round i, if the run σ was to be executed next, i.e., f(σ) = kP j=0 g(ui,j, ti,j) + nP j=1 gsim(uj, ti,k + wP(σ(j))) ♯(σi.σ(1) . . . last(σ)) , (5) where gsim(uj, ti,k + wP(σ(j))) is the simulated expected penalty incurred in the state uj at the time of its visit. If the visit occurs within the next h time units and the state uj is visible from the current state ui,k, we simulate the penalty currently observed in uj over wP(σ(j)) time units. Otherwise, we set the expected penalty to be gexp(uj). The exact definition of wP(σ(j)) can be found in Tab. I. For a set of states X ⊆SP, we define a shorten- ing indicator function IX : TP →{0, 1} such that for ((s1, q1), (s2, q2)) ∈TP IX ((s1, q1), (s2, q2))  =      1 if w∗ P((s1, q1), X) > w∗ P((s2, q2), X), 0 otherwise. (6) Intuitively, the indicator has value 1 if the transition leads strictly closer to the set X, and 0 otherwise. In the first phase of every round, we locally improve the strategy Cφ P computed in Sec. IV-B that aims to visit an accepting state of the chosen ASCC U. In each step of the resulting strategy Cφ P, we consider the set Runφ(ui,k) of all finite runs from the current state ui,k that lead to an accepting state from the set FU with all transitions shortening in the indicator IFU defined according to Eq. (6), i.e., Runφ(ui,k) = {σ ∈RunP fin(ui,k) | last(σ) ∈FU, ∀0 ≤j ≤|σ| −1: IFU ((σ(j), σ(j + 1))) = 1}. Let σ ∈Runφ(ui,k) be the run that minimizes the optimiza- tion function f from Eq. (5). Then Cφ P(σall) = σ(1). Just like in the offline algorithm, the strategy Cφ P is applied until a state from the set FU is visited. In the second phase, we locally improve the strategy CV P for the average subgoal computed in Sec. IV-B to obtain a strategy CV P. However, the definition of the set of finite runs we choose from changes during the phase. At the beginning of the second phase of the current round i, we aim to reach the cycle cycV U = c0 . . . cm of the ASCC U and we use the same idea that is used in the first phase above. To be specific, we define CV P(σall) = σ(1), where σ is the finite gsim(uj, ti,k + wP(σ(j))) =          g(uj, ti,k) + wP (σ(j)) r if uj ∈Vis(ui,k), wP(σ(j)) ≤h and g(uj, ti,k) + wP (σ(j)) r ≤1, rP x=0 pst( x r ) + pst(1) if uj ∈Vis(ui,k), wP(σ(j)) ≤h and g(uj, ti,k) + wP (σ(j)) r > 1, gexp(uj) otherwise. pst( x r ) =  z1 P y=0 (z1−y+z2+y(r+1))! (z1−y)!·(z2+y(r+1))! · (1 −p(uj))(z1−y) · (p(s))(z2+y(r+1)) · (1 −p(s)) · x r if z = wP(σ(j)) −(1 −g(uj, ti,k)) · r −x −1 ≥0, z1 = z div (r + 1), z2 = z mod (r + 1); otherwise if z < 0, pst( x r ) = 0 pst(1) = z3 P y=0 (z3−y+z4+y(r+1))! (z3−y)!·(z4+y(r+1))! · (1 −p(uj))(z3−y) · (p(s))(z4+y(r+1)) · p(s) where z = wP(σ(j)) −(1 −g(uj, ti,k)) · r −1, z3 = z div (r + 1), z4 = z mod (r + 1) TABLE I: The function computing the simulated expected penalty incurred in a state uj of the run σ at the time of its visit ti,k + wP(σ(j)) if we are to apply the run σ from the current state ui,k, div stands for integer division and mod for modulus. run minimizing f from the set RunV (ui,k) = {σ ∈RunP fin(ui,k) | last(σ) ∈cycV U , ∀0 ≤j ≤|σ| −1: IcycV U ((σ(j), σ(j + 1))) = 1}. Once a state ca ∈cycV U of the cycle is reached, we continue as follows. Let cb ∈cycV U be the first state labeled with πsur that is visited from ca if we follow the cycle. Until we reach the state cb, the optimal finite run σ is chosen from the set RunV (ui,k) = {σ ∈RunP fin(ui,k) | last(σ) = cb, and ∀0 ≤j ≤|σ| −1: Icb((σ(j), σ(j + 1))) = 1 or |σca→ui,k| + |σ| ≤b −a + 2 mod (m + 1)}, where σca→ui,k is the finite run already executed in P from the state ca to the current state ui,k. Intuitively, the set contains every finite run from the current state to the state cb that either has all transitions shortening in Icb or the length of the finite run is such that if we were to perform the finite run, the length of the performed run from ca to cb would not be longer than following the cycle from ca to cb. When the state cb is reached, we restart the above procedure with ca = cb. The strategy CV P is performed until li ≥max{j( 1 i ), i · ki} surveillance cycles are completed in the second phase of the current round i, where ki is the number of steps of the first phase and j is from Prop. 1. We can end the second phase sooner, specifically in any time when we complete a surveillance cycle and the average cumulative penalty per surveillance cycle incurred in the current round is less or equal to V ∗ U + 2 i . Theorem 2: The strategy CP projects to a strategy C of T solving Problem 1. Proof: First, we prove that Prop. 1 holds for the strategy CV P as well. This result follows directly from the facts below. The set of finite runs we choose from always contains a finite run induced by the strategy CV P . Once the cycle cycV U is reached, the system optimizes the finite run from one surveillance state of the cycle to the next, until it is reached after finite time. Finally, if the strategy CV P does not follow CV P , it is only because the chosen finite run provides lower expected average. The correctness of the strategy CP is now proved analogously to the correctness of the strategy computed offline. Proposition 2: The strategy CP is with probability 1 expected to perform in the worst case as good as the strategy CP computed offline. That means, if the average cumulative penalty per surveillance cycle incurred in the so far performed run of the system is lower than the optimal APPC value V ∗ U , it will rise slower under the strategy CP than under the strategy CP. On the other hand, if the average cumulative penalty per surveillance cycle incurred in the so far performed run of the system is higher than the optimal APPC value V ∗ U , it is expected to decrease faster under the strategy CP than under the strategy CP. Proof: Follows from the proof of Theorem 2. Complexity: The cardinality of the set of finite runs Runφ(ui,k) grows exponentially with the minimum weight w∗ P(ui,k, FU). Analogously, the same holds for the set of finite runs RunV (ui,k) and the set cyc∗ V or one of its surveil- lance states. To simplify the computations and effectively use the algorithm in real time, one can use the following rule that was also applied in our implementation in Sec.VI. We put a threshold on the maximum weight of a finite run in Runφ(ui,k) and RunV (ui,k). In the second phase of a round, when on the optimal cycle, we optimize the finite run from the state ca to the next surveillance state on the cycle cb. However, if the weight of the fragment of the cycle from ca to cb is too high, we can first optimize the run to some intermediate state c′ b. Also, the complexity of one step of the strategy CP grows exponentially with the user-defined planning horizon h. Hence, h should be chosen wisely. One should also keep in mind that the higher the planning horizon, the better local improvement. V. DISCUSSION Every LTL formula ϕ over AP can be converted to a formula φ of the form in Eq. (3) for which it holds that a run of the TS T satisfies φ if and only if it satisfies ϕ. The formula is φ = ϕ ∧GF πsur where πsur ∈L(s) for every s ∈S. In that case, Problem 1 requires to minimize the expected average penalty incurred per stage. The algorithms presented in Sec. IV can be used to correctly solve Problem 1 also for the systems with different penalty dynamics than the one defined in Sec. III. However, for every state we need to be able to compute the expected value of the penalty in the state, like in Eq. (2). For the online algorithm we also require that the dynamics of penalties allows to simulate them for a finite number of time units. More precisely, if we observe the penalty in a state s ∈S in time t, we can compute the simulated expected value of the penalty in s in every following time unit, up to h time units, based only on the observed value. 2/5 0/5 2/5 1/5 4/5 5/5 4 0/5 1/5 4/5 (a) (b) Fig. 2: (a) A TS modeling the robot (black dot) motion in a partitioned environment. Two stock locations are in green, a base is shown in blue, and unsafe locations are in red. There is a transition between vertically, horizontally or diagonally neighboring states. The weight of a horizontal and vertical transition is 2, for a diagonal transition it is 3. (b) The penalty probabilities in states. Darker shade indicates higher probability. The online algorithm from Sec. IV-C is a heuristic. The sets of finite runs Runφ(ui,k), RunV (ui,k) can be defined differently according to the properties of the actual problem. To guarantee the correctness of the strategy CP, the sets must satisfy the following conditions. There always exists a finite run in the set minimizing the optimization function f in Eq. (5). The definition of the set Runφ(ui,k) guarantees that an accepting state from FU is visited after finite number of steps. The definition of RunV (ui,k) also guarantees a visit of the cycle cycV U in finite time and moreover, Prop. 1 holds for the resulting strategy CV P. VI. CASE STUDY We implemented the framework developed in this paper for a persistent surveillance robotics example in Java [25]. In this section, we report on the simulation results. We consider a mobile robot in a grid-like partitioned environment modeled as a TS depicted in Fig. 2a. The robot transports packages between two stocks, marked green in Fig. 2a. The blue state marks the robot’s base location. The penalties in states are defined by rate r = 5 and penalty probability function in Fig. 2b. The visibility range v is 6. For example, in Fig. 2a the set Vis(s) of states visible from the current state s, with corresponding penalties, is depicted as the blue-shaded area. We set the planning horizon h = 9. The mission for the robot is to transport packages be- tween the two stocks (labeled with propositions a, and b, respectively) and infinitely many times return to the base (labeled with proposition c). The red states in Fig. 2a are dangerous locations (labeled with u) which are to be avoided. At the same time, we wish to minimize the cumulative penalty incurred during the transport of a package, i.e., the surveillance property πsur is true in both stock states. The corresponding LTL formula is G a ⇒X (¬a U b)  ∧G b ⇒X (¬b U a)  ∧ GF c ∧G(¬u) ∧G F πsur, and the B¨uchi automaton has 10 states. The cycle provid- ing the minimum expected average cumulative penalty per (a) (b) Fig. 3: (a) The average cumulative penalty per surveillance cycle incurred during the runs, shown at the end of each round. The red line marks the optimal APPC value. (b) The average cumulative penalty per surveillance cycle incurred in every round. The red bars indicate the threshold V ∗ U + 2 i . surveillance cycle is depicted in magenta in Fig. 2a and the optimal APPC value is 5.4. We ran both offline and online algorithm for multiple rounds starting from the base state. In Fig. 3 we report on the results for 20 rounds, for more results see [25]. As illustrated in Fig. 3a, the average cumulative penalty per surveillance cycle incurred in the run induced by the offline strategy is above the optimal value and converges to it fairly fast. For the run induced by the online strategy, the average is significantly below the minimum APPC value due to the local improvement based on local sensing. On the other hand, Fig. 3b shows the average cumulative penalty per surveillance cycle incurred in each round separately. The number of surveillance cycles performed in the second phase of every round i of the offline strategy was less than i · ki, i.e., the second phase always ended due to the fact that the average incurred in the round was below the threshold V ∗ U + 2 i . The maximum number of surveillance cycles performed in the second phase of a round was 7. The same is true for the online strategy and the maximum number of surveillance cycles in the second phase of a round was 3. For both algorithms, the number of surveillance cycles in the second phase of a round does not evolve monotonically, rather randomly. Hence we conclude that in every round i we unlikely need to compute the value j( 1 i ). REFERENCES [1] C. Baier and J. Katoen, Principles of model checking. The MIT Press, 2008. [2] J. Barnat, L. Brim, M. ˇCeˇska, and P. Roˇckai, “DiVinE: Parallel Distributed Model Checker,” in HiBi/PDMC’10, 2010, pp. 4–7. [3] A. Hinton, M. Kwiatkowska, G. Norman, and D. Parker, “PRISM: A Tool for Automatic Verification of Probabilistic Systems,” in TACAS’06, ser. LNCS, vol. 3920, 2006, pp. 441–444. [4] A. Cimatti, E. Clarke, E. Giunchiglia, F. Giunchiglia, M. Pistore, M. Roveri, R. Sebastiani, and A. Tacchella, “NuSMV Version 2: An OpenSource Tool for Symbolic Model Checking,” in CAV’02, ser. LNCS, vol. 2404, 2002. [5] S. L. Smith, J. Tumova, C. Belta, and D. Rus, “Optimal path planning for surveillance with temporal-logic constraints,” I. J. Robotic Res., vol. 30, no. 14, pp. 1695–1708, 2011. [6] K. Chatterjee, L. Doyen, T. A. Henzinger, and J. F. Raskin, “Al- gorithms for omega-regular games with imperfect information,” in CSL06, ser. LNCS, vol. 4207, 2006, pp. 287–302. [7] X. C. Ding, S. Smith., C. Belta, and D. Rus, “MDP Optimal Control under Temporal Logic Constraints,” in CDC’11, 2011, pp. 532 –538. [8] P. Tabuada and G. Pappas, “Linear Time Logic Control of Discrete- Time Linear Systems,” Trans. on Automatic Control, vol. 51, no. 12, pp. 1862–1877, 2006. [9] E. A. Gol, M. Lazar, and C. Belta, “Language-Guided Controller Synthesis for Discrete-Time Linear Systems,” in HSCC’12, 2012, pp. 95–104. [10] T. Wongpiromsarn, U. Topcu, and R. R. Murray, “Receding horizon temporal logic planning for dynamical systems,” in CDC’09, 2009, pp. 5997–6004. [11] M. Kloetzer and C. Belta, “A fully automated framework for control of linear systems from temporal logic specifications,” Trans. on Automatic Control, vol. 53, no. 1, pp. 287–297, 2008. [12] B. Yordanov, J. Tumova, I. Cerna, J. Barnat, and C. Belta, “Temporal Logic Control of Discrete-Time Piecewise Affine Systems,” Trans. on Automatic Control, vol. 57, pp. 1491–1504, 2012. [13] X. C. Ding, M. Lazar, and C. Belta, “Receding Horizon Temporal Logic Control for Finite Deterministic Systems,” in ACC’12, 2012. [14] M. Svoreˇnov´a, J. T˚umov´a, J. Barnat, and I. ˇCern´a, “Attraction-Based Receding Horizon Path Planning with Temporal Logic Constraints,” in CDC’12, 2012, to appear. [15] R. Tarjan, “Depth-first search and linear graph algorithms,” in SWAT’71, 1971, pp. 114 –121. [16] M. Y. Vardi and P. Wolper, “Reasoning about Infinite Computations,” Information and Computation, vol. 115, pp. 1–37, 1994. [17] P. Gastin and D. Oddoux, “Fast LTL to B¨uchi Automata Translation,” in CAV’01, 2001, pp. 53–65. [18] F. Somenzi and R. Bloem, “Efficient B¨uchi Automata from LTL Formulae,” in CAV’02, 2000, pp. 248–263. [19] P. Gastin and D. Oddoux. (2001) LTL 2 BA : fast translation from LTL formulae to B¨uchi automata. [Online]. Available: http://www.lsv.ens-cachan.fr/∼gastin/ltl2ba/ [20] K. Chatterjee and L. Doyen, “Energy and Mean-Payoff Parity Markov Decision Processes,” in MFCS’11, 2011, pp. 206–218. [21] R. Ash and C. Dol´eans-Dade, Probability & Measure Theory. Aca- demic Press, 2000. [22] R. M. Karp, “A characterization of the minimum cycle mean in a digraph,” Discrete Mathematics, vol. 23, no. 3, pp. 309 – 311, 1978. [23] M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming. New York, NY, USA: John Wiley & Sons, Inc., 1994. [24] J. R. Norris, Markov chains, ser. Cambridge series in statistical and probabilistic mathematics. Cambridge University Press, 1998. [25] M. Svorenova, I. Cerna, and C. Belta. (2012) Simulation of Optimal Receding Horizon Control for Finite Deterministic Systems with Temporal Logic Specification. [Online]. Available: http: //www.fi.muni.cz/$\sim$x175388/simulationLTLoptimalrhc