Optimal Receding Horizon Control for Finite Deterministic Systems with Temporal Logic Constraints  M ́ aria Svoreˇ nov ́ a, Ivana ˇ Cern ́ a and Calin Belta  Abstract — In   this   paper,   we   develop   a   provably   correct optimal control strategy for a finite deterministic transition system. By assuming that penalties with known probabilities of occurrence and dynamics can be sensed locally at the states of the system, we derive a receding horizon strategy that minimizes the expected average cumulative penalty incurred between two consecutive satisfactions of a desired property. At the same time, we guarantee the satisfaction of correctness specifications expressed as Linear Temporal Logic formulas. We illustrate the approach with a persistent surveillance robotics application.  I. I NTRODUCTION  Temporal logics, such as Computation Tree Logic (CTL) and Linear Temporal Logic (LTL), have been customarily used to specify the correctness of computer programs and digital circuits modeled as finite-state transition systems [1]. The problem of analyzing such a model against a temporal logic formula, known as formal analysis or model checking, has received a lot of attention during the past thirty years, and several efficient algorithms and software tools are available [2], [3], [4]. The formal synthesis problem, in which the goal is to design or control a system from a temporal logic spec- ification, has not been studied extensively until a few years ago. Recent results include the use of model checking al- gorithms for controlling deterministic systems [5], automata games for controlling non-deterministic systems [6], linear programming and value iteration for synthesis of control policies for Markov decision processes [1], [7]. Through the use of abstractions, such techniques have also been used for infinite systems, such as continuous and discrete-time linear systems [8], [9], [10], [11], [12]. The connection between optimal and temporal logic con- trol is an intriguing problem with a potentially high impact in several applications. By combining these two seemingly unrelated areas, our goal is to optimize the behavior of a system subject to correctness constraints. Consider, for example, a mobile robot involved in a persistent surveillance mission in a dangerous area and under tight fuel / time constraints. The correctness requirement is expressed as a temporal logic specification,   e.g.,   “Alternately keep visiting  A   and   B   while always avoiding   C ”, while the resource constraints translate to minimizing a cost function over the feasible   trajectories   of   the   robot.   While   optimal   control  M.   Svoreˇ nov ́ a,   I.   ˇ Cern ́ a   are   with   Faculty   of   Informatics,   Masaryk University,   Brno,   Czech   Republic,   svorenova@mail.muni.cz, cerna@muni.cz . C. Belta is with Department of Mechanical Engineering and the Division of Systems Engineering, Boston University, Boston, MA, USA,   cbelta@bu.edu . This work was partially supported at MU by grants GAP202/11/0312, LH11065, at BU by ONR grants MURI N00014- 09-1051, MURI N00014-10-10952 and by NSF grant CNS-1035588.  is a mature discipline and formal synthesis is fairly well understood, optimal formal synthesis is a largely open area. In this paper, we focus on finite labeled transition systems and correctness specifications given as formulas of LTL. We assume there is a penalty associated with the states of the sys- tem with a known occurrence probability and time-behavior. Motivated by persistent surveillance robotic missions, our goal is to minimize the expected average cumulative penalty incurred between two consecutive satisfactions of a desired property associated with some states of the system, while at the same time satisfying an additional temporal logic constraint. Also from robotics comes our assumption that actual penalty values can only be sensed locally in a close proximity from the current state during the execution of the system. We propose two algorithms for this problem. The first operates offline,   i.e.,   without executions of the system, and therefore only uses the known probabilities but does not exploit actual penalties sensed during the execution. The second algorithm designs an online strategy by locally improving the offline strategy based on local sensing and simulation over a user-defined planning horizon. While both algorithms guarantee optimal expected average penalty col- lection, in real executions of the system, the second algorithm provides lower real average than the first algorithm. We illustrate these results on a robotic persistent surveillance case study. This paper is closely related to [13], [14], [5], which also focused on optimal control for finite transitions systems with temporal logic constraints. In [5], the authors developed an offline control strategy minimizing the maximum cost between two consecutive visits to a given set of states, subject to constraints expressed as LTL formulas. Time-varying, locally sensed rewards were introduced in [13], where a receding horizon control strategy maximizing rewards col- lected locally was shown to satisfy an LTL specification. This approach was generalized in [14] to allow for a broader class of optimization objectives and reward models. In contrast with [13], [14], we interpret the dynamic values appearing in states of the system as penalties instead of rewards,   i.e.,  in our case, the cost function is being minimized rather than maximized. That allows the existence of the optimum in expected average penalty collection. In this paper, we show how it can be achieved using automata-based approach and game theory results. In   Sec.   II,   we   introduce   the   notation   and   definitions necessary throughout the paper. The problem is stated in Sec. III. The main results of the paper are in Sec. IV and Sec. V. The simulation results are presented in Sec. VI.  arXiv:1303.3533v1 [cs.RO] 14 Mar 2013
II. P RELIMINARIES  For a set   S , we use   S ω   and   S +   to denote the set of all infinite and all non-empty finite sequences of elements of   S , respectively. For a finite or infinite sequence   α   =   a 0 a 1   . . . , we use   α ( i ) =   a i   to denote the   i -th element and   α ( i )   =   a 0   . . .   a i  for the finite prefix of   α   of length   | α ( i ) |   =   i   + 1 .  Definition 1:   A weighted deterministic transition system (TS) is a tuple   T   = ( S, T, AP, L, w ) , where   S   is a non- empty finite set of states,   T   ⊆   S   ×   S   is a transition relation,  AP   is a finite set of atomic propositions,   L   :   S   →   2 AP   is a labeling function and   w   :   T   →   N   is a weight function. We assume that for every   s   ∈   S   exists   s ′   ∈   S   such that  ( s, s ′ )   ∈   T   . An initialized transition system is a TS   T   = ( S, T, AP, L, w )   with a distinctive initial state   s init   ∈   S . A run of a TS   T   is an infinite sequence   ρ   =   s 0 s 1   . . .   ∈   S ω  such that for every   i   ≥   0   it holds   ( s i , s i +1 )   ∈   T   . We use  inf( ρ )   to denote the set of all states visited infinitely many times in the run   ρ   and   Run T   ( s )   for the set of all runs of  T   that start in   s   ∈   S . Let   Run T   =   ⋃  s ∈ S   Run T   ( s ) . A finite run   σ   =   s 0   . . . s n   of   T   is a finite prefix of a run of   T   and  Run T  fin ( s )   denotes the set of all finite runs of   T   that start in   s   ∈   S . Let   Run T  fin   =   ⋃  s ∈ S   Run T  fin ( s ) . The length   | σ | , or number of stages, of a finite run   σ   =   s 0   . . . s n   is   n   + 1   and  last ( σ ) =   s n   denotes the last state of   σ . With slight abuse of notation, we use   w ( σ )   to denote the weight of a finite run  σ   =   s 0   . . . s n ,   i.e.,   w ( σ ) =   ∑ n − 1  i =0   w (( s i , s i +1 )) . Moreover,  w ∗ ( s, s ′ )   denotes the minimum weight of a finite run from   s  to   s ′ . Specifically,   w ∗ ( s, s ) = 0   for every   s   ∈   S   and if there does not exist a run from   s   to   s ′ , then   w ∗ ( s, s ′ ) =   ∞ . For a set   S ′   ⊆   S   we let   w ∗ ( s, S ′ ) = min  s ′   ∈ S ′   w ∗ ( s, s ′ ) . We say that a state   s ′   and a set   S ′   is reachable from   s , iff   w ∗ ( s, s ′ )   6   =   ∞  and   w ∗ ( s, S ′ )   6   =   ∞ , respectively. Every run   ρ   =   s 0 s 1   . . .   ∈   Run T   , resp.   σ   =   s 0   . . . s n   ∈  Run T  fin , induces a word   z   =   L ( s 0 ) L ( s 1 )   . . .   ∈   (2 AP   ) ω   , resp.  z   =   L ( s 0 )   . . . L ( s n )   ∈   (2 AP   ) + , over the power set of   AP   . A cycle of the TS   T   is a finite run   cyc   =   c 0   . . . c m   of   T  for which it holds that   ( c m , c 0 )   ∈   T   .  Definition 2:   A sub-system of a   T   = ( S, T, AP, L, w )  is a TS   U   = ( S U   , T U   , AP, L | U   , w | U   ) , where   S U   ⊆   S   and  T U   ⊆   T   ∩   ( S U   ×   S U   ) . We use   L | U   to denote the labeling function   L   restricted to the set   S U   . Similarly, we use   w | U  with the obvious meaning. If the context is clear, we use   L, w  instead of   L | U   , w | U   . A sub-system   U   of   T   is called strongly connected if for every pair of states   s, s ′   ∈   S U   , there exists a finite run   σ   ∈   Run U  fin ( s )   such that   last ( σ ) =   s ′ . A strongly connected component (SCC) of   T   is a maximal strongly connected sub-system of   T   . We use   SCC( T   )   to denote the set of all strongly connected components of   T   . Strongly connected components of a TS   T   are pairwise disjoint. Hence, the cardinality of the set   SCC( T   )   is bounded by the number of states of   T   and the set can be computed using Tarjan’s algorithm [15].  Definition 3:   Let   T   = ( S, T, AP, L, w )   be a TS. A control strategy for   T   is a function   C   : Run T  fin   →   S   such that for every   σ   ∈   Run T  fin , it holds that   ( last ( σ ) , C ( σ ))   ∈   T   . A strategy   C   for which   C ( σ 1 ) =   C ( σ 2 ) , for all finite runs   σ 1 , σ 2   ∈   Run T  fin   with   last ( σ 1 ) =   last ( σ 2 ) , is called memoryless. In that case,   C   is a function   C   :   S   →   S . A strategy is called finite-memory if it is defined as a tuple  C   = ( M,   next ,   ∆ ,   start ) , where   M   is a finite set of modes,  ∆ :   M   ×   S   →   M   is a transition function,   next   :   M   ×   S   →   S  selects a state of   T   to be visited next, and   start   :   S   →   M  selects the starting mode for every   s   ∈   S . A run induced by a strategy   C   for   T   is a run   ρ C   =  s 0 s 1   . . .   ∈   Run T   for which   s i +1   =   C ( ρ ( i )  C   )   for every   i   ≥   0 . For every   s   ∈   S , there is exactly one run induced by   C   that starts in   s . A finite run induced by   C   is   σ C   ∈   Run T  fin , which is a finite prefix of a run   ρ C   induced by   C . Let   C   be a strategy, finite-memory or not, for a TS   T   . For every state   s   ∈   S , the run   ρ C   ∈   Run T   ( s )   induced by   C  satisfies   inf( ρ C   )   ⊆   S U   for some   U ∈   SCC( T   )   [1]. We say that   C   leads   T   from the state   s   to the SCC   U .  Definition 4:   Linear Temporal Logic (LTL) formulas over the set   AP   are formed according to the following grammar:  φ   ::=   true   |   a   | ¬ φ   |   φ   ∨   φ   |   φ   ∧   φ   |   X   φ   |   φ   U   φ   |   G   φ   |   F   φ,  where   a   ∈   AP   is an atomic proposition,   ¬ ,   ∨   and   ∧   are standard Boolean connectives, and   X   ( next ),   U   (until),   G  (always) and   F   (eventually) are temporal operators. The semantics of LTL is defined over words over   2 AP   , such as those generated by the runs of a TS   T   (for details see   e.g.,   [1]). For example, a word   w   ∈   (2 AP   ) ω   satisfies   G   φ  and   F   φ   if   φ   holds in   w   always and eventually, respectively. If the word induced by a run of   T   satisfies a formula   φ , we say that the run satisfies   φ . We call   φ   satisfiable in   T   from  s   ∈   S   if there exists a run   ρ   ∈   Run T   ( s )   that satisfies   φ . Having an initialized TS   T   and an LTL formula   φ   over  AP   , the formal synthesis problem aims to find a strategy  C   for   T   such that the run   ρ C   ∈   Run T   ( s init )   induced by  C   satisfies   φ . In that case we also say that the strategy   C  satisfies   φ . The formal synthesis problem can be solved using principles from model checking methods [1]. Specifically,   φ  is translated to a B ̈ uchi automaton and the system combining the B ̈ uchi automaton and the TS   T   is analyzed.  Definition 5:   A B ̈ uchi automaton (BA) is a tuple   B   = ( Q,   2 AP   , δ, q 0 , F   ) , where   Q   is a non-empty finite set of states,   2 AP   is the alphabet,   δ   ⊆   Q   ×   2 AP   ×   Q   is a transition relation such that for every   q   ∈   Q ,   a   ∈   2 AP   , there exists  q ′   ∈   Q   such that   ( q, a, q ′ )   ∈   δ ,   q 0   ∈   Q   is the initial state, and   F   ⊆   Q   is a set of accepting states. A run   q 0 q 1   . . . Q ω   of   B   is an infinite sequence such that for every   i   ≥   0   there exists   a i   ∈   2 AP   with   ( q i , a i , q i +1 )   ∈   δ . The word   a 0 a 1   . . .   ∈   (2 AP   ) ω   is called the word induced by the run   q 0 q 1   . . . . A run   q 0 q 1   . . .   of   B   is accepting if there exist infinitely many   i   ≥   0   such that   q i   is an accepting state. For every LTL formula   φ   over   AP   , one can construct a B ̈ uchi automaton   B φ   such that the accepting runs are all and only words over   2 AP   satisfying   φ   [16]. We refer readers to [17], [18] for algorithms and to online implementations such as [19], to translate an LTL formula to a BA.  Definition 6:   Let   T   =   ( S, T, AP, L, w )   be an initial- ized   TS   and   B   =   ( Q,   2 AP   , δ, q 0 , F   )   be   a   B ̈ uchi   au- tomaton. The product   P   of   T   and   B   is a tuple   P   =
( S P   , T P   , s P init , AP, L P   , F P   , w P   ) ,   where   S P   =   S   ×   Q ,  T P   ⊆   S P   ×   S P   is a transition relation such that for every  ( s, q ) ,   ( s ′ , q ′ )   ∈   S P   it holds that   (( s, q ) ,   ( s ′ , q ′ ))   ∈   T P   if and only if   ( s, s ′ )   ∈   T   and   ( q, L ( s ) , q ′ )   ∈   δ ,   s P init   = ( s init , q 0 )   is the initial state,   L P   (( s, q )) =   L ( s )   is a labeling function,   F P   =   S   ×   F   is a set of accepting states, and  w P   ((( s, q ) ,   ( s ′ , q ′ ))) =   w (( s, s ′ ))   is a weight function. The product   P   can be viewed as an initialized TS with a set of accepting states. Therefore, we adopt the definitions of a run   ρ , a finite run   σ , its weight   w P   ( σ ) , and sets  Run P   (( s, q )) ,   Run P   ,   Run P  fin (( s, q ))   and   Run P  fin   from above. Similarly, a cycle   cyc   of   P , a strategy   C P   for   P   and runs  ρ C P   , σ C P   induced by   C P   are defined in the same way as for a TS. On the other hand,   P   can be viewed as a weighted BA over the trivial alphabet with a labeling function, which gives us the definition of an accepting run of   P . Using the projection on the first component, every run  ( s 0 , q 0 )( s 1 , q 1 )   . . .   and finite run   ( s 0 , q 0 )   . . .   ( s n , q n )   of   P  corresponds to a run   s 0 s 1   . . .   and a finite run   s 0   . . . s n   of  T   , respectively. Vice versa, for every run   s 0 s 1   . . .   and finite run   s 0   . . . s n   of   T   , there exists a run   ( s 0 , q 0 )( s 1 , q 1 )   . . .   and finite run   ( s 0 , q 0 )   . . .   ( s n , q n ) . Similarly, every strategy for   P  projects to a strategy for   T   and for every strategy for   T   there exists a strategy for   P   that projects to it. The projection of a finite-memory strategy for   P   is also finite-memory. Since   P   can be viewed as a TS, we also adopt the defini- tions of a sub-system and a strongly connected component.  Definition 7:   Let   P   = ( S P   , T P   , s P init , AP, L P   , F P   , w P   )  be the product of an initialized TS   T   and a BA   B . An accepting strongly connected component (ASCC) of   P   is an SCC   U   = ( S U   , T U   , AP, L P   , w P   )   such that the set   S U   ∩   F P  is non-empty and we refer to it as the set   F U   of accepting states of   U . We use   ASCC( P )   to denote the set of all ASCCs of   P   that are reachable from the initial state   s P init . In our work, we always assume that   ASCC( P )   is non- empty,   i.e.,   the given LTL formula is satisfiable in the TS. III. P ROBLEM   F ORMULATION  Consider an initialized weighted transition system   T   = ( S, T, AP, L, w ) .   The   weight   w (( s, s ′ ))   of   a   transition  ( s, s ′ )   ∈   T   represents the amount of time that the transition takes and the system starts at time   0 . We use   t n   to denote the point in time after the   n -th transition of a run,   i.e.,   initially the system is in a state   s 0   at time   t 0   = 0   and after a finite run   σ   ∈   Run T  fin ( s 0 )   of length   n   + 1   the time is   t n   =   w ( σ ) . We assume there is a dynamic   penalty   associated with every state   s   ∈   S . In this paper, we address the following model of penalties. Nevertheless, as we discuss in Sec.V, the algorithms presented in the next section provide optimal solution for a much broader class of penalty dynamics. The penalty is a rational number between   0   and   1   that is increasing every time unit by   1  r   , where   r   ∈   N   is a given rate. Always when the penalty is   1 , in the next time unit the penalty remains   1   or it drops to   0   according to a given probability distribution. Upon the visit of a state, the corresponding penalty is incurred. We assume that the visit of the state does not affect the penalty’s value or dynamics. Formally, the penalties are defined by a   rate   r   ∈   N   and a  penalty probability function   p   :   S   →   (0 ,   1] , where   p ( s )   is the probability that if the penalty in a state   s   is   1   then in the next time unit the penalty remains   1 , and   1 − p ( s )   is the probability of the penalty dropping to   0 . The penalties are described using a function   g   :   S   ×   N 0   → {   i r   |   i   ∈ { 0 ,   1 , . . . , r }} , such that   g ( s, t )   is the penalty in a state   s   ∈   S   at time   t   ∈   N 0 . For   s   ∈   S ,   g ( s,   0)   is a uniformly distributed random variable with values in the set   {   i r   |   i   ∈ { 0 ,   1 , . . . , r }}   and for   t   ≥   1  g ( s, t ) =  {  g ( s, t   −   1) +   1  r   if   g ( s, t   −   1)   <   1 , x   otherwise ,   (1)  where   x   is   a   random   variable   such   that   x   =   1   with probability   p ( s )   and   x   = 0   otherwise. We use  g exp ( s ) = (1   −   p ( s ))   ·   1 2   +   p ( s )   ·   1 =   1 2   (1 +   p ( s ))   (2) to denote the expected value of the penalty in a state   s   ∈   S . Please note that   1 2   ≤   g exp ( s )   ≤   1 , for every   s   ∈   S . In our setting, the penalties are sensed only locally in the states in close proximity from the current state. To be specific, we assume a   visibility range   v   ∈   N   is given. If the system is in a state   s   ∈   S   at time   t , the penalty  g ( s ′ , t )   of a state   s ′   ∈   S   is observable if and only if  s ′   ∈   Vis( s ) =   { s ′   ∈   S   |   w ∗ ( s, s ′ )   ≤   v } . The set   Vis( s )  is also called the set of states visible from   s . The problem we consider in this paper combines the formal synthesis problem with long-term optimization of the expected amount of penalties incurred during the system’s execution. We assume that the specification is given as an LTL formula   φ   of the form  φ   =   φ   ∧   GF π sur ,   (3) where   φ   is an LTL formula over   AP   and   π sur   ∈   AP   . This formula requires that the system satisfies   φ   and surveys the states satisfying the property   π sur   infinitely often. We say that every visit of a state from the set   S sur   =   { s   ∈   S   |   π sur   ∈  L ( s ) }   completes a surveillance cycle. Specifically, starting from the initial state, the first visit of   S sur   after the initial state completes the first surveillance cycle of a run. Note that a surveillance cycle is not a cycle in the sense of the definition of a cycle of a TS in Sec. II. For a finite run   σ   such that   last ( σ )   ∈   S sur ,   ] ( σ )   denotes the number of complete surveillance cycles in   σ , otherwise   ] ( σ )   is the number of complete surveillance cycles plus one. We define a function  V T   ,C   :   S   →   R + 0   such that   V T   ,C   ( s )   is the expected average cumulative penalty per surveillance cycle (APPC) incurred under a strategy   C   for   T   starting from a state   s   ∈   S :  V T   ,C   ( s ) = lim sup  n →∞  E  ( ∑ n i =0   g ( ρ C   ( i ) , w ( ρ ( i )  C   ))  ] ( ρ ( n )  C   )  )  ,   (4) where   ρ C   ∈   Run T   ( s )   is the run induced by   C   starting from  s   and   E ( · )   denotes the expected value. In this paper, we consider the following problem:  Problem 1:   Let   T   = ( S, T, AP, L, w )   be an initialized TS, with penalties defined by a rate   r   ∈   N   and penalty probabilities   p   :   S   →   (0 ,   1] . Let   v   ∈   N   be a visibility range and   φ   an LTL formula over the set   AP   of the form in Eq. (3). Find a strategy   C   for   T   such that   C   satisfies   φ   and
among all strategies satisfying   φ ,   C   minimizes the APPC value   V T   ,C   ( s init )   defined in Eq. (4). In the next section, we propose two algorithms solving the above problem. The first algorithm operates offline, without the deployment of the system, and therefore, without taking advantage of the local sensing of penalties. On the other hand, the second algorithm computes the strategy in real-time by locally improving the offline strategy according to the penalties observed from the current state and their simulation over the next   h   time units, where   h   ≥   1   is a natural number, a user-defined   planning horizon . IV. S OLUTION  The   two   algorithms   work   with   the   product   P   = ( S P   , T P   , s P init , AP, L P   , F P   , w P   )   of the initialized TS   T  and a B ̈ uchi automaton   B φ   for the LTL formula   φ . To project the penalties from   T   to   P , we define the penalty in a state  ( s, q )   ∈   S P   at time   t   as   g (( s, q ) , t ) =   g ( s, t ) . We also adopt the visibility range   v   and the set   Vis(( s, q ))   of all states visible from   ( s, q )   is defined as for a state of   T   . The APPC function   V P ,C P   of a strategy   C P   for   P   is then defined according to Eq. (4). We use the correspondence between the strategies for   P   and   T   to find a strategy for   T   that solves Problem 1. Let   C P   be a strategy for   P   such that the run induced by   C P   visits the set   F P   infinitely many times and at the same time, the APPC value   V P ,C P   ( s P init )   is minimal among all strategies that visit   F P   infinitely many times. It is easy to see that   C P   projects to a strategy   C   for   T   that solves Problem 1 and   V T   ,C   ( s init ) =   V P ,C P   ( s P init ) . The offline algorithm leverages ideas from formal meth- ods. Using the automata-based approach to model checking, one can construct a strategy   C φ  P   for   P   that visits at least one of the accepting states infinitely many times. On the other hand, using graph theory, we can design a strategy   C V  P   that achieves the minimum APPC value among all strategies of  P   that do not cause an immediate, unrepairable violation of  φ ,   i.e.,   φ   is satisfiable from every state of the run induced by  C V  P   . However, we would like to have a strategy   C P   satisfying both properties at the same time. To achieve that, we employ a technique from game theory presented in [20]. Intuitively, we combine two strategies   C φ  P   and   C V  P   to create a new strategy   C P   . The strategy   C P   is played in rounds, where each round consists of two phases. In the first phase, we play the strategy   C φ  P   until an accepting state is reached. We say that the system is to achieve the mission subgoal. The second phase applies the strategy   C V  P   . The aim is to maintain the expected average cumulative penalty per surveillance cycle in the current round, and we refer to it as the average subgoal. The number of steps for which we apply   C V  P   is computed individually every time we enter the second phase of a round. The online algorithm constructs a strategy   C P   by locally improving the strategy   C P   computed by the offline algo- rithm. Intuitively, we compare applying   C P   for several steps to reach a specific state or set of states of   P , to executing different local paths to reach the same state or set. We consider a finite set of finite runs leading to the state, or set, containing the finite run induced by   C P   , choose the one that is expected to minimize the average cumulative penalty per surveillance cycle incurred in the current round and apply the first transition of the chosen run. The process continues until the state, or set, is reached, and then it starts over again.  A. Probability measure  Let   C P   be a strategy for   P   and   ( s, q )   ∈   S P   a state of  P . For a finite run   σ C P   ∈   Run P  fin (( s, q ))   induced by the strategy   C P   starting from the state   ( s, q )   and a sequence  τ   ∈   ( {   i r   |   0   ≤   i   ≤   r } ) +   of length   | σ C P   | , we call   ( σ C P   , τ   )  a finite pair. Analogously, an infinite pair   ( ρ C P   , κ )   consists of the run   ρ C P   ∈   Run P   (( s, q ))   induced by the strategy   C P  and an infinite sequence   κ   ∈   ( {   i r   |   0   ≤   i   ≤   r } ) ω   . A cylinder set   Cyl (( σ C P   , τ   ))   of a finite pair   ( σ C P   , τ   )   is the set of all infinite pairs   ( ρ C P   , κ )   such that   τ   is a prefix of   κ . Consider the   σ -algebra generated by the set of cylinder sets of all finite pairs   ( σ C P   , τ   ) , where   σ C P   ∈   Run P  fin (( s, q ))  is a finite run induced by the strategy   C P   starting from the state   ( s, q )   and   τ   ∈   ( {   i r   |   0   ≤   i   ≤   r } ) +   is of length  | σ C P   | . From classical concepts in probability theory [21], there exists a unique probability measure   Pr P ,C P  ( s,q )   on the   σ - algebra such that for a finite pair   ( σ C P   , τ   ) Pr P ,C P  ( s,q )   ( Cyl (( σ C P   , τ   )))  is the probability that the penalties incurred in the first  | σ C P   |   + 1   stages when applying the strategy   C P   in   P   from the state   ( s, q )   are given by the sequence   τ   ,   i.e.,  g ( σ C P   ( i ) , w P   ( σ ( i )  C P   )) =   τ   ( i )  for every   0   ≤   i   ≤   | σ C P   | . This probability is given by the penalty dynamics and therefore, can be computed from the rate   r   and the penalty probability function   p . For a set   X   of infinite pairs, an element of the above   σ -algebra, the probability   Pr P ,C P  ( s,q )   ( X )   is the probability that under   C P  starting from   ( s, q )   the infinite sequence of penalties received in the visited states is   κ   where   ( ρ C P   , κ )   ∈   X .  B. Offline control  In this section, we construct a strategy   C P   for   P   that projects to a strategy   C   for   T   solving Problem 1. The strategy   C P   has   to   visit   F P   infinitely   many   times   and therefore,   C P   must lead from   s P init   to an ASCC. For an   U   ∈   ASCC( P ) , we denote   V   ∗ U   (( s, q ))   the minimum expected average cumulative penalty per surveillance cycle that can be achieved in   U   starting from   ( s, q )   ∈   S U   . Since  U   is strongly connected, this value is the same for all the states in   S U   and is referred to as   V   ∗ U   . It is associated with a cycle   cyc V  U   =   c 0   . . . c m   of   U   witnessing the value,   i.e.,  1  | cyc V  U   ∩ S U   sur   |  m ∑  i =0  g exp ( c i ) =   V   ∗ U  where   S U   sur   is the set of all states of   U   labeled with   π sur . Since   U   is an ASCC, it holds   S U   sur   6   =   ∅ . We design an algorithm that finds the value   V   ∗ U   and a cycle   cyc V  U   for an ASCC   U . The algorithm first reduces   U  to a TS   U   and then applies the Karp’s algorithm [22] on   U . The Karp’s algorithm finds for a directed graph with values
run 1   run 8  run 3  run 7   run 9  run 4  run 6  run 2  run 5  run 5 . run 4  run 2 . run 6  Fig. 1: An example of elimination of a state during the reduction of an ASCC   U   . The finite run   run 8   is equal to the one of the finite runs   run 1  and   run 2 . run 4   that minimizes the sum of expected penalties in the states of the run. Similarly,   run 9   is one of the finite runs   run 7   and   run 5 . run 6 .  on edges a cycle with minimum value per edge also called the minimum mean cycle. The value   V   ∗ U   and cycle   cyc V  U   are synthesized from the minimum mean cycle. Let   U   = ( S U   , T U   , AP, L P   , w P   )   be an ASCC of   P . For simplicity, we use singletons such as   u, u i   to denote the states of   P   in this paragraph. We construct a TS  U   = ( S U   sur , T U   , AP, L P   , w U   ) ,  and a function   run :   T U   →   Run U  fin   for which it holds that  ( u, u ′ )   ∈   T U   if and only if there exists a finite run in  U   from   u   ∈   S U   sur   to   u ′   ∈   S U   sur   with one surveillance cycle,   i.e.,   between   u   and   u ′   no state labeled with   π sur   is visited. Moreover, the run   run(( u, u ′ )) =   u 0   . . . u n   is such that   u   =   u 0   and   σ   =   u 0   . . . u n u ′   is the finite run in   U  from   u   to   u ′   with one surveillance cycle that minimizes the expected sum of penalties received during   σ   among all finite runs in   U   from   u   to   u ′   with one surveillance cycle. The TS can be constructed from   U   by an iterative algorithm eliminating the states from   S U   \ S U   sur   one by one, in arbitrary order. At the beginning let   U   =   U ,   T U   =   T U   , and for every transition   ( u, u ′ )   ∈   T U   let   run(( u, u ′ )) =   u . The procedure for eliminating   u   ∈   S U   \ S U   sur   proceeds as follows. Consider every   u 1   6   =   u, u 2   6   =   u   such that   ( u 1 , u ) ,   ( u, u 2 )   ∈   T U   . If the transition   ( u 1 , u 2 )   is not in   T U   , add   ( u 1 , u 2 )   to  T U   and define   run(( u 1 , u 2 )) = run(( u 1 , u )) . run(( u, u 2 )) , where   .   is the concatenation of sequences. If   T U   already contains the transition   ( u 1 , u 2 )   and   run(( u 1 , u 2 )) =   σ , we set   run(( u 1 , u 2 )) = run(( u 1 , u )) . run(( u, u 2 )) , if  ∑   g exp  ( run(( u 1 , u )) . run(( u, u 2 )) )   ≤   ∑   g exp  ( σ ) ,  where   ∑   g exp ( x )   for a run   x   is the sum of   g exp ( x ( i ))   for every state   x ( i )   of   x , otherwise we let   run(( u 1 , u 2 )) =   σ . The weight   w U   (( u 1 , u 2 )) =   ∑   g exp (run(( u 1 , u 2 ))) . Once all pairs   u 1 , u 2   are handled, remove   u   from   S U   and all adjacent transitions from   T U   . Fig. 1 demonstrates one iteration of the algorithm. Consequently, we apply the Karp’s algorithm on the ori- ented graph with vertices   S U   sur , edges   T U   and values on edges   w U   . Let   cyc U   =   u 0   . . . u m   be the minimum mean cycle of this graph. Then it holds  V   ∗ U   =   1  m +1  m ∑  i =0  g exp  ( run(( u i , u i +1   mod ( m +1) ) ) ,  cyc V  U   = run(( u 0 , u 1 )) . . . . . run(( u m − 1 , u m )) . run(( u m , u 0 )) .  When the APPC value and the corresponding cycle is computed for every ASCC of   P , we choose the ASCC that minimizes the APPC value. We denote this ASCC  U   = ( S U   , T U   , AP, L P   , w P   )   and   cyc V  U   =   c 0   . . . c m . The mission subgoal aims to reach an accepting state from the set   F U   . The corresponding strategy   C φ  P   is such that from every state   ( s, q )   ∈   S P   \ F U   that can reach the set   F U   , we follow one of the finite runs with minimum weight from  ( s, q )   to   F U   . That means,   C φ  P   is a memoryless strategy such that for   ( s, q )   ∈   S P   \ F U   with   w ∗ P   (( s, q ) , F U   )   <   ∞   it holds  C φ  P   (( s, q )) = ( s ′ , q ′ )   where  w P   (( s, q ) ,   ( s ′ , q ′ )) =   w ∗ P   (( s, q ) , F U   )   −   w ∗ P   (( s ′ , q ′ ) , F U   ) .  The strategy   C V  P   for the average subgoal is given by the cycle   cyc V  U   =   c 0   . . . c m   of the ASCC   U . Similarly to the mission subgoal, for a state   ( s, q )   ∈   S P   \ cyc V  U   with  w ∗ P   (( s, q ) ,   cyc V  U   )   <   ∞ , the strategy   C V  P   follows one of the finite runs with minimum weight to   cyc V  U   . For a state  c i   ∈   cyc V  U   , it holds   C V  P   ( c i ) =   c i +1 mod ( m +1) . If all the states of the cycle   cyc V  U   are distinct, the strategy   C V  P   is memoryless, otherwise it is finite-memory.  Proposition 1:   For   the   strategy   C V  P   and   every   state  ( s, q )   ∈   S U   , it holds  lim  n →∞   Pr U   ,C V  P  ( s,q )  ( ∑ n i =0   g ( ρ C V  P   ( i ) , w P   ( ρ ( i )  C V  P  ))  ] ( ρ ( n )  C V  P  )   ≤   V   ∗ U  )   = 1 .  Equivalently, for every state   ( s, q )   ∈   S U   and every    >   0 , there exists   j (  )   ∈   N   such that if the strategy   C V  P   is played from the state   ( s, q )   until at least   l   ≥   j (  )   surveillance cycles are completed, then the average cumulative penalty per surveillance cycle incurred in the performed finite run is at most   V   ∗ U   +      with probability at least   1   −    .  Proof:   (Sketch.) The proof is based on the fact that the product   P   with dynamic penalties can be translated into a Markov decision process (MDP) (see   e.g.,   [23]) with static penalties. The run   ρ C V  P   corresponds to a Markov chain (see  e.g.,   [24]) of the MDP. Moreover, the cycle   cyc ∗ U   corresponds to the minimum mean cycle of the reduced TS   U . Hence, the equation in the theorem is equivalent to the property of MDPs with static penalties proved in [20] regarding the minimum expected penalty incurred per stage.  Remark 1:   Assume there exists a state   ( s, q )   ∈   S P   with  p (( s, q )) = 0 ,   i.e.,   if the penalty in   ( s, q )   is   1 , it always drops to   0 . The dynamics of the penalty in   ( s, q )   is not probabilistic and if we visit   ( s, q )   infinitely many times, the expected average penalty incurred in   ( s, q )   might differ from  g exp (( s, q )) . That can cause violation of Prop. 1. Now we describe the strategy   C P   . It is played in rounds, where each round consists of two phases, one for each subgoal.   The   first   round   starts   at   the   beginning   of   the execution of the system in the initial state   s P init   of   P . Let  i   be the current round. In the first phase of the round the strategy   C φ  P   is applied until an accepting state of the ASCC  U   is reached. We use   k i   to denote the number of steps we played the strategy   C φ  P   in round   i . Once the mission subgoal is fulfilled, the average subgoal becomes the current subgoal. In this phase, we play the strategy   C V  P   until the number of completed surveillance cycles in the second phase of the current round is   l i   ≥   max { j (   1  i   ) , i   ·   k i } .  Theorem 1:   The strategy   C P   projects to a strategy   C   of  T   that solves Problem 1.
Proof:   From the fact that the ASCC   U   is reachable from the initial state   s P init   and from the construction of   C φ  P   , it follows that   U   is reached from   s P init   in finite time. In every round   i   of the strategy   C P   , an accepting state is visited. Moreover, from Prop. 1 and the fact that   l i   ≥   max { j (   1  i   ) , i   ·  k i } , it can be shown that the average cumulative penalty per surveillance cycle incurred in the   i -th round is at most   V   ∗ U   +   2  i  with probability at least   1 −   1  i   . Therefore, in the limit, the run induced by   C P   satisfies the LTL specification and reaches the optimal average cumulative penalty per surveillance cycle  V   ∗ U   with probability   1 . Note that, in general, the strategy   C P   is not finite-memory. The reason is that in the modes of the finite-memory strategy we would need to store the number of steps spent so far in the first phase   k i   and the number   l i   of the surveillance cycles in the second phase of a given round. Since   j (   1  i   )   is generally increasing with   i , we would need infinitely many modes to be able to count the number of surveillance cycles in the second phase. However, if there exists a cycle   cyc V  U   of the SCC   U  corresponding to   V   ∗ U   that contains an accepting state, then the finite-memory strategy   C V  P   for the average subgoal maps to a strategy of   T   solving Problem 1, which is therefore in the worst case finite-memory as well.  Complexity:   The size of a BA for an LTL formula   φ   is in the worst case   2 O ( | φ | ) , where   | φ |   is the size of   φ   [17]. However, the actual size of the BA is in practice often quite small. The size of the product   P   is   O ( | S | ·   2 O ( | φ | ) ) . To com- pute the minimum weights   w ∗ (( s, q ) ,   ( s ′ , q ′ ))   between every two states of   P   we use Floyd-Warshall algorithm taking  O ( | S P   | 3 )   time. Tarjan’s algorithm [15] is used to compute the set   SCC( P )   in time   O ( | S P   |   +   | T P   | ) . The reduction of an ASCC   U   can be computed in time   O ( | S U   | · | T U   | 2 ) . The Karp’s algorithm [22] finds the optimal APPC value and cor- responding cycle in time   O ( | S U   | · | T U   | ) . The main pitfall of the algorithm is to compute the number   j (   1  i   )   of surveillance cycles needed in the second phase of the current round   i  according to Prop. 1. Intuitively, we need to consider the finite run   σ C V  P   ,k   induced by the strategy   C V  P   from the current state that contains   k   = 1   surveillance cycles, and compute the sum of probabilities   Pr P ,C P  ( s,q )   ( Cyl (( σ C V  P   ,k , τ   )))   for every  τ   with the average cumulative penalty per surveillance cycle less or equal to   V   ∗ U   +   1  i   . If the total probability is at least  1   −   1  i   , we set   j (   1  i   ) =   k , otherwise we increase   k   and repeat the process. For every   k , there exist   r | σ CV  P   ,k   |   sequences   τ   . To partially overcome this issue, we compute the number   j (   1  i   )  only at the point in time, when the number of surveillance cycles in the second phase of the current round   i   is   i   ·   k i   and the average cumulative penalty in this round is still above  V   ∗ U   +   2  i   . As the simulation results in Sec. VI show, this happens only rarely, if ever.  C. Online control  The online algorithm locally improves the strategy   C P  according to the values of penalties observed from the current state and their simulation in the next   h   time units. The resulting strategy   C P   is again played in rounds. However, in each step of the strategy   C P   , we consider a finite set of finite runs starting from the current state, choose one according to an optimization function, and apply its first transition. Throughout the rest of the section we use the following notation. We use singletons such as   u, u i   to denote the states of   P . Let   σ all   ∈   Run P  fin ( s P init )   denote the finite run executed by   P   so far. Let   i   be the current round of the strategy   C P   and   σ i   =   u i, 0   . . . u i,k   the finite run executed so far in this round,   i.e.,   u i,k   is the current state of   P . We use   t i, 0 , . . . , t i,k   to denote the points in time when the states  u i, 0 , . . . , u i,k   were visited, respectively. The optimization function   f   : Run P  fin ( u i,k )   →   [0 ,   1]   as- signs every finite run   σ   =   u 0   . . . u n   starting from the current state a value   f   ( σ )   that is the expected average cumulative penalty per surveillance cycle that would be incurred in the round   i , if the run   σ   was to be executed next,   i.e.,  f   ( σ ) =  k ∑  j =0  g ( u i,j   , t i,j   ) +   n ∑  j =1  g sim ( u j   , t i,k   +   w P   ( σ ( j ) ))  ] ( σ i .σ (1)   . . . last ( σ ))   ,   (5)  where   g sim ( u j   , t i,k   +   w P   ( σ ( j ) ))   is the simulated expected penalty incurred in the state   u j   at the time of its visit. If the visit occurs within the next   h   time units and the state  u j   is visible from the current state   u i,k , we simulate the penalty currently observed in   u j   over   w P   ( σ ( j ) )   time units. Otherwise, we set the expected penalty to be   g exp ( u j   ) . The exact definition of   w P   ( σ ( j ) )   can be found in Tab. I. For   a   set   of   states   X   ⊆   S P   ,   we   define   a   shorten- ing   indicator   function   I X   :   T P   →   { 0 ,   1 }   such   that   for  (( s 1 , q 1 ) ,   ( s 2 , q 2 ))   ∈   T P  I X  ( (( s 1 , q 1 ) ,   ( s 2 , q 2 )) )   =        1   if   w ∗ P   (( s 1 , q 1 ) , X )  > w ∗ P   (( s 2 , q 2 ) , X ) ,  0   otherwise. (6)  Intuitively, the indicator has value   1   if the transition leads strictly closer to the set   X , and   0   otherwise. In the first phase of every round, we locally improve the strategy   C φ  P   computed in Sec. IV-B that aims to visit an accepting state of the chosen ASCC   U . In each step of the resulting strategy   C φ  P   , we consider the set   Run φ ( u i,k )   of all finite runs from the current state   u i,k   that lead to an accepting state from the set   F U   with all transitions shortening in the indicator   I F U   defined according to Eq. (6),   i.e.,  Run φ ( u i,k   ) =   { σ   ∈   Run P  fin ( u i,k   )   |   last ( σ )   ∈   F U   ,  ∀ 0   ≤   j   ≤ | σ | −   1 :   I F U   (( σ ( j ) , σ ( j   + 1))) = 1 } .  Let   σ   ∈   Run φ ( u i,k )   be the run that minimizes the optimiza- tion function   f   from Eq. (5). Then   C φ  P   ( σ all ) =   σ (1) . Just like in the offline algorithm, the strategy   C φ  P   is applied until a state from the set   F U   is visited. In the second phase, we locally improve the strategy   C V  P  for the average subgoal computed in Sec. IV-B to obtain a strategy   C V  P   . However, the definition of the set of finite runs we choose from changes during the phase. At the beginning of the second phase of the current round   i , we aim to reach the cycle   cyc V  U   =   c 0   . . . c m   of the ASCC   U   and we use the same idea that is used in the first phase above. To be specific, we define   C V  P   ( σ all ) =   σ (1) , where   σ   is the finite
g sim ( u j   , t i,k   +   w P   ( σ ( j ) )) =            g ( u j   , t i,k   ) +   w P   ( σ ( j )   )  r   if   u j   ∈   Vis( u i,k   ) , w P   ( σ ( j ) )   ≤   h   and   g ( u j   , t i,k   ) +   w P   ( σ ( j )   )  r   ≤   1 ,  r ∑  x =0  pst (   x r   ) +   pst (1)   if   u j   ∈   Vis( u i,k   ) , w P   ( σ ( j ) )   ≤   h   and   g ( u j   , t i,k   ) +   w P   ( σ ( j )   )  r   >   1 , g exp ( u j   )   otherwise.  pst (   x r   ) =  (   z 1 ∑  y =0 ( z 1   − y + z 2 + y ( r +1))! ( z 1 − y )! · ( z 2 + y ( r +1))!   ·   (1   −   p ( u j   )) ( z 1 − y )   ·   ( p ( s )) ( z 2   + y ( r +1))  )  ·   (1   −   p ( s ))   ·   x r  if   z   =   w P   ( σ ( j ) )   −   (1   −   g ( u j   , t i,k   ))   ·   r   −   x   −   1   ≥   0 , z 1   =   z   div ( r   + 1) , z 2   =   z   mod ( r   + 1);   otherwise if   z <   0 ,   pst (   x r   ) = 0  pst (1) =  z 3 ∑  y =0 ( z 3 − y + z 4 + y ( r +1))! ( z 3 − y )! · ( z 4 + y ( r +1))!   ·   (1   −   p ( u j   )) ( z 3 − y )   ·   ( p ( s )) ( z 4 + y ( r +1))   ·   p ( s )  where   z   =   w P   ( σ ( j ) )   −   (1   −   g ( u j   , t i,k   ))   ·   r   −   1 , z 3   =   z   div ( r   + 1) , z 4   =   z   mod ( r   + 1)  TABLE I: The function computing the simulated expected penalty incurred in a state   u j   of the run   σ   at the time of its visit   t i,k   +   w P   ( σ ( j ) )   if we are to apply the run   σ   from the current state   u i,k   ,   div   stands for integer division and   mod   for modulus.  run minimizing   f   from the set  Run V   ( u i,k   ) =   { σ   ∈   Run P  fin ( u i,k   )   |   last ( σ )   ∈   cyc V  U   ,  ∀ 0   ≤   j   ≤ | σ | −   1 :   I cyc V  U  (( σ ( j ) , σ ( j   + 1))) = 1 } .  Once a state   c a   ∈   cyc V  U   of the cycle is reached, we continue as follows. Let   c b   ∈   cyc V  U   be the first state labeled with   π sur  that is visited from   c a   if we follow the cycle. Until we reach the state   c b , the optimal finite run   σ   is chosen from the set  Run V   ( u i,k   ) =   { σ   ∈   Run P  fin ( u i,k   )   |   last ( σ ) =   c b ,   and  ∀ 0   ≤   j   ≤ | σ | −   1 :   I c b   (( σ ( j ) , σ ( j   + 1))) = 1   or  | σ c a   → u i,k   |   +   | σ | ≤   b   −   a   + 2 mod ( m   + 1) } ,  where   σ c a → u i,k   is the finite run already executed in   P   from the state   c a   to the current state   u i,k . Intuitively, the set contains every finite run from the current state to the state   c b  that either has all transitions shortening in   I c b   or the length of the finite run is such that if we were to perform the finite run, the length of the performed run from   c a   to   c b   would not be longer than following the cycle from   c a   to   c b . When the state  c b   is reached, we restart the above procedure with   c a   =   c b . The strategy   C V  P   is performed until   l i   ≥   max { j (   1  i   ) , i   ·   k i }  surveillance cycles are completed in the second phase of the current round   i , where   k i   is the number of steps of the first phase and   j   is from Prop. 1. We can end the second phase sooner, specifically in any time when we complete a surveillance cycle and the average cumulative penalty per surveillance cycle incurred in the current round is less or equal to   V   ∗ U   +   2  i   .  Theorem 2:   The strategy   C P   projects to a strategy   C   of  T   solving Problem 1.  Proof:   First, we prove that Prop. 1 holds for the strategy  C V  P   as well. This result follows directly from the facts below. The set of finite runs we choose from always contains a finite run induced by the strategy   C V  P   . Once the cycle   cyc V  U  is reached, the system optimizes the finite run from one surveillance state of the cycle to the next, until it is reached after finite time. Finally, if the strategy   C V  P   does not follow  C V  P   , it is only because the chosen finite run provides lower expected average. The correctness of the strategy   C P   is now proved analogously to the correctness of the strategy computed offline.  Proposition 2:   The   strategy   C P   is   with   probability   1 expected   to   perform   in   the   worst   case   as   good   as   the strategy   C P   computed offline. That means, if the average cumulative penalty per surveillance cycle incurred in the so far performed run of the system is lower than the optimal APPC value   V   ∗ U   , it will rise slower under the strategy   C P  than under the strategy   C P   . On the other hand, if the average cumulative penalty per surveillance cycle incurred in the so far performed run of the system is higher than the optimal APPC value   V   ∗ U   , it is expected to decrease faster under the strategy   C P   than under the strategy   C P   .  Proof:   Follows from the proof of Theorem 2.  Complexity:   The   cardinality   of   the   set   of   finite   runs  Run φ ( u i,k )   grows exponentially with the minimum weight  w ∗ P   ( u i,k , F U   ) . Analogously, the same holds for the set of finite runs   Run V   ( u i,k )   and the set   cyc ∗  V   or one of its surveil- lance states. To simplify the computations and effectively use the algorithm in real time, one can use the following rule that was also applied in our implementation in Sec.VI. We put a threshold on the maximum weight of a finite run in   Run φ ( u i,k )   and   Run V   ( u i,k ) . In the second phase of a round, when on the optimal cycle, we optimize the finite run from the state   c a   to the next surveillance state on the cycle   c b . However, if the weight of the fragment of the cycle from   c a   to   c b   is too high, we can first optimize the run to some intermediate state   c ′  b . Also, the complexity of one step of the strategy   C P   grows exponentially with the user-defined planning horizon   h . Hence,   h   should be chosen wisely. One should also keep in mind that the higher the planning horizon, the better local improvement. V. D ISCUSSION  Every LTL formula   φ   over   AP   can be converted to a formula   φ   of the form in Eq. (3) for which it holds that a run of the TS   T   satisfies   φ   if and only if it satisfies   φ . The formula is   φ   =   φ   ∧   GF   π sur   where   π sur   ∈   L ( s )   for every   s   ∈   S . In that case, Problem 1 requires to minimize the expected average penalty incurred per stage. The   algorithms   presented   in   Sec.   IV   can   be   used   to correctly solve Problem 1 also for the systems with different penalty dynamics than the one defined in Sec. III. However, for every state we need to be able to compute the expected value of the penalty in the state, like in Eq. (2). For the online algorithm we also require that the dynamics of penalties allows to simulate them for a finite number of time units. More precisely, if we observe the penalty in a state   s   ∈   S  in time   t , we can compute the simulated expected value of the penalty in   s   in every following time unit, up to   h   time units, based only on the observed value.
2/5   0/5  2/5   1/5  4/5 5/5 4 0/5   1/5 4/5 (a)   (b) Fig. 2: (a) A TS modeling the robot (black dot) motion in a partitioned environment. Two stock locations are in green, a base is shown in blue, and unsafe locations are in red. There is a transition between vertically, horizontally or diagonally neighboring states. The weight of a horizontal and vertical transition is 2, for a diagonal transition it is 3. (b) The penalty probabilities in states. Darker shade indicates higher probability.  The online algorithm from Sec. IV-C is a heuristic. The sets of finite runs   Run φ ( u i,k ) ,   Run V   ( u i,k )   can be defined differently according to the properties of the actual problem. To guarantee the correctness of the strategy   C P   , the sets must satisfy the following conditions. There always exists a finite run in the set minimizing the optimization function   f  in Eq. (5). The definition of the set   Run φ ( u i,k )   guarantees that an accepting state from   F U   is visited after finite number of steps. The definition of   Run V   ( u i,k )   also guarantees a visit of the cycle   cyc V  U   in finite time and moreover, Prop. 1 holds for the resulting strategy   C V  P   . VI. C ASE STUDY  We implemented the framework developed in this paper for a persistent surveillance robotics example in Java [25]. In this section, we report on the simulation results. We consider a mobile robot in a grid-like partitioned environment modeled as a TS depicted in Fig. 2a. The robot transports packages between two stocks, marked green in Fig. 2a. The blue state marks the robot’s base location. The penalties in states are defined by rate   r   = 5   and penalty probability function in Fig. 2b. The visibility range   v   is 6. For example, in Fig. 2a the set   Vis( s )   of states visible from the current state   s , with corresponding penalties, is depicted as the blue-shaded area. We set the planning horizon   h   = 9 . The mission for the robot is to transport packages be- tween the two stocks (labeled with propositions   a , and   b , respectively) and infinitely many times return to the base (labeled with proposition   c ). The red states in Fig. 2a are dangerous locations (labeled with   u ) which are to be avoided. At the same time, we wish to minimize the cumulative penalty incurred during the transport of a package,   i.e.,   the surveillance property   π sur   is true in both stock states. The corresponding LTL formula is  G   ( a   ⇒   X   ( ¬ a   U   b ) )   ∧   G   ( b   ⇒   X   ( ¬ b   U   a ) )   ∧  GF   c   ∧   G ( ¬ u )   ∧   G F   π sur ,  and the B ̈ uchi automaton has 10 states. The cycle provid- ing the minimum expected average cumulative penalty per  (a)  (b)  Fig. 3: (a) The average cumulative penalty per surveillance cycle incurred during the runs, shown at the end of each round. The red line marks the optimal APPC value. (b) The average cumulative penalty per surveillance cycle incurred in every round. The red bars indicate the threshold   V   ∗ U   +   2  i   .  surveillance cycle is depicted in magenta in Fig. 2a and the optimal APPC value is   5 . 4 . We ran both offline and online algorithm for multiple rounds starting from the base state. In Fig. 3 we report on the results for 20 rounds, for more results see [25]. As illustrated in Fig. 3a, the average cumulative penalty per surveillance cycle incurred in the run induced by the offline strategy is above the optimal value and converges to it fairly fast. For the run induced by the online strategy, the average is significantly below the minimum APPC value due to the local improvement based on local sensing. On the other hand, Fig. 3b shows the average cumulative penalty per surveillance cycle incurred in each round separately. The number of surveillance cycles performed in the second phase of every round   i   of the offline strategy was less than   i   ·   k i ,   i.e.,   the second phase always ended due to the fact that the average incurred in the round was below the threshold   V   ∗ U   +   2  i   . The maximum number of surveillance cycles performed in the second phase of a round was   7 . The same is true for the online strategy and the maximum number of surveillance cycles in the second phase of a round was  3 . For both algorithms, the number of surveillance cycles in the second phase of a round does not evolve monotonically, rather randomly. Hence we conclude that in every round   i  we unlikely need to compute the value   j (   1  i   ) . R EFERENCES [1]   C. Baier and J. Katoen,   Principles of model checking .   The MIT Press, 2008. [2]   J. Barnat, L. Brim, M.   ˇ Ceˇ ska, and P. Roˇ ckai, “DiVinE: Parallel Distributed Model Checker,” in   HiBi/PDMC’10 , 2010, pp. 4–7. [3]   A. Hinton, M. Kwiatkowska, G. Norman, and D. Parker, “PRISM: A   Tool   for   Automatic   Verification   of   Probabilistic   Systems,”   in  TACAS’06 , ser. LNCS, vol. 3920, 2006, pp. 441–444. [4]   A. Cimatti, E. Clarke, E. Giunchiglia, F. Giunchiglia, M. Pistore, M. Roveri, R. Sebastiani, and A. Tacchella, “NuSMV Version 2: An OpenSource Tool for Symbolic Model Checking,” in   CAV’02 , ser. LNCS, vol. 2404, 2002. [5]   S. L. Smith, J. Tumova, C. Belta, and D. Rus, “Optimal path planning for surveillance with temporal-logic constraints,”   I. J. Robotic Res. , vol. 30, no. 14, pp. 1695–1708, 2011. [6]   K. Chatterjee, L. Doyen, T. A. Henzinger, and J. F. Raskin, “Al- gorithms for omega-regular games with imperfect information,” in  CSL06 , ser. LNCS, vol. 4207, 2006, pp. 287–302. [7]   X. C. Ding, S. Smith., C. Belta, and D. Rus, “MDP Optimal Control under Temporal Logic Constraints,” in   CDC’11 , 2011, pp. 532 –538.
[8]   P. Tabuada and G. Pappas, “Linear Time Logic Control of Discrete- Time Linear Systems,”   Trans. on Automatic Control , vol. 51, no. 12, pp. 1862–1877, 2006. [9]   E. A. Gol, M. Lazar, and C. Belta, “Language-Guided Controller Synthesis for Discrete-Time Linear Systems,” in   HSCC’12 , 2012, pp. 95–104. [10]   T. Wongpiromsarn, U. Topcu, and R. R. Murray, “Receding horizon temporal logic planning for dynamical systems,” in   CDC’09 , 2009, pp. 5997–6004. [11]   M. Kloetzer and C. Belta, “A fully automated framework for control of   linear   systems   from   temporal   logic   specifications,”   Trans.   on Automatic Control , vol. 53, no. 1, pp. 287–297, 2008. [12]   B. Yordanov, J. Tumova, I. Cerna, J. Barnat, and C. Belta, “Temporal Logic Control of Discrete-Time Piecewise Affine Systems,”   Trans. on Automatic Control , vol. 57, pp. 1491–1504, 2012. [13]   X. C. Ding, M. Lazar, and C. Belta, “Receding Horizon Temporal Logic Control for Finite Deterministic Systems,” in   ACC’12 , 2012. [14]   M. Svoreˇ nov ́ a, J. T ̊ umov ́ a, J. Barnat, and I. ˇ Cern ́ a, “Attraction-Based Receding Horizon Path Planning with Temporal Logic Constraints,” in   CDC’12 , 2012, to appear. [15]   R.   Tarjan,   “Depth-first   search   and   linear   graph   algorithms,”   in  SWAT’71 , 1971, pp. 114 –121. [16]   M. Y. Vardi and P. Wolper, “Reasoning about Infinite Computations,”  Information and Computation , vol. 115, pp. 1–37, 1994. [17]   P. Gastin and D. Oddoux, “Fast LTL to B ̈ uchi Automata Translation,” in   CAV’01 , 2001, pp. 53–65. [18]   F. Somenzi and R. Bloem, “Efficient B ̈ uchi Automata from LTL Formulae,” in   CAV’02 , 2000, pp. 248–263. [19]   P.   Gastin   and   D.   Oddoux.   (2001)   LTL   2   BA   :   fast   translation from   LTL   formulae   to   B ̈ uchi   automata.   [Online].   Available: http://www.lsv.ens-cachan.fr/ ∼ gastin/ltl2ba/ [20]   K. Chatterjee and L. Doyen, “Energy and Mean-Payoff Parity Markov Decision Processes,” in   MFCS’11 , 2011, pp. 206–218. [21]   R. Ash and C. Dol ́ eans-Dade,   Probability & Measure Theory .   Aca- demic Press, 2000. [22]   R. M. Karp, “A characterization of the minimum cycle mean in a digraph,”   Discrete Mathematics , vol. 23, no. 3, pp. 309 – 311, 1978. [23]   M. L. Puterman,   Markov Decision Processes: Discrete Stochastic Dynamic Programming .   New York, NY, USA: John Wiley & Sons, Inc., 1994. [24]   J. R. Norris,   Markov chains , ser. Cambridge series in statistical and probabilistic mathematics.   Cambridge University Press, 1998. [25]   M.   Svorenova,   I.   Cerna,   and   C.   Belta.   (2012)   Simulation   of Optimal Receding Horizon Control for Finite Deterministic Systems with   Temporal   Logic   Specification.   [Online].   Available:   http: //www.fi.muni.cz/$ \ sim$x175388/simulationLTLoptimalrhc