Robust Satisfaction of Temporal Logic Specifications via Reinforcement Learning Austin Jones1, Derya Aksaray2, Zhaodan Kong3, Mac Schwager4, and Calin Belta2,5 Abstract— We consider the problem of steering a system with unknown, stochastic dynamics to satisfy a rich, temporally- layered task given as a signal temporal logic formula. We represent the system as a Markov decision process in which the states are built from a partition of the statespace and the transition probabilities are unknown. We present provably convergent reinforcement learning algorithms to maximize the probability of satisfying a given formula and to maximize the average expected robustness, i.e., a measure of how strongly the formula is satisfied. We demonstrate via a pair of robot navigation simulation case studies that reinforcement learning with robustness maximization performs better than probability maximization in terms of both probability of satisfaction and expected robustness. I. INTRODUCTION We consider the problem of controlling a system with unknown, stochastic dynamics, i.e., a “black box”, to achieve a complex, time-sensitive task. An example is controlling a noisy aerial vehicle with partially known dynamics to visit a pre-specified set of regions in some desired order while avoiding hazardous areas. We consider tasks given as temporal logic (TL) formulae [2], an extension of first order Boolean logic that can be used to reason about how the state of a system evolves over time. When a stochastic dynamical model is known, there exist algorithms to find control policies for maximizing the probability of achieving a given TL specification [18], [17], [23], [13] by planning over stochastic abstractions [12], [1], [17]. However, only a handful of papers have considered the problem of enforcing TL specifications to a system with unknown dynamics. Passive [3] and active [21], [9] reinforcement learning has been used to find a policy that maximizes the probability of satisfying a given linear temporal logic formula. In this paper, in contrast to the above works on reinforce- ment learning which use propositional temporal logic, we use signal temporal logic (STL), a rich predicate logic that can be used to describe tasks involving bounds on physical parameters and time intervals [7]. An example of such a *This work was partially supported at Boston University by ONR grant number N00014-14-1-0554 and by the NSF under grant numbers NRI- 1426907, CMMI-1400167, and CNS-1035588. 1Author is with Mechanical Engineering and Electrical Engineering, Georgia Institute of Technology, Atlanta, GA, USA. austinjones@gatech.edu 2Authors are with Mechanical Engineering, Boston University, Boston, MA, USA. {cbelta,daksaray}@bu.edu 3Author is with Mechanical and Aerospace Engineering, University of California Davis, Davis, CA, USA. zdkong@ucdavis.edu 4 Author is with Aeronautics and Astronautics, Stanford University, Stanford, CA, USA. schwager@stanford.edu 5Author is with Systems Engineering, Boston University, Boston, MA, USA. property is “Within t1 seconds, a region in which y is less than π1 is reached, and regions in which y is larger than π2 are avoided for t2 seconds.” STL admits a continuous measure called robustness degree that quantifies how strongly a given sample path exhibits an STL property as a real number rather than just providing a yes or no answer [8], [7]. This measure enables the use of continuous optimization methods to solve inference (e.g., [10], [11], [14]) or formal synthesis problems (e.g., [20]) involving STL. One of the difficulties in solving problems with TL for- mulae is the history-dependence of their satisfaction. For instance, if the specification requires visiting region A before region B, whether or not the system should steer towards region B depends on whether or not it has previously visited region A. For linear temporal logic (LTL) formulae with time-abstract semantics, this history-dependence can be broken by translating the formula to a deterministic Rabin automaton (DRA), a model that automatically takes care of the history-dependent “book-keeping” [4], [21]. In the case of STL, such a construction is difficult due to the time- bounded semantics. We circumvent this problem by defining a fragment of STL such that the progress towards satisfaction is checked with some finite number τ of state measurements. We thus define an MDP, called the τ-MDP whose states correspond to the τ-step history of the system. The inputs to the τ-MDP are a finite collection of control actions. We use a reinforcement learning strategy called Q-learning [24], in which a policy is constructed by taking actions, observing outcomes, and reinforcing actions that improve a given reward. Our algorithms either maximize the probability of satisfying a given STL formula, or maximize the expected robustness with respect to the given STL formula. These procedures provably converge to the optimal policy for each case. Furthermore, we propose that maximizing expected robustness is typically more effective than maximizing prob- ability of satisfaction. We prove that in certain cases, the policy that maximizes expected robustness also maximizes the probability of satisfaction. However, if the given speci- fication is not satisfiable, the probability maximization will return an arbitrary policy, while the robustness maximization will return a policy that gets as close to satisfying the policy as possible. Finally, we demonstrate through simulation case studies that the policy that maximizes expected robustness in some cases gives better performance in terms of both probability of satisfaction and expected robustness when fewer training episodes are available. arXiv:1510.06460v1 [cs.SY] 22 Oct 2015 II. SIGNAL TEMPORAL LOGIC(STL) STL is defined with respect to continuously valued signals. Let F(A,B) denote the set of mappings from A to B and define a signal as a member of F(N,Rn). For a signal s, we denote st as the value of s at time t and st1:t2 as the sequence of values st1st1+1 ...st2. Moreover, we denote s[t] as the suffix from time t, i.e., s[t] = {st′|t′ ≥t}. In this paper, the desired mission specification is described by an STL fragment with the following syntax : φ := F[0,T]ψ|G[0,T]ψ, ψ := f(s) ≤d|¬ϕ|ϕ1 ∧ϕ2|ϕ1U[a,b)ϕ2, (1) where T is a finite time bound, φ,ψ, and ϕ are STL formulae, a and b are non-negative real-valued constants, and f(s) < d is a predicate where s is a signal, f ∈F(Rn,R) is a function, and d ∈R is a constant. The Boolean operators ¬ and ∧are negation (“not”) and conjunction (“and”), respectively. The other Boolean operators are defined as usual. The temporal operators F, G, and U stand for “Finally (eventually)” , “Globally (always)”, and “Until”, respectively. Note that in this paper, we use a discrete-time version of STL rather than the typical continuous-time formulation. The semantics of STL is recursively defined as s[t] |= ( f(s) < d) iff f(st) < d s[t] |= φ1 ∧φ2 iff s[t] |= φ1 and s[t] |= φ2 s[t] |= φ1 ∨φ2 iff s[t] |= φ1 or s[t] |= φ2 s[t] |= G[a,b)φ iff s[t′] |= φ ∀t′ ∈[t +a,t +b) s[t] |= F[a,b)φ iff ∃t′ ∈[t +a,t +b) s.t. s[t′] |= φ s[t] |= φ1U[a,b)φ2 iff ∃t′ ∈[t +a,t +b) s.t. s[t′′] |= φ1∀t′′ ∈[t,t′) and s[t′] |= φ2. In plain English, F[a,b)φ means “within a and b time units in the future, φ is true,” G[a,b)φ means “for all times between a and b time units in the future φ is true,” and φ1U[a,b)φ2 means “There exists a time c between a and b time units in the future such that φ1 is true until c and φ2 is true at c.” STL is equipped with a robustness degree [8], [7] (also called “degree of satisfaction”) that quantifies how well a given signal s satisfies a given formula φ. The robustness is calculated recursively according to the quantitative semantics r(s,(f(s) < d),t) = d −f(st) r(s,φ1 ∧φ2,t) = min r(s,φ1,t),r(s,φ2,t)  r(s,φ1 ∨φ2,t) = max r(s,φ1,t),r(s,φ2,t)  r(s,G[a,b)φ,t) = min t′∈[t+a,t+b)r(s,φ,t′) r(s,F[a,b)φ,t) = max t′∈[t+a,t+b)r(s,φ,t′), r(s,φ1U[a,b)φ2,t) = supt′∈[t+a,t+b]  min r(φ2,s,t′), inft′′∈[t,t′] r(φ1,s,t′′)  . We use r(s,φ) to denote r(s,φ,0). If r(s,φ) is large and positive, then s would have to change by a large deviation in order to violate φ. Similarly, if r(s,φ) is large in absolute value and negative, then s strongly violates φ. Similar to [6], let hrz(φ) denote the horizon length of an STL formula φ. The horizon length is the required number of samples to resolve any (future or past) requirements of φ. The horizon length can be computed recursively as hrz(p) = 0, hrz(¬φ) = hrz(φ), hrz(φ1 ∨φ2) = max{hrz(φ1),hrz(φ2)}, hrz(φ1 ∧φ2) = max{hrz(φ1),hrz(φ2)}, hrz(φ1U[a,b]φ2) = max{hrz(φ1)+b−1,hrz(φ2)+b}, (2) where φ,φ1,φ2 are STL formulae. Example 1: Consider the robot navigation problem illus- trated in Figure 1(a). The specification is “Visit Regions A or B and visit Regions C or D every 4 time units along a mission horizon of 100 units.” Let s(t) =  x(t) y(t) T, where x and y are the x−and y−components of the signal s. This task can be formulated in STL as φ = G[0,100) ψ ψ =  F[0,4) (x > 2∧x < 3∧y > 2∧y < 3) ∨(x > 4∧x < 5∧y > 4∧y < 5)  ∧F[0,4) (x > 2∧x < 3∧y > 4∧y < 5) ∨(x > 4∧x < 5∧y > 2∧y < 3)  . (3) Figure 1(a) shows two trajectories of the system beginning at the initial location of R and ending in region C that each satisfies the inner specification ψ given in (3). Note that s2 barely satisfies ψ, as it only slightly penetrates region A, while s1 appears to satisfy it strongly, as it passes through the center of region A and the center of region C. The robustness degrees confirm this: r(s1,ψ) = 0.3 while r(s2,ψ) = 0.05. The horizon length of the inner specification ψ of (3) is hrz(ψ) = max 4+max(0,0),4+max(0,0)  = 4. III. MODELS FOR REINFORCEMENT LEARNING For a system with unknown and stochastic dynamics, a critical problem is how to synthesize control to achieve a desired behavior. A typical approach is to discretize the state and action spaces of the system and then use a reinforcement learning strategy, i.e., by learning how to take actions through trial and error interactions with an unknown environ- ment [22]. In this section, we present models of systems that are amenable for reinforcement learning to enforce temporal logic specifications. We start with a discussion on the widely used LTL before introducing the particular model that we will use for reinforcement learning with STL. A. Reinforcement Learning with LTL One approach to the problem of enforcing LTL satisfaction in a stochastic system is to partition the statespace and design control primitives that can (nominally) drive the system from one region to another. These controllers, the stochastic dynamical model of the system, and the quotient obtained from the partition are used to construct a Markob decision process (MDP), called a bounded parameter MDP or BMDP, whose transition probabilities are interval-valued [1]. These BMDPs can then be composed with a DRA constructed from a given LTL formula to form a product BMDP. Dynamic programming (DP) can then be applied over this product MDP to generate a policy that maximizes the probability of satisfaction. Other approaches to this problem include aggregating the states of a given quotient until an MDP can be constructed such that the transition probability can be considered constant (with bounded error) [16]. The optimal policy can be computed over the resulting MDP using DP [15] or approximate DP, e.g., actor-critic methods [5]. Thus, even when the stochastic dynamics of a system are known and the logic that encodes constraints has time- abstract semantics, the problem of constructing an abstrac- tion of the system that is amenable to control policy synthesis is difficult and computationally intensive. Reinforcement learning methods for enforcing LTL constraints make the assumption that the underlying model under control is an MDP [3], [21], [9]. Implicitly, these procedures compute a frequentist approximation of the transition probabilities that asymptotically approaches the true (unknown) value as the number of observed sample paths increases. Since this algo- rithm doesn’t explicitly rely on any a priori knowledge of the transition probability, it could be applied to an abstraction of a continuous-space system that is built from a proposition- preserving partition. In this case, the uncertainty on the motion described by intervals in the BMDP that is reduced via computation would instead be described by complete ignorance that is reduced via learning. The resulting policy would map regions of the statespace to discrete actions that will optimally drive the real-valued state of the system to satisfy the given LTL specification. Different partitions will result in different policies. In the next section, we extend the above observation to derive a discrete model that is amenable for reinforcement learning for STL formulae. B. Reinforcement learning with STL: τ-MDP In order to reduce the search space of the problem, we par- tition the statespace of the system to form the quotient graph G = (Σ,E), where Σ is a set of discrete states corresponding to the regions of the statespace and E corresponds to the set of edges. An edge between two states σ and σ′ exists in E if and only if σ and σ′ are neighbors (share a boundary) in the partition. In our case, since STL has time-bounded semantics, we cannot use an automaton with a time-abstract acceptance condition (e.g., a DRA) to check its satisfaction. In general, whether or not a given trajectory s0:T satisfies an STL formula would be determined by directly using the qualitative semantics. The STL fragment (1) consists of a sub-formula ψ with horizon length hrz(ψ) = τ that is modified by either a F[0,T) or G[0,T) temporal operator. This means that in order to update at time t whether or not the given formula φ has been satisfied or violated, we can use the τ previous state values st−τ+1:t For this reason, we choose to learn policies over an MDP with finite memory, called a τ-MDP, whose states correspond to sequences of length τ of regions in the defined partition. Example 1 (cont’d): Let the robot evolve according to the discrete-time Dubins dynamics xt+1 = xt +vδt cosθt yt+1 = yt +vδt sinθt, (4) where xt and yt are the x and y coordinates of the robot at time t, v is its forward speed, δt is a time interval, and the robot’s orientation is given by θt. The control primitives in this case are given by Act = {up,down,left,right} which correspond to the directions on the grid. Each (noisy) control primitive induces a distribution with support θdes ± ∆θ, where θdes is the orientation where the robot is facing the desired cell. When a motion primitive is enacted, the robot rotates to an angle θt drawn from the distribution and moves along that direction for δt time units. The partition of the statespace and the induced quotient G are shown in Figures 1(b) and 1(c), respectively. A state σ(i, j) in the quotient (Figure 1(c)) represents the region in the partition of the statespace (Figure 1(b)) with the point (i, j) in the lower left hand corner. Definition 1: Given a quotient of a system G = (Σ,E) and a finite set of actions Act, a τ-Markov Decision Process (τ- MDP) is a tuple Mτ = ⟨S ,Act,P⟩, where • S ⊆(Σ ∪ε)τ is the set of finite states, where ε is the empty string. Each state στ ∈S corresponds to a τ−horizon (or shorter) path in G . Shorter paths of length n < τ (representing the case in which the system has not yet evolved for τ time steps) have ε prepended τ −n times. • P : S × Act × S →[0,1] is a probabilistic transition relation. P(στ,a,σ′ τ) can be positive only if the first τ −1 states of σ′ τ are equal to the last τ −1 states of στ and there exists an edge in G between the final state of στ and the final state of σ′ τ. We denote the state of the τ-MDP at time t as σt τ. Definition 2: Given a trajectory st−τ+1:t of the original system, we define its induced trace in the τ-MDP Mτ as Tr(st−τ+1:t) = σt−τ+1:t = σt τ. That is, σt τ corresponds to the previous τ regions of the statespace that the state has resided in from time t −τ +1 to time t. The construction of a τ-MDP from a given quotient and set of actions is straightforward. The details are omitted due to length constraints. We make the following key assumptions on the quotient and the resulting τ-MDP: • The defined control actions Act will drive the system either to a point in the current region or to a point in a neighboring region of the partition, e.g.,no regions are “skipped”. • The transition relation P is Markovian. For every τ state σt τ, there exists a continuous set of sample paths {st−τ+1:t} whose traces could be that state. The dynamics of the underlying system produces an un- known distribution p(st−τ+1:t|Tr(st−τ+1:t) = σt τ). Since the robustness degree is a function of sample paths of length τ and an STL formula ψ, we can define a distribution p(r(st−τ+1:t,ψ)|Tr(st−τ+1:t) = σt τ). 1 2 3 4 5 6 1 2 3 4 5 6 R C D A B x (m) y (m) s1 s2 1 2 3 4 5 6 1 2 3 4 5 6 R C D A B x (m) y (m) (a) (b) (c) Fig. 1: (a) Example of robot navigation problem. (b) Partitioned space. (c) Subsection of the quotient. Example 1 (cont’d): Figure 2 shows a portion of the τ- MDP constructed from Figure 1. The states in M4 are labeled with the corresponding sample paths of length 4 in G . The green and blue σ’s in the states in M4 correspond to green and blue regions from Figure 1. IV. PROBLEM FORMULATION In this paper, we address the following two problems. Problem 1 (Maximizing Probability of Satisfaction): Let Mτ be a τ-MDP as described in the previous section. Given an STL formula φ with syntax (1), find a policy µ∗ mp ∈F(S ×N,Act) such that µ∗ mp = argmax µ∈F(S ×N,Act) Prs0:T [s0:T |= φ] (5) Problem 2 (Maximizing Average Robustness): Let Mτ be as defined in Problem 1. Given an STL formula φ with syntax (1), find a policy µ∗ mr ∈F(S ,Act) such that µ∗ mr = argmax µ∈F(S ×N,Act) Es0:T [r(s0:T,φ)] (6) Fig. 2: Part of the τ-MDP constructed from the robot navigation MDP shown in Figure 1 Problems 1 and 2 are two alternate solutions to enforce a given STL specification. The policy found by Problem 1, i.e. µ∗ mp, maximizes the chance that φ will be satisfied, while the policy found by Problem 2, i.e. µ∗ mr, drives the system to satisfy φ as strongly as possible on average. Problems similar to (5) have already been considered in the literature (e.g., [9], [21]). However, Problem 2 is a novel formulation that provides some advantages over Problem 1. As we show in Section V, for some special systems, µ∗ mr achieves the same probability of satisfaction as µ∗ mp. Furthermore, if φ is not satisfiable, any arbitrary policy could be a solution to Problem 1, as all policies will result in a satisfaction probability of 0. If φ is unsatisfiable, Problem 2 yields a solution that attempts to get as close as possible to satisfying the formula, as the optimal solution will have an average robustness value that is least negative. The forms of the objective functions differ for the two different types of formula, φ = F[0,T)ψ and φ = G[0,T)ψ. Case 1: Consider an STL formula φ = F[0,T)ψ. In this case, the objective function in (5) can be rewritten as Prs0:T [∃t = τ,...,T −τ s.t. st−τ+1:t |= ψ], (7) and the objective function in (6) can be rewritten as Es0:T [ max t=τ,...,T−τr(st−τ+1:t,ψ)]. (8) Case 2: Now, consider an STL formula φ = G[0,T)ψ. The objective function in (5) can be rewritten as Prs0:T [∀t = τ,...,T −τ, st−τ:t |= ψ], (9) and the objective function in (6) can be rewritten as Es0:T [ min t=τ,...,T−τr(st−τ+1:t,ψ)]. (10) V. MAXIMIZING EXPECTED ROBUSTNESS VS. MAXIMIZING PROBABILITY OF SATISFACTION Here, we demonstrate that the solution to (6) subsumes the solution to (5) for a certain class of systems. Due to space limitations, we only consider formulae of the type φ = F[0,t)ψ. Let Mτ = (Sτ,Pτ,Act) be a τ-MDP. For simplicity, we make the following assumption on Sτ. Assumption 1: For every state στ ∈Sτ, either every trajectory st+τ−1:t whose trace is στ satisfies ψ, denoted στ |= ψ, or every trajectory that passes through the sequence of regions associated with στ does not satisfy ψ, denoted στ ̸|= ψ. Assumption 1 can be enforced in practice during partitioning. We define the set A = {στ ∈Sτ|στ |= ψ}. (11) Definition 3: The signed graph distance of a τ−state σi τ ∈ S to a set X ⊆S is d(σi τ,X) =      min σ j τ ∈X l(σi τ,σ j τ ) σi τ ̸∈X − min σ j τ ∈Sτ\X l(σ j τ ,σi τ) σi τ ∈X (12) where l(σi τ,σ j τ ) is the length of the shortest path from σi τ to σ j τ . We also make the following two assumptions. Assumption 2: For any signal st−τ+1:t such that Tr(st−τ+1:t) ∈Sτ, let r(st−τ+1:t,ψ) be bounded from below by Rmin and from above by Rmax. Assumption 3: Let Dστ(δ) = Pr[r(st:t+τ,ψ) > δ|Tr(st:t+τ) = στ]. For any two states, d(σi τ,A) < d(σ j τ ,A) ⇒Dσiτ(δ) ≥Dσ j τ (δ) ∀δ ∈[Rmin,Rmax] (13) Now we define the policies µ∗ mp and µ∗ mr over Mτ as µ∗ mp = argmax µ∈F(S ×N,Act) Prσ0:T τ h ∃t ∈[0,T] s.t. σt τ |= ψ i (14) µ∗ mr = argmax µ∈F(S ×N,Act) Eσ0:T τ h max t=0,...,Tr(σt τ,ψ) i (15) Proposition 1: If Assumptions 1,2, and 3 hold, then the policy µ∗ mr maximizes the expected probability of satisfac- tion. Proof: Given any policy µ, its associated reachability probability can be defined as Prµ(στ) = Prµ " στ = argmin σ0τ ,...,σT−τ τ d(στ,A) # . (16) Let I(.) be the indicator function such that I(B) is 1 if B is true and 0 if B is false. By definition, the expected probability of satisfaction for a given policy µ is EPS(µ) = E  I(∃0 < k < T −τ s.t. σk τ |= ψ)  = ∑ στ∈Sτ Prµ(στ)I(στ ∈A) = ∑ στ∈A Prµ(στ). (17) Also, the expected robustness of policy µ becomes ER(µ) = E  max k=0,...,T−τr(σk τ ,ψ)  = R Rmax 0 Pr  max k=0,...,T−τr(σk τ ,ψ) > x  dx+ R 0 Rmin 1−Pr  max k=0,...T−τr(σk τ ,ψ) > x  dx = R Rmax 0 Pr  max k=0,...,T−τr(σk τ ,ψ) > x  dx− R 0 Rmin Pr  max k=0,...,T−τr(σk τ ,ψ) > x  dx−Rmin = R Rmax 0 ∑ στ∈Sτ Prµ(στ)Dστ(x)dx− R 0 Rmin ∑ στ∈Sτ Prµ(στ)Dστ(x)dx−Rmin = ∑ στ∈A Prµ(στ) R Rmax 0 Dστ(x)dx− ∑ στ̸∈A Prµ(στ) R 0 Rmin Dστ(x)dx−Rmin. (18) Since Rmin is constant, maximizing (18) is equivalent to max µ  ∑ στ∈A Prµ(στ) R Rmax 0 Dστ(x)dx −∑ στ̸∈A Prµ(στ) R 0 Rmin Dστ(x)dx  (19) Let p be the satisfaction probability such that p = ∑ στ∈A Prµ(στ). Then, we can rewrite the objective in (19) as J(µ) = p ∑ στ∈A Prµ  στ = argmin σ0τ ,...,σT−τ τ d(στ,A)|στ ∈A  × R Rmax 0 Dστ(x)dx −(1−p)Prµ  στ = argmin σ0τ ,...,σT−τ τ d(στ,A)|στ ̸∈A  × R 0 Rmin Dστ(x)dx. (20) Now, ∂J(µ) ∂p = ∑ στ∈A Prµ  στ = argmin σ0τ ,...,σT−τ τ d(στ,A)|στ ∈A  × R Rmax 0 Dστ(x)dx +Prµ  στ = argmin σ0τ ,...,σT−τ τ d(στ,A)|στ ̸∈A  × R 0 Rmin Dστ(x)dx > 0 (21) Thus, any policy µ increasing J(µ) also leads to an increase in p. Since increasing J(µ) is equivalent to increasing ER(µ), then we can conclude that the policy that maximizes the robustness also achieves the maximum satisfaction prob- ability. VI. CONTROL SYNTHESIS TO MAXIMIZE ROBUSTNESS A. Policy Generation through Q-Learning Since we do not know the dynamics of the system under control, we cannot a priori predict how a given control action will affect the evolution of the system and hence its progress towards satisfying/dissatisfying a given specification. Thus, we use the well-known paradigm of reinforcement learning to learn policies to solve Problems 1 and 2. In reinforcement learning, the system takes actions and records the rewards associated with the state-action pair. These rewards are then used to update a feedback policy that maximizes the expected gathered reward. In our cases, the rewards that we collect over Mτ are related to whether or not ψ is satisfied (Problem 1) or how robustly ψ is satisfied/violated (Problem 2). Our solutions to these problems rely on a Q-learning formulation [24]. Let R(σt τ,a) be the reward collected when action a ∈Act was taken in state σt τ ∈S . Define the function Q : S ×Act ×N as Q(σT−t τ ,a,t) = R(σT−t τ ,a)+ max {µl∈}T l=T−t−1 E  ∑T l=T−t−1 R(σl τ,µl(σl τ))  = R(σT−t τ ,a)+ max a′∈ActQ(σT−t+1 τ ,a′,t −1). (22) For an optimization problem with a cumulative objective function of the form ∑ l=τ:T R(σl τ,al), (23) the optimal policy µ∗∈F(S ,Act) can be found by µ∗(σt τ,T −t) = argmax a∈Act Q(σt τ,a,T −t). (24) Applying the update rule Qt+1(σt τ,at,T −t) = (1−αt)Qt(σt τ,at,T −t)+ αt[R(σt τ,at)+γ max a′∈A Qt(σt+1 τ ,a′)] (25) where 0 < γ < 1 will cause Qt converges to Q w.p. 1 as t goes to infinity [24]. B. Batch Q-learning We cannot reformulate Problems 1 and 2 into the form (23) (see Section IV). Thus, we propose an alternate Q−learning formulation, called batch Q-learning , to solve these prob- lems. Instead of updating the Q-function after each action is taken, we wait until an entire episode s[0:T) is completed before updating the Q-function. The batch Q-learning pro- cedure is summarized in Algorithm 1. Algorithm 1 The Batch Q learning algorithm. function BatchQLearn(Sys,probType,Nep,φ) Q ←RandomInitialization µ ←InitializePolicy(Q) for n = 1 to Nep do s[0,T) ←Simulate(Sys,µ) Q ←UpdateQFunction(Q,µ,s0:T,φ,probType) µ ←UpdatePolicy(µ,Q) return Q,µ Algorithm 2 Function used to update Q function used in Algorithm 1. function UpdateQFunction(Q,µ,s0:T,φ,γ,probType) for n = T −τ −1 to τ do if probType is MaximumProbability then Qtmp(σn τ ,µ(σn τ ,T −n)) ← max(I(sn−τ+1:n |= φ), γQtmp(σn+1 τ ,µ(σn+1 τ ,T −n−1)) else Qtmp(σn τ ,µ(σn τ ,T −n)) ← max(r(sn−τ+1:n,φ), γQtmp(σn+1 τ ,µ(σn+1 τ ,T −n−1)) Qnew(σn τ ,µ(σn τ ,T −n) ← (1−α)Qtmp(σn τ ,µ(σn τ ,T −n) +αQ(σn τ ,µ(σn τ ,T −n) return Qnew The Q function is initialized to random values and µ is computed from the initial Q values. Then, for Nep episodes, the system is simulated using µ. Randomization is used to encourage exploration of the policy space. The observed trajectory is then used to update the Q function according to Algorithm 2. The new value of the Q function is used to update the policy µ. For compactness, Algorithm 2 as written only covers the case φ = F[0,T)ψ. The case in which φ = G[0,T)ψ can be addressed similarly. C. Convergence of Batch Q-learning Given a formula of the form φ = F[0,T)ψ and an objective of maximizing the expected robustness (Problem 2), we will show that applying Algorithm 1 converges to the optimal solution. The other three cases discussed in Section IV can be proven similarly. The following analysis is based on [19]. The optimal Q function derived from (8) is Q∗(σk τ ,a,T −k) = ∑σt+1 τ P(σt τ,a,σt+1)max(r(σt τ,ψ), max b∈ActγQ∗(σt+1 τ ,b,T −t −1)). (26) This gives the following convergence result. Proposition 2: The Q-learning rule given by Qk+1(σt τ,at,T −t) = (1−αk)Qk(σt τ,at,T −t) +αk max(r(σt τ,ψ), max b∈ActγQk(σt+1 τ ,b,T −t −1)), (27) converges to the optimal Q function (26) if the sequence {αk}∞ k=0 is such that ∑∞ k=0 αk = ∞and ∑∞ k=0(αk)2 < ∞. Proof: (Sketch) The proof of Proposition 2 relies primarily on Proposition 3. Once this is established, the rest of the proof varies only slightly from the presentation in [19]. Note that in this case, k ranges over the number of episodes and t ranges over the time coordinate of the signal. Proposition 3: The optimal Q-function given by (26) is a fixed point of the contraction mapping H where (Hq)(σt τ,a,T −t) = ∑σt+1 τ P(σt τ,a,σt+1)max(r(σt τ,ψ), γ max b∈Actq(σt+1 τ ,b,T −t −1)). (28) Proof: By (26), if H is a contraction mapping, then Q∗ is a fixed point of H. Consider ||Hq1 −Hq2||∞ = max στ,a ∑ σ′τ P(στ,a,σ′ τ)(max(r(στ,ψ), γ max b∈Actq1(σ′ τ,b,T −t −1)) −max(r(στ,ψ), γ b∈Act q2(σ′ τ,b,T −t −1)). (29) Define q∗ j(t) = max b∈Actγq1(σ′ τ,b,t). (30) WOLOG let q∗ 1(T −t −1) > q∗ 2(T −t −1). Define R(σ′ τ) = (max(r(στ,ψ),q∗ 1(T −t −1) −max(r(στ,ψ),q∗ 2(T −t −1)) (31) There exist 3 possibilities for the value of R(στ). r(στ,ψ) > q∗ 1(T −t −1) > q∗ 2(T −t −1) ⇒R(σ′ τ) = 0 . (32a) q∗ 1(T −t −1) > r(στ,ψ) > q∗ 2(T −t −1) ⇒R(σ′ τ) = ||q∗ 1(T −t −1)−r||∞< γ||q1 −q2||∞ (32b) q∗ 1(T −t −1) > q∗ 2(T −t −1) > r(στ,ψ) ⇒R(σ′ τ) < γ||q1 −q2||∞. (32c) 0 0.1 0.2 0.3 0.4 0 20 40 60 80 100 Robustness Count Trained Performance, µmp 0 0.1 0.2 0.3 0.4 0 20 40 60 80 100 Robustness Count Trained Performance, µmr (a) (b) 0 1 2 3 4 5 0 1 2 3 4 5 x y Example Trajectory, µmp 0 1 2 3 4 5 0 1 2 3 4 5 x y Example Trajectory, µmr (c) (d) Fig. 3: Comparison of Policies for Case Study 1. Histogram of robustness values for trained policies for solution to (a) Problem 1 and (b) Problem 2. Trajectory generated from policies for solution to (c) Problem 1 and (d) Problem 2. Thus, this means that R(σ′ τ) ≤γ||q1 −q2||∞∀σ′ τ. Hence, ||Hq1 −Hq2||∞ = maxστ,a ∑σ′τ P(στ,a,σ′ τ)R(σ′ τ) ≤maxστ,a ∑σ′τ P(στ,a,σ′ τ)γ||q1 −q2||∞ ≤γ||q1 −q2||∞. (33) Therefore, H is a contraction mapping. VII. CASE STUDY We implemented the batch-Q learning algorithm (Algo- rithm 1) and applied it to two case studies that adapt the robot navigation model from Example 1. For each case study, we solved Problems 1 and 2 and compared the performance of the resulting policies. All simulations were implemented in Matlab and performed on a PC with a 2.6 GHz processor and 7.8 GB RAM. A. Case Study 1: Reachability First, we consider a simple reachability problem. The given STL specification is φcs1 = F[0,20)(F[0,1)ϕblue ∧G[1,4)¬ϕblue), (34) where ϕblue is the STL subformula corresponding to being in a blue region. In plain English, (34) can be stated as “Within 20 time units, reach a blue region and then don’t revisit a blue region for 4 time units.” The results from applying Algorithm 1 are summarized in Figure 3. We used the parameters γ = 1,αt = 0.95, Nep = 300 and εt = 0.995t, where εt is the probability at iteration t of selecting an action at random 1. Constructing the τ-MDP took 17.2s. Algorithm 1 took 161s to solve Problem 1 and 184s to solve Problem 2. The two approaches perform very similarly. In the first row, we show a histogram of the robustness of 500 trials 1Although the conditions γ < 1 and ∑∞ k=0 α2 k < ∞are technically required to prove convergence, in practice these conditions can be relaxed without having adverse effects on learning performance −1 −0.5 0 0.5 0 20 40 60 80 100 Robustness Count Trained Performance, µmp −1 −0.5 0 0.5 0 20 40 60 80 100 Robustness Count Trained Performance, µmr (a) (b) 0 1 2 3 4 5 0 1 2 3 4 5 x y Example Trajectory, µmp 0 1 2 3 4 5 0 1 2 3 4 5 x y Example Trajectory, µmr (c) (d) Fig. 4: Comparison of Policies for Case Study 2.. The subplots have the same meaning as in Figure 3. generated from the system simulated using each of the trained policies after learning has completed, i.e. without the randomization that is used during the learning phase. Note that both trained policies satisfied the specification with probability 1. The performance of the two algorithms are very similar, as the mean robustness is 0.2287 with standard deviation 0.1020 for probability maximization and 0.2617 and 0.1004,resp., for robustness maximization. In the second row, we see trajectories simulated by each of the trained policies. The similarity of the solutions in this case study is not surprising. If the state of the system is deep within A or B, then the probability that it will remain inside that region in the next 3 time steps (satisfy φ) is higher than if it is at the edge of the region. Trajectories that remain deeper in the interior of region A or B also have a high robustness value. Thus, for this particular problem, there is an inherent coupling between the policies that satisfy the formula with high probability and those that satisfy the formula as robustly as possible on average. B. Case Study 2: Repeated Satisfaction In this second case study, we look at a problem involving repeatedly satisfying a condition finitely many times. The specification of interest is φcs2 = G[0,12)(F[0,4)(ϕblue)∧F[0,4)(ϕgreen)), (35) In plain English, (35) is “Ensure that every 4 time units over a 12 unit interval, a green region and a blue region is entered.” Results from this case study are shown in Figure 4. We used the same parameters as listed in Section VII-A, except Nep = 1200,α = 0.4, and εt = 0.9t. Constructing the τ-MDP took 16.5s. Applying Algorithm 1 took 257.7s for Problem 1 and 258.3s for Problem 2. In the first row, we see that the solution to Problem 1 satisfies the formula with probability 0 while the solution to Problem 2 satisfies the formula with probability 1. At first, this seems counterintuitive, as Proposition 2 indicates that a policy that maximizes probability would achieve a probability of satisfaction at least as high as the policy that maximizes the expected robustness. However, this is only guaranteed with an infinite number of learning trials. The performance in terms of robustness is obviously better for the robustness maximization (mean 0.1052, standard deviation 0.0742) than for the probability maximization (mean -0.6432, standard deviation 0.2081). In the second row, we see that the maximum robustness policy enforces convergence to a cycle between two regions, while the maximum probability policy deviates from this cycle. The discrepancy between the two solutions can be ex- plained by what happens when trajectories that almost satisfy (35) occur. If a trajectory that almost oscillates between a blue and green region every four seconds is encountered when solving Problem 1, it collects 0 reward. On the other hand, when solving Problem 2, the policy that produces the almost oscillatory trajectory will be reinforced much more strongly, as the resulting robustness is less negative. However, since the robustness degree gives “partial credit” for trajectories that are close to satisfying the policy, the reinforcement learning algorithm performs a directed search to find policies that satisfy the formula. Since probability maximization gives no partial credit, the reinforcement learn- ing algorithm is essentially performing a random search until it encounters a trajectory that satisfies the given formula. Therefore, if the family of policies that satisfy the formula with positive probability is small, it will on average take the Q-learning algorithm solving Problem 1 a longer time to converge to a solution that enforces formula satisfaction. VIII. CONCLUSIONS AND FUTURE WORK In this paper, we presented a new reinforcement learning paradigm to enforce temporal logic specifications when the dynamics of the system are a priori unknown. In contrast to existing works on this topic, we use a logic (signal temporal logic) whose formulation is directly related to a system’s statespace. We present a novel, convergent Q-learning algo- rithm that uses the robustness degree, a continuous measure of how well a trajectory satisfies a formula, to enforce the given specification. In certain cases, robustness maximization subsumes the established paradigm of probability maximiza- tion and, in certain cases, robustness maximization performs better in terms of both probability and robustness under partial training. Future research includes formally connecting our approach to abstractions of linear stochastic systems. REFERENCES [1] A. Abate, A. D’Innocenzo, and M. Di Benedetto. Approximate abstractions of stochastic hybrid systems. Automatic Control, IEEE Transactions on, 56(11):2688–2694, Nov 2011. [2] C. Baier and J.-P. Katoen. Principles of model checking, volume 26202649. MIT press Cambridge, 2008. [3] T. Brazdil, K. Chatterjee, M. Chmelik, M.k, V. Forejt, J. Kretinsky, M. Kwiatkowska, D. Parker, and M. Ujma. Verification of markov decision processes using learning algorithms. In F. Cassez and J.-F. Raskin, editors, Automated Technology for Verification and Analysis, volume 8837 of Lecture Notes in Computer Science, pages 98–114. Springer International Publishing, 2014. [4] X. C. Ding, S. L. Smith, C. Belta, and D. Rus. Optimal control of markov decision processes with linear temporal logic constraints. IEEE Transactions on Automatic Control, 59(5):1244–1257, 2014. [5] X. C. Ding, J. Wang, M. Lahijanian, I. Paschalidis, and C. Belta. Temporal logic motion control using actor-critic methods. In Robotics and Automation (ICRA), 2012 IEEE International Conference on, pages 4687–4692, May 2012. [6] A. Dokhanchi, B. Hoxha, and G. Fainekos. On-line monitoring for temporal logic robustness. In Runtime Verification, pages 231–246. Springer, 2014. [7] A. Donz´e and O. Maler. Robust satisfaction of temporal logic over real-valued signals. Formal Modeling and Analysis of Timed Systems, pages 92–106, 2010. [8] G. E. Fainekos and G. J. Pappas. Robustness of temporal logic specifications for continuous-time signals. Theoretical Computer Science, 410(42):4262–4291, 2009. [9] J. Fu and U. Topcu. Probably approximately correct MDP learning and control with temporal logic constraints. CoRR, abs/1404.7073, 2014. [10] X. Jin, A. Donze, J. V. Deshmukh, and S. A. Seshia. Mining requirements from closed-loop control models. In Proceedings of the 16th international conference on Hybrid systems: computation and contro, pages 43–52, 2013. [11] A. Jones, Z. Kong, and C. Belta. Anomaly detection in cyber-physical systems: A formal methods approach. In IEEE Conference on Decision and Control (CDC), pages 848–853, 2014. [12] A. Julius and G. Pappas. Approximations of stochastic hybrid systems. Automatic Control, IEEE Transactions on, 54(6):1193–1203, June 2009. [13] M. Kamgarpour, J. Ding, S. Summers, A. Abate, J. Lygeros, and C. Tomlin. Discrete time stochastic hybrid dynamic games: Verifi- cation and controller synthesis. In Proceedings of the 50th IEEE Con- ference on Decision and Control and European Control Conference, pages 6122–6127, 2011. [14] Z. Kong, A. Jones, A. Medina Ayala, E. Aydin Gol, and C. Belta. Temporal logic inference for classification and prediction from data. In Proceedings of the 17th international conference on Hybrid systems: computation and control, pages 273–282. ACM, 2014. [15] M. Lahijanian, S. Andersson, and C. Belta. Temporal logic mo- tion planning and control with probabilistic satisfaction guarantees. Robotics, IEEE Transactions on, 28(2):396–409, April 2012. [16] M. Lahijanian, S. B. Andersson, and C. Belta. Approximate markovian abstractions for linear stochastic systems. In Proc. of the IEEE Conference on Decision and Control, pages 5966–5971, Maui, HI, USA, Dec. 2012. [17] M. Lahijanian, S. B. Andersson, and C. Belta. Formal verification and synthesis for discrete-time stochastic systems. IEEE Transactions on Automatic Control, 6(8):2031–2045, 2015. [18] R. Luna, M. Lahijanian, M. Moll, and L. E. Kavraki. Asymptotically optimal stochastic motion planning with temporal goals. In Work- shop on the Algorithmic Foundations of Robotics, Istanbul, Turkey, 03/08/2014 2014. [19] F. S. Melo. Convergence of q-learning: a simple proof. http://users.isr.ist.utl.pt/ mtjspaan/readingGroup/ProofQlearning.pdf. [20] V. Raman, A. Donze, M. Maasoumy, R. M. Murray, A. Sangiovanni- Vincentelli, and S. A. Seshia. Model predictive control with signal temporal logic specifications. In Proceedings of IEEE Conference on Decision and Control (CDC), pages 81–87, 2014. [21] D. Sadigh, E. S. Kim, S. Coogan, S. S. Sastry, and S. A. Seshia. A learning based approach to control synthesis of markov decision pro- cesses for linear temporal logic specifications. CoRR, abs/1409.5486, 2014. [22] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998. [23] M. Svorenova, J. Kret´ınsk´y, M. Chmelik, K. Chatterjee, I. Cern´a, and C. Belta. Temporal logic control for stochastic linear systems using abstraction refinement of probabilistic games. In Hybrid Systems Computation and Control (HSCC) 2015, volume abs/1410.5387, 2015. (To appear). [24] J. N. Tsitsiklis. Asynchronous stochastic approximation and q- learning. Machine Learning, 16(3):185–202, 1994.