arXiv:1306.0386v4 [math.OC] 10 Feb 2016 Improved and Generalized Upper Bounds on the Complexity of Policy Iteration Bruno Scherrer INRIA Nancy Grand Est, Team MAIA bruno.scherrer@inria.fr October 29, 2018 Abstract Given a Markov Decision Process (MDP) with n states and a total number m of actions, we study the number of iterations needed by Policy Iteration (PI) algorithms to converge to the optimal γ -discounted policy. We consider two variations of PI: Howard’s PI that changes the actions in all states with a positive advantage, and Simplex-PI that only changes the action in the state with max- imal advantage. We show that Howard’s PI terminates after at most O ( m 1 − γ log ( 1 1 − γ )) iterations, improving by a factor O (log n ) a result by Hansen et al (2013), while Simplex-PI terminates after at most O ( nm 1 − γ log ( 1 1 − γ )) iterations, improving by a factor O (log n ) a result by Ye (2011). Under some structural properties of the MDP, we then consider bounds that are independent of the discount factor γ : quantities of interest are bounds τ t and τ r —uniform on all states and policies—respectively on the expected time spent in transient states and the inverse of the frequency of visits in recurrent states given that the process starts from the uniform distribution. Indeed, we show that Simplex-PI terminates after at most ̃ O ( n 3 m 2 τ t τ r ) iterations. This extends a recent result for deterministic MDPs by Post & Ye (2013), in which τ t ≤ 1 and τ r ≤ n ; in particular it shows that Simplex-PI is strongly polynomial for a much larger class of MDPs. We explain why similar results seem hard to derive for Howard’s PI. Finally, under the additional (restrictive) assumption that the state space is partitioned in two sets, respectively states that are transient and recurrent for all policies, we show that both Howard’s PI and Simplex-PI terminate after at most ̃ O ( m ( n 2 τ t + nτ r )) iterations. 1 Introduction We consider a discrete-time dynamic system whose state transition depends on a control, where the state space X is of finite size n . When at state i ∈ { 1 , .., n } , the action is chosen from a set of admissible actions A i ⊂ A , where the action space A is of finite size m , such that ( A i ) 1 ≤ i ≤ n form a partition of A . The action a ∈ A i specifies the transition probability p ij ( a ) = P ( i t +1 = j | i t = i, a t = a ) to the next state j . At each transition, the system is given a reward r ( i, a, j ) ∈ R where r is the instantaneous reward function . In this context, we look for a stationary deterministic policy 1 , that is a function π : X → A that maps states into admissible actions (for all i , π ( i ) ∈ A i ) that maximizes the expected discounted sum of rewards from any state i , called the value of policy π at state i : v π ( i ) := E [ ∞ ∑ k =0 γ k r ( i k , a k , i k +1 ) ∣ ∣ ∣ ∣ ∣ i 0 = i, ∀ k ≥ 0 , a k = π ( i k ) , i k +1 ∼ P ( ·| i k , a k ) ] , where γ ∈ (0 , 1) is a discount factor. The tuple 〈 X, ( A i ) i ∈ X , p, r, γ 〉 is called a Markov Decision Process (MDP) (Puterman, 1994; Bertsekas and Tsitsiklis, 1996), and the associated problem is known as stochastic optimal control. The optimal value starting from state i is defined as v ∗ ( i ) := max π v π ( i ) . 1 Restricting our attention to stationary deterministic policies is not a limitation. Indeed, for the optimality criterion to be defined soon, it can be shown that there exists at least one stationary deterministic policy that is optimal (Puterman, 1994). 1 For any policy π , we write P π for the n × n stochastic matrix whose elements are p ij ( π ( i )), and r π for the vector whose components are ∑ j p ij ( π ( i )) r ( i, π ( i ) , j ). The value functions v π and v ∗ can be seen as vectors on X . It is well known that v π is the solution of the following Bellman equation: v π = r π + γP π v π , that is v π is a fixed point of the affine operator T π : v 7 → r π + γP π v . It is also well known that v ∗ satisfies the following Bellman equation: v ∗ = max π ( r π + γP π v ∗ ) = max π T π v ∗ where the max operator is taken componentwise. In other words, v ∗ is a fixed point of the nonlinear operator T : v 7 → max π T π v . For any value vector v , we say that a policy π is greedy with respect to the value v if it satisfies: π ∈ arg max π ′ T π ′ v or equivalently T π v = T v . With some slight abuse of notation, we write G ( v ) for any policy that is greedy with respect to v . The notions of optimal value function and greedy policies are fundamental to optimal control because of the following property: any policy π ∗ that is greedy with respect to the optimal value v ∗ is an optimal policy and its value v π ∗ is equal to v ∗ . Let π be some policy. For any policy π ′ , we consider the quantity a π ′ π = T π ′ v π − v π that measures the difference in value resulting from switching the first action to π ′ with respect to always using π ; we shall call it the advantage of π ′ with respect to π . Furthermore, we call maximal advantage with respect to π the componentwise best such advantage: a π = max π ′ a π ′ π = T v π − v π , where the second equality follows from the very definition of the Bellman operator T . While the advantage a π ′ π may have negative values, the maximal advantage a π has only non-negative values. We call the set of switchable states of π the set of states for which the maximal advantage with respect to π is positive: S π = { i, a π ( i ) > 0 } . Assume now that π is non-optimal (this implies that S π is a non-empty set). For any non-empty subset Y of S π , we denote switch( π, Y ) a policy satisfying: ∀ i, switch( π, Y )( i ) = { G ( v π )( i ) if i ∈ Y π ( i ) if i 6 ∈ Y. The following result is well known (see for instance Puterman (1994)). Lemma 1. Let π be some non-optimal policy. If π ′ = switch ( π, Y ) for some non-empty subset Y of S π , then v π ′ ≥ v π and there exists at least one state i such that v π ′ ( i ) > v π ( i ) . This lemma is the foundation of the well-known iterative procedure, called Policy Iteration (PI), that generates a sequence of policies ( π k ) as follows. π k +1 ← switch( π k , Y k ) for some set Y k such that ∅ ( Y k ⊆ S π k . The choice for the subsets Y k leads to different variations of PI. In this paper we will focus on two of them: • When for all iterations k , Y k = S π k , that is one switches the actions in all states with positive advantage with respect to π k , the above algorithm is known as Howard’s PI; it can be seen then that π k +1 ∈ G ( v π k ). 2 • When for all iterations k , Y k is a singleton containing a state i k ∈ arg max i a π k ( i ), that is if we only switch one action in the state with maximal advantage with respect to π k , we will call it Simplex-PI 2 . Since it generates a sequence of policies with increasing values, any variation of PI converges to an optimal policy in a number of iterations that is smaller than the total number of policies. In practice, PI converges in very few iterations. On random MDP instances, convergence often occurs in time sub-linear in n . The aim of this paper is to discuss existing and provide new upper bounds on the number of iterations required by Howard’s PI and Simplex-PI that are much sharper than m n . In the next sections, we describe some known results—see also Ye (2011) for a recent and comprehen- sive review—about the number of iterations required by Howard’s PI and Simplex-PI, along with some of our original improvements and extensions. For clarity, all proofs are deferred to the later sections. 2 Bounds with respect to a fixed discount factor γ < 1 A key observation for both algorithms, that will be central to the results we are about to discuss, is that the sequences they generate satisfy some contraction property 3 . For any vector u ∈ R n , let ‖ u ‖ ∞ = max 1 ≤ i ≤ n | u ( i ) | be the max-norm of u . Let 1 be the vector of which all components are equal to 1. Lemma 2 (e.g. Puterman (1994), proof in Section 5) . The sequence ( ‖ v ∗ − v π k ‖ ∞ ) k ≥ 0 built by Howard’s PI is contracting with coefficient γ . Lemma 3 ((Ye, 2011), proof in Section 6) . The sequence ( 1 T ( v ∗ − v π k )) k ≥ 0 built by Simplex-PI is contracting with coefficient 1 − 1 − γ n . Contraction is a widely known property for Howard’s PI, and it was to our knowledge first proved by (Ye, 2011) for Simplex-PI; we provide simple proofs in this paper for the sake of completeness. While the first contraction property is based on the ‖ · ‖ ∞ -norm, the second can be equivalently expressed in terms of the ‖ · ‖ 1 -norm defined by ‖ u ‖ 1 = ∑ n i =1 | u ( i ) | , since the vectors v ∗ − v π k are non-negative and thus satisfy 1 T ( v ∗ − v π k ) = ‖ v ∗ − v π k ‖ 1 . Contraction has the following immediate consequence 4 . Corollary 1. Let V max = max π ‖ r π ‖ ∞ 1 − γ be an upper bound on ‖ v π ‖ ∞ for all policies π . In order to get an ǫ -optimal policy, that is a policy π k satisfying ‖ v ∗ − v π k ‖ ∞ ≤ ǫ , Howard’s PI requires at most ⌈ log V max ǫ 1 − γ ⌉ iterations, while Simplex-PI requires at most ⌈ n log nV max ǫ 1 − γ ⌉ iterations. These bounds depend on the precision term ǫ , which means that Howard’s PI and Simplex-PI are weakly polynomial for a fixed discount factor γ . An important breakthrough was recently achieved by Ye (2011) who proved that one can remove the dependency with respect to ǫ , and thus show that Howard’s PI and Simplex-PI are strongly polynomial for a fixed discount factor γ . Theorem 1 (Ye (2011)) . Simplex-PI and Howard’s PI both terminate after at most ( m − n ) ⌈ n 1 − γ log ( n 2 1 − γ )⌉ = O ( mn 1 − γ log n 1 − γ ) iterations. The proof is based on the fact that PI corresponds to the simplex algorithm in a linear program- ming formulation of the MDP problem. Using a more direct proof—not based on linear programming arguments—Hansen et al. (2013) recently improved the result by a factor O ( n ) for Howard’s PI. 2 In this case, PI is equivalent to running the simplex algorithm with the highest-pivot rule on a linear program version of the MDP problem (Ye, 2011). 3 A sequence of non-negative numbers ( x k ) k ≥ 0 is contracting with coefficient α if and only if for all k ≥ 0, x k +1 ≤ αx k . 4 For Howard’s PI, we have: ‖ v ∗ − v π k ‖ ∞ ≤ γ k ‖ v ∗ − v π 0 ‖ ∞ ≤ γ k V max . Thus, a sufficient condition for ‖ v ∗ − v π k ‖ ∞ < ǫ is γ k V max < ǫ , which is implied by k ≥ log V max ǫ 1 − γ > log V max ǫ log 1 γ . For Simplex-PI, we have ‖ v ∗ − v π k ‖ ∞ ≤ ‖ v ∗ − v π k ‖ 1 ≤ ( 1 − 1 − γ n ) k ‖ v ∗ − v π 0 ‖ 1 ≤ ( 1 − 1 − γ n ) k nV max , and the conclusion is similar to that for Howard’s PI. 3 Theorem 2 (Hansen et al. (2013)) . Howard’s PI terminates after at most ( m + 1) ⌈ 1 1 − γ log ( n 1 − γ )⌉ = O ( m 1 − γ log n 1 − γ ) iterations. Our first results, that are consequences of the contraction property of Howard’s PI (Lemma 2) are stated in the following theorems. Theorem 3 (Proof in Section 7) . Howard’s PI terminates after at most ( m − n ) ⌈ 1 1 − γ log ( 1 1 − γ )⌉ = O ( m 1 − γ log 1 1 − γ ) iterations. Theorem 4 (Proof in Section 8) . Simplex-PI terminates after at most n ( m − n ) ( 1 + 2 1 − γ log 1 1 − γ ) = O ( mn 1 − γ log 1 1 − γ ) iterations. Both results are a factor O (log n ) better than the previously known results provided by Hansen et al. (2013) and Ye (2011). These improvements boil down to the use of the ‖ · ‖ ∞ -norm instead of the ‖ · ‖ 1 - norm at various points of the previous analyses. For Howard’s PI, the resulting arguments constitute a rather simple extension—the overall line of analysis ends up being very simple, and we consequently believe that it could be part of an elementary course on Policy Iteration; note that a similar improvement and analysis was discovered independently by Akian and Gaubert (2013) in a slightly more general setting. For Simplex-PI, however, the line of analysis is slightly trickier: it amounts to bound the improvement in value at individual states and requires a bit of bookkeeping; the technique we use is to our knowledge original. The bound for Simplex-PI is a factor O ( n ) larger than that for Howard’s PI 5 . However, since one changes only one action per iteration, each iteration has a complexity that is in a worst-case sense lower by a factor n : the update of the value can be done in time O ( n 2 ) through the Sherman-Morrisson formula, though in general each iteration of Howard’s PI, which amounts to compute the value of some policy that may be arbitrarily different from the previous policy, may require O ( n 3 ) time. Thus, it is remarkable that both algorithms seem to have a similar complexity. The linear dependency of the bound for Howard’s PI with respect to m is optimal (Hansen, 2012, Chapter 6.4). The linear dependency with respect to n or m (separately) is easy to prove for Simplex-PI; we conjecture that Simplex-PI’s complexity is proportional to nm , and thus that our bound is tight for a fixed discount factor. The dependency with respect to the term 1 1 − γ may be improved, but removing it is impossible for Howard’s PI and very unlikely for Simplex-PI. Fearnley (2010) describes an MDP for which Howard’s PI requires an exponential (in n ) number of iterations for γ = 1 and Hollanders et al. (2012) argued that this holds also when γ is in the vicinity of 1. Though a similar result does not seem to exist for Simplex-PI in the literature, Melekopoglou and Condon (1994) consider four variations of PI that all switch one action per iteration, and show through specifically designed MDPs that they may require an exponential (in n ) number of iterations when γ = 1. 3 Bounds for Simplex-PI that are independent of γ In this section, we will describe some bounds that do not depend on γ but that will be based on some structural properties of the MDP. On this topic, Post and Ye (2013) recently showed the following result for deterministic MDPs. Theorem 5 (Post and Ye (2013)) . If the MDP is deterministic, then Simplex-PI terminates after at most O ( n 3 m 2 log 2 n ) iterations. 5 Note that it was also the case in Corollary 1. 4 Given a policy π of a deterministic MDP, states are either on cycles or on paths induced by π . The core of the proof relies on the following lemmas that altogether show that cycles are created regularly and that significant progress is made every time a new cycle appears; in other words, significant progress is made regularly. Lemma 4 (Post and Ye (2013, Lemma 3.4)) . If the MDP is deterministic, after O ( n 2 m log n ) iterations, either Simplex-PI finishes or a new cycle appears. Lemma 5 (Post and Ye (2013, Lemma 3.5)) . If the MDP is deterministic, when Simplex-PI moves from π to π ′ where π ′ involves a new cycle, we have 1 T ( v π ∗ − v π ′ ) ≤ ( 1 − 1 n ) 1 T ( v π ∗ − v π ) . Indeed, these observations suffice to prove 6 that Simplex-PI terminates after O ( n 2 m 2 log n 1 − γ ). Com- pletely removing the dependency with respect to the discount factor γ —the term in O (log 1 1 − γ )—requires a careful extra work described in Post and Ye (2013), which incurs an extra term of order O ( n log( n )). The main result of this section is to show how these results can be extended to a more general setting. While Ye (2011) reason on states that belong to paths and cycles induced by policies on deterministic MDPs, we shall consider their natural generalization for stochastic MDPs: transient states and recurrent classes induced by policies. Precisely, we are going to consider bounds—uniform on all policies and states—of the average time 1) spent in transient states and 2) needed to revisit states in recurrent classes. For any policy π and state i , denote τ π ( i, t ) the expected cumulative time spent in state i until time t − 1 given than the process starts from the uniform distribution U on X and takes actions according to π : τ π ( i, t ) = E [ t − 1 ∑ k =0 1 i t = i | i 0 ∼ U, a t = π ( i t ) ] = t − 1 ∑ k =0 P ( i t = i | i 0 ∼ U, a t = π ( i t )) , where 1 denotes the indicator function. In addition, consider the vector μ π on X providing the asymptotic frequency in all states given that policy π is used and that the process starts from the uniform distribution U : ∀ i, μ π ( i ) = lim t →∞ 1 t τ π ( i, t ) . When the Markov chain induced by π is ergodic, and thus admits a unique stationary distribution, μ π is equal to this very stationary distribution. However, our definition is more general in that policies may induce Markov chains with aperiodicity and/or multiple recurrent classes. For any state i that is transient for the Markov chain induced by π , it is well known that lim t →∞ τ π ( i, t ) < ∞ and μ π ( i ) = 0. However, for any recurrent state i , we know that lim t →∞ τ π ( i, t ) = ∞ and μ π ( i ) > 0; in particular, if i belongs to some recurrent class R , which is reached with probability q from the uniform distribution U , then q μ π ( i ) is the expected time between two visits of the state i . We are now ready to express the structural properties with which we can provide an extension of the analysis of Post and Ye (2013). Definition 1. Let τ t and τ r be the smallest finite constants such that for all policies π and states i , if i is transient for π , then lim t →∞ τ π ( i, t ) ≤ τ t else if i is recurrent for π , then 1 μ π ( i ) ≤ τ r . Note that for any finite MDP, these finite constants always exist. With Definition 1 in hand, we can generalize Lemmas 4-5 as follows. Lemma 6. After at most ( m − n ) ⌈ n 2 τ t log( n 2 τ t ) ⌉ + n ⌈ n 2 τ t log( n 2 ) ⌉ iterations either Simplex-PI finishes or a new recurrent class appears. 6 This can be done by using arguments similar those for Theorem 1 (see Ye (2011) for details). 5 Lemma 7. When Simplex-PI moves from π to π ′ where π ′ involves a new recurrent class, we have 1 T ( v π ∗ − v π ′ ) ≤ ( 1 − 1 nτ r ) 1 T ( v π ∗ − v π ) . From these generalized observations, we can deduce the following original result. Theorem 6 (Proof in Section 9) . Simplex-PI terminates after at most [ m ⌈ nτ r log( n 2 τ r ) ⌉ + ( m − n ) ⌈ nτ r log( n 2 τ t ) ⌉ ] [ ( m − n ) ⌈ n 2 τ t log( n 2 τ t ) ⌉ + n ⌈ n 2 τ t log( n 2 ) ⌉ ] = ̃ O ( n 3 m 2 τ t τ r ) iterations. Remark 1. This new result extends the result obtained for deterministic MDPs by Post and Ye (2013) recalled in Theorem 5. In the deterministic case, it is easy to see that τ t = 1 and τ r ≤ n . Then, while Lemma 6 is a strict generalization of Lemma 4, Lemma 7 provides a contraction factor that is slightly weaker than that of Lemma 5— ( 1 − 1 n 2 ) instead of ( 1 − 1 n ) —, which makes the resulting bound provided in Theorem 6 a factor O ( n ) worse than that of Theorem 5. This extra term in the bound is the price paid for making the constant τ r (and the vector μ π ) independent of the discount factor γ , that is by presenting our result in a way that only depends on the dynamics of the underlying MDP. An analysis that would strictly generalizes that of Ye (2011) can be done under a variation of Definition 1 where the constants τ t and τ r depend on the discount factor 7 γ . An immediate consequence of the above result is that Simplex-PI is strongly polynomial for sets of MDPs that are much larger than the deterministic MDPs mentioned in Theorem 5. Corollary 2. For any family of MDPs indexed by n and m such that τ t and τ r are polynomial functions of n and m , Simplex-PI terminates after a number of steps that is polynomial in n and m . 4 Similar results for Howard’s PI? One may then wonder whether similar results can be derived for Howard’s PI. Unfortunately, and as briefly mentioned by Post and Ye (2013), the line of analysis developed for Simplex-PI does not seem to adapt easily to Howard’s PI, because simultaneously switching several actions can interfere in a way such that the policy improvement turns out to be small. We can be more precise on what actually breaks in the approach we have described so far. On the one hand, it is possible to write counterparts of Lemmas 4 and 6 for Howard’s PI (see Section 10 for proofs). Lemma 8. If the MDP is deterministic, after at most n iterations, either Howard’s PI finishes or a new cycle appears. Lemma 9. After at most ( m − n ) ⌈ n 2 τ t log( n 2 τ t ) ⌉ + n ⌈ n 2 τ t log( n 2 ) ⌉ iterations, either Howard’s PI finishes or a new recurrent class appears. On the other hand, we did not manage to adapt Lemma 5 nor Lemma 7. In fact, it is unlikely that a result similar to that of Lemma 5 will be shown to hold for Howard’s PI. In a recent deterministic example due to Hansen and Zwick (2010) to show that Howard’s PI may require at least Ω( n 2 ) iterations, new 7 Define the following γ -discounted variation of τ π ( i, t ): τ π γ ( i, t ) = E [∑ t − 1 k =0 γ k 1 i t = i | i 0 ∼ U, a t = π ( i t ) ] = ∑ t − 1 k =0 γ k P ( i t = i | i 0 ∼ U, a t = π ( i t )) and τ π γ ( i ) = lim t →∞ τ π γ ( i, t ). Assume that we have constants τ γ t , and τ γ r such that for every policy π , τ π γ ( i ) ≤ τ γ t if i is a transient state for π , and 1 (1 − γ ) τ π γ ( i ) ≤ τ γ r if i is recurrent for π . Then, one can derive a bound similar to that of Theorem 6 where τ t and τ r are respectively replaced by τ γ t and τ γ r n . At a more technical level, our analysis begins by removing the dependency with respect to γ : Lemma 11, page 11, shows that for every policy π , τ π γ ( i ) ≤ τ t if i is a transient state for π , and 1 (1 − γ ) τ π γ ( i ) ≤ nτ r if i is recurrent for π (this is where we pay the O ( n ) term because the upper bound is nτ r instead of τ γ r ); we then follow the line of arguments originally given by Post and Ye (2013), though our more general setting induces a few technicalities (in particular in the second part of the proof of Lemma 13 page 13). 6 cycles are created every single iteration but the sequence of values satisfies 8 for all iterations k < n 2 4 + n 4 and states i , v ∗ ( i ) − v π k +1 ( i ) ≥ [ 1 − ( 2 n ) k ] ( v ∗ ( i ) − v π k ( i )) . Contrary to Lemma 5, as k grows, the amount of contraction gets (exponentially) smaller and smaller. With respect to Simplex-PI, this suggests that Howard’s PI may suffer from subtle specific pathologies. In fact, the problem of determining the number of iterations required by Howard’s PI has been challenging for almost 30 years. It was originally identified as an open problem by Schmitz (1985). In the simplest— deterministic—case, the complexity is still an open problem: the currently best-known lower bound is O ( n 2 ) (Hansen and Zwick, 2010), while the best known upper bound is O ( m n n ) Mansour and Singh (1999); Hollanders et al. (2014). On the positive side, an adaptation of the line of proof we have considered so far can be carried out under the following assumption. Assumption 1. The state space X can be partitioned in two sets T and R such that for all policies π , the states of T are transient and those of R are recurrent. Under this additional assumption, we can deduce the following original bounds. Theorem 7 (Proof in Section 11) . If the MDP satisfies Assumption 1, then Howard’s PI and Simplex-PI terminate after at most ( m − n ) ( ⌈ nτ r log n 2 τ r ⌉ + ⌈ n 2 τ t log n 2 τ t ⌉ ) = ̃ O ( mn ( n 2 τ t + nτ r )) iterations. It should however be noted that Assumption 1 is rather restrictive. It implies that the algorithms converge on the recurrent states independently of the transient states, and thus the analysis can be decomposed in two phases: 1) the convergence on recurrent states and then 2) the convergence on transient states (given that recurrent states do not change anymore). The analysis of the first phase (convergence on recurrent states) is greatly facilitated by the fact that in this case, a new recurrent class appears every single iteration (this is in contrast with Lemmas 4, 6, 8 and 9 that were designed to show under which conditions cycles and recurrent classes are created). Furthermore, the analysis of the second phase (convergence on transient states) is similar to that of the discounted case of Theorems 3 and 4. In other words, this last result sheds some light on the practical efficiency of Howard’s PI and Simplex-PI, and a general analysis of Howard’s PI is still largely open, and constitutes intriguing future work. The following sections contains detailed proofs of Lemmas 2 and 3, Theorems 3, 4, and 6, Lemmas 8 and 9, and finally Theorem 7. Before we start, we provide a particularly useful identity relating the difference between the values of two policies π and π ′ and the relative advantage a π ′ π . Lemma 10. For all pairs of policies π and π ′ , v π ′ − v π = ( I − γP π ′ ) − 1 a π ′ π = ( I − γP π ) − 1 ( − a π π ′ ) . Proof. This first identity follows from simple linear algebra arguments: v π ′ − v π = ( I − γP π ′ ) − 1 r π ′ − v π { v π ′ = T π ′ v π ′ ⇔ v π ′ = ( I − γP π ′ ) − 1 r π ′ } = ( I − γP π ′ ) − 1 ( r π ′ + γP π ′ v π − v π ) = ( I − γP π ′ ) − 1 ( T π ′ v π − v π ) . The second identity follows by symmetry. 8 This MDP has an even number of states n = 2 p . The goal is to minimize the long term expected cost. The optimal value function satisfies v ∗ ( i ) = − p N for all i , with N = p 2 + p . The policies generated by Howard’s PI have values v π k ( i ) ∈ ( p N − k − 1 , p N − k ). We deduce that for all iterations k and states i , v ∗ ( i ) − v πk +1 ( i ) v ∗ ( i ) − v πk ( i ) ≥ 1+ p − k − 2 1+ p − k = 1 − p − k − p − k − 2 1+ p − k ≥ 1 − p − k (1 − p − 2 ) ≥ 1 − p − k . 7 We will repeatedly use the following property: since for any policy π , the matrix (1 − γ )( I − γP ) − 1 = (1 − γ ) ∑ ∞ t =0 ( γP π ) t is a stochastic matrix (as a mixture of stochastic matrices), then ‖ ( I − γP ) − 1 ‖ ∞ = 1 1 − γ , where ‖ · ‖ ∞ is the natural induced max-norm on matrices. Finally, for any vector/matrix A and any number λ , we shall use the notation “ A ≥ λ ” (respectively “ A ≤ λ ”) for denoting the fact that “all the coefficients of A are greater or equal to (respectively smaller or equal to) λ ”. 5 Contraction property for Howard’s PI (Proof of Lemma 2) For any k , we have v π ∗ − v π k = T π ∗ v π ∗ − T π ∗ v π k − 1 + T π ∗ v π k − 1 − T π k v π k − 1 + T π k v π k − 1 − T π k v π k {∀ π, T π v π = v π } ≤ γP π ∗ ( v π ∗ − v π k − 1 ) + γP π k ( v π k − 1 − v π k ) { T π ∗ v π k − 1 ≤ T π k v π k − 1 } ≤ γP π ∗ ( v π ∗ − v π k − 1 ) . { Lemma 1 and P π k ≥ 0 } Since v π ∗ − v π k is non-negative, we can take the max-norm and get: ‖ v π ∗ − v π k ‖ ∞ ≤ γ ‖ v π ∗ − v π k − 1 ‖ ∞ . 6 Contraction property for Simplex-PI (Proof of Lemma 3) The proof we provide here is very close to the one given by Ye (2011). We provide it here for completeness, and also because it resembles the proofs we will provide for the bounds that are independent of γ . On the one hand, using Lemma 10, we have for any k : v π k +1 − v π k = ( I − γP π k +1 ) − 1 a π k +1 π k ≥ a π k +1 π k , { ( I − γP π k +1 ) − 1 − I ≥ 0 and a π k +1 π k ≥ 0 } which implies, by left multiplying by the vector 1 T , that 1 T ( v π k +1 − v π k ) ≥ 1 T a π k +1 π k . (1) On the other hand, we have: v π ∗ − v π k = ( I − γP π ∗ ) − 1 a π ∗ π k { Lemma 10 } ≤ 1 1 − γ max s a π k +1 π k ( s ) {‖ ( I − γP π ∗ ) − 1 ‖ ∞ = 1 1 − γ and max s a π k +1 π k ( s ) = max s,π a π π k ( s ) ≥ 0 } ≤ 1 1 − γ 1 T a π k +1 π k , {∀ x ≥ 0 , max s x ( s ) ≤ 1 T x } which implies that 1 T a π k +1 π k ≥ (1 − γ ) ‖ v π ∗ − v π k ‖ ∞ ≥ 1 − γ n 1 T ( v π ∗ − v π k ) . {∀ x, 1 T x ≤ n ‖ x ‖ ∞ } (2) Combining Equations (1) and (2), we get: 1 T ( v π ∗ − v π k +1 ) = 1 T ( v π ∗ − v π k ) − 1 T ( v π k +1 − v π k ) ≤ 1 T ( v π ∗ − v π k ) − 1 − γ n 1 T ( v π ∗ − v π k ) = ( 1 − 1 − γ n ) 1 T ( v π ∗ − v π k ) . 8 7 A bound for Howard’s PI when γ < 1 (Proof of Theorem 3) Although the overall line or arguments follows from those given originally by Ye (2011) and adapted by Hansen et al. (2013), our proof is slightly more direct and leads to a better result. For any k , we have: − a π k π ∗ = ( I − γP π k )( v ∗ − v π k ) { Lemma 10 } ≤ v ∗ − v π k . { v ∗ − v π k ≥ 0 and P π k ≥ 0 } By the optimality of π ∗ , − a π k π ∗ is non-negative, and we can take the max-norm: ‖ a π k π ∗ ‖ ∞ ≤ ‖ v ∗ − v π k ‖ ∞ ≤ γ k ‖ v π ∗ − v π 0 ‖ ∞ { Lemma 2 } = γ k ‖ ( I − γP π 0 ) − 1 ( − a π 0 π ∗ ) ‖ ∞ { Lemma 10 } ≤ γ k 1 − γ ‖ a π 0 π ∗ ‖ ∞ . {‖ ( I − γP π 0 ) − 1 ‖ ∞ = 1 1 − γ } By definition of the max-norm, and as a π 0 π ∗ ≤ 0 (using again the fact that π ∗ is optimal), there exists a state s 0 such that − a π 0 π ∗ ( s 0 ) = ‖ a π 0 π ∗ ‖ ∞ . We deduce that for all k , − a π k π ∗ ( s 0 ) ≤ ‖ a π k π ∗ ‖ ∞ ≤ γ k 1 − γ ‖ a π 0 π ∗ ‖ ∞ = γ k 1 − γ ( − a π 0 π ∗ ( s 0 )) . As a consequence, the action π k ( s 0 ) must be different from π 0 ( s 0 ) when γ k 1 − γ < 1, that is for all values of k satisfying k ≥ k ∗ = ⌈ log 1 1 − γ 1 − γ ⌉ > ⌈ log 1 1 − γ log 1 γ ⌉ . In other words, if some policy π is not optimal, then one of its non-optimal actions will be eliminated for good after at most k ∗ iterations. By repeating this argument, one can eliminate all non-optimal actions (there are at most n − m of them), and the result follows. 8 A bound for Simplex-PI when γ < 1 (Proof of Theorem 4) At each iteration k , let s k be the state in which an action is switched. We have (by definition of Simplex-PI): a π k +1 π k ( s k ) = max π,s a π π k ( s ) . Starting with arguments similar to those for the contraction property of Simplex-PI, we have on the one hand: v π k +1 − v π k = ( I − γP π k +1 ) − 1 a π k +1 π k { Lemma 10 } ≥ a π k +1 π k , { ( I − γP π k +1 ) − 1 − I ≥ 0 and a π k +1 π k ≥ 0 } which implies that v π k +1 ( s k ) − v π k ( s k ) ≥ a π k +1 π k ( s k ) . (3) On the other hand, we have: v π ∗ − v π k = ( I − γP π ∗ ) − 1 a π ∗ π k { Lemma 10 } ≤ 1 1 − γ a π k +1 π k ( s k ) {‖ ( I − γP π ∗ ) − 1 ‖ ∞ = 1 1 − γ and a π k +1 π k ( s k ) = max s,π a π π k ( s ) ≥ 0 } 9 which implies that ‖ v π ∗ − v π k ‖ ∞ ≤ 1 1 − γ a π k +1 π k ( s k ) . (4) Write ∆ k = v π ∗ − v π k . From Equations (3) and (4), we deduce that: ∆ k +1 ( s k ) ≤ ∆ k ( s k ) − (1 − γ ) ‖ ∆ k ‖ ∞ = ( 1 − (1 − γ ) ‖ ∆ k ‖ ∞ ∆ k ( s k ) ) ∆ k ( s k ) . This implies—since ∆ k ( s k ) ≤ ‖ ∆ k ‖ ∞ —that ∆ k +1 ( s k ) ≤ γ ∆ k ( s k ) , but also—since ∆ k ( s k ) and ∆ k +1 ( s k ) are non-negative and thus ( 1 − (1 − γ ) ‖ ∆ k ‖ ∞ ∆ k ( s k ) ) ≥ 0—that ‖ ∆ k ‖ ∞ ≤ 1 1 − γ ∆ k ( s k ) . Now, write n k for the vector on the state space such that n k ( s ) is the number of times state s has been switched until iteration k (including k ). Since by Lemma 1 the sequence (∆ k ) k ≥ 0 is non-increasing, we have ‖ ∆ k ‖ ∞ ≤ 1 1 − γ ∆ k ( s k ) ≤ γ n k − 1 ( s k ) 1 − γ ∆ 0 ( s k ) ≤ γ n k − 1 ( s k ) 1 − γ ‖ ∆ 0 ‖ ∞ . (5) At any iteration k , let s ∗ k = arg max s n k − 1 ( s ) be the state in which actions have been switched the most. Since at each iteration k , one of the n components of n k is increased by 1, we necessarily have n k − 1 ( s ∗ k ) ≥ ⌊ k − 1 n ⌋ ≥ k − n n . (6) Write k ∗ ≤ k − 1 for the last iteration when the state s ∗ k was updated, such that we have n k − 1 ( s ∗ k ) = n k ∗ − 1 ( s k ∗ ) . (7) Since ( ‖ ∆ k ‖ ∞ ) k ≥ 0 is nonincreasing (using again Lemma 1), we have ‖ ∆ k ‖ ∞ ≤ ‖ ∆ k ∗ ‖ ∞ { k ∗ ≤ k − 1 } ≤ γ n k ∗ − 1 ( s k ∗ ) 1 − γ ‖ ∆ 0 ‖ ∞ { Equation (5) } = γ n k − 1 ( s ∗ k ) 1 − γ ‖ ∆ 0 ‖ ∞ { Equation (7) } ≤ γ k − n n 1 − γ ‖ ∆ 0 ‖ ∞ . { Equation (6) and x 7 → γ x is decreasing } We are now ready to finish the proof. By using arguments similar to those for Howard’s PI, we have: ‖ a π k π ∗ ‖ ∞ ≤ ‖ ∆ k ‖ ∞ ≤ γ k − n n 1 − γ ‖ ∆ 0 ‖ ∞ ≤ γ k − n n (1 − γ ) 2 ‖ a π 0 π ∗ ‖ ∞ . In particular, we can deduce from the above relation that as soon as γ k − n n (1 − γ ) 2 < 1, that is for instance when k > k ∗ = n ( 1 + 2 1 − γ log 1 1 − γ ) , one of the non-optimal actions of π 0 cannot appear in π k . Thus, every k ∗ iterations, a non-optimal action is eliminated for good, and the result follows from the fact that there are at most n − m non-optimal actions. 10 9 A general bound for Simplex-PI (Proof of Theorem 6) The proof we give here is strongly inspired by that for the deterministic case of Post and Ye (2013): the steps (a series of lemmas) are similar. There are mainly two differences. First, our arguments are more direct in the sense that we do not refer to linear programming, but only provide simple linear algebra arguments. Second, it is more general : for any policy π , we consider the set of transient states (respectively recurrent classes) instead of the set of path states (respectively cycles); it slightly complicates the arguments, the most complicated extension being the second part of the proof of the forthcoming Lemma 13. Consider the vector x π = ( I − γP T π ) − 1 1 that provides a discounted measure of state visitations along the trajectories induced by a policy π starting from the uniform distribution U on the state space X : ∀ i ∈ X, x π ( i ) = n ∞ ∑ t =0 γ t P ( i t = i | i 0 ∼ U, a t = π ( i t )) . This vector plays a crucial role in the analysis. For any policy π and state i , we trivially have x π ( i ) ∈ ( 1 , n 1 − γ ) . In the case of deterministic MDPs, Post and Ye (2013)) exploits the fact that x π ( i ) belongs to the set (1 , n ) when i is on path of π , while x π ( i ) belongs to the set ( 1 1 − γ , n 1 − γ ) when i is on a cycle of π . Our extension of their result to the case of general (stochastic) MDPs will rely on the following result. For any policy π , we shall write R ( π ) for the set of states that are recurrent for π . Lemma 11. With the constants τ t and τ r of Definition 1, we have for every discount factor γ , ∀ i 6 ∈ R ( π ) , 1 ≤ x π ( i ) ≤ nτ t (8) ∀ i ∈ R ( π ) , 1 τ r ≤ (1 − γ ) x π ( i ) ≤ n. (9) Proof. The fact that x π ( i ) belongs to ( 1 , n 1 − γ ) is obvious from the definition of x π . The upper bound on x π on the transient states i follows from the fact that for any policy π , τ t ( i ) ≥ lim t →∞ τ π ( i, t ) = ∞ ∑ k =0 P ( i t = i | i 0 ∼ U, a t = π ( i t )) ≥ ∞ ∑ k =0 γ k P ( i t = i | i 0 ∼ U, a t = π ( i t )) = 1 n x π ( i ) . Let us now consider the lower bound on (1 − γ ) x π ( i ) when i is a recurrent state of some policy π . In general, the asymptotic frequency μ π of π does not necessarily satisfy μ π T P π = P π because P π may correspond to an aperiodic or reducible chain. To deal with this issue, we consider the Ces` aro mean Q π = lim t →∞ 1 t t − 1 ∑ k =0 ( P π ) k that is well-defined (Stroock, 2005, Section 3.2). It can be shown (Fritz et al. , 1979, Proposition 3.5(a)) that Q π = Q π P π = P π Q π = Q π Q π . This implies in particular that (1 − γ ) Q π ( I − γP π ) − 1 = (1 − γ ) ∞ ∑ k =0 γ k Q π ( P π ) k = (1 − γ ) ∞ ∑ k =0 γ k Q π = Q π . (10) 11 Then, by using twice the fact that μ π = 1 n Q π T 1 , we can see that for all recurrent states i , 1 τ r ≤ μ π ( i ) = [ 1 n Q πT 1 ] ( i ) = [ 1 n (1 − γ )( I − γP πT ) − 1 Q π T 1 ] ( i ) { Equation (10) } = [ (1 − γ )( I − γP πT ) − 1 μ π ] ( i ) ≤ [ (1 − γ )( I − γP πT ) − 1 1 ] ( i ) { μ π ≤ 1 } = (1 − γ ) x π ( i ) . Finally, a rewriting of Lemma 10 in terms of the vector x π will be useful in the following proofs: for any pair of policies π and π ′ , 1 T ( v π ′ − v π ) = x π ′ T a π ′ π = x π T ( − a π π ′ ) . (11) We are now ready to delve into the details of the arguments. As mentioned before, the proof is structured in two steps: first, we will show that recurrent classes are created often; then we will show that significant progress is made every time a new recurrent class appears. 9.1 Part 1: Recurrent classes are created often Lemma 12. Suppose one moves from policy π to policy π ′ without creating any recurrent class . Let π † be the final policy before either a new recurrent class appears or Simplex-PI terminates. Then 1 T ( v π † − v π ′ ) ≤ ( 1 − 1 n 2 τ t ) 1 T ( v π † − v π ) . Proof. The arguments are similar to those for the proof of Theorem 4. On the one hand, we have: 1 T ( v π ′ − v π ) ≥ 1 T a π ′ π . (12) On the other hand, we have 1 T ( v π † − v π ) = x T π † a π † π { Equation (11) } = ∑ s 6 ∈R ( π † ) x π † ( s ) a π † π ( s ) + ∑ s ∈R ( π † ) x π † ( s ) a π † π ( s ) ≤ n 2 τ t max s 6 ∈R ( π † ) a π † π ( s ) + n 2 1 − γ max s ∈R ( π † ) a π † π ( s ) . { Equations (8)-(9) } Since by assumption recurrent classes of π † are also recurrent classes of π , we deduce that for all s ∈ R ( π † ), π † ( s ) = π ( s ), so that max s ∈R ( π † ) a π † π ( s ) = 0. Thus, the second term of the above r.h.s. is null and 1 T ( v π † − v π ) ≤ n 2 τ t max s a π † π ( s ) ≤ n 2 τ t max s a π ′ π ( s ) { max s T π ′ v π ( s ) = max s, ̃ π T ̃ π v π ( s ) } ≤ n 2 τ t 1 T a π ′ π . {∀ x ≥ 0 , max s x ( s ) ≤ 1 T x } (13) Combining Equations (12) and (13), we get: 1 T ( v π † − v π ′ ) = 1 T ( v π † − v π ) − 1 T ( v π ′ − v π ) ≤ ( 1 − 1 n 2 τ t ) 1 T ( v π † − v π ) . 12 Lemma 13. While Simplex-PI does not create any recurrent class nor finishes, • either an action is eliminated from policies after at most ⌈ n 2 τ t log( n 2 τ t ) ⌉ iterations, • or a recurrent class is broken after at most ⌈ n 2 τ t log( n 2 ) ⌉ iterations. Proof. Let π be the policy in some iteration. Let π † be the last policy before a new recurrent class appears, and π ′ a policy generated after k iterations from π . We shall prove that one of the two events stated of the lemma must happen. Since 0 ≤ 1 T ( v π † − v π ) { v π † ≥ v π } = x π T ( − a π π † ) { Equation (11) } = ∑ s 6 ∈R ( π ) x π ( s )( − a π π † ( s )) + ∑ C ∈R ( π ) ∑ s ∈ C x π ( s )( − a π π † ( s )) there must exist either a state s 0 6 ∈ R ( π ) such that x π ( s 0 )( − a π π † ( s 0 )) ≥ 1 n x π T ( − a π π † ) ≥ 0 . (14) or a recurrent class R 0 such that ∑ s ∈ R 0 x π ( s )( − a π π † ( s )) ≥ 1 n x π T ( − a π π † ) ≥ 0 . (15) We consider these two cases separately below. • case 1: Equation (14) holds for some s 0 6 ∈ R ( π ). Let us prove by contradiction that for k sufficiently big, π ′ ( s 0 ) 6 = π ( s 0 ): let us assume that π ′ ( s 0 ) = π ( s 0 ). Then 1 T ( v π † − v π ′ ) ≥ v π † ( s 0 ) − v π ′ ( s 0 ) { v π † ≥ v π ′ } = v π † ( s 0 ) − T π ′ v π ′ ( s 0 ) { v π ′ = T π ′ v π ′ } ≥ v π † ( s 0 ) − T π ′ v π † ( s 0 ) { v π † ≥ v π ′ } = − a π ′ π † ( s 0 ) = − a π π † ( s 0 ) { π ( s 0 ) = π ′ ( s 0 ) } ≥ 1 nτ t x π ( s 0 )( − a π π † ( s 0 )) { Equation (8) } ≥ 1 n 2 τ t x π T ( − a π π † ) { Equation (14) } = 1 n 2 τ t 1 T ( v π † − v π ) . { Equation (11) } If there is no recurrent class creation, the contraction property given in Lemma 12 implies that if π ′ is obtained after k = ⌈ n 2 τ t log( n 2 τ t ) ⌉ > log( n 2 τ t ) log 1 1 − 1 n 2 τt iterations, then 1 T ( v π † − v π ′ ) < 1 n 2 τ t 1 T ( v π † − v π ) , and we get a contradiction. As a conclusion, we necessarily have π ′ ( s 0 ) 6 = π ( s 0 ). • case 2: Equation (15) holds for some R 0 that is a recurrent class of π . Let us prove by contradiction that for k sufficiently big, R 0 cannot be a recurrent class of π ′ : let us thus assume that R 0 is a recurrent class of π ′ . Write T for the set of states that are transient for π (formally, T = X \R ( π )). For any subset Y of the state space X , write P Y π for the stochastic matrix of which the i th row is equal to that of P π if i ∈ Y , and is 0 otherwise, and write 1 Y the vectors of which the i th component is equal to 1 if i ∈ Y and 0 otherwise. 13 Using the fact that P R 0 π P T π = 0, one can first observe that ( I − γP R 0 π )( I − γP T π ) = I − γ ( P R 0 π + P T π ) , from which we can deduce that ∀ s ∈ R 0 , [ 1 T ∪ R 0 T ( I − γP π ) − 1 ] ( s ) = [ 1 T ∪ R 0 T ( I − γ ( P R 0 π + P T π )) − 1 ] ( s ) = [ 1 T ∪ R 0 T ( I − γP T π ) − 1 ( I − γP R 0 π ) − 1 ] ( s ) . (16) Also, let s be an arbitrary state and s ′ be a state of R 0 . Since 1 T s ( P T π ) k ( s ′ ) is the probability that the chain starting in s reaches s ′ for the first time after k iterations, then 1 T s ( I − γP T π ) − 1 ( s ′ ) ≤ ∞ ∑ t =0 1 T s ( P T π ) i ( s ′ ) ≤ 1 . and therefore, ∀ s ′ ∈ R 0 , 1 T ∪ R 0 T ( I − γP T π ) − 1 ( s ′ ) ≤ n. (17) Writing δ for the vector that equals − a π π † on R 0 and that is null everywhere else, we have ∑ s ∈ R 0 x π ( s )( − a π π † ( s )) = ∑ s ∈ R 0 [( I − γP T π ) − 1 1 ]( s ) δ ( s ) = ∑ s ∈ R 0 [( I − γP T π ) − 1 1 T ∪ R 0 ]( s ) δ ( s ) { ∀ s ∈ R 0 , [( I − γP T π ) − 1 1 X \ ( T ∪ R 0 ) ]( s ) = 0 } = ∑ s [( I − γP T π ) − 1 1 T ∪ R 0 ]( s ) δ ( s ) {∀ s 6 ∈ R 0 , δ ( s ) = 0 } = 1 T ∪ R 0 T ( I − γP π ) − 1 δ = 1 T ∪ R 0 T ( I − γP T π ) − 1 ( I − γP R 0 π ) − 1 δ { Equation (16) } = ∑ s [( I − γP T π T ) − 1 1 T ∪ R 0 ]( s )[( I − γP R 0 π ) − 1 δ ]( s ) = ∑ s ∈ R 0 [( I − γP T π T ) − 1 1 T ∪ R 0 ]( s )[( I − γP R 0 π ) − 1 δ ]( s ) {∀ s 6 ∈ R 0 , δ ( s ) = 0 } = ∑ s ∈ R 0 [( I − γP T π T ) − 1 1 T ∪ R 0 ]( s )( v π † ( s ) − v π ( s )) { Lemma 10 } ≤ n 1 R 0 T ( v π † − v π ) . { Equation (17) } (18) We assumed that R 0 is also a recurrent class of π ′ , which implies 1 R 0 T v π = 1 R 0 T v π ′ , and 1 T ( v π † − v π ′ ) ≥ 1 R 0 T ( v π † − v π ′ ) { v π † ≥ v π ′ } = 1 R 0 T ( v π † − v π ) { 1 R 0 T v π = 1 R 0 T v π ′ } ≥ 1 n ∑ s ∈ R 0 x π ( s )( − a π π † ( s )) { Equation (18) } ≥ 1 n 2 x π T ( − a π π † ) { Equation (15) } = 1 n 2 1 T ( v π † − v π ) . { Equation (11) } 14 If there is no recurrent class creation, the contraction property given in Lemma 12 implies that if π ′ is obtained after k = ⌈ n 2 τ t log( n 2 ) ⌉ > log( n 2 ) log 1 1 − 1 n 2 τt iterations, then 1 T ( v π † − v π ′ ) < 1 n 2 1 T ( v π † − v π ) , and thus we get a contradiction. As a conclusion, R 0 cannot be a recurrent class of π ′ . A direct consequence of the above result is Lemma 6 that we originally stated on page 5, and that we restate for clarity. Lemma 6. After at most ( m − n ) ⌈ n 2 τ t log( n 2 τ t ) ⌉ + n ⌈ n 2 τ t log( n 2 ) ⌉ iterations, either Simplex-PI finishes or a new recurrent class appears. Proof. Before a recurrent class is created, at most n recurrent classes need to be broken and ( m − n ) actions to be eliminated, and the time required by these events is bounded thanks to the previous lemma. 9.2 Part 2: A new recurrent class implies a significant step towards the optimal value We now proceed to the second part of the proof, and begin by proving Lemma 7 (originally stated page 6). Lemma 7. When Simplex-PI moves from π to π ′ where π ′ involves a new recurrent class, we have 1 T ( v π ∗ − v π ′ ) ≤ ( 1 − 1 nτ r ) 1 T ( v π ∗ − v π ) . Proof. Let s 0 be the state such that π ′ ( s 0 ) 6 = π ( s 0 ). On the one hand, since π ′ contains a new recurrent class R (necessarily containing s 0 ), we have 1 T ( v π ′ − v π ) = x π ′ T a π ′ π { Equation (11) } = x π ′ ( s 0 ) a π ( s 0 ) { Simplex-PI switches 1 action and a π ( s 0 ) = a π ′ π ( s 0 ) } ≥ 1 (1 − γ ) τ r a π ( s 0 ) . { Equation (9) with s 0 ∈ R ( π ′ ) } (19) On the other hand, ∀ s, v π ∗ ( s ) − v π ( s ) = [( I − γP π ∗ ) − 1 a π ∗ π ]( s ) { Lemma 10 } ≤ 1 1 − γ a π ( s 0 ) . {‖ ( I − γP π ∗ ) − 1 ‖ ∞ ≤ 1 1 − γ and a π ( s 0 ) = max s, ̃ π a ̃ π π ( s ) ≥ 0 } (20) Combining these two observations, we obtain 1 T ( v π ∗ − v π ′ ) = 1 T ( v π ∗ − v π ) − 1 T ( v π ′ − v π ) ≤ 1 T ( v π ∗ − v π ) − 1 (1 − γ ) τ r a π ( s 0 ) { Equation (19) } ≤ 1 T ( v π ∗ − v π ) − 1 τ r max s v π ∗ ( s ) − v π ′ ( s ) { Equation (20) } ≤ ( 1 − 1 nτ r ) 1 T ( v π ∗ − v π ) . {∀ x, 1 n 1 T x ≤ max s x ( s ) } 15 Lemma 14. While Simplex-PI does not terminate, • either some non-optimal action is eliminated from recurrent states after at most ⌈ nτ r log( n 2 τ r ) ⌉ recurrent class creations, • or some non-optimal action is eliminated from policies after at most ⌈ nτ r log( n 2 τ t ) ⌉ recurrent class creations. Proof. Let π be the policy in some iteration and π ′ the policy generated after k iterations from π (without loss of generality we assume π ′ 6 = π ∗ ). Let s 0 = arg max s x π ( s )( − a π π ∗ ( s )). We have x π ( s 0 )( − a π π ∗ ( s 0 )) ≥ 1 n x π T ( − a π π ∗ ) {∀ x, 1 T x ≤ n max s x ( s ) } = 1 n 1 T ( v π ∗ − v π ) . { Equation (11) } (21) We now consider two cases, respectively corresponding to s 0 6 ∈ R ( π ) or s 0 ∈ R ( π ). • case 1: s 0 6 ∈ R ( π ). Let us prove by contradiction that π ′ ( s 0 ) 6 = π ( s 0 ) if k is sufficiently large: let us assume that π ′ ( s 0 ) = π ( s 0 ). Then, by using repeatedly the fact that for all ̃ π, a ̃ π π ∗ ≤ 0 (by definition of the optimal policy π ∗ ), we have: 1 T ( v π ∗ − v π ′ ) = x π ′ T ( − a π ′ π ∗ ) { Equation (11) } ≥ x π ′ ( s 0 )( − a π ′ π ∗ ( s 0 )) ≥ − a π ′ π ∗ ( s 0 ) { x π ′ ( s 0 ) ≥ 1 } = − a π π ∗ ( s 0 ) { π ( s 0 ) = π ′ ( s 0 ) } ≥ 1 nτ t x π ( s 0 )( − a π π ∗ ( s 0 )) { Equation (8) } ≥ 1 n 2 τ t 1 T ( v π ∗ − v π ) . { Equation (21) } After k = ⌈ nτ r log n 2 τ t ⌉ > log n 2 τ t log 1 1 − 1 nτr recurrent classes are created, we have by the contraction property of Lemma 7 that 1 T ( v π ∗ − v π ′ ) < 1 n 2 τ t 1 T ( v π ∗ − v π ) and we get a contradiction. As a conclusion, we have π ′ ( s 0 ) 6 = π ( s 0 ). • case 2: s 0 ∈ R ( π ). Let us prove by contradiction that π ′ ( s 0 ) 6 = π ( s 0 ) if s 0 is recurrent for π ′ and k is sufficiently large: let us assume that π ′ ( s 0 ) = π ( s 0 ) and s 0 ∈ R ( π ′ ). Then, by using again the fact that for all ̃ π, a ̃ π π ∗ ≤ 0, we have: 1 T ( v π ∗ − v π ′ ) = x π ′ T ( − a π ′ π ∗ ) { Equation (11) } = ∑ s x π ′ ( s )( − a π ′ π ∗ ( s )) ≥ ∑ s ∈ R 0 x π ′ ( s )( − a π ′ π ∗ ( s )) ≥ 1 (1 − γ ) τ r ∑ s ∈ R 0 ( − a π ′ π ∗ ( s )) { Equation (9) } ≥ 1 (1 − γ ) τ r ( − a π ′ π ∗ ( s 0 )) = 1 (1 − γ ) τ r ( − a π π ∗ ( s 0 )) { π ( s 0 ) = π ′ ( s 0 ) } ≥ 1 nτ r x π ( s 0 )( − a π π ∗ ( s 0 )) { x π ( s 0 ) ≤ n 1 − γ } ≥ 1 n 2 τ r 1 T ( v π ∗ − v π ) . { Equation (21) } 16 After k = ⌈ nτ r log n 2 τ r ⌉ > log n 2 τ r log 1 1 − 1 nτr new recurrent classes are created, we have by the contraction property of Lemma 7 that 1 T ( v π ∗ − v π ′ ) < 1 n 2 τ r 1 T ( v π ∗ − v π ) , and we get a contradiction. As a conclusion, we know that π ′ ( s 0 ) 6 = π ( s 0 ) if s 0 is recurrent for π ′ . We are ready to conclude: At most, the ( m − n ) non-optimal actions may need to be eliminated from all states; in addition, all actions may need to be eliminated from recurrent states (some optimal actions may only be used at transient states and thus also need to be eliminated from recurrent states). Overall, convergence can thus be obtained after at most a total of m ⌈ nτ r log( n 2 τ r ) ⌉ + ( m − n ) ⌈ nτ r log( n 2 τ t ) ⌉ recurrent class creations. The result follows from the fact that each class creation requires at most ( m − n ) ⌈ n 2 τ t log( n 2 τ t ) ⌉ + n ⌈ n 2 τ t log( n 2 ) ⌉ iterations (cf. Lemma 6). 10 Cycle and recurrent classes creations for Howard’s PI (Proofs of Lemmas 8 and 9) Lemma 8. If the MDP is deterministic, after at most n iterations, either Howard’s PI finishes or a new cycle appears. Proof. Consider a sequence of l generated policies π 1 , · · · , π l from an initial policy π 0 such that no new cycle appears. By induction, we have v π l − v π k = T π l v π l − T π l v π k − 1 + T π l v π k − 1 − T π k v π k − 1 + T π k v π k − 1 − T π k v π k {∀ π, T π v π = v π } ≤ γP π l ( v π l − v π k − 1 ) + γP π k ( v π k − 1 − v π k ) { T π l v π k − 1 ≤ T π k v π k − 1 } ≤ γP π l ( v π l − v π k − 1 ) { Lemma 1 and P π k ≥ 0 } ≤ ( γP π l ) k ( v π l − v π 0 ) . { By induction on k } (22) Since the MDP is deterministic and has n states, ( P π l ) n will only have non-zero values on columns that correspond to R ( π l ). Furthermore, since no cycle is created, R ( π l ) ⊂ R ( π 0 ), which implies that v π l ( s ) − v π 0 ( s ) = 0 for all s ∈ R ( π l ). As a consequence, we have ( P π l ) n ( v π l − v π 0 ) = 0. By Equation (22), this implies that v π l = v π n . If l > n , then Howard’s PI must have terminated. Lemma 9. After at most ( m − n ) ⌈ n 2 τ t log( n 2 τ t ) ⌉ + n ⌈ n 2 τ t log( n 2 ) ⌉ iterations, either Howard’s PI finishes or a new recurrent class appears. Proof. A close examination of the proof of Lemma 6, originally designed for Simplex-PI, shows that it applies to Howard’s PI without any modification. 11 A bound for Howard’s PI and Simplex-PI under Assump- tion 1 (Proof of Theorem 7) We here consider that the state space is decomposed into 2 sets: T is the set of states that are transient under all policies, and R is the set of states that are recurrent under all policies. From this assumption, it can be seen that when running Howard’s PI or Simplex-PI, the values and actions chosen on T have no influence on the evolution of the values and policies on R . So we will study the convergence of both algorithms in two steps: we will first bound the number of iterations to converge on R ; we will then add the number of iterations for converging on T given that convergence has occurred on R . 17 Convergence on the set R of recurrent states: Without loss of generality, we consider here that the state space is only made of the set of recurrent states. First consider Simplex-PI. If all states are recurrent, new recurrent classes are created at every iteration, and Lemma 7 holds. Then, in a way similar to the proof of Lemma 14, it can be shown that every ⌈ nτ r log n 2 τ r ⌉ iterations, a non-optimal action can be eliminated. As there are at most ( m − n ) non-optimal actions, we deduce that Simplex-PI converges in at most ( m − n ) ⌈ nτ r log n 2 τ r ⌉ iterations on R . Consider now Howard’s PI. We can prove the following lemma. Lemma 10. If the MDP satisfies Assumption 1 and all states are recurrent under all policies, Howard’s PI generates policies ( π k ) k ≥ 0 that satisfy: 1 T ( v π ∗ − v π k +1 ) ≤ ( 1 − 1 nτ r ) 1 T ( v π ∗ − v π k ) . Proof. On the one hand, we have 1 T ( v π k +1 − v π k ) = x π k +1 T a π k +1 π k { Equation (11) } = x π k +1 T a π k { a π k +1 π k = a π k } ≥ 1 (1 − γ ) τ r 1 T a π k { Equation (9) and all states are recurrent } ≥ 1 (1 − γ ) τ r ‖ a π k ‖ ∞ . {∀ x ≥ 0 , 1 T x ≥ ‖ x ‖ ∞ } (23) On the other hand, 1 T ( v π ∗ − v π k ) = x π ∗ T a π ∗ π k { Equation (11) } ≤ x π ∗ T a π k { a π k ≥ a π ∗ π k } ≤ n 1 − γ ‖ a π k ‖ ∞ . { ∑ i x π ∗ ( i ) ≤ n 1 − γ and a π k ≥ 0 } (24) By combining Equations (23) and (24), we obtain: 1 T ( v π ∗ − v π k +1 ) = 1 T ( v π ∗ − v π k ) − 1 T ( v π k +1 − v π k ) ≤ ( 1 − 1 nτ r ) 1 T ( v π ∗ − v π k ) . Then, similarly to Simplex-PI, we can prove that after every ⌈ nτ r log n 2 τ r ⌉ iterations a non-optimal action must be eliminated. And as there are at most ( m − n ) non-optimal actions, we deduce that Howard’s PI converges in at most ( m − n ) ⌈ nτ r log n 2 τ r ⌉ iterations on R . Convergence on the set T of transient states: Consider now that convergence has occurred on the recurrent states R . A simple variation of the proof of Lemma 6/Lemma 9 (where we use the fact that we don’t need to consider the events where recurrent classes are broken since recurrent classes do not evolve anymore) allows us to show that the extra number of iterations for both algorithms to converge on the transient states is at most ( m − n )) ⌈ n 2 τ t log n 2 τ t ⌉ , and the result follows. Acknowledgements. I would like to thank Ian Post for exchanges about the proof in Post and Ye (2013), Thomas Dueholm Hansen for noticing a flaw in a claimed result for deterministic MDPs in an earlier version, Romain Aza ̈ ıs for the reference on the Cesaro mean of stochastic matrices, and the reviewers and editor for their very careful feedback, who helped improve the paper overall, and the proof of Lemma 13 in particular. 18 References Akian, M. and Gaubert, S. (2013). Policy iteration for perfect information stochastic mean payoff games with bounded first return times is strongly polynomial. Technical Report arxiv 1310.4953v1. Bertsekas, D. and Tsitsiklis, J. (1996). Neurodynamic Programming . Athena Scientific. Fearnley, J. (2010). Exponential lower bounds for policy iteration. In Proceedings of the 37th international colloquium conference on Automata, languages and programming: Part II , ICALP’10, pages 551–562, Berlin, Heidelberg. Springer-Verlag. Fritz, F., Huppert, B., and Willems, W. (1979). Stochastische Matrizen . Springer, Berlin. Hansen, T. (2012). Worst-case Analysis of Strategy Iteration and the Simplex Method . Ph.D. thesis, Department Office Computer Science, Aarhus University. Hansen, T. and Zwick, U. (2010). Lower bounds for Howard’s algorithm for finding minimum mean-cost cycles. In ISAAC (1) , pages 415–426. Hansen, T., Miltersen, P., and Zwick, U. (2013). Strategy iteration is strongly polynomial for 2-player turn-based stochastic games with a constant discount factor. J. ACM , 60 (1), 1–16. Hollanders, R., Delvenne, J., and Jungers, R. (2012). The complexity of policy iteration is exponential for discounted markov decision processes. In 51st IEEE conference on Decision and control (CDC’12) . Hollanders, R., Gerencs ́ er, B., Delvenne, J., and Jungers, R. (2014). Improved bound on the worst case complexity of policy iteration. Technical Report arxiv 1410.7583v1. Mansour, Y. and Singh, S. (1999). On the complexity of policy iteration. In UAI , pages 401–408. Melekopoglou, M. and Condon, A. (1994). On the complexity of the policy improvement algorithm for Markov decision processes. INFORMS Journal on Computing , 6 (2), 188–192. Post, I. and Ye, Y. (2013). The simplex method is strongly polynomial for deterministic Markov decision processes. In 24th ACM-SIAM Symposium on Discrete Algorithms . Puterman, M. (1994). Markov Decision Processes . Wiley, New York. Schmitz, N. (1985). How good is Howard’s policy improvement algorithm? Zeitschrift f ̈ ur Operations Research , 29 (7), 315–316. Stroock, D. (2005). An introduction to Markov processes . Springer, Berlin. Ye, Y. (2011). The simplex and policy-iteration methods are strongly polynomial for the markov decision problem with a fixed discount rate. Math. Oper. Res. , 36 (4), 593–603. 19