arXiv:1306.0386v4 [math.OC] 10 Feb 2016 Improved and Generalized Upper Bounds on the Complexity of Policy Iteration Bruno Scherrer INRIA Nancy Grand Est, Team MAIA bruno.scherrer@inria.fr October 29, 2018 Abstract Given a Markov Decision Process (MDP) with n states and a total number m of actions, we study the number of iterations needed by Policy Iteration (PI) algorithms to converge to the optimal γ-discounted policy. We consider two variations of PI: Howard’s PI that changes the actions in all states with a positive advantage, and Simplex-PI that only changes the action in the state with max- imal advantage. We show that Howard’s PI terminates after at most O m 1−γ log 1 1−γ  iterations, improving by a factor O(log n) a result by Hansen et al (2013), while Simplex-PI terminates after at most O nm 1−γ log 1 1−γ  iterations, improving by a factor O(log n) a result by Ye (2011). Under some structural properties of the MDP, we then consider bounds that are independent of the discount factor γ: quantities of interest are bounds τt and τr—uniform on all states and policies—respectively on the expected time spent in transient states and the inverse of the frequency of visits in recurrent states given that the process starts from the uniform distribution. Indeed, we show that Simplex-PI terminates after at most ˜O n3m2τtτr  iterations. This extends a recent result for deterministic MDPs by Post & Ye (2013), in which τt ≤1 and τr ≤n; in particular it shows that Simplex-PI is strongly polynomial for a much larger class of MDPs. We explain why similar results seem hard to derive for Howard’s PI. Finally, under the additional (restrictive) assumption that the state space is partitioned in two sets, respectively states that are transient and recurrent for all policies, we show that both Howard’s PI and Simplex-PI terminate after at most ˜O(m(n2τt + nτr)) iterations. 1 Introduction We consider a discrete-time dynamic system whose state transition depends on a control, where the state space X is of finite size n. When at state i ∈{1, .., n}, the action is chosen from a set of admissible actions Ai ⊂A, where the action space A is of finite size m, such that (Ai)1≤i≤n form a partition of A. The action a ∈Ai specifies the transition probability pij(a) = P(it+1 = j|it = i, at = a) to the next state j. At each transition, the system is given a reward r(i, a, j) ∈R where r is the instantaneous reward function. In this context, we look for a stationary deterministic policy1, that is a function π : X →A that maps states into admissible actions (for all i, π(i) ∈Ai) that maximizes the expected discounted sum of rewards from any state i, called the value of policy π at state i: vπ(i) := E " ∞ X k=0 γkr(ik, ak, ik+1) i0 = i, ∀k ≥0, ak = π(ik), ik+1 ∼P(·|ik, ak) # , where γ ∈(0, 1) is a discount factor. The tuple ⟨X, (Ai)i∈X, p, r, γ⟩is called a Markov Decision Process (MDP) (Puterman, 1994; Bertsekas and Tsitsiklis, 1996), and the associated problem is known as stochastic optimal control. The optimal value starting from state i is defined as v∗(i) := max π vπ(i). 1Restricting our attention to stationary deterministic policies is not a limitation. Indeed, for the optimality criterion to be defined soon, it can be shown that there exists at least one stationary deterministic policy that is optimal (Puterman, 1994). 1 For any policy π, we write Pπ for the n × n stochastic matrix whose elements are pij(π(i)), and rπ for the vector whose components are P j pij(π(i))r(i, π(i), j). The value functions vπ and v∗can be seen as vectors on X. It is well known that vπ is the solution of the following Bellman equation: vπ = rπ + γPπvπ, that is vπ is a fixed point of the affine operator Tπ : v 7→rπ +γPπv. It is also well known that v∗satisfies the following Bellman equation: v∗= max π (rπ + γPπv∗) = max π Tπv∗ where the max operator is taken componentwise. In other words, v∗is a fixed point of the nonlinear operator T : v 7→maxπ Tπv. For any value vector v, we say that a policy π is greedy with respect to the value v if it satisfies: π ∈arg max π′ Tπ′v or equivalently Tπv = T v. With some slight abuse of notation, we write G(v) for any policy that is greedy with respect to v. The notions of optimal value function and greedy policies are fundamental to optimal control because of the following property: any policy π∗that is greedy with respect to the optimal value v∗is an optimal policy and its value vπ∗is equal to v∗. Let π be some policy. For any policy π′, we consider the quantity aπ′ π = Tπ′vπ −vπ that measures the difference in value resulting from switching the first action to π′ with respect to always using π; we shall call it the advantage of π′ with respect to π. Furthermore, we call maximal advantage with respect to π the componentwise best such advantage: aπ = max π′ aπ′ π = T vπ −vπ, where the second equality follows from the very definition of the Bellman operator T . While the advantage aπ′ π may have negative values, the maximal advantage aπ has only non-negative values. We call the set of switchable states of π the set of states for which the maximal advantage with respect to π is positive: Sπ = {i, aπ(i) > 0}. Assume now that π is non-optimal (this implies that Sπ is a non-empty set). For any non-empty subset Y of Sπ, we denote switch(π, Y ) a policy satisfying: ∀i, switch(π, Y )(i) =  G(vπ)(i) if i ∈Y π(i) if i ̸∈Y. The following result is well known (see for instance Puterman (1994)). Lemma 1. Let π be some non-optimal policy. If π′ = switch(π, Y ) for some non-empty subset Y of Sπ, then vπ′ ≥vπ and there exists at least one state i such that vπ′(i) > vπ(i). This lemma is the foundation of the well-known iterative procedure, called Policy Iteration (PI), that generates a sequence of policies (πk) as follows. πk+1 ←switch(πk, Yk) for some set Yk such that ∅⊊Yk ⊆Sπk. The choice for the subsets Yk leads to different variations of PI. In this paper we will focus on two of them: • When for all iterations k, Yk = Sπk, that is one switches the actions in all states with positive advantage with respect to πk, the above algorithm is known as Howard’s PI; it can be seen then that πk+1 ∈G(vπk). 2 • When for all iterations k, Yk is a singleton containing a state ik ∈arg maxi aπk(i), that is if we only switch one action in the state with maximal advantage with respect to πk, we will call it Simplex-PI2. Since it generates a sequence of policies with increasing values, any variation of PI converges to an optimal policy in a number of iterations that is smaller than the total number of policies. In practice, PI converges in very few iterations. On random MDP instances, convergence often occurs in time sub-linear in n. The aim of this paper is to discuss existing and provide new upper bounds on the number of iterations required by Howard’s PI and Simplex-PI that are much sharper than mn. In the next sections, we describe some known results—see also Ye (2011) for a recent and comprehen- sive review—about the number of iterations required by Howard’s PI and Simplex-PI, along with some of our original improvements and extensions. For clarity, all proofs are deferred to the later sections. 2 Bounds with respect to a fixed discount factor γ < 1 A key observation for both algorithms, that will be central to the results we are about to discuss, is that the sequences they generate satisfy some contraction property3. For any vector u ∈Rn, let ∥u∥∞= max1≤i≤n|u(i)| be the max-norm of u. Let 1 be the vector of which all components are equal to 1. Lemma 2 (e.g. Puterman (1994), proof in Section 5). The sequence (∥v∗−vπk∥∞)k≥0 built by Howard’s PI is contracting with coefficient γ. Lemma 3 ((Ye, 2011), proof in Section 6). The sequence (1T (v∗−vπk))k≥0 built by Simplex-PI is contracting with coefficient 1 −1−γ n . Contraction is a widely known property for Howard’s PI, and it was to our knowledge first proved by (Ye, 2011) for Simplex-PI; we provide simple proofs in this paper for the sake of completeness. While the first contraction property is based on the ∥· ∥∞-norm, the second can be equivalently expressed in terms of the ∥· ∥1-norm defined by ∥u∥1 = Pn i=1 |u(i)|, since the vectors v∗−vπk are non-negative and thus satisfy 1T (v∗−vπk) = ∥v∗−vπk∥1. Contraction has the following immediate consequence4. Corollary 1. Let Vmax = maxπ ∥rπ∥∞ 1−γ be an upper bound on ∥vπ∥∞for all policies π. In order to get an ǫ-optimal policy, that is a policy πk satisfying ∥v∗−vπk∥∞≤ǫ, Howard’s PI requires at most l log Vmax ǫ 1−γ m iterations, while Simplex-PI requires at most l n log nVmax ǫ 1−γ m iterations. These bounds depend on the precision term ǫ, which means that Howard’s PI and Simplex-PI are weakly polynomial for a fixed discount factor γ. An important breakthrough was recently achieved by Ye (2011) who proved that one can remove the dependency with respect to ǫ, and thus show that Howard’s PI and Simplex-PI are strongly polynomial for a fixed discount factor γ. Theorem 1 (Ye (2011)). Simplex-PI and Howard’s PI both terminate after at most (m −n)  n 1 −γ log  n2 1 −γ  = O  mn 1 −γ log n 1 −γ  iterations. The proof is based on the fact that PI corresponds to the simplex algorithm in a linear program- ming formulation of the MDP problem. Using a more direct proof—not based on linear programming arguments—Hansen et al. (2013) recently improved the result by a factor O(n) for Howard’s PI. 2In this case, PI is equivalent to running the simplex algorithm with the highest-pivot rule on a linear program version of the MDP problem (Ye, 2011). 3A sequence of non-negative numbers (xk)k≥0 is contracting with coefficient α if and only if for all k ≥0, xk+1 ≤αxk. 4For Howard’s PI, we have: ∥v∗−vπk∥∞≤γk∥v∗−vπ0∥∞≤γkVmax. Thus, a sufficient condition for ∥v∗−vπk∥∞< ǫ is γkVmax < ǫ, which is implied by k ≥ log Vmax ǫ 1−γ > log Vmax ǫ log 1 γ . For Simplex-PI, we have ∥v∗−vπk∥∞≤∥v∗−vπk∥1 ≤ 1 −1−γ n k ∥v∗−vπ0∥1 ≤1 −1−γ n k nVmax, and the conclusion is similar to that for Howard’s PI. 3 Theorem 2 (Hansen et al. (2013)). Howard’s PI terminates after at most (m + 1)  1 1 −γ log  n 1 −γ  = O  m 1 −γ log n 1 −γ  iterations. Our first results, that are consequences of the contraction property of Howard’s PI (Lemma 2) are stated in the following theorems. Theorem 3 (Proof in Section 7). Howard’s PI terminates after at most (m −n)  1 1 −γ log  1 1 −γ  = O  m 1 −γ log 1 1 −γ  iterations. Theorem 4 (Proof in Section 8). Simplex-PI terminates after at most n(m −n)  1 + 2 1 −γ log 1 1 −γ  = O  mn 1 −γ log 1 1 −γ  iterations. Both results are a factor O(log n) better than the previously known results provided by Hansen et al. (2013) and Ye (2011). These improvements boil down to the use of the ∥· ∥∞-norm instead of the ∥· ∥1- norm at various points of the previous analyses. For Howard’s PI, the resulting arguments constitute a rather simple extension—the overall line of analysis ends up being very simple, and we consequently believe that it could be part of an elementary course on Policy Iteration; note that a similar improvement and analysis was discovered independently by Akian and Gaubert (2013) in a slightly more general setting. For Simplex-PI, however, the line of analysis is slightly trickier: it amounts to bound the improvement in value at individual states and requires a bit of bookkeeping; the technique we use is to our knowledge original. The bound for Simplex-PI is a factor O(n) larger than that for Howard’s PI5. However, since one changes only one action per iteration, each iteration has a complexity that is in a worst-case sense lower by a factor n: the update of the value can be done in time O(n2) through the Sherman-Morrisson formula, though in general each iteration of Howard’s PI, which amounts to compute the value of some policy that may be arbitrarily different from the previous policy, may require O(n3) time. Thus, it is remarkable that both algorithms seem to have a similar complexity. The linear dependency of the bound for Howard’s PI with respect to m is optimal (Hansen, 2012, Chapter 6.4). The linear dependency with respect to n or m (separately) is easy to prove for Simplex-PI; we conjecture that Simplex-PI’s complexity is proportional to nm, and thus that our bound is tight for a fixed discount factor. The dependency with respect to the term 1 1−γ may be improved, but removing it is impossible for Howard’s PI and very unlikely for Simplex-PI. Fearnley (2010) describes an MDP for which Howard’s PI requires an exponential (in n) number of iterations for γ = 1 and Hollanders et al. (2012) argued that this holds also when γ is in the vicinity of 1. Though a similar result does not seem to exist for Simplex-PI in the literature, Melekopoglou and Condon (1994) consider four variations of PI that all switch one action per iteration, and show through specifically designed MDPs that they may require an exponential (in n) number of iterations when γ = 1. 3 Bounds for Simplex-PI that are independent of γ In this section, we will describe some bounds that do not depend on γ but that will be based on some structural properties of the MDP. On this topic, Post and Ye (2013) recently showed the following result for deterministic MDPs. Theorem 5 (Post and Ye (2013)). If the MDP is deterministic, then Simplex-PI terminates after at most O(n3m2 log2 n) iterations. 5Note that it was also the case in Corollary 1. 4 Given a policy π of a deterministic MDP, states are either on cycles or on paths induced by π. The core of the proof relies on the following lemmas that altogether show that cycles are created regularly and that significant progress is made every time a new cycle appears; in other words, significant progress is made regularly. Lemma 4 (Post and Ye (2013, Lemma 3.4)). If the MDP is deterministic, after O(n2m log n) iterations, either Simplex-PI finishes or a new cycle appears. Lemma 5 (Post and Ye (2013, Lemma 3.5)). If the MDP is deterministic, when Simplex-PI moves from π to π′ where π′ involves a new cycle, we have 1T (vπ∗−vπ′) ≤  1 −1 n  1T (vπ∗−vπ). Indeed, these observations suffice to prove6 that Simplex-PI terminates after O(n2m2 log n 1−γ ). Com- pletely removing the dependency with respect to the discount factor γ—the term in O(log 1 1−γ )—requires a careful extra work described in Post and Ye (2013), which incurs an extra term of order O(n log(n)). The main result of this section is to show how these results can be extended to a more general setting. While Ye (2011) reason on states that belong to paths and cycles induced by policies on deterministic MDPs, we shall consider their natural generalization for stochastic MDPs: transient states and recurrent classes induced by policies. Precisely, we are going to consider bounds—uniform on all policies and states—of the average time 1) spent in transient states and 2) needed to revisit states in recurrent classes. For any policy π and state i, denote τ π(i, t) the expected cumulative time spent in state i until time t−1 given than the process starts from the uniform distribution U on X and takes actions according to π: τ π(i, t) = E "t−1 X k=0 1it=i | i0 ∼U, at = π(it) # = t−1 X k=0 P(it = i | i0 ∼U, at = π(it)), where 1 denotes the indicator function. In addition, consider the vector µπ on X providing the asymptotic frequency in all states given that policy π is used and that the process starts from the uniform distribution U: ∀i, µπ(i) = lim t→∞ 1 t τ π(i, t). When the Markov chain induced by π is ergodic, and thus admits a unique stationary distribution, µπ is equal to this very stationary distribution. However, our definition is more general in that policies may induce Markov chains with aperiodicity and/or multiple recurrent classes. For any state i that is transient for the Markov chain induced by π, it is well known that limt→∞τ π(i, t) < ∞and µπ(i) = 0. However, for any recurrent state i, we know that limt→∞τ π(i, t) = ∞and µπ(i) > 0; in particular, if i belongs to some recurrent class R, which is reached with probability q from the uniform distribution U, then q µπ(i) is the expected time between two visits of the state i. We are now ready to express the structural properties with which we can provide an extension of the analysis of Post and Ye (2013). Definition 1. Let τt and τr be the smallest finite constants such that for all policies π and states i, if i is transient for π, then lim t→∞τ π(i, t) ≤τt else if i is recurrent for π, then 1 µπ(i) ≤τr. Note that for any finite MDP, these finite constants always exist. With Definition 1 in hand, we can generalize Lemmas 4-5 as follows. Lemma 6. After at most (m−n)⌈n2τt log(n2τt)⌉+n⌈n2τt log(n2)⌉iterations either Simplex-PI finishes or a new recurrent class appears. 6This can be done by using arguments similar those for Theorem 1 (see Ye (2011) for details). 5 Lemma 7. When Simplex-PI moves from π to π′ where π′ involves a new recurrent class, we have 1T (vπ∗−vπ′) ≤  1 − 1 nτr  1T (vπ∗−vπ). From these generalized observations, we can deduce the following original result. Theorem 6 (Proof in Section 9). Simplex-PI terminates after at most  m⌈nτr log(n2τr)⌉+ (m −n)⌈nτr log(n2τt)⌉   (m −n)⌈n2τt log(n2τt)⌉+ n⌈n2τt log(n2)⌉  = ˜O n3m2τtτr  iterations. Remark 1. This new result extends the result obtained for deterministic MDPs by Post and Ye (2013) recalled in Theorem 5. In the deterministic case, it is easy to see that τt = 1 and τr ≤n. Then, while Lemma 6 is a strict generalization of Lemma 4, Lemma 7 provides a contraction factor that is slightly weaker than that of Lemma 5— 1 − 1 n2  instead of 1 −1 n  —, which makes the resulting bound provided in Theorem 6 a factor O(n) worse than that of Theorem 5. This extra term in the bound is the price paid for making the constant τr (and the vector µπ) independent of the discount factor γ, that is by presenting our result in a way that only depends on the dynamics of the underlying MDP. An analysis that would strictly generalizes that of Ye (2011) can be done under a variation of Definition 1 where the constants τt and τr depend on the discount factor7 γ. An immediate consequence of the above result is that Simplex-PI is strongly polynomial for sets of MDPs that are much larger than the deterministic MDPs mentioned in Theorem 5. Corollary 2. For any family of MDPs indexed by n and m such that τt and τr are polynomial functions of n and m, Simplex-PI terminates after a number of steps that is polynomial in n and m. 4 Similar results for Howard’s PI? One may then wonder whether similar results can be derived for Howard’s PI. Unfortunately, and as briefly mentioned by Post and Ye (2013), the line of analysis developed for Simplex-PI does not seem to adapt easily to Howard’s PI, because simultaneously switching several actions can interfere in a way such that the policy improvement turns out to be small. We can be more precise on what actually breaks in the approach we have described so far. On the one hand, it is possible to write counterparts of Lemmas 4 and 6 for Howard’s PI (see Section 10 for proofs). Lemma 8. If the MDP is deterministic, after at most n iterations, either Howard’s PI finishes or a new cycle appears. Lemma 9. After at most (m−n)⌈n2τt log(n2τt)⌉+n⌈n2τt log(n2)⌉iterations, either Howard’s PI finishes or a new recurrent class appears. On the other hand, we did not manage to adapt Lemma 5 nor Lemma 7. In fact, it is unlikely that a result similar to that of Lemma 5 will be shown to hold for Howard’s PI. In a recent deterministic example due to Hansen and Zwick (2010) to show that Howard’s PI may require at least Ω(n2) iterations, new 7 Define the following γ-discounted variation of τ π(i, t): τ π γ (i, t) = EPt−1 k=0 γk1it=i | i0 ∼U, at = π(it) = Pt−1 k=0 γkP(it = i | i0 ∼U, at = π(it)) and τ π γ (i) = limt→∞τ π γ (i, t). Assume that we have constants τ γ t , and τ γ r such that for every policy π, τ π γ (i) ≤τ γ t if i is a transient state for π, and 1 (1−γ)τπ γ (i) ≤τ γ r if i is recurrent for π. Then, one can derive a bound similar to that of Theorem 6 where τt and τr are respectively replaced by τ γ t and τγ r n . At a more technical level, our analysis begins by removing the dependency with respect to γ: Lemma 11, page 11, shows that for every policy π, τ π γ (i) ≤τt if i is a transient state for π, and 1 (1−γ)τπ γ (i) ≤nτr if i is recurrent for π (this is where we pay the O(n) term because the upper bound is nτr instead of τ γ r ); we then follow the line of arguments originally given by Post and Ye (2013), though our more general setting induces a few technicalities (in particular in the second part of the proof of Lemma 13 page 13). 6 cycles are created every single iteration but the sequence of values satisfies8 for all iterations k < n2 4 + n 4 and states i, v∗(i) −vπk+1(i) ≥ " 1 −  2 n k# (v∗(i) −vπk(i)). Contrary to Lemma 5, as k grows, the amount of contraction gets (exponentially) smaller and smaller. With respect to Simplex-PI, this suggests that Howard’s PI may suffer from subtle specific pathologies. In fact, the problem of determining the number of iterations required by Howard’s PI has been challenging for almost 30 years. It was originally identified as an open problem by Schmitz (1985). In the simplest— deterministic—case, the complexity is still an open problem: the currently best-known lower bound is O(n2) (Hansen and Zwick, 2010), while the best known upper bound is O( mn n ) Mansour and Singh (1999); Hollanders et al. (2014). On the positive side, an adaptation of the line of proof we have considered so far can be carried out under the following assumption. Assumption 1. The state space X can be partitioned in two sets T and R such that for all policies π, the states of T are transient and those of R are recurrent. Under this additional assumption, we can deduce the following original bounds. Theorem 7 (Proof in Section 11). If the MDP satisfies Assumption 1, then Howard’s PI and Simplex-PI terminate after at most (m −n) ⌈nτr log n2τr⌉+ ⌈n2τt log n2τt⌉  = ˜O(mn(n2τt + nτr)) iterations. It should however be noted that Assumption 1 is rather restrictive. It implies that the algorithms converge on the recurrent states independently of the transient states, and thus the analysis can be decomposed in two phases: 1) the convergence on recurrent states and then 2) the convergence on transient states (given that recurrent states do not change anymore). The analysis of the first phase (convergence on recurrent states) is greatly facilitated by the fact that in this case, a new recurrent class appears every single iteration (this is in contrast with Lemmas 4, 6, 8 and 9 that were designed to show under which conditions cycles and recurrent classes are created). Furthermore, the analysis of the second phase (convergence on transient states) is similar to that of the discounted case of Theorems 3 and 4. In other words, this last result sheds some light on the practical efficiency of Howard’s PI and Simplex-PI, and a general analysis of Howard’s PI is still largely open, and constitutes intriguing future work. The following sections contains detailed proofs of Lemmas 2 and 3, Theorems 3, 4, and 6, Lemmas 8 and 9, and finally Theorem 7. Before we start, we provide a particularly useful identity relating the difference between the values of two policies π and π′ and the relative advantage aπ′ π . Lemma 10. For all pairs of policies π and π′, vπ′ −vπ = (I −γPπ′)−1aπ′ π = (I −γPπ)−1(−aπ π′). Proof. This first identity follows from simple linear algebra arguments: vπ′ −vπ = (I −γPπ′)−1rπ′ −vπ {vπ′ = Tπ′vπ′ ⇔vπ′ = (I −γPπ′)−1rπ′} = (I −γPπ′)−1(rπ′ + γPπ′vπ −vπ) = (I −γPπ′)−1(Tπ′vπ −vπ). The second identity follows by symmetry. 8This MDP has an even number of states n = 2p. The goal is to minimize the long term expected cost. The optimal value function satisfies v∗(i) = −pN for all i, with N = p2 + p. The policies generated by Howard’s PI have values vπk(i) ∈(pN−k−1, pN−k). We deduce that for all iterations k and states i, v∗(i)−vπk+1 (i) v∗(i)−vπk (i) ≥1+p−k−2 1+p−k = 1−p−k−p−k−2 1+p−k ≥ 1 −p−k(1 −p−2) ≥1 −p−k. 7 We will repeatedly use the following property: since for any policy π, the matrix (1−γ)(I −γP)−1 = (1 −γ) P∞ t=0(γPπ)t is a stochastic matrix (as a mixture of stochastic matrices), then ∥(I −γP)−1∥∞= 1 1 −γ , where ∥· ∥∞is the natural induced max-norm on matrices. Finally, for any vector/matrix A and any number λ, we shall use the notation “A ≥λ” (respectively “A ≤λ”) for denoting the fact that “all the coefficients of A are greater or equal to (respectively smaller or equal to) λ”. 5 Contraction property for Howard’s PI (Proof of Lemma 2) For any k, we have vπ∗−vπk = Tπ∗vπ∗−Tπ∗vπk−1 + Tπ∗vπk−1 −Tπkvπk−1 + Tπkvπk−1 −Tπkvπk {∀π, Tπvπ = vπ} ≤γPπ∗(vπ∗−vπk−1) + γPπk(vπk−1 −vπk) {Tπ∗vπk−1 ≤Tπkvπk−1} ≤γPπ∗(vπ∗−vπk−1). {Lemma 1 and Pπk ≥0} Since vπ∗−vπk is non-negative, we can take the max-norm and get: ∥vπ∗−vπk∥∞≤γ∥vπ∗−vπk−1∥∞. 6 Contraction property for Simplex-PI (Proof of Lemma 3) The proof we provide here is very close to the one given by Ye (2011). We provide it here for completeness, and also because it resembles the proofs we will provide for the bounds that are independent of γ. On the one hand, using Lemma 10, we have for any k: vπk+1 −vπk = (I −γPπk+1)−1aπk+1 πk ≥aπk+1 πk , {(I −γPπk+1)−1 −I ≥0 and aπk+1 πk ≥0} which implies, by left multiplying by the vector 1T , that 1T (vπk+1 −vπk) ≥1T aπk+1 πk . (1) On the other hand, we have: vπ∗−vπk = (I −γPπ∗)−1aπ∗ πk {Lemma 10} ≤ 1 1 −γ max s aπk+1 πk (s) {∥(I −γPπ∗)−1∥∞= 1 1 −γ and max s aπk+1 πk (s) = max s,π aπ πk(s) ≥0} ≤ 1 1 −γ 1T aπk+1 πk , {∀x ≥0, max s x(s) ≤1T x} which implies that 1T aπk+1 πk ≥(1 −γ)∥vπ∗−vπk∥∞ ≥1 −γ n 1T (vπ∗−vπk). {∀x, 1T x ≤n∥x∥∞} (2) Combining Equations (1) and (2), we get: 1T (vπ∗−vπk+1) = 1T (vπ∗−vπk) −1T (vπk+1 −vπk) ≤1T (vπ∗−vπk) −1 −γ n 1T (vπ∗−vπk) =  1 −1 −γ n  1T (vπ∗−vπk). 8 7 A bound for Howard’s PI when γ < 1 (Proof of Theorem 3) Although the overall line or arguments follows from those given originally by Ye (2011) and adapted by Hansen et al. (2013), our proof is slightly more direct and leads to a better result. For any k, we have: −aπk π∗= (I −γPπk)(v∗−vπk) {Lemma 10} ≤v∗−vπk. {v∗−vπk ≥0 and Pπk ≥0} By the optimality of π∗, −aπk π∗is non-negative, and we can take the max-norm: ∥aπk π∗∥∞≤∥v∗−vπk∥∞ ≤γk∥vπ∗−vπ0∥∞ {Lemma 2} = γk∥(I −γPπ0)−1(−aπ0 π∗)∥∞ {Lemma 10} ≤ γk 1 −γ ∥aπ0 π∗∥∞. {∥(I −γPπ0)−1∥∞= 1 1 −γ } By definition of the max-norm, and as aπ0 π∗≤0 (using again the fact that π∗is optimal), there exists a state s0 such that −aπ0 π∗(s0) = ∥aπ0 π∗∥∞. We deduce that for all k, −aπk π∗(s0) ≤∥aπk π∗∥∞≤ γk 1 −γ ∥aπ0 π∗∥∞= γk 1 −γ (−aπ0 π∗(s0)). As a consequence, the action πk(s0) must be different from π0(s0) when γk 1−γ < 1, that is for all values of k satisfying k ≥k∗= & log 1 1−γ 1 −γ ' > & log 1 1−γ log 1 γ ' . In other words, if some policy π is not optimal, then one of its non-optimal actions will be eliminated for good after at most k∗iterations. By repeating this argument, one can eliminate all non-optimal actions (there are at most n −m of them), and the result follows. 8 A bound for Simplex-PI when γ < 1 (Proof of Theorem 4) At each iteration k, let sk be the state in which an action is switched. We have (by definition of Simplex-PI): aπk+1 πk (sk) = max π,s aπ πk(s). Starting with arguments similar to those for the contraction property of Simplex-PI, we have on the one hand: vπk+1 −vπk = (I −γPπk+1)−1aπk+1 πk {Lemma 10} ≥aπk+1 πk , {(I −γPπk+1)−1 −I ≥0 and aπk+1 πk ≥0} which implies that vπk+1(sk) −vπk(sk) ≥aπk+1 πk (sk). (3) On the other hand, we have: vπ∗−vπk = (I −γPπ∗)−1aπ∗ πk {Lemma 10} ≤ 1 1 −γ aπk+1 πk (sk) {∥(I −γPπ∗)−1∥∞= 1 1 −γ and aπk+1 πk (sk) = max s,π aπ πk(s) ≥0} 9 which implies that ∥vπ∗−vπk∥∞≤ 1 1 −γ aπk+1 πk (sk). (4) Write ∆k = vπ∗−vπk. From Equations (3) and (4), we deduce that: ∆k+1(sk) ≤∆k(sk) −(1 −γ)∥∆k∥∞=  1 −(1 −γ)∥∆k∥∞ ∆k(sk)  ∆k(sk). This implies—since ∆k(sk) ≤∥∆k∥∞—that ∆k+1(sk) ≤γ∆k(sk), but also—since ∆k(sk) and ∆k+1(sk) are non-negative and thus  1 −(1 −γ) ∥∆k∥∞ ∆k(sk)  ≥0—that ∥∆k∥∞≤ 1 1 −γ ∆k(sk). Now, write nk for the vector on the state space such that nk(s) is the number of times state s has been switched until iteration k (including k). Since by Lemma 1 the sequence (∆k)k≥0 is non-increasing, we have ∥∆k∥∞≤ 1 1 −γ ∆k(sk) ≤γnk−1(sk) 1 −γ ∆0(sk) ≤γnk−1(sk) 1 −γ ∥∆0∥∞. (5) At any iteration k, let s∗ k = arg maxs nk−1(s) be the state in which actions have been switched the most. Since at each iteration k, one of the n components of nk is increased by 1, we necessarily have nk−1(s∗ k) ≥ k −1 n  ≥k −n n . (6) Write k∗≤k −1 for the last iteration when the state s∗ k was updated, such that we have nk−1(s∗ k) = nk∗−1(sk∗). (7) Since (∥∆k∥∞)k≥0 is nonincreasing (using again Lemma 1), we have ∥∆k∥∞≤∥∆k∗∥∞ {k∗≤k −1} ≤γnk∗−1(sk∗) 1 −γ ∥∆0∥∞ {Equation (5)} = γnk−1(s∗ k) 1 −γ ∥∆0∥∞ {Equation (7)} ≤γ k−n n 1 −γ ∥∆0∥∞. {Equation (6) and x 7→γx is decreasing} We are now ready to finish the proof. By using arguments similar to those for Howard’s PI, we have: ∥aπk π∗∥∞≤∥∆k∥∞≤γ k−n n 1 −γ ∥∆0∥∞≤ γ k−n n (1 −γ)2 ∥aπ0 π∗∥∞. In particular, we can deduce from the above relation that as soon as γ k−n n (1−γ)2 < 1, that is for instance when k > k∗= n  1 + 2 1−γ log 1 1−γ  , one of the non-optimal actions of π0 cannot appear in πk. Thus, every k∗iterations, a non-optimal action is eliminated for good, and the result follows from the fact that there are at most n −m non-optimal actions. 10 9 A general bound for Simplex-PI (Proof of Theorem 6) The proof we give here is strongly inspired by that for the deterministic case of Post and Ye (2013): the steps (a series of lemmas) are similar. There are mainly two differences. First, our arguments are more direct in the sense that we do not refer to linear programming, but only provide simple linear algebra arguments. Second, it is more general: for any policy π, we consider the set of transient states (respectively recurrent classes) instead of the set of path states (respectively cycles); it slightly complicates the arguments, the most complicated extension being the second part of the proof of the forthcoming Lemma 13. Consider the vector xπ = (I −γP T π )−11 that provides a discounted measure of state visitations along the trajectories induced by a policy π starting from the uniform distribution U on the state space X: ∀i ∈X, xπ(i) = n ∞ X t=0 γtP(it = i | i0 ∼U, at = π(it)). This vector plays a crucial role in the analysis. For any policy π and state i, we trivially have xπ(i) ∈  1, n 1−γ  . In the case of deterministic MDPs, Post and Ye (2013)) exploits the fact that xπ(i) belongs to the set (1, n) when i is on path of π, while xπ(i) belongs to the set ( 1 1−γ , n 1−γ ) when i is on a cycle of π. Our extension of their result to the case of general (stochastic) MDPs will rely on the following result. For any policy π, we shall write R(π) for the set of states that are recurrent for π. Lemma 11. With the constants τt and τr of Definition 1, we have for every discount factor γ, ∀i ̸∈R(π), 1 ≤xπ(i) ≤nτt (8) ∀i ∈R(π), 1 τr ≤(1 −γ)xπ(i) ≤n. (9) Proof. The fact that xπ(i) belongs to  1, n 1−γ  is obvious from the definition of xπ. The upper bound on xπ on the transient states i follows from the fact that for any policy π, τt(i) ≥lim t→∞τ π(i, t) = ∞ X k=0 P(it = i | i0 ∼U, at = π(it)) ≥ ∞ X k=0 γkP(it = i | i0 ∼U, at = π(it)) = 1 nxπ(i). Let us now consider the lower bound on (1 −γ)xπ(i) when i is a recurrent state of some policy π. In general, the asymptotic frequency µπ of π does not necessarily satisfy µπT Pπ = Pπ because Pπ may correspond to an aperiodic or reducible chain. To deal with this issue, we consider the Ces`aro mean Qπ = lim t→∞ 1 t t−1 X k=0 (Pπ)k that is well-defined (Stroock, 2005, Section 3.2). It can be shown (Fritz et al., 1979, Proposition 3.5(a)) that Qπ = QπPπ = PπQπ = QπQπ. This implies in particular that (1 −γ)Qπ(I −γPπ)−1 = (1 −γ) ∞ X k=0 γkQπ(Pπ)k = (1 −γ) ∞ X k=0 γkQπ = Qπ. (10) 11 Then, by using twice the fact that µπ = 1 nQπ T 1, we can see that for all recurrent states i, 1 τr ≤µπ(i) =  1 nQπ T 1  (i) =  1 n(1 −γ)(I −γPπ T )−1Qπ T 1  (i) {Equation (10)} =  (1 −γ)(I −γPπ T )−1µπ (i) ≤  (1 −γ)(I −γPπ T )−11  (i) {µπ ≤1} = (1 −γ)xπ(i). Finally, a rewriting of Lemma 10 in terms of the vector xπ will be useful in the following proofs: for any pair of policies π and π′, 1T (vπ′ −vπ) = xπ′T aπ′ π = xπT (−aπ π′). (11) We are now ready to delve into the details of the arguments. As mentioned before, the proof is structured in two steps: first, we will show that recurrent classes are created often; then we will show that significant progress is made every time a new recurrent class appears. 9.1 Part 1: Recurrent classes are created often Lemma 12. Suppose one moves from policy π to policy π′ without creating any recurrent class. Let π† be the final policy before either a new recurrent class appears or Simplex-PI terminates. Then 1T (vπ† −vπ′) ≤  1 − 1 n2τt  1T (vπ† −vπ). Proof. The arguments are similar to those for the proof of Theorem 4. On the one hand, we have: 1T (vπ′ −vπ) ≥1T aπ′ π . (12) On the other hand, we have 1T (vπ† −vπ) = xT π†aπ† π {Equation (11)} = X s̸∈R(π†) xπ†(s)aπ† π (s) + X s∈R(π†) xπ†(s)aπ† π (s) ≤n2τt max s̸∈R(π†) aπ† π (s) + n2 1 −γ max s∈R(π†) aπ† π (s). {Equations (8)-(9)} Since by assumption recurrent classes of π† are also recurrent classes of π, we deduce that for all s ∈R(π†), π†(s) = π(s), so that maxs∈R(π†) aπ† π (s) = 0. Thus, the second term of the above r.h.s. is null and 1T (vπ† −vπ) ≤n2τt max s aπ† π (s) ≤n2τt max s aπ′ π (s) {max s Tπ′vπ(s) = max s,˜π T˜πvπ(s)} ≤n2τt1T aπ′ π . {∀x ≥0, max s x(s) ≤1T x} (13) Combining Equations (12) and (13), we get: 1T (vπ† −vπ′) = 1T (vπ† −vπ) −1T (vπ′ −vπ) ≤  1 − 1 n2τt  1T (vπ† −vπ). 12 Lemma 13. While Simplex-PI does not create any recurrent class nor finishes, • either an action is eliminated from policies after at most ⌈n2τt log(n2τt)⌉iterations, • or a recurrent class is broken after at most ⌈n2τt log(n2)⌉iterations. Proof. Let π be the policy in some iteration. Let π† be the last policy before a new recurrent class appears, and π′ a policy generated after k iterations from π. We shall prove that one of the two events stated of the lemma must happen. Since 0 ≤1T (vπ† −vπ) {vπ† ≥vπ} = xπT (−aπ π†) {Equation (11)} = X s̸∈R(π) xπ(s)(−aπ π†(s)) + X C∈R(π) X s∈C xπ(s)(−aπ π†(s)) there must exist either a state s0 ̸∈R(π) such that xπ(s0)(−aπ π†(s0)) ≥1 nxπT (−aπ π†) ≥0. (14) or a recurrent class R0 such that X s∈R0 xπ(s)(−aπ π†(s)) ≥1 nxπ T (−aπ π†) ≥0. (15) We consider these two cases separately below. • case 1: Equation (14) holds for some s0 ̸∈R(π). Let us prove by contradiction that for k sufficiently big, π′(s0) ̸= π(s0): let us assume that π′(s0) = π(s0). Then 1T (vπ† −vπ′) ≥vπ†(s0) −vπ′(s0) {vπ† ≥vπ′} = vπ†(s0) −Tπ′vπ′(s0) {vπ′ = Tπ′vπ′} ≥vπ†(s0) −Tπ′vπ†(s0) {vπ† ≥vπ′} = −aπ′ π†(s0) = −aπ π†(s0) {π(s0) = π′(s0)} ≥ 1 nτt xπ(s0)(−aπ π†(s0)) {Equation (8)} ≥ 1 n2τt xπ T (−aπ π†) {Equation (14)} = 1 n2τt 1T (vπ† −vπ). {Equation (11)} If there is no recurrent class creation, the contraction property given in Lemma 12 implies that if π′ is obtained after k = ⌈n2τt log(n2τt)⌉> log(n2τt) log 1 1− 1 n2τt iterations, then 1T (vπ† −vπ′) < 1 n2τt 1T (vπ† −vπ), and we get a contradiction. As a conclusion, we necessarily have π′(s0) ̸= π(s0). • case 2: Equation (15) holds for some R0 that is a recurrent class of π. Let us prove by contradiction that for k sufficiently big, R0 cannot be a recurrent class of π′: let us thus assume that R0 is a recurrent class of π′. Write T for the set of states that are transient for π (formally, T = X\R(π)). For any subset Y of the state space X, write P Y π for the stochastic matrix of which the ith row is equal to that of Pπ if i ∈Y , and is 0 otherwise, and write 1Y the vectors of which the ith component is equal to 1 if i ∈Y and 0 otherwise. 13 Using the fact that P R0 π P T π = 0, one can first observe that (I −γP R0 π )(I −γP T π ) = I −γ(P R0 π + P T π ), from which we can deduce that ∀s ∈R0,  1T ∪R0 T (I −γPπ)−1 (s) =  1T ∪R0 T (I −γ(P R0 π + P T π ))−1 (s) =  1T ∪R0 T (I −γP T π )−1(I −γP R0 π )−1 (s). (16) Also, let s be an arbitrary state and s′ be a state of R0. Since 1T s (P T π )k(s′) is the probability that the chain starting in s reaches s′ for the first time after k iterations, then 1T s (I −γP T π )−1(s′) ≤ ∞ X t=0 1T s (P T π )i(s′) ≤1. and therefore, ∀s′ ∈R0, 1T ∪R0 T (I −γP T π )−1(s′) ≤n. (17) Writing δ for the vector that equals −aπ π† on R0 and that is null everywhere else, we have X s∈R0 xπ(s)(−aπ π†(s)) = X s∈R0 [(I −γP T π )−11](s)δ(s) = X s∈R0 [(I −γP T π )−11T ∪R0](s)δ(s)  ∀s ∈R0, [(I −γP T π )−11X\(T ∪R0)](s) = 0 = X s [(I −γP T π )−11T ∪R0](s)δ(s) {∀s ̸∈R0, δ(s) = 0} = 1T ∪R0 T (I −γPπ)−1δ = 1T ∪R0 T (I −γP T π )−1(I −γP R0 π )−1δ {Equation (16)} = X s [(I −γP T π T )−11T ∪R0](s)[(I −γP R0 π )−1δ](s) = X s∈R0 [(I −γP T π T )−11T ∪R0](s)[(I −γP R0 π )−1δ](s) {∀s ̸∈R0, δ(s) = 0} = X s∈R0 [(I −γP T π T )−11T ∪R0](s)(vπ†(s) −vπ(s)) {Lemma 10} ≤n1R0 T (vπ† −vπ). {Equation (17)} (18) We assumed that R0 is also a recurrent class of π′, which implies 1R0 T vπ = 1R0 T vπ′, and 1T (vπ† −vπ′) ≥1R0 T (vπ† −vπ′) {vπ† ≥vπ′} = 1R0 T (vπ† −vπ) {1R0 T vπ = 1R0 T vπ′} ≥1 n X s∈R0 xπ(s)(−aπ π†(s)) {Equation (18)} ≥1 n2 xπT (−aπ π†) {Equation (15)} = 1 n2 1T (vπ† −vπ). {Equation (11)} 14 If there is no recurrent class creation, the contraction property given in Lemma 12 implies that if π′ is obtained after k = ⌈n2τt log(n2)⌉> log(n2) log 1 1− 1 n2τt iterations, then 1T (vπ† −vπ′) < 1 n2 1T (vπ† −vπ), and thus we get a contradiction. As a conclusion, R0 cannot be a recurrent class of π′. A direct consequence of the above result is Lemma 6 that we originally stated on page 5, and that we restate for clarity. Lemma 6. After at most (m−n)⌈n2τt log(n2τt)⌉+n⌈n2τt log(n2)⌉iterations, either Simplex-PI finishes or a new recurrent class appears. Proof. Before a recurrent class is created, at most n recurrent classes need to be broken and (m −n) actions to be eliminated, and the time required by these events is bounded thanks to the previous lemma. 9.2 Part 2: A new recurrent class implies a significant step towards the optimal value We now proceed to the second part of the proof, and begin by proving Lemma 7 (originally stated page 6). Lemma 7. When Simplex-PI moves from π to π′ where π′ involves a new recurrent class, we have 1T (vπ∗−vπ′) ≤  1 − 1 nτr  1T (vπ∗−vπ). Proof. Let s0 be the state such that π′(s0) ̸= π(s0). On the one hand, since π′ contains a new recurrent class R (necessarily containing s0), we have 1T (vπ′ −vπ) = xπ′T aπ′ π {Equation (11)} = xπ′(s0)aπ(s0) {Simplex-PI switches 1 action and aπ(s0) = aπ′ π (s0)} ≥ 1 (1 −γ)τr aπ(s0). {Equation (9) with s0 ∈R(π′)} (19) On the other hand, ∀s, vπ∗(s) −vπ(s) = [(I −γPπ∗)−1aπ∗ π ](s) {Lemma 10} ≤ 1 1 −γ aπ(s0). {∥(I −γPπ∗)−1∥∞≤ 1 1 −γ and aπ(s0) = max s,˜π a˜π π(s) ≥0} (20) Combining these two observations, we obtain 1T (vπ∗−vπ′) = 1T (vπ∗−vπ) −1T (vπ′ −vπ) ≤1T (vπ∗−vπ) − 1 (1 −γ)τr aπ(s0) {Equation (19)} ≤1T (vπ∗−vπ) −1 τr max s vπ∗(s) −vπ′(s) {Equation (20)} ≤  1 − 1 nτr  1T (vπ∗−vπ). {∀x, 1 n1T x ≤max s x(s)} 15 Lemma 14. While Simplex-PI does not terminate, • either some non-optimal action is eliminated from recurrent states after at most ⌈nτr log(n2τr)⌉ recurrent class creations, • or some non-optimal action is eliminated from policies after at most ⌈nτr log(n2τt)⌉recurrent class creations. Proof. Let π be the policy in some iteration and π′ the policy generated after k iterations from π (without loss of generality we assume π′ ̸= π∗). Let s0 = arg maxs xπ(s)(−aπ π∗(s)). We have xπ(s0)(−aπ π∗(s0)) ≥1 nxπT (−aπ π∗) {∀x, 1T x ≤n max s x(s)} = 1 n1T (vπ∗−vπ). {Equation (11)} (21) We now consider two cases, respectively corresponding to s0 ̸∈R(π) or s0 ∈R(π). • case 1: s0 ̸∈R(π). Let us prove by contradiction that π′(s0) ̸= π(s0) if k is sufficiently large: let us assume that π′(s0) = π(s0). Then, by using repeatedly the fact that for all ˜π, a˜π π∗≤0 (by definition of the optimal policy π∗), we have: 1T (vπ∗−vπ′) = xπ′T (−aπ′ π∗) {Equation (11)} ≥xπ′(s0)(−aπ′ π∗(s0)) ≥−aπ′ π∗(s0) {xπ′(s0) ≥1} = −aπ π∗(s0) {π(s0) = π′(s0)} ≥ 1 nτt xπ(s0)(−aπ π∗(s0)) {Equation (8)} ≥ 1 n2τt 1T (vπ∗−vπ). {Equation (21)} After k = ⌈nτr log n2τt⌉> log n2τt log 1 1− 1 nτr recurrent classes are created, we have by the contraction property of Lemma 7 that 1T (vπ∗−vπ′) < 1 n2τt 1T (vπ∗−vπ) and we get a contradiction. As a conclusion, we have π′(s0) ̸= π(s0). • case 2: s0 ∈R(π). Let us prove by contradiction that π′(s0) ̸= π(s0) if s0 is recurrent for π′ and k is sufficiently large: let us assume that π′(s0) = π(s0) and s0 ∈R(π′). Then, by using again the fact that for all ˜π, a˜π π∗≤0, we have: 1T (vπ∗−vπ′) = xπ′T (−aπ′ π∗) {Equation (11)} = X s xπ′(s)(−aπ′ π∗(s)) ≥ X s∈R0 xπ′(s)(−aπ′ π∗(s)) ≥ 1 (1 −γ)τr X s∈R0 (−aπ′ π∗(s)) {Equation (9)} ≥ 1 (1 −γ)τr (−aπ′ π∗(s0)) = 1 (1 −γ)τr (−aπ π∗(s0)) {π(s0) = π′(s0)} ≥ 1 nτr xπ(s0)(−aπ π∗(s0)) {xπ(s0) ≤ n 1 −γ } ≥ 1 n2τr 1T (vπ∗−vπ). {Equation (21)} 16 After k = ⌈nτr log n2τr⌉> log n2τr log 1 1− 1 nτr new recurrent classes are created, we have by the contraction property of Lemma 7 that 1T (vπ∗−vπ′) < 1 n2τr 1T (vπ∗−vπ), and we get a contradiction. As a conclusion, we know that π′(s0) ̸= π(s0) if s0 is recurrent for π′. We are ready to conclude: At most, the (m−n) non-optimal actions may need to be eliminated from all states; in addition, all actions may need to be eliminated from recurrent states (some optimal actions may only be used at transient states and thus also need to be eliminated from recurrent states). Overall, convergence can thus be obtained after at most a total of m⌈nτr log(n2τr)⌉+ (m −n)⌈nτr log(n2τt)⌉ recurrent class creations. The result follows from the fact that each class creation requires at most (m −n)⌈n2τt log(n2τt)⌉+ n⌈n2τt log(n2)⌉iterations (cf. Lemma 6). 10 Cycle and recurrent classes creations for Howard’s PI (Proofs of Lemmas 8 and 9) Lemma 8. If the MDP is deterministic, after at most n iterations, either Howard’s PI finishes or a new cycle appears. Proof. Consider a sequence of l generated policies π1, · · · , πl from an initial policy π0 such that no new cycle appears. By induction, we have vπl −vπk = Tπlvπl −Tπlvπk−1 + Tπlvπk−1 −Tπkvπk−1 + Tπkvπk−1 −Tπkvπk {∀π, Tπvπ = vπ} ≤γPπl(vπl −vπk−1) + γPπk(vπk−1 −vπk) {Tπlvπk−1 ≤Tπkvπk−1} ≤γPπl(vπl −vπk−1) {Lemma 1 and Pπk ≥0} ≤(γPπl)k(vπl −vπ0). {By induction on k} (22) Since the MDP is deterministic and has n states, (Pπl)n will only have non-zero values on columns that correspond to R(πl). Furthermore, since no cycle is created, R(πl) ⊂R(π0), which implies that vπl(s)−vπ0(s) = 0 for all s ∈R(πl). As a consequence, we have (Pπl)n(vπl −vπ0) = 0. By Equation (22), this implies that vπl = vπn. If l > n, then Howard’s PI must have terminated. Lemma 9. After at most (m−n)⌈n2τt log(n2τt)⌉+n⌈n2τt log(n2)⌉iterations, either Howard’s PI finishes or a new recurrent class appears. Proof. A close examination of the proof of Lemma 6, originally designed for Simplex-PI, shows that it applies to Howard’s PI without any modification. 11 A bound for Howard’s PI and Simplex-PI under Assump- tion 1 (Proof of Theorem 7) We here consider that the state space is decomposed into 2 sets: T is the set of states that are transient under all policies, and R is the set of states that are recurrent under all policies. From this assumption, it can be seen that when running Howard’s PI or Simplex-PI, the values and actions chosen on T have no influence on the evolution of the values and policies on R. So we will study the convergence of both algorithms in two steps: we will first bound the number of iterations to converge on R; we will then add the number of iterations for converging on T given that convergence has occurred on R. 17 Convergence on the set R of recurrent states: Without loss of generality, we consider here that the state space is only made of the set of recurrent states. First consider Simplex-PI. If all states are recurrent, new recurrent classes are created at every iteration, and Lemma 7 holds. Then, in a way similar to the proof of Lemma 14, it can be shown that every ⌈nτr log n2τr⌉iterations, a non-optimal action can be eliminated. As there are at most (m −n) non-optimal actions, we deduce that Simplex-PI converges in at most (m −n)⌈nτr log n2τr⌉iterations on R. Consider now Howard’s PI. We can prove the following lemma. Lemma 10. If the MDP satisfies Assumption 1 and all states are recurrent under all policies, Howard’s PI generates policies (πk)k≥0 that satisfy: 1T (vπ∗−vπk+1) ≤  1 − 1 nτr  1T (vπ∗−vπk). Proof. On the one hand, we have 1T (vπk+1 −vπk) = xπk+1 T aπk+1 πk {Equation (11)} = xπk+1 T aπk {aπk+1 πk = aπk} ≥ 1 (1 −γ)τr 1T aπk {Equation (9) and all states are recurrent} ≥ 1 (1 −γ)τr ∥aπk∥∞. {∀x ≥0, 1T x ≥∥x∥∞} (23) On the other hand, 1T (vπ∗−vπk) = xπ∗ T aπ∗ πk {Equation (11)} ≤xπ∗ T aπk {aπk ≥aπ∗ πk} ≤ n 1 −γ ∥aπk∥∞. { X i xπ∗(i) ≤ n 1 −γ and aπk ≥0} (24) By combining Equations (23) and (24), we obtain: 1T (vπ∗−vπk+1) = 1T (vπ∗−vπk) −1T (vπk+1 −vπk) ≤  1 − 1 nτr  1T (vπ∗−vπk). Then, similarly to Simplex-PI, we can prove that after every ⌈nτr log n2τr⌉iterations a non-optimal action must be eliminated. And as there are at most (m −n) non-optimal actions, we deduce that Howard’s PI converges in at most (m −n)⌈nτr log n2τr⌉iterations on R. Convergence on the set T of transient states: Consider now that convergence has occurred on the recurrent states R. A simple variation of the proof of Lemma 6/Lemma 9 (where we use the fact that we don’t need to consider the events where recurrent classes are broken since recurrent classes do not evolve anymore) allows us to show that the extra number of iterations for both algorithms to converge on the transient states is at most (m −n))⌈n2τt log n2τt⌉, and the result follows. Acknowledgements. I would like to thank Ian Post for exchanges about the proof in Post and Ye (2013), Thomas Dueholm Hansen for noticing a flaw in a claimed result for deterministic MDPs in an earlier version, Romain Aza¨ıs for the reference on the Cesaro mean of stochastic matrices, and the reviewers and editor for their very careful feedback, who helped improve the paper overall, and the proof of Lemma 13 in particular. 18 References Akian, M. and Gaubert, S. (2013). Policy iteration for perfect information stochastic mean payoffgames with bounded first return times is strongly polynomial. Technical Report arxiv 1310.4953v1. Bertsekas, D. and Tsitsiklis, J. (1996). Neurodynamic Programming. Athena Scientific. Fearnley, J. (2010). Exponential lower bounds for policy iteration. In Proceedings of the 37th international colloquium conference on Automata, languages and programming: Part II, ICALP’10, pages 551–562, Berlin, Heidelberg. Springer-Verlag. Fritz, F., Huppert, B., and Willems, W. (1979). Stochastische Matrizen. Springer, Berlin. Hansen, T. (2012). Worst-case Analysis of Strategy Iteration and the Simplex Method. Ph.D. thesis, Department Office Computer Science, Aarhus University. Hansen, T. and Zwick, U. (2010). Lower bounds for Howard’s algorithm for finding minimum mean-cost cycles. In ISAAC (1), pages 415–426. Hansen, T., Miltersen, P., and Zwick, U. (2013). Strategy iteration is strongly polynomial for 2-player turn-based stochastic games with a constant discount factor. J. ACM, 60(1), 1–16. Hollanders, R., Delvenne, J., and Jungers, R. (2012). The complexity of policy iteration is exponential for discounted markov decision processes. In 51st IEEE conference on Decision and control (CDC’12). Hollanders, R., Gerencs´er, B., Delvenne, J., and Jungers, R. (2014). Improved bound on the worst case complexity of policy iteration. Technical Report arxiv 1410.7583v1. Mansour, Y. and Singh, S. (1999). On the complexity of policy iteration. In UAI, pages 401–408. Melekopoglou, M. and Condon, A. (1994). On the complexity of the policy improvement algorithm for Markov decision processes. INFORMS Journal on Computing, 6(2), 188–192. Post, I. and Ye, Y. (2013). The simplex method is strongly polynomial for deterministic Markov decision processes. In 24th ACM-SIAM Symposium on Discrete Algorithms. Puterman, M. (1994). Markov Decision Processes. Wiley, New York. Schmitz, N. (1985). How good is Howard’s policy improvement algorithm? Zeitschrift f¨ur Operations Research, 29(7), 315–316. Stroock, D. (2005). An introduction to Markov processes. Springer, Berlin. Ye, Y. (2011). The simplex and policy-iteration methods are strongly polynomial for the markov decision problem with a fixed discount rate. Math. Oper. Res., 36(4), 593–603. 19