Solving Markov Decision Processes with Reachability Characterization from Mean First Passage Times Shoubhik Debnath1, Lantao Liu2, Gaurav Sukhatme3 Abstract— A new mechanism for efficiently solving the Markov decision processes (MDPs) is proposed in this paper. We introduce the notion of reachability landscape where we use the Mean First Passage Time (MFPT) as a means to characterize the reachability of every state in the state space. We show that such reachability characterization very well assesses the importance of states and thus provides a natural basis for effectively prioritizing states and approximating policies. Built on such a novel observation, we design two new algorithms – Mean First Passage Time based Value Iteration (MFPT-VI) and Mean First Passage Time based Policy Iteration (MFPT- PI) – that have been modified from the state-of-the-art solution methods. To validate our design, we have performed numerical evaluations in robotic decision-making scenarios, by comparing the proposed new methods with corresponding classic baseline mechanisms. The evaluation results showed that MFPT-VI and MFPT-PI have outperformed the state-of-the-art solutions in terms of both practical runtime and number of iterations. Aside from the advantage of fast convergence, this new solution method is intuitively easy to understand and practically simple to implement. I. INTRODUCTION AND RELATED WORK Decision-making in uncertain environments is a basic problem in the area of artificial intelligence [1], [2], and Markov decision processes (MDPs) have become very popular for modeling non-deterministic planning problems with full observability [3], [4]. In this paper, we investigate the exact solution methods for solving the MDPs.We are especially interested in the MDPs that are widely applied in the artificial intelligence domain, such as the decision-theoretic planning [5], where the agent following the optimal policy eventually enter a terminal state defined as a goal/destination. Contribution: Our objective is to further improve the convergence behavior of the MDPs’ solving mechanism. To realize that, we are introducing a new MDP solution method with a novel design, and we are showing that such design allows us to gain insights on designing efficient heuristics to speed up the convergence of the MDPs. In greater detail, we introduce the notion of reachability landscape which is essentially a “grid-map" measuring the degree of difficulty for each state transiting to the terminal/goal state. 1Shoubhik Debnath is with NVIDIA Corporation, Santa Clara, CA 95051, USA. E-mail: sdebnath@nvidia.com 2Lantao Liu is with the Intelligent Systems Engineering Department at Indiana University - Bloomington, Bloomington, IN 47408, USA. E-mail: lantao@iu.edu 3Gaurav Sukhatme is with the Department of Computer Science at the University of Southern California, Los Angeles, CA 90089, USA. E-mail: gaurav@usc.edu The paper was published in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). By “reachability" we mean that based on the current discrete stochastic system, how hard it is for the agent to transit from the current state to the given goal state. In other words, reachability can be viewed as the reachability with regards to arriving at the terminal state. To compute the reachability landscape, we use the Mean First Passage Time (MFPT) which can be formulated into a simple linear system. We show that such reachability characterization of each state reflects the importance of this state, and thus provides a natural basis that can be used to upgrade and accelerate both value iteration (VI) and policy iteration (PI) mechanisms, e.g., it can be utilized for prioritizing states for the standard VI process and approximating policy for the standard PI procedure. More specifically, we design two novel algorithms that have been modified from the state-of-the-art solution methods. Our two algorithms include the Mean First Passage Time based Value Iteration (MFPT-VI) and the Mean First Passage Time based Policy Iteration (MFPT-PI). Experimental evaluations showed that both MFPT-VI and MFPT-PI are superior to the state-of- the-art methods in terms of both the practical runtime and the number of iterations needed to converge. Related Work: The basic computational mechanisms and techniques for MDPs have been well-understood and widely applied to solve many decision-theoretic planning [5], [6] and reinforcement learning problems [7], [8]. In general, the fundamental solving mechanism is to exploit the dynamic programming (DP) structure which can be computed in polynomial time. Value iteration and policy iteration are two of the most famous and most widely used algorithms to solve the MDPs [9], [10]. An important related heuristic for efficiently solving MDPs is the prioritized sweeping [11], which has been broadly employed to further speed up the value iteration process. This heuristic evaluates each state and obtains a score based on the state’s contribution to the convergence, and then prioritizes/sorts all states based on their scores (e.g., those states with larger difference in value between two consecutive iterations will get higher scores) [12], [13]. Then in the immediately next dynamic programming iteration, evaluating the value of states follow the newly prioritized order. The prioritized sweeping heuristic is also leveraged in our MFPT based value iteration procedure, and comparisons with base- line approaches have been conducted in our experimental section. Important related frameworks for solving MDPs also include compact representations such as linear function representation and approximation [9], [3] used in the policy arXiv:1901.01229v1 [cs.AI] 4 Jan 2019 iteration algorithms. The linear equation based techniques (detailed formulation is provided in the paper) do not exploit regions of uniformity in value functions associated with states, but rather a compact form of state features that can somewhat reflect values [14]. Our method for computing the MFPT can also be formulated into a linear system. However, the intermediate results generated from MFPT are more direct: they very well capture – and also allow us to visualize – the “importance" of states, and can lead to a faster convergence speed which is demonstrated in the experiments. Another relevant strategy is called real-time dynamic programming (RTDP) [15] where states are not treated uniformly. Specifically, in each DP iteration, only a subset of most important states might be explored, and the selection of the subset of states are usually built on and related to agent’s exploration experience. For a single DP iteration, the RTDP usually requires less computation time in comparison to the classic DP where all states need to be swept, and thus can be extended as an online process and integrated into the real-time reinforcement learning framework [16]. Similar strategies also include the state abstraction [17], [18], where states with similar characteristics are hierarchically and/or adaptively grouped together, either in offline static or online dynamic aggregation style. Although we believe our proposed framework can be easily extended to the fashions of RTDP’s partial states exploration and the adaptive states abstraction/computation, in this work we consider the complete and full exploration of all states, and compare with state-of-the-art methods that evaluate across the entire state space. II. PRELIMINARIES A. Markov Decision Processes Definition 2.1: A Markov Decision Process (MDP) is a tuple M = (S, A, T, R), where S = {s1, · · · , sn} is a set of states and A = {a1, · · · , an} is a set of actions. The state transition function T : S × A × S →[0, 1] is a probability function such that Ta(s1, s2) is the probability that action a in state s1 will lead to state s2, and R : S × A →R is a reward function where Ra(s, s′) returns the immediate reward received on taking action a in state s that will lead to state s′. A Markov system is defined as absorbing if from every non-terminal state it is possible to eventually enter a terminal state such as a goal/destination state [14]. We restrict our attention to absorbing Markov systems so that the agent can arrive and stop at a goal. A policy is of the form π = {s1 →a1, s2 →a2, · · · , sn → an}. We denote π[s] as the action associated to state s. If the policy of a MDP is fixed, then the MDP behaves as a Markov chain [19]. B. MDP Solution Methods Here we discuss prevalent solution methods including the value iteration and policy iteration, as well as state-of-the-art heuristics/variants. 1) Value Iteration: The value iteration (VI) is an iterative procedure that calculates the value (or utility in some litera- ture) of each state s based on the values of the neighbouring states that s can directly transit to. The value V (s) of state s at each iteration can be calculated by the Bellman equation shown below V (s) = max a∈A X s′∈S Ta(s, s′)  Ra(s, s′) + γV (s′)  , (1) where γ is a reward discounting parameter. The stopping criteria for the algorithm is when the values calculated on two consecutive iterations are close enough, i.e., maxs∈S |V (s)− V ′(s)| ≤ϵ, where ϵ is an optimization tolerance/threshold value, which determines the level of convergence accuracy. Prioritized Sweeping: The standard Bellman recursion evaluates values of all states in a sweeping style, following the index of the states stored in the memory. To speed up the convergence, a heuristic called prioritized sweeping has been proposed and widely used as a benchmark framework for non-domain-specific applications. The algorithm labels a state as more “interesting" or more “important" during a particular iteration, if the change in the state value is higher when compared to its previous iteration. The essential idea is that, the larger the value changes, the higher impact that updating that state will change its dependent states, thereby taking a larger step towards convergence. 2) Policy Iteration: The policy iteration (PI) is an anytime algorithm because at any iteration there will be a feasible policy (although not necessarily optimal). The PI involves the following two steps. The first step also known as policy evaluation is responsible for calculating the value V (s) of each state s given some fixed policy π(s) until convergence. This is followed by a step called policy improvement. Here, the policy π(s) is updated based on the resulting converged values of each state. The optimal policy is obtained at convergence when πi+1 = πi, where i is the iteration index. Each policy improvement step is based on the following equation: π(s) = arg max a∈A X s′∈S Ta(s, s′)  Ra(s, s′) + γV π(s′)  . (2) There are two ways of computing the policy evaluation step. 1) The first approach is to perform Bellman updates iteratively till convergence given some fixed policy using the below equation. V π i+1(s) = X s′∈S Tπ(s)(s, s′)  Rπ(s)(s, s′) + γV π i (s′)  . (3) We represent this version of policy iteration where policy evaluation is performed using Bellman updates as Policy Iteration (PI) throughout the paper. 2) The second approach to perform policy evaluation is through linear equation approximation where we solve a linear system as shown in Eq. (4). This outputs the optimal values of each state corresponding to a given (a) (b) (c) (d) (e) (f) Fig. 1. Illustration of reachability landscape. (a) Demonstration of a simple simulation scenario with dark blocks as obstacles, and the goal state as a red circle; (b)-(e) MFPT based reachability landscapes; (f) Converged optimal policy shown in red arrows. policy. ∀s, V π(s) = X s′∈S Tπ(s)(s, s′)  Rπ(s)(s, s′)+γV π(s′)  . (4) The variables in this system of linear equation are V π(s), whereas T, R are constants given a fixed policy. We call this version of policy iteration as Policy Iteration - Linear Equation (PI-LE) throughout the paper. C. Mean First Passage Times The first passage time (FPT), Tij, is defined as the number of state transitions involved in reaching states sj when started in state si for the first time. The mean first passage time (MFPT), µij from state si to sj is the expected number of hopping steps for reaching state sj given initially it was in state si [20]. The MFPT analysis is built on the Markov chain, and has nothing to do with the agent’s actions. Remember that, when a MDP is associated to a fixed policy, it then behaves as a Markov chain [19]. Formally, let us define a Markov chain with n states and transition probability matrix, p ∈IRn,n. If the transition probability matrix is regular, then each MFPT, µij = E(Tij), satisfies the below conditional expectation formula: E(Tij) = X k E(Tij|Bk)pik (5) where, pik represents the transition probability from state si to sk, and Bk is an event where the first state transition happens from state si to sk. From the definition of mean first passage times, we have, E(Tij|Bk) = 1 + E(Tkj). So, we can rewrite Eq. (5) as follows. E(Tij) = X k pik + X k̸=j E(Tkj)pik (6) Since, P k pik = 1, Eq. (6) can be formulated as per the below equation: µij = 1+ X k̸=j pik ·µkj ⇒ X k̸=j pik ·µkj −µij = −1, (7) Solving all MFPT variables can be viewed as solving a system of linear equations       p11 −1 p12 .. .. p1n p21 p22 −1 .. .. p2n .. .. .. .. .. .. .. .. .. .. pn1 pn2 .. .. pnn −1             µ1j µ2j .. .. µnj       =       −1 −1 .. .. −1       . (8) The values µ1j, µ2j, ...., µnj represents the MFPTs calculated for state transitions from states s1, s2, ...., sn to sj and µjj = 0. To solve above equation, efficient decomposition methods [21] may help to avoid a direct matrix inversion. III. TECHNICAL APPROACH A. Reachability Characterization using MFPT The notion of MFPT allows us to define the reachability of a state. By “reachability of a state" we mean that based on current fixed policy, how hard it is for the agent to transit from the current state to the given goal/absorbing state. With all MFPT values µij obtained, we can construct a reachability landscape which is essentially a “map" measuring the degree of difficulty of all states transiting to the goal. Fig. 1 shows a series of landscapes represented in heatmap in our simulated environment. The values in the heatmap range from 0 (cold color) to 100 (warm color). In order to better visualize the low MFPT spectrum that we are most interested, any value greater than 100 has been clipped to 100. Fig. 1(b)-1(e) show the change of landscapes as the MFPT-PI algorithm proceeds. Initially, all states except the goal state are initialized as unreachable, as shown by the high MFPT color in Fig. 1(b). We observe that the reachability landscape conveys very useful information on potential impacts of different states. More specifically, a state with a better reachability (smaller MFPT value) is more likely to make a larger change during the MDP convergence procedure, leading to a bigger convergence step. With such observation and Algorithm 1: Mean First Passage Time based Value Iteration (MFPT-VI) 1 Given states S, actions A, transition probability Ta(s, s′) and reward Ra(s, s′). Assume goal state s∗, calculate the optimal policy π 2 while true do 3 V = V ′ 4 Calculate MFPT values µ1s∗, µ2s∗, · · · , µ|S|s∗by solving the linear system as shown in Eq. (8) 5 List L := Sorted states with increasing order of MFPT values 6 foreach state s in L do 7 Compute value update at each state s given policy π: V ′(s) = maxa∈A P ∀s′∈S Ta(s, s′)  Ra(s, s′) + γV (s′)  8 if maxsi |V (si) −V ′(si)| ≤ϵ then 9 break the new metric, we can design state prioritization schemes for the VI and policy update heuristics for the PI, which are discussed next. B. Mean First Passage Time based Value Iteration (MFPT-VI) Classic VI converges the solution through Bellman update as shown in Eq. (1), which essentially sweeps all states by treating them equally important. The “prioritized sweeping" mechanism has further improved the convergence rate by exploiting and ranking the states based on their potential contributions, where the metric is based on comparing the difference of values for each state between two consecutive iterations. The new MFPT-VI method is also built on the prioritized sweeping mechanism, but we propose a metric using the reachability (MFPT values), because as aforementioned, the reachability characterization of each state reflects the potential impact/contribution of this state, and thus provides a natural basis for prioritization. Our reachability based metric is different from the existing value-difference metric as follows: The reachability landscape can very well capture the degree of importance for all states from a global viewpoint, whereas the classic value-difference strategy evaluates states locally, and may fail grasping the correct global “convergence direction" due to the local viewpoint. Note that, since the MFPT computation is relatively expensive, it is not necessary to compute the MFPT at every iteration, but rather after every few iterations. The computational process of MFPT-VI is pseudo-coded in Alg. 1. C. Mean First Passage Time based Policy Iteration (MFPT- PI) Remember that the classic PI algorithm involves two steps: policy evaluation and policy improvement. In this sub-section, we propose a new policy improvement mechanism. Policy Evaluation: Classic policy evaluation utilizes either local policy optimization (shown in Eq. (3)) or a linear equation approximation (shown in Eq. (4)) to obtain a new Algorithm 2: Mean First Passage Time based Policy Iteration (MFPT-PI) 1 Given states S, actions A, transition probability Ta(s, s′) and reward Ra(s, s′). Assume goal state s∗, calculate the optimal policy π 2 Initialize with an arbitrary policy π′ 3 while true do 4 π = π′ 5 foreach state s of |S| do 6 Compute value of each state s, V(s) given the policy π using Eq. (3) 7 foreach state s of |S| do 8 Improve policy π at each state s using Eq. (2) 9 Calculate the MFPT values µ1s∗, µ2s∗, · · · , µ|S|s∗by solving the linear system in Eq. (8) 10 foreach state s of |S| do 11 Update policy at state s with the action a that leads to state s′ with the minimum MFPT value: π′(s) = arg mina∈A µs′s∗ 12 if π = π′ then 13 break set of values V (s), ∀s ∈S. Our MFPT-PI also employs the local optimization rule described in Eq. (3). Policy Improvement: The policy improvement of MFPT- PI is based on MFPT values. To approximate a good and feasible policy, the operation can be even simpler than the classic approach. We propose an intuitively simple and com- putationally cheap heuristic which shows great performance: the action selection can be following (or weighted by) the “gradient" of reachability landscape. For instance, in the agent motion planning scenario where each state transits to some set of states in the vicinity, the locally optimal action can be selected to transit to the neighbouring state with the least MFPT value. Alternating the above two steps allow us to converge to the solution. Fig. 1 illustrates the basic idea. In the beginning, a random policy is initialized and all states except the goal are pre-set with high MFPT values, see Fig. 1(b). In the first iteration, after completing the policy evaluation, the reachability landscape is shown as Fig. 1(c), which characterizes the reachability regions at a high level. The result is more like an image segmentation solution where regions with distinct reachability levels are roughly partitioned: since in the middle area there are a few obstacles that separate the space, the reachability landscape is split into two regions, with upper area being well reachable and the lower part fully un-reachable. With a couple more iterations, the reachability landscape is refined until low level details are well characterized (see Figs 1(d)-1(e)). In essence, a major benefit of the proposed MFPT-PI algorithm is that: at earlier stage of the convergence, it captures the big picture; and later it focuses on improving the details once the policy becomes better. It is worth mentioning that, if there are multiple goal/absorbing states, then each goal state will require to compute its own MFPT landscape, and the final reachability landscape will be normalized by all MFPTs corresponding to all terminal states. D. Time Complexity Analysis The classic VI algorithm has a time complexity of O(|A||S|2) per iteration where |S| represents number of states and |A| represents number of actions. The classic PI has a time complexity of O(|A||S|2 + |S|2.3) if a linear equation solver is used (the matrix decomposition has a time complexity of O(|S|2.3) if state-of-the-art algorithms are employed [21], given that the size of matrix is the number of states |S|). Since, both the proposed MFPT-VI and MFPT- PI algorithms involve a key step for calculating the MFPT which also needs to solve a linear system, therefore, for each iteration, both the MFPT-VI and MFPT-PI algorithms have a time complexity of O(|A||S|2 +|S|2.3). Although, the MFPT is an extra component (and cost) introduced to the VI and PI mechanisms, our reachability characterization enables the MDP solving process to converge in much fewer steps. An important point to note here is that the MFPT values don’t need to be computed frequently at every iteration and it is only calculated when characterizing global feature is needed, which also reduces the practical runtime. It is also worth mentioning that, although the time com- plexity for matrix decomposition is O(|S|2.3) in general, for sparse matrices, efficient decomposition techniques perform much faster than O(|S|2.3) where the time complexity depends on the number of non-zero elements in the matrix. For example, in certain robotic motion planning scenario, the robot transits to states within a certain vicinity causing the transition probability matrix to be sparse by nature. Inspite of this additional time complexity per iteration, MFPT based algorithms have faster overall runtime than existing solution methods because they converge in much fewer steps. Results are presented in the following section. IV. EXPERIMENTAL RESULTS A. Experimental Setting Task Details: We validated our method through numerical evaluations with two different simulators running on a Linux machine. We consider the MDP problem where each action can lead to transitions to all other states with certain transition probabilities. However, in many practical scenarios, the probability of transiting from a state to another state that is “weakly connected" can be small, even close to 0. This can potentially result in non-dense transition matrix. For example, in the robotic motion planning scenario, the state transition probability from state si to state sj can be correlated with the time or distance of traveling from si to sj, and it is more likely for a state to transit to some states within certain vicinity. For the first task, we developed a simulator in C++ using OpenGl. To obtain the discrete MDP states, we tessellate the agent’s workspace into a 2D grid map, and represent the center of each grid as a state. In this way, the state hopping between two states represents the corresponding motion in (a) (b) (c) (d) (e) (f) Fig. 2. (a) Demonstration of simulation environment, with the agent’s initial state (blue) and the goal state (red). Grey blocks are obstacles; (b) Converged optimal policy (red arrows) and a trajectory completed by the agent to reach the goal; (c)-(f) Evolution of reachability landscapes. the workspace. Each non-boundary state has a total of nine actions, i.e., a non-boundary state can move in the directions of N, NE, E, SE, S, SW, W, NW, plus an idle action returning to itself. A demonstration is shown in Fig. 2. For the second task, we developed a simulator in C++ using ROS and rviz. The agent’s workspace is partitioned into a 3D grid map where the center of each grid represents a MDP state. Each non-boundary state has a total of five actions, i.e., a non-boundary state can move in the directions of N, E, S, W, TOP, BOTTOM plus an idle action. In both tasks, the objective of the agent is to reach a goal state from an initial start location in the presence of obstacles. The reward function for both setups is represented as high positive reward at the goal state and -1 for obstacle states. All experiments were performed on a laptop computer with 8GB RAM and 2.6 GHz quad-core Intel i7 processor. B. 2D Grid Setup In this setup, we compare our proposed algorithms with their corresponding baseline algorithms in terms of the practical runtime performance, iterations required to converge and their convergence profile. Later, we also performed some analysis on time costs of individual components and MFPT invoking frequency. Practical Runtime Performance: We first investigate the time taken by the algorithms with changing number of states. Fig. 3(a) compares the time taken by VI, VI-PS and MFPT-VI algorithms. In MFPT-VI, the MFPT component is computed every three iterations. Also, the convergence thresholds are set the same for all methods. The results show that VI is the slowest, and VI-PS is faster than VI due to the prioritized sweeping heuristic. And our proposed MFPT-VI is much faster than the other two algorithms. Although, time taken per iteration for MFPT-VI is higher than that of the classic VI, overall the MFPT-VI takes the least time due to much fewer convergence steps and also because MFPT values are computed every three iterations. Below, we will further (a) (b) Fig. 3. Time comparisons between the baseline methods and our proposed algorithms, with changing numbers of states (x-axis). (a) Variants of value iteration methods. (b) Variants of policy iteration methods. elaborate on why we have decided not to calculate the MFPT values every iteration. Fig. 3(b) compares the time taken by PI, PI-LE, and MFPT- PI. The results also show that the MFPT-PI is the fastest compared to the other two algorithms. Note that, both MFPT- PI and PI-LE require to compute linear equations, and in our implementation we use the Eigen library. (Since the reachability characterization does not require high MFPT accuracy, among many provided decomposition solvers we chose one of the fast variants which however do not provide the highest accuracy.) Number of Iterations: We then analyze the number of iterations taken by the algorithms as the number of states change. Fig. 4(a) compares the number of iterations taken by VI, VI-PS and MFPT-VI, respectively. The results reveal that MFPT-VI converges the fastest compared to the other two algorithms. Fig. 4(b) compares the iterations taken by PI, PI-LE, and MFPT-PI. Again, we can see that MFPT-PI converges the fastest among all three algorithms. Convergence Profile: Here, we investigate the detailed convergence profile along with the increasing number of iterations. Since one important criterion to judge the conver- gence is to see if the error/difference between two consecutive iterations is small enough, thus, we utilize the maximum error ∆S across all states as an evaluation metric. Specifically, for VI, VI-PS, MFPT-VI, the error ∆S is defined as ∆S = max si∈S |V (si) −V ′(si)| (9) where V , V ′ represent the values of states at iteration i and i + 1. For PI, PI-LE and MFPT-PI, the ∆S is defined as the total number of mismatches between two consecutive (a) (b) Fig. 4. Iteration to convergence between the baseline methods and our proposed algorithms, with changing numbers of states (x- axis). (a) Variants of value iteration methods. (b) Variants of policy iteration methods. (a) (b) Fig. 5. (a) The progress of VI, VI-PS and MFPT-VI across iterations. (b) The progress of PI, PI-LE and MFPT-PI across iterations. iterations’ policies: ∆S = X si∈S Iπ(si), where Iπ(si) = ( 0, if π(si) = π′(si) 1, otherwise (10) where π and π′ represent the policy at iteration i and i + 1, respectively. Fig. 5(a) shows that initially VI, VI-PS and MFPT-VI start with the same ∆S value. It is very obvious that our MFPT-VI approach has an extremely steep decrement in the first few iterations. For example, in our simulation scenario with around 2500 states and a threshold value of 0.1, the MFPT-VI converged in only 19 iterations; in contrast, VI took 55 iterations and VI-PS took 45. Fig. 5(b) shows profiles for PI, PI-LE and MFPT-PI. Note that, with the policy obtained after the first iteration, MFPT-PI already produced a smaller ∆S value compared to the other two algorithms. After that, the ∆S values in case of MFPT- PI are lower than those of PI and PI-LE. For instance, in the same simulation example mentioned above, our MFPT-PI method converged in 9 iterations whereas PI took 13 iterations and PI-LE took 11 iterations. The profiles also indicate that the PI based methods take much fewer iterations, but more Solving Markov Decision Processes with Reachability Characterization from Mean First Passage Times Shoubhik Debnath1, Lantao Liu2, Gaurav Sukhatme3 Abstract— A new mechanism for efficiently solving the Markov decision processes (MDPs) is proposed in this paper. We introduce the notion of reachability landscape where we use the Mean First Passage Time (MFPT) as a means to characterize the reachability of every state in the state space. We show that such reachability characterization very well assesses the importance of states and thus provides a natural basis for effectively prioritizing states and approximating policies. Built on such a novel observation, we design two new algorithms – Mean First Passage Time based Value Iteration (MFPT-VI) and Mean First Passage Time based Policy Iteration (MFPT- PI) – that have been modified from the state-of-the-art solution methods. To validate our design, we have performed numerical evaluations in robotic decision-making scenarios, by comparing the proposed new methods with corresponding classic baseline mechanisms. The evaluation results showed that MFPT-VI and MFPT-PI have outperformed the state-of-the-art solutions in terms of both practical runtime and number of iterations. Aside from the advantage of fast convergence, this new solution method is intuitively easy to understand and practically simple to implement. I. INTRODUCTION AND RELATED WORK Decision-making in uncertain environments is a basic problem in the area of artificial intelligence [1], [2], and Markov decision processes (MDPs) have become very popular for modeling non-deterministic planning problems with full observability [3], [4]. In this paper, we investigate the exact solution methods for solving the MDPs.We are especially interested in the MDPs that are widely applied in the artificial intelligence domain, such as the decision-theoretic planning [5], where the agent following the optimal policy eventually enter a terminal state defined as a goal/destination. Contribution: Our objective is to further improve the convergence behavior of the MDPs’ solving mechanism. To realize that, we are introducing a new MDP solution method with a novel design, and we are showing that such design allows us to gain insights on designing efficient heuristics to speed up the convergence of the MDPs. In greater detail, we introduce the notion of reachability landscape which is essentially a “grid-map" measuring the degree of difficulty for each state transiting to the terminal/goal state. 1Shoubhik Debnath is with NVIDIA Corporation, Santa Clara, CA 95051, USA. E-mail: sdebnath@nvidia.com 2Lantao Liu is with the Intelligent Systems Engineering Department at Indiana University - Bloomington, Bloomington, IN 47408, USA. E-mail: lantao@iu.edu 3Gaurav Sukhatme is with the Department of Computer Science at the University of Southern California, Los Angeles, CA 90089, USA. E-mail: gaurav@usc.edu The paper was published in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). By “reachability" we mean that based on the current discrete stochastic system, how hard it is for the agent to transit from the current state to the given goal state. In other words, reachability can be viewed as the reachability with regards to arriving at the terminal state. To compute the reachability landscape, we use the Mean First Passage Time (MFPT) which can be formulated into a simple linear system. We show that such reachability characterization of each state reflects the importance of this state, and thus provides a natural basis that can be used to upgrade and accelerate both value iteration (VI) and policy iteration (PI) mechanisms, e.g., it can be utilized for prioritizing states for the standard VI process and approximating policy for the standard PI procedure. More specifically, we design two novel algorithms that have been modified from the state-of-the-art solution methods. Our two algorithms include the Mean First Passage Time based Value Iteration (MFPT-VI) and the Mean First Passage Time based Policy Iteration (MFPT-PI). Experimental evaluations showed that both MFPT-VI and MFPT-PI are superior to the state-of- the-art methods in terms of both the practical runtime and the number of iterations needed to converge. Related Work: The basic computational mechanisms and techniques for MDPs have been well-understood and widely applied to solve many decision-theoretic planning [5], [6] and reinforcement learning problems [7], [8]. In general, the fundamental solving mechanism is to exploit the dynamic programming (DP) structure which can be computed in polynomial time. Value iteration and policy iteration are two of the most famous and most widely used algorithms to solve the MDPs [9], [10]. An important related heuristic for efficiently solving MDPs is the prioritized sweeping [11], which has been broadly employed to further speed up the value iteration process. This heuristic evaluates each state and obtains a score based on the state’s contribution to the convergence, and then prioritizes/sorts all states based on their scores (e.g., those states with larger difference in value between two consecutive iterations will get higher scores) [12], [13]. Then in the immediately next dynamic programming iteration, evaluating the value of states follow the newly prioritized order. The prioritized sweeping heuristic is also leveraged in our MFPT based value iteration procedure, and comparisons with base- line approaches have been conducted in our experimental section. Important related frameworks for solving MDPs also include compact representations such as linear function representation and approximation [9], [3] used in the policy arXiv:1901.01229v1 [cs.AI] 4 Jan 2019 iteration algorithms. The linear equation based techniques (detailed formulation is provided in the paper) do not exploit regions of uniformity in value functions associated with states, but rather a compact form of state features that can somewhat reflect values [14]. Our method for computing the MFPT can also be formulated into a linear system. However, the intermediate results generated from MFPT are more direct: they very well capture – and also allow us to visualize – the “importance" of states, and can lead to a faster convergence speed which is demonstrated in the experiments. Another relevant strategy is called real-time dynamic programming (RTDP) [15] where states are not treated uniformly. Specifically, in each DP iteration, only a subset of most important states might be explored, and the selection of the subset of states are usually built on and related to agent’s exploration experience. For a single DP iteration, the RTDP usually requires less computation time in comparison to the classic DP where all states need to be swept, and thus can be extended as an online process and integrated into the real-time reinforcement learning framework [16]. Similar strategies also include the state abstraction [17], [18], where states with similar characteristics are hierarchically and/or adaptively grouped together, either in offline static or online dynamic aggregation style. Although we believe our proposed framework can be easily extended to the fashions of RTDP’s partial states exploration and the adaptive states abstraction/computation, in this work we consider the complete and full exploration of all states, and compare with state-of-the-art methods that evaluate across the entire state space. II. PRELIMINARIES A. Markov Decision Processes Definition 2.1: A Markov Decision Process (MDP) is a tuple M = (S, A, T, R), where S = {s1, · · · , sn} is a set of states and A = {a1, · · · , an} is a set of actions. The state transition function T : S × A × S →[0, 1] is a probability function such that Ta(s1, s2) is the probability that action a in state s1 will lead to state s2, and R : S × A →R is a reward function where Ra(s, s′) returns the immediate reward received on taking action a in state s that will lead to state s′. A Markov system is defined as absorbing if from every non-terminal state it is possible to eventually enter a terminal state such as a goal/destination state [14]. We restrict our attention to absorbing Markov systems so that the agent can arrive and stop at a goal. A policy is of the form π = {s1 →a1, s2 →a2, · · · , sn → an}. We denote π[s] as the action associated to state s. If the policy of a MDP is fixed, then the MDP behaves as a Markov chain [19]. B. MDP Solution Methods Here we discuss prevalent solution methods including the value iteration and policy iteration, as well as state-of-the-art heuristics/variants. 1) Value Iteration: The value iteration (VI) is an iterative procedure that calculates the value (or utility in some litera- ture) of each state s based on the values of the neighbouring states that s can directly transit to. The value V (s) of state s at each iteration can be calculated by the Bellman equation shown below V (s) = max a∈A X s′∈S Ta(s, s′)  Ra(s, s′) + γV (s′)  , (1) where γ is a reward discounting parameter. The stopping criteria for the algorithm is when the values calculated on two consecutive iterations are close enough, i.e., maxs∈S |V (s)− V ′(s)| ≤ϵ, where ϵ is an optimization tolerance/threshold value, which determines the level of convergence accuracy. Prioritized Sweeping: The standard Bellman recursion evaluates values of all states in a sweeping style, following the index of the states stored in the memory. To speed up the convergence, a heuristic called prioritized sweeping has been proposed and widely used as a benchmark framework for non-domain-specific applications. The algorithm labels a state as more “interesting" or more “important" during a particular iteration, if the change in the state value is higher when compared to its previous iteration. The essential idea is that, the larger the value changes, the higher impact that updating that state will change its dependent states, thereby taking a larger step towards convergence. 2) Policy Iteration: The policy iteration (PI) is an anytime algorithm because at any iteration there will be a feasible policy (although not necessarily optimal). The PI involves the following two steps. The first step also known as policy evaluation is responsible for calculating the value V (s) of each state s given some fixed policy π(s) until convergence. This is followed by a step called policy improvement. Here, the policy π(s) is updated based on the resulting converged values of each state. The optimal policy is obtained at convergence when πi+1 = πi, where i is the iteration index. Each policy improvement step is based on the following equation: π(s) = arg max a∈A X s′∈S Ta(s, s′)  Ra(s, s′) + γV π(s′)  . (2) There are two ways of computing the policy evaluation step. 1) The first approach is to perform Bellman updates iteratively till convergence given some fixed policy using the below equation. V π i+1(s) = X s′∈S Tπ(s)(s, s′)  Rπ(s)(s, s′) + γV π i (s′)  . (3) We represent this version of policy iteration where policy evaluation is performed using Bellman updates as Policy Iteration (PI) throughout the paper. 2) The second approach to perform policy evaluation is through linear equation approximation where we solve a linear system as shown in Eq. (4). This outputs the optimal values of each state corresponding to a given (a) (b) (c) (d) (e) (f) Fig. 1. Illustration of reachability landscape. (a) Demonstration of a simple simulation scenario with dark blocks as obstacles, and the goal state as a red circle; (b)-(e) MFPT based reachability landscapes; (f) Converged optimal policy shown in red arrows. policy. ∀s, V π(s) = X s′∈S Tπ(s)(s, s′)  Rπ(s)(s, s′)+γV π(s′)  . (4) The variables in this system of linear equation are V π(s), whereas T, R are constants given a fixed policy. We call this version of policy iteration as Policy Iteration - Linear Equation (PI-LE) throughout the paper. C. Mean First Passage Times The first passage time (FPT), Tij, is defined as the number of state transitions involved in reaching states sj when started in state si for the first time. The mean first passage time (MFPT), µij from state si to sj is the expected number of hopping steps for reaching state sj given initially it was in state si [20]. The MFPT analysis is built on the Markov chain, and has nothing to do with the agent’s actions. Remember that, when a MDP is associated to a fixed policy, it then behaves as a Markov chain [19]. Formally, let us define a Markov chain with n states and transition probability matrix, p ∈IRn,n. If the transition probability matrix is regular, then each MFPT, µij = E(Tij), satisfies the below conditional expectation formula: E(Tij) = X k E(Tij|Bk)pik (5) where, pik represents the transition probability from state si to sk, and Bk is an event where the first state transition happens from state si to sk. From the definition of mean first passage times, we have, E(Tij|Bk) = 1 + E(Tkj). So, we can rewrite Eq. (5) as follows. E(Tij) = X k pik + X k̸=j E(Tkj)pik (6) Since, P k pik = 1, Eq. (6) can be formulated as per the below equation: µij = 1+ X k̸=j pik ·µkj ⇒ X k̸=j pik ·µkj −µij = −1, (7) Solving all MFPT variables can be viewed as solving a system of linear equations       p11 −1 p12 .. .. p1n p21 p22 −1 .. .. p2n .. .. .. .. .. .. .. .. .. .. pn1 pn2 .. .. pnn −1             µ1j µ2j .. .. µnj       =       −1 −1 .. .. −1       . (8) The values µ1j, µ2j, ...., µnj represents the MFPTs calculated for state transitions from states s1, s2, ...., sn to sj and µjj = 0. To solve above equation, efficient decomposition methods [21] may help to avoid a direct matrix inversion. III. TECHNICAL APPROACH A. Reachability Characterization using MFPT The notion of MFPT allows us to define the reachability of a state. By “reachability of a state" we mean that based on current fixed policy, how hard it is for the agent to transit from the current state to the given goal/absorbing state. With all MFPT values µij obtained, we can construct a reachability landscape which is essentially a “map" measuring the degree of difficulty of all states transiting to the goal. Fig. 1 shows a series of landscapes represented in heatmap in our simulated environment. The values in the heatmap range from 0 (cold color) to 100 (warm color). In order to better visualize the low MFPT spectrum that we are most interested, any value greater than 100 has been clipped to 100. Fig. 1(b)-1(e) show the change of landscapes as the MFPT-PI algorithm proceeds. Initially, all states except the goal state are initialized as unreachable, as shown by the high MFPT color in Fig. 1(b). We observe that the reachability landscape conveys very useful information on potential impacts of different states. More specifically, a state with a better reachability (smaller MFPT value) is more likely to make a larger change during the MDP convergence procedure, leading to a bigger convergence step. With such observation and Algorithm 1: Mean First Passage Time based Value Iteration (MFPT-VI) 1 Given states S, actions A, transition probability Ta(s, s′) and reward Ra(s, s′). Assume goal state s∗, calculate the optimal policy π 2 while true do 3 V = V ′ 4 Calculate MFPT values µ1s∗, µ2s∗, · · · , µ|S|s∗by solving the linear system as shown in Eq. (8) 5 List L := Sorted states with increasing order of MFPT values 6 foreach state s in L do 7 Compute value update at each state s given policy π: V ′(s) = maxa∈A P ∀s′∈S Ta(s, s′)  Ra(s, s′) + γV (s′)  8 if maxsi |V (si) −V ′(si)| ≤ϵ then 9 break the new metric, we can design state prioritization schemes for the VI and policy update heuristics for the PI, which are discussed next. B. Mean First Passage Time based Value Iteration (MFPT-VI) Classic VI converges the solution through Bellman update as shown in Eq. (1), which essentially sweeps all states by treating them equally important. The “prioritized sweeping" mechanism has further improved the convergence rate by exploiting and ranking the states based on their potential contributions, where the metric is based on comparing the difference of values for each state between two consecutive iterations. The new MFPT-VI method is also built on the prioritized sweeping mechanism, but we propose a metric using the reachability (MFPT values), because as aforementioned, the reachability characterization of each state reflects the potential impact/contribution of this state, and thus provides a natural basis for prioritization. Our reachability based metric is different from the existing value-difference metric as follows: The reachability landscape can very well capture the degree of importance for all states from a global viewpoint, whereas the classic value-difference strategy evaluates states locally, and may fail grasping the correct global “convergence direction" due to the local viewpoint. Note that, since the MFPT computation is relatively expensive, it is not necessary to compute the MFPT at every iteration, but rather after every few iterations. The computational process of MFPT-VI is pseudo-coded in Alg. 1. C. Mean First Passage Time based Policy Iteration (MFPT- PI) Remember that the classic PI algorithm involves two steps: policy evaluation and policy improvement. In this sub-section, we propose a new policy improvement mechanism. Policy Evaluation: Classic policy evaluation utilizes either local policy optimization (shown in Eq. (3)) or a linear equation approximation (shown in Eq. (4)) to obtain a new Algorithm 2: Mean First Passage Time based Policy Iteration (MFPT-PI) 1 Given states S, actions A, transition probability Ta(s, s′) and reward Ra(s, s′). Assume goal state s∗, calculate the optimal policy π 2 Initialize with an arbitrary policy π′ 3 while true do 4 π = π′ 5 foreach state s of |S| do 6 Compute value of each state s, V(s) given the policy π using Eq. (3) 7 foreach state s of |S| do 8 Improve policy π at each state s using Eq. (2) 9 Calculate the MFPT values µ1s∗, µ2s∗, · · · , µ|S|s∗by solving the linear system in Eq. (8) 10 foreach state s of |S| do 11 Update policy at state s with the action a that leads to state s′ with the minimum MFPT value: π′(s) = arg mina∈A µs′s∗ 12 if π = π′ then 13 break set of values V (s), ∀s ∈S. Our MFPT-PI also employs the local optimization rule described in Eq. (3). Policy Improvement: The policy improvement of MFPT- PI is based on MFPT values. To approximate a good and feasible policy, the operation can be even simpler than the classic approach. We propose an intuitively simple and com- putationally cheap heuristic which shows great performance: the action selection can be following (or weighted by) the “gradient" of reachability landscape. For instance, in the agent motion planning scenario where each state transits to some set of states in the vicinity, the locally optimal action can be selected to transit to the neighbouring state with the least MFPT value. Alternating the above two steps allow us to converge to the solution. Fig. 1 illustrates the basic idea. In the beginning, a random policy is initialized and all states except the goal are pre-set with high MFPT values, see Fig. 1(b). In the first iteration, after completing the policy evaluation, the reachability landscape is shown as Fig. 1(c), which characterizes the reachability regions at a high level. The result is more like an image segmentation solution where regions with distinct reachability levels are roughly partitioned: since in the middle area there are a few obstacles that separate the space, the reachability landscape is split into two regions, with upper area being well reachable and the lower part fully un-reachable. With a couple more iterations, the reachability landscape is refined until low level details are well characterized (see Figs 1(d)-1(e)). In essence, a major benefit of the proposed MFPT-PI algorithm is that: at earlier stage of the convergence, it captures the big picture; and later it focuses on improving the details once the policy becomes better. It is worth mentioning that, if there are multiple goal/absorbing states, then each goal state will require to compute its own MFPT landscape, and the final reachability landscape will be normalized by all MFPTs corresponding to all terminal states. D. Time Complexity Analysis The classic VI algorithm has a time complexity of O(|A||S|2) per iteration where |S| represents number of states and |A| represents number of actions. The classic PI has a time complexity of O(|A||S|2 + |S|2.3) if a linear equation solver is used (the matrix decomposition has a time complexity of O(|S|2.3) if state-of-the-art algorithms are employed [21], given that the size of matrix is the number of states |S|). Since, both the proposed MFPT-VI and MFPT- PI algorithms involve a key step for calculating the MFPT which also needs to solve a linear system, therefore, for each iteration, both the MFPT-VI and MFPT-PI algorithms have a time complexity of O(|A||S|2 +|S|2.3). Although, the MFPT is an extra component (and cost) introduced to the VI and PI mechanisms, our reachability characterization enables the MDP solving process to converge in much fewer steps. An important point to note here is that the MFPT values don’t need to be computed frequently at every iteration and it is only calculated when characterizing global feature is needed, which also reduces the practical runtime. It is also worth mentioning that, although the time com- plexity for matrix decomposition is O(|S|2.3) in general, for sparse matrices, efficient decomposition techniques perform much faster than O(|S|2.3) where the time complexity depends on the number of non-zero elements in the matrix. For example, in certain robotic motion planning scenario, the robot transits to states within a certain vicinity causing the transition probability matrix to be sparse by nature. Inspite of this additional time complexity per iteration, MFPT based algorithms have faster overall runtime than existing solution methods because they converge in much fewer steps. Results are presented in the following section. IV. EXPERIMENTAL RESULTS A. Experimental Setting Task Details: We validated our method through numerical evaluations with two different simulators running on a Linux machine. We consider the MDP problem where each action can lead to transitions to all other states with certain transition probabilities. However, in many practical scenarios, the probability of transiting from a state to another state that is “weakly connected" can be small, even close to 0. This can potentially result in non-dense transition matrix. For example, in the robotic motion planning scenario, the state transition probability from state si to state sj can be correlated with the time or distance of traveling from si to sj, and it is more likely for a state to transit to some states within certain vicinity. For the first task, we developed a simulator in C++ using OpenGl. To obtain the discrete MDP states, we tessellate the agent’s workspace into a 2D grid map, and represent the center of each grid as a state. In this way, the state hopping between two states represents the corresponding motion in (a) (b) (c) (d) (e) (f) Fig. 2. (a) Demonstration of simulation environment, with the agent’s initial state (blue) and the goal state (red). Grey blocks are obstacles; (b) Converged optimal policy (red arrows) and a trajectory completed by the agent to reach the goal; (c)-(f) Evolution of reachability landscapes. the workspace. Each non-boundary state has a total of nine actions, i.e., a non-boundary state can move in the directions of N, NE, E, SE, S, SW, W, NW, plus an idle action returning to itself. A demonstration is shown in Fig. 2. For the second task, we developed a simulator in C++ using ROS and rviz. The agent’s workspace is partitioned into a 3D grid map where the center of each grid represents a MDP state. Each non-boundary state has a total of five actions, i.e., a non-boundary state can move in the directions of N, E, S, W, TOP, BOTTOM plus an idle action. In both tasks, the objective of the agent is to reach a goal state from an initial start location in the presence of obstacles. The reward function for both setups is represented as high positive reward at the goal state and -1 for obstacle states. All experiments were performed on a laptop computer with 8GB RAM and 2.6 GHz quad-core Intel i7 processor. B. 2D Grid Setup In this setup, we compare our proposed algorithms with their corresponding baseline algorithms in terms of the practical runtime performance, iterations required to converge and their convergence profile. Later, we also performed some analysis on time costs of individual components and MFPT invoking frequency. Practical Runtime Performance: We first investigate the time taken by the algorithms with changing number of states. Fig. 3(a) compares the time taken by VI, VI-PS and MFPT-VI algorithms. In MFPT-VI, the MFPT component is computed every three iterations. Also, the convergence thresholds are set the same for all methods. The results show that VI is the slowest, and VI-PS is faster than VI due to the prioritized sweeping heuristic. And our proposed MFPT-VI is much faster than the other two algorithms. Although, time taken per iteration for MFPT-VI is higher than that of the classic VI, overall the MFPT-VI takes the least time due to much fewer convergence steps and also because MFPT values are computed every three iterations. Below, we will further (a) (b) Fig. 3. Time comparisons between the baseline methods and our proposed algorithms, with changing numbers of states (x-axis). (a) Variants of value iteration methods. (b) Variants of policy iteration methods. elaborate on why we have decided not to calculate the MFPT values every iteration. Fig. 3(b) compares the time taken by PI, PI-LE, and MFPT- PI. The results also show that the MFPT-PI is the fastest compared to the other two algorithms. Note that, both MFPT- PI and PI-LE require to compute linear equations, and in our implementation we use the Eigen library. (Since the reachability characterization does not require high MFPT accuracy, among many provided decomposition solvers we chose one of the fast variants which however do not provide the highest accuracy.) Number of Iterations: We then analyze the number of iterations taken by the algorithms as the number of states change. Fig. 4(a) compares the number of iterations taken by VI, VI-PS and MFPT-VI, respectively. The results reveal that MFPT-VI converges the fastest compared to the other two algorithms. Fig. 4(b) compares the iterations taken by PI, PI-LE, and MFPT-PI. Again, we can see that MFPT-PI converges the fastest among all three algorithms. Convergence Profile: Here, we investigate the detailed convergence profile along with the increasing number of iterations. Since one important criterion to judge the conver- gence is to see if the error/difference between two consecutive iterations is small enough, thus, we utilize the maximum error ∆S across all states as an evaluation metric. Specifically, for VI, VI-PS, MFPT-VI, the error ∆S is defined as ∆S = max si∈S |V (si) −V ′(si)| (9) where V , V ′ represent the values of states at iteration i and i + 1. For PI, PI-LE and MFPT-PI, the ∆S is defined as the total number of mismatches between two consecutive (a) (b) Fig. 4. Iteration to convergence between the baseline methods and our proposed algorithms, with changing numbers of states (x- axis). (a) Variants of value iteration methods. (b) Variants of policy iteration methods. (a) (b) Fig. 5. (a) The progress of VI, VI-PS and MFPT-VI across iterations. (b) The progress of PI, PI-LE and MFPT-PI across iterations. iterations’ policies: ∆S = X si∈S Iπ(si), where Iπ(si) = ( 0, if π(si) = π′(si) 1, otherwise (10) where π and π′ represent the policy at iteration i and i + 1, respectively. Fig. 5(a) shows that initially VI, VI-PS and MFPT-VI start with the same ∆S value. It is very obvious that our MFPT-VI approach has an extremely steep decrement in the first few iterations. For example, in our simulation scenario with around 2500 states and a threshold value of 0.1, the MFPT-VI converged in only 19 iterations; in contrast, VI took 55 iterations and VI-PS took 45. Fig. 5(b) shows profiles for PI, PI-LE and MFPT-PI. Note that, with the policy obtained after the first iteration, MFPT-PI already produced a smaller ∆S value compared to the other two algorithms. After that, the ∆S values in case of MFPT- PI are lower than those of PI and PI-LE. For instance, in the same simulation example mentioned above, our MFPT-PI method converged in 9 iterations whereas PI took 13 iterations and PI-LE took 11 iterations. The profiles also indicate that the PI based methods take much fewer iterations, but more (a) (b) (c) (d) (e) (f) Fig. 6. Time taken by individual components of algorithms. (a)-(c) Variants of value iteration methods. VI/BE, VI-PS/BE, MFPT-VI/BE represents the time involved in computing bellman equation for VI, VI-PS and MFPT-VI respectively whereas MFPT-VI/FPT is the time involved for computing MFPT values. (d)-(f) Variants of policy iteration methods. PI/PE, PI-LE/PE, MFPT-PI/PE represents the time involved in computing the policy evaluation step for PI, PI-LE and MFPT-PI respectively. PI/PI, PI-LE/PI, MFPT-PI/PI represents the time involved in computing the policy improvement step for PI, PI-LE and MFPT-PI respectively. MFPT-PI/FPT is the time involved for computing MFPT values. time per iteration, than those of VI based approaches. Time Costs of Components: We also looked into the detailed time taken by critical components of each algorithm. The VI related algorithms are analyzed in Fig. 6(a)-6(c). Specifically, for VI most time is used for Bellman backup operations as shown in Fig. 6(a); for VI-PS there is an additional step involved in sorting the states for the purpose of states prioritization, which is negligible when compared with the cost of Bellman backup, as illustrated by Fig. 6(b). In contrast, in our MFPT-VI algorithm, a big chunk of time is used for computing the MFPT values, as shown in Fig. 6(c). Fig. 6(d)-6(f) show results for PI based algorithms. We recorded the time taken by the policy evaluation and policy improvement phases. In addition, for MFPT-PI, we also recorded the time used by MFPT calculation. From the results we can see that, policy evaluation dominates the time in PI and PI-LE, whereas for MFPT-PI, the MFPT calculation takes most of the time as shown in Fig. 6(f). However, MFPT-PI compensated for the expensive MFPT calculation by converging in fewer iterations, leading to a faster convergence in general. MFPT Invoking Frequency: As mentioned earlier, the MFPT values are computed after every few iterations when there is a need for characterizing global features. Next, we present some analysis related to this. In Fig. 7(a), the number on the x-axis represents that the MFPT values are computed after every p VI iterations. The time taken by MFPT-VI initially decreases to a certain point and then increases from there on as p increases. We also observe a similar behavior in Fig. 7(b) when we analyze the number of iterations taken to converge under varying p. This implies that, (1) invoking MFPT calculation at every (a) (b) Fig. 7. Trends of time and iterations under differing MFPT invoking frequencies. (a) The overall runtime of invoking MFPT with every p VI iterations. (b) The number of iterations of invoking MFPT with every p VI iterations. iteration is not the optimal way; and (2) for different application scenarios with different patterns of transition matrices, the best range of p can be obtained by running a few trials, using probing mechanisms such as the binary search. In our testing scenario, a p ranging from three to five is a good choice. (a) (b) Fig. 8. (a) Demonstration of simulation environment, with the agent’s initial state (pink) and the goal state (red). Green blocks are obstacles; (b) A trajectory (blue) completed by the agent to reach the goal. C. 3D Grid Setup A demonstration of the trajectory taken by the agent in 3D simulation environment is shown in Fig. 8. Next, we compare our proposed algorithms with their corresponding baseline algorithms in terms of the practical runtime performance and iterations required to converge. We first investigate the time taken by the algorithms with changing number of states. Fig. 9(a) compares the time taken by VI, VI-PS and MFPT-VI algorithms. The results show that our proposed MFPT-VI is much faster than the VI and VI-PS. Also, as expected, due to the prioritized sweeping heuristic, VI-PS is faster than VI. In Fig. 9(b), we compare the time taken by PI, PI-LE, and MFPT-PI. The results show that MFPT-PI is the fastest compared to the other two algorithms. The outstanding performance of MFPT-VI and MFPT-PI indicate that, reachability characterization based framework is superior to state-of-the-art solutions, yet the implementation of our methods are as simple as the classic ones. As in the case of 2D grid setup, we also analyze the number of iterations required by the algorithms to converge as the number of states change. In Fig. 10(a),we compare the number of iterations taken by VI, VI-PS and MFPT-VI, respectively. Similar to the 2D grid setup, our results reveal that MFPT-VI converges the fastest compared to VI and VI- PS. Fig. 10(b) compares the iterations taken by PI, PI-LE, (a) (b) Fig. 9. Time comparisons between the baseline methods and our proposed algorithms, with changing numbers of states (x-axis). (a) Variants of value iteration methods. (b) Variants of policy iteration methods. and MFPT-PI. Again, we can see that MFPT-PI converges the fastest among all three algorithms. Another interesting observation is that as the number of states increase, the number of iterations required to converge flattens out for both MFPT-VI and MFPT-PI. These results clearly show the power of reachability characterization, which very well captures the convergence optimization feature, thereby requiring much fewer iterations to reach optimality. V. CONCLUSIONS In this paper, we propose a new framework for efficiently solving the MDPs. The proposed method explores reachability of states using MFPT values, which characterizes the degree of difficulty of reaching given goal states. Different from the classic VI and PI methods, the proposed solutions based on MFPTs reflect a patterned landscape of states, which very well captures – and also allows us to visualize – the degree of importance of states. The reachability characterization enables one to design efficient heuristics such as value prioritization and policy approximation, and we propose two specific algorithms called MFPT-VI and MFPT-PI. We show that the implementation of proposed methods is as simple as classic methods, but our algorithms converge much faster, with less runtime and fewer iterations than state-of-art approaches. REFERENCES [1] Stuary Russell and Peter Norvig. Artifical intelligence: A modern approach. Accessed October 22, 2004 at http://aima.cs.berkeley.edu/, 2002. [2] Olivier Sigaud and Olivier Buffet. Markov decision processes in artificial intelligence. John Wiley & Sons, 2013. [3] Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014. [4] Douglas J White. A survey of applications of markov decision processes. Journal of the Operational Research Society, 44(11):1073–1096, 1993. [5] Craig Boutilier, Thomas Dean, and Steve Hanks. Decision-theoretic planning: Structural assumptions and computational leverage. Journal of Artificial Intelligence Research, 11:1–94, 1999. (a) (b) Fig. 10. Iteration to convergence between the baseline methods and our proposed algorithms, with changing numbers of states (x- axis). (a) Variants of value iteration methods. (b) Variants of policy iteration methods. [6] Richard S Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proceedings of the seventh international conference on machine learning, pages 216–224, 1990. [7] Lucian Busoniu, Robert Babuska, Bart De Schutter, and Damien Ernst. Reinforcement learning and dynamic programming using function approximators, volume 39. CRC press, 2010. [8] Martijn van Otterlo and Marco Wiering. Reinforcement learning and markov decision processes. In Reinforcement Learning, pages 3–42. Springer, 2012. [9] Ronald A. Howard. Dynamic Programming and Markov Processes. MIT Press, Cambridge, MA, 1960. [10] D. P. Bertsekas. Dynamic Programming: Deterministic and Stochastic Models. Prentice-Hall, Englewood Cliffs, N.J., 1987. [11] Andrew W. Moore and Christopher G. Atkeson. Prioritized sweeping: Reinforcement learning with less data and less time. In Machine Learning, pages 103–130, 1993. [12] David Andre, Nir Friedman, and Ronald Parr. Generalized prioritized sweeping. Advances in Neural Information Processing Systems, 1998. [13] David Wingate and Kevin D Seppi. Prioritization methods for accelerating mdp solvers. Journal of Machine Learning Research, 6(May):851–881, 2005. [14] Craig Boutilier, Richard Dearden, and Moisés Goldszmidt. Stochas- tic dynamic programming with factored representations. Artificial intelligence, 121(1):49–107, 2000. [15] Andrew G. Barto, Steven J. Bradtke, and Satinder P. Singh. Learning to act using real-time dynamic programming. Artif. Intell., 72(1-2):81–138, January 1995. [16] Blai Bonet and Hector Geffner. Labeled rtdp: Improving the conver- gence of real-time dynamic programming. In ICAPS, volume 3, pages 12–21, 2003. [17] David Andre and Stuart J Russell. State abstraction for programmable reinforcement learning agents. In AAAI/IAAI, pages 119–125, 2002. [18] Lihong Li, Thomas J Walsh, and Michael L Littman. Towards a unified theory of state abstraction for mdps. In ISAIM, 2006. [19] John G Kemeny, Hazleton Mirkill, J Laurie Snell, and Gerald L Thompson. Finite mathematical structures. Prentice-Hall, 1959. [20] David Assaf, Moshe Shared, and J. George Shanthikumar. First-passage times with pfr densities. Journal of Applied Probability, 22(1):185–196, 1985. [21] Gene H. Golub and Charles F. Van Loan. Matrix Computations (3rd Ed.). Johns Hopkins University Press, Baltimore, MD, USA, 1996.