Active Markov Information-Theoretic Path Planning for Robotic Environmental Sensing  Kian Hsiang Low  Department of Computer Science National University of Singapore Republic of Singapore  lowkh@comp.nus.edu.sg John M. Dolan and Pradeep Khosla  Robotics Institute Carnegie Mellon University Pittsburgh PA 15213 USA  jmd@cs.cmu.edu, pkk@ece.cmu.edu  ABSTRACT  Recent research in multi-robot exploration and mapping has focused on sampling environmental fields, which are typi- cally modeled using the Gaussian process (GP). Existing information-theoretic exploration strategies for learning GP- based environmental field maps adopt the non-Markovian problem structure and consequently scale poorly with the length of history of observations. Hence, it becomes compu- tationally impractical to use these strategies for   in situ , real- time active sampling.   To ease this computational burden, this paper presents a Markov-based approach to efficient information-theoretic path planning for active sampling of GP-based fields. We analyze the time complexity of solving the Markov-based path planning problem, and demonstrate analytically that it scales better than that of deriving the non-Markovian strategies with increasing length of planning horizon. For a class of exploration tasks called the transect sampling task, we provide theoretical guarantees on the ac- tive sampling performance of our Markov-based policy, from which ideal environmental field conditions and sampling task settings can be established to limit its performance degrada- tion due to violation of the Markov assumption. Empirical evaluation on real-world temperature and plankton density field data shows that our Markov-based policy can generally achieve active sampling performance comparable to that of the widely-used non-Markovian greedy policies under less favorable realistic field conditions and task settings while enjoying significant computational gain over them.  Categories and Subject Descriptors  G.3 [ Probability and Statistics ]: Markov processes, stochas- tic processes; I.2.8 [ Problem Solving, Control Methods, and   Search ]:   Dynamic programming;   I.2.9 [ Robotics ]: Autonomous vehicles  General Terms  Algorithms, Performance, Experimentation, Theory  Keywords  Multi-robot exploration and mapping, adaptive sampling, active learning, Gaussian process, non-myopic path planning  Cite   as:   Active Markov Information-Theoretic Path Planning for Robotic Environmental Sensing, Kian Hsiang Low, John M. Dolan, and Pradeep Khosla,   Proc. of 10th Int. Conf. on Autonomous Agents and Multiagent Systems (AAMAS 2011) , Tumer, Yolum, Sonen- berg and Stone (eds.), May, 2–6, 2011, Taipei, Taiwan, pp. XXX-XXX. Copyright   c ©   2011, International Foundation for Autonomous Agents and Multiagent Systems (www.ifaamas.org). All rights reserved.  1.   INTRODUCTION  Research in multi-robot exploration and mapping has re- cently progressed from building occupancy grids [14] to sam- pling spatially varying environmental phenomena [6, 7], in particular, environmental fields (e.g., plankton density, pol- lutant concentration, temperature fields) that are charac- terized by   continuous-valued, spatially correlated   measure- ments (see Fig. 1). Exploration strategies for building occu- pancy grid maps usually operate under the assumptions of (a)   discrete , (b)   independent   cell occupancies, which impose, respectively, the following limitations for learning environ- mental field maps:   these strategies (a) cannot be fully in- formed by the continuous field measurements and (b) cannot exploit the spatial correlation structure of an environmental field for selecting observation paths. As a result, occupancy grid mapping strategies are not capable of selecting the most informative observation paths for learning an environmental field map. Furthermore, occupancy grid mapping strategies typically assume that range sensing is available. In contrast, many   in situ   environmental and ecological sensing applications (e.g., monitoring of ocean phenomena, forest ecosystems, or pollu- tion) permit only point-based sensing, thus making a high- resolution sampling of the entire field impractical in terms of resource costs (e.g., energy consumption, mission time). In practice, the resource cost constraints restrict the spatial coverage of the observation paths. Fortunately, the spatial correlation structure of an environmental field enables a map of the field (in particular, its unobserved areas) to be learned using the point-based observations taken along the resource- constrained paths.   To learn this map, a commonly-used approach in spatial statistics [15] is to assume that the envi- ronmental field is realized from a probabilistic model called the   Gaussian process   (GP) (Section 3.2). More importantly, the GP model allows an environmental field to be formally characterized and consequently provides formal measures of mapping uncertainty (e.g., based on mean-squared error [6] or entropy criterion [7]) for directing a robot team to explore highly uncertain areas of the field. In this paper, we focus on using the entropy criterion to measure mapping uncertainty. How then does a robot team plan the most informative resource-constrained observation paths to minimize the map- ping uncertainty of an environmental field? To address this, the work of [7] has proposed an information-theoretic multi- robot exploration strategy that selects non-myopic observa- tion paths with maximum entropy. Interestingly, this work has established an equivalence result that the maximum- entropy paths selected by such a strategy can achieve the  arXiv:1101.5632v1 [cs.LG] 28 Jan 2011
dual objective of minimizing the mapping uncertainty de- fined using the entropy criterion. When this strategy is ap- plied to sampling a GP-based environmental field, it can be reduced to solving a non-Markovian, deterministic planning problem called the   information-theoretic multi-robot adap- tive sampling problem   ( i MASP) (Section 3).   Due to the non-Markovian problem structure of   i MASP, its state size grows exponentially with the length of planning horizon. To alleviate this computational difficulty, an anytime heuristic search algorithm called Learning Real-Time A ∗   [2] is used to solve   i MASP approximately. However, this algorithm does not guarantee the performance of its induced exploration policy.   We have also observed through experiments that when the joint action space of the robot team is large or the planning horizon is long, it no longer produces a good pol- icy fast enough. Even after incurring a huge amount of time and space to improve the search, its resulting policy still performs worse than the widely-used non-Markovian greedy policy, the latter of which can be derived efficiently by solv- ing the myopic formulation of   i MASP (Section 3.3). Though the anytime and greedy algorithms provide some computational relief to solving   i MASP (albeit approximately), they inherit   i MASP’s non-Markovian problem structure and consequently scale poorly with the length of history of ob- servations.   Hence, it becomes computationally impractical to use these non-Markovian path planning algorithms for   in situ , real-time active sampling performed (a) at high resolu- tion (e.g., due to high sensor sampling rate or large sampling region), (b) over dynamic features of interest (e.g., algal blooms, oil spills), (c) with resource cost constraints (e.g., energy consumption, mission time), or (d) in the presence of dynamically changing external forces translating the robots (e.g., ocean drift on autonomous boats), thus requiring fast replanning.   For example, the deployment of autonomous underwater vehicles (AUVs) and boats for ocean sampling poses the above challenges/issues among others [4]. To ease this computational burden, this paper proposes a principled Markov-based approach to efficient information- theoretic path planning for active sampling of GP-based en- vironmental fields (Section 4), which we develop by assum- ing the Markov property in   i MASP planning. To the proba- bilistic robotics community, such a move to achieve time effi- ciency is probably anticipated. However, the Markov prop- erty is often imposed without carefully considering or for- mally analyzing its consequence on the performance degra- dation while operating in non-Markovian environments. In particular, to what extent does the environmental structure affect the performance degradation due to violation of the Markov assumption?   Motivated by this lack of treatment, our work in this paper is novel in demonstrating both theo- retically and empirically the extent of which the degradation of active sampling performance depends on the spatial cor- relation structure of an environmental field. An important practical consequence is that of establishing environmen- tal field conditions under which the Markov-based approach performs well relative to the non-Markovian   i MASP-based policy while enjoying significant computational gain over it. The specific contributions of our work include:  •   analyzing the time complexity of solving the Markov-based information-theoretic path planning problem, and show- ing analytically that it scales better than that of deriv- ing the non-Markovian strategies with increasing length of planning horizon (Section 4.1);  •   providing theoretical guarantees on the active sampling performance of our Markov-based policy (Section 4.2) for a class of exploration tasks called the   transect sampling task   (Section 2), from which various ideal environmental field conditions and sampling task settings can be estab- lished to limit its performance degradation;  •   empirically evaluating the active sampling performance and time efficiency of our Markov-based policy on real- world temperature and plankton density field data under less favorable realistic environmental field conditions and sampling task settings (Section 5).  2.   TRANSECT SAMPLING TASK  Fig. 1 illustrates the transect sampling task introduced in [11, 13] previously.   A temperature field is spatially dis- tributed over a 25 m   ×   150 m transect that is discretized into a 5   ×   30 grid of sampling locations comprising 30 columns, each of which has 5 sampling locations. It can be observed that the number of columns is much greater than the number of sampling locations in each column; this observed prop- erty is assumed to be consistent with every other transect. The robots are constrained to simultaneously explore for- ward one column at a time from the leftmost to the right- most column of the transect such that each robot samples one location per column for a total of 30 locations. So, each robot’s action space given its current location consists of moving to any of the 5 locations in the adjacent column on its right. The number of robots is assumed not to be larger than the number of sampling locations per column. We as- sume that an adversary chooses the starting robot locations in the leftmost column and the robots will only know them at the time of deployment; such an adversary can be the dynamically changing external forces translating the robots (e.g., ocean drift on autonomous boats) or the unknown ob- stacles occupying potential starting locations.   The robots are allowed to end at any location in the rightmost column. In practice, the constraint on exploring forward in a tran- sect sampling task permits the planning of less complex ob- servation paths that can be achieved more reliably, using less sophisticated control algorithms, and by robots with limited maneuverability (e.g., unmanned aerial vehicles, au- tonomous boats and AUVs [9]). For practical applications, while the robot is in transit from its current location to a distant planned waypoint [4, 13], this task can be performed to collect the most informative observations during transit. In monitoring of ocean phenomena and freshwater quality along rivers, the transect can span a plankton density or temperature field drifting at a constant rate from right to left and the autonomous boats are tasked to explore within a line perpendicular to the drift.   As another example, the transect can be the bottom surface of ship hull or other mar- itime structure to be inspected and mapped by AUVs.  3.   NON-MARKOVIAN PATH PLANNING 3.1   Notations and Preliminaries  Let   U   be the domain of the environmental field represent- ing a set of sampling locations in the transect such that each location   u   ∈ U   yields a measurement   z u . The columns of the transect are indexed in an increasing order from left to right with the leftmost column being indexed ‘0’. Each planning stage is associated with a column from which every robot in
?  ? Figure 1: Transect sampling task on a temperature field (measured in   ◦ C ) spatially distributed over a  25   m   ×   150   m transect that is discretized into a   5   ×   30  grid of sampling locations (white dots).  the team selects and takes an observation (i.e., comprising a pair of location and its measurement).   Let   k   denote the number of robots in the team. In each stage   i , the team of   k  robots then collects from column   i   a total of   k   observations, which are denoted by a pair of vectors   x i   of   k   locations and  z x i   of the corresponding measurements.   Let   x 0: i   and   z x 0: i  denote vectors comprising the histories of robots’ sampling locations and corresponding measurements over stages 0 to  i   (i.e., concatenations of   x 0 , x 1 , . . . , x i   and   z x 0   , z x 1   , . . . , z x i   ), respectively.   Let   Z u ,   Z x i   , and   Z x 0: i   be random measure- ments that are associated with the realizations   z u ,   z x i   , and  z x 0: i   , respectively.  3.2   Gaussian Process-Based Environmental Field  The GP model can be used to formally characterize an environmental field as follows: the environmental field is de- fined to vary as a realization of a GP. Let   { Z u } u ∈U   denote a GP, i.e., every finite subset of   { Z u } u ∈U   has a multivariate Gaussian distribution [8].   The GP is fully specified by its mean   μ u  4  =   E [ Z u ] and covariance   σ uv  4  = cov[ Z u , Z v   ] for all  u, v   ∈ U .   We assume that the GP is second-order station- ary, i.e., it has a constant mean and a stationary covariance structure (i.e.,   σ uv   is a function of   u   −   v   for all   u, v   ∈ U   ). In particular, its covariance structure is defined by the widely- used squared exponential covariance function [8]  σ uv  4  =   σ 2  s   exp  {  −   1 2 ( u   −   v ) > M   − 2 ( u   −   v )  }  +   σ 2  n δ uv   (1) where   σ 2  s   is the signal variance,   σ 2  n   is the noise variance,  M   is a diagonal matrix with length-scale components   ` 1  and   ` 2   in the horizontal and vertical directions of a tran- sect, respectively, and   δ uv   is a Kronecker delta of value 1 if   u   =   v , and 0 otherwise. Intuitively, the signal and noise variances describe, respectively, the intensity and noise of the field measurements while the length-scale can be in- terpreted as the approximate distance to be traversed in a transect for the field measurement to change considerably [8]; it therefore controls the degree of spatial correlation or “similarity” between field measurements. In this paper, the mean and covariance structure of the GP are assumed to be known.   Given that the robot team has collected observa- tions   x 0 , z x 0   , x 1 , z x 1   , . . . , x i , z x i   over stages 0 to   i , the distri- bution of   Z u   remains Gaussian with the following posterior mean and covariance  μ u | x 0: i   =   μ u   + Σ ux 0: i   Σ − 1  x 0: i x 0: i   { z x 0: i   −   μ x 0: i   } >   (2)  σ uv | x 0: i   =   σ uv   −   Σ ux 0: i   Σ − 1  x 0: i x 0: i   Σ x 0: i v   (3) where   μ x 0: i   is a row vector with mean components   μ w   for every location   w   of   x 0: i , Σ ux 0: i   is a row vector with covari- ance components   σ uw   for every location   w   of   x 0: i , Σ x 0: i v   is a column vector with covariance components   σ wv   for every location   w   of   x 0: i , and Σ x 0: i x 0: i   is a covariance matrix with components   σ wy   for every pair of locations   w, y   of   x 0: i . Note that the posterior mean   μ u | x 0: i   (2) is the best unbiased pre- dictor of the measurement   z u   at unobserved location   u . An important property of GP is that the posterior covariance  σ uv | x 0: i   (3) is independent of the observed measurements  z x 0: i   ; this property is used to reduce   i MASP to a determin- istic planning problem as shown later.  3.3   Deterministic   i MASP Planning  Supposing the robot team starts in locations   x 0   of leftmost column 0, an exploration policy is responsible for direct- ing it to sample locations   x 1 , x 2 , . . . , x t +1   of the respective columns 1 ,   2 , . . . , t   + 1 to form the observation paths. For- mally, a non-Markovian policy is denoted by   π   4  =   〈 π 0 ( x 0:0   =  x 0 ) , π 1 ( x 0:1 ) , . . . , π t ( x 0: t ) 〉   where   π i ( x 0: i ) maps the history  x 0: i   of robots’ sampling locations to a vector   a i   ∈ A ( x i ) of robots’ actions in stage   i   (i.e.,   a i   ←   π i ( x 0: i )), and   A ( x i ) is the joint action space of the robots given their current lo- cations   x i . We assume that the transition function   τ   ( x i , a i )  deterministically   (i.e.,   no localization uncertainty) moves the robots to their next locations   x i +1   in stage   i   + 1 (i.e.,  x i +1   ←   τ   ( x i , a i )). Putting   π i   and   τ   together yields the as- signment   x i +1   ←   τ   ( x i , π i ( x 0: i )). The work of [7] has proposed a non-Markovian policy   π ∗  that selects non-myopic observation paths with maximum entropy for sampling a GP-based field.   To know how   π ∗  is derived, we first define the value under a policy   π   to be the entropy of observation paths when starting in   x 0   and following   π   thereafter:  V   π  0   ( x 0 )   4  =   H [ Z x 1: t +1   | Z x 0   , π ] =   −  ∫  f   ( z x 0: t +1   | π ) log   f   ( z x 1: t +1   | z x 0   , π )   dz x 0: t +1  (4) where   f   denotes a Gaussian probability density function. When a non-Markovian policy   π   is plugged into (4), the following ( t   + 1)-stage recursive formulation results from the chain rule for entropy and   x i +1   ←   τ   ( x i , π i ( x 0: i )):  V   π i   ( x 0: i ) =   H [ Z x i +1   | Z x 0: i   , π i ] +   V   π i +1 ( x 0: i +1 ) =   H [ Z τ   ( x i ,π i ( x 0: i )) | Z x 0: i   ] +   V   π i +1 (( x 0: i , τ   ( x i , π i ( x 0: i ))))  V   π t   ( x 0: t ) =   H [ Z x t +1   | Z x 0: t   , π t ] =   H [ Z τ   ( x t ,π t ( x 0: t )) | Z x 0: t   ] (5) for stage   i   = 0 , . . . , t   −   1 such that each stagewise posterior entropy (i.e., of the measurements   Z x i +1   to be observed in stage   i   + 1 given the history of measurements   Z x 0: i   observed from stages 0 to   i ) reduces to  H [ Z x i +1   | Z x 0: i   ] = 1 2 log (2 πe ) k   | Σ x i +1 | x 0: i   |   (6) where Σ x i +1 | x 0: i   is a covariance matrix with components  σ uv | x 0: i   for every pair of locations   u, v   of   x i +1 , each of which is independent of observed measurements   z x 0: i   by (3), as dis- cussed above. So,   H [ Z x i +1   | Z x 0: i   ] can be evaluated in closed form, and the value functions (5) only require the history of robots’ sampling locations   x 0: i   as inputs but not that of corresponding measurements   z x 0: i   . Solving   i MASP involves choosing   π   to maximize   V   π  0   ( x 0 ) (5), which yields the optimal policy   π ∗ .   Plugging   π ∗   into (5) gives the ( t   + 1)-stage dynamic programming equations:  V   π ∗  i   ( x 0: i ) =   max  a i ∈A ( x i )   H [ Z τ   ( x i ,a i ) | Z x 0: i   ] +   V   π ∗  i +1 (( x 0: i , τ   ( x i , a i )))  V   π ∗  t   ( x 0: t ) =   max  a t ∈A ( x t )  H [ Z τ   ( x t ,a t ) | Z x 0: t   ] (7)
for stage   i   = 0 , . . . , t   −   1. Since each stagewise posterior en- tropy   H [ Z τ   ( x i ,a i ) | Z x 0: i   ] (6) can be evaluated in closed form as explained above,   i MASP for sampling the GP-based field (7) reduces to a deterministic planning problem.   Further- more, it turns out to be the well-known maximum entropy sampling problem [10] as demonstrated in [7]. Policy   π ∗   =  〈 π ∗  0   ( x 0:0 ) , . . . , π ∗  t   ( x 0: t ) 〉   can be determined by  π ∗  i   ( x 0: i ) =   arg max  a i ∈A ( x i )  H [ Z τ   ( x i ,a i ) | Z x 0: i   ] +   V   π ∗  i +1 (( x 0: i , τ   ( x i , a i )))  π ∗  t   ( x 0: t ) =   arg max  a t ∈A ( x t )  H [ Z τ   ( x t ,a t ) | Z x 0: t   ] (8) for stage   i   = 0 , . . . , t   −   1. Similar to the optimal value func- tions (7),   π ∗   only requires the history of robots’ sampling lo- cations as inputs. So,   π ∗   can generate the maximum-entropy paths prior to exploration. Solving the myopic formulation of   i MASP (7) is often con- sidered to ease computation (Section 4.1), which entails de- riving the non-Markovian greedy policy   π G   =   〈 π G  0   ( x 0:0 ) , . . . , π G t   ( x 0: t ) 〉   where, for stage   i   = 0 , . . . , t ,  π G i   ( x 0: i ) = arg max  a i ∈A ( x i )  H [ Z τ   ( x i ,a i ) | Z x 0: i   ]   .   (9) The work of [3] has proposed a non-Markovian greedy policy  π M   =   〈 π M  0   ( x 0:0 ) , . . . , π M t   ( x 0: t ) 〉   to approximately maximize the closely related mutual information criterion:  π M i   ( x 0: i ) = arg max  a i ∈A ( x i )  H [ Z τ   ( x i ,a i ) | Z x 0: i   ]   −   H [ Z τ   ( x i ,a i ) | Z x 0: i +1   ] (10) for stage   i   = 0 , . . . , t   where   x 0: i +1   denotes the vector com- prising locations of domain   U   not found in ( x 0: i , τ   ( x i , a i )). It is shown in [3] that   π M   greedily selects new sampling lo- cations that maximize the increase in mutual information. As noted in [7], this strategy is deficient in that it may not necessarily minimize the mapping uncertainty defined using the entropy criterion.   More importantly, it suffers a huge computational drawback: the time needed to derive   π M   de- pends on the map resolution (i.e.,   |U| ) (Section 4.1).  4.   MARKOV-BASED PATH PLANNING  The Markov property assumes that the measurements   Z x i +1  to be observed next in stage   i   + 1 depends only on the cur- rent measurements   Z x i   observed in stage   i   and is condition- ally independent of the past measurements   Z x 0: i − 1   observed from stages 0 to   i   −   1. That is,   f   ( z x i +1   | z x 0: i   ) =   f   ( z x i +1   | z x i   ) for all   z x 0   , z x 1   , . . . , z x i +1   . As a result,   H [ Z x i +1   | Z x 0: i   ] (6) can be approximated by   H [ Z x i +1   | Z x i   ].   It is therefore straight- forward to impose the Markov assumption on   i MASP (7), which yields the following dynamic programming equations for the Markov-based path planning problem:   ̃ V i ( x i ) =   max  a i ∈A ( x i )  H [ Z τ   ( x i ,a i ) | Z x i   ] +    ̃ V i +1 ( τ   ( x i , a i ))   ̃ V t ( x t ) =   max  a t ∈A ( x t )   H [ Z τ   ( x t ,a t ) | Z x t   ]   .  (11) for stage   i   = 0 , . . . , t   −   1. Consequently, the Markov-based policy    ̃ π   =   〈  ̃ π 0 ( x 0 ) , . . . ,    ̃ π t ( x t ) 〉   can be determined by   ̃ π i ( x i ) =   arg max  a i ∈A ( x i )  H [ Z τ   ( x i ,a i ) | Z x i   ] +    ̃ V i +1 ( τ   ( x i , a i ))   ̃ π i ( x t ) =   arg max  a t ∈A ( x t )  H [ Z τ   ( x t ,a t ) | Z x t   ]   .  (12)  4.1   Time Complexity: Analysis & Comparison  Theorem   1.   Let   A   4  =   A ( x 0 ) =   . . .   =   A ( x t ) .   Deriving the Markov-based policy    ̃ π   (12) for the transect sampling task requires   O ( |A| 2 ( t   +   k 4 ))   time.  Note that   |A|   =   r   C k   =   O ( r k   ) where   r   is the number of sampling locations per column and   k   ≤   r   as assumed in Section 2.   Though   |A|   is exponential in the number   k   of robots,   r   is expected to be small in a transect, which pre- vents   |A|   from growing too large. In contrast, deriving   i MASP-based policy   π ∗   (8) requires  O ( |A| t t 2 k 4 ) time. Deriving greedy policies   π G   (9) and   π M  (10) incur, respectively,   O ( |A| t 4 k 3 + |A| 2 tk 4 ) and   O ( |A| t |U | 3 +  |A| 2 tk 4 ) =   O ( |A| t 4 r 3   +   |A| 2 tk 4 ) time to compute the obser- vation paths over all   |A|   possible choices of starting robot locations. Clearly, all the non-Markovian strategies do not scale as well as our Markov-based approach with increasing length   t   + 1 of planning horizon or number   t   + 2 of columns, which is expected to be large.   As demonstrated empiri- cally (Section 5), the Markov-based policy    ̃ π   can be derived faster than   π G   and   π M   by more than an order of magnitude; this computational advantage is boosted further for transect sampling tasks with unknown starting robot locations.  4.2   Performance Guarantees  We will first provide a theoretical guarantee on how the Markov-based policy    ̃ π   (12) performs relative to the non- Markovian   i MASP-based policy   π ∗   (8) for the case of 1 robot. This key result follows from our intuition that when the horizontal spatial correlation becomes small, exploiting the past measurements for path planning should hardly im- prove the active sampling performance in a transect sam- pling task, thus favoring the Markov-based policy. Though this intuition is simple, supporting it with formal theoretical results and their corresponding proofs (Appendix A) turns out to be non-trivial as shown below. Recall the Markov assumption that   H [ Z x i +1   | Z x 0: i   ] (6) is to be approximated by   H [ Z x i +1   | Z x i   ]. This prompts us to first consider bounding the difference of these posterior entropies that ensues from the Markov property:  H [ Z x i +1   | Z x i   ]   −   H [ Z x i +1   | Z x 0: i   ] = 1 2 log   σ 2  x i +1 | x i  σ 2  x i +1 | x 0: i  = 1 2 log  (  1   −   σ 2  x i +1 | x i   −   σ 2  x i +1 | x 0: i  σ 2  x i +1   | x i  ) − 1  ≥   0   .  (13) This difference can be interpreted as the reduction in un- certainty of the measurements   Z x i +1   to be observed next in stage   i   + 1 by observing the past measurements   Z x 0: i − 1  from stages 0 to   i   −   1 given the current measurements   Z x i  observed in stage   i . This difference is small if   Z x 0: i − 1   does not contribute much to the reduction in uncertainty of   Z x i +1  given   Z x i   . It (13) is often known as the conditional mutual information of   Z x i +1   and   Z x 0: i − 1   given   Z x i   denoted by  I [ Z x i +1   ;   Z x 0: i − 1   | Z x i   ]   4  =   H [ Z x i +1   | Z x i   ]   −   H [ Z x i +1   | Z x 0: i   ]   ,  which is of value 0 if the Markov property holds. The results to follow assume that the transect is discretized into a grid of sampling locations. Let   ω 1   and   ω 2   denote the horizontal and vertical grid discretization widths (i.e., sep- arations between adjacent sampling locations), respectively. Let   ` ′  1  4  =   ` 1 /ω 1   and   ` ′  2  4  =   ` 2 /ω 2   represent the normalized hor- izontal and vertical length-scale components, respectively.
The following lemma bounds the variance reduction term  σ 2  x i +1 | x i   −   σ 2  x i +1 | x 0: i   in (13):  Lemma   2.   Let   ξ   4  = exp  {  −   1 2 ` ′ 2 1  }  and   ρ   4  = 1 +   σ 2  n  σ 2  s  .   If  ξ < ρ i   , then   0   ≤   σ 2  x i +1 | x i   −   σ 2  x i +1 | x 0: i   ≤   σ 2  s   ξ 4  ρ i   −   ξ .  The next lemma is fundamental to the subsequent results on the active sampling performance of Markov-based policy   ̃ π . It provides bounds on   I [ Z x i +1   ;   Z x 0: i − 1   | Z x i   ], which follow immediately from (13), Lemma 2, and the lower bound  σ 2  x i +1 | x i   =   σ 2  x i +1   −   ( σ x i +1 x i   ) 2 /σ 2  x i   ≥   σ 2  s   +   σ 2  n   −   σ 2  s   ξ 2   :  Lemma   3.   If   ξ < ρ i   , then   0   ≤   I [ Z x i +1   ;   Z x 0: i − 1   | Z x i   ]   ≤   ∆( i )  where   ∆( i )   4  = 1 2 log  (  1   −   ξ 4  (   ρ i   −   ξ )( ρ   −   ξ 2 )  ) − 1  .  Remark . If   j   ≤   s , then   ∆( j )   ≤   ∆( s )   for   j, s   = 0 , . . . , t .  From Lemma 3, since ∆( i ) bounds   I [ Z x i +1   ;   Z x 0: i − 1   | Z x i   ] from above, a small   I [ Z x i +1   ;   Z x 0: i − 1   | Z x i   ] can be guaranteed by making ∆( i ) small.   From the definition of ∆( i ), there are a few ways to achieve a small ∆( i ):   (a) ∆( i ) depends on  ` ′  1   through   ξ .   As   ` ′  1   →   0 + ,   ξ   →   0 + , by definition.   Con- sequently, ∆( i )   →   0 + .   A small   ` ′  1   can be obtained us- ing a small   ` 1   and/or a large   ω 1 , by definition; (b) ∆( i ) also depends on the noise-to-signal ratio   σ 2  n /σ 2  s   through   ρ . Raising   σ 2  n   or lowering   σ 2  s   increases   ρ , by definition.   This, in turn, decreases ∆( i ); (c) Since   i   indicates the length of history of observations, the remark after Lemma 3 tells us that a shorter length produces a smaller ∆( i ).   To sum- marize, (a) environmental field conditions such as smaller horizontal spatial correlation and noisy, less intense fields, and (b) sampling task settings such as larger horizontal grid discretization width and shorter length of history of obser- vations all contribute to smaller ∆( i ), and hence smaller  I [ Z x i +1   ;   Z x 0: i − 1   | Z x i   ].   This analysis is important for under- standing the practical implication of our theoretical results later. A limitation with using Lemma 3 is that of the suffi- cient condition   ξ < ρ/i , which will hold if the field conditions and task settings realized above to make ∆( i ) small are ad- equately satisfied. The following theorem uses the induced optimal value   ̃ V 0 ( x 0 ) from solving the Markov-based path planning prob- lem (11) to bound the maximum entropy   V   π ∗  0   ( x 0 ) of obser- vation paths achieved by   π ∗   from solving   i MASP (7):  Theorem   4.   Let    i  4  =   ∑ t s = i   ∆( s )   ≤   ( t   −   i   + 1)∆( t ) .   If  ξ < ρ t   , then    ̃ V i ( x i ) −  i   ≤   V   π ∗  i   ( x 0: i )   ≤    ̃ V i ( x i )   for   i   = 0 , . . . , t .  The above result is useful in providing an efficient way of knowing the maximum entropy   V   π ∗  0   ( x 0 ), albeit approximately: the time needed to derive the two-sided bounds on   V   π ∗  0   ( x 0 ) is linear in the length of planning horizon (Theorem 1) as opposed to exponential time required to compute the ex- act value of   V   π ∗  0   ( x 0 ).   Since the error bound    i   is defined as a sum of ∆( s )’s, we can rely on the above analysis of ∆( s ) (see paragraph after Lemma 3) to improve this error bound:   (a) environmental field conditions such as smaller horizontal spatial correlation and noisy, less intense fields, and (b) sampling task settings such as larger horizontal grid discretization width and shorter planning horizon (i.e., fewer transect columns) all improve this error bound. In the main result below, the Markov-based policy    ̃ π   is guaranteed to achieve an entropy   V    ̃ π  0   ( x 0 ) of observation paths (i.e., by plugging    ̃ π   into (5)) that is not more than    0   from the maximum entropy   V   π ∗  0   ( x 0 ) of observation paths achieved by policy   π ∗ :  Theorem   5.   If   ξ   <   ρ t   , then policy    ̃ π   is    0 -optimal in achieving the maximum-entropy criterion, i.e.,   V   π ∗  0   ( x 0 )   −  V    ̃ π  0   ( x 0 )   ≤    0 .  Again, since the error bound    0   is defined as a sum of ∆( s )’s, we can use the above analysis of ∆( s ) to improve this bound: (a) environmental field conditions such as smaller horizontal spatial correlation and noisy, less intense fields, and (b) sam- pling task settings such as larger horizontal grid discretiza- tion width and shorter planning horizon (i.e., fewer transect columns) all result in smaller    0 , and hence improve the ac- tive sampling performance of Markov-based policy    ̃ π   relative to that of non-Markovian   i MASP-based policy   π ∗ . This not only supports our prior intuition (see first paragraph of this section) but also identifies other means of limiting the per- formance degradation of the Markov-based policy. For the multi-robot case, a condition has to be imposed on the covariance structure of GP to obtain a similar guarantee:  | σ uv | x 0: i   | ≤ | σ uv | x m   |   (14) for   m   = 0 , . . . , i   and any   u, v, x 0 , x 1 , . . . , x i   ∈ U . Intuitively, (14) says that further conditioning does not make   Z u   and  Z v   more correlated. Note that (14) is satisfied if   u   =   v . Similar to Lemma 3 for the 1-robot case, we can bound  I [ Z x i +1   ;   Z x 0: i − 1   | Z x i   ] for the multi-robot case but tighter con- ditions have to be satisfied:  Lemma   6.   Let   ` ′  1   =   ` ′  2 .   If   ξ <   min(   ρ ik , ρ  4 k   )   and   (14)  is satisfied, then   0   ≤   I [ Z x i +1   ;   Z x 0: i − 1   | Z x i   ]   ≤   ∆ k   ( i )   where  ∆ k   ( i )   4  =   k  2 log  (  1   −   ξ 4  (   ρ ik   −   ξ )( ρ   −   4 k ρ   ξ 2 )  ) − 1  .  To improve the upper bound ∆ k   ( i ), the above analysis of ∆( i ) can be applied here as these two upper bounds are largely similar:   (a) environmental field conditions such as smaller spatial correlation and noisy, less intense fields, and (b) sampling task settings such as larger grid discretiza- tion width and shorter planning horizon (i.e., fewer transect columns) all entail smaller ∆ k   ( i ). Decreasing the number   k  of robots also reduces ∆ k   ( i ), thus yielding tighter bounds on  I [ Z x i +1   ;   Z x 0: i − 1   | Z x i   ]. Using Lemma 6, we can derive guaran- tees similar to that of Theorems 4 and 5 on the performance of Markov-based policy    ̃ π   for the multi-robot case.  5.   EXPERIMENTS AND DISCUSSION  In Section 4.2, we have highlighted the practical implica- tion of our main theoretical result (i.e., Theorem 5), which establishes various environmental field conditions and sam- pling task settings to limit the performance degradation of Markov-based policy    ̃ π .   This result, however, does not re- veal whether    ̃ π   performs well (or not) under “seemingly” less favorable field conditions and task settings that do not jointly satisfy its sufficient condition   ξ   < ρ/ ( tk ).   These include large spatial correlation, less noisy, highly intense fields, small grid discretization width, long planning horizon
(i.e., many transect columns), and large number of robots. So, this section evaluates the active sampling performance and time efficiency of    ̃ π   empirically on two real-world datasets under such field conditions and task settings as detailed be- low: (a) May 2009 temperature field data of Panther Hollow Lake in Pittsburgh, PA spanning 25 m by 150 m, and (b) June 2009 plankton density field data of Chesapeake Bay spanning 314 m by 1765 m. Using maximum likelihood estimation (MLE) [8], the learned hyperparameters (i.e., horizontal and vertical length-scales, signal and noise variances) are, respectively,   ` 1   = 40 . 45 m,  ` 2   = 16 . 00 m,   σ 2  s   = 0 . 1542, and   σ 2  n   = 0 . 0036 for the temper- ature field, and   ` 1   = 27 . 53 m,   ` 2   = 134 . 64 m,   σ 2  s   = 2 . 152, and   σ 2  n   = 0 . 041 for the plankton density field.   It can be observed that the temperature and plankton density fields have low noise-to-signal ratios   σ 2  n /σ 2  s   of 0 . 023 and 0 . 019, re- spectively. Relative to the size of transect, both fields have large vertical spatial correlations, but only the temperature field has large horizontal spatial correlation. The performance of Markov-based policy    ̃ π   is compared to non-Markovian policies produced by two state-of-the-art information-theoretic exploration strategies: greedy policies  π G   (9) and   π M   (10) proposed by [7] and [3], respectively. The non-Markovian policy   π ∗   that has to be derived ap- proximately using Learning Real-Time A ∗   is excluded from comparison due to the reason provided in Section 1.  5.1   Performance Metrics  The tested policies are evaluated using the two metrics proposed in [7], which quantify the mapping uncertainty of the unobserved areas of the field differently: (a) The ENT( π ) metric measures the posterior joint entropy   H [ Z x 0: t +1   | Z x 0: t +1   ] of field measurements   Z x 0: t +1   at unobserved locations   x 0: t +1  where   x 0: t +1   denotes the vector comprising locations of do- main   U   not found in the sampled locations   x 0: t +1   selected by policy   π . Smaller ENT( π ) implies lower mapping uncer- tainty; (b) The ERR( π ) metric measures the mean-squared relative error   |U | − 1   ∑  u ∈U   { ( z u − μ u | x 0: t +1   ) /  ̄ μ } 2   resulting from using the observations (i.e., sampled locations   x 0: t +1   and corresponding measurements   z x 0: t +1   ) selected by policy   π  and the posterior mean   μ u | x 0: t +1   (2) to predict the field where  ̄ μ   =   |U | − 1   ∑  u ∈U   z u . Smaller ERR( π ) implies higher prediction accuracy. Two noteworthy differences distinguish these metrics: (a) The ENT( π ) metric exploits the spatial correlation between field measurements in the unobserved areas whereas the ERR( π ) metric implicitly assumes inde- pendence between them.   As a result, unlike the ERR( π ) metric, the ENT( π ) metric does not overestimate the map- ping uncertainty.   To illustrate this, suppose the unknown field measurements are restricted to only two unobserved lo- cations   u   and   v   residing in a highly uncertain area and they are highly correlated due to spatial proximity. The behavior of the ENT( π ) metric can be understood upon applying the chain rule for entropy (i.e., ENT( π ) =   H [ Z u , Z v   | Z x 0: t +1   ] =  H [ Z u | Z x 0: t +1   ] +   H [ Z v   | Z x 0: t +1   , Z u ]);   the latter uncertainty term (i.e., posterior entropy of   Z v   ) is significantly reduced or “discounted” due to the high spatial correlation between   Z u  and   Z v   .   Hence, the mapping uncertainty of these two un- observed locations is not overestimated. A practical advan- tage of this metric is that it does not overcommit sensing re- sources; in the simple illustration above, a single observation at either location   u   or   v   suffices to learn both field measure- ments well. On the other hand, the ERR( π ) metric considers  (a) Field 1:   ` 1   = 5 . 00 m,   ` 2   =   5 . 00 m.  (b) Field 2:   ` 1   = 5 . 00 m,   ` 2   = 16 . 00 m.  (c) Field 3:   ` 1   = 40 . 45 m,   ` 2   =   5 . 00 m.  (d) Field 4:   ` 1   = 40 . 45 m,   ` 2   = 16 . 00 m.  Figure   2:   Temperature   fields   (measured   in   ◦ C ) with varying horizontal length-scale   ` 1   and vertical length-scale   ` 2 .  each location to be of high uncertainty due to the indepen- dence assumption; (b) In contrast to the ENT( π ) metric, the ERR( π ) metric can use ground truth measurements to eval- uate if the field is being mapped accurately. Let ENTD( π )  4  = ENT(  ̃ π ) − ENT( π ) and ERRD( π )   4  = ERR(  ̃ π ) − ERR( π ). Decreasing ENTD( π ) improves the ENT(  ̃ π ) performance of   ̃ π   relative to that of   π .   Small   | ENTD( π ) |   implies that    ̃ π  achieves ENT(  ̃ π ) performance comparable to that of   π . ERRD( π ) can be interpreted likewise.   Additionally, we will consider the time taken to derive each policy as the third metric.  5.2   Temperature Field Data  We will first investigate how varying spatial correlations (i.e., varying length-scales) of the temperature field affect the ENT( π ) and ERR( π ) performance of evaluated policies. The temperature field is discretized into a 5 × 30 grid of sam- pling locations as shown in Figs. 1 and 2d. The horizontal and/or vertical length-scales of the original field (i.e., field 4 in Fig. 2d) are reduced to produce modified fields 1, 2, and 3 (respectively, Figs. 2a, 2b, and 2c); we fix these reduced length-scales while learning the remaining hyperparameters (i.e., signal and noise variances) through MLE. Table 1 shows the results of mean ENT( π ) and ERR( π ) performance of tested policies (i.e., averaged over all possi- ble starting robot locations) with varying length-scales and number of robots.   The ENT( π ) and ERR( π ) for all poli- cies generally decrease with increasing length-scales (except ERR(  ̃ π ) for 1 robot from field 2 to 4) due to increasing spatial correlation between measurements, thus resulting in lower mapping uncertainty. For the case of 1 robot, the observations are as follows: (a) When   ` 2   is kept constant (i.e., at 5 m or 16 m), reduc- ing   ` 1   from 40 . 45 m to 5 m (i.e., from field 3 to 1 or field 4 to 2) decreases ENTD( π G ), ERRD( π G ), ENTD( π M   ), and ERRD( π M   ): when the horizontal correlation becomes small, it can no longer be exploited by the non-Markovian poli- cies   π G   and   π M   ; (b) For field 3 with large   ` 1   and small   ` 2 ,
Table 1: Comparison of ENT( π ) (left) and ERR( π ) ( × 10 − 5 ) (right) performance for temperature fields that are discretized into   5   ×   30   grids (Fig. 2).  1 robot   Field Policy   1   2   3   4   ̃ π   -83   -246   -543   -597  πG   -82   -246   -554   -598  πM   -80   -211   -554   -596 1 robot   Field Policy   1   2   3   4   ̃ π   3.7040   0.5713   2.3680   0.5754  πG   1.8680   0.5713   0.0801   0.0252  πM   1.8433   0.5212   0.0701   0.0421 2 robots   Field Policy   1   2   3   4   ̃ π   -71   -190   -380   -422  πG   -72   -190   -382   -425  πM   -68   -131   -382   -421 2 robots   Field Policy   1   2   3   4   ̃ π   0.3797   0.2101   0.1171   0.0095  πG   0.3526   0.2101   0.0150   0.0087  πM   0.6714   0.1632   0.0148   0.0086 3 robots   Field Policy   1   2   3   4   ̃ π   -53   -109   -232   -297  πG   -53   -109   -215   -297  πM   -53   -73   -214   -255 3 robots   Field Policy   1   2   3   4   ̃ π   0.1328   0.0068   0.0063   0.0031  πG   0.1312   0.0068   0.0059   0.0031  πM   0.1080   0.1397   0.0055   0.0030  Table 2: Comparison of ENT( π ) (left) and ERR( π ) ( × 10 − 5 ) (right) performance for temperature field that is discretized into   13   ×   75   grid.  ENT( π )   Number   k   of robots Policy   1   2   3   ̃ π   -4813   -4284   -3828  πG   -4813   -4286   -3841  πM   -4808   -4277   -3825 ERR( π )   Number   k   of robots Policy   1   2   3   ̃ π   1.0287   0.0032   0.0015  πG   0.0082   0.0030   0.0024  πM   0.0087   0.0034   0.0019  ENTD( π G ) and ENTD( π M   ) are large as the Markov prop- erty of    ̃ π   prevents it from exploiting the large horizontal correlation; (c) When   ` 1   is kept constant (i.e., at 5 m or 40 . 45 m), reducing   ` 2   from 16 m to 5 m (i.e., from field 2 to 1 or field 4 to 3) increases ERRD( π G ) and ERRD( π M   ): when vertical correlation becomes small, it can no longer be exploited by    ̃ π , thus incurring larger ERR(  ̃ π ). For the case of 2 robots, the observations are as follows: (a)   | ENTD( π G ) |   and   | ENTD( π M   ) |   are small for all fields except for field 2 where    ̃ π   significantly outperforms   π M   . In particular, when   ` 2   is kept constant (i.e., at 5 m or 16 m), reducing   ` 1   from 40 . 45 m to 5 m (i.e., from field 3 to 1 or field 4 to 2) decreases ENTD( π G ), ENTD( π M   ), and ERRD( π G ): this is explained in the first observation of 1-robot case; (b) For field 3 with large   ` 1   and small   ` 2 , ERRD( π G ) and ERRD( π M   ) are large:   this is explained in the second and third observations of 1-robot case; (c) When   ` 1   is kept con- stant (i.e., at 5 m or 40 . 45 m), reducing   ` 2   from 16 m to 5 m (i.e., from field 2 to 1 or field 4 to 3) increases ERRD( π G ): this is explained in the third observation of 1-robot case. This also holds for ERRD( π M   ) when   ` 1   is large. For the case of 3 robots, it can be observed that    ̃ π   can achieve ENT(  ̃ π ) and ERR(  ̃ π ) performance comparable to (if not, better than) that of   π G   and   π M   for all fields. To summarize the above observations on spatial correla- tion conditions favoring    ̃ π   over   π G   and   π M   ,    ̃ π   can achieve ENT(  ̃ π ) performance comparable to (if not, better than) that of   π G   and   π M   for all fields with any number of robots except for field 3 (i.e., of large   ` 1   and small   ` 2 ) with 1 robot as explained previously.   Policy    ̃ π   can achieve comparable ERR(  ̃ π ) performance for field 2 (i.e., of small   ` 1   and large  ` 2 ) with 1 robot because    ̃ π   is capable of exploiting the large vertical correlation, and the small horizontal correlation can- not be exploited by   π G   and   π M   . Policy    ̃ π   can also achieve comparable ERR(  ̃ π ) performance for all fields with 2 and 3 robots except for field 3 (i.e., of large   ` 1   and small   ` 2 ) with 2 robots. These observations reveal that (a) small horizontal and large vertical correlations are favorable to    ̃ π ; (b) though large horizontal and small vertical correlations are not favor- able to    ̃ π , this problem can be mitigated by increasing the number of robots.   For more detailed analysis (e.g., visual-  Figure 3: Plankton density (chl-a) field (measured in  mg m − 3 ) spatially distributed over a   314   m   ×   1765   m transect that is discretized into a   8   ×   45   grid with  ` 1   = 27 . 53   m and   ` 2   = 134 . 64   m.  ization of planned observation paths and their corresponding error maps), the interested reader is referred to [5]. We will now examine how the increase in resolution to 13 ×  75 grid affects the ENT( π ) and ERR( π ) performance of eval- uated policies; the resulting grid discretization width and planning horizon are about 0 . 4 ×   smaller and 2 . 5 ×   longer, respectively. Table 2 shows the results of mean ENT( π ) and ERR( π ) performance of tested policies with varying num- ber of robots, from which we can derive observations similar to that for temperature field 4 discretized into 5   ×   30 grid:   ̃ π   can achieve ENT(  ̃ π ) and ERR(  ̃ π ) performance compara- ble to (if not, better than) that of   π G   and   π M   except for ERR(  ̃ π ) performance with 1 robot. So, increasing the grid resolution does not seem to noticeably degrade the active sampling performance of    ̃ π   relative to that of   π G   and   π M   .  5.3   Plankton Density Field Data  Fig. 3 illustrates the plankton density field that is dis- cretized into a 8   ×   45 grid.   Table 3 shows the results of mean ENT( π ) and ERR( π ) performance of tested policies with varying number of robots.   The observations are as follows:    ̃ π   can achieve the same ENT(  ̃ π ) and ERR(  ̃ π ) per- formance as that of   π G   and superior ENT(  ̃ π ) performance over that of   π M   because small horizontal and large vertical correlations favor    ̃ π   as explained in Section 5.2. By increas- ing the number of robots (i.e.,   k >   2),    ̃ π   can achieve ERR(  ̃ π ) performance comparable to (if not, better than) that of   π M   . Table 4 shows the results of mean ENT( π ) and ERR( π ) performance of tested policies after increasing the resolu- tion to 16   ×   89 grid; the resulting grid discretization width and planning horizon are about 0 . 5 ×   smaller and 2 ×   longer, respectively.   Similar observations can be obtained:    ̃ π   can achieve ENT(  ̃ π ) performance comparable to that of   π G   and superior ENT(  ̃ π ) performance over that of   π M   . By deploy- ing more than 1 robot,    ̃ π   can achieve ERR(  ̃ π ) performance comparable to (if not, better than) that of   π G   and   π M   . Again, we can observe that increasing the grid resolution does not seem to noticeably degrade the active sampling performance of    ̃ π   relative to that of   π G   and   π M   .  5.4   Incurred Policy Time  Fig. 4 shows the time taken to derive the tested policies for sampling the temperature and plankton density fields with varying number of robots and grid resolutions. It can be observed that the time taken to derive    ̃ π   is shorter than that needed to derive   π G   and   π M   by more than 1 and 4 orders of magnitude, respectively.   It is important to point out that Fig. 4 reports the average time taken to derive   π G  and   π M   over all possible starting robot locations. So, if the starting robot locations are unknown, the incurred time to derive   π G   and   π M   have to be increased by   r   C k   -fold. In con- trast,    ̃ π   caters to all possible starting robot locations.   So, the incurred time to derive    ̃ π   is unaffected. These observa- tions show a considerable computational gain of    ̃ π   over   π G
1   2   3 10  − 5 10  − 4 10  − 3 10  − 2 10  − 1 10 0 k Time (s)  ! !  G  ! M ~ (a) 1   2   3 10  − 4 10  − 2 10 0 10 2 10 4  k Time (s)  ! ! G  !  M ~ (b) 1   2   3   4 10  − 4 10  − 2 10 0 10 2 k Time (s)  ! !  G  ! M (c) ~ 1   2   3 10  − 4 10  − 2 10 0 10 2 10 4 10 6 k Time (s)  ! !  G  !  M (d) ~ Figure 4:   Graph of time taken to derive policy vs.   number   k   of robots for temperature field   4   discretized into (a)   5   ×   30   and (b)   13   ×   75   grids and plankton density field discretized into (c)   8   ×   45   and (d)   16   ×   89   grids. Table 3: Comparison of ENT( π ) (left) and ERR( π ) ( × 10 − 3 )   (right)   performance   for   plankton   density field that is discretized into   8   ×   45   grid.  ENT( π )   Number   k   of robots Policy   1   2   3   4   ̃ π   -359   -322   -196   -121  πG   -359   -322   -196   -121  πM   -230   -186   -70   -11 ERR( π )   Number   k   of robots Policy   1   2   3   4   ̃ π   5.6124   2.2164   0.0544   0.0066  πG   5.6124   2.2164   0.0544   0.0066  πM   4.5371   0.5613   0.0472   0.0324  Table 4: Comparison of ENT( π ) (left) and ERR( π ) ( × 10 − 3 )   (right)   performance   for   plankton   density field that is discretized into   16   ×   89   grid.  ENT( π )   Number   k   of robots Policy   1   2   3   ̃ π   -4278   -3949   -3681  πG   -4238   -3964   -3686  πM   -4171   -3840   -3501 ERR( π )   Number   k   of robots Policy   1   2   3   ̃ π   3.4328   0.0970   0.0546  πG   1.5648   0.1073   0.0643  πM   0.8186   0.0859   0.0348  and   π M   , which supports our time complexity analysis and comparison (Section 4). So, our Markov-based path planner is more time-efficient for   in situ , real-time, high-resolution active sampling.  6.   CONCLUSION  This paper describes an efficient Markov-based information- theoretic path planner for active sampling of GP-based en- vironmental fields. We have provided theoretical guarantees on the active sampling performance of our Markov-based policy    ̃ π   for the transect sampling task, from which ideal environmental field conditions (i.e., small horizontal spatial correlation and noisy, less intense fields) and sampling task settings (i.e., large grid discretization width and short plan- ning horizon) can be established to limit its performance degradation.   Empirically, we have shown that    ̃ π   can gen- erally achieve active sampling performance comparable to that of the widely-used non-Markovian greedy policies   π G  and   π M   under less favorable realistic field conditions (i.e., low noise-to-signal ratio) and task settings (i.e., small grid discretization width and long planning horizon) while en- joying huge computational gain over them.   In particular, we have empirically observed that (a) small horizontal and large vertical correlations strongly favor    ̃ π ; (b) though large horizontal and small vertical correlations do not favor    ̃ π , this problem can be mitigated by increasing the number of robots. In fact, deploying a large robot team often produces superior active sampling performance of    ̃ π   over   π M   in our experiments, not forgetting the computational gain of   >   4 orders of magnitude. Our Markov-based planner can be used to efficiently achieve more general exploration tasks (e.g., boundary tracking and those in [6, 7]), but the guarantees provided here may not apply. For our future work, we will “relax” the Markov assumption by utilizing a longer (but not entire) history of observations in path planning.   This can potentially improve the active sampling performance in fields of moderate to large horizontal correlation but does not incur as much time as that of non-Markovian policies.  7.   REFERENCES  [1] G. H. Golub and C.-F. Van Loan.   Matrix Computations . Johns Hopkins Univ. Press, 3rd edition, 1996. [2] R. Korf. Real-time heuristic search.   Artif. Intell. , 42(2-3):189–211, 1990. [3] A. Krause, A. Singh, and C. Guestrin. Near-optimal sensor placements in Gaussian processes: Theory, efficient algorithms and empirical studies.   JMLR , 9:235–284, 2008. [4] N. E. Leonard, D. Paley, F. Lekien, R. Sepulchre, D. M. Fratantoni, and R. Davis. Collective motion, sensor networks and ocean sampling.   Proc. IEEE , 95(1):48–74, 2007. [5] K. H. Low.   Multi-Robot Adaptive Exploration and Mapping for Environmental Sensing Applications . Ph.D. Thesis, Technical Report CMU-ECE-2009-024, Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, 2009. [6] K. H. Low, J. M. Dolan, and P. Khosla. Adaptive multi-robot wide-area exploration and mapping. In  Proc. AAMAS , pages 23–30, 2008. [7] K. H. Low, J. M. Dolan, and P. Khosla. Information-theoretic approach to efficient adaptive path planning for mobile robotic environmental sensing. In   Proc. ICAPS , pages 233–240, 2009. [8] C. E. Rasmussen and C. K. I. Williams.   Gaussian Processes for Machine Learning . MIT Press, Cambridge, MA, 2006. [9] D. L. Rudnick, R. E. Davis, C. C. Eriksen, D. Fratantoni, and M. J. Perry. Underwater gliders for ocean research.   Mar. Technol. Soc. J. , 38(2):73–84, 2004. [10] M. C. Shewry and H. P. Wynn. Maximum entropy sampling.   J. Applied Stat. , 14(2):165–170, 1987. [11] A. St ̊ ahl, Ringvall, and T. L ̈ am ̊ as. Guided transect sampling for assessing sparse populations.   Forest Science , 46(1):108–115, 2000. [12] G. W. Stewart and J.-G. Sun.   Matrix Perturbation Theory . Academic Press, 1990. [13] D. R. Thompson and D. Wettergreen. Intelligent maps for autonomous kilometer-scale science survey. In  Proc. i-SAIRAS , 2008. [14] S. Thrun, W. Burgard, and D. Fox.   Probabilistic Robotics . MIT Press, Cambridge, MA, 2005. [15] R. Webster and M. Oliver.   Geostatistics for
Environmental Scientists . John Wiley & Sons, Inc., NY, 2nd edition, 2007.  APPENDIX A.   PROOFS A.1   Proof Sketch of Theorem 1  For each vector   x i   of current robot locations, the time needed to evaluate the posterior entropy   H [ Z τ   ( x i ,a i ) | Z x i   ] (i.e., using Cholesky factorization) over all possible actions  a i   ∈ A ( x i ) is   |A| × O ( k 4 ) =   O ( |A| k 4 ).   Doing this over all possible vectors of current robot locations in each column thus incurs   |A|×O ( |A| k 4 ) =   O ( |A| 2 k 4 ) time since the vector space of current robot locations in each column is of the same size as that of the joint action space   |A| .   We do not have to compute these posterior entropies again for every column because the entropies evaluated for any one column repli- cate across different columns. This computational saving is due to the Markov assumption and the problem structure of the transect sampling task. Propagating the optimal values from stages   t   to 0 takes   O ( |A| 2 t ) time. Hence, solving the Markov-based path planning problem (11) or deriving the Markov-based policy    ̃ π   (12) takes   O ( |A| 2 ( t   +   k 4 )) time for the transect sampling task.  A.2   Proof of Lemma 2  Let Σ x 0: i − 1 x 0: i − 1 | x i  4  =   C   +   E   where   C   is defined to be a matrix with diagonal components   σ 2  x k   =   σ 2  s   +   σ 2  n   for   k   = 0 , . . . , i   −   1 and off-diagonal components 0, and   E   is defined to be a matrix with diagonal components   − ( σ x k   x i   ) 2 /σ 2  x i   =  − ( σ x k   x i   ) 2 / ( σ 2  s   +   σ 2  n ) for   k   =   0 , . . . , i   −   1 and the same off-diagonal components as Σ x 0: i − 1   x 0: i − 1   | x i   (i.e.,   σ x j   x k   | x i   =  σ x j   x k   −   σ x j   x i   σ x i x k   /σ 2  x i   for   j, k   = 0 , . . . , i   −   1,   j   6   =   k ). Then,  || C − 1 || 2   =   || ( σ 2  s   +   σ 2  n ) − 1 I || 2   =   1  σ 2  s   +   σ 2  n  .   (15) The last equality follows from   σ 2  s   +   σ 2  n   being the smallest eigenvalue of   C . So, 1 / ( σ 2  s   +   σ 2  n ) is the largest eigenvalue of  C − 1 , which is equal to   || C − 1 || 2 . Note that the minimum distance between any pair of lo- cation components of   x 0: i − 1   cannot be less than   ω 1 . So, it can be observed that any component of   E   cannot have an absolute value more than   σ 2  s   ξ . Therefore,  || E || 2   ≤   iσ 2  s   ξ ,   (16) which follows from a property of the matrix 2-norm that  || E || 2   cannot be more than the largest absolute component of   E   multiplied by   i   [1]. Note that the minimum distance between locations   x i   and  x i +1   as well as between location   x i   and any location compo- nent of   x 0: i − 1   cannot be less than   ω 1 . So, it can be observed that any component of Σ x i +1 x 0: i − 1 | x i   cannot have an abso- lute value more than   σ 2  s   ξ 2 . Therefore,  | σ Z xi +1   Z xk   | x i   | ≤   σ 2  s   ξ 2   (17) for   k   = 0 , . . . , i   −   1. Now, Σ x i +1 x 0: i − 1   | x i   ( C   +   E ) − 1 Σ x 0: i − 1 x i +1 | x i   −  Σ x i +1 x 0: i − 1   | x i   C − 1 Σ x 0: i − 1   x i +1 | x i  = Σ x i +1 x 0: i − 1 | x i   { ( C   +   E ) − 1   −   C − 1 } Σ x 0: i − 1 x i +1 | x i  ≤ || Σ x i +1 x 0: i − 1   | x i   || 2 2   || ( C   +   E ) − 1   −   C − 1 || 2  ≤  i − 1 ∑  k =0  | σ Z xi +1   Z xk   | x i   | 2   || C − 1 || 2   || E || 2 1  || C − 1 || 2   − || E || 2  =   i ( σ 2  s   ) 2 ξ 4   || C − 1 || 2   || E || 2 1  || C − 1   || 2   − || E || 2  .  (18) The first inequality is due to Cauchy-Schwarz inequality and submultiplicativity of the matrix norm [12]. The second in- equality follows from an important result in the perturba- tion theory of matrix inverses (in particular, Theorem III.2.5 in [12]).   It requires the assumption of   || C − 1   E || 2   <   1. This assumption can be satisfied by   || C − 1 || 2   || E || 2   <   1 because   || C − 1   E || 2   ≤ || C − 1 || 2   || E || 2 .   By (15) and (16),  || C − 1 || 2   || E || 2   <   1 translates to   ξ < ρ/i . The last equality is due to (17). From (18), Σ x i +1 x 0: i − 1   | x i   ( C   +   E ) − 1 Σ x 0: i − 1 x i +1 | x i  ≤   Σ x i +1 x 0: i − 1 | x i   C − 1 Σ x 0: i − 1   x i +1 | x i   +   i ( σ 2  s   ) 2 ξ 4   || C − 1 || 2   || E || 2 1  || C − 1 || 2   − || E || 2  ≤   i ( σ 2  s   ) 2 ξ 4   || C − 1 || 2  (  1 +   || E || 2 1  || C − 1 || 2   − || E || 2  )  =   i ( σ 2  s   ) 2 ξ 4 1  || C − 1 || 2   − || E || 2  ≤   i ( σ 2  s   ) 2 ξ 4  σ 2  s   +   σ 2  n   −   iσ 2  s   ξ  =   σ 2  s   ξ 4  ρ i   −   ξ  (19) The second inequality is due to Σ x i +1 x 0: i − 1 | x i   C − 1 Σ x 0: i − 1 x i +1 | x i   ≤   i ( σ 2  s   ) 2 ξ 4   || C − 1 || 2   ,  which follows from Cauchy-Schwarz inequality and (17). The third inequality follows from (15) and (16). We will need the following property of posterior variance that is similar to (3):  σ 2  x i +1 | x 0: i   =   σ 2  x i +1 | x i   − Σ x i +1 x 0: i − 1 | x i   Σ − 1  x 0: i − 1 x 0: i − 1 | x i   Σ x 0: i − 1   x i +1 | x i  (20) where Σ x i +1 x 0: i − 1 | x i   is a posterior covariance vector with components   σ x i +1 x k   | x i   for   k   = 0 , . . . , i   −   1, Σ x 0: i − 1 x i +1 | x i  is the transpose of Σ x i +1 x 0: i − 1   | x i   , and Σ x 0: i − 1 x 0: i − 1 | x i   is a posterior covariance matrix with components   σ x j   x k   | x i   for  j, k   = 0 , . . . , i   −   1. By (19) and (20),  σ 2  x i +1 | x i   −   σ 2  x i +1 | x 0: i  = Σ x i +1 x 0: i − 1   | x i   Σ − 1  x 0: i − 1 x 0: i − 1 | x i   Σ x 0: i − 1 x i +1 | x i  ≤   σ 2  s   ξ 4  ρ i   −   ξ .  A.3   Proof of Theorem 4  Proof by induction   on   i   that   V   π ∗  i   ( x 0: i )   ≤    ̃ V i ( x i )   ≤   V   π ∗  i   ( x 0: i ) + ∑ t s = i   ∆( s ) for   i   =   t, . . . ,   0.
Base case   ( i   =   t ): By Lemma 3,  H [ Z x t +1   | Z x 0: t   ]   ≤   H [ Z x t +1   | Z x t   ]  ≤   H [ Z x t +1   | Z x 0: t   ] + ∆( t )   for any   x t +1  ⇒   max  a t ∈A ( x t )   H [ Z x t +1   | Z x 0: t   ]   ≤   max  a t ∈A ( x t )   H [ Z x t +1   | Z x t   ]  ≤   max  a t ∈A ( x t )  H [ Z x t +1   | Z x 0: t   ] + ∆( t )  ⇒   V   π ∗  t   ( x 0: t )   ≤    ̃ V t ( x t )   ≤   V   π ∗  t   ( x 0: t ) + ∆( t )   .  (21) Hence, the base case is true.  Inductive case : Suppose that  V   π ∗  i +1 ( x 0: i +1 )   ≤    ̃ V i +1 ( x i +1 )   ≤   V   π ∗  i +1 ( x 0: i +1 )+  t ∑  s = i +1  ∆( s ) (22) is true. We have to prove that   V   π ∗  i   ( x 0: i )   ≤    ̃ V i ( x i )   ≤   V   π ∗  i   ( x 0: i ) + ∑ t s = i   ∆( s ) is true. We will first show that    ̃ V i ( x i )   ≤   V   π ∗  i   ( x 0: i ) +   ∑ t s = i   ∆( s ). By Lemma 3,  H [ Z x i +1   | Z x i   ]   ≤   H [ Z x i +1   | Z x 0: i   ] + ∆( i )   for any   x i +1  ⇒   H [ Z x i +1   | Z x i   ] +    ̃ V i +1 ( x i +1 )   ≤   H [ Z x i +1   | Z x 0: i   ] +  V   π ∗  i +1 ( x 0: i +1 ) +   ∑ t s = i   ∆( s ) by (22) for any   x i +1  ⇒   max  a i ∈A ( x i )  H [ Z x i +1   | Z x i   ] +    ̃ V i +1 ( x i +1 )  ≤   max  a i ∈A ( x i )  H [ Z x i +1   | Z x 0: i   ] +   V   π ∗  i +1 ( x 0: i +1 ) +  t ∑  s = i  ∆( s )  ⇒    ̃ V i ( x i )   ≤   V   π ∗  i   ( x 0: i ) +  t ∑  s = i  ∆( s )   .  We will now prove that   V   π ∗  i   ( x 0: i )   ≤    ̃ V i ( x i ). By Lemma 3,  H [ Z x i +1   | Z x 0: i   ]   ≤   H [ Z x i +1   | Z x i   ]   for any   x i +1  ⇒   H [ Z x i +1   | Z x 0: i   ] +   V   π ∗  i +1 ( x 0: i +1 )  ≤   H [ Z x i +1   | Z x i   ] +    ̃ V i +1 ( x i +1 ) by (22) for any   x i +1  ⇒   max  a i ∈A ( x i )   H [ Z x i +1   | Z x 0: i   ] +   V   π ∗  i +1 ( x 0: i +1 )  ≤   max a i ∈A ( x i )   H [ Z x i +1   | Z x i   ] +    ̃ V i +1 ( x i +1 )  ⇒   V   π ∗  i   ( x 0: i )   ≤    ̃ V i ( x i )   .  Hence, the inductive case is true.  A.4   Proof of Theorem 5  The following lemma is needed for this proof:  Lemma   7.    ̃ V i ( x i )   ≤   V    ̃ π i   ( x 0: i )+ ∑ t s = i   ∆( s )   for   i   = 0 , . . . , t .  The proof of the above lemma is provided in Appendix A.6.  Proof by induction   on   i   that   V   π ∗  i   ( x 0: i )   ≤   V    ̃ π i   ( x 0: i )+ ∑ t s = i   ∆( s ) for   i   =   t, . . . ,   0.  Base case   ( i   =   t ):  V   π ∗  t   ( x 0: t )   ≤    ̃ V t ( x t )   ≤   V    ̃ π t   ( x 0: t ) + ∆( t )   .  The first inequality is due to Theorem 4.   The second in- equality follows from Lemma 7. Hence, the base case is true.  Inductive case : Suppose that  V   π ∗  i +1 ( x 0: i +1 )   ≤   V    ̃ π i +1 ( x 0: i +1 ) +  t ∑  s = i +1  ∆( s )   (23) is true.   We have to prove that   V   π ∗  i   ( x 0: i )   ≤   V    ̃ π i   ( x 0: i ) + ∑ t s = i   ∆( s ) is true.  V   π ∗  i   ( x 0: i )   ≤    ̃ V i ( x i ) =   H [ Z τ   ( x i ,  ̃ π i ( x i )) | Z x i   ] +    ̃ V i +1 ( τ   ( x i ,    ̃ π i ( x i )))  ≤   H [ Z τ   ( x i ,  ̃ π i ( x i )) | Z x 0: i   ] + ∆( i ) +    ̃ V i +1 ( τ   ( x i ,    ̃ π i ( x i )))  ≤   H [ Z τ   ( x i ,  ̃ π i ( x i )) | Z x 0: i   ] + ∆( i ) +   V    ̃ π i +1 ( ( x 0: i , τ   ( x i ,    ̃ π i ( x i ))) ) +  ∑ t s = i +1   ∆( s ) =   V    ̃ π i   ( x 0: i ) +  t ∑  s = i  ∆( s )   .  The first inequality is due to Theorem 4.   The first equal- ity follows from (11).   The second inequality follows from Lemma 3. The third inequality is due to Lemma 7. The last equality follows from (5). Hence, the inductive case is true.  A.5   Proof Sketch of Lemma 6  Define   x [ m ]  i   to be the   m -th component of vector   x i   of robot locations for   m   = 1 , . . . , k . Let   x [1: m ]  i   denote a vector com- prising the first   m   components of   x i   (i.e., concatenation of  x [1]  i   , . . . , x [ m ]  i   ).  I [ Z x i +1   ;   Z x 0: i − 1   | Z x i   ] =   H [ Z x i +1   | Z x i   ]   −   H [ Z x i +1   | Z x 0: i   ] =  k ∑  m =1  (  H [ Z x [ m ]  i +1  | Z ( x i ,x [1: m − 1]  i +1   ) ]   −   H [ Z x [ m ]  i +1  | Z ( x 0: i ,x [1: m − 1]  i +1   ) ]  )  = 1 2  k ∑  m =1    log  σ 2  x [ m ]  i +1 | ( x i ,x [1: m − 1]  i +1   )  σ 2  x [ m ]  i +1 | ( x 0: i ,x [1: m − 1]  i +1   )     = 1 2  k ∑  m =1  log    1   −  σ 2  x [ m ]  i +1 | ( x i ,x [1: m − 1]  i +1   )   −   σ 2  x [ m ]  i +1 | ( x 0: i ,x [1: m − 1]  i +1   )  σ 2  x [ m ]  i +1 | ( x i ,x [1: m − 1]  i +1   )     − 1  ≥   0   .  (24) The second equality follows from the chain rule for entropy. Similar to Lemma 2, the following result bounds the vari- ance reduction term  σ 2  x [ m ]  i +1 | ( x i ,x [1: m − 1]  i +1   )   −   σ 2  x [ m ]  i +1 | ( x 0: i ,x [1: m − 1]  i +1   )  in (24):  Lemma   8.   If   ξ <   min(   ρ ik , ρ  4 k   )   and (14) is satisfied,  0   ≤   σ 2  x [ m ]  i +1 | ( x i ,x [1: m − 1]  i +1   )   −   σ 2  x [ m ]  i +1 | ( x 0: i ,x [1: m − 1]  i +1   )   ≤   σ 2  s   ξ 4  ρ ik   −   ξ .  The proof of the above result is largely similar to that of Lemma 2 (Appendix A.2), and is therefore omitted here. The bounds on   I [ Z x i +1   ;   Z x 0: i − 1   | Z x i   ] follow immediately from (24), Lemma 8, and the following lower bound on
σ 2  x [ m ]  i +1 | ( x i ,x [1: m − 1]  i +1   ) :  σ 2  x [ m ]  i +1 | ( x i ,x [1: m − 1]  i +1   )  =   σ 2  x [ m ]  i +1  −  Σ x [ m ]  i +1 ( x i ,x [1: m − 1]  i +1   ) Σ − 1 ( x i ,x [1: m − 1]  i +1   )( x i ,x [1: m − 1]  i +1   ) Σ ( x i ,x [1: m − 1]  i +1   ) x [ m ]  i +1  ≥   σ 2  s   +   σ 2  n   −   σ 2  s   ξ 2  ρ  2 k − 1   −   ξ  ≥   σ 2  s   +   σ 2  n   −   4 k ρ   σ 2  s   ξ 2   .  The equality is due to (3).   The first inequality is due to Cauchy-Schwarz inequality, submultiplicativity of the ma- trix norm [12], and a result in the perturbation theory of matrix inverses (in particular, Theorem III.2.5 in [12]). The second inequality follows from the given satisfied condition  ξ <   ρ  4 k   .  A.6   Proof of Lemma 7  Proof by induction   on   i   that    ̃ V i ( x i )   ≤   V    ̃ π i   ( x 0: i )+ ∑ t s = i   ∆( s ) for   i   =   t, . . . ,   0.  Base case   ( i   =   t ):   ̃ V t ( x t ) =   max  a t ∈A ( x t )   H [ Z τ   ( x t ,a t ) | Z x t   ] =   H [ Z τ   ( x t ,  ̃ π t ( x t )) | Z x t   ]  ≤   H [ Z τ   ( x t ,  ̃ π t ( x t )) | Z x 0: t   ] + ∆( t ) =   V    ̃ π t   ( x 0: t ) + ∆( t )   .  The first equality follows from (11). The inequality follows from Lemma 3. The last equality is due to (5). So, the base case is true.  Inductive case : Suppose that   ̃ V i +1 ( x i +1 )   ≤   V    ̃ π i +1 ( x 0: i +1 ) +  t ∑  s = i +1  ∆( s )   (25) is true. We have to prove that    ̃ V i ( x i )   ≤   V    ̃ π i   ( x 0: i )+ ∑ t s = i   ∆( s ) is true. By Lemma 3,  H [ Z x i +1   | Z x i   ]   ≤   H [ Z x i +1   | Z x 0: i   ] + ∆( i )   for any   x i +1  ⇒   H [ Z x i +1   | Z x i   ] +    ̃ V i +1 ( x i +1 )   ≤   H [ Z x i +1   | Z x 0: i   ] +  V    ̃ π i +1 ( x 0: i +1 ) +   ∑ t s = i   ∆( s ) by (25) for any   x i +1  ⇒   H [ Z τ   ( x i ,  ̃ π i ( x i )) | Z x i   ] +    ̃ V i +1 ( τ   ( x i ,    ̃ π i ( x i )))  ≤   H [ Z τ   ( x i ,  ̃ π i ( x i )) | Z x 0: i   ] +   V    ̃ π i +1 ( ( x 0: i , τ   ( x i ,    ̃ π i ( x i ))) ) +  ∑ t s = i   ∆( s )by   x i +1   ←   τ   ( x i ,    ̃ π i ( x i ))  ⇒    ̃ V i ( x i )   ≤   V    ̃ π i   ( x 0: i ) +  t ∑  s = i  ∆( s ) by (11) and (5) .  Hence, the inductive case is true.