Multi-Robot Informative Path Planning for Active Sensing of Environmental Phenomena: A Tale of Two Algorithms Nannan Cao and Kian Hsiang Low Department of Computer Science National University of Singapore Republic of Singapore {nncao, lowkh}@comp.nus.edu.sg John M. Dolan Robotics Institute Carnegie Mellon University Pittsburgh PA 15213 USA jmd@cs.cmu.edu ABSTRACT A key problem of robotic environmental sensing and moni- toring is that of active sensing: How can a team of robots plan the most informative observation paths to minimize the uncertainty in modeling and predicting an environmen- tal phenomenon? This paper presents two principled ap- proaches to efficient information-theoretic path planning based on entropy and mutual information criteria for in situ ac- tive sensing of an important broad class of widely-occurring environmental phenomena called anisotropic fields. Our pro- posed algorithms are novel in addressing a trade-off between active sensing performance and time efficiency. An impor- tant practical consequence is that our algorithms can exploit the spatial correlation structure of Gaussian process-based anisotropic fields to improve time efficiency while preserv- ing near-optimal active sensing performance. We analyze the time complexity of our algorithms and prove analyti- cally that they scale better than state-of-the-art algorithms with increasing planning horizon length. We provide the- oretical guarantees on the active sensing performance of our algorithms for a class of exploration tasks called tran- sect sampling, which, in particular, can be improved with longer planning time and/or lower spatial correlation along the transect. Empirical evaluation on real-world anisotropic field data shows that our algorithms can perform better or at least as well as the state-of-the-art algorithms while of- ten incurring a few orders of magnitude less computational time, even when the field conditions are less favorable. Categories and Subject Descriptors G.3 [ Probability and Statistics ]: Stochastic processes; I.2.9 [ Robotics ]: Autonomous vehicles General Terms Algorithms, Performance, Experimentation, Theory Keywords Multi-robot exploration and mapping, adaptive sampling, active learning, Gaussian process, non-myopic path planning 1. INTRODUCTION Research in environmental sensing and monitoring has re- cently gained significant attention and practical interest, es- Appears in: Proceedings of the 12th International Confer- ence on Autonomous Agents and Multiagent Systems (AA- MAS 2013) , Ito, Jonker, Gini, and Shehory (eds.), May, 6–10, 2013, Saint Paul, Minnesota, USA. Copyright c © 2012, International Foundation for Autonomous Agents and Multiagent Systems (www.ifaamas.org). All rights reserved. pecially in supporting environmental sustainability efforts worldwide. A key direction of this research aims at sensing, modeling, and predicting the various types of environmen- tal phenomena spatially distributed over our natural and built-up habitats so as to improve our knowledge and un- derstanding of their economic, environmental, and health impacts and implications. This is non-trivial to achieve due to a trade-off between the quantity of sensing resources (e.g., number of deployed sensors, energy consumption, mission time) and the uncertainty in predictive modeling. In the case of deploying a limited number of mobile robotic sens- ing assets, such a trade-off motivates the need to plan the most informative resource-constrained observation paths to minimize the uncertainty in modeling and predicting a spa- tially varying environmental phenomenon, which constitutes the active sensing problem to be addressed in this paper. A wide multitude of natural and urban environmental phenomena is characterized by spatially correlated field mea- surements, which raises the following fundamental issue faced by the active sensing problem: How can the spatial correlation structure of an environmental phenomenon be exploited to im- prove the active sensing performance and com- putational efficiency of robotic path planning? The works of [11, 12, 13] have tackled this issue specifically in the context of an environmental hotspot field by study- ing how its spatial correlation structure affects the perfor- mance advantage of adaptivity in path planning: If the field is large with a few small hotspots exhibiting extreme mea- surements and much higher spatial variability than the rest of the field, then adaptivity can provide better active sens- ing performance. On the other hand, non-adaptive sampling techniques [2, 8, 14] suffice for smoothly-varying fields. In this paper, we will investigate the above issue for an- other important broad class of environmental phenomena called anisotropic fields that exhibit a (often much) higher spatial correlation along one direction than along its per- pendicular direction. Such fields occur widely in natural and built-up environments and some of them include (a) ocean and freshwater phenomena like plankton density [6], fish abundance [23], temperature and salinity [22]; (b) soil and atmospheric phenomena like peat thickness [25], surface soil moisture [26], rainfall [18]; (c) mineral deposits like ra- dioactive ore [19]; (d) pollutant and contaminant concentra- tion like air [1], heavy metals [16]; and (e) ecological abun- dance like vegetation density [9]. The geostatistics community has examined a related issue of how the spatial correlation structure of an anisotropic field arXiv:1302.0723v2 [cs.LG] 5 Feb 2013 can be exploited to improve the predictive performance of a sampling design for a static sensor network. To resolve this, the following heuristic design [25] is commonly used for sampling the anisotropic fields described above: Arrange and place the static sensors in a rectangular grid such that one axis of the grid is aligned along the direction of lowest spatial correlation (i.e., highest spatial variability) and the grid spacing along this axis as compared to that along its perpendicular axis is proportional to the ratio of their re- spective spatial correlations. In the case of path planning for k robots, one may consider the sampling locations of the rectangular grid as cities to be visited in a k -traveling sales- man problem so as to minimize the total distance traveled or mission time [15]. However, since the resulting observation paths are constrained by the heuristic sampling design, they are suboptimal in solving the active sensing problem (i.e., minimizing the predictive uncertainty). This drawback is exacerbated when the robots are capable of sampling at a higher resolution along their paths (e.g., due to high sensor sampling rate) than that of the grid, hence gathering subop- timal observations while traversing between grid locations. This paper presents two principled approaches to efficient information-theoretic path planning based on entropy and mutual information (respectively, Sections 3 and 4) criteria for in situ active sensing of environmental phenomena. In contrast to the existing methods described above, our pro- posed path planning algorithms are novel in addressing a trade-off between active sensing performance and computa- tional efficiency. An important practical consequence is that our algorithms can exploit the spatial correlation structure of anisotropic fields to improve time efficiency while preserv- ing near-optimal active sensing performance. The specific contributions of our work in this paper include: • Analyzing the time complexity of our proposed algorithms and proving analytically that they scale better than state- of-the-art information-theoretic path planning algorithms [8, 13] with increasing length of planning horizon (Sec- tions 3 and 4); • Providing theoretical guarantees on the active sensing per- formance of our proposed algorithms (Sections 3 and 4) for a class of exploration tasks called the transect sampling task (Section 2.1), which, in particular, can be improved with longer planning time and/or lower spatial correlation along the transect; • Empirically evaluating the time efficiency and active sens- ing performance of our proposed algorithms on real-world temperature and plankton density field data (Section 5). 2. BACKGROUND 2.1 Transect Sampling Task In a transect sampling task [14, 24], a team of k robots is tasked to explore and sample an environmental phenomenon spatially distributed over a transect (Fig. 1) that is dis- cretized into a r × n grid of sampling locations where the number n of columns is assumed to be much larger than the number r of sampling locations in each column, r is ex- pected to be small in a transect, and k ≤ r . The columns are indexed in an increasing order from left to right. The k robots are constrained to simultaneously explore forward one column at a time from the leftmost column ‘1’ to the rightmost column ‘ n ’ such that each robot samples one lo- cation per column for a total of n locations. Hence, each robot, given its current location, can move to any of the r ? ? Figure 1: Transect sampling task with 2 robots on a temperature field (measured in ◦ C ) spatially dis- tributed over a 25 m × 150 m transect that is dis- cretized into a 5 × 30 grid of sampling locations (white dots) (Image courtesy of [ 14 ]). locations in the adjacent column on its right. In practice, the transect sampling task is especially ap- propriate for and widely performed by mobile robots with limited maneuverability (e.g., unmanned aerial vehicles, au- tonomous surface and underwater vehicles (AUVs) [21]) be- cause it involves less complex path maneuvers that can be achieved more reliably using less sophisticated on-board con- trol algorithms. In terms of practical applicability, transect sampling is a particularly useful exploration task to be per- formed during the transit from the robot’s current location to a distant planned waypoint [10, 24] to collect the most informative observations. For active sensing of ocean and freshwater phenomena, the transect can span a spatial fea- ture of interest such as a harmful algal bloom or pollutant plume to be explored and sampled by a fleet of AUVs being deployed off a ship vessel. 2.2 Gaussian Process-Based Anisotropic Field An environmental phenomenon is defined to vary as a re- alization of a rich class of Bayesian non-parametric mod- els called the Gaussian process (GP) [20] that can formally characterize its spatial correlation structure and be refined with increasing number of observations. More importantly, GP can provide formal measures of predictive uncertainty (e.g., based on an entropy or mutual information criterion) for directing the robots to explore the highly uncertain areas of the phenomenon. Let D be a set of sampling locations representing the do- main of the environmental phenomenon such that each lo- cation x ∈ D is associated with a realized (random) mea- surement z x ( Z x ) if x is sampled/observed (unobserved). Let { Z x } x ∈D denote a GP, that is, every finite subset of { Z x } x ∈D has a multivariate Gaussian distribution [20]. The GP is fully specified by its prior mean μ x , E [ Z x ] and covari- ance σ xx ′ , cov[ Z x , Z x ′ ] for all x, x ′ ∈ D . In the experiments (Section 5), we assume that the GP is second-order station- ary, i.e., it has a constant prior mean and a stationary prior covariance structure (i.e., σ xx ′ is a function of x − x ′ for all x, x ′ ∈ D ), both of which are assumed to be known. In par- ticular, its covariance structure is defined by the widely-used squared exponential covariance function σ xx ′ , σ 2 s exp { − 1 2 ( x − x ′ ) T M − 2 ( x − x ′ ) } + σ 2 n δ xx ′ (1) where σ 2 s and σ 2 n are, respectively, the signal and noise vari- ances controlling the intensity and noise of the measure- ments, M is a diagonal matrix with length-scale compo- nents ` 1 and ` 2 controlling the degree of spatial correlation or “similarity” between measurements along (i.e., horizontal direction) and perpendicular to (i.e., vertical direction) the transect, respectively, and δ xx ′ is a Kronecker delta of value 1 if x = x ′ , and 0 otherwise. For anisotropic fields, ` 1 6 = ` 2 . An advantage of using GP to model the environmental phenomenon is its probabilistic regression capability: Given a vector s of sampled locations and a column vector z s of corresponding measurements, the joint distribution of the measurements at any vector u of κ unobserved locations remains Gaussian with the following posterior mean vector and covariance matrix μ u | s = μ u + Σ us Σ − 1 ss ( z s − μ s ) (2) Σ uu | s = Σ uu − Σ us Σ − 1 ss Σ su (3) where μ u ( μ s ) is a column vector with mean components μ x for every location x of u ( s ), Σ us (Σ ss ) is a covariance matrix with covariance components σ xx ′ for every pair of locations x of u ( s ) and x ′ of s , and Σ su is the transpose of Σ us . The posterior mean vector μ u | s (2) is used to pre- dict the measurements at vector u of κ unobserved locations. The uncertainty of these predictions can be quantified using the posterior covariance matrix Σ uu | s (3), which is indepen- dent of the measurements z s , in two ways: (a) the trace of Σ uu | s yields the sum of posterior variances Σ xx | s over ev- ery location x of u ; (b) the determinant of Σ uu | s is used in calculating the Gaussian posterior joint entropy H ( Z u | Z s ) , 1 2 log(2 πe ) κ ∣ ∣ Σ uu | s ∣ ∣ . (4) Unlike the first measure of predictive uncertainty which as- sumes conditional independence between measurements at vector u of unobserved locations, the entropy-based mea- sure (4) accounts for their correlation, thereby not overesti- mating their uncertainty. Hence, we will focus on using the entropy-based measure of uncertainty in this paper. 3. ENTROPY-BASED PATH PLANNING Notations. Each planning stage i is associated with column i of the transect for i = 1 , . . . , n . In each stage i , the team of k robots samples from column i a total of k observations (each of which comprises a pair of a location and its measure- ment) that are denoted by a pair of vectors x i of k locations and Z x i of the corresponding random measurements. Let X i denote the set of all possible robots’ sampling locations x i in stage i . It can be observed that χ , |X 1 | = . . . = |X n | = r C k . We assume that the robots can deterministically (i.e., no stochasticity in motion) move from their current locations x i − 1 in column i − 1 to the next locations x i in column i . Let x i : j and Z x i : j denote vectors concatenating robots’ sam- pling locations x i , . . . , x j and concatenating corresponding random measurements Z x i , . . . , Z x j over stages i to j , re- spectively, and X i : j denote the set of all possible x i : j . Maximum Entropy Path Planning (MEPP). The work of [13] has proposed planning non-myopic observation paths x ∗ 1: n with maximum entropy (i.e., highest uncertainty): x ∗ 1: n = arg max x 1: n ∈X 1: n H ( Z x 1: n ) (5) that, as proven in an equivalence result, minimize the pos- terior entropy/uncertainty remaining in the unobserved lo- cations of the transect. Computing the maximum entropy paths x ∗ 1: n incurs O ( χ n ( kn ) 3 ) , which is exponential in the length n of planning horizon. To mitigate this computa- tional difficulty, an anytime heuristic search algorithm [7] is used to compute (5) approximately. However, its perfor- mance cannot be guaranteed. Furthermore, as reported in [14], when χ or n is large, its computed paths perform poorly even after incurring a huge amount of search time and space. Approximate MEPP ( m ) . To establish a trade-off be- tween active sensing performance and computational effi- ciency, the key idea is to exploit a property of the covariance function (1) that the spatial correlation of measurements between any two locations decreases exponentially with in- creasing distance between them. Intuitively, such a property makes the measurements Z x i to be observed next in col- umn i near-independent of the past distant measurements Z x 1: i − m − 1 observed from columns 1 to i − m − 1 (i.e., far from column i ) for a sufficiently large m by conditioning on the closer measurements Z x i − m : i − 1 observed in columns i − m to i − 1 (i.e., closer to column i ). Consequently, H ( Z x i | Z x 1: i − 1 ) can still be closely approximated by H ( Z x i | Z x i − m : i − 1 ) after assuming a m -th order Markov property, thus yielding the following approximation of the joint entropy H ( Z x 1: n ) in (5): H ( Z x 1: n ) = H ( Z x 1: m ) + ∑ n i = m +1 H ( Z x i | Z x 1: i − 1 ) ≈ H ( Z x 1: m ) + ∑ n i = m +1 H ( Z x i | Z x i − m : i − 1 ) . (6) The first equality is due to the chain rule for entropy [3]. Using (6), MEPP (5) can be approximated by the following stage-wise dynamic programming equations, which we call MEPP( m ): V i ( x i − m : i − 1 ) = max x i ∈X i H ( Z x i | Z x i − m : i − 1 ) + V i +1 ( x i − m +1: i ) V n ( x n − m : n − 1 ) = max x n ∈X n H ( Z x n | Z x n − m : n − 1 ) (7) for stage i = m + 1 , . . . , n − 1, each of which induces a corre- sponding optimal vector x E i of k locations given the optimal vector x E i − m : i − 1 obtained from previous stages i − m to i − 1 1 . Let the optimal observation paths of MEPP( m ) be denoted by x E 1: n that concatenates x E 1: m = arg max x 1: m ∈X 1: m H ( Z x 1: m ) + V m +1 ( x 1: m ) (8) for the first m stages and x E m +1 , . . . , x E n derived using (7) for the subsequent stages m + 1 to n . Our proposed MEPP( m ) algorithm generalizes that of [14] which is essentially MEPP(1). Theorem 1 (Time Complexity). Deriving x E 1: n of MEPP( m ) requires O ( χ m +1 [ n + ( km ) 3 ] ) time. The proof of Theorem 1 is given in Appendix A.1. Unlike MEPP which scales exponentially in the planning horizon length n , our MEPP( m ) algorithm scales linearly in n . Let ω 1 and ω 2 be the horizontal and vertical separation widths between adjacent grid locations, respectively, ` ′ 1 , ` 1 /ω 1 and ` ′ 2 , ` 2 /ω 2 denote the normalized horizontal and vertical length-scale components, respectively, and η , σ 2 n /σ 2 s . The following result bounds the loss in active sens- ing performance of the MEPP( m ) algorithm (i.e., (7) and (8)) relative to that of MEPP (5): Theorem 2 (Performance Guarantee). The paths x E 1: n are  -optimal in achieving the maximum entropy crite- rion, i.e., H ( Z x ∗ 1: n ) − H ( Z x E 1: n ) ≤  where  , [ k ( n − m )] 2 log { 1 + exp { − ( m + 1) 2 / (2 ` ′ 2 1 ) } 2 η (1 + η ) } . The proof of Theorem 2 is given in Appendix A.3. Theo- rem 2 reveals that the active sensing performance of MEPP( m ) 1 In fact, solving MEPP( m ) (7) yields a policy that, in each stage i , induces an optimal vector for every possible vector x i − m : i − 1 (including possible diverged paths from x E i − m : i − 1 due to external forces) obtained from previous m stages. can be improved by decreasing  , which is achieved using higher noise-to-signal ratio η (i.e., noisy, less intense fields), smaller number k of robots, shorter planning horizon length n , larger m , and/or lower spatial correlation ` ′ 1 along the transect. Two important implications result: (a) Increasing m trades off computational efficiency (Theorem 1) for better active sensing performance, and (b) if the spatial correlation of the anisotropic field along the transect is sufficiently low to maintain a relatively tight bound  such that only a small m is needed, then MEPP( m ) can exploit this spatial cor- relation structure to gain time efficiency while preserving near-optimal active sensing performance. In practice, it is often possible to obtain prior knowledge on a direction of low spatial correlation (refer to ocean and freshwater phe- nomena in Section 1 for examples) and align it with the horizontal axis of the transect. 4. MUTUAL INFORMATION-BASED PATH PLANNING Notations. Recall that the team of k robots selects k lo- cations x i to be sampled from column i of the transect for i = 1 , . . . , n . Let u i denote a vector of remaining r − k unobserved locations in column i and Z u i denote a vector of the corresponding random measurements. Let u i : j and Z u i : j denote vectors concatenating remaining unobserved lo- cations u i , . . . , u j and concatenating corresponding random measurements Z u i , . . . , Z u j over stages i to j , respectively. Maximum Mutual Information Path Planning (M 2 IPP). An alternative to MEPP is to plan non-myopic observation paths x ? 1: n that share the maximum mutual information with the remaining unobserved locations u ? 1: n of the transect: x ? 1: n = arg max x 1: n ∈X 1: n I ( Z x 1: n ; Z u 1: n ) I ( Z x 1: n ; Z u 1: n ) , H ( Z u 1: n ) − H ( Z u 1: n | Z x 1: n ) . (9) From (9), I ( Z x 1: n ; Z u 1: n ) measures the reduction in entropy/ uncertainty of the measurements Z u 1: n at the remaining un- observed locations u 1: n of the transect by observing the mea- surements Z x 1: n to be sampled along the paths x 1: n . So, the path planning of M 2 IPP (9) is equivalent to the selection of remaining unobserved locations with the largest entropy reduction (i.e., determining u ? 1: n ). This may be mistakenly perceived as the selection of remaining unobserved locations with the lowest uncertainty (i.e., minimizing posterior en- tropy term H ( Z u 1: n | Z x 1: n ) in (9)), which is exactly what the path planning of MEPP (5) can achieve, as mentioned in Section 3. Note, however, that the maximum mutual infor- mation paths (9) planned by M 2 IPP can in fact induce a very large prior entropy H ( Z u 1: n ) but not necessarily the small- est posterior entropy H ( Z u 1: n | Z x 1: n ). Consequently, MEPP and M 2 IPP exhibit different path planning behaviors and resulting active sensing performances, as shown empirically in Section 5. Similar to MEPP, M 2 IPP incurs exponential time in the length of planning horizon. To relieve this computational burden, we will describe an approximation algorithm for planning maximum mutual information paths next. Approximate M 2 IPP ( m ) . We will exploit the same prop- erty of the covariance function (1) as that used by MEPP( m ) (Section 3) to establish a trade-off between active sensing performance and computational efficiency for our M 2 IPP( m ) algorithm. However, this is not as straightforward to achieve as that to derive MEPP( m ) where a m -th order Markov property can simply be imposed on each posterior entropy term in (6). To illustrate this, using the chain rule for mu- tual information [3], I ( Z x 1: n ; Z u 1: n ) = I ( Z x 1: m ; Z u 1: n ) + n − m − 1 ∑ i = m +1 I ( Z x i ; Z u 1: n | Z x 1: i − 1 ) + I ( Z x n − m : n ; Z u 1: n | Z x 1: n − m − 1 ) , after which a m -th order Markov property is assumed to yield the following approximation: I ( Z x 1: n ; Z u 1: n ) ≈ I ( Z x 1: m ; Z u 1: n ) + n − m − 1 ∑ i = m +1 I ( Z x i ; Z u 1: n | Z x i − m : i − 1 ) + I ( Z x n − m : n ; Z u 1: n | Z x n − 2 m : n − m − 1 ) . (10) From (10), note that each conditional mutual information term I ( Z x i ; Z u 1: n | Z x i − m : i − 1 ) cannot be evaluated individu- ally because the remaining unobserved locations u 1: n of the transect (specifically, u 1: i − m − 1 and u i +1: n in the respective columns 1 to i − m − 1 and i + 1 to n ) cannot be determined simply by knowing the robots’ past and current sampling locations x i − m : i − 1 and x i in columns i − m to i . To resolve this, we exploit the same property of the co- variance function (1) as that used by MEPP( m ) (Section 3) again: It makes the measurements Z x i to be observed next in column i near-independent of the distant unobserved mea- surements Z u 1: i − m − 1 and Z u i + m +1: n in the respective columns 1 to i − m − 1 and i + m + 1 to n (i.e., far from column i ) for a sufficiently large m by conditioning on the closer measurements Z x i − m : i − 1 and Z u i − m : i + m in columns i − m to i + m (i.e., closer to column i ). As a result, each term I ( Z x i ; Z u 1: n | Z x i − m : i − 1 ) in (10) can be closely approximated by I ( Z x i ; Z u i − m : i + m | Z x i − m : i − 1 ) for i = m + 1 , . . . , n − m − 1: I ( Z x i ; Z u 1: n | Z x i − m : i − 1 ) = H ( Z x i | Z x i − m : i − 1 ) − H ( Z x i | Z x i − m : i − 1 , Z u 1: n ) ≈ H ( Z x i | Z x i − m : i − 1 ) − H ( Z x i | Z x i − m : i − 1 , Z u i − m : i + m ) = I ( Z x i ; Z u i − m : i + m | Z x i − m : i − 1 ) where the approximation follows from the above-mentioned conditional independence assumption and the equalities are due to the definition of conditional mutual information [3]. Similarly, I ( Z x 1: m ; Z u 1: n ) and I ( Z x n − m : n ; Z u 1: n | Z x n − 2 m : n − m − 1 ) in (10) are, respectively, approximated by I ( Z x 1: m ; Z u 1:2 m ) and I ( Z x n − m : n ; Z u n − 2 m : n | Z x n − 2 m : n − m − 1 ). Then, I ( Z x 1: n ; Z u 1: n ) ≈ I ( Z x 1: m ; Z u 1:2 m ) + n − m − 1 ∑ i = m +1 I ( Z x i ; Z u i − m : i + m | Z x i − m : i − 1 ) + I ( Z x n − m : n ; Z u n − 2 m : n | Z x n − 2 m : n − m − 1 ) = I ( Z x 1: m ; Z u 1:2 m ) + n − 1 ∑ i =2 m +1 I ( Z x i − m ; Z u i − 2 m : i | Z x i − 2 m : i − m − 1 ) + I ( Z x n − m : n ; Z u n − 2 m : n | Z x n − 2 m : n − m − 1 ) . (11) Using (11), M 2 IPP (9) can be approximated by the following stage-wise dynamic programming equations, which we call M 2 IPP( m ): U i ( x i − 2 m : i − 1 ) = max x i ∈X i I ( Z x i − m ; Z u i − 2 m : i | Z x i − 2 m : i − m − 1 ) + U i +1 ( x i − 2 m +1: i ) U n ( x n − 2 m : n − 1 ) = max x n ∈X n I ( Z x n − m : n ; Z u n − 2 m : n | Z x n − 2 m : n − m − 1 ) (12) for stage i = 2 m +1 , . . . , n − 1, each of which induces a corre- sponding optimal vector x M i of k locations given the optimal vector x M i − 2 m : i − 1 obtained from previous stages i − 2 m to i − 1 2 . Note that the term I ( Z x i − m ; Z u i − 2 m : i | Z x i − 2 m : i − m − 1 ) in each stage i can be evaluated now because the remaining unobserved locations u i − 2 m : i in columns i − 2 m to i can be determined since the robots’ past and current sampling lo- cations x i − 2 m : i − 1 and x i in the same columns are given (i.e., as input to U i and under the max operator, respectively). Let the optimal observation paths of M 2 IPP( m ) be denoted by x M 1: n that concatenates x M 1:2 m = arg max x 1:2 m ∈X 1:2 m I ( Z x 1: m , Z u 1:2 m ) + U 2 m +1 ( x 1:2 m ) (13) for the first 2 m stages and x M 2 m +1 , . . . , x M n derived using (12) for the subsequent stages 2 m + 1 to n . Theorem 3 (Time Complexity). Deriving x M 1: n of M 2 IPP( m ) requires O ( χ 2 m +1 [ n + 2( r (2 m + 1)) 3 ] ) time. The proof of Theorem 3 is given in Appendix B.1. Un- like M 2 IPP that scales exponentially in the planning horizon length n , our M 2 IPP( m ) algorithm scales linearly in n . The following result bounds the loss in active sensing per- formance of the M 2 IPP( m ) algorithm (i.e., (12) and (13)) relative to that of M 2 IPP (9): Theorem 4 (Performance Guarantee). The paths x M 1: n are ε -optimal in achieving the maximum mutual infor- mation criterion, i.e., I ( Z x ? 1: n ; Z u ? 1: n ) − I ( Z x M 1: n ; Z u M 1: n ) ≤ ε where ε , k ( n − 2 m ) [ rn + 1 2 k ( n − 2 m ) ] log      1+ exp { − ( m +1) 2 2 ` ′ 2 1 } 2 η (1 + η )      . The proof of Theorem 4 is given in Appendix B.3. As shown in Theorem 4, decreasing ε improves the active sensing per- formance of M 2 IPP( m ); this can be achieved in a similar manner to that for decreasing the loss bound  of MEPP( m ) (see paragraph after Theorem 2) since the two loss bounds ε and  are similar. In addition, smaller number r of sampling locations in each column decreases ε . M 2 IPP( m ) shares the same implications as that of MEPP( m ): (a) Increas- ing m trades off time efficiency (Theorem 3) for improved active sensing performance, and (b) M 2 IPP( m ) can exploit a low spatial correlation ` ′ 1 of the anisotropic field along the transect to improve time efficiency (i.e., only requiring a small m ) while preserving near-optimal active sensing per- formance (i.e., still maintaining a relatively tight bound ε ). 5. EXPERIMENTS AND DISCUSSION This section evaluates the active sensing performance and computational efficiency of the MEPP( m ) (i.e., (7) and (8)) and M 2 IPP( m ) (i.e., (12) and (13)) algorithms empirically on two real-world datasets: (a) May 2009 temperature field data of Panther Hollow Lake in Pittsburgh, PA spatially dis- tributed over a 25 m by 150 m transect that is discretized into a 5 × 30 grid [17], and (b) June 2009 plankton density field data of Chesapeake Bay spatially distributed over a 314 m by 1765 m transect that is discretized into a 8 × 45 grid [5]. These environmental phenomena are modeled by GPs with hyperparameters (i.e., horizontal and vertical length- scales, signal and noise variances) (Section 2.2) learned us- ing maximum likelihood estimation (MLE) [20]: (a) ` 1 = 2 Similar to MEPP( m ), solving M 2 IPP( m ) (12) yields a pol- icy that, in each stage i , induces an optimal vector for every possible vector x i − 2 m : i − 1 (including possible diverged paths from x E i − 2 m : i − 1 ) obtained from previous 2 m stages. 5 10 15 20 25 30 1 2 3 4 5 23 23.5 24 5 10 15 20 25 30 1 2 3 4 5 23 23.5 24 (a) ` 1 = 5 m, ` 2 = 5 m. (b) ` 1 = 5 m, ` 2 = 16 m. 5 10 15 20 25 30 1 2 3 4 5 23 23.5 24 5 10 15 20 25 30 1 2 3 4 5 23 23.5 24 (c) ` 1 = 40 . 45 m, ` 2 = 5 m. (d) ` 1 = 40 . 45 m, ` 2 = 16 m. Figure 2: Temperature fields (measured in ◦ C ) dis- cretized into 5 × 30 grids with varying horizontal and vertical length-scales. 40 . 45 m, ` 2 = 16 . 00 m, σ 2 s = 0 . 1542, and σ 2 n = 0 . 0036 for the temperature field, and (b) ` 1 = 27 . 53 m, ` 2 = 134 . 64 m, σ 2 s = 2 . 152, and σ 2 n = 0 . 041 for the plankton density field. It can be observed that the temperature and plankton density fields have low noise-to-signal ratios η of 0 . 023 and 0 . 019, re- spectively. Also, though both fields are observed to be highly anisotropic, the spatial correlation of the temperature field is much higher along the transect than perpendicular to it. According to Theorems 2 and 4, such field conditions lead to loose performance loss bounds for both algorithms, which does not necessarily imply their poor performance. So, the empirical evaluation here complements our theoretical re- sults by assessing their performance-efficiency trade-off (i.e., by varying m ) under these less favorable field conditions. To further investigate our algorithms’ trade-off behaviors under different horizontal and vertical spatial correlations, the cor- responding length-scales ` 1 and ` 2 of the original tempera- ture field (Fig. 2d) are reduced and fixed to produce three other modified fields (Figs. 2a, 2b, 2c) with the signal and noise variances σ 2 s and σ 2 n learned using MLE. Comparison with Active Sensing Algorithms. The performance of our proposed algorithms is compared to that of state-of-the-art information-theoretic path planning algo- rithms for active sensing: The work of [13] has proposed the following greedy maximum entropy path planning (gMEPP) algorithm: V g i ( x 1: i − 1 ) = max x i ∈X i H ( Z x i | Z x 1: i − 1 ) (14) for stage i = 1 , . . . , n , each of which induces a corresponding optimal vector x E i of k locations given the optimal vector x E 1: i − 1 obtained from previous stages 1 to i − 1. A greedy maximum mutual information path planning (gM 2 IPP) al- gorithm is devised by [8] as follows: U g i ( x 1: i − 1 ) = max x i ∈X i I ( Z x 1: i ; Z x 1: i ) (15) for stage i = 1 , . . . , n , each of which induces a corresponding optimal vector x M i of k locations given the optimal vector x M 1: i − 1 obtained from previous stages 1 to i − 1, and x 1: i denotes a vector of all sampling locations in the domain D excluding those of x 1: i . As mentioned earlier in Section 3, the work of [14] has developed MEPP(1), which is a special case of our MEPP( m ) algorithm. In contrast to our MEPP( m ) and M 2 IPP( m ) algorithms that scale linearly in the length n of planning horizon (The- orems 1 and 3), deriving x E 1: n of gMEPP and x M 1: n of gM 2 IPP incurs quartic time in n . Hence, if the required value of m is sufficiently small, then MEPP( m ) and M 2 IPP( m ) can be more efficient than the greedy algorithms, as shown below. Performance Metrics. The tested algorithms are eval- uated using three different metrics: The (a) entropy met- ric EN( x 1: n ) , H ( Z u 1: n | Z x 1: n ) and (b) mutual information metric MI( x 1: n ) , I ( Z x 1: n ; Z u 1: n ) measure, respectively, the Table 1: Comparison of EN ( x 1: n ) , MI ( x 1: n ) , and ER ( x 1: n ) ( × 10 − 5 ) performance for different temperature fields shown in Fig. 2 with varying number of robots. For our proposed M 2 IPP ( m ) and MEPP ( m ) algorithms, every performance result is preceded by the value of m (in round brackets) used. EN( x 1: n ) MI( x 1: n ) ER( x 1: n ) 1 robot Field Field Field Algorithm a b c d a b c d a b c d gM2IPP: x M 1: n [8] -64.4 -123.9 -173.3 -182.2 27.9 48.4 46.0 39.5 1.764 0.581 0.088 0.042 gMEPP: x E 1: n [13] -64.8 -128.4 -173.3 -182.4 26.5 44.7 46.0 39.5 2.792 0.572 0.077 0.037 M 2 IPP ( m ): x M 1: n (1) -64.5 (1) -123.9 (1) -167.2 (1) -182.0 (1) 27.9 (1) 48.4 (1) 39.6 (1) 39.4 (1) 1.764 (1) 0.581 (1) 0.488 (1) 0.049 (2) -173.2 (2) 45.8 (2) 0.110 (2) 0.042 (3) 0.034 MEPP ( m ): x E 1: n (1) -64.8 (1) -128.4 (1) -161.2 (1) -180.4 (1) 23.9 (1) 44.7 (1) 33.2 (1) 36.9 (1) 5.115 (1) 0.572 (1) 3.765 (1) 0.757 (2) -64.9 (2) -167.2 (2) -182.4 (2) 26.3 (2) 39.6 (2) 39.5 (2) 2.315 (2) 0.501 (2) 0.026 (3) -171.6 (3) 44.2 (3) 2.080 (3) 0.241 (4) -173.4 (4) 46.1 (4) 0.068 2 robots Field Field Field Algorithm a b c d a b c d a b c d gM2IPP: x M 1: n [8] -57.8 -100.5 -132.9 -138.0 41.7 62.0 45.8 36.9 1.153 0.265 0.019 0.016 gMEPP: x E 1: n [13] -59.8 -112.2 -132.9 -138.8 41.2 55.8 45.9 36.2 0.521 0.439 0.033 0.018 M 2 IPP ( m ): x M 1: n (1) -57.8 (1) -100.5 (1) -132.9 (1) -138.2 (1) 41.2 (1) 62.0 (1) 45.9 (1) 36.9 (1) 0.605 (1) 0.265 (1) 0.020 (1) 0.018 (2) 41.8 (2) 0.014 MEPP ( m ): x E 1: n (1) -59.8 (1) -113.0 (1) -129.3 (1) -138.4 (1) 41.6 (1) 56.4 (1) 41.8 (1) 36.9 (1) 0.662 (1) 0.378 (1) 0.286 (1) 0.012 (2) -60.0 (2) -132.9 (2) 45.9 (2) 0.018 3 robots Field Field Field Algorithm a b c d a b c d a b c d gM2 IPP: x M 1: n [8] -46.5 -80.5 -89.5 -92.8 40.8 61.3 41.4 31.6 0.272 0.012 0.018 0.008 gMEPP: x E 1: n [13] -46.3 -80.6 -89.5 -93.2 40.5 60.6 41.3 28.6 0.257 0.024 0.017 0.009 M 2 IPP ( m ): x M 1: n (1) -46.5 (1) -72.0 (1) -89.4 (1) -92.1 (1) 40.8 (1) 60.0 (1) 38.8 (1) 32.0 (1) 0.272 (1) 0.123 (1) 0.016 (1) 0.008 (2) -89.5 (2) 41.3 (2) 0.229 (2) 0.014 MEPP ( m ): x E 1: n (1) -45.9 (1) -81.3 (1) -89.4 (1) -93.5 (1) 40.2 (1) 61.6 (1) 38.7 (1) 28.2 (1) 0.231 (1) 0.014 (1) 0.013 (1) 0.007 (2) -46.5 (2) 40.8 (4) 41.1 (3) 28.6 (4) 29.0 posterior entropy/uncertainty and the reduction in entropy/ uncertainty at the remaining unobserved locations u 1: n of the transect given the observation paths x 1: n . The differ- ence between the entropy and mutual information metrics has been explained in the paragraph after (9) in Section 4. The (c) ER( x 1: n ) , || z u 1: n − μ u 1: n | x 1: n || 2 2 / { μ 2 n ( r − k ) } metric measures the mean-squared relative prediction error resulting from using the posterior mean μ u | x 1: n (2) to pre- dict the measurements at the remaining n ( r − k ) unobserved locations u 1: n of the transect given the measurements sam- pled along the observation paths x 1: n where μ = 1 > z u 1: n / { n ( r − k ) } . It has an advantage over the two information- theoretic metrics of using ground truth measurements to evaluate if the phenomenon is being predicted accurately. However, unlike the EN( x 1: n ) and MI( x 1: n ) metrics that ac- count for the spatial correlation between measurements at the unobserved locations u 1: n , the ER( x 1: n ) metric assumes conditional independence between them. In contrast to the ER( x 1: n ) metric, the EN( x 1: n ) and MI( x 1: n ) metrics conse- quently do not overestimate their uncertainty. 5.1 Temperature Field Data Table 1 shows the results of EN( x 1: n ), MI( x 1: n ), and ER( x 1: n ) performance of tested algorithms for temperature fields with different horizontal and vertical length-scales (Fig. 2) and with varying number of robots. For our proposed M 2 IPP( m ) and MEPP( m ) algorithms, the results are reported in an in- creasing order of m until the performance has stabilized. It can be observed from Table 1 that MEPP( m ) with m > 1 or M 2 IPP( m ) often outperforms MEPP(1) [14] in the three metrics, as discussed and explained later. Note that every increment of m increases the length of history of sampling locations considered in each stage by two for M 2 IPP( m ) in- stead of by one for MEPP( m ); this can be seen from the inputs to U i (12) and V i (7), respectively. The observations of the results are detailed in the rest of this subsection. 5.1.1 Entropy Metric EN( x 1: n ) As expected, the entropy-based MEPP( m ) and gMEPP algorithms generally perform better than or at least as well 1 2 3 4 5 6 7 10 −1 10 0 10 1 10 2 10 3 10 4 m Time(s) 1 2 3 4 5 10 −1 10 0 10 1 10 2 10 3 10 4 m Time(s) gMEPP gM2IPP MEPP(m) M2IPP(m) 1 2 3 4 5 10 −1 10 0 10 1 10 2 10 3 10 4 m Time(s) (a) 1 robot. (b) 2 robots. (c) 3 robots. Figure 3: Graphs of incurred time by different active sensing algorithms vs. m for temperature fields with varying number of robots. as the mutual information-based M 2 IPP( m ) and gM 2 IPP algorithms in this metric. For fields a, b, and d (i.e., of small ` 1 or large ` 2 ) with any number of robots, MEPP( m ) can produce EN( x E 1: n ) values lower than or comparable to that achieved by gMEPP and gM 2 IPP using small values of m (i.e., m = 1 or 2), hence in- curring 1 to 4 orders of magnitude less computational time, as shown in Fig. 3. This can be explained by one of the following reasons: (a) A low spatial correlation along the transect cannot be exploited by gMEPP and gM 2 IPP, which consider the entire history of past measurements for improv- ing active sensing performance; (b) a high correlation per- pendicular to the transect can be exploited by MEPP( m ) for better active sensing performance; and (c) unlike the greedy gMEPP and gM 2 IPP algorithms, MEPP( m ) is capable of non-myopic planning to improve active sensing performance. For field c (i.e., of large ` 1 and small ` 2 ) with 1 robot, MEPP( m ) cannot exploit the low spatial correlation perpen- dicular to the transect for improving active sensing perfor- mance. Therefore, it needs to raise the value of m up to 4 in order to better exploit the high spatial correlation along the transect. Consequently, MEPP( m ) can achieve EN( x E 1: n ) performance comparable to that achieved by gMEPP and gM 2 IPP while incurring similar computational time as gMEPP and about 2 orders of magnitude less time than gM 2 IPP. In- creasing the number of robots allows MEPP( m ) to achieve EN( x E 1: n ) performance comparable to that of gMEPP and gM 2 IPP using smaller values of m (i.e., m = 1 or 2), hence incurring 1 to 4 orders of magnitude less time. Figure 4: Plankton density (chl-a) field (measured in mg m − 3 ) discretized into a 8 × 45 grid. 5.1.2 Mutual Information Metric MI( x 1: n ) The mutual information-based M 2 IPP( m ) and gM 2 IPP algorithms often perform better than or at least as well as the entropy-based MEPP( m ) and gMEPP in this metric. For fields a, b, and d (i.e., of small ` 1 or large ` 2 ) with any number of robots, M 2 IPP( m ) can generally yield MI( x M 1: n ) values higher than or comparable to that achieved by gM 2 IPP and gMEPP using a small m value of 1, hence incurring less computational time (in particular, about 2 orders of magni- tude less time than gM 2 IPP), as shown in Fig. 3. This can be explained by the same reasons as that discussed previ- ously in Section 5.1.1. For field c (i.e., of large ` 1 and small ` 2 ) with 1 or 3 robots, M 2 IPP( m ) cannot exploit the low spatial correlation per- pendicular to the transect for improving active sensing per- formance. So, it has to increase the value of m to 2 in or- der to better exploit the high correlation along the transect. As a result, M 2 IPP( m ) can achieve MI( x M 1: n ) performance comparable to that achieved by gM 2 IPP and gMEPP while incurring less time with 1 robot and slightly more time with 3 robots than gM 2 IPP. With 2 robots, m = 1 suffices for M 2 IPP( m ) to achieve MI( x M 1: n ) performance comparable to that achieved by gM 2 IPP and gMEPP while incurring less time (Fig. 3). A computationally cheaper alternative for ac- tive sensing of field c is to consider using MEPP( m ) with larger m : When the values of m are raised to 4, 2, and 4 for the respective 1-, 2-, and 3-robot cases, it can pro- duce MI( x E 1: n ) performance comparable to that achieved by gM 2 IPP and gMEPP while incurring similar or less time. 5.1.3 Prediction Error Metric ER( x 1: n ) For field c (i.e., of large ` 1 and small ` 2 ) with any num- ber of robots, MEPP( m ) and M 2 IPP( m ) cannot exploit the low spatial correlation perpendicular to the transect for im- proving active sensing performance. Hence, their values of m need to be raised in order to exploit the high corre- lation along the transect. Compared to M 2 IPP( m ), it is computationally cheaper (Fig. 3) and offers greater perfor- mance improvement (Table 1) to increase the value of m of MEPP( m ), which can then produce ER( x E 1: n ) values lower than that achieved by gMEPP and gM 2 IPP while incurring similar computational time to gMEPP and about 2 orders of magnitude less time than gM 2 IPP with 1 robot and 1 to 4 orders of magnitude less time than both with 2 or 3 robots. For field d (i.e., of large ` 1 and large ` 2 ) with any num- ber of robots, MEPP( m ) can now exploit the high spatial correlation perpendicular to the transect for better active sensing performance. As a result, MEPP( m ) can yield bet- ter ER( x E 1: n ) performance than gMEPP and gM 2 IPP using smaller values of m (i.e., m = 1 or 2), hence incurring 1 to 4 orders of magnitude less time. For fields a and b (i.e., of small ` 1 ) with 1 or 2 robots, M 2 IPP( m ) can produce ER( x M 1: n ) values lower than or com- parable to that achieved by gM 2 IPP and gMEPP using a small m value of 1, hence incurring less time (in particu- lar, about 2 orders of magnitude less time than gM 2 IPP), Table 2: Comparison of EN ( x 1: n ) , MI ( x 1: n ) , and ER ( x 1: n ) ( × 10 − 2 ) performance for plankton density field shown in Fig. 4 with varying number of robots. EN( x 1: n ) MI( x 1: n ) ER( x 1: n ) No. of robots k No. of robots k No. of robots k Algorithm 1 2 3 1 2 3 1 2 3 gM2IPP: x M 1: n [8] 124 55 28 83 162 201 0.65 0.09 0.01 gMEPP: x E 1: n [13] 117 42 -6 65 126 184 1.35 0.44 0.04 M 2 IPP ( m ): x M 1: n 124 55 28 83 162 201 0.65 0.09 0.01 MEPP ( m ): x E 1: n 117 41 -8 65 128 187 1.35 0.41 0.01 as shown in Fig. 3. Increasing to 3 robots allows MEPP( m ) to achieve ER( x E 1: n ) performance better than or compara- ble to that of gMEPP and gM 2 IPP using a small m value of 1, hence incurring 3 to 4 orders of magnitude less time (Fig. 3). These can be explained by the same reasons as that discussed previously in Section 5.1.1. 5.2 Plankton Density Field Data Table 2 shows the results of EN( x 1: n ), MI( x 1: n ), and ER( x 1: n ) performance of tested algorithms for the plankton density field (Fig. 4) with varying number of robots. For our pro- posed M 2 IPP( m ) and MEPP( m ) algorithms, the results are only reported for m = 1, at which their performance has al- ready stabilized. As mentioned earlier in the first paragraph of Section 5, the plankton density field exhibits low and high spatial correlations, respectively, along and perpendicular to the transect, which resemble that of temperature field b. The observations are as follows: With any number of robots, MEPP(1) can produce EN( x E 1: n ) values lower than that achieved by gMEPP and gM 2 IPP while incurring 2 to 5 orders of magnitude less time, as shown in Fig. 5. On the other hand, M 2 IPP(1) can yield MI( x M 1: n ) and ER( x M 1: n ) performance better than or comparable to that achieved by gM 2 IPP and gMEPP while incurring less time (in particu- lar, about 2 orders of magnitude less time than gM 2 IPP) (Fig. 5). These can be explained by the same reasons as that discussed previously in Section 5.1.1. 5.3 Summary of Test Results The observations of the above results are summarized be- low: For anisotropic fields with low spatial correlation along the transect (e.g., temperature fields a and b and plankton density field), MEPP( m ) can perform better or at least as well as gMEPP and gM 2 IPP in the prediction error (i.e., with 3 robots) and entropy metrics using small m values of 1 or 2, hence incurring 1 to 4 orders of magnitude less time. M 2 IPP( m ) can generally perform likewise in the prediction error (i.e., with 1 or 2 robots) and mutual information met- rics using a small m value of 1, hence incurring less time as well (in particular, 2 orders of magnitude less time than gM 2 IPP). These observations are previously explained in Section 5.1.1. Note that they corroborate the second impli- cations of Theorems 2 and 4 on the performance guarantees of MEPP( m ) and M 2 IPP( m ). For anisotropic fields with high spatial correlation along the transect (e.g., temperature fields c and d), a larger m value is needed in order for MEPP( m ) and M 2 IPP( m ) to ex- ploit it if the correlation perpendicular to the transect is low (i.e., field c). Compared to M 2 IPP( m ), it is computationally cheaper to increase the value of m of MEPP( m ) such that it performs better or at least as well as gMEPP and gM 2 IPP in all three metrics while incurring similar time to gMEPP and about 2 orders of magnitude less time than gM 2 IPP with 1 robot and often 1 to 4 orders of magnitude less time than both with 2 or 3 robots. If the correlation perpendicular to 1 2 3 4 10 −1 10 0 10 1 10 2 10 3 10 4 10 5 10 6 m Time(s) 1 2 3 10 −1 10 0 10 1 10 2 10 3 10 4 10 5 10 6 m Time(s) 1 2 3 10 0 10 1 10 2 10 3 10 4 10 5 10 6 m Time(s) gMEPP gM2IPP MEPP(m) M2IPP(m) (a) 1 robot. (b) 2 robots. (c) 3 robots. Figure 5: Graphs of incurred time by different active sensing algorithms vs. m for plankton density field with varying number of robots. the transect is high (i.e., field d) instead, it can be exploited by MEPP( m ) and M 2 IPP( m ) to improve active sensing per- formance and consequently allow m to be reduced to small values of 1 or 2: MEPP( m ) can perform better or, if not, at least as well as gMEPP and gM 2 IPP in the prediction error and entropy metrics while incurring 1 to 4 orders of magnitude less time. M 2 IPP( m ) can perform likewise in the mutual information metric while incurring less time (in particular, 2 orders of magnitude less time than gM 2 IPP). 6. CONCLUSION This paper describes two principled information-theoretic path planning algorithms based on entropy and mutual in- formation criteria (respectively, MEPP( m ) and M 2 IPP( m )) for active sensing of GP-based anisotropic fields. Two im- portant practical implications result from the theoretical guarantees on the active sensing performance of our algo- rithms (Theorems 2 and 4): Increasing m trades off com- putational efficiency (Theorems 1 and 3) for better active sensing performance, and our algorithms can exploit a low spatial correlation along the transect to improve time effi- ciency (i.e., only needing a small m ) while preserving near- optimal active sensing performance. This motivates the use of prior knowledge, if available, on a direction of low spatial correlation in order to align it with the horizontal axis of the transect. Empirical evaluation of real-world anisotropic temperature and plankton density field data reveals that our algorithms can perform better or at least as well as gMEPP and gM 2 IPP while often incurring a few orders of magnitude less time. In particular, it can be observed that anisotropic fields with low spatial correlation along the transect or high correlation perpendicular to the transect allow our algorithms to perform well using small values of m , thus yielding significant computational gain over gMEPP and gM 2 IPP. To perform well in a field with high correla- tion along the transect and low correlation perpendicular to the transect (i.e., less favorable conditions), our algorithms have to increase the value of m or the number of robots but can still achieve comparable or better time efficiency than gMEPP and gM 2 IPP. 7. REFERENCES [1] J. B. Boisvert and C. V. Deutsch. Modeling locally varying anisotropy of CO 2 emissions in the United States. Stoch. Environ. Res. Risk Assess. , 25:1077–1084, 2011. [2] J. Chen, K. H. Low, C. K.-Y. Tan, A. Oran, P. Jaillet, J. M. Dolan, and G. S. Sukhatme. Decentralized data fusion and active sensing with mobile sensors for modeling and predicting spatiotemporal traffic phenomena. In Proc. UAI , pages 163–173, 2012. [3] T. Cover and J. Thomas. Elements of Information Theory . John Wiley & Sons, NY, 1991. [4] A. Das and D. Kempe. Algorithms for subset selection in linear regression. In Proc. STOC , pages 45–54, 2008. [5] J. M. Dolan, G. Podnar, S. Stancliff, K. H. Low, A. Elfes, J. Higinbotham, J. C. Hosler, T. A. Moisan, and J. Moisan. Cooperative aquatic sensing using the telesupervised adaptive ocean sensor fleet. In Proc. SPIE Conference on Remote Sensing of the Ocean, Sea Ice, and Large Water Regions , volume 7473, 2009. [6] D. Kitsiou, G. Tsirtsis, and M. Karydis. Developing an optimal sampling design: A case study in a coastal marine ecosystem. Environmental Monitoring and Assessment , 71(1):1–12, 2001. [7] R. Korf. Real-time heuristic search. Artif. Intell. , 42(2-3):189–211, 1990. [8] A. Krause, A. Singh, and C. Guestrin. Near-optimal sensor placements in Gaussian processes: Theory, efficient algorithms and empirical studies. JMLR , 9:235–284, 2008. [9] P. Legendre and M.-J. Fortin. Spatial pattern and ecological analysis. Vegetatio , 80:107–138, 1989. [10] N. E. Leonard, D. Paley, F. Lekien, R. Sepulchre, D. M. Fratantoni, and R. Davis. Collective motion, sensor networks and ocean sampling. Proc. IEEE , 95(1):48–74, 2007. [11] K. H. Low, J. Chen, J. M. Dolan, S. Chien, and D. R. Thompson. Decentralized active robotic exploration and mapping for probabilistic field classification in environmental sensing. In Proc. AAMAS , pages 105–112, 2012. [12] K. H. Low, J. M. Dolan, and P. Khosla. Adaptive multi-robot wide-area exploration and mapping. In Proc. AAMAS , pages 23–30, 2008. [13] K. H. Low, J. M. Dolan, and P. Khosla. Information-theoretic approach to efficient adaptive path planning for mobile robotic environmental sensing. In Proc. ICAPS , pages 233–240, 2009. [14] K. H. Low, J. M. Dolan, and P. Khosla. Active Markov information-theoretic path planning for robotic environmental sensing. In Proc. AAMAS , pages 753–760, 2011. [15] K. H. Low, G. J. Gordon, J. M. Dolan, and P. Khosla. Adaptive sampling for multi-robot wide-area exploration. In Proc. IEEE ICRA , 2007. [16] D. McGrath, C. Zhang, and O. T. Carton. Geostatistical analyses and hazard assessment on soil lead in Silvermines area, Ireland. Environmental Pollution , 127:239–248, 2004. [17] G. Podnar, J. M. Dolan, K. H. Low, and A. Elfes. Telesupervised remote surface water quality sensing. In Proc. IEEE Aerospace Conference , 2010. [18] C. Prudhomme and D. W. Reed. Mapping extreme rainfall in a mountainous region using geostatistical techniques: A case study in Scotland. Int. J. Climatol. , 19:1337–1356, 1999. [19] N. Rabesiranana, M. Rasolonirina, A. F. Solonjara, and R. Andriambololona. Investigating the spatial anisotropy of soil radioactivity in the region of Vinaninkarena, Antsirabe - Madagascar. In Proc. 4th High-Energy Physics International Conference , 2009. [20] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning . MIT Press, 2006. [21] D. L. Rudnick, R. E. Davis, C. C. Eriksen, D. Fratantoni, and M. J. Perry. Underwater gliders for ocean research. Mar. Technol. Soc. J. , 38:73–84, 2004. [22] S. Sokolov and S. R. Rintoul. Some remarks on interpolation of nonstationary oceanographic fields. J. Atmos. Oceanic Technol. , 16:1434–1449, 1999. [23] J. C. Taylor, J. S. Thompson, P. S. Rand, and M. Fuentes. Sampling and statistical considerations for hydroacoustic surveys used in estimating abundance of forage fishes in reservoirs. North American Journal of Fisheries Management , 25:73–85, 2005. [24] D. R. Thompson and D. Wettergreen. Intelligent maps for autonomous kilometer-scale science survey. In Proc. i-SAIRAS , 2008. [25] R. Webster and M. Oliver. Geostatistics for Environmental Scientists . John Wiley & Sons, Inc., NY, 2nd edition, 2007. [26] J. G. Zhang, H. S. Chen, Y. R. Su, X. L. Kong, W. Zhang, Y. Shi, H. B. Liang, and G. M. Shen. Spatial variability and patterns of surface soil moisture in a field plot of karst area in southwest China. Plant Soil. Environ. , 57(9):409–417, 2011. APPENDIX Notations. Let σ 2 x , σ xx and σ 2 x | s , Σ xx | s in (3) for any location x . Let ξ , exp {− ( m + 1) 2 / (2 ` ′ 2 1 ) } . A. ENTROPY-BASED PATH PLANNING A.1 Proof of Theorem 1 Given each vector x i − m : i − 1 , the time needed to evalu- ate the posterior entropy H ( Z x i | Z x i − m : i − 1 ) over all possible x i ∈ X i is χ × O (( km ) 3 ) = O ( χ ( km ) 3 ). The time needed to perform this over all χ m possible vectors x i − m : i − 1 in each stage i is χ m × O ( χ ( km ) 3 ) = O ( χ m +1 ( km ) 3 ). Since the covariance function is stationary (i.e., it only depends on the distance between locations), the entropies calculated in a stage are the same as those in every other stage. The time needed to propagate the optimal values from stages n − 1 to m + 1 is O ( χ m +1 ( n − m − 1)). To obtain the optimal vector x E 1: m , the joint entropy H ( Z x 1: m ) has to be evaluated over all possible vectors x 1: m . Hence, the time needed to solve for the optimal vector x E 1: m is O ( χ m ( km ) 3 ). As a result, the time complexity of the MEPP( m ) algorithm is O ( χ m +1 [( n − m − 1) + ( km ) 3 ] + χ m ( km ) 3 ) = O ( χ m +1 [ n + ( km ) 3 ]). A.2 Proof of Some Lemmas Before giving the proof of Theorem 2, the following lem- mas are needed. Lemma 5. For any observation paths x 1: n , H ( Z x E 1: m ) + n ∑ i = m +1 H ( Z x E i | Z x E i − m : i − 1 ) ≥ H ( Z x 1: m ) + n ∑ i = m +1 H ( Z x i | Z x i − m : i − 1 ) . Proof. Using (7), V m +1 ( x 1: m ) = max x m +1 ∈X m +1 H ( Z x m +1 | Z x 1: m ) + V m +2 ( x 2: m +1 ) = max x m +1 ∈X m +1 H ( Z x m +1 | Z x 1: m ) + max x m +2 ∈X m +2 H ( Z x m +2 | Z x 2: m +1 ) + V m +3 ( x 3: m +2 ) = max x m +1 ∈X m +1 ,x m +2 ∈X m +2 H ( Z x m +1 | Z x 1: m ) + H ( Z x m +2 | Z x 2: m +1 ) + V m +3 ( x 3: m +2 ) . . . = max x m +1 ∈X m +1 ,...,x n ∈X n n ∑ i = m +1 H ( Z x i | Z x i − m : i − 1 ) . (16) Given x 1: m , the vectors x m +1 , . . . , x n that maximize the term ∑ n i = m +1 H ( Z x i | Z x i − m : i − 1 ) in (16) can be obtained. Using (8), the observation paths x 1: n that maximize H ( Z x 1: m ) + ∑ n i = m +1 H ( Z x i | Z x i − m : i − 1 ) can be obtained. Therefore, Lemma 5 holds. Lemma 6. In a GP, given an unobserved location y and a vector A of sampled locations, σ 2 y | A ≥ σ 2 n . Proof. We know that σ 2 y | A ≥ 0. So, if σ 2 n > 0, σ 2 y | A = σ 2 s + σ 2 n − Σ yA Σ − 1 AA Σ Ay ≥ 0 (17) where the covariance components in the diagonal of Σ AA are σ 2 s + σ 2 n . On the other hand, if σ 2 n = 0, σ 2 y | A = σ 2 s − Σ yA Σ − 1 BB Σ Ay ≥ 0 (18) where Σ BB , Σ AA − σ 2 n I . Let A , Σ AA , B , Σ BB , E , σ 2 n I, Y , Σ Ay , Y > , Σ yA , and W , Σ − 1 AA Σ Ay = A − 1 Y . Then, Y = AW . W > EW + W > E > B − 1 EW ≥ 0 (19) ⇒ W > B > B − 1 EW + W > E > B − 1 EW ≥ 0 (20) ⇒ W > ( B + E ) > B − 1 EW ≥ 0 ⇒ W > A > B − 1 EW + W > A > W ≥ W > A > W ⇒ W > A > ( B − 1 E + I ) W ≥ W > A > W ⇒ W > A > B − 1 ( E + B ) W ≥ W > A > A − 1 AW ⇒ ( AW ) > B − 1 AW ≥ ( AW ) > A − 1 AW ⇒ Y > B − 1 Y ≥ Y > A − 1 Y ⇒ Σ yA Σ − 1 BB Σ Ay ≥ Σ yA Σ − 1 AA Σ Ay . (21) To derive (19), since W is a vector and E = σ 2 n I , W > EW ≥ 0. Since B is a covariance matrix that is invertible and positive semi-definite, B − 1 is positive semi-definite. Hence, W > E > B − 1 EW ≥ 0 and (19) therefore holds. Since B is symmetric, B > = B . Hence, (20) can be obtained from (19). The rest of the derivation from (20) to (21) is straightfor- ward. From (17), σ 2 y | A = σ 2 s + σ 2 n − Σ yA Σ − 1 AA Σ Ay ≥ σ 2 s + σ 2 n − Σ yA Σ − 1 BB Σ Ay (22) ≥ σ 2 n . (23) Note that (22) and (23) follow from (21) and (18), respec- tively. Therefore, Lemma 6 holds. Lemma 7. H ( Z x i − m − 1 | Z x i − m : i − 1 ) − H ( Z x i − m − 1 | Z x i − m : i − 1 , Z x i ) ≤ k 2 log { 1 + ξ 2 η (1 + η ) } . Proof. We will first prove for the single-robot case. This result will be used later for the multi-robot case. Let x A , x i − m : i − 1 and x p , x i − m − 1 . σ 2 x i − m − 1 | x i − m : i − 1 − σ 2 x i − m − 1 | ( x i − m : i − 1 ,x i ) = σ 2 x p | x A − σ 2 x p | ( x A ,x i ) ≤ σ 2 x p − σ 2 x p | x i = σ 2 x p − ( σ 2 x p − σ x p x i σ x i x p σ 2 x i ) ≤ σ 4 s exp { − ( m +1) 2 2 ` ′ 2 1 } 2 σ 2 s + σ 2 n = σ 2 s ξ 2 1 + η . (24) The first inequality follows from the property that variance reduction is submodular [4] in many practical cases (e.g., further conditioning on Z x A does not make Z x p and Z x i more correlated). To intuitively understand this notion of submodularity, observing a new location x i will reduce the variance at location x p more if few or no observations are made, and less if many observations are already taken (e.g., at locations x A ). The second equality is due to (3). The second inequality follows from the fact that the distance between any two locations from stage i and stage i − m − 1 is at least ω 1 ( m + 1). So, σ x p x i ≤ σ 2 s exp {− ( m + 1) 2 / (2 ` ′ 2 1 ) } . H ( Z x i − m − 1 | Z x i − m : i − 1 ) − H ( Z x i − m − 1 | Z x i − m : i − 1 , Z x i ) = H ( Z x p | Z x A ) − H ( Z x p | Z x A , Z x i ) = 1 2 log σ 2 x p | x A σ 2 x p | ( x A ,x i ) (25) ≤ 1 2 log σ 2 x p | ( x A ,x i ) + σ 2 s ξ 2 1+ η σ 2 x p | ( x A ,x i ) (26) = 1 2 log { 1 + σ 2 s ξ 2 σ 2 x p | ( x A ,x i ) (1 + η ) } ≤ 1 2 log { 1 + σ 2 s ξ 2 σ 2 n (1 + η ) } (27) ≤ log { 1 + ξ 2 η (1 + η ) } . (28) Using (4), (25) can be obtained. Inequality (26) results from (24). Inequality (27) can be obtained using Lemma 6. We will now prove for the k -robot case where k > 1. Then, vectors x i and x p comprise k locations each. Let x j i ( x j p ) de- note the j -th location component in vector x i ( x p ). Let x j : t i ( x j : t p ) denote a vector comprising the j -th to t -th location components in vector x i ( x p ). Using the chain rule for en- tropy [3], H ( Z x p | Z x A ) − H ( Z x p | Z x A , Z x i ) = H ( Z x 1 p | Z x A ) − H ( Z x 1 p | Z x A , Z x i ) + (29) k ∑ j =2 H ( Z x j p | Z x 1: j − 1 p , Z x A ) − H ( Z x j p | Z x 1: j − 1 p , Z x A , Z x i ) . (30) For (29), H ( Z x 1 p | Z x A ) − H ( Z x 1 p | Z x A , Z x 1: k i ) = H ( Z x 1 p | Z x A ) − [ H ( Z x 1 p | Z x A , Z x 1 i ) + H ( Z x 2: k i | Z x 1 p , Z x A , Z x 1 i ) − H ( Z x 2: k i | Z x A , Z x 1 i )] = H ( Z x 1 p | Z x A ) − H ( Z x 1 p | Z x A , Z x 1 i ) + H ( Z x 2: k i | Z x A , Z x 1 i ) − H ( Z x 2: k i | Z x 1 p , Z x A , Z x 1 i ) = H ( Z x 1 p | Z x A ) − H ( Z x 1 p | Z x A , Z x 1 i ) + k ∑ t =2 H ( Z x t i | Z x A , Z x 1: t − 1 i ) − H ( Z x t i | Z x 1 p , Z x A , Z x 1: t − 1 i ) ≤ k log { 1 + ξ 2 η (1 + η ) } . (31) The inequality follows from a derivation similar to (28). Let x A j denote a vector concatenating x 1: j − 1 p and x A . For (30), k ∑ j =2 H ( Z x j p | Z x Aj ) − H ( Z x j p | Z x Aj , Z x 1: k i ) = k ∑ j =2 H ( Z x j p | Z x Aj ) − [ H ( Z x j p | Z x Aj , Z x 1 i ) + H ( Z x 2: k i | Z x j p , Z x Aj , Z x 1 i ) − H ( Z x 2: k i | Z x Aj , Z x 1 i )] = k ∑ j =2 H ( Z x j p | Z x Aj ) − H ( Z x j p | Z x Aj , Z x 1 i ) + H ( Z x 2: k i | Z x Aj , Z x 1 i ) − H ( Z x 2: k i | Z x j p , Z x Aj , Z x 1 i ) = k ∑ j =2 H ( Z x j p | Z x Aj ) − H ( Z x j p | Z x Aj , Z x 1 i ) + k ∑ t =2 H ( Z x t i | Z x Aj , Z x 1: t − 1 i ) − H ( Z x t i | Z x j p , Z x Aj , Z x 1: t − 1 i ) ≤ k ( k − 1) log { 1 + ξ 2 η (1 + η ) } . (32) The inequality follows from a derivation similar to (28). Combining (31) and (32), Lemma 7 results. Corollary 8. For t = 1 , . . . , i − m − 1 , H ( Z x t | Z x t +1: i − 1 ) − H ( Z x t | Z x t +1: i − 1 , Z x i ) ≤ k 2 log { 1 + ξ 2 η (1 + η ) } . Proof. The proof is similar to that of Lemma 7. Lemma 9. H ( Z x i | Z x i − m : i − 1 ) − H ( Z x i | Z x 1: i − 1 ) ≤ ( i − m − 1) k 2 log { 1 + ξ 2 η (1 + η ) } . Proof. Using the chain rule for entropy [3], H ( Z x 1: i − m − 1 , Z x i | Z x i − m : i − 1 ) = H ( Z x i | Z x i − m : i − 1 ) + H ( Z x 1: i − m − 1 | Z x i − m : i − 1 , Z x i ) . (33) H ( Z x 1: i − m − 1 , Z x i | Z x i − m : i − 1 ) = H ( Z x 1: i − m − 1 | Z x i − m : i − 1 ) + H ( Z x i | Z x 1: i − m − 1 , Z x i − m : i − 1 ) = H ( Z x 1: i − m − 1 | Z x i − m : i − 1 ) + H ( Z x i | Z x 1: i − 1 ) . (34) From (33) and (34), H ( Z x i | Z x i − m : i − 1 ) − H ( Z x i | Z x 1: i − 1 ) = H ( Z x 1: i − m − 1 | Z x i − m : i − 1 ) − H ( Z x 1: i − m − 1 | Z x i − m : i − 1 , Z x i ) . (35) Applying the chain rule for entropy [3] to (35), H ( Z x 1: i − m − 1 | Z x i − m : i − 1 ) − H ( Z x 1: i − m − 1 | Z x i − m : i − 1 , Z x i ) = i − m − 1 ∑ t =1 H ( Z x t | Z x t +1: i − 1 ) − H ( Z x t | Z x t +1: i − 1 , Z x i ) (36) ≤ ( i − m − 1) k 2 log { 1 + ξ 2 η (1 + η ) } . (37) The inequality (37) follows from Corollary 8. So, Lemma 9 holds. A.3 Proof of Theorem 2 Let θ , H ( Z x E 1: m ) + n ∑ i = m +1 H ( Z x E i | Z x E i − m : i − 1 ) − { H ( Z x ∗ 1: m ) + n ∑ i = m +1 H ( Z x ∗ i | Z x ∗ i − m : i − 1 ) } . (38) From Lemma 5, θ ≥ 0. By the chain rule for entropy [3], H ( Z x ∗ 1: n ) − H ( Z x E 1: n ) = H ( Z x ∗ 1: m ) + n ∑ i = m +1 H ( Z x ∗ i | Z x ∗ 1: i − 1 ) − { H ( Z x E 1: m ) + n ∑ i = m +1 H ( Z x E i | Z x E 1: i − 1 ) } . (39) Let ∆ ∗ i , H ( Z x ∗ i | Z x ∗ i − m : i − 1 ) − H ( Z x ∗ i | Z x ∗ 1: i − 1 ) and ∆ E i , H ( Z x E i | Z x E i − m : i − 1 ) − H ( Z x E i | Z x E 1: i − 1 ) for i = m + 1 , . . . , n . Then, (39) can be re-written as H ( Z x ∗ 1: n ) − H ( Z x E 1: n ) = H ( Z x ∗ 1: m ) + n ∑ i = m +1 [ H ( Z x ∗ i | Z x ∗ i − m : i − 1 ) − ∆ ∗ i ] − { H ( Z x E 1: m ) + n ∑ i = m +1 [ H ( Z x E i | Z x E i − m : i − 1 ) − ∆ E i ] } = H ( Z x ∗ 1: m ) + n ∑ i = m +1 H ( Z x ∗ i | Z x ∗ i − m : i − 1 ) − n ∑ i = m +1 ∆ ∗ i − [ H ( Z x E 1: m ) + n ∑ i = m +1 H ( Z x E i | Z x E i − m : i − 1 ) − n ∑ i = m +1 ∆ E i ] = n ∑ i = m +1 [∆ E i − ∆ ∗ i ] − θ (40) ≤ n ∑ i = m +1 [∆ E i − ∆ ∗ i ] (41) ≤ n ∑ i = m +1 ∆ E i . (42) Since θ ≥ 0, (41) results. Since ∆ ∗ i ≥ 0 for i = m + 1 , . . . , n , (42) follows. By Lemma 9, ∆ E i ≤ ( i − m − 1) k 2 log { 1 + ξ 2 η (1 + η ) } for i = m + 1 , . . . , n . Then, Theorem 2 follows. B. MUTUAL INFORMATION-BASED PATH PLANNING B.1 Proof of Theorem 3 Given each vector x i − 2 m : i − 1 , the time needed to evaluate I ( Z x i − m ; Z u i − 2 m : i | Z x i − 2 m : i − m − 1 ) over all possible x i ∈ X i is χ ×O ([ r (2 m +1)] 3 ) = O ( χ [ r (2 m +1)] 3 ). The time needed to perform this over all χ 2 m possible vectors x i − 2 m : i − 1 in each stage i is χ 2 m × O ( χ [ r (2 m + 1)] 3 ) = O ( χ 2 m +1 [ r (2 m + 1)] 3 ). Similar to the MEPP( m ) algorithm, the conditional mu- tual information terms calculated in a stage are the same as those in every other stage. The time needed to prop- agate the optimal values from stages n − 1 to 2 m + 1 is O ( χ 2 m +1 ( n − 2 m − 1)). Similarly, the time needed to eval- uate I ( Z x n − m : n ; Z u n − 2 m : n | Z x n − 2 m : n − m − 1 ) over all possible x n ∈ X n and all χ 2 m possible vectors x n − 2 m : n − 1 in stage n is O ( χ 2 m +1 [ r (2 m + 1)] 3 ). To obtain the optimal vector x M 1:2 m , I ( Z x 1: m , Z u 1:2 m ) has to be evaluated over all possi- ble x 1:2 m . Hence, the time needed to solve for the optimal vector x M 1:2 m is O ( χ 2 m [ r (2 m )] 3 ). As a result, the time com- plexity of the M 2 IPP( m ) algorithm is O ( χ 2 m +1 ( n − 2 m − 1 + [ r (2 m + 1)] 3 ) + χ 2 m +1 [ r (2 m + 1)] 3 + χ 2 m [ r (2 m )] 3 ) = O ( χ 2 m +1 ( n + 2[ r (2 m + 1)] 3 )). B.2 Proof of Some Lemmas Before giving the proof of Theorem 4, the following lem- mas are needed. Lemma 10. For any observation paths x 1: n , I ( Z x M 1: m ; Z u M 1:2 m ) + n − 1 ∑ i =2 m +1 I ( Z x M i − m ; Z u M i − 2 m : i | Z x M i − 2 m : i − m − 1 ) + I ( Z x M n − m : n ; Z u M n − 2 m : n | Z x M n − 2 m : n − m − 1 ) ≥ I ( Z x 1: m ; Z u 1:2 m ) + n − 1 ∑ i =2 m +1 I ( Z x i − m ; Z u i − 2 m : i | Z x i − 2 m : i − m − 1 ) + I ( Z x n − m : n ; Z u n − 2 m : n | Z x n − 2 m : n − m − 1 ) . Proof. Using (12), U 2 m +1 ( x 1:2 m ) = max x 2 m +1 ∈X 2 m +1 I ( Z x m +1 ; Z u 1:2 m +1 | Z x 1: m ) + U 2 m +2 ( x 2:2 m +1 ) = max x 2 m +1 ∈X 2 m +1 I ( Z x m +1 ; Z u 1:2 m +1 | Z x 1: m ) + max x 2 m +2 ∈X 2 m +2 I ( Z x m +2 ; Z u 2:2 m +2 | Z x 2: m +1 ) + U 2 m +3 ( x 3:2 m +2 ) = max x 2 m +1 ∈X 2 m +1 ,x 2 m +2 ∈X 2 m +2 I ( Z x m +1 ; Z u 1:2 m +1 | Z x 1: m ) + I ( Z x m +2 ; Z u 2:2 m +2 | Z x 2: m +1 ) + U 2 m +3 ( x 3:2 m +2 ) . . . = max x 2 m +1 ∈X 2 m +1 ,...,x n ∈X n n − 1 ∑ i =2 m +1 I ( Z x i − m ; Z u i − 2 m : i | Z x i − 2 m : i − m − 1 ) + I ( Z x n − m : n ; Z u n − 2 m : n | Z x n − 2 m : n − m − 1 ) . (43) Given x 1:2 m , the vectors x 2 m +1 , . . . , x n that maximize the term ∑ n − 1 i =2 m +1 I ( Z x i − m ; Z u i − 2 m : i | Z x i − 2 m : i − m − 1 ) + I ( Z x n − m : n ; Z u n − 2 m : n | Z x n − 2 m : n − m − 1 ) in (43) can be obtained. Using (13), the paths x 1: n that maximize I ( Z x 1: m ; Z u 1:2 m )+ ∑ n − 1 i =2 m +1 I ( Z x i − m ; Z u i − 2 m : i | Z x i − 2 m : i − m − 1 ) + I ( Z x n − m : n ; Z u n − 2 m : n | Z x n − 2 m : n − m − 1 ) can be obtained. There- fore, Lemma 10 holds. Lemma 11. For t = 1 , . . . , i − 2 m − 1 , H ( Z x t | Z x t +1: i − m − 1 , Z u i − 2 m : i ) − H ( Z x t | Z x t +1: i − m − 1 , Z u i − 2 m : i , Z x i − m ) ≤ k 2 log { 1 + ξ 2 η (1 + η ) } . Proof. The proof is similar to that of Corollary 8. Corollary 12. For t = 1 , . . . , i − 2 m − 1 , H ( Z u t | Z x 1: i − m − 1 , Z u t +1: i ) − H ( Z u t | Z x 1: i − m − 1 , Z u t +1: i , Z x i − m ) ≤ k ( r − k ) log { 1 + ξ 2 η (1 + η ) } . Proof. Note that the size of vector u t is r − k . The proof is similar to that of Corollary 8. Corollary 13. For t = i + 1 , . . . , n , H ( Z u t | Z x 1: i − m − 1 , Z u 1: t − 1 ) − H ( Z u t | Z x 1: i − m − 1 , Z u 1: t − 1 , Z x i − m ) ≤ k ( r − k ) log { 1 + ξ 2 η (1 + η ) } . Proof. The proof is similar to that of Corollary 12. Lemma 14. H ( Z x i − m | Z x i − 2 m : i − m − 1 , Z u i − 2 m : i ) − H ( Z x i − m | Z x 1: i − m − 1 , Z u 1: n ) ≤ ( n − 2 m − 1) rk log { 1 + ξ 2 η (1 + η ) } . Proof. Let x ∆ ( u ∆ ) denote a vector of all the locations of x 1: i − m − 1 ( u 1: n ) excluding those of x i − 2 m : i − m − 1 ( u i − 2 m : i ). That is, x ∆ , x 1: i − 2 m − 1 and u ∆ , ( u 1: i − 2 m − 1 , u i +1: n ). H ( Z x i − m | Z x i − 2 m : i − m − 1 , Z u i − 2 m : i ) − H ( Z x i − m | Z x 1: i − m − 1 , Z u 1: n ) = H ( Z x i − m | Z x i − 2 m : i − m − 1 , Z u i − 2 m : i ) − [ H ( Z x i − m | Z x i − 2 m : i − m − 1 , Z u i − 2 m : i ) + H ( Z x ∆ , Z u ∆ | Z x i − 2 m : i − m − 1 , Z u i − 2 m : i , Z x i − m ) − H ( Z x ∆ , Z u ∆ | Z x i − 2 m : i − m − 1 , Z u i − 2 m : i )] (44) = H ( Z x ∆ , Z u ∆ | Z x i − 2 m : i − m − 1 , Z u i − 2 m : i ) − H ( Z x ∆ , Z u ∆ | Z x i − 2 m : i − m − 1 , Z u i − 2 m : i , Z x i − m ) (45) = [ H ( Z x ∆ | Z x i − 2 m : i − m − 1 , Z u i − 2 m : i ) − H ( Z x ∆ | Z x i − 2 m : i − m − 1 , Z u i − 2 m : i , Z x i − m )] + [ H ( Z u ∆ | Z x 1: i − m − 1 , Z u i − 2 m : i ) − H ( Z u ∆ | Z x 1: i − m − 1 , Z u i − 2 m : i , Z x i − m )] (46) = i − 2 m − 1 ∑ t =1 [ H ( Z x t | Z x t +1: i − m − 1 , Z u i − 2 m : i ) − H ( Z x t | Z x t +1: i − m − 1 , Z u i − 2 m : i , Z x i − m )] + i − 2 m − 1 ∑ t =1 [ H ( Z u t | Z x 1: i − m − 1 , Z u t +1: i ) − H ( Z u t | Z x 1: i − m − 1 , Z u t +1: i , Z x i − m )] + n ∑ t = i +1 [ H ( Z u t | Z x 1: i − m − 1 , Z u 1: t − 1 ) − H ( Z u t | Z x 1: i − m − 1 , Z u 1: t − 1 , Z x i − m )] (47) ≤ [( i − 2 m − 1) k 2 + ( n − 2 m − 1)( r − k ) k ] log { 1 + ξ 2 η (1 + η ) } (48) ≤ ( n − 2 m − 1)[ k 2 + ( r − k ) k ] log { 1 + ξ 2 η (1 + η ) } = ( n − 2 m − 1) rk log { 1 + ξ 2 η (1 + η ) } . (49) Using the chain rule for entropy [3], (44), (46), and (47) can be obtained. Using Lemma 11 and Corollaries 12 and 13, (48) can be obtained. Lemma 15. I ( Z x i − m ; Z u 1: n | Z x 1: i − m − 1 ) − I ( Z x i − m ; Z u i − 2 m : i | Z x i − 2 m : i − m − 1 ) = A i − m − B i − m where A i − m = H ( Z x i − m | Z x i − 2 m : i − m − 1 , Z u i − 2 m : i ) − H ( Z x i − m | Z x 1: i − m − 1 , Z u 1: n ) , B i − m = H ( Z x i − m | Z x i − 2 m : i − m − 1 ) − H ( Z x i − m | Z x 1: i − m − 1 ) and A i − m ≤ ( n − 2 m − 1) rk log { 1 + ξ 2 η (1 + η ) } , B i − m ≤ ( i − 2 m − 1) k 2 log { 1 + ξ 2 η (1 + η ) } . Proof. By the definition of conditional mutual informa- tion, I ( Z x i − m ; Z u 1: n | Z x 1: i − m − 1 ) = H ( Z x i − m | Z x 1: i − m − 1 ) − H ( Z x i − m | Z x 1: i − m − 1 , Z u 1: n ) , (50) I ( Z x i − m ; Z u i − 2 m : i | Z x i − 2 m : i − m − 1 ) = H ( Z x i − m | Z x i − 2 m : i − m − 1 ) − H ( Z x i − m | Z x i − 2 m : i − m − 1 , Z u i − 2 m : i ) . (51) Using (50) and (51), I ( Z x i − m ; Z u 1: n | Z x 1: i − m − 1 ) − I ( Z x i − m ; Z u i − 2 m : i | Z x i − 2 m : i − m − 1 ) = A i − m − B i − m . By Lemma 14, A i − m ≤ ( n − 2 m − 1) rk log { 1 + ξ 2 η (1 + η ) } . Using Lemma 9, B i − m ≤ ( i − 2 m − 1) k 2 log { 1 + ξ 2 η (1 + η ) } . Therefore, Lemma 15 holds. B.3 Proof of Theorem 4 Let θ , I ( Z x M 1: m ; Z u M 1:2 m ) + n − 1 ∑ i =2 m +1 I ( Z x M i − m ; Z u M i − 2 m : i | Z x M i − 2 m : i − m − 1 ) + I ( Z x M n − m : n ; Z u M n − 2 m : n | Z x M n − 2 m : n − m − 1 ) − [ I ( Z x ? 1: m ; Z u ? 1:2 m ) + n − 1 ∑ i =2 m +1 I ( Z x ? i − m ; Z u ? i − 2 m : i | Z x ? i − 2 m : i − m − 1 ) + I ( Z x ? n − m : n ; Z u ? n − 2 m : n | Z x ? n − 2 m : n − m − 1 )] . (52) From Lemma 10, θ ≥ 0. By the chain rule for mutual infor- mation [3], I ( Z x ? 1: n , Z u ? 1: n ) − I ( Z x M 1: n , Z u M 1: n ) = I ( Z x ? 1: m ; Z u ? 1: n ) + n − 1 ∑ i =2 m +1 I ( Z x ? i − m ; Z u ? 1: n | Z x ? 1: i − m − 1 ) + I ( Z x ? n − m : n ; Z u ? 1: n | Z x ? 1: n − m − 1 ) − [ I ( Z x M 1: m ; Z u M 1: n ) + n − 1 ∑ i =2 m +1 I ( Z x M i − m ; Z u M 1: n | Z x M 1: i − m − 1 ) + I ( Z x M n − m : n ; Z u M 1: n | Z x M 1: n − m − 1 )] . (53) By the definition of mutual information, I ( Z x 1: m ; Z u 1: n ) = H ( Z x 1: m ) − H ( Z x 1: m | Z u 1: n ) , (54) I ( Z x 1: m ; Z u 1:2 m ) = H ( Z x 1: m ) − H ( Z x 1: m | Z u 1:2 m ) . (55) Using the chain rule for entropy [3], H ( Z x 1: m | Z u 1:2 m ) − H ( Z x 1: m | Z u 1: n ) = H ( Z x 1 | Z u 1:2 m ) − H ( Z x 1 | Z u 1: n ) + . . . + H ( Z x m | Z x 1: m − 1 , Z u 1:2 m ) − H ( Z x m | Z x 1: m − 1 , Z u 1: n ) = m ∑ t =1 H ( Z x t | Z x 1: t − 1 , Z u 1:2 m ) − H ( Z x t | Z x 1: t − 1 , Z u 1: n ) ≤ m ( n − 2 m )( r − k ) k log { 1 + ξ 2 η (1 + η ) } . (56) Inequality (56) can be obtained using a proof similar to Lemma 14. Applying (56) to (54) and (55), I ( Z x 1: m ; Z u 1: n ) − I ( Z x 1: m ; Z u 1:2 m ) = A 1: m (57) where A 1: m ≤ m ( n − 2 m )( r − k ) k log { 1 + ξ 2 η (1 + η ) } . (58) By the definition of mutual information, I ( Z x n − m : n ; Z u 1: n | Z x 1: n − m − 1 ) = H ( Z x n − m : n | Z x 1: n − m − 1 ) − H ( Z x n − m : n | Z x 1: n − m − 1 , Z u 1: n ) , (59) I ( Z x n − m : n ; Z u n − 2 m : n | Z x n − 2 m : n − m − 1 ) = H ( Z x n − m : n | Z x n − 2 m : n − m − 1 ) − H ( Z x n − m : n | Z x n − 2 m : n − m − 1 , Z u n − 2 m : n ) . (60) Using the chain rule of entropy, H ( Z x n − m : n | Z x n − 2 m : n − m − 1 ) − H ( Z x n − m : n | Z x 1: n − m − 1 ) = n ∑ t = n − m H ( Z x t | Z x n − 2 m : t − 1 ) − H ( Z x t | Z x 1: t − 1 ) (61) ≤ ( m + 1)( n − 2 m − 1) k 2 log { 1 + ξ 2 η (1 + η ) } . (62) Using Lemma 9, inequality (62) can be obtained. By the chain rule of entropy, H ( Z x n − m : n | Z x n − 2 m : n − m − 1 , Z u n − 2 m : n ) − H ( Z x n − m : n | Z x 1: n − m − 1 , Z u 1: n ) = n ∑ t = n − m H ( Z x t | Z x n − 2 m : t − 1 , Z u n − 2 m : n ) − H ( Z x t | Z x 1: t − 1 , Z u 1: n ) ≤ ( m + 1)( n − 2 m − 1) rk log { 1 + ξ 2 η (1 + η ) } . (63) Using a proof similar to Lemma 14, inequality (63) can be obtained. Applying the results (62) and (63) to (59) and (60), I ( Z x n − m : n ; Z u 1: n | Z x 1: n − m − 1 ) − I ( Z x n − m : n ; Z u n − 2 m : n | Z x n − 2 m : n − m − 1 ) = A n − m : n − B n − m : n (64) where A n − m : n ≤ ( m + 1)( n − 2 m − 1) rk log { 1 + ξ 2 η (1 + η ) } , (65) B n − m : n ≤ ( m + 1)( n − 2 m − 1) k 2 log { 1 + ξ 2 η (1 + η ) } . (66) Using the above results, (53) can be rewritten as I ( Z x ? 1: n ; Z u ? 1: n ) − I ( Z x M 1: n ; Z u M 1: n ) = A ? 1: m + I ( Z x ? 1: m ; Z u ? 1:2 m ) + n − 1 ∑ i =2 m +1 [ A ? i − m − B ? i − m + I ( Z x ? i − m ; Z u ? i − 2 m : i | Z x ? i − 2 m : i − m − 1 )]+ A ? n − m : n − B ? n − m : n + I ( Z x ? n − m : n ; Z u ? n − 2 m : n | Z x ? n − 2 m : n − m − 1 ) − { A M 1: m + I ( Z x M 1: m ; Z u M 1:2 m ) + n − 1 ∑ i =2 m +1 [ A M i − m − B M i − m + I ( Z x M i − m ; Z u M i − 2 m : i | Z x M i − 2 m : i − m − 1 )]+ A M n − m : n − B M n − m : n + I ( Z x M n − m : n ; Z u M n − 2 m : n | Z x M n − 2 m : n − m − 1 ) } (67) = A ? 1: m + n − 1 ∑ i =2 m +1 ( A ? i − m − B ? i − m ) + ( A ? n − m : n − B ? n − m : n ) − [ A M 1: m + n − 1 ∑ i =2 m +1 ( A M i − m − B M i − m ) + ( A M n − m : n − B M n − m : n )] − θ (68) ≤ A ? 1: m + n − 1 ∑ i =2 m +1 A ? i − m + A ? n − m : n + n − 1 ∑ i =2 m +1 B M i − m + B M n − m : n − [ A M 1: m + n − 1 ∑ i =2 m +1 A M i − m + A M n − m : n + n − 1 ∑ i =2 m +1 B ? i − m + B ? n − m : n ] (69) ≤ A ? 1: m + n − 1 ∑ i =2 m +1 A ? i − m + A ? n − m : n + n − 1 ∑ i =2 m +1 B M i − m + B M n − m : n (70) ≤ [ m ( n − 2 m )( r − k ) k + ( n − 2 m − 1 + m + 1)( n − 2 m − 1) rk + 1 2 ( n − 2 m − 1)( n − 2 m − 2) k 2 + ( m + 1)( n − 2 m − 1) k 2 ] log { 1 + ξ 2 η (1 + η ) } (71) ≤ [ m ( n − 2 m )( r − k ) k + ( n − m )( n − 2 m ) rk + 1 2 ( n − 2 m )( n − 2 m − 2) k 2 + ( m + 1)( n − 2 m ) k 2 ] log { 1 + ξ 2 η (1 + η ) } = [ m ( n − 2 m )( r − k ) k + ( n − m )( n − 2 m ) rk + 1 2 ( n − 2 m )( n − 2 m ) k 2 + m ( n − 2 m ) k 2 )] log { 1 + ξ 2 η (1 + η ) } = [ mr + ( n − m ) r + 1 2 ( n − 2 m ) k ]( n − 2 m ) k log { 1 + ξ 2 η (1 + η ) } = [ nr + 1 2 ( n − 2 m ) k ]( n − 2 m ) k log { 1 + ξ 2 η (1 + η ) } . Using (57), (64), and Lemma 15, (67) can be obtained. Ap- plying θ to (67), (68) can be obtained. Since θ ≥ 0, (69) can be obtained. Using (58), (65), (66), and Lemma 15, (70) and (71) can be obtained. Therefore, Theorem 4 holds.