arXiv:1609.08536v1 [cs.SY] 27 Sep 2016 Scheduling Nonlinear Sensors for Stochastic Process Estimation Vasileios Tzoumas⋆, Nikolay A. Atanasov⋆, Ali Jadbabaie†, George J. Pappas⋆ Abstract—In this paper, we focus on activating only a few sen- sors, among many available, to estimate the state of a stochastic process of interest. This problem is important in applications such as target tracking and simultaneous localization and mapping (SLAM). It is challenging since it involves stochastic systems whose evolution is largely unknown, sensors with nonlinear measurements, and limited operational resources that constrain the number of active sensors at each measurement step. We provide an algorithm applicable to general stochastic processes and nonlinear measurements whose time complexity is linear in the planning horizon and whose performance is a multiplicative factor 1/2 away from the optimal performance. This is notable because the algorithm offers a significant computational advan- tage over the polynomial-time algorithm that achieves the best approximation factor 1/e. In addition, for important classes of Gaussian processes and nonlinear measurements corrupted with Gaussian noise, our algorithm enjoys the same time complexity as even the state-of-the-art algorithms for linear systems and measurements. We achieve our results by proving two properties for the entropy of the batch state vector conditioned on the measurements: a) it is supermodular in the choice of the sensors; b) it has a sparsity pattern (involves block tri-diagonal matrices) that facilitates its evaluation at each sensor set. I. INTRODUCTION Adversarial target tracking and capturing [1], [2], robotic navigation and autonomous construction [3], active perception and simultaneous localization and mapping (SLAM) [4] are only a few of the challenging information gathering problems that benefit from the monitoring capabilities of sensor net- works [5]. These problems are challenging because: • they involve systems whose evolution is largely unknown, modeled either as a stochastic process, such as a Gaussian process [6], or as linear or nonlinear system corrupted with process noise [1], • they involve nonlinear sensors (e.g., cameras, radios) corrupted with noise [7], • they involve systems that change over time [8], and as a result, necessitate both spatial and temporal deployment of sensors in the environment, increasing the total number of needed sensors, and at the same time, • they involve operational constraints, such as limited com- munication bandwidth and battery life, which limit the number of sensors that can simultaneously be active in the information gathering process [9]. ⋆The authors are with the Department of Electrical and Systems Engineer- ing, University of Pennsylvania, Philadelphia, PA 19104-6228 USA (email: {vtzoumas, atanasov, pappasg}@seas.upenn.edu). †The author is with the Department of Civil and Environmental Engi- neering, Massachusetts Institute of Technology, Cambridge, MA, 02139 USA (email: jadbabai@mit.edu). This work was supported in part by TerraSwarm, one of six centers of STARnet, a Semiconductor Research Corporation program sponsored by MARCO and DARPA, in part by AFOSR Complex Networks Program and in part by AFOSR MURI CHASE. Due to these challenges, we focus on the following question: “How do we select, at each time, only a few of the available sensors so as to monitor effectively a system despite the above challenges?” In particular, we focus on the following sensor scheduling problem: Problem 1. Consider a stochastic process, whose realization at time t is denoted by x(t) and a set of m sensors, whose measurements are nonlinear functions of x(t), evaluated at a fixed set of K measurement times t1, t2, . . . , tK. In addition, suppose that at each tk a set of at most sk ≤m sensors can be used. Select the sensor sets so that the error of the corresponding minimum mean square error estimator of (x(t1), x(t2), . . . , x(tK)) is minimal among all possible sensor sets. There are two classes of sensor scheduling algorithms, that trade-off between the estimation accuracy of the batch state vector and their time complexity [10]: those used for Kalman filtering, and those for batch state estimation. The most relevant papers on batch state estimation are [10] and [11]. However, both of these papers focus on linear systems and measurements. The most relevant papers for Kalman filtering consider algorithms that use: myopic heuristics [12], tree pruning [13], convex optimization [14]–[17], quadratic pro- gramming [18], Monte Carlo methods [19], or submodular function maximization [20], [21]. However, these papers focus similarly on linear or nonlinear systems and measurements, and do not consider unknown dynamics. Main contributions: 1) We prove that Problem 1 is NP-hard. 2) We prove that the best approximation factor one can achieve in polynomial time for Problem 1 is 1/e. 3) We provide Algorithm 1 for Problem 1 that: • for all stochastic processes and nonlinear measurements, achieves a solution that is up to a multiplicative factor 1/2 from the optimal solution with time complexity that is only linear in the planning horizon K. This is important, since it implies that Algorithm 1 offers a significant computational advantage with negligible loss in performance over the polynomial-time algorithm that achieves the best approximation factor of 1/e, • for important classes of Gaussian processes, and nonlin- ear measurements corrupted with Gaussian noise, has the same time complexity as even state-of-the-art algorithms for linear systems and measurements. For example, for Gaussian process such as those in target tracking, or those generated by linear or nonlinear systems cor- rupted with Gaussian noise, Algorithm 1 has the same time complexity as the batch state estimation algorithm in [10], and lower than the Kalman filter scheduling algorithms in [14], [17]. Therefore, Algorithm 1 can enjoy both the estimation accu- racy of the batch state scheduling algorithms (compared to the Kalman filtering approach, that only approximates the batch state estimation error with an upper bound [10]) and, surprisingly, even the low time complexity of the Kalman filtering scheduling algorithms for linear systems. Technical contributions: 1) Supermodularity in Problem 1: We achieve the approxi- mation performance of Algorithm 1, and the linear dependence of its time complexity on the planning horizon, by proving that our estimation metric is a supermodular function in the choice of the utilized sensors. This is important, since this is in contrast to the case of multi-step Kalman filtering for linear systems and measurements, where the corresponding estima- tion metric is neither supermodular nor submodular [20] [21]. Moreover, our submodularity result cannot be reduced to the batch estimation problems in [22], [23]. The reasons are twofold: i) we consider sensors that measure nonlinear combinations of the elements of x(t), in contrast to [22], [23], where each sensor measures directly only one element of x(t); ii) our estimation metric is relevant to monitoring dynamical systems and different to the submodular entropy metric and information gain considered in [22] and [23], respectively. 2) Sparsity in Problem 1: We achieve the reduced time com- plexity of Algorithm 1 for Gaussian processes by identifying a sparsity pattern in our estimation metric. Specifically, in this case the time complexity of each evaluation of our metric is decided by the sparsity pattern of either the covariance of (x(t1), x(t2), . . . , x(tK)), or the inverse of this covariance. This is important since the two matrices are not usually sparse at the same time, even if one of them is [24]. E.g., for Gaussian processes such as those in target tracking, the first matrix is block tri-diagonal, whereas for those in SLAM, or those generated by linear or nonlinear systems corrupted with Gaussian noise, the second matrix is block tri-diagonal. Notation: We denote the set of natural numbers {1, 2, . . .} by N, the set of real numbers by R, and the set {1, 2, . . ., n} by [n] (n ∈N). The set of real numbers between 0 and 1 is denoted by [0, 1], and the empty set by ∅. Given a set X, |X| is its cardinality. In addition, for n ∈N, X n is the n-times Cartesian product X × X × · · · × X. Matrices are represented by capital letters and vectors by lower-case letters. We write A ∈X n1×n2 (n1, n2 ∈N) to denote a matrix of n1 rows and n2 columns whose elements take values in X; A⊤is its transpose, and [A]ij is its element at the i-th row and j-th column; det(A) is its determinant. Furthermore, if A is positive definite, we write A ≻0. In the latter case, A−1 is its inverse. I is the identity matrix; its dimension is inferred from the context. Similarly for the zero matrix 0. The ≡denotes equivalence. Moreover, for a probability space (Ω, F, P), Ωis the sample space, F the σ-field, and P : F 7→[0, 1] the function that assigns probabilities to events in F [25]. We write x ∼F to denote a random variable x with probability distribution F; E(x) is its expected value, and Σ(x) its covariance. x ∼N(µ, Σ) denotes a Gaussian random variable x with mean µ and covariance Σ; we equivalently write x ∼N(E(x), Σ(x)). Finally, we write x|y ∼G to denote that x’s probability distribution given y is G. II. PROBLEM FORMULATION This section introduces the system, measurement, and scheduling models and presents the sensor scheduling problem formally. System Model. We consider two cases: • Continuous time model: Consider the stochastic process (along with a probability space (Ω, F, P)): xω(t) : ω ∈Ω, t ≥t0 7→Rn (1) where n ∈N, t0 is the initial time, and xω(t) the state vector given the sample ω. • Discrete time model: Consider the nonlinear discrete- time system: xk+1 = lk(x1:k), lk ∼Lk, k ∈N (2) where xk ∈Rn is the state vector, x1:k the batch vector (x1, x2, . . . , xk), and Lk a probability distribution over functions lk : Rnk 7→Rn. Because the system models (1) and (2) assume no char- acteristic structure, they are appropriate for modeling largely unknown dynamics. For example, an instance of (1) is the time-indexed Gaussian process system model: x(t) ∼GP(µ(t), Σ(t, t′)), t, t′ ≥t0, (3) where µ(t) is the mean function and Σ(t, t′) is the covariance function. Similarly, an instance of (2) is the state-indexed Gaussian process system model: xk+1 = l(xk), l ∼GP(µ(x), Σ(x, x′)), x, x′ ∈Rn. (4) Measurement Model. Consider m nonlinear sensors that operate in discrete time: zi,k = gi(xk) + vi,k, i ∈[m], k ∈N (5) where for the continuous-time system in (1) we let xk := x(tk) at a pre-specified set of measurement times t1, t2, . . . and vi,k is the measurement noise of sensor i at time k. Assumption 1. vi,k are independent across i and k. In addition, gi is one-time differentiable. Sensor Scheduling Model. The m sensors in (5) are used at K scheduled measurement times {t1, t2, . . . , tK}. At each k ∈[K], only sk of the m sensors are used (sk ≤m), resulting in the batch measurement vector yk: yk = Skzk, k ∈[K], (6) where Sk is a sensor selection matrix, composed of sub- matrices [Sk]ij (i ∈[sk], j ∈[m]) such that [Sk]ij = I if sensor j is used at time k, and [Sk]ij = 0 otherwise. We assume that a sensor can be used at most once at each k, and as a result, for each i there is one j such that [Sk]ij = I while for each j there is at most one i such that [Sk]ij = I. We now present the sensor scheduling problem formally: Notation: For i, j ∈N, φi:j ≡(φi, φi+1, . . . , φj). In addition, Sk ≡{j : there exists i ∈[sk], [Sk]ij = I}: Sk is the set of indices that correspond to utilized sensors at tk. Problem 1 (Sensor Scheduling in Stochastic Processes with Nonlinear Observations). Select at each time k a subset of sk sensors, out of the m sensors in (5), to use in order to minimize the conditional entropy of x1:K given the measurements y1:K: minimize Sk⊆[m],k∈[K] H(x1:K|S1:K) subject to |Sk| ≤sk, k ∈[K], where H(x1:K|S1:K) denotes the conditional entropy H(x1:K|y1:K) of x1:K given the measurements y1:K. The conditional entropy H(x1:K|y1:K) captures the esti- mation accuracy of x1:K given y1:K, as we explain in the following two propositions: Proposition 1. H(x1:K|y1:K) is a constant factor away from the mutual information of x1:K and y1:K. In particular: H(x1:K|y1:K) = −I(x1:K; y1:K) + H(x1:K), where I(x1:K; y1:K) is the mutual information of x1:K and y1:K, and H(x1:K) is constant. Proposition 2. Consider the Gaussian process (3) and sup- pose that the measurement noise in (5) is Gaussian, vi,k ∼ N(0, Σ(vi,k)). H(x1:K|y1:K) is a constant factor away from log det(Σ(x⋆ 1:K)), where Σ(x⋆ 1:K) is the error covariance of the minimum mean square estimator x⋆ 1:K of x1:K given the measurements y1:K. In particular:1 H(x1:K|y1:K) = log det(Σ(x⋆ 1:K)) 2 + nK log(2πe) 2 . III. MAIN RESULTS We first prove that Problem 1 is NP-hard, and then derive for it a provably near-optimal approximation algorithm: Theorem 1. The problem of sensor scheduling in stochastic processes with nonlinear observations (Problem 1) is NP hard. Proof. See Appendix IV. Our approach is to find an instance of Problem 1 that is equivalent to the NP-hard minimal observability problem introduced in [26], [27]. Due to Theorem 1, we need to appeal to approximation al- gorithms to obtain a solution to Problem 1 in polynomial-time. To this end, we propose an efficient near-optimal algorithm 1We explain x⋆ 1:K and log det(Σ(x⋆ 1:K)): x⋆ 1:K is the optimal estimator for x1:K, since it minimizes among all estimators of x1:K the mean square error E(∥x1:K −x⋆ 1:K∥2 2) (∥·∥2 is the euclidean norm), where the expectation is taken with respect to y1:K [7]. log det(Σ(x⋆ 1:K)) is an estimation error metric related to ∥x1:K −x⋆ 1:K∥2 2, since when it is minimized, the probability that the estimation error ∥x1:K −x⋆ 1:K∥2 2 is small is maximized [10]. Algorithm 1 Approximation algorithm for Problem 1. Input: Horizon K, scheduling constraints s1, s2, . . . , sK, er- ror metric H(x1:K|S1:K) : Sk ⊆[m], k ∈[K] 7→R Output: Sensor sets (S1, S2, . . . , SK) that approximate the solution to Problem 1, as quantified in Theorem 2 k ←1, S1:0 ←∅ while k ≤K do 1. Apply Algorithm 2 to min S⊆[m]{H(x1:K|S1:k−1, S) : |S| ≤sk} (7) 2. Denote by Sk the solution Algorithm 2 returns 3. S1:k ←(S1:k−1, Sk) 4. k ←k + 1 end while (Algorithm 1 with a subroutine in Algorithm 2) and quantify its performance and time complexity in the following theorem. Theorem 2. The theorem has two parts: 1) Approximation performance of Algorithm 1: Algorithm 1 returns sensors sets S1, S2, . . . , SK that: i. satisfy all the feasibility constraints of Problem 1: |Sk| ≤sk, k ∈[K] ii. achieve an error H(x1:K|S1:K) such that: H(x1:K|S1:K) −OPT MAX −OPT ≤1 2, (8) where OPT is the optimal cost of Problem 1, and MAX ≡ maxS′ 1:K H(x1:K|S′ 1:K) is the maximum (worst) cost in Problem 1. 2) Time complexity of Algorithm 1: Algorithm 1 has time complexity O(PK k=1 s2 kT ), where T is the time complex- ity of evaluating H(x1:K|S′ 1:K) : S′ k ⊆[m], k ∈[K] 7→ R at an S′ 1:K. In the following paragraphs, we discuss Algorithm 1’s ap- proximation quality and time complexity and fully characterize the latter in Theorem 3 and Corollary 1 for Gaussian processes and Gaussian measurement noise. Supermodularity and monotonicity of H(x1:K|y1:K): We state two properties of H(x1:K|y1:K) that are used to prove Theorem 2. In particular, we show that H(x1:K|y1:K) is a non-increasing and supermodular function with respect to the sequence of selected sensors. Then, Theorem 2 follows by combining these two results with results on submodular functions maximization over matroid constraints [28]. These derivations are presented in Appendix IV. Approximation quality of Algorithm 1: Theorem 2 quan- tifies the worst-case performance of Algorithm 1 across all values of Problem 1’s parameters. The reason is that the right-hand side of (8) is constant. In particular, (8) guarantees that for any instance of Problem 1, the distance of the approximate cost H(x1:K|S1:K) from OPT is at most 1/2 the distance of the worst (maximum) cost MAX from OPT . This approximation factor is close to the optimal approximation factor 1/e ∼= .38 one can achieve in the worst-case for Algorithm 2 Single step greedy algorithm (subroutine in Algorithm 1). Input: Current iteration k, selected sensor sets (S1, S2, . . . , Sk−1) up to the current iteration, constraint sk, error metric H(x1:K|S1:K) : Sk ⊆[m], k ∈[K] 7→R Output: Sensor set Sk that approximates the solution to Problem 1 at time k S0 ←∅, X 0 ←[m], and t ←1 Iteration t: 1. If X t−1 = ∅, return St−1 2. Select i(t) ∈ X t−1 for which ρi(t)(St−1) = maxi∈X t−1 ρi(St−1), with ties settled arbitrarily, where: ρi(St−1) ≡ H(x1:K|S1:k−1, St−1) − H(x1:K|S1:k−1, St−1 ∪{i}) 3.a. If |St−1 ∪{i(t)}| > sk, X t−1 ←X t−1 \ {i(t)}, and go to Step 1 3.b. If |St−1 ∪{i(t)}| ≤sk, St ←St−1 ∪{i(t)} and X t ← X t−1 \ {i(t)} 4. t ←t + 1 and continue Problem 1 in polynomial time [29]; the reason is twofold: first, Problem 1 involves the minimization of a non-increasing and supermodular function [30], and second, as we proved in Theorem 1, Problem 1 is in the worst-case equivalent to the minimal observability problem introduced in [26], which cannot be approximated in polynomial time with a better factor than the 1/e [31]. Remark 1. We can improve the 1/2 approximation fac- tor of Algorithm 1 to 1/e by utilizing the algorithm intro- duced in [32]. However, this algorithm has time complexity O((nK)11T ), where T is the time complexity of evaluating H(x1:K|S′ 1:K) : S′ k ⊆[m], k ∈[K] 7→R at an S′ 1:K. Time complexity of Algorithm 1: Algorithm 1’s time complexity is broken down into two parts: a) the number of evaluations of H(x1:K|y1:K) required by the algorithm; b) the time complexity of each such evaluation. In more detail: a) Number of evaluations of H(x1:K|y1:K) required by Algorithm 1: Algorithm 1 requires at most s2 k evaluations of H(x1:K|y1:K) at each k ∈[K]. Therefore, Algorithm 1 achieves a time complexity that is only linear in K with respect to the number of evaluations of H(x1:K|y1:K); the reason is that PK k=1 s2 k ≤maxk∈[K](s2 k)K. This is in contrast to the algorithm in Remark 1, that obtains the best approximation factor 1/e, whose time complexity is of the order O((nK)11) with respect to the number of evaluations of H(x1:K|y1:K).2 b) Time complexity of each evaluation of H(x1:K|y1:K): This time complexity depends on the properties of both the stochastic process (1) (similarly, (2)) and the measurement 2We can also speed up Algorithm 1 by implementing in Algorithm 2 the method of lazy evaluations [33]: this method avoids in Step 2 of Algorithm 2 the computation of ρi(St−1) for unnecessary choices of i. noise vi,k in (5). For the case of Gaussian stochastic processes and measurement noises: Theorem 3. Consider the Gaussian process model (3) and suppose that the measurement noise is Guassian: vi,k ∼ N(0, Σ(vi,k)) such that Σ(vi,k) ≻0. The time complexity of evaluating H(x1:K|y1:K) depends on the sparsity pattern of Σ(x1:K) and Σ(x1:K)−1 as follows. • Each evaluation of H(x1:K|y1:K) has time complexity O(n2.4K), when either Σ(x1:K) or Σ(x1:K)−1 is exactly sparse (that is, block tri-diagonal). • Each evaluation of H(x1:K|y1:K) has time complexity O(n2.4K2.4), when both Σ(x1:K) and Σ(x1:K)−1 are dense. Theorem 3 implies that when Σ(x1:K) or Σ(x1:K)−1 is exactly sparse, the time complexity of each evaluation of H(x1:K|y1:K) is only linear in K. This is important because Σ(x1:K) or Σ(x1:K)−1 is exactly sparse for several appli- cations and system models [34]. For example, in adversarial target tracking applications, where the target wants to avoid capture and randomizes its motion in the environment (by un- correlating its movements), Σ(x1:K) can be considered tri- diagonal (since this implies x(tk) and x(tk′) are uncorrelated for |k −k′| > 2). Similarly, in SLAM, or in system models where the Gaussian process in (3) is generated by a linear or nonlinear system corrupted with Gaussian noise, Σ(x1:K)−1 is block tri-diagonal [24]. In particular, for linear systems, Σ(x1:K)−1 is block tri-diagonal [24, Section 3.1], and for nonlinear systems, Σ(x1:K)−1 is efficiently approximated by a block tri-diagonal matrix as follows: for each k, before the k-th iteration of Step 1 in Algorithm 1, we first compute ˜µ1:K given y1:(k−1) up to k. This step has complexity O(n2.4K) when Σ(x1:K)−1 is sparse [24, Eq. (5)] [35, Section 3.8], and it does not increase the total time complexity of Algorithm 1. Then, we continue as in [24, Section 3.2]. Sparsity in H(x1:K|y1:K): We state the two properties of H(x1:K|y1:K) that result to Theorem 3. In particular, we prove that H(x1:K|y1:K) is expressed in closed form with two different formulas such that the time complexity for the evaluation of H(x1:K|y1:K) using the first formula is decided by the sparsity pattern of Σ(x1:K), whereas using the second formula is decided by the sparsity pattern of Σ(x1:K)−1. The reason for this dependence is that the rest of the matrices in these formulas are sparser than Σ(x1:K) or Σ(x1:K)−1; in particular, they are block diagonal. The full characterization of Algorithm 1’s time complexity for Gaussian processes and Gaussian measurement noises follows. Corollary 1. Consider the Gaussian process model (3) and suppose that the measurement noise is Gaussian: vi,k ∼ N(0, Σ(vi,k)) such that Σ(vi,k) ≻0. The time complexity of Algorithm 1 depends on the sparsity pattern of Σ(x1:K) and Σ(x1:K)−1 as follows. • Algorithm 1 has time complexity O(n2.4K PK k=1 s2 k), when either Σ(x1:K) or Σ(x1:K)−1 is exactly sparse (that is, block tri-diagonal). • Algorithm 1 has time complexity O(n2.4K2.4 PK k=1 s2 k), when both Σ(x1:K) and Σ(x1:K)−1 are dense. Comparison of Algorithm 1’s time complexity for Gaus- sian processes and Gaussian measurement noises, per Corol- lary 1, to that of existing scheduling algorithms: The most relevant algorithm to Algorithm 1 is the one provided in [10], where linear systems with additive process noise and measure- ment noises with any distribution are assumed. Algorithm 1 generalizes [10] from linear systems and measurements to Gaussian processes and nonlinear measurements. At the same time, it achieves the same time complexity as the algorithm in [10] when Σ(x1:K) or Σ(x1:K)−1 is exactly sparse. This is important since the algorithm in [10] has time complexity lower than the-state-of-the-art batch estimation sensor schedul- ing algorithms, such as the algorithm proposed in [11], and similar to that of the state of the art Kalman filter scheduling algorithms, such as those proposed in [14], [17], [21] (in particular, lower for large K). IV. CONCLUSION In this paper, we proposed Algorithm 1 for the NP- hard problem of sensor scheduling for stochastic process estimation. Exploiting the supermodularity and monotonicity of conditional entropy, we proved that the algorithm has an approximation factor 1/2 and linear complexity in the scheduling horizon. It achieves both the accuracy of batch estimation scheduling algorithms and, surprisingly, when the information structure of the problem is sparse, the low time complexity of Kalman filter scheduling algorithms for linear systems. This is the case, for example, in applications such as SLAM and target tracking, and for processes generated by linear or nonlinear systems corrupted with Gaussian noise. Future work will focus on an event-triggered version of the scheduling problem, in which the measurement times are decided online based on the available measurements, and on a decentralized version, in which information is exchanged only among neighboring sensors. APPENDIX A: PROOF OF PROPOSITION 2 Proof. We first show that the conditional probability distribu- tion of x1:K given y1:K is Gaussian with covariance Σ(x⋆ 1:K), and then apply the following lemma: Lemma 1 (Ref. [36]). Let x ∼N(µ, Σ) and x ∈Rm: H(x) = 1 2 log[(2πe)m det(Σ)]. Specifically, due to Assumption 1, (x1:K, y1:K) are jointly Gaussian. This has a twofold implication: first, the minimum mean square estimator of x1:K given y1:K is linear in y1:K [37, Proposition E.2]; second, the conditional probability distribu- tion of x1:K given y1:K is Gaussian [38], with covariance Σ(x⋆ 1:K). Therefore, due to [37, Proposition E.3], this is also the covariance of the minimum mean square estimator of x1:K given y1:K. As a result, due to Lemma 1: H(x1:K|y1:K) = Ey1:K=y′ 1:K (H(x1:K|y1:K = y′ 1:K)) = Ey1:K=y′ 1:K 1 2 log[(2πe)nK det(Σ(x⋆ 1:K))  = nK log(2πe) + log det(Σ(x⋆ 1:K)) 2 . (9) We derive a formula for Σ(x⋆ 1:K) in the proof of Lemma 4. APPENDIX B: PROOF OF THEOREM 1 Proof. We present for the discrete time case (2) an instance of Problem 1 that is equivalent to the NP-hard minimal observability problem introduced in [26], [27], that is defined as follows (the proof for the continuous time case is similar): Definition (Minimal Observability Problem): Consider the linear time-invariant system: ˙x(t) = Ax(t), yi(t) = rie⊤ i x(t), i ∈[n] (10) where ei is the vector with the i-th entry equal to 1 and the rest equal to 0, and ri is either zero or one; the minimal observability problem follows: select r1, r2, . . . , rn such that r1 + r2 + . . . + rn ≤r, (10) is observable. (11) The minimal observability problem is NP-hard when A is chosen as in the proof of Theorem 1 of [26], and r ≤n. We denote this A by ANP −h. Problem 1 is equivalent to the NP-hard minimal observ- ability problem for the following instance: K = 1, x(t0) ∼ N(c, 0), where c ∈Rn is an unknown constant, µ(t) = eANP−h(t−t0)x(t0), Σ(t, t′) = 0, m = n, gi(x(t)) = e⊤ i x(t), zero measurement noise, and s1 = r. This observation con- cludes the proof. APPENDIX C: PROOF OF THEOREM 2 Proof. We first prove that H(x1:K|S1:K) is a non-increasing and supermodular function in the choice of the sensors. Then, we prove Theorem 2 by combining these two results and results on the maximization of submodular functions over matroid constraints [28]. Notation: Given K disjoint finite sets E1, E2, . . . , EK and Ai, Bi ∈Ei, we write A1:K ⪯B1:K to denote that for all i ∈[K], Ai ⊆Bi (Ai is a subset of Bi). Moreover, we denote that Ai ∈Ei for all i ∈[K] by A1:K ∈E1:K. In addition, given A1:K, B1:K ∈E1:K, we write A1:K ⊎B1:K to denote that for all i ∈[K], Ai ∪Bi (Ai union Bi). Definition 1. Consider K disjoint finite sets E1, E2, . . . , EK. A function h : E1:K 7→R is non-decreasing if and only if for all A, B ∈E1:K such that A ⪯B, h(A) ≤h(B); h : E1:K 7→R is non-increasing if −h is non-decreasing. Proposition 3. For any finite K ∈N, consider K distinct copies of [m], denoted by R1, R2, . . . , RK. The estimation error metric H(x1:K|S1:K) : R1:K 7→R is a non-increasing function in the choice of the sensors S1:K. Proof. Consider A, B ∈R1:K such that A ⪯B, and denote by B \ A ≡{i|i ∈B, i /∈A}: H(x1:K|B) = H(x1:K|A, B \ A) ≤H(x1:K|A) since conditioning can either keep constant or decrease the entropy [36]. Definition 2. Consider K disjoint finite sets E1, E2, . . . , EK. A function h : E1:K 7→R is submodular if and only if for all A, B, C ∈E1:K such that A ⪯B, h(A ⊎C) −h(A) ≥ h(B ⊎C) −h(B); h : E1:K 7→R is supermodular if −h is submodular. Proposition 4. For any finite K ∈N, consider K distinct copies of [m], denoted by R1, R2, . . . , RK; the estimation error metric H(x1:K|S1:K) : R1:K 7→R is a set supermodular function in the choice of the sensors S1:K. Proof. Let A, B, C ∈E1:K such that A ⪯B: H(x1:K|A)−H(x1:K|A ⊎C) (12) = H(x1:K|A) −H(x1:K|A, C) = I(x1:K; C|A) (13) = H(C|A) −H(C|x1:K, A) (14) ≥H(C|B) −H(C|x1:K, B) (15) = I(x1:K; C|B) (16) = H(x1:K|B) −H(x1:K|B, C) (17) = H(x1:K|B) −H(x1:K|B ⊎C). (18) Eq. (12) and (18) follow from our definition of ⊎. (13) and (14), (15) and (16), and (16) and (17) hold due to the definition of mutual information [36]. (15) follows from (14) due to two reasons: first, H(C|A) ≥H(C|B), since A ⪯B and conditioning can either keep constant or decrease the entropy [36]; second, H(C|x1:K, A) = H(C|x1:K, B) due to the independence of the measurements given x1:K, per Assumption 1. Proof of Part 1 of Theorem 2. We use the next result from the literature of maximization of submodular functions over matroid constraints: Definition 3. Consider a finite set E and a collection C of subsets of E. (E, C) is: • an independent system if and only if: – ∅∈C, where ∅denotes the empty set – for all X′ ⊆X ⊆E, if X ∈C, X′ ∈C. • a matroid if and only if in addition to the previous two properties: – for all X′, X ∈C where |X′| < |X|, there exists x /∈ X′ and x ∈X such that X′ ∪{x} ∈C. Lemma 2 (Ref. [28]). Consider K independence systems {(Ek, Ck)}k∈[K], each the intersection of at most P matroids, and a submodular and non-decreasing function h : E1:K 7→R. There exist a polynomial time greedy algorithm that returns an (approximate) solution S1:K to: maximize S1:K⪯E1:K h(S1:K) subject to Sk ∩Ek ∈Ck, k ∈[K], (19) that satisfies: h(O) −h(S1:K) h(O) −h(∅) ≤ P 1 + P , (20) where O is an (optimal) solution to (19). Lemma 3. Problem 1 is an instance of (19) with P = 1. Proof. We identify the instance of {Ek, Ck}k∈[K] and h, re- spectively, that translate (19) to Problem 1: Given K distinct copies of [m], denoted by R1, R2, . . . , RK, first consider Ek = Rk and Ck = {S|S ⊆Rk, |S| ≤sk}: (Ek, Ck) satisfies the first two points in part 1 of Definition 3, and as a result is an independent system. Moreover, by its definition, Sk ∩Ek ∈Ck if and only if |Sk| ≤sk. Second, for all S1:K ⪯E1:K, consider: h(S1:K) = −H(x1:K|S1:K). From Propositions 3 and 4, h(S1:K) is set submodular and non-decreasing. In addition to Lemma 3, the independence system (Ek, Ck), where Ek = Rk and Ck = {S|S ⊆Rk, |S| ≤ sk}, satisfies also the point in part 2 of Definition 3; thereby, it is also a matroid and as a result P, as in Lemma 2, is 1. This observation, along with Lemmas 2 and 3 complete the proof of (8), since the adaptation to Problem 1 of the greedy algorithm in [28, Theorem 4.1] results to Algorithm 1. Proof of Part 2 of Theorem 2. Algorithm 1 requires for each k ∈[K] the application of Algorithm 2 to (7). In addition, each such application requires at most s2 k evaluations of H(x1:K|y1:K). Therefore, Algorithm 1 has time complexity O(PK k=1 s2 kT ). The proof of Theorem 2 is complete. APPENDIX D: PROOF OF THEOREM 3 Notations: We introduce five notations: first, S1:K is the block diagonal matrix with diagonal elements the sensor se- lection matrices S1, S2, . . . , SK; second, c(x1:K) is the batch vector [(S1g(x1))⊤, (S2g(x2))⊤, . . . , (SKg(xK))⊤]⊤, where g(xk) ≡(g1(xk), g2(xk), . . . , gm(xk))⊤; third, C(x1:K) is the block diagonal matrix with diagonal elements the ma- trices S1C1, S2C2, . . . , SKCK, where Ck ≡ G(xk) and G(x(t)) ≡∂g(x(t))/∂x(t); fourth, vk is the batch mea- surement noise vector (v⊤ 1,k, v⊤ 2,k, . . . , v⊤ m,k)⊤; fifth, µ1:K ≡ (µ(t1)⊤, µ(t2)⊤, . . . , µ(tK)⊤)⊤. Proof. We first derive the two formulas for H(x1:K|y1:K): the first formula is expressed in terms of Σ(x1:K)−1, and the second formula is expressed in terms of Σ(x1:K). Lemma 4 (Formula of H(x1:K|y1:K) in terms of Σ(x1:K)−1). Consider the start of the k-th iteration in Algorithm 1. Given the measurements y1:(k−1) up to k, H(x1:K|y1:K) is given by −T ′ 1 + nK log(2πe)/2, where: T ′ 1 ≡1 2 log det(Ξ + Σ(x1:K)−1) Ξ ≡C(˜µ1:K)⊤S1:KΣ(v1:K)−1S⊤ 1:KC(˜µ1:K) and ˜µ1:K is the maximum a posteriori (MAP) estimate of x1:K given the measurements y1:(k−1) up to k. Proof. Before the k-th iteration of Step 1 in Algorithm 1, we first compute ˜µ1:K given y1:(k−1) up to k. This step has complexity O(n2.4K) when Σ(x1:K)−1 is sparse [24, Eq. (5)] [35, Section 3.8], and it does not increase the total time complexity of Algorithm 1. Next, given ˜µ1:K, we linearise our measurement model over ˜µ1:K, and compute C(˜µ1:K). Then, we continue as follows: x1:K and y1:K are jointly Gaussian: (x1:K, y1:K) ∼N (E(x1:K, y1:K), Σ(x1:K, y1:K)) , E(x1:K, y1:K) = (µ1:K, c(µ1:K)) Σ(x1:K, y1:K) =  Σ(x1:K) Σ(x1:K)C(˜µ1:K)⊤ C(˜µ1:K)Σ(x1:K) Σ(y1:K)  , where: Σ(y1:K) = S1:KΣ(v1:K)S⊤ 1:K + C(˜µ1:K)Σ(x1:K)C(˜µ1:K)⊤. Therefore, the conditional probability distribution of x1:K given y1:K has covariance Σ(x⋆ 1:K) (using our notation in Proposition 2) such that: Σ(x⋆ 1:K) = Σ(x1:K) −Σ(x1:K)C(˜µ1:K)⊤ΦC(˜µ1:K)Σ(x1:K), where: Φ ≡C(˜µ1:K)Σ(x1:K)C(˜µ1:K)⊤+ S1:KΣ(v1:K)S⊤ 1:K)−1. Using the Woodbury matrix identity [39, Corollary 2.8.8]: Σ(x⋆ 1:K) = (Ξ + Σ(x1:K)−1)−1, where we also used the (S1:KΣ(v1:K)S⊤ 1:K)−1 = S1:K Σ(v1:K)−1S⊤ 1:K, which holds since S1:K and Σ(v1:K) are block diagonal. Using (9) the proof is complete. Remark 2. The time complexity for the evaluation of H(x1:K|y1:K) using Lemma 4 is decided by the sparsity of Σ(x1:K)−1 since the rest of the matrices are block diagonal. Lemma 5 (Formula of H(x1:K|y1:K) in terms of Σ(x1:K)). Consider the start of the k-th iteration in Algorithm 1. Given the measurements y1:(k−1) up to k, H(x1:K|y1:K) is given by H(x1:K|y1:K) = T1 −T2 + H(x1:K), where: T1 ≡1 2 K X k=1 log[(2πe)sk det(SkΣ(vk)S⊤ k )] (21) T2 ≡1 2 log[(2πe) PK k=1 sk det(Σ(y1:K))] (22) Σ(y1:K) = S1:KΣ(v1:K)S⊤ 1:K + C(˜µ1:K)Σ(x1:K)C(˜µ1:K)⊤, and ˜µ1:K is the maximum a posteriori (MAP) estimate of x1:K given the measurements y1:(k−1) up to k. Proof. Before the k-th iteration of Step 1 in Algorithm 1, we first compute ˜µ1:K given y1:(k−1) up to k. This step has complexity O(n2.4K) when: a) Σ(x1:K) is sparse [24, Eq. (5) after multiplying its both sides with Σ(x1:K)]; b) certain invertibility conditions apply [35, Section 3.8]. In this case, this step does not increase the total time complexity of Algo- rithm 1. If the invertibility conditions in [35, Section 3.8] do not apply, the complexity of this computation is O(n2.4K2.4). In this case, we can use µ1:K, instead of ˜µ1:K, to evaluate H(x1:K|y1:K), and keep the overall complexity of Algorithm 1 to O(n2.4K). Next, given ˜µ1:K, we linearise our measurement model over ˜µ1:K, and compute C(˜µ1:K). Then, we continue as follows: the chain rule for conditional entropies implies [36]: H(x1:K|y1:K) = H(y1:K|x1:K) −H(y1:K) + H(x1:K). Thus, we derive a closed formula for H(y1:K|x1:K) and H(y1:K): Closed form of H(y1:K|x1:K): The chain rule for condi- tional entropies implies [36]: H(y1:K|x1:K) = K X k=1 H(y(tk)|x1:K, y1:k−1) (23) = K X k=1 Ex(tk′ )(H(y(tk)|x(tk) = x(tk′))). (24) Eq. (24) follows from (23) because given x(tk) y(tk) is independent of y1:k−1, x1:(k−1) and x(k+1):K. In ad- dition, (21) follows from (24) because y(tk)|x(tk) ∼ N(Skg(x(tk)), SkΣ(vk)S⊤ k )) and, thus, Lemma 1 applies. Closed form of H(y1:K): To this end, we derive the marginal distribution of y1:K, denoted by f(y1:K): f(y1:K) = Z f(y1:K, x1:K)dx1:K = Z f(y1:K|x1:K)f(x1:K)dx1:K, where f(x1:K) denotes the probability distribution of x1:K. In particular: y1:K|x1:K ∼N(c(x1:K), S1:KΣ(v1:K)S⊤ 1:K) x1:K ∼N(µ1:K, Σ(x1:K)). Therefore, the best Gaussian approximation to the marginal distribution of y1:K is: y1:K ∼N(E(y1:K), Σ(y1:K)) E(y1:K) = c(µ1:K) Σ(y1:K) = S1:KΣ(v1:K)S⊤ 1:K + C(˜µ1:K)Σ(x1:K)C(˜µ1:K)⊤. Thus, from Lemma 1, (22) follows. Remark 3. The time complexity for the evaluation of H(x1:K|y1:K) using Lemma 5 is decided by the sparsity of Σ(x1:K) since the rest of the matrices are block diagonal. We complete the proof for each case of Theorem 3: • Time complexity of each evaluation of H(x1:K|y1:K) when either Σ(x1:K) or Σ(x1:K)−1 is exactly sparse (that is, block tri-diagonal): We present the proof only for the case where Σ(x1:K)−1 is exactly sparse since the proof for the case where Σ(x1:K) is exactly sparse is similar. In particular, consider the formula of H(x1:K|y1:K) in Lemma 4: T ′ 1 involves the log determinant of a matrix that is the sum of two nK × nK sparse matrices: the first matrix is block diagonal, and the second one is block tri-diagonal. The block diagonal matrix is eval- uated in O(n2.4K) time, since the determinant of an n × n matrix is computed in O(n2.4) time using the Coppersmith-Winograd algorithm [40]. Then, T ′ 1 is eval- uated in O(n2.4K) [41, Theorem 2]. • Time complexity of each evaluation of H(x1:K|y1:K) when both Σ(x1:K) and Σ(x1:K)−1 are dense: In this case, T ′ 1 (and similarly T2 in Lemma 5) is the log determinant of a dense nK × nK matrix. Therefore, it is evaluated in O((nK)2.4) time, since the determinant of an n × n matrix is computed in O(n2.4) time using the Coppersmith-Winograd algorithm [40]. REFERENCES [1] E. Masazade, M. Fardad, and P. K. Varshney, “Sparsity-promoting ex- tended kalman filtering for target tracking in wireless sensor networks,” IEEE Signal Processing Letters, vol. 19, no. 12, pp. 845–848, 2012. [2] N. Karnad, Robot Motion Planning for Tracking and Capturing Adver- sarial, Cooperative and Independent Targets. U. of Minnesota, 2015. [3] M. P. Vitus, “Sensor placement for improved robotic navigation,” Robotics: Science and Systems VI, p. 217, 2011. [4] M. Kaess, A. Ranganathan, and F. Dellaert, “isam: Incremental smooth- ing and mapping,” IEEE Transactions on Robotics, vol. 24, no. 6, pp. 1365–1378, 2008. [5] H. Rowaihy, S. Eswaran, M. Johnson, D. Verma, A. Bar-Noy, T. Brown, and T. La Porta, “A survey of sensor selection schemes in wireless sensor networks,” in Defense and Security Symposium. International Society for Optics and Photonics, 2007, pp. 65 621A–65 621A. [6] S. Karlin, A first course in stochastic processes. Academic press, 2014. [7] T. Kailath, A. H. Sayed, and B. Hassibi, Linear estimation. Prentice Hall Upper Saddle River, NJ, 2000, vol. 1. [8] R. Nowak, U. Mitra, and R. Willett, “Estimating inhomogeneous fields using wireless sensor networks,” IEEE Journal on Selected Areas in Communications, vol. 22, no. 6, pp. 999–1006, 2004. [9] A. O. Hero III and D. Cochran, “Sensor management: Past, present, and future,” IEEE Sensors Journal, vol. 11, no. 12, pp. 3064–3075, 2011. [10] V. Tzoumas, A. Jadbabaie, and G. J. Pappas, “Near-Optimal Sensor Scheduling for Batch State Estimation: Complexity, Algorithms, and Limits,” in 55th IEEE Conference on Decision and Control (CDC), 2016. [11] V. Roy, A. Simonetto, and G. Leus, “Spatio-temporal sensor manage- ment for environmental field estimation,” Signal Processing, vol. 128, pp. 369 – 381, 2016. [12] M. Shamaiah, S. Banerjee, and H. Vikalo, “Greedy sensor selection: Leveraging submodularity,” in 49th IEEE Conference on Decision and Control (CDC),, 2010, pp. 2572–2577. [13] M. P. Vitus, W. Zhang, A. Abate, J. Hu, and C. J. Tomlin, “On efficient sensor scheduling for linear dynamical systems,” Automatica, vol. 48, no. 10, pp. 2482–2493, 2012. [14] S. Joshi and S. Boyd, “Sensor selection via convex optimization,” IEEE Transactions on Signal Processing, vol. 57, no. 2, pp. 451–462, 2009. [15] J. Le Ny, E. Feron, and M. A. Dahleh, “Scheduling continuous-time kalman filters,” IEEE Transactions on Automatic Control, vol. 56, no. 6, pp. 1381–1394, 2011. [16] X. Shen, S. Liu, and P. K. Varshney, “Sensor selection for nonlinear systems in large sensor networks,” IEEE Transactions on Aerospace and Electronic Systems, vol. 50, no. 4, pp. 2664–2678, 2014. [17] S. Liu, M. Fardad, P. K. Varshney, and E. Masazade, “Optimal periodic sensor scheduling in networks of dynamical systems,” Signal Processing, IEEE Transactions on, vol. 62, no. 12, pp. 3055–3068. [18] Y. Mo, R. Ambrosino, and B. Sinopoli, “Sensor selection strategies for state estimation in energy constrained wireless sensor networks,” Automatica, vol. 47, no. 7, pp. 1330–1338, 2011. [19] Y. He and E. K. P. Chong, “Sensor scheduling for target tracking: A monte carlo sampling approach,” Digital Signal Processing, vol. 16, no. 5, pp. 533–545, 2006. [20] H. Zhang, R. Ayoub, and S. Sundaram, “Sensor selection for optimal filtering of linear dynamical systems: Complexity and approximation,” in IEEE Conference on Decision and Control (CDC), 2015. [21] S. T. Jawaid and S. L. Smith, “Submodularity and greedy algorithms in sensor scheduling for linear dynamical systems,” Automatica, vol. 61, pp. 282–288, 2015. [22] C. Ko, J. Lee, and M. Queyranne, “An algorithm for maximum entropy sampling,” Operations Research, vol. 43, no. 4, pp. 684–691, 1995. [23] A. Krause, A. Singh, and C. Guestrin, “Near-optimal sensor placements in gaussian processes: Theory, efficient algorithms and empirical stud- ies,” Journal of Machine Learning Research, vol. 9, pp. 235–284, 2008. [24] S. Anderson, T. D. Barfoot, C. H. Tong, and S. S¨arkk¨a, “Batch nonlinear continuous-time trajectory estimation as exactly sparse gaussian process regression,” Autonomous Robots, vol. 39, no. 3, pp. 221–238, 2015. [25] R. Durrett, Probability: theory and examples. Cambridge University Press, 2010. [26] A. Olshevsky, “Minimal controllability problems,” IEEE Transactions on Control of Network Systems, vol. 1, no. 3, pp. 249–258, 2014. [27] S. Pequito, S. Kar, and A. Aguiar, “A framework for structural input/out- put and control configuration selection in large-scale systems,” IEEE Transactions on Automatic Control, vol. 61, pp. 303 –318, 2015. [28] M. L. Fisher, G. L. Nemhauser, and L. Wolsey, An analysis of approxi- mations for maximizing submodular functions–II. Springer, 1978, 1978. [29] J. Vondr´ak, “Submodularity and curvature: the optimal algorithm,” RIMS Kokyuroku Bessatsu B, vol. 23, pp. 253–266, 2010. [30] G. L. Nemhauser and L. A. Wolsey, Integer and Combinatorial Opti- mization. New York, NY, USA: Wiley-Interscience, 1988. [31] U. Feige, “A threshold of ln n for approximating set cover,” J. ACM, vol. 45, no. 4, pp. 634–652, Jul. 1998. [32] J. Vondr´ak, “Optimal approximation for the submodular welfare problem in the value oracle model,” in Proceedings of the fortieth annual ACM symposium on Theory of computing. ACM, 2008, pp. 67–74. [33] M. Minoux, “Accelerated greedy algorithms for maximizing submodular functions,” in Optimization Techniques. Springer, 1978, pp. 234–243. [34] F. Dellaert and M. Kaess, “Square root sam: Simultaneous localization and mapping via square root information smoothing,” The International Journal of Robotics Research, vol. 25, no. 12, pp. 1181–1203, 2006. [35] A. Quarteroni, R. Sacco, and F. Saleri, Numerical mathematics. Springer Science & Business Media, 2010, vol. 37. [36] T. M. Cover and J. A. Thomas, Elements of information theory. John Wiley & Sons, 2012. [37] D. P. Bertsekas, Dynamic Programming and Optimal Control, Vol. I, 3rd ed. Athena Scientific, 2005. [38] S. Venkatesh, The Theory of Probability: Explorations and Applications. Cambridge University Press, 2012. [39] D. S. Bernstein, Matrix mathematics: theory, facts, and formulas. Princeton University Press, 2009. [40] D. Coppersmith and S. Winograd, “Matrix multiplication via arithmetic progressions,” in Proceedings of the nineteenth annual ACM symposium on Theory of computing, 1987, pp. 1–6. [41] L. G. Molinari, “Determinants of block tridiagonal matrices,” Linear algebra and its applications, vol. 429, no. 8, pp. 2221–2226, 2008.