arXiv:1609.08536v1 [cs.SY] 27 Sep 2016 Scheduling Nonlinear Sensors for Stochastic Process Estimation Vasileios Tzoumas ⋆ , Nikolay A. Atanasov ⋆ , Ali Jadbabaie † , George J. Pappas ⋆ Abstract —In this paper, we focus on activating only a few sen- sors, among many available, to estimate the state of a stochastic process of interest. This problem is important in applications such as target tracking and simultaneous localization and mapping (SLAM). It is challenging since it involves stochastic systems whose evolution is largely unknown, sensors with nonlinear measurements, and limited operational resources that constrain the number of active sensors at each measurement step. We provide an algorithm applicable to general stochastic processes and nonlinear measurements whose time complexity is linear in the planning horizon and whose performance is a multiplicative factor 1 / 2 away from the optimal performance. This is notable because the algorithm offers a significant computational advan- tage over the polynomial-time algorithm that achieves the best approximation factor 1 /e . In addition, for important classes of Gaussian processes and nonlinear measurements corrupted with Gaussian noise, our algorithm enjoys the same time complexity as even the state-of-the-art algorithms for linear systems and measurements. We achieve our results by proving two properties for the entropy of the batch state vector conditioned on the measurements: a) it is supermodular in the choice of the sensors; b) it has a sparsity pattern (involves block tri-diagonal matrices) that facilitates its evaluation at each sensor set. I. I NTRODUCTION Adversarial target tracking and capturing [ 1 ], [ 2 ], robotic navigation and autonomous construction [ 3 ], active perception and simultaneous localization and mapping (SLAM) [ 4 ] are only a few of the challenging information gathering problems that benefit from the monitoring capabilities of sensor net- works [ 5 ]. These problems are challenging because: • they involve systems whose evolution is largely unknown, modeled either as a stochastic process, such as a Gaussian process [ 6 ], or as linear or nonlinear system corrupted with process noise [ 1 ], • they involve nonlinear sensors (e.g., cameras, radios) corrupted with noise [ 7 ], • they involve systems that change over time [ 8 ], and as a result, necessitate both spatial and temporal deployment of sensors in the environment, increasing the total number of needed sensors, and at the same time, • they involve operational constraints, such as limited com- munication bandwidth and battery life, which limit the number of sensors that can simultaneously be active in the information gathering process [ 9 ]. ⋆ The authors are with the Department of Electrical and Systems Engineer- ing, University of Pennsylvania, Philadelphia, PA 19104-6228 USA (email: { vtzoumas, atanasov, pappasg } @seas.upenn.edu ). † The author is with the Department of Civil and Environmental Engi- neering, Massachusetts Institute of Technology, Cambridge, MA, 02139 USA (email: jadbabai@mit.edu ). This work was supported in part by TerraSwarm, one of six centers of STARnet, a Semiconductor Research Corporation program sponsored by MARCO and DARPA, in part by AFOSR Complex Networks Program and in part by AFOSR MURI CHASE. Due to these challenges, we focus on the following question: “How do we select, at each time, only a few of the available sensors so as to monitor effectively a system despite the above challenges?” In particular, we focus on the following sensor scheduling problem: Problem 1. Consider a stochastic process, whose realization at time t is denoted by x ( t ) and a set of m sensors, whose measurements are nonlinear functions of x ( t ) , evaluated at a fixed set of K measurement times t 1 , t 2 , . . . , t K . In addition, suppose that at each t k a set of at most s k ≤ m sensors can be used. Select the sensor sets so that the error of the corresponding minimum mean square error estimator of ( x ( t 1 ) , x ( t 2 ) , . . . , x ( t K )) is minimal among all possible sensor sets. There are two classes of sensor scheduling algorithms, that trade-off between the estimation accuracy of the batch state vector and their time complexity [ 10 ]: those used for Kalman filtering, and those for batch state estimation. The most relevant papers on batch state estimation are [ 10 ] and [ 11 ]. However, both of these papers focus on linear systems and measurements. The most relevant papers for Kalman filtering consider algorithms that use: myopic heuristics [ 12 ], tree pruning [ 13 ], convex optimization [ 14 ]–[ 17 ], quadratic pro- gramming [ 18 ], Monte Carlo methods [ 19 ], or submodular function maximization [ 20 ], [ 21 ]. However, these papers focus similarly on linear or nonlinear systems and measurements, and do not consider unknown dynamics. Main contributions: 1) We prove that Problem 1 is NP-hard. 2) We prove that the best approximation factor one can achieve in polynomial time for Problem 1 is 1 /e . 3) We provide Algorithm 1 for Problem 1 that: • for all stochastic processes and nonlinear measurements, achieves a solution that is up to a multiplicative factor 1 / 2 from the optimal solution with time complexity that is only linear in the planning horizon K . This is important, since it implies that Algorithm 1 offers a significant computational advantage with negligible loss in performance over the polynomial-time algorithm that achieves the best approximation factor of 1 /e , • for important classes of Gaussian processes, and nonlin- ear measurements corrupted with Gaussian noise, has the same time complexity as even state-of-the-art algorithms for linear systems and measurements. For example, for Gaussian process such as those in target tracking, or those generated by linear or nonlinear systems cor- rupted with Gaussian noise, Algorithm 1 has the same time complexity as the batch state estimation algorithm in [ 10 ], and lower than the Kalman filter scheduling algorithms in [ 14 ], [ 17 ]. Therefore, Algorithm 1 can enjoy both the estimation accu- racy of the batch state scheduling algorithms (compared to the Kalman filtering approach, that only approximates the batch state estimation error with an upper bound [ 10 ]) and, surprisingly, even the low time complexity of the Kalman filtering scheduling algorithms for linear systems. Technical contributions: 1) Supermodularity in Problem 1 : We achieve the approxi- mation performance of Algorithm 1 , and the linear dependence of its time complexity on the planning horizon, by proving that our estimation metric is a supermodular function in the choice of the utilized sensors. This is important, since this is in contrast to the case of multi-step Kalman filtering for linear systems and measurements, where the corresponding estima- tion metric is neither supermodular nor submodular [ 20 ] [ 21 ]. Moreover, our submodularity result cannot be reduced to the batch estimation problems in [ 22 ], [ 23 ]. The reasons are twofold: i) we consider sensors that measure nonlinear combinations of the elements of x ( t ) , in contrast to [ 22 ], [ 23 ], where each sensor measures directly only one element of x ( t ) ; ii) our estimation metric is relevant to monitoring dynamical systems and different to the submodular entropy metric and information gain considered in [ 22 ] and [ 23 ], respectively. 2) Sparsity in Problem 1 : We achieve the reduced time com- plexity of Algorithm 1 for Gaussian processes by identifying a sparsity pattern in our estimation metric. Specifically, in this case the time complexity of each evaluation of our metric is decided by the sparsity pattern of either the covariance of ( x ( t 1 ) , x ( t 2 ) , . . . , x ( t K )) , or the inverse of this covariance. This is important since the two matrices are not usually sparse at the same time, even if one of them is [ 24 ]. E.g., for Gaussian processes such as those in target tracking, the first matrix is block tri-diagonal, whereas for those in SLAM, or those generated by linear or nonlinear systems corrupted with Gaussian noise, the second matrix is block tri-diagonal. Notation: We denote the set of natural numbers { 1 , 2 , . . . } by N , the set of real numbers by R , and the set { 1 , 2 , . . . , n } by [ n ] ( n ∈ N ). The set of real numbers between 0 and 1 is denoted by [0 , 1] , and the empty set by ∅ . Given a set X , |X | is its cardinality. In addition, for n ∈ N , X n is the n -times Cartesian product X × X × · · · × X . Matrices are represented by capital letters and vectors by lower-case letters. We write A ∈ X n 1 × n 2 ( n 1 , n 2 ∈ N ) to denote a matrix of n 1 rows and n 2 columns whose elements take values in X ; A ⊤ is its transpose, and [ A ] ij is its element at the i -th row and j -th column; det( A ) is its determinant. Furthermore, if A is positive definite, we write A ≻ 0 . In the latter case, A − 1 is its inverse. I is the identity matrix; its dimension is inferred from the context. Similarly for the zero matrix 0 . The ≡ denotes equivalence. Moreover, for a probability space (Ω , F , P ) , Ω is the sample space, F the σ -field, and P : F 7 → [0 , 1] the function that assigns probabilities to events in F [ 25 ]. We write x ∼ F to denote a random variable x with probability distribution F ; E ( x ) is its expected value, and Σ( x ) its covariance. x ∼ N ( μ, Σ) denotes a Gaussian random variable x with mean μ and covariance Σ ; we equivalently write x ∼ N ( E ( x ) , Σ( x )) . Finally, we write x | y ∼ G to denote that x ’s probability distribution given y is G . II. P ROBLEM F ORMULATION This section introduces the system, measurement, and scheduling models and presents the sensor scheduling problem formally. System Model. We consider two cases: • Continuous time model : Consider the stochastic process (along with a probability space (Ω , F , P ) ): x ω ( t ) : ω ∈ Ω , t ≥ t 0 7 → R n (1) where n ∈ N , t 0 is the initial time, and x ω ( t ) the state vector given the sample ω . • Discrete time model : Consider the nonlinear discrete- time system: x k +1 = l k ( x 1: k ) , l k ∼ L k , k ∈ N (2) where x k ∈ R n is the state vector, x 1: k the batch vector ( x 1 , x 2 , . . . , x k ) , and L k a probability distribution over functions l k : R nk 7 → R n . Because the system models ( 1 ) and ( 2 ) assume no char- acteristic structure, they are appropriate for modeling largely unknown dynamics. For example, an instance of ( 1 ) is the time-indexed Gaussian process system model: x ( t ) ∼ GP ( μ ( t ) , Σ( t, t ′ )) , t, t ′ ≥ t 0 , (3) where μ ( t ) is the mean function and Σ( t, t ′ ) is the covariance function. Similarly, an instance of ( 2 ) is the state-indexed Gaussian process system model: x k +1 = l ( x k ) , l ∼ GP ( μ ( x ) , Σ( x, x ′ )) , x, x ′ ∈ R n . (4) Measurement Model. Consider m nonlinear sensors that operate in discrete time: z i,k = g i ( x k ) + v i,k , i ∈ [ m ] , k ∈ N (5) where for the continuous-time system in ( 1 ) we let x k := x ( t k ) at a pre-specified set of measurement times t 1 , t 2 , . . . and v i,k is the measurement noise of sensor i at time k . Assumption 1. v i,k are independent across i and k . In addition, g i is one-time differentiable. Sensor Scheduling Model. The m sensors in ( 5 ) are used at K scheduled measurement times { t 1 , t 2 , . . . , t K } . At each k ∈ [ K ] , only s k of the m sensors are used ( s k ≤ m ), resulting in the batch measurement vector y k : y k = S k z k , k ∈ [ K ] , (6) where S k is a sensor selection matrix, composed of sub- matrices [ S k ] ij ( i ∈ [ s k ] , j ∈ [ m ] ) such that [ S k ] ij = I if sensor j is used at time k , and [ S k ] ij = 0 otherwise. We assume that a sensor can be used at most once at each k , and as a result, for each i there is one j such that [ S k ] ij = I while for each j there is at most one i such that [ S k ] ij = I . We now present the sensor scheduling problem formally: Notation: For i, j ∈ N , φ i : j ≡ ( φ i , φ i +1 , . . . , φ j ) . In addition, S k ≡ { j : there exists i ∈ [ s k ] , [ S k ] ij = I } : S k is the set of indices that correspond to utilized sensors at t k . Problem 1 (Sensor Scheduling in Stochastic Processes with Nonlinear Observations) . Select at each time k a subset of s k sensors, out of the m sensors in ( 5 ) , to use in order to minimize the conditional entropy of x 1: K given the measurements y 1: K : minimize S k ⊆ [ m ] ,k ∈ [ K ] H ( x 1: K |S 1: K ) subject to |S k | ≤ s k , k ∈ [ K ] , where H ( x 1: K |S 1: K ) denotes the conditional entropy H ( x 1: K | y 1: K ) of x 1: K given the measurements y 1: K . The conditional entropy H ( x 1: K | y 1: K ) captures the esti- mation accuracy of x 1: K given y 1: K , as we explain in the following two propositions: Proposition 1. H ( x 1: K | y 1: K ) is a constant factor away from the mutual information of x 1: K and y 1: K . In particular: H ( x 1: K | y 1: K ) = − I ( x 1: K ; y 1: K ) + H ( x 1: K ) , where I ( x 1: K ; y 1: K ) is the mutual information of x 1: K and y 1: K , and H ( x 1: K ) is constant. Proposition 2. Consider the Gaussian process ( 3 ) and sup- pose that the measurement noise in ( 5 ) is Gaussian, v i,k ∼ N (0 , Σ( v i,k )) . H ( x 1: K | y 1: K ) is a constant factor away from log det(Σ( x ⋆ 1: K )) , where Σ( x ⋆ 1: K ) is the error covariance of the minimum mean square estimator x ⋆ 1: K of x 1: K given the measurements y 1: K . In particular: 1 H ( x 1: K | y 1: K ) = log det(Σ( x ⋆ 1: K )) 2 + nK log(2 πe ) 2 . III. M AIN R ESULTS We first prove that Problem 1 is NP-hard, and then derive for it a provably near-optimal approximation algorithm: Theorem 1. The problem of sensor scheduling in stochastic processes with nonlinear observations (Problem 1) is NP hard. Proof. See Appendix IV . Our approach is to find an instance of Problem 1 that is equivalent to the NP-hard minimal observability problem introduced in [ 26 ], [ 27 ]. Due to Theorem 1 , we need to appeal to approximation al- gorithms to obtain a solution to Problem 1 in polynomial-time. To this end, we propose an efficient near-optimal algorithm 1 We explain x ⋆ 1: K and log det(Σ( x ⋆ 1: K )) : x ⋆ 1: K is the optimal estimator for x 1: K , since it minimizes among all estimators of x 1: K the mean square error E ( ‖ x 1: K − x ⋆ 1: K ‖ 2 2 ) ( ‖·‖ 2 is the euclidean norm), where the expectation is taken with respect to y 1: K [ 7 ]. log det(Σ( x ⋆ 1: K )) is an estimation error metric related to ‖ x 1: K − x ⋆ 1: K ‖ 2 2 , since when it is minimized, the probability that the estimation error ‖ x 1: K − x ⋆ 1: K ‖ 2 2 is small is maximized [ 10 ]. Algorithm 1 Approximation algorithm for Problem 1. Input: Horizon K , scheduling constraints s 1 , s 2 , . . . , s K , er- ror metric H ( x 1: K |S 1: K ) : S k ⊆ [ m ] , k ∈ [ K ] 7 → R Output: Sensor sets ( S 1 , S 2 , . . . , S K ) that approximate the solution to Problem 1, as quantified in Theorem 2 k ← 1 , S 1:0 ← ∅ while k ≤ K do 1. Apply Algorithm 2 to min S ⊆ [ m ] { H ( x 1: K |S 1: k − 1 , S ) : |S| ≤ s k } (7) 2. Denote by S k the solution Algorithm 2 returns 3. S 1: k ← ( S 1: k − 1 , S k ) 4. k ← k + 1 end while (Algorithm 1 with a subroutine in Algorithm 2 ) and quantify its performance and time complexity in the following theorem. Theorem 2. The theorem has two parts: 1) Approximation performance of Algorithm 1 : Algorithm 1 returns sensors sets S 1 , S 2 , . . . , S K that: i. satisfy all the feasibility constraints of Problem 1: |S k | ≤ s k , k ∈ [ K ] ii. achieve an error H ( x 1: K |S 1: K ) such that: H ( x 1: K |S 1: K ) − OP T M AX − OP T ≤ 1 2 , (8) where OP T is the optimal cost of Problem 1, and M AX ≡ max S ′ 1: K H ( x 1: K |S ′ 1: K ) is the maximum (worst) cost in Problem 1. 2) Time complexity of Algorithm 1 : Algorithm 1 has time complexity O ( ∑ K k =1 s 2 k T ) , where T is the time complex- ity of evaluating H ( x 1: K |S ′ 1: K ) : S ′ k ⊆ [ m ] , k ∈ [ K ] 7 → R at an S ′ 1: K . In the following paragraphs, we discuss Algorithm 1 ’s ap- proximation quality and time complexity and fully characterize the latter in Theorem 3 and Corollary 1 for Gaussian processes and Gaussian measurement noise. Supermodularity and monotonicity of H ( x 1: K | y 1: K ) : We state two properties of H ( x 1: K | y 1: K ) that are used to prove Theorem 2 . In particular, we show that H ( x 1: K | y 1: K ) is a non-increasing and supermodular function with respect to the sequence of selected sensors. Then, Theorem 2 follows by combining these two results with results on submodular functions maximization over matroid constraints [ 28 ]. These derivations are presented in Appendix IV . Approximation quality of Algorithm 1 : Theorem 2 quan- tifies the worst-case performance of Algorithm 1 across all values of Problem 1’s parameters. The reason is that the right-hand side of ( 8 ) is constant. In particular, ( 8 ) guarantees that for any instance of Problem 1, the distance of the approximate cost H ( x 1: K |S 1: K ) from OP T is at most 1 / 2 the distance of the worst (maximum) cost M AX from OP T . This approximation factor is close to the optimal approximation factor 1 /e ∼ = . 38 one can achieve in the worst-case for Algorithm 2 Single step greedy algorithm (subroutine in Algorithm 1 ). Input: Current iteration k , selected sensor sets ( S 1 , S 2 , . . . , S k − 1 ) up to the current iteration, constraint s k , error metric H ( x 1: K |S 1: K ) : S k ⊆ [ m ] , k ∈ [ K ] 7 → R Output: Sensor set S k that approximates the solution to Problem 1 at time k S 0 ← ∅ , X 0 ← [ m ] , and t ← 1 Iteration t: 1. If X t − 1 = ∅ , return S t − 1 2. Select i ( t ) ∈ X t − 1 for which ρ i ( t ) ( S t − 1 ) = max i ∈X t − 1 ρ i ( S t − 1 ) , with ties settled arbitrarily, where: ρ i ( S t − 1 ) ≡ H ( x 1: K |S 1: k − 1 , S t − 1 ) − H ( x 1: K |S 1: k − 1 , S t − 1 ∪ { i } ) 3.a. If |S t − 1 ∪ { i ( t ) }| > s k , X t − 1 ← X t − 1 \ { i ( t ) } , and go to Step 1 3.b. If |S t − 1 ∪ { i ( t ) }| ≤ s k , S t ← S t − 1 ∪ { i ( t ) } and X t ← X t − 1 \ { i ( t ) } 4. t ← t + 1 and continue Problem 1 in polynomial time [ 29 ]; the reason is twofold: first, Problem 1 involves the minimization of a non-increasing and supermodular function [ 30 ], and second, as we proved in Theorem 1 , Problem 1 is in the worst-case equivalent to the minimal observability problem introduced in [ 26 ], which cannot be approximated in polynomial time with a better factor than the 1 /e [ 31 ]. Remark 1. We can improve the 1 / 2 approximation fac- tor of Algorithm 1 to 1 /e by utilizing the algorithm intro- duced in [ 32 ]. However, this algorithm has time complexity O (( nK ) 11 T ) , where T is the time complexity of evaluating H ( x 1: K |S ′ 1: K ) : S ′ k ⊆ [ m ] , k ∈ [ K ] 7 → R at an S ′ 1: K . Time complexity of Algorithm 1 : Algorithm 1 ’s time complexity is broken down into two parts: a) the number of evaluations of H ( x 1: K | y 1: K ) required by the algorithm; b) the time complexity of each such evaluation. In more detail: a) Number of evaluations of H ( x 1: K | y 1: K ) required by Algorithm 1 : Algorithm 1 requires at most s 2 k evaluations of H ( x 1: K | y 1: K ) at each k ∈ [ K ] . Therefore, Algorithm 1 achieves a time complexity that is only linear in K with respect to the number of evaluations of H ( x 1: K | y 1: K ) ; the reason is that ∑ K k =1 s 2 k ≤ max k ∈ [ K ] ( s 2 k ) K . This is in contrast to the algorithm in Remark 1 , that obtains the best approximation factor 1 /e , whose time complexity is of the order O (( nK ) 11 ) with respect to the number of evaluations of H ( x 1: K | y 1: K ) . 2 b) Time complexity of each evaluation of H ( x 1: K | y 1: K ) : This time complexity depends on the properties of both the stochastic process ( 1 ) (similarly, ( 2 )) and the measurement 2 We can also speed up Algorithm 1 by implementing in Algorithm 2 the method of lazy evaluations [ 33 ]: this method avoids in Step 2 of Algorithm 2 the computation of ρ i ( S t − 1 ) for unnecessary choices of i . noise v i,k in ( 5 ). For the case of Gaussian stochastic processes and measurement noises: Theorem 3. Consider the Gaussian process model ( 3 ) and suppose that the measurement noise is Guassian: v i,k ∼ N (0 , Σ( v i,k )) such that Σ( v i,k ) ≻ 0 . The time complexity of evaluating H ( x 1: K | y 1: K ) depends on the sparsity pattern of Σ( x 1: K ) and Σ( x 1: K ) − 1 as follows. • Each evaluation of H ( x 1: K | y 1: K ) has time complexity O ( n 2 . 4 K ) , when either Σ( x 1: K ) or Σ( x 1: K ) − 1 is exactly sparse (that is, block tri-diagonal). • Each evaluation of H ( x 1: K | y 1: K ) has time complexity O ( n 2 . 4 K 2 . 4 ) , when both Σ( x 1: K ) and Σ( x 1: K ) − 1 are dense. Theorem 3 implies that when Σ( x 1: K ) or Σ( x 1: K ) − 1 is exactly sparse, the time complexity of each evaluation of H ( x 1: K | y 1: K ) is only linear in K . This is important because Σ( x 1: K ) or Σ( x 1: K ) − 1 is exactly sparse for several appli- cations and system models [ 34 ]. For example, in adversarial target tracking applications, where the target wants to avoid capture and randomizes its motion in the environment (by un- correlating its movements), Σ( x 1: K ) can be considered tri- diagonal (since this implies x ( t k ) and x ( t k ′ ) are uncorrelated for | k − k ′ | > 2 ). Similarly, in SLAM, or in system models where the Gaussian process in ( 3 ) is generated by a linear or nonlinear system corrupted with Gaussian noise, Σ( x 1: K ) − 1 is block tri-diagonal [ 24 ]. In particular, for linear systems, Σ( x 1: K ) − 1 is block tri-diagonal [ 24 , Section 3.1], and for nonlinear systems, Σ( x 1: K ) − 1 is efficiently approximated by a block tri-diagonal matrix as follows: for each k , before the k -th iteration of Step 1 in Algorithm 1 , we first compute ̃ μ 1: K given y 1:( k − 1) up to k . This step has complexity O ( n 2 . 4 K ) when Σ( x 1: K ) − 1 is sparse [ 24 , Eq. (5)] [ 35 , Section 3.8], and it does not increase the total time complexity of Algorithm 1 . Then, we continue as in [ 24 , Section 3.2]. Sparsity in H ( x 1: K | y 1: K ) : We state the two properties of H ( x 1: K | y 1: K ) that result to Theorem 3 . In particular, we prove that H ( x 1: K | y 1: K ) is expressed in closed form with two different formulas such that the time complexity for the evaluation of H ( x 1: K | y 1: K ) using the first formula is decided by the sparsity pattern of Σ( x 1: K ) , whereas using the second formula is decided by the sparsity pattern of Σ( x 1: K ) − 1 . The reason for this dependence is that the rest of the matrices in these formulas are sparser than Σ( x 1: K ) or Σ( x 1: K ) − 1 ; in particular, they are block diagonal. The full characterization of Algorithm 1 ’s time complexity for Gaussian processes and Gaussian measurement noises follows. Corollary 1. Consider the Gaussian process model ( 3 ) and suppose that the measurement noise is Gaussian: v i,k ∼ N (0 , Σ( v i,k )) such that Σ( v i,k ) ≻ 0 . The time complexity of Algorithm 1 depends on the sparsity pattern of Σ( x 1: K ) and Σ( x 1: K ) − 1 as follows. • Algorithm 1 has time complexity O ( n 2 . 4 K ∑ K k =1 s 2 k ) , when either Σ( x 1: K ) or Σ( x 1: K ) − 1 is exactly sparse (that is, block tri-diagonal). • Algorithm 1 has time complexity O ( n 2 . 4 K 2 . 4 ∑ K k =1 s 2 k ) , when both Σ( x 1: K ) and Σ( x 1: K ) − 1 are dense. Comparison of Algorithm 1 ’s time complexity for Gaus- sian processes and Gaussian measurement noises, per Corol- lary 1 , to that of existing scheduling algorithms: The most relevant algorithm to Algorithm 1 is the one provided in [ 10 ], where linear systems with additive process noise and measure- ment noises with any distribution are assumed. Algorithm 1 generalizes [ 10 ] from linear systems and measurements to Gaussian processes and nonlinear measurements. At the same time, it achieves the same time complexity as the algorithm in [ 10 ] when Σ( x 1: K ) or Σ( x 1: K ) − 1 is exactly sparse. This is important since the algorithm in [ 10 ] has time complexity lower than the-state-of-the-art batch estimation sensor schedul- ing algorithms, such as the algorithm proposed in [ 11 ], and similar to that of the state of the art Kalman filter scheduling algorithms, such as those proposed in [ 14 ], [ 17 ], [ 21 ] (in particular, lower for large K ). IV. C ONCLUSION In this paper, we proposed Algorithm 1 for the NP- hard problem of sensor scheduling for stochastic process estimation. Exploiting the supermodularity and monotonicity of conditional entropy, we proved that the algorithm has an approximation factor 1 / 2 and linear complexity in the scheduling horizon. It achieves both the accuracy of batch estimation scheduling algorithms and, surprisingly, when the information structure of the problem is sparse, the low time complexity of Kalman filter scheduling algorithms for linear systems. This is the case, for example, in applications such as SLAM and target tracking, and for processes generated by linear or nonlinear systems corrupted with Gaussian noise. Future work will focus on an event-triggered version of the scheduling problem, in which the measurement times are decided online based on the available measurements, and on a decentralized version, in which information is exchanged only among neighboring sensors. A PPENDIX A: P ROOF OF P ROPOSITION 2 Proof. We first show that the conditional probability distribu- tion of x 1: K given y 1: K is Gaussian with covariance Σ( x ⋆ 1: K ) , and then apply the following lemma: Lemma 1 (Ref. [ 36 ]) . Let x ∼ N ( μ, Σ) and x ∈ R m : H ( x ) = 1 2 log[(2 πe ) m det(Σ)] . Specifically, due to Assumption 1 , ( x 1: K , y 1: K ) are jointly Gaussian. This has a twofold implication: first, the minimum mean square estimator of x 1: K given y 1: K is linear in y 1: K [ 37 , Proposition E.2]; second, the conditional probability distribu- tion of x 1: K given y 1: K is Gaussian [ 38 ], with covariance Σ( x ⋆ 1: K ) . Therefore, due to [ 37 , Proposition E.3], this is also the covariance of the minimum mean square estimator of x 1: K given y 1: K . As a result, due to Lemma 1 : H ( x 1: K | y 1: K ) = E y 1: K = y ′ 1: K ( H ( x 1: K | y 1: K = y ′ 1: K )) = E y 1: K = y ′ 1: K ( 1 2 log[(2 πe ) nK det(Σ( x ⋆ 1: K )) ) = nK log(2 πe ) + log det(Σ( x ⋆ 1: K )) 2 . (9) We derive a formula for Σ( x ⋆ 1: K ) in the proof of Lemma 4 . A PPENDIX B: P ROOF OF T HEOREM 1 Proof. We present for the discrete time case ( 2 ) an instance of Problem 1 that is equivalent to the NP-hard minimal observability problem introduced in [ 26 ], [ 27 ], that is defined as follows (the proof for the continuous time case is similar): Definition (Minimal Observability Problem): Consider the linear time-invariant system: ̇ x ( t ) = Ax ( t ) , y i ( t ) = r i e ⊤ i x ( t ) , i ∈ [ n ] (10) where e i is the vector with the i -th entry equal to 1 and the rest equal to 0 , and r i is either zero or one; the minimal observability problem follows: select r 1 , r 2 , . . . , r n such that r 1 + r 2 + . . . + r n ≤ r, ( 10 ) is observable . (11) The minimal observability problem is NP-hard when A is chosen as in the proof of Theorem 1 of [ 26 ], and r ≤ n . We denote this A by A N P − h . Problem 1 is equivalent to the NP-hard minimal observ- ability problem for the following instance: K = 1 , x ( t 0 ) ∼ N ( c, 0) , where c ∈ R n is an unknown constant, μ ( t ) = e A N P − h ( t − t 0 ) x ( t 0 ) , Σ( t, t ′ ) = 0 , m = n , g i ( x ( t )) = e ⊤ i x ( t ) , zero measurement noise, and s 1 = r . This observation con- cludes the proof. A PPENDIX C: P ROOF OF T HEOREM 2 Proof. We first prove that H ( x 1: K |S 1: K ) is a non-increasing and supermodular function in the choice of the sensors. Then, we prove Theorem 2 by combining these two results and results on the maximization of submodular functions over matroid constraints [ 28 ]. Notation: Given K disjoint finite sets E 1 , E 2 , . . . , E K and A i , B i ∈ E i , we write A 1: K  B 1: K to denote that for all i ∈ [ K ] , A i ⊆ B i ( A i is a subset of B i ). Moreover, we denote that A i ∈ E i for all i ∈ [ K ] by A 1: K ∈ E 1: K . In addition, given A 1: K , B 1: K ∈ E 1: K , we write A 1: K ⊎ B 1: K to denote that for all i ∈ [ K ] , A i ∪ B i ( A i union B i ). Definition 1. Consider K disjoint finite sets E 1 , E 2 , . . . , E K . A function h : E 1: K 7 → R is non-decreasing if and only if for all A, B ∈ E 1: K such that A  B , h ( A ) ≤ h ( B ); h : E 1: K 7 → R is non-increasing if − h is non-decreasing. Proposition 3. For any finite K ∈ N , consider K distinct copies of [ m ] , denoted by R 1 , R 2 , . . . , R K . The estimation error metric H ( x 1: K |S 1: K ) : R 1: K 7 → R is a non-increasing function in the choice of the sensors S 1: K . Proof. Consider A, B ∈ R 1: K such that A  B , and denote by B \ A ≡ { i | i ∈ B, i / ∈ A } : H ( x 1: K | B ) = H ( x 1: K | A, B \ A ) ≤ H ( x 1: K | A ) since conditioning can either keep constant or decrease the entropy [ 36 ]. Definition 2. Consider K disjoint finite sets E 1 , E 2 , . . . , E K . A function h : E 1: K 7 → R is submodular if and only if for all A, B, C ∈ E 1: K such that A  B , h ( A ⊎ C ) − h ( A ) ≥ h ( B ⊎ C ) − h ( B ); h : E 1: K 7 → R is supermodular if − h is submodular. Proposition 4. For any finite K ∈ N , consider K distinct copies of [ m ] , denoted by R 1 , R 2 , . . . , R K ; the estimation error metric H ( x 1: K |S 1: K ) : R 1: K 7 → R is a set supermodular function in the choice of the sensors S 1: K . Proof. Let A, B, C ∈ E 1: K such that A  B : H ( x 1: K | A ) − H ( x 1: K | A ⊎ C ) (12) = H ( x 1: K | A ) − H ( x 1: K | A, C ) = I ( x 1: K ; C | A ) (13) = H ( C | A ) − H ( C | x 1: K , A ) (14) ≥ H ( C | B ) − H ( C | x 1: K , B ) (15) = I ( x 1: K ; C | B ) (16) = H ( x 1: K | B ) − H ( x 1: K | B, C ) (17) = H ( x 1: K | B ) − H ( x 1: K | B ⊎ C ) . (18) Eq. ( 12 ) and ( 18 ) follow from our definition of ⊎ . ( 13 ) and ( 14 ), ( 15 ) and ( 16 ), and ( 16 ) and ( 17 ) hold due to the definition of mutual information [ 36 ]. ( 15 ) follows from ( 14 ) due to two reasons: first, H ( C | A ) ≥ H ( C | B ) , since A  B and conditioning can either keep constant or decrease the entropy [ 36 ]; second, H ( C | x 1: K , A ) = H ( C | x 1: K , B ) due to the independence of the measurements given x 1: K , per Assumption 1 . Proof of Part 1 of Theorem 2 . We use the next result from the literature of maximization of submodular functions over matroid constraints: Definition 3. Consider a finite set E and a collection C of subsets of E . ( E , C ) is: • an independent system if and only if: – ∅ ∈ C , where ∅ denotes the empty set – for all X ′ ⊆ X ⊆ E , if X ∈ C , X ′ ∈ C . • a matroid if and only if in addition to the previous two properties: – for all X ′ , X ∈ C where | X ′ | < | X | , there exists x / ∈ X ′ and x ∈ X such that X ′ ∪ { x } ∈ C . Lemma 2 (Ref. [ 28 ]) . Consider K independence systems { ( E k , C k ) } k ∈ [ K ] , each the intersection of at most P matroids, and a submodular and non-decreasing function h : E 1: K 7 → R . There exist a polynomial time greedy algorithm that returns an (approximate) solution S 1: K to: maximize S 1: K E 1: K h ( S 1: K ) subject to S k ∩ E k ∈ C k , k ∈ [ K ] , (19) that satisfies: h ( O ) − h ( S 1: K ) h ( O ) − h ( ∅ ) ≤ P 1 + P , (20) where O is an (optimal) solution to ( 19 ) . Lemma 3. Problem 1 is an instance of ( 19 ) with P = 1 . Proof. We identify the instance of {E k , C k } k ∈ [ K ] and h , re- spectively, that translate ( 19 ) to Problem 1: Given K distinct copies of [ m ] , denoted by R 1 , R 2 , . . . , R K , first consider E k = R k and C k = {S|S ⊆ R k , |S| ≤ s k } : ( E k , C k ) satisfies the first two points in part 1 of Definition 3 , and as a result is an independent system. Moreover, by its definition, S k ∩ E k ∈ C k if and only if |S k | ≤ s k . Second, for all S 1: K  E 1: K , consider: h ( S 1: K ) = − H ( x 1: K |S 1: K ) . From Propositions 3 and 4 , h ( S 1: K ) is set submodular and non-decreasing. In addition to Lemma 3 , the independence system ( E k , C k ) , where E k = R k and C k = {S|S ⊆ R k , |S| ≤ s k } , satisfies also the point in part 2 of Definition 3 ; thereby, it is also a matroid and as a result P , as in Lemma 2 , is 1 . This observation, along with Lemmas 2 and 3 complete the proof of ( 8 ), since the adaptation to Problem 1 of the greedy algorithm in [ 28 , Theorem 4.1] results to Algorithm 1 . Proof of Part 2 of Theorem 2 . Algorithm 1 requires for each k ∈ [ K ] the application of Algorithm 2 to ( 7 ). In addition, each such application requires at most s 2 k evaluations of H ( x 1: K | y 1: K ) . Therefore, Algorithm 1 has time complexity O ( ∑ K k =1 s 2 k T ) . The proof of Theorem 2 is complete. A PPENDIX D: P ROOF OF T HEOREM 3 Notations: We introduce five notations: first, S 1: K is the block diagonal matrix with diagonal elements the sensor se- lection matrices S 1 , S 2 , . . . , S K ; second, c ( x 1: K ) is the batch vector [( S 1 g ( x 1 )) ⊤ , ( S 2 g ( x 2 )) ⊤ , . . . , ( S K g ( x K )) ⊤ ] ⊤ , where g ( x k ) ≡ ( g 1 ( x k ) , g 2 ( x k ) , . . . , g m ( x k )) ⊤ ; third, C ( x 1: K ) is the block diagonal matrix with diagonal elements the ma- trices S 1 C 1 , S 2 C 2 , . . . , S K C K , where C k ≡ G ( x k ) and G ( x ( t )) ≡ ∂g ( x ( t )) /∂x ( t ) ; fourth, v k is the batch mea- surement noise vector ( v ⊤ 1 ,k , v ⊤ 2 ,k , . . . , v ⊤ m,k ) ⊤ ; fifth, μ 1: K ≡ ( μ ( t 1 ) ⊤ , μ ( t 2 ) ⊤ , . . . , μ ( t K ) ⊤ ) ⊤ . Proof. We first derive the two formulas for H ( x 1: K | y 1: K ) : the first formula is expressed in terms of Σ( x 1: K ) − 1 , and the second formula is expressed in terms of Σ( x 1: K ) . Lemma 4 (Formula of H ( x 1: K | y 1: K ) in terms of Σ( x 1: K ) − 1 ) . Consider the start of the k -th iteration in Algorithm 1 . Given the measurements y 1:( k − 1) up to k , H ( x 1: K | y 1: K ) is given by − T ′ 1 + nK log(2 πe ) / 2 , where: T ′ 1 ≡ 1 2 log det(Ξ + Σ( x 1: K ) − 1 ) Ξ ≡ C ( ̃ μ 1: K ) ⊤ S 1: K Σ( v 1: K ) − 1 S ⊤ 1: K C ( ̃ μ 1: K ) and ̃ μ 1: K is the maximum a posteriori (MAP) estimate of x 1: K given the measurements y 1:( k − 1) up to k . Proof. Before the k -th iteration of Step 1 in Algorithm 1 , we first compute ̃ μ 1: K given y 1:( k − 1) up to k . This step has complexity O ( n 2 . 4 K ) when Σ( x 1: K ) − 1 is sparse [ 24 , Eq. (5)] [ 35 , Section 3.8], and it does not increase the total time complexity of Algorithm 1 . Next, given ̃ μ 1: K , we linearise our measurement model over ̃ μ 1: K , and compute C ( ̃ μ 1: K ) . Then, we continue as follows: x 1: K and y 1: K are jointly Gaussian: ( x 1: K , y 1: K ) ∼ N ( E ( x 1: K , y 1: K ) , Σ( x 1: K , y 1: K )) , E ( x 1: K , y 1: K ) = ( μ 1: K , c ( μ 1: K )) Σ( x 1: K , y 1: K ) = [ Σ( x 1: K ) Σ( x 1: K ) C ( ̃ μ 1: K ) ⊤ C ( ̃ μ 1: K )Σ( x 1: K ) Σ( y 1: K ) ] , where: Σ( y 1: K ) = S 1: K Σ( v 1: K ) S ⊤ 1: K + C ( ̃ μ 1: K )Σ( x 1: K ) C ( ̃ μ 1: K ) ⊤ . Therefore, the conditional probability distribution of x 1: K given y 1: K has covariance Σ( x ⋆ 1: K ) (using our notation in Proposition 2 ) such that: Σ( x ⋆ 1: K ) = Σ( x 1: K ) − Σ( x 1: K ) C ( ̃ μ 1: K ) ⊤ Φ C ( ̃ μ 1: K )Σ( x 1: K ) , where: Φ ≡ C ( ̃ μ 1: K )Σ( x 1: K ) C ( ̃ μ 1: K ) ⊤ + S 1: K Σ( v 1: K ) S ⊤ 1: K ) − 1 . Using the Woodbury matrix identity [ 39 , Corollary 2.8.8]: Σ( x ⋆ 1: K ) = (Ξ + Σ( x 1: K ) − 1 ) − 1 , where we also used the ( S 1: K Σ( v 1: K ) S ⊤ 1: K ) − 1 = S 1: K Σ( v 1: K ) − 1 S ⊤ 1: K , which holds since S 1: K and Σ( v 1: K ) are block diagonal. Using ( 9 ) the proof is complete. Remark 2. The time complexity for the evaluation of H ( x 1: K | y 1: K ) using Lemma 4 is decided by the sparsity of Σ( x 1: K ) − 1 since the rest of the matrices are block diagonal. Lemma 5 (Formula of H ( x 1: K | y 1: K ) in terms of Σ( x 1: K ) ) . Consider the start of the k -th iteration in Algorithm 1 . Given the measurements y 1:( k − 1) up to k , H ( x 1: K | y 1: K ) is given by H ( x 1: K | y 1: K ) = T 1 − T 2 + H ( x 1: K ) , where: T 1 ≡ 1 2 K ∑ k =1 log[(2 πe ) s k det( S k Σ( v k ) S ⊤ k )] (21) T 2 ≡ 1 2 log[(2 πe ) ∑ K k =1 s k det(Σ( y 1: K ))] (22) Σ( y 1: K ) = S 1: K Σ( v 1: K ) S ⊤ 1: K + C ( ̃ μ 1: K )Σ( x 1: K ) C ( ̃ μ 1: K ) ⊤ , and ̃ μ 1: K is the maximum a posteriori (MAP) estimate of x 1: K given the measurements y 1:( k − 1) up to k . Proof. Before the k -th iteration of Step 1 in Algorithm 1 , we first compute ̃ μ 1: K given y 1:( k − 1) up to k . This step has complexity O ( n 2 . 4 K ) when: a) Σ( x 1: K ) is sparse [ 24 , Eq. (5) after multiplying its both sides with Σ( x 1: K ) ]; b) certain invertibility conditions apply [ 35 , Section 3.8]. In this case, this step does not increase the total time complexity of Algo- rithm 1 . If the invertibility conditions in [ 35 , Section 3.8] do not apply, the complexity of this computation is O ( n 2 . 4 K 2 . 4 ) . In this case, we can use μ 1: K , instead of ̃ μ 1: K , to evaluate H ( x 1: K | y 1: K ) , and keep the overall complexity of Algorithm 1 to O ( n 2 . 4 K ) . Next, given ̃ μ 1: K , we linearise our measurement model over ̃ μ 1: K , and compute C ( ̃ μ 1: K ) . Then, we continue as follows: the chain rule for conditional entropies implies [ 36 ]: H ( x 1: K | y 1: K ) = H ( y 1: K | x 1: K ) − H ( y 1: K ) + H ( x 1: K ) . Thus, we derive a closed formula for H ( y 1: K | x 1: K ) and H ( y 1: K ) : Closed form of H ( y 1: K | x 1: K ) : The chain rule for condi- tional entropies implies [ 36 ]: H ( y 1: K | x 1: K ) = K ∑ k =1 H ( y ( t k ) | x 1: K , y 1: k − 1 ) (23) = K ∑ k =1 E x ( t k ′ ) ( H ( y ( t k ) | x ( t k ) = x ( t k ′ ))) . (24) Eq. ( 24 ) follows from ( 23 ) because given x ( t k ) y ( t k ) is independent of y 1: k − 1 , x 1:( k − 1) and x ( k +1): K . In ad- dition, ( 21 ) follows from ( 24 ) because y ( t k ) | x ( t k ) ∼ N ( S k g ( x ( t k )) , S k Σ( v k ) S ⊤ k )) and, thus, Lemma 1 applies. Closed form of H ( y 1: K ) : To this end, we derive the marginal distribution of y 1: K , denoted by f ( y 1: K ) : f ( y 1: K ) = ∫ f ( y 1: K , x 1: K ) dx 1: K = ∫ f ( y 1: K | x 1: K ) f ( x 1: K ) dx 1: K , where f ( x 1: K ) denotes the probability distribution of x 1: K . In particular: y 1: K | x 1: K ∼ N ( c ( x 1: K ) , S 1: K Σ( v 1: K ) S ⊤ 1: K ) x 1: K ∼ N ( μ 1: K , Σ( x 1: K )) . Therefore, the best Gaussian approximation to the marginal distribution of y 1: K is: y 1: K ∼ N ( E ( y 1: K ) , Σ( y 1: K )) E ( y 1: K ) = c ( μ 1: K ) Σ( y 1: K ) = S 1: K Σ( v 1: K ) S ⊤ 1: K + C ( ̃ μ 1: K )Σ( x 1: K ) C ( ̃ μ 1: K ) ⊤ . Thus, from Lemma 1 , ( 22 ) follows. Remark 3. The time complexity for the evaluation of H ( x 1: K | y 1: K ) using Lemma 5 is decided by the sparsity of Σ( x 1: K ) since the rest of the matrices are block diagonal. We complete the proof for each case of Theorem 3 : • Time complexity of each evaluation of H ( x 1: K | y 1: K ) when either Σ( x 1: K ) or Σ( x 1: K ) − 1 is exactly sparse (that is, block tri-diagonal) : We present the proof only for the case where Σ( x 1: K ) − 1 is exactly sparse since the proof for the case where Σ( x 1: K ) is exactly sparse is similar. In particular, consider the formula of H ( x 1: K | y 1: K ) in Lemma 4 : T ′ 1 involves the log determinant of a matrix that is the sum of two nK × nK sparse matrices: the first matrix is block diagonal, and the second one is block tri-diagonal. The block diagonal matrix is eval- uated in O ( n 2 . 4 K ) time, since the determinant of an n × n matrix is computed in O ( n 2 . 4 ) time using the Coppersmith-Winograd algorithm [ 40 ]. Then, T ′ 1 is eval- uated in O ( n 2 . 4 K ) [ 41 , Theorem 2]. • Time complexity of each evaluation of H ( x 1: K | y 1: K ) when both Σ( x 1: K ) and Σ( x 1: K ) − 1 are dense : In this case, T ′ 1 (and similarly T 2 in Lemma 5 ) is the log determinant of a dense nK × nK matrix. Therefore, it is evaluated in O (( nK ) 2 . 4 ) time, since the determinant of an n × n matrix is computed in O ( n 2 . 4 ) time using the Coppersmith-Winograd algorithm [ 40 ]. R EFERENCES [1] E. Masazade, M. Fardad, and P. K. Varshney, “Sparsity-promoting ex- tended kalman filtering for target tracking in wireless sensor networks,” IEEE Signal Processing Letters , vol. 19, no. 12, pp. 845–848, 2012. [2] N. Karnad, Robot Motion Planning for Tracking and Capturing Adver- sarial, Cooperative and Independent Targets . U. of Minnesota, 2015. [3] M. P. Vitus, “Sensor placement for improved robotic navigation,” Robotics: Science and Systems VI , p. 217, 2011. [4] M. Kaess, A. Ranganathan, and F. Dellaert, “isam: Incremental smooth- ing and mapping,” IEEE Transactions on Robotics , vol. 24, no. 6, pp. 1365–1378, 2008. [5] H. Rowaihy, S. Eswaran, M. Johnson, D. Verma, A. Bar-Noy, T. Brown, and T. La Porta, “A survey of sensor selection schemes in wireless sensor networks,” in Defense and Security Symposium . International Society for Optics and Photonics, 2007, pp. 65 621A–65 621A. [6] S. Karlin, A first course in stochastic processes . Academic press, 2014. [7] T. Kailath, A. H. Sayed, and B. Hassibi, Linear estimation . Prentice Hall Upper Saddle River, NJ, 2000, vol. 1. [8] R. Nowak, U. Mitra, and R. Willett, “Estimating inhomogeneous fields using wireless sensor networks,” IEEE Journal on Selected Areas in Communications , vol. 22, no. 6, pp. 999–1006, 2004. [9] A. O. Hero III and D. Cochran, “Sensor management: Past, present, and future,” IEEE Sensors Journal , vol. 11, no. 12, pp. 3064–3075, 2011. [10] V. Tzoumas, A. Jadbabaie, and G. J. Pappas, “Near-Optimal Sensor Scheduling for Batch State Estimation: Complexity, Algorithms, and Limits,” in 55th IEEE Conference on Decision and Control (CDC) , 2016. [11] V. Roy, A. Simonetto, and G. Leus, “Spatio-temporal sensor manage- ment for environmental field estimation,” Signal Processing , vol. 128, pp. 369 – 381, 2016. [12] M. Shamaiah, S. Banerjee, and H. Vikalo, “Greedy sensor selection: Leveraging submodularity,” in 49th IEEE Conference on Decision and Control (CDC), , 2010, pp. 2572–2577. [13] M. P. Vitus, W. Zhang, A. Abate, J. Hu, and C. J. Tomlin, “On efficient sensor scheduling for linear dynamical systems,” Automatica , vol. 48, no. 10, pp. 2482–2493, 2012. [14] S. Joshi and S. Boyd, “Sensor selection via convex optimization,” IEEE Transactions on Signal Processing , vol. 57, no. 2, pp. 451–462, 2009. [15] J. Le Ny, E. Feron, and M. A. Dahleh, “Scheduling continuous-time kalman filters,” IEEE Transactions on Automatic Control , vol. 56, no. 6, pp. 1381–1394, 2011. [16] X. Shen, S. Liu, and P. K. Varshney, “Sensor selection for nonlinear systems in large sensor networks,” IEEE Transactions on Aerospace and Electronic Systems , vol. 50, no. 4, pp. 2664–2678, 2014. [17] S. Liu, M. Fardad, P. K. Varshney, and E. Masazade, “Optimal periodic sensor scheduling in networks of dynamical systems,” Signal Processing, IEEE Transactions on , vol. 62, no. 12, pp. 3055–3068. [18] Y. Mo, R. Ambrosino, and B. Sinopoli, “Sensor selection strategies for state estimation in energy constrained wireless sensor networks,” Automatica , vol. 47, no. 7, pp. 1330–1338, 2011. [19] Y. He and E. K. P. Chong, “Sensor scheduling for target tracking: A monte carlo sampling approach,” Digital Signal Processing , vol. 16, no. 5, pp. 533–545, 2006. [20] H. Zhang, R. Ayoub, and S. Sundaram, “Sensor selection for optimal filtering of linear dynamical systems: Complexity and approximation,” in IEEE Conference on Decision and Control (CDC) , 2015. [21] S. T. Jawaid and S. L. Smith, “Submodularity and greedy algorithms in sensor scheduling for linear dynamical systems,” Automatica , vol. 61, pp. 282–288, 2015. [22] C. Ko, J. Lee, and M. Queyranne, “An algorithm for maximum entropy sampling,” Operations Research , vol. 43, no. 4, pp. 684–691, 1995. [23] A. Krause, A. Singh, and C. Guestrin, “Near-optimal sensor placements in gaussian processes: Theory, efficient algorithms and empirical stud- ies,” Journal of Machine Learning Research , vol. 9, pp. 235–284, 2008. [24] S. Anderson, T. D. Barfoot, C. H. Tong, and S. S ̈ arkk ̈ a, “Batch nonlinear continuous-time trajectory estimation as exactly sparse gaussian process regression,” Autonomous Robots , vol. 39, no. 3, pp. 221–238, 2015. [25] R. Durrett, Probability: theory and examples . Cambridge University Press, 2010. [26] A. Olshevsky, “Minimal controllability problems,” IEEE Transactions on Control of Network Systems , vol. 1, no. 3, pp. 249–258, 2014. [27] S. Pequito, S. Kar, and A. Aguiar, “A framework for structural input/out- put and control configuration selection in large-scale systems,” IEEE Transactions on Automatic Control , vol. 61, pp. 303 –318, 2015. [28] M. L. Fisher, G. L. Nemhauser, and L. Wolsey, An analysis of approxi- mations for maximizing submodular functions–II . Springer, 1978, 1978. [29] J. Vondr ́ ak, “Submodularity and curvature: the optimal algorithm,” RIMS Kokyuroku Bessatsu B , vol. 23, pp. 253–266, 2010. [30] G. L. Nemhauser and L. A. Wolsey, Integer and Combinatorial Opti- mization . New York, NY, USA: Wiley-Interscience, 1988. [31] U. Feige, “A threshold of ln n for approximating set cover,” J. ACM , vol. 45, no. 4, pp. 634–652, Jul. 1998. [32] J. Vondr ́ ak, “Optimal approximation for the submodular welfare problem in the value oracle model,” in Proceedings of the fortieth annual ACM symposium on Theory of computing . ACM, 2008, pp. 67–74. [33] M. Minoux, “Accelerated greedy algorithms for maximizing submodular functions,” in Optimization Techniques . Springer, 1978, pp. 234–243. [34] F. Dellaert and M. Kaess, “Square root sam: Simultaneous localization and mapping via square root information smoothing,” The International Journal of Robotics Research , vol. 25, no. 12, pp. 1181–1203, 2006. [35] A. Quarteroni, R. Sacco, and F. Saleri, Numerical mathematics . Springer Science & Business Media, 2010, vol. 37. [36] T. M. Cover and J. A. Thomas, Elements of information theory . John Wiley & Sons, 2012. [37] D. P. Bertsekas, Dynamic Programming and Optimal Control, Vol. I , 3rd ed. Athena Scientific, 2005. [38] S. Venkatesh, The Theory of Probability: Explorations and Applications . Cambridge University Press, 2012. [39] D. S. Bernstein, Matrix mathematics: theory, facts, and formulas . Princeton University Press, 2009. [40] D. Coppersmith and S. Winograd, “Matrix multiplication via arithmetic progressions,” in Proceedings of the nineteenth annual ACM symposium on Theory of computing , 1987, pp. 1–6. [41] L. G. Molinari, “Determinants of block tridiagonal matrices,” Linear algebra and its applications , vol. 429, no. 8, pp. 2221–2226, 2008.