arXiv:1610.05882v2 [cs.RO] 19 Oct 2021 Cognitive Indoor Positioning and Tracking using Multipath Channel Information Erik Leitinger, Paul Meissner, and Klaus Witrisal Abstract—This paper presents a robust and accurate position- ing system that adapts its behavior to the surrounding environ- ment, mimicking the capability of the visual brain to filtering out clutter and focusing attention on activity and relevant informa- tion. Especially in indoor environments, which are characterized by harsh multipath propagation, robust positioning is still hard to achieve under the constraint of reasonable infrastructural needs. In such environments it is essential to separate relevant from irrelevant information and attain an appropriate uncertainty model for measurements that are used for positioning. Index Terms—Cognitive dynamic systems, Cram´er-Rao bounds, localization, simultaneous localization and mapping, radio channel models I. INTRODUCTION A. Motivation and State of the Art For radiobased positioning in indoor environments, which are characterized by harsh multipath propagation, it is still elusive to achieve the needed level of accuracy robustly1 under the constraint of reasonable infrastructural needs. In such environments it is essential to separate relevant from irrelevant information and attain an appropriate uncertainty model for measurements that are being used for positioning. To approach this objective more closely the four basic principles for human cognition, namely the perception-action- cycle (PAC), memory, attention and intelligence [1] are im- plemented into the positioning systems as schematically il- lustrated in Fig. 2. To encounter all these principles, the concepts of multipath-assisted indoor navigation and tracking (MINT) [2]–[5] are intertwined with the principles of cognitive dynamic systems (CDS) that were developed in [6]–[10]. Evidently, a perceptive system has to reason with measure- ments under uncertainty [11], i.e. it has to treat the gained information probabilistically [12], [13], but it also has to deliberately take actions on the environment and consequently influence measurements to reason in favor of relevant informa- tion instead of irrelevant one. Hence, cognitive processing of measurement data for positioning seems to be a natural choice to overcome such severe impairments. MINT exploit specular multipath components (MPCs) that can be associated to the local geometry as illustrated in Fig. 1. MPCs can be interpreted as signals originiating from addi- tional virtual sources, so-called virtual anchors (VA). These VAs are mirror-images of a physical anchor w.r.t. the flat surfaces as illustrated in Fig. 1 [2], [14], [15]. This additional position-related information can be utilized from the radio 1We define robustness as the percentage of cases in which a system can achieve its given potential accuracy. signals. For a proper consideration of uncertainties in the floor plan and to account for the stochastic nature of the radio signals a geometry-based probabilistic environment model (GPEM) and a geometry-based stochastic channel model (GSCM) where introduced in [16]–[19], extending MINT to a simultaneous localization and mapping (SLAM) approach. Such a systems acquires and adapts online information about its surrounding environment and is able to continuously build- ing up a consistent memory in a Bayesian sense. The idea of combing MINT with a CDS is to gain control over the observed environment information to (i) provide as much position-related information to the Bayesian state estimator as possible for achieving the highest level of re- liability/robustness in position estimation, (ii) to improve the separation between relevant and irrelevant information, and (iii) building up a consistent environment and action memory. By actively planning next control actions on the environment using the Bayesian memory—in sense of waveform adaptation [6], [20]–[22] or mobile agent motor-control [23], [24]— the relevant information-return contained in the signals can be maximized. The information-flow coupling between the perceptor-actor system and the surrounding environment is given by the PAC that plays the key-role when it is coming to gather relevant environment information [1], [10]. The core feedback loop of the cognitive dynamic system, the perception-action-cycle resembles the idea of optimally choosing future measurements based on a physical model under reasoning with uncertainty. The principle has been explored by the physics community under the term Bayesian experimental design [25]. This decision-theoretic process gives a mathematical justification for selecting the appropriate opti- mality criterion under uncertainty that maximizes the utility function of the posterior probability density function, such that new model information of the acquired measurements can be predicted. Information theoretic measures such as the conditional entropy [26], the mutual information [26] or the determinate of the Fisher information matrix [27], [28] are suitable utility functions for this process. The active selection of measurement parameters has a lot in common with cognitive perception and control at the lowest layer. However, it lacks an explicit description of a layered memory structure that, in combination with algorithmic attention leads to an “intelligent” behavior of the overall system. II. MINT CONCEPTS In this section we review basic elements of MINT [3], [29] starting with the signal model, then discussing the estimation of the MPC parameters, and finally introducing position related information that is of main importance for a proper weighting of the MPC-VA relations in the Bayesian tracking filter. All not-geometrically-modeled propagation effects in the signals, so-called diffuse multipath (DM) [30], constitute interference to the useful position-related information. A. Geometry-based Stochastic Signal Model (GSCM) Our signal model is the following. During time step n, a baseband radio signal s(t) is transmitted from the j-th physical anchor located at position a(j) 1 ∈R2, j ∈{1, . . . , J} = J , to a mobile agent at position pn ∈R2. The corresponding received signal is given as [3] r(j) n (t) = K(j) n X k=1 α(j) k,ns t −τ(j) k,n  + s(t) ∗ν(j) n (t) + w(t). (1) Here, the first term describes the contributions from K(j) n specular MPCs with complex amplitudes α(j) k,n and delays τ (j) k,n, where k ∈  1, . . . , K(j) n = K(j) n . The delays τ (j) k,n correspond to the distances between the agent and the j-th physical anchor (for k = 1) or the VAs of the j-th physical anchor (for k ∈  2, . . . , K(j) n ). Thus, τ(j) k,n = pn−a(j) k  c, where a(j) k ∈R2 is the position of the respective (physical or virtual) anchor and c is the speed of light. The energy of the transmitted signal s(t) is assumed to be normalized to one. The second term in (1) denotes the convolution of s(t) with the diffuse multipath (DM) ν(j) n (t), which is modeled as a non-stationary zero-mean Gaussian random process. Considering uncorrelated scattering along the delay axis τ, the auto-correlation function of ν(j) n (t) is given by Eν  ν(j) n (τ)ν(j)∗ n (u) = S(j) ν,n(τ)δ(τ −u), where S(j) ν,n(τ) represents the power delay profile of DM. The DM process ν(j) n (t) is assumed to be quasi-stationary in the spatial domain, which means that S(j) ν,n(τ) does not change in the vicinity of pn [31]. Note that the DM component interferes with the useful position-related information. The last term in (1), w(t), is additive white Gaussian noise with double-sided power spectral density N0/2. B. MPC Parameter Estimation The delays of the MPCs at agent position pn are estimated from the received signals using a sparse Bayesian channel estimator [32]. The algorithm estimates up to a predefined maximum number M of MPCs yielding estimated delays ˆτ (j) m,n and according complex amplitudes ˆα(j) m,n, with m ∈ {1, . . . , M (j) n } = M(j) n . The estimated delays are scaled by the speed of light c and used as noisy distance measurements z(j) m,n = cˆτ (j) m,n in the proposed multipath-assisted SLAM algorithm. Furthermore, in a real-world MINT system, the amplitude estimates ˆα(j) m,n (after being associated with the k- th anchor) are fed into a higher-level, non-Bayesian algorithm that determines the signal-to-interference-plus-noise power ra- tio (SINR) between the useful specular MPC and the DM plus noise. This SINR is related to the range standard deviation σ(j) m,n (see [29], [33] for details). Note that an extension to additional parameters besides the delay (and the corresponding amplitude), as for example the angle-of-arrival and angle-of- departure of the MPCs, is straightforward. C. Position and Range Uncertainty As a performance measure and lower bound on the position error we use the Cramer-Rao-Lower Bound (CRLB) defined by the inequality E{||p −ˆp||} ≥tr{J−1 p }, where Jp is the equivalent Fisher information matrix (EFIM) [3], [34], [35] for the position vector and tr{·} is the trace operator. Assuming no path overlap between MPCs, the EFIM Jp is formulated for a set of anchors in a canonical form by [3] IIIp,n = 8π2β2 c2 J X j=1 K(j) n X k=1 SINR(j) k,nIIIr  φ(j) k,n  , (2) where β denotes the effective (root mean square) bandwidth of s(t) and IIIr(φ(j) k,n) is the ranging direction matrix, which is a rank-one matrix with an eigenvector in direction φ(j) k,n from the agent to the k-th VA. The signal-to-interference-plus-noise ratios (SINRs) are described by the ratio between the energy of the deterministic MPCs to the interfering DM plus noise SINR(j) k,n = |α(j) k,n|2 N0 + TpS(j) ν,n(τ (j) k,n) (3) The according MPC range uncertainties σ(j)2 k,n = var n z(j) k,n o to already associated VAs is given as σ(j)2 k,n ≥ 8π2β2 c2 SINR(j) k,n −1 . (4) D. Geometry-based Probabilitstic Environment Model (GPEM) Fig. 1 illustrates the probabilistic geometric environment model. A signal exchanged between an anchor at position a(j) 1 and an agent at p(m) contains specular reflections at the room walls, indicated by the black lines2. These reflec- tions can be modeled geometrically using the VA a(j) k with k = 1, . . . , K(j) that are mirror-images of the j-th anchor w.r.t. walls [2], [14], [15]. The number of VAs per anchor j is defined as K(j). The VAs of all anchors are comprised in An =  A(j) n J j=1, where A(j) n =  a(j) k,n K(j) n k=1 . To be able to cope with uncertainties in the floor plan the deterministic geometric model of the VA positions a(j) k of the j-th anchor, is extended to a probabilistic one as shown in Fig. 1. The VA positions and the agent position p(m) are represented by a joint PDF p p(m), a(j) 1 , a(j) 2 , . . . , a(j) K(j) n  . If the position of the j-th anchor is assumed to be known exactly, the joint PDF reduces to p p(m), a(j) 2 , . . . , a(j) K(j) n  . The joint PDF of the agent and the VA positions is rep- resented by a multivariate Gaussian RV, where the figure shows the marginal distributions of the agent p p(m) (dashed black ellipses) and the VA positions p a(j) k ) (red ellipses). The 2Since the radio channel is reciprocal, the assignment of transmitter and receiver roles to anchors and agents is arbitrary and this choice can be made according to the application scenario. −0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 0 0.5 1 1.5 2 2.5 eplacements afalse a(j) 1 a(j) 2 a(j) 3 a(j) 4 p Fig. 1. Illustration of the VA for the j-th anchor and an agent with PDF p a(j) k  and p p(m) , respectively. The VA at position afalse represents a false detected VA. marginal distribution p afalse  (dashed red ellipse) defines a wrongly detected VA at position afalse. The anchor position a(j) 1 is assumed to be known perfectly. Uncertainty in the floor plan does not just mean that the VA positions are uncertain and thus described by RV, but also that floor plan information is incorrect/inconsistent or entirely missing. This means that positioning and tracking algorithms based on VA, have to consider this lack of knowledge. E. Probabilistic Data Association (PDA) The state of the agent xn = [pT n, vT n ]T, where vn is the velocity, evolves according to the state transition probability density function (PDF) p(xn|xn−1) over time instances n. From each VA in A(j) n and the predicted agent position, a set of expected MPC distances D(j) n at time step n is computed. The MPC distances described in Section II-B are subject to a data association uncertainty, i.e., it is not known which measurement in z(j) n originated from which VA k of the j- th anchor, and it is also possible that a measurement y(j) m,n did not originate from any VA (false alarm, clutter) or that a VA did not give rise to any measurement (missed detection). The probability that a VA is detected is denoted by Pd. Possible associations at time instance n are described by the K(j) n - dimensional random vector b(j) n =  b(j) 1,n · · · b(j) n,K(j) n T, whose k-th entry is defined as [5], [17]–[19], [36], [37] b(j) k,n =        m ∈{1, . . ., M}, a(j) k generates measurement z(j) m,n 0, a(j) k did not give rise to any measurement. We also define bn =  b(1)T n · · · b(J)T n T. False alarms are modeled by a uniform distribution with mean arrival rate µ, and the distribution of each false alarm measurement is described by the PDF fFA z(j) m,n  [38], [39], factoring in a likelihood that a measurement correspond to a false alarm. The statistical dependence of the distance measurement vec- tors zn =  z(1) n , · · · , z(J) n T on the agent state vector xn and the association vector bn is described by the global likelihood function f(yn|xn, bn). Under commonly-used assumptions about the statistics of the measurements [2], [38], [39], the global likelihood function at time instances n factors as f(zn|xn, bn) = J Y j=1 M Y m=1 fFA z(j) m,n  ! × Y k∈Q(xn,b(j) n ) f  z(j) b(j) k,n,n xn; a(j) k , σ(j) k,n  fFA  z(j) b(j) k,n,n  , where Q(xn, b(j) n ) ≜  k ∈{1, . . . , K(j) n } : b(j) k,n ̸= 0, . The local likelihood function f z(j) m,n|xn; a(j) k , σ(j) k,n  is related to a noisy measurement of the distance to VA a(j) k at agent position pn which is modeled as z(j) k,n = ∥pn −a(j) k ∥+ v(j) k,n , where vk,n is a zero-mean Gaussian random variable with standard deviation σ(j) k,n as described in (4). Based on the factorized likelihood model, a probabilistic data association algorithm is used to compute the associations between the expected delay to the VAs and the estimated MPCs using belief propagation as described in [5], [17]–[19], [36], [37]. The most probable MPC-to-anchor associations are obtained by means of an approximation of the maximum a posterior (MAP) detector [40] ˆb(j)MAP k,n ≜ arg max b(j) k,n∈{1,...,M} p b(j) k,n z  . (5) After the PDA was applied for all anchors, the following union sets are defined: • The set of associated discovered (and optionally a-priori known) VAs An,ass = S j A(j) n,ass. • The according set of associated measurements Zn,ass = S j Z(j) n,ass. • The set of remaining measurements Zn,ass = S j Z(j) n,ass, which are not associated to VAs of An,ass. F. MINT-SLAM In the most generic form, the prediction equation for the VAs An and the agent state xn = [pn, xn]T, can be written as, using the Markovian assumption, p(xn, An|Z1:n−1) = Z p(xn−1, An−1|Z1:n−1)p(xn|xn−1) × p(An|An−1)d{xn−1, An−1}, (6) where p(xn|xn−1) and p(An|An−1) are the state transition probability distribution functions of the agent and the VAs, respectively. The latter can be represented by an identity function. The update equation is then p(xn, An|Z1:n) = p(Zn|xn, An)p(xn, An|Z1:n−1) p(Z1:n|Z1:n−1) , (7) where p(Zn|xn, An) is the likelihood function of the current measurements. Assuming that the agent moves along a path according to a linear Gaussian constant-velocity motion, the state space model is defined as, xn = Fxn−1 + Gna,n =   1 0 ∆T 0 0 1 0 ∆T 0 0 1 0 0 0 0 1  xn +   ∆T 2 2 0 0 ∆T 2 2 ∆T 0 0 ∆T  na,n, (8) where ∆T is the discrete time update rate. The driving acceleration noise term na,n is zero-mean, circular symmetric with variance σ2 a, and models motion changes which deviate from the constant-velocity assumption. The transformed noise covariance matrix is given as Ra = σ2 aGGT. The entire state space of xn and the associated VAs An,ass described in (6) are formulated as [4], [16] ˜xn =  F 04×2Kn 02Kn×4 I2Kn×2Kn  ˜xn−1 +  G 02Kn×2  na,n, (9) where ˜xn = [xT n, aT 2,n, . . . , aT Kn,n]T represents the stacked state vector with {a(j) k,n} ∈An,ass. The covariance matrix of the state vector consists of the agent covariance matrix Cxn, the cross-covariances Cxn,ak,n between the agent state xn and the VAs at positions ak,n, the cross-covariances between VAs Cak,n,ak′,n with k ̸= k′, and the covariances of the VAs Cak,n. The measurement model is defined as zn = ˜hn(˜xn) + ˜nz,n, (10) where zn is defined in (5) with the according stack measure- ment noise vector ˜nz,n. The measurement model ˜hn contains all distance equations ||a(j) n,k −pn|| ∀a(j) n,k ∈An,ass to update the agent and the VAs, respectively. As Bayesian state estimator a UKF is used [4]. The measurement covariance matrix is written as Rn = diag n var n z(j) k,n oo ∀k, j : a(j) k,n ∈An,ass, (11) where the range variances are defined by (4). III. COGNITIVE POSITIONING SYSTEM The basic building blocks of a CDS, namely the perception- action cycle (PAC), cognitive perceptor (CP), information feedback and the cognitive controller (CC) are depicted in Fig. 2. All of these blocks are reciprocally coupled and form a hierarchical structure to enable the ability to interpret the environmental observables on different abstraction layers. A. Multipath-assisted Positioning as CDS Figure 2 illustrates the block diagram of a cognitive local- ization and tracking system with a triple layered structure: • First Layer: Defines (i) the direct Bayesian state esti- mation p xn Zn, cn  at the CP holds the agent position and its velocity, and (ii) the cognitive control parameters cn at the CC based on the feedback information of the Bayesian state space filter. • Second Layer: Represents (i) the memory for the GPEM described by the VAs with marginal PDF p A(j) n |Z(j) n , cn  and the memory for the GSCM de- scribed by the SINR(j) k,n of the MPC at the CP and (ii) the memory of VAs specific waveform parameters at the CC, which specify on which the cognitive control is based on. • Third Layer: It represents the highest layer and is dif- ferent from the two layers below in the sense that it defines the application driven by the cognitive localiza- tion/tracking system. The CP memory of applications holds abstract parameters or structures of the specified application and the CC enables the motor control for realizing higher goal planning [41]. The first and second layers describe the signal and information processing of the model parameters of the surrounding phys- ical environment and the radio channel. On the other hand, the third layer holds higher goal parameters, i.e. motor-control input to fulfill navigation goals, that are based on the physical- related parameters [41]–[43]. B. Feedback Information The system is able to adapt online its behavior to the environment, i.e. perceptual attention is given, through the following principles: • At the CP side, the GSCM and GPEM memories are up- dated using the received signal rn(t, cn) with waveform parameters chosen by the CC. • In the actual sensing cycle the attention is put through the CC using the control parameters cn on the potential set of VA and their parameters memorized in the GSCM and GPEM. These model parameters are seen at the CP side of Fig. 2. Now the question is, “How to control the environment infor- mation flow through the received signal and put cognitive attention on the relevant features in the following sensing cycle?” The answer to this lies in the CC and the feed-back and feed-forward information between the perceptor and the controller as illustrated in Fig. 2. The control parameter vector cn+1 of the next sensing cycle is chosen in order to gain the most “valuable” position-related information from the new set of measurements ˜Zn+l using the predicted posterior p xn+l, An+l|˜Zn+l, bn+l, cn+l  the predicted received signals ˜Zn+l that depends on the chosen signal model, with l = 0, . . . , lfuture as future horizon. This goal can be reached by minimizing an expected cost-to-go function, yielding cn+1 = arg min cn C p xn+l, An+l|˜Zn+l, bn+l, cn+l  , (12) where C(·) is the expected cost-to-go function for optimal control [25], [43] of the environmental information contained in ˜Zn+l. The expected cost-to-go function is based on an information-theoretic measure that should depend on the envi- ronment parameters, like the VA specific SINR(j) k,n, and serves as feedback information in the CDS. In general, estimation and control problems have to deal with probabilistic states and observations. As a consequence, also the control has to be probabilistic, i.e. the cost function or utility must handle uncertainties. Based on covering the sn(t, an) rn(t, an) Jn(a) πn(a, a′) p(xn) p(An,ass) Cognitive Controller Cognitive Perceptor Information Flow Feedforward Information Feedback Information h(xn, an) b Policy πn: Waveform Selection Learning & Planning Bayesian State Filter: ˜xn, eCn Bayesian VA Memory:  ak,n, Cak,n n Cxn,ak,n, Cak,n,ak′,n o {SINRk,n} Short-term Memory: z−1 Environment Perception-Action Cycle Fig. 2. Block diagram of the cognitive indoor positioning and tracking system that uses multipath channel information. uncertainty of the state with a PDF, a measure of informative- ness of measurements has to be defined on the posterior state distribution. Two commonly used information measures of an RV are the entropy [44] and the Fisher information [28]. C. Information Measures for Feedback 1) Fisher Information: The Fisher information matrix (FIM) of a RV r, dependent on the deterministic parameter p, can also be used as a measure of information. Using the likelihood function ln f(r; p), it is defined as IIIp = Er;p ( ∂ ∂p ln f(r; p)   ∂ ∂p ln f(r; p) T) . (13) 2) Entropy: For a continuous-valued vector RV p ∈RL (in the follow-up sections p represents the agent position), the conditional entropy is given as [26] h(p) .= −Ep {ln p(p)} = − Z ∞ −∞ · · · Z ∞ −∞ p(p) ln p(p)dp, (14) The entropy is directly related to the uncertainty of the according RV. For a multivariate Gaussian RV N (mp, Cp) this means that the entropy is directly related to the covariance matrix Cp, yielding h(p) = 1 2 ln (2πe)L det Cp  , (15) where det(·) defines the determinant of a matrix. The de- terminate of the covariance matrix Cp is a measure of the “volume” of uncertainty of p. The more compact the volume is, the smaller is the entropy h(p) and consequently the more informative is the distribution p(p). The inverse of the FIM is a lower bound on the covariance Cˆp ⪰III−1 p of the deterministic parameter p of an estimator ˆp [28]. Looking at the entropy of the estimator’s distribution N (ˆp, Cˆp), the explicit relationship between the FIM IIIp of r (dependent on p) and the entropy h(ˆp) is given as h(ˆp) = 1 2 log (2πe)L det Cˆp  (16) ≥−1 2 log (2πe)L det IIIp  . As the relationship in (16) shows, one can connect the FIM of a parameter vector with the entropy, resulting in a scalar measure of information that is valuable for choosing optimal waveform parameters, as it is needed for a cognitive posi- tioning system. As it is shown in Section II-C, the FIM IIIp on the position of the agent p contains the environment and signal parameters, e.g. VA positions and the according SINRs. With this, a direct relationship between the environment, the feedback information and the control of the sensing is given, closing the PAC (Figure 2). In the same manner, the system can also be expanded to information-based control of the agent state to increase the informativeness in the measurements [23], [24], [42]. IV. COGNITIVE MINT A. Cognitice Controller: Reinforcement Learning (RL) As already stated in Section III, the control parameters should be chosen in order to optimize the expected cost-to-go function C (·) of the predicted posterior PDF as defined in (12). In general, the expected cost-to-go function for a Bayesian state space filter can be written as C (p(xn+1, An+1|˜rn+1(t, cn)) = ¯g ǫn+1|n+1(cn)  , (17) where ǫn+1|n+1(cn) is the predicted posterior state-estimation error depend on the control parameters and ¯g(·) defines the cost-to-go function of the transmitter. The conditional entropy was discussed as a possible information measure for the feed- back, thus a possible cost-to-go function ¯g(·) of the transmitter is the conditional entropy of the predicted posterior state- estimation error ǫn+1|n+1(cn), given as ¯g ǫn+1|n+1(cn)  = h ǫn+1|n+1(cn)  [26], [45]. This entropy conditioned on the control parameter vector cn is directly coupled with the posterior covariance matrix of the Bayesian tracking filter that is lower bounded by the inverse of the EFIM in (2). The entropy of the predicted posterior state-estimation error (when assuming a Gaussian approximation) is given as h ǫn+1|n+1(cn)  ∝det eCxn+1(cn)  , (18) where eCxn+1(cn) and IIIxn+1(cn) is the predicted state co- variance matrix as described in Section II-F of the state vector provided from the Bayesian state space filter (UKF) dependent on the control parameter vector cn. Thus, the entropy in (18) is directly coupled with the position-related information that is contained in the measurement noise covariance matrix Rz,n described by (11). How the introduced algorithm is using the state space and measurement model equations of the Bayesian state space estimator is described in more detail in Sections IV-B2 and IV-B3. For readability of the following derivations of the control optimization algorithm, the cost-to-go of the CC (18) is rewritten as ¯g ǫn+1|n+1(cn)  = h (xn+1, cn) with cn ∈A, where A is the space of cognitive action with size |A| that represents the waveform library in our case. Consequently, the next set of waveform parameters has to be chosen in order to minimize the cost-to-go of the next posterior entropy. As elaborated in [46], dynamic programming represents an optimal solution for such problems, but unfortunately it is based on the assumption that the state to be controlled is “perfectly” perceivable. Hence, methods have been introduced that are capable of handling imperfect state information [47] with the drawback that they are computational complex. In [6], [45] approximate dynamic programming was used for optimal control. In there, the trace of the posterior covariance matrix was used as cost-to-go function to reduce the computational complexity. The policy for control parameter selection in the transmitter at time instance n is seeking to find the set of waveform parameters, for which the cost-to-go function ¯g(ǫn+1|n+1(cn)) ≈tr[eCxn+1(cn)] is minimized for a rolling future horizon of lfuture predicted states. In practice, it is difficult to construct all state transition probabilities from one state to another that are conditioned on the selected actions, including their cost incurred as a result of each transition. RL3 [48] represents an approximation of dynamic programming [46], [47] for solving such optimal control and future planning task with high computationally efficiency. In RL literature the cost-to-go function is termed value-to-go function Jn(cn) that is updated online for every PAC based on the immediate rewards rn. The immediate reward rn is a measure of “quality” of an action cn taken on the environment. Using the Markovian assumption and following the way in [8], it is given by rn = gn (h(xn−1, cn−1) −h(xn, cn)) , (19) where h(xn, cn) ∝det Cxn(cn)  and gn(·) is an arbitrary scalar operator that in its most general form could also depend on the time instance n [8]. A reasonable function for the reward is the scaled change in the posterior entropy from one PAC to the next, i.e. rn = sign (∆h(xn, cn) log |∆h(xn, cn)|  . (20) A positive reward will be favoring the current action an for the future action cn+1 and conversely a negative one will lead to a penalty for these actions. As described in [8], the cognitive RL algorithm has to find the optimal future action cn+1 for the next PAC based on the immediate reward rn and the learned value-to-go function Jn(cn). For computing the expected costs of future actions as it is done in dynamic programming, RL divides the computation of the value-to-go function into two parts, (i) the learning phase that incorporates the actual measured reward into the value-to- go function based on actions cn and cn−1, and (ii) the planning phase that incorporates predicted future rewards into the value-to-go function. Whereas for learning a “real” reward is perceived from the environment, for planning just model-based predicted rewards are perceived from the internal perceptor memory using the feedforward link. A faster convergence to the optimal control policy can be achieved in this way. B. Learning and Planning: Algorithm The value-to-go function that is used in the cognitive controller is defined as [8] Jn(c) = Eπn  rn + γrn+1 + γ2rn+2 + · · · |cn = c , (21) where rn with c ∈C is the actual reward, rn+l are the predicted future rewards that are based on the GPEM and GSCM parameter that are used by the Bayesian filter, 0 < γ ≤1 is the discount factor for future rewards based on action cn ∈C and the expected value is calculated using the cognitive policy πn(c′, c) = P [cn+1 = c′|cn = c] , c, c′ ∈C, (22) 3RL represents an intermediate learning procedure that lies between super- vised and unsupervised learning as stated in [45]. where P[·|·] defines a conditional PMF that describes the tran- sition probabilities of all actions c ∈C over time instances n. Following the derivations in [8], the value-to-go function can be reformulated in an incremental recursive manner, yielding Jn(c) ←Jn(c)+α " R(c) + γ X c′ πn(c, c′)Jn(c′) −Jn(c) # , (23) where R(c) = Eπn {rn|cn = c} ∀c ∈C denotes the expected immediate reward and α > 0 is the learning rate. The algorithm for updating the value-to-go function can be found in the Appendix of [4]. The incremental recursive update in (23) means that for all actions c ∈C the value-to-go function is updated using the expected immediate reward and the policy πn(c, c′) for all these actions. 1) Learning from applied Actions: With the value of the immediate reward rn, a new value is learned for the value- to-go function for the currently selected action cn using Jn(cn) ←(1 −α)Jn(cn) + αR(cn) of (23). This accounts for the “real” physical action on the environment. Hence, only one parameter set can be chosen as an action for the PAC at a time; it would take at least |C|T seconds for applying all actions on the environment and collecting the according immediate rewards, where T is the time period of a PAC. Unfortunately, this results in a poor convergence rate of the algorithm and unacceptable behavior for time-variant environments. A possible remedy against this is the planning of future actions based on the state space and measurement model of the Bayesian state estimator. 2) Planning for Improving Convergence Behavior: Plan- ning is defined as predicting expected future rewards using the state and measurement model of the Bayesian state space filter to improve the convergence rate of the RL algorithm. As depicted in Fig. 2, the feedforward link is used to connect the controller with the perceptor. The feedforward information is a hypothesized future action, which is selected for a future planning stage. Inspecting (23), one can observe that for every action c ∈M, where M ⊂C is a subset of C depending on the actual policy πn, the predicted posterior covariance matrices eCn+l(c) and the according predicted future rewards rn+l, are computed with decreasing discount factor γl for predicted future rewards, for l = 1, . . . , lfuture, where lfuture is the future horizon. The predicted covariance matrices eCn+l(c) for a specific future action c is computed using the state space (e.g. (9)) and measurement model (e.g. (10)) of the Bayesian state space estimator and the according GPEM and GSCM parameters stored in the perceptors’ memory as shown in Fig. 2. After the planning process is finished, the value-to- go function is updated for all actions c ∈M. Finally, the actual PAC is closed by updating the policy to πn+1 using the value-to-go function Jn+1 and choosing the new action, i.e. the waveform parameters, for the next PAC according to this new policy. This means that the value-to-go function Jn(cn) and the policy πn are updated iteratively from one another from one PAC to the next PAC, with one important detail which is discussed below. a) Explore/Exploit trade-off:: Both the planning process and choosing new actions are based on the policy. In planning, the chosen action-subset M is defined by sampling from the policy πn and new actions are selected based on the updated policy πn+1. Hence, the policy is responsible for the explore/exploit trade-off in the action space. A widely used method for balancing the exploration of new actions and exploiting the already learned value-to-go function Jn(cn) is the ǫ-greedy strategy, meaning that with a small probability of ǫ a random action is selected, representing pure exploration, and with probability of 1 −ǫ the action is chosen according to the maximum of the value-to-go function, representing a pure exploitation. The random selection of a new action and the action in the subspace M can either be selected from a uniform distribution over the action space A or from the policy πn. The policy is computed using the Boltzmann distribution πn+l = πn+l−1(c) exp{∆Jn+l(c)/τ} P c′ πn+l−1(c′) exp{∆Jn+l(c′)/τ}, where τ defines the exploration degree and is referred to as the system temperature [49] and ∆Jn+l(c) = Jn+l(c) − Jn−1+l(c). The cognitive action is selected according to cn =  random action ∼πn+1 ∈C if ξ < ǫ arg maxc∈C Jn(c) otherwise , (24) where 0 ≤ξ ≤1 is a uniform random number drawn at each time step n. As we have said, from the policy in (24) the new action cn+1 is selected and applied on the environment so that the next PAC can start. The important concept of attention at the perceptor as well as the actuator side in the cognitive dynamic system can be argued with the following: • Perceptual attention: Is given by the fact that the environment dependent parameters, i.e. the marginal PDF of the VA p(ak,n) and their multipath channel dependent reliability measures, SINRk,n, are learned and updated online, so that the perceptual Bayesian state space filter puts its attention on the relevant position-related informa- tion in the received signal. • Control attention: Is given by the fact that the policy πn that is learned over time and the according subset of actions M put focus on the “more relevant” actions. These action in turn focus on the relevant position-related information in the received signal. 3) Waveform Library: The general form of the waveform library contains the control parameters cn = {Tp,n, f j c,n}J j=1 for the j-th anchor consisting of carrier frequencies and pulse durations. Hence, the VA specific MPC parameters are esti- mated using specific sub-bands of the radio channel spectrum defined by the parameter pair T j p,n and f j c,n, which in turn is chosen in an “optimal” manner. Optimal in this case means that the position-related information that is contained in the MPC parameters is maximized at agent position pn (see for (2)). Equations (3) and (2), which describe the parameters ^ SINR j k,n, show the relation between the pulse parameter pair T j p,n and f j c,n and the position-related information contained in the channel. The pulse duration T j p,n scales the amount of DM and is directly proportional to the effective root mean square bandwidth β. The relation to f j c,n is not that obvious, since 0 2 4 6 0 1 2 3 4 5 6 7 8 9 eplacements x[m] y[m] p(1) 1 p(2) 1 Fig. 3. Scenario for probabilistic MINT using cognitive sensing in presence of additional DM interference. The anchors are at the positions a(1) 1 and a(2) 1 . The black line represents the agent trajectory and the red part of the line indicates the agent positions, where the DM interference is activated. it describes the frequency dependency of the environment parameters and thus the GSCM parameters as the complex amplitudes of the MPC and the DM PDP. The set of selected VA should lead to the highest overall SINR values (and accordingly the smallest range variances var n ˆdj k,n o ) and the smallest possible GDOP4, i.e. geometric optimal constellation of VA positions which is reflected by the ranging direction matrix IIIr(φj k,n). In a cognitive sense this means that the actions a ∈A are chosen to reduce the posterior entropy over time under quasi-stationary environment conditions. V. RESULTS A. Measurement Setup For the evaluation of this positioning approach, we use the seminar room scenario of the MeasureMINT database [51]. The measurements allow for 5 trajectories consisting of 1000 agent positions with a 1 cm spacing as shown in Fig. 3. At each position, UWB measurements are available of the channel between the agent and the two anchors at the positions a(1) 1 = [0.5, 7]T and a(2) 1 = [5.2, 3.2]T. The measurements have been performed using an M-sequence cor- relative channel sounder developed by Ilmsens. This sounder provides measurements over approximately the FCC frequency range, from 3 −10 GHz. On anchor and agent sides, dipole- like antennas made of Euro-cent coins have been used. They 4The GDOP the ratio between position variance and the range variance [50]. For positioning a small value indicates a high level of confidence that high precision can be reached. Hence, the GDOP indicates a “good ” geometry for positioning, i.e. a good geometric placement of the anchors. 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0 0.2 0.4 0.6 0.8 1 conventional MINT cognitive MINT, win = 1 cognitive MINT, win = 5 PSfrag replacements P(p)[m] CDF Fig. 4. Performance CDF of the cognitive MINT algorithm using a smaller restricted set of VA. Visibilities of VA are computed using the SINR instead of optical ray-tracing. have an approximately uniform radiation pattern in azimuth plane and zeroes in the directions of floor and ceiling. The chosen initial pulse duration is Tp = 0.5 ns (corre- sponding to a bandwidth of 2 GHz) and the center frequency is fc = 7 GHz. The VA for the anchors at the positions a(1) 1 and a(2) 1 were computed a-priori up to order 2. The past window of agent positions for the SINR estimation is again chosen to be wpast = 40. For all simulations 30 Monte Carlo runs were conducted. B. Initial Experiment Setup For the sake of simplicity, we reduce the control parameters to just the carrier frequency cn = fc,n for each PAC for all anchors and we fix the pulse duration Tp. This means that the cognitive MINT system adaptively finds the carrier frequency fc,n from PAC to PAC that yields the highest reward from the environment by maximizing the position-related information. Starting from the initial value fc,1 = 7 GHz (which represents the center of the measured bandwidth), the carrier frequency is adapted over time using the posterior entropy in (18). The finite space of cognitive actions C contains the discrete frequency values bounded by the measured bandwidth, i.e. fc,n,i ∈C, where i = 1, . . . , |C|. The frequency spacing between the frequency bins is equidistant, ∆fc = fc,n,i+1 − fc,n,i. For the experiments, we haven chosen ∆fc = 50 MHz, considering the large signal bandwidth of 2 GHz. The start- ing policy is defined as a uniform distribution π1(c′, c) = U(fc,n,1, fc,n,|C|) and the cost-to-go function is chosen to be J1(c) = 0 ∀c. The size of the planning subspace is |M| = 20; the size of C is |C| = 40. C. Discussion of Performance Results 1) Conventional MINT: Fig. 4 shows the overall position error CDF for “conventional” MINT (which assumes perfect floor plan knowledge) with and without cognitive waveform adaptation. To show the advantage of the cognitive MINT algorithm, a restricted set of VA is chosen and the visibilities of the VA are computed using the SINR instead of optical ray-tracing. As the CDF of “conventional” MINT indicates (blue line with circle marker), the tracking algorithm tends to 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0 0.2 0.4 0.6 0.8 1 prob. MINT cog. prob. MINT, win = 1 cog. prob. MINT, win = 5 eplacements P(p)[m] CDF Fig. 5. Performance CDF of cognitive probabilistic MINT using a smaller restricted set of VA. For probabilistic MINT, the visibilities of VA are always computed using the SINR. diverge since too little position-related information is available. The black and the red lines show the overall position error CDF for cognitive MINT for a future horizon window of l = 1 and l = 5, respectively. As one can observe, the perfor- mance is significantly increased due to the cognitive waveform adaptation. This means that the cognitive MINT algorithm is able to increase the amount of position-related information by changing the sensing spectrum via the carrier frequency fc,n,i ∈A to bands that carry more geometry-dependent information in the MPC. Another interesting observation of Fig. 4 is that an increase of the planning horizon results in an increased performance, confirming the correct functionality of the cognitive algorithm. 2) Probabilistic MINT: Fig. 5 shows the overall position error CDF for probabilistic MINT with and without cog- nitive waveform adaptation. Uncertainties in the floor plan and wrong associations can be robustly handled due to the probabilistic treatment of VA and thus none of the individual trajectory runs diverges. The already achieved high accuracy and robustness of probabilistic MINT are the reasons that cognitive sensing leads to only a minor additional performance gain for this scenario. It is suspected that for lower bandwidth the performance gain induced by the cognitive probabilistic MINT should be much more distinct. 3) Probabilistic MINT with additional DM Interference: In the last setup, we additionally have added synthetic DM interference filtered at a carrier frequency fc = 7 GHz, with a bandwidth of 2 GHz. The DM parameters are chosen according to [52] except for the DM power. The experiments were conducted with three levels of DM power, Ω1 = 1.1615∗10−9, Ω1 = 5.8076 ∗10−9 and Ω1 = 1.1615 ∗10−8. Fig. 3 illustrates the scenario used for the experiment. The black line represents the agent trajectory and the red part of it indicates the agent positions, where the DM interference is activated. Fig. 6 shows the signals exchanged between the agent and the Anchors 1 and 2 for one sample position. The “clean” signals are shown in Fig. 6a, the noisy signal for DM power of Ω1 = 1.1615 ∗10−9 in Fig. 6b. Looking at Fig. 6b it is quite obvious that this level of DM represents already a severe interference. The justification of using such a interference noise model lies in the fact that it can describe 0 5 10 15 20 25 0 0.2 0.4 0.6 0.8 1 0 5 10 15 20 25 0 0.2 0.4 0.6 0.8 1 PSfrag replacements path delay [m] path delay [m] |r(1) n (t)| |r(2) n (t)| (a) Clean signals 0 5 10 15 20 25 0 0.2 0.4 0.6 0.8 1 0 5 10 15 20 25 0 0.2 0.4 0.6 0.8 1 PSfrag replacements path delay [m] path delay [m] |r(1) n (t)| |r(2) n (t)| (b) Noisy signal with DM power of Ω1 = 1.1615 ∗10−8 Fig. 6. Signals exchanged between agent and Anchors 1 and 2 for an example agent position. The gray lines represent the estimated delays of the MPC. Fig. 6a shows the “clean” signal and Fig. 6b the noisy signal. 0 100 200 300 400 500 600 700 800 900 1000 5.5 6 6.5 7 7.5 8 8.5 initial fc,0 DM band at fc,0 mean fc,n examples of fc,n PSfrag replacements time index n f [GHz] Fig. 7. Mean carrier frequency for DM power Ω1 = 1.1615∗10−8. The black line denotes for the initial carrier frequency fc,1 and the blue one the mean of the cognitively adapted carrier frequency fc,n. The blue dashed lines show a few example realizations of cognitively adapted carrier frequencies along different trajectories and for different Monte Carlo runs. many kinds of measurement modeling mismatches, e.g. if the anisotropy of the antenna pattern for different angle of arrivals is not considered. Fig. 7 illustrates the mean values of the cognitively adapted carrier frequency along one of the trajectories at DM power 0 100 200 300 400 500 600 700 800 900 1000 0 0.5 1 1.5 2 2.5 3 3.5 4 x 10 −8 mean prob. MINT example runs prob. MINT mean cog. prob. MINT example runs cog. prob. MINT DM disturbance eplacements time index n Entropy Fig. 8. Mean entropy of probabilistic MINT and cognitive probabilistic MINT over time instances n for DM power Ω1 = 1.1615 ∗10−8. The red and black dashed lines show a few example entropy realizations along different trajectories and for different Monte Carlo runs. Ω1 = 1.1615 ∗10−8. The mean is computed using the 30 Monte Carlo simulations of the experiment. The black line denotes the initial carrier frequency fc,1 and the blue one the mean of the cognitively adapted carrier fc,n. The blue dashed lines show a few example realizations of cognitively adapted carrier frequencies along different trajectories and for different Monte Carlo runs. The figure shows quite clearly that the cognitive probabilistic MINT algorithm is avoiding (almost at all agent positions, where additional DM interference is present) carrier frequencies fc,n near to the carrier of DM. Fig. 8 shows the according mean entropy values of proba- bilistic MINT (red line with diamond markers) and cognitive probabilistic MINT (black line with triangle markers) over time instances n for DM power Ω1 = 1.1615 ∗10−8. The red and black dashed lines show a few example entropy realizations along different trajectories and for different Monte Carlo runs. Before the noise disturbance starts the entropy of the probabilistic MINT algorithm is almost the same as of the cognitive probabilistic MINT algorithm. In the moment the disturbance is introduced, the entropy of the posterior increases. The cognitive probabilistic MINT algorithm then starts to change its carrier frequency fc,n (as shown in Fig. 7) until the entropy is again reduced. This leads to an almost constant or even decreasing entropy even in the presence of a tremendous noise level (black line with triangle markers in Fig. 8). In contrast to that the probabilistic MINT algorithm without cognitive waveform adaptation starts to diverge after the disturbance is introduced and is not able to recover. This is indicated by the rapid increase of the entropy and stagnation at a large value shown in Fig. 8 by the red line with diamond markers. This result is confirmed by looking at the performance CDF of the agent position error shown in Fig. 9. This comparison between probabilistic MINT and cognitive probabilistic MINT illustrates the powerful property of the cognitive algorithm to separate relevant from irrelevant information using adaptation of the control parameter fc,n to avoid the noisy frequency band of the signal. The probabilistic MINT algorithm without wave- form adaptation tends to diverge under such harsh conditions as depicted by CDF drawn with solid lines. In contrast to this, 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0 0.2 0.4 0.6 0.8 1 prob. MINT: noise 1 cognitive prob. MINT: noise 1 prob. MINT: noise 2 cognitive prob. MINT: noise 2 prob. MINT: noise 3 cognitive prob. MINT: noise 3 PSfrag replacements P(p)[m] CDF Fig. 9. Performance CDF of the cognitive probabilistic MINT algorithm with introducing a disturbance at three different noise levels along a certain part of the trajectory. Noise 1 corresponds to DM with Ω1 = 1.1615 ∗10−9, Noise 2 with power Ω1 = 5.8076 ∗10−9 and with power Ω1 = 1.1615 ∗10−8 the cognitive MINT algorithm overcomes these impairments, leading again to a robust behavior as depicted by CDF drawn with dashed lines. REFERENCES [1] J. M. Fuster, Cortex and Mind - Unifying Cognition. Oxford University Press, 2003. [2] P. Meissner, “Multipath-Assisted Indoor Positioning,” Ph.D. dissertation, Graz University of Technology, 2014. [3] E. Leitinger, P. Meissner, C. Rudisser, G. Dumphart, and K. Witrisal, “Evaluation of Position-Related Information in Multipath Components for Indoor Positioning,” IEEE Journal on Selected Areas in Communi- cations, vol. 33, no. 11, pp. 2313–2328, Nov 2015. [4] E. Leitinger, “Cognitive Indoor Positioning and Tracking using Mul- tipath Channel Information,” Ph.D. dissertation, Graz University of Technology, 2016. [5] E. Leitinger, M. F., P. Meissner, K. Witrisal, and F. Hlawatsch, “Belief Propagation based Joint Probabilistic Data Association for Multipath- Assisted Indoor Navigation and Tracking,” in 2016 International Con- ference on Localization and GNSS (ICL-GNSS), June 2016. [6] S. Haykin, Y. Xue, and P. Setoodeh, “Cognitive Radar: Step Toward Bridging the Gap Between Neuroscience and Engineering,” Proceedings of the IEEE, vol. 100, no. 11, pp. 3102 –3130, nov. 2012. [7] S. Haykin, M. Fatemi, P. Setoodeh, and Y. Xue, “Cognitive Control,” Proceedings of the IEEE, vol. 100, no. 12, pp. 3156 –3169, dec. 2012. [8] M. Fatemi and S. Haykin, “Cognitive Control: Theory and Application,” Access, IEEE, vol. 2, pp. 698–710, 2014. [9] A. Amiri and S. Haykin, “Improved Sparse Coding Under the Influence of Perceptual Attention,” Neural Comput., vol. 26, no. 2, pp. 377–420, Feb. 2014. [10] S. Haykin and J. Fuster, “On Cognitive Dynamic Systems: Cognitive Neuroscience and Engineering Learning From Each Other,” Proceedings of the IEEE, vol. 102, no. 4, pp. 608–628, April 2014. [11] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1988. [12] P. Gregory, Bayesian Logical Data Analysis for the Physical Sciences. New York, NY, USA: Cambridge University Press, 2005. [13] D. S. Sivia and J. Skilling, Data analysis : a Bayesian tutorial, ser. Oxford science publications. Oxford, New York: Oxford University Press, 2006. [14] J. Borish, “Extension of the Image Model to arbitrary Polyhedra,” The Journal of the Acoustical Society of America, March 1984. [15] J. Kunisch and J. Pamp, “An Ultra-Wideband space-variant Multipath Indoor Radio Channel Model,” in Ultra Wideband Systems and Tech- nologies, 2003 IEEE Conference on, Nov 2003, pp. 290–294. [16] E. Leitinger, P. Meissner, M. Lafer, and K. Witrisal, “Simultaneous Localization and Mapping using Multipath Channel Information,” in 2015 IEEE International Conference on Communications Workshops (ICC), London, UK, June 2015, pp. 754–760. [17] E. Leitinger, F. Meyer, F. Tufvesson, and K. Witrisal, “Factor graph based simultaneous localization and mapping using multipath channel information,” in Proc. IEEE ICCW-17, Paris, France, May 2017, pp. 652–658. [18] E. Leitinger, F. Meyer, F. Hlawatsch, K. Witrisal, F. Tufvesson, and M. Z. Win, “A belief propagation algorithm for multipath-based SLAM,” IEEE Trans. Wireless Commun., vol. 18, no. 12, pp. 5613–5629, Dec. 2019. [19] E. Leitinger, S. Grebien, and K. Witrisal, “Multipath-based SLAM exploiting AoA and amplitude information,” in Proc. IEEE ICCW-19, Shanghai, China, May 2019, pp. 1–7. [20] D. Kershaw and R. Evans, “Optimal waveform selection for tracking systems,” Information Theory, IEEE Transactions on, vol. 40, no. 5, pp. 1536 –1550, sep 1994. [21] S. Haykin, A. Zia, Y. Xue, and I. Arasaratnam, “Control Theoretic Approach to Tracking Radar: First step towards cognition,” Digital Signal Processing, vol. 21, no. 5, pp. 576 – 585, 2011. [22] K. Bell, C. Baker, G. Smith, J. Johnson, and M. Rangaswamy, “Cognitive radar framework for target detection and tracking,” Selected Topics in Signal Processing, IEEE Journal of, vol. 9, no. 8, pp. 1427–1439, Dec 2015. [23] G. Hoffmann and C. Tomlin, “Mobile Sensor Network Control Using Mutual Information Methods and Particle Filters,” Automatic Control, IEEE Transactions on, vol. 55, no. 1, pp. 32–47, Jan 2010. [24] B. J. Julian, M. Angermann, M. Schwager, and D. Rus, “Distributed Robotic Sensor Networks: An Information-theoretic Approach,” Int. J. Rob. Res., vol. 31, no. 10, pp. 1134–1154, Sep. 2012. [Online]. Available: http://dx.doi.org/10.1177/0278364912452675 [25] K. Chaloner and I. Verdinelli, “Bayesian Experimental Design: A Review,” Statistical Science, vol. 10, no. 3, pp. pp. 273–304, 1995. [Online]. Available: http://www.jstor.org/stable/2246015 [26] T. M. Cover and J. A. Thomas, Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). Wiley- Interscience, 2006. [27] H. L. Van Trees, Detection, Estimation and Modulation, Part I. Wiley Press, 1968. [28] S. Kay, Fundamentals of Statistical Signal Processing: Estimation Theory. Prentice Hall Signal Processing Series, 1993. [29] P. Meissner, E. Leitinger, and K. Witrisal, “UWB for robust indoor tracking: Weighting of multipath components for efficient estimation,” Wireless Communications Letters, IEEE, vol. 3, no. 5, pp. 501–504, Oct 2014. [30] N. Michelusi, U. Mitra, A. Molisch, and M. Zorzi, “UWB Sparse/Diffuse Channels, Part I: Channel Models and Bayesian Estimators,” Signal Processing, IEEE Transactions on, vol. 60, no. 10, pp. 5307–5319, 2012. [31] A. Molisch, “Ultra-Wide-Band Propagation Channels,” Proceedings of the IEEE, vol. 97, no. 2, pp. 353–371, Feb. 2009. [32] S. Grebien, E. Leitinger, K. Witrisal, and B. H. Fleury, “Super-resolution channel estimation including the dense multipath component — A sparse variational Bayesian approach,” 2021, in preperation. [33] K. Witrisal, P. Meissner, E. Leitinger, Y. Shen, C. Gustafson, F. Tufves- son, K. Haneda, D. Dardari, A. F. Molisch, A. Conti, and M. Z. Win, “High-Accuracy Localization for Assisted Living: 5G systems will turn multipath channels from foe to friend,” IEEE Signal Processing Magazine, vol. 33, no. 2, pp. 59–70, March 2016. [34] Y. Shen and M. Win, “Fundamental Limits of Wideband Localization; Part I: A General Framework,” Information Theory, IEEE Transactions on, vol. 56, no. 10, pp. 4956–4980, Oct. 2010. [35] Y. Shen, H. Wymeersch, and M. Win, “Fundamental Limits of Wideband Localization; Part II: Cooperative Networks,” Information Theory, IEEE Transactions on, vol. 56, no. 10, pp. 4981 –5000, Oct. 2010. [36] F. Meyer, T. Kropfreiter, J. L. Williams, R. Lau, F. Hlawatsch, P. Braca, and M. Z. Win, “Message passing algorithms for scalable multitarget tracking,” Proc. IEEE, vol. 106, no. 2, pp. 221–259, Feb. 2018. [37] F. Meyer, P. Braca, P. Willett, and F. Hlawatsch, “Scalable Multitarget Tracking using Multiple Sensors: A belief propagation approach,” in Information Fusion (Fusion), 2015 18th International Conference on, July 2015, pp. 1778–1785. [38] Y. Bar-Shalom and X.-R. Li, Multitarget-Multisensor Tracking : Prin- ciples and Techniques. Storrs, CT: Yaakov Bar-Shalom, 1995. [39] J. Vermaak, S. J. Godsill, and P. Perez, “Monte Carlo filtering for multi target tracking and data association,” vol. 41, no. 1, pp. 309–332, Jan. 2005. [40] S. Kay, Fundamentals of Statistical Signal Processing: Detection The- ory. Prentice Hall Signal Processing Series, 1998. [41] H. Wymeersch, “The Impact of Cooperative Localization on Achieving higher-level Goals,” in Communications Workshops (ICC), 2013 IEEE International Conference on, June 2013, pp. 1–5. [42] F. Meyer, H. Wymeersch, M. Frohle, and F. Hlawatsch, “Distributed Estimation With Information-Seeking Control in Agent Networks,” Selected Areas in Communications, IEEE Journal on, vol. 33, no. 11, pp. 2439–2456, Nov 2015. [43] B. Grocholsky and B. Grocholsky, “Information-Theoretic Control of Multiple Sensor Platforms,” Ph.D. dissertation, Department of Aerospace, Mechatronic and Mechanical Engineering, 2002. [44] C. Shannon, “A Mathematical Theory of Communication,” Bell System Technical Journal, The, vol. 27, no. 4, pp. 623–656, Oct 1948. [45] S. Haykin, Cognitive Dynamic Systems: Perception-action Cycle, Radar and Radio. New York, NY, USA: Cambridge University Press, 2012. [46] R. Bellman, Dynamic Programming. Princeton, NJ, USA: Princeton University Press, 1957. [47] D. P. Bertsekas, Dynamic Programming and Optimal Control. Athena Scientific, 2000. [48] R. S. Sutton and A. G. Barto, Introduction to Reinforcement Learning, 1st ed. Cambridge, MA, USA: MIT Press, 1998. [49] A. Lazaric, M. Restelli, and A. Bonarini, “Reinforcement learning in continuous action spaces through sequential monte carlo methods,” in Advances in Neural Information Processing Systems, 2007. [50] Z. Sahinoglu, S. Gezici, and I. Guvenc, Ultra-wideband Positioning Systems – Theoretical Limits, Ranging Algorithms and Protocols. Cam- bridge University Press, 2008. [51] P. Meissner, E. Leitinger, M. Lafer, and K. Witrisal, “MeasureMINT UWB database,” www.spsc.tugraz.at/tools/UWBmeasurements, 2013, Publicly available database of UWB indoor channel measurements. [Online]. Available: www.spsc.tugraz.at/tools/UWBmeasurements [52] J. Karedal, S. Wyne, P. Almers, F. Tufvesson, and A. Molisch, “A Measurement-Based Statistical Model for Industrial Ultra-Wideband Channels,” Wireless Communications, IEEE Transactions on, vol. 6, no. 8, pp. 3028–3037, Aug. 2007.