Consistent Map-based 3D Localization on Mobile Devices Ryan C. DuToit, Joel A. Hesch, Esha D. Nerurkar, and Stergios I. Roumeliotis Abstract —The objective of this paper is to provide consistent , real-time 3D localization capabilities to mobile devices navigating within previously mapped areas. To this end, we introduce the Cholesky-Schmidt-Kalman filter (C-SKF), which explicitly considers the uncertainty of the prior map, by employing the sparse Cholesky factor of the map’s Hessian, instead of its dense covariance–as is the case for the Schmidt-Kalman filter (SKF). By doing so, the C-SKF has memory requirements typically linear in the size of the map, as opposed to quadratic for storing the map’s covariance. Moreover, and in order to bound the processing needs of the C-SKF (between linear and quadratic in the size of the map), we introduce a relaxation of the C-SKF algorithm, the sC- SKF, which operates on the Cholesky factors of independent sub- maps resulting from dividing the trajectory and observations used for constructing the map into overlapping segments. Lastly, we assess the processing and memory requirements of the proposed C-SKF and sC-SKF algorithms, and compare their positioning accuracy against other approximate map-based localization ap- proaches that employ measurement-noise-covariance inflation to compensate for the map’s uncertainty. I. I NTRODUCTION In many applications (e.g., surveillance, manufacturing, virtual and augmented reality), robots or people need to accurately localize within a frequently-visited indoor space. In such cases, the accuracy and efficiency of localization can be significantly improved by using a map of the area of operation. In the context of 3D visual-inertial localization, maps computed beforehand 1 have been employed by local- ization algorithms, 2 such as the multi-state constraint Kalman filter (MSCKF) in [15] and [18], and parallel tracking and mapping (PTAM) in [16] and [25], to improve positioning accuracy based on visual observations of mapped features. These methods achieve real-time performance but are not consistent (in this work, we refer to a consistent estimator as having zero-mean error with covariance equal to or larger than the estimation error’s true covariance). Specifically, [16] and [25] are inconsistent not only because the assumption of a perfect map but also due to the approximations invoked by the optimization algorithm used for localization; thus they cannot provide a reliable measure of their positioning uncertainty. On the other hand, [18], which also assumes that the map is perfect and [15], which ignores the correlations between the estimated state and the map, inflate the camera measurement’s noise covariance so as to reduce the effect of inconsistency: 1 Besides batch least squares (BLS), pose-graphs [1], and PTAM [12] have also been used for reducing the processing cost of map building. Since such approximations yield inconsistent maps, we do not consider them further in the context of this work. 2 Since we are interested in continuous localization, we do not consider vision-only methods that provide pose estimates only intermittently (e.g., [3]). overly confident and often unreliable estimates. Nevertheless, inflating the measurement noise does not alleviate the incon- sistency that arises from ignoring cross-correlations between the estimated state and the map. An alternative, approximate method, which explicitly ac- counts for the map’s uncertainty and its correlations with the estimated state, is the Schmidt-Kalman filter (SKF) [22, 23]. The SKF has processing requirements linear in the map’s size, as it only needs to update the device’s state, covariance, and cross-correlation with the map. Although the map’s state and covariance need not be updated, storing its covariance has cost quadratic in the number of features. This has been the main drawback of the SKF, as well as of its variants applied to simultaneous localization and mapping (e.g., [6, 11]), which has restricted its use to small-size areas. To overcome this limitation, in this work, we introduce the Cholesky (C)-SKF which has, typically, linear in the map’s size memory requirements, while providing the same consistency properties as the SKF. The key insight behind our approach is that most current methods employed for constructing large-scale maps, such as batch least squares (BLS), compute and use the Cholesky factor of the problem’s Hessian, which is sparse [typically linear number of non-zero elements (nnz) in the size of the map 3 ] instead of the dense covariance matrix [24]. Additionally, and in order to reduce the processing requirements of the C-SKF – between linear and quadratic in the map’s size – we introduce a consistent relaxation of the C-SKF, the sub-map (s)C-SKF, which trades localization accuracy for processing speed by operating on the Cholesky factors of the partitioned Hessians resulting from dividing the original map into independent sub-maps. Note that the sub-maps used throughout this work are generated from the method of [8], however, other methods that produce sub- maps (i.e., [5]) could be employed as well. This approximation allows mapping larger areas and/or operating on resource- constrained mobile devices, such as cell phones and tablets. In summary, the main contributions of this paper are: • We introduce the Cholesky-Schmidt-Kalman filter (C- SKF), which employs the Hessian’s Cholesky factor to compactly represent the map’s uncertainty, and efficiently compute consistent map-based updates. • We introduce the sub-map (s)C-SKF, a relaxation of the C-SKF, which employs multiple, independent sub-maps 3 In extreme cases, such as when the map is constructed using images taken while hovering over the same scene, the memory requirements may become quadratic. In such cases, alternative approaches, such as PTAM, should be employed instead for localization. arXiv:1604.08087v1 [cs.RO] 27 Apr 2016 of the area of interest to support real-time, consistent map-based localization on mobile devices. • We validate the accuracy and consistency of the C-SKF and sC-SKF using visual and inertial measurements from mobile devices against VICON ground truth. In what follows, we provide an overview of our map-based localization system (Sect. II) and then present the system state and measurement models (Sect. III). In Sect. IV, we describe the limitations of the SKF when applied to map-based localization, and then introduce the C-SKF and the sC-SKF. Our method for generating reliable 2D-3D correspondences is presented in Sect. V. Lastly, we experimentally validate the proposed algorithms in Sect. VI and provide concluding remarks with a discussion on future work in Sect. VII. II. M AP - BASED L OCALIZATION A LGORITHM O VERVIEW Our objective is to design a consistent estimator for com- puting, in real-time, the 3D position and attitude (pose) of a mobile device using inertial and visual measurements, as well as a prior map of the area of operation. To do so, we require the following information from the (offline) mapping process: • The Cholesky factor of the (sub)map’s Hessian. 4 • The 3D position estimates of the mapped features along with their descriptors (e.g., FREAKs [2] or ORBs [21]) and the vectors of their corresponding images as indexed by a vocabulary tree (VT) [20]. The former is necessary for representing the map’s uncertainty, while the latter is used for recognizing mapped features and using their position estimates for updating the device’s pose. Given this information, in Fig. 1, we provide an overview of the (online) image processing and estimation components of our (s)C-SKF map-based localization algorithms. In par- ticular, at the core of our estimator is the MSCKF [7, 18] which processes inertial measurements for propagating the device’s state and covariance estimates. Intermittently, and when feature tracks (e.g., Harris corners [9] corners tracked by KLT [14]) become available, the proposed (s)C-SKF adap- tation of the MSCKF consistently updates only the device’s state-covariance and correlations with the map, but not the map itself (Sect. IV-C). Similarly, every time the 2D-to-3D feature- matching pipeline (Sect. V) finds correspondences between the, e.g., FREAK features extracted in the current image and those found in the map, the (s)C-SKF uses this information to, again, update all estimated quantities except the map and its covariance (Sect. IV-D – IV-F). 4 In this work, we employ the cooperative mapping (CM) algorithm of [8] since, in addition to computing the Cholesky factor of the map’s Hessian, it also provides a convenient mechanism for handling sub-maps and computing their corresponding Cholesky factors needed for the sC-SKF. Image A. Harris +KL T Local feature measurements IMU C. Propagation D. Local feature update available? Map based update Map estimate and Cholesky factor B. 2D - 3D Correspondence pipeline N Y ˆ x - R P - RR Γ Γ Γ - , , ˆ x + R P + RR Γ Γ Γ + , , ˆ x + + R P + + RR Γ Γ Γ + + , , ˆ x M ˆ x - R P - RR Γ Γ Γ - , , ˆ x + R P + RR Γ Γ Γ + , , E. ˆ x + + R P + + RR Γ Γ Γ + + , , Fig. 1. Cholesky-Schmidt-Kalman filter (sub)map-based localization algo- rithm overview. III. S YSTEM S TATE AND M EASUREMENT M ODELS A. Device State At time-step k , the estimated state is 5 : x k = [ x T E x T C k − 1 . . . x T C k − N x T τ ] T (1) where x E is the evolving state of the device: x E = [ I k q T G G p T I k b T g k G v T I k b T a k G p T I k ] T (2) I k q G is the quaternion representation of the global, { G } , frame’s orientation in the IMU’s current frame, { I k } , b a k and b g k are the accelerometer and gyroscope biases, respectively, and G v I k and G p I k are the velocity and position of { I k } in { G } . In (1), x C i = [ I i q T G G p T I i t s i ] T , i = k − N , . . . , k − 1 corre- sponds to previous IMU poses. Following [18], we maintain a sliding window of N such poses so as to process measurements to non-mapped (or local) features without incorporating them into the state vector. Finally, our problem formulation requires estimating the 4 degree of freedom (d.o.f) transformation between the device’s global frame, { G } , and one or more map’s frames of reference, { M i } : x τ = [ x T τ 1 x T τ 2 . . . x T τ L ] T , x τ i = [ M i φ G G p T M i ] T (3) where G p M i is the position of { M i } in { G } , and M i φ G is the rotation about gravity between the two frames. Note that since the roll and pitch angles are observable for any vision-aided inertial navigation system (VINS) [10], we can choose the z - axis to align with gravity in both the global and map coordinate frames. 5 To simplify the subsequent derivations, we assume the camera and IMU are time synchronized and co-located. In practice, we estimate their extrinsic calibration parameters and time offset following the approaches of [17] and [7], respectively. B. IMU measurement model Following [18], we propagate the state estimate of the device [see (1)] by integrating the IMU’s rotational velocity and linear acceleration measurements, u k , x k + 1 = g ( x k , u k ) + w k (4) where g is a nonlinear function corresponding to the IMU measurement model and w k is zero-mean, Gaussian noise of known covariance [4]. C. Local-feature measurement model As the device traverses its environment, it observes and tracks (via KLT [14]) point features (Harris corners [9]) that have not been mapped. These local features are used as in [18] to provide measurement constraints between the N + 1 IMU- camera poses maintained in the state vector. The non-linear and linearized local-feature measurement models are z = h ( x , p f ) + n , r = H R ̃ x R + H f ̃ p f + n (5) where z is the measurement, h is the perspective-projection camera measurement model, r is the linearized measurement residual, x R ( ̃ x R ) is the device state (error-state), p f ( ̃ p f ) is the local feature state (error-state), H R and H f are the measure- ment Jacobians corresponding to the device and feature states, respectively, and n is zero-mean, Gaussian noise with known covariance, R . As in [18], we marginalize the local feature by projecting r on the left null space of H f , U : r o = H o R ̃ x R + n o (6) where: r o = U T r , H o R = U T H R , n o = U T n The new measurement residual, r o , and Jacobian, H o R , are then (Sect. IV-C2) used for updating the device state estimate. D. Mapped-feature measurement model When a previously-mapped point-feature, f , (expressed with respect to the IMU-camera frame, { I M i λ } , that first observed it during mapping) is detected by the device, we can form the following geometric relationship (see Fig. 2): I k p f = I k G C ( G p M i − G p I k + G M i C [ M i p I Mi λ + M i I Mi λ C I Mi λ p f ]) (7) where all rotation matrices, C , are parameterized by their cor- responding 3 d.o.f quaternions, except G M i C , which corresponds to a rotation about gravity by an angle M i φ G . Applying the camera perspective-projection transformation, π π π , leads to the following measurement model: z map = π π π ( I k p f ) + n = h map ( x k , I Mi λ p f ) + n (8) where n is zero-mean, Gaussian noise with covariance R . 𝑓𝑓 𝑮𝑮 𝒑𝒑 𝑴𝑴 𝑖𝑖 , 𝑴𝑴 𝑖𝑖 𝜙𝜙 𝑮𝑮 𝑮𝑮 𝒑𝒑 𝑰𝑰 𝑘𝑘 , 𝑰𝑰 𝑘𝑘 𝒒𝒒 𝑮𝑮 𝑴𝑴 𝑖𝑖 𝒑𝒑 𝑰𝑰 𝜆𝜆 𝑴𝑴𝑖𝑖 , 𝑰𝑰 𝜆𝜆 𝑴𝑴𝑖𝑖 𝒒𝒒 𝑴𝑴 𝑖𝑖 { 𝑰𝑰 𝜆𝜆 𝑴𝑴 𝑖𝑖 } 𝑰𝑰 𝜆𝜆 𝑴𝑴𝑖𝑖 𝒑𝒑 𝑓𝑓 𝑰𝑰 𝑘𝑘 𝒑𝒑 𝑓𝑓 { 𝑰𝑰 𝑘𝑘 } { 𝑴𝑴 𝑖𝑖 } { 𝑮𝑮} Fig. 2. The geometric relationship of a mapped feature observation. The dotted line is a bearing measurement, solid lines correspond to quantities estimated online. Dashed lines correspond to variables determined during mapping offline. To simplify the notation, from here on, we omit time indices and denote the state x k in (1) by x R , while we use x M to represent the vector comprising of all mapped features and IMU-camera poses. Linearizing (8) yields: r = H R ̃ x R + H M ̃ x M + n (9) where H R and H M are the device and map Jacobians, respec- tively. Note that both H R and H M are sparse, as the measure- ment equation only involves the position and quaternion of the current and mapped IMU-camera pairs, the 4 d.o.f map- to-global transformation, and the position of the feature. With our system state and measurement models defined, in what follows, we present how measurements (4), (6), and (8) are processed in a consistent manner. IV. A LGORITHM D ESCRIPTION In what follows, we first review the SKF, and then introduce the C-SKF and sC-SKF. To simplify notation, we denote up- dated values with + and propagated values with - , rather than using time subscripts. [i.e., P k + 1 | k + 1 = f ( P k + 1 | k ) is expressed as P + = f ( P ) , and P k + 1 | k = g ( P k | k ) is P - = g ( P ) ] A. Background: Schmidt-Kalman filter Consider the current propagated system covariance, P , and Jacobian, H , to be: P = [ P RR P RM P MR P MM ] H = [ H R H M ] (10) The state and covariance update equations for the extended Kalman filter (EKF) are: ˆ x + = ˆ x + Kr (11) P + = ( I − KH ) P ( I − H T K T ) + KRK T (12) where r = z − h ( ˆ x ) S = HPH T + R K = PH T S − 1 = [ P RR H T R + P RM H T M P MR H T R + P MM H T M ] S − 1 = [ ̄ K R ̄ K M ] S − 1 = ̄ KS − 1 (13) As shown in [23], the SKF updates only part of state by zeroing the Kalman gain associated with the state elements whose estimates are to remain the same (in our case the map, x M ): K SKF = [ ( ̄ K R S − 1 ) T 0 ] T (14) Substituting (14) in (11) yields the following state update: ˆ x + R = ˆ x R + ̄ K R S − 1 r ˆ x + M = ˆ x M (15) Additionally, by employing (14) in (12), we arrive at the SKF’s covariance update: P + SKF = P − [ ̄ K R S − 1 ̄ K T R ̄ K R S − 1 ̄ K T M ̄ K M S − 1 ̄ K T R 0 ] (16) Note that if we express the updated covariance of the EKF as a function of the SKF updated covariance, it is straightforward to show: P + EKF = P + SKF + [ 0 ̄ K M ] S − 1 [ 0 ̄ K T M ] (17) ⇒ P + EKF  P + SKF (18) Since S − 1 is positive definite, and hence [ 0 ̄ K M ] S − 1 [ 0 ̄ K T M ] must be positive semi-definite. Thus, we have shown that the SKF is consistent as it does not underestimate the system covariance. Furthermore, as evident from (16), the cost of an SKF covariance update is linear in the size of the map, as only the device’s covariance and cross-correlation terms need to be updated, while H in (10) is sparse. On the other hand the SKF requires storing the covariance of the map, P MM , which is dense. To better appreciate the challenge this poses on mobile devices, we note that the covariance of a map generated from 1.5 min of visual and inertial data requires 1.2 GB of storage space. Instead, the sparse Cholesky factor of the corresponding Hessian matrix requires only 76 MB. This motivates us to introduce the Cholesky-SKF (C-SKF) in the next sections. B. C-SKF device-map initialization In what follows, we start by describing the process for esti- mating the 4 d.o.f transformation between the map’s frame and the device’s global reference frame, as well as its covariance and correlations with all estimated quantities. Specifically, consider the first time the mobile device ob- serves two or more previously-mapped features. An initial estimate, ˆ x τ , for the uniform 4 d.o.f transformation is obtained by employing the 2+1 pt RANSAC [13]. Furthermore, we partition the current state as x = [ x T R ′ x T τ x T M ] T (19) where x R ′ comprises of the remaining elements of the device’s state vector [see (1)], and the corresponding covariance as P =   P R ′ R ′ 0 0 0 P ττ 0 0 0 ( GG T ) − 1   , P ττ = lim μ → ∞ ( μ I 4 ) (20) where P ττ is the covariance of the unknown 4 d.o.f transforma- tion, GG T is the Hessian of the map, and G is its Cholesky factor. After linearizing the measurement model in (8) and denoting the corresponding Jacobian as H = [ H R ′ H τ H M ] (21) we employ (13)-(15) to update the estimates of x R ′ and x τ : ˆ x + R ′ = ˆ x R ′ + P R ′ R ′ H T R ′ S − 1 r (22) ˆ x + τ = ˆ x τ + ( H T τ A − 1 H τ ) − 1 H T τ A − 1 r (23) ˆ x + M = ˆ x M (24) Additionally, employing (16), it can be shown that the updated SKF covariance is P =    P + R ′ R ′ P + R ′ τ P + R ′ M P + T R ′ τ P + ττ P + τ M P + T R ′ M P + T τ M ( GG T ) − 1    = [ P + RR P + RM P + MR ( GG T ) − 1 ] (25) where P + R ′ R ′ = P R ′ R ′ − P R ′ R ′ H T R ′ S − 1 H R ′ P R ′ R ′ P + R ′ τ = − P R ′ R ′ H T R ′ A − 1 H τ ( H T τ A − 1 H τ ) − 1 P + ττ = ( H T τ AH τ ) − 1 P + RM = [ P + R ′ M P + τ M ] = [ P R ′ R ′ − P R ′ R ′ H T R ′ S − 1 J ( H T τ A − 1 H τ ) − 1 H T τ A − 1 J ] G − 1 = Γ Γ Γ G − 1 (26) and S − 1 = A − 1 − A − 1 H τ ( H T τ A − 1 H τ ) H T τ A − 1 A = H R ′ P R ′ R ′ H R ′ + JJ T + R Finally, J is defined as: GJ T = H T M (27) A key element of our approach is that, from this point on, instead of updating the cross-correlation term, P RM , we will represent it in a factorized form [see (26)] and apply updates on its factor, Γ Γ Γ , as is shown in Sect. IV-C and Sect. IV-D. Following this convention, we have: P = [ P RR Γ Γ Γ G − 1 ( Γ Γ Γ G − 1 ) T ( GG T ) − 1 ] (28) Note also that we do not need to compute the inverse of the Cholesky factor, G . Instead, all update equations involving G will be of the same form as (27), where a back-solve involving the sparse G is required for efficiently computing J T . C. C-SKF propagation and local-feature update 1) IMU-based Propagation: By employing the IMU mea- surement model in (4), the device’s state is propagated while, as is the case for the SKF, the map’s state remains the same: ˆ x - R = f ( ˆ x R , u ) , ˆ x - M = x M (29) On the other hand, and in order to propagate the covariance of the C-SKF [see (28)], we employ the EKF covariance propagation equation: P - = Φ Φ Φ C P Φ Φ Φ T C + Q C = [ Φ Φ Φ P RR Φ Φ Φ T + Q Φ Φ ΦΓ Γ Γ G − 1 ( Φ Φ ΦΓ Γ Γ G − 1 ) T ( GG T ) − 1 ] (30) with Φ Φ Φ C = [ Φ Φ Φ 0 0 I ] , Q C = [ Q 0 0 0 ] (31) where Φ Φ Φ and Q are the IMU Jacobian of the state and corresponding noise covariance, respectively. As evident from (30), the device’s propagated covariance is computed at low cost, while the cross-correlation terms only require modifying the cross-correlation factor, Γ Γ Γ , (i.e., Γ Γ Γ - = Φ Φ ΦΓ Γ Γ ) at cost linear in the size of the map. 2) C-SKF Local-Feature Measurement Update: When a local-feature-track measurement becomes available [see (6)], we arrive at the following Jacobian: H = [ H o R 0 ] (32) As in (14), we zero out the Kalman gain associated with the mapped states, leading to an update of the device state, while the map remains the same: ˆ x + R = ˆ x R + P RR H oT R S − 1 r , ˆ x + M = ˆ x M (33) For the covariance update, we first denote ̄ K = PH : [ ̄ K R ̄ K M ] = [ P RR Γ Γ Γ G − 1 ( Γ Γ Γ G − 1 ) T ( GG T ) − 1 ] [ H oT R 0 ] = [ P RR H oT R G − T Γ Γ Γ T H oT R ] (34) Next, we employ (34) and (28) in (16) without evaluating ̄ K M , yielding: P + RR = P RR − P RR H oT R S − 1 H o R P RR , P + MM = P MM (35) and P + RM = Γ Γ Γ G − 1 − ̄ K R S − 1 ̄ K T M = ( I − P RR H oT R S − 1 H o R ) Γ Γ Γ G − 1 = Γ Γ Γ + G − 1 (36) As evident from (33)–(36), the local-feature update has complexity linear in the size of the map, as we only need to apply a standard EKF update on the device’s covariance, and update the cross-correlation factor, Γ Γ Γ . D. C-SKF map-based updates When the device observes previously mapped features (Sect. V), we employ the methodology of Sect. IV-A [(13)– (16)] using the measurement model of (8)–(9), and operate on the system covariance with factorized cross-correlation, as defined in (28). Specifically, we denote ̄ K = PH T as: [ ̄ K R ̄ K M ] = [ P RR H T R + Γ Γ Γ G − 1 H T M G − T Γ Γ Γ T H T R + ( GG T ) − 1 H T M ] (37) Note that we do not explicitly compute ̄ K M , instead we first compute J as: GJ T = H T M (38) Because G is triangular, we can compute J with a back-solve operation. Substituting J into (37) yields: [ ̄ K R ̄ K M ] = [ P RR H T R + Γ Γ Γ J T G − T Γ Γ Γ T H T R + G − T J T ] (39) Next, we compute the residual covariance: S = H R P RR H T R + H R Γ Γ Γ J T + J Γ Γ Γ T H T R + JJ T + R (40) and state update [see (15)]: ˆ x + R = ˆ x R + ̄ K R S − 1 r , ˆ x + M = ˆ x M (41) Finally, we update the covariance using (16) with (39). Avoid- ing the calculation of ̄ K M , and following the same process in (36), we produce a factorized form of the updated cross- correlation: P + RR = P RR − ̄ K R S − 1 ̄ K T R P + RM = Γ Γ Γ G − 1 − ̄ K R S − 1 ( H R Γ Γ Γ G − 1 + JG − 1 ) = [ Γ Γ Γ − ̄ K R S − 1 ( H R Γ Γ Γ + J )] G − 1 = Γ Γ Γ + G − 1 (42) At this point, we should note that unlike propagation and local-feature updates, the processing requirements of mapped- feature updates are not strictly linear in the size of the map. The main bottleneck is (38), as computing J has complexity between linear and quadratic in the size of the map (depending on the structure of G ). As shown in Sect. VI, as the map grows, the time to compute J becomes unacceptable for real-time operation on a mobile device. This limitation motivates us to introduce the sub-map relaxation of Sect. IV-E and Sect. IV-F. That is, we decrease the size of G in (38) by partitioning the map into independent sub-maps, while retaining consistency. E. Sub-map relaxation Before discussing the sC-SK, we first describe the sub- mapping relaxation process. As shown in Fig. 3 and described below, we divide the trajectory and associated features into two separate sets: 6 First, the IMU-camera poses ξ ξ ξ i , are divided into two sets ([ ξ ξ ξ 1 to ξ ξ ξ N ] and [ ξ ξ ξ N + 1 to ξ ξ ξ M ]). Our current implementation 6 To simplify the explanation, without loss of generality, we describe the case for two sub-maps = Fig. 3. A partitioning of a single map into two sub-maps, with a geometric constraint, k ( x a , x τ , f j ) = 0 , imposed to common features. evenly distributes these sets in time. As part of our future work, we seek to find an optimal partitioning into sub-maps. With this division defined, we partition features into two sets: (i) F α : those observed by IMU-camera poses ξ ξ ξ 1 to ξ ξ ξ N , and (ii) F β : those observed by IMU-camera poses ξ ξ ξ N + 1 to ξ ξ ξ M . Features in F α ∩ F β are common features appearing in both sub-maps. The cooperative mapping (CM) algorithm [8] “du- plicates” these features such that the two sub-map’s individual cost functions are independent, but also introduces a non-linear constraint between these common features k ( x a , x τ , f j ) = 0 , f j ∈ F α ∩ F β (43) where x a is the state of all IMU-camera poses and x τ is the 4 d.o.f transformation between the two sub-maps. For each common feature, this constraint enforces its corresponding “duplicate” in F α or F β to have the same physical position. Finally, all camera and IMU measurements, z i , j and u i , i + 1 , are assigned to their corresponding sub-map (with the execption of u N , N + 1 , which is discarded). With such a partitioning, by ignoring the common-feature constraints, we can form two independent cost functions corresponding to each sub-map, C 1 and C 2 : C 1 = N ∑ i = 1 f j ∈ F α || z i , j − h ( ξ ξ ξ i , f j ) || R + N − 1 ∑ i = 1 || ξ ξ ξ i + 1 − g ( ξ ξ ξ i , u i , i + 1 ) || Q C 2 = M ∑ i = N + 1 f j ∈ F β || z i , j − h ( ξ ξ ξ i , f j ) || R + M − 1 ∑ i = N + 1 || ξ ξ ξ i + 1 − g ( ξ ξ ξ i , u i , i + 1 ) || Q (44) where g and h are defined in (5) and (4), respectively. Summing these two cost functions and imposing the common feature constraints yields: C = C 1 + C 2 (45) s. t. k ( x a , x τ , f j ) = 0 , f j ∈ F α ∩ F β We minimize (45) by employing the CM algorithm of [8]. F. Cholesky-Kalman-Schmidt with sub-maps (sC-SKF) In our problem, we take advantage of the high-accuracy estimate computed from the solution of (45), but relax the information attributed to each sub-map by storing the Cholesky factors, G i , resulting from each corresponding cost function in (44) linearized at the CM solution of (45). Such a relaxation causes the sub-maps to become independent, but maintains consistency. Theorem 1: The covariance of each sub-map when ignoring the common-feature constraints is larger or equal, in the positive semi-definite sense, to that computed from the CM. Proof: See Appendix A for proof of Theorem 1. To process mapped measurements with sub-maps, we rep- resent the system covariance as: P =   P RR Γ Γ Γ 1 G − 1 1 Γ Γ Γ 2 G − 1 2 ( Γ Γ Γ 1 G − 1 1 ) T ( G 1 G T 1 ) − 1 0 ( Γ Γ Γ 2 G − 1 2 ) T 0 ( G 2 G T 2 ) − 1   (46) where Γ Γ Γ i and G i is the cross-correlation factor from (28) and the Cholesky factor of the i ’th sub-map, respectively. When a mapped feature in the first sub-map is observed (without loss of generality), the measurement Jacobian is: H = [ H R H 1 0 ] (47) where H 1 is the measurement Jacobian corresponding to the states in the first sub-map. We compute J 1 and denote ̄ K in similar fashion as in (38) and (39), respectively: G 1 J T 1 = H T 1 ,   ̄ K R ̄ K 1 ̄ K 2   =   P RR H T R + Γ Γ Γ 1 J T 1 G − T 1 ( Γ Γ Γ T 1 H T R + J T 1 ) G − T 2 Γ Γ Γ T 2 H T R   (48) The residual covariance and state update are computed as: S = H R P RR H T R + H R Γ Γ Γ 1 J T 1 + J 1 Γ Γ Γ T 1 H T R + J 1 J T 1 + R ˆ x + R = ˆ x R + ̄ K R S − 1 r , ˆ x + 1 = ˆ x 1 , ˆ x + 2 = ˆ x 2 (49) Finally, we update the device covariance and cross-correlation using the same factorization as the single-map case [see (42)]: P + RR = P RR − ̄ K R S − 1 ̄ K T R , P + 11 = P 11 , P + 22 = P 22 P + R 1 = Γ Γ Γ 1 G − 1 1 − ̄ K R S − 1 ̄ K T 1 = [ Γ Γ Γ 1 − ̄ K R S − 1 ( H R Γ Γ Γ 1 + J 1 )] G − 1 1 = Γ Γ Γ + 1 G − 1 1 (50) P + R 2 = Γ Γ Γ 2 G − 1 2 − ̄ K R S − 1 ̄ K T 2 = [ I − ̄ K R S − 1 H R ] Γ Γ Γ 2 G − 1 2 = Γ Γ Γ + 2 G − 1 2 (51) By employing the sub-map relaxation, we have significantly lowered the computational requirements of the C-SKF. Note again that the bottleneck of our system is the computation of J 1 [see (48)]. With the sub-map relaxation, however, we can adjust the sC-SKF to given hardware constraints by adjusting the number of sub-maps employed. Increasing the amount of sub-maps decreases the size of the system when solving for J 1 , dramatically reducing computation time. It should be noted, however, that increasing the number of sub-maps decreases the information associated with the map, which typically leads to slightly less accurate device pose estimates (Sect. VI). V. 2D-3D F EATURE C ORRESPONDENCE P IPELINE Before employing map-based updates, we identify corre- spondences between 2D feature measurements (FREAK fea- tures [2]) in the current image and 3D features which have been previously mapped (Fig. 1 Box B). Our pipeline takes a dual layer approach: the first considers the case when the estimator has no prior for the 4 d.o.f device-to-map transformation, while the second takes advantage of such a prior to improve efficiency. A. Pose-less correspondence generation 1) VT query: We query the saved VT with the current image, which returns up to five mapped images of similar appearance. 2) Feature matching: We apply binary descriptor matching between features in the query and returned images. 3) Outlier rejection: We apply 3+1 pt RANSAC [19] on each image returned by the VT. If less than 7 correspondences remain, we classify that VT return as an outlier. Then, the 2+1 pt RANSAC [13] is applied over the remaining set of inlier correspondences, If less than 13 correspondences remain, we classify all feature measurements as outliers. B. Pose-assisted correspondence generation During nominal operation, the estimator will have a reliable estimate for the device-map transformation. We leverage this estimate to re-project a subset of mapped features into the current image bypassing the VT (the main bottleneck in the pose-less pipeline). 1) Pose-based matching: We find mapped images whose camera poses are nearby the current camera pose. A number of heuristics could be used to define nearby mapped images; in our case we require images to be within 3 m of the current position, and have an optical axis within 45 deg of the current optical axis. Images satisfying these criteria are the initial set of mapped image matches. 2) Co-visibility: We add mapped images which view at least 50 common features with any of the pose-based image matches to the set of mapped image matches. This step allows the pipeline to include images that are far from the current camera pose, but still view the same scene. 3) Feature re-projection and matching: The union of all features in the set of matched mapped images (through pose-based matching and the co-visibility) are reprojected onto the current image. Re-projected features are matched with current features by binary descriptor matching; projected features considered for matching are limited to a small radius (30 pixels) around the current feature. 4) Outlier rejection We apply 2+1 pt RANSAC on the set of matched 2D-3D correspondences to reject outliers. If less than 13 correspondences remain, we classify all feature measurements as outliers. C. Additional outlier rejection While RANSAC-based approaches generally remove the majority of any 2D-3D outliers, we further improve our system’s reliability by applying a Mahalanobis distance test on a per-feature basis. VI. E XPERIMENT R ESULTS All experimental results are obtained on a Google Project Tango tablet, which is equipped with a fisheye, global-shutter, grayscale camera, a MEMS quality IMU, a quad-core, 2.3 GHz ARM Cortex-A15 CPU, and 4 GB of RAM. A. Data collection and ground truth To generate experimental results we collected three datasets: the first (DS1) (2,069 images, approximately 60m long) is the dataset which the tested estimators run, the second (DS2) (2,627 images, approximately 75m long) is used to generate the map of the area of interest. All maps and sub-maps are generated using the CM method of [8]. In addition to each map’s estimate and Cholesky factor, we save FREAK binary descriptors [2] and a VT [20], which indexes all mapped images. To acquire ground truth for the trajectory in DS1, we generate a BLS estimate, where in addition to all camera and IMU measurements, we supply the BLS estimator with absolute pose measurements from a VICON system. B. Results Note that in the trajectory of DS1, the user moves within the VICON-room, visiting the same scene several times. Such a trajectory will emphasize the detrimental effects of an inconsistent estimator which does not track device-map cross- correlations (the device estimate becomes strongly correlated with the map when viewing the same features multiple times). This inconsistency is illustrated in Fig. 4, which shows the error and 3 σ bounds for the inflated measurement noise model ( σ = 7.5 pixels), and the sC-SKF estimator employing 2 sub- maps. Furthermore, the pose RMSE for each method of map- based updates can be found in Table II, where, as expected, the C-SKF and sC-KF outperform inconsistent methods. The storage requirements of the map’s uncertainty infor- mation is the main limitation of the SKF addressed by the C-SKF. The sizes of the Cholesky factors used for the map of DS2, its two sub-maps, as well as an additional, larger map (DS3), are available in Table I. Furthermore, we provide the corresponding size of each map’s covariance. As expected, the sparse Cholesky factor requires much less disk space than the dense covariance, and grows at a slower rate as the map size increases. As mentioned earlier, the motivation for partitioning the map into sub-maps is to reduce the time required to compute J [see (38)]. We supply the average time for the back-solve required to compute J on a Project Tango tablet, for different map sizes in Table III. For large maps, it becomes impossible 0 10 20 30 40 50 60 x-error (m) -0.2 -0.1 0 0.1 0.2 sC-SKF 0 10 20 30 40 50 60 y-error (m) -0.2 -0.1 0 0.1 0.2 time (s) 0 10 20 30 40 50 60 z-error (m) -0.1 -0.05 0 0.05 0.1 0 10 20 30 40 50 60 x-error (m) -0.2 -0.1 0 0.1 0.2 Perfect Map, Inflated Noise 0 10 20 30 40 50 60 y-error (m) -0.2 -0.1 0 0.1 0.2 time (s) 0 10 20 30 40 50 60 z-error (m) -0.1 -0.05 0 0.05 0.1 Fig. 4. The error and 3 σ bounds for the sC-SKF with 2 sub-maps and perfect map-based updates with inflated measurement noise. to perform real-time map-based updates with the C-SKF. On the other hand, by reducing the map size (i.e., increase number of sub-maps) and employing the sC-SKF, we significantly reduce the computation requirements. Finally, we present the times of the most computationally- demanding-components of the sC-SKF pipeline (using the two sub-maps of DS2) in Table IV. Specifically, the components timed are the 2D-3D correspondence detection pipeline, the map-based update, the local feature tracking pipeline (Harris and KLT), and the MSCKF update. We also provide the time for mapped-feature updates using the perfect-map as- sumption (inflate noise) for comparison. Our MSCKF runs at approximately 6Hz (depending on the user’s motion). By summing the MSCKF update time and feature-tracking time Map ID DS2 sub-map 1 DS2 sub-map 2 DS2 Full Map DS3 Num Feat. 1,838 1,424 2,203 12,164 Map Dim 11,517 8,774 20,537 63,692 Size ( G ) 45 MB 18.1 MB 76 MB 457 MB Size ( P MM ) 530 MB 308 MB 1.2 GB 16.2 GB TABLE I S IZES OF VARIOUS MAPS AS WELL AS THE ASSOCIATED MEMORY REQUIREMENTS OF STORING THEIR C HOLESKY FACTOR , G , AND COVARIANCE P MM . Method Position RMSE (cm) C-SKF, Single Map 6.2 sC-SKF, Two Sub-maps 6.6 Perfect Map Approximation 8.3 No Map-based Updates 14.7 TABLE II P OSITION RMSE listed, our local-measurements estimation requires on average 324 ms of CPU time per second, which leaves 676 ms to be devoted to map-based updates. The sC-SKF map-based update pipeline (2D-3D pose-based correspondence generation and map-based-update) takes 214 ms per iteration. Therefore, we can generally apply map-based updates while maintaining real- time operation. If the application or device has stricter process- ing requirements, we have the option to increase the number of sub-maps (decreasing the total information attributed to the map). VII. C ONCLUSION In this paper, we focused on the problem of performing approximate, but consistent map-based localization. Specif- ically, and motivated from the linear (in the map’s size) processing cost, but quadratic memory requirements of the Schmidt-Kalman filter (SKF) when applied to map-based Number of Features in Map Time to Compute J 3,926 6.7 ms 6,341 24.5 ms 12,164 233.3 ms TABLE III A VERAGE TIME TO COMPUTE J [ SEE (38)] FOR A SINGLE FEATURE MEASUREMENT IN VARIOUS MAPS . Item Mean Time (ms) Mapped-feature Update (sC-SKF) 180 Mapped-feature Update (perfect map) 7 Correspondence Detection (pose-less) 52 Correspondence Detection (pose-assisted) 36 Harris + KLT Tracking 20 Propagation + Local-feature Update 34 TABLE IV M EAN TIMES FOR THE STEPS OF THE ESTIMATOR PIPELINE . localization, we introduced the Cholesky (C)-SKF, which uses the map’s Cholesky factor to model the information (and thus uncertainty) in the prior map. By doing so, and given the sparsity of the Cholesky factor, the C-SKF has only linear, in the map’s size, memory requirements. Moreover, its equations are factored in such a form so as to avoid inverting the Cholesky factor of the map’s Hessian matrix. Despite, however, the gains in efficiency, the processing cost of the C-SKF may grow more than linearly in the map’s size. In order to bound its processing cost, we introduced a relaxation, termed the sC-SKF, which uses the sub-maps obtained by partitioning the original map, with minimal loss in accuracy. Lastly, the computational requirements of the proposed C- SKF and sC-SKF were assessed using a Project Tango tablet, while we demonstrated their superior performance against other approximate, but inconsistent, map-based approaches through real-world experiments using high accuracy ground truth. A PPENDIX A P ROOF OF SUB - MAP CONSISTENCY We seek to prove Theorem 1 in Section IV-E. Proof: As described in [8], for the case of two sub- maps, the KKT matrix, K , resulting from the constrained optimization problem is: K =     H 1 0 A T 1 0 0 H 2 A T 2 0 A 1 A 2 0 B 0 0 B T 0     where H i is the Hessian of sub-map i , A i corresponds to features common to the two sub-maps, and B is a tall ma- trix whose columns correspond to the 4 d.o.f transformation between the two sub-maps. Computing the covariance of the sub-maps in the CM result requires computing the 2 × 2 block sub-matrix of the inverse of K . i.e., P = ( K − 1 ) 1:2 , 1:2 = {[ H 1 0 0 H 2 ] − [ A T 1 0 A T 2 0 ] [ 0 B B T 0 ] − 1 [ A 1 A 2 0 0 ]} − 1 Employing the matrix inversion lemma yields: P = [ H − 1 1 0 0 H − 1 2 ] − [ H − 1 1 0 0 H − 1 2 ] M [ H − 1 1 0 0 H − 1 2 ] where M = [ A T 1 0 A T 2 0 ] W [ A 1 A 2 0 0 ] (52) and W − 1 = [ A 1 A 2 0 0 ] [ H − 1 1 0 0 H − 1 2 ] [ A T 1 0 A T 2 0 ] − [ 0 B B T 0 ] = [ Θ Θ Θ − B − B T 0 ] with Θ Θ Θ , A 1 H − 1 1 A T 1 + A 2 H − 1 2 A T 2 On the other hand, if we treat each of the sub-maps as being independent (i.e., we ignore the constraints imposed by common features) but we compute their Hessians using the CM result, their covariances are: ̄ P = [ H − 1 1 0 0 H − 1 2 ] (53) In order to prove P  ̄ P , it suffices to show M in (52) is sym- metric positive semi-definite (PSD). We start by computing W = [ X Y Y T Z ] (54) where: X = Θ Θ Θ − 1 − Θ Θ Θ − 1 B ( B T Θ Θ Θ − 1 B ) − 1 B T Θ Θ Θ − 1 (55) Y = − Θ Θ Θ − 1 B ( B T Θ Θ Θ − 1 B ) − 1 (56) Z = − ( B T Θ Θ Θ − 1 B ) − 1 (57) Substituting (54) in (52) yields M = [ A T 1 A T 2 ] X [ A 1 A 2 ] (58) Note that X in (55) is PSD since it is the (1,1) Schur complement of the PSD matrix [ I B T ] Θ Θ Θ − 1 [ I B ] . Thus M is also PSD. R EFERENCES [1] Motilal Agrawal and Kurt Konolige. Frameslam: From bundle adjustment to real-time visual mapping. IEEE Trans. on Robotics , 24(5):1066–1077, October 2008. [2] Alexandre Alahi, Raphael Ortiz, and Pierre Van- dergheynst. FREAK: Fast retina keypoint. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition , pages 510–517, College Park, MD, June 16–21 2012. [3] Clemens Arth, Daniel Wagner, Manfred Klopschitz, Arnold Irschara, and Dieter Schmalstieg. Wide area localization on mobile phones. In 8th IEEE and ACM International Symposium on Mixed and Augmented Re- ality , pages 73–82, Orlando, FL, USA, October 19–22 2009. [4] Averil Burton Chatfield. Fundamentals of High Accuracy Inertial Navigation , volume 174. American Institute of Aeronautics and Astronautics, 1997. [5] Siddharth Choudhary, Luca Carlone, Henrik I Chris- tensen, and Frank Dellaert. Exactly sparse memory efficient slam using the multi-block alternating direction method of multipliers. In Proc. of the IEEE/RSJ Inter- national Conference on Intelligent Robots and Systems , Hamburg, Germany. [6] Jose E Guivant and Eduardo Mario Nebot. Optimiza- tion of the simultaneous localization and map-building algorithm for real-time implementation. IEEE Trans. on Robotics and Automation , 17(3):242–257, 2001. [7] Chao Guo, Dimitrios Kottas, Ryan DuToit, Ahmed Ahmed, Ruipeng Li, and Stergios Roumeliotis. Efficient visual-inertial navigation using a rolling-shutter camera with inaccurate timestamps. In Proceedings of Robotics: Science and Systems , Berkeley, CA, USA, July 12–16 2014. [8] Chao Guo, Kourosh Sartipi, Ryan DuToit, Georgios Georgiou, Ruipeng Li, John O’Leary, Esha Nerurkar, Joel Hesch, and Stergios Roumeliotis. Large-scale coopera- tive 3D visual-inertial mapping in a Manhattan world. In Proc. of the IEEE International Conference on Robotics and Automation , Stockholm, Sweden, May 16–21 2016. URL http://mars.cs.umn.edu/papers/CM_line.pdf. [9] Chris Harris and Mike Stephens. A combined corner and edge detector. In Proc. of the Alvey Vision Conference , pages 147–151, Manchester, UK, August 31 – September 2 1988. [10] Joel A. Hesch, Dimitrios G. Kottas, Sean L. Bowman, and Stergios I. Roumeliotis. Consistency analysis and improvement of vision-aided inertial navigation. IEEE Trans. on Robotics , 30(1):158–176, February 2014. [11] Simon J Julier. A sparse weight Kalman filter approach to simultaneous localisation and map building. In Proc. of the IEEE/RSJ International Conference on Intelligent Robots and Systems , volume 3, pages 1251–1256, Maui, HI, USA, October 29 – November 3 2001. [12] Georg Klein and David Murray. Parallel tracking and mapping for small AR workspaces. In 6th IEEE and ACM International Symposium on Mixed and Augmented Reality , pages 225–234, Nara, Japan, November 13–16 2007. [13] Zuzana Kukelova, Martin Bujnak, and Tomas Pajdla. Closed-form solutions to minimal absolute pose prob- lems with known vertical direction. In Proc. of the Asian Conference on Computer Vision , pages 216–229, Queenstown, New Zealand, November 8–12 2011. [14] Bruce D. Lucas and Takeo Kanade. An iterative im- age registration technique with an application to stereo vision. In Proc. of the International Joint Conference on Artificaial Intelligence , pages 674–679, Vancouver, British Columbia, August 24–28 1981. [15] Simon Lynen, Torsten Sattler, Michael Bosse, Joel Hesch, Marc Pollefeys, and Roland Siegwart. Get out of my lab: Large-scale, real-time visual-inertial localization. In Proc. of Robotics: Science and Systems Conference , Rome, Italy, July 13–17 2015. [16] Sven Middelberg, Torsten Sattler, Ole Untzelmann, and Leif Kobbelt. Scalable 6-dof localization on mobile de- vices. In Proc. of the European Conference on Computer Vision , pages 649–663, Zurich, Switzerland, September 6–12 2014. [17] Faraz M. Mirzaei and Stergios I. Roumeliotis. A Kalman filter-based algorithm for IMU-camera calibration: Ob- servability analysis and performance evaluation. IEEE Trans. on Robotics , 24(5):1143–1156, October 2008. [18] Anastasios I. Mourikis, Nikolas Trawny, Stergios I. Roumeliotis, Andrew E. Johson, Adnan Ansar, and Larry Matthies. Vision-aided inertial navigation for spacecraft entry, descent, and landing. IEEE Trans. on Robotics , 25 (2):264–280, April 2009. [19] Oleg Naroditsky, Xun S. Zhou, Jean Gallier, Stergios I. Roumeliotis, and Kostas Daniilidis. Two efficient so- lutions for visual odometry using directional correspon- dence. IEEE Trans. on Pattern Analysis and Machine Intelligence , 34(4):818–824, April 2012. [20] David Nister and Henrik Stewenius. Scalable recognition with a vocabulary tree. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition , pages 2161–2168, New York, NY, June 17–22 2006. [21] Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. ORB: An efficient alternative to SIFT or SURF. In Proc. of the IEEE International Conference on Computer Vision , pages 2564–2571, Barcelona, Spain, November 6–13 2011. [22] Stanley F. Schmidt. Applications of state space meth- ods to navigation problems. in C. T. Leondes, Editor, Advanced Control Systems , 3:293–340, 1966. [23] Dan Simon. Optimal state estimation: Kalman, H infinity, and nonlinear approaches . John Wiley & Sons, 2006. [24] Bill Triggs, Philip McLauchlan, Richard Hartley, and A. Fitzgibbon. Bundle adjustment - a modern synthesis. In Vision Algorithms: Theory and Practice , pages 298– 375. Springer–Verlag, 2000. [25] Jordi Ventura, Clemens Arth, Gerhard Reitmayr, and Dieter Schmalstieg. Global localization from monocular SLAM on a mobile phone. IEEE Trans. on Visualization and Computer Graphics , 20(4):531–539, April 2014.