Data-Efficient Decentralized Visual SLAM Titus Cieslewski 1 , Siddharth Choudhary 2 and Davide Scaramuzza 1 Abstract — Decentralized visual simultaneous localization and mapping (SLAM) is a powerful tool for multi-robot applications in environments where absolute positioning systems are not available. Being visual, it relies on cameras, cheap, lightweight and versatile sensors, and being decentralized, it does not rely on communication to a central ground station. In this work, we integrate state-of-the-art decentralized SLAM components into a new, complete decentralized visual SLAM system. To allow for data association and co-optimization, existing decen- tralized visual SLAM systems regularly exchange the full map data between all robots, incurring large data transfers at a complexity that scales quadratically with the robot count. In contrast, our method performs efficient data association in two stages: in the first stage a compact full-image descriptor is deterministically sent to only one robot. In the second stage, which is only executed if the first stage succeeded, the data required for relative pose estimation is sent, again to only one robot. Thus, data association scales linearly with the robot count and uses highly compact place representations. For optimization, a state-of-the-art decentralized pose-graph optimization method is used. It exchanges a minimum amount of data which is linear with trajectory overlap. We characterize the resulting system and identify bottlenecks in its components. The system is evaluated on publically available data and we provide open access to the code. I. I NTRODUCTION Using several robots instead of one can accelerate many tasks such as exploration and mapping, or enable heteroge- nous teams of robots, where each robot has a specialization. Multi-robot simultaneous localization and mapping (SLAM) is an essential component of any team of robots operating in an absolute positioning system denied environment, as it relates the current state estimate of each robot to all present and past state estimates of all robots. Because cameras are cheap, light-weight and versatile sensors, we seek to im- plement a visual SLAM system. Visual Multi-Robot SLAM can be solved in a centralized manner, where a single entity collects all data and solves SLAM for all robots, but that relies on a central entity to always be reachable, to never fail and to scale to the size of the robot team, both in computation and bandwidth. Decentralized systems do not have this bottleneck, but are more challenging to implement. Visual SLAM systems typically consist of three compo- nents: 1) a visual odometry algorithm which provides an initial state estimate, 2) a place recognition system which is able to relate the currently observed scene to previously 1 T. Cieslewski and D. Scaramuzza are with the Robotics and Per- ception Group, Dep. of Informatics, University of Zurich , and Dep. of Neuroinformatics, University of Zurich and ETH Zurich, Switzerland— http://rpg.ifi.uzh.ch. 2 S. Choudhary is with the College of Computing, Georgia Institute of Technology, Atlanta, GA, USA, siddharth.choudhary@gatech. edu -200 0 200 x [m] 0 100 200 300 400 y [m] KITTI 00 (a) Ten sub-trajectories of KITTI 00 after running our method. Each color represents an individual trajectory, place matches are marked in black, dashed lines indicate the ground truth. 0 20 40 60 80 time [s] 0 0.5 1 total transmitted [MB] Total data transmission DOpt DVPR RelPose VO end time (b) Data transmission over time for the system components: decentralized optimization (DOpt), decentralized visual place recognition (DVPR) and relative pose estimation (RelPose). Fig. 1. The proposed decentralized visual SLAM system is able to build a globally consistent map from ten sub-trajectories of KITTI 00 by exchanging only around 2MB in total – for decentralized optimization, place recognition and visual relative pose estimation, for all ten robots. observed scenes, and 3) an optimization back-end which consistently integrates the place matches from the place recognition system with the full stream of state estimates. The end product is a map, and that map feeds back to place recognition and optimization. It is this feedback which makes decentralized SLAM challenging, especially if one is concerned about communication bandwidth. Visual odometry is not involved in the feedback loop and is thus trivial to distribute – it can be run on each robot independently and feed its output to the rest of the decentralized SLAM system. In previous work we have proposed state-of-the art de- centralized place recognition [1] and optimization [2] sys- tems separately. Both systems focus on data efficiency: they achieve performance similar to a centralized system while minimizing the data that needs to be exchanged. In this work, we integrate both systems along with a data-efficient method for visual feature association [3] into a full decentralized visual SLAM system. The system is evaluated on publically available data and the code is provided open-source. arXiv:1710.05772v1 [cs.RO] 16 Oct 2017 II. R ELATED W ORK In the following, we independently consider related work pertaining to decentralized optimization and to decentralized place recognition. We conclude with a review of existing integrated decentralized SLAM systems. A. Decentralized Optimization Distributed estimation in multi-robot systems is an ac- tive field of research, with special attention being paid to communication constraints [4], heterogeneity [5], [6], con- sistency [7], and robust data association [8]. The literature of- fers distributed implementations of different estimation tech- niques, including Kalman filters [9], information filters [10], particle filters [11], [12], and distributed smoothers [2], [13] In multi-robot systems, maximum-likelihood trajectory estimation can be performed by collecting all measurements at a centralized inference engine, which performs the opti- mization [5], [8], [14]–[16]. However, it is not practical to collect all measurements at a single inference engine since it requires a large communication bandwidth. Furthermore, solving trajectory estimation over a large team of robots can be too demanding for a single computational unit. These reasons triggered interest towards distributed tra- jectory estimation , in which the robots only exploit local communication, in order to reach a consensus on the tra- jectory estimate [17]–[21]. Recently, Cunnigham et al. [13], [21] used Gaussian elimination, and developed an approach, called DDF-SAM, in which robots exchange Gaussian marginals over the separators (i.e., the variables observed by multiple robots). While Gaussian elimination has become a popular ap- proach, it has two major shortcomings. First, the marginals to be exchanged among the robots are dense, hence the communication cost is quadratic in the number of separators. This motivated the use of sparsification techniques [4]. Second, Gaussian elimination is performed on a linearized version of the problem, hence these approaches require good linearization points and complex bookkeeping to en- sure consistent linearization across the robots [13]. The need of a linearization point also characterizes gradient- based techniques [20]. An alternative approach to Gaussian elimination is the Gauss-Seidel approach of Choudhary et al. [2], which requires communication linear in the number of separators. The Distributed Gauss-Seidel approach only requires the latest estimates for poses involved in inter-robot measurement to be communicated and therefore does not require consistent linearization and complex bookkeeping. This approach also avoids information double counting since it does not communicate marginals. B. Decentralized Place Recognition and Pose Estimation In order to solve decentralized SLAM, measurements that relate poses between different robots ( inter-robot measure- ments ) need to be established. A popular type of inter-robot measurements are direct measurements of the other robot [22], such as time-of-flight distance measurements [4] or Fig. 2. Restricting relative pose measurements to robots that are in line of sight (red circle) severely limits the recall of relative localization: Robots A and C can establish relative localization but A and B cannot. This often applies to direct relative measurements. By ensuring that relative localization can occur from all past measurements (green outline) we can increase recall and put less restrictions on the high-level application that controls the robots. vision-based relative pose etimation [15]. The latter is typi- cally aided with markers that are deployed on the robots. To the best of our knowledge, most types of direct measurements require specialized hardware (which can precisely measure time-of-flight for example, or visual markers). Furthermore, many types of direct measurements require the robots to be in line of sight, which, in many environments, imposes a major limitation on the set of relative poses that can be established, see Fig. 2. Limiting relative measurements in this way translates into limitations for higher-level applications that would like to use decentralized SLAM as a tool. We prefer, instead, to use indirect relative measurements. These are established through registration of observations that two robots A and B make of the same scene [3], [23]. This requires no additional hardware: the sensors already used for odometry can be re-used for inter-robot measurements. Since establishing inter-robot measurements in this way is equivalent to establishing loop closures in single-robot SLAM, indirect measurements are widely used in centralized visual SLAM algorithms [24]–[26]. The es- tablishment of indirect measurements is still limited by the communication range of the robots, but in practice it is feasible to establish a communication range that exceeds line-of-sight. Furthermore, if two robots cannot communicate directly, they might still communicate through a multi-hop network. A disadvantage of indirect measurements, then, is that they require bandwidth to communicate. If all robots share all data with every other robot, the amount of data exchanged is quadratic in the number of robots. One way to mitigate the bandwidth is to exchange data only with immediate neighbours, but this has the same reduced recall as discussed in Fig. 2. Another way to reduce bandwidth is to use compact representations that allow to establish inter-robot measurements with a minimal amount of data exchange. Visual feature-based maps can be compressed by compressing feature descriptors [27], substituting feature descriptors for corresponding cluster centers in a visual vocabulary [3], removing landmarks that prove not to be necessary for localization [28], [29] or using compact full- image descriptors [30]. In previous work, we have shown another way to reduce the bandwidth of place recognition: data exchange can be reduced by a factor of the robot count if the problem can be cast to key-value lookup [31]. In this work, we use [1], an advanced version of [31] which at the same time makes use of a compact, state-of-the-art place representation, NetVLAD [30]. [3] is used in association of visual features for relative pose estimation. Fig. 3. Components and interactions in our system. The diagram shows the components that run on each robot. The colored blocks indicate compo- nents that communicate with other robots. Sharp corners indicate software modules, rounded corners indicate data. DVPR : Decentralized visual place recognition, VO : visual odometry, RelPose : Geometric verification and relative pose estimation, DOpt : Decentralized optimization. C. Integrated Decentralized SLAM Several systems have previously combined decentralized optimization and relative pose estimation into decentralized SLAM. Several optimization works cited in Section II-A, such as [2], [4] use direct relative measurements and can thus be considered full decentralized SLAM systems with the re- call limitation illustrated in Fig. 2. DDF-SAM [21] has been extended to a decentralized SLAM system in [32] by adding a data association system which matches planar landmark points. This system has been validated in the real world, but the landmarks consisted of manually placed poles that were detected by a laser range finder. [16] establishes relative poses by exchanging and aligning 2D laser scan edges. [2] has been extended to a decentralized SLAM system in [33] where data association is provided by identification and pose estimation of commonly observed objects. This approach relies on the existence of unique, pre-trained objects in the environment. The majority of decentralized SLAM systems we have found in our literature review are not vision-based. A first decentralized visual SLAM system has been proposed in [34], but it relies on exchanging the full maps between robots. Furthermore, it is only evaluated on 50 m × 50 m L- shaped trajectory and with only two robots, with images obtained from a simulated environment. In contrast, we evaluate our system on a large scale scenario, with more robots, on real data, and our system uses state-of-the-art algorithms which exchange far less than the full maps. III. M ETHODOLOGY The proposed decentralized SLAM system consists of | Ω | robots Ω = { α, β, γ, . . . } which each execute the system illustrated in Fig. 3. The images taken by a camera are fed to both NetVLAD [30] and a visual odometry (VO). NetVLAD produces a compact image representation used for decentralized visual place recognition (DVPR) [1], which itself produces candidate place recognition matches. The VO produces a sparse feature map that can be used for localization. The relative pose estimation module (RelPose) extracts the parts of the sparse map that are observed at the candidate place matches and uses them to establish relative poses between the robots trajectories, or to reject candidate matches. The decentralized optimization module (DOpt) [2] obtains the initial guess from the map, intra-robot relative pose measurements E I from the VO and the inter-robot relative pose measurements E S from RelPose, and updates the map. The system works continuously as new images are acquired. A. Intra-Robot Measurements Intra-robot measurements are obtained from a visual odometry (VO) algorithm. Any VO algorithm can be used in our system as long as it fulfills the following requirements: 1) It produces a pose graph. The VO of robot α defines frames F α = { ( α, 1) , ( α, 2) , ( α, 3) , ... } and provides the intra-robot measurements E I,α . = { ̄ z α i α i +1 ∀ ( α, i ) , ( α, i + 1) ∈ F α } (1) ̄ z α i α i +1 . = ( ̄ R α i α i +1 , ̄ t α i α i +1 ) ∈ SE(3) (2) describing pose transforms between subsequent frames. 2) Each pose is associated to an image to be used for place recognition . We denote this image by I α i . Our current system does not yet take full advantage of a multi-camera setup, which could be a useful future extension. The image capture time is denoted as t α i 1 . 3) The VO provides absolute scale, as implied by (2). Monocular VO typically estimates the pose up to a scale due to scale ambiguity. Scale can be obtained from a stereo camera or by fusing with inertial measurements. In our implementation, we use for visual odometry ORB- SLAM [36] in stereo configuration. B. Inter-Robot Measurements The inter-robot measurements are E S . = { ̄ z α i β j | α 6 = β } (3) with ̄ z α i β j defined like (2). They are established in two phases: Given its latest image I α i , robot α first tries to determine whether there is any image I β j of the same scene previously captured by another robot. This is achieved using decentralized visual place recognition, detailed in Section III- C. Then, if ∃ I β j , α tries to establish ̄ z α i β j using the method detailed in Section III-D. This two-phase approach is pursued because, as we will see, establishing relative poses consumes more data than recognizing whether ∃ I β j . The first phase is capable of reducing the amount of data to be exchanged by determining which robot to send the relative pose query to, and by determining whether it is worthwhile to send it in the first place, using a minimal amount of data. C. Decentralized Visual Place Recognition The first phase is accomplished by matching image I α i to the images previously captured by the other robots, { I γ k | t γ k < t α i } , using the full-image descriptor NetVLAD [30]. NetVLAD uses a neural network to calculate a vector v α i ∈ R D NetVLAD from I α i . In rough terms, it is trained such that ‖ v α i − v γ k ‖ ` 2 is lower if I α i and I γ k observe the same scene than if they observe different scenes. For more details, see [30]. NetVLAD is well-suited for decentralized place 1 A common time reference is hard to establish, and, strictly speaking, not well defined, so consider t α i a Lamport timestamp [35]. t α i is not actually used in the implementation, so this detail is not particularly important. recognition, as v α i is all that robot α needs to send to robot γ to establish whether a matching image exists among { I γ k } . Concretely, we seek the image I ? β j whose NetVLAD vector v ? β j has the shortest ` 2 distance to v α i and which satisfies ∥ ∥ v α i − v β j ∥ ∥ ` 2 < τ NetVLAD (4) where τ NetVLAD is a threshold parameter. I ? β j can be found by sending v α i to all other robots, but we use a method that requires | Ω | times less data exchange [1], which is particularly interesting for scaling to large robot teams. In this method, each robot γ is pre-assigned a cluster center c γ and each robot knows the c of all other robots. The query v α i is sent only to robot δ = arg min γ ‖ v α i − c γ ‖ ` 2 , and δ replies with the identifier of I β j , ( β, j ) = arg min ( γ,k ): v γk ∈V δ ‖ v α i − v γ k ‖ ` 2 , (5) if v β j also satisfies (4), where V δ is the Voronoi cell V δ = { v | c δ = arg min c γ ‖ v − c γ ‖ ` 2 } . The query is executed for every frame of every robot, so at t α i this query has already been executed ∀ γ, k : t γ k < t α i , and δ is aware of all v γ k ∈ V δ by virtue of having received the previous queries. It can thus provide ( β, j ) without any extra data exchange. This method approximates I ? β j ∼ I β j ; I ? β j = I β j if ∃ δ : v ? β j , v α i ∈ V δ . In the method, the cluster centers are determined using k-means clustering on a training dataset, in our case the Oxford Robotcar Dataset [37]. The properties of the method are discussed in more detail in [1]. D. Relative Pose Estimation Once and if ( β, j ) has been established, α sends a relative pose estimation request to β . The relative pose estimation (RelPose) can and should heavily depend on the visual odom- etry (VO) since it can benefit from re-using by-products of the VO such as 3D positions of landmarks. In our method, we use ORB-SLAM [36] for VO and accordingly, our RelPose implementation imitates its loop closure method. However, it does not imitate [36] exactly, since the latter makes use of the data surrounding the match, which in decentralized RelPose would translate into a lot of data exchange. We want to avoid this and thus, we combine the relative pose estimation method form [36] with the visual data association method from [3], which reduces the data to be sent from α to β , d α → β RelPose , to a set of landmarks and visual word identifiers: d α → β RelPose = ( { ( w k , p k ) ∀ k ∈ K α i } , α, i, j ) (6) where K α i are the keypoints observed in I α i , w k is the identifier of a visual word associated to keypoint k and p k is the 3D position of the landmark corresponding to keypoint k , expressed in the camera frame of I α i . Visual words represent Voronoi cells in the keypoint descriptor space, here the space of ORB descriptors [38]. They provide a quantization of the high-dimensional descriptor space, and since similar descriptors are likely to be assigned to the same word, word identifiers can be used for matching keypoints between images [3]. Since this is less precise than using raw descriptors, we provide an option to exchange full descriptors instead. Given d α → β , β establishes pairs of matching keypoints { ( k, k ′ ) | w k = w k ′ , @ l ∈ K α i : w l = w k , @ l ′ ∈ K β j : w l ′ = w k ′ } . Given the corresponding pairs of landmark positions { ( p k , p k ′ ) } , RANSAC is used to determine an initial estimate of ̄ z α i β j and the corresponding inlier matches M = { ( k, k ′ ) ? } . A place match is rejected if | M | < τ inliers . Otherwise, the inlier landmark position pairs { ( p k , p k ′ ) ? } are used to refine ̄ z α i β j to the pose that minimizes a robust 3D registration error ∑ { ( p k , p k ′ ) ? } ρ τ loss ( ∥ ∥ p k − ( ̄ R · p k ′ + ̄ t ) ∥ ∥ 2 ` 2 ) . (7) We use a robust weight loss function ρ τ loss ( s ) = arctan s τ loss to reduce the weight of remaining keypoint match outliers. To avoid relative pose ̄ z α i β j outliers we use a consistency check, where ̄ z α i β j is only accepted if either: a) there is another already accepted ̄ z α i ′ β j ′ such that ( α, i ′ ) and ( α, i ) are within a distance of τ cdist and ̄ z α i β j and ̄ z α i ′ β j ′ are consistent; b) no accepted ̄ z α i ′ β j ′ exists, but a previous candidate which fulfills the same conditions. ̄ z α i β j and ̄ z α i ′ β j ′ are considered consistent if the two ways of calculating the relative position between ( α, i ′ ) and ( β, j ) – with ̄ z α i ′ β j ′ and E I,β or with E I,α and ̄ z α i β j – result in positions that are within a distance < τ tol . RelPose still constitutes a significant amount of data, so we introduce a parameter τ mdg (minimum distance geometric verifications) which allows to throttle geometric verifications. Geometric verification are skipped for matches which are close to already established relative poses. d α → β is only sent for geometric verification to β if there is no other ( α, i ′ ) matched to a frame of β such that ∥ ∥ t α i − t α i ′ ∥ ∥ ` 2 < τ mdg . This assumes that the VO and RelPose are consistent enough such that the resulting subset of retrievable E S suffices to produce a globally-consistent state estimate. E. Decentralized Optimization In our decentralized optimization, each robot estimates its own trajectory using the available measurements, and leveraging occasional communication with other robots. The pose of frame ( α, i ) is denoted with x α i . The trajectory of robot α is then denoted as x α = [ x α 1 , x α 2 , . . . ] . , where x α i = ( R α i , t α i ) ∈ SE(3) , R α i ∈ SO(3) represents the 3D rotation, and t α i ∈ R 3 represents the 3D translation. While our classification of the measurements (inter vs in- tra) is based on the robots involved in the measurement process, all relative measurements can be framed within the same measurement model. Since all measurements corre- spond to noisy observations of the relative pose between a pair of poses, say x α i and x β j , a general measurement model is: ̄ z α i β j . = ( ̄ R α i β j , ̄ t α i β j ) , with: { ̄ R α i β j = ( R α i ) T R β j R  ̄ t α i β j = ( R α i ) T ( t β j − t α i )+ t  (8) where the relative pose measurement ̄ z α i β j includes the rela- tive rotation measurements ̄ R α i β j , which describes the attitude R β j with respect to the reference frame ( α, i ) , “plus” a random rotation R  (measurement noise), and the relative position measurement ̄ t α i β j , which describes the position t β j in the reference frame ( α, i ) , plus random noise t  . Assuming that the translation noise is distributed accord- ing to a zero-mean Gaussian with information matrix ω 2 t I 3 , while the rotation noise follows a Von-Mises distribution with concentration parameter ω 2 R , it is possible to demon- strate [39] that the ML estimate x . = { ( R α i , t α i ) , ∀ α ∈ Ω , ∀ i } can be computed as solution of the following opti- mization problem: min t αi ∈ R 3 , R αi ∈ SO(3) ∀ α ∈ Ω , ∀ i ∑ ( α i ,β j ) ∈E ω 2 t ∥ ∥ ∥ t β j − t α i − R α i ̄ t α i β j ∥ ∥ ∥ 2 + ω 2 R 2 ∥ ∥ ∥ R β j − R α i ̄ R α i β j ∥ ∥ ∥ 2 F (9) As proposed in [33], we use a two-stage approach to solve the optimization problem in a distributed manner: we first solve a relaxed version of (9) and get an estimate for the rotations R α i of all robots, and then we recover the full poses and top-off the result with a Gauss-Newton (GN) iteration. In both stages, we need to solve a linear system where the unknown vector can be partitioned into subvectors, such that each subvector contains the variables associated to a single robot in the team. For instance, we can partition the vector r in the first stage, as r = [ r α , r β , . . . ] , such that r α describes the rotations of robot α . Similarly, we can partition p = [ p α , p β , . . . ] ( p represents the linearized poses) in the second stage, such that p α describes the trajectory of robot α . Therefore, the linear systems in the first two stages can be framed in the general form Hy = g :    H αα H αβ . . . H βα H ββ . . . . . . . . . . . .       y α y β . . .    =    g α g β . . .    (10) where we want to compute the vector y = [ y α , y β , . . . ] given the (known) block matrix H and the (known) block vector g , partitioned according to the block-structure of y . The linear system (10) can be rewritten as: ∑ δ ∈ Ω H αδ y δ = g α ∀ α ∈ Ω Taking the contribution of y α out of the sum, we get: H αα y α = − ∑ δ ∈ Ω \{ α } H αδ y δ + g α ∀ α ∈ Ω (11) The set of equations (11) is the same as the original sys- tem (10), but clearly exposes the contribution of the variables associated to each robot. The distributed Gauss-Seidel algorithm [40] starts at an arbitrary initial estimate y (0) = [ y (0) α , y (0) β , . . . ] and applies the following update rule, for each robot α ∈ Ω : y ( k +1) α = H − 1 αα   − ∑ δ ∈ Ω + α H αδ y ( k +1) δ − ∑ δ ∈ Ω − α H αδ y ( k ) δ + g α   (12) where Ω + α is the set of robots that already computed the ( k + 1) -th estimate, while Ω − α is the set of robots that still have to perform the update (12), excluding node α (intuitively: each robot uses the latest estimate). We can see that if the sequence produced by the iterations (12) converges to a fixed point, then such point satisfies (11), and indeed solves the original linear system (10). F. Making Optimization Work Continuously Since the optimization consists of two stages, and the map estimate at the end of the first stage and between iterations is not a consistent map state, the map estimate handled by DOpt cannot be considered as best estimate at any given time. Furthermore, the optimization is not laid out to incorporate new measurements between iterations. As a consequnce, the map estimate is concurrently modified by the VO and DOpt, which would lead to inconsistencies if unchecked. This can be solved using optimistic concurrency control (OCC). In OCC, each of the processes that concurrently operate on some data operate on their own copy. Once a process is done, its copy is merged back to the reference state. OCC is a central concept in version control tools such as SVN and GIT, and has recently been applied to distributed robotic systems in [41] and [42]. In our method, the handling of VO and place recognition is considerd one process and decentralized optimization another. We let VO and place recognition operate directly on the reference state. Decentralized optimization, as described in Section III-E is executed episodically. At the beginning of each episode, the robots consent on a reference time t e . During the episode, only the poses x α i which satisfy t α i < t e are optimized. Once the optimization has finished, the x α i of the reference state are replaced with the newly optimized estimates x ′ α i . The remaining x α j for which t α j > t e are corrected with: x α j ← R · x α j + t (13) which satisfies x ′ α e = R · x α e + t (14) with ( α, e ) referring to the frame recorded just before t e . IV. E XPERIMENTS In the experiments, we seek to find out how much data the individual components and the overall system use and how accurate the state estimate is as time progresses. This will give us an idea of the applicability and scalability of the system, and will let us identify which components are most worthwhile to be optimized in the future. A. Data Exchange Evaluation The components of our system that exchange data are de- centralized visual place recognition, relative pose estimation and decentralized optimization. Each of these components could potentially be further developed to use less data. Thus, Sec. value Sec. value D NetVLAD III-C 128 τ cdist III-D 20 m τ NetVLAD III-C 0 . 1 τ tol III-D 4 m τ inliers III-D 20 τ mdg III-D 0 m τ loss III-D 3 m TABLE I D EFAULT PARAMETER VALUES we record data transfer for each component individually. Data transfers for the individual components are: ⋃ episode d α → β DOpt = { ⋃ rot. iters. R α i β j , ⋃ pose iters. ̄ z α i β j } (15) d α → δ DVPR = ( α, i, v α i ) , d δ → α DVPR = ( β, j ) (16) d β → α RelPose = ( ̄ z α i β j , x − 1 β j ′ x β j ) (17) with d α → β RelPose from (6) and where x − 1 β j ′ x β j is sent for the relative pose consistency check. The size of data is calculated accordingly, using 6 · 8 bytes for poses ̄ z α i β j , 9 · 8 bytes for rotations R α i β j , 8 · D NetVLAD bytes for NetVLAD vectors v α i , 1 byte for robot indices, 4 bytes for frame indices, 2 bytes for visual word indices w k and 3 · 4 bytes for landmarks p k . B. Accuracy Evaluation As it is typically done for SLAM systems, we evaluate the accuracy of our system. Unlike of other SLAM systems, however, we evaluate the evolution of accuracy over time. This allows us to better characterize the system. Accuracy is measured using the average trajectory error (ATE). The ATE requires alignment to the ground truth trajectory, which is only meaningful for connected components. Thus, we report the ATE for different connected components individually. C. Parameter Studies A parameter that we find of particular interest is τ mdg , which determines the distance between an established rel- ative pose ̄ z α i β j and the next frame of α to be sent to β for geometrical verification. τ mdg can significantly reduce the data to be transmitted for RelPose and DOpt, but it is important to verify that this does not happen at the cost of accuracy. We vary τ mdg and see how it affects data transmission and ATE. Another interesting parameter is the used NetVLAD di- mension D NetVLAD . Since the last layer of NetVLAD per- forms principal component analysis, D NetVLAD can be tuned arbitrarily up to 4096 dimensions. The lower D NetVLAD , the lower the DVPR traffic, but also the lower its precision. V. R ESULTS We evaluate our system on the KITTI dataset [43]. Like in [31], we split the dataset into | Ω | parts to simulate tra- jectories recorded by different robots. Fig. 4 shows the used sub-trajectories of KITTI 00 and the end result of running our system with the parameter values shown in Table I. The corresponding data transmission is shown in Fig. 5. Note that we plot the cumulative data transmission up to a given time. This directly exhibits the full amount of transmitted data and makes it easy to see time-varying bandwidth (slope). DOpt -200 0 200 x [m] 0 100 200 300 400 y [m] KITTI 00 Fig. 4. Ten sub-trajectories of KITTI 00 after running our method with the parameters in Table I. Each color represents an individual robots trajectory, place matches are marked with bold black dots. The aligned ground truth is indicated with dashed lines. 0 20 40 60 80 time [s] 0 2 4 6 8 total transmitted [MB] Total data transmission DOpt DVPR RelPose VO end time Fig. 5. Data transmission over time for the three system components: decentralized optimization (DOpt), decentralized visual place recognition (DVPR) and relative pose estimation (RelPose), using the parameters in Table I. DOpt continues to run after visual odometry (VO) has ended until it has incorporated all data produced by VO. and RelPose require the most data transmission, but as we will see, DOpt traffic can be significantly reduced, while RelPose traffic remains high. Fig. 6 shows the total data transmission between pairs of robots. This data transmission is highly uneven and could require consideration in multi-hop networks, but that is outside of the scope of this paper. DOpt and RelPose highly depend on the trajectory overlap between robots. The higher the overlap, the more data is exchanged by these components. Decentralized visual place recognition traffic is also uneven, but in a different way. It is due to the clustering that assigns NetVLAD cluster centers to robots. What we see in Fig. 6 is that robots 2, 6 and 7 are assigned the v that represent the majority of NetVLAD descriptors in KITTI 00. This unevenness can be mitigated either by randomly assigning several clusters per robot or by training the clustering on a dataset that is very similar to the data at hand, in terms of the scope of visual appearances of keypoints. A detailed study of this problem is provided in [1]. log 10 (transmitted data [kB]) DOpt DVPR RelPose Recipient Sender -1 0 1 2 3 Fig. 6. Data exchanged between individual pairs of robots. Row numbers indicate sender and column numbers receiver. DOpt, DVPR and RelPose traffic is visualized separately and separated by green lines. 10 20 30 40 50 60 time [s] 0 10 20 30 ATE [m] 0 500 1000 1500 2000 frame count in CC accuracy development VO end time Fig. 7. Average trajectory error (ATE) and size of the connected component (CC) each of the ten robots is in. As the CCs of two robots converge, so does the line representing the frame count and ATE within these CC. Note that accuracy does not change significantly in the optimizations executed after visual odometry (VO) has ended – the state estimate reaches its final accuracy already before all the data is considered for optimization. 1 2 4 8 16 32 64 min dist. geo. verification [m] 0 5 10 total transmitted [MB] min dist gv parameter study DOpt DVPR RelPose RelPose(desc) Fig. 8. Effect of changing τ mdg , the parameter that controls the distance between an established relative pose ̄ z α i β j and the next frame of α to be sent to β for geometrical verification. As we can see, τ mdg can be used to significantly throttle the required bandwidth. We have verified that this happens without a significant effect on the ATE, which is around 4 m throughout. The dashed line indicates RelPose data transmission if descriptors are used for keypoint association rather than visual word indices. Fig. 7 illustrates how the trajectories of the different robots fuse as time progresses, as well as the accuracy of the resulting connected components. Seven merge events of connected components are apparent around seconds 15, 23, 32, 33, 39, 46 and 47, the remaining two merges occur at the beginning of the experiment and are not visible in Fig. 7. At each merge event, there is a jump in connected component size and the corresponding lines in the plot get fused. In early merge events, the ATE grows as the drift of individual components is compounded with noise in relative pose measurements. This is most pronounced at the merge event around second 39. In later merge events, and as more relative measurements are added, loops are closed and the ATE decreases again, until it stabilizes around a value of 4 meters. This is well below 1% of the overall trajectory length. Fig. 8 shows how increasing the minimum distance be- tween geometric verifications τ mdg can significantly reduce bandwidth requirements without negatively affecting accu- racy. Fig. 1a shows the place matches that remain and Fig. 1b data transmission over time with τ mdg = 60 m . Fig. 9 shows how varying the NetVLAD dimension affects the size of transmitted data, with τ mdg = 10 m . Unsurprisingly, DVPR traffic is quasi-linear with respect to the NetVLAD dimension. However, using a too low dimension for NetVLAD results in lower precision, which results in more failed relative pose estimates. Since τ mdg 60 80 100 120 140 160 NetVLAD dimensions 0 2 4 6 8 10 total transmitted [MB] NetVLAD dimension parameter study DOpt DVPR RelPose DVPR+RelPose Fig. 9. Data transmission and ATE as the NetVLAD dimensionality is varied, with τ mdg = 10 . Dashed lines indicate RelPose data transmission if keypoints are matched with descriptors, rather than words. The ATE is around 4 m throughout the samples. -20 -10 0 10 20 30 x[m] 0 10 20 30 y[m] Stata 2012-01-25-12-14 Fig. 10. Groundtruth trajectory for one of the sequences in the MIT Stata Center Dataset [44]. only suppresses relative pose queries if a relative pose query has previously been successful, this results in a significant increase of RelPose data transmission. Conversely, increas- ing NetVLAD dimensions beyond a certain point does not increase precision significantly. In Fig. 9 we can see that an efficient trade-off is reached with D NetVLAD > 100 . We also evaluated our system on one of the sequences in MIT Stata Center Dataset [44]. Fig. 10 shows the groundtruth trajectory. Similar to the KITTI 00 dataset, we split the sequence into | Ω | parts to simulate trajectories recorded by different robots. Fig. 11 shows the sub-trajectories estimated using our method before and after the final loop closure. As we can see, the final estimate given by our method is not visually similar to the groundtruth estimate (Fig. 10). The main reason for this mismatch is that the graph structure in the case of MIT Stata Center dataset is closer to a chain topology. Due to the chain topology, decentralized optimization requires more iterations to reach consensus among robot trajectories on either ends of the chain and therefore finds it difficult to converge to the centralized estimate. Finding sub-trajectories which follow certain graph topologies (chain vs grid) and are favorable for decentralized optimization can be a useful future extension. VI. C ONCLUSION We have presented a new integrated decentralized visual SLAM algorithm that is based on state-of-the-art compo- nents. The system has been characterized using publically available data and we have explored how data transmission can be reduced to a minimum. Based on our results, we believe that future developments of decentralized visual SLAM will focus on one hand on even more data-efficient data association and on the other hand on more robust decentralized optimization. -20 0 20 x[m] 0 10 20 30 y[m] Stata 2012-01-25-12-14 -20 0 20 x[m] 0 10 20 30 40 y[m] Stata 2012-01-25-12-14 Fig. 11. Ten sub-trajectories of the MIT Stata Center Dataset before and after the final loop closure , estimated using our method. Each color represents an individual robots trajectory, place matches are marked with bold black lines. R EFERENCES [1] T. Cieslewski and D. Scaramuzza, “Efficient decentralized visual place recognition from full-image descriptors,” IEEE International Symposium on Multi-Robot and Multi-Agent Systems , 2017. [2] S. Choudhary, L. Carlone, C. Nieto, J. Rogers, H. I. Christensen, and F. Dellaert, “Distributed trajectory estimation with privacy and com- munication constraints: a two-stage distributed gauss-seidel approach,” in IEEE Int. Conf. on Robotics and Automation , 2016. [3] D. Tardioli, E. Montijano, and A. R. Mosteo, “Visual data association in narrow-bandwidth networks,” in Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on , 2015. [4] L. Paull, G. Huang, M. Seto, and J. Leonard, “Communication- constrained multi-AUV cooperative SLAM,” in IEEE Intl. Conf. on Robotics and Automation (ICRA) , 2015. [5] T. Bailey, M. Bryson, H. Mu, J. Vial, L. McCalman, and H. Durrant- Whyte, “Decentralised cooperative localisation for heterogeneous teams of mobile robots,” in IEEE Intl. Conf. on Robotics and Au- tomation (ICRA) , 2011. [6] V. Indelman, P. Gurfil, E. Rivlin, and H. Rotstein, “Graph-based dis- tributed cooperative navigation for a general multi-robot measurement model,” Intl. J. of Robotics Research , vol. 31, no. 9, August 2012. [7] A. Bahr, M. Walter, and J. Leonard, “Consistent cooperative localiza- tion,” in IEEE Intl. Conf. on Robotics and Automation (ICRA) , 2009. [8] J. Dong, E. Nelson, V. Indelman, N. Michael, and F. Dellaert, “Distributed real-time cooperative localization and mapping using an uncertainty-aware expectation maximization approach,” in IEEE Intl. Conf. on Robotics and Automation (ICRA) , 2015. [9] S. Roumeliotis and G. Bekey, “Distributed multi-robot localization,” IEEE Trans. Robot. Automat. , August 2002. [10] S. Thrun and Y. Liu, “Multi-robot SLAM with sparse extended infor- mation filters,” in Proceedings of the 11th International Symposium of Robotics Research (ISRR’03) . Sienna, Italy: Springer, 2003. [11] A. Howard, “Multi-robot simultaneous localization and mapping using particle filters,” Intl. J. of Robotics Research , 2006. [12] L. Carlone, M. K. Ng, J. Du, B. Bona, and M. Indri, “Simultaneous localization and mapping using Rao-Blackwellized particle filters in multi robot systems,” J. of Intelligent and Robotic Systems , 2011. [13] A. Cunningham, V. Indelman, and F. Dellaert, “DDF-SAM 2.0: Consistent distributed smoothing and mapping,” in IEEE Intl. Conf. on Robotics and Automation (ICRA) , Karlsruhe, Germany, May 2013. [14] L. Andersson and J. Nygards, “C-SAM : Multi-robot SLAM using square root information smoothing,” in IEEE Intl. Conf. on Robotics and Automation (ICRA) , 2008. [15] B. Kim, M. Kaess, L. Fletcher, J. Leonard, A. Bachrach, N. Roy, and S. Teller, “Multiple relative pose graphs for robust cooperative mapping,” in IEEE Intl. Conf. on Robotics and Automation (ICRA) , 2010. [16] M. Lazaro, L. Paz, P. Pinies, J. Castellanos, and G. Grisetti, “Multi- robot SLAM using condensed measurements,” in IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS) , 2013, pp. 1069–1076. [17] E. Nerurkar, S. Roumeliotis, and A. Martinelli, “Distributed maximum a posteriori estimation for multi-robot cooperative localization,” in IEEE Intl. Conf. on Robotics and Automation (ICRA) , 2009. [18] M. Franceschelli and A. Gasparri, “On agreement problems with Gossip algorithms in absence of common reference frames,” in IEEE Intl. Conf. on Robotics and Automation (ICRA) , 2010. [19] R. Aragues, L. Carlone, G. Calafiore, and C. Sagues, “Multi-agent localization from noisy relative pose measurements,” in IEEE Intl. Conf. on Robotics and Automation (ICRA) , 2011, pp. 364–369. [20] J. Knuth and P. Barooah, “Collaborative localization with heteroge- neous inter-robot measurements by Riemannian optimization,” in IEEE Intl. Conf. on Robotics and Automation (ICRA) , 2013, pp. 1534–1539. [21] A. Cunningham, M. Paluri, and F. Dellaert, “DDF-SAM: Fully dis- tributed slam using constrained factor graphs,” in IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS) , 2010. [22] X. Zhou and S. Roumeliotis, “Multi-robot SLAM with unknown initial correspondence: The robot rendezvous case,” in IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS) . IEEE, 2006, pp. 1785– 1792. [23] R. Aragues, E. Montijano, and C. Sagues, “Consistent data association in multi-robot systems with limited communications,” in Robotics: Science and Systems , 2011, pp. 97–104. [24] C. Forster, S. Lynen, L. Kneip, and D. Scaramuzza, “Collaborative monocular slam with multiple micro aerial vehicles,” in IEEE/RSJ Int. Conf. on Intelligent Robots and Systems , 2013. [25] L. Riazuelo, J. Civera, and J. Montiel, “C2tam: A cloud framework for cooperative tracking and mapping,” Robotics and Autonomous Systems , 2014. [26] P. Schmuck and M. Chli, “Multi-uav collaborative monocular slam,” in IEEE Int. Conf. on Robotics and Automation , 2017. [27] S. Lynen, T. Sattler, M. Bosse, J. Hesch, M. Pollefeys, and R. Siegwart, “Get out of my lab: Large-scale, real-time visual-inertial localization,” in Robotics: Science and Systems , 2015. [28] M. Dymczyk, S. Lynen, T. Cieslewski, M. Bosse, R. Siegwart, and P. Furgale, “The gist of maps-summarizing experience for lifelong localization,” in IEEE Int. Conf. on Robotics and Automation , 2015. [29] L. Contreras and W. Mayol-Cuevas, “O-poco: Online point cloud compression mapping for visual odometry and slam,” in IEEE Int. Conf. on Robotics and Automation , 2017. [30] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “Netvlad: Cnn architecture for weakly supervised place recognition,” in IEEE Conf. on Computer Vision and Pattern Recognition , 2016. [31] T. Cieslewski and D. Scaramuzza, “Efficient decentralized visual place recognition using a distributed inverted index,” IEEE Robotics and Automation Letters , 2017. [32] A. Cunningham, K. Wurm, W. Burgard, and F. Dellaert, “Fully distributed scalable smoothing and mapping with robust multi-robot data association,” in IEEE Intl. Conf. on Robotics and Automation (ICRA) , St. Paul, MN, 2012. [33] S. Choudhary, L. Carlone, C. Nieto, J. Rogers, H. I. Christensen, and F. Dellaert, “Distributed mapping with privacy and communication constraints: Lightweight algorithms and object-based models,” arXiv preprint arXiv:1702.03435 , 2017. [34] G. Bresson, R. Aufr` ere, and R. Chapuis, “Consistent multi-robot decentralized slam with unknown initial positions,” in Information Fusion (FUSION), 2013 16th International Conference on , 2013. [35] L. Lamport, “Time, clocks, and the ordering of events in a distributed system,” Communications of the ACM , 1978. [36] R. Mur-Artal, J. Montiel, and J. D. Tard ́ os, “Orb-slam: a versatile and accurate monocular slam system,” IEEE Transactions on Robotics , 2015. [37] W. Maddern, G. Pascoe, C. Linegar, and P. Newman, “1 year, 1000 km: The oxford robotcar dataset.” The Intl. J. of Robotics Research , 2017. [38] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “Orb: An efficient alternative to sift or surf,” in Int. Conf. on Computer Vision , 2011. [39] L. Carlone, D. Rosen, G. Calafiore, J. Leonard, and F. Dellaert, “Lagrangian duality in 3D SLAM: Verification techniques and optimal solutions,” in IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS) , 2015. [40] D. Bertsekas and J. Tsitsiklis, Parallel and Distributed Computation: Numerical Methods . Englewood Cliffs, NJ: Prentice-Hall, 1989. [41] T. Cieslewski, S. Lynen, M. Dymczyk, S. Magnenat, and R. Siegwart, “Map api-scalable decentralized map building for robots,” in IEEE Int. Conf. on Robotics and Automation , 2015. [42] M. Gadd and P. Newman, “Checkout my map: Version control for fleetwide visual localisation,” in IEEE/RSJ Int. Conf. on Intelligent Robots and Systems , 2016. [43] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,” The Intl. J. of Robotics Research , 2013. [44] M. F. Fallon, H. Johnnsson, M. Kaess, and J. J. Leonard, “The MIT Stata Center dataset,” vol. 32, pp. 1695–1699, 2013.