1 Event-based Camera Pose Tracking using a Generative Event Model Guillermo Gallego, Christian Forster, Elias Mueggler, Davide Scaramuzza Abstract—Event-based vision sensors mimic the operation of biological retina and they represent a major paradigm shift from traditional cameras. Instead of providing frames of in- tensity measurements synchronously, at artificially chosen rates, event-based cameras provide information on brightness changes asynchronously, when they occur. Such non-redundant pieces of information are called “events”. These sensors overcome some of the limitations of traditional cameras (response time, bandwidth and dynamic range) but require new methods to deal with the data they output. We tackle the problem of event-based camera localization in a known environment, without additional sensing, using a probabilistic generative event model in a Bayesian filtering framework. Our main contribution is the design of the likelihood function used in the filter to process the observed events. Based on the physical characteristics of the sensor and on empirical evidence of the Gaussian-like distribution of spiked events with respect to the brightness change, we propose to use the contrast residual as a measure of how well the estimated pose of the event-based camera and the environment explain the observed events. The filter allows for localization in the general case of six degrees-of-freedom motions. Index Terms—Event-based, Dynamic Vision Sensor, generative model, spiking model, robot localization, pose tracking, Kalman filter. I. INTRODUCTION Recently, event-based cameras such as the Dynamic Vision Sensor (DVS) [1] have attracted a lot of attention from both the robotics and vision communities [2], [3], [4], [5], [6], [7], [8], [9], [10]. These bio-inspired sensors overcome some of the limitations of traditional image sensors: they respond very quickly (within microseconds) to brightness changes, have very high dynamic range (120 dB compared to 60 dB of standard cameras), and require low bandwidth [1]. Hence, they are very promising sensors for high-speed visual appli- cations in challenging scenes with large brightness contrast. However, the output of these cameras (a stream of events) is fundamentally different from that of traditional ones, and so a paradigm shift is required to design algorithms that exploit the potential of these vision sensors. Examples of such emerging event-based algorithms are: event-based optical flow [4], visual odometry [5], localization [2], [6], Simultaneous Localization and Mapping (SLAM) [3], [9], mosaicing [7], [8], object recognition [10], etc. We address the localization problem of a moving event- based camera in a known environment. One of the first works The authors are with the Robotics and Perception Group, Department of In- formatics, University of Zurich, Zurich 8050, Switzerland. http://rpg.ifi.uzh.ch e-mail: guillermo.gallego@ifi.uzh.ch. This research was supported by the National Centre of Competence in Research (NCCR) Robotics, the ERC-SNSF Starting Grant, and Google Faculty Research Award. in this respect is [2], where a particle-filter system that is limited to planar motions and 2-D maps was introduced. In the experiments, they used an upward-looking DVS mounted on a ground robot moving at low speed. The provided map used for navigation consisted of line segments on the ceiling. In [5], a probabilistic filtering approach was designed to localize a DVS moving on a plane with respect to the temporally closest pair of frames provided by an additional RGB-D camera attached to the DVS. An algorithm to track the 6-DOF pose of the DVS with no additional sensing during high-speed maneuvers was given in [6]. They used a map consisting of the edges of a black square of known size and minimized the event-to-line reprojection distance to estimate the DVS pose. We propose an implicit Extended Kalman Filter (EKF) approach [11] to localize the DVS with respect to a given dense map of the 3-D scene (consisting of geometric and photometric information) without additional sensing (as in [2], [6], [8]), just using the information contained in the event stream. The map is not constrained to consist only of lines, thus it is more general than those in [2], [6], and it is also richer in brightness changes than the barcoded scenes in [5]. We allow for localization in the general case of 6-DOF motion of the DVS and design the filter accordingly. Our main contribution pertains to the design of the likelihood function used in the correction step of the EKF to process the observed events (Section III-B), by measuring how well the system state (DVS pose and velocity) and the map explain an event from the DVS using a contrast residual. To do so, we first derive a simple yet compelling model for event generation (Section II-A). The technique is demonstrated on synthetic and real data in Section IV. II. DYNAMIC VISION SENSOR (DVS): GENERATIVE EVENT MODEL In contrast to standard cameras, which acquire full frames at fixed rates, event-based vision sensors such as the DVS (Fig. 1a) have independent pixels that spike events at local relative brightness changes in continuous time. A visualization of the output of the DVS is shown in Fig. 1b. Events are time-stamped with microsecond resolution and transmitted asynchronously at the time they occur. Each event is a tuple ek = ⟨xk, yk, tk, pk⟩, where xk, yk are pixel coordinates of the event, tk is its time-stamp, and pk = ±1 is its polarity (sign of the brightness change). The sensor’s spatial resolution is limited1 (128 × 128 pixels), but its 120 dB dynamic range 1A new generation of event-based sensors with VGA resolution (640×480) is being developed by the group [1]. arXiv:1510.01972v1 [cs.CV] 7 Oct 2015 2 (a) (b) (c) Fig. 1. (a) The Dynamic Vision Sensor (DVS). (b) Space-time visualization of the output of a DVS viewing a rotating dot. Colored dots mark individual events. Event polarity is not displayed. Noise is visible by isolated points that are not part of the spiral. (c) The contrast of the DVS events empirically follows a unimodal distribution (e.g. Gaussian-like) centered at a selected threshold C = |∆log(I)| (six threshold settings are shown). Images (b) and (c) are courtesy of [1]. notably exceeds the 60 dB of high-quality traditional image sensors. Next, we provide a generative event model for the DVS using a principled derivation of the equations that characterize an ideal sensor. The event model combines several hypothesis (constant brightness, temporal persistence, etc.) with particular characteristics of the DVS. The model is at the heart of data assimilation in our filtering approach for DVS localization. A. Scene modeling Assume that objects in the 3-D world are represented by a surface S with geometric and radiometric properties. Typically, objects are described by a mesh or depth map and a corresponding intensity (i.e., “texture”) function (in a Lambertian context). The DVS has the same optics as traditional perspective cameras, therefore, standard models (e.g., pinhole) apply. In camera coordinates, the projection operation is described by u = π(X), mapping a 3-D point X = (X, Y, Z)⊤into the image point u = (u, v)⊤. Assume a simplified radiance model where each point on the surface S has an intensity, IS : S →R, and this is the value observed by the DVS to trigger events, that is, the intensity at the image plane corresponds to the intensity defined on the surface: I(u) .= IS(X) for 3-D points X visible from the DVS. Hence, the image plane parametrizes both the image I and the surface S (geometric and photometric properties). B. 3-D motion and apparent (2-D) motion The motion of a moving camera is described by a smooth trajectory in the space of Euclidean transformations, SE(3). Let the relative motion between the viewing camera and the scene be described, in the camera coordinate frame, by dX dt ≡˙X(t) = −bω(t)X(t) −v(t) (1) where ω and v are body angular and linear velocities, respec- tively, and bω is the cross-product matrix: bab = a × b ∀a, b. The corresponding apparent motion of the 3-D point X is described by the velocity of the image point u, which comprises the image motion field. Specifically, the equation that relates surface velocity (in the camera frame) to feature velocity (in normalized coordinates) is (see, e.g. [12], [13, Eq. 5.87]), dropping the t notation: du dt ≡˙u = B ξ, (2) where twist coordinates ξ = (v⊤, ω⊤)⊤encode the relative motion and B = −Z−1 0 uZ−1 uv −(1 + u2) v 0 −Z−1 vZ−1 1 + v2 −uv −u  (3) is called interaction matrix, image Jacobian matrix for a point feature, or feature sensitivity matrix [12], [14, p. 460- 462]. Typically, the surface is assumed to admit a depth map representation with respect to the camera, and so the depth of the 3-D point is parametrized in the image plane, Z ≡Z(u, v). Consequently, B(u, v) is just a function of the surface and the image point. The motion field has two separate components for translation and rotation. C. Deterministic generative event model The standard hypothesis in measuring image motion is that the intensity structure of local time-varying image regions are approximately constant under motion for at least a short duration (temporal persistence). Formally, if ˜I(u, t) is the space-time image intensity function measured by the DVS, the total derivative d˜I/dt vanishes for those trajectories u(t) of constant intensity values, ˜I(u(t), t) = const, that is, d˜I dt = D ∇u ˜I, ˙u E + ∂˜I ∂t = 0, (4) where ⟨·, ·⟩is the dot product, ∇u ˜I = ∂˜I ∂u, ∂˜I ∂v ⊤are the first partial derivatives with respect to spatial coordinates and ˙u = ( ˙u, ˙v)⊤is the motion field. The DVS senses brightness logarithmically2: ˜I = log(I), and it generates an event at a location u if the amount of 2Using the chain rule it is easy to verify that, if I ̸= 0, both conditions dI/dt = 0 and d˜I/dt = 0 are equivalent. 3 intensity (grey level) change ∆log(I) during an interval ∆t (the time since the previous event at the same location), i.e., the contrast ∆log(I) ≈∂log(I) ∂t ∆t (4)= −⟨∇u log(I), ˙u∆t⟩ (5) is larger than a threshold C [1], [5] (typically 10-15% relative brightness change): |∆log(I)| ≈|⟨∇u log(I), ˙u∆t⟩| ≥C. (6) Incorporating polarity, if the contrast ∆log(I) ≥C, a positive event (pk = +1) is generated; if ∆log(I) ≤−C, a negative event (pk = −1) is generated; otherwise, no event is fired. D. Probabilistic generative event model Equation (6) is a hard decision model for the generation of events. A more realistic one takes into account sensor noise and manufacturing mismatches, yielding a soft decision repre- sented by a smooth probability function. A characterization of the corresponding probability density averaged over all DVS pixels is shown in Fig. 6 of [1] (see Fig. 1c), suggesting a unimodal Gaussian-like distribution, for which they measure its standard deviation as a function of the threshold C. This probabilistic generative event model can be included in a Bayesian filtering approach to process the events, as shown in the next section, where we adopt the simple yet powerful filter given by the Extended Kalman Filter (EKF), which assumes Gaussian probability distributions to keep a compact and manageable representation of the posterior probability of the DVS pose and velocity. III. BAYESIAN FILTERING APPROACH A. State-space design In the popular Bayesian inference framework given by the EKF [11] we can formulate the DVS localization problem with respect to a map M as that of estimating the state of a system defined by its state-space representation (state and measurement equations). The state equation is a non-linear function f of the state and the process noise xn = f(xn−1, wn), with x = (t⊤, r⊤, v⊤, ω⊤)⊤. (7) As usual, subscripts {n −1, n} denote temporal references. The process noise wn is not additive and it is assumed to be zero-mean multivariate Gaussian distributed with covariance Qw n . The state vector describes the DVS pose (position and orientation) and its velocity: t is the position of the optical center of the DVS, in world coordinates; r is the rotation vector parametrizing the orientation of the DVS by means of the exponential coordinates (as in the filter proposed by [15]) of the rotation matrix from the world to the camera frame, R = exp(br); and the linear and angular velocities (1) (v, ω) are given in world and camera (body) coordinates, respectively. We chose the motion model f given by the constant velocity model, which is typical of SLAM approaches [16]. This ac- counts for general smooth motions of the DVS. By integration of the continuous motion over a time interval3 ∆t = tn−tn−1 and discretization, (7) becomes          tn = tn−1 + (vn−1 + V)∆t, rn = (log (exp ((ωn−1 + Ω)∧∆t) exp (rn−1)))∨, vn = vn−1 + V, ωn = ωn−1 + Ω, (8) where the noise is wn = (V⊤, Ω⊤)⊤. The log and exp operators refer to the rotation group, SO(3). ∆R = exp ((ωn−1 + Ω)∧∆t) is the incremental rotation of angle θ = ∥ωn−1 + Ω∥∆t around the axis defined by vector (ωn−1 + Ω), u∧is the cross-product matrix associated to a 3-vector u, and S∨is the 3-vector associated to a 3 × 3 skew-symmetric matrix S. B. Implicit measurement equation In the standard EKF, the likelihood is specified by an equa- tion zn = h(xn) + ηn, where observations zn are explicitly written in terms of the state and the observation noise ηn. This is the formulation used in classical visual localization and SLAM, where zn consists of the image coordinates of sensed map landmarks, and h predicts the observations by using the camera model to project the landmarks. This design choice implies Gaussian image coordinate noise, and it may also be applied to DVS localization [2]. However, it does not take into account the generative event model (such as (6)). In a different (non-localization) context, an alternative approach is given in [8] to estimate the intensity gradient at each pixel: zn consists of event rates and a generative model is used to write such explicit dependency. This design choice implies that the temporal (event-rate) noise is Gaussian, which is an arbitrary choice. We depart from the previous explicit models (spatial or temporal measurements) and propose an implicit measurement equation q(zn, xn) = 0 (9) to quantify how well the event generation model (6) is sat- isfied. This leads to an implicit EKF [17], [18]. Our design choice assumes that the deviations of the contrast from the nominal one that fires events is Gaussian, which Fig. 1c suggests to be. A similar unimodal density function is given in [8] only for the correction step of rotation tracking. Assuming constant illumination and independence of the observations, each event en = ⟨un, vn, tn, pn⟩is caused by a brightness change at pixel pn = (un, vn)⊤, depending on both the DVS state xn ≡x(tn) and the map M. Thus, a more rigorous description than (9) is q = q(zn, xn; M) because an event is an observation of some map point. Letting g be a shorthand notation for the spatial gradient ∇u log(I) in (6), we define the implicit function q as the difference between the absolute contrast (5) and the nominal threshold, q = |∆log(I)| −C. Substituting |y| = y sgn(y) 3Here ∆t is the time between prediction steps in the EKF, which may or may not coincide with the time between events at the same location in (5) depending on whether events are processed in packets or individually. 4 (a) (b) (c) (d) Fig. 2. Neighborhood of an event en = ⟨xn, yn, pn, tn⟩triggered by a moving edge. The DVS is moving horizontally to the right (positive X direction). Top row: positive event (dark-to-bright transition). Bottom row: negative event (bright-to-dark transition). (a) Rendering of the map on the DVS image plane, ˜I(tn −∆tn); the event pn = (xn, yn)⊤is at the center of the patch. The motion field ˙u (magenta vectors) points toward the negative X direction. The image gradient g = ∇u ˜I(tn −∆tn) (perpendicular to the edge) is displayed with cyan vectors. (b) Predicted neighborhood ˜I(tn) ≈˜I(tn −∆tn) + ∆˜I. (c) Constrast ∆˜I ≈−⟨g, ˙u⟩∆tn. (d) The implicit measurement function q in (10) has the same shape as the absolute contrast, |∆˜I| ≈−pn ⟨g, ˙u⟩∆tn, which defines the likelihood that the event was triggered. for y = ∆log(I) and replacing sgn(y) by the measured polarity pn, we use (6) to define q(zn, xn; M) = −pn ⟨g, ˙u⟩|(pn,xn,Xn) ∆tn −C, (10) where ∆tn = tn −tprev is the time span since the previous event at the same location pn, and the inner product between the gradient g and the motion field ˙u depends on the event location pn, its corresponding 3-D point Xn and the state xn. Specifically, g depends on the DVS pose only (but not on its velocity) via the perspective projection between the map Xn and point pn, whereas the motion field (2) depends on both the DVS pose (depth Z of Xn with respect to the sensor) and velocities (twist coordinates). The gradient g may be computed by taking the spatial derivatives of the predicted image intensities I in a neighborhood of the current event location pn, obtained through rendering the dense map M according to the DVS pose in the current state. Examples of the contrast function for positive and negative events are shown in Fig. 2. Patches of 15 × 15 pixels around the event location are displayed, but the local analysis of the generative event model is only reliable close to the center. Fig. 3 reports the cases of moving edges parallel or almost perpendicular to the apparent motion, yielding largest and smallest absolute contrast, respectively. C. Recursive solution: Implicit EKF equations Once the system state and measurements equations have been designed, the update equations of the parameters of the posterior in the EKF are also determined. The recursive esti- mation carried out in the EKF is described by the equations in Algorithm 1. We follow the notation in [11] for the posteriors and their moments. The DVS pose tracking filter also assumes that an accurate estimate of the initial configuration, with relatively small uncertainties, is given (µ0, Σ0). Let us further explain the steps of Algorithm 1. a) Prediction.: In this step, the projection of the posterior beln−1 ∼N(µn−1, Σn−1) through the kinematic model (8) gives the predicted posterior beln ∼N(¯µn, ¯Σn) before incor- porating the measurement. The state mean and error covari- ance are predicted according to lines (1)-(2) in Algorithm 1. Uncertainty is propagated through the system by means of the Jacobians of (8), Fn = ∂f/∂xn−1, Ln = ∂f/∂wn, evaluated at the current best estimate, (µn−1, wn). b) Correction.: This is the data assimilation step, where the predicted posterior beln ∼ N(¯µn, ¯Σn) is combined with the measurement zn to yield the updated posterior beln ∼N(µn, Σn). The state mean and error covariance are corrected according to lines (3)-(7) in Algorithm 1. Events from the DVS are fed to the generative sensor equation (10) to produce a residual that drives the update of the filter variables. With regard to Figs. 2d and 3d, the correction step changes the state such that the likelihood at the event position increases (white region). The innovation process and its covariance (lines (3)-(4) in Algorithm 1) are obtained by linearization of the implicit measurement function (10) around the current best estimate, (zn, ¯µn) (see [17], [18]). Uncertainty is corrected in the system (up to first order) by means of the Jacobians of (10) (evaluated at (zn, ¯µn; M)), Hn = ∂q/∂xn, Dn = ∂q/∂zn, with covariance of the measurement noise [17] Rn := DnQη nD⊤ n . Since q is a real value, both the noise and the innovation covariances (Rn and Sn) are scalars. 5 (a) (b) (c) (d) Fig. 3. Neighborhood of an event triggered by a moving edge. Same notation as in Fig. 2. Top row: at the event location, the image gradient g is parallel to the motion field ˙u. Bottom row: g almost perpendicular to ˙u. Both rows correspond to a negative event. Algorithm 1 Extended Kalman Filter (EKF) equations for one iteration, (µn−1, Σn−1) →(µn, Σn), with implicit measurement function q. 1. Mean state (pred.) ¯µn = f(µn−1, wn) 2. Error covar. (pred.) ¯Σn = FnΣn−1F ⊤ n + LnQw n L⊤ n , with Fn, Ln Jacobians of f. 3. Innovation νn = −q(zn, ¯µn) 4. Innovation covar. Sn = Hn ¯ΣnH⊤ n + Rn, with Hn and Rn given by the Jacobians of q. 5. Kalman gain Kn = ¯ΣnH⊤ n S−1 n 6. Mean state µn = ¯µn + Knνn 7. Error covar. Σn = (I −KnHn)¯Σn(I −KnHn)⊤+ KnRnK⊤ n 1) Data association: An additional advantage of our ap- proach is that there is no data association like in the classical localization problem (associating predicted measurements to actual ones), thus removing a challenging sub-problem and common source of brittleness in localization and mapping with the EKF. This is a consequence of using a dense map (as opposed to a set of isolated landmarks) to represent the scene and to design a measurement equation (10) that exploits such a representation. There is no data association problem because a correspondence between the event location and a map point pn ↔Xn will always exist, and it can be computed via ray- tracing. The errors caused by a mismatch between the true surface point ¯Xn that triggered the event and the predicted one Xn are implicitly taken into account in the EKF via the innovation (10) and its covariance. For example, the value of the gradient g in the neighborhood of the event will change (with some degree of smoothness) and if the predicted value does not yield the triggering of an event, the EKF adjusts the state parameters so that a different surface point Xn will be more likely to trigger the observed event. There is no need to artificially search for a 3-D point, close to the predicted one, that better explains the event. IV. EXPERIMENTS A. Synthetic data The proposed method was tested with synthetic and real data. The synthetic data was generated using computer graph- ics software (Blender4) to render images of a given map along a specified trajectory. Adjacent images were subtracted, thresh- olded and randomly sampled to simulate the events generated by a DVS. We chose a pinhole camera model with intrinsics identical to the ones of a lens from the real experiments: 2.6 mm lens for a 1/3” sensor. A linear trajectory with constant acceleration was simulated. Results are reported in Fig. 4. Groups of 500 events every 8 ms were generated between adjacent images. The algorithm processed 230k events. This experiment validated the measurement function (10) since the kinematic model (8) alone cannot predict the DVS motion. The results show that the filter successfully estimated the DVS pose and velocity, with small relative errors (Fig. 4c). B. Real data For the experiment with real data, we mounted the DVS on a model train that runs on a straight track with constant 4https://www.blender.org/ 6 (a) (b) (c) Fig. 4. Constant acceleration experiment. (a) Estimated position. (b) Estimated velocity. (c) Relative errors in position and velocity between simulated trajectory and estimated one. (a) (b) (c) (d) (e) (f) Fig. 5. Experiment with approximately constant velocity motion. (a) Visualization of a few events from the DVS (positive events in cyan, negative events in magenta) used for filter initialization, overlaid on the rendered map. (b) Time since the last event at each pixel (∆tn in (10)) (c) Normalized histogram of the absolute contrast in (10) (solid line) and Gaussian fit (dashed line) (cf. Fig. 1c). The mode of the Gaussian corresponds to the threshold C. (d) Innovations sequence νn. Estimated position (e) and velocity (f) of the event-based camera. velocity. The DVS faced sideways and observed a planar scene at a constant distance. The scene contains a pattern of complex black and white stripes and a set of circles at known locations; the latter were used for extrinsic calibration. The DVS was intrinsically calibrated using standard camera calibration techniques on the imaged points detected from the projection of an array of blinking LEDS placed in a checkerboard configuration. Horizontal edges are parallel to the apparent motion, and, consequently do not trigger events. The intensities of the map were smoothed to provide non-zero gradients in the regions near sharp edges that generate events, hence to smooth the response of the contrast function (10) and the corresponding likelihood in such regions. Fig. 5 reports some of the results of this experiment. Fig. 5b shows, for a few hundreds of events (Fig. 5a), the measured absolute contrast |∆˜I| ≈−pn ⟨g, ˙u⟩∆tn used in the implicit measurement function (10). Having the map intensities given in arbitrary units (log of gray levels) and lacking physical measurements of the incoming light that the DVS used to trigger the events, the threshold values in Fig. 1c (≈0.2) are not applicable to the map, and so a few events are used to estimate the threshold C corresponding to the given map. The filter processed about 100k events and successfully estimated the DVS pose and velocities of the DVS throughout the event stream. Figs. 2 and 3 were also obtained from this experiment. 7 V. CONCLUSION We have successfully developed an implicit EKF for event- based camera (DVS) localization based on the contrast resid- ual (10), which provides a natural measure to define the likelihood of an event. For this, we derived a generative event model that incorporates the physical characteristics of the DVS. Our algorithm readily matches the asynchronous nature of the events and allows filter updates on an event- by-event basis. An additional advantage of our approach is that the contrast residual naturally takes into account a dense map representation of the environment, removing the data- association sub-problem. In future work, we plan to extend the developed method to event-based SLAM without additional sensing. REFERENCES [1] P. Lichtsteiner, C. Posch, and T. Delbruck, “A 128×128 120 dB 15 µs latency asynchronous temporal contrast vision sensor,” IEEE J. of Solid-State Circuits, vol. 43, no. 2, pp. 566–576, 2008. [2] D. Weikersdorfer and J. Conradt, “Event-based Particle Filtering for Robot Self-Localization,” in IEEE Int. Conf. on Robotics and Biomimet- ics (ROBIO), 2012. [3] D. Weikersdorfer, R. Hoffmann, and J. Conradt, “Simultaneous Local- ization and Mapping for event-based Vision Systems,” in Int. Conf. on Computer Vision Systems (ICVS), 2013. [4] R. Benosman, C. Clercq, X. Lagorce, S.-H. Ieng, and C. Bartolozzi, “Event-Based Visual Flow,” IEEE Trans. Neural Networks and Learning Systems, vol. 25, no. 2, pp. 407–417, 2014. [5] A. Censi and D. Scaramuzza, “Low-Latency Event-Based Visual Odom- etry,” in IEEE Int. Conf. on Robotics and Automation (ICRA), 2014. [6] E. Mueggler, B. Huber, and D. Scaramuzza, “Event-based, 6-DOF Pose Tracking for High-Speed Maneuvers,” in IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS), 2014. [7] M. Cook, L. Gugelmann, F. Jug, C. Krautz, and A. Steger, “Interacting maps for fast visual interpretation,” in Int. Joint Conf. on Neural Networks (IJCNN), July 2011, pp. 770–776. [8] H. Kim, A. Handa, R. Benosman, S.-H. Ieng, and A. J. Davison, “Simultaneous Mosaicing and Tracking with an Event Camera,” in British Machine Vision Conf. (BMVC), 2014. [9] D. Weikersdorfer, D. B. Adrian, D. Cremers, and J. Conradt, “Event- based 3D SLAM with a depth-augmented dynamic vision sensor,” in IEEE Int. Conf. on Robotics and Automation (ICRA), Jun. 2014, pp. 359–364. [10] G. Orchard, C. Meyer, R. Etienne-Cummings, C. Posch, N. Thakor, and R. Benosman, “HFirst: A Temporal Approach to Object Recognition,” IEEE Trans. Pattern Anal. Machine Intell., vol. PP, no. 99, pp. 1–1, 2015. [11] S. Thrun, W. Burgard, and D. Fox, Probabilistic Robotics. The MIT Press, Cambridge, MA, 2005. [12] F. Chaumette and S. Hutchinson, “Visual Servoing and Visual Tracking,” in Springer Handbook of Robotics, B. Siciliano and O. Khatib, Eds. Springer, 2008. [13] Y. Ma, S. Soatto, J. Ko˘seck´a, and S. S. Sastry, An Invitation to 3-D Vision: From Images to Geometric Models. Springer, 2004. [14] P. Corke, Robotics, Vision and Control: Fundamental Algorithms in MATLAB, ser. Springer Tracts in Advanced Robotics. Springer, 2011. [15] A. Chiuso, P. Favaro, H. Jin, and S. Soatto, “Structure from motion causally integrated over time,” IEEE Trans. Pattern Anal. Machine Intell., vol. 24, no. 4, pp. 523–535, 2002. [16] A. Davison, “Real-time simultaneous localisation and mapping with a single camera,” in Int. Conf. on Computer Vision (ICCV), 2003, pp. 1403–1410 vol.2. [17] Z. Zhang and O. D. Faugeras, “Finding clusters and planes from 3D line segments with application to 3D motion determination,” in Eur. Conf. on Computer Vision (ECCV), ser. Lecture Notes in Computer Science. Springer, 1992, vol. 588, pp. 227–236. [18] S. Soatto, R. Frezza, and P. Perona, “Recursive Motion Estimation on the Essential Manifold,” California Institute of Technology, Tech. Rep., 1993, caltechCDSTR:1993.021.