1  Event-based Camera Pose Tracking using a Generative Event Model  Guillermo Gallego, Christian Forster, Elias Mueggler, Davide Scaramuzza  Abstract —Event-based vision sensors mimic the operation of biological retina and they represent a major paradigm shift from traditional cameras. Instead of providing frames of in- tensity measurements synchronously, at artificially chosen rates, event-based cameras provide information on brightness changes asynchronously, when they occur. Such non-redundant pieces of information are called “events”. These sensors overcome some of the limitations of traditional cameras (response time, bandwidth and dynamic range) but require new methods to deal with the data they output. We tackle the problem of event-based camera localization in a known environment, without additional sensing, using   a   probabilistic   generative   event   model   in   a   Bayesian filtering framework. Our main contribution is the design of the likelihood function used in the filter to process the observed events. Based on the physical characteristics of the sensor and on empirical evidence of the Gaussian-like distribution of spiked events with respect to the brightness change, we propose to use the contrast residual as a measure of how well the estimated pose of the event-based camera and the environment explain the observed events. The filter allows for localization in the general case of six degrees-of-freedom motions.  Index Terms —Event-based, Dynamic Vision Sensor, generative model, spiking model, robot localization, pose tracking, Kalman filter.  I. I NTRODUCTION  Recently, event-based cameras such as the Dynamic Vision Sensor (DVS) [1] have attracted a lot of attention from both the robotics and vision communities [2], [3], [4], [5], [6], [7], [8], [9], [10]. These bio-inspired sensors overcome some of the limitations of traditional image sensors: they respond very quickly (within microseconds) to brightness changes, have very high dynamic range (120 dB compared to 60 dB of standard cameras), and require low bandwidth [1]. Hence, they are very promising sensors for high-speed visual appli- cations in challenging scenes with large brightness contrast. However, the output of these cameras (a stream of events) is fundamentally different from that of traditional ones, and so a paradigm shift is required to design algorithms that exploit the potential of these vision sensors. Examples of such emerging event-based algorithms are: event-based optical flow [4], visual odometry [5], localization [2], [6], Simultaneous Localization and Mapping (SLAM) [3], [9], mosaicing [7], [8], object recognition [10], etc. We address the localization problem of a moving event- based camera in a known environment. One of the first works  The authors are with the Robotics and Perception Group, Department of In- formatics, University of Zurich, Zurich 8050, Switzerland. http://rpg.ifi.uzh.ch e-mail: guillermo.gallego@ifi.uzh.ch. This research was supported by the National Centre of Competence in Research (NCCR) Robotics, the ERC-SNSF Starting Grant, and Google Faculty Research Award.  in this respect is [2], where a particle-filter system that is limited to planar motions and 2-D maps was introduced. In the experiments, they used an upward-looking DVS mounted on a ground robot moving at low speed. The provided map used for navigation consisted of line segments on the ceiling. In [5], a probabilistic filtering approach was designed to localize a DVS moving on a plane with respect to the temporally closest pair of frames provided by an additional RGB-D camera attached to the DVS. An algorithm to track the 6-DOF pose of the DVS with no additional sensing during high-speed maneuvers was given in [6]. They used a map consisting of the edges of a black square of known size and minimized the event-to-line reprojection distance to estimate the DVS pose. We propose an   implicit   Extended Kalman Filter (EKF) approach [11] to localize the DVS with respect to a given dense map of the 3-D scene (consisting of geometric and photometric information) without additional sensing (as in [2], [6], [8]), just using the information contained in the event stream. The map is not constrained to consist only of lines, thus it is more   general than those in [2],   [6],   and   it   is also richer in brightness changes than the barcoded scenes in [5]. We allow for localization in the general case of 6-DOF motion of the DVS and design the filter accordingly. Our main contribution pertains to the design of the likelihood function used in the correction step of the EKF to process the observed events (Section III-B), by measuring how well the system state (DVS pose and velocity) and the map explain an event from the DVS using a contrast residual. To do so, we first derive a simple yet compelling model for event generation (Section II-A). The technique is demonstrated on synthetic and real data in Section IV. II. D YNAMIC   V ISION   S ENSOR   (DVS):  GENERATIVE EVENT MODEL  In contrast to standard cameras, which acquire full frames at fixed rates, event-based vision sensors such as the DVS (Fig. 1a) have independent pixels that spike events at local relative brightness changes in continuous time. A visualization of the output of the DVS is shown in Fig. 1b. Events are time-stamped with microsecond resolution and transmitted asynchronously at the time they occur. Each event is a tuple  e k   =   〈 x k , y k , t k , p k 〉 , where   x k , y k   are pixel coordinates of the event,   t k   is its time-stamp, and   p k   =   ± 1   is its polarity (sign of the brightness change). The sensor’s spatial resolution is limited 1   ( 128   ×   128   pixels), but its 120 dB dynamic range  1 A new generation of event-based sensors with VGA resolution ( 640 × 480 ) is being developed by the group [1].  arXiv:1510.01972v1 [cs.CV] 7 Oct 2015
2  (a)   (b)   (c)  Fig. 1.   (a) The Dynamic Vision Sensor (DVS). (b) Space-time visualization of the output of a DVS viewing a rotating dot. Colored dots mark individual events. Event polarity is not displayed. Noise is visible by isolated points that are not part of the spiral. (c) The contrast of the DVS events empirically follows a unimodal distribution (e.g. Gaussian-like) centered at a selected threshold   C   =   | ∆ log( I ) |   (six threshold settings are shown). Images (b) and (c) are courtesy of [1].  notably exceeds the 60 dB of high-quality traditional image sensors. Next, we provide a generative event model for the DVS using a principled derivation of the equations that characterize an   ideal   sensor. The event model combines several hypothesis (constant brightness, temporal persistence, etc.) with particular characteristics of the DVS. The model is at the heart of data assimilation in our filtering approach for DVS localization.  A. Scene modeling  Assume   that   objects   in   the   3-D   world   are   represented by a surface   S   with geometric and radiometric properties. Typically, objects are described by a mesh or depth map and a corresponding intensity (i.e., “texture”) function (in a Lambertian context). The DVS has the same optics as traditional perspective cameras, therefore, standard models (e.g., pinhole) apply. In camera coordinates, the projection operation is described by  u   =   π ( X ) , mapping a 3-D point   X   = ( X, Y, Z ) >   into the image point   u   = ( u, v ) > . Assume a simplified radiance model where each point on the surface   S   has an intensity,   I S   :   S   →   R , and this is the value observed by the DVS to trigger events, that is, the intensity at the image plane corresponds to the intensity defined on the surface:   I ( u )   .  =   I S   ( X )   for 3-D points   X   visible from the DVS. Hence, the image plane parametrizes both the image   I  and the surface   S   (geometric and photometric properties).  B. 3-D motion and apparent (2-D) motion  The motion of a moving camera is described by a smooth trajectory in the space of Euclidean transformations,   SE (3) . Let the relative motion between the viewing camera and the scene be described, in the camera coordinate frame, by  d X  dt   ≡    ̇ X ( t ) =   − ̂ ω ( t ) X ( t )   −   v ( t )   (1) where   ω   and   v   are body angular and linear velocities, respec- tively, and   ̂   ω   is the cross-product matrix:   ̂   ab   =   a   ×   b   ∀ a ,   b . The   corresponding   apparent   motion   of   the   3-D   point   X  is described by the velocity of the image point   u , which comprises the   image motion field . Specifically, the equation that relates surface velocity (in the camera frame) to feature velocity (in normalized coordinates) is (see, e.g. [12], [13, Eq. 5.87]), dropping the   t   notation:  d u  dt   ≡    ̇ u   =   B   ξ ,   (2) where twist coordinates   ξ   = ( v > ,   ω > ) >   encode the relative motion and  B   =  ( − Z − 1   0   uZ − 1   uv   − (1 +   u 2 )   v  0   − Z − 1   vZ − 1   1 +   v 2   − uv   − u  )  (3) is called   interaction matrix ,   image Jacobian   matrix for a point feature, or   feature sensitivity matrix   [12], [14, p. 460- 462]. Typically, the surface is assumed to admit a depth map representation with respect to the camera, and so the depth of the 3-D point is parametrized in the image plane,   Z   ≡   Z ( u, v ) . Consequently,   B ( u, v )   is just a function of the surface and the image point. The motion field has two separate components for translation and rotation.  C. Deterministic generative event model  The standard hypothesis in measuring image motion is that the intensity structure of local time-varying image regions are approximately constant under motion for at least a short duration (temporal persistence). Formally, if    ̃ I ( u , t )   is the space-time image intensity function measured by the DVS, the total derivative   d    ̃ I/dt   vanishes for those trajectories   u ( t )  of constant intensity values,    ̃ I ( u ( t ) , t ) =   const, that is,  d    ̃ I dt   =  〈  ∇ u    ̃ I,    ̇ u  〉  +   ∂    ̃ I ∂t   = 0 ,   (4) where   〈· ,   ·〉   is the dot product,   ∇ u    ̃ I   =   (   ∂    ̃ I ∂u   ,   ∂    ̃ I ∂v  ) >   are the first partial derivatives with respect to spatial coordinates and   ̇ u   = (  ̇ u,    ̇ v ) >   is the motion field. The DVS senses brightness logarithmically 2 :    ̃ I   = log( I ) , and it generates an event at a location   u   if the amount of  2 Using the chain rule it is easy to verify that, if   I   6   = 0 , both conditions  dI/dt   = 0   and   d    ̃ I/dt   = 0   are equivalent.
3  intensity (grey level) change   ∆ log( I )   during an interval   ∆ t  (the time since the previous event at the same location), i.e., the   contrast  ∆ log( I )   ≈   ∂   log( I )  ∂t   ∆ t   (4)  =   − 〈∇ u   log( I ) ,    ̇ u ∆ t 〉   (5) is larger than a threshold   C   [1], [5] (typically 10-15% relative brightness change):  | ∆ log( I ) | ≈ |〈∇ u   log( I ) ,    ̇ u ∆ t 〉| ≥   C.   (6) Incorporating polarity, if the contrast   ∆ log( I )   ≥   C , a positive event ( p k   = +1 ) is generated; if   ∆ log( I )   ≤ − C , a negative event ( p k   =   − 1 ) is generated; otherwise, no event is fired.  D. Probabilistic generative event model  Equation (6) is a hard decision model for the generation of events. A more realistic one takes into account sensor noise and manufacturing mismatches, yielding a soft decision repre- sented by a smooth probability function. A characterization of the corresponding probability density averaged over all DVS pixels is shown in Fig. 6 of [1] (see Fig. 1c), suggesting a unimodal Gaussian-like distribution, for which they measure its standard deviation as a function of the threshold   C . This probabilistic generative event model can be included in a Bayesian filtering approach to process the events, as shown in the next section, where we adopt the simple yet powerful filter given by the Extended Kalman Filter (EKF), which assumes Gaussian probability distributions to keep a compact and manageable representation of the posterior probability of the DVS pose and velocity. III. B AYESIAN FILTERING APPROACH  A. State-space design  In the popular Bayesian inference framework given by the EKF [11] we can formulate the DVS localization problem with respect to a map   M   as that of estimating the state of a system defined by its state-space representation (state and measurement equations). The state equation is a non-linear function   f   of the state and the process noise  x n   =   f   ( x n − 1 ,   w n ) ,   with   x   = ( t > ,   r > ,   v > ,   ω > ) > .   (7) As usual, subscripts   { n   −   1 , n }   denote temporal references. The process noise   w n   is not additive and it is assumed to be zero-mean multivariate Gaussian distributed with covariance  Q w  n   . The state vector describes the DVS pose (position and orientation) and its velocity:   t   is the position of the optical center of the DVS, in world coordinates;   r   is the rotation vector parametrizing the orientation of the DVS by means of the exponential coordinates (as in the filter proposed by [15]) of the rotation matrix from the world to the camera frame,   R   = exp( ̂ r ) ; and the linear and angular velocities (1) ( v ,   ω ) are given in world and camera (body) coordinates, respectively. We chose the motion model   f   given by the constant velocity model, which is typical of SLAM approaches [16]. This ac- counts for general smooth motions of the DVS. By integration of the continuous motion over a time interval 3   ∆ t   =   t n   − t n − 1  and discretization, (7) becomes            t n   =   t n − 1   + ( v n − 1   +   V )∆ t,  r n   = (log (exp (( ω n − 1   +   Ω ) ∧ ∆ t ) exp ( r n − 1 ))) ∨   ,  v n   =   v n − 1   +   V ,  ω n   =   ω n − 1   +   Ω ,  (8) where   the   noise   is   w n   =   ( V > ,   Ω > ) > .   The   log   and  exp   operators refer to the rotation group,   SO (3) .   ∆ R   = exp (( ω n − 1   +   Ω ) ∧ ∆ t )   is the incremental rotation of angle  θ   =   ‖ ω n − 1   +   Ω ‖ ∆ t   around the axis defined by vector  ( ω n − 1   +   Ω ) ,   u ∧   is the cross-product matrix associated to a 3-vector   u , and   S ∨   is the 3-vector associated to a   3   ×   3  skew-symmetric matrix   S .  B. Implicit measurement equation  In the standard EKF, the likelihood is specified by an equa- tion   z n   =   h ( x n ) +   η n ,   where observations   z n   are explicitly written in terms of the state and the observation noise   η n . This is the formulation used in classical visual localization and SLAM, where   z n   consists of the image coordinates of sensed map landmarks, and   h   predicts the observations by using the camera model to project the landmarks. This design choice implies Gaussian image coordinate noise, and it may also be applied to DVS localization [2]. However, it does not take into account the generative event model (such as (6)). In a different (non-localization) context, an alternative approach is given in [8] to estimate the intensity gradient at each pixel:   z n  consists of event rates and a generative model is used to write such explicit dependency. This design choice implies that the temporal (event-rate) noise is Gaussian, which is an arbitrary choice. We depart from the previous explicit models (spatial or temporal measurements) and propose an implicit measurement equation  q ( z n ,   x n ) =   0   (9) to quantify how well the event generation model (6) is sat- isfied. This leads to an   implicit   EKF [17], [18]. Our design choice assumes that the deviations of the contrast from the nominal one that fires events is Gaussian, which Fig. 1c suggests to be. A similar unimodal density function is given in [8] only for the correction step of rotation tracking. Assuming constant illumination and independence of the observations, each event   e n   =   〈 u n , v n , t n , p n 〉   is caused by a brightness change at pixel   p n   = ( u n , v n ) > , depending on both the DVS state   x n   ≡   x ( t n )   and   the map   M . Thus, a more rigorous description than (9) is   q   =   q ( z n ,   x n ;   M )  because   an   event   is   an   observation   of   some   map   point. Letting   g   be a shorthand notation for the spatial gradient  ∇ u   log( I )   in (6), we define the implicit function   q   as the difference between the absolute contrast (5) and the nominal threshold,   q   =   | ∆ log( I ) | −   C . Substituting   | y |   =   y   sgn ( y )  3 Here   ∆ t   is the time between prediction steps in the EKF, which may or may not coincide with the time between events at the same location in (5) depending on whether events are processed in packets or individually.
4  (a)   (b)   (c)   (d)  Fig. 2.   Neighborhood of an event   e n   =   〈 x n , y n , p n , t n 〉   triggered by a moving edge. The DVS is moving horizontally to the right (positive   X   direction). Top row: positive event (dark-to-bright transition). Bottom row: negative event (bright-to-dark transition). (a) Rendering of the map on the DVS image plane,   ̃ I ( t n   −   ∆ t n ) ; the event   p n   = ( x n , y n ) >   is at the center of the patch. The motion field    ̇ u   (magenta vectors) points toward the negative   X   direction. The image gradient   g   =   ∇ u    ̃ I ( t n   −   ∆ t n )   (perpendicular to the edge) is displayed with cyan vectors. (b) Predicted neighborhood    ̃ I ( t n )   ≈    ̃ I ( t n   −   ∆ t n ) + ∆  ̃ I . (c) Constrast   ∆  ̃ I   ≈ − 〈 g ,    ̇ u 〉   ∆ t n . (d) The implicit measurement function   q   in (10) has the same shape as the absolute contrast,   | ∆  ̃ I | ≈ − p n   〈 g ,    ̇ u 〉   ∆ t n , which defines the likelihood that the event was triggered.  for   y   =   ∆ log( I )   and replacing sgn ( y )   by the measured polarity   p n , we use (6) to define  q ( z n ,   x n ;   M ) =   − p n   〈 g ,    ̇ u 〉| ( p n   , x n   , X n   )   ∆ t n   −   C,   (10) where   ∆ t n   =   t n   −   t prev   is the time span since the previous event at the same location   p n , and the inner product between the gradient   g   and the motion field    ̇ u   depends on the event location   p n , its corresponding 3-D point   X n   and the state  x n . Specifically,   g   depends on the DVS pose only (but not on its velocity) via the perspective projection between the map   X n   and point   p n , whereas the motion field (2) depends on both the DVS pose (depth   Z   of   X n   with respect to the sensor) and velocities (twist coordinates). The gradient   g   may be computed by taking the spatial derivatives of the predicted image intensities   I   in a neighborhood of the current event location   p n , obtained through rendering the dense map   M  according to the DVS pose in the current state. Examples of the contrast function for positive and negative events are shown in Fig. 2. Patches of   15   ×   15   pixels around the event location are displayed, but the local analysis of the generative event model is only reliable close to the center. Fig. 3 reports the cases of moving edges parallel or almost perpendicular to the apparent motion, yielding largest and smallest absolute contrast, respectively.  C. Recursive solution: Implicit EKF equations  Once the system state and measurements equations have been designed, the update equations of the parameters of the posterior in the EKF are also determined. The recursive esti- mation carried out in the EKF is described by the equations in Algorithm 1. We follow the notation in [11] for the posteriors and their moments. The DVS pose tracking filter also assumes that an accurate estimate of the initial configuration, with relatively small uncertainties, is given   ( μ 0 ,   Σ 0 ) . Let us further explain the steps of Algorithm 1.  a) Prediction.:   In this step, the projection of the posterior  bel n − 1   ∼ N   ( μ n − 1 ,   Σ n − 1 )   through the kinematic model (8) gives the predicted posterior   bel n   ∼ N   (  ̄ μ n ,    ̄ Σ n )   before incor- porating the measurement. The state mean and error covari- ance are predicted according to lines (1)-(2) in Algorithm 1. Uncertainty is propagated through the system by means of the Jacobians of (8),   F n   =   ∂ f   /∂ x n − 1 ,   L n   =   ∂ f   /∂ w n , evaluated at the current best estimate,   ( μ n − 1 ,   w n ) .  b) Correction.:   This is the data assimilation step, where the   predicted   posterior   bel n   ∼   N   (  ̄ μ n ,    ̄ Σ n )   is   combined with   the   measurement   z n   to   yield   the   updated   posterior  bel n   ∼ N   ( μ n ,   Σ n ) . The state mean and error covariance are corrected according to lines (3)-(7) in Algorithm 1. Events from the DVS are fed to the generative sensor equation (10) to produce a residual that drives the update of the filter variables. With regard to Figs. 2d and 3d, the correction step changes the state such that the likelihood at the event position increases (white region). The innovation process and its covariance (lines (3)-(4) in Algorithm 1) are obtained by linearization of the implicit measurement function (10) around the current best estimate,   ( z n ,    ̄ μ n )   (see [17], [18]). Uncertainty is corrected in the system (up to first order) by means of the Jacobians of (10) (evaluated at   ( z n ,    ̄ μ n ;   M ) ),   H n   =   ∂ q /∂ x n ,  D n   =   ∂ q /∂ z n , with covariance of the measurement noise [17]  R n   :=   D n Q η  n D >  n   . Since   q   is a real value, both the noise and the innovation covariances ( R n   and   S n ) are scalars.
5  (a)   (b)   (c)   (d)  Fig. 3.   Neighborhood of an event triggered by a moving edge. Same notation as in Fig. 2. Top row: at the event location, the image gradient   g   is parallel to the motion field    ̇ u . Bottom row:   g   almost perpendicular to    ̇ u . Both rows correspond to a negative event.  Algorithm 1   Extended Kalman Filter (EKF) equations for one iteration,   ( μ n − 1 ,   Σ n − 1 )   →   ( μ n ,   Σ n ) , with   implicit   measurement function   q . 1. Mean state (pred.)    ̄ μ n   =   f   ( μ n − 1 ,   w n )  2. Error covar. (pred.)    ̄ Σ n   =   F n Σ n − 1 F   >  n   +   L n Q w  n   L >  n   , with   F n , L n   Jacobians of   f   . 3. Innovation   ν n   =   − q ( z n ,    ̄ μ n )  4. Innovation covar.   S n   =   H n    ̄ Σ n H >  n   +   R n , with   H n   and   R n   given by the Jacobians of   q . 5. Kalman gain   K n   =  ̄ Σ n H >  n   S − 1  n  6. Mean state   μ n   =  ̄ μ n   +   K n ν n  7. Error covar.   Σ n   = ( I   −   K n H n )  ̄ Σ n ( I   −   K n H n ) >   +   K n R n K >  n  1) Data association:   An additional advantage of our ap- proach is that there is no data association like in the classical localization problem (associating predicted measurements to actual ones), thus removing a challenging sub-problem and common source of brittleness in localization and mapping with the EKF. This is a consequence of using a dense map (as opposed to a set of isolated landmarks) to represent the scene and to design a measurement equation (10) that exploits such a representation. There is no data association problem because a correspondence between the event location and a map point  p n   ↔   X n   will always exist, and it can be computed via ray- tracing. The errors caused by a mismatch between the true surface point    ̄ X n   that triggered the event and the predicted one   X n   are implicitly taken into account in the EKF via the innovation (10) and its covariance. For example, the value of the gradient   g   in the neighborhood of the event will change (with some degree of smoothness) and if the predicted value does not yield the triggering of an event, the EKF adjusts the state parameters so that a different surface point   X n   will be more likely to trigger the observed event. There is no need to artificially search for a 3-D point, close to the predicted one, that better explains the event. IV. E XPERIMENTS  A. Synthetic data  The proposed method was tested with synthetic and real data. The synthetic data was generated using computer graph- ics software (Blender 4 ) to render images of a given map along a specified trajectory. Adjacent images were subtracted, thresh- olded and randomly sampled to simulate the events generated by a DVS. We chose a pinhole camera model with intrinsics identical to the ones of a lens from the real experiments: 2.6 mm lens for a 1/3” sensor. A linear trajectory with constant acceleration was simulated. Results are reported in Fig. 4. Groups of 500 events every 8 ms were generated between adjacent images. The algorithm processed 230k events. This experiment validated the measurement function (10) since the kinematic model (8) alone cannot predict the DVS motion. The results show that the filter successfully estimated the DVS pose and velocity, with small relative errors (Fig. 4c).  B. Real data  For the experiment with real data, we mounted the DVS on a model train that runs on a straight track with constant  4 https://www.blender.org/
6  (a)   (b)   (c)  Fig. 4. Constant acceleration experiment. (a) Estimated position. (b) Estimated velocity. (c) Relative errors in position and velocity between simulated trajectory and estimated one.  (a)   (b)   (c)  (d)   (e)   (f)  Fig. 5.   Experiment with approximately constant velocity motion. (a) Visualization of a few events from the DVS (positive events in cyan, negative events in magenta) used for filter initialization, overlaid on the rendered map. (b) Time since the last event at each pixel ( ∆ t n   in (10)) (c) Normalized histogram of the absolute contrast in (10) (solid line) and Gaussian fit (dashed line) (cf. Fig. 1c). The mode of the Gaussian corresponds to the threshold   C . (d) Innovations sequence   ν n . Estimated position (e) and velocity (f) of the event-based camera.  velocity. The DVS faced sideways and observed a planar scene at a constant distance. The scene contains a pattern of complex black and white stripes and a set of circles at known locations; the latter were used for extrinsic calibration. The DVS was intrinsically calibrated using standard camera calibration techniques on the imaged points detected from the projection of an array of blinking LEDS placed in a checkerboard configuration. Horizontal edges are parallel to the apparent motion, and, consequently do not trigger events. The intensities of the map were smoothed to provide non-zero gradients in the regions near sharp edges that generate events, hence to smooth the response of the contrast function (10) and the corresponding likelihood in such regions. Fig. 5 reports some of the results of this experiment. Fig. 5b shows, for a few hundreds of events (Fig. 5a), the measured absolute contrast  | ∆  ̃ I |   ≈   − p n   〈 g ,    ̇ u 〉   ∆ t n   used in the implicit measurement function (10). Having the map intensities given in arbitrary units (log of gray levels) and lacking physical measurements of the incoming light that the DVS used to trigger the events, the threshold values in Fig. 1c ( ≈   0 . 2 ) are not applicable to the map, and so a few events are used to estimate the threshold   C  corresponding to the given map. The filter processed about 100k events and successfully estimated the DVS pose and velocities of the DVS throughout the event stream. Figs. 2 and 3 were also obtained from this experiment.
7  V. C ONCLUSION  We have successfully developed an implicit EKF for event- based camera (DVS) localization based on the contrast resid- ual (10), which provides a natural measure to define the likelihood of an event. For this, we derived a generative event model that incorporates the physical characteristics of the DVS. Our algorithm readily matches the asynchronous nature of the events and allows filter updates on an event- by-event basis. An additional advantage of our approach is that the contrast residual naturally takes into account a dense map representation of the environment, removing the data- association sub-problem. In future work, we plan to extend the developed method to event-based SLAM without additional sensing. R EFERENCES [1]   P. Lichtsteiner, C. Posch, and T. Delbruck, “A 128 × 128 120 dB 15  μ s latency asynchronous temporal contrast vision sensor,”   IEEE J. of Solid-State Circuits , vol. 43, no. 2, pp. 566–576, 2008. [2]   D. Weikersdorfer and J. Conradt, “Event-based Particle Filtering for Robot Self-Localization,” in   IEEE Int. Conf. on Robotics and Biomimet- ics (ROBIO) , 2012. [3]   D. Weikersdorfer, R. Hoffmann, and J. Conradt, “Simultaneous Local- ization and Mapping for event-based Vision Systems,” in   Int. Conf. on Computer Vision Systems (ICVS) , 2013. [4]   R. Benosman, C. Clercq, X. Lagorce, S.-H. Ieng, and C. Bartolozzi, “Event-Based Visual Flow,”   IEEE Trans. Neural Networks and Learning Systems , vol. 25, no. 2, pp. 407–417, 2014. [5]   A. Censi and D. Scaramuzza, “Low-Latency Event-Based Visual Odom- etry,” in   IEEE Int. Conf. on Robotics and Automation (ICRA) , 2014. [6]   E. Mueggler, B. Huber, and D. Scaramuzza, “Event-based, 6-DOF Pose Tracking for High-Speed Maneuvers,” in   IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS) , 2014. [7]   M. Cook, L. Gugelmann, F. Jug, C. Krautz, and A. Steger, “Interacting maps   for   fast   visual   interpretation,”   in   Int.   Joint   Conf.   on   Neural Networks (IJCNN) , July 2011, pp. 770–776. [8]   H. Kim, A. Handa, R. Benosman, S.-H. Ieng, and A. J. Davison, “Simultaneous Mosaicing and Tracking with an Event Camera,” in  British Machine Vision Conf. (BMVC) , 2014. [9]   D. Weikersdorfer, D. B. Adrian, D. Cremers, and J. Conradt, “Event- based 3D SLAM with a depth-augmented dynamic vision sensor,” in  IEEE Int. Conf. on Robotics and Automation (ICRA) , Jun. 2014, pp. 359–364. [10]   G. Orchard, C. Meyer, R. Etienne-Cummings, C. Posch, N. Thakor, and R. Benosman, “HFirst: A Temporal Approach to Object Recognition,”  IEEE Trans. Pattern Anal. Machine Intell. , vol. PP, no. 99, pp. 1–1, 2015. [11]   S. Thrun, W. Burgard, and D. Fox,   Probabilistic Robotics .   The MIT Press, Cambridge, MA, 2005. [12]   F. Chaumette and S. Hutchinson, “Visual Servoing and Visual Tracking,” in   Springer Handbook of Robotics , B. Siciliano and O. Khatib, Eds. Springer, 2008. [13]   Y. Ma, S. Soatto, J. Ko ̆ seck ́ a, and S. S. Sastry,   An Invitation to 3-D Vision: From Images to Geometric Models .   Springer, 2004. [14]   P. Corke,   Robotics, Vision and Control: Fundamental Algorithms in MATLAB , ser. Springer Tracts in Advanced Robotics.   Springer, 2011. [15]   A. Chiuso, P. Favaro, H. Jin, and S. Soatto, “Structure from motion causally integrated over time,”   IEEE Trans. Pattern Anal. Machine Intell. , vol. 24, no. 4, pp. 523–535, 2002. [16]   A. Davison, “Real-time simultaneous localisation and mapping with a single camera,” in   Int. Conf. on Computer Vision (ICCV) , 2003, pp. 1403–1410 vol.2. [17]   Z. Zhang and O. D. Faugeras, “Finding clusters and planes from 3D line segments with application to 3D motion determination,” in   Eur. Conf. on Computer Vision (ECCV) , ser. Lecture Notes in Computer Science. Springer, 1992, vol. 588, pp. 227–236. [18]   S. Soatto, R. Frezza, and P. Perona, “Recursive Motion Estimation on the Essential Manifold,” California Institute of Technology, Tech. Rep., 1993, caltechCDSTR:1993.021.