arXiv:1703.10193v2 [cs.RO] 31 Mar 2017 An End-to-End System for Crowdsourced 3d Maps for Autonomous Vehicles: The Mapping Component Onkar Dabeer 1 , Radhika Gowaikar 1 , Slawomir K. Grzechnik 1 , Mythreya J. Lakshman 1 , Gerhard Reitmayr 2 Kiran Somasundaram 1 , Ravi Teja Sukhavasi 1 , Xinzhou Wu 1 Abstract — Autonomous vehicles rely on precise high defini- tion (HD) 3d maps for navigation. This paper presents the mapping component of an end-to-end system for crowdsourcing precise 3d maps with semantically meaningful landmarks such as traffic signs (6 dof pose, shape and size) and traffic lanes (3d splines). The system uses consumer grade parts, and in particular, relies on a single front facing camera and a consumer grade GPS. Using real-time sign and lane triangulation on- device in the vehicle, with offline sign/lane clustering across multiple journeys and offline Bundle Adjustment across multi- ple journeys in the backend, we construct maps with mean absolute accuracy at sign corners of less than 20 cm from 25 journeys. To the best of our knowledge, this is the first end-to-end HD mapping pipeline in global coordinates in the automotive context using cost effective sensors. I. INTRODUCTION In the past decade several autonomous vehicle prototypes have been demonstrated in highway and urban scenarios. The next decade is expected to be marked by commercial deployment of autonomous vehicles with different levels of autonomy (L1-L5, e.g., see [1]). Many current prototypes rely on precise 3D maps [2], [3] and it is expected that most of the commercial designs will also leverage such maps. These maps serve two main purposes: • The landmarks/features in the map can be associated with camera/lidar landmarks/features seen by the ego vehicle and this association can be used to precisely locate the ego vehicle in the map. An accuracy of 10 cm localization in the map is desired for navigation. • The map also provides prior information about the static environment, which can far exceed the range of the vehicle sensors. In other words, the map is also a sensor for example, it can warn about a sharp turn 50 m ahead even though the camera view is occluded by a truck in front of the ego vehicle. While there are a lot of classical techniques for 3d recon- struction of point clouds (e.g., Ch 7 [4]), for the second reason above, maps with semantically meaningful objects, such as traffic signs and lanes, are of interest. Moreover, the map is a live object. Roads and associated signage changes frequently and reliable means of updating the map with new elements and discarding old elements is desired. For a globally scalable solution, crowdsourcing where vehicles using the map also contribute to its update is an attractive option. However, in contrast to specialized vehicle fleets with 1 Qualcomm Technologies, Inc. 2 Qualcomm Austria Research Center GmbH ∗ order in which authors’ names appear does not signify contribution high grade expensive equipment used by many map makers, crowdsourcing inherently needs a map making solution based on consumer grade equipment to keep the system cost low. The key contribution of this paper is an end-to-end mapping solution based on a single front facing automotive grade 1MP RGB camera, consumer grade GPS, IMU (accelerometer and gyro), a Qualcomm Snapdragon 820A SoC in the vehicle, and a backend mapping server. In our current experiments with real data, we get mean absolute accuracy of less than 20cm and relative error of about 15 cm for traffic sign corners from 25 journeys. Our system comprises of several modules. • Positioning: Precise camera 6dof pose estimation in the world coordinates runs in real-time on a Snapdragon 820A. We get lane level accuracy, which allows reliable association of landmarks across multiple journeys. In this paper, we do not address this module. • Perception: Precise detection and tracking of traffic sign corners and traffic lanes runs in real-time on a Qualcomm Snapdragon 820A. The perception module is described in [5]. • Single-journey Triangulation: The positioning and per- ception outputs are processed on the same Snapdragon 820A SoC for real-time triangulation of traffic signs and lanes. We do a full 3d reconstruction: 6dof pose for signs and 3d splines for lanes. The triangulation outputs, related positioning and perception data are shipped over a commercial LTE link to the backend mapping server. • Multi-journey Association/Clustering: In the backend mapping server, we cluster the triangulation outputs from multiple journeys across different days and vehi- cles. In particular, we identify the landmarks to be in- cluded in the map and their association with positioning and perception observations from individual journeys. • Multi-journey Bundle Adjustment: Finally, we generate the map by joint optimization of camera poses, sign poses and lane spline parameters to minimize a robust cost function. In this paper, we cover the last three modules above. Specifi- cally, we focus on a) the rationale behind our design choices for triangulation, multi-journey processing, b) specific mod- ules such as road normal estimation, which are critical for good performance, c) reporting performance on real data collected in San Diego, CA, USA. Our main message is that we can build precise 3d maps with consumer grade equipment and few tens of journeys. The rest of the paper is organized as follows. Section II describes the mapping component system architecture and the rationale behind the design choices. Section IV describes the triangulation pipeline and the offline multi-journey processing is described in Section V. Section VI summarizes our experiments on real world data. II. SYSTEM ARCHITECTURE Our system architecture is primarily driven by modularity, which allows different teams to iterate rapidly. In this section, we describe the mapping pipeline and the motivation behind some of our design choices. A. Single-Journey Landmark Triangulation/Reconstruction The technical details of this module are given in Section IV. Here we present an overview of some design choices. The single journey processing pipeline receives the following real-time inputs: • Traffic sign vertex coordinates in the camera frame and sign shape type for any sign detected or tracked in the frame (denoted by I1 in Fig. 1). • Traffic lane point coordinates in the camera frame for any lane detected in the frame (denoted by I2 in Fig. 1). • Camera 6dof pose for each frame of the camera (de- noted by I3 in Fig. 1). Single-journey triangulation processes these inputs and gen- erates 6dof traffic sign pose in world coordinates (O1 in Fig. 1) and 3d spline fit for lane segments (O2 in Fig. 1). In our system, we made the choice of implementing this module in real-time since 3d reconstruction of traffic signs and lanes is also needed by autonomous vehicles for a semantic understanding of their environment. However, for a pure mapping system, an offline implementation is also acceptable. The sign and lane triangulation pipelines are independent. In both cases we first use the camera pose and the land- mark detections for inter-frame associations. In our system, perception module also performs tracking. For modularity we made the choice that perception tracking relies on image features while inter-frame association in the mapping compo- nent relies on camera pose information. Section IV.A covers inter-frame association in detail. Once we have the tracklets from inter-frame association, we run a more or less classical triangulation pipeline for traffic signs. In particular, we exploit the fact that traffic signs are planar. The details of single-journey sign reconstruction are described in Section IV-B. For lanes, the situation is more complex. The inverse perspective mapping (IPM) to convert camera lane infor- mation onto the road surface in 3d is inherently a sensitive transformation. The fact that lanes are long objects stretching several 10s of meters ahead of the vehicle further amplifies the problem. One of our key innovations is the estimation of road surface normal near the ego vehicle: we start with a calibrated value but perform periodic online correction of the road normal by ensuring that road points after IPM are planar. The road normal estimation is described in Section IV-C and lane reconstruction is described in Section IV-D. We do not triangulate all the landmarks we detect. To ensure reliability of triangulated landmarks, we apply several pruning rules for traffic signs often based on the viewing angles and for lanes on the basis of tracklet length. B. Multi-journey Processing To map out a road, we drive on it several times with our vehicles. The real-time triangulation outputs as well as the camera poses and camera frame landmark detections from each of the journeys are stored in the backend mapping server. The aim of the multi-journey processing module is to combine all this information to form the final map. The technical details of this module are described in Section V. Here we give a high level overview as shown in Fig. 2. Combining data across multiple journeys can be of two kinds: • Jointly process data from multiple journeys so that we are not biased to any particular journey. • Incremental update of existing map using new journey data. In practice we need both the first approach is necessary for cold start (our case) and periodic rationalization of all observed data to ensure an accurate map, while the second approach is necessary for real-time update of the map with new content such as a road construction sign. In this paper we focus on the cold start case. The first step is to cluster landmarks triangulated from dif- ferent journeys. For both signs and lanes, we rely on defining suitable distance metrics between the landmarks followed by several rounds of spectral clustering each with different spatial scale parameters. Our current distance metrics are purely based on geometry (including shape of traffic signs), though in future we envision using some underlying image features. Once we have a distance/similarity metric, there is a rich set of clustering algorithm choices. We use spectral clustering since an analysis of similarity matrix eigenvalue spectrum and the spectral gap gives us a good estimate of the number of clusters, that is, the number of true landmarks underlying our data. Once we have estimated the number of landmarks in the map, we either use K-means clustering of the similarity matrix rows or the sign of the row entries for deriving a cluster binary code. The details of multi-journey clustering are given in Section V-A. The final step is to use classical Bundle Adjustment to jointly optimize camera poses, sign poses, and lane spline pa- rameters across multiple journeys to ensure that the landmark observations in the camera images are well approximated and some regularity constraints are also met. The details of this optimization are given in Section V-B. For cold start, an additional final step may be necessary. In multi-journey clustering, we prefer under-clustering and we ignore small clusters. This ensures that we do not put different landmarks in the same cluster, but it also leads to some observations not being used. To harvest these, once a map is generated, we can feed it back to improve clustering Fig. 1. Single journey mapping pipeline Fig. 2. Multi-journey mapping pipeline and harvest any unused data. We do not describe this aspect in the paper and instead focus in Section V on landmark clustering and Bundle Adjustment. III. NOTATION TABLE I N OTATION f n Camera frame n t n timestamp of frame n p n 6dof camera pose corresponding to frame n s n,i i-th sign detected within frame n s n,i,j j-th corner of i-th sign detected within frame n l n,i i-th lane marker detected within frame n l n,i,j j-th point of i-th lane marker detected within frame n s i i-th sign tracklets final state after inter-frame associ- ation and { s n,k } are associated observations l i i-th lane marker tracklets final state after inter-frame association and { l n,k } are associated observations S i i-th sign (after triangulation of s i ) S i,j 3D coordinates of j-th corner of i-th sign L i i-th lane marker (after triangulation of l i ) L i,j 3D coordinates of j-th control point of i-th lane marker S i i-th sign after multi-journey association, { S ( k ) i } are associated 3D signs L i i-th lane marker after multi-journey association, { L ( k ) i } are associated 3D signs The notation for single journey variables is listed in Table I. For the multi-journey case, we use superscript ( k ) to denote data from k -th journey. IV. SINGLE JOURNEY TRIANGULATION In this section, we detail sign and lane triangulation using as input 6-dof camera poses and sign/lane detections in pixel domain. The overall flow as detailed in Fig. 1 consists of associating detections across frames to obtain tracklets followed by triangulation. A. Inter-frame Association We rely entirely on geometry to associate detections across successive frames since low level image features are not available at this stage of the pipeline at present. Furthermore, geometric considerations were adequate since pose is quite accurate and sign/lane features are relatively well separated. Detections are associated with each other across frames to obtain tracklets. A tracklet is effectively a set of detections for a specific physical object. Inter-frame association is performed causally to appropriately associate new detections either with existing tracklets or to spawn new tracklets. We formulate the association of new detections with existing tracklets as a weighted bipartite graph matching problem. Such a formulation naturally relies on an association cost. We spawn new tracklets for detections that have a high cost of associating with all existing tracklets. Let { λ i } i ∈ I denote the tracklets and { y n,j } j ∈ J denote the pixel measurements for the | J | objects detected on frame n . For each tracklet λ i , let d i denote the gap till the previous measurement, i.e., the last measurement for λ i was received at frame n − d i . Let C n,ij denote the cost of associating measurement y n,j with tracklet λ i . For a given measurement, say y n,j , we spawn a new tracklet if min i ∈ I C n,ij is too high. This is to avoid forcing an association when there is none, e.g., when the detections on the current frame are all new and do not correspond to anything seen thus far. Let J N ⊆ J denote the subset of measurements for which new tracklets will be spawned. For the remaining measurements, we obtain the optimal assign- ment to tracklets by solving a weighted bipartite matching problem on the matrix C n = [ C n,ij ] i | d i